JPH0795237B2

JPH0795237B2 - Adaptive Multivariate Estimator

Info

Publication number: JPH0795237B2
Application number: JP62-506332A
Authority: JP
Inventors: リントムソン，デビット
Original assignee: アメリカンテレフォンアンドテレグラフカムパニー
Priority date: 1987-04-03
Filing date: 1988-01-12
Publication date: 1995-10-11
Anticipated expiration: 2010-10-11
Also published as: WO1988007738A1; HK106693A; CA1338251C; ATE82426T1; SG59893G; DE3875894T2; AU1222688A; JPH0795237B1; CA1337708C; JPH01502779A; EP0308433B1; AU599459B2; EP0308433A1; DE3875894D1

Description

【発明の詳細な説明】［技術分野］本発明は、リアルタイム過程（プロセス）を表わすサン
プルを、それぞれリアルタイム過程の一状態に対応する
群に類別することに関する。とくにこの類別は、各サン
プルが発生したときに統計的技法を用いてリアルタイム
で行われる。TECHNICAL FIELD The present invention relates to classifying samples representing a real-time process into groups, each corresponding to a state of the real-time process, and more particularly, this classification is performed in real time using statistical techniques as each sample is generated.

［背景技術と問題点］多くのリアルタイム過程において、変化しつつある環境
における現在の状態を過程の現在および過去のサンプル
から推定することを試みるときに問題が存在する。この
ような過程の１つの例が人の声道による音声の発生であ
る。声道により発生された音は、基本周波数を持つこと
もあり（有声音の状態）または基本周波数を持たない場
合もある（無声音の状態）。さらに音が発生されなけれ
ば第３の状態が存在することもある（沈黙の状態）。こ
れらの３つの状態を判別する問題は音声／沈黙判別とい
われる。低ビット速度音声コーダにおいてはしばしば、
不正確な音声判別のために音声品質の低下が生ずる。こ
れらの音声判別を正確に行う際の困難な点は、単一の音
声パラメータすなわち類別子（classifier）では有声音
音声と無声音音声との識別に信頼性がないという事実に
ある。音声判定を行うために、多重音声類別子を重みつ
き和の形に組合わせることは当業者に周知である。この
ような方法は、デー・ピー・プレザス（D.P.Prezas）他
による「パターン認識および適応時間−領域分析を用い
た迅速かつ正確なピッチ検出」、IEEE音響・音成および
信号処理国際会議資料、第１巻、109−112ページ、1986
年４月（“Fast and Accurate Pitch Detection Using
Pattern Recognition and Adaptive Time-Domain Analy
sis"、Proc.IEEE Int.Conf.Acoust.,Speech and Signal
Proc.,Vol.1,pp109-112,April 1986）に記載されてい
る。この論文の説明のように、音声類別子の重みつき和
がもしある特定のしきい値より大であれば音声フレーム
は有声音と宣言され、もしそうでなければ無声音と宣言
される。数学的にはこの関係はａ′Ｘ＋ｂ＞０として表
わされ、ここで“a"は重みからなるベクトル、“X"は類
別子からなるベクトル、および“b"はしきい値を表わす
スカラーである。重みは音声の学習（training）セット
上の性能を最大化するように選択されるが、ここで各フ
レームの音声化（voicing）は既知である。これらの重
みは、単一パラメータを使用するものに比較して音声コ
ーダ内に顕著な音声品質改良を提供する判別ルールを形
成する。BACKGROUND ART AND PROBLEM In many real-time processes, problems exist when attempting to estimate the current state of a changing environment from current and past samples of the process. One example of such a process is the production of speech by the human vocal tract. The sound produced by the vocal tract may have a fundamental frequency (voiced state) or may not have a fundamental frequency (unvoiced state). In addition, a third state may exist if no sound is produced (silence state). The problem of distinguishing between these three states is known as speech/silence discrimination. In low bit rate speech coders,
Inaccurate speech discrimination results in a degradation of speech quality. The difficulty in making these speech discriminations accurately lies in the fact that no single speech parameter, or classifier, can reliably distinguish between voiced and unvoiced speech. Combining multiple speech classifiers in a weighted sum fashion to make speech discrimination is well known to those skilled in the art. Such a method is described in D.P. Prezas et al., "Fast and Accurate Pitch Detection Using Pattern Recognition and Adaptive Time-Domain Analysis," Proceedings of the IEEE International Conference on Acoustics, Speech Synthesis and Signal Processing, Vol. 1, pp. 109-112, 1986.
April 2015 (“Fast and Accurate Pitch Detection Using
Pattern Recognition and Adaptive Time-Domain Analysis
sis", Proc. IEEE Int. Conf. Acoust.,Speech and Signal
Proc., Vol. 1, pp. 109-112, April 1986. As explained in that paper, if the weighted sum of the speech classifiers is greater than a certain threshold, the speech frame is declared voiced; if not, it is declared unvoiced. Mathematically, this relationship is expressed as a'X + b > 0, where "a" is a vector of weights, "X" is a vector of classifiers, and "b" is a scalar representing the threshold. The weights are selected to maximize performance on a training set of speech, where the voicing of each frame is known. These weights form a decision rule within the speech coder that provides significant speech quality improvements compared to those using a single parameter.

固定重みつき和による方法に付帯する問題点は、音声環
境が変化する場合にそれが良好に実行しないということ
である。このような音声環境の変化は、車内の電話すな
わち移動電話で行われる電話会話の結果であったり、ま
たはおそらく電話送話器が異種のものが原因であったり
する。固定重みつき和による方法が変化する環境におい
て良好に実行しない原因は、多くの音声類別子が、暗騒
音、非線形ひずみ、および濾波による影響を受けること
である。もし音声化が学習セットの特徴とは異なる特徴
を有する音声に対して判別されなければならないなら
ば、一般に重みは満足な結果を与えないであろう。A problem with fixed weighted sum methods is that they do not perform well in changing speech environments, such as the result of telephone conversations conducted on a car or mobile phone, or perhaps due to different types of telephone microphones. The reason fixed weighted sum methods do not perform well in changing environments is that many speech classifiers are affected by background noise, nonlinear distortion, and filtering. If speechifications must be discriminated against speech with characteristics different from those of the training set, the weights will generally not provide satisfactory results.

固定重みつき和による方法を変化する音声環境に適応さ
せる一方法が、シー・ピー・キャンベル（C.P.Cambel
l）他の論文「音声の有声音／無声音類別の米国政府LPC
-10Eアルゴリズムへの適用」、IEEE音響・音成および信
号処理国際会議資料、1986年、東京、第9.11.4巻、473-
476ページ（“Voiced/Unvoiced Classification of Spe
ech with Application to the U.S. Government LPC-10
E Algorithm"、IEEE International Conference on Aco
ustics,Speech and Signal Processing,1986,Tokyo,Vo
l.9.11.4,pp.473-476）に開示されている。この論文
は、重みおよびしきい値の各組（セット）に対する学習
データに異なるレベルの白色雑音を加えることにより、
同一セットの学習データからあらかじめ設定された各々
異なる重みつきおよびしきい値の組を利用することを開
示している。各フレームに対し音声サンプルは、これら
の組の１つの結果がSN比（信号対雑音比、SNR）に基づ
いて選択された後に１組の重みおよびしきい値により処
理される。SN比が持つことができる可能値の範囲（レン
ジ）は、各々が組の１つに割当てられる副範囲（サブレ
ンジ）に分割される。各フレームに対しSN比が計算さ
れ；副範囲が決定され；次にフレームが有声音／無声音
判別される。この方法に伴う問題点は、これは学習デー
タに白色雑音が追加されたものに対してのみ有効であっ
て広範囲の音声環境および話者に対し適応できないこと
にある。従って、変化する環境および異なる話者に対し
音声が有声音であるか無声音であるかを信頼性をもって
判別可能な音声音検出器に対する需要が存在してくる。One method for adapting the fixed weighted sum method to changing speech environments is proposed by C.P. Campbell
l) Another paper, "US Government LPC for Voiced/Unvoiced Classification of Speech"
-10E algorithm application," Proceedings of the IEEE International Conference on Acoustics, Sound Generation and Signal Processing, Tokyo, 1986, Vol. 9.11.4, pp. 473-
Page 476 (“Voiced/Unvoiced Classification of Spe.
ech with Application to the US Government LPC-10
E Algorithm", IEEE International Conference on Aco
ustics,Speech and Signal Processing,1986,Tokyo,Vo
1.9.11.4, pp.473-476). This paper proposes a method for improving the performance of a neural network by adding different levels of white noise to the training data for each set of weights and thresholds.
This paper discloses the use of different sets of weights and thresholds predefined from the same set of training data. For each frame, speech samples are processed with a set of weights and thresholds, after which one of these sets is selected based on the signal-to-noise ratio (SNR). The range of possible SNR values is divided into subranges, each of which is assigned to one of the sets. The SNR is calculated for each frame; the subrange is determined; and the frame is then classified as voiced or unvoiced. The problem with this method is that it is only effective when white noise is added to the training data and is not adaptable to a wide range of speech environments and speakers. Therefore, there is a need for a speech detector that can reliably determine whether speech is voiced or unvoiced for varying environments and different speakers.

［解決法］上記の問題点は、物理的過程からのリアルタイムサンプ
ルに応答して複数の過程状態に対する統計的分布を決定
し、これらの分布から判別領域を確立する装置により解
決されかつ技術的進歩が達成される。後者の領域は、各
過程サンプルが発生されたときに現在の過程状態を決定
するのに使用される。音声判別をするのに使用されると
き、この装置は音声の類別子の状態を使用することによ
り変化する音声環境に適応する。統計的手法は類別子に
基づいて行われ、音声判別に使用される判別領域を修正
するのに使用される。この装置は、有声音および無声音
の両フレームに対して統計的分布を推定し、これらの統
計的分布を判別領域の決定に使用するのが好ましい。後
者の領域は次に現在の音声フレームが有声音か無声音か
を判別するのに使用される。Solution The above problems are solved and a technical advance is achieved by an apparatus that determines statistical distributions for a plurality of process states in response to real-time samples from a physical process and establishes discriminant regions from these distributions. The latter regions are used to determine the current process state at the time each process sample is generated. When used to make speech discrimination, the apparatus adapts to changing speech environments by using the state of a speech classifier. Statistical techniques are based on the classifiers and used to modify the discriminant regions used for speech discrimination. The apparatus preferably estimates statistical distributions for both voiced and unvoiced frames and uses these statistical distributions to determine the discriminant regions. The latter regions are then used to discriminate whether the current speech frame is voiced or unvoiced.

有声音検出器は、現在の音声フレームが無声音である確
率、現在の音声フレームが有声音である確率、およびあ
るフレームが無声音であろうという総合確率、とを計算
するのが好ましい。これらの３種類の確率を用いて次に
検出器は、無声音フレームの確率分布と有声音フレーム
の確率分布とを計算する。さらに、現在の音声フレーム
が有声音であるか無声音であるかの確率を決定する計算
は最尤（maximum likelihood）統計的手法を用いること
により実行される。また最尤統計的方法は、確率の他に
重みベクトルおよびしきい値にも応答する。他の実施例
においては、重みベクトルおよびしきい値は各フレーム
に対し適応的に計算される。この重みベクトルおよびし
きい値の適応計算は、変化する音声環境への検出器の迅
速適応を可能にする。The voiced speech detector preferably calculates the probability that the current speech frame is unvoiced, the probability that the current speech frame is voiced, and the overall probability that a frame will be unvoiced. Using these three probabilities, the detector then calculates a probability distribution for unvoiced frames and a probability distribution for voiced frames. Furthermore, the calculation to determine the probability that the current speech frame is voiced or unvoiced is performed using a maximum likelihood statistical method, which is also responsive to a weight vector and a threshold value in addition to the probability. In another embodiment, the weight vector and threshold value are adaptively calculated for each frame. This adaptive calculation of the weight vector and threshold value allows the detector to quickly adapt to changing speech environments.

音声フレーム内における基本周波数の存在を判定する装
置は、音声フレームの音声属性を表わす１組の類別子に
応答して１組の統計的パラメータを計算するための回路
を有するのが好ましい。第２の回路は統計的分布を定義
する１組のパラメータに応答して各々が類別子の１つに
付属する１組の重みを計算する。最後に第３の回路が計
算された１組の重みおよび類別子と１組のパラメータと
に応答して音声フレーム内における基本周波数の存在を
判定し、すなわち通常の表現を用いれば、無声音／有声
音判別を行う。The apparatus for determining the presence of a fundamental frequency within a speech frame preferably includes a circuit for calculating a set of statistical parameters in response to a set of classifiers representing speech attributes of the speech frame, a second circuit for calculating a set of weights each associated with one of the classifiers in response to a set of parameters defining a statistical distribution, and finally a third circuit for determining the presence of a fundamental frequency within the speech frame in response to the calculated set of weights, the classifiers, and the set of parameters, or, in conventional terms, making an unvoiced/voiced decision.

第２の回路はまた、しきい値と新しい重みベクトルとを
計算してこれらの値を第１の回路に連絡し、第１の回路
はこれらの値および新しい１組の類別子とに応答して他
の１組の統計的パラメータを決定するのが好ましい。他
の１組の統計的パラメータは次に、次の音声フレームに
対して基本周波数の存在を判定するのに使用される。The second circuit also preferably calculates a threshold value and a new weight vector and communicates these values to the first circuit, which in response to these values and the new set of classifiers determines another set of statistical parameters which are then used to determine the presence of a fundamental frequency for the next speech frame.

第１の回路は次の１組の類別子、新しい重みベクトルお
よびしきい値とに応答して、次のフレームが無声音であ
る確率、次のフレームが有声音である確率、およびある
フレームが無声音であろうという総合確率、とを計算す
るのが好ましい。これらの確率は次に過去および現在の
フレームに対する類別子の平均を与える１組の値と共に
他の１組の統計的パラメータを決定するのに利用され
る。The first circuit preferably responds to the next set of classifiers, the new weight vector, and the threshold value to calculate the probability that the next frame will be unvoiced, the probability that the next frame will be voiced, and the overall probability that a frame will be unvoiced. These probabilities are then used to determine another set of statistical parameters, along with a set of values giving the average of the classifiers for the past and current frames.

音声判別を決定するための方法は次のステップで実行さ
れる：すなわち有声音および無声音フレームに対する統
計的分布を推定するステップ、この統計的分布に応答し
て有声音音声と無声音音声とを表わす判別領域を決定す
るステップ、および判別領域および現在の音声フレーム
とに応答して音声判別を行うステップである。さらに統
計的分布は、現在の音声フレームが無声音である確率、
現在の音声フレームが有声音である確率、およびあるフ
レームが無声音であろうという総合確率、とから計算さ
れる。これらの３種類の確率は統計的分布を決定するス
テップのサブステップとして計算される。The method for determining a speech decision comprises the steps of: estimating a statistical distribution for voiced and unvoiced frames; determining a decision region representing voiced and unvoiced speech in response to the statistical distribution; and making a speech decision in response to the decision region and the current speech frame. The statistical distribution further comprises a probability that the current speech frame is unvoiced,
The probability that the current speech frame is voiced and the overall probability that a frame will be unvoiced are calculated. These three probabilities are calculated as substeps of the step of determining the statistical distribution.

［図面の簡単な説明］本発明は図面を参照しながら以下の詳細な説明を読めば
容易に理解されよう。ここで：第１図は本発明を用いた装置のブロック図；第２図は本発明をブロック図の形で表わした図；第３図および第４図は第２図の統計的有声音検出器103
により実行される機能をさらに詳細に表わした図；第５図は第４図のブロック340で実行される機能をさら
に詳細に表わした図である。BRIEF DESCRIPTION OF THE DRAWINGS The present invention will be readily understood by reading the following detailed description in conjunction with the drawings in which: FIG. 1 is a block diagram of an apparatus employing the present invention; FIG. 2 is a block diagram representation of the present invention; FIGS. 3 and 4 are block diagrams of the statistical voiced speech detector 103 of FIG. 2;
FIG. 5 is a diagram illustrating in more detail the functions performed by block 340 of FIG.

［詳細な説明］第１図は有声音検出器の１つとして本発明の主題である
統計的有声音検出器を使用する無声音／有声音判別動作
を実行するための装置を示す。第１図の装置は２種類の
検出器すなわち識別有声音検出器と統計的有声音検出器
とを使用する。統計的有声音検出器103は、音声環境の
変化を検出して類別子発生器101から来る類別子を処理
するのに使用される重みを修正してより正確に無声音／
有声音判別を行うようにする適応検出器である。識別有
声音検出器102は、初期スタートアップの間すなわち統
計的有声音検出器103が初期の音声環境にすなわち新し
い音声環境にまだ十分には適応していないときの急激に
変化する音声環境条件内で使用される。DETAILED DESCRIPTION Figure 1 shows an apparatus for performing unvoiced/voiced discrimination operations using the statistical voiced detector that is the subject of this invention as one of the voiced detectors. The apparatus of Figure 1 uses two types of detectors: a discriminative voiced detector and a statistical voiced detector. The statistical voiced detector 103 detects changes in the audio environment and modifies the weights used to process the classifiers coming from the classifier generator 101 to more accurately classify unvoiced/voiced sounds.
The discriminative voiced sound detector 102 is an adaptive detector that enables voiced sound discrimination. The discriminative voiced sound detector 102 is used during initial start-up, i.e., in rapidly changing speech environment conditions, when the statistical voiced sound detector 103 has not yet fully adapted to the initial or new speech environment.

ここで第１図に示す装置の全体的動作を考えてみる。類
別子発生器101は各音声フレームに応答して、音声エネ
ルギーの対数（log）、LPC（線形予測分布）ゲインの対
数、第１の反射係数の対数面積比、および１ピッチ周期
だけオフセットされている１フレーム長の２つの音声セ
グメントの二乗相関係数であることが好ましい類別子
（classifier）を発生する。これらの類別子の計算は、
アナログ音声をディジタルにサンプリングすること、デ
ィジタルサンプルのフレームを形成すること、およびこ
れらのフレームを処理すること、とを含み、これは当業
者には周知である。発生器101は通路106を介して類別子
を検出器102および103に伝送する。Consider now the overall operation of the apparatus shown in Figure 1. A classifier generator 101 is responsive to each speech frame to generate classifiers which are preferably the log of the speech energy, the log of the LPC (Linear Predictive Distribution) gain, the log-area ratio of the first reflection coefficients, and the squared correlation coefficient of two speech segments one frame long offset by one pitch period. The calculation of these classifiers is as follows:
This involves digitally sampling analog audio, forming frames of digital samples, and processing these frames, as is well known to those skilled in the art. Generator 101 transmits classifiers via path 106 to detectors 102 and 103.

検出器102および103は通路106を介して受取られた類別
子に応答して無声音／有声音判別を行い、通路107およ
び110の各々を介してこれらの判別をマルチプレクサ105
に伝達する。さらにこれらの検出器は有声音フレームと
無声音フレームとの間の距離尺度を決定し、通路108お
よび109を介してこれらの距離を比較器104に伝送する。
これらの距離はマハラノビス（Maharanobis）距離また
は他の一般化距離であることが好ましい。比較器104は
通路108及び109を介して受取られた距離に応答してマル
チプレクサ105を制御し、この結果後者のマルチプレク
サは最大距離を発生している検出器出力を選別する。Detectors 102 and 103 make unvoiced/voiced decisions in response to the classifiers received via path 106 and transmit these decisions via paths 107 and 110, respectively, to multiplexer 105.
These detectors also determine distance measures between the voiced and unvoiced frames and transmit these measures to comparator 104 via paths 108 and 109.
These distances are preferably Mahalanobis distances or other generalized distances. Comparator 104 controls multiplexer 105 in response to the distances received via paths 108 and 109, so that the latter multiplexer selects the detector output producing the maximum distance.

第２図は統計的有声音検出器103をさらに詳細に示す。
各音声フレームに対して、通路106を介して類別子発生
器101から類別子のベクトルとも呼ばれる１組の類別子
が受取られる。沈黙検出器201はこれらの類別子に応答
してこのフレーム内に音声が存在するか否かを判別す
る。もし音声が存在すれば、検出器201は通路210を介し
て信号を伝送する。もしフレーム内に音声が存在しなけ
れば（沈黙）、このときのみ減算器207およびU/V（無声
音／有声音）判別器205がその特定のフレームのために
作動する。音声が存在するか否かに関しては、判別器20
5により各フレーム毎に無声音／有声音判別が行われ
る。FIG. 2 shows the statistical voiced speech detector 103 in more detail.
For each speech frame, a set of classifiers, also called a vector of classifiers, is received from classifier generator 101 via path 106. Silence detector 201 is responsive to these classifiers to determine whether speech is present in the frame. If speech is present, detector 201 transmits a signal via path 210. If speech is not present in the frame (silence), then only do subtractor 207 and U/V (unvoiced/voiced) classifier 205 operate for that particular frame. As to whether speech is present, classifier 20
5, the unvoiced/voiced sound is discriminated for each frame.

類別子平均器202は検出器201からの信号に応答して、現
フレームに対する類別子内でそれ以前のフレームに対す
る類別子と平均することにより、通路106を介して受取
られた個々の類別子の平均を維持する。フレーム内にも
し音声（沈黙でない）が存在すれば、沈黙検出器201は
通路210を介して統計的計算器203、発生器206、および
平均器202とに信号を送る。Classifier averager 202, in response to signals from detector 201, maintains an average of the individual classifiers received via path 106 by averaging the classifier for the current frame with the classifier for the previous frame. If speech (not silence) is present in the frame, silence detector 201 sends a signal via path 210 to statistics calculator 203, generator 206, and averager 202.

統計的計算器203は有声音フレームおよび無声音フレー
ムに対する統計的分布を計算する。とくに計算器203は
通路210を介して受取られた信号に応答してああるフレ
ームが無声音である総合確率およびあるフレームが有声
音である確率とを計算する。さらに統計的計算器203は
そのフレームが無声音であった場合に各類別子が有する
であろう統計値およびそのフレームが有声音であった場
合に各類別子が有するであろう統計値とを計算する。さ
らに計算器203は類別子の共分散マトリックスを計算す
る。この統計値は平均値であることが好ましい。計算器
203により行われる計算は、現フレームに基づくのみで
なくそれ以前のフレームにも基づいている。統計的計算
器203は、これらの計算を、通路106を介して受取られる
現フレームに対する類別子および通路211を介して受取
られる類別子の平均に基づくのみでなく、各類別子のた
めの重みおよびフレームが無声音であるかまたは有声音
であるかを判別するところの、通路213を介して重み計
算器204から受取られたしきい値とにも基づいて行う。Statistics calculator 203 calculates statistical distributions for voiced and unvoiced frames. In particular, calculator 203 calculates the overall probability that a frame is unvoiced and the probability that a frame is voiced in response to signals received via path 210. Statistical calculator 203 also calculates the statistic that each classifier would have if the frame were unvoiced and the statistic that each classifier would have if the frame were voiced. Calculator 203 also calculates the covariance matrix of the classifiers. The statistic is preferably a mean value.
The calculations made by 203 are based not only on the current frame but also on previous frames. Statistical calculator 203 bases these calculations not only on the classifiers for the current frame received via path 106 and the average of the classifiers received via path 211, but also on weights for each classifier and a threshold value received via path 213 from weight calculator 204 which determines whether the frame is unvoiced or voiced.

重み計算器204は、計算器203により発生され通路212を
介して受取られた現フレームに対する類別子の確率、共
分散マトリックス、および統計値に応答して、各類別子
に対する重みベクトルａ、および現フレームに対するし
きい値ｂ、とを再計算する。次にこれらの新しいａおよ
びｂの値は通路213を介して統計的計算器203に逆伝送さ
れる。Weight calculator 204 recalculates the weight vector a for each classifier and the threshold value b for the current frame in response to the classifier probabilities, covariance matrices, and statistics for the current frame generated by calculator 203 and received over path 212. These new a and b values are then transmitted back to statistics calculator 203 over path 213.

重み計算器204はまた無声音と有声音との両方の領域内
における類別子のための重みおよび統計値を通路214を
介して判別器205に伝送しかつ通路208を介して発生器20
6に伝送する。後者の発生器はこの情報に応答して距離
尺度を計算し、この距離尺度は次に第１図に示すように
通路109を介して比較器104に伝送される。Weight calculator 204 also transmits weights and statistics for classifiers in both the unvoiced and voiced domains to discriminator 205 via path 214 and to generator 206 via path 208.
6. The latter generator responds to this information by calculating a distance measure which is then transmitted to comparator 104 via path 109 as shown in FIG.

U/V（無声音／有声音）判別器205は通路214および215を
介して伝送された情報に応答してこのフレームが無声音
であるかまたは有声音であるかを判別し、この判別器を
通路110を介して第１図のマルチプレクサ105に伝送す
る。A U/V (unvoiced/voiced) discriminator 205 determines whether the frame is unvoiced or voiced in response to the information transmitted over paths 214 and 215 and transmits this discriminator over path 110 to multiplexer 105 of FIG. 1.

ここで第２図に示し、ここではベクトルおよびマトリッ
クス数学で与えられる各ブロックの動作をさらに詳細に
説明する。平均器202、統計的計算器203、および重み計
算器204とは、エヌ・イー・ディ（N.E.Day）著の「混合
正規分布の成分の推定」（“Estimating the Component
s of a Mixture of Normal Distribution"、ビオメトリ
カ［Biometrika］誌、第56巻、第３号、463-474ペー
ジ、1969）という題名の論文に記載されたものに類似の
改良EMアルゴリズムを実行する。くずし平均（decaying
average）の概念を用いて、類別子平均器202は次式
１、２、および３を計算することにより、現フレームお
よびそれ以前のフレームに対する類別子の平均を計算す
る。2, where the operation of each block is given in more detail in vector and matrix mathematics. The meaner 202, statistics calculator 203, and weight calculator 204 are based on the method described in "Estimating the Components of a Gaussian Mixture" by N.E. Day.
We implement a modified EM algorithm similar to that described in the paper entitled "Decaying Means of a Mixture of Normal Distributions," Biometrika, Vol. 56, No. 3, pp. 463-474, 1969.
Using the concept of a classifier averager, classifier averager 202 calculates the average of the classifiers for the current frame and previous frames by calculating Equations 1, 2, and 3 below.

ｎ＝ｎ＋1 ifn＜2000 （１）ｚ＝＝1/n （２） X_n＝（１−ｚ）X_n-1＋zx_n （３） x_nは現フレームのための類別子を示すベクトルであり、
ｎは2000までの処理フレーム数である。ｚはくずし平均
係数を示し、X_nは現フレームおよび過去のフレームの全
部の類別子の平均を示す。統計的計算器203は、ｚ、x_n
およびX_n情報の受領に応答して、次のようにまず二乗お
よび積の和のマトリックスQ_nを計算することにより共分
散マトリックスＴを計算する。n = n + 1 if n < 2000 (1) z = = 1/n (2) X _n = (1 - z) X _n-1 + zx _n (3) x _n is a vector indicating the classifier for the current frame,
n is the number of frames to be processed up to 2000. z represents the average coefficient of the broken down data, and _Xn represents the average of all classifiers for the current frame and past frames. The statistical calculator 203 calculates z, _xn
In response to receiving the and X _n information, the covariance matrix T is calculated by first calculating a matrix of sums of squares and products Q _n as follows:

Q_n＝（１−ｚ）Q_n-1＋ｚx_nｘ′_n．（４） Q_nが計算されると、次のようにＴが計算される。 _Qn = (1 - z)Qn _-1 + _zxnx'n (4) Once _Qn is _calculated , T is calculated as follows:

Ｔ＝Q_n−X_nＸ′_n．（５）類別子から次のように平均値が差引かれる。T = _Qn - _Xn _X'n . (5) The mean is subtracted from the classifier as follows:

x_n＝x_n−X_n （６）次に計算器203は以下に示す式（７）を解くことによ
り、現ベクトルX_nにより表わされるフレームが無声音で
ある確率を決定するが、ここでベクトルａの成分は、音
声エネルギーの対数に対応する成分は0.3918606に、LPC
ゲインの対数に対応する成分は−0.0520902に、第１反
射係数の対数面積比に対応する成分は0.5637082に、お
よび二乗相関係数に対応する成分は、1.361249に等しく
初期化し、またｂは最初−8.36454に等しく初期化する
ことが好ましい。 _xn = _xn - _Xn (6) Next, the calculator 203 determines the probability that the frame represented by the current vector _Xn is unvoiced by solving the following equation (7): where the components of vector a correspond to the logarithm of the speech energy, 0.3918606, and the LPC
Preferably, the component corresponding to the logarithm of the gain is initialized equal to −0.0520902, the component corresponding to the logarithm of the area ratio of the first reflection coefficient is initialized equal to 0.5637082, and the component corresponding to the squared correlation coefficient is initialized equal to 1.361249, and b is initially initialized equal to −8.36454.

（７）式を解いた後に計算器203は次式を解くことによ
り、類別子が有声音フレームを表わす確率を決定する。 After solving equation (7), calculator 203 determines the probability that the classifier represents a voiced frame by solving the equation:

Ｐ（v|x_n）＝１−Ｐ（u|x_n）（８）次に計算器203はp_nを求める式（９）を解くことによ
り、あるフレームが無声音であろうという総合確率を決
定する。P(v| _xn )=1-P(u| _xn ) (8) Calculator 203 then determines the overall probability that a frame will be unvoiced by solving equation (9) for _pn .

p_n＝（１−ｚ）p_n-1＋zP（u|x_n）．（９）フレームが無声音であろうという確率を決定した後に、
次に計算器203は無声音型および有声音型の両方のフレ
ームに対する各類別子の平均値を与える２つのベクトル
ｕおよびｖを決定する。ベクトルｕおよびｖはそれぞれ
無声音フレームおよび有声音フレームに対する統計的平
均である。統計的平均無声音ベクトルであるベクトルｕ
は、もしフレームが無声音であるならば各類別子の平均
値を含み；また統計的平均有声音ベクトルであるベクト
ルｖは、もしフレームが有声音であるならば各類別子に
対する平均値を与える。以下に示すように、現フレーム
に対するベクトルｕは式（10）を計算することにより解
かれ、現フレームに対するベクトルｖは式（11）を計算
することにより決定される。 _pn = (1 - z)pn _-1 + zP(u| _xn ). (9) After determining the probability that the frame will be unvoiced,
Next, calculator 203 determines two vectors u and v that give the mean values of each classifier for both unvoiced and voiced frames. Vectors u and v are the statistical means for unvoiced and voiced frames, respectively. Vector u, which is the statistical mean unvoiced vector,
contains the mean value of each classifier if the frame is unvoiced; and the vector v, the statistical mean voiced vector, gives the mean value for each classifier if the frame is voiced. As shown below, the vector u for the current frame is solved by computing equation (10), and the vector v for the current frame is determined by computing equation (11).

u_n＝（１−ｚ）u_n-1＋zx_nＰ（u|x_n）／p_n−zx_n （10） v_n＝（１−ｚ）v_n-1＋zx_nＰ（v|x_n）／（１−p_n）−zx_n
（11）ここで計算器203は、通路212を介してベクトルｕおよび
ｖ、マトリックスＴ、および確率ｐを重み計算器204に
伝送する。重み計算器204はこの情報に応答してベクト
ルａおよびスカラーｂに対する新しい値を計算する。次
にこれらの新しい値は通路213を介して統計的計算器203
に逆伝送される。これにより検出器103は変化する環境
に迅速に適応可能である。ベクトルａおよびスカラーｂ
に対する新しい値が統計的計算器203に逆伝送されなく
ても、ベクトルｕおよびｖが最新の値とされているので
検出器103は変化する環境に適応し続けるであろう。明
らかなように、判別器205はベクトルｕおよびｖ並びに
ベクトルａおよびスカラーｂを用いて音声判別を行う。
ｎが好ましくは99より大きくなると、ベクトルａおよび
スカラーｂは次式のように計算される。ベクトルａは式
を解くことにより決定される。u _n = (1-z)u _n-1 +zx _n P(u|x _n )/p _n -zx _n (10) v _n = (1-z)v _n-1 +zx _n P(v|x _n )/(1-p _n )-zx _n
(11) where calculator 203 transmits vectors u and v, matrix T, and probability p to weight calculator 204 via path 212. Weight calculator 204 responds to this information by calculating new values for vector a and scalar b. These new values are then transmitted via path 213 to statistical calculator 203.
This allows the detector 103 to quickly adapt to changing environments.
Even if new values for a are not transmitted back to the statistical calculator 203, the detector 103 will continue to adapt to the changing environment because the vectors u and v are kept up to date. As will be apparent, the classifier 205 uses the vectors u and v as well as the vector a and scalar b to make a speech classification.
When n is preferably greater than 99, the vector a and scalar b are calculated as follows: Vector a is determined by solving the equation:

スカラーｂは次式を解くことにより決定される。 The scalar b is determined by solving the following equation:

式（12）および（13）を計算した後に、重み計算器204
は通路214を介してベクトルａ、ｕ、およびｖをU/V判別
器205に伝送する。もしフレームが沈黙を含んだ場合は
式（６）のみが計算される。 After calculating equations (12) and (13), weight calculator 204
transmits the vectors a, u, and v to the U/V discriminator 205 via path 214. If the frame contains silence, only equation (6) is calculated.

判別器205はこの伝送された情報に応答して現フレーム
が有声音であるかまたは無声音であるかを判別する。も
し出力に対応するベクトル（v_n−u_n）の成分が正であれ
ば、このときは、もし次式が真であるならばフレームは
有声音であると宣言される。In response to this transmitted information, the discriminator 205 determines whether the current frame is voiced or unvoiced. If the component of the vector ( _vn -u _n ) corresponding to the output is positive, then the frame is declared to be voiced if the following expression is true:

ａ′x_n−ａ′（u_n＋v_n）/2＞0; （14）またはもし出力に対応するベクトル（v_n−u_n）の成分が
負であれば、このときは、もし次式が真であるならばフ
レームは有声音であると宣言される。a'xn _- a'(u _n +v _n )/2>0; (14) or if the component of the vector (v _n -u _n ) corresponding to the output is negative, then the frame is declared to be voiced if the following is true:

ａ′x_n−ａ′（u_n＋v_n）/2＜0. （15）式（14）はまた次式のようにも書き替えられる。a'xn _- a'(u _n +v _n )/2 < 0. (15) Equation (14) can also be rewritten as follows:

ａ′x_n＋ｂ−log［（１−p_n）／p_n］＞０式（15）はまた次式のようにも書き替えられる。 _a'xn +b-log[(1- _pn )/ _pn ]>0 Equation (15) can also be rewritten as follows:

ａ′x_n＋ｂ−log［（１−p_n）／p_n］＞０もし前記の条件が満たされないならば、判別器205はフ
レームが無声音であると宣言する。式（14）および（1
5）は有声判別を行うための判別領域を表わす。（14）
および（15）の書き替え形式のlogの項は性能を少し変
えれば省略可能である。本実施例においては、出力に対
応する成分は音声エネルギーのlogであるのが好まし
い。 _a'xn + b-log[(1- _pn )/ _pn ]>0 If the above condition is not met, the classifier 205 declares the frame to be unvoiced.
5) represents the decision region for voiced discrimination. (14)
and the log term in the rewritten form of (15) can be omitted with minor performance changes. In this embodiment, the component corresponding to the output is preferably the log of the speech energy.

発生器206は通路214を介して計算器204から受取られた
情報に応答して次のように距離尺度Ａを計算する。まず
最初に、次のように式（16）により識別変数ｄが計算さ
れる。Generator 206 responds to information received from calculator 204 via path 214 to calculate a distance measure A as follows: First, a discrimination variable d is calculated according to equation (16) as follows: d = ∑ i ⁢ ...

ｄ＝ａ′x_n＋ｂ−log［（１−p_n）／p_n］．（16）次の諸式で用いるためのｄに類似の値を発生するため
に、種々のタイプの音声検出器を用いることは好まし
く、これは当業者には明らかであろう。このような検出
器の１つが自己相関検出器であろう。もしフレームが有
声音であれば、式（17）ないし（20）は次のように解か
れる。d= _a'xn +b-log[(1- _pn )/ _pn ]. (16) Various types of speech detectors may be used, as would be apparent to one skilled in the art, to generate values similar to d for use in the following equations. One such detector would be an autocorrelation detector. If the frame is voiced, then equations (17) through (20) are solved as follows:

m₁＝（１−ｚ）m₁＋zd, （17） s₁＝（１−ｚ）s₁＋zd²，（18） k₁＝s₁−▲m² ₁▼ （19）ここでm₁は有声音フレームに対する平均であり、k₁は有
声音フレームに対する分散である。m ₁ = (1 − z) m ₁ + zd, (17) s ₁ = (1 − z) s ₁ + zd ² , (18) k ₁ = s ₁ − ▲m ² ₁ ▼ (19) where m ₁ is the mean for the voiced frames and k ₁ is the variance for the voiced frames.

フレームが無声音であると判別器205が宣言するであろ
う確率P_dは次式で計算される。The probability P _d that the classifier 205 will declare a frame to be unvoiced is calculated as follows:

P_d＝（１−ｚ）P_d．（20） P_dは最初0.5に設定されるのが好ましい。 _Pd = (1 - z) _Pd (20) Preferably, _Pd is initially set to 0.5.

もしフレームが無声音ならば、式（21）ないし（24）は
次のように解かれる。If the frame is unvoiced, then equations (21) through (24) are solved as follows:

m₀＝（１−ｚ）m₀＋zd, （21） s₀＝（１−ｚ）s₀＋zd² （22） k₀＝s₀−▲m² ₀▼．（23）フレームが無声音であると判別器205が宣言するであろ
う確率P_dは次式で計算される。 _m0 = (1 - z) _m0 + zd, (21) _s0 = (1 - z) _s0 + ^zd2 , (22) _k0 = _s0 - ^m20 , (23) The probability _Pd that the classifier 205 will declare a frame _to be unvoiced is calculated as follows:

P_d＝（１−ｚ）P_d＋z. （24）式（16）ないし（22）を計算した後に距離尺度すなわち
メリット値が次のように計算される。 _Pd = (1 - z) _Pd + z. (24) After calculating equations (16) through (22), a distance measure or merit value is calculated as follows: Pd = (1 - z) Pd + z.

式（25）はホテリング（Hotelling）の２サンプルT²統
計を用いて距離尺度を計算する式（25）に対して、メリ
ット値が大きくなればなるほど分離は大きくなる。しか
しながら他のメリット値は、メリット値が小さくなれば
なるほど分離は大きくなるところに存在する。好ましい
ことに距離尺度は次式で与えられるマハラノビス距離で
あってもよい。 Equation (25) is different from Equation (25), which calculates the distance measure using Hotelling's two-sample ^T2 statistic, in that the larger the merit value, the greater the separation. However, other merit values exist where the smaller the merit value, the greater the separation. Preferably, the distance measure may be the Mahalanobis distance, given by

好ましいことに第３の方法は次式で与えられる。 Preferably, the third method is given by:

好ましくは、距離尺度を計算するために第４の方法は次
式で示される。 Preferably, a fourth method for calculating the distance measure is given by:

A²＝ａ′（v_n−u_n）（28）識別検出器102は、もしａ′Ｘ＋ｂ＞０ならば有声音フ
レームを指示する情報を通路107を介してマルチプレク
サ105に伝送することにより無声音／有声音判別を行
う。もしこの条件が真でなければ、このときは検出器10
2は無声音フレームを指示する。検出器102により使用さ
れるベクトルａおよびスカラーｂに対する値は好ましい
ことに統計的有声音検出器103に対するａおよびｂの初
期値と同一である。A ² =a'(v _n -u _n ) (28) The decision detector 102 performs the unvoiced/voiced decision by transmitting information indicating a voiced frame to the multiplexer 105 via path 107 if a'X+b>0. If this condition is not true, then the detector 10
2 indicates an unvoiced frame. The values for the vector a and scalar b used by detector 102 are preferably the same as the initial values of a and b for statistical voiced detector 103.

検出器102は、式（16）ないし（28）に与えられるもの
と類似の計算を実行することにより発生器206に類似の
方法で距離尺度を決定する。Detector 102 determines the distance measure in a manner similar to generator 206 by performing calculations similar to those given in equations (16) through (28).

第３図および第４図は第２図の統計的有声音検出器103
により実施される操作を流れ図の形式でさらに詳細に示
す。ブロック02および300はそれぞれ第２図のブロック2
02および201を実行する。ブロック304ないし318は統計
的計算器203を実行する。ブロック320および322は重み
計算器204を実行し、ブロック326ないし338は第２図の
ブロック205を実行する。第２図の発生器206はブロック
340により実行される。減算器207はブロック308または
ブロック324により実行される。3 and 4 show the statistical voiced sound detector 103 of FIG.
The operations performed by the method are shown in more detail in flow chart form. Blocks 02 and 300 correspond to blocks 200 and 210 of FIG.
2. Blocks 302 and 318 implement statistical calculator 203. Blocks 320 and 322 implement weight calculator 204, and blocks 326 through 338 implement block 205 of FIG. 2. Generator 206 of FIG. 2 implements block
340. The subtractor 207 is implemented by block 308 or block 324.

ブロック302は現フレームとそれ以前の全てのフレーム
とに対する類別子の平均を示すベクトルを計算する。ブ
ロック300は現フレーム内には音声が存在するかまたは
沈黙が存在するかを判別する。そしてもし現フレーム内
に沈黙が存在すれば、制御が判別ブロック326に引渡さ
れる前にブロック324により各類別子から各類別子に対
する平均が差引かれる。しかしながらもし現フレーム内
に音声が存在すれば、このときはブロック304ないし322
により統計的計算および重み計算が実行される。まず第
１番目にブロック302において平均ベクトルが求められ
る。第２番目にブロック304において二乗および積の和
のマトリックスが計算される。次にブロック306におい
て、現フレームおよび過去のフレームに対する類別子の
平均を示すベクトルＸと共に後者のマトリックスが使用
されて共分散マトリックスＴを計算する。次にブロック
308において類別子ベクトルx_nから平均Ｘが差引かれ
る。Block 302 calculates a vector representing the average of the classifiers for the current frame and all previous frames. Block 300 determines whether speech or silence is present in the current frame, and if silence is present in the current frame, block 324 subtracts the average for each classifier from each classifier before transferring control to decision block 326. However, if speech is present in the current frame, then blocks 304 through 322
The statistical and weighting calculations are performed by: First, a mean vector is determined in block 302. Second, a matrix of sums of squares and products is calculated in block 304. Then, in block 306, the latter matrix is used together with a vector X indicating the means of the classifiers for the current and previous frames to calculate the covariance matrix T. Next, in block
At 308, the mean X is subtracted from the classifier vector x _n .

次にブロック310は、現在の重みベクトルａ、現在のし
きい値ｂ、および現フレームに対する類別子のベクトル
x_nとを利用することにより、現フレームが無声音である
確率を計算する。現フレームが無声音であるという確率
を計算した後に、ブロック312により現フレームが有声
音である確率が計算される。次にブロック314によりあ
るフレームが無声音であろうという総合確率p_nが計算さ
れる。Next, block 310 calculates the current weight vector a, the current threshold b, and the classifier vector for the current frame.
x _n is used to calculate the probability that the current frame is unvoiced. After calculating the probability that the current frame is unvoiced, block 312 calculates the probability that the current frame is voiced. Block 314 then calculates the overall probability p _n that a frame will be unvoiced.

ブロック316および318は２つのベクトルｕおよびｖを計
算する。ベクトルｕの中に含まれる値は、もしそのフレ
ームが無声音であったならば各類別子が持つであろう統
計的平均値を表わす。一方ベクトルｖは、もしそのフレ
ームが有声音であったならば各類別子が持つであろう統
計的平均値を表わす値を含む。現フレームおよびそれ以
前のフレームに対する類別子の実際の値はベクトルｕま
たはベクトルｖのまわりにクラスタ（集団化）される。
もしこれらのフレームが無声音であることがわかると、
それ以前のフレームおよび現フレームに対する類別子を
表わすベクトルはベクトルｕのまわりにクラスタされ；
そうでなければそれ以前の類別子ベクトルはベクトルｖ
のまわりにクラスタされる。Blocks 316 and 318 compute two vectors u and v. The values contained in vector u represent the statistical mean value that each classifier would have if the frame were unvoiced, while vector v contains values representing the statistical mean value that each classifier would have if the frame were voiced. The actual values of the classifiers for the current and previous frames are clustered around vector u or vector v.
If these frames are found to be unvoiced,
The vectors representing the classifiers for the previous and current frames are clustered around vector u;
Otherwise, the previous classifier vector is the vector v
are clustered around

ブロック316および318を実施した後に制御は判別ブロッ
ク320に引き渡される。もし、Ｎが99より大きければ、
制御は判別ブロック322に引渡され；そうでなければ制
御はブロック326に引渡される。制御を受取ると、ブロ
ック322は次に新しい重みベクトルａおよび新しいしき
い値ｂを計算する。ベクトルａおよび値ｂは次に続くフ
レーム内で第３図内のそれに先行するブロックにより使
用される。好ましくは、もしＮが無限大より大であるこ
とが要求されるならば、ベクトルａおよびスカラーｂは
決して変えられないで、検出器103はブロック326ないし
328内に示すようにベクトルｖおよびｕにのみ応答して
適応するであろう。After executing blocks 316 and 318, control is passed to decision block 320. If N is greater than 99, then
Control is passed to decision block 322; otherwise, control is passed to block 326. Upon receiving control, block 322 then calculates a new weight vector a and a new threshold value b. Vector a and value b are used by the blocks preceding it in FIG. 3 in the next succeeding frame. Preferably, if N is required to be greater than infinity, vector a and scalar b are never changed and detector 103 continues from block 326 to
It will adapt in response only to the vectors v and u as shown in 328.

ブロック326ないし338は第２図のu/v判別器205を実行す
る。ブロック326は現フレームのベクトルｖのパワー項
（powerterm）がベクトルｕのパワー項以上か否かを判
別する。もしこの条件が真であれば、このときは判別ブ
ロック328が実行される。後者の判別ブロックは、テス
トにより有声音かまたは無声音かを判別する。もしブロ
ック328の判別においてフレームはブロック330により有
声音として表示され、そうでなければフレームはブロッ
ク332により無声音として表示される。もしベクトルｖ
のパワー項より小であるならば、ブロック334ないし338
の機能が実行され同様に機能する。最後にブロック340
が距離尺度を計算する。Blocks 326 through 338 implement the u/v classifier 205 of FIG. 2. Block 326 determines whether the power term of vector v of the current frame is greater than or equal to the power term of vector u. If this condition is true, then decision block 328 is executed. The latter decision block tests for voiced or unvoiced speech. If the decision of block 328 is true, the frame is marked as voiced by block 330; otherwise, the frame is marked as unvoiced by block 332. If vector v
If it is less than the power term of
The function of is executed and functions similarly. Finally, block 340
calculates the distance measure.

第５図は第４図のブロック340により実行される動作を
流れ図の形で詳細に示す。判別ブロック501は、ブロッ
ク330、332、336または338の結果を調べることによりフ
レームが無声音と指示されたかまたは有声音と指示され
たかを判別する。もしフレームが有声音と指定されたな
らば通路507が選択される。ブロック510は確率P_dを計算
し、ブロック502は有声音フレームに対する平均m₁を再
計算し、およびブロック503は有声音フレームに対する
分散k₁を再計算する。もしフレームが無声音と判別され
たならば判別ブロック501は通路508を選択する。ブロッ
ク509は確率P_dを再計算し、ブロック504は無声音フレー
ムに対する平均m₀を再計算し、およびブロック505は無
声音フレームに対する分散k₀を再計算する。最後にブロ
ック506は指示された計算を実行することにより距離尺
度を計算する。Figure 5 details, in flow diagram form, the operations performed by block 340 of Figure 4. Decision block 501 determines whether a frame has been designated unvoiced or voiced by examining the results of blocks 330, 332, 336, or 338. If the frame is designated voiced, path 507 is selected. Block 510 calculates probability _Pd , block 502 recalculates mean _m1 for voiced frames, and block 503 recalculates variance _k1 for voiced frames. If the frame is classified as unvoiced, decision block 501 selects path 508. Block 509 recalculates probability _Pd , block 504 recalculates mean _m0 for unvoiced frames, and block 505 recalculates variance _k0 for unvoiced frames. Finally, block 506 calculates the distance measure by performing the indicated calculations.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭61−48898（ＪＰ，Ａ) 特開昭60−200300（ＪＰ，Ａ) 特開昭60−114900（ＪＰ，Ａ) ＩＥＥＥＴｒａｎｓａｃｔｉｏｎｓｏｎＡｃｏｕｓｔｉｃｓ，Ｓｐｅｅｃｈ，ａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇＶｏｌ．ＡＳＳＰ−24，Ｎｏ. ３，Ｊｕｎｅ 1976，Ｐ．201−212 ──────────────────────────────────────────────────── Continued from the front page (56) References: JP 61-48898 (JP, A) JP 60-200300 (JP, A) JP 60-114900 (JP, A) IEEE Transactions on Acoustics, Speech, and Signal Processes sing, Vol. ASSP-24, No. 3, June 1976, pp. 201-212

Claims

[Claims]

[Claim 1] An apparatus for determining the presence of a fundamental frequency in a non-training speech signal, comprising: means for generating a digital speech signal by sampling the speech signal in response to the non-training speech signal, forming frames of the digital speech signal, and processing each frame to generate a set of classifiers defining speech attributes; first means for calculating a first set of statistical distributions in response to the set of classifiers defining speech attributes of a first frame of the frames; second means for calculating a set of weights corresponding to one of the classifiers in response to the calculated set of first statistical distributions; and third means for determining the presence of a fundamental frequency in the first frame in response to the calculated weights, the set of classifiers, and the set of first statistical distributions.

[Claim 2] The apparatus of claim 1, wherein said second means comprises: means for calculating a threshold value in response to said set of statistical distributions; and means for informing said first means of said set of weights and said threshold value to be used in calculating a second set of statistical distributions for a second frame different from said first frame.

3. The apparatus of claim 2, wherein said first means calculates a second set of statistical distributions further in response to the communicated set of weights and a second set of classifiers defining speech attributes of said second frame.

4. The method of claim 1, wherein the first means comprises: means for calculating an average of the classifiers for a previous frame; the average of the classifiers; and the set of weights notified.
and means for determining said second set of statistical distributions in response to said second set of classifiers.

5. The apparatus of claim 4, wherein said first means further comprises: means for detecting the presence of speech in each frame; and means for ceasing calculation of said second set of statistical distributions when no speech is detected in said second frame.

6. The apparatus of claim 5, wherein said first means further comprises: means for calculating a probability that said second set of classifiers represents an unvoiced frame and a probability that said second set of classifiers represents a voiced frame; and means for calculating an overall probability that a frame is unvoiced.

7. The apparatus of claim 6, wherein said first means further comprises means for calculating a set of average classifiers representing unvoiced frames and a set of average classifiers representing voiced frames.

8. The apparatus of claim 7, wherein said first means further comprises means for calculating a covariance matrix between said set of average classifiers representing unvoiced frames for said second frame and said set of classifiers representing unvoiced frames for said second frame.

9. The second means includes a covariance matrix;
9. The apparatus of claim 8, wherein said second set of statistical distributions is determined in response to said set of average classifiers for both voiced and unvoiced frames and said overall probability that a frame is unvoiced.

10. The apparatus of claim 9, wherein said third means determines the presence of said fundamental frequency in said second frame in response to said second set of statistical distributions and said set of average classifiers for voiced and unvoiced frames.

[Claim 11] A method for determining the presence of a fundamental frequency in a non-training speech signal, comprising the steps of: generating a digital speech signal by sampling a non-training speech signal, forming frames of the digital speech signal, and processing each frame to generate a set of classifiers defining speech attributes; a first calculation step of calculating a first set of statistical distributions in response to the set of classifiers defining speech attributes of a first frame of one of the frames; a second calculation step of calculating a set of weights corresponding to one of the classifiers in response to the calculated set of first statistical distributions; and determining the presence of a fundamental frequency in the first frame in response to the calculated weights and set of classifiers and the first set of statistical distributions.

12. The method of claim 11, wherein said second calculating step comprises: calculating a threshold value in response to said set of statistical distributions; and reporting said set of weights and said threshold value for use in calculating a second set of statistical distributions for a second frame different from said first frame.

13. The method of claim 12, wherein said first calculating step further comprises calculating a second set of statistical distributions in response to the communicated set of weights and a second set of classifiers defining speech attributes of said second frame.

14. The method of claim 1, wherein said first calculation step comprises: calculating an average of said classifiers for a previous frame; and calculating said average of said classifiers and said set of signaled weights.
and determining said second set of statistical distributions in response to said second set of classifiers.

15. The method of claim 14, wherein the first calculating step further comprises the steps of: detecting the presence of speech in each frame; and ceasing calculation of the second set of statistical distributions when no speech is detected in the second frame.

16. The method of claim 15, wherein the first calculating step further comprises: calculating the probability that the second set of classifiers represents an unvoiced frame and the probability that the second set of classifiers represents a voiced frame; and calculating the overall probability that the frame is unvoiced.

17. The method of claim 16, wherein said first calculating step further comprises the step of calculating a set of average classifiers representing unvoiced frames and a set of average classifiers representing voiced frames.

18. The method of claim 17, wherein the first calculating step further comprises calculating a covariance matrix between the set of average classifiers representing unvoiced frames relative to the second frame and the set of classifiers representing unvoiced frames relative to the second frame.
17 ways.

19. The method of claim 18, wherein the second calculating step determines the set of second statistical distributions in response to a covariance matrix, the set of average classifiers for both voiced and unvoiced frames, and the overall probability that a frame is unvoiced.

20. The determining step comprises: determining the second set of statistical distributions in response to the second set of statistical distributions and the set of average classifiers for voiced and unvoiced frames.
20. The method of claim 19, further comprising determining the presence of the fundamental frequency in a frame.