JPH0766271B2

JPH0766271B2 - Voice recognizer

Info

Publication number: JPH0766271B2
Application number: JP63104769A
Authority: JP
Inventors: 鈴木　　忠; 邦男中島
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1988-04-27
Filing date: 1988-04-27
Publication date: 1995-07-19
Anticipated expiration: 2010-07-19
Also published as: JPH01274198A

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は入力音声信号に重畳した雑音を抑圧する雑音
除去機能を持ち入力音声を認識する音声認識装置に関す
るものである。Description: TECHNICAL FIELD The present invention relates to a voice recognition device having a noise removal function of suppressing noise superimposed on an input voice signal and recognizing an input voice.

[Conventional technology]

音声のスペクトル情報を用いた音声認識では、雑音重畳
による音声のスペクトルの変形は認識性能を著しく低下
させる。それゆえ、音声認識装置を実用に供するために
は雑音に対する耐性の向上は重要な問題である。環境雑
音による雑音混入を抑えるためにノイズキャンセルマイ
クがよく使用されているが、それでも十分な信号対雑音
比（S/N）が得られない場合や音声信号の伝送過程にお
いて雑音が重畳する場合があり、このような既に雑音の
混入した音声信号から雑音のみを除去しS/Nを改善しよ
うとする信号処理技術は雑音抑圧・雑音除去・音声強調
などと呼ばれ、多数の方式提案がなされている。近時、
新しい概念に基づく雑音抑圧法として、ベクトル量子化
を用いて生成された難音重畳信号空間と雑音無し信号空
間の既知の写像関係により、信号検出の立場から雑音を
除去するSpectral Mapping法と称する方式が、論文「Bi
ing−Hwang Juang,L.R.Rabiner,Signal Restoration by
Spectral Mapping",1987 IEEE INTERNATIONAL CONFERE
NCE ON ACOUSTICS,SPEECH,＆ SIGNAL PROCESSING Volum
e 4 PP.6.6.1−6.6.4 April 1987,Dallas」（以下では
この論文を文献〔１〕と引用する）において提案されて
おり、音声伝送用雑音抑圧方式として有効とされてい
る。In speech recognition using speech spectrum information, deformation of the speech spectrum due to noise superposition significantly reduces recognition performance. Therefore, improving the resistance to noise is an important issue for practical use of the voice recognition device. Noise-canceling microphones are often used to suppress noise mixing due to environmental noise, but there are cases in which a sufficient signal-to-noise ratio (S / N) cannot be obtained, or noise is superimposed during the audio signal transmission process. There is a signal processing technology that attempts to improve S / N by removing only noise from a speech signal that already contains noise, called noise suppression, noise removal, speech enhancement, etc., and many methods have been proposed. There is. Recently,
As a noise suppression method based on a new concept, a method called the Spectral Mapping method that removes noise from the standpoint of signal detection based on the known mapping relationship between the noise-superimposed signal space generated by using vector quantization and the noise-free signal space. However, the paper “Bi
ing−Hwang Juang, LRRabiner, Signal Restoration by
Spectral Mapping ", 1987 IEEE INTERNATIONAL CONFERE
NCE ON ACOUSTICS, SPEECH, & SIGNAL PROCESSING Volum
e 4 PP.6.6.1-6.6.4 April 1987, Dallas ”(hereinafter, this paper is referred to as reference [1]) and is effective as a noise suppression method for voice transmission.

このSpectral Mapping法による雑音除去回路を組み込ん
だ音声認識装置の構成例として、第２図が考えられる。
認識装置における認識方式は種々あるが、単語単位のテ
ンプレートを持ち、DP（Dynamic Programming）マッチ
ングによる離散単語認識装置を例として説明する。FIG. 2 can be considered as an example of the configuration of a voice recognition device incorporating a noise removal circuit by the Spectral Mapping method.
Although there are various recognition methods in the recognition device, a discrete word recognition device having a template for each word and using DP (Dynamic Programming) matching will be described as an example.

第２図において、１は音声信号の入力端、２は入力音声
信号、３は入力音声信号２を音響分析する分析回路、４
は特徴ベクトル時系列、５は分析回路から出力される特
徴ベクトル時系列４を雑音付加符号帳６でベクトル量子
化するベクトル量子化器、７は符合語を表すラベルで構
成されるラベル候補時系列、８はベクトル量子化器５の
出力であるところのラベル候補時系列７を難音無し符号
帳９で逆ベクトル量子化する逆ベクトル量子化器、10は
雑音無し特徴ベクトル時系列、11は前記文献〔１〕に示
されたベクトル量子化器５と雑音付加符号帳６と逆ベク
トル量子化器８と雑音無し符号帳９とで構成される雑音
除去回路である。12は雑音無し特徴ベクトル時系列10
と、テンプレートメモリ14から出力される雑音無し特徴
ベクトル時系列で表現される参照パタン15とのDPマッチ
ングを行い、マッチング歪16を出力するパタンマッチン
グ回路、13はパタンマッチングにおける参照パタンの指
定と認識結果の出力を行う認識制御回路、17は認識制御
回路13が参照パタンを指定するためにテンプレートメモ
リ14に送るアドレスデータ、18は認識結果、19はパタン
マッチング回路12とテンプレートメモリ14と認識制御回
路13とで構成される認識処理回路である。In FIG. 2, 1 is an input end of an audio signal, 2 is an input audio signal, 3 is an analysis circuit for acoustically analyzing the input audio signal 2, 4
Is a feature vector time series, 5 is a vector quantizer that vector-quantizes the feature vector time series 4 output from the analysis circuit with the noise-added codebook 6, and 7 is a label candidate time series composed of labels representing codewords. , 8 is an inverse vector quantizer for inverse vector quantizing the label candidate time series 7 which is the output of the vector quantizer 5 with the non-soundless codebook 9, 10 is a noiseless feature vector time series, and 11 is the above It is a noise elimination circuit composed of a vector quantizer 5, a noise-added codebook 6, an inverse vector quantizer 8 and a noiseless codebook 9 shown in the document [1]. 12 is a noise-free feature vector time series 10
And a pattern matching circuit that outputs a matching distortion 16 by performing DP matching with a reference pattern 15 represented by a noise-free feature vector time series output from the template memory 14, and 13 is a reference pattern designation and recognition in pattern matching. A recognition control circuit that outputs results, 17 is address data that the recognition control circuit 13 sends to the template memory 14 to specify a reference pattern, 18 is a recognition result, 19 is a pattern matching circuit 12, template memory 14, and recognition control circuit A recognition processing circuit composed of 13 and 13.

次に動作について説明する。Next, the operation will be described.

雑音除去回路11において、雑音無し符号帳９は雑音が重
畳していない音声の特徴ベクトルを符号語として構成さ
れ、雑音付加符号帳６は雑音無し符号帳９の各符号語に
対し、雑音重畳入力音声信号の雑音様態と同じになるよ
うな処理、例えば、雑音重畳入力音声信号２と同一の信
号対雑音比になるように、時間波形領域で雑音を付加
し、これを再分析して雑音付加特徴ベクトルに変換する
ことで生成した符号語で構成される。入力端１に入力さ
れた雑音が重畳した入力音声信号２は分析回路３で音響
分析され、特徴ベクトル時系列４である｛Ｘ（ｎ）|n＝
1,2,…,N｝として出力される。ここでＮは特徴ベクトル
の数を示す。雑音除去回路11において、ベクトル量子化
器５は特徴ベクトル時系列４を入力とし、任意のベクト
ル番号ｎに対応するＸ（ｎ）について雑音付加符号帳６
の全符号語との尤度を求め、尤度が大きい方から第Ｌ位
までの符号語のラベルをラベル候補｛ｍ_ｉ（ｎ）|i＝1,
2,…,L｝とし、これをｎ＝1,2,…,Nについて求め、ラベ
ル候補時系列７である｛Ｍ（ｎ）|n＝1,2,…,N｝（ただ
しＭ（ｎ）＝｛ｍ_ｉ（ｎ）|i＝1,2,…,L｝）として出力
する。Ｌは、１または２以上の整数である。逆ベクトル
量子化器８はラベル候補時系列７である｛Ｍ（ｎ）|n＝
1,2,…,N｝（ただしＭ（ｎ）＝｛ｍ_ｉ（ｎ）|i＝1,2,
…,L｝）の任意のｎについて、ラベル候補Ｍ（ｎ）＝
｛ｍ_ｉ（ｎ）|i＝1,2,…,L｝を雑音無し符号帳９で逆ベ
クトル量子化し雑音無し特徴ベクトル候補｛Ｚ_ｉ（ｎ）
|i＝1,2,…,L｝を求め、このＬ個の雑音無し特徴ベクト
ル候補｛Ｚ_ｉ（ｎ）|i＝1,2,…,L｝の平均ベクトルとし
てＹ（ｎ）を求める。In the noise elimination circuit 11, the noise-free codebook 9 is configured by using a feature vector of speech without noise superimposed as a codeword, and the noise-added codebook 6 inputs a noise-superimposed input for each codeword of the noise-free codebook 9. A process that is similar to the noise aspect of a voice signal, for example, noise is added in the time waveform domain so that the same signal-to-noise ratio as the noise-superimposed input voice signal 2 is obtained, and this is reanalyzed to add noise It is composed of a code word generated by converting it into a feature vector. The input voice signal 2 input to the input terminal 1 and on which noise is superimposed is acoustically analyzed by the analysis circuit 3, and is a feature vector time series 4 {X (n) | n =
It is output as 1,2, ..., N}. Here, N represents the number of feature vectors. In the noise elimination circuit 11, the vector quantizer 5 receives the feature vector time series 4 as an input, and the noise addition codebook 6 for X (n) corresponding to an arbitrary vector number n.
, And the labels of the code words from the largest likelihood to the L-th rank are label candidates {m _i (n) | i = 1,
2, ..., L}, and for n = 1,2, ..., N, the label candidate time series 7 is {M (n) | n = 1,2, ..., N} (where M (n ) = {M _i (n) | i = 1,2, ..., L}). L is an integer of 1 or 2 or more. The inverse vector quantizer 8 is the label candidate time series 7 {M (n) | n =
1,2, ..., N} (where M (n) = {m _i (n) | i = 1,2,
, L}) for any n of the label candidates M (n) =
{M _i (n) | i = 1,2, ..., L} is inverse vector quantized by the noiseless codebook 9 and the noiseless feature vector candidate {Z _i (n)
| i = 1,2, ..., L} is calculated, and Y (n) is calculated as an average vector of the L noise-free feature vector candidates {Z _i (n) | i = 1,2, ..., L}. .

すなわち、ここで、ｙ_ｔ（ｎ）、ｚ_t,i（ｎ）はそれぞれ、Ｙ
（ｎ）、Ｚ_ｉ（ｎ）の第ｔ次元の成分である。That is, Here, y _t (n) and z _{t, i} (n) are respectively Y
(N) and Z _i (n) are the t-th dimension components.

認識処理回路19において、テンプレートメモリ14は認識
制御回路13が出力するアドレスデータ17で指定される参
照パタン15である｛Ｔ（ｋ）|k＝1,2,…,K｝（Ｋは特徴
ベクトルの数）をパタンマッチング回路12を送出する。
パタンマッチング回路12は上記参照パタン15である｛Ｔ
（ｋ）|k＝1,2,…,K｝と雑音除去回路11の出力であると
ころの雑音無し特徴ベクトル時系列10である｛Ｙ（ｎ）
|n＝1,2,…,N｝とのDPマッチングを行う。DPマッチング
の漸化式は例えば次のようになる。In the recognition processing circuit 19, the template memory 14 is a reference pattern 15 designated by the address data 17 output from the recognition control circuit 13 {T (k) | k = 1,2, ..., K} (K is a feature vector Number) of the pattern is sent to the pattern matching circuit 12.
The pattern matching circuit 12 is the above reference pattern 15 {T
(K) | k = 1,2, ..., K} and the noise-free feature vector time series 10 which is the output of the noise elimination circuit 11 {Y (n)
DP matching with | n = 1,2, ..., N} is performed. The recurrence formula of DP matching is, for example, as follows.

ここで、Ｄ（k,n）は特徴ベクトルＴ（ｋ）と特徴ベク
トルＹ（ｎ）との歪である。（式２）は一例として傾斜
制限なしのDPマッチングの場合を挙げたものである。こ
の漸化式（式２）において得られたｇ（K,N）を市街化
距離（Ｋ＋Ｎ）で割ることで正規化し、マッチング歪16
として出力する。認識制御回路13は、アドレスデータ17
で指定する参照パタンを順次変え、各参照パタンについ
てパタンマッチング回路12が出力するマッチング歪16か
ら、マッチング歪を最小とする参照パタンのラベルを認
識結果18として出力する。 Here, D (k, n) is the distortion between the feature vector T (k) and the feature vector Y (n). (Equation 2) is an example of the case of DP matching without inclination limitation. This g (K, N) obtained in this recurrence formula (Formula 2) is normalized by dividing it by the urbanization distance (K + N), and the matching distortion 16
Output as. The recognition control circuit 13 uses the address data 17
The reference pattern designated by is changed sequentially, and the label of the reference pattern that minimizes the matching distortion is output as the recognition result 18 from the matching distortion 16 output by the pattern matching circuit 12 for each reference pattern.

[Problems to be Solved by the Invention]

Spectral Mapping法による雑音除去において根幹をな
す、雑音重畳信号空間と雑音無し信号空間との写像関係
は、雑音無し符号帳９に雑音重畳と等価な処理を加えて
雑音付加符号帳６を作ることで形成される。しかし、実
際の雑音は統計的分散をもつため、前記写像関係の生成
において等価的に加えた雑音と入力音声信号２に重畳し
ている雑音との差により写像関係に誤りが生じる。ま
た、入力音声信号２に重畳している雑音のレベルが上が
るに従い、入力音声信号２のスペクトル包絡は平滑化す
る。そのため分析回路３より出力される特徴ベクトル時
系列４に含まれる音韻特徴性が消滅し、前記の写像誤り
は甚だしく増加する。例えば雑音除去回路11におけるラ
ベル候補の数Ｌを１にすると、雑音の分散による写像誤
りが生じた場合、その写像誤りによる歪がそのままパタ
ンマッチングにおけるマッチング歪16に反映するので、
重畳雑音のレベルが上がるに従い認識性能は急激に低下
する。また、Ｌを２以上の数にして候補数を増やせば、
複数の候補の中に正しい写像による候補が含まれる確率
が高くなるが、その複数の候補の中のどの候補が正しい
写像による候補なのかはわからない。そのため複数の候
補の特徴ベクトルの平均を出力とする改良策を、前記文
献〔１〕では使用している。しかしこの平均化処理によ
り、本来選ばれるべき候補の特徴ベクトルに誤った候補
の特徴ベクトルの成分が混入するため歪が生じ、認識性
能は低下する。従って、前記文献〔１〕の方法は、音声
伝送において聴覚上の信号対雑音比を改善する効果はあ
るが、音声認識に適用する場合は効果がない。このよう
に従来の雑音除去機能を持つ音声認識装置では、雑音除
去処理によって音声の特徴ベクトルに歪が生じ、高雑音
下においては音声の認識率が低下するという問題点があ
った。The mapping relationship between the noise-superimposed signal space and the noise-free signal space, which forms the basis of noise removal by the Spectral Mapping method, is obtained by adding the noise-free codebook 9 with a process equivalent to noise superposition to create the noise-added codebook 6. It is formed. However, since the actual noise has a statistical variance, an error occurs in the mapping relationship due to the difference between the noise equivalently added in the generation of the mapping relationship and the noise superimposed on the input voice signal 2. Further, as the level of noise superimposed on the input voice signal 2 rises, the spectrum envelope of the input voice signal 2 is smoothed. Therefore, the phonological feature included in the feature vector time series 4 output from the analysis circuit 3 disappears, and the above-mentioned mapping error significantly increases. For example, when the number L of label candidates in the noise removing circuit 11 is set to 1, when a mapping error due to the variance of noise occurs, the distortion due to the mapping error is directly reflected in the matching distortion 16 in the pattern matching.
The recognition performance drops sharply as the level of superposed noise increases. Moreover, if L is set to a number of 2 or more to increase the number of candidates,
The probability that a candidate with a correct mapping is included in a plurality of candidates increases, but it is not known which candidate among the plurality of candidates is a candidate with a correct mapping. Therefore, the above-mentioned document [1] uses an improvement measure that outputs the average of a plurality of candidate feature vectors. However, this averaging process causes distortion because the component of the erroneous candidate feature vector is mixed into the originally selected candidate feature vector, and the recognition performance is degraded. Therefore, the method of the above-mentioned document [1] has an effect of improving the auditory signal-to-noise ratio in voice transmission, but has no effect when applied to voice recognition. As described above, the conventional voice recognition device having a noise removal function has a problem that the noise removal process causes distortion in the feature vector of the voice, and the recognition rate of the voice decreases under high noise.

本発明は上記のような問題点を解消するためになされた
もので、雑音除去処理によって音声の特徴ベクトルに歪
を与えず、高雑音下においても音声の認識率が低下しな
い音声認識装置を得ることを目的とする。The present invention has been made to solve the above problems, and obtains a voice recognition device that does not distort a voice feature vector by noise removal processing and that does not reduce the voice recognition rate even under high noise. The purpose is to

[Means for Solving the Problems]

この発明に係る音声認識装置は、雑音が重畳していない
音声の特徴ベクトルを符号語とする雑音無し符号帳９
と、この雑音無し符号帳９の各符号語に雑音重量と等価
な処理を施し生成された雑音付加符号帳６と、雑音が重
畳した入力音声信号の特徴ベクトル時系列を上記雑音付
加符号帳６に従ってベクトル量子化し符合語を表すラベ
ルで構成される複数のラベル候補時系列を出力するベク
トル量子化手段（曖昧ベクトル量子化器20）と、このベ
クトル量子化手段（曖昧ベクトル量子化器20）の出力信
号でである複数のラベル候補時系列を上記雑音無し符号
帳９に従って逆ベクトル量子化し複数の雑音無し特徴ベ
クトル候補時系列を出力する逆ベクトル量子化手段（逆
曖昧ベクトル量子化器22）と、この逆ベクトル量子化手
段（逆曖昧ベクトル量子化器22）の出力信号である複数
の雑音無し特徴ベクトル候補時系列を入力して全ての雑
音無し特徴ベクトル候補時系列について最大の尤度を与
える参照パタンに基づいて最適な雑音無し特徴ベクトル
時系列を選択して音声認識処理を行う多重ベクトル認識
処理手段（多重ベクトル認識処理回路26）とを備えたこ
とを特徴とするものである。A speech recognition apparatus according to the present invention includes a noiseless codebook 9 in which a feature vector of speech without noise is used as a codeword.
A noise-added codebook 6 generated by subjecting each codeword of the noise-free codebook 9 to a process equivalent to the noise weight, and a feature vector time series of an input voice signal on which noise is superimposed, Vector quantizing means (ambiguous vector quantizer 20) for outputting a plurality of label candidate time series composed of labels representing codewords according to the above, and this vector quantizing means (ambiguity vector quantizer 20) An inverse vector quantization means (inverse fuzzy vector quantizer 22) for inverse vector quantizing a plurality of label candidate time series which are output signals in accordance with the noiseless codebook 9 and outputting a plurality of noiseless feature vector candidate time series; , All the noise-free feature vectors by inputting a plurality of noise-free feature vector candidate time series which are output signals of the inverse vector quantizer (inverse ambiguous vector quantizer 22) A multi-vector recognition processing means (multi-vector recognition processing circuit 26) for selecting the optimum noise-free feature vector time series based on the reference pattern giving the maximum likelihood for the complementary time series and performing the voice recognition processing. It is characterized by.

[Action]

ベクトル量子化手段（曖昧ベクトル量子化器20）は、雑
音が重畳した入力音声信号の特徴ベクトル時系列を雑音
付加符号帳６に基づいてベクトル量子化し、複数のラベ
ル候補時系列を逆ベクトル量子化手段（逆曖昧ベクトル
量子化器22）に与える。逆ベクトル量子化手段（逆曖昧
ベクトル量子化器22）は、入力された複数のラベル候補
時系列を雑音無し符号帳９に基づいて逆ベクトル量子化
し、複数の雑音無し特徴ベクトル候補時系列を多重ベク
トル認識処理手段（多重ベクトル認識処理回路26）に与
える。多重ベクトル認識処理手段（多重ベクトル認識処
理回路26）は入力された複数の雑音無し特徴ベクトル候
補時系列から、全ての雑音無し特徴ベクトル候補時系列
について最大の尤度を与える参照パタンに基づいて最適
な雑音無し特徴ベクトル時系列を選び出し音声認識処理
を行う。The vector quantizer (ambiguous vector quantizer 20) vector-quantizes the feature vector time series of the input speech signal on which noise is superimposed based on the noise-added codebook 6, and inverse vector quantizes a plurality of label candidate time series. Means (inverse fuzzy vector quantizer 22). The inverse vector quantizer (inverse ambiguous vector quantizer 22) inverse vector quantizes the input plurality of label candidate time series based on the noiseless codebook 9 and multiplexes the plurality of noiseless feature vector candidate time series. It is given to the vector recognition processing means (multiple vector recognition processing circuit 26). The multi-vector recognition processing means (multi-vector recognition processing circuit 26) is optimal based on the reference pattern that gives the maximum likelihood for all the noise-free feature vector candidate time series from the input plurality of noise-free feature vector candidate time series. A noise-free feature vector time series is selected and speech recognition processing is performed.

Example of Invention

第１図はこの発明の一実施例に係る音声認識装置の構成
を示すブロック図である。第１図において、第２図に示
す構成要素に対応するものには同一の参照符を付し、そ
の説明を省略する。この実施例は、従来例と同様、単語
単位のテンプレートとのDPマッチングにより認識を行う
離散単語音声認識装置を例として説明する。FIG. 1 is a block diagram showing the configuration of a voice recognition device according to an embodiment of the present invention. In FIG. 1, components corresponding to those shown in FIG. 2 are designated by the same reference numerals, and their description will be omitted. This embodiment will be described by taking as an example a discrete word speech recognition apparatus that performs recognition by DP matching with a template in word units, as in the conventional example.

第１図において、20は分析回路３の出力信号であるとこ
ろの特徴ベクトル時系列４を雑音付加符号帳６を用いて
ベクトル量子化し、複数のラベル候補時系列21を出力す
るベクトル量子化手段としての曖昧ベクトル量子化器、
22は複数のラベル候補時系列21を雑音無し符号帳９で逆
ベクトル量子化し、複数の雑音無し特徴ベクトル候補時
系列23を出力する逆ベクトル量子化手段としての逆曖昧
ベクトル量子化器、24は雑音付加符号帳６と雑音無し符
号帳９と曖昧ベクトル量子化器20との逆曖昧ベクトル量
子化器22で構成される雑音無し多重ベクトル生成回路、
25は複数の雑音無し特徴ベクトル候補時系列23を入力と
して参照パタン15とのDPマッチングを行い、マッチング
歪16を出力する多重ベクトルパタンマッチング回路、26
は多重ベクトルパタンマッチング回路25とテンプレート
メモリ14と認識制御回路13とで構成される多重ベクトル
認識処理手段としての多重ベクトル認識処理回路であ
る。In FIG. 1, 20 is a vector quantizing means for vector-quantizing the feature vector time series 4 which is the output signal of the analyzing circuit 3 using the noise addition codebook 6 and outputting a plurality of label candidate time series 21. Fuzzy vector quantizer,
Reference numeral 22 denotes an inverse fuzzy vector quantizer as an inverse vector quantizer that inverse vector quantizes a plurality of label candidate time series 21 with the noiseless codebook 9 and outputs a plurality of noiseless feature vector candidate time series 23. A noise-free multiple vector generation circuit composed of a noise-added codebook 6, a noiseless codebook 9, an ambiguous vector quantizer 20 and an inverse ambiguous vector quantizer 22;
25 is a multiple vector pattern matching circuit that inputs a plurality of noise-free feature vector candidate time series 23 and performs DP matching with reference pattern 15 and outputs matching distortion 16.
Is a multiplex vector recognition processing circuit as multiplex vector recognition processing means composed of multiplex vector pattern matching circuit 25, template memory 14 and recognition control circuit 13.

この音声認識装置は、曖昧ベクトル量子化（Fuzzy Vect
or Quantization）により、入力信号の雑音除去とパタ
ン認識を同時的に実行する新規方式を採用し、雑音が重
畳した入力音声信号の特徴ベクトルを入力して複数の雑
音無し特徴ベクトル候補時系列23を出力する雑音無し多
重ベクトル生成回路24と、この雑音無し多重ベクトル生
成回路24の出力信号である複数の雑音無し特徴ベクトル
候補時系列23を入力として認識を行う多重ベクトル認識
処理回路26とを備える。上述したように、雑音無し多重
ベクトル生成回路24は、雑音が重畳していない音声の特
徴ベクトルを符号語とする雑音無し符号帳９と、この雑
音無し符号帳９の各符号語に雑音重畳と等価な処理を施
し生成した雑音付加符号帳６と、雑音が重畳した入力音
声信号２の特徴ベクトル時系列４を上記雑音付加符号帳
６に従ってベクトル量子化し複数のラベル候補時系列21
を出力する曖昧ベクトル量子化器20と、曖昧ベクトル量
子化器20の出力信号であるところの複数のラベル候補時
系列21を上記雑音無し符号帳９に従って逆ベクトル量子
化し複数の雑音無し特徴ベクトル候補時系列23を出力す
る逆曖昧ベクトル量子化器22とで構成される。This speech recognizer uses fuzzy vector quantization (Fuzzy Vect
or Quantization), a new method that simultaneously performs noise removal and pattern recognition of the input signal is adopted, and the feature vector of the input speech signal on which noise is superimposed is input to generate multiple noise-free feature vector candidate time series 23. A noiseless multiplex vector generation circuit 24 for outputting and a multiplex vector recognition processing circuit 26 for performing recognition with a plurality of noiseless feature vector candidate time series 23, which are output signals of the noiseless multiplex vector generation circuit 24, as inputs. As described above, the noiseless multiplex vector generation circuit 24 performs noiseless codebook 9 in which the feature vector of speech without noise is used as a codeword, and noise superimposition in each codeword of the noiseless codebook 9. The noise-added codebook 6 generated by equivalent processing and the feature vector time series 4 of the input speech signal 2 on which noise is superimposed are vector-quantized in accordance with the noise-added codebook 6 to obtain a plurality of label candidate time series 21.
Ambiguous vector quantizer 20 and a plurality of label candidate time series 21, which are output signals of the ambiguous vector quantizer 20, are inverse vector quantized according to the noiseless codebook 9 to obtain a plurality of noiseless feature vector candidates. It is composed of an inverse ambiguity vector quantizer 22 that outputs a time series 23.

多重ベクトル認識処理回路26は、逆曖昧ベクトル量子化
器22の出力信号である複数の雑音無し特徴ベクトル候補
時系列23である｛Ｚ（ｎ）|n＝1,2,…,N｝（ただしＺ
（ｎ）＝｛Ｚ_ｉ（ｎ）|i＝1,2,…,L｝）を入力し、任意
のｎにおける雑音無し特徴ベクトル候補Ｚ（ｎ）＝｛Ｚ
_ｉ（ｎ）|i＝1,2,…,L｝のＬ個の候補の中で、各参照パ
タン毎に参照パタンとの尤度を最大化する候補を最適な
雑音無し特徴ベクトルとして選択することにより、最終
的にｎ＝Ｎに至る全時系列について最大の尤度を与える
参照パタンを認識カテゴリと判定すると同時に、最適な
雑音無し特徴ベクトル時系列を決定する処理を行う。The multiple vector recognition processing circuit 26 is {Z (n) | n = 1,2, ..., N} (however, a plurality of noise-free feature vector candidate time series 23 which are output signals of the deambiguity vector quantizer 22. Z
(N) = {Z _i (n) | i = 1,2, ..., L}), and the noise-free feature vector candidate Z (n) = {Z
Among L candidates of _i (n) | i = 1,2, ..., L}, a candidate that maximizes the likelihood of the reference pattern is selected as an optimal noiseless feature vector for each reference pattern. As a result, the reference pattern that gives the maximum likelihood for all time series up to n = N is finally determined as the recognition category, and at the same time, the processing for determining the optimum noise-free feature vector time series is performed.

この実施例の動作を説明する。The operation of this embodiment will be described.

曖味ベクトル量子化器20は、入力音声の特徴ベクトル時
系列４である｛Ｘ（ｎ）|n＝1,2,…,N｝を入力とし、任
意のｎに対する特徴ベクトルＸ（ｎ）と雑音付加符号帳
６の全符号語との尤度を求め、尤度が大きい方から第Ｌ
位までの符号語のラベルをラベル候補｛ｍ_ｉ（ｎ）|i＝
1,2,…,L｝とし、これをｎ＝1,2,…,Nについて求め複数
のラベル候補時系列21｛Ｍ（ｎ）|n＝1,2,…,N｝（ただ
しＭ（ｎ）＝｛ｍ_ｉ（ｎ）|i＝1,2,…,L｝）として出力
する。逆曖昧ベクトル量子化器22は、複数のラベル候補
時系列21である｛Ｍ（ｎ）|n＝1,2,…,N｝（ただしＭ
（ｎ）＝｛ｍ_ｉ（ｎ）|i＝1,2,…,L｝）の任意のｎに対
応するラベル候補Ｍ（ｎ）＝｛ｍ_ｉ（ｎ）|i＝1,2,…,
L｝を雑音無し符号帳９で逆ベクトル量子化し、複数の
雑音無し特徴ベクトル候補｛Ｚ_ｉ（ｎ）|i＝1,2,…,L｝
を得て、これをｎ＝1,2,…,Nについて求め、複数の雑音
無し特徴ベクトル候補時系列23である｛Ｚ（ｎ）|n＝1,
2,…,N｝（ただしＺ（ｎ）＝｛Ｚ_ｉ（ｎ）|i＝1,2,…,
L｝）として出力する。The ambiguity vector quantizer 20 receives {X (n) | n = 1,2, ..., N}, which is the feature vector time series 4 of the input speech, as an input and outputs the feature vector X (n) for any n. Likelihoods with all codewords in the noise-added codebook 6 are calculated, and the L-th is calculated from the largest likelihood.
Labels of code words up to the rank are label candidates {m _i (n) | i =
1,2, ..., L}, this is obtained for n = 1,2, ..., N and a plurality of label candidate time series 21 {M (n) | n = 1,2, ..., N} (where M ( n) = {m _i (n) | i = 1,2, ..., L}). The inverse ambiguity vector quantizer 22 includes {M (n) | n = 1,2, ..., N} (where M is a plurality of label candidate time series 21).
Label candidates M (n) = {m _i (n) | i = 1,2, ... Corresponding to any n in (n) = {m _i (n) | i = 1,2, ..., L}) ,
L} is inverse-vector quantized by the noiseless codebook 9, and a plurality of noiseless feature vector candidates {Z _i (n) | i = 1,2, ..., L}
, N is obtained for n = 1, 2, ..., N, and is a plurality of noise-free feature vector candidate time series 23 {Z (n) | n = 1, 1.
2, ..., N} (where Z (n) = {Z _i (n) | i = 1,2, ...,
Output as L}).

多重ベクトルパタンマッチング回路25は逆曖昧ベクトル
量子化器22の出力信号であるところの複数の雑音無し特
徴ベクトル候補時系列23である｛Ｚ（ｎ）|n＝1,2,…,
N｝（ただしＺ（ｎ）＝｛Ｚ_ｉ（ｎ）|i＝1,2,…,L｝）
を入力し、参照パタン15｛Ｔ（ｋ）|k＝1,2,…,K｝（Ｋ
は特徴ベクトルの数）とのDPマッチング（式２）を行
う。ただし、複数の雑音無し特徴ベクトル候補時系列23
は任意のｎについてＬ個の特徴ベクトル候補を持つの
で、任意のｋとｎにおける歪Ｄ（k,n）を次式のように
定義する。The multi-vector pattern matching circuit 25 is a plurality of noise-free feature vector candidate time series 23 which are output signals of the deambiguity vector quantizer 22 {Z (n) | n = 1,2, ...,
N} (however, Z (n) = {Z _i (n) | i = 1,2, ..., L})
, And input the reference pattern 15 {T (k) | k = 1,2, ..., K} (K
Performs DP matching (equation 2) with the number of feature vectors). However, multiple noise-free feature vector candidate time series 23
Has L feature vector candidates for any n, the distortion D (k, n) at any k and n is defined by the following equation.

Ｄ（k,n）＝min（ｄ（Ｔ（ｋ）,Z_ｉ（ｎ））１≦ｉ≦Ｌ（式３）ただし、ｄ（＊，＊）は特徴ベクトル間の歪を表す。こ
れにより、参照パタン15に対して最適な特徴ベクトルが
複数の候補の中から選択される。従来例と同様に、漸化
式（式２）において得られたｇ（K,N）を市街化距離
（Ｋ＋Ｎ）で割ることで正規化し、マッチング歪16とし
て出力する。DPマッチングの原理により、（式３）の部
分歪を最小化すれば、パタン全体のマッチング歪が最小
化される。認識制御回路13は、アドレスデータ17で指定
するテンプレートメモリ14内の参照パタンを順次変え、
各参照パタンについてパタンマッチング回路25が出力す
るマッチング歪16を判定し、マッチング歪16を最小とす
る参照パタンのラベルを認識結果18として出力する。マ
ッチング歪16を最小とする参照パタンに対して、複数の
雑音無し特徴ベクトル候補時系列23の中から最適選択さ
れたベクトル時系列が、雑音の重畳した入力音声から雑
音を正しく除去した音声信号の雑音無し特徴ベクトル時
系列に近似しており、最小マッチング歪は雑音無しの入
力音声と雑音無しの自己カテゴリテンプレートとの間の
マッチング歪に相当する。なお、（式２）によるDPマッ
チングは一実施例として挙げたもので、本発明の適用範
囲を縛るものではない。D (k, n) = min (d (T (k), Z _i (n)) 1 ≦ i ≦ L (Equation 3) where d (*, *) represents the distortion between the feature vectors. , The optimum feature vector for the reference pattern 15 is selected from a plurality of candidates.As in the conventional example, g (K, N) obtained in the recurrence formula (Formula 2) is converted into the urbanization distance ( It is normalized by dividing by K + N) and is output as matching distortion 16. By the principle of DP matching, if the partial distortion of (Equation 3) is minimized, the matching distortion of the entire pattern is minimized. Sequentially changes the reference pattern in the template memory 14 specified by the address data 17,
The matching distortion 16 output by the pattern matching circuit 25 is determined for each reference pattern, and the label of the reference pattern that minimizes the matching distortion 16 is output as the recognition result 18. For a reference pattern that minimizes the matching distortion 16, a vector time series that is optimally selected from among a plurality of noise-free feature vector candidate time series 23 is a speech signal in which noise is correctly removed from noise-superimposed input speech. It is approximated to the noise-free feature vector time series, and the minimum matching distortion corresponds to the matching distortion between the noise-free input speech and the noise-free self-category template. The DP matching according to (Equation 2) is given as an example, and does not limit the scope of application of the present invention.

このように、雑音無し多重ベクトル生成回路24では、雑
音が重畳した入力音声の特徴ベクトル時系列４を正しい
雑音除去信号を含む複数の雑音無し特徴ベクトル候補時
系列23に冗長性を持たせて曖昧写像し、多重ベクトル認
識処理回路26において、雑音無しの参照パタンを使用し
て尤度を最大化する処理により自動的に写像歪を最小化
する雑音無し特徴ベクトル時系列を選択して認識処理が
行われる。As described above, in the noiseless multiple vector generation circuit 24, the feature vector time series 4 of the input voice on which noise is superimposed is ambiguous by providing redundancy to the plurality of noiseless feature vector candidate time series 23 including the correct noise removal signal. In the multi-vector recognition processing circuit 26, the recognition processing is performed by selecting the noise-free feature vector time series that automatically minimizes the mapping distortion by the processing that maximizes the likelihood by using the noise-free reference pattern. Done.

従って、この実施例では前述の従来装置の問題点である
ところの、写像誤りによる歪及び複数候補の特徴ベクト
ルを平均することにより生じる歪がなくなり、それらを
原因とする認識率の低下が生じなくなる。Therefore, in this embodiment, the problem of the above-mentioned conventional device, the distortion due to the mapping error and the distortion caused by averaging the feature vectors of a plurality of candidates are eliminated, and the reduction of the recognition rate due to them does not occur. .

以上、単語単位のテンプレートを登録する離散単語音声
認識装置を例にとり本発明の説明を行ったが、認識単位
は単語に限定されるものではなく、CV（子音−母音）、
VCV（母音−子音−母音）、CVC（子音−母音−子音）、
単音節、形態素など任意の単位であってもよく、特徴ベ
クトルの時系列でテンプレート登録するものであれば、
上記実施例と同様の効果を奏する。またテンプレートマ
ッチングによる認識方式以外のHMM（Hidden Markov Mod
el）のような統計的認識手法においても、参照パタンの
代わりに遷移確率モデルを用い、各遷移確率モデルに対
して生成確率を最大にする雑音無し特徴ベクトルを雑音
無し特徴ベクトル候補から選択することで同様の効果が
得られる。さらに、離散単語音声以外の連続発声された
音声についても、その連続発声を登録した音声の単位に
切り出すセグメンテーション回路を本装置の前段に設置
し、切り出された音声を本装置の入力とすることで適用
可能である。また、本発明の実現手段は専用ハードウェ
アに限定するものではなく、汎用の計算機や信号処理プ
ロセッサにおけるソフトウェア処理によっても実現でき
ることはいうまでもない。さらに、音声以外のほかの音
響信号や画像信号、文字・図形信号などの認識装置に
も、本発明を拡張し適用することが可能である。As described above, the present invention has been described by taking the discrete word voice recognition device that registers the template of the word unit as an example, but the recognition unit is not limited to the word, and CV (consonant-vowel),
VCV (vowel-consonant-vowel), CVC (consonant-vowel-consonant),
It may be an arbitrary unit such as a monosyllabic or a morpheme, and as long as template registration is performed in time series of feature vectors,
The same effect as that of the above embodiment is obtained. HMM (Hidden Markov Mod) other than the recognition method by template matching
Even in statistical recognition methods such as el), a transition probability model is used instead of a reference pattern, and a noise-free feature vector that maximizes the generation probability is selected from each noise-free feature vector candidate for each transition probability model. The same effect can be obtained with. Furthermore, even for continuously uttered voices other than discrete word voices, a segmentation circuit that cuts out the continuous utterances into registered voice units is installed in the preceding stage of this device, and the cutout voices are input to this device. Applicable. Needless to say, the implementation means of the present invention is not limited to dedicated hardware, and can be implemented by software processing in a general-purpose computer or signal processor. Furthermore, the present invention can be extended and applied to a recognition device for acoustic signals other than voice, image signals, and character / graphic signals.

〔The invention's effect〕

以上のように本発明によれば、雑音が重畳した入力音声
信号の特徴ベクトル時系列を雑音付加符号帳に従ってベ
クトル量子化し複数のラベル候補時系列を出力するベク
トル量子化手段と、このベクトル量子化手段の出力信号
である複数のラベル候補時系列を上記雑音無し符号帳に
従って逆ベクトル量子化し複数の雑音無し特徴ベクトル
候補時系列を出力する逆ベクトル量子化手段と、この逆
ベクトル量子化手段の出力信号である複数の雑音無し特
徴ベクトル候補時系列を入力して全ての雑音無し特徴ベ
クトル候補時系列について最大の尤度を与える参照パタ
ンに基づいて最適な雑音無し特徴ベクトル時系列を選択
して音声認識処理を行う多重ベクトル認識処理手段とを
備えて構成したので、パタンマッチングの際に参照パタ
ンとの歪を最小化することを基準にして複数個の雑音無
し特徴ベクトルの候補の中から最適な雑音無し特徴ベク
トル時系列を選択でき、これにより雑音除去処理による
音声の特徴ベクトルには歪が生ずることがなくなり、従
って高雑音下においても認識率がきわめて高いという効
果が得られる。As described above, according to the present invention, a vector quantization means for vector-quantizing a feature vector time series of an input speech signal on which noise is superimposed according to a noise-added codebook, and outputting a plurality of label candidate time series, and this vector quantization Inverse vector quantization means for inverse vector quantizing a plurality of label candidate time series as output signals of the means in accordance with the noiseless codebook and outputting a plurality of noiseless feature vector candidate time series, and output of the inverse vector quantization means Input a plurality of noise-free feature vector candidate time series, which are signals, and give the maximum likelihood for all noise-free feature vector candidate time series. Since it is configured with multiple vector recognition processing means for performing recognition processing, distortion with reference patterns is minimized during pattern matching. It is possible to select the optimum noise-free feature vector time series from among a plurality of noise-free feature vector candidates on the basis of the fact that the distortion of the voice feature vector due to the denoising process does not occur. The effect is that the recognition rate is extremely high even under high noise.

[Brief description of drawings]

第１図はこの発明の一実施例に係る音声認識装置の構成
を示すブロック図、第２図は従来の音声認識装置の構成
を示すブロック図である。６……雑音付加符号帳、９……雑音無し符号帳、20……
曖昧ベクトル量子化器（ベクトル量子化手段）、22……
逆曖昧ベクトル量子化器（逆ベクトル量子化手段）、26
……多重ベクトル認識処理回路（多重ベクトル認識処理
手段）。FIG. 1 is a block diagram showing the configuration of a voice recognition device according to an embodiment of the present invention, and FIG. 2 is a block diagram showing the configuration of a conventional voice recognition device. 6 ... Noise-added codebook, 9 ... No-noise codebook, 20 ...
Ambiguous vector quantizer (vector quantizer), 22 ...
Inverse fuzzy vector quantizer (inverse vector quantizer), 26
... Multi-vector recognition processing circuit (multi-vector recognition processing means).

Claims

[Claims]

1. A speech recognition apparatus for recognizing an input speech having a noise removal function for suppressing noise superimposed on an input speech signal, wherein a noiseless code having a feature vector of speech on which no noise is superimposed as a code word. Book, a noise-added codebook generated by performing a process equivalent to noise superposition on each codeword of this noise-free codebook, and a feature vector time series of an input voice signal on which noise is superposed, according to the above noise-added codebook. A vector quantizing means for outputting a plurality of label candidate time series composed of labels representing quantized code words, and a plurality of label candidate time series as an output signal of the vector quantizing means are inversed according to the noiseless codebook. Inverse vector quantization means for vector-quantizing and outputting a plurality of noise-free feature vector candidate time series, and a plurality of noise-free output signals of the inverse vector quantization means A multi-vector for inputting the feature vector candidate time series and giving the maximum likelihood for all noise-free feature vector candidate time series, selecting the optimum noise-free feature vector time series based on the reference pattern, and performing speech recognition processing. A speech recognition apparatus comprising: a recognition processing unit.