JPH0155476B2

JPH0155476B2 -

Info

Publication number: JPH0155476B2
Application number: JP58180769A
Authority: JP
Inventors: Takao Irumano; Kunio Akiba; Hisanori Kanezashi
Original assignee: Computer Basic Technology Research Association Corp
Current assignee: Computer Basic Technology Research Association Corp
Priority date: 1983-09-30
Filing date: 1983-09-30
Publication date: 1989-11-24
Also published as: JPS6073696A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、入力音声と、音声表記された単語辞
書を照合して単語を認識する単語音声認識方法に
関するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a word speech recognition method that recognizes words by comparing input speech with a word dictionary containing phonetic notation.

（従来例の構成とその問題点）第１図は第１の従来例の方法を説明するための
図で、その方法を実施する装置の機能ブロツク図
として示すものである。図において、１はパラメ
ータ抽出部、２は確率密度計算部、３は音素認識
部、４は単語認識部、５は音素標準パタン部、６
は単語辞書部、７はコンフユージヨンマトリクス
部である。音素標準パタン部５には、各音素の各
種パラメータにおける分布を、各音素毎の平均
値、及び各種パラメータ間の共分散行列の形で表
わした標準音素パタンが記憶されており、この標
準音素パタンは予め予備実験等により作成してお
くものである。単語辞書部６には、認識すべき全
単語を音素で表記した単語辞書が記憶されてお
り、その単語辞書は、例えば単語「サツポロ」、
「アサヒカワ」はSAQPORO，ASAHIKAWA等
と表記されているものである。コンフユージヨン
マトリクス部７には、辞書の表記に用いられる各
種音素が実際の音素認識で何と認識されるかの確
率、例えば、ＡがＡと認識される確率は85％、Ａ
がＯと認識される確率は７％……等の値を示すコ
ンフユージヨンマトリクスが記憶されており、こ
のコンフユージヨンマトリクスは予め予備実験等
により作成されるものである。(Structure of the conventional example and its problems) FIG. 1 is a diagram for explaining the method of the first conventional example, and is shown as a functional block diagram of an apparatus that implements the method. In the figure, 1 is a parameter extraction part, 2 is a probability density calculation part, 3 is a phoneme recognition part, 4 is a word recognition part, 5 is a phoneme standard pattern part, and 6
7 is a word dictionary section, and 7 is a confusion matrix section. The phoneme standard pattern section 5 stores a standard phoneme pattern that represents the distribution of various parameters of each phoneme in the form of an average value for each phoneme and a covariance matrix between the various parameters. is created in advance through preliminary experiments and the like. The word dictionary section 6 stores a word dictionary in which all words to be recognized are expressed in phonemes, and the word dictionary includes, for example, the word "satsuporo",
"Asahikawa" is written as SAQPORO, ASAHIKAWA, etc. The confusion matrix section 7 contains probabilities of how various phonemes used in dictionary notations will be recognized in actual phoneme recognition, for example, the probability that A will be recognized as A is 85%,
A conflation matrix is stored that shows a value such that the probability that .

次に上記従来例の動作について説明する。パラ
メータ抽出部１は入力音声を10mSのフレーム毎
に分析し、パラメータを抽出して、パラメータ時
系列を作成する。次に確率密合計算部２はフレー
ム毎に、得られたパラメータと音素標準パタンを
照合し、そのパラメータの値が各音素から生成さ
れる確率密度を算出する。そして、音素認識部３
において、前記算出された確率密度が最大となる
音素に音声認識され、フレーム毎の音素認識結果
と音素セグメンテーシヨン（入力音声を各音素毎
の区間に区切ること）の結果を組み合わせ、認識
音素系列が作成される。単語認識部４では認識音
素系列と単語辞書部６内の各辞書項目との類似度
をコンフユージヨンマトリクスを用いて計算し、
最大類似度となる単語（辞書項目）を認識単語と
していた。なお、確率密度計算部２における計算
は以下の式による。１つの１フレームにおける
ｎ個のパラメータをベクトル〓で表わす。また、
ある音素Ｘのそれらパラメータにおける平均値を
μ_X、共分散行列をΣ_Xと表わすと、音素Ｘに対す
るベクトル〓の確率密度φ_X（〓）はで表わされる。但し分布形はガウス分布を仮定し
ている。 Next, the operation of the above conventional example will be explained. The parameter extraction unit 1 analyzes input audio every 10 mS frame, extracts parameters, and creates a parameter time series. Next, the probability density calculation unit 2 compares the obtained parameters with the phoneme standard pattern for each frame, and calculates the probability density that the value of the parameter is generated from each phoneme. Then, the phoneme recognition unit 3
In this step, speech is recognized into the phoneme with the maximum probability density calculated above, and the phoneme recognition result for each frame is combined with the result of phoneme segmentation (dividing the input speech into sections for each phoneme) to generate a recognized phoneme sequence. is created. The word recognition unit 4 calculates the degree of similarity between the recognized phoneme sequence and each dictionary item in the word dictionary unit 6 using a conflation matrix.
The word (dictionary entry) with the highest degree of similarity was selected as the recognized word. Note that the calculation in the probability density calculation unit 2 is based on the following formula. The n parameters in one frame are expressed as a vector 〓. Also,
If the average value of these parameters of a certain phoneme X is expressed as μ _X and the covariance matrix is expressed as Σ _X , then the probability density _φ It is expressed as However, the distribution shape is assumed to be Gaussian distribution.

しかしながら、上記従来例においては、音素認
識を行なつた段階で確率密度の値、すなわち音素
認識の確からしさの情報が失なわれ、その情報が
単語認識に反映されない（具体的には類似度計算
の精度が低い）という欠点があつた。 However, in the above conventional example, the value of the probability density, that is, information on the certainty of phoneme recognition, is lost at the stage of phoneme recognition, and this information is not reflected in word recognition (specifically, similarity calculation The disadvantage was that the accuracy was low.

次に第２の従来例を第２図とともに述べる。第
２図において、音素標準パタン部１４は、各音素
の各種パラメータにおける平均値を表わし予め予
備実験等により作成された音素標準パタンを記憶
している。パラメータ抽出部１１及び単語辞書部
１５は第１図に示す第１の従来例のものと同様で
ある。第２の従来例の動作について説明する。先
ず第１の従来例と同様にパラメータ抽出部１１で
入力音声のパラメータ時系列を得る。そして距離
計算部１２においてフレーム毎に標準パタンとの
間の距離を計算する。あるフレームにおけるＮ個
のパラメータをC₁，C₂，……C_Nとし、ある音素
Ｘのそれらパラメータにおけ平均値をμ_X1，μ_X2，
……μ_XNとすると、そのフレームと音素Ｘの標準
パタンとの距離d_X（C₁，C₂，……C_N）はで計算される。次に単語認識部１３において各辞
書項目毎に類似度を求めるのであるが、この類似
計算時に、その辞書項目を構成する辞書音素系列
に従つて音素のセグメンテーシヨンを行ない、下
記式に従い、この音素と、その音素に対応して
セグメンテーシヨンされた区間との尤度ｌを計算
する。類似度は辞書項目における各音素の尤度の
平均として求められる。 Next, a second conventional example will be described with reference to FIG. In FIG. 2, a phoneme standard pattern section 14 stores a phoneme standard pattern that represents the average value of various parameters of each phoneme and is prepared in advance through preliminary experiments or the like. The parameter extraction section 11 and word dictionary section 15 are similar to those of the first conventional example shown in FIG. The operation of the second conventional example will be explained. First, as in the first conventional example, the parameter extraction section 11 obtains a parameter time series of input speech. Then, the distance calculation unit 12 calculates the distance between each frame and the standard pattern. Let N parameters in a certain frame be C ₁ , C ₂ , ... C _N , and the average values of these parameters for a certain phoneme X be μ _X1 , μ _X2 ,
...If _μ _XN , _then _the distance _d It is calculated by Next, the word recognition unit 13 calculates the degree of similarity for each dictionary item. At the time of this similarity calculation, phoneme segmentation is performed according to the dictionary phoneme sequence that constitutes the dictionary item, and this is done according to the following formula. The likelihood l between a phoneme and a segmented interval corresponding to that phoneme is calculated. Similarity is determined as the average likelihood of each phoneme in the dictionary entry.

音素をＸとし、このＸに対応してセグメンテー
シヨンされた区間の始端と終端のフレーム番号を
N_s，N_eとし、第ｎフレームにおける各パラメー
タの値をC_o1，C_o2，……C_oNとして、尤度ｌは下
式にり計算される。 Let the phoneme be X, and the frame numbers at the start and end of the segmented section corresponding to this X are
The likelihood l is calculated by the following formula, where N _s and _Ne are taken, and the values of each parameter in the n-th frame are _Co1 , _Co2 , . . . _CoN .

ここで、距離の割り算における分母のサメンシ
ヨンのｉの範囲は音素Ｘが何であるかにより異な
り、例えばＸが音素Ａ（ア）の時はｉの範囲は５
母音Ａ，Ｅ，Ｉ，Ｏ・Ｕとしている。以上により
得られる類似度を各辞書項目毎に求め、認識単語
とするというものであつた。この第２の従来例に
おいて、音素標準パタンはパラメータの平均値の
みを持つために、パラメータの値のバラツキの度
合いは式による距離計算には反映されず、従つ
て類似度は精度の低いものとなる欠点があつた。 Here, the range of i in the summension of the denominator in distance division differs depending on the phoneme X. For example, when X is the phoneme A (a), the range of i is 5.
The vowels are A, E, I, O, and U. The degree of similarity obtained in the above manner was determined for each dictionary item and used as a recognized word. In this second conventional example, since the phoneme standard pattern has only the average value of the parameters, the degree of variation in the parameter values is not reflected in the distance calculation using the formula, and therefore the degree of similarity is considered to have low accuracy. There was a drawback.

（発明の目的）本発明は上記従来例の欠点を除去するものであ
り、類似度計算の精度を向上させ、それにより単
語認識率を向上させることを目的とする。(Objective of the Invention) The present invention is intended to eliminate the drawbacks of the above-mentioned conventional example, and aims to improve the accuracy of similarity calculation, thereby improving the word recognition rate.

（発明の構成）本発明は、上記目的を達成するために、入力音
声から得られたパラメータ時系列を音素標準パタ
ン及び単語辞書と照合し、各辞書項目毎に音素の
セグメンテーシヨンを行ない各音素がそのパラメ
ータの値を示す確率密度を算出し、それを直接、
類似度計算に用いることにより、類似度計算の精
度を向上させる効果を持つものである。(Structure of the Invention) In order to achieve the above object, the present invention compares a parameter time series obtained from input speech with a phoneme standard pattern and a word dictionary, performs phoneme segmentation for each dictionary item, and Calculate the probability density that a phoneme indicates the value of its parameter, and use it directly as
By using it for similarity calculation, it has the effect of improving the accuracy of similarity calculation.

（実施例の説明）以下に本発明の一実施例について、第３図とと
もに説明する。第３図において、２１はパラメー
タ抽出部、２２は確率密度計算部、２３は音素毎
のセグメンテーシヨン、尤度計算及び類似度計算
を実行する単語認識部、２４は各音素毎の各種パ
ラメータにおける分布を各音素毎の平均値（μ_X）、
及び各種パラメータ間の共分散行列（Σ_X）の形
で表わした音素標準パタンを記憶する音素標準パ
タン部、２５は認識すべき全単語を音素で表記し
た単語辞書が記憶されている単語辞書部である。
パラメータ抽出部２１、音素標準パタン部２４及
び単語辞書部２５はそれぞれ第１図の対応する部
分と同じものである。(Description of Embodiment) An embodiment of the present invention will be described below with reference to FIG. 3. In FIG. 3, 21 is a parameter extraction unit, 22 is a probability density calculation unit, 23 is a word recognition unit that performs segmentation, likelihood calculation, and similarity calculation for each phoneme, and 24 is a parameter extraction unit for each phoneme. The distribution is expressed as the average value for each phoneme ( _μX ),
and a phoneme standard pattern section that stores a phoneme standard pattern expressed in the form of a covariance matrix ( _Σ It is.
The parameter extraction section 21, phoneme standard pattern section 24, and word dictionary section 25 are the same as the corresponding parts in FIG. 1, respectively.

本実施例の動作について説明する。先ずパラメ
ータ抽出部２１において、入力音声からフレーム
毎のパラメータを得、さらに確率率密度計算部２
２でそのパラメータの値が各音素標準パタンから
得られる確率密度を計算する。ここ迄は、前記第
１の従来例と同様である。次に、単語認識部２３
で各辞書項目毎に類似度を求めるのであるが、こ
の類似度計算時に、その辞書項目を構成する辞書
音素系列に従つて音素のセグメンテーシヨンを行
ない、下記式に従い、その音素と、その音素に
対応してセグメンテーシヨンされた区間の尤度ｌ
を計算し、その辞書項目における各音素の尤度の
平均として類似度を求める。ここでその音素をＸ
とし、Ｘに対応してセグメンテーシヨンされた区
間の始端と終端のフレーム番号をN_s，N_eとし、
第ｎフレームにおける各パラメータの値を〓_oと
すると確率密度φは式の定義によつて、尤度ｌ
を下式で定義される。 The operation of this embodiment will be explained. First, the parameter extraction unit 21 obtains parameters for each frame from the input audio, and then the probability rate density calculation unit 2
In step 2, the probability density that the value of the parameter is obtained from each phoneme standard pattern is calculated. The process up to this point is the same as the first conventional example. Next, the word recognition unit 23
The degree of similarity is calculated for each dictionary item.When calculating this degree of similarity, phoneme segmentation is performed according to the dictionary phoneme sequence that makes up the dictionary item, and the phoneme and its phoneme are segmented according to the following formula. The likelihood l of the segmented interval corresponding to
is calculated, and the similarity is determined as the average of the likelihoods of each phoneme in that dictionary entry. Here, the phoneme is
Let N _s and N _e be the frame numbers at the start and end of the segmented section corresponding to X,
When the value of each parameter in the n-th frame is 〓 _o , the probability density φ is the likelihood l
is defined by the following formula.

ここで、確率密度の割り算における分母のサメ
ンシヨンのｉの範囲は、音素Ｘが何であるかによ
り異り、例えばＸが音素Ａ（ア）の時は、ｉの範
囲は５母音、Ａ，Ｅ，Ｉ，，Ｕとしている。以
上により得られる類似度を各辞書項目毎に求め、
類似度が最大となる辞書項目をもつて、認識単語
とする。 Here, the range of i in the summension of the denominator in probability density division differs depending on the phoneme X. For example, when X is the phoneme A, the range of i is 5 vowels, A, E, I,,U. The similarity obtained from the above is calculated for each dictionary item,
The dictionary item with the maximum similarity is selected as a recognized word.

本実施例においては、辞書項目毎に音素のセグ
メンテーシヨンを行ない、すなわち第１の従来例
と異なり音素の付加や脱落の存在しない正確なセ
グメンテーシヨンを行なつた上で、各音素毎にそ
の音楽と対応する音声区間との間で式に示す尤
度計算を行なうことにより、その音声区間が、そ
の音素から発声されたものである確率密度、すな
わち確からしさを、直接類似度に反映させること
ができ、高い精度で類似度が得られるという利点
がある。 In this embodiment, phoneme segmentation is performed for each dictionary entry, that is, unlike the first conventional example, accurate segmentation is performed without adding or dropping phonemes, and then each phoneme is segmented. By performing the likelihood calculation shown in the formula between the music and the corresponding speech interval, the probability density, that is, the certainty that the speech interval is uttered from that phoneme, is directly reflected in the degree of similarity. This method has the advantage that similarity can be obtained with high accuracy.

（発明の効果）本発明は上記のような構成であり、以下に示す
効果が得られるものである。入力音声を各辞書項
目の辞書音素系列に従つてセグメンテーシヨン
し、辞書の各音素とそれに対応してセグメンテー
シヨンされた音声の区間との尤度を確率密度を直
接用いて計算し、この尤度の平均値として類似度
を定義することにより精度の高い類似度の値を得
ることができる。(Effects of the Invention) The present invention has the above-described configuration, and provides the following effects. The input speech is segmented according to the dictionary phoneme sequence of each dictionary entry, and the likelihood between each phoneme in the dictionary and the corresponding segmented speech interval is calculated directly using probability density. By defining similarity as the average value of likelihoods, highly accurate similarity values can be obtained.

[Brief explanation of drawings]

第１図及び第２図は従来の単語音声認識方法を
説明するための機能ブロツク図、第３図は、本発
明による単語音声認識方法の一実施例を説明する
ための機能ブロツク図である。２１……パラメータ抽出部、２２……確率密度
計算部、２３……単語認識部、２４……音素標準
パタン部、２５……単語辞書部。 1 and 2 are functional block diagrams for explaining a conventional word speech recognition method, and FIG. 3 is a functional block diagram for explaining an embodiment of the word speech recognition method according to the present invention. 21... Parameter extraction section, 22... Probability density calculation section, 23... Word recognition section, 24... Phoneme standard pattern section, 25... Word dictionary section.

Claims

[Claims]

1 In a word speech recognition method that performs word recognition of input speech using a word dictionary in which words to be recognized are expressed in phonemes and standard patterns of each phoneme expressed as a distribution of acoustic parameters of each phoneme, input The speech is checked against each dictionary entry in the word dictionary, the input speech is segmented for each phoneme according to the dictionary phoneme sequence that constitutes each dictionary entry, and the segmented speech is segmented using the standard pattern of the phoneme. A word speech recognition method that calculates the probability density that an interval is generated from the phoneme, and uses this probability density value to determine the degree of similarity between each dictionary entry and the input speech to recognize the word.