JPH045392B2

JPH045392B2 -

Info

Publication number: JPH045392B2
Application number: JP59058174A
Authority: JP
Priority date: 1984-03-28
Filing date: 1984-03-28
Publication date: 1992-01-31
Also published as: JPS60202487A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、入力音声と、音素表記された単語辞
書を照合して単語を認識する単語音声認識方法に
関するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a word speech recognition method for recognizing words by comparing input speech with a word dictionary in which phonemes are expressed.

（従来例の構成とその問題点）第１図は従来の単語音声認識方法の一例及び本
発明の単語音声認識方法の実施例等を実行するた
めの装置の機能ブロツク図である。従来例を第１
図および第２図とともに説明する。第１図におい
て、１は入力音声からパラメータの時系列を作成
するパラメータ抽出部、２は音素標準パタンを照
合して、音素の確率密度を算出する確率密度計算
部、３は音素毎のセグメンテーシヨン、尤度計
算、単語類似度計算等を行なう単語認識部であ
る。また、４は予め予備実験等により作成され
た、各音素毎の各種パラメータにおける分布を各
音素毎の平均値（〓_i）、及び各種パラメータ間の
共分散行列（Σ_i）の形で表わした音素標準パタン
を記憶する音素標準パタン部、５は認識すべき全
単語を音素単位の記号列で表記した単語辞書が記
憶されている単語辞書部である。その単語辞書
は、例えば単語「サツポロ」、「クルメ」等は、そ
れぞれ「SAQPORO」、「KURUME」等と表記
されている。(Constitution of Conventional Example and Problems thereof) FIG. 1 is a functional block diagram of an apparatus for executing an example of a conventional word speech recognition method and an embodiment of the word speech recognition method of the present invention. Conventional example first
This will be explained with reference to FIG. In Figure 1, 1 is a parameter extraction unit that creates a time series of parameters from input speech, 2 is a probability density calculation unit that calculates the probability density of a phoneme by comparing standard patterns of phonemes, and 3 is a segmentation unit for each phoneme. This is a word recognition unit that performs calculations such as probability calculation, likelihood calculation, and word similarity calculation. In addition, 4 represents the distribution of various parameters for each phoneme, created in advance through preliminary experiments, etc., in the form of an average value for each phoneme (〓 _i ) and a covariance matrix (Σ _i ) between various parameters. A phoneme standard pattern section 5 stores phoneme standard patterns, and a word dictionary section 5 stores a word dictionary in which all words to be recognized are expressed in symbol strings in phoneme units. In the word dictionary, for example, the words "Satsuporo" and "Kurume" are written as "SAQPORO" and "KURUME", respectively.

次に上記従来例の動作について説明する。入力
音素をパラメータ抽出部１により10ｍｓのフレー
ム毎に分析しパラメータを抽出して、パラメータ
時系列を作成する。確率密度計算部２はフレーム
毎に得られたパラメータと音素標準パタンを照合
し、音素の確率密度算出する。次に単語認識部３
において、上記のパラメータと得られた確率密度
値を用いて各辞書項目毎に、その辞書項目を構成
する辞書音素系列に従つて１音素毎に音素のセグ
メンテーシヨンを行ない、下記式に従いその音
素の種類と、その音素に対応してセグメンテーシ
ヨンされた区間の尤度ｌを計算し、その辞書項目
における、各音素の尤度の平均として類似度を求
める。ここで、その音素をＸとし、Ｘに対応して
セグメンテーシヨンされた区間の始端と終端のフ
レーム番号をN_s、N_eとし、第ｎフレームにおけ
る各パラメータの値をC_oとすると、音素Ｘの尤
度l_xは下式で定義される。 Next, the operation of the above conventional example will be explained. The input phoneme is analyzed by the parameter extraction unit 1 every 10 ms frame, parameters are extracted, and a parameter time series is created. The probability density calculation unit 2 compares the parameters obtained for each frame with the phoneme standard pattern and calculates the probability density of the phoneme. Next, word recognition section 3
For each dictionary item, phoneme segmentation is performed for each phoneme according to the dictionary phoneme sequence that constitutes the dictionary item using the above parameters and the obtained probability density value, and the phoneme is segmented according to the following formula. , and the likelihood l of the segmented interval corresponding to that phoneme is calculated, and the similarity is determined as the average of the likelihoods of each phoneme in the dictionary entry. Here, if the phoneme is X, the frame numbers at the start and end of the segmented section corresponding to X are N _s and _Ne , and the value of each parameter in the nth frame is _Co , The likelihood l _x of X is defined by the following formula.

φ_i（C_o）はある音素ｉの確率密度を表わし、
式のように定義される。 φ _i (C _o ) represents the probability density of a certain phoneme i,
It is defined as Eq.

φ_i（C_o）＝１／（2π）^1/2｜Σ_i｜^1/2exp〔
−１／２（C_o−〓_i）^T _-1 〓 _i（C_o−〓_i）〕 …… C_o：第ｎフレームにおけるＮ個のパラメータ（ベクトル） μ_i：ある音素ｉのパラメータの平均値（ベクトル） Σ_i：共分散行列式において、確率密度の割り算における分母
のサメンシヨンｉの範囲は、音素Ｘが何であるか
によつて異なり、例えばＸが音素Ａ(ア)の時はｉの
範囲は５母音、Ａ、Ｅ、Ｉ、Ｏ、Ｕとしている。
以上により得られる単語類似度L_Mを式に従つ
て各辞書項目毎に求め、L_Mが最大となる辞書項
目をもつて、認識単語としていた。 φ _i (C _o )=1/(2π) ^1/2 | Σ _i | ^1/2 exp [
−1/2 (C _o −〓 _i ) ^T ₋₁ 〓 _i (C _o −〓 _i )〕 ... _Co : N parameters (vector) in the n-th frame μ _i : Average of parameters of a certain phoneme i Value (vector) Σ _i : Covariance matrix In the formula, the range of summension i in the denominator in probability density division differs depending on the phoneme X. For example, when X is the phoneme A, The range is 5 vowels: A, E, I, O, and U.
The word similarity L _M obtained above was determined for each dictionary item according to the formula, and the dictionary item with the maximum L _M was selected as a recognized word.

L_M＝_NP 〓^j=1 l_j／NP …… L_M：辞書中のＭ番目の単語の類似度 l_i：辞書音素系列中の音素ｊの尤度 NP：辞書音素数第２図は／KURUME／（久留米）と発声した
時の各音素／Ｋ／、／Ｕ／、／Ｒ／、／Ｕ／、／
Ｍ／、／Ｅ／、の確率密度φ_K、φ_U、φ_R、φ_U、
φ_M、φ_Eの時間変化を表わしている。この場合の
辞書単語／KURUME／に対する各音素のセグメ
ンテーシヨン及び尤度計算は、辞書の音素系列／
Ｋ／、／Ｕ／、／Ｒ／、／Ｕ／、／Ｍ／、／
Ｅ／、の順序に使い、第１番目の音素／Ｋ／に対
してφ_Kを用いてセグメンテーシヨンした区間
（ａ−ｂ）を対応させ、式に従いφ_Kを用いてl_K
を計算し、同様してl_U、l_R、l_U、l_M、l_E、を求め
る。 L _M = _NP 〓 ^j=1 l _j /NP …… L _M : Similarity of the Mth word in the dictionary l _i : Likelihood of phoneme j in the dictionary phoneme sequence NP: Number of dictionary phonemes Figure 2 shows / Each phoneme when saying KURUME/ (Kurume) /K/, /U/, /R/, /U/, /
The probability densities of M/, /E/, φ _K , φ _U , φ _R , φ _U ,
It shows the changes in φ _M and φ _E over time. In this case, segmentation and likelihood calculation of each phoneme for the dictionary word /KURUME/ are performed using the dictionary phoneme sequence /
K/, /U/, /R/, /U/, /M/, /
E/, and the first phoneme /K/ is made to correspond to the interval (a-b) segmented using φ _K , and according to the formula, using φ _K , l _K
, and similarly obtain l _U , l _R , l _U , l _M , l _E .

第３図は同じ単語／KURUME／を別の話者が
発声した場合の各音素の確率密度の時間的変化を
示している。第３図において、辞書単語／
KURUME／に対する各音素のセグメンテーシヨ
ンは、辞書の音素系列／Ｋ／、／Ｕ／、／
Ｒ／、／Ｕ／、／Ｍ／、／Ｅ／の順序に従つて行
なうが、第１番目の音素である／Ｋ／のセグメン
テーシヨンを行なう場合、／Ｋ／の確率密度φ_K
は辞書の音素系列中の第２番目に現われる／Ｕ／
の始まり付近まで優勢であり、辞書の音素系列中
の最初に現われる／Ｕ／の区間においてφ_Uはφ_K
に比べ小さな値となつている。また／Ｒ／の区間
においてφ_Rもφ_Kに比べほぼ同程度の値である。 Figure 3 shows the temporal change in the probability density of each phoneme when the same word /KURUME/ is uttered by different speakers. In Figure 3, the dictionary word /
The segmentation of each phoneme for KURUME/ is the phoneme series /K/, /U/, / in the dictionary.
Segmentation is performed according to the order of R/, /U/, /M/, /E/, but when segmenting the first phoneme /K/, the probability density φ _K of /K/
appears second in the phoneme sequence in the dictionary /U/
is dominant until near the beginning of , and in the section of /U/ that appears first in the phoneme sequence of the dictionary, φ _U is φ _K
It is a small value compared to . Further, in the section /R/, φ _R is also approximately the same value as φ _K.

このため、本来、（ｃ−ｄ）となるべき／Ｋ／
の区間を区間（ｃ−ｅ）又は区間（ｃ−ｆ）と誤
るため、第２番目以後の音素のセグメンテーシヨ
ンを誤り尤度も低くなるため、結果として無声子
音、無声化母音又は発声のナマケ易い母音、有声
子音が連続した音素系列を含む単語は、誤認識し
易い欠点があつた。 Therefore, originally it should be (c-d) /K/
Since the interval is mistaken for interval (c-e) or interval (c-f), the likelihood of error in the segmentation of the second and subsequent phonemes is also lower, resulting in unvoiced consonants, unvoiced vowels, or unvoiced Words that contain phoneme sequences with continuous vowels or voiced consonants that are easy to be sloppy have the disadvantage of being easily misrecognized.

（発明の目的）本発明は、上記従来例の欠点を除去するもので
あり、尤度計算の精度を向上させ、それにより単
語認識率を向上させることを目的とする。(Objective of the Invention) The present invention is intended to eliminate the drawbacks of the conventional example described above, and aims to improve the accuracy of likelihood calculation, thereby improving the word recognition rate.

（発明の構成）本発明は、上記目的を達成するために、無声子
音及び有声子音に挟まれた無声化母音又は発声が
ナマケた母音のセグメンテーシヨン及び尤度計算
を行なう際、無声化母音又は、発声がナマケた母
音を含む、無声子音、母音、有声子音の連続３音
素をまとめてセグメンテーシヨンした尤度計算を
行なうことにより、セグメンテーシヨン及び尤度
計算の精度を向上させる効果を得るものである。(Structure of the Invention) In order to achieve the above object, the present invention provides a method for segmentation and likelihood calculation of a devoiced vowel sandwiched between a voiceless consonant and a voiced consonant or a vowel whose pronunciation has been omitted. Alternatively, the accuracy of segmentation and likelihood calculation can be improved by segmenting and calculating the likelihood of consecutive three phonemes of voiceless consonants, vowels, and voiced consonants, including vowels that are half-voiced. It's something you get.

（実施例の説明）以下に本発明の一実施例について第１図及び第
３図とともに説明する。第１図において、音素標
準パタンは従来例と同様である。単語辞書は、認
識すべき単語を音素の記号例で表記してあるが従
来例と異なるのは、無声化母音又はナマケた発声
をし易い母音に対して予め符号をつけてある。ま
たパラメータ抽出により得られるパラメータ時系
列は従来例と同様である。(Description of Embodiment) An embodiment of the present invention will be described below with reference to FIGS. 1 and 3. In FIG. 1, the phoneme standard pattern is the same as in the conventional example. In the word dictionary, the words to be recognized are represented by examples of phoneme symbols, but unlike the conventional example, codes are attached in advance to devoiced vowels or vowels that are likely to be pronounced half-heartedly. Further, the parameter time series obtained by parameter extraction is the same as in the conventional example.

本実施例の動作について説明する。先ず入力音
声からフレーム毎のパラメータを得、さらにその
パラメータの値を使つて、各音素標準パタンから
得られる確率密度を計算する。ここまでは、前記
従来例と同様である。次に各辞書項目毎にその辞
書項目を構成する辞書音素系列に従つて音素Ｘの
セグメンテーシヨンを行ない、その音素Ｘとその
音素Ｘに対応してセグメンテーシヨンされた区間
の尤度l_Xを計算するのであるが、辞書音素系列中
に無声子音C₁、有声子音C₂に挟まれた無声化母
音又はナマケた発声をし易い母音Ｖがある場合、
Ｖの確率密度の値は母音の性質を示さず、無声子
音又は有声子音の性質を示す。従つて、無声子
音、無声化母音又は発声のナマケた母音、有声子
音（C₁VC₂）の並びにおける各音素の種類及びそ
の音素並びに対応して、各々の音素の確率密度の
値を利用して、３音素まとめてセグメンテーシヨ
ンを行ない、そのセグメンテーシヨンされた区間
に対して尤度l_C1vc2を計算する。 The operation of this embodiment will be explained. First, parameters for each frame are obtained from the input speech, and then the probability density obtained from each phoneme standard pattern is calculated using the parameter values. The process up to this point is the same as the conventional example. Next, segmentation of the phoneme X is performed for each dictionary entry according to the dictionary phoneme series that constitutes that dictionary entry, and the _likelihood of the segmented interval corresponding to that phoneme X and that phoneme is calculated, but if there is a devoiced vowel sandwiched between a voiceless consonant C ₁ and a voiced consonant C ₂ or a vowel V that is easy to pronounce half-voiced in the dictionary phoneme sequence,
The value of the probability density of V does not indicate the nature of a vowel, but rather the nature of a voiceless or voiced consonant. Therefore, by using the type of each phoneme in the sequence of unvoiced consonants, unvoiced vowels, unvoiced vowels, and voiced consonants (C ₁ VC ₂ ), the phoneme, and the corresponding probability density value of each phoneme, Then, the three phonemes are segmented together, and the likelihood l _C1vc2 is calculated for the segmented interval.

第３図において、／Ｋ／の次の／Ｕ／の区間
（ｄ−ｅ）において／Ｕ／の確率度φ_Uの値はほと
んどなく、代わりに／Ｋ／の確率密度φ_Kが第２
番目の／Ｕ／の始まり付近まで優勢となつてい
る。また／Ｒ／の確率密度φ_Rは／Ｒ／の区間に
おいてφ_Kと同程度の値である。従つて区間（ｃ
−ｆ）を／Ｋ／、／Ｕ／、／Ｒ／を１つにまとめ
た音素系列／KUR／のセグメンテーシヨン区間
とし、／KUR／のセグメンテーシヨン区間内に
おいてφ_K、φ_Rの値を用いて式に従い、３音素
分の尤度l_KURを計算する。 In Figure 3, in the interval (d-e) of /U/ next to /K/, the probability degree φ _U of /U/ has almost no value, and instead the probability density φ _K of /K/ is the second
It is dominant up to the beginning of the /U/. Further, the probability density φ _R of /R/ is approximately the same value as φ _K in the interval of /R/. Therefore, the interval (c
−f) is the segmentation interval of the phoneme sequence /KUR/ in which /K/, /U/, and /R/ are combined into one, and the values of φ _K and φ _R within the segmentation interval of /KUR/ The likelihood l _KUR for the three phonemes is calculated using the formula.

式と対比して、他の普通の音素については従
来と同様式を用いて尤度計算を行なう。 In contrast to the formula, for other ordinary phonemes, likelihood calculations are performed using formulas as in the past.

本実施例においては、発声のナマケた母音を１
つの母音として扱わず、無声子音、発声のナマケ
た母音、無声子音の音素並びをまとめてセグメン
テーシヨン及び尤度計算を行なうため、発声のナ
マケた母音を含む単語の認識率が向上する利点が
ある。 In this example, the vowels that have been uttered are
Since segmentation and likelihood calculations are performed on unvoiced consonants, vowels with incomplete pronunciation, and phoneme sequences of unvoiced consonants, rather than treating them as single vowels, the recognition rate for words containing vowels with incorrect pronunciation is improved. be.

（発明の効果）本発明は上記のような構成であり、以下に示す
効果が得られるものである。(Effects of the Invention) The present invention has the above-described configuration, and provides the following effects.

無声子音と有声子音に挟まれ、発声のナマケた
母音のセグメンテーシヨン及び尤度計算を行なう
際、発声のナマケた母音を含む無声子音、発声の
ナマケた母音、無声子音の連続３音素をまとめて
セグメンテーシヨンし、尤度計算を行なうことに
より、従来法に比べ精度よくセグメンテーシヨン
及び尤度計算を行うことができる利点を有する。 When segmenting and calculating the likelihood of a vowel with a half-voiced sound sandwiched between a voiceless consonant and a voiced consonant, the unvoiced consonant containing the half-voiced vowel, the half-voiced vowel, and the three consecutive phonemes of the unvoiced consonant are summarized. By performing segmentation and calculating the likelihood, this method has the advantage that the segmentation and likelihood calculation can be performed with higher accuracy than conventional methods.

[Brief explanation of drawings]

第１図は従来及び本発明の一実施例における単
語音声認識方法を説明するための図、第２図は／
KURUME／（久留米）と発声した場合の各音
素／Ｋ／、／Ｕ／、／Ｒ／、／Ｕ／、／Ｍ／、／
Ｅ／の確率密度φ_K、φ_U、φ_R、φ_U、φ_M、φ_Eの時間
変化を示す図、第３図は第２図の場合と別の話者
が／KURUME／と発声した場合のφ_K、φ_U、φ_R、
φ_U、φ_M、φ_Eの時間変化を表わす図である。１……パラメータ抽出部、２……確率密度計算
部、３……単語認識部、４……音素標準パタン
部、５……単語辞書部。 FIG. 1 is a diagram for explaining the word speech recognition method in the conventional method and an embodiment of the present invention, and FIG.
Each phoneme when saying KURUME/ (Kurume) /K/, /U/, /R/, /U/, /M/, /
A diagram showing the temporal changes in the probability densities φ _K , φ _U , φ _R , φ _U , φ _M , φ _E of E/. Figure 3 shows the case in Figure 2 when a different speaker uttered /KURUME/ φ _K , φ _U , φ _R ,
FIG. 3 is a diagram showing changes over time in φ _U , φ _M , and φ _E ; 1... Parameter extraction section, 2... Probability density calculation section, 3... Word recognition section, 4... Phoneme standard pattern section, 5... Word dictionary section.

Claims

[Claims]

1 Word speech that performs single-error recognition of input speech using a word dictionary that describes the word to be recognized as a symbol string for each phoneme and a standard pattern for each phoneme that is expressed as a distribution of the acoustic parameters of each phoneme. In the recognition method, the input speech is checked against each dictionary item in the word dictionary,
Segment the input speech for each phoneme according to the dictionary phoneme sequence that constitutes each dictionary entry, and use the standard pattern of that phoneme to calculate the probability density that the segmented speech section is generated from that phoneme. When recognizing words by calculating the similarity between each dictionary entry and the input speech using the above probability density value for the segmented speech interval, For unvoiced vowels and vowels with half-voiced parts, group the continuous three phonemes of unvoiced vowels and vowels with half-voiced parts, unvoiced consonants including unvoiced vowels and half-voiced vowels, unvoiced vowels or vowels with half-voiced parts, and voiced consonants. A word speech recognition method characterized by performing segmentation and calculating likelihood.