JPH0566596B2

JPH0566596B2 -

Info

Publication number: JPH0566596B2
Application number: JP59104786A
Authority: JP
Inventors: Satoru Kabasawa; Hidekazu Tsuboka; Yoshiteru Mifune
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1984-05-25
Filing date: 1984-05-25
Publication date: 1993-09-22
Also published as: JPS60249197A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は音声認識装置、特に単語あるいは文節
等音節を連続して発声した音声の認識装置に関す
る。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a speech recognition device, and particularly to a speech recognition device that recognizes speech produced by consecutively uttering syllables such as words or phrases.

（従来例の構成とその問題点）人間にとつて最も自然な情報発生手段である音
声が、人間−機械系の入力手段として使用できれ
ば、その効果な非常に大きい。(Constitution of Conventional Example and Its Problems) If voice, which is the most natural means of generating information for humans, can be used as an input means for a human-machine system, the effect would be enormous.

従来、音声認識装置としては特定話者登録方式
によるものが実用化されている。即ち、認識装置
を使用しようとする話者が、予め、認識すべきす
べての単語を自分の声で特徴ベクトルの系列に変
換し単語辞書に標準パターンとして登録してお
き、認識時に発声された音声を、同様に特徴ベク
トルの系列に変換し、前記単語辞書中のどの単語
に最も近いかを予め定められた規則によつて計算
し、最も類似している単語を認識結果とするもの
である。 Conventionally, speech recognition devices based on a specific speaker registration method have been put into practical use. That is, a speaker who intends to use a recognition device converts all the words to be recognized into a series of feature vectors using his/her own voice and registers them as standard patterns in a word dictionary, and then uses the voice uttered during recognition. is similarly converted into a series of feature vectors, which word in the word dictionary is closest is calculated according to a predetermined rule, and the most similar word is taken as the recognition result.

ところが、この方法によると、認識単語数が少
いときは良いが、数百、数千単語といつたように
増加してくると、主として次の三つの問題が無視
し得なくなる。 However, this method is good when the number of recognized words is small, but as the number of words increases to hundreds or thousands of words, the following three problems become impossible to ignore.

(1) 登録時における話者の負担が著しく増大す
る。(1) The burden on speakers during registration increases significantly.

(2) 認識時に発声された音声と標準パターンとの
類似度あるいは距離を計算するのに要する時間
が著しく増大し、認識装置の応答速度が遅くな
る。(2) The time required to calculate the similarity or distance between the voice uttered and the standard pattern during recognition increases significantly, and the response speed of the recognition device becomes slow.

(3) 前記単語辞書のために要するメモリが非常に
大きくなる。(3) The memory required for the word dictionary becomes very large.

以上の欠点を回避するための方法として認識の
単位を子音＋母音および母音の単音節（以後それ
ぞれCV，Ｖで表す。Ｃは子音、Ｖは母音を意味
する。）とする方法がある。即ち、標準パターン
として単音節を特徴ベクトルの系列として登録し
ておき、認識時に特徴ベクトルの系列に変換され
た入力音声を、前記単音節の標準パターンとマツ
チングすることにより、単音節の系列に変換する
ものである。日本語の場合、単音節はたかだか
101種類であり、単音節は仮名文字に対応してい
るから、この方法によれば、日本語の任意の単語
あるいは文章を単音節列に変換する（認識する）
ことができ、前記(1)〜(3)の問題はすべて解決され
ることになる。しかし、この場合の問題として調
音結合とセグメンテーシヨンがある。調音結合
は、音節を連続して発声すると各音節は前後の音
節の影響を受け、スペクトル構造が前後に接続さ
れる音節によつて変化する現象である。セグメン
テーシヨンは、連続して発声された音声を単音節
単位に区切ることであるが、これを確実に行う決
定的な方法は未だ見出されていない。この２つの
問題を解決するために、現在のところ各単音節を
区切つて、発声することが行われており、実用化
されている装置もある。 As a method to avoid the above-mentioned drawbacks, there is a method in which the unit of recognition is a consonant+vowel or a monosyllable of a vowel (hereinafter expressed as CV and V, respectively; C means a consonant and V means a vowel). That is, monosyllables are registered as a series of feature vectors as standard patterns, and input speech converted into a series of feature vectors during recognition is converted into a series of monosyllables by matching with the standard pattern of monosyllables. It is something to do. In Japanese, the number of monosyllables is at most
There are 101 types, and monosyllables correspond to kana characters, so this method converts (recognizes) any Japanese word or sentence into a monosyllable string.
This will solve all of the problems (1) to (3) above. However, problems in this case include articulatory combination and segmentation. Articulatory coupling is a phenomenon in which when syllables are uttered in succession, each syllable is influenced by the syllables before and after it, and the spectral structure changes depending on the syllables connected before and after it. Segmentation is the process of dividing continuously uttered speech into monosyllable units, but a definitive method for doing this reliably has not yet been found. In order to solve these two problems, the current practice is to separate each single syllable and pronounce it, and some devices are in practical use.

しかし、単音節を離散的に発声するのは不自然
であり、話者に緊張を強いるものである。 However, uttering monosyllables discretely is unnatural and puts stress on the speaker.

（発明の目的）本発明は、認識されるべき単語または文節に対
し、それらの数が多い場合でも小形低価格かつ標
準パターンの登録操作が簡単であつて、認識精度
および処理速度の向上が可能な音声認識装置を実
現することを目的とする。(Objective of the Invention) The present invention is small, inexpensive, and easy to register standard patterns even when there are a large number of words or phrases to be recognized, and it is possible to improve recognition accuracy and processing speed. The purpose of this research is to realize a speech recognition device that is easy to use.

（発明の構成）本発明は、Ｖ，CV，VV，VCV等の音節を予
め登録しておき、これらを連続発声して入力され
た単語または文節を、単語辞書を用いて、これら
音節列として認識するものであつて、その構成
は、入力音声信号を特徴パタンの系列Ａに変換す
る特徴抽出手段と、前記入力信号の定常点を抽出
する定常点抽出手段と、抽出されたそれぞれの定
常点を母音と見做して母音識別して入力母音列Ｘ
を得る母音識別手段と、前記母音識別手段で得ら
れる入力母音列Ｘと認識されるべき単語あるいは
文節の母音列Yⁿ（ｎ＝１，２，…，Ｎ）とのマツ
チングを行つて前記入力母音列Ｘに最も近に標準
母音列Y^noを識別する母音列識別手段と、前記入
力母音列Ｘと前記識別結果母音列Y^noとのマツチ
ング結果に基づいて前記母音列Ｘと前記識別結果
母音列Y^noの母音の対応関係を決定する母音列対
応決定手段と、前記対応関係に基づいて決定され
た前記入力母音列Ｘの部分区間に対応した前記
入力信号の部分系列について、前記母音列識別
手段で得られる前記識別結果母音列Y^noのうち前
記部分区間に対応した特定部分Y^no＝（y^no _j1，…，
y^no _j2）（j₁＜j₂）の母音列で定義されるy^no _j1Cy^no _j1+1，
y^no _j1+1Cy^no _j1+2，…，y^no _j2-1Cy^no _j2（Ｃ：子音）等の
それ
ぞれの音節に対応する標準パタンとのマツチング
を行つて、前記特定部分特徴パタン系列に対応す
る音節列を識別し、識別された音節列に基づいて
単語あるいは文節を判定して認識結果として出力
する判定手段とを備えた音声認識装置であり、認
識されるべき単語または文節に対して、それらの
数が多い場合でも小形低価かつ標準パタンの登録
操作が簡単であつて、認識精度及び処理速度を向
上することのできるものである。(Structure of the Invention) The present invention registers syllables such as V, CV, VV, VCV in advance, and inputs words or phrases by continuously speaking them as a string of these syllables using a word dictionary. It recognizes the input signal, and its configuration includes feature extraction means for converting an input audio signal into a series A of feature patterns, a steady point extraction means for extracting stationary points of the input signal, and each extracted stationary point. is regarded as a vowel, the vowel is identified, and the input vowel string
and a vowel identification means that matches the input vowel string X obtained by the vowel identification means with the vowel string Y ⁿ (n=1, 2,..., N) of the word or phrase to be recognized. a vowel string ^{identification} means for identifying a standard vowel string Y ^no closest to the vowel string X, and a vowel string identifying means for identifying the vowel string vowel string correspondence determining means for determining the correspondence of vowels in the column Y ^no ; and vowel string identification for a partial sequence of the input signal corresponding to a partial interval of the input vowel string X determined based on the correspondence. Of the identified vowel string Y ^no obtained by the means, the specific part Y no corresponding to the partial interval Y ^no = (y ^no _j1 ,...,
y ^no _j1 Cy ^no _j1+1 ^defined by the vowel string y no _j2 ) (j ₁ < j ₂ ) ，
By performing matching with standard patterns corresponding to each syllable such as y ^no _j1+1 Cy ^no _j1+2 ,..., y ^no _j2-1 Cy ^no _j2 (C: consonant), etc., the specific partial feature pattern series is created. This is a speech recognition device that is equipped with a determination means that identifies a corresponding syllable string, determines a word or phrase based on the identified syllable string, and outputs the result as a recognition result. , even if there are a large number of them, it is small, inexpensive, and easy to register standard patterns, and can improve recognition accuracy and processing speed.

（実施例の説明）以後、「単語」という言葉は「文節」という言
葉も代表するものとする。(Description of Examples) Hereinafter, the word "word" will also represent the word "bunsetsu."

さて、第１図は本発明の第１の実施例を示す機
能ブロツク図である。１は音声信号入力端子であ
る。２は特徴抽出部であつて、例えば20チヤネル
のフイルタバンクを用い、１フレームを10msec
とすれば、その出力には10msec毎に20個の数値
（特徴パタン）が得られる。即ち、入力音声信号
は特徴パタンの系列Ａ＝（a₁，a₂，…，a，…，
a_L）に変換される。ここで、a＝（a₁，a₂，
…，a₂₀）は第フレームで得られる特徴パタ
ン、Ｌは入力音声のフレーム数である。３は電力
計算部であつて、第フレームの電力をPとす
れば、フレーム毎にP＝√² ₁＋² ₂＋…＋
a² ₂₀が計算される。４は音声区間検出部であつ
て、このPの変化パタンから入力音声信号の始
終端を検出する。即ち、無音・有音を判別する閾
値を定め、この閾値以上の区間が予め定めた一定
時間（例えば30msec）以上続いたとき、この閾
値を越えた時点を音声の開始時点とし、この閾値
以下の期間が予め定めたある一定時間（例えば
300msec）以上続いたとき、この閾値以下となつ
た時点を音声の終端とする等の方法が可能であ
る。５は母音標準パタン記憶部であつて、各母音
の定常部の特徴パタンを予め記憶しておくもので
ある。６はバツフアメモリで、入力音声信号の特
徴パタンについて音声区間検出部４で検出される
始端から終端までを一時的に記憶するものであ
る。７は定常点検出部であつて、バツフアメモリ
６の内容を読み出し、定常点を検出するものであ
る。定常点の検出は、例えば、各フレームに対し
て前後数フレーム（例えば５フレーム）の特徴パ
タンの分散を計算し、これが最小となるフレーム
として検出できる。即ち、第フレームにおける
この分散をσ²とすると、前記入力信号の特徴パ
タンの系列Ａ＝（a₁，a₂，…，a，…，a_L），a
＝（a₁，a₂，…，a₂₀）に対して、 σ²＝₂₀ 〓^k=1 _1+N 〓^q=l-N （a_qk−_k）² (1)_k ＝１／2N＋１_1+N 〓^q=l-N a_kq (2) Ｎ＝５ (3) として与えられる。８は定常点（フレーム）記憶
部であつて、前記定常点検出部７で検出された定
常点（フレーム）列を記憶する。９は母音パター
ン比較部であつて、前記定常点（フレーム）記憶
部８で記憶されている前記定常点（フレーム）列
のそれぞれを母音中心フレームと見做して、定常
点（フレーム）の特徴パタンと前記母音標準パタ
ン記憶部５の各母音に対応する標準パタン（特徴
パタン）との距離（または類似度、以下では、
「類似度」は「距離」で代表することとする。即
ち、「距離が小さい」とは「類似度が大きい」と
いうことである）を計算するものである。１０は
母音識別部であつて、前記母音パタン比較部９の
出力のうち、最小値を与える前記母音を前記定常
点（フレーム）の母音識別結果とするものであ
る。１１は母音・促音判定結果記憶部であつて、
前記母音識別部１０で得られた母音列（入力母音
列）、前記音声区間検出部４で検出された無音区
間のうち促音と判定される部分を記憶するもので
ある。ここで、促音の検出は、前記定義に基づく
無音区間の時間長によつて判定されるものであ
る。例えば、この区間が100msec〜250msecを促
音とする等である。さらに、母音・促音判定結果
記憶部１１は、後述の母音列識別部１５で識別さ
れた母音列も記憶する。１２は標準母音列記憶部
であつて、認識されるべき単語、即ち、後述の単
語辞書部２２に記憶されている単語の母音列（標
準母音列）（以後、母音列は促音も含むものとす
る）が重複を避けて記憶されている。１３は母音
間距離記憶部であつて、前記母音標準パタン記憶
部５で記憶されている前記母音標準パタンを用い
て予め求めた母音間距離が記憶されている。１４
は母音列比較部であつて、前記母音・促音判定結
果記憶部１１に記憶されている前記入力母音列を
読み出して、前記標準母音列記憶部１２で記憶さ
れている各標準母音列とのマツチングを行うもの
である。ここで、マツチングは周知のDPマツチ
ングで行うことが可能である。即ち、ｎ番目の標
準母音列をYⁿ＝（yⁿ ₁，yⁿ ₂，…，yⁿ _j，…，yⁿ _Jo）、前
記入力母音列をＸ＝（x₁，x₂，…，x_i，…，x_I）
（Jⁿ，Ｉはそれぞれ標準母音列の母音数および入
力母音列の定常点の個数）とし、dⁿ（ｉ，ｊ）を
ｉ番目の入力母音x_iとｊ番目の標準母音yⁿ _jとの距
離とするとき、ｇ（ｉ，ｊ）＝minｇ（ｉ−３，ｊ−１）＋３・dⁿ（
ｉ，ｊ）ｇ（ｉ−２，ｊ−１）＋2dⁿ（ｉ，ｊ）ｇ（ｉ−１，ｊ−１）＋dⁿ（ｉ，ｊ）ｇ（ｉ−２，ｊ−２）＋2dⁿ（ｉ，ｊ）ｇ（ｉ−１，ｊ−２）＋dⁿ（ｉ，ｊ）なる漸化式を、gⁿ（１，１）＝dⁿ（１，１）として
解けば、ＸとYⁿの距離Ｄ（Ｘ，Yⁿ）は、Ｄ（Ｘ，Yⁿ）＝ｇ（Ｉ，Jⁿ） (5) となる。ここで、dⁿ（ｉ，ｊ）は前記母音間距離
記憶部１３の内容のうち前記入力母音列Ｘのｉ番
目の母音x_iと前記ｎ番目の標準母音列Yⁿのｊ番目
の母音y_jとの母音間距離を読み出すことにより与
えられる。以上の様にして、前記入力母音列Ｘと
前記標準母音列Yⁿの距離Ｄ（Ｘ，Yⁿ）が求めら
れ、出力される。また、前記母音列比較部１４
は、式(4)で与えられるｇ（ｉ，ｊ）が、格子点
（ｉ−３，ｊ−１），（ｉ−２，ｊ−１），（ｉ−１，
ｊ−１），（ｉ−２），ｊ−２），（ｉ−１，ｊ−２）
のうちのどの格子点からの遷移であるかを順次記
憶し、式(5)が求まつた段階で、始点gⁿ（１，１）
から終点gⁿ（Ｉ，Jⁿ）に至つた経路（以後、「最適
パス」と呼ぶことにする）を、式(5)で与えられる
前記距離Ｄ（Ｘ，Yⁿ）と共に出力する。なお、前
記漸化式(5)については種々の形が提案されてお
り、ここではその一例を示したにすぎない。前記
母音列比較部１４では、以上の様にして、前記入
力母音列について、すべての標準母音列Yⁿ（ｎ＝
１，２，…，Ｎ）とのマツチングを順次行つて距
離および最適パスを出力する。１５は母音列識別
部であつて、前記母音列比較部１４の出力のう
ち、最小値を与える前記標準母音列Y^noを識別し
て識別結果母音列Y^noとし、Y^no及び付帯する最
適パス（識別結果最適パス）を出力する。１６は
母音対応決定部であつて、前記識別結果最適パス
を、終点の格子点（Ｉ，J^no）より順に逆上り、
前記識別結果最適パス上の格子点（i^no，j^no）が
前記定常点列の何番目の定常点（即ち母音）に該
当するかを決定する。ところで、第２図は、「Ａ
Ｉ ZU WA KA MA TSU」と発声し、前記
定常点検出部７において、８ケの定常点が検出さ
れ、それぞれの定常点について前記母音比較部９
および前記母音識別部１０において母音識別し、
「ＡＩＵＩＵＩＡＵ」という入力
母音列Ｘが得られ、前記母音列比較部１４および
前記母音列識別部１５において「ＡＩＵＡ
ＡＡＵ」という識別結果母音列Y^noが得ら
れた時の格子点の遷移の様子を示す図である。前
記母音対応決定部１６では、この遷移を逆にたど
るわけである。即ち、先ず格子点（Ｉ，J^no）＝
（８，７）への遷移の始点は格子点（７，５）で
あり、格子点（７，５）へは格子点（５，３）よ
り、格子点（５，３）へは格子点（２，２）よ
り、格子点（２，２）へは格子点（１，１）より
の遷移であることを順にたどりながら、前記識別
結果母音列Y^noの各母音と前記入力母音列Ｘの各
母音との対応を決定する。即ち、前記入力母音列
Ｘの第１番目の母音「Ａ」、第２番目の母音
「Ｉ」、第５番目の母音「Ｕ」、第７番目の母音
「Ａ」、第８番目の母音「Ｕ」のそれぞれは、前記
識別結果母音列Y^noの第１番目の母音「Ａ」、第
２番目の母音「Ｉ」、第３番目の母音「Ｕ」、第５
番目の母音「Ａ」、第７番目の母音「Ｕ」に対応
し、前記入力母音列Ｘの第３番目の母音「Ｕ」お
よび第４番目の母音「Ｉ」に対応する前記識別結
果母音列Y^noの母音はなく（挿入）、前記識別結
果母音列Y^noの第６番目の母音「Ａ」に対応する
前記入力母音列Ｘの母音はない（脱落）こと、ま
た、前記入力母音列Ｘの第６番目の母音「Ｉ」に
対応する前記識別結果母音列Y^noの母音はないと
同時に前記識別結果母音列Y^noの第４番目の母音
「Ａ」に対応する前記入力母音列Ｘの母音もない
（挿入と脱落が同時に発生）ことが決定される。
（ここで、「挿入と脱落が同時に発生」している場
合と、前記識別結果母音列Y^no中の「脱落」と決
定された母音が前記入力母音列Ｘ中の「挿入」と
決定された母音に前記母音識別部１０において誤
識別された場合とは、前記最適パスの遷移状態に
より区別される。即ち、格子点（５，３）から格
子点（７，５）への遷移に関して、第２図の如
く、格子点（５，３）から格子点（７，５）へ直
接遷移する場合が「挿入と脱落が同時に発生」し
た場合であり、格子点（５，３）→格子点（６，
４）→格子点（７，５）と遷移する場合が「格子
点（６，４）において誤識別が発生」した場合で
ある。）以上の様にして決定された前記入力母音
列Ｘと前記識別結果母音列Y^noの各母音の対応関
係及び前記識別結果母音列Y^noは前記母音・促音
判定結果記憶部１１に記憶される。１７は特定部
分決定部であつて、前記母音・促音判定結果記憶
部１１で記憶されている前記入力母音列Ｘと前記
識別結果母音列Y^noの各母音の対応関係及び前記
識別結果母音列Y^noを読み出し、前記入力母音列
Ｘと前記識別結果母音列Y^noとが正しく対応して
いる（識別された）母音（即ち、前記脱落母音、
前記挿入母音、前記誤識別母音以外の母音）の隣
合つた区間、例えば、第２図において、前記入力
母音列の第１番目の母音「Ａ」から第２番目の母
音「Ｉ」までの区間、第２番目の母音「Ｉ」から
第５番目の母音「Ｕ」までの区間、第５番目の母
音から第７番目の母音「Ａ」までの区間、第７番
目の母音「Ａ」から第８番目の母音「Ｕ」までの
区間のそれぞれを特定部分と決定する。ただし、
語頭母音が誤つている場合には、語頭から正しく
識別された母音までの区間を前記特定部分とし、
語尾母音が誤つている場合には、正しく識別され
た母音のうち最後尾の母音から語尾母音までの区
間を前記特定部分Y^no とする。１８は音節標準パ
タン記憶部であつて、Ｖ，CV，VV，VCV等の
音節に対する特徴パタンの系列を、Ｖ，CVにつ
いては語頭から母音定常部まで、VV，VCVにつ
いては先行母音の定常部から後続母音の定常部ま
でを標準パタンとして予め話者が発声し、登録し
ておく。１９は音節パタン比較部であつて、前記
特定部分決定部１７において決定された前記特定
部分Y^no について、その特定部分を定義する先行
母音y^no _j1が対応するフレームを始点とし、後続母
音y^no _j2が対応するフレームを終点とする部分特徴
パターン系列を前記バツフアメモリ６より読み
出して、前記音節標準パタン記憶部１８で記憶さ
れているところの、前記特定部分Y^no＝y^no _j1，…，
y^no _j2（j₁＜j₂）で定義されるy^no _j1Cy^no _j1+1，y^no _j1+1C
y^no _j1+2，
…，y^no _j2-1Cy^no _j2，y^no _j1y^no _j1+1，y^no _j1+1y^no _j1+2，
…，y^no _j2-1
y^no _j2（Ｃ：子音）等のそれぞれの音節に対応する標
準パタンを前記特定部分^noの母音列に対応する
様に種々組み合わせた音節標準パタン系列（複合
音節標準パタン系列）（例えば、y^no _j1Cy^no _j2，y^no _j1y^n
o _j2，
y^no _j1C₁y^no _j1+1C₂y^no _j2，y^no _j1Cy^no _j1+1y^no _j2など）と
のマツチ
ングを行うものである。例えば、第２図におい
て、前記入力母音列の第１番目の母音「Ａ」から
第２番目の母音「Ｉ」までの特定部分に対応する
部分特徴パタン系列は、複合音節標準パタン
「Ａ・Ｃ・Ｉ」（Ｃ：子音）とマツチングされる。
これは、周知のDPマツチングで行うことが可能
である。即ち、前記複合音節標準パタン「Ａ・
Ｃ・Ｉ」に対応する複合音節標準パタン系列をＲ
＝（r₁，r₂，…，r〓，…，r〓）とし、前記部分特徴
パタン系列¹＝（a¹ ₁，a¹ ₂，…，a¹〓，…，a¹〓とし
、
δ（τ，λ）を前記部分特徴パタン系列¹の第τ
番目の特徴パタンa¹〓と前記複合音節標準パタン系
列Ｒの第λ番目の標準特徴パタンr〓との距離とす
るとき、（τ，λ）＝min（τ−１，λ−２）＋δ（τ，
λ−１）＋δ（τ，λ）（τ−１，λ−１）＋δ（τ，λ）（τ−２，λ−１）＋δ（τ，λ） (6) なる漸化式を（１，１）＝2δ（１，１）として解
けば、¹とＲの距離Δ（¹，Ｒ）は、 Δ（¹，Ｒ）＝（Τ，Λ） (7) となる。ここで、δ（τ，λ）は、a〓＝（a〓₁，a〓₂
，
…，a〓₂₀），r〓＝（r〓₁，r〓₂，…，r〓₂₀）に関し
て、
δ（τ，λ）＝₂₀ 〓^p=1 ｜a〓〓−r〓〓｜ (8) で与えられるのが一般的である。また、上記漸化
式も種々の形が提案されており、ここではその一
例を示したにすぎない。以上の様にして、前記部
分特徴パタン系列の先行母音「Ａ」と後続母音
「Ｉ」のにはさまれる種々な子音Ｃ（Ｃが無い場合
もあるがこれを含めて子音Ｃと呼ぶ）をもつ前記
複合音節標準パタン系列Ｒについての距離が求め
られ、対応する前記複合音節標準パタン系列を構
成する標準音節の番号と共に出力される。以下同
様にして、前記入力母音列の第２番目の母音
「Ｉ」と第５番目の母音「Ｕ」の特定部分、第５
番目の母音「Ｕ」と第７番目の母音「Ａ」の特定
部分、第７番目の母音「Ａ」と第８番目の母音
「Ｕ」の特定部分について、それぞれ前記複合音
節標準パタン系列との距離が求められ、対応する
前記複合音節標準パタン系列を構成する標準音節
の番号と共に出力される。２０は音節識別部であ
つて、前記特定部分のそれぞれについて、前記音
節パタン比較部１９から出力される距離のうち最
小値を求め、最小値を与える前記複合音節標準パ
タン系列を構成する標準音節番号（識別音節番
号）を識別する。２１は音節列記憶部であつて、
前記音節識別部２０で得られた前記識別音節番号
を記憶する。２２は単語辞書部であつて、認識す
べき単語を構成する音節番号の系列を記憶してい
る。例えば、「オオサカ」という単語に対しては、
「OO」，「OSA」，「AKA」という３つの音節の対
応する番号の系列を記憶している。２３は単語間
距離計算部であつて、前記音節列記憶部２１で記
憶されあいる識別音節番号列と、前記単語辞書部
２２で記憶されている単語音節番号列とのマツチ
ングを行う。これは、例えば以下の様になる。即
ち、前記識別音節番号列と前記単語音節番号列で
対応する位置に同じ番号の音節が存在する場合を
「１」とし、違つた番号の音節が存在する場合を
「０」として、前記識別音節番号列に関して和を
求め、単語を構成する音節数で正規化して単語間
距離とする。前記単語間距離２３は、前記単語間
距離と対応する単語番号を出力する。２４は単語
判定部２４であつて、前記単語間距離の最小値を
求め、最小値を与える単語番号を判定結果として
出力する。２５は出力端子であり、前記判定結果
は出力端子２５より出力される。 Now, FIG. 1 is a functional block diagram showing a first embodiment of the present invention. 1 is an audio signal input terminal. 2 is a feature extraction unit that uses, for example, a 20-channel filter bank and processes one frame for 10 msec.
Then, 20 values (feature patterns) are obtained every 10 msec in the output. That is, the input audio signal is a series of feature patterns A = (a ₁ , a ₂ , ..., a, ...,
a _L ). Here, a=(a ₁ , a ₂ ,
..., a ₂₀ ) is the feature pattern obtained in the th frame, and L is the number of frames of input audio. 3 is a power calculation unit, and if the power of the th frame is P, then P = √ ² ₁ + ² ₂ +...+ for each frame.
a ² ₂₀ is calculated. Reference numeral 4 denotes a voice section detecting section, which detects the beginning and end of the input voice signal from the change pattern of P. In other words, a threshold value is set to determine whether there is a sound or not, and when the interval above this threshold continues for a predetermined period of time (e.g. 30 msec), the time when this threshold is exceeded is considered to be the start point of audio, and the time when the sound is below this threshold is A certain period of time (for example,
300 msec) or more, it is possible to use a method such as setting the point in time when the value falls below this threshold as the end of the audio. Reference numeral 5 is a vowel standard pattern storage unit which stores characteristic patterns of the constant portion of each vowel in advance. A buffer memory 6 temporarily stores the characteristic pattern of the input audio signal from the start to the end detected by the audio section detection section 4. Reference numeral 7 denotes a stationary point detection section, which reads out the contents of the buffer memory 6 and detects a stationary point. The stationary point can be detected, for example, by calculating the variance of feature patterns of several frames (for example, 5 frames) before and after each frame, and detecting the frame with the minimum variance. That is, if this variance in the th frame is σ ² , then the sequence of characteristic patterns of the input signal A = (a ₁ , a ₂ , ..., a, ..., a _L ), a
= (a ₁ , a ₂ ,…, a ₂₀ ), σ ² = ₂₀ 〓 ^k=1 _1+N 〓 ^q=lN (a _qk − _k ) ² (1) _k = 1/2N+1 _1+N 〓 ^q=lN a _kq (2) N=5 (3) Given as follows. A stationary point (frame) storage section 8 stores a stationary point (frame) sequence detected by the stationary point detection section 7. Reference numeral 9 denotes a vowel pattern comparing section, which regards each of the stationary point (frame) strings stored in the stationary point (frame) storage section 8 as a vowel center frame and compares the characteristics of the stationary points (frames). The distance (or similarity, hereinafter, the degree of similarity between the pattern and the standard pattern (characteristic pattern) corresponding to each vowel in the vowel standard pattern storage section 5,
“Similarity” is represented by “distance”. In other words, "small distance" means "high similarity"). Reference numeral 10 denotes a vowel identification section, which takes the vowel that gives the minimum value among the outputs of the vowel pattern comparison section 9 as the vowel identification result of the stationary point (frame). 11 is a vowel/consonant determination result storage unit,
The vowel string (input vowel string) obtained by the vowel identifying section 10 and the portion of the silent section detected by the speech section detecting section 4 that is determined to be a consonant are stored. Here, detection of a consonant is determined based on the time length of the silent section based on the above definition. For example, this section may be set as a consonant for 100 msec to 250 msec. Furthermore, the vowel/consonant determination result storage section 11 also stores vowel strings identified by the vowel string identification section 15, which will be described later. Reference numeral 12 denotes a standard vowel string storage unit which stores the vowel strings (standard vowel strings) of words to be recognized, that is, words stored in the word dictionary unit 22 (described later) (hereinafter, the vowel strings will also include consonants). are stored to avoid duplication. Reference numeral 13 denotes an inter-vowel distance storage section, in which an inter-vowel distance determined in advance using the vowel standard pattern stored in the vowel standard pattern storage section 5 is stored. 14
is a vowel string comparison section that reads out the input vowel string stored in the vowel/consonant determination result storage section 11 and matches it with each standard vowel string stored in the standard vowel string storage section 12. This is what we do. Here, matching can be performed by well-known DP matching. That is, the nth standard vowel string is Y ⁿ = (y ⁿ ₁ , y ⁿ ₂ , ..., y ⁿ _j , ..., y ⁿ _Jo ), and the input vowel string is X = (x ₁ , x ₂ , ..., x _i ,…, x _I )
(J ⁿ , I are the number of vowels in the standard vowel string and the number of stationary points in the input vowel string, respectively), and d ⁿ (i, j) is the i-th input vowel x _i and the j-th standard vowel y ⁿ _j . When the distance is g (i, j) = ming (i-3, j-1) + 3・d ⁿ (
i, j) g (i-2, j-1) + 2d ⁿ (i, j) g (i-1, j-1) + d ⁿ (i, j) g (i-2, j-2) + 2d ⁿ (i, j) g (i-1, j-2) + d ⁿ (i, j) If we solve the recurrence formula as g ⁿ (1, 1) = d ⁿ (1, 1), then X and Y The distance D(X, Y ⁿ ) of ⁿ is D(X, Y ⁿ )=g(I, J ⁿ ) (5). Here, d ⁿ (i, j) is the i-th vowel x i of the input vowel string X and the j-th vowel _y of the n-th standard vowel string Y ⁿ among the contents of the inter-vowel distance storage section 13. It is given by reading the inter-vowel distance with _j . In the manner described above, the distance D (X, Y ⁿ ) between the input vowel string X and the standard vowel string Y ⁿ is determined and output. Further, the vowel string comparison unit 14
is, g(i, j) given by equation (4) is the grid point (i-3, j-1), (i-2, j-1), (i-1,
j-1), (i-2), j-2), (i-1, j-2)
The starting point g ⁿ (1, 1) is sequentially memorized from which grid point the transition is from, and when equation (5) is determined,
The path (hereinafter referred to as the "optimal path") from to the end point g ⁿ (I, J ⁿ ) is output together with the distance D (X, Y ⁿ ) given by equation (5). Note that various forms have been proposed for the recurrence formula (5), and only one example is shown here. As described above, the vowel string comparison unit 14 compares all standard vowel strings Y ⁿ (n=
1, 2,...,N) to output the distance and optimal path. 15 is a vowel string identification section, which identifies the standard vowel string Y ^no that gives the minimum value among the outputs of the vowel string comparison section 14, takes it as the identification result vowel string Y ^no , and selects Y ^no and the accompanying optimal path. (identification result optimal path) is output. 16 is a vowel correspondence determination unit which sequentially ascends the identified optimal path from the terminal grid point (I, J ^no );
It is determined which stationary point (i.e., vowel) in the stationary point sequence the grid point (i ^no , j ^no ) on the identification result optimal path corresponds to. By the way, Figure 2 shows "A
``I ZU WA KA MA TSU'', eight stationary points are detected in the stationary point detection unit 7, and the vowel comparison unit 9 detects eight stationary points for each stationary point.
and the vowel identification unit 10 identifies the vowel,
An input vowel string
FIG. 4 is a diagram showing the transition of lattice points when the identification result vowel string Y ^no "A A U" is obtained. The vowel correspondence determination unit 16 traces this transition in reverse. That is, first, lattice point (I, J ^no )=
The starting point of the transition to (8,7) is the grid point (7,5), and the transition to the grid point (7,5) is from the grid point (5,3), and the transition to the grid point (5,3) is from the grid point From (2, 2), each vowel in the identification result vowel string Y ^no and the input vowel string Determine the correspondence with each vowel. That is, in the input vowel string X, the first vowel "A", the second vowel "I", the fifth vowel "U", the seventh vowel "A", and the eighth vowel " Each of "U" is the first vowel "A", the second vowel "I", the third vowel "U", and the fifth vowel of the identified vowel string Y ^no .
The identified vowel string corresponds to the th vowel "A" and the 7th vowel "U", and corresponds to the 3rd vowel "U" and the 4th vowel "I" of the input vowel string There is no vowel in Y ^no (insertion), and there is no vowel in the input vowel string X that corresponds to the sixth vowel "A" in the identified vowel string Y ^no (dropout); There is no vowel in the identification result vowel string Y no corresponding to the sixth vowel "I" in the identification result vowel string Y ^no , and at the same time there is no vowel in the input vowel string X corresponding to the fourth vowel "A" in the identification result vowel string Y ^no . It is determined that there are no vowels (insertion and omission occur simultaneously).
(Here, if "insertion and omission occur simultaneously" and if the vowel determined to be "dropped" in the identified vowel string Y ^no is determined to be "inserted" in the input vowel string A case in which a vowel is incorrectly identified by the vowel identification unit 10 is distinguished by the transition state of the optimal path.In other words, regarding the transition from the lattice point (5, 3) to the lattice point (7, 5), As shown in Figure 2, the case where there is a direct transition from lattice point (5, 3) to lattice point (7, 5) is a case where "insertion and dropout occur simultaneously," and the transition from lattice point (5, 3) to lattice point ( 6,
The case where the transition occurs from 4) to grid point (7, 5) is the case where "misidentification occurs at grid point (6, 4)." ) The correspondence between each vowel in the input vowel string X and the identification result vowel string Y ^no determined as above and the identification result vowel string Y ^no are stored in the vowel/consonant determination result storage unit 11. . Reference numeral 17 is a specific part determination unit which determines the correspondence between each vowel ⁱⁿ the input vowel string ^no is read, and the input vowel string X and the identification result vowel string Y ^no correctly correspond (identified) vowels (i.e., the dropped vowels,
the inserted vowel, the vowel other than the misidentified vowel), for example, in FIG. 2, the interval from the first vowel "A" to the second vowel "I" in the input vowel string , the section from the second vowel "I" to the fifth vowel "U", the section from the fifth vowel to the seventh vowel "A", the section from the seventh vowel "A" to the fifth vowel "A" Each section up to the eighth vowel "U" is determined to be a specific portion. however,
If the initial vowel of a word is incorrect, the section from the beginning of the word to the correctly identified vowel is used as the specific part,
If the word-final vowel is incorrect, the section from the last vowel to the word-final vowel among the correctly identified vowels is set as the specific portion Y ^no . Reference numeral 18 is a syllable standard pattern storage unit which stores a series of characteristic patterns for syllables such as V, CV, VV, VCV, from the beginning of the word to the constant vowel part for V and CV, and the constant part of the preceding vowel for VV and VCV. to the stationary part of the following vowel is uttered in advance by the speaker as a standard pattern and registered. Reference numeral 19 denotes a syllable pattern comparison unit which, for the specific part Y ^no determined by the specific part determining unit 17, starts from the frame corresponding to the preceding vowel y ^no _j1 that defines the specific part, and compares the subsequent vowel y ^no The partial feature pattern sequence ending at the frame corresponding to _j2 is read from the buffer memory 6, and the specific portion Y ^no =y ^no _j1 , . . . stored in the syllable standard pattern storage section 18 is read out.
y ^no _j2 (j ₁ < j ₂ ) defined by y ^no _j1 Cy ^no _j1+1 , y ^no _j1+1 C
y ^no _j1+2 ,
…, y ^no _j2-1 Cy ^no _j2 , y ^no _j1 y ^no _j1+1 , y ^no _j1+1 y ^no _j1+2 ,
...,y ^no _j2-1
A syllable standard pattern series (compound syllable standard pattern series) in which standard patterns corresponding to each syllable such as y ^no _j2 (C: consonant) are combined in various ways to correspond to the vowel string of the specific part ^no (for example, y ^no _j1 Cy ^no _j2 ,y ^no _j1 y ^{n
o} _j2 ,
y ^no _j1 C ₁ y ^no _j1+1 C ₂ y ^no _j2 , y ^no _j1 Cy ^no _j1+1 y ^no _j2 , etc.). For example, in FIG. 2, the partial feature pattern sequence corresponding to the specific part from the first vowel "A" to the second vowel "I" in the input vowel string is the compound syllable standard pattern "A・C".・Matched with "I" (C: consonant).
This can be done using the well-known DP matching. That is, the compound syllable standard pattern "A.
The compound syllable standard pattern series corresponding to “C・I” is R
= (r ₁ , r ₂ , ..., r〓, ..., r〓), and the partial feature pattern sequence ¹ = (a ¹ ₁ , a ¹ ₂ , ..., a ¹ 〓, ..., a ¹ 〓,
δ(τ, λ) is the τ-th of the partial feature pattern series ¹ .
When the distance between the th feature pattern ^a1〓 and the λth standard feature pattern r〓 of the compound syllable standard pattern series R is (τ, λ)=min(τ-1, λ-2)+δ( τ,
λ−1)+δ(τ,λ) (τ−1,λ−1)+δ(τ,λ) (τ−2,λ−1)+δ(τ,λ) (6) The recurrence formula becomes (1 , 1) = 2δ(1, 1), the distance Δ( ¹ , R) between ¹ and R becomes Δ( ¹ , R) = (T, Λ) (7). Here, δ(τ, λ) is a〓=(a〓 ₁ , a〓 ₂
，
…, a〓 ₂₀ ), r〓=(r〓 ₁ , r〓 ₂ , …, r〓 ₂₀ ),
It is generally given by δ(τ, λ) = ₂₀ 〓 ^p=1 ｜a〓〓−r〓〓｜ (8). Further, various forms of the above recurrence formula have been proposed, and only one example thereof is shown here. In the above manner, various consonants C (including the consonant C in some cases where there is no C) sandwiched between the preceding vowel "A" and the following vowel "I" in the partial feature pattern series are obtained. The distance for the complex syllable standard pattern series R having the same complex syllable standard pattern series is determined and output together with the numbers of the standard syllables constituting the corresponding complex syllable standard pattern series. Similarly, specific parts of the second vowel "I" and the fifth vowel "U" of the input vowel string, the fifth vowel
The specific parts of the 7th vowel "U" and the 7th vowel "A", and the specific parts of the 7th vowel "A" and the 8th vowel "U", respectively, are compared with the compound syllable standard pattern series. The distance is determined and output together with the number of the standard syllable that constitutes the corresponding complex syllable standard pattern sequence. Reference numeral 20 is a syllable identification unit, which calculates the minimum value of the distances output from the syllable pattern comparison unit 19 for each of the specific portions, and determines the standard syllable number constituting the composite syllable standard pattern sequence that gives the minimum value. (identification syllable number). 21 is a syllable string storage unit,
The identified syllable number obtained by the syllable identifying section 20 is stored. A word dictionary section 22 stores a series of syllable numbers constituting words to be recognized. For example, for the word "Osaka",
It memorizes a sequence of numbers corresponding to the three syllables "OO", "OSA", and "AKA". Reference numeral 23 denotes an inter-word distance calculating section, which performs matching between the identification syllable number string stored in the syllable string storage section 21 and the word syllable number string stored in the word dictionary section 22. For example, this is as follows. That is, the case where a syllable with the same number exists in the corresponding position in the identification syllable number string and the word syllable number string is set as "1", and the case where a syllable with a different number exists is set as "0". The sum of the number strings is calculated and normalized by the number of syllables that make up the word to obtain the distance between words. The inter-word distance 23 outputs a word number corresponding to the inter-word distance. Reference numeral 24 denotes a word determination unit 24, which determines the minimum value of the distance between words and outputs the word number that gives the minimum value as a determination result. 25 is an output terminal, and the determination result is outputted from the output terminal 25.

なお本実施例では、Ｖ，VV，CV，VCV等の
音節を単位として認識を行つているが、本発明
は、デミ・シラブル、ダイフオーン等の音声単位
にも適用が可能である。 In this embodiment, recognition is performed in units of syllables such as V, VV, CV, and VCV, but the present invention can also be applied to units of speech such as demi-syllables and diphons.

（発明の効果）本発明によれば、単音節を連続して発声した場
合でも、定常点を抽出し、定常点を母音と見做し
て母音識別をし、識別された母音の系列と認識す
べき単語を構成する母音列とのマツチングを行つ
て、前記母音列中の誤識別・挿入・脱落等を訂正
して、入力部分パタンとＶ，CV，VCV，VV等
の音節標準パタンとマツチングすることにより、
比較照合すべき単語と音節標準パタンを適切に限
定することができ、認識率・照合速度において大
幅な改善が得られる。(Effects of the Invention) According to the present invention, even when a single syllable is uttered in succession, a stationary point is extracted, the stationary point is regarded as a vowel, the vowel is identified, and the identified vowel is recognized as a series. Matching is performed with the vowel strings constituting the word to be used, correcting misidentifications, insertions, omissions, etc. in the vowel strings, and matching the input partial pattern with syllable standard patterns such as V, CV, VCV, VV, etc. By doing so,
The words and syllable standard patterns to be compared and matched can be appropriately limited, resulting in significant improvements in recognition rate and matching speed.

[Brief explanation of the drawing]

第１図は、本発明の一実施例の構成を示すブロ
ツク図、第２図は、母音列比較部の動作を説明す
るための図である。１……音声信号入力端子、２……特徴抽出部、
３……電力計算部、４……音声区間検出部、５…
…母音標準パタン記憶部、６……バツフアメモ
リ、７……定常点検出部、８……定常点（フレー
ム）記憶部、９……母音パタン比較部、１０……
母音識別部、１１……母音・促音判定結果記憶
部、１２……標準母音列記憶部、１３……母音間
距離記憶部、１４……母音列比較部、１５……母
音列識別部、１６……母音対応決定部、１７……
特定部分決定部、１８……音節標準パタン記憶
部、１９……音節パタン比較部、２０……音節識
別部、２１……音節列記憶部、２２……単語辞書
部、２３……単語間距離計算部、２４……単語判
定部、２５……出力端子。 FIG. 1 is a block diagram showing the configuration of an embodiment of the present invention, and FIG. 2 is a diagram for explaining the operation of a vowel string comparison section. 1...Audio signal input terminal, 2...Feature extraction unit,
3... Power calculation unit, 4... Voice section detection unit, 5...
... Vowel standard pattern storage unit, 6... Buffer memory, 7... Steady point detection unit, 8... Steady point (frame) storage unit, 9... Vowel pattern comparison unit, 10...
Vowel identification unit, 11... Vowel/consonant determination result storage unit, 12... Standard vowel string storage unit, 13... Vowel distance storage unit, 14... Vowel string comparison unit, 15... Vowel string identification unit, 16 ...Vowel correspondence determination section, 17...
Specific portion determination section, 18...Syllable standard pattern storage section, 19...Syllable pattern comparison section, 20...Syllable identification section, 21...Syllable string storage section, 22...Word dictionary section, 23...Word distance Calculation unit, 24...word determination unit, 25...output terminal.

Claims

[Claims]

1 The input audio signal is converted into a series of feature patterns A=(a ₁ ,
a ₂ , ..., a, ..., a _L ); a steady point extracting means that extracts the stationary points of the input signal; Identify and input vowel string X = (x ₁ , x ₂ ,...,
x _i , ..., x _I ) (where I is the number of stationary points of the input speech); and a vowel string of the word or phrase to be recognized as the input vowel string X obtained by the vowel identifying means. (hereinafter referred to as standard vowel string)
Y ⁿ = (y ⁿ ₁ , y ⁿ ₂ , ..., y ⁿ _j , ..., y ⁿ _J n) (however, n (=
1, 2, ..., N) is the class of standard vowel strings, and J is the number of standard vowel strings) to identify the standard vowel string Y ^no (identification result vowel string) closest to the input vowel string X. a vowel string identifying means for determining a correspondence between vowels in the input vowel string X and the identification result vowel string Y ^no based on a matching result between the input vowel string X and the identification result vowel string Y ^no ; a determining means and a partial interval of the input vowel string X determined based on the correspondence relationship=(x _i1 ,...,
x _i2 ) (i ₁ < i ₂ ) A partial sequence of the feature pattern of the input signal (specific partial feature pattern sequence) =
(a _t1 ,..., a _t2 ) (t ₁ < t ₂ ), of the identified vowel string Y no obtained by the vowel string identification means, the specific portion ^no corresponding to the partial interval ^no = (y ^no _j1 ,
…, y ^no _j2 ) (j ₁ < j ₂ ) y ^no _j1 Cy ^no defined by the vowel sequence _j1+
₁ ,y ^no _j1+1 Cy ^no _j1+2 ,…,y ^no _j2-1 Cy ^no _j2 ,y ^no _j1 y ^no _{j1
+1} ，y ^no _j1+1
y ^no _j1+2 ,..., y ^no _j2+1 y ^no _j2 (C: consonant), etc., are matched with standard patterns corresponding to each syllable, and a syllable string corresponding to the specific partial feature pattern sequence is obtained. 1. A speech recognition device comprising: a determining means for determining a word or a phrase based on the identified syllable string obtained and outputting the result as a recognition result.