JPH0246957B2

JPH0246957B2 -

Info

Publication number: JPH0246957B2
Application number: JP59017263A
Authority: JP
Inventors: Mitsuhiko Kano; Yasuhiro Matsuda; Tsudoi Tezuka
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1984-02-03
Filing date: 1984-02-03
Publication date: 1990-10-17
Also published as: JPS60164798A

Description

[Detailed description of the invention]

［産業上の利用分野］この発明は単音節音声認識方法に関し、とくに
簡易な構成でありながら認識率をも向上させるこ
とができるようにしたものである。［従来技術］近年コンピユータや各種制御装置等における入
力装置として音声認識装置が実用期を向えるにい
たつている。人間の話す言葉をそのまま認識でき
る音声認識装置では利用のための特別な教育もい
らず、視線や手足の拘束もない等種々の利点を有
する。しかしながら現在実用されている多くの音
声認識装置は単語単位で認識を行う単語音声認識
装置であり、上述の利点の反面語数に限界を持つ
という欠点があつた。以上のこともあつて最近では音節単位で音声を
認識する単音節音声認識システムが注目されるよ
うになつてきている。周知のとおり日本語におい
ては表音文字により言語体系が構成されているの
で、すなわち各音節がカナの各々にほぼ１対１で
対応するので単音節音声認識システムが各種入力
装置として利用可能である。とくに日本語ワー
ド・プロセツサやワークステーシヨンの普及にと
もなつて、この単音節音声認識システムをこれら
の機器の入力手段に用いる試みが多々なされるよ
うになつている。ところで単音節音声認識システムでは、通常、
音節の特徴パラメータ時系列（以下単にパターン
という）のうち子音部分を全音節部分から切り出
して、子音部分どうしのマツチングを行うように
している。音節のパターンは単語のパターンに較
べパターン間の特徴に乏しく、さらに一般に母音
部分の時間長が子音部分に較べて極めて長く、子
音部分間の類似度の微少な差が後続母音部分のパ
ターンのゆらぎによつてマスクされてしまうから
である。吉田氏等の論文「日本語単音節音声認識実験」
（日本音響学会講演論文集、３−２−16、1979年）
はこのような子音部分の切り出しの一例を示して
いる。この例では音節の特徴ベクトルが大きく変
化する点のうちの所定の位置を子音・母音境界と
するアルゴリズムを用いて子音部分を切り出し、
この子音部分を標準登録子音パターンにつき端点
自由のダイナミツクプログラミングマツチング法
（以下DPマツチングとする）を実行して子音情報
を得るようにしている。また、古井氏の論文「単音節認識とその大語い
単語音音声認識への適用」（電子通信学会論文集
Ａ、65−Ａ；２、pp175−182、1982年）は子音
部分の切り出しの他の手法を示している。この例
では音節全長に対する所定の比率で語頭部を切り
出し、これを同様の登録語頭部と比較して子音情
報を得るようにしている。たとえば線形マツチン
グを実行している。さらに、中川氏等の論文「不特定話者の単音節
単位入力による大語彙単語音声認識」（電子通信
学会論文集Ｄ、65−Ｄ、12、pp1558−1565、
1982年）も他の切り出し手法を開示している。こ
の例では対象が不特定話者であるので登録パター
ンを固定とすることができ、このため予め目視に
より登録パターンの各々につき子音、母音の境界
点ｊを決定するのである。この決定点はたとえば
音声信号の波形やホルマント等を勘案して推定さ
れる。未知入力パターンについては音節全域の登
録パターンとこの未知入力パターンとについて
DPマツチングを行い、その最適パスが上述境界
点ｊを通過する点ｉを未知入力パターンの子音、
母音境界点としている。子音情報を得る手法とし
ては端点自由のDPマツチングを含め種々の提案
がなされている。また、この例では上述古井氏の
論文と同様のアルゴリズムにより子音切り出しに
ついても開示がある。母音情報の特定については上述論文ともDPマ
ツチングより簡易な方法を採用している。そして
この母音情報と上述子音情報とを総合して音節に
ついての識別を行なつている。しかしながら、上述のような従来の構成では子
音部分と母音部分とを分離するために複雑なアル
ゴリズムを採用する必要があつた。またそうでな
い場合では予め固定の登録パターンについて目視
を行い煩雑な作業と深い経験を要請することとな
つてしまつていた。また、このように子音および
母音の分離点の識別におけるエラーによつて音
節、とくにその子音情報の認識ミスが増大するお
それもあつた。古井氏の例では音節全長に対する
比率で定形的に子音部分の切り出しを行うので上
述のような問題はないけれども、各音節ごとの子
音・母音境界点のバラツキを無視しているため自
ずと各音節ごとに認識率が異なると考えられる。なお、最終的に音節を認識する段階としては、
子音情報と母音情報とを個別に求め、その組み合
わせから音節を決定する手法がある。たとえば
「ｉ」の母音情報と「ｋ」の子音情報から音節
「キ」を特定するのである。他の手法としては、
母音情報を予め求め、それを後続母音とする音節
を認識候補とし、この候補の音節につきDPマツ
チングなどを行うものが知られている。後者で
は、最終のマツチングにおいて子音と母音との調
音結合要素をも十分考慮できるので良好な判別を
行える。このような２段階の評価については特開
昭54−145409号、特開昭58−52694号および特開
昭58−59498号に記載がある。ただ、この場合、
２段階の評価を行うのが煩雑である。後述のよう
にこの発明の一実現態様ではこのような問題を解
消できる。［発明が解決しようとする問題点］この発明は以上の事情を考慮してなされたもの
であり、子音を母音から区別することなく簡易か
つ確実に子音情報を得ることのできる単音節音声
認識方法を提供することを目的としている。［問題点を解決するための手段］この発明では以上の目的を達成するために未知
入力パターンと登録標準パターンとについてDP
マツチングを距離演算を実行していき、これら未
知入力パターンまたは登録標準パターンの語頭か
ら語尾の間の所定の中間点での最小累積距離に基
づいて未知入力パターンの子音情報を識別するよ
うにしている。この発明の一態様では、子音情報を得る中間点
より語尾がわ、すなわち母音情報源をより多く含
む第２の中間点についても最小累積距離を求め、
これに基づいて母音情報を識別し、こののち識別
母音を後続母音とする候補標準パターンについて
子音情報の識別を行つてもよい。すなわち２段階
のマツチングを行つてもよい。この場合、母音情
報判別時の距離演算の副次物として子音用中間点
の最小累積距離を得ることができ、距離演算を１
回の処理で済ませることができる。またこの発明の他の態様では、母音情報を得る
際に語尾がわから所定の中間点までDPマツチン
グの演算を行うようにしてもよい。この場合、５
つの母音の標準パターンで確実に未知入力パター
ンの母音を識別できるので、音節すべてにつき参
照を行う場合に比して計算量が極めて減少する。［実施例］以下この発明の特定話者用の音声認識装置に適
用した一実施例について図面を参照しながら説明
しよう。第１図はこの実施例を全体として示すものであ
り、この第１図において、マイクロホン１には話
者の音声が供給され、この音声がオーデイオ信号
に変換されてＡ／Ｄ変換器２に供給される。この
Ａ／Ｄ変換器は例えばサンプル周波数が20KHz、
データのビツト長が12ビツトのものである。Ａ／
Ｄ変換器２からのデータは特徴パラメータ抽出部
３に供給され、ここで上述データに基づいて特徴
パラメータ時系列（パターン）が形成される。本
例ではこの特徴パラメータとして後に詳述するよ
うに対数化スペクトルを用いている。本例は特定話者を対象とするものであるので、
音声認識に先だつてトレーニングが行われる。す
なわち、識別すべき所定個数たとえば68個の音節
を話者がマイクロホン１に向つて発声し、これを
順次Ａ／Ｄ変換器２および特徴パラメータ抽出部
３で演算し、認識部４に各音節の標準パターンを
供給していく。この場合認識部４の切換回路５は
ａがわに切り換えられており認識部４のストア部
６に登録されるようになつている。このような準
備段階ののち話者が音節を区切つてたとえば100
音節／分の速度で音声を入力していくと、各音節
は特徴パラメータ抽出部３を介して未知入力パタ
ーンとして導出され、認識部４のストア部６に記
憶されていく。この際は切換回路５はｂがわに切
り換えられている。そして未知入力パターンは順
次68個の標準パターンに参照させられ、この参照
結果のうち一番最適なものが出力回路７を介して
プリンタやモニタ等の出力装置８に出力されてい
く。もちろん、一番最適なものの他に、第２、第
３の候補等をも出力する様にしてもよい。第１図の特徴パラメータ抽出部３は第２図〜第
６図に示すようにして対数化スペクトルの時系列
を形成する。すなわち、Ａ／Ｄ変換器２からのデ
ジタルデータはプリエンフアシスされ、このプリ
エンフアシスされたデータに基づいて時間フレー
ムｉ、この例では10ｍsecごとのエネルギEiが求
められる。ただし、 Ei＝10log₁₀（振幅の二乗値の平均）である。こののち最大エネルギEmaxおよび最小
エネルギEminから正規化エネルギEinを求め、Ei
をたとえば０から32までの値に正規化する。ただ
し、 Ein＝32（Ei−Emin）／Emax−Emin である。そしてこうして得た正規化エネルギEin
の時間分布から適切な閾値を設定して音節間の境
界を判別する。他方、プリエンフアシスされたデジタルデータ
には短時間スペクトル分析も実行される。すなわ
ちデジタルデータを１フレーム10ｍsecごとに移
動させながら20ｍsec（400点）の範囲でハミング
窓の時関関数を用いたウイノグラードの高速フー
リエ変換実行するのである。こうして得たパワー
スペクトルは対数化され、さらに10Hz〜7900Hzま
でのスペクトルが19の周波数バンドに分割され
る。具体的には、100Hzおよび200Hzが１つのバン
ドを形成し、同様に300Hzおよび400Hz、……、
700Hz〜7900Hzがそれぞれバンドを形成する。こののち無声部分のパワースペクトルの平均値
をバツク・グラウンド・ノイズとして有声部分
（各音節）のパワースペクトルから差し引く。このようにしてバツク・グラウンド・ノイズが
差し引かれたパワースペクトル、すなわち特徴パ
ターンは時間方向に正規化され、更に周波数成分
の非線形変換を受ける。すなわち、第３図に示す
ような時間１〜n_F、周波数バンド１〜ｍの特徴パ
ターンを考える。ここでn_Fは各音節ごとのフレー
ム数であり、ｍはバンド数、本例ではｍ＝19であ
る。各時間および各バンドにおける対数化パワー
スペクトルは簡略化して単に丸印で示してある。
時間方向の正規化を行うには、第４図に示すよう
に標準パターン（語長T_R）および未知入力パタ
ーン（語長T_I）の両者を所定のパターン長（Tn）
に線形補間により変換する。パターン長Tnは別
途実験により定める。たとえば標準パターン音節
長の平均の1.2倍に選定してよい。周波数成分の非線形変換は第５図および第６図
に示すようにして行う。すなわち第５図に示すよ
うに時間正規化後の特徴パターンをVijで表わし、
その最大エネルギを Vi_MAX＝ MAX^j （Vij）で表わす。そしてVijを第６図に示すように下の
式にしたがつて０から255までの値Vijに変換する
のである。 VijVi_MAX−Vbの場合、Vij＝０ Vij＞Vi_MAX−Vbの場合、 Vij＝255／Vb（Vij−Vi_MAX＋Vb）なおVbは別途実験により定める。Vb＝30、
40、50と変化させた場合、最適値はVb＝40であ
つた。 Vbの最適値はノイズパワーに対する音声ピー
ク信号の比に関係していると考えられる。この非
線形変換によりノイズパワーの悪影響を緩和する
ことができる。つぎに第１図の認識部４について第７図および
第８図をも参照しながら説明しよう。認識部４は
ストア部６、累積距離演算部９等からなつてい
る。ストア部６には、入力パターンP_Iおよび標準
パターンP_Rｉが蓄えられており、これら入力パ
ターンP_Iと標準パターンP_Rｉとの間のDPマツチ
ングの演算が累積距離演算部９で実行される。た
だし、この累積距離演算部９ではパターン全体に
わたるマツチング用の演算は必要とされない。少
なくとも入力パターンの時間軸上の時刻t_I＝t_V（第
７図）までの演算が行われていればよい。ここで
第７図の時点tcは子音情報を得るための中間点で
あり、時点t_Vは母音情報を得るための中間点であ
る。これについては以下で詳述される。まず所定の未知入力パターンP_Iについて68個の
標準パターンP_Rｉがマツチングされる。すなわ
ち第８図に示すように未知入力パターンP_Iと第１
の標準パターンP_RｉとのDPマツチング演算が実
行されていき、その際t_I＝tcにおける最小累積距
離Dcが求められる。始点（第７図、語頭がわ）
からt_I＝tc上の格子（マツチング窓に制限される
のでWcで示される範囲内の格子）までのパスは
種々あるけれども、それらのパスの各々の累積距
離のうち最小のものを求めるのである。こののち
さらにDPマツチングの演算を継続してt_I＝t_Vにお
いても同様の最小累積距離Dvを求める。これら
最小累積距離Dc、Dvはストア部６に蓄えられ
る。以下同様にして未知入力パターンと残りの67
個の標準パターンP_Rｉ（ｉ＝２〜68）との間でDP
マツチング演算が実行されてそれぞれの最小累積
距離Dc、Dvが求められる。すべての標準パターンP_Rｉについて最小累積
距離Dc、Dvが求められると、つぎにt_I＝t_Vにお
ける68個の最小累積距離Dvから母音を決定する。
すなわち最小累積距離Dvを最小とする音節を求
め、この音節の母音を検出母音情報とする。この
のち検出結果の母音を後続母音とする音節（たと
えば母音が「あ」であれば「あ」、「か」、「さ」…
…）についてのt_I＝tcでの最小累積距離Dcを候補
音節用データとして選び出し、このうち最小の最
小累積距離Dcを持つものを音節検出結果として
出力する。以上の動作は第１図の母音検出部１
０、選択回路１１および音節検出部１２によつて
実行される。上述の中間点t_I＝tc、t_Vは実験によつて定める。
たとえばtcTn×１／３程度、tv＝Tn×0.7程度で良好な結果が得られた。本例を68音節につき適用した実検結果は表１の
とおりである。１音節あたりの標準パターンの数
は１つである（単一テンプレート方式）。この表
で第２候補とは第１図の音節検出部１２において
２番目に小さな最小累積距離を持つ音節のことで
ある。 [Industrial Field of Application] The present invention relates to a monosyllabic speech recognition method, and is particularly concerned with a monosyllabic speech recognition method that can improve the recognition rate even though it has a particularly simple configuration. [Prior Art] In recent years, voice recognition devices have come into practical use as input devices for computers, various control devices, and the like. Speech recognition devices that can recognize human speech as they are have various advantages, such as requiring no special training for use and no restrictions on line of sight or limbs. However, many of the speech recognition devices currently in use are word speech recognition devices that perform recognition on a word-by-word basis, and while they have the above-mentioned advantages, they have the drawback of being limited in the number of words. For these reasons, monosyllabic speech recognition systems that recognize speech in units of syllables have recently been attracting attention. As is well known, the Japanese language system is composed of phonetic characters, meaning that each syllable corresponds almost one-to-one to each kana, so monosyllabic speech recognition systems can be used as various input devices. . In particular, with the spread of Japanese word processors and workstations, many attempts have been made to use this monosyllabic speech recognition system as an input means for these devices. By the way, in monosyllabic speech recognition systems,
The consonant parts of the syllable characteristic parameter time series (hereinafter simply referred to as patterns) are extracted from all the syllable parts, and the consonant parts are matched. Syllable patterns have fewer characteristics between patterns than word patterns, and vowel parts are generally much longer in duration than consonant parts, and minute differences in similarity between consonant parts can cause fluctuations in the pattern of the following vowel part. This is because it is masked by Yoshida et al.'s paper "Japanese monosyllabic speech recognition experiment"
(Proceedings of the Acoustical Society of Japan, 3-2-16, 1979)
shows an example of such extraction of consonant parts. In this example, the consonant part is cut out using an algorithm that sets the consonant/vowel boundary at a predetermined position among the points where the feature vector of the syllable changes significantly.
This consonant part is subjected to a dynamic programming matching method (hereinafter referred to as DP matching) with free endpoints using a standard registered consonant pattern to obtain consonant information. In addition, Mr. Furui's paper ``Single syllable recognition and its application to speech recognition of large word sounds'' (Proceedings of the Institute of Electronics and Communication Engineers A, 65-A; 2, pp175-182, 1982) is about cutting out consonant parts. shows other techniques. In this example, the beginning of a word is cut out at a predetermined ratio to the total length of the syllable, and this is compared with similar registered beginnings of words to obtain consonant information. For example, performing linear matching. Furthermore, the paper by Mr. Nakagawa et al., “Large vocabulary word speech recognition using monosyllable unit input from unspecified speakers” (IEICE Transactions D, 65-D, 12, pp1558-1565,
(1982) also disclose other extraction methods. In this example, since the subject is an unspecified speaker, the registered patterns can be fixed, and for this reason, the boundary point j between consonants and vowels is determined in advance for each registered pattern by visual inspection. This decision point is estimated by taking into consideration, for example, the waveform and formant of the audio signal. Regarding the unknown input pattern, the registered pattern for the entire syllable and this unknown input pattern
DP matching is performed, and the point i where the optimal path passes through the above boundary point j is the consonant of the unknown input pattern,
It is used as a vowel boundary point. Various proposals have been made to obtain consonant information, including endpoint-free DP matching. This example also discloses consonant extraction using the same algorithm as in Furui's paper mentioned above. Both of the above-mentioned papers use a simpler method than DP matching for identifying vowel information. Then, this vowel information and the above-mentioned consonant information are combined to identify the syllable. However, in the conventional configuration as described above, it was necessary to employ a complicated algorithm to separate the consonant part and the vowel part. In other cases, fixed registered patterns must be visually checked in advance, resulting in complicated work and requiring deep experience. Furthermore, errors in identifying separation points between consonants and vowels may increase the number of errors in recognizing syllables, especially their consonant information. In Mr. Furui's example, the consonant part is cut out in a fixed manner based on the ratio to the total syllable length, so there is no problem like the one mentioned above, but since the variation in the consonant/vowel boundary point for each syllable is ignored, it naturally cuts out the consonant part for each syllable. It is thought that the recognition rate is different. The final stage of recognizing syllables is as follows:
There is a method of obtaining consonant information and vowel information separately and determining a syllable from a combination of them. For example, the syllable ``ki'' is identified from the vowel information of ``i'' and the consonant information of ``k''. Another method is
It is known that vowel information is obtained in advance, syllables with that vowel as the subsequent vowel are recognized candidates, and DP matching is performed on these candidate syllables. In the latter case, the articulatory coupling elements between consonants and vowels can be fully considered in the final matching, so that good discrimination can be achieved. Such two-stage evaluation is described in JP-A-54-145409, JP-A-58-52694, and JP-A-58-59498. However, in this case,
It is complicated to perform two-stage evaluation. As described below, one embodiment of the present invention can solve this problem. [Problems to be Solved by the Invention] This invention has been made in consideration of the above circumstances, and provides a monosyllabic speech recognition method that can easily and reliably obtain consonant information without distinguishing consonants from vowels. is intended to provide. [Means for Solving the Problems] In order to achieve the above object, the present invention uses DP for unknown input patterns and registered standard patterns.
A distance calculation is performed during matching, and the consonant information of the unknown input pattern is identified based on the minimum cumulative distance at a predetermined midpoint between the beginning and end of these unknown input patterns or registered standard patterns. . In one aspect of the present invention, the minimum cumulative distance is also determined for a second intermediate point that is closer to the end of the word than the intermediate point from which consonant information is obtained, that is, contains more vowel information sources,
Based on this, vowel information may be identified, and then consonant information may be identified for candidate standard patterns in which the identified vowel is the subsequent vowel. That is, two-stage matching may be performed. In this case, the minimum cumulative distance of the midpoint for consonants can be obtained as a by-product of the distance calculation when discriminating vowel information, and the distance calculation is
This process can be completed in one go. In another aspect of the present invention, when obtaining vowel information, the ending of a word may be known and the DP matching calculation may be performed up to a predetermined midpoint. In this case, 5
Since the vowels of the unknown input pattern can be reliably identified using a standard pattern of one vowel, the amount of calculation is significantly reduced compared to the case where all syllables are referred to. [Embodiment] Hereinafter, an embodiment of the present invention applied to a speech recognition device for a specific speaker will be described with reference to the drawings. FIG. 1 shows this embodiment as a whole. In FIG. 1, a speaker's voice is supplied to a microphone 1, and this voice is converted into an audio signal and supplied to an A/D converter 2. be done. This A/D converter has a sampling frequency of 20KHz, for example.
The data bit length is 12 bits. A/
The data from the D converter 2 is supplied to the feature parameter extraction section 3, where a feature parameter time series (pattern) is formed based on the above-mentioned data. In this example, a logarithmized spectrum is used as this characteristic parameter, as will be described in detail later. This example targets a specific speaker, so
Training is performed prior to speech recognition. That is, a speaker utters a predetermined number of syllables to be identified, for example, 68, into the microphone 1, which are sequentially calculated by the A/D converter 2 and the feature parameter extractor 3, and then sent to the recognizer 4 to identify each syllable. We will supply standard patterns. In this case, the switching circuit 5 of the recognition section 4 is switched to "a" so that the information is registered in the storage section 6 of the recognition section 4. After this preliminary step, the speaker divides the syllables into, say, 100 words.
As speech is input at a rate of syllables/minute, each syllable is derived as an unknown input pattern via the feature parameter extraction section 3 and stored in the storage section 6 of the recognition section 4. At this time, the switching circuit 5 is switched to side b. The unknown input pattern is sequentially referenced to 68 standard patterns, and the most optimal one among the reference results is outputted via the output circuit 7 to an output device 8 such as a printer or a monitor. Of course, in addition to the most optimal candidate, second and third candidates may also be output. The feature parameter extraction unit 3 in FIG. 1 forms a time series of logarithmized spectra as shown in FIGS. 2 to 6. That is, the digital data from the A/D converter 2 is pre-emphasized, and based on this pre-emphasized data, the energy Ei for each time frame i, 10 msec in this example, is determined. However, Ei = 10log ₁₀ (average of squared amplitudes). After this, the normalized energy Ein is calculated from the maximum energy Emax and the minimum energy Emin, and Ei
For example, normalize to a value between 0 and 32. However, Ein=32(Ei−Emin)/Emax−Emin. And the normalized energy Ein obtained in this way
The boundaries between syllables are determined by setting an appropriate threshold based on the time distribution of . On the other hand, short-term spectral analysis is also performed on the pre-emphasized digital data. In other words, while moving the digital data every 10 msec per frame, Winograd's fast Fourier transform using the time function of the Hamming window is performed over a range of 20 msec (400 points). The power spectrum thus obtained is logarithmized, and the spectrum from 10Hz to 7900Hz is further divided into 19 frequency bands. Specifically, 100Hz and 200Hz form one band, similarly 300Hz and 400Hz,...
700Hz to 7900Hz each form a band. Thereafter, the average value of the power spectrum of the unvoiced part is subtracted from the power spectrum of the voiced part (each syllable) as background noise. The power spectrum from which the background noise has been subtracted, that is, the characteristic pattern, is normalized in the time direction and is further subjected to nonlinear transformation of frequency components. That is, consider a characteristic pattern of times 1 to n _F and frequency bands 1 to m as shown in FIG. Here, _nF is the number of frames for each syllable, m is the number of bands, and in this example, m=19. The logarithmized power spectrum at each time and each band is simply indicated by circles.
To perform normalization in the time direction, both the standard pattern (word length _TR ) and the unknown input pattern (word length T _I ) are set to a predetermined pattern length (Tn) as shown in Figure 4.
Convert to by linear interpolation. The pattern length Tn is determined separately by experiment. For example, it may be selected to be 1.2 times the average standard pattern syllable length. Nonlinear transformation of frequency components is performed as shown in FIGS. 5 and 6. That is, as shown in Fig. 5, the characteristic pattern after time normalization is expressed as Vij,
The maximum energy is expressed as Vi _MAX = MAX ^j (Vij). Then, as shown in FIG. 6, Vij is converted into a value Vij from 0 to 255 according to the formula below. In the case of VijVi _MAX −Vb, Vij=0 In the case of Vij>Vi _MAX −Vb, Vij=255/Vb (Vij−Vi _MAX +Vb) Note that Vb is determined separately by experiment. Vb=30,
When Vb was changed to 40 and 50, the optimum value was Vb=40. The optimal value of Vb is considered to be related to the ratio of audio peak signal to noise power. This nonlinear conversion can alleviate the adverse effects of noise power. Next, the recognition unit 4 shown in FIG. 1 will be explained with reference to FIGS. 7 and 8. The recognition unit 4 includes a storage unit 6, a cumulative distance calculation unit 9, and the like. The storage unit 6 stores an input pattern P _I and a standard pattern P _R i, and a cumulative distance calculation unit 9 executes a DP matching operation between these input patterns P _I and the standard pattern P _R i. Ru. However, this cumulative distance calculation section 9 does not require calculation for matching over the entire pattern. It is sufficient that the calculation is performed at least up to time t _I =t _V (FIG. 7) on the time axis of the input pattern. Here, the time point tc in FIG. 7 is an intermediate point for obtaining consonant information, and the time point _tV is an intermediate point for obtaining vowel information. This is detailed below. First, 68 standard patterns P _R i are matched with respect to a predetermined unknown input pattern P _I. In other words, as shown in Fig. 8, the unknown input pattern P _I and the first
A DP matching operation with the standard pattern P _R i is executed, and at this time, the minimum cumulative distance Dc at t _I =tc is determined. Starting point (Figure 7, beginning of word)
There are various paths from to the grid on t _I = tc (limited to the matching window, so the grid is within the range indicated by Wc), but we find the minimum cumulative distance of each of these paths. . Thereafter, the DP matching calculation is continued to obtain the same minimum cumulative distance Dv at t _I =t _V. These minimum cumulative distances Dc and Dv are stored in the storage section 6. Similarly, the unknown input pattern and the remaining 67
DP between standard patterns P _R i (i = 2 to 68)
A matching operation is performed to obtain the respective minimum cumulative distances Dc and Dv. Once the minimum cumulative distances Dc and Dv are determined for all standard patterns P _R i , vowels are determined from the 68 minimum cumulative distances Dv at t _I =t _V.
That is, the syllable that minimizes the minimum cumulative distance Dv is found, and the vowel of this syllable is used as detected vowel information. After this, a syllable in which the vowel of the detection result is the subsequent vowel (for example, if the vowel is "a", it is "a", "ka", "sa", etc.)
...) at t _I = tc is selected as candidate syllable data, and among them, the one with the smallest minimum cumulative distance Dc is output as a syllable detection result. The above operation is performed by the vowel detection section 1 in Fig. 1.
0, is executed by the selection circuit 11 and the syllable detection section 12. The intermediate points t _I =tc and t _V mentioned above are determined by experiment.
For example, good results were obtained with approximately tcTn×1/3 and tv=Tn×0.7. Table 1 shows the actual results of applying this example to 68 syllables. The number of standard patterns per syllable is one (single template method). In this table, the second candidate is the syllable with the second smallest minimum cumulative distance in the syllable detection unit 12 of FIG.

【表】表２から明らかなように本例は先に述べた従前
のものに較べてすぐれた認識結果をもたらすもの
であることがわかる。[Table] As is clear from Table 2, this example provides superior recognition results compared to the previous example described above.

【表】以上述べたようにこの実施例では子音・母音の
セグメンテーシヨンが不要なので極めて簡易な構
成で単音節の認識を行うことができる。さらに母
音情報を得る際のDPマツチング演算の副次物を
利用して子音情報を得ることができ、演算量を少
なくすることができる。またハードウエア実現態
様に極めて適した構成となつている。この実施例では認識率を向上させることもでき
る。これはつぎのように考えられる。上述のとお
り音節の特徴ベクトルは語頭に子音情報が含ま
れ、語の半ばから語尾にかけての広い範囲にわた
つては母音情報が含まれている。これを模式的に
示すと第９図に示すとおりである。そして目視や
経験により両者の境界は破線で示されるように配
置される。しかしながら、子音は後続母音に影響
を与え、とくに遷移領域には子音決定上重要な要
素が含まれていると考えられる。したがつて、子
音および母音の境界を検出して子音の切り出しを
行うことはかえつて子音情報を不確かなものとし
てしまう。本例では子音・母音境界にこだわるこ
となく、子音情報を得るための中間点tcを実験に
より定め、たとえば第９図に示すように定めてい
るのでより好ましい認識を行うことができる。ま
た子音決定上実現し得るパスの終点は第７図に
Wcで示す格子群であるのでこの範囲で標準パタ
ーンの対応する終点を選び得、入力パターンと標
準パターンとの子音領域での微妙なマツチングを
より自由度高く実行することができ、このため認
識率が向上すると考えられる。なお、この発明は上述実施例に限定されるもの
ではなくその趣旨を逸脱しない範囲で種々変更が
可能である。たとえば拗音を含む101単音節にこ
の発明を適用してもよく複数テンプレートを用い
ることもできる。上述実施例と同様の構成での
101単音節認識結果は表３のとおりであり、複数
テンプレートでの従前との比較結果は表４に示す
とおりである。[Table] As described above, in this embodiment, segmentation of consonants and vowels is not required, so single syllables can be recognized with an extremely simple configuration. Furthermore, consonant information can be obtained by using a byproduct of the DP matching operation when obtaining vowel information, and the amount of calculations can be reduced. Furthermore, the configuration is extremely suitable for hardware implementation. This embodiment can also improve the recognition rate. This can be considered as follows. As mentioned above, the syllable feature vector includes consonant information at the beginning of the word, and vowel information over a wide range from the middle of the word to the end. This is schematically shown in FIG. 9. Then, by visual inspection or experience, the boundary between the two is arranged as indicated by a broken line. However, consonants influence the following vowels, and the transition region in particular is thought to contain important elements for determining consonants. Therefore, detecting the boundaries between consonants and vowels to extract consonants makes the consonant information uncertain. In this example, the intermediate point tc for obtaining consonant information is determined by experiment without being concerned about the consonant/vowel boundary, and is determined as shown in FIG. 9, for example, so that more preferable recognition can be performed. Also, the end points of paths that can be realized in terms of consonant determination are shown in Figure 7.
Since it is a lattice group indicated by Wc, the corresponding end point of the standard pattern can be selected within this range, and delicate matching in the consonant region between the input pattern and the standard pattern can be performed with a higher degree of freedom. It is thought that this will improve. Note that this invention is not limited to the above-described embodiments, and various changes can be made without departing from the spirit thereof. For example, the present invention may be applied to 101 monosyllables including syllables, and multiple templates may be used. With the same configuration as the above embodiment
Table 3 shows the 101 monosyllable recognition results, and Table 4 shows the comparison results with the previous model using multiple templates.

【表】【table】

【表】また、上述の実施例では母音情報を得る場合に
も語頭からDPマツチングを行つたけれども母音
情報を得る場合に語尾がわからDPマツチングを
行つてもよい。この場合、語長の20〜30％を中間
点とすれば十分に母音を決定できる。母音の標準
パターンとしては「ア」「イ」「ウ」「エ」「オ」の
５つのみでよい。子音の影響により認識エラーが
生じることが少なく、また上述実施例と異つて母
音情報用の演算から副次的に68個または101個の
子音情報を得る必要がないからである。したがつ
て計算量を極めて少なくさせることができる。も
ちろん、母音に関する認識結果から対応する候補
の音節が絞り込まれる。子音情報識別用には別途
68個または101個の標準パターンが用意されてお
り、候補の音節についてこの発明にしたがつて子
音情報に関する参照が行われて最終的に入力音節
の識別が完了する。また、子音情報を得る際のDPマツチング演算
において、その始点および中間点tcを可変にする
ように構成してもよく、また複数組の始点および
中間点についてそれぞれ音節認識を行つて多数決
で最終決に音節を決定するようにすることもでき
る。もちろん、登録標準パターンの時間軸上に中間
点を選定し、この中間点での最小累積距離に基づ
いて子音情報を得るようにすることもできる。［発明の効果］以上説明したように、この発明によれば未知入
力パターンと登録標準パターンとについてDPマ
ツチングで距離演算を実行していき、これら未知
入力パターンまたは登録標準パターンの語頭から
語尾の間の所定の中間点での最小累積距離に基づ
いて未知入力パターンの子音情報を得るようにし
ている。したがつて子音・母音境界点を判別する
必要がなく構成を簡略化でき、しかも境界点の決
定にともなうエラーをなくすことができる。しか
も子音情報を得るためのマツチングに自由度をも
たせることができるので認識率を向上させること
ができる。[Table] Furthermore, in the above-described embodiment, DP matching was performed from the beginning of the word when obtaining vowel information, but DP matching may be performed when the ending of the word is known when obtaining vowel information. In this case, vowels can be determined sufficiently by setting 20 to 30% of the word length as the midpoint. There are only five standard vowel patterns: "A", "I", "U", "E", and "O". This is because recognition errors are less likely to occur due to the influence of consonants, and unlike the above-described embodiments, there is no need to obtain 68 or 101 consonant information as a subsidiary from the calculation for vowel information. Therefore, the amount of calculation can be extremely reduced. Of course, corresponding candidate syllables are narrowed down based on the recognition results regarding vowels. Separately for consonant information identification
68 or 101 standard patterns are prepared, and the candidate syllables are referred to for consonant information according to the present invention, and the input syllables are finally identified. Furthermore, in the DP matching operation when obtaining consonant information, the starting point and midpoint tc may be configured to be variable, or syllable recognition is performed for each of multiple sets of starting points and midpoints, and the final decision is made by majority vote. It is also possible to have the syllable determined by the syllable. Of course, it is also possible to select an intermediate point on the time axis of the registered standard pattern and obtain consonant information based on the minimum cumulative distance at this intermediate point. [Effects of the Invention] As explained above, according to the present invention, distance calculation is performed by DP matching for unknown input patterns and registered standard patterns, and distance calculation between unknown input patterns or registered standard patterns is performed between the beginning and the end of the word of these unknown input patterns or registered standard patterns. The consonant information of the unknown input pattern is obtained based on the minimum cumulative distance at a predetermined midpoint. Therefore, there is no need to discriminate consonant/vowel boundary points, and the configuration can be simplified, and errors associated with determining boundary points can be eliminated. Moreover, since the matching for obtaining consonant information can be given a degree of freedom, the recognition rate can be improved.

[Brief explanation of the drawing]

第１図はこの発明の一実施例を示すブロツク
図、第２図、第３図、第４図、第５図および第６
図は第１図実施例の特徴パラメータ抽出部３を説
明するための図、第７図および第８図は第１図実
施例の認識部４を説明するための図、第９図は第
１図実施例の効果を説明するための図である。１……マイクロホン、２……Ａ／Ｄ変換器、３
……特徴パラメータ抽出部、４……認識部、８…
…出力装置。 FIG. 1 is a block diagram showing one embodiment of the present invention, FIGS. 2, 3, 4, 5, and 6.
The figure is a diagram for explaining the feature parameter extraction section 3 of the embodiment in FIG. 1, FIGS. 7 and 8 are diagrams for explaining the recognition section 4 of the embodiment in FIG. It is a figure for explaining the effect of a figure example. 1...Microphone, 2...A/D converter, 3
...Feature parameter extraction unit, 4...Recognition unit, 8...
...Output device.

Claims

[Claims]

1 A unit that analyzes a speech signal of an unknown input monosyllable and compares the input feature parameter time series extracted from this speech signal with the registered feature parameter time series registered in advance to recognize the unknown input monosyllable. In the syllable speech recognition method, a distance calculation is performed according to a dynamic programming matching method in which the beginning of the word is known for each of the input feature parameter time series and the registered feature parameter time series, and the distance calculation is performed according to the dynamic programming matching method. Or, find the minimum cumulative distance at two different midpoints uniformly determined in advance for all syllables between the beginning and end of each word in the registered feature parameter time series,
The vowel of the unknown input syllable is determined based on the minimum cumulative distance between the midpoints of the endings, the candidate syllables are narrowed down based on this vowel, and the minimum cumulative distance of the midpoints of the endings of the narrowed down syllables is determined. A monosyllabic speech recognition method characterized in that the syllable that minimizes the syllable is the recognition result.