JPS6131878B2

JPS6131878B2 -

Info

Publication number: JPS6131878B2
Application number: JP53009108A
Authority: JP
Inventors: Masanori Koda; Hiromi Nagashima
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1978-01-30
Filing date: 1978-01-30
Publication date: 1986-07-23
Also published as: JPS54102806A

Description

[Detailed description of the invention]

この発明は限定された語彙の単語音声の認識を
おこなう単語音声認識方法に関するものである。単語音声の認識の方法として認識対象のすべて
の単語の音声パターンを標準パターンとして蓄え
ておき、入力音声とそれぞれの標準パターンとの
類似度を計算し、類似度が最大となつた標準パタ
ーンの単語を認識結果とする方法がある。しかし
この方法では単語全体をそのまま標準パターンと
して蓄えておくことが必要なので語彙数に比例し
て標準パターンを蓄えるために要する記憶容量が
増大し、語彙数が多い場合には不向きである。このために標準パターンとしては単語を構成し
ている音韻のスペクトルパターンだけを蓄えてお
き、それと単語の音韻記号列（以下音韻系列とい
う）を規定した単語辞書を併用する方法が考えら
れている。この方法ではまず音声を比較的短い時
間間隔で区分し、それぞれの区分のスペクトルと
音韻の標準パターンのスペクトルとの類似度を計
算する音韻認識部と、その結果を用いて単語辞書
の音韻系列とのマツチングをおこない、入力音声
全体と音韻系列との類似度を計算し、類似度が最
大の音韻系列の単語を認識結果とする単語認識部
とから成つている。周知のように音声は同一の単語でも発声するた
びに単語中の音韻の長さが変化し、しかもその変
化の度合は音韻によつて異なつている。また個人
差による音韻の長さの違いも大きい。従つて入力
音声と音韻系列との類似度を計算する際には、各
音韻の継続時間についてある程度の変動を許した
マツチングをおこなう必要がある。それを考慮し
た方法が「電子通信学会論文誌Vol 55−Ｄ、
No.3（1972年３月）好田他“数字音声の機械認
識系”」に提案されている。また辞書中の各音韻
に最小継続時間と最大継続時間との制限をつけ、
ダイナミツクプログラミングの手法を利用してマ
ツチングの処理に必要な計算量を減らす方法が
「特願昭51−81504」にて提案されている。しかし
これらの方法では音声の終端がわかつてから音韻
系列とのマツチングをおこなう必要があり、音声
の入力と並行してマツチングをおこなう、いわゆ
る実時間処理には適していないという欠点があつ
た。この発明は単語辞書の音韻系列の音韻記号に
“１”あるいは“０”の識別記号を付加し、必ず
音声と対応づけなければならない音韻の区間と、
必ずしも音声と対応づけずに読みとばしてもよい
音韻の区間との区別を示し、それによつて音韻系
列の各音韻についてある程度の継続時間の変動を
許したマツチングが音声の入力と並列しておこな
えるようにしたもので、音韻を認識単位とする実
時間の単語音声認識方法を実現することを目的と
している。第１図はこの発明による単語音声認識方法の一
実施例を示すブロツク図である。入力端子１から
入力された音声は音声の特徴パラメータを抽出す
る特徴抽出部２において比較的短い時間間隔（フ
レームと呼ぶ）、例えば15msec程度で区分され、
その各区分ごとにパワーとスペクトルを表現する
特徴パラメータが算出される。そのスペクトルを
表現する特徴パラメータとしては種々のものが考
えられるが、ここでは一例として音声の自己相関
関数を用いる。音韻の標準パターン記憶部３は認識に必要な音
韻のスペクトルを標準パターンとして蓄えておく
部分であり、その音韻の標準パターンと入力音声
の各区分のスペクトルとの類似度が音韻認識部４
において計算される。そのスペクトル間の類似度
を表わす尺度としては特徴パラメータ間のユーク
リツド距離、パワースペクトルの差の２乗を全帯
域について積分したものなど種々の尺度が考えら
れる。ここでは一例として最尤スペクトル推定法
に基づく対数尤度（以下単に尤度と呼ぶ）を類似
度として用いた場合を説明する。最尤スペクトル
推定法によると入力音声の第ｎ区分の自己相関関
数を｛v₀ ⁽ⁿ⁾、v₁ ⁽ⁿ⁾、v_2(o)……ｖ_p ⁽ⁿ⁾｝（ｐは最大遅延時間） (1) とし、音韻／ｘ／の標準パターンの最尤スペクト
ルパラメータを｛A₀ ^(x)、A₁ ^(x)、A₂ ^(x) ……Ａ_p ^(x)｝ (2) とすると入力音声の第ｎ区分が音韻／ｘ／である
らしいことを示す尤度ｌ（ｎ、ｘ）は次式で計算
される。ただし The present invention relates to a word speech recognition method for recognizing word speech of a limited vocabulary. As a method of word speech recognition, the speech patterns of all the words to be recognized are stored as standard patterns, the similarity between the input speech and each standard pattern is calculated, and the standard pattern word with the maximum similarity is selected. There is a way to use this as a recognition result. However, in this method, it is necessary to store the entire word as a standard pattern, so the storage capacity required to store the standard pattern increases in proportion to the number of vocabulary, and is not suitable when the number of vocabulary is large. For this purpose, a method has been considered in which only the spectral patterns of the phonemes constituting words are stored as standard patterns, and this is used together with a word dictionary that defines the phoneme symbol strings of the words (hereinafter referred to as phoneme sequences). This method first divides speech into relatively short time intervals, calculates the degree of similarity between the spectrum of each division and the spectrum of a standard phonological pattern, and uses the results to create a phonological sequence in a word dictionary. The word recognition unit performs matching, calculates the degree of similarity between the entire input speech and the phoneme sequence, and outputs the word of the phoneme sequence with the maximum degree of similarity as the recognition result. As is well known, the length of the phonemes in a word changes each time it is uttered, even for the same word, and the degree of this change differs depending on the phoneme. There are also large differences in phonological length due to individual differences. Therefore, when calculating the similarity between the input speech and the phoneme sequence, it is necessary to perform matching that allows some variation in the duration of each phoneme. A method that takes this into consideration is ``Transactions of the Institute of Electronics and Communication Engineers Vol. 55-D.
No. 3 (March 1972) Koda et al. "Machine recognition system for digit speech" was proposed. In addition, each phoneme in the dictionary is given a minimum and maximum duration.
A method of reducing the amount of calculation required for matching processing using a dynamic programming method is proposed in ``Japanese Patent Application No. 51-81504.'' However, these methods have the disadvantage that matching with the phoneme sequence must be performed before the end of the speech is known, and they are not suitable for so-called real-time processing, in which matching is performed in parallel with speech input. This invention adds an identification symbol of "1" or "0" to the phoneme symbol of the phoneme series in a word dictionary, and defines a phoneme section that must be associated with a sound.
It shows the distinction between phoneme sections that can be skipped without necessarily being associated with speech, and thereby allows matching to be performed in parallel with speech input, allowing for a certain degree of variation in duration for each phoneme in the phoneme series. The aim is to realize a real-time word speech recognition method that uses phonemes as recognition units. FIG. 1 is a block diagram showing an embodiment of the word speech recognition method according to the present invention. The audio input from the input terminal 1 is divided into relatively short time intervals (referred to as frames), for example, about 15 msec, by a feature extraction unit 2 that extracts audio feature parameters.
Feature parameters expressing the power and spectrum are calculated for each of the sections. Although various feature parameters can be considered to express the spectrum, an autocorrelation function of speech is used here as an example. The phoneme standard pattern storage unit 3 is a part that stores the phoneme spectrum necessary for recognition as a standard pattern, and the degree of similarity between the phoneme standard pattern and the spectrum of each segment of the input speech is stored in the phoneme recognition unit 4.
Calculated in . Various measures can be considered to express the degree of similarity between the spectra, such as the Euclidean distance between feature parameters and the integration of the square of the difference in power spectra over the entire band. Here, as an example, a case will be described in which log likelihood (hereinafter simply referred to as likelihood) based on the maximum likelihood spectrum estimation method is used as the degree of similarity. According to the maximum likelihood spectrum estimation method, the autocorrelation function of the n-th segment of the input speech is expressed as {v ₀ ⁽ⁿ⁾ , v ₁ ⁽ⁿ⁾ , v _2(o) ... v _p ⁽ⁿ⁾ } (p is the maximum delay time ) (1), and the maximum likelihood spectral parameters of the standard pattern of phoneme /x/ are {A ₀ ^(x) , A ₁ ^(x) , A ₂ ^(x) ...A _p ^(x) } (2) The likelihood l(n, x) indicating that the n-th division of the input speech is likely to be the phoneme /x/ is calculated by the following equation. however

【式】にのみ依存する定数項は除いている。なお音韻の標準パターンを表現する最尤スペクト
ルパラメータConstant terms that depend only on [Formula] are excluded. Furthermore, the maximum likelihood spectral parameters expressing the standard pattern of phoneme

【式】はその音韻に対応した音声の自己相関関数からあらかじめ容易に計算で
きるので音韻の標準パターン記憶部３には自己相
関関数よりも最尤スペクトルパラメータの形で記
憶しておく方が(3)式の尤度を計算するのに都合が
よい。(3)式による尤度は音声の区分ごとにすべて
の音韻について計算し尤度記憶部５に蓄える。こ
の尤度により入力音声の区分の系列と、認識対象
の単語の音韻系列を蓄えておく単語辞書記憶部６
よりの音韻系列とのマツチングがマツチング部７
で行われる。以下その原理と動作を説明する。入力音声の長さをＮ区分、その音声とマツチン
グをおこなう音韻系列の１つをＸ：x₁x₂x₃……ｘ_j……ｘ_J (4) とし、各音韻に最小継続長ｄ_jと最大継続長Ｄ_jと
の制限がつけられているとする。入力音声とこの
音韻系列とのマツチングをおこなつて対応関係を
求めることは入力音声中の（Ｊ−１）個の音韻の
変化時点[Formula] can be easily calculated in advance from the autocorrelation function of the speech corresponding to the phoneme, so it is better to store it in the phoneme standard pattern storage unit 3 in the form of maximum likelihood spectral parameters than the autocorrelation function (3 ) is convenient for calculating the likelihood of the equation. The likelihood based on equation (3) is calculated for all phonemes for each speech segment and stored in the likelihood storage unit 5. A word dictionary storage unit 6 stores the classification series of the input speech and the phonological series of the word to be recognized based on this likelihood.
The matching part 7 performs matching with the phonological series.
It will be held in The principle and operation will be explained below. _The length of _the input speech _is divided into N parts, _one of the phoneme sequences to be matched with _the speech _is set as Assume that there are restrictions on the maximum duration D j and the maximum duration D _j . Matching the input speech with this phoneme sequence to find the correspondence relationship is the point at which (J-1) phonemes change in the input speech.

【式】を次の条件を満たすように決めることと等価である。このような条件のもとで入力音声と音韻系列と
の類似度Ｌ（ｘ）を次のように定義する。即ち入力音声の各区分について対応づけをおこ
なう音韻の標準パターンに関する尤度を求めそれ
を全区分について加え合わせる。尤度の加え合わ
せ方は音韻の変化時点This is equivalent to determining [Formula] so that it satisfies the following conditions. Under these conditions, the degree of similarity L(x) between the input speech and the phoneme sequence is defined as follows. That is, the likelihood of the standard pattern of phonemes to be associated with each segment of the input speech is determined and added for all segments. The likelihood is added at the point of phonological change.

【式】の組合せの数だけあるが、その中で最大のものを選んで類似度
とする。この発明ではこの類似度を効率よく計算するた
めに次のような手順をとる。音韻系列Ｘにおいて
各音韻の最大継続長の分だけ、各フレーム時間長
の同じ音韻記号を並べた長さThere are as many combinations as [Formula], and the largest one is selected as the similarity. In this invention, the following steps are taken to efficiently calculate this degree of similarity. The length of the same phoneme symbols of each frame time length arranged by the maximum duration of each phoneme in the phoneme series X

【式】の音韻系列を作り、それを音韻系列X′とする。この音韻系
列X′の各音韻記号に対してはじめからそれぞれ
その音韻の最小継続長の分だけ“１”の識別記号
をつけ、残りは“０”の識別記号をつけておく。音韻系列Ｘの１番目の音韻の最小継続長d₁が１
以上であれば入力音声の始端と音韻系列X′との
対応づけは１通りしかないが、特別な場合として
最小継続長d₁が０のとき入力音声の始端と音韻系
列X′の対応づけは最大継続長D₁＋１の数だけ存
在する。２番目の音韻の最小継続長d₂も０の場合
はさらに増加する。このために入力音声とのマツ
チングに先立つて音韻系列X′の各音韻記号の識
別記号を先頭から調べておく必要がある。この結
果m₀番目の音韻記号の識別記号がはじめて
“１”であつたとすると、この音韻系列X′と入力
音声の始端との対応づけはm₀通りが可能であ
る。従つて入力音声の第ｎ区分までの部分和と音
韻系列X′の第ｍ番目の部分系列の類似度をＳ
（ｎ、ｍ）で表わした場合、初期条件Ｓ（_０、ｍ）＝０、（ｍ＝０、１、２……m₀ ^-1）
（7a）Ｓ（_０、ｍ）＝Ｃ、（ｍ＝m₀、m₀ ⁺¹……Ｍ）（7b）Ｓ（ｎ、_０）＝Ｃ、（ｎ＝１、２……Ｎ）（7c）（ただしここでＣは非常に大きい負の定数とす
る）のもとで次の漸化式 (i) 音韻系列X′のｍ番目の音韻が／ｘ_i／で、そ
の識別記号が“１”のときＳ（ｎ、ｍ）＝Ｓ（n_-1、m_-1）＋ｌ（ｎ、ｘ_i）
（8a） (ii) 音韻系列X′のｍ番目の音韻が／ｘ_i／で、そ
の識別記号が“０”のときＳ（ｎ、ｍ）＝max｛Ｓ（n_-1、m_-1）＋ｌ（ｎ、ｘ_i）、Ｓ（ｎ、m_-1）｝（8b）を１≦ｎ≦Ｎ、１≦ｍ≦Ｍの範囲で計算すること
により(6)式の類似度Ｌ（ｘ）はＳ（Ｎ、Ｍ）とし
て得られる。上に述べた手順の一例を説明する。今音韻系列
Ｘの１つの音韻／ｘ_i／に最小継続長２、最大継
続長５の制限がつけられているものとする。それ
は音韻系列X′の中では次のような系列として表
わされる。識別記号はかつこの中に示されてい
る。 X′：……ｘ_i-1 ⁽⁰⁾ｘ_i(1)ｘ_i(1)ｘ_i ⁽⁰⁾Ｘ_i ⁽⁰⁾ｘ_i ⁽⁰⁾
ｘ
_i+1(1)…… 第２図はこの音韻系列X′についての演算の一
部を図示したものである。横軸が音声の区分番
号、縦軸が音韻系列X′の区分番号に対応してお
りm_-4、m_-3がｘ_i(1)、m_-2、m_-1、ｍがＸ_i ⁽⁰⁾にそ
れぞれ対応しているとする。第２図で第n_-1列上
のそれぞれの点にはn_-1区分までの音声とマツチ
ングをおこなつた場合の類似度が既に計算されて
いるとすると、次に第ｎ区分目の音声の音韻／ｘ
_i／に関する尤度ｌ（ｎ、ｘ_i）が与えられた場
合、第ｎ区分までの類似度は次のように計算され
る。まず点200にはｘ_iの前の音韻／ｘ_i-1／までマ
ツチングをおこなつたときの類似度Ｓ（n_-1、
m_-5）が求まつているからこれにｌ（ｎ、ｘ_i）を
加えて、音韻／ｘ_i／が１区分のときの類似度Ｓ
（ｎ、m_-4）が点２０６に計算される。また点２０
１には音韻／ｘ_i／が１区分のときの類似度Ｓ
（n_-1、m_-4）が求まつているから、同様にｌ
（ｎ、ｘ_i）を加えて音韻／ｘ_i／が２区分のときの
類似度Ｓ（ｎ、m_-3）が点２０７に計算される。
点２０２には音韻／ｘ_i／が２区分のときの類似
度Ｓ（n_-1、m_-3）が求まつているから音韻／ｘ_i／
が３区分のときの類似度がＳ（n_-1、m_-3）＋ｌ
（ｎ、ｘ_i）として計算されるが、このとき音韻／
ｘ_i／の識別記号は“０”であるから２区分のと
きの類似度Ｓ（ｎ、m_-3）と比較がおこなわれて
大きい方の値がＳ（ｎ、m_-2）として点２０８に
割り当てられる。即ち点２０８には音韻／ｘ_i／
が２区分ないし３区分のときの類似度が割り当て
られることになる。同様な手段によつてさらに点
２０９には音韻／ｘ_i／が２〜４区分のときの類
似度が、点２１０には２〜５区分のとき類似度が
割り当てられる。以上は音韻系列X′の一部につ
いての計算例にすぎないが、n_-1区分までの結果
からｎ区分までの結果が容易に計算できることを
示している。従つて入力音声の始端である第１区
分から終端の第Ｎ区分まで(8)式を計算することに
より漸化式の最終結果Ｓ（Ｎ、Ｍ）に入力音声と
音韻系列Ｘとの類似度Ｌ（Ｘ）が得られる。第３図に「７」／nana／という音韻系列とマ
ツチングをおこなつた場合の例を示す。なお(8)式
は[Formula] Phonological series , and call it the phonological sequence X′. For each phoneme symbol of this phoneme sequence X', identification symbols of "1" are attached from the beginning for the minimum duration of the phoneme, and identification symbols of "0" are attached to the remaining symbols. The minimum duration d ₁ of the first phoneme of phoneme series X is 1
If the above is the case, there is only _one correspondence between the beginning of the input speech and the phonetic sequence There are as many as the maximum duration D ₁ +1. If the minimum duration d ₂ of the second phoneme is also 0, it increases further. For this reason, it is necessary to check the identification symbol of each phoneme symbol of the phoneme sequence X' from the beginning before matching it with the input speech. As a result, if the identification symbol of the _m0th phonetic symbol is "1" for the first time, there are _m0 possible associations between this phonetic sequence X' and the beginning of the input speech. Therefore, the similarity between the partial sum of input speech up to the n-th segment and the m-th subsequence of the phoneme sequence X′ is S
When expressed as (n, m), the initial condition S ( ₀ , m) = 0, (m = 0, 1, 2... m ₀ ^-1 )
(7a) S( ₀ , m)=C, (m=m ₀ , m ₀ ⁺¹ ...M) (7b) S(n, ₀ )=C, (n=1, 2...N) (7c ) (where C is a very large negative constant), the following recurrence formula (i) The m-th phoneme of the phoneme sequence X′ is /x _i / and its identification symbol is “1 ” then S (n, m) = S (n _-1 , m _-1 ) + l (n, x _i )
( _8a ₎ (ii) When the m _- th phoneme of the phoneme sequence +l(n, x _i ), S(n, m _-1 )} (8b) By calculating the similarity L(x) in equation (6) in the range of 1≦n≦N, 1≦m≦M is obtained as S(N,M). An example of the procedure described above will be explained. Assume that one phoneme /x _i / of the phoneme sequence X is restricted to a minimum duration of 2 and a maximum duration of 5. It is expressed as the following sequence in the phonetic sequence X′. The identification symbol is shown here. X'：...x _i-1 ⁽⁰⁾ x _i (1)x _i (1)x _i ⁽⁰⁾ X _i ⁽⁰⁾ x _i ⁽⁰⁾
x
_i+1 (1)... Figure 2 illustrates part of the calculations for this phoneme sequence X'. _The horizontal axis corresponds to the segment number of ^the phonetic _sequence _, and the vertical _axis corresponds to the segment number of the phonetic sequence X _' _. ⁰⁾ respectively. In Figure 2, if we assume that each point on the n _-1th column has already calculated the degree of similarity when matching the voices up to the n _-1 category, then the next point is the voice in the nth category. phoneme/x
When the likelihood l(n, x _i ) for _i / is given, the similarity up to the n-th division is calculated as follows. First, point ₂₀₀ has similarity S( _{n -1} _,
m _-5 ) has been found, add l(n, x _i ) to this, and calculate the similarity S when phoneme /x _i / is one category.
(n, m ₋₄ ) is calculated at point 206. Also 20 points
1 is the similarity S when the phoneme /x _i / is in one category.
Since (n _-1 , m _-4 ) is found, similarly l
(n, x _i ) is added to calculate the similarity S(n, m ₋₃ ) at point 207 when the phoneme /x _i / is divided into two categories.
At point 202, the similarity S (n _-1 , m _-3 ) when the phoneme /x _i / is divided into two categories is found, so the phoneme /x _i /
When there are 3 categories, the similarity is S (n _-1 , m _-3 ) + l
(n, x _i ), but in this case, the phoneme/
Since the identification symbol of x _i / is "0", it is compared with the similarity S (n, m _-3 ) in the case of two classifications, and the larger value is set as S (n, m _-2 ) and the point 208 assigned to. That is, at point 208 there is a phoneme /x _i /
The degree of similarity is assigned when there are two or three categories. By similar means, point 209 is further assigned a degree of similarity when the phoneme /x _i / is in 2 to 4 categories, and point 210 is assigned a degree of similarity when it is in 2 to 5 categories. Although the above is only an example of calculation for a part of the phoneme sequence X', it shows that the results from up to n _-1 divisions to n divisions can be easily calculated. Therefore, by calculating equation (8) from the first section at the beginning of the input speech to the Nth section at the end, the final result of the recurrence formula S(N, M) is calculated by calculating the degree of similarity between the input speech and the phoneme sequence X. L(X) is obtained. FIG. 3 shows an example of matching with the phoneme sequence "7" /nana/. Note that equation (8) is

【式】の範囲で計算すればよいので単語辞書の各音韻系列について、長さ
It is only necessary to calculate within the range of [Formula], so for each phoneme series in the word dictionary, the length

【式】の途中結果を保持しておけば音声の入力と並列して複数の音韻系列とマツチング
が可能である。以上述べた計算を実行する部分が第１図のマツ
チング部７である。（8a）式および（8b）式の加
算Ｓ（n_-1、m_-1）＋ｌ（ｎ、ｘ_i）が加算器７１で
行われ、（8b）式のmax演算のための比較が比較
器７２で行われ、その出力はレジスタ７３に一時
的に蓄えられ（8b）式のmax演算が能率よく行わ
れるようにされる。各音韻系列についての途中結
果の類似度は類似度演算バツフア７４に蓄えられ
る。まずマツチングに先立つて類似度演算バツフア
７４に（7a）、（7b）式の初期値を設定する。ｎ
区分目の音声に対する動作は次のようになされ
る。制御部８からアドレス指定信号が辞書記憶部
６に送られ、マツチングをとろうとしている音韻
系列X′の音韻ｘ_iを表わす記号と識別記号とが読
み出される。ｘ_iを表わす番号はさらに尤度記憶
部５のアドレス指定信号となつて音韻ｘ_iに対す
る尤度ｌ（ｎ、ｘ_i）を尤度記憶部から読み出す
ために用いられる。識別記号は比較器７２の制御
信号となる。一方類似度演算バツフア７４から
n_-1区分までのマツチング結果の一つＳ（n_-1、
m_-1）が読み出される。加算器７１ではこのＳ
（n_-1、m_-1）と尤度記憶部５から読み出された尤
度ｌ（ｎ、ｘ_i）との加算を行つてその結果の値
Ｓ（n_-1、m_-1）＋ｌ（ｎ、ｘ_i）を比較器７２に送
る。比較器７２では音韻ｘ_iと同時に読み出され
た識別記号が“１”ならばＳ（n_-1、m_-1）＋ｌ
（ｎ、ｘ_i）をそのまま出力し、識別記号が“０”
ならばレジスタ７３に蓄えられた１つ前の計算結
果Ｓ（ｎ、m_-1）と比較を行つて大きい方の値を
出力する。この出力は類似度の途中結果Ｓ（ｎ、
ｍ）として類似度バツフア７４に蓄えられると同
時にレジスタ７３にも蓄えられ、次の計算で使わ
れる。なおマツチングをとる音韻系列が変わる毎
に始めに（7c）式の初期値としてレジスタ７３に
負の大きい定数Ｃを入れておく必要がある。以上
のような処理をそれぞれの音韻系列について必要
回数By retaining the intermediate results of [Formula], it is possible to match multiple phoneme sequences in parallel with speech input. The part that executes the calculations described above is the matching unit 7 shown in FIG. The addition S (n _-1 , m _-1 ) + l (n, x _i ) of equations (8a) and (8b) is performed in the adder 71, and the comparison for the max operation of equation (8b) is performed using the comparator. 72, and its output is temporarily stored in a register 73 so that the max calculation of equation (8b) can be performed efficiently. The intermediate result similarity for each phoneme sequence is stored in a similarity calculation buffer 74. First, prior to matching, initial values of equations (7a) and (7b) are set in the similarity calculation buffer 74. n
The operation for the audio segment is performed as follows. An address designation signal is sent from the control section 8 to the dictionary storage section 6, and the symbol and identification symbol representing the phoneme x _i of the phoneme series X' to be matched are read out. The number representing x _i further serves as an addressing signal for the likelihood storage section 5 and is used to read out the likelihood l(n, x _i ) for the phoneme x _i from the likelihood storage section. The identification symbol becomes a control signal for the comparator 72. On the other hand, from the similarity calculation buffer 74
One of _{the matching results S(n -1} _,
m _-1 ) is read. In the adder 71, this S
(n _-1 , m _-1 ) and the likelihood l(n, x _i ) read from the likelihood storage unit 5 and the resulting value S(n _-1 , m _-1 ) + l (n, x _i ) is sent to the comparator 72. In the comparator 72, if the identification symbol read out at the same time as the phoneme x _i is "1", S (n _-1 , m _-1 ) + l
(n, x _i ) is output as is, and the identification symbol is “0”
If so, it compares with the previous calculation result S(n, m _-1 ) stored in the register 73 and outputs the larger value. This output is the interim result of similarity S(n,
m) is stored in the similarity buffer 74 and simultaneously stored in the register 73, and used in the next calculation. Note that each time the phoneme series to be matched changes, it is necessary to first input a large negative constant C into the register 73 as the initial value of equation (7c). The above processing is performed as many times as necessary for each phoneme sequence.

【式】だけ繰り返し、音声の終端Ｎまで続ける。そして各音韻系列について得られ
た類似度Ｓ（Ｎ、Ｍ）を類似度記憶部９に蓄えて
マツチングを終了する。次に制御部８の動作を説明する。制御部８は音
声の始端と終端の検出、マツチング部７の制御、
類似度最大値選択部１０の制御などを行う。即ち
まず特徴抽出部２から１区分毎にパワーの値を受
けとり、パワーがあらかじめ定めたしきい値を越
えたかどうかによつて始端の検出を行う。始端が
検出されると類似度演算バツフア７４に初期値を
セツトしてマツチングを開始する。さらに区分毎
のパワーを調べ続けパワーがしきい値よりも小さ
くなつたときその前の区分を音声の終端の候補と
する。終端の候補の区分までのマツチング結果の
類似度を類似度記憶部９に送り、類似度最大値選
択部１０を動作させる。類似度最大値選択部１０
ではそれぞれの音韻系列に対する類似度のうちで
最大のものを選び、その音韻系列の単語を認識結
果の候補とする。その後パワーがしきい値よりも
小さい区分が一定時間以上続いたならば、その候
補の単語を最終的な認識結果として１１に出力す
る。一定時間以内に再びパワーが大きくなつた場
合は、単語中の無音区間であると判断してさらに
音声の終端の検出を続ける。以上述べたようにこの発明によれば単語辞書と
して用意する音韻系列の各音韻に“１”あるいは
“０”の識別記号を付加することにより、音韻の
継続時間に制限をつけたマツチングができるので
高い認識率が期待できると同時に、実時間の処理
が可能な単語音声認識装置が実現できる。なお１
区分の音声と音韻の標準パターンとの類似度を表
わすために用いた最尤スペクトル推定法による対
数尤度は一例であり、この発明はこれに限定され
るものではない。Repeat [Formula] and continue until the end of the voice N. Then, the similarity S(N, M) obtained for each phoneme sequence is stored in the similarity storage section 9, and the matching is completed. Next, the operation of the control section 8 will be explained. The control unit 8 detects the start and end of audio, controls the matching unit 7,
Controls the maximum similarity value selection unit 10, etc. That is, first, a power value is received for each section from the feature extractor 2, and the starting end is detected depending on whether the power exceeds a predetermined threshold. When the starting edge is detected, an initial value is set in the similarity calculation buffer 74 and matching is started. Furthermore, the power of each segment is checked, and when the power becomes smaller than the threshold value, the previous segment is selected as a candidate for the end of the voice. The similarity of the matching results up to the final candidate category is sent to the similarity storage section 9, and the maximum similarity selection section 10 is operated. Maximum similarity selection unit 10
Then, the one with the highest degree of similarity to each phoneme sequence is selected, and the word in that phoneme sequence is used as a candidate for the recognition result. After that, if a segment in which the power is smaller than the threshold continues for a certain period of time or more, that candidate word is output to 11 as the final recognition result. If the power increases again within a certain period of time, it is determined that this is a silent section within the word, and detection of the end of the voice is continued. As described above, according to the present invention, by adding an identification symbol of "1" or "0" to each phoneme in the phoneme series prepared as a word dictionary, matching can be performed with a limit on the duration of the phoneme. A word speech recognition device that can be expected to have a high recognition rate and can perform real-time processing can be realized. Note 1
The log likelihood obtained by the maximum likelihood spectrum estimation method used to express the degree of similarity between the speech of the segment and the standard pattern of phoneme is one example, and the present invention is not limited thereto.

[Brief explanation of the drawing]

第１図はこの発明による単語音声認識方法の一
実施例を示すブロツク図、第２図はマツチング部
の動作を説明するための図、第３図はマツチング
の例を示す図である。１：音声信号入力端子、２：特徴抽出部、３：
音韻認識部、４：標準パターン記憶部、５：尤度
記憶部、６：単語辞書記憶部、７：マツチング
部、７１：加算器、７２：比較器、７３：レジス
タ、７４：類似度演算バツフア、８：制御部、
９：類似度記憶部、１０：類似度最大値選択部、
１１：認識結果出力端子。 FIG. 1 is a block diagram showing an embodiment of the word speech recognition method according to the present invention, FIG. 2 is a diagram for explaining the operation of a matching section, and FIG. 3 is a diagram showing an example of matching. 1: Audio signal input terminal, 2: Feature extraction section, 3:
Phonological recognition unit, 4: Standard pattern storage unit, 5: Likelihood storage unit, 6: Word dictionary storage unit, 7: Matching unit, 71: Adder, 72: Comparator, 73: Register, 74: Similarity calculation buffer , 8: control section,
9: similarity storage unit, 10: similarity maximum value selection unit,
11: Recognition result output terminal.

Claims

[Claims] 1. A feature extraction unit that divides input speech into relatively short fixed time intervals and calculates an autocorrelation function representing the power and spectrum of each division, and converts some phoneme spectral patterns into standard patterns. A phonological recognition unit that calculates the degree of similarity between the autocorrelation function for each category and the standard pattern; The similarity between the input speech and the phoneme symbol sequence is calculated based on a fixed-form recurrence formula while referring to the word dictionary memory that stores the symbol sequence as a word dictionary and the similarity calculated by the phoneme recognition unit. A word speech recognition device consisting of a matching section and a maximum similarity selection section that selects the maximum similarity among the similarities obtained by the matching section and outputs the word of the corresponding phonetic symbol sequence as a recognition result. In the word dictionary memory, for each phoneme constituting a word, the same phoneme symbols are arranged at the fixed time interval for the maximum duration length, and identification symbols of "1" are stored for the minimum duration length among these phoneme symbols. , and the remaining phonetic symbol sequences with identification symbols of "0" are stored, and the matching section combines the n-th segment of the input speech of length N with the m-th phonetic symbol of the length M phonetic symbol sequence. In the recurrence formula calculation for, if the identification symbol attached to the phonetic symbol is "1", the phonetic symbol
The similarity l(n, x) between the standard pattern of x/ and the input voice of the nth division is the calculation result of the (n-1)th division.
The value added to S (n-1, m-1) is the calculation result S (n, m) for the nth division, and when the above identification symbol is "0", the above similarity l (n, x ) as the calculation result S (n-1, m-
1) and the calculation result S (n,
m-1), the larger value is calculated as the calculation result S (n, m) of the n-th division, and the similarity between the input speech and the phonetic symbol sequence is calculated as the calculation result of the N-th division.
A word speech recognition method characterized by calculating by S (N, M).