JPH0451037B2

JPH0451037B2 -

Info

Publication number: JPH0451037B2
Application number: JP60251360A
Authority: JP
Inventors: Katsuyuki Futayada
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1985-11-08
Filing date: 1985-11-08
Publication date: 1992-08-17
Also published as: JPS62111293A

Description

【発明の詳細な説明】産業上の利用分野本発明は人間の声を機械に認識させる音声認識
方法に関するものである。DETAILED DESCRIPTION OF THE INVENTION Field of Industrial Application The present invention relates to a voice recognition method for causing a machine to recognize a human voice.

従来の技術近年音声認識技術の開発が活発に行なわれ、商
品化されているが、これらのほとんどは声を登録
した人のみを認識対象とする特定話者用である。
特定話者用の装置は認識すべき言葉をあらかじめ
装置に登録する手間を要するため、連続的に長時
間使用する場合を除けば、使用者にとつて大きな
負担となる。これに対し、声の登録を必要とせ
ず、使い勝手のよい不特定話者用の認識技術の研
究が最近では精力的に行なわれるようになつた。BACKGROUND OF THE INVENTION In recent years, voice recognition technology has been actively developed and commercialized, but most of these are for specific speakers whose voices are recognized only by those who have registered their voices.
Devices for specific speakers require time and effort to register the words to be recognized in the device in advance, which puts a heavy burden on the user unless the device is used continuously for a long time. In response to this, research has recently been actively conducted on recognition technology for non-specific speakers that is easy to use and does not require voice registration.

音声認識方法を一般的に言うと、入力音声と辞
書中に格納してある標準的な音声（こけらはパラ
メータ化してある）のパターンマツチングを行な
つて、類似度が最も高い辞書中の音声を認識結果
として出力するということである。この場合、入
力音声と辞書中の音声が物理的に全く同じものな
らば問題はないわけであるが、一般には同一音声
であつても、人が違つたり、言い方が違つている
ため、全く同じにはならない。 Generally speaking, the speech recognition method performs pattern matching between the input speech and the standard speech stored in the dictionary (these are parameterized), and selects the one in the dictionary with the highest degree of similarity. This means that the voice is output as the recognition result. In this case, there is no problem if the input voice and the voice in the dictionary are physically exactly the same, but in general, even if the input voice is the same voice, different people or different ways of saying it may cause it to be completely different. It won't be the same.

人の違い、言い方の違いなどは、物理的にはス
ペクトルの特徴の違いと時間的な特徴の違いとし
て表現される。すなわち、調音器官（口、舌、の
どなど）の形状は人ごとに異なつているので、人
が違えば同じ言葉でもスペクトル形状は異なる。
また早口で発声するか、ゆつくり発声するかによ
つて時間的な特徴は異なる。 Physically, differences between people and differences in the way they speak are expressed as differences in spectral features and differences in temporal features. In other words, the shape of the articulatory organs (mouth, tongue, throat, etc.) differs from person to person, so the spectral shape of the same word will differ between different people.
Furthermore, the temporal characteristics differ depending on whether the voice is spoken quickly or slowly.

不特定話者用の認識技術では、このようなスペ
クトルおよびその時間的変動を正規化して、標準
パターンと比較する必要がある。 Speaker-independent recognition techniques require such spectra and their temporal variations to be normalized and compared to standard patterns.

不特定話者の音声認識に有効な方法として、本
出願人は既にパラメータの時系列情報と統計的距
離尺度を併用する方法に関して特許を出願してい
る（特願昭60−29547号）ので、その方法を以下
に説明する。 The present applicant has already filed a patent application for a method that uses parameter time series information and a statistical distance measure in combination (Japanese Patent Application No. 60-29547) as an effective method for speech recognition of unspecified speakers. The method will be explained below.

第１０図は本願出願人が以前に提案した音声認
識方法の具現化を示す機能ブロツク図である。 FIG. 10 is a functional block diagram showing an implementation of the speech recognition method previously proposed by the applicant.

図において、１は入力音声をデイジタル信号に
変換するAD変換部、２は音声を分析区間（フレ
ーム）毎に分析しスペクトル情報を求める音響分
析部、３は特徴パラメータを求める特徴パラメー
タ抽出部、４は始端フレームと終端フレームを検
出する音声区間検出部、５は単語長の伸縮を行う
時間軸正規化部、６は入力パターンと標準パター
ンとの類似度を計算する距離計算部、７は予め作
成された標準パターンを格納する標準パターン格
納部である。上記構成において以下その動作を説
明する。 In the figure, 1 is an AD conversion unit that converts input audio into a digital signal, 2 is an acoustic analysis unit that analyzes audio for each analysis section (frame) and obtains spectrum information, 3 is a feature parameter extraction unit that obtains feature parameters, and 4 5 is a speech interval detection unit that detects the start frame and the end frame, 5 is a time axis normalization unit that expands and contracts the word length, 6 is a distance calculation unit that calculates the similarity between the input pattern and the standard pattern, and 7 is created in advance. This is a standard pattern storage unit that stores standard patterns that have been created. The operation of the above configuration will be explained below.

入力音声をAD変換部１によつて12ビツトのデ
イジタル信号に変換する。標本化周波数は8KHz
である。音響分析部２では、１フレーム
（10msec）ごとに自己相関法によるLPC分析を行
なう。分析の次数は10次とし、線形予測係数〓０，
〓１，〓２…〓10を求める。またここではフレームご
との音声パワーW_pも求めておく。特徴パラメー
タ抽出部３では線形予測係数を用いて、LPCケ
プストラム係数C₁〜C_p（ｐは打切り次数）および
正規化対数残差パワーC_pを求める。なお、LPC
分析とLPCケプストラム係数の抽出法に関して
は、例えば、J.D.マーケル、A.H.グレイ著、鈴木
久喜訳「音声の線形予測」に詳しく記述してある
ので、ここでは説明を省略する。また特徴パラメ
ータ抽出部３では対数パワーLW_pを次式で求め
る。 The input audio is converted into a 12-bit digital signal by the AD converter 1. Sampling frequency is 8KHz
It is. The acoustic analysis section 2 performs LPC analysis using the autocorrelation method for each frame (10 msec). The order of analysis is 10th, and the linear prediction coefficient is 0,
Find 〓1,〓2...〓10. In addition, the audio power W _p for each frame is also determined here. The feature parameter extraction unit 3 uses the linear prediction coefficients to obtain LPC cepstral coefficients C ₁ to C _p (p is the truncation order) and normalized logarithmic residual power C _p . In addition, LPC
The analysis and method for extracting LPC cepstrum coefficients are described in detail in, for example, "Linear Prediction of Speech" by JD Markel and AH Gray, translated by Hisaki Suzuki, so the explanation will be omitted here. In addition, the feature parameter extraction unit 3 obtains the logarithmic power LW _p using the following equation.

LW_p＝10log10W_p （式１）音声区間検出部４は（式１）で求めたLW_pを閾
値θ_Sと比較し、LW_p＞θ_sのフレームがl_sフレーム
以上持続する場合、その最初のフレームを音声区
間の始端フレームF_sとする。またF_sの後におい
て、LW_pと閾値θ_eを比較し、LW_p＜θ_eとなるフレ
ームがl_eフレーム以上連続するとき、その最初の
フレームを音声区間の終端フレームF_eとする。
このようにしてF_sからF_eまでを音声区間とする。
いま説明を簡単にするために、改めてF_sを第１フ
レームと考え、フレームナンバーを（１，２，…
ｊ，…Ｊ）とする。ただし、Ｊ＝F_e−F_s＋１で
ある。 LW _p = 10log10W _p (Formula 1) The speech interval detection unit 4 compares LW _p obtained by (Formula 1) with the threshold θ _S , and if a frame with LW _p > θ _s continues for l _s frames or more, the first Let the frame be the starting frame _Fs of the voice section. Further, after F _s , LW _p is compared with the threshold value θ _e , and if there are consecutive frames with LW _p < θ _e for _LE frames or more, the first frame is set as the end frame F _e of the voice section.
In this way, the period from F _s to F _e is defined as a voice section.
To simplify the explanation, let us consider F _s as the first frame and set the frame number as (1, 2,...
j,...J). However, J=F _e −F _s +1.

時間軸正規化部５では、単語長をＩフレームの
長さに分割することにより線形に伸縮をする。伸
縮後の第ｉフレームと入力音声の第ｊフレームは
（式２）の関係を持つ。 The time axis normalization unit 5 linearly expands and contracts the word length by dividing it into I-frame lengths. The i-th frame after expansion and contraction and the j-th frame of the input audio have the relationship expressed by (Equation 2).

ｉ＝〔Ｊ−１／Ｊ−１ｊ＋Ｊ−／Ｊ−１＋0.5〕
（式２）ただし〔〕は、その数を超えない最大の整数
を表す。例ではＩ＝16としている。 i=[J-1/J-1j+J-/J-1+0.5]
(Formula 2) However, [ ] represents the largest integer that does not exceed that number. In the example, I=16.

次に伸縮後の特徴パラメータを時系列に並べ、
時系列パターン〓_xを作成する。いま第ｉフレー
ムの特徴パラメータ（LPCケプストラム係数）
をC^x _i，ｋ（ｋ＝０，１，２，…Ｐ：ｄ個）とする
と〓_xは次式となる。 Next, the feature parameters after expansion and contraction are arranged in chronological order,
Create time series pattern 〓 _x . Feature parameters of the current i-th frame (LPC cepstral coefficients)
When C ^x _i , k (k=0, 1, 2,...P: d pieces), _x becomes the following formula.

〓_x＝（C^(x) ₁，０，C^(x) ₁，１，C^(x) ₁，２…C^(x) ₁
，ｐ
……C^(x) _i０，C^(x) _i，１…………C^(x)〓，0C^(x)〓，１
…
C^(x)〓，Ｐ）（式３）すなわち〓_xは、・（Ｐ＋１）すなわち・Ｄ
次元のベクトルとなる（Ｄは１フレームあたりの
パラメータ数）。〓 _x = (C ^(x) ₁ , 0, C ^(x) ₁ , 1, C ^(x) ₁ , 2...C ^(x) ₁
,p
……C ^(x) _i 0,C ^(x) _i ,1……C ^(x) 〓,0C ^(x) 〓,1
…
C ^(x) 〓,P) (Formula 3) That is, 〓 _x is ・(P+1), that is, ・D
dimensional vector (D is the number of parameters per frame).

距離計算部６は入力パターン〓_xと標準パター
ン格納部７に格納されている各音声の標準パター
ンとの類似度を統計的な距離尺度を用いて計算
し、最も距離が小さくなる音声を認識結果として
出力する。標準パターン格納部７に格納されてい
る第ｋ番目の音声に対応する標準パターンを〓_k
（平均値）、対象とする全音声に共通な共分散行列
を〓とすると、入力パターン〓_kと第ｋ番目の標
準パターンとのマハラノビス距離S_kは次式で計算
される。 The distance calculation unit 6 calculates the degree of similarity between the input pattern _x and the standard pattern of each voice stored in the standard pattern storage unit 7 using a statistical distance measure, and selects the voice with the smallest distance as the recognition result. Output as . The standard pattern corresponding to the k-th voice stored in the standard pattern storage unit 7 is 〓 _k
(average value), and the covariance matrix common to all target speech is 〓, the Mahalanobis distance S _k between the input pattern 〓 _k and the k-th standard pattern is calculated by the following formula.

S_k＝（〓_x−〓_k）^t・〓^-1・（〓_x−〓_k）（式４）添字ｔは転置を、また−１は逆行列であること
を表す。（式４）を展開すると S_k＝〓^t _x・〓^-1・〓_x−２〓^t _k・〓^-1・〓_x ＋〓^t _x・〓^-1・〓_k （式５）（式５）の第１項はｎに無関係なので大小比較を
するときは考慮しなくてもよい。したがつて第１
項を取除いて、S_kをD_kに置きかえると、D_kは次
のようになる。 S _k = (〓 _x −〓 _k ) ^t・〓 ⁻¹・(〓 _x −〓 _k ) (Formula 4) The subscript t represents transposition, and −1 represents an inverse matrix. Expanding (Equation 4), S _k =〓 ^t _x・〓 ^-1・〓 _x −2〓 ^t _k・〓 ^-1・〓 _x +〓 ^t _x・〓 ^-1・〓 _k (Formula 5) (Formula 5 ) is unrelated to n, so it does not need to be taken into consideration when comparing sizes. Therefore, the first
If we remove the terms and replace S _k with D _k , D _k becomes:

D_k＝b_k−〓^t _k・〓_x （式６）ただし〓_k＝２〓^-1・〓_k （式７） b_k＝〓^t/k・〓^-1・〓_k （式８） Dkを全てのｋ（ｋ＝１，２…Ｎ）について計算
し、Dkを最小とする音声を認識結果とする。こ
こでｋは標準パターン格納部７に格納されている
音声標準パターンの数である。実際には標準パタ
ーンは〓ｋとbkが１対として、音声の数（Ｋ種
類）だけ格納されている。 D _k = b _k −〓 ^t _k・〓 _x (Formula 6) Where〓 _k = 2〓 ^-1・〓 _k (Formula 7) b _k =〓 ^t/k・〓 ^-1・〓 _k (Formula 8) Dk is calculated for all k (k=1, 2...N), and the speech that minimizes Dk is taken as the recognition result. Here, k is the number of voice standard patterns stored in the standard pattern storage section 7. Actually, the standard pattern is stored as a pair of 〓k and bk, and the number of voices (K types) is stored.

（式６）に要する計算量は積和演算がＩ・（Ｐ
＋１）回、減算が１回であり、非常に計算量が少
ないのが特長である。実用的にはＩ＝16，Ｐ＝４
とすれば十分なので、積和演算回数は１単語あた
り80回である。 The amount of calculation required for (Equation 6) is I・(P
+1) times and one subtraction, and the feature is that the amount of calculation is extremely small. Practically, I=16, P=4
is sufficient, so the number of product-sum operations is 80 per word.

次に標準パターン〓ｋ，〓（実際には〓ｋ，
bkに変換される）の作成方法について説明する。 Next, the standard pattern 〓k, 〓 (actually 〓k,
This section explains how to create (converted to bk).

標準パターンは、各音声ごとに多くのデータサ
ンプルを用いて作成する。各音声に対して、用い
るサンプルの数をＭとする。各サンプルに対して
（式２）を適用して、フレーム数をＩに揃える。
音声ｋに対して平均値ベクトルを求める。 A standard pattern is created using many data samples for each voice. For each voice, let M be the number of samples used. Apply (Equation 2) to each sample to make the number of frames equal to I.
Find the average value vector for voice k.

〓ｋ＝（C^(k) ₁，０，C^(k) ₁，１，C^(k) ₁，２，…C^(k
) ₁，
ｐ……C^(k) _i，０，C^(k) _i，１…………C^(k)〓，０，C^(k)〓，
１，…C^(k)〓，Ｐ）（式９）ただし C^(k) _i，ｎ＝１／Ｍ_M 〓^m=1 Ci，^(k) _o，ｍ（式10）ｉ＝１，２，…Ｉ：Ｉフレームｎ＝０，１，２，…Ｐ：ｄ個ここでCi，ｎ，ｍは音声ｋの第ｍ番目のサンプ
ルで、第ｉフレームの第ｎ次のケプストラム係数
を示す。平均値ベクトルと同様な手順で音声ｋの
共分散行列W^kを求める。全音声に共通な共分散
行列〓は次式で求める。〓k=(C ^(k) ₁ , 0, C ^(k) ₁ , 1, C ^(k) ₁ , 2,...C ^{(k
)} ₁ ,
p……C ^(k) _i ,0,C ^(k) _i ,1……C ^(k) 〓,0,C ^(k) 〓、
1,...C ^(k) 〓,P) (Formula 9) where C ^(k) _i ,n=1/M _M 〓 ^m=1 Ci, ^(k) _o ,m (Formula 10) i=1,2, ...I: I frame n=0, 1, 2, ...P: d pieces Here, Ci, n, m are the m-th sample of audio k and indicate the n-th order cepstral coefficient of the i-th frame. A covariance matrix W ^k of voice k is obtained using the same procedure as for the mean value vector. The covariance matrix 〓 common to all voices is calculated using the following formula.

〓＝１／Ｋ（〓(1)＋〓(2)＋…＋〓^k＋…＋〓^K）（式11）〓ｋ，〓を（式７），（式８）によつて〓ｋ，
bkに変換し、標準パターン格納部７にあらかじ
め格納しておく。〓=1/K (〓(1)+〓(2)+…+〓 ^k +…+〓 ^K ) (Formula 11) 〓k,〓 by (Formula 7) and (Formula 8), 〓k,
bk and stored in the standard pattern storage section 7 in advance.

発明が解決しようとする問題点かかる方法における問題点は、パターンマツチ
ングを行なう以前に音声区間が一意に確実に決め
られていると仮定している点にある。現実の音声
データは種々のノイズを含んでいたり、語頭や語
尾における発生が不明瞭であるため、音声区間を
正確に決められない場合やも音声以外の区間を誤
まつて検出する場合が多々ある。誤まつた音声区
間に対して従来例の方法を適用すると、当然のこ
とながら、認識率が大きく低下してしまう。Problems to be Solved by the Invention The problem with this method is that it assumes that the speech interval is uniquely and reliably determined before pattern matching is performed. Actual speech data contains various types of noise, and occurrences at the beginning and end of words are unclear, so there are many cases where it is not possible to accurately determine speech intervals, or where non-speech intervals are mistakenly detected. . If the conventional method is applied to the erroneously misidentified speech section, the recognition rate will naturally drop significantly.

本考案の目的は上記問題点を解決するもので、
音声区間の検出という操作を必要としないで、入
力信号中から音声を自動的に抽出して認識でき
る、高い認識率を有する音声認識方法を提供する
ものである。 The purpose of this invention is to solve the above problems.
It is an object of the present invention to provide a speech recognition method having a high recognition rate and capable of automatically extracting and recognizing speech from an input signal without requiring the operation of detecting a speech section.

問題点を解決するための手段本発明は上記目的を達成するもので、認識すべ
き音声とその前後の騒音を含む十分長い区間を入
力信号区間とし、この入力信号区間に、ある時間
的な基準点を設け、基準点を端点としてそれから
N₁フレームの区間とN₂フレームの区間（N₁＜
N₂）の２区間を設定して、これらを音声区間の
それぞれ最小値と最大値と考えて、N₂−N₁＋１
とおりの音声区間候補のそれぞれに対して、音声
区間長を一定時間長に伸縮しながら各単語の標準
パターンとのマツチングを行なつて各単語の類似
度または距離を求め、この操作を基準点を全入力
信号区間の始めから終りまで走査して行ない、全
ての基準点位置の全ての音声区間候補に対する類
似度または距離を各単語について比較し、類似度
を最大または距離を最小とする単語を認識結果と
して出力するものである。Means for Solving the Problems The present invention achieves the above object, and uses a sufficiently long section including the speech to be recognized and the noises before and after it as an input signal section, and sets a certain time standard in this input signal section. Set a point, use the reference point as an end point, and then
N ₁ frame interval and N ₂ frame interval (N ₁ <
N ₂ ), and considering these as the minimum and maximum values of the voice section, N ₂ −N ₁ +1
For each speech segment candidate, the similarity or distance of each word is determined by matching the speech segment length with the standard pattern of each word while expanding or contracting it to a certain length of time. Scans all input signal sections from the beginning to the end, compares the similarity or distance of all reference point positions to all speech section candidates for each word, and recognizes the word with the maximum similarity or minimum distance. This is what is output as a result.

作用本発明は、全入力信号区間を対象として１フレ
ームずつずらせながら線形伸縮した入力と標準パ
ターンとの間のパターンマツチングを行ない、類
似度最大または距離最小となる音声とその区間と
を自動的に求めるので音声区間の検出が必要でな
くなり、騒音環境下で発声した音声を高い確率で
認識することができる。Effect The present invention performs pattern matching between linearly expanded and contracted input and a standard pattern while shifting one frame at a time for all input signal sections, and automatically selects the audio with the maximum similarity or the minimum distance and its section. Since the method is calculated based on the method, it is no longer necessary to detect speech sections, and speech uttered in a noisy environment can be recognized with high probability.

実施例以下に本発明の実施例を図面を用いて詳細に説
明する。第１図は本発明の一実施例における音声
認識方法の具現化を示す機能ブロツク図である。Examples Examples of the present invention will be described in detail below with reference to the drawings. FIG. 1 is a functional block diagram showing an embodiment of a speech recognition method according to an embodiment of the present invention.

まず実施例の考え方を第２図〜第４図を用いて
説明する。同じ言葉を発声しても、発声の時間的
な長さ（音声長）は発声方法によつても異なる
し、人の違いによつても異なる。パターンマツチ
ングによる音声認識方法では入力音声の長さを、
標準的な音声に正規化したうえで類似度計算を行
なつて音声長の認識を行なう。第２図は音声長の
正規化の様子を示したものである。入力音声の長
さの最小長をN₁、最大長をN₂とし、音声の標準
的な長さ（標準パターン長）をＩとすると、第２
図に示すように、長さＮ（N₁≦Ｎ≦N₂）の音声
長を伸縮して長さＩに正規化することになる。第
２図では音声の終端を一致させて、伸縮するよう
になつている。伸縮には（式２）と同様に、線形
伸縮式を用いる。 First, the concept of the embodiment will be explained using FIGS. 2 to 4. Even when the same word is uttered, the temporal length of the utterance (voice length) differs depending on the utterance method and differs depending on the person. In the speech recognition method using pattern matching, the length of the input speech is
After normalizing to standard speech, similarity calculation is performed to recognize speech length. FIG. 2 shows how the voice length is normalized. Let N ₁ be the minimum length of input audio, N ₂ be the maximum length, and I be the standard length of audio (standard pattern length), then the second
As shown in the figure, the voice length of length N (N ₁ ≦N≦N ₂ ) is expanded or contracted to be normalized to length I. In Fig. 2, the ends of the audio are made to match and are expanded and contracted. For expansion and contraction, a linear expansion and contraction formula is used, similar to (Formula 2).

ｉ＝［Ｉ−１／Ｎ−１・ｎ＋Ｎ−Ｉ／Ｎ−１＋0.5
］（式12）未知入力と標準パターンの類似度を計算する場
合、未知入力の音声長Ｎを（式12）によつて標準
パターン長に伸縮することになるが、この様子を
図示したのが第３図である。よこ軸に入力長、た
て軸に標準パターン長をとり、終端を一致させる
と、入力音声長はN₁〜N₂の範囲であるから、入
力と標準パターンとのマツチングルートは、入力
軸のN₁≦Ｎ≦N₂内の１点を始点とし、Ｐを終端
とする直線となる。したがつて、類似度計算は全
て三角形の内側で行なわれることになる。 i=[I-1/N-1・n+N-I/N-1+0.5
] (Formula 12) When calculating the similarity between an unknown input and a standard pattern, the voice length N of the unknown input is expanded or contracted to the standard pattern length by (Formula 12). FIG. If we take the input length on the horizontal axis and the standard pattern length on the vertical axis, and match the ends, the input audio length is in the range of N ₁ to _{N 2} , so the matching route between the input and the standard pattern is the input axis. A straight line starts from one point within N ₁ ≦N≦N ₂ and ends at P. Therefore, all similarity calculations are performed inside the triangle.

いま、時間長N_Uの未知入力があり、その内容
が音声ｋであつたとする。ただし、未知入力の終
端は既知であるが終端は未知である（したがつ
て、N_Uも未知である）とする。この未知入力と
単語ｋの標準パターンS_kの照合を行なう場合、Ｎ
をN₁からN₂まで、１フレームずつずらせなが
ら、各フレームに対して（式12）を用いて時間長
をＩに伸縮し、未知入力パラメータと標準パター
ンとの類似度を求める。このとき標準パターンは
S_kであるので、発声が正確ならば、Ｎ＝N_Uにお
いて類似度は最大となるはずである。また他の任
意の標準パターンS_k′に対するよりも、S_kに対し
て類似度が大きくなるはずである。このようにし
て、未知入力の始端が決められる（したがつて音
声長が決められる）と同時に音声ｋが認識でき
る。 Assume now that there is an unknown input of length N _U and its content is voice k. However, it is assumed that although the terminal end of the unknown input is known, the terminal end is unknown (therefore, N _U is also unknown). When matching this unknown input with the standard pattern S _k of word k, N
is shifted from N ₁ to N ₂ one frame at a time, the time length is expanded or contracted to I using (Equation 12) for each frame, and the degree of similarity between the unknown input parameter and the standard pattern is determined. In this case, the standard pattern is
Since S _k , if the utterance is accurate, the similarity should be maximum at N=N _U. Furthermore, the degree of similarity should be greater for S _k than for any other standard pattern S _k ′. In this way, the starting end of the unknown input is determined (therefore, the voice length is determined) and at the same time voice k can be recognized.

さて、第３図においては終端が既知として説明
を行なつたが、両端が未知の場合（すなわち音声
区間が不明である場合）にも、この方法を拡張で
きる。第４図はその説明図である。図において終
端点の横軸（入力の時間軸）座標をｊとする。こ
こでもしｊの位置が入力音声の終端に一致してい
れば第３図の場合と同じであるが、今度は両端点
が未知という仮定であるので、必ずしもｊが音声
の終了点と一致するとは限らない。しかしなが
ら、ｊを音声区間が十分に入る広い範囲j₁≦ｊ≦
j₂でスキヤンすれば、ｊが音声の終端と一致する
時点ｊ＝j₀が必ず存在する。その場合、始端単は
j₀−N₂〜j₀−N₁の範囲内の点j₀−N_Uに存在するは
ずである。そして、このようにスキヤンした場合
においても、発声した言葉と標準パターンが一致
していれば、始端がj₀−N_U、終端がj₀のときの類
似度が、他のどのようなｊおよびＮの組合せより
も大きくなる。しかも、この類似度は他の標準パ
ターンに対する類似度よりも大きい。したがつ
て、認識結果が求められると同時に、音声の始端
点、終端点が定まる。 Now, in FIG. 3, the explanation has been made assuming that the end is known, but this method can also be extended to the case where both ends are unknown (that is, when the voice section is unknown). FIG. 4 is an explanatory diagram thereof. In the figure, the horizontal axis (input time axis) coordinate of the terminal point is j. Here, if the position of j coincides with the end of the input voice, the situation is the same as in Figure 3, but this time it is assumed that both end points are unknown, so j does not necessarily coincide with the end of the voice. is not limited. However, if j is a wide range j ₁ ≦j≦
If we scan with _j2 , there will always be a time point j= _j0 where j coincides with the end of the voice. In that case, the starting simple is
It should exist at the point j ₀ −N _U within the range of j ₀ −N ₂ to _{j 0} −N ₁ . Even when scanned in this way, if the uttered word and the standard pattern match, the similarity when the starting point is j ₀ −N _U and the ending point is j ₀ is different from any other j and It is larger than the combination of N. Furthermore, this degree of similarity is greater than the degree of similarity with respect to other standard patterns. Therefore, at the same time as the recognition result is obtained, the start and end points of the voice are determined.

このように第４図に示した方法は、騒音と音声
が混在した信号から、標準パターンに最も類似し
た部分を切り出して認識することができる。した
がつて、一般に用いられているような複雑な音声
区間検出の手続きを必要とせず、音声区間は認識
された音声とともに結果として出力される。 In this manner, the method shown in FIG. 4 can extract and recognize the portion most similar to the standard pattern from a signal containing a mixture of noise and voice. Therefore, there is no need for a commonly used complicated speech segment detection procedure, and the speech segment is output as a result together with the recognized speech.

類似度の計算は以下に述べるように、特徴パラ
メータの時系列パターンを用い、統計的距離尺度
（事後確率に基く距離）によつて計算する。 As described below, the degree of similarity is calculated using a time-series pattern of feature parameters using a statistical distance measure (distance based on posterior probability).

１フレームあたりの特徴パラメータの個数をＤ
とすると、Ｉフレームの時系列パターンはD_XＩ
次元のベクトルとなる。いま、未知入力の第ｉフ
レームのパラメータを〓_i、単語ｋの標準パター
ンの第ｉフレームの成分をa^k _iとすると、〓_i＝（x₁，ｉ，x₂，ｉ，…x_d，ｉ，…x_D，
ｉ）（式13）〓^k _i＝（a^k ₁，ｉ，…a^k ₂，ｉ，…a^k _d，ｉ，…a^k _D，
ｉ）（式14）時系列パターンをそれぞれ〓，〓_kとすると〓＝（〓₁，〓₂，…，〓_i，…，〓_I）（式15）〓_k＝（〓^k ₁，〓^k ₂，…〓^k _i，…，〓^k _I）（式16）である。単語ｋに対する類似度をL_kとすると、 L_k＝B_k−〓^t _k・〓（式17）＝B_k−_I 〓ⁱ⁼¹ （〓^k _i）^t・〓_i （式18）＝B_k−_I 〓ⁱ⁼¹ （_D 〓^z=1 a^k _d，ｉ・x_d，ｉ）（式19）ここで〓_k，B_kは単語ｋの標準パターンであ
る。 The number of feature parameters per frame is D
Then, the time series pattern of I frame is D _X I
It becomes a vector of dimensions. Now, if the parameter of the i-th frame of the unknown input is 〓 _i and the component of the i-th frame of the standard pattern of word k is a ^k _i , then 〓 _i = (x ₁ , i, x ₂ , i, ... x _d , i,...x _D ,
i) (Formula 13) 〓 ^k _i = (a ^k ₁ , i, ... a ^k ₂ , i, ... a ^k _d , i, ... a ^k _D ,
i) (Formula 14) Let the time series patterns be 〓 and 〓 _k , respectively. 〓 = (〓 ₁ , 〓 ₂ , ..., 〓 _i , ..., 〓 _I ) (Formula 15) 〓 _k = (〓 ^k ₁ , 〓 ^k ₂ ,...〓 ^k _i , ...,〓 ^k _I ) (Equation 16). If the similarity for word k is L _k , then L _k =B _k −〓 ^t _k・〓 (Formula 17) =B _k − _I 〓 ⁱ⁼¹ (〓 ^k _i ) ^t・〓 _i (Formula 18) =B _k − _I 〓 ⁱ⁼¹ ( _D 〓 ^z=1 a ^k _d , i x _d , i) (Formula 19) where 〓 _k , B _k is a standard pattern of word k.

〓_k＝２〓^-1〓（〓_k−〓_e）（式20） B_k＝〓^t _k・〓^-1〓・〓_k−〓^t _x・〓^-1 _a・〓_e
（式21）ただし、〓_kは単語ｋの平均値ベクトル、〓_eは
全ての単語の周囲情報の平均値ベクトルである。
また〓_aは共分散行列であり、各単語の共分散行
列〓_kと周囲情報の共分散行列〓_eを用いて作成で
きる。_a ＝_k 〓^k=1 （〓W_k＋_e）／（Ｋ＋１）（式22）ｋは単語の種類である。〓 _k = 2〓 ^-1 〓(〓 _k −〓 _e ) (Formula 20) B _k =〓 ^t _k・〓 ^-1 〓・〓 _k −〓 ^t _x・〓 ^-1 _a・〓 _e
(Equation 21) However, 〓 _k is the average value vector of word k, and 〓 _e is the average value vector of surrounding information of all words.
Also, 〓 _a is a covariance matrix, which can be created using the covariance matrix 〓 _k of each word and the covariance matrix 〓 _e of surrounding information. _a = _k 〓 ^k=1 (〓W _k + _e )/(K+1) (Formula 22) k is the type of word.

〓_e，〓_eは各単語に属する多くのサンプルを用
いて、次のように作成する。第５図に示すよう
に、音声とその周囲の区間に対して、１フレーム
ずつずらせながら複数の区間（区間長はＩフレー
ム）を設定する。このような操作を各単語の多く
のサンプルに対して行ない、それらの区間のパラ
メータの平均値ベクトル〓_eと共分散行列〓_eを作
成する。〓 _e and 〓 _e are created as follows using many samples belonging to each word. As shown in FIG. 5, a plurality of sections (section length is I frame) are set for the voice and its surrounding sections by shifting them one frame at a time. This operation is performed on many samples of each word, and an average value vector 〓 _e and a covariance matrix 〓 _e of the parameters of those intervals are created.

（式17）は（式６）と同じ形であるので、類似
度計算に要する演算量は従来例と変わらない。標
準パターン作成の式（（式７），（式８）と（式
20），（式21））のみが異なつている。周囲情報を
〓_e，〓_eとして標準パターンに取り込んでいるの
が本発明の特徴である。このようにすると、（式
17）は擬似的な事後確率に基づく距離となる。 Since (Formula 17) has the same form as (Formula 6), the amount of calculation required for calculating the similarity is the same as in the conventional example. Equations for creating standard patterns ((Equation 7), (Equation 8) and (Equation
20) and (Equation 21)) are the only differences. A feature of the present invention is that surrounding information is incorporated into the standard pattern as 〓 _e and 〓 _e . In this way, (formula
17) is a distance based on pseudo posterior probability.

第１図において１０は入力信号をデイジタル信
号に変換するAD変換部、１１は音声分析区間
（フレーム）ごとに分析する音響分析部、１２は
特徴パラメータ抽出部であり、低次の６つの
LPCケプストラム係数（C₀〜C₅）をフレーム
（10msec）ごとに出力する。特徴パラメータ抽出
部１２の出力が（式13）の〓に相当する（したが
つてＤ＝６である）。なおブロツク10〜12の機能
は第１０図のブロツク１〜３の機能と同じであ
る。特徴パラメータはLPCケプストラム係数の
他に、自己相関係数、PARCOR係数、帯域通過
フイルタの出力などがある。 In FIG. 1, 10 is an AD converter that converts the input signal into a digital signal, 11 is an acoustic analyzer that analyzes each audio analysis section (frame), and 12 is a feature parameter extractor.
Outputs LPC cepstrum coefficients (C ₀ to C ₅ ) every frame (10 msec). The output of the feature parameter extraction unit 12 corresponds to ≦ of (Equation 13) (therefore, D=6). The functions of blocks 10-12 are the same as those of blocks 1-3 in FIG. In addition to LPC cepstrum coefficients, feature parameters include autocorrelation coefficients, PARCOR coefficients, and bandpass filter outputs.

以下、各ブロツクの機能を第６図のフローチヤ
ートを参照しながら説明する。フレーム同期信号
発生部１３は、１フレームごとに同期信号を発生
する。フレーム番号をｊとし、入力音声を含む十
分広い区間j₁≦ｊ≦j₂で類似度の計算を行なうも
のとする。１フレームの期間で次の操作を行な
う。 The functions of each block will be explained below with reference to the flowchart of FIG. The frame synchronization signal generator 13 generates a synchronization signal for each frame. Let the frame number be j, and calculate the similarity in a sufficiently wide interval j ₁ ≦j≦j ₂ that includes the input voice. The following operations are performed in one frame period.

標準パターン選択部１８は、認識対象とする音
声（ここでは単語）の１つ１つを選択する（単語
数をＫとする）。選択された標準パターンに対し
て、区間候補設定部１５では、各単語の最小音声
区間長N₁（ｋ）と最大音声区間長N₂（ｋ）を設定
する。そして、区間長Ｎ（N₁（ｋ）≦Ｎ＜N₂（ｋ））
に対して、特徴パラメータ抽出部１２で得られた
未知入力パラメータをｊ−ｎ〜ｊフレームの時間
分だけ並べて、入力パラメータの時系列を作り、
時間軸正規化部１４において、時系列パラメータ
の時間を（式12）を用いてＩフレームに伸縮し、
（式15）に相当するパラメータ系列を得る。類似
度計算部１６はこのパラメータ系列と、標準パタ
ーン選択部１８で選ばれた標準パターン格納部１
７中の標準パターン〓_k，B_kとの間で、（式17）
を用いて類似度L_k（Ｎ）を計算する。類似度比較
部２０では、L_k（Ｎ）と１次記憶１９に蓄積され
ているこの時点までの最大類似度値（距離の最小
値Lmin）を比較し、L_k（Ｎ）＜LminならばLmin
をL_k（Ｎ）に置きかえてその時のｋをk^として１
次記憶１９を更新し、L_kｎ≧Lminならば１次記
憶１９の内容は更新しない。 The standard pattern selection unit 18 selects each voice (word here) to be recognized (the number of words is K). For the selected standard pattern, the section candidate setting unit 15 sets the minimum speech section length N ₁ (k) and maximum speech section length N ₂ (k) for each word. Then, the section length N (N ₁ (k)≦N<N ₂ (k))
, the unknown input parameters obtained by the feature parameter extraction unit 12 are arranged for the time period of j−n to j frames to create a time series of input parameters,
In the time axis normalization unit 14, the time of the time series parameter is expanded or contracted to an I frame using (Equation 12),
Obtain a parameter series corresponding to (Equation 15). The similarity calculation unit 16 uses this parameter series and the standard pattern storage unit 1 selected by the standard pattern selection unit 18.
Between standard pattern 〓 _k and B _k in 7, (Equation 17)
The similarity L _k (N) is calculated using . The similarity comparison unit 20 compares L _k (N) with the maximum similarity value (minimum distance Lmin) stored in the primary memory 19 up to this point, and if L _k (N) < Lmin, then Lmin
Replace with L _k (N) and set k at that time to k^ to 1
The next memory 19 is updated, and if L _k n≧Lmin, the contents of the primary memory 19 are not updated.

このような一連の操作を、１つの標準パターン
に対してN₂（ｋ）−N₁（ｋ）＋１回ずつ、１フレー
ムの間にＫ個の標準パターンに対して行なう。そ
して更に、それをj₁〜j₂フレームの期間に対して
行なう。認識結果は、j₂フレームまで到達した時
点におけるk^であり、その時の類似度値はLmin
である。また最大類似度を得た時点のフレームj^
とその時の区間長N^を１次記憶１９に蓄積してお
けば、これらを用いて音声区間を結果として求め
ることができる。 This series of operations is performed on K standard patterns during one frame, _N2 (k) _-N1 (k)+1 times for each standard pattern. Further, this is performed for the period of frames j ₁ to j ₂ . The recognition result is k^ at the time when _j2 frames are reached, and the similarity value at that time is Lmin
It is. Also, the frame j^ at the time when the maximum similarity is obtained
If the interval length N^ at that time is stored in the primary memory 19, the voice interval can be obtained as a result using these.

以上述べたように、本実施例はj₁〜j₂の区間を、
音声がその中に十分に入るように広く取つておき
さえすれば、音声区間検出という操作を必要とせ
ずに音声を認識することができる。第１図で示し
た第１の実施例は、解りやすいので、方法の説明
には有用であり、このとおりに実現することはも
ちろん可能である。しかし、リアルタイム化を図
ろうとした場合、計算量が多すぎるという難点が
ある。その原因は、区間候補設定部１５で設定し
た全ての区間について、まともに（式17）を計算
している点にある。 As described above, in this embodiment, the interval j ₁ to j ₂ is
As long as the area is set wide enough to accommodate the voice, the voice can be recognized without the need for voice section detection. The first embodiment shown in FIG. 1 is easy to understand and is useful for explaining the method, and it is of course possible to implement it in this manner. However, when attempting to achieve real-time implementation, there is a problem in that the amount of calculation is too large. The reason for this is that (Equation 17) is properly calculated for all the sections set by the section candidate setting section 15.

次に述べる第２の実施例は、計算量を削減し
た、より実用的な方法である。まず原理的な説明
を行なう。 The second embodiment described below is a more practical method that reduces the amount of calculation. First, I will explain the principle.

認識結果を得るには類似度計算式（18）におい
て、L_kを最小とするｋ＝k^を求めればよい。すな
わち、 minL_k＝min｛B_k−_I 〓ⁱ⁼¹ （〓^k _i）^t・〓_i｝＝B_k−max｛_I 〓ⁱ⁼¹ （〓^k _i）^t・〓_i｝（式23）＝B_k−max｛_I 〓ⁱ⁼¹ l^k _i（Ｎ）｝（式24）＝B_k−maxM^k（Ｎ）（式25）ここで l^k _i（Ｎ）＝（a^k _i）^t・〓_i （式26）は、マツチングルートＮに従つて時間伸縮された
後の第ｉフレームの入力〓_iと標準パターンｋの
部分類似度である。次に時間伸縮の意味するとこ
ろを考えてみる。時間伸縮をされる前の未知入力
ベクトルを〓とすると、〓＝（〓₁，〓₂，…〓_o，…〓_N）（式27）と表わされる。ｎとｉは両方とも整数であり、
（式12）で関係づけられている。したがつて（式
15）のベクトル〓は（式27）の未知入力ベクトル
〓の中から、（式12）で関係づけられるフレーム
をＩフレーム分だけ選択して時間的順序を並べた
ものである。マツチングルートに従つて選択する
という操作を便宜上、次式で表わす。 To obtain the recognition result, k=k^, which minimizes L _k , can be found in similarity calculation formula (18). That is, minL _k = min{B _k − _I 〓 ⁱ⁼¹ (〓 ^k _i ) ^t・〓 _i } =B _k −max{ _I 〓 ⁱ⁼¹ (〓 ^k _i ) ^t・〓 _i } (Equation 23) =B _k −max{ _I 〓 ⁱ⁼¹ l ^k _i (N)} (Formula 24) =B _k −maxM ^k (N) (Formula 25) where l ^k _i (N) = (a ^k _i ) ^t・〓 _i (Formula 26) is the partial similarity between the i-th frame input 〓 _i and the standard pattern k after being time-stretched according to the matching route N. Next, let's consider the meaning of time expansion and contraction. Letting the unknown input vector before time warping be 〓, it is expressed as 〓=(〓 ₁ , 〓 ₂ , ...〓 _o , ...〓 _N ) (Equation 27). n and i are both integers,
They are related by (Equation 12). Therefore (formula
The vector 〓 in 15) is obtained by selecting I frames of frames related by (Formula 12) from among the unknown input vectors 〓 in (Formula 27) and arranging them in temporal order. For convenience, the operation of selecting according to the matching route is expressed by the following equation.

〓_i＝〓〓_i〓Ｎ（式28）そうすると部分類似度（式26）は l^k _i（Ｎ）＝（^k _i）^t・〓〓_i〓Ｎ（式29）また部分類似度の和M^k（Ｎ）は M^k（Ｎ）＝_I 〓ⁱ⁼¹ l^k _i（Ｎ）＝_I 〓ⁱ⁼¹ （〓^k _i）^t・〓_i〓Ｎ（式30）すなわち（式17）は、部分類似度l^k _i（Ｎ）が先
に求められていれば、それらを（式12）の関係に
従つてＩフレーム分だけ加えるという操作に置き
かえられる。（式12）はＮを与えれば一意にｉと
ｎの関係が求まるので、N₁≦Ｎ＜N₂の範囲であ
らかじめ計算して、テーブルなどに蓄積しておく
ことができる。〓 _i ＝〓〓 _i 〓N (Formula 28) Then, the partial similarity (Formula 26) is l ^k _i (N) = ( ^k _i ) ^t・〓〓 _i 〓N (Formula 29) Also, the sum of the partial similarities M ^k (N) is M ^k (N) = _I 〓 ⁱ⁼¹ l ^k _i (N) = _I 〓 ⁱ⁼¹ (〓 ^k _i ) ^t・〓 _i 〓N (Equation 30) That is, (Equation 17) is If the partial similarities l ^k _i (N) have been found in advance, they can be replaced by adding them for I frames according to the relationship in (Equation 12). (Equation 12) uniquely determines the relationship between i and n when N is given, so it can be calculated in advance within the range of N ₁ ≦N < N ₂ and stored in a table or the like.

次に第７図を参照してl^k _i（Ｎ）の求め方につい
て考えてみる。図において、点Ｐを標準パターン
と未知入力の終端点とし、未知入力の終端点の座
標をN₀とする。N₁，N₂は以前と同様に、音声の
最小長と最大長である。いま、未知入力の始端点
がＮの場合の類似度を求めるものとすると、マツ
チングルートは直線PNである。PN上で（式12）
を満足する、任意の一点（n′，ｉ）における部分
類似度l_i（Ｎ）は、入力のn′フレームのベクトルと
標準パターンのｉフレーム成分のベクトル〓₁の
積である。（n′，ｉ）点は、現時点ではPN上に位
置しているが、Ｐ点は時間とともにシフトするの
で、n′フレーム以前にはP′N′O上に存在していた
はずである。したがつて、Ｐ点の時点で（n′，
ｉ）の部分類似度を求めてそれを蓄積しておき、
P′の時点で使用することができる。（n′，ｉ）は
ΔPN₂N₁上の任意の点であるから、他の点につい
ても同様のことが言える。このように考えると、
各フレームにおける計算は次のように２つに分け
ることができる。 Next, with reference to FIG. 7, we will consider how to obtain l ^k _i (N). In the figure, point P is the terminal point of the standard pattern and unknown input, and the coordinates of the terminal point of the unknown input are N ₀ . N ₁ and N ₂ are the minimum and maximum lengths of speech as before. Now, assuming that the degree of similarity is determined when the starting point of the unknown input is N, the matching route is the straight line PN. on PN (Equation 12)
The partial similarity l _i (N) at any point (n', i) that satisfies is the product of the vector of the input n' frame and the vector of the i-frame component of the standard pattern 〓 ₁ . Point (n', i) is located on PN at the present time, but since point P shifts with time, it should have been on P'N'O before frame n'. Therefore, at point P, (n′,
Find the partial similarity of i) and accumulate it,
It can be used at point P′. Since (n', i) is an arbitrary point on ΔPN ₂ N ₁ , the same can be said for other points. Thinking like this,
The calculation in each frame can be divided into two parts as follows.

PN_O上での部分類似度を計算して、バツフア
に蓄積する。（積和演算）（式30）によつて計算する部分類似度和に用
いるl^k _i（Ｎ）は、それ以前のフレームで計算し
てバツフアに蓄積されていたものを取り出して
用いる。（加算演算）第８図はフレームあたりの計算方法をブロツク
図で示したものである。図において、３０はl^k _i
（N_O）を計算する積和器であり、標準パターンの
フレーム数（Ｉ）だけ用意されている。各積和器
の下部からは第ｊフレームの入力ベクトル〓
（ｊ）が入力され、左側から標準パターンが入力
される。そして（式29）に相当する計算を行な
い、l^k _i（N_O）を出力する。遅延バツフア３１は、
積和器の計算結果を１フレームの期間保存して、
次段へ伝播する。遅延バツフアの数は、１単語あ
たり、第７図のΔPN₂N₀内の点の数だけ用意され
ている。３２は加算器であり、（式30）に相当す
る計算を行なつて類似度和を求める。加算器３２
はＩ個の入力端を持ち、その各々は（式12）で規
定されるマツチングルートに従つて、遅延バツフ
アの出力端に接続されている。３３は比較器であ
り、maxM_k（Ｎ）を求める。３４は減算器であ
り、（式25）の計算を行なつて、単語ｋに対する
最小値を求める。 The partial similarity on PN _O is calculated and accumulated in the buffer. (Product-Sum Calculation) The l ^k _i (N) used for the partial similarity sum calculated by (Equation 30) is extracted from the value calculated in the previous frame and stored in the buffer. (Addition operation) FIG. 8 is a block diagram showing the calculation method per frame. In the figure, 30 is l ^k _i
(N _O ), and is prepared for the number of frames (I) of the standard pattern. From the bottom of each product-summer, the input vector of the j-th frame 〓
(j) is input, and the standard pattern is input from the left side. Then, a calculation corresponding to (Equation 29) is performed and l ^k _i (N _O ) is output. The delay buffer 31 is
The calculation results of the product-summer are saved for a period of one frame,
Propagate to the next stage. The number of delay buffers provided for each word is equal to the number of points within ΔPN ₂ N ₀ in FIG. Reference numeral 32 denotes an adder, which calculates the sum of similarities by performing calculations equivalent to (Equation 30). Adder 32
has I input terminals, each of which is connected to the output terminal of the delay buffer according to the matching route defined by (Equation 12). 33 is a comparator, which calculates maxM _k (N). 34 is a subtracter, which calculates the formula (25) to find the minimum value for word k.

以上、第２の実施例における方法の説明を行な
つた。第９図は第２の実施例における音声認識装
置の具現化を示す機能ブロツク図である。第９図
において、第１図と同じ番号を有するブロツクは
同一機能を有するので、説明を省略または簡略化
する。 The method in the second embodiment has been explained above. FIG. 9 is a functional block diagram showing an implementation of the speech recognition device in the second embodiment. In FIG. 9, blocks having the same numbers as in FIG. 1 have the same functions, so their explanation will be omitted or simplified.

第９図において、AD変換部１０、音響分析部
１１、特徴パラメータ抽出部１２で入力音声をデ
イジタル化してLPC分析を行ない、特徴パラメ
ータ（LPCケプストラム係数）をフレームごと
に求める。１フレームの期間内に以下の操作を行
なう。 In FIG. 9, an AD converter 10, an acoustic analyzer 11, and a feature parameter extractor 12 digitize input audio and perform LPC analysis to obtain feature parameters (LPC cepstral coefficients) for each frame. The following operations are performed within the period of one frame.

標準パターン選択部１８は、標準パターン格納
部１７に格納されているＫ個の標準パターンを、
１つずつ選択する。部分類似度計算部２１は、入
力特徴パラメータと選択された標準パターンとの
間で（式29）の計算を行ないl^k _i（N_O）を求め、類
似度バツフア２２へ蓄積する。類似度バツフア
は、１単語あたり第７図のΔPN₂N₀内の類似度を
蓄積できる容量を持つており、時間伸縮テーブル
２４で指定されたアドレスの内容を読み出す。時
間伸縮テーブルには入力長Ｎ（N₁≦Ｎ≦N₂）の
各々に対して（式12）で規定されるｎとｉの関係
が記述されている。N₁，N₂は単語ごとに異な
り、区間候補設定部１５によつて設定される。類
似度加算部２３は、マツチングルートN₁〜N₂の
各々に対して、時間伸縮テーブル２４で指定され
たアドレスで読出される類似度バツフア２２の出
力を加算して（式30）の計算を行ない、類似度和
M^k（Ｎ）を求める。類似度比較部２０はM^k（Ｎ）
と１次記憶１９の内容を比較し、M^k（Ｎ）の方が
大きい場合のみ、１次記憶の内容をM^k（Ｎ）に置
きかえる。Ｎ＝N₂まで計算し終えると（式18）
によつてL_kを求め、１次記憶１９に蓄積されて
いる、それ以前の最小値と比較し、L_kが小さい
場合のみ１次記憶１９の内容を更新する。そし
て、標準パターン選択部１８は次の単語を選択し
て同様の操作を行なう。さらに全単語を終了する
とフレームを進める。 The standard pattern selection unit 18 selects the K standard patterns stored in the standard pattern storage unit 17.
Select one at a time. The partial similarity calculation unit 21 calculates (Equation 29) between the input feature parameter and the selected standard pattern to obtain l ^k _i (N _O ), and stores it in the similarity buffer 22 . The similarity buffer has a capacity to store similarities within ΔPN ₂ N ₀ in FIG. The time expansion/contraction table describes the relationship between n and i defined by (Equation 12) for each input length N (N ₁ ≦N≦N ₂ ). N ₁ and N ₂ differ for each word and are set by the section candidate setting unit 15. The similarity addition unit 23 adds the output of the similarity buffer 22 read at the address specified in the time expansion/contraction table 24 to each of the matching routes N ₁ to _{N 2} to calculate (Equation 30). and calculate the sum of similarities
Find M ^k (N). The similarity comparison unit 20 uses M ^k (N)
and the contents of the primary memory 19 are compared, and only if M ^k (N) is larger, the contents of the primary memory are replaced with M ^k (N). After calculating N=N ₂ (Equation 18)
L _k is obtained by and compared with the previous minimum value stored in the primary memory 19, and the contents of the primary memory 19 are updated only when L _k is small. Then, the standard pattern selection unit 18 selects the next word and performs the same operation. Furthermore, when all words are completed, the frame advances.

対象とする全区間（ｊ＝j₁〜j₂）に対してこの
ような操作を行なうと、ｊ＝j₂フレームを終了し
た時点では、類似度の最小値L^とその時の単語名
k^を認識結果として求めることができる。 If such an operation is performed for the entire target interval (j = j ₁ to j ₂ ), at the end of j = j ₂ frames, the minimum similarity value L^ and the word name at that time will be calculated.
k^ can be obtained as a recognition result.

第２の実施例では、第１の実施例に比べて、類
似度を求めるための積和演算の回数が非常に少な
くなつている。いま、単語数Ｋ＝10、標準パター
ン長Ｉ＝16、平均最小時間長N₁＝21、平均最大
時間長N₂＝40、１フレームあたりのパラメータ
数Ｄ＝６とすると、第１の実施例における積和演
算量は19800回に対し、第２の実施例では960回で
ある。 In the second embodiment, compared to the first embodiment, the number of sum-of-products operations for determining the degree of similarity is significantly smaller. Now, assuming that the number of words K = 10, the standard pattern length I = 16, the average minimum time length N ₁ = 21, the average maximum time length N ₂ = 40, and the number of parameters per frame D = 6, the first embodiment The amount of product-sum calculations in the second embodiment is 19,800 times, whereas it is 960 times in the second embodiment.

本実施例の方法を用いて、成人男女計330名が
電話機を通して発生した10数字単語を評価した結
果、平均認識率93.75％を得た。高騒音下の発声
であることを考慮すれば、この値は低いとは言え
ない。また本実施例による認識誤まりの原因を分
析した結果、誤まりのほとんどはある単語の一部
を他の単語として認識してしまうために生ずるこ
とがわかつた。たとえば／Zero／の／ro／の部
分を／go／と誤認識するのがその１例である。
このため、第２候補までを正解とすると97％以上
の認識率を得る。したがつて、他の方法を少し併
用すれば、第１候補としてさらに高い認識率が得
られることが容易に推察される。 Using the method of this example, a total of 330 adult men and women evaluated 10 numeric words generated through telephones, and as a result, an average recognition rate of 93.75% was obtained. This value cannot be said to be low considering that the speech is made under high noise conditions. Furthermore, as a result of analyzing the causes of recognition errors according to this embodiment, it was found that most errors occur because a part of a certain word is recognized as another word. For example, the / ro / part of /Zero/ is mistakenly recognized as /go/.
Therefore, if up to the second candidate are correct, a recognition rate of 97% or higher is obtained. Therefore, it is easily inferred that if a few other methods are used in combination, an even higher recognition rate can be obtained as the first candidate.

発明の効果以上要するに本発明は、認識すべき音声とその
前後の騒音を含む入力信号区間に、ある時間的な
基準点を設け、基準点を端点としてそれからN₁
フレームの区間とN₂フレームの区間（N₁＜N₂）
の２区間を設定して、これらを音声区間のそれぞ
れ最小値と最大値と考えて、N₂−N₁＋１とおり
の音声区間候補のそれぞれに対して、音声区間長
を一定時間長に伸縮しながら各単語の標準パター
ンとのマツチングを行なつて各単語の類似度また
は距離を求め、この操作を基準点を全入力信号区
間の始めから終りまで走査して行ない、全ての基
準点位置の全ての音声区間候補に対する類似度ま
たは距離を各単語について比較し、類似度を最大
または距離を最小とする単語を認識結果として出
力するもので、音声区間の検出を必要とせず、騒
音と音声が混在した信号から音声に相当する部分
のみを切出して認識でき、従来は複雑なルールを
用いて音声区間の検出を行なつていたが、それで
も騒音レベルが高い場合や非定常的なノイズが混
入する場合には音声区間の検出を誤まり、したが
つて誤認識をしていたが、本発明は複雑な音声区
間検出アルゴリズムを除去することによつて、シ
ステムを簡略化し、また高騒音入力に対して安定
した認識率を確保することができ、その効果は大
きい。Effects of the Invention In summary, the present invention provides a certain temporal reference point in an input signal section that includes the speech to be recognized and the noise before and after the speech, and uses the reference point as an end point and then N ₁
Frame interval and N ₂ frame interval (N ₁ < N ₂ )
Set two intervals, consider these as the minimum and maximum values of the voice interval, and expand or contract the voice interval length to a constant time length for each of N ₂ −N ₁ +1 voice interval candidates. The similarity or distance of each word is determined by matching each word with a standard pattern, and this operation is performed by scanning the reference point from the beginning to the end of the entire input signal section, and all of the reference point positions are This method compares the degree of similarity or distance for each word with respect to candidates for speech sections, and outputs the word with the maximum similarity or the minimum distance as the recognition result.It does not require detection of speech sections and can be used to recognize a mixture of noise and speech. It is possible to extract and recognize only the part corresponding to speech from the signal. Conventionally, complex rules were used to detect speech sections, but even then, when the noise level is high or non-stationary noise is mixed in. However, the present invention simplifies the system by removing the complicated speech interval detection algorithm, and also makes it more suitable for high-noise inputs. A stable recognition rate can be ensured, and the effect is significant.

[Brief explanation of the drawing]

第１図は本発明の第１の実施例における音声認
識方法を具現化する機能ブロツク図、第２図乃至
第４図は同実施例の音声区間長の伸縮を説明する
概念図、第５図は同実施例の音声の標準パターン
作成時の、周囲情報の標準パターン作成法を説明
する概念図、第６図は同実施例の処理手順を説明
するフローチヤート、第７図は本発明の第２の実
施例における音声認識方法の部分類似度の求め方
を示す概念図、第８図は同実施例のフレームあた
りの計算方法を示すブロツク図、第９図は同実施
例における音声認識方法を具現化する機能ブロツ
ク図、第１０図は従来の音声認識方法を示す機能
ブロツク図である。１０……AD変換部、１１……音響分析部、１
２……特徴パラメータ抽出部、１３……フレーム
同期信号発生部、１４……時間軸正規化部、１５
……区間候補設定部、１６……類似度計算部、１
７……標準パターン格納部、１８……標準パター
ン選択部、１９……１次記憶、２０……類似度比
較部。 FIG. 1 is a functional block diagram embodying the speech recognition method according to the first embodiment of the present invention, FIGS. 2 to 4 are conceptual diagrams illustrating expansion and contraction of the speech interval length in the same embodiment, and FIG. 5 6 is a conceptual diagram illustrating a standard pattern creation method for surrounding information when creating a standard voice pattern in the same embodiment, FIG. 6 is a flowchart illustrating the processing procedure of the same embodiment, and FIG. FIG. 8 is a block diagram showing the calculation method per frame in the second embodiment, and FIG. FIG. 10 is a functional block diagram showing a conventional speech recognition method. 10...AD conversion section, 11...acoustic analysis section, 1
2... Feature parameter extraction unit, 13... Frame synchronization signal generation unit, 14... Time axis normalization unit, 15
...Section candidate setting unit, 16...Similarity calculation unit, 1
7... Standard pattern storage section, 18... Standard pattern selection section, 19... Primary storage, 20... Similarity comparison section.

Claims

[Claims] 1. A standard pattern for each voice to be recognized is created in advance using data belonging to each voice, data for all voices to be recognized, and surrounding information of the data for all voices. On the other hand, a temporal reference point is set in the unknown input containing the speech to be recognized and its surrounding information, and two time lengths N ₁ and N ₂ (N ₁ < N ₂ ) are set from the reference point as an end point. Set the interval, consider the interval between the reference point and N ₁ to be the minimum value of the voice interval, and the interval between the reference point and N ₂ to be the maximum value of the voice interval, and set the interval between the minimum voice interval and the maximum voice interval. Assuming multiple speech intervals, the length of each assumed speech interval is expanded or contracted to a certain length of time, and the similarity or distance for each voice is determined by comparing each voice with the standard pattern. Memorize the maximum similarity or minimum distance and the standard pattern name in that case for all standard patterns in the speech interval, then shift the reference point in the unknown input by a unit interval, and calculate the new maximum similarity or minimum distance in the same way. , compare the previously stored maximum similarity or minimum distance with the new maximum similarity or minimum distance, memorize the larger similarity or smaller distance and the standard pattern name at that time, and This operation is performed for a sufficiently wide range of unknown input while shifting the reference point by unit time, and when the reference point reaches the final point, the speech corresponding to the memorized standard pattern name is recognized as the result. A speech recognition method characterized by: 2. In advance, find the correspondence between the speech interval length and the temporal position of the standard pattern when the speech interval length is expanded or contracted to a certain time length. On the other hand, when calculating similarity or distance, unknown input and A patent characterized in that the partial similarity or distance of a standard pattern is first determined, and the partial similarity is added to the similarity or distance between an unknown input of an assumed speech interval length and the standard pattern while referring to the correspondence relationship. A speech recognition method according to claim 1. 3. The speech recognition method according to claim 1, wherein the similarity or distance is calculated using a measure based on a posteriori probability. 4 Feature parameters are LPC cepstral coefficients,
2. The speech recognition method according to claim 1, wherein the speech recognition method is either an autocorrelation coefficient or an output of a bandpass filter. 5. A patent characterized in that surrounding information is statistically created from many data samples belonging to all target words, using a speech interval determined by combining ₁ frame exactly near the beginning and ₂ frames exactly near the end. A speech recognition method according to claim 1. 6. The speech recognition method according to claim 1, wherein a standard pattern for a certain speech n is obtained by removing surrounding information from a standard pattern statistically obtained using data belonging to n. 7. The speech recognition method according to claim 1, wherein the formula for calculating the degree of similarity is a linear discriminant function.