JPH0346838B2

JPH0346838B2 -

Info

Publication number: JPH0346838B2
Application number: JP59179693A
Authority: JP
Inventors: Satoru Taguchi
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1984-08-29
Filing date: 1984-08-29
Publication date: 1991-07-17
Also published as: JPS6157995A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は音声認識装置に関し、特に分析フレー
ムが圧縮された形式でトレーニング時（登録時）
に登録された標準パタンと、認識処理の都度入力
される入力音声パタンとの時間正規化いわゆるパ
タンマツチングを特定話者単語について実施し入
力単語音声を認識する圧縮DP型の音声認識装置
に関する。[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a speech recognition device, and in particular, the present invention relates to a speech recognition device, in particular, when an analysis frame is in a compressed format during training (at the time of registration).
The present invention relates to a compressed DP type speech recognition device that recognizes input word speech by performing time normalization, so-called pattern matching, on a specific speaker's word, between a standard pattern registered in the system and an input speech pattern input each time a recognition process is performed.

[Prior art]

特定された話者の発する複数の単語音声を予め
定めた一定の分析周期、すなわち分析フレームご
とに分析して抽出した特徴パラメータの分布に関
する標準パタンを求めたうえこれをDP手法によ
つて圧縮して入力パターンとの時間正規化を実施
し、空間ベクトルである両者間のベクトルである
両者間のベクトル距離の最小なもの、すなわち認
識音声の歪が最小であるものをDPパスの追跡を
介して求める圧縮DP型の特定話者単独による音
声認識装置は近特よく知られている。 Multiple word sounds uttered by the identified speaker are analyzed at a predetermined analysis cycle, that is, each analysis frame, to find a standard pattern regarding the distribution of extracted feature parameters, and then this is compressed using the DP method. Then, the vector with the minimum vector distance between the two, i.e., the one with the minimum distortion of the recognized speech, is determined by tracing the DP path. The desired compressed DP type speech recognition device for a specific speaker alone is currently well known.

このような圧縮DP型の音声認識装置は、入力
パタンとの時間正規化を図るべき標準パタンを圧
縮状態で利用しているため標準パタンをストアす
べき標準パタンメモリを削減し得て時間正規化の
ための処理量も削減し従つてハードウエア規模も
これに対応して簡素化しうるという特徴がある。 Such a compressed DP type speech recognition device uses a compressed standard pattern that should be time-normalized with the input pattern, so it is possible to reduce the standard pattern memory that stores the standard pattern and perform time normalization. It has the feature that the amount of processing required for this process can be reduced, and the scale of the hardware can also be correspondingly simplified.

上述した時間正規化は、入力パタンと標準パタ
ンとが同一者の音声パタンである場合で、かつ通
常の使用環境を想定する場合には発声速度変動が
最大の変動要因になり、これによつて母音部と子
音部とで発するそれぞれ異る複雑な非常形伸縮を
除去せんとするものである。この時間正規化の目
的は入力パタンと標準パタンとの間の写像関数を
最適なものとして見出し、標準パタンの時間軸を
入力パタンの時間軸に揃える操作であつて、通常
はベクトル距離、換言すればパタン間離を評価尺
度とするDP手法を実施してこれの最小とするも
のを求めるという方法で行なわれている。 The above-mentioned time normalization is performed when the input pattern and the standard pattern are the voice patterns of the same person, and when assuming a normal usage environment, the largest variation factor is the variation in speaking speed. The purpose is to eliminate the different and complex extensions and contractions of emergency forms that occur in the vowel and consonant parts. The purpose of this time normalization is to find the optimal mapping function between the input pattern and the standard pattern, and to align the time axis of the standard pattern with the time axis of the input pattern. This is done by implementing the DP method using the pattern spacing as an evaluation measure and finding the one that minimizes this.

さて、このような圧縮DP型の音声認識装置で
は上述した如き種々の特徴を有するものの、標準
パタンの圧縮方法としては矩形近似が用いられて
いるため演算量と得られる歪量低減との割合いわ
ゆる圧縮効率には限度があることが避けられない
という欠点がある。 Although such compressed DP type speech recognition devices have various features as mentioned above, since rectangular approximation is used as the standard pattern compression method, the ratio between the amount of calculation and the reduction in the amount of distortion obtained is The disadvantage is that there is an inevitable limit to compression efficiency.

[Purpose of the invention]

本発明の目的は上述した欠点を除去し、特定話
者単語音声を対象とする圧縮DP型の音声認識装
置において、標準パタンの圧縮手段としては最適
台形近似を利用してDPを図るとともに時間正規
化は入力パタンを圧縮標準パタンに対応せしめて
圧縮したうえ正規化するかもしくは標準パタンを
入力パタンに合せるように伸張せしめてDPパス
を求めるDPマツチング手法を備えて音声認識処
理を実行することにより圧縮効率を著しく改善し
た音声認識装置を提供することにある。 The purpose of the present invention is to eliminate the above-mentioned drawbacks, and provide a compressed DP type speech recognition device that targets word speech of a specific speaker, in which DP is achieved using optimal trapezoidal approximation as a standard pattern compression means, and time regularization is achieved. This is done by making the input pattern correspond to a compression standard pattern, compressing it and then normalizing it, or by expanding the standard pattern to match the input pattern and performing speech recognition processing using a DP matching method to find the DP path. An object of the present invention is to provide a speech recognition device with significantly improved compression efficiency.

[Structure of the invention]

本発明の装置は、分析フレームがDP手法で圧
縮された形式で登録された標準パタンと前記特定
話者による単語音声の入力パタンとの時間正規化
によるパタンマツチングを介して特定話者単語に
対する音声認識を行なう圧縮DP型の音声認識装
置において、標準パタンの圧縮をDP手法による
最適台形近似にもとづいて実施する標準パタン圧
縮手段と、入力パタンを標準パタンに合はせるよ
うに圧縮したうえ標準パタン長で時間正規化する
かもしくは標準パタンを入力パタンに合わせるよ
うに延伸して時間正規化しかつこれら時間正規化
は前記標準パタンに対する歪量を評価尺度として
これを最小ならしめるDPパスを見出すことによ
つて求める時間正規化手段とを備えて構成され
る。 The device of the present invention performs pattern matching based on time normalization between a standard pattern registered in a format in which the analysis frame is compressed using the DP method and an input pattern of word speech by the specific speaker. In a compressed DP type speech recognition device that performs speech recognition, there is a standard pattern compression means that compresses a standard pattern based on optimal trapezoidal approximation using the DP method, and a standard pattern compression method that compresses an input pattern to fit the standard pattern and then converts it to a standard pattern. Time normalization is performed using the pattern length, or time normalization is performed by stretching the standard pattern to match the input pattern, and in this time normalization, the amount of distortion with respect to the standard pattern is used as an evaluation criterion to find a DP path that minimizes this. and a time normalization means for calculating the time.

〔Example〕

次に図面を参照して本発明を詳細に説明する。
第１図は本発明による音声認識装置の一実施例を
示すブロツク図である。 Next, the present invention will be explained in detail with reference to the drawings.
FIG. 1 is a block diagram showing an embodiment of a speech recognition device according to the present invention.

第１図に示す実施例は音響分析器１、切替器
２、圧縮処理器３、標準パタンメモリ４、パタン
マツチング器５、最小距離検索器６等を備えて構
成される。 The embodiment shown in FIG. 1 includes an acoustic analyzer 1, a switch 2, a compression processor 3, a standard pattern memory 4, a pattern matcher 5, a minimum distance searcher 6, and the like.

特定話者単語音声の音声認識では、まず特定話
者の発声する複数の単語に関する標準パタンをあ
らかじめストアしておく必要があるがこれは次の
ようにして実施される。 In voice recognition of specific speaker word speech, it is first necessary to store in advance standard patterns related to a plurality of words uttered by a specific speaker, and this is carried out as follows.

すなわち音響分析器１はLPF（Low Pass
Filter）、Ａ／Ｄコンバータ、LSP（Line
Spectrum Pairs、線スペクトル対）で分析器等
さ内蔵し、入力音声を所定の遮断周波数のLPF
でレイルタリングしたのち所定のサンプリング周
波数でサンプリングしてデイジタルデータに変換
したうえLSP分析器にかける。 In other words, the acoustic analyzer 1 is an LPF (Low Pass
Filter), A/D converter, LSP (Line
Spectrum Pairs (Line Spectrum Pairs) with a built-in analyzer, etc., input audio to an LPF with a predetermined cutoff frequency.
After the data is railtared, it is sampled at a predetermined sampling frequency, converted to digital data, and then sent to an LSP analyzer.

LSP分析器はLPC（Linear Prediction
Coefficient、線型予測係数）分析器も有し、あ
らかじめ設定する分析周期の時間フレームすなわ
ち分析フレームごとにLPC分析器で分析、抽出
したPRRCOR（偏自己相関係数）等のLPC係数か
ら公知の技術、たとえばニユートン（Newton）
の反復法を利用する高次方程式を解く方法などに
よつて分析フレームごとにあらかじめ設定する次
数のLSP係数列を求めてこれを切替器２に送出す
る。こうして得られるLSP係数は声道の共振特性
を表わすパラメータであり声門を仮想的に完全開
放および完全閉塞した場合の声道フイルタの伝達
関数の線スペクトル周波数によるパラメータであ
り周波数領域で扱われる特徴パラメータであるこ
ともまたよく知られている。 LSP analyzer is LPC (Linear Prediction)
It also has a linear predictive coefficient (linear prediction coefficient) analyzer, and uses known techniques from LPC coefficients such as PRRCOR (partial autocorrelation coefficient), which are analyzed and extracted by the LPC analyzer for each time frame of a preset analysis cycle, that is, each analysis frame. For example Newton
An LSP coefficient sequence of a predetermined order is determined for each analysis frame by a method of solving higher-order equations using the iterative method, and this is sent to the switch 2. The LSP coefficient obtained in this way is a parameter representing the resonance characteristics of the vocal tract, and is a parameter based on the line spectrum frequency of the transfer function of the vocal tract filter when the glottis is virtually completely open and completely closed, and is a characteristic parameter treated in the frequency domain. It is also well known that

切替器２は、標準パタンのトレーニング時（登
録時）にあつては点線に示す接続状態に切替えら
れ、従つて特定話者の単語に関するLSPパラメー
タは圧縮処理器３に供給される。 During standard pattern training (registration), the switch 2 is switched to the connection state shown by the dotted line, and therefore the LSP parameters regarding the words of the specific speaker are supplied to the compression processor 3.

圧縮処理器３は次のようにしてこのLSPパラメ
ータの最適台形近似によるフレーム圧縮処理を
DP手法を利用して実施する。 The compression processor 3 performs frame compression processing using optimal trapezoidal approximation of this LSP parameter as follows.
Implemented using DP method.

フレーム圧縮処理には最適線形近似のほか近時
は最適矩形近似、さらには最適台形近似といつた
ものが可変長フレーム型線形予測ポコーダ等の分
野で利用されつつあることはよく知られており、
これら最適近似のうち最適矩形近似は音声認識装
置における圧縮利用の基本手段として多用されて
いる。これは圧縮の結果期待しうる演算量の減少
が最適線形近似に比して著しいことによるが一方
最適矩形近似の本質から、得られる近似度には限
度があり従つて歪量も最適線形近似に比して非常
に増加してしまう。 It is well known that in addition to optimal linear approximation, methods such as optimal rectangular approximation and even optimal trapezoidal approximation are being used in frame compression processing in fields such as variable-length frame type linear predictive pocoders.
Among these optimal approximations, optimal rectangular approximation is often used as a basic means of compression utilization in speech recognition devices. This is because the expected reduction in the amount of calculation as a result of compression is significant compared to optimal linear approximation, but on the other hand, due to the nature of optimal rectangular approximation, there is a limit to the degree of approximation that can be obtained, and therefore the amount of distortion is also less than optimal linear approximation. This will increase significantly compared to the previous year.

一方、最適台形近似は演算量の減少こそ最適矩
形近似に及ばないものの近似度ははるかに増大
し、従つて歪量も最適線形近似とほぼ近似した状
態まで改善し得て圧縮効率を著しく向上すること
ができる。 On the other hand, although the optimal trapezoidal approximation does not reach the same level of computational complexity as the optimal rectangular approximation, the degree of approximation is much greater, and therefore the amount of distortion can be improved to almost the same as the optimal linear approximation, significantly improving compression efficiency. be able to.

第２図Ａは最適矩形近似の、またＢは最適台形
近似の原理を説明するための原理図である。 FIG. 2A is a principle diagram for explaining the principle of optimal rectangular approximation, and FIG. 2B is a principle diagram for explaining the principle of optimal trapezoidal approximation.

第２図Ａにおいて、入力音声ａは分析フレーム
ごとにたとえばLSPパラメータが特徴ベクトルと
して抽出される。最適矩形近似においてはこうし
て次次に連続して供給されるLSPパラメータベク
トルのＫフレーム分ずつをまとめて新たなひとつ
の処理区分として取扱い、この処理区分ごとにあ
らかじめ設定する最大数Ｍ（１＜Ｍ＜Ｋ）個の特
徴パラメータと、Ｍ個の特徴パラメータのそれぞ
れが代表すべき分析フレームとの最適組合せを選
択し、このような選択によつて近似された分析フ
レームの連続が第２図Ａのｂに示す最適矩形距離
による可変長フレームとなる。 In FIG. 2A, for example, LSP parameters are extracted as feature vectors from input speech a for each analysis frame. In optimal rectangular approximation, K frames of LSP parameter vectors that are continuously supplied in this way are collectively treated as one new processing section, and the maximum number M (1 < M <K) feature parameters and the analysis frame that each of the M feature parameters should represent is selected, and the series of analysis frames approximated by such selection is shown in Figure 2A. This results in a variable length frame based on the optimal rectangular distance shown in b.

上述した処理区分ごとに設定すべき特徴ベクト
ルの最大数Ｍは１とＫとの間で圧縮効率を考慮し
て任意に設定しうる。こうして各区分ごとに設定
される最大数Ｍの特徴ベクトル群は、DP手法を
利用しつつそれぞれがどの分析フレームを代表す
るどのような組合せのＭ個であるかが決定され
る。この場合のDPはこうした矩形近似による歪
を評価尺度として実行され、この歪は代表するＭ
個の特徴ベクトル群がそれぞれどの分析フレーム
を代表するときその矩形近似特徴ベクトルともこ
の特徴ベクトルの距離とを最小とするかについて
処理区分ごとに求めるという方法を繰返しつつ容
易に求められる。 The maximum number M of feature vectors to be set for each processing section described above can be arbitrarily set between 1 and K in consideration of compression efficiency. In this way, the maximum number M of feature vector groups set for each segment is determined by using the DP method, and what combinations of M feature vectors each represent which analysis frame. In this case, DP is performed using the distortion caused by rectangular approximation as an evaluation measure, and this distortion is calculated using the representative M
It can be easily determined by repeating the method of determining for each processing section which analysis frame each group of feature vectors represents and minimizes the distance between this feature vector and its rectangular approximate feature vector.

しかしながらこのようなDP利用最適矩形近似
は前述の如き圧縮効率の限度に関する問題があ
る。そこで本実施例においては最適台形近似を
DP手法によつて求めこの問題性の大幅な緩和を
図つている。 However, such optimal rectangular approximation using DP has the above-mentioned problem regarding the limit of compression efficiency. Therefore, in this example, the optimal trapezoidal approximation is
We aim to significantly alleviate this problem by using the DP method.

最適台形近似は、音声情報の変化の激しい過度
部分はほぼ一定の時間長、通常は約20ｍSEC程度
であることを利用してこの過度部分をあらかじめ
設定した一定数の分析フレーム数に相当する時間
長で表現する。矩形関数の代りに台形関数を利用
する最適近似であり、本質的に最適矩形近似より
も近似度が高くなる。このような最適台形近似も
原特徴ベクトルとのベクトル空間距離を最小とす
る代表特徴ベクトル群の選定をDP手法を介して
実施しつつこれら選定代表特徴ベクトル間は前記
一定の時間長いわゆる傾斜区間で表現するという
方法によつて基本的には処理され、近時可変長フ
レームボコーダ等の利用分野でも多用されつつあ
るが、本実施例にあつては処理区分ごとに処理す
る、いわゆる区分的近似ではなく標準パタンとし
て登録すべき各単語の１単語ずつをひとつの処理
区分とし、歪総量を目安として台形近似による最
適化を図り、従つて選択されるフレーム数も固定
数としていない点に特徴を有する。 Optimal trapezoidal approximation uses the fact that an excessive portion of audio information that changes rapidly has a nearly constant length of time, usually about 20 mSEC, to calculate this excessive portion to a time length that corresponds to a preset number of analysis frames. Expressed as This is an optimal approximation that uses a trapezoidal function instead of a rectangular function, and the degree of approximation is essentially higher than the optimal rectangular approximation. In such optimal trapezoidal approximation, while selecting a group of representative feature vectors that minimize the vector spatial distance from the original feature vector using the DP method, the selected representative feature vectors are separated by the above-mentioned constant time length so-called slope interval. Processing is basically done by a method of expressing the data, and it has recently been widely used in fields such as variable-length frame vocoders, but in this embodiment, so-called piecewise approximation, in which processing is performed for each processing section, is used. This method is characterized by treating each word that should be registered as a standard pattern as one processing section, performing optimization using trapezoidal approximation using the total amount of distortion as a guideline, and therefore not setting a fixed number of frames to be selected. .

第２図Ｂはこのような特徴を有する最適台形近
似原理図であり、曲線ｃは特定話者による１単語
音声、台形ｄは１単語音声ｃを１処理区間とする
近似台形であり、点P₁，P₂，P₃，P₄等は代表特
徴パラメータ群を示し、これら代表特徴パラメー
タによつて代表される可変長フレーム区間₁〜₄
相互間はあらかじめ設定する一定の時間長の傾斜
区間が設定される。最適台形近似を決定すること
は台形ｄと１単語音声ｃとによつて形成される斜
線で示す面積を最小とする台形をDP手法によつ
て求めることに他ならない。またかくして求めら
れる最適台形近似は第２図Ｂからも明らから如
く、矩形台形よりもはるかに近似度が増大し、従
つて、代表特徴ベクトルの設定もはるかに少なく
てすみ圧縮効率も向上することとなる。 FIG. 2B is a diagram of the principle of optimal trapezoidal approximation having such characteristics, where curve c is an approximate trapezoid with one-word speech by a specific speaker, trapezoid d is an approximate trapezoid with one-word speech c as one processing section, and point P ₁ , _P2 , _P3 , _P4, etc. indicate representative feature parameter groups, and variable length frame sections ₁ to ₄ are represented by these representative feature parameters.
A slope section of a predetermined length of time is set between them. Determining the optimal trapezoidal approximation is nothing but finding a trapezoid that minimizes the area indicated by the diagonal line formed by the trapezoid d and the one-word sound c using the DP method. Furthermore, as is clear from FIG. 2B, the optimal trapezoidal approximation obtained in this way has a much higher degree of approximation than a rectangular trapezoid, and therefore requires far fewer representative feature vectors to be set, improving compression efficiency. becomes.

ふたたび第１図に戻つて説明する。圧縮処理器
３はこのような最適台形近似処理を特定話者の発
声する単語音声ごとに分析、抽出される特徴パラ
メータ、LSPパラメータについて実施しこれらを
標準パタンとして標準パタンメモリ４に送出しス
トアせしめる。 The explanation will be given by returning to FIG. 1 again. The compression processor 3 performs such optimal trapezoid approximation processing on the extracted feature parameters and LSP parameters by analyzing each word voice uttered by a specific speaker, and sends these as a standard pattern to the standard pattern memory 4 for storage. .

こうして標準パタンがストアされている状態で
切替器２を認識側に切替え入力端子１０１を介し
て特定話者が標準パタンメモリ４にストアされて
いるどの単語音声かを発声し、これを音響分析器
１にかけてLSPパラメータを抽出したあとパタン
マツチング器５に供給する。 With the standard pattern stored in this way, the switch 2 is switched to the recognition side, and the specific speaker utters which word sound is stored in the standard pattern memory 4 via the input terminal 101, and this is transmitted to the acoustic analyzer. 1 to extract the LSP parameters and then supply them to the pattern matcher 5.

パタンマツチング器５は、スペクトル距離計測
器、補間器等を備えスペクトル距離を評価尺度と
するDP手法を実施し標準パタンと、この標準パ
タンに合わせるように圧縮した入力パタンとの間
で標準パタン長での時間正規化を次のようにして
実施する。 The pattern matching device 5 is equipped with a spectral distance measuring device, an interpolator, etc., and performs a DP method that uses spectral distance as an evaluation measure to generate a standard pattern between a standard pattern and an input pattern compressed to match the standard pattern. Perform time normalization by length as follows.

標準パタンメモリ４から次次に読出される標準
パタンはパタンマツチング器５の内蔵する補間器
によつて、DP圧縮された代表特徴ベクトル間に
補間値を設定したうえ内蔵スペクトル距離計測器
によつて計測した。スペクトル距離を評価尺度と
するDP手法を介して時間正規化を標準パタン長
で行なう。 The standard patterns read out from the standard pattern memory 4 one after another are set by the built-in interpolator of the pattern matching device 5 to set interpolated values between the DP-compressed representative feature vectors, and then by the built-in spectral distance measuring device. I measured it. Time normalization is performed using the standard pattern length through the DP method using spectral distance as the evaluation measure.

DP圧縮した標準パタンと入力パタンとの時間
正規化には２通りの方法があり、入力パタンを、
圧縮した標準パタンに合わせるように間引いて圧
縮したうえ標準パタン長で時間正規化する方法も
しくは標準パタンを入力パタンに対応して代表特
徴ベクトル間隔を繰返し発生して延伸して時間正
規化を図る方法があるが本実施例においては前者
の手法によつて時間正規化を図つている。圧縮さ
れた標準パタンと圧縮されない状態の入力パタン
の時間軸を合はせるため、つまり標準パタンと入
力パタンとの間の字像関数を見出して時間正規化
を図るためには入力パタンを間引きして標準パタ
ンに合せても、また逆に標準パタンを入力パタン
に合せて延伸してもどちらでも差支えないわけで
ある。 There are two ways to time normalize the DP compressed standard pattern and the input pattern.
A method of thinning and compressing the data to match the compressed standard pattern, and then normalizing the time using the standard pattern length, or a method of repeatedly generating representative feature vector intervals in response to the input pattern and stretching the standard pattern to achieve time normalization. However, in this embodiment, time normalization is performed using the former method. In order to align the time axes of the compressed standard pattern and the uncompressed input pattern, that is, to find the image function between the standard pattern and the input pattern and perform time normalization, the input pattern must be thinned out. There is no problem in either stretching the standard pattern to match the input pattern or conversely stretching the standard pattern to match the input pattern.

第３図は第１図の実施例におけるパタンマツチ
ング処理の原理を示すパタンマツチング原理図で
ある。以下に第３図を参照しながら実施例の説明
を続行する。 FIG. 3 is a pattern matching principle diagram showing the principle of pattern matching processing in the embodiment of FIG. 1. The description of the embodiment will be continued below with reference to FIG.

第３図において標準パタン１００１は前述した
最適台形近似による、かつDP手法を利用して形
成された標準パタンのひとつであり、入力パタン
１００２は標準パタン１００１とパタンマツチン
グすべき、すなわち時間正規化を図るべき入力パ
タンとする。 In FIG. 3, a standard pattern 1001 is one of the standard patterns formed by the above-mentioned optimal trapezoidal approximation and using the DP method, and an input pattern 1002 should be pattern matched with the standard pattern 1001, that is, time normalized. is the input pattern to be aimed at.

いま第３図に示す如きｉ−ｊ平面を考え、ｉ方
向には標準パタン１００１、ｊ方向には入力パタ
ン１００２を対応させ、黒丸で示す縦線は実計測
のLSPパラメータとする。パタンマツチング器５
は内蔵する補間器でこれら実線間にＸ印で示す補
間LSPパラメータを点線で示す如く設定する。 Now, considering the ij plane as shown in FIG. 3, the i direction corresponds to the standard pattern 1001, the j direction corresponds to the input pattern 1002, and the vertical lines indicated by black circles are actually measured LSP parameters. Pattern matching device 5
The interpolation LSP parameters shown by the X mark between these solid lines are set using the built-in interpolator as shown by the dotted lines.

また入力パタンは音響分析器１の分析周期t₀ご
とにLSPパラメータベクトルがｊ方向に直角な実
線としていう得られ、これらｉ−ｊ面を構成する
縦、横の交差線の交点が時間正規化を実施すべき
両者の対応位置となり、これら各対応位置につい
ての両パタンのLSPパラメータベクトル間のスペ
クトル距離をあらゆる対応点の組について求め、
これを評価尺度とするDP手法によつて両パタン
間の距離を最小とするDPパスを求めればこれら
が両パタン間のスペクトル距離を示すものとな
る。ただし、このDP手法によつて両パタン間の
距離を最小とするDPパスを求める場合、実際に
は生じないようなパタン間の極端な時間軸変動範
囲は排除しDP処理は通常整合窓と呼ばれる処理
範囲l₁とl₂間に限定して行なわれる。 In addition, the input pattern is obtained as a solid line perpendicular to the j direction, and the LSP parameter vector is obtained as a solid line perpendicular to the j direction at every analysis period t ₀ of the acoustic analyzer 1. The spectral distance between the LSP parameter vectors of both patterns for each of these corresponding positions is determined for every set of corresponding points, and
If the DP path that minimizes the distance between the two patterns is found using the DP method using this as an evaluation measure, these will indicate the spectral distance between the two patterns. However, when using this DP method to find a DP path that minimizes the distance between both patterns, the extreme time axis variation range between patterns that does not actually occur is eliminated, and the DP process is usually called a matching window. The processing is limited to the areas _l1 and _l2 .

第３図において、たとえばｉ−ｊ平面上のQ₁
における標準パタンのLSPパラメータベクトルと
入力パタンにおける対応LSPパラメータベクトル
との矢印で示すスペクトル距離を計測する。これ
らのスペクトル距離においてｄにおけるパス、45
度のラインは最適台形近似における一定の時間長
区間いわゆる傾斜区間に相当し、これを含め直線
および析線で示されるスペクトル距離を計測す
る。このスペクトル計測は線l₁とl₂で限定された
処理範囲内の点線を含むすべての縦横の交差点位
置に関する標準パタンと入力パタンとの対応ぶん
についてかつ標準パタン長で実施する。この標準
パタン長での実施条件は標準パタンのDP圧縮度
によつて異るが、本実施例の場合は第３図に示す
如く対応個数６個ずつの総組合せを対象として実
施される。 In FIG. 3, for example, Q ₁ on the ij plane
The spectral distance shown by the arrow between the LSP parameter vector of the standard pattern and the corresponding LSP parameter vector of the input pattern is measured. The path in d at these spectral distances, 45
The degree line corresponds to a constant time length section, so-called slope section, in the optimal trapezoidal approximation, and the spectral distances shown by the straight line and analysis line including this are measured. This spectrum measurement is performed for the correspondence between the standard pattern and the input pattern for all vertical and horizontal intersection positions including the dotted line within the processing range limited by the lines _l1 and _l2 , and with the standard pattern length. The conditions for implementing this standard pattern length vary depending on the degree of DP compression of the standard pattern, but in the case of this embodiment, the implementation is performed for a total of six corresponding combinations as shown in FIG.

こうして標準パタン１００１と入力パタン１０
０２との間で入力パタン１００２を標準パタン１
００１に対して合わせるように圧縮し、かつ標準
パタン長で時間正規化したものがDPパスｇとし
て求められ、この時間正規化入力パタンと全標準
パタン間のスペクトル距離が次次に最小距離検索
器６に供給される。 In this way, standard pattern 1001 and input pattern 10
02, input pattern 1002 is converted to standard pattern 1.
001 and time-normalized using the standard pattern length is obtained as the DP path g, and the spectral distance between this time-normalized input pattern and all standard patterns is successively determined by the minimum distance searcher. 6.

一般に二つのLSPパラメータスペクトル間の距
離は次の(1)式に示すスペクトル距離D_srによつて
示される。 Generally, the distance between two LSP parameter spectra is represented by the spectral distance D _sr shown in the following equation (1).

D_sr＝１／π∫〓₀｛S_s（ω）−S_r（ω）｝²dw ……(1) (1)式はまた、通常は次の(2)式に近似式に変換さ
れて利用される。D _sr = 1/π∫〓 ₀ {S _s (ω)−S _r (ω)} ² dw ……(1) Equation (1) is also usually transformed into the following approximate equation (2). used.

D_sr＝_N 〓^k=1 W_K｛P_K ^(s)−P_K ^(r)｝² ……(2) (1)および(2)式においてｓ、ｒは分析フレームも
しくは処理区分（ブロツク）の番号、S_s（ω）、S_r
（ω）は周波数ωの関数としての分析フレームも
しくはブロツクｓ、ｒの対数スペクトル、P_K ^(s)、
P_K ^(r)は分析フレームもしくはブロツクｓおよびｒ
における分析次数Ｋ次のLSPパラメータベクト
ル、W_KはＫ次のLSP周波数スペクトル感度であ
る。D _sr = _N 〓 ^k=1 W _K {P _K ^(s) −P _K ^(r) } ² ...(2) In equations (1) and (2), s and r are analysis frames or processing sections (blocks) number, S _s (ω), S _r
(ω) is the logarithmic spectrum of the analysis frame or block s, r as a function of frequency ω, P _K ^(s) ,
P _K ^(r) is the analysis frame or block s and r
where W _K is the K-order LSP frequency spectral sensitivity.

前述したDP手法による時間正規化、換言すれ
ばDPパタンマツチングは上述した演算根拠にも
とづき入力パタンを標準パタンに対して間引いた
内容で再パタン間の空間ベクトル距離を演算し、
この演算を入力パタンに対し全標準パタンの各パ
タンにわたつて実施、その結果はスペクトル距離
データとして次次に最小距離検索器６に標準パタ
ンの指定番号データとともに供給する。 Time normalization using the above-mentioned DP method, in other words, DP pattern matching, is based on the above-mentioned calculation basis, and calculates the space vector distance between re-patterns by thinning out the input pattern with respect to the standard pattern.
This calculation is performed on the input patterns for each of all standard patterns, and the results are sequentially supplied as spectral distance data to the minimum distance search unit 6 together with the designated number data of the standard patterns.

最小距離検索器６は入力した各標準パタンごと
の入力パタンに対するスペクトル距離データをい
つたん内蔵メモリにストアしたうえ相互間の大小
関係を判定し最小値を有するものを検索し、その
最小スペクトル距離データを提供した標準パタン
指定番号データから当該標準パタン情報を認識結
果として出力端子６０１に供給し、かくして最適
台形近似による標準パタンとのDPパタンマツチ
ングを介しての音声認識が実行される。 The minimum distance search device 6 stores the spectral distance data for each input pattern for each input standard pattern in the built-in memory, determines the magnitude relationship between them, searches for the one with the minimum value, and retrieves the minimum spectral distance data. The standard pattern information is supplied as a recognition result to the output terminal 601 from the standard pattern designation number data provided, and speech recognition is thus performed through DP pattern matching with the standard pattern by optimal trapezoidal approximation.

なお、上述した実施例においては標準パタンと
してストアすべき音声単語の特徴パラメータには
LSPパラメータを利用しているが、これは他の特
徴パラメータ、たとえば単語音声に関するスペク
トルの対数の逆変換で表現されるケプストラム
（Cepstrum）等を利用しても同様に実施しうるこ
とは明らかである。 In addition, in the above-mentioned embodiment, the feature parameters of audio words that should be stored as standard patterns include
Although LSP parameters are used, it is clear that this can be done similarly using other feature parameters, such as the cepstrum, which is expressed by the inverse logarithmic transformation of the spectrum of word sounds. .

また、本実施例では時間正規化の方法として入
力パタンを標準パタンに合わせるように圧縮し、
かつ標準パタン長での正規化を図る場合を例とし
ているが、これは圧縮された標準パタンを入力パ
タンに合わせるように延伸するようにしてDPに
よる時間正規化を図つても同じことであり、この
場合は圧縮された標準パタンを入力パタンに時間
的に合わせるように読出し繰返す形式で容易に実
施しうる。 In addition, in this example, as a time normalization method, the input pattern is compressed to match the standard pattern.
This is an example of normalizing using the standard pattern length, but this is the same even if the compressed standard pattern is stretched to match the input pattern and time normalization is performed using DP. In this case, it can be easily implemented in a format in which the compressed standard pattern is read out and repeated in time to match the input pattern.

〔Effect of the invention〕

以上説明した如く本発明によれば、分析フレー
ムを圧縮した形式で登録した標準パタンと、特定
話者の発声した単語音声による入力パタンとの時
間正規化を介して特定話者による単語音声を認識
する音声認識装置において、DP手法を利用して
去めた圧縮DP型の最適台形近似による標準パタ
ンを備えるとともに、入力パタンと標準パタンと
の時間正規化においては、入力パタンを標準パタ
ンに合わせるように圧縮しかつ標準パタン長での
正規化を実行するか、もしくは標準パタンを入力
パタンに合わせるように延伸するかのいずれかを
両パタン間の特徴ベクトル距離を評価尺度とする
DP手法にもとづいて実施するという手段を備え
ることによつて圧縮効率を大幅に改善し、標準パ
タンのメモリ容量も大幅に削減しうる音声認識装
置を実現しうるという効果がある。 As explained above, according to the present invention, word speech by a specific speaker is recognized through time normalization of a standard pattern registered in a compressed format of an analysis frame and an input pattern of word speech uttered by a specific speaker. The speech recognition device is equipped with a standard pattern based on the optimal trapezoidal approximation of the compressed DP type obtained using the DP method, and in time normalization of the input pattern and the standard pattern, the input pattern is adjusted to match the standard pattern. The feature vector distance between the two patterns is used as an evaluation measure.
By providing a means for implementing the method based on the DP method, it is possible to realize a speech recognition device that can significantly improve compression efficiency and significantly reduce the memory capacity of standard patterns.

[Brief explanation of drawings]

第１図は本発明の一実施例を示すブロツク図、
第２図Ａは最適矩形近似の原理を示す最適矩形近
似原理図、第２図Ｂは最適台形近似の原理を示す
最適台形近似原理図、第３図は第１図の実施例に
おける時間正規化を説明するための時間正規化説
明図である。１……音響分析器、２……切替器、３……圧縮
処理器、４……標準パタンメモリ、５……パタン
マツチング器、６……最小距離検索器。 FIG. 1 is a block diagram showing one embodiment of the present invention;
Figure 2A is an optimal rectangular approximation principle diagram showing the principle of optimal rectangular approximation, Figure 2B is an optimal trapezoidal approximation principle diagram showing the optimal trapezoidal approximation principle, and Figure 3 is time normalization in the example of Figure 1. FIG. 2 is a time normalization explanatory diagram for explaining. 1...Acoustic analyzer, 2...Switcher, 3...Compression processor, 4...Standard pattern memory, 5...Pattern matching device, 6...Minimum distance search device.

Claims

[Claims]

1 The analysis frame is dynamic programming (hereinafter referred to as dynamic programming).
A standard pattern of word speech by a specific speaker registered in a compressed format using a method (abbreviated as DP) and an input pattern of word speech by the specific speaker are pattern-matched by time normalization. In a compressed DP type speech recognition device that performs speech recognition for words, there is a standard pattern compression means that compresses a standard pattern based on optimal trapezoidal approximation using the DP method, and a standard pattern compression means that compresses an input pattern to match the standard pattern and then creates a standard pattern. Time normalization is performed using the pattern length, or time normalization is performed by stretching the standard pattern to match the input pattern, and in this time normalization, the amount of distortion with respect to the standard pattern is used as an evaluation criterion to find a DP path that minimizes this. 1. A speech recognition device comprising: time normalization means for obtaining time by: