JP2804265B2

JP2804265B2 - Voice recognition method

Info

Publication number: JP2804265B2
Application number: JP62179562A
Authority: JP
Inventors: 哲也室井
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1987-07-17
Filing date: 1987-07-17
Publication date: 1998-09-24
Anticipated expiration: 2013-09-24
Also published as: JPS6423299A

Description

【発明の詳細な説明】技術分野本発明は、音声認識方式、より詳細には、音声認識に
おける照合方式に関する。従来技術音声パターンの照合方式の１つであるDPマッチング
は、計算量が多いという欠点があった。これは、局所距
離、累積距離を計算すべき、格子点の数が（入力フレー
ム数）×（標準パターンのフレーム数）に比例している
ことに起因している。また、圧縮DPという手法も知られ
ている。これは、格子点の数を減らすのに有効な方法で
あるが、音声パターンのどの部分をどれだけ圧縮するか
とういう決め方が難しい。また、伸縮の制御も複雑にな
り、単語によって圧縮率が異なる等の欠点があった。ま
た、HMMでは、距離と確率との違いはあるが、格子点の
数が、（入力フレーム数）×（モデルの状態数）とな
り、DPマッチングの格子点と比較して、計算量が大幅に
少なくなっている。また、状態数が固定であるため、語
いの変更によって演算量が変化しないなどの利点があ
る。しかし、HMMでは、時間方向の伸縮（状態遷移の制
御）が難しいという欠点があった。また、訓練データが
少ないと良質のモデルが得にくいという欠点があった。目的本発明は、上述のごとき実情に鑑みてなされたもの
で、特に、少ない演算量と少ない標準パターン用メモリ
を用いて精密な音声パターンの照合を行なうことを目的
としてなされたものである。構成本発明は、上記目的を達成するために、入力された音
声信号を特徴系列変換手段を用いて特徴ベクトルの時系
列に変換し、標準パターンを一定の状態数を持つ状態の
系列として表現し、各状態ごとに状態を代表する特徴ベ
クトルと状態の継続時間が記憶してあり、該入力音声の
特徴ベクトルの時系列を前記標準パターンと照合する音
声認識装置において、照合を行なう際、状態遷移の重み
を、入力音声が標準パターンの各状態に帰属する時間と
標準パターンの各状態の継続時間の差を自乗に比例させ
た量とし、該重みを照合の際の累積距離に加えることを
特徴としたものである。以下、本発明の実施例に基づいて説明する。第１図は、本発明の一実施例を説明するためのブロッ
ク線図で、図中、１はマイクロフォン、２は特徴系列変
換手段、３は音声区間検出手段、４はマッチング部、５
は標準パターン、６は認識結果出力部で、マイクロフォ
ン１から入力された音声は、特徴ベクトル変換手段２に
よって特徴ベクトルの時系列に変換される。特徴ベクト
ルとしては、スペクトラム、LPCケプストラム等様々な
ものが考えられる。本実施例では、スペクトラムを用い
て説明するが、本発明は、これに限るわけではない。音
声波形をスペクトラムパターンに変換するには、中心周
波数が250〜6300Hzで1/3オクターブごとに配置した15チ
ャンネルのバンドパスフィルター群を用いればよい。ま
た、フレーム周期は10ms程度とすればよい。音声区間検出手段３としては様々な方式が知られてお
り、本発明には直接関係ないのでここでの詳細な説明は
省略する。ここで入力音声は、次に示すベクトルの時系列として
得られたことになる。 x_i＝（x_1i,x_2i,……x_15i）（２）（ただし、Ｉは入力音声フレーム数,x_iは入力音声のｉ
フレームの特徴ベクトル、x_fiは、入力音声のｉフレー
ムのｆチャンネルの出力である。）また、標準パターンは、次に示す形で登録されてい
る。（ただし、Ｎは状態数，は標準パターンの第ｊ状態の特徴ベクトル,y_fjは、標準
パターンの第ｊ状態のｆチャンネルの値,l_jは標準パタ
ーンの第ｊ状態の継続時間（フレーム長）である。）パターンマッチングは、動的計画法や、山登り法等を
用いることができ、ここでは、動的計画法を用いて説明
する。 step1. Ｄ（1,1）＝ｄ（1,1）（６） step3.（標準パターンと入力音声との距離）＝Ｄ（I,
N）（ただし、ｉは入力フレーム番号,jは状態番号,d（i,
j）は入力ｉフレームと標準パターンｊ状態との局所距
離,D（i,j）は格子点（i,j）に到達する最適累積距離,W
は状態遷移の重みである。）第２図は、上述のようにして求めた入力音声と標準パ
ターンとの照合における最適状態列の例（状態数Ｎ＝
４）で、このようにして最適な状態列を求め、そのとき
の累積距離Ｄ（I,N）を照合結果としている。局所距離は、市街地距離、ユークリッド距離、マハラ
ノビスの汎距離等様々なものが知られており、例えば、
市街地距離を用いればｄ（i,j）は、して計算される。通常のDPマッチングでは、標準パターンは、入力パタ
ーンと基本的には同じ形式であり、マッチング平面は第
３図に示すようになる。例えば、フレーム周期10msの単
語認識の場合、平面単語長は60フレーム（600ms）程度
であり、格子点の個数はＩ・Ｊ＝3600となる。ところ
が、本発明の場合には、例えば状態数Ｎ＝４（第２図）
とすると、格子点の個数はＩ・Ｎ＝240と大幅に少なく
なっており、演算時間の大幅な短縮が可能となる。（８）式における状態遷移の重みＷは、入力パターン
と標準パターンとの対応において極端な時間伸縮を防ぐ
時間制御のための項である。第４図において格子点Ｃ（i,j）に到る最適パスは、
格子点Ａ（ｉ−1,j）に到達した最適パスＩと、格子点
Ｂ（ｉ−1,j−１）に到る最適パスIIとの２つのパスか
ら選択される。ここで、格子点Ｂに到達する最適パスにおいて、初め
て状態ｊ−１に到達したときに入力フレーム番号をｉ′
とすれば、状態ｊ−１への帰属時間はｉ−ｉ′フレーム
である。また、標準パターンにおける状態ｊ−１の継続
時間をl_j−ｌとする。このとき、状態遷移の重みＷをＷ＝ａ｛（ｉ−ｉ′）−l_j-l｝^２（10）として表わす（ａは定数）。（10）式の重みを採用する
ことにより、帰属時間と継続時間との差が小さい（即
ち、入力パターンと標準パターンとの時間的なズレが少
ない）ときのみＷの値は小さくなり、状態遷移が起り易
くなる。このため、精密な時間制御が可能となる。また、Ｗをと定義することもできる。この（11）式は、状態ｊ−１への帰属時間を状態ｊ−
１の継続時間の1/2〜２倍に制限しており、不適当な時
間対応を防ぐことができる。（10），（11）で用いられている継続時間l_j（１≦ｊ
≦Ｎ）は、一回の訓練データ（標準パターン作成用の発
声）で得ることができるが、標準パターンを精密に作成
するため、複数の訓練データがある場合には、その平均
値とすれば良い。効果以上の説明から明らかなように、本発明によると、少
ない演算量と少ない標準パターン用のメモリーを用い
て、正確な時間長制御を行うことができる精密な音声認
識を行うことが可能となる。Description: TECHNICAL FIELD The present invention relates to a speech recognition system, and more particularly, to a matching system in speech recognition. 2. Description of the Related Art DP matching, which is one of the voice pattern matching methods, has a drawback that the amount of calculation is large. This is because the number of grid points for which the local distance and the cumulative distance are to be calculated is proportional to (the number of input frames) × (the number of frames of the standard pattern). Also, a technique called compressed DP is known. Although this is an effective method for reducing the number of grid points, it is difficult to determine which part of the voice pattern should be compressed and how much. In addition, the control of expansion and contraction is complicated, and there are drawbacks such as a different compression ratio depending on the word. In the HMM, although there is a difference between the distance and the probability, the number of grid points is (the number of input frames) x (the number of model states). Is running low. Further, since the number of states is fixed, there is an advantage that the amount of calculation does not change due to a change in vocabulary. However, the HMM has a drawback that expansion and contraction in the time direction (control of state transition) is difficult. Further, there is a drawback that if the training data is small, it is difficult to obtain a good quality model. SUMMARY OF THE INVENTION The present invention has been made in view of the above circumstances, and has been made in particular for the purpose of performing precise voice pattern matching using a small amount of computation and a small standard pattern memory. Configuration In order to achieve the above object, the present invention converts an input audio signal into a time series of feature vectors using feature sequence conversion means, and expresses a standard pattern as a sequence of states having a certain number of states. In each of the states, a feature vector representative of the state and a duration of the state are stored. When the time series of the feature vector of the input voice is compared with the standard pattern, the state transition is performed when the matching is performed. The weight of the input voice is an amount proportional to the square of the difference between the time at which the input voice belongs to each state of the standard pattern and the duration of each state of the standard pattern, and the weight is added to the cumulative distance at the time of matching. It is what it was. Hereinafter, a description will be given based on examples of the present invention. FIG. 1 is a block diagram for explaining an embodiment of the present invention, in which 1 is a microphone, 2 is a feature sequence conversion means, 3 is a voice section detection means, 4 is a matching section,
Is a standard pattern, and 6 is a recognition result output unit. The speech input from the microphone 1 is converted into a time series of feature vectors by the feature vector conversion unit 2. Various types of feature vectors such as a spectrum and an LPC cepstrum can be considered. Although this embodiment will be described using a spectrum, the present invention is not limited to this. To convert an audio waveform into a spectrum pattern, a band-pass filter group of 15 channels having a center frequency of 250 to 6300 Hz and arranged for every 1/3 octave may be used. Further, the frame period may be set to about 10 ms. Various methods are known as the voice section detection means 3 and are not directly related to the present invention, so that detailed description is omitted here. Here, the input voice is obtained as a time series of the following vectors. x _i = (x _1i , x _2i ,..., x _15i ) (2) (where I is the number of input speech frames, x _i is _i of the input speech)
The frame feature vector, x _fi, is the output of the f channel of the i-frame of the input audio. The standard pattern is registered in the following format. (Where N is the number of states, Is the feature vector of the j-th state of the standard pattern, y _fj is the value of the f-channel of the j-th state of the standard pattern, and l _j is the duration (frame length) of the j-th state of the standard pattern. For pattern matching, a dynamic programming method, a hill-climbing method, or the like can be used. Here, the dynamic programming method will be described. step1. D (1,1) = d (1,1) (6) step3. (distance between standard pattern and input voice) = D (I,
N) (where i is the input frame number, j is the state number, d (i,
j) is the local distance between the input i-frame and the standard pattern j state, D (i, j) is the optimal cumulative distance reaching the grid point (i, j), W
Is the weight of the state transition. FIG. 2 shows an example of an optimal state sequence in the collation of the input voice obtained as described above with the standard pattern (the number of states N =
In 4), an optimum state sequence is obtained in this way, and the accumulated distance D (I, N) at that time is used as a comparison result. As the local distance, various things such as an urban distance, an Euclidean distance, and a Mahalanobis general distance are known.
If the city distance is used, d (i, j) becomes Is calculated. In normal DP matching, the standard pattern has basically the same format as the input pattern, and the matching plane is as shown in FIG. For example, in the case of word recognition with a frame period of 10 ms, the plane word length is about 60 frames (600 ms), and the number of grid points is I · J = 3600. However, in the case of the present invention, for example, the number of states N = 4 (FIG. 2)
Then, the number of grid points is significantly reduced to IN = 240, so that the calculation time can be significantly reduced. The weight W of the state transition in the equation (8) is a term for time control for preventing extreme time expansion and contraction in correspondence between the input pattern and the standard pattern. In FIG. 4, the optimal path to the grid point C (i, j) is
It is selected from two paths: an optimal path I reaching the lattice point A (i-1, j) and an optimal path II reaching the lattice point B (i-1, j-1). Here, in the optimal path reaching the grid point B, the input frame number is set to i ′ when the state j−1 is reached for the first time.
Then, the belonging time to the state j-1 is ii 'frame. The duration of the state j-1 in the standard pattern is defined as l _j -l. At this time, the weight W of the state transition is expressed as W = a ｛(ii ′) − l _jl ｝ ² (10) (a is a constant). By employing the weights of the expression (10), the value of W is reduced only when the difference between the belonging time and the duration is small (that is, when the time difference between the input pattern and the standard pattern is small), and the state transition is performed. Is more likely to occur. For this reason, precise time control becomes possible. Also, W Can also be defined as This equation (11) indicates that the belonging time to the state j-1 is represented by the state j-
The duration is limited to 1/2 to 2 times the duration of 1 so that inappropriate time response can be prevented. The duration l _j (1 ≦ j) used in (10) and (11)
≦ N) can be obtained by one training data (utterance for creating a standard pattern). However, in order to accurately create a standard pattern, if there is a plurality of training data, the average value is good. Advantages As is apparent from the above description, according to the present invention, it is possible to perform precise speech recognition that can perform accurate time length control using a small amount of calculation and a small memory for a standard pattern. .

【図面の簡単な説明】第１図は、本発明の一実施例を説明するためのブロック
線図、第２図は、入力音声と標準パターンとの照合にお
ける最適状態列の例を示す図、第３図は、DPマッチング
における計算平面を示す図、第４図は、最適パスの例を
示す図である。１……マイクロフォン,2……特徴系列変換手段,3……音
声区間検出手段,4……マッチング部,5……標準パター
ン,6……認識結果出力部。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram for explaining an embodiment of the present invention, FIG. 2 is a diagram showing an example of an optimal state sequence in collating an input voice with a standard pattern, FIG. 3 is a diagram showing a calculation plane in DP matching, and FIG. 4 is a diagram showing an example of an optimal path. 1 ... Microphone, 2 ... Feature sequence conversion means, 3 ... Sound section detection means, 4 ... Matching unit, 5 ... Standard pattern, 6 ... Recognition result output unit.

Claims

(57) [Claims] The input audio signal is converted into a time series of feature vectors using a feature sequence conversion unit, and the standard pattern is expressed as a sequence of states having a certain number of states, and a feature vector and a state representing a state for each state. In the speech recognition device that compares the time series of the feature vector of the input voice with the standard pattern, the weight of the state transition is set to each state of the standard voice when performing the verification. A speech recognition method characterized by an amount proportional to the square of the difference between the time to which the pattern belongs and the duration of each state of the standard pattern, and adding the weight to the cumulative distance at the time of matching.