JPH0642153B2

JPH0642153B2 - Voice recognizer

Info

Publication number: JPH0642153B2
Application number: JP1331727A
Authority: JP
Inventors: 均岩見田; 滋片桐; エリックマクダーモット
Original assignee: ATR AUDITORY VISUAL PERCEPTION; EI TEI AARU SHICHOKAKU KIKO KENKYUSHO KK
Current assignee: ATR AUDITORY VISUAL PERCEPTION; EI TEI AARU SHICHOKAKU KIKO KENKYUSHO KK
Priority date: 1989-12-20
Filing date: 1989-12-20
Publication date: 1994-06-01
Anticipated expiration: 2009-06-01
Also published as: JPH03191400A

Description

【発明の詳細な説明】［産業上の利用分野］この発明は音声認識装置に関し、特に、離散型の穏れマ
ルコフモデル（以下、ＨＭＭと称する）を用いた音声認
識装置に関する。Description: TECHNICAL FIELD The present invention relates to a speech recognition apparatus, and more particularly to a speech recognition apparatus using a discrete relaxed Markov model (hereinafter referred to as HMM).

［従来の技術］第５図は従来のＨＭＭを用いた音声認識装置の原理を示
す図である。第５図を参照して、符号帳作成手段１は多
数の音声特徴ベクトルからこれらを最もよく近似する複
数個の代表ベクトルの集合を求めるものであり、求めた
複数個の代表ベクトルを符号化手段３に与える。符号化
手段３は与えられた複数個の代表ベクトルの集合を符号
帳として音声特徴ベクトルを符号化し、ＨＭＭ訓練手段
４とＨＭＭ認識手段５とに与える。ＨＭＭ訓練手段４は
複数の音声特徴ベクトル時系列を符号化して得られた複
数の符号時系列を訓練用データとして離散型のＨＭＭを
訓練する。一方、ＨＭＭ認識手段５は音声特徴ベクトル
時系列を符号化して得られた符号時系列を認識用データ
としてＨＭＭ訓練手段で訓練されたＨＭＭで認識を行な
い、認識結果を出力する。[Prior Art] FIG. 5 is a diagram showing the principle of a conventional speech recognition apparatus using an HMM. With reference to FIG. 5, the codebook creating means 1 finds a set of a plurality of representative vectors that most closely approximates them from a large number of speech feature vectors, and the obtained plurality of representative vectors are encoded by a coding means. Give to 3. The encoding means 3 encodes the speech feature vector using the given set of a plurality of representative vectors as a codebook, and supplies it to the HMM training means 4 and the HMM recognition means 5. The HMM training means 4 trains a discrete HMM by using a plurality of code time series obtained by encoding a plurality of voice feature vector time series as training data. On the other hand, the HMM recognizing means 5 recognizes the code time series obtained by encoding the voice feature vector time series as the recognition data by the HMM trained by the HMM training means, and outputs the recognition result.

［発明が解決しようとする課題］上述の第５図に示した音声認識装置において、音声特徴
ベクトルに対して正しいカテゴリでの生成確率が大きく
なるように訓練されるが、誤ったカテゴリでの確率を小
さくするような訓練は行なわれない。このため、高い音
声認識性能を得ることができないという問題点があっ
た。[Problems to be Solved by the Invention] In the speech recognition apparatus shown in FIG. 5 described above, the speech feature vector is trained to have a high generation probability in the correct category, but the probability in the wrong category is increased. There is no training to reduce. Therefore, there is a problem that high voice recognition performance cannot be obtained.

それゆえに、この発明の主たる目的は、高い音声認識性
能を得ることができるような離散型のＨＭＭを用いた音
声認識装置を提供することである。Therefore, a main object of the present invention is to provide a speech recognition apparatus using a discrete HMM capable of obtaining high speech recognition performance.

［課題を解決するための手段］第１図はこの発明の原理を示す図であり、多数の音声特
徴ベクトルからこれらを最もよく近似する複数個の代表
ベクトルの集合を求める符号帳作成手段１と、符号帳を
構成している各代表ベクトルにカテゴリ名を付与し、複
数の音声特徴ベクトルを符号化する際に用いられる代表
ベクトルのカテゴリと音声特徴ベクトルのカテゴリとが
一致する個数が増加するように代表ベクトルを逐次的に
更新する符号帳学習手段２と、複数個の代表ベクトルの
集合を符号帳として、入力された音声特徴ベクトルとの
ユークリッド距離が最も近い代表ベクトルの符号番号を
符号時系列として出力する符号化手段３と、複数の音声
特徴ベクトル時系列を符号化して得られた複数の符号時
系列について、その生成確率が最大となるような遷移確
率と出力確率を求め、訓練用データとして離散型のＨＭ
Ｍを訓練する訓練手段４と、音声特徴ベクトル時系列を
符号化して得られた符号時系列を入力とし、遷移確率と
出力確率とを組合わせて、入力された符号時系列を生成
する確率を計算し、最も生成確率の高い音声を求めて出
力する認識手段５とによって構成される。[Means for Solving the Problem] FIG. 1 is a diagram showing the principle of the present invention, and is a codebook creating means 1 for obtaining a set of a plurality of representative vectors that most approximates a large number of speech feature vectors. , A category name is given to each representative vector forming the codebook so that the number of coincidences between the representative vector category and the speech feature vector category used when encoding a plurality of speech feature vectors increases. In the codebook learning means 2 for sequentially updating the representative vector, and using a set of a plurality of representative vectors as a codebook, the code number of the representative vector having the closest Euclidean distance to the input speech feature vector is code time series. And a plurality of code time series obtained by coding a plurality of voice feature vector time series, the generation probability becomes maximum. Such transition probabilities and output probabilities are obtained, and discrete HM is used as training data.
The training means 4 for training M and the code time series obtained by coding the voice feature vector time series are input, and the transition probability and the output probability are combined to generate the input code time series. The recognition unit 5 calculates and outputs the voice with the highest generation probability.

［作用］この発明にかかる音声認識装置は、多数の音声特徴ベク
トルからこれらを最もよく近似する複数個の代表ベクト
ルの集合を求め、各代表ベクトルにカテゴリ名を付与
し、複数の音声特徴ベクトルを符号化する際に用いられ
る代表ベクトルのカテゴリと音声特徴ベクトルのカテゴ
リとが一致する個数が増加するように代表ベクトルを逐
次的に更新し、これらの複数個の代表ベクトルの集合を
符号帳として音声特徴ベクトルを符号化し、符号化して
得られた複数の符号時系列を訓練用データとして離散型
のＨＭＭを訓練し、訓練されたＨＭＭで認識を行なう。[Operation] The speech recognition apparatus according to the present invention obtains a set of a plurality of representative vectors that most approximate these from a large number of speech feature vectors, assigns a category name to each representative vector, and determines a plurality of speech feature vectors. The representative vector is sequentially updated so that the number of coincidences between the category of the representative vector used for encoding and the category of the speech feature vector increases, and the set of these plurality of representative vectors is used as a codebook for speech. A feature vector is encoded, a discrete HMM is trained using a plurality of code time series obtained by encoding as training data, and recognition is performed by the trained HMM.

［発明の実施例］第２図はこの発明の一実施例の概略ブロツク図である。
この実施例においては、日本語２３音韻の音韻認識を行
なう場合について説明する。音韻データ１１は各音韻カ
テゴリあたり１００個ずつの音韻サンプルからなり、１
つの音韻サンプルは音韻特徴ベクトルの時系列からな
る。音韻特徴ベクトルは、たとえば１６次元のパワース
ペクトルである。Ｋ−平均クラスタリング手段１２は、
学習用の音韻サンプルのすべての音韻特徴ベクトルを、
音韻カテゴリごとにＫ−平均クラスタリング法を用いて
クラスタリングし、１音韻カテゴリあたり１０個ずつの
代表ベクトルを求める。そして、全音韻カテゴリについ
て求めた総計２３０個の代表ベクトルを符号帳１３とす
る。Embodiment of the Invention FIG. 2 is a schematic block diagram of an embodiment of the present invention.
In this embodiment, the case of performing phoneme recognition of Japanese 23 phonemes will be described. The phoneme data 11 consists of 100 phoneme samples for each phoneme category.
One phoneme sample consists of a time series of phoneme feature vectors. The phonological feature vector is, for example, a 16-dimensional power spectrum. The K-means clustering means 12
All phoneme feature vectors of the phoneme sample for learning,
Clustering is performed for each phoneme category using the K-means clustering method, and 10 representative vectors are obtained for each phoneme category. Then, a total of 230 representative vectors obtained for all phoneme categories is set as the codebook 13.

符号帳１３の各代表ベクトルには、それぞれの音韻カテ
ゴリ名が付与される。ＬＶＱ学習手段１４は、学習ベク
トル量子化法（以下、ＬＶＱと称する）を用いて、音韻
特徴ベクトルを符号化する際に用いられる各代表ベクト
ルのカテゴリと音韻特徴ベクトルのカテゴリとが一致す
る個数が増加するように代表ベクトルを逐次的に更新す
る。Each phoneme category name is given to each representative vector of the codebook 13. The LVQ learning unit 14 uses the learning vector quantization method (hereinafter, referred to as LVQ) to determine the number of coincidences between the category of each representative vector and the category of the phoneme feature vector used when encoding the phoneme feature vector. The representative vector is sequentially updated so as to increase.

第３図は第２図に示したＬＶＱ学習手段１４によるＬＶ
Ｑ２学習アルゴリズムを示すフロー図である。ステップ
（図示ではＳＰと略称する）ＳＰ１において、音韻特徴
ベクトルｘとのユークリッド距離が最も小さい代表ベク
トルｍ_ｉと、その代表ベクトルのカテゴリ以外のカテゴ
リで最もユークリッド距離が小さい代表ベクトルｍ_ｊが
求められる。ステップＳＰ２において、代表ベクトルの
更新を行なうか否かの判定が行なわれる。その条件は、
ｍ_ｉの属するカテゴリと一致せず、かつｍ_ｊの属するカ
テゴリがｘの属するカテゴリと一致することである。ス
テップＳＰ３においては、ステップＳＰ２での条件が成
立した場合のみ、代表ベクトルｍ_ｉ，ｍ_ｊの更新を行な
う。FIG. 3 is an LV by the LVQ learning means 14 shown in FIG.
It is a flowchart which shows a Q2 learning algorithm. In step (abbreviated as SP in the drawing) SP1, a representative vector m _i having the smallest Euclidean distance to the phoneme feature vector x and a representative vector m _j having the smallest Euclidean distance in categories other than the category of the representative vector are obtained. . At step SP2, it is determined whether or not the representative vector is updated. The condition is
That is, it does not match the category to which m _i belongs, and the category to which m _j belongs matches the category to which x belongs. In step SP3, the representative vectors m _i and m _j are updated only when the condition in step SP2 is satisfied.

更新は次の式で行なわれる。The update is performed by the following formula.

ｍ_ｉ′＝ｍ_ｉ−ａ(t)（ｘ−ｍ_ｉ）ｍ_ｊ′＝ｍ_ｊ＋ａ(t)（ｘ−ｍ_ｊ）ここでｍ_ｉ′，ｍ_ｊ′は更新後の代表ベクトルを示し、
ａ（ｔ）は時間とともに単調減少する関数である（ａ
（ｔ）＞０＞）。m _i ′ = m _i −a (t) (x−m _i ) m _j ′ = m _j + a (t) (x−m _j ), where m _i ′ and m _j ′ represent the updated representative vectors. ,
a (t) is a function that monotonically decreases with time (a
(T)>0>).

第２図に示したＬＶＱ学習手段１４は上述の動作を学習
用の全音韻特徴ベクトルについて行ない、さらにそれを
適当な回数繰り返す。符号化手段１５は、全音韻サンプ
ル（音韻特徴ベクトル時系列）について符号化を行な
い、音韻サンプル符号時系列を求める。この符号化は、
入力された特徴ベクトルとのユークリッド距離が最も近
い代表ベクトルの符号番号を出力することによって行な
われる。The LVQ learning means 14 shown in FIG. 2 performs the above-mentioned operation for all phoneme feature vectors for learning, and repeats it for an appropriate number of times. The encoding means 15 encodes all phoneme samples (phoneme feature vector time series) to obtain phoneme sample code time series. This encoding is
This is performed by outputting the code number of the representative vector having the closest Euclidean distance to the input feature vector.

ＨＭＭ訓練手段１６は、学習用の音韻サンプルの符号時
系列を入力とし、Ｂａｕｍ−Ｗｅｌｃｈアルゴリズムを
用いて各音韻モデル１７を訓練する。The HMM training means 16 inputs the code time series of the phoneme samples for learning and trains each phoneme model 17 using the Baum-Welch algorithm.

第４図は音韻モデルの構成を示す図である。第４図を参
照して、ｓは状態を示し、ａは遷移確率を示し、ｂは出
力確率を示している。たとえば、状態ｓ_１において、こ
の状態ｓ_１に留まる確率はａ_１１であり、状態ｓ_２に遷
移する確率はａ_１２である。また、状態ｓ_１に留まった
ときあるいは状態ｓ_１から状態ｓ_２に遷移したときコー
ドｋを出力する確率はｂ_１，_ｋであり、添字の１は状態
ｓ_１から出力された遷移であることを示している。ＨＭ
Ｍ訓練手段１６は各音韻ごとに、入力された各音韻サン
プル符号時系列について、その生成確率が最大となるよ
うな遷移確率ａと出力確率ｂを求め、それらを音韻モデ
ルとして出力する。FIG. 4 is a diagram showing the structure of a phoneme model. Referring to FIG. 4, s indicates a state, a indicates a transition probability, and b indicates an output probability. For example, in the state s ₁ , the probability of staying in the state s ₁ is a ₁₁ , and the probability of transiting to the state s ₂ is a ₁₂ . Also, the probability of outputting the code k when a transition from or state s ₁ time remained state s ₁ to the state s ₂ is b _1, _k, 1 subscript is a transition that is output from the state s ₁ Is shown. HM
The M training unit 16 obtains, for each phoneme, a transition probability a and an output probability b that maximize the generation probability of each input phoneme sample code time series, and outputs them as a phoneme model.

ＨＭＭ認識手段１８は認識したい音韻サンプルの符号時
系列を入力とし、前向きパスアルゴリズムを用い、遷移
確率ａと出力確率ｂとを掛合わせて、入力された符号時
系列を生成する確率を全音韻モデルについて計算する。
そして、最も生成確率値が高くなる音韻モデルを求め、
音韻認識結果として出力する。The HMM recognizing means 18 receives the code time series of the phoneme sample to be recognized as input, uses the forward pass algorithm, and multiplies the transition probability a and the output probability b to calculate the probability of generating the input code time series as a whole phoneme model. Calculate about.
Then, the phoneme model with the highest generation probability value is obtained,
Output as a phoneme recognition result.

［発明の効果］以上のように、この発明によれば、離散型のＨＭＭを用
いた音声認識装置において、符号帳を構成している各代
表ベクトルにカテゴリ名を付与し、複数の音声特徴ベク
トルを符号化する際に用いられる代表ベクトルのカテゴ
リと音声特徴ベクトルのカテゴリとが一致する個数が増
加するように代表ベクトルを逐次的に更新するようにし
たので、カテゴリ境界をよりよく反映した符号帳を作成
でき、音声認識性能を向上できる。[Effects of the Invention] As described above, according to the present invention, in a voice recognition device using a discrete HMM, a category name is assigned to each representative vector forming the codebook, and a plurality of voice feature vectors are provided. Since the representative vectors are updated sequentially so that the number of coincidences between the category of the representative vector and the category of the speech feature vector used when encoding the Can be created, and the voice recognition performance can be improved.

[Brief description of drawings]

第１図はこの発明の原理を説明するためのブロック図で
ある。第２図はこの発明の一実施例の概略ブロック図で
ある。第３図は第２図に示したＬＶＱ２学習アルゴリズ
ムを示すフロー図である。第４図はＨＭＭ音韻モデルを
示す図である。第５図は従来のＨＭＭ音声認識装置の原
理を説明するための図である。図において、１は符号帳作成手段、２は符号帳学習手
段、３は符号化手段、４はＨＭＭ訓練手段、５はＨＭＭ
認識手段、１１は音韻データ、１２はＫ−平均クラスタ
リング手段、１３は符号帳、１４はＬＶＱ学習手段、１
５は符号化手段、１６はＨＭＭ訓練手段、１７は音韻モ
デル、１８はＨＭＭ認識手段を示す。FIG. 1 is a block diagram for explaining the principle of the present invention. FIG. 2 is a schematic block diagram of an embodiment of the present invention. FIG. 3 is a flow chart showing the LVQ2 learning algorithm shown in FIG. FIG. 4 is a diagram showing an HMM phoneme model. FIG. 5 is a diagram for explaining the principle of a conventional HMM voice recognition device. In the figure, 1 is a codebook creating means, 2 is a codebook learning means, 3 is an encoding means, 4 is an HMM training means, and 5 is an HMM.
Recognition means, 11 is phoneme data, 12 is K-means clustering means, 13 is a codebook, 14 is LVQ learning means, 1
Reference numeral 5 is an encoding means, 16 is an HMM training means, 17 is a phoneme model, and 18 is an HMM recognition means.

───────────────────────────────────────────────────── フロントページの続き (72)発明者エリックマクダーモット京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール視聴覚機構研究所内 (56)参考文献日本音響学会講演論文集平成元年10月１−１−20 Ｐ．39−40 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Eric McDermott, 5 Seiraya, Seika-cho, Soraku-gun, Kyoto Prefecture, Mihiratani, A-R Co., Ltd. (56) References Acoustics Society of Japan October of the first year 1-1-20 P. 39-40

Claims

[Claims]

1. A codebook creating means for finding a set of a plurality of representative vectors that most closely approximate these from a large number of speech feature vectors, and a codebook for the set of a plurality of representative vectors found by the codebook creating means. As coding means for outputting the code number of the representative vector having the closest Euclidean distance to the input speech feature vector as a code time series, and obtained by encoding a plurality of speech feature vector time series by the coding means. For a plurality of code time series, a transition probability and an output probability that maximize the generation probability are obtained, and training means for training a discrete Hidden Markov Model as training data, a speech feature vector time series by the coding means. The code time series obtained by encoding is input, the transition probabilities and output probabilities are combined, and the input code time system is input. Recognizing means for calculating the probability of generating, and obtaining and outputting the voice with the highest generation probability, and a category name is given to each representative vector constituting the codebook created by the codebook creating means, Speech recognition provided with a codebook learning means for sequentially updating the representative vectors so that the number of coincidences between the category of the representative vector used when encoding apparatus.