JPH0752356B2

JPH0752356B2 - Speaker adaptation method

Info

Publication number: JPH0752356B2
Application number: JP3216983A
Authority: JP
Inventors: 浩明服部; 茂樹嵯峨山
Original assignee: 株式会社エイ・ティ・アール自動翻訳電話研究所
Priority date: 1991-08-28
Filing date: 1991-08-28
Publication date: 1995-06-05
Anticipated expiration: 2010-06-05
Also published as: JPH0553599A

Description

[Detailed description of the invention]

【０００１】[0001]

【産業上の利用分野】この発明は話者適応化方式に関
し、特に、音声認識分野において、未知話者の発生した
学習サンプルを用いて、標準話者の特徴ベクトルと未知
話者の特徴ベクトルとの対応関係を求め、求められた対
応関係をもとに適応化を行なうような話者適応化方式に
関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speaker adaptation method, and more particularly, in the field of speech recognition, using training samples generated by an unknown speaker, a feature vector of a standard speaker and a feature vector of an unknown speaker are obtained. It relates to a speaker adaptation method that obtains a correspondence relationship between and performs adaptation based on the obtained correspondence relationship.

【０００２】[0002]

【従来の技術】従来、標準話者の音声と未知話者の音声
とをＤＰマッチングにより時間軸の対応付け，標準話者
の特徴ベクトルと未知話者の特徴ベクトルの対応関係を
求める話者適応化方式においては、学習データにより得
られた対応関係をそのまま用いて適応化を行なってい
る。2. Description of the Related Art Conventionally, the speech of a standard speaker and the speech of an unknown speaker are matched on the time axis by DP matching, and the correspondence relationship between the feature vector of the standard speaker and the feature vector of the unknown speaker is found in speaker adaptation. In the adaptation method, the correspondence obtained from the learning data is used as it is for adaptation.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、学習用
データが少ない場合には、そこから得られる対応関係に
はデータに依存したばらつきが存在し、対応関係の信頼
度も低くなってしまう。すなわち、学習用データが少な
い場合には、その少ない学習データに偏ってしまい、そ
の偏ったデータに依存して対応関係が依存してしまうと
いう問題点がある。However, if the amount of learning data is small, the correspondence obtained therefrom will have variations depending on the data, and the reliability of the correspondence will be low. That is, when the amount of learning data is small, there is a problem that the learning data is biased toward the small amount of learning data, and the correspondence relationship depends on the biased data.

【０００４】それゆえに、この発明の主たる目的は、学
習から得られた対応付けに含まれるばらつきを吸収する
ために、話者空間の連続性に留意したスムージングを行
ない、より精度のよい対応関係を得ることができるよう
な話者適応化方式を提供することである。[0004] Therefore, the main object of the present invention is to perform smoothing in consideration of the continuity of the speaker's space in order to absorb variations contained in correspondences obtained from learning, thereby obtaining more accurate correspondences. It is to provide such a speaker adaptation scheme that can be obtained.

【０００５】[0005]

【課題を解決するための手段】請求項１に係る発明は、
標準話者の音声をベクトル量子化してコードブックとし
て記憶していて、未知話者の特徴ベクトルが入力された
ことに応じて、該未知話者の特徴ベクトルと標準話者の
特徴ベクトルの対応関係をコードブックを用いて対応付
けし、標準話者の特徴ベクトルに対応付けられた入力話
者の特徴ベクトルの平均を求めることによりコードブッ
クを更新し、対応付けに誤差が含まれていれば、ファジ
ィ関数に基づいて、特徴ベクトルのスムージングを行な
ってコードブックに貯えるようにしたものである。[Means for Solving the Problems] The invention according to claim 1 is
The speech of a standard speaker is vector-quantized and stored as a codebook, and in response to input of the feature vector of an unknown speaker, the correspondence relationship between the feature vector of the unknown speaker and the feature vector of the standard speaker. are matched using the codebook, and the codebook is updated by calculating the average of the feature vectors of the input speaker matched with the feature vectors of the standard speaker. Based on a fuzzy function, feature vectors are smoothed and stored in a codebook.

【０００６】[0006]

【０００７】[0007]

【０００８】[0008]

【０００９】[0009]

【作用】この発明に係る話者適応化方式は、学習サンプ
ルから求められた対応関係にスムージングを行なうこと
によって、少数の学習用データしか得られない場合で
も、精度のよい対応関係を得ることができ、高精度の話
者適応を実現できる。In the speaker adaptation system according to the present invention, by smoothing the correspondence obtained from the learning samples, even if only a small amount of data for learning is obtained, it is possible to obtain the correspondence with high accuracy. It is possible to achieve highly accurate speaker adaptation.

【００１０】[0010]

【発明の実施例】図１はこの発明が適用される音声認識
装置の概略ブロック図である。図１において、音声認識
装置はアンプ１とローパスフィルタ２とＡ／Ｄ変換器３
と処理装置４とから構成される。アンプ１は入力された
音声信号を増幅し、ローパスフィルタ２は増幅された音
声信号から折返し雑音を除去する。Ａ／Ｄ変換器３は音
声信号を１２ｋＨｚのサンプリング信号により、１６ビ
ットのデジタル信号に変換する。処理装置４はコンピュ
ータ５と磁気ディスク６と端末類７とプリンタ８とを含
む。コンピュータ５はＡ／Ｄ変換器３から入力された音
声のデジタル信号に基づいて、後述の図２に示した手法
を用いて話者の特徴ベクトル間の適応化を行なう。DETAILED DESCRIPTION OF THE INVENTION FIG. 1 is a schematic block diagram of a speech recognition apparatus to which the present invention is applied. In FIG. 1, the speech recognition apparatus includes an amplifier 1, a low-pass filter 2, and an A/D converter 3.
and a processing device 4 . An amplifier 1 amplifies an input speech signal, and a low-pass filter 2 removes aliasing noise from the amplified speech signal. The A/D converter 3 converts the audio signal into a 16-bit digital signal using a 12 kHz sampling signal. The processing device 4 includes a computer 5 , a magnetic disk 6 , terminals 7 and a printer 8 . Based on the voice digital signal input from the A/D converter 3, the computer 5 performs adaptation between speaker feature vectors using the method shown in FIG. 2, which will be described later.

【００１１】図２はこの発明の一実施例の動作を説明す
るためのフロー図である。標準話者のコードブックＣ^R
は図１に示す磁気ディスク６に記憶されていて、この標
準話者のコードブックＣ^Rは変換コードブックＣ^Tの初
期値とされる。ステップ（図示ではＳＰと略称する）Ｓ
Ｐ１において、未知話者の入力ベクトル列が入力される
と、ステップＳＰ２において未知話者の入力ベクトル列
と標準話者のコード列とを変換コードブックＣ^Tを用い
てＤＴＷ（動的時間伸縮法）を用いて対応付けが行なわ
れる。FIG. 2 is a flowchart for explaining the operation of one embodiment of the invention. Standard speaker codebook C ^R
is stored on the magnetic disk 6 shown in FIG. 1, and this standard speaker's codebook C ^R is used as the initial value of the transformation codebook C ^T . Step (abbreviated as SP in the figure) S
At P1, when the input vector sequence of the unknown speaker is input, at step SP2, the input vector sequence of the unknown speaker and the code sequence of the standard speaker are converted into a DTW (dynamic time warping method) using a conversion codebook C ^T . ) is used to make the correspondence.

【００１２】ステップＳＰ３において、ステップＳＰ２
で得られた対応付けから変換コードブックＣ^Tを更新す
る。すなわち、標準話者のコードベクトルに対応付けら
れた入力話者のベクトルの平均を求めることにより、変
換コードブックＣ^Tを求める。その際、対応付けの既に
求まっているコードベクトルのサブベクトルを用いて、
対応付けの行なわれなかったベクトルの差分ベクトルの
推定が行なわれる。より具体的に説明すると、ｍ番目の
ベクトルＣ^T _mに対応付けられた入力ベクトルの集合を
Ｍとし、集合Ｍに属する入力ベクトルｘの平均値とベク
トルＣ^R _mの差分ベクトルＶ_mを次の数１により求め
る。At step SP3, step SP2
Update the transform codebook C ^T from the correspondence obtained in . That is, the transformed codebook C ^T is obtained by calculating the average of the input speaker's vectors associated with the standard speaker's code vectors. At that time, using the sub-vectors of the code vectors for which the correspondence has already been determined,
An estimation of the difference vector of the unmatched vectors is performed. More specifically, let M be the set of input vectors associated with the m-th vector C ^T _m , and let the difference vector V _m between the average value of the input vectors x belonging to the set M and the vector C ^R _m be: Calculated by Equation 1.

【００１３】[0013]

【数１】 [Number 1]

【００１４】Ｎ_n＝０であるベクトルＣ^R _nについて、
Ｎ_k＞０であるコードベクトルＣ^R _kのファジィ級関数
μ_n,kを求める。ベクトルＣ^R _nの差分ベクトルＶ_nを
ベクトルＣ^R _kの差分ベクトルＶ_kとμ_n,kを用いて、
次の数２より計算する。N._n= 0 vector C^R. _nabout,
N._kA code vector C with >0^R. _kfuzzy class function of
μ_n,kAsk for Vector C^R. _ndifference vector V_nof
Vector C^R. _kdifference vector V_kand μ_n,kUsing,
It is calculated from the following formula 2.

【００１５】[0015]

【数２】 [Number 2]

【００１６】変換コードブックＣ^Tの中のすべてのベク
トルＣ^T _nを差分ベクトルＶを用いて次の数３により更
新する。[0016] All vectors C ^T _n in the transform codebook C ^T are updated using the difference vector V according to Equation 3 below.

【００１７】[0017]

【数３】 [Number 3]

【００１８】上述のごとくして得られた対応関係は、Ｄ
ＴＷの枠組みの中で少数単語によって得られるものであ
り、異話者空間の対応関係そのものを表わしているとは
限らず、誤差を含んでいる。一方、話者空間の連続性を
考えると、異話者空間の対応関係も連続的であると考え
るのは自然である。そこで、ステップＳＰ６において、
ＤＴＷによる対応に含まれる誤差を吸収し、真の対応関
係を得るため、ファジィ級関数に基づくスムージングを
以下のようにして行なう。すなわち、ステップＳＰ５に
おいてベクトルＣ^R _nに対するＮ_k＞０であるすべての
ベクトルＣ^R _kのファジィ級関数μ_n,k：_ｋ≠_ｎを求
める。次に、コードベクトルＣ^R _nの差分ベクトルを次
の数４によって計算する。The correspondence obtained as described above is D
It is obtained from a small number of words within the framework of TW, and does not necessarily represent the correspondence in the different speaker space itself, and contains errors. On the other hand, considering the continuity of the speaker space, it is natural to think that the correspondence in the different speaker space is also continuous. Therefore, in step SP6,
In order to absorb the error contained in the DTW correspondence and obtain the true correspondence, smoothing based on the fuzzy series function is performed as follows. That is, in step SP5, the fuzzy series function μ _n,k : _k ≠ _n of all vectors C ^R _k for which N _k >0 for vectors C ^R _n is obtained. Next, the difference vector of the code vector C ^R _n is calculated by the following equation (4).

【００１９】[0019]

【数４】 [Formula 4]

【００２０】ここでは、Ｎ_kを対応付けの信頼度と考
え、差分ベクトルへの重みとしている。αはＮ_kの寄与
度をμ_n,kと同じ程度にするための定数であり、βは予
め定められるＶ_nの信頼度である。スムージングされた
差分ベクトルＶ′_nを用いてすべての変換コードブック
を更新する。[0020] Here, N _k is considered as the reliability of the correspondence and is used as a weight for the difference vector. α is a constant for making the contribution of N _k the same as μ _n,k , and β is the predetermined reliability of V _n . Update all transform codebooks with the smoothed difference vector _V'n .

【００２１】スムージング時にファジィネスを変化させ
ることにより、連続性を考慮する空間を制御できる。す
なわち、ファジィネスが１に近いほど局所的な空間を、
∞に近づくほど大局的な空間を考慮することになる。し
たがって、大量の学習データがあり、対応付けが十分信
頼できる場合には１に近いファジィネスを用い、少量の
学習データしかなく対応付けの信頼性が低い場合には大
きいファジィネスを用いることで、より精度の高い話者
適応が実現できる。By varying the fuzziness during smoothing, the space in which continuity is taken into account can be controlled. That is, the closer the fuzziness is to 1, the more local the space is.
The closer to ∞, the more global space is considered. Therefore, when there is a large amount of training data and the matching is sufficiently reliable, a fuzzyness close to 1 is used, and when there is only a small amount of training data and the matching is unreliable, a large fuzzyness is used to achieve higher accuracy. high speaker adaptation can be realized.

【００２２】[0022]

【発明の効果】以上のように、この発明によれば話者空
間の連続性に留意したスムージングを行なうことによ
り、学習から得られた対応付けに含まれるばらつきを吸
収し、より精度のよい対応関係を得ることができ、少数
の学習用データしか得られない場合にも、高精度の話者
適応が実現できる。INDUSTRIAL APPLICABILITY As described above, according to the present invention, by performing smoothing while paying attention to the continuity of the speaker's space, variations in correspondences obtained from learning can be absorbed, resulting in more accurate correspondences. A relationship can be obtained, and highly accurate speaker adaptation can be achieved even when only a small amount of training data is obtained.

[Brief description of the drawing]

【図１】この発明の一実施例の概略ブロック図である。1 is a schematic block diagram of one embodiment of the present invention; FIG.

【図２】この発明の一実施例の具体的な動作を説明する
ためのフロー図である。FIG. 2 is a flowchart for explaining specific operations of one embodiment of the present invention;

[Description of symbols]

１アンプ２ローパスフィルタ３Ａ／Ｄ変換器４処理装置５コンピュータ６磁気ディスク７端末類８プリンタ 1 amp 2 Low pass filter 3 A/D converter 4 processing equipment 5 computer 6 magnetic disk 7 Terminals 8 Printer

Claims

[Claims]

[Claim 1] A speech of a standard speaker is vector-quantized and stored as a codebook, and in response to input of a feature vector of an unknown speaker, the feature vector of the unknown speaker and the standard speaker are generated. using the codebook, updating the codebook by obtaining an average of the feature vectors of the input speaker associated with the feature vectors of the standard speaker; A speaker adaptation method characterized in that, if an error is included in the matching, the feature vectors are smoothed based on a fuzzy function and stored in the codebook.