JPH0459638B2

JPH0459638B2 -

Info

Publication number: JPH0459638B2
Application number: JP58132508A
Authority: JP
Inventors: Masanori Myatake
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1983-07-20
Filing date: 1983-07-20
Publication date: 1992-09-22
Also published as: JPS6024595A

Description

【発明の詳細な説明】 (イ) 産業上の利用分野本発明は音声を認識する事のできる音声認識装
置な関する。[Detailed Description of the Invention] (a) Industrial Application Field The present invention relates to a speech recognition device capable of recognizing speech.

(ロ) 従来例従来の音声認識装置は、音声信号からその音声
特徴を示す特徴パラメータの時系列からなる音声
パターンＡ＝（a₍₁₎、a₍₂₎、……）を予め定められ
た複数の音声について抽出しておき、未知の音声
についての特徴パラメータの時系列からなる未知
パターンＸ＝（x₍₁₎、x₍₂₎、……）を各時系列Ａ＝
（a₍₁₎、a₍₂₎、……）と比較し、これ等両パターン
の距離Ｄ＝｜Ａ−Ｘ｜＝〓¹ ｜ａ（ｉ）−ｘ（ｉ）｜が
最小となる時系列Ａの音声をこの時の未知の音声
であると判定するものであつた。(b) Conventional example A conventional speech recognition device predetermines a speech pattern A=(a ₍₁₎ , a ₍₂₎ , ...) consisting of a time series of feature parameters indicating the speech characteristics from the speech signal. Multiple voices are extracted, and an unknown pattern X = (x ₍₁₎ , x ₍₂₎ , ...) consisting of a time series of feature parameters for unknown voices is extracted for each time series A =
(a ₍₁₎ , a ₍₂₎ , ...), when the distance between these two patterns D = |A-X|= 〓 ¹ |a(i)-x(i)| is the minimum The audio of series A was determined to be the unknown audio at this time.

しかしながら、上述の音声の特徴パラメータと
しては音声のスペクトル値の時系列、又は自己相
関係数の時系列等が用いられるが、これ等のパラ
メータは話者の発声状況に依つて多少なりとも変
動する惧れがあり、この為に同じ音声であつても
予め登録された音声と未知の入力音声との両パタ
ーンに大きな距離が生じ、誤認識を招く欠点があ
つた。 However, although time series of voice spectral values, time series of autocorrelation coefficients, etc. are used as the above-mentioned voice characteristic parameters, these parameters vary to some extent depending on the speaking situation of the speaker. For this reason, even if the voices are the same, there is a large distance between the patterns of the pre-registered voice and the unknown input voice, resulting in a drawback of erroneous recognition.

(ハ) 発明の目的本発明は誤認識の発生を低減した即ち認識率の
向上を図つた音声認識装置を提供するものであ
る。(c) Object of the Invention The present invention provides a speech recognition device that reduces the occurrence of recognition errors, that is, improves the recognition rate.

(ニ) 発明の構成本発明の音声認識装置は、予め貯えられた登録
音声の特徴パラメータの平均値を示す平均値パラ
メータの時系列とその標準偏差の時系列とを用い
て未知の入力音声の特徴パラメータの時系列に対
して統計的処理を施こし、この入力音声の特徴パ
ラメータを平均値パラメータとの差が定数倍の標
準偏差より大なる時と小なる時の二値状態を示す
“１”、“０”の二値信号の時系列に変換して類似
度を求め、最も類似度が大なる時の登録音声をこ
の時の入力音声と判定するものである。(d) Structure of the Invention The speech recognition device of the present invention recognizes unknown input speech using a time series of average value parameters indicating the average value of feature parameters of registered speech stored in advance and a time series of its standard deviation. Statistical processing is applied to the time series of feature parameters, and the feature parameters of this input voice are set to a binary state of "1", which indicates when the difference from the average value parameter is greater than a standard deviation times a constant, and when it is smaller. ”, “0” are converted into a time series of binary signals to determine the degree of similarity, and the registered voice with the highest degree of similarity is determined to be the input voice at that time.

(ホ) 実施例第１図に本発明の音声認識装置の一実施例を示
す。同図に於いて、１は音声を電気的な音声信号
に変換するマイクロフオン、２は該マイクロフオ
ン１から得られる音声信号からその音声の特徴を
示す特徴パラメータである周波数スペクトル値を
抽出するパラメータ抽出回路であり、例えば８チ
ヤンネルのバンドパスフイルタが用いられ、音声
帯域（100〜4000Hz）を８分割した周波数スペク
トル値f₁、f₂、f₃、…f₈の各８サンプルからなる
時系列で表わされるた音声パターンが得られる。
即ちフイルタ番号をｎ、サンプル番号をｔとした
時の特徴パラメータはfn（ｔ）で表わされ、音声
パターンＦはとなる。３は登録モードと認識モードを切り換え
るモード選択スイツチであり、Ｑ側に接続すれば
登録モードとなり、逆にＰ側に接続すれば認識モ
ードとなる。４は該モード選択スイツチ３をＱ側
に接続した登録モード時に上記パラメータ抽出回
路２からの音声パターンが入力される統計処理回
路であり、同一音声を複数回連続して入力する事
に依つて得られる複数の音声パターンに基づい
て、その各特徴パラメータfn（ｔ）が第２図に示
す如き正規分布をなす事として平均値パラメータ
ｆｎ（ｔ）からなる平均値パターン ₁₍₁₎ ₁₍₂₎ … ₁₍₈₎ ₂₍₁₎ ₂₍₂₎ … ₂₍₈₎ … … … ₈₍₁₎ ₈₍₂₎ … ₈₍₈₎ を算出すると共にその標準偏差△ ｆｎ（ｔ）からな
る標準偏差パターン △ Ｆf₁₍₁₎ f₁₍₂₎ … f₁₍₈₎ f₂₍₁₎ f₂₍₂₎ … f₂₍₈₎ … … … f₈₍₁₎ f₈₍₂₎ … f₈₍₈₎ を算出する。５はメモリ回路であり、例えば異な
るＡ、Ｂ、Ｃ、Ｄ、Ｅの複数の登録音声に対して
上記統計処理回路４から得られる平均値パターン
Ａ、、、、を貯える平均値パターンメモ
リ部５１と、これに対応づけて標準偏差パターン
△ Ａ、△ Ｂ、△ Ｃ、△ Ｄ、△ Ｅを貯える芳醇偏差メモリ部５
２と、から構成されている。一方６は上記モード
選択スイツチ３をＰ側に接続した認識モード時に
上記パラメータ抽出回路２から得られる未知の入
力音声Ｘに対してその音声パターンＸを一時的に
貯えるバツフアメモリである。７は比較手段であ
り、上記バツフアメモリ６の入力音声パターンＸ
の各パラメータxn（ｔ）から上記メモリ回路５の
平均値パターンメモリ部５１の各平均値パターン
Ａ、、……夫々のパラメータｎ（ｔ）、ｎ
（ｔ）、……を減じる減算器７１と、上記メモリ回
路５の標準偏差メモリ部５２の各標準偏差パター
ン△ Ａ、△ Ｂ、……夫々の標準偏差ｎ（ｔ）、ｎ
（ｔ）、……に定数Ｋ例えば、１、又は２を乗算す
る乗算器７２と、上記減算器７１からの減算値
xn（ｔ）−ｎ（ｔ）、xn（ｔ）−ｎ（ｔ）、……を
夫々上記乗算器７２からの乗算値Ｋ△ ａｎ（ｔ）、Ｋ
△ ｂｎ（ｔ）、……と比較し｜xn（ｔ）−ｎ（ｔ）｜
ＫΔ ａｎ（ｔ）の時“１”を出力し｜xn（ｔ）−
ｎ（ｔ）｜＞Ｋ△ ａｎ（ｔ）の時“０”を出力する比
較器７３とから構成されている。即ち、Ｋ＝１と
した時、例えばＡの登録音声に対して｜xn（ｔ）
−ｎ（ｔ）｜Ｋ△ ａｎ（ｔ）の時、第２図に示し
た如き正規分布に基づいてxn（ｔ）がan（ｔ）に
68.3％の確率をもつて類似していると看做される
ので、“１”が与えられ、逆の場合は“０”が与
えられる事となり、未知パターンＸx₁₍₁₎ x₁₍₂₎ … x₁₍₈₎ x₂₍₁₎ x₂₍₂₎ … x₂₍₈₎ … … … x₈₍₁₎ x₈₍₂₎ … x₈₍₈₎ は各登録音声に対して、 “１”、“０”の２値信号δで表わされる行列パ
ターン △＝δ₁₁ δ₁₂ … δ₁₈ δ₂₁ δ₂₂ … δ₂₈ … … … δ₈₁ δ₈₂ … δ₈₈ に変換される。(E) Embodiment FIG. 1 shows an embodiment of the speech recognition device of the present invention. In the figure, 1 is a microphone that converts audio into an electrical audio signal, and 2 is a parameter that extracts a frequency spectrum value, which is a feature parameter indicating the characteristics of the audio, from the audio signal obtained from the microphone 1. This is an extraction circuit that uses, for example, an 8-channel bandpass filter to generate a time series consisting of 8 samples each of frequency spectrum values f ₁ , f ₂ , f ₃ , ... f ₈ that are obtained by dividing the audio band (100 to 4000 Hz) into 8. A voice pattern expressed as is obtained.
In other words, when the filter number is n and the sample number is t, the characteristic parameter is expressed as fn(t), and the voice pattern F is becomes. 3 is a mode selection switch for switching between registration mode and recognition mode; when connected to the Q side, the mode is set to registration; when connected to the P side, the switch is set to recognition mode. 4 is a statistical processing circuit to which the voice pattern from the parameter extraction circuit 2 is input in the registration mode when the mode selection switch 3 is connected to the Q side; Based on a plurality of voice patterns, each feature parameter fn(t) forms a normal distribution as shown in FIG. 2, and an average value pattern ₁₍₁₎ ₁₍₂₎ ... consisting of average value parameters fn(t) is obtained. ₁₍₈₎ ₂₍₁₎ ₂₍₂₎ … ₂₍₈₎ … … … ₈₍₁₎ ₈₍₂₎ … ₈₍₈₎ and the standard deviation pattern consisting of its standard deviation △ fn(t) △ Ff ₁₍₁₎ f ₁₍₂₎ … f ₁₍₈₎ f ₂₍₁₎ f ₂₍₂₎ … f ₂₍₈₎ … … … f ₈₍₁₎ f ₈₍₂₎ … f _{8( 8)} . Reference numeral 5 denotes a memory circuit, and an average value pattern memory section 51 stores average value patterns A, . and a mellow deviation memory unit 5 that stores standard deviation patterns △ A, △ B, △ C, △ D, △ E in association with this.
It consists of 2 and. On the other hand, 6 is a buffer memory for temporarily storing the voice pattern X for the unknown input voice X obtained from the parameter extraction circuit 2 during the recognition mode in which the mode selection switch 3 is connected to the P side. 7 is a comparison means, which compares the input audio pattern X of the buffer memory 6.
From each parameter xn(t), each average value pattern A of the average value pattern memory section 51 of the memory circuit 5, . . . respective parameters n(t), n
(t), . . . and the standard deviation patterns ΔA, ΔB, . . . of the standard deviation memory section 52 of the memory circuit 5.
A multiplier 72 that multiplies (t), ... by a constant K, for example, 1 or 2, and a subtracted value from the subtracter 71.
xn(t)-n(t), xn(t)-n(t), ... are the multiplied values K△ an(t), K from the multiplier 72, respectively.
△ Compare with bn(t), ...｜xn(t)-n(t)｜
Outputs “1” when KΔ an(t) |xn(t)−
The comparator 73 outputs "0" when n(t)|>KΔan(t). That is, when K=1, for example, |xn(t) for the registered voice of A
-n(t) | K△ an(t), xn(t) becomes an(t) based on the normal distribution as shown in Figure 2.
Since they are considered similar with a probability of 68.3%, "1" is given, and in the opposite case, "0" is given, and the unknown pattern Xx ₁₍₁₎ x ₁₍₂₎ … x ₁₍₈₎ x ₂₍₁₎ x ₂₍₂₎ … x ₂₍₈₎ … … … x ₈₍₁₎ x ₈₍₂₎ … x ₈₍₈₎ is for each registered voice “ The matrix pattern represented by the binary signal δ of "1" and "0" is converted into a matrix pattern Δ=δ ₁₁ δ ₁₂ ... δ ₁₈ δ ₂₁ δ ₂₂ ... δ ₂₈ ... ... δ ₈₁ δ ₈₂ ... δ ₈₈ .

８は上記比較手段７から得られる二値信号の行
列パターン△に基づき、その16個の構成要素の総
和〓ⁱ 〓^j δij即ち“１”の存在数を類以度として算
出する認識処理回路であり、Ａ、Ｂ、Ｃ、Ｄ、Ｅ
の各登録音声に対してこの類以度が、例えば、
11、２、８、７、３であれば、この時の入力音声
はＡであつた事と判定される。 8 is a recognition processing circuit that calculates the sum of its 16 constituent elements 〓 ⁱ 〓 ^j δij, that is, the number of existing "1"s, as a class degree, based on the matrix pattern △ of the binary signal obtained from the comparison means 7. Yes, A, B, C, D, E
For each registered voice, this level or higher is, for example,
11, 2, 8, 7, 3, it is determined that the input voice at this time was A.

而して、モード選択スイツチ３Ｑに接続した登
録モードに於いては、複数の定めめられた音声を
夫々数回、例えば３回づつ発声入力して、メモリ
回路５に各音声の平均値パターン、、……並
びにその標準偏差パターン△ Ａ、△ Ｂ、……を貯えて
おく。そして、モード選択スイツチ３をＰに接続
した認識モードに於いて、未知の音声が入力さ
れ、その音声パターンＸは比較手段７に依つて上
記平均値パターンＡ−、Ｂ−、……並びに標準偏差パ
ターン△_A、△_B、……を用いて音声の発声の際の変
動成分を除去した形式２値信号パターンに変換さ
れる。この時音声の許容変動分を決定する為のＫ
の値を0.5〜２程度に設定しておけば、２値信号
パターンは未知音声パターンと登録音声パターン
との類似性を最適に示すものとなり、これに依つ
て信頼性の高いパターン認識を実行できる。 In the registration mode connected to the mode selection switch 3Q, each of a plurality of predetermined voices is uttered several times, for example, three times each, and the average value pattern of each voice is stored in the memory circuit 5. , . . . and their standard deviation patterns △ A, △ B, . . . are stored. Then, in the recognition mode in which the mode selection switch 3 is connected to P, an unknown voice is input, and the voice pattern The patterns _ΔA , _ΔB , . At this time, K is used to determine the permissible variation in audio.
By setting the value of 0.5 to 2, the binary signal pattern optimally indicates the similarity between the unknown speech pattern and the registered speech pattern, and highly reliable pattern recognition can be performed based on this. .

以下の説明に於いては、特徴パラメータの時系
列として周波数スペクトル値の時系列からなる音
声パターンを用いたが、この他に自己相関係数で
あるパーコール係数等各種の特徴パラメータの時
系列の使用も可能である。 In the following explanation, a speech pattern consisting of a time series of frequency spectrum values is used as a time series of feature parameters, but in addition to this, time series of various feature parameters such as the Percoll coefficient, which is an autocorrelation coefficient, can be used. is also possible.

(ハ) 発明と効果本発明の音声認識装置は、以上の説明から明ら
かな如く、予め貯えられた登録音声の特徴パラメ
ータの平均値を示す平均値パラメータの時系列と
その標準偏差の時系列とを用いて未知の入力音声
の特徴パラメータの時系列に対して統計的処理を
施こして“１”、“０”の２値信号の時系列に変換
して類似度を求め、この類似度が最大となる時の
登録音声と判定するものであるので、音声の発声
状況に依る各特徴パラメータの変動分を除去で
き、最適な類似度を導出する事が可能となり、認
識率の大巾な向上が望める。(C) Invention and Effects As is clear from the above description, the speech recognition device of the present invention uses a time series of an average value parameter indicating the average value of feature parameters of registered speech stored in advance and a time series of its standard deviation. is used to perform statistical processing on the time series of the feature parameters of unknown input speech, converting it into a time series of binary signals of "1" and "0" to find the similarity, and calculate the similarity. Since it is determined that the registered voice is the one at the maximum, it is possible to remove variations in each feature parameter depending on the voice utterance situation, and it is possible to derive the optimal degree of similarity, greatly improving the recognition rate. can be expected.

[Brief explanation of drawings]

第１図は本発明の音声認識装置の一実施例を示
すブロツク図、第２図は正規分布図であり、１は
マイクロフオン、２はパラメータ抽出回路、４は
統計処理回路、５はメモリ回路、７は比較手段、
８は認識処理回路を夫々示している。 FIG. 1 is a block diagram showing an embodiment of the speech recognition device of the present invention, and FIG. 2 is a normal distribution diagram, where 1 is a microphone, 2 is a parameter extraction circuit, 4 is a statistical processing circuit, and 5 is a memory circuit. , 7 is a means of comparison,
Reference numeral 8 indicates a recognition processing circuit.

Claims

[Claims]

1. Audio input means for converting audio into electrical signals;
a parameter extracting means for extracting characteristic parameters representing the characteristics of the voice from the electric signal of the input voice; A storage means for storing a time series of standard deviations corresponding to a time series of average value parameters, and a time series for each registered voice from the storage means in response to a time series of characteristic parameters of an unknown input voice obtained from the parameter extraction means. Read out the time series of the average value parameter and the time series of its standard deviation, compare it with the standard deviation obtained by multiplying the error value between the feature parameter of the input voice and the average value parameter by a constant, and based on the result of this comparison, determine that the above error value is a constant. “1” and “0” indicate the binary state when it is larger and smaller than the multiplied standard deviation.
a comparison means for converting the time series of the characteristic parameters of the unknown input voice into a time series of binary signals by outputting a binary signal; , a recognition processing means that calculates the degree of similarity of each registered voice to the unknown voice and determines that the registered voice with the highest degree of similarity is the currently input unknown voice.