JP2954591B2

JP2954591B2 - Speech feature extraction method

Info

Publication number: JP2954591B2
Application number: JP63183905A
Authority: JP
Inventors: 憲治坂本; 耕市山口
Original assignee: Consejo Superior de Investigaciones Cientificas CSIC
Current assignee: Consejo Superior de Investigaciones Cientificas CSIC
Priority date: 1988-07-21
Filing date: 1988-07-21
Publication date: 1999-09-27
Anticipated expiration: 2014-09-27
Also published as: JPH0232400A

Description

【発明の詳細な説明】＜産業上の利用分野＞この発明は、入力音声から人間の調音に関する特徴量
を抽出することによって、話者によらず安定した高認識
率の音声認識を可能にする音声の特徴抽出方法に関す
る。DETAILED DESCRIPTION OF THE INVENTION <Industrial Application Field> The present invention enables stable speech recognition with a high recognition rate irrespective of a speaker by extracting a feature amount related to human articulation from input speech. The present invention relates to an audio feature extraction method.

＜従来の技術＞従来より、音声認識装置によって入力音声を認識する
場合は次のようにして行っている。すなわち、入力音声
を幾つかのBPF（帯域ろ波器）に通してその出力値を入
力音声の特徴量として抽出する音声特徴抽出装置によっ
て、認識対象の単語の特徴量を予め抽出し、この特徴量
を時系列の形で標準パターンとして幾つか格納してお
く。そして、未知の音声が入力されると、上述のように
して未知の入力音声の特徴量を抽出し、この未知単語の
特徴量と上記標準パターンの特徴量との類似性を調べ、
この類似性に基づいて単語を識別判定するのである。<Conventional technology> Conventionally, input speech is recognized by a speech recognition device as follows. That is, a speech feature extraction device that passes an input speech through several BPFs (bandpass filters) and extracts an output value as a feature of the input speech extracts a feature of a word to be recognized in advance. Some quantities are stored in the form of time series as standard patterns. Then, when an unknown voice is input, the feature amount of the unknown input voice is extracted as described above, and the similarity between the feature amount of the unknown word and the feature amount of the standard pattern is checked.
The words are identified and determined based on the similarity.

その際に、上記類似性は上記未知単語の特徴量の時系
列のパターンと上記標準パターンとの距離で表され、そ
の距離尺度としては各BPFの出力値の対数の差またはそ
の二乗値を全帯域に渡って加算（平均）した値を用いる
ことが多い。また、上記特徴量としては、ケプストラム
係数や各BPFの出力のZCC（零交差回数）あるいはZCI
（零交差間隔）の値が用いることもある。At this time, the similarity is represented by the distance between the time-series pattern of the feature amount of the unknown word and the standard pattern, and the distance measure is the difference in the logarithm of the output value of each BPF or its squared value. In many cases, a value added (averaged) over a band is used. In addition, as the above-mentioned feature quantity, the cepstrum coefficient, the ZCC (number of zero crossings) of the output of each BPF, or the ZCI
The value of (zero-crossing interval) may be used.

＜発明が解決しようとする課題＞しかしながら、上記従来の音声認識装置は、入力音声
のスペクトルに直接関係した特徴量を用いて、標準パタ
ーンと入力音声の特徴量のパターンとの類似度（距離尺
度）を各BPFの出力値の差等で求めるようにしているの
で、話者の生理的差異（声道長の差等）によって、同じ
単語を発生しても生成させる音声のスペクトルは異なる
場合が多く、したがって、誤認識やリジェクトされるこ
とがあり、高い認識性能が得られないという問題があ
る。<Problems to be Solved by the Invention> However, the above-described conventional speech recognition apparatus uses the feature amount directly related to the spectrum of the input speech to determine the similarity (distance scale) between the standard pattern and the pattern of the feature amount of the input speech. ) Is calculated based on the difference between the output values of each BPF. Therefore, depending on the physiological difference of speakers (difference in vocal tract length, etc.), the spectrum of the generated speech may be different even if the same word is generated. In many cases, there is a problem that erroneous recognition or rejection may occur, and high recognition performance cannot be obtained.

このようなことは、上記特徴量として各BPFの出力値
をそのまま用いているため、話者によるスペクトルの変
動に影響されて、各音韻を識別するのに必要な本質的な
情報を抽出しにくいためである。すなわち、音韻を効果
的に識別するためには、各BPFの出力を人間の調音に関
連した特徴量に変換して用いることが重要である。This is because the output value of each BPF is used as it is as the feature value, and it is difficult to extract essential information necessary to identify each phoneme, affected by the fluctuation of the spectrum by the speaker. That's why. That is, in order to effectively identify phonemes, it is important to convert the output of each BPF into a feature quantity related to human articulation and use it.

そこで、この発明の目的は、話者によるスペクトルの
変動に影響されずに、音声を安定して正しく認識するこ
とを可能にする音声の特徴抽出方法を提供することにあ
る。SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech feature extraction method capable of stably and correctly recognizing speech without being affected by spectrum fluctuations caused by a speaker.

＜課題を解決するための手段＞上記目的を達成するため、請求項１に係る発明は、入
力音声を周波数分析し，得られたスペクトル成分から音
声の特徴量を抽出する音声の特徴抽出方法であって、上
記周波数分析によって得られたスペクトル成分に基づい
て，摩擦性を表す弁別的特徴の度合いを表す弁別的特徴
量を求める第１のステップと、上記周波数分析によって
得られたスペクトル成分に基づいて，有声を表す弁別的
特徴の度合いを表す弁別的特徴量を求める第２のステッ
プと、上記周波数分析によって得られたスペクトル成分
に基づいて，鼻音化によるスペクトルの平坦性を表す弁
別的特徴の度合いを表す弁別的特徴量を求める第３のス
テップと、上記周波数分析によって得られたステップ成
分に基づいて，バズ性を表す弁別的特徴の度合いを表す
弁別的特徴量を求める第４のステップと、上記周波数分
析によって得られたスペクトル成分に基づいて，中周波
数帯域のスペクトルの平坦性を表す弁別的特徴の度合い
を表す弁別的特徴量を求める第５のステップと、上記周
波数分析によって得られたスペクトル成分に基づいて，
スペクトルの分散性を表す弁別的特徴の度合いを表す弁
別的特徴量を求める第６のステップを備えて、人間の調
音に関連する音声の特徴量を抽出することを特徴として
いる。<Means for Solving the Problems> In order to achieve the above object, an invention according to claim 1 is a voice feature extraction method for performing frequency analysis of an input voice and extracting a voice feature from an obtained spectral component. A first step of obtaining a discriminative feature quantity representing a degree of a discriminative feature representing frictional property based on the spectrum component obtained by the frequency analysis; and A second step of obtaining a discriminative feature representing the degree of the discriminating feature representing voiced, and a discriminating feature representing the flatness of the spectrum by nasalization based on the spectral component obtained by the frequency analysis. A third step of obtaining a discriminative feature amount representing the degree, and a degree of the discriminative feature representing the buzziness based on the step component obtained by the frequency analysis. A fourth step of obtaining a discriminative feature amount representing the matching, and a discriminative feature amount representing the degree of the discriminative feature representing the flatness of the spectrum of the middle frequency band based on the spectrum component obtained by the frequency analysis. Based on the fifth step to be determined and the spectral components obtained by the frequency analysis,
The method includes a sixth step of obtaining a discriminating feature amount indicating a degree of the discriminating feature indicating the dispersion of the spectrum, and extracting a feature amount of a sound related to human articulation.

＜作用＞音声が入力され、この入力された音声が周波数分析さ
れる。そうすると、上記周波数分析によって得られたス
ペクトル成分に基づいて、第１のステップでは、摩擦性
を表す弁別的特徴の度合いを表す弁別的特徴量が、その
弁別的特徴に応じた手順によって求められる。以下、同
様にして、第２のステップでは有声を表す弁別的特徴の
度合いを表す弁別的特徴量が求められ、第３のステップ
では鼻音化によるスペクトルの平坦性を表す弁別的特徴
量が求められ、第４のステップではバズ性を表す弁別的
特徴量が求められ、第５のステップでは中周波数帯域の
スペクトルの平坦性を表す弁別的特徴量が求められ、第
６のステップではスペクトルの分散性を表す弁別的特徴
量が求められる。こうして、弁別的特徴量という人間の
調音に関連した上記６種の特徴量が抽出される。<Operation> A voice is input, and the input voice is subjected to frequency analysis. Then, based on the spectral components obtained by the frequency analysis, in the first step, a discriminative feature amount indicating the degree of the discriminative feature indicating the frictional property is obtained by a procedure corresponding to the discriminative feature. Hereinafter, similarly, in the second step, a discriminative feature amount indicating the degree of the discriminative feature indicating voiced is obtained, and in the third step, a discriminative feature amount indicating the flatness of the spectrum due to nasalization is obtained. , In the fourth step, a discriminative feature representing the buzziness is obtained, in the fifth step, a discriminative feature representing the flatness of the spectrum of the middle frequency band is obtained, and in the sixth step, the dispersion of the spectrum is obtained. Is obtained. In this way, the above-mentioned six types of feature amounts related to human articulation, which are discriminative feature amounts, are extracted.

＜実施例＞以下、この発明を図示の実施例により詳細に説明す
る。<Example> Hereinafter, the present invention will be described in detail with reference to an illustrated example.

第１図はこの発明に係る音声認識装置のブロック図で
ある。マイクロホン１から入力された音声信号はアンプ
２によって増幅され、特徴抽出部３に入力される。上記
特徴抽出部３は複数のチャンネル（以下、ch.と言う）
からなる（本実施例では16ch.からなる）BPF群で構成さ
れる。上記各BPFの中心周波数は、第1ch.は300Hzであり
第16ch.は3400Hzであり、その間はメル等間隔になるよ
うに設定されいる。第２図に各BPFの周波数特性の概形
を示す。FIG. 1 is a block diagram of a speech recognition apparatus according to the present invention. The audio signal input from the microphone 1 is amplified by the amplifier 2 and input to the feature extraction unit 3. The feature extraction unit 3 includes a plurality of channels (hereinafter referred to as ch.)
(16 channels in this embodiment). The center frequency of each of the BPFs is 300 Hz for the first channel and 3400 Hz for the 16th channel, and is set so that the interval between them is equal. FIG. 2 shows an outline of the frequency characteristics of each BPF.

上記のように構成された特徴抽出部３に入力された音
声信号は、各BPFによって単位時間（フレーム）毎に特
徴量としてのパワーが計算される。その際に、上記パワ
ーの代わりにZCCやZCIを特徴量として用いてもよい。上
記各BPF毎に得られた特徴量は入力音声のスペクトル概
形に相当している。以下、この特徴量をhi（ｉ＝１〜1
6:ch.番号）とする。このようにして、各BPF毎に得られ
た特徴量は弁別的特徴抽出部４に入力される。弁別的特
徴量抽出部４はこの発明における主要部分であり、後に
詳述するようにして上記特徴量hiから人間の調音に関連
した特徴である弁別的特徴の度合いを表す弁別的特徴量
を抽出する部分である。For the audio signal input to the feature extraction unit 3 configured as described above, the power as a feature amount is calculated for each unit time (frame) by each BPF. At that time, ZCC or ZCI may be used as the feature amount instead of the power. The feature amount obtained for each of the above BPFs corresponds to an approximate spectrum of an input voice. Hereinafter, this feature amount is defined as hi (i = 1 to 1).
6: ch. Number). In this way, the feature amount obtained for each BPF is input to the discriminative feature extraction unit 4. The discriminative feature extraction unit 4 is a main part of the present invention, and extracts a discriminative feature representing the degree of a discriminative feature related to human articulation from the feature hi as described later in detail. This is the part to do.

上記弁別的特徴量抽出部４によって各フレーム毎に求
められた入力音声の弁別的特徴量の時系列は単語認識部
５に入力され、標準パターン格納部６に格納されている
標準パターンの弁別的特徴量の時系列との類似度が計算
される。その際に、上記類似度計算はDPマッチング法等
によって行われる。そして、上記類似度計算の結果に基
づいて認識された音声入力単語が結果表示部７に表示さ
れる。The time series of the discriminative feature amounts of the input speech obtained for each frame by the discriminative feature amount extraction unit 4 is input to the word recognition unit 5 and is used to discriminate the standard patterns stored in the standard pattern storage unit 6. The degree of similarity between the feature amount and the time series is calculated. At this time, the similarity calculation is performed by a DP matching method or the like. Then, the speech input word recognized based on the result of the similarity calculation is displayed on the result display unit 7.

次に、上記弁別的特徴の意味およびその度合いを表す
弁別的特徴量の算出方法について詳細に説明する。Next, a method of calculating a discriminative feature amount representing the meaning and the degree of the discriminative feature will be described in detail.

上記弁別的特徴は人間の調音に関係した特徴であり、
次のような６つの種類がある。The above distinctive features are related to human articulation,
There are six types as follows.

“strident"は強い摩擦性を表す弁別的特徴であり、
高周波数帯域（2500Hz）にエネルギーが集中している度
合いによって判定できる。“Strident” is a distinctive feature that indicates strong friction,
It can be determined based on the degree of concentration of energy in the high frequency band (2500 Hz).

“voiced"は声帯振動が存在することを表す弁別的特
徴であり、120Hz〜750Hzの帯域におけるエネルギーの存
在でほぼ判定できる。“Voiced” is a distinctive feature that indicates the presence of vocal cord vibration, and can be almost determined by the presence of energy in the 120 Hz to 750 Hz band.

“nasal"は鼻音化の度合いを表す弁別的特徴であり、
呼気が鼻腔にも流れることにより、極−零対によって第
１ホルマント領域のスペクトルが平坦になる度合いによ
って判定できる。“Nasal” is a distinctive feature that indicates the degree of nasalization,
As the expiration also flows into the nasal cavity, the determination can be made based on the degree to which the spectrum of the first formant region becomes flat by the pole-zero pair.

“murmur/buzz"はバズ性の度合いを表す弁別的特徴で
あり、低周波数帯域にエネルギーが集中している度合い
によって判定できる。“Murmur / buzz” is a distinctive feature that indicates the degree of buzziness, and can be determined by the degree of concentration of energy in the low frequency band.

“diffuse"はスペクトルが比較的平坦になっているこ
とを表す弁別的特徴であり、中周波数帯域〜高周波数帯
域（800Hz〜）のスペクトルの平坦さの度合いで判定で
きる。“Diffuse” is a discriminating feature indicating that the spectrum is relatively flat, and can be determined by the degree of flatness of the spectrum in the middle frequency band to the high frequency band (800 Hz or more).

“compact"は中周波数帯域（800Hz〜2500Hz）に目立
つピークが存在するか否かを表す弁別的特徴であり、中
周波数帯域にエネルギーが集中している度合いで判定で
きる。“Compact” is a distinctive feature that indicates whether or not there is a conspicuous peak in the middle frequency band (800 Hz to 2500 Hz), and can be determined based on the degree of concentration of energy in the middle frequency band.

一般に、弁別的特徴は［＋／−］の２値、すなわち０
か１の値で表すことが普通である。しかし、この発明で
は０から１の中間の値を取るようにして、実際の音声の
連続的な変化に対して単語識別の精度を上げるようにし
ている。第３図は上記夫々の弁別的特徴とその弁別的特
徴量の増加の要因となる音韻との対応を示す。Generally, the discriminating characteristic is a binary value of [+/-], that is, 0
Usually, it is expressed by the value of or 1. However, in the present invention, an intermediate value between 0 and 1 is used to improve the accuracy of word identification with respect to continuous changes in actual speech. FIG. 3 shows a correspondence between each of the above-described discriminating features and a phoneme that causes an increase in the discriminating feature amount.

以下、上記各弁別的特徴量の算出方法について、第４
図〜第９図のフローチャートにしたがって説明する。Hereinafter, the method of calculating each of the discriminative feature amounts will be described in the fourth.
This will be described with reference to the flowcharts in FIGS.

第４図（ａ）は弁別的特徴“strident"の度合いを求
めるフローチャートである。以下、このフローチャート
にしたがって弁別的特徴“strident"の度合いの算出方
法について詳細に説明する。FIG. 4A is a flowchart for obtaining the degree of the discriminative feature "strident". Hereinafter, a method of calculating the degree of the discriminative feature “strident” will be described in detail with reference to this flowchart.

ステップS1で、弁別的特徴“strident"の度合いを表
す弁別的特徴量“RSTR"が次式で算出される。In step S1, a discriminative feature amount “RSTR” representing the degree of the discriminative feature “strident” is calculated by the following equation.

ここで、分子は高周波数帯域に重みを付けたエネルギ
ー量を表し、弁別的特徴量“RSTR"は高周波数帯域にお
けるエネルギーの集中の度合いを表す。上記wiは各帯域
における重み係数であり、その値は第４図（ｂ）に示す
通りである。すなわち、高周波数帯域における重み付け
を大きくして、高周波数帯域におけるエネルギーの集中
を効率良く判定するようにしている。また、α₁は定数
である。 Here, the numerator represents the energy amount weighting the high frequency band, and the discriminating feature amount “RSTR” represents the degree of energy concentration in the high frequency band. The above wi is a weight coefficient in each band, and its value is as shown in FIG. 4 (b). That is, the weight in the high frequency band is increased, and the concentration of energy in the high frequency band is determined efficiently. Α ₁ is a constant.

ステップS2で、上記ステップS1において算出された弁
別的特徴量“RSTR"が負の値であるか否かが判別され
る。その結果、負の値であればステップS3に進み、そう
でなければ弁別的特徴“strident"の度合い算出動作を
終了する。In step S2, it is determined whether or not the discriminative feature amount “RSTR” calculated in step S1 is a negative value. As a result, if the result is a negative value, the process proceeds to step S3, and if not, the operation of calculating the degree of the discriminative feature “strident” ends.

ステップS3で、上記ステップS1において算出された弁
別的特徴量“RSTR"の値を０にして弁別的特徴“striden
t"の度合い算出動作を終了する。In step S3, the value of the discriminative feature “RSTR” calculated in step S1 is set to 0, and the discriminant feature “striden” is set.
The operation of calculating the degree of “t” ends.

第５図は弁別的特徴“voiced"の度合いを求めるフロ
ーチャートである。以下、このフローチャートにしたが
って弁別的特徴“voiced"の度合い算出法について詳細
に説明する。FIG. 5 is a flowchart for obtaining the degree of the discriminative feature “voiced”. Hereinafter, the method of calculating the degree of the distinctive feature “voiced” will be described in detail with reference to this flowchart.

ステップS21で、第1ch.〜第3ch.の特徴量hiの累積値
が求められ、その値が閾値θｖよりも大きいか否かが判
別される。In step S21, the accumulated value of the feature amounts hi of the first to third channels is obtained, and it is determined whether or not the value is greater than the threshold value θv.

そして、その結果大きければステップS24に進み、そ
うでなければステップS22に進む。 If the result is larger, the process proceeds to step S24, and if not, the process proceeds to step S22.

これは、弁別的特徴“voiced"は120Hz〜750Hzの周波
数帯域でのエネルギーの存在で判定できることに基づ
く。This is based on the fact that the distinctive feature "voiced" can be determined by the presence of energy in the frequency band from 120 Hz to 750 Hz.

ステップS22で、上記ステップS21において求められた
第1ch.〜第3ch.の特徴量hiの累積値の全周波数帯域にお
ける特徴量hiの累積値に対する割合が算出され、その値
が閾値θrvより大きいか否かが判別される。In step S22, the ratio of the cumulative value of the feature value hi of the first to third channels calculated in step S21 to the cumulative value of the feature value hi in the entire frequency band is calculated, and whether the value is greater than the threshold θrv It is determined whether or not it is.

そして、その結果大きければステップS24に進み、そ
うでなければステップS23に進む。 If the result is larger, the process proceeds to step S24, and if not, the process proceeds to step S23.

これは、上記ステップS21において第1ch.〜第3ch.の
特徴量hiの累積値が閾値θｖよりも小さいと判別された
値でも、全周波数帯域における特徴量hiの累積値に対す
る割合がθrv以上であれば、第1ch.〜第3ch.の周波数帯
域において声帯振動が存在すると判定するためである。This is because the ratio of the cumulative value of the feature value hi to the cumulative value of the feature value hi in the entire frequency band is equal to or greater than θrv even if the cumulative value of the feature value hi of the first to third channels is determined to be smaller than the threshold value θv in step S21. If there is, the vocal cord vibration is determined to be present in the frequency bands of the first to third channels.

ステップS23で、声帯振動が存在しないとして、弁別
的特徴“voiced"の度合いを表す弁別的特徴量“RVOI"を
０として弁別的特徴量“voiced"の度合い算出動作を終
了する。In step S23, assuming that there is no vocal fold vibration, the discriminating feature amount “RVOI” representing the degree of the discriminating feature “voiced” is set to 0, and the degree calculating operation of the discriminating feature amount “voiced” ends.

ステップS24で、声帯振動が存在するとして、上記弁
別的特徴量“RVOI"を１にして弁別的特徴“voiced"の度
合い算出動作を終了する。In step S24, assuming that the vocal cord vibration exists, the discriminative feature amount “RVOI” is set to 1, and the degree calculating operation of the discriminative feature “voiced” ends.

第６図は弁別的特徴“nasal"の度合いを求めるフロー
チャートである。以下、このフローチャートにしたがっ
て弁別的特徴“nasal"の度合い算出法について詳細に説
明する。FIG. 6 is a flowchart for obtaining the degree of the discriminative feature “nasal”. Hereinafter, the method of calculating the degree of the distinctive feature “nasal” will be described in detail with reference to this flowchart.

ステップS31で、“NCH"に５がセットされる。 In step S31, 5 is set to "NCH".

ステップS32で、低周波数帯域〜中周波数帯域におけ
るスペクトルの凹凸の量“NAS1"およびその計算回数“N
N"が次式で算出される。In step S32, the amount “NAS1” of the spectrum unevenness in the low frequency band to the middle frequency band and the number of calculation “N1”
N "is calculated by the following equation.

ステップS33で、特徴量hiの値が閾値θｉ以上であるc
h.の数（すなわち、低周波数帯域〜中周波数帯域のスペ
クトルの広がり）“NBAND"が次式で算出される。 In step S33, the value of the feature amount hi is equal to or larger than the threshold value θi.
The number of h. (that is, the spread of the spectrum from the low frequency band to the middle frequency band) “NBAND” is calculated by the following equation.

ステップS34で、“NAS1"が“RMAX"より小さいか否か
が判別される。その結果小さければステップS35に進
み、そうでなければステップS36に進む。 In step S34, it is determined whether “NAS1” is smaller than “RMAX”. If the result is smaller, the process proceeds to step S35; otherwise, the process proceeds to step S36.

ここで、上記“RMAX"の初期値はこのフローチャート
のループ内に現れないような大きな数に設定する。Here, the initial value of "RMAX" is set to a large number that does not appear in the loop of this flowchart.

ステップS35で、低周波数帯域〜中周波数帯域におけ
るスペクトルの平坦さの度合い“RMAX"が次式で算出さ
れる。In step S35, the degree of spectrum flatness “RMAX” in the low frequency band to the middle frequency band is calculated by the following equation.

ここで、分子は低周波数帯域〜中周波数帯域の広がり
を示し、分母は低周波数帯域〜中周波数帯域の凹凸の度
合いを表す。また、α₃は定数である。 Here, the numerator indicates the spread of the low frequency band to the middle frequency band, and the denominator indicates the degree of unevenness in the low frequency band to the middle frequency band. Α ₃ is a constant.

ステップS36で、“NCH"に１が加算される。 In step S36, 1 is added to "NCH".

ステップS37で、“NCH"が７より大きいか否かが判別
される。その結果大きければステップS38に進み、そう
でなければステップS32に戻ってさらに計算範囲を広げ
て“RMAX"の算出を行う。In step S37, it is determined whether “NCH” is greater than seven. If the result is larger, the process proceeds to step S38; otherwise, the process returns to step S32 to further expand the calculation range and calculate “RMAX”.

ステップS38で、さらに、第１ホルマント領域に相当
する周波数帯域におけるスペクトルの平坦さの度合いを
算出するために、“NAS2"および“NAS3"が次式によって
算出される。In step S38, “NAS2” and “NAS3” are calculated by the following equations to calculate the degree of flatness of the spectrum in the frequency band corresponding to the first formant region.

ステップS39で、“NAS2"が“NAS3"より大きいか否か
が判別される。その結果大きければステップS40に進
み、そうでなければステップS41に進む。 In step S39, it is determined whether “NAS2” is larger than “NAS3”. If the result is larger, the process proceeds to step S40; otherwise, the process proceeds to step S41.

ステップS40で、弁別的特徴“nasal"の度合いを表す
弁別的特徴量“RNAS"が次式によって算出され、弁別的
特徴“nasal"の度合い算出動作を終了する。In step S40, the discriminating feature amount “RNAS” representing the degree of the discriminating feature “nasal” is calculated by the following equation, and the discriminating feature “nasal” degree calculating operation ends.

ここで、右辺第２項は第１ホルマント領域におけるス
ペクトルの平坦さの度合いを表す。また、α₃′，α₃″
は定数である。 Here, the second term on the right side represents the degree of flatness of the spectrum in the first formant region. Α ₃ ′, α ₃ ″
Is a constant.

ステップS41で、弁別的特徴“nasal"の度合いを表す
弁別的特徴量“RNAS"が次式によって算出され、弁別的
特徴“nasal"の度合い算出動作を終了する。In step S41, the discriminating feature amount “RNAS” representing the degree of the discriminating feature “nasal” is calculated by the following equation, and the discriminating feature “nasal” degree calculating operation ends.

ここで、右辺第２項は第１ホルマント領域におけるス
ペクトルの平坦さの度合いを表す。 Here, the second term on the right side represents the degree of flatness of the spectrum in the first formant region.

第７図（ａ）は弁別的特徴“murmur/buzz"の度合いを
求めるフローチャートである。以下、このフローチャー
トにしたがって弁別的特徴“murmur/buzz"の度合いの算
出法について詳細に説明する。FIG. 7A is a flowchart for obtaining the degree of the discriminative feature “murmur / buzz”. Hereinafter, a method of calculating the degree of the distinctive feature “murmur / buzz” will be described in detail with reference to this flowchart.

ステップS51で、低周波数帯域にエネルギーが集中し
ている度合いを表す弁別的特徴量“RMUR"が次式によっ
て算出される。In step S51, a discriminative feature amount “RMUR” representing the degree of concentration of energy in the low frequency band is calculated by the following equation.

ここで、上記wiは各帯域における重み係数であり、そ
の値は第７図（ｂ）に示す通りである。すなわち、低周
波数帯域における重み係数を大きくして、低周波数帯域
におけるエネルギーの集中を効率良く判定するようにし
ている。また、α₄は定数である。 Here, wi is a weight coefficient in each band, and its value is as shown in FIG. 7 (b). That is, the weight coefficient in the low frequency band is increased to efficiently determine the concentration of energy in the low frequency band. Α ₄ is a constant.

ステップS52で、上記ステップS51において算出された
弁別的特徴量“RMUR"が負か否かが判別され、その結果
負であればステップS53に進み、そうでなければ弁別的
特徴“murmur/buzz"の度合い算出動作を終了する。In step S52, it is determined whether or not the discriminative feature amount “RMUR” calculated in step S51 is negative. If the result is negative, the process proceeds to step S53; otherwise, the discriminative feature “murmur / buzz” Ends the degree calculation operation.

ステップS53で、上記ステップS51において算出された
弁別的特徴量“RMUR"を０にして、弁別的特徴“murmur/
buzz"の度合い算出動作を終了する。In step S53, the discriminative feature amount “RMUR” calculated in step S51 is set to 0, and the discriminant feature “murmur /
The operation of calculating the degree of “buzz” ends.

第８図（ａ）は弁別的特徴“compact"の度合いを求め
るフローチャートである。以下、このフローチャートに
したがって弁別的特徴“compact"の算出法について説明
する。FIG. 8 (a) is a flowchart for obtaining the degree of the discriminative feature "compact". Hereinafter, a method of calculating the discriminative feature “compact” will be described with reference to this flowchart.

ステップS61で、“NCH"に７がセットされる。 In step S61, "NCH" is set to 7.

ステップS62で、中周波数帯域におけるエネルーギー
量を表す“NCOM"が次式によって算出される。In step S62, “NCOM” representing the energy amount in the middle frequency band is calculated by the following equation.

ここで、上記wiは重み係数であり、その値は第８図
（ｂ）に示す通りである。すなわち、前式によって“NC
OM"を算出する際の周波数帯域の中間における重み付け
を大きくして、中周波数帯域におけるエネルギーを効率
良く求めるようにしている。 Here, wi is a weighting coefficient, and its value is as shown in FIG. 8 (b). In other words, "NC
The weight in the middle of the frequency band when calculating "OM" is increased so that the energy in the middle frequency band is efficiently obtained.

ステップS63で、上記ステップS62で求めた“NCOM"の
値が“MAXC"の値より大きいか否かが判別され、その結
果大きければステップS64に進み、そうでなければステ
ップS65に進む。In step S63, it is determined whether or not the value of “NCOM” obtained in step S62 is greater than the value of “MAXC”. If the value is larger, the process proceeds to step S64; otherwise, the process proceeds to step S65.

ここで、上記“MAXC"の初期値はこのフローチャート
のループ内に現れないような小さな数に設定する。Here, the initial value of the “MAXC” is set to a small number that does not appear in the loop of this flowchart.

ステップS64で、上記“MAXC"に上記ステップS62で求
めた“NCOM"の値をセットする。In step S64, the value of "NCOM" obtained in step S62 is set in "MAXC".

ステップS65で、上記“NCH"に１が加算される。 In step S65, 1 is added to the "NCH".

ステップS66で、上記ステップS65で算出された“NCH"
の値が９よりも大きいか否かが判別される。その結果大
きければステップS67に進み、そうでなければ上記ステ
ップS62に戻って、さらに“NCOM"の算出範囲をずらして
次の“NCOM"の算出を行う。In step S66, “NCH” calculated in step S65 above
Is determined whether or not the value of is larger than 9. If the result is larger, the process proceeds to step S67; otherwise, the process returns to step S62 to calculate the next “NCOM” by further shifting the calculation range of “NCOM”.

ステップS67で、中周波数帯域へのエネルギーの集中
の度合いを表す弁別的特徴量“RCOM"が次式で算出され
る。In step S67, a discriminative feature amount “RCOM” representing the degree of concentration of energy in the middle frequency band is calculated by the following equation.

すなわち、分子は中周波数帯域のエネルギー量に対応
し、分母は全周波数帯域におけるエネルギー量に対応し
ており、弁別的特徴量“RCOM"の値は中周波数帯域にお
けるスペクトルの集中の度合いを表している。 That is, the numerator corresponds to the energy amount in the middle frequency band, the denominator corresponds to the energy amount in the entire frequency band, and the value of the discriminative feature “RCOM” indicates the degree of concentration of the spectrum in the middle frequency band. I have.

第９図は弁別的特徴“diffuse"の度合いを求めるフロ
ーチャートである。以下、このフローチャートにしたが
って弁別的特徴“diffuse"の度合い算出法について詳細
に説明する。FIG. 9 is a flowchart for obtaining the degree of the discriminative feature "diffuse". Hereinafter, the method of calculating the degree of the distinctive feature “diffuse” will be described in detail with reference to this flowchart.

ステップS71で、中周波数帯域〜高周波数帯域におけ
るスペクトルの凹凸の量を表す“NDF1"が次式で算出さ
れる。In step S71, “NDF1” representing the amount of spectrum unevenness in the middle frequency band to the high frequency band is calculated by the following equation.

ステップS72で、中周波数帯域〜高周波数帯域におい
て特徴量hiの値が閾値θｉ以上であるch.の数“NBAND"
次式でが算出される。この“NBAND"は中周波数帯域〜高
周波数帯域におけるスペクトルの広がりを表す値であ
る。 In step S72, the number of channels “NBAND” in which the value of the feature value hi is equal to or larger than the threshold θi in the middle frequency band to the high frequency band
The following equation is calculated. This “NBAND” is a value representing the spread of the spectrum in the middle frequency band to the high frequency band.

ステップS73で、第6ch.〜第16ch.の各BPFからの特徴
量hiの累計“NDF2"および第3ch.〜第16ch.の各BPFから
の特徴量hiの累計“NDF3"が式によって算出される。 In step S73, the cumulative total “NDF2” of the feature amounts hi from the BPFs of the sixth to 16th channels and the cumulative total “NDF3” of the feature amounts hi from the BPFs of the third to 16th channels are calculated by equations. You.

ステップS74で、中周波数帯域〜高周波数帯域のスペ
クトルの平坦さの度合いを表す弁別的特徴量“RDIF"が
次式によって算出され、弁別的特徴“diffuse"の度合い
算出動作を終了する。 In step S74, the discriminating feature amount “RDIF” representing the degree of flatness of the spectrum in the middle frequency band to the high frequency band is calculated by the following equation, and the degree calculating operation of the discriminating feature “diffuse” ends.

この式の右辺第１項は中周波数帯域〜高周波数帯域に
おける平坦さの度合いを表し、右辺第２項は低周波数帯
域〜中周波数帯域にスペクトルが存在しない度合いを表
す。また、上記閾値θｉはスペクトルの広がりを求める
ために各BPFに設けられた閾値であり、α₆，α₆′，
α₆″およびα₆は定数である。 The first term on the right side of this equation represents the degree of flatness in the middle frequency band to the high frequency band, and the second term on the right side represents the degree of absence of the spectrum in the low frequency band to the middle frequency band. The threshold value θi is a threshold value provided for each BPF in order to obtain the spread of the spectrum, and α ₆ , α ₆ ′,
α ₆ ″ and α ₆ are constants.

第10図は入力音声波形の一例と、この入力音声から上
述のようにして得られた各弁別的特徴量の時系列パター
ンを示す。すなわち、第10図（ａ）は入力音声波形、第
10図（ｂ）は上記音声波形の音韻ラベル、第10図（ｃ）
は第10図（ａ）の入力音声における弁別的特徴“stride
nt"の度合いを表す弁別的特徴量“RSTR"の時系列、第10
図（ｄ）は弁別的特徴“voiced"の度合いを表す弁別的
特徴量“RVOI"の時系列、第10図（ｅ）は弁別的特徴“n
asal"の度合いを表す弁別的特徴“RNAS"の時系列、第10
図（ｆ）は弁別的特徴“murmur/buzz"の度合いを表す弁
別的特徴量“RMUR"の時系列、第10図（ｇ）は弁別的特
徴“compact"の度合いを表す弁別的特徴量“RCOM"の時
系列、第10図（ｈ）は弁別的特徴“diffuse"の度合いを
表す弁別的特徴量“RDIF"の時系列を表す。FIG. 10 shows an example of an input voice waveform and a time-series pattern of each discriminative feature obtained from the input voice as described above. That is, FIG. 10 (a) shows the input voice waveform, and FIG.
FIG. 10 (b) is the phoneme label of the above speech waveform, FIG. 10 (c)
Is the discriminative feature “stride” in the input voice of FIG. 10 (a).
time series of the discriminative feature “RSTR” representing the degree of “nt”
FIG. 10D is a time series of the discriminative feature “RVOI” representing the degree of the discriminative feature “voiced”, and FIG. 10E is a discriminant feature “n”.
time series of the distinctive feature "RNAS", which represents the degree of "asal", 10th
FIG. 10 (f) is a time series of the discriminative feature “RMUR” representing the degree of the discriminant feature “murmur / buzz”, and FIG. 10 (g) is a discriminant feature “representing the degree of the discriminant feature“ compact ”. FIG. 10 (h) shows the time series of the discriminative feature “RDIF” representing the degree of the discriminative feature “diffuse”.

次に、第10図において各音韻毎に得られた弁別的特徴
量を第３図を参照して検討する。まず、弁別的特徴“co
mpact"に関係する音韻/k/においては弁別的特徴量“RCO
M"の値のみが“1"であり、音韻/k/の弁別的特徴量が正
しく抽出されていると言える。弁別的特徴“voiced"に
関係する音韻/a/においては弁別的特徴量“RVOI"の値が
“1"であり、さらに弁別的特徴量“RCOM"の値が“０〜
1"の値であって前の音韻/k/との調音位置が後ろにある
音韻であることを表しており、後半において弁別的特徴
量“RNAS"が“０〜1"の値となって次の音韻/n/の影響で
鼻音化していることを表しており、音韻/a/の弁別的特
徴量が正しく抽出されていると言える。弁別的特徴“na
sal"および“murmur/buzz"に関係する音韻/n/において
は弁別的特徴量“RNAS"および“RMUR"の値が“1"であ
り、音韻/n/の弁別的特徴量が正しく抽出されていると
言える。弁別的特徴“voiced"に関係する音韻/e/におい
ては弁別的特徴量“RVOI"の値が“1"であり、さらに弁
別的特徴量“RNAS"の値が“０〜1"の値であって前の音
韻/n/の影響で鼻音化していることを表しており、音韻/
e/の弁別的特徴量が正しく抽出されていると言える。続
いて、弁別的特徴“strident"および“diffuse"に関係
する音韻/ts/においては弁別的特徴量“RSTR"および“R
DIF"の値が“1"であり、音韻/ts/の弁別的特徴量が正し
く抽出されていると言える。弁別的特徴“voiced"に関
係する音韻/u/においては弁別的特徴量“RVOI"の値が
“1"であり、音韻/u/の弁別的特徴量が正しく抽出され
ていると言える。Next, the discriminative feature amount obtained for each phoneme in FIG. 10 will be discussed with reference to FIG. First, the distinctive feature “co
For the phonological / k / related to "mpact", the discriminative feature "RCO
Only the value of "M" is "1", and it can be said that the discriminative feature of the phoneme / k / is correctly extracted. The discriminative feature of the phoneme / a / related to the discriminative feature "voiced" is " RVOI ”is“ 1 ”and the value of the discriminative feature“ RCOM ”is“ 0 ”
A value of "1" indicates that the articulation position with the preceding phoneme / k / is a phoneme at the back, and the discriminative feature "RNAS" has a value of "0 to 1" in the latter half. This indicates that the phonation is affected by the following phoneme / n /, which means that the distinctive feature of the phoneme / a / is correctly extracted.
In the phonemes / n / related to "sal" and "murmur / buzz", the values of the discriminative features "RNAS" and "RMUR" are "1", and the discriminative features of the phoneme / n / are correctly extracted. In the phoneme / e / related to the discriminating feature "voiced", the value of the discriminating feature "RVOI" is "1", and the value of the discriminating feature "RNAS" is "0". It is a value of 1 ", indicating that it is nasalized by the effect of the previous phoneme / n /.
It can be said that the discriminative feature of e / is correctly extracted. Subsequently, in the phonemes / ts / related to the distinguishing features “strident” and “diffuse”, the distinguishing features “RSTR” and “R
The value of “DIF” is “1”, and it can be said that the discriminative feature of the phoneme / ts / is correctly extracted. The discriminant feature of the phoneme / u / related to the discriminant feature “voiced” is “RVOI”. It can be said that the value of "" is "1", and the discriminative feature amount of the phoneme / u / has been correctly extracted.

すなわち、この発明の音声の特徴抽出方法によって抽
出された弁別的特徴量は、人間の調音活動に即して各音
韻の特徴を正しく表している情報であると言える。した
がって、上記弁別的特徴量によって音韻を正しく識別す
ることができる。That is, it can be said that the discriminative feature amount extracted by the voice feature extraction method of the present invention is information that correctly represents the features of each phoneme in accordance with human articulation activity. Therefore, phonemes can be correctly identified based on the discriminating feature amount.

上述のように、この発明の音声の特徴抽出方法は、各
BPFからの出力パワーを特徴量hiとし、この特徴量hiに
基づいて人間の調音に関連した特徴量である弁別的特徴
の度合いを表す弁別的特徴量“RSTR",“RVOI",“RNAS",
“RMUR",“RCOM"および“RDIF"を算出するようにしてい
る。したがって、この発明によれば、話者によるスペク
トルの変動に影響されずに各音韻を識別するのに必要な
情報（弁別的特徴量）を抽出することができ、入力音声
の上記弁別的特徴量と標準パターンの弁別的特徴量とに
基づいてDPマッチング等によって類似度を求め、正しく
音声を認識することができる。As described above, the voice feature extraction method of the present invention
The output power from the BPF is defined as a feature amount hi, and based on the feature amount hi, the discriminative feature amounts “RSTR”, “RVOI”, and “RNAS” indicating the degree of the discriminative feature which is a feature amount related to human articulation. ,
"RMUR", "RCOM" and "RDIF" are calculated. Therefore, according to the present invention, it is possible to extract information (discriminative feature amount) necessary for identifying each phoneme without being affected by a spectrum change caused by a speaker, and it is possible to extract the discriminative feature amount of an input voice. The similarity is obtained by DP matching or the like based on the discriminative feature amount of the standard pattern and the speech can be correctly recognized.

＜発明の効果＞以上より明らかなように、請求項１に係る発明の音声
の特徴抽出方法は、周波数分析によって得られた入力音
声のスペクトル成分から、摩擦性を表す弁別的特徴量，
有声を表す弁別的特徴量，鼻音化によるスペクトルの平
坦性を表す弁別的特徴量，バズ性を表す弁別的特徴量，
中周波数帯域のスペクトルの平坦性を表す弁別的特徴量
およびスペクトルの分散性を表す弁別的特徴量を求める
ので、人間の調音に関連した上記６種の弁別的特徴量を
抽出することができる。したがって、この発明を用いれ
ば、話者によるスペクトル変動に影響されずに音声を安
定して正しく認識することができ、高い認識性能を得る
ことができるのである。<Effects of the Invention> As is apparent from the above description, the speech feature extraction method according to the first aspect of the present invention uses a discriminative feature quantity representing frictional property, based on a spectral component of an input speech obtained by frequency analysis.
Discriminating features representing voiced, discriminating features representing flatness of the spectrum due to nasalization, discriminating features representing buzziness,
Since the discriminating feature indicating the flatness of the spectrum in the middle frequency band and the discriminating feature indicating the dispersion of the spectrum are obtained, the above-described six different discriminating features related to human articulation can be extracted. Therefore, according to the present invention, speech can be stably and correctly recognized without being affected by spectrum fluctuations caused by a speaker, and high recognition performance can be obtained.

[Brief description of the drawings]

第１図はこの発明に係る音声認識装置の一実施例におけ
るブロック図、第２図は各BPFの周波数特性を示す図、
第３図は弁別的特徴とその弁別的特徴量の増加の要因と
なる音韻との対応図、第４図（ａ）は弁別的特徴“stri
dent"の度合いを求めるフローチャート、第４図（ｂ）
は重み係数の一例を示す図、第５図は弁別的特徴“voic
ed"の度合いを求めるフローチャート、第６図は弁別的
特徴“nasal"の度合いを求めるフローチャート、第７図
（ａ）は弁別的特徴“murmur/buzz"の度合いを求めるフ
ローチャート、第７図（ｂ）は重み係数の一例を示す
図、第８図（ａ）は弁別的特徴“compact"の度合いを求
めるフローチャート、第８図（ｂ）は重み係数の一例を
示す図、第９図は弁別的特徴“diffuse"の度合いを求め
るフローチャート、第10図は入力音声波形に基づいて求
められた弁別的特徴量の時系列パターンの一例を示す図
である。１……マイクロホン、２……アンプ、３……特徴抽出
部、４……弁別的特徴抽出部、５……単語認識部、６…
…標準パターン格納部、７……結果表示部。FIG. 1 is a block diagram of an embodiment of a speech recognition device according to the present invention, FIG. 2 is a diagram showing frequency characteristics of each BPF,
FIG. 3 is a correspondence diagram between the discriminative feature and the phoneme which causes the increase of the discriminative feature, and FIG.
flowchart for determining the degree of "dent", FIG. 4 (b)
Is a diagram showing an example of a weighting factor, and FIG.
6 is a flowchart for determining the degree of the discriminative feature "nasal", FIG. 7A is a flowchart for determining the degree of the discriminative feature "murmur / buzz", and FIG. 7B ) Is a diagram showing an example of a weighting factor, FIG. 8 (a) is a flowchart for obtaining the degree of the discriminative feature “compact”, FIG. 8 (b) is a diagram showing an example of the weighting factor, and FIG. Fig. 10 is a flow chart for obtaining the degree of the feature "diffuse", and Fig. 10 is a diagram showing an example of a time-series pattern of a discriminative feature amount obtained based on the input speech waveform. ... Feature extraction unit, 4... Discriminative feature extraction unit, 5... Word recognition unit, 6.
... Standard pattern storage unit, 7... Result display unit.

フロントページの続き (56)参考文献特開昭63−65500（ＪＰ，Ａ) 特開昭62−17900（ＪＰ，Ａ) 特開昭62−7081（ＪＰ，Ａ) 特開昭60−194500（ＪＰ，Ａ) 特開昭62−102298（ＪＰ，Ａ) 特開昭62−102297（ＪＰ，Ａ) 特開昭63−85697（ＪＰ，Ａ) 特開昭61−240300（ＪＰ，Ａ) 特公昭47−11635（ＪＰ，Ｂ１) 特公昭47−11633（ＪＰ，Ｂ１) (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 7/08 G10L 9/00 - 9/20 ＪＩＣＳＴ科学技術文献ファイルContinuation of the front page (56) References JP-A-63-65500 (JP, A) JP-A-62-17900 (JP, A) JP-A-62-7081 (JP, A) JP-A-60-194500 (JP, A) JP-A-62-102298 (JP, A) JP-A-62-102297 (JP, A) JP-A-63-85697 (JP, A) JP-A-61-240300 (JP, A) 47-11635 (JP, B1) JP-B 47-11633 (JP, B1) (58) Fields investigated (Int. Cl. ⁶ , DB name) G10L 7/08 G10L 9/00-9/20 JICST Science and Technology Reference file

Claims

(57) [Claims]

1. A speech feature extraction method for frequency-analyzing an input speech and extracting speech features from the obtained spectrum components, wherein a frictional property is determined based on the spectrum components obtained by the frequency analysis. A first step of obtaining a discriminative feature amount representing the degree of the discriminative feature to be represented; and a second step of obtaining a discriminative feature amount representing the degree of the discriminative feature representing the voice based on the spectral component obtained by the frequency analysis. A second step, a third step of obtaining a discriminating feature amount indicating a degree of a discriminating feature indicating flatness of a spectrum by nasalization based on the spectrum component obtained by the frequency analysis, and A fourth step of obtaining a discriminating feature amount representing a degree of the discriminating feature representing buzziness based on the obtained spectral component; A fifth step of obtaining, based on the spectral components obtained by the analysis, a discriminating feature amount representing a degree of the discriminating feature representing the flatness of the spectrum of the middle frequency band; A sixth step of obtaining a discriminating feature amount representing a degree of a discriminating feature representing the dispersibility of the spectrum based on the speech, wherein a feature amount of a speech related to human articulation is extracted. Feature extraction method.