JPH0462598B2

JPH0462598B2 -

Info

Publication number: JPH0462598B2
Application number: JP59238339A
Authority: JP
Inventors: Yasuaki Awanaka; Gichu Oota
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1984-11-14
Filing date: 1984-11-14
Publication date: 1992-10-06
Also published as: JPS61117600A

Description

【発明の詳細な説明】〔発明の利用分野〕本発明は、音声の単音節を対象とし、子音領域
から母音領域に至る、いわゆる「わたり」の領域
におけるスペクトルの変化量を精度良く定量化し
て、あらかじめ用意した標準特性との類似度を計
算することにより、音声を認識する音声認識装置
に関する。[Detailed Description of the Invention] [Field of Application of the Invention] The present invention targets monosyllables of speech and accurately quantifies the amount of change in the spectrum in the so-called "crossing" region, from the consonant region to the vowel region. , relates to a speech recognition device that recognizes speech by calculating the degree of similarity with standard characteristics prepared in advance.

[Background of the invention]

従来、単音節の音声を認識するには、最初に入
力信号を子音領域と母音領域に分割し、子音と母
音それぞれのスペクトル、あるいは子音から母音
領域にわたる複数個の時系列スペクトルを求め、
これ等と標準のスペクトル群との類似度を計算す
る方法が行なわれている。 Conventionally, to recognize monosyllabic speech, the input signal is first divided into a consonant region and a vowel region, and the spectra of each consonant and vowel, or multiple time-series spectra spanning from the consonant to the vowel region, are obtained.
A method is being used to calculate the similarity between these and a standard spectrum group.

入力信号を子音と母音領域に分割する代表的な
方法の１つは、日本音響学会誌，Vol40，No.２、
P63〜70（1984）に記載されているように、入力
信号のエンベロープを求め、これとエンベロープ
の立上り領域において、あらかじめ用意された標
準エンベロープとシフトマツチングを行なうもの
である。 One of the typical methods for dividing an input signal into consonant and vowel regions is the one described in the Journal of the Acoustical Society of Japan, Vol. 40, No. 2.
As described in P63-70 (1984), the envelope of the input signal is determined, and shift matching is performed between this and a standard envelope prepared in advance in the rising region of the envelope.

ここで、その原理を第２図を用いて説明する。
第２図Ａにおける実線１は、単音節信号における
エンベロープの立上り部分を表わしている。この
実線１に対して、あらかじめ標準エンベロープ２
を設定し、これを時間軸上で逐次シフトすること
により、実線１と点線２との間の距離を逐次計算
する。計算された距離は第２図Ｂに示したように
実線３となり、この極小点４が子音領域検出の基
準点になる。ポイント４に相当する標準エンベロ
ープは第２図Ａの２′である。標準エンベロープ
２′から子音の範囲および母音の範囲を求めるこ
とが出来る。 Here, the principle will be explained using FIG. 2.
The solid line 1 in FIG. 2A represents the rising edge of the envelope in a monosyllabic signal. For this solid line 1, the standard envelope 2 is set in advance.
By setting and sequentially shifting this on the time axis, the distance between the solid line 1 and the dotted line 2 is sequentially calculated. The calculated distance becomes a solid line 3 as shown in FIG. 2B, and this minimum point 4 becomes the reference point for consonant region detection. The standard envelope corresponding to point 4 is 2' in FIG. 2A. The range of consonants and the range of vowels can be determined from the standard envelope 2'.

上記のような方法により、単音節の子音領域を
概略設定することが可能である。また、発声速度
や音声波形の個人差なども標準エンベロープを複
数個用意することによつて、対処することが出来
る。 By the method described above, it is possible to roughly set the consonant region of a monosyllable. Furthermore, individual differences in speaking speed and voice waveform can be dealt with by preparing a plurality of standard envelopes.

一方、上記の方法の改良案として、特開昭59−
26797号公報に記載されているように、入力音声
ごとのエンベロープ信号の最大値と最小値近傍に
おいて、母音と子音の領域を定める基準値を設定
することによつて、子音と母音領域の分割を行な
つている。第３図は標準エンベロープにおける最
小値Ec，最大値Evの関係を示したものである。 On the other hand, as an improvement plan of the above method,
As described in Publication No. 26797, division of consonant and vowel regions is achieved by setting reference values that define vowel and consonant regions near the maximum and minimum values of the envelope signal for each input voice. I am doing it. Figure 3 shows the relationship between the minimum value Ec and maximum value Ev in the standard envelope.

子音が検出されると、子音領域のスペクトルが
求められる。上記の方法を用いると原理的には子
音と母音領域のスペクトルを求めることは容易で
あり、また優れた方法になつている。 Once a consonant is detected, the spectrum of the consonant region is determined. Using the above method, it is in principle easy to obtain the spectrum of consonant and vowel regions, and it is an excellent method.

ところが、音声エンベロープの最大振幅値は話
者の発声の強さや単音節の種類によつて大きく変
化するため、音声エンベロープの振幅が小さな場
合には、簡単な方法では子音を検出出来なくなる
ことや、音声以外の周囲雑音に埋まつてしまうと
いう問題を生じる。この問題点を第４図により詳
しく説明する。 However, the maximum amplitude value of the speech envelope varies greatly depending on the strength of the speaker's vocalization and the type of monosyllable, so if the amplitude of the speech envelope is small, consonants cannot be detected using simple methods. A problem arises in that the voice is buried in ambient noise other than the voice. This problem will be explained in detail with reference to FIG.

第４図Ａは比較的振幅の大きな単音節エンベロ
ープ１を分析する例を示している。あらかじめ設
定した振幅の基準値（トリガーレベル）とエンベ
ロープが交叉する時間から、エンベロープの最大
値を示す位置５で示した時間までを子音の領域お
よび子音から母音にいたるわたりの領域とする。 FIG. 4A shows an example of analyzing a monosyllabic envelope 1 having a relatively large amplitude. The period from the time when the envelope intersects with a preset amplitude reference value (trigger level) to the time indicated by position 5 indicating the maximum value of the envelope is defined as the consonant region and the transition region from the consonant to the vowel.

基準振幅値より十分大きな振幅値をもつ第４図
Ａのような音声信号に対しては基準振幅値に達し
た信号から分析を始めるという簡単な方法によつ
て子音領域とそれに続くわたりの領域、母音領域
が信号分析装置へ入力出来ることになる。基準振
幅値とエンベロープが交叉する時点から所定の時
間幅で周波数分析する分析区間をａとすれば、こ
れによつて子音領域が分析される。次にエンベロ
ープの最大値の時点５から所定の時間間隔で分析
する分析区間をｃとすれば、これにより母音領域
が分析されることになる。 For a speech signal such as that shown in FIG. 4A, which has an amplitude value sufficiently larger than the reference amplitude value, the consonant region and the following transition region can be determined by a simple method of starting the analysis from the signal that has reached the reference amplitude value. The vowel region can be input to the signal analyzer. If the analysis interval in which the frequency is analyzed in a predetermined time width from the point where the reference amplitude value and the envelope intersect is defined as a, then the consonant region is analyzed. Next, if the analysis interval to be analyzed at a predetermined time interval from time 5, which is the maximum value of the envelope, is c, then the vowel region will be analyzed.

これに対して、第４図Ｂのように音声信号の振
幅が小さい場合には子音領域の振幅が基準値より
小さくなり、子音領域の分析は不可能となる。基
準値とエンベロープが交叉する時点から設定可能
な分析区間はｂのようであり、これは子音から母
音領域にいたるわたりの区間における分析とな
る。これに対してＡ図におけるわたり領域の分析
区間はa′である。 On the other hand, when the amplitude of the audio signal is small as shown in FIG. 4B, the amplitude of the consonant region is smaller than the reference value, making it impossible to analyze the consonant region. The analysis interval that can be set from the point where the reference value and the envelope intersect is like b, and this is an analysis in the interval from the consonant to the vowel region. On the other hand, the analysis interval of the crossover area in diagram A is a'.

上記のように、単音節音声の振幅が小さな場合
には、子音領域の検出が不可能になるか、前記の
ように周囲雑音に埋まるという問題を生じるので
ある。 As mentioned above, when the amplitude of monosyllabic speech is small, the problem arises that it becomes impossible to detect the consonant region, or that the consonant region is buried in ambient noise as described above.

子音領域から母音領域にいたる特性の変化を抽
出する方法として電子通信学会誌，Vol J65−
Ａ，P1278−1285（1982）に記載されているよう
に時間軸における判別フイルターによつて、「無
声子音性」，「摩擦性」，「無声破裂性」等の子音グ
ループを検出する方法が行なわれている。この方
法は周波数分析フイルター群以外に判別フイルタ
ー群を必要とし、また個々の単音節を識別するた
めには単音節の個数分フイルターを用意する必要
がある。 Journal of the Institute of Electronics and Communication Engineers, Vol J65- as a method for extracting changes in characteristics from the consonant region to the vowel region
As described in A, P1278-1285 (1982), a method of detecting consonant groups such as "voiceless consonant,""fricative," and "voiceless plosive" is carried out using a discrimination filter on the time axis. It is. This method requires a discrimination filter group in addition to the frequency analysis filter group, and in order to identify individual monosyllables, it is necessary to prepare filters for the number of monosyllables.

[Purpose of the invention]

本発明の目的は、前記のような入力音声信号の
子音から母音領域にいたる周波数分析上の問題点
を解消することにある。 SUMMARY OF THE INVENTION An object of the present invention is to solve the above-mentioned problems in frequency analysis of an input audio signal ranging from consonants to vowels.

[Summary of the invention]

上記目的を達成するため、本発明においてはわ
たりの領域におけるスペクトルと母音領域にける
スペクトルとの変化分を求め、これとスペクトル
の形状そのものを認識のためのデータとし、あら
かじめ用意した標準特性との類似度を計算して精
度の良い認識を可能とした音声認識装置とする。 In order to achieve the above object, the present invention calculates the amount of change between the spectrum in the crossing region and the spectrum in the vowel region, uses this and the shape of the spectrum itself as data for recognition, and compares this with standard characteristics prepared in advance. To provide a speech recognition device capable of performing highly accurate recognition by calculating similarity.

なお、わたり区間に対する分析区間などの時間
的位置は、音節の種類によらず同一であることが
望ましい。たとえば、音声エンベロープの振幅最
大値を1.0に基準化した場合。わたり領域の分析
区間を振幅0.5付近に設定する。 Note that it is desirable that the temporal position of the analysis interval and the like with respect to the crossing interval be the same regardless of the type of syllable. For example, if you normalize the maximum amplitude of the audio envelope to 1.0. Set the analysis interval of the crossing region to around 0.5 amplitude.

以下、本発明の概要を第５，６図を用いて説明
する。第５図は信号処理回路の概略ブロツク図を
示したものである。マイクロホン７より入力した
音声信号をマイクアンプ８により所定の信号振幅
とし、エンベロープ形成回路９によりエンベロー
プ信号とする。所定の振幅以上の信号が入力する
と、これがトリガー信号となり、メモリ１０への
記録を開始する。所定の時間間隔分の信号を記憶
すると、記憶されたデータをエンベロープパラメ
ータ計算部１１へ入力し、計算を開始する。計算
の内容は前記した単音節音声のエンベロープにお
ける振幅最大値の時間的位置、わたり領域の時間
的位置、単音節の終了位置等である。一方、マイ
クアンプ８におけるもう一方の出力信号は、周波
数分析部１２で分析され、その結果はメモリ１３
に記録される。分析データの記録開始を知らせる
トリガー信号の振幅は、メモリ１０における場合
と同一であり、また同期しているため、メモリ１
０とメモリ１３の信号は同期している。所定の時
間間隔の記録が終了すると、エンベロープパラメ
ータ計算部１１における結果を用いて、差分スペ
クトル計算部１４の計算を実行する。計算の内容
は、わたり領域のスペクトルと母音領域スペクト
ルとの差分計算ならびに単音節が終了するまでに
少なくとも１個のスペクトルを求め、これと母音
領域スペクトルとの類似度計算である。 Hereinafter, an outline of the present invention will be explained using FIGS. 5 and 6. FIG. 5 shows a schematic block diagram of the signal processing circuit. The audio signal inputted from the microphone 7 is made into a predetermined signal amplitude by the microphone amplifier 8, and is made into an envelope signal by the envelope forming circuit 9. When a signal with a predetermined amplitude or more is input, this becomes a trigger signal and recording to the memory 10 is started. After storing signals for a predetermined time interval, the stored data is input to the envelope parameter calculating section 11 to start calculation. The contents of the calculation include the temporal position of the maximum amplitude value in the envelope of the monosyllabic voice, the temporal position of the transition region, and the end position of the monosyllable. On the other hand, the other output signal from the microphone amplifier 8 is analyzed by the frequency analyzer 12, and the result is stored in the memory 13.
recorded in The amplitude of the trigger signal that signals the start of recording analysis data is the same as in the memory 10 and is synchronous, so
0 and the signals of the memory 13 are synchronized. When the recording for a predetermined time interval is completed, the difference spectrum calculation unit 14 performs calculation using the results in the envelope parameter calculation unit 11. The content of the calculation is to calculate the difference between the spectrum of the crossover region and the vowel region spectrum, obtain at least one spectrum before the end of a single syllable, and calculate the degree of similarity between this spectrum and the vowel region spectrum.

次に、エンベロープパラメータ計算部１１にお
ける計算内容について、第６図により詳しく説明
する。第６図は単音節を２個連続して発声した場
合のエンベロープ波形を表わしている。音声を発
声し、エンベロープが基準値に達すると、メモリ
１０へのデータの記録が始まる。サンプリング周
期はたとえば、15ｍ秒であり、第６図のエンベロ
ープ曲線と縦線が交わる点が振幅データとなる。
各データの間隔a₁，a₂，…等は信号分析部１２に
おける時間的位置としての分析区間を表わしてい
る。メモリ１０に記録されたデータ群から、まず
エンベロープが２個の単音節からなる二山を形成
していることを識別し、次いで第１の山の頂上が
ポイント５で示される位置にあることを計算によ
り求める。これにより、母音領域はa₆であると判
定する。 Next, the contents of calculation in the envelope parameter calculation section 11 will be explained in detail with reference to FIG. 6. FIG. 6 shows the envelope waveform when two monosyllables are uttered in succession. When a voice is uttered and the envelope reaches a reference value, recording of data to the memory 10 begins. The sampling period is, for example, 15 msec, and the point where the envelope curve and the vertical line intersect in FIG. 6 becomes amplitude data.
Each data interval a ₁ , a ₂ , . . . represents an analysis interval as a temporal position in the signal analysis section 12. From the data group recorded in the memory 10, it is first identified that the envelope forms two peaks consisting of two monosyllables, and then it is determined that the top of the first peak is at the position indicated by point 5. Obtain by calculation. As a result, it is determined that the vowel region is _a6 .

また、第１のわたりの領域は第６図においては
a₄の区間に相当する。ここで、a₄に関してはa₁か
らa₆にいたる区間のうち、各区間の始めと終りの
振幅値の平均値が、ポイント５における振幅値の
0.5に近い区間が選ばれる。その後、a₁₀が２個の
音節が結合した領域であることを確認した後、単
音節が終了していく領域としてa₈の位置を情報と
して用いる。時間的位置としての分析区間a₄，
a₆，a₈は、一連の分析データの中から必要なデー
タを指定するものである。 Also, the first crossing area is shown in Figure 6.
Corresponds to section a ₄ . Here, regarding a ₄ , the average value of the amplitude values at the beginning and end of each section from a ₁ to a ₆ is the amplitude value at point 5.
An interval close to 0.5 is selected. After confirming that _a10 is a region where two syllables are joined, the position of _a8 is used as information as the region where a single syllable ends. Analysis interval a ₄ as temporal position,
a ₆ and a ₈ specify necessary data from a series of analysis data.

b₁からb₆までは、第２の単音節区間であり、こ
の領域におけるパラメータ計算はb₁を最初の区間
として、たとえば0.36ｍ秒間のエンベロープデー
タをエンベロープパラメータ計算部１１へ入力し
直してから実行する。計算の結果として得られる
わたり領域と母音領域の分析区間は、それぞれb₁
とb₃である。 From b ₁ to b ₆ is the second monosyllabic section, and the parameter calculation in this area is performed by inputting envelope data of, for example, 0.36 m seconds again to the envelope parameter calculation unit 11 with b ₁ as the first section. Execute. The analysis intervals of the crossover area and the vowel area obtained as a result of calculation are respectively b ₁
and b ₃ .

[Embodiments of the invention]

以下、本発明の装置全体にわたる構成と動作を
第１図により説明する。 The overall structure and operation of the apparatus of the present invention will be explained below with reference to FIG.

発声された音声信号は、マイクロホン７、マイ
クアンプ８を介してエンベロープ方成回路９とバ
ンドパスフイルター群１５へ入力される。エンベ
ロープ形成回路９において形成された音声エンベ
ロープの振幅が、所定の基準振幅値以上になる
と、その情報はトリガーパルスによつてコントロ
ーラ２４へ伝送されてコントローラの動作が開始
される。それと同時に、音声エンベロープ信号は
Ａ／Ｄ変換器１８を介してデイジタル信号として
メモリ１０に記録される。上記のエンベロープ形
成回路９の時定数は15ｍ秒であり、Ａ／Ｄ変換器
１８における信号のサンプリング周波数は同じく
15ｍ秒である。メモリ１０は約８秒間の音声エン
ベロープ信号を記憶出来るため、バツフアメモリ
としての役割をはたしている。このメモリ１０が
最初の0.36秒間の信号で満たされると、エンベロ
ープパラメータ計算部１１の計算が開始される。
上記0.36秒の時間間隔には、通常単音節が１個な
いし２個存在し、後で詳しく説明するように、始
めの音節が取出されてそのパラメータが計算され
る。計算されたパラメータは、コントローラ２４
を介して差分スペクトル計算部１４へ伝送され
る。 The uttered audio signal is inputted via the microphone 7 and the microphone amplifier 8 to the envelope forming circuit 9 and the group of bandpass filters 15. When the amplitude of the audio envelope formed in the envelope forming circuit 9 exceeds a predetermined reference amplitude value, that information is transmitted to the controller 24 by a trigger pulse, and the controller starts operating. At the same time, the audio envelope signal is recorded in the memory 10 as a digital signal via the A/D converter 18. The time constant of the envelope forming circuit 9 described above is 15 msec, and the sampling frequency of the signal in the A/D converter 18 is the same.
It is 15m seconds. Since the memory 10 can store approximately 8 seconds of audio envelope signals, it functions as a buffer memory. When this memory 10 is filled with the signal for the first 0.36 seconds, the envelope parameter calculation section 11 starts calculation.
There are usually one or two monosyllables in the 0.36 second time interval, and as will be explained in detail later, the first syllable is extracted and its parameters are calculated. The calculated parameters are sent to the controller 24
The signal is transmitted to the difference spectrum calculation section 14 via.

一方、分析部１２内のバンドパスフイルター群
１５へ伝送された音声信号に対する個々のフイル
ター出力信号は、エンベロープ形成回路１６によ
つてエンベロープ信号となる。上記のバンドパス
フイルター群１５の内容は、１個のバンドパスフ
イルターが1/3オクターブの周波数帯域を持ち、
200Hzから5kHzにわたり、15チヤンネルのフイル
ター群により成り立つている。また、エンベロー
プ出力信号の時定数はエンベロープ形成回路９の
時定数と同一であり、エンベロープ形成回路９で
計算されたエンベロープパラメータに対応するよ
うになつている。 On the other hand, each filter output signal for the audio signal transmitted to the bandpass filter group 15 in the analysis section 12 is converted into an envelope signal by the envelope forming circuit 16. The contents of the above bandpass filter group 15 are such that one bandpass filter has a frequency band of 1/3 octave,
It consists of a group of 15 channels of filters ranging from 200Hz to 5kHz. Further, the time constant of the envelope output signal is the same as the time constant of the envelope forming circuit 9, and corresponds to the envelope parameter calculated by the envelope forming circuit 9.

エンベロープ形成回路１６の出力信号は、マル
チプレクサ１７によつて15ｍ秒のフレーム間隔で
掃査され、分析部１２の出力としてＡ／Ｄ変換器
１９を介してメモリ１３に記録される。この場合
もＡ／Ｄ変換器１８におけると同様に、あらかじ
め設定した基準振幅値以上の信号が入力した場合
に動作が開始される。メモリ１３は、メモリ１０
と同様に約８秒間の音声分析データを記憶出来る
バツフアメモリとなつている。メモリ１３に0.36
秒間のデータが記録され、かつエンベロープパラ
メータ計算部１１の計算が終了している場合に
は、コントローラ２４の指示に従つて差分スペク
トル計算部１４における計算を開始する。エンベ
ロープパラメータの計算が終了していない場合に
は終了を待ち、終了した後にコントローラ２４の
指示に従つて0.36秒間のデータを差分スペクトル
計算部において計算する。計算結果はパターンメ
モリ２０に記録され、記録が終了するとコントロ
ーラ２４の指示に従つて、２番目の単音節データ
を先頭に含む0.36秒間のエンベロープデータをエ
ンベロープパラメータ計算部１１へ入力する。 The output signal of the envelope forming circuit 16 is scanned by the multiplexer 17 at a frame interval of 15 msec, and is recorded in the memory 13 as the output of the analysis section 12 via the A/D converter 19. In this case, as in the case of the A/D converter 18, the operation is started when a signal having a preset reference amplitude value or more is input. Memory 13 is memory 10
Similarly, it has a buffer memory that can store approximately 8 seconds of voice analysis data. 0.36 to memory 13
If data for a second has been recorded and the calculation by the envelope parameter calculation section 11 has been completed, the calculation in the difference spectrum calculation section 14 is started according to instructions from the controller 24. If the calculation of the envelope parameters has not been completed, wait for the calculation to be completed, and after the calculation is completed, data for 0.36 seconds is calculated in the differential spectrum calculation unit according to instructions from the controller 24. The calculation result is recorded in the pattern memory 20, and when the recording is completed, 0.36 seconds of envelope data including the second monosyllable data at the beginning is input to the envelope parameter calculation section 11 according to instructions from the controller 24.

パターンメモリ２０がデータで満たされると、
パターンマツチング部２１において、標準パター
ンメモリ２２にあらかじめ記憶させていた標準パ
ターンと逐次パターン間の類似度（距離）を計算
し、入力信号の単音節が何であるかを同定する。
得られた結果は入力バツフア２３を介して出力端
子２５より出力する。 When the pattern memory 20 is filled with data,
The pattern matching section 21 calculates the degree of similarity (distance) between the standard pattern previously stored in the standard pattern memory 22 and the sequential pattern, and identifies the monosyllable of the input signal.
The obtained result is outputted from the output terminal 25 via the input buffer 23.

次に、エンベロープパラメータ計算部１１にお
ける計算内容について第７図の概略フローチヤー
トを用いて説明する。 Next, the contents of calculation in the envelope parameter calculation section 11 will be explained using the schematic flowchart of FIG. 7.

発声された音声エンベロープ信号における最初
の0.36秒間のデータにおいて、エンベロープ形状
の局所的な凹凸を無視して大きな山谷を検出する
計算が行なわれる。前記したように、0.36秒の間
隔には通常１個ないし２個の単音節信号があり、
これより第１の単音節に該当するエンベロープ信
号の第１番目の山を検出する。第７図のフローチ
ヤートにおいて、のステツプでは、メモリに記
録されたデータのうち、最初のデータから計算処
理が始まることを指示している。のステツプで
は第１の山において振幅が最大値を示すピーク位
置のデータ番号を検出する。ここで、もしエンベ
ロープ信号の立上りが急激であり、最初のデータ
が振幅最大値を示す場合には、２番目のデータを
第１の山のピーク位置とする。のステツプで
は、わたり領域の位置を検出する。この位置は最
初のデータとピーク位置データとの中間に存在
し、最大振幅の1/2に近かい値を示すデータ番号
を探索する。次に、のステツプでは、第１と第
２の山の連結域、すなわち谷の位置を示すデータ
番号を探索する。このデータ番号をｎとすれば、
Ｉ＝ｎとして、第２の単音節に対する計算処理を
行なう場合、ｎが最初のデータであることをコン
トローラに知らせるものである。のステツプで
は第１の山の終了域を示すデータ番号を探索す
る。この領域に対しては、第１の山の位置と谷
（連結域）との中間に位置するデータ番号が選ば
れる。のステツプでは、これまでに得られた結
果をコントローラ２４に伝送する。伝送が終り、
かつコントローラから計算ストツプの指示がなけ
れば、Ｉ＝ｎとしてステツプより第２の音節に
対する計算を始める。コントローラからストツプ
の指示がある場合は、メモリ１０の全てのデータ
に対する処理が終了した場合、および差分スペク
トル計算部１４における計算が続行中の場合であ
る。 Calculations are performed to detect large peaks and valleys in the first 0.36 seconds of data in the uttered audio envelope signal, ignoring local irregularities in the envelope shape. As mentioned above, there are usually one or two monosyllabic signals in an interval of 0.36 seconds,
From this, the first peak of the envelope signal corresponding to the first monosyllable is detected. In the flowchart of FIG. 7, step 2 instructs that the calculation process starts from the first data among the data recorded in the memory. In the step, the data number of the peak position where the amplitude shows the maximum value in the first peak is detected. Here, if the rise of the envelope signal is rapid and the first data indicates the maximum amplitude value, the second data is set as the peak position of the first mountain. In step 2, the position of the crossing area is detected. This position exists between the first data and the peak position data, and a data number showing a value close to 1/2 of the maximum amplitude is searched. Next, in step (2), a data number indicating the position of the connecting region of the first and second peaks, that is, the valley, is searched. If this data number is n, then
When performing calculation processing on the second monosyllable with I=n, this is to notify the controller that n is the first data. In step , a data number indicating the end area of the first mountain is searched. For this area, a data number located midway between the position of the first peak and the valley (connected area) is selected. In step , the results obtained so far are transmitted to the controller 24. When the transmission is finished,
If there is no instruction to stop the calculation from the controller, the calculation for the second syllable is started from the step with I=n. When there is a stop instruction from the controller, this is the case when processing for all data in the memory 10 has been completed, and when the calculation in the difference spectrum calculating section 14 is continuing.

次に差分スペクトル計算部１４における計算内
容を説明する。エンベロープ信号のピーク位置に
おけるスペクトルを母音領域スペクトルと仮定す
る。このスペクトルは比較基準となるもので、形
状の精度を良くするために、ピーク位置の分析区
間と次の分析区間（第６図におけるa₆，a₇）のス
ペクトル値の平均値を求めて、母音領域の基準ス
ペクトルとする。すなわち、分析区間を30ｍ秒と
するのである。次にこの基準スペクトルとわたり
領域スペクトル（分析区間15ｍ秒）との差分を求
める。この計算の内容を第８図を用いて説明す
る。第８図Ａは、ある単音節におけるスペクトル
形状を示したものである。母音領域スペクトルは
２６であり、わたり領域のスペクトルは２７であ
る。両者の差分を求めると、Ｂ図のスペクトル２
８が得られる。スペクトル２８の形状は、入力単
音節が何であるかを判断する１つの有力なデータ
となる。 Next, the content of calculation in the difference spectrum calculation section 14 will be explained. The spectrum at the peak position of the envelope signal is assumed to be the vowel region spectrum. This spectrum is used as a comparison standard, and in order to improve the precision of the shape, the average value of the spectrum values of the analysis interval of the peak position and the next analysis interval (a ₆ and a ₇ in Figure 6) is calculated. This is the reference spectrum for the vowel region. That is, the analysis interval is set to 30 msec. Next, the difference between this reference spectrum and the cross-region spectrum (analysis interval 15 msec) is determined. The details of this calculation will be explained using FIG. FIG. 8A shows the spectral shape of a certain monosyllable. The vowel region spectrum is 26, and the crossing region spectrum is 27. When we find the difference between the two, spectrum 2 in diagram B is obtained.
8 is obtained. The shape of the spectrum 28 is one powerful piece of data for determining what the input monosyllable is.

差分スペクトル計算部１４では、さらにもう１
つの計算が実行される。それは、母音領域スペク
トルと終了域スペクトルとの距離計算である。距
離が閾値より小さければ、スペクトル形状は変化
なしと判断され、母音１個を含む単音節と判断さ
れる。また、閾値を越えた場合には「きや」，「き
ゆ」のように概略母音を２個含む単音節であると
判断される。あるいは「ん」のような本来信号振
幅の小さな単音節がある単音節の後に付加されて
いると判断される。 In the difference spectrum calculation section 14, one more
Two calculations are performed. It is a distance calculation between the vowel region spectrum and the ending region spectrum. If the distance is smaller than the threshold, it is determined that there is no change in the spectral shape, and the spectral shape is determined to be a monosyllable containing one vowel. Furthermore, if the threshold value is exceeded, it is determined that the syllable is a monosyllable containing approximately two vowels, such as "kiya" or "kiyu". Alternatively, it is determined that the signal is added after a single syllable such as "n" which originally has a small signal amplitude.

上記の計算結果の他、パターンメモリ２０に記
録されるデータはまとめると以下のようである。 In addition to the above calculation results, the data recorded in the pattern memory 20 is summarized as follows.

(1) 母音領域のスペクトル (2) わたり領域のスペクトル (3) (1)，(2)の差分スペクトル終了域スペクトルと母音領域スペクトルとの類
似度計算の結果により、必要であれば (4) 終了域スペクトル上記の４種類の特性を用いたパターンマツチン
グ部２１における計算方法を以下に説明する。こ
こで行なわれる認識計算手法は、従来から行なわ
れているパターンマツチング法に則つたものであ
る。最初にパターンメモリ２０に記録されている
母音領域スペクトルと５種類の母音「あいうえ
お」と「ん」の各標準スペクトルとの類似度（距
離）を計算して、最も類似している（距離の近
い）母音を選択する。１個の母音が選ばれると、
わたり領域スペクトルと差分スペクトルを用い、
選ばれた母音の子音系列、たとえば「あ」が選ば
れると、「あ，か，さ，た，な，……」の系列に
おいて、あらかじめ用意したわたり領域の標準ス
ペクトル、および差分の標準スペクトルとの類似
度を計算する。次に２つの特性における結果を総
合して、１個の単音節を選択する。ここで、前記
のように終了域スペクトルと母音領域スペクトル
が一致しているという情報を得た場合には、母音
が１個で形成されている単音節グループ、上記の
例においては「あ，か，さ，た，な，…，ば，
ぱ」の15個の単音節グループと計算が行なわれ
る。また、一致していないという情報を得ると、
概略母音が２個で形成される単音節「きや，し
や，ちや，…，ぴや」の12個のグループと計算が
行なわれ、１個の単音節が選択される。(1) Spectrum of the vowel region (2) Spectrum of the crossing region (3) Difference spectrum of (1) and (2) Based on the result of similarity calculation between the ending region spectrum and the vowel region spectrum, if necessary, (4) End Range Spectrum A calculation method in the pattern matching section 21 using the above four types of characteristics will be described below. The recognition calculation method used here is based on the conventional pattern matching method. First, the degree of similarity (distance) between the vowel region spectrum recorded in the pattern memory 20 and each standard spectrum of five types of vowels "aiueo" and "n" is calculated, and the most similar (closest distance) ) Select a vowel. When one vowel is selected,
Using the crossing region spectrum and the difference spectrum,
When the consonant series of the selected vowel, for example "a", is selected, in the series "a, ka, sa, ta, na,...", the standard spectrum of the transition area prepared in advance and the standard spectrum of the difference are used. Calculate the similarity of . Next, the results for the two characteristics are combined to select one single syllable. Here, if we obtain the information that the final region spectrum and the vowel region spectrum match as mentioned above, we can use the monosyllabic group formed by one vowel. 、Sa、Ta、Na...、Ba、
Calculations are made with 15 monosyllabic groups of 'pa'. Also, if you get information that there is no match,
Calculations are performed on 12 groups of monosyllables ``kiya, shiya, chiya, ..., piya'', which are roughly formed by two vowels, and one monosyllable is selected.

〔Effect of the invention〕

本発明によれば、単音節の子音と母音の中間領
域であるわたり領域のスペクトルを比較的簡単な
方法で得るようにしたため、単音節の振幅の大き
さに影響されずに、ほぼ同一条件で単音節認識の
有力なデータを得ることが出来るようになり、認
識の精度を向上させることが出来た。 According to the present invention, the spectrum of the crossover region, which is the intermediate region between the consonant and vowel of a monosyllable, can be obtained using a relatively simple method. We were able to obtain powerful data for monosyllable recognition, and were able to improve recognition accuracy.

わたり領域から母音に移行する領域におけるス
ペクトルの全体的な形状は、母音スペクトル形状
を示し、その中で子音スペクトルの影響が付加し
た形になつている。したがつて、本発明における
差分スペクトルは、子音スペクトルの影響のみを
抽出したことになり、本来のスペクトル形状その
ものによつて単音節が何であるかを同定する方法
に比べて、より秀れた方法を提供するものであ
る。 The overall shape of the spectrum in the region transitioning from the crossing region to the vowel shows the shape of the vowel spectrum, with the influence of the consonant spectrum added thereto. Therefore, the difference spectrum in the present invention extracts only the influence of the consonant spectrum, and is a superior method compared to the method of identifying monosyllables based on the original spectrum shape itself. It provides:

また、子音から母音にいたる時系列スペクトル
を求めて時系列標準スペクトルと比較する従来の
方法に比べて、差分スペクトルのパターンマツチ
ングは、はるかに簡単な計算で済み、経済的な方
法となつている。 In addition, compared to the conventional method of obtaining time-series spectra from consonants to vowels and comparing them with time-series standard spectra, pattern matching of differential spectra requires much simpler calculations, making it an economical method. There is.

[Brief explanation of drawings]

第１図は本発明の認識装置のブロツク図、第２
図は子音と母音を分割する方法の従来例を示した
図、第３図は子音と母音を分割する方法の別の従
来例を示した図、第４図は子音領域検出の問題点
を示した図、第５図は本発明の主要ブロツク図、
第６図は音声エンベロープに対する計算処理法の
説明図、第７図は音声エンベロープに対する計算
処理手順を示すフローチヤート、第８図は本発明
における差分スペクトルの説明図である。９……エンベロープ形成回路、１０，１３……
メモリ、１１……エンベロープパラメータ計算
部、１４……差分スペクトル計算部、１５……バ
ンドパスフイルタ群、１６……エンベロープ形成
回路、１７……マルチプレクサ、１８，１９……
Ａ／Ｄ変換器、２０……パタンメモリ、２１……
パターンマツチング部、２２……標準パタンメモ
リ、２３……出力バツフア、２４……コントロー
ラ。 Figure 1 is a block diagram of the recognition device of the present invention, Figure 2 is a block diagram of the recognition device of the present invention.
The figure shows a conventional example of a method for dividing consonants and vowels, Figure 3 shows another conventional example of a method for dividing consonants and vowels, and Figure 4 shows problems in consonant region detection. Figure 5 is a main block diagram of the present invention.
FIG. 6 is an explanatory diagram of the calculation processing method for the audio envelope, FIG. 7 is a flowchart showing the calculation processing procedure for the audio envelope, and FIG. 8 is an explanatory diagram of the difference spectrum in the present invention. 9... Envelope formation circuit, 10, 13...
Memory, 11...Envelope parameter calculation unit, 14...Difference spectrum calculation unit, 15...Band pass filter group, 16...Envelope formation circuit, 17...Multiplexer, 18, 19...
A/D converter, 20... pattern memory, 21...
Pattern matching section, 22...standard pattern memory, 23...output buffer, 24...controller.

Claims

[Claims]

1 Analyze the frequency of input speech and generate multiple spectral patterns ranging from consonants to vowels,
A speech recognition device that recognizes the input speech by calculating the degree of similarity with a standard spectral pattern stored in advance, comprising: means for forming an envelope of the speech signal; In the amplitude data, a predetermined time interval from the time point showing the maximum amplitude value is defined as the vowel region, and a region approximately halfway between the start point of the single syllable and the time point showing the maximum amplitude value is defined as the crossing region. means for deriving the temporal position of the region, means for storing the temporal position, means for individually detecting at least two types of spectra of the vowel region and the crossing region, and the vowel region spectrum and the crossing region. 1. A speech recognition device comprising means for determining a difference spectrum from a region spectrum, and using at least the two types of spectra and the difference spectrum as data for recognition.