JPH0289099A

JPH0289099A - Voice recognizing device

Info

Publication number: JPH0289099A
Application number: JP63240250A
Authority: JP
Inventors: Toru Ueda; 徹上田; Shin Kamiya; 伸神谷
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1988-09-26
Filing date: 1988-09-26
Publication date: 1990-03-29

Abstract

PURPOSE:To correctly recognize syllables without being affected by the change of the speaker' speaking speed by calculating a syllable section estimating parameter of each input voice based on a feature quantity extracted from the input voice and estimating an average syllable length of each input voice with the syllable section estimating parameter and segmenting a syllable section based on the average syllable length and the feature quantity. CONSTITUTION:When a voice is inputted, a feature quantity is extracted from this input voice by a feature extracting part 3. The syllable section estimating parameter of each input voice is calculated in an average syllable length estimating part 4 based on this extracted feature quantity, and the average syllable length of each input voice is estimated based on this calculated syllable section estimating parameter. A syllable section is segmented in a syllable section extracting part 5 based on the estimated average syllable length and the feature quantity. Consequently, the syllable section is segmented based on the average syllable length estimated for each input voice. Thus, syllables are correctly recognized without being affected by the speaker's speaking speed.

Description

【発明の詳細な説明】〈産業上の利用分野〉この発明は、入力された音声を音節単位で認識する音声
認識装置の改良に関する。DETAILED DESCRIPTION OF THE INVENTION <Industrial Application Field> The present invention relates to an improvement in a speech recognition device that recognizes input speech in units of syllables.

〈従来の技術〉従来より、音声認識装置として次のようなものがある。<Conventional technology> Conventionally, there are the following types of speech recognition devices.

すなわち、入力音声から特？ｌＩ量を抽出し、この抽出
した特徴量と平均音節長に基づいて音節区間を切り出し
、その切り出された音節区間の特徴量と予め記憶部に格
納された標準パターンの特徴量との類似度計算を行って
音節単位で音声の認識結果を行うものである。その際に
、上記平均音節長は予め設定して記憶された固定値を用
いるか、あるいは過去に音声入力された複数の単語の音
声区間長と音節数とから計算によって求めている。In other words, from the input audio? Extract the lI amount, cut out a syllable section based on the extracted feature amount and average syllable length, and calculate the similarity between the feature amount of the extracted syllable section and the feature amount of the standard pattern stored in advance in the storage unit. The results of speech recognition are determined on a syllable-by-syllable basis. At this time, the average syllable length is determined by using a fixed value set and stored in advance, or by calculation from the speech interval length and the number of syllables of a plurality of words inputted in the past.

〈発明が解決しようとする課題〉このように、上記従来の音声認識装置においては、音節
を認識する際に用いられる平均音節長は、予め設定して
記憶された固定値を用いるか、あるいは過去に音声入力
された複数の単語の音声区間長と音節数とから計算によ
って求めるようにしているので、次のような問題か生じ
る。すなわち、音声認識装置を使いなれた話者の場合に
は発声速度が一定であるから問題はないが、音声認識装
置を使い慣れていない話者の場合には発声速度が発声毎
に変動する。そのため、上述のようにして用いられる平
均音節長と実際に人力される音声の音節長とが大きく食
い違って、正しく音節区間を判定することがてきないと
いう問題がある。<Problems to be Solved by the Invention> As described above, in the above-mentioned conventional speech recognition device, the average syllable length used when recognizing syllables is either a fixed value set in advance and stored, or a past value is used. Since the calculation is performed based on the length of the speech interval and the number of syllables of a plurality of words input by voice, the following problem arises. That is, in the case of a speaker who is accustomed to using a speech recognition device, there is no problem because the speaking speed is constant, but in the case of a speaker who is not accustomed to using a speech recognition device, the speaking speed varies with each utterance. Therefore, there is a problem in that the average syllable length used as described above and the syllable length of the actual human-generated speech differ greatly, making it impossible to correctly determine syllable intervals.

そこで、この発明の目的は、入力された音声毎に平均音
節長を推定することによって、話者の発声速度等に影響
されずに正しく音節を認識することができる音声認識装
置を提供することにある。Therefore, an object of the present invention is to provide a speech recognition device that can correctly recognize syllables without being affected by the speaking speed of the speaker by estimating the average syllable length for each input speech. be.

〈課題を解決するための手段〉上記目的を達成するため、この発明は、入力された音声
から特徴量を抽出し、この抽出された特徴量に基づいて
音節を切り出し、音節単位で類似度計算を行って音節を
認識する音声認識装置において、上記人力された音声か
ら抽出された特徴量に基づいて、入力音声毎に音節区間
推定用パラメータを算出し、この音節区間推定用パラメ
ータに基づいて入力音声毎に平均音節長を推定する音節
長推定部と、この推定された平均音節長と上記特徴ｍに
基づいて音節区間を切り出す音節区間抽出部を備えたこ
とを特徴としている。<Means for Solving the Problems> In order to achieve the above object, the present invention extracts features from input speech, cuts out syllables based on the extracted features, and calculates similarity on a syllable basis. In a speech recognition device that recognizes syllables, a parameter for estimating a syllable interval is calculated for each input voice based on the feature amount extracted from the human-generated speech, and the parameter for estimating a syllable interval is calculated based on the parameter for estimating a syllable interval. It is characterized by comprising a syllable length estimating section that estimates the average syllable length for each voice, and a syllable section extracting section that extracts a syllable section based on the estimated average syllable length and the feature m.

また、上記音声認識装置は、上記音節長推定部において
、上記特徴量のパワー変化とスペクトル変化とを複合し
た音節区間推定用パラメータを人力音声毎に算出するよ
うにしている。Further, in the speech recognition device, the syllable length estimation unit calculates a parameter for estimating a syllable interval, which is a combination of a power change and a spectrum change of the feature amount, for each human voice.

また、上記音節区間推定用パラメータは、子音から母音
への渡りにおいて値が他の箇所よりも大きくなるような
パラメータである。Further, the syllable interval estimation parameter is a parameter whose value becomes larger in the transition from a consonant to a vowel than in other parts.

〈作用〉音声が入力されると、この入力された音声から特徴量が
抽出される。そうすると、この抽出された特徴１に基づ
いて、音節長推定部によって人力音声毎に音節区間推定
用パラメータが算出され、この算出された音節区間推定
用パラメータに基づいて入力音声毎に平均音節長が推定
される。そして、この推定された平均音節長と上記特徴
量に基づいて、音節区間抽出部によって音節区間が切り
出される。したがって、入力された音声毎に推定された
平均音節長に基づいて音節区間か切り出される。<Operation> When audio is input, feature amounts are extracted from the input audio. Then, based on this extracted feature 1, the syllable length estimator calculates the syllable interval estimation parameter for each human voice, and the average syllable length for each input voice is calculated based on the calculated syllable interval estimation parameter. Presumed. Then, based on this estimated average syllable length and the feature amount, a syllable segment is extracted by a syllable segment extraction unit. Therefore, a syllable section is cut out based on the average syllable length estimated for each input voice.

また、この音声認識装置は、上記音節長推定部によって
、上記特徴量のパワー変化とスペクトル変化とを複合し
た音節区間推定用パラメータを入力音声毎に算出し、こ
のパラメータに基づいて平均音節長が推定される。した
がって、的確に平均音節長が推定される。In addition, in this speech recognition device, the syllable length estimating unit calculates a syllable interval estimation parameter that combines the power change and spectrum change of the feature amount for each input voice, and calculates the average syllable length based on this parameter. Presumed. Therefore, the average syllable length can be accurately estimated.

また、この音声認識装置は、子音から母音への渡りにお
いて値が他の箇所よりも大きくなるような音節区間推定
用パラメータを入力音声毎に算出し、これに基づいて音
節区間が切り出される。したかって、子音から母音の渡
りの箇所に基づいて、平均音節長が推定され、より的確
に音節が切り出される。In addition, this speech recognition device calculates a syllable interval estimation parameter for each input voice such that the value is larger in the transition from a consonant to a vowel than in other parts, and a syllable interval is extracted based on this. Therefore, the average syllable length is estimated based on the transition point from the consonant to the vowel, and the syllable is extracted more accurately.

〈実施例〉以下、この発明を図示の実施例により詳細に説明する。<Example> Hereinafter, the present invention will be explained in detail with reference to illustrated embodiments.

第１図はこの発明の音声認識装置のブロック図であり、
１は発声された音声を捕らえるためのマイク、２は人力
された音声の音声帯域だけを増幅する増幅器である。FIG. 1 is a block diagram of the speech recognition device of the present invention,
Reference numeral 1 is a microphone for capturing the uttered voice, and reference numeral 2 is an amplifier that amplifies only the audio band of the human-generated voice.

特徴抽出部３は、マイクＩから入力されて増幅器２によ
って増幅された音声から、音節区間の切り出しと類似度
計算を行うための特ｍｆｆ１（例えば、短区間（５＋ｓ
ｓ〜２０ｍ５）のパワーおよび１６帯域フイルタバンク
の出力等）を抽出する。平均音節長推定部４は、特徴抽
出部３で抽出された特徴量に基づいて、後に詳述するよ
うにして音節区間推定用パラメータを算出し、この音節
区間推定用パラメータに基づいて平均音節長を推定する
。音節区間抽出部５は、特徴抽出部３で抽出された特徴
量と平均音節長推定部４によって推定された平均音節長
とに基づいて音節区間を切り出す。The feature extractor 3 uses a feature mff1 (for example, a short section (5+s
s~20m5) and the output of a 16-band filter bank, etc.). The average syllable length estimating unit 4 calculates a syllable interval estimation parameter based on the feature extracted by the feature extracting unit 3, as will be described in detail later, and calculates the average syllable length based on the syllable interval estimation parameter. Estimate. The syllable section extractor 5 extracts a syllable section based on the feature amount extracted by the feature extractor 3 and the average syllable length estimated by the average syllable length estimator 4.

類似度計算部６は、音節区間抽出部５で抽出された音節
区間における特徴量と予め標準パターンメモリ７に登録
されている標準パターンの特徴量との類似度計算を行い
、その類似度計算結果に基づいて音節の認識を行う。イ
ンターフェース部９は、ワードプロツセサ等の図示しな
い外部装置とのやり取りの際に用いられる。ＣＰＵ８は
、特徴抽出部３．平平均節長推定部４．音節区間抽出部
５゜類似度計算部６およびインターフェース部９を制御
して、音声認識動作を実行する。The similarity calculation section 6 calculates the similarity between the feature amount in the syllable section extracted by the syllable section extraction section 5 and the feature amount of the standard pattern registered in advance in the standard pattern memory 7, and calculates the similarity calculation result. Perform syllable recognition based on. The interface section 9 is used when communicating with an external device (not shown) such as a word processor. The CPU 8 includes a feature extraction unit 3. Average node length estimator 4. The syllable segment extraction unit 5 controls the similarity calculation unit 6 and the interface unit 9 to execute speech recognition operations.

ここで、平均音節長推定部４によって算出される上記音
節区間推定用パラメータは、上記特徴抽出部３によって
抽出された特徴量のパワー変化とスペクトル変化とを複
合したパラメータ（ＰＷＳＰ　−ｆ（Ｐｏｔ、５ＰＥＣ
）　：　ＰＯＷ＝パ’７−．５ＰＥＣ＝　スペクトル変
化）を用いろ。（１）式は本実施例において用いられる
上記音節区間推定用パラメータＰＷＳＰの具体的な式を
示す。Here, the syllable interval estimation parameter calculated by the average syllable length estimating section 4 is a parameter (PWSP -f (Pot, 5PEC
): POW=Pa'7-. Use 5PEC=spectral change). Equation (1) shows a specific equation for the syllable interval estimation parameter PWSP used in this embodiment.

ＰＷＳＰ＝　（Ｓ　ｌ／Ｓ　２）Ｘ　（（Ｐ　２−　Ｐ
　１）／（Ｐ　１＋　Ｋ））・・・（１）ここで、Ｓｌ：　時刻ｔでの８フレームの窓のスペクト
ル変化。PWSP= (S l/S 2)X ((P 2- P
1)/(P 1+ K))...(1) Here, Sl: Spectral change in the 8-frame window at time t.

Ｓ２：　時刻（ｔ−４）での４フレームの窓のスペクト
ル変化。S2: Spectral change of a 4-frame window at time (t-4).

Ｐｌ：　時刻（ｔ−３）でのパワーＰ２：　時刻ｔでのパワーＫ　：　定数。Pl: Power at time (t-3) P2: Power at time t K: Constant.

上記音節区間推定用パラメータＰＷＳＰは、パワーの立
ち上がりで、かつスペクトル変化の大きい箇所を捕らえ
るものである。したがって、音節の子音から母音への渡
りで大きな値を示すことが多い。The syllable interval estimation parameter PWSP captures the rise of the power and the point where the spectrum change is large. Therefore, it often shows a large value at the transition from a consonant to a vowel in a syllable.

第２図は話者が／たな／と発声した場合のパワーと音節
区間推定用パラメータｐｖｓｐとの変化を示す。FIG. 2 shows changes in power and syllable interval estimation parameter pvsp when the speaker utters /tana/.

この図から、／たな／と発声した場合には音節／た／の
始端（Ａ）と音節／な／の子音から母音への渡り（Ｂ）
？こおいて音節区間推定用パラメータＰＷＳＰがピーク
を有している。このようにして得られた入力音声の音節
区間推定用パラ・メータｐｗｓｐのピークが２っ以上存
在する場合に、音節に対応する上記音節区間推定用パラ
メータｐｗｓｐのピーク間の間隔（すなわちＡ−Ｂ間の
間隔Ｑ）を音節／た／の平均音節長とする。From this diagram, when uttering /tana/, the beginning of the syllable /ta/ (A) and the transition from the consonant to the vowel of the syllable /na/ (B)
? Here, the syllable interval estimation parameter PWSP has a peak. When there are two or more peaks of the syllable interval estimation parameter pwsp of the input voice obtained in this way, the interval between the peaks of the syllable interval estimation parameter pwsp corresponding to the syllable (i.e. A-B Let the interval Q) be the average syllable length of the syllable /ta/.

ただし、音節区間推定用パラメータｐｗｓｐのピーク間
の間隔が極端に長い場合には、ピーク間に２音節以上含
まれている可能性があるので、平均音節長の推定は行わ
ないようにする。この場合には、平均音節長の推定を行
わない音節の平均音節長として、例えば直前の音節にお
いて推定した平均音節長を用いる。このことは、従来よ
り知られている子音から母音への渡りの間隔は、母音か
ら子音への渡りの間隔よりも変動が少ないことに基づい
ている。すなわち、第２図に示すように、上記音節区間
推定用パラメータｐｖｓｐによって推定された音節／た
／の平均音節長Ｑが音節／な／の平均音節長にほぼ等し
いため、音節／な／の平均音節長の推定値が得られなか
った場合は、直前の上記音節／た／の平均音節長Ｑを音
節／な／の平均音節長として用いることが可能なのであ
る。However, if the interval between the peaks of the syllable interval estimation parameter pwsp is extremely long, there is a possibility that two or more syllables are included between the peaks, so the average syllable length is not estimated. In this case, for example, the average syllable length estimated for the immediately preceding syllable is used as the average syllable length of the syllable for which the average syllable length is not estimated. This is based on the fact that the conventionally known consonant-to-vowel transition interval has less variation than the vowel-to-consonant transition interval. That is, as shown in FIG. 2, since the average syllable length Q of the syllable /ta/ estimated by the syllable interval estimation parameter pvsp is approximately equal to the average syllable length of the syllable /na/, the average syllable length Q of the syllable /na/ If an estimated value of the syllable length cannot be obtained, it is possible to use the average syllable length Q of the immediately preceding syllable /ta/ as the average syllable length of the syllable /na/.

また、音節区間推定用パラメータｐｗｓｐのピークが一
つの音節区間と推定される区間に多数現れた場合は、各
ピーク間の間隔のうち音節長として妥当な間隔（例えば
、予め設定されている入力音節７秒から判断）を選出し
て平均することによって平均音節長を求める。In addition, if many peaks of the syllable interval estimation parameter pwsp appear in an interval that is estimated to be one syllable interval, an interval that is appropriate as a syllable length among the intervals between each peak (for example, a preset input syllable (judged from 7 seconds) and average them to find the average syllable length.

このように、音節区間推定用パラメータとして特徴抽出
部３で抽出されるパワー変化とスペクトル変化との複合
パラメータを用いることにより、パワー変化やスペクト
ル変化等の特徴量のみを用いるよりも、より的確に安定
して平均音節長を推定することができるのである。In this way, by using a composite parameter of the power change and spectral change extracted by the feature extraction unit 3 as a parameter for syllable interval estimation, it is possible to estimate the syllable interval more accurately than by using only feature quantities such as power change and spectral change. The average syllable length can be estimated stably.

次に、上記音節区間抽出部５による音節区間切り出しに
ついて、具体的な音声入力例をあげて説明する。Next, syllable segment extraction by the syllable segment extracting section 5 will be explained using a specific voice input example.

第３図はマイクｌに向かって／りんご／と最適な発声速
度で発声した際のパワー変化を示す。図中において、パ
ワー変化図の下部上段（ａ）には従来法による平均音節
長（予めセットされた値）とその平均音節長によって求
められた音節区間切り出し結果を示し、パワー変化図の
下部下段（ｂ）には本実施例による平均音節長（上述の
ように平均音節長推定部４によって人力音声毎に算出し
た値）とその平均音節長によって求められた音節区間切
り出し結果を示す。Figure 3 shows the power change when the user utters /ringo/ into the microphone l at the optimal speaking speed. In the figure, the upper row (a) at the bottom of the power change diagram shows the average syllable length (a preset value) according to the conventional method and the syllable section cutout result obtained from the average syllable length, and the lower row at the bottom of the power change diagram (b) shows the average syllable length (the value calculated for each human voice by the average syllable length estimating unit 4 as described above) and the syllable section extraction results obtained from the average syllable length according to this embodiment.

まず、特徴抽出部３によって抽出された特徴量に基づい
て音節境界候補Ｃ，Ｄ、Ｅ、Ｆ、Ｇを検出する。そして
、上記検出された音節境界候補Ｃ，Ｄ。First, syllable boundary candidates C, D, E, F, and G are detected based on the feature amounts extracted by the feature extractor 3. Then, the detected syllable boundary candidates C and D.

Ｅ、Ｆ、Ｇの中から上記特徴量の変化（第３図に示すパ
ワーの急激な変化やスペクトル変化等）に基づいて音節
境界位置Ｃ，Ｆ、Ｇを確定する。このようにして確定さ
れた音節境界位置を垂直な実線で示す。また、上記特徴
量のみでは確定できなかった音節境界候補り、Ｅを破線
で示す。Among E, F, and G, syllable boundary positions C, F, and G are determined based on changes in the feature amounts (such as sudden changes in power and spectrum changes shown in FIG. 3). The syllable boundary positions determined in this way are indicated by vertical solid lines. Furthermore, a syllable boundary candidate, E, which could not be determined using only the above-mentioned feature amount, is indicated by a broken line.

次に、上述のように平均音節長推定部４によって算出さ
れた平均音節長に基づいて、上記音節境界位置として確
定されなかった音節境界候補り。Next, based on the average syllable length calculated by the average syllable length estimation unit 4 as described above, syllable boundary candidates that have not been determined as the syllable boundary positions are determined.

Ｅを対象として音節境界位置の確定を行う。その結果、
従来法による予めセットされた平均音節長に基づいてた
場合には第３図（ａ）のように音節境界位置Ｅが確定さ
れる。また、本実施例による人力音声毎に決定された平
均音節長に基づいた場合には第３図（ｂ）のように音節
境界位置Ｅが確定されろ。The syllable boundary position is determined for E. the result,
When the conventional method is based on a preset average syllable length, the syllable boundary position E is determined as shown in FIG. 3(a). Furthermore, when based on the average syllable length determined for each human voice according to this embodiment, the syllable boundary position E is determined as shown in FIG. 3(b).

この場合には、話者の発声速度が最適な速度であるため
、従来法によって予めセットした平均音節長と本実施例
によって入力音声毎に推定した平均音節長とはほぼ同じ
である。しかも、いずれの平均音節長も実際の音節長と
ほぼ同じである。したがって、いずれの平均音節長に基
づいて確定された音節境界位置も同じ音節境界位置Ｅと
なるのである。In this case, since the speaking speed of the speaker is the optimum speed, the average syllable length preset by the conventional method and the average syllable length estimated for each input voice by this embodiment are almost the same. Moreover, both average syllable lengths are approximately the same as the actual syllable lengths. Therefore, the syllable boundary position determined based on any average syllable length becomes the same syllable boundary position E.

第４図は／みかん／と通常の約２倍の発声速度で発声し
た際のパワー変化を示す。図中において、パワー変化図
の下部上段（ａ）には第３図と同ように従来法による平
均音節長（予めセットされた値）とその平均音節長によ
って求められた音節区間切り出し結果を示し、パワー変
化図の下部下段（ｂ）には本実施例による平均音節長（
平均音節長推定部４によって人力音声毎に算出した値）
とその平均音節長によって求められた音節区間切り出し
結果を示す。FIG. 4 shows the power change when /mandarin/ is uttered at about twice the normal utterance speed. In the figure, the upper part (a) of the lower part of the power change diagram shows the average syllable length (a preset value) by the conventional method and the syllable segment extraction results obtained from the average syllable length, as in Figure 3. , the lower part (b) of the power change diagram shows the average syllable length (
(value calculated for each human voice by the average syllable length estimation unit 4)
The results of syllable segment extraction obtained from the average syllable length and the average syllable length are shown below.

第３図と同様に、特徴抽出部３によって抽出された特徴
量に基づいて音節境界候Ｎｌ１Ｈ，［、Ｊ　、ＫＬを検
出する。そして、上記検出された音節境界候補Ｈ，Ｉ、
Ｊ、に、Ｌの中から上記特徴量の急激な変化に基づいて
確定された音節境界位置Ｈ，１、ＪＬを垂直な実線で示
す。また、上記特徴ｍのみでは確定できなかった音節境
界候補Ｋを破線で示す。Similar to FIG. 3, syllable boundary candidates Nl1H, [, J, KL are detected based on the feature amounts extracted by the feature extractor 3. Then, the detected syllable boundary candidates H, I,
In J,, the syllable boundary position H,1, JL determined from among L based on the sudden change in the feature amount is shown by a vertical solid line. Furthermore, syllable boundary candidates K that could not be determined using only the feature m are indicated by broken lines.

次に、平均音節長推定部４によって算出された平均音節
長に基づいて、上記音節境界位置として確定されなかっ
た音節境界候補Ｋを対象として音節境界位置の確定を行
う。その結果、従来法による予めセットされた平均音節
長に基づいた場合には第４図（ａ）のように音節境界候
補には音節境界位置として確定されない。また、本実施
例による人力音声毎に決定された平均音節長に基づいた
場合には第４図（ｂ）のように音節境界候補Ｋが音節境
界位置として確定されるのである。Next, based on the average syllable length calculated by the average syllable length estimation unit 4, the syllable boundary position is determined for the syllable boundary candidate K that has not been determined as the syllable boundary position. As a result, when based on the preset average syllable length according to the conventional method, the syllable boundary candidate is not determined as a syllable boundary position as shown in FIG. 4(a). Furthermore, when based on the average syllable length determined for each human voice according to this embodiment, the syllable boundary candidate K is determined as the syllable boundary position as shown in FIG. 4(b).

この場合には、話者の発声速度が通常の速度の約２倍で
あるため、従来法によって予めセントした平均音節長は
実際の入力音声の音節長の約２倍となってしまう。その
ため、音節境界候補には音節境界として確定されないの
である。ところが、本実施例によって入力音声毎に推定
した平均音節長は実際の音節長とほぼ同じである。その
ため、音節境界候補Ｋが音節境界位置として確定される
のである。In this case, since the speaking speed of the speaker is about twice the normal speed, the average syllable length pre-cented by the conventional method will be about twice the syllable length of the actual input voice. Therefore, the syllable boundary candidate is not determined as a syllable boundary. However, the average syllable length estimated for each input voice according to this embodiment is almost the same as the actual syllable length. Therefore, the syllable boundary candidate K is determined as the syllable boundary position.

したがって、従来例においては２音節と誤って切り出さ
れる音節が、本実施例によれば正しく／み／、／か／お
よび／ん／と３音節に切り出されるのである。Therefore, in the conventional example, syllables that are incorrectly cut out as two syllables are correctly cut out as three syllables, /mi/, /ka/, and /n/, according to this embodiment.

上述のことは、従来例における平均音節長として過去に
音声入力された複数単語から得られた平均音節長を用い
た場合にし同様のことが生じる。The same thing as described above occurs when the average syllable length obtained from a plurality of words inputted in the past is used as the average syllable length in the conventional example.

要は、人力された音声毎に平均音節長を求めなければ、
話者の発声速度が変化した場合には、必ず平均音節長と
実際に入力される音声の音節長とに差が生じて、正しく
音節区間が切り出されない場合が生じるのである。The point is, unless you find the average syllable length for each human-generated voice,
When the speech rate of a speaker changes, there is always a difference between the average syllable length and the syllable length of the actually input voice, and syllable sections may not be correctly extracted.

上記実施例においては、平均音節長推定部４によって入
力音声毎に推定された平均音節長のみによって音節境界
位置を確定するようにしている。In the above embodiment, the syllable boundary position is determined only based on the average syllable length estimated for each input voice by the average syllable length estimator 4.

しかしながら、この発明はこれに限定されるものではな
い。すなわち、実際の入力音声の認識において、人力音
声毎に平均音節長を推定することが実際上困難な場合が
生じる。このような場合には、過去に音声入力された複
数単語から得られた平均音節長を用いて音節境界位置を
確定し、再度人力音声毎に平均音節長を推定することが
可能になった場合には、上記平均音節長推定部４によっ
て平均音節長を推定するようにしてもよい。However, the invention is not limited thereto. That is, in actual input speech recognition, there are cases where it is actually difficult to estimate the average syllable length for each human speech. In such cases, if it becomes possible to determine the syllable boundary position using the average syllable length obtained from multiple words inputted in the past and estimate the average syllable length again for each human voice. In this case, the average syllable length may be estimated by the average syllable length estimating section 4.

〈発明の効果〉以上より明らかなように、この発明の音声認識装置は、
音節長推定部および音節区間抽出部を有して、入力され
た音声から抽出された特徴量に基づいて、人力音声毎に
音節区間推定用パラメータを算出し、この音節区間推定
用パラメータに基づいて入力音声毎に平均音節長を推定
し、ト記推定された平均音節長と上記特徴量に基ついて
音節区間を切り出すようにしたので、話者の発声速度の
変化に影響されずに正しく音節を認識することができる
。<Effects of the Invention> As is clear from the above, the speech recognition device of the present invention has the following effects:
It has a syllable length estimation unit and a syllable interval extraction unit, calculates a syllable interval estimation parameter for each human voice based on the feature quantity extracted from the input speech, and calculates a syllable interval estimation parameter for each human voice based on the syllable interval estimation parameter. The average syllable length is estimated for each input voice, and syllable sections are extracted based on the estimated average syllable length and the above feature amounts, so syllables can be correctly extracted without being affected by changes in the speaker's speaking rate. can be recognized.

また、この発明の音声認識装置は、音節長推定部によっ
て、特徴ｍのパワー変化とスペクトル変化とを複合した
音節区間推定用パラメータを人力音声毎に算出して、こ
の音節区間推定用パラメータに基づいて人力音声毎に平
均音節長を推定するので、的確に音節区間を切り出すこ
とができる。Furthermore, the speech recognition device of the present invention calculates, for each human voice, a parameter for estimating a syllable interval, which is a combination of the power change and the spectrum change of the feature m, by the syllable length estimator, and based on this parameter for estimating a syllable interval. Since the average syllable length is estimated for each human voice, syllable intervals can be accurately extracted.

また、この発明の音声認識装置は、子音から母音への渡
りにおいて値が他の箇所よりも大きくなる音節区間推定
用パラメータを入力音声毎に算出するようにしたので、
より的確に安定して平均音節長を推定して音節を認識す
ることができる。Furthermore, the speech recognition device of the present invention calculates for each input speech a parameter for estimating a syllable interval whose value is larger in the transition from a consonant to a vowel than in other parts.
Syllables can be recognized by more accurately and stably estimating the average syllable length.

[Brief explanation of the drawing]

第１図はこの発明の音声認識装置のブロック図、第２図
は実際の入力音声における音節区間推定用パラメータの
変化の一例を示す図、第３図および第４図は上記音声認
識装置による音節認識結果の説明図である。ｌ・・マイク、２・・・増幅器、３・・・特徴抽出部、
４・・平均音節長推定部、５・・・音節区間抽出部、６
・・・類似度計算部、７・・・標準パターンメモリ、訃
・ＣＰ　Ｕ、９・インターフェース部。FIG. 1 is a block diagram of the speech recognition device of the present invention, FIG. 2 is a diagram showing an example of changes in parameters for syllable interval estimation in actual input speech, and FIGS. 3 and 4 are syllables produced by the speech recognition device described above. It is an explanatory diagram of a recognition result. l...Microphone, 2...Amplifier, 3...Feature extraction section,
4. Average syllable length estimation section, 5.. Syllable interval extraction section, 6
. . . Similarity calculation unit, 7. Standard pattern memory, CPU, 9. Interface unit.

Claims

[Claims]

(1) In a speech recognition device that recognizes syllables by extracting features from the input speech, cutting out syllables based on the extracted features, and calculating similarity on a syllable basis, the input speech is Based on the features extracted from
a syllable length estimator that calculates a syllable interval estimation parameter for each input voice and estimates an average syllable length for each input voice based on the syllable interval estimation parameter; What is claimed is: 1. A speech recognition device comprising: a syllable section extraction unit that extracts a syllable section based on the syllable section.

(2) The syllable length estimation unit uses the power and spectrum of the feature extracted for each input voice to calculate a parameter for estimating a syllable interval, which is a combination of power change and spectrum change, for each input voice. A speech recognition device according to claim 1 that calculates.

(3) The syllable interval estimation parameter calculated by the syllable length estimation unit, which is a combination of the power change and the spectrum change, has a value larger in the transition from a consonant to a vowel than in other parts. Speech recognition device described in section.