JPH0564800B2

JPH0564800B2 -

Info

Publication number: JPH0564800B2
Application number: JP61216180A
Authority: JP
Inventors: Tooru Ueda; Hiroyuki Iwahashi
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1986-09-13
Filing date: 1986-09-13
Publication date: 1993-09-16
Also published as: JPS6370899A

Description

【発明の詳細な説明】＜産業上の利用分野＞この発明は、日本語等の入力された音声を音節
単位で認識して、外部装置に出力する音声認識装
置に関する。DETAILED DESCRIPTION OF THE INVENTION <Field of Industrial Application> The present invention relates to a speech recognition device that recognizes input speech such as Japanese in syllable units and outputs the recognized speech to an external device.

＜従来の技術＞従来の音声認識装置においては、入力された音
声からその認識単位である音節の音節区間を抽出
するために、一定区間の音声スペクトル（以下、
単にスペクトルと言う）の変化を用いて上記音節
の境界を検出するようにしている。<Prior art> In conventional speech recognition devices, in order to extract syllable sections of syllables, which are recognition units, from input speech, a speech spectrum of a certain section (hereinafter referred to as
The syllable boundaries are detected using changes in the syllable (simply called spectrum).

＜発明が解決しようとする問題点＞しかしながら、上記従来の音声認識装置では、
入力される音声のスペクトルの変化には急激に変
化する音声と穏やかに変化する音声とが混在して
おり、その両者に追従して音節の境界を正確に検
出することは困難であり、しばしば音声の誤認が
発生するという問題がある。<Problems to be solved by the invention> However, in the above-mentioned conventional speech recognition device,
Changes in the input speech spectrum include a mixture of rapidly changing speech and gently changing speech, and it is difficult to accurately detect syllable boundaries by following both of them, and often There is a problem that misunderstandings occur.

そこで、この発明の目的は、上記スペクトルか
らの情報を用いて、上記スペクトルの変化の穏急
にかかわらず、正確に音節境界を検出することが
できる音節認識装置を提供することにある。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a syllable recognition device that can accurately detect syllable boundaries using information from the spectrum, regardless of the degree of change in the spectrum.

＜問題点を解決するための手段＞上記目的を達成するために、この発明の音声認
識装置は、入力された音声から音節区間を抽出
し、この抽出された音節の特徴パターンとメモリ
に予め記憶している特徴標準パターンとの類似度
計算を行つて、入力された音声を音節単位で認識
する音声認識装置において、抽出された音節区間
に続く入力音声におけるスペクトル情報の変化か
ら、上記スペクトル情報の安定点を抽出する安定
点抽出部と、上記抽出された安定点におけるスペ
クトル情報と上記安定点以後のスペクトル情報と
の類似度計算を順次行い、上記算出された類似度
を値の所定の値と比較して次に抽出すべき音節区
間の音節の境界を検出する音節境界検出部とを設
けたことを特徴としている。<Means for Solving the Problems> In order to achieve the above object, the speech recognition device of the present invention extracts syllable intervals from input speech, and stores feature patterns of the extracted syllables in a memory in advance. In a speech recognition device that recognizes input speech in units of syllables by calculating the similarity with the standard feature pattern that has been extracted, the above-mentioned spectral information is A stable point extraction unit that extracts a stable point sequentially calculates the similarity between the spectral information at the extracted stable point and the spectral information after the stable point, and sets the calculated similarity to a predetermined value. The present invention is characterized in that it includes a syllable boundary detection section that compares and detects the syllable boundary of the next syllable interval to be extracted.

＜作用＞音声が入力されると、安定点抽出部により、抽
出された音節区間に続く入力音声におけるスペク
トル情報の変化を用いて、上記スペクトル情報上
の安定点が抽出される。さらに、音節境界検出部
によつて、上記抽出された抽出された安定点にお
けるスペクトル情報と上記安定点以後のスペクト
ル情報との類似度計算が順次行われ、得られた類
似度の値を所定の値と比較することによつて、次
に抽出すべき音節区間の音声境界が検出され、こ
の検出された音節境界にしたがつて音節区間が抽
出される。したがつて、上記スペクトル情報の変
化の穏急にかかわらず、正確に有意味な音節区間
を抽出することができる。<Operation> When speech is input, the stable point extraction section extracts stable points on the spectral information using changes in spectral information in the input speech following the extracted syllable section. Furthermore, the syllable boundary detection unit sequentially calculates the similarity between the spectral information at the extracted stable point and the spectral information after the stable point, and calculates the obtained similarity value by a predetermined value. By comparing with the value, the phonetic boundary of the next syllable interval to be extracted is detected, and the syllable interval is extracted according to the detected syllable boundary. Therefore, it is possible to accurately extract meaningful syllable sections regardless of the degree of change in the spectral information.

＜実施例＞以下、この発明を図示の実施例により詳細に説
明する。<Examples> The present invention will be described in detail below with reference to illustrated examples.

第１図において、１は音声を入力するマイク、
２は上記マイク１より入力された音声の音声帯域
のみを増幅する増幅器、３は増幅器２の出力が入
力される特徴抽出部である。 In FIG. 1, 1 is a microphone for inputting audio;
2 is an amplifier that amplifies only the audio band of the voice input from the microphone 1, and 3 is a feature extraction unit to which the output of the amplifier 2 is input.

上記特徴抽出部３は、上記増幅器２で増幅され
た音声を８ｍｓの間隔ごとに16ｍｓの区間（以下
フレームと呼ぶ）の特徴パラメータを抽出する。
上記特徴パラメータとは、マツチング部７によつ
て行われる最終的な音声認識のための類似度計算
に用いられる１音節の特徴パターン（例えば16チ
ヤンネルの帯域フイルタからの出力等）と、音韻
分類部４で音韻分類のために使用されるパラメー
タ（例えばパワー、１次の自己相関係数等）であ
る。上記音韻分類部４は上記特徴パラメータを用
いて、上記音声の１フレームにそのフレームの音
声の性質を表わすラベル付けを行う。境界検出部
５は後に詳述する方法によつて上記スペクトル上
の安定点を抽出し、上記安定点を基にして音節の
境界を検出する。 The feature extraction unit 3 extracts feature parameters of a 16 ms section (hereinafter referred to as a frame) from the audio amplified by the amplifier 2 at intervals of 8 ms.
The feature parameters mentioned above are the one-syllable feature pattern (for example, output from a 16-channel band filter, etc.) used in the similarity calculation for final speech recognition performed by the matching unit 7, and the phoneme classification unit. 4 are parameters (for example, power, first-order autocorrelation coefficient, etc.) used for phoneme classification. The phoneme classification unit 4 uses the feature parameters to label one frame of the audio to indicate the nature of the audio of that frame. The boundary detection unit 5 extracts stable points on the spectrum by a method described in detail later, and detects syllable boundaries based on the stable points.

音節区間抽出部６は上記音韻分類部４によつて
得られたラベルの時系列と、上記境界検出部５に
よつて得られた音節境界の情報とを用いて、入力
された音声から音節区間を抽出する。さらに、上
記マツチング部７は上記音節区間抽出部６で抽出
された１つの音節区間における上記特徴抽出部３
で抽出された特徴パターンと、特許標準パターン
メモリ８に予め記憶されている特徴標準パターン
との類似度計算の一例であるユークリツド距離計
算を行つて音声の認識を行う。CPU９は、上記
特徴抽出部３、境界検出部５、音節区間抽出部６
およびマツチング部７を制御すると共に、上記マ
ツチング部７出られる認識結果を、図示しない外
部装置に出力するインターフエース１０を制御し
ている。 The syllable interval extraction unit 6 uses the time series of labels obtained by the phoneme classification unit 4 and the syllable boundary information obtained by the boundary detection unit 5 to extract syllable intervals from the input speech. Extract. Further, the matching section 7 performs the feature extraction section 3 in one syllable section extracted by the syllable section extraction section 6.
Speech recognition is performed by performing Euclidean distance calculation, which is an example of similarity calculation between the feature pattern extracted in step 1 and the feature standard pattern previously stored in the patent standard pattern memory 8. The CPU 9 includes the feature extraction section 3, the boundary detection section 5, and the syllable section extraction section 6.
In addition to controlling the matching section 7, it also controls an interface 10 that outputs the recognition results from the matching section 7 to an external device (not shown).

上記構成の音声認識装置は次のように動作す
る。 The speech recognition device having the above configuration operates as follows.

入力者が上記マイク１に向つて音声を発声する
と、その音声は上記マイク１から入り上記増幅器
２で音声帯域だけが増幅されて上記特徴抽出部３
に送られ、上記特徴抽出部３では８ｍｓの間隔ご
とに16ｍｓのフレームに区切つて、そのフレーム
の特徴パラメータが抽出される。上記特徴パタメ
ータは上記マツチング部７によつて行なわれる最
終的な音声認識のための類似度計算に用いられる
１音節の特徴パターン（例えば16チヤンネルの帯
域フイルタからの出力等）と、音韻分類部４で音
韻分類のために使用されるパラメータ（例えばパ
ワー、１次の自己相関係数等）とである。上記音
韻分類部４では上記特徴抽出部３で求められた特
徴パラメータによつてフレームの音声の性質を表
わすラベル付けが行なわれる。ここで本実施例で
用いるラベルは母音性（記号′V′）、摩擦性（記
号′F′）、バズバー性（記号′B′）、無音性（記
号′．′）の４種類である。 When an input person utters a voice into the microphone 1, the voice enters the microphone 1 and is amplified only in the voice band by the amplifier 2, and then sent to the feature extraction section 3.
The feature extraction unit 3 divides the frame into 16 ms frames at intervals of 8 ms, and extracts the feature parameters of the frames. The feature parameters include a one-syllable feature pattern (for example, output from a 16-channel band filter, etc.) used in similarity calculation for final speech recognition performed by the matching unit 7, and a phoneme classification unit 4. and parameters (eg, power, first-order autocorrelation coefficient, etc.) used for phoneme classification. In the phoneme classification section 4, a label representing the nature of the speech of the frame is performed using the feature parameters obtained by the feature extraction section 3. Here, there are four types of labels used in this embodiment: vowel (symbol 'V'), fricative (symbol 'F'), buzz bar (symbol 'B'), and silent (symbol '.').

また、境界検出部５では得られた１フレームの
スペクトルの変化から安定な点を抽出し、さら
に、上記安定点のフレームのスペクトルパターン
と、上記安定点以後に入力された音声のフレーム
のスペクトルパターンとの類似度を表わすユーク
リツド距離を求めることによつて、抽出すべき音
節の音節境界を検出する。第２図は上記安定点を
抽出してから音節境界を検出までのフローチヤー
トを示しており、図中右側は上記安定点抽出のフ
ローであり左側は音節境界検出のフローである。
以下第２図に沿つて上記安定点の抽出および上記
音節境界の検出の手段を詳述する。 In addition, the boundary detection unit 5 extracts a stable point from the obtained change in the spectrum of one frame, and further extracts the spectral pattern of the frame at the stable point and the spectral pattern of the audio frame input after the stable point. The syllable boundary of the syllable to be extracted is detected by calculating the Euclidean distance representing the degree of similarity between the syllable and the syllable. FIG. 2 shows a flowchart from extracting the stable point to detecting the syllable boundary. The right side of the figure is the flow for extracting the stable point, and the left side is the flow for detecting the syllable boundary.
The means for extracting the stable point and detecting the syllable boundary will be described in detail below with reference to FIG.

ここで、各変数をｉ，ｊ：一時変数、Ｎ：パターンの次数を表す定数、ｔ：フレームの番号、 ta：安定点のフレーム番号、 PAT(i)：安定点の特徴パターンのｉ次の特徴量Ｄ(t)：フレームｔでのスペクトル変化距離、 SP(t)(i)：フレームｔでの入力パターンのｉ次の
特徴量、Ｌ：スペクトル変化を計算する窓の長さを表す定
数で2L＋１が窓長になる、Ｍ：安定点を求めるための窓の長さを表す定数で
2M＋１が窓長となる、 DIS：安定点の特徴パターンと入力フレームの特
徴パターンの距離、 ANTFLG：安定点からの上記距離による境界検
出フラグ、とする。 Here, each variable is defined as i, j: temporary variable, N: constant representing the order of the pattern, t: frame number, ta: frame number of the stable point, PAT(i): i-th order of the characteristic pattern at the stable point. Feature amount D(t): Spectral change distance at frame t, SP(t)(i): i-th feature amount of the input pattern at frame t, L: Constant representing the length of the window for calculating the spectral change. 2L+1 becomes the window length, M: A constant representing the length of the window for finding the stable point.
2M+1 is the window length, DIS: distance between the feature pattern of the stable point and the feature pattern of the input frame, ANTFLG: boundary detection flag based on the above distance from the stable point.

いま、スペクトル上のある１つのフレームｔ
（これを現フレームとする）からの入力パターン
SP(t)(i)が入力されると、ステツプS₁で、安定点パターンの有無（すなわ
ち、過去に安定点を抽出して、上記安定点のスペ
クトルパターンを取り込んでいるか否か）を判定
する。ここでは、安定点のスペクトルパターンを
取り込んでいれば安定点のスペクトルパターンの
データが総ての次数で０となることがないことを
利用して、ｉ＝１，，ＮであるすべてのPAT(i)に
たいして PAT(i)＝０を満たすときは、すでに抽出された安定点パター
ンは無としてステツプS₂に進み安定点を求める動
作に入り、それ以外のときにはすでに抽出された
安定点が有りとしてステツプS₅に進む。 Now, one frame t on the spectrum
Input pattern from (let this be the current frame)
When SP(t)(i) is input, in step S ₁ , it is determined whether there is a stable point pattern (that is, whether a stable point has been extracted in the past and the spectral pattern of the above stable point has been incorporated). do. Here, we take advantage of the fact that if the spectral pattern of a stable point is captured, the data of the spectral pattern of a stable point will never become 0 in all orders, and we will calculate all PAT( When PAT(i)=0 is satisfied for i), the stable point pattern that has already been extracted is ignored and the process proceeds to step _S2 to begin the process of finding a stable point; otherwise, the stable point pattern that has already been extracted is assumed to be present. Proceed to step _S5 .

ステツプS₂で、現フレームの安定性をチエツク
する。すなわち、現フレームｔにおけるスペクト
ル変化Ｄ(t)をＤ(t)＝_N 〓ⁱ⁼¹ （SP（ｔ−Ｌ）(i)−SP（ｔ＋Ｌ）(i)）² とすると、Ｄ(t)＝ min ｊ ……(1) ただし、ｊ＝−Ｍ，−Ｍ＋１，，，０，，，Ｍを満たすＤ(t)が存在するときに上記現フレームｔ
は安定と判断してステツプS₃に進み、上記(1)式を
満たすＤ(t)が存在しないときは現フレームｔは安
定でないとしてステツプS₁へ戻り次のフレームの
処理を実行する。 Step S ₂ checks the stability of the current frame. That is, if the spectrum change D(t) in the current frame t is D(t) = _N 〓 ⁱ⁼¹ (SP(t-L)(i)-SP(t+L)(i)) ² , then D(t) = min j ...(1) However, when there exists D(t) that satisfies j=-M, -M+1,,,0,,,M, the above current frame t
is judged to be stable, and the process proceeds to step _S3 . If there is no D(t) that satisfies the above equation (1), the current frame t is assumed to be unstable, and the process returns to step _S1 to execute the processing of the next frame.

ステツプS₃で、上記スペクトル変化が非常に大
きい点を安定点として採択するのを避けるため、
ステツプS₂で求められた安定なフレームｔにおけ
るスペクトル変化Ｄ(t)を設定値THDIS2と比較す
る。その結果THDIT2より小さければステツプ
S₄に進み、以上であれば現フレームｔは安定点と
して採択できないとして、ステツプS₁へ戻る。 In step _S3 , in order to avoid adopting the point where the spectral change is extremely large as the stable point,
The spectrum change D(t) in the stable frame t obtained in step _S2 is compared with the set value THDIS2. If the result is smaller than THDIT2, step
The process advances to step _S4 , and if the current frame t is above, it is determined that the current frame t cannot be adopted as a stable point, and the process returns to step _S1 .

ステツプS₄で、安定点として採択されたフレー
ムtaにおけるスペクトルの特徴パターンを上記安
定点パターンPAT(i)にセツトして安定点の抽出
が完了し、ステツプS₁へ戻る。 In step _S4 , the feature pattern of the spectrum in the frame ta selected as the stable point is set as the stable point pattern PAT(i) to complete extraction of the stable point, and the process returns to step _S1 .

PAT(i)＝SP（ta）(i) ｉ＝１，，ＮステツプS₅で、上記抽出された安定点の安定点
パターンと現フレームｔにおけるスペクトルの特
徴パターンとの距離（DIS）を次式 DLS＝_N 〓ⁱ⁼¹ （PAT(i)−SP(t)(i)）² を用いて計算して、ステツプS₆に進む。 PAT(i)=SP(ta)(i) i=1,,N In step _S5 , the distance (DIS) between the stable point pattern of the stable points extracted above and the characteristic pattern of the spectrum in the current frame t is calculated as follows. Calculate using the formula DLS= _N 〓 ⁱ⁼¹ (PAT(i)−SP(t)(i)) ² and proceed to step _S6 .

ステツプS₆で上記ステツプS₅で求めた距離DIS
が設定値THDIS1より大きいか否か、すなわち類
似度が小さいか大きいかを判断して、設定値以下
の場合は安定点パターンと現フレームにおける特
徴パターンとは類似しているので、現フレームは
音節の境界点としては採択できないとしてステツ
プS₁へ戻る。一方、設定値より大きい場合は現フ
レームは音節境界点であるとしてステツプS₇へ進
む。 In step S ₆ , calculate the distance DIS obtained in step S ₅ above.
is larger than the set value THDIS1, that is, whether the similarity is small or large. If it is less than the set value, the stable point pattern and the feature pattern in the current frame are similar, so the current frame is a syllable. It cannot be adopted as a boundary point, and the process returns to step _S1 . On the other hand, if it is larger than the set value, the current frame is determined to be a syllable boundary point and the process proceeds to step _S7 .

ステツプS₇で、ステツプS₆でIDS＞THDIS1と
判断され、音節境界が検出されたとき、音節境界
検出フラグANTFLGをセツト ANTFLG＝１してステツプS₈に進む。 In step _S7 , when it is determined in step _S6 that IDS>THDIS1 and a syllable boundary is detected, the syllable boundary detection flag ANTFLG is set to ANTFLG=1 and the process proceeds to step _S8 .

ステツプS₈で抽出すべき音節の音節境界検出が
完了したので、境界検出に用いた安定点パターン
PAT(i)をクリア PAT(i)＝０ただしｉ＝１，，ＮしてステツプS₁へ戻り、次の音節の安定点の抽出
と音節境界検出とを行う。 Since the syllable boundary detection of the syllable to be extracted in step _S8 has been completed, the stable point pattern used for boundary detection is
Clear PAT(i) PAT(i) = 0 where i = 1, , N and return to step S ₁ to extract the stable point of the next syllable and detect the syllable boundary.

上述のようにして、１つの音声の安定点が抽出
され、この安定点を基にして抽出すべき音節の音
節境界が検出されると、第１図の上記音節区間抽
出部６により上記音節分類部４で得られた音節ラ
ベルの時系列と上記境界検出部５で求められた音
節境界情報とから、第３図に示す音節抽出フロー
チヤートにしたがつて、上記音節区間抽出部６に
より音節が抽出される。 As described above, when a stable point of one voice is extracted and a syllable boundary of a syllable to be extracted is detected based on this stable point, the syllable segment extraction unit 6 of FIG. Based on the time series of syllable labels obtained in section 4 and the syllable boundary information obtained by the boundary detection section 5, the syllables are extracted by the syllable interval extraction section 6 according to the syllable extraction flowchart shown in FIG. Extracted.

ここで、各変数を SEG：音韻分類部で出力されるラベル、 FRAME：抽出された音韻のフレーム数、 CUTFLG：抽出完了フラグ、 ANTFLG：音韻境界検出フラグ、（音節境界検出部により検出される） FRMCNT：フレームのカウンタ、 VCNT：母音性のラベル′V′の付いたフレームの
カウンタ、 THCUT：定数(10) ′V′：母音性の音韻ラベル、 ′F′：摩擦性の音韻ラベル、とする。 Here, each variable is defined as SEG: label output by the phoneme classification section, FRAME: number of extracted phoneme frames, CUTFLG: extraction completion flag, ANTFLG: phoneme boundary detection flag, (detected by the syllable boundary detection section) FRMCNT: Frame counter, VCNT: Frame counter with vowel label 'V', THCUT: Constant (10) 'V': Vowel phonological label, 'F': Frictional phonological label, .

ステツプS₁₁で、CUTFLG（音節抽出完了フラ
グ）がセツトしてあるか否かを判別し、セツトし
てあればステツプS₁₂に進み、上記CUTFLGをク
リアしてステツプS₁₃に進む。クリアしてあれば
そのままステツプS₁₃に進む。 In step _S11 , it is determined whether or not CUTFLG (syllable extraction completion flag) has been set. If it has been set, the process proceeds to step _S12 , where the CUTFLG is cleared and the process proceeds to step _S13 . If you have cleared it, proceed directly to step _S13 .

ステツプS₁₃で、現フレームのSEG（音韻ラベ
ル）が′V′か否かを判定し、′V′であればステツ
プS₁₄に進み、′V′でなければステツプS₁₇に進む。 In step _S13 , it is determined whether the SEG (phonetic label) of the current frame is 'V' or not. If 'V', the process proceeds to step _S14 , and if not 'V', the process proceeds to step _S17 .

ステツプS₁₄でFRMCNT（フレームカウンタ）
に＋１を加え、VCNT（母音性の音韻ラベル′
V′のフレーム数）に＋１を加えステツプS₁₅に進
む。 FRMCNT (frame counter) at step S ₁₄
Add +1 to VCNT (vowel phonological label′)
Add +1 to the frame number of V' and proceed to step _S15 .

ステツプS₁₅で、ANTFLG（音節境界検出フラ
グ）がセツトされているか否か（このANTFLG
は第２図の安定点抽出および音節境界点検出のフ
ローチヤートのステツプS₇で１つの音節境界の検
出が完了したときにセツトされる。）を判別する。
その結果、１にセツトされているときはステツプ
₁₆に進んで１音節抽出を行い、セツトされていな
いときはまだ１音節の境界検出が完了していない
と判別してステツプS₁₁に戻り、次のフレームの
処理を実行する。ステツプ₁₆で、上記ステツプ
S₁₅で上記ANTFLGが１にセツトされていると
判別されたときは１音節の境界が検出されている
ので、現フレームまでを１音節とみなして、現フ
レームまでの音節のフレーム数をカウントしてい
る上記FRMCNTをFRAME（抽出された音節の
フレーム数）に転送して、上記FRMCNTおよび
上記VCNTをクリアし、１音節抽出完了のフラ
グCUTFLGを１にセツトしてステツプS₁₁に戻
り、次の音節抽出処理を実行する。 In step _S15 , whether the ANTFLG (syllable boundary detection flag) is set or not (this ANTFLG
is set when the detection of one syllable boundary is completed in step _S7 of the flowchart of stable point extraction and syllable boundary point detection in FIG. ).
As a result, when set to 1, the step
Proceeding to _{step S16} , one syllable is extracted, and if it is not set, it is determined that one syllable boundary detection has not yet been completed, and the process returns to step _S11 to execute the processing of the next frame. At step ₁₆ , follow the steps above.
When it is determined in _S15 that the above ANTFLG is set to 1, a boundary of one syllable has been detected, so the frame up to the current frame is regarded as one syllable, and the number of syllable frames up to the current frame is counted. Transfer the above FRMCNT to FRAME (frame number of extracted syllables), clear the above FRMCNT and above VCNT, set the 1 syllable extraction completion flag CUTFLG to 1, return to step _S11 , and proceed to the next step. Execute syllable extraction processing.

ステツプS₁₇で、現フレームの音韻ラベルが′
V′でないときは、上記VCNTとTHCUT（定数＝
本実施例では10）とを比較する。その結果母音性
の音韻ラベル数がTHCUTよりも大であれば、
現フレームより以前のフレームは有意味な音節で
あり、現フレームは音節の境界であると判断し
て、ステツプS₁₈に進んで１音節抽出を行い、
THCUT以下であればステツプS₁₉に進む。 At step _S17 , the phonetic label of the current frame is
When it is not V′, the above VCNT and THCUT (constant =
In this example, 10) will be compared. As a result, if the number of vowel phonological labels is greater than THCUT,
It is determined that the frames before the current frame are meaningful syllables and that the current frame is a syllable boundary, and the process proceeds to step _S18 to extract one syllable.
If it is less than THCUT, proceed to step _S19 .

ステツプS₁₈で、現フレームまでを１音節とみ
なして、現フレームまでの音節のフレーム数をカ
ウントしている上記FRMCNTを上記FRAMEに
転送して、上記FRMCNTおよびVCNTをクリ
アし、１音節抽出完了のフラグCUTFLGを１に
セツトしてステツプS₁₁に戻る。 In step _S18 , the FRMCNT, which counts the number of frames of syllables up to the current frame, is counted as one syllable, and the FRMCNT and VCNT are cleared, and one syllable extraction is completed. The flag CUTFLG is set to 1 and the process returns to step _S11 .

ステツプS₁₉で、現フレームの上記SEGが′F′か
否かを判別し、′F′であればステツプS₂₁に進
み、′F′でなければステツプS₂₀に進む。 In step _S19 , it is determined whether the above-mentioned SEG of the current frame is 'F' or not. If 'F', the process proceeds to step _S21 , and if not 'F', the process proceeds to step _S20 .

ステツプS₂₀で、現フレームのSEGが′V′で
も′F′でもない場合、現フレームまでの音節は有
意味な音節ではないとして、上記FRMCNTおよ
びVCNTをクリアしてステツプS₁₁に戻り、次の
音節抽出処理を実行する。 In step _S20 , if the SEG of the current frame is neither 'V' nor 'F', it is assumed that the syllables up to the current frame are not meaningful syllables, the above FRMCNT and VCNT are cleared, and the process returns to step _S11 . Execute the syllable extraction process.

ステツプS₂₁で、音韻ラベル′F′のときはまだ音
節が続いているとして、実行FRMCNTに＋１を
加えてステツプS₁₁に戻り、次のフレームの処理
を実行する。 In step _S21 , when the phoneme label is 'F', it is assumed that syllables are still continuing, so +1 is added to the execution FRMCNT, and the process returns to step _S11 to execute the processing of the next frame.

第３図の音節抽出フローチヤートのステツプ
S₁₆およびステツプS₁₈で、１音節抽出完了のフラ
グCUTFLGが１にセツトされると、第１図の上
記CPU９の指令により上記マツチング部７は、
入力された音声の上記音節区間軸出部６によつて
抽出された１つの音節区間の特徴パターンと、上
記特徴標準パターンメモリ８に予め記憶されてい
る特徴標準パターンとの類似度を計算して、上記
入力されて抽出された音節が類似度の最も高い標
準音節と同一の音節として認識され、その認識結
果が上記インターフエース１０を介して、外部装
置に出力される。 Figure 3 Steps of syllable extraction flowchart
When the one-syllable extraction completion flag CUTFLG is set to 1 in _S16 and step _S18 , the matching section 7 performs the following in accordance with a command from the CPU 9 shown in FIG.
Calculate the degree of similarity between the feature pattern of one syllable section extracted by the syllable section axis extraction unit 6 of the input speech and the feature standard pattern stored in advance in the feature standard pattern memory 8. The input and extracted syllable is recognized as the same syllable as the standard syllable with the highest degree of similarity, and the recognition result is output to an external device via the interface 10.

第４図は本実施例において抽出された安定点、
音節境界点の例を示し、上段より音韻分類ラベル
の時系列、本実施例とは異なる方法によつて得ら
れた母音系列（参考）、スペクトル変化が記され
ている。また、Ｃは従来のスペクトル変化から求
めた音節境界点、Ａ，Ｂは本実施例で求めた音節
境界点を現わしている。なお、第４図より、音韻
分類ラベルは全ての母音性の′V′であるため、第
４図は第３図の音節抽出フローチヤートにおける
ステツプS₁₅で音節境界が検出された例である。
すなわち、上記スペクトル曲線上に上述の方法で
安定点P₁が設定され、この安定点P₁を基にして
上述の方法により各フレームの特徴パターンと上
記安定点パターンとの距離DISが、図中の太曲線
P₁Q₁のように求められ、点Q₁において、DIS＞
THDIS1となり音節境界点Ａが検出される。同様
にして、次の安定点P₂が設定されると、P₂を基
にして点Q₂が求められ、次の音節境界点Ｂが検
出され、３つの音節「え」「い」「お」が分離して
抽出される。従来のスペクトルの変化から音節境
界を検出する方法では、スペクトル変化の極値点
P₃より音節境界点Ｃが検出されるので、音節
「えい」と「お」は区別されて抽出されるが、音
節「え」と「い」とはその両音節間のスペクトル
変化が穏やかであるために音節境界点が検出され
ず、したがつて異なる音節として区別して抽出す
ることができない。 Figure 4 shows the stable points extracted in this example.
An example of syllable boundary points is shown, and from the top, a time series of phoneme classification labels, a vowel series (reference) obtained by a method different from this example, and spectral changes are described. Further, C represents the syllable boundary point determined from the conventional spectrum change, and A and B represent the syllable boundary points determined in this embodiment. Note that, from FIG. 4, since the phoneme classification label is 'V' for all vowels, FIG. 4 is an example in which a syllable boundary is detected in step _S15 in the syllable extraction flowchart of FIG. 3.
That is, a stable point P ₁ is set on the above spectrum curve by the above method, and based on this stable point P ₁ , the distance DIS between the characteristic pattern of each frame and the above stable point pattern is determined by the above method as shown in the figure. thick curve
It is calculated as P ₁ Q ₁ , and at point Q ₁ , DIS＞
The result is THDIS1, and syllable boundary point A is detected. Similarly, when the next stable point P ₂ is set, point Q ₂ is found based on P ₂ , the next syllable boundary point B is detected, and the three syllables ``e'', ``i'', and ``o'' are ' is separated and extracted. In the conventional method of detecting syllable boundaries from spectral changes, the extreme points of spectral changes are
Since the syllable boundary point C is detected from P ₃ , the syllables "ei" and "o" are distinguished and extracted, but the syllables "e" and "i" have a gentle spectrum change between the two syllables. Therefore, syllable boundary points are not detected, and therefore, it is not possible to distinguish and extract different syllables.

したがつて、本実施例ではスペクトル変化が小
さくて従来のスペクトル変化で音節境界の抽出が
不可能な場合でも正確に音節境界を検出できる。 Therefore, in this embodiment, even when the spectral change is small and it is impossible to extract the syllable boundary using the conventional spectral change, the syllable boundary can be detected accurately.

＜発明の効果＞以上より明らかなように、この発明の音節認識
装置では、抽出された音声区間に続く入力音声に
おけるスペクトル情報の変化から上記スペクトル
情報の安定点を抽出する安定点抽出部と、上記抽
出された安定点におけるスペクトル情報と上記安
定点以後のスペクトル情報との類似度計算を順次
行つて、上記算出された類似度の値の所定の値と
比較することによつて次に抽出すべき音節区間の
音節境界を検出する音節境界検出部とを設けたの
で、上記スペクトル情報の変化が穏やかな場合で
あつても、急な場合であつても正確にしかも容易
に音節境界を検出することができる。<Effects of the Invention> As is clear from the above, the syllable recognition device of the present invention includes a stable point extraction unit that extracts a stable point of the spectral information from a change in the spectral information in the input speech following the extracted speech section; Next, the similarity is calculated between the spectral information at the stable point extracted above and the spectral information after the stable point, and the calculated similarity value is compared with a predetermined value. Since a syllable boundary detection unit is provided to detect the syllable boundary of the syllable interval, the syllable boundary can be detected accurately and easily even when the change in the spectral information is gentle or sudden. be able to.

[Brief explanation of the drawing]

第１図はこの発明の音声認識装置のブロツク
図、第２図は安定点抽出および音節境界検出のフ
ローチヤート、第３図は音節抽出のフローチヤー
ト、第４図は実施例において抽出された安定点と
音節境界点の一例を示す図である。３……特徴抽出部、４……音韻分類部、５……
境界検出部、６……音節区間抽出部、７……マツ
チング部、８……特徴標準パターンメモリ、９…
…CPU。 FIG. 1 is a block diagram of the speech recognition device of the present invention, FIG. 2 is a flowchart of stable point extraction and syllable boundary detection, FIG. 3 is a flowchart of syllable extraction, and FIG. 4 is a stable point extraction and syllable boundary detection flowchart. It is a figure which shows an example of a point and a syllable boundary point. 3...Feature extraction unit, 4...Phonological classification unit, 5...
Boundary detection unit, 6...Syllable section extraction unit, 7...Matching unit, 8...Characteristic standard pattern memory, 9...
…CPU.

Claims

[Scope of Claims] 1. A syllable section is extracted from input speech, and the degree of similarity between the feature pattern of the extracted syllable and a feature standard pattern stored in advance in memory is calculated, and the input speech is extracted. A speech recognition device that recognizes syllables in units of syllables includes: a stable point extraction unit that extracts stable points of the speech spectrum information from changes in the speech spectrum information in the input speech following the extracted syllable section; The similarity between the speech spectrum information at and the speech spectrum information after the stable point is sequentially calculated, and the calculated similarity value is compared with a predetermined value to determine the syllable boundary of the next syllable interval to be extracted. A speech recognition device comprising: a syllable boundary detection unit for detecting a syllable boundary.