JPH0458635B2

JPH0458635B2 -

Info

Publication number: JPH0458635B2
Application number: JP61121870A
Authority: JP
Inventors: Shin Kamya; Atsuo Tanaka
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1986-05-26
Filing date: 1986-05-26
Publication date: 1992-09-18
Also published as: JPS62278597A

Description

[Detailed description of the invention]

〔産業上の利用分野〕本発明は音声認識装置における音素標準パター
ンの切り出し方法に関する。〔従来の技術〕音声認識は、音声を音響分析してその中に含ま
れる言語的特徴を抽出し、これにより音声に対応
する言語記号の表示に変換する処理であり、原理
的には２種類の方法が知られている。すなわち、
その一つは音声に含まれる言語的特徴に関する標
準パターンを予め記憶しておき、この標準パター
ンと音声入力とを比較して類似性を調べ、その類
似性に基づいて入力された音声入力が標準パター
ンと一致とするかどうかの認識判定を行う方法で
ある。もう一つの方法は、上記の標準パターンを
使うことなく、音声入力の音響分析結果に基づい
て、音素記号の二者択一的な判定を繰り返し行
い、最終的に言語としての認識判定を行う方法で
ある。上記２つの方法では、一般に標準パターンを用
いる前者の方法が認識結果が良好であり、たとえ
ば、第８図に示す方法により音声入力の単語認識
が行われている。第８図において、入力された音声入力の周波数
スペクトラム包絡と、相関分析等による駆動音源
の２つの音響的特徴で音響分析された後、予め作
成された音素標準パターンにより音素認識が行わ
れる。この音素認識においては、入力された上記
音響的特徴が音素記号の系列で表され、この音素
記号の系列を予め作成された単語辞書により単語
認識を行い、認識された単語がその単語の言語記
号の形で出力される。〔発明が解決しようとする問題点〕上述のように、連続音声認識において音素を認
識の基本単位として用いる場合、予め音素標準パ
ターンを登録用単語音声から切り出す必要があ
り、この音素の切り出しは、従来音声情報処理の
熟練者が視察にて行つていたために、切り出し時
間が長くかかり非常に不便であつた。〔発明の目的〕本発明の目的は以上の問題点を解決し、単語音
声から音素標準パターンを人手を介することなく
機械的にかつ迅速に切り出すことができる音素標
準パターンの切り出し方法を提供することにあ
る。〔発明の構成〕本発明は、予め複数の話者が発声した単語毎に
音素境界記号を節とする複数の遷移路を有する単
語ネツトワークを記憶手段に記憶する一方、入力
された単語音声の音素境界記号列及び音声分析の
パラメータ系列を抽出し、上記入力された単語音
声の音素境界記号列が上記記憶手段に記憶された
単語ネツトワークのうちの少なくとも１つの遷移
路と一致したとき、上記パラメータ系列を音声認
識を行うための音素標準パターンとして切り出す
ことを特徴とする。〔実施例〕第１図は本発明の一実施例である音素標準パタ
ーン切り出し装置のブロツク図であり、本発明は
登録用単語音声から音素標準パターンを切り出す
際にパワー変化及びスペクトル変化等から検出さ
れる音素境界記号を節とする単語ネツトワークを
用いることを特徴とする。第１図において、まず登録用単語音声Ｘ(t)は音
声分析部１に入力され、その音声入力Ｘ(t)から、
自己相関係数Ｒ(t)及びその変化R′(t)、パワーＰ
(t)及びその変化P′(t)、並びにケプストラム係数ｃ
(t)が計算される。ここで、音声入力のフレーム周
期を例えば８ｍsecとし、上記ｔは音声入力のｔ
番めのフレームを表す。第２図は第１図の音声分析部１のブロツク図で
あり、第２図において、まず登録用単語音声入力
Ｘ(t)は標本化回路１１に入力されて、所定の標本
化周波数で標本化され、標本化値Ｓ(t)が自己相関
係数計算部１２及びパワー計算部１３に出力され
る。本実施例の標本化回路１１では、１フレーム
あたり256回の標本化を行い、以下、個々の標本
化値を、Ｓ(t)ｉ，１≦ｉ≦256 ……(1) と表す。自己相関係数計算部１２において、入力された
標本化値Ｓ(t)から、分析次数np＝24として第３
図の処理フローに基づいて次式の自己相関係数Ｒ
(t)ｉが計算された後、線形予測係数計算部１４及
び音韻分類部２に出力される。Ｒ(t)ｉ＝１／256_256-i 〓^k=1 Ｓ(t)ｋ・Ｓ(t)ｋ＋ｉ，１≦ｉ≦24 ……(2) ここで添字ｉは自己相関係数Ｒ(t)の次数を表
し、以下において記述される線形予測分析整数Ａ
(t)ｉ及びケプストラム係数ｃ(t)ｉの各添字ｉも次
数を表す。第３図のフローチヤートにおいて、Ｓ（Ｉ）は
上記標本化値Ｓ(t)ｉを表し、Ｒ（Ｉ）は上記自己
相関係数をＲ(t)ｉを表わす。線形予測係数計算部１４において、入力された
自己相関係数Ｒ(t)ｉから、公知の線形予測分析法
により第４図の処理フローに基づいて線形予測分
析係数Ａ(t)ｉが算出された後、ケプストラム係数
計算部１５に出力される。ケプストラム係数計算
部１５においては、入力された線形予測分析係数
Ａ(t)ｉから次式によりケプストラム係数ｃ(t)ｉが
算出され、音素切り出し部４及びケプストラム変
化計算部１６に出力される。ｃ(t)ｉ＝−Ａ(t)ｉ−１／ｉ_i-1 〓^k=1 ｋ・ｃ(t)ｋ・Ａ(t)ｉ−ｋ，１≦ｉ≦24 ……(3) ただし、(3)式において、１次のケプストラム係数
ｃ(t)₁は次式で表わされる。ｃ(t)₁＝−Ａ(t)₁ ……(4) さらに、ケプストラム変化計算部１６におい
て、入力されたケプストラム係数ｃ(t)ｉから次式
に基づいてケプストラム係数の変化c′(t)ｉを算出
し、音素境界検出部３に出力される。 c′(t)ｉ＝｜ｃ（ｔ−４）ｉ−ｃ(t)ｉ｜ ……(5) 一方、パワー計算部１３において、入力された
標本化値Ｓ(t)ｉから次式に基づいてパワーＰ(t)が
算出された後、音韻分類部２及びパワー変化計算
部１７に出力される。Ｐ(t)＝１／256₂₅₆ 〓ⁱ⁼¹ ｜Ｓ(t)ｉ｜² ……(6) 次に、パワー変化計算部１７において、入力さ
れたパワーＰ(t)から次式に基づいてパワーの変化
P′(t)を算出し、音声境界検出部３に出力される。 P′(t)＝₇ 〓^j=1 （ｊ−４）・Ｐ（ｔ−７＋ｊ） ……(7) 第５図は、第１図の音韻分類部２において音韻
分類する際の領域表であり、横軸Ｘは、−log（１
−Ｒ(t)₁）であり、縦軸Ｙは、logP(t)である。こ
こで、Ｒ(t)₁は前述の通りｔ番目のフレームの１
次の自己相関係数である。第５図において、Ｙが所定の境界値Y₁未満の
領域においては、無音部（・）である。またＹが
所定境界値Y₁以上かつ所定の境界値Y₂以下であ
る領域であつて、Ｘが所定の境界値X₁未満の領
域では無声部(F)、Ｘが所定の境界値X₁以上かつ
所定の境界値X₂以下の領域では母音部(V)、Ｘが
所定の境界値X₂を越える領域では鼻音部(N)であ
る。さらに、ＹがY₂を越える領域であつて、Ｙ＜−m₁（Ｘ−X₁）＋Y₂ ……(8) なる領域は無声部(F)であり、Ｙ≧−m₁（Ｘ−X₁）＋Y₂ ……(9) であつてかつＹ≧m₂（Ｘ−X₂）＋Y₂ ……(10) なる領域は母音部(V)であり、Ｙ＜m₂（Ｘ−X₂）＋Y₂ ……(11) なる領域は鼻音部(N)である。ここで、m₁及びm₂
は正の所定値である。音韻分類部２においては、入力されたパワーＰ
(t)及び自己相関係数Ｒ(t)から、第５図に基づき音
声入力の各フレームの大略的特徴を音韻分類記号
ph(t)の形で音素境界検出部３に出力する。なお、
出力される音韻分類記号ph(t)とそれが表すべき
性質を第１表に示す。次に、音素境界検出部３では、入力されたパワ
ーの変化P′(t)、ケプストラム係数の変化C′(t)ｉ及
び音韻分類記号ph(t)から、第２表の条件に基づ
いて、第２表の音素境界番号bd(t)が検出され、
音素切り出し部４に出力される。なお、第２表に
おいて、T₁、T₂及びT₃は所定のしきい値であ
る。この音素境界検出部３において、もし境界番
号の間隔が所定のしきい値T₄フレーム以内であ
るならば、次式に示す優先度の高い音素境界番号
bd(t)が出力される。優先度が高い＞＞＞＞＞優先度が低い……
(12) 第６図は、３名の話者が「あさひ」と発声した
ときの、音韻分類記号列ph(t)と境界番号列bd(t)
の例を示した図である。前述のように、１個の単
語区間は、境界記号から始まり境界記号で終
わる境界記号列bd(t)で記述できる。第６図の境
界記号列bd(t)を、境界記号をノード（節）とす
る単語ネツトワークで表現すると第７図のように
なる。ただしノード間の枝にその区間に存在する
音素を、ノードの上に通し番号を示す。なお、第
７図において示されるように、複数の話者によつ
て作成された１個の単語に対する単語ネツトワー
クにおいては、話者によつて境界記号列bd(t)が
異なるため複数の遷移路が存在する。第１図において、５は単語ネツトワーク表
（ROM）であり、予め多数の話者が発声した音
素切り出し用単語の音声データを分析して、単語
毎に第７図のような単語ネツトワークを作成し、
単語ネツトワーク表（ROM）５に書き込んでお
く。このネツトワークをメモリ（ROM）上に記
憶させるために第３表の例のようなリスト表現を
用い、第３表に示すように１本の枝を６ワードの
ノード情報で表現する。ノード情報の各ワードの
意味を第３表に、各枝における音素の切り出し位
置とその記号を第４表に示す。なお、第３表において、分岐条件（最短）とは
分岐条件を満たす境界記号が来るまでのフレーム
間隔の最小値であり、分岐条件（最長）とは分岐
条件を満たす境界記号が来るまでのフレーム間隔
の最大値である。第３表の例においては、境界記号が、５フレ
ーム以上15フレーム以内に来れば、ノード番号４
に分岐し現在のノードと分岐先のノードを結ぶ区
間の中央のフレームにおけるケプストラム係数ｃ
(t)を音素／ａ／の標準パターンとして切り出すこ
とを意味する。音素切り出し部４では、音素切り出し用単語毎
に対応する単語ネツトワークを単語ネツトワーク
表（ROM）５より読み出すとともに、登録用音
声入力を分析した結果音素境界検出部３から出力
される境界記号列bd(t)が入力される。まず、最
初のノードである境界記号から出発して、ノー
ド情報内の分岐条件を満たせば、音素切り出し部
に設けられたポインタを次のノードに遷移させ、
この動作を繰り返す。入力された境界記号列bd(t)に基づいて、上記
ポインタが単語ネツトワーク表（ROM）５に記
憶された単語ネツトワークに従つて遷移し、単語
の終端を表す境界記号まで遷移することができ
た時のみ、音素の区分に成功したと見なして、単
語ネツトワーク表（ROM）５に書き込まれたノ
ード情報の切り出し位置t₀に対するフレームにお
けるケプストラム係数ｃ（t₀）を各音素毎に切り
出し、その係数ｃ（t₀）を音素の標準パターンと
して音素標準パターン表（RAM）６にストアす
る。以上説明したように、予め多数の話者が発声し
た音素切り出し用単語の音声データを分析して、
音素境界記号をノードした第７図に示すような単
語ネツトワークを、各ノード間の枝を６ワードで
表わした第３表のノード情報の形で単語ネツトワ
ーク表（ROM）５に書き込んでおき、登録用音
声入力Ｘ(t)から分析された境界記号列bd(t)と単
語ネツトワーク表（ROM）５に書き込まれた単
語ネツトワークとを照合して、一致した遷移路が
ある場合、音素の区分に成功したと判断し単語ネ
ツトワーク表（ROM）５に書き込まれたノード
情報の切り出し位置t₀に対応するフレームにおけ
るケプストラム係数ｃ（t₀）を各音素毎に音素標
準パターンとして切り出すことができる。 [Industrial Application Field] The present invention relates to a method for extracting a standard phoneme pattern in a speech recognition device. [Prior art] Speech recognition is a process of acoustically analyzing speech to extract the linguistic features contained therein, and converting this into a display of linguistic symbols corresponding to the speech.In principle, there are two types of speech recognition: method is known. That is,
One method is to memorize in advance a standard pattern related to the linguistic features contained in speech, compare this standard pattern with the speech input to check for similarities, and based on that similarity, the input speech input is set as the standard pattern. This is a method of recognizing and determining whether or not it matches a pattern. Another method is to repeatedly judge phoneme symbols based on the acoustic analysis results of the speech input without using the standard pattern described above, and finally make a recognition judgment as a language. It is. Among the above two methods, the former method using a standard pattern generally gives good recognition results, and for example, the method shown in FIG. 8 is used to recognize words from voice input. In FIG. 8, after acoustic analysis is performed using the frequency spectrum envelope of the input voice input and two acoustic features of the driving sound source through correlation analysis, etc., phoneme recognition is performed using a phoneme standard pattern created in advance. In this phoneme recognition, the input acoustic features are expressed as a series of phoneme symbols, and this series of phoneme symbols is recognized as a word using a word dictionary created in advance, and the recognized word is the linguistic symbol of the word. is output in the form of [Problems to be Solved by the Invention] As mentioned above, when using phonemes as the basic unit of recognition in continuous speech recognition, it is necessary to cut out a standard phoneme pattern from the registered word speech in advance, and this phoneme cutting is done as follows: Conventionally, inspections were conducted by experts in speech information processing, which took a long time and was extremely inconvenient. [Object of the Invention] An object of the present invention is to solve the above-mentioned problems and to provide a method for extracting a standard phoneme pattern that can mechanically and quickly extract a standard phoneme pattern from a word sound without manual intervention. It is in. [Structure of the Invention] The present invention stores in advance a word network having a plurality of transition paths with phoneme boundary symbols as nodes for each word uttered by a plurality of speakers, while A phoneme boundary symbol string and a speech analysis parameter sequence are extracted, and when the phoneme boundary symbol string of the input word speech matches at least one transition path of the word network stored in the storage means, the It is characterized by extracting a parameter sequence as a standard phoneme pattern for speech recognition. [Example] Fig. 1 is a block diagram of a phoneme standard pattern cutting device which is an example of the present invention. It is characterized by the use of a word network whose nodes are phoneme boundary symbols. In FIG. 1, first, the word speech for registration X(t) is input to the speech analysis section 1, and from the speech input X(t),
Autocorrelation coefficient R(t) and its change R′(t), power P
(t) and its change P′(t), and cepstral coefficient c
(t) is calculated. Here, the frame period of the audio input is assumed to be 8 msec, and the above t is the t of the audio input.
Represents the number frame. FIG. 2 is a block diagram of the speech analysis unit 1 shown in FIG. 1. In FIG. The sampled value S(t) is output to the autocorrelation coefficient calculating section 12 and the power calculating section 13. The sampling circuit 11 of this embodiment performs sampling 256 times per frame, and hereinafter, each sampling value is expressed as S(t)i, 1≦i≦256 (1). In the autocorrelation coefficient calculation unit 12, from the input sampling value S(t), the third
Based on the processing flow in the figure, the autocorrelation coefficient R of the following formula
After (t)i is calculated, it is output to the linear prediction coefficient calculation section 14 and the phoneme classification section 2. R(t)i=1/256 _256-i 〓 ^k=1 S(t)k ・S(t)k+i, 1≦i≦24 ……(2) Here, the subscript i is the autocorrelation coefficient R(t ) represents the order of the linear predictive analysis integer A described below.
Each subscript i of (t)i and cepstral coefficient c(t)i also represents the order. In the flowchart of FIG. 3, S(I) represents the sampled value S(t)i, and R(I) represents the autocorrelation coefficient R(t)i. In the linear prediction coefficient calculation unit 14, a linear prediction analysis coefficient A(t)i is calculated from the input autocorrelation coefficient R(t)i by a known linear prediction analysis method based on the processing flow shown in FIG. After that, it is output to the cepstral coefficient calculation section 15. In the cepstrum coefficient calculation unit 15, a cepstrum coefficient c(t)i is calculated from the input linear prediction analysis coefficient A(t)i using the following equation, and is output to the phoneme extraction unit 4 and the cepstrum change calculation unit 16. c(t)i=-A(t)i-1/i _i-1 〓 ^k=1 k・c(t)k・A(t)i−k, 1≦i≦24 ……(3) However , (3), the first-order cepstrum coefficient c(t) ₁ is expressed by the following equation. c(t) ₁ = -A(t) ₁ ...(4) Furthermore, the cepstrum change calculation unit 16 calculates the change in cepstrum coefficient c′(t )i is calculated and output to the phoneme boundary detection section 3. c'(t)i=|c(t-4)i-c(t)i| ...(5) On the other hand, in the power calculation section 13, from the input sampled value S(t)i, the following equation is calculated. After the power P(t) is calculated based on the power P(t), it is output to the phoneme classification section 2 and the power change calculation section 17. P(t)=1/256 ₂₅₆ 〓 ⁱ⁼¹ |S(t)i| ² ...(6) Next, in the power change calculation section 17, the input power P(t) is calculated based on the following formula. power change
P'(t) is calculated and output to the speech boundary detection section 3. P'(t)= ₇ 〓 ^j=1 (j-4)・P(t-7+j) ...(7) Figure 5 is an area table for phoneme classification in the phoneme classification section 2 in Figure 1. Yes, the horizontal axis X is −log(1
-R(t) ₁ ), and the vertical axis Y is logP(t). Here, R(t) ₁ is 1 of the tth frame as described above.
The following autocorrelation coefficient is: In FIG. 5, an area where Y is less than a predetermined boundary value _Y1 is a silent part (.). Furthermore, in an area where Y is greater than or equal to the predetermined boundary value _Y1 and less than or equal to the predetermined boundary value _Y2 , and where X is less than the predetermined boundary value _X1 , a silent part (F) is formed, _and A region where X is above and below a predetermined boundary value X ₂ is a vowel part (V), and a region where X exceeds a predetermined boundary value X ₂ is a nasal part (N). Furthermore, the region where Y exceeds Y ₂ and where Y<-m ₁ (X-X ₁ )+Y ₂ ...(8) is a silent part (F), and Y≧-m ₁ (X- The region where X ₁ )+Y ₂ ...(9) and Y≧m ₂ (X-X ₂ )+Y ₂ ...(10) is the vowel part (V), and Y<m ₂ (X-X ₂ ) + Y ₂ ...(11) The region is the nasal part (N). Here, m ₁ and m ₂
is a positive predetermined value. In the phoneme classification section 2, the input power P
(t) and the autocorrelation coefficient R(t), the general characteristics of each frame of speech input are determined by the phoneme classification symbol based on Figure 5.
It is output to the phoneme boundary detection unit 3 in the form of ph(t). In addition,
Table 1 shows the output phoneme classification symbol ph(t) and the properties it should represent. Next, the phoneme boundary detection unit 3 uses the input power change P'(t), cepstral coefficient change C'(t)i, and phoneme classification symbol ph(t) to calculate the , the phoneme boundary number bd(t) in Table 2 is detected,
It is output to the phoneme extraction section 4. Note that in Table 2, T ₁ , T ₂ and T ₃ are predetermined threshold values. In this phoneme boundary detection unit 3, if the interval between boundary numbers is within a predetermined threshold T ₄ frames, a phoneme boundary number with a high priority as shown in the following formula
bd(t) is output. High priority >>>>>Low priority...
(12) Figure 6 shows the phonological classification symbol sequence ph(t) and the boundary number sequence bd(t) when three speakers uttered “Asahi”.
It is a figure showing an example. As described above, one word section can be described by a boundary symbol string bd(t) that starts with a boundary symbol and ends with a boundary symbol. When the boundary symbol string bd(t) in FIG. 6 is expressed as a word network with boundary symbols as nodes, the result is as shown in FIG. 7. However, on the branches between nodes, the phonemes that exist in that section are indicated with serial numbers above the nodes. As shown in Fig. 7, in a word network for one word created by multiple speakers, multiple transitions occur because the boundary symbol string bd(t) differs depending on the speaker. A road exists. In Figure 1, numeral 5 is a word network table (ROM), which analyzes the audio data of words for phoneme segmentation uttered by many speakers in advance, and creates a word network for each word as shown in Figure 7. make,
Write it in word network table (ROM) 5. In order to store this network in the memory (ROM), a list representation as shown in Table 3 is used, and one branch is represented by 6 words of node information as shown in Table 3. Table 3 shows the meaning of each word in the node information, and Table 4 shows the extraction positions of phonemes in each branch and their symbols. In Table 3, the branching condition (shortest) is the minimum frame interval until a boundary symbol that satisfies the branching condition arrives, and the branching condition (longest) is the frame interval until a boundary symbol that satisfies the branching condition arrives. This is the maximum value of the interval. In the example in Table 3, if the boundary symbol comes within 5 frames or more and within 15 frames, the node number 4
The cepstral coefficient c in the center frame of the section connecting the current node and the branch destination node
This means cutting out (t) as a standard pattern of the phoneme /a/. The phoneme extraction unit 4 reads out the word network corresponding to each word for phoneme extraction from the word network table (ROM) 5, and analyzes the voice input for registration to generate a boundary symbol string output from the phoneme boundary detection unit 3. bd(t) is input. First, starting from the boundary symbol that is the first node, if the branching condition in the node information is satisfied, the pointer provided in the phoneme extraction section is moved to the next node,
Repeat this action. Based on the input boundary symbol string bd(t), the pointer can transition according to the word network stored in the word network table (ROM) 5, and can transition to the boundary symbol representing the end of the word. Only when it is possible to segment the phoneme, it is assumed that the phoneme has been successfully segmented, and the cepstral coefficient c(t ₀ ) in the frame corresponding to the extraction position t ₀ of the node information written in the word network table (ROM) 5 is extracted for each phoneme. , and its coefficient c(t ₀ ) is stored in the phoneme standard pattern table (RAM) 6 as a standard phoneme pattern. As explained above, by analyzing the audio data of words for phoneme segmentation uttered by many speakers in advance,
A word network with phoneme boundary symbols as nodes, as shown in Figure 7, is written in the word network table (ROM) 5 in the form of the node information in Table 3, where the branches between each node are represented by 6 words. , the boundary symbol string bd(t) analyzed from the registration voice input X(t) is compared with the word network written in the word network table (ROM) 5, and if there is a matching transition path, The cepstral coefficient c(t ₀ ) in the frame corresponding to the extraction position t ₀ of the node information written in the word network table (ROM) 5 after determining that the phoneme has been successfully segmented is extracted as a phoneme standard pattern for each phoneme. be able to.

【表】【table】

〔Effect of the invention〕

以上詳述したように、予め複数の話者が発声し
単語発声を分析して単語毎に音素境界記号を節と
する複数の遷移路を有する単語ネツトワークを記
憶手段に記憶しておき、入力された単語発声を分
析して音素境界記号列及び音声分析のパラメータ
系列を出力させ、上記入力音素境界記号列が上記
記憶手段に記憶された単語ネツトワークのうちの
少なくとも１つの遷移路と一致したとき、上記パ
ラメータ系列を音声認識を行うための音素の標準
パターンとして切り出すことができるので、人手
を介することなく機械的にかつ迅速に単語音声か
ら音素標準パターンの切り出すことができる。 As detailed above, a word network uttered by multiple speakers is analyzed in advance, and a word network having a plurality of transition paths with phoneme boundary symbols as nodes is stored in a storage means for each word, and the word network is inputted. the input phoneme boundary symbol string matches at least one transition path of the word network stored in the storage means; In this case, since the parameter series can be extracted as a standard pattern of phonemes for speech recognition, the standard pattern of phonemes can be extracted mechanically and quickly from word speech without human intervention.

[Brief explanation of the drawing]

第１図は本発明の一実施例である音素標準パタ
ーンの切り出し装置のブロツク図、第２図は第１
図の音声分析部のブロツク図、第３図は第２図の
自己相関係数計算部の処理を示すフローチヤー
ト、第４図は第２図の線形予測分析係数計算部の
処理を示すフローチヤート、第５図は第１図の音
韻分類部における分類の領域を示す図、第６図は
３名の話者が「あさひ」と発声したときの音韻分
類記号列と境界番号列を示した図、第７図は第６
図の境界記号列を境界記号をノードとして表現さ
れた単語ネツトワークを示す図、第８図は従来例
の音声認識方法を示すブロツク図である。 FIG. 1 is a block diagram of a phoneme standard pattern extraction device which is an embodiment of the present invention, and FIG.
3 is a flowchart showing the processing of the autocorrelation coefficient calculation section of FIG. 2, and FIG. 4 is a flowchart of the processing of the linear prediction analysis coefficient calculation section of FIG. 2. , Figure 5 is a diagram showing the classification area in the phoneme classification section of Figure 1, and Figure 6 is a diagram showing the phoneme classification symbol string and boundary number string when three speakers utter "Asahi". , Figure 7 is the 6th
FIG. 8 is a block diagram showing a conventional speech recognition method.

Claims

[Scope of Claims] 1. A word network having a plurality of transition paths with phoneme boundary symbols as nodes for each word uttered by a plurality of speakers is stored in advance in a storage means, and the phoneme boundaries of input word sounds are stored in advance in a storage means. A symbol string and a speech analysis parameter series are extracted, and when the phoneme boundary symbol string of the input word voice matches at least one transition path of the word network stored in the storage means, the parameter series is extracted. A method for extracting a standard phoneme pattern, characterized in that the standard phoneme pattern is extracted as a standard phoneme pattern for speech recognition.