JPS5936758B2

JPS5936758B2 - Voice recognition method

Info

Publication number: JPS5936758B2
Application number: JP49138066A
Authority: JP
Inventors: 博平川; 敏夫杉原; 靖夫徳永
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1974-11-30
Filing date: 1974-11-30
Publication date: 1984-09-05
Also published as: JPS5162904A

Description

【発明の詳細な説明】本発明は、音声認識方法、特に文章を含む単語および該
単語を発した話者のいずれか一方または両方を認識する
音声認識において、複数個の標準音素を対とした少なく
とも第１の標準音素組と同じく対とした第２の標準音素
組とよりなる複合標準音素組を時間系列に配列した時間
系列パターンをもうけることにより、特に話者が発音し
たときの状況による声道の僅かな変化ｌこもとずく音声
信号の差に拘らず正しく認識できるようにした音声認識
方法に関するものである。DETAILED DESCRIPTION OF THE INVENTION The present invention provides a speech recognition method, particularly speech recognition that recognizes a word including a sentence and/or a speaker who uttered the word. By creating a time-sequence pattern in which a composite standard phoneme set consisting of at least a first standard phoneme set and a second standard phoneme set paired in the same manner is arranged in a time series, it is possible to create a time-series pattern in which a composite standard phoneme set, which is made up of at least a first standard phoneme set and a second standard phoneme set, is arranged in a time series. This invention relates to a voice recognition method that can correctly recognize slight changes in the road regardless of differences in voice signals.

音声認識方法において特に話者の特徴を把握して認識を
行なうべく、先に本発明者らが発明した特願昭４９−４
１３４１号「音声認識方法」に係る発明において、複数
の標準音素を対とした標準音素組を用い、、該標準音素
組を時間系列に配夕１ルた時間系列パターンを認識のた
めの標準時間系列パターンとして用いることを提案した
。In the speech recognition method, the present inventors previously invented a patent application filed in 1972-4 in order to perform recognition by specifically grasping the characteristics of the speaker.
In the invention related to No. 1341 "Speech recognition method", a standard phoneme set made up of a plurality of standard phoneme pairs is used, and a standard time for recognizing a time series pattern in which the standard phoneme set is arranged in a time series. We proposed that it be used as a series pattern.

本発明は上記発明をさらに発展させたもので、話者が発
音したときの状況による声道の僅かな変化にもとずく音
声信号の差に拘らず正しい認識を行ない得るよう、ある
時間間隔内においてもつとも類似度の高い第１の標準音
素組と次に類似度の高い第２の標準音素組とを抽出して
構成した複合標準音素組を用いるようにして正しい認識
を行なうようにすることを目的としている。The present invention is a further development of the above-mentioned invention, and is designed to enable accurate recognition within a certain time interval, regardless of differences in audio signals caused by slight changes in the vocal tract caused by the situation when a speaker pronounces. To achieve correct recognition, a composite standard phoneme set is constructed by extracting a first standard phoneme set with the highest degree of similarity and a second standard phoneme set with the next highest degree of similarity. The purpose is

なお、これら標準音素組の抽出処理はデータ処理装置を
用いることによつてより正確により簡単に行なうことが
可能である。上記目的を達成するため、本発明の音声認
識方法は文章を含む単語および該単語を発した話者のい
ずれか一方または両方を認識する音声認識方法において
、複数個の標準音素を対とした少なくとも第１の標準音
素組と同じく対とした第２の標準音素組とよりなる複合
標準音素組を時間系列に配列した時間系列パターンをそ
なえると共に、未知入力音声を予め定めた時間間隔で区
分した当該時間間隔内の音素と上記第１の標準音素組お
よび上記第２の標準音素組との類似度を演算する手段を
そなえ、上記入力音声の上記区分された時間間隔内の音
素の時間系列が上記第１の標準音素組と上記第２の標準
音素組とを組合わせて構成した時間系列パターンのいず
れに最も類似するかによつて上記認識を行なうことを特
徴とし、時間系列上に配列された複数の複合標準音素組
にもとずき、１つの複合標準音素組内の第１の標準音素
組と第２の標準音素組とのいずれか１方と、他の複合標
準音素組内の第１の標準音素組と第２の標準音素組との
いずれか１方とを組合わせる如くした第１の標準音素組
と第２の標準音素組とを組合わせて構成した時間系列パ
ターンをそなえ、既知入力音声の区分された時間間隔内
の音素の時間系列と上記時間系列パターンの夫々につい
て類似度を演算せしめ、該演算結果により、上記既知入
力音声の音素の時間系列が複数個の上記時間系列パター
ンに対して所定範囲内で同じ類似度をもつ場合、当該時
間系列パターン内において第１の標準音素組の個数がよ
り多い側の時間系列パターンを未知入力音声認識のため
の標準時間系列パターンとしたことを特徴とし、さらに
この場合において同じ類似度をもつ該複数個の時間系列
パターンが互に如何なる形態の関係にあるかを形態別に
分類し、該形態別分類情報を未知入力音声認識のための
１情報としたことを特徴としている。以下図面を参照し
つつ説明する。第１図ないし第４図は先に出願した発明
を概念的に説明するもので、第１図は音声信号のサンプ
リングを説明する説明図、第２図は複数個の標準音素を
対とした標準音素組を抽出する処理を説明する説明図、
第３図は標準音素組にもとずいて類似度を求める処理を
説明する説明図、第４図は音声認識装置の構成図、第５
図は本発明による一実施例処理プロツク図、第６図は標
準時間系列パターンを決定する一実施例プロツク図、第
７図ないし第１２図は本発明による標準時間系列パター
ンと形態別分類情報とを得る処理を説明する説明図を示
す。Note that the process of extracting these standard phoneme sets can be performed more accurately and easily by using a data processing device. In order to achieve the above object, the speech recognition method of the present invention recognizes a word including a sentence and/or a speaker who uttered the word. It has a time-series pattern in which a composite standard phoneme set consisting of a first standard phoneme set and a second standard phoneme set, which are also paired, is arranged in a time series, and the unknown input speech is divided at predetermined time intervals. means for calculating the degree of similarity between a phoneme within a time interval and the first standard phoneme set and the second standard phoneme set; The recognition is performed according to which of the time series patterns formed by combining the first standard phoneme set and the second standard phoneme set, Based on a plurality of composite standard phoneme sets, one of the first standard phoneme set and the second standard phoneme set in one composite standard phoneme set and the first standard phoneme set in the other composite standard phoneme set. A time series pattern configured by combining a first standard phoneme set and a second standard phoneme set, such as combining either one of the first standard phoneme set and the second standard phoneme set, A degree of similarity is calculated for each of the time series of phonemes within the divided time intervals of the known input voice and the time series pattern, and based on the calculation result, the time series of the phonemes of the known input voice are divided into a plurality of the time series. If the patterns have the same degree of similarity within a predetermined range, the time series pattern with the greater number of first standard phoneme sets in the time series pattern is selected as the standard time series pattern for unknown input speech recognition. In this case, the plurality of time-series patterns having the same degree of similarity are classified according to the form of their relationship with each other, and the classification information according to form is used for unknown input speech recognition. It is characterized by being one piece of information. This will be explained below with reference to the drawings. Figures 1 to 4 conceptually explain the previously filed invention. Figure 1 is an explanatory diagram explaining the sampling of audio signals, and Figure 2 is a standard diagram for pairing a plurality of standard phonemes. An explanatory diagram illustrating the process of extracting phoneme sets,
Figure 3 is an explanatory diagram explaining the process of determining similarity based on a standard phoneme set, Figure 4 is a configuration diagram of the speech recognition device, and Figure 5
The figure is a processing block diagram of an embodiment according to the present invention, FIG. 6 is a block diagram of an embodiment for determining a standard time series pattern, and FIGS. 7 to 12 are diagrams showing the standard time series pattern and format-based classification information according to the present invention An explanatory diagram illustrating the process of obtaining .

本発明に関する説明に先立つて先に出願した特願昭４９
−４１３４１号に係る発明について概略的に説明する。A patent application filed in 1973 prior to the explanation of the present invention
The invention related to No.-41341 will be briefly described.

該発明においては、先ず第１図に示す如く音声信号のサ
ンプリング処理が行なわれる。即ち、図示音声信号１を
例えば３０ｍｓを１フレームとして該１フレーム内で例
えば１０個の部分自己相関係数を抽出し、上記フレーム
を例えば１０ｍｓづつずらして行くようにする。該抽出
された部分自己相関係数を時系列に並べてこれらが、第
２図図示の如く予め定められた標準音素例えば１２個の
！Ａｌｌ，ｌａ２ｌ，・・・・・・１ｒ１に対してどの
ような類似度をもつかをテーブルにまとめる。図示第２
図においては、あるフレームにおける音素ｋパラメータ
Ｖが標準音素ｋパラメータＫｊに対してＳｊ（ｃ）で表
わされる類似度をもつことが表わされ、該音素パラメー
タｋ′が標準音素パラメータＫｇに対し最も高い類似度
をもち標準音素パラメータＫ２に対し次に高い類似度を
もつことが表わされている。なお上記Ｓｊ（Ｃ）中のＣ
は話者Ｃによる発音に対する類似度であることを表わし
ている。上記第２図に示す如く、第１位の類似度をもつ
標準音素ｊｌと第２位の類似度をもつ標準音素Ｊ．とを
調べて行き、標準音素組Ｊｌ，ｊ。をつくり、ある話者
Ｃがある単語例えば数字「４」を発音したときに現われ
る標準音素組の時系列パターン例えば｛８，２，４８｝
：｛２，６，６６｝；｛３，１１，７８｝；｛１，
２，４７｝をつくる。なお上記｛｝内における４８，６
６，７８，４７は説明を省略するが声帯に関する特徴係
数（ピツチ）を表わしている。上記の如く得られた標準
音素組の時系列パターン（ある話者Ｃがある単語を発音
したときの）を第３図図示テーブルの横軸上に配列し、
未知入力音声信号の音素時系列がどのような類似度をも
つかを調べて行くようにする。In the invention, first, sampling processing of the audio signal is performed as shown in FIG. That is, for example, ten partial autocorrelation coefficients are extracted within one frame of the illustrated audio signal 1, for example, 30 ms, and the frames are shifted by, for example, 10 ms. The extracted partial autocorrelation coefficients are arranged in chronological order and are arranged into predetermined standard phonemes, for example 12!, as shown in FIG. All, la2l, . . . The degree of similarity to 1r1 is summarized in a table. Second illustration
In the figure, it is shown that the phoneme k parameter V in a certain frame has a degree of similarity expressed by Sj(c) with respect to the standard phoneme k parameter Kj, and the phoneme parameter k' is the most similar to the standard phoneme parameter Kg. It is shown that it has a high degree of similarity and has the next highest degree of similarity to the standard phoneme parameter K2. In addition, C in the above Sj (C)
represents the degree of similarity to the pronunciation by speaker C. As shown in FIG. 2 above, the standard phoneme jl has the highest degree of similarity and the standard phoneme J.1 has the second degree of similarity. I looked up the standard phoneme set Jl,j. and create a time-series pattern of standard phoneme sets that appear when a certain speaker C pronounces a certain word, for example the number "4", for example {8, 2, 48}
:{ 2, 6, 66}; { 3, 11, 78}; {1,
2,47}. In addition, 48,6 in {} above
Reference numerals 6, 78, and 47 represent characteristic coefficients (pitch) related to the vocal cords, although their explanation will be omitted. The time-series pattern of the standard phoneme set obtained as above (when a certain speaker C pronounces a certain word) is arranged on the horizontal axis of the illustrated table in FIG.
The degree of similarity of the phoneme time series of the unknown input speech signal is investigated.

なお第３図中のＳ，（１・、ｊ）は第１の時間帯Ｔ。に
おいて番号「８」で指示される標準音素Ｋ８に対する類
似度、Ｓｌ（２，ｊ）は同じ時間帯Ｔ。において番号「
２」で指示される標準音素Ｋ。に対する類似度を表わし
ている。そしてダイナミツク・プログラムと呼ばれる処
理により例えば図中黒太線で囲んだ類似度を加算して行
くようにし、該未知入力音声信号がある話者Ｃによるあ
る単語例えば「４」と発音したものに対して如何なる類
似度をもつかを決定する。勿論認識に当つては、上記第
３図図示の処理を（話者の数）×（単語の数）回行なつ
て、最も類似度の高いものを求める。なお説明を省略し
たが、先の発明においてはピツチの時系列について相違
度を考慮し、上記類似度と該相違度とにもとずいて認識
を行なうようにしている。第４図は先の発明における音
声認識の場合の構成を示し、本発明の音声認識の場合も
本質的に差異はない。Note that S, (1., j) in FIG. 3 is the first time period T. The similarity to the standard phoneme K8 indicated by the number "8", Sl(2, j), is the same time period T. In the number “
Standard phoneme K indicated by ``2''. represents the degree of similarity to Then, by a process called a dynamic program, for example, the similarities enclosed by the thick black lines in the figure are added up, and for a certain word pronounced by a certain speaker C in the unknown input audio signal, for example, "4". Determine the degree of similarity. Of course, in recognition, the process shown in FIG. 3 is repeated (number of speakers) x (number of words) times to find the one with the highest degree of similarity. Although the explanation has been omitted, in the previous invention, the degree of dissimilarity is taken into consideration in the time series of pitches, and recognition is performed based on the degree of similarity and the degree of dissimilarity. FIG. 4 shows the configuration for voice recognition in the previous invention, and there is essentially no difference in the voice recognition according to the present invention.

図中、１はｋパラメータ抽出・ピツチ抽出装置、２，４
，６，８，１０・・・・・・はある話者による或る単語
発音時の標準ｋパラメータ格納部、３，５，７，９，１
１，・・・・・・は対応するピツチ格納部、Ａは類似度
計算回路、Ｂは相違度訂算回路、Ｃは類似度和計算回路
、Ｄは相違度和計算回路、Ｅは相違度和絶対値化回路、
Ｗは重み付け回路、Ｆは加算回路、Ｇは最大検出回路を
夫々表わしている。詳細な説明を省略するが、上述の如
く音素時系列パターンに対する類似度和から上記ピツチ
時系列パターンに対する重みを乗じた相違度和を差引い
た値について、最大のものを選出する。In the figure, 1 is a k-parameter extraction/pitch extraction device, 2, 4
, 6, 8, 10... are standard k parameter storage units when a certain word is pronounced by a certain speaker, 3, 5, 7, 9, 1
1, ... are corresponding pitch storage units, A is a similarity calculation circuit, B is a dissimilarity correction circuit, C is a similarity sum calculation circuit, D is a dissimilarity sum calculation circuit, and E is a dissimilarity calculation circuit. Sum absolute value circuit,
W represents a weighting circuit, F represents an addition circuit, and G represents a maximum detection circuit. Although a detailed explanation will be omitted, as described above, the maximum value is selected from the sum of similarities for the phoneme time series pattern minus the sum of dissimilarities multiplied by the weights for the pitch time series pattern.

以上先に出願した発明を概説したが、話者が発音すると
きの状況により声道の僅かな変化によつて音声信号が変
化することがあり、このような変化によつて誤まつた認
識が生ずることがないようにすることが必要である。As mentioned above, we have outlined the previously filed invention, but depending on the situation when a speaker pronounces, slight changes in the vocal tract can cause the audio signal to change, and such changes can cause erroneous recognition. It is necessary to prevent this from occurring.

この点を解決すべく本発明においては、ある話者が同じ
単語を発音するときのある時間帯において最も発生頻度
の高い第１の標準音素組と次に発生頻度の高い第２の標
準音素組とを抽出して、これらにより標準時間系列パタ
ーンを用意するようにしている。第５図は本発明の音声
認識方法にもとずく処理プロツク図を表わしている。上
記先に出願した発明の場合とは、第５図図示の「標準時
間系列パターン」の決定処理に第１の標準音素組と第２
の標準音素組とよりなる複合標準音素組を用いること、
および形態別分類情報を用いることにおいて異なつてい
る。第６図における鎖線枠内の処理は第５図における「
標準時間系列パターン」の決定に対応しており、本発明
の場合この処理に当つて、（ｉ）上述の如く既知入力音
声にもとづいて発生頻度の最も高い第１位の標準音素組
と次の第２位の標準音素組とよりなる複合標準音素組を
決定し、（ｉｌ）これら第１位の標準音素組と第２位の
標準音素組とによる組合わせ系列をつくり、（１１０更
に多くの既知入力音声を入力して当該先に得られている
組合わせ系列に対して非線形整合計算を行ない、Ｑψこ
れによつて当該話者による当該単語についての標準時間
系列パターンを決定するようにする。そしてこの決定さ
れた標準時間系列パターンにもとずいて、第５図図示の
如く未知入力音声との類似度を計算するようにする。以
下第６図図示のプロツク図にそつて本発明に用いる標準
時間系列パターンの決定および形態別分類情報の決定に
ついて説明する。本発明の場合、１人１人の話者が同じ
単語について幾回も発声する態様をとつて、或る特定の
話者による特定の単語を発声したときのあるフレームに
ついて、複合標準音素組を得るようにしている。In order to solve this problem, the present invention uses a first standard phoneme set that occurs most frequently in a certain time period when a certain speaker pronounces the same word, and a second standard phoneme set that occurs next most frequently. A standard time series pattern is prepared using these. FIG. 5 represents a processing block diagram based on the speech recognition method of the present invention. In the case of the above-mentioned invention filed earlier, the first standard phoneme set and the second standard phoneme set are
using a composite standard phoneme set consisting of the standard phoneme set of
and the use of type classification information. The processing within the chain line frame in FIG. 6 is “
In the case of the present invention, in this process, (i) as mentioned above, based on the known input speech, the standard phoneme set with the highest frequency of occurrence and the next Determine a composite standard phoneme set consisting of the second-rank standard phoneme set, (il) create a combination series of these first-rank standard phoneme sets and the second-rank standard phoneme set, and (110 more A known input speech is input and a nonlinear matching calculation is performed on the previously obtained combination sequence, Qψ, thereby determining a standard time sequence pattern for the word by the speaker. Then, based on this determined standard time series pattern, the degree of similarity with the unknown input voice is calculated as shown in Figure 5.Hereinafter, the process diagram shown in Figure 6 will be used in the present invention. Determination of a standard time series pattern and determination of type classification information will be explained.In the case of the present invention, each speaker utters the same word many times, and A composite standard phoneme set is obtained for a certain frame when a specific word is uttered.

即ら、特定の話者によつて特定の単語が発声されたとき
の音声信号１について第Ｔ図図示の如く第１図図示と同
様にサンプリングを行なつて、例えばあるフレームに関
して、上記幾回も発声された中で最も発生頻度の高い第
１の標準音素組（Ｓ，（１，ｊ），Ｓ，（２，ｊ））と
次に発生頻度の高い第２の標準音素組（Ｓｌ７（１・ｊ
），ＳＳ（２・Ｊ））とを抽出して複合標準音素組を得
る。この複合標準音素組を第８図図示の如く一般にフレ
ームｌにおけるものとしてＡｌ｛Ｅｌ，Ｆｌ｝として表
わし、第７図図示の第１フレームにおける複合標準音素
組Ａｌ，第２フレームにおける複合標準音素組Ａ２，・
・・・・・，Ａｆ！，・・・・・・，Ａｎと時系列上に
配列する。この状態が第９図に図示されている。第９図
図示の時系列は一般に隣接するフレーム間で同一の複合
標準音素組が存在することがあるため、同じ複合標準音
素組を集約し第１０図図示の如く隣接するものが互に異
なつた複合標準音素組Ｂｌ，Ｂ２・・・・・・Ｂｍとな
るようにした時系列パターンを得る。That is, by sampling the audio signal 1 when a specific word is uttered by a specific speaker in the same manner as shown in FIG. 1, as shown in FIG. The first standard phoneme set (S, (1, j), S, (2, j)) that occurs most frequently among those uttered and the second standard phoneme set (Sl7 ( 1・j
), SS(2・J)) to obtain a composite standard phoneme set. This composite standard phoneme set is generally expressed as Al{El, Fl} in frame l as shown in FIG. 8, and the composite standard phoneme set Al in the first frame and the composite standard phoneme set Al in the second frame as shown in FIG. A2,・
..., Af! , ..., An are arranged in chronological order. This situation is illustrated in FIG. In the time series shown in Figure 9, the same composite standard phoneme set may exist between adjacent frames, so the same composite standard phoneme set is aggregated and adjacent ones are different from each other as shown in Figure 10. A time-series pattern is obtained that is a composite standard phoneme set Bl, B2, . . ., Bm.

第１１図は、上記の如く得られた複合標準音素組がＢｌ
，Ｂ２，Ｂ３，Ｂ４よりなるものとしたものを表わし、
各複合標準音素組が｛Ｅ，Ｆ｝によつて構成されている
ものとして表わされている。Figure 11 shows that the composite standard phoneme set obtained as above is Bl.
, B2, B3, B4,
Each composite standard phoneme set is represented as consisting of {E, F}.

該複合標準音素組の時系列パターンは、ある話者がある
単語を発音した場合、その音声信号を時間帯で区切つて
調べた結実現われるであろう複合標準音素組の時系列を
表わしている。なお先に出願した発明の場合、第１１図
図示の標準音素組ＥｌＥ２，Ｅ３，Ｅ４よりなる時系列
Ｅ１→Ｅ２→Ｅ３→Ｅ４を考慮していたものである。あ
る話者がある単語を正確に発音しようとした場合、上記
Ｅ１→Ｅ２→Ｅ３→Ｅ４よりなる時系列を標準時系列パ
ターンとして用いることで足りる。しかし、同じ話者が
同じ単語を発音した場合でも、発音時の状況例えばその
時の声道の僅かな変化により、Ｅｌ．→Ｅ２→Ｅ３→Ｅ
４で表わされる時系列パターン、Ｅ１→Ｅ２→Ｅ３→Ｅ
４で表わされる時系列パターン、・・・・・・Ｆ１→Ｆ
２→Ｆ３→Ｆ４で表わされる時系列パターンの計１６通
りの時系列パターンが現われる可能性がある。このため
、本発明の場合、同じ話者が同じ単語を種々の状況のも
とで発音した際に、第６図図示「非線形整合」において
図示［標準音素組の組合わせ系列］において得られてい
る上記計１６通りの時系列パターンのいずれがより多く
発生するかを調べ、最も発生頻度の高い時系列パターン
を決定しそれを標準時系列パターンとして以後の未知入
力音声認識のために利用するようにする。即ち、当該発
生頻度を調べるに当つては言うまでもなく、入力された
音声について上記時系列パターンＥ１→Ｅ２→Ｅ３→Ｅ
４・・・・・・・・・，Ｆ１→Ｆ２→Ｆ３→Ｆ４の夫々
とのいずれと類似するかを調べ、これによつて上記発生
頻度分布が求められる。しかし、この場合発生頻度の最
も高い１つの時系列パターンにくらべて他の時系列パタ
ーンの発生頻度が殆んど変わらないことがある。このよ
うなとき上記標準時系列パターンとしてその１つを決定
するに当つては、次のようにする。即ち例えば２つの時
系列パターンの夫々について入力音声の音素と所定範囲
内で類似する第１の標準音素組Ｅの数と第２の標準音素
組Ｆの数とを調べ、第１の標準音素組Ｅの数が多い側の
時系列パターンを標準時系列パターンに選ぶ。また第１
の標準音素組Ｅの数が同じ場合アクセントがある時間帯
において第１の標準音素組がある側の時系列パターンを
選ぶ。しかし、このような場合にも、単一の標準時系列
パターンのみを選んだことによる難点を補助的に救うこ
とが望まれる。このため、発生頻度が略同じ２つの時系
列パターンが互にどのような関係にあるかを形態別に分
類して形態別分類情報を用意する。また第６図図示の「
非線形整合」は従来公知のダイナミツク・プログラミン
グ（ＤＰ）法によつて行なう非線形整合処理を行なうも
のである。即ち、入力音声とひとつひとつの標準系列と
についていわば時間の間延びを考慮して当該間延びに影
響を受け難い形で整合を行なうようにされる。第１２図
は２つの時系列パターンの関係を表わす形態と分類情報
を示している。The time-series pattern of the composite standard phoneme set represents the time-series of the composite standard phoneme set that would be realized when a certain speaker pronounces a certain word and the speech signal is divided into time zones and examined. In the case of the invention filed earlier, the time series E1→E2→E3→E4 consisting of the standard phoneme set ElE2, E3, and E4 shown in FIG. 11 was taken into consideration. When a certain speaker attempts to pronounce a certain word accurately, it is sufficient to use the above-mentioned time series consisting of E1→E2→E3→E4 as a standard time series pattern. However, even when the same speaker pronounces the same word, El. →E2→E3→E
Time series pattern represented by 4, E1 → E2 → E3 → E
Time series pattern represented by 4,...F1→F
There is a possibility that a total of 16 time series patterns, represented by 2→F3→F4, may appear. Therefore, in the case of the present invention, when the same speaker pronounces the same word under various circumstances, the results obtained in the [combination series of standard phoneme sets] shown in the "nonlinear matching" shown in FIG. We will investigate which of the above 16 time series patterns occurs more often, determine the time series pattern with the highest frequency of occurrence, and use it as the standard time series pattern for future unknown input speech recognition. do. That is, when investigating the frequency of occurrence, it goes without saying that the above chronological pattern E1→E2→E3→E is used for the input voice.
4......, F1→F2→F3→F4, respectively, to find out which of them is similar, and thereby the above-mentioned occurrence frequency distribution is determined. However, in this case, the frequency of occurrence of other time-series patterns may be almost the same as that of the one time-series pattern with the highest frequency of occurrence. In such a case, one of the standard time series patterns is determined as follows. That is, for example, for each of two time-series patterns, the number of first standard phoneme sets E and the number of second standard phoneme sets F that are similar to the input voice phonemes within a predetermined range are checked, and the number of first standard phoneme sets F is determined. The time series pattern with the larger number of E's is selected as the standard time series pattern. Also the first
If the number of standard phoneme sets E is the same, the time series pattern on the side with the first standard phoneme set is selected in the time zone where the accent is present. However, even in such cases, it is desirable to supplementally overcome the drawbacks caused by selecting only a single standard time series pattern. For this reason, type-specific classification information is prepared by classifying the relationship between two time-series patterns having substantially the same frequency of occurrence according to type. Also, as shown in Figure 6,
"Nonlinear matching" is a nonlinear matching process performed by the conventionally known dynamic programming (DP) method. That is, the input audio and each standard sequence are matched in a manner that is not easily affected by the time delay, so to speak, taking into account the time delay. FIG. 12 shows the form and classification information representing the relationship between two time-series patterns.

第１２図Ａは４つの内のいずれか１つにおいて同じ標準
音素組が共通である場合の形態とその形態別分類情報Ｇ
ｌｌ，Ｇｌ２，Ｇｌ３，Ｇｌ４とを表わしている。第１
２図Ｂはいずれか２つの標準音素組が共通である場合の
形態とその形態別分類情報Ｇ２ｌ，Ｇ２２，Ｇ２３，Ｇ
２４，Ｇ２５，Ｇ２６，とを表わしている。第１２図Ｃ
はいずれか３つの標準音素組が共通である場合の形態と
その形態別分類情報Ｇ３ｌ，Ｇ３，，Ｇ３３，Ｇ３４と
を表わしている。なお言うまでもなく、４つの標準音素
組が共通であることは２つの時系列パターンが同一であ
ることであり、いずれの時間帯においていずれも共通す
るものがない形態は１通りであり形態別分類情報として
は例えばＧ。Ｏが附与される。本発明の場合、上記の如
く複合標準音素組を考慮して得た標準時系列パターンは
、先の発明の場合と同様に第４図図示の標準ｋパラメー
タ格納部２，４，６，８，１０・・・・・・に格納され
る。Figure 12A shows the forms in which the same standard phoneme set is common in any one of the four types, and the classification information G for each form.
ll, Gl2, Gl3, and Gl4. 1st
Figure 2B shows the forms when any two standard phoneme sets are common and the classification information for each form G2l, G22, G23, G
24, G25, G26, and so on. Figure 12C
represents the form and its form-based classification information G3l, G3, , G33, G34 when any three standard phoneme sets are common. Needless to say, the fact that the four standard phoneme sets are common means that the two time series patterns are the same, and there is only one form that has nothing in common in any time period, so the classification information by form is For example, G. O is given. In the case of the present invention, the standard time series pattern obtained by considering the composite standard phoneme set as described above is stored in the standard k parameter storage units 2, 4, 6, 8, 10 shown in FIG. It is stored in...

そして先の発明の場合と同様に最大検出回路Ｇから認識
出力を得る。上記複合標準音素組を考慮して得た標準時
系列パターンは、個々の話者毎であつてかつ個々の単語
毎に用意されている。ただ本発明の場合、形態別分類情
報をあわせて上記格納部２，４，６，８，１０・・・・
・・に格納しておく。そして未知音声入力との類似度を
得るに当つて、単一の上記標準時系列パターンに対する
比較と共に、上記形態別分類情報を利用して発生頻度が
上記標準時系列パターンのそれに近い時系列パターンに
対する比較をもあわせて行なうようにする。このとき、
話者を固定して当該比較を行なえば単語認識ができ、単
語を固定して行なえば話者認識ができ、両者とも固定し
ないで行なえばどの話者がどの単語を発声したかを認識
する形となる。以上説明した如く、本発明によれば複合
標準音素組の概念を導入することにより、発音時の状況
の違いによる誤認識を救うことができ、話者数や単語数
が増大した際の音声認識にきわめて有効となる。Then, as in the case of the previous invention, a recognition output is obtained from the maximum detection circuit G. The standard time series pattern obtained by considering the composite standard phoneme set is prepared for each speaker and for each word. However, in the case of the present invention, the storage units 2, 4, 6, 8, 10...
Store it in... In order to obtain the degree of similarity with the unknown voice input, in addition to the comparison with a single standard time series pattern, a comparison is made with a time series pattern whose frequency of occurrence is close to that of the standard time series pattern using the type classification information. Make sure to do this at the same time. At this time,
If the comparison is performed with the speaker fixed, word recognition is possible, if the word is fixed, speaker recognition is possible, and if both are not fixed, it is possible to recognize which speaker uttered which word. becomes. As explained above, according to the present invention, by introducing the concept of composite standard phoneme sets, it is possible to avoid misrecognition caused by differences in the pronunciation situation, and to improve speech recognition when the number of speakers and words increases. It is extremely effective.

なお複合標準音素組を得る前提となる第１の標準音素組
と第２の標準音素組との抽出に当つて、データ処理装置
を用いた処理に頼ることができる。Note that in extracting the first standard phoneme set and the second standard phoneme set, which are a prerequisite for obtaining the composite standard phoneme set, it is possible to rely on processing using a data processing device.

この場合処理が迅速となり、抽出判断基準が画一化され
る利点をそなえている。In this case, the processing is quick and the extraction criteria are standardized.

[Brief explanation of drawings]

第１図ないし第４図は先に出願した発明を概念的に説明
するもので、第１図は音声信号のサンプリングを説明す
る説明図、第２図は複数個の標準音素を対とした標準音
素組を抽出する処理を説明する説明図、第３図は標準音
素組にもとずいて類似度を求める処理を説明する説明図
、第４図は音声認識装置の構成図、第５図は本発明によ
る一実施例処理プロツク図、第６図は標準時間系列パタ
ーンを決定する一実施例プロツク図、第７図ないし第１
２図は本発明による標準時間系列パターンと形態別分類
情報とを得る処理を説明する説明図を示す。図中１はｋパラメータ抽出・ピツチ抽出装置、２，４，
６，８，１０・・・・・・は標準ｋパラメータ格納部、
３，５，７，９，１１・・・・・・はピツチ格納部、Ｇ
は最大検出回路、１ａ１１，！Ａ２ｌ・・・・・・は標
準音素、（Ｓ１（１，ｊ），Ｓ２（２，ｊ））は第１の
標準音素組、（Ｓ（（１・ｊ），Ｓ：（２・ｊ））は第
２の標準音素組、｛Ｓ１（１，ｊ），Ｓ２（２，ｊ），
Ｓ：（１，ｊ），Ｓ：（２，ｊ）｝は複合標準音素組を
表わす。Figures 1 to 4 conceptually explain the previously filed invention. Figure 1 is an explanatory diagram explaining the sampling of audio signals, and Figure 2 is a standard diagram for pairing a plurality of standard phonemes. FIG. 3 is an explanatory diagram explaining the process of extracting a phoneme set, FIG. 3 is an explanatory diagram explaining the process of determining similarity based on a standard phoneme set, FIG. 4 is a block diagram of the speech recognition device, and FIG. FIG. 6 is a processing block diagram of an embodiment according to the present invention, and FIG. 6 is a diagram of an embodiment of determining a standard time series pattern.
FIG. 2 shows an explanatory diagram illustrating the process of obtaining a standard time series pattern and type classification information according to the present invention. In the figure, 1 is a k-parameter extraction/pitch extraction device, 2, 4,
6, 8, 10... are standard k parameter storage units,
3, 5, 7, 9, 11... are pitch storage parts, G
is the maximum detection circuit, 1a11,! A2l... is a standard phoneme, (S1(1,j), S2(2,j)) is the first standard phoneme set, (S((1・j), S:(2・j) ) is the second standard phoneme set, {S1 (1, j), S2 (2, j),
S: (1, j), S: (2, j)} represents a composite standard phoneme set.

Claims

[Claims] 1. A speech recognition method for recognizing either or both of a word including a sentence and a speaker who uttered the word, comprising: at least a first standard phoneme set consisting of a plurality of standard phoneme pairs; It has a time series pattern in which a composite standard phoneme set consisting of a second standard phoneme set which is also paired is arranged in a time series, and the unknown input speech is divided into predetermined time intervals, and the phonemes within the time interval and the above. means for calculating the degree of similarity between the first standard phoneme set and the second standard phoneme set; and the second standard phoneme set, the speech recognition method is characterized in that the recognition is performed depending on which of the time series patterns is most similar. 2. Based on a plurality of composite standard phoneme sets arranged in a time series, one of the first standard phoneme set and the second standard phoneme set in one composite standard phoneme set, and the other. Combining the first standard phoneme set and the second standard phoneme set, such as combining either one of the first standard phoneme set and the second standard phoneme set in the composite standard phoneme set of Provided with the constructed time series pattern, the degree of similarity is calculated for each of the time series of phonemes within the divided time intervals of the known input voice and the above time series pattern, and based on the calculation result, the time of the phoneme of the known input voice is calculated. If a sequence has the same degree of similarity within a predetermined range with respect to a plurality of the above-mentioned time-series patterns, the time-series pattern with a larger number of first standard phoneme sets in the time-series pattern is used for unknown input speech recognition. 2. The speech recognition method according to claim 1, wherein a standard time series pattern is used for the speech recognition method. 3 The first standard phoneme set and the second standard phoneme set in one composite standard phoneme set
A plurality of time-sequence patterns such as combining one of the standard phoneme sets of , and one of the first standard phoneme set and the second standard phoneme set in another composite standard phoneme set. If the time series of phonemes of known input speech have the same degree of similarity within a predetermined range, the relationship between the plurality of time series patterns is classified according to the form, and 2. The speech recognition method according to claim 1, wherein the classification information is used as one piece of information for unknown input speech recognition.