JPH0756597B2

JPH0756597B2 - Voice recognizer

Info

Publication number: JPH0756597B2
Application number: JP61160201A
Authority: JP
Inventors: 英生瀬川
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1986-07-08
Filing date: 1986-07-08
Publication date: 1995-06-14
Anticipated expiration: 2010-06-14
Also published as: JPS6315296A

Description

【発明の詳細な説明】［発明の目的］（産業上の利用分野）この発明は、連続音声による単語の認識精度を向上させ
るようにした音声認識装置に関する。Description: [Object of the Invention] (Industrial field of application) The present invention relates to a speech recognition apparatus for improving the recognition accuracy of words by continuous speech.

（従来の技術）連続音声の認識においては、認識対象が単語単位で連続
的に入力されるため、入力情報からこれを構成する音韻
構造を抽出する技術が重要となる。(Prior Art) In recognition of continuous speech, a recognition target is continuously input word by word, and thus a technique of extracting a phoneme structure constituting the recognition information from input information is important.

この点に関し、従来の連続音声認識では入力された音声
信号をフレームと呼ばれる一定の分析時間間隔毎に区切
り、各フレーム毎に例えばスペクトル等の特徴情報を抽
出するとともに、フレーム毎に音韻のラベル付けを行っ
て同じラベルのフレームを統合しながら音韻区間のセグ
メンテーション（区分）を自動的に行うようにしてい
た。そして、フレーム長は、破裂性子音など継続時間が
短い音韻を識別するため、例えば16msec,20msec等、比
較的短く固定的に設定されていた。In this regard, in the conventional continuous speech recognition, the input speech signal is divided into fixed analysis time intervals called frames, feature information such as spectrum is extracted for each frame, and phoneme is labeled for each frame. In this way, phoneme segmentation is automatically performed while integrating frames with the same label. Then, the frame length is set to be relatively short and fixed, for example, 16 msec and 20 msec, in order to identify phonemes such as plosive consonants having a short duration.

しかしながら、フレーム長をあまり短く設定すると、外
部雑音等の影響によるパワーのゆらぎでラベルが変化
し、上述したセグメンテーションの自動化が不可能にな
ったり、また、音韻の変化点であるいわゆる“わたり”
の部分のテンプレートが用意されていないことから、こ
の“わたり”を含むフレームが認識できないという問題
を生じる。However, if the frame length is set too short, the label will change due to power fluctuations due to the influence of external noise, etc., making it impossible to automate the above-mentioned segmentation, and the so-called “crossover”, which is the point of phoneme change.
Since the template for the part (1) is not prepared, there arises a problem that the frame including the "crossover" cannot be recognized.

この問題は、フレーム長をある程度長く取ることによっ
て解決することができるが、フレーム長を長くすると、
破裂性子音のような継続時間の短い音韻の認識率が低下
することになる。This problem can be solved by making the frame length longer, but if the frame length is increased,
The recognition rate of phonemes with short duration such as plosive consonants will be reduced.

（発明が解決しようとする問題点）このように、従来の連続音声の認識においては、フレー
ム長をあまり短く設定すると、音声パワーのゆらぎによ
ってセグメンテーションの自動化が困難になったり、音
韻の変化点における“わたり”部分が認識できず、ま
た、フレー長を長く設定すると、継続時間の短い音韻の
認識が不可能になるという問題があった。(Problems to be Solved by the Invention) As described above, in the conventional continuous speech recognition, if the frame length is set to be too short, it becomes difficult to automate the segmentation due to the fluctuation of the speech power, or the phonological change point becomes difficult. There was a problem that the "crossover" part could not be recognized, and if the length of the frame was set long, it became impossible to recognize the phoneme with a short duration.

そこで、この発明の目的は、継続時間の短い音韻の認識
率を低下させることなしに、音声パワーのゆらぎや“わ
たり”部分の音韻認識に与える影響を抑制でき、もって
認識精度に優れ、自動セグメンテーションの可能な音声
認識装置を提供することにある。Therefore, an object of the present invention is to suppress fluctuations in voice power and effects on phoneme recognition of "overflow" portions without lowering the recognition rate of phonemes with short durations, and thus excellent in recognition accuracy and automatic segmentation. To provide a voice recognition device capable of

［発明の構成］（問題点を解決するための手段）本発明に係る音声認識装置は、入力された音声信号を一
定長のフレーム毎に分析し、フレーム毎の特徴情報を抽
出する特徴抽出手段と、この特徴抽出手段から前記フレ
ーム毎の特徴情報を入力し、連続する複数のフレームの
各特徴情報に基づいて算出した、該特徴情報と同一の情
報量を有する、該連続する複数のフレームからなるフレ
ーム群に関する特徴情報を、フレーム数を互いに異なら
せた複数のフレーム群について生成する並列変換手段
と、前記特徴抽出手段により抽出されたフレーム毎の特
徴情報を１つのフレームからなるフレーム群の特徴情報
として夫々入力するとともに、前記並列変換手段で生成
された各フレーム群の特徴情報を夫々入力し、該フレー
ム群毎の特徴情報と予め定められた音韻辞書との夫々の
照合を並列的に行い、夫々の照合結果に基づいて該フレ
ーム群毎に音韻レベルを付与する音韻認識手段と、この
音韻認識手段で付与された音韻ラベルの系列と予め設定
された単語辞書とを照合し単語認識結果を夫々得るとと
もに、該複数の音韻ラベルのうち、単語認識に供する音
韻ラベルを、前記音韻認識手段による照合結果または該
単語認識の結果に基づいて選択する単語認識手段とを具
備したことを特徴とする。[Configuration of the Invention] (Means for Solving Problems) A voice recognition device according to the present invention analyzes a voice signal input for each frame of a certain length and extracts feature information for each frame. From the plurality of consecutive frames having the same information amount as the feature information, which is obtained by inputting the feature information of each frame from the feature extracting means and calculated based on the respective feature information of the plurality of consecutive frames. Parallel conversion means for generating feature information about a plurality of frame groups for a plurality of frame groups having different numbers of frames, and feature information for each frame extracted by the feature extraction means of a frame group consisting of one frame In addition to inputting each as information, the feature information of each frame group generated by the parallel conversion means is input respectively, and the feature information for each frame group and The phoneme recognition means for performing parallel matching with each phoneme dictionary created in parallel and assigning a phoneme level to each frame group based on the respective matching results, and a sequence of phoneme labels provided by the phoneme recognition means. And a word recognition result obtained by collating with a preset word dictionary, and a phonological label provided for word recognition among the plurality of phoneme labels is based on the collation result by the phonological unit or the result of the word recognition. It is characterized by comprising a word recognition means for selecting by selecting.

好ましくは、前記単語認識手段は、前記音韻認識手段で
並列的に得られた複数の音韻ラベルのうち、音韻辞書と
の類似度が最も高いフレーム群の音韻ラベルを用いて単
語認識を行うものであることを特徴とする。Preferably, the word recognition means performs word recognition using a phoneme label of a frame group having a highest similarity to a phoneme dictionary among a plurality of phoneme labels obtained in parallel by the phoneme recognition means. It is characterized by being.

好ましくは、前記単語認識手段は、前記音韻認識手段で
並列的に得られた複数の音韻ラベルのうち、これら音韻
ラベルを用いて単語認識を行った場合の単語辞書との類
似度が最も高くなるフレーム群の音韻ラベルを用いた単
語認識結果を出力するものであることを特徴とする。Preferably, the word recognition means has the highest similarity to the word dictionary when word recognition is performed using these phoneme labels among the plurality of phoneme labels obtained in parallel by the phoneme recognition means. It is characterized in that it outputs a word recognition result using a phoneme label of a frame group.

好ましくは、前記単語認識手段は、前記音韻認識手段で
並列的に得られた複数の音韻ラベルのうち、音韻認識手
段における認識結果に基づいて予想された音韻継続時間
に最も近いフレーム群の音韻ラベルを用いた単語認識を
行うものであることを特徴とする。Preferably, the word recognition means, among the plurality of phoneme labels obtained in parallel by the phoneme recognition means, the phoneme label of the frame group closest to the phoneme duration predicted based on the recognition result by the phoneme recognition means. It is characterized by performing word recognition using.

（作用）本発明によれば、各時間においてフレーム長を異ならせ
た複数の音韻認識が並列的に行われる。そして、得られ
た複数の音韻ラベルのうち例えばスコア（類似度，距離
等）の最も高いフレーム群の音韻ラベルを用いて単語認
識が行われる。このため、等価的にフレーム長を音韻デ
ータに応じて適応的に変化させたことになり、フレーム
長を一定の長さに固定する従来の方式に比べ、単語認識
率を大幅に高めることができる。(Operation) According to the present invention, a plurality of phoneme recognitions having different frame lengths are performed in parallel at each time. Then, word recognition is performed using the phoneme label of the frame group having the highest score (similarity, distance, etc.) among the obtained phoneme labels. For this reason, the frame length is equivalently changed adaptively according to the phoneme data, and the word recognition rate can be significantly increased compared to the conventional method in which the frame length is fixed to a fixed length. .

（実施例）以下、図面を参照しながら本発明の実施例について説明
する。Embodiments Embodiments of the present invention will be described below with reference to the drawings.

第１図は、本発明の一実施例に係る連続音声認識装置の
構成を示す図である。FIG. 1 is a diagram showing the configuration of a continuous speech recognition apparatus according to an embodiment of the present invention.

A/D変換部１は、図示しない音声入力部や公衆電話回線
等から入力された音声信号を、例えば12KHz程度のサン
プリングレートで12ビット程度のディジタル信号に変換
し、特徴抽出部２に出力する。The A / D conversion unit 1 converts a voice signal input from a voice input unit (not shown) or a public telephone line into a digital signal of about 12 bits at a sampling rate of about 12 KHz and outputs the digital signal to the feature extraction unit 2. .

特徴抽出部２は、入力されたディジタル音声信号を、例
えば16msec程度に固定された単位フレーム長毎に分析し
てその特徴を抽出する。この特徴抽出部２は、例えば特
徴抽出にスペクトル分析を使用する場合には、バンドパ
スフィルタ群（フィルタバンク）により構成することが
できる。ここで抽出された特徴ベクトルは並列変換部３
に入力される。The feature extraction unit 2 analyzes the input digital audio signal for each unit frame length fixed to, for example, about 16 msec and extracts the feature. The feature extracting unit 2 can be configured by a band pass filter group (filter bank) when, for example, spectrum analysis is used for feature extraction. The feature vector extracted here is the parallel conversion unit 3
Entered in.

並列変換部３は、入力された単位フレーム毎の特徴ベク
トルをｎ系統に分割するもので、ｎ−１段の遅延回路3a
と、これら各遅延回路3aの出力と入力される特徴ベクト
ルとを加算するｎ−１個の加算器3bと、これら加算器3b
の出力を加算フレーム数で除算するためのｎ−１個の係
数回路3cとで構成されている。これらｎ個の系統を介し
て音韻認識部４に入力されるｎ種の特徴ベクトルは、そ
れぞれ加算フレーム数を異ならせて加重平均をとったも
ので、加算フレーム数が少ない程、破裂性子音など継続
時間の短い音韻の特徴情報を担っており、加算フレーム
数が多い程、パワーのゆらぎや“わたり”部分の影響が
過去の定常的な音韻の特徴ベクトルによって抑制された
音韻情報を担っている。これら特徴ベクトルは、並列的
に音韻認識部４に入力される。The parallel conversion unit 3 divides the input feature vector for each unit frame into n systems, and has n−1 stages of delay circuits 3a.
And n−1 adders 3b for adding the output of each delay circuit 3a and the input feature vector, and these adders 3b
Is divided by the number of added frames, and n-1 coefficient circuits 3c. The n types of feature vectors input to the phoneme recognition unit 4 via these n systems are weighted averages with different numbers of added frames. The smaller the number of added frames, the more explosive consonants, etc. It carries phonological feature information with a short duration, and as the number of added frames increases, it carries phonological information in which the effects of power fluctuations and “crossovers” are suppressed by past steady phonological feature vectors. . These feature vectors are input to the phoneme recognition unit 4 in parallel.

音韻認識部４は入力されたｎ種の特徴ベクトルについて
音韻辞書との照合を行ない、、それぞれの特徴ベクトル
に音韻ラベルを付与する。音韻辞書との類似度計算は、
例えば複合類似度法を用いることができる。また、類似
度の代りに辞書のテンプレートとの間の距離を求めるよ
うにしても良い。これら、ｎ種の音韻ラベルは、そのス
コア（類似度域は距離）とともに音韻ラベルバッファ６
に一旦蓄えられる。The phoneme recognition unit 4 collates the input n types of feature vectors with the phoneme dictionary, and assigns a phoneme label to each feature vector. Similarity calculation with the phonological dictionary,
For example, the composite similarity method can be used. Further, instead of the similarity, the distance to the template of the dictionary may be obtained. These n kinds of phoneme labels are stored in the phoneme label buffer 6 together with their scores (the similarity range is distance).
Is temporarily stored in.

音韻ラベルバッファ６にはこの他にも過去に生成された
特徴ベクトルが蓄えられている。音韻ラベルバッファ６
は、これらの特徴ベクトルから、第２図に示すように、
ｎ−１フレーム前の時刻ｔ−ｎを中心として、過去・未
来にわたる計2n−１種類の加重平均特徴ベクトルを選択
してラベルソート部７に出力する。In addition to this, the phoneme label buffer 6 also stores previously generated feature vectors. Phonological label buffer 6
From these feature vectors, as shown in FIG.
A total of 2n-1 types of weighted average feature vectors over the past and future are selected and output to the label sorting unit 7 centering on time t-n before n-1 frames.

ラベルソート部７は、これら2n−１種類の特徴ベクトル
のスコアをソーティング（大きい順に並び替え）する。
これによって例えば類似度の最も高い最上位の音韻ラベ
ルを時刻ｔ−ｎにおける音韻ラベルとしてスコアととも
に単語認識部８に出力する。The label sorting unit 7 sorts the scores of these 2n-1 types of feature vectors (sorts in descending order).
Thereby, for example, the highest phoneme label having the highest degree of similarity is output to the word recognition unit 8 together with the score as a phoneme label at time t-n.

単語認識部８では、入力された音韻ラベル系列と単語辞
書９との照合を行なう。照合は、例えば公知のDP（dyna
mic programming）マッチングによって行われる。このD
Pマッチングは、第３図に示すように、横軸に入力音韻
系列、縦軸にネットワーク表現された標準パターンをと
り、始点から１フレームずつ、標準パターンを構成する
各音韻ラベルとの類似度和が最大となるパスを選択して
いくものである。時刻ｔにおいては、標準パターンの全
てのノードに対する最適なパスが計算され、次のフレー
ムにおいては標準パターンのネットワークにより許され
たあらゆるパスに対しスコアが計算され、時刻ｔ＋１に
おける最適なパスが計算される。そして、類似度和が最
大になった標準パターンを認識結果として出力する。The word recognition unit 8 collates the input phoneme label sequence with the word dictionary 9. For example, the known DP (dyna
mic programming) Matching is done. This D
As shown in FIG. 3, the P matching takes an input phoneme sequence on the horizontal axis and a network-represented standard pattern on the vertical axis, and one frame at a time from the start point, the sum of the degrees of similarity with each phoneme label forming the standard pattern. The path that maximizes is selected. At time t, the optimal paths for all nodes of the standard pattern are calculated, and in the next frame, the scores are calculated for all paths permitted by the network of the standard pattern, and the optimal path at time t + 1 is calculated. It Then, the standard pattern having the maximum sum of similarities is output as a recognition result.

以上の構成の本実施例に係る音声認識装置によれば、各
時間においてｎ種類の分析フレーム長のうち類似度の最
も高いフレーム長の音韻ラベルを採用するようにしてい
るので、等価的にフレーム長を音韻データに応じて適応
的に変化させることができる。According to the speech recognition apparatus of the present embodiment having the above-described configuration, the phonological label having the frame length with the highest degree of similarity among the n types of analysis frame lengths is adopted at each time. The length can be adaptively changed according to the phoneme data.

例えば、第２図において、時刻ｔ−ｎの単位フレームで
得られた音声データが、破裂音のような継続時間の短い
ものである場合には、最短フレーム長で得られた音韻ラ
ベルのスコアが最大となり、この音韻ラベルがそのフレ
ームのラベルとして採用される。For example, in FIG. 2, when the voice data obtained in the unit frame at time t-n has a short duration such as a plosive sound, the score of the phoneme label obtained in the shortest frame length is At the maximum, this phoneme label is adopted as the label for that frame.

一方、時刻ｔ−ｎで得られた音声データが、異種音韻間
の変化点である“わたり”部分である場合には、時刻ｔ
−ｎを中心としてその前後に定常的な音韻データが集約
されている。したがって、時刻ｔ−ｎを中心として過去
又は将来の複数のフレームの加重平均をとれば、“わた
り”部分の影響が加重フレーム数に応じて抑制される。
したがって、この場合には長いフレームにより得られた
音韻ラベルのスコアが大きくなり、これが認識結果とし
て採用される。On the other hand, when the voice data obtained at the time t-n is the "crossover" portion which is the change point between different phonemes, the time t
Steady phoneme data is collected before and after -n. Therefore, if the weighted average of a plurality of past or future frames centered on the time t-n is taken, the influence of the "overflow" portion is suppressed according to the number of weighted frames.
Therefore, in this case, the score of the phoneme label obtained by the long frame becomes large, and this is adopted as the recognition result.

また、音声パワーにゆらぎがある場合にも、他のフレー
ムとの加重平均を取ることによって、このゆらぎの影響
を抑制できるので、比較的長いフレームの音韻ラベルの
スコアが高くなる。Further, even when there is fluctuation in the voice power, the influence of this fluctuation can be suppressed by taking the weighted average with other frames, so that the score of the phoneme label in a relatively long frame becomes high.

ところで、連続音声の認識においてはセグメンテーショ
ンの自動化が大きな技術的課題であるが、この装置によ
れば、分析フレーム長を音韻データに応じて適応的に変
化させることにより、音声パワーの変動等を吸収して音
韻の誤認識によるラベルの変化を防止することができる
ので、同一の音韻区間で同一のラベルが付与される確率
が高い。したがって、同一ラベルの統合処理を行うこと
によってセグメンテーションの自動化が容易になるとい
う利点がある。By the way, in the recognition of continuous speech, the automation of segmentation is a major technical issue, but this device absorbs fluctuations in speech power by adaptively changing the analysis frame length according to the phoneme data. Since it is possible to prevent the label from changing due to erroneous recognition of the phoneme, it is highly likely that the same label is given in the same phoneme section. Therefore, there is an advantage that the automation of the segmentation is facilitated by performing the integration processing of the same label.

第４図は本発明の他の実施例に係る音声認識の構成を示
す図である。なお、この図において、第１図と同一部分
には同一符号を付し、重複する部分の説明は省くことに
する。この装置が先の装置と異なる点は単語認識部11に
ある。この単語認識部11は、音韻認識部４から出力され
るｔ−1,t−2,…,t−ｎの各時刻までの音韻ラベルにつ
いて単語辞書９内の標準パターンとの類似度を並列的に
計算する。得られた時刻ｔ−1,t−2,…,t−ｎの各時点
までのDPマッチングの結果（スコア）は単語DPマッチン
グスコアバッファ12に保持される。FIG. 4 is a diagram showing a configuration of voice recognition according to another embodiment of the present invention. In this figure, the same parts as those in FIG. 1 are designated by the same reference numerals, and the description of the overlapping parts will be omitted. This device is different from the previous device in the word recognition unit 11. The word recognition unit 11 parallelizes the degree of similarity with the standard pattern in the word dictionary 9 for the phoneme labels up to the times t−1, t−2, ..., T−n output from the phoneme recognition unit 4. Calculate to. The obtained DP matching results (scores) up to time points t-1, t-2, ..., Tn are held in the word DP matching score buffer 12.

DPマッチングある時刻における全てのノードに対する最
適なパスを計算する方法であるが、この実施例において
は、第５図に示すように、時刻ｔにおいて時刻ｔ−１の
DPマッチングの結果と１フレーム分の音韻ラベル、時刻
ｔ−２のDPマッチングの結果と２フレーム平均したパタ
ーンの音韻ラベル（スコアは２倍）、…、時刻ｔ−ｎの
DPマッチングの結果とｎフレームの平均したパターンの
音韻ラベル（スコアはｎ倍）というようにｎ種類のマッ
チングを行う。これによって、時刻ｔまで最適なフレー
ム長を動的に求めながら、最適なパスを動的に求めるこ
とができる。このようなマチングを繰返した後、終点に
おける類似度和が最大となったパターンが認識結果とな
り、表示部７に送られる。DP matching is a method of calculating the optimum paths for all the nodes at a certain time, but in this embodiment, as shown in FIG.
The result of DP matching and the phoneme label for one frame, the result of the DP matching at time t-2 and the phoneme label of the pattern averaged over two frames (score is twice), ..., At time t-n
N types of matching are performed, such as the result of DP matching and the phoneme label of the averaged pattern of n frames (score is n times). As a result, the optimum path can be dynamically obtained while dynamically obtaining the optimum frame length until time t. After repeating such matching, the pattern having the maximum sum of similarities at the end point becomes the recognition result and is sent to the display unit 7.

この方法ではDPマッチングを１フレーム毎にｎ通り行う
が、これは並列的に行われるため演算速度の低下はな
い。In this method, DP matching is performed n times for each frame, but since this is performed in parallel, the calculation speed does not decrease.

第５図は更に他の実施例を示すものである。この装置で
は、第３図の装置において、単語認識部11と並列的に音
韻継続時間制御部21が新たに設けられている。この音韻
継続制御部21は、音韻認識部４において単位フレーム長
の分析で得られたラベルに基づき、音韻継続時間を求
め、それぞれのラベルに見合った継続時間のフレーム長
のDPマッチングを行う単語認識部11に送るシステムであ
る。FIG. 5 shows still another embodiment. In this device, a phoneme duration control unit 21 is newly provided in parallel with the word recognition unit 11 in the device of FIG. The phoneme continuation control unit 21 obtains the phoneme duration based on the label obtained by the analysis of the unit frame length in the phoneme recognition unit 4, and performs the word recognition for performing the DP matching of the frame length of the duration corresponding to each label. This is a system for sending to section 11.

この装置によれば、例えば“p"、“t"のように継続時間
が短いことが予想される音韻ラベルが単位フレーム長の
分析により得られた時には、時刻ｔ−１までのDPマッチ
ング結果と１フレーム分の音韻ラベルとを用いて最適パ
スが求められ、逆に母音のように継続時間が長いと予想
される音韻については、時刻ｔ−ｎまでのDPマッチング
結果とｎフレーム分の音韻ラベルとを用いて最適パスが
求められる。即ち、継続時間の短い非定常的な音韻は短
いフレーム長で、継続時間の長い定常的な音韻は長いフ
レーム長で認識されることなる。なお、音韻継続時間制
御部21における継続時間の設定は、経験的に求められる
各音韻についての継続時間確率分布に基づいて定めれば
良い。According to this device, when a phoneme label whose duration is expected to be short, such as “p” or “t”, is obtained by the analysis of the unit frame length, the DP matching result up to time t−1 is obtained. The optimal path is obtained using the phoneme label for one frame, and conversely, for a phoneme that is expected to have a long duration such as a vowel, the DP matching result up to time t-n and the phoneme label for n frames. The optimal path is obtained using and. That is, a non-stationary phoneme with a short duration is recognized with a short frame length, and a stationary phoneme with a long duration is recognized with a long frame length. The duration setting in the phoneme duration control unit 21 may be set based on the duration probability distribution for each phoneme empirically obtained.

［発明の効果］以上説明したように、本発明によれば、音韻の分析フレ
ーム長としての種々の長さのものを用意し、これら各フ
レーム長について並列的な音韻認識処理を行って、認識
結果の最も良好なフレーム長を採用するようにしている
ので、分析フレーム長を入力される音韻に応じて適応的
に変化させることができる。このため、継続時間の短い
音韻の認識率を低下させることなしに、音声パワーのゆ
らぎや“わたり”部分の音韻認識に与える影響を抑制で
き、もって認識精度に優れ、自動セグメンテーションの
可能な音声認識装置を提供できる。[Effects of the Invention] As described above, according to the present invention, various lengths as phoneme analysis frame lengths are prepared, and parallel phoneme recognition processing is performed for each of these frame lengths for recognition. Since the best frame length of the result is adopted, the analysis frame length can be adaptively changed according to the input phoneme. For this reason, it is possible to suppress fluctuations in the speech power and the effects on the phoneme recognition in the “overflow” part without lowering the recognition rate of the phoneme with a short duration. Therefore, the speech recognition is excellent in the recognition accuracy and capable of automatic segmentation. A device can be provided.

[Brief description of drawings]

第１図は本発明の一実施例に係る音声認識装置の構成を
示すブロック図、第２図は同装置における音韻ラベルバ
ッファの内容を説明するための図、第３図は同装置にお
ける単語認識部の作用を説明するための図、第４図は本
発明の他の実施例に係る音声認識装置の構成を示すブロ
ック図、第５図は同装置における単語認識部の作用を説
明するための図、第６図は本発明の更に他の実施例に係
る音声認識装置の構成を示すブロック図である。１…A/D変換部、２…特徴抽出部、３…並列変換部、４
…音韻認識部、５…音韻辞書、６…音韻ラベルバッフ
ァ、７…ラベルソート部、8,11…単語認識部、９…単語
辞書、12…単語DPマッチングスコアバッファ、21…音韻
継続時間制御部。FIG. 1 is a block diagram showing the configuration of a speech recognition apparatus according to an embodiment of the present invention, FIG. 2 is a diagram for explaining the contents of a phoneme label buffer in the apparatus, and FIG. 3 is word recognition in the apparatus. FIG. 4 is a block diagram showing the configuration of a speech recognition apparatus according to another embodiment of the present invention, and FIG. 5 is a diagram for explaining the operation of the word recognition section in the apparatus. FIG. 6 is a block diagram showing the configuration of a voice recognition device according to still another embodiment of the present invention. 1 ... A / D converter, 2 ... Feature extractor, 3 ... Parallel converter, 4
... Phoneme recognition unit, 5 ... Phoneme dictionary, 6 ... Phoneme label buffer, 7 ... Label sort unit, 8, 11 ... Word recognition unit, 9 ... Word dictionary, 12 ... Word DP matching score buffer, 21 ... Phoneme duration control unit. .

Claims

[Claims]

1. A feature extraction unit that analyzes an input audio signal for each frame of a certain length and extracts feature information for each frame, and the feature information for each frame is input from the feature extraction unit and continues. A plurality of frame groups in which the number of frames is different from the feature information regarding the frame group composed of the plurality of consecutive frames having the same amount of information as the feature information calculated based on the respective feature information of the plurality of frames. And parallel input means for generating the parallel conversion means for inputting the feature information for each frame extracted by the feature extraction means as feature information of a frame group consisting of one frame, and for each frame group generated by the parallel conversion means. The feature information is input respectively, and the feature information of each frame group and the predetermined phoneme dictionary are collated in parallel, and the respective collation results are obtained. A phoneme recognition means for giving a phoneme label to each frame group based on the above, a sequence of phoneme labels given by this phoneme recognition means and a preset word dictionary are collated to obtain word recognition results, respectively, and A speech recognition device, comprising: a word recognition unit that selects a phoneme label to be used for word recognition from among a plurality of phoneme labels based on a matching result by the phoneme recognition unit or a result of the word recognition.

2. The word recognition means performs word recognition using a phoneme label of a frame group having a highest similarity to a phoneme dictionary among a plurality of phoneme labels obtained in parallel by the phoneme recognition means. The speech recognition apparatus according to claim 1, wherein the speech recognition apparatus is a speech recognition apparatus.

3. The word recognizing means has the highest similarity to a word dictionary when word recognition is performed using these phonological labels among a plurality of phonological labels obtained in parallel by the phonological recognizing means. The speech recognition apparatus according to claim 1, wherein the speech recognition device outputs a word recognition result using a phoneme label of a higher frame group.

4. The word recognition means selects a frame group closest to a phoneme duration estimated based on a recognition result of the phoneme recognition means from among a plurality of phoneme labels obtained in parallel by the phoneme recognition means. The speech recognition apparatus according to claim 1, wherein word recognition is performed using a phoneme label.