JP3523949B2

JP3523949B2 - Voice recognition device and voice recognition method

Info

Publication number: JP3523949B2
Application number: JP31062395A
Authority: JP
Inventors: 晴剛安田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1995-11-29
Filing date: 1995-11-29
Publication date: 2004-04-26
Anticipated expiration: 2015-11-29
Also published as: JPH09152888A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識における
認識結果の後処理機能を有し特に単語音声認識の分野に
応用可能な音声認識装置及び音声認識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device and a voice recognition method having a post-processing function of a recognition result in voice recognition and particularly applicable to the field of word voice recognition.

【０００２】[0002]

【従来の技術】音声認識装置における基本機能の一例
は、例えば日本音響学会講論集３−１−８（１９８３年
１０月）の公開論文に記載されており、その中でも不特
定話者に関する認識方法の一例は、第１０回情報理論と
その応用シンポジウム講演論文集「ファジィパターンマ
ッチングを用いた不特定話者単語音声認識」に開示さ
れ、その不特定話者の辞書作成方法は、電子通信学会論
文集Ｖｏｌ．Ｊ６９−ＡＮｏ．１（１９８６年）「不特
定話者単語音声認識」に記載されている。2. Description of the Related Art An example of basic functions of a speech recognition apparatus is described in, for example, a published paper of Acoustical Society of Japan 3-1-8 (October 1983), and among them, a recognition method for an unspecified speaker. An example is disclosed in the proceedings of the 10th information theory and its application symposium "Unspecified speaker word speech recognition using fuzzy pattern matching". Shu Vol. J69-ANo. 1 (1986) "Unspecified speaker word speech recognition".

【０００３】[0003]

【発明が解決しようとする課題】音声認識装置におい
て、未知入力音声と、上記未知入力音声の認識を行うた
めに使用され予め登録し格納されている標準パターンと
の照合動作について図２を参照し説明する。マイクロフ
ォン１から入力された未知入力音声は、前処理手段２で
その入力レベルが最適化され不要な帯域の信号が除かれ
た後、特徴抽出手段３において当該未知入力音声の認識
に必要な特徴パターンに変換される。変換された上記特
徴パターンは、標準パターン格納手段４に予め登録し格
納されている複数の標準パターンと認識処理手段５にお
いて比較照合され、上記標準パターン毎に、上記標準パ
ターンと上記特徴パターンとの類似度が決定される。こ
のようにして求められた各々の標準パターンの類似度に
応じて、その類似度の高い順に、当該未知入力音声に対
する候補単語が決定される。次に後処理手段６では決定
された上記候補単語に基づいて、ある基準を満たす場合
にはその候補単語を結果出力手段７を介して外部へ出力
し、上記基準を満たさない場合には候補単語を棄却（リ
ジェクトともいう）する。With reference to FIG. 2, the operation of collating an unknown input voice with a standard pattern which is used for recognizing the unknown input voice and which is registered and stored in advance in the voice recognition apparatus is referred to. explain. The unknown input voice input from the microphone 1 has its input level optimized by the pre-processing unit 2 to remove signals in unnecessary bands, and then the feature extraction unit 3 features a pattern required for recognition of the unknown input voice. Is converted to. The converted characteristic pattern is compared and collated in the recognition processing means 5 with a plurality of standard patterns registered and stored in the standard pattern storage means 4 in advance, and the standard pattern and the characteristic pattern are compared for each standard pattern. The degree of similarity is determined. According to the similarity of each standard pattern thus obtained, candidate words for the unknown input speech are determined in descending order of similarity. Next, the post-processing means 6 outputs the candidate word to the outside through the result output means 7 when a certain criterion is satisfied based on the determined candidate word, and the candidate word when the criterion is not satisfied. Reject (also referred to as reject).

【０００４】次に、上記標準パターンの作成方法に関し
て説明する。音声認識装置の標準パターンの作成法は多
数報告されているが、ここでは例えば従来より用いられ
ているパターンマッチング方法を用いて説明する。まず
特定話者方法の場合は、当該音声認識装置を使用する前
に１回または複数回、上記特定話者が発声し、この発声
に基づく入力パターンを荷重平均化し、入力パターンの
辞書を作成する。不特定話者を扱う音声認識装置では複
数話者の発声に基づく入力パターンを荷重平均化して入
力パターンの辞書を作成するが、その場合、標準パター
ンに一般性をもたせるため、複数の評価入力パターンを
入力し最も認識率が高くなるように標準パターンの最適
化を行う場合が多い。その方法の一例が特開平５−３１
３６８７号公報に開示されている。このように作成され
た標準パターンを標準パターン格納手段４に格納してお
く。尚、このとき、特定話者と不特定話者とをそれぞれ
個別に認識する場合と双方を混在させて認識する場合と
の二通りがある。Next, a method of creating the standard pattern will be described. Many methods of creating standard patterns for a voice recognition device have been reported, but here, for example, a pattern matching method that has been conventionally used will be described. First, in the case of the specific speaker method, the specific speaker utters one or more times before using the speech recognition apparatus, and the input pattern based on this utterance is weighted averaged to create a dictionary of input patterns. . In a voice recognition device that handles unspecified speakers, input patterns based on the utterances of multiple speakers are weight-averaged to create a dictionary of input patterns.In that case, in order to make the standard patterns general, multiple evaluation input patterns are used. In many cases, the standard pattern is optimized by inputting "." An example of the method is Japanese Patent Laid-Open No. 5-31.
It is disclosed in Japanese Patent No. 3687. The standard pattern thus created is stored in the standard pattern storage means 4. At this time, there are two cases: a specific speaker and an unspecified speaker are individually recognized, and a case where both are recognized in a mixed manner.

【０００５】又、従来より、音声認識装置には当該音声
認識装置を使用する使用者の音声だけでなく、当該音声
認識装置が使用されている環境における周囲の騒音や使
用者の本来の使用目的以外の音声、例えば私語などが入
力され、これらの騒音や私語などに音声入力装置が反応
してしまい問題となっていた。更に予め登録されていな
い単語を使用者が発声した場合など、得られる類似度は
低いにもかかわらず有効な棄却機能が備わっていないた
め、音声認識装置から何らかの結果が出力され、弊害を
生じてしまう場合もあった。従ってこのような弊害を防
止するために、従来の音声認識装置では、確度の低い認
識結果に対してはその結果そのものを棄却（リジェク
ト）し、再度、使用者に入力を促す機能が必要となって
いた。このような棄却機能に関する音声認識装置は、特
開昭６１−７３２００号公報及び特開平１−１５６８０
０号公報に開示される。即ち、これの標準パターンを使
用しての未知入力音声の音声認識を行い最も大きい第１
位類似度を有する第１位標準パターンと第２位の類似度
を有する第２位標準パターンを求め、上記第１位類似度
と上記第２位類似度との比率を求め、求めた比率が第１
閾値以上でかつ、上記第１位標準パターンの第１位類似
度が第２閾値以上を満たす場合には、上記第１位標準パ
ターンを音声認識結果として出力し、満たさない場合に
は上記第１位標準パターンを棄却する機能を有する。Further, conventionally, not only the voice of the user who uses the voice recognition device, but also the ambient noise in the environment in which the voice recognition device is used and the original purpose of use of the voice recognition device. A voice other than the above, for example, a private language is input, and the voice input device reacts to these noises and a private language, which is a problem. Furthermore, when the user utters a word that is not registered in advance, the voice recognition device outputs some result because the effective rejection function is not provided although the obtained similarity is low. There were times when it ended up. Therefore, in order to prevent such an adverse effect, the conventional voice recognition device needs to have a function of rejecting (rejecting) the recognition result with low accuracy and prompting the user to input again. Was there. A voice recognition device relating to such a rejection function is disclosed in JP-A-61-73200 and JP-A-1-15680.
No. 0 publication. That is, the voice recognition of the unknown input voice is performed using this standard pattern, and the largest first
The first-rank standard pattern having the second-rank similarity and the second-rank standard pattern having the second-rank similarity are calculated, and the ratio between the first-rank similarity and the second-rank similarity is calculated. First
If it is equal to or more than the threshold value and the first-order similarity degree of the first-order standard pattern satisfies the second threshold value or more, the first-order standard pattern is output as a voice recognition result. It has the function to reject the standard pattern.

【０００６】しかしこの場合、上記第１閾値及び上記第
２閾値は、ある一定の値に固定されているため、未知入
力音声と標準パターンとの類似度が未知入力音声の発せ
られる環境によってばらついたり、不特定話者と特定話
者が混在する場合には上記類似度の基準が異なるという
現象が生じる。よってある未知入力音声によっては候補
に挙げられた標準パターンが正答であるにも拘わらず棄
却されやすかったり、誤答であるにも拘わらず棄却され
にくくなったりするという問題点があった。本発明は上
述したような問題点を解決するためになされたもので、
未知入力音声に対して安定した認識結果を得ることがで
きる音声認識装置及び音声認識方法を提供することを目
的とする。However, in this case, since the first threshold value and the second threshold value are fixed to certain values, the similarity between the unknown input voice and the standard pattern varies depending on the environment in which the unknown input voice is emitted. When the unspecified speaker and the specified speaker are mixed, the phenomenon that the standard of the similarity is different occurs. Therefore, depending on a certain unknown input voice, there is a problem that the standard pattern listed as a candidate is likely to be rejected even though it is a correct answer, or is difficult to be rejected even if it is an incorrect answer. The present invention has been made to solve the above problems,
An object of the present invention is to provide a voice recognition device and a voice recognition method that can obtain a stable recognition result for an unknown input voice.

【０００７】[0007]

【課題を解決するための手段及び作用】本発明の一態様
によれば音声認識装置は、標準パターンを格納する標準
パターン格納手段と、未知入力音声と上記標準パターン
とを比較照合することで上記未知入力音声の認識を行う
認識手段と、を備えた音声認識装置であって、上記未知
入力音声に対して選択された標準パターンの各標準パタ
ーン毎に設けられる基準類似度を格納する基準類似度格
納手段と、上記未知入力音声に対して選択された標準パ
ターンの採用を棄却する基準となり上記基準類似度格納
手段に格納された上記基準類似度に基づき生成される棄
却閾値を格納する棄却閾値格納手段とを備え、上記認識
手段は上記比較照合にて選択された上記標準パターンに
ついて上記選択された標準パターンを認識結果とするか
否かを上記基準類似度に基づき決定するとき、さらに、
未知入力音声と各標準パターンとの類似度を算出し最も
類似度が大きい第１位類似度を有する第１位標準パター
ンと第２位の類似度を有する第２位標準パターンとを求
めさらに上記第１位類似度と第２位類似度との比率が第
１閾値以上で、かつ上記第１位類似度が第２閾値以上で
ある上記第１位標準パターンについて、上記認識手段は
上記第１位標準パターンの上記第１位類似度が上記棄却
閾値以上である場合に上記第１位標準パターンを採用す
ることを特徴とする。According to one aspect of the present invention, a voice recognition apparatus is characterized in that the standard pattern storage means for storing a standard pattern and the unknown input voice are compared and collated with each other. A voice recognition device comprising a recognition means for recognizing an unknown input voice, wherein the reference similarity stores a reference similarity provided for each standard pattern of the standard patterns selected for the unknown input voice. The storage means and the standard pattern selected for the unknown input speech
It becomes a standard to reject the adoption of turns and stores the above standard similarity
Discard generated based on the above-mentioned reference similarity stored in the means
A rejection threshold value storage means for storing a rejection threshold value, wherein the recognition means determines whether or not the selected standard pattern is a recognition result for the standard pattern selected in the comparison and collation based on the reference similarity. When making a decision ,
Calculate the similarity between the unknown input voice and each standard pattern
First-ranking standard putter with first-ranking similarity
And a second-order standard pattern having a second-rank similarity degree.
In addition, the ratio between the above-mentioned first-rank similarity and second-rank similarity is
1 threshold value or more, and the first similarity is the second threshold value or more
Regarding the certain first standard pattern, the recognizing means is
The 1st rank similarity of the 1st rank standard pattern is the rejection
If the number is above the threshold, the first standard pattern is adopted.
Characterized in that that.

【０００８】[0008]

【０００９】このような音声認識装置によれば、各標準
パターンにはそれぞれ基準類似度が付加されており、該
基準類似度に基づき選択され候補に挙げられた標準パタ
ーンを音声認識結果として採用するか否かが判断され
る。よって、単に、未知入力音声と標準パターンとの類
似度の大小によってのみ音声認識を行うのではなく、さ
らに基準類似度を加味して音声認識を行うので、音声認
識結果における正答率を向上させることができる。この
ように上述の態様に記載する発明特定事項は、入力音声
に対して安定した認識結果を得ることができるように作
用する。According to such a voice recognition apparatus, the standard similarity is added to each standard pattern, and the standard pattern selected and listed as a candidate based on the standard similarity is adopted as the voice recognition result. It is determined whether or not. Therefore, the voice recognition is performed not only based on the similarity between the unknown input voice and the standard pattern, but also based on the reference similarity, so that the correct answer rate in the voice recognition result is improved. You can As described above, the matters specifying the invention described in the above-described aspect act so that a stable recognition result can be obtained for the input voice.

【００１０】又、本発明の別の態様によれば音声認識方
法は、標準パターンを格納し、未知入力音声と上記標準
パターンとを比較照合することで上記未知入力音声の認
識を行う音声認識方法であって、上記未知入力音声に対
して最も類似性の高い上記標準パターンの各標準パター
ン毎に設けられる基準類似度を格納し、上記基準類似度
に基づき生成され未知入力音声に対して選択された標準
パターンの採用を棄却するための棄却閾値を格納し、上
記比較照合にて選択された上記標準パターンについて上
記選択された標準パターンを認識結果とするか否かを上
記基準類似度に基づき決定するとき、さらに未知入力音
声と各標準パターンとの類似度を算出し最も類似度が大
きい第１位類似度を有する第１位標準パターンと第２位
の類似度を有する第２位標準パターンとを求めさらに上
記第１位類似度と第２位類似度との比率が第１閾値以上
で、かつ上記第１位類似度が第２閾値以上である上記第
１位標準パターンについて、上記第１位標準パターンの
上記第１位類似度が上記棄却閾値以上である場合に上記
第１位標準パターンを採用する、ことを特徴とする。According to another aspect of the present invention, the voice recognition method stores a standard pattern and recognizes the unknown input voice by comparing and collating the unknown input voice with the standard pattern. The reference similarity provided for each standard pattern of the standard pattern having the highest similarity to the unknown input voice is stored, and is generated based on the reference similarity and selected for the unknown input voice. the adoption of the standard pattern storing the rejection threshold for rejecting the upper
The above standard pattern selected by comparison
Select whether or not to use the selected standard pattern as the recognition result.
When making a decision based on the reference similarity,
The highest similarity is calculated by calculating the similarity between the voice and each standard pattern.
Threshold 1st rank standard pattern and 2nd rank with similarity
2nd standard pattern with similarity of
Note: The ratio between the first-rank similarity and the second-rank similarity is greater than or equal to the first threshold value.
And the first similarity is greater than or equal to a second threshold value.
Regarding the 1st standard pattern,
When the first-rank similarity is equal to or greater than the rejection threshold, the above
The first standard pattern is adopted .

【００１１】[0011]

【００１２】[0012]

【発明の実施の形態】本発明の一実施形態である音声認
識装置について図を参照しながら以下に説明する。本音
声認識装置は音声認識における後処理機能を有するもの
であり、予め登録されていない単語である未知入力の音
声、即ち未知入力音声を音声認識装置に発声した場合、
当該未知入力音声を効果的に棄却する機能をもたせ音声
認識の誤認識を防ごうとするものである。尚、図１にお
いて図２に示す構成部分と同じ構成部分については同じ
符号を付しその説明を省力する。又、説明を容易にする
ために以下の説明では不特定話者の認識方式を例に取っ
て説明する。本実施形態において、特徴抽出手段３の出
力側は、標準パターンの作成時には標準パターン生成手
段１１に接続され、標準パターンが作成された後におけ
る未知入力音声の認識時には認識手段１５に含まれる認
識処理手段５に接続される。又、標準パターン生成手段
１１の出力側は、標準パターン格納手段４に接続され
る。標準パターン生成手段１１には、標準パターン生成
用入力音声が特徴抽出手段３にて認識に必要なパターン
に変換された音声パターンが供給される。不特定話者の
場合、標準パターン生成手段１１は、標準パターンを作
成する公知の装置であり、各単語毎について複数人が発
声した標準パターン生成用入力音声から得た音声パター
ンを用いて当該各単語に対応するそれぞれの標準パター
ンを一つづつ生成する。尚、特定話者の場合には、各単
語について一人が複数回発声して得られる音声パターン
を用いて当該各単語に対応するそれぞれの標準パターン
を一つづつ生成する。標準パターン格納手段４の出力側
は認識処理手段５に接続され、未知入力音声の認識動作
時には、標準パターン格納手段４に格納されている標準
パターンが認識処理手段５へ送出される。又、認識処理
手段５の出力側は、後述の基準類似度を生成する基準類
似度生成時には基準類似度生成手段１２に接続され、一
方上記認識動作時には認識手段５に含まれる棄却決定手
段１４に接続される。BEST MODE FOR CARRYING OUT THE INVENTION A speech recognition apparatus according to an embodiment of the present invention will be described below with reference to the drawings. This voice recognition device has a post-processing function in voice recognition, when a voice of an unknown input that is a word that is not registered in advance, that is, an unknown input voice is uttered to the voice recognition device,
The function of effectively rejecting the unknown input voice is provided to prevent erroneous recognition of voice recognition. In FIG. 1, the same components as those shown in FIG. 2 are designated by the same reference numerals, and the description thereof will be omitted. Further, in order to facilitate the description, the following description will be made by taking an unspecified speaker recognition method as an example. In the present embodiment, the output side of the feature extraction means 3 is connected to the standard pattern generation means 11 at the time of creating the standard pattern, and the recognition processing included in the recognition means 15 at the time of recognizing the unknown input voice after the standard pattern is created. Connected to the means 5. The output side of the standard pattern generation means 11 is connected to the standard pattern storage means 4. The standard pattern generating means 11 is supplied with a voice pattern obtained by converting the standard pattern generating input voice into a pattern required for recognition by the feature extracting means 3. In the case of an unspecified speaker, the standard pattern generation means 11 is a known device that creates a standard pattern, and uses a voice pattern obtained from a standard pattern generation input voice uttered by a plurality of persons for each word to generate each standard pattern. Generate one standard pattern for each word. In the case of a specific speaker, a standard pattern corresponding to each word is generated one by one using a voice pattern obtained by one person uttering a plurality of times for each word. The output side of the standard pattern storage means 4 is connected to the recognition processing means 5, and the standard pattern stored in the standard pattern storage means 4 is sent to the recognition processing means 5 during the recognition operation of the unknown input voice. Further, the output side of the recognition processing means 5 is connected to the reference similarity generation means 12 at the time of generating the reference similarity, which will be described later, and is connected to the rejection determination means 14 included in the recognition means 5 at the time of the recognition operation. Connected.

【００１３】基準類似度生成手段１２には、基準類似度
の生成時において、特徴抽出手段３から変換された音声
パターンが認識処理手段５を介して供給される。基準類
似度生成手段１２は、予め当該基準類似度生成手段１２
に格納され、又は外部から供給される複数の評価パター
ンと、特徴抽出手段３から供給される上記音声パターン
とに基づき、上記評価パターンと上記音声パターンとの
類似度を上記評価パターン毎に算出し、この算出結果を
もとに基準類似度を算出する。上記評価パターンとは、
上述した標準パターンを作成するために発声した複数者
とは別の複数者（例えば１００人）によって、例えば
「鉛筆」という単語について基準類似度生成用入力音声
を発声したときに得られる数十ないし数百通りの音声パ
ターンをいう。具体的に説明すると、基準類似度生成手
段１２は、特徴抽出手段３から供給された「鉛筆」に対
応した音声パターンと、例えば上記数十ないし数百通り
の「鉛筆」の音声パターンとの類似度をそれぞれ算出す
る。よって「鉛筆」の単語に対して数十ないし数百個の
類似度が得られ、さらに基準類似度生成手段１２では、
これらの数十ないし数百個の類似度について統計的処理
を行う。統計的処理として本実施形態では平均値を算出
しているが、これに限られず例えば分散等の値を算出す
るようにしてもよい。このようにして得られた例えば上
記平均値が上記基準類似度となる。よって基準類似度が
大きいことは、未知入力音声の音声パターンに近似する
音声パターンを発声する人数が多いことを意味する。即
ち基準類似度は、未知入力音声に対して候補に挙げられ
た標準パターンと上記未知入力音声との一致する程度、
換言すると選択された上記標準パターンの選択正答率を
表すものである。そして生成された基準類似度は、各標
準パターンに対応させながら基準類似度格納手段１６へ
格納される。このようにして各標準パターン毎に基準類
似度が設けられる。The voice pattern converted from the feature extracting means 3 at the time of generating the reference similarity is supplied to the reference similarity generating means 12 through the recognition processing means 5. The reference similarity generation means 12 has the reference similarity generation means 12 in advance.
Is calculated or the similarity between the evaluation pattern and the voice pattern is calculated for each of the evaluation patterns, based on the plurality of evaluation patterns stored in or externally and the voice pattern supplied from the feature extracting means 3. The reference similarity is calculated based on this calculation result. What is the evaluation pattern?
Dozens or more obtained when a plurality of persons (for example, 100 persons) different from a plurality of persons who uttered to create the above-described standard pattern uttered the reference similarity generation input voice for the word "pencil", for example. Hundreds of voice patterns. More specifically, the reference similarity generation unit 12 resembles the voice pattern corresponding to the “pencil” supplied from the feature extraction unit 3 with, for example, the tens to hundreds of the above “pencil” voice patterns. Calculate each degree. Therefore, several tens to several hundreds of similarities can be obtained for the word “pencil”, and the reference similarity generating means 12
Statistical processing is performed on these tens or hundreds of similarities. In this embodiment, the average value is calculated as the statistical processing, but the present invention is not limited to this, and a value such as variance may be calculated. The average value thus obtained, for example, becomes the reference similarity. Therefore, a large reference similarity means that a large number of people utter a voice pattern that is similar to the voice pattern of an unknown input voice. That is, the reference similarity is the degree to which the standard pattern listed as a candidate for the unknown input voice and the unknown input voice match,
In other words, it represents the selection correct answer rate of the selected standard pattern. Then, the generated reference similarity is stored in the reference similarity storage means 16 in association with each standard pattern. In this way, the reference similarity is set for each standard pattern.

【００１４】認識動作時に特徴抽出手段３と接続される
認識手段１５は、認識処理手段５と棄却決定手段１４と
を有する。認識処理手段５は、未知入力音声について特
徴抽出手段３にて変換された音声パターンと、標準パタ
ーン格納手段４から読み出した標準パターンとの類似度
を算出し、算出した類似度に基づき、最大の第１位類似
度を有する第１位標準パターン、第２位類似度を有する
第２位標準パターン、…を決定しこれらを順次送出す
る。一方、基準類似度格納手段１６の出力は認識処理手
段５へ接続されるとともに棄却閾値生成手段１３に接続
される。棄却閾値生成手段１３は、基準類似度格納手段
１６に格納されている各基準類似度に対して１以下の値
のある係数を乗じることで棄却閾値を生成し、該棄却閾
値を棄却決定手段１４又は棄却閾値格納手段１７へ送出
する。棄却閾値格納手段１７の出力側は棄却決定手段１
４に接続され、棄却閾値格納手段１７に棄却閾値が格納
された場合で音声認識動作時には、棄却閾値格納手段１
７から読み出された棄却閾値が棄却決定手段１４へ送出
される。The recognition means 15 connected to the feature extraction means 3 at the time of recognition operation has a recognition processing means 5 and a rejection decision means 14. The recognition processing unit 5 calculates the similarity between the voice pattern converted by the feature extraction unit 3 for the unknown input voice and the standard pattern read from the standard pattern storage unit 4, and the maximum similarity is calculated based on the calculated similarity. The first standard pattern having the first similarity and the second standard pattern having the second similarity are determined, and these are sequentially transmitted. On the other hand, the output of the reference similarity storage means 16 is connected to the recognition processing means 5 and the rejection threshold value generation means 13. The rejection threshold generation means 13 generates a rejection threshold by multiplying each reference similarity stored in the reference similarity storage means 16 by a coefficient having a value of 1 or less, and the rejection threshold determination means 14 determines the rejection threshold. Alternatively, it is sent to the rejection threshold value storage means 17. The output side of the rejection threshold value storage means 17 is rejection determination means 1
4 and the rejection threshold value storage unit 17 stores the rejection threshold value, and during the voice recognition operation, the rejection threshold value storage unit 1
The rejection threshold value read from 7 is sent to rejection determination means 14.

【００１５】棄却決定手段１４は、認識動作時において
認識処理手段５から供給される上記第１位標準パターン
の上記第１位類似度が、該第１位標準パターンの基準類
似度から算出された上記棄却閾値未満である場合には上
記第１位標準パターンの採用を棄却する。一方、上記第
２位標準パターンの上記第２位類似度が、該第２位標準
パターンの基準類似度から算出された上記棄却閾値以上
である場合には上記第２位標準パターンを未知入力音声
の認識結果として採用する。棄却決定手段１４の出力は
結果出力手段７に接続され結果出力手段７は上記認識結
果として採用された標準パターンを外部へ送出する。
尚、上述した実施形態における音声認識装置では図１に
示すように、標準パターン生成手段１１、基準類似度生
成手段１２及び棄却閾値生成手段１３を設けている。し
かし、標準パターン、基準類似度及び棄却閾値について
変更を要しないタイプの音声認識装置である場合には、
標準パターン生成手段１１、基準類似度生成手段１２及
び棄却閾値生成手段１３は設けられていない構成とな
る。このような音声認識装置においては、予め標準パタ
ーン格納手段４、基準類似度格納手段１６及び棄却閾値
格納手段１７にそれぞれ対応して予め標準パターン、基
準類似度及び棄却閾値が格納されている。The rejection decision means 14 calculates the first-order similarity degree of the first-order standard pattern supplied from the recognition processing means 5 during the recognition operation from the reference similarity degree of the first-order standard pattern. If it is less than the rejection threshold, the adoption of the first standard pattern is rejected. On the other hand, if the second-rank similarity of the second-rank standard pattern is equal to or more than the rejection threshold calculated from the reference similarity of the second-rank standard pattern, the second-rank standard pattern is set as an unknown input voice. It is adopted as the recognition result of. The output of the rejection decision means 14 is connected to the result output means 7, and the result output means 7 sends the standard pattern adopted as the recognition result to the outside.
The voice recognition device according to the above-described embodiment is provided with the standard pattern generation unit 11, the reference similarity generation unit 12, and the rejection threshold generation unit 13, as shown in FIG. However, in the case of a type of speech recognition device that does not require changes in the standard pattern, the standard similarity and the rejection threshold,
The standard pattern generation unit 11, the reference similarity generation unit 12, and the rejection threshold generation unit 13 are not provided. In such a voice recognition device, the standard pattern, the reference similarity, and the rejection threshold are stored in advance corresponding to the standard pattern storage unit 4, the reference similarity storage unit 16, and the rejection threshold storage unit 17, respectively.

【００１６】このように構成される音声認識装置におけ
る動作を以下に説明する。音声認識処理が実行される前
に、まず標準パターンの生成、基準類似度及び棄却閾値
の生成を行う。標準パターンの生成は、例えば５０人に
より「鉛筆」の発声が行われ、特徴抽出手段３から送出
される例えば５０通りの音声パターンに基づき、最も認
識率が高くなるような「鉛筆」のパターンが生成され
る。この様にして各単語毎に標準パターンが生成され
る。又、基準類似度の生成は、上述したように、評価パ
ターンと未知入力音声の音声パターンとの類似度の算出
結果から各標準パターン毎に生成される。生成された各
基準類似度は、各標準パターンに対応させて基準類似度
格納手段１６に格納される。棄却閾値は、上記基準類似
度に係数を乗じることで各基準類似度毎に生成される。
生成された棄却閾値は、棄却閾値格納手段１７に格納し
て棄却閾値格納手段１７から棄却決定手段１４へ送出し
てもよいが、棄却閾値生成手段１３から直接棄却決定手
段１４へ送出するようにしてもよい。尚、以下の説明で
は棄却閾値生成手段１３から直接棄却決定手段１４へ送
出する場合を例にとる。The operation of the speech recognition apparatus configured as above will be described below. Before the voice recognition processing is executed, first, a standard pattern is generated, a reference similarity and a rejection threshold are generated. The standard pattern is generated by, for example, 50 people uttering a "pencil", and based on, for example, 50 voice patterns transmitted from the feature extracting means 3, a "pencil" pattern having the highest recognition rate is generated. Is generated. In this way, a standard pattern is generated for each word. Further, as described above, the reference similarity is generated for each standard pattern from the calculation result of the similarity between the evaluation pattern and the voice pattern of the unknown input voice. Each generated reference similarity is stored in the reference similarity storage means 16 in association with each standard pattern. The rejection threshold is generated for each reference similarity by multiplying the reference similarity by a coefficient.
The generated rejection threshold value may be stored in the rejection threshold value storage means 17 and sent from the rejection threshold value storage means 17 to the rejection decision means 14, but may be sent directly from the rejection threshold value generation means 13 to the rejection decision means 14. May be. In the following description, the case where the rejection threshold value generation unit 13 directly sends the rejection threshold value determination unit 14 will be described.

【００１７】次に音声認識動作を説明する。マイクロフ
ォン１から入力された未知入力音声は、前処理手段２で
入力レベルが最適化され不要な帯域が除かれた後、特徴
抽出手段３において認識に必要な音声パターンに変換さ
れる。該音声パターンは認識処理手段５へ送出され、標
準パターン格納手段４から読み出された各標準パターン
と認識処理手段５において比較照合され、それぞれの標
準パターンとの類似度が求められる。又、認識処理手段
５には、標準パターン格納手段４から読み出された各標
準パターンに対応した基準類似度が基準類似度格納手段
１６から供給される。認識処理手段５は、求めた各類似
度に応じて、類似度の高い順に第１位標準パターン、第
２位標準パターン、…と候補単語を決定する。尚、標準
パターンとの比較照合動作において、例えば、男性、女
性の区別を認識させたり、方言に対応するために同一単
語に複数の標準パターンを定義することもできる。Next, the voice recognition operation will be described. The unknown input voice input from the microphone 1 is converted into a voice pattern required for recognition by the feature extraction unit 3 after the input level is optimized by the preprocessing unit 2 to remove unnecessary bands. The voice pattern is sent to the recognition processing means 5, and the recognition processing means 5 compares and collates each standard pattern read from the standard pattern storage means 4 to obtain the degree of similarity with each standard pattern. Further, the reference processing unit 5 is supplied with the reference similarity corresponding to each standard pattern read from the standard pattern storage unit 4 from the reference similarity storage unit 16. The recognition processing means 5 determines candidate words such as a first standard pattern, a second standard pattern, ... In the comparison and collation operation with the standard pattern, for example, it is possible to recognize the distinction between male and female, and to define a plurality of standard patterns in the same word to correspond to the dialect.

【００１８】棄却決定手段１４には、認識処理手段５か
ら例えば上記第１位標準パターン、第２位標準パター
ン、…の順に候補単語である標準パターンが供給され、
又、これらの標準パターンに対応する各基準類似度に基
づく各棄却閾値が棄却閾値生成手段１３から供給され
る。棄却決定手段１４は、認識処理手段５から供給され
る各標準パターンにおける上記類似度が棄却閾値生成手
段１３から供給される各棄却閾値以下であるか否かを第
１位標準パターン、第２位標準パターンの順に判断す
る。そして棄却決定手段１４は、上記類似度が上記棄却
閾値未満であればその類似度を有する標準パターンを誤
答とみなし棄却する。逆に、上記類似度が上記棄却閾値
以上であればその類似度を有する標準パターンを正答と
みなし結果出力手段７へ送出し結果出力手段７は該標準
パターンを外部へ送出する。The rejection decision means 14 is supplied from the recognition processing means 5 with standard patterns which are candidate words in the order of, for example, the first standard pattern, the second standard pattern ,.
Further, each rejection threshold based on each reference similarity corresponding to these standard patterns is supplied from the rejection threshold generation means 13. The rejection determination unit 14 determines whether the similarity in each standard pattern supplied from the recognition processing unit 5 is less than or equal to each rejection threshold supplied from the rejection threshold value generation unit 13 as the first standard pattern and the second rank. Judge in the order of standard patterns. Then, when the similarity is less than the rejection threshold, the rejection determining unit 14 regards the standard pattern having the similarity as an incorrect answer and rejects it. On the contrary, if the similarity is equal to or more than the rejection threshold, the standard pattern having the similarity is regarded as a correct answer and is sent to the result output means 7, and the result output means 7 sends the standard pattern to the outside.

【００１９】例えば「電話」という未知入力音声が入力
された場合を例に採り、上述した棄却動作について、よ
り具体的に説明する。認識処理手段５における、「電
話」の未知入力音声と各標準パターンとの比較照合の結
果、上記未知入力音声と、例えば標準パターンＡとの類
似度が１００であり、標準パターンＢとの類似度が７０
であり、標準パターンＣとの類似度が８０であるとする
と、認識処理手段５は、第１位標準パターンとして標準
パターンＡを、第２位標準パターンとして標準パターン
Ｃを、第３位標準パターンとして標準パターンＢを順位
づける。又、標準パターンＡには基準類似度として１５
０が付され、標準パターンＢには基準類似度として２０
０が付され、標準パターンＣには基準類似度として６０
が付されているとする。又、棄却閾値を決定する係数を
例えば０．８とすると、標準パターンＡに対する棄却閾
値は１２０、標準パターンＢに対する棄却閾値は１６
０、標準パターンＣに対する棄却閾値は４８となる。従
って、棄却決定手段１４において、標準パターンＡにつ
いて、標準パターンＡの類似度１００は標準パターンＡ
の棄却閾値である１２０未満であることから、類似度で
は第１位であるが標準パターンＡは棄却される。第２位
の標準パターンＣ、第３位の標準パターンＢはともに各
類似度が各棄却閾値以上であるので、棄却されず、これ
らの内で最も類似度の大きい標準パターンＣが認識結果
として結果出力手段７から送出される。The rejection operation described above will be described more concretely, taking the case where an unknown input voice "telephone" is input as an example. As a result of comparison and matching between the unknown input voice of “telephone” and each standard pattern in the recognition processing means 5, the similarity between the unknown input voice and the standard pattern A is 100, and the similarity with the standard pattern B is 100. Is 70
If the similarity to the standard pattern C is 80, the recognition processing means 5 uses the standard pattern A as the first standard pattern, the standard pattern C as the second standard pattern, and the third standard pattern. The standard pattern B is ranked as. The standard pattern A has a standard similarity of 15
0 is added, and the standard pattern B has a standard similarity of 20.
0 is added, and the standard pattern C has a standard similarity of 60.
Is attached. If the coefficient for determining the rejection threshold is 0.8, for example, the rejection threshold for standard pattern A is 120 and the rejection threshold for standard pattern B is 16
The rejection threshold for 0 and the standard pattern C is 48. Therefore, in the rejection decision means 14, for the standard pattern A, the similarity 100 of the standard pattern A is the standard pattern A.
Since it is less than 120, which is the rejection threshold of, the standard pattern A is rejected although it is ranked first in the similarity. The second standard pattern C and the third standard pattern B are not rejected because each similarity is equal to or more than each rejection threshold, and the standard pattern C having the highest similarity among these results as a recognition result. It is sent from the output means 7.

【００２０】更に、上述した基準類似度を使用した比較
を行う前に、まず従来用いていた棄却決定方法を行って
も良い。又、逆に上記基準類似度又は上記棄却閾値を上
回る類似度を有する標準パターンに対して従来用いてい
た棄却決定方法を適用しても良い。即ち、例えば、第１
位標準パターンと第２位標準パターンとを求めさらに第
１位標準パターンにおける類似度である第１位類似度と
第２位標準パターンにおける類似度である第２位類似度
との比率が第１閾値以上で、かつ上記第１位類似度が第
２閾値以上である上記第１位標準パターンについて、上
記第１位標準パターンの第１位類似度又は基準類似度が
棄却閾値以上である場合に上記第１位標準パターンを認
識結果に採用する方法を採ってもよい。このような構成
を採ることで、基準類似度に基づき音声認識を行う場合
に比べさらに精度良く音声認識結果を得ることができ
る。Further, before performing the comparison using the above-mentioned reference similarity, the rejection decision method conventionally used may be first performed. On the contrary, the rejection determination method conventionally used may be applied to the standard pattern having the reference similarity or the similarity exceeding the rejection threshold. That is, for example, the first
The second standard pattern and the second standard pattern are obtained, and the ratio of the first similarity, which is the similarity in the first standard pattern, to the second similarity, which is the similarity in the second standard pattern, is first. For the first-rank standard pattern that is equal to or higher than a threshold and the first-rank similarity is equal to or higher than a second threshold, when the first-rank similarity or reference similarity of the first-rank standard pattern is equal to or higher than a rejection threshold. A method of adopting the first standard pattern as a recognition result may be adopted. By adopting such a configuration, it is possible to obtain a voice recognition result with higher accuracy than in the case of performing voice recognition based on the reference similarity.

【００２１】さらに又、以下のように構成してもよい。
音声認識装置が使用される環境の差異によっては標準パ
ターンの作成時に用いた音声サンプルとは異なった未知
入力音声のパターンを生じる場合がある。このような場
合にも対応可能なように、標準パターンの作成時には各
種の環境において入力を行い、各環境に応じて環境別の
標準パターンを作成し、さらにこのような異なる環境に
おける標準パターン毎に上記基準類似度に相当する環境
別基準類似度を生成する。このように環境に応じた複数
個の環境別基準類似度を各標準パターン毎に保持してお
き、使用者の指示に従って使用する環境別基準類似度を
切り替え、さらに該環境別基準類似度に基づき生成され
る環境別棄却閾値を切り替えて用いるようにすることも
できる。又、音声認識装置が含まれるシステムに、その
設置環境やその騒音種類等の検知機能が設けられている
場合には、それらを利用して上記環境別基準類似度や環
境別棄却閾値の切り替えを指示しても良い。このような
構成を採ることで、未知入力音声が発せられる種々の環
境に応じた環境別棄却閾値が設定可能となることから、
上記環境に応じて音声認識動作が行われ、よって精度良
く音声認識結果を得ることができる。Further, it may be configured as follows.
Depending on the environment in which the voice recognition device is used, an unknown input voice pattern different from the voice sample used when creating the standard pattern may occur. In order to be able to handle such cases as well, when creating a standard pattern, input in various environments, create a standard pattern for each environment according to each environment, and further for each standard pattern in such a different environment. An environment-specific reference similarity corresponding to the reference similarity is generated. In this way, a plurality of environment-based reference similarities corresponding to the environment are held for each standard pattern, and the environment-based reference similarity to be used is switched according to the instruction of the user, and further based on the environment-based reference similarity. It is also possible to switch and use the generated rejection threshold for each environment. In addition, if the system including the voice recognition device is provided with a detection function of its installation environment and its noise type, etc., it is possible to switch the environment-based reference similarity and the environment-based rejection threshold by using them. You may instruct. By adopting such a configuration, it becomes possible to set environment-specific rejection thresholds according to various environments in which unknown input speech is emitted,
The voice recognition operation is performed according to the environment, and thus the voice recognition result can be obtained with high accuracy.

【００２２】上述した説明は不特定話者の場合を対象に
行ったが、特定話者の場合は例えば音声の登録時とは別
に登録単語の評価モードを設けておき、その評価モード
の認識結果に対する類似度に基づき上述の基準類似度を
生成しても良い。更に実際の使用時において上記生成し
た基準類似度を再計算したり、更新演算する機能をもた
せておいても良い。Although the above description has been made for the case of an unspecified speaker, in the case of a specified speaker, an evaluation mode of a registered word is provided separately from the time of voice registration, and the recognition result of the evaluation mode is set. The above-mentioned reference similarity may be generated based on the similarity to. Further, it may be provided with a function of recalculating the generated reference similarity at the time of actual use or performing an update calculation.

【００２３】又、上述の説明は単語単位のパターンマッ
チング方式を用いて行ったが、本実施形態における音声
認識装置及び音声認識方法は単語単位の方式に限定する
ものでも、パターンマッチング方式に限定するものでも
なく、統計的手法を用いた認識方式に適用することも可
能である。Although the above description has been made using the word-based pattern matching method, the voice recognition apparatus and the voice recognition method in this embodiment are limited to the word-based method or the pattern matching method. However, it is also possible to apply it to a recognition method using a statistical method.

【００２４】以上説明したように、各標準パターン毎に
基準類似度を付加し、未知入力音声について単に各標準
パターンとの類似度のみから認識結果を求めるのではな
く、さらに基準類似度に基づき認識結果を求めるように
したことより、従来の棄却機能では棄却できなかった、
誤答の発生を抑えることができ、さらに、音声以外の騒
音や私語などによる誤認識をより確実に防止することが
できる。特に単語毎にその棄却基準を有しているため、
類似度が高いことのみで一律に認識結果とすることはな
く、よって標準パターンの作成時の音声サンプルのばら
つきにより、棄却単語の偏りが減り認識結果が安定す
る。また各単語の標準パターン作成時の音声サンプルが
良い場合は比較的棄却単語が減り、逆の場合は棄却単語
が増えるため、使用者に標準パターンの作成状態の良否
を自動的に知らしめることが可能になる。As described above, the reference similarity is added to each standard pattern, and the recognition result of the unknown input voice is not obtained only from the similarity with each standard pattern, but is recognized based on the reference similarity. Since the result is calculated, it cannot be rejected by the conventional rejection function.
It is possible to suppress the occurrence of erroneous answers, and more reliably prevent erroneous recognition due to noise other than voice or private language. Especially since each word has its rejection criteria,
The recognition result is not uniformly obtained only because the similarity is high. Therefore, the deviation of the rejected word is reduced and the recognition result is stabilized due to the variation of the voice samples when the standard pattern is created. In addition, if the voice sample at the time of creating the standard pattern of each word is good, the number of rejected words decreases relatively, and in the opposite case, the number of rejected words increases. It will be possible.

【００２５】[0025]

【発明の効果】以上詳述したように本発明によれば、各
標準パターンにはそれぞれ基準類似度が付加されてお
り、選択され候補に挙げられた標準パターンを音声認識
結果として採用するか否かが上記基準類似度に基づき判
断される。よって、単に、未知入力音声と標準パターン
との類似度の大小によってのみ音声認識を行うのではな
く、さらに基準類似度を加味して音声認識を行うので、
音声認識結果における正答率を向上させることができ、
未知入力音声に対して安定した認識結果を得ることがで
きる。As described above in detail, according to the present invention, the standard similarity is added to each standard pattern, and whether or not the standard pattern selected and listed as the candidate is adopted as the voice recognition result. Is determined based on the reference similarity. Therefore, not only the voice recognition is performed only based on the magnitude of the similarity between the unknown input voice and the standard pattern, but the voice recognition is performed in consideration of the reference similarity.
The correct answer rate in the voice recognition result can be improved,
A stable recognition result can be obtained for an unknown input voice.

[Brief description of drawings]

【図１】本発明の一実施形態である音声認識装置の構
成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a voice recognition device according to an embodiment of the present invention.

【図２】従来の音声認識装置の構成を示すブロック図
である。FIG. 2 is a block diagram showing a configuration of a conventional voice recognition device.

[Explanation of symbols]

２…前処理手段、３…特徴抽出手段、４…標準パターン
格納手段、５…認識処理手段、７…結果出力手段、１１
…標準パターン生成手段、１２…基準類似度生成手段、
１３…棄却閾値生成手段、１４…棄却決定手段。2 ... Preprocessing means, 3 ... Feature extraction means, 4 ... Standard pattern storage means, 5 ... Recognition processing means, 7 ... Result output means, 11
... standard pattern generation means, 12 ... reference similarity generation means,
13 ... Rejection threshold generating means, 14 ... Rejection determining means.

Claims

(57) [Claims]

1. A voice recognition device comprising: a standard pattern storing means for storing a standard pattern; and a recognizing means for recognizing the unknown input voice by comparing and collating the unknown input voice with the standard pattern. Te, and the reference similarity storage means for storing the reference similarity provided for each standard pattern of the standard patterns selected for the unknown input speech, adoption of standard patterns selected for the unknown input speech
Stored in the above-mentioned standard similarity storage means
The rejection threshold generated based on the above-mentioned standard similarity
And a rejection threshold value storage means for paid, it is the recognition means when determined based whether the recognition result a standard pattern which is the selected for the standard pattern selected in said comparison match to the reference similarity ,
In addition, the similarity between the unknown input voice and each standard pattern is calculated.
The first landmark with the highest similarity, which has the highest similarity
Second-order standard putter with second-order similarity to quasi-pattern
Of the above-mentioned first-rank similarity and second-rank similarity.
The ratio is equal to or higher than the first threshold, and the first similarity is the second threshold.
The above-mentioned 1st standard pattern which is more than the value
The means of recognition is that the above-mentioned first-rank similarity of the above-mentioned first-rank standard pattern is
The first standard pattern when it is equal to or more than the rejection threshold
A voice recognition device characterized by adopting .

2. A standard pattern generation means for generating the standard pattern and sending the generated standard pattern to the standard pattern storage means, and a standard similarity generated for generating the standard similarity to the standard similarity storage means. and a reference similarity generation means for sending, according to claim 1 Symbol placement of the speech recognition device.

3. The reference similarity has a plurality of environment-based reference similarities for one standard pattern, and the rejection threshold storage unit is generated for each of the plurality of environment-based reference similarities. stores environment by rejection threshold, the said recognition means to use for determining whether or not the recognition result a standard pattern selecting the environment-specific rejection threshold depending on the environment of the unknown input speech is generated, according to claim 1 or 2 Symbol placement of the voice recognition device.

Wherein said reference similarity, for the same word one of which is generated from a plurality of standard patterns generated for the input speech
A standard pattern, determine the similarity between a plurality of evaluation patterns generated from a plurality of reference similarity generating the input speech for the same words an input voice of a different person from the above standard pattern generating input speech, The method according to claim 1 , which is generated by performing statistical processing based on the plurality of calculated similarities.
The voice recognition device according to any one of 3 above.

5. The voice recognition device according to claim 1 , wherein the standard pattern is a standard pattern for recognition of an unspecified speaker by a plurality of times of utterances by a plurality of people.

6. A voice recognition method for recognizing an unknown input voice by storing a standard pattern and comparing and collating an unknown input voice with the standard pattern, wherein the voice recognition method is most similar to the unknown input voice. The standard similarity that is set for each standard pattern of the high standard pattern is stored, and the rejection threshold for rejecting the adoption of the standard pattern that is generated based on the standard similarity and that is selected for the unknown input speech is stored. The standard pattern selected in the above comparison
Whether to use the selected standard pattern as the recognition result
When making a decision based on the above-mentioned standard similarity, unknown input
The similarity between the voice and each standard pattern is calculated, and the
1st standard pattern and 2nd with high 1st similarity
The second standard pattern having similarity of rank
The ratio between the first-rank similarity and the second-rank similarity is less than or equal to the first threshold.
Above, and the first similarity is equal to or greater than a second threshold
Regarding the first standard pattern, the above first standard pattern
If the above-mentioned first-rank similarity of is above the rejection threshold, then
A voice recognition method characterized in that a first-rank standard pattern is adopted .