JP4699016B2

JP4699016B2 - Voice recognition device

Info

Publication number: JP4699016B2
Application number: JP2004360162A
Authority: JP
Inventors: 純石井; 知弘岩▲さき▼; 洋平岡登
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2004-12-13
Filing date: 2004-12-13
Publication date: 2011-06-08
Anticipated expiration: 2024-12-13
Also published as: JP2006171111A

Description

この発明は、人間が発声した音声を登録し、登録した音声と類似度が高い音声が入力された場合に認識結果を出力する音声認識装置に関するものである。 The present invention relates to a speech recognition apparatus that registers speech uttered by a human and outputs a recognition result when speech having a high similarity to the registered speech is input.

音声認識装置は、人間の声を機械が自動認識するシステムのことを言い、音声による機械の操作等において実用性が高い。従来の音声認識装置に関しては、非特許文献１や非特許文献２に記載されている。このような従来の音声認識装置では、登録した音声の音節数が少ない、母音が少ない、騒音が含まれている等、登録した音声が不適切であった場合には、登録した音声以外の音声を誤認識してしまう場合があり、また、音声を登録した条件と異なる条件で認識対象単語を発声した場合には、正確に認識できない場合がある。 The voice recognition device refers to a system in which a machine automatically recognizes a human voice, and has high practicality in operation of a machine by voice. Conventional speech recognition devices are described in Non-Patent Document 1 and Non-Patent Document 2. In such a conventional speech recognition device, if the registered speech is inappropriate, such as the number of syllables of the registered speech is small, the number of vowels is small, or noise is included, the speech other than the registered speech May be misrecognized, and if the recognition target word is uttered under conditions different from the conditions for registering speech, it may not be recognized correctly.

また、本願に関連する技術として、特許文献１〜特許文献５に示される技術がある。特許文献１では、音声入力実行時に得られた音節の識別結果の正誤を指示し、この音節の識別結果及び正誤の指示に基づいて各音節の出現頻度及び誤り度数を求め、求められた各音節の出現頻度及び誤り度数に関連して登録又は再登録すべき音節を決定することで、より少ない処理量で再登録を必要とする音節を効率良く見出して、再登録処理することが可能になることが記載されている。 Moreover, there exists a technique shown by patent document 1-patent document 5 as a technique relevant to this application. In Patent Document 1, the correctness / incorrectness of the syllable identification result obtained at the time of speech input is instructed, the appearance frequency and the error frequency of each syllable are obtained based on the syllable identification result and the correct / incorrect instruction, and each syllable obtained By determining the syllables to be registered or re-registered in relation to the appearance frequency and the error frequency, it becomes possible to efficiently find and re-register syllables that require re-registration with a smaller amount of processing. It is described.

また、特許文献２では、話者の単音節の特徴を予め格納する音節辞書を備え、話者の入力音声の音声辞書への登録時に、音節辞書の単音節を組み合わせて作成した読み列と、話者の入力音声との類似度を比較し、所定のしきい値以下に類似している場合に入力音声を音声辞書に登録することにより、認識時に最も影響を及ぼす音声登録を確実に行うことが記載されている。 Patent Document 2 includes a syllable dictionary that prestores the characteristics of a speaker's single syllable, and a reading sequence created by combining single syllables of the syllable dictionary when the speaker's input speech is registered in the speech dictionary; Comparing the degree of similarity with the speaker's input voice, and registering the input voice in the voice dictionary when it is similar to a predetermined threshold value or less, ensures that the voice registration that has the most influence on recognition is performed. Is described.

さらに、特許文献３では、登録すべき語彙が入力されると、入力された語彙の音節数が所定音節数以下である場合や、入力された語彙と音節数が等しくかつ母音の配列に共通性がある既登録語彙が存在する場合には、登録適正がないと判断することにより、認識率の高い語彙辞書を作成することが記載されている。 Further, in Patent Document 3, when a vocabulary to be registered is input, the number of syllables of the input vocabulary is equal to or less than a predetermined number of syllables, or the input vocabulary and the number of syllables are equal and common to vowel arrangements. It is described that when a certain registered vocabulary exists, a vocabulary dictionary with a high recognition rate is created by determining that the registration is not appropriate.

さらに、特許文献４では、音声を収集する部分と、音声をパタンに変換する部分と、音響再生部と、話者の両耳に再生音を聞かせる器具と、音声を登録する部分とを有し、話者に特定の音を聞かせながら音声を登録することにより、騒音下でも認識率が低下しない音声登録を行うことが記載されている。 Furthermore, Patent Document 4 has a part for collecting sound, a part for converting sound into a pattern, an acoustic reproduction unit, a device for listening to reproduced sound in both ears of a speaker, and a part for registering sound. In addition, it is described that voice registration is performed so that the recognition rate does not decrease even under noise by registering voice while letting a speaker hear a specific sound.

さらに、特許文献５では、複数語を連続的に認識する場合に、音声入力及びエコーバックは複数語のまとまった単位で行い、認識語の中に誤認識が発生したとき、すなわち、話者がエコーバックに対して否定応答を行ったとき、エコーバックを一語単位で行い、それぞれの語に対して話者の確認応答入力することにより、話者は効率良く音声入力でき、まとめて入力する語数が増えても認識率が低下せず、再入力時は一度目より認識率を向上させることが記載されている。 Furthermore, in Patent Document 5, when a plurality of words are continuously recognized, voice input and echo back are performed in a unit of a plurality of words, and when a recognition error occurs in a recognized word, that is, a speaker is When a negative response is made to echo back, echo back is performed in units of words, and the speaker's confirmation response is input for each word. It is described that the recognition rate does not decrease even if the number of words increases, and that the recognition rate is improved from the first time when re-input.

Ｌ．ＲＡＢＩＮＥＲ、Ｂ．Ｈ．ＪＵＡＮＧ、古井貞煕監訳、「音声認識の基礎」（上下）、ＮＴＴアドバンステクノロジ、１９９５年１１月、ｐ．１−２９１（上）、ｐ．１−３２２（下）L. RABINER, B.M. H. JUANG, translated by Sadaaki Furui, “Basics of Speech Recognition” (up and down), NTT Advanced Technology, November 1995, p. 1-291 (top), p. 1-322 (bottom) 古井貞煕、「音声情報処理」、森北出版株式会社、１９９８年６月、ｐ．７９−１３２Sadaaki Furui, “Speech Information Processing”, Morikita Publishing Co., Ltd., June 1998, p. 79-132 特開昭５９−１５７６９９号公報（第５頁、右上欄第１８行〜第５頁、左下欄第７行）JP 59-157699 (page 5, upper right column, line 18 to page 5, lower left column, line 7) 特開昭６３−１４９６９８号公報（第２頁、右上欄第１６行〜第２頁、左下欄第３行、第３頁、右上欄第４行〜第３頁、右上欄第５行）JP-A-63-149698 (2nd page, upper right column, 16th to 2nd page, lower left column, 3rd line, 3rd page, upper right column, 4th to 3rd page, upper right column, 5th line) 特開昭６３−２９２１９７号公報（第３頁、左上欄第１４行〜第３頁、右上欄第６行）JP 63-292197 (page 3, upper left column, line 14 to page 3, upper right column, line 6) 特開平２−１５４２９９号公報（第２頁、左上欄第１９行〜第２頁、右上欄第３行、第３頁、右上欄第１８行〜第３頁、右上欄第１９行）JP-A-2-154299 (page 2, upper left column, line 19 to page 2, upper right column, line 3, page 3, upper right column, line 18 to page 3, upper right column, line 19) 特開昭６０−２６００９５号公報（第２頁、左上欄第１０行〜第２頁、左上欄第１６行、第３頁、左下欄第４行〜第３頁、左下欄第７行）JP-A-60-260095 (page 2, upper left column, line 10 to page 2, upper left column, line 16, page 3, lower left column, line 4 to page 3, lower left column, line 7)

従来の音声認識装置は以上のように構成されているので、登録した音声の音節数が少ない、母音が少ない、騒音が含まれている等、登録した音声が不適切であった場合には、登録した音声以外の音を誤認識してしまい認識精度が低下するという課題があった。また、音声を登録した条件と異なる条件で認識対象単語を発声した場合には、正確に認識できず認識精度が低下するという課題があった。 Since the conventional speech recognition device is configured as described above, if the registered speech is inappropriate, such as the number of syllables of the registered speech is small, the number of vowels is low, or noise is included, There was a problem that the recognition accuracy was lowered because a sound other than the registered voice was erroneously recognized. In addition, when a recognition target word is uttered under conditions different from the conditions for registering speech, there is a problem in that recognition is not possible and recognition accuracy is reduced.

この発明は上記のような課題を解決するためになされたもので、不適切な音声が登録されることを防ぐことにより、認識精度を向上させることができる音声認識装置を得ることを目的とする。また、音声登録時と音声認識時の条件が異なる場合でも、認識精度を向上させることができる音声認識装置を得ることを目的とする。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a speech recognition device that can improve recognition accuracy by preventing inappropriate speech from being registered. . It is another object of the present invention to provide a speech recognition apparatus that can improve the recognition accuracy even when the conditions for speech registration and speech recognition are different.

この発明に係る音声認識装置は、２音節以上の登録音声を入力して全音声区間にわたる音声変化度を計算する音声変化度計算手段と、該音声変化度計算手段により計算された音声変化度を、所定音節の単語の音声変化度の平均値と比較し、入力した上記登録音声を登録するか否かを判定する音声登録判定手段と、該音声登録判定手段による判定結果が登録不可の場合に、登録音声変更要求を出力する登録音声変更要求手段と、上記音声登録判定手段による判定結果が登録可能の場合に、入力した上記登録音声により音声標準パタンを生成する音声標準パタン生成手段とを備えたものである。
The speech recognition apparatus according to the present invention includes a speech change degree calculating means for inputting a registered speech of two or more syllables and calculating a sound change degree over the whole speech section, and a sound change degree calculated by the sound change degree calculating means. A voice registration determination unit that determines whether or not to register the input registered voice by comparing with an average value of the voice change degree of a word of a predetermined syllable, and a determination result by the voice registration determination unit is unregisterable A registration voice change request means for outputting a registration voice change request; and a voice standard pattern generation means for generating a voice standard pattern from the input registered voice when the determination result by the voice registration determination means is registerable. It is a thing.

この発明により、登録音声が不適切であることが原因の誤認識を少なくすることができ、認識精度を向上させることができるという効果が得られる。 According to the present invention, it is possible to reduce misrecognition caused by improper registered speech and to improve the recognition accuracy.

以下、この発明の実施の一形態を説明する。
実施の形態１．
図１はこの発明の実施の形態１による音声認識装置の構成を示すブロック図である。この音声認識装置は登録手段１００及び照合手段３００を備えている。登録手段１００は音声変化度計算手段１１１、音声登録判定手段１１２、登録音声変更要求手段１１３、音声登録スイッチ１１４、音声標準パタン生成手段１１５及び音声標準パタン格納手段１１６を備え、照合手段３００は音声類似度計算手段３１１及び音声照合判定手段３１２を備えている。 An embodiment of the present invention will be described below.
Embodiment 1 FIG.
1 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 1 of the present invention. The speech recognition apparatus includes a registration unit 100 and a verification unit 300. The registration unit 100 includes a voice change degree calculation unit 111, a voice registration determination unit 112, a registered voice change request unit 113, a voice registration switch 114, a voice standard pattern generation unit 115, and a voice standard pattern storage unit 116. A similarity calculation unit 311 and a voice collation determination unit 312 are provided.

登録手段１００において、音声変化度計算手段１１１は利用者が発声した登録音声１１を入力して音声変化度を計算する。音声登録判定手段１１２は、音声変化度計算手段１１１により計算された音声変化度に基づき、入力された登録音声１１を登録するか否かを判定する。登録音声変更要求手段１１３は、音声登録判定手段１１２による判定結果が登録不可であった場合に、利用者に登録音声変更要求１２を出力する。音声登録判定手段１１２は判定結果が登録可能であった場合に音声登録スイッチ１１４を接続し、音声標準パタン生成手段１１５は、登録音声１１を入力し音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。 In the registration means 100, the voice change degree calculation means 111 inputs the registered voice 11 uttered by the user and calculates the voice change degree. The voice registration determination unit 112 determines whether or not to register the input registered voice 11 based on the voice change degree calculated by the voice change degree calculation unit 111. The registered voice change request unit 113 outputs a registered voice change request 12 to the user when the result of determination by the voice registration determination unit 112 is unregisterable. The voice registration determination unit 112 connects the voice registration switch 114 when the determination result can be registered, and the voice standard pattern generation unit 115 inputs the registered voice 11 and generates a voice standard pattern to store the voice standard pattern storage unit. 116.

また、照合手段３００において、音声類似度計算手段３１１は、利用者が発声した認識対象音声３１と音声標準パタン格納手段１１６に格納されている音声標準パタンを入力して音声類似度を計算する。音声照合判定手段３１２は、音声類似度計算手段３１１により計算された音声類似度が予め設定されている閾値３２以上である場合に、認識結果３３を出力する。 In the matching unit 300, the speech similarity calculation unit 311 calculates the speech similarity by inputting the recognition target speech 31 uttered by the user and the speech standard pattern stored in the speech standard pattern storage unit 116. The voice collation determination unit 312 outputs the recognition result 33 when the voice similarity calculated by the voice similarity calculation unit 311 is equal to or greater than a preset threshold value 32.

なお、この実施の形態１では、音声変化度計算手段１１１、音声登録判定手段１１２、登録音声変更要求手段１１３、音声標準パタン生成手段１１５、音声類似度計算手段３１１及び音声照合判定手段３１２をハードウェアで構成しても良いが、各手段の処理内容を記述した音声認識プログラムを作成し、コンピュータが当該音声認識プログラムを実行するようにしても良い。 In the first embodiment, the voice change degree calculation means 111, the voice registration judgment means 112, the registered voice change request means 113, the voice standard pattern generation means 115, the voice similarity degree calculation means 311 and the voice collation judgment means 312 are hard-wired. However, it is also possible to create a speech recognition program that describes the processing contents of each means and allow the computer to execute the speech recognition program.

次に動作について説明する。
図２はこの発明の実施の形態１による音声認識装置の処理の流れを示すフローチャートである。まず、音声登録時には、ステップＳＴ１０１において、音声変化度計算手段１１１は利用者が発声した登録音声１１を入力し、ステップＳＴ１０２において、音声変化度計算手段１１１は音声変化度を計算する。 Next, the operation will be described.
FIG. 2 is a flowchart showing the flow of processing of the speech recognition apparatus according to Embodiment 1 of the present invention. First, at the time of voice registration, in step ST101, the voice change degree calculating means 111 inputs the registered voice 11 uttered by the user, and in step ST102, the voice change degree calculating means 111 calculates the voice change degree.

ここで、登録音声１１は利用者が音声を登録するために発声する単語や文の音声である。音声認識時には、登録した音声と類似度が高い発声を行った場合に認識結果３３が出力される。そして、音声変化度は１発声においてどの程度音声が変化したかを示すものであり、異なった音節が多いほど音声変化度は大きい値になる。例えば、「あ」という発声よりは、「あいうえお」という発声のほうが音声変化度が大きい。抽出方法としては、例えば入力された登録音声１１の信号を１０ｍｓのフレーム毎に分析して抽出したスペクトル成分の変化度によって求める。音声認識では、音声のスペクトル成分を効率的に表現するものとしてケプストラムを用いる場合が多いが、このケプストラムを用いた音声変化度ｄは次の式（１）によって求められる。

ここで、ｃ_tは時刻ｔのフレームのケプストラム、ｃ＾は音声区間のケプストラム平均、Ｎは音声区間のフレーム数であり、Ｔは転置行列を示している。なお、ケプストラムについては上記非特許文献２の２−２節に記載されている。 Here, the registered voice 11 is a voice of a word or sentence uttered by the user to register the voice. At the time of voice recognition, a recognition result 33 is output when an utterance having a high similarity to the registered voice is performed. The voice change degree indicates how much the voice has changed in one utterance, and the voice change degree becomes larger as the number of different syllables increases. For example, the voice change degree is larger for the voice of “Aiueo” than for the voice of “A”. As an extraction method, for example, the input signal of the registered voice 11 is analyzed for every 10 ms frame, and is obtained from the degree of change of the extracted spectral component. In speech recognition, a cepstrum is often used as an efficient representation of the spectral components of speech, and the degree of speech change d using this cepstrum is obtained by the following equation (1).

Here, c _t is the cepstrum of the frame at time t, c ^ is the cepstrum average of the speech segment, N is the number of frames in the speech segment, and T is the transposed matrix. The cepstrum is described in Section 2-2 of Non-Patent Document 2.

ステップＳＴ１０３において、音声登録判定手段１１２は、音声変化度計算手段１１１により計算された音声変化度ｄを入力し、音声変化度ｄが予め定めた闘値ＴＨｄ以上か否かを判定することにより、入力された登録音声１１を登録するか否かを判定する。判定方法としては、音声変化度ｄが予め定めた閾値ＴＨｄ以上の場合に登録音声１１を登録可能と判定し、一方、音声変化度ｄが閾値ＴＨｄより小さい場合に登録音声１１を登録不可と判定する。閾値ＴＨｄを、例えば３音節以上の単語の音声変化度ｄの平均としておけば、３音節以上の単語であるならば登録が可能になる。 In step ST103, the voice registration determination means 112 receives the voice change degree d calculated by the voice change degree calculation means 111, and determines whether or not the voice change degree d is equal to or greater than a predetermined battle value THd. It is determined whether or not the input registration voice 11 is registered. As a determination method, it is determined that the registered voice 11 can be registered when the voice change degree d is equal to or greater than a predetermined threshold THd. On the other hand, when the voice change degree d is smaller than the threshold value THd, it is determined that the registered voice 11 cannot be registered. To do. For example, if the threshold THd is set as an average of the voice change d of words having three or more syllables, registration is possible if the word has three or more syllables.

上記ステップＳＴ１０３で、音声登録判定手段１１２による判定結果が登録不可であった場合には、ステップＳＴ１０４において、登録音声変更要求手段１１３は利用者に対して登録音声変更要求１２を出力する。音声変化度ｄが小さい場合は、登録音声１１の音節数が少ないと判断されているので、このような登録音声１１を用いて音声標準パタンを生成して音声認識を行った場合は、情報量が少ないために誤認識が多くなる。したがって、登録音声変更要求１２を出力する。音声認識装置からの要求は、例えば、「音節数の多い単語を言ってください」というメッセージを合成音によって利用者に伝え、ステップＳＴ１０１に戻って登録処理を繰り返す。 If the result of determination by the voice registration determination unit 112 is not registered in step ST103, the registration voice change request unit 113 outputs a registration voice change request 12 to the user in step ST104. When the voice change degree d is small, it is determined that the number of syllables of the registered voice 11 is small. Therefore, when the voice standard pattern is generated using the registered voice 11 and the voice recognition is performed, the amount of information Misrecognition increases because there are few. Therefore, the registration voice change request 12 is output. As a request from the speech recognition apparatus, for example, a message “Please say a word with a large number of syllables” is transmitted to the user by a synthesized sound, and the process returns to step ST101 to repeat the registration process.

一方、上記ステップＳＴ１０３で、音声登録判定手段１１２による判定結果が登録可能であった場合には、ステップＳＴ１０５において、音声登録判定手段１１２は音声登録スイッチ１１４を接続し、音声標準パタン生成手段１１５は登録音声１１を入力し音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。生成された音声標準パタンは、入力された登録音声１１の特徴を効率的に表現するものであり、例えば、５〜１０ｍｓのフレーム毎に分析して得られたケプストラム、ケプストラムの動的特徴量及び対数パワーの動的特徴量の時系列である。また、ケプストラム、ケプストラムの動的特徴量及び対数パワーの動的特徴量の時系列を用いて、隠れマルコフモデル（ＨＭＭ：ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）を学習することにより音声標準パタンを生成することもできる。隠れマルコフモデルについては、非特許文献２の５−４節に記述されているので説明は省略する。 On the other hand, in step ST103, if the determination result by the voice registration determination unit 112 can be registered, in step ST105, the voice registration determination unit 112 connects the voice registration switch 114, and the voice standard pattern generation unit 115 The registered voice 11 is input, a voice standard pattern is generated and stored in the voice standard pattern storage means 116. The generated speech standard pattern efficiently expresses the characteristics of the input registered speech 11. For example, the cepstrum obtained by analyzing every frame of 5 to 10 ms, the dynamic feature amount of the cepstrum, and It is a time series of logarithmic power dynamic features. Also, a speech standard pattern can be generated by learning a Hidden Markov Model (HMM) using a time series of cepstrum, cepstral dynamic features and logarithmic power dynamic features. Since the hidden Markov model is described in Section 5-4 of Non-Patent Document 2, the description thereof is omitted.

音声認識時には、ステップＳＴ１０６において、音声類似度計算手段３１１は利用者が発声した認識対象音声３１を入力し、ステップＳＴ１０７において、音声類似度計算手段３１１は、音声標準パタン格納手段１１６に格納されている音声標準パタンを用いて、認識対象音声３１に対する音声類似度を計算する。ここで、認識対象音声３１は音声認識装置に音声認識をさせるために利用者が発声した音声であり、音声類似度は認識対象音声３１と音声認識装置に登録されている音声標準パタンがどの程度類似しているかを示す指標である。音声標準パタンが、ケプストラム、ケプストラムの動的特徴及び対数パワーの動的特徴の時系列であるならば、認識対象音声３１についても同様に、ケプストラム、ケプストラムの動的特徴及び対数パワーの動的特徴の時系列を抽出し、音声類似度は例えば動的計画法によって計算可能である。動的計画法については、非特許文献２の５−３節に詳細が記されているので説明を省略する。 At the time of speech recognition, in step ST106, the speech similarity calculation unit 311 inputs the recognition target speech 31 uttered by the user. In step ST107, the speech similarity calculation unit 311 is stored in the speech standard pattern storage unit 116. The voice similarity with respect to the recognition target voice 31 is calculated using the existing voice standard pattern. Here, the recognition target voice 31 is a voice uttered by the user for causing the voice recognition apparatus to perform voice recognition, and the degree of voice similarity is the standard voice pattern registered in the recognition target voice 31 and the voice recognition apparatus. It is an index indicating whether they are similar. If the speech standard pattern is a time series of cepstrum, dynamic features of cepstrum, and logarithmic power, the cepstrum, cepstrum dynamic features, and logarithmic power dynamic features of the recognition target speech 31 as well. The speech similarity can be calculated by, for example, dynamic programming. The details of the dynamic programming method are described in Section 5-3 of Non-Patent Document 2, and thus the description thereof is omitted.

ステップＳＴ１０８において、音声照合判定手段３１２は、音声類似度計算手段３１１により計算された音声類似度が予め定めた閾値３２以上か否かを判定する。上記ステップＳＴ１０８で、音声類似度が予め定めた閾値３２以上ある場合には、ステップＳＴ１０９において、音声照合判定手段３１２は、音声標準パタン格納手段１１６に格納されている音声標準パタンに対応する音声を利用者が発声したと判断して、認識結果３３を出力する。一方、上記ステップＳＴ１０８で、音声類似度が閾値３２より小さい場合には、音声照合判定手段３１２は認識結果３３を出力せずに、ステップＳＴ１０６へ戻る。 In step ST108, the voice collation determination unit 312 determines whether or not the voice similarity calculated by the voice similarity calculation unit 311 is equal to or greater than a predetermined threshold 32. If the voice similarity is greater than or equal to the predetermined threshold value 32 in step ST108, the voice collation determination unit 312 outputs the voice corresponding to the voice standard pattern stored in the voice standard pattern storage unit 116 in step ST109. It is determined that the user uttered, and the recognition result 33 is output. On the other hand, if the voice similarity is smaller than the threshold value 32 in step ST108, the voice collation determination unit 312 returns to step ST106 without outputting the recognition result 33.

ここでは、１発声のみの場合について説明したが、音声登録時には、複数の異なる発声について音声標準パタンを各々生成し、音声認識時には、認識対象音声３１に対して複数の標準パタンを用いて音声類似度を計算し、音声類似度が閾値３２以上で最も音声類似度が大きい標準パタンを示す発声番号を認識結果として出力することも可能である。 Here, the case of only one utterance has been described. However, at the time of voice registration, a voice standard pattern is generated for each of a plurality of different utterances, and at the time of voice recognition, a plurality of standard patterns are used for the recognition target voice 31. It is also possible to calculate the degree and output the utterance number indicating the standard pattern having the largest voice similarity with a voice similarity of 32 or more as a recognition result.

以上のように、この実施の形態１によれば、音声変化度計算手段１１１は登録音声１１の音声変化度を計算し、音声登録判定手段１１２は、計算された音声変化度が予め定めた闘値以上の場合に、入力された登録音声１１を登録することにより、音節数が少ない音声の登録を防ぐことができ、登録音声１１が不適切であることが原因の誤認識、すなわち音節数が少ない音声が登録されることによって生じる誤認識を少なくすることができ、認識精度を向上させることができるという効果が得られる。 As described above, according to the first embodiment, the voice change degree calculation means 111 calculates the voice change degree of the registered voice 11, and the voice registration determination means 112 determines the calculated voice change degree as a predetermined battle. By registering the input registered voice 11 when the value is greater than or equal to the value, it is possible to prevent registration of a voice having a small number of syllables, and misrecognition caused by an inappropriate registered voice 11, that is, the number of syllables. It is possible to reduce misrecognition caused by registering a small amount of voice, and to improve the recognition accuracy.

実施の形態２．
図３はこの発明の実施の形態２による音声認識装置の構成を示すブロック図である。この音声認識装置は登録手段１００及び照合手段３００を備えている。登録手段１００は音節数抽出手段１２１、音声登録判定手段１２２、登録音声変更要求手段１１３、音声登録スイッチ１１４、音声標準パタン生成手段１１５及び音声標準パタン格納手段１１６を備え、照合手段３００は音声類似度計算手段３１１及び音声照合判定手段３１２を備えている。 Embodiment 2. FIG.
FIG. 3 is a block diagram showing the configuration of a speech recognition apparatus according to Embodiment 2 of the present invention. The speech recognition apparatus includes a registration unit 100 and a verification unit 300. The registration unit 100 includes a syllable number extraction unit 121, a voice registration determination unit 122, a registered voice change request unit 113, a voice registration switch 114, a voice standard pattern generation unit 115, and a voice standard pattern storage unit 116. Degree calculation means 311 and voice collation judgment means 312 are provided.

登録手段１００において、音節数検出手段１２１は利用者が発声した登録音声１１を入力して音節数を抽出する。音声登録判定手段１２２は、音節数抽出手段１２１により抽出された音節数に基づき入力された登録音声１１を登録するか否かを判定する。登録音声変更要求手段１１３は、音声登録判定手段１２２による判定結果が登録不可であった場合に、利用者に登録音声変更要求１２を出力する。音声登録判定手段１２２は、判定結果が登録可能であった場合に、音声登録スイッチ１１４を接続し、音声標準パタン生成手段１１５は、登録音声１１を入力し音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。 In the registration means 100, the syllable number detection means 121 inputs the registered voice 11 uttered by the user and extracts the number of syllables. The voice registration determination unit 122 determines whether or not to register the registered voice 11 input based on the syllable number extracted by the syllable number extraction unit 121. The registered voice change request unit 113 outputs the registered voice change request 12 to the user when the result of determination by the voice registration determination unit 122 is unregisterable. When the determination result can be registered, the voice registration determination unit 122 connects the voice registration switch 114, and the voice standard pattern generation unit 115 inputs the registered voice 11, generates a voice standard pattern, and generates a voice standard pattern. Store in the storage means 116.

また、照合手段３００において、音声類似度計算手段３１１及び音声照合判定手段３１２の機能は、実施の形態１の図１に示すものと同じである。 In the matching unit 300, the functions of the voice similarity calculation unit 311 and the voice matching determination unit 312 are the same as those shown in FIG.

なお、この実施の形態２では、音節数抽出手段１２１、音声登録判定手段１２２、登録音声変更要求手段１１３、音声標準パタン生成手段１１５、音声類似度計算手段３１１及び音声照合判定手段３１２をハードウェアで構成しても良いが、各手段の処理内容を記述した音声認識プログラムを作成し、コンピュータが当該音声認識プログラムを実行するようにしても良い。 In the second embodiment, the syllable number extraction unit 121, the voice registration determination unit 122, the registered voice change request unit 113, the voice standard pattern generation unit 115, the voice similarity calculation unit 311 and the voice collation determination unit 312 are provided in hardware. However, it is also possible to create a voice recognition program that describes the processing contents of each means and cause the computer to execute the voice recognition program.

次に動作について説明する。
図４はこの発明の実施の形態２による音声認識装置の処理の流れを示すフローチャートである。まず、音声登録時には、ステップＳＴ２０１において、音節数抽出手段１２１は利用者が発声した登録音声１１を入力し、ステップＳＴ２０２において、音節数抽出手段１２１は音節数を抽出する。ここで、音節は日本語においては平仮名１文字に相当する音声単位である。 Next, the operation will be described.
FIG. 4 is a flowchart showing the flow of processing of the speech recognition apparatus according to Embodiment 2 of the present invention. First, at the time of voice registration, in step ST201, the syllable number extracting means 121 inputs the registered voice 11 uttered by the user, and in step ST202, the syllable number extracting means 121 extracts the syllable number. Here, the syllable is a speech unit corresponding to one Hiragana character in Japanese.

音節数抽出手段１２１の処理について、音節単位の音声認識を用いた場合で説明する。音節単位の音声認識を行うために、予め音節単位の標準パタンを用意しておく。標準パタンとしては、例えば隠れマルコフモデル（ＨＭＭ）を用い、多量の音声を用いてモデルパラメータを学習しておく。 The processing of the syllable number extracting unit 121 will be described using speech recognition in syllable units. In order to perform speech recognition in syllable units, a standard pattern in syllable units is prepared in advance. As a standard pattern, for example, a hidden Markov model (HMM) is used, and model parameters are learned using a large amount of speech.

図５は音節単位の音声認識を説明する図である。音節「あ」の隠れマルコフモデルがΛａ、音節「い」の隠れマルコフモデルがΛｉというようにした場合には、音節単位の音声認識は図５に示すように、音節単位の隠れマルコフモデルの全接続ネットワークにしたがって登録音声１１に対して類似度計算の処理を行い、最も類似度が高くなる音節の組み合わせを音節単位の認識結果として出力する。そして、出力された音節認識結果の音節数を抽出することで音節数を得る。音節単位認識結果が、例えば「おはよう」であったならば音節数は４となる。なお、図５において、黒丸はノードを示し、矢印は遷移方向を示している。 FIG. 5 is a diagram for explaining speech recognition in syllable units. When the hidden Markov model of the syllable “A” is Λa and the hidden Markov model of the syllable “I” is Λi, the speech recognition per syllable is performed as shown in FIG. A similarity calculation process is performed on the registered speech 11 in accordance with the connection network, and the combination of syllables having the highest similarity is output as a recognition result in syllable units. Then, the number of syllables is obtained by extracting the number of syllables of the output syllable recognition result. For example, if the syllable unit recognition result is “good morning”, the number of syllables is four. In FIG. 5, black circles indicate nodes, and arrows indicate transition directions.

ステップＳＴ２０３において、音声登録判定手段１２２は、音節数抽出手段１２１により抽出された音節数が予め定めた所定個数以上であるか否かを判定することにより、入力された登録音声１１を登録するかどうか判定する。判定方法としては、抽出された音節数が予め定めた所定個数以上である場合には、登録音声１１を登録可能と判定し、一方、抽出された音節数が予め定めた所定個数より少ない場合には、登録音声１１を登録不可と判定する。例えば、所定個数を３とし、３音節以上を登録可能と判定する。 In step ST203, the speech registration determination means 122 determines whether or not to register the input registered speech 11 by determining whether or not the number of syllables extracted by the syllable number extraction means 121 is greater than or equal to a predetermined number. Judge whether. As a determination method, when the number of extracted syllables is equal to or larger than a predetermined number, it is determined that the registered speech 11 can be registered, and on the other hand, when the number of extracted syllables is smaller than a predetermined number. Determines that the registered voice 11 cannot be registered. For example, the predetermined number is 3, and it is determined that 3 or more syllables can be registered.

上記ステップＳＴ２０３で、音声登録判定手段１２２による判定結果が登録不可であった場合には、ステップＳＴ２０４において、登録音声変更要求手段１１３は利用者に対して登録音声変更要求１２を出力する。音節数が少ない登録音声１１を用いて音声標準パタンを生成して音声認識を行った場合は、情報量が少ないために誤認識が多くなる。したがって、登録音声変更要求１２を出力する。音声認識装置からの要求は、例えば、「音節数の多い単語を言ってください」というメッセージを合成音によって利用者に伝え、ステップＳＴ２０１に戻って登録処理を繰り返す。 In step ST203, if the determination result by the voice registration determination unit 122 cannot be registered, the registration voice change request unit 113 outputs the registration voice change request 12 to the user in step ST204. When the speech standard pattern is generated using the registered speech 11 having a small number of syllables and the speech recognition is performed, the amount of information is small, so that misrecognition increases. Therefore, the registration voice change request 12 is output. As a request from the speech recognition apparatus, for example, a message “Please say a word with a large number of syllables” is transmitted to the user by synthetic sound, and the process returns to step ST201 to repeat the registration process.

一方、上記ステップＳＴ２０３で、音声登録判定手段１２２による判定結果が登録可能であった場合は、ステップＳＴ２０５において、音声登録判定手段１２２は音声登録スイッチ１１４を接続し、音声標準パタン生成手段１１５は登録音声１１を入力し音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。 On the other hand, if the determination result by the voice registration determination unit 122 can be registered in step ST203, the voice registration determination unit 122 connects the voice registration switch 114 and the voice standard pattern generation unit 115 registers in step ST205. A voice standard pattern is generated by inputting the voice 11 and stored in the voice standard pattern storage means 116.

音声認識時のステップＳＴ２０６及びステップＳＴ２０７における音声類似度計算手段３１１の処理、並びにステップＳＴ２０８及びステップＳＴ２０９における音声照合判定手段３１２の処理は、実施の形態１の図２に示す処理と同じである。 The processing of speech similarity calculation means 311 in steps ST206 and ST207 at the time of speech recognition and the processing of speech collation determination means 312 in steps ST208 and ST209 are the same as the processing shown in FIG. 2 of the first embodiment.

以上のように、この実施の形態２によれば、音節数抽出手段１２１は登録音声１１の音節数を抽出し、音声登録判定手段１２２は、抽出された音節数が所定個数以上の場合に、登録音声１１を登録することにより、登録音声１１の音節数が少ない音声の登録を防ぐことができ、登録音声１１が不適切であることが原因の誤認識、すなわち音節数が少ない音声が登録されることによって生じる誤認識を少なくすることができ、認識精度を向上させることができるという効果が得られる。 As described above, according to the second embodiment, the syllable number extracting unit 121 extracts the number of syllables of the registered speech 11, and the speech registration determining unit 122 determines that the number of extracted syllables is equal to or greater than a predetermined number. By registering the registered voice 11, it is possible to prevent registration of a voice having a small number of syllables of the registered voice 11, and a misrecognition caused by an inappropriate registered voice 11, that is, a voice having a small number of syllables is registered. Misrecognition caused by this can be reduced, and the recognition accuracy can be improved.

実施の形態３．
図６はこの発明の実施の形態３による音声認識装置の構成を示すブロック図である。この音声認識装置は登録手段１００及び照合手段３００を備えている。登録手段１００は母音尤度計算手段１３１、音声登録判定手段１３２、登録音声変更要求手段１１３、音声登録スイッチ１１４、音声標準パタン生成手段１１５及び音声標準パタン格納手段１１６を備え、照合手段３００は音声類似度計算手段３１１及び音声照合判定手段３１２を備えている。 Embodiment 3 FIG.
FIG. 6 is a block diagram showing the configuration of a speech recognition apparatus according to Embodiment 3 of the present invention. The speech recognition apparatus includes a registration unit 100 and a verification unit 300. The registration unit 100 includes a vowel likelihood calculation unit 131, a voice registration determination unit 132, a registered voice change request unit 113, a voice registration switch 114, a voice standard pattern generation unit 115, and a voice standard pattern storage unit 116. A similarity calculation unit 311 and a voice collation determination unit 312 are provided.

登録手段１００において、母音尤度計算手段１３１は利用者が発声した登録音声１１を入力して母音尤度を計算する。音声登録判定手段１３２は、母音尤度計算手段１３１により計算された母音尤度に基づき入力された登録音声１１を登録するか否かを判定する。登録音声変更要求手段１１３は、音声登録判定手段１３２の判定結果が登録不可であった場合に、利用者に登録音声変更要求１２を出力する。音声登録判定手段１３２は、判定結果が登録可能であった場合に、音声登録スイッチ１１４を接続し、音声標準パタン生成手段１１５は、登録音声１１を入力し音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。 In the registration means 100, the vowel likelihood calculation means 131 inputs the registered speech 11 uttered by the user and calculates the vowel likelihood. The voice registration determination unit 132 determines whether or not to register the input registered voice 11 based on the vowel likelihood calculated by the vowel likelihood calculation unit 131. The registered voice change request unit 113 outputs the registered voice change request 12 to the user when the determination result of the voice registration determination unit 132 is not registerable. When the determination result can be registered, the voice registration determination unit 132 connects the voice registration switch 114, and the voice standard pattern generation unit 115 inputs the registered voice 11 and generates a voice standard pattern to generate a voice standard pattern. Store in the storage means 116.

なお、この実施の形態３では、母音尤度計算手段１３１、音声登録判定手段１３２、登録音声変更要求手段１１３、音声標準パタン生成手段１１５、音声類似度計算手段３１１及び音声照合判定手段３１２をハードウェアで構成しても良いが、各手段の処理内容を記述した音声認識プログラムを作成し、コンピュータが当該音声認識プログラムを実行するようにしても良い。 In the third embodiment, the vowel likelihood calculation means 131, the voice registration determination means 132, the registered voice change request means 113, the voice standard pattern generation means 115, the voice similarity calculation means 311 and the voice collation determination means 312 are hard-wired. However, it is also possible to create a speech recognition program that describes the processing contents of each means and allow the computer to execute the speech recognition program.

次に動作について説明する。
図７はこの発明の実施の形態３による音声認識装置の処理の流れを示すフローチャートである。まず、音声登録時には、ステップＳＴ３０１において、母音尤度計算手段１３１は利用者が発声した登録音声１１を入力し、ステップＳＴ３０２において、母音尤度計算手段１３１は母音尤度を計算する。ここで、母音とは、口腔内で妨害による噪音をたてることなく、口の中央を流れ出る持続可能な共鳴音であると、「音声学入門」小泉保、平成８年９月、東京大学書林のｐ．８５に定義されている。日本語では、「あ（／ａ／）」、「い（／ｉ／）」、「う（／ｕ／）」、「え（／ｅ／）」、「お／ｏ／」が母音に相当する。 Next, the operation will be described.
FIG. 7 is a flowchart showing the flow of processing of the speech recognition apparatus according to Embodiment 3 of the present invention. First, at the time of voice registration, in step ST301, the vowel likelihood calculating means 131 inputs the registered voice 11 uttered by the user, and in step ST302, the vowel likelihood calculating means 131 calculates the vowel likelihood. Here, a vowel is a sustainable resonance sound that flows through the center of the mouth without making a roar due to interference in the mouth. “Introduction to phonetics” Koizumi Tadashi, September 1996, Shorin, the University of Tokyo P. 85. In Japanese, “A (/ a /)”, “I (/ i /)”, “U (/ u /)”, “E (/ e /)”, “O / o /” are equivalent to vowels. To do.

母音尤度の計算方法について、ＨＭＭを用いた場合で説明する。まず、各母音「あ（／ａ／）」、「い（／ｉ／）」、「う（／ｕ／）」、「え（／ｅ／）」、「お／ｏ／」のＨＭＭを多量の音声を用いて作成しておく。 A method for calculating the vowel likelihood will be described using an HMM. First, a large amount of HMM for each vowel “A (/ a /)”, “I (/ i /)”, “U (/ u /)”, “E (/ e /)”, “O / o /” Create using the voice.

図８は音節単位の音声認識を説明する図である。音節「あ」の隠れマルコフモデルがΛａ、音節「い」の隠れマルコフモデルがΛｉというようにした場合、これらの母音ＨＭＭを図８のように接続して登録音声１１に対して母音尤度の計算を行う（尤度計算の方法は非特許文献２の第５章に記載）。この求められた尤度Ｐをフレーム数Ｎで割って正規化した値Ｐ’を母音尤度計算手段１３１が計算する母音尤度とする。 FIG. 8 is a diagram for explaining speech recognition in syllable units. When the hidden Markov model of syllable “A” is Λa and the hidden Markov model of syllable “I” is Λi, these vowel HMMs are connected as shown in FIG. Calculation is performed (likelihood calculation method is described in Chapter 5 of Non-Patent Document 2). A value P ′ obtained by dividing the obtained likelihood P by the number N of frames and normalized is set as a vowel likelihood calculated by the vowel likelihood calculating means 131.

ステップＳＴ３０３において、音声登録判定手段１３２は、母音尤度計算手段１３１により計算された母音尤度Ｐ’が予め定めた閾値以上か否かを判定することにより、入力された登録音声１１を登録するか否かを判定する。判定方法としては、計算された母音尤度Ｐ’が予め定めた閾値以上である場合には登録音声１１を登録可能と判定し、一方、計算された母音尤度Ｐ’が予め定めた閾値より小さい場合には、登録音声１１を登録不可と判定する。 In step ST303, the speech registration determination unit 132 registers the input registered speech 11 by determining whether or not the vowel likelihood P ′ calculated by the vowel likelihood calculation unit 131 is equal to or greater than a predetermined threshold. It is determined whether or not. As a determination method, when the calculated vowel likelihood P ′ is equal to or greater than a predetermined threshold, it is determined that the registered speech 11 can be registered. On the other hand, the calculated vowel likelihood P ′ is greater than a predetermined threshold. If it is smaller, it is determined that the registered voice 11 cannot be registered.

上記ステップＳＴ３０３で、音声登録判定手段１３２による判定結果が登録不可であった場合には、ステップＳＴ３０４において、登録音声変更要求手段１１３は利用者に対して登録音声変更要求１２を出力する。母音尤度Ｐ’が低い場合は、登録音声１１に母音が含まれる割合が低いと判断されているので、このような登録音声１１を用いて音声標準パタンを生成して音声認識を行った場合には、情報量が少ないために誤認識が多くなる。したがって、登録音声変更要求１２を出力する。音声認識装置からの要求は、例えば、「母音の多い単語を言ってください」というメッセージを合成音によって利用者に伝え、ステップＳＴ３０１に戻って登録処理を繰り返す。 If the result of determination by the voice registration determination unit 132 is not registered in step ST303, the registration voice change request unit 113 outputs a registration voice change request 12 to the user in step ST304. When the vowel likelihood P ′ is low, it is determined that the percentage of vowels included in the registered speech 11 is low. Therefore, when speech recognition is performed by generating a speech standard pattern using such registered speech 11 In some cases, misrecognition increases because the amount of information is small. Therefore, the registration voice change request 12 is output. As a request from the speech recognition apparatus, for example, a message “Please say a word with a lot of vowels” is transmitted to the user by a synthesized sound, and the process returns to step ST301 to repeat the registration process.

一方、上記ステップＳＴ３０３で、音声登録判定手段１３２による判定結果が登録可能であった場合には、ステップＳＴ３０５において、音声登録判定手段１３２は音声登録スイッチ１１４を接続し、音声標準パタン生成手段１１５は登録音声１１を入力し音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。 On the other hand, if the determination result by the voice registration determination unit 132 can be registered in step ST303, the voice registration determination unit 132 is connected to the voice registration switch 114 and the voice standard pattern generation unit 115 is set in step ST305. The registered voice 11 is input, a voice standard pattern is generated and stored in the voice standard pattern storage means 116.

音声認識時のステップＳＴ３０６及びステップＳＴ３０７における音声類似度計算手段３１１の処理、並びにステップＳＴ３０８及びステップＳＴ３０９における音声照合判定手段３１２の処理は、実施の形態１の図２に示すものと同じである。 The processing of speech similarity calculation means 311 in steps ST306 and ST307 at the time of speech recognition, and the processing of speech collation determination means 312 in steps ST308 and ST309 are the same as those shown in FIG. 2 of the first embodiment.

以上のように、この実施の形態３によれば、母音尤度計算手段１３１は登録音声１１を入力して母音尤度を計算し、音声登録判定手段１３２は、母音尤度が予め定めた閾値以上の登録音声１１を登録することにより、登録音声１１の母音の比率が低い音声の登録を防ぐことができ、登録音声１１が不適切であることが原因の誤認識、すなわち母音の比率が低い音声が登録されることによって生じる誤認識が少なくなり、認識精度を向上させることができるという効果が得られる。 As described above, according to the third embodiment, the vowel likelihood calculating unit 131 inputs the registered speech 11 to calculate the vowel likelihood, and the speech registration determining unit 132 is configured to set a threshold value with a predetermined vowel likelihood. By registering the above registered voice 11, it is possible to prevent registration of a voice with a low vowel ratio of the registered voice 11, and misrecognition caused by inappropriate registration voice 11, that is, a low vowel ratio. There is less misrecognition caused by the registration of voice, and the recognition accuracy can be improved.

実施の形態４．
図９はこの発明の実施の形態４による音声認識装置の構成を示すブロック図である。この音声認識装置は登録手段１００及び照合手段３００を備えている。登録手段１００は音声標準パタン生成手段１１５、音声標準パタン格納手段１１６、周囲音類似度計算手段１４１、音声登録判定手段１４２及び登録音声変更要求手段１１３を備え、照合手段３００は音声類似度計算手段３１１及び音声照合判定手段３１２を備えている。 Embodiment 4 FIG.
FIG. 9 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 4 of the present invention. The speech recognition apparatus includes a registration unit 100 and a verification unit 300. The registration unit 100 includes an audio standard pattern generation unit 115, an audio standard pattern storage unit 116, an ambient sound similarity calculation unit 141, an audio registration determination unit 142, and a registered audio change request unit 113, and the collation unit 300 includes an audio similarity calculation unit. 311 and voice collation determination means 312.

登録手段１００において、音声標準パタン生成手段１１５は、利用者が発声した登録音声１１を入力し音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。周囲音類似度計算手段１４１は、音声標準パタン格納手段１１６に格納されている音声標準パタンを用いて、周囲音１３に対する周囲音類似度を計算する。音声登録判定手段１４２は、周囲音類似度計算手段１４１により計算された周囲音類似度が予め定めた閾値より大きいか否かを判定することにより、登録音声１１を登録するか否かを判定する。登録音声変更要求手段１１３は、音声登録判定手段１４２の判定結果が登録不可であった場合に、利用者に登録音声変更要求１２を出力する。 In the registration unit 100, the voice standard pattern generation unit 115 receives the registered voice 11 uttered by the user, generates a voice standard pattern, and stores it in the voice standard pattern storage unit 116. The ambient sound similarity calculation unit 141 calculates the ambient sound similarity with respect to the ambient sound 13 using the speech standard pattern stored in the speech standard pattern storage unit 116. The voice registration determination unit 142 determines whether or not to register the registered voice 11 by determining whether or not the ambient sound similarity calculated by the ambient sound similarity calculation unit 141 is greater than a predetermined threshold. . The registered voice change request unit 113 outputs the registered voice change request 12 to the user when the result of determination by the voice registration determination unit 142 is unregisterable.

なお、この実施の形態４では、音声標準パタン生成手段１１５、周囲音類似度計算手段１４１、音声登録判定手段１４２、登録音声変更要求手段１１３、音声類似度計算手段３１１及び音声照合判定手段３１２をハードウェアで構成しても良いが、各手段の処理内容を記述した音声認識プログラムを作成し、コンピュータが当該音声認識プログラムを実行するようにしても良い。 In the fourth embodiment, the voice standard pattern generation means 115, the ambient sound similarity calculation means 141, the voice registration determination means 142, the registered voice change request means 113, the voice similarity calculation means 311 and the voice collation determination means 312 are provided. Although it may be configured by hardware, a voice recognition program describing the processing contents of each means may be created, and the computer may execute the voice recognition program.

次に動作について説明する。
図１０はこの発明の実施の形態４による音声認識装置の処理の流れを示すフローチャートである。まず、音声登録時には、ステップＳＴ４０１において、音声登録パタン生成手段１１５は利用者が発声した登録音声１１を入力し、ステップＳＴ４０２において、音声登録パタン生成手段１１５は登録音声１１により音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。 Next, the operation will be described.
FIG. 10 is a flowchart showing the flow of processing of the speech recognition apparatus according to Embodiment 4 of the present invention. First, at the time of voice registration, in step ST401, the voice registration pattern generation means 115 inputs the registered voice 11 uttered by the user. In step ST402, the voice registration pattern generation means 115 generates a voice standard pattern from the registration voice 11. And stored in the voice standard pattern storage means 116.

ステップＳＴ４０３において、周囲音類似度計算手段１４１は、音声標準パタン格納手段１１６に格納されている音声標準パタンを用いて、周囲音１３に対する周囲音類似度を計算する。ここで、周囲音は音声認識装置が設置されている場所の音である。周囲音類似度計算手段１４１で実行する処理は、登録音声１１が発声された直後や、一定時間間隔、例えば１時間間隔で行う。 In step ST403, the ambient sound similarity calculating unit 141 calculates the ambient sound similarity with respect to the ambient sound 13 using the speech standard pattern stored in the speech standard pattern storage unit 116. Here, the ambient sound is a sound at a place where the voice recognition device is installed. The processing executed by the ambient sound similarity calculation unit 141 is performed immediately after the registered voice 11 is uttered or at a constant time interval, for example, an hour interval.

ステップＳＴ４０４において、音声登録判定手段１４２は周囲音類似度計算手段１４１により計算された周囲音類似度が予め定めた閾値より大きいか否かを判定することにより、入力された登録音声１１を登録するか否かを判定する。判定方法としては、周囲音類似度が予め定めた閾値以下の場合には、登録音声１１を登録可能と判定し、一方、周囲音類似度が予め定めた閾値より大きい場合には、登録音声１１を登録不可と判定する。 In step ST404, the voice registration determination unit 142 registers the input registered voice 11 by determining whether or not the ambient sound similarity calculated by the ambient sound similarity calculation unit 141 is larger than a predetermined threshold. It is determined whether or not. As a determination method, when the ambient sound similarity is equal to or lower than a predetermined threshold, it is determined that the registered sound 11 can be registered. On the other hand, when the ambient sound similarity is larger than the predetermined threshold, the registered sound 11 is determined. Is determined to be unregisterable.

上記ステップＳＴ４０４で、音声登録判定手段１４２の判定結果が登録不可であった場合には、ステップＳＴ４０５において、登録音声変更要求手段１１３は利用者に対して登録音声変更要求１２を出力する。上記ステップＳＴ４０４で、周囲音類似度計算手段１４１により計算された周囲音類似度が闘値より大きければ、登録音声１１によって生成した音声標準パタンは、周囲音１３を誤って認識してしまう可能性が高い。したがって、登録音声変更要求１２を出力する。音声認識装置からの要求は、例えば、「別の単語を言ってください」というメッセージを合成音によって利用者に伝え、ステップＳＴ４０１に戻って登録処理を繰り返す。 If the result of determination by the voice registration determination unit 142 is not registerable in step ST404, the registration voice change request unit 113 outputs a registration voice change request 12 to the user in step ST405. If the ambient sound similarity calculated by the ambient sound similarity calculating unit 141 in step ST404 is greater than the threshold value, the speech standard pattern generated by the registered speech 11 may erroneously recognize the ambient sound 13. Is expensive. Therefore, the registration voice change request 12 is output. The request from the speech recognition device is, for example, a message “Please say another word” is transmitted to the user by synthetic sound, and the process returns to step ST401 to repeat the registration process.

一方、上記ステップＳＴ４０４で、音声登録判定手段１４２による判定結果が登録可能であった場合は登録を終了する。 On the other hand, in step ST404, if the determination result by the voice registration determination unit 142 can be registered, the registration ends.

音声認識時のステップＳＴ４０６及びステップＳＴ４０７における音声類似度計算手段３１１の処理、並びにステップＳＴ４０８及びステップＳＴ４０９における音声照合判定手段３１２の処理は、実施の形態１の図２に示すものと同じである。 The processing of the speech similarity calculation means 311 in steps ST406 and ST407 at the time of speech recognition and the processing of the speech collation determination means 312 in steps ST408 and ST409 are the same as those shown in FIG. 2 of the first embodiment.

以上のように、この実施の形態４によれば、周囲音類似度計算手段１４１は入力された登録音声１１により生成された音声標準パタンを用いて周囲音に対する周囲音類似度を計算し、音声登録判定手段１４２は計算された周囲音類似度が闘値以下の場合に、入力された登録音声１１を登録することにより、周囲音を誤認識しやすい音声標準パタンの登録防ぐことができ、登録音声１１が不適切であることが原因の誤認識、すなわち周囲音と似た音声が登録されることによって生じる誤認識が少なくすることができ、認識精度を向上させることができるという効果が得られる。 As described above, according to the fourth embodiment, the ambient sound similarity calculating unit 141 calculates the ambient sound similarity with respect to the ambient sound by using the speech standard pattern generated by the input registered speech 11, and the speech When the calculated ambient sound similarity is equal to or less than the battle value, the registration determining unit 142 can prevent registration of a voice standard pattern that easily misidentifies the ambient sound by registering the input registered speech 11. Misrecognition caused by the sound 11 being inappropriate, that is, misrecognition caused by registering a sound similar to the surrounding sound can be reduced, and the effect of improving the recognition accuracy can be obtained. .

実施の形態５．
図１１はこの発明の実施の形態５による音声認識装置の構成を示すブロック図である。この音声認識装置は登録手段１００及び照合手段３００を備えている。登録手段１００は音声標準パタン生成手段１１５、音声標準パタン格納手段１１６、周囲音類似度計算手段１４１、周囲音再生判定手段１５１、周囲音再生スイッチ１５２、周囲音再生手段１５３及び登録音声変更要求手段１１３を備え、照合手段３００は音声類似度計算手段３１１及び音声照合判定手段３１２を備えている。 Embodiment 5 FIG.
FIG. 11 is a block diagram showing the structure of a speech recognition apparatus according to Embodiment 5 of the present invention. The speech recognition apparatus includes a registration unit 100 and a verification unit 300. The registration unit 100 includes an audio standard pattern generation unit 115, an audio standard pattern storage unit 116, an ambient sound similarity calculation unit 141, an ambient sound reproduction determination unit 151, an ambient sound reproduction switch 152, an ambient sound reproduction unit 153, and a registered audio change request unit. 113, the collation unit 300 includes a speech similarity calculation unit 311 and a speech collation determination unit 312.

登録手段１００において、音声標準パタン生成手段１１５は登録音声１１を入力し音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。周囲音類似度計算手段１４１は、音声標準パタン格納手段１１６に格納されている音声標準パタンを用いて、周囲音１３に対して周囲音類似度を計算する。周囲音再生判定手段１５１は、周囲音類似度計算手段１４１により計算された周囲音類似度が予め定めた閾値以上の場合に周囲音を再生すると判定し、周囲音再生スイッチ１５２を接続する。周囲音再生手段１５３は、周囲音再生判定手段１５１による判定結果が再生であった場合に周囲音１３を再生して再生音１４を出力する。登録音声変更要求手段１１３は、再生音を聞いた利用者が騒音と判断した場合に、騒音確認１５を入力して利用者に登録音声変更要求１２を出力する。 In the registration unit 100, the voice standard pattern generation unit 115 receives the registered voice 11, generates a voice standard pattern, and stores it in the voice standard pattern storage unit 116. The ambient sound similarity calculation unit 141 calculates the ambient sound similarity with respect to the ambient sound 13 using the speech standard pattern stored in the speech standard pattern storage unit 116. The ambient sound reproduction determination unit 151 determines to reproduce the ambient sound when the ambient sound similarity calculated by the ambient sound similarity calculation unit 141 is equal to or greater than a predetermined threshold, and connects the ambient sound reproduction switch 152. The ambient sound reproducing means 153 reproduces the ambient sound 13 and outputs the reproduced sound 14 when the determination result by the ambient sound reproduction determining means 151 is reproduction. The registered voice change request means 113 inputs the noise confirmation 15 and outputs the registered voice change request 12 to the user when the user who has heard the reproduced sound determines that the sound is noise.

なお、この実施の形態５では、音声標準パタン生成手段１１５、周囲音類似度計算手段１４１、周囲音再生判定手段１５１、周囲音再生手段１５３、登録音声変更要求手段１１３、音声類似度計算手段３１１及び音声照合判定手段３１２をハードウェアで構成しても良いが、各手段の処理内容を記述した音声認識プログラムを作成し、コンピュータが当該音声認識プログラムを実行するようにしても良い。 In the fifth embodiment, the sound standard pattern generation means 115, the ambient sound similarity calculation means 141, the ambient sound reproduction determination means 151, the ambient sound reproduction means 153, the registered sound change request means 113, and the sound similarity calculation means 311. The voice collation determination means 312 may be configured by hardware, but a voice recognition program describing the processing contents of each means may be created and the computer may execute the voice recognition program.

次に動作について説明する。
図１２はこの発明の実施の形態５による音声認識装置の処理内容を示すフローチャートである。まず、音声登録時には、ステップＳＴ５０１において、音声標準パタン生成手段１１５は利用者の登録音声１１を入力し、ステップＳＴ５０２において、音声標準パタン生成手段１１５は登録音声１１により音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。 Next, the operation will be described.
FIG. 12 is a flowchart showing the processing contents of the speech recognition apparatus according to Embodiment 5 of the present invention. First, at the time of voice registration, in step ST501, the voice standard pattern generation means 115 inputs the user's registered voice 11, and in step ST502, the voice standard pattern generation means 115 generates a voice standard pattern from the registered voice 11 and generates a voice. Stored in the standard pattern storage means 116.

ステップＳＴ５０３において、周囲音類似度計算手段１４１は、音声標準パタン格納手段１１６に格納されている音声標準パタンを用いて、周囲音１３に対する周囲音類似度を計算する。 In step ST503, the ambient sound similarity calculating unit 141 calculates the ambient sound similarity with respect to the ambient sound 13 using the speech standard pattern stored in the speech standard pattern storage unit 116.

ステップＳＴ５０４において、周囲音再生判定手段１５１は、周囲音類似度計算手段１４１により計算された周囲音類似度が予め定めた閾値以上か否かを判定することにより、周囲音を再生して利用者に聞かせるか否かを判定する。判定方法としては、周囲音類似度が予め定めた閾値以上の場合の場合には、周囲音を再生すると判定する。 In step ST504, the ambient sound reproduction determination unit 151 reproduces the ambient sound by determining whether or not the ambient sound similarity calculated by the ambient sound similarity calculation unit 141 is equal to or greater than a predetermined threshold value. Determine whether or not to ask. As a determination method, when the ambient sound similarity is equal to or higher than a predetermined threshold, it is determined to reproduce the ambient sound.

上記ステップＳＴ５０４で、周囲音を再生すると判定した場合には、ステップＳＴ５０５において、周囲音再生判定手段１５１は周囲音再生スイッチ１５２を接続し、周囲音再生手段１５３は周囲音１３を入力し再生音１４を出力することにより利用者に再生音１４を聞かせる。 If it is determined in step ST504 that the ambient sound is to be reproduced, in step ST505, the ambient sound reproduction determination means 151 is connected to the ambient sound reproduction switch 152, and the ambient sound reproduction means 153 inputs the ambient sound 13 and plays the reproduced sound. By outputting 14, the user hears the reproduced sound 14.

ステップＳＴ５０６において、利用者は再生音１４を聞き、登録した発声と異なるならば騒音であると判定し、騒音確認１５という判定結果を音声認識装置に通知し、ステップＳＴ５０７において、登録音声変更要求手段１１３は騒音確認１５を入力して利用者に対して登録音声変更要求１２を出力する。音声認識装置からの要求は、例えば、「別の単語を言ってください」というメッセージを合成音によって利用者に伝え、ステップＳＴ５０１に戻って登録処理を繰り返す。このように、利用者が再生音１４を聞いて騒音と判断した場合だけ登録音声変更要求１２を出力するので、周囲音類似度計算手段１４１が周囲音類似度計算の際に、偶然に利用者が登録音声１１と同一の発声をした場合に登録音声変更要求１２が出力されて、誤って音声標準パタンを変更することがなくなる。 In step ST506, the user listens to the reproduced sound 14, determines that it is noise if it differs from the registered utterance, and notifies the speech recognition apparatus of the determination result of noise confirmation 15, and in step ST507, the registered voice change request means 113 inputs the noise confirmation 15 and outputs a registered voice change request 12 to the user. The request from the voice recognition device is, for example, a message “Please say another word” is transmitted to the user by synthesized sound, and the process returns to step ST501 to repeat the registration process. In this way, since the registered voice change request 12 is output only when the user hears the reproduced sound 14 and judges that it is noise, the ambient sound similarity calculating means 141 happens to be inadvertent when the ambient sound similarity is calculated. When the same voice as the registered voice 11 is uttered, the registered voice change request 12 is output, and the voice standard pattern is not changed by mistake.

上記ステップＳＴ５０４で、周囲音再生判定手段１５１の判定結果が再生不要と判定されるか、又は、上記ステップＳＴ５０６で、周囲音再生手段１５３から出力された再生音１４が騒音ではないと判定された場合には、音声登録処理を終了する。 In step ST504, it is determined that the determination result of the ambient sound reproduction determination unit 151 is not required to be reproduced, or in step ST506, it is determined that the reproduction sound 14 output from the ambient sound reproduction unit 153 is not noise. In the case, the voice registration process is terminated.

音声認識時のステップＳＴ５０８及びステップＳＴ５０９における音声類似度計算手段３１１の処理、並びにステップＳＴ５１０及びステップＳＴ５１１における音声照合判定手段３１２の処理は、実施の形態１の図２に示すものと同じである。 The processing of speech similarity calculation means 311 in steps ST508 and ST509 at the time of speech recognition and the processing of speech collation determination means 312 in steps ST510 and ST511 are the same as those shown in FIG. 2 of the first embodiment.

以上のように、この実施の形態５によれば、周囲音類似度計算手段１４１は入力された登録音声１１により生成された音声標準パタンを用いて周囲音に対する周囲音類似度を計算し、周囲音再生判定手段１５１は計算された周囲音類似度が閾値以上の場合の場合に周囲音を再生させ、周囲音が騒音であることを確認した利用者から変更された登録音声１１を入力することにより、周囲音を誤認識しやすい音声標準パタンの登録防ぐことができ、登録音声１１が不適切であることが原因の誤認識、すなわち周囲音と似た音声が登録されることによって生じる誤認識が少なくすることができ、認識精度を向上させることができるという効果が得られる。 As described above, according to the fifth embodiment, the ambient sound similarity calculation unit 141 calculates the ambient sound similarity with respect to the ambient sound using the speech standard pattern generated by the input registered speech 11, and The sound reproduction determination unit 151 reproduces the ambient sound when the calculated ambient sound similarity is equal to or greater than the threshold, and inputs the changed registered sound 11 from the user who confirmed that the ambient sound is noise. Therefore, it is possible to prevent registration of a voice standard pattern that easily misrecognizes ambient sounds, and misrecognition caused by inappropriateness of the registered speech 11, that is, misrecognition caused by registration of sounds similar to ambient sounds. Can be reduced, and the recognition accuracy can be improved.

また、この実施の形態５によれば、利用者が再生音１４を聞いて騒音と判断した場合だけ登録音声変更要求１２を出力するので、周囲音類似度計算手段１４１が周囲音類似度計算の際に、偶然に利用者が登録音声１１と同一の発声をした場合に登録音声変更要求１２が出力されて、誤って音声標準パタンを変更することがなくなるという効果が得られる。 Further, according to the fifth embodiment, since the registered voice change request 12 is output only when the user hears the reproduced sound 14 and judges that it is noise, the ambient sound similarity calculating unit 141 calculates the ambient sound similarity. At this time, when the user accidentally utters the same voice as the registered voice 11, the registered voice change request 12 is output, and there is an effect that the voice standard pattern is not erroneously changed.

実施の形態６．
図１３はこの発明の実施の形態６による音声認識装置の構成を示すブロック図である。この音声認識装置は登録手段１００及び照合手段３００を備えている。登録手段１００は音声標準パタン生成手段１１５、音声標準パタン格納手段１１６、周囲音類似度計算手段１４１、周囲音再生判定手段１５１、周囲音再生スイッチ１５２、周囲音再生手段１５３、騒音標準パタン生成手段１６１及び騒音標準パタン格納手段１６２を備え、照合手段３００は音声類似度計算手段３１１、騒音類似度計算手段３６１及び音声照合判定手段３６２を備えている。 Embodiment 6 FIG.
FIG. 13 is a block diagram showing the structure of a speech recognition apparatus according to Embodiment 6 of the present invention. The speech recognition apparatus includes a registration unit 100 and a verification unit 300. The registration unit 100 includes an audio standard pattern generation unit 115, an audio standard pattern storage unit 116, an ambient sound similarity calculation unit 141, an ambient sound reproduction determination unit 151, an ambient sound reproduction switch 152, an ambient sound reproduction unit 153, and a noise standard pattern generation unit. 161 and a noise standard pattern storage unit 162, and the collation unit 300 includes a voice similarity calculation unit 311, a noise similarity calculation unit 361, and a voice collation determination unit 362.

登録手段１００において、音声標準パタン生成手段１１５、音声標準パタン格納手段１１６、周囲音類似度計算手段１４１、周囲音再生判定手段１５１、周囲音再生スイッチ１５２、周囲音再生手段１５３は、実施の形態５の図１１に示すものと同じ機能を有している。騒音標準パタン生成手段１６１は、利用者による騒音確認１５を入力し、周囲音１３を入力し騒音標準パタンを生成して騒音標準パタン格納手段１６２に格納する。 In the registration means 100, the voice standard pattern generation means 115, the voice standard pattern storage means 116, the ambient sound similarity calculation means 141, the ambient sound reproduction determination means 151, the ambient sound reproduction switch 152, and the ambient sound reproduction means 153 are described in the embodiment. 5 has the same function as that shown in FIG. The noise standard pattern generation means 161 inputs the noise confirmation 15 by the user, inputs the ambient sound 13, generates a noise standard pattern, and stores it in the noise standard pattern storage means 162.

また、照合手段３００において、音声類似度計算手段３１１は、利用者が発声した認識対象音声３１と音声標準パタン格納手段１１６に格納されている音声標準パタンを入力し音声類似度を計算する。騒音類似度計算手段３６１は、利用者が発声した認識対象音声３１と騒音標準パタン格納手段１６２に格納されている騒音標準パタンを入力し、騒音類似度を計算する。音声照合判定手段３６２は、音声類似度計算手段３１１により計算された音声類似度と、騒音類似度計算手段３６１により計算された騒音類似度を入力し、音声類似度が音声類似度閾値３４以上であり、騒音類似度が騒音類似度閾値３５以下ならば認識結果３３を出力する。 Further, in the collating means 300, the voice similarity calculating means 311 inputs the recognition target voice 31 uttered by the user and the voice standard pattern stored in the voice standard pattern storage means 116, and calculates the voice similarity. The noise similarity calculation means 361 inputs the recognition target speech 31 uttered by the user and the noise standard pattern stored in the noise standard pattern storage means 162, and calculates the noise similarity. The voice collation determination unit 362 receives the voice similarity calculated by the voice similarity calculation unit 311 and the noise similarity calculated by the noise similarity calculation unit 361, and the voice similarity is greater than or equal to the voice similarity threshold 34. If the noise similarity is less than or equal to the noise similarity threshold 35, the recognition result 33 is output.

なお、この実施の形態６では、音声標準パタン生成手段１１５、周囲音類似度計算手段１４１、周囲音再生判定手段１５１、周囲音再生手段１５３、騒音標準パタン生成手段１６１、音声類似度計算手段３１１、騒音類似度計算手段３６１及び音声照合判定手段３６２をハードウェアで構成しても良いが、各手段の処理内容を記述した音声認識プログラムを作成し、コンピュータが当該音声認識プログラムを実行するようにしても良い。 In the sixth embodiment, the voice standard pattern generation unit 115, the ambient sound similarity calculation unit 141, the ambient sound reproduction determination unit 151, the ambient sound reproduction unit 153, the noise standard pattern generation unit 161, and the voice similarity calculation unit 311. The noise similarity calculation unit 361 and the voice collation determination unit 362 may be configured by hardware. However, a voice recognition program describing the processing contents of each unit is created, and the computer executes the voice recognition program. May be.

次に動作について説明する。
図１４はこの発明の実施の形態６による音声認識装置の処理の流れを示すフローチャートである。まず、音声登録時には、ステップＳＴ６０１において、音声標準パタン生成手段１１５は利用者が発声した登録音声１１を入力し、ステップＳＴ６０２において、音声標準パタン生成手段１１５は登録音声１１により音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。 Next, the operation will be described.
FIG. 14 is a flowchart showing the flow of processing of the speech recognition apparatus according to Embodiment 6 of the present invention. First, at the time of voice registration, in step ST601, the voice standard pattern generation means 115 inputs the registered voice 11 uttered by the user. In step ST602, the voice standard pattern generation means 115 generates a voice standard pattern from the registered voice 11. And stored in the voice standard pattern storage means 116.

ステップＳＴ６０３において、周囲音類似度計算手段１４１は、音声標準パタン格納手段１１６に格納されている音声標準パタンを用いて、周囲音１３に対する周囲音類似度を計算する。 In step ST <b> 603, the ambient sound similarity calculation unit 141 calculates the ambient sound similarity with respect to the ambient sound 13 using the speech standard pattern stored in the speech standard pattern storage unit 116.

ステップＳＴ６０４において、周囲音再生判定手段１５１は、周囲音類似度計算手段１４１により計算された周囲音類似度が予め定めた閾値以上か否かを判定することにより、周囲音１３を再生して利用者に聞かせるか否かを判定する。判定方法としては、周囲音類似度が予め定めた閾値以上である場合には、周囲音を再生すると判定する。 In step ST604, the ambient sound reproduction determination unit 151 reproduces and uses the ambient sound 13 by determining whether or not the ambient sound similarity calculated by the ambient sound similarity calculation unit 141 is equal to or greater than a predetermined threshold. Judge whether to let the person know. As a determination method, when the ambient sound similarity is equal to or higher than a predetermined threshold, it is determined to reproduce the ambient sound.

上記ステップＳＴ６０４で、周囲音を再生すると判定した場合には、ステップＳＴ６０５において、周囲音再生判定手段１５１は、周囲音再生スイッチ１５２を接続し、周囲音再生手段１５３は周囲音１３を入力して再生音１４を出力し、利用者に再生音１４を聞かせる。 If it is determined in step ST604 that ambient sound is to be reproduced, in step ST605, the ambient sound reproduction determining means 151 is connected to the ambient sound reproduction switch 152, and the ambient sound reproducing means 153 is input with the ambient sound 13. The reproduction sound 14 is output, and the user hears the reproduction sound 14.

ステップＳＴ６０６において、利用者は再生音１４を聞き、登録した発声と異なるならば騒音であると判定し、ステップＳＴ６０７において、騒音確認１５という判定結果を音声認識装置に通知し、騒音標準パタン生成手段１６１は騒音確認１５に基づき周囲音１３を入力し騒音標準パタンを生成して騒音標準パタン格納手段１６２に格納する。騒音標準パタンは例えばＨＭＭを用い、ＨＭＭのパラメータを周囲音１３によって学習することで生成できる。このように、音声標準パタンと誤認識しやすい騒音を、騒音標準パタンとして登録して、認識時に騒音類似度を計算することにより、騒音を誤認識したかどうかの判定が可能となる。また、利用者が再生音１４を聞いて騒音と判断した場合だけ騒音標準パタンを生成するので、周囲音類似度計算手段１４１が周囲音類似度を計算する際に、偶然、利用者が登録音声１１と同一の発声をした場合に、誤って利用者の音声を用いて騒音標準パタンを生成することがない。 In step ST606, the user hears the reproduced sound 14 and determines that it is noise if it differs from the registered utterance. In step ST607, the user is notified of the determination result of noise confirmation 15 to the sound recognition pattern generation means. 161 receives the ambient sound 13 based on the noise confirmation 15, generates a noise standard pattern, and stores it in the noise standard pattern storage means 162. The noise standard pattern can be generated by using, for example, an HMM and learning the parameters of the HMM from the ambient sound 13. In this way, it is possible to determine whether noise has been misrecognized by registering noise that is easily misrecognized as a voice standard pattern as a noise standard pattern and calculating the noise similarity at the time of recognition. Further, since the noise standard pattern is generated only when the user hears the reproduced sound 14 and judges that it is noise, the user accidentally registers the registered voice when the ambient sound similarity calculating unit 141 calculates the ambient sound similarity. 11, the noise standard pattern is not generated by mistake using the user's voice.

上記ステップＳＴ６０４で、周囲音再生判定手段１５１による判定結果が再生不要の場合、又は上記ステップＳＴ６０６で、周囲音再生手段１５３の出力である再生音１４が騒音ではないと判定された場合には、登録処理を終了する。 When the determination result by the ambient sound reproduction determination unit 151 does not need to be reproduced in step ST604, or when it is determined in step ST606 that the reproduced sound 14 that is the output of the ambient sound reproduction unit 153 is not noise, The registration process ends.

音声認識時には、ステップＳＴ６０８において、音声類似度計算手段３１１は認識対象音声３１を入力し、ステップＳＴ６０９において、音声類似度計算手段３１１は、音声標準パタン格納手段１１６に格納されている音声標準パタンを用いて、認識対象音声３１に対する音声類似度を計算する。 At the time of speech recognition, the speech similarity calculation unit 311 inputs the recognition target speech 31 in step ST608, and in step ST609, the speech similarity calculation unit 311 uses the speech standard pattern stored in the speech standard pattern storage unit 116. Using this, the speech similarity to the recognition target speech 31 is calculated.

ステップＳＴ６１０において、騒音類似度計算手段３６１は、認識対象音声３１を入力し、騒音標準パタン格納手段１６２に格納されている騒音標準パタンを用いて認識対象音声３１に対する騒音類似度を計算する。 In step ST610, the noise similarity calculation unit 361 inputs the recognition target speech 31, and calculates the noise similarity with respect to the recognition target speech 31 using the noise standard pattern stored in the noise standard pattern storage unit 162.

ステップＳＴ６１１において、音声照合判定手段３６２は、音声類似度計算手段３１１により計算された音声類似度と、騒音類似度計算手段３６１により計算された騒音類似度を入力し、音声類似度が音声類似度閾値３４以上で、かつ騒音類似度が騒音類似度閾値３５以下であるか否かを判断する。ステップＳＴ６１１で、音声類似度が音声類似度閾値３４以上で、かつ騒音類似度が騒音類似度閾値３５以下である場合に、ステップＳＴ６１２において、音声照合判定手段３６２は、音声標準パタン格納手段１１６に格納されている音声標準パタンに対応する音声を利用者が発声したと判断し、認識結果３３を出力する。一方、ステップＳＴ６１１で、音声類似度が音声類似度閾値３４より小さいか、又は騒音類似度が騒音類似度閾値３５より大きい場合には、音声照合判定手段３６２は認識結果３３を出力せず、ステップＳＴ６０８に戻る。 In step ST611, the speech collation determining unit 362 receives the speech similarity calculated by the speech similarity calculating unit 311 and the noise similarity calculated by the noise similarity calculating unit 361, and the speech similarity is the speech similarity. It is determined whether or not the noise similarity is not less than the threshold 34 and the noise similarity threshold is not more than 35. In step ST611, when the voice similarity is equal to or higher than the voice similarity threshold 34 and the noise similarity is equal to or lower than the noise similarity threshold 35, in step ST612, the voice collation determination unit 362 stores the voice standard pattern storage unit 116. It is determined that the user has uttered voice corresponding to the stored voice standard pattern, and a recognition result 33 is output. On the other hand, if the speech similarity is smaller than the speech similarity threshold 34 or the noise similarity is larger than the noise similarity threshold 35 in step ST611, the speech matching determination unit 362 does not output the recognition result 33, and the step Return to ST608.

ここでは、１発声のみの場合について説明したが、音声登録時には、複数の異なる発声について音声標準パタンを各々生成して、音声認識時には、認識対象音声３１に対して複数の標準パタンを用いて音声類似度を計算し、音声類似度が音声類似度閾値３４以上、騒音類似度が騒音類似度閾値３５以下で最も音声類似度が大きい標準パタンを示す発声番号を認識結果として出力することも可能である。 Here, the case of only one utterance has been described, but at the time of voice registration, a voice standard pattern is generated for each of a plurality of different utterances, and at the time of voice recognition, a voice using a plurality of standard patterns for the recognition target voice 31 is used. It is also possible to calculate the similarity and output the utterance number indicating the standard pattern having the highest speech similarity with the speech similarity being the speech similarity threshold 34 or more and the noise similarity being the noise similarity threshold 35 or less as the recognition result. is there.

以上のように、この実施の形態６によれば、騒音標準パタン生成手段１６１は音声標準パタンと誤認識しやすく騒音と判定された周囲音により騒音標準パタンを生成して登録し、音声認識時には、騒音類似度計算手段３６１は騒音類似度を計算し、音声照合判定手段３６２は計算された騒音類似度を用いて音声照合を行うことにより、騒音を誤認識したかどうかの判定が可能となり、認識精度を向上させることができるという効果が得られる。 As described above, according to the sixth embodiment, the noise standard pattern generation unit 161 generates and registers a noise standard pattern based on ambient sounds that are easily recognized as noise standard patterns and is determined as noise. The noise similarity calculation unit 361 calculates the noise similarity, and the voice collation determination unit 362 performs voice collation using the calculated noise similarity, thereby making it possible to determine whether or not the noise is erroneously recognized. The effect that the recognition accuracy can be improved is obtained.

また、この実施の形態６によれば、利用者が再生音１４を聞いて騒音と判断した場合だけ騒音標準パタンを生成するので、周囲音類似度計算手段１４１が周囲音類似度計算の際に、偶然、利用者が登録音声１１と同一の発声をした場合に、誤って利用者の音声を用いて騒音標準パタンを生成することがなくなるので、認識精度を向上させることができるという効果が得られる。 Further, according to the sixth embodiment, since the noise standard pattern is generated only when the user hears the reproduced sound 14 and judges that it is noise, the ambient sound similarity calculating unit 141 calculates the ambient sound similarity. If the user accidentally makes the same utterance as the registered voice 11, no noise standard pattern is erroneously generated using the user's voice, so that the recognition accuracy can be improved. It is done.

実施の形態７．
図１５はこの発明の実施の形態７による音声認識装置の構成を示すブロック図である。この音声認識装置は登録手段１００及び照合手段３００を備えている。登録手段１００は騒音標準パタン格納手段１７１、騒音類似度計算手段１７２、音声登録判定手段１７３、登録音声変更要求手段１１３、音声登録スイッチ１１４、音声標準パタン生成手段１１５及び音声標準パタン格納手段１１６を備え、照合手段３００は音声類似度計算手段３１１及び音声照合判定手段３１２を備えている。 Embodiment 7 FIG.
FIG. 15 is a block diagram showing the configuration of a speech recognition apparatus according to Embodiment 7 of the present invention. The speech recognition apparatus includes a registration unit 100 and a verification unit 300. The registration unit 100 includes a noise standard pattern storage unit 171, a noise similarity calculation unit 172, a voice registration determination unit 173, a registered voice change request unit 113, a voice registration switch 114, a voice standard pattern generation unit 115, and a voice standard pattern storage unit 116. The collation unit 300 includes a speech similarity calculation unit 311 and a speech collation determination unit 312.

登録手段１００において、騒音標準パタン格納手段１７１は予め用意された騒音標準パタンを格納している。騒音類似度計算手段１７２は、登録音声１１を入力し、騒音標準パタン格納手段１７１に格納されている騒音標準パタンを用いて、登録音声１１に対する騒音類似度を計算する。音声登録判定手段１７３は、騒音類似度計算手段１７２により計算された騒音類似度が予め定めた閾値以上か否かを判定することにより、登録音声１１を登録するか否かを判定する。音声登録判定手段１７３による判定結果が登録不可であった場合に、登録音声変更要求手段１１３は利用者に登録音声変更要求１２を出力する。音声登録判定手段１７３は、判定結果が登録可能であった場合に音声登録スイッチ１１４を接続し、音声標準パタン生成手段１１５は登録音声１１を入力し音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。 In the registration unit 100, the noise standard pattern storage unit 171 stores a noise standard pattern prepared in advance. The noise similarity calculation means 172 receives the registered voice 11 and calculates the noise similarity for the registered voice 11 using the noise standard pattern stored in the noise standard pattern storage means 171. The voice registration determination unit 173 determines whether or not to register the registered voice 11 by determining whether or not the noise similarity calculated by the noise similarity calculation unit 172 is equal to or greater than a predetermined threshold. When the determination result by the voice registration determination unit 173 is not registrationable, the registration voice change request unit 113 outputs a registration voice change request 12 to the user. The voice registration determination unit 173 connects the voice registration switch 114 when the determination result can be registered, and the voice standard pattern generation unit 115 inputs the registered voice 11 and generates a voice standard pattern to generate a voice standard pattern storage unit. 116.

なお、この実施の形態７では、騒音類似度計算手段１７２、音声登録判定手段１７３、登録音声変更要求手段１１３、音声標準パタン生成手段１１５、音声類似度計算手段３１１及び音声照合判定手段３１２をハードウェアで構成しても良いが、各手段の処理内容を記述した音声認識プログラムを作成し、コンピュータが当該音声認識プログラムを実行するようにしても良い。 In the seventh embodiment, the noise similarity calculation unit 172, the voice registration determination unit 173, the registered voice change request unit 113, the voice standard pattern generation unit 115, the voice similarity calculation unit 311, and the voice collation determination unit 312 are hardened. However, it is also possible to create a speech recognition program that describes the processing contents of each means and allow the computer to execute the speech recognition program.

次に動作について説明する。
図１６はこの発明の実施の形態７による音声認識装置の処理の流れを示すフローチャートである。まず、音声登録時には、ステップＳＴ７０１において、騒音類似度計算手段１７２は利用者が発声した登録音声１１を入力し、ステップＳＴ７０２において、騒音類似度計算手段１７２は、騒音標準パタン格納手段１７１に格納されている騒音標準パタンを用いて、登録音声１１に対する騒音類似度を計算する。ここで、騒音標準パタンは、音声以外の騒音を入力した場合に高い類似度を出力するように作成されたものである。例えば、１状態の混合分布型ＨＭＭを用い、騒音を学習データとしてＨＭＭの学習を行うことで生成しておく。騒音類似度の計算は、標準パタンが１状態の混合分布型ＨＭＭであるならば、上記非特許文献２の５章で説明されている方法で類似度を計算する。 Next, the operation will be described.
FIG. 16 is a flowchart showing the flow of processing of the speech recognition apparatus according to Embodiment 7 of the present invention. First, at the time of voice registration, in step ST701, the noise similarity calculation means 172 inputs the registered voice 11 uttered by the user, and in step ST702, the noise similarity calculation means 172 is stored in the noise standard pattern storage means 171. The noise similarity with respect to the registered voice 11 is calculated using the noise standard pattern. Here, the noise standard pattern is created so as to output a high degree of similarity when noise other than voice is input. For example, a mixed distribution type HMM in one state is used, and noise is generated as learning data to perform HMM learning. As for the calculation of the noise similarity, if the standard pattern is a mixed distribution type HMM with one state, the similarity is calculated by the method described in Chapter 5 of Non-Patent Document 2.

ステップ７０３において、音声登録判定手段１７３は、騒音類似度計算手段１７２により計算された騒音類似度が予め定めた閾値以上か否かを判定することにより、登録音声１１を登録するか否かを判定する。判定方法としては、騒音類似度が予め定めた閾値以上の場合には音声登録不可と判定し、一方、騒音類似度が閾値より小さい場合には音声登録可能と判定する。 In step 703, the voice registration determination unit 173 determines whether or not to register the registered voice 11 by determining whether or not the noise similarity calculated by the noise similarity calculation unit 172 is equal to or greater than a predetermined threshold. To do. As a determination method, when the noise similarity is equal to or greater than a predetermined threshold, it is determined that voice registration is impossible, and when the noise similarity is smaller than the threshold, it is determined that voice registration is possible.

上記ステップ７０３で、音声登録判定手段１７３による判定結果が登録不可であった場合には、ステップＳＴ７０４において、登録音声変更要求手段１１３は利用者に対して登録音声変更要求１２を出力する。騒音類似度計算手段１７２により計算された騒音類似度が闘値以上の場合には、登録音声１１によって生成した音声標準パタンは騒音を誤認識してしまう可能性が高い。したがって登録音声変更要１２を出力する。音声認識装置からの要求は、例えば、「別の単語を言ってください」というメッセージを合成音によって利用者に伝え、ステップＳＴ７０１に戻って登録処理を繰り返す。 If the determination result by the voice registration determination unit 173 is not registered in step 703, the registration voice change request unit 113 outputs the registration voice change request 12 to the user in step ST704. When the noise similarity calculated by the noise similarity calculation means 172 is equal to or higher than the threshold value, the voice standard pattern generated by the registered voice 11 is highly likely to erroneously recognize noise. Therefore, the registration voice change required 12 is output. As a request from the speech recognition apparatus, for example, a message “Please say another word” is transmitted to the user by synthetic sound, and the process returns to step ST701 to repeat the registration process.

一方、上記ステップ７０３で、音声登録判定手段１７３による判定結果が登録可能であった場合には、ステップＳＴ７０５において、音声登録判定手段１７３は音声登録スイッチ１１４を接続し、音声標準パタン生成手段１１５は登録音声１１を入力し音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。 On the other hand, if the determination result by the voice registration determination unit 173 can be registered in step 703, the voice registration determination unit 173 connects the voice registration switch 114 and the voice standard pattern generation unit 115 in step ST705. The registered voice 11 is input, a voice standard pattern is generated and stored in the voice standard pattern storage means 116.

音声認識時のステップＳＴ７０６及びステップＳＴ７０７における音声類似度計算手段３１１の処理、並びにステップＳＴ７０８及びステップＳＴ７０９における音声照合判定手段３１２の処理は、実施の形態１の図２に示すものと同じである。 The processing of speech similarity calculation means 311 in steps ST706 and ST707 during speech recognition and the processing of speech collation determination means 312 in steps ST708 and ST709 are the same as those shown in FIG. 2 of the first embodiment.

ここでは、１発声のみの場合について説明したが、音声登録時には、複数の異なる発声について音声標準パタンを各々生成し、音声認識時には、認識対象音声３１に対して複数の標準パタンを用いて音声類似度を計算し、音声類似度が閾値３２以上で最も音声類似度が大きい標準パタンを示す発声番号を認識結果として出力することも可能である。 Here, the case of only one utterance has been described. However, at the time of voice registration, a voice standard pattern is generated for each of a plurality of different utterances. It is also possible to calculate the degree and output the utterance number indicating the standard pattern having the largest voice similarity with a voice similarity of 32 or more as a recognition result.

以上のように、この実施の形態７によれば、騒音類似度計算手段１７２は騒音標準パタンを用いて登録音声１１に対する騒音類似度を計算し、音声登録判定手段１７３は計算された騒音類似度を用いて登録音声１１を登録するか否かを判定することにより、騒音を誤認識しやすい音声標準パタンの登録を防ぐことができ、登録音声１１が不適切であることが原因の誤認識、すなわち騒音と似た音声標準パタンが登録されることによって生じる誤認識を少なくすることができ、認識精度を向上させることができるという効果が得られる。 As described above, according to the seventh embodiment, the noise similarity calculating unit 172 calculates the noise similarity for the registered speech 11 using the noise standard pattern, and the speech registration determining unit 173 calculates the calculated noise similarity. , It is possible to prevent registration of a voice standard pattern that is likely to misrecognize noise, and misrecognize the cause that the registered voice 11 is inappropriate. That is, it is possible to reduce misrecognition caused by registering a voice standard pattern similar to noise, and to improve the recognition accuracy.

実施の形態８．
図１７はこの発明の実施の形態８による音声認識装置の構成を示すブロック図である。この音声認識装置は登録手段１００及び照合手段３００を備えている。登録手段１００は音声標準パタン生成手段１１５、音声標準パタン格納手段１１６、別使用状態用音声標準パタン生成手段１８１及び別使用状態用音声標準パタン格納手段１８２を備え、照合手段３００は音声類似度計算手段３１１、別使用状態音声類似度計算手段３８１及び音声照合判定手段３８２を備えている。 Embodiment 8 FIG.
FIG. 17 is a block diagram showing the configuration of a speech recognition apparatus according to Embodiment 8 of the present invention. The speech recognition apparatus includes a registration unit 100 and a verification unit 300. The registration unit 100 includes a speech standard pattern generation unit 115, a speech standard pattern storage unit 116, a separate use state speech standard pattern generation unit 181 and a separate use state speech standard pattern storage unit 182, and the collation unit 300 calculates speech similarity. Means 311, another usage state voice similarity calculation means 381 and voice collation determination means 382 are provided.

登録手段１００において、音声標準パタン生成手段１１５は登録音声１１を入力し音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。別使用状態用音声標準パタン生成手段１８１は、音声標準パタン格納手段１１６に格納されている音声標準パタンを入力し別使用状態用音声標準パタンを生成して別使用状態用音声標準パタン格納手段１８２に格納する。 In the registration unit 100, the voice standard pattern generation unit 115 receives the registered voice 11, generates a voice standard pattern, and stores it in the voice standard pattern storage unit 116. The separate usage state voice standard pattern generation unit 181 receives the voice standard pattern stored in the voice standard pattern storage unit 116, generates another use state voice standard pattern, and generates another use state voice standard pattern storage unit 182. To store.

また、照合手段３００において、音声類似度計算手段３１１は、利用者が発声した認識対象音声１１と音声標準パタン格納手段１１６に格納されている音声標準パタンを入力し音声類似度を計算する。別使用状態音声類似度計算手段３８１は、利用者が発声した認識対象音声１１と別使用状態用音声標準パタン格納手段１８２に格納されている別使用状態用音声標準パタンを入力し、別使用状態音声類似度を計算する。音声照合判定手段３８２は、音声類似度計算手段３１１により計算された音声類似度と別使用状態音声類似度計算手段３８１により計算された別使用状態音声類似度を入力し、音声類似度又は別使用状態音声類似度が閾値３２以上であるならば認識結果３３を出力する。 In the matching unit 300, the speech similarity calculation unit 311 inputs the recognition target speech 11 uttered by the user and the speech standard pattern stored in the speech standard pattern storage unit 116, and calculates speech similarity. The different usage state voice similarity calculation means 381 inputs the recognition target voice 11 uttered by the user and the different usage state voice standard pattern storage means 182 and inputs the different usage state voice standard pattern. Calculate voice similarity. The voice collation determination unit 382 inputs the voice similarity calculated by the voice similarity calculation unit 311 and the different use state voice similarity calculated by the different use state voice similarity calculation unit 381, and uses the voice similarity or another use. If the state voice similarity is greater than or equal to the threshold 32, a recognition result 33 is output.

なお、この実施の形態８では、音声標準パタン生成手段１１５、別使用状態用音声標準パタン生成手段１８１、音声類似度計算手段３１１、別使用状態音声類似度計算手段３８１及び音声照合判定手段３８２をハードウェアで構成しても良いが、各手段の処理内容を記述した音声認識プログラムを作成し、コンピュータが当該音声認識プログラムを実行するようにしても良い。 In the eighth embodiment, the voice standard pattern generation means 115, the separate use state voice standard pattern generation means 181, the voice similarity calculation means 311, the separate use state voice similarity calculation means 381, and the voice collation determination means 382 are provided. Although it may be configured by hardware, a voice recognition program describing the processing contents of each means may be created, and the computer may execute the voice recognition program.

次に動作について説明する。
図１８はこの発明の実施の形態８による音声認識装置の処理の流れを示すフローチャートである。まず、音声登録時には、ステップＳＴ８０１において、音声登録パタン生成手段１１５は利用者が発声した登録音声１１を入力し、ステップＳＴ８０２において、音声登録パタン生成手段１１５は登録音声１１により音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。 Next, the operation will be described.
FIG. 18 is a flowchart showing the flow of processing of the speech recognition apparatus according to Embodiment 8 of the present invention. First, at the time of voice registration, in step ST801, the voice registration pattern generation means 115 inputs the registered voice 11 uttered by the user. In step ST802, the voice registration pattern generation means 115 generates a voice standard pattern from the registered voice 11. And stored in the voice standard pattern storage means 116.

ステップＳＴ８０３において、別使用状態用音声標準パタン生成手段１８１は、音声標準パタン格納手段１１６に格納されている音声標準パタンを入力し、別使用状態用音声標準パタンを生成して別使用状態用音声標準パタン格納手段１８２に格納する。ここで、別使用状態とは、登録音声１１を入力した使用状態とは異なる状態で認識対象音声３１を入力する場合である。 In step ST803, the separate use state sound standard pattern generation unit 181 inputs the sound standard pattern stored in the sound standard pattern storage unit 116, generates another use state sound standard pattern, and uses the other use state sound standard pattern. The data is stored in the standard pattern storage means 182. Here, the separate usage state is a case where the recognition target speech 31 is input in a state different from the usage state in which the registered speech 11 is input.

図１９はカメラ付き携帯電話の外観を示す図である。例えば、図１９に示すようなカメラ付き携帯電話に音声認識装置を適用し、音声によってカメラのシャッターを切る装置においては、一般的に登録音声１１は携帯電話の表に正対してマイク孔に向かって入力されるが、撮影の際の自分撮りの場合には携帯電話の裏に正対し、マイク孔の逆向きから認識対象音声３１が入力されることになり、別使用状態となる。 FIG. 19 is a diagram showing the appearance of a camera-equipped mobile phone. For example, in a device in which a voice recognition device is applied to a camera-equipped mobile phone as shown in FIG. 19 and the camera shutter is released by voice, the registered voice 11 generally faces the microphone hole facing the front of the mobile phone. However, in the case of a self-portrait at the time of shooting, the recognition target voice 31 is input from the opposite direction of the microphone hole directly facing the back of the mobile phone, and is in another use state.

ここで、別使用状態用音声標準パタンの生成方法について説明する。別使用状態用音声標準パタンは、登録音声１１を入力する場合の入力の周波数特性と、別使用状態における入力の周波数特性の差を予め求めておき、この差を用いて音声標準パタンから生成する。
図２０は音声入力の周波数特性を説明する図である。図２０（ａ）は登録音声１１を入力する場合の周波数特性、図２０（ｂ）は別使用状態における入力の周波数特性、図２０（ｃ）は図２０（ａ）と図２０（ｂ）の周波数特性の差である。この図の例では、図２０（ａ）の音声標準パタンから、図２０（ｃ）の周波数特性を引くことにより、図２０（ｂ）の別使用状態用音声標準パタンが生成される。 Here, a method of generating another use state voice standard pattern will be described. The different use state voice standard pattern is generated in advance from the voice standard pattern by previously obtaining the difference between the input frequency characteristic when the registered voice 11 is input and the input frequency characteristic in the different use state. .
FIG. 20 is a diagram illustrating the frequency characteristics of voice input. 20A shows the frequency characteristics when the registered voice 11 is input, FIG. 20B shows the frequency characteristics of the input in a different use state, and FIG. 20C shows the frequency characteristics of FIGS. 20A and 20B. This is the difference in frequency characteristics. In the example of this figure, by subtracting the frequency characteristics of FIG. 20C from the voice standard pattern of FIG. 20A, the voice standard pattern for another use state of FIG. 20B is generated.

音声標準パタンを構成するものとして、フレーム毎に分析されたケプストラムがあるならば、次の式（２）によって別使用状態用音声標準パタンのケプストラムが生成される。

ここで、ｃ_tは音声標準パタンのフレーム時刻ｔのケプストラム、バーｃは登録音声１１が入力される周波数特性と別使用状態の入力の周波数特性の差のケプストラム、ｃ’_tは別使用状態用音声標準パタンのフレーム時刻ｔのケプストラムである。 If there is a cepstrum analyzed for each frame as a constituent of the voice standard pattern, a cepstrum of a voice standard pattern for another use state is generated by the following equation (2).

Here, c _t is a cepstrum at the frame time t of the voice standard pattern, bar c is a cepstrum of the difference between the frequency characteristic to which the registered voice 11 is inputted and the frequency characteristic of the input in a different use state, and c ′ _t is for another use state. It is a cepstrum at the frame time t of the voice standard pattern.

音声認識時には、図１８のステップＳＴ８０４において、音声類似度計算手段３１１は認識対象音声１１を入力し、ステップＳＴ８０５において、音声類似度計算手段３１１は音声標準パタン格納手段１１６に格納されている音声標準パタンを用いて、認識対象音声１１に対する音声類似度を計算する。 During speech recognition, the speech similarity calculation means 311 inputs the recognition target speech 11 in step ST804 of FIG. 18, and the speech similarity calculation means 311 stores the speech standard stored in the speech standard pattern storage means 116 in step ST805. Using the pattern, the speech similarity to the recognition target speech 11 is calculated.

ステップＳＴ８０６において、別使用状態音声類似度計算手段３８１は認識対象音声３１を入力し、別使用状態用音声標準パタン格納手段１８２に格納されている別使用状態用音声標準パタンを用いて、認識対象音声１１に対する別使用状態音声類似度を計算する。 In step ST806, the different usage state voice similarity calculation means 381 inputs the recognition target voice 31, and uses the different usage state voice standard pattern storage means 182 to recognize the recognition target voice. The different use state voice similarity to the voice 11 is calculated.

ステップＳＴ８０７において、音声照合判定手段３８２は、音声類似度計算手段３１１により計算された音声類似度と、別使用状態音声類似度計算手段３８１により計算された別使用状態音声類似度を入力し、音声類似度、別使用状態音声類似度の一方又は両方が予め定めた閾値３２以上であるか否かを判定する。上記ステップＳＴ８０７で、音声類似度、別使用状態音声類似度の一方又は両方が予め定めた閾値３２以上である場合には、ステップＳＴ８０８において、音声照合判定手段３８２は認識結果３３を出力する。一方、上記ステップＳＴ８０７で、音声類似度、別使用状態音声類似度の両方が閾値３２より小さい場合に、音声照合判定手段３８２は認識結果３３を出力せず、ステップＳＴ８０４に戻る。 In step ST807, the voice collation determination unit 382 receives the voice similarity calculated by the voice similarity calculation unit 311 and the different usage state voice similarity calculated by the different usage state voice similarity calculation unit 381, It is determined whether one or both of the similarity and the different use state audio similarity is equal to or greater than a predetermined threshold value 32. In step ST807, if one or both of the voice similarity and the different use state voice similarity is equal to or greater than the predetermined threshold 32, the voice collation determination unit 382 outputs the recognition result 33 in step ST808. On the other hand, when both the voice similarity and the different use state voice similarity are smaller than the threshold value 32 in step ST807, the voice collation determination unit 382 does not output the recognition result 33 and returns to step ST804.

ここでは、１発声のみの場合について説明したが、音声登録時には、複数の異なる発声について音声標準パタン及び別使用状態用音声標準パタンを各々生成し、音声認識時には、認識対象音声３１に対して複数の標準パタン及び別使用状態用音声標準パタンを用いて音声類似度及び別使用状態音声類似度を計算し、音声類似度又は別使用状態音声類似度が閾値３２以上で、音声類似度又は別使用状態音声類似度が最も大きい標準パタンを示す発声番号を認識結果として出力することも可能である。 Here, the case of only one utterance has been described. However, at the time of voice registration, a voice standard pattern and a different usage state voice standard pattern are generated for a plurality of different utterances, and at the time of voice recognition, a plurality of recognition target voices 31 are generated. The voice similarity and the different use state voice similarity are calculated using the standard pattern and the different use state voice standard pattern, and the voice similarity or the different use state voice similarity is greater than or equal to the threshold 32, and the voice similarity or the other use It is also possible to output the utterance number indicating the standard pattern having the highest state voice similarity as the recognition result.

以上のように、この実施の形態８によれば、別使用状態用音声標準パタン生成手段１８１は、登録音声１１が入力された状態とは別使用状態の別使用状態用音声標準パタンを生成し、別使用状態音声類似度計算手段３８１は別使用状態用音声標準パタンを用いて認識対象音声３１に対する別使用状態音声類似度を計算し、音声照合判定手段３８２は別使用状態音声類似度も考慮して認識結果３３を出力することにより、別使用状態による周波数特性の違いによって類似度が低くなることが原因の誤認識が少なくなり、認識精度を向上させることがきるという効果が得られる。 As described above, according to the eighth embodiment, the separate usage state voice standard pattern generation unit 181 generates the separate usage state voice standard pattern that is different from the state in which the registered voice 11 is input. The different usage state voice similarity calculation means 381 calculates another usage state voice similarity for the recognition target voice 31 using the different usage state voice standard pattern, and the voice matching determination means 382 also considers the different usage state voice similarity. Then, by outputting the recognition result 33, there is an effect that the recognition accuracy can be improved by reducing the number of misrecognitions caused by the similarity being lowered due to the difference in the frequency characteristics depending on the different use states.

実施の形態９．
図２１はこの発明の実施の形態９による音声認識装置の構成を示すブロック図である。この音声認識装置は登録手段１００及び照合手段３００を備えている。登録手段１００は顔標準パタン生成手段１９１、顔標準パタン格納手段１９２、通常使用状態用音声標準パタン生成手段１９３、通常使用状態用音声標準パタン格納手段１９４、別使用状態用音声標準パタン生成手段１９５及び別使用状態用音声標準パタン格納手段１９６を備え、照合手段３００は顔類似度計算手段３９１、顔判定手段３９２、音声標準パタン選択手段３９３、音声類似度計算手段３１１及び音声照合判定手段３１２を備えている。 Embodiment 9 FIG.
FIG. 21 is a block diagram showing the structure of a speech recognition apparatus according to Embodiment 9 of the present invention. The speech recognition apparatus includes a registration unit 100 and a verification unit 300. The registration unit 100 includes a face standard pattern generation unit 191, a face standard pattern storage unit 192, a normal use state audio standard pattern generation unit 193, a normal use state audio standard pattern storage unit 194, and a separate use state audio standard pattern generation unit 195. And another use state voice standard pattern storage means 196. The collation means 300 includes a face similarity calculation means 391, a face determination means 392, a voice standard pattern selection means 393, a voice similarity calculation means 311 and a voice collation determination means 312. I have.

登録手段１００において、顔標準パタン生成手段１９１は、利用者の登録顔画像１６を入力し、顔標準パタンを生成して顔標準パタン格納手段１９２に格納する。通常使用状態用音声標準パタン生成手段１９３は、利用者が発声した通常使用状態用登録音声１７を入力し、通常使用状態用音声標準パタンを生成して通常使用状態用音声標準パタン格納手段１９４に格納する。別使用状態用音声標準パタン生成手段１９５は、利用者が発声した別使用状態用登録音声１８を入力し、別使用状態用音声標準パタンを生成して別使用状態用音声標準パタン格納手段１９６に格納する。 In the registration unit 100, the face standard pattern generation unit 191 receives the registered face image 16 of the user, generates a face standard pattern, and stores it in the face standard pattern storage unit 192. The normal use state voice standard pattern generation unit 193 receives the normal use state registration voice 17 uttered by the user, generates a normal use state sound standard pattern, and stores it in the normal use state sound standard pattern storage unit 194. Store. The separate usage state voice standard pattern generation means 195 receives the different usage state registration voice 18 uttered by the user, generates another usage state voice standard pattern, and stores it in the separate usage state voice standard pattern storage means 196. Store.

また、照合手段３００において、顔類似度計算手段３９１は、認識対象画像３６と顔標準パタン格納手段１９２に格納されている顔標準パタンを入力して顔類似度を計算する。顔判定手段３９２は、顔類似度計算手段３９１により計算された顔類似度が予め定めた閾値以上か否かを判定することにより、登録した利用者が写っているか否かを判定する。音声標準パタン選択手段３９３は、顔判定手段３９２による判定結果を入力し、通常使用状態用音声標準パタン格納手段１９４に格納されている通常使用状態用音声標準パタン、又は別使用状態用音声標準パタン格納手段１９６に格納されている別使用状態用音声標準パタンを音声標準パタンとして選択する。音声類似度計算手段３１１は、利用者が発声した認識対象音声３１と、音声標準パタン選択手段３９３により選択された音声標準パタンを入力して、音声類似度を計算する。音声照合判定手段３１２は、音声類似度計算手段３１１により計算された音声類似度を入力し、音声類似度が閾値３２以上である場合に認識結果３３を出力する。 In the matching unit 300, the face similarity calculation unit 391 inputs the recognition target image 36 and the face standard pattern stored in the face standard pattern storage unit 192 and calculates the face similarity. The face determination unit 392 determines whether the registered user is shown by determining whether the face similarity calculated by the face similarity calculation unit 391 is greater than or equal to a predetermined threshold. The voice standard pattern selection unit 393 receives the determination result of the face determination unit 392, and is used for the normal use state voice standard pattern stored in the normal use state voice standard pattern storage unit 194, or for another use state voice standard pattern. The voice standard pattern for another use state stored in the storage unit 196 is selected as the voice standard pattern. The voice similarity calculation means 311 inputs the recognition target voice 31 uttered by the user and the voice standard pattern selected by the voice standard pattern selection means 393, and calculates the voice similarity. The voice collation determination unit 312 receives the voice similarity calculated by the voice similarity calculation unit 311, and outputs a recognition result 33 when the voice similarity is equal to or greater than the threshold 32.

なお、この実施の形態９では、顔標準パタン生成手段１９１、通常使用状態用音声標準パタン生成手段１９３、別使用状態用音声標準パタン生成手段１９５、顔類似度計算手段３９１、顔判定手段３９２、音声標準パタン選択手段３９３、音声類似度計算手段３１１１及び音声照合判定手段３１２をハードウェアで構成しても良いが、各手段の処理内容を記述した音声認識プログラムを作成し、コンピュータが当該音声認識プログラムを実行するようにしても良い。 In the ninth embodiment, the face standard pattern generation means 191, the normal use state sound standard pattern generation means 193, the separate use state sound standard pattern generation means 195, the face similarity calculation means 391, the face determination means 392, The voice standard pattern selection unit 393, the voice similarity calculation unit 3111, and the voice collation determination unit 312 may be configured by hardware. However, a voice recognition program that describes the processing contents of each unit is created, and the computer performs the voice recognition. The program may be executed.

次に動作について説明する。
図２２はこの発明の実施の形態９による音声認識装置の処理の流れを示すフローチャートである。ここでは、図１９に示すカメラ付き携帯電話に音声認識装置を搭載した場合を例にとって説明する。まず、音声登録時には、ステップＳＴ９０１において、顔標準パタン生成手段１９１は利用者の登録顔画像１６を入力し、ステップＳＴ９０２において、顔標準パタン生成手段１９１は登録顔画像１６により顔標準パタンを生成して顔標準パタン格納手段１９２に格納する。ここで、登録顔画像１６は利用者の顔をカメラを使って入力したものである。顔標準パタンは、例えば特開２０００−９９７２２号公報、人物顔認識装置及び人物顔認識方法に記載されている方法で生成可能である。 Next, the operation will be described.
FIG. 22 is a flowchart showing the flow of processing of the speech recognition apparatus according to Embodiment 9 of the present invention. Here, a case where a voice recognition device is mounted on the camera-equipped mobile phone shown in FIG. 19 will be described as an example. First, at the time of voice registration, the face standard pattern generation means 191 inputs the user's registered face image 16 in step ST901, and the face standard pattern generation means 191 generates a face standard pattern from the registered face image 16 in step ST902. And stored in the face standard pattern storage means 192. Here, the registered face image 16 is obtained by inputting a user's face using a camera. The face standard pattern can be generated by a method described in, for example, Japanese Patent Application Laid-Open No. 2000-99722, a human face recognition device and a human face recognition method.

ステップＳＴ９０３において、通常使用状態用音声標準パタン生成手段１９３は利用者が発声した通常使用状態用登録音声１７を入力し、ステップＳＴ９０４において、通常使用状態用音声標準パタン生成手段１９３は通常使用状態用登録音声１７により通常使用状態用音声標準パタンを生成して通常使用状態用音声標準パタン格納手段１９４に格納する。ここで、通常使用状態用登録音声１７は、例えば図１９に示すようなカメラ付き携帯電話に音声認識装置を搭載する場合は、携帯電話のマイク孔がある面（表面）に正対して発声する登録音声である。したがって、カメラには利用者が写っていない状態における発声である。音声認識時には、通常使用状態用音声標準パタンが、携帯電話のマイク孔がある面（表面）に正対している場合に用いる音声標準パタンとなる。 In step ST903, the normal use state voice standard pattern generation means 193 inputs the normal use state registered voice 17 uttered by the user. In step ST904, the normal use state sound standard pattern generation means 193 is for the normal use state. A normal use state sound standard pattern is generated from the registered sound 17 and stored in the normal use state sound standard pattern storage means 194. Here, for example, when the voice recognition device is mounted on a camera-equipped mobile phone as shown in FIG. 19, the normal use state registration voice 17 is uttered facing the surface (front surface) having the microphone hole of the mobile phone. Registered voice. Therefore, the utterance is when the user is not shown on the camera. At the time of voice recognition, the voice standard pattern for normal use is the voice standard pattern used when facing the surface (front surface) where the microphone hole of the mobile phone is located.

ステップＳＴ９０５において、別使用状態用音声標準パタン生成手段１９５は利用者が発声した別使用状態用登録音声１８を入力し、ステップＳＴ９０６において、別使用状態用音声標準パタン生成手段１９５は別使用状態用登録音声１８により別使用状態用音声標準パタンを生成して別使用状態用音声標準パタン格納手段１９６に格納する。ここで、別使用状態用登録音声１８は、通常使用状態用登録音声１７とは、利用者とマイクとの位置関係が異なる場合の登録音声である。例えば図１９に示すようなカメラ付き携帯電話に音声認識装置を搭載する場合は、携帯電話のマイク孔がない面（裏面）に正対して発声する登録音声である。したがって、カメラに利用者が写る状態における発声である。音声認識時には、別使用状態用音声標準パタンが携帯電話のマイク孔が無い面（裏面）に正対している場合に用いる音声標準パタンとなる。 In step ST905, the separate use state voice standard pattern generation means 195 inputs the separate use state registration voice 18 uttered by the user. A different voice condition standard pattern for use state is generated from the registered voice 18 and stored in the voice standard pattern storage means for different use condition 196. Here, the separate use state registration voice 18 is a registration voice when the positional relationship between the user and the microphone is different from the normal use state registration voice 17. For example, when a voice recognition device is mounted on a camera-equipped mobile phone as shown in FIG. Therefore, the utterance is in a state where the user is reflected in the camera. At the time of voice recognition, the voice standard pattern for another use state is the voice standard pattern used when facing directly to the surface (back surface) without the microphone hole of the mobile phone.

音声認識時には、ステップＳＴ９０７において、顔類似度計算手段３９１は、認識対象画像３６を入力し、ステップＳＴ９０８において、顔類似度計算手段３９１は顔標準パタン格納手段１９２に格納されている顔標準パタンを用いて認識対象画像３６に対する顔類似度を計算する。ここで、認識対象画像３６は、音声認識時にカメラから入力される画像である。顔類似度は、例えば上記特開２０００−９９７２２号公報に記載されている方法で計算する。 During speech recognition, the face similarity calculation unit 391 inputs the recognition target image 36 in step ST907, and the face similarity calculation unit 391 uses the face standard pattern stored in the face standard pattern storage unit 192 in step ST908. The face similarity with respect to the recognition target image 36 is calculated. Here, the recognition target image 36 is an image input from the camera at the time of voice recognition. The face similarity is calculated, for example, by the method described in Japanese Patent Laid-Open No. 2000-99722.

ステップＳＴ９０９において、顔判定手段３９２は、顔類似度計算手段３９１により計算された顔類似度が予め定めた閾値以上か否かを判定することにより、利用者がカメラに写っているか否かを判定する。上記ステップＳＴ９０９で、顔類似度が予め定めた閾値以上の場合には、利用者がカメラに写っていると判定し、一方、顔類似度が閾値より小さい場合には、利用者がカメラに写っていないと判定する。 In step ST909, the face determination unit 392 determines whether the user is in the camera by determining whether the face similarity calculated by the face similarity calculation unit 391 is greater than or equal to a predetermined threshold. To do. In step ST909, if the face similarity is greater than or equal to a predetermined threshold, it is determined that the user is in the camera. On the other hand, if the face similarity is less than the threshold, the user is in the camera. Judge that it is not.

上記ステップＳＴ９０９で、顔類似度が予め定めた閾値より小さく、利用者がカメラに写っていないと判定した場合には、ステップＳＴ９１０において、音声標準パタン選択手段３９３は通常使用状態用音声標準パタン格納手段１９４に格納されている通常使用状態用音声標準パタンを音声標準パタンとして選択する。一方、上記ステップＳＴ９０９で、顔類似度が予め定めた閾値以上で、利用者がカメラに写っていると判定した場合には、ステップＳＴ９１１において、音声標準パタン選択手段３９３は別使用状態用音声標準パタン格納手段１９６に格納されている別使用状態用音声標準パタンを音声標準パタンとして選択する。 If it is determined in step ST909 that the face similarity is smaller than a predetermined threshold and the user is not in the camera, the voice standard pattern selection unit 393 stores the voice standard pattern for normal use in step ST910. The normal use state voice standard pattern stored in the means 194 is selected as the voice standard pattern. On the other hand, if it is determined in step ST909 that the face similarity is equal to or greater than a predetermined threshold value and the user is in the camera, the audio standard pattern selection means 393 determines in step ST911 that the audio standard for different use states. The voice standard pattern for another use state stored in the pattern storage unit 196 is selected as the voice standard pattern.

ステップＳＴ９１２において、音声類似度計算手段３１１は認識対象音声３１を入力し、ステップＳＴ９１３において、音声類似度計算手段３１１は、音声標準パタン選択手段３９３により選択された音声標準パタンを用いて、認識対象音声３１に対する音声類似度を計算する。 In step ST912, the speech similarity calculation unit 311 inputs the recognition target speech 31, and in step ST913, the speech similarity calculation unit 311 uses the speech standard pattern selected by the speech standard pattern selection unit 393 to recognize the target speech. The voice similarity to the voice 31 is calculated.

ステップＳＴ９１４において、音声照合判定手段３１２は、音声類似度計算手段３１１により計算された音声類似度が予め定めた閾値３２以上か否かを判定する。上記ステップＳＴ９１４で、音声類似度が閾値３２以上である場合には、ステップＳＴ９１５において、音声照合判定手段３１２は、通常使用状態用音声標準パタン格納手段１９４に格納されている通常使用状態用音声標準パタン又は別使用状態用音声標準パタン格納手段１９６に格納されている別使用状態用音声標準パタンに対応する音声を利用者が発声したと判断して認識結果３３を出力する。一方、上記ステップＳＴ９１４で、音声類似度が閾値３２より小さい場合には、音声照合判定手段３１２は認識結果３３を出力せず、ステップＳＴ９０７に戻る。 In step ST914, the speech collation determining unit 312 determines whether the speech similarity calculated by the speech similarity calculating unit 311 is equal to or greater than a predetermined threshold 32. If the voice similarity is greater than or equal to the threshold value 32 in step ST914, in step ST915, the voice collation determination unit 312 stores the normal standard for use state voice standard stored in the normal use state voice standard pattern storage unit 194. It is determined that the user has uttered a voice corresponding to the voice standard pattern for different usage status stored in the pattern or the voice standard pattern storage status for different usage status 196, and the recognition result 33 is output. On the other hand, if the voice similarity is smaller than the threshold value 32 in step ST914, the voice collation determination unit 312 does not output the recognition result 33 and returns to step ST907.

ここでは、１発声のみの場合について説明したが、音声登録時には、複数の異なる発声について通常使用状態用音声標準パタン及び別使用状態用音声標準パタンを各々生成し、音声認識時には、認識対象音声３１に対して複数の通常使用状態用音声標準パタン、又は別使用状態用音声標準パタンを用いて音声類似度を計算し、音声類似度が閾値３２以上で、最も音声類似度が大きい標準パタンを示す発声番号を認識結果として出力することも可能である。 Here, the case of only one utterance has been described. However, at the time of voice registration, a normal use state voice standard pattern and a separate use state voice standard pattern are respectively generated for a plurality of different utterances. The voice similarity is calculated using a plurality of normal use state voice standard patterns or different use state voice standard patterns, and the voice similarity is equal to or greater than the threshold 32 and indicates the standard pattern having the highest voice similarity. It is also possible to output the utterance number as a recognition result.

以上のように、この実施の形態９によれば、顔類似度計算手段３９１は認識時にカメラ画像を入力して顔類似度計算を行い、顔判定手段３９２は顔類似度によってカメラに利用者が写っているか否かを判定し、音声標準パタン選択手段３９３は、顔判定手段３９２の判定結果である利用者が写っているか否かにより、通常使用状態用音声標準パタン又は別使用状態用音声標準パタンを音声標準パタンとして選択し、音声類似度計算手段３１１は選択された音声標準パタンを用いて認識対象音声３１に対する音声類似度を計算することにより、使用状態が異なることで生じる周波数特性の違いによって、音声類似度が低くなることが原因の誤認識を少なくすることができ、認識精度を向上させることがきるという効果が得られる。 As described above, according to the ninth embodiment, the face similarity calculation unit 391 inputs a camera image at the time of recognition and calculates the face similarity, and the face determination unit 392 uses the face similarity to determine whether the user is in the camera. The voice standard pattern selection unit 393 determines whether the user is captured as a result of the determination by the face determination unit 392, or the voice standard pattern for the normal use state or the voice standard for another use state. The pattern is selected as the voice standard pattern, and the voice similarity calculation means 311 calculates the voice similarity with respect to the recognition target voice 31 using the selected voice standard pattern. Therefore, it is possible to reduce the erroneous recognition caused by the low voice similarity and to improve the recognition accuracy.

実施の形態１０．
図２３はこの発明の実施の形態１０による音声認識装置の構成を示すブロック図である。この音声認識装置は登録手段１００及び照合手段３００を備えている。登録手段１００は通常使用状態用音声標準パタン生成手段１９３、通常使用状態用音声標準パタン格納手段１９４、別使用状態用音声標準パタン生成手段１９５及び別使用状態用音声標準パタン格納手段１９６を備え、照合手段３００は回転検出手段４０１、音声標準パタン選択手段４０２、音声類似度計算手段３１１及び音声照合判定手段３１２を備えている。 Embodiment 10 FIG.
FIG. 23 is a block diagram showing the structure of a speech recognition apparatus according to Embodiment 10 of the present invention. The speech recognition apparatus includes a registration unit 100 and a verification unit 300. The registration unit 100 includes a normal use state audio standard pattern generation unit 193, a normal use state audio standard pattern storage unit 194, another use state audio standard pattern generation unit 195, and another use state audio standard pattern storage unit 196. The collation unit 300 includes a rotation detection unit 401, a voice standard pattern selection unit 402, a voice similarity calculation unit 311, and a voice collation determination unit 312.

登録手段１００において、通常使用状態用音声標準パタン生成手段１９３は、利用者が発声した通常使用状態用登録音声１７を入力し、通常使用状態用音声標準パタンを生成して通常使用状態用音声標準パタン格納手段１９４に格納する。別使用状態用音声標準パタン生成手段１９５は、利用者が発声した別使用状態用登録音声１８を入力し、別使用状態用音声標準パタンを生成して別使用状態用音声標準パタン格納手段１９６に格納する。 In the registration unit 100, the normal use state voice standard pattern generation unit 193 receives the normal use state registration voice 17 uttered by the user, generates a normal use state sound standard pattern, and generates a normal use state sound standard. Stored in the pattern storage means 194. The separate usage state voice standard pattern generation means 195 receives the different usage state registration voice 18 uttered by the user, generates another usage state voice standard pattern, and stores it in the separate usage state voice standard pattern storage means 196. Store.

また、照合手段３００において、回転検出手段４０１は、カメラ付き携帯電話に設置されている加速度センサ３７からの信号を入力し、音声認識装置が回転したか否かを検出する。音声標準パタン選択手段４０２は、回転検出手段４０１により判定された音声認識装置が回転したか否かの結果を入力し、通常使用状態用音声標準パタン格納手段１９４に格納されている通常使用状態用音声標準パタン又は別使用状態用音声標準パタン格納手段１９６に格納されている別使用状態用音声標準パタンを音声標準パタンとして選択する。音声類似度計算手段３１１は、利用者が発声した認識対象音声３１と音声標準パタン選択手段４０２により選択された音声標準パタンを入力し音声類似度を計算する。音声照合判定手段３１２は、音声類似度計算手段３１１により計算された音声類似度が閾値３２以上の場合には認識結果３３を出力する。 Further, in the collating means 300, the rotation detecting means 401 inputs a signal from the acceleration sensor 37 installed in the camera-equipped mobile phone, and detects whether or not the voice recognition device has rotated. The voice standard pattern selection unit 402 inputs the result of whether or not the voice recognition device determined by the rotation detection unit 401 has rotated, and is used for the normal use state stored in the normal use state voice standard pattern storage unit 194. The voice standard pattern or the voice standard pattern for different usage status stored in the voice standard pattern storage means 196 for different usage status is selected as the voice standard pattern. The voice similarity calculation means 311 inputs the recognition target voice 31 uttered by the user and the voice standard pattern selected by the voice standard pattern selection means 402 and calculates the voice similarity. The voice collation determination unit 312 outputs a recognition result 33 when the voice similarity calculated by the voice similarity calculation unit 311 is equal to or greater than the threshold value 32.

なお、この実施の形態１０では、通常使用状態用音声標準パタン生成手段１９３、別使用状態用音声標準パタン生成手段１９５、回転検出手段４０１、音声標準パタン選択手段４０２、音声類似度計算手段３１１及び音声照合判定手段３１２をハードウェアで構成しても良いが、各手段の処理内容を記述した音声認識プログラムを作成し、コンピュータが当該音声認識プログラムを実行するようにしても良い。 In the tenth embodiment, normal use state voice standard pattern generation means 193, separate use state voice standard pattern generation means 195, rotation detection means 401, voice standard pattern selection means 402, voice similarity calculation means 311 and The voice collation determination unit 312 may be configured by hardware, but a voice recognition program describing the processing contents of each unit may be created and the computer may execute the voice recognition program.

次に動作について説明する。
図２４はこの発明の実施の形態１０による音声認識装置の処理の流れを示すフローチャートである。ここでは、図１９に示すカメラ付き携帯電話に音声認識装置を搭載した場合を例にとって説明する。まず、音声登録時には、ステップＳＴ１００１において、通常使用状態用音声標準パタン生成手段１９３は利用者が発声した通常使用状態用登録音声１７を入力し、ステップＳＴ１００２において、通常使用状態用音声標準パタン生成手段１９３は通常使用状態用登録音声１７により通常使用状態用音声標準パタンを生成して通常使用状態用音声標準パタン格納手段１９４に格納する。通常使用状態用登録音声１７は、例えば、図１９に示すような携帯電話に音声認識装置を搭載する場合には、携帯電話のマイク孔がある面（表面）に正対して発声する登録音声である。したがって、音声認識時には、通常使用状態用音声標準パタンが、携帯電話のマイク孔がある面（表面）に正対している場合に用いる音声標準パタンとなる。 Next, the operation will be described.
FIG. 24 is a flowchart showing a process flow of the speech recognition apparatus according to the tenth embodiment of the present invention. Here, a case where a voice recognition device is mounted on the camera-equipped mobile phone shown in FIG. 19 will be described as an example. First, at the time of voice registration, in step ST1001, the normal use state voice standard pattern generation means 193 inputs the normal use state registration voice 17 uttered by the user, and in step ST1002, the normal use state voice standard pattern generation means. In step 193, a normal use state sound standard pattern is generated by the normal use state registration sound 17 and stored in the normal use state sound standard pattern storage unit 194. For example, when the voice recognition device is mounted on a mobile phone as shown in FIG. 19, the normal use state registration voice 17 is a registered voice uttered directly on the surface (front surface) having the microphone hole of the mobile phone. is there. Therefore, at the time of voice recognition, the voice standard pattern for normal use is the voice standard pattern used when facing the surface (front surface) where the microphone hole of the mobile phone is located.

ステップＳＴ１００３において、別使用状態用音声標準パタン生成手段１９５は、利用者が発声した別使用状態用登録音声１８を入力し、ステップＳＴ１００４において、別使用状態用音声標準パタン生成手段１９５は別使用状態用登録音声１８により別使用状態用音声標準パタンを生成して別使用状態用音声標準パタン格納手段１９６に格納する。ここで、別使用状態用登録音声１８は、通常使用状態用登録音声１７とは利用者とマイクとの位置関係が異なる場合の登録音声である。例えば、図１９に示すようなカメラ付き携帯電話に音声認識装置を搭載する場合は、携帯電話のマイク孔が無い面（裏面）に正対して発声する登録音声である。したがって、音声認識時には、別使用状態用音声標準パタンが、携帯電話のマイク孔が無い面（裏面）に正対している場合に用いる音声標準パタンとなる。 In step ST1003, the separate use state voice standard pattern generation unit 195 inputs the separate use state use registered voice 18 uttered by the user. In step ST1004, the separate use state sound standard pattern generation unit 195 outputs another use state. The voice standard pattern for another use state is generated from the registered voice 18 and stored in the voice standard pattern storage unit 196 for another use state. Here, the separate usage state registration voice 18 is a registration voice when the positional relationship between the user and the microphone is different from the normal usage state registration voice 17. For example, when a voice recognition device is mounted on a camera-equipped mobile phone as shown in FIG. 19, it is a registered voice uttered directly on the surface (back surface) of the mobile phone that does not have a microphone hole. Therefore, at the time of voice recognition, the voice standard pattern for another use state is the voice standard pattern used when facing the surface (back side) without the microphone hole of the mobile phone.

音声認識時には、ステップＳＴ１００５において、回転検出手段４０１は加速度センサ３７から角度変化量を入力し、ステップＳＴ１００６において、回転検出手段４０１は音声認識装置が回転したか否かを検出する。回転したかどうかの判定は、音声認識装置を起動した初期状態から、音声認識装置が一定角度以上回転したかどうかを判定するものである。例えば、図１９のようなカメラ付き携帯電話に音声認識装置を搭載する場合で、音声認識装置の起動は表面に正対した状態で行うものとした場合には、回転角度が９０度以上であるならば裏面に向かい合っている状態であるとして回転したと判定する。なお、加速度センサ３７による角度変化量の検出は、例えば特許第３０７６１２４号公報に記載している方法によって可能である。 At the time of voice recognition, in step ST1005, the rotation detection unit 401 inputs an angle change amount from the acceleration sensor 37, and in step ST1006, the rotation detection unit 401 detects whether or not the voice recognition device has rotated. The determination as to whether or not the rotation has been made is to determine whether or not the voice recognition device has rotated a predetermined angle or more from the initial state where the voice recognition device has been activated. For example, when a voice recognition device is mounted on a camera-equipped mobile phone as shown in FIG. 19 and the voice recognition device is activated in a state of facing the surface, the rotation angle is 90 degrees or more. If so, it is determined that it has rotated because it faces the back surface. The angle change amount can be detected by the acceleration sensor 37 by, for example, a method described in Japanese Patent No. 3076124.

上記ステップＳＴ１００６で回転しないと判定された場合には、ステップＳＴ１００７において、音声標準パタン選択手段４０２は通常使用状態用音声標準パタン格納手段１９４に格納されている通常使用状態用音声標準パタンを音声標準パタンとして選択する。一方、上記ステップＳＴ１００６で、回転したと判定された場合には、ステップＳＴ１００８において、音声標準パタン選択手段４０２は別使用状態用音声標準パタン格納手段１９６に格納されている別使用状態用音声標準パタンを音声標準パタンとして選択する。 If it is determined in step ST1006 that it does not rotate, in step ST1007, the voice standard pattern selection unit 402 uses the normal use state voice standard pattern stored in the normal use state voice standard pattern storage unit 194 as a voice standard. Select as a pattern. On the other hand, if it is determined in step ST1006 that the rotation has been made, in step ST1008, the audio standard pattern selection unit 402 determines that the audio standard pattern for different usage status stored in the voice standard pattern storage unit 196 for different usage status. Is selected as the voice standard pattern.

ステップＳＴ１００９において、音声類似度計算手段３１１は、認識対象音声３１を入力し、ステップＳＴ１０１０において、音声類似度計算手段３１１は、音声標準パタン選択手段４０２により選択された音声標準パタンを用いて、認識対象音声３１に対する音声類似度を計算する。 In step ST1009, the speech similarity calculation unit 311 inputs the recognition target speech 31, and in step ST1010, the speech similarity calculation unit 311 recognizes using the speech standard pattern selected by the speech standard pattern selection unit 402. The voice similarity with respect to the target voice 31 is calculated.

ステップＳＴ１０１１において、音声照合判定手段３１２は、音声類似度計算手段３１１により計算された音声類似度が予め定めた閾値３２以上か否かを判定する。上記ステップＳＴ１０１１で、音声類似度が閾値３２以上と判定された場合には、ステップＳＴ１０１２において、音声照合判定手段３１２は、通常使用状態用音声標準パタン格納手段１９４に格納されている通常使用状態用音声標準パタン、又は別使用状態用音声標準パタン格納手段１９６に格納されている別使用状態用音声標準パタンに対応する音声を利用者が発声したと判断して認識結果３３を出力する。一方、上記ステップＳＴ１０１１で、音声類似度が閾値３２より小さいと判定された場合には、音声照合判定手段３１２は認識結果３３を出力せず、ステップＳＴ１００５に戻る。 In step ST1011, the voice collation determining unit 312 determines whether or not the voice similarity calculated by the voice similarity calculating unit 311 is equal to or greater than a predetermined threshold 32. If it is determined in step ST1011 that the voice similarity is greater than or equal to the threshold value 32, in step ST1012, the voice collation determining unit 312 is for the normal use state stored in the normal use state voice standard pattern storage unit 194. It is determined that the user has uttered the voice corresponding to the voice standard pattern or the voice standard pattern for different usage status stored in the voice standard pattern storage unit 196 for different usage status, and the recognition result 33 is output. On the other hand, when it is determined in step ST1011 that the voice similarity is smaller than the threshold value 32, the voice collation determination unit 312 does not output the recognition result 33 and returns to step ST1005.

ここでは、１発声のみの場合について説明したが、音声登録時には、複数の異なる発声について通常使用状態用音声標準パタン及び別使用状態用音声標準パタンを各々生成し、音声認識時には、認識対象音声３１に対して複数の通常使用状態用音声標準パタン又は複数の別使用状態用音声標準パタンを用いて音声類似度を計算し、音声類似度が閾値３２以上で、最も音声類似度が大きい標準パタンを示す発声番号を認識結果として出力することも可能である。 Here, the case of only one utterance has been described. However, at the time of voice registration, a normal use state voice standard pattern and a separate use state voice standard pattern are respectively generated for a plurality of different utterances. The voice similarity is calculated using a plurality of normal use state sound standard patterns or a plurality of separate use state sound standard patterns, and the standard pattern having the highest sound similarity with a sound similarity of 32 or more is calculated. It is also possible to output the indicated utterance number as a recognition result.

以上のように、この実施の形態１０によれば、音声標準パタン選択手段４０２は、回転検出手段４０１により検出された音声認識装置の回転の有無に基づき、通常使用状態用音声標準パタン又は別使用状態用音声標準パタンを音声標準パタンとして選択し、音声類似度計算手段３１１は選択された音声標準パタンを用いて認識対象音声３１に対する音声類似度を計算することにより、使用状態が異なることで生じる周波数特性の違いによって、音声類似度が低くなることが原因の誤認識を少なくすることができ、認識精度を向上させることがきるという効果が得られる。 As described above, according to the tenth embodiment, the speech standard pattern selection unit 402 is based on the presence or absence of rotation of the speech recognition device detected by the rotation detection unit 401, or the normal use state speech standard pattern or another use. The state voice standard pattern is selected as the voice standard pattern, and the voice similarity calculation unit 311 calculates the voice similarity with respect to the recognition target voice 31 using the selected voice standard pattern. Due to the difference in frequency characteristics, it is possible to reduce misrecognition caused by a decrease in speech similarity and to improve the recognition accuracy.

実施の形態１１．
図２５はこの発明の実施の形態１１による音声認識装置の構成を示すブロック図である。この音声認識装置は登録手段１００及び照合手段３００を備えている。登録手段１００は顔標準パタン生成手段１９１、顔標準パタン格納手段１９２、音声標準パタン生成手段１１５及び音声標準パタン格納手段１１６を備え、照合手段３００は顔類似度計算手段３９１、顔判定手段３９２、マイク指向性設定手段４１１、音声類似度計算手段３１１及び音声照合判定手段３１２を備えている。 Embodiment 11 FIG.
FIG. 25 is a block diagram showing the structure of a speech recognition apparatus according to Embodiment 11 of the present invention. The speech recognition apparatus includes a registration unit 100 and a verification unit 300. The registration unit 100 includes a face standard pattern generation unit 191, a face standard pattern storage unit 192, a voice standard pattern generation unit 115, and a voice standard pattern storage unit 116. The collation unit 300 includes a face similarity calculation unit 391, a face determination unit 392, A microphone directivity setting unit 411, a voice similarity calculation unit 311, and a voice collation determination unit 312 are provided.

登録手段１００において、顔標準パタン生成手段１９１は、利用者の登録顔画像１６を入力し、顔標準パタンを生成して顔標準パタン格納手段１９２に格納する。音声標準パタン生成手段１１５は、利用者が発声した登録音声１１を入力し、音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。 In the registration unit 100, the face standard pattern generation unit 191 receives the registered face image 16 of the user, generates a face standard pattern, and stores it in the face standard pattern storage unit 192. The voice standard pattern generation means 115 receives the registered voice 11 uttered by the user, generates a voice standard pattern, and stores it in the voice standard pattern storage means 116.

照合手段３００において、顔類似度計算手段３９１は、認識対象画像３６と顔標準パタン格納手段１９２に格納されている顔標準パタンを入力して顔類似度を計算する。顔判定手段３９２は、顔類似度計算手段３９１により計算された顔類似度が予め定めた閾値以上か否かを判定することにより、利用者がカメラに写っている否かを判定する。マイク指向性設定手段４１１は、顔判定手段３９２による判定結果を入力してマイクの指向性を設定し、設定された指向性が得られるようにマイクを駆動するための設定信号３８を出力する。音声類似度計算手段３１１は、利用者が発声した認識対象音声３１を指向性が設定されたマイクを介して入力し、音声標準パタン格納手段１１６に格納されている音声標準パタンを入力して音声類似度を計算する。音声照合判定手段３１２は、音声類似度計算手段３１１により計算された音声類似度が閾値３２以上である場合に認識結果３３を出力する。 In the matching unit 300, the face similarity calculation unit 391 inputs the recognition target image 36 and the face standard pattern stored in the face standard pattern storage unit 192 and calculates the face similarity. The face determination unit 392 determines whether or not the user is in the camera by determining whether or not the face similarity calculated by the face similarity calculation unit 391 is greater than or equal to a predetermined threshold. The microphone directivity setting means 411 inputs the determination result by the face determination means 392, sets the directivity of the microphone, and outputs a setting signal 38 for driving the microphone so that the set directivity is obtained. The voice similarity calculation means 311 inputs the recognition target voice 31 uttered by the user through a microphone set with directivity, and inputs the voice standard pattern stored in the voice standard pattern storage means 116 to input voice. Calculate similarity. The voice collation determination unit 312 outputs the recognition result 33 when the voice similarity calculated by the voice similarity calculation unit 311 is equal to or greater than the threshold value 32.

なお、この実施の形態１１では、顔標準パタン生成手段１９１、音声標準パタン生成手段１１５、顔類似度計算手段３９１、顔判定手段３９２、マイク指向性設定手段４１１、音声類似度計算手段３１１及び音声照合判定手段３１２をハードウェアで構成しても良いが、各手段の処理内容を記述した音声認識プログラムを作成し、コンピュータが当該音声認識プログラムを実行するようにしても良い。 In the eleventh embodiment, face standard pattern generation means 191, voice standard pattern generation means 115, face similarity calculation means 391, face determination means 392, microphone directivity setting means 411, sound similarity calculation means 311 and sound. Although the collation determination unit 312 may be configured by hardware, a voice recognition program describing the processing contents of each unit may be created and the computer may execute the voice recognition program.

次に動作について説明する。
図２６はこの発明の実施の形態１１による音声認識装置の処理の流れを示すフローチャートである。ここでは、図１９に示すカメラ付き携帯電話に音声認識装置を搭載した場合を例にとり説明する。まず、音声登録時には、ステップＳＴ１１０１において、顔標準パタン生成手段１９１は、利用者の登録顔画像１６を入力し、ステップＳＴ１１０２において、顔標準パタン生成手段１９１は登録顔画像１６により顔標準パタンを生成して顔標準パタン格納手段１９２に格納する。ここで、登録顔画像１６は利用者の顔をカメラを使って入力したものである。 Next, the operation will be described.
FIG. 26 is a flowchart showing the flow of processing of the speech recognition apparatus according to Embodiment 11 of the present invention. Here, a case where a voice recognition device is mounted on the camera-equipped mobile phone shown in FIG. 19 will be described as an example. First, at the time of voice registration, the face standard pattern generation means 191 inputs the user's registered face image 16 in step ST1101, and the face standard pattern generation means 191 generates a face standard pattern from the registered face image 16 in step ST1102. And stored in the face standard pattern storage means 192. Here, the registered face image 16 is obtained by inputting a user's face using a camera.

ステップＳＴ１１０３において、音声標準パタン生成手段１１５は利用者が発声した登録音声１１を入力する。ステップＳＴ１１０４において、音声標準パタン生成手段１１５は登録音声１１により音声標準パタンを生成して音声標準パタン格納手段１１６に格納する。 In step ST1103, the voice standard pattern generation means 115 inputs the registered voice 11 uttered by the user. In step ST1104, the voice standard pattern generation means 115 generates a voice standard pattern from the registered voice 11 and stores it in the voice standard pattern storage means 116.

音声認識時には、ステップＳＴ１１０５において、顔類似度計算手段３９１は認識対象画像３６を入力し、ステップＳＴ１１０６において、顔類似度計算手段３９１は、顔標準パタン格納手段１９２に格納されている顔標準パタンを用いて、認識対象画像３６に対する顔類似度を計算する。 At the time of speech recognition, the face similarity calculation unit 391 inputs the recognition target image 36 in step ST1105, and in step ST1106, the face similarity calculation unit 391 uses the face standard pattern stored in the face standard pattern storage unit 192. The face similarity with respect to the recognition target image 36 is calculated.

ステップＳＴ１１０７において、顔判定手段３９２は、顔類似度計算手段３９１により計算された顔類似度が予め定めた閾値以上か否かを判定することにより、利用者がカメラに写っている否かを判定する。ここでは、顔類似度が予め定めた閾値以上ならば、利用者がカメラに写っていると判定し、一方、顔類似度が閾値より小さいならば、利用者がカメラに写っていないと判定する。 In step ST1107, the face determination unit 392 determines whether the user is in the camera by determining whether the face similarity calculated by the face similarity calculation unit 391 is greater than or equal to a predetermined threshold. To do. Here, if the face similarity is equal to or greater than a predetermined threshold, it is determined that the user is in the camera. On the other hand, if the face similarity is smaller than the threshold, it is determined that the user is not in the camera. .

上記ステップＳＴ１１０７で、顔類似度が閾値以上で、利用者がカメラに写っていると判定された場合には、ステップＳＴ１１０８において、マイク指向性設定手段４１１は、カメラと同一方向に、すなわち裏面にマイク指向性を設定して、設定された指向性が得られるような設定信号３８を出力する。一方、上記ステップＳＴ１１０７で、顔類似度が閾値より小さく、利用者がカメラに写っていないと判定された場合には、ステップＳＴ１１０９において、マイク指向性設定手段４１１は、カメラと反対方向に、すなわち表面にマイク指向性を設定して、設定された指向性が得られるような設定信号３８を出力する。 If it is determined in step ST1107 that the face similarity is equal to or greater than the threshold and the user is in the camera, in step ST1108, the microphone directivity setting unit 411 is set in the same direction as the camera, that is, on the back surface. A microphone directivity is set, and a setting signal 38 that outputs the set directivity is output. On the other hand, if it is determined in step ST1107 that the face similarity is smaller than the threshold value and the user is not in the camera, in step ST1109, the microphone directivity setting unit 411 is in the opposite direction to the camera, that is, A microphone directivity is set on the surface, and a setting signal 38 that outputs the set directivity is output.

ここで、マイク指向性とは、ある方向の音に対して感度が強いことである。したがって感度の強い方向から認識対象音声３１が入力されれば、周囲の騒音に比べてパワーが大きく入力され、信号対雑音比が大きいために高い認識率が得られる。例えば、図１９のようなカメラ付き携帯電話の場合では、顔判定手段３９２による判定結果が、利用者が写っていると判定された場合には、カメラが向いている方向、すなわち裏面にマイク指向性を設定する。一方、顔判定手段３９２による判定結果が、利用者が写っていないと判定された場合には、カメラが向いていない方向、すなわち表面にマイク指向性を設定する。 Here, the microphone directivity is a high sensitivity to sound in a certain direction. Therefore, if the recognition target speech 31 is input from a direction with strong sensitivity, power is input larger than the surrounding noise, and a high recognition rate is obtained because the signal-to-noise ratio is large. For example, in the case of a camera-equipped mobile phone as shown in FIG. 19, when the determination result by the face determination unit 392 determines that the user is captured, the microphone is directed in the direction in which the camera is facing, that is, the back surface. Set gender. On the other hand, when the determination result by the face determination unit 392 determines that the user is not captured, the microphone directivity is set in the direction in which the camera is not facing, that is, on the surface.

ステップＳＴ１１１０において、音声類似度計算手段３１１は指向性が設定されたマイクを介して認識対象音声３１を入力し、ＳＴ１１１１において、音声類似度計算手段３１１は、音声標準パタン格納手段１１６に格納されている音声標準パタンを用いて、認識対象音声３１に対する音声類似度を計算する。 In step ST1110, the speech similarity calculation unit 311 inputs the recognition target speech 31 via the microphone in which directivity is set. In ST1111, the speech similarity calculation unit 311 is stored in the speech standard pattern storage unit 116. The voice similarity with respect to the recognition target voice 31 is calculated using the existing voice standard pattern.

ステップＳＴ１１１２において、音声照合判定手段３１２は、音声類似度計算手段３１１により計算された音声類似度が予め定めた閾値３２以上か否かを判定する。上記ステップＳＴ１１１２で、音声類似度が閾値３２以上と判定された場合には、ステップＳＴ１１１３において、音声照合判定手段３１２は、音声標準パタン格納手段１１６に格納されている音声標準パタンに対応する音声を利用者が発声したと判断し、認識結果３３を出力する。一方、上記ステップＳＴ１１１２で、音声類似度が閾値３２より小さいと判定された場合には、音声照合判定手段３１２は認識結果３３を出力せず、ステップＳＴ１１０５に戻る。 In step ST1112, the speech collation determining unit 312 determines whether or not the speech similarity calculated by the speech similarity calculating unit 311 is equal to or greater than a predetermined threshold value 32. If it is determined in step ST1112 that the voice similarity is greater than or equal to the threshold value 32, in step ST1113, the voice collation determination unit 312 selects a voice corresponding to the voice standard pattern stored in the voice standard pattern storage unit 116. It is determined that the user uttered, and the recognition result 33 is output. On the other hand, when it is determined in step ST1112 that the voice similarity is smaller than the threshold value 32, the voice collation determination unit 312 does not output the recognition result 33 and returns to step ST1105.

ここでは、１発声のみの場合について説明したが、音声登録時には、複数の異なる発声について音声標準パタンを各々生成し、音声認識時には、認識対象音声３１に対して複数の音声標準パタンを用いて音声類似度を計算し、音声類似度が閾値３２以上で最も音声類似度が大きい標準パタンを示す発声番号を認識結果として出力することも可能である。 Here, the case of only one utterance has been described, but at the time of voice registration, a voice standard pattern is generated for each of a plurality of different utterances, and at the time of voice recognition, a plurality of voice standard patterns are used for the recognition target voice 31. It is also possible to calculate the similarity and output the utterance number indicating the standard pattern having the highest voice similarity with the voice similarity of 32 or more as the recognition result.

以上のように、この実施の形態１１によれば、顔類似度計算手段３９１は認識対象画像３６を入力して顔類似度を計算し、顔判定手段３９２は顔類似度によってカメラに利用者が写っているか否かを判定し、マイク指向性設定手段４１１は利用者が写っているか否かでマイクの指向性を設定し、音声類似度計算手段３１１が指向性が設定されたマイクを介して認識対象音声３１を入力することにより、使用状態が異なることで生じる信号対雑音比の低下や周波数特性の違いによって、音声類似度が低くなることが原因の誤認識を少なくすることができ、認識精度を向上させることがきるという効果が得られる。 As described above, according to the eleventh embodiment, the face similarity calculation unit 391 inputs the recognition target image 36 and calculates the face similarity, and the face determination unit 392 uses the face similarity to determine whether the user is in the camera. The microphone directivity setting means 411 sets the directivity of the microphone depending on whether or not the user is photographed, and the sound similarity calculation means 311 passes through the microphone with the directivity set. By inputting the recognition target speech 31, it is possible to reduce misrecognition caused by a decrease in speech similarity due to a decrease in signal-to-noise ratio and a difference in frequency characteristics caused by different usage states. The effect that accuracy can be improved is obtained.

実施の形態１２．
図２７はこの発明の実施の形態１２による音声認識装置の構成を示すブロック図である。この音声認識装置は、登録手段１００及び照合手段３００を備え、上記実施の形態１の図１に示す音声認識装置と比較して、照合手段３００において遅延処理手段４２１を備えている点が異なっているのみで、その他の構成は同じである。ここで、遅延処理手段４２１は音声照合判定手段３１２から出力された認識結果３３を入力し、一定時間経った後に処理開始信号３９を出力する。 Embodiment 12 FIG.
FIG. 27 is a block diagram showing the structure of a speech recognition apparatus according to Embodiment 12 of the present invention. This speech recognition apparatus includes a registration unit 100 and a verification unit 300, and is different from the speech recognition apparatus shown in FIG. 1 of the first embodiment in that the verification unit 300 includes a delay processing unit 421. Other configurations are the same. Here, the delay processing unit 421 receives the recognition result 33 output from the voice collation determination unit 312 and outputs a processing start signal 39 after a predetermined time.

なお、この実施の形態１２では、音声変化度計算手段１１１、音声登録判定手段１１２、登録音声変更要求手段１１３、音声標準パタン生成手段１１５、音声類似度計算手段３１１、音声照合判定手段３１２及び遅延処理手段４２１をハードウェアで構成しても良いが、各手段の処理内容を記述した音声認識プログラムを作成し、コンピュータが当該音声認識プログラムを実行するようにしても良い。 In the twelfth embodiment, the voice change calculation means 111, the voice registration determination means 112, the registered voice change request means 113, the voice standard pattern generation means 115, the voice similarity calculation means 311, the voice collation determination means 312 and the delay. The processing unit 421 may be configured by hardware, but a voice recognition program describing the processing contents of each unit may be created and the computer may execute the voice recognition program.

次に動作について説明する。
図２８はこの発明の実施の形態１２による音声認識装置の処理の流れを示すフローチャートである。ここでは、図１９に示すカメラ付き携帯電話に音声認識装置を搭載した場合を例にとって説明する。音声登録時のステップＳＴ１２０１からステップＳＴ１２０５までの処理、及び音声認識時のステップＳＴ１２０６からステップＳＴ１２０９までの処理は、上記実施の形態１の図２に示すステップＳＴ１０１からステップＳＴ１０９までの処理と同じである。 Next, the operation will be described.
FIG. 28 is a flowchart showing the flow of processing of the speech recognition apparatus according to Embodiment 12 of the present invention. Here, a case where a voice recognition device is mounted on the camera-equipped mobile phone shown in FIG. 19 will be described as an example. The processing from step ST1201 to step ST1205 at the time of voice registration and the processing from step ST1206 to step ST1209 at the time of voice recognition are the same as the processing from step ST101 to step ST109 shown in FIG. 2 of the first embodiment. .

ステップＳＴ１２１０において、遅延処理手段４２１は、音声照合判定手段３１２による認識結果３３を入力して、一定時間経った後に処理開始信号３９を出力する。このような処理により、発声が終了する前に次の処理への移行を防ぐことができる。したがって、カメラのシャッターを音声認識によって切るような装置で、自分を撮影する場合に、発声中の顔が写ることを防止できる。 In step ST1210, the delay processing unit 421 inputs the recognition result 33 by the voice collation determination unit 312 and outputs a processing start signal 39 after a predetermined time. By such processing, it is possible to prevent the transition to the next processing before the utterance is finished. Therefore, it is possible to prevent the face being uttered from being captured when photographing the subject with a device that releases the shutter of the camera by voice recognition.

ここでは、１発声のみの場合について説明したが、音声登録時には、複数の異なる発声について音声標準パタンを各々生成し、音声認識時には、認識対象音声３１に対して複数の標準パタンを用いて音声類似度を計算し、最も音声類似度が大きい標準パタンを示す発声番号を認識結果として出力することも可能である。 Here, the case of only one utterance has been described. However, at the time of voice registration, a voice standard pattern is generated for each of a plurality of different utterances, and at the time of voice recognition, a plurality of standard patterns are used for the recognition target voice 31. It is also possible to calculate the degree and output the utterance number indicating the standard pattern having the highest speech similarity as the recognition result.

以上のように、この実施の形態１２によれば、上記実施の形態１と同様の効果が得られると共に、遅延処理手段４２１は、音声照合判定手段３１２による認識結果３３を入力して、一定時間経った後に処理開始信号３９を出力することにより、発声中に次の処理に移行しては不具合となる場合を防ぐことができるという効果が得られる。 As described above, according to the twelfth embodiment, the same effect as in the first embodiment can be obtained, and the delay processing unit 421 receives the recognition result 33 by the voice collation determining unit 312 and receives a predetermined time. By outputting the processing start signal 39 after a while, there is an effect that it is possible to prevent a case where a trouble occurs if the process proceeds to the next processing during the utterance.

この実施の形態１２では、上記実施の形態１の構成に遅延処理手段４２１を追加しているが、上記実施の形態２から上記実施の形態１１の構成に遅延処理手段４２１を追加しても同様の効果が得られる。 In the twelfth embodiment, the delay processing means 421 is added to the configuration of the first embodiment, but the same applies even if the delay processing means 421 is added to the configurations of the second to eleventh embodiments. The effect is obtained.

この発明の実施の形態１による音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus by Embodiment 1 of this invention. この発明の実施の形態１による音声認識装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech recognition apparatus by Embodiment 1 of this invention. この発明の実施の形態２による音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus by Embodiment 2 of this invention. この発明の実施の形態２による音声認識装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech recognition apparatus by Embodiment 2 of this invention. この発明の実施の形態２による音声認識装置における音節単位の音声認識を説明する図である。It is a figure explaining the speech recognition of the syllable unit in the speech recognition apparatus by Embodiment 2 of this invention. この発明の実施の形態３による音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus by Embodiment 3 of this invention. この発明の実施の形態３による音声認識装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech recognition apparatus by Embodiment 3 of this invention. この発明の実施の形態３による音声認識装置における音節単位の音声認識を説明する図である。It is a figure explaining the speech recognition of the syllable unit in the speech recognition apparatus by Embodiment 3 of this invention. この発明の実施の形態４による音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus by Embodiment 4 of this invention. この発明の実施の形態４による音声認識装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech recognition apparatus by Embodiment 4 of this invention. この発明の実施の形態５による音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus by Embodiment 5 of this invention. この発明の実施の形態５による音声認識装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech recognition apparatus by Embodiment 5 of this invention. この発明の実施の形態６による音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus by Embodiment 6 of this invention. この発明の実施の形態６による音声認識装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech recognition apparatus by Embodiment 6 of this invention. この発明の実施の形態７による音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus by Embodiment 7 of this invention. この発明の実施の形態７による音声認識装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech recognition apparatus by Embodiment 7 of this invention. この発明の実施の形態８による音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus by Embodiment 8 of this invention. この発明の実施の形態８による音声認識装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech recognition apparatus by Embodiment 8 of this invention. カメラ付き携帯電話の外観を示す図である。It is a figure which shows the external appearance of the mobile phone with a camera. 音声入力の周波数特性を説明する図である。It is a figure explaining the frequency characteristic of a voice input. この発明の実施の形態９による音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus by Embodiment 9 of this invention. この発明の実施の形態９による音声認識装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech recognition apparatus by Embodiment 9 of this invention. この発明の実施の形態１０による音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus by Embodiment 10 of this invention. この発明の実施の形態１０による音声認識装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech recognition apparatus by Embodiment 10 of this invention. この発明の実施の形態１１による音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus by Embodiment 11 of this invention. この発明の実施の形態１１による音声認識装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech recognition apparatus by Embodiment 11 of this invention. この発明の実施の形態１２による音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus by Embodiment 12 of this invention. この発明の実施の形態１２による音声認識装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the speech recognition apparatus by Embodiment 12 of this invention.

Explanation of symbols

１１登録音声、１２登録音声変更要求、１３周囲音、１４再生音、１５騒音確認、１６登録顔画像、１７通常使用状態用登録音声、１８別使用状態用登録音声、３１認識対象音声、３２闘値、３３認識結果、３４音声類似度闘値、３５騒音類似度闘値、３６認識対象画像、３７加速度センサ、３８設定信号、３９処理開始信号、１００登録手段、１１１音声変化度計算手段、１１２音声登録判定手段、１１３登録音声変更要求手段、１１４音声登録スイッチ、１１５音声標準パタン生成手段、１１６音声標準パタン格納手段、１２１音節数抽出手段、１２２音声登録判定手段、１３１母音尤度計算手段、１３２音声登録判定手段、１４１周囲音類似度計算手段、１４２音声登録判定手段、１５１周囲音再生判定手段、１５２周囲音再生スイッチ、１５３周囲音再生手段、１６１騒音標準パタン生成手段、１６２騒音標準パタン格納手段、１７１騒音標準パタン格納手段、１７２騒音類似度計算手段、１７３音声登録判定手段、１８１別使用状態用音声標準パタン生成手段、１８２別使用状態用音声標準パタン格納手段、１９１顔標準パタン生成手段、１９２顔標準パタン格納手段、１９３通常使用状態用音声標準パタン生成手段、１９４通常使用状態用音声標準パタン格納手段、１９５別使用状態用音声標準パタン生成手段、１９６別使用状態用音声標準パタン格納手段、３００照合手段、３１１音声類似度計算手段、３１２音声照合判定手段、３６１騒音類似度計算手段、３６２音声照合判定手段、３８１別使用状態音声類似度計算手段、３８２音声照合判定手段、３９１顔類似度計算手段、３９２顔判定手段、３９３音声標準パタン選択手段、４０１回転検出手段、４０２音声標準パタン選択手段、４１１マイク指向性設定手段、４２１遅延処理手段。 11 Registration voice, 12 Registration voice change request, 13 Ambient sound, 14 Playback sound, 15 Noise confirmation, 16 Registration face image, 17 Normal use state registration voice, 18 Separate use state registration voice, 31 Recognition target voice, 32 Fight Value, 33 recognition result, 34 voice similarity threshold, 35 noise similarity threshold, 36 recognition target image, 37 acceleration sensor, 38 setting signal, 39 processing start signal, 100 registration means, 111 voice change degree calculation means, 112 Voice registration determination means, 113 registration voice change request means, 114 voice registration switch, 115 voice standard pattern generation means, 116 voice standard pattern storage means, 121 syllable number extraction means, 122 voice registration judgment means, 131 vowel likelihood calculation means, 132 voice registration determination means, 141 ambient sound similarity calculation means, 142 voice registration determination means, 151 Ambient sound reproduction determination means, 152 Ambient sound reproduction switch, 153 Ambient sound reproduction means, 161 Noise standard pattern generation means, 162 Noise standard pattern storage means, 171 Noise standard pattern storage means, 172 Noise similarity calculation means, 173 Voice registration determination Means, 181 Voice standard pattern generation means for different use states, 182 Voice standard pattern storage means for different use states, 191 Face standard pattern generation means, 192 Face standard pattern storage means, 193 Voice standard pattern generation means for normal use states, 194 Normal use state speech standard pattern storage means, 195 Separate use state speech standard pattern generation means, 196 Separate use state speech standard pattern storage means, 300 collation means, 311 Speech similarity calculation means, 312 Speech collation determination means, 361 Noise similarity calculation means, 362 voice collation determination means, 38 Separate use state voice similarity calculation means, 382 voice collation determination means, 391 face similarity calculation means, 392 face determination means, 393 voice standard pattern selection means, 401 rotation detection means, 402 voice standard pattern selection means, 411 microphone directivity Setting means, 421 Delay processing means.

Claims

A voice change degree calculating means for inputting a registered voice of two or more syllables and calculating a voice change degree over the whole voice section ;
A voice registration determination means for comparing the voice change degree calculated by the voice change degree calculation means with an average value of the voice change degrees of words of a predetermined syllable and determining whether or not to register the input registered voice;
A registration voice change requesting means for outputting a registration voice change request when the judgment result by the voice registration judgment means is not registerable;
A speech recognition apparatus comprising speech standard pattern generation means for generating a speech standard pattern from the input registered speech when the determination result by the speech registration determination means can be registered.

2. The speech recognition apparatus according to claim 1, further comprising delay processing means for inputting a recognition result at the time of speech recognition and outputting an operation start signal for performing the next processing after elapse of a predetermined time .