JP6912985B2

JP6912985B2 - Speech recognition system and computer program

Info

Publication number: JP6912985B2
Application number: JP2017176219A
Authority: JP
Inventors: 諒助川; 信範工藤
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2017-09-13
Filing date: 2017-09-13
Publication date: 2021-08-04
Anticipated expiration: 2037-09-13
Also published as: JP2019053143A

Description

本発明は、ユーザの発話音声を認識する音声認識の技術に関するものである。 The present invention relates to a voice recognition technique for recognizing a user's spoken voice.

ユーザの発話音声を認識する音声認識の技術としては、楽曲等のオーディオコンテンツの音声をスピーカから出力するオーディオ装置を備えたシステムに、マイクロフォンでピックアップしたユーザの発話音声を認識する第１の音声認識部に加え、オーディオ装置がスピーカに出力する音声の音声認識を行う第２の音声認識部を設け、第１の音声認識部が音声認識した結果と、第２の音声認識部が音声認識した結果とが一致した場合に、第１の音声認識部が音声認識した結果を無効化する技術が知られている（たとえば、特許文献１）。 As a voice recognition technology for recognizing a user's voice, a first voice recognition that recognizes a user's voice picked up by a microphone in a system equipped with an audio device that outputs the voice of audio content such as music from a speaker. In addition to the unit, a second voice recognition unit that performs voice recognition of the voice output from the audio device to the speaker is provided, and the result of voice recognition by the first voice recognition unit and the result of voice recognition by the second voice recognition unit. There is known a technique for invalidating the result of voice recognition by the first voice recognition unit when the above is the same (for example, Patent Document 1).

このような第１の技術によれば、マイクロフォンに回りこんだオーディオ装置の出力音声に対しても行われてしまう第１の音声認識部の音声認識結果を、ユーザの発話音声の音声認識結果として誤認してしまうことを抑止することができる。 According to such a first technique, the voice recognition result of the first voice recognition unit, which is also performed for the output voice of the audio device that wraps around the microphone, is used as the voice recognition result of the user's uttered voice. It is possible to prevent misidentification.

また、ユーザの発話音声を認識する音声認識の技術としては、音声入力の候補となるコマンドのセットが前回入力されたコマンドによって変化するシステムにおいて、第１の音声認識部で今回音声入力の候補となるコマンドのセット中のコマンドの音声認識を行いつつ、第２の音声認識部で前回音声入力の候補であったコマンドのセット中のコマンドの音声認識を行うことにより、ユーザのコマンドの言い直しを受け付ける技術も知られている（たとえば、特許文献２）。 In addition, as a voice recognition technology for recognizing a user's voice, in a system in which a set of commands that are candidates for voice input changes depending on a command input last time, the first voice recognition unit is used as a candidate for voice input this time. While performing voice recognition of the command in the set of commands, the second voice recognition unit performs voice recognition of the command in the set of commands that was a candidate for voice input last time, so that the user's command can be rephrased. Accepting techniques are also known (for example, Patent Document 2).

実登２６０２３４２号公報Jitsuto 2602342 国際公開第２０１１/０１６１２９号International Publication No. 2011/016129

さて、コマンドの音声入力を行うシステムにおいては、各時点において多くのコマンドの音声入力を受け付けることができることが好ましい。特に、ユーザにとって緊急を要する処理の実行を要求するコマンドの音声入力は任意の時点において受け付けることができることが好ましい。 By the way, in a system for voice input of commands, it is preferable to be able to accept voice input of many commands at each time point. In particular, it is preferable that the voice input of a command requesting the execution of an urgent process for the user can be accepted at any time.

しかし、一般的に音声認識部において実用的に音声認識できるコマンドの数には限りがある。
一方で、第１の音声認識部と第２の音声認識部との二つの音声認識部を備えたシステムにおいて、上述した特許文献１の技術のように第１の音声認識部と第２の音声認識部において、異なるコマンドのセット中のコマンドの音声認識を行えば、各時点において、一つの音声認識部のみを用いる場合に比べ多くのコマンドの音声入力を受け付けることができるようになるが、このようにすると、上述した特許文献２の技術を適用して、マイクロフォンに回りこんだオーディオ装置の出力音声に対してコマンドを誤認識してしまうことを防止することができなくなってしまう。 However, in general, the number of commands that can be practically recognized by a voice recognition unit is limited.
On the other hand, in a system including two voice recognition units, a first voice recognition unit and a second voice recognition unit, the first voice recognition unit and the second voice recognition unit are used as in the technique of Patent Document 1 described above. If the recognition unit performs voice recognition of commands in different sets of commands, it becomes possible to accept voice input of more commands at each time point than when only one voice recognition unit is used. Then, it becomes impossible to prevent the command from being erroneously recognized for the output voice of the audio device that wraps around the microphone by applying the technique of Patent Document 2 described above.

そこで、本発明は、第１の音声認識部と第２の音声認識部との二つの音声認識部を備えた音声認識システムにおいて、マイクロフォンに回りこんだオーディオ装置の出力音声に対してコマンド等のワードを誤認してしまうことを抑制しつつ、音声入力を受け付けることのできるワードの数を可及的に増大することを課題とする。 Therefore, according to the present invention, in a voice recognition system including two voice recognition units, a first voice recognition unit and a second voice recognition unit, a command or the like can be used for the output voice of an audio device that wraps around the microphone. The challenge is to increase the number of words that can accept voice input as much as possible while suppressing misidentification of words.

前記課題達成のために、本発明は、スピーカから、当該スピーカにオーディオソース機器から出力された音声が放射される空間の中で発話された音声を認識する音声認識システムに、マイクロフォンと、ワードが複数登録された第１音声認識辞書と、前記マイクロフォンがピックアップした音声を入力し、前記第１音声認識辞書に登録された複数のワードのうちから、入力した音声に整合するワードを対象候補として検出する第１音声認識部と、第２音声認識部と、第２音声認識部によって用いられる第２音声認識辞書と、ユーザの発話したワードを認識する認識部と、第１の認識モードと第２の認識モードとを選択的に設定しながら、前記認識部が認識したワードの音声入力を受け付ける音声入力受付部とを備えたものである。ここで、前記第１の認識モードにおいて第２音声認識部によって用いられる前記第２音声認識辞書には、前記第１音声認識辞書に登録されている複数のワードと同じ複数のワードが登録されており、前記第１の認識モードにおいて、前記第２音声認識部は、前記オーディオソース機器がスピーカに出力する音声を入力し、前記第２音声認識辞書に登録された複数のワードのうちから、入力した音声に整合するワードを対象候補として検出し、前記第１の認識モードにおいて、前記認識部は、前記第２音声認識部によって前記対象候補が検出された後の所定期間中、当該対象候補として検出されたワードと同じワードが前記第１音声認識部によって前記対象候補として検出されても、当該前記第１音声認識部によって対象候補として検出されたワードをユーザの発話したワードとして認識することを抑止しつつ、前記第１音声認識部によって前記対象候補として検出されたワードをユーザの発話したワードとして認識する。また、第２の認識モードにおいて第２音声認識部によって用いられる前記第２音声認識辞書には、前記第１音声認識辞書に登録されている複数のワードと、少なくとも部分的に異なる複数のワードが登録されており、前記第２の認識モードにおいて、前記第２音声認識部は、前記マイクロフォンがピックアップした音声を入力し、前記第２音声認識辞書に登録された複数のワードのうちから、入力した音声に整合するワードを対象候補として検出し、前記第２の認識モードにおいて、前記認識部は、前記第１音声認識部によって前記対象候補が検出された場合には当該対象候補として検出されたワードをユーザの発話したワードとして認識し、前記第２音声認識部によって前記対象候補が検出された場合には当該対象候補として検出されたワードをユーザの発話したワードとして認識する。
In order to achieve the above object, the present invention provides a voice recognition system that recognizes a voice uttered in a space in which the voice output from an audio source device is radiated from the speaker to the speaker, and a microphone and a word are used. A plurality of registered first speech recognition dictionaries and voices picked up by the microphone are input, and a word matching the input voice is detected as a target candidate from a plurality of words registered in the first voice recognition dictionary. A first voice recognition unit, a second voice recognition unit, a second voice recognition dictionary used by the second voice recognition unit, a recognition unit that recognizes a word spoken by a user, a first recognition mode, and a second. It is provided with a voice input receiving unit that accepts voice input of a word recognized by the recognition unit while selectively setting the recognition mode of. Here, in the second voice recognition dictionary used by the second voice recognition unit in the first recognition mode, the same plurality of words as the plurality of words registered in the first voice recognition dictionary are registered. In the first recognition mode, the second voice recognition unit inputs the voice output to the speaker by the audio source device, and inputs from a plurality of words registered in the second voice recognition dictionary. A word matching the voice is detected as a target candidate, and in the first recognition mode, the recognition unit serves as the target candidate during a predetermined period after the target candidate is detected by the second voice recognition unit. Even if the same word as the detected word is detected as the target candidate by the first voice recognition unit, the word detected as the target candidate by the first voice recognition unit is recognized as the word spoken by the user. While suppressing it, the word detected as the target candidate by the first voice recognition unit is recognized as the word spoken by the user. In addition, the second speech recognition dictionary used by the second speech recognition unit in the second recognition mode includes a plurality of words registered in the first speech recognition dictionary and a plurality of words that are at least partially different from the plurality of words registered in the first speech recognition dictionary. It is registered, and in the second recognition mode, the second voice recognition unit inputs the voice picked up by the microphone and inputs from a plurality of words registered in the second voice recognition dictionary. A word matching the voice is detected as a target candidate, and in the second recognition mode, the recognition unit detects the target candidate as the target candidate when the target candidate is detected by the first voice recognition unit. Is recognized as a word spoken by the user, and when the target candidate is detected by the second voice recognition unit, the word detected as the target candidate is recognized as the word spoken by the user.

以上のような音声認識システムによれば、第２音声認識辞書を用いてオーディオソース機器の出力音声による第１音声認識辞書に登録されたワードの誤認識を抑止した形態で音声認識を行う第１の認識モードと、第２音声認識辞書を第１音声認識辞書と並列に用いて音声認識を行う第２の認識モードとを備えているので、オーディオソース機器の出力音声による第１音声認識辞書に登録されたワードの誤認識が生じ易い状況下では第１の認識モードを設定して音声認識を行い、オーディオソース機器の出力音声による第１音声認識辞書に登録されたワードの誤認識が生じ難い状況下では第２の認識モードを設定して、音声認識可能なワードの数を拡大することができる。したがって、オーディオソース機器の出力音声による第１音声認識辞書に登録されたワードの誤認識を効果的に抑制しつつ、音声入力を受け付けることのできるワードの数を増大することができる。 According to the above-mentioned voice recognition system, the second voice recognition dictionary is used to perform voice recognition in a form that suppresses erroneous recognition of words registered in the first voice recognition dictionary by the output voice of the audio source device. Since it has a recognition mode of 2 and a second recognition mode in which the second voice recognition dictionary is used in parallel with the first voice recognition dictionary to perform voice recognition, it can be used as the first voice recognition dictionary based on the output voice of the audio source device. In a situation where erroneous recognition of registered words is likely to occur, the first recognition mode is set to perform voice recognition, and erroneous recognition of words registered in the first voice recognition dictionary by the output voice of the audio source device is unlikely to occur. Under circumstances, a second recognition mode can be set to increase the number of words that can be voice recognized. Therefore, it is possible to increase the number of words that can accept voice input while effectively suppressing erroneous recognition of words registered in the first voice recognition dictionary by the output voice of the audio source device.

より具体的には、たとえば、このような音声認識システムは、前記音声入力受付部において、前記認識モードとして前記第１の認識モードを設定し、予め定めた音声認識辞書である待受用音声認識辞書を前記第１音声認識辞書と第２音声認識辞書に設定し、前記認識部が認識したワードの音声入力を受け付ける待受処理を行い、前記待受処理によってワードの音声入力を受け付けたならば、当該音声入力を受け付けたワードを第１階層ワードとするシーケンス実行処理を開始するように構成してもよい。ここで、前記シーケンス実行処理は、前記認識モードとして前記第２の認識モードを設定し、各時点において、前記第１音声認識辞書と前記第２音声認識辞書の一方が、前記待受用音声認識辞書を設定した副音声認識辞書となり、他方が主音声認識辞書となるように前記第１音声認識辞書と前記第２音声認識辞書を用いつつ、前記第１階層ワードに応じて定まる音声認識辞書を設定した前記主音声認識辞書を、音声入力を受け付けた前記主音声認識辞書に登録されたワードに応じて定まる音声認識辞書に更新しながら、前記認識部が認識した前記主音声認識辞書に登録されたワードの音声入力を１回もしくは複数回受け付けるシーケンスを実行する処理である。また、前記音声入力受付部は、前記シーケンス実行処理を終了したならば前記待受処理を開始し、前記シーケンス実行処理の実行中に、前記待受用音声認識辞書に登録されているワードの音声入力を受け付けたならば、当該音声入力を受け付けたワードを第１階層ワードとする前記シーケンス実行処理の開始と、当該音声入力を受け付けたワードに関連づけられた他の処理の開始との少なくとも一方を行う。 More specifically, for example, in such a voice recognition system, the first recognition mode is set as the recognition mode in the voice input reception unit, and the standby voice recognition dictionary is a predetermined voice recognition dictionary. Is set in the first voice recognition dictionary and the second voice recognition dictionary, the standby process for accepting the voice input of the word recognized by the recognition unit is performed, and the voice input of the word is received by the standby process. The sequence execution process in which the word that receives the voice input is set as the first layer word may be started. Here, in the sequence execution process, the second recognition mode is set as the recognition mode, and at each time point, one of the first voice recognition dictionary and the second voice recognition dictionary is the standby voice recognition dictionary. While using the first speech recognition dictionary and the second speech recognition dictionary so that the sub-speech recognition dictionary is set to and the other is the main speech recognition dictionary, a speech recognition dictionary determined according to the first layer word is set. While updating the main voice recognition dictionary to a voice recognition dictionary determined according to the words registered in the main voice recognition dictionary that received the voice input, the main voice recognition dictionary was registered in the main voice recognition dictionary recognized by the recognition unit. This is a process of executing a sequence in which a word voice input is accepted once or a plurality of times. Further, the voice input receiving unit starts the standby process when the sequence execution process is completed, and during the execution of the sequence execution process, the voice input of the word registered in the standby voice recognition dictionary is performed. Is accepted, at least one of the start of the sequence execution process in which the word that received the voice input is set as the first layer word and the start of other processes associated with the word that received the voice input are performed. ..

または、このような音声認識システムは、前記音声入力受付部において、前記認識モードとして前記第１の認識モードを設定し、予め定めた音声認識辞書である待受用音声認識辞書を前記第１音声認識辞書と第２音声認識辞書に設定し、前記認識部が認識したワードの音声入力を受け付ける待受処理を行い、前記待受処理によってワードの音声入力を受け付けたならば、当該音声入力を受け付けたワードを第１階層ワードとするシーケンス実行処理を開始するように構成してもよい。ここで、当該シーケンス実行処理は、前記認識モードとして前記第２の認識モードを設定し、前記第１階層ワードに応じて定まる相互に異なる音声認識辞書を設定した前記第１音声認識辞書と前記第２音声認識辞書とを、音声入力として受け付けたワードに応じて定まる相互に異なる音声認識辞書に更新しながら、前記認識部が認識したワードの音声入力を１回もしくは複数回受け付けるシーケンスを実行する処理である。また、前記音声入力受付部は、前記シーケンス実行処理を終了したならば前記待受処理を開始する。 Alternatively, in such a voice recognition system, the first recognition mode is set as the recognition mode in the voice input receiving unit, and the standby voice recognition dictionary, which is a predetermined voice recognition dictionary, is used for the first voice recognition. It is set in the dictionary and the second voice recognition dictionary, and the standby process for accepting the voice input of the word recognized by the recognition unit is performed. If the voice input of the word is accepted by the standby process, the voice input is accepted. It may be configured to start the sequence execution process in which the word is the first layer word. Here, in the sequence execution process, the first voice recognition dictionary and the first voice recognition dictionary in which the second recognition mode is set as the recognition mode and different voice recognition dictionaries are set according to the first layer word are set. 2 A process of executing a sequence of accepting voice input of a word recognized by the recognition unit once or a plurality of times while updating the voice recognition dictionary to different voice recognition dictionaries determined according to the words received as voice input. Is. Further, the voice input receiving unit starts the standby process when the sequence execution process is completed.

または、このような音声認識システムは、前記音声入力受付部において、前記認識モードとして前記第１の認識モードを設定し、予め定めた音声認識辞書である待受用音声認識辞書を前記第１音声認識辞書と第２音声認識辞書に設定し、前記認識部が認識したワードの音声入力を受け付ける待受処理を行い、前記待受処理によって、前記スピーカから前記オーディオソース機器から出力された音声が放射されているときに、ワードの音声入力を受け付けたならば、当該音声入力を受け付けたワードを第１階層ワードとする第１シーケンス実行処理を開始し、前記待受処理によって、前記スピーカから前記オーディオソース機器から出力された音声が放射されていないときに、ワードの音声入力を受け付けたならば、当該音声入力を受け付けたワードを第１階層ワードとする第２シーケンス実行処理を開始するように構成してもよい。ここで、前記第１シーケンス実行処理は、前記認識モードとして前記第１の認識モードを設定し、前記第１階層ワードに応じて定まる同じ音声認識辞書を設定した第１音声認識辞書と前記第２音声認識辞書とを、音声入力を受け付けたワードに応じて定まる同じ音声認識辞書に更新しながら、前記認識部が認識したワードの音声入力を１回もしくは複数回受け付けるシーケンスを実行する処理であり、前記第２シーケンス実行処理は、前記認識モードとして前記第２の認識モードを設定し、各時点において、前記第１音声認識辞書と前記第２音声認識辞書の一方が、前記待受用音声認識辞書を設定した副音声認識辞書となり、他方が主音声認識辞書となるように前記第１音声認識辞書と前記第２音声認識辞書を用いつつ、前記第１階層ワードに応じて定まる音声認識辞書を設定した前記主音声認識辞書を、音声入力を受け付けた前記主音声認識辞書に登録されたワードに応じて定まる音声認識辞書に更新しながら、前記認識部が認識した前記主音声認識辞書に登録されたワードの音声入力を１回もしくは複数回受け付けるシーケンスを実行する処理である。また、前記音声入力受付部は、前記第１シーケンス実行処理を終了したならば前記待受処理を開始し、前記第２シーケンス実行処理を終了したならば前記待受処理を開始し、前記第２シーケンス実行処理の実行中に、前記待受用音声認識辞書に登録されているワードの音声入力を受け付けたならば、当該音声入力を受け付けたワードを第１階層ワードとする前記シーケンス実行処理の開始と、当該音声入力を受け付けたワードに関連づけられた他の処理の開始との少なくとも一方を行う。 Alternatively, in such a voice recognition system, the first recognition mode is set as the recognition mode in the voice input receiving unit, and the standby voice recognition dictionary, which is a predetermined voice recognition dictionary, is used for the first voice recognition. It is set in the dictionary and the second voice recognition dictionary, and the standby process for accepting the voice input of the word recognized by the recognition unit is performed, and the voice output from the audio source device is emitted from the speaker by the standby process. If the voice input of the word is accepted while the voice input is being performed, the first sequence execution process in which the word for which the voice input is accepted is set as the first layer word is started, and the audio source is transmitted from the speaker by the standby process. If the voice input of a word is accepted when the voice output from the device is not radiated, the second sequence execution process in which the word receiving the voice input is set as the first layer word is configured to start. You may. Here, in the first sequence execution process, the first speech recognition dictionary and the second speech recognition dictionary in which the first recognition mode is set as the recognition mode and the same speech recognition dictionary determined according to the first layer word is set. It is a process of executing a sequence of accepting the voice input of the word recognized by the recognition unit once or a plurality of times while updating the voice recognition dictionary to the same voice recognition dictionary determined according to the word that received the voice input. In the second sequence execution process, the second recognition mode is set as the recognition mode, and at each time point, one of the first voice recognition dictionary and the second voice recognition dictionary uses the standby voice recognition dictionary. While using the first speech recognition dictionary and the second speech recognition dictionary so that the set sub-speech recognition dictionary and the other become the main speech recognition dictionary, a speech recognition dictionary determined according to the first layer word was set. While updating the main voice recognition dictionary to a voice recognition dictionary determined according to the words registered in the main voice recognition dictionary that received voice input, the words registered in the main voice recognition dictionary recognized by the recognition unit. This is a process of executing a sequence in which the voice input of the above is accepted once or a plurality of times. Further, the voice input receiving unit starts the standby process when the first sequence execution process is completed, and starts the standby process when the second sequence execution process is completed, and the second sequence execution process is completed. If the voice input of the word registered in the standby voice recognition dictionary is accepted during the execution of the sequence execution process, the start of the sequence execution process in which the word receiving the voice input is set as the first layer word. , At least one of the start of other processing associated with the word that received the voice input.

または、このような音声認識システムは、前記音声入力受付部において、前記認識モードとして前記第１の認識モードを設定し、予め定めた音声認識辞書である待受用音声認識辞書を前記第１音声認識辞書と第２音声認識辞書に設定し、前記認識部が認識したワードの音声入力を受け付ける待受処理を行い、前記待受処理によって、前記スピーカから前記オーディオソース機器から出力された音声が放射されているときに、ワードの音声入力を受け付けたならば、当該音声入力を受け付けたワードを第１階層ワードとする第１シーケンス実行処理を開始し、前記待受処理によって、前記スピーカから前記オーディオソース機器から出力された音声が放射されていないときに、ワードの音声入力を受け付けたならば、当該音声入力を受け付けたワードを第１階層ワードとする第２シーケンス実行処理を開始するように構成してもよい。ここで、前記第１シーケンス実行処理は、前記認識モードとして前記第１の認識モードを設定し、前記第１階層ワードに応じて定まる同じ音声認識辞書を設定した第１音声認識辞書と前記第２音声認識辞書とを、音声入力を受け付けたワードに応じて定まる同じ音声認識辞書に更新しながら、前記認識部が認識したワードの音声入力を１回もしくは複数回受け付けるシーケンスを実行する処理であり、前記第２シーケンス実行処理は、前記認識モードとして前記第２の認識モードを設定し、前記第１階層ワードに応じて定まる相互に異なる音声認識辞書を設定した前記第１音声認識辞書と前記第２音声認識辞書とを、音声入力を受け付けたワードに応じて定まる相互に異なる音声認識辞書に更新しながら、前記認識部が認識したワードの音声入力を１回もしくは複数回受け付けるシーケンスを実行する処理である。また、前記音声入力受付部は、前記第１シーケンス実行処理を終了したならば前記待受処理を開始し、前記第２シーケンス実行処理を終了したならば前記待受処理を開始する。 Alternatively, in such a voice recognition system, the first recognition mode is set as the recognition mode in the voice input receiving unit, and the standby voice recognition dictionary, which is a predetermined voice recognition dictionary, is used for the first voice recognition. It is set in the dictionary and the second voice recognition dictionary, and the standby process for accepting the voice input of the word recognized by the recognition unit is performed, and the voice output from the audio source device is emitted from the speaker by the standby process. If the voice input of the word is accepted while the voice input is being performed, the first sequence execution process in which the word for which the voice input is accepted is set as the first layer word is started, and the audio source is transmitted from the speaker by the standby process. If the voice input of a word is accepted when the voice output from the device is not radiated, the second sequence execution process in which the word receiving the voice input is set as the first layer word is configured to start. You may. Here, in the first sequence execution process, the first speech recognition dictionary and the second speech recognition dictionary in which the first recognition mode is set as the recognition mode and the same speech recognition dictionary determined according to the first layer word is set. It is a process of executing a sequence of accepting the voice input of the word recognized by the recognition unit once or a plurality of times while updating the voice recognition dictionary to the same voice recognition dictionary determined according to the word that received the voice input. In the second sequence execution process, the first speech recognition dictionary and the second speech recognition dictionary in which the second recognition mode is set as the recognition mode and different speech recognition dictionaries are set according to the first layer word are set. In the process of updating the voice recognition dictionary to different voice recognition dictionaries determined according to the word that received the voice input, and executing a sequence that accepts the voice input of the word recognized by the recognition unit once or multiple times. be. Further, the voice input receiving unit starts the standby process when the first sequence execution process is completed, and starts the standby process when the second sequence execution process is completed.

または、以上の音声認識システムでは、前記音声入力受付部において、前記スピーカから前記オーディオソース機器から出力された音声が放射されているときに前記認識モードを前記第１の認識モードに設定し、前記スピーカから前記オーディオソース機器から出力された音声が放射されていないときに前記認識モードを前記第２の認識モードに設定するように構成してもよい。 Alternatively, in the above voice recognition system, the voice input receiving unit sets the recognition mode to the first recognition mode when the voice output from the audio source device is emitted from the speaker, and the voice recognition mode is set to the first recognition mode. The recognition mode may be set to the second recognition mode when the sound output from the audio source device is not emitted from the speaker.

ここで、以上のような音声認識システムは、自動車に搭載された車載システムにおいて音声入力に用いられる音声認識システムであってよい。 Here, the voice recognition system as described above may be a voice recognition system used for voice input in an in-vehicle system mounted on an automobile.

以上のように、本発明によれば、第１の音声認識部と第２の音声認識部との二つの音声認識部を備えた音声認識システムにおいて、マイクロフォンに回りこんだオーディオ装置の出力音声に対してコマンド等のワードを誤認してしまうことを抑制しつつ、音声入力を受け付けることのできるワードの数を可及的に増大することができる。 As described above, according to the present invention, in a voice recognition system including two voice recognition units, a first voice recognition unit and a second voice recognition unit, the output voice of the audio device that wraps around the microphone can be obtained. On the other hand, it is possible to increase the number of words that can accept voice input as much as possible while suppressing misidentification of words such as commands.

本発明の実施形態に係る情報処理システムの構成を示すブロック図である。It is a block diagram which shows the structure of the information processing system which concerns on embodiment of this invention. 本発明の実施形態に係る認識データを示す図である。It is a figure which shows the recognition data which concerns on embodiment of this invention. 本発明の実施形態に係る音声入力制御処理を示すフローチャートである。It is a flowchart which shows the voice input control processing which concerns on embodiment of this invention. 本発明の実施形態に係る音声入力設定処理を示すフローチャートである。It is a flowchart which shows the voice input setting process which concerns on embodiment of this invention. 本発明の実施形態に係る音声認識エンジンの音声認識の手法を示す図である。It is a figure which shows the voice recognition method of the voice recognition engine which concerns on embodiment of this invention. 本発明の実施形態に係るオーディオキャンセルモード認識処理を示すフローチャートである。It is a flowchart which shows the audio cancel mode recognition process which concerns on embodiment of this invention. 本発明の実施形態に係る並列認識モード認識処理を示すフローチャートである。It is a flowchart which shows the parallel recognition mode recognition process which concerns on embodiment of this invention. 本発明の実施形態に係る表示画面と音声認識辞書の遷移例を示す図である。It is a figure which shows the transition example of the display screen and a voice recognition dictionary which concerns on embodiment of this invention. 本発明の実施形態に係る表示画面と音声認識辞書の遷移例を示す図である。It is a figure which shows the transition example of the display screen and a voice recognition dictionary which concerns on embodiment of this invention. 本発明の実施形態に係る音声入力設定処理の他の例を示すフローチャートである。It is a flowchart which shows other example of the voice input setting process which concerns on embodiment of this invention. 本発明の実施形態に係る情報処理システムの他の構成例を示す図である。It is a figure which shows the other structural example of the information processing system which concerns on embodiment of this invention.

以下、本発明の実施形態を、自動車に搭載される情報処理システムへの適用を例にとり説明する。
図１に、本実施形態に係る情報処理システムの構成を示す。
図示するように、情報処理システムは、データ処理部１、辞書ＤＢ２、マイクロフォン３、音声入力部４、スピーカ５、ラジオ受信器やミュージックプレイヤ等のオーディオソース６、表示装置７、複数のカメラ８、ＧＰＳ受信器等のその他の周辺装置９を備えている。 Hereinafter, embodiments of the present invention will be described by taking application to an information processing system mounted on an automobile as an example.
FIG. 1 shows the configuration of the information processing system according to the present embodiment.
As shown in the figure, the information processing system includes a data processing unit 1, a dictionary DB 2, a microphone 3, a voice input unit 4, a speaker 5, an audio source 6 such as a radio receiver and a music player, a display device 7, and a plurality of cameras 8. It is equipped with other peripheral devices 9 such as a GPS receiver.

ここで、音声入力部４は、マイクロフォン３から入力するユーザの発話音声を音声認識し認識結果をデータ出力部に出力する。
また、オーディオソース６は、データ処理部１の制御に従って動作する、ラジオ受信器やミュージックプレイヤなどの音源となる装置であり、オーディオコンテンツの音声を、スピーカ５と、音声入力部４に出力する。また、スピーカ５は、オーディオソース６から入力した音声を車内に放射する。 Here, the voice input unit 4 recognizes the voice of the user input from the microphone 3 and outputs the recognition result to the data output unit.
Further, the audio source 6 is a device that operates under the control of the data processing unit 1 and serves as a sound source for a radio receiver, a music player, or the like, and outputs the audio of the audio content to the speaker 5 and the audio input unit 4. Further, the speaker 5 radiates the sound input from the audio source 6 into the vehicle.

また、複数のカメラ８は、自動車の前方を撮影するフロントカメラや、自動車の後方を撮影するバックカメラや、自動車の側方を撮影するサイドカメラ等である。
そして、データ処理部１は、音声入力部４をコマンド等の音声入力に、表示装置７を画面の表示に用いながら、各種処理を行うことができる。
また、音声入力部４は、第１音声認識エンジン４１、第１音声認識辞書４２、第２音声認識エンジン４３、第２音声認識辞書４４、認識調整部４５を備えている。
ここで、このような情報処理システムは、CPUやメモリや周辺デバイスなどを備えたコンピュータを用いて構成されるものであってよく、この場合、上述したデータ処理部１や音声入力部４は、CPUがコンピュータプログラムを実行することにより実現されるものであってよい。 Further, the plurality of cameras 8 are a front camera for photographing the front of the automobile, a back camera for photographing the rear of the automobile, a side camera for photographing the side of the automobile, and the like.
Then, the data processing unit 1 can perform various processes while using the voice input unit 4 for voice input such as a command and the display device 7 for displaying the screen.
Further, the voice input unit 4 includes a first voice recognition engine 41, a first voice recognition dictionary 42, a second voice recognition engine 43, a second voice recognition dictionary 44, and a recognition adjustment unit 45.
Here, such an information processing system may be configured by using a computer provided with a CPU, a memory, a peripheral device, or the like. In this case, the data processing unit 1 and the voice input unit 4 described above may be used. It may be realized by the CPU executing a computer program.

次に、辞書ＤＢ２には、図２に示すように、第１階層認識データから第３階層認識データまでの複数階層の認識データが格納されている。
各階層の認識データは、音声認識に用いられる音声認識辞書を著すものであり、複数のワードのそれぞれについて、番号（No.）とワードとが登録されている。
また、第２階層認識データは、第１階層認識データに登録されている各ワードに対応して複数設けることができ、第３階層認識データは、各第２階層認識データに登録されている各ワードに対応して複数設けることができる。すなわち、辞書ＤＢ２は、各階層の認識データをノードとするツリー構造を備えている。 Next, as shown in FIG. 2, the dictionary DB 2 stores recognition data of a plurality of layers from the first layer recognition data to the third layer recognition data.
The recognition data of each layer is written by a voice recognition dictionary used for voice recognition, and a number (No.) and a word are registered for each of a plurality of words.
Further, a plurality of second layer recognition data can be provided corresponding to each word registered in the first layer recognition data, and the third layer recognition data is each registered in each second layer recognition data. Multiple pieces can be provided corresponding to the word. That is, the dictionary DB2 has a tree structure in which the recognition data of each layer is used as a node.

そして、第２階層認識データには、当該第２階層認識データに対応する第１階層認識データのワードが、音声入力部４における音声認識によって認識結果とされたときに、次に、音声入力部４における音声認識の認識候補とする複数のワードとその番号（No.）とが登録されている。 Then, in the second layer recognition data, when the word of the first layer recognition data corresponding to the second layer recognition data is recognized as the recognition result by the voice recognition in the voice input unit 4, then the voice input unit A plurality of words and their numbers (No.) which are recognition candidates for voice recognition in 4 are registered.

また、第３階層認識データには、当該第３階層認識データに対応する第２階層認識データのワードが、音声入力部４における音声認識によって認識結果とされたときに、次に、音声入力部４における音声認識の認識候補とする複数のワードとその番号（No.）とが登録されている。 Further, in the third layer recognition data, when the word of the second layer recognition data corresponding to the third layer recognition data is recognized as the recognition result by the voice recognition in the voice input unit 4, then the voice input unit A plurality of words and their numbers (No.) which are recognition candidates for voice recognition in 4 are registered.

以上、辞書ＤＢ２について説明した。
なお、以上では、辞書ＤＢ２に登録する認識データとして、第１階層認識データから第３階層認識データまでの３階層の認識データを設ける場合について示したが、辞書ＤＢ２に登録する階層認識データは、２以上の任意の数の階層の認識データとしてよい。 The dictionary DB2 has been described above.
In the above, the case where the recognition data of three layers from the first layer recognition data to the third layer recognition data is provided as the recognition data to be registered in the dictionary DB2 has been described, but the layer recognition data to be registered in the dictionary DB2 is. It may be recognition data of any number of layers of 2 or more.

さて、データ処理部１は、カーナビゲーション機能やミュージックプレイヤ機能などの各種機能を備えており、起動したならば、所定の情報処理（たとえば、カーナビゲーション機能により表示装置７にカーナビゲーション用の案内地図を表示する情報処理や、ミュージックプレイヤ機能によりオーディオソース６から音楽を出力する情報処理）の実行を開始する。 By the way, the data processing unit 1 is provided with various functions such as a car navigation function and a music player function, and when activated, a predetermined information processing (for example, a guide map for car navigation is displayed on the display device 7 by the car navigation function). Information processing to display the information processing and information processing to output music from the audio source 6 by the music player function) are started.

また、データ処理部１は、起動したならば、図３に示す音声入力制御処理を開始する。
図示するように、音声入力制御処理では、後述する音声入力設定処理を開始した上で（ステップ３０２）、待受処理を開始する（ステップ３０４）。
ここで、待受処理では、第１階層認識データを現用認識データに設定し、音声入力部４から現用認識データのワードが認識結果として入力するのを待って、認識結果のワードの入力を受け付ける。 Further, when the data processing unit 1 is activated, the voice input control process shown in FIG. 3 is started.
As shown in the figure, in the voice input control process, after starting the voice input setting process described later (step 302), the standby process is started (step 304).
Here, in the standby process, the first layer recognition data is set as the current recognition data, the word of the current recognition data is input from the voice input unit 4 as the recognition result, and the input of the recognition result word is accepted. ..

また、音声入力部４は、第１階層認識データに登録されているワードと現用認識データに設定された認識データに登録されているワードのうちから、ユーザが発話したワードを認識し、認識したワードを認識結果としてデータ処理部１に出力する。ここで、このような音声入力部４の認識を実現する動作の詳細については後述する。 Further, the voice input unit 4 recognizes and recognizes a word spoken by the user from among the words registered in the first layer recognition data and the words registered in the recognition data set in the current recognition data. The word is output to the data processing unit 1 as a recognition result. Here, the details of the operation for realizing such recognition of the voice input unit 4 will be described later.

そして、データ処理部１は、待受処理によって、認識結果として入力した現用認識データに設定している第１階層認識データのワードの入力を受け付けたならば（ステップ３０６）、入力を受け付けた認識結果のワードに応じた処理を行う（ステップ３０８）。また、認識結果の第１階層認識データのワードに対応する第２階層認識データが存在するかどうかを調べ（ステップ３１０）、存在する場合には、待受処理が実行中であれば（ステップ３１２）、待受処理を終了した上で（ステップ３２０）、認識結果の第１階層認識データのワードを第１階層ワードとするシーケンス実行処理を開始する（ステップ３１４）。 Then, if the data processing unit 1 accepts the input of the word of the first layer recognition data set in the current recognition data input as the recognition result by the standby process (step 306), the recognition that the input is accepted Processing is performed according to the resulting word (step 308). Further, it is checked whether or not the second layer recognition data corresponding to the word of the first layer recognition data of the recognition result exists (step 310), and if it exists, if the standby process is being executed (step 312). ), After the standby process is completed (step 320), the sequence execution process in which the word of the first layer recognition data of the recognition result is set as the first layer word is started (step 314).

一方、認識結果の第１階層認識データのワードに対応する第２階層認識データが存在しない場合には（ステップ３１０）、待受処理を継続したまま、ステップ３０６に戻って、待受処理による次の認識結果の入力の受け付けの発生を待つ。 On the other hand, when the second layer recognition data corresponding to the word of the first layer recognition data of the recognition result does not exist (step 310), the process returns to step 306 while the standby process is continued, and the next step is performed by the standby process. Wait for the reception of the input of the recognition result of.

ここで、ステップ３１４で開始したシーケンス実行処理では、第１階層認識データの第１階層ワードに対応する第２階層認識データを現用認識データに設定した上で、現用認識データのワードの認識結果としての入力の所定回数の受け付けを、各回の認識結果の受け付け毎に、入力を受け付けた認識結果のワードに応じた処理と、認識結果のワードに対応する現用認識データの一つ下の階層の認識データの現用認識データへの設定、すなわち、現用認識データの更新とを行いながら実行するシーケンスを実行する。 Here, in the sequence execution process started in step 314, the second layer recognition data corresponding to the first layer word of the first layer recognition data is set as the current recognition data, and then the word recognition result of the current recognition data is set. For each reception of the recognition result of each time, the processing according to the word of the recognition result that received the input and the recognition of the hierarchy one level below the current recognition data corresponding to the word of the recognition result are received. The sequence to be executed while setting the data to the current recognition data, that is, updating the current recognition data is executed.

そして、シーケンス実行処理を開始したならば（ステップ３１４）、音声入力部４からの第１階層認識データのワードの認識結果としての入力の発生（ステップ３１６）と、シーケンス実行処理の終了の発生（ステップ３１８）を監視する。 Then, when the sequence execution process is started (step 314), the input from the voice input unit 4 as the word recognition result of the first layer recognition data occurs (step 316), and the sequence execution process ends (step 316). Monitor step 318).

そして、シーケンス実行処理の実行中に、第１階層認識データのワードの認識結果としての入力が発生した場合には（ステップ３１６）、シーケンス実行処理を終了し（ステップ３２２）、ステップ３０８に戻って、第１階層認識データが現用認識データである待受処理において、認識結果のワードの入力として、当該第１階層認識データのワードの入力を受け付けた場合と同様の動作を行う。 Then, when the input as the recognition result of the word of the first layer recognition data occurs during the execution of the sequence execution process (step 316), the sequence execution process is terminated (step 322), and the process returns to step 308. In the standby process in which the first layer recognition data is the current recognition data, the same operation as when the input of the word of the first layer recognition data is accepted is performed as the input of the word of the recognition result.

すなわち、シーケンス実行処理の実行中に入力を受け付けた認識結果の第１階層認識データのワードに応じた処理を行うと共に（ステップ３０８）、シーケンス実行処理の実行中に入力を受け付けた認識結果の第１階層認識データのワードに対応する第２階層認識データが存在すれば（ステップ３１０）、シーケンス実行処理の実行中に入力を受け付けた認識結果の第１階層認識データのワードを第１階層ワードとするシーケンス実行処理を開始する（ステップ３１４）。 That is, while performing processing according to the word of the first layer recognition data of the recognition result that received the input during the execution of the sequence execution process (step 308), the recognition result that received the input during the execution of the sequence execution process is the first. If the second layer recognition data corresponding to the word of the first layer recognition data exists (step 310), the word of the first layer recognition data of the recognition result that received the input during the execution of the sequence execution process is referred to as the first layer word. The sequence execution process to be performed is started (step 314).

一方、シーケンス実行処理の終了が発生した場合には（ステップ３１８）、ステップ３０４に戻って、待受処理を再開する。
以上、データ処理部１が起動時に開始する音声入力制御処理について説明した。
次に、データ処理部１は、音声入力制御処理のステップ３０２で開始する音声入力設定処理を次のように行う。
図４に、この音声入力設定処理の手順を示す。
図示するように、音声入力設定処理においてデータ処理部１は、上述した待受処理やシーケンス実行処理による現用認識データの設定の発生を監視する（ステップ４０２）。
そして、現用認識データの設定が発生したならば（ステップ４０２）、現用認識データの認識モードが第１階層認識データであるかどうかを判定する（ステップ４０４）。
そして、現用認識データが第１階層認識データであれば（ステップ４０４）、現用認識データである第１階層認識データを第１音声認識辞書４２と第２音声認識辞書４４に設定し、（ステップ４０６）、認識モードとしてオーディオキャンセルモードを認識調整部４５に設定する（ステップ４０８）。 On the other hand, when the end of the sequence execution process occurs (step 318), the process returns to step 304 and the standby process is restarted.
The voice input control process that the data processing unit 1 starts at startup has been described above.
Next, the data processing unit 1 performs the voice input setting process started in step 302 of the voice input control process as follows.
FIG. 4 shows the procedure of this voice input setting process.
As shown in the figure, in the voice input setting process, the data processing unit 1 monitors the occurrence of the setting of the current recognition data by the standby process and the sequence execution process described above (step 402).
Then, when the setting of the current recognition data occurs (step 402), it is determined whether or not the recognition mode of the current recognition data is the first layer recognition data (step 404).
Then, if the current recognition data is the first layer recognition data (step 404), the first layer recognition data which is the current recognition data is set in the first speech recognition dictionary 42 and the second speech recognition dictionary 44 (step 406). ), The audio cancel mode is set in the recognition adjustment unit 45 as the recognition mode (step 408).

そして、音声認識開始を認識調整部４５に指示し（ステップ４１０）、ステップ４０２の監視に戻る。
一方、現用認識データが１階層認識データでなければ（ステップ４０４）、現用認識データを第１音声認識辞書４２に設定し（ステップ４１２）、認識モードとして並列モードを認識調整部４５に設定する（ステップ４１４）。なお、ステップ４１４では、第２音声認識辞書４４は更新されず、この結果、第２音声認識辞書４４は第１階層認識データのまま維持される。 Then, the recognition adjustment unit 45 is instructed to start voice recognition (step 410), and the process returns to the monitoring of step 402.
On the other hand, if the current recognition data is not one-layer recognition data (step 404), the current recognition data is set in the first speech recognition dictionary 42 (step 412), and the parallel mode is set in the recognition adjustment unit 45 as the recognition mode (step 412). Step 414). In step 414, the second voice recognition dictionary 44 is not updated, and as a result, the second voice recognition dictionary 44 is maintained as the first layer recognition data.

そして、音声認識開始を認識調整部４５に指示し（ステップ４１０）、ステップ４０２の監視に戻る。
以上、データ処理部１が行う音声入力設定処理について説明した。
次に、第１音声認識エンジン４１と第２音声認識エンジン４３で行う音声認識の動作について説明する。
第１音声認識エンジン４１と第２音声認識エンジン４３は、認識対象音声の入力と並行して、認識対象音声に対する音声認識辞書に格納された各認識候補のワードのスコアを算定する。 Then, the recognition adjustment unit 45 is instructed to start voice recognition (step 410), and the process returns to the monitoring of step 402.
The voice input setting process performed by the data processing unit 1 has been described above.
Next, the operation of voice recognition performed by the first voice recognition engine 41 and the second voice recognition engine 43 will be described.
The first voice recognition engine 41 and the second voice recognition engine 43 calculate the score of each recognition candidate word stored in the voice recognition dictionary for the recognition target voice in parallel with the input of the recognition target voice.

すなわち、第１音声認識エンジン４１は、認識対象音声の入力と並行して、認識対象音声に対する第１音声認識辞書４２に格納されたワードのスコアを算定し、第２音声認識エンジン４３は、認識対象音声の入力と並行して、認識対象音声に対する第２音声認識辞書４４に格納された各ワードのスコアを算定する。 That is, the first speech recognition engine 41 calculates the score of the word stored in the first speech recognition dictionary 42 for the recognition target speech in parallel with the input of the recognition target speech, and the second speech recognition engine 43 recognizes it. In parallel with the input of the target voice, the score of each word stored in the second voice recognition dictionary 44 for the recognition target voice is calculated.

なお、第１音声認識エンジン４１の認識対象音声はマイクロフォン３から入力する音声である。一方、第２音声認識エンジン４３については、選択的に、マイクロフォン３から入力する音声とオーディオソース６から入力する音声の一方を、第２音声認識エンジン４３の認識対象音声とすることができる。 The recognition target voice of the first voice recognition engine 41 is a voice input from the microphone 3. On the other hand, with respect to the second voice recognition engine 43, one of the voice input from the microphone 3 and the voice input from the audio source 6 can be selectively used as the recognition target voice of the second voice recognition engine 43.

ここで、認識対象音声に対する音声認識辞書に登録された各ワードのスコアは、認識対象音声が表す語句と、ワードとの相違の大きさの予測値を表すものであり、より大きい相違を予測しているときほど、スコアはより大きくなる。 Here, the score of each word registered in the speech recognition dictionary for the speech to be recognized represents a predicted value of the magnitude of the difference between the word and the phrase represented by the speech to be recognized and the word, and predicts a larger difference. The more you do, the higher your score.

より具体的には、スコアの算定は、予め定めておいた初期値をスコアとして設定した上で、認識対象音声の各音声区間（たとえば、音素毎の音声区間）の音が入力する度に、当該音声区間の音と、音声認識辞書に登録されている各ワードの当該音声区間に対応する部分の発音との整合の有無を算定し、整合していればスコアを所定値減少し、整合していなければスコアを所定値増加することにより行う。なお、認識対象音声の音声区間毎のスコアの増加値／減少値は、たとえば、当該音声区間のワードの全音声区間に対する比率を、スコアの初期値に乗じた大きさとする。 More specifically, in the calculation of the score, after setting a predetermined initial value as the score, each time the sound of each voice section of the recognition target voice (for example, the voice section for each phonetic element) is input, Whether or not the sound of the voice section matches the pronunciation of the part corresponding to the voice section of each word registered in the voice recognition dictionary is calculated, and if they match, the score is reduced by a predetermined value and matched. If not, the score is increased by a predetermined value. The increase / decrease value of the score for each voice section of the voice to be recognized is, for example, the magnitude obtained by multiplying the ratio of the words in the voice section to the total voice section by the initial value of the score.

このような音声認識によれば、認識対象音声が「あいうえおか」であるときに、ワード「あいうえお」に対して算出されるスコアの推移を図５ａに示し、ワード「あいうあい」に対して算出されるスコアの推移を図５ｂに示したように、ワードと一致する認識対象音声の音が入力されている間は、ワードとのスコアは順次減少しワードと一致しない認識対象音声の音が入力されている間はワードのスコアは順次増加する。 According to such voice recognition, when the recognition target voice is "aiueoka", the transition of the score calculated for the word "aiueo" is shown in FIG. 5a, and it is calculated for the word "aiueoka". As shown in FIG. 5b, the score with the word gradually decreases and the sound of the recognition target voice that does not match the word is input while the sound of the recognition target voice that matches the word is input. The ward score will gradually increase while it is being played.

すなわち、たとえば、図５ａに示したように、認識対象音声「あいうえおか」と、ワード「あいうえお」とスコアは、認識対象音声の「あいうえお」の音が入力されている期間は順次減少し、その後、認識対象音声の「か」が入力されると増加する。 That is, for example, as shown in FIG. 5a, the recognition target voice "aiueoka", the word "aiueo", and the score gradually decrease during the period in which the recognition target voice "aiueo" sound is input, and then gradually decrease. , Increases when the "ka" of the recognition target voice is input.

また、同様に、図５ｂに示したように、認識対象音声「あいうえおか」と、ワード「あいうあい」とスコアは、認識対象音声の「あいう」の音が入力されている期間は順次減少し、その後の、認識対象音声の「えおか」が入力されている期間は順次増加する。 Similarly, as shown in FIG. 5b, the recognition target voice "aiueoka", the word "aiai", and the score gradually decrease during the period in which the recognition target voice "ai" sound is input. After that, the period during which the recognition target voice "Eoka" is input gradually increases.

さて、第１音声認識エンジン４１と第２音声認識エンジン４３は、以上のようにして算出される認識対象音声といずれかのワードとのスコアが、予め設定されたしきい値Th以下となったならば、当該スコアがしきい値Th以下となったワードのヒットを検出し、ヒットしたワードの番号（No.)をヒットデータとして認識調整部４５に通知する。 By the way, in the first voice recognition engine 41 and the second voice recognition engine 43, the score of the recognition target voice calculated as described above and one of the words is equal to or less than the preset threshold value Th. If so, the hit of the word whose score is equal to or less than the threshold value Th is detected, and the number (No.) of the hit word is notified to the recognition adjustment unit 45 as hit data.

すなわち、たとえば、図５ａに示した場合では、ワード「あいうえお」についてのスコアは、認識対象音声の「あいうえおか」の「え」が入力される直前にしきい値Th以下となるので、この時点で、ワード「あいうえお」のヒットが検出される。 That is, for example, in the case shown in FIG. 5a, the score for the word "aiueo" is equal to or less than the threshold value Th immediately before the "e" of the recognition target voice "aiueoka" is input. , The hit of the word "aiueo" is detected.

一方、図５ｂに示した場合では、ワード「あいうあいお」についてのスコアがしきい値Th以下となることはないので、このワード「あいうあいお」のヒットは検出されない。
次に、音声入力部４の認識調整部４５の動作について説明する。
認識調整部４５は、データ処理部１から図４に示した音声入力設定処理によって音声認識開始を指示されたならば、オーディオキャンセルモードが設定されているときには、オーディオキャンセルモード認識処理を実行し、並列認識モードが設定されているときには、並列認識モード認識処理を実行する。 On the other hand, in the case shown in FIG. 5b, since the score for the word "Ai Aio" does not fall below the threshold value Th, the hit of this word "Ai Aio" is not detected.
Next, the operation of the recognition adjustment unit 45 of the voice input unit 4 will be described.
If the data processing unit 1 instructs the start of voice recognition by the voice input setting process shown in FIG. 4, the recognition adjustment unit 45 executes the audio cancel mode recognition process when the audio cancel mode is set. When the parallel recognition mode is set, the parallel recognition mode recognition process is executed.

まず、オーディオキャンセルモードが設定されているときに認識調整部４５が行うオーディオキャンセルモード認識処理について説明する。
図６に、この示すオーディオキャンセルモード認識処理の手順を示す。
図示するように、このオーディオキャンセルモード認識処理では、予め定めた値Th1を第１音声認識エンジン４１に上述したしきい値Thとして設定し、予め定めた値Th2を第２音声認識エンジン４３に上述したしきい値Thを設定する（ステップ６０２）。ここで、Th1、Th2としては、Th2＞Th1となる値を用いる。 First, the audio cancel mode recognition process performed by the recognition adjustment unit 45 when the audio cancel mode is set will be described.
FIG. 6 shows the procedure of the audio cancel mode recognition process shown.
As shown in the figure, in this audio cancel mode recognition process, a predetermined value Th1 is set in the first voice recognition engine 41 as the above-mentioned threshold value Th, and a predetermined value Th2 is set in the second voice recognition engine 43 as described above. The threshold value Th is set (step 602). Here, as Th1 and Th2, values such that Th2> Th1 are used.

そして、次に、第２音声認識エンジン４３の認識対象音声をオーディオソース６から入力する音声に設定する（ステップ６０４）。
そして、第１音声認識エンジン４１からのヒットデータの通知の発生（ステップ６０６）と、第２音声認識エンジン４３からのヒットデータの通知の発生（ステップ６０８）と、タイマのタイムアウトの発生（ステップ６１０）とを監視する。 Then, next, the recognition target voice of the second voice recognition engine 43 is set to the voice input from the audio source 6 (step 604).
Then, the hit data notification is generated from the first voice recognition engine 41 (step 606), the hit data notification is generated from the second voice recognition engine 43 (step 608), and the timer timeout occurs (step 610). ) And monitor.

そして、第２音声認識エンジン４３からのヒットデータの通知が発生したばらば（ステップ６０８）、マスクフラグをセットし（ステップ６１６）、第２音声認識エンジン４３から通知さらたヒットデータが示す番号（No.)を調整ワード番号に設定する（ステップ６１８）。そして、上述のタイマを所定のタイムアウト時間を設定してスタートし（ステップ６２０）、ステップ６０６、６０８、６１０の監視に戻る。 Then, if the notification of the hit data from the second voice recognition engine 43 occurs (step 608), the mask flag is set (step 616), and the number indicated by the hit data notified from the second voice recognition engine 43 (step 616). No.) is set as the adjustment word number (step 618). Then, the timer described above is started by setting a predetermined timeout time (step 620), and the process returns to the monitoring of steps 606, 608, and 610.

一方、タイマのタイムアウトが発生したならば（ステップ６１０）、マスクフラグをクリアし（ステップ６１２）、調整ワード番号をクリアする（ステップ６１４）。そして、ステップ６０６、６０８、６１０の監視に戻る。 On the other hand, when the timer time-out occurs (step 610), the mask flag is cleared (step 612) and the adjustment word number is cleared (step 614). Then, the process returns to the monitoring of steps 606, 608, and 610.

また、第１音声認識エンジン４１からのヒットデータの通知が発生した場合には（ステップ６０６）、マスクフラグがセットされているかどうを調べ（ステップ６２２）、マスクフラグがセットされていなければ、第１音声認識辞書４２の第１音声認識エンジン４１からのヒットデータが示す番号（No.)のワードを、認識結果とするワードとして算定し、算定した認識結果をデータ処理部１に出力する（ステップ６２６）。そして、オーディオキャンセルモード認識処理を終了する。 Further, when the notification of the hit data from the first voice recognition engine 41 occurs (step 606), it is checked whether or not the mask flag is set (step 622), and if the mask flag is not set, the first step. 1 The word of the number (No.) indicated by the hit data from the first voice recognition engine 41 of the voice recognition dictionary 42 is calculated as the word to be the recognition result, and the calculated recognition result is output to the data processing unit 1 (step). 626). Then, the audio cancel mode recognition process is terminated.

一方、ステップ６２２において、マスクフラグがセットされている場合には、第１音声認識エンジン４１からのヒットデータが示す番号（No.)がワードが調整ワード番号と一致しているかどうかを調べ（ステップ６２４）、一致している場合には、そのままステップ６０６、６０８、６１０の監視に戻る。 On the other hand, in step 622, when the mask flag is set, it is checked whether the number (No.) indicated by the hit data from the first speech recognition engine 41 matches the adjusted word number (step). 624) If they match, the process returns to the monitoring of steps 606, 608, and 610.

一方、第１音声認識エンジン４１からのヒットデータが示す番号（No.)が調整ワード番号と一致していない場合には（ステップ６２４）、第１音声認識辞書４２の第１音声認識エンジン４１からのヒットデータが示す番号（No.)のワードを、認識結果とするワードとして算定し、算定した認識結果をデータ処理部１に出力する（ステップ６２６）。そして、オーディオキャンセルモード認識処理を終了する。 On the other hand, when the number (No.) indicated by the hit data from the first voice recognition engine 41 does not match the adjustment word number (step 624), the first voice recognition engine 41 of the first voice recognition dictionary 42 The word of the number (No.) indicated by the hit data of is calculated as a word to be a recognition result, and the calculated recognition result is output to the data processing unit 1 (step 626). Then, the audio cancel mode recognition process is terminated.

以上、オーディオキャンセルモードが設定されているときに認識調整部４５が行うオーディオキャンセルモード認識処理について説明した。
ここで、このようなオーディオキャンセルモード認識処理によれば、第２音声認識エンジン４３がオーディオソース６から入力する音声に対してヒットを検出したワードは、その後、一定期間、第１音声認識エンジン４１でヒットが検出されても認識結果とはしない。 The audio cancel mode recognition process performed by the recognition adjustment unit 45 when the audio cancel mode is set has been described above.
Here, according to such an audio cancel mode recognition process, the word for which the second voice recognition engine 43 detects a hit with respect to the voice input from the audio source 6 is subsequently the first voice recognition engine 41 for a certain period of time. Even if a hit is detected in, it is not recognized as a recognition result.

また、オーディオソース６が出力した音声がマイクロフォン３に回りこんで、当該音声に対して第１音声認識エンジン４１でワードのヒットが検出されるときには、そのワードのしきい値Thは第１音声認識エンジン４１よりも第２音声認識エンジン４３の方が大きく設定されており、また、第２音声認識エンジン４３に入力するオーディオソース６の音声の方が音声品質が良いので、それ以前に第２音声認識エンジン４３で、そのワードのヒットが検出される。 Further, when the voice output by the audio source 6 wraps around the microphone 3 and a word hit is detected by the first voice recognition engine 41 for the voice, the threshold Th of the word is the first voice recognition. Since the second voice recognition engine 43 is set larger than the engine 41 and the voice of the audio source 6 input to the second voice recognition engine 43 has better voice quality, the second voice is set before that. The recognition engine 43 detects the hit of the word.

したがって、オーディオソース６が出力した音声がマイクロフォン３に回りこんで、当該音声に対して第１音声認識エンジン４１でワードのヒットが検出されてしまった場合でも、当該ワードが認識結果としてデータ処理部１に出力されてしまうことが抑止される。 Therefore, even if the voice output by the audio source 6 wraps around the microphone 3 and a word hit is detected by the first voice recognition engine 41 for the voice, the word is recognized as a recognition result by the data processing unit. It is suppressed that it is output to 1.

次に、並列認識モードが設定されているときに認識調整部４５が行う並列認識モード認識処理について説明する。
図７に、この並列認識モード認識処理の手順を示す。
図示するように、この並列認識モード認識処理では、上述した値Th1を第１音声認識エンジン４１と第２音声認識エンジン４３の双方にしきい値Thとして設定する（ステップ７０２）。 Next, the parallel recognition mode recognition process performed by the recognition adjustment unit 45 when the parallel recognition mode is set will be described.
FIG. 7 shows the procedure of this parallel recognition mode recognition process.
As shown in the figure, in this parallel recognition mode recognition process, the above-mentioned value Th1 is set as the threshold value Th in both the first voice recognition engine 41 and the second voice recognition engine 43 (step 702).

また、第２音声認識エンジン４３の認識対象音声をマイクロフォン３から入力する音声に設定する（ステップ７０４）。
そして、第１音声認識エンジン４１からのヒットデータの通知の発生（ステップ７０６）と、第２音声認識エンジン４３からのヒットデータの通知の発生（ステップ７０８）とを監視する。 Further, the recognition target voice of the second voice recognition engine 43 is set to the voice input from the microphone 3 (step 704).
Then, the generation of the hit data notification from the first voice recognition engine 41 (step 706) and the generation of the hit data notification from the second voice recognition engine 43 (step 708) are monitored.

そして、第１音声認識エンジン４１からのヒットデータの通知が発生した場合には（ステップ７０６）、第１音声認識辞書４２の第１音声認識エンジン４１からのヒットデータが示す番号（No.)のワードを、認識結果とするワードとして算定し、算定した認識結果をデータ処理部１に出力する（ステップ７１０）。そして、並列認識モード認識処理を終了する。 Then, when the notification of the hit data from the first voice recognition engine 41 occurs (step 706), the number (No.) indicated by the hit data from the first voice recognition engine 41 of the first voice recognition dictionary 42. The word is calculated as a word to be the recognition result, and the calculated recognition result is output to the data processing unit 1 (step 710). Then, the parallel recognition mode recognition process is terminated.

一方、第２音声認識エンジン４３からのヒットデータの通知が発生した場合には（ステップ７０８）、第２音声認識辞書４４の第２音声認識エンジン４３からのヒットデータが示す番号（No.)のワードを、認識結果とするワードとして算定し、算定した認識結果をデータ処理部１に出力する（ステップ７１２）。そして、並列認識モード認識処理を終了する。 On the other hand, when the notification of the hit data from the second voice recognition engine 43 occurs (step 708), the number (No.) indicated by the hit data from the second voice recognition engine 43 of the second voice recognition dictionary 44. The word is calculated as a word to be the recognition result, and the calculated recognition result is output to the data processing unit 1 (step 712). Then, the parallel recognition mode recognition process is terminated.

以上、並列認識モードが設定されているときに認識調整部４５が行う並列認識モード認識処理について説明した。
このような並列認識モード認識処理によれば、第１音声認識辞書４２に登録された各ワードと第２音声認識辞書４４に登録されたワードの双方について音声認識を行うことができるようになる。 The parallel recognition mode recognition process performed by the recognition adjustment unit 45 when the parallel recognition mode is set has been described above.
According to such a parallel recognition mode recognition process, it becomes possible to perform voice recognition for both the words registered in the first voice recognition dictionary 42 and the words registered in the second voice recognition dictionary 44.

さて、ここで、以上のような情報処理装置の動作例を図８に示す。
起動したデータ処理部１は、表示装置７にカーナビゲーション用の案内地図を表示する情報処理や、オーディオソース６から音楽を出力する情報処理を開始する。
また、データ処理部１は、起動したならば、音声入力設定処理を開始し、第１階層認識データを現用認識データに設定することにより、図８ａ２、図８ａ３に示すように第１階層認識データを第１音声認識辞書４２と第２音声認識辞書４４の双方に設定し、認識調整部４５にオーディオキャンセルモードを設定する。 Here, an operation example of the above information processing apparatus is shown in FIG.
The activated data processing unit 1 starts information processing for displaying a guide map for car navigation on the display device 7 and information processing for outputting music from the audio source 6.
Further, when the data processing unit 1 is activated, the voice input setting process is started, and the first layer recognition data is set as the current recognition data, so that the first layer recognition data is as shown in FIGS. 8a2 and 8a3. Is set in both the first voice recognition dictionary 42 and the second voice recognition dictionary 44, and the audio cancel mode is set in the recognition adjustment unit 45.

そして、この結果、音声入力部４の認識調整部４５において、オーディオキャンセルモード認識処理によって、第１音声認識辞書４２に登録されたワードの音声認識が、オーディオソース６の出力音声による誤認識を第２音声認識辞書４４を用いて抑止しながら行われる。 As a result, in the recognition adjustment unit 45 of the voice input unit 4, the voice recognition of the word registered in the first voice recognition dictionary 42 by the audio cancel mode recognition process causes erroneous recognition by the output voice of the audio source 6. 2 It is performed while suppressing using the voice recognition dictionary 44.

次に、この状態において、ユーザが第１音声認識辞書４２に登録されているワード「もくてきちせってい」を発話すると、認識調整部４５により、ワード「もくてきちせってい」が認識結果として算定され、ワード「もくてきちせってい」が認識結果として音声入力部４からデータ処理部１に出力される。 Next, in this state, when the user utters the word "Mokukichisete" registered in the first speech recognition dictionary 42, the recognition adjustment unit 45 uses the word "Mokukichisete" as the recognition result. The calculation is performed, and the word "Mokukichisetei" is output from the voice input unit 4 to the data processing unit 1 as a recognition result.

データ処理部１は、ワード「もくてきちせってい」を認識結果として受け付けたならば、これに応答して、第１階層認識データのワード「もくてきちせってい」に対応する第２階層認識データを、現用認識データに設定する。そして、図８ｂ１に示すように、表示装置７の表示画面を、現用認識データに設定した第２階層認識データに登録されているワードのリストを含めた画面に変更する。 When the data processing unit 1 receives the word "Mokukichiset" as the recognition result, in response to this, the data processing unit 1 responds to the second layer recognition corresponding to the word "Mokukichiset" of the first layer recognition data. Set the data to the current recognition data. Then, as shown in FIG. 8b1, the display screen of the display device 7 is changed to a screen including a list of words registered in the second layer recognition data set in the current recognition data.

また、データ処理部１は、図８ｂ２に示すように現用認識データに設定した第２階層認識データを第１音声認識辞書４２に設定する。一方、図８ｂ３に示すように、第２音声認識辞書４４は、第１階層認識データのまま維持される。そして、データ処理部１は、認識調整部４５に、並列認識モードを設定する。 Further, the data processing unit 1 sets the second layer recognition data set in the current recognition data in the first voice recognition dictionary 42 as shown in FIG. 8b2. On the other hand, as shown in FIG. 8b3, the second speech recognition dictionary 44 is maintained as the first layer recognition data. Then, the data processing unit 1 sets the parallel recognition mode in the recognition adjustment unit 45.

そして、この結果、音声入力部４の認識調整部４５において並列認識モード認識処理によって、第１音声認識辞書４２に登録された第２階層認識データのワードと第２音声認識辞書４４に登録された第１階層認識データのワードの双方の音声認識が行われる。 As a result, the recognition adjustment unit 45 of the voice input unit 4 registers the word of the second layer recognition data registered in the first voice recognition dictionary 42 and the second voice recognition dictionary 44 by the parallel recognition mode recognition process. Speech recognition of both words of the first layer recognition data is performed.

次に、この状態において、ユーザが第１音声認識辞書４２に登録されているワード「ちかくのらーめんや」を発話すると、認識調整部４５により、ワード「ちかくのらーめんや」が認識結果として算定され、ワード「ちかくのらーめんや」が認識結果として音声入力部４からデータ処理部１に出力される。 Next, in this state, when the user utters the word "Chikaku no Ramenya" registered in the first speech recognition dictionary 42, the recognition adjustment unit 45 calculates the word "Chikaku no Ramenya" as the recognition result. , The word "Chikaku no Ramenya" is output from the voice input unit 4 to the data processing unit 1 as a recognition result.

データ処理部１は、ワード「ちかくのらーめんや」を認識結果として受け付けたならば、これに応答して、現用認識データに設定している第２階層認識データのワード「ちかくのらーめんや」に対応する第３階層認識データを、現用認識データに設定する。そして、現在位置の近くのラーメン屋を、データ処理部１が備えているカーナビゲーション機能により探索し、図８ｃ１に示すように、表示装置７の表示画面を、現用認識データに設定した第３階層認識データに登録されているワードのリストを、リスト中の現用認識データに設定した第３階層認識データに登録されている「ひとつめにいく」から「いつつめにいく」の５つのワードに、探索した現在位置の近くの５つのラーメン屋の表示を各々対応づけた形態で含めた画面に変更する。 When the data processing unit 1 accepts the word "Chikaku no Ramenya" as the recognition result, in response to this, the word "Chikaku no Ramenya" of the second layer recognition data set in the current recognition data is used. The corresponding third layer recognition data is set as the current recognition data. Then, a ramen shop near the current position is searched by the car navigation function provided in the data processing unit 1, and as shown in FIG. 8c1, the display screen of the display device 7 is set as the current recognition data in the third layer. The list of words registered in the recognition data is changed to 5 words from "I go to the first" to "I go to the tsutsume" registered in the third layer recognition data set in the current recognition data in the list. Change the display of the five ramen shops near the searched current position to a screen that includes the corresponding form.

また、データ処理部１は、図８ｃ２に示すように現用認識データに設定した第３階層認識データを第１音声認識辞書４２に設定する。一方、図８ｃ３に示すように、第２音声認識辞書４４は、第１階層認識データのまま維持される。そして、データ処理部１は、認識調整部４５に、並列認識モードを設定する。 Further, the data processing unit 1 sets the third layer recognition data set in the current recognition data in the first voice recognition dictionary 42 as shown in FIG. 8c2. On the other hand, as shown in FIG. 8c3, the second speech recognition dictionary 44 is maintained as the first layer recognition data. Then, the data processing unit 1 sets the parallel recognition mode in the recognition adjustment unit 45.

そして、この結果、音声入力部４の認識調整部４５において並列認識モード認識処理によって、第１音声認識辞書４２に登録された第３階層認識データのワードと第２音声認識辞書４４に登録された第１階層認識データのワードの双方の音声認識が行われる。 As a result, the recognition adjustment unit 45 of the voice input unit 4 registers the word of the third layer recognition data registered in the first voice recognition dictionary 42 and the second voice recognition dictionary 44 by the parallel recognition mode recognition process. Speech recognition of both words of the first layer recognition data is performed.

そして、この状態で、ユーザが第１音声認識辞書４２に登録されているワード「ひとつめにいく」を発話すると、認識調整部４５により、ワード「ひとつめにいく」が認識結果として算定され、ワード「ひとつめにいく」が認識結果として音声入力部４からデータ処理部１に出力される。そして、データ処理部１は、ワード「ひとつめにいく」を認識結果として受け付けたならば、ワード「ひとつめにいく」に対応づけて図８ｃ１の画面に表したラーメン屋を目的地に設定し、データ処理部１が備えているカーナビゲーション機能において目的地までの道案内のための処理を開始する。 Then, in this state, when the user utters the word "first go" registered in the first speech recognition dictionary 42, the recognition adjustment unit 45 calculates the word "first go" as a recognition result. The word "first go" is output from the voice input unit 4 to the data processing unit 1 as a recognition result. Then, when the data processing unit 1 accepts the word "go to the first" as the recognition result, the data processing unit 1 sets the ramen shop shown on the screen of FIG. 8c1 as the destination in association with the word "go to the first". , The car navigation function provided in the data processing unit 1 starts processing for route guidance to the destination.

一方、図８ｂ１、ｂ２、ｂ３の第２階層認識データが第１音声認識辞書４２に設定されている状態において、ユーザが第２音声認識辞書４４に登録されている第１階層認識データのワード「ばっくかめら」を発話すると、認識調整部４５により、ワード「ばっくかめら」が認識結果として算定され、ワード「ばっくかめら」が認識結果として音声入力部４からデータ処理部１に出力される。 On the other hand, in a state where the second layer recognition data of FIGS. 8b1, b2, b3 is set in the first speech recognition dictionary 42, the word "word" of the first layer recognition data registered in the second speech recognition dictionary 44 by the user. When "Bakukamera" is spoken, the recognition adjustment unit 45 calculates the word "Bakukamera" as a recognition result, and the word "Bakukamera" is recognized as a recognition result from the voice input unit 4 to the data processing unit 1. It is output.

ここで、本実施形態に係るデータ処理部１は、「ふろんとかめら」や「さいどかめら」や「ばっくかめら」といったワードが認識結果として入力したときに対応するカメラ８で撮影した画像を表示装置７に表示する処理を行うものであるとする。 Here, the data processing unit 1 according to the present embodiment is a camera 8 corresponding to when words such as "Furonto Kamera", "Saido Kamera", and "Bakku Kamera" are input as recognition results. It is assumed that the process of displaying the captured image on the display device 7 is performed.

この場合、データ処理部１は、ワード「ばっくかめら」を認識結果として受け付けたならば、第１階層認識データのワード「ばっくかめら」に対応する第２階層認識データを、現用認識データに設定する。 In this case, if the data processing unit 1 accepts the word "bakkamera" as the recognition result, the data processing unit 1 currently recognizes the second layer recognition data corresponding to the word "bakkamera" of the first layer recognition data. Set in the data.

また、データ処理部１は、図９ｄ１に示すように、カメラ８の一つとして備えたバックカメラで撮影した自動車後方の画像を、現用認識データに設定した第２階層認識データに登録されているワードのリストと共に表示する。なお、図９ａ１、ａ２、ａ３は図８ａ１、ａ２、ａ３と同じものであり、図９ｂ１、ｂ２、ｂ３は、図８ｂ１、ｂ２、ｂ３と同じものである。 Further, as shown in FIG. 9d1, the data processing unit 1 registers an image of the rear of the vehicle taken by a back camera provided as one of the cameras 8 in the second layer recognition data set as the current recognition data. Display with a list of words. Note that FIGS. 9a1, a2, and a3 are the same as those in FIGS. 8a1, a2, and a3, and FIGS. 9b1, b2, and b3 are the same as those in FIGS. 8b1, b2, and b3.

また、データ処理部１は、図９ｄ２に示すように現用認識データに設定した第２階層認識データを第１音声認識辞書４２に設定する。一方、図９ｄ３に示すように、第２音声認識辞書４４は、第１階層認識データのまま維持される。そして、データ処理部１は、認識調整部４５に、並列認識モードを設定する。 Further, the data processing unit 1 sets the second layer recognition data set in the current recognition data in the first speech recognition dictionary 42 as shown in FIG. 9d2. On the other hand, as shown in FIG. 9d3, the second speech recognition dictionary 44 is maintained as the first layer recognition data. Then, the data processing unit 1 sets the parallel recognition mode in the recognition adjustment unit 45.

さて、以上のように並列認識モードを設定しているときには、オーディオソース６の出力音声による誤認識の抑止は行われない。しかし、並列認識モードが設定されるのは、第１階層認識データ以外の階層の階層認識データが現用認識データに設定されているときであり、第１階層認識データ以外の階層の階層認識データが現用認識データに設定されるときは、データ処理部１が上述したシーケンス実行処理を行っており、ユーザが一連の階層的な音声入力を連続的に行っているときである。 By the way, when the parallel recognition mode is set as described above, the false recognition by the output voice of the audio source 6 is not suppressed. However, the parallel recognition mode is set when the layer recognition data of the layer other than the first layer recognition data is set as the active recognition data, and the layer recognition data of the layer other than the first layer recognition data is set. When it is set to the current recognition data, it is when the data processing unit 1 is performing the sequence execution process described above and the user is continuously performing a series of hierarchical voice inputs.

したがって、並列認識モードを設定してから、ユーザの発話による音声入力が行われるまでの期間は短く、この間に、第１音声認識辞書４２や第２音声認識辞書４４に設定されているワードと同じワードの音声が、オーディオソース６から出力されることは希である。 Therefore, the period from the setting of the parallel recognition mode to the voice input by the user's utterance is short, and during this period, it is the same as the word set in the first voice recognition dictionary 42 and the second voice recognition dictionary 44. Word audio is rarely output from audio source 6.

したがって、第１階層認識データ以外の階層の階層認識データを現用認識データに設定しているときに、オーディオソース６の出力音声による誤認識の抑止を行わなくても実用上、支障が生じることはない。 Therefore, when the layer recognition data of the layer other than the first layer recognition data is set as the working recognition data, there may be a practical problem even if the false recognition by the output voice of the audio source 6 is not suppressed. do not have.

なお、第１階層認識データを現用認識データに設定しているときには、データ処理部１は上述した待受処理を行っている状態にあり、第１階層認識データを現用認識データに設定してからユーザの発話による音声入力が発生するまでの期間は不定となる。したがって、この間に、第１音声認識辞書４２に設定されているワードと同じワードの音声がオーディオソース６から出力される可能性は小さくないので、オーディオソース６の出力音声による誤認識の第２音声認識辞書４４を用いた抑止を行うことが必要となる。 When the first layer recognition data is set as the current recognition data, the data processing unit 1 is in the state of performing the standby process described above, and after setting the first layer recognition data as the current recognition data. The period until voice input by the user's utterance occurs is indefinite. Therefore, during this period, there is a high possibility that the voice of the same word as the word set in the first voice recognition dictionary 42 is output from the audio source 6, so that the second voice of erroneous recognition by the output voice of the audio source 6 is not small. It is necessary to perform deterrence using the recognition dictionary 44.

一方、以上のように第２音声認識辞書４４を第１階層認識データに維持したまま、第１音声認識辞書４２を更新して、並列認識モードを設定することにより、音声認識したワードに応じて次回認識する候補とするワードを更新しつつ、第１階層認識データのワードを常時音声認識できるようになる。 On the other hand, while maintaining the second voice recognition dictionary 44 in the first layer recognition data as described above, the first voice recognition dictionary 42 is updated and the parallel recognition mode is set, so that the words recognized by the voice can be obtained. While updating the word to be recognized next time, the word of the first layer recognition data can be constantly recognized by voice.

また、ユーザにとって緊急を要する処理の実行を要求するコマンドを表すワードは、第１階層認識データに登録されることが多い。
たとえば、本実施形態に係るデータ処理部１は、上述のように「ふろんとかめら」や「さいどかめら」や「ばっくかめら」といったワードが認識結果として入力したときに対応するカメラ８で撮影した画像を表示装置７に表示する処理を行うものであり、第１階層認識データに登録されている、これらの「ふろんとかめら」や「さいどかめら」や「ばっくかめら」といったワードは、ユーザが周囲状況確認のためにカメラ８の撮影画像の表示を指示するコマンドのワードであるので、緊急を要する処理の実行を要求するコマンドを表すワードに該当する。 In addition, a word representing a command requesting execution of an urgent process for the user is often registered in the first layer recognition data.
For example, the data processing unit 1 according to the present embodiment corresponds to when words such as "front camera", "saido camera", and "back camera" are input as recognition results as described above. The process of displaying the image taken by the camera 8 on the display device 7 is performed, and these "Furonto Kamera", "Saido Kamera", and "Bat" registered in the first layer recognition data are performed. A word such as "Kukamera" is a word of a command instructing the user to display a captured image of the camera 8 for confirming the surrounding situation, and therefore corresponds to a word representing a command requesting execution of an urgent process.

したがって、本実施形態によれば、ユーザにとって緊急を要する処理の実行を要求するコマンドの音声入力を、任意の時点において受け付けることができるようになる。
以上、本発明の実施形態について説明した。
ところで、以上の実施形態は、図４に示した音声入力設定処理に代えて、図１０に示す音声入力設定処理を行うようにしてもよい。
すなわち、この音声入力設定処理では、データ処理部１は、現用認識データの設定の発生を監視する（ステップ１００２）。
そして、現用認識データの設定が発生したならば（ステップ１００２）、現用認識データの認識モードが第１階層認識データであるかどうかを調べる（ステップ１００４）。
そして、現用認識データが第１階層認識データであれば（ステップ１００４）、現用認識データである第１階層認識データを第１音声認識辞書４２と第２音声認識辞書４４に設定し、（ステップ１００６）、認識モードとしてオーディオキャンセルモードを認識調整部４５に設定する（ステップ１００８）。 Therefore, according to the present embodiment, it becomes possible to accept the voice input of the command requesting the execution of the processing that is urgent for the user at an arbitrary time point.
The embodiment of the present invention has been described above.
By the way, in the above embodiment, the voice input setting process shown in FIG. 10 may be performed instead of the voice input setting process shown in FIG.
That is, in this voice input setting process, the data processing unit 1 monitors the occurrence of the setting of the current recognition data (step 1002).
Then, when the setting of the current recognition data occurs (step 1002), it is checked whether or not the recognition mode of the current recognition data is the first layer recognition data (step 1004).
Then, if the current recognition data is the first layer recognition data (step 1004), the first layer recognition data which is the current recognition data is set in the first voice recognition dictionary 42 and the second voice recognition dictionary 44 (step 1006). ), The audio cancel mode is set in the recognition adjustment unit 45 as the recognition mode (step 1008).

そして、音声認識開始を認識調整部４５に指示し（ステップ１０１０）、ステップ１００２の監視に戻る。
一方、現用認識データが１階層認識データでなければ、現在、データ処理部１が、オーディオソース６からスピーカ５に音声を出力させているかどうかを調べる（ステップ１０１２）。 Then, the recognition adjustment unit 45 is instructed to start voice recognition (step 1010), and the process returns to the monitoring of step 1002.
On the other hand, if the current recognition data is not the one-layer recognition data, it is examined whether or not the data processing unit 1 is currently outputting audio from the audio source 6 to the speaker 5 (step 1012).

そして、オーディオソース６からスピーカ５に音声を出力させていれば（ステップ１０１２）、現用認識データを第１音声認識辞書４２と第２音声認識辞書４４に設定し、（ステップ１００６）、認識モードとしてオーディオキャンセルモードを認識調整部４５に設定する（ステップ１００８）。 Then, if the voice is output from the audio source 6 to the speaker 5 (step 1012), the current recognition data is set in the first voice recognition dictionary 42 and the second voice recognition dictionary 44 (step 1006), and the recognition mode is set. The audio cancel mode is set in the recognition adjustment unit 45 (step 1008).

そして、音声認識開始を認識調整部４５に指示し（ステップ１０１０）、ステップ１００２の監視に戻る。
また、オーディオソース６からスピーカ５に音声を出力させていなければ（ステップ１０１２）、現用認識データを第１音声認識辞書４２に設定し、第１階層認識データを第２音声認識辞書４４に設定し、（ステップ１０１４）、認識モードとして並列認識モードを認識調整部４５に設定する（ステップ１０１６）。 Then, the recognition adjustment unit 45 is instructed to start voice recognition (step 1010), and the process returns to the monitoring of step 1002.
If no voice is output from the audio source 6 to the speaker 5 (step 1012), the current recognition data is set in the first voice recognition dictionary 42, and the first layer recognition data is set in the second voice recognition dictionary 44. , (Step 1014), the parallel recognition mode is set in the recognition adjustment unit 45 as the recognition mode (step 1016).

そして、音声認識開始を認識調整部４５に指示し（ステップ１０１０）、ステップ１００２の監視に戻る。
このような音声入力設定処理によれば、オーディオソース６からスピーカ５に音声が出力されているときにはオーディオソース６の出力音声による誤認識を行いつつ、オーディオソース６からスピーカ５に音声を出力させていないとき、すなわち、オーディオソース６の出力音声による誤認識が生じないときには、音声認識できるワードの数を拡大することができる。 Then, the recognition adjustment unit 45 is instructed to start voice recognition (step 1010), and the process returns to the monitoring of step 1002.
According to such an audio input setting process, when the audio is output from the audio source 6 to the speaker 5, the audio is output from the audio source 6 to the speaker 5 while erroneously recognizing the output audio of the audio source 6. When there is no such thing, that is, when there is no erroneous recognition due to the output voice of the audio source 6, the number of words that can be voice-recognized can be increased.

また、以上の実施形態では、並列認識モードのときに第１階層認識データを第２音声認識辞書４４として維持するようにしたが、並列認識モードのときに、第１音声認識辞書４２と同様に第２音声認識辞書４４の内容を切り替えるようにしてもよい。 Further, in the above embodiment, the first layer recognition data is maintained as the second speech recognition dictionary 44 in the parallel recognition mode, but in the parallel recognition mode, the same as the first speech recognition dictionary 42. The contents of the second voice recognition dictionary 44 may be switched.

すなわち、たとえば、図１１に示すように、第１階層認識データ以外の各階層の認識データには、相互に異なるワードのセットを登録した主認識データと副認識データとを含めておき、図４に示した音声入力設定処理のステップ４１２や図１０に示した音声入力設定処理のステップ１０１４において、現用認識データの主認識データを第１音声認識辞書４２に設定し、現用認識データの副認識データを第２音声認識辞書４４に設定するようにしてもよい。 That is, for example, as shown in FIG. 11, the recognition data of each layer other than the first layer recognition data includes the main recognition data and the sub-recognition data in which different sets of words are registered, and FIG. In step 412 of the voice input setting process shown in FIG. 10 and step 1014 of the voice input setting process shown in FIG. 10, the main recognition data of the current recognition data is set in the first voice recognition dictionary 42, and the sub-recognition data of the current recognition data is set. May be set in the second speech recognition dictionary 44.

このようにすることにより、並列認識モード認識処理によって音声認識できるワードを、より柔軟に設定することができるようになる。 By doing so, it becomes possible to more flexibly set the words that can be voice-recognized by the parallel recognition mode recognition process.

１…データ処理部、２…辞書ＤＢ、３…マイクロフォン、４…音声入力部、５…スピーカ、６…オーディオソース、７…表示装置、８…カメラ、９…周辺装置、４１…第１音声認識エンジン、４２…第１音声認識辞書、４３…第２音声認識エンジン、４４…第２音声認識辞書、４５…認識調整部。 1 ... Data processing unit, 2 ... Dictionary DB, 3 ... Microphone, 4 ... Voice input unit, 5 ... Speaker, 6 ... Audio source, 7 ... Display device, 8 ... Camera, 9 ... Peripheral device, 41 ... First voice recognition Engine, 42 ... 1st voice recognition dictionary, 43 ... 2nd voice recognition engine, 44 ... 2nd voice recognition dictionary, 45 ... recognition adjustment unit.

Claims

It is a voice recognition system that recognizes the voice uttered in the space where the voice output from the audio source device is radiated from the speaker to the speaker.
With a microphone
The first speech recognition dictionary in which multiple words are registered, and
A first voice recognition unit that inputs a voice picked up by the microphone and detects a word matching the input voice as a target candidate from a plurality of words registered in the first voice recognition dictionary.
The second voice recognition unit and
The second voice recognition dictionary used by the second voice recognition unit,
A recognition unit that recognizes words spoken by the user,
It has a voice input receiving unit that receives voice input of a word recognized by the recognition unit while selectively setting a first recognition mode and a second recognition mode.
In the second voice recognition dictionary used by the second voice recognition unit in the first recognition mode, the same plurality of words as the plurality of words registered in the first voice recognition dictionary are registered.
In the first recognition mode, the second voice recognition unit inputs the voice output to the speaker by the audio source device, and the input voice is selected from a plurality of words registered in the second voice recognition dictionary. Detects words that match the target as target candidates,
In the first recognition mode, the recognition unit recognizes the same words as the words detected as the target candidates during the predetermined period after the target candidate is detected by the second voice recognition unit. Even if the target is detected by the unit, the target is prevented from being recognized as a word spoken by the user by the first voice recognition unit. Recognize the word detected as a candidate as the word spoken by the user and
In the second voice recognition dictionary used by the second voice recognition unit in the second recognition mode, a plurality of words registered in the first voice recognition dictionary and a plurality of words that are at least partially different from the plurality of words are registered. And
In the second recognition mode, the second voice recognition unit inputs the voice picked up by the microphone, and among a plurality of words registered in the second voice recognition dictionary, a word matching the input voice. Is detected as a target candidate,
In the second recognition mode, when the target candidate is detected by the first voice recognition unit, the recognition unit recognizes the word detected as the target candidate as the word spoken by the user, and the second recognition unit. (2) A voice recognition system characterized in that when the target candidate is detected by the voice recognition unit, the word detected as the target candidate is recognized as the word spoken by the user.

The voice recognition system according to claim 1.
The voice input reception unit is
The first recognition mode is set as the recognition mode, the standby voice recognition dictionary, which is a predetermined voice recognition dictionary, is set in the first voice recognition dictionary and the second voice recognition dictionary, and the recognition unit recognizes them. Performs standby processing to accept voice input of words,
When the voice input of the word is accepted by the standby process, the sequence execution process in which the word for which the voice input is accepted is set as the first layer word is started.
In the sequence execution process, the second recognition mode is set as the recognition mode, and at each time point, one of the first voice recognition dictionary and the second voice recognition dictionary sets the standby voice recognition dictionary. While using the first speech recognition dictionary and the second speech recognition dictionary so that the sub-speech recognition dictionary and the other become the main speech recognition dictionary, the main speech recognition dictionary determined according to the first layer word is set. While updating the voice recognition dictionary to a voice recognition dictionary determined according to the words registered in the main voice recognition dictionary that received voice input, the voice of the words registered in the main voice recognition dictionary recognized by the recognition unit. It is a process to execute a sequence that accepts input once or multiple times.
When the sequence execution process is completed, the standby process is started, and the standby process is started.
If the voice input of the word registered in the standby voice recognition dictionary is accepted during the execution of the sequence execution process, the sequence execution process in which the word receiving the voice input is set as the first layer word is started. A voice recognition system, characterized in that it performs at least one of the above and the start of other processing associated with the word that has received the voice input.

The voice recognition system according to claim 1.
The voice input reception unit is
The first recognition mode is set as the recognition mode, the standby voice recognition dictionary, which is a predetermined voice recognition dictionary, is set in the first voice recognition dictionary and the second voice recognition dictionary, and the recognition unit recognizes them. Performs standby processing to accept voice input of words,
When the voice input of the word is accepted by the standby process, the sequence execution process in which the word for which the voice input is accepted is set as the first layer word is started.
In the sequence execution process, the first voice recognition dictionary and the second voice recognition are set by setting the second recognition mode as the recognition mode and setting different voice recognition dictionaries determined according to the first layer word. It is a process of executing a sequence of accepting the voice input of the word recognized by the recognition unit once or a plurality of times while updating the dictionary to a different voice recognition dictionary determined according to the word received as the voice input.
A voice recognition system characterized in that the standby process is started when the sequence execution process is completed.

The voice recognition system according to claim 1.
The voice input reception unit is
The first recognition mode is set as the recognition mode, the standby voice recognition dictionary, which is a predetermined voice recognition dictionary, is set in the first voice recognition dictionary and the second voice recognition dictionary, and the recognition unit recognizes them. Performs standby processing to accept voice input of words,
If the voice input of a word is accepted when the voice output from the audio source device is radiated from the speaker by the standby process, the word that has received the voice input is set as the first layer word. The first sequence execution process is started,
If the voice input of a word is accepted when the voice output from the audio source device is not radiated from the speaker by the standby process, the word that has received the voice input is set as the first layer word. The second sequence execution process is started,
In the first sequence execution process, the first voice recognition mode and the second voice recognition dictionary in which the first recognition mode is set as the recognition mode and the same voice recognition dictionary determined according to the first layer word is set are set. Is a process of executing a sequence of accepting the voice input of the word recognized by the recognition unit once or a plurality of times while updating the same voice recognition dictionary determined according to the word that received the voice input.
In the second sequence execution process, the second recognition mode is set as the recognition mode, and at each time point, one of the first voice recognition dictionary and the second voice recognition dictionary uses the standby voice recognition dictionary. While using the first speech recognition dictionary and the second speech recognition dictionary so that the set sub-speech recognition dictionary and the other become the main speech recognition dictionary, a speech recognition dictionary determined according to the first layer word was set. While updating the main voice recognition dictionary to a voice recognition dictionary determined according to the words registered in the main voice recognition dictionary that received voice input, the words registered in the main voice recognition dictionary recognized by the recognition unit. It is a process to execute a sequence that accepts the voice input of
When the first sequence execution process is completed, the standby process is started, and the standby process is started.
When the second sequence execution process is completed, the standby process is started, and the standby process is started.
If the voice input of the word registered in the standby voice recognition dictionary is received during the execution of the second sequence execution process, the word receiving the voice input is set as the first layer word. A voice recognition system, characterized in that it starts at least one of the start of processing and the start of other processing associated with the word that received the voice input.

The voice recognition system according to claim 1.
The voice input reception unit is
The first recognition mode is set as the recognition mode, the standby voice recognition dictionary, which is a predetermined voice recognition dictionary, is set in the first voice recognition dictionary and the second voice recognition dictionary, and the recognition unit recognizes them. Performs standby processing to accept voice input of words,
If the voice input of a word is accepted when the voice output from the audio source device is radiated from the speaker by the standby process, the word that has received the voice input is set as the first layer word. The first sequence execution process is started,
If the voice input of a word is accepted when the voice output from the audio source device is not radiated from the speaker by the standby process, the word that has received the voice input is set as the first layer word. The second sequence execution process is started,
In the first sequence execution process, the first voice recognition mode and the second voice recognition dictionary in which the first recognition mode is set as the recognition mode and the same voice recognition dictionary determined according to the first layer word is set are set. Is a process of executing a sequence of accepting the voice input of the word recognized by the recognition unit once or a plurality of times while updating the same voice recognition dictionary determined according to the word that received the voice input.
In the second sequence execution process, the first voice recognition dictionary and the second voice recognition dictionary in which the second recognition mode is set as the recognition mode and different voice recognition dictionaries are set according to the first layer word are set. In the process of executing a sequence of accepting the voice input of the word recognized by the recognition unit once or multiple times while updating the voice recognition dictionary to a different voice recognition dictionary determined according to the word that received the voice input. can be,
When the first sequence execution process is completed, the standby process is started, and the standby process is started.
A voice recognition system characterized in that the standby process is started when the second sequence execution process is completed.

The voice recognition system according to claim 1.
The voice input receiving unit sets the recognition mode to the first recognition mode when the voice output from the audio source device is emitted from the speaker, and is output from the audio source device from the speaker. A voice recognition system characterized in that the recognition mode is set to the second recognition mode when no sound is emitted.

The voice recognition system according to claim 1, 2, 3, 4, 5 or 6.
The voice recognition system is a voice recognition system used for voice input in an in-vehicle system mounted on an automobile.

A computer program that is read and executed by a computer equipped with a microphone placed in a space where the sound output from the audio source device is radiated from the speaker.
The computer program uses the computer,
The first speech recognition dictionary in which multiple words are registered, and
A first voice recognition unit that inputs a voice picked up by the microphone and detects a word matching the input voice as a target candidate from a plurality of words registered in the first voice recognition dictionary.
The second voice recognition unit and
The second voice recognition dictionary used by the second voice recognition unit,
A recognition unit that recognizes words spoken by the user,
A computer program that functions as a voice input receiving unit that receives voice input of a word recognized by the recognition unit while selectively setting a first recognition mode and a second recognition mode.
In the second voice recognition dictionary used by the second voice recognition unit in the first recognition mode, the same plurality of words as the plurality of words registered in the first voice recognition dictionary are registered.
In the first recognition mode, the second voice recognition unit inputs the voice output to the speaker by the audio source device, and the input voice is selected from a plurality of words registered in the second voice recognition dictionary. Detects words that match the target as target candidates,
In the first recognition mode, the recognition unit recognizes the same words as the words detected as the target candidates during the predetermined period after the target candidate is detected by the second voice recognition unit. Even if the target is detected by the unit, the target is prevented from being recognized as a word spoken by the user by the first voice recognition unit. Recognize the word detected as a candidate as the word spoken by the user and
In the second voice recognition dictionary used by the second voice recognition unit in the second recognition mode, a plurality of words registered in the first voice recognition dictionary and a plurality of words that are at least partially different from the plurality of words are registered. And
In the second recognition mode, the second voice recognition unit inputs the voice picked up by the microphone, and among a plurality of words registered in the second voice recognition dictionary, a word matching the input voice. Is detected as a target candidate,
In the second recognition mode, when the target candidate is detected by the first voice recognition unit, the recognition unit recognizes the word detected as the target candidate as the word spoken by the user, and the second recognition unit. (2) A computer program characterized in that when the target candidate is detected by the voice recognition unit, the word detected as the target candidate is recognized as the word spoken by the user.

The computer program according to claim 8.
The voice input reception unit is
The first recognition mode is set as the recognition mode, the standby voice recognition dictionary, which is a predetermined voice recognition dictionary, is set in the first voice recognition dictionary and the second voice recognition dictionary, and the recognition unit recognizes them. Performs standby processing to accept voice input of words,
When the voice input of the word is accepted by the standby process, the sequence execution process in which the word for which the voice input is accepted is set as the first layer word is started.
In the sequence execution process, the second recognition mode is set as the recognition mode, and at each time point, one of the first voice recognition dictionary and the second voice recognition dictionary sets the standby voice recognition dictionary. While using the first speech recognition dictionary and the second speech recognition dictionary so that the sub-speech recognition dictionary and the other become the main speech recognition dictionary, the main speech recognition dictionary determined according to the first layer word is set. While updating the voice recognition dictionary to a voice recognition dictionary determined according to the words registered in the main voice recognition dictionary that received voice input, the voice of the words registered in the main voice recognition dictionary recognized by the recognition unit. It is a process to execute a sequence that accepts input once or multiple times.
When the sequence execution process is completed, the standby process is started, and the standby process is started.
If the voice input of the word registered in the standby voice recognition dictionary is accepted during the execution of the sequence execution process, the sequence execution process in which the word receiving the voice input is set as the first layer word is started. A computer program characterized by performing at least one of the operation and the start of other processing associated with the word that received the voice input.

The computer program according to claim 8.
The voice input reception unit is
The first recognition mode is set as the recognition mode, the standby voice recognition dictionary, which is a predetermined voice recognition dictionary, is set in the first voice recognition dictionary and the second voice recognition dictionary, and the recognition unit recognizes them. Performs standby processing to accept voice input of words,
When the voice input of the word is accepted by the standby process, the sequence execution process in which the word for which the voice input is accepted is set as the first layer word is started.
In the sequence execution process, the first voice recognition dictionary and the second voice recognition are set by setting the second recognition mode as the recognition mode and setting different voice recognition dictionaries determined according to the first layer word. It is a process of executing a sequence of accepting the voice input of the word recognized by the recognition unit once or a plurality of times while updating the dictionary to a different voice recognition dictionary determined according to the word received as the voice input.
A computer program characterized in that the standby process is started when the sequence execution process is completed.

The computer program according to claim 8.
The voice input reception unit is
The first recognition mode is set as the recognition mode, the standby voice recognition dictionary, which is a predetermined voice recognition dictionary, is set in the first voice recognition dictionary and the second voice recognition dictionary, and the recognition unit recognizes them. Performs standby processing to accept voice input of words,
If the voice input of a word is accepted when the voice output from the audio source device is radiated from the speaker by the standby process, the word that has received the voice input is set as the first layer word. The first sequence execution process is started,
If the voice input of a word is accepted when the voice output from the audio source device is not radiated from the speaker by the standby process, the word that has received the voice input is set as the first layer word. The second sequence execution process is started,
In the first sequence execution process, the first voice recognition mode and the second voice recognition dictionary in which the first recognition mode is set as the recognition mode and the same voice recognition dictionary determined according to the first layer word is set are set. Is a process of executing a sequence of accepting the voice input of the word recognized by the recognition unit once or a plurality of times while updating the same voice recognition dictionary determined according to the word that received the voice input.
In the second sequence execution process, the second recognition mode is set as the recognition mode, and at each time point, one of the first voice recognition dictionary and the second voice recognition dictionary uses the standby voice recognition dictionary. While using the first speech recognition dictionary and the second speech recognition dictionary so that the set sub-speech recognition dictionary and the other become the main speech recognition dictionary, a speech recognition dictionary determined according to the first layer word was set. While updating the main voice recognition dictionary to a voice recognition dictionary determined according to the words registered in the main voice recognition dictionary that received voice input, the words registered in the main voice recognition dictionary recognized by the recognition unit. It is a process to execute a sequence that accepts the voice input of
When the first sequence execution process is completed, the standby process is started, and the standby process is started.
When the second sequence execution process is completed, the standby process is started, and the standby process is started.
If the voice input of the word registered in the standby voice recognition dictionary is received during the execution of the second sequence execution process, the word receiving the voice input is set as the first layer word. A computer program characterized in that it starts at least one of the start of the other processing associated with the word that received the voice input.

The computer program according to claim 8.
The voice input reception unit is
The first recognition mode is set as the recognition mode, the standby voice recognition dictionary, which is a predetermined voice recognition dictionary, is set in the first voice recognition dictionary and the second voice recognition dictionary, and the recognition unit recognizes them. Performs standby processing to accept voice input of words,
If the voice input of a word is accepted when the voice output from the audio source device is radiated from the speaker by the standby process, the word that has received the voice input is set as the first layer word. The first sequence execution process is started,
If the voice input of a word is accepted when the voice output from the audio source device is not radiated from the speaker by the standby process, the word that has received the voice input is set as the first layer word. The second sequence execution process is started,
In the first sequence execution process, the first voice recognition mode and the second voice recognition dictionary in which the first recognition mode is set as the recognition mode and the same voice recognition dictionary determined according to the first layer word is set are set. Is a process of executing a sequence of accepting the voice input of the word recognized by the recognition unit once or a plurality of times while updating the same voice recognition dictionary determined according to the word that received the voice input.
In the second sequence execution process, the first voice recognition dictionary and the second voice recognition dictionary in which the second recognition mode is set as the recognition mode and different voice recognition dictionaries are set according to the first layer word are set. In the process of executing a sequence of accepting the voice input of the word recognized by the recognition unit once or multiple times while updating the voice recognition dictionary to a different voice recognition dictionary determined according to the word that received the voice input. can be,
When the first sequence execution process is completed, the standby process is started, and the standby process is started.
A computer program characterized in that the standby process is started when the second sequence execution process is completed.

The computer program according to claim 8.
The audio input receiving unit sets the recognition mode to the first recognition mode when the sound output from the audio source device is emitted from the speaker, and is output from the audio source device from the speaker. A computer program characterized in that the recognition mode is set to the second recognition mode when no sound is emitted.

The computer program according to claim 8, 9, 10, 11, 12 or 13.
The computer is a computer program characterized in that it is a computer mounted on an automobile.