JP3700533B2

JP3700533B2 - Speech recognition apparatus and processing system

Info

Publication number: JP3700533B2
Application number: JP2000117910A
Authority: JP
Inventors: 英夫宮内; 義隆尾崎; 一郎赤堀; 教英北岡; 徹名田
Original assignee: Denso Corp
Current assignee: Denso Corp
Priority date: 2000-04-19
Filing date: 2000-04-19
Publication date: 2005-09-28
Anticipated expiration: 2020-04-19
Also published as: JP2001306088A

Description

【０００１】
【発明の属する技術分野】
本発明は、例えばナビゲーションシステムにおける目的地の設定などを音声によって入力する場合などに有効な音声認識装置及びその音声認識装置を備えた処理システムに関する。
【０００２】
【従来の技術及び発明が解決しようとする課題】
従来より、入力された音声を予め記憶されている複数の比較対象パターン候補と比較し、一致度合の高いものを認識結果とする音声認識装置が既に実用化されており、例えばナビゲーションシステムにおいて設定すべき目的地を利用者が地名を音声で入力するためなどに用いられている。特に車両用のナビゲーションシステムを運転手自身が利用する場合、音声入力であればボタン操作や画面注視が伴わないため、車両の走行中に行っても安全性が高いため有効である。
【０００３】
このような機能を満たすためには、十分詳細な地点の指定が容易にできることが望まれる。具体的には、県や市のレベルではなく、市の下の町名のレベルや、町村における大字あるいは小字といったレベルまで入力できるようにすることが好ましい。さらに、利用者が例えば「愛知県刈谷市昭和町」と設定したい場合に、「愛知県」「刈谷市」「昭和町」というように県市町というレベル毎に区切って発音しなくてはならないとすると煩わしいので、ひと続きで入力（一括入力）できるようにすることが好ましい。
【０００４】
しかしながら、このように一括入力ができることを前提とし、さらに十分詳細な地点の指定ができるようにするためには、認識できる語数を増やすことが必要であり、辞書のデータ量を増加させてしまう。例えば住所の認識についていえば、現在では大字程度のレベルまでしか辞書を用意していないのが一般的である。それを小字までの辞書を用意することでより詳細な地点の指定はできるが、辞書データが増大することにより、その辞書データを格納しておくための例えばＲＡＭなどのメモリが非常に大型化し、コストアップにもつながることとなる。
【０００５】
本発明は、このような音声認識技術において辞書に準備する単語を増やすことで認識可能な対象を増加させるという利点を追求した場合に生じ得るデメリットを極力抑制し、上記利点をより現実的に享受し易くする技術を提案することを目的とする。
【０００６】
【課題を解決するための手段及び発明の効果】
まず、請求項１に記載した県名テンプレートデータ、県別テンプレートデータ、第１の記憶手段及び第２の記憶手段について説明する。
県名テンプレートデータは、最終的な認識対象が複数の地名を階層的につなぎ合わせた住所である場合の都道府県名に対応しており、入力音声に基づいて得たマッチング用データと比較するためのデータである。また、県別テンプレートデータは、各都道府県単位で準備され、都道府県名に加えて市町村名あるいはさらに市町村よりも下位レベルの地名までを含んだ語群を格納したものである。例えば日本の場合であれば県名テンプレートデータが４７都道府県名をテンプレートデータとして持ち、４７に分割された県別テンプレートデータが準備されることとなる。
【０００７】
県名又は県別のテンプレートデータは、請求項２に示すように、辞書データ（上位階層辞書あるいは下位階層辞書）であってもよいし、請求項３に示すように、音声データであってもよい。例えば予め人がその語又は語群を発音し、それを入力して音声データとして記憶しておいてもよい。つまり、辞書を用いた認識ではなくても、利用者の発声した音声データに基づき、何らかのデータとマッチングすることで認識できるようなデータであればよい。
【０００８】
また、第１の記憶手段は音声認識処理に際して高速アクセス性が相対的に低く、第２の記憶手段は音声認識処理に際して高速アクセス性が相対的に高いものであるが、具体例としては、第１の記憶手段としてＤＶＤやＣＤ−ＲＯＭが挙げられ、第２の記憶手段としてＲＡＭなどが挙げられる。つまり、実際の音声認識処理を実行する上では、その処理時間を短くしてレスポンスを向上させる観点から通常はＲＡＭなどの第２の記憶手段に辞書を読み込むこととなる。
【０００９】
請求項１に記載の音声認識装置によれば、少なくとも県別テンプレートデータは第１の記憶手段に記憶されており、ひと続きで入力できる音声入力手段を介して入力された音声を認識する際には、まず、入力音声に基づいて得たマッチング用データと県名テンプレートデータとを比較することで、どの都道府県名が含まれているかを判定する。そして、その予備判定にて含まれているとされた都道府県名に対応する県別テンプレートデータを第２の記憶手段に読み込み、その県別テンプレートデータを用いて最終的な認識結果を得る。つまり、例えば予備判定で「愛知県」という都道府県名が含まれていることが判った場合は、愛知県用に準備された県別テンプレートデータのみを第２の記憶手段に読み込んで認識を行うことができる。
【００１０】
このようにすれば、全テンプレートデータを第２の記憶手段に読み込んでおかなくてもよい。つまり、都道府県別に準備された県別テンプレートデータを最低限１つ（場合によっては複数）読み込むだけでよく、それに対応するだけの記憶容量が第２の記憶手段にあればよい。つまり本発明は、複数の地名を階層的につなぎ合わせた住所に対して都道府県単位に県別テンプレートデータを準備するという、いわばテンプレートデータの「分割」を行い、予備判定にてどの県別テンプレートデータを用いればよいかを判定して、「真に必要な」テンプレートデータに絞ってから第２の記憶手段に読み込むようにした。したがって、テンプレートデータに準備する単語を増やすことで認識可能な対象を増加させるという利点を追求した場合であっても、その認識可能語彙をすべて第２の記憶手段に格納しておく必要がない。そのため、第２の記憶手段は相対的に容量が小さくても、一括入力に対応した適切な音声認識が実現できる。
【００１１】
なお、複数の地名を階層的につなぎ合わせた住所についての上位階層と下位の切り分けについては弾力的な適用が可能であるため、請求項４に示すようにしてもよい。つまり、住所が３階層以上の地名で構成されている場合には、県名テンプレートデータと、県別テンプレートデータとを備えるとともに、当該県別テンプレートデータを上位階層とみなして市町村よりも下位の地名レベルも区別するようにした、市町村単位で準備された市別テンプレートデータを備えるのである。このようにすることで、必要なテンプレートデータだけを読み込めばよくなり、第２の記憶手段がより小容量でも対応可能となる。
【００１２】
ところで、最終的には県別テンプレートデータまたは請求項４における市別テンプレートデータを用いて認識するために、その県別テンプレートデータ又は請求項４における市別テンプレートデータを選択する予備判定を行う。この予備判定は、県名テンプレートデータ又は請求項４における県別テンプレートデータを用いて行うのであるが、このテンプレートデータの構成には次のような工夫をしてもよい。つまり、請求項５に示すように、テンプレートデータを構成する複数種類の語または語群の後にそれ以外の語又は語群が付属した音声入力に対してもマッチング可能なワイルドカードモデルとするのである。
【００１３】
例えば、県名テンプレートデータの場合、県名の後にどのような音声にもマッチングするようにする。単に県名しか辞書データとして持たない場合には、実際の認識対象（都道府県以下の市町村や大字なども含む語群）の内の一部分しか県名がないため、全体としてのマッチング度合いが低下する。それに対して、ワイルドカードモデルの場合には、マッチング自体は認識対象全体として行えるのでそのような問題が生じない。ワイルドカードモデルとしては、後述するガーベージモデルや音節連接モデルなどがある。
【００１４】
一方、このようなワイルドカードモデルを用いるのではなく、請求項６に示すように、県名テンプレートデータは、都道府県名の後に市町村名あるいはさらに市町村よりも下位レベルの地名が付属したものであり、請求項４における県別テンプレートデータは、市町村名の後に市町村よりも下位レベルの地名が付属したし冗長なテンプレートデータを用いてもよい。
【００１５】
ワイルドカードを使うと上述のような利点があるが、このワイルドカードはどのようなものにも緩やかにマッチングしてしまうので、誤認識の可能性を増やす原因ともなる。そこで、例えば県名テンプレートデータとして、県名だけでなく市町村名まで付加した冗長な状態でテンプレートデータを準備する。認識時には市町村名までマッチングするが、結果としてはいずれの県名とマッチングしたかを判定する。より長い音声でマッチングをし、またワイルドカードモデルのように緩やかにどのようなものにもマッチングするものではないため、認識率の向上が期待できる。但し、準備するテンプレートデータ量は相対的には増加する。
【００１６】
ところで、このような音声認識装置と、その音声認識装置にて認識された結果に基づいて所定の処理を実行する処理装置とを備え、処理装置が処理をする上で指定される必要のある所定のコマンドを利用者が音声にて入力できるようにした処理システムを構築することができる。この際、請求項７に示すように、コマンドを認識するためのテンプレートデータであるコマンド用テンプレートデータを第１の記憶手段から第２の記憶手段に読み込んでおくか、あるいは第２の記憶手段同様に高速アクセス性が相対的に高い読み取り専用の第３の記憶手段（例えばＲＯＭ）に予め記憶しておく。そして、次の（１）〜（４）の手順で認識を行う。
【００１７】
（１）県名テンプレートデータ又は請求項４における県別テンプレートデータを用いた認識を行って予備判定をする。
（２）コマンド用テンプレートデータを用いた認識を行うと共に、この認識と並行して予備判定の結果に対応する県別テンプレートデータ又は請求項４における市別テンプレートデータを第２の記憶手段に読み込む。
【００１８】
（３）その読み込んだ県別テンプレートデータ又は請求項４における市別テンプレートデータを用いて認識を行う。
（４）上記（２）の認識結果と（３）の認識結果の内でより確からしさが上位のものを最終的な認識結果とする。
例えば処理装置がナビゲーション装置であれば、目的地などの設定のために階層的な構成を持つ地名（住所）を音声入力することがあり、また、当然ながらナビゲーション装置の各種機能を使うためのコマンドを指示することがある。そして、このナビゲーション用のシステムを想定した場合には、上述の認識処理を実行することで、地名（住所）の入力だけでなくコマンドが入力された場合にも即座に対応でき、コマンド用テンプレートデータを用いた認識処理を別途行わなくてもよい。つまりレスポンスが向上し、利用者にとっての使い勝手が向上することとなる。
【００１９】
なお、請求項７に示した処理システムは、ナビゲーション用のシステム以外にも当然適用できるが、特にナビゲーション用のシステムに限定して考えた場合には、次のような工夫もできる。つまり、請求項８に示すように、現在地を検出する機能を持つことを前提として、次の（１）〜（４）の手順で認識を行う。
【００２０】
（１）県名テンプレートデータ又は請求項４における県別テンプレートデータを用いた認識を行って予備判定を行う。
（２）現在地検出手段にて検出された現在地に対応する県別テンプレートデータ又は請求項４における市別テンプレートデータを第２の記憶手段に読み込み、その県別テンプレートデータ又は請求項４における市別テンプレートデータを用いた認識を行うと共に、この認識と並行して予備判定の結果に対応する県別テンプレートデータ又は請求項４における市別テンプレートデータを第２の記憶手段に読み込む。
【００２１】
（３）その読み込んだ県別テンプレートデータ又は請求項４における市別テンプレートデータを用いて認識を行う。
（４）上記（２）の認識結果と（３）の認識結果の内でより確からしさが上位のものを最終的な認識結果とする。
この手法によって解決したい状況は次の通りである。つまり、ナビゲーションシステムを搭載した車両が例えば愛知県内を走行しており、同じ愛知県内である「愛知県刈谷市昭和町」を目的地として設定する場合には、「愛知県刈谷市昭和町」と音声入力するのではなく、「愛知県」を省略して「刈谷市昭和町」と音声入力する方が自然である。本手法であれば、２回目の認識において現在地に対応する下位階層辞書を用いた認識を行うため、都道府県名を省略した音声入力であっても対応できる。
【００２２】
一方、同様に現在地を検出する機能を持つことを前提としながら、相対的に認識速度の向上を図りたい場合には請求項９に示すようにしてもよい。この場合には、認識処理に先立って現在地検出手段にて検出された現在地に対応する県別テンプレートデータ又は請求項４における市別テンプレートデータを第２の記憶手段に予め読み込んでおく。そして、次の（１）、（２）の手順で認識を行う。
【００２３】
（１）県名テンプレートデータ又は請求項４における県別テンプレートデータ及び予め読み込んでおいた県別テンプレートデータ又は請求項４における市別テンプレートデータを用いた認識を行う。そして、その認識結果が、予め読み込んでおいた県別テンプレートデータ又は請求項４における市別テンプレートデータを用いて得られたものである場合には、それを最終的な認識結果として認識処理を終了する。
【００２４】
（２）一方、上記（１）の認識結果が、県名テンプレートデータ又は請求項４における県別テンプレートデータを用いたものである場合には、その認識結果に対応する県別テンプレートデータ又は請求項４における市別テンプレートデータを第２の記憶手段に読み込み、その県別テンプレートデータ又は請求項４における市別テンプレートデータを用いて得た認識結果を最終的な認識結果とする。
【００２５】
このようにすれば、使用頻度が高いと考えられる現在地を含む所定地域内の地名を認識する際には、それを認識するための県別テンプレートデータ又は請求項４における市別テンプレートデータが予め読み込んであるため、相対的に認識処理が素早くできることとなる。
【００２６】
【発明の実施の形態】
以下、本発明が適用された実施例について図面を用いて説明する。なお、本発明の実施の形態は、下記の実施例に何ら限定されることなく、本発明の技術的範囲に属する限り、種々の形態を採り得ることは言うまでもない。
【００２７】
図１は音声認識機能を持たせたナビゲーションシステム２の概略構成を示すブロック図である。本ナビゲーションシステム２は、車両に搭載されて用いられるいわゆるカーナビゲーションシステムであり、位置検出器４、データ入力器６、操作スイッチ群８、これらに接続された制御回路１０、制御回路１０に接続された外部メモリ１２、表示装置１４及びリモコンセンサ１５及び音声認識装置３０を備えている。なお制御回路１０は通常のコンピュータとして構成されており、内部には、周知のＣＰＵ、ＲＯＭ、ＲＡＭ、Ｉ／Ｏ及びこれらの構成を接続するバスラインが備えられている。
【００２８】
位置検出器４は、周知の地磁気センサ１６、ジャイロスコープ１８、距離センサ２０及び衛星からの電波に基づいて車両の位置を検出するためのＧＰＳ受信機２２を有している。これらのセンサ等１６，１８，２０，２２は各々が性質の異なる誤差を持っているため、複数のセンサにより、各々補間しながら使用するように構成されている。なお、精度によっては上述した内の一部で構成してもよく、更に、ステアリングの回転センサ、各転動輪の車輪センサ等を用いてもよい。
【００２９】
データ入力器６は、位置検出の精度向上のためのいわゆるマップマッチング用データ、地図データ及び目印データを含むナビゲーション用の各種データに加えて、音声認識装置３０において認識処理を行う際に用いる辞書データを入力するための装置である。記憶媒体としては、そのデータ量からＤＶＤを用いるのが一般的であると考えられるが、ＣＤ−ＲＯＭ等の他の媒体を用いても良い。データ記憶媒体としてＤＶＤを用いた場合には、このデータ入力器６はＤＶＤプレーヤとなる。
【００３０】
表示装置１４はカラー表示装置であり、表示装置１４の画面には、位置検出器４から入力された車両現在位置マークと、地図データ入力器６より入力された地図データと、更に地図上に表示する誘導経路や設定地点の目印等の付加データとを重ねて表示することができる。また、複数の選択肢を表示するメニュー画面やその中の選択肢を選んだ場合に、さらに複数の選択肢を表示するコマンド入力画面なども表示することができる。
【００３１】
また、本ナビゲーションシステム２は、リモートコントロール端末（以下、リモコンと称する。）１５ａを介してリモコンセンサ１５から、あるいは操作スイッチ群８により目的地の位置を入力すると、現在位置からその目的地までの最適な経路を自動的に選択して誘導経路を形成し表示する、いわゆる経路案内機能も備えている。このような自動的に最適な経路を設定する手法は、ダイクストラ法等の手法が知られている。操作スイッチ群８は、例えば、表示装置１４と一体になったタッチスイッチもしくはメカニカルなスイッチ等が用いられ、各種コマンドの入力に利用される。
【００３２】
そして、音声認識装置３０は、上記操作スイッチ群８あるいはリモコン１５ａが手動操作により各種コマンド入力のために用いられるのに対して、利用者が音声で入力することによっても同様に各種コマンドを入力できるようにするための装置である。
【００３３】
この音声認識装置３０は、音声認識部３１と、対話制御部３２と、音声合成部３３と、音声抽出部３４と、マイク３５と、スイッチ３６と、スピーカ３７と、制御部３８とを備えている。
音声認識部３１は、音声抽出部３４から入力された音声データを、対話制御部３２からの指示により入力音声の認識処理を行い、その認識結果を対話制御部３２に返す。すなわち、音声抽出部３４から取得した音声データに対し、記憶している辞書データを用いて照合を行ない、複数の比較対象パターン候補と比較して一致度の高い上位比較対象パターンを対話制御部３２へ出力する。入力音声中の単語系列の認識は、音声抽出部３４から入力された音声データを順次音響分析して音響的特徴量（例えばケプストラム）を抽出し、この音響分析によって得られた音響的特徴量時系列データを得る。そして、周知のＨＭＭ（隠れマルコフモデル）、ＤＰマッチング法あるいはニューラルネットなどによって、この時系列データをいくつかの区間に分け、各区間が辞書データとして格納されたどの単語に対応しているかを求める。
【００３４】
対話制御部３２は、音声認識部３１における認識結果や制御部３８からの指示に基づき、音声合成部３３への応答音声の出力指示、あるいは、ナビゲーションシステム自体の処理を実行する制御回路１０に対して例えばナビゲート処理のために必要な目的地やコマンドを通知して目的地の設定やコマンドを実行させるよう指示する処理を行う。このような処理の結果として、この音声認識装置３０を利用すれば、上記操作スイッチ群８あるいはリモコン１５ａを手動しなくても、音声入力によりナビゲーションシステムに対する目的地の指示などが可能となるのである。
【００３５】
なお、音声合成部３３は、波形データベース内に格納されている音声波形を用い、対話制御部３２からの応答音声の出力指示に基づく音声を合成する。この合成音声がスピーカ３７から出力されることとなる。
音声抽出部３４は、マイク３５にて取り込んだ周囲の音声をデジタルデータに変換して音声認識部３１に出力するものである。詳しくは、入力した音声の特徴量を分析するため、例えば数１０ｍｓ程度の区間のフレーム信号を一定間隔で切り出し、その入力信号が、音声の含まれている音声区間であるのか音声の含まれていない雑音区間であるのか判定する。マイク３５から入力される信号は、認識対象の音声だけでなく雑音も混在したものであるため、音声区間と雑音区間の判定を行なう。この判定方法としては従来より多くの手法が提案されており、例えば入力信号の短時間パワーを一定時間毎に抽出していき、所定の閾値以上の短時間パワーが一定以上継続したか否かによって音声区間であるか雑音区間であるかを判定する手法がよく採用されている。そして、音声区間であると判定された場合には、その入力信号が音声認識部３１に出力されることとなる。
【００３６】
また、本実施形態においては、利用者がスイッチ３６を押しながらマイク３５を介して音声を入力するという利用方法である。具体的には、制御部３８がスイッチ３６が押されたタイミングや戻されたタイミング及び押された状態が継続した時間を監視しており、スイッチ３６が押された場合には音声抽出部３４及び音声認識部３１に対して処理の実行を指示する。一方、スイッチ３６が押されていない場合にはその処理を実行させないようにしている。したがって、スイッチ３６が押されている間にマイク３５を介して入力された音声データが音声認識部３１へ出力されることとなる。
【００３７】
このような構成を有することによって、本実施例の車載ナビゲーションシステム２では、ユーザがコマンドを入力することによって、経路設定や経路案内あるいは施設検索や施設表示など各種の処理を実行することができる。
ここで、音声認識部３１と対話制御部３２についてさらに説明する。図２に示すように、音声認識部３１は照合部３１１と辞書部３１２と抽出結果記憶部３１３とを有しており、対話制御部３２は処理部３２１と入力部３２２と辞書制御部３２３とを有している。
【００３８】
音声認識部３１においては、抽出結果記憶部３１３が音声抽出部３４から出力された抽出結果を記憶しておき、照合部３１ａがその記憶された抽出結果に対し、辞書部３１２内に記憶されている辞書データを用いて照合を行う。この辞書部３１２内の辞書データは固定ではなく、適宜設定・更新されるのであるが、この点は後述する。そして、照合部３１１にて辞書データと比較されて一致度が高いとされた上位の認識結果は、対話制御部３２の処理部３２１へ出力され、対話制御部３２の処理部３２１が、制御回路１０へその認識結果を出力する。
【００３９】
一方、処理部３２１は、制御回路１０に対して辞書データをＤＶＤから読み出して音声認識装置３０側へ出力する依頼（辞書読込依頼）を出すことができ、その依頼の結果として制御回路１０から送られた辞書データは、対話制御部３２の入力部３２２を介して入力される。そして、辞書制御部３２３がその辞書データを音声認識部３１の辞書部３１２に対して設定（書込）・更新する。
【００４０】
ここで辞書データについて説明する。辞書データとしては、語彙そのもののデータだけでなく、その語彙が複数の語を階層的につなぎ合わせたものである場合には、次のように分割されて準備されている。ここでは、そのように分割されて準備されている辞書データの例として地名辞書を説明する。
【００４１】
まず、上位階層辞書は、都道府県名の辞書データである。つまり、４７の都道府県（愛知県、青森県……、和歌山県）の名称に対応したキーワードを辞書データとして持つものである。そして、下位階層辞書は、都道府県別に分割して準備された県別辞書である。つまり、愛知県の県別辞書、青森県の県別辞書……、和歌山県の県別辞書というように４７の県別辞書が準備されている。この下位階層辞書は、上位階層のキーワードに下位階層のキーワードを付加した辞書データであり、例えば愛知県の県別辞書であれば、愛知県○○市××町、……、愛知県刈谷市昭和町、愛知県△△市▽▽町、……というように、必ず愛知県から始まるようにしている。他の都道府県の県別辞書も同様である。
【００４２】
なお、必要に応じて、さらに下位階層の辞書を準備してもよい。つまり、全国に存在する市町村別に市別辞書を準備してもよい。例えば愛知県刈谷市の市別辞書、愛知県大府市の市別辞書……といった具合である。日本の場合には約４０００の市町村があるといわれているので、約４０００の市別辞書が準備されることとなる。この考え方を進めれば、当然ながらさらに下位階層の辞書を準備することも可能である。例えば名古屋市には１６の区があるが、その区別に１６分割した辞書を準備してもよい。もちろん、区に限らず市町村の下位にくる大字レベルに分割した辞書を準備することも可能である。
【００４３】
そして、このように分割された辞書も含め、基本的に辞書はすべて、データ入力器６にセットされるＤＶＤなどの記録媒体に記録されている。なお、「基本的には」としたのは、音声認識部３１の辞書部３１２に常駐させておく辞書データがあってもよいからである。但し、上述した下位階層の辞書については、原則通りＤＶＤなどのデータ記憶媒体に記憶させておき、必要なときに辞書部３１２に読み込むようにする。
【００４４】
次に、本実施例のナビゲーションシステム２の動作について説明する。なお、音声認識装置３０に関係する部分が特徴であるので、ナビゲーションシステムとしての一般的な動作を簡単に説明した後、音声認識装置３０に関係する部分の動作について詳しく説明することとする。
【００４５】
ナビゲーションシステム２の電源オン後に、表示装置１４上に表示されるメニューから、ドライバーがリモコン１５ａ（操作スイッチ群８でも同様に操作できる。以後の説明においても同じ）により、案内経路を表示装置１４に表示させるために経路情報表示処理を選択した場合、あるいは、音声認識装置３０を介して希望するメニューをマイク３５を介して音声入力することで、対話制御部３２から制御回路１０へ、リモコン１５ａを介して選択されるのを同様の指示がなされた場合、次のような処理を実施する。
【００４６】
すなわち、ドライバーが表示装置１４上の地図に基づいて、音声あるいはリモコンなどの操作によって目的地を入力すると、ＧＰＳ受信機２２から得られる衛星のデータに基づき車両の現在地が求められ、目的地と現在地との間に、ダイクストラ法によりコスト計算して、現在地から目的地までの最も短距離の経路を誘導経路として求める処理が行われる。そして、表示装置１４上の道路地図に重ねて誘導経路を表示して、ドライバーに適切なルートを案内する。このような誘導経路を求める計算処理や案内処理は一般的に良く知られた処理であるので説明は省略する。
【００４７】
次に、音声認識装置３０における動作について説明する。ここでは、いくつかの動作例を挙げる。
[動作例１]
図４は、動作例１の場合の音声認識部３１及び対話制御部３２における処理を示すフローチャートである。
【００４８】
最初のステップＳ１０においては、最上位階層の辞書を設定する。具体的には、上述した県名辞書であり、データ入力器６によってＤＶＤから読み出し、それを制御回路１０、対話制御部３２を介して音声認識部３１の辞書部３１２に設定する。なお、上述したように、この県名辞書については辞書部３１２に常駐させておくことも考えられる。
【００４９】
このように音声認識の準備ができたら、続いて音声認識処理を行う（Ｓ２０）。上述したように、スイッチ３６が押されている間にマイク３５を介して入力された音声データが音声抽出部３４にて抽出されて音声認識部３１へ出力されるため、この抽出結果に対して認識処理を実行することとなる。
【００５０】
この音声認識処理がなされた後、その認識に用いたのが最下位階層の辞書であるかどうかを判断する（Ｓ３０）。Ｓ１０にて設定した県名辞書を用いた認識であれば最下位階層の辞書ではないので（Ｓ３０：ＮＯ）、Ｓ２０での認識処理の結果から選択された下位階層の辞書を設定する（Ｓ４０）。例えば、県名辞書を用いた認識で「愛知県」が選択された場合には、愛知県の県別辞書を設定する。この設定に際しては、図３に例示するように、対話制御部３２が制御回路１０へ県別辞書の読み込みを依頼する。制御部１０はその依頼に応じ、データ入力器６によってＤＶＤから該当する県別辞書を読み出し、対話制御部３２へ送る。そして、上述したように、対話制御部３２内の辞書制御部３２３（図２参照）によってその県別辞書が音声認識部３１の辞書部３１２に設定される。
【００５１】
その後Ｓ２０へ戻り、抽出結果記憶部３１３に記憶されている抽出結果を再度用いて音声認識処理を行う。県別辞書が最下位階層の辞書であれば（Ｓ３０：ＹＥＳ）、その辞書を用いて得た認識結果を制御回路１０へ出力する（Ｓ５０）。
なお、上述したように、県別辞書のさらに下位階層の辞書として市別辞書や区別辞書、大字辞書などが準備されている場合には、Ｓ２０〜Ｓ４０のループ処理を繰り返して、最下位階層の辞書が設定された状態で認識された結果を出力すればよい。
【００５２】
このようにすれば、音声入力された地名を認識する場合に、地名に関する全辞書を辞書部３１２に読み込んでおかなくてもよく、県名辞書及び選択された都道府県に対応する県別辞書を読み込むだけでよい。このような階層的に構成される語群に対して辞書の「分割」を行い、予備判定にてどの下位階層辞書（県別辞書）を用いればよいかを判定して、「真に必要な」辞書に絞ってから辞書部３１２に読み込むようにした。したがって、辞書に準備する語彙を増やすことで認識可能な対象を増加させるという利点を追求した場合であっても、その認識可能語彙をすべて辞書部３１２に格納しておく必要がない。そのため、辞書部３１２は相対的に容量が小さくても、一括入力に対応した適切な音声認識が実現できる。
【００５３】
[動作例２]
図５は、動作例２の場合の音声認識部３１及び対話制御部３２における処理を示すフローチャートである。ここでは、実際の認識処理を開始する前に、県名辞書及びコマンド辞書が辞書部３１２に記憶されていることを前提とする。
【００５４】
最初のステップＳ１１０においては、県名辞書を設定する。予め記憶されているため、ここでは音声認識に用いる辞書として設定する。つまり、辞書部３１２にはコマンド辞書も記憶されているが、それは設定しない。そして、続くＳ１２０ではその県名辞書を用いて第１回目の音声認識処理を行い、その第１回目の認識結果から選択された県別辞書の読込を依頼する（Ｓ１３０）。
【００５５】
この辞書の読込依頼は、上記動作例１でも説明したように対話制御部３２が制御回路１０に対して行う。この依頼を受けた制御部１０はその依頼に応じ、データ入力器６によってＤＶＤから該当する県別辞書を読み出し、対話制御部３２へ送る、そして、対話制御部３２はその県別辞書を読み込み（Ｓ１９０）、その県別辞書を音声認識部３１の辞書部３１２に設定する（Ｓ１６０）。
【００５６】
しかし、制御部１０へ依頼をしてから県別辞書が送られてくるまでの時間がある程度必要であるので、ここでは、その間を利用して２回目の認識処理を行う。つまり、今度はコマンド辞書を音声認識に用いる辞書として設定し（Ｓ１４０）、そのコマンド辞書を用いて第２回目の音声認識処理を行うのである（Ｓ１５０）。この第２回目の認識処理が終了したら、上述したＤＶＤから読み込んだ県別辞書を音声認識に用いる辞書として設定し（Ｓ１６０）、その県別辞書を用いて第３回目の音声認識処理を行う（Ｓ１７０）。
【００５７】
このようにして得た第２回目の認識結果と第３回目の認識結果の確からしさを比較し、上位の候補（認識結果）を出力する（Ｓ１８０）。
ナビゲーションシステムを利用する際に利用者が音声入力する語彙としては、目的地などの設定のために地名（住所）があるが、当然ながらナビゲーションの各種機能を使うためのコマンドを指示することがある。したがって、本動作例のようにすれば、第２回目の音声認識処理をコマンド辞書を用いて行っているので、地名（住所）の入力だけでなくコマンドが入力された場合にも即座に対応できる。そして、この認識処理は、県名辞書を用いた予備判定にて選択された県別辞書の読み込みを行う間に実行するため、時間のロスが少なくて済む。つまり全体としてレスポンスが向上し、利用者にとっての使い勝手が向上する。
【００５８】
[動作例３]
図６は、動作例３の場合の音声認識部３１及び対話制御部３２における処理を示すフローチャートである。ここでは、実際の認識処理を開始する前に、県名辞書及び現在地の県別辞書が辞書部３１２に記憶されていることを前提とする。つまり、位置検出器４によって現在地を検出できるため、例えば本ナビゲーションシステムを搭載した車両が愛知県内を走行している場合には、愛知県の県別辞書を予めＤＶＤから読み込んで辞書部３１２に記憶させておく。
【００５９】
最初のステップＳ２１０においては、県名辞書を設定する。予め記憶されているため、ここでは音声認識に用いる辞書として設定する。つまり、辞書部３１２には現在地に対応する県別辞書も記憶されているが、それは設定しない。そして、続くＳ２２０ではその県名辞書を用いて第１回目の音声認識処理を行い、その第１回目の認識結果から選択された県別辞書の読込を依頼する（Ｓ２３０）。
【００６０】
このＳ２３０での辞書の読込依頼の結果、ＤＶＤから該当する県別辞書を読み込み（Ｓ２９０）、その県別辞書を音声認識部３１の辞書部３１２に設定する（Ｓ２６０）点については、上述の動作例２のＳ１３０，Ｓ１６０，Ｓ１９０の処理内容を同じである。そして、動作例２ではこの間を利用してコマンド辞書を用いた認識処理を行ったが、本動作例３では、予め読み込んであった現在地に対応する県別辞書を音声認識に用いる辞書として設定し（Ｓ２４０）、その県別辞書を用いて第２回目の音声認識処理を行う（Ｓ２５０）。この第２回目の認識処理が終了したら、上述したＤＶＤから読み込んだ県別辞書を音声認識に用いる辞書として設定し（Ｓ２６０）、その県別辞書を用いて第３回目の音声認識処理を行う（Ｓ２７０）。
【００６１】
このようにして得た第２回目の認識結果と第３回目の認識結果の確からしさを比較し、上位の候補（認識結果）を出力する（Ｓ１８０）。
ナビゲーションシステムを搭載した車両が例えば愛知県内を走行しており、同じ愛知県内である「愛知県刈谷市昭和町」を目的地として設定する場合には、「愛知県刈谷市昭和町」と音声入力するのではなく、「愛知県」を省略して「刈谷市昭和町」と音声入力する方が自然である。本手法であれば、２回目の認識において現在地に対応する下位階層辞書を用いた認識を行うため、都道府県名を省略した音声入力であっても対応できる。
【００６２】
[動作例４]
図７は、動作例４の場合の音声認識部３１及び対話制御部３２における処理を示すフローチャートである。動作例３の場合と同様に、実際の認識処理を開始する前に、県名辞書及び現在地の県別辞書が辞書部３１２に記憶されていることを前提とする。
【００６３】
最初のステップ３１０においては、県名辞書及び現在地に対応する県別辞書を音声認識に用いる辞書として設定する。そして、続くＳ３２０ではその県名辞書及び現在地対応の県別辞書を用いて第１回目の音声認識処理を行う。その第１回目の認識結果が、現在地対応の県別辞書を用いて得られたものである場合には（Ｓ３３０：ＹＥＳ）、この第１回目の認識結果を出力する（Ｓ３４０）。
【００６４】
一方、現在地対応の県別辞書ではなく、県名辞書を用いて第１回目の認識結果が得られたものである場合には（Ｓ３３０：ＮＯ）、その認識結果から選択された県別辞書の読込を依頼し（Ｓ３５０）、ＤＶＤから該当する県別辞書を読み込む（Ｓ３５５）。この場合は、上述した動作例２，３とは異なり、辞書の読込依頼から実際に読み込むまでに別に音声認識処理は実行しない。
【００６５】
そして、Ｓ３５５で読み込んだ県別辞書を音声認識に用いる辞書として設定し（Ｓ３６０）、その県別辞書を用いて第２回目の音声認識処理を行い（Ｓ３７０）、その認識結果を出力する（Ｓ３８０）。
このようにすれば、使用頻度が高いと考えられる現在地を含む県内の地名を認識する際には、それを認識するための県別辞書を用いて第１回目の音声認識処理で認識できるため、相対的に認識処理が素早くできることとなる。
【００６６】
音声認識装置３０における動作について４例挙げ、それぞれの動作例による効果などを説明したが、上位階層辞書の構成を工夫することでも以下に示すような効果を得ることができる。
［辞書構成例１］
ここでは上位階層辞書として県名辞書を例にとって考える。県名辞書は、上述したように都道府県（愛知県、青森県……、和歌山県）の名称に対応したキーワードを辞書データとして持つものであるが、これを愛知県＊、青森県＊……、和歌山県＊というように記述し、＊の部分がどのような音声入力に対してもマッチング可能なワイルドカードモデルとする。例えば、「愛知県刈谷市」という音声入力の内「刈谷市」が＊にマッチングする。単に都道府県名のキーワードしか辞書データとして持たない場合には、実際の認識対象（都道府県以下の市町村や大字なども含む語群）の内の一部分しか県名がないため、全体としてのマッチング度合いが低下する。それに対して、ワイルドカードモデルの場合には、マッチング自体は認識対象全体として行えるのでそのような問題が生じない。
【００６７】
ここで、ワイルドカードモデルについて少し補足説明する。
まず、音声認識で一般的に用いられるＨＭＭ（隠れマルコフモデル）手法について簡単に説明する。本手法は、音声を状態と遷移で表現されたマルコフモデルから生成されるものであると仮定して、生成モデルを事前に作成しておき、それと音声とを突き合わせ（マッチング）、最もよくマッチングするものを認識結果とするものである。このモデルの例としては図８に示す表現が一般的である。各状態には出力確率分布が対応しており、音声を分析した結果の特徴量（図８では簡単のために２次元で表現した）の時系列を図８（ａ）に対応する順（ａ１→ａ２→ａ３）に、図８（ｂ）の確率分布から確からしさを突き合わせていく。最終的には音声の終端までの確からしさの積（尤度と呼ばれるスコア）が最も良いものを認識結果とする。この手法では、認識対象語彙のＨＭＭを準備しておいてそれを比較することが基本となるが、大語彙の認識では事実上不可能であるので、音素や音節（これは単語の部分という意味でサブワードと呼ばれる）といった単位を設定し、それらのＨＭＭを作成しておいて、それを接続することで単語のモデルを作成する。
【００６８】
次に、ワイルドカードモデルの一例であるガーベージモデルについて説明する。図９（Ａ）に、/ａ/，/ｉ/，/ｕ/のＨＭＭの各状態に対応している確率分布の例を示した。ここでは特徴空間を２次元としている。ガーベージモデルと呼ばれる音声モデルは、特定の音節のある特徴を表現するのではなく、多くの音声をカバーできるように、大きな分散を有する分布を持つものである。こうすると、ガーベージモデルはさまざまな音声パターンに対して「広く浅く」マッチングするため、広範囲の音声に対してある程度のスコア（＝確率）を出力するが、正しい分布に比べると小さい値を出力する傾向がある。例えば図９（Ａ）中の「×」で示した音声パターンに対して、/ａ/，/ｉ/のスコアは非常に小さくなり、/ｕ/のスコアは大きくなる。一方、ガーベージモデルの場合のスコアは、/ａ/，/ｉ/のスコアと比べると大きいが、/ｕ/のスコアと比べると小さい。
【００６９】
したがって、「あいちけんＧ」（Ｇはガーベージモデル）及び「あいちけんかりやし」のテンプレートと「あいちけんかりやし」の音声をマッチングすれば、そのスコアは「あいちけんＧ」＜「あいちけんかりやし」となる可能性が高い（但し保証されているわけではない）。しかし、「あいちけんＧ（ガーベージモデル）」及び「あいちけんかすがいし」のテンプレートと「あいちけんかりやし」の音声をマッチングすれば、そのスコアは「あいちけんＧ」＜「あいちけんかすがいし」となるとは限らず、かなりの確率で逆転する。
【００７０】
続いて、ワイルドカードモデルの他の例である音節連接モデルについて説明する。
音節のＨＭＭは単語を構成する単位となるが、これを任意に接続可能としていおくと、あらゆる語の発声が認識できることになる。つまり、図９（Ｂ）に示すような音声連接モデルはそのようなものである。なお、ここでは日本語の認識を前提としている。
【００７１】
これを「あいちけんＳＣＭ」（ＳＣＭは音節連接モデル）のようにワイルドカードとしておくと「愛知県刈谷市昭和町」のような発声に対してもマッチング可能である。この場合、「あいちけんＳＣＭ」のモデルは「あいちけんかりやししょうわちょう」というモデルの表現を内包しているので、スコアとしては後者以上の値を得ることができる。
【００７２】
［辞書構成例２］
上述したワイルドカードモデルを使うと上述のような利点があるが、このワイルドカードはどのようなものにも緩やかにマッチングしてしまうので、誤認識の可能性を増やす原因ともなる。そこで、例えば県名辞書を構成する場合に、県名だけでなく市町村名まで付加した冗長な状態で辞書を準備する。そして、認識時には市町村名までマッチングするが、結果としてはいずれの県名とマッチングしたかを判定する。より長い音声でマッチングをし、またワイルドカードモデルのように緩やかにどのようなものにもマッチングするものではないため、認識率の向上が期待できる。
【００７３】
以上、本発明はこのような実施例に何等限定されるものではなく、本発明の主旨を逸脱しない範囲において種々なる形態で実施し得る。
例えば、上述した実施形態では、音声認識装置３０を車両に搭載したナビゲーションシステム２に適用した例として説明したが、車載機器として用いられる場合だけではなく、例えば携帯型ナビゲーション装置として実現してもよい。
【００７４】
また、ナビゲーションではない他の処理を実行する装置に対して音声入力で各種データの設定や指示などを与える場合にでも適用はできる。
【図面の簡単な説明】
【図１】実施例としてのナビゲーションシステムの概略構成を示すブロック図である。
【図２】音声認識装置における音声認識部と対話制御部の構成を示すブロック図である。
【図３】辞書の読込依頼及びそれに対応した辞書読込の説明図である。
【図４】音声認識装置における動作例１に係る処理を示すフローチャートである。
【図５】音声認識装置における動作例２に係る処理を示すフローチャートである。
【図６】音声認識装置における動作例３に係る処理を示すフローチャートである。
【図７】音声認識装置における動作例４に係る処理を示すフローチャートである。
【図８】ＨＭＭ（隠れマルコフモデル）の説明図である。
【図９】ワイルドカードモデルの例としてのガーベージモデル及び音節連接モデルの説明図である。
【符号の説明】
２…ナビゲーションシステム４…位置検出器
６…データ入力器８…操作スイッチ群
１０…制御回路１２…外部メモリ
１４…表示装置１５…リモコンセンサ
１５ａ…リモコン１６…地磁気センサ
１８…ジャイロスコープ２０…距離センサ
２２…ＧＰＳ受信機３０…音声認識装置
３１…音声認識部３２…対話制御部
３３…音声合成部３４…音声入力部
３５…マイク３６…スイッチ
３７…スピーカ３８…制御部
３１１…照合部３１２…辞書部
３１３…抽出結果記憶部３２１…処理部
３２２…入力部３２３…辞書制御部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a voice recognition device that is effective when, for example, a destination setting or the like in a navigation system is input by voice, and a processing system including the voice recognition device.
[0002]
[Prior art and problems to be solved by the invention]
Conventionally, a speech recognition apparatus that compares input speech with a plurality of comparison target pattern candidates stored in advance and uses a recognition result having a high degree of coincidence has already been put into practical use. This is used for a user to input a place name by voice. In particular, when the driver himself / herself uses a navigation system for a vehicle, since it is not accompanied by button operation or screen gaze if it is a voice input, it is effective because it is highly safe even when the vehicle is running.
[0003]
In order to satisfy such a function, it is desired that a sufficiently detailed point can be easily specified. Specifically, it is preferable to be able to input not the prefecture or city level but the level of the town name under the city and the level of large or small letters in the town and village. Furthermore, if the user wants to set “Showacho, Kariya City, Aichi Prefecture”, for example, he / she must divide it into levels called “Aichi Prefecture”, “Kariya City”, “Showamachi”, etc. Then, since it is troublesome, it is preferable to enable continuous input (batch input).
[0004]
However, it is necessary to increase the number of recognizable words and increase the amount of data in the dictionary in order to make it possible to specify more detailed points on the premise that collective input is possible in this way. For example, when it comes to address recognition, it is common to have dictionaries only up to the level of large letters. It is possible to specify more detailed points by preparing a dictionary up to small letters, but as the dictionary data increases, the memory such as RAM for storing the dictionary data becomes very large, It will also lead to cost increase.
[0005]
The present invention suppresses the disadvantages that may occur when pursuing the advantage of increasing the number of recognizable objects by increasing the number of words prepared in the dictionary in such a speech recognition technology, and enjoys the above advantages more realistically. It aims at proposing the technology which makes it easy to do.
[0006]
[Means for Solving the Problems and Effects of the Invention]
First, it is described in claim 1 Prefecture name Template data, By prefecture The template data, the first storage unit, and the second storage unit will be described.
Prefecture name Template data has multiple final recognition targets Place name Are connected hierarchically Street address If Name of prefectures This is data for comparison with matching data obtained based on input speech. Also, By prefecture Template data is By each prefecture Be prepared, Name of prefectures In addition to City name or even lower level place name A word group including up to is stored. For example In the case of Japan Prefecture name Template data But Have 47 prefecture names as template data By the prefecture divided into 47 Template data But Will be prepared.
[0007]
Prefecture name Or By prefecture The template data may be dictionary data (upper layer dictionary or lower layer dictionary) as shown in claim 2, or may be voice data as shown in claim 3. For example, a person may pronounce a word or a group of words in advance, input it, and store it as voice data. In other words, data that can be recognized by matching with some data based on voice data uttered by the user is not limited to recognition using a dictionary.
[0008]
Further, the first storage means has a relatively low high-speed accessibility during the speech recognition process, and the second storage means has a relatively high high-speed accessibility during the speech recognition process. Examples of the first storage means include a DVD and a CD-ROM, and examples of the second storage means include a RAM. That is, in executing the actual speech recognition process, the dictionary is usually read into the second storage means such as a RAM from the viewpoint of shortening the processing time and improving the response.
[0009]
According to the speech recognition apparatus of claim 1, at least By prefecture The template data is stored in the first storage means, and when recognizing the voice input through the voice input means that can be input continuously, first, matching data obtained based on the input voice and Prefecture name By comparing with template data But It is determined whether or not the prefecture name is included. And it was supposed to be included in the preliminary judgment Name of prefectures Corresponding to By prefecture Read the template data into the second storage means, By prefecture The final recognition result is obtained using the template data. In other words, for example, if the preliminary judgment shows that the prefecture name “Aichi Prefecture” is included, To use Prepared By prefecture Only the template data can be read into the second storage means for recognition.
[0010]
In this way, it is not necessary to read all template data into the second storage means. That is , Prepared by prefecture By prefecture It is only necessary to read at least one template data (a plurality of template data in some cases), and it is sufficient that the second storage unit has a storage capacity corresponding to the template data. In other words, the present invention Address where multiple place names are connected hierarchically Against Prefectures To the unit By prefecture In other words, template data is prepared, that is, “divide” the template data, By prefecture It was determined whether template data should be used, and after narrowing down to “really necessary” template data, it was read into the second storage means. Therefore, even when pursuing the advantage of increasing the number of recognizable objects by increasing the number of words prepared in the template data, it is not necessary to store all the recognizable vocabulary in the second storage means. Therefore, even if the second storage means has a relatively small capacity, it is possible to realize appropriate speech recognition corresponding to batch input.
[0011]
In addition, Address where multiple place names are connected hierarchically Because it is possible to flexibly apply the upper hierarchy and lower classification for , Contract It may be as shown in claim 4. That means Street address Is more than 3 levels With the place name If configured, Prefecture name Template data, By prefecture Template data and By prefecture Considering template data as a higher hierarchy By city prepared for each municipality, so that the place name level lower than the municipality is also distinguished. Template data The To prepare . This By doing so, it is only necessary to read the necessary template data, and the second storage means can cope with a smaller capacity.
[0012]
By the way, finally By prefecture Template data Or by city in claim 4 To recognize using template data, By prefecture Template data Or by city in claim 4 Preliminary determination for selecting template data is performed. This preliminary judgment is Prefecture name Template data Or by prefecture in claim 4 Template day T The template data may be configured as follows. That is, as shown in claim 5 , Te A wild card model that can be matched with a speech input in which other words or word groups are attached after a plurality of types of words or word groups constituting template data.
[0013]
For example, Prefecture name Template data of In this case, match any sound after the prefecture name. If only the prefecture name is stored as dictionary data, since only a part of the actual recognition target (a group of words including municipalities and capital letters below the prefecture) has the prefecture name, the matching degree as a whole decreases. . On the other hand, in the case of the wild card model, since the matching itself can be performed as the entire recognition target, such a problem does not occur. Examples of the wild card model include a garbage model and a syllable connection model described later.
[0014]
On the other hand, instead of using such a wild card model, as shown in claim 6, Prefecture name Template data Is the name of the municipality or a lower-level place after the prefecture name Is attached The template data by prefecture in claim 4 is the name of the place at a lower level than the municipality after the municipality name. And redundant template data may be used.
[0015]
The use of a wild card has the advantages as described above. However, since this wild card matches anything gently, it also increases the possibility of erroneous recognition. So, for example Prefecture name Redundant state with template data including not only the prefecture name but also the city name In Prepare template data. At the time of recognition, matching is performed up to the municipality name, but as a result, it is determined which prefecture name is matched. Matching with longer voices, and since it does not match anything gently like a wild card model, an improvement in recognition rate can be expected. However, the amount of template data to be prepared increases relatively.
[0016]
By the way, such a speech recognition device and a processing device that executes a predetermined process based on a result recognized by the speech recognition device, a predetermined that needs to be specified when the processing device performs processing. It is possible to construct a processing system that allows the user to input the above command by voice. At this time, as shown in claim 7, command template data, which is template data for recognizing a command, is read from the first storage means into the second storage means, or similar to the second storage means. Further, it is stored in advance in a third read-only storage means (for example, a ROM) having relatively high high-speed accessibility. And the next (1)-(4) Recognize according to the procedure.
[0017]
(1) Prefecture name template data or by prefecture in claim 4 Preliminary determination is performed by performing recognition using template data.
(2) Recognize using command template data and respond to preliminary judgment results in parallel with this recognition Template data by prefecture or city by claim 4 The template data is read into the second storage means.
[0018]
(3) That read Template data by prefecture or city by claim 4 Recognition is performed using template data.
(4) the above (2) Recognition result and (3) Of these recognition results, the one with higher accuracy is set as the final recognition result.
For example, if the processing device is a navigation device, a place name (address) having a hierarchical configuration may be input by voice for setting the destination, etc., and of course, commands for using various functions of the navigation device May be instructed. If this navigation system is assumed, by executing the recognition process described above, not only the place name (address) but also the command is input, the command template data It is not necessary to separately perform recognition processing using. That is, the response is improved and the usability for the user is improved.
[0019]
The processing system described in claim 7 can naturally be applied to systems other than the navigation system, but the following contrivances can also be made especially when limited to the navigation system. That is, as shown in claim 8, on the premise of having the function of detecting the present location, (1)-(4) Recognize according to the procedure.
[0020]
(1) Prefecture name template data or by prefecture in claim 4 Preliminary determination is performed by performing recognition using template data.
(2) Corresponds to the current location detected by the current location detection means Template data by prefecture or city by claim 4 Read the template data into the second storage means, Template data by prefecture or city by claim 4 Recognize using template data and handle the result of preliminary judgment in parallel with this recognition Template data by prefecture or city by claim 4 The template data is read into the second storage means.
[0021]
(3) That read Template data by prefecture or city by claim 4 Recognition is performed using template data.
(4) the above (2) Recognition result and (3) Of these recognition results, the one with higher accuracy is set as the final recognition result.
The situation to be solved by this method is as follows. In other words, if a vehicle equipped with a navigation system is traveling in Aichi Prefecture, for example, and the same Aichi Prefecture is set as “Showa Town, Kariya City, Aichi Prefecture”, the destination is “Showa Town, Kariya City, Aichi Prefecture” It is more natural not to input voice, but to input “Akari prefecture” and “Kariya city Showa-cho”. In this method, since the recognition is performed using the lower hierarchy dictionary corresponding to the current location in the second recognition, even speech input in which the prefecture name is omitted can be handled.
[0022]
On the other hand, if it is desired to relatively improve the recognition speed on the premise of having the function of detecting the current location in the same manner, it may be as shown in claim 9. In this case, it corresponds to the current location detected by the current location detection means prior to the recognition process. Template data by prefecture or city by claim 4 Template data is read in advance in the second storage means. And the next (1), (2) Recognize according to the procedure.
[0023]
(1) Prefecture name template data or by prefecture in claim 4 Template data and pre-loaded Template data by prefecture or city by claim 4 Perform recognition using template data. And the recognition result was read in advance Template data by prefecture or city by claim 4 If it is obtained using template data, the recognition process is terminated with the final recognition result.
[0024]
(2) On the other hand, the above (1) The recognition result of Prefecture name template data or by prefecture in claim 4 If template data is used, it corresponds to the recognition result. Template data by prefecture or city by claim 4 Read the template data into the second storage means, Template data by prefecture or city by claim 4 The recognition result obtained using the template data is used as the final recognition result.
[0025]
In this way, when recognizing place names in a given area including the current place that is considered to be frequently used, Template data by prefecture or city by claim 4 Since the template data is read in advance, the recognition process can be performed relatively quickly.
[0026]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments to which the present invention is applied will be described below with reference to the drawings. Needless to say, the embodiments of the present invention are not limited to the following examples, and can take various forms as long as they belong to the technical scope of the present invention.
[0027]
FIG. 1 is a block diagram showing a schematic configuration of a navigation system 2 having a voice recognition function. The navigation system 2 is a so-called car navigation system used by being mounted on a vehicle. The navigation system 2 is connected to a position detector 4, a data input device 6, an operation switch group 8, a control circuit 10 connected thereto, and a control circuit 10. And an external memory 12, a display device 14, a remote control sensor 15, and a voice recognition device 30. The control circuit 10 is configured as a normal computer, and includes a well-known CPU, ROM, RAM, I / O, and a bus line for connecting these configurations.
[0028]
The position detector 4 has a known geomagnetic sensor 16, a gyroscope 18, a distance sensor 20, and a GPS receiver 22 for detecting the position of the vehicle based on radio waves from a satellite. Since these sensors 16, 18, 20, and 22 have errors of different properties, they are configured to be used while being interpolated by a plurality of sensors. Depending on the accuracy, a part of the above may be used, and further, a steering rotation sensor, a wheel sensor of each rolling wheel, or the like may be used.
[0029]
In addition to so-called map matching data for improving the accuracy of position detection, various data for navigation including map data and landmark data, the data input device 6 uses dictionary data used when the speech recognition device 30 performs recognition processing. Is a device for inputting. As a storage medium, it is considered that a DVD is generally used because of its data amount, but another medium such as a CD-ROM may be used. When a DVD is used as the data storage medium, the data input device 6 is a DVD player.
[0030]
The display device 14 is a color display device. On the screen of the display device 14, the vehicle current position mark input from the position detector 4, the map data input from the map data input device 6, and further displayed on the map. Additional data such as guidance routes to be set and landmarks for setting points can be displayed in an overlapping manner. In addition, when a menu screen that displays a plurality of options, or when an option is selected, a command input screen that displays a plurality of options can be displayed.
[0031]
In addition, when the navigation system 2 inputs the position of the destination from the remote control sensor 15 or the operation switch group 8 via a remote control terminal (hereinafter referred to as a remote controller) 15a, the navigation system 2 can move from the current position to the destination. A so-called route guidance function is also provided which automatically selects an optimum route to form and display a guidance route. As a method for automatically setting an optimal route, a method such as the Dijkstra method is known. For example, a touch switch or a mechanical switch integrated with the display device 14 is used as the operation switch group 8 and is used for inputting various commands.
[0032]
The voice recognition device 30 can input various commands in the same manner when the user inputs voices, while the operation switch group 8 or the remote controller 15a is used for inputting various commands by manual operation. It is an apparatus for doing so.
[0033]
The voice recognition device 30 includes a voice recognition unit 31, a dialogue control unit 32, a voice synthesis unit 33, a voice extraction unit 34, a microphone 35, a switch 36, a speaker 37, and a control unit 38. Yes.
The voice recognition unit 31 performs input voice recognition processing on the voice data input from the voice extraction unit 34 according to an instruction from the dialogue control unit 32, and returns the recognition result to the dialogue control unit 32. That is, the speech data acquired from the speech extraction unit 34 is collated using the stored dictionary data, and the upper comparison target pattern having a higher degree of coincidence than a plurality of comparison target pattern candidates is displayed in the dialog control unit 32. Output to. The recognition of the word series in the input speech is performed by sequentially analyzing the speech data input from the speech extraction unit 34 to extract an acoustic feature amount (for example, cepstrum), and at the time of the acoustic feature amount obtained by this acoustic analysis. Get series data. Then, the time series data is divided into several sections by a known HMM (Hidden Markov Model), DP matching method or neural network, and it is determined which word is stored as dictionary data. .
[0034]
Based on the recognition result in the voice recognition unit 31 and the instruction from the control unit 38, the dialogue control unit 32 instructs the control circuit 10 to execute a response voice output instruction to the voice synthesis unit 33 or the processing of the navigation system itself. For example, a destination or command necessary for the navigation process is notified, and processing for instructing to set the destination or execute the command is performed. As a result of such processing, if the voice recognition device 30 is used, a destination can be instructed to the navigation system by voice input without manually operating the operation switch group 8 or the remote controller 15a. .
[0035]
The voice synthesizer 33 synthesizes a voice based on a response voice output instruction from the dialogue control unit 32 using a voice waveform stored in the waveform database. This synthesized voice is output from the speaker 37.
The voice extraction unit 34 converts the surrounding voice captured by the microphone 35 into digital data and outputs the digital data to the voice recognition unit 31. Specifically, in order to analyze the feature amount of the input voice, for example, a frame signal of a section of about several tens of milliseconds is cut out at a constant interval, and whether the input signal is a voice section including the voice is included. Determine if there is no noise interval. Since the signal input from the microphone 35 includes not only the speech to be recognized but also noise, the speech section and the noise section are determined. Many methods have been proposed as this determination method. For example, the short-time power of the input signal is extracted at regular intervals, and depending on whether or not the short-time power equal to or greater than a predetermined threshold continues for a certain period. A method of determining whether a speech section or a noise section is often used. Then, when it is determined that it is a voice section, the input signal is output to the voice recognition unit 31.
[0036]
In the present embodiment, the user inputs voice through the microphone 35 while pressing the switch 36. Specifically, the control unit 38 monitors the timing at which the switch 36 is pressed, the timing at which the switch 36 is returned, and the time during which the pressed state continues, and if the switch 36 is pressed, The voice recognition unit 31 is instructed to execute processing. On the other hand, when the switch 36 is not pressed, the processing is not executed. Accordingly, voice data input via the microphone 35 while the switch 36 is being pressed is output to the voice recognition unit 31.
[0037]
By having such a configuration, in the in-vehicle navigation system 2 of the present embodiment, various processes such as route setting, route guidance, facility search, and facility display can be executed by the user inputting a command.
Here, the voice recognition unit 31 and the dialogue control unit 32 will be further described. As shown in FIG. 2, the voice recognition unit 31 includes a collation unit 311, a dictionary unit 312, and an extraction result storage unit 313, and the dialogue control unit 32 includes a processing unit 321, an input unit 322, a dictionary control unit 323, and the like. have.
[0038]
In the speech recognition unit 31, the extraction result storage unit 313 stores the extraction result output from the speech extraction unit 34, and the collation unit 31a stores the extracted result in the dictionary unit 312. Collation is performed using existing dictionary data. The dictionary data in the dictionary unit 312 is not fixed, but is set and updated as appropriate. This point will be described later. Then, the higher-level recognition result compared with the dictionary data by the collating unit 311 and having a high degree of coincidence is output to the processing unit 321 of the dialogue control unit 32, and the processing unit 321 of the dialogue control unit 32 performs the control circuit. The recognition result is output to 10.
[0039]
On the other hand, the processing unit 321 can issue a request (dictionary read request) to read out the dictionary data from the DVD and output it to the voice recognition device 30 side to the control circuit 10. The dictionary data thus input is input via the input unit 322 of the dialog control unit 32. Then, the dictionary control unit 323 sets (writes) / updates the dictionary data in the dictionary unit 312 of the voice recognition unit 31.
[0040]
Here, the dictionary data will be described. As the dictionary data, not only the data of the vocabulary itself but also the vocabulary is prepared by dividing a plurality of words in a hierarchical manner as follows. Here, a place name dictionary will be described as an example of dictionary data that is divided and prepared in this way.
[0041]
First, the upper hierarchy dictionary is dictionary data of prefecture names. That is, it has keywords corresponding to the names of 47 prefectures (Aichi Prefecture, Aomori Prefecture,..., Wakayama Prefecture) as dictionary data. The lower hierarchy dictionary is a prefecture-specific dictionary prepared by dividing into prefectures. That is, 47 prefecture-specific dictionaries are prepared, such as Aichi prefecture-specific dictionaries, Aomori prefecture-specific dictionaries, ... Wakayama prefecture-specific dictionaries. This lower-level dictionary is dictionary data in which lower-level keywords are added to higher-level keywords. For example, if it is a prefecture-specific dictionary in Aichi prefecture, Aichi prefecture XX city xx town, ..., Kariya city, Aichi prefecture Showa Town, Aichi Prefecture, △△ City, ▽▽ Town, and so on. The same applies to prefecture-specific dictionaries of other prefectures.
[0042]
If necessary, a lower-level dictionary may be prepared. In other words, a city-specific dictionary may be prepared for each municipality existing throughout the country. For example, a city-specific dictionary in Kariya City, Aichi Prefecture, a city-specific dictionary in Obu City, Aichi Prefecture, and so on. In Japan, it is said that there are about 4000 municipalities, so about 4000 city-specific dictionaries will be prepared. If this idea is advanced, it is of course possible to prepare a lower-level dictionary. For example, there are 16 wards in Nagoya City, but a 16-divided dictionary may be prepared for the distinction. Of course, it is also possible to prepare a dictionary that is divided into large-size levels below the municipality as well as the ward.
[0043]
Basically, all dictionaries including the dictionary thus divided are recorded on a recording medium such as a DVD set in the data input device 6. The reason for “basically” is that there may be dictionary data resident in the dictionary unit 312 of the voice recognition unit 31. However, the lower-level dictionary described above is stored in a data storage medium such as a DVD as a rule, and is read into the dictionary unit 312 when necessary.
[0044]
Next, the operation of the navigation system 2 of the present embodiment will be described. Since the portion related to the speech recognition device 30 is characteristic, the general operation as a navigation system will be briefly described, and then the operation related to the speech recognition device 30 will be described in detail.
[0045]
After the navigation system 2 is turned on, the driver can operate the remote controller 15a from the menu displayed on the display device 14 (the operation switch group 8 can also be operated in the same manner). When the route information display process is selected for display, or by inputting a desired menu through the voice recognition device 30 through the microphone 35, the remote controller 15a is connected to the control circuit 10 from the dialogue control unit 32. When the same instruction is given for selection via the above, the following processing is performed.
[0046]
That is, when the driver inputs the destination by operating voice or a remote controller based on the map on the display device 14, the current location of the vehicle is obtained based on the satellite data obtained from the GPS receiver 22, and the destination and current location are determined. In between, the cost is calculated by the Dijkstra method, and the shortest route from the current location to the destination is obtained as a guidance route. Then, the guidance route is displayed on the road map on the display device 14 to guide the driver of the appropriate route. Such calculation processing and guidance processing for obtaining a guidance route are generally well-known processing, and thus description thereof is omitted.
[0047]
Next, the operation in the voice recognition device 30 will be described. Here, some operation examples are given.
[Operation example 1]
FIG. 4 is a flowchart illustrating processing in the voice recognition unit 31 and the dialogue control unit 32 in the case of the first operation example.
[0048]
In the first step S10, a dictionary of the highest hierarchy is set. Specifically, it is the above-mentioned prefecture name dictionary, which is read from the DVD by the data input device 6 and set in the dictionary unit 312 of the voice recognition unit 31 through the control circuit 10 and the dialogue control unit 32. As described above, the prefecture name dictionary may be resident in the dictionary unit 312.
[0049]
If preparation for voice recognition is thus completed, voice recognition processing is subsequently performed (S20). As described above, the voice data input through the microphone 35 while the switch 36 is being pressed is extracted by the voice extraction unit 34 and output to the voice recognition unit 31. Recognition processing is executed.
[0050]
After the voice recognition process is performed, it is determined whether or not the lowest-level dictionary is used for the recognition (S30). If it is a recognition using the prefecture name dictionary set in S10, it is not the lowest hierarchy dictionary (S30: NO), so the lower hierarchy dictionary selected from the result of the recognition process in S20 is set (S40). . For example, when “Aichi Prefecture” is selected in recognition using a prefecture name dictionary, a prefecture-specific dictionary for Aichi Prefecture is set. In this setting, as illustrated in FIG. 3, the dialogue control unit 32 requests the control circuit 10 to read the prefecture-specific dictionary. In response to the request, the control unit 10 reads the corresponding prefecture-specific dictionary from the DVD by the data input device 6 and sends it to the dialogue control unit 32. Then, as described above, the prefecture-specific dictionary is set in the dictionary unit 312 of the speech recognition unit 31 by the dictionary control unit 323 (see FIG. 2) in the dialogue control unit 32.
[0051]
Thereafter, the process returns to S20, and the speech recognition process is performed again using the extraction result stored in the extraction result storage unit 313. If the prefecture-specific dictionary is the lowest-level dictionary (S30: YES), the recognition result obtained using the dictionary is output to the control circuit 10 (S50).
As described above, when a city-specific dictionary, a distinction dictionary, a large dictionary, etc. are prepared as lower-level dictionaries of the prefecture-specific dictionaries, the loop processing of S20 to S40 is repeated, and the lowest-level dictionaries are repeated. The result recognized with the dictionary set may be output.
[0052]
In this way, when recognizing a place name input by voice, it is not necessary to read the entire dictionary relating to the place name into the dictionary unit 312, and the prefecture name dictionary and the prefecture-specific dictionary corresponding to the selected prefecture are not included. Just read it. The dictionary is “divided” for such a hierarchically structured word group, and it is determined which sub-layer dictionary (dictionary for each prefecture) should be used in the preliminary determination. The data was narrowed down to the dictionary and then read into the dictionary unit 312. Therefore, even when pursuing the advantage of increasing the number of recognizable objects by increasing the vocabulary prepared in the dictionary, it is not necessary to store all the recognizable vocabularies in the dictionary unit 312. Therefore, even if the dictionary unit 312 has a relatively small capacity, appropriate speech recognition corresponding to batch input can be realized.
[0053]
[Operation example 2]
FIG. 5 is a flowchart showing processing in the speech recognition unit 31 and the dialogue control unit 32 in the case of the operation example 2. Here, it is assumed that the prefecture name dictionary and the command dictionary are stored in the dictionary unit 312 before the actual recognition process is started.
[0054]
In the first step S110, a prefecture name dictionary is set. Since it is stored in advance, it is set here as a dictionary used for speech recognition. That is, although the command dictionary is also stored in the dictionary unit 312, it is not set. In the subsequent S120, the first speech recognition process is performed using the prefecture name dictionary, and a request is made to read the prefecture-specific dictionary selected from the first recognition result (S130).
[0055]
The dialog controller 32 requests the control circuit 10 to read the dictionary as described in the first operation example. Upon receiving this request, the control unit 10 reads the corresponding prefecture-specific dictionary from the DVD by the data input device 6 and sends it to the dialog control unit 32. The dialog control unit 32 reads the prefecture-specific dictionary ( In S190, the prefecture-specific dictionary is set in the dictionary unit 312 of the voice recognition unit 31 (S160).
[0056]
However, since a certain amount of time is required from when the request is sent to the control unit 10 until the prefecture-specific dictionary is sent, here, the second recognition process is performed using that time. That is, this time, the command dictionary is set as a dictionary used for speech recognition (S140), and the second speech recognition process is performed using the command dictionary (S150). When the second recognition process is completed, the prefecture dictionary read from the DVD is set as a dictionary used for voice recognition (S160), and the third voice recognition process is performed using the prefecture dictionary (S160). S170).
[0057]
The probabilities of the second recognition result and the third recognition result obtained in this way are compared, and a higher candidate (recognition result) is output (S180).
As a vocabulary that users input by voice when using a navigation system, there is a place name (address) for setting the destination, etc. Of course, there are cases where commands for using various navigation functions are instructed. . Therefore, according to this operation example, since the second speech recognition process is performed using the command dictionary, not only the place name (address) but also a command can be promptly dealt with. . This recognition process is executed while the prefecture-specific dictionary selected in the preliminary determination using the prefecture name dictionary is being read, so that time loss can be reduced. That is, the response is improved as a whole, and the usability for the user is improved.
[0058]
[Operation example 3]
FIG. 6 is a flowchart showing processing in the voice recognition unit 31 and the dialogue control unit 32 in the case of the operation example 3. Here, it is assumed that the prefecture name dictionary and the current prefecture-specific dictionary are stored in the dictionary unit 312 before the actual recognition process is started. That is, since the current location can be detected by the position detector 4, for example, when a vehicle equipped with the navigation system is traveling in Aichi Prefecture, the prefecture-specific dictionary of Aichi Prefecture is read in advance from the DVD and stored in the dictionary unit 312. Let me.
[0059]
In the first step S210, a prefecture name dictionary is set. Since it is stored in advance, it is set here as a dictionary used for speech recognition. That is, the dictionary unit 312 stores a prefecture-specific dictionary corresponding to the current location, but it is not set. In the subsequent S220, the first speech recognition process is performed using the prefecture name dictionary, and a request is made to read the prefecture-specific dictionary selected from the first recognition result (S230).
[0060]
As a result of the dictionary reading request in S230, the corresponding prefecture-specific dictionary is read from the DVD (S290), and the prefecture-specific dictionary is set in the dictionary unit 312 of the speech recognition unit 31 (S260). The processing contents of S130, S160, and S190 in Example 2 are the same. In the second operation example, the recognition process using the command dictionary is performed using this interval. In the third operation example, the prefecture-specific dictionary corresponding to the current location read in advance is set as a dictionary used for voice recognition. (S240) A second speech recognition process is performed using the prefecture-specific dictionary (S250). When the second recognition process is completed, the prefecture dictionary read from the DVD is set as a dictionary used for voice recognition (S260), and the third voice recognition process is performed using the prefecture dictionary (S260). S270).
[0061]
The probabilities of the second recognition result and the third recognition result obtained in this way are compared, and a higher candidate (recognition result) is output (S180).
For example, if a vehicle equipped with a navigation system is traveling in Aichi Prefecture and you want to set “Showa Town, Kariya City, Aichi Prefecture” as the destination in the same Aichi Prefecture, enter the voice as “Showa Town, Kariya City, Aichi Prefecture” Instead of doing it, it is natural to omit “Aichi Prefecture” and input “Kariya City Showacho” as a voice. In this method, since the recognition is performed using the lower hierarchy dictionary corresponding to the current location in the second recognition, even speech input in which the prefecture name is omitted can be handled.
[0062]
[Operation example 4]
FIG. 7 is a flowchart showing processing in the voice recognition unit 31 and the dialogue control unit 32 in the case of the fourth operation example. As in the case of the operation example 3, it is assumed that the prefecture name dictionary and the prefecture-specific dictionary of the current location are stored in the dictionary unit 312 before starting the actual recognition process.
[0063]
In the first step 310, a prefecture name dictionary and a prefecture-specific dictionary corresponding to the current location are set as a dictionary used for speech recognition. In subsequent S320, the first speech recognition process is performed using the prefecture name dictionary and the prefecture-specific dictionary corresponding to the current location. When the first recognition result is obtained using the prefecture-specific dictionary corresponding to the current location (S330: YES), the first recognition result is output (S340).
[0064]
On the other hand, if the first recognition result is obtained using the prefecture name dictionary instead of the prefecture dictionary corresponding to the current location (S330: NO), the prefecture dictionary selected from the recognition result A request for reading is made (S350), and the corresponding prefecture-specific dictionary is read from the DVD (S355). In this case, unlike the above-described operation examples 2 and 3, no voice recognition process is executed separately from the dictionary reading request to the actual reading.
[0065]
Then, the prefecture-specific dictionary read in S355 is set as a dictionary used for speech recognition (S360), the second speech recognition process is performed using the prefecture-specific dictionary (S370), and the recognition result is output (S380). ).
In this way, when recognizing a place name in a prefecture that includes the current location that is considered to be frequently used, it can be recognized by the first speech recognition process using a prefecture-specific dictionary for recognizing it, The recognition process can be performed relatively quickly.
[0066]
Although four examples of operations in the speech recognition apparatus 30 have been described and the effects of the respective operation examples have been described, the following effects can also be obtained by devising the configuration of the upper layer dictionary.
[Dictionary configuration example 1]
Here, a prefecture name dictionary is considered as an example of an upper hierarchy dictionary. As described above, the prefecture name dictionary has keywords corresponding to the names of prefectures (Aichi Prefecture, Aomori Prefecture, ..., Wakayama Prefecture) as dictionary data. , Wakayama Prefecture *, and a wild card model in which the * part can be matched to any voice input. For example, “Kariya City” in the voice input “Akari Prefecture Kariya City” matches *. If there is only a keyword for the prefecture name as dictionary data, since only a part of the actual recognition target (a group of words including municipalities and capital letters below the prefecture) has the prefecture name, the matching level as a whole Decreases. On the other hand, in the case of the wild card model, since the matching itself can be performed as the entire recognition target, such a problem does not occur.
[0067]
Here, a supplementary explanation of the wild card model will be given.
First, an HMM (Hidden Markov Model) method generally used in speech recognition will be briefly described. Assuming that speech is generated from a Markov model expressed by states and transitions, this method creates the generation model in advance, matches it with the speech, and matches the best. The result is a recognition result. The expression shown in FIG. 8 is common as an example of this model. The output probability distribution corresponds to each state, and the time series of the feature values (represented in two dimensions for simplicity in FIG. 8) as a result of analyzing the speech are in the order (a1) corresponding to FIG. → a2 → a3), the probability is matched from the probability distribution of FIG. 8B. Finally, the recognition result is the one with the best probability product (score called likelihood) until the end of the speech. In this method, HMMs of recognition target vocabulary are prepared and compared, but it is practically impossible to recognize large vocabulary, so phonemes and syllables (this means the word part) The unit is called a sub-word), HMMs are created, and a word model is created by connecting them.
[0068]
Next, a garbage model that is an example of a wild card model will be described. FIG. 9A shows an example of a probability distribution corresponding to each state of the HMMs of / a /, / i /, and / u /. Here, the feature space is two-dimensional. A speech model called a garbage model does not represent a certain feature of a specific syllable, but has a distribution with a large variance so as to cover a large amount of speech. In this way, because the garbage model matches “broad and shallow” to various voice patterns, it outputs a certain score (= probability) for a wide range of voices, but tends to output smaller values than the correct distribution There is. For example, for the voice pattern indicated by “x” in FIG. 9A, the score of / a /, / i / becomes very small and the score of / u / becomes large. On the other hand, the score in the case of the garbage model is larger than the score of / a /, / i /, but smaller than the score of / u /.
[0069]
Therefore, if you match the template of “Aichiken G” (G is a garbage model) and “Aichi Kenkari Yasushi” with the voice of “Aichi Kenkari Ya”, the score will be “Aichi Ken G” <“Aichi Kenkari Yasushi” "Is likely (but not guaranteed). However, if the “Aichiken G (garbage model)” and “Aichiken Kashigarashi” templates are matched with the “Aichikenkarishi” voice, the score will be “Aichiken G” <“Aichi Kenkaishi”. It does not necessarily become, and it reverses with considerable probability.
[0070]
Next, a syllable concatenation model, which is another example of a wild card model, will be described.
The syllable HMM is a unit constituting a word, but if this is arbitrarily connectable, the utterance of any word can be recognized. That is, the voice connection model as shown in FIG. Here, it is assumed that Japanese is recognized.
[0071]
If this is used as a wild card, such as “Aichiken SCM” (SCM is a syllable concatenation model), it is possible to match even utterances such as “Showacho, Kariya City, Aichi Prefecture”. In this case, since the model of “Aichiken SCM” includes the expression of the model “Aichikenkari and Shishowacho”, a score higher than the latter can be obtained as a score.
[0072]
[Dictionary configuration example 2]
The use of the wild card model described above has the advantages as described above. However, since this wild card is gently matched with anything, it also increases the possibility of erroneous recognition. Therefore, for example, when configuring a prefecture name dictionary, the dictionary is prepared in a redundant state in which not only the prefecture name but also the municipality name is added. At the time of recognition, matching is performed up to the municipality name, but as a result, it is determined which prefecture name is matched. Matching with longer voices, and since it does not match anything gently like a wild card model, an improvement in recognition rate can be expected.
[0073]
As described above, the present invention is not limited to such embodiments, and can be implemented in various forms without departing from the spirit of the present invention.
For example, in the above-described embodiment, the voice recognition device 30 is described as an example applied to the navigation system 2 mounted on a vehicle. However, the embodiment may be realized not only as a vehicle-mounted device but also as a portable navigation device, for example. .
[0074]
Also, the present invention can be applied to the case where various data settings and instructions are given by voice input to a device that executes other processing than navigation.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of a navigation system as an embodiment.
FIG. 2 is a block diagram showing a configuration of a voice recognition unit and a dialogue control unit in the voice recognition device.
FIG. 3 is an explanatory diagram of a dictionary reading request and dictionary reading corresponding thereto;
FIG. 4 is a flowchart showing processing according to an operation example 1 in the speech recognition apparatus.
FIG. 5 is a flowchart showing processing according to an operation example 2 in the speech recognition apparatus.
FIG. 6 is a flowchart showing processing according to an operation example 3 in the speech recognition apparatus.
FIG. 7 is a flowchart illustrating processing according to an operation example 4 in the speech recognition apparatus.
FIG. 8 is an explanatory diagram of an HMM (Hidden Markov Model).
FIG. 9 is an explanatory diagram of a garbage model and a syllable concatenation model as examples of a wild card model.
[Explanation of symbols]
2 ... Navigation system 4 ... Position detector
6 ... Data input device 8 ... Operation switch group
10 ... Control circuit 12 ... External memory
14 ... Display device 15 ... Remote control sensor
15a ... remote control 16 ... geomagnetic sensor
18 ... Gyroscope 20 ... Distance sensor
22 ... GPS receiver 30 ... Voice recognition device
31 ... Voice recognition unit 32 ... Dialog control unit
33 ... Speech synthesis unit 34 ... Speech input unit
35 ... Microphone 36 ... Switch
37 ... Speaker 38 ... Control unit
311: Verification unit 312: Dictionary unit
313 ... Extraction result storage unit 321 ... Processing unit
322: Input unit 323: Dictionary control unit

Claims

Voice input means that can input voice in a row,
Prefecture name template data corresponding to the prefecture name when the final recognition target is an address obtained by hierarchically connecting multiple place names ,
Prefectural template data that is prepared for each prefecture, and corresponds to a group of words that includes the name of the city or town, or even a lower-level place name in addition to the name of the prefecture ,
A first storage means having a relatively low high-speed accessibility during voice recognition processing;
A second storage means having a relatively high speed accessibility for voice recognition processing,
At least the prefecture-specific template data is stored in the first storage means,
When recognizing speech input continuously through the speech input means, first, by comparing the matching data obtained based on the input speech with the prefecture name template data, which prefecture Preliminarily determine whether the name is included,
Prefectural template data corresponding to the name of one prefecture that is included in the preliminary determination is read into the second storage means, and the final recognition result is obtained using the prefectural template data. A voice recognition device characterized by the above.

The speech recognition apparatus according to claim 1,
At least one of the prefecture name or prefecture-specific template data is dictionary data.

The speech recognition apparatus according to claim 1,
A speech recognition apparatus, wherein at least one of the prefecture name or the template data for each prefecture is speech data.

The speech recognition apparatus according to any one of claims 1 to 3,
When the address is composed of three or more levels of place names , the prefecture name template data and prefecture-specific template data are provided, and the prefecture-specific template data is regarded as the upper hierarchy and the place name level lower than the municipality is also included. It was to distinguish speech recognition apparatus comprising: a city-specific template data prepared in municipal units.

The speech recognition device according to any one of claims 1 to 4,
Prefecture template data in the prefecture name template data and claim 4, also matched against the speech input other word or a word group after multiple kinds of words or word groups is included to configure the template data A wild card model

The speech recognition device according to any one of claims 1 to 4,
The prefecture name template data, der ones than city name, or even municipal and comes lower level places after the prefecture name is, prefectural template data in Claim 4, the lower level than municipalities after the city name A voice recognition device characterized by the fact that the place name is attached .

A voice recognition device according to any one of claims 1 to 6, and a processing device that executes a predetermined process based on a result recognized by the voice recognition device,
The voice input means is a processing system that is also used by a user to input a predetermined command that needs to be specified when the processing device performs processing.
The voice recognition device
Command template data, which is template data for recognizing the command, is read into the second storage means before performing the actual speech recognition process, or has high-speed accessibility as with the second storage means. Pre-stored in a relatively high read-only third storage means,
Perform the preliminary determination by performing recognition using the prefecture name template data or the prefecture-specific template data in claim 4 ,
Thereafter, recognition using the command template data is performed, and in parallel with the recognition, prefecture-specific template data corresponding to the result of the preliminary determination or city-specific template data in claim 4 is stored in the second storage means. read, have line recognition by using a city-specific template data in the prefecture template data or claim 4 read that,
The recognition result obtained by using the command template data and the recognition result obtained by using the prefecture-specific template data or the city-specific template data in claim 4 is the final recognition result. A processing system characterized by

A voice recognition device according to any one of claims 1 to 6, and a navigation device that executes predetermined processing based on a result recognized by the voice recognition device,
Said voice input means is a processing system we are used to the user an indication of a predetermined place name-related data that needs to be specified on which at least the navigation device to the navigation processing is input by voice ,
The navigation device includes a current location detecting means for detecting a current location,
The voice recognition device
Perform the preliminary determination by performing recognition using the prefecture name template data or the prefecture-specific template data in claim 4 ,
5. The prefecture-specific template data corresponding to the current location detected by the current location detection means or the city-specific template data in claim 4 is read into the second storage means, and the prefecture-specific template data or city-specific template in claim 4 is read. In addition to performing recognition using data, in parallel with this recognition, the prefecture-specific template data corresponding to the result of the preliminary determination or the city-specific template data in claim 4 is read into the second storage means, and the read prefecture There line recognition using city-specific template data in another template data or claim 4,
The recognition result obtained by using the template data by prefecture corresponding to the present location or the template data by city in claim 4 and the template data by prefecture corresponding to the result of the preliminary determination or the template data by city in claim 4 A processing system characterized in that a recognition result having a higher degree of accuracy is used as a final recognition result.

A voice recognition device according to any one of claims 1 to 6, and a navigation device that executes predetermined processing based on a result recognized by the voice recognition device,
Said voice input means is a processing system we are used to the user an indication of a predetermined place name-related data that needs to be specified on which at least the navigation device to the navigation processing is input by voice ,
The navigation device includes a current location detecting means for detecting a current location,
The voice recognition device
Pre-loading the prefecture-specific template data corresponding to the current location detected by the current location detection means or the city-specific template data in claim 4 into the second storage means,
Recognition is performed using the prefecture name template data or the prefecture-specific template data in claim 4 and the pre-read prefecture-specific template data or city-specific template data in claim 4 , and the recognition result is read in advance. If is obtained by using a city-specific template data in Oita Prefecture another template data or claim 4, which was the final recognition result, while the recognition result is the prefecture name template data also When using the template data by prefecture in claim 4, the template data by prefecture corresponding to the recognition result or the template data by city in claim 4 is read into the second storage means and read. it was separated by lower-level template de in the prefecture template data or claim 4 Processing system, characterized in that a final recognition result recognition results obtained using data.