JP3605308B2

JP3605308B2 - Voice recognition device and recording medium

Info

Publication number: JP3605308B2
Application number: JP05141299A
Authority: JP
Inventors: 智一森尾; 俊夫赤羽
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1999-02-26
Filing date: 1999-02-26
Publication date: 2004-12-22
Anticipated expiration: 2019-02-26
Also published as: JP2000250580A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識装置及び記録媒体に関し、詳しくは、入力された音響特徴量と音響特徴量辞書との間の尤度の計算において認識精度を落とすことなく負荷を軽減できる音声認識装置及び記録媒体に関する。
【０００２】
【従来の技術】
図７に、従来の一般的な音声認識装置の構成を示しており、音声の入力端子１０１、音響特徴量計算器１０２、尤度計算器１０３、音響特徴量辞書１０４、認識辞書検索器１０５、認識語彙辞書１０６、認識結果出力端子１０７で構成されている。
【０００３】
音声信号は入力端子１０１から入力され、予め定められた時間長（例えば１０ｍｓ）のフレームに分割され、フレーム毎に入力音声の特徴量を音響特徴量計算器１０２で計算する。音声認識で用いる特徴量としては、例えばパワースペクトルの形状を表現する帯域フィルターバンク出力やケプストラムパラメータなどが用いられている。
【０００４】
フレーム毎に計算された入力音声の音響特徴量に対し、予め作成された音響特徴量辞書１０４を用いて、音響特徴量辞書１０４の全ての状態毎に尤度計算器１０３によって尤度を計算する。
ここで尤度とは、入力音声の音響特徴量が、音響特徴量辞書の各状態の音響特徴量にどれだけ似ているかを表す指標値で、音響特徴量としてケプストラムを用いた場合は、入力音声のケプストラムと音響特徴量辞書の各状態毎に保持されているケプストラムとのケプストラム距離などが用いられている。
【０００５】
音響特徴量辞書１０４は、音声を予め定められた状態（単位）で分割し、状態毎に音響特徴量を保持している。音声の単位である状態の例としては、音素（例えば、‘ａ’、‘ｔ’）や、或いは時間的に前後の音素環境を考慮した３つ組音素（例えば、‘ａ；ｋ；ａ’、‘ｔ；ａ；ｋ’）、更に一つの音素を時間的に分割した単位を用いる方法がある。状態の数は単位の選び方によって変化し、音素を単位とした場合は数十、３つ組音素を単位に取れば数百程度の個数になる。
例えば、音響特徴量辞書が２５６の要素から構成されており、発声が１秒であれば１０ｍｓの分析周期で分析すれば１００フレームの音響特徴量が算出され、結局２５６×１００の尤度表が計算される。
【０００６】
次に、入力音声の発声が終了した時点で、認識語彙辞書１０６を用いて、認識辞書検索器１０５で発声内容を検索する。認識語彙辞書とは、認識対象となる語を前述した認識の単位で記述したものである。例えば「赤」を、上記説明の中で例示した三つ組み音素の単位で表現すると、‘−；ａ；ｋ’，‘ａ；ｋ；ａ’，‘ｋ；ａ；−’と表される。ここで‘−’は、無音状態を表す。
上記で計算された各状態毎の尤度の時系列が、認識対象語彙の内、どの語がもっともらしいかを検索する手法に、隠れマルコフモデルやビタビ検索の技術が使われている（「音声・音情報のディジタル信号処理」昭晃堂，ｐ．４２−７９参照）。
【０００７】
以上で説明したように、従来技術の音声認識装置においては、入力信号の全てのフレームに対して尤度計算を行っており、そのため、計算量が多いという問題があった。この問題に対処する方法として、特開平２―２３９２９１号公報には、音声の音響特徴量の変化量を時間的に調べ、その動的な特徴量が多い時点或いは極大となる点のフレームに対してのみ尤度計算するようにした技術が開示されている。これによって音素境界位置の候補数を絞り込むと共に、尤度計算量を削減することを意図している。
【０００８】
【発明が解決しようとする課題】
高い音声認識率を得るためには、短時間に音響特徴量が変化する破裂音などの情報を精度良く分析することと、音響特徴量辞書も音素環境を考慮して多くの状態数に分割することが望ましい。分析間隔が早くなり音響特徴量辞書の状態が多いと、尤度計算量が非常に多くなり、上記のように認識装置の実現コストが高くなるという問題が生ずる。
【０００９】
この問題を解決するために提案された前記特開平２−２３９２９１号公報の技術の場合、音素の句切り目を検出することを目的とし、瞬間的に変化の大きなフレーム或いは極大点のフレームの尤度を求めているため、安定した特徴量区間での尤度が算出できないという問題がある。さらにまた、音声によっては変化量が小さかったり、極大点が現れず、音響特徴量の変化が検知できず、尤度を求めることができない。結果として上記従来技術では、変化量が小さな区間で尤度計算を省略することにより、実際の尤度からかけ離れ、音声認識精度の低下につながるという問題があった。
本発明は、このような問題に鑑みてなされたものであって、その目的とするところは、尤度計算の負荷を軽減するとともに、音声認識の精度を上げることを可能にした音声認識装置及び記録媒体を提供することにある。
【００１２】
【課題を解決するための手段】
本発明の音声認識装置は、音声信号を予め定められた時間長のフレームに分割し、音声の特徴量を計算する音響特徴量計算器と、予め定められた基準に沿って音声を複数の状態に分類し、分類された状態毎に音響特徴量を保持している音響特徴量辞書と、入力された音声の音響特徴量と音響特徴量辞書の状態毎に尤度を計算する尤度計算器と、音声認識対象語を前述の状態を使って記述した認識語彙辞書と、先に計算した尤度計算結果を入力し認識語彙辞書の中から音声認識結果を計算する認識辞書検索器とを備えるものであって、音響特徴量の時間的な変化量を計算する変化量計算器と、複数のフレームに渡って前記変化量計算器で計算された変化量を蓄積する変化量メモリーと、その中から変化量の大きいものから予め定められた数のフレーム数を選択して、選択されたフレームのみ尤度計算を実行し、選択されなかったフレームに対しては、すでに計算された尤度の値を使うように制御する尤度計算フレーム選択器を備えるものである。
【００１３】
これにより、フレーム内の計算量の最大値をある一定量に抑えて、実時間処理装置に適した構成とすることができる。
また、前記尤度計算器は、尤度計算を省略したフレームの区間に対し、直前に計算した尤度と次に計算する尤度の平均の値を計算出力することで、変化量が小さな区間で尤度計算を省略しながら尤度を近似解として求めることができる。
また、前記尤度計算器は、尤度計算を省略したフレームの区間に対し、直前に計算した尤度と次に計算する尤度の傾斜値を計算出力することで、変化量が小さな区間で尤度計算を省略しながら尤度を近似解として求めることができる。
【００１４】
また、前記尤度計算器は、最後に尤度計算したフレームの次のフレームの尤度を計算し得られた結果をその後に続く尤度計算しないフレーム期間の尤度として出力することで、変化量が少なくなった尤度を用いて音声認識の精度を上げることが可能となる。
また、前記尤度計算器は、最後に尤度計算したフレームの次のフレーム及びその次に尤度計算するフレームの直前のフレームの尤度を計算し、その間のフレームに対し、両者の平均値又は傾斜値を計算出力することで、変化量が少なくなった尤度を用いて音声認識の精度を上げることが可能となる。
また、本発明は、コンピュータを上記音声認識装置として機能させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体である。
【００１５】
【発明の実施の形態】
以下、添付図面を参照しながら本発明の好適な実施の形態について詳細に説明する。なお、図７と同一機能のものは同一符号で示している。
図１は、本発明の音声認識装置の第１実施の形態を示している。図７の回路構成に対して、構成要素として新しく付け加わったのは、入力信号の特徴量の変化を計算する変化量計算器１０８と、入力信号の特徴量の変化量を元に尤度計算器の動作を制御する尤度計算制御器１０９である。
【００１６】
音声信号は入力端子１０１から入力され、予め定められた時間長（例えば１０ｍｓ）のフレームに分割される。フレーム毎に入力音声の特徴量を音響特徴量計算器１０２で計算する。特徴量としては例えばケプストラムを用いる。フレーム毎に算出された音響特徴量は、変化量計算器１０８と尤度計算器１０３に入力される。変化量計算器１０８は入力音声信号の特徴量変化を検出することを目的とし、例えば直前のフレームで計算された音響特徴量と、現在フレームの音響特徴量の差から変化量を算出する。入力音声信号の状態（例えば音韻）が変化するとスペクトル形状などの音響特徴量が変化し、特徴量の変化は大きい。しかし、定常的な母音区間などでは比較的安定した音響特徴量が継続し、音響特徴量の変化量は小さくなる。
【００１７】
音響特徴量としてケプストラムを用いた場合、変化量としては、例えば直前のフレームのケプストラムと、現在フレームのケプストラムとのケプストラム距離を用いる。また音声認識で用いる音響特徴量として、一般に動的特徴量と呼ばれる特徴量の時間的変化の情報を用いる手法がある（音響特徴としてケプストラムを用いている場合、デルタケプストラムと呼ぶ特徴量を用いている）。この場合は変化量としては、この動的特徴量の大きさを算出しても良い。変化量計算器１０８で計算された特徴量の変化量は、尤度計算制御器１０９に入力され、予め設定した判定閾値と比較する。
【００１８】
変化量計算機１０８で計算された特徴量の変化を尤度計算制御器１０９の判定閾値と比較した結果、変化量が判定閾値より大きい場合と小さい場合とに分岐する。
変化量が判定閾値より大きい場合は、入力音声信号の状態変化が大きいので、尤度計算器１０３に音響特徴量計算器１０２で計算した入力信号の特徴量に対する尤度計算を実行させる。この場合の処理は従来と同様である。
【００１９】
次に、本発明の特徴となる処理の部分について述べる。変化量が判定閾値より小さい場合は、入力音声信号の状態があまり変化してないので、尤度計算結果も大きく変化しないことが予想される。この場合は尤度計算制御器１０９は、変化量が判定閾値より大きい場合とは異なる基準に基づいて尤度の値を求める。この基準は１個以上複数あってもよい。
【００２０】
例えば、尤度計算器１０３に対して、現在フレームの尤度計算を行わず、直前に計算した尤度計算結果を用いるという基準を適用する。音声信号には、破裂子音のように音響特徴量の変化が早い区間と、母音定常部のようにあまり変化しない区間がある。このように音響特徴量の変化が小さい区間で、尤度計算を行わないように制御することで計算量を削減できる。
上述したように、音響特徴量の変化が大きい区間で尤度の計算を行い、音響特徴量の小さい区間では尤度計算を行わないという基準に基づいて尤度計算を行う尤度計算制御の様子を図６に示している。
【００２１】
図６（Ａ）は入力音声信号のエネルギーを表しており、図６（Ｂ）は特徴量の変化量の大きさを表している。図６（Ｂ）の点線で示している閾値が変化量の判定閾値で、変化量がこの閾値より大きいフレームでは尤度計算を行い、閾値より小さいフレームでは最近に計算した尤度の値を用いる。図６（Ｃ）に尤度計算の実行状況を示している。目盛りの間隔がフレーム長を表しており、ｃの印が付いたフレームは、特徴量の変化量が判定閾値より大きく、尤度計算を実行したフレームであり、ｃの印の付いてないフレームは、尤度の値は直前に計算した尤度を用いるフレームである。
【００２２】
以降の処理は図７の従来技術で説明したものと同様で、入力音声の発声が終了した時点で、尤度計算を省略して求めた尤度表と認識語彙辞書１０６を用いて、認識辞書検索器１０５で発声内容を検索する。
図２は、本発明の音声認識装置の第２実施の形態を示している。第１実施の形態の図１に対して構成要素として新しく付け加わったのは、尤度計算制御器１０９の出力を入力し、変化量が小さい区間では尤度計算を間引いたフレーム間隔で計算するように制御する間引き計算制御器２０１のみである。第１実施の形態と異なる部分について以下に説明する。
【００２３】
母音など定常的な音響特徴量が複数のフレームに渡って継続する場合、隣接するフレームの音響特徴量の変化は小さいが、複数のフレームでは変化量が蓄積して変化量が大きくなるという場合がある。この場合は、尤度計算を行わないと認識率が低下する。そこで、第２実施の形態では、間引き計算制御器２０１は、尤度計算制御器１０９の出力を入力し、変化量が大きい区間では第１実施の形態と同様の方法で尤度計算器１０３に尤度計算を実行させる。
【００２４】
一方、尤度計算制御器１０９の出力の変化量が小さい区間では、処理フレーム数をカウントし、予め定められた間引きの間隔で間引いたフレームに対しては尤度計算器１０３で尤度計算を実行し、それ以外のフレームでは、尤度計算を行わず、直前に計算した尤度計算結果を用いるように制御する。
尤度計算制御の様子を図６（Ｄ）に示す。この例では間引き率２で制御しており、図６（Ｃ）と比べると、特徴量の変化量が判定閾値より小さい区間でも、２フレームに１回の割合で尤度計算を実行するようになっている。なお、間引き率は２に限定する必要はなく任意である。
【００２５】
図３は、本発明の音声認識装置の第３実施の形態を示している。図１に対して構成要素として新しく付け加わったのは、複数フレームの特徴量の変化量を記憶する変化量メモリー３０１と、この変化量メモリー３０１から、ある選択基準で尤度計算を行うフレームを選択する尤度計算フレーム選択器３０２である。以下、第１および第２実施の形態と異なる部分について説明する。
第１および第２実施の形態では、全ての入力フレームではなく、ある基準を満たしたフレームに対してのみ尤度計算を実行しており、処理全体としては計算量が削減する。しかしながら実時間処理する装置においては、予め定めた処理単位で計算量を一定に抑える必要がある。
【００２６】
第３実施の形態では、この課題に対処するために、予め定めた数（Ｍとする）のフレーム毎に処理を行い、このＭフレームの中から尤度計算を実行するフレーム（Ｎ、但しＮ＜Ｍ）を選択し、残りのフレームは最近に計算された尤度の値を用いるように制御する。
この選択の基準の例としては、変化量計算器１０８で計算された変化量をＭフレームに渡って変化量メモリー３０１記憶し、尤度計算フレーム選択器３０２は、そのＭ個の変化量の中で大きいものからＮフレーム選択する。
このように制御することで、Ｍフレーム内の計算量の最大値をある一定量に抑えられて、実時間処理装置に適した構成とすることができる。
【００２７】
図４は、本発明の音声認識装置の第４実施の形態を示している。該図における平均尤度計算器６０１は、図１乃至図３の尤度計算器１０３の代用をするものである。入力音声の特徴の変化量が閾値を越えるフレームでは従来通りの尤度計算を行い、計算を行った結果を平均尤度計算器６０１内のバッファメモリに蓄積する。このようにして得られた直前の尤度の値をＬとし、次に変化量が閾値を越え計算した尤度Ｎとすると、Ｌを計算したフレームとＮを計算したフレームとの間のフレームに対し（Ｌ＋Ｎ）／２の尤度を当てはめる。その他の処理は、第１〜第３実施の形態に準じる。
図６（Ｅ）に尤度計算制御の結果を示す。この結果は図６（Ｃ）に適用したもので、尤度を計算するのは図６（Ｃ）と同じとなっている。尤度計算を省略したフレームはその両端で求めた尤度の平均値になっている（図では煩雑になるのを避けるため最初の尤度無計算区間のみ記入している）。
【００２８】
図５は、本発明の音声認識装置の第５実施の形態を示している。該図における傾斜尤度計算器７０１は図１乃至図３の尤度計算器１０３の代用をするものである。入力音声の特徴の変化量が閾値を越えるフレームでは従来通りの尤度計算を行い、計算を行った結果を傾斜尤度計算器７０１内のバッファメモリに蓄積する。このようにして得られた直前の尤度の値をＬとし、次に変化量が閾値を越え計算した尤度Ｎとすると、Ｌを計算したフレームとＮを計算したフレームとの間のフレームの数Ｐをカウントし、これらＰ個のフレームに対しＬ＋（Ｎ−Ｌ）×ｍ／（Ｐ＋１）（ｍは１からＰの整数）の傾斜尤度を当てはめる。その他の処理は、第１〜第３実施の形態に準じる。
【００２９】
図６（Ｆ）に尤度計算制御の結果を示す。この結果は図６（Ｃ）に適用したもので、尤度を計算するのは図６（Ｃ）と同じとなっている。尤度計算を省略したフレームはその両端で求めた尤度の傾斜配分値になっている（図では煩雑になるのを避けるため最初の尤度無計算区間のみ記入している）。
以上の第１実施の形態〜第５実施の形態のほか、以下のような実施の形態も可能である。
【００３０】
図２において、音響特徴量が閾値を越えないフレームの場合、第１実施の形態では最後に尤度計算して得られた値をこれらのフレームの尤度として用いるようにしている。音響特徴量が閾値を越えないフレームの尤度として最後に尤度計算した値の代わりに、最後に尤度計算した次のフレームの尤度を計算し、この値を用いるようにすることができる。これにより、音響特徴量が閾値を越えないフレームの尤度として特徴量の変化の大きい尤度でなく、変化量が少なく定常的になった尤度を用いることで音声認識の精度を上げることが可能となる。このようにして求めた尤度計算制御の結果を図４（Ｇ）に示している。
【００３１】
さらにまた、図４または図５において、音響特徴量が閾値を越えないフレームの場合、図６（Ｇ）の実施の形態では最後に尤度計算して得られ変化量が閾値を越えなくなった最初のフレームでも尤度の計算をしているが、次に変化量が閾値を越え尤度計算する一つ前の変化量が閾値を越えないフレームでも尤度の計算をする。この間のフレームに対し、これら閾値を越えないフレームの両端で計算した尤度の平均値あるいは傾斜値を当てはめる。なお、平均尤度値あるいは傾斜尤度値は実施例４、或いは実施例５で述べた方法に基づく。このようにして求めた尤度計算制御の結果を図６（Ｈ）に示す。
【００３２】
なお、本発明は上記実施の形態に限定されるものではない。
上記各実施の形態では入力音声の特徴量変化の小さい区間に対し単一の尤度計算制御基準を用いているが、例えば複数の基準を組み合わせても実施は可能である。
また、本発明は、コンピュータを上記音声認識装置として機能させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体であってもよく、例えば、磁気テープ、ＣＤ−ＲＯＭ、ＩＣカード、ＲＡＭカード等のいかなるタイプの記録媒体であってもよい。
【００３３】
【発明の効果】
以上、詳述したように、本発明によれば、音響特徴量の時間的な変化量を計算する変化量計算器と、計算された変化量を予め定められた閾値と比較し、閾値を越える場合と越えない場合で複数の基準を用いて尤度の値を計算出力する尤度計算制御器を備えたので、前記尤度計算制御器の制御により、入力音声の特徴量変化の大きい区間に対して尤度計算を実行し、変化の小さい場合は例えば、最近の計算された尤度の値を使うように制御し、あるいは、尤度計算を間引きするような制御が可能となり、認識率の劣化を小さく抑えて、計算量を削減することができる。
【００３４】
また、入力音声の特徴量変化の大きい区間に対して尤度を求めるだけでなく、入力音声の特徴量変化の小さい区間或いは安定区間に対しても近似的な尤度を割り当てることができるため、一貫した入力音声の認識ができる。また、入力音声の特徴量変化の大小にかかわらず尤度を出すことができるので、音声によっては変化量が小さかったり、極大点が現れ無い場合でも尤度を求めることができる。したがって、本発明によれば、変化量が小さな区間でも尤度計算を省略しながら尤度を近似解として求めることができるため、音声認識精度を低下させることなく計算量を削減することができる。
【図面の簡単な説明】
【図１】本発明の音声認識装置の第１実施の形態を示すブロック図である。
【図２】本発明の音声認識装置の第２実施の形態を示すブロック図である。
【図３】本発明の音声認識装置の第３実施の形態を示すブロック図である。
【図４】本発明の音声認識装置の第４実施の形態を示すブロック図である。
【図５】本発明の音声認識装置の第５実施の形態を示すブロック図である。
【図６】（Ａ）〜（Ｈ）は尤度計算制御の説明図である。
【図７】従来技術の音声認識装置の実施例を説明する図。
【符号の説明】
１０１入力端子
１０２音響特徴量計算器
１０３尤度計算器
１０４音響特徴量辞書
１０５認識辞書検索器
１０６認識語彙辞書
１０７認識結果出力端子
１０８変化量計算器
１０９尤度計算制御器
２０１間引き計算制御器
３０１変化量メモリー
３０２尤度計算フレーム選択器
６０１平均尤度計算器
７０１傾斜尤度計算器[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech recognition device and a recording medium, and more particularly, to a speech recognition device and a recording device capable of reducing a load without lowering recognition accuracy in calculating a likelihood between an input acoustic feature and an acoustic feature dictionary. Regarding the medium.
[0002]
[Prior art]
FIG. 7 shows the configuration of a conventional general speech recognition apparatus. The speech input terminal 101, the acoustic feature calculator 102, the likelihood calculator 103, the acoustic feature dictionary 104, the recognition dictionary searcher 105, It comprises a recognition vocabulary dictionary 106 and a recognition result output terminal 107.
[0003]
The audio signal is input from the input terminal 101, is divided into frames of a predetermined time length (for example, 10 ms), and the feature quantity of the input speech is calculated by the acoustic feature quantity calculator 102 for each frame. As the feature amount used in the speech recognition, for example, a band filter bank output expressing a power spectrum shape, a cepstrum parameter, and the like are used.
[0004]
Likelihood calculator 103 calculates likelihood for each state of acoustic feature dictionary 104 using acoustic feature dictionary 104 created in advance for acoustic features of input speech calculated for each frame. .
Here, the likelihood is an index value indicating how similar the acoustic feature of the input voice is to the acoustic feature of each state of the acoustic feature dictionary.If the cepstrum is used as the acoustic feature, the input The cepstrum distance between the cepstrum of the speech and the cepstrum held for each state of the acoustic feature dictionary is used.
[0005]
The acoustic feature dictionary 104 divides a sound into predetermined states (units) and holds acoustic features for each state. Examples of states that are units of voice include phonemes (eg, 'a', 't') or triads of phonemes (eg, 'a;k;a') that take into account temporally preceding and succeeding phoneme environments. , 'T;a;k'), and a method using a unit obtained by temporally dividing one phoneme. The number of states varies depending on how the unit is selected. If a phoneme is used as a unit, the number of states becomes several hundreds if several tens or three groups of phonemes are used.
For example, the acoustic feature dictionary is composed of 256 elements. If the utterance is 1 second, the analysis is performed at an analysis period of 10 ms, and the acoustic feature of 100 frames is calculated. As a result, a 256 × 100 likelihood table is obtained. Is calculated.
[0006]
Next, when the utterance of the input speech is completed, the utterance content is searched by the recognition dictionary searcher 105 using the recognition vocabulary dictionary 106. The recognition vocabulary dictionary describes words to be recognized in units of recognition described above. For example, when "red" is expressed in units of the triplet phoneme exemplified in the above description, it is expressed as "-;a; k ',"a;k;a',"k;a;-". Here, "-" represents a silent state.
Hidden Markov models and Viterbi search techniques are used to search for the most likely words in the vocabulary to be recognized based on the time series of the likelihood calculated for each state. -Digital signal processing of sound information ", Shokodo, pp. 42-79).
[0007]
As described above, in the conventional speech recognition apparatus, the likelihood calculation is performed for all frames of the input signal, and therefore, there is a problem that the calculation amount is large. As a method for coping with this problem, Japanese Patent Laid-Open No. 2-239291 discloses a method of temporally examining the amount of change in the acoustic feature of a voice, and examining the frame at the time when the dynamic feature is large or at the maximum. There is disclosed a technique in which likelihood calculation is performed only in the case. This is intended to narrow down the number of candidates for the phoneme boundary position and to reduce the likelihood calculation amount.
[0008]
[Problems to be solved by the invention]
In order to obtain a high speech recognition rate, it is necessary to accurately analyze information such as plosives whose acoustic features change in a short period of time, and to divide the acoustic feature dictionary into a large number of states in consideration of the phoneme environment. It is desirable. If the analysis interval is shortened and the state of the acoustic feature dictionary is large, the likelihood calculation amount becomes extremely large, and the problem that the realization cost of the recognition device increases as described above occurs.
[0009]
The technique disclosed in Japanese Patent Application Laid-Open No. Hei 2-239291 proposed to solve this problem aims at detecting a punctuation between phonemes, and the likelihood of a frame having a large instantaneous change or a frame at a local maximum point. Therefore, there is a problem that the likelihood in a stable feature amount section cannot be calculated. Furthermore, depending on the voice, the amount of change is small or the maximum point does not appear, the change in the acoustic feature cannot be detected, and the likelihood cannot be obtained. As a result, in the above-described related art, there is a problem in that omitting the likelihood calculation in a section where the amount of change is small greatly deviates from the actual likelihood, leading to a reduction in speech recognition accuracy.
The present invention has been made in view of such a problem, and an object of the present invention is to reduce the load of likelihood calculation and increase the accuracy of speech recognition. It is to provide a recording medium.
[0012]
[Means for Solving the Problems]
The voice recognition device of the present invention divides a voice signal into frames of a predetermined time length, calculates a voice feature value, and converts the voice into a plurality of states based on a predetermined criterion. And an acoustic feature dictionary holding acoustic features for each of the classified states, and a likelihood calculator for calculating the acoustic features of the input speech and the likelihood for each state of the acoustic features dictionary. And a recognition vocabulary dictionary that describes speech recognition target words using the above-described state, and a recognition dictionary searcher that inputs a previously calculated likelihood calculation result and calculates a speech recognition result from the recognition vocabulary dictionary A change amount calculator for calculating a temporal change amount of an acoustic feature amount; a change amount memory for storing the change amount calculated by the change amount calculator over a plurality of frames; predetermined number of off from those variation greater from Likelihood calculation frame selector that selects the number of frames and performs likelihood calculation only on the selected frames, and controls the unselected frames to use the already calculated likelihood values. It is provided with.
[0013]
As a result, the maximum value of the calculation amount in the frame can be suppressed to a certain fixed amount, and a configuration suitable for a real-time processing device can be obtained.
Further, the likelihood calculator calculates and outputs the average value of the likelihood calculated immediately before and the likelihood calculated next for the section of the frame in which the likelihood calculation is omitted, so that the change amount is small. Thus, the likelihood can be obtained as an approximate solution while omitting the calculation of the likelihood.
Further, the likelihood calculator calculates and outputs the likelihood calculated immediately before and the slope value of the likelihood to be calculated next with respect to the section of the frame where the likelihood calculation is omitted, so that the change amount is small in the section. The likelihood can be obtained as an approximate solution while omitting the calculation of the likelihood.
[0014]
Further, the likelihood calculator outputs the result obtained by calculating the likelihood of the frame next to the frame of which the last likelihood calculation was performed as the likelihood of a subsequent frame period in which the likelihood calculation is not to be performed. It is possible to improve the accuracy of speech recognition using the likelihood that the amount has decreased.
Further, the likelihood calculator calculates the likelihood of a frame next to the last frame for which the likelihood calculation is performed last and a frame immediately before the frame for which the likelihood calculation is next performed. Alternatively, by calculating and outputting the inclination value, it becomes possible to increase the accuracy of speech recognition using the likelihood that the amount of change is reduced.
Further, the present invention is a computer-readable recording medium in which a program for causing a computer to function as the voice recognition device is recorded.
[0015]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. The components having the same functions as those in FIG. 7 are denoted by the same reference numerals.
FIG. 1 shows a first embodiment of the speech recognition apparatus of the present invention. In the circuit configuration of FIG. 7, newly added components are a change amount calculator 108 for calculating a change in the feature amount of the input signal, and a likelihood calculator based on the change amount of the feature amount of the input signal. Is a likelihood calculation controller 109 for controlling the operation of the above.
[0016]
The audio signal is input from the input terminal 101 and is divided into frames having a predetermined time length (for example, 10 ms). The feature quantity of the input speech is calculated by the acoustic feature quantity calculator 102 for each frame. For example, cepstrum is used as the feature amount. The acoustic feature amount calculated for each frame is input to the change amount calculator 108 and the likelihood calculator 103. The change amount calculator 108 aims to detect a change in the feature amount of the input audio signal, and calculates a change amount from, for example, a difference between the sound feature amount calculated in the immediately preceding frame and the sound feature amount in the current frame. When the state (for example, phoneme) of the input speech signal changes, the acoustic feature such as the spectrum shape changes, and the change in the feature is large. However, in a steady vowel section or the like, a relatively stable acoustic feature value continues, and the change amount of the acoustic feature value decreases.
[0017]
When the cepstrum is used as the acoustic feature amount, for example, the cepstrum distance between the cepstrum of the immediately preceding frame and the cepstrum of the current frame is used as the change amount. In addition, there is a method of using information of a temporal change of a feature amount generally referred to as a dynamic feature amount as an acoustic feature amount used in speech recognition (when a cepstrum is used as an acoustic feature, a feature called a delta cepstrum is used. There). In this case, the magnitude of the dynamic feature may be calculated as the change. The change amount of the feature amount calculated by the change amount calculator 108 is input to the likelihood calculation controller 109, and is compared with a preset determination threshold.
[0018]
As a result of comparing the change in the feature amount calculated by the change amount calculator 108 with the judgment threshold value of the likelihood calculation controller 109, the process branches when the change amount is larger than the judgment threshold value or smaller.
If the change is larger than the determination threshold, the state change of the input voice signal is large, and the likelihood calculator 103 is caused to execute the likelihood calculation for the feature of the input signal calculated by the acoustic feature calculator 102. The processing in this case is the same as the conventional one.
[0019]
Next, a part of the processing which is a feature of the present invention will be described. If the amount of change is smaller than the determination threshold, the state of the input speech signal has not changed much, and it is expected that the likelihood calculation result will not change significantly. In this case, the likelihood calculation controller 109 obtains the likelihood value based on a criterion different from the case where the amount of change is larger than the determination threshold. There may be one or more criteria.
[0020]
For example, a criterion that the likelihood calculator 103 does not perform the likelihood calculation of the current frame but uses the likelihood calculation result calculated immediately before is used. The voice signal includes a section in which the change of the acoustic feature amount is fast, such as a plosive consonant, and a section, such as a vowel stationary part, which does not change much. As described above, in a section where the change in the acoustic feature amount is small, the calculation amount can be reduced by controlling so that the likelihood calculation is not performed.
As described above, the likelihood calculation control in which the likelihood calculation is performed in the section where the change in the acoustic feature amount is large and the likelihood calculation is performed based on the criterion that the likelihood calculation is not performed in the section where the acoustic feature amount is small. Is shown in FIG.
[0021]
FIG. 6A shows the energy of the input audio signal, and FIG. 6B shows the magnitude of the change in the feature amount. The threshold indicated by the dotted line in FIG. 6B is a threshold for determining the amount of change. Likelihood calculation is performed for a frame whose amount of change is larger than this threshold, and the value of the likelihood calculated recently is used for a frame smaller than the threshold. . FIG. 6C shows the execution state of the likelihood calculation. The interval between the graduations represents the frame length, and the frame marked with c is a frame in which the amount of change in the feature amount is larger than the determination threshold and the likelihood calculation has been performed, and the frame not marked with c is , The likelihood value is a frame using the likelihood calculated immediately before.
[0022]
Subsequent processing is the same as that described in the prior art of FIG. 7, and when the utterance of the input speech is completed, the likelihood table obtained by omitting the likelihood calculation and the recognition vocabulary dictionary 106 are used. The utterance content is searched by the search device 105.
FIG. 2 shows a second embodiment of the speech recognition apparatus of the present invention. What is newly added as a component to FIG. 1 of the first embodiment is that the output of the likelihood calculation controller 109 is input, and the likelihood calculation is performed at a reduced frame interval in a section where the variation is small. Only the thinning-out calculation controller 201 that controls The parts different from the first embodiment will be described below.
[0023]
When a steady acoustic feature such as a vowel continues over a plurality of frames, the change in the acoustic feature of an adjacent frame is small, but in a plurality of frames, the amount of change accumulates and the amount of change increases. is there. In this case, if the likelihood calculation is not performed, the recognition rate decreases. Therefore, in the second embodiment, the thinning calculation controller 201 receives the output of the likelihood calculation controller 109 and, in a section where the amount of change is large, sends the output to the likelihood calculator 103 in the same manner as in the first embodiment. Execute likelihood calculation.
[0024]
On the other hand, in a section where the amount of change in the output of the likelihood calculation controller 109 is small, the number of processing frames is counted, and likelihood calculation is performed by the likelihood calculator 103 for a frame thinned out at a predetermined thinning interval. The control is performed so that the likelihood calculation is not performed in the other frames and the result of the likelihood calculation calculated immediately before is used.
The state of likelihood calculation control is shown in FIG. In this example, the control is performed with the thinning rate of 2, so that the likelihood calculation is performed once every two frames even in a section where the amount of change in the feature amount is smaller than the determination threshold as compared with FIG. Has become. Note that the thinning rate does not need to be limited to 2 and is optional.
[0025]
FIG. 3 shows a third embodiment of the speech recognition apparatus of the present invention. A new component added to FIG. 1 is a change amount memory 301 for storing the change amounts of the feature amounts of a plurality of frames, and a frame for performing likelihood calculation based on a certain selection criterion is selected from the change amount memory 301. Is the likelihood calculation frame selector 302. Hereinafter, portions different from the first and second embodiments will be described.
In the first and second embodiments, the likelihood calculation is performed only on a frame that satisfies a certain criterion, instead of all input frames, and the amount of calculation as a whole is reduced. However, in an apparatus that performs real-time processing, it is necessary to keep the amount of calculation constant in predetermined processing units.
[0026]
In the third embodiment, in order to address this problem, processing is performed for each of a predetermined number (M) of frames, and a frame (N, where N, <M) and control to use the recently calculated likelihood values for the remaining frames.
As an example of the criterion for this selection, the change amount calculated by the change amount calculator 108 is stored in a change amount memory 301 over M frames, and the likelihood calculation frame selector 302 selects the M change amounts. To select N frames from the largest one.
By performing such control, the maximum value of the amount of calculation in the M frames can be suppressed to a certain fixed amount, and a configuration suitable for a real-time processing device can be obtained.
[0027]
FIG. 4 shows a fourth embodiment of the speech recognition apparatus of the present invention. The average likelihood calculator 601 in this figure substitutes for the likelihood calculator 103 in FIGS. For a frame in which the amount of change in the feature of the input speech exceeds the threshold, the likelihood calculation is performed as before, and the calculation result is stored in a buffer memory in the average likelihood calculator 601. Assuming that the value of the immediately preceding likelihood obtained in this manner is L, and then the likelihood N calculated with the amount of change exceeding the threshold, the frame between the frame where L is calculated and the frame where N is calculated is defined as On the other hand, the likelihood of (L + N) / 2 is applied. Other processes are in accordance with the first to third embodiments.
FIG. 6E shows the result of the likelihood calculation control. This result is applied to FIG. 6C, and the calculation of the likelihood is the same as that of FIG. 6C. The frame for which the likelihood calculation is omitted has the average value of the likelihoods calculated at both ends thereof (only the first likelihood non-calculation section is shown in the figure to avoid complication).
[0028]
FIG. 5 shows a speech recognition apparatus according to a fifth embodiment of the present invention. The inclination likelihood calculator 701 in this figure substitutes for the likelihood calculator 103 in FIGS. For a frame in which the amount of change in the feature of the input speech exceeds the threshold, the likelihood calculation is performed in the conventional manner, and the result of the calculation is stored in the buffer memory in the gradient likelihood calculator 701. Assuming that the immediately preceding likelihood value obtained in this way is L, and then the likelihood N calculated with the change amount exceeding the threshold, the frame between the frame where L was calculated and the frame where N was calculated was calculated. The number P is counted, and a slope likelihood of L + (NL) × m / (P + 1) (m is an integer from 1 to P) is applied to these P frames. Other processes are in accordance with the first to third embodiments.
[0029]
FIG. 6F shows the result of the likelihood calculation control. This result is applied to FIG. 6C, and the calculation of the likelihood is the same as that of FIG. 6C. Frames for which the likelihood calculation has been omitted have the likelihood gradient distribution values obtained at both ends thereof (only the first likelihood non-calculation section is shown in the figure to avoid complication).
In addition to the above-described first to fifth embodiments, the following embodiments are also possible.
[0030]
In FIG. 2, in the case of frames in which the acoustic feature amount does not exceed the threshold, in the first embodiment, a value obtained by finally calculating the likelihood is used as the likelihood of these frames. Instead of the last calculated likelihood as the likelihood of a frame whose acoustic feature value does not exceed the threshold value, the likelihood of the next frame lastly calculated may be calculated and this value may be used. . As a result, the accuracy of speech recognition can be improved by using the likelihood of a small change in the amount of feature, instead of the likelihood of a large change in the amount of feature, as the likelihood of a frame whose acoustic feature does not exceed the threshold. It becomes possible. The result of the likelihood calculation control obtained in this way is shown in FIG.
[0031]
Furthermore, in FIG. 4 or FIG. 5, in the case of the frame in which the acoustic feature amount does not exceed the threshold value, in the embodiment of FIG. Although the likelihood is calculated also in the frame of the above, the likelihood is calculated also in the frame in which the change amount exceeds the threshold value and the change amount immediately before the likelihood calculation does not exceed the threshold value. The average value or the slope value of the likelihood calculated at both ends of the frame not exceeding the threshold value is applied to the frame during this period. Note that the average likelihood value or the slope likelihood value is based on the method described in the fourth or fifth embodiment. FIG. 6H shows the result of the likelihood calculation control thus obtained.
[0032]
Note that the present invention is not limited to the above embodiment.
In each of the above embodiments, a single likelihood calculation control criterion is used for a section in which a change in the feature amount of the input speech is small. However, the present invention can be implemented by combining a plurality of criteria, for example.
Further, the present invention may be a computer-readable recording medium on which a program for causing a computer to function as the voice recognition device is recorded. For example, any of a magnetic tape, a CD-ROM, an IC card, a RAM card, etc. It may be a type of recording medium.
[0033]
【The invention's effect】
As described above in detail, according to the present invention, the change amount calculator that calculates the temporal change amount of the acoustic feature amount, the calculated change amount is compared with a predetermined threshold value, and the threshold value is exceeded. The likelihood calculation controller that calculates and outputs the likelihood value using a plurality of criteria in the case and the case where the likelihood value does not exceed is provided. The likelihood calculation is executed on the other hand, and when the change is small, for example, control is performed so as to use the latest calculated likelihood value, or control such as thinning out the likelihood calculation becomes possible. Deterioration can be kept small, and the amount of calculation can be reduced.
[0034]
In addition, not only is the likelihood calculated for a section where the change in the feature amount of the input speech is large, but also an approximate likelihood can be assigned to a section where the change in the feature amount of the input voice is small or a stable section. Consistent input speech recognition is possible. In addition, since the likelihood can be obtained regardless of the magnitude of the change in the feature amount of the input speech, the likelihood can be obtained even when the variation is small or the maximum point does not appear depending on the speech. Therefore, according to the present invention, the likelihood can be obtained as an approximate solution while omitting the likelihood calculation even in a section where the amount of change is small, so that the calculation amount can be reduced without lowering the speech recognition accuracy.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a first embodiment of a speech recognition device of the present invention.
FIG. 2 is a block diagram showing a second embodiment of the speech recognition apparatus of the present invention.
FIG. 3 is a block diagram showing a third embodiment of the speech recognition device of the present invention.
FIG. 4 is a block diagram showing a fourth embodiment of the speech recognition device of the present invention.
FIG. 5 is a block diagram showing a fifth embodiment of the speech recognition apparatus of the present invention.
FIGS. 6A to 6H are explanatory diagrams of likelihood calculation control.
FIG. 7 is a view for explaining an embodiment of a conventional speech recognition apparatus.
[Explanation of symbols]
101 input terminal 102 acoustic feature calculator 103 likelihood calculator 104 acoustic feature dictionary 105 recognition dictionary searcher 106 recognition vocabulary dictionary 107 recognition result output terminal 108 change calculator 109 likelihood calculation controller 201 thinning calculation controller 301 Change amount memory 302 Likelihood calculation frame selector 601 Average likelihood calculator 701 Slope likelihood calculator

Claims

The audio signal is divided into frames of a predetermined time length, and an acoustic feature amount calculator that calculates a feature amount of the audio, and the audio is classified into a plurality of states according to a predetermined criterion, and the classified state is An acoustic feature dictionary holding acoustic features for each speech, an acoustic feature of an input speech and a likelihood calculator for calculating likelihood for each state of the acoustic feature dictionary, and a speech recognition target word as described above. A speech recognition device comprising a recognition vocabulary dictionary described using the state of a change amount calculator for calculating a temporal change amount of the characteristic amount, a change amount memory for storing the amount of change calculated by the change amount calculator over a plurality of frames, the change from the plurality of frames predetermined number of off from those amounts large Likelihood calculation frame selector that selects the number of frames, performs likelihood calculation only on the selected frame, and controls the non-selected frame to use the already calculated likelihood value. A speech recognition device comprising:

The said likelihood calculator calculates and outputs the average value of the likelihood calculated immediately before and the likelihood calculated next with respect to the area of the frame from which the likelihood calculation was omitted. Voice recognition device.

2. The likelihood calculator according to claim 1, wherein the likelihood calculator calculates and outputs, for a section of the frame in which the likelihood calculation is omitted, the likelihood calculated immediately before and the slope value of the likelihood calculated next. 3. Voice recognition device.

The said likelihood calculator outputs the result obtained by calculating the likelihood of the frame next to the frame of which the last likelihood calculation was performed as the likelihood of the subsequent frame period in which the likelihood calculation is not to be performed. Item 2. The speech recognition device according to item 1.

The likelihood calculator calculates the likelihood of the frame immediately following the frame next to the last frame for which the likelihood calculation is performed and the frame immediately before the frame for which the likelihood calculation is next performed. The speech recognition device according to claim 1, wherein a value is calculated and output.

A computer-readable recording medium on which a program for causing a computer to function as the speech recognition device according to claim 1 is recorded.