JP3698511B2

JP3698511B2 - Speech recognition method

Info

Publication number: JP3698511B2
Application number: JP33059596A
Authority: JP
Inventors: 正晃伊達
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1996-12-11
Filing date: 1996-12-11
Publication date: 2005-09-21
Anticipated expiration: 2016-12-11
Also published as: JPH10171489A

Description

【０００１】
【発明の属する技術分野】
本発明は音声認識装置に用いる音声認識方法に関するものである。
【０００２】
【従来の技術】
文献名（１）：波辺隆夫、塚田聡 “音節認識を用いたゆう度補正による未知発話のリジェクション” 電子情報通信学会論文誌 D-II Vol.J75-D-II No.12 1992 年12月 PP.2002-2009
文献名（２）：大河内正明“Hidden Markov Model に基づいた音声認識” 日本音響学会誌４２巻１２号 (1986)
音声認識装置では、高い認識精度とリアルタイム処理を実現するため、装置が受理できる単語や文法規則等をあらかじめ規定することによって、認識対象を制約して認識処理を行う。しかし、利用者が実際に装置を使用する場合には、認識対象外の発話、言い誤り、言い直し等は避けられない。そこで、ある発話に対する認識結果の信頼性が低い場合には、発話を棄却するリジェクト機能が必要になる。
【０００３】
リジェクト機能を付加するための方法として、従来、上記文献（１）に開示される方法がある。この方法では、音声を表現するモデル（一般に、音響モデルと呼ばれる）として、音素や音節などのサブワード単位のＨＭＭ（Hidden Markov Model ：隠れマルコフモデル）を用いることを前提としている。ＨＭＭを用いた音声認識方法の詳細については、上記文献（２）に開示されている。サブワードモデルを連結することによって、認識対象として規定された単語や文などの発話内容の仮説に対するモデルを構成し、各仮説に対するモデルが入力音声データを生成する確率（ゆう度）を計算する。この計算により、最大ゆう度を与えるモデルに対応する仮説を認識結果とする。
【０００４】
この認識手法にリジェクト機能を付加するためには、以上のような認識対象を制約して行うゆう度計算（認識処理）の他に、入力音声を任意の音素列あるいは音節列として認識するためのゆう度計算を行う。それぞれのゆう度計算の結果得られた最大ゆう度の差を求め、しきい値判定により入力発話のリジェクト判定を行う。
【０００５】
【発明が解決しようとする課題】
しかしながら、以上述べた従来方法では、以下に述べる問題がある。
【０００６】
（ａ）トライフォンモデル等のコンテキスト依存音素モデルは、音素コンテキストに依存した異音を表現でき、比較的高い認識精度が得られるため、音響モデルとしてよく用いられる。しかし、音響モデルとして、このトライフォンモデル等のコンテキスト依存音素（あるいは音節）モデルを用いる場合、リジェクト機能を付加すると、処理量が大幅に増加する。このため、リジェクト機能を付加することが困難であった。また、自然な発話に対する音声認識装置の頑健性を向上させるには、入力発話中の不要語や未知語に対処する必要があるが、音声認識に対する処理量と精度において十分な性能を得ることは困難であった。
【０００７】
（ｂ）音響モデルとして、音素や音節などのサブワード単位のモデルを用いない場合（例えば、単語や文節などの単位を用いる場合）、前述した従来の方法は適用できない。
【０００８】
（ｃ）入力発話の一部に不要語や未知語を含む場合、認識のための処理量の増加を抑えた状態で、不要語や未知語部分を検出し、それ以外の発話部分を精度よく認識することは困難である。上記（ａ）と同ように、音響モデルとしてコンテキスト依存音素（あるいは音飾）モデルを用いる場合は、特に処理量が大幅に増加する。
【０００９】
【課題を解決するための手段】
上記課題を解決するために、本発明に係る音声認識方法は、入力された音声データを認識して処理する音声認識方法において、音響モデルを構成するＨＭＭの任意の状態間の遷移のしやすさを表す状態遷移制約情報を作成する参照テーブル作成手段と、入力音声データに対する認識結果候補と共に認識ゆう度及び局所ゆう度を算出する認識処理手段と、入力音声データの棄却判定に用いる参照ゆう度を算出する参照ゆう度算出手段とを備え、上記参照テーブル作成手段で、音響モデルを構成するＨＭＭの状態に対するクラスタリングを行い、生成された状態クラスタにおける各状態間の遷移接続に基づいて状態クラスタ間の遷移確率を算出し、ＨＭＭの各状態がどの状態クラスタに属するかを示すヘッダ情報を付加して状態遷移制約情報とすると共に、上記認識処理手段で算出した局所ゆう度及び上記参照テーブル作成手段で作成した状態遷移制約情報を用いて参照ゆう度算出手段により参照ゆう度を算出し、この参照ゆう度と上記認識処理手段で算出した認識ゆう度との比較により、入力音声データの棄却判定を行うことを特徴とする。
【００１０】
以上のように、局所ゆう度と状態遷移制約情報を用いて参照ゆう度を算出することで、参照ゆう度の算出に要する演算が加算と大小比較だけになり、棄却判定機能の付加による処理量の増加を小さくすることができる。また、参照ゆう度と認識ゆう度との比較により、入力音声データの棄却判定を行うため、認識のための処理量をほとんど増加させることなく、効率的に棄却判定を行うことが可能になる。
【００１１】
また、他の発明に係る音声認識方法は、入力された音声データを認識して処理する音声認識方法において、音響モデルを構成するＨＭＭの任意の状態間の遷移のしやすさを表す状態遷移制約情報を作成する参照テーブル作成手段と、入力音声データに対する認識結果と共に局所ゆう度及び部分仮説累積ゆう度を算出する認識処理手段と、入力音声データ中の不要語または未知語を処理するために用いる不要語仮説累積ゆう度を算出する不要語処理手段とを備え、上記認識処理手段で算出した局所ゆう度及び部分仮説累積ゆう度と、上記参照テーブル作成手段で作成した状態遷移制約情報とを用いて、不要語処理手段により不要語仮説累積ゆう度を算出し、この不要語仮説累積ゆう度を用いて入力音声データ中の不要語または未知語の部分を検出し、それ以外の部分の認識を行うことを特徴とする。
【００１２】
これにより、不要語仮説ゆう度の算出に要する演算は加算と大小比較だけになり、不要語処理の付加による処理量の増加を最小限に抑えることができる。
【００１３】
【発明の実施の形態】
次に本発明の実施形態を添付図面に基づいて説明する。
【００１４】
［第１の実施形態］
［構成及び機能］
本実施形態に係る音声認識方法に用いる音声認識装置の基本構成を図１に示す。
【００１５】
図１中の１０は音声分析部である。この音声分析部１０は、入力音声データD10 を音響特徴パラメータ時系列D11 に変換する。具体的には、ＬＰＣ（Linear Predictive Coding）分析等の分析手法を用いて、入力音声データD10 を数ｍｓ〜数十ｍｓ程度の短時間周期（以後「フレーム」と呼ぶ）毎の音響特徴パラメータに変換する。ここで音響特徴パラメータとは、音声データのスペクトル包格情報を表現するパラメータで、例えば、ケプストラム（対数スペクトルを逆フーリ工変換した量）やその時間変化量などである。フレーム単位に得られる音響特徴パラメータを音響特徴パラメータ時系列D11 とする。音声分析部１０で変換された音響特徴パラメータ時系列D11 は、認識処理部１３に入力される。なお、上記入力音声データD10 は、マイクロフォンなどから入力された音声（アナログ信号）をディジタル信号に変換した信号である。
【００１６】
１１は音響モデルである。この音響モデル１１は、音声を表現するモデル（ＨＭＭ）の集合である。モデルの言語的な単位としては、音声の任意の構成要素（音素、音節、単語、文節など）を採用することが可能である。また、音素や音節などのサブワードを単位として採用した場合、コンテキスト独立／依存のどちらのモデルでも使用することができる。つまり、リジェクト機能を付加するために使用する音響モデルが制限されることはない。本実施形態では、例としてトライフォンモデルを使用する場合について説明する。トライフォンモデルはコンテキスト依存音素モデルで、各々の音素に対して前後の音素コンテキスト別に異なるモデルを用意するものである。
【００１７】
１２は言語モデルである。この言語モデル１２は、音声認識装置が受理可能な単語や文法規則（構文）等を規定して、認識対象を制約するモデルである。例えば図２に示すように、有限状態オートマトンを用いて受理可能な単語系列を構文ネットワークの形で記述したものである。
【００１８】
１３は認識処理部である。この認識処理部１３では、音声が音声認識装置に入力される以前に、即ち認識処理を開始する以前に、音響モデル１１及び言語モデル１２を用いて受理可能な発話内容の仮説を表現するＨＭＭネットワークを構成しておく。このＨＭＭネットワークとは、単語の音素表記や文法規則等の制約に従ってトライフォンモデルを連結して作成する、文字通りＨＭＭのネットワークである。これは、例えば図２に示したような構文ネットワークにおいて、単語の部分をトライフォンモデルの連結によって作成した単語モデル（ＨＭＭ）に置き換えたものである。このようなネットワークを構成することによって、認識処理を効率化することができる。各発話内容の仮説に対応するモデルは、ＨＭＭネットワークの一部として表現される。音声認識装置に発話が入力されると、ＨＭＭネットワークを用いて各仮説に対応するモデルが音響特徴パラメータ時系列D11 を生成する確率（ゆう度）を計算する。ＨＭＭネットワーク中で最大ゆう度を与える仮説を探索し、認識結果候補D12 とする。また、このときの最大ゆう度を対数化した最大対数ゆう度を、認識ゆう度D13 とする。ここで、各仮説に対するゆう度計算は、音響特徴パラメータ時系列D11 のフレームに同期して並列に行う。各フレームではＨＭＭの各状態に対する出力確率分布計算（当該フレームの音響特徴パラメータを出力する確率の計算）を行い、これを対数化して局所ゆう度D14 とする。認識ゆう度D13 は局所ゆう度D14 とＨＭＭの状態遷移確率を用いて、前述した従来技術文献（２）に開示されるViterbi アルゴリズム等の手段により算出する。
【００１９】
この認識処理部１３で算出された認識結果候補D12 と認識ゆう度D13 はリジェクト判定部１６に出力される。局所ゆう度D14 は参照ゆう度算出部１４に出力される。
【００２０】
参照ゆう度算出部１４では、認識処理部１３で算出される局所ゆう度D14 と、参照テーブル１５に格納されている状態遷移制約情報D15 を用いて参照ゆう度D16 を算出する。
【００２１】
参照テーブル１５は、参照ゆう度算出部１４で用いる状態遷移制約情報D15 を格納しているテーブルである。状態遷移制約情報D15 は、あらかじめ音響モデル１１を用いて作成しておく。その作成方法は後述する。
【００２２】
リジェクト判定部１６では、認識ゆう度D13 と参照ゆう度D16 を用いて認識結果候補D12 に対してリジェクト判定を行い、認識結果D17 を出力する。
【００２３】
［音声認識方法］
次に、上記構成の音声認識装置を用いた音声認識方法を説明する。
【００２４】
［第１段階］
まず、音響モデル１１を構成するすべてのトライフォンモデルを用いて、ＨＭＭの状態に対するクラスタリングを行う。クラスタリングにより生成される各クラスタを、以後「状態クラスタ」と呼ぶ。クラスタリングにおける距離尺度は各状態を表現するパラメータを用いて定義する。例えば、各状態の出力確率分布が多次元正規分布で表されている場合、多次元正規分布の平均ベクトル（あるいは分散べクトル）を用いて以下のように定義する。
【００２５】
２つの平均ベクトル（あるいは、さらに分散ベクトルを付加したベクトル）
ｘ＝［ａ₁, ａ₂, …, ａ_n］
ｙ＝［ｂ₁, ｂ₂, …, ｂ_n］
に対して、距離尺度Ｄは
Ｄ＝（ａ₁−ｂ₁）²＋（ａ₂−ｂ₂）²＋…＋（ａ_n−ｂ_n）²
となる。
【００２６】
クラスタリング方法としては、ＬＢＧアルゴリズム等の一般的なクラスタリングアルゴリズムを用いる。ここでは、より簡易な方法を以下に示す。
【００２７】
Ｍ個のサンプル集合Ｘ＝｛ｘ₁, ｘ₂, …, ｘ_M｝をクラスタリングする場合を考える。しきい値Ｔ_hが与えられているものとする。なお、しきい値Ｔ_hの値は実験的に決定される数値である。
【００２８】
まず、任意に１個のサンプル、例えばｘ₁を取り、これをクラスタ中心ｚ₁（＝ｘ₁）とする。次にｘ_k（ｋ＝２, ３, …, Ｍ）を取り、ｚ₁とｘ_kとの距離Ｄ_1kを計算する。この計算により、Ｄ_1k≦Ｔ_hであれば、そのｘ_kはｚ₁を中心とするクラスタに属すると判定する。また、Ｄ_1k＞Ｔ_hであれば、そのｘ_kを新たなクラスタ中心Ｚ₂とする。同ようにして、残りのサンプルｘ_kについてｚ₁, ｚ₂との距離Ｄ_1k, Ｄ_2kを計算する。そして、この距離Ｄ_1k, Ｄ_2kのいずれかがＴ_hより小さければそのｘ_kはそのクラスタに属するものとし、そうでなければそのｘ_kを新たなクラスタ中心ｚ₃とする。
【００２９】
以上の操作を、全てのサンプル｛ｘ₁, ｘ₂, …, ｘ_M｝に対して行い、クラスタリングを終了する。
【００３０】
［第２段階］
状態クラスタ間の遷移確率を、以下のようにして算出する。
【００３１】
まず、状態クラスタ間の遷移確率を定義する。それぞれの状態クラスタに属する各状態は、トライフォンモデル上では他の状態に接続されている。例えば、図３に示すように、状態Ｓ_１は状態Ｓ_２に、状態Ｓ_２は状態Ｓ_３にそれぞれ接続されている。また、トライフォンモデルの終端状態Ｓ_３は、次に続き得るトライフォンモデルの始端状態Ｓ_４, Ｓ_５, Ｓ_６にそれぞれ接続されている。一般に、あるトライフォンモデルに対して次に続き得るトライフォンモデルは複数存在するので、トライフォンモデルの終端状態は複数の状態に接続されている。さらに、状態の接続関係には、向き（図３で示す矢印）があり、この向きは一方の状態から他方の状態への遷移方向を表している。このときの遷移の起こりやすさとして、状態遷移確率が付与されている。また、各状態には自己ループ遷移を表す接続も存在する。このようなトライフォンモデル上での状態の遷移接続を、状態クラスタに属する各状態に対して適用する。これにより、任意の状態クラスタ間に、構成要素の状態が作る遷移接続の束ができる。図４はこの様子を示した例である。図４において、状態クラスタ１に属する状態Ｓ_１は、状態クラスタ２に属する状態Ｓ_２に接続されており、トライフォンモデルにおいて状態Ｓ_１から状態Ｓ_２への遷移接続（状態遷移確率ａ_１２）が存在することを意味する。図４では状態クラスタ１に属する状態から、他の状態クラスタに属する状態への遷移接続を示した。なお、一部においては、状態クラスタ１の内部における遷移接続も示した。状態クラスタ間で同一の遷移方向を持つ遷移接続を１つの束としたものが“遷移接続の束”である。この遷移接続の束を用いて、状態クラスタ間の遷移確率を次の（１）〜（３）式により定義する。
【００３２】
【数１】

ｉ，ｊ（＝１, ２, …, Ｎ）：状態クラスタ番号
ｕ，ｖ（＝１, ２, …, Ｍ）：状態番号
Ｐ_ij：状態クラスタｉから状態クラスタｊへの遷移確率
Ｎ：状態クラスタの総数
Ｍ：状態の総数
ａ_uv：状態Ｓ_uから状態Ｓ_vへの状態遷移確率
ａ_uu：状態Ｓ_uの自己ループ遷移確率
ｒ_i：ともに状態クラスタｉに属する異なる状態間における遷移接続の個数
（自己ループ遷移接続は対象外）
ｚ_i：状態クラスタｉから他の状態クラスタへの“遷移接続の束”の個数
ｑ_u：ある状態クラスタに属する状態Ｓ_uから他の状態クラスタに層する状態への遷移接続の個数
上式において、ｆ_ij（ｉ≠ｊ）は、状態クラスタｉから状態クラスタｊへの遷移接続の束に対する状態遷移確率の総和を表している。ただし、遷移接続が存在しない状態クラスタ間においては、ｆ_ij＝０である。また、ｆ_iiは、状態クラスタｉの内部における遷移接続に対する状態遷移確率の総和を、状態クラスタｉから他の状態クラスタへの“遷移接続の束”の個数で割った値を表している。
【００３３】
以上、説明した式を用いて状態クラスタ間の遷移確率Ｐ_ijを算出する。算出した状態クラスタ間の遷移確率Ｐ_ijは、対数化して重み係数（定数）Ｗを乗じる。なお、重み係数Ｗは後述する参照ゆう度算出部１４での動作において説明する。
【００３４】
以上のようにして得られたＷ・log Ｐ_ijに、トライフォンモデルの各状態がどの状態クラスタに属するかを示すヘッダ情報を付加して、状態遷移制約情報D15 とする。
【００３５】
＜上記以外の定義方法＞
なお、状態クラスタ間の遷移確率は、上述した定義方法以外に、以下に示す定義方法によっても可能である。いずれの方法においても、状態クラスタ間の遷移接続に基づいて定義している点が共通している。
【００３６】
（ａ）状態クラスタｉから状態クラスタｊへの遷移接続の束（遷移接続の束を構成する遷移接続の個数は、１個でも構わない）が存在するか否かによってｆ_ijを定義する。
【００３７】
これは上記（１）〜（３）式において、ｆ_ij及びｆ_iiを次のように定義し直すことによって得られる。
【００３８】
1) 状態クラスタｉから状態クラスタｊへの遷移接続の束が存在するならば、ｆ_ij＝１とする。
【００３９】
2) 状態クラスタｉから状態クラスタｊへの遷移接続の束が存在しないならば、ｆ_ij＝０とする。
【００４０】
3)ｆ_ii＝１とする。
【００４１】
（ｂ）状態クラスタｉから状態クラスタｊへの遷移接続の束を構成する遷移接続の個数によりｆ_ijを定義する。
【００４２】
これは上記（１）〜（３）式において、ａ_uv及びａ_uuを
ａ_uv＝ａ_uu＝１
と定義し直すことによって得られる。
【００４３】
［参照ゆう度算出部１４の動作］
参照ゆう度算出部１４では、次の（４），（５）式を用いて参照ゆう度D16 を算出する。
【００４４】
【数２】

ここで、
ｔ＝１のとき、ｃ_uv＝１
ｔ≠１のとき、ｃ_uv＝Ｐ_ij
ただし、ｕ∈ｉ, ｖ∈ｊ
また
ｃ_uv＝０のとき、log ｃ_uv＝ＩＮＨ
とする。
【００４５】
ｔ：フレーム番号
Ｌ_G：参照ゆう度D16
Ｌ_g：対数ゆう度
Ｔ：フレーム総数
Ｗ：状態遷移制約情報に対する重み係数
ｕ：フレーム番号（ｔ−１）において、（５）式の右辺の最大値を与える状態番号
ｖ：任意の状態番号
Ｐ_ij：状態クラスタｉから状態クラスタｊへの遷移確率
ｉ, ｊ：任意の状態クラスタ番号
Ｖｔ：認識処理部１３において、フレーム番号ｔに出力確率分布計算を行う状態全体の集合
ｂ（ｘ_t）：状態ｖにおける音響特徴パラメータｘ_tの出力確率（密度）
Ｘ_t：フレーム番号ｔにおける音響特徴パラメータ
ＩＮＨ：状態クラスタ間の遷移確率を対数化した値の下限値（定数）
（５）式において、log ｂ_v（ｘ_t）は、認識処理部１３より局所ゆう度D14 として与えられ、Ｗ・log ｃ_uvは参照テーブル１５より状態遷移制約情報D15 として与えられる。したがって、参照ゆう度算出部１４で行う演算は、加算と大小比較のみである。なお、（５）式において、状態遷移制約情報に対する重み係数Ｗは、log ｃ_uvとlog ｂ_v（ｘ_t）のＬ_g（ｔ）に寄与する割合を調節するためのパラメータ（定数）であり、定数ＩＮＨは状態クラスタ間の遷移確率を対数化した値の下限値を設定するためのパラメータ（定数）である。ともにその値は実験的に決定する。
【００４６】
なお、参照ゆう度Ｌ_Gは、任意の発話内容を表現するモデルに対する累積対数ゆう度を表す。また、Ｌ_g（ｔ）は、任意の発話内容を表現するモデルに対する各フレームにおける局所的な対数ゆう度を表す。
【００４７】
（５）式において、フレーム番号（ｔ−１）における（５）式の右辺の最大値を与える状態番号をｕとする。log ｃ_UVは状態番号ｕが何であるかによって、次フレーム番号ｔにおける（５）式の右辺の最大値を与える状態番号の候補を制約する。状態番号ｕの状態から状態番号ｖの状態への遷移の起こりやすさを制約として用いている。このような状態遷移制約によって、トライフォンモデルが有する音声の時間構造を考慮した参照ゆう度Ｄ１６の算出を可能にしている。
【００４８】
［リジェクト判定部１６の動作］
リジェクト判定部１６では、次の（６）式により入力音声データD10 のリジェクト判定を行う。（６）式において、リジェクト判定のしきい値θは実験的に決定される。しきい値θの値によって、入力が認識対象である場合の認識率と、認識対象外である場合のリジェクト率が変化する。一般に、両者はトレードオフの関係にあるので、所望の性能に合わせてしきい値θの値を決定する。
【００４９】
Ｌ_M＝（Ｌ_G−Ｌ_R）／Ｔ ……（６）
Ｌ_R：認識ゆう度D13
Ｌ_G：参照ゆう度D16
Ｔ：フレーム総数
この式により得られた値Ｌ_Mとしきい値θとの大きさを比較してリジェクト判定を行う。なお、θはリジェクト判定のしきい値である。
【００５０】
Ｌ_M＞θならば、入力をリジェクトするように判定し、認識結果D17 として、入力がリジェクトされたことを示す情報を出力する。
【００５１】
Ｌ_M≦θならば、入力をリジェクトせずに、認識結果D17 として、認識結果候補D12 を出力する。
【００５２】
［効果］
以上のように、入力発話のリジェクト判定に用いる参照ゆう度D16 を、認識ゆう度D13 の算出過程で得られる局所ゆう度D14 と、あらかじめ作成した状態遷移制約情報D15 に基づいて算出する。これにより、参照ゆう度D16 の算出に要する演算は加算と大小比較だけになり、リジェクト機能の付加による処理量の増加をきわめて小さくすることができる。また、上記の参照ゆう度D16 は、音響モデル１１が有する音声の時間構造を考慮しつつ、種々の音響的事象に対処可能な定式化を行って算出しているため、音素あるいは音節認識を用いる従来の方法と同等のリジェクト精度を得ることができる。この結果、認識対象外の発話（認識対象語以外の語、あるいは文法外の発話）や、せき、くしゃみなどの非言語音、あるいはベルなどの環境音が装置に入力された場合に、認識のための処理量をほとんど増加させることなく、効率的にリジェクト判定を行うことが可能になる。
【００５３】
換言すると、次のようになる。
【００５４】
（ａ）音響モデルとして、音素や音節などのサブワードに対するコンテキスト依存モデルを用いても、リジェクト機能の付加による処理量の増加はほとんどなく、処理機能を高い状態に維持することができる。
【００５５】
（ｂ）音響モデルとして、いかなる言語的単位（音素、音節、単語、文節など）のモデルを用いても、リジェクト機能を付加することが可能である。
【００５６】
（ｃ）音素あるいは音節認識を用いる方法（従来法）と同等のリジェクト精度を得ることができる。
【００５７】
［第２の実施形態］
上記第１の実施形態では、認識対象語以外の語や文法外の入力発話を、全体として棄却する方法について説明したが、本実施形態では、入力発話のうち不要な部分だけを部分的に棄却する方法について説明する。即ち、本実施形態では、入力発話の一部に「あのー」、「えーと」等に代表される間投詞や、「○○かな」、「○○とか」等の不要な語、あるいは「（じょ）情報」といったような言いよどみなどを含む場合に対処する方法について説明する。以後、間投詞、不要な語、言いよどみ等を不要語と呼ぶ。
【００５８】
本実施形態に係る音声認識装置の基本構成を図５に示す。図５において入力音声データD20 は、マイクロフォンなどから入力された音声（アナログ信号）をディジタル信号に変換した信号である。入力音声データD20 は、音声分析部２０で音響特徴パラメータ時系列D21 に変換され、認識処理部２３に入力される。
【００５９】
認識処理部２３では音響モデル２１、言語モデル２２及び不要語処理部２４を用いて認識処理を行い、入力音声データD20 に対する認識結果D26 を出力する。
【００６０】
また、不要語処理部２４では、認識処理部２３で算出される局所ゆう度D22 と部分仮説累積ゆう度D23 、さらに参照テーブル２５に格納されている状態遷移制約情報D24 を用いて、不要語仮説累積ゆう度D25 を算出する。不要語仮説累積ゆう度D25 は認識離処理部２３に出力され、認識結果D26 の算出に用いられる。
【００６１】
以下、本実施形態に係る音声認識方法に特有の機能を中心に説明し、第１実施形態と同様の部分については、その説明を省略する。
【００６２】
［言語モデル２２］
言語モデル２２は、第１実施形態同様に、音声認識装置が受理可能な単語や文法規則（構文）等を規定して、認識対象を制約するモデルである。例えば、図６に示すように有限状態オートマトンを用いて、受理可能な単語系列を構文ネットワークの形で記述したものである。ただし、本実施形態の言語モデル２２では、入力発話中の不要語に対処するため、構文ネットワークの各ノードに、自己遷移として、不要語を表現するアークを付加している。このようにすることによって、任意の単語間において、不要語を受理することが可能になる。
【００６３】
［認識処理部２３］
認識処理部２３の機能は、上記第１の実施形態の認識処理部１３とほぼ同様である。ただし、本実施形態の認識処理部２３では、不要語の部分についてはＨＭＭによる明示的な不要語モデルを用意せず、後述する不要語処理部２４を介して各ノードに自己遷移させるようになっている。つまり、不要語処理部２４を不要語モデルとして用いる。このようなネットワークを構成することによって、認識処理及び不要語処理を効率的に行う。各々の発話内容の仮説に対応するモデルは、仮説の任意の単語間において、不要語を受理可能な形で、ＨＭＭネットワークの一部として表現される。音声認識装置に発話が入力されると、ＨＭＭネットワークを用いて、各仮説に対応するモデルが音響特徴パラメータ時系列D21 を生成する確率（ゆう度）を計算する。各ＨＭＭネットワーク中で最大ゆう度を与える仮説を探索し、認識結果D26 とする。
【００６４】
入力発話の一部に不要語が含まれる場合の認識結果D26 は次のようになる。
【００６５】
［図６の言語モデルを用いた場合］
入力発話：「それじゃあー東京の（こ）交通情報」
認識結果D26 ：「＃東京＃交通情報」（＃は不要語を表す記号）
認識結果D26 に対応する最大ゆう度を対数化した最大対数ゆう度を、以後、「認識ゆう度」と呼ぶ。ここで、各仮説に対するゆう度計算は、音響特徴パラメータ時系列D21 のフレームに同期して並列に行う。各フレームではＨＭＭの各状態に対する出力確率分布計算（当該フレームの音響特徴パラメータを出力する確率の計算）を行い、これを対数化して局所ゆう度D22 とする。
【００６６】
認識ゆう度は、局所ゆう度D22 とＨＭＭの状態遷移確率を用いて、前述した従来技術文献（２）に開示されているViterbi アルゴリズム等の手段により算出する。ただし、不要語の部分は、不要語処理部２４によりゆう度計算を行う。
【００６７】
不要語処理部２４におけるゆう度計算方法は後述する。認識ゆう度を算出する上での不要語処理部２４の扱いは、他の単語モデルと同様である。音響特徴パラメータ時系列D21 のフレーム番号１から任意のフレーム番号までの“発話内容の部分仮説”に対する累積対数ゆう度を、部分仮説累積ゆう度D23 とする。部分仮説累積ゆう度D23 は、その部分仮説の終端フレーム番号を付加して不要語処理部２４に出力される。
【００６８】
［不要語処理部２４］
不要語処理部２４では、次の（７）〜（９）式を用いて不要語仮説累積ゆう度D25 を算出する。以下では、不要語を表す発話内容の部分仮説を「不要語仮説」と、この不要語仮説に対する対数ゆう度を「不要語仮説ゆう度」と呼ぶ。不要語仮説ゆう度の算出における始端フレーム番号及び終端フレーム番号をそれぞれｔ₀、ｔ₁とし、このときの不要語仮説ゆう度をＬ_G（ｔ₁）で表す。不要語仮説累積ゆう度D25 は、フレーム番号（ｔ₀−１）における部分仮説累積ゆう度D23 と、不要語仮説ゆう度Ｌ_G（ｔ₁）との和として定義する。したがって、不要語仮説累積ゆう度D25 はフレーム番号１からフレーム番号ｔ₁（不要語仮説ゆう度の算出における終端フレーム番号）までの発話内容の部分仮説に対する累積対数ゆう度を表している。
【００６９】
次に、不要語仮説ゆう度Ｌ_G（ｔ₁）について説明する。（８）式は任意の始端フレーム番号ｔ₀に対して、異なる（Ｔ_del＋１）個の終端フレーム番号ｔ₁（＝ｔ₀＋Ｔ_min, ｔ₀＋Ｔ_min＋１, ……, ｔ₀＋Ｔ_min＋Ｔ_del）における不要語仮説ゆう度Ｌ_G（ｔ₁）を算出することを表している。
【００７０】
なお、（８）式においてＬ_g（ｔ₁）は、各フレームにおける不要語仮説に対する局所的な対数ゆう度を表す。また、補正定数Ｒは、不要語仮説ゆう度の変域を調節するためのパラメータ（定数）で、その値は実験的に決定される。不要語仮説ゆう度Ｌ_G（ｔ₁）は、［Ｌ_g（ｔ）＋Ｒ］を、始端フレーム番号ｔ₀から終端フレーム番号ｔ₁まで、累積加算したものとして定義する。不要語仮説ゆう度Ｌ_G（ｔ₁）の算出に用いる値で、（最小フレーム数−１）Ｔ_minと、最大フレーム数及び最小フレーム数の差Ｔ_delの値は、実験的に決定する。
【００７１】
また、Ｌ_g（ｔ）を定義する（９）式において、log ｂ_v（ｘ_t）は、認識処理部２３より局所ゆう度D22 として与えられ、Ｗ・log ｃ_uvは、参照テーブル２５より状態遷移制約情報D24 として与えられる。ここで、状態遷移制約情報に対する重み係数Ｗは、log ｃ_uvとlog ｂ_v（ｘ_t）のＬ_g（ｔ）に寄与する割合を調節するためのパラメータ（定数）であり、定数ＩＮＨは状態クラスタ間の遷移確率を対数化した値の下限値を設定するためのパラメータ（定数）である。ともにその値は実験的に決定する。
【００７２】
これにより、不要語処理部２４で行う演算は、加算と大小比較のみとなる。
【００７３】
次に、（９）式におけるlog ｃ_uvの働きについて説明する。
【００７４】
フレーム番号（ｔ−１）において、（９）式の右辺に最大値を与える状番号をｕとする。log ｃ_uvは状態番号ｕが何であるかによって、次フレーム番号ｔにおいて（９）式の右辺の最大値を与える状態番号の候補を制約する。状態番号ｕの状態から状態番号ｖの状態への遷移の起こりやすさを制約として用いている。このような状態遷移制約によって、トライフォンモデルが有する音声の時間構造を考慮した不要語仮説ゆう度の算出を可能にしている。不要語処理部２４で算出されたフレーム番号ｔ₁における不要語仮説累積ゆう度D25 は、認識処理部２３に出力される。認識処理部２３では、不要語仮説に後続する発話内容の部分仮説に対するゆう度計算を、不要語仮説累積ゆう度D25 を初期値とし、フレーム番号（ｔ₁＋１）を始端フレーム番号として行う。このようにすることによって、認識処理部２３における認識ゆう度の計算は、単語仮説（単語を表す発話内容の部分仮説）に対するゆう度と不要語仮説に対するゆう度を同様に扱って行うことができる。
【００７５】
【数３】

ここで、
ｔ＝１のとき、ｃ_uv＝１
ｔ≠１のとき、ｃ_uv＝Ｐ_ij
ただし、ｕ∈ｉ, ｖ∈ｊ
また、
ｔ₀＝１のとき、Ｌ_F（ｔ₀−１）＝０
ｃ_uv＝０のとき、log ｃ_uv＝ＩＮＨ
とする。
【００７６】
ｔ：フレーム番号
Ｌ_B（ｔ₁）：フレーム番号ｔ₁における不要語仮説累積ゆう度D25
Ｌ_F（ｔ₀−１）：フレーム番号（ｔ₀−１）における部分仮説累積ゆう度D23
Ｌ_G（ｔ₁）：フレーム番号ｔ₁における不要語仮説ゆう度
ｔ₀：不要語仮説ゆう度の算出における始端フレーム番号
ｔ₁：不要語仮説ゆう度の算出における終端フレーム番号
Ｔ_min：不要語仮説ゆう度の算出に用いる（最小フレーム数−１）
Ｔ_del：不要語仮説ゆう度の算出に用いる最大フレーム数と最小フレーム数の差
Ｒ：補正定数
Ｔ：フレーム総数
Ｗ：状態遷移制約情報に対する重み係数
ｕ：フレーム番号(t-1) において、(9) 式の右辺に最大値を与える状態番号
ｖ：任意の状態番号
ｉ, ｊ：任意の状態クラスタ番号
Ｖ_t：認識処理部２３において、フレーム番号ｔに出力確率分布計算を行う状態全体の集合
ｂ_v（ｘ_t）：状態ｖにおける音響特徴パラメータｘ_tの出力確率（密度）
ｘ_t：フレーム番号ｔにおける音響特徴パラメータ
ＩＮＨ：状態クラスタ間の遷移確率を対数化した値の下限値（定数）
［効果］
本実施形態では、入力発話の一部に不要語を含む揚合に対処するため、不要語仮説ゆう度を、認識ゆう度算出過程で得られる局所ゆう度D22 と、あらかじめ作成した状態遷移制約情報D24 に基づいて算出する。これにより、不要語仮説ゆう度の算出に要する演算は加算と大小比較だけになり、不要語処理の付加による処理量の増加を非常に小さくすることができる。この不要語処理の付加による処理量の増加は、ＨＭＭによる明示的な不要語モデル（一般に、garbage モデルと呼ばれ、種々の不要語を１種類のＨＭＭでモデル化するもの）を用いる方法よりも小さく抑えることができる。
【００７７】
また、不要語仮説ゆう度は、音響モデル２１が有する音声の時間構造を考慮しつつ、種々の音響的事象に対処可能な定式化を行って算出しているため、高い不要語検出精度を得ることができる。これにより、入力発話の一部に不要語を含む場合に、認識のための処理の増加を低く抑えて効率的に不要語を検出し、不要語以外の部分の認識率を向上させることが可能になる。しかも、不要語が入力発話の先頭、末尾、任意の単語間において複数含まれている場合にも、高い精度で不要語を検出することができる。
【００７８】
換言すると、以下のようになる。
【００７９】
（ａ）音響モデルとして、音素や音節などのサブワードに対するコンテキスト依存モデルを用いても、不要語処理の付加による処理量の増加は非常に小さく、処理機能を高い状態に維持することができる。
【００８０】
（ｂ）音響モデルとして、いかなる言語的単位（音素、音節、単語、文節など）のモデルを用いても、処理機能の低下を招くことなく、不要語処理を付加することができる。
【００８１】
（ｃ） garbage モデルを用いる場合に比べて、不要語を表現するモデルの音響的分解能が高い。したがって、種々の不要語の音響的バリエーションに対処することが可能である。
【００８２】
（ｄ） garbage モデルを用いていないので、不要語を含む多量の音声データを用いてあらかじめ不要語モデルのパラメータ推定（学習）をする必要がない。このため、学習用の音声データとして用意しにくい言いよどみ等にも対処することができるようになる。
【００８３】
［変形例］
（１）第２の実施形態では、入力発話の一部に不要語が含まれる場合に対処する方法について説明したが、本発明の音声認識方法は不要語に限らず、入力発話の一部に未知語（認識対象語以外の語）が含まれる場合に対処する方法としても使用可能である。その場合、図５における言語モデル２２を、例えば図７に示すように記述する。このようにすることによって、入力発話全体を棄却するのではなく、入力発話中の未知語部分を効果的に検出し、未知語以外の部分の認識率を向上させることができる。
【００８４】
例えば、図７に示した言語モデルを用いて
「ニューヨーク観光情報」
と発声した場合を考える。“ニューヨーク”が未知語である場合、本発明により
「＠観光情報」（＠は未知語を表す記号）
という認識結果を出力することが可能になる。
【００８５】
【発明の効果】
以上、詳述したように、本発明の音声認識方法によれば、次のように効果を奏することができる。
【００８６】
（１）局所ゆう度と状態遷移制約情報を用いて参照ゆう度を算出することで、参照ゆう度の算出に要する演算を低減することができ、棄却判定機能の付加による処理量の増加を小さくすることができる。
【００８７】
（２）参照ゆう度と認識ゆう度との比較により、入力音声データの棄却判定を行うため、認識のための処理量をほとんど増加させることなく、効率的に棄却判定を行うことが可能になる。
【００８８】
（３）不要語仮説ゆう度の算出に要する演算を低減することができ、不要語処理の付加による処理量の増加を最小限に抑えることができる。
【図面の簡単な説明】
【図１】本発明の第１の実施形態に係る音声認識装置の基本構成を示すブロック図である。
【図２】第１の実施形態の言語モデル（構文ネットワーク）例を示す模式図である。
【図３】トライフォンモデルにおける状態の遷移接続の例を示す模式図である。
【図４】状態クラスタ間における遷移接続の例を示す模式図である。
【図５】本発明の第２の実施形態に係る音声認識装置の基本構成を示すブロック図である。
【図６】第２の実施形態の言語モデル（構文ネットワーク）例を示す模式図である。
【図７】本発明の変形例に係る言語モデル（構文ネットワーク）例を示す模式図である。
【符号の説明】
１０, ２０：音声分析部、１１, ２１：音響モデル、１２, ２２：言語モデル、１３, ２３：認識処理部、１４：参照ゆう度算出部、１５, ２５：参照テーブル、１６：リジェクト判定部、２４：不要語処理部、D10,D20 ：入力音声データ、D11,D21 ：音響特徴パラメータ時系列、D12 ：認識結果候補、D13 ：認識ゆう度、D14,D22 ：局所ゆう度、D15,D24 ：状態遷移制約情報、D16 ：参照ゆう度、D17,D26 ：認識結果、D23 ：部分仮説累積ゆう度、D25 ：不要語仮説累積ゆう度。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition method used in a speech recognition apparatus.
[0002]
[Prior art]
Reference title (1): Takao Namibe, Satoshi Tsukada “Rejection of unknown utterances by likelihood correction using syllable recognition” IEICE Transactions D-II Vol.J75-D-II No.12 1992 12 Month PP.2002-2009
Reference title (2): Masaaki Okouchi “Speech recognition based on Hidden Markov Model” Journal of the Acoustical Society of Japan, Vol. 42, No. 12 (1986)
In order to realize high recognition accuracy and real-time processing, the speech recognition device performs recognition processing by restricting recognition targets by predefining words, grammatical rules, and the like that can be accepted by the device. However, when the user actually uses the device, utterances, remarks, and rephrasings that are not recognized cannot be avoided. Therefore, when the reliability of the recognition result for a certain utterance is low, a reject function for rejecting the utterance is required.
[0003]
Conventionally, as a method for adding a reject function, there is a method disclosed in the above document (1). In this method, it is assumed that HMM (Hidden Markov Model) of subword units such as phonemes and syllables is used as a model (generally called an acoustic model) for expressing speech. Details of the speech recognition method using the HMM are disclosed in the above document (2). By connecting the sub-word models, a model for utterance content hypotheses such as words and sentences defined as recognition targets is constructed, and the probability (likelihood) that the model for each hypothesis generates input speech data is calculated. By this calculation, the hypothesis corresponding to the model that gives the maximum likelihood is set as the recognition result.
[0004]
In order to add a rejection function to this recognition method, in addition to the likelihood calculation (recognition process) performed by restricting the recognition target as described above, the input speech is recognized as an arbitrary phoneme string or syllable string. Perform likelihood calculation. The difference of the maximum likelihood obtained as a result of each likelihood calculation is obtained, and rejection determination of the input utterance is performed by threshold determination.
[0005]
[Problems to be solved by the invention]
However, the conventional methods described above have the following problems.
[0006]
(A) A context-dependent phoneme model such as a triphone model is often used as an acoustic model because it can express abnormal sounds depending on the phoneme context and can obtain relatively high recognition accuracy. However, when a context-dependent phoneme (or syllable) model such as the triphone model is used as the acoustic model, the processing amount is greatly increased by adding a reject function. For this reason, it has been difficult to add a reject function. Also, to improve the robustness of the speech recognition device for natural utterances, it is necessary to deal with unnecessary words and unknown words in the input utterance, but obtaining sufficient performance in terms of processing amount and accuracy for speech recognition It was difficult.
[0007]
(B) When a subword unit model such as phonemes or syllables is not used as the acoustic model (for example, when a unit such as a word or a phrase is used), the above-described conventional method cannot be applied.
[0008]
(C) When an unnecessary word or unknown word is included in a part of the input utterance, the unnecessary word or unknown word part is detected while suppressing an increase in the amount of processing for recognition, and other utterance parts are accurately detected. It is difficult to recognize. As in the case of (a) above, when a context-dependent phoneme (or sound decoration) model is used as the acoustic model, the amount of processing is greatly increased.
[0009]
[Means for Solving the Problems]
  In order to solve the above problems, a speech recognition method according to the present invention is a speech recognition method for recognizing and processing input speech data. Ease of transition between arbitrary states of an HMM constituting an acoustic model. A reference table creating means for creating state transition constraint information representing the recognition, a recognition processing means for calculating a recognition likelihood and a local likelihood together with a recognition result candidate for the input speech data, and a reference likelihood for use in the rejection determination of the input speech data. A reference likelihood calculating means for calculatingThe reference table creating means performs clustering on the state of the HMM constituting the acoustic model, calculates the transition probability between the state clusters based on the transition connection between the states in the generated state cluster, and each state of the HMM Header information indicating which state cluster belongs to is added as state transition constraint information,Local likelihood calculated by the recognition processing means and the above referenceTeThe reference likelihood is calculated by the reference likelihood calculation means using the state transition constraint information created by the table creation means, and by comparing the reference likelihood with the recognition likelihood calculated by the recognition processing means, the input speech data A rejection determination is performed.
[0010]
As described above, by calculating the reference likelihood using the local likelihood and the state transition constraint information, the calculation required for calculating the reference likelihood is only addition and size comparison, and the processing amount due to the addition of the rejection determination function The increase in can be reduced. Moreover, since the rejection determination of the input speech data is performed by comparing the reference likelihood with the recognition likelihood, the rejection determination can be performed efficiently without increasing the processing amount for recognition.
[0011]
  A speech recognition method according to another invention is a speech recognition method for recognizing and processing input speech data, and a state transition constraint representing ease of transition between arbitrary states of an HMM constituting an acoustic model. Reference table creation means for creating information, recognition processing means for calculating local likelihood and partial hypothesis cumulative likelihood together with recognition results for input speech data, and used for processing unnecessary words or unknown words in input speech data Unnecessary word processing means for calculating an unnecessary word hypothesis cumulative likelihood, the local likelihood and the partial hypothesis cumulative likelihood calculated by the recognition processing means, and the above-mentioned referenceTeThe unnecessary word hypothesis cumulative likelihood is calculated by the unnecessary word processing means using the state transition constraint information created by the table creation means, and the unnecessary word or unknown word in the input voice data is calculated using the unnecessary word hypothesis cumulative likelihood. This is characterized in that the part is detected and the other part is recognized.
[0012]
As a result, the calculation required for calculating the unnecessary word hypothesis likelihood is only addition and size comparison, and an increase in processing amount due to the addition of unnecessary word processing can be minimized.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of the present invention will be described with reference to the accompanying drawings.
[0014]
[First Embodiment]
[Configuration and function]
FIG. 1 shows the basic configuration of a speech recognition apparatus used in the speech recognition method according to this embodiment.
[0015]
Reference numeral 10 in FIG. 1 denotes a voice analysis unit. The voice analysis unit 10 converts the input voice data D10 into an acoustic feature parameter time series D11. Specifically, using an analysis method such as LPC (Linear Predictive Coding) analysis, the input speech data D10 is converted into acoustic feature parameters for each short period (hereinafter referred to as “frame”) of about several ms to several tens of ms. Convert. Here, the acoustic feature parameter is a parameter that expresses the spectrum inclusion information of the audio data, and is, for example, a cepstrum (a quantity obtained by performing inverse Fourier transform on the logarithmic spectrum) or a time change amount thereof. The acoustic feature parameter obtained for each frame is assumed to be an acoustic feature parameter time series D11. The acoustic feature parameter time series D11 converted by the speech analysis unit 10 is input to the recognition processing unit 13. The input voice data D10 is a signal obtained by converting voice (analog signal) input from a microphone or the like into a digital signal.
[0016]
11 is an acoustic model. The acoustic model 11 is a set of models (HMM) that express voice. As a linguistic unit of the model, it is possible to employ any component of speech (phonemes, syllables, words, phrases, etc.). When subwords such as phonemes and syllables are used as a unit, both context independent and dependent models can be used. That is, the acoustic model used for adding the reject function is not limited. This embodiment demonstrates the case where a triphone model is used as an example. The triphone model is a context-dependent phoneme model, and a different model is prepared for each phoneme according to the preceding and following phoneme contexts.
[0017]
Reference numeral 12 denotes a language model. The language model 12 is a model that restricts recognition targets by defining words, grammar rules (syntax), and the like that can be accepted by the speech recognition apparatus. For example, as shown in FIG. 2, an acceptable word sequence is described in the form of a syntax network using a finite state automaton.
[0018]
Reference numeral 13 denotes a recognition processing unit. This recognition processing unit 13 is an HMM network that expresses a hypothesis of utterance content that can be accepted using the acoustic model 11 and the language model 12 before speech is input to the speech recognition apparatus, that is, before the recognition processing is started. Is configured. This HMM network is literally an HMM network created by connecting triphone models in accordance with restrictions such as phoneme notation of words and grammatical rules. For example, in the syntax network shown in FIG. 2, the word part is replaced with a word model (HMM) created by concatenating triphone models. By configuring such a network, the recognition process can be made efficient. A model corresponding to each speech content hypothesis is represented as part of the HMM network. When an utterance is input to the speech recognition apparatus, the probability (likelihood) that the model corresponding to each hypothesis generates the acoustic feature parameter time series D11 is calculated using the HMM network. A hypothesis that gives the maximum likelihood in the HMM network is searched for and set as a recognition result candidate D12. Further, the maximum log likelihood obtained by logarithmizing the maximum likelihood at this time is set as a recognition likelihood D13. Here, the likelihood calculation for each hypothesis is performed in parallel in synchronization with the frame of the acoustic feature parameter time series D11. In each frame, the output probability distribution calculation (calculation of the probability of outputting the acoustic feature parameter of the frame) is performed for each state of the HMM, and this is logarithmized to obtain the local likelihood D14. The recognition likelihood D13 is calculated by means such as the Viterbi algorithm disclosed in the above-mentioned prior art document (2) using the local likelihood D14 and the state transition probability of the HMM.
[0019]
The recognition result candidate D12 and the recognition likelihood D13 calculated by the recognition processing unit 13 are output to the rejection determination unit 16. The local likelihood D14 is output to the reference likelihood calculating unit 14.
[0020]
The reference likelihood calculating unit 14 calculates the reference likelihood D16 using the local likelihood D14 calculated by the recognition processing unit 13 and the state transition constraint information D15 stored in the reference table 15.
[0021]
The reference table 15 is a table that stores state transition constraint information D15 used by the reference likelihood calculating unit 14. The state transition constraint information D15 is created using the acoustic model 11 in advance. The creation method will be described later.
[0022]
The rejection determination unit 16 performs rejection determination on the recognition result candidate D12 using the recognition likelihood D13 and the reference likelihood D16, and outputs a recognition result D17.
[0023]
[Voice recognition method]
Next, a speech recognition method using the speech recognition apparatus having the above configuration will be described.
[0024]
[First stage]
First, clustering with respect to the state of the HMM is performed using all the triphone models constituting the acoustic model 11. Each cluster generated by clustering is hereinafter referred to as a “state cluster”. The distance measure in clustering is defined using parameters expressing each state. For example, when the output probability distribution of each state is represented by a multidimensional normal distribution, it is defined as follows using an average vector (or dispersion vector) of the multidimensional normal distribution.
[0025]
2 mean vectors (or a vector with additional dispersion vectors)
x = [a₁, a₂,…, A_n]
y = [b₁, b₂,…, B_n]
On the other hand, the distance measure D is
D = (a₁-B₁)²+ (A₂-B₂)²+ ... + (a_n-B_n)²
It becomes.
[0026]
As a clustering method, a general clustering algorithm such as an LBG algorithm is used. Here, a simpler method is shown below.
[0027]
M sample sets X = {x₁, x₂,…, X_M} Is clustered. Threshold T_hIs given. The threshold value T_hThe value of is an experimentally determined value.
[0028]
First, optionally one sample, eg x₁And take this as the cluster center z₁(= X₁). X_k(K = 2, 3, ..., M) and z₁And x_kDistance D to_1kCalculate By this calculation, D_1k≦ T_hThen x_kIs z₁Is determined to belong to a cluster centered on. D_1k> T_hThen x_kA new cluster center Z₂And In the same way, the remaining sample x_kAbout z₁, z₂Distance D to_1k, D_2kCalculate And this distance D_1k, D_2kOne of the T_hIf it is smaller, the x_kShall belong to the cluster, otherwise its x_kTo the new cluster center z_ThreeAnd
[0029]
The above operation is repeated for all samples {x₁, x₂,…, X_M} To finish clustering.
[0030]
[Second stage]
The transition probability between the state clusters is calculated as follows.
[0031]
  First, the transition probability between state clusters is defined. Each state belonging to each state cluster is connected to another state on the triphone model. For example, as shown in FIG.₁Is state S₂State S₂Is state S₃Are connected to each. The terminal state S of the triphone model₃Is the starting state S of the triphone model that can be followed₄, S₅, S₆Are connected to each. In general, since there are a plurality of triphone models that can continue next to a certain triphone model, the terminal states of the triphone model are connected to a plurality of states. Further, the state connection relationship has a direction (an arrow shown in FIG. 3), and this direction represents a transition direction from one state to the other state. A state transition probability is given as the ease of transition at this time. Each state also has a connection representing a self-loop transition. Such state transition connection on the triphone model is applied to each state belonging to the state cluster. As a result, a bundle of transition connections created by the state of the component is created between arbitrary state clusters. FIG. 4 is an example showing this state. In FIG. 4, the state S belonging to the state cluster 1₁Is state S belonging to state cluster 2₂Connected to theBState S in the phone model₁From state S₂Transition connection to (state transition probability a₁₂) Exists. FIG. 4 shows a transition connection from a state belonging to the state cluster 1 to a state belonging to another state cluster. In some cases, transition connections within the state cluster 1 are also shown. A bundle of transition connections having the same transition direction between state clusters is a “bundle of transition connections”. Using this bundle of transition connections, the transition probability between the state clusters is defined by the following equations (1) to (3).
[0032]
[Expression 1]

i, j (= 1, 2,..., N): state cluster number
u, v (= 1, 2, ..., M): state number
P_ij: Transition probability from state cluster i to state cluster j
N: Total number of state clusters
M: Total number of states
a_uv: Status S_uFrom state S_vState transition probability to
a_uu: Status S_uSelf-loop transition probability
r_i: Number of transition connections between different states that both belong to state cluster i
(Does not apply to self-loop transition connections)
z_i: Number of “bundles of transition connections” from state cluster i to other state clusters
q_u: State S belonging to a certain state cluster_uNumber of transition connections from one state to another state cluster
Where f_ij(I ≠ j) represents the sum of the state transition probabilities for the bundle of transition connections from state cluster i to state cluster j. However, between state clusters where there is no transition connection, f_ij= 0. F_iiRepresents a value obtained by dividing the sum of the state transition probabilities for the transition connections inside the state cluster i by the number of “bundles of transition connections” from the state cluster i to other state clusters.
[0033]
The transition probability P between the state clusters using the above-described formula_ijIs calculated. Transition probability P between the calculated state clusters_ijIs logarithmically multiplied by a weighting factor (constant) W. The weighting factor W will be described in the operation of the reference likelihood calculating unit 14 described later.
[0034]
W · log P obtained as described above_ijIn addition, header information indicating to which state cluster each state of the triphone model belongs is added to form state transition constraint information D15.
[0035]
<Definition method other than the above>
Note that the transition probability between state clusters can be defined by the following definition method in addition to the above-described definition method. Both methods are common in that they are defined based on transition connections between state clusters.
[0036]
(A) f depending on whether or not there is a bundle of transition connections from state cluster i to state cluster j (the number of transition connections constituting the transition connection bundle may be one)._ijDefine
[0037]
This is represented by f in the above equations (1) to (3)._ijAnd f_iiCan be obtained by redefining
[0038]
1) If there is a bunch of transition connections from state cluster i to state cluster j, then f_ij= 1.
[0039]
2) If there is no bunch of transition connections from state cluster i to state cluster j, f_ij= 0.
[0040]
3) f_ii= 1.
[0041]
(B) f depending on the number of transition connections constituting a bundle of transition connections from state cluster i to state cluster j_ijDefine
[0042]
This is because in the above equations (1) to (3), a_uvAnd a_uuThe
a_uv= A_uu= 1
It is obtained by redefining.
[0043]
[Operation of reference likelihood calculating unit 14]
The reference likelihood calculating unit 14 calculates the reference likelihood D16 using the following equations (4) and (5).
[0044]
[Expression 2]

here,
When t = 1, c_uv= 1
When t ≠ 1, c_uv= P_ij
However, u∈i, v∈j
Also
c_uv= 0, log c_uv= INH
And
[0045]
t: Frame number
L_G: Reference likelihood D16
L_g: Log likelihood
T: Total number of frames
W: Weight coefficient for state transition constraint information
u: State number that gives the maximum value of the right side of equation (5) in frame number (t-1)
v: Arbitrary state number
P_ij: Transition probability from state cluster i to state cluster j
i, j: Arbitrary state cluster number
Vt: a set of all states in which the output probability distribution calculation is performed for frame number t in the recognition processing unit 13
b (x_t): Acoustic feature parameter x in state v_tOutput probability (density)
X_t: Acoustic feature parameter at frame number t
INH: Lower limit value (constant) of logarithm of transition probability between state clusters
In equation (5), log b_v(X_t) Is given as the local likelihood D14 from the recognition processing unit 13, and W · log c_uvIs given from the reference table 15 as state transition constraint information D15. Therefore, the calculations performed by the reference likelihood calculating unit 14 are only addition and magnitude comparison. In equation (5), the weighting factor W for the state transition constraint information is log c_uvAnd log b_v(X_tL)_gIt is a parameter (constant) for adjusting the ratio contributing to (t), and the constant INH is a parameter (constant) for setting a lower limit value of a logarithm of the transition probability between state clusters. Both values are determined experimentally.
[0046]
Reference likelihood L_GRepresents the cumulative log likelihood for a model expressing arbitrary utterance content. L_g(T) represents a local log likelihood in each frame with respect to a model expressing arbitrary utterance contents.
[0047]
In equation (5), the state number that gives the maximum value of the right side of equation (5) in frame number (t−1) is u. log c_UVRestricts the candidate of the state number that gives the maximum value of the right side of the expression (5) in the next frame number t depending on what the state number u is. The ease of transition from the state of state number u to the state of state number v is used as a constraint. Such a state transition constraint makes it possible to calculate the reference likelihood D16 in consideration of the time structure of speech possessed by the triphone model.
[0048]
[Operation of Reject Determination Unit 16]
The reject determination unit 16 performs reject determination of the input voice data D10 according to the following equation (6). In equation (6), the threshold value θ for reject determination is experimentally determined. Depending on the value of the threshold θ, the recognition rate when the input is a recognition target and the rejection rate when the input is not a recognition target change. In general, since both are in a trade-off relationship, the value of the threshold value θ is determined in accordance with the desired performance.
[0049]
L_M= (L_G-L_R) / T (6)
L_R: Recognition likelihood D13
L_G: Reference likelihood D16
T: Total number of frames
The value L obtained by this formula_MAnd the threshold value θ are compared to make a rejection determination. Note that θ is a threshold value for reject determination.
[0050]
L_MIf> θ, it is determined to reject the input, and information indicating that the input has been rejected is output as the recognition result D17.
[0051]
L_MIf ≦ θ, the recognition result candidate D12 is output as the recognition result D17 without rejecting the input.
[0052]
[effect]
As described above, the reference likelihood D16 used for the input speech rejection determination is calculated based on the local likelihood D14 obtained in the process of calculating the recognition likelihood D13 and the state transition constraint information D15 created in advance. As a result, the calculation required for calculating the reference likelihood D16 is only addition and size comparison, and the increase in processing amount due to the addition of the reject function can be extremely small. In addition, the reference likelihood D16 is calculated by performing formulation that can deal with various acoustic events while taking into account the time structure of the voice of the acoustic model 11, and therefore uses phoneme or syllable recognition. A rejection accuracy equivalent to that of the conventional method can be obtained. As a result, when an unrecognized utterance (a word other than the recognized word or utterance outside the grammar), a non-verbal sound such as cough, sneeze, or an environmental sound such as a bell is input to the device, Therefore, it is possible to efficiently perform the rejection determination without substantially increasing the processing amount.
[0053]
In other words, it is as follows.
[0054]
(A) Even if a context-dependent model for subwords such as phonemes and syllables is used as an acoustic model, there is almost no increase in the processing amount due to the addition of the reject function, and the processing function can be maintained in a high state.
[0055]
(B) The reject function can be added by using a model of any linguistic unit (phoneme, syllable, word, phrase, etc.) as the acoustic model.
[0056]
(C) A rejection accuracy equivalent to the method using phonemes or syllable recognition (conventional method) can be obtained.
[0057]
[Second Embodiment]
In the first embodiment, the method of rejecting words other than the recognition target words and input utterances outside the grammar as a whole has been described. However, in this embodiment, only unnecessary portions of the input utterances are partially rejected. How to do will be described. That is, in this embodiment, part of the input utterance is an interjection typified by “Ano”, “Eto”, etc., an unnecessary word such as “XX kana”, “XX toka”, or “(Jo ) A method for dealing with a case where slogan such as “information” is included will be described. Hereinafter, interjections, unnecessary words, sloppy words, etc. are called unnecessary words.
[0058]
FIG. 5 shows a basic configuration of the speech recognition apparatus according to this embodiment. In FIG. 5, input voice data D20 is a signal obtained by converting voice (analog signal) input from a microphone or the like into a digital signal. The input speech data D20 is converted into an acoustic feature parameter time series D21 by the speech analysis unit 20 and input to the recognition processing unit 23.
[0059]
The recognition processing unit 23 performs recognition processing using the acoustic model 21, the language model 22, and the unnecessary word processing unit 24, and outputs a recognition result D26 for the input speech data D20.
[0060]
The unnecessary word processing unit 24 uses the local likelihood D22 and the partial hypothesis cumulative likelihood D23 calculated by the recognition processing unit 23 and the state transition constraint information D24 stored in the reference table 25 to use the unnecessary word hypothesis. Calculate the cumulative likelihood D25. The unnecessary word hypothesis cumulative likelihood D25 is output to the recognition separation processing unit 23 and used for calculation of the recognition result D26.
[0061]
The following description focuses on functions specific to the speech recognition method according to the present embodiment, and a description of the same parts as in the first embodiment is omitted.
[0062]
[Language model 22]
As in the first embodiment, the language model 22 is a model that restricts recognition targets by defining words, grammar rules (syntax), and the like that can be accepted by the speech recognition apparatus. For example, as shown in FIG. 6, an acceptable word sequence is described in the form of a syntax network using a finite state automaton. However, in the language model 22 of this embodiment, in order to deal with unnecessary words in the input utterance, arcs representing unnecessary words are added to each node of the syntax network as self-transitions. In this way, unnecessary words can be accepted between arbitrary words.
[0063]
[Recognition processing unit 23]
The function of the recognition processing unit 23 is substantially the same as that of the recognition processing unit 13 of the first embodiment. However, the recognition processing unit 23 of the present embodiment does not prepare an explicit unnecessary word model by the HMM for the unnecessary word part, and makes a self-transition to each node via the unnecessary word processing unit 24 described later. ing. That is, the unnecessary word processing unit 24 is used as an unnecessary word model. By configuring such a network, recognition processing and unnecessary word processing are efficiently performed. A model corresponding to each hypothesis of utterance content is expressed as a part of the HMM network in such a way that unnecessary words can be accepted between arbitrary words of the hypothesis. When an utterance is input to the speech recognition apparatus, the probability (likelihood) that the model corresponding to each hypothesis generates the acoustic feature parameter time series D21 is calculated using the HMM network. A hypothesis that gives the maximum likelihood is searched for in each HMM network and is set as a recognition result D26.
[0064]
The recognition result D26 when an unnecessary word is included in a part of the input utterance is as follows.
[0065]
[When using the language model of FIG. 6]
  Input utterance: "That's why Tokyo traffic information"
  Recognition result D26: "#eastKyo # traffic information "(# is a symbol indicating an unnecessary word)
  The maximum log likelihood obtained by logarithmizing the maximum likelihood corresponding to the recognition result D26 is hereinafter referred to as “recognition likelihood”. Here, the likelihood calculation for each hypothesis is performed in parallel in synchronization with the frame of the acoustic feature parameter time series D21. In each frame, the output probability distribution calculation (calculation of the probability of outputting the acoustic feature parameter of the frame) is performed for each state of the HMM, and this is logarithmized to obtain the local likelihood D22.
[0066]
The recognition likelihood is calculated by means such as the Viterbi algorithm disclosed in the above-mentioned prior art document (2) using the local likelihood D22 and the state transition probability of the HMM. However, the unnecessary word portion is subjected to likelihood calculation by the unnecessary word processing unit 24.
[0067]
A likelihood calculation method in the unnecessary word processing unit 24 will be described later. The handling of the unnecessary word processing unit 24 in calculating the recognition likelihood is the same as other word models. The cumulative log likelihood for the “partial hypothesis of utterance content” from frame number 1 to an arbitrary frame number in the acoustic feature parameter time series D21 is defined as a partial hypothesis cumulative likelihood D23. The partial hypothesis cumulative likelihood D23 is output to the unnecessary word processing unit 24 with the terminal frame number of the partial hypothesis added.
[0068]
[Unnecessary word processing unit 24]
The unnecessary word processing unit 24 calculates the unnecessary word hypothesis cumulative likelihood D25 using the following equations (7) to (9). Hereinafter, the partial hypothesis of the utterance content representing the unnecessary word is referred to as “unnecessary word hypothesis”, and the logarithmic likelihood for the unnecessary word hypothesis is referred to as “unnecessary word hypothesis likelihood”. The start frame number and end frame number in the calculation of the unnecessary word hypothesis likelihood are t₀, T₁And the unnecessary word hypothesis likelihood at this time is L_G(T₁). The unnecessary word hypothesis cumulative likelihood D25 is determined by the frame number (t₀-1) partial hypothesis cumulative likelihood D23 and unnecessary word hypothesis likelihood L_G(T₁) And the sum. Therefore, the unnecessary word hypothesis cumulative likelihood D25 is changed from frame number 1 to frame number t.₁This represents the cumulative log likelihood for the partial hypothesis of the utterance content up to (terminal frame number in calculation of unnecessary word hypothesis likelihood).
[0069]
Next, unnecessary word hypothesis likelihood L_G(T₁). Equation (8) is an arbitrary starting frame number t.₀Is different (T_del+1) terminal frame number t₁(= T₀+ T_min, t₀+ T_min+1, ......, t₀+ T_min+ T_del) Unnecessary word hypothesis likelihood L_G(T₁) Is calculated.
[0070]
Note that L in equation (8)_g(T₁) Represents the local log likelihood for the unnecessary word hypothesis in each frame. The correction constant R is a parameter (constant) for adjusting the range of the unnecessary word hypothesis likelihood, and its value is experimentally determined. Unnecessary word hypothesis likelihood L_G(T₁) Is [L_g(T) + R] is changed to start frame number t₀To end frame number t₁It is defined as the cumulative addition. Unnecessary word hypothesis likelihood L_G(T₁) Is a value used to calculate (minimum number of frames-1) T_minAnd the difference T between the maximum number of frames and the minimum number of frames_delThe value of is determined experimentally.
[0071]
L_gIn equation (9) that defines (t), log b_v(X_t) Is given as the local likelihood D22 from the recognition processing unit 23, and W · log c_uvIs given from the reference table 25 as state transition constraint information D24. Here, the weighting factor W for the state transition constraint information is log c_uvAnd log b_v(X_tL)_gIt is a parameter (constant) for adjusting the ratio contributing to (t), and the constant INH is a parameter (constant) for setting a lower limit value of a logarithm of the transition probability between state clusters. Both values are determined experimentally.
[0072]
Thereby, the calculation performed by the unnecessary word processing unit 24 is only addition and size comparison.
[0073]
Next, log c in equation (9)_uvExplain the function of.
[0074]
In frame number (t−1), u is the state number that gives the maximum value to the right side of equation (9). log c_uvRestricts the candidate of the state number that gives the maximum value of the right side of the equation (9) in the next frame number t depending on what the state number u is. The ease of transition from the state of state number u to the state of state number v is used as a constraint. Such a state transition constraint makes it possible to calculate an unnecessary word hypothesis likelihood considering the time structure of speech possessed by the triphone model. Frame number t calculated by the unnecessary word processing unit 24₁The unnecessary word hypothesis cumulative likelihood D25 is output to the recognition processing unit 23. In the recognition processing unit 23, the likelihood calculation for the partial hypothesis of the utterance content subsequent to the unnecessary word hypothesis is performed with the unnecessary word hypothesis cumulative likelihood D25 as an initial value, and the frame number (t₁+1) is performed as a start frame number. In this way, the recognition likelihood in the recognition processing unit 23 can be calculated by treating the likelihood for the word hypothesis (the partial hypothesis of the utterance content representing the word) and the likelihood for the unnecessary word hypothesis in the same manner. .
[0075]
[Equation 3]

here,
When t = 1, c_uv= 1
When t ≠ 1, c_uv= P_ij
However, u∈i, v∈j
Also,
t₀= 1, L_F(T₀-1) = 0
c_uv= 0, log c_uv= INH
And
[0076]
t: Frame number
L_B(T₁): Frame number t₁Unwanted word hypothesis cumulative likelihood in D25
L_F(T₀-1): frame number (t₀-1) Partial hypothesis cumulative likelihood D23
L_G(T₁): Frame number t₁Unwanted word hypothesis likelihood in
t₀: Start frame number in calculating unnecessary word hypothesis likelihood
t₁： Termination frame number in calculating unnecessary word hypothesis likelihood
T_min: Used to calculate unnecessary word hypothesis likelihood (minimum number of frames-1)
T_del: Difference between the maximum number of frames and the minimum number of frames used to calculate the likelihood of unnecessary word hypothesis
R: Correction constant
T: Total number of frames
W: Weight coefficient for state transition constraint information
u: State number that gives the maximum value to the right side of equation (9) in frame number (t-1)
v: Arbitrary state number
i, j: Arbitrary state cluster number
V_t: In the recognition processing unit 23, a set of all states for which output probability distribution calculation is performed for frame number t
b_v(X_t): Acoustic feature parameter x in state v_tOutput probability (density)
x_t: Acoustic feature parameter at frame number t
INH: Lower limit value (constant) of logarithm of transition probability between state clusters
[effect]
In the present embodiment, in order to deal with unfairness that includes unnecessary words in a part of the input utterance, the unnecessary word hypothesis likelihood is determined based on the local likelihood D22 obtained in the recognition likelihood calculation process and state transition constraint information created in advance. Calculated based on D24. As a result, the calculation required for calculating the unnecessary word hypothesis likelihood is only addition and size comparison, and the increase in processing amount due to the addition of unnecessary word processing can be extremely reduced. The increase in processing amount due to the addition of unnecessary word processing is greater than the method using an explicit unnecessary word model (generally called garbage model, in which various unnecessary words are modeled by one type of HMM) by HMM. It can be kept small.
[0077]
In addition, since the unnecessary word hypothesis likelihood is calculated by formulating the acoustic model 21 so as to cope with various acoustic events while taking into account the time structure of the speech, high unnecessary word detection accuracy is obtained. be able to. As a result, when an unnecessary word is included in a part of the input utterance, it is possible to efficiently detect the unnecessary word while suppressing an increase in the processing for recognition, and to improve the recognition rate of the part other than the unnecessary word. become. In addition, even when a plurality of unnecessary words are included in the beginning, end, and arbitrary words of the input utterance, the unnecessary words can be detected with high accuracy.
[0078]
In other words, it is as follows.
[0079]
(A) Even if a context-dependent model for subwords such as phonemes and syllables is used as an acoustic model, an increase in processing amount due to addition of unnecessary word processing is very small, and the processing function can be maintained at a high level.
[0080]
(B) Even if a model of any linguistic unit (phoneme, syllable, word, phrase, etc.) is used as the acoustic model, unnecessary word processing can be added without degrading the processing function.
[0081]
(C) The acoustic resolution of the model that expresses unnecessary words is higher than when the garbage model is used. Therefore, it is possible to cope with acoustic variations of various unnecessary words.
[0082]
(D) Since the garbage model is not used, it is not necessary to estimate (learn) parameters of the unnecessary word model in advance using a large amount of speech data including unnecessary words. For this reason, it becomes possible to cope with stagnation that is difficult to prepare as learning audio data.
[0083]
[Modification]
(1) In the second embodiment, a method for dealing with a case where an unnecessary word is included in a part of an input utterance has been described. However, the speech recognition method of the present invention is not limited to an unnecessary word, and may be included in a part of an input utterance. It can also be used as a method for dealing with cases where unknown words (words other than recognition target words) are included. In that case, the language model 22 in FIG. 5 is described, for example, as shown in FIG. By doing so, it is possible to effectively detect the unknown word part in the input utterance and improve the recognition rate of the part other than the unknown word, instead of rejecting the entire input utterance.
[0084]
For example, using the language model shown in FIG.
"New York tourist information"
If you say If “New York” is an unknown word,
“@Sightseeing information” (@ is a symbol representing an unknown word)
It is possible to output the recognition result.
[0085]
【The invention's effect】
As described above in detail, according to the speech recognition method of the present invention, the following effects can be obtained.
[0086]
(1) By calculating the reference likelihood using the local likelihood and the state transition constraint information, the calculation required for calculating the reference likelihood can be reduced, and the increase in the processing amount due to the addition of the rejection determination function can be reduced. can do.
[0087]
(2) Since the input speech data rejection determination is performed by comparing the reference likelihood with the recognition likelihood, it is possible to efficiently determine the rejection without substantially increasing the amount of processing for recognition. .
[0088]
(3) It is possible to reduce the calculation required for calculating the unnecessary word hypothesis likelihood, and it is possible to minimize the increase in the processing amount due to the addition of unnecessary word processing.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a basic configuration of a speech recognition apparatus according to a first embodiment of the present invention.
FIG. 2 is a schematic diagram illustrating an example of a language model (syntax network) according to the first embodiment.
FIG. 3 is a schematic diagram showing an example of state transition connection in a triphone model.
FIG. 4 is a schematic diagram illustrating an example of transition connection between state clusters.
FIG. 5 is a block diagram showing a basic configuration of a speech recognition apparatus according to a second embodiment of the present invention.
FIG. 6 is a schematic diagram illustrating an example of a language model (syntax network) according to the second embodiment.
FIG. 7 is a schematic diagram showing an example language model (syntax network) according to a modification of the present invention.
[Explanation of symbols]
10, 20: Speech analysis unit, 11, 21: Acoustic model, 12, 22: Language model, 13, 23: Recognition processing unit, 14: Reference likelihood calculation unit, 15, 25: ReferenceTe16: Reject determination unit, 24: Unnecessary word processing unit, D10, D20: Input speech data, D11, D21: Acoustic feature parameter time series, D12: Recognition result candidate, D13: Recognition likelihood, D14, D22: Local Likelihood, D15, D24: State transition constraint information, D16: Reference likelihood, D17, D26: Recognition result, D23: Partial hypothesis cumulative likelihood, D25: Unnecessary word hypothesis cumulative likelihood.

Claims

In a speech recognition method for recognizing and processing input speech data,
A reference table creating means for creating state transition constraint information indicating ease of transition between arbitrary states of the HMM constituting the acoustic model, and a recognition likelihood and a local likelihood together with a recognition result candidate for input speech data. A recognition processing means, and a reference likelihood calculating means for calculating a reference likelihood used for the rejection determination of the input voice data,
The reference table creation means performs clustering on the state of the HMM constituting the acoustic model, calculates the transition probability between the state clusters based on the transition connection between the states in the generated state cluster, and each state of the HMM While adding header information indicating which state cluster it belongs to as state transition constraint information,
Calculating a reference likelihood by reference likelihood calculation means using the state transition constraints information created in the local likelihood and the reference tables creation means calculated in the recognition processing means, in this reference likelihood and the recognition processing means A speech recognition method, comprising: determining rejection of input speech data by comparing with a calculated likelihood of recognition.

The speech recognition method according to claim 1,
The reference likelihood calculating means calculates, for each frame of the input speech data, the state of the HMM that maximizes the weighted sum of the local likelihood and the state transition constraint information and its maximum value. A speech recognition method, wherein the reference likelihood is calculated by cumulatively adding each local reference likelihood for all frames, with the value as a local reference likelihood.

In a speech recognition method for recognizing and processing input speech data,
Reference table creation means for creating state transition constraint information indicating ease of transition between arbitrary states of the HMM constituting the acoustic model, and the local likelihood and partial hypothesis cumulative likelihood together with the recognition result for the input speech data Recognition processing means, and unnecessary word processing means for calculating an unnecessary word hypothesis cumulative likelihood used for processing unnecessary words or unknown words in the input speech data,
Using the local likelihood and partial hypotheses cumulative likelihood calculated in the recognition processing means, and a state transition constraints information created in the reference tables creation means calculates the unnecessary word hypothesis cumulative likelihood by unnecessary word processor A speech recognition method characterized by detecting an unnecessary word or unknown word portion in input speech data by using the unnecessary word hypothesis cumulative likelihood and recognizing other portions.

The speech recognition method according to claim 3,
With the above unnecessary word processing means,
For each frame of the input speech data, the HMM state that maximizes the weighted sum of the local likelihood and the state transition constraint information and its maximum value are calculated, and the maximum value at this time is used as the local reference likelihood. ,
The next frame number of the frame number corresponding to the partial hypothesis cumulative likelihood is set as a start frame number, and a plurality of different fixed frame numbers set in advance are added to the start frame number to obtain the number of end frame numbers,
A cumulative value from the start frame number to the end frame number is added to the value obtained by adding the correction constant to the local reference likelihood, thereby calculating a plurality of different unnecessary word hypothesis likelihoods,
By adding the partial hypothesis cumulative likelihood to the unnecessary word hypothesis likelihood, the unnecessary word hypothesis cumulative likelihood at the plurality of different frame numbers is calculated for the frame number of the partial hypothesis cumulative likelihood. A voice recognition method characterized by the above.

The speech recognition method according to claim 3,
The likelihood calculation for the partial hypothesis of the utterance content following the unnecessary word hypothesis performed by the recognition processing means, with the unnecessary word hypothesis cumulative likelihood as an initial value, and the frame number next to the frame number for the unnecessary word hypothesis cumulative likelihood Is performed as a start frame number.

The speech recognition method according to claim 3 ,
In the reference tables creating means performs clustering on the state of HMM constituting the acoustic model to calculate the transition probabilities between states cluster based on transition connections between the states in the generated state clusters, each state of the HMM A speech recognition method characterized in that header information indicating to which state cluster belongs is added as state transition constraint information.

The speech recognition method according to claim 1 or 6,
A speech recognition method, wherein the transition probability between the state clusters is defined based on a state transition probability for a transition connection from a state belonging to an arbitrary state cluster to a state belonging to another state cluster.

The speech recognition method according to claim 1 or 6,
A total sum f _ij (i ≠ j) of state transition probabilities for a transition connection from a state belonging to an arbitrary state cluster i to a state belonging to another state cluster j;
Using the value f _ii obtained by dividing the sum of the state transition probabilities for the transition connection inside the arbitrary state cluster i by the number of transition connection bundles from the arbitrary state cluster i to all other state clusters,
To the sum of f _{ij (including} the case of _i = j) for all other states cluster j, the proportion of a certain f _{ij (including} the case of _i = j), from the state the cluster i to another state cluster j A speech recognition method characterized by defining a transition probability P _ij .

The speech recognition method according to claim 1 or 6,
A speech recognition method, wherein the transition probability between the state clusters is defined based on the number of transition connections from a state belonging to an arbitrary state cluster i to a state belonging to another state cluster j.

The speech recognition method according to claim 1 or 6,
A speech recognition method, wherein a transition probability between the state clusters is defined based on presence / absence of a transition connection from a state belonging to the arbitrary state cluster i to a state belonging to another state cluster j.