JP3756879B2

JP3756879B2 - Method for creating acoustic model, apparatus for creating acoustic model, computer program for creating acoustic model

Info

Publication number: JP3756879B2
Application number: JP2002367606A
Authority: JP
Inventors: 伸一芳澤; 清宏鹿野
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2001-12-20
Filing date: 2002-12-19
Publication date: 2006-03-15
Anticipated expiration: 2022-12-19
Also published as: JP2004004509A

Description

【０００１】
【発明の属する技術分野】
この発明は、音声認識に用いられる音響モデルを作成する方法、装置、コンピュータプログラムに関する。さらに詳しくは、音声認識を利用する人の声および音声認識が利用される環境に適応した音響モデルを作成する方法、装置、コンピュータプログラムに関する。
【０００２】
【従来の技術】
近年、携帯電話・携帯端末・カーナビゲーションシステム・パーソナルコンピュータ・家電機器などのデジタル情報機器において、音声認識技術を用いて利用者の利便性を向上させることが期待されている。
【０００３】
音声認識システムに用いられる音響モデルが利用者にふさわしくない場合、その利用者は音声認識システムを利用することができない。したがって音声認識システムでは利用者の音声に適応した音響モデルを用いる必要がある。音声認識システムを利用する人の声に音響モデルを適応させる技術（話者適応技術）としては、図１に示すように様々なものが存在する。図１では、話者適応技術を実現するために必要とされるシステムの計算機パワーおよびハードディスク容量に対応させて各種の話者適応技術をマッピングしている。さらに話者適応技術の各々について、「適応化を行うために利用者が発声しなければならない文章の数」，「その適応技術によって対応可能な変動要因（話者性、声の調子）」，「認識性能（星印の大きさで示す。大きいほど性能がよい。）」を併記している。
【０００４】
従来は情報機器の計算機パワーおよび搭載可能なハードディスクの容量が小さく、「声道長正規化」，「ＭＬＬＲ＋固有声空間」のように認識性能の低い話者適応技術しか利用できなかった。情報機器の計算機パワーが増大するにつれ、この計算機パワーを利用して高い認識性能が得られる話者適応技術「ＭＬＬＲ」，「ＣＡＴ」が利用されるようになった。しかしこれらの話者適応技術では、音響モデルを適応させるために利用者が発声しなければならない文章の数が比較的多い。したがって、利用者の負担が大きく、また、利用者が頻繁に入れ替わるような情報機器（たとえばテレビリモコン）には適していない。さらには、家電機器や携帯電話のように比較的計算機パワーの小さい機器にも適していない。
【０００５】
近年、ハードディスク容量の増大化および低価格化がすすんでおり、これにともない「クラスタリングによる方法」，「十分統計量による方法」のように比較的大容量のハードディスクを利用しかつ比較的小さな計算機パワーですむ話者適応技術が登場している。これらの話者適応技術は、搭載されるハードディスクの容量が増大しつつあるカーナビゲーションシステム、テレビなどの家電機器や携帯電話のように比較的計算機パワーの小さい機器に適している。小型の家電機器や携帯電話には大容量のハードディスクを搭載することはできないが、近年ではネットワークを通じて大容量のサーバと通信できるため問題はない。また、これらの話者適応技術では、音響モデルを適応させるために利用者が発声しなければならない文章の数が少ないため（約１文章）、利用者の負担が少なく、利用者が入れ替わる場合にも瞬時に利用できる。しかしながら「クラスタリングによる方法」では、利用者に近いＨＭＭを１つ選択しこれを適応モデルとして利用するため、利用者・利用環境に近いＨＭＭがない場合に認識性能が大きく悪化する。
【０００６】
以上の点を鑑みると、携帯電話や家電機器などに最もふさわしい話者適応技術は「十分統計量を用いた方法」（芳澤伸一，馬場朗，松浪加奈子，米良祐一郎，山田実一，鹿野清宏，“充足統計量と話者距離を用いた音韻モデルの教師なし学習”，信学技報，SP2000-89，pp.83-88，2000）であると考えられる。これによれば、利用者の１発声で瞬時に高精度な適応モデル（利用者の声に適応した音響モデル）が得られる。
【０００７】
次に、「十分統計量を用いた方法」によって適応モデルを作成する手順を図２および図３を参照しつつ説明する。
【０００８】
〜選択モデルおよび十分統計量の作成（ＳＴ２００）〜
静かな環境で収録したさまざまな話者（たとえば約３００人）の音声データを音声データベース３１０（図３）にあらかじめ蓄積しておく。
【０００９】
データベース３１０に蓄積された音声データを用いて話者ごとに選択モデル（ここでは混合ガウス分布（ＧＭＭ）で表現する。）と十分統計量（ここでは隠れマルコフモデル（ＨＭＭ）で表現する。）とを作成し、これらを十分統計量ファイル３２０（図３）に蓄積する。「十分統計量」とは、データベースの性質を表す十分な統計量であり、ここではＨＭＭの音響モデルにおける平均、分散、ＥＭカウントである。十分統計量は、ＥＭアルゴリズムを用いて不特定話者モデルから１回学習することにより計算する。選択モデルは、音韻を区別することなく１状態６４混合のGausian Mixture Modelにより作成する。
【００１０】
十分統計量の作成手順を図４を参照して詳しく説明する。
【００１１】
＜ＳＴ２０１＞
まず、不特定話者の十分統計量を作成する。ここでは、ＥＭアルゴリズムを用いて全ての話者のデータにより学習することによって作成する。十分統計量は隠れマルコフモデルで表現され、各状態は混合ガウス分布で表現される。作成した不特定話者の十分統計量のガウス分布に番号をつける。
【００１２】
＜ＳＴ２０２＞
作成した不特定話者の十分統計量を初期値として、各話者に対する十分統計量を作成する。ここでは、ＥＭアルゴリズムを用いて各話者のデータにより学習することによって作成する。各話者の十分統計量のガウス分布に対して、不特定話者の十分統計量に付与された番号に対応した番号を保存する。
【００１３】
〜適応用の音声データの入力（ＳＴ２１０）〜
利用者の音声が入力される。
【００１４】
〜選択モデルによる十分統計量の選択（ＳＴ２２０）〜
入力された音声と選択モデルとに基づいて、利用者の音声に「近い」複数の十分統計量（利用者の音声に音響的に近い話者についての音響モデル）を選択する。ここでいう「近い」とは、入力音声を選択モデルに入力して得られた確率尤度が大きいものから上位N個の選択モデルに対応する話者の十分統計量を意味する。上述の選択処理は、図３に示す適応モデル作成部３３０において行われる。この様子を図５に示す。
【００１５】
〜適応モデルの作成（ＳＴ２３０）〜
選択された十分統計量を用いて適応モデルを作成する。具体的には、選択された十分統計量に対して、同じ番号をもつガウス分布同士で新たに統計的計算（数１〜数３）を行い１つのガウス分布を算出する。適応モデルの作成処理は、図３に示す適応モデル作成部３３０において行われる。この様子を図５に示す。
【００１６】
【数１】

【００１７】
【数２】

【００１８】
【数３】

【００１９】
ここでは、適応モデルのＨＭＭの各状態における正規分布の平均、分散をそれぞれμ_i ^adp（ｉ＝１，２，…，Ｎ_mix）、ｖ_i ^adp（ｉ＝１，２，…，Ｎ_mix）としている。Ｎ_mixは混合分布数である。また、状態遷移確率をａ^adp［ｉ］［ｊ］（ｉ，ｊ＝１，２，…，Ｎ_state）としている。Ｎ_stateは状態数であり、ａ^adp［ｉ］［ｊ］は状態ｉから状態ｊへの遷移確率を表す。Ｎ_selは選択された音響モデルの数であり、μ_i ^j（ｉ＝１，２，…，Ｎ_mix，ｊ＝１，２，…，Ｎ_sel）、ｖ_i ^j（ｉ＝１，２，…，Ｎ_mix，ｊ＝１，２，…，Ｎ_sel）はそれぞれの音響モデルの平均、分散である。Ｃ_mix ^j（ｊ＝１，２，…，Ｎ_sel）、Ｃ_state ^k［ｉ］［ｊ］（ｋ＝１，２，…，Ｎ_sel、ｉ，ｊ＝１，２，…，Ｎ_state）はそれぞれ、正規分布におけるＥＭカウント（度数）、状態遷移に関するＥＭカウントである。
【００２０】
〜認識（ＳＴ２４０）〜
上述のようにして作成された適応モデルを用いて音声認識システム３００（図３）は利用者の音声を認識する。
【００２１】
【非特許文献１】
芳澤伸一，馬場朗，松浪加奈子，米良祐一郎，山田実一，鹿野清宏，「充足統計量と話者距離を用いた音韻モデルの教師なし学習」，信学技報，SP2000-89，pp.83-88，2000
【特許文献１】
特開２００１−２５５８８７号公報
【特許文献２】
特開平１０−１６１６９２号公報
【特許文献３】
特開平５−２３９９号公報
【特許文献４】
特開平６−２１４５９２号公報
【特許文献５】
特開平９−２５８７６９号公報
【特許文献６】
特開２００２−１８２６８２号公報
【００２２】
【発明が解決しようとする課題】
以上に説明した「十分統計量を用いた方法」では、不特定話者（初期値）の十分統計量のガウス分布の位置関係と各話者の十分統計量のガウス分布の位置関係とが同等であると近似している。すなわち、十分統計量の初期値から各音声データの十分統計量を計算しても、ガウス分布の位置関係は維持された状態で、混合重み、平均値、分散のみが学習できると仮定している。具体的には、十分統計量の初期値のガウス分布のうち、各音声データの十分統計量のガウス分布にＫＬ距離などの分布距離で最も近いガウス分布の番号が、当該音声データの十分統計量のガウス分布の番号と同一であると仮定している。静かな環境では上記仮定が成り立っているため（図４参照）、上記方法は“静かな環境での”適応モデル作成方法として有効である。しかし、実用を考えると、“雑音環境における”適応モデルの作成を考慮しなくてはならない。その場合、図６に示すように上記仮定が成り立たなくなり、適応モデルの精度が低下する。
【００２３】
この発明の目的は、雑音環境における適応モデルの精度の低下を防ぐことができる音響モデル作成方法、音響モデル作成装置、音響モデル作成プログラムを提供することである。
【００２４】
【課題を解決するための手段および発明の効果】
この発明による方法は、音声認識に用いられる音響モデルを作成する方法であって以下のステップ（ａ）〜（ｅ）を備える。ステップ（ａ）では、雑音が重畳された音声データを音響的な近さに基づいてグループ化する。ステップ（ｂ）では、ステップ（ａ）によって得られた各グループについて、当該グループに含まれる音声データを用いて十分統計量を作成する。ステップ（ｃ）では、音声認識を利用する人（利用者）の音声データに音響的に近いグループをステップ（ａ）によって得られたグループの中から選択する。ステップ（ｄ）では、ステップ（ｃ）によって選択されたグループについての十分統計量の中から利用者の音声データに音響的に近い十分統計量を選択する。ステップ（ｅ）では、ステップ（ｄ）によって選択された十分統計量を用いて音響モデルを作成する。
【００２５】
好ましくは、上記ステップ（ａ）および（ｂ）は、利用者が音声認識を利用する時点よりも前にオフラインで行われる。
【００２６】
好ましくは、上記ステップ（ａ）では、雑音の種類に基づいてグループ化する。
【００２７】
好ましくは、上記ステップ（ａ）では、雑音が重畳された音声データのＳＮ比に基づいてグループ化する。
【００２８】
好ましくは、上記ステップ（ａ）では、音響的に近い話者ごとにグループ化する。
【００２９】
好ましくは、上記ステップ（ｂ）では、話者ごとに十分統計量を作成する。
【００３０】
好ましくは、上記ステップ（ｂ）では、話者の声の調子ごとに十分統計量を作成する。
【００３１】
好ましくは、上記ステップ（ｂ）では、雑音の種類ごとに十分統計量を作成する。
【００３２】
好ましくは、上記ステップ（ｂ）では、上記各グループに含まれる音声データのＳＮ比ごとに十分統計量を作成する。
【００３３】
この発明による装置は、音声認識に用いられる音響モデルを作成する装置であって、蓄積部と、第１の選択部と、第２の選択部と、モデル作成部とを備える。蓄積部は、雑音が重畳された音声データを音響的な近さに基づいてグループ化することによって得られた複数のグループの各々について当該グループに含まれる音声データを用いて作成された十分統計量を蓄積する。第１の選択部は、音声認識を利用する人（利用者）の音声データに音響的に近いグループを上記複数のグループの中から選択する。第２の選択部は、第１の選択部によって選択されたグループについての十分統計量の中から利用者の音声データに音響的に近い十分統計量を選択する。モデル作成部は、第２の選択部によって選択された十分統計量を用いて音響モデルを作成する。
【００３４】
好ましくは、上記装置は、グループ作成部と、十分統計量作成部とをさらに備える。グループ作成部は、雑音が重畳された音声データを音響的な近さに基づいてグループ化する。十分統計量作成部は、グループ作成部によって得られた各グループについて当該グループに含まれる音声データを用いて十分統計量を作成する。上記蓄積部は、十分統計量作成部によって作成された十分統計量を蓄積する。
【００３５】
この発明によるプログラムは、音声認識に用いられる音響モデルを作成するためのコンピュータプログラムであって、コンピュータを手段（ａ）〜手段（ｄ）として機能させる。手段（ａ）は、雑音が重畳された音声データを音響的な近さに基づいてグループ化することによって得られた複数のグループの各々について当該グループに含まれる音声データを用いて作成された十分統計量を蓄積する。手段（ｂ）は、音声認識を利用する人（利用者）の音声データに音響的に近いグループを上記複数のグループの中から選択する。手段（ｃ）は、手段（ｂ）によって選択されたグループについての十分統計量の中から上記利用者の音声データに音響的に近い十分統計量を選択する。手段（ｄ）は、手段（ｃ）によって選択された十分統計量を用いて音響モデルを作成する。
【００３６】
好ましくは、上記コンピュータをさらに手段（ｅ）〜（ｆ）として機能させる。手段（ｅ）は、雑音が重畳された音声データを音響的な近さに基づいてグループ化する。手段（ｆ）は、手段（ｅ）によって得られた各グループについて当該グループに含まれる音声データを用いて十分統計量を作成する。上記手段（ａ）は、手段（ｆ）によって作成された十分統計量を蓄積する。
【００３７】
上記方法、装置、プログラムでは、雑音の種類・ＳＮ比・話者などのバリエーションにおいて「音響的に近い」ものをグループ化し、各グループの中で十分統計量の作成および適応モデル（音響モデル）の作成を行うことができる。このようにグループ化することにより上述の仮定を成立させることができる。この結果、雑音環境における適応モデルの精度の低下を防ぐことができ、高精度の適応モデルを作成することができる。
【００３８】
この発明によるもう１つの方法は、音声認識に用いられる音響モデルを作成する方法であって以下のステップ（ａ）〜（ｄ）を備える。ステップ（ａ）では、複数の話者による複数の音声データの中から、音声認識を利用する人（利用者）の音声データに音響的に近い音声データを選択する。ステップ（ｂ）では、ステップ（ａ）によって選択された音声データに、音声認識が利用される環境における雑音を重畳する。ステップ（ｃ）では、ステップ（ｂ）によって雑音が重畳された音声データを用いて十分統計量を作成する。ステップ（ｄ）では、ステップ（ｃ）によって作成された十分統計量を用いて音響モデルを作成する。
【００３９】
好ましくは、上記方法はステップ（ｅ）〜（ｆ）をさらに備える。ステップ（ｅ）では、上記複数の話者による複数の音声データに、音声認識が利用されるであろうと予測される環境における雑音を重畳する。ステップ（ｆ）では、ステップ（ｅ）によって雑音が重畳された音声データについてのラベル情報を作成する。ステップ（ｃ）では、ステップ（ｂ）によって雑音が重畳された音声データと、ステップ（ｆ）において作成されたラベル情報のうちステップ（ａ）によって選択された音声データについてのラベル情報とを用いて十分統計量を作成する。
【００４０】
好ましくは、上記ステップ（ｆ）ではさらに、ステップ（ｅ）によって雑音が重畳された音声データについての音響モデルの状態遷移に関する情報を作成する。上記ステップ（ｃ）では、ステップ（ｆ）において作成された音響モデルの状態遷移に関する情報のうちステップ（ａ）によって選択された音声データについての音響モデルの状態遷移に関する情報をさらに用いて十分統計量を作成する。
【００４１】
好ましくは、上記ステップ（ｅ）では、複数種類の雑音の各々を上記複数の話者による複数の音声データに重畳する。上記ステップ（ｆ）では、上記複数種類の雑音の各々についてラベル情報を作成する。上記ステップ（ｃ）では、ステップ（ａ）によって選択された音声データについての複数のラベル情報の中から、音声認識が利用される環境に適したラベル情報を選択し、選択したラベル情報を用いて十分統計量を作成する。
【００４２】
この発明によるもう１つの装置は、音声認識に用いられる音響モデルを作成する装置であって、蓄積部と、選択部と、雑音重畳部と、十分統計量作成部と、モデル作成部とを備える。蓄積部は、複数の話者による複数の音声データを蓄積する。選択部は、音声認識を利用する人（利用者）の音声データに音響的に近い音声データを蓄積部に蓄積された音声データの中から選択する。雑音重畳部は、選択部によって選択された音声データに、音声認識が利用される環境における雑音を重畳する。十分統計量作成部は、雑音重畳部によって雑音が重畳された音声データを用いて十分統計量を作成する。モデル作成部は、十分統計量作成部によって作成された十分統計量を用いて音響モデルを作成する。
【００４３】
この発明によるもう１つのプログラムは、音声認識に用いられる音響モデルを作成するためのコンピュータプログラムであって、コンピュータを手段（ａ）〜（ｅ）として機能させる。手段（ａ）は、複数の話者による複数の音声データを蓄積する。手段（ｂ）は、音声認識を利用する人（利用者）の音声データに音響的に近い音声データを手段（ａ）に蓄積された音声データの中から選択する。手段（ｃ）は、手段（ｂ）によって選択された音声データに、音声認識が利用される環境における雑音を重畳する。手段（ｄ）は、手段（ｃ）によって雑音が重畳された音声データを用いて十分統計量を作成する。手段（ｅ）は、手段（ｄ）によって作成された十分統計量を用いて音響モデルを作成する。
【００４４】
上記方法、装置、プログラムでは、音響的に近い音声データで処理を行うため高精度の適応モデルが作成できる。また、音響的に近い音声データを選択してから十分統計量の計算を行うため、十分統計量を作成するための処理を速くできる。
【００４５】
この発明による適応モデル作成装置は、音声認識に用いられる音響モデルを作成する装置であって、蓄積部と、記憶部と、モデル作成部とを備える。蓄積部には、音響的な近さに基づいてグループ化された複数のグループが蓄積される。上記複数のグループの各々は複数の十分統計量を含む。記憶部には、上記複数のグループのうちの少なくとも１つのグループを示すグループＩＤが記憶される。モデル作成部は、記憶部に記憶されたグループＩＤに対応するグループの中から利用者の音声に音響的に近いグループを１つ選択する。モデル作成部は、選択したグループに含まれている十分統計量のうち上記利用者の音声に音響的に近い少なくとも２つの十分統計量を用いて音響モデルを作成する。
【００４６】
好ましくは、上記モデル作成部は、上記複数のグループのうち利用者の音声に音響的に近いグループを少なくとも１つ選択し、選択したグループを示すグループＩＤを上記記憶部に記憶させる。
【００４７】
好ましくは、上記記憶部は、音声認識が利用される環境における雑音の種類と前記グループＩＤとを対応づけて記憶する。
【００４８】
好ましくは、上記記憶部は、利用者を示す利用者ＩＤと上記グループＩＤとを対応づけて記憶する。
【００４９】
好ましくは、上記記憶部は、上記適応モデル作成装置を識別するための装置ＩＤと上記グループＩＤとを対応づけて記憶する。
【００５０】
この発明によるもう１つの適応モデル作成装置は、音声認識に用いられる音響モデルを作成する装置であって、蓄積部と、モデル作成部とを備える。蓄積部には、音響的な近さに基づいてグループ化された複数のグループが蓄積される。上記複数のグループの各々は複数の十分統計量を含む。モデル作成部は、上記複数のグループのうちの少なくとも１つのグループを示すグループＩＤを受ける。モデル作成部は、受けたグループＩＤに対応するグループの中から利用者の音声に音響的に近いグループを１つ選択する。モデル作成部は、選択したグループに含まれている十分統計量のうち上記利用者の音声に音響的に近い少なくとも２つの十分統計量を用いて音響モデルを作成する。
【００５１】
好ましくは、上記モデル作成部は、上記グループＩＤを外部の記憶装置から受ける。上記モデル作成部は、上記複数のグループのうち利用者の音声に音響的に近いグループを少なくとも１つ選択し、選択したグループを示すグループＩＤを上記記憶装置に記憶させる。
【００５２】
好ましくは、上記記憶装置は、音声認識が利用される環境における雑音の種類と前記グループＩＤとを対応づけて記憶する。
【００５３】
好ましくは、上記記憶装置は、利用者を示す利用者ＩＤと上記グループＩＤとを対応づけて記憶する。
【００５４】
好ましくは、上記記憶装置は、上記適応モデル作成装置を識別するための装置ＩＤと上記グループＩＤとを対応づけて記憶する。
【００５５】
この発明によるもう１つの適応モデル作成装置は、音声認識に用いられる音響モデルを作成する装置であって、選択部と、モデル作成部とを備える。選択部は、複数のグループのうちの少なくとも１つのグループを示すグループＩＤを受ける。上記複数のグループは、音響的な近さに基づいてグループ化されている。上記複数のグループの各々は複数の十分統計量を含む。選択部は、受けたグループＩＤに対応するグループの中から利用者の音声に音響的に近いグループを１つ選択する。モデル作成部は、選択部によって選択されたグループに含まれている十分統計量のうち上記利用者の音声に音響的に近い少なくとも２つの十分統計量を受ける。モデル作成部は、受けた十分統計量を用いて音響モデルを作成する。
【００５６】
好ましくは、上記選択部は、上記グループＩＤを外部の記憶装置から受ける。上記選択部は、上記複数のグループのうち利用者の音声に音響的に近いグループを少なくとも１つ選択し、選択したグループを示すグループＩＤを上記記憶装置に記憶させる。
【００５７】
好ましくは、上記記憶装置は、音声認識が利用される環境における雑音の種類と前記グループＩＤとを対応づけて記憶する。
【００５８】
好ましくは、上記記憶装置は、利用者を示す利用者ＩＤと上記グループＩＤとを対応づけて記憶する。
【００５９】
好ましくは、上記記憶装置は、上記適応モデル作成装置を識別するための装置ＩＤと上記グループＩＤとを対応づけて記憶する。
【００６０】
【発明の実施の形態】
以下、この発明の実施の形態を図面を参照して詳しく説明する。なお、図中同一または相当部分には同一の符号を付し、その説明は繰り返さない。
【００６１】
（第１の実施形態）
＜適応モデル作成装置の構成＞
図７は、第１の実施形態による音声認識用適応モデル作成装置の全体構成を示すブロック図である。図７に示す装置は、十分統計量作成部１と、選択モデル作成部２と、十分統計量蓄積部３と、選択モデル蓄積部４と、適応モデル作成部５と、グループ作成部６とを備える。
【００６２】
グループ作成部６は、静かな環境における音声データ８３に雑音データ８２を重畳して作成した雑音重畳音声データ８４を「音響的な近さ」によりグルーピングする。
【００６３】
十分統計量作成部１は、グループ作成部６がグルーピングした音声データ８４を用いて、グループ作成部６が作成したグループごとに十分統計量７１を作成する。
【００６４】
十分統計量蓄積部３は、十分統計量作成部１が作成した十分統計量７１を蓄積する。
【００６５】
選択モデル作成部２は、選択モデル７３を作成する。選択モデル７３は、蓄積部３に蓄積された十分統計量７１の中から利用者の音声データ８１に近い十分統計量７２を選択するためのモデルである。
【００６６】
選択モデル蓄積部４は、選択モデル作成部２が作成した選択モデル７３を蓄積する。
【００６７】
適応モデル作成部５は、蓄積部４に蓄積された選択モデル７３を用いて、蓄積部３に蓄積された十分統計量７１の中から利用者の音声データ８１に「音響的に近い」十分統計量７２を選択し、選択した十分統計量７２を用いて適応モデル７４を作成する。
【００６８】
＜適応モデルの作成手順＞
次に、以上のように構成された装置による適応モデル作成の手順について説明する。ここでは、利用者が室内で音声認識を行う場合を例にして説明する。
［十分統計量７１および選択モデル７３の作成］
はじめに、十分統計量７１と選択モデル７３の作成方法について述べる。ここでは、十分統計量７１と選択モデル７３の作成を、利用者が適応モデルの獲得を要求する以前にオフラインで行う場合について述べる。
【００６９】
静かな環境において複数話者の音声データ８３を収録する。ここでは約３００人の音声データを収録する。
【００７０】
利用者が音声認識を利用するであろう環境の雑音データ８２を収録する。ここでは室内雑音を収録する。
【００７１】
音声データ８３に、利用者が音声認識を利用するであろう環境におけるＳＮ比で雑音データ８２を重畳して音声データ８４を作成する。ここでは１５ｄＢ、２０ｄＢ、２５ｄＢのＳＮ比で雑音データ８２を重畳する。
【００７２】
グループ作成部６は、作成された音声データ８４を「音響的な近さ」によりグルーピングする。ここでは、図８に示すように、ＳＮ比ごとに１５ｄＢのグループＡ、２０ｄＢのグループＢ、２５ｄＢのグループＣにグルーピングする。
【００７３】
十分統計量７１を作成する。十分統計量作成部１は、図９に示すように、グループ作成部６が作成したグループごとに、雑音重畳音声データ８４Ａ〜８４Ｃを用いてそれぞれの不特定話者モデルＡ〜Ｃを作成する。次に、グループ作成部６が作成した各グループについて、各話者の雑音重畳音声データ８４を用いて話者ごとにＥＭアルゴリズムにより各グループの不特定話者モデルから一回学習することにより十分統計量７１Ａ〜７１Ｃを計算する。ここでは、グループごとに約３００の十分統計量が作成される。
【００７４】
選択モデル７３を作成する。一例として図１０に示すように、グループ作成部６が作成したグループごとに、雑音重畳音声データ８４Ａ〜８４Ｃを用いて、話者ごとに、音韻を区別することなく１状態６４混合のGaussian Mixture Model（ＧＭＭ）により、選択モデル７３Ａ〜７３Ｃを作成する。ここでは、グループごとに約３００の十分統計量選択モデルが作成される。
【００７５】
十分統計量７１Ａ〜７１Ｃ（図９）を作成したときに用いた音声データ８４Ａ〜８４Ｃ（図９）と、これにより作成した選択モデル７３Ａ〜７３Ｃ（図１０）とは対をなしており、対応した選択モデルにより利用者の音声データに近い十分統計量が選択される。
【００７６】
十分統計量蓄積部３は、十分統計量作成部１が作成した十分統計量７１Ａ〜７１Ｃを蓄積する。選択モデル蓄積部４は、選択モデル作成部２が作成した選択モデル７３Ａ〜７３Ｃを蓄積する。十分統計量蓄積部３に蓄積される十分統計量７１の一例を図１１および図１６に示す。また、選択モデル蓄積部４に蓄積される選択モデル７３の一例を図１２に示す。ここで、各グループ（Ａ〜Ｃ）における各話者（Ａさん〜Ｚさん）の十分統計量と選択モデルとは対をなしている。
［適応モデル７４の作成］
次に、適応モデル作成部５における適応モデル７４の作成手順について述べる。
【００７７】
十分統計量７１および選択モデル７３の一例として図１１，図１２に示したものを用いて説明する。
【００７８】
利用者は、適応モデル７４の作成を要求する。
【００７９】
利用者は、音声認識用のマイクなどを利用して、音声認識を利用する環境下での音声データ８１を適応モデル作成部５に入力する。音声データ８１には、音声認識を利用する環境の雑音が重畳されている。
【００８０】
ここでは、利用者が、室内でＳＮ比が２０ｄＢとなる環境で音声認識を利用する場合について述べる。
【００８１】
適応モデル作成部５は、音声データ８１を選択モデル蓄積部４に送信して選択モデル７３に入力する。すなわち、音声データ８１は、図１２のグループＡ〜ＣのＡさん〜Ｚさんの十分統計量選択モデルに入力される。
【００８２】
グループ作成部６が作成したグループのうち利用者の音声データ８１に「音響的に近い」グループを決定する。
【００８３】
音声データ８１を選択モデル７３に入力したときの選択モデル７３の尤度を計算して尤度の大きい順番に並べる。すなわち、図１２のグループＡ〜ＣのＡさん〜Ｚさんの選択モデルの音声データ８１に対する尤度を計算して大きい順番に並べる。選択モデル７３の尤度を計算して尤度の大きい順番で並べた一例を図１３に示す。
【００８４】
尤度が大きい順番で上位Ｎ個（図１３の例では１００個）の選択モデルを選択し、最も多く選択したグループ（室内雑音のＳＮ比）を決定する。図１２の例では、最も多く選択したグループはグループＢ（室内雑音２０ｄＢ）である。すなわち、グループＢが利用者の音声データ８１に「音響的に近い」グループである。
【００８５】
適応モデル７４を、音声データ８１に「音響的に近い」グループ（グループＢ）の十分統計量を用いて作成する。音声データ８１に「音響的に近い」グループ（グループＢ）の選択モデル７３から尤度が大きい順番で上位Ｌ個（図１４の例では２０個）を選択する。そして選択した選択モデルと対をなす十分統計量７２を用いて適応モデル７４を作成する。具体的には以下の統計処理計算（数４〜数６）により適応モデル７４を作成する。適応モデル７４のＨＭＭの各状態における正規分布の平均、分散をそれぞれμ_i ^adp（ｉ＝１，２，…，Ｎ_mix）、ｖ_i ^adp（ｉ＝１，２，…，Ｎ_mix）とする。Ｎ_mixは混合分布数である。また、状態遷移確率をａ^adp［ｉ］［ｊ］（ｉ，ｊ＝１，２，…，Ｎ_state）とする。Ｎ_stateは状態数であり、ａ^adp［ｉ］［ｊ］は状態ｉから状態ｊへの遷移確率を表す。
【００８６】
【数４】

【００８７】
【数５】

【００８８】
【数６】

【００８９】
ここで、Ｎ_selは、選択された音響モデルの数であり、μ_i ^j（ｉ＝１，２，…，Ｎ_mix，ｊ＝１，２，…，Ｎ_sel）、ｖ_i ^j（ｉ＝１，２，…，Ｎ_mix，ｊ＝１，２，…，Ｎ_sel）はそれぞれのＨＭＭの平均、分散である。Ｃ_mix ^j（ｊ＝１，２，…，Ｎ_sel）、Ｃ_state ^k［ｉ］［ｊ］（ｋ＝１，２，…，Ｎ_sel、ｉ，ｊ＝１，２，…，Ｎ_state）はそれぞれ、正規分布におけるＥＭカウント（度数）、状態遷移に関するＥＭカウントである。
【００９０】
適応モデル作成部５は、利用者の次の適応モデル作成の要求に備える。
【００９１】
＜実験結果＞
適応モデルを用いて認識実験を行った結果について述べる。
【００９２】
認識実験の条件について述べる。データベースは３０６人の話者データにより構成されており、各話者は約２００文章の発声データをもつ。サンプリング周波数１６ｋＨｚ、１６ｂｉｔのデータである。特徴量として、窓シフト長１０ｍｓで分析した１２次元のＭＦＣＣ（Mel-frequency cepstrum coefficient）とデルタケプストラム、デルタパワーを用いる。特徴量抽出においてＣＭＮ（Cepstrum mean normalization）処理がほどこされている。２０ｋの新聞記事により構築した言語モデルを用いる。評価話者は４６人である。評価文章として各話者４〜５文章、合計２００文章を用いた。雑音の種類として室内雑音を用いた。
【００９３】
図１５に認識実験結果を示す。図１５では、十分統計量を用いて適応モデルを作成する従来技術の認識結果も合わせて示している。
【００９４】
図１５に示す結果をみると、この発明により作成した適応モデルの性能は、従来技術によるものと比較してきわめて高いことがわかる。
【００９５】
＜効果＞
以上説明したように第１の実施形態では、「音響的に近い」ものをクラスタリング（グルーピング）して、各グループの中で、選択モデル・十分統計量の作成と適応モデルの作成を行う。このようにクラスタリングすることにより、従来技術の箇所で説明した仮定を成立させることができる。この結果、雑音環境における適応モデルの精度の低下を防ぐことができ、高精度の適応モデルを作成することができる。ここでグルーピングされる「音響的に近い」音声データとは、従来技術の項で説明した「十分統計量による方法」における仮定が成立する範囲に存在する音声データ群のことである。具体的には、十分統計量の初期値から各音声データの十分統計量を計算しても、ガウス分布の位置関係は維持された状態で、混合重み、平均値、分散のみが学習できる音声データ群のことである（図１６参照）。いいかえると、各音声データの十分統計量のガウス分布にＫＬ距離などの分布距離で最も近い初期値の十分統計量のガウス分布の番号が上記音声データの十分統計量のガウス分布の番号と同一であることをいう（図１６参照）。
【００９６】
このような仮定を成立させることができるグルーピングの例としては、
・雑音の種類ごとにグループを作る。
・ＳＮ比ごとにグループを作る。
・各音声データを用いて音声モデル（混合ガウス分布で表現する）を作成して、ＫＬ距離などの分布間距離が近いものを同じグループとする。
などがある。一例を図１７に示す。
【００９７】
また、第１の実施形態によれば以下のような効果も得られる。
【００９８】
雑音／話者に適応した適応モデル７４を作成するための音声データとして、オフラインで収録した音声データ８３を利用するため、利用者は大量の発声を行う必要がなく利用者の負担が少ない。
【００９９】
雑音重畳音声データ８４を用いて十分統計量７１を作成して適応モデル７４を作成するため、利用環境に適応した適応モデルが作成できる。したがって、雑音環境で適応モデルが利用できる。
【０１００】
オフラインで十分統計量７１を作成するため、適応時に、瞬時に適応モデル７４を作成することができる。したがって、利用環境が変化した場合にすぐに適応モデルが利用できる。
【０１０１】
グループ作成部６により作成したグループごとに十分統計量を作成して適応モデル７４を作成するため、より利用者の音声データ８１に適応した適応モデル７４が作成できる。したがって、より多くの利用者がさまざまな雑音環境で適応モデルが利用できる。
【０１０２】
なお、雑音重畳音声データ８４として、雑音データを計算処理で重畳した音声データの代わりに、雑音環境下で発声した音声データを収録したものを用いてもよい。
【０１０３】
グループ作成部６は、雑音の種類ごと、近い話者ごとにグループを作ってもよい。
【０１０４】
雑音重畳音声データ８４として、室内雑音、車内雑音、会場騒音、掃除機の音などさまざまな雑音環境下の音声データを用いてよい。
【０１０５】
適応モデル７４を作成するタイミングは、適応モデル作成部が自動的に行ってもよい。
【０１０６】
十分統計量選択モデル７３はGaussian Mixture Modelに限られない。
【０１０７】
雑音データ８２として、利用環境の雑音を用いてもよい。
【０１０８】
第１の実施形態による適応モデル作成装置はハードウェアによってもソフトウェア（コンピュータプログラム）によっても実現できる。
【０１０９】
＜具体的な商品イメージおよびグルーピング例＞
第１の実施形態による話者適応技術を用いた音声認識システムは、たとえば次のような商品（情報機器）に搭載することができる。携帯電話、携帯端末（ＰＤＡ）、カーナビゲーションシステム、パソコン、テレビのリモコン、音声翻訳装置、ペットロボット、対話エージェント（グラフィックス）など。これらのうちいくつかについてグルーピング例とともに以下に示す。
【０１１０】
［グループの作成方法１］
雑音の種類×ＳＮ比ごとにグループを作り，グループ内には，話者×話者の声の調子のバリエーションごとの十分統計量を蓄積する。
【０１１１】
＜複数の雑音下，複数の話者が利用する機器（例：テレビの操作）＞
・グループの選択方法１（図１８参照）
この例によるシステムの構成を図１８Ａに示す。このシステムは、サーバ１８００と、デジタルＴＶシステム１８１０と、音声リモコン１８２０とを備える。サーバ１８００は、グループ作成部６と、選択モデル作成部２と、十分統計量作成部１とを含む。グループ作成部６は、図１８Ｂに示すように、雑音が重畳された音声データ８４を雑音の種類（掃除機の音，洗濯機の音など）×ＳＮ比（１０ｄＢ，２０ｄＢなど）ごとにグループ化する。十分統計量作成部１は、グループ作成部６によって作成されたグループの各々について、話者（話者Ａ，話者Ｂなど）×話者の声の調子（鼻声，普通の声，早口の声など）ごとに十分統計量を作成する。選択モデル作成部２は、十分統計量作成部１によって作成された十分統計量の各々について、対応する選択モデルを作成する。音声リモコン１８２０はマイク１８２１を含む。マイク１８２１は、利用者が発声した音声を所定の音声データに変換する。マイク１８２１によって変換された音声データはデジタルＴＶシステム１８１０に送信される。デジタルＴＶシステム１８１０は、ハードディスク（ＨＤＤ）１８１１と、適応モデル作成部５と、音声認識システム３００（図３参照）と、処理部１８１２とを含む。サーバ１８００の選択モデル作成部２によって作成された選択モデルおよび十分統計量作成部１によって作成された十分統計量は通信網を介してＨＤＤ１８１１にダウンロードされる。適応モデル作成部５は、音声リモコン１８２０からの音声データとＨＤＤ１８１１に蓄積された選択モデルおよび十分統計量とを利用して適応モデルを作成する。音声認識システム３００は、適応モデル作成部５によって作成された適応モデルを用いて音声リモコン１８２０からの音声データを認識する。処理部１８１２は、音声認識システム３００による認識の結果に応じて各種の処理を行う。以上のように構成されたシステムでは以下の処理が行われる。
【０１１２】
［ステップＳＴ１］
音声リモコン１８２０のマイク１８２１に向かって利用者が発声する。利用者が発声した音声は所定の音声データに変換されてデジタルＴＶシステム１８１０に送信される。
【０１１３】
［ステップＳＴ２］
適応モデル作成部５は、音声リモコン１８２０からの音声データをＨＤＤ１８１１内の選択モデルに入力して尤度を算出する。適応モデル作成部５は、算出した尤度のうち大きいものからＮ個を選択する。適応モデル作成部５は、これらＮ個が属するグループのうち属する選択モデルの数が最も多いグループを選択する。
【０１１４】
［ステップＳＴ３］
適応モデル作成部５は、選択したグループの中で尤度が大きいＭ個の十分統計量を選択する。適応モデル作成部５は、選択したＭ個の十分統計量を用いて適応モデルを作成する。
【０１１５】
・グループの選択方法２（図１９，図２０参照）
この例による表示システムの構成を図１９Ａに示す。このシステムは、サーバ１９００と、デジタルＴＶシステム１９１０と、音声リモコン１９２０とを備える。サーバ１９００は、グループ作成部６と、選択モデル作成部２と、十分統計量作成部１と、選択モデル蓄積部４と、十分統計量蓄積部３とを含む。グループ作成部６は、図１９Ｂに示すように、雑音が重畳された音声データ８４を雑音の種類（掃除機Ａの音，掃除機Ｂの音など）×ＳＮ比（１０ｄＢ，２０ｄＢなど）ごとにグループ化する。十分統計量作成部１は、グループ作成部６によって作成されたグループの各々について、話者（話者Ａ，話者Ｂなど）×話者の声の調子（鼻声，普通の声，早口の声など）ごとに十分統計量を作成する。選択モデル作成部２は、十分統計量作成部１によって作成された十分統計量の各々について、対応する選択モデルを作成する。音声リモコン１９２０はマイク１８２１とメモリ１９２２とを含む。メモリ１９２２には、雑音の種類を示すＩＤ（雑音ＩＤ）とグループを示すＩＤ（グループＩＤ）とが対応づけられて記憶される。デジタルＴＶシステム１９１０は、適応モデル作成部５と、音声認識システム３００（図３参照）と、処理部１８１２とを含む。適応モデル作成部５は、音声リモコン１９２０からの音声データと、サーバ１９００の選択モデル蓄積部４に蓄積された選択モデルおよび十分統計量蓄積部３に蓄積された十分統計量とを利用して適応モデルを作成する。以上のように構成されたシステムでは以下の処理が行われる。
【０１１６】
［ステップＳＴ１−ａ］
デジタルＴＶシステム１９１０は、利用環境における雑音の種類をリモコン１９２０のボタン操作によって選択するように利用者を促す。たとえば、「１．洗濯機，２．掃除機，３．エアコン，…」のように選択肢を画面に表示する。利用者は、利用環境における雑音の種類をボタン操作により選択する。ここでは掃除機が使用されている環境で利用者がリモコン操作を行っているものとする。利用者は、雑音の種類として「２．掃除機」をボタン操作によって選択する。
【０１１７】
［ステップＳＴ２−ａ］
音声リモコン１９２０のマイク１８２１に向かって利用者が発声する。利用者が発声した音声は所定の音声データに変換されてデジタルＴＶシステム１９１０に送信される。
【０１１８】
［ステップＳＴ３−ａ］
適応モデル作成部５は、音声リモコン１９２０からの音声データをサーバ１９００の選択モデル蓄積部４内の選択モデルに入力して尤度を算出する。適応モデル作成部５は、算出した尤度のうち大きいものからＮ個を選択する。適応モデル作成部５は、これらＮ個が属するグループのうち属する選択モデルの数が最も多いグループを選択する。
【０１１９】
［ステップＳＴ４−ａ］
適応モデル作成部５は、選択したグループの中で尤度が大きいＭ個の十分統計量を選択する。適応モデル作成部５は、選択したＭ個の十分統計量を用いて適応モデルを作成する。
【０１２０】
［ステップＳＴ５−ａ］
適応モデル作成部５は、ステップＳＴ３−ａにおいて選択したグループを示すＩＤ（グループＩＤ）と、当該グループと雑音の種類が同じであるグループを示すＩＤ（グループＩＤ）とを音声リモコン１９２０に送信する。これらのグループＩＤは、ステップＳＴ１−ａにおいて選択された雑音の種類を示すＩＤ（雑音ＩＤ）と対応づけられてメモリ１９２２に記憶される。ここではステップＳＴ３−ａにおいてグループ１（図１９Ｂ参照）が選択されたものとする。グループ１の雑音の種類は「掃除機Ａの音」である。雑音の種類が「掃除機Ａの音」であるグループはグループ１およびグループ２である（図１９Ｂ参照）。適応モデル作成部５は、図２０に示すように、雑音の種類が「掃除機Ａの音」であるグループ（グループ１，グループ２）のグループＩＤを音声リモコン１９２０へ送信する。これらのグループＩＤは、ステップＳＴ１−ａにおいて選択された雑音の種類「２．掃除機」を示す雑音ＩＤと対応づけられてメモリ１９２２に記憶される（図２０参照）。
【０１２１】
［ステップＳＴ１−ｂ］
ふたたび、掃除機が使用されている環境で利用者がリモコン操作を行う。利用者は、雑音の種類として「２．掃除機」をボタン操作によって選択する。音声リモコン１９２０は、選択された雑音の種類「２．掃除機」に対応づけられてメモリ１９２２に記憶されているグループＩＤ（グループ１，グループ２のグループＩＤ）をデジタルＴＶシステム１９１０へ送信する（図２０参照）。
【０１２２】
［ステップＳＴ２−ｂ］
音声リモコン１９２０のマイク１８２１に向かって利用者が発声する。利用者が発声した音声は所定の音声データに変換されてデジタルＴＶシステム１９１０に送信される。
【０１２３】
［ステップＳＴ３−ｂ］
適応モデル作成部５は、サーバ１９００の選択モデル蓄積部４内の選択モデルのうち音声リモコン１９２０からのグループＩＤが示すグループ（グループ１，グループ２）の選択モデルに音声リモコン１９２０からの音声データを入力して尤度を算出する。適応モデル作成部５は、算出した尤度のうち大きいものからＮ個を選択する。適応モデル作成部５は、これらＮ個が属するグループのうち属する選択モデルの数が最も多いグループを選択する。
【０１２４】
［ステップＳＴ４−ｂ］
適応モデル作成部５は、選択したグループの中で尤度が大きいＭ個の十分統計量を選択する。適応モデル作成部５は、選択したＭ個の十分統計量を用いて適応モデルを作成する。
【０１２５】
適応処理ごとに（ST１-b）に戻る。また，必要に応じて（ST１-a）に戻る（たとえば、掃除機を別の種類の掃除機に買い換えたとき、掃除機の音とは異なる雑音環境の下で音声認識を利用するときなど）。
【０１２６】
＜複数の雑音下，複数の話者が利用する機器（例：PDAの操作）＞
・グループの選択方法１
通信網で接続されたサーバーに蓄積された十分統計量から，GPSの位置情報により雑音の種類を自動的に選択した後に，雑音が付加された利用者の音声により，選択モデル（GMM）を用いて十分統計量を選択して適応する。具体的には以下の処理を行う。
【０１２７】
GPSの位置情報を用いて，雑音の種類を自動的に選択する（ST１）。（例：駅のホームなら電車内の雑音，工事現場なら工事現場の雑音など）
【０１２８】
利用者の音声を入力する（ST２）。
【０１２９】
選択された雑音のグループにおいて，利用者の音声を選択モデルに入力したときの尤度が大きいN個を選択し，その中で一番個数の多いSN比のグループを選択する（ST３）。
【０１３０】
選択したグループの中で，尤度の大きいM個の十分統計量を選択して適応する（ST４）。
【０１３１】
・グループの選択方法２
通信網で接続されたサーバーに蓄積された十分統計量から，PDAの中のスケジュール帳と時間情報により雑音の種類を自動的に選択した後に，雑音が付加された利用者の音声により，選択モデル（GMM）を用いて十分統計量を選択して適応する。具体的には以下の処理を行う。
【０１３２】
スケジュール帳と時間情報を用いて，雑音の種類を自動的に選択する（ST１）。（例：スケジュールにて10時に電車で移動，現在の時刻10時55分なら，電車内の雑音を選択する。）
【０１３３】
利用者の音声を入力する（ST２）。
【０１３４】
選択された雑音のグループにおいて，利用者の音声を選択モデルに入力したときの尤度が大きいN個を選択し，その中で一番個数の多いSN比のグループを選択する（ST３）。
【０１３５】
選択したグループの中で，尤度の大きいM個の十分統計量を選択して適応する（ST４）。
【０１３６】
＜特定の雑音下で利用する機器（例：カーナビ）＞
・グループの選択方法（図２１，図２２参照）
この例による情報検索システムの構成を図２１Ａに示す。このシステムは、サーバ２１００と、カーナビゲーションシステム２１１０とを備える。サーバ２１００は、グループ作成部６と、選択モデル作成部２と、十分統計量作成部１と、選択モデル蓄積部４と、十分統計量蓄積部３と、適応モデル作成部５と、メモリ２１０１とを含む。グループ作成部６は、図２１Ｂに示すように、雑音が重畳された音声データ８４を雑音の種類（カローレの音，マークIIIの音など）×ＳＮ比（１０ｄＢ，２０ｄＢなど）ごとにグループ化する。メモリ２１０１には、カーナビゲーションシステムを識別するための機器ＩＤ（たとえば製品番号）とグループを示すＩＤ（グループＩＤ）とが対応づけられて記憶される。カーナビゲーションシステム２１１０は、マイク２１１１と、データ通信モジュール２１１２と、音声認識システム３００（図３参照）と、処理部２１１３とを含む。以上のように構成されたシステムでは以下の処理が行われる。
【０１３７】
［ステップＳＴ１−ａ］
カーナビゲーションシステム２１１０のマイク２１１１に向かって利用者が発声する。利用者が発声した音声は所定の音声データに変換され、データ通信モジュール２１１２によってサーバ２１００へ送信される。またデータ通信モジュール２１１２は、カーナビゲーションシステム２１１０の製品番号「１００」を示すデータ（機器ＩＤ）をサーバ２１００へ送信する。
【０１３８】
［ステップＳＴ２−ａ］
適応モデル作成部５は、カーナビゲーションシステム２１１０からの音声データを選択モデル蓄積部４内の選択モデルに入力して尤度を算出する。適応モデル作成部５は、算出した尤度のうち大きいものからＮ個を選択する。適応モデル作成部５は、これらＮ個が属するグループのうち属する選択モデルの数が最も多いグループを選択する。
【０１３９】
［ステップＳＴ３−ａ］
適応モデル作成部５は、選択したグループの中で尤度が大きいＭ個の十分統計量を選択する。適応モデル作成部５は、選択したＭ個の十分統計量を用いて適応モデルを作成する。
【０１４０】
［ステップＳＴ４−ａ］
適応モデル作成部５は、ステップＳＴ２−ａにおいて選択したグループを示すＩＤ（グループＩＤ）と、当該グループと雑音の種類が同じであるグループを示すＩＤ（グループＩＤ）とを、カーナビゲーションシステム２１１０からの製品番号「１００」に対応づけてメモリ２１０１に記憶する。ここではステップＳＴ２−ａにおいてグループ１（図２１Ｂ参照）が選択されたものとする。グループ１の雑音の種類は「カローレの音」である。雑音の種類が「カローレの音」であるグループはグループ１およびグループ２である（図２１Ｂ参照）。適応モデル作成部５は、図２２に示すように、雑音の種類が「カローレの音」であるグループ（グループ１，グループ２）のグループＩＤを製品番号「１００」に対応づけてメモリ２１０１に記憶する。
【０１４１】
［ステップＳＴ１−ｂ］
ふたたび、カーナビゲーションシステム２１１０のマイク２１１１に向かって利用者が発声する。利用者が発声した音声は所定の音声データに変換され、データ通信モジュール２１１２によってサーバ２１００へ送信される。またデータ通信モジュール２１１２は、カーナビゲーションシステム２１１０の製品番号「１００」を示すデータ（機器ＩＤ）をサーバ２１００へ送信する。
【０１４２】
［ステップＳＴ２−ｂ］
適応モデル作成部５は、選択モデル蓄積部４内の選択モデルのうち、カーナビゲーションシステム２１１０からの製品番号「１００」に対応づけられてメモリ２１０１に記憶されているグループＩＤが示すグループ（グループ１，グループ２）の選択モデルにカーナビゲーションシステム２１１０からの音声データを入力して尤度を算出する（図２２参照）。適応モデル作成部５は、算出した尤度のうち大きいものからＮ個を選択する。適応モデル作成部５は、これらＮ個が属するグループのうち属する選択モデルの数が最も多いグループを選択する。
【０１４３】
［ステップＳＴ３−ｂ］
適応モデル作成部５は、選択したグループの中で尤度が大きいＭ個の十分統計量を選択する。適応モデル作成部５は、選択したＭ個の十分統計量を用いて適応モデルを作成する。
【０１４４】
適応処理ごとに（ST１-b）に戻る。また，必要に応じて（ST１-a）に戻る（たとえば、カーナビゲーションシステム２１１０を別の種類の車（たとえばマークIII）に取り付けたときなど）。
【０１４５】
［グループの作成方法２］
雑音の種類×SN比×近い話者ごとにグループを作り，グループ内には，近い話者において，声の調子のバリエーションごと（鼻声，早口，どもり声など）の十分統計量を蓄積する。
【０１４６】
＜複数の雑音下，複数の話者が利用する機器（例：テレビの操作）＞
・グループの選択方法（図２３，図２４参照）
この例によるシステムの構成を図２３Ａに示す。このシステムは、サーバ２３００と、デジタルＴＶシステム２３１０と、音声リモコン２３２０とを備える。サーバ２３００は、グループ作成部６と、選択モデル作成部２と、十分統計量作成部１と、選択モデル蓄積部４と、十分統計量蓄積部３と、適応モデル作成部５と、メモリ２３０１とを含む。グループ作成部６は、図２３Ｂに示すように、雑音が重畳された音声データ８４を雑音の種類（掃除機の音，エアコンの音など）×ＳＮ比（１０ｄＢ，２０ｄＢなど）×近い話者ごとにグループ化する。メモリ２３０１には、利用者を識別するためのＩＤ（利用者ＩＤ）とグループを示すＩＤ（グループＩＤ）とが対応づけられて記憶される。デジタルＴＶシステム２３１０は、データ通信モジュール２３１２と、音声認識システム３００（図３参照）と、処理部１８１２とを含む。音声リモコン２３２０はマイク１８２１を含む。以上のように構成されたシステムでは以下の処理が行われる。
【０１４７】
［ステップＳＴ１−ａ］
音声リモコン２３２０のマイク１８２１に向かって利用者が発声する。利用者が発声した音声は所定の音声データに変換されてデジタルＴＶシステム２３１０に送信される。また利用者は、名前や暗証番号などの自己を識別するための情報（利用者ＩＤ）をリモコン２３２０のボタン操作により入力する。入力された利用者ＩＤ（ここでは「１００」）はデジタルＴＶシステム２３１０に送信される。音声リモコン２３２０からの音声データおよび利用者ＩＤ「１００」は、データ通信モジュール２１１２によってサーバ２３００へ送信される。
【０１４８】
［ステップＳＴ２−ａ］
適応モデル作成部５は、デジタルＴＶシステム２３１０からの音声データを選択モデル蓄積部４内の選択モデルに入力して尤度を算出する。適応モデル作成部５は、算出した尤度のうち大きいものからＮ個を選択する。適応モデル作成部５は、これらＮ個が属するグループのうち属する選択モデルの数が最も多いグループを選択する。
【０１４９】
［ステップＳＴ３−ａ］
適応モデル作成部５は、選択したグループの中で尤度が大きいＭ個の十分統計量を選択する。適応モデル作成部５は、選択したＭ個の十分統計量を用いて適応モデルを作成する。
【０１５０】
［ステップＳＴ４−ａ］
適応モデル作成部５は、ステップＳＴ２−ａにおいて選択したグループを示すＩＤ（グループＩＤ）と、当該グループと近い話者が同じであるグループを示すＩＤ（グループＩＤ）とを、デジタルＴＶシステム２３１０からの利用者ＩＤ「１００」に対応づけてメモリ２３０１に記憶する。ここではステップＳＴ２−ａにおいてグループ２（図２３Ｂ参照）が選択されたものとする。グループ２の近い話者は「話者Ｃ，Ｄ」である。近い話者が「話者Ｃ，Ｄ」であるグループはグループ２，グループ（Ｋ−１）およびグループＫである（図２３Ｂ参照）。適応モデル作成部５は、図２４に示すように、近い話者が「話者Ｃ，Ｄ」であるグループ（グループ２，グループ（Ｋ−１），グループＫ）のグループＩＤを利用者ＩＤ「１００」に対応づけてメモリ２３０１に記憶する。
【０１５１】
［ステップＳＴ１−ｂ］
ふたたび、音声リモコン２３２０のマイク１８２１に向かって利用者が発声する。利用者が発声した音声は所定の音声データに変換されてデジタルＴＶシステム２３１０に送信される。また利用者は、利用者ＩＤ「１００」をリモコン２３２０のボタン操作により入力する。入力された利用者ＩＤ「１００」はデジタルＴＶシステム２３１０に送信される。音声リモコン２３２０からの音声データおよび利用者ＩＤ「１００」は、データ通信モジュール２３１２によってサーバ２３００へ送信される。
【０１５２】
［ステップＳＴ２−ｂ］
適応モデル作成部５は、選択モデル蓄積部４内の選択モデルのうち、デジタルＴＶシステム２３１０からの利用者ＩＤ「１００」に対応づけられてメモリ２３０１に記憶されているグループＩＤが示すグループ（グループ２，グループ（Ｋ−１），グループＫ）の選択モデルにデジタルＴＶシステム２３１０からの音声データを入力して尤度を算出する（図２４参照）。適応モデル作成部５は、算出した尤度のうち大きいものからＮ個を選択する。適応モデル作成部５は、これらＮ個が属するグループのうち属する選択モデルの数が最も多いグループを選択する。
【０１５３】
［ステップＳＴ３−ｂ］
適応モデル作成部５は、選択したグループの中で尤度が大きいＭ個の十分統計量を選択する。適応モデル作成部５は、選択したＭ個の十分統計量を用いて適応モデルを作成する。
【０１５４】
適応処理ごとに（ST１-b）に戻る。また，必要に応じて（ST１-a）に戻る（たとえば、利用者が代わったときなど）。
【０１５５】
＜特定の話者が利用する機器（例：携帯電話の操作）＞
・グループの選択方法（図２５，図２６を参照）
この例によるシステムの構成を図２５Ａに示す。このシステムは、サーバ２５００と、携帯電話２５１０とを備える。サーバ２５００は、グループ作成部６と、選択モデル作成部２と、十分統計量作成部１と、選択モデル蓄積部４と、十分統計量蓄積部３と、適応モデル作成部５と、メモリ２５０１と、音声認識システム３００とを含む。グループ作成部６は、図２５Ｂに示すように、雑音が重畳された音声データ８４を雑音の種類（電車の音，バスの音など）×ＳＮ比（１０ｄＢ，２０ｄＢなど）×近い話者ごとにグループ化する。メモリ２５０１には、携帯電話を識別するための機器ＩＤ（たとえば製品番号）とグループを示すＩＤ（グループＩＤ）とが対応づけられて記憶される。音声認識システム３００による認識結果は通信網を介して携帯電話２５１０へ送信される。携帯電話２５１０は、マイク２５１１と、データ通信モジュール２５１２と、処理部２５１３とを含む。以上のように構成されたシステムでは以下の処理が行われる。
【０１５６】
［ステップＳＴ１−ａ］
携帯電話２５１０のマイク２５１１に向かって利用者が発声する。利用者が発声した音声は所定の音声データに変換され、データ通信モジュール２５１２によってサーバ２５００へ送信される。またデータ通信モジュール２５１２は、携帯電話２５１０の製品番号「２００」を示すデータ（機器ＩＤ）をサーバ２５００へ送信する。
【０１５７】
［ステップＳＴ２−ａ］
適応モデル作成部５は、携帯電話２５１０からの音声データを選択モデル蓄積部４内の選択モデルに入力して尤度を算出する。適応モデル作成部５は、算出した尤度のうち大きいものからＮ個を選択する。適応モデル作成部５は、これらＮ個が属するグループのうち属する選択モデルの数が最も多いグループを選択する。
【０１５８】
［ステップＳＴ３−ａ］
適応モデル作成部５は、選択したグループの中で尤度が大きいＭ個の十分統計量を選択する。適応モデル作成部５は、選択したＭ個の十分統計量を用いて適応モデルを作成する。
【０１５９】
［ステップＳＴ４−ａ］
適応モデル作成部５は、ステップＳＴ２−ａにおいて選択したグループを示すＩＤ（グループＩＤ）と、当該グループと近い話者が同じであるグループを示すＩＤ（グループＩＤ）とを、携帯電話２５１０からの製品番号「２００」に対応づけてメモリ２５０１に記憶する。ここではステップＳＴ２−ａにおいてグループ２（図２５Ｂ参照）が選択されたものとする。グループ２の近い話者は「話者Ｃ，Ｄ」である。近い話者が「話者Ｃ，Ｄ」であるグループはグループ２，グループ（Ｋ−１）およびグループＫである（図２５Ｂ参照）。適応モデル作成部５は、図２６に示すように、近い話者が「話者Ｃ，Ｄ」であるグループ（グループ２，グループ（Ｋ−１），グループＫ）のグループＩＤを製品番号「２００」に対応づけてメモリ２５０１に記憶する。
【０１６０】
［ステップＳＴ１−ｂ］
ふたたび、携帯電話２５１０のマイク２５１１に向かって利用者が発声する。利用者が発声した音声は所定の音声データに変換され、データ通信モジュール２５１２によってサーバ２５００へ送信される。またデータ通信モジュール２５１２は、携帯電話２５１０の製品番号「２００」を示すデータ（機器ＩＤ）をサーバ２５００へ送信する。
【０１６１】
［ステップＳＴ２−ｂ］
適応モデル作成部５は、選択モデル蓄積部４内の選択モデルのうち、携帯電話２５１０からの製品番号「２００」に対応づけられてメモリ２５０１に記憶されているグループＩＤが示すグループ（グループ２，グループ（Ｋ−１），グループＫ）の選択モデルに携帯電話２５１０からの音声データを入力して尤度を算出する（図２６参照）。適応モデル作成部５は、算出した尤度のうち大きいものからＮ個を選択する。適応モデル作成部５は、これらＮ個が属するグループのうち属する選択モデルの数が最も多いグループを選択する。
【０１６２】
［ステップＳＴ３−ｂ］
適応モデル作成部５は、選択したグループの中で尤度が大きいＭ個の十分統計量を選択する。適応モデル作成部５は、選択したＭ個の十分統計量を用いて適応モデルを作成する。
【０１６３】
適応処理ごとに（ST１-b）に戻る。また，必要に応じて（ST１-a）に戻る（たとえば、利用者が代わったときなど）。
【０１６４】
［グループの作成方法３］
近い話者ごとにグループを作り，グループ内には，雑音の種類×SN比ごとの十分統計量を蓄積する。
【０１６５】
＜複数の雑音下，複数の話者が利用する機器（例：テレビの操作）＞
・グループの選択方法（図２７，図２８を参照）
家庭内のセット・トップ・ボックス，もしくは通信網で接続された家庭外のサーバーに蓄積された十分統計量から，雑音が付加された利用者の音声により，選択モデル（GMM）を用いて十分統計量を選択して適応する。このとき，選択したグループと利用者の話者ID（名前や暗証番号など）を対応付ける。次に適応するときは，話者IDを入力してグループを選択して適応する。具体的には以下の処理を行う。
【０１６６】
利用者の音声を入力する（ST１-a）。
【０１６７】
利用者の音声を選択モデルに入力したときの尤度が大きいN個を選択し，その中で一番個数の多い話者グループを選択する（ST２-a）。
【０１６８】
選択したグループの中で，（様々な雑音の種類・SN比の中から）尤度の大きいM個の十分統計量を選択して適応する（ST３-a）。
【０１６９】
選択したグループと話者IDを対応付ける（対応関係を蓄積する）（ST４-a）。
【０１７０】
話者IDを入力してグループを選択する（ST１-b）。
【０１７１】
利用者の音声を入力する（ST２-b）。
【０１７２】
選択したグループ（利用者に近い話者グループ）の中で，尤度の大きいM個の十分統計量を選択して適応する（ST３-b）。
【０１７３】
適応処理ごとに（ST１-b）に戻る。また，必要に応じて（ST１-a）に戻る。
【０１７４】
＜特定の話者が利用する機器（例：携帯電話の操作）＞
・グループの選択方法
通信網で接続された家庭外のサーバーに蓄積された十分統計量から，雑音が付加された利用者の音声により，選択モデル（GMM）を用いて十分統計量を選択して適応する。このとき，選択したグループと利用した機器IDを対応付ける。次に適応するときは，機器IDにより自動的にグループを選択して適応する。具体的には以下の処理を行う。
【０１７５】
利用者の音声を入力する（ST１-a）。
【０１７６】
利用者の音声を選択モデルに入力したときの尤度が大きいN個を選択し，その中で一番個数の多い話者グループを選択する（ST２-a）。
【０１７７】
選択したグループの中で，尤度の大きいM個の十分統計量を選択して適応する（ST３-a）。
【０１７８】
選択したグループと機器IDを対応付ける（対応関係を蓄積する）（ST４-a）。
【０１７９】
利用者の音声を入力する（ST１-b）。
【０１８０】
機器IDによりグループを自動的に選択する（ST２-b）。
【０１８１】
選択したグループの中で，尤度の大きいM個の十分統計量を選択して適応する（ST２-b）。
【０１８２】
適応処理ごとに（ST１-b）に戻る。また，必要に応じて（ST１-a）に戻る（たとえば、利用者が代わったときなど）。
【０１８３】
［グループの作成方法４］
特定の雑音の種類において，SN比ごとにグループを作り，グループ内には，話者ごとの十分統計量を蓄積する。
【０１８４】
＜特定の雑音下で利用する機器（例：エレベータの操作）＞
・グループの選択方法
エレベータに備え付けられたサーバーに蓄積された十分統計量から，雑音が付加された利用者の音声により，選択モデル（GMM）を用いて十分統計量を選択して適応する。具体的には以下の処理を行う。
【０１８５】
利用者の音声を入力する（ST１）。
【０１８６】
利用者の音声を選択モデルに入力したときの尤度が大きいN個を選択し，その中で一番個数の多いSN比のグループを選択する（ST２）。
【０１８７】
選択したグループの中で，尤度の大きいM個の十分統計量を選択して適応する（ST３）。
【０１８８】
［グループの作成方法５］
特定の話者において，SN比ごとにグループを作り，グループ内には，特定の話者の声の調子のバリエーションごと（鼻声，早口，どもり声など）の十分統計量を蓄積する。
【０１８９】
＜特定の話者・雑音下で利用する機器について（例：カーナビ）＞
・グループの選択方法
車内に備え付けられたサーバー（カーナビ）に蓄積された十分統計量から，雑音付加された利用者の音声により，選択モデル（GMM）を用いて十分統計量を選択して適応する。具体的には以下の処理を行う。
【０１９０】
利用者の音声を入力する（ST１）。
【０１９１】
利用者の音声を選択モデルに入力したときの尤度が大きいN個を選択し，その中で一番個数の多いSN比のグループを選択する（ST２）。
【０１９２】
選択したグループの中で，尤度の大きいM個の十分統計量を選択して適応する（ST３）。
【０１９３】
なお，グループごとにグループ選択モデルを作成して，グループを選択しても良い（例：雑音の種類ごとにグループを作成する場合，雑音選択モデルがグループ選択モデルとなり，GMMで作成した場合，雑音を雑音選択モデルに入力して尤度が最も大きいグループを選択する。）。
【０１９４】
（第２の実施形態）
＜適応モデル作成装置の構成＞
図２９は、第２の実施形態による音声処理用適応モデル作成装置の全体構成を示すブロック図である。図２９に示す装置は、選択モデル作成部２１と、選択モデル蓄積部４１と、十分統計量作成部１１と、適応モデル作成部５１とを備える。選択モデル作成部２１は、利用者の音声データに近い音声データを選択するための選択モデル７５を作成する。選択モデル蓄積部４１は、選択モデル作成部２１が作成した選択モデル７５を蓄積する。十分統計量作成部１１は、選択モデル蓄積部４１が蓄積した選択モデル７５を用いて音声データ８３の中から利用者の音声データに近い音声データを選択し、選択した音声データに雑音を重畳した音声データを用いて十分統計量７２を作成する。適応モデル作成部５１は、十分統計量作成部１１が作成した十分統計量７２を用いて適応モデル７４を作成する。
【０１９５】
＜適応モデルの作成処理＞
次に、以上のように構成された装置による音声認識用適応モデルの作成処理について説明する。
【０１９６】
［選択モデル７５の作成］
はじめに、選択モデル７５の作成方法について述べる。ここでは、選択モデル７５の作成を、利用者が適応モデルの獲得を要求する以前にオフラインで行う場合について述べる。
【０１９７】
静かな環境において複数話者の音声データ８３を収録する。ここでは約３００人の音声データを収録する。
【０１９８】
選択モデル作成部２１は、音声データ８３を用いて、話者ごとに、音韻を区別することなく１状態６４混合のGaussian Mixture Modelにより選択モデル７５を作成する。
【０１９９】
一例として図３０に示すように、音声データ８３のパワーの大きいフレームを用いて選択モデル７５を作成する。この方法を用いると雑音に強い音声データ選択モデルが作成できる。
【０２００】
選択モデル蓄積部４１は、選択モデル作成部２１が作成した選択モデル７５を蓄積する。選択モデル蓄積部４１に蓄積される選択モデル７５の一例を図３０に示す。
【０２０１】
［十分統計量７２の作成］
次に、十分統計量７２の作成方法について述べる。
【０２０２】
利用者は、適応モデル７４の作成を要求する。
【０２０３】
利用者は、音声認識用のマイクなどを利用して、音声認識を利用する環境の雑音データ８５を十分統計量作成部１１に入力する。
【０２０４】
また、利用者は、音声認識用のマイクなどを利用して、音声認識を利用する環境下での音声データ８１を十分統計量作成部１１に入力する。音声データ８１は、音声認識を利用する環境の雑音が重畳されている。
【０２０５】
次に、十分統計量作成部１１は、音声データ８１を、選択モデル蓄積部４１が蓄積した選択モデル７５に入力して尤度を計算する。ここでは、音声データ８１のパワーの大きいフレーム部分を図３０に示す選択モデル７５に入力して尤度を計算する。そして、尤度の大きい上位Ｌ人（たとえば上位２０人）の話者を選択して利用者の音声データに近い話者とする。
【０２０６】
十分統計量作成部１１は、静かな環境における音声データ８３の中から利用者に近い話者の音声データに雑音データ８５を重畳し、雑音重畳音声データ８６を作成する。このとき、音声データ８１と雑音データ８５よりＳＮ比を計算して、計算したＳＮ比で雑音重畳音声データ８６を作成する。雑音重畳音声データ８６の作成方法の一例を図３１に示す。
【０２０７】
十分統計量作成部１１は、雑音重畳音声データ８６を用いて十分統計量７２を作成する。十分統計量作成部１１が作成した十分統計量７２の一例を図３２に示す。
【０２０８】
［適応モデル７４の作成］
次に、適応モデル作成部５１における適応モデル７４の作成処理について述べる。
【０２０９】
適応モデル作成部５１は、十分統計量作成部１１が作成した十分統計量７２を用いて適応モデル７４を作成する。具体的には以下の統計処理計算（数７〜数９）により適応モデル７４を作成する。適応モデル７４のＨＭＭの各状態における正規分布の平均、分散をそれぞれμ_i ^adp（ｉ＝１，２，…，Ｎ_mix）、ｖ_i ^adp（ｉ＝１，２，…，Ｎ_mix）とする。Ｎ_mixは混合分布数である。また、状態遷移確率をａ^adp［ｉ］［ｊ］（ｉ，ｊ＝１，２，…，Ｎ_state）とする。Ｎ_stateは状態数であり、ａ^adp［ｉ］［ｊ］は状態ｉから状態ｊへの遷移確率を表す。
【０２１０】
【数７】

【０２１１】
【数８】

【０２１２】
【数９】

【０２１３】
ここで、Ｎ_selは、選択された音響モデルの数であり、μ_i ^j（ｉ＝１，２，…，Ｎ_mix，ｊ＝１，２，…，Ｎ_sel）、ｖ_i ^j（ｉ＝１，２，…，Ｎ_mix，ｊ＝１，２，…，Ｎ_sel）はそれぞれのＨＭＭの平均、分散である。Ｃ_mix ^j（ｊ＝１，２，…，Ｎ_sel）、Ｃ_state ^k［ｉ］［ｊ］（ｋ＝１，２，…，Ｎ_sel、ｉ，ｊ＝１，２，…，Ｎ_state）はそれぞれ、正規分布におけるＥＭカウント（度数）、状態遷移に関するＥＭカウントである。
【０２１４】
適応モデル作成部５１は、利用者の次の適応モデル作成の要求に備える。
【０２１５】
＜効果＞
以上説明したように第２の実施形態では、利用環境の雑音データ８５を重畳した音声データ８６を用いて十分統計量７２を作成して適応モデル７４を作成するため、利用環境に適応した適応モデル７４が作成できる。したがって、さまざまな雑音環境で適応モデルが利用できる。
【０２１６】
また、利用者に音響的に近い話者の音声データに雑音を重畳した音声データ８６を用いて十分統計量７２を作成するため、瞬時に十分統計量７２を作成して適応モデル７４を作成することができる。したがって、利用環境がさまざまに変化した場合にすぐに適応モデルが利用できる。
【０２１７】
なお、雑音データ８５を、利用者が適応モデルの獲得を要求する以前にオフラインで十分統計量作成部１１に入力し、十分統計量７２をオフラインで作成してもよい。
【０２１８】
雑音データ８５を十分統計量作成部１１に入力するタイミングは、十分統計量作成部１１が自動的に決定してもよい。
【０２１９】
適応モデル７４を作成するタイミングは、適応モデル作成部５１が自動的に決定してもよい。
【０２２０】
選択モデル７５はGaussian Mixture Modelに限らない。
【０２２１】
ＨＭＭの各状態に対応するラベルをデータベースに蓄積し、蓄積したラベル情報を用いて雑音重畳音声データ８６の十分統計量７２を作成してもよい。
【０２２２】
＜具体的な商品イメージ＞
第２の実施形態による適応モデル作成装置を実際の製品に適用したイメージを図３３に示す。このシステムは、音声を入力する携帯端末（ＰＤＡ）と、適応モデルを作成して認識を行うサーバとから構成される。利用者は、サービスセンター（サーバ）に電話をかけ、センターからの音声ガイダンスに従い音声により指示を送る。サービスセンター（サーバ）側では、利用者の音声と雑音を受信して上述の方法により適応モデルを作成する。作成した適応モデルを用いて利用者の音声を認識し、ガイダンス（認識結果）をＰＤＡへ送る。
【０２２３】
（第３の実施形態）
＜音声認識用適応モデル作成装置の構成＞
図３４は、第３の実施形態による適応モデル作成装置の全体構成を示すブロック図である。図３４に示す適応モデル作成装置は、選択モデル作成部１５０７と、選択モデル蓄積部１５０８と、十分統計量作成部１５０６と、適応モデル作成部５１と、ラベル情報作成部１５０１と、ラベル情報蓄積部１５０２と、メモリ１５１２とを備える。選択モデル作成部１５０７は、利用者の音声データに近い音声データを選択するための選択モデル１５１０を作成する。選択モデル蓄積部１５０８は、選択モデル作成部１５０７が作成した選択モデル１５１０を蓄積する。ラベル情報作成部１５０１は、利用環境の雑音であると予測される予測雑音データ１５０３を静かな環境における音声データ８３に予測したＳＮ比で重畳した音声データ１５０５を用いて、ラベル情報１５０４を作成する。ラベル情報蓄積部１５０２は、ラベル情報作成部１５０１が作成したラベル情報１５０４を蓄積する。十分統計量作成部１５０６は、選択モデル蓄積部１５０８が蓄積した選択モデル１５１０とメモリ１５１２に記憶した静かな環境における利用者の音声データ１５１３とを用いて音声データ８３の中から利用者の音声データに音響的に近い音声データを選択し、選択した音声データに雑音データ８５を重畳した音声データと、ラベル情報蓄積部１５０２が蓄積したラベル情報１５０４とを用いて十分統計量１５０９を作成する。適応モデル作成部５１は、十分統計量作成部１５０６が作成した十分統計量１５０９を用いて適応モデル１５１１を作成する。
【０２２４】
＜適応モデル作成装置の動作＞
次に、以上のように構成された適応モデル作成装置の動作について説明する。
【０２２５】
［選択モデル１５１０の作成］
はじめに、選択モデル１５１０の作成方法について述べる。ここでは選択モデル１５１０の作成を、利用者が適応モデルの獲得を要求する以前にオフラインで行う場合について述べる。
【０２２６】
静かな環境において複数話者の音声データ８３を収録する。ここでは約３００人の音声データを収録する。
【０２２７】
選択モデル作成部１５０７は、図３５に示すように、音声データ８３を用いて、話者ごとに、音韻を区別することなく１状態６４混合のGaussian Mixture Modelにより選択モデル１５１０を作成する。
【０２２８】
選択モデル蓄積部１５０８は、選択モデル作成部１５０７が作成した選択モデル１５１０を蓄積する。
【０２２９】
［ラベル情報１５０４と音韻モデルの状態遷移に関する情報１５１４の作成］
ラベル情報１５０４と音韻モデルの状態遷移に関する情報１５１４の作成方法について述べる。ここでは、ラベル情報１５０４と音韻モデルの状態遷移に関する情報１５１４との作成を、利用者が適応モデルの獲得を要求する以前にオフラインで行う場合について述べる。一例として、音声認識を車内で利用する場合について、図３６、図３７、図３８を用いて説明する。ここではカーナビゲーションシステムにおける音声認識を考える。
【０２３０】
図３６に示すように、静かな環境における音声データ８３に、利用環境であると予測した雑音データ（一般的な車種Ａの車内雑音データ）１６０１を重畳して車内雑音１０ｄＢでの音声データ１６０２を作成する。ここでは車種Ａの車内雑音データ１６０１は、事前に市内を車種Ａで走行したときに収録したものを利用する。次に、作成した音声データ１６０２を用いて車内雑音１０ｄＢの十分統計量１６０３をＥＭアルゴリズムにより計算する。ここでは、音韻ごとにＨＭＭを用いて不特定話者の十分統計量を作成する。ここでは音韻モデルの状態遷移に関する情報１５１４は、音韻ごとのＨＭＭの状態遷移確率である。次に、図３７に示すように、車内雑音１０ｄＢの雑音重畳音声データ１６０２を音声データ（ある話者のある発声データ）ごとに、車内雑音１０ｄＢの十分統計量１６０３に入力し、ビタービアルゴリズムを用いてラベル情報１５０４を音声データ（ある話者のある発声データ）ごとに作成する。図３８にラベル情報１５０４の一例を示す。ここでは、フレーム番号に対応する音韻名とＨＭＭの状態番号をラベル情報１５０４とする。
【０２３１】
ラベル情報蓄積部１５０２は、ラベル情報１５０４と音韻モデルの状態遷移に関する情報１５１４を蓄積する。
【０２３２】
［十分統計量１５０９の作成］
次に、十分統計量１５０９の作成方法について述べる。
【０２３３】
利用者は、静かな環境における利用者の音声データ１５１３をあらかじめメモリ１５１２に記憶しておく。
【０２３４】
利用者は、適応モデル１５１１の作成を要求する。
【０２３５】
十分統計量作成部１５０６は、メモリ１５１２に記憶された静かな環境における利用者の音声データ１５１３を受信する。また、十分統計量作成部１５０６は、音声認識を利用する環境での雑音データ８５を受信する。
【０２３６】
十分統計量作成部１５０６は、静かな環境における利用者の音声データ１５１３を、選択モデル蓄積部１５０８に蓄積されている選択モデル１５１０に入力して尤度を計算する。そして、尤度の大きい上位Ｌ人（たとえば上位４０人）の話者を選択して利用者の音声データに近い話者とする。
【０２３７】
十分統計量作成部１５０６は、静かな環境における音声データ８３の中から利用者に近い話者の音声データに雑音データ８５を重畳し、雑音重畳音声データ８６を作成する。音声データ８６の作成方法の一例を図３１に示す。
【０２３８】
十分統計量作成部１５０６は、雑音重畳音声データ８６とラベル情報蓄積部１５０２に蓄積されたラベル情報１５０４と音韻モデルの状態遷移に関する情報１５１４とを用いて十分統計量１５０９を作成する。図３９に示すように、雑音重畳音声データ８６に対応する音韻名とＨＭＭの状態番号を、ラベル情報１５０４に記載された雑音重畳音声データ１５０５の音韻名とＨＭＭの状態番号と同一であるとみなす。同様に、音韻ごとのＨＭＭの状態遷移確率も同一だとみなす。すなわち、ＨＭＭの状態番号、状態遷移確率などに関する計算処理を行わない。そして、ＨＭＭの同一状態の中で、平均値、分散、混合重みなどの十分統計量の計算を行う。
【０２３９】
［適応モデル１５１１の作成］
次に、適応モデル作成部５１における適応モデル１５１１作成の方法について述べる。
【０２４０】
適応モデル作成部５１は、十分統計量作成部１５０６が作成した十分統計量１５０９を用いて適応モデル１５１１を作成する。具体的には以下の統計処理計算（数１０〜数１２）により適応モデル１５１１を作成する。適応モデル１５１１のＨＭＭの各状態における正規分布の平均、分散をそれぞれμ_i ^adp（ｉ＝１，２，…，Ｎ_mix）、ｖ_i ^adp（ｉ＝１，２，…，Ｎ_mix）とする。Ｎ_mixは混合分布数である。また、状態遷移確率をａ^adp［ｉ］［ｊ］（ｉ，ｊ＝１，２，…，Ｎ_state）とする。Ｎ_stateは状態数であり、ａ^adp［ｉ］［ｊ］は状態ｉから状態ｊへの遷移確率を表す。
【０２４１】
【数１０】

【０２４２】
【数１１】

【０２４３】
【数１２】

【０２４４】
ここで、Ｎ_selは、選択された音響モデルの数であり、μ_i ^j（ｉ＝１，２，…，Ｎ_mix，ｊ＝１，２，…，Ｎ_sel）、ｖ_i ^j（ｉ＝１，２，…，Ｎ_mix，ｊ＝１，２，…，Ｎ_sel）はそれぞれのＨＭＭの平均、分散である。Ｃ_mix ^j（ｊ＝１，２，…，Ｎ_sel）、Ｃ_state ^k［ｉ］［ｊ］（ｋ＝１，２，…，Ｎ_sel、ｉ，ｊ＝１，２，…，Ｎ_state）はそれぞれ、正規分布におけるＥＭカウント（度数）、状態遷移に関するＥＭカウントである。
【０２４５】
適応モデル作成部５１は、利用者の次の適応モデル作成の要求に備える。
【０２４６】
＜効果＞
以上説明したように第３の実施形態では、ラベル情報１５０４を用いて十分統計量１５０９を計算するため、短時間に十分統計量１５０９が作成でき短時間に適応モデル１５１１が作成できる。したがって、利用環境がさまざまに変化した場合にすぐに適応モデルが利用できる。
【０２４７】
また、利用環境に近い雑音重畳音声データ１５０５を用いてラベル情報１５０４を作成するため、短時間に精度の高い十分統計量１５０９が作成できる。したがって、利用環境がさまざまに変化した場合にすぐにより精度の高い適応モデルが利用できる。
【０２４８】
また、ラベル情報１５０４と音韻モデルの状態遷移に関する情報１５１４とを用いて十分統計量１５０９を計算するため、さらに短時間に十分統計量１５０９が作成でき短時間に適応モデル１５１１が作成できる。したがって、利用環境がさまざまに変化した場合にすぐに適応モデルが利用できる。
【０２４９】
なお、雑音データ８５を、利用者が適応モデルの獲得を要求する以前にオフラインで十分統計量作成部１５０６に入力し、十分統計量１５０９をオフラインで作成してもよい。
【０２５０】
雑音データ８５を十分統計量作成部１５０６に入力するタイミングは、十分統計量作成部１５０６が自動的に決定してもよい。
【０２５１】
適応モデル１５１１を作成するタイミングは、適応モデル作成部５１が自動的に決定してもよい。
【０２５２】
選択モデル１５１０はGaussian Mixture Modelに限らない。
【０２５３】
メモリ１５１２に記憶する音声データ１５１３は、利用環境もしくは利用環境と予測した環境における雑音が重畳していてもよい。
【０２５４】
予測雑音データ１５０３として雑音データ８５を用いてもよい。
【０２５５】
（第４の実施形態）
＜音声認識用適応モデル作成装置の構成＞
図４０は、第４の実施形態による適応モデル作成装置の全体構成を示すブロック図である。図４０に示す適応モデル作成装置は、選択モデル作成部１５０７と、選択モデル蓄積部１５０８と、十分統計量作成部２１０７と、適応モデル作成部５１と、ラベル情報作成部２１０４と、ラベル情報蓄積部２１０６と、ラベル情報選択モデル作成部２１０１と、ラベル情報選択モデル蓄積部２１０２と、メモリ１５１２とを備える。選択モデル作成部１５０７は、利用者の音声データに近い音声データを選択するための選択モデル１５１０を作成する。選択モデル蓄積部１５０８は、選択モデル作成部１５０７が作成した選択モデル１５１０を蓄積する。ラベル情報作成部２１０４は、利用環境の雑音であると予測される予測雑音データ１５０３を静かな環境における音声データ８３に予測したＳＮ比で重畳した雑音重畳音声データを用いて、２種類以上のラベル情報２１０５を作成する。ラベル情報蓄積部２１０６は、ラベル情報作成部２１０４が作成した２種類以上のラベル情報２１０５を蓄積する。ラベル情報選択モデル作成部２１０１は、利用環境の雑音であると予測される予測雑音データ１５０３を用いてラベル情報選択モデル２１０３を作成する。ラベル情報選択モデル蓄積部２１０２は、ラベル情報選択モデル作成部２１０１が作成したラベル情報選択モデル２０１３を蓄積する。十分統計量作成部２１０７は、選択モデル蓄積部１５０８が蓄積した選択モデル１５１０とメモリ１５１２に記憶した静かな環境における利用者の音声データ１５１３とを用いて音声データ８３の中から利用者の音声データに近い音声データを選択する。また、十分統計量作成部２１０７は、ラベル情報選択モデル蓄積部２１０２が蓄積したラベル情報選択モデル２１０３と利用環境の雑音データ８５とを用いて、ラベル情報蓄積部２１０６に蓄積されているラベル情報２１０５の中から利用環境に適したラベル情報を選択する。そして十分統計量作成部２１０７は、選択した音声データに雑音データ８５を重畳した音声データと、選択した利用環境に適したラベル情報２１０５とを用いて十分統計量２１０８を作成する。適応モデル作成部５１は、十分統計量作成部２１０７が作成した十分統計量２１０８を用いて適応モデル２１０９を作成する。
【０２５６】
＜音声認識用適応モデル作成装置の動作＞
次に、以上のように構成された適応モデル作成装置の動作について説明する。
【０２５７】
［選択モデル１５１０の作成］
はじめに、選択モデル１５１０の作成方法について述べる。ここでは、選択モデル１５１０の作成を、利用者が適応モデルの獲得を要求する以前にオフラインで行う場合について述べる。
【０２５８】
静かな環境において複数話者の音声データ８３を収録する。ここでは約３００人の音声データを収録する。
【０２５９】
選択モデル作成部１５０７は、図３５に示したように、音声データ８３を用いて、話者ごとに、音韻を区別することなく１状態６４混合のGaussian Mixture Modelにより、選択モデル１５１０を作成する。
【０２６０】
選択モデル蓄積部１５０８は、選択モデル作成部１５０７が作成した選択モデル１５１０を蓄積する。
【０２６１】
［ラベル情報２１０５の作成］
ラベル情報２１０５の作成方法について述べる。ここでは、ラベル情報２１０５の作成を、利用者が適応モデルの獲得を要求する以前にオフラインで行う場合について述べる。一例として、音声認識を展示会場で利用する場合について、図４１、図４２を用いて説明する。
【０２６２】
利用者の行動履歴から、音声認識を車内、展示会場、家庭内でよく利用することがわかっている。そのため、車内，展示会場，家庭内における一般的な雑音をそれぞれ収録しておく。図４１に示すように、静かな環境における音声データ８３に、利用環境であると予測した３種類の雑音データ（車内雑音データ１５０３Ａ、展示会場雑音データ１５０３Ｂ、家庭内雑音データ１５０３Ｃ）を重畳して、車内雑音１０ｄＢでの雑音重畳音声データ１５０５Ａ、展示会場雑音２０ｄＢでの雑音重畳音声データ１５０５Ｂ、家庭内雑音２０ｄＢでの雑音重畳音声データ１５０５Ｃを作成する。次に、作成した雑音重畳音声データを用いて雑音の種類ごとに十分統計量１６０３Ａ，１６０３Ｂ，１６０３ＣをＥＭアルゴリズムにより計算する。ここでは、音韻ごとにＨＭＭを用いて不特定話者の十分統計量を作成する。次に、図４２に示すように、３種類の雑音重畳音声データ１５０５Ａ，１５０５ｂ，１５０５Ｃを音声データ（ある種類の雑音データのある話者のある発声データ）ごとに十分統計量１６０３Ａ，１６０３Ｂ，１６０３Ｃに入力し、ビタービアルゴリズムを用いてラベル情報２１０５Ａ，２１０５Ｂ，２１０５Ｃを音声データ（ある話者のある発声データ）ごとに作成する。
【０２６３】
［ラベル情報選択モデル２１０３の作成］
次に、ラベル情報選択モデル２１０３の作成方法を図４３を用いて説明する。ここでは一例として雑音の種類に対応したＧＭＭを作成する。ラベル情報２１０５の作成で用いた予測雑音データ１５０３Ａ，１５０３Ｂ，１５０３Ｃを用いてラベル情報選択モデル２１０３Ａ，２１０３Ｂ，２１０３Ｃを作成する。
【０２６４】
［十分統計量２１０８の作成］
次に、十分統計量２１０８の作成方法について述べる。
【０２６５】
利用者は、静かな環境における利用者の音声データ１５１３をあらかじめメモリ１５１２に記憶しておく。
【０２６６】
利用者は、適応モデル２１０９の作成を要求する。
【０２６７】
十分統計量作成部２１０７は、メモリ１５１２が記憶した静かな環境における利用者の音声データ１５１３を受信する。また、十分統計量作成部２１０７は、音声認識を利用する環境での雑音データ８５を受信する。
【０２６８】
十分統計量作成部２１０７は、静かな環境における利用者の音声データ１５１３を、選択モデル蓄積部１５０８に蓄積された選択モデル１５１０に入力して尤度を計算する。そして、尤度の大きい上位Ｌ人（たとえば上位４０人）の話者を選択して利用者の音声データに近い話者とする。
【０２６９】
十分統計量作成部２１０７は、静かな環境における音声データ８３の中から利用者に近い話者の音声データに雑音データ８５を重畳し、雑音重畳音声データ８６を作成する。雑音重畳音声データ８６の作成方法の一例を図３１に示す。
【０２７０】
十分統計量作成部２１０７は、蓄積部２１０２に蓄積されたラベル情報選択モデル２１０３に雑音データ８５を入力して、最も大きい尤度をもつラベル情報選択モデル２１０３に対応するラベル情報２１０５をラベル情報蓄積部２１０６から取り出す。ここでは、利用環境が展示会場であるので展示会場雑音２０ｄＢのラベル情報２１０５Ｂが取り出される。
【０２７１】
十分統計量作成部２１０７は、雑音重畳音声データ８６と、ラベル情報蓄積部２１０６から取り出した展示会場雑音２０ｄＢのラベル情報２１０５Ｂとを用いて十分統計量２１０８を作成する。
【０２７２】
［適応モデル２１０９の作成］
次に、適応モデル作成部５１において適応モデル２１０９を作成する方法について述べる。
【０２７３】
適応モデル作成部５１は、十分統計量作成部２１０７が作成した十分統計量２１０８を用いて適応モデル２１０９を作成する。具体的には以下の統計処理計算（数１３〜数１５）により適応モデル２１０９を作成する。適応モデル２１０９のＨＭＭの各状態における正規分布の平均、分散をそれぞれμ_i ^adp（ｉ＝１，２，…，Ｎ_mix）、ｖ_i ^adp（ｉ＝１，２，…，Ｎ_mix）とする。Ｎ_mixは混合分布数である。また、状態遷移確率をａ^adp［ｉ］［ｊ］（ｉ，ｊ＝１，２，…，Ｎ_state）とする。Ｎ_stateは状態数であり、ａ^adp［ｉ］［ｊ］は状態ｉから状態ｊへの遷移確率を表す。
【０２７４】
【数１３】

【０２７５】
【数１４】

【０２７６】
【数１５】

【０２７７】
ここで、Ｎ_selは、選択された音響モデルの数であり、μ_i ^j（ｉ＝１，２，…，Ｎ_mix，ｊ＝１，２，…，Ｎ_sel）、ｖ_i ^j（ｉ＝１，２，…，Ｎ_mix，ｊ＝１，２，…，Ｎ_sel）はそれぞれのＨＭＭの平均、分散である。Ｃ_mix ^j（ｊ＝１，２，…，Ｎ_sel）、Ｃ_state ^k［ｉ］［ｊ］（ｋ＝１，２，…，Ｎ_sel、ｉ，ｊ＝１，２，…，Ｎ_state）はそれぞれ、正規分布におけるＥＭカウント（度数）、状態遷移に関するＥＭカウントである。
【０２７８】
適応モデル作成部５１は、利用者の次の適応モデル作成の要求に備える。
【０２７９】
＜効果＞
以上説明したように第４の実施形態では、ラベル情報選択モデル２１０３に基づいて選択した、利用環境に適したラベル情報２１０５を用いて十分統計量２１０８を計算するため、さらに精度の高い十分統計量が作成できる。したがって、利用環境がさまざまに変化した場合にすぐにより精度の高い適応モデルが利用できる。
【０２８０】
なお、雑音データ８５を、利用者が適応モデルの獲得を要求する以前にオフラインで十分統計量作成部２１０７に入力し、十分統計量２１０８をオフラインで作成してもよい。
【０２８１】
雑音データ８５を十分統計量作成部２１０７に入力するタイミングは、十分統計量作成部２１０７が自動的に決定してもよい。
【０２８２】
適応モデル２１０９を作成するタイミングは、適応モデル作成部５１が自動的に決定してもよい。
【０２８３】
選択モデル１５１０はGaussian Mixture Modelに限らない。
【０２８４】
メモリ１５１２に記憶する音声データ１５１３は、利用環境もしくは利用環境と予測した環境における雑音が重畳していてもよい。
【０２８５】
ラベル情報２１０５の種類の数とラベル情報選択モデル２１０３の数は同数であるとは限らない。
【０２８６】
予測雑音データ１５０３として雑音データ８５を用いてもよい。
【０２８７】
第２の実施形態による適応モデル作成装置はハードウェアによってもソフトウェア（コンピュータプログラム）によっても実現できる。
【図面の簡単な説明】
【図１】各種の話者適応技術を示す図である。
【図２】「十分統計量を用いた方法」によって適応モデルを作成する手順を示すフローチャートである。
【図３】「十分統計量を用いた方法」によって適応モデルを作成する手順を説明するためのブロック図である。
【図４】十分統計量の作成処理を説明するための図である。
【図５】適応モデルの作成処理を説明するための図である。
【図６】従来技術の「十分統計量を用いた方法」における課題を説明するための図である。
【図７】第１の実施形態による適応モデル作成装置の構成を示すブロック図である。
【図８】図７に示したグループ作成部におけるグループ作成処理の流れを示す図である。
【図９】図７に示した十分統計量蓄積部に蓄積される十分統計量を作成する処理の流れを示す図である。
【図１０】図７に示した選択モデル蓄積部に蓄積される選択モデルを作成する処理の流れを示す図である。
【図１１】図７に示した十分統計量蓄積部に蓄積される十分統計量の一例を示す図である。
【図１２】図７に示した選択モデル蓄積部に蓄積される選択モデルの一例を示す図である。
【図１３】図７に示した適応モデル作成部において利用者の音声に音響的に近いグループを決定する処理の流れを示す図である。
【図１４】図７に示した適応モデル作成部において利用者の音声データに近い十分統計量を決定する処理の流れを示す図である。
【図１５】認識実験の結果を示す図である。
【図１６】図７に示した十分統計量蓄積部に蓄積される十分統計量の一例を示す図である。
【図１７】グループ作成部によって作成されるグループの例を示す図である。
【図１８Ａ】具体的な商品イメージおよびグルーピング例を示す図である。
【図１８Ｂ】具体的な商品イメージおよびグルーピング例を示す図である。
【図１９Ａ】具体的な商品イメージおよびグルーピング例を示す図である。
【図１９Ｂ】具体的な商品イメージおよびグルーピング例を示す図である。
【図２０】具体的な商品イメージおよびグルーピング例を示す図である。
【図２１Ａ】具体的な商品イメージおよびグルーピング例を示す図である。
【図２１Ｂ】具体的な商品イメージおよびグルーピング例を示す図である。
【図２２】具体的な商品イメージおよびグルーピング例を示す図である。
【図２３Ａ】具体的な商品イメージおよびグルーピング例を示す図である。
【図２３Ｂ】具体的な商品イメージおよびグルーピング例を示す図である。
【図２４】具体的な商品イメージおよびグルーピング例を示す図である。
【図２５Ａ】具体的な商品イメージおよびグルーピング例を示す図である。
【図２５Ｂ】具体的な商品イメージおよびグルーピング例を示す図である。
【図２６】具体的な商品イメージおよびグルーピング例を示す図である。
【図２７】具体的な商品イメージおよびグルーピング例を示す図である。
【図２８】具体的な商品イメージおよびグルーピング例を示す図である。
【図２９】第２の実施形態による適応モデル作成装置の構成を示すブロック図である。
【図３０】図２９に示した選択モデル蓄積部に蓄積される選択モデルを作成する処理の流れを示す図である。
【図３１】雑音重畳音声データを作成する処理の流れを示す図である。
【図３２】図９に示した十分統計量作成部が作成する十分統計量の一例を示す図である。
【図３３】第２の実施形態による適応モデル作成装置を実際の製品に適用したイメージを示す図である。
【図３４】第３の実施形態による適応モデル作成装置の構成を示すブロック図である。
【図３５】選択モデル蓄積部に蓄積される選択モデルを作成する処理の流れを示す図である。
【図３６】ラベル情報を作成する処理の流れを示す図である。
【図３７】ラベル情報を作成する処理の流れを示す図である。
【図３８】ラベル情報蓄積部に蓄積されるラベル情報の一例を示す図である。
【図３９】十分統計量を作成する処理の流れを示す図である。
【図４０】第４の実施形態による適応モデル作成装置の構成を示すブロック図である。
【図４１】ラベル情報を作成する処理の流れを示す図である。
【図４２】ラベル情報を作成する処理の流れを示す図である。
【図４３】ラベル情報選択モデルを作成する処理の流れを示す図である。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a method, an apparatus, and a computer program for creating an acoustic model used for speech recognition. More particularly, the present invention relates to a method, an apparatus, and a computer program for creating an acoustic model adapted to a voice of a person who uses voice recognition and an environment where the voice recognition is used.
[0002]
[Prior art]
In recent years, in digital information devices such as mobile phones, mobile terminals, car navigation systems, personal computers, and home appliances, it is expected to improve user convenience by using voice recognition technology.
[0003]
If the acoustic model used in the speech recognition system is not suitable for the user, the user cannot use the speech recognition system. Therefore, it is necessary to use an acoustic model adapted to the user's voice in the voice recognition system. There are various technologies (speaker adaptation technologies) for adapting an acoustic model to the voice of a person who uses a speech recognition system, as shown in FIG. In FIG. 1, various speaker adaptation techniques are mapped in correspondence with the computer power and hard disk capacity of the system required to implement the speaker adaptation techniques. Furthermore, for each of the speaker adaptation technologies, "the number of sentences that the user has to utter in order to adapt", "variation factors that can be handled by the adaptation technology (speaker nature, voice tone)", "Recognition performance (indicated by the size of the star. The larger the performance, the better the performance.)"
[0004]
Conventionally, the computer power of the information device and the capacity of the mountable hard disk are small, and only the speaker adaptation technology having low recognition performance such as “normalization of vocal tract length” and “MLLR + proper voice space” can be used. As the computer power of information equipment increases, speaker adaptation technologies “MLLR” and “CAT” that can obtain high recognition performance using this computer power have come to be used. However, in these speaker adaptation techniques, the number of sentences that the user has to utter in order to adapt the acoustic model is relatively large. Therefore, it is not suitable for an information device (for example, a TV remote controller) that places a heavy burden on the user and frequently changes. Furthermore, it is not suitable for devices with relatively small computer power, such as home appliances and mobile phones.
[0005]
In recent years, hard disk capacity has increased and its price has been reduced, and with this, a relatively small computer power using a relatively large capacity hard disk such as "method by clustering" and "method by sufficient statistics" is used. Therefore, speaker adaptation technology has appeared. These speaker adaptation technologies are suitable for devices with relatively small computer power, such as car navigation systems, televisions and other home appliances and mobile phones, where the capacity of the hard disk installed is increasing. Small home appliances and mobile phones cannot be equipped with a large-capacity hard disk, but in recent years there is no problem because they can communicate with a large-capacity server through a network. In addition, in these speaker adaptation techniques, the number of sentences that the user has to utter in order to adapt the acoustic model is small (about one sentence), so the burden on the user is small and the user is replaced. Can also be used instantly. However, in the “method by clustering”, one HMM close to the user is selected and used as an adaptive model, so that the recognition performance is greatly deteriorated when there is no HMM close to the user / use environment.
[0006]
Considering the above points, the speaker adaptation technology most suitable for mobile phones and home appliances is the “method using sufficient statistics” (Shinichi Yoshizawa, Akira Baba, Kanako Matsunami, Yuichiro Yone, Miichi Yamada, Kiyohiro Shikano, "Unsupervised learning of phonological models using satisfaction statistics and speaker distance", IEICE Technical Report, SP2000-89, pp.83-88, 2000). According to this, a highly accurate adaptive model (acoustic model adapted to the user's voice) can be obtained instantaneously with one voice of the user.
[0007]
Next, the procedure for creating an adaptive model by the “method using sufficient statistics” will be described with reference to FIGS.
[0008]
-Creation of selection model and sufficient statistics (ST200)-
Voice data of various speakers (for example, about 300 people) recorded in a quiet environment is stored in advance in the voice database 310 (FIG. 3).
[0009]
Using the speech data stored in the database 310, a selection model (represented by a mixed Gaussian distribution (GMM) here) and a sufficient statistic (represented by a hidden Markov model (HMM) here) for each speaker. Are stored in the sufficient statistics file 320 (FIG. 3). The “sufficient statistic” is a sufficient statistic representing the nature of the database, and here is the mean, variance, and EM count in the acoustic model of the HMM. Sufficient statistics are calculated by learning once from an unspecified speaker model using EM algorithm. The selection model is created by a Gausian Mixture Model with one state and 64 mixes without distinguishing phonemes.
[0010]
The procedure for creating sufficient statistics will be described in detail with reference to FIG.
[0011]
<ST201>
First, sufficient statistics for unspecified speakers are created. Here, it creates by learning with the data of all the speakers using EM algorithm. Sufficient statistics are represented by hidden Markov models, and each state is represented by a mixed Gaussian distribution. Number the Gaussian distribution with sufficient statistics for the unspecified speakers.
[0012]
<ST202>
Sufficient statistics for each speaker are created using the created sufficient statistics for unspecified speakers as initial values. Here, it is created by learning from each speaker's data using the EM algorithm. For the Gaussian distribution of sufficient statistics for each speaker, a number corresponding to the number assigned to the sufficient statistics of unspecified speakers is stored.
[0013]
-Input of voice data for adaptation (ST210)-
User's voice is input.
[0014]
-Selection of sufficient statistics by selection model (ST220)-
Based on the input speech and the selection model, a plurality of sufficient statistics (acoustic models for speakers acoustically close to the user's speech) that are “close” to the user's speech are selected. Here, “close” means sufficient statistics of speakers corresponding to the top N selection models from those having a large probability of probability obtained by inputting input speech into the selection model. The selection process described above is performed in the adaptive model creation unit 330 shown in FIG. This is shown in FIG.
[0015]
~ Adaptive model creation (ST230) ~
An adaptive model is created using the selected sufficient statistics. Specifically, for a selected sufficient statistic, a new statistical calculation (Equation 1 to Equation 3) is performed between Gaussian distributions having the same number to calculate one Gaussian distribution. The adaptive model creation process is performed in the adaptive model creation unit 330 shown in FIG. This is shown in FIG.
[0016]
[Expression 1]

[0017]
[Expression 2]

[0018]
[Equation 3]

[0019]
Here, the mean and variance of the normal distribution in each state of the HMM of the adaptive model are expressed as μ_i ^adp(I = 1, 2,..., N_mix), V_i ^adp(I = 1, 2,..., N_mix). N_mixIs the number of mixed distributions. The state transition probability is a^adp[I] [j] (i, j = 1, 2,..., N_state). N_stateIs the number of states, a^adp[I] [j] represents the transition probability from the state i to the state j. N_selIs the number of selected acoustic models and μ_i ^j(I = 1, 2,..., N_mix, J = 1, 2,..., N_sel), V_i ^j(I = 1, 2,..., N_mix, J = 1, 2,..., N_sel) Is the mean and variance of each acoustic model. C_mix ^j(J = 1, 2,..., N_sel), C_state ^k[I] [j] (k = 1, 2,..., N_sel, I, j = 1, 2,..., N_state) Are the EM count (frequency) in the normal distribution and the EM count related to the state transition, respectively.
[0020]
-Recognition (ST240)-
The speech recognition system 300 (FIG. 3) recognizes the user's speech using the adaptive model created as described above.
[0021]
[Non-Patent Document 1]
Shinichi Yoshizawa, Akira Baba, Kanako Matsunami, Yuichiro Yonera, Shinichi Yamada, Kiyohiro Shikano, “Unsupervised Phonological Modeling Using Satisfaction Statistics and Speaker Distance”, IEICE Technical Report, SP2000-89, pp.83 -88, 2000
[Patent Document 1]
JP 2001-255887 A
[Patent Document 2]
Japanese Patent Laid-Open No. 10-161692
[Patent Document 3]
JP-A-5-2399
[Patent Document 4]
JP-A-6-214592
[Patent Document 5]
JP-A-9-258769
[Patent Document 6]
JP 2002-182682 A
[0022]
[Problems to be solved by the invention]
In the “method using sufficient statistics” explained above, the position of Gaussian distribution of sufficient statistics for the unspecified speaker (initial value) is equivalent to the position of Gaussian distribution of sufficient statistics for each speaker. It is approximated to be. In other words, it is assumed that even if the sufficient statistics of each voice data is calculated from the initial values of the statistics, only the mixture weight, average value, and variance can be learned while maintaining the positional relationship of the Gaussian distribution. . Specifically, among the Gaussian distributions of the initial values of sufficient statistics, the number of the Gaussian distribution closest to the Gaussian distribution of sufficient statistics of each speech data in the distribution distance such as the KL distance is the sufficient statistics of the speech data. Is assumed to be the same as the number of the Gaussian distribution. Since the above assumption is valid in a quiet environment (see FIG. 4), the above method is effective as an adaptive model creation method in a “quiet environment”. However, considering practical use, the creation of an adaptive model “in a noisy environment” must be considered. In that case, the above assumption does not hold as shown in FIG. 6, and the accuracy of the adaptive model decreases.
[0023]
An object of the present invention is to provide an acoustic model creation method, an acoustic model creation device, and an acoustic model creation program that can prevent a reduction in accuracy of an adaptive model in a noisy environment.
[0024]
[Means for Solving the Problems and Effects of the Invention]
The method according to the present invention is a method for creating an acoustic model used for speech recognition, and includes the following steps (a) to (e). In step (a), voice data on which noise is superimposed is grouped based on acoustic proximity. In step (b), sufficient statistics are created for each group obtained in step (a) using the audio data contained in the group. In step (c), a group acoustically close to the voice data of the person (user) who uses voice recognition is selected from the group obtained in step (a). In step (d), a sufficient statistic that is acoustically close to the user's voice data is selected from among the sufficient statistic for the group selected in step (c). In step (e), an acoustic model is created using the sufficient statistics selected in step (d).
[0025]
Preferably, the steps (a) and (b) are performed off-line before the time when the user uses voice recognition.
[0026]
Preferably, in the step (a), grouping is performed based on the type of noise.
[0027]
Preferably, in the step (a), grouping is performed based on the S / N ratio of audio data on which noise is superimposed.
[0028]
Preferably, in step (a) above, the speakers are grouped for each acoustically close speaker.
[0029]
Preferably, in the step (b), sufficient statistics are created for each speaker.
[0030]
Preferably, in the step (b), sufficient statistics are created for each tone of the speaker.
[0031]
Preferably, in the step (b), sufficient statistics are created for each type of noise.
[0032]
Preferably, in the step (b), a sufficient statistic is created for each SN ratio of the audio data included in each group.
[0033]
An apparatus according to the present invention is an apparatus that creates an acoustic model used for speech recognition, and includes an accumulation unit, a first selection unit, a second selection unit, and a model creation unit. The accumulation unit, for each of a plurality of groups obtained by grouping audio data on which noise is superimposed based on acoustic proximity, a sufficient statistic created using the audio data included in the group Accumulate. The first selection unit selects a group acoustically close to the voice data of a person (user) who uses voice recognition from the plurality of groups. The second selection unit selects a sufficient statistic that is acoustically close to the user's voice data from among the sufficient statistic for the group selected by the first selection unit. The model creation unit creates an acoustic model using the sufficient statistics selected by the second selection unit.
[0034]
Preferably, the apparatus further includes a group creation unit and a sufficient statistic creation unit. The group creation unit groups audio data on which noise is superimposed based on acoustic proximity. The sufficient statistics creation unit creates sufficient statistics for each group obtained by the group creation unit using the audio data included in the group. The accumulation unit accumulates sufficient statistics created by the sufficient statistics creation unit.
[0035]
The program according to the present invention is a computer program for creating an acoustic model used for speech recognition, and causes a computer to function as means (a) to means (d). The means (a) is sufficient for each of a plurality of groups obtained by grouping audio data on which noise is superimposed based on acoustic proximity using the audio data included in the group. Accumulate statistics. The means (b) selects a group acoustically close to the voice data of the person (user) who uses voice recognition from the plurality of groups. The means (c) selects a sufficient statistic that is acoustically close to the user's voice data from among the sufficient statistics for the group selected by the means (b). The means (d) creates an acoustic model using the sufficient statistics selected by the means (c).
[0036]
Preferably, the computer is further caused to function as means (e) to (f). The means (e) groups audio data on which noise is superimposed based on acoustic proximity. The means (f) creates sufficient statistics for each group obtained by the means (e) using the voice data included in the group. The means (a) accumulates sufficient statistics created by the means (f).
[0037]
In the above method, apparatus, and program, the “acoustically close” groups of noise types, signal-to-noise ratio, and speaker variations are grouped, and sufficient statistics are created and adaptive models (acoustic models) are created within each group. Can be created. The above assumption can be established by grouping in this way. As a result, it is possible to prevent a reduction in the accuracy of the adaptive model in a noisy environment and to create a highly accurate adaptive model.
[0038]
Another method according to the present invention is a method for creating an acoustic model used for speech recognition, and includes the following steps (a) to (d). In step (a), voice data acoustically close to the voice data of a person (user) using voice recognition is selected from a plurality of voice data by a plurality of speakers. In step (b), noise in an environment where speech recognition is used is superimposed on the speech data selected in step (a). In step (c), sufficient statistics are created using the audio data on which noise is superimposed in step (b). In step (d), an acoustic model is created using the sufficient statistics created in step (c).
[0039]
Preferably, the method further comprises steps (e) to (f). In step (e), noise in an environment where speech recognition is expected to be used is superimposed on a plurality of speech data from the plurality of speakers. In step (f), label information is created for the audio data on which noise is superimposed in step (e). In step (c), the audio data on which the noise is superimposed in step (b) and the label information about the audio data selected in step (a) among the label information created in step (f) are used. Make enough statistics.
[0040]
Preferably, in step (f), information related to state transition of the acoustic model is created for the audio data on which noise is superimposed in step (e). In the step (c), a sufficient statistic is obtained by further using information on the state transition of the acoustic model for the speech data selected in step (a) among the information on the state transition of the acoustic model created in the step (f). Create
[0041]
Preferably, in the step (e), each of a plurality of types of noise is superimposed on a plurality of voice data by the plurality of speakers. In step (f), label information is created for each of the plurality of types of noise. In step (c), label information suitable for the environment in which speech recognition is used is selected from a plurality of label information for the audio data selected in step (a), and the selected label information is used. Make enough statistics.
[0042]
Another apparatus according to the present invention is an apparatus for creating an acoustic model used for speech recognition, and includes an accumulation unit, a selection unit, a noise superimposing unit, a sufficient statistic creating unit, and a model creating unit. . The storage unit stores a plurality of voice data from a plurality of speakers. The selection unit selects voice data acoustically close to voice data of a person (user) who uses voice recognition from the voice data stored in the storage unit. The noise superimposing unit superimposes noise in an environment where speech recognition is used on the audio data selected by the selecting unit. The sufficient statistic creating unit creates sufficient statistic using the voice data on which the noise is superimposed by the noise superimposing unit. The model creation unit creates an acoustic model using the sufficient statistics created by the sufficient statistics creation unit.
[0043]
Another program according to the present invention is a computer program for creating an acoustic model used for speech recognition, and causes a computer to function as means (a) to (e). The means (a) accumulates a plurality of voice data from a plurality of speakers. The means (b) selects the sound data acoustically close to the sound data of the person (user) who uses the speech recognition from the sound data stored in the means (a). The means (c) superimposes noise in an environment where voice recognition is used on the voice data selected by the means (b). The means (d) creates sufficient statistics using the voice data on which noise is superimposed by the means (c). The means (e) creates an acoustic model using the sufficient statistics created by the means (d).
[0044]
In the above-described method, apparatus, and program, since processing is performed with acoustic data that is acoustically close, a highly accurate adaptive model can be created. Moreover, since sufficient statistics are calculated after selecting sound data that is acoustically close, the process for creating sufficient statistics can be accelerated.
[0045]
An adaptive model creation apparatus according to the present invention is an apparatus for creating an acoustic model used for speech recognition, and includes an accumulation unit, a storage unit, and a model creation unit. The storage unit stores a plurality of groups grouped based on acoustic proximity. Each of the plurality of groups includes a plurality of sufficient statistics. The storage unit stores a group ID indicating at least one of the plurality of groups. The model creation unit selects one group that is acoustically close to the user's voice from the groups corresponding to the group IDs stored in the storage unit. The model creation unit creates an acoustic model using at least two sufficient statistics that are acoustically close to the user's voice among the sufficient statistics included in the selected group.
[0046]
Preferably, the model creation unit selects at least one group acoustically close to the user's voice among the plurality of groups, and stores a group ID indicating the selected group in the storage unit.
[0047]
Preferably, the storage unit stores the type of noise in the environment where voice recognition is used and the group ID in association with each other.
[0048]
Preferably, the storage unit stores a user ID indicating a user and the group ID in association with each other.
[0049]
Preferably, the storage unit stores a device ID for identifying the adaptive model creation device and the group ID in association with each other.
[0050]
Another adaptive model creation apparatus according to the present invention is an apparatus for creating an acoustic model used for speech recognition, and includes an accumulation unit and a model creation unit. The storage unit stores a plurality of groups grouped based on acoustic proximity. Each of the plurality of groups includes a plurality of sufficient statistics. The model creation unit receives a group ID indicating at least one of the plurality of groups. The model creation unit selects one group that is acoustically close to the user's voice from among the groups corresponding to the received group ID. The model creation unit creates an acoustic model using at least two sufficient statistics that are acoustically close to the user's voice among the sufficient statistics included in the selected group.
[0051]
Preferably, the model creation unit receives the group ID from an external storage device. The model creation unit selects at least one group that is acoustically close to the user's voice among the plurality of groups, and stores a group ID indicating the selected group in the storage device.
[0052]
Preferably, the storage device stores the type of noise in the environment where voice recognition is used and the group ID in association with each other.
[0053]
Preferably, the storage device stores a user ID indicating a user and the group ID in association with each other.
[0054]
Preferably, the storage device stores a device ID for identifying the adaptive model creation device and the group ID in association with each other.
[0055]
Another adaptive model creation apparatus according to the present invention is an apparatus for creating an acoustic model used for speech recognition, and includes a selection unit and a model creation unit. The selection unit receives a group ID indicating at least one group of the plurality of groups. The plurality of groups are grouped based on acoustic proximity. Each of the plurality of groups includes a plurality of sufficient statistics. The selection unit selects one group that is acoustically close to the user's voice from among the groups corresponding to the received group ID. The model creation unit receives at least two sufficient statistics that are acoustically close to the user's voice among the sufficient statistics included in the group selected by the selection unit. The model creation unit creates an acoustic model using the received sufficient statistics.
[0056]
Preferably, the selection unit receives the group ID from an external storage device. The selection unit selects at least one group acoustically close to the user's voice from the plurality of groups, and stores a group ID indicating the selected group in the storage device.
[0057]
Preferably, the storage device stores the type of noise in the environment where voice recognition is used and the group ID in association with each other.
[0058]
Preferably, the storage device stores a user ID indicating a user and the group ID in association with each other.
[0059]
Preferably, the storage device stores a device ID for identifying the adaptive model creation device and the group ID in association with each other.
[0060]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In the drawings, the same or corresponding parts are denoted by the same reference numerals, and the description thereof will not be repeated.
[0061]
(First embodiment)
<Configuration of adaptive model creation device>
FIG. 7 is a block diagram showing the overall configuration of the speech recognition adaptive model creation apparatus according to the first embodiment. The apparatus shown in FIG. 7 includes a sufficient statistic creation unit 1, a selection model creation unit 2, a sufficient statistic accumulation unit 3, a selection model accumulation unit 4, an adaptive model creation unit 5, and a group creation unit 6. Prepare.
[0062]
The group creating unit 6 groups the noise superimposed speech data 84 created by superimposing the noise data 82 on the speech data 83 in a quiet environment based on “acoustic proximity”.
[0063]
The sufficient statistic creation unit 1 creates a sufficient statistic 71 for each group created by the group creation unit 6 using the audio data 84 grouped by the group creation unit 6.
[0064]
The sufficient statistics accumulation unit 3 accumulates the sufficient statistics 71 created by the sufficient statistics creation unit 1.
[0065]
The selection model creation unit 2 creates a selection model 73. The selection model 73 is a model for selecting a sufficient statistic 72 close to the user's voice data 81 from the sufficient statistic 71 stored in the storage unit 3.
[0066]
The selection model storage unit 4 stores the selection model 73 created by the selection model creation unit 2.
[0067]
The adaptive model creation unit 5 uses the selection model 73 stored in the storage unit 4 to generate sufficient statistics “acoustically close” to the user's voice data 81 from among the sufficient statistics 71 stored in the storage unit 3. A quantity 72 is selected and an adaptive model 74 is created using the selected sufficient statistics 72.
[0068]
<Adaptive model creation procedure>
Next, a procedure for creating an adaptive model by the apparatus configured as described above will be described. Here, a case where the user performs voice recognition indoors will be described as an example.
[Create sufficient statistics 71 and selection model 73]
First, a method for creating sufficient statistics 71 and selection model 73 will be described. Here, a case will be described in which the sufficient statistics 71 and the selection model 73 are created offline before the user requests acquisition of the adaptive model.
[0069]
Record voice data 83 of multiple speakers in a quiet environment. Here, audio data of about 300 people are recorded.
[0070]
The noise data 82 of the environment where the user will use voice recognition is recorded. Here, room noise is recorded.
[0071]
The voice data 84 is created by superimposing the noise data 82 on the voice data 83 at the SN ratio in an environment where the user will use voice recognition. Here, the noise data 82 is superimposed at an S / N ratio of 15 dB, 20 dB, and 25 dB.
[0072]
The group creation unit 6 groups the created audio data 84 by “acoustic proximity”. Here, as shown in FIG. 8, grouping is performed for each S / N ratio into a group A of 15 dB, a group B of 20 dB, and a group C of 25 dB.
[0073]
Sufficient statistics 71 are created. As shown in FIG. 9, the sufficient statistic creation unit 1 creates the unspecified speaker models A to C using the noise superimposed speech data 84 </ b> A to 84 </ b> C for each group created by the group creation unit 6. Next, with respect to each group created by the group creation unit 6, sufficient statistics are obtained by learning once from the unspecified speaker model of each group by the EM algorithm for each speaker using the noise superimposed speech data 84 of each speaker. Calculate quantities 71A-71C. Here, about 300 sufficient statistics are created for each group.
[0074]
A selection model 73 is created. As an example, as shown in FIG. 10, for each group created by the group creation unit 6, a Gaussian Mixture Model with one state 64 mixture is used for each speaker without distinguishing phonemes using the noise superimposed speech data 84 </ b> A to 84 </ b> C. Selection models 73A to 73C are created by (GMM). Here, about 300 sufficient statistic selection models are created for each group.
[0075]
The voice data 84A to 84C (FIG. 9) used when the sufficient statistics 71A to 71C (FIG. 9) are created and the selection models 73A to 73C (FIG. 10) created thereby are paired. A sufficient statistic close to the user's voice data is selected by the selected model.
[0076]
The sufficient statistics accumulation unit 3 accumulates sufficient statistics 71A to 71C created by the sufficient statistics creation unit 1. The selection model accumulation unit 4 accumulates the selection models 73A to 73C created by the selection model creation unit 2. An example of the sufficient statistics 71 stored in the sufficient statistics storage unit 3 is shown in FIGS. An example of the selection model 73 stored in the selection model storage unit 4 is shown in FIG. Here, the sufficient statistics of each speaker (Mr. A to Z) in each group (A to C) and the selection model are paired.
[Create Adaptive Model 74]
Next, a procedure for creating the adaptive model 74 in the adaptive model creating unit 5 will be described.
[0077]
An example of the sufficient statistics 71 and the selection model 73 will be described with reference to those shown in FIGS.
[0078]
The user requests creation of the adaptive model 74.
[0079]
The user inputs voice data 81 in an environment using voice recognition to the adaptive model creating unit 5 using a microphone for voice recognition or the like. In the audio data 81, noise in an environment using voice recognition is superimposed.
[0080]
Here, a case where the user uses voice recognition in an environment where the SN ratio is 20 dB indoors will be described.
[0081]
The adaptive model creation unit 5 transmits the voice data 81 to the selection model storage unit 4 and inputs it to the selection model 73. That is, the audio data 81 is input to the sufficient statistic selection model of A to Z in the groups A to C in FIG.
[0082]
Among the groups created by the group creation unit 6, a group “acoustically close” to the user's voice data 81 is determined.
[0083]
The likelihoods of the selection model 73 when the audio data 81 is input to the selection model 73 are calculated and arranged in descending order of likelihood. That is, the likelihood with respect to the speech data 81 of the selection models of A to Z in the groups A to C in FIG. 12 is calculated and arranged in descending order. FIG. 13 shows an example in which the likelihood of the selection model 73 is calculated and arranged in descending order of likelihood.
[0084]
The top N selection models (100 in the example of FIG. 13) are selected in descending order of likelihood, and the most selected group (SNR of room noise) is determined. In the example of FIG. 12, the most selected group is group B (room noise 20 dB). That is, group B is a group that is “acoustically close” to the user's voice data 81.
[0085]
The adaptive model 74 is created using sufficient statistics of a group (group B) “acoustically close” to the audio data 81. The top L (20 in the example of FIG. 14) are selected in descending order of likelihood from the selection model 73 of the group (group B) “acoustically close” to the audio data 81. Then, an adaptive model 74 is created using a sufficient statistic 72 that is paired with the selected selection model. Specifically, the adaptive model 74 is created by the following statistical processing calculation (Equation 4 to Equation 6). The mean and variance of the normal distribution in each state of the HMM of the adaptive model 74 are μ_i ^adp(I = 1, 2,..., N_mix), V_i ^adp(I = 1, 2,..., N_mix). N_mixIs the number of mixed distributions. The state transition probability is a^adp[I] [j] (i, j = 1, 2,..., N_state). N_stateIs the number of states, a^adp[I] [j] represents the transition probability from the state i to the state j.
[0086]
[Expression 4]

[0087]
[Equation 5]

[0088]
[Formula 6]

[0089]
Where N_selIs the number of selected acoustic models and μ_i ^j(I = 1, 2,..., N_mix, J = 1, 2,..., N_sel), V_i ^j(I = 1, 2,..., N_mix, J = 1, 2,..., N_sel) Is the mean and variance of each HMM. C_mix ^j(J = 1, 2,..., N_sel), C_state ^k[I] [j] (k = 1, 2,..., N_sel, I, j = 1, 2,..., N_state) Are the EM count (frequency) in the normal distribution and the EM count related to the state transition, respectively.
[0090]
The adaptive model creation unit 5 prepares for a user's request for the next adaptation model creation.
[0091]
<Experimental result>
The results of recognition experiments using an adaptive model are described.
[0092]
The conditions of the recognition experiment are described. The database is composed of 306 speaker data, and each speaker has utterance data of about 200 sentences. This is data with a sampling frequency of 16 kHz and 16 bits. As the feature quantity, 12-dimensional MFCC (Mel-frequency cepstrum coefficient), delta cepstrum, and delta power analyzed with a window shift length of 10 ms are used. CMN (Cepstrum mean normalization) processing is performed in feature amount extraction. A language model constructed from 20k newspaper articles is used. There are 46 evaluation speakers. As the evaluation sentences, 4 to 5 sentences for each speaker, a total of 200 sentences were used. Room noise was used as the noise type.
[0093]
FIG. 15 shows the recognition experiment results. FIG. 15 also shows the recognition result of the prior art that creates an adaptive model using sufficient statistics.
[0094]
From the results shown in FIG. 15, it can be seen that the performance of the adaptive model created by the present invention is very high compared to that of the prior art.
[0095]
<Effect>
As described above, in the first embodiment, “acoustically close” are clustered (grouped), and a selection model / sufficient statistics and an adaptive model are created in each group. By performing clustering in this way, the assumption described in the section of the prior art can be established. As a result, it is possible to prevent a reduction in the accuracy of the adaptive model in a noisy environment and to create a highly accurate adaptive model. The “acoustically close” voice data grouped here is a voice data group existing in a range where the assumption in the “method based on sufficient statistics” described in the section of the related art is satisfied. Specifically, even if the sufficient statistics of each voice data is calculated from the initial values of the statistics, the voice data that can learn only the mixing weight, average value, and variance while maintaining the Gaussian distribution positional relationship. It is a group (see FIG. 16). In other words, the Gaussian distribution number of the initial statistical value closest to the Gaussian distribution of the sufficient statistics of each voice data is the same as the Gaussian distribution number of the voice statistics. It means something (see FIG. 16).
[0096]
Examples of groupings that can make such assumptions are:
・ Create groups for each type of noise.
・ Create groups for each SN ratio.
Create a speech model (represented by a mixed Gaussian distribution) using each speech data, and make the ones having the same distance between distributions such as the KL distance the same group.
and so on. An example is shown in FIG.
[0097]
Further, according to the first embodiment, the following effects can also be obtained.
[0098]
Since voice data 83 recorded off-line is used as voice data for creating the adaptive model 74 adapted to noise / speaker, the user does not need to make a large amount of utterances and the burden on the user is small.
[0099]
Since the sufficient statistics 71 are created using the noise superimposed speech data 84 and the adaptive model 74 is created, an adaptive model adapted to the usage environment can be created. Therefore, the adaptive model can be used in a noisy environment.
[0100]
Since the sufficient statistics 71 are created off-line, the adaptation model 74 can be created instantaneously during adaptation. Therefore, the adaptation model can be used immediately when the usage environment changes.
[0101]
Since the sufficient statistics are created for each group created by the group creation unit 6 and the adaptive model 74 is created, the adaptive model 74 adapted to the user's voice data 81 can be created. Therefore, more users can use the adaptive model in various noise environments.
[0102]
Note that, as the noise-superimposed audio data 84, audio data uttered in a noise environment may be used instead of the audio data in which the noise data is superimposed by calculation processing.
[0103]
The group creation unit 6 may create a group for each type of noise and for each close speaker.
[0104]
As the noise superimposed voice data 84, voice data under various noise environments such as room noise, vehicle interior noise, venue noise, and vacuum cleaner sound may be used.
[0105]
The adaptive model creation unit may automatically perform the timing for creating the adaptive model 74.
[0106]
The sufficient statistic selection model 73 is not limited to the Gaussian Mixture Model.
[0107]
As the noise data 82, noise in the usage environment may be used.
[0108]
The adaptive model creation apparatus according to the first embodiment can be realized by hardware or software (computer program).
[0109]
<Specific product image and grouping examples>
The speech recognition system using the speaker adaptation technique according to the first embodiment can be mounted on, for example, the following products (information equipment). Mobile phones, personal digital assistants (PDAs), car navigation systems, personal computers, TV remote controls, speech translation devices, pet robots, interactive agents (graphics), etc. Some of these are shown below along with grouping examples.
[0110]
[Group creation method 1]
Groups are created for each type of noise × SNR, and sufficient statistics are accumulated in each group for each variation of tone of speaker × speaker voice.
[0111]
<Devices used by multiple speakers under multiple noises (eg, TV operation)>
-Group selection method 1 (see FIG. 18)
The configuration of the system according to this example is shown in FIG. 18A. This system includes a server 1800, a digital TV system 1810, and an audio remote controller 1820. The server 1800 includes a group creation unit 6, a selection model creation unit 2, and a sufficient statistics creation unit 1. As shown in FIG. 18B, the group creation unit 6 groups the audio data 84 with superimposed noise for each type of noise (sound of a cleaner, sound of a washing machine, etc.) × SNR (10 dB, 20 dB, etc.). To do. Sufficient statistic creation unit 1 has a speaker (speaker A, speaker B, etc.) × speaker tone (nose, normal voice, fast voice) for each group created by group creation unit 6 Etc.) enough statistics for each. The selection model creation unit 2 creates a corresponding selection model for each of the sufficient statistics created by the sufficient statistics creation unit 1. Voice remote controller 1820 includes a microphone 1821. The microphone 1821 converts the voice uttered by the user into predetermined voice data. The audio data converted by the microphone 1821 is transmitted to the digital TV system 1810. The digital TV system 1810 includes a hard disk (HDD) 1811, an adaptive model creation unit 5, a speech recognition system 300 (see FIG. 3), and a processing unit 1812. The selection model created by the selection model creation unit 2 of the server 1800 and the sufficient statistics created by the sufficient statistics creation unit 1 are downloaded to the HDD 1811 via the communication network. The adaptive model creating unit 5 creates an adaptive model using the voice data from the voice remote controller 1820, the selected model stored in the HDD 1811, and sufficient statistics. The speech recognition system 300 recognizes speech data from the speech remote controller 1820 using the adaptive model created by the adaptive model creation unit 5. The processing unit 1812 performs various types of processing according to the result of recognition by the speech recognition system 300. In the system configured as described above, the following processing is performed.
[0112]
[Step ST1]
The user utters toward the microphone 1821 of the voice remote controller 1820. The voice uttered by the user is converted into predetermined voice data and transmitted to the digital TV system 1810.
[0113]
[Step ST2]
The adaptive model creation unit 5 inputs the audio data from the audio remote controller 1820 to the selected model in the HDD 1811 and calculates the likelihood. The adaptive model creation unit 5 selects N of the calculated likelihoods from the largest. The adaptive model creation unit 5 selects a group having the largest number of selection models to which the N groups belong.
[0114]
[Step ST3]
The adaptive model creation unit 5 selects M sufficient statistics having a large likelihood in the selected group. The adaptive model creation unit 5 creates an adaptive model using the selected M sufficient statistics.
[0115]
・ Group selection method 2 (see FIGS. 19 and 20)
The configuration of the display system according to this example is shown in FIG. 19A. This system includes a server 1900, a digital TV system 1910, and an audio remote controller 1920. The server 1900 includes a group creation unit 6, a selection model creation unit 2, a sufficient statistic creation unit 1, a selection model accumulation unit 4, and a sufficient statistic accumulation unit 3. As shown in FIG. 19B, the group creating unit 6 generates the voice data 84 with superimposed noise for each type of noise (sound of cleaner A, sound of cleaner B, etc.) × SNR (10 dB, 20 dB, etc.). Group. Sufficient statistic creation unit 1 has a speaker (speaker A, speaker B, etc.) × speaker tone (nose, normal voice, fast voice) for each group created by group creation unit 6 Etc.) enough statistics for each. The selection model creation unit 2 creates a corresponding selection model for each of the sufficient statistics created by the sufficient statistics creation unit 1. The voice remote controller 1920 includes a microphone 1821 and a memory 1922. The memory 1922 stores an ID indicating the type of noise (noise ID) and an ID indicating the group (group ID) in association with each other. The digital TV system 1910 includes an adaptive model creation unit 5, a speech recognition system 300 (see FIG. 3), and a processing unit 1812. The adaptive model creation unit 5 uses the voice data from the voice remote controller 1920, the selection model stored in the selection model storage unit 4 of the server 1900 and the sufficient statistics stored in the sufficient statistics storage unit 3. Create a model. In the system configured as described above, the following processing is performed.
[0116]
[Step ST1-a]
The digital TV system 1910 prompts the user to select the type of noise in the usage environment by operating a button on the remote control 1920. For example, options are displayed on the screen such as “1. washing machine, 2. vacuum cleaner, 3. air conditioner,. The user selects the type of noise in the usage environment by operating a button. Here, it is assumed that the user performs a remote control operation in an environment where the vacuum cleaner is used. The user selects “2. Vacuum cleaner” as a noise type by operating a button.
[0117]
[Step ST2-a]
The user utters toward the microphone 1821 of the audio remote controller 1920. The voice uttered by the user is converted into predetermined voice data and transmitted to the digital TV system 1910.
[0118]
[Step ST3-a]
The adaptive model creation unit 5 inputs the voice data from the voice remote controller 1920 to the selection model in the selection model storage unit 4 of the server 1900 and calculates the likelihood. The adaptive model creation unit 5 selects N of the calculated likelihoods from the largest. The adaptive model creation unit 5 selects a group having the largest number of selection models to which the N groups belong.
[0119]
[Step ST4-a]
The adaptive model creation unit 5 selects M sufficient statistics having a large likelihood in the selected group. The adaptive model creation unit 5 creates an adaptive model using the selected M sufficient statistics.
[0120]
[Step ST5-a]
The adaptive model creation unit 5 transmits an ID (group ID) indicating the group selected in step ST3-a and an ID (group ID) indicating a group having the same noise type as that group to the voice remote controller 1920. . These group IDs are stored in the memory 1922 in association with an ID (noise ID) indicating the type of noise selected in step ST1-a. Here, it is assumed that group 1 (see FIG. 19B) is selected in step ST3-a. The noise type of Group 1 is “Sound of Vacuum Cleaner A”. The groups whose noise type is “Sound of vacuum cleaner A” are group 1 and group 2 (see FIG. 19B). As shown in FIG. 20, adaptive model creation unit 5 transmits the group ID of the group (group 1, group 2) whose noise type is “sound of cleaner A” to voice remote controller 1920. These group IDs are stored in the memory 1922 in association with the noise ID indicating the noise type “2. Vacuum cleaner” selected in step ST1-a (see FIG. 20).
[0121]
[Step ST1-b]
Again, the user performs remote control operations in an environment where the vacuum cleaner is used. The user selects “2. Vacuum cleaner” as a noise type by operating a button. The voice remote controller 1920 transmits the group ID (group ID of group 1 and group 2) associated with the selected noise type “2. vacuum cleaner” and stored in the memory 1922 to the digital TV system 1910 ( FIG. 20).
[0122]
[Step ST2-b]
The user utters toward the microphone 1821 of the audio remote controller 1920. The voice uttered by the user is converted into predetermined voice data and transmitted to the digital TV system 1910.
[0123]
[Step ST3-b]
The adaptive model creation unit 5 adds the voice data from the voice remote controller 1920 to the selected model of the group (group 1, group 2) indicated by the group ID from the voice remote controller 1920 among the selected models in the selection model storage unit 4 of the server 1900. Input to calculate likelihood. The adaptive model creation unit 5 selects N of the calculated likelihoods from the largest. The adaptive model creation unit 5 selects a group having the largest number of selection models to which the N groups belong.
[0124]
[Step ST4-b]
The adaptive model creation unit 5 selects M sufficient statistics having a large likelihood in the selected group. The adaptive model creation unit 5 creates an adaptive model using the selected M sufficient statistics.
[0125]
Return to (ST1-b) for each adaptation process. If necessary, return to (ST1-a) (for example, when you replace the vacuum cleaner with another type of vacuum cleaner or use voice recognition in a different noise environment from the vacuum cleaner) .
[0126]
<Devices used by multiple speakers under multiple noises (example: PDA operation)>
・ Group selection method 1
After selecting the type of noise automatically based on GPS location information from sufficient statistics stored in servers connected by a communication network, a selection model (GMM) is used based on the voice of the user with the noise added. Select and apply sufficient statistics. Specifically, the following processing is performed.
[0127]
The type of noise is automatically selected using the GPS position information (ST1). (Example: Noise in trains at platform platforms, construction site noise at construction sites, etc.)
[0128]
The user's voice is input (ST2).
[0129]
In the selected noise group, N pieces having the highest likelihood when the user's voice is input to the selection model are selected, and the largest number of S / N ratio group is selected (ST3).
[0130]
Among the selected groups, M sufficient statistics having a high likelihood are selected and adapted (ST4).
[0131]
・ Group selection method 2
After selecting the noise type automatically based on the schedule book and time information in the PDA from the sufficient statistics accumulated in the servers connected by the communication network, the selection model is based on the voice of the user with the noise added. (GMM) is used to select and adapt sufficient statistics. Specifically, the following processing is performed.
[0132]
The type of noise is automatically selected using the schedule book and time information (ST1). (Example: Move by train at 10:00 on the schedule. If the current time is 10:55, select noise in the train.)
[0133]
The user's voice is input (ST2).
[0134]
In the selected noise group, N pieces having the highest likelihood when the user's voice is input to the selection model are selected, and the largest number of S / N ratio group is selected (ST3).
[0135]
Among the selected groups, M sufficient statistics having a high likelihood are selected and adapted (ST4).
[0136]
<Devices used under specific noise (eg car navigation)>
・ Group selection method (see Figs. 21 and 22)
The configuration of the information search system according to this example is shown in FIG. 21A. This system includes a server 2100 and a car navigation system 2110. The server 2100 includes a group creation unit 6, a selection model creation unit 2, a sufficient statistic creation unit 1, a selection model accumulation unit 4, a sufficient statistic accumulation unit 3, an adaptive model creation unit 5, a memory 2101, including. As shown in FIG. 21B, the group creating unit 6 groups the audio data 84 with the superimposed noise for each noise type (carole sound, mark III sound, etc.) × SN ratio (10 dB, 20 dB, etc.). . In the memory 2101, a device ID (for example, a product number) for identifying a car navigation system and an ID (group ID) indicating a group are stored in association with each other. The car navigation system 2110 includes a microphone 2111, a data communication module 2112, a voice recognition system 300 (see FIG. 3), and a processing unit 2113. In the system configured as described above, the following processing is performed.
[0137]
[Step ST1-a]
A user speaks into the microphone 2111 of the car navigation system 2110. The voice uttered by the user is converted into predetermined voice data and transmitted to the server 2100 by the data communication module 2112. The data communication module 2112 transmits data (device ID) indicating the product number “100” of the car navigation system 2110 to the server 2100.
[0138]
[Step ST2-a]
The adaptive model creation unit 5 inputs the voice data from the car navigation system 2110 to the selection model in the selection model storage unit 4 and calculates the likelihood. The adaptive model creation unit 5 selects N of the calculated likelihoods from the largest. The adaptive model creation unit 5 selects a group having the largest number of selection models to which the N groups belong.
[0139]
[Step ST3-a]
The adaptive model creation unit 5 selects M sufficient statistics having a large likelihood in the selected group. The adaptive model creation unit 5 creates an adaptive model using the selected M sufficient statistics.
[0140]
[Step ST4-a]
The adaptive model creation unit 5 obtains, from the car navigation system 2110, an ID (group ID) indicating the group selected in step ST2-a and an ID (group ID) indicating a group having the same noise type as that group. Is stored in the memory 2101 in association with the product number “100”. Here, it is assumed that group 1 (see FIG. 21B) is selected in step ST2-a. The noise type of group 1 is “carole sound”. The groups whose noise type is “carole sound” are group 1 and group 2 (see FIG. 21B). As shown in FIG. 22, the adaptive model creation unit 5 stores the group ID of the group (group 1, group 2) whose noise type is “carole sound” in the memory 2101 in association with the product number “100”. To do.
[0141]
[Step ST1-b]
The user speaks again toward the microphone 2111 of the car navigation system 2110. The voice uttered by the user is converted into predetermined voice data and transmitted to the server 2100 by the data communication module 2112. The data communication module 2112 transmits data (device ID) indicating the product number “100” of the car navigation system 2110 to the server 2100.
[0142]
[Step ST2-b]
The adaptive model creation unit 5 includes a group (group 1) indicated by the group ID stored in the memory 2101 in association with the product number “100” from the car navigation system 2110 among the selection models in the selection model storage unit 4. , Group 2) is input with voice data from the car navigation system 2110 and the likelihood is calculated (see FIG. 22). The adaptive model creation unit 5 selects N of the calculated likelihoods from the largest. The adaptive model creation unit 5 selects a group having the largest number of selection models to which the N groups belong.
[0143]
[Step ST3-b]
The adaptive model creation unit 5 selects M sufficient statistics having a large likelihood in the selected group. The adaptive model creation unit 5 creates an adaptive model using the selected M sufficient statistics.
[0144]
Return to (ST1-b) for each adaptation process. If necessary, the process returns to (ST1-a) (for example, when the car navigation system 2110 is attached to another type of car (for example, mark III)).
[0145]
[Group creation method 2]
A group is created for each type of noise x SN ratio x close speakers, and sufficient statistics for each voice tone variation (nasal voice, fast voice, stuttering voice, etc.) are accumulated in the group.
[0146]
<Devices used by multiple speakers under multiple noises (eg, TV operation)>
・ Group selection method (see Figs. 23 and 24)
The configuration of the system according to this example is shown in FIG. 23A. This system includes a server 2300, a digital TV system 2310, and an audio remote controller 2320. The server 2300 includes a group creation unit 6, a selection model creation unit 2, a sufficient statistic creation unit 1, a selection model accumulation unit 4, a sufficient statistic accumulation unit 3, an adaptive model creation unit 5, a memory 2301, including. As shown in FIG. 23B, the group creating unit 6 converts the voice data 84 with the superimposed noise into the type of noise (sound of a cleaner, sound of an air conditioner, etc.) × SNR (10 dB, 20 dB, etc.) × each speaker who is close Group into The memory 2301 stores an ID (user ID) for identifying a user and an ID (group ID) indicating a group in association with each other. The digital TV system 2310 includes a data communication module 2312, a voice recognition system 300 (see FIG. 3), and a processing unit 1812. Voice remote controller 2320 includes a microphone 1821. In the system configured as described above, the following processing is performed.
[0147]
[Step ST1-a]
The user utters toward the microphone 1821 of the voice remote controller 2320. The voice uttered by the user is converted into predetermined voice data and transmitted to the digital TV system 2310. Further, the user inputs information (user ID) for identifying himself / herself such as a name and a personal identification number by operating a button on the remote controller 2320. The entered user ID (here, “100”) is transmitted to the digital TV system 2310. The voice data from the voice remote controller 2320 and the user ID “100” are transmitted to the server 2300 by the data communication module 2112.
[0148]
[Step ST2-a]
The adaptive model creation unit 5 inputs the audio data from the digital TV system 2310 to the selection model in the selection model storage unit 4 and calculates the likelihood. The adaptive model creation unit 5 selects N of the calculated likelihoods from the largest. The adaptive model creation unit 5 selects a group having the largest number of selection models to which the N groups belong.
[0149]
[Step ST3-a]
The adaptive model creation unit 5 selects M sufficient statistics having a large likelihood in the selected group. The adaptive model creation unit 5 creates an adaptive model using the selected M sufficient statistics.
[0150]
[Step ST4-a]
From the digital TV system 2310, the adaptive model creation unit 5 obtains an ID (group ID) indicating the group selected in step ST2-a and an ID (group ID) indicating a group having the same speaker close to the group from the digital TV system 2310. Is stored in the memory 2301 in association with the user ID “100”. Here, it is assumed that group 2 (see FIG. 23B) is selected in step ST2-a. The close speakers in group 2 are “speakers C and D”. Groups whose close speakers are “speakers C and D” are group 2, group (K−1), and group K (see FIG. 23B). As shown in FIG. 24, the adaptive model creation unit 5 uses the group IDs of the groups (group 2, group (K-1), group K) whose close speakers are “speakers C and D” as the user ID “ 100 ”and stored in the memory 2301.
[0151]
[Step ST1-b]
The user speaks again toward the microphone 1821 of the voice remote controller 2320. The voice uttered by the user is converted into predetermined voice data and transmitted to the digital TV system 2310. Further, the user inputs the user ID “100” by operating a button on the remote controller 2320. The input user ID “100” is transmitted to the digital TV system 2310. The voice data from the voice remote controller 2320 and the user ID “100” are transmitted to the server 2300 by the data communication module 2312.
[0152]
[Step ST2-b]
The adaptive model creation unit 5 includes the group (group) indicated by the group ID stored in the memory 2301 in association with the user ID “100” from the digital TV system 2310 among the selection models in the selection model storage unit 4. 2, audio data from the digital TV system 2310 is input to a selection model of group (K-1) and group K), and likelihood is calculated (see FIG. 24). The adaptive model creation unit 5 selects N of the calculated likelihoods from the largest. The adaptive model creation unit 5 selects a group having the largest number of selection models to which the N groups belong.
[0153]
[Step ST3-b]
The adaptive model creation unit 5 selects M sufficient statistics having a large likelihood in the selected group. The adaptive model creation unit 5 creates an adaptive model using the selected M sufficient statistics.
[0154]
Return to (ST1-b) for each adaptation process. If necessary, the process returns to (ST1-a) (for example, when a user is replaced).
[0155]
<Devices used by specific speakers (example: operation of mobile phones)>
・ Group selection method (See Figs. 25 and 26)
The configuration of the system according to this example is shown in FIG. 25A. This system includes a server 2500 and a mobile phone 2510. The server 2500 includes a group creating unit 6, a selection model creating unit 2, a sufficient statistic creating unit 1, a selected model accumulating unit 4, a sufficient statistic accumulating unit 3, an adaptive model creating unit 5, and a memory 2501. Voice recognition system 300. As shown in FIG. 25B, the group creating unit 6 converts the voice data 84 with superimposed noise into a noise type (train sound, bus sound, etc.) × SNR (10 dB, 20 dB, etc.) × for each close speaker. Group. In the memory 2501, a device ID (for example, a product number) for identifying a mobile phone and an ID (group ID) indicating a group are stored in association with each other. The recognition result by the voice recognition system 300 is transmitted to the mobile phone 2510 via the communication network. A cellular phone 2510 includes a microphone 2511, a data communication module 2512, and a processing unit 2513. In the system configured as described above, the following processing is performed.
[0156]
[Step ST1-a]
A user speaks into the microphone 2511 of the mobile phone 2510. The voice uttered by the user is converted into predetermined voice data and transmitted to the server 2500 by the data communication module 2512. The data communication module 2512 transmits data (device ID) indicating the product number “200” of the mobile phone 2510 to the server 2500.
[0157]
[Step ST2-a]
The adaptive model creation unit 5 inputs the voice data from the mobile phone 2510 to the selection model in the selection model storage unit 4 and calculates the likelihood. The adaptive model creation unit 5 selects N of the calculated likelihoods from the largest. The adaptive model creation unit 5 selects a group having the largest number of selection models to which the N groups belong.
[0158]
[Step ST3-a]
The adaptive model creation unit 5 selects M sufficient statistics having a large likelihood in the selected group. The adaptive model creation unit 5 creates an adaptive model using the selected M sufficient statistics.
[0159]
[Step ST4-a]
The adaptive model creation unit 5 obtains, from the mobile phone 2510, an ID (group ID) indicating the group selected in step ST2-a and an ID (group ID) indicating a group with the same speaker as the group. The product number “200” is stored in the memory 2501 in association with it. Here, it is assumed that group 2 (see FIG. 25B) is selected in step ST2-a. The close speakers in group 2 are “speakers C and D”. Groups whose close speakers are “speakers C and D” are group 2, group (K−1), and group K (see FIG. 25B). As shown in FIG. 26, the adaptive model creation unit 5 assigns the group ID of the group (group 2, group (K-1), group K) whose near speakers are “speakers C, D” to the product number “200”. And stored in the memory 2501.
[0160]
[Step ST1-b]
Again, the user speaks into the microphone 2511 of the mobile phone 2510. The voice uttered by the user is converted into predetermined voice data and transmitted to the server 2500 by the data communication module 2512. The data communication module 2512 transmits data (device ID) indicating the product number “200” of the mobile phone 2510 to the server 2500.
[0161]
[Step ST2-b]
The adaptive model creation unit 5 includes the group (group 2, group 2) indicated by the group ID stored in the memory 2501 in association with the product number “200” from the mobile phone 2510 among the selection models in the selection model storage unit 4. The likelihood is calculated by inputting the voice data from the mobile phone 2510 to the selection model of the group (K-1) and the group K) (see FIG. 26). The adaptive model creation unit 5 selects N of the calculated likelihoods from the largest. The adaptive model creation unit 5 selects a group having the largest number of selection models to which the N groups belong.
[0162]
[Step ST3-b]
The adaptive model creation unit 5 selects M sufficient statistics having a large likelihood in the selected group. The adaptive model creation unit 5 creates an adaptive model using the selected M sufficient statistics.
[0163]
Return to (ST1-b) for each adaptation process. If necessary, the process returns to (ST1-a) (for example, when a user is replaced).
[0164]
[Group creation method 3]
A group is created for each close speaker, and sufficient statistics for each noise type x SN ratio are accumulated in the group.
[0165]
<Devices used by multiple speakers under multiple noises (eg, TV operation)>
・ Group selection method (see Figs. 27 and 28)
Based on sufficient statistics accumulated in a set-top box in the home or a server outside the home connected by a communication network, sufficient statistics using the selection model (GMM) can be obtained from the voice of the user with noise added. Select and adapt the amount. At this time, the selected group is associated with the user's speaker ID (name, password, etc.). When adapting next, input the speaker ID, select the group and adapt. Specifically, the following processing is performed.
[0166]
The user's voice is input (ST1-a).
[0167]
N pieces having the highest likelihood when the user's voice is input to the selection model are selected, and the speaker group having the largest number is selected (ST2-a).
[0168]
Among the selected groups, M sufficient statistics with large likelihoods (from various noise types and SN ratios) are selected and adapted (ST3-a).
[0169]
Corresponding the selected group and speaker ID (accumulating correspondence) (ST4-a).
[0170]
Enter a speaker ID and select a group (ST1-b).
[0171]
The user's voice is input (ST2-b).
[0172]
Among the selected group (speaker group close to the user), M sufficient statistics with a high likelihood are selected and adapted (ST3-b).
[0173]
Return to (ST1-b) for each adaptation process. If necessary, return to (ST1-a).
[0174]
<Devices used by specific speakers (example: operation of mobile phones)>
・ Group selection method
Based on sufficient statistics stored in servers outside the home connected by a communication network, the user's voice with added noise is used to select and adapt sufficient statistics using a selection model (GMM). At this time, the selected group is associated with the used device ID. When adapting next time, the device ID is automatically selected based on the device ID. Specifically, the following processing is performed.
[0175]
The user's voice is input (ST1-a).
[0176]
N pieces having the highest likelihood when the user's voice is input to the selection model are selected, and the speaker group having the largest number is selected (ST2-a).
[0177]
Among the selected groups, M sufficient statistics having a high likelihood are selected and adapted (ST3-a).
[0178]
Associate the selected group with the device ID (accumulate the correspondence) (ST4-a).
[0179]
The user's voice is input (ST1-b).
[0180]
A group is automatically selected based on the device ID (ST2-b).
[0181]
Among the selected groups, M sufficient statistics having a high likelihood are selected and adapted (ST2-b).
[0182]
Return to (ST1-b) for each adaptation process. If necessary, the process returns to (ST1-a) (for example, when a user is replaced).
[0183]
[Group creation method 4]
For a specific noise type, a group is created for each S / N ratio, and sufficient statistics for each speaker are accumulated in the group.
[0184]
<Equipment used under specific noise (eg, elevator operation)>
・ Group selection method
Based on the sufficient statistics stored in the server installed in the elevator, the user's voice with added noise is used to select and adapt sufficient statistics using the selection model (GMM). Specifically, the following processing is performed.
[0185]
The user's voice is input (ST1).
[0186]
N pieces with the highest likelihood when the user's voice is input to the selection model are selected, and the group with the largest number of SN ratios is selected (ST2).
[0187]
Among the selected groups, M sufficient statistics having a high likelihood are selected and adapted (ST3).
[0188]
[Group creation method 5]
For a specific speaker, a group is created for each S / N ratio, and sufficient statistics for each tone variation of the specific speaker (such as a nose, fast speech, and stuttering) are accumulated in the group.
[0189]
<Specific speakers / devices used under noise (eg car navigation)>
・ Group selection method
Based on sufficient statistics stored in the server (car navigation system) installed in the car, the user's voice with added noise is used to select and adapt sufficient statistics using the selection model (GMM). Specifically, the following processing is performed.
[0190]
The user's voice is input (ST1).
[0191]
N pieces with the highest likelihood when the user's voice is input to the selection model are selected, and the group with the largest number of SN ratios is selected (ST2).
[0192]
Among the selected groups, M sufficient statistics having a high likelihood are selected and adapted (ST3).
[0193]
Note that a group selection model may be created for each group, and the group may be selected (eg, when creating a group for each type of noise, the noise selection model becomes the group selection model, and when created with the GMM, the noise To the noise selection model and select the group with the highest likelihood.)
[0194]
(Second Embodiment)
<Configuration of adaptive model creation device>
FIG. 29 is a block diagram showing an overall configuration of a speech processing adaptive model creation device according to the second embodiment. The apparatus shown in FIG. 29 includes a selection model creation unit 21, a selection model storage unit 41, a sufficient statistic creation unit 11, and an adaptive model creation unit 51. The selection model creation unit 21 creates a selection model 75 for selecting voice data close to the user's voice data. The selection model storage unit 41 stores the selection model 75 created by the selection model creation unit 21. The sufficient statistic generation unit 11 selects voice data close to the user's voice data from the voice data 83 using the selection model 75 accumulated by the selection model accumulation unit 41, and superimposes noise on the selected voice data. A sufficient statistic 72 is created using the voice data. The adaptive model creation unit 51 creates an adaptive model 74 using the sufficient statistics 72 created by the sufficient statistics creation unit 11.
[0195]
<Adaptive model creation process>
Next, a process for creating an adaptive model for speech recognition by the apparatus configured as described above will be described.
[0196]
[Create selection model 75]
First, a method for creating the selection model 75 will be described. Here, a case will be described in which the selection model 75 is created offline before the user requests acquisition of the adaptive model.
[0197]
Record voice data 83 of multiple speakers in a quiet environment. Here, audio data of about 300 people are recorded.
[0198]
The selection model creation unit 21 creates a selection model 75 using a Gaussian Mixture Model of 1 state 64 mixing without distinguishing phonemes for each speaker using the voice data 83.
[0199]
As an example, as shown in FIG. 30, a selection model 75 is created using a frame with high power of the audio data 83. By using this method, a voice data selection model that is resistant to noise can be created.
[0200]
The selection model storage unit 41 stores the selection model 75 created by the selection model creation unit 21. An example of the selection model 75 stored in the selection model storage unit 41 is shown in FIG.
[0201]
[Create sufficient statistics 72]
Next, a method for creating sufficient statistics 72 will be described.
[0202]
The user requests creation of the adaptive model 74.
[0203]
The user inputs noise data 85 of the environment in which the voice recognition is used to the sufficient statistic generation unit 11 by using a voice recognition microphone or the like.
[0204]
In addition, the user uses the microphone for voice recognition or the like to sufficiently input the voice data 81 in an environment where voice recognition is used to the statistical quantity creation unit 11. The voice data 81 is superimposed with noise of an environment that uses voice recognition.
[0205]
Next, the sufficient statistic creation unit 11 inputs the voice data 81 to the selection model 75 accumulated by the selection model accumulation unit 41 and calculates the likelihood. Here, the frame portion having the high power of the audio data 81 is input to the selection model 75 shown in FIG. 30, and the likelihood is calculated. Then, the speakers of the top L people (for example, the top 20 people) with the highest likelihood are selected to be speakers close to the user's voice data.
[0206]
The sufficient statistic creation unit 11 superimposes the noise data 85 on the speech data of a speaker close to the user from the speech data 83 in a quiet environment, and creates the noise superimposed speech data 86. At this time, the SN ratio is calculated from the voice data 81 and the noise data 85, and the noise superimposed voice data 86 is created with the calculated SN ratio. An example of a method for creating the noise superimposed audio data 86 is shown in FIG.
[0207]
The sufficient statistic creating unit 11 creates the sufficient statistic 72 using the noise superimposed voice data 86. An example of the sufficient statistics 72 created by the sufficient statistics creation unit 11 is shown in FIG.
[0208]
[Create Adaptive Model 74]
Next, the creation process of the adaptive model 74 in the adaptive model creation unit 51 will be described.
[0209]
The adaptive model creation unit 51 creates an adaptive model 74 using the sufficient statistics 72 created by the sufficient statistics creation unit 11. Specifically, the adaptive model 74 is created by the following statistical processing calculation (Equation 7 to Equation 9). The mean and variance of the normal distribution in each state of the HMM of the adaptive model 74 are μ_i ^adp(I = 1, 2,..., N_mix), V_i ^adp(I = 1, 2,..., N_mix). N_mixIs the number of mixed distributions. The state transition probability is a^adp[I] [j] (i, j = 1, 2,..., N_state). N_stateIs the number of states, a^adp[I] [j] represents the transition probability from the state i to the state j.
[0210]
[Expression 7]

[0211]
[Equation 8]

[0212]
[Equation 9]

[0213]
Where N_selIs the number of selected acoustic models and μ_i ^j(I = 1, 2,..., N_mix, J = 1, 2,..., N_sel), V_i ^j(I = 1, 2,..., N_mix, J = 1, 2,..., N_sel) Is the mean and variance of each HMM. C_mix ^j(J = 1, 2,..., N_sel), C_state ^k[I] [j] (k = 1, 2,..., N_sel, I, j = 1, 2,..., N_state) Are the EM count (frequency) in the normal distribution and the EM count related to the state transition, respectively.
[0214]
The adaptation model creation unit 51 prepares for a user's request for creation of the next adaptation model.
[0215]
<Effect>
As described above, in the second embodiment, since the sufficient statistics 72 are created by using the audio data 86 on which the noise data 85 of the usage environment is superimposed, and the adaptation model 74 is created, the adaptation model adapted to the usage environment. 74 can be created. Therefore, the adaptive model can be used in various noise environments.
[0216]
In addition, since the sufficient statistics 72 are created using the voice data 86 in which noise is superimposed on the voice data of a speaker that is acoustically close to the user, the adaptation model 74 is created by instantly creating the sufficient statistics 72. be able to. Therefore, the adaptation model can be used immediately when the usage environment changes in various ways.
[0217]
Note that the noise data 85 may be input to the sufficient statistics creating unit 11 offline before the user requests acquisition of the adaptive model, and the sufficient statistics 72 may be created offline.
[0218]
The timing at which the noise data 85 is input to the sufficient statistic generation unit 11 may be automatically determined by the sufficient statistic generation unit 11.
[0219]
The adaptive model creation unit 51 may automatically determine the timing for creating the adaptive model 74.
[0220]
The selection model 75 is not limited to the Gaussian Mixture Model.
[0221]
A label corresponding to each state of the HMM may be stored in a database, and the sufficient statistics 72 of the noise superimposed speech data 86 may be created using the stored label information.
[0222]
<Specific product image>
FIG. 33 shows an image in which the adaptive model creation device according to the second embodiment is applied to an actual product. This system is composed of a portable terminal (PDA) for inputting voice and a server for creating and recognizing an adaptive model. The user calls the service center (server) and sends an instruction by voice according to voice guidance from the center. On the service center (server) side, the user's voice and noise are received and an adaptive model is created by the method described above. The user's voice is recognized using the created adaptive model, and guidance (recognition result) is sent to the PDA.
[0223]
(Third embodiment)
<Configuration of adaptive model creation device for speech recognition>
FIG. 34 is a block diagram showing the overall configuration of the adaptive model creation device according to the third embodiment. 34 includes a selection model creation unit 1507, a selection model storage unit 1508, a sufficient statistic creation unit 1506, an adaptive model creation unit 51, a label information creation unit 1501, and a label information storage unit. 1502 and a memory 1512. The selection model creation unit 1507 creates a selection model 1510 for selecting speech data close to the user's speech data. The selection model storage unit 1508 stores the selection model 1510 created by the selection model creation unit 1507. The label information creation unit 1501 creates label information 1504 using speech data 1505 in which predicted noise data 1503 predicted to be noise in the usage environment is superimposed on the speech data 83 in a quiet environment with the predicted SN ratio. . The label information accumulation unit 1502 accumulates the label information 1504 created by the label information creation unit 1501. The sufficient statistic generation unit 1506 uses the selection model 1510 accumulated by the selection model accumulation unit 1508 and the user's voice data 1513 in a quiet environment stored in the memory 1512 to use the user's voice data from the voice data 83. Is selected, and sufficient statistics 1509 are created using the audio data obtained by superimposing the noise data 85 on the selected audio data and the label information 1504 accumulated by the label information accumulation unit 1502. The adaptive model creation unit 51 creates an adaptive model 1511 using the sufficient statistics 1509 created by the sufficient statistics creation unit 1506.
[0224]
<Operation of adaptive model creation device>
Next, the operation of the adaptive model creation device configured as described above will be described.
[0225]
[Create selection model 1510]
First, a method for creating the selection model 1510 will be described. Here, a case where the selection model 1510 is created offline before the user requests acquisition of the adaptation model will be described.
[0226]
Record voice data 83 of multiple speakers in a quiet environment. Here, audio data of about 300 people are recorded.
[0227]
As shown in FIG. 35, the selection model creation unit 1507 creates a selection model 1510 by using a Gaussian Mixture Model of 1 state 64 mixing for each speaker without distinguishing phonemes using the voice data 83.
[0228]
The selection model storage unit 1508 stores the selection model 1510 created by the selection model creation unit 1507.
[0229]
[Creation of Label Information 1504 and Information 1514 on State Transition of Phoneme Model]
A method of creating the label information 1504 and the information 1514 related to the state transition of the phoneme model will be described. Here, a case will be described in which the label information 1504 and the information 1514 on the state transition of the phonological model are created offline before the user requests acquisition of the adaptive model. As an example, the case where voice recognition is used in a vehicle will be described with reference to FIGS. 36, 37, and 38. FIG. Here, speech recognition in a car navigation system is considered.
[0230]
As shown in FIG. 36, noise data (general vehicle interior noise data 1601) 1601 that is predicted to be the use environment is superimposed on the speech data 83 in a quiet environment to superimpose the speech data 1602 at the interior noise 10 dB. create. Here, the in-vehicle noise data 1601 of the vehicle type A uses the data recorded when the vehicle type A was run in advance in the city. Next, a sufficient statistic 1603 of in-vehicle noise 10 dB is calculated by the EM algorithm using the created voice data 1602. Here, sufficient statistics of unspecified speakers are created for each phoneme using the HMM. Here, the information 1514 on the state transition of the phoneme model is the state transition probability of the HMM for each phoneme. Next, as shown in FIG. 37, the noise-superimposed voice data 1602 of the in-vehicle noise 10 dB is input to the sufficient statistics 1603 of the in-vehicle noise 10 dB for each voice data (speech data of a certain speaker), and the Viterbi algorithm is The label information 1504 is created for each voice data (a certain utterance data of a certain speaker). FIG. 38 shows an example of label information 1504. Here, the phoneme name corresponding to the frame number and the state number of the HMM are set as label information 1504.
[0231]
The label information storage unit 1502 stores label information 1504 and information 1514 regarding the state transition of the phonological model.
[0232]
[Create sufficient statistics 1509]
Next, a method for creating sufficient statistics 1509 will be described.
[0233]
The user stores the voice data 1513 of the user in a quiet environment in the memory 1512 in advance.
[0234]
The user requests creation of the adaptive model 1511.
[0235]
The sufficient statistic generation unit 1506 receives the voice data 1513 of the user in a quiet environment stored in the memory 1512. Further, the sufficient statistic generation unit 1506 receives the noise data 85 in an environment using voice recognition.
[0236]
The sufficient statistic generation unit 1506 inputs the user's voice data 1513 in a quiet environment to the selection model 1510 stored in the selection model storage unit 1508 and calculates the likelihood. Then, the speakers of the top L people (for example, the top 40 people) having the highest likelihood are selected to be speakers close to the user's voice data.
[0237]
Sufficient statistic creation unit 1506 creates noise superimposed voice data 86 by superimposing noise data 85 on voice data of a speaker close to the user from voice data 83 in a quiet environment. An example of a method for creating the audio data 86 is shown in FIG.
[0238]
The sufficient statistic creating unit 1506 creates the sufficient statistic 1509 using the noise superimposed speech data 86, the label information 1504 accumulated in the label information accumulating unit 1502, and the information 1514 regarding the state transition of the phonological model. As shown in FIG. 39, the phoneme name and the HMM state number corresponding to the noise superimposed speech data 86 are regarded as the same as the phoneme name and the HMM state number of the noise superimposed speech data 1505 described in the label information 1504. . Similarly, the HMM state transition probabilities for each phoneme are also considered to be the same. That is, the calculation processing related to the state number, state transition probability, etc. of the HMM is not performed. Then, in the same state of the HMM, sufficient statistics such as average value, variance, and mixing weight are calculated.
[0239]
[Create Adaptive Model 1511]
Next, a method for creating the adaptive model 1511 in the adaptive model creating unit 51 will be described.
[0240]
The adaptive model creation unit 51 creates an adaptive model 1511 using the sufficient statistics 1509 created by the sufficient statistics creation unit 1506. Specifically, the adaptive model 1511 is created by the following statistical processing calculation (Expression 10 to Expression 12). The mean and variance of the normal distribution in each state of the HMM of the adaptive model 1511 are μ_i ^adp(I = 1, 2,..., N_mix), V_i ^adp(I = 1, 2,..., N_mix). N_mixIs the number of mixed distributions. The state transition probability is a^adp[I] [j] (i, j = 1, 2,..., N_state). N_stateIs the number of states, a^adp[I] [j] represents the transition probability from the state i to the state j.
[0241]
[Expression 10]

[0242]
## EQU11 ##

[0243]
[Expression 12]

[0244]
Where N_selIs the number of selected acoustic models and μ_i ^j(I = 1, 2,..., N_mix, J = 1, 2,..., N_sel), V_i ^j(I = 1, 2,..., N_mix, J = 1, 2,..., N_sel) Is the mean and variance of each HMM. C_mix ^j(J = 1, 2,..., N_sel), C_state ^k[I] [j] (k = 1, 2,..., N_sel, I, j = 1, 2,..., N_state) Are the EM count (frequency) in the normal distribution and the EM count related to the state transition, respectively.
[0245]
The adaptation model creation unit 51 prepares for a user's request for creation of the next adaptation model.
[0246]
<Effect>
As described above, in the third embodiment, since the sufficient statistics 1509 are calculated using the label information 1504, the sufficient statistics 1509 can be created in a short time, and the adaptive model 1511 can be created in a short time. Therefore, the adaptation model can be used immediately when the usage environment changes in various ways.
[0247]
In addition, since the label information 1504 is created using the noise-superimposed speech data 1505 that is close to the usage environment, the sufficient statistics 1509 with high accuracy can be created in a short time. Therefore, an adaptive model with higher accuracy can be used immediately when the usage environment changes in various ways.
[0248]
Further, since the sufficient statistics 1509 are calculated using the label information 1504 and the information 1514 on the state transition of the phoneme model, the sufficient statistics 1509 can be created in a shorter time, and the adaptive model 1511 can be created in a shorter time. Therefore, the adaptation model can be used immediately when the usage environment changes in various ways.
[0249]
Note that the noise data 85 may be input to the sufficient statistics generation unit 1506 offline before the user requests acquisition of the adaptive model, and the sufficient statistics 1509 may be generated offline.
[0250]
The sufficient statistic generation unit 1506 may automatically determine the timing at which the noise data 85 is input to the sufficient statistic generation unit 1506.
[0251]
The adaptive model creation unit 51 may automatically determine the timing for creating the adaptive model 1511.
[0252]
The selection model 1510 is not limited to the Gaussian Mixture Model.
[0253]
The sound data 1513 stored in the memory 1512 may be superimposed with noise in the usage environment or the environment predicted as the usage environment.
[0254]
The noise data 85 may be used as the predicted noise data 1503.
[0255]
(Fourth embodiment)
<Configuration of adaptive model creation device for speech recognition>
FIG. 40 is a block diagram showing the overall configuration of the adaptive model creation apparatus according to the fourth embodiment. The adaptive model creation apparatus shown in FIG. 40 includes a selection model creation unit 1507, a selection model storage unit 1508, a sufficient statistic creation unit 2107, an adaptive model creation unit 51, a label information creation unit 2104, and a label information storage unit. 2106, a label information selection model creation unit 2101, a label information selection model storage unit 2102, and a memory 1512. The selection model creation unit 1507 creates a selection model 1510 for selecting speech data close to the user's speech data. The selection model storage unit 1508 stores the selection model 1510 created by the selection model creation unit 1507. The label information creation unit 2104 uses two or more types of labels using noise superimposed speech data in which the predicted noise data 1503 predicted to be noise in the usage environment is superimposed on the speech data 83 in a quiet environment with the predicted SN ratio. Information 2105 is created. The label information storage unit 2106 stores two or more types of label information 2105 created by the label information creation unit 2104. The label information selection model creation unit 2101 creates a label information selection model 2103 using predicted noise data 1503 predicted to be noise in the usage environment. The label information selection model storage unit 2102 stores the label information selection model 2013 created by the label information selection model creation unit 2101. The sufficient statistic generation unit 2107 uses the selection model 1510 accumulated by the selection model accumulation unit 1508 and the user voice data 1513 in the quiet environment stored in the memory 1512 to use the user voice data from the voice data 83. Select audio data close to. Further, the sufficient statistic generation unit 2107 uses the label information selection model 2103 accumulated by the label information selection model accumulation unit 2102 and the noise data 85 of the usage environment to label information 2105 accumulated in the label information accumulation unit 2106. Select label information suitable for the usage environment. Then, the sufficient statistic creating unit 2107 creates the sufficient statistic 2108 using the voice data in which the noise data 85 is superimposed on the selected voice data and the label information 2105 suitable for the selected usage environment. The adaptive model creation unit 51 creates an adaptive model 2109 using the sufficient statistics 2108 created by the sufficient statistics creation unit 2107.
[0256]
<Operation of adaptive model creation device for speech recognition>
Next, the operation of the adaptive model creation device configured as described above will be described.
[0257]
[Create selection model 1510]
First, a method for creating the selection model 1510 will be described. Here, a case will be described in which the selection model 1510 is created offline before the user requests acquisition of the adaptive model.
[0258]
Record voice data 83 of multiple speakers in a quiet environment. Here, audio data of about 300 people are recorded.
[0259]
As shown in FIG. 35, the selection model creation unit 1507 creates a selection model 1510 by using the voice data 83 and a Gaussian Mixture Model of 1 state 64 mixing without distinguishing phonemes for each speaker.
[0260]
The selection model storage unit 1508 stores the selection model 1510 created by the selection model creation unit 1507.
[0261]
[Create Label Information 2105]
A method for creating the label information 2105 will be described. Here, a case will be described in which the label information 2105 is created offline before the user requests acquisition of the adaptive model. As an example, the case where voice recognition is used in an exhibition hall will be described with reference to FIGS. 41 and 42.
[0262]
It is known from the user's behavior history that voice recognition is often used in cars, exhibition halls, and homes. Therefore, general noises in the car, exhibition hall, and home are recorded. As shown in FIG. 41, three types of noise data (car interior noise data 1503A, exhibition hall noise data 1503B, home noise data 1503C) predicted to be the use environment are superimposed on the voice data 83 in a quiet environment. Noise superimposing voice data 1505A with in-car noise 10 dB, noise superimposing voice data 1505B with exhibition hall noise 20 dB, and noise superimposing voice data 1505C with home noise 20 dB are created. Next,

sufficient statistics

1603A, 1603B, and 1603C are calculated by the EM algorithm for each type of noise using the generated noise superimposed voice data. Here, sufficient statistics of unspecified speakers are created for each phoneme using the HMM. Next, as shown in FIG. 42, three types of noise-superimposed

speech data

1505A, 1505b, and 1505C are classified into

sufficient statistics

1603A, 1603B, and 1603C for each speech data (utterance data with a speaker having a certain type of noise data). And

label information

2105A, 2105B, 2105C is created for each voice data (speaking data of a certain speaker) using the Viterbi algorithm.
[0263]
[Create Label Information Selection Model 2103]
Next, a method for creating the label information selection model 2103 will be described with reference to FIG. Here, as an example, a GMM corresponding to the type of noise is created. Label information selection models 2103A, 2103B, and 2103C are created using the predicted

noise data

1503A, 1503B, and 1503C used in creating the label information 2105.
[0264]
[Create sufficient statistics 2108]
Next, a method for creating sufficient statistics 2108 will be described.
[0265]
The user stores the voice data 1513 of the user in a quiet environment in the memory 1512 in advance.
[0266]
The user requests creation of the adaptive model 2109.
[0267]
The sufficient statistic generation unit 2107 receives the voice data 1513 of the user in the quiet environment stored in the memory 1512. Further, the sufficient statistic creating unit 2107 receives the noise data 85 in an environment using voice recognition.
[0268]
The sufficient statistic generation unit 2107 inputs the user's voice data 1513 in a quiet environment to the selection model 1510 stored in the selection model storage unit 1508 and calculates the likelihood. Then, the speakers of the top L people (for example, the top 40 people) having the highest likelihood are selected to be speakers close to the user's voice data.
[0269]
Sufficient statistic creation unit 2107 creates noise superimposed voice data 86 by superimposing noise data 85 on voice data of a speaker close to the user from voice data 83 in a quiet environment. An example of a method for creating the noise superimposed audio data 86 is shown in FIG.
[0270]
The sufficient statistic generation unit 2107 inputs the noise data 85 to the label information selection model 2103 stored in the storage unit 2102 and stores the label information 2105 corresponding to the label information selection model 2103 having the greatest likelihood. The unit 2106 is taken out. Here, since the use environment is the exhibition hall, the label information 2105B of the exhibition hall noise 20 dB is extracted.
[0271]
The sufficient statistic creation unit 2107 creates the sufficient statistic 2108 using the noise superimposed audio data 86 and the label information 2105B of the exhibition hall noise 20 dB extracted from the label information storage unit 2106.
[0272]
[Create Adaptive Model 2109]
Next, a method for creating the adaptive model 2109 in the adaptive model creation unit 51 will be described.
[0273]
The adaptive model creation unit 51 creates an adaptive model 2109 using the sufficient statistics 2108 created by the sufficient statistics creation unit 2107. Specifically, the adaptive model 2109 is created by the following statistical processing calculation (Equation 13 to Equation 15). The mean and variance of the normal distribution in each state of the HMM of the adaptive model 2109 are μ_i ^adp(I = 1, 2,..., N_mix), V_i ^adp(I = 1, 2,..., N_mix). N_mixIs the number of mixed distributions. The state transition probability is a^adp[I] [j] (i, j = 1, 2,..., N_state). N_stateIs the number of states, a^adp[I] [j] represents the transition probability from the state i to the state j.
[0274]
[Formula 13]

[0275]
[Expression 14]

[0276]
[Expression 15]

[0277]
Where N_selIs the number of selected acoustic models and μ_i ^j(I = 1, 2,..., N_mix, J = 1, 2,..., N_sel), V_i ^j(I = 1, 2,..., N_mix, J = 1, 2,..., N_sel) Is the mean and variance of each HMM. C_mix ^j(J = 1, 2,..., N_sel), C_state ^k[I] [j] (k = 1, 2,..., N_sel, I, j = 1, 2,..., N_state) Are the EM count (frequency) in the normal distribution and the EM count related to the state transition, respectively.
[0278]
The adaptation model creation unit 51 prepares for a user's request for creation of the next adaptation model.
[0279]
<Effect>
As described above, in the fourth embodiment, the sufficient statistics 2108 are calculated using the label information 2105 that is selected based on the label information selection model 2103 and that is suitable for the usage environment. Can be created. Therefore, an adaptive model with higher accuracy can be used immediately when the usage environment changes in various ways.
[0280]
The noise data 85 may be input to the sufficient statistic generation unit 2107 offline before the user requests acquisition of the adaptive model, and the sufficient statistic 2108 may be generated offline.
[0281]
The sufficient statistic generation unit 2107 may automatically determine the timing at which the noise data 85 is input to the sufficient statistic generation unit 2107.
[0282]
The adaptive model creation unit 51 may automatically determine the timing for creating the adaptive model 2109.
[0283]
The selection model 1510 is not limited to the Gaussian Mixture Model.
[0284]
The sound data 1513 stored in the memory 1512 may be superimposed with noise in the usage environment or the environment predicted as the usage environment.
[0285]
The number of types of label information 2105 and the number of label information selection models 2103 are not necessarily the same.
[0286]
The noise data 85 may be used as the predicted noise data 1503.
[0287]
The adaptive model creation apparatus according to the second embodiment can be realized by hardware or software (computer program).
[Brief description of the drawings]
FIG. 1 is a diagram showing various speaker adaptation techniques.
FIG. 2 is a flowchart showing a procedure for creating an adaptive model by a “method using sufficient statistics”;
FIG. 3 is a block diagram for explaining a procedure for creating an adaptive model by “a method using sufficient statistics”;
FIG. 4 is a diagram for explaining sufficient statistic creation processing;
FIG. 5 is a diagram for explaining an adaptive model creation process;
FIG. 6 is a diagram for explaining a problem in the “method using sufficient statistics” in the prior art.
FIG. 7 is a block diagram showing a configuration of an adaptive model creation device according to the first embodiment.
8 is a diagram showing a flow of group creation processing in the group creation unit shown in FIG.
FIG. 9 is a diagram showing a flow of processing for creating sufficient statistics accumulated in the sufficient statistics accumulation unit shown in FIG. 7;
10 is a diagram showing a flow of processing for creating a selection model stored in a selection model storage unit shown in FIG. 7. FIG.
FIG. 11 is a diagram illustrating an example of sufficient statistics accumulated in the sufficient statistics accumulation unit illustrated in FIG. 7;
12 is a diagram illustrating an example of a selection model stored in a selection model storage unit illustrated in FIG.
13 is a diagram showing a flow of processing for determining a group acoustically close to a user's voice in the adaptive model creating unit shown in FIG. 7;
14 is a diagram showing a flow of processing for determining a sufficient statistic close to the user's voice data in the adaptive model creation unit shown in FIG. 7; FIG.
FIG. 15 is a diagram showing a result of a recognition experiment.
16 is a diagram illustrating an example of sufficient statistics accumulated in the sufficient statistics accumulation unit illustrated in FIG. 7;
FIG. 17 is a diagram illustrating an example of a group created by a group creation unit.
FIG. 18A is a diagram showing a specific product image and an example of grouping.
FIG. 18B is a diagram showing a specific product image and an example of grouping.
FIG. 19A is a diagram showing a specific product image and an example of grouping.
FIG. 19B is a diagram showing a specific product image and an example of grouping.
FIG. 20 is a diagram illustrating a specific product image and a grouping example.
FIG. 21A is a diagram showing a specific product image and an example of grouping.
FIG. 21B is a diagram showing a specific product image and an example of grouping.
FIG. 22 is a diagram illustrating a specific product image and a grouping example.
FIG. 23A is a diagram showing a specific product image and an example of grouping.
FIG. 23B is a diagram showing a specific product image and an example of grouping.
FIG. 24 is a diagram illustrating a specific product image and a grouping example.
FIG. 25A is a diagram showing a specific product image and an example of grouping.
FIG. 25B is a diagram showing a specific product image and an example of grouping.
FIG. 26 is a diagram illustrating a specific product image and a grouping example.
FIG. 27 is a diagram illustrating a specific product image and a grouping example.
FIG. 28 is a diagram illustrating a specific product image and a grouping example.
FIG. 29 is a block diagram showing a configuration of an adaptive model creation device according to a second embodiment.
30 is a diagram showing a flow of processing for creating a selection model stored in a selection model storage unit shown in FIG. 29. FIG.
FIG. 31 is a diagram showing a flow of processing for creating noise-superimposed voice data.
32 is a diagram illustrating an example of sufficient statistics created by the sufficient statistics creating unit illustrated in FIG. 9;
FIG. 33 is a diagram showing an image in which the adaptive model creation device according to the second embodiment is applied to an actual product.
FIG. 34 is a block diagram showing a configuration of an adaptive model creation device according to a third embodiment.
FIG. 35 is a diagram showing a flow of processing for creating a selection model stored in a selection model storage unit;
FIG. 36 is a diagram showing a flow of processing for creating label information.
FIG. 37 is a diagram showing a flow of processing for creating label information.
FIG. 38 is a diagram illustrating an example of label information stored in a label information storage unit.
FIG. 39 is a diagram showing a flow of processing for creating sufficient statistics.
FIG. 40 is a block diagram showing a configuration of an adaptive model creation device according to a fourth embodiment.
FIG. 41 is a diagram showing a flow of processing for creating label information.
FIG. 42 is a diagram illustrating a flow of processing for creating label information.
FIG. 43 is a diagram showing a flow of processing for creating a label information selection model.

Claims

A method of creating an acoustic model used for speech recognition using sufficient statistics ,
The sufficient statistics are represented by a hidden Markov model, and each state of the hidden Markov model is represented by a mixed Gaussian distribution,
The sufficient statistic holds, as information, an average value, a variance value, an EM count value, and an identification number assigned to the Gaussian distribution in the mixed Gaussian distribution,
The creation method is as follows:
Using two or more sufficient statistics, a new Gaussian is integrated by statistical calculation using a mean value, a variance value, and an EM count value of a Gaussian distribution holding a common identification number in different sufficient statistics. It is a method of creating an acoustic model used for speech recognition expressed by a hidden Markov model by calculating the distribution,
Grouping audio data with superimposed noise based on acoustic proximity;
For each group obtained in the step (a), from the initial value of a predetermined sufficient statistic, a step of creating a sufficient statistic that is a material for creating an acoustic model using speech data included in the group (B) and
(C) selecting a group acoustically close to the voice data of a person (user) using voice recognition from the group obtained by the step (a);
Selecting a plurality of sufficient statistics close to the user's voice data from among the sufficient statistics for the group selected in step (c);
By using the sufficient statistics selected by the step (d), by means of statistical calculation using the mean value, the variance value, the EM count value of the Gaussian distribution holding the common identification number in the different sufficient statistics And (e) creating an acoustic model by integrating and calculating a new Gaussian distribution ,
In step (a),
When a sufficient statistic of the voice data on which the noise is superimposed is created using the voice data on which the noise is superimposed from the initial value of the predetermined sufficient statistic, a Gaussian distribution of the initial value of the sufficient statistic is obtained. The noise is set such that the identification number and the identification number of the Gaussian distribution of sufficient statistics of speech data in which the noise between distributions is closest to the Gaussian distribution of initial values of the sufficient statistics are the same. Group audio data superimposed with
A method characterized by that.

In claim 1,
Steps (a) and (b) are performed off-line prior to the time when the user uses voice recognition.

In claim 1,
In the step (a), grouping is performed based on the type of the noise.

In claim 1,
In the step (a), grouping is performed based on an S / N ratio of audio data on which the noise is superimposed.

In claim 1,
In the step (a), grouping is performed for each acoustically close speaker.

In claim 1,
In the step (b), a sufficient statistic is created for each speaker.

In claim 6,
In the step (b), a sufficient statistic is created for each tone of the speaker's voice.

In claim 1,
In the step (b), a sufficient statistic is created for each type of noise.

In claim 1,
In the step (b), a sufficient statistic is created for each S / N ratio of audio data included in each group.

A method of creating an acoustic model used for speech recognition using sufficient statistics ,
The sufficient statistics are represented by a hidden Markov model, and each state of the hidden Markov model is represented by a mixed Gaussian distribution,
The sufficient statistic holds, as information, an average value, a variance value, an EM count value, and an identification number assigned to the Gaussian distribution in the mixed Gaussian distribution,
The creation method is as follows:
Using two or more sufficient statistics, a new Gaussian is integrated by statistical calculation using a mean value, a variance value, and an EM count value of a Gaussian distribution holding a common identification number in different sufficient statistics. It is a method of creating an acoustic model used for speech recognition expressed by a hidden Markov model by calculating the distribution,
From a plurality of audio data of a plurality of speakers, acoustically close rather the audio data of the person (user) utilizing speech recognition, the speech using the voice data from an initial value of a predetermined sufficient statistics When a sufficient statistic of data is created, an identification number of the Gaussian distribution of the initial value of the sufficient statistic and a sufficient statistic of the voice data whose distribution distance is closest to the Gaussian distribution of the initial value of the sufficient statistic (A) selecting voice data such that the identification number of the Gaussian distribution of
Superimposing noise in an environment where speech recognition is used on the speech data selected in step (a);
When a sufficient statistic of the voice data on which the noise is superimposed is created from a predetermined initial value of the sufficient statistic using the voice data on which noise is superimposed in the step (b), the initial value of the sufficient statistic The identification number of the Gaussian distribution of the value is the same as the identification number of the Gaussian distribution of the sufficient statistics of the speech data in which the noise is closest to the Gaussian distribution of the initial value of the sufficient statistics. (C) creating sufficient statistics so that
Statistical calculation using a sufficient statistic created by the step (c) using a Gaussian distribution value, mean value, variance value, and EM count value holding a common identification number in the different sufficient statistics And (d) creating an acoustic model by calculating a new Gaussian distribution integrated with the method.

In claim 10,
Superimposing noise in an environment in which speech recognition is expected to be used on a plurality of speech data from the plurality of speakers (e);
And (f) creating label information about the audio data on which noise is superimposed by the step (e),
In step (c),
A sufficient statistic using the audio data on which noise is superimposed in step (b) and the label information on the audio data selected in step (a) among the label information created in step (f). A method characterized by creating.

In claim 11,
In step (f),
Creating information on the state transition of the acoustic model for the audio data on which noise is superimposed in the step (e),
In step (c),
A sufficient statistic is created by further using information relating to the state transition of the acoustic model for the audio data selected in step (a) among the information relating to the state transition of the acoustic model created in step (f). Feature method.

In claim 11,
In step (e),
Superimposing each of a plurality of types of noise on a plurality of voice data by the plurality of speakers,
In the step (f), label information is created for each of the plurality of types of noise,
In step (c),
Label information suitable for the environment in which speech recognition is used is selected from a plurality of label information about the voice data selected in the step (a), and sufficient statistics are created using the selected label information. A method characterized by that.

An apparatus for creating an acoustic model used for speech recognition using sufficient statistics ,
The sufficient statistics are represented by a hidden Markov model, and each state of the hidden Markov model is represented by a mixed Gaussian distribution,
The sufficient statistic holds, as information, an average value, a variance value, an EM count value, and an identification number assigned to the Gaussian distribution in the mixed Gaussian distribution,
The creation device includes:
Using two or more sufficient statistics, a new Gaussian is integrated by statistical calculation using a mean value, a variance value, and an EM count value of a Gaussian distribution holding a common identification number in different sufficient statistics. It is a device that creates an acoustic model used for speech recognition expressed by a hidden Markov model by calculating the distribution,
Created using voice data included in a group from the initial value of a predetermined sufficient statistic for each of a plurality of groups obtained by grouping voice data with superimposed noise based on acoustic proximity An accumulator for accumulating sufficient statistics,
A first selection unit that selects a group acoustically close to voice data of a person (user) who uses voice recognition from the plurality of groups;
A second selection unit that selects a plurality of sufficient statistics acoustically close to the user's voice data from among the sufficient statistics for the group selected by the first selection unit;
By using a sufficient statistic selected by the second selection unit, a statistical calculation using a mean value, a variance value, and an EM count value of a Gaussian distribution holding a common identification number in the different sufficient statistic With a model creation unit that creates an acoustic model by integrating and calculating a new Gaussian distribution ,
The plurality of groups are:
When a sufficient statistic of the speech data on which the noise is superimposed is created from the initial value of the predetermined sufficient statistic using the speech data on which the noise is superimposed, a Gaussian distribution of the initial value of the sufficient statistic is obtained. The noise is set so that the identification number and the identification number of the Gaussian distribution of sufficient statistics of speech data in which the noise between distributions is closest to the Gaussian distribution of initial values of the sufficient statistics are the same. Is obtained by grouping the audio data superimposed with
A device characterized by that.

In claim 14,
A group creation unit for grouping audio data on which noise is superimposed based on acoustic proximity;
A sufficient statistic creating unit that creates a sufficient statistic using the audio data included in the group for each group obtained by the group creating unit;
The storage unit
An apparatus for storing sufficient statistics created by the sufficient statistics creation unit.

An apparatus for creating an acoustic model used for speech recognition using sufficient statistics ,
The sufficient statistics are represented by a hidden Markov model, and each state of the hidden Markov model is represented by a mixed Gaussian distribution,
The sufficient statistic holds, as information, an average value, a variance value, an EM count value, and an identification number assigned to the Gaussian distribution in the mixed Gaussian distribution,
The creation device includes:
Using two or more sufficient statistics, a new Gaussian is integrated by statistical calculation using a mean value, a variance value, and an EM count value of a Gaussian distribution holding a common identification number in different sufficient statistics. It is a device that creates an acoustic model used for speech recognition expressed by a hidden Markov model by calculating the distribution,
A storage unit for storing a plurality of voice data by a plurality of speakers;
Acoustically close rather the audio data of the person (user) utilizing speech recognition, from an initial value of a predetermined sufficient statistics when creating sufficient statistics of the audio data using the audio data, the well The identification number of the Gaussian distribution of the initial value of the statistic is the same as the identification number of the Gaussian distribution of the sufficient statistical amount of the speech data whose distance between the distributions is closest to the Gaussian distribution of the initial value of the sufficient statistic. a selector for selecting from among the audio data to the audio data is the storage such,
A noise superimposing unit that superimposes noise in an environment where speech recognition is used on the audio data selected by the selecting unit;
Using the voice data on which noise is superimposed by the noise superimposing unit, when creating a sufficient statistic of the voice data on which the noise is superimposed from an initial value of a predetermined sufficient statistic, the initial value of the sufficient statistic The sufficient statistic is set so that the identification number of the Gaussian distribution of the voice data and the identification number of the Gaussian distribution of the sufficient statistics of the speech data closest to the Gaussian distribution of the initial value of the sufficient statistics are the same. A sufficient statistics creation section to create,
By using a sufficient statistic created by the sufficient statistic creating unit, using a mean value, a variance value, and an EM count value of a Gaussian distribution holding a common identification number in the different sufficient statistics, by statistical calculation An apparatus comprising: a model creation unit that creates an acoustic model by integrating and calculating a new Gaussian distribution .

A computer program for creating an acoustic model used for speech recognition using sufficient statistics ,
The sufficient statistics are represented by a hidden Markov model, and each state of the hidden Markov model is represented by a mixed Gaussian distribution,
The sufficient statistic holds, as information, an average value, a variance value, an EM count value, and an identification number assigned to the Gaussian distribution in the mixed Gaussian distribution,
The computer program is
Using two or more sufficient statistics, a new Gaussian is integrated by statistical calculation using a mean value, a variance value, and an EM count value of a Gaussian distribution holding a common identification number in different sufficient statistics. It is a program to create an acoustic model used for speech recognition expressed by a hidden Markov model by calculating the distribution,
Computer
Created using voice data included in a group from the initial value of a predetermined sufficient statistic for each of a plurality of groups obtained by grouping voice data with superimposed noise based on acoustic proximity Means for accumulating sufficient statistics (a),
Means (b) for selecting a group acoustically close to voice data of a person (user) using voice recognition from the plurality of groups;
Means (c) for selecting a plurality of sufficient statistics acoustically close to the user's voice data from among the sufficient statistics for the group selected by said means (b);
Using the sufficient statistics selected by the means (c), integration by statistical calculation using the mean value, variance value, and EM count value of the Gaussian distribution holding a common identification number in the different sufficient statistics Means (d) for creating an acoustic model by calculating a new Gaussian distribution ,
As a program to function as
The plurality of groups are:
When a sufficient statistic of the speech data on which the noise is superimposed is created from the initial value of the predetermined sufficient statistic using the speech data on which the noise is superimposed, a Gaussian distribution of the initial value of the sufficient statistic is obtained. The noise is set so that the identification number and the identification number of the Gaussian distribution of sufficient statistics of speech data in which the noise between distributions is closest to the Gaussian distribution of initial values of the sufficient statistics are the same. Is obtained by grouping the audio data superimposed with
A program characterized by that.

In claim 17,
The computer further
Means (e) for grouping audio data with superimposed noise based on acoustic proximity;
Means (f) for generating sufficient statistics for each group obtained by the means (e) using the audio data included in the group;
Function as
Said means (a) comprises:
A program for storing sufficient statistics created by said means (f).

A computer program for creating an acoustic model used for speech recognition using sufficient statistics ,
The sufficient statistics are represented by a hidden Markov model, and each state of the hidden Markov model is represented by a mixed Gaussian distribution,
The sufficient statistic holds, as information, an average value, a variance value, an EM count value, and an identification number assigned to the Gaussian distribution in the mixed Gaussian distribution,
The computer program is
Using two or more sufficient statistics, a new Gaussian is integrated by statistical calculation using a mean value, a variance value, and an EM count value of a Gaussian distribution holding a common identification number in different sufficient statistics. It is a program to create an acoustic model used for speech recognition expressed by a hidden Markov model by calculating the distribution,
Computer
Means (a) for storing a plurality of voice data by a plurality of speakers;
Acoustically close rather the audio data of the person (user) utilizing speech recognition, from an initial value of a predetermined sufficient statistics when creating sufficient statistics of the audio data using the audio data, the well The identification number of the Gaussian distribution of the initial value of the statistic is the same as the identification number of the Gaussian distribution of the sufficient statistical amount of the speech data whose distance between the distributions is closest to the Gaussian distribution of the initial value of the sufficient statistic. Means (b) for selecting correct voice data from the voice data stored in the means (a),
Means (c) for superimposing noise in an environment where voice recognition is used on the voice data selected by the means (b);
When a sufficient statistic of the voice data on which the noise is superimposed is created from a predetermined initial value of the sufficient statistic using the voice data on which the noise is superimposed by the means (c), the initial value of the sufficient statistic The identification number of the Gaussian distribution of the value is the same as the identification number of the Gaussian distribution of the sufficient statistics of the speech data in which the noise is closest to the Gaussian distribution of the initial value of the sufficient statistics. Means (d) to create sufficient statistics so that
By using the sufficient statistics created by the means (d), by means of statistical calculation using the mean value, variance value, EM count value of the Gaussian distribution holding a common identification number in the different sufficient statistics Means (e) for creating an acoustic model by integrating and calculating a new Gaussian distribution ;
Program to function as.