JP6095588B2

JP6095588B2 - Speech recognition WFST creation device, speech recognition device, speech recognition WFST creation method, speech recognition method, and program

Info

Publication number: JP6095588B2
Application number: JP2014015124A
Authority: JP
Inventors: 祥子山畠; 隆伸大庭
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2013-06-03
Filing date: 2014-01-30
Publication date: 2017-03-15
Anticipated expiration: 2034-01-30
Also published as: JP2015014774A

Description

この発明は、音声認識に用いる重み付き有限状態トランスデューサを作成し、その重み付き有限状態トランスデューサを用いて音声認識する技術に関する。 The present invention relates to a technique for creating a weighted finite state transducer for use in speech recognition and performing speech recognition using the weighted finite state transducer.

重み付き有限状態トランスデューサ（以下、ＷＦＳＴ（Weighted Finite-State Transducers）という）を用いた音声認識では、まず、音響モデルや辞書、言語モデルなど音声認識に必要な情報を個別にＷＦＳＴに変換し、これらのＷＦＳＴを合成及び最適化することで音声認識用のＷＦＳＴ（以下、音声認識用ＷＦＳＴという）を作成する。そして、作成した音声認識用ＷＦＳＴを探索空間と見立ててデコードし、入力された音声信号を音声認識結果の文字列に変換する。ここで、最適化とは、決定化や最小化などのＷＦＳＴの最適化演算の総称である。また、入力音声と音響モデルとの照合スコアである音響スコアや言語モデルによる言語スコアは重みとして累積され、最終的に最も重みの高い出力記号列が音声認識結果となる。 In speech recognition using weighted finite state transducers (hereinafter referred to as WFST), information necessary for speech recognition such as acoustic models, dictionaries, and language models is first converted into WFST. WFST for speech recognition (hereinafter referred to as WFST for speech recognition) is created by synthesizing and optimizing the WFST. The created speech recognition WFST is decoded as a search space, and the input speech signal is converted into a character string as a speech recognition result. Here, the optimization is a general term for WFST optimization operations such as determinization and minimization. In addition, the acoustic score, which is a matching score between the input speech and the acoustic model, and the language score based on the language model are accumulated as weights, and finally the output symbol string having the highest weight is the speech recognition result.

近年、音声認識サービスの多くはサーバ／クライアント型の音声認識システムにより構築されている。サーバ／クライアント型での音声認識システムでは、クライアント端末からサーバ装置へユーザが発話した音声信号を送信すると、サーバ装置が受信した音声信号をテキストの音声認識結果に変換し、サーバ装置からクライアント端末へ音声認識結果が返信される。このとき素早いレスポンスを実現するために、サーバ装置はメモリ上に巨大な音声認識用ＷＦＳＴを常時展開しておく。この音声認識用ＷＦＳＴには、音響モデル、言語モデル及び単語辞書の情報が含まれており、巨大なネットワークが構成されている。 In recent years, many voice recognition services have been constructed by server / client type voice recognition systems. In a server / client type speech recognition system, when a speech signal uttered by a user is transmitted from a client terminal to a server device, the speech signal received by the server device is converted into a speech recognition result of text, and the server device to the client terminal. Voice recognition result is returned. At this time, in order to realize a quick response, the server device always develops a huge speech recognition WFST on the memory. The speech recognition WFST includes information on an acoustic model, a language model, and a word dictionary, and forms a huge network.

このような音声認識サービスにおいて、ユーザごとに蓄積された音声やテキストを用いてモデルや辞書の情報を更新するなどの適応を行うことで、ユーザごとに最適な音声認識用ＷＦＳＴを構築することができ、これにより音声認識精度を向上することができる。特に単語辞書の更新は、ユーザ特有の新規単語（例えば、家族や友人の名前など）を追加することで、その単語が新たに認識できるようになるため、音声認識精度を向上する効果が高い。 In such a speech recognition service, it is possible to construct an optimum speech recognition WFST for each user by performing an adaptation such as updating information of a model or dictionary using speech or text accumulated for each user. Thus, the voice recognition accuracy can be improved. In particular, the word dictionary is updated by adding a new word specific to the user (for example, the name of a family member or friend) so that the word can be newly recognized, which is highly effective in improving voice recognition accuracy.

しかし、ユーザごとに最適な音声認識用ＷＦＳＴを構築するためには、巨大な汎用ＷＦＳＴをユーザ数分用意することが必要になる。しかも、その汎用ＷＦＳＴは素早いレスポンスを確保するために常にサーバ装置のメモリ上に展開されていなければならない。一方、サーバ装置の備える計算リソースには限りがある。したがって、この方法は現実的には非常に困難である。 However, in order to construct an optimum speech recognition WFST for each user, it is necessary to prepare huge general-purpose WFSTs for the number of users. Moreover, the general-purpose WFST must always be developed on the memory of the server device in order to ensure a quick response. On the other hand, the computing resources provided in the server device are limited. Therefore, this method is very difficult in practice.

音響モデル及び言語モデルに対しては、汎用ＷＦＳＴを作り替えることなくモデル適応を行う枠組みが提案されている。例えば、特許文献１には、音響モデルの各音素ＨＭＭ（Hidden Markov Model、隠れマルコフモデル）を汎用ＷＦＳＴとは別に学習しておき、事前に合成された汎用ＷＦＳＴのネットワーク中で、該当するＨＭＭ状態に対してデコーディング時に動的に当てはめて音声認識処理を行う技術が記載されている。特許文献１の技術を用いれば、音響モデルの適応はＨＭＭのパラメータを更新する処理のみとなる。すなわち、ユーザごとに最適化された音響モデルと汎用ＷＦＳＴを別々に保持しておき、常にメモリ上に展開されている汎用ＷＦＳＴに対してユーザごとの音響モデルを動的に当てはめることで適応を実現できるようになる。このように、特許文献１の技術を用いれば、ユーザごとに巨大な汎用ＷＦＳＴを保持する必要がなくなる。 For an acoustic model and a language model, a framework has been proposed in which model adaptation is performed without rewriting the general-purpose WFST. For example, Patent Document 1 discloses that each phoneme HMM (Hidden Markov Model) of an acoustic model is learned separately from the general-purpose WFST, and the corresponding HMM state in the network of the general-purpose WFST synthesized beforehand. Describes a technique for performing speech recognition processing by dynamically applying to the above. If the technique of Patent Document 1 is used, the adaptation of the acoustic model is only the process of updating the parameters of the HMM. In other words, the acoustic model optimized for each user and the general-purpose WFST are held separately, and adaptation is realized by dynamically applying the acoustic model for each user to the general-purpose WFST that is always developed in the memory. become able to. Thus, if the technique of patent document 1 is used, it will become unnecessary to hold | maintain huge general purpose WFST for every user.

また、非特許文献１には、言語モデルについてあらかじめ大規模なコーパスから特定の話題に特化した小規模なモデルを複数用意しておき、入力された音声のアプリケーションに合わせて言語モデルの混合重みを動的に変更できるようなＷＦＳＴを構成する手法が記載されている。これにより、特定のユーザに対する混合重みセットのみを持っていれば、そのユーザに特化したモデルを汎用ＷＦＳＴの構造を変更することなく動的に生成することが可能である。 Non-Patent Document 1 prepares a plurality of small models specialized for a specific topic from a large corpus in advance as language models, and mixes the weights of the language models according to the input speech application. Describes a method for constructing a WFST that can dynamically change. As a result, if there is only a mixture weight set for a specific user, a model specialized for that user can be generated dynamically without changing the structure of the general-purpose WFST.

特開２０１２−１１３０８７号公報JP 2012-113087 A

Brandon Ballinger, “On-Demand Language Model Interpolation for Mobile Speech Input”, INTERSPEECH 2010, pp.1812-1815, 2010.Brandon Ballinger, “On-Demand Language Model Interpolation for Mobile Speech Input”, INTERSPEECH 2010, pp.1812-1815, 2010.

従来技術では、新規単語を認識可能にするためには、ベースとなる巨大な汎用ＷＦＳＴの再合成を避けることができない。そのため、ユーザごとに適切な語彙を個別にカバーした音声認識用ＷＦＳＴを作成しなければならない。 In the prior art, recombination of a huge general-purpose WFST as a base cannot be avoided in order to be able to recognize a new word. Therefore, it is necessary to create a speech recognition WFST that individually covers an appropriate vocabulary for each user.

そこで、巨大な汎用ＷＦＳＴに対するユーザごとの音響モデル、言語モデル及び単語辞書の差分情報のみをコンパクトに持ち、音声認識実行時に瞬時に差分情報をロードして汎用ＷＦＳＴに適応することができれば、巨大な汎用ＷＦＳＴを大量にメモリ上に展開しなくともユーザごとのモデル適応を実現できる。 Therefore, if only the difference information of the acoustic model, the language model, and the word dictionary for each user with respect to a huge general-purpose WFST is compactly stored, and the difference information can be instantly loaded when voice recognition is executed, it can be applied to the general-purpose WFST. Model adaptation for each user can be realized without deploying a large amount of general-purpose WFST on a memory.

この発明の目的は、巨大な汎用ＷＦＳＴの再合成を行わずに新規単語を追加することができる音声認識用ＷＦＳＴを作成し、その音声認識用ＷＦＳＴを用いて音声認識を行うことである。 An object of the present invention is to create a speech recognition WFST that can add a new word without recombining a large general-purpose WFST, and perform speech recognition using the speech recognition WFST.

上記の課題を解決するために、この発明の一態様による音声認識用ＷＦＳＴ作成装置は、音素列を入力とし、入力された音素列に基づくサブワード及び予め定めた汎用単語のうち読みを表す音素列が入力された音素列と合致する汎用単語を出力する辞書ＷＦＳＴを作成する辞書ＷＦＳＴ作成部と、辞書ＷＦＳＴの出力する汎用単語にはその汎用単語の重みを付与し、辞書ＷＦＳＴの出力するサブワードには汎用単語の重みのいずれよりも小さい重みを付与する言語モデルＷＦＳＴを作成する言語モデルＷＦＳＴ作成部と、入力された音声信号を音響環境へ変換する音響モデルＷＦＳＴと音響環境を音素に変換する音素ＷＦＳＴと辞書ＷＦＳＴと言語モデルＷＦＳＴとを用いて汎用ＷＦＳＴを作成する汎用ＷＦＳＴ作成部と、汎用ＷＦＳＴの出力が新規単語の読みを表す音素列のいずれかに合致する場合には合致した新規単語の重みを付与してその新規単語を出力し、汎用ＷＦＳＴの出力が汎用単語のいずれかに合致する場合には合致した新規単語の重みを維持してその汎用単語を出力する単語追加用ＷＦＳＴを作成する単語追加用ＷＦＳＴ作成部と、を含む。 In order to solve the above-described problem, a speech recognition WFST creation apparatus according to an aspect of the present invention receives a phoneme string and inputs a phoneme string representing a subword based on the input phoneme string and a predetermined general-purpose word. A dictionary WFST creation unit that creates a dictionary WFST that outputs a generic word that matches a phoneme string that is input, and a generic word that is output from the dictionary WFST is given a weight of the generic word, and is assigned to a subword that is output from the dictionary WFST. Is a language model WFST creation unit that creates a language model WFST that gives a weight smaller than any of the weights of general-purpose words, an acoustic model WFST that converts an input speech signal into an acoustic environment, and a phoneme that converts the acoustic environment into a phoneme A general-purpose WFST creation unit that creates a general-purpose WFST using the WFST, the dictionary WFST, and the language model WFST; Is matched with one of the phoneme strings representing the reading of the new word, the weight of the matched new word is assigned and the new word is output, and the output of the general-purpose WFST matches with any of the general-purpose words Includes a word addition WFST creation unit that creates a word addition WFST that maintains the weight of the matched new word and outputs the general-purpose word.

この発明の一態様による音声認識装置は、音声認識用ＷＦＳＴ作成装置により作成された汎用ＷＦＳＴを記憶する汎用ＷＦＳＴ記憶部と、音声認識用ＷＦＳＴ作成装置により作成された単語追加用ＷＦＳＴを記憶する単語追加用ＷＦＳＴ記憶部と、汎用ＷＦＳＴに対して単語追加用ＷＦＳＴを動的に合成した音声認識用ＷＦＳＴを用いて、入力された音声信号を音声認識結果に変換する音声認識部と、を含む。 A speech recognition apparatus according to an aspect of the present invention includes a general-purpose WFST storage unit that stores a general-purpose WFST created by a speech recognition WFST creation apparatus, and a word that stores a word addition WFST created by a speech recognition WFST creation apparatus An additional WFST storage unit, and a speech recognition unit that converts an input speech signal into a speech recognition result using a speech recognition WFST in which a word addition WFST is dynamically synthesized with respect to a general-purpose WFST.

この発明によれば、巨大な汎用ＷＦＳＴの再合成を行わずに新規単語を追加することができる音声認識用ＷＦＳＴを得ることができる。音声認識の実行時に瞬時に差分情報を適応することができるため、大量の巨大なＷＦＳＴをメモリ上に展開し直すことなくユーザごとのモデル適応を実現することができる。 According to the present invention, it is possible to obtain a speech recognition WFST that can add a new word without recombining a large general-purpose WFST. Since the difference information can be instantly adapted at the time of executing speech recognition, model adaptation for each user can be realized without re-deploying a large amount of huge WFST on the memory.

図１は、第一実施形態の音声認識用ＷＦＳＴ作成装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of the speech recognition WFST creation apparatus according to the first embodiment. 図２は、第一実施形態の音声認識用ＷＦＳＴ作成方法の処理フローを例示する図である。FIG. 2 is a diagram illustrating a processing flow of the speech recognition WFST creation method according to the first embodiment. 図３は、辞書ＷＦＳＴの構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of the dictionary WFST. 図４は、言語モデルＷＦＳＴの構成例を示す図である。FIG. 4 is a diagram illustrating a configuration example of the language model WFST. 図５は、第一実施形態の単語追加用ＷＦＳＴの構成例を示す図である。FIG. 5 is a diagram illustrating a configuration example of the word addition WFST according to the first embodiment. 図６は、第一実施形態の音声認識装置の機能構成を例示する図である。FIG. 6 is a diagram illustrating a functional configuration of the speech recognition apparatus according to the first embodiment. 図７は、第一実施形態の音声認識方法の処理フローを例示する図である。FIG. 7 is a diagram illustrating a processing flow of the speech recognition method according to the first embodiment. 図８は、第二実施形態の単語追加用ＷＦＳＴの構成例を示す図である。FIG. 8 is a diagram illustrating a configuration example of the word addition WFST according to the second embodiment. 図９は、第三実施形態の音声認識用ＷＦＳＴ作成装置の機能構成を例示する図である。FIG. 9 is a diagram illustrating a functional configuration of the speech recognition WFST creation apparatus according to the third embodiment. 図１０は、第三実施形態の音声認識用ＷＦＳＴ作成方法の処理フローを例示する図である。FIG. 10 is a diagram illustrating a processing flow of the speech recognition WFST creation method according to the third embodiment. 図１１は、第三実施形態の音声認識装置の機能構成を例示する図である。FIG. 11 is a diagram illustrating a functional configuration of the speech recognition apparatus according to the third embodiment. 図１２は、第四実施形態の単語追加用ＷＦＳＴの構成例を示す図である。FIG. 12 is a diagram illustrating a configuration example of the word addition WFST according to the fourth embodiment. 図１３は、第五実施形態の単語追加用ＷＦＳＴの構成例を示す図である。FIG. 13 is a diagram illustrating a configuration example of the word addition WFST according to the fifth embodiment. 図１４は、プッシング処理について説明するための図である。FIG. 14 is a diagram for explaining the pushing process. 図１５は、第六実施形態の単語追加用ＷＦＳＴの構成例を示す図である。FIG. 15 is a diagram illustrating a configuration example of the word addition WFST of the sixth embodiment.

一般的な音声認識用ＷＦＳＴによる音声認識処理では、入力された音声信号を音素環境に変換する音響モデルＷＦＳＴと、音素環境を音素に変換する音素ＷＦＳＴと、音素列を単語に変換する辞書ＷＦＳＴと、単語列に言語スコアを付与して単語列を出力する言語モデルＷＦＳＴとが合成された汎用ＷＦＳＴを事前に作成しておく。音声認識を行う際には、事前に作成した汎用ＷＦＳＴを読み込み、入力された音声信号の音声認識処理を行い、その音声信号を変換した音声認識結果の文字列を出力する。 In speech recognition processing by a general speech recognition WFST, an acoustic model WFST that converts an input speech signal into a phoneme environment, a phoneme WFST that converts a phoneme environment into a phoneme, and a dictionary WFST that converts a phoneme string into a word A general-purpose WFST in which a language model WFST for giving a language score to a word string and outputting the word string is synthesized in advance. When performing speech recognition, a general-purpose WFST created in advance is read, speech recognition processing of the input speech signal is performed, and a character string as a speech recognition result obtained by converting the speech signal is output.

詳細は後述するが、この発明の音声認識用ＷＦＳＴ作成装置及び音声認識装置は、一般的な音声認識用ＷＦＳＴ作成装置及び音声認識装置と比べて以下の点が異なる。
１．辞書ＷＦＳＴと言語モデルＷＦＳＴの構成が異なる。これに伴い、それぞれのＷＦＳＴを作成する処理が異なる。
２．汎用ＷＦＳＴに加えて単語追加用ＷＦＳＴを作成する。
３．汎用ＷＦＳＴに加えて単語追加用ＷＦＳＴを読み込み、音声認識処理を行う。 Although the details will be described later, the speech recognition WFST creation apparatus and speech recognition apparatus of the present invention differ from the general speech recognition WFST creation apparatus and speech recognition apparatus in the following points.
1. The configurations of the dictionary WFST and the language model WFST are different. Accordingly, the process for creating each WFST is different.
2. A word addition WFST is created in addition to the general-purpose WFST.
3. In addition to the general-purpose WFST, the word addition WFST is read to perform speech recognition processing.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

［第一実施形態］
＜音声認識用ＷＦＳＴ作成装置＞
図１を参照して、第一実施形態に係る音声認識用ＷＦＳＴ作成装置の機能構成の一例を説明する。第一実施形態に係る音声認識用ＷＦＳＴ作成装置１は、この発明の音声認識用ＷＦＳＴを作成する装置である。 [First embodiment]
<WFST creation device for voice recognition>
With reference to FIG. 1, an example of a functional configuration of the speech recognition WFST creation apparatus according to the first embodiment will be described. The speech recognition WFST creation device 1 according to the first embodiment is a device that creates the speech recognition WFST of the present invention.

音声認識用ＷＦＳＴ作成装置１は、汎用単語発音辞書記憶部１０、音素リスト記憶部１２、汎用言語モデル記憶部１４、辞書ＷＦＳＴ作成部１６、言語モデルＷＦＳＴ作成部１８、音素ＷＦＳＴ記憶部２０、音響モデルＷＦＳＴ記憶部２２、辞書ＷＦＳＴ記憶部２４、言語モデルＷＦＳＴ記憶部２６、汎用ＷＦＳＴ作成部２８、汎用ＷＦＳＴ記憶部３０、汎用単語リスト記憶部３２、新規単語読みリスト記憶部３４、単語追加用ＷＦＳＴ作成部３６及び単語追加用ＷＦＳＴ記憶部３８を含む。 The speech recognition WFST creation apparatus 1 includes a general-purpose word pronunciation dictionary storage unit 10, a phoneme list storage unit 12, a general-purpose language model storage unit 14, a dictionary WFST creation unit 16, a language model WFST creation unit 18, a phoneme WFST storage unit 20, an acoustic Model WFST storage unit 22, dictionary WFST storage unit 24, language model WFST storage unit 26, general-purpose WFST creation unit 28, general-purpose WFST storage unit 30, general-purpose word list storage unit 32, new word reading list storage unit 34, WFST for adding words A creation unit 36 and a word addition WFST storage unit 38 are included.

音声認識用ＷＦＳＴ作成装置１は、例えば、中央演算処理装置（Central Processing Unit、ＣＰＵ）、主記憶装置（Random Access Memory、ＲＡＭ）等を有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。音声認識用ＷＦＳＴ作成装置１は例えば、中央演算処理装置の制御のもとで各処理を実行する。音声認識用ＷＦＳＴ作成装置１に入力されたデータや各処理で得られたデータは例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。音声認識用ＷＦＳＴ作成装置１が備える各記憶部は、例えば、ＲＡＭ（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。音声認識用ＷＦＳＴ作成装置１が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 The speech recognition WFST creation apparatus 1 is configured, for example, by loading a special program into a known or dedicated computer having a central processing unit (CPU), a main memory (Random Access Memory, RAM), and the like. Special equipment. For example, the speech recognition WFST creation apparatus 1 executes each process under the control of the central processing unit. The data input to the speech recognition WFST creation apparatus 1 and the data obtained in each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out as necessary to perform other processing. Used for Each storage unit included in the speech recognition WFST creation apparatus 1 includes, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory. It can be configured by a device, or middleware such as a relational database or key-value store. Each storage unit included in the speech recognition WFST creation apparatus 1 may be logically divided, and may be stored in one physical storage device.

汎用単語発音辞書記憶部１０には、予め用意された汎用単語発音辞書が記憶されている。汎用単語発音辞書は、後述する汎用言語モデルに含まれる単語とその読み情報のリストである。例えば、「赤/aka」「秋/aki」のような、汎用単語とその読みを表す音素列とが１対１に対応づけられた情報を１エントリとする。 The general-purpose word pronunciation dictionary storage unit 10 stores a general-purpose word pronunciation dictionary prepared in advance. The general-purpose word pronunciation dictionary is a list of words included in a general-purpose language model described later and their reading information. For example, information such as “red / aka” and “autumn / aki” in which one-to-one correspondence between general-purpose words and phoneme strings representing their readings is defined as one entry.

音素リスト記憶部１２には、予め用意された音素リストが記憶されている。音素リストは、汎用単語発音辞書の読み情報に登場するすべての音素のリストである。 The phoneme list storage unit 12 stores a phoneme list prepared in advance. The phoneme list is a list of all phonemes that appear in the reading information of the general word pronunciation dictionary.

汎用言語モデル記憶部１４には、予め用意された汎用言語モデルが記憶されている。汎用言語モデルは、Ｎグラム（N-gram）単語列、そのＮグラム確率及びそのバックオフ係数のリストである。例えば、トライグラム（3-gram）の単語列「赤い花」とトライグラム確率値「0.01」とバックオフ係数「0.1」のような、Ｎ個の単語列とその確率値とそのバックオフ係数の組み合わせを１エントリとする。ここで、Ｎは整数値であり、例えば３、４といった値を用いる。なお、汎用言語モデルには、バックオフ係数の存在しないエントリが含まれてもよい。 The general-purpose language model storage unit 14 stores a general-purpose language model prepared in advance. The general language model is a list of N-gram word strings, their N-gram probabilities, and their backoff coefficients. For example, N word strings, their probability values and their backoff coefficients, such as the trigram (3-gram) word string “red flowers”, trigram probability value “0.01” and backoff coefficient “0.1” Is one entry. Here, N is an integer value, and a value such as 3, 4 is used. The general language model may include an entry having no backoff coefficient.

汎用単語リスト記憶部３２には、予め用意された汎用単語リストが記憶されている。汎用単語リストは、汎用言語モデルに含まれるすべての単語のリストである。 The general-purpose word list storage unit 32 stores a general-purpose word list prepared in advance. The general word list is a list of all words included in the general language model.

新規単語読みリスト記憶部３４には、予め用意された新規単語読みリストが記憶されている。新規単語読みリストは、汎用単語とは異なる新規単語とその読み情報のリストである。例えば、「阿川/akou」「明夫/akio」のような、新規単語とその読みを表す音素列とが１対１に対応づけられた情報を１エントリとする。ただし、「阿川」「明夫」は汎用単語モデルには含まれていないものとする。 The new word reading list storage unit 34 stores a new word reading list prepared in advance. The new word reading list is a list of new words different from general-purpose words and their reading information. For example, information such as “Agawa / akou” and “Akio / akio” in which a new word and a phoneme string representing the reading are associated one-to-one is defined as one entry. However, “Agawa” and “Akio” are not included in the general-purpose word model.

図２を参照して、音声認識用ＷＦＳＴ作成装置１が実行する音声認識用ＷＦＳＴ作成方法の処理フローの一例を、実際に行われる手続きの順に従って説明する。 With reference to FIG. 2, an example of the processing flow of the speech recognition WFST creation method executed by the speech recognition WFST creation device 1 will be described in the order of procedures actually performed.

ステップＳ１６において、辞書ＷＦＳＴ作成部１６は、汎用単語発音辞書記憶部１０に記憶されている汎用単語発音辞書及び音素リスト記憶部１２に記憶されている音素リストを読み込み、音素列を入力として、入力された音素列に基づくサブワード及び汎用単語発音辞書に含まれる汎用単語のうち読みを表す音素列が入力された音素列と合致する汎用単語を出力する辞書ＷＦＳＴを作成する。サブワードとは音素、音韻、音節などの、単語よりも短い音声単位である。 In step S16, the dictionary WFST creation unit 16 reads the general word pronunciation dictionary stored in the general word pronunciation dictionary storage unit 10 and the phoneme list stored in the phoneme list storage unit 12, and inputs the phoneme string as an input. A dictionary WFST is generated that outputs a general word that matches the input phoneme string among the subwords based on the phoneme string and the general word included in the general word pronunciation dictionary. A subword is a speech unit shorter than a word, such as a phoneme, a phoneme, or a syllable.

図３を参照して、この発明の辞書ＷＦＳＴの構成の一例を説明する。アークに付随する記号はそのアークを通るときの入出力を表す。アークに付随する記号において、「：（コロン）」の前方の記号列が入力記号であり、後方の記号列が出力記号である。εは空記号列を表す。例えば、「a,k,a」という入力記号が図３に示すＷＦＳＴに入力された場合、「a:ε」のアーク、「k:ε」のアーク、「a:赤」のアークを順に通り、「赤」という単語が出力される。「a,k,i」という入力記号が入力された場合には、「a:ε」のアーク、「k:ε」のアーク、「i:秋」のアークを順に通り、「秋」という単語が出力される。 An example of the configuration of the dictionary WFST of the present invention will be described with reference to FIG. Symbols associated with arcs represent input and output when passing through the arc. Among the symbols associated with the arc, the symbol string in front of “: (colon)” is an input symbol, and the symbol string behind is an output symbol. ε represents an empty symbol string. For example, when the input symbol “a, k, a” is input to the WFST shown in FIG. 3, the “a: ε” arc, the “k: ε” arc, and the “a: red” arc are sequentially passed. , The word “red” is output. When the input symbol “a, k, i” is input, the word “autumn” passes through the arc “a: ε”, the arc “k: ε”, and the arc “i: autumn”. Is output.

従来の辞書ＷＦＳＴでは、「a,k,a」を入力として「赤」を出力するアーク連鎖や、「a,k,i」を入力として「秋」を出力するアーク連鎖のような、音素記号列を単語に変換するアークしか存在しない。この発明の辞書ＷＦＳＴではそれに加えて、「a:a」「k:k」「i:i」が付随する３本のアークのような、音素を入力とし音素を出力するアークが存在する。また、一番上のアーク連鎖に示すように、例えば、「k,i」のような音素列を入力とし「ki」のようなサブワード単位を出力するアーク連鎖としてもよい。 In the conventional dictionary WFST, phonetic symbols such as an arc chain that outputs “red” with “a, k, a” as input and an arc chain that outputs “autumn” with “a, k, i” as input. There are only arcs that convert columns to words. In the dictionary WFST of the present invention, there are arcs that input phonemes and output phonemes, such as three arcs accompanied by “a: a”, “k: k”, and “i: i”. Also, as shown in the top arc chain, for example, a phoneme string such as “k, i” may be input and an arc chain that outputs a subword unit such as “ki” may be used.

辞書ＷＦＳＴ作成部１６が辞書ＷＦＳＴを作成する手順をより詳細に説明する。まず、辞書ＷＦＳＴ作成部１６は、音素リスト記憶部１２に記憶されている音素リストを読み込み、それぞれの音素ごとにその音素を入力とし音素もしくはサブワードを出力するアークを作成する。次に、辞書ＷＦＳＴ作成部１６は、汎用単語発音辞書記憶部１０に記憶されている汎用単語発音辞書を読み込み、それぞれの単語ごとに音素列を入力とし単語を出力するアークを作成する。そして、辞書ＷＦＳＴ作成部１６は、作成された辞書ＷＦＳＴを、辞書ＷＦＳＴ記憶部２４に記憶する。 The procedure in which the dictionary WFST creation unit 16 creates the dictionary WFST will be described in more detail. First, the dictionary WFST creation unit 16 reads the phoneme list stored in the phoneme list storage unit 12, and creates an arc that inputs the phoneme for each phoneme and outputs a phoneme or subword. Next, the dictionary WFST creation unit 16 reads the general-purpose word pronunciation dictionary stored in the general-purpose word pronunciation dictionary storage unit 10, and creates an arc for inputting a phoneme string and outputting a word for each word. Then, the dictionary WFST creation unit 16 stores the created dictionary WFST in the dictionary WFST storage unit 24.

ステップＳ１８において、言語モデルＷＦＳＴ作成部１８は、汎用言語モデル記憶部１４に記憶されている汎用言語モデル及び音素リスト記憶部１２に記憶されている音素リストを読み込み、辞書ＷＦＳＴの出力する汎用単語には出力された汎用単語の重みを付与し、辞書ＷＦＳＴの出力する音素もしくはサブワードには汎用単語の重みのいずれよりも小さい重みを付与する言語モデルＷＦＳＴを作成する。 In step S18, the language model WFST creation unit 18 reads the general language model stored in the general language model storage unit 14 and the phoneme list stored in the phoneme list storage unit 12, and sets the general language word output from the dictionary WFST. Gives a weight of the output general word, and creates a language model WFST that gives a weight smaller than any of the weights of the general word to the phoneme or subword output from the dictionary WFST.

図４を参照して、この発明の言語モデルＷＦＳＴの構成の一例を説明する。アークに付随する記号は辞書ＷＦＳＴと同様に、「：（コロン）」の前方の記号列が入力記号であり、後方の記号列が出力記号である。「φ」は入力記号に対応するアークが存在しない場合の失敗遷移を表す。また、「／（スラッシュ）」で区切られた後方に示された記号列は、そのアークを通ったときの重みを示す。例えば、「p(w₃|w₁w₂)」は単語列w₁w₂w₃のトライグラム確率を表す。また、「α(w₁w₂)」はバイグラムw₁w₂に対するバックオフ係数を表す。 An example of the configuration of the language model WFST of the present invention will be described with reference to FIG. As for the symbol associated with the arc, the symbol string in front of “: (colon)” is an input symbol and the symbol string behind is an output symbol, as in the dictionary WFST. “Φ” represents a failure transition when there is no arc corresponding to the input symbol. Further, the symbol string shown at the rear and delimited by “/ (slash)” indicates the weight when passing through the arc. For example, “p (w ₃ | w ₁ w ₂ )” represents the trigram probability of the word string w ₁ w ₂ w ₃ . “Α (w ₁ w ₂ )” represents a back-off coefficient for the bigram w ₁ w ₂ .

従来の言語モデルＷＦＳＴには単語を入出力とするアークしか存在しない。この発明の言語モデルＷＦＳＴではそれに加えて、音素もしくはサブワード単位をユニグラムとして入出力するアークが存在する。例えば、図４に示す「a:a/δ」の付随するアークは、音素「a」をユニグラムとして入出力するアークである。ここで、重みにはごく小さい確率δを与える。これにより、音素「a」のユニグラムは、バックオフ係数α(w₁w₂)及びα(w₂)と確率δを掛け合わせた値を伴って出力される。 The conventional language model WFST has only arcs that input and output words. In addition to this, the language model WFST of the present invention has arcs that input and output phonemes or subword units as unigrams. For example, the arc accompanying “a: a / δ” shown in FIG. 4 is an arc that inputs and outputs the phoneme “a” as a unigram. Here, a very small probability δ is given to the weight. As a result, the unigram of the phoneme “a” is output with a value obtained by multiplying the probability δ by the backoff coefficients α (w ₁ w ₂ ) and α (w ₂ ).

確率δの値は、サブワード単位のユニグラムがビーム探索で枝刈りされるように、汎用言語モデルに含まれる確率値と比較して十分に小さな値に設定する。図４の例であれば、単語列w₁w₂w₃のトライグラム確率p(w₃|w₁w₂)の1/100以下とすることが望ましい。なお、実際には、これらの確率値は対数スケールに変換され、デコーディング時にはアークを通るごとにそのアークに付随する対数スケールの重みが加算されていくことになる。 The value of the probability δ is set to a sufficiently small value as compared with the probability value included in the general language model so that the subword unit unigram is pruned by the beam search. In the example of FIG. 4, it is desirable that the trigram probability p (w ₃ | w ₁ w ₂ ) of the word string w ₁ w ₂ w ₃ is 1/100 or less. Actually, these probability values are converted to a logarithmic scale, and the weight of the logarithmic scale associated with the arc is added every time the arc passes during decoding.

言語モデルＷＦＳＴ作成部１８が言語モデルＷＦＳＴを作成する手順をより詳細に説明する。まず、言語モデルＷＦＳＴ作成部１８は、汎用言語モデル記憶部１４に記憶されている汎用言語モデルを読み込む。続いて、言語モデルＷＦＳＴ作成部１８は、音素リスト記憶部１２に記憶されている音素リストを読み込み、それぞれの音素に対して確率δを付与し、汎用言語モデルにユニグラムのエントリとして追加する。このように音素のエントリを追加した汎用言語モデルに基づいて、上述した言語モデルＷＦＳＴの構成に適合するように言語モデルＷＦＳＴを構築する。そして、言語モデルＷＦＳＴ作成部１８は、作成された言語モデルＷＦＳＴを、言語モデルＷＦＳＴ記憶部２６に記憶する。 The procedure in which the language model WFST creation unit 18 creates the language model WFST will be described in more detail. First, the language model WFST creation unit 18 reads a general language model stored in the general language model storage unit 14. Subsequently, the language model WFST creation unit 18 reads the phoneme list stored in the phoneme list storage unit 12, assigns a probability δ to each phoneme, and adds it as a unigram entry to the general language model. Based on the general language model to which phoneme entries are added in this way, the language model WFST is constructed so as to conform to the configuration of the language model WFST described above. Then, the language model WFST creation unit 18 stores the created language model WFST in the language model WFST storage unit 26.

ステップＳ２８において、汎用ＷＦＳＴ作成部２８は、音素ＷＦＳＴ記憶部２０に記憶されている音素ＷＦＳＴと、音響モデルＷＦＳＴ記憶部２２に記憶されている音響モデルＷＦＳＴと、辞書ＷＦＳＴ記憶部２４に記憶されている辞書ＷＦＳＴと、言語モデルＷＦＳＴ記憶部２６に記憶されている言語モデルＷＦＳＴとを読み込む。汎用ＷＦＳＴ作成部２８は、読み込んだ音素ＷＦＳＴ、音響モデルＷＦＳＴ、辞書ＷＦＳＴ及び言語モデルＷＦＳＴを合成及び最適化して汎用ＷＦＳＴを作成する。そして、汎用ＷＦＳＴ作成部２８は、作成された汎用ＷＦＳＴを、汎用ＷＦＳＴ記憶部３０に記憶する。 In step S28, the general-purpose WFST creation unit 28 stores the phoneme WFST stored in the phoneme WFST storage unit 20, the acoustic model WFST stored in the acoustic model WFST storage unit 22, and the dictionary WFST storage unit 24. The dictionary WFST and the language model WFST stored in the language model WFST storage unit 26 are read. The general-purpose WFST creation unit 28 creates a general-purpose WFST by synthesizing and optimizing the read phoneme WFST, acoustic model WFST, dictionary WFST, and language model WFST. Then, the general purpose WFST creation unit 28 stores the created general purpose WFST in the general purpose WFST storage unit 30.

この発明の汎用ＷＦＳＴは上述のように構成された辞書ＷＦＳＴ及び言語モデルＷＦＳＴを用いて構築されるため、音素もしくはサブワード単位の出力を受理することが可能になる。また、汎用ＷＦＳＴ単独でデコーディングを行う場合には、音素もしくはサブワード単位のアークにはごく小さい確率δが付随しているため、ビーム探索で枝刈りされ、認識結果には出力されることがない。 Since the general-purpose WFST of the present invention is constructed using the dictionary WFST and the language model WFST configured as described above, it becomes possible to accept an output in units of phonemes or subwords. In addition, when decoding is performed using general-purpose WFST alone, arcs in units of phonemes or subwords are accompanied by a very small probability δ, so that they are pruned by beam search and are not output as recognition results. .

ステップＳ３６において、単語追加用ＷＦＳＴ作成部３６は、汎用単語リスト記憶部３２に記憶されている汎用単語リスト及び新規単語読みリスト記憶部３４に記憶されている新規単語読みリストを読み込み、汎用ＷＦＳＴの出力する音素列が新規単語の読みを表す音素列のいずれかに合致する場合には合致した新規単語の重みを付与してその新規単語を出力し、汎用ＷＦＳＴの出力が汎用単語のいずれかに合致する場合には汎用ＷＦＳＴの付与した重みを維持してその汎用単語を出力する単語追加用ＷＦＳＴを作成する。 In step S36, the word addition WFST creation unit 36 reads the general word list stored in the general word list storage unit 32 and the new word reading list stored in the new word reading list storage unit 34, and stores the general word WFST. When the phoneme string to be output matches any of the phoneme strings representing the reading of the new word, the weight of the matched new word is given and the new word is output, and the output of the general WFST is one of the general words If they match, a word addition WFST that outputs the general-purpose word while maintaining the weight assigned to the general-purpose WFST is created.

図５を参照して、この発明で導入する単語追加用ＷＦＳＴの構成の一例を説明する。単語追加用ＷＦＳＴは、汎用単語を入出力するアークと、音素もしくはサブワード単位の系列を入力し新規単語を出力するアークを含む。ここでは汎用単語を「赤」「秋」、新規単語を「阿川（akou）」、「明夫（akio）」としている。汎用単語を入出力するアークは、「赤:赤/1」「秋:秋/1」が付随する上から２本のアークである。ここで、汎用単語を入出力するアークでは、重みに変更を与えないために確率１（対数スケールでは０）の重みを与えている。音素もしくはサブワード単位の系列を入力し新規単語を出力するアークは下から２本のアーク連鎖である。すなわち、「a,k,o,u」を入力とし「阿川」を出力するアーク連鎖及び「a,k,i,o」を入力とし「明夫」を出力するアーク連鎖である。ここで、新規単語を出力するアーク連鎖における最初のアークにのみ、重み「1/δ⁴」が付与されている。これは、新規単語が４音素からなる場合には、上述した言語モデルＷＦＳＴによりごく小さい確率δの４乗が重みとしてかかるため、この小さい確率δを相殺する値を付与しなければ、新規単語が枝刈りされてしまい、認識結果に出力されないためである。 With reference to FIG. 5, an example of the configuration of the word addition WFST introduced in the present invention will be described. The word addition WFST includes an arc for inputting / outputting general-purpose words and an arc for inputting a sequence of phonemes or subwords and outputting a new word. Here, the general-purpose words are “red” and “autumn”, and the new words are “Akou” and “akio”. Arcs for inputting / outputting general-purpose words are the two arcs from the top accompanied by “red: red / 1” and “autumn: autumn / 1”. Here, in the arc for inputting / outputting general-purpose words, a weight having a probability of 1 (0 on a logarithmic scale) is given in order not to change the weight. An arc for inputting a phoneme or subword unit sequence and outputting a new word is a chain of two arcs from the bottom. That is, an arc chain that inputs “a, k, o, u” and outputs “Agawa” and an arc chain that outputs “A, k, i, o” and outputs “Akio”. Here, the weight “1 / δ ⁴ ” is given only to the first arc in the arc chain that outputs a new word. This is because if the new word is composed of four phonemes, the fourth power of the very small probability δ is applied as a weight according to the language model WFST described above, so if a value that cancels out this small probability δ is not given, This is because they are pruned and are not output in the recognition result.

単語追加用ＷＦＳＴ作成部３６が単語追加用ＷＦＳＴを作成する手順をより詳細に説明する。まず、単語追加用ＷＦＳＴ作成部３６は、汎用単語リスト記憶部３２に記憶されている汎用単語リストを読み込み、それぞれの単語について、その単語を入出力とするアークを生成する。次に、単語追加用ＷＦＳＴ作成部３６は、新規単語読みリスト記憶部３４に記憶されている新規単語読みリストを読み込み、それぞれの音素列を入力とし新規単語を出力するアークを追加する。このとき、新規単語に対応するアーク連鎖の最初のアークに、確率「1/δ^M」の重みを付与する。ここで、Mは新規単語の音素数を表す。そして、単語追加用ＷＦＳＴ作成部３６は、作成された単語追加用ＷＦＳＴを、単語追加用ＷＦＳＴ記憶部３８に記憶する。 The procedure in which the word addition WFST creation unit 36 creates the word addition WFST will be described in more detail. First, the word addition WFST creation unit 36 reads the general-purpose word list stored in the general-purpose word list storage unit 32 and generates an arc that uses the word as an input / output for each word. Next, the word addition WFST creation unit 36 reads the new word reading list stored in the new word reading list storage unit 34, and adds an arc that outputs each new phoneme string as an input. At this time, a weight of probability “1 / δ ^M ” is given to the first arc of the arc chain corresponding to the new word. Here, M represents the number of phonemes of the new word. Then, the word addition WFST creation unit 36 stores the created word addition WFST in the word addition WFST storage unit 38.

＜音声認識装置＞
図６を参照して、第一実施形態に係る音声認識装置の機能構成の一例を説明する。第一実施形態の音声認識装置２は、音声認識用ＷＦＳＴ作成装置１が作成した音声認識用ＷＦＳＴを用いて音声認識処理を実行する装置である。 <Voice recognition device>
With reference to FIG. 6, an example of a functional configuration of the speech recognition apparatus according to the first embodiment will be described. The speech recognition device 2 of the first embodiment is a device that executes speech recognition processing using the speech recognition WFST created by the speech recognition WFST creation device 1.

音声認識装置２は、汎用ＷＦＳＴ記憶部３０、単語追加用ＷＦＳＴ記憶部３８、入力部５０、音声認識部５２及び出力部５４を含む。音声認識装置２は、音声認識用ＷＦＳＴ作成装置１と同一の装置として構成し、汎用ＷＦＳＴ記憶部３０及び単語追加用ＷＦＳＴ記憶部３８を共有する構成とすることも可能である。 The speech recognition apparatus 2 includes a general-purpose WFST storage unit 30, a word addition WFST storage unit 38, an input unit 50, a speech recognition unit 52, and an output unit 54. The speech recognition device 2 may be configured as the same device as the speech recognition WFST creation device 1 and share the general-purpose WFST storage unit 30 and the word addition WFST storage unit 38.

音声認識装置２は、例えば、中央演算処理装置（Central Processing Unit、ＣＰＵ）、主記憶装置（Random Access Memory、ＲＡＭ）等を有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。音声認識装置２は例えば、中央演算処理装置の制御のもとで各処理を実行する。音声認識装置２に入力されたデータや各処理で得られたデータは例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。音声認識装置２が備える各記憶部は、例えば、ＲＡＭ（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。音声認識装置２が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 The speech recognition device 2 is a special configuration in which a special program is read by a known or dedicated computer having a central processing unit (CPU), a main storage device (Random Access Memory, RAM), and the like. Device. For example, the voice recognition device 2 executes each process under the control of the central processing unit. The data input to the voice recognition device 2 and the data obtained in each process are stored in, for example, the main storage device, and the data stored in the main storage device is read out as necessary and used for other processing. The Each storage unit included in the speech recognition device 2 includes, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory, or It can be configured with middleware such as a relational database or key-value store. Each storage unit included in the speech recognition device 2 may be logically divided, and may be stored in one physical storage device.

汎用ＷＦＳＴ記憶部３０には、音声認識用ＷＦＳＴ作成装置１の作成した汎用ＷＦＳＴが予め記憶されている。音声認識装置２を音声認識用ＷＦＳＴ作成装置１と同一の装置として構成した場合には音声認識用ＷＦＳＴ作成装置１により作成された汎用ＷＦＳＴをそのまま用いればよい。音声認識装置２を音声認識用ＷＦＳＴ作成装置１と異なる装置として構成した場合には音声認識用ＷＦＳＴ作成装置１の備える汎用ＷＦＳＴ記憶部３０に記憶された汎用ＷＦＳＴをオンラインまたはオフラインで複製して、音声認識装置２の備える汎用ＷＦＳＴ記憶部３０に記憶する。 The general-purpose WFST storage unit 30 stores in advance the general-purpose WFST created by the speech recognition WFST creation apparatus 1. When the speech recognition device 2 is configured as the same device as the speech recognition WFST creation device 1, the general-purpose WFST created by the speech recognition WFST creation device 1 may be used as it is. When the speech recognition device 2 is configured as a device different from the speech recognition WFST creation device 1, the generic WFST stored in the generic WFST storage unit 30 included in the speech recognition WFST creation device 1 is replicated online or offline, The data is stored in the general-purpose WFST storage unit 30 included in the voice recognition device 2.

単語追加用ＷＦＳＴ記憶部３８には、音声認識用ＷＦＳＴ作成装置１の作成した単語追加用ＷＦＳＴが予め記憶されている。音声認識装置２を音声認識用ＷＦＳＴ作成装置１と同一の装置として構成した場合には音声認識用ＷＦＳＴ作成装置１により作成された単語追加用ＷＦＳＴをそのまま用いればよい。音声認識装置２を音声認識用ＷＦＳＴ作成装置１と異なる装置として構成した場合には音声認識用ＷＦＳＴ作成装置１の備える単語追加用ＷＦＳＴ記憶部３８に記憶された単語追加用ＷＦＳＴをオンラインまたはオフラインで複製して、音声認識装置２の備える単語追加用ＷＦＳＴ記憶部３８に記憶する。 The word addition WFST storage unit 38 stores in advance the word addition WFST created by the speech recognition WFST creation apparatus 1. When the speech recognition device 2 is configured as the same device as the speech recognition WFST creation device 1, the word addition WFST created by the speech recognition WFST creation device 1 may be used as it is. When the speech recognition device 2 is configured as a device different from the speech recognition WFST creation device 1, the word addition WFST stored in the word addition WFST storage unit 38 included in the speech recognition WFST creation device 1 is online or offline. It duplicates and memorize | stores in the WFST memory | storage part 38 for word addition with which the speech recognition apparatus 2 is provided.

図７を参照して、音声認識装置２が実行する音声認識方法の処理フローの一例を、実際に行われる手続きの順に従って説明する。 With reference to FIG. 7, an example of the processing flow of the speech recognition method executed by the speech recognition apparatus 2 will be described according to the order of procedures actually performed.

ステップＳ５０において、入力部５０へ音声データが入力される。入力された音声データは、音声認識部５２へ入力される。 In step S50, audio data is input to the input unit 50. The input voice data is input to the voice recognition unit 52.

ステップＳ５２において、音声認識部５２は、汎用ＷＦＳＴ記憶部３０に記憶されている汎用ＷＦＳＴ及び単語追加用ＷＦＳＴ記憶部３８に記憶されている単語追加用ＷＦＳＴを読み込み、汎用ＷＦＳＴに対して単語追加用ＷＦＳＴを動的に合成した音声認識用ＷＦＳＴを用いて、入力された音声データを音声認識結果に変換する。 In step S52, the speech recognition unit 52 reads the general WFST stored in the general WFST storage unit 30 and the word addition WFST stored in the word addition WFST storage unit 38, and adds the word WFST to the general WFST. Using the speech recognition WFST that dynamically synthesizes the WFST, the input speech data is converted into a speech recognition result.

音声認識部５２は、on-the-fly合成でのデコーディングを可能とする。on-the-fly合成とは、ＷＦＳＴの合成を音声認識処理の実行時に行う方法で、探索ネットワークを必要な部分のみ動的に合成することで認識時の消費メモリ量を抑えることができる。on-the-fly合成は、特に言語モデルに対して有効である。言語モデルをＷＦＳＴ化すると、大語彙の２つ組及び３つ組に対するネットワークが構成されるため、事前に合成するとサイズが巨大になってしまう。そこで、言語モデルのみon-the-fly合成をすることで消費メモリを大きく抑えることができる。on-the-fly合成についての詳細は、「大西翼他、“ＷＦＳＴに基づくＴ３音声認識デコーダ”、情報処理学会誌、Vol.51、No.11、2010」などに記載されている。 The voice recognition unit 52 enables decoding by on-the-fly synthesis. On-the-fly synthesis is a method of synthesizing WFST at the time of executing speech recognition processing. By dynamically synthesizing only a necessary part of a search network, it is possible to reduce the amount of memory consumed during recognition. On-the-fly synthesis is particularly effective for language models. When a language model is converted to WFST, a network for two or three large vocabularies is constructed, and therefore the size becomes huge if synthesized in advance. Therefore, memory consumption can be greatly reduced by performing on-the-fly synthesis only for language models. Details on on-the-fly synthesis are described in “Tsubasa Onishi et al.,“ T3 speech recognition decoder based on WFST ”, Journal of Information Processing Society of Japan, Vol. 51, No. 11, 2010”.

音声認識部５２では、汎用ＷＦＳＴと単語追加用ＷＦＳＴを、on-the-fly合成を用いてデコーディング時に逐次合成を行う。逐次合成は以下のような式に基づいて行われる。 The speech recognition unit 52 sequentially synthesizes the general-purpose WFST and the word addition WFST at the time of decoding using on-the-fly synthesis. Sequential synthesis is performed based on the following equation.

ここで、Hは音響モデルＷＦＳＴ、Cは音素ＷＦＳＴ、Lは辞書ＷＦＳＴ、Gは言語モデルＷＦＳＴ、L_pは単語追加用ＷＦＳＴを表す。また、◇演算子はon-the-fly合成を表し、○演算子は最適化処理を伴う逐次的ではない一般的な合成処理を表す。 Here, H represents an acoustic model WFST, C represents a phoneme WFST, L represents a dictionary WFST, G represents a language model WFST, and L _p represents a word addition WFST. The ◇ operator represents on-the-fly synthesis, and the ○ operator represents general synthesis processing that is not sequential with optimization processing.

汎用ＷＦＳＴにおける言語モデルＷＦＳＴはトライグラム部分のみをon-the-fly合成にする構成も可能である。この場合、逐次合成は以下の式に基づいて行われる。 The language model WFST in the general-purpose WFST can be configured such that only the trigram portion is synthesized on-the-fly. In this case, the sequential synthesis is performed based on the following formula.

ここで、G_uniはユニグラムのみで構成される言語モデルＷＦＳＴ、G_tri/uniはユニグラム確率を差し引いたトライグラム確率が付与されている言語モデルＷＦＳＴを表す。 Here, G _uni represents a language model WFST composed only of unigrams, and G _{tri / uni} represents a language model WFST to which trigram probabilities are subtracted from unigram probabilities.

on-the-fly合成では汎用ＷＦＳＴと単語追加用ＷＦＳＴをあらかじめ合成しておく必要がないため、それぞれを別個に保存しておくことができる。デコーディング時には、汎用ＷＦＳＴのみをメモリ上に展開しておき、ユーザの音声入力があったときに単語追加用ＷＦＳＴを瞬時にメモリ上にロードして汎用ＷＦＳＴと合成し、音声認識を実行する。 In the on-the-fly synthesis, since it is not necessary to synthesize the general-purpose WFST and the word addition WFST in advance, each can be stored separately. At the time of decoding, only the general-purpose WFST is expanded on the memory, and when a user's voice is input, the word addition WFST is instantaneously loaded onto the memory and synthesized with the general-purpose WFST to execute voice recognition.

ステップＳ５４において、音声認識部５２で得られた認識結果が出力部５４から外部へ出力される。 In step S54, the recognition result obtained by the voice recognition unit 52 is output from the output unit 54 to the outside.

汎用ＷＦＳＴには大語彙のモデルを仮定しているため、ユーザに特有の単語は、例えば家族や友人の名前など、少ない単語数に限られる。これにより、ユーザに特有の単語追加用ＷＦＳＴのサイズは小さく抑えることができる。したがって、音声認識装置２は単語追加用ＷＦＳＴを瞬時にメモリにロードすることが可能である。 Since a general WFST assumes a large vocabulary model, words unique to the user are limited to a small number of words such as names of family members and friends. Thereby, the size of the word addition WFST specific to the user can be kept small. Therefore, the speech recognition apparatus 2 can instantly load the word addition WFST into the memory.

［第二実施形態］
第一実施形態に係る音声認識用ＷＦＳＴでは、新規単語に最終的に付与される確率はバックオフ係数の値そのものであった。しかし、新規単語には通常、バックオフ係数に加えて、汎用ＷＦＳＴの言語モデルにおける未知語の出現確率が付与される。そこで、単語追加用ＷＦＳＴにおいて新規単語に未知語の出現確率が付与されるように重みを変更する。 [Second Embodiment]
In the WFST for speech recognition according to the first embodiment, the probability finally given to a new word is the back-off coefficient value itself. However, a new word is usually given an appearance probability of an unknown word in a general WFST language model in addition to a back-off coefficient. Therefore, the weight is changed so that the unknown word appearance probability is given to the new word in the word addition WFST.

第二実施形態に係る音声認識用ＷＦＳＴ作成装置及び音声認識装置では、単語追加用ＷＦＳＴの構成及び単語追加用ＷＦＳＴ作成部の動作が第一実施形態に係る音声認識用ＷＦＳＴ作成装置及び音声認識装置と異なる。 In the speech recognition WFST creation device and speech recognition device according to the second embodiment, the configuration of the word addition WFST and the operation of the word addition WFST creation unit are the speech recognition WFST creation device and speech recognition device according to the first embodiment. And different.

図８を参照して、第二実施形態に係る単語追加用ＷＦＳＴの構成の一例を説明する。第二実施形態に係る単語追加用ＷＦＳＴでは、各新規単語の最初のアークに未知語の出現確率p(unk)を乗じている。この構成により、新規単語はすべてp(unk)という出現確率を伴って出力することができる。 An example of the configuration of the word addition WFST according to the second embodiment will be described with reference to FIG. In the word addition WFST according to the second embodiment, the first arc of each new word is multiplied by the appearance probability p (unk) of the unknown word. With this configuration, all new words can be output with an appearance probability of p (unk).

以下、単語追加用ＷＦＳＴを作成する処理について説明する。まず、単語追加用ＷＦＳＴ作成部３６は、第一実施形態と同様に、汎用単語リスト記憶部３２に記憶されている汎用単語リストを読み込み、それぞれの単語について、その単語を入出力とするアークを生成する。次に、単語追加用ＷＦＳＴ作成部３６は、新規単語読みリスト記憶部３４に記憶されている新規単語読みリストを読み込み、それぞれの音素列を入力とし新規単語を出力するアークを追加する。このとき、新規単語に対応するアーク連鎖の最初のアークに、確率「1/δ^M*p(unk)」の重みを付与する。 Hereinafter, the process of creating the word addition WFST will be described. First, the word addition WFST creation unit 36 reads the general-purpose word list stored in the general-purpose word list storage unit 32 as in the first embodiment, and for each word, an arc that uses the word as an input / output is read. Generate. Next, the word addition WFST creation unit 36 reads the new word reading list stored in the new word reading list storage unit 34, and adds an arc that outputs each new phoneme string as an input. At this time, a weight of probability “1 / δ ^M * p (unk)” is given to the first arc of the arc chain corresponding to the new word.

このように構成することにより、第二実施形態の単語追加用ＷＦＳＴでは、すべての新規単語に対して未知語の出現確率p(unk)を付与することができる。 With this configuration, in the word addition WFST of the second embodiment, the unknown word appearance probability p (unk) can be assigned to all new words.

［第三実施形態］
第一実施形態及び第二実施形態では、単語追加用ＷＦＳＴが新規単語に付与する出現確率はどれも一律の値だった。しかし、単語ごとに発話される頻度は異なるため、その頻度に合わせて出現確率を調整することで、より認識精度を高めることができる。さらに、新規単語だけではなく汎用単語についても、ユーザごとに出現確率を調整することで認識精度を向上することが可能である。この発明の第三実施形態に係る音声認識用ＷＦＳＴ作成装置及び音声認識装置は、ユーザごとに汎用単語及び新規単語の出現確率を調整することができるように構成する。 [Third embodiment]
In the first embodiment and the second embodiment, the appearance probabilities assigned to the new words by the word addition WFST are all uniform values. However, since the frequency of utterance differs for each word, the recognition accuracy can be further improved by adjusting the appearance probability according to the frequency. Furthermore, not only new words but also general-purpose words can be improved in recognition accuracy by adjusting the appearance probability for each user. The speech recognition WFST creation device and speech recognition device according to the third embodiment of the present invention are configured so that the appearance probabilities of general-purpose words and new words can be adjusted for each user.

＜音声認識用ＷＦＳＴ作成装置＞
図９を参照して、第三実施形態に係る音声認識用ＷＦＳＴ作成装置の機能構成の一例を説明する。第三実施形態に係る音声認識用ＷＦＳＴ作成装置３は、第一実施形態に係る音声認識用ＷＦＳＴ作成装置１と同様に、汎用単語発音辞書記憶部１０、音素リスト記憶部１２、汎用言語モデル記憶部１４、辞書ＷＦＳＴ作成部１６、言語モデルＷＦＳＴ作成部１８、音素ＷＦＳＴ記憶部２０、音響モデルＷＦＳＴ記憶部２２、辞書ＷＦＳＴ記憶部２４、言語モデルＷＦＳＴ記憶部２６、汎用ＷＦＳＴ作成部２８、汎用ＷＦＳＴ記憶部３０、汎用単語リスト記憶部３２、新規単語読みリスト記憶部３４、単語追加用ＷＦＳＴ作成部３６及び単語追加用ＷＦＳＴ記憶部３８を含み、さらに、個人言語モデル記憶部４０、個人言語モデルＷＦＳＴ作成部４２及び個人言語モデルＷＦＳＴ記憶部４４を含む。 <WFST creation device for voice recognition>
With reference to FIG. 9, an example of a functional configuration of the speech recognition WFST creation apparatus according to the third embodiment will be described. Similar to the speech recognition WFST creation apparatus 1 according to the first embodiment, the speech recognition WFST creation apparatus 3 according to the third embodiment is a general word pronunciation dictionary storage unit 10, a phoneme list storage unit 12, and a general language model storage. Unit 14, dictionary WFST creation unit 16, language model WFST creation unit 18, phoneme WFST storage unit 20, acoustic model WFST storage unit 22, dictionary WFST storage unit 24, language model WFST storage unit 26, general purpose WFST creation unit 28, general purpose WFST A storage unit 30, a general word list storage unit 32, a new word reading list storage unit 34, a word addition WFST creation unit 36 and a word addition WFST storage unit 38; and a personal language model storage unit 40, a personal language model WFST A creation unit 42 and a personal language model WFST storage unit 44 are included.

個人言語モデル記憶部４０には、予め用意された個人言語モデルが記憶されている。個人言語モデルの構成は汎用言語モデルと同様であり、Ｎグラム単語列、そのＮグラム確率及びそのバックオフ係数の組み合わせを１エントリとする。個人言語モデルは、例えば、個人の書き起しデータ等を用いてトライグラム確率を計算することなどにより作成することができる。個人言語モデルに含まれる単語列は、汎用単語もしくは新規単語に含まれる単語である。 The personal language model storage unit 40 stores a personal language model prepared in advance. The configuration of the personal language model is the same as that of the general-purpose language model, and a combination of an N-gram word string, its N-gram probability, and its back-off coefficient is one entry. The personal language model can be created, for example, by calculating a trigram probability using personal transcription data or the like. The word string included in the personal language model is a word included in a general-purpose word or a new word.

図１０を参照して、音声認識用ＷＦＳＴ作成装置３が実行する音声認識用ＷＦＳＴ作成方法の処理フローの一例を、実際に行われる手続きの順に従って説明する。ステップＳ１６からステップＳ３６までの処理は第一実施形態に係る音声認識用ＷＦＳＴ作成装置１と同様であるので、詳細な説明は省略する。 With reference to FIG. 10, an example of the processing flow of the speech recognition WFST creation method executed by the speech recognition WFST creation device 3 will be described in the order of procedures actually performed. Since the process from step S16 to step S36 is the same as that of the speech recognition WFST creation apparatus 1 according to the first embodiment, detailed description thereof is omitted.

ステップＳ４２において、個人言語モデルＷＦＳＴ作成部４２は、個人言語モデル記憶部４０に記憶されている個人言語モデルを読み込み、個人言語モデルＷＦＳＴを出力する。個人言語モデルから個人言語モデルＷＦＳＴを構築する方法は、汎用言語モデルから言語モデルＷＦＳＴを作成する処理と同様である。そして、個人言語モデルＷＦＳＴ作成部４２は、作成された個人言語モデルＷＦＳＴを、個人言語モデルＷＦＳＴ記憶部４４に記憶する。 In step S42, the personal language model WFST creation unit 42 reads the personal language model stored in the personal language model storage unit 40 and outputs the personal language model WFST. The method of constructing the personal language model WFST from the personal language model is the same as the process of creating the language model WFST from the general language model. Then, the personal language model WFST creation unit 42 stores the created personal language model WFST in the personal language model WFST storage unit 44.

＜音声認識装置＞
図１１を参照して、第三実施形態に係る音声認識装置の機能構成の一例を説明する。第三実施形態に係る音声認識装置４は、第一実施形態に係る音声認識装置２と同様に、汎用ＷＦＳＴ記憶部３０、単語追加用ＷＦＳＴ記憶部３８、入力部５０、音声認識部５２及び出力部５４を含み、さらに、個人言語モデルＷＦＳＴ記憶部４４を含む。音声認識装置４は、第一実施形態と同様に、音声認識用ＷＦＳＴ作成装置３と同一の装置として構成し、汎用ＷＦＳＴ記憶部３０、単語追加用ＷＦＳＴ記憶部３８及び個人言語モデルＷＦＳＴ記憶部４４を共有する構成とすることも可能である。 <Voice recognition device>
With reference to FIG. 11, an example of a functional configuration of the speech recognition apparatus according to the third embodiment will be described. Similar to the speech recognition device 2 according to the first embodiment, the speech recognition device 4 according to the third embodiment includes a general-purpose WFST storage unit 30, a word addition WFST storage unit 38, an input unit 50, a speech recognition unit 52, and an output. A personal language model WFST storage unit 44. Similar to the first embodiment, the speech recognition device 4 is configured as the same device as the speech recognition WFST creation device 3, and includes a general-purpose WFST storage unit 30, a word addition WFST storage unit 38, and a personal language model WFST storage unit 44. It is also possible to adopt a configuration in which

第三実施形態の音声認識部５２は、汎用ＷＦＳＴ記憶部３０に記憶されている汎用ＷＦＳＴと、単語追加用ＷＦＳＴ記憶部３８に記憶されている単語追加用ＷＦＳＴと、個人言語モデルＷＦＳＴ記憶部４４に記憶されている個人言語モデルＷＦＳＴとを読み込み、汎用ＷＦＳＴに対して単語追加用ＷＦＳＴ及び個人言語モデルＷＦＳＴを動的に合成した音声認識用ＷＦＳＴを用いて、入力された音声データを音声認識結果に変換する。つまり、第一実施形態の音声認識部５２において、音声認識用ＷＦＳＴの後段に個人言語モデルＷＦＳＴをon-the-fly合成したものである。具体的には、on-the-fly合成は以下のような式に基づいて行われる。 The speech recognition unit 52 of the third embodiment includes a general-purpose WFST stored in the general-purpose WFST storage unit 30, a word addition WFST stored in the word addition WFST storage unit 38, and a personal language model WFST storage unit 44. The personal language model WFST stored in the memory, and using the speech recognition WFST in which the word addition WFST and the personal language model WFST are dynamically synthesized with respect to the general-purpose WFST, the input speech data is converted into a speech recognition result. Convert to That is, in the speech recognition unit 52 of the first embodiment, the personal language model WFST is synthesized on-the-fly after the speech recognition WFST. Specifically, on-the-fly synthesis is performed based on the following equation.

ここで、G_pは個人言語モデルＷＦＳＴを表す。 Here, G _p represents the personal language model WFST.

第一実施形態と同様に、言語モデルＷＦＳＴはトライグラム部分のみをon-the-fly合成にする構成も可能である。この場合、on-the-fly合成は以下の式に基づいて行われる。 As in the first embodiment, the language model WFST can be configured to perform on-the-fly synthesis only on the trigram portion. In this case, on-the-fly synthesis is performed based on the following equation.

このように構成することで、第三実施形態の音声認識装置４は、個人ごとの単語出現確率が反映されたデコーディングを行うことが可能となり、音声認識精度を向上することができる。 With this configuration, the speech recognition apparatus 4 according to the third embodiment can perform decoding reflecting the word appearance probability for each individual, and can improve speech recognition accuracy.

［第四実施形態］
音声認識では、発話された音素を時系列順に音声認識用ＷＦＳＴの単語とマッチングし、マッチした中で確率の高い単語集合のみが認識仮説として残される。発話された音素にマッチしたとしても、確率が低い場合には、認識仮説から外れてしまう。第一実施形態の構成では、音声認識時に新規単語が数多く認識仮説に残ってしまい、汎用単語が外れてしまうという現象が起こる。すると、新規単語が認識結果に湧き出し誤りとして多数出現してしまう。 [Fourth embodiment]
In speech recognition, uttered phonemes are matched with words in the speech recognition WFST in chronological order, and only a word set with a high probability of being matched is left as a recognition hypothesis. Even if it matches the uttered phoneme, it falls out of the recognition hypothesis if the probability is low. In the configuration of the first embodiment, a phenomenon occurs in which a large number of new words remain in the recognition hypothesis during speech recognition, and general-purpose words are lost. Then, many new words appear as recognition errors in the recognition result.

この課題について、汎用単語として「赤」という単語があり、新規単語として「阿賀/aka」という単語がある場合を例として具体的に説明する。話者が「a,k,a」という音素列を発話したとき、最初の音素「a」が入力された時点では、汎用単語「赤」は単語の最後まで到達していないため、汎用単語「赤」に対する言語確率は付与されない。一方で、新規単語「阿賀/aka」は言語モデルＷＦＳＴにより「a」に対してごく小さい確率δが付与され、単語追加用ＷＦＳＴにより「a:阿賀」に対して音素数分の確率をキャンセルする重み「1/δ³」が付与されるため、新規単語「阿賀/aka」にはδ*1/δ³=1/δ²の重みが付与される。確率δは0以上1以下の値を取り得るが、例えば、0.1未満のように小さい確率であると、1/δ²は大きい値となる。この結果、最初の音素「a」が入力された時点で、汎用単語「赤」へは重みが付与されず、新規単語「阿賀/aka」へは大きい重みが付与される。実際には汎用単語も新規単語も数多くあるため、認識仮説から「赤」が外れてしまうおそれがある。なお、確率δの値が、例えば0.8や0.9のように比較的大きい確率が与えられるような場合には、このような問題は生じにくい。 This problem will be specifically described by taking as an example a case where there is a word “red” as a general word and a word “Aga / aka” as a new word. When the speaker utters the phoneme string “a, k, a”, when the first phoneme “a” is input, the general word “red” has not reached the end of the word, so the general word “ Language probabilities for “red” are not given. On the other hand, the new word “Aga / aka” is given a very small probability δ with respect to “a” by the language model WFST, and the word addition WFST cancels the probability corresponding to the number of phonemes for “a: Aga”. Since the weight “1 / δ ³ ” is given, the new word “Aga / aka” is given a weight of δ * 1 / δ ³ = 1 / δ ² . The probability δ can take a value of 0 or more and 1 or less. For example, if the probability δ is small such as less than 0.1, 1 / δ ² is a large value. As a result, when the first phoneme “a” is input, the general word “red” is not given a weight, and the new word “Aga / aka” is given a large weight. Actually, there are many general-purpose words and new words, so there is a possibility that “red” may be removed from the recognition hypothesis. Note that such a problem is unlikely to occur when the probability δ is given a relatively large probability, such as 0.8 or 0.9.

この課題を解決するために、第四実施形態に係る音声認識用ＷＦＳＴ作成装置及び音声認識装置では、単語追加用ＷＦＳＴの構成及び単語追加用ＷＦＳＴ作成部の動作が第一実施形態に係る音声認識用ＷＦＳＴ作成装置及び音声認識装置と異なるように構成する。 In order to solve this problem, in the speech recognition WFST creation device and speech recognition device according to the fourth embodiment, the configuration of the word addition WFST and the operation of the word addition WFST creation unit according to the first embodiment The WFST creation apparatus and the speech recognition apparatus are configured differently.

図１２を参照して、第四実施形態に係る単語追加用ＷＦＳＴの構成の一例を説明する。第四実施形態に係る単語追加用ＷＦＳＴでは、各新規単語に対応するアーク連鎖をなす各アークに、言語モデルＷＦＳＴがサブワードに与える確率δの逆数「1/δ」を重みとして付与する。例えば、新規単語「阿賀/aka」では、各音素「a:阿賀」「k:ε」「a:ε」それぞれに重み「1/δ」が付与されていることがわかる。この構成により、最初の音素が入力された時点で新規単語に大きい重みが付与されることを回避することができる。 An example of the configuration of the word addition WFST according to the fourth embodiment will be described with reference to FIG. In the word addition WFST according to the fourth embodiment, the reciprocal number “1 / δ” of the probability δ given to the subword by the language model WFST is given as a weight to each arc forming the arc chain corresponding to each new word. For example, in the new word “Aga / aka”, it is understood that the weight “1 / δ” is assigned to each of the phonemes “a: Aga”, “k: ε”, and “a: ε”. With this configuration, it is possible to avoid giving a large weight to a new word when the first phoneme is input.

以下、第四実施形態に係る単語追加用ＷＦＳＴを作成する処理について説明する。まず、単語追加用ＷＦＳＴ作成部３６は、第一実施形態と同様に、汎用単語リスト記憶部３２に記憶されている汎用単語リストを読み込み、それぞれの単語について、その単語を入出力とするアークを生成する。次に、単語追加用ＷＦＳＴ作成部３６は、新規単語読みリスト記憶部３４に記憶されている新規単語読みリストを読み込み、それぞれの音素列を入力とし新規単語を出力するアークを追加する。このとき、新規単語に対応するアーク連鎖をなす各アークに、確率「1/δ」の重みを付与する。 Hereinafter, a process of creating the word addition WFST according to the fourth embodiment will be described. First, the word addition WFST creation unit 36 reads the general-purpose word list stored in the general-purpose word list storage unit 32 as in the first embodiment, and for each word, an arc that uses the word as an input / output is read. Generate. Next, the word addition WFST creation unit 36 reads the new word reading list stored in the new word reading list storage unit 34, and adds an arc that outputs each new phoneme string as an input. At this time, a weight of probability “1 / δ” is given to each arc forming the arc chain corresponding to the new word.

このように構成することにより、第四実施形態の単語追加用ＷＦＳＴでは、最初の音素が入力された時点で新規単語に大きい重みが付与されることがなくなり、新規単語が認識結果に湧き出し誤りとして多数出現することを回避できる。 With this configuration, in the word addition WFST of the fourth embodiment, a large weight is not given to the new word when the first phoneme is input, and the new word appears in the recognition result. Can be avoided.

［第五実施形態］
第二実施形態では、新規単語に対して、最初の音素が発話されたときに未知語の出現確率p(unk)を付与していた。しかし、汎用単語は、辞書ＷＦＳＴ及び言語モデルＷＦＳＴにおいて、音素列がすべて発話されたときに単語確率が付与される。新規単語の方が先に未知語の出現確率p(unk)が付与されるため、確率が低くなり、認識仮説から外れてしまう現象が起こる。 [Fifth embodiment]
In the second embodiment, an unknown word appearance probability p (unk) is given to a new word when the first phoneme is uttered. However, the general word is given a word probability when all phoneme strings are uttered in the dictionary WFST and the language model WFST. Since the new word is given the unknown word appearance probability p (unk) first, the probability is low and the phenomenon of deviating from the recognition hypothesis occurs.

この課題を解決するために、第五実施形態に係る音声認識用ＷＦＳＴ作成装置及び音声認識装置では、単語追加用ＷＦＳＴの構成及び単語追加用ＷＦＳＴ作成部の動作が第二実施形態に係る音声認識用ＷＦＳＴ作成装置及び音声認識装置と異なるように構成する。 In order to solve this problem, in the speech recognition WFST creation device and speech recognition device according to the fifth embodiment, the configuration of the word addition WFST and the operation of the word addition WFST creation unit according to the second embodiment The WFST creation apparatus and the speech recognition apparatus are configured differently.

図１３を参照して、第五実施形態に係る単語追加用ＷＦＳＴの構成の一例を説明する。第五実施形態に係る単語追加用ＷＦＳＴでは、各新規単語に対応するアーク連鎖の最後のアークに未知語の出現確率「p(unk)」の重みを付与する。例えば、新規単語「阿川/akou」では、最後の音素「u:ε」に重み「p(unk)」が付与されていることがわかる。この構成により、新規単語と汎用単語とで単語確率が付与されるタイミングが同時となる。 An example of the configuration of the word addition WFST according to the fifth embodiment will be described with reference to FIG. In the word addition WFST according to the fifth embodiment, the weight of the unknown word appearance probability “p (unk)” is given to the last arc of the arc chain corresponding to each new word. For example, in the new word “Akawa / akou”, it can be seen that the weight “p (unk)” is given to the last phoneme “u: ε”. With this configuration, the word probabilities are given simultaneously for the new word and the general-purpose word.

以下、第五実施形態に係る単語追加用ＷＦＳＴを作成する処理について説明する。まず、単語追加用ＷＦＳＴ作成部３６は、第二実施形態と同様に、汎用単語リスト記憶部３２に記憶されている汎用単語リストを読み込み、それぞれの単語について、その単語を入出力とするアークを生成する。次に、単語追加用ＷＦＳＴ作成部３６は、新規単語読みリスト記憶部３４に記憶されている新規単語読みリストを読み込み、それぞれの音素列を入力とし新規単語を出力するアークを追加する。このとき、新規単語に対応するアーク連鎖の最後のアークに、未知語の出現確率「p(unk)」の重みを付与する。 Hereinafter, a process of creating the word addition WFST according to the fifth embodiment will be described. First, the word addition WFST creation unit 36 reads the general-purpose word list stored in the general-purpose word list storage unit 32 in the same manner as in the second embodiment, and for each word, an arc that uses the word as an input / output is read. Generate. Next, the word addition WFST creation unit 36 reads the new word reading list stored in the new word reading list storage unit 34, and adds an arc that outputs each new phoneme string as an input. At this time, the weight of the unknown word appearance probability “p (unk)” is given to the last arc of the arc chain corresponding to the new word.

このように構成することにより、第五実施形態の単語追加用ＷＦＳＴでは、新規単語と汎用単語とで単語確率が付与されるタイミングが同時となり、新規単語が認識仮説から外れる現象を防ぐことができる。 With this configuration, in the word addition WFST of the fifth embodiment, the timing at which the word probabilities are given to the new word and the general-purpose word is the same, and the phenomenon that the new word deviates from the recognition hypothesis can be prevented. .

［第六実施形態］
重み付き有限状態トランスデューサは、処理の効率化を目的として、決定化・最小化処理と重みのプッシング（pushing）処理が施されている場合がある。図１４に決定化・最小化処理とプッシング処理の例を説明する。図１４中で上段に位置するアーク連鎖は未処理のＷＦＳＴであり、中段に位置するアーク連鎖は決定化・最小化処理を行った結果であり、下段に位置するアーク連鎖はさらにプッシング処理を行った結果である。プッシング処理とは、決定化・最小化処理により単語間で共通するアークをまとめたときに、後ろのアークに付与されている重みを前のアークへ移行する処理である。「赤/aka」「秋/aki」という二つの単語が存在し、「赤/aka」は重みが0.8であり、「秋/aki」は重みが0.6であるとする。「赤/aka」と「秋/aki」とは先頭二文字が「ak」で共通しているため、決定化・最小化処理により共通部分を一つのアークにまとめることができる。それぞれの単語に付与されている重みで共通する値は、最初のアークに付与しても全体の出力は等価となるため、最初の音素「a:ε」に共通する0.6を付与し、「赤/aka」の重みを差分の0.2とし、「秋/aki」の重みを差分の0とすることができる。このように複数の単語を決定化・最小化したときに共通する重みを前に押し出す処理がプッシング処理である。 [Sixth embodiment]
The weighted finite state transducer may be subjected to determinizing / minimizing processing and weight pushing processing for the purpose of improving processing efficiency. FIG. 14 illustrates an example of the decision / minimization process and the pushing process. In FIG. 14, the arc chain located in the upper stage is an unprocessed WFST, the arc chain located in the middle stage is a result of determinizing / minimizing processing, and the arc chain located in the lower stage is further subjected to pushing processing. It is a result. The pushing process is a process of shifting the weight assigned to the subsequent arc to the previous arc when the common arcs between words are collected by the determinizing / minimizing process. It is assumed that there are two words “red / aka” and “autumn / aki”, “red / aka” has a weight of 0.8, and “autumn / aki” has a weight of 0.6. Since “red / aka” and “autumn / aki” have the same first two letters “ak”, common parts can be combined into one arc by determinizing and minimizing processing. A common value for the weight assigned to each word is equivalent to the first arc, even if it is assigned to the first arc, so 0.6 is assigned to the first phoneme “a: ε”. The weight of “/ aka” can be 0.2, and the weight of “autumn / aki” can be 0. The pushing process is a process for pushing out a common weight before determinizing and minimizing a plurality of words.

第二実施形態もしくは第五実施形態では、未知語の出現確率p(unk)が新規単語の最初もしくは最後のアークに付与されていた。しかし、辞書ＷＦＳＴ及び言語モデルＷＦＳＴに決定化・最小化処理と重みのプッシング（pushing）処理が施されている場合、汎用単語に対する単語確率が付与されるタイミングは、汎用単語の各音素のアークに対して辞書ＷＦＳＴ及び言語モデルＷＦＳＴの探索を最適化するように分散して付与される。第五実施形態では新規単語に付与する確率が最後のアークに集中しているが、汎用単語の確率は実際には音素ごとに分散しているため、汎用単語の方が先に単語確率がかかってしまい、確率が低くなり、結果として新規単語が湧き出してしまう。 In the second embodiment or the fifth embodiment, the unknown word appearance probability p (unk) is assigned to the first or last arc of a new word. However, when the dictionary WFST and the language model WFST are subjected to the determinizing / minimizing process and the weight pushing process, the word probability for the general word is given to the arc of each phoneme of the general word. For the dictionary WFST and the language model WFST, the search is given in a distributed manner. In the fifth embodiment, the probabilities assigned to new words are concentrated in the last arc, but the probabilities of general-purpose words are actually distributed for each phoneme. As a result, the probability is lowered, and as a result, new words are generated.

この課題を解決するために、第六実施形態に係る音声認識用ＷＦＳＴ作成装置及び音声認識装置では、単語追加用ＷＦＳＴの構成及び単語追加用ＷＦＳＴ作成部の動作が第五実施形態に係る音声認識用ＷＦＳＴ作成装置及び音声認識装置と異なるように構成する。 In order to solve this problem, in the speech recognition WFST creation device and speech recognition device according to the sixth embodiment, the configuration of the word addition WFST and the operation of the word addition WFST creation unit are the speech recognition according to the fifth embodiment. The WFST creation apparatus and the speech recognition apparatus are configured differently.

図１５を参照して、第六実施形態に係る単語追加用ＷＦＳＴの構成の一例を説明する。第六実施形態に係る単語追加用ＷＦＳＴでは、各新規単語に対応するアーク連鎖をなす各アークに、未知語の出現確率p(unk)を音素数で除した値を重みとして付与する。例えば、新規単語「阿川/akou」では、最初の音素「a:阿川」に重み「1/δ⁴×p(unk)/4」が付与され、続く音素「k:ε」「o:ε」「u:ε」に重み「p(unk)/4」が付与されていることがわかる。この構成により、単語確率が付与されるタイミングが汎用単語と新規単語で大きく異なることがなくなる。 An example of the configuration of the word addition WFST according to the sixth embodiment will be described with reference to FIG. In the word addition WFST according to the sixth embodiment, a value obtained by dividing the appearance probability p (unk) of an unknown word by a phoneme number is assigned to each arc forming an arc chain corresponding to each new word as a weight. For example, in the new word “Akawa / akou”, the first phoneme “a: Agawa” is given the weight “1 / δ ⁴ × p (unk) / 4”, and the subsequent phonemes “k: ε” “o: ε” It can be seen that the weight “p (unk) / 4” is given to “u: ε”. With this configuration, the timing at which word probabilities are given does not differ greatly between general-purpose words and new words.

以下、第六実施形態に係る単語追加用ＷＦＳＴを作成する処理について説明する。まず、単語追加用ＷＦＳＴ作成部３６は、第二実施形態と同様に、汎用単語リスト記憶部３２に記憶されている汎用単語リストを読み込み、それぞれの単語について、その単語を入出力とするアークを生成する。次に、単語追加用ＷＦＳＴ作成部３６は、新規単語読みリスト記憶部３４に記憶されている新規単語読みリストを読み込み、それぞれの音素列を入力とし新規単語を出力するアークを追加する。このとき、新規単語に対応するアーク連鎖をなす各アークに、未知語の出現確率p(unk)を音素数で除した値「p(unk)/M」を重みとして付与する。ここで、Mは新規単語の音素数を表す。 Hereinafter, the process of creating the word addition WFST according to the sixth embodiment will be described. First, the word addition WFST creation unit 36 reads the general-purpose word list stored in the general-purpose word list storage unit 32 in the same manner as in the second embodiment, and for each word, an arc that uses the word as an input / output is read. Generate. Next, the word addition WFST creation unit 36 reads the new word reading list stored in the new word reading list storage unit 34, and adds an arc that outputs each new phoneme string as an input. At this time, a value “p (unk) / M” obtained by dividing the appearance probability p (unk) of the unknown word by the number of phonemes is assigned as a weight to each arc forming the arc chain corresponding to the new word. Here, M represents the number of phonemes of the new word.

このように構成することにより、第六実施形態の単語追加用ＷＦＳＴでは、単語確率が付与されるタイミングが汎用単語と新規単語で大きく異なることがなく、新規単語の湧き出しを防ぐことができる。 With this configuration, in the word addition WFST of the sixth embodiment, the timing at which word probabilities are given does not differ greatly between general-purpose words and new words, and the occurrence of new words can be prevented.

［実施形態の組み合わせ］
第四実施形態の構成を、第二実施形態、第五実施形態及び第六実施形態の構成に組み合わせることが可能である。 [Combination of Embodiments]
The configuration of the fourth embodiment can be combined with the configurations of the second embodiment, the fifth embodiment, and the sixth embodiment.

第四実施形態の構成を第二実施形態に組み合わせる場合、各新規単語に対応するアーク連鎖をなす各アークに確率δの逆数「1/δ」を重みとして付与し、最初のアークにはさらに未知語の出現確率「p(unk)」の重みを付与すればよい。例えば、新規単語「阿川/akou」では、最初の音素「a:阿川」に重み「p(unk)/δ」を付与し、残りの音素「k:ε」「o:ε」「u:ε」に重み「1/δ」を付与する。 When the configuration of the fourth embodiment is combined with the second embodiment, the reciprocal “1 / δ” of the probability δ is assigned as a weight to each arc forming the arc chain corresponding to each new word, and further unknown to the first arc. What is necessary is just to give the weight of the word appearance probability “p (unk)”. For example, in the new word “Akawa / akou”, a weight “p (unk) / δ” is assigned to the first phoneme “a: Agawa”, and the remaining phonemes “k: ε”, “o: ε”, “u: ε” Is assigned a weight “1 / δ”.

第四実施形態の構成を第五実施形態に組み合わせる場合、各新規単語に対応するアーク連鎖をなす各アークに確率δの逆数「1/δ」を重みとして付与し、最後のアークにはさらに未知語の出現確率「p(unk)」の重みを付与すればよい。例えば、新規単語「阿川/akou」では、最初の三音素「a:阿川」「k:ε」「o:ε」に重み「1/δ」を付与し、最後の音素「u:ε」に重み「p(unk)/δ」を付与する。 When the configuration of the fourth embodiment is combined with the fifth embodiment, the reciprocal “1 / δ” of the probability δ is assigned as a weight to each arc forming the arc chain corresponding to each new word, and the unknown is further unknown to the last arc. What is necessary is just to give the weight of the word appearance probability “p (unk)”. For example, in the new word “Akawa / akou”, the first triphone “a: Agawa” “k: ε” “o: ε” is given a weight “1 / δ” and the last phoneme “u: ε” A weight “p (unk) / δ” is assigned.

第四実施形態の構成を第六実施形態に組み合わせる場合、各新規単語に対応するアーク連鎖をなす各アークに確率δの逆数と未知語の出現確率p(unk)を乗じて音素数で除した値を重みとして付与すればよい。例えば、新規単語「阿川/akou」では、すべての音素「a:阿川」「k:ε」「o:ε」「u:ε」に重み「p(unk)/4δ」を付与する。 When combining the configuration of the fourth embodiment with the sixth embodiment, each arc forming the arc chain corresponding to each new word is multiplied by the reciprocal of probability δ and the occurrence probability p (unk) of the unknown word and divided by the number of phonemes. What is necessary is just to give a value as a weight. For example, in the new word “Akawa / akou”, a weight “p (unk) / 4δ” is assigned to all phonemes “a: Akawa”, “k: ε”, “o: ε”, and “u: ε”.

［プログラム、記録媒体］
この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施例において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 [Program, recording medium]
The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above-described embodiments are not only executed in time series according to the order described, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes.

また、上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１，３音声認識用ＷＦＳＴ作成装置
２，４音声認識装置
１０汎用単語発音辞書記憶部
１２音素リスト記憶部
１４汎用言語モデル記憶部
１６辞書ＷＦＳＴ作成部
１８言語モデルＷＦＳＴ作成部
２０音素ＷＦＳＴ記憶部
２２音響モデルＷＦＳＴ記憶部
２４辞書ＷＦＳＴ記憶部
２６言語モデルＷＦＳＴ記憶部
２８汎用ＷＦＳＴ作成部
３０汎用ＷＦＳＴ記憶部
３２汎用単語リスト記憶部
３４新規単語読みリスト記憶部
３６単語追加用ＷＦＳＴ作成部
３８単語追加用ＷＦＳＴ記憶部
４０個人言語モデル記憶部
４２個人言語モデルＷＦＳＴ作成部
４４個人言語モデルＷＦＳＴ記憶部
５０入力部
５２音声認識部
５４出力部 DESCRIPTION OF SYMBOLS 1,3 Speech recognition WFST creation device 2, 4 Speech recognition device 10 General-purpose word pronunciation dictionary storage unit 12 Phoneme list storage unit 14 General-purpose language model storage unit 16 Dictionary WFST creation unit 18 Language model WFST creation unit 20 Phoneme WFST storage unit 22 Acoustic model WFST storage unit 24 Dictionary WFST storage unit 26 Language model WFST storage unit 28 General-purpose WFST creation unit 30 General-purpose WFST storage unit 32 General-purpose word list storage unit 34 New word reading list storage unit 36 Word addition WFST creation unit 38 Word addition WFST storage unit 40 Personal language model storage unit 42 Personal language model WFST creation unit 44 Personal language model WFST storage unit 50 Input unit 52 Speech recognition unit 54 Output unit

Claims

A dictionary WFST creation unit for creating a dictionary WFST that takes a phoneme string as an input and outputs a subword based on the phoneme string and a phoneme string representing a reading out of predetermined generic words that matches the phoneme string;
A language for creating a language model WFST in which a weight of the general word is given to a general word output from the dictionary WFST, and a weight δ smaller than any of the weights of the general word is given to a subword output from the dictionary WFST A model WFST creation unit;
An acoustic model WFST that converts an input audio signal into an acoustic environment, a phoneme WFST that converts the acoustic environment into a phoneme, a generic WFST creation unit that creates a generic WFST using the dictionary WFST and the language model WFST;
When the output of the general WFST matches any phoneme string representing the reading of the new word, the new word is output with the weight of the new word matched, and the output of the general WFST is the output of the general word. A word addition WFST creation unit that creates a word addition WFST that outputs the generic word while maintaining the weight of the matched generic word if it matches any of the above;
WFST creation device for speech recognition.

The speech recognition WFST creation apparatus according to claim 1,
The word addition WFST creation unit assigns a weight obtained by multiplying the weight of the matched new word by the appearance probability of an unknown word in the language model WFST when the output of the general-purpose WFST matches the new word. A speech recognition WFST creation device for creating an additional WFST.

The speech recognition WFST creation apparatus according to claim 2,
The word addition WFST creation unit creates a word addition WFST that gives the appearance probability of the unknown word as a weight to the last phoneme of the phoneme string representing the reading of the new word. apparatus.

The speech recognition WFST creation apparatus according to claim 2,
The word addition WFST creation unit assigns, as a weight, a value obtained by dividing the appearance probability of the unknown word by the number of phonemes in the phoneme string to each phoneme forming a phoneme string representing the reading of the new word. WFST creation device for voice recognition.

The speech recognition WFST creation apparatus according to any one of claims 1 to 4,
The word addition WFST creation unit creates a word addition WFST that assigns the reciprocal of the weight δ as a weight to each phoneme forming a phoneme string representing the reading of the new word. .

A speech recognition WFST creation apparatus according to any one of claims 1 to 5,
A personal language model WFST creating unit for creating a personal language model WFST that gives a weight different from the weight output by the word adding WFST to at least one word included in the general word or the new word;
A WFST creation device for speech recognition further comprising:

A general-purpose WFST storage unit that stores the general-purpose WFST created by the speech recognition WFST creation device according to claim 1;
A word addition WFST storage unit for storing the word addition WFST created by the speech recognition WFST creation device according to claim 1;
A speech recognition unit that converts an input speech signal into a speech recognition result using a speech recognition WFST in which the word addition WFST is dynamically synthesized with respect to the general-purpose WFST;
A speech recognition device.

The speech recognition device according to claim 7,
A personal language model storage unit for storing a personal language model WFST created by the speech recognition WFST creation device according to claim 6;
The voice recognition unit converts the voice signal into the voice recognition result using a voice recognition WFST obtained by dynamically synthesizing the word addition WFST and the personal language model WFST with respect to the general-purpose WFST. Voice recognition device.

A dictionary WFST is a dictionary WFST that takes a phoneme string as an input and creates a dictionary WFST that outputs a subword based on the phoneme string and a generic word in which a phoneme string representing a reading out of predetermined generic words matches the phoneme string Creation steps,
The language model WFST creation unit assigns a weight of the general word to the general word output from the dictionary WFST, and assigns a weight δ smaller than any of the weights of the general word to the subword output from the dictionary WFST. A language model WFST creation step for creating a language model WFST;
A general-purpose WFST creation unit creates a general-purpose WFST using an acoustic model WFST that converts an input audio signal into an acoustic environment, a phoneme WFST that converts the acoustic environment into a phoneme, the dictionary WFST, and the language model WFST. A WFST creation step;
The word addition WFST creation unit assigns a weight of the matched new word and outputs the new word when the output of the general-purpose WFST matches any of the phoneme strings representing the reading of the new word. A word addition WFST creation step for creating a word addition WFST that outputs the generic word while maintaining the weight of the matched generic word when the output of the WFST matches any of the generic words;
WFST creation method for speech recognition including

The general-purpose WFST created by the speech recognition WFST creation method according to claim 9 is stored in the general-purpose WFST storage unit,
The word addition WFST created by the speech recognition WFST creation method according to claim 9 is stored in the word addition WFST storage unit,
A speech recognition step in which a speech recognition unit converts an input speech signal into a speech recognition result using a speech recognition WFST in which the word addition WFST is dynamically synthesized with respect to the general-purpose WFST;
A speech recognition method including:

A program for causing a computer to function as the speech recognition WFST creation device according to any one of claims 1 to 6 or the speech recognition device according to claim 7 or 8.