JP6876543B2

JP6876543B2 - Phoneme recognition dictionary generator and phoneme recognition device and their programs

Info

Publication number: JP6876543B2
Application number: JP2017126929A
Authority: JP
Inventors: 麻乃一木; 和穂尾上
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2017-06-29
Filing date: 2017-06-29
Publication date: 2021-05-26
Anticipated expiration: 2037-06-29
Also published as: JP2019012095A

Description

本発明は、発話音声の音素認識に用いる音素発音辞書および音素言語モデルを生成する音素認識辞書生成装置およびそのプログラム、ならびに、音素発音辞書および音素言語モデルを用いた音素認識装置およびそのプログラムに関する。 The present invention relates to a phoneme pronunciation dictionary and a phoneme recognition dictionary generator for generating a phoneme pronunciation dictionary and a phoneme language model used for phoneme recognition of spoken speech, and a phoneme recognition device and a program thereof using a phoneme pronunciation dictionary and a phoneme language model.

通常、音声認識では、単語と当該単語の発音系列（音素列）とを対応付けた発音辞書を用いている。この発音辞書には、一般的な辞書に記載されているような読みが発音として登録されている。
しかし、表記上の読みと実際に発話された発音とでは異なることが多い。例えば、放送番組では、ニュース番組のアナウンサの正確な（発音辞書の発音と近い）発音に比べ、情報番組の出演者の発話は曖昧な発音が多い。
そこで、統計的機械翻訳モデルを利用して、アナウンサ等の正確な発音を前提とした音素列から、発音が不明瞭な発話の音素列の単語を推定して、発音辞書を拡張する技術が開示されている（特許文献１参照）。 Usually, in speech recognition, a pronunciation dictionary in which a word is associated with a pronunciation sequence (phoneme sequence) of the word is used. In this pronunciation dictionary, readings as described in a general dictionary are registered as pronunciations.
However, the notational reading and the pronunciation actually spoken are often different. For example, in a broadcast program, the utterances of the performers of the information program are often ambiguous compared to the accurate pronunciation (close to the pronunciation of the pronunciation dictionary) of the announcer of the news program.
Therefore, a technology that expands the pronunciation dictionary by estimating the words of the phoneme sequence of utterances whose pronunciation is unclear from the phoneme sequence that assumes accurate pronunciation such as an announcer using a statistical machine translation model is disclosed. (See Patent Document 1).

特許文献１の技術（以下、従来技術という）では、認識対象音素の前後の音素に対する依存性（環境依存）を考慮して音素認識を行う。
この従来技術は、学習コーパスから、トライフォンを１つの単語として発音辞書を学習するとともに、トライフォンの連接確率を与える言語モデルを学習する。ここで、トライフォンは、例えば、「警察」の発音では、「（けー）ｋ−ｅ：＋ｓ」，「（さ）ｅ：−ｓ＋ａ」，「（つ）ｓ−ａ＋ｔｓ」のように、中心音素を含めた前後の発音を含めて表現したものである。 In the technique of Patent Document 1 (hereinafter referred to as the prior art), phoneme recognition is performed in consideration of the dependence (environmental dependence) of the phoneme to be recognized with respect to the phonemes before and after.
In this conventional technique, a pronunciation dictionary is learned from a learning corpus using a triphone as one word, and a language model that gives a connection probability of the triphone is learned. Here, the triphone is, for example, in the pronunciation of "police", such as "(ke) ke: + s", "(sa) e: -s + a", "(tsu) s-a + ts". It is expressed including the pronunciation before and after including the central phoneme.

そして、従来技術は、音声と書き起こしテキストとを対応付けた学習コーパスから強制音素アライメントを行った音素列（標準音素列）と、音素のトライフォンの言語モデルおよび発音辞書を用いて学習コーパスの音声を音素認識した音素列（実発話音素列）とを用いて、統計的機械翻訳モデルを学習する。 Then, in the prior art, a phoneme string (standard phoneme string) in which forced phoneme alignment is performed from a learning corpus that associates speech and transcribed text, and a learning corpus using a phoneme triphone language model and a pronunciation dictionary are used. A statistical machine translation model is learned using a phoneme sequence that recognizes speech as a phoneme (actual speech phoneme sequence).

特開２０１６−１６１７６５号公報Japanese Unexamined Patent Publication No. 2016-161765

前記した従来技術は、統計的機械翻訳モデルを学習するために、強制音素アライメントを行った音素列（標準音素列）と、音素認識した音素列（実発話音素列）とを用いる。この統計的機械翻訳モデルの精度を高めるには、標準音素列と実発話音素列の質が重要になる。
従来技術で、アナウンサ等の正確な発音の音声とその書き起こしテキストとを学習コーパスとして用いて標準音素列と実発話音素列とを生成した場合、理想的には、それぞれの音素列がほぼ同じであることが望ましい。
しかし、従来技術では、標準音素列と実発話音素列とをＤＰ（Dynamic Programming）マッチングした結果、音素が異なる割合（音素異なり率）が、２２．８％あり、さらなる音素認識の精度改善が望まれている。 In the above-mentioned prior art, in order to learn a statistical machine translation model, a phoneme sequence in which forced phoneme alignment is performed (standard phoneme sequence) and a phoneme sequence in which phonemes are recognized (actual speech phoneme sequence) are used. In order to improve the accuracy of this statistical machine translation model, the quality of standard phoneme sequences and actual speech phoneme sequences is important.
When a standard phoneme string and an actual spoken phoneme string are generated by using the voice of an announcer or the like and its transcribed text as a learning corpus, ideally, the phoneme strings are almost the same. Is desirable.
However, in the prior art, as a result of DP (Dynamic Programming) matching between the standard phoneme string and the actual spoken phoneme string, the ratio of different phonemes (phoneme difference rate) is 22.8%, and further improvement in the accuracy of phoneme recognition is desired. It is rare.

本発明は、このような問題に鑑みてなされたものであり、音素認識の精度を高める音素認識辞書（音素発音辞書および音素言語モデル）を生成する音素認識辞書生成装置およびそのプログラム、ならびに、音素発音辞書および音素言語モデルを用いた音素認識装置およびそのプログラムを提供することを課題とする。 The present invention has been made in view of such a problem, and a phoneme recognition dictionary generator and a program thereof for generating a phoneme recognition dictionary (phoneme pronunciation dictionary and phoneme language model) that enhances the accuracy of phoneme recognition, and a phoneme. An object of the present invention is to provide a phoneme recognition device and a program thereof using a phoneme dictionary and a phoneme language model.

前記課題を解決するため、本発明に係る音素認識辞書生成装置は、音響モデルと発音辞書と学習コーパスとを用いて、音素認識に用いる音素発音辞書および音素言語モデルを生成する音素認識辞書生成装置であって、単語別音素列生成手段と、音素列単語生成手段と、音素発音辞書生成手段と、音素言語モデル生成手段と、を備える。 In order to solve the above problems, the phoneme recognition dictionary generator according to the present invention is a phoneme recognition dictionary generator that generates a phoneme pronunciation dictionary and a phoneme language model used for phoneme recognition by using an acoustic model, a phoneme dictionary, and a learning corpus. The phoneme string generation means, the phoneme string word generation means, the phoneme pronunciation dictionary generation means, and the phoneme language model generation means are provided.

かかる構成において、音素認識辞書生成装置は、単語別音素列生成手段によって、学習コーパスの音声を音響モデルと発音辞書とに基づいて音声認識し、発音辞書に登録されている見出し語に対応する単語ごとの音素列である単語別音素列を生成する。
そして、音素認識辞書生成装置は、音素列単語生成手段によって、単語別音素列を１単語のテキストデータ形式に変換して音素列単語を生成する。例えば、音素列単語生成手段は、単語別音素列の音素間のスペースに音素以外の予め定めた文字（例えば、“＋”）を挿入することで、音素列単語を生成する。これによって、音素認識辞書生成装置は、音素列単語を１単語として扱うことが可能になる。 In such a configuration, the phoneme recognition dictionary generator recognizes the phoneme of the learning corpus based on the acoustic model and the pronunciation dictionary by the word-specific phoneme string generation means, and the word corresponding to the headword registered in the pronunciation dictionary. Generates a word-specific phoneme string, which is a phoneme string for each word.
Then, the phoneme recognition dictionary generation device converts the phoneme string for each word into the text data format of one word by the phoneme string word generation means to generate the phoneme string word. For example, the phoneme string word generation means generates a phoneme string word by inserting a predetermined character (for example, “+”) other than a phoneme into a space between phonemes of a word-specific phoneme string. As a result, the phoneme recognition dictionary generator can handle the phoneme string word as one word.

そして、音素認識辞書生成装置は、音素発音辞書生成手段によって、音素列単語を見出し語とし、当該音素列単語に対応する単語別音素列を発音表記とすることで、音素発音辞書を生成する。これによって、音素発音辞書生成手段は、単語単位で音素列の発音を音素発音辞書に登録する。
さらに、音素認識辞書生成装置は、音素言語モデル生成手段によって、音素列単語生成手段で生成される音素列単語のリストから音素列単語の連鎖としてＮ−ｇｒａｍ言語モデルを学習することにより、音素言語モデルを生成する。これによって、音素言語モデル生成手段は、音素認識を行う際の音素列単語の接続確率を計算するため音素列単語の出現確率をモデル化する。 Then, the phoneme recognition dictionary generation device generates a phoneme pronunciation dictionary by using the phoneme string word as a headword and the word-specific phoneme string corresponding to the phoneme string word as a pronunciation notation by the phoneme pronunciation dictionary generation means. As a result, the phoneme pronunciation dictionary generation means registers the pronunciation of the phoneme string in the phoneme pronunciation dictionary on a word-by-word basis.
Further, the phoneme recognition dictionary generator learns the N-gram language model as a chain of phoneme string words from the list of phoneme string words generated by the phoneme string word generation means by the phoneme language model generation means. Generate a model. As a result, the phoneme language model generation means models the appearance probability of the phoneme string word in order to calculate the connection probability of the phoneme string word when performing phoneme recognition.

なお、音素認識辞書生成装置は、コンピュータを、単語別音素列生成手段、音素列単語生成手段、音素発音辞書生成手段、音素言語モデル生成手段として機能させるための音素認識辞書生成プログラムで動作させることができる。 The phoneme recognition dictionary generation device operates with a phoneme recognition dictionary generation program for functioning as a word-specific phoneme string generation means, a phoneme string word generation means, a phoneme pronunciation dictionary generation means, and a phoneme language model generation means. Can be done.

また、前記課題を解決するため、本発明に係る音素認識装置は、音響モデルと、音素認識辞書生成装置により生成された音素発音辞書および音素言語モデルとを用いて、音声の音素を認識する音素認識装置であって、認識手段と、音素列生成手段と、を備える。 Further, in order to solve the above-mentioned problems, the phoneme recognition device according to the present invention uses a phoneme model and a phoneme pronunciation dictionary and a phoneme language model generated by the phoneme recognition dictionary generation device to recognize phonemes of speech. It is a recognition device and includes a recognition means and a phoneme string generation means.

かかる構成において、音素認識装置は、認識手段によって、音響モデルと音素発音辞書と音素言語モデルとにより、音声を音素列単語単位で認識する。これによって、認識手段は、単語の繋がりに依存した音素列を認識することが可能になる。
そして、音素認識装置は、音素列生成手段によって、認識手段で認識された１単語のテキストデータ形式である音素列単語を、個々の音素に分離して音素列を生成する。例えば、音素列生成手段は、単語別音素列の音素間に挿入されている予め定めた文字（例えば、“＋”）をスペースに置き換えることで、個々の音素に分離する。 In such a configuration, the phoneme recognition device recognizes the voice in phoneme string word units by the phoneme model, the phoneme pronunciation dictionary, and the phoneme language model by the recognition means. This makes it possible for the recognition means to recognize a phoneme sequence that depends on the connection of words.
Then, the phoneme recognition device separates the phoneme string word, which is a text data format of one word recognized by the recognition means, into individual phonemes by the phoneme string generation means to generate a phoneme string. For example, the phoneme string generation means separates the phonemes into individual phonemes by replacing a predetermined character (for example, “+”) inserted between the phonemes of the word-specific phoneme string with a space.

なお、音素認識装置は、コンピュータを、認識手段、音素列生成手段として機能させるための音素認識プログラムで動作させることができる。 The phoneme recognition device can be operated by a phoneme recognition program for causing the computer to function as a recognition means and a phoneme string generation means.

本発明は、以下に示す優れた効果を奏するものである。
本発明によれば、音素列を単語単位とした音素発音辞書および音素言語モデルを生成することができる。
この音素発音辞書および音素言語モデルを用いることで、音素認識する際の音素の連結確率を、単に音素の前後の依存性だけではなく、音素の単語内および単語間における依存性も加味して算出することが可能になり、音声から音素を認識する際の認識精度を高めることができる。 The present invention has the following excellent effects.
According to the present invention, it is possible to generate a phoneme pronunciation dictionary and a phoneme language model in which a phoneme string is used as a word unit.
By using this phoneme pronunciation dictionary and phoneme language model, the connection probability of phonemes when recognizing phonemes is calculated by taking into account not only the dependence before and after the phoneme but also the dependence within and between words of the phoneme. It becomes possible to improve the recognition accuracy when recognizing a phoneme from a voice.

本発明の第１実施形態に係る音素認識辞書生成装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the phoneme recognition dictionary generation apparatus which concerns on 1st Embodiment of this invention. 図１の単語別音素列生成手段における単語別音素列の生成例を説明するための説明図であって、（ａ）は学習コーパスの音声の書き起こし例、（ｂ）は発音辞書の一部、（ｃ）は生成した単語別音素列の例を示す。It is explanatory drawing for demonstrating the generation example of the word-specific phoneme string by the word-specific phoneme string generation means of FIG. , (C) show an example of the generated word-specific phoneme sequence. 音素の表記例を示す図である。It is a figure which shows the notation example of a phoneme. 図１の音素発音辞書生成手段が生成する音素発音辞書の例を示す図である。It is a figure which shows the example of the phoneme pronunciation dictionary generated by the phoneme pronunciation dictionary generation means of FIG. 図１の音素列単語生成手段が生成する音素列単語リストの例を示す図である。It is a figure which shows the example of the phoneme string word list generated by the phoneme string word generation means of FIG. 図１の音素言語モデル生成手段が生成する音素言語モデルの例を示す図である。It is a figure which shows the example of the phoneme language model generated by the phoneme language model generation means of FIG. 本発明の第１実施形態に係る音素認識辞書生成装置の動作を示すフローチャートである。It is a flowchart which shows the operation of the phoneme recognition dictionary generation apparatus which concerns on 1st Embodiment of this invention. 本発明の第２実施形態に係る音素認識装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the phoneme recognition apparatus which concerns on 2nd Embodiment of this invention.

以下、本発明の実施形態について図面を参照して説明する。
＜第１実施形態＞
〔音素認識辞書生成装置の構成〕
まず、図１を参照して、本発明の第１実施形態に係る音素認識辞書生成装置１の構成について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
<First Embodiment>
[Configuration of phoneme recognition dictionary generator]
First, the configuration of the phoneme recognition dictionary generation device 1 according to the first embodiment of the present invention will be described with reference to FIG.

音素認識辞書生成装置１は、音声データから音素を認識するための辞書として、音素発音辞書および音素言語モデルを生成するものである。この音素認識辞書生成装置１は、学習コーパス記憶装置２、発音辞書記憶装置３および音響モデル記憶装置４にそれぞれ記憶されている学習コーパス２０、発音辞書３０および音響モデル４０から、音素発音辞書５０および音素言語モデル６０を生成する。 The phoneme recognition dictionary generation device 1 generates a phoneme pronunciation dictionary and a phoneme language model as a dictionary for recognizing phonemes from speech data. The phoneme recognition dictionary generation device 1 is composed of the phoneme pronunciation dictionary 50 and the phoneme pronunciation dictionary 50 from the learning corpus 20, the pronunciation dictionary 30, and the acoustic model 40 stored in the learning corpus storage device 2, the pronunciation dictionary storage device 3, and the acoustic model storage device 4, respectively. Generate a phoneme language model 60.

具体的には、音素認識辞書生成装置１は、学習コーパスから強制アライメントにより単語別音素列を生成し、生成した単語別音素列を１単語のテキストデータ形式に変換して音素列単語を生成する。そして、音素認識辞書生成装置１は、生成した音素列単語を見出し語とし、当該音素列単語に対応する単語別音素列を発音表記とすることで、音素発音辞書５０を生成する。さらに、音素認識辞書生成装置１は、生成した音素列単語のリストから、Ｎ−ｇｒａｍ言語モデルを学習し、音素言語モデル６０を生成する。 Specifically, the phoneme recognition dictionary generator 1 generates a word-specific phoneme string from the learning corpus by forced alignment, converts the generated word-specific phoneme string into a one-word text data format, and generates a phoneme string word. .. Then, the phoneme recognition dictionary generation device 1 generates the phoneme pronunciation dictionary 50 by using the generated phoneme string word as a headword and using the word-specific phoneme string corresponding to the phoneme string word as a pronunciation notation. Further, the phoneme recognition dictionary generation device 1 learns the N-gram language model from the generated list of phoneme string words and generates the phoneme language model 60.

学習コーパス２０は、予め大量の音声データ（音声コーパス）と、音声データの書き起こしテキスト（テキストコーパス）とを対応付けたデータである。この学習コーパス２０は、例えば、ニュース番組、情報番組等におけるアナウンサ、リポータ等の約１０００時間程度の音声（音声コーパス）と、その音声を書き起こしたテキスト（テキストコーパス）である。 The learning corpus 20 is data in which a large amount of voice data (voice corpus) is previously associated with a transcribed text (text corpus) of the voice data. The learning corpus 20 is, for example, a voice (voice corpus) of about 1000 hours such as an announcer, a reporter, etc. in a news program, an information program, etc., and a text (text corpus) transcribed from the voice.

発音辞書３０は、所定の文字列である見出し語（ここでは、単語とする）ごとに、その発音表記（音素列）を示した辞書である。
この発音辞書３０は、一般的な発音辞書であって、例えば、人手を介して見出し語（単語）とその発音表記（音素列）とを対応付けた辞書である。 The pronunciation dictionary 30 is a dictionary showing the pronunciation notation (phoneme string) for each headword (here, a word) which is a predetermined character string.
The pronunciation dictionary 30 is a general pronunciation dictionary, for example, a dictionary in which a headword (word) and its pronunciation notation (phoneme string) are associated with each other manually.

音響モデル４０は、大量の音声データから予め学習したディープニューラルネットワーク（ＤＮＮ：Deep Neural Network）音響モデルである。例えば、ＤＮＮの入力には、メルフィルタバンク対数パワーの４０次元に時間変化（Δ＋ΔΔ）を加えて１１フレーム分の特徴量を連結（スプライス）した特徴量を用い、ＤＮＮの隠れ層を８層とする。
なお、音響モデル４０における音響特徴量の尤度計算は、隠れマルコフモデル（ＨＭＭ：Hidden Markov Model）や、ガウス混合モデル（ＧＭＭ：Gaussian mixture model）音響モデルであっても構わない。
以下、音素認識辞書生成装置１の構成について詳細に説明する。 The acoustic model 40 is a deep neural network (DNN) acoustic model learned in advance from a large amount of voice data. For example, for the input of DNN, a feature amount obtained by adding a time change (Δ + ΔΔ) to 40 dimensions of the logarithmic power of the mel filter bank and connecting (splicing) the feature amounts for 11 frames is used, and the hidden layer of the DNN is set to 8 layers. To do.
The likelihood calculation of the acoustic features in the acoustic model 40 may be a hidden Markov model (HMM: Hidden Markov Model) or a Gaussian mixture model (GMM) acoustic model.
Hereinafter, the configuration of the phoneme recognition dictionary generation device 1 will be described in detail.

音素認識辞書生成装置１は、図１に示すように、単語別音素列生成手段１０と、音素列単語生成手段１１と、音素発音辞書生成手段１２と、音素列単語リスト記憶手段１３と、音素言語モデル生成手段１４と、を備える。
また、音素認識辞書生成装置１は、生成した音素発音辞書５０を記憶する音素発音辞書記憶装置５と、生成した音素言語モデル６０を記憶する音素言語モデル記憶装置６と、を外部に接続している。もちろん、音素発音辞書記憶装置５および音素言語モデル記憶装置６は、音素認識辞書生成装置１の内部に備える構成としてもよい。また、音素発音辞書記憶装置５および音素言語モデル記憶装置６は、１つの記憶装置で構成してもよい。 As shown in FIG. 1, the phoneme recognition dictionary generation device 1 includes a word-specific phoneme string generation means 10, a phoneme string word generation means 11, a phoneme pronunciation dictionary generation means 12, a phoneme string word list storage means 13, and phonemes. A language model generation means 14 is provided.
Further, the phoneme recognition dictionary generation device 1 externally connects a phoneme pronunciation dictionary storage device 5 that stores the generated phoneme pronunciation dictionary 50 and a phoneme language model storage device 6 that stores the generated phoneme language model 60. There is. Of course, the phoneme pronunciation dictionary storage device 5 and the phoneme language model storage device 6 may be provided inside the phoneme recognition dictionary generation device 1. Further, the phoneme pronunciation dictionary storage device 5 and the phoneme language model storage device 6 may be configured by one storage device.

単語別音素列生成手段１０は、発音辞書３０と音響モデル４０とに基づいて、学習コーパス２０の音声（音声コーパス）を強制アライメントすることで、発音辞書３０に登録されている見出し語に対応する単語ごとに、音声の音素列を切り分けて単語別音素列を生成するものである。 The word-specific phoneme string generation means 10 corresponds to the headword registered in the pronunciation dictionary 30 by forcibly aligning the speech (speech corpus) of the learning corpus 20 based on the pronunciation dictionary 30 and the acoustic model 40. For each word, the phoneme string of the voice is separated to generate the phoneme string for each word.

この単語別音素列生成手段１０は、学習コーパス２０の音声から、音響モデル４０に対応する音響特徴量（メル周波数ケプストラム係数等）を抽出する。そして、単語別音素列生成手段１０は、発音辞書３０と音響モデル４０とを用いて、音声の書き起こしテキスト（テキストコーパス）を事前知識とする音声認識を行い、発音辞書３０に登録されている文字列（見出し語）に対応して強制アライメントする。これにより、単語別音素列生成手段１０は、図２（ｂ）に示されているように、発音辞書３０に登録されている単語に複数存在する発音の音素列に対し、尤も音声に近い発音の音素列を選択し、単語別音素列を生成する。 The word-specific phoneme string generation means 10 extracts an acoustic feature amount (mel frequency cepstrum coefficient, etc.) corresponding to the acoustic model 40 from the voice of the learning corpus 20. Then, the word-specific phoneme string generation means 10 uses the pronunciation dictionary 30 and the acoustic model 40 to perform voice recognition using the transcribed text (text corpus) of the voice as prior knowledge, and is registered in the pronunciation dictionary 30. Forced alignment corresponding to the character string (headword). As a result, as shown in FIG. 2B, the word-specific phoneme string generation means 10 pronounces the phoneme strings of pronunciations existing in a plurality of words registered in the pronunciation dictionary 30 as close to speech as possible. Select the phoneme string of and generate the phoneme string for each word.

図２は、単語別音素列生成手段１０における単語別音素列の生成例を示す。例えば、単語別音素列生成手段１０は、学習コーパス２０として、「世界一短い東京の橋でイベントが開かれました」の音声データを入力した場合、音響モデル４０に対応する音響特徴量を抽出する。
そして、単語別音素列生成手段１０は、音声データに対応する図２（ａ）に示す学習コーパス２０の書き起こしテキスト「世界一短い東京 …」を事前知識として、図２（ｂ）に示す発音辞書３０と、音響モデル４０と、を用いて音声認識を行う。 FIG. 2 shows an example of generating a word-specific phoneme string by the word-specific phoneme string generation means 10. For example, when the word-specific phoneme string generation means 10 inputs the voice data of "an event was held at the shortest bridge in Tokyo in the world" as the learning corpus 20, the acoustic features corresponding to the acoustic model 40 are extracted. To do.
Then, the word-specific phoneme string generation means 10 uses the transcription text “the shortest Tokyo in the world…” of the learning corpus 20 shown in FIG. 2 (a) corresponding to the voice data as prior knowledge, and the pronunciation shown in FIG. 2 (b). Speech recognition is performed using the dictionary 30 and the acoustic model 40.

これによって、単語別音素列生成手段１０は、図２（ｃ）に示すように、単語ごとの音素列（単語別音素列）「ｓ_△ｅ_△ｋ_△ａ_△ｉ_△ｉ_△ｃｈ_△ｉ／ｍ_△ｉ_△ｊ_△ｉ_△ｋ_△ａ_△ｉ／ｔ_△ｏ：_△ｋｙ_△ｏ：／…」（ここで、“_△”はスペースを示す）を生成する。
単語別音素列生成手段１０は、生成した単語別音素列を音素列単語生成手段１１に出力する。 As a result, as shown in FIG. 2C, the word-specific phoneme sequence generation means 10 uses the word-specific phoneme sequence (word-specific phoneme sequence) “s _△ e _△ k _△ a _△ i _△ i _△ ch _△ i /. m _△ i _△ j _△ i _△ k _△ a _△ i / t _△ o: _△ ky _△ o: / ... ”(here,“ _Δ ”indicates a space) is generated.
The word-specific phoneme string generation means 10 outputs the generated word-specific phoneme string to the phoneme string word generation means 11.

音素列単語生成手段１１は、単語別音素列生成手段１０で生成された単語別音素列を、単語ごとに１単語のテキストデータ形式に変換した音素列単語を生成するものである。
この音素列単語生成手段１１は、単語別音素列の音素間に音素以外の予め定めた文字を挿入することで、個々に分離した音素列を、１単語のテキストデータ形式に変換する。 The phoneme string word generation means 11 generates a phoneme string word obtained by converting the word-specific phoneme string generated by the word-specific phoneme string generation means 10 into a text data format of one word for each word.
The phoneme string word generation means 11 converts individually separated phoneme strings into a one-word text data format by inserting a predetermined character other than a phoneme between the phonemes of the word-specific phoneme string.

具体的には、音素列単語生成手段１１は、音素ごとにスペースを含んだ単語別音素列のスペースを、音素以外の予め定めた文字に置き換えて１つの単語テキストとする。例えば、音素列単語生成手段１１は、単語別音素列のスペースを“＋”に置き換え、“ｓ_△ｅ_△ｋ_△ａ_△ｉ_△ｉ_△ｃｈ_△ｉ”を“ｓ＋ｅ＋ｋ＋ａ＋ｉ＋ｉ＋ｃｈ＋ｉ”等に変換する。 Specifically, the phoneme string word generation means 11 replaces the space of the word-specific phoneme string including the space for each phoneme with a predetermined character other than the phoneme to form one word text. For example, the phoneme string word generation means 11 replaces the space of the phoneme string for each word with “+” _{and converts “s △} e _△ k _△ a _△ i _△ i _△ ch _△ i” into “s + e + k + a + i + i + ch + i” or the like.

音素列単語生成手段１１は、スペースを含んだ単語別音素列と、テキスト置換した音素列単語とを対にして、順次、音素発音辞書生成手段１２に出力する。また、音素列単語生成手段１１は、テキスト置換した音素列単語のみを、順次、音素列単語リスト記憶手段１３に書き込む。 The phoneme string word generation means 11 sequentially outputs a pair of a word-specific phoneme string including a space and a text-replaced phoneme string word to the phoneme pronunciation dictionary generation means 12. Further, the phoneme string word generation means 11 sequentially writes only the text-replaced phoneme string words into the phoneme string word list storage means 13.

音素発音辞書生成手段１２は、音素列を単語とみなした音素列単語の発音辞書である音素発音辞書を生成するものである。音素発音辞書生成手段１２は、図１に示すように、単語別音素列登録手段１２０と、組み合わせ音素列登録手段１２１と、を備える。 The phoneme pronunciation dictionary generation means 12 generates a phoneme pronunciation dictionary which is a pronunciation dictionary of phoneme string words in which the phoneme string is regarded as a word. As shown in FIG. 1, the phoneme pronunciation dictionary generation means 12 includes a word-specific phoneme string registration means 120 and a combination phoneme string registration means 121.

単語別音素列登録手段１２０は、単語別音素列と音素列単語とを対として登録した音素発音辞書を生成するものである。単語別音素列登録手段１２０は、音素列単語生成手段１１で生成された音素列単語を見出し語とし、音素列単語と対となる単語別音素列をその見出し語の発音として、音素発音辞書記憶装置５の音素発音辞書５０に登録する。 The word-specific phoneme string registration means 120 generates a phoneme pronunciation dictionary in which a word-specific phoneme string and a phoneme string word are registered as a pair. The word-specific phoneme string registration means 120 uses the phoneme string word generated by the phoneme string word generation means 11 as a headword, and uses the word-specific phoneme string paired with the phoneme string word as the pronunciation of the headword, and stores the phoneme pronunciation dictionary. Register in the phoneme pronunciation dictionary 50 of the device 5.

なお、単語別音素列登録手段１２０は、同じ見出し語となる音素列単語に対して、異なる発音の単語別音素列が入力された場合、見出し語に複数の発音を登録する。また、単語別音素列登録手段１２０は、同じ見出し語となる音素列単語に対して、同じ発音の単語別音素列が入力された場合、登録を行わないこととする。 The word-specific phoneme string registration means 120 registers a plurality of pronunciations in the headword when word-specific phoneme strings having different pronunciations are input to the phoneme string words that are the same headword. Further, the word-specific phoneme string registration means 120 does not register when a word-specific phoneme string having the same pronunciation is input to a phoneme string word that is the same headword.

組み合わせ音素列登録手段１２１は、任意の音素の組み合わせで構成される音素列を単語とみなした見出し語と、その音素列とを対として、音素発音辞書に登録するものである。
具体的には、組み合わせ音素列登録手段１２１は、図３に示す音素の例において、すべての音素（図３の例では、４０音素）に対して、予め定めた最大音素数（ここでは、“４”とする）の音素の組み合わせ（４０^１＋４０^２＋４０^３＋４０^４通り）の音素列を、音素発音辞書記憶装置５の音素発音辞書５０に登録する The combination phoneme string registration means 121 registers a headword that regards a phoneme string composed of any combination of phonemes as a word and the phoneme string as a pair in a phoneme pronunciation dictionary.
Specifically, in the phoneme example shown in FIG. 3, the combination phoneme sequence registration means 121 has a predetermined maximum phoneme number (here, “40 phonemes in the example of FIG. 3) for all phonemes (40 phonemes in the example of FIG. 3). 4 ”) phoneme combinations (40 ¹ +40 ² +40 ³ +40 ⁴ ways) are registered in the phoneme pronunciation dictionary 50 of the phoneme pronunciation dictionary storage device 5.

この組み合わせ音素列登録手段１２１は、音素列単語生成手段１１と同様に、音素を組み合わせた音素列を、１つのテキストデータ形式に変換する。具体的には、組み合わせ音素列登録手段１２１は、音素を組み合わせた音素列のスペースを音素以外の予め定めた１つのテキスト（ここでは、“＋”）に置き換えた単語に変換し、見出し語とする。 The combination phoneme string registration means 121 converts the phoneme string in which phonemes are combined into one text data format, similarly to the phoneme string word generation means 11. Specifically, the combination phoneme string registration means 121 converts the space of the phoneme string that combines phonemes into a word that is replaced with one predetermined text (here, “+”) other than the phoneme, and becomes a headword. To do.

ここで、図４を参照して、音素発音辞書生成手段１２が音素発音辞書記憶装置５に登録する音素発音辞書５０の例について説明する。
図４に示すように、音素発音辞書５０は、単語別音素列登録手段１２０で登録される辞書Ａと、組み合わせ音素列登録手段１２１で登録される辞書Ｂとで構成される。
辞書Ａは、学習コーパス２０の書き起こしに含まれる単語の発音を示す単語音素列のスペース部分を“＋”に置き換えた単語別音素列を見出し語とし、スペースを含んだ音素列（単語別音素列）を見出し語に対応する発音表記とする。 Here, an example of the phoneme pronunciation dictionary 50 registered in the phoneme pronunciation dictionary storage device 5 by the phoneme pronunciation dictionary generation means 12 will be described with reference to FIG.
As shown in FIG. 4, the phoneme pronunciation dictionary 50 is composed of a dictionary A registered by the word-specific phoneme string registration means 120 and a dictionary B registered by the combination phoneme string registration means 121.
The dictionary A uses a word-specific phoneme string in which the space part of the word phoneme string indicating the pronunciation of the word included in the transcription of the learning corpus 20 is replaced with "+" as a headword, and the phoneme string including the space (word-specific phoneme). Column) is the pronunciation notation corresponding to the headword.

辞書Ｂは、すべての音素の予め定めた最大音素数の組み合わせにおいて、音素列のスペース部分を“＋”に置き換えた組み合わせ音素列を見出し語とし、スペースを含んだ音素列を見出し語に対応する発音表記とする。これによって、学習コーパス２０に含まれていない音素の組み合わせであっても、音素発音辞書５０内に見出し語と発音表記とが登録される。
図１に戻って、音素認識辞書生成装置１の構成について説明を続ける。 In the dictionary B, in the combination of the maximum number of phonemes predetermined for all phonemes, the combination phoneme string in which the space part of the phoneme string is replaced with "+" is used as the headword, and the phoneme string including the space corresponds to the headword. Use phoneme notation. As a result, even if the phoneme combination is not included in the learning corpus 20, the headword and the pronunciation notation are registered in the phoneme pronunciation dictionary 50.
Returning to FIG. 1, the configuration of the phoneme recognition dictionary generator 1 will be described.

音素列単語リスト記憶手段１３は、音素列単語生成手段１１で生成される音素列単語を、音素列単語リストとして記憶するものである。音素列単語リスト記憶手段１３は、半導体メモリ、ハードディスク等の一般的な記憶装置で構成することができる。 The phoneme string word list storage means 13 stores the phoneme string words generated by the phoneme string word generation means 11 as a phoneme string word list. The phoneme string word list storage means 13 can be configured by a general storage device such as a semiconductor memory or a hard disk.

図５に、音素列単語リスト記憶手段１３に記憶される音素列単語リスト１３０の例を示す。図５に示すように、音素列単語リスト１３０は、音素列単語生成手段１１で生成した単語別音素列のスペースを“＋”に置き換えた音素列単語を逐次記憶したものである。
この音素列単語リスト１３０には、学習コーパス２０の書き起こしに含まれる単語の音素列を１つの単語として順次書き込まれる。 FIG. 5 shows an example of the phoneme string word list 130 stored in the phoneme string word list storage means 13. As shown in FIG. 5, the phoneme string word list 130 sequentially stores phoneme string words in which the space of the phoneme string for each word generated by the phoneme string word generation means 11 is replaced with “+”.
The phoneme sequence of the words included in the transcription of the learning corpus 20 is sequentially written as one word in the phoneme sequence word list 130.

音素言語モデル生成手段１４は、音素列単語リスト記憶手段１３に記憶されている音素列単語リスト１３０から、音素言語モデルを学習により生成するものである。
音素言語モデルは、任意の音素列単語の単語列において、それが文である確率（尤度）を付与する確率モデル（統計的言語モデル）である。この音素言語モデルは、例えば、Ｎ−ｇｒａｍ言語モデルであって、以下の式（１）に示すように、音素列単語の列ｗ_１ｗ_２…ｗ_ｉ−１の後にｉ番目の音素列単語ｗ_ｉが出現する条件付き確率（Ｎグラム確率）を与えるモデルである。なお、桁あふれを防止するため、式（１）の尤度を対数とし、対数尤度とすることが好ましい。 The phoneme language model generation means 14 generates a phoneme language model by learning from the phoneme string word list 130 stored in the phoneme string word list storage means 13.
The phoneme language model is a probability model (statistical language model) that gives the probability (probability) that a phoneme string word is a sentence in a word string. This phoneme language model is, for example, an N-gram language model, and as shown in the following equation (1), the i-th phoneme string word after _{the phoneme string word sequence w 1} w ₂ ... w _i-1. is a model that gives the conditional probability that w _i appears (N-gram probability). In order to prevent overflow, it is preferable that the likelihood of the equation (1) is logarithmic and log-likelihood.

例えば、学習コーパスの書き起こしで「東京の橋で」という単語列が存在する場合、音素言語モデル生成手段１４は、音素列単語リスト１３０として生成される「ｔ＋ｏ：＋ｋｙ＋ｏ：」、「ｎ＋ｏ」、「ｈ＋ａ＋ｓｈ＋ｉ」、「ｄ＋ｅ」の音素列単語からなる「ｔ＋ｏ：＋ｋｙ＋ｏ：_△ｎ＋ｏ_△ｈ＋ａ＋ｓｈ＋ｉ_△ｄ＋ｅ」という学習テキストでＮ−ｇｒａｍ言語モデルを学習する。 For example, when the word string "at the bridge in Tokyo" exists in the transcription of the learning corpus, the phoneme language model generation means 14 generates "t + o: + ky + o:", "n + o", as the phoneme string word list 130. The N-gram language model is learned with the learning text _{"t + o: + ky + o: △} n + o _△ h + a + sh + i _△ d + e" consisting of phoneme string words of "h + a + sh + i" and "d + e".

なお、音素言語モデル生成手段１４は、学習テキストとして音素列単語リスト１３０に現れない音素列単語の連鎖には、一般的なスムージング手法によってＮグラム確率を与える。音素言語モデル生成手段１４は、スムージング手法として、例えば、バックオフスムージング（back-off smoothing）を用いることができる。バックオフスムージングは、学習テキストに出現しない音素列単語の連鎖のＮグラム確率を、連鎖数の少ない音素列単語の連鎖に与えられているＮグラム確率から推定するものである。 The phoneme language model generation means 14 gives an N-gram probability to a chain of phoneme string words that do not appear in the phoneme string word list 130 as learning text by a general smoothing method. The phoneme language model generation means 14 can use, for example, back-off smoothing as the smoothing method. Back-off smoothing estimates the N-gram probability of a chain of phoneme string words that does not appear in the learning text from the N-gram probability given to the chain of phoneme string words with a small number of chains.

これによって、音素言語モデル生成手段１４は、すべての音素の組み合わせを含んだ音素発音辞書５０に登録されている見出し語の音素列単語の連鎖に、Ｎグラム確率を付与することができる。
音素言語モデル生成手段１４は、生成した音素言語モデルを音素言語モデル記憶装置６に書き込み記憶する。 As a result, the phoneme language model generation means 14 can add an N-gram probability to the chain of phoneme string words of the headwords registered in the phoneme pronunciation dictionary 50 including all phoneme combinations.
The phoneme language model generation means 14 writes and stores the generated phoneme language model in the phoneme language model storage device 6.

図６に、音素言語モデル記憶装置６に記憶される音素言語モデル６０の例を示す。ここでは、Ｎ−ｇｒａｍ言語モデルとして、２−ｇｒａｍ言語モデルの例を示す。
図６に示すように、音素言語モデル６０は、２つの音素列単語ｗ_１，ｗ_２に対して、Ｎグラム確率（ｌｏｇＰ（ｗ_２｜ｗ_１））を対応付けたものである。 FIG. 6 shows an example of the phoneme language model 60 stored in the phoneme language model storage device 6. Here, an example of a 2-gram language model is shown as an N-gram language model.
As shown in FIG. 6, phonemic language model 60 for the two phoneme string word _w 1, _{w 2,} N-gram probability _{_{(logP (w 2 | w 1}} )) in which associates.

以上説明したように音素認識辞書生成装置１を構成することで、音素認識辞書生成装置１は、発話音声から音素を認識するための辞書として、音素発音辞書および音素言語モデルを生成することができる。このように生成された音素発音辞書および音素言語モデルは、音素認識を行う際に、単に音素の前後の依存性だけではなく、音素の単語内および単語間における音素列の依存性を加味して、音素認識の精度を高めることができる。
なお、音素認識辞書生成装置１は、図示を省略したコンピュータを、前記した各手段として機能させるプログラム（音素認識辞書生成プログラム）で動作させることができる。 By configuring the phoneme recognition dictionary generation device 1 as described above, the phoneme recognition dictionary generation device 1 can generate a phoneme pronunciation dictionary and a phoneme language model as a dictionary for recognizing phonemes from spoken speech. .. The phoneme pronunciation dictionary and phoneme language model generated in this way take into account not only the dependency before and after the phoneme but also the dependency of the phoneme sequence within and between words of the phoneme when performing phoneme recognition. , The accuracy of phoneme recognition can be improved.
The phoneme recognition dictionary generation device 1 can be operated by a program (phoneme recognition dictionary generation program) that causes a computer (not shown) to function as each of the above-mentioned means.

〔音素認識辞書生成装置の動作〕
次に、図７を参照（構成については適宜図１参照）して、本発明の第１実施形態に係る音素認識辞書生成装置１の動作について説明する。 [Operation of phoneme recognition dictionary generator]
Next, the operation of the phoneme recognition dictionary generation device 1 according to the first embodiment of the present invention will be described with reference to FIG. 7 (see FIG. 1 for the configuration as appropriate).

ステップＳ１において、単語別音素列生成手段１０は、学習コーパス２０の音声から音響特徴量を抽出し、発音辞書３０と音響モデル４０を用いて、学習コーパス２０の音声の書き起こしテキストを事前知識とする音声認識を行い、発音辞書３０に登録されている見出し語に対応して強制アライメントした単語別音素列を生成する。 In step S1, the word-specific phoneme string generation means 10 extracts the acoustic feature quantity from the speech of the learning corpus 20, and uses the pronunciation dictionary 30 and the acoustic model 40 to use the transcription text of the speech of the learning corpus 20 as prior knowledge. Performs voice recognition to generate a word-specific phoneme string that is forcibly aligned according to the headwords registered in the pronunciation dictionary 30.

ステップＳ２において、音素列単語生成手段１１は、ステップＳ１で生成した単語別音素列の音素間のスペースを音素以外の予め定めた１つのテキスト（例えば、“＋”）に置き換えて、音素列単語を生成する。これによって、以降の動作において、単語別音素列を、スペースのない、１つの単語テキストとして扱うことが可能になる。 In step S2, the phoneme string word generation means 11 replaces the space between the phonemes of the word-specific phoneme string generated in step S1 with one predetermined text (for example, “+”) other than the phoneme, and the phoneme string word. To generate. This makes it possible to treat the word-specific phoneme sequence as one word text without spaces in the subsequent operations.

ステップＳ３において、音素列単語生成手段１１は、ステップＳ２で生成した音素列単語を、順次、音素列単語リスト記憶手段１３に書き込み記憶する。これによって、音素列単語リスト記憶手段１３には、学習コーパス２０の音声に対応する音素列を単語ごとにテキスト化した音素列単語リスト１３０が記録される。 In step S3, the phoneme string word generation means 11 sequentially writes and stores the phoneme string words generated in step S2 in the phoneme string word list storage means 13. As a result, the phoneme string word list storage means 13 records the phoneme string word list 130 in which the phoneme strings corresponding to the voices of the learning corpus 20 are converted into text for each word.

ステップＳ４において、音素発音辞書生成手段１２は、単語別音素列登録手段１２０によって、ステップＳ２で生成した音素列単語を見出し語とし、ステップＳ１で生成した単語別音素列をその見出し語に対応する発音表記として、音素発音辞書記憶装置５の音素発音辞書５０に登録する（図４の辞書Ａ参照）。 In step S4, the phoneme pronunciation dictionary generation means 12 uses the phoneme string word generated in step S2 as a headword by the word-specific phoneme string registration means 120, and the word-specific phoneme string generated in step S1 corresponds to the headword. As a pronunciation notation, it is registered in the phoneme pronunciation dictionary 50 of the phoneme pronunciation dictionary storage device 5 (see dictionary A in FIG. 4).

ステップＳ５において、単語別音素列生成手段１０は、学習コーパス２０の音声についてすべて入力が終了したか否かを判定する。ここで、学習コーパス２０の入力が終了していない場合（ステップＳ５でＮｏ）、音素認識辞書生成装置１は、ステップＳ１に動作を戻す。
一方、学習コーパス２０の入力が終了した場合（ステップＳ５でＹｅｓ）、音素認識辞書生成装置１は、ステップＳ６に動作を進める。 In step S5, the word-specific phoneme string generation means 10 determines whether or not all the voices of the learning corpus 20 have been input. Here, when the input of the learning corpus 20 is not completed (No in step S5), the phoneme recognition dictionary generation device 1 returns to the operation in step S1.
On the other hand, when the input of the learning corpus 20 is completed (Yes in step S5), the phoneme recognition dictionary generation device 1 proceeds to step S6.

ステップＳ６において、音素発音辞書生成手段１２は、組み合わせ音素列登録手段１２１によって、任意の音素の組み合わせで構成される音素列を単語とみなした見出し語と、その音素列とを対として、音素発音辞書記憶装置５の音素発音辞書５０に登録する（図４の辞書Ｂ参照）。これによって、学習コーパス２０からは抽出することができない音素の並びに対して、見出し語と発音表記とを割り当てることができる。 In step S6, the phoneme pronunciation dictionary generation means 12 uses the combination phoneme string registration means 121 to pair a headword that regards a phoneme string composed of any phoneme combination as a word and the phoneme string to pronounce the phoneme. Register in the phoneme pronunciation dictionary 50 of the dictionary storage device 5 (see dictionary B in FIG. 4). As a result, headwords and pronunciation notations can be assigned to a sequence of phonemes that cannot be extracted from the learning corpus 20.

ステップＳ７において、音素言語モデル生成手段１４は、ステップＳ３で順次、音素列単語リスト記憶手段１３に記憶された音素列単語リスト１３０から、Ｎ−ｇｒａｍ言語モデルの音素言語モデル６０を生成し、音素言語モデル記憶装置６に記憶する。 In step S7, the phoneme language model generation means 14 sequentially generates a phoneme language model 60 of the N-gram language model from the phoneme string word list 130 stored in the phoneme string word list storage means 13 in step S3, and phonemes. Store in the language model storage device 6.

さらに、ステップＳ８において、音素言語モデル生成手段１４は、音素発音辞書５０に登録されている音素の組み合わせから生成された見出し語を含めて、学習コーパスとして音素列単語リスト１３０に現れない音素列単語の連鎖に対して、スムージング手法によってＮグラム確率を与える。これによって、音素言語モデル６０を用いて音素認識を行う際に、音素列単語の連結確率が“０”になることを防止することができる。
以上の動作によって、音素認識辞書生成装置１は、音声から音素を認識するための辞書として、音素発音辞書および音素言語モデルを生成する。 Further, in step S8, the phoneme language model generation means 14 includes a phoneme string word that does not appear in the phoneme string word list 130 as a learning corpus, including a headword generated from a combination of phonemes registered in the phoneme pronunciation dictionary 50. The N-gram probability is given to the chain of. This makes it possible to prevent the connection probability of phoneme string words from becoming "0" when performing phoneme recognition using the phoneme language model 60.
By the above operation, the phoneme recognition dictionary generation device 1 generates a phoneme pronunciation dictionary and a phoneme language model as a dictionary for recognizing phonemes from speech.

＜第２実施形態＞
〔音素認識装置〕
次に、図８を参照して、本発明の第２実施形態に係る音素認識装置２００について説明する。 <Second Embodiment>
[Phoneme recognition device]
Next, the phoneme recognition device 200 according to the second embodiment of the present invention will be described with reference to FIG.

音素認識装置２００は、音響モデルと、音素認識辞書生成装置１で生成した音素発音辞書および音素言語モデルとを用いて、音声データから音素を認識するものである。この音素認識装置２００は、音響モデル記憶装置４、音素発音辞書記憶装置５および音素言語モデル記憶装置６にそれぞれ記憶されている音響モデル４０、音素発音辞書５０および音素言語モデル６０を用いて、音声データから音素を認識する。 The phoneme recognition device 200 recognizes phonemes from voice data by using an acoustic model, a phoneme pronunciation dictionary and a phoneme language model generated by the phoneme recognition dictionary generation device 1. The phoneme recognition device 200 uses the phoneme model 40, the phoneme pronunciation dictionary 50, and the phoneme language model 60 stored in the phoneme model storage device 4, the phoneme pronunciation dictionary storage device 5, and the phoneme language model storage device 6, respectively. Recognize phonemes from data.

音響モデル４０は、図１で説明した音響モデルと同じであって、大量の音声データから予め学習した音素ごとの音響特徴量をディープニューラルネットワーク（ＤＮＮ）によってモデル化したものである。 The acoustic model 40 is the same as the acoustic model described with reference to FIG. 1, and is a model of acoustic features for each phoneme learned in advance from a large amount of speech data by a deep neural network (DNN).

音素発音辞書５０は、図１で説明した音素認識辞書生成装置１で生成されたものである（図４参照）。
音素言語モデル６０は、図１で説明した音素認識辞書生成装置１で生成されたものである（図６参照）。 The phoneme pronunciation dictionary 50 is generated by the phoneme recognition dictionary generator 1 described with reference to FIG. 1 (see FIG. 4).
The phoneme language model 60 is generated by the phoneme recognition dictionary generator 1 described with reference to FIG. 1 (see FIG. 6).

音素認識装置２００は、図８に示すように、認識手段２０１と、音素列生成手段２０２と、を備える。 As shown in FIG. 8, the phoneme recognition device 200 includes a recognition means 201 and a phoneme sequence generation means 202.

認識手段２０１は、音響モデル４０と、音素発音辞書５０と、音素言語モデル６０とを用いて、音声データから音素列を認識するものである。
この認識手段２０１は、外部から入力される音声データから音響特徴量を抽出し、音響モデル４０と音素発音辞書５０とから音素列単語の候補をリストアップする。そして、認識手段２０１は、その候補の中で、音素言語モデル６０に基づく接続確率が最大となる音素列単語を認識結果とする。 The recognition means 201 recognizes a phoneme sequence from speech data using an acoustic model 40, a phoneme pronunciation dictionary 50, and a phoneme language model 60.
The recognition means 201 extracts acoustic features from externally input speech data, and lists phoneme string word candidates from the acoustic model 40 and the phoneme pronunciation dictionary 50. Then, the recognition means 201 uses the phoneme string word having the maximum connection probability based on the phoneme language model 60 as the recognition result among the candidates.

具体的には、認識手段２０１は、音素列単語列ｗ_１，ｗ_２，…，ｗ_ｎで、以下の式（２）に示す、ｗ_ｎ−１の次にｗ_ｎが出現する確率（事後確率）Ｐ（ｗ_ｎ｜ｗ_ｎ−１）の接続確率が最大となる音素列単語列を認識する。 Specifically, the recognition means 201 is a phoneme string word string w ₁ , w ₂ , ..., W _n _{, and the probability that w n} appears next to w _n-1 shown in the following equation (2) (posterior). Probability) Recognizes the phoneme string word string that maximizes the connection probability of P (w _n | w _n-1).

このように、認識手段２０１は、一般的な音声認識が発音辞書に登録されている単語単位で音声を認識するのに対し、音素発音辞書５０に登録されている単語とみなした音素列単語単位で音声を認識する。
認識手段２０１は、認識した音素列単語を、順次、音素列生成手段２０２に出力する。 As described above, the recognition means 201 recognizes the voice in word units registered in the pronunciation dictionary, whereas the recognition means 201 recognizes the voice in word units registered in the phoneme pronunciation dictionary 50, whereas the phoneme string word unit regarded as words registered in the phoneme pronunciation dictionary 50. Recognize voice with.
The recognition means 201 sequentially outputs the recognized phoneme string words to the phoneme string generation means 202.

音素列生成手段２０２は、認識手段２０１で認識された１単語のテキストデータ形式である音素列単語から音素列を生成するものである。
具体的には、音素列生成手段２０２は、音素列単語から、音素以外の予め定めた文字（ここでは、“＋”）をスペースに置き換えて、音素列を生成する。例えば、音素列生成手段２０２は、音素列単語“ｓ＋ｅ＋ｋ＋ａ＋ｉ＋ｉ＋ｃｈ＋ｉ”を音素列“ｓ_△ｅ_△ｋ_△ａ_△ｉ_△ｉ_△ｃｈ_△ｉ”に変換して出力する。
この音素列生成手段２０２が行う変換処理は、図１で説明した音素列単語生成手段１１の変換処理の逆変換に相当する。 The phoneme string generation means 202 generates a phoneme string from a phoneme string word, which is a text data format of one word recognized by the recognition means 201.
Specifically, the phoneme string generation means 202 generates a phoneme string from a phoneme string word by replacing a predetermined character (here, “+”) other than a phoneme with a space. For example, the phoneme string generation means 202 converts the phoneme string word “s + e + k + a + i + i + ch + i” into the phoneme string “s _Δ e _Δ k _Δ a _Δ i _Δ i _Δ ch _Δ i” and outputs it.
The conversion process performed by the phoneme string generation means 202 corresponds to the inverse conversion of the conversion process of the phoneme string word generation means 11 described with reference to FIG.

以上説明したように音素認識装置２００を構成することで、従来、音響モデルにおけるトライフォンＨＭＭにより文脈として前後の音素の依存性で認識をしていた音素認識に対し、音素認識装置２００は、単語の繋がりを用いた、より長い文脈の依存性を考慮して音素認識を行う。 By configuring the phoneme recognition device 200 as described above, the phoneme recognition device 200 is a word, as opposed to the phoneme recognition which was conventionally recognized by the triphone HMM in the acoustic model based on the dependence of the front and back phonemes as the context. Phoneme recognition is performed in consideration of the dependency of a longer context using the connection of.

これによって、音素認識装置２００は、従来よりも精度よく音素認識を行うことができる。具体的には、従来技術の課題で説明したように、従来の音素認識の音素異なり率が２２．８％であったのに対し、音素認識装置２００は、音素異なり率を１．２％に改善することができた。
なお、音素認識装置２００は、図示を省略したコンピュータを、前記した各手段として機能させるプログラム（音素認識プログラム）で動作させることができる。 As a result, the phoneme recognition device 200 can perform phoneme recognition with higher accuracy than before. Specifically, as explained in the problem of the prior art, the phoneme difference rate of the conventional phoneme recognition was 22.8%, whereas the phoneme recognition device 200 reduced the phoneme difference rate to 1.2%. I was able to improve.
The phoneme recognition device 200 can operate a computer (not shown) with a program (phoneme recognition program) that functions as each of the above-mentioned means.

以上、本発明の実施形態について説明したが、本発明は、これらの実施形態に限定されるものではない。
ここでは、音素発音辞書５０の見出し語と音素言語モデル６０の接続対象とを、音素列単語生成手段１１（図１参照）が生成した単語別音素列のスペースを“＋”とした音素列単語とすることで、１単語分の音素列を１つの単語として扱うこととした。 Although the embodiments of the present invention have been described above, the present invention is not limited to these embodiments.
Here, the headword of the phoneme pronunciation dictionary 50 and the connection target of the phoneme language model 60 are phoneme string words in which the space of the phoneme string for each word generated by the phoneme string word generation means 11 (see FIG. 1) is “+”. Therefore, it was decided to treat the phoneme sequence for one word as one word.

しかし、音素列を１単語とみなす手法は、これに限定されるものではない。例えば、１単語分の音素列の末尾に音素以外の予め定めた文字（例えば、“￥”）を付加することとしてもよいし、単語分の音素列の前後に音素以外の予め定めた文字（例えば、“＜”，“＞”）を付加することとしてもよい。
この場合も、音素列生成手段２０２（図８参照）は、音素列単語生成手段１１（図１参照）が行った処理の逆変換を行えばよい。 However, the method of regarding a phoneme sequence as one word is not limited to this. For example, a predetermined character other than a phoneme (for example, "\") may be added to the end of a phoneme string for one word, or a predetermined character other than a phoneme (for example, "\") may be added before and after the phoneme string for a word. For example, "<", ">") may be added.
In this case as well, the phoneme string generation means 202 (see FIG. 8) may perform the inverse transformation of the processing performed by the phoneme string word generation means 11 (see FIG. 1).

また、ここでは、音素言語モデル生成手段１４が生成する音素言語モデル６０として、２−ｇｒａｍ言語モデルを例示した。
しかし、音素言語モデル生成手段１４は、Ｎ−ｇｒａｍ言語モデルであれば、１−ｇｒａｍ言語モデル、３−ｇｒａｍ言語モデル等であっても構わない。 Further, here, a 2-gram language model is exemplified as the phoneme language model 60 generated by the phoneme language model generation means 14.
However, the phoneme language model generation means 14 may be a 1-gram language model, a 3-gram language model, or the like as long as it is an N-gram language model.

１音素認識辞書生成装置
１０単語別音素列生成手段
１１音素列単語生成手段
１２音素発音辞書生成手段
１２０単語別音素列登録手段
１２１組み合わせ音素列登録手段
１３音素列単語リスト記憶手段
１３０音素列単語リスト
１４音素言語モデル生成手段
２学習コーパス記憶装置
２０学習コーパス
３発音辞書記憶装置
３０発音辞書
４音響モデル記憶装置
４０音響モデル
５音素発音辞書記憶装置
５０音素発音辞書
６音素言語モデル記憶装置
６０音素言語モデル 1 Phoneme recognition dictionary generator 10 Word-specific phoneme string generation means 11 Phoneme string word generation means 12 Phoneme pronunciation dictionary generation means 120 Word-specific phoneme string registration means 121 Combination phoneme string registration means 13 Phoneme string word list storage means 130 Phoneme string word list 14 Phoneme language model generation means 2 Learning corpus storage device 20 Learning corpus 3 Pronunciation dictionary storage device 30 Pronunciation dictionary 4 Acoustic model storage device 40 Acoustic model 5 Phoneme pronunciation dictionary storage device 50 Phoneme pronunciation dictionary 6 Phoneme language model storage device 60 Phoneme language model

Claims

It is a phoneme recognition dictionary generator that generates a phoneme pronunciation dictionary and a phoneme language model used for phoneme recognition using an acoustic model, a pronunciation dictionary, and a learning corpus.
A word that recognizes the voice of the learning corpus based on the acoustic model and the pronunciation dictionary, and generates a word-specific phoneme string that is a phoneme string for each word corresponding to a headword registered in the pronunciation dictionary. Another phoneme string generation means and
A phoneme string word generation means for generating a phoneme string word by converting the word-specific phoneme string into a one-word text data format, and
A phoneme pronunciation dictionary generation means for generating the phoneme pronunciation dictionary by using the phoneme string word as a headword and using the word-specific phoneme string corresponding to the phoneme string word as a pronunciation notation.
A phoneme language model generation means for generating the phoneme language model by learning an N-gram language model as a chain of the phoneme language words from the list of the phoneme string words generated by the phoneme string word generation means.
A phoneme recognition dictionary generator characterized by being equipped with.

The phoneme according to claim 1, wherein the word-specific phoneme string generation means generates the phoneme string word by inserting a predetermined character other than a phoneme between the phonemes of the word-specific phoneme string. Recognition dictionary generator.

The phoneme pronunciation dictionary generation means uses a phoneme string in which a predetermined number of phonemes are combined as a headword converted into the text data format, and registers the phoneme string corresponding to the headword as a pronunciation notation in the phoneme pronunciation dictionary. The phoneme recognition dictionary generation device according to claim 1 or 2, wherein the phoneme recognition dictionary is generated.

A claim, wherein the phoneme language model generation means gives an N-gram probability by smoothing to a chain of phoneme string words generated by the phoneme string word generation means that does not exist in the list of the phoneme string words. The phoneme recognition dictionary generator according to any one of claims 1 to 3.

A phoneme recognition dictionary generation program for causing a computer to function as the phoneme recognition dictionary generation device according to any one of claims 1 to 4.

A phoneme recognition device that recognizes phoneme of speech using a phoneme model and a phoneme pronunciation dictionary and a phoneme language model generated by the phoneme recognition dictionary generator according to any one of claims 1 to 4. There,
A recognition means for recognizing the voice in phoneme string word units by the acoustic model, the phoneme pronunciation dictionary, and the phoneme language model.
A phoneme string generation means that separates a phoneme string word, which is a text data format of one word recognized by this recognition means, into individual phonemes to generate a phoneme string, and a phoneme string generation means.
A phoneme recognition device characterized by being equipped with.

A phoneme recognition program for causing a computer to function as the phoneme recognition device according to claim 6.