JP6549064B2

JP6549064B2 - Speech recognition device, speech recognition method, program

Info

Publication number: JP6549064B2
Application number: JP2016112982A
Authority: JP
Inventors: 賢昭佐藤; 中村　孝; 孝中村
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2016-06-06
Filing date: 2016-06-06
Publication date: 2019-07-24
Anticipated expiration: 2036-06-06
Also published as: JP2017219637A

Description

本発明は、音声認識装置、音声認識方法、プログラムに関する。 The present invention relates to a speech recognition apparatus, a speech recognition method, and a program.

特許文献１に、文字数制限に柔軟に対応することができる文書要約装置が開示されている。特許文献１の文書要約装置は、文短縮装置と、文スコア決定装置と、文選択装置を備える。文短縮装置は、入力された文書中の文を、指定された複数の短縮率で短縮して原文および短縮文を出力する。文スコア決定装置は、文短縮装置により出力された原文および短縮文に対して、文短縮率、文の出現位置情報および入力されたパラメタにより求められた位置情報スコアと、単語スコアデータベースを参照して取得した、文を構成する単語の重みを示す尺度である単語スコアとに基づいて文スコアを決定する。文選択装置は、入力された文字数制限のもと、文スコア決定装置により決定された文スコアの和が最大となる文の組合せを要約として選択する。 Patent Document 1 discloses a document summarizing apparatus capable of flexibly coping with the limitation of the number of characters. The document summarizing device of Patent Document 1 includes a sentence shortening device, a sentence score determining device, and a sentence selecting device. The sentence shortening apparatus shortens the sentences in the input document at a plurality of designated shortening rates and outputs a text and a short sentence. The sentence score determination device refers to the word score database and the sentence shortening rate, the appearance position information of the sentence, and the position information score obtained by the input parameter with respect to the original sentence and the short sentence output by the sentence shortening device. The sentence score is determined on the basis of the word score obtained as a measure indicating the weight of the words constituting the sentence. The sentence selection device selects a combination of sentences having the largest sum of sentence scores determined by the sentence score determination device as a summary under the input character number restriction.

特開２０１０−５５２３６号公報JP, 2010-55236, A

音声認識は、音声をテキストに変換する技術であるが、全ての状況下において１００％の変換率で音声認識を行うことは現状難しく、多くの場合認識結果に誤りの単語が含まれてしまう。また、認識結果にはテキストに変換する必要のない冗長なフレーズが含まれている場合がある。例えば、認識結果「これはそうですね難しいですね」は、冗長なフレーズの削除により、「これは難しい」という表現に短縮すべき場合がある。 Speech recognition is a technology for converting speech into text, but under all circumstances it is currently difficult to perform speech recognition with a conversion rate of 100%, and in many cases erroneous words will be included in the recognition result. In addition, the recognition result may include redundant phrases that do not need to be converted into text. For example, the recognition result "This is difficult so it may be" may be shortened to the expression "this is difficult" by deletion of a redundant phrase.

このように音声認識結果の誤りを修正し、不要部を削除することは、質の高い認識結果を得るためには必須の処理である。認識結果に対して上記２点の改善を行い、認識結果の可読性や、後段の言語処理の適用しやすさを向上させる処理を、「認識結果の整形技術」と称することにする。 In this way, correcting the error of the speech recognition result and deleting the unnecessary part is an essential process to obtain high-quality recognition result. A process of improving the above two points with respect to the recognition result and improving the readability of the recognition result and the ease of application of the language processing in the latter stage will be referred to as “recognition result shaping technology”.

通常の音声認識は、音響モデルと言語モデルを対象音声に最適な形にチューニングした後、デコーディング（両モデルを用いたリアルタイムなテキスト変換）することにより、行われている。上述の２つのモデルは音の確からしさ、言語の数、単語の並びの情報のみを用いるモデルである。上記２つのモデルはそれ以上の情報を利用することができない。 Normal speech recognition is performed by tuning an acoustic model and a language model to a form optimum for the target speech and then decoding (real-time text conversion using both models). The above two models are models that use only information on the probability of sound, the number of languages, and the arrangement of words. The above two models can not use more information.

一方、音声認識結果と正解データのペアを用いて、通常の言語モデルでは考慮できないような長距離の情報（文単位での確からしさなど）を用いて認識結果の修正を行う識別的リランキングの研究が存在する。しかしながら、識別的リランキング法は音声認識の正解データを人手で作成しなければならず、このコストが高いことが課題であった。 On the other hand, in discriminative reranking, correction of recognition results is performed using long-distance information (such as certainty in sentence units) that can not be considered in a normal language model, using a pair of speech recognition results and correct data Research exists. However, in the discriminative reranking method, correct data for speech recognition must be manually prepared, and the problem is that this cost is high.

そこで、本発明では正解データを用いずに音声認識結果を修正することができる音声認識装置を提供することを目的とする。 Therefore, it is an object of the present invention to provide a speech recognition apparatus capable of correcting speech recognition results without using correct answer data.

本発明の音声認識装置は、音声認識部と、３ｇｒａｍ計算部と、ｔｆ−ｉｄｆ計算部と、重要度計算部と、不要単語削除部を含む。なお、Ｎを２以上の整数とする。 The speech recognition apparatus of the present invention includes a speech recognition unit, a 3 gram calculation unit, a tf-idf calculation unit, an importance calculation unit, and an unnecessary word deletion unit. Note that N is an integer of 2 or more.

音声認識部は、入力された音声データに基づいて１位からＮ位の音声認識結果を出力する。３ｇｒａｍ計算部は、予め用意されたテキストデータの３ｇｒａｍ確率を計算する。ｔｆ−ｉｄｆ計算部は、１位からＮ位の音声認識結果に含まれる各単語のｔｆと、テキストデータに基づいて予め用意されたｉｄｆのうち１位の音声認識結果に含まれる各単語のｉｄｆに基づき、１位の音声認識結果に含まれる各単語のｔｆ−ｉｄｆを計算する。重要度計算部は、ｔｆ−ｉｄｆに基づいて１位の音声認識結果に含まれる各単語のＮＲＤを計算し、計算されたＮＲＤに基づく値を各単語の重要度として出力する。不要単語削除部は、１位の音声認識結果に含まれる各単語の信頼度と、１位の音声認識結果に含まれる連続する三つの単語の３ｇｒａｍ確率と、１位の音声認識結果に含まれる各単語の重要度と、を用いて定式化した整数計画問題の解に基づいて１位の音声認識結果に含まれる不要単語を削除する。 The speech recognition unit outputs the first to N-th speech recognition results based on the input speech data. The 3 gram calculating unit calculates the 3 gram probability of the text data prepared in advance. The tf-idf calculation unit is configured to calculate the tf of each word included in the first to N-th speech recognition results and the idf of each word included in the first speech recognition result among idf prepared based on text data. And calculate the tf-idf of each word included in the first speech recognition result. The degree-of-importance calculation unit calculates the NRD of each word included in the first-ranked speech recognition result based on tf-idf, and outputs a value based on the calculated NRD as the degree of importance of each word. The unnecessary word deletion unit is included in the reliability of each word included in the first-ranked speech recognition result, the 3-gram probability of three consecutive words included in the first-ranked speech recognition result, and the first-ranked speech recognition result The unnecessary words included in the first speech recognition result are deleted based on the solution of the integer programming problem formulated using the importance of each word.

本発明の音声認識装置によれば、正解データを用いずに音声認識結果を修正することができる。 According to the speech recognition apparatus of the present invention, the speech recognition result can be corrected without using the correct answer data.

実施例１の音声認識装置の構成を示すブロック図。FIG. 1 is a block diagram showing the configuration of a speech recognition device according to a first embodiment. 実施例１の音声認識装置の動作を示すフローチャート。3 is a flowchart showing the operation of the speech recognition device of the first embodiment. 実施例２の音声認識装置の構成を示すブロック図。FIG. 7 is a block diagram showing the configuration of a speech recognition device according to a second embodiment. 実施例２の音声認識装置の動作を示すフローチャート。6 is a flowchart showing the operation of the speech recognition device of the second embodiment. 実施例３の音声認識装置の構成を示すブロック図。FIG. 7 is a block diagram showing the configuration of a speech recognition device according to a third embodiment. 実施例３の音声認識装置の動作を示すフローチャート。10 is a flowchart showing the operation of the speech recognition device of the third embodiment. 実施例４の音声認識装置の構成を示すブロック図。FIG. 7 is a block diagram showing the configuration of a speech recognition device according to a fourth embodiment. 実施例４の音声認識装置の動作を示すフローチャート。10 is a flowchart showing the operation of the speech recognition device of the fourth embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. Note that components having the same function will be assigned the same reference numerals and redundant description will be omitted.

以下、図１、図２を参照して実施例１の音声認識装置１の構成、および動作を説明する。図１に示すように、本実施例の音声認識装置１は、コーパス記憶部１０と、音声認識部１１と、３ｇｒａｍ計算部１２と、ｔｆ−ｉｄｆ計算部１３と、重要度計算部１４と、不要単語削除部１５を含む。コーパス記憶部１０には、予め用意されたテキストデータが記憶されているものとする。 The configuration and operation of the speech recognition apparatus 1 according to the first embodiment will be described below with reference to FIGS. 1 and 2. As shown in FIG. 1, the speech recognition apparatus 1 of this embodiment includes a corpus storage unit 10, a speech recognition unit 11, a 3 gram calculation unit 12, a tf-idf calculation unit 13, and an importance calculation unit 14. The unnecessary word deletion unit 15 is included. It is assumed that the corpus storage unit 10 stores text data prepared in advance.

音声認識部１１は、入力された音声データに基づいて１位からＮ位の音声認識結果を出力する（Ｓ１１）。３ｇｒａｍ計算部１２は、予め用意されたテキストデータの３ｇｒａｍ確率を計算する（Ｓ１２）。ｔｆ−ｉｄｆ計算部１３は、１位からＮ位の音声認識結果に含まれる各単語のｔｆと、テキストデータに基づいて予め用意されたｉｄｆのうち１位の音声認識結果に含まれる各単語のｉｄｆに基づき、１位の音声認識結果に含まれる各単語のｔｆ−ｉｄｆを計算する（Ｓ１３）。重要度計算部１４は、ｔｆ−ｉｄｆに基づいて１位の音声認識結果に含まれる各単語のＮＲＤ（ＮｏｒｍａｌｉｚｅｄＲｅｌｅｖａｎｃｅＤｉｓｔａｎｃｅ）を計算し、計算されたＮＲＤに基づく値を各単語の重要度として出力する（Ｓ１４）。不要単語削除部１５は、１位の音声認識結果に含まれる各単語の信頼度と、１位の音声認識結果に含まれる連続する三つの単語の３ｇｒａｍ確率と、１位の音声認識結果に含まれる各単語の重要度と、を用いて定式化した整数計画問題の解に基づいて１位の音声認識結果に含まれる不要単語を削除する（Ｓ１５）。 The speech recognition unit 11 outputs the first to N-th speech recognition results based on the input speech data (S11). The 3 gram calculating unit 12 calculates the 3 gram probability of the text data prepared in advance (S12). The tf-idf calculation unit 13 calculates tf of each word included in the first to N-th speech recognition results and each word included in the first speech recognition result among idf prepared in advance based on text data. Based on idf, tf-idf of each word included in the first speech recognition result is calculated (S13). The importance degree calculation unit 14 calculates NRD (Normalized Relevance Distance) of each word included in the first-ranked speech recognition result based on tf-idf, and outputs the calculated NRD value as the importance degree of each word To do (S14). The unnecessary word deletion unit 15 includes the reliability of each word included in the first-ranked speech recognition result, the 3-gram probability of three consecutive words included in the first-ranked speech recognition result, and the first-ranked speech recognition result The unnecessary words included in the first speech recognition result are deleted on the basis of the solution of the integer programming problem formulated using the importance of each word to be processed (S15).

以下、各構成要件の動作について詳細に説明する。
＜音声認識部１１＞
入力：音声データ（音圧の時系列データ。形式はｐｃｍ，ｗａｖなど）、順位数の上限Ｎ
出力：音声認識結果の１位〜Ｎ位の文、各文における各単語の信頼度
ここで、入力される音声データは、例えば１発話毎に文として処理されるものとする。 The operation of each component will be described in detail below.
<Voice recognition unit 11>
Input: Audio data (time-series data of sound pressure. Format is pcm, wav etc.), upper limit N of the order number
Output: 1st to Nth sentences of speech recognition result, reliability of each word in each sentence Here, it is assumed that the inputted speech data is processed as a sentence for each one utterance.

［入力される１発話毎の文例］
1発話目の文例：「今日はかるカレーを食べた」
音声認識部１１は、音声データを入力として取得し、一般的な音声認識方法によって音声認識を実行し、１位からＮ位までの音声認識結果を出力する（Ｓ１１）。ただし前述したようにＮは２以上の整数である。 [Example of sentences for each input utterance]
Example of the first utterance: "I ate curry curry today"
The speech recognition unit 11 acquires speech data as input, executes speech recognition by a general speech recognition method, and outputs speech recognition results from the first place to the Nth place (S11). However, as described above, N is an integer of 2 or more.

ステップＳ１１により、音声データの各文（文は時刻情報を基に区切られる）に対し、以下のように順位と各単語に信頼度を持つ複数（＝Ｎ個）仮説の認識結果が出力される。なお、Ｎは人手で指定してもよい。例えば、Ｎ＝５としてもよい。 In step S11, recognition results of multiple (= N) hypotheses having degrees of reliability and reliability for each word are output as follows for each sentence (sentences are separated based on time information) of speech data . Note that N may be specified manually. For example, N may be five.

認識結果の複数仮説とは、音声認識結果のうち、音声認識システムにより１番尤もらしいと評価された文、音声認識の計算途中で上記以外に候補として挙がった文を含む複数の認識結果よりなる仮説を示す。 The multiple hypotheses of recognition results consist of multiple recognition results that include, among the speech recognition results, sentences evaluated as likely to be the first by the speech recognition system, and sentences cited as candidates other than those mentioned above during calculation of speech recognition. Show a hypothesis.

信頼度とは、認識結果の各単語に対してどれだけの尤もらしさでこの単語が正解しているかを表す確率値であり、０以上１以下の値を持つ。 The degree of reliability is a probability value indicating how much likelihood this word is correct for each word of the recognition result, and has a value of 0 or more and 1 or less.

［信頼度の例］
１位：今日はかるカレーを食べた
信頼度：今日→０．７は→０．５かる→０．４カレー→０．５を→０．７食べた→０．９
２位：今日は軽いカレーを食べた
信頼度：今日→０．７は→０．５軽い→０．３５カレー→０．５を→０．７食べた→０．９
・・・
Ｎ位：今日はかんカレーを食べた
信頼度：今日→０．７は→０．５かん→０．２カレー→０．５を→０．７食べた→０．９ [Example of reliability]
1st place: Today I ate curry curd Reliability: Today → 0.7 → → 0.5 → → 0.4 Curry → 0.5 → 0.7 Eat → 0.9
2nd place: Today I ate light curry Confidence: Today → 0.7 ate → 0.5 light → 0.35 Curry → 0.5 ate → 0.7 → 0.9
...
N: I ate Kan curry today
Confidence: Today → 0.7 → 0.5 Can → 0.2 Curry → 0.5 → 0.7 Eat → 0.9

＜３ｇｒａｍ計算部１２＞
入力：大量のテキストデータ
出力：テキストデータに対する３ｇｒａｍ確率
３ｇｒａｍ計算部１２は、予め用意された大量のテキストデータ（音声認識結果でないもの）を用いる。本実施例では、大量のテキストデータはコーパス記憶部１０に予め記憶されているものとする。大量のテキストデータとしては、例えば新聞の記事などを用いることができる。大量のテキストデータとしては、例えば１個３０文程度の記事が１５００００記事程度あるようなコーパスが考えられる。 <3 gram calculation unit 12>
Input: large amount of text data output: 3gram probability for text data The 3gram calculating unit 12 uses a large amount of text data (not a speech recognition result) prepared in advance. In the present embodiment, a large amount of text data is stored in advance in the corpus storage unit 10. For example, a newspaper article can be used as a large amount of text data. As a large amount of text data, there can be considered, for example, a corpus in which there are about 150,000 articles of about 30 sentences each.

３ｇｒａｍ計算部１２は、大量のテキストデータの全ての文に対して、３ｇｒａｍ確率を計算する。３ｇｒａｍ確率とは、三つの単語が連続して文章に出現する確率を表す。ステップＳ１２を具体例を用いて説明する。例えば、３ｇｒａｍ（今日、は、暑い）という並びが出現する３ｇｒａｍ確率ｐ（暑い｜今日、は）を計算したいとする。これを計算する際、「今日、は、○○○」と連続する３単語の並びをテキストデータすべてに対して探し出し、その個数を計算する。その結果例えば以下の３パターンのみが見つかり、各並びの個数は以下であったとする。
（１）今日は暑い１００
（２）今日は晴れ９５
（３）今日はまれ５
この場合、３ｇｒａｍ確率ｐ（暑い｜今日、は）は、 The 3 gram calculating unit 12 calculates 3 gram probabilities for all sentences of a large amount of text data. The 3 gram probability represents the probability that three words appear in succession in a sentence. Step S12 will be described using a specific example. For example, suppose that we want to calculate a 3gram probability p (hot | today, today) in which a sequence of 3gram (today, hot) appears. When this is calculated, a sequence of three words that continue as "today, xxx" is searched for all text data, and the number is calculated. As a result, for example, it is assumed that only the following three patterns are found, and the number of each line is as follows.
(1) It is hot 100 today
(2) It is fine today 95
(3) Rarely today 5
In this case, the 3gram probability p (hot | today) is

と計算される。３ｇｒａｍ計算部１２は、テキストデータに出現する任意の単語について、考えられる任意の三並びｗ_ｉ，ｗ_ｊ，ｗ_ｋが生じる条件付き確率ｐ（ｗ_ｋ｜ｗ_ｉ，ｗ_ｊ）を計算する（ｉ、ｊ、ｋは任意のインデックスを表す記号、以下の数式などにも登場する）。 Is calculated. 3gram calculation unit 12, for any of the words that appear in the text data, any three-row _w i _considered, _w _j, w _k conditional probability occurs _{_{p (w k | w i,}} w j) to calculate the ( i, j, k are symbols representing arbitrary indexes, and also appear in the following mathematical expressions).

ここで、任意の三並びｗ_ｉ，ｗ_ｊ，ｗ_ｋがテキストデータに１回も出現せず、ｐ（ｗ_ｋ｜ｗ_ｉ，ｗ_ｊ）が直接計算不可能な場合も存在する。このような場合には、バックオフと呼ばれる方法で対処すればよい。バックオフとは、上述のような場合に３ｇｒａｍの代わりに２ｇｒａｍや１ｇｒａｍを利用する方法である。バックオフについては、例えば参考非特許文献１に開示されている。
（参考非特許文献１：北研二、辻井潤一、「言語と計算（４）確率的言語モデル」、東京大学出版会、1999年11月、p.67-69） Here, there are also cases where arbitrary three lines w _i , w _j and w _k do not appear at least once in text data, and p (w _k | w _i , w _j ) can not be calculated directly. In such a case, it may be dealt with by a method called backoff. Backoff is a method of using 2gram or 1gram instead of 3gram in the above case. The backoff is disclosed, for example, in Reference Non-Patent Document 1.
(Reference Non-Patent Document 1: Kenji Kita, Junichi Sakurai, “Language and Computation (4) Probabilistic Language Model”, The University of Tokyo Press, November 1999, p. 67-69)

また、ｐ（○｜ｓｔａｒｔ）とｐ（ｅｎｄ｜○，○）も計算する。ｐ（○｜ｓｔａｒｔ）は文頭の直後に○という単語が出現する確率である。ｐ（ｅｎｄ｜○，○）は、○，○という並びの後文末になる確率である。文頭、文末は、テキストデータにおける改行記号を基に判断する。 Also, p (o | start) and p (end | o, o) are also calculated. p (o | start) is the probability that the word ○ appears immediately after the beginning of a sentence. p (end | ,,)) is the probability of becoming a tail end of the sequence of ,, 。. The beginning of the sentence and the end of the sentence are judged based on the line feed symbol in the text data.

＜ｔｆ−ｉｄｆ計算部１３＞
入力：１位からＮ位の音声認識結果、大量のテキストデータ
出力：１位の音声認識結果に出現した全ての単語に対するｔｆ−ｉｄｆ
ｔｆ−ｉｄｆとは、ｔｆ（ｔｅｒｍｆｒｅｑｕｅｎｃｙ、単語の出現頻度）とｉｄｆ（ｉｎｖｅｒｓｅｄｏｃｕｍｅｎｔｆｒｅｑｕｅｎｃｙ、逆文書頻度）の二つの指標にもとづいて計算される指標であり、文章内での単語の重要度を表す指標である。 <Tf-idf calculator 13>
Input: First to N speech recognition results, large amount of text data output: tf-idf for all words appearing in the first speech recognition result
tf-idf is an index calculated based on two indices of tf (term frequency, word occurrence frequency) and idf (inverse document frequency, inverse document frequency), and the importance of a word in a sentence is It is an indicator to represent.

まず、ステップＳ１２で用いた大量のテキストデータと同じデータを用意する。本実施例では、コーパス記憶部１０に予め記憶されたテキストデータを流用すればよい。予め用意するテキストデータは前述したとおり、例えば新聞の記事、１個３０文程度の記事が１５００００記事程度あるようなコーパスなどでよい。 First, the same data as the large amount of text data used in step S12 is prepared. In the present embodiment, text data stored in advance in the corpus storage unit 10 may be diverted. As described above, the text data to be prepared in advance may be, for example, a newspaper article or a corpus having about 150,000 articles each having about 30 sentences.

以下、ｉｄｆの計算方法について述べる。テキストデータのドキュメント（文書のあるまとまった区切り。当該区切りはあらかじめテキストデータに付与されているものとする。例えば新聞なら１記事など）の数をＤ、そのうち着目単語ａが出現するドキュメントの数をｄとすると、ｉｄｆはｌｏｇ（Ｄ／ｄ）と計算される。対数の底は任意の１より大きい正の実数とする。以下の例では底は１０であるものとする。 The method of calculating idf is described below. The number of documents of text data (a set of document segments, which is assumed to be attached to the text data in advance, for example, one article in a newspaper), is D, of which the number of documents in which the target word a appears Assuming that d, idf is calculated as log (D / d). The base of the logarithm is a positive real number greater than one. In the following example, the bottom is assumed to be 10.

例えばコーパス記憶部１０内のドキュメントの総数が１５００００で、
「今日」が出現するドキュメントの数・・・４００
「は」が出現するドキュメントの数・・・３００
「カレー」が出現するドキュメントの数・・・３０００
「を」が出現するドキュメントの数・・・５００００
「食べ」が出現するドキュメントの数・・・４００００
「た」が出現するドキュメントの数・・・５００００
であったとする。 For example, the total number of documents in the corpus storage unit 10 is 150000,
Number of documents where "today" appears ... 400
Number of documents where "ha" appears ... 300
Number of documents where "curry" appears-3000
The number of documents in which "o" appears ... 50000
Number of documents that "eat" appears ... 40000
Number of documents in which "ta" appears ... 50000
It is assumed that

この場合、ｔｆ−ｉｄｆ計算部１３は各単語のｉｄｆを、
「今日」のｉｄｆ＝ｌｏｇ_１０（１５００００／４００）＝２．２４
「は」のｉｄｆ＝ｌｏｇ_１０（１５００００／３００）＝２．７０
「カレー」のｉｄｆ＝ｌｏｇ_１０（１５００００／３０００）＝１．７０
「を」のｉｄｆ＝ｌｏｇ_１０（１５００００／５００００）＝０．４７８
「食べ」のｉｄｆ＝ｌｏｇ_１０（１５００００／４００００）＝０．５７
「た」のｉｄｆ＝ｌｏｇ_１０（１５００００／５００００）＝０．４７８
と計算する。 In this case, the tf-idf calculator 13 calculates idf of each word as
Idf of "today" = log ₁₀ (150000/400) = 2.24
Idf = log ₁₀ (150,000 / 300) = 2.70 of "ha"
Idf of "curry" = log ₁₀ (150000/3000) = 1.70
Idf = log ₁₀ (150,000 / 50000) = 0.478 of "to"
"Eat" idf = log ₁₀ (150000/40000) = 0.57
Idf of the "ta" = log ₁₀ (150,000 / 50000) = 0.478
Calculate

次に、ｔｆの計算方法について述べる。一般的にｔｆは、あるドキュメント内での単語の総数がＭである場合に、当該ドキュメント内における、着目単語ａの出現頻度Ａを用いて、ｔｆ＝Ａ／Ｍと計算される。 Next, the method of calculating tf will be described. Generally, tf is calculated as tf = A / M using the appearance frequency A of the word of interest a in the document when the total number of words in a document is M.

従って、ｔｆ−ｉｄｆ計算部１３は、１位からＮ位までの音声認識結果の集合を一つのドキュメントと捉え、少なくとも１位の音声認識結果に含まれる各単語についてｔｆを計算する。例えば、１位からＮ位までの音声認識結果の集合における単語の総数Ｍ＝１０００であり、１位の音声認識結果に含まれる単語「カレー」の出現頻度Ａ＝２００であるものとすると、
「カレー」のｔｆ＝２００／１０００＝０．２０
と計算される。 Therefore, the tf-idf calculation unit 13 regards a set of speech recognition results from the first place to the Nth place as one document, and calculates tf for each word included in at least the first speech recognition result. For example, assuming that the total number M of words in the set of speech recognition results from the first to the Nth is M = 1000 and the appearance frequency A of the word "curry" included in the first speech recognition result is 200,
"Curry" tf = 200/1000 = 0.20
Is calculated.

ｔｆ−ｉｄｆ計算部１３は、ｉｄｆとｔｆの値を用いて、ｔｆ−ｉｄｆを以下のように計算する。
ｔｆ−ｉｄｆ＝ｔｆ×ｉｄｆ
例えば前述の例における「カレー」のｔｆ−ｉｄｆは、
ｔｆ−ｉｄｆ＝０．２０×１．７０＝０．３４
である。 The tf-idf calculator 13 calculates tf-idf as follows using the values of idf and tf.
tf-idf = tf x idf
For example, tf-idf of "curry" in the above example is
tf-idf = 0.20 × 1.70 = 0.34
It is.

なお、ｔｆ−ｉｄｆ計算部１３は、少なくとも１位の音声認識結果に含まれる各単語のｔｆ−ｉｄｆを計算すればよい。ここで重要なのは、ｔｆの計算には１位からＮ位までの音声認識結果が必要であるものの、ｔｆ−ｉｄｆとしては、必ずしも１位からＮ位までの音声認識結果に登場する全ての単語について必須ではないということである。 The tf-idf calculation unit 13 may calculate tf-idf of each word included in the at least one speech recognition result. The important thing here is that although the tf calculation requires the first to N speech recognition results, as tf-idf, all the words appearing in the first to N speech recognition results are not necessarily required. It is not essential.

＜重要度計算部１４＞
入力：１位の音声認識結果、１位の音声認識結果のｔｆ−ｉｄｆ
出力：１位の音声認識結果に含まれる各単語のＮＲＤ（ＮｏｒｍａｌｉｚｅｄＲｅｌｅｖａｎｃｅＤｉｓｔａｎｃｅ）
重要度計算部１４は、１位の音声認識結果のｔｆ−ｉｄｆに基づいて、１位の音声認識結果中の２単語間の類似性を表す尺度であるＮＲＤを計算する（Ｓ１４）。２単語間の類似性を表すＮＲＤを計算する目的は、例えば、ある単語が他の単語との類似性が高ければ、その単語は誤認識ではない正解単語であり、削除すべき不要な単語ではないと判断できるためである。 <Importance calculation unit 14>
Input: tf-idf of speech recognition result of first place, speech recognition result of first place
Output: NRD (Normalized Relevance Distance) of each word included in the first speech recognition result
The importance degree calculation unit 14 calculates NRD, which is a measure representing the similarity between two words in the first speech recognition result, based on tf-idf of the first speech recognition result (S14). The purpose of calculating NRD that represents the similarity between two words is, for example, if a word is highly similar to another word, that word is a correct word that is not misrecognized, and unnecessary words that should be deleted It is because it can be judged that there is not.

ここで、ＮＲＤを計算するために、まず各単語に対するｆ_ＮＲＤ（ｗ）と、２単語間のｆ_ＮＲＤ（ｗ_１，ｗ_２）を算出する。これらは、 Here, in order to calculate NRD, first f _NRD (w) for each word and f _NRD (w ₁ , w ₂ ) between two words are calculated. They are,

として定義される。なお、ＴＦＩＤＦ（ｗ，ｄ）は、ドキュメントｄにおける単語ｗのｔｆ−ｉｄｆを表す。Ｓはドキュメントの総数である。 Defined as TFIDF (w, d) represents tf-idf of the word w in the document d. S is the total number of documents.

重要度計算部１４は、例えば上述の定義を用いて１位の音声認識結果に含まれる各単語のＮＲＤを計算する。例えば、１位の音声認識結果に含まれる各単語がｗ_１，．．．，ｗ_Ｍだったとすると、この中の任意の単語ｗ_ｉの単語一貫性スコアは、 The importance calculator 14 calculates, for example, the NRD of each word included in the first speech recognition result using the above-mentioned definition. For example, each word included in the first speech recognition result is w ₁ ,. . . , W _M and the word consistency score of any word w _{i in} this

すなわち、インデクスｊ＝１〜Ｍまでのうちｉを除いた全ての単語と単語ｗ_ｉのＮＲＤの逆数の和として計算される。このスコアが高いほど、正解単語である可能性が高く、不要でない（必要な）単語である可能性が高い単語であるものと判断する。 That is calculated as the sum of the inverse of the NRD of all words and word w _i excluding i of up index j = 1 to M. As this score is higher, it is determined that the word is likely to be a correct word and is a word that is likely to be an unnecessary (necessary) word.

＜不要単語削除部１５＞
入力：１位の音声認識結果、１位の音声認識結果の各単語に対する信頼度、１位の音声認識結果の各単語に対する３ｇｒａｍ確率、１位の音声認識結果の各単語に対する重要度
出力：不要な単語が削除された音声認識結果
不要単語削除部１５は、１位の音声認識結果に対して、不要な単語を削除する処理を実行する。今、ある音声データの１位の音声認識結果が、単語ｗ_１，ｗ_２，．．．，ｗ_Ｔという並びで得られているとする。この文から不要な単語を、ＮＲＤに基づく値（単語一貫性スコア、その単語の重要度、削除してはいけない度合い）と、単語３つ並びの接続のしやすさの確率（３ｇｒａｍ確率）、信頼度（その単語が音声認識結果として正しいと考えられる度合い）を用いて整数計画問題に定式化する。 <Unnecessary word deletion unit 15>
Input: Speech recognition result of first place, reliability for each word of speech recognition result of first place, 3gram probability for each word of speech recognition result of first place, importance degree output for each word of speech recognition result of first place: unnecessary Speech recognition result in which a word is deleted The unnecessary word deletion unit 15 executes a process of deleting an unnecessary word from the speech recognition result of the first place. Now, the first speech recognition result of certain speech data is the words w ₁ , w ₂ ,. . . , W _T are assumed to be obtained. Unnecessary words from this sentence, NRD-based values (word consistency score, the importance of the word, the degree that should not be deleted) and the probability of connection of a three-word sequence (3 gram probability), Formulate into an integer programming problem using the degree of confidence (the degree to which the word is considered correct as a speech recognition result).

ステップＳ１５の説明にあたり、変数を定義する。δ_ｉ，α_ｉ，β_ｉｊ，γ_ｉｊｋはいずれも１か０の整数値を取り、δ_ｉはｉ＝１〜Ｔに対して定義され、１ならば単語ｗ_ｉを残し（削除しない）、０ならば削除することを表す変数である。α_ｉはｉ＝１〜Ｔに対して定義され、１ならば単語ｗ_ｉが文の先頭単語であり、０ならば先頭単語ではないことを表す変数である。β_ｉｊは０≦ｉ＜ｊ≦Ｔを満たす全ての（ｉ，ｊ）の組み合わせに対して定義され、ｗ_ｉ，ｗ_ｊという並びの直後が文末となるならば１、そうでなければ０を表す変数である。γ_ｉｊｋは、０≦ｉ＜ｊ＜ｋ≦Ｔを満たすすべての（ｉ，ｊ，ｋ）の組み合わせに対して定義され、ｗ_ｉ，ｗ_ｊ，ｗ_ｋという三連続する単語が削除後の文に存在すれば１、存在しなければ０と定義される。 In the description of step S15, variables are defined. Each of δ _i , α _i , β _ij , and γ _ijk takes an integer value of 1 or 0, δ _i is defined for i = 1 to T, and 1 leaves the word w _i (does not delete it), If it is 0, it is a variable that represents deleting. α _i is a variable that is defined for i = 1 to T. If it is 1, it is a variable representing that the word w _i is the head word of the sentence and 0 if it is not the head word. β _ij is defined for all (i, j) combinations that satisfy 0 ≦ i <j ≦ T, 1 if the sequence immediately after w _i , w _j is the end of the sentence, 0 otherwise It is a variable to represent. γ _ijk is defined for all combinations of (i, j, k) that satisfy 0 ≦ i <j ≦ k ≦ T, and three consecutive words w _i , w _j and w _k are sentences after deletion If it exists in, it is defined as 0, if it does not exist.

この変数を用いて、次の関数を最大化する解となるδ_ｉ，α_ｉ，β_ｉｊ，γ_ｉｊｋを計算する（整数計画問題）。 This variable is used to calculate δ _i , α _i , β _ij , γ _ijk which become solutions maximizing the next function (integer programming problem).

ここで、Ｓｉｇ（ｗ_ｉ）は単語ｗ_ｉの重要度（ＮＲＤに基づく値）であり、ｐ（ｗ_ｋ｜ｗ_ｉ，ｗ_ｊ）は単語ｗ_ｉ，ｗ_ｊ，ｗ_ｋが三連続する３ｇｒａｍ確率、ｑ（ｗ_ｉ）は単語ｗ_ｉの信頼度である。不要単語削除部１５は、この問題の解を算出し、ｗ_１，ｗ_２，．．．，ｗ_Ｔの各単語ｗ_ｉに対して、δ_ｉが１となるような単語はそのまま出力し、δ_ｉが０となるような単語（不要語に相当する）は出力しないことによって、認識結果文の不要単語の削除を行う。 Here, Sig (w _i ) is the importance (value based on NRD) of the word w _i , and p (w _k | w _i , w _j ) is a 3 gram in which the words w _i , w _j and w _k are three consecutive probability, q _{(w i)} is the reliability of the word _{w i.} The unnecessary word deletion unit 15 calculates the solution of this problem, and w ₁ , w ₂ ,. . . For each word w _i of, w _T, a word whose δ _i is 1 is output as it is, and a word (corresponding to an unnecessary word) whose δ _i is 0 is not output. Delete unnecessary words in sentences.

上記の評価関数を最大にすると、重要度が高くない単語は削除される。また任意の単語を削除した場合に他の単語の並びが自然な並びとなる場合、当該任意の単語は削除される。
例えば、音声認識結果である、「今日はかるカレーを食べた」という文に対してこの方法で不要単語を削除し、文圧縮を行うものとする。この場合、「かる」は、重要度が低い（ＮＲＤに基づく値が低い）ものとする。すると、「かる」を削除したと仮定した並びの、「はカレーを」が自然である（３ｇｒａｍ確率が高い）ならば、該当の「かる」を削除しても構わないはずである。これを数式表現したものが上記の式となっている。 When the above evaluation function is maximized, the less important words are deleted. If any word is deleted and the arrangement of other words is a natural arrangement, the arbitrary word is deleted.
For example, it is assumed that the unnecessary words are deleted by this method for the sentence "I ate today's curry", which is the speech recognition result, and the sentence is compressed. In this case, “Karu” is assumed to be low in importance (low in value based on NRD). Then, if it is natural that “Curray” is natural (the 3 gram probability is high) in the sequence assuming that “Karu” is deleted, the corresponding “Karu” may be deleted. The above expression is a mathematical expression of this.

従って、不要単語削除部１５は、上述の最大値問題の解のδ_ｉを用いて不要語を除去した音声認識結果を出力する。 Therefore, the unnecessary word deletion unit 15 outputs the speech recognition result from which the unnecessary word has been removed using δ _i of the solution of the maximum value problem described above.

以下、図３、図４を参照して実施例２の音声認識装置の構成および動作について説明する。図３に示すように本実施例の音声認識装置２は、コーパス記憶部１０と、音声認識部１１と、３ｇｒａｍ計算部１２と、重要度計算部２４と、不要単語削除部１５を含む。実施例１の音声認識装置１に存在したｔｆ−ｉｄｆ計算部１３が省略されていること、実施例１の重要度計算部１４が本実施例において重要度計算部２４に置き換えられていること以外については、実施例１と同様であるため、適宜説明を略する。 The configuration and operation of the speech recognition apparatus according to the second embodiment will be described below with reference to FIGS. 3 and 4. As shown in FIG. 3, the speech recognition apparatus 2 of the present embodiment includes a corpus storage unit 10, a speech recognition unit 11, a 3 gram calculation unit 12, an importance calculation unit 24, and an unnecessary word deletion unit 15. Except that the tf-idf calculation unit 13 existing in the speech recognition device 1 of the first embodiment is omitted, and the importance calculation unit 14 of the first embodiment is replaced by the importance calculation unit 24 in the present embodiment. The same as in the first embodiment will not be described as appropriate.

本実施例では、単語の重要度を計算する際、ＮＲＤの代わりにｗｏｒｄ２ｖｅｃを用いる。ｗｏｒｄ２ｖｅｃは大量のテキストデータの各単語をＤＮＮ（ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ）を用いてＵ次元（Ｕは２以上の整数）の実数値ベクトルに変換する方法である。ｗｏｒｄ２ｖｅｃについては例えば参考非特許文献２に開示されている。
（参考非特許文献２：Tomas Mikolov, Ilya Sutskever , Kai Chen, Greg Corrado, Jeffrey Dean, ”Distributed Representations of Words and Phrases and their Compositionality”, [online], Oct 2013, [平成28年5月30日検索]、インターネット<URL:https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf>） In this embodiment, word2vec is used instead of NRD when calculating the importance of a word. word2vec is a method of converting each word of a large amount of text data into a U-dimensional (U is an integer of 2 or more) real value vector using DNN (Deep Neural Network). The word 2vec is disclosed, for example, in Reference Non-Patent Document 2.
(Reference Non-Patent Document 2: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean, “Distributed Representations of Words and Words and Their Compositionality”, [online], Oct 2013, [May 30, 2016 Search ], Internet <URL: https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf>)

次元Ｕは人手で指定する。例えば、Ｕ＝１００などが用いられる。以下、単語ｗのｗｏｒｄ２ｖｅｃの実数値ベクトルをｘ（ｗ）と書くことにする（このベクトルは縦ベクトルであるものとする）。これを用いて、前述の The dimension U is specified manually. For example, U = 100 is used. Hereinafter, the real-valued vector of word2vec of word w will be written as x (w) (this vector is assumed to be a vertical vector). Using this,

の代わりに、

Instead of,

を用いる。すなわち、重要度計算部２４は、テキストデータの各単語ｗを、ＤＮＮを用いてＵ次元の実数値ベクトルｘ（ｗ）に変換し、当該Ｕ次元の実数値ベクトルｘ（ｗ）に基づいて、１位の音声認識結果に含まれる各単語の重要度を計算する（Ｓ２４）。 Use That is, the importance calculator 24 converts each word w of the text data into a U-dimensional real value vector x (w) using DNN, and based on the U-dimensional real value vector x (w), The importance of each word included in the first speech recognition result is calculated (S24).

以下、図５、図６を参照して実施例３の音声認識装置の構成および動作について説明する。本実施例の音声認識装置３は、実施例２の音声認識装置２にさらに変更を加えたものである。図５に示すように本実施例の音声認識装置３は、コーパス記憶部１０と、音声認識部１１と、３ｇｒａｍ計算部１２と、重要度計算部３４と、不要単語削除部１５を含み、実施例２の重要度計算部２４が本実施例において重要度計算部３４に置き換えられていること以外については、実施例２と同様であるため、適宜説明を略する。 The configuration and operation of the speech recognition apparatus according to the third embodiment will be described below with reference to FIGS. 5 and 6. The speech recognition apparatus 3 of this embodiment is a modification of the speech recognition apparatus 2 of the second embodiment. As shown in FIG. 5, the speech recognition apparatus 3 of this embodiment includes a corpus storage unit 10, a speech recognition unit 11, a 3 gram calculation unit 12, an importance calculation unit 34, and an unnecessary word deletion unit 15. The second embodiment is the same as the second embodiment except that the importance calculation unit 24 of the second embodiment is replaced by the importance calculation unit 34 in the present embodiment, and thus the description will be omitted as appropriate.

本実施例の重要度計算部３４は、単語の重要度を計算する際、ｗｏｒｄ２ｖｅｃによる実数値ベクトルのばらつきの度合いに基づいて重要度（文内での自然さ）を計算する。今、文中の各単語ｗ_１，ｗ_２，．．．，ｗ_Ｖのそれぞれに対し、実数値ベクトルｘ（ｗ_ｉ）を考える。
この実数値ベクトルの平均ベクトルである When calculating the degree of importance of a word, the degree of importance calculation unit 34 of this embodiment calculates the degree of importance (naturalness in a sentence) on the basis of the degree of variation of a real-valued vector by word2vec. Now, each word w ₁ , w ₂ ,. . . , W _V , consider a real-valued vector x (w _i ).
Is the mean vector of this real-valued vector

を計算する。これを用いて、

Calculate Using this,

をその単語の重要度（文内の自然さ）を表す指標とし、

As a measure of the importance of the word (naturalness in the sentence),

の代わりに用いる。

In place of

以下、図７、図８を参照して実施例４の音声認識装置の構成および動作について説明する。本実施例の音声認識装置４は、実施例２の音声認識装置２にさらに変更を加えたものである。図７に示すように本実施例の音声認識装置４は、コーパス記憶部１０と、音声認識部１１と、３ｇｒａｍ計算部１２と、重要度計算部４４と、不要単語削除部１５を含み、実施例２の重要度計算部２４が本実施例において重要度計算部４４に置き換えられていること以外については、実施例２と同様であるため、適宜説明を略する。 The configuration and operation of the speech recognition apparatus according to the fourth embodiment will be described below with reference to FIGS. 7 and 8. The speech recognition device 4 of this embodiment is a modification of the speech recognition device 2 of the second embodiment. As shown in FIG. 7, the speech recognition apparatus 4 of this embodiment includes a corpus storage unit 10, a speech recognition unit 11, a 3 gram calculation unit 12, an importance calculation unit 44, and an unnecessary word deletion unit 15. The second embodiment is the same as the second embodiment except that the importance calculation unit 24 of the second embodiment is replaced by the importance calculation unit 44 in the present embodiment, and thus the description will be omitted as appropriate.

本実施例の重要度計算部４４は、実施例３と同様に、単語の重要度を計算する際、ｗｏｒｄ２ｖｅｃによる実数値ベクトルのばらつきの度合いに基づいて重要度（文内での自然さ）を計算する。 Similar to the third embodiment, the importance calculation unit 44 of the present embodiment calculates the importance (naturalness in a sentence) based on the degree of variation of the real value vector due to word2vec when calculating the importance of the word. calculate.

今、文中の各単語ｗ_１，ｗ_２，．．．，ｗ_Ｙのそれぞれに対し、実数値ベクトルｗ_ｉを考える。この実数値ベクトル群が１混合正規分布に従うと仮定し、 Now, each word w ₁ , w ₂ ,. . . , W _Y , consider real value vectors w _i . Assuming that this real-valued vector group follows one mixed normal distribution,

を用いて、対数ガウス確率

Log Gaussian probability using

を、その単語の重要度とする。これを、

And the importance of the word. this,

の代わりに用いる。

In place of

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary Note>
The apparatus according to the present invention is, for example, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected as a single hardware entity, or a communication device (for example, communication cable) capable of communicating outside the hardware entity. Communication unit that can be connected, CPU (central processing unit, cache memory, registers, etc. may be provided), RAM or ROM that is memory, external storage device that is hard disk, input unit for these, output unit, communication unit , CPU, RAM, ROM, and a bus connected so as to enable exchange of data between external storage devices. If necessary, the hardware entity may be provided with a device (drive) capable of reading and writing a recording medium such as a CD-ROM. Examples of physical entities provided with such hardware resources include general purpose computers.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above-mentioned function, data required for processing the program, and the like (not limited to the external storage device, for example, the program is read) It may be stored in the ROM which is a dedicated storage device). In addition, data and the like obtained by the processing of these programs are appropriately stored in a RAM, an external storage device, and the like.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in the external storage device (or ROM etc.) and data necessary for processing of each program are read into the memory as necessary, and interpreted and processed appropriately by the CPU . As a result, the CPU realizes predetermined functions (each component requirement expressed as the above-mentioned,...

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the spirit of the present invention. Further, the processing described in the above embodiment may be performed not only in chronological order according to the order of description but also may be performed in parallel or individually depending on the processing capability of the device that executes the processing or the necessity. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing function in the hardware entity (the apparatus of the present invention) described in the above embodiment is implemented by a computer, the processing content of the function that the hardware entity should have is described by a program. Then, by executing this program on a computer, the processing function of the hardware entity is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing content can be recorded in a computer readable recording medium. As the computer readable recording medium, any medium such as a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, etc. may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (Rewritable), etc. as magneto-optical recording medium, MO (Magneto-Optical disc) etc., as semiconductor memory EEP-ROM (Electronically Erasable and Programmable Only Read Memory) etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 Further, this program is distributed, for example, by selling, transferring, lending, etc. a portable recording medium such as a DVD, a CD-ROM or the like in which the program is recorded. Furthermore, this program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 For example, a computer that executes such a program first temporarily stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, at the time of execution of the process, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of this program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer Each time, processing according to the received program may be executed sequentially. In addition, a configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes processing functions only by executing instructions and acquiring results from the server computer without transferring the program to the computer It may be Note that the program in the present embodiment includes information provided for processing by a computer that conforms to the program (such as data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, the hardware entity is configured by executing a predetermined program on a computer, but at least a part of the processing content may be realized as hardware.

Claims

Let N be an integer of 2 or more,
A voice recognition unit that outputs a first to N-th voice recognition result based on input voice data;
3gram calculation unit which calculates 3gram probability of text data prepared in advance;
The tf of each word included in the first to N speech recognition results and the idf of each word included in the first speech recognition result among idf prepared in advance based on the text data, A tf-idf calculator for calculating tf-idf of each word contained in the first-ranked speech recognition result;
An importance calculator configured to calculate an NRD of each word included in the first-ranked speech recognition result based on the tf-idf, and outputting the calculated value based on the NRD as the importance of each word;
The reliability of each word included in the first speech recognition result, the 3-gram probability of three consecutive words included in the first speech recognition result, and each word included in the first speech recognition result A speech recognition apparatus including an unnecessary word deletion unit which deletes an unnecessary word included in the first speech recognition result on the basis of the solution of the integer programming problem formulated using the importance and the importance.

A speech recognition unit that outputs a first speech recognition result based on input speech data;
3gram calculation unit which calculates 3gram probability of text data prepared in advance;
Each word of the text data is converted into a multi-dimensional real-valued vector using DNN, and the importance of each word included in the first-ranked speech recognition result is calculated based on the multi-dimensional real-valued vector Importance calculation unit,
The reliability of each word included in the first speech recognition result, the 3-gram probability of three consecutive words included in the first speech recognition result, and each word included in the first speech recognition result A speech recognition apparatus including an unnecessary word deletion unit which deletes an unnecessary word included in the first speech recognition result on the basis of the solution of the integer programming problem formulated using the importance and the importance.

The speech recognition apparatus according to claim 2, wherein
The importance calculation unit
A speech recognition apparatus which calculates the importance based on the degree of variation of the real value vector.

A speech recognition method performed by a speech recognition device, comprising
Let N be an integer of 2 or more,
Outputting the first to N-th speech recognition results based on the input speech data;
Calculating a 3-gram probability of text data prepared in advance;
The tf of each word included in the first to N speech recognition results and the idf of each word included in the first speech recognition result among idf prepared in advance based on the text data, Calculating tf-idf of each word contained in the first-place speech recognition result;
Calculating an NRD of each word included in the first speech recognition result based on the tf-idf, and outputting a value based on the calculated NRD as an importance of each word;
The reliability of each word included in the first speech recognition result, the 3-gram probability of three consecutive words included in the first speech recognition result, and each word included in the first speech recognition result A speech recognition method including the step of deleting unnecessary words included in the first-ranked speech recognition result based on the solution of the integer programming problem formulated using the importance and the importance.

A speech recognition method performed by a speech recognition device, comprising
Outputting a first speech recognition result based on the input speech data;
Calculating a 3-gram probability of text data prepared in advance;
Each word of the text data is converted into a multi-dimensional real-valued vector using DNN, and the importance of each word included in the first-ranked speech recognition result is calculated based on the multi-dimensional real-valued vector Step and
The reliability of each word included in the first speech recognition result, the 3-gram probability of three consecutive words included in the first speech recognition result, and each word included in the first speech recognition result A voice recognition method comprising the step of deleting unnecessary words included in the first place voice recognition result based on the solution of the integer programming problem formulated using the degree of importance.

The speech recognition method according to claim 5, wherein
A speech recognition method for calculating the degree of importance based on the degree of variation of the real value vector.

A program that causes a computer to function as the speech recognition device according to any one of claims 1 to 3.