JP4981850B2

JP4981850B2 - Voice recognition apparatus and method, program, and recording medium

Info

Publication number: JP4981850B2
Application number: JP2009143173A
Authority: JP
Inventors: 哲小橋川; 義和山口; 太一浅見; 明夫神; 浩和政瀧; 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2009-06-16
Filing date: 2009-06-16
Publication date: 2012-07-25
Anticipated expiration: 2029-06-16
Also published as: JP2011002494A

Description

この発明は、様々な音質の音声データを効率良く音声認識する音声認識装置とその方法と、プログラムと記録媒体に関する。 The present invention relates to a speech recognition apparatus and method, a program, and a recording medium for efficiently recognizing speech data of various sound qualities.

近年、音声データを記録するメモリ素子が安価になることに伴い大量の音声データを容易に入手することが可能になった。それらの音声データを音声認識する際に、音声データの品質によって認識精度や処理時間が大きく変動する問題が発生する。 In recent years, it has become possible to easily obtain a large amount of audio data as a memory element for recording audio data becomes cheaper. When recognizing such audio data, there arises a problem that the recognition accuracy and processing time greatly vary depending on the quality of the audio data.

図１０に従来の音声認識装置９００の機能構成を示す。音声認識装置９００は、Ａ/Ｄ変換部９０、特徴量分析部９１、音声認識処理部９２、音響モデルパラメータメモリ９３、言語モデルパラメータメモリ９４を備える。 FIG. 10 shows a functional configuration of a conventional speech recognition apparatus 900. The speech recognition apparatus 900 includes an A / D conversion unit 90, a feature amount analysis unit 91, a speech recognition processing unit 92, an acoustic model parameter memory 93, and a language model parameter memory 94.

Ａ/Ｄ変換部９０は、入力されるアナログ信号の音声を、例えばサンプリング周波数１６ｋＨｚで離散的なディジタル信号に変換する。特徴量分析部９１は、離散値化された音声ディジタル信号を入力として、例えば３２０個の音声ディジタル信号を１フレーム（２０ｍｓ）としたフレーム毎に、音声特徴量Ｏ_ｔを算出する。音声特徴量Ｏ_ｔは、例えばメル周波数ケプストラム係数（ＭＦＣＣ）分析によって算出される。 The A / D conversion unit 90 converts the sound of the input analog signal into a discrete digital signal with a sampling frequency of 16 kHz, for example. The feature quantity analysis unit 91 receives the voice digital signals that have been converted into discrete values, and calculates a voice feature quantity O _t for each frame, for example, with 320 voice digital signals as one frame (20 ms). The voice feature amount O _t is calculated, for example, by Mel frequency cepstrum coefficient (MFCC) analysis.

音声認識処理部９２は、音声特徴量Ｏ_ｔを入力として音響モデルパラメータメモリ９３に記録された音響モデルと、言語モデルパラメータメモリ９４に記録された言語モデルとを参照して、ビーム探索アルゴリズムに基づいて音声認識結果を出力する。ビーム探索アルゴリズムとは、フレーム毎に最も高い累積尤度（音響モデルの尤度と言語モデルの尤度の和）から最終的に最も累積尤度が高い認識結果の存在をおおよそ保証できる所定数（ビーム幅）の音声認識結果候補（仮説）を残しながら探索する手順である。 Based on the beam search algorithm, the speech recognition processing unit 92 refers to the acoustic model recorded in the acoustic model parameter memory 93 and the language model recorded in the language model parameter memory 94 with the speech feature amount O _t as an input. Output the voice recognition result. The beam search algorithm is a predetermined number that can roughly guarantee the existence of a recognition result having the highest cumulative likelihood from the highest cumulative likelihood (sum of the likelihood of the acoustic model and the likelihood of the language model) for each frame. This is a procedure for searching while leaving a speech recognition result candidate (hypothesis) of (beam width).

ここで従来は、音響モデルを適応的に変化させることで音声データの品質の変動に対応していた（非特許文献１）。図１１にその考えを示す。現場で収録した音声データに含まれる背景雑音や音声歪みを推定し、適応の基になる標準音響モデルを変化させる変換行列を生成する。そして、標準音響モデルに変換行列を掛け合わせて音響モデルを、その現場の環境に適応させる。このように音響モデルを様々な環境に適応させることで、音声データの品質変動に対応していた。 Here, conventionally, the acoustic model is adaptively changed to cope with a change in the quality of the audio data (Non-Patent Document 1). FIG. 11 shows the idea. Estimate background noise and audio distortion included in audio data recorded in the field, and generate a transformation matrix that changes the standard acoustic model on which adaptation is based. Then, the standard acoustic model is multiplied by the transformation matrix to adapt the acoustic model to the environment in the field. By adapting the acoustic model to various environments in this way, it has been possible to cope with fluctuations in the quality of audio data.

政瀧浩和、他５名、「顧客との自然な会話を聞き取る自由発話音声認識技術「VoiceRex」」ＮＴＴ技術ジャーナル、pp.15-18,2006.11Masakazu Masahiro, 5 others, “Free Speech Recognition Technology that Listens to Natural Conversations with Customers“ VoiceRex ”, NTT Technical Journal, pp.15-18, 2006.11

従来、様々に変化する音声データの品質に対応する方法としては、上述した音響モデルを適応させる考え方が一般的であった。つまり、音声データの品質の変化に対しては音声認識装置側で対処しようとする考え方である。この結果、過剰に歪んでいる音声データ等に対しては、音響モデルをたとえ適応させたとしてもビーム探索途中の仮説間の尤度に十分な差が付かず、探索効率が悪くなり処理時間が増大する。その結果、時間ばかり掛かって高い精度の認識結果が得られない問題が発生する。 Conventionally, the idea of adapting the above-described acoustic model has been common as a method for dealing with the quality of variously changing audio data. In other words, this is the idea that the speech recognition device side should cope with a change in the quality of the speech data. As a result, for excessively distorted speech data and the like, even if the acoustic model is adapted, there is not a sufficient difference in the likelihood between hypotheses during the beam search, and the search efficiency becomes poor and the processing time is reduced. Increase. As a result, there is a problem that it takes a long time and a highly accurate recognition result cannot be obtained.

この発明は、このような問題点に鑑みてなされたものであり、音声認識処理の事前処理として音声データの品質を評価し、その評価結果で認識処理の動作を制御するようにした音声認識装置と、その考えに基づいて複数の音声ファイルを効率良く音声認識する音声認識装置と、それらの方法とプログラムと記録媒体を提供することを目的とする。 The present invention has been made in view of such problems, and a speech recognition apparatus that evaluates the quality of speech data as a preliminary process of speech recognition processing and controls the operation of the recognition processing based on the evaluation result. Another object of the present invention is to provide a speech recognition apparatus that efficiently recognizes a plurality of sound files based on the idea, a method, a program thereof, and a recording medium.

この発明の音声認識装置は、特徴量分析部と、フレーム音質推定部と、平均音質推定部と、音声認識処理制御部と、音声認識処理部とを具備する。特徴量分析部は、入力された音声ファイルに含まれる音声ディジタル信号の音声特徴量をフレーム単位で分析する。フレーム音質推定部は、フレーム毎にＧＭＭを参照して上記フレームの音声特徴量に対応するＧＭＭ尤度を算出し、上記フレーム音質として出力する。平均音質推定部は、音声ファイルの全フレームのフレーム音質から、当該音声ファイルの音質である音質レベルを算出する。音声認識処理制御部は、音質レベルが所定の閾値よりも悪い時には、音声認識処理を行わせないことを示す認識対象外指示信号を含む制御信号を出力する。音声認識処理部は、上記制御信号に認識対象外指示信号が含まれている場合には上記音声ファイルの音声認識処理を行わない。 The speech recognition apparatus according to the present invention includes a feature amount analysis unit, a frame sound quality estimation unit, an average sound quality estimation unit, a speech recognition processing control unit, and a speech recognition processing unit. The feature amount analysis unit analyzes the speech feature amount of the speech digital signal included in the input speech file in units of frames. The frame sound quality estimation unit refers to the GMM for each frame, calculates a GMM likelihood corresponding to the sound feature amount of the frame, and outputs the GMM likelihood as the frame sound quality . The average sound quality estimation unit calculates a sound quality level that is the sound quality of the sound file from the frame sound quality of all frames of the sound file. The voice recognition processing control unit outputs a control signal including a non-recognition instruction signal indicating that voice recognition processing is not performed when the sound quality level is lower than a predetermined threshold. The voice recognition processing unit does not perform voice recognition processing on the voice file when the control signal includes a non-recognition instruction signal.

また、複数の音声ファイルを効率良く音声認識する音声認識装置は、上記した機能構成の他に、更に、音声ファイル制御部と、音声ファイル処理部と、音声ファイルメモリとを具備する。音声ファイル制御部は、音声ディジタル信号の音声ファイル情報と制御信号とを入力として音声ファイル情報の処理順を決定する。音声ファイル処理部は、音声ファイルメモリに音声ディジタル信号をその音声ファイル単位で記録すると共に、上記処理順に記録した音声ディジタル信号を音声認識処理部に出力する。 In addition to the functional configuration described above, the voice recognition device that efficiently recognizes a plurality of voice files further includes a voice file control unit, a voice file processing unit, and a voice file memory. The audio file control unit inputs the audio file information of the audio digital signal and the control signal and determines the processing order of the audio file information. The voice file processing unit records the voice digital signal in the voice file memory in units of the voice file, and outputs the voice digital signal recorded in the order of processing to the voice recognition processing unit.

この発明の音声認識装置によれば、音声データの品質に対応する制御信号によって音声認識処理部の動作を適応的に変化させるので、音声認識精度を維持したまま処理時間の効率を向上させることが出来る。また、複数の音声ファイルの音声認識を行うこの発明の音声認識装置においては、制御信号に基づいて音声品質の高い順番で音声ファイルの処理を行うことが可能である。また、音声品質が所定の水準に達しないものを認識対象外にすることも可能なので、音声認識処理全体の処理効率を改善する効果を奏する。つまり、品質の悪い音声データがボトルネックになることが無いので音声認識処理の効率が向上する。 According to the speech recognition apparatus of the present invention, since the operation of the speech recognition processing unit is adaptively changed by the control signal corresponding to the quality of the speech data, it is possible to improve the processing time efficiency while maintaining the speech recognition accuracy. I can do it. Moreover, in the voice recognition apparatus of the present invention that performs voice recognition of a plurality of voice files, it is possible to process the voice files in order of high voice quality based on the control signal. In addition, since it is possible to exclude those whose voice quality does not reach a predetermined level from the recognition target, there is an effect of improving the processing efficiency of the whole voice recognition process. That is, since the voice data with poor quality does not become a bottleneck, the efficiency of the voice recognition process is improved.

この発明の音声認識装置１００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 100 of this invention. 音声認識装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 100. フレーム音質推定部１０の機能構成例を示す図。The figure which shows the function structural example of the frame sound quality estimation part 10. FIG. フレーム音質推定部１１の機能構成例を示す図。The figure which shows the function structural example of the frame sound quality estimation part 11. FIG. 音声認識処理制御部３０の制御信号のビーム探索幅の設定方法を示す図。The figure which shows the setting method of the beam search width | variety of the control signal of the speech recognition process control part. この発明の音声認識装置２００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 200 of this invention. 音声認識装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 200. この発明の音声認識装置３００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 300 of this invention. 音声認識装置３００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 300. 従来の音声認識装置９００の機能構成を示す図。The figure which shows the function structure of the conventional speech recognition apparatus 900. 非特許文献１に開示された音声データの品質の変動に対応する考えを示す図。The figure which shows the idea corresponding to the fluctuation | variation of the quality of the audio | speech data disclosed by the nonpatent literature 1. FIG.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１にこの発明の音声認識装置１００の機能構成例を示す。その動作フローを図２に示す。音声認識装置１００は、特徴量分析部９１と、フレーム音質推定部１０と、平均音質推定部２０と、音声認識処理制御部３０と、音声認識処理部９２′と、音響モデルパラメータメモリ９３と、言語モデルパラメータメモリ９４と、制御部３５とを具備する。特徴量分析部９１と音響モデルパラメータメモリ９３と言語モデルパラメータメモリ９４は、従来の音声認識装置９００と同じものである。音声認識処理部９２′は、音声認識処理制御部３０が出力する制御信号に基づいて音声認識処理を行う点のみが、音声認識装置９２と異なりその他の動作は同じである。アナログ信号の音声データが入力される場合にＡ/Ｄ変換部９０が設けられる点も、音声認識装置９００と同じである。 FIG. 1 shows a functional configuration example of the speech recognition apparatus 100 of the present invention. The operation flow is shown in FIG. The speech recognition apparatus 100 includes a feature amount analysis unit 91, a frame sound quality estimation unit 10, an average sound quality estimation unit 20, a speech recognition processing control unit 30, a speech recognition processing unit 92 ′, an acoustic model parameter memory 93, A language model parameter memory 94 and a control unit 35 are provided. The feature amount analysis unit 91, the acoustic model parameter memory 93, and the language model parameter memory 94 are the same as those of the conventional speech recognition apparatus 900. The speech recognition processing unit 92 ′ differs from the speech recognition apparatus 92 in other operations except that it performs speech recognition processing based on the control signal output from the speech recognition processing control unit 30. The point that the A / D conversion unit 90 is provided when analog signal voice data is input is the same as the voice recognition apparatus 900.

音声認識装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 The speech recognition apparatus 100 is realized by reading a predetermined program into a computer configured with, for example, a ROM, a RAM, a CPU, and the like, and executing the program by the CPU.

音声認識装置１００は、従来の音声認識装置９００と比較してフレーム音質推定部１０と、平均音質推定部２０と、音声認識処理制御部３０とを具備する点で新しい。以降の説明では、その異なる部分を中心に説明を行う。 Compared with the conventional speech recognition apparatus 900, the speech recognition apparatus 100 is new in that it includes a frame sound quality estimation unit 10, an average sound quality estimation unit 20, and a speech recognition processing control unit 30. In the following description, the description will focus on the different parts.

特徴量分析部９１は、離散値化された音声ディジタル信号を入力として、所定の数の音声ディジタル信号を１フレーム（例えば２０ｍｓ）としたフレーム毎に、音声特徴量Ｏ_ｔを算出する（ステップＳ９１）。フレーム音質推定部１０は、フレーム毎（ｔ）に音声ディジタル信号の音質を評価してフレーム音質ｑ（ｔ）を出力する（ステップＳ１０）。 The feature amount analysis unit 91 receives the speech digital signal converted into a discrete value, and calculates the speech feature amount O _t for each frame in which a predetermined number of speech digital signals are one frame (for example, 20 ms) (step S91). ). The frame sound quality estimation unit 10 evaluates the sound quality of the audio digital signal for each frame (t) and outputs the frame sound quality q (t) (step S10).

平均音質推定部２０は、複数フレームのフレーム音質ｑ（ｔ）から音質レベルＱ_Ｔを推定する（ステップＳ２０）。添え字のＴは複数フレームの通し番号である。 The average sound quality estimation unit 20 estimates the sound quality level Q _T from the frame sound quality q (t) of a plurality of frames (step S20). The subscript T is a serial number of a plurality of frames.

音声認識処理制御部３０は、音質レベルＱ_Ｔに基づいて音声認識時の制御信号を設定する（ステップＳ３０）。制御信号の具体例については後述する。音声認識処理部９２′は、音声認識処理制御部３０が設定した制御信号に基づいて音声認識処理を行う（ステップＳ９２′）。 Speech recognition processing control unit 30 sets the control signal at the time of speech recognition on the basis of the quality level Q _T (step S30). A specific example of the control signal will be described later. The voice recognition processing unit 92 ′ performs voice recognition processing based on the control signal set by the voice recognition processing control unit 30 (step S92 ′).

以上の動作は、全てのフレームについて終了するまで繰り返される（ステップＳ３５のＮ）。この音声認識装置１００の各部の動作及び繰り返し動作の制御は、制御部３５が行う。なお、制御部３５は、フレーム毎に処理するばかりでなく、音声ファイル単位や、発話単位毎に、上記した動作が実行されるように各部を制御しても良い。 The above operation is repeated until completion for all frames (N in step S35). The control unit 35 controls the operation and repetitive operation of each unit of the speech recognition apparatus 100. The control unit 35 may not only perform processing for each frame but also control each unit so that the above-described operation is performed for each audio file or each utterance unit.

音声認識装置１００によれば、音声認識処理部９２が、音声認識処理制御部３０によって設定された制御信号に応じて適応的に認識処理を行う。つまり、複数フレームの音質レベルＱ_Ｔに応じた制御信号を設定することで、音声認識精度を維持したまま処理時間の効率を向上させることが出来る。次に、各部の具体的な構成例を示して更に詳しく音声認識装置１００の動作を説明する。 According to the speech recognition apparatus 100, the speech recognition processing unit 92 adaptively performs recognition processing according to the control signal set by the speech recognition processing control unit 30. In other words, by setting the control signal corresponding to the quality level Q _T of a plurality of frames, it is possible to improve the efficiency of the processing time while maintaining the accuracy of speech recognition. Next, the operation of the speech recognition apparatus 100 will be described in more detail by showing a specific configuration example of each unit.

〔フレーム音質推定部〕
図３にフレーム音質推定部１０の機能構成例を示す。フレーム音質推定部１０は、例えばＧＭＭ尤度算出手段１０１と、ＧＭＭ（Gaussian Mixture Model：混合正規分布モデル）１０２を備える。ＧＭＭ１０２は、音響モデルパラメータメモリ９３内に格納しても良い。ＧＭＭ尤度算出手段１０１は、音声特徴量Ｏ_ｔを入力として、ＧＭＭ１０２を参照し、フレーム音質ｑ（ｔ）を現す音声特徴量Ｏ_ｔに対応するＧＭＭ尤度を算出する。ＧＭＭ１０２は、例えば音響モデルの学習データの全ての音素から学習されているので、その尤度は音響モデルと音声特徴量Ｏ_ｔの合致度を示し、ＧＭＭ尤度の値ｑ（ｔ）によって各フレームの音質（音響モデルに合致しているか否か）を評価することが可能である。つまり、ＧＭＭ尤度が大きければ音質が良好（音声認識精度が高くなる）、また、その値が小さければ音質が悪い（音声認識精度が低くなる）と評価することが出来る。 [Frame sound quality estimation unit]
FIG. 3 shows a functional configuration example of the frame sound quality estimation unit 10. The frame sound quality estimation unit 10 includes, for example, a GMM likelihood calculation unit 101 and a GMM (Gaussian Mixture Model) 102. The GMM 102 may be stored in the acoustic model parameter memory 93. The GMM likelihood calculating means 101 receives the speech feature amount O _t and refers to the GMM 102 to calculate a GMM likelihood corresponding to the speech feature amount O _t representing the frame sound quality q (t). For example, since the GMM 102 is learned from all phonemes of the learning data of the acoustic model, the likelihood indicates the degree of coincidence between the acoustic model and the speech feature amount O _t , and each frame is represented by the GMM likelihood value q (t). It is possible to evaluate the sound quality (whether or not it matches the acoustic model). That is, if the GMM likelihood is large, it can be evaluated that the sound quality is good (speech recognition accuracy is high), and if the value is small, the sound quality is bad (speech recognition accuracy is low).

なお、ＧＭＭ尤度に代えて、対数値に変換する前の出力確率値を用いても良い。更に、ＧＭＭ１０２の学習データから無音を取り除き、そのＧＭＭ１０２を音声ＧＭＭとしても良い。また、音声ＧＭＭとpause(無音)モデルの両方を照合し、尤度の高い方の尤度値を用いても良い。 Instead of the GMM likelihood, an output probability value before conversion to a logarithmic value may be used. Furthermore, silence may be removed from the learning data of the GMM 102 and the GMM 102 may be used as a voice GMM. Further, both the speech GMM and the pause (silence) model may be collated, and the likelihood value with the higher likelihood may be used.

図４に他の機能構成例のフレーム音質推定部１１を示す。フレーム音質推定部１１は、パワー算出手段１１１と、音声・非音声区間検出手段１１２と、Ｓ/Ｎ計算手段１１３とを備える。パワー算出手段１１１は、音声特徴量Ｏ_ｔから各フレームのパワーを算出する。音声・非音声区間検出手段１１２は、例えば一定値以上のパワーのフレームを音声区間として検出する。また、前述の音声ＧＭＭとpauseモデルの尤度を比較して、音声ＧＭＭの尤度が高い区間を音声区間としても良い。Ｓ/Ｎ計算手段１１３は、非音声区間に対する音声区間のパワーの比率であるＳ/Ｎ比を計算する。このＳ/Ｎ比がフレーム音質ｑ（ｔ）となる。 FIG. 4 shows a frame sound quality estimation unit 11 of another functional configuration example. The frame sound quality estimation unit 11 includes a power calculation unit 111, a voice / non-speech section detection unit 112, and an S / N calculation unit 113. The power calculation unit 111 calculates the power of each frame from the audio feature amount O _t . The voice / non-voice section detecting means 112 detects, for example, a frame having a power of a certain value or more as a voice section. Further, the likelihood of the speech GMM and the pause model may be compared, and a section having a high likelihood of the speech GMM may be set as a speech section. The S / N calculation means 113 calculates an S / N ratio that is a ratio of the power of the speech section to the non-speech section. This S / N ratio becomes the frame sound quality q (t).

〔平均音質推定部〕
平均音質推定部２０は、フレーム音質推定部１０が出力するフレーム音質ｑ（ｔ）である例えばＧＭＭ尤度やＳ/Ｎ比を、複数フレームに渡って平均して音質レベルＱ_Ｔを推定する（式（１））。 [Average sound quality estimation section]
The average sound quality estimation unit 20 estimates the sound quality level Q _T by averaging, for example, GMM likelihood and S / N ratio, which are the frame sound quality q (t) output from the frame sound quality estimation unit 10 over a plurality of frames ( Formula (1)).

ここでｔはフレーム番号、Ｔは複数フレームの数である。添え字のＴは、その複数フレームの通し番号である。 Here, t is a frame number and T is the number of a plurality of frames. The subscript T is a serial number of the plurality of frames.

〔音声認識処理制御部〕
音声認識処理制御部３０は、音質レベルＱ_Ｔを入力として制御信号を出力する。制御信号の具体例としては、例えばビーム探索幅Ｎ（Ｑ_Ｔ）が考えられる。その一例を式（２）に示す。 [Voice recognition processing control unit]
Speech recognition processing control unit 30 outputs a control signal as an input sound level Q _T. As a specific example of the control signal, for example, a beam search width N (Q _T ) can be considered. An example is shown in equation (2).

図５に音質レベルＱ_Ｔとビーム探索幅Ｎ（Ｑ_Ｔ）との関係を例示する。横軸は音質レベルＱ_Ｔであり、縦軸はビーム探索幅Ｎ（Ｑ_Ｔ）である。 FIG. 5 illustrates the relationship between the sound quality level Q _T and the beam search width N (Q _T ). The horizontal axis is the sound quality level Q _T , and the vertical axis is the beam search width N (Q _T ).

図５に示すように式（２）は、所定の範囲の音質レベルＱ_Ｔ（Ｑ_ｍｉｎ〜Ｑ_ｍａｘ）に対応するビーム探索幅Ｎ（Ｑ_Ｔ）（Ｎ_ｍｉｎ〜Ｎ_ｍａｘ）を、音質レベルＱ_Ｔの値で比例配分する考えである。ここでは、比例係数が負の値なので、音質レベルＱ_Ｔが小でビーム探索幅Ｎ（Ｑ_Ｔ）が大であり、Ｑ_Ｔが大でＮ（Ｑ_Ｔ）が小となる関係である。もちろん、音質レベルＱ_Ｔとビーム探索幅Ｎ（Ｑ_Ｔ）との関係は、非線形な関数で表せる関係であっても良い。また、制御信号としてビーム探索幅Ｎ（Ｑ_Ｔ）を用いる場合、ビーム探索幅は、個数ビーム幅に限定したものではなく、例えばスコアビーム幅、単語終端スコアビーム幅や、単語終端個数ビーム幅等であっても良い。 As shown in FIG. 5, the expression (2) represents the beam search width N (Q _T ) (N _{min to} N _max ) corresponding to the sound quality level Q _T (Q _{min to} Q _max ) in a predetermined range as the sound quality level Q. _The idea is to proportionally distribute by the value of _T. Here, since the proportionality coefficient is a negative value, the sound quality level Q _T is small, the beam search width N (Q _T ) is large, Q _T is large, and N (Q _T ) is small. Of course, the relationship between the sound quality level Q _T and the beam search width N (Q _T ) may be a relationship that can be expressed by a non-linear function. When the beam search width N (Q _T ) is used as the control signal, the beam search width is not limited to the number beam width, and for example, the score beam width, the word end score beam width, the word end number beam width, etc. It may be.

ここで、Ｓ/Ｎ比やＧＭＭ尤度等の音質の範囲に関しては、例えばＱ_ｍａｘやＱ_ｍｉｎをそれぞれ音響モデル学習データに対する音質の分布から最大/最小値として良い。また、音質をＳ/Ｎ比とした場合には、例えばＱ_ｍａｘ＝３０[ｄＢ]、Ｑ_ｍｉｎ＝１０[ｄＢ]のように予め定めた範囲を用いても良い。またビーム探索幅に関しては、例えばＮ_ｍａｘを通常用いるビーム幅の１.５倍、Ｎ_ｍｉｎを通常用いるビーム幅の半分等とすれば良い。 Here, regarding the sound quality ranges such as the S / N ratio and the GMM likelihood, for example, Q _max and Q _min may be set to the maximum / minimum values from the sound quality distribution for the acoustic model learning data. Further, when the sound quality is set to the S / N ratio, a predetermined range such as Q _max = 30 [dB] and Q _min = 10 [dB] may be used. Regarding the beam search width, for example, N _max may be 1.5 times the beam width that is normally used, N _min may be half the beam width that is normally used, and the like.

また、音質レベルが極端に悪い場合（例えば、Ｑ_Ｔ＜Ｑ_ｍｉｎ）には、ビーム探索幅を拡大しても精度向上が望めず処理時間ばかり掛かるので、ビーム探索幅を小さく、例えばＮ_ｍｉｎにしても良い。また、制御信号に認識対象外指示信号を含ませて音声認識処理を行わせないようにしても良い。 In addition, when the sound quality level is extremely bad (for example, Q _T <Q _min ), even if the beam search width is increased, it is not possible to improve the accuracy and it takes much processing time. Therefore, the beam search width is reduced to, for example, N _min . May be. Further, the speech recognition process may not be performed by including the non-recognition instruction signal in the control signal.

〔音声認識処理部〕
音声認識処理部９２′は、音声特徴量Ｏ_ｔと制御信号のビーム探索幅Ｎ（Ｑ_Ｔ）を入力として音響モデルパラメータメモリ９３に記録された音響モデルと、言語モデルパラメータメモリ９４に記録された言語モデルとを参照して、ビーム探索アルゴリズムに基づいて音声認識結果を出力する。音声認識処理部９２′は、ビーム探索幅Ｎ（Ｑ_Ｔ）個の音声認識結果候補から正解を探索する点のみが従来の音声認識装置と異なる。つまり、適応的に音声認識処理部の動作が変化する。ビーム探索方法そのものは、従来からの音声認識装置と同じであるので詳細な説明は省略する。 [Voice recognition processing unit]
The speech recognition processing unit 92 ′ receives the speech feature quantity O _t and the beam search width N (Q _T ) of the control signal as input, and the acoustic model recorded in the acoustic model parameter memory 93 and the language model parameter memory 94. The speech recognition result is output based on the beam search algorithm with reference to the language model. The speech recognition processing unit 92 ′ is different from the conventional speech recognition device only in searching for a correct answer from beam search width N (Q _T ) speech recognition result candidates. That is, the operation of the speech recognition processing unit is adaptively changed. Since the beam search method itself is the same as that of a conventional speech recognition apparatus, detailed description thereof is omitted.

以上述べたように、音声認識装置１００は、音質レベルＱ_Ｔによって適応的に音声認識処理を変化させる。図５に示した例では、音質レベルＱ_Ｔが悪い時にはビーム探索幅Ｎ（Ｑ_Ｔ）を大、音質レベルが良い時にはビーム探索幅Ｎ（Ｑ_Ｔ）を小にする。すなわち、音質が良い場合には音声認識結果候補（仮説）間に尤度差が付くので、ビーム探索幅を狭めても音声認識精度が劣化することが無く、処理速度を向上させることが出来る。一方、音質が悪い場合には音声認識結果候補（仮説）間に尤度差が付き難いので、ビーム幅を広げることで音声認識精度を向上させることが可能である。但し、極端に音質が悪い場合には、ビーム探索幅を広げたとしても音声認識結果候補（仮説）間に尤度差が付かないので、逆にビーム探索幅を狭めるか音声認識対象外にすることで処理速度を向上させることが出来る。したがって、音声認識精度を維持したまま処理時間の効率を向上させることが可能である。 As described above, the speech recognition apparatus 100 adaptively changes the speech recognition processing by the quality level Q _T. In the example shown in FIG. 5, when the quality level Q _T is poor beam search width N (Q _T) large and the small beam search width N (Q _T) when quality level is good. That is, when the sound quality is good, a likelihood difference is added between the speech recognition result candidates (hypotheses), so that the speech recognition accuracy does not deteriorate even if the beam search width is narrowed, and the processing speed can be improved. On the other hand, if the sound quality is poor, it is difficult to add a likelihood difference between the speech recognition result candidates (hypotheses), so it is possible to improve the speech recognition accuracy by widening the beam width. However, if the sound quality is extremely poor, there is no likelihood difference between the speech recognition result candidates (hypotheses) even if the beam search width is widened. Conversely, the beam search width is narrowed or excluded from the speech recognition target. Thus, the processing speed can be improved. Accordingly, it is possible to improve the processing time efficiency while maintaining the voice recognition accuracy.

実施例１で説明した制御信号を音質レベルＱ_Ｔに応じて適応的に変化させる考えを、複数の音声ファイルを音声認識する音声認識装置に適用すると、複数の音声ファイルを効率良く音声認識することが可能である。 When the idea of adaptively changing the control signal described in the first embodiment according to the sound quality level Q _T is applied to a voice recognition device that recognizes a plurality of voice files, the voice recognition of the plurality of voice files can be performed efficiently. Is possible.

図６にその音声認識装置２００の機能構成例を示す。その動作フローを図７に示す。音声認識装置２００は、音声ファイル制御部４０と、音声ファイル処理部５０と、音声ファイルメモリ６０とを更に備える点で音声認識装置１００と異なる。他の機能構成は、音声認識装置１００と同じである。 FIG. 6 shows a functional configuration example of the speech recognition apparatus 200. The operation flow is shown in FIG. The voice recognition device 200 is different from the voice recognition device 100 in that it further includes a voice file control unit 40, a voice file processing unit 50, and a voice file memory 60. Other functional configurations are the same as those of the speech recognition apparatus 100.

音声ファイル制御部４０は、外部から入力される音声ディジタル信号の音声ファイル情報（たとえば音声ファイル名）と、その音声ディジタル信号の音質レベルＱ_Ｔと、制御信号Ｎ（Ｑ_Ｔ）を入力として音声ファイルの処理順を決定する（ステップＳ４０、図７）。音声ファイル処理部５０は、音声ファイルメモリ６０に音声ディジタル信号の特徴量をフレーム単位でその音声ファイル毎に記録する（ステップＳ５０１）。また同時に制御信号も記録する。そして、音声ファイル制御部４０が決定した処理順に記録した音声ディジタル信号の特徴量と制御信号とを出力する（ステップＳ５０）。 The sound file control unit 40 receives sound file information (for example, sound file name) of a sound digital signal input from the outside, a sound quality level Q _T of the sound digital signal, and a control signal N (Q _T ) as an sound file. Is determined (step S40, FIG. 7). The audio file processing unit 50 records the feature amount of the audio digital signal in the audio file memory 60 in units of frames for each audio file (step S501). At the same time, a control signal is recorded. And the feature-value and control signal of the audio | voice digital signal recorded in the processing order determined by the audio | voice file control part 40 are output (step S50).

特徴量を分析するステップＳ９１〜その特徴量と制御信号を音声ファイル単位で音声ファイルメモリ６０に記録するステップＳ５０１の処理は、入力された全ての音声ファイルが終了するまで行われる（ステップＳ５０２のＮ）。そして、特徴量と制御信号は、各ファイルの処理順に従ってフレーム単位で音声認識処理部９２′に出力される（ステップＳ５０３）。 The process of step S91 for analyzing the feature value and the process of step S501 for recording the feature value and the control signal in the sound file memory 60 in units of sound files are performed until all the input sound files are completed (N in step S502). ). Then, the feature amount and the control signal are output to the speech recognition processing unit 92 ′ in units of frames in accordance with the processing order of each file (step S503).

音声認識処理部９２′は、制御信号が音声ファイル処理部５０から与えられる点のみが異なるだけで、その動作は音声認識装置１００のものと同じである。音声認識処理部９２′は、制御信号に基づいて音声認識処理を行う（ステップＳ９２′）。音声認識処理は入力された全てのファイルが終了するまで繰り返される（ステップＳ３６のＮ）動作は、制御部３６が制御する。 The operation of the speech recognition processing unit 92 ′ is the same as that of the speech recognition apparatus 100 except that the control signal is given from the audio file processing unit 50. The voice recognition processing unit 92 ′ performs voice recognition processing based on the control signal (step S92 ′). The voice recognition process is repeated until all input files are completed (N in step S36), and the control unit 36 controls the operation.

音声ファイル制御部４０は、音質レベルＱ_Ｔに基づいて上記処理順を決定する。その処理順を、音質レベルＱ_Ｔの大きい順とすると、音質の良好なファイルから順に音声認識処理部９２′で音声認識処理されることになる。その結果、音質の良い順番で音声ファイルが音声認識処理されるので、複数の音声ファイルを効率良く音声認識することが出来る。 Audio file control unit 40 determines the processing order on the basis of the quality level Q _T. The order of processing, if the descending order of quality level Q _T, will be the speech recognition processing by the voice recognition processing section 92 'from the good files sound sequentially. As a result, the voice files are subjected to voice recognition processing in order of good sound quality, so that a plurality of voice files can be efficiently recognized.

また、音声認識処理を行う計算機の台数や仕様において、全ファイルに対して音声認識処理が行えない場合には、音質レベルＱ_Ｔを参照することで、音質の良好な音声ファイルのみを音声認識対象とすることが出来る。 Further, the number and specifications of the computer to perform a speech recognition process, the total in the case where the speech recognition processing can not be performed on the file, by referring to the quality level Q _T, the speech recognition target only good voice file quality It can be.

また、音声ファイル処理部５０が音質範囲判定手段５０１を備え、その音質範囲判定手段５０１で音質レベルＱ_Ｔが所定値Ｑ_ｔｈより大きいか否かを判定し、所定値より小さな場合にその音声ファイルを廃棄するようにしても良い。 The voice file processing section 50 comprises a sound quality range determining unit 501, the quality level Q _T in the quality range determining means 501 determines whether or not greater than the predetermined value Q _th, the audio file if smaller than the predetermined value May be discarded.

なお、音声ファイル処理部５０は、音声ファイルメモリ６０に特徴量を記録する例を説明したが、特徴量を分析する前の音声ディジタル信号を音声ファイルメモリ６０に記録するようにしても良い。また、音質範囲判定手段５０１は、例えば、音響モデル学習データに対する音質レベルの最低値を所定値Ｑ_ｔｈとし、所定値を基準に廃棄ファイルの選別を行うようにしても良い。 Although the audio file processing unit 50 has been described with respect to the example in which the feature amount is recorded in the audio file memory 60, the audio digital signal before the analysis of the feature amount may be recorded in the audio file memory 60. Further, the sound quality range determination unit 501 may select the discard file based on the predetermined value _Qth , for example, with the minimum value of the sound quality level for the acoustic model learning data as the predetermined value _Qth .

また、学習データの音質レベルの最低値に限定せずに、学習データの音質レベルの分布が正規分布に従うとした場合の音質レベルＱ_Ｔの分布の平均μや標準偏差σから所定値をμ−２σと定めても良い。また、音声認識処理部９２′は、一般的な音声認識装置であっても良い。その場合は、制御信号は不要となり、音声認識装置は音質の良い順番で音声認識処理を行う。 Further, without limiting to the minimum value of the sound quality level of the learning data, a predetermined value is obtained from the average μ and standard deviation σ of the sound quality level Q _T distribution when the sound quality level distribution of the learning data follows a normal distribution. It may be set to 2σ. Further, the voice recognition processing unit 92 ′ may be a general voice recognition device. In that case, no control signal is required, and the speech recognition apparatus performs speech recognition processing in order of good sound quality.

図８にこの発明の音声認識装置３００の機能構成例を示す。その動作フローを図９に示す。音声認識装置３００は、実施例１の音声認識装置１００の機能構成に更に、教師なし適応部８０と、適応後音響モデルパラメータメモリ９５と、第２音声認識処理部９６とを備え、音声認識装置１００で音声認識処理した音声認識結果を適応用ラベルとして学習した音響モデルを用いて音声認識処理を行うものである。 FIG. 8 shows a functional configuration example of the speech recognition apparatus 300 of the present invention. The operation flow is shown in FIG. The speech recognition apparatus 300 further includes an unsupervised adaptation unit 80, an after-adaptation acoustic model parameter memory 95, and a second speech recognition processing unit 96 in addition to the functional configuration of the speech recognition apparatus 100 of the first embodiment. The speech recognition process is performed using an acoustic model learned by using the speech recognition result obtained by performing the speech recognition process in 100 as an adaptive label.

教師なし適応部８０は、音声認識処理部９２′の出力する音声認識結果を適応用ラベルとして音響モデルパラメータメモリ９３に記録された音響モデルを学習し、適応音響モデルを生成する（ステップＳ８０、図９）。適応音響モデルは適応後音響モデルパラメータメモリ９５に記録される。 The unsupervised adaptation unit 80 learns the acoustic model recorded in the acoustic model parameter memory 93 using the speech recognition result output from the speech recognition processing unit 92 ′ as an adaptation label, and generates an adaptive acoustic model (step S80, FIG. 9). The adaptive acoustic model is recorded in the post-adaptation acoustic model parameter memory 95.

第２音声認識処理部９６は、適応後音響モデルパラメータメモリ９５と言語モデルパラメータメモリ９４とを参照して、ビーム探索アルゴリズムに基づいて音声認識結果を出力する（ステップＳ９６）。このステップＳ９６の第２音声認識処理過程は、実施例１の音声認識装置１００の処理でも良いし、一般的な音声認識装置による処理でもかまわない。なお、教師なし適応部８０に制御信号を破線で入力しているように、教師なし適応部８０が制御信号の値に応じて、音声認識処理部９２′の出力する音声認識結果を適応ラベルとするか否かを判断するようにしても良い。 The second speech recognition processing unit 96 refers to the post-adaptation acoustic model parameter memory 95 and the language model parameter memory 94, and outputs a speech recognition result based on the beam search algorithm (step S96). The process of the second speech recognition process in step S96 may be the process of the speech recognition apparatus 100 of the first embodiment or may be the process of a general speech recognition apparatus. Note that, as the control signal is input to the unsupervised adaptation unit 80 by a broken line, the unsupervised adaptation unit 80 uses the speech recognition result output from the speech recognition processing unit 92 'as the adaptive label according to the value of the control signal. Whether or not to do so may be determined.

以上述べたように、音声認識装置３００によれば、音声データの音質レベルに応じて音声認識した結果を適応用ラベルとして音響モデルを学習するので、音響モデルの精度を高めることが出来る。そして、その精度の高い音響モデルを用いた音声認識処理を行うことが可能である。また、この発明の音声認識装置１００，２００によれば、音声データの品質に応じて音声認識処理部の動作を制御信号によって変化させるので、音声認識処理の効率を向上させることが出来る。 As described above, according to the speech recognition apparatus 300, the acoustic model is learned using the result of speech recognition according to the sound quality level of the speech data as an adaptive label, so that the accuracy of the acoustic model can be improved. Then, it is possible to perform speech recognition processing using the highly accurate acoustic model. Further, according to the speech recognition apparatuses 100 and 200 of the present invention, the operation of the speech recognition processing unit is changed by the control signal according to the quality of the speech data, so that the efficiency of the speech recognition processing can be improved.

なお、実施例１のフレーム音質推定部１０をＧＭＭで構成する例で説明を行ったが、フレーム毎にモノフォン尤度を計算し、そのモノフォン尤度でフレーム音質を推定するようにしても良い。つまり、入力される特徴量に対して、音響モデルに属するモノフォン全てを照合し、もっとも尤度の高い最尤モノフォンで音質を評価するようにしても良い。 In addition, although the frame sound quality estimation unit 10 according to the first embodiment has been described as an example configured with a GMM, a monophone likelihood may be calculated for each frame, and the frame sound quality may be estimated using the monophone likelihood. That is, all the monophones belonging to the acoustic model may be checked against the input feature quantity, and the sound quality may be evaluated using the maximum likelihood monophone with the highest likelihood.

また、上記方法及び装置において説明した処理は、記載の順に従って時系列に実行され
るのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 Further, the processes described in the above method and apparatus are not only executed in time series according to the order of description, but also may be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. Good.

また、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 Further, when the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

A feature quantity analysis unit that analyzes voice feature quantities of a voice digital signal included in an input voice file in units of frames;
A frame sound quality estimation unit that evaluates the sound quality of the audio digital signal for each frame and outputs a frame sound quality that is the sound quality of the frame;
An average sound quality estimation unit that calculates a sound quality level that is the sound quality of the sound file from the frame sound quality of all frames of the sound file;
A speech recognition processing control unit that outputs a control signal for controlling an operation during speech recognition processing based on the sound quality level;
A speech recognition processing unit that performs speech recognition processing based on the control signal;
Comprising
The frame sound quality estimation unit calculates a GMM likelihood corresponding to the voice feature amount of the frame with reference to the GMM, and outputs the GMM likelihood as the frame sound quality.
The voice recognition processing control unit outputs a control signal including a non-recognition instruction signal indicating that voice recognition processing is not performed when the sound quality level is lower than a predetermined threshold,
The voice recognition processing unit does not perform voice recognition processing of the voice file when the control signal includes a non-recognition instruction signal.
A speech recognition apparatus characterized by that.

The speech recognition apparatus according to claim 1,
  The frame sound quality estimation unit uses a speech GMM obtained by removing silence from the GMM learning data, calculates a speech GMM likelihood that is a likelihood corresponding to a speech feature amount of the frame in the speech GMM, and the speech GMM Output the likelihood as the frame sound quality,
  Or
  The speech feature amount of the frame is compared with the silence model, and a silence likelihood that is a likelihood corresponding to the speech feature amount of the frame in the silence model is calculated, and is higher in the speech GMM likelihood and the silence likelihood. The likelihood value of the other is output as the frame sound quality,
  A speech recognition apparatus characterized by that.

The speech recognition apparatus according to claim 1 or 2 ,
Furthermore, an audio file memory for recording a plurality of audio files,
An audio file control unit for determining the processing order of the audio file in descending order of the sound quality level, by inputting the audio file information of the audio digital signal, the sound quality level, and the control signal;
A voice file processing unit for recording the voice digital signal in the voice file memory in units of the voice file and outputting the recorded voice digital signal to the voice recognition processing unit in the processing order;
A speech recognition apparatus comprising:

A feature quantity analysis unit that analyzes voice feature quantities of a voice digital signal included in an input voice file in units of frames;
A frame sound quality estimation unit that evaluates the sound quality of the audio digital signal for each frame and outputs a frame sound quality that is the sound quality of the frame;
An average sound quality estimation unit that calculates a sound quality level that is the sound quality of the sound file from the frame sound quality of all frames of the sound file;
A speech recognition processing control unit that outputs a control signal for controlling an operation during speech recognition processing based on the sound quality level;
A speech recognition processing unit that receives the speech feature value and the control signal as input and outputs a result of speech recognition processing based on the speech feature value as an adaptive label;
An unsupervised adaptation unit that learns an acoustic model using the adaptation label as an input and generates an adaptive acoustic model;
A post-adaptive acoustic model parameter memory for recording the adaptive acoustic model;
A second speech recognition processing unit that receives the speech digital signal and performs speech recognition processing with reference to the adaptive acoustic model recorded in the post-adaptation acoustic model parameter memory;
Comprising
The frame sound quality estimation unit calculates a GMM likelihood corresponding to the voice feature amount of the frame with reference to the GMM, and outputs the GMM likelihood as the frame sound quality.
The voice recognition processing control unit outputs a control signal including a non-recognition instruction signal indicating that voice recognition processing is not performed when the sound quality level is lower than a predetermined threshold,
The voice recognition processing unit does not perform voice recognition processing of the voice file when the control signal includes a non-recognition instruction signal.
A speech recognition apparatus characterized by that.

The speech recognition apparatus according to any one of claims 1 to 4 ,
When the sound quality level is better than the predetermined threshold, the voice recognition processing control unit recognizes the voice with the beam search width being larger as the sound quality level is worse and the beam search width being smaller as the sound quality level is better. A speech recognition apparatus that outputs a control signal for setting a beam search width of processing.

The speech recognition apparatus according to claim 3 ,
The audio file processing unit
If the sound quality level is lower than the minimum sound quality level for the acoustic model learning data , discard the audio file , or
A speech recognition apparatus, wherein a speech file is discarded when the value of the sound quality level is smaller than a threshold μ-2σ calculated from an average μ and a standard deviation σ of a sound quality level distribution of acoustic model learning data .

A feature quantity analysis process in which a feature quantity analysis unit analyzes a voice feature quantity of a voice digital signal included in an input voice file in units of frames;
A frame sound quality estimation unit that evaluates the sound quality of the audio digital signal for each frame and outputs a frame sound quality that is the sound quality of the frame; and
An average sound quality estimation unit that calculates a sound quality level that is the sound quality of the sound file from the frame sound quality of all frames of the sound file;
A voice recognition process control unit for outputting a control signal for controlling an operation during the voice recognition process based on the sound quality level;
A speech recognition processing section in which the speech recognition processing unit performs speech recognition processing based on the control signal;
With
The frame sound quality estimation process calculates a GMM likelihood corresponding to the voice feature of the frame with reference to the GMM, and outputs the GMM likelihood as the frame sound quality.
The voice recognition processing control process outputs a control signal including a non-recognition instruction signal indicating that voice recognition processing is not performed when the sound quality level is lower than a predetermined threshold,
The voice recognition process does not perform voice recognition processing of the voice file when the control signal includes a non-recognition instruction signal.
A speech recognition method characterized by the above.

The speech recognition method according to claim 7 ,
And an audio file control process in which an audio file control unit determines the audio file processing order in descending order of the audio quality level by inputting the audio file information of the audio digital signal, the audio quality level, and the control signal;
A voice file processing step in which the voice file processing unit records the voice digital signal in the voice file memory in units of the voice file and outputs the recorded voice digital signal to the voice recognition processing unit in the processing order;
A speech recognition method comprising:

A feature quantity analysis process in which a feature quantity analysis unit analyzes a voice feature quantity of a voice digital signal included in an input voice file in units of frames;
A frame sound quality estimation unit that evaluates the sound quality of the audio digital signal for each frame and outputs a frame sound quality that is the sound quality of the frame; and
An average sound quality estimation unit that calculates a sound quality level that is the sound quality of the sound file from the frame sound quality of all frames of the sound file;
A voice recognition process control unit for outputting a control signal for controlling an operation during the voice recognition process based on the sound quality level;
A speech recognition processing step in which a speech recognition processing unit outputs the result of speech recognition processing based on the speech feature amount as an adaptation label by receiving the speech feature amount and the control signal;
An unsupervised adaptation unit learns an acoustic model using the adaptation label as an input and generates an adaptive acoustic model;
A second speech recognition processing section in which a second speech recognition processing unit receives the speech digital signal and performs speech recognition processing with reference to the adaptive acoustic model recorded in the post-adaptation acoustic model parameter memory;
With
The frame sound quality estimation process calculates a GMM likelihood corresponding to the voice feature of the frame with reference to the GMM, and outputs the GMM likelihood as the frame sound quality.
The voice recognition processing control process outputs a control signal including a non-recognition instruction signal indicating that voice recognition processing is not performed when the sound quality level is lower than a predetermined threshold,
The voice recognition process does not perform voice recognition processing of the voice file when the control signal includes a non-recognition instruction signal.
A speech recognition method characterized by the above.

Device program for causing a computer to function as a speech recognition apparatus according to any one of claims 1 to 6.

A computer-readable recording medium on which any of the apparatus programs according to claim 10 is recorded.