JP5609182B2

JP5609182B2 - Speech recognition apparatus, speech recognition method, and speech recognition program

Info

Publication number: JP5609182B2
Application number: JP2010059791A
Authority: JP
Inventors: 秀治古明地; 隆行荒川; 孝文越仲
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2010-03-16
Filing date: 2010-03-16
Publication date: 2014-10-22
Anticipated expiration: 2030-03-16
Also published as: JP2011191682A

Description

本発明は、入力信号から音声を認識する音声認識装置、音声認識方法および音声認識プログラムに関する。 The present invention relates to a speech recognition device that recognizes speech from an input signal, a speech recognition method, and a speech recognition program.

音声認識システムの性能は、雑音の影響によって著しく劣化する。このため実運用にあたっては、雑音が混在していても所望の音声が認識できるようにするための耐雑音手法が必要となる。雑音による性能劣化の原因は、音響モデル学習時に用いる音声データと、実運用で使用する入力信号との間のミスマッチに起因する。このようなミスマッチを抑制するため、音声認識向けの耐雑音手法には大別して二つの方法が存在する。 The performance of a speech recognition system is significantly degraded due to the effects of noise. For this reason, in actual operation, a noise proofing method is required to enable a desired voice to be recognized even if noise is mixed. The cause of performance degradation due to noise is due to a mismatch between audio data used during acoustic model learning and an input signal used in actual operation. In order to suppress such mismatches, there are roughly two methods for noise resistance for speech recognition.

一つは、入力信号から雑音成分を抑圧・除去し、入力信号側を音響モデル学習時に用いた音声データに近づける方法（以下、雑音抑圧方法と記す。）である。もう一つは、音響モデル側を入力信号と同じ雑音環境に適応させる方法（以下、音響モデル適応方法と記す。）である。 One is a method (hereinafter referred to as a noise suppression method) in which noise components are suppressed / removed from an input signal, and the input signal side is brought close to voice data used during acoustic model learning. The other is a method of adapting the acoustic model side to the same noise environment as the input signal (hereinafter referred to as acoustic model adaptation method).

雑音抑圧方法としては、スペクトルサブトラクション法（以下、ＳＳ法と記す。）が広く用いられている（例えば、特許文献１、非特許文献１参照。）。ＳＳ法は、周波数領域における雑音抑圧方法の一つであり、雑音を含む音声信号（入力信号）のパワースペクトルから、別途推定した雑音のパワースペクトルを減算することによって、入力信号に含まれる雑音を抑圧する方法である。なお、特許文献１には、ＳＮＲ（ｓｉｇｎａｌ−ｎｏｉｓｅｒａｔｉｏ：ＳＮ比）に応じて、雑音の抑圧量を制御するパラメータである抑圧係数を変化させる方法が記載されている。また、非特許文献２に記載されているように、雑音推定の精度をより高めるための技術も研究されている。 As a noise suppression method, a spectral subtraction method (hereinafter referred to as SS method) is widely used (see, for example, Patent Document 1 and Non-Patent Document 1). The SS method is one of noise suppression methods in the frequency domain, and the noise contained in the input signal is subtracted from the power spectrum of the separately estimated noise from the power spectrum of the speech signal (input signal) containing noise. It is a way to suppress. Patent Document 1 describes a method of changing a suppression coefficient, which is a parameter for controlling a noise suppression amount, according to an SNR (signal-noise ratio: SN ratio). Further, as described in Non-Patent Document 2, a technique for further improving the accuracy of noise estimation has been studied.

一方、音響モデル適応方法としては、ＨＭＭ合成法（例えば、非特許文献１参照。）や、ヤコビ法、ＶｅｃｔｏｒＴａｌｙｅｒＳｅｒｉｅｓ（ＶＴＳ）法（例えば、非特許文献３参照。）などが知られている。 On the other hand, as an acoustic model adaptation method, an HMM synthesis method (see, for example, Non-Patent Document 1), a Jacobian method, a Vector Tallyer Series (VTS) method (see, for example, Non-Patent Document 3), and the like are known. .

ＨＭＭ合成法は、予め雑音が混在していないクリーンな音声を用いて生成したＨＭＭ（以下、クリーンＨＭＭと記す。）と、推定された雑音を用いて生成したＨＭＭとを合成し、対象となる雑音環境で発声された音声に適合するＨＭＭを生成する方法である。このようなＨＭＭ合成法の具体的手法の一つに、例えば、ＰａｒａｌｌｅｌＭｏｄｅｌＣｏｍｂｉｎａｔｉｏｎ法（以下、ＰＭＣ法と記す。）がある。ＰＭＣ法では、特徴量をスペクトル領域の量に逆変換し、スペクトル領域で２つのＨＭＭを合成する。 The HMM synthesis method synthesizes an HMM generated using clean speech that does not contain noise in advance (hereinafter referred to as a clean HMM) and an HMM generated using estimated noise, and is an object. This is a method for generating an HMM suitable for speech uttered in a noisy environment. One specific method of such an HMM synthesis method is, for example, a Parallel Model Combination method (hereinafter referred to as PMC method). In the PMC method, feature quantities are inversely converted into spectral domain quantities, and two HMMs are synthesized in the spectral domain.

また、ヤコビ法とＶＴＳ法は、推定された雑音によってクリーンＨＭＭを構成する各分布が雑音環境でどのように変化するかを線形の式で近似する方法である。 Further, the Jacobi method and the VTS method are methods that approximate how each distribution constituting the clean HMM changes in a noise environment by estimated noise by a linear expression.

特開２０００−３３０５９７号公報JP 2000-330597 A

松本弘，”雑音環境下の音声認識手法ＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎＴｅｃｈｎｉｑｕｅｓｆｏｒＮｏｉｓｙＥｎｖｉｒｏｎｍｅｎｔｓ”，第２回情報科学技術フォーラム（ＦＩＴ２００３），２００３年９月，ｐ．１−４Hiroshi Matsumoto, “Speech Recognition Techniques for Noise Environments”, 2nd Information Science and Technology Forum (FIT2003), September 2003, p. 1-4 ＭａｓａｎｏｒｉＫａｔｏ，ｅｔａｌ．，”ＮｏｉｓｅＳｕｐｐｒｅｓｓｉｏｎｗｉｔｈＨｉｇｈＳｐｅｅｃｈＱｕａｌｉｔｙＢａｓｅｏｎＷｅｉｇｈｔｅｄＮｏｉｓｅＥｓｔｉｍａｔｉｏｎａｎｄＭＭＳＥＳＴＳＡ”，ＰｒｏｃＩＷＡＥＮＣ２００１，２００１年９月，ｐ．１８３−１８６Masanori Kato, et al. , “Noise Suppression with High Speed Quality Base on Weighted Noise Estimation and MMSE STSA”, Proc IWAENC 2001, September 2001, p. 183-186 Ａ．Ａｃｅｒｏ，ｅｔａｌ．，”ＨＭＭａｄａｐｔａｔｉｏｎｕｓｉｎｇｖｅｃｔｏｒｔａｙｌｏｒｓｅｒｉｅｓｆｏｒｎｏｉｓｙｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎ”，ｉｎｔ．Ｃｏｎｆ．ＳｐｏｋｅｎＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ（２０００），２０００年A. Acero, et al. , “HMM adaptation using vector Taylor series for noise spec recognition”, int. Conf. Spoken Language Processing (2000), 2000

しかし、上述したＨＭＭ合成法やヤコビ法、ＶＴＳ法といった音響モデル適応方法や、ＳＳ法といった雑音抑圧方法では、変化する雑音環境下において、必ずしも変化する雑音環境に追従して高精度に音声認識を行うことができないという問題がある。 However, in the acoustic model adaptation method such as the HMM synthesis method, the Jacobi method, and the VTS method described above, and the noise suppression method such as the SS method, speech recognition is performed with high accuracy by following the changing noise environment. There is a problem that can not be done.

例えば、音響モデル適応方法では、音響モデルを雑音適応させるためにかかる計算時間が雑音の変化時間に追いつかない可能性があり、その場合には音声認識精度が劣化するという問題がある。一般に、音響モデルのサイズが大きくなると、音響モデルを雑音適応させるための計算量も増大する。このため、かかる計算時間が雑音の変化時間よりも大きくなる状況では、音響モデルが変化する雑音環境に適合しきれず、音声認識精度が劣化してしまう。 For example, in the acoustic model adaptation method, there is a possibility that the calculation time required for noise adaptation of the acoustic model may not catch up with the noise change time, and in this case, there is a problem that the speech recognition accuracy deteriorates. In general, as the size of an acoustic model increases, the amount of calculation for adapting the acoustic model to noise also increases. For this reason, in a situation where the calculation time is longer than the noise change time, the acoustic model cannot be adapted to the changing noise environment, and the speech recognition accuracy deteriorates.

また、ＳＳ法といった周波数領域での演算を用いて推定雑音の抑圧を行う雑音抑圧方法には、例えばフロアリング処理の問題がある。一般に行われている音声認識では、入力信号をケプストラムなどの特徴量に変換して、その特徴量と音響モデル中に含まれる音素毎の確率密度関数との距離等を比較し、入力信号に対応する単語列を探索する。このケプストラムなどの特徴量に変換する際に行う対数演算において、入力信号とされる雑音抑圧した信号が悪影響を及ぼすことがある。 In addition, a noise suppression method that suppresses estimated noise using calculation in the frequency domain, such as the SS method, has a problem of flooring processing, for example. In general speech recognition, the input signal is converted into a feature value such as a cepstrum, and the distance between the feature value and the probability density function for each phoneme included in the acoustic model is compared, and the input signal is supported. Search for the word string you want. In a logarithmic calculation performed when converting to a feature amount such as a cepstrum, a noise-suppressed signal that is used as an input signal may have an adverse effect.

以下の式（１）は、ＳＳ法における雑音抑圧された信号（すなわち、抑圧結果として出力される音声信号）を表したものである。なお、式（１）は各周波数帯域毎またはサブバンド毎に定義される。 The following equation (1) represents a signal in which noise is suppressed in the SS method (that is, an audio signal output as a suppression result). Equation (1) is defined for each frequency band or subband.

Ｘ＝ｍａｘ［Ｙ−Ｎ，α］・・・式（１） X = max [Y−N, α] (1)

式（１）において、Ｘは音声のパワースペクトル、Ｙは入力信号のパワースペクトル、Ｎは入力信号に対して推定された雑音のパワースペクトルを示す。また、ｍａｘ［Ａ，Ｂ］は、ＡかＢのどちらか大きい方の値を取る演算を示す。また、αはフロアリング係数を示す。 In Expression (1), X represents a power spectrum of speech, Y represents a power spectrum of an input signal, and N represents a power spectrum of noise estimated for the input signal. Further, max [A, B] indicates an operation that takes a larger value of A or B. Α represents a flooring coefficient.

例えば、低ＳＮＲ環境や音声が発声されていない区間では、入力信号のパワースペクトルＹが推定された雑音のパワースペクトラムＮと比較して、ほぼ等しくなるまたは小さくなることが起こる。このような条件下で、単純に減算演算の結果を雑音抑圧した信号のパワースペクトルとすると、その後の対数演算で得られる値が非常に不安定になるまたは値自体が得られないという事態が生じてしまう。ＳＳ法ではこのような事態を防ぐために、式（１）にあるようなフロアリング係数αを導入することによって、雑音を抑圧（除去）するために周波数領域において行う減算の後の値が、マイナス値および０付近の値となることを防いでいる。 For example, in a low SNR environment or a section where no voice is uttered, the power spectrum Y of the input signal may be substantially equal to or smaller than the estimated power spectrum N of noise. Under these conditions, if the result of the subtraction operation is simply the power spectrum of the noise-suppressed signal, the value obtained by the subsequent logarithmic operation becomes very unstable or the value itself cannot be obtained. End up. In the SS method, in order to prevent such a situation, a value after subtraction performed in the frequency domain in order to suppress (remove) noise is negative by introducing a flooring coefficient α as in Equation (1). Value and values near 0 are prevented.

しかし、変化する雑音環境下では、雑音の種類やＳＮＲに依存するフロアリング係数αに最適値を設定することは困難である。フロアリング係数αが大きすぎると認識精度が劣化するという問題や、逆に小さすぎるとフロアリングの効果がなくなって不安定な値のまま音声認識が行われるかまたは音声認識処理がエラーとなってしまうという問題が生じる。なお、非特許文献２に記載されているような推定雑音の精度を高める技術を用いた場合であっても、入力信号のパワースペクトルＹが推定された雑音のパワースペクトラムＮと比較して、ほぼ等しくなるまたは小さくなるような状況では、少しの誤差でフロアリング処理が行われることになるため、上記のような問題は依然として存在する。 However, in a changing noise environment, it is difficult to set an optimum value for the flooring coefficient α depending on the noise type and SNR. If the flooring coefficient α is too large, the recognition accuracy deteriorates. On the other hand, if the flooring coefficient α is too small, the flooring effect is lost and speech recognition is performed with an unstable value or the speech recognition process results in an error. Problem arises. Even when a technique for improving the accuracy of estimated noise as described in Non-Patent Document 2 is used, the power spectrum Y of the input signal is almost equal to the estimated power spectrum N of noise. In situations where they are equal or smaller, the flooring process will be performed with a small error, so the above problem still exists.

そこで、本発明は、変化する雑音環境下においても、高精度に音声認識を行うことができる音声認識装置、音声認識方法および音声認識プログラムを提供することを目的とする。 Accordingly, an object of the present invention is to provide a speech recognition apparatus, a speech recognition method, and a speech recognition program that can perform speech recognition with high accuracy even in a changing noise environment.

本発明による音声認識装置は、入力信号の複数フレームに対して推定された雑音のデータから、雑音の統計量を算出する雑音統計量算出手段と、雑音統計量算出手段によって算出された雑音の統計量に基づいて、入力信号の各フレームに含まれる雑音の短時間変動成分を算出する短時間変動雑音成分算出手段と、雑音統計量算出手段によって算出された雑音の統計量を用いて、音響モデルを雑音に適応させる音響モデル適応手段と、入力信号の各フレームに対して、短時間変動雑音成分算出手段によって算出された当該フレームに含まれる雑音の短時間変動成分を抑圧する雑音抑圧手段と、雑音抑圧手段によって抑圧された入力信号を、音響モデル適応手段によって雑音適応された音響モデルを用いて音声認識を行う音声認識手段とを備えたことを特徴とする。 The speech recognition apparatus according to the present invention includes a noise statistic calculating unit that calculates a noise statistic from noise data estimated for a plurality of frames of an input signal, and a noise statistic calculated by the noise statistic calculating unit. An acoustic model using a short-time fluctuation noise component calculating means for calculating a short-time fluctuation component of noise included in each frame of the input signal based on the amount and a noise statistic calculated by the noise statistic calculation means; Acoustic model adaptation means for adapting to noise, noise suppression means for suppressing the short-time fluctuation component of noise included in the frame calculated by the short-time fluctuation noise component calculation means for each frame of the input signal, Speech recognition means for performing speech recognition on the input signal suppressed by the noise suppression means using an acoustic model noise-adapted by the acoustic model adaptation means And wherein the door.

本発明による音声認識方法は、入力信号の複数フレームに対して推定された雑音のデータから、雑音の統計量を算出し、算出された雑音の統計量を用いて、音響モデルを雑音に適応させ、入力信号の各フレームに対して、雑音適応された音響モデルに用いられた雑音の統計量に基づき算出される雑音の短時間変動成分を抑圧し、雑音の短時間変動成分が抑圧された入力信号を、雑音適応された音響モデルを用いて音声認識を行うことを特徴とする。 The speech recognition method according to the present invention calculates a noise statistic from noise data estimated for a plurality of frames of an input signal, and adapts an acoustic model to the noise using the calculated noise statistic. For each frame of the input signal, the short-term fluctuation component of the noise calculated based on the noise statistics used in the noise-adapted acoustic model is suppressed, and the short-term fluctuation component of the noise is suppressed. Speech recognition is performed on the signal using a noise-adapted acoustic model.

本発明による音声認識プログラムは、コンピュータに、入力信号の複数フレームに対して推定された雑音のデータから、雑音の統計量を算出する雑音統計量算出処理と、雑音統計量算出処理によって算出された雑音の統計量に基づいて、入力信号の各フレームに含まれる雑音の短時間変動成分を算出する短時間変動雑音成分算出処理と、雑音統計量算出処理によって算出された雑音の統計量を用いて、音響モデルを雑音に適応させる音響モデル適応処理と、入力信号の各フレームに対して、短時間変動雑音成分算出処理によって算出された当該フレームに含まれる雑音の短時間変動成分を抑圧する雑音抑圧処理と、雑音抑圧処理によって抑圧された入力信号を、音響モデル適応処理によって雑音適応された音響モデルを用いて音声認識を行う音声認識処理とを実行させることを特徴とする。
Speech recognition program according to the present invention, the computer, from the noise of the data estimated for a plurality of frames of the input signal, and a noise statistic calculation processing for calculating the noise statistics have been calculated by the noise statistic calculation processing Based on the noise statistic, the short-time fluctuation noise component calculation process that calculates the short-time fluctuation component of the noise included in each frame of the input signal, and the noise statistic calculated by the noise statistic calculation process , An acoustic model adaptation process for adapting the acoustic model to noise, and noise suppression for suppressing the short-time fluctuation component of the noise included in the frame calculated by the short-time fluctuation noise component calculation process for each frame of the input signal and processing the input signal is suppressed by the noise suppressing process, speech recognition is conducted using noise adapted acoustic models by the acoustic model adaptation process Characterized in that to execute the voice recognition process.

本発明によれば、変化する雑音環境下においても、高精度に音声認識を行うことができる。 According to the present invention, voice recognition can be performed with high accuracy even in a changing noise environment.

第１の実施形態の音声認識装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition apparatus of 1st Embodiment. 第１の実施形態の音声認識装置の全体の動作の一例を示すフローチャートである。It is a flowchart which shows an example of the whole operation | movement of the speech recognition apparatus of 1st Embodiment. 音声抑圧・音声認識処理の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of a voice suppression / voice recognition process. 音響モデル雑音適応処理の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of an acoustic model noise adaptation process. 雑音のパワースペクトルＮと該雑音から算出される雑音成分の平均パワースペクトルＮ＿Ｌと該雑音の短時間変動成分のパワースペクトルＮ＿Ｓの時間変化の一例を示す説明図である。It is explanatory drawing which shows an example of the time change of the power spectrum N_S of noise, the average power spectrum N_L of the noise component calculated from the noise, and the power spectrum N_S of the short-time fluctuation component of the noise. 雑音のパワースペクトルＮと該雑音から算出される雑音成分の平均パワースペクトルＮ＿Ｌと該雑音の短時間変動成分のパワースペクトルＮ＿Ｓの時間変化の他の例を示す説明図である。It is explanatory drawing which shows the other example of the time change of the power spectrum N_N of the noise, the average power spectrum N_L of the noise component calculated from the noise, and the power spectrum N_S of the short-time fluctuation component of the noise. 音響モデル雑音適応処理および該入力信号に対する雑音抑圧処理の動作タイミングの例を示す説明図である。It is explanatory drawing which shows the example of the operation | movement timing of the acoustic model noise adaptation process and the noise suppression process with respect to this input signal. 音響モデル雑音適応処理および該入力信号に対する雑音抑圧処理の動作タイミングの例を示す説明図である。It is explanatory drawing which shows the example of the operation | movement timing of the acoustic model noise adaptation process and the noise suppression process with respect to this input signal. 第２の実施形態の音声認識装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition apparatus of 2nd Embodiment. 第２の実施形態の音声認識装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the speech recognition apparatus of 2nd Embodiment. 第２の実施形態における音響モデル雑音適応処理および該入力信号に対する雑音抑圧処理の動作タイミングの例を示す説明図である。It is explanatory drawing which shows the example of the operation timing of the acoustic model noise adaptation process in 2nd Embodiment, and the noise suppression process with respect to this input signal. 第２の実施形態における音響モデル雑音適応処理および該入力信号に対する雑音抑圧処理の動作タイミングの例を示す説明図である。It is explanatory drawing which shows the example of the operation timing of the acoustic model noise adaptation process in 2nd Embodiment, and the noise suppression process with respect to this input signal. 第３の実施形態の音声認識装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition apparatus of 3rd Embodiment. 第３の実施形態の音響モデル雑音適応処理の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of the acoustic model noise adaptation process of 3rd Embodiment. 本発明の概要を示すブロック図である。It is a block diagram which shows the outline | summary of this invention.

実施形態１．
以下、本発明の実施形態を図面を参照して説明する。図１は、本発明の第１の実施形態の音声認識装置の構成例を示すブロック図である。図１に示す音声認識装置は、音声認識部１００と音響モデル適応部２００とを備える。 Embodiment 1. FIG.
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram illustrating a configuration example of the speech recognition apparatus according to the first embodiment of this invention. The speech recognition apparatus shown in FIG. 1 includes a speech recognition unit 100 and an acoustic model adaptation unit 200.

また、音声認識部１００は、入力信号取得手段１０１と、雑音推定手段１０２と、短時間変動雑音成分算出手段１０３と、雑音抑圧手段１０４と、サーチ手段１０５と、音響モデル格納手段１０６とを含む。また、音響モデル適応部２００は、推定雑音保持手段２０１と、雑音統計量算出手段２０２と、音響モデル適応手段２０３とを含む。 The speech recognition unit 100 also includes an input signal acquisition unit 101, a noise estimation unit 102, a short-time fluctuation noise component calculation unit 103, a noise suppression unit 104, a search unit 105, and an acoustic model storage unit 106. . The acoustic model adaptation unit 200 includes an estimated noise holding unit 201, a noise statistic calculation unit 202, and an acoustic model adaptation unit 203.

入力信号取得手段１０１は、マイクロホンなどを用いて集音された信号を入力し、入力信号の時系列をフレーム毎に切り出し、取得する。雑音推定手段１０２は、入力信号取得手段１０１によって取得された入力信号の時系列から、雑音を推定する。 The input signal acquisition unit 101 inputs a signal collected using a microphone or the like, and extracts and acquires a time series of the input signal for each frame. The noise estimation unit 102 estimates noise from the time series of the input signal acquired by the input signal acquisition unit 101.

短時間変動雑音成分算出手段１０３は、雑音推定手段１０２によって推定された雑音と、後述の雑音統計量算出手段２０２によって算出された雑音の統計量とに基づいて、雑音の短時間変動成分を算出する。以下、雑音の短時間変動成分を短時間変動雑音成分と表現する場合がある。 The short-time variation noise component calculation unit 103 calculates a short-time variation component of noise based on the noise estimated by the noise estimation unit 102 and the noise statistic calculated by the noise statistic calculation unit 202 described later. To do. Hereinafter, the short-time fluctuation component of noise may be expressed as a short-time fluctuation noise component.

ここで、雑音の統計量とは、例えば、雑音の平均や分散であって、音響モデルの適応時から次の音響モデル適応時に至るまでの比較的長時間に対して変化しない量である。仮に、音響モデルを雑音適応させるために必要な計算時間をＴとした場合、Ｔ以上の時間に対して変化しない量とする。従って、例えば、Ｔ時間以上の所定の時間区間内の雑音データから求めた平均や分散であってもよい。従って、例えば、Ｔ時間以上の所定の時間区間内の雑音データから求めた平均や分散であってもよい。なお、雑音の平均や分散は、パワースペクトルなどの雑音成分の平均や分散に限らず、特徴量領域での雑音の平均や分散や、さらに一階微分，二階微分，・・・，Ｎ階微分の平均や分散も含む。特徴量の例としてはケプストラムが挙げられる。またケプストラム以外にもその一次差分成分、二次差分成分や、ピッチ周波数の値など様々な音声認識向けの特徴量を用いることが可能である。 Here, the noise statistic is, for example, the average or variance of the noise, and is an amount that does not change over a relatively long time from the time of adaptation of the acoustic model to the time of adaptation of the next acoustic model. Assuming that the calculation time necessary for noise adaptation of the acoustic model is T, it is set to an amount that does not change with respect to a time equal to or longer than T. Therefore, for example, an average or variance obtained from noise data in a predetermined time interval equal to or longer than T time may be used. Therefore, for example, an average or variance obtained from noise data in a predetermined time interval equal to or longer than T time may be used. Note that the mean and variance of noise are not limited to the mean and variance of noise components such as power spectrum, but the mean and variance of noise in the feature region, as well as first-order differentiation, second-order differentiation,. Including mean and variance of. An example of the feature amount is cepstrum. In addition to the cepstrum, various feature quantities for speech recognition such as the primary difference component, the secondary difference component, and the pitch frequency value can be used.

これに対し、雑音の短時間変動成分とは、概念的には雑音の比較的短時間（ここでは、雑音の統計量を求める時間区間に比して短い時間区間をいう。）において変化する成分をいい、具体的には、雑音から、雑音の統計量のうちの一つである平均を引いた残りの雑音成分をいう。以下、「雑音成分の平均」と表現した場合には、短時間変動成分を求めるために用いた雑音の統計量のうちの一つとしての雑音成分の平均をいうものとする。 On the other hand, the short-time fluctuation component of noise is a component that conceptually changes in a relatively short time of noise (in this case, a time interval shorter than the time interval for obtaining the statistical amount of noise). Specifically, it means the remaining noise component obtained by subtracting the average, which is one of the noise statistics, from the noise. Hereinafter, the expression “average of noise components” means the average of noise components as one of the statistical quantities of noise used for obtaining the short-time fluctuation component.

なお、本実施形態では、短時間変動雑音成分算出手段１０３が短時間変動雑音成分を算出する際に用いる雑音成分の平均は、当該入力信号に対して適用する音響モデルの雑音適応時に使用された雑音の統計量のものを用いる。例えば、雑音統計量算出手段２０２は、後述する音響モデル格納手段１０６に雑音適応させた音響モデルを格納する際に、該音響モデルに適応させた雑音の統計量の情報を対応づけて記憶させ、その統計量の情報を短時間変動雑音成分算出手段１０３からの雑音の統計量の取得要求に応じて出力するようにしてもよい。このようにして、入力信号に抑圧させる雑音の雑音環境と音響モデルに適応させた雑音の雑音環境とを同期させる。 In the present embodiment, the average of the noise components used when the short-time fluctuation noise component calculation unit 103 calculates the short-time fluctuation noise component is used during noise adaptation of the acoustic model applied to the input signal. Use noise statistics. For example, when the noise statistic calculation unit 202 stores an acoustic model adapted to noise in the acoustic model storage unit 106 described later, the noise statistic calculation unit 202 associates and stores information on the noise statistic adapted to the acoustic model, The statistics information may be output in response to a noise statistics acquisition request from the short-time fluctuation noise component calculation unit 103. In this way, the noise environment of noise to be suppressed by the input signal is synchronized with the noise environment of noise adapted to the acoustic model.

雑音抑圧手段１０４は、入力信号の時系列に対して、短時間変動雑音成分算出手段１０３によって算出された短時間変動雑音成分を抑圧する雑音抑圧処理を行う。 The noise suppression unit 104 performs noise suppression processing on the time series of the input signal to suppress the short-time fluctuation noise component calculated by the short-time fluctuation noise component calculation unit 103.

サーチ手段１０５は、後述の音響モデル格納手段１０６に格納されている音響モデルを用いて、雑音抑圧手段１０４によって雑音抑圧処理された入力信号に対して、音声認識を行う。サーチ手段１０５は、例えば、雑音抑圧処理された入力信号の時系列から特徴量を抽出し、入力信号の特徴量と音響モデル中に含まれる音素毎の確率密度関数との距離を比較して、入力信号に対応する単語列を探索し、その探索結果を入力信号の音声認識結果として出力してもよい。 The search means 105 performs speech recognition on the input signal subjected to noise suppression processing by the noise suppression means 104 using an acoustic model stored in an acoustic model storage means 106 described later. The search means 105 extracts, for example, a feature value from the time series of the noise-suppressed input signal, compares the distance between the feature value of the input signal and the probability density function for each phoneme included in the acoustic model, A word string corresponding to the input signal may be searched, and the search result may be output as a speech recognition result of the input signal.

音響モデル格納手段１０６は、音響モデル適応部２００において雑音適応された音響モデルを格納する。本実施形態では、音響モデル格納手段１０６は、後述する音響モデル適応手段２０３によって雑音の統計量に基づき雑音適応された音響モデルを格納する。音響モデル格納手段１０６は、音響モデルの情報として、例えば、音響モデルを規定するパラメタの値を記憶してもよい。例えば、音声の変化を音素毎に特徴量の確率密度関数で表したＨＭＭなどの音響モデルのパラメタの値を記憶する。また、音響モデル格納手段１０６は、音響モデルの情報とともに、当該音響モデルに適応した雑音の統計量の情報を記憶してもよい。例えば、雑音の統計量のうちの雑音成分の平均の値や別途保持されている雑音の統計量を指し示す値を、音響モデルと対応づけて記憶してもよい。 The acoustic model storage unit 106 stores the acoustic model that has been subjected to noise adaptation in the acoustic model adaptation unit 200. In the present embodiment, the acoustic model storage unit 106 stores an acoustic model that has been subjected to noise adaptation based on noise statistics by an acoustic model adaptation unit 203 described later. The acoustic model storage unit 106 may store, for example, parameter values that define the acoustic model as acoustic model information. For example, a parameter value of an acoustic model such as an HMM in which a change in speech is represented by a probability density function of a feature amount for each phoneme is stored. The acoustic model storage means 106 may store information on the statistical amount of noise adapted to the acoustic model, along with the information on the acoustic model. For example, an average value of noise components of noise statistics or a value indicating noise statistics held separately may be stored in association with the acoustic model.

推定雑音保持手段２０１は、雑音推定手段１０２によって推定された雑音のデータを保持する。推定雑音保持手段２０１は、少なくとも雑音統計量算出手段２０２が雑音の統計量を求めるために必要な分の雑音データを逐次保持できるものとする。 The estimated noise holding means 201 holds noise data estimated by the noise estimating means 102. It is assumed that the estimated noise holding unit 201 can sequentially hold at least noise data necessary for the noise statistic calculating unit 202 to obtain a noise statistic.

雑音統計量算出手段２０２は、推定雑音保持手段２０１に保持されている雑音データを用いて、雑音の統計量を算出する。 The noise statistic calculation unit 202 calculates a noise statistic using the noise data held in the estimated noise holding unit 201.

音響モデル適応手段２０３は、雑音統計量算出手段２０２によって算出された雑音の統計量を用いて、音響モデルを雑音適応させる。音響モデル適応手段２０５は、例えば、雑音の統計量として示される雑音成分の平均や分散、特徴量領域で算出された雑音の平均や分散の値を、雑音が混在していない音声を用いて生成した音響モデルのパラメタの値に上乗せしてもよい。このようにして雑音の統計量によって示される雑音環境に音響モデルを適応させる。なお、既に説明したように、音響モデル適応手段２０３によって雑音適応させた音響モデルは、音響モデル格納手段１０６に格納され、サーチ手段１０５により利用される。音響モデル適応手段２０３は、音響モデルの情報を音響モデル格納手段１０６に格納する際に、その音響モデルに適応させた雑音の統計量の情報を併せて格納させてもよい。 The acoustic model adaptation unit 203 uses the noise statistic calculated by the noise statistic calculation unit 202 to perform noise adaptation on the acoustic model. The acoustic model adaptation means 205 generates, for example, the average and variance of noise components shown as noise statistics, and the average and variance values of noise calculated in the feature quantity region using speech that does not contain noise. It may be added to the value of the acoustic model parameter. In this way, the acoustic model is adapted to the noise environment indicated by the noise statistics. As already described, the acoustic model subjected to noise adaptation by the acoustic model adaptation unit 203 is stored in the acoustic model storage unit 106 and used by the search unit 105. When the acoustic model adaptation unit 203 stores the acoustic model information in the acoustic model storage unit 106, the acoustic model adaptation unit 203 may also store information on noise statistics adapted to the acoustic model.

なお、本実施形態において、入力信号取得手段１０１は、例えばマイクロホンなどの集音装置とプログラムに従って動作するＣＰＵ等とによって実現される。また、雑音推定手段１０２、短時間変動雑音成分算出手段１０３、雑音抑圧手段１０４、サーチ手段１０５、雑音統計量算出手段２０２、音響モデル適応手段２０３は、例えばプログラムに従って動作するＣＰＵ等によって実現される。また、音響モデル格納手段１０６、推定雑音保持手段２０１は、例えば記憶装置によって実現される。 In the present embodiment, the input signal acquisition unit 101 is realized by a sound collection device such as a microphone and a CPU or the like that operates according to a program. The noise estimation unit 102, the short-time fluctuation noise component calculation unit 103, the noise suppression unit 104, the search unit 105, the noise statistic calculation unit 202, and the acoustic model adaptation unit 203 are realized by a CPU or the like that operates according to a program, for example. . The acoustic model storage unit 106 and the estimated noise holding unit 201 are realized by a storage device, for example.

なお、本例では、音声認識部１００が音響モデル格納手段１０６を含み、音響モデル適応部２００が推定雑音保持手段２０１を含む例を示したが、それらの記憶手段を音声認識部１００と音響モデル適応部２００との間で共用される記憶手段として音声認識装置内または外部に独立して備えることも可能である。また、当該音声認識装置が備えるＣＰＵは、例えばマルチスレッド環境を実装するなどして、音声認識部１００を構成する各手段と音響モデル適応部２００を構成する各手段とが別々のスレッドとして動作することがより好ましい。なお、音声認識部１００内において、さらに、入力信号取得手段１０１と雑音推定手段１０２とそれ以外の手段とを別々のスレッドとして動作させてもよい。 In this example, the speech recognition unit 100 includes the acoustic model storage unit 106, and the acoustic model adaptation unit 200 includes the estimated noise holding unit 201. However, the storage unit includes the speech recognition unit 100 and the acoustic model. It is also possible to provide the storage means shared with the adaptation unit 200 independently inside or outside the speech recognition apparatus. In addition, the CPU included in the speech recognition apparatus implements a multi-thread environment, for example, so that each unit configuring the speech recognition unit 100 and each unit configuring the acoustic model adaptation unit 200 operate as separate threads. It is more preferable. In the speech recognition unit 100, the input signal acquisition unit 101, the noise estimation unit 102, and other units may be operated as separate threads.

次に、本実施形態の動作について説明する。図２は、本実施形態の音声認識装置の全体の動作の一例を示すフローチャートである。なお、図２に示す例は、例えば、信号（音データ）が入力された旨を通知するイベントを受信することによって動作する。また、入力信号に対する処理はフレーム毎に行われるものとする。 Next, the operation of this embodiment will be described. FIG. 2 is a flowchart showing an example of the overall operation of the speech recognition apparatus of this embodiment. Note that the example shown in FIG. 2 operates by receiving an event notifying that a signal (sound data) has been input, for example. In addition, the processing for the input signal is performed for each frame.

図２に示すように、まず音声認識部１００において入力信号取得手段１０１が、集音された時系列の入力音データを単位時間のフレーム毎に切り出して取得する（ステップＳ１０１）。例えば、入力音データがサンプリング周波数８０００Ｈｚの１６ｂｉｔＬｉｎｅｒ−ＰＣＭのデータの場合、１秒あたり８０００点分の波形データが入力される。入力信号取得手段１０１は、例えば、このような波形データに対して、フレーム幅２００点（２５ミリ秒）、フレームシフト８０点（１０ミリ秒）で時系列に沿って逐次切り出し、所定の記憶領域にデータを保持させてもよい。その際、切り出した波形データに対して、短時間離散フーリエ変換（ＦＦＴ）を行い、パワースペクトルに変換してもよい。 As shown in FIG. 2, first, in the speech recognition unit 100, the input signal acquisition unit 101 cuts out and acquires collected time-series input sound data for each frame of unit time (step S101). For example, when the input sound data is 16-bit liner-PCM data with a sampling frequency of 8000 Hz, waveform data for 8000 points per second is input. For example, the input signal acquisition unit 101 sequentially cuts out such waveform data along a time series with a frame width of 200 points (25 milliseconds) and a frame shift of 80 points (10 milliseconds), and a predetermined storage area. May hold data. At that time, short-time discrete Fourier transform (FFT) may be performed on the extracted waveform data to convert it into a power spectrum.

次に、雑音推定手段１０３が、入力信号取得手段１０１によって取得された入力信号から該入力信号に含まれる雑音を推定し、推定した雑音データを推定雑音保持手段２０１に保持させる（ステップＳ１０２）。雑音の推定方法は、例えば、対象となる音声が発声される前の非音声区間の入力信号の平均値を用いる方法や前述の非特許文献２に記載されている方法（以下、重み付き雑音推定方法という。）などを用いてもよい。重み付き雑音推定方法は、主にＳＮＲ推定，重み係数計算，平均化の３種類の処理により構成される方法であって、入力信号を推定ＳＮＲに基づいて計算された重み係数で重み付けし、その重み付き入力信号の平均値により、推定雑音を推定する方法である（非特許文献２参照）。 Next, the noise estimation unit 103 estimates the noise included in the input signal from the input signal acquired by the input signal acquisition unit 101, and holds the estimated noise data in the estimated noise holding unit 201 (step S102). The noise estimation method may be, for example, a method using an average value of input signals in a non-speech section before the target speech is uttered, or a method described in Non-Patent Document 2 (hereinafter referred to as weighted noise estimation). May be used). The weighted noise estimation method is a method mainly composed of three types of processes of SNR estimation, weight coefficient calculation, and averaging, and weights an input signal with a weight coefficient calculated based on the estimated SNR, In this method, estimated noise is estimated based on an average value of weighted input signals (see Non-Patent Document 2).

ここで、音響モデルの適応タイミングである場合には（ステップＳ１０３のＹｅｓ）、ステップＳ１０６に進み、音響モデル適応部２００が音響モデルの雑音適応処理を行う。以下では、当該入力信号に用いる音響モデルは既に雑音適応されおり、現時点では音響モデルの適応タイミングでないものとして説明を進める。 Here, if it is the adaptation timing of the acoustic model (Yes in step S103), the process proceeds to step S106, where the acoustic model adaptation unit 200 performs noise adaptation processing for the acoustic model. In the following description, it is assumed that the acoustic model used for the input signal has already been noise-adapted and is not at the adaptation timing of the acoustic model at this time.

音響モデルの適応タイミングでない場合（ステップＳ１０３のＮｏ）、または当該入力信号に用いる音響モデルの雑音適応処理が完了すると、音声認識部１００では雑音抑圧・音声認識処理（ステップＳ１０４）を行う。なお、図２では、全ての入力信号を音声認識の処理対象とする例を示しているが、例えば音声認識処理の指示がある場合にのみ以降の雑音抑圧・音声認識処理を行うようにしてもよい。また、例えば当該入力信号を含む所定の時間区間内の全ての雑音データを用いて雑音の統計量を算出する場合などには、雑音の統計量が算出され、それに基づき音響モデルの雑音適応処理が完了するまでの間、雑音抑圧・音声認識処理（ステップＳ１０４）を待機させておき、雑音推定を必要な分だけ先に行い、それにより音響モデルが雑音適応された後で以降の雑音抑圧・音声認識処理を行うようにしてもよい。 When it is not the adaptation timing of the acoustic model (No in step S103) or when the noise adaptation processing of the acoustic model used for the input signal is completed, the speech recognition unit 100 performs noise suppression / speech recognition processing (step S104). FIG. 2 shows an example in which all input signals are subjected to speech recognition processing. However, for example, the subsequent noise suppression / speech recognition processing may be performed only when there is an instruction for speech recognition processing. Good. Further, for example, when calculating noise statistics using all noise data within a predetermined time interval including the input signal, noise statistics are calculated, and noise adaptation processing of the acoustic model is performed based on the noise statistics. Until the completion, the noise suppression / speech recognition process (step S104) is made to stand by, and noise estimation is performed in advance as much as necessary, so that after the acoustic model is noise-adapted, the subsequent noise suppression / speech is performed. Recognition processing may be performed.

図３は、音声抑圧・音声認識処理（図２のステップＳ１０４）の処理フローの一例を示すフローチャートである。図３に示すように、音声抑圧・音声認識処理では、まず短時間変動雑音成分算出手段１０３が、当該入力信号に含まれる雑音の統計量を取得し、取得した雑音の統計量（より具体的には雑音成分の平均）と、雑音推定手段１０３によって推定された雑音とに基づいて、該入力信号についての短時間変動雑音成分を算出する（図３のステップＳ１１１）。なお、当該入力信号に含まれる雑音の統計量とは、当該入力信号の音声認識に用いる音響モデルに適応されている雑音の統計量であって、雑音統計量算出手段２０２から取得される値を用いればよい。ここでは、雑音統計量算出手段２０２によって算出された、当該入力信号を含む時間区間または当該入力信号より前の時間区間における入力信号の推定雑音の統計量が取得される。 FIG. 3 is a flowchart showing an example of a processing flow of the voice suppression / voice recognition process (step S104 in FIG. 2). As shown in FIG. 3, in the speech suppression / recognition processing, first, the short-time fluctuation noise component calculation unit 103 acquires the noise statistic included in the input signal, and the acquired noise statistic (more specifically, Is the average of the noise components) and the noise estimated by the noise estimation means 103 is calculated for the short-time fluctuation noise component for the input signal (step S111 in FIG. 3). The noise statistic included in the input signal is a noise statistic adapted to the acoustic model used for speech recognition of the input signal, and is a value acquired from the noise statistic calculation unit 202. Use it. Here, the statistical quantity of the estimated noise of the input signal in the time interval including the input signal or the time interval before the input signal, which is calculated by the noise statistic calculation unit 202, is acquired.

ステップＳ１１１では、例えば、雑音推定手段１０３によって推定された雑音のパワースペクトルをＮ、雑音統計量算出手段２０２によって算出された当該入力信号に含まれる雑音成分の平均パワースペクトルをＮ＿Ｌとすると、当該入力信号の短時間変動雑音成分のパワースペクトルＮ＿Ｓを、以下の式（２）によって算出する。なお、短時間変動雑音成分算出手段１０３は、例えば、式（２）の演算を周波数帯域毎やサブバンド毎といった周波数領域において定めた所定の単位毎に行う。 In step S111, for example, if the noise power spectrum estimated by the noise estimation unit 103 is N and the average power spectrum of the noise component included in the input signal calculated by the noise statistic calculation unit 202 is N_L, the input The power spectrum N_S of the short-time fluctuation noise component of the signal is calculated by the following equation (2). Note that the short-time fluctuation noise component calculation unit 103 performs, for example, the calculation of Expression (2) for each predetermined unit defined in the frequency domain such as each frequency band or each subband.

Ｎ＿Ｓ＝Ｎ−Ｎ＿Ｌ・・・式（２） N_S = N−N_L (2)

次に、雑音抑圧手段１０４は、処理対象とされた入力信号の時系列から、短時間変動雑音成分算出手段１０３によって算出された短時間変動雑音成分を抑圧する（ステップＳ１１２）。抑圧方法は、例えば、スペクトル減算法（ＳＳ法）やウィナーフィルタ（ＷｉｅｎｅｒＦｉｌｔｅｒ）法といった方法を用いてもよい。例えば、以下の式（３）は、ＳＳ法を用いて抑圧した例である。 Next, the noise suppression unit 104 suppresses the short-time fluctuation noise component calculated by the short-time fluctuation noise component calculation unit 103 from the time series of the input signal to be processed (step S112). As a suppression method, for example, a method such as a spectral subtraction method (SS method) or a Wiener filter method may be used. For example, the following formula (3) is an example of suppression using the SS method.

Ｘ＝ｍａｘ［Ｙ−Ｎ＿Ｓ，α］・・・式（３） X = max [Y−N_S, α] (3)

なお、式（３）のＹ−Ｎ＿Ｓ（雑音の短時間変動成分が抑圧された入力信号）は、雑音が全て除去されているのではなく、理論上、雑音の統計量によって示される雑音成分の平均に相当する成分が残っていることになる。 It should be noted that YN_S (input signal in which the short-time fluctuation component of noise is suppressed) in Expression (3) does not remove all of the noise, but theoretically indicates the noise component indicated by the noise statistic. The component corresponding to the average remains.

また、以下の式（４），（５）は、ウィナーフィルタ法により雑音抑圧された入力信号を表すものである。このような式（５），（６）を用いて雑音抑圧してもよい。 Also, the following equations (4) and (5) represent input signals that are noise-suppressed by the Wiener filter method. Noise suppression may be performed using such equations (5) and (6).

Ｘ＝Ｄ＿ｔ／（Ｄ＿ｔ＋Ｎ＿Ｓ）＊Ｙ・・・式（４）
Ｄ＿ｔ＝γ＊Ｄ＿（ｔ−１）＋（１−γ）＊ｍａｘ［Ｙ−Ｎ＿Ｓ，α］・・・式（５） X = D_t / (D_t + N_S) * Y Expression (4)
D_t = γ * D_ (t−1) + (1−γ) * max [Y−N_S, α] (5)

なお、式（４）および式（５）において、ｔはフレームの番号を示す。また、αはフロアリング係数を示す。また、γ＞０．９である。 In equations (4) and (5), t represents a frame number. Α represents a flooring coefficient. Further, γ> 0.9.

なお、雑音抑圧手段１０４は、例えば、上記式（３）または上記式（４）および式（５）の演算を、周波数帯域毎やサブバンド毎といった周波数領域において定めた所定の単位毎に行う。 The noise suppression unit 104 performs, for example, the calculation of the above formula (3) or the above formula (4) and the formula (5) for each predetermined unit defined in the frequency domain such as each frequency band or each subband.

次に、サーチ手段１０５は、雑音抑圧手段１０４によって雑音抑圧された信号（雑音抑圧信号Ｘ）から、音声認識に使用する特徴量を抽出する（ステップＳ１１３）。次いで、抽出した特徴量と音響モデル中に含まれる音素毎の確率密度関数との距離を比較し、入力信号に対応する単語列を探索する（ステップＳ１１４）。音響モデルの例としては、ＨＭＭが挙げられる。 Next, the search unit 105 extracts a feature amount used for speech recognition from the signal (noise suppression signal X) subjected to noise suppression by the noise suppression unit 104 (step S113). Next, the distance between the extracted feature quantity and the probability density function for each phoneme included in the acoustic model is compared, and a word string corresponding to the input signal is searched (step S114). An example of the acoustic model is HMM.

本例の雑音抑圧・音声認識処理では、雑音推定済みの入力信号であって音声認識処理の対象とされる入力信号全てに対してステップＳ１１１〜１１４の処理を行う。そして、未処理の入力信号が無くなると、呼び出し元へ復帰する（ステップＳ１１５のＮｏ、図２のステップＳ１０５に進む）。 In the noise suppression / speech recognition process of this example, the processes in steps S111 to S114 are performed on all input signals that have been noise-estimated and that are the targets of the speech recognition process. When there is no unprocessed input signal, the process returns to the caller (No in step S115, proceeds to step S105 in FIG. 2).

ステップＳ１０５では、信号入力が終了したか否かを判定し、信号入力が終了していれば（ステップＳ１０７のＹｅｓ）、一連の処理を終了する。信号入力が終了していなければ、入力信号取得処理から再度同様の処理を行う（ステップＳ１０１に戻る）。 In step S105, it is determined whether or not the signal input has ended. If the signal input has ended (Yes in step S107), the series of processing ends. If the signal input has not ended, the same processing is performed again from the input signal acquisition processing (return to step S101).

また、このようにして入力される信号に対して音声認識処理を行う中で、音響モデルの適応タイミングがきた場合に（ステップＳ１０３のＹｅｓ）、音響モデル適応部２００において音響モデル雑音適応処理（ステップＳ１０６）を動作させればよい。音響モデルの適応タイミングは、例えば、マイクロホンでの集音が開始された数秒後に動作し、以降一定の間隔毎に動作するなどが考えられる。なお、音響モデルの適応タイミングがきたか否かに関わらず、入力信号に対する音声認識処理を行う前の初期動作として少なくとも１度、音響モデル雑音適応処理を行うようにしてもよい。 In addition, when the acoustic model adaptation timing comes while performing the speech recognition process on the input signal in this way (Yes in step S103), the acoustic model adaptation unit 200 performs the acoustic model noise adaptation process (step S106) may be operated. As the adaptation timing of the acoustic model, for example, it can be considered that it operates several seconds after sound collection by a microphone is started, and thereafter operates at regular intervals. Note that the acoustic model noise adaptation process may be performed at least once as an initial operation before performing the speech recognition process on the input signal, regardless of whether the adaptation timing of the acoustic model has come.

図４は、音響モデル雑音適応処理（図２のステップＳ１０６）の処理フローの一例を示すフローチャートである。図４に示すように、音響モデル雑音適応処理では、まず雑音統計量算出手段２０２が、推定雑音保持手段２０１に保持されている雑音のデータを用いて、雑音の統計量を算出する（ステップＳ１２１）。雑音統計量算出手段２０２は、例えば、推定雑音保持手段２０１に保持されている最新Ｋ個の雑音データから、雑音の統計量を算出してもよい。 FIG. 4 is a flowchart showing an example of the processing flow of the acoustic model noise adaptation processing (step S106 in FIG. 2). As shown in FIG. 4, in the acoustic model noise adaptation process, first, the noise statistic calculation unit 202 calculates noise statistic using the noise data held in the estimated noise holding unit 201 (step S121). ). For example, the noise statistic calculation unit 202 may calculate a noise statistic from the latest K pieces of noise data held in the estimated noise holding unit 201.

雑音統計量算出手段２０２は、雑音の統計量として、例えば、Ｋ個のパワースペクトルの平均値を算出してもよい。この他にも、その平均値を元にさらに分散を算出したり、Ｎ階微分の平均値や分散を算出してもよい。また、雑音成分をケプストラムなどの特徴量に変換し、特徴量領域での平均値や分散、Ｎ階微分の平均値や分散を算出してもよい。なお、どのような統計量を算出するかは、雑音適応させる音響モデルに応じて定めればよい。なお、少なくとも雑音成分の平均は含むものとする。 The noise statistic calculation unit 202 may calculate, for example, an average value of K power spectra as the noise statistic. In addition to this, the variance may be further calculated based on the average value, or the average value and variance of the N-th derivative may be calculated. Alternatively, the noise component may be converted into a feature quantity such as a cepstrum, and an average value or variance in the feature quantity region, or an average value or variance of the Nth order derivative may be calculated. In addition, what kind of statistic should be calculated may be determined according to the acoustic model to which noise is applied. It should be noted that at least the average of noise components is included.

なお、雑音統計量算出手段２０２は、例えば、周期的に音響モデルを雑音適応させるような場合には、入力信号の時間区間と対応づけて雑音の統計量を算出してもよい。例えば、音響モデルの適応タイミングとなった現時点から開始される時間区間の全ての入力信号についての雑音データが保持されるのを待ち、それらの雑音データを用いて、当該時間区間において有効とする雑音の統計量として算出としてもよい。また、例えば、音響モデルの適応タイミングとなった時点において既に推定雑音保持手段２０１に保持されている一つ前の時間区間の雑音データ（すなわち、一つ前に時間区間に含まれる入力信号から推定された雑音データ）を用いて、現時点から開始される時間区間において有効とする雑音の統計量として算出してもよい。雑音統計量算出手段２０２は、入力信号の時間区間と対応づけて雑音の統計量を算出した場合には、雑音の統計量と併せて、その統計量に対応する入力信号の時間区間の情報を出力してもよい。 Note that the noise statistic calculation unit 202 may calculate the noise statistic in association with the time interval of the input signal, for example, when the acoustic model is subjected to noise adaptation periodically. For example, it waits for the noise data for all input signals in the time interval starting from the current time when the acoustic model has been adapted to be held, and the noise that is valid in that time interval is used using those noise data. It may be calculated as a statistic. Also, for example, the noise data of the previous time interval already held in the estimated noise holding means 201 at the time when the acoustic model adaptation timing is reached (ie, estimated from the input signal included in the previous time interval) Calculated noise data), it may be calculated as a statistical quantity of noise that is effective in the time interval starting from the present time. When the noise statistic calculating unit 202 calculates the noise statistic in association with the time interval of the input signal, the noise statistic calculating unit 202 obtains information on the time interval of the input signal corresponding to the statistic together with the noise statistic. It may be output.

次に、音響モデル適応手段２０６は、雑音統計量算出手段２０２によって算出された雑音の統計量を基に、雑音の特徴量を抽出し、音響モデルを雑音適応させる（ステップＳ１２２，Ｓ１２３）。なお、雑音の統計量として既に特徴量領域での統計量が算出されている場合には、ステップＳ１２２の処理は省略される。 Next, the acoustic model adaptation unit 206 extracts a noise feature amount based on the noise statistic calculated by the noise statistic calculation unit 202, and noise-adapts the acoustic model (steps S122 and S123). Note that if the statistical amount in the feature amount region has already been calculated as the statistical amount of noise, the process of step S122 is omitted.

また、音響モデルの適用手法としては、例えば、ＶＴＳ法やＰＭＣ法などを用いてもよい。例えば、音響モデルの適応手法の一つであるＶＴＳ法を用いる場合には、音響モデル適応手段２０６は、雑音統計量算出手段２０２によって算出された雑音の統計量と、音響モデルのパラメタとを用いて、線形演算を行うことにより、音響モデルを雑音適応させてもよい。ＶＴＳ法は、音響モデルを雑音環境に適応させる際に用いる近似を線形近似としたものである。音響モデルのパラメータと雑音環境の雑音統計量との関係は非線形であるが、これを線形近似することで、線形演算のみで雑音環境の音声で学習した音響モデルを近似的に生成する方法である（非特許文献３参照。）。 As an application method of the acoustic model, for example, a VTS method or a PMC method may be used. For example, when the VTS method, which is one of acoustic model adaptation methods, is used, the acoustic model adaptation unit 206 uses the noise statistic calculated by the noise statistic calculation unit 202 and the parameters of the acoustic model. Thus, the acoustic model may be noise-adapted by performing a linear operation. In the VTS method, an approximation used when an acoustic model is adapted to a noise environment is a linear approximation. The relationship between the parameters of the acoustic model and the noise statistic of the noise environment is non-linear, but by approximating this linearly, it is a method of approximately generating an acoustic model learned from speech in the noise environment with only linear computation. (Refer nonpatent literature 3.).

そして、音響モデル適応手段２０６は、雑音適応させた音響モデルを音響モデル格納手段１０６に格納する。音響モデル適応手段２０６は、例えば、音響モデルの情報として、音響モデルのパラメータの値を音響モデル格納手段１０６に格納してもよい。また、例えば、音響モデルの情報と対応づけて、当該音響モデルに適応させた雑音の統計量における雑音成分の平均の値や、必要であれば対応する入力信号の時間区間の情報を格納してもよい。 Then, the acoustic model adaptation unit 206 stores the acoustic model subjected to noise adaptation in the acoustic model storage unit 106. For example, the acoustic model adaptation unit 206 may store the parameter value of the acoustic model in the acoustic model storage unit 106 as the acoustic model information. Also, for example, in association with the information of the acoustic model, the average value of the noise component in the statistic of noise adapted to the acoustic model, and the time interval information of the corresponding input signal if necessary are stored. Also good.

なお、音響モデルと入力信号の時間区間とは必ずしも対応づけられていなくてもよく、少なくとも音響モデルに適応させた雑音の統計量における雑音成分の平均と、その音響モデルに入力する入力信号である雑音抑圧信号に残された雑音成分の平均とが一致するように、各情報が対応づけられていればよい。なお、入力信号の時間区間と対応づけて雑音の統計量を算出したような場合には、さらにその算出元となった雑音データの時間区間の情報を保持するようにしてもよい。 Note that the acoustic model and the time interval of the input signal are not necessarily associated with each other, and are at least the average noise component in the noise statistic adapted to the acoustic model and the input signal input to the acoustic model. Each information should just be matched so that the average of the noise component left to the noise suppression signal may correspond. When noise statistics are calculated in association with the time interval of the input signal, information on the time interval of the noise data that is the source of the calculation may be retained.

図５は、雑音（推定雑音）のパワースペクトルＮと該雑音から算出される雑音成分の平均パワースペクトルＮ＿Ｌと該雑音の短時間変動成分（短時間変動雑音成分）のパワースペクトルＮ＿Ｓの時間変化の一例を示す説明図である。なお、図５（ａ）は、ある周波数帯域またはサブバンドにおける雑音のパワースペクトルＮの時間変化の一例を示す説明図である。図５（ｂ）は、図５（ａ）に示す雑音から算出される雑音成分の平均パワースペクトルＮ＿Ｌの時間変化の一例を示す説明図である。図５（ｃ）は、図５（ａ）に示す雑音の短時間変動成分のパワースペクトルＮ＿Ｓの時間変化の一例を示す説明図である。また、図５（ａ）の上部に示した時間軸に対する下向きの白矢印は、音響モデルを雑音適応させるタイミングの一例である。 FIG. 5 shows the time change of the power spectrum N_S of the noise (estimated noise), the average power spectrum N_L of the noise component calculated from the noise, and the power spectrum N_S of the short-time fluctuation component (short-time fluctuation noise component) of the noise. It is explanatory drawing which shows an example. FIG. 5A is an explanatory diagram showing an example of a temporal change of the noise power spectrum N in a certain frequency band or subband. FIG. 5B is an explanatory diagram illustrating an example of a temporal change in the average power spectrum N_L of the noise component calculated from the noise illustrated in FIG. FIG.5 (c) is explanatory drawing which shows an example of the time change of the power spectrum N_S of the short time fluctuation component of the noise shown to Fig.5 (a). Moreover, the downward white arrow with respect to the time axis shown in the upper part of FIG. 5A is an example of the timing for applying noise to the acoustic model.

図５に示す例は、周期的に音響モデルを雑音適応させる場合の例であり、かつ各時間区間内の各入力信号に含まれる雑音の統計量を当該時間区間の終了時に当該時間区間内の入力信号の雑音データから算出する例である。例えば、図５に示す例では、入力信号の時系列におけるＦｔ０，Ｆｔ１，Ｆｔ２，Ｆｔ３，Ｆｔ４，・・・が音響モデルの雑音適応タイミングである。本例では、例えば時間区間Ｆｔ０−Ｆｔ１の入力信号に含まれる雑音の統計量は、当該時間区間Ｆｔ０−Ｆｔ１の入力信号の雑音データを用いて算出する。従って、時間区間Ｆｔ０−Ｆｔ１内のある入力信号フレームに対する雑音抑圧処理は、当該時間区間Ｆｔ０−Ｆｔ１内の全入力信号の雑音推定処理を待って行うことになる。なお、音声認識処理は、さらにそれらの雑音の統計量に基づく音声認識モデルの適応処理の完了を待って行うことになる。 The example shown in FIG. 5 is an example in which the acoustic model is periodically subjected to noise adaptation, and the noise statistic included in each input signal in each time interval is calculated at the end of the time interval. It is an example calculated from noise data of an input signal. For example, in the example shown in FIG. 5, Ft0, Ft1, Ft2, Ft3, Ft4,... In the time series of the input signal are noise adaptation timings of the acoustic model. In this example, for example, the statistical amount of noise included in the input signal in the time interval Ft0-Ft1 is calculated using the noise data of the input signal in the time interval Ft0-Ft1. Therefore, the noise suppression process for a certain input signal frame in the time interval Ft0-Ft1 is performed after the noise estimation process for all input signals in the time interval Ft0-Ft1. Note that the speech recognition processing is performed after completion of the adaptation processing of the speech recognition model based on the statistics of the noise.

なお、他の例として、音声認識の即時応答性を重視する場合などには、各時間区間の開始時にその直前の時間区間内の入力信号の雑音データを用いて当該時間区間内の入力信号に含まれる雑音の統計量を算出することも可能である。この場合、図５（ａ）に示す時間区間を例にして説明すると、Ｆｔ１−Ｆｔ２までの時間区間内の入力信号に含まれる雑音の統計量は、その前の時間区間であるＦｔ０−Ｆｔ１までの時間区間の雑音データから算出される。このような場合には、例えば、開始から一定時間（例えば、Ｆｔ０−Ｆｔ１まで）の入力信号に対しては雑音推定のみを行い、その雑音データを基に音響モデルを雑音適応させ、その後は、現時点で生成されている音響モデルを用いて、即座に入力信号の各フレームに対する雑音抑圧処理および音声認識処理を行うことができる。 As another example, when emphasizing the immediate responsiveness of voice recognition, at the start of each time interval, the noise data of the input signal in the immediately preceding time interval is used to convert the input signal in that time interval. It is also possible to calculate the statistics of the included noise. In this case, the time interval shown in FIG. 5A will be described as an example. The statistical amount of noise included in the input signal in the time interval from Ft1 to Ft2 is from Ft0 to Ft1 that is the previous time interval. It is calculated from the noise data of the time interval. In such a case, for example, only noise estimation is performed on an input signal for a certain time from the start (for example, from Ft0 to Ft1), and the acoustic model is subjected to noise adaptation based on the noise data. Using the acoustic model generated at the present time, it is possible to immediately perform noise suppression processing and speech recognition processing for each frame of the input signal.

図６は、入力信号の雑音（推定雑音）のパワースペクトルＮに対して、該雑音から算出される雑音成分の平均パワースペクトルＮ＿Ｌと該雑音の短時間変動成分（短時間変動雑音成分）のパワースペクトルＮ＿Ｓの時間変化の他の例を示す説明図である。なお、図６においても図５と同様、図６（ａ）に雑音のパワースペクトルＮを示し、図６（ｂ）に雑音から算出される雑音成分の平均のパワースペクトルＮ＿Ｌを示し、図６（ｃ）に、雑音の短時間変動成分のパワースペクトルＮ＿Ｓを示している。図６に示す例では、入力信号の時系列におけるＦｔ１，Ｆｔ２，Ｆｔ３，Ｆｔ４，・・・が音響モデルの雑音適応タイミングである。本例では、例えば時間区間Ｆｔ１−Ｆｔ２の入力信号に含まれる雑音の統計量は、音響モデルを雑音適応させるタイミングのときに推定雑音保持手段２０１に保持されている、当該時間区間Ｆｔ１−Ｆｔ２の前の時間区間Ｆｔ０−Ｆｔ１の入力信号の雑音データを用いて算出する。従って、時間区間Ｆｔ１−Ｆｔ２内のある入力信号フレームに対する雑音抑圧処理および音声認識処理を、当該時間区間Ｆｔ０−Ｆｔ１内の全入力信号の雑音推定処理を待たずに行うことができる。 FIG. 6 shows, for the power spectrum N of the noise (estimated noise) of the input signal, the average power spectrum N_L of the noise component calculated from the noise and the power of the short-time fluctuation component (short-time fluctuation noise component) of the noise. It is explanatory drawing which shows the other example of the time change of spectrum N_S. 6A and 6B, FIG. 6A shows a noise power spectrum N, FIG. 6B shows an average power spectrum N_L of noise components calculated from the noise, and FIG. c) shows a power spectrum N_S of a short-time fluctuation component of noise. In the example shown in FIG. 6, Ft1, Ft2, Ft3, Ft4,... In the time series of the input signal are noise adaptation timings of the acoustic model. In this example, for example, the statistic of noise included in the input signal of the time interval Ft1-Ft2 is stored in the estimated noise holding unit 201 at the timing of noise adaptation of the acoustic model, for the time interval Ft1-Ft2. Calculation is performed using the noise data of the input signal in the previous time interval Ft0-Ft1. Therefore, noise suppression processing and speech recognition processing for a certain input signal frame in the time interval Ft1-Ft2 can be performed without waiting for noise estimation processing of all input signals in the time interval Ft0-Ft1.

なお、図５および図６に示すように、時間区間内において入力信号に含まれる雑音の統計量は変化しない。換言すると、雑音の統計量は、音響モデルを雑音適応させるタイミングで算出された後は次の音響モデルを雑音適応させるタイミングまで一定の値が保持される。 Note that, as shown in FIGS. 5 and 6, the statistical amount of noise included in the input signal does not change in the time interval. In other words, after the noise statistic is calculated at the timing of noise adaptation of the acoustic model, a constant value is held until the timing of noise adaptation of the next acoustic model.

また、図７および図８は、入力信号から推定された雑音の例とともに、音響モデル雑音適応処理および該入力信号に対する雑音抑圧処理の動作タイミングの例を示す説明図である。 FIGS. 7 and 8 are explanatory diagrams showing examples of operation timings of the acoustic model noise adaptation process and the noise suppression process for the input signal, along with examples of noise estimated from the input signal.

図７（ａ）は、入力信号から推定された雑音の例を示す説明図である。なお、図７（ａ）の上部に示した時間軸に対する下向きの白矢印は、音響モデルを雑音適応させるタイミングの例である。図７（ａ）では、入力信号の時系列におけるＦｔ０，Ｆｔ１，Ｆｔ２，Ｆｔ３，Ｆｔ４，・・・が音響モデルの雑音適応タイミングとして示されている。また、図７（ａ）の下部にある右向き矢印およびその時間ｔ１は、その開始位置から始まる時間区間内の入力信号に含まれる雑音の統計量を求めるのに用いる雑音データが当該時間区間の入力信号の雑音データであり、時間ｔ１分の雑音データであることを示している。 FIG. 7A is an explanatory diagram illustrating an example of noise estimated from an input signal. In addition, the downward white arrow with respect to the time axis shown in the upper part of FIG. 7A is an example of timing at which the acoustic model is adapted to noise. In FIG. 7A, Ft0, Ft1, Ft2, Ft3, Ft4,... In the time series of the input signal are shown as noise adaptation timings of the acoustic model. The right arrow at the bottom of FIG. 7 (a) and its time t1 indicate that the noise data used to calculate the statistical quantity of noise included in the input signal in the time interval starting from the start position is input to the time interval. This is signal noise data, indicating that the noise data is for time t1.

図７（ｂ）は、図７（ａ）に示す雑音に対して算出される雑音の統計量と、音響モデルの雑音適応処理の動作タイミングの例を示す説明図である。図７（ｂ）において、上部に示した時間軸に対する下向きの白矢印は、音声モデル適応部２００（より具体的には雑音統計量算出手段２０２）が雑音の統計量の算出処理を開始するタイミングを示している。図７（ｂ）では、雑音統計量算出手段２０２が動作するタイミングは、図７（ａ）で示した音響モデルを雑音適応させるタイミングからｔ１時間を経過した時としている。また、本例では、雑音の統計量の算出処理を開始してから音響モデルの雑音適応処理が完了するまでに時間ｔ２がかかるものとしている。なお、図７（ｂ）では、雑音の統計量を算出してもその雑音の統計量が音響モデルに適応されるまでの間、すなわち音響モデルの雑音適応処理が完了するまでの間は、短時間変動雑音成分算出手段１０３に出力する雑音成分の平均とはしないため、適応中を表す破線で示している。短時間変動雑音成分算出手段１０３へは実線で示した値が出力される。 FIG. 7B is an explanatory diagram showing an example of noise statistics calculated for the noise shown in FIG. 7A and the operation timing of the noise adaptation processing of the acoustic model. In FIG. 7B, the downward white arrow with respect to the time axis shown at the top indicates the timing at which the speech model adaptation unit 200 (more specifically, the noise statistic calculation unit 202) starts the noise statistic calculation process. Is shown. In FIG. 7B, the timing at which the noise statistic calculating unit 202 operates is the time when t1 time has elapsed from the timing at which the acoustic model shown in FIG. In this example, it is assumed that it takes time t2 from the start of the noise statistic calculation process to the completion of the acoustic model noise adaptation process. In FIG. 7B, even if the noise statistic is calculated, the time until the noise statistic is applied to the acoustic model, that is, until the noise adaptation processing of the acoustic model is completed is short. Since it is not the average of the noise components output to the time fluctuation noise component calculation means 103, it is indicated by a broken line indicating that adaptation is in progress. A value indicated by a solid line is output to the short-time fluctuation noise component calculation means 103.

図７（ｃ）は、音声認識処理の動作タイミングの例を示す説明図である。図７（ｃ）において、上部に示す時間軸に対する下向きの白矢印は、音声認識部１００（より具体的には短時間変動雑音成分算出手段１０３）が雑音抑圧・音声認識処理を開始するタイミングを示している。本例では、音響モデルの雑音適応処理が完了したことを受けて、短時間変動雑音成分算出手段１０３が、改めてその音響モデルの雑音適応に用いられた雑音の統計量の算出元となる雑音データを得た入力信号に対する雑音抑圧処理を開始する。すなわち、入力信号に対して雑音推定をした後、当該入力信号に用いる音響モデルが雑音適用されるのを待って、当該音響モデルに用いられた雑音の統計量を取得して、それを基に雑音抑圧を行う。その後、それによって得た抑圧信号を入力信号にして現在保持されている音響モデルを用いて音声認識処理を行う例である。 FIG. 7C is an explanatory diagram illustrating an example of the operation timing of the voice recognition process. In FIG. 7C, the downward white arrow with respect to the time axis shown at the top indicates the timing at which the speech recognition unit 100 (more specifically, the short-time fluctuation noise component calculation unit 103) starts the noise suppression / speech recognition processing. Show. In this example, in response to the completion of the noise adaptation processing for the acoustic model, the short-time fluctuation noise component calculation means 103 is the noise data from which the noise statistic used for the noise adaptation of the acoustic model is calculated again. Noise suppression processing for the obtained input signal is started. That is, after estimating the noise for the input signal, wait for the acoustic model used for the input signal to be applied with noise, and obtain the statistics of the noise used for the acoustic model. Noise suppression is performed. Thereafter, the speech recognition process is performed using the currently stored acoustic model using the suppression signal obtained thereby as an input signal.

また、図８では、図７とは異なる動作タイミングの例を示している。図８（ａ）は、入力信号から推定された雑音の例を示す説明図である。なお、図８（ａ）の上部に示した時間軸に対する下向きの白矢印は、音響モデルを雑音適応させるタイミングの例である。図８（ａ）では、入力信号の時系列におけるＦｔ１，Ｆｔ２，Ｆｔ３，Ｆｔ４，・・・が音響モデルの雑音適応タイミングとして示されている。また、図８（ａ）の下部にある左向き矢印およびその時間ｔ１は、その開始位置から始まる時間区間の入力信号に含まれる雑音の統計量を求めるのに用いる雑音データが直前の時間区間の入力信号の雑音データであり、時間ｔ１分の雑音データであることを示している。 8 shows an example of operation timing different from that in FIG. FIG. 8A is an explanatory diagram illustrating an example of noise estimated from an input signal. In addition, the downward white arrow with respect to the time axis shown in the upper part of FIG. 8A is an example of timing at which the acoustic model is adapted to noise. In FIG. 8A, Ft1, Ft2, Ft3, Ft4,... In the time series of the input signal are shown as noise adaptation timing of the acoustic model. Also, the left arrow at the bottom of FIG. 8A and its time t1 indicate that the noise data used to calculate the statistical amount of noise included in the input signal in the time interval starting from the start position is input in the previous time interval. This is signal noise data, indicating that the noise data is for time t1.

図８（ｂ）は、図８（ａ）に示す雑音に対して算出される雑音の統計量と、音響モデルの雑音適応処理の動作タイミングの例を示す説明図である。図８（ｂ）では、雑音統計量算出手段２０２が動作するタイミングは、図８（ａ）で示した音響モデルを雑音適応させるタイミングと同じタイミングとしている。すなわち、即座に現在保持されている雑音データを用いて雑音の統計量を算出する。なお、他の点に関しては、図７（ｂ）と同様である。 FIG. 8B is an explanatory diagram showing an example of noise statistics calculated for the noise shown in FIG. 8A and the operation timing of the noise adaptation processing of the acoustic model. In FIG. 8B, the timing at which the noise statistic calculating unit 202 operates is the same as the timing at which the acoustic model shown in FIG. That is, the noise statistics are immediately calculated using the noise data currently held. Other points are the same as in FIG. 7B.

図８（ｃ）は、音声認識処理の動作タイミングの例を示す説明図である。なお、図８（ｃ）に示す例では、短時間変動雑音成分算出手段１０３およびサーチ手段１０５による雑音抑圧・音声認識処理の動作タイミングは特に規定されない。すなわち、現在処理中の入力信号に対して、雑音適応済みの音響モデルが存在していればその音響モデルに用いられた雑音の統計量を取得して、それを基に雑音抑圧を行う。本例は、ある時間区間の入力信号に対して、それよりも前の時間区間の雑音データによる雑音の統計量に基づいて雑音適応された音響モデルを用いる例である。 FIG. 8C is an explanatory diagram illustrating an example of operation timing of the voice recognition process. In the example shown in FIG. 8C, the operation timing of the noise suppression / voice recognition processing by the short-time fluctuation noise component calculation unit 103 and the search unit 105 is not particularly specified. That is, if there is a noise-adapted acoustic model for the input signal currently being processed, the noise statistic used in the acoustic model is acquired, and noise suppression is performed based on that. In this example, an acoustic model that is noise-adapted based on noise statistics based on noise data in a previous time interval is used for an input signal in a certain time interval.

以上のように、本実施形態によれば、入力信号の時系列から推定雑音ではなく短時間変動雑音成分を抑圧することによって、フロアリング係数αによる悪影響を防ぐことができる。短時間変動雑音成分は推定雑音に比べて値が小さい。このため式（３）と式（１）とを比較すると、式（３）の方がフロリングの処理を必要とする可能性を小さく抑えることができる。従って、式（１）のような推定雑音を抑圧する雑音抑圧方法と比べて、本実施形態における雑音抑圧方法はフロアリング係数による悪影響を少なくできる。 As described above, according to the present embodiment, it is possible to prevent an adverse effect due to the flooring coefficient α by suppressing the short-time fluctuation noise component instead of the estimated noise from the time series of the input signal. The value of the short-time fluctuation noise component is smaller than the estimated noise. For this reason, when the formula (3) is compared with the formula (1), the possibility that the formula (3) requires a flooring process can be reduced. Therefore, the noise suppression method according to the present embodiment can reduce the adverse effects due to the flooring coefficient, as compared with the noise suppression method that suppresses the estimated noise as in Expression (1).

残っている雑音成分による入力信号と音響モデル間のミスマッチは、音響モデルを雑音の統計量に基づき雑音適応させることによって抑制することができる。また、音響モデル雑音の統計量に基づき雑音適応することで、推定雑音に基づき音響モデルを雑音適応させる方法と比べて、計算量の多い音響モデルの雑音適応処理の頻度を少なくできるので、計算コストを少なく抑えながら変動する雑音に追従させることができる。 Mismatch between the input signal and the acoustic model due to the remaining noise component can be suppressed by applying noise adaptation to the acoustic model based on the statistics of noise. In addition, by performing noise adaptation based on the statistic of acoustic model noise, the frequency of noise adaptation processing for acoustic models with a large amount of computation can be reduced compared to the method of applying noise adaptation to acoustic models based on estimated noise. It is possible to follow the fluctuating noise while suppressing the noise.

実施形態２．
次に、本発明の第２の実施形態について図面を参照して説明する。図９は、本実施形態の音声認識装置の構成例を示すブロック図である。図９に示すように、本実施形態の音声認識装置は、図１に示す第１の実施形態と比べて、さらにトリガ発生手段３０１を備える点が異なる。 Embodiment 2. FIG.
Next, a second embodiment of the present invention will be described with reference to the drawings. FIG. 9 is a block diagram illustrating a configuration example of the speech recognition apparatus according to the present embodiment. As shown in FIG. 9, the speech recognition apparatus according to the present embodiment is different from the first embodiment shown in FIG. 1 in that it further includes trigger generation means 301.

トリガ発生手段３０１は、音響モデルの雑音適応のタイミングを制御する。トリガ発生手段３０１は、例えば、音声検出手段に基づくものや、ユーザの意思によってトリガを発生させることのできる入力装置などによって実現してもよい。なお、本実施形態では、音響モデル適応部２００の雑音統計量算出手段２０２および音響モデル適応手段２０３は、トリガ発生手段３０１が発生させるトリガによって動作の開始タイミングを得る。 The trigger generation means 301 controls the noise adaptation timing of the acoustic model. The trigger generation unit 301 may be realized by, for example, a unit based on a voice detection unit or an input device that can generate a trigger according to a user's intention. In this embodiment, the noise statistic calculation unit 202 and the acoustic model adaptation unit 203 of the acoustic model adaptation unit 200 obtain the operation start timing by the trigger generated by the trigger generation unit 301.

図１０は、本実施形態の音声認識装置の動作例を示すフローチャートである。なお、図１０に示す動作例は、基本的には図２に示した第１の実施形態と同様である。ただし、音響モデル雑音適応処理（ステップＳ２０６）の動作のタイミングが、トリガ発生手段３０１によって制御される。すなわち、本例では図２のステップＳ１０３に代わり、ステップＳ２０３で、トリガ発生手段３０１によるトリガが発生したか否かを判定し、トリガの有無に基づいて音響モデル適応部２００の動作の有無を決定する。なお、他の動作については第１の実施形態と同様であるため、説明省略する。 FIG. 10 is a flowchart showing an operation example of the speech recognition apparatus of the present embodiment. Note that the operation example shown in FIG. 10 is basically the same as that of the first embodiment shown in FIG. However, the timing of the operation of the acoustic model noise adaptation process (step S206) is controlled by the trigger generation unit 301. That is, in this example, in place of step S103 in FIG. 2, it is determined in step S203 whether or not a trigger is generated by the trigger generation unit 301, and the presence or absence of the operation of the acoustic model adaptation unit 200 is determined based on the presence or absence of the trigger. To do. Since other operations are the same as those in the first embodiment, description thereof will be omitted.

トリガ発生手段３０１は、例えば、図示しない音声検出手段を用いて、入力信号取得手段１０１で得られる入力信号に基づいて音声検出を行い、主に無音区間であると判断された場合にトリガを発生させてもよい。このような場合には、入力信号の無音区間の間に、音響モデルの雑音適応処理を動作させることができる。すなわち、有音区間において音声認識処理にＣＰＵ資源を優先的に割り当てることができる。 For example, the trigger generation unit 301 uses a voice detection unit (not shown) to perform voice detection based on the input signal obtained by the input signal acquisition unit 101, and generates a trigger mainly when it is determined that it is a silent section. You may let them. In such a case, the noise adaptation process of the acoustic model can be operated during the silent period of the input signal. That is, CPU resources can be preferentially assigned to the voice recognition process in the voiced section.

また、例えばトリガ発生手段３０１は、図示しない入力装置を介して入力されるユーザの指示に基づいてトリガを発生させてもよい。このような場合には、ユーザが音響モデルを適応させたいと思ったタイミングで、音響モデル適応部２００を動作させることができる。 For example, the trigger generation unit 301 may generate a trigger based on a user instruction input via an input device (not shown). In such a case, the acoustic model adaptation unit 200 can be operated at a timing when the user wants to adapt the acoustic model.

図１１〜図１２は、入力信号から推定された雑音の例とともに、本実施形態における音響モデル雑音適応処理および該入力信号に対する雑音抑圧処理の動作タイミングの例を示す説明図である。なお、音響モデル適応部２００の動作指示のタイミングが周期ではなく任意のタイミングとなる以外は基本的には第１の実施形態と同様である。 FIGS. 11 to 12 are explanatory diagrams illustrating examples of operation timings of the acoustic model noise adaptation process and the noise suppression process for the input signal in the present embodiment, along with examples of noise estimated from the input signal. Note that the operation is basically the same as in the first embodiment except that the operation instruction timing of the acoustic model adaptation unit 200 is not a cycle but an arbitrary timing.

なお、図１１に示す例は、音響モデル適応部２００の動作指示がされるタイミングが周期的でないだけで、基本的には、図８に示す例と同様である。すなわち、本例では、雑音統計量算出手段２０２は、音響モデル適応部２００の動作指示がされたタイミングと同じタイミングで雑音の統計量を算出する（図１１（ｂ）参照。）。なお、このとき用いる雑音データは、その時点において保持されているｔ１時間分の雑音データとする。また、短時間変動雑音成分算出手段１０３は、雑音適応済みの音響モデルが存在していればその音響モデルに用いられた雑音の統計量を取得して、それを基に雑音抑圧を行う。その後、サーチ手段１０５が、それによって得た抑圧信号を入力信号にして現在保持されている音響モデルを用いて音声認識処理を行う（図１１（ｃ）参照。）。 Note that the example shown in FIG. 11 is basically the same as the example shown in FIG. 8 except that the operation instruction of the acoustic model adaptation unit 200 is not periodic. That is, in this example, the noise statistic calculation unit 202 calculates the noise statistic at the same timing as the timing when the operation instruction of the acoustic model adaptation unit 200 is given (see FIG. 11B). Note that the noise data used at this time is the noise data for t1 held at that time. In addition, if there is an acoustic model that has been adapted for noise, the short-time fluctuation noise component calculation unit 103 acquires the statistical amount of noise used in the acoustic model, and performs noise suppression based on that. Thereafter, the search means 105 performs a speech recognition process using the currently stored acoustic model using the suppression signal obtained thereby as an input signal (see FIG. 11C).

また、図１２は、短時間変動雑音成分算出手段１０３が、音響モデルの雑音適応処理が完了したことを受けて、そのトリガ発生時の入力信号から雑音抑圧処理を行う例である（図１２（ｃ）参照。）。なお、雑音の統計量の算出タイミングは、図１１と同様でよい。その上で、本例では、短時間変動雑音成分算出手段１０３が、音響モデルの雑音適応処理が完了したことを受けて、そのトリガ発生時の入力信号から雑音抑圧処理を行う。例えば、入力信号の時系列におけるＦｔ１でトリガが発生した場合には、そのトリガによって開始された音響モデルの雑音適応処理が完了するのを待ち、その処理により生成された音響モデルに用いられた雑音の統計量を取得して、それを基にトリガ発声時の入力信号Ｆｔ１から雑音抑圧を行ってもよい。この他にも、様々なタイミング制御が可能である。例えば、第１の実施形態と同様に、雑音の統計量をそのトリガ発声時以降のデータを用いて算出するようにしたり、音響モデルを雑音の統計量を算出するのに用いた入力信号から適用したりすることも可能である。 FIG. 12 shows an example in which the short-time fluctuation noise component calculation unit 103 performs noise suppression processing from the input signal when the trigger is generated in response to the completion of the noise adaptation processing of the acoustic model (FIG. 12 ( see c)). Note that the calculation timing of the noise statistic may be the same as in FIG. In addition, in this example, in response to the completion of the acoustic model noise adaptation processing, the short-time fluctuation noise component calculation unit 103 performs noise suppression processing from the input signal when the trigger is generated. For example, when a trigger occurs at Ft1 in the time series of the input signal, the process waits for completion of noise adaptation processing of the acoustic model started by the trigger, and the noise used for the acoustic model generated by the processing , And noise suppression may be performed from the input signal Ft1 at the time of trigger utterance. In addition to this, various timing controls are possible. For example, as in the first embodiment, the noise statistic is calculated using data after the trigger utterance, or the acoustic model is applied from the input signal used to calculate the noise statistic. It is also possible to do.

以上のように、本実施形態によれば、入力信号の無音区間に音響モデルの雑音適応処理を行ったり、ユーザの意思に基づいて音響モデルの雑音適応処理を行うことができるため、例えば、発話中の音響モデル適応の動作による計算量の増加を避けることができる。また、例えば、ユーザの指示によって雑音環境が変化した時にのみ音響モデルの雑音適応処理を行えば、少ない計算量でより効果的に変化する雑音に追従した音声認識処理を行うことができる。すなわち、ＣＰＵ効率を上げることができるので、音声認識処理の高速化に繋がる。 As described above, according to the present embodiment, noise adaptation processing of the acoustic model can be performed in the silent section of the input signal, or noise adaptation processing of the acoustic model can be performed based on the user's intention. It is possible to avoid an increase in the amount of calculation due to the operation of the acoustic model adaptation. Further, for example, if the noise adaptation processing of the acoustic model is performed only when the noise environment changes according to the user's instruction, it is possible to perform the speech recognition processing that follows the noise that changes more effectively with a small amount of calculation. That is, since the CPU efficiency can be increased, the speed of the voice recognition process is increased.

実施形態３．
次に、本発明の第３の実施形態について図面を参照して説明する。図１３は、本実施形態の音声認識装置の構成例を示すブロック図である。図１３に示すように、本実施形態の音声認識装置は、図１に示す第１の実施形態と比べて、さらに特徴量変換手段４０１と特徴量逆変換手段４０２とを含む点が異なる。また、音声認識部１００がさらに特徴量変換手段１０７を含む点が異なる。 Embodiment 3. FIG.
Next, a third embodiment of the present invention will be described with reference to the drawings. FIG. 13 is a block diagram illustrating a configuration example of the speech recognition apparatus according to the present embodiment. As shown in FIG. 13, the speech recognition apparatus of this embodiment is different from the first embodiment shown in FIG. 1 in that it further includes a feature amount conversion unit 401 and a feature amount inverse conversion unit 402. Further, the voice recognition unit 100 further includes a feature amount conversion unit 107.

特徴量変換手段４０１、１０７は、それぞれ雑音成分を特徴量に変換する。なお、図１３では、特徴量変換手段４０１、１０７とを別々の手段として示しているが、１つの特徴量変換手段を共用することも可能である。また、第１の実施形態において説明したように、サーチ手段１０５が特徴量変換機能を有している場合には、特徴量変換手段１０７は省略してもよい。 The feature amount conversion means 401 and 107 respectively convert noise components into feature amounts. In FIG. 13, the feature quantity conversion means 401 and 107 are shown as separate means, but one feature quantity conversion means can also be shared. Further, as described in the first embodiment, when the search unit 105 has a feature amount conversion function, the feature amount conversion unit 107 may be omitted.

特徴量逆変換手段４０２は、雑音の特徴量から雑音成分に逆変換を行う。具体的には、雑音統計量算出手段２０２によって算出された雑音の特徴量領域での平均を雑音成分に変換する。 The feature amount reverse conversion unit 402 performs reverse conversion from the noise feature amount to the noise component. Specifically, the average of noise in the feature amount area calculated by the noise statistic calculating means 202 is converted into a noise component.

特徴量変換手段４０１、１０７および特徴量逆変換手段４０２は、例えば、プログラムに従って動作するＣＰＵによって実現される。 The feature amount conversion units 401 and 107 and the feature amount inverse conversion unit 402 are realized by a CPU that operates according to a program, for example.

次に、本実施形態の動作について説明する。図１４は、本実施形態の音響モデル雑音適応処理の処理フローの一例を示すフローチャートである。図１４に示すように、本実施形態では、音響モデル雑音適応処理において、特徴量変換ステップ（図４のステップＳ１２２参照。）が不要となる。また、本実施形態では、雑音統計量算出手段２０２は、推定雑音保持手段２０１に保持されている推定雑音の特徴量から、雑音の統計量を求める（ステップＳ２２１）。 Next, the operation of this embodiment will be described. FIG. 14 is a flowchart illustrating an example of the processing flow of the acoustic model noise adaptation processing of the present embodiment. As shown in FIG. 14, in this embodiment, a feature amount conversion step (see step S <b> 122 in FIG. 4) is not necessary in the acoustic model noise adaptation process. In the present embodiment, the noise statistic calculating unit 202 obtains a noise statistic from the estimated noise feature amount held in the estimated noise holding unit 201 (step S221).

なお、図示省略しているが、本実施形態では雑音推定・データ保持ステップ（図４のステップＳ１０２参照。）において、推定した雑音データを推定雑音保持手段２０１に保持させる際に、特徴量変換手段４０１を介すことによって、雑音の特徴量に変換して保持させる。また、雑音抑圧・音声認識処理の短時間変動雑音成分算出ステップ（図３のステップＳ１１１参照。）において、雑音成分の平均を取得する際に、特徴量逆変換手段４０２を介すことによって、雑音成分として示される平均を取得する。特徴量逆変換手段４０２は、例えば、短時間変動雑音成分算出手段１０３への入力のために、雑音統計量算出手段２０２によって算出された雑音の特徴量領域での平均値をパワースペクトルなどの雑音成分の形式に変換する。なお、他の点に関しては、第１の実施形態と同様である。 Although not shown, in the present embodiment, in the noise estimation / data holding step (see step S102 in FIG. 4), when the estimated noise data is held in the estimated noise holding means 201, the feature amount conversion means. Through 401, it is converted into a noise feature and held. Further, in the short-time fluctuation noise component calculation step (see step S111 in FIG. 3) of the noise suppression / voice recognition processing, noise is obtained through the feature amount inverse conversion means 402 when obtaining the average of the noise components. Get the average shown as a component. For example, the feature amount inverse transform unit 402 converts the average value of noise in the feature amount region calculated by the noise statistic calculation unit 202 into noise such as a power spectrum for input to the short-time fluctuation noise component calculation unit 103. Convert to component format. The other points are the same as in the first embodiment.

このように、本実施形態では、特徴量変換手段を介して、推定した雑音データを推定雑音保持手段２０１に保持させることによって、推定雑音保持手段２０１には、推定雑音の特徴量が保持されることになる。また、雑音統計量算出手段２０２では、推定雑音の特量から雑音の統計量を計算することになる。 As described above, in the present embodiment, the estimated noise data is held in the estimated noise holding unit 201 via the feature amount conversion unit, whereby the estimated noise holding unit 201 holds the estimated noise feature amount. It will be. Also, the noise statistic calculation means 202 calculates the noise statistic from the estimated noise feature.

一般に、１フレームにおける入力信号のデータ量と特徴量のデータ量とでは、特徴量のデータ量の方が少ない。このため、本実施形態によれば、推定雑音保持手段２０１の保存領域の節約ができ、雑音統計量算出手段２０２における計算量の節約が可能となる。 In general, the data amount of the feature amount is smaller between the data amount of the input signal and the data amount of the feature amount in one frame. Therefore, according to the present embodiment, the storage area of the estimated noise holding unit 201 can be saved, and the calculation amount in the noise statistic calculating unit 202 can be saved.

なお、本実施形態では、第１の実施形態に対して特徴量変換手段４０１、１０７および特徴量逆変換手段４０２を追加する例を示したが、このような追加は、第２の実施形態に対しても可能である。 In this embodiment, the example in which the feature amount conversion units 401 and 107 and the feature amount inverse conversion unit 402 are added to the first embodiment has been described. However, such addition is added to the second embodiment. It is also possible.

以下、本発明の概要について説明する。図１５は、本発明の概要を示すブロック図である。図１５に示す音声認識装置は、雑音統計量算出手段１１と、短時間変動雑音成分算出手段１２と、音響モデル適応手段１３と、雑音抑圧手段１４と、音声認識手段１５とを備える。 The outline of the present invention will be described below. FIG. 15 is a block diagram showing an outline of the present invention. The speech recognition apparatus shown in FIG. 15 includes a noise statistic calculation unit 11, a short-time fluctuation noise component calculation unit 12, an acoustic model adaptation unit 13, a noise suppression unit 14, and a speech recognition unit 15.

雑音統計量算出手段１１は、入力信号の複数フレームに対して推定された雑音のデータから、雑音の統計量を算出する。なお、雑音統計量算出手段１１は、上記実施形態では、雑音特徴量算出手段２０２として示されている。 The noise statistic calculation means 11 calculates a noise statistic from noise data estimated for a plurality of frames of the input signal. Note that the noise statistic calculation unit 11 is shown as the noise feature amount calculation unit 202 in the above embodiment.

短時間変動雑音成分算出手段１２は、雑音統計量算出手段１１によって算出された雑音の統計量に基づいて、入力信号の各フレームに含まれる雑音の短時間変動成分を算出する。なお、短時間変動雑音成分算出手段１２は、上記実施形態では、短時間変動雑音成分算出手段１０３として示されている。 The short-time fluctuation noise component calculation unit 12 calculates a short-time fluctuation component of noise included in each frame of the input signal based on the noise statistic calculated by the noise statistic calculation unit 11. Note that the short-time fluctuation noise component calculation means 12 is shown as the short-time fluctuation noise component calculation means 103 in the above embodiment.

音響モデル適応手段１３は、雑音統計量算出手段１１によって算出された雑音の統計量を用いて、音響モデルを雑音に適応させる。なお、音響モデル適応手段１３は、上記実施形態では、音響モデル適応手段２０３として示されている。 The acoustic model adaptation unit 13 adapts the acoustic model to noise using the noise statistic calculated by the noise statistic calculation unit 11. The acoustic model adaptation means 13 is shown as the acoustic model adaptation means 203 in the above embodiment.

雑音抑圧手段１４は、入力信号の各フレームに対して、短時間変動雑音成分算出手段１２によって算出された当該フレームに含まれる雑音の短時間変動成分を抑圧する。なお、雑音抑圧手段１４は、上記実施形態では、雑音抑圧手段１０４として示されている。 The noise suppression unit 14 suppresses the short-time fluctuation component of the noise included in the frame calculated by the short-time fluctuation noise component calculation unit 12 for each frame of the input signal. Note that the noise suppression unit 14 is shown as the noise suppression unit 104 in the above embodiment.

音声認識手段１５は、雑音抑圧手段１４によって抑圧された入力信号を、音響モデル適応手段１３によって雑音適応された音響モデルを用いて音声認識を行う。なお、音声認識手段１５は、上記実施形態では、サーチ手段１０５として示されている。 The speech recognition unit 15 performs speech recognition on the input signal suppressed by the noise suppression unit 14 using an acoustic model that is noise-adapted by the acoustic model adaptation unit 13. Note that the voice recognition means 15 is shown as the search means 105 in the above embodiment.

このような構成により、変化する雑音環境下においても、高精度に音声認識を行うことが可能となる。 With such a configuration, it is possible to perform speech recognition with high accuracy even in a changing noise environment.

また、短時間変動雑音成分算出手段は、入力信号の各フレームに含まれる雑音成分から、雑音統計量算出手段によって算出された雑音の統計量によって示される雑音成分の平均を減算することによって、当該各フレームに含まれる雑音の短時間変動成分を算出してもよい。 Further, the short-time fluctuation noise component calculation means subtracts the average of the noise component indicated by the noise statistic calculated by the noise statistic calculation means from the noise component included in each frame of the input signal. A short-time fluctuation component of noise included in each frame may be calculated.

また、本発明による音声認識装置は、入力信号の各フレームに対して推定された雑音のデータを逐次保持する推定雑音保持手段（例えば、推定雑音保持手段２０１）を備え、雑音統計量算出手段は、推定雑音保持手段に保持されている雑音のデータを用いて、雑音の統計量を算出してもよい。 The speech recognition apparatus according to the present invention further includes estimated noise holding means (for example, estimated noise holding means 201) that sequentially holds noise data estimated for each frame of the input signal, and the noise statistic calculating means includes: The noise statistics may be calculated using noise data held in the estimated noise holding means.

また、本発明による音声認識装置は、音響モデルの情報と、音響モデルに適応させた雑音の統計量の情報とを対応づけて記憶する音響モデル記憶手段（例えば、音響モデル格納手段１０６）を備え、短時間変動雑音成分算出手段は、音響モデル記憶手段に記憶されている雑音の統計量に基づいて、当該フレームに含まれる雑音の短時間変動成分を算出し、音声認識手段は、音響モデル記憶手段に記憶されている音響モデルを用いて入力信号の各フレームに対して音声認識を行ってもよい。 In addition, the speech recognition apparatus according to the present invention includes acoustic model storage means (for example, acoustic model storage means 106) that stores information on the acoustic model and information on the statistical amount of noise adapted to the acoustic model in association with each other. The short-time fluctuation noise component calculation means calculates the short-time fluctuation component of noise included in the frame based on the noise statistic stored in the acoustic model storage means, and the speech recognition means Speech recognition may be performed on each frame of the input signal using an acoustic model stored in the means.

また、雑音統計量算出手段は、入力信号の時間区間における所定の時間区間毎に雑音の統計量を算出し、音響モデル適応手段は、雑音の統計量が更新される毎に音響モデルを雑音に適応させ、短時間変動雑音成分算出手段は、算出対象とされたフレームを含む時間区間内のフレームの雑音データを用いて算出された雑音の統計量に基づいて、当該フレームに含まれる雑音の短時間変動成分を算出し、音声認識手段は、入力信号の各フレームに対して、当該フレームを含む時間区間内のフレームの雑音データを用いて算出された雑音の統計量を用いて雑音適応された音響モデルを用いて音声認識を行ってもよい。 The noise statistic calculation means calculates noise statistic for each predetermined time interval in the input signal time interval, and the acoustic model adaptation means converts the acoustic model to noise every time the noise statistic is updated. The short-time fluctuation noise component calculation means is adapted to reduce the noise included in the frame based on the noise statistic calculated using the noise data of the frame in the time interval including the frame to be calculated. The time recognition component is calculated, and the speech recognition means is subjected to noise adaptation for each frame of the input signal using noise statistics calculated using noise data of frames in the time interval including the frame. Speech recognition may be performed using an acoustic model.

また、本発明による音声認識装置は、音響モデルを雑音適応させるタイミングを制御するトリガ発生手段（例えば、トリガ発生手段３０１）を備えていてもよい。 In addition, the speech recognition apparatus according to the present invention may include trigger generation means (for example, trigger generation means 301) for controlling the timing of noise adaptation of the acoustic model.

また、トリガ発生手段は、入力信号に対して無音区間と判断した場合に、音響モデル適応手段に音響モデルの適応処理を開始させるためのトリガを発生させてもよい。 The trigger generation means may generate a trigger for causing the acoustic model adaptation means to start the acoustic model adaptation processing when it is determined that the input signal is a silent section.

また、トリガ発生手段は、ユーザからの指示に応じて、音響モデル適応手段に音響モデルの適応処理を開始させるためのトリガを発生させてもよい。 The trigger generation means may generate a trigger for causing the acoustic model adaptation means to start the acoustic model adaptation processing in response to an instruction from the user.

また、本発明による音声認識装置は、特徴量領域で雑音の統計量を得るための特徴量変換手段（例えば、特徴量変換手段４０１，１０７）と、特徴量領域で算出された雑音の統計量から雑音の短時間変動成分を得るための特徴量逆変換手段（例えば、特徴量逆変換手段４０２）とを備えていてもよい。 In addition, the speech recognition apparatus according to the present invention includes a feature amount conversion unit (for example, feature amount conversion units 401 and 107) for obtaining a noise statistic in the feature amount region, and a noise statistic calculated in the feature amount region. Feature amount reverse conversion means (for example, feature amount reverse conversion means 402) for obtaining a short-time fluctuation component of noise from the image data.

また、本発明による音声認識方法は、音響モデルの適応処理を開始させるためのトリガを発生させ、トリガにより、入力信号の複数フレームに対して推定された雑音のデータから、雑音の統計量を算出し、算出された雑音の統計量を用いて、音響モデルを雑音に適応させてもよい。 Further, the speech recognition method according to the present invention generates a trigger for starting the adaptive processing of the acoustic model, and calculates noise statistics from the noise data estimated for a plurality of frames of the input signal by the trigger. Then, the acoustic model may be adapted to noise using the calculated noise statistic.

また、本発明による音声認識方法は、入力信号の各フレームに対して推定された雑音のデータを逐次保持する際に、雑音のデータを特徴量に変換して保持し、保持された雑音の特徴量のデータから、雑音の特徴量領域での統計量を算出してもよい。 In addition, the speech recognition method according to the present invention converts the noise data into feature amounts and holds the noise data when sequentially storing the estimated noise data for each frame of the input signal. A statistic in the noise feature quantity region may be calculated from the quantity data.

また、本発明による音声認識プログラムは、コンピュータに、音響モデルを雑音適応させるタイミングを制御するトリガを発生させる処理を実行させてもよい。 In addition, the speech recognition program according to the present invention may cause the computer to execute a process for generating a trigger for controlling the timing of noise adaptation of the acoustic model.

また、本発明による音声認識プログラムは、コンピュータに、特徴量領域で雑音の統計量を得るための特徴量変換処理と、特徴量領域で算出された雑音の統計量から雑音の短時間変動成分を得るための特徴量逆変換処理とを実行させてもよい。 In addition, the speech recognition program according to the present invention allows a computer to perform a feature amount conversion process for obtaining a noise statistic in the feature amount region, and a short-time fluctuation component of noise from the noise statistic calculated in the feature amount region. You may perform the feature-value reverse transformation process for obtaining.

本発明は、音声認識に限らず、入力音声に対して雑音を抑圧して所望のデータを得る用途に適用可能である。 The present invention is not limited to speech recognition, and can be applied to uses for obtaining desired data by suppressing noise with respect to input speech.

１１雑音統計量算出手段
１２短時間変動雑音成分算出手段
１３音響モデル適応手段
１４雑音抑圧手段
１５音声認識手段
１０１入力信号取得手段
１０２雑音推定手段
１０３短時間変動雑音成分算出手段
１０４雑音抑圧手段
１０５サーチ手段
１０６音響モデル格納手段
２０１推定雑音保持手段
２０２雑音統計量取得手段
２０３音響モデル適応手段
３０１トリガ発生手段
４０１，１０７特徴量変換手段
４０２特徴量逆変換手段 DESCRIPTION OF SYMBOLS 11 Noise statistic calculation means 12 Short-time fluctuation noise component calculation means 13 Acoustic model adaptation means 14 Noise suppression means 15 Speech recognition means 101 Input signal acquisition means 102 Noise estimation means 103 Short-time fluctuation noise component calculation means 104 Noise suppression means 105 Search Means 106 Acoustic model storage means 201 Estimated noise holding means 202 Noise statistic acquisition means 203 Acoustic model adaptation means 301 Trigger generation means 401, 107 Feature quantity conversion means 402 Feature quantity reverse conversion means

Claims

A noise statistic calculating means for calculating a noise statistic from noise data estimated for a plurality of frames of the input signal;
A short-time fluctuation noise component calculation means for calculating a short-time fluctuation component of noise included in each frame of the input signal based on the noise statistics calculated by the noise statistics calculation means;
Acoustic model adaptation means for adapting the acoustic model to noise using the noise statistics calculated by the noise statistics calculation means;
Noise suppression means for suppressing, for each frame of the input signal, the short-time fluctuation component of the noise included in the frame calculated by the short-time fluctuation noise component calculation means;
A speech recognition apparatus comprising speech recognition means for performing speech recognition on an input signal suppressed by the noise suppression means using an acoustic model noise-adapted by the acoustic model adaptation means.

The short-time fluctuation noise component calculation means subtracts the average of noise components indicated by the noise statistic calculated by the noise statistic calculation means from the noise components included in each frame of the input signal, thereby The speech recognition device according to claim 1, wherein a short-time variation component of noise included in the noise is calculated.

Estimated noise holding means for sequentially holding noise data estimated for each frame of the input signal,
The speech recognition apparatus according to claim 1, wherein the noise statistic calculation unit calculates a noise statistic using noise data held in the estimated noise holding unit.

Acoustic model storage means for storing the information of the acoustic model and the information of the statistical amount of noise adapted to the acoustic model in association with each other;
The short-time fluctuation noise component calculation means calculates a short-time fluctuation component of noise included in the frame based on a noise statistic stored in the acoustic model storage means,
The voice according to any one of claims 1 to 3, wherein the voice recognition means performs voice recognition on each frame of the input signal using an acoustic model stored in the acoustic model storage means. Recognition device.

The noise statistic calculating means calculates a noise statistic for each predetermined time interval in the time interval of the input signal,
The acoustic model adaptation means adapts the acoustic model to noise every time the noise statistic is updated,
The short-time fluctuation noise component calculation means is configured to calculate a short-time fluctuation of noise included in a frame based on a noise statistic calculated using noise data of the frame in the time interval including the frame to be calculated. Calculate the ingredients,
The speech recognition means performs speech for each frame of the input signal using an acoustic model that is noise-adapted using a noise statistic calculated using noise data of the frame in the time interval including the frame. The speech recognition apparatus according to any one of claims 1 to 4, wherein recognition is performed.

The speech recognition apparatus according to any one of claims 1 to 4, further comprising trigger generation means for controlling a timing at which the acoustic model is subjected to noise adaptation.

The speech recognition apparatus according to claim 6, wherein the trigger generation unit generates a trigger for causing the acoustic model adaptation unit to start the acoustic model adaptation process when it is determined that the input signal is a silent section.

The speech recognition apparatus according to claim 6, wherein the trigger generation unit generates a trigger for causing the acoustic model adaptation unit to start an acoustic model adaptation process in response to an instruction from a user.

A feature value conversion means for obtaining noise statistics in the feature value region;
The speech recognition according to any one of claims 1 to 8, further comprising a feature amount inverse transform unit for obtaining a short-time fluctuation component of noise from a noise statistic calculated in the feature amount region. apparatus.

Calculate noise statistics from noise data estimated for multiple frames of the input signal,
Adapting the acoustic model to noise using the calculated noise statistic,
For each frame of the input signal, suppress the short-time fluctuation component of the noise calculated based on the noise statistics used in the noise-adapted acoustic model,
A speech recognition method, wherein speech recognition is performed on the input signal in which the short-time fluctuation component of the noise is suppressed using the noise-adapted acoustic model.

Generate a trigger to start the adaptive processing of the acoustic model,
From the noise data estimated for multiple frames of the input signal by the trigger, the noise statistics are calculated,
The speech recognition method according to claim 10, wherein an acoustic model is adapted to noise using the calculated noise statistic.

When sequentially storing the estimated noise data for each frame of the input signal, the noise data is converted into feature values and stored,
The speech recognition method according to claim 10, wherein a statistic in a noise feature amount region is calculated from the retained noise feature amount data.

On the computer,
A noise statistic calculation process for calculating a noise statistic from noise data estimated for a plurality of frames of the input signal;
Based on the statistics of the noise calculated by the noise statistic calculation processing, a short time variation noise component calculating process of calculating a short-time variation component of the noise contained in each frame of the input signal,
An acoustic model adaptation process for adapting the acoustic model to noise using the noise statistic calculated by the noise statistic calculation process ;
For each frame of the input signal, a noise suppression process for suppressing the short-time fluctuation component of the noise included in the frame calculated by the short-time fluctuation noise component calculation process ;
A speech recognition program for executing speech recognition processing for performing speech recognition on an input signal suppressed by the noise suppression processing using an acoustic model noise-adapted by the acoustic model adaptation processing .

On the computer,
The speech recognition program according to claim 13, wherein a process for generating a trigger for controlling a timing for applying noise to the acoustic model is executed.

On the computer,
Feature amount conversion processing for obtaining noise statistics in the feature amount region,
The speech recognition program according to claim 13 or 14, wherein a feature amount inverse transform process for obtaining a short-time fluctuation component of noise is executed from a noise statistic calculated in a feature amount region.