JP6588936B2

JP6588936B2 - Noise suppression apparatus, method thereof, and program

Info

Publication number: JP6588936B2
Application number: JP2017056079A
Authority: JP
Inventors: 隆朗福冨; 中村　孝; 孝中村; 清彰松井
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2017-03-22
Filing date: 2017-03-22
Publication date: 2019-10-09
Anticipated expiration: 2037-03-22
Also published as: JP2018159756A

Description

本発明は、入力された音声信号に重畳する雑音成分を抑圧する雑音抑圧装置、その方法、及びプログラムに関する。 The present invention relates to a noise suppression device that suppresses a noise component superimposed on an input audio signal, a method thereof, and a program.

音声認識などの音声入力を伴うアプリケーションやサービスにおいて、入力音声に雑音が重畳している場合、その精度に悪影響を及ぼす。入力音声に重畳する雑音成分を抑圧する(雑音抑圧する)ことで後段の精度劣化を防ぐことができる。雑音抑圧手法としては、クリーン音声と雑音を確率的なモデルで表現し、入力音声に重畳する雑音成分を推定し、抑圧する確率モデルベースの方式の有効性が広く知られている(非特許文献１参照)。 In an application or service involving voice input such as voice recognition, if noise is superimposed on the input voice, the accuracy is adversely affected. By suppressing the noise component superimposed on the input speech (noise suppression), it is possible to prevent subsequent deterioration in accuracy. As a noise suppression method, the effectiveness of a stochastic model-based method that expresses clean speech and noise with a probabilistic model, estimates the noise component superimposed on the input speech, and suppresses it is widely known (Non-Patent Literature). 1).

Fujimoto, "A VOICE ACTIVITY DETECTION BASED ON THE ADAPTIVE INTEGRATION OF MULTIPLE SPEECH FEATURES AND A SIGNAL DECISION SCHEME", Proc. ICASSP ’08, pp.4441-4444.Fujimoto, "A VOICE ACTIVITY DETECTION BASED ON THE ADAPTIVE INTEGRATION OF MULTIPLE SPEECH FEATURES AND A SIGNAL DECISION SCHEME", Proc. ICASSP '08, pp.4441-4444.

しかしながら、従来技術では、高雑音下等において、精度良く雑音抑圧を行うことができない場合がある。 However, with the conventional technology, there is a case where noise suppression cannot be performed with high accuracy under high noise.

本発明は、従来技術と比べ、高雑音下等において、精度良く雑音抑圧を行うことができる雑音抑圧装置、その方法、及びプログラムを提供することを目的とする。 It is an object of the present invention to provide a noise suppression device, a method thereof, and a program capable of performing noise suppression with high accuracy under high noise or the like as compared with the prior art.

上記の課題を解決するために、本発明の一態様によれば、雑音抑圧装置は、雑音抑圧対象の音声データの音声特徴量O_tを算出する特徴量算出部と、音声と雑音とからなる状態と、非音声と雑音とからなる状態とから構成される混合数K個のガウス分布で表現された観測信号モデルの各状態の事後確率b_{j,N_t}(O_t)を用いて、音声特徴量O_tから雑音抑圧対象の音声データに含まれる雑音成分を推定する雑音推定部と、雑音成分の推定値を用いて、雑音抑圧対象の音声データから雑音成分を抑圧した雑音抑圧音声データを求める雑音抑圧部とを含み、事後確率b_{j,N_t}(O_t)は、DNNの出力である各状態、各分布の事後確率p_j,kをガウス分布の出力の重みとして得られる値であり、DNNは出力層がガウス混合モデルの各状態での各分布に対応する。 In order to solve the above problem, according to one aspect of the present invention, a noise suppression device includes a feature amount calculation unit that calculates a speech feature amount O _t of speech data to be noise-suppressed, and speech and noise. Speech features using the posterior probabilities b _{j, N_t} (O _t ) of each state of the observed signal model expressed by a Gaussian distribution with K number of mixtures composed of states and states consisting of non-speech and noise Using the noise estimation unit that estimates the noise component contained in the speech data subject to noise suppression from the amount O _t and the noise component estimation value, noise-suppressed speech data obtained by suppressing the noise component is obtained from the speech data subject to noise suppression. The posterior probability b _{j, N_t} (O _t ) is a value obtained by using the posterior probability p _{j, k} of each state and each distribution as the output weight of the Gaussian distribution. DNN corresponds to each distribution in each state of the Gaussian mixture model output layer.

上記の課題を解決するために、本発明の他の態様によれば、雑音抑圧方法は、雑音抑圧対象の音声データの音声特徴量O_tを算出する特徴量算出ステップと、音声と雑音とからなる状態と、非音声と雑音とからなる状態とから構成される混合数K個のガウス分布で表現された観測信号モデルの各状態の事後確率b_{j,N_t}(O_t)を用いて、音声特徴量O_tから雑音抑圧対象の音声データに含まれる雑音成分を推定する雑音推定ステップと、雑音成分の推定値を用いて、雑音抑圧対象の音声データから雑音成分を抑圧した雑音抑圧音声データを求める雑音抑圧ステップとを含み、事後確率b_{j,N_t}(O_t)は、DNNの出力である各状態、各分布の事後確率p_j,kをガウス分布の出力の重みとして得られる値であり、DNNは出力層がガウス混合モデルの各状態での各分布に対応する。 In order to solve the above-described problem, according to another aspect of the present invention, a noise suppression method includes a feature amount calculation step of calculating a speech feature amount O _t of speech data to be noise suppressed, and speech and noise. Using the posterior probabilities b _{j, N_t} (O _t ) of each state of the observed signal model represented by a Gaussian distribution of K number of mixtures consisting of a state consisting of non-voice and noise Using the noise estimation step for estimating the noise component contained in the speech data subject to noise suppression from the feature amount O _t and the noise component estimation value, the noise-suppressed speech data obtained by suppressing the noise component from the speech data subject to noise suppression is obtained. The posterior probability b _{j, N_t} (O _t ) is the value obtained using the posterior probability p _{j, k} of each state and each distribution as the output weight of the Gaussian distribution. , DNN corresponds to each distribution in each state of the Gaussian mixture model output layer. Respond.

本発明によれば、従来技術と比べ、高雑音下等において、精度良く雑音抑圧を行うことができるという効果を奏する。 According to the present invention, there is an effect that noise suppression can be performed with high accuracy under high noise or the like, as compared with the prior art.

第一、二実施形態に係る雑音抑圧装置の機能ブロック図。The functional block diagram of the noise suppression apparatus which concerns on 1st, 2 embodiment. 第一、二実施形態に係る雑音抑圧装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the noise suppression apparatus which concerns on 1st, 2 embodiment.

以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、テキスト中で使用する記号「^」等は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。式中においてはこれらの記号は本来の位置に記述している。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted. In the following explanation, the symbol “^” etc. used in the text should be described immediately above the character immediately after it, but it is described immediately before the character due to restrictions on the text notation. In the formula, these symbols are written in their original positions. Further, the processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

＜第一実施形態のポイント＞
音声・非音声の特徴量分布が混合ガウス分布に従うと仮定したうえで、深層学習により各ガウス分布の事後確率を学習し、学習したDNN(deep neural network)の出力値を各ガウス分布の重みとして利用することで、より精度よく音声・非音声の特徴量を捉えることができる。 <Points of first embodiment>
Assuming that the feature distribution of speech and non-speech follows a mixed Gaussian distribution, the posterior probability of each Gaussian distribution is learned by deep learning, and the output value of the learned DNN (deep neural network) is used as the weight of each Gaussian distribution. By using it, it is possible to capture voice / non-voice feature quantities with higher accuracy.

＜第一実施形態＞
図１は第一実施形態に係る雑音抑圧装置の機能ブロック図を、図２はその処理フローを示す。 <First embodiment>
FIG. 1 is a functional block diagram of a noise suppression apparatus according to the first embodiment, and FIG. 2 shows a processing flow thereof.

雑音抑圧装置１００は、雑音抑圧対象の音声データを入力とし、音声データに含まれる雑音を抑圧して、雑音抑圧音声データを求め、出力する。 The noise suppression apparatus 100 receives speech data to be noise-suppressed as input, suppresses noise included in the speech data, obtains and outputs noise-suppressed speech data.

雑音抑圧装置１００は、CPUと、RAMと、以下の処理を実行するためのプログラムを記録したROMを備えたコンピュータで構成され、機能的には次に示すように構成されている。 The noise suppression apparatus 100 includes a CPU, a RAM, and a computer that includes a ROM that stores a program for executing the following processing, and is functionally configured as follows.

雑音抑圧装置１００は、音声モデル学習部１０１と、特徴量算出部１０２と、雑音推定部１０３と、雑音抑圧部１０４とを含む。 The noise suppression apparatus 100 includes a speech model learning unit 101, a feature amount calculation unit 102, a noise estimation unit 103, and a noise suppression unit 104.

＜音声モデル学習部１０１＞
入力音声データに重畳する雑音成分を推定するために、事前に学習可能な音声・非音声の確率密度関数を混合ガウス分布（GMM）で表現する。 <Voice model learning unit 101>
In order to estimate the noise component to be superimposed on the input speech data, the probability density function of speech / non-speech that can be learned in advance is expressed by a mixed Gaussian distribution (GMM).

例えば、音声モデル学習部１０１は、雑音抑圧処理に先立ち、音声・非音声の識別が可能なラベル(以下、音声識別ラベルともいう)が短時間フレーム毎に付与された学習用音声データが入力される。音声モデル学習部１０１は、学習用音声データの音声特徴量O_H,tを算出する。音声特徴量O_H,tとしては、L次元のフィルタバンク出力値等を用いることができる。l次元のフィルタバンク出力値をO_H,t,lとすると、O_H,t={O_H,t,1,O_H,t,2,…,O_H,t,L}である。ただし、Hは学習データであることを表すインデックスであり、tはフレーム時刻を示すインデックスである。 For example, the speech model learning unit 101 receives learning speech data to which a label that can identify speech / non-speech (hereinafter also referred to as a speech identification label) is given for each short-time frame prior to the noise suppression processing. The The speech model learning unit 101 calculates a speech feature amount _{OH, t} of the speech data for learning. As the audio feature quantity _{OH, t} , an L-dimensional filter bank output value or the like can be used. When the l-dimensional filter bank output value is O _{H, t, l} , O _{H, t} = {O _{H, t, 1} , O _{H, t, 2} ,..., O _{H, t, L} }. Here, H is an index indicating learning data, and t is an index indicating frame time.

音声モデル学習部１０１は、音声特徴量O_H,tと音声識別ラベルとを用いて、EMアルゴリズムなどにより、GMMの最適なパラメータ値(例えば、平均ベクトルと共分散行列)を推定する。 The speech model learning unit 101 estimates an optimal parameter value (for example, an average vector and a covariance matrix) of the GMM by using an EM algorithm or the like using the speech feature quantity _{OH, t} and the speech identification label.

さらに、本実施形態では、音声モデル学習部１０１は、構築したGMMを用いて、出力層がGMMの各状態での各ガウス分布に対応するようDNNも合わせて構築する。DNNは事前に構築したGMMの状態数J、混合数Kに合わせており、出力層での出力数はJ×K個となる。例えば、j=0,1,…,J-1とし、J=2とし(つまり、j=0,1)、j=0は非対話、j=1は対話を示すラベルである。DNNは、音声特徴量を入力とし、各状態での各ガウス分布の事後確率を出力とするモデルであり、音声特徴量O_H,tと構築したGMMの事後確率から学習する。 Furthermore, in this embodiment, the speech model learning unit 101 uses the constructed GMM to construct a DNN so that the output layer corresponds to each Gaussian distribution in each state of the GMM. DNN matches the number of pre-constructed GMM states J and number of mixes K, and the number of outputs in the output layer is J x K. For example, j = 0, 1,..., J−1, J = 2 (that is, j = 0, 1), j = 0 is a non-interaction, and j = 1 is a label indicating an interaction. DNN is a model that takes speech feature values as input and outputs posterior probabilities of each Gaussian distribution in each state, and learns from speech feature values _{OH, t} and the posterior probability of the constructed GMM.

音声モデル学習部１０１は、雑音抑圧処理に先立ち、構築したGMMとDNNとを雑音推定部１０３に出力する。 The speech model learning unit 101 outputs the constructed GMM and DNN to the noise estimation unit 103 prior to the noise suppression process.

＜特徴量算出部１０２＞
特徴量算出部１０２は、雑音抑圧対象の音声データを入力とし、その音声特徴量O_tを算出し（Ｓ１０２）、出力する。なお、音声特徴量の種類、算出方法は、音声モデル学習部１０１で用いたものと同じものを用いる。 <Feature amount calculation unit 102>
The feature amount calculation unit 102 receives the speech data to be noise-suppressed, calculates the speech feature amount O _t (S102), and outputs it. Note that the type and calculation method of the voice feature amount are the same as those used in the voice model learning unit 101.

＜雑音推定部１０３＞
雑音推定部１０３は、雑音抑圧処理に先立ち、GMM及びDNNを受け取る。 <Noise estimation unit 103>
The noise estimation unit 103 receives the GMM and DNN prior to noise suppression processing.

雑音推定部１０３は、音声特徴量O_tを入力とし、雑音抑圧対象の音声データに含まれる雑音成分を推定し（Ｓ１０３）、出力する。例えば、確率モデルに基づく公知の雑音推定方法(参考文献１参照)を用いて、時刻tにおける雑音成分を推定する。
(参考文献１)Fujimoto, "Noise Robust Voice Activity Detection Based on Switching Kalman Filter", IEICE Trans. on Info. & and Syst., Vol. E91-D,No.3, pp.467-477, March 2008. The noise estimation unit 103 receives the speech feature amount O _t as input, estimates a noise component included in the speech data to be noise-suppressed (S103), and outputs it. For example, the noise component at time t is estimated using a known noise estimation method (see Reference 1) based on a probability model.
(Reference 1) Fujimoto, "Noise Robust Voice Activity Detection Based on Switching Kalman Filter", IEICE Trans. On Info. & And Syst., Vol. E91-D, No.3, pp.467-477, March 2008.

時刻tでの正規化確率を用いて加重平均した雑音成分の推定値^N_t,lは以下の形式で与えられる。

The estimated value ^ N _{t, l} of the noise component weighted and averaged using the normalized probability at time t is given in the following form.

ここで、l=1,2,…,L、^N_t=(^N_t,1,^N_t,2,_…,^N_t,L)、N_t,j=(N_t,j,1,N_t,j,2,_…,N_t,j,L)、N_t,j,lは時刻tかつ状態jにおける推定雑音成分N_t,jのl次元のフィルタバンク出力値であり、b_{j,N_t}(O_t)(ただし、下付き添え字N_tはN_tを意味する)は、音声と雑音(j=0)、非音声と雑音(j=1)の2状態から構成される混合数K個のガウス分布で表現された観測信号モデル(O_t)の各状態の事後確率であり、以下のように表される。

Where l = 1,2,…, L, ^ N _t = (^ N _{t, 1} , ^ N _{t, 2} , _… , ^ N _{t, L} ), N _{t, j} = (N _{t, j, 1} , N _{t, j, 2} , _… , N _{t, j, L} ), N _{t, j, l} is the l-dimensional filter bank output value of the estimated noise component N _{t, j} at time t and state j, b _{j, N_t} (O _t ) (where the subscript N_t means N _t ) consists of two states: voice and noise (j = 0), non-voice and noise (j = 1) This is the posterior probability of each state of the observed signal model (O _t ) expressed by a Gaussian distribution of K number of mixtures, and is expressed as follows.

ここでμ_{O_t,j,k}、Σ_{O_t,j,k}(ただし、下付き添え字O_tはO_tを意味する)はそれぞれ観測音モデルの各状態j(j=0,1)、各正規分布k(k=1,2,…,K)、時刻tでの平均ベクトル、共分散行列である。これらの値は音声モデル学習部１０１で学習したGMMに基づき求めることができる。 Where μ _{O_t, j, k} , Σ _{O_t, j, k} (where subscript O_t means O _t ) is the state j (j = 0,1) of the observed sound model, and each normal distribution k (k = 1, 2,..., K), mean vector and covariance matrix at time t. These values can be obtained based on the GMM learned by the speech model learning unit 101.

また公知の技術(参考文献１参照)では、w_{S_j,k}(ただし、下付き添え字S_j,kはS_j,kを意味する)は、事前に学習した音声・非音声の2状態からなる混合数K個のGMMの混合重みである。本実施形態では、音声モデル学習部１０１で構築したDNNの出力である各状態、各状態の事後確率p_j,kを重みw_{S_j,k}として利用する。つまり、

である。 In the known technique (see Reference 1), w _{S_j, k} (where subscript S_j, k means S _{j, k} ) consists of two states of speech and non-speech learned in advance. This is the mixing weight of K number of GMMs. In the present embodiment, each state and the posterior probability p _{j, k} of each state that is the output of the DNN constructed by the speech model learning unit 101 is used as the weight w _{S_j, k} . That means

It is.

＜雑音抑圧部１０４＞
雑音抑圧部１０４は、雑音抑圧対象の音声データと雑音成分の推定値^N_t,lとを入力とし、雑音成分の推定値^N_t,lを用いて、雑音抑圧対象の音声データから雑音成分を抑圧した雑音抑圧音声データを求め（Ｓ１０４）、出力する。例えば、まず、雑音成分の推定値^N_t,lを用いてフィルタゲインを求める。次に、フィルタゲインをインパルス応答に変換する。そして、インパルス応答を雑音抑圧対象の音声データに畳み込み、雑音抑圧音声データを求め、出力する。 <Noise Suppression Unit 104>
The noise suppression unit 104 receives the speech data to be noise-suppressed and the estimated noise component ^ N _{t, l} as input, and uses the estimated noise component ^ N _{t, l} to generate noise from the speech data to be suppressed. Noise-suppressed voice data with suppressed components is obtained (S104) and output. For example, first, the filter gain is obtained using the estimated value ^ N _{t, l} of the noise component. Next, the filter gain is converted into an impulse response. Then, the impulse response is convoluted with the speech data to be noise-suppressed to obtain and output the noise-suppressed speech data.

例えば、Wiener filterに基づく手法を用いて、雑音抑圧を行う(参考文献２)
(参考文献２) Segura, "Model-based compensation of additive noise for continuous speech recognition. experiments using AURORA II database and tasks", Proc. of Euro Speech ’01, Vol.1, pp.221-224. For example, noise suppression is performed using a method based on Wiener filter (Reference 2)
(Reference 2) Segura, "Model-based compensation of additive noise for continuous speech recognition.experiments using AURORA II database and tasks", Proc. Of Euro Speech '01, Vol.1, pp.221-224.

一般に時刻tでのWiener filterのフィルタゲインG_t,lは以下のように表され、S_t,lは事前に学習した音声・非音声の確率モデルの平均値を用い、N_t,lは雑音推定部１０３で推定した推定値^N_t,lを用いることで算出が可能である。

フィルタゲインを算出後は、Mel-warped DCTなどを用いてインパルス応答に変換し、入力信号波形に畳み込むことで、雑音抑圧音声データを得ることができる。 In general, the filter gain G _{t, l} of the Wiener filter at time t is expressed as follows, _{St, l} uses the average value of the probabilistic model of speech / non-speech learned in advance, and N _{t, l} is noise Calculation is possible by using the estimated value ^ N _{t, l} estimated by the estimation unit 103.

After calculating the filter gain, noise-suppressed voice data can be obtained by converting it into an impulse response using Mel-warped DCT or the like and convolving it with the input signal waveform.

＜効果＞
以上の構成により、従来技術と比べ、高雑音下等において、精度良く雑音抑圧を行うことができる。 <Effect>
With the above configuration, it is possible to perform noise suppression with high accuracy under high noise or the like as compared with the conventional technology.

＜変形例＞
本実施形態では、音声モデル学習部１０１を雑音抑圧装置１００内に設けたが、別装置として構成してもよい。 <Modification>
In the present embodiment, the speech model learning unit 101 is provided in the noise suppression device 100, but may be configured as a separate device.

＜第二実施形態＞
第一実施形態と異なる部分を中心に説明する。 <Second embodiment>
A description will be given centering on differences from the first embodiment.

雑音抑圧装置１００は、音声モデル学習部２０１と、特徴量算出部１０２と、雑音推定部２０３と、雑音抑圧部１０４とを含む。音声モデル学習部２０１と雑音推定部２０３の処理内容が第一実施形態とは異なる。 The noise suppression apparatus 100 includes a speech model learning unit 201, a feature amount calculation unit 102, a noise estimation unit 203, and a noise suppression unit 104. The processing contents of the speech model learning unit 201 and the noise estimation unit 203 are different from those in the first embodiment.

＜音声モデル学習部２０１＞
本実施形態におけるDNNは、学習用音声データの特徴量と、学習用音声データに重畳させる学習用の雑音からなる音声データの特徴量とを学習データとして構築したものである。 <Voice model learning unit 201>
The DNN in the present embodiment is constructed as learning data by using the feature amount of the learning speech data and the feature amount of the speech data composed of the learning noise to be superimposed on the learning speech data.

まず、音声モデル学習部２０１は、第一実施形態と同様にGMMを構築する。さらに本実施形態では、学習用音声データに意図的に雑音を重畳させ、その雑音成分(以下、学習用雑音ともいう)の特徴量（たとえば、音声と同様にL次元のフィルタバンク出力）を入力特徴量としてさらに学習データとして与え、DNNを構築する。つまり、GMMは、L次元の学習用音声データの特徴量を用いて学習し、DNNは学習用音声データの特徴量＋学習用雑音の特徴量の2L次元の特徴量を用いて学習する。DNNは、音声特徴量O_H,tと学習用雑音の特徴量とを入力とし、各状態での各ガウス分布の事後確率を出力とするモデルでり、音声特徴量O_H,tと学習用雑音の特徴量と、構築したGMMの事後確率から学習する。 First, the speech model learning unit 201 constructs a GMM as in the first embodiment. Furthermore, in this embodiment, noise is intentionally superimposed on the learning speech data, and the feature amount of the noise component (hereinafter also referred to as learning noise) (for example, an L-dimensional filter bank output as with speech) is input. A DNN is constructed by giving it as feature data as learning data. That is, the GMM learns using the feature amount of the L-dimensional learning speech data, and the DNN learns using the 2L-dimensional feature amount of the feature amount of the learning speech data + the feature amount of the learning noise. DNN is learning audio feature O _H, and _t and the feature amount of the learning noise as an input, the model deli to output a posteriori probability for each Gaussian distribution at each state, the voice feature amount O _H, and _t Learning from noise features and the posterior probabilities of the constructed GMM.

音声モデル学習部２０１は、雑音抑圧処理に先立ち、構築したGMMとDNNとを雑音推定部２０３に出力する。 The speech model learning unit 201 outputs the constructed GMM and DNN to the noise estimation unit 203 prior to noise suppression processing.

＜雑音推定部２０３＞
雑音推定部２０３は、雑音抑圧処理に先立ち、GMM及びDNNを受け取る。 <Noise estimation unit 203>
The noise estimation unit 203 receives GMM and DNN prior to noise suppression processing.

雑音推定部２０３は、音声特徴量O_tを入力とし、雑音抑圧対象の音声データに含まれる雑音成分を推定し（Ｓ２０３）、出力する。本実施形態では、1つ以上前の時刻における雑音成分の推定値^N_t-n,lの特徴量と音声特徴量O_tとを用いて音声モデル学習部２０１で構築したDNNの出力である各状態・各分布での事後確率p_j,kを算出し、混合重み重みw_{S_j,k}として利用する。nは1以上の整数の何れかである。推定値^N_t,lは、

により求める。 The noise estimation unit 203 receives the speech feature amount O _t as input, estimates a noise component included in the speech data to be noise-suppressed (S203), and outputs it. In the present embodiment, each state that is an output of the DNN constructed by the speech model learning unit 201 using the feature value of the noise component estimate ^ N _{tn, l} and the speech feature value O _{t at} one or more previous times Calculate the posterior probabilities p _{j, k} in each distribution and use them as the mixture weights w _{S_j, k} . n is any integer of 1 or more. The estimate ^ N _{t, l} is

Ask for.

なお、推定値^N_t,lを求める際には、事後確率p_j,kが必要となる。ここで、事後確率p_j,kの算出においては、例えば、ひとつ前の時刻t-1時点での推定値^N_t-1,lを利用して算出を行う。また、推定値^N_t-1,lが得られないt=0時点では、推定値^N_t-1,lを0とすることや音声特徴量O_tを推定値^N_t-1,lとすることなどで代用すると良い。 Note that the posterior probability p _{j, k} is required when obtaining the estimated value ^ N _{t, l} . Here, in the calculation of the posterior probability p _{j, k} , for example, the calculation is performed using the estimated value ^ N _{t−1, l} at the previous time t−1. In addition, at time t = 0 when the estimated value ^ N _{t-1, l} cannot be obtained, the estimated value ^ N _{t-1, l is set} to 0, and the speech feature O _{t is changed} to the estimated value ^ N _{t-1, It} is better to substitute _l .

＜効果＞
このような構成とすることで、第一実施形態と同様の効果を得ることができる。さらに、事前に音声の特徴量だけではなく、雑音特徴量についても学習することでより雑音環境下でおいても頑健に音声／非音声の事後確率算出が可能となり、精度よく雑音抑圧を行うことができる。 <Effect>
By setting it as such a structure, the effect similar to 1st embodiment can be acquired. Furthermore, by learning not only speech feature values in advance but also noise feature amounts, it is possible to calculate posterior probabilities for speech / non-speech robustly even in noisy environments, and to accurately suppress noise. Can do.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by an electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

Claims

A feature amount calculation unit that calculates a speech feature amount O _t of the speech data to be noise-suppressed;
A posteriori probability b _{j, N_t} (O _t ) of each state of the observed signal model expressed by a Gaussian distribution of K number of mixtures consisting of a state consisting of speech and noise and a state consisting of non-speech and noise A noise estimation unit for estimating a noise component included in the speech data to be noise-suppressed from the speech feature amount O _t ,
A noise suppression unit that obtains noise-suppressed speech data in which the noise component is suppressed from the speech data to be noise-suppressed using the estimated value of the noise component;
The posterior probabilities b _{j, N_t} (O _t ) are values obtained by using the posterior probabilities p _{j, k} of the respective states and distributions as outputs of the DNN as the output weights of the Gaussian distribution, and the DNN is an output layer Corresponds to each distribution in each state of the Gaussian mixture model, and the DNN receives the speech feature O _t as an input ,
Noise suppression device.

The noise suppression device of claim 1,
The DNN is constructed as learning data, the feature amount of learning speech data and the feature amount of speech data composed of learning noise superimposed on the learning speech data,
The noise estimation unit uses the feature value of the estimated value of the noise component at one or more previous times and the speech feature value O _t, and the posterior probability in each state and each distribution that is the output of the DNN p _{j, k} is calculated,
Noise suppression device.

A feature amount calculating step for calculating a speech feature amount O _t of the speech data subject to noise suppression;
A posteriori probability b _{j, N_t} (O _t ) of each state of the observed signal model expressed by a Gaussian distribution of K number of mixtures consisting of a state consisting of speech and noise and a state consisting of non-speech and noise using a noise estimation step of estimating a noise component included in the speech data noise suppression target from speech features O _t,
A noise suppression step for obtaining noise-suppressed speech data in which the noise component is suppressed from the speech data subject to noise suppression using the estimated value of the noise component,
The posterior probabilities b _{j, N_t} (O _t ) are values obtained by using the posterior probabilities p _{j, k} of the respective states and distributions as outputs of the DNN as the output weights of the Gaussian distribution, and the DNN is an output layer Corresponds to each distribution in each state of the Gaussian mixture model, and the DNN receives the speech feature O _t as an input ,
Noise suppression method.

The noise suppression method according to claim 3,
The DNN is constructed as learning data, the feature amount of learning speech data and the feature amount of speech data composed of learning noise superimposed on the learning speech data,
The noise estimation step uses the feature value of the estimated value of the noise component at one or more previous times and the speech feature value O _t, and the posterior probability in each state and each distribution that is the output of the DNN p _{j, k} is calculated,
Noise suppression method.

A program for causing a computer to function as the noise suppression device according to claim 1.