JP6139430B2

JP6139430B2 - Signal processing apparatus, method and program

Info

Publication number: JP6139430B2
Application number: JP2014025197A
Authority: JP
Inventors: 小川　厚徳; 厚徳小川; 慶介木下; 堀　貴明; 貴明堀; 中谷　智広; 智広中谷; 中村　篤; 篤中村
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2014-02-13
Filing date: 2014-02-13
Publication date: 2017-05-31
Anticipated expiration: 2034-02-13
Also published as: JP2015152705A

Description

この発明は、音声信号、音響信号等の信号を処理するための技術に関する。 The present invention relates to a technique for processing a signal such as an audio signal or an acoustic signal.

雑音や残響のある環境で音響信号を収音すると、本来の信号に音響歪み（雑音や残響）が重畳された信号が観測される。音響信号が音声の場合、重畳した音響歪みの影響により音声の明瞭度は大きく低下してしまう。その結果、本来の音声信号の性質を抽出することが困難となり、例えば、音声認識システムの認識率が低下する。この認識率の低下を防ぐためには、重畳した音響歪みを取り除く工夫が必要である。 When an acoustic signal is collected in an environment with noise or reverberation, a signal in which acoustic distortion (noise or reverberation) is superimposed on the original signal is observed. When the acoustic signal is speech, the clarity of speech is greatly reduced due to the effect of superimposed acoustic distortion. As a result, it becomes difficult to extract the nature of the original speech signal, and for example, the recognition rate of the speech recognition system decreases. In order to prevent this decrease in the recognition rate, it is necessary to devise a method for removing the superimposed acoustic distortion.

そこで、以下に述べる従来の信号処理装置が提案されている。なお、この信号処理装置は、音声認識の他にも、例えば、補聴器、ＴＶ会議システム、機械制御インターフェース、楽曲を検索したり採譜したりする音楽情報処理システムなどに利用することが出来る。 Therefore, a conventional signal processing apparatus described below has been proposed. In addition to voice recognition, this signal processing device can be used for, for example, a hearing aid, a TV conference system, a machine control interface, a music information processing system for searching for music, and recording music.

[信号処理装置]
図１に従来の信号処理装置の機能構成例を示して、その動作を簡単に説明する。信号処理装置は、フーリエ変換部１０１と、特徴量生成部１０２と、マッチング部１０３と、音声強調フィルタリング部１０４と、事例モデル記憶部１０５とを備えている。 [Signal processing equipment]
FIG. 1 shows a functional configuration example of a conventional signal processing apparatus, and its operation will be briefly described. The signal processing apparatus includes a Fourier transform unit 101, a feature value generation unit 102, a matching unit 103, a voice enhancement filtering unit 104, and a case model storage unit 105.

フーリエ変換部１０１には、雑音/残響を含む音声が入力信号として入力される。入力信号は例えば３０ｍｓ程度の短時間ハミング窓で窓かけされ、窓かけされた入力信号は離散フーリエ変換を経て振幅スペクトルに変換される（ステップＳ１，図２）。振幅スペクトルとは、周波数スペクトルの振幅データのことである。振幅スペクトルは、特徴量生成部１０２及び音声強調フィルタリング部１０４に提供される。 Voice including noise / reverberation is input to the Fourier transform unit 101 as an input signal. The input signal is windowed by a short Hamming window of about 30 ms, for example, and the windowed input signal is converted into an amplitude spectrum through a discrete Fourier transform (step S1, FIG. 2). An amplitude spectrum is amplitude data of a frequency spectrum. The amplitude spectrum is provided to the feature quantity generation unit 102 and the speech enhancement filtering unit 104.

特徴量生成部１０２は、フーリエ変換部１０１が出力する振幅スペクトルの全てを、例えばメルケプストラム特徴量に変換する（ステップＳ２，図２）。一般的に広く使われているメルケプストラムは高々１０〜２０次程度であるが、事例データを正確に表すために、高い次数（例えば、３０〜１００次程度）のメルケプストラムを用いる。なお、メルケプストラム以外の特徴量を用いても良い。生成された特徴量は、マッチング部１０３に提供される。 The feature amount generation unit 102 converts all of the amplitude spectrum output from the Fourier transform unit 101 into, for example, a mel cepstrum feature amount (step S2, FIG. 2). In general, the mel cepstrum widely used is about 10 to 20th order, but in order to accurately represent the case data, a mel cepstrum having a high order (for example, about 30 to 100th order) is used. Note that feature quantities other than the mel cepstrum may be used. The generated feature amount is provided to the matching unit 103.

事例モデル記憶部１０５には、事例に対応したクリーン音声のデータと、フレームごとの特徴量に対して最大の尤度を与えるガウス混合分布のインデックスの系列（セグメント）である事例モデルＭとが記憶されている。事例に対応したクリーン音声のデータとは、例えば事例に対応したクリーン音声の振幅スペクトルのことである。事例モデルＭに含まれるセグメントの例を図３に示す。各セルはｉ番目の時間フレームに対応する。各セル内の数字は最大の尤度を与えるガウス混合分布ｇ中のガウス分布のインデックスmiを表す。事例モデルは、音声コーパスなどから得られる大量のクリーン音声と、あらゆる環境で得られる雑音/残響データ（雑音信号の波形や、室内インパルス応答）とを用い、さまざま
な環境での観測信号を模擬生成し、その模擬観測信号を特徴量領域へ変換したものを用いて、事前に事例モデル生成装置によって生成され、予め事例モデル記憶部１０５に記憶される。この事例モデル生成装置の詳細については、後述する。 The case model storage unit 105 stores clean speech data corresponding to a case, and a case model M that is a series (segment) of indexes of a Gaussian mixture distribution that gives the maximum likelihood with respect to a feature amount for each frame. Has been. The clean sound data corresponding to the case is, for example, the amplitude spectrum of the clean sound corresponding to the case. An example of segments included in the case model M is shown in FIG. Each cell corresponds to the i-th time frame. The number in each cell represents the index mi of the Gaussian distribution in the Gaussian mixture distribution g giving the maximum likelihood. The example model uses a large amount of clean speech obtained from a speech corpus and noise / reverberation data (noise signal waveform and room impulse response) obtained in any environment to simulate generation of observation signals in various environments. Then, using the simulation observation signal converted into the feature amount region, it is generated in advance by the case model generation device and stored in the case model storage unit 105 in advance. Details of the case model generation apparatus will be described later.

マッチング部１０３は、入力信号の特徴量と事例モデル記憶部１０５内に含まれる特徴量の事例とのマッチングを行い、入力信号に一番近い事例モデル中のセグメントを探索する（ステップＳ３，図２）。探索により見つかった入力信号に一番近い事例モデル中のセグメントについての情報は、音声強調フィルタリング部１０４に提供される。マッチング部１０３の詳細については、後述する。 The matching unit 103 matches the feature quantity of the input signal with the case example of the feature quantity included in the case model storage unit 105, and searches for a segment in the case model closest to the input signal (step S3, FIG. 2). ). Information about the segment in the case model closest to the input signal found by the search is provided to the speech enhancement filtering unit 104. Details of the matching unit 103 will be described later.

音声強調フィルタリング部１０４は、マッチング部１０３で探索した入力信号に一番近い事例モデル中のセグメントに対応するクリーン音声の振幅スペクトルを用いて音声強調のためのフィルタを作成し、作成されたフィルタを用いて入力信号をフィルタリングする（ステップＳ４，図２）。入力信号に一番近い事例モデル中のセグメントに対応するクリーン音声の振幅スペクトルは、事例モデル記憶部１０５から読み込んだものを用いる。音声強調フィルタリング部１０４の詳細については、例えば非特許文献１及び特許文献１を参照のこと。 The speech enhancement filtering unit 104 creates a filter for speech enhancement using the amplitude spectrum of clean speech corresponding to the segment in the case model closest to the input signal searched by the matching unit 103, and the created filter is To filter the input signal (step S4, FIG. 2). As the amplitude spectrum of the clean speech corresponding to the segment in the case model closest to the input signal, the amplitude spectrum read from the case model storage unit 105 is used. For details of the voice enhancement filtering unit 104, see, for example, Non-Patent Document 1 and Patent Document 1.

この信号処理装置によれば、従来は困難であった、非常に時間変化の多い雑音の除去が可能となることが報告されている。非常に時間変化の多い雑音とは、背景雑音に対して、例えば目覚まし時計のアラーム音などの雑音のことである。 According to this signal processing apparatus, it has been reported that it is possible to remove noise that has been difficult in the past and has a very large time variation. The noise having a very large time change is a noise such as an alarm sound of an alarm clock with respect to the background noise.

［事例モデル生成装置］
ここで、事例モデル記憶部１０５に記憶される事例モデルを生成する事例モデル生成装置について説明する。図４に、事例モデル生成装置の機能構成例を示す。事例モデル生成装置は、フーリエ変換部２０１と、特徴量生成部２０２と、ガウス混合モデル学習部２０３と、最尤ガウス分布計算部２０４とを備えている。 [Case model generator]
Here, a case model generation apparatus that generates a case model stored in the case model storage unit 105 will be described. FIG. 4 shows a functional configuration example of the case model generation apparatus. The example model generation apparatus includes a Fourier transform unit 201, a feature value generation unit 202, a Gaussian mixture model learning unit 203, and a maximum likelihood Gaussian distribution calculation unit 204.

事例モデル生成装置の各部の機能は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 The function of each part of the case model generation apparatus is realized by, for example, a predetermined program being read into a computer including a ROM, a RAM, a CPU, and the like, and the CPU executing the program.

事例モデル生成装置への入力は、様々な雑音/残響環境の音声データである。なお、この様々な雑音/残響環境の音声データの中には、クリーン環境の音声データが含まれているとする。この様々な雑音/残響環境の音声データのそれぞれについて以下の処理が行われる
フーリエ変換部２０１及び特徴量生成部２０２は、それぞれ図１のフーリエ変換部１０１及び特徴量生成部１０２と同様であるため、重複説明を省略する。 The input to the case model generator is speech data of various noise / reverberation environments. It is assumed that the sound data of various noise / reverberation environments includes the sound data of clean environments. The following processing is performed for each of the audio data of various noise / reverberation environments, because the Fourier transform unit 201 and the feature amount generation unit 202 are the same as the Fourier transform unit 101 and the feature amount generation unit 102 of FIG. 1, respectively. The duplicated explanation is omitted.

ガウス混合モデル学習部２０３は、特徴量生成部２０２で得られた各短時間フレームｔでの特徴量ｘ_ｉを学習データとして、通常の最尤推定法によりガウス混合モデルｇを得る。ガウス混合モデルｇは、以下の式により示される。 Gaussian mixture model learning unit 203, a feature amount x _i the learning data in each short time frame t obtained by the feature amount generating unit 202 to obtain the Gaussian mixture model g by a conventional maximum likelihood estimation. The Gaussian mixture model g is expressed by the following equation.

ｇ（ｘ_ｉ|ｍ）は、平均μ_ｍ、分散Σ_ｍを持つｍ番目のガウス分布を表す。ｇ（ｘ_ｉ|ｍ）は、多くの場合多次元ガウス分布であり、その次元数は特徴量ｘ_ｉの次元数と同じである。ｇ（ｘ_ｉ|ｍ）が多次元ガウス分布である場合、平均μ_ｍ及び分散Σ_ｍのそれぞれはベクトルとなる。ここでは、ｇ（ｘ_ｉ|ｍ）が多次元ガウス分布であったとしても、記載の簡略化のためｇ（ｘ_ｉ|ｍ）のことを単にガウス分布と表現する。ｗ（ｍ）は、ｍ番目のガウス分布に対する混合重みを表す。Ｑは混合数を表す。Ｑには、例えば、4096や8192など、かなり大きな値を設定する。 g (x _i | m) represents the m-th Gaussian distribution having mean μ _m and variance Σ _m . In many cases, g (x _i | m) is a multidimensional Gaussian distribution, and the number of dimensions is the same as the number of dimensions of the feature quantity x _i . When g (x _i | m) is a multidimensional Gaussian distribution, each of the mean μ _m and the variance Σ _m is a vector. Here, even if g (x _i | m) is a multidimensional Gaussian distribution, g (x _i | m) is simply expressed as a Gaussian distribution for simplification of the description. w (m) represents the mixing weight for the mth Gaussian distribution. Q represents the number of mixtures. For Q, for example, a fairly large value such as 4096 or 8192 is set.

最尤ガウス分布計算部２０４は、各時間フレームｉに対して最大の尤度を与えるガウス混合分布ｇの中のガウス分布のインデックスｍ_ｉを求め、そのインデックスｍ_ｉの時間系列を事例モデルＭとして求める。事例モデルＭは、ガウス分布のインデックスｍ_ｉの集合とガウス混合モデルｇを用いて以下の式に示すように表される。 Maximum likelihood Gaussian distribution calculation unit 204, the index m _i of the Gaussian distribution in the Gaussian mixture distribution g which gives the maximum likelihood for each time frame i calculated, the time sequence of the index m _i as a case model M Ask. Case model M, using the set and Gaussian mixture model g of the index m _i of the Gaussian distribution is expressed as shown in the following equation.

ここで、ｍ_ｉは、ｉ番目のフレームの特徴量x_ｉに対して最大の尤度を与えるガウス分布のインデックスであり、ガウス混合分布ｍの中のガウス分布ｇ（ｘ_ｉ|ｍ）を表している。Ｉは学習データの総フレーム数を表す。例えば、1時間の学習データを仮定すると、Ｉ＝３．５×１０^５となる。生成された事例モデルＭは、事例モデル記憶部１０５（図１）に記憶される。この事例モデルの生成は、様々な雑音/残響環境の学習データのそれぞれに対して行われる。 Here, m _i is the index of the Gaussian distribution that gives the maximum likelihood for the feature amount x _i of i-th frame, Gaussian g in Gaussian mixture m | represents the (x _i m) ing. I represents the total number of frames of learning data. For example, assuming 1 hour of learning data, I = 3.5 × 10 ⁵ . The generated case model M is stored in the case model storage unit 105 (FIG. 1). This case model is generated for each learning data of various noise / reverberation environments.

なお、環境がクリーンの場合は、フーリエ変換部２０１から出力された振幅スペクトルデータも事例モデル記憶部１０５（図１）に記憶される。 If the environment is clean, the amplitude spectrum data output from the Fourier transform unit 201 is also stored in the case model storage unit 105 (FIG. 1).

［マッチング部１０３の具体処理］
ここで、マッチング部１０３における処理を詳述する。簡単のためあるひとつの雑音/残響環境の事例モデルＭのみを考える。また、簡単のため入力信号の特徴量系列と学習データセグメントのマッチングの際の時間伸縮は考えないものとする。マッチング部１０３は、入力信号の特徴量ｙ_ｔと事例モデルＭとを用いて、入力信号の特徴量系列に最も近い学習データのセグメントを探索し、入力信号に含まれるクリーン音声に一番近いクリーン音声系列を与えると思われる学習データセグメントＭ^ｔ _{ｕ：ｕ＋τｍａｘ}を出力する。 [Specific Processing of Matching Unit 103]
Here, the processing in the matching unit 103 will be described in detail. For simplicity, consider only one example model M of a noise / reverberation environment. For simplicity, it is assumed that time expansion and contraction is not considered when matching the feature amount series of the input signal and the learning data segment. Matching unit 103 uses the feature quantity y _t and case model M of the input signal, searching the segment closest training data to the feature amount sequence of the input signal, nearest clean clean speech included in the input signal training data segment ^M _{t u} is believed to give a speech _sequence: output _{u + .tau.max.}

入力信号は、Ｔ個の時間フレームから成るとし、その入力信号の特徴量系列をｙ＝{ｙ_ｔ:ｔ=1，２，…，Ｔ}とする。また、ｙ_{ｔ：ｔ＋τ}を入力信号の特徴量の時間フレームｔからｔ＋τまでの系列とする。そして、Ｍ_{ｕ：ｕ＋τ}＝{ｇ，ｍ_ｉ：ｉ＝ｕ，ｕ＋１，…，ｕ＋τ}を、学習データの中のｕ番目からｕ＋τ番目までの連続する時間フレームに対応するガウス分布系列とする。 Assume that the input signal is composed of T time frames, and the feature quantity sequence of the input signal is y = {y _t : t = 1, 2,..., T}. Also, let _{yt: t + τ be} a sequence from the time frame t to t + τ of the feature quantity of the input signal. Then, M _{u: u + τ} = {g, m _i : i = u, u + 1,..., U + τ} is a Gaussian distribution sequence corresponding to continuous time frames from u-th to u + τ-th in the learning data.

入力信号の特徴量系列ｙ_{ｔ：ｔ＋τ}と学習データの中のあるセグメントとの距離の定義や、入力信号の特徴量系列ｙ_{ｔ：ｔ＋τ}と一番近い学習データの探索方法としては、ユークリッド距離など、他のいくつかの方法を考えることが出来る。ここでは、入力信号の特徴量系列に対する一番近い学習データセグメントは、入力信号の特徴量系列に良く一致する学習データセグメントの中でも長さの最も長いものとする。つまり、入力信号の特徴量系列に最も近い学習データセグメントＭ^ｔ _{ｕ：ｕ＋τ}は、次式に示す事後確率を最大化することで求めることが出来る。 Feature amount sequence y _t of the input _signal: definition and of the distance between a segment in the _{t + tau} training data, feature amount sequence y _t of the input _signal: a method of searching for _{t + tau} and closest training data, Euclidean distance, etc. You can think of several other ways. Here, it is assumed that the learning data segment closest to the feature quantity sequence of the input signal has the longest length among learning data segments that closely match the feature quantity series of the input signal. In other words, the closest training data segments M ^{t u} the feature amount sequence of the input _{signal: u + tau} can be determined by maximizing a posterior probability shown in the following equation.

ここで、ｐ（Ｍ_ｕ:ｕ+τ|ｙ_ｔ:ｔ+τ）は事後確率を表し、ｙ_ｔ:ｔ+τとＭ_ｕ:ｕ+τが比較的よく一致している場合、τが長ければ長いほど高い事後確率を与えるという特徴を持っている。この特徴の証明は、非特許文献１に詳述されている。より長いセグメントを探索するという方策を取ることで、ある時間に局所的に存在する雑音などの影響を受け難くなり、雑音などに対して比較的ロバストなマッチングが行われることが期待できる。 Here, p (M _{u: u + τ} | y _{t: t + τ} ) represents the posterior probability, and when y _{t: t + τ} and M _{u: u + τ} are relatively well matched, τ is The longer it is, the higher the posterior probability is. The proof of this feature is described in detail in Non-Patent Document 1. By taking a measure of searching for a longer segment, it becomes difficult to be affected by noise that exists locally at a certain time, and it can be expected that relatively robust matching is performed with respect to noise.

式（２）の分子の項ｐ（ｙ_ｔ:ｔ+τ|Ｍ_ｕ:ｕ+τ）は、Ｍ_ｕ:ｕ+τに対応する学習データセグメントに対するｙ_ｔ:ｔ+τの尤度である。その尤度は次式で計算される。 The numerator term p (y _{t: t + τ} | M _{u: u + τ} ) in equation (2) is the likelihood of y _{t: t + τ} for the training data segment corresponding to M _{u: u + τ.} . The likelihood is calculated by the following equation.

簡単のため、隣り合うフレームは独立であることを仮定している。式（２）の分母の第１項は、学習データ中のあらゆる時間フレームｕ’を開始点として，ｐ（ｙ_ｔ:ｔ+τ|Ｍ_{ｕ’:ｕ’+τ}）の和を取ったものである。式（２）の分母の第２項は、ガウス混合モデルｇに対するｙ_ｔ:ｔ+τの尤度であり、次式で計算される。 For simplicity, it is assumed that adjacent frames are independent. The first term of the denominator of Equation (2) is the sum of p (y _{t: t + τ} | M _{u ′: u ′ + τ} ) starting from any time frame u ′ in the learning data. It is. The second term of the denominator of Equation (2) is the likelihood of yt _{: t + τ} for the Gaussian mixture model g, and is calculated by the following equation.

ここでマッチング部１０３におけるセグメント探索処理の手順を更に具体的に記述する。まず、セグメントの最大長を（τ_lim＋１）フレームに制限する。例えば、セグメントの最大長を３０フレームと制限するならば、τ_lim＝２９である。この制限の下で、まず、τ＝０、すなわち、セグメント長＝１として、式（２）に従い、最大事後確率を与えるセグメント長＝１のセグメントを見つける。次にτ＝１、すなわち、セグメント長＝２として、式（２）に従い、最大事後確率を与えるセグメント長＝２のセグメントを見つける。この処理をτ＝τ_limまで繰り返し、最後に、見つかった異なる長さのセグメント候補の中から，最大事後確率を与えるセグメントを見つける。その最大事後確率を与えるセグメントの長さがτ_maxである。 Here, the procedure of the segment search process in the matching unit 103 will be described more specifically. First, the maximum segment length is limited to (τ _lim +1) frames. For example, if the maximum length of the segment is limited to 30 frames, τ _lim = 29. Under this restriction, first, τ = 0, that is, segment length = 1, and a segment with segment length = 1 that gives the maximum posterior probability is found according to the equation (2). Next, assuming τ = 1, that is, segment length = 2, a segment with segment length = 2 that gives the maximum posterior probability is found according to equation (2). This process is repeated until τ = τ _lim , and finally, a segment that gives the maximum posterior probability is found from the segment candidates of different lengths that have been found. The length of the segment giving the maximum posterior probability is τ _max .

このマッチング部１０３におけるセグメント探索処理は、図３に示すような、Ｉフレーム分のリニアなメモリで表現できる事例モデルＭ上で行うことができる。 The segment search process in the matching unit 103 can be performed on a case model M that can be expressed by a linear memory for I frames as shown in FIG.

J. Ming and R. Srinivasan, and D. Crooke, “A Corpus-Based Approach to Speech Enhancement From Nonstationary Noise,” IEEE Trans. On Acoustics, Speech and Signal Processing, 19(4), pp. 822-836, 2011.J. Ming and R. Srinivasan, and D. Crooke, “A Corpus-Based Approach to Speech Enhancement From Nonstationary Noise,” IEEE Trans. On Acoustics, Speech and Signal Processing, 19 (4), pp. 822-836, 2011 .

特開２０１３−３７１７４号公報JP 2013-37174 A

従来の信号処理装置では、マッチング部１０３において、入力信号に一番近いセグメントを探索する際に、異なる長さのセグメントを比較している。しかし、本来は異なる長さのセグメントの比較はできない。このため、従来の信号処理装置においては、必ずしも精度の高いセグメント探索ができているとは限らなかった。 In the conventional signal processing apparatus, the matching unit 103 compares segments having different lengths when searching for the segment closest to the input signal. However, it is not possible to compare segments of different lengths. For this reason, in the conventional signal processing apparatus, the segment search with high accuracy is not always performed.

この発明は、従来よりも精度の高いセグメント探索を行うことができる信号処理装置、方法及びプログラムを提供することを目的とする。 An object of this invention is to provide the signal processing apparatus, method, and program which can perform a segment search with higher precision than before.

この発明の一態様による信号処理装置は、所定の信号の各フレームの特徴量に対して最大の尤度を与える、ガウス混合分布の中のガウス分布のインデックスの系列であるセグメントが記憶されている事例モデル記憶部と、事例モデル記憶部に記憶されているセグメントを候補として、入力信号の特徴量系列に対して最大の事後確率を与えるセグメントを探索するマッチング部と、を備えており、入力信号を２つに分割したときの前半部分を前半部分信号とし後半部分を後半部分信号として、マッチング部における事後確率は、前半部分信号についてその前半部分信号に対応する長さのセグメントに基づいて評価した尤度と、後半部分信号についてガウス混合分布によるモデルに基づいて評価した尤度とを用いて表現される。 A signal processing apparatus according to an aspect of the present invention stores a segment that is a series of Gaussian distribution indexes in a Gaussian mixture distribution that gives a maximum likelihood to a feature amount of each frame of a predetermined signal. A case model storage unit, and a matching unit that searches for a segment that gives the maximum posterior probability for the feature quantity sequence of the input signal using the segments stored in the case model storage unit as candidates, and includes an input signal The first half is divided into two, the first half is the first half signal and the second half is the second half signal. The posterior probability in the matching section is evaluated based on the length of the first half signal corresponding to the first half signal. The likelihood is expressed using the likelihood and the likelihood evaluated based on the model with the Gaussian mixture distribution for the second half signal.

精度の高いセグメント探索を行うことができる。 A segment search with high accuracy can be performed.

信号処理装置の例を説明するためのブロック図。The block diagram for demonstrating the example of a signal processing apparatus. 信号処理方法の例を説明するためのフローチャート。The flowchart for demonstrating the example of a signal processing method. セグメントの例を説明するための図。The figure for demonstrating the example of a segment. 事例モデル生成装置の例を説明するための図。The figure for demonstrating the example of an example model production | generation apparatus. 式（７）によるセグメント評価を説明するための図。The figure for demonstrating the segment evaluation by Formula (7).

以下、図面を参照して、信号処理装置及び方法の実施形態を説明する。 Hereinafter, embodiments of a signal processing apparatus and method will be described with reference to the drawings.

この実施形態による信号処理装置は、従来の信号処理装置と同様に、図１に例示するように、フーリエ変換部１０１と、特徴量生成部１０２と、マッチング部１０３と、音声強調フィルタリング部１０４と、事例モデル記憶部１０５とを備えている。 Similar to the conventional signal processing apparatus, the signal processing apparatus according to this embodiment includes a Fourier transform unit 101, a feature amount generation unit 102, a matching unit 103, a speech enhancement filtering unit 104, as illustrated in FIG. The case model storage unit 105 is provided.

以下、従来とは異なる部分である、マッチング部１０３を中心に説明する。第一実施形態による信号処理装置のフーリエ変換部１０１と、特徴量生成部１０２と、音声強調フィルタリング部１０４とは、それぞれ従来の信号処理装置のフーリエ変換部１０１と、特徴量生成部１０２と、音声強調フィルタリング部１０４と同様であるため、重複説明を省略する。 Hereinafter, the matching unit 103, which is a part different from the conventional one, will be mainly described. The Fourier transform unit 101, the feature amount generation unit 102, and the speech enhancement filtering unit 104 of the signal processing device according to the first embodiment are respectively the Fourier transform unit 101, the feature amount generation unit 102, and the conventional signal processing device. Since it is the same as that of the voice emphasis filtering unit 104, duplicate description is omitted.

この実施形態による信号処理装置は、マッチング部１０３において、異なるセグメント長のセグメントを、フレームという共通の長さの下で公平に評価することにより、入力信号に一番近いセグメントを探索する。 In the signal processing apparatus according to this embodiment, the matching unit 103 searches for a segment closest to the input signal by fairly evaluating segments having different segment lengths under a common length called a frame.

この実施形態のマッチング部１０３では、式（３）の代わりに、所定の長さのフレームの入力信号の特徴量系列ｙ_ｔ:ｔ+τの尤度を、事例モデルＭとガウス混合モデルｇの両方を用いて計算する。すなわち、ｙ_ｔ:ｔ+τをｙ_ｔ:ｔ+νとｙ_{ｔ＋ν＋１:ｔ+τ}に分割して（０≦ν≦τ）、前者をＭで、後者をｇで、評価する形にする。具体的には入力信号の特徴量系列ｙ_ｔ:ｔ+τの尤度は、次式のように計算される。 In the matching unit 103 of this embodiment, instead of the equation (3), the likelihood of the feature quantity sequence yt _{: t + τ} of the input signal of the frame having a predetermined length is calculated by using the case model M and the Gaussian mixture model g. Calculate using both. That is, y _{t: t + τ} is divided into y _{t: t + ν} and y _{t + ν + 1: t + τ} (0 ≦ ν ≦ τ), and the former is evaluated by M and the latter is evaluated by g. Specifically, the likelihood of the feature quantity sequence yt _{: t + τ} of the input signal is calculated as follows.

ここで、ｐ（ｙ_ｔ:ｔ+ν｜Ｍ_ｕ:ｕ+ν）は、事例モデルＭ_ｕ:ｕ+νが与えられたときの入力信号の特徴量系列のｙ_ｔ:ｔ+νの尤度を表す。ｐ（ｙ_{ｔ＋ν＋１:ｔ+τ}｜φ_{ｕ＋ν＋１：ｕ＋τ}）は、混合モデルφ_{ｕ＋ν＋１：ｕ＋τ}が与えられたときの入力信号の特徴量系列ｙ_ｔ:ｔ+νの尤度を表す。φ_{ｕ＋ν＋１：ｕ＋τ}は、フレームｕ＋ν＋１からフレームｕ＋τに対応するガウス混合分布である。p(ｙ_ｔ:ｔ+ν｜Ｍ_ｕ:ｕ+ν, φ_{ｕ＋ν＋１：ｕ＋τ})は、事例モデルＭ_ｕ:ｕ+ν及び混合モデルφ_{ｕ＋ν＋１：ｕ＋τ}が与えられたときの入力信号の特徴量系列ｙ_ｔ:ｔ+νの尤度を表す。 Here, p (y _{t: t + ν} | M _{u: u + ν} ) is the likelihood of y _{t: t + ν} of the feature quantity sequence of the input signal when the case model M _{u: u + ν} is given. Represents degrees. p (y _{t + ν + 1: t + τ} | φ _{u + ν + 1: u + τ} ) represents the likelihood of the feature quantity sequence y _{t: t + ν} of the input signal when the mixed model φ _{u + ν + 1: u + τ} is given. φ _{u + ν + 1: u + τ} is a Gaussian mixture distribution corresponding to the frame u + ν + 1 to the frame u + τ. p (y _{t: t + ν} | M _{u: u + ν} , φ _{u + ν + 1: u + τ} ) is a feature quantity sequence of the input signal when the case model M _{u: u + ν} and the mixed model φ _{u + ν + 1: u + τ} are given. y _t: represents the likelihood of _{t + ν} .

ｙ_ｔ:ｔ+νは、入力信号の特徴量系列ｙ_ｔ:ｔ+τのうち事例モデルのセグメントＭ_ｕ:ｕ+νに対応する長さの入力信号の特徴量系列である。言い換えれば、ｙ_ｔ:ｔ+νは、フレームｔからフレームｔ＋νに対応する入力信号の特徴量系列である。ｙ_{ｔ＋ν＋１:ｔ+τ}は、入力信号の特徴量系列ｙ_ｔ:ｔ+τのうち事例モデルのセグメントＭ_ｕ:ｕ+νの長さを超える部分の入力信号の特徴量系列である。言い換えれば、ｙ_{ｔ＋ν＋１:ｔ+τ}は、フレームｔ＋ν＋１からフレームｔ＋τに対応する入力信号の特徴量系列である。 y _{t: t + ν} is a feature amount sequence of the input signal having a length corresponding to the segment M _{u: u + ν} of the case model in the feature amount sequence y _{t: t + τ} of the input signal. In other words, yt _{: t + ν} is a feature quantity sequence of the input signal corresponding to the frame t to the frame t + ν. y _{t + ν + 1: t + τ} is the feature amount sequence of the input signal in the portion of the feature amount sequence y _{t: t + τ} of the input signal that exceeds the length of the segment M _{u: u + ν} of the case model. In other words, yt _{+ ν + 1: t + τ} is a feature quantity sequence of the input signal corresponding to the frame t + ν + 1 to the frame t + τ.

すなわち、式（５）は、評価対象の入力信号を所定長（ここではτ＋１）の入力信号として、評価対象の入力信号の特徴量系列のうち事例モデルに基づいて評価できる部分は事例モデルで尤度ｐ（ｙ_ｔ:ｔ+ν｜Ｍ_ｕ:ｕ+ν）を評価し、事例モデルのセグメントＭ_ｕ:ｕ+νで評価できない（事例モデルのセグメントの長さを超える部分の）評価対象の入力信号の特徴量系列ｙ_{ｔ＋ν＋１:ｔ+τ}については混合モデルｇに基づいて尤度ｐ（ｙ_{ｔ＋ν＋１:ｔ+τ}｜φ_{ｕ＋ν＋１：ｕ＋τ}）を評価することを意味する。 In other words, Equation (5) is obtained by using the input signal to be evaluated as an input signal having a predetermined length (in this case, τ + 1), and the portion that can be evaluated based on the case model in the feature quantity series of the input signal to be evaluated is the case model. Degree p (y _{t: t + ν} | M _{u: u + ν} ) and cannot be evaluated by the segment M _{u: u + ν} of the case model (the portion exceeding the segment length of the case model) For the feature quantity sequence y _{t + ν + 1: t + τ} of the input signal, this means that the likelihood p (y _{t + ν + 1: t + τ} | φ _{u + ν + 1: u + τ} ) is evaluated based on the mixed model g.

言い換えれば、入力信号を２つに分割したときの前半部分を前半部分信号とし後半部分を後半部分信号として、マッチング部１０３が式（４）に基づいて計算する尤度は、前半部分信号についてその前半部分信号に対応する長さのセグメントに基づいて評価した尤度ｐ（ｙ_ｔ:ｔ+ν｜Ｍ_ｕ:ｕ+ν）と、後半部分信号について上記ガウス混合分布によるモデルに基づいて評価した尤度ｐ（ｙ_{ｔ＋ν＋１:ｔ+τ}｜φ_{ｕ＋ν＋１：ｕ＋τ}）とが統合された尤度であると言える。 In other words, the likelihood that the matching unit 103 calculates based on Equation (4) using the first half when the input signal is divided into two as the first half signal and the second half as the second half signal is that for the first half signal. The likelihood p (y _{t: t + ν} | M _{u: u + ν} ) evaluated based on the segment of the length corresponding to the first half signal and the second half signal based on the above-described model based on the Gaussian mixture distribution It can be said that the likelihood p (y _{t + ν + 1: t + τ} | φ _{u + ν + 1: u + τ} ) is an integrated likelihood.

混合モデルｇに基づく尤度は、モデル全体で平滑化した尤度のようなものに相当する。事例モデルで評価できない部分については平均的な尤度で代用することで、等しいフレーム長で入力信号を公平に評価しようとするものである。 The likelihood based on the mixed model g corresponds to a likelihood smoothed over the entire model. For the part that cannot be evaluated by the case model, the average likelihood is substituted to try to evaluate the input signal fairly with the same frame length.

このｙ_ｔ:ｔ+τの尤度を用いて，マッチング部１０３ではｙ_ｔ:ｔ+τに最も適合するセグメントＭ^ｔ _{ｕ：ｕ＋νmax}を次式（６）（７）に従い求める。ｔ，τ，ｕ，ν，ｕ’，ν’は整数である。 The y _t: using the likelihood of the _{t + tau,} the matching unit 103 y _{t: t +} best fits segment _{τ ^M} _{t u:} finding according the following equation (6) (7) _{u + .nu.max.} t, τ, u, ν, u ′, ν ′ are integers.

ここで、式（７）の分母は、学習データのあらゆる開始点ｕ’と、ｙ_ｔ:ｔ+τのあらゆる分割点ν’について、ｐ（ｙ_ｔ:ｔ+τ|Ｍ_{ｕ′:ｕ′+ν′}，φ_{ｕ′＋ν′＋１：ｕ′＋τ}）の和を取ったものである。 Here, the denominator of equation (7), 'and, y _t: any division point of _{t + τ ν'} u any starting point of the learning data _{for, p (y t: t +} τ | M u ': u' _{+ ν ′} , φu _{′ + ν ′ + 1: u ′ + τ} ).

式（７）により定義される事後確率ｐ（Ｍ_ｕ:ｕ+ν，φ_{ｕ＋ν＋１：ｕ＋τ}｜ｙ_ｔ:ｔ+τ）は、上記式（４）及び上記式（５）に示したように、入力信号を２つに分割したときの前半部分を前半部分信号とし後半部分を後半部分信号として、部分信号についてその前半部分信号に対応する長さのセグメントに基づいて評価した尤度ｐ（ｙ_ｔ:ｔ+ν｜Ｍ_ｕ:ｕ+ν）と、後半部分信号について上記ガウス混合分布によるモデルに基づいて評価した尤度ｐ（ｙ_{ｔ＋ν＋１:ｔ+τ}｜φ_{ｕ＋ν＋１：ｕ＋τ}）とを用いて表現される。 The posterior probability p (M _{u: u + ν} , φ _{u + ν + 1: u + τ} | y _{t: t + τ} ) defined by the equation (7) is as shown in the above equation (4) and the above equation (5): Likelihood p (y _t) evaluated based on a segment having a length corresponding to the first half signal, with the first half of the input signal divided into two as the first half signal and the second half as the second half signal. _{: t + ν} | M _{u: u + ν} ) and likelihood p (y _{t + ν + 1: t + τ} | φ _{u + ν + 1: u + τ} ) evaluated based on the model of the Gaussian mixture distribution for the latter half signal. Is done.

セグメントの最大長を、従来の方法と同様に、（τ_lim＋１）フレームに制限する。例
えば、セグメントの最大長を３０フレームと制限するならば、τ_lim＝２９となる。この
制限の下での式（７）によるセグメント評価を図示すると、図５のようになる。この図から明らかなように、この実施形態によれば、あらゆるセグメント長のセグメントが、（τ_lim＋１）フレームという共通の長さの下で、公平に評価されていることがわかる。別の見方をすれば、この実施形態によれば、最適なセグメント長（ν_max）と，セグメント開
始点（ｕ）の探索を同時に行っていることになる。 The maximum length of the segment is limited to (τ _lim +1) frames as in the conventional method. For example, if the maximum length of the segment is limited to 30 frames, τ _lim = 29. FIG. 5 shows the segment evaluation according to the equation (7) under this restriction. As is apparent from this figure, according to this embodiment, it can be seen that the segments of any segment length are evaluated fairly under a common length of (τ _lim +1) frames. From another viewpoint, according to this embodiment, the optimum segment length (ν _max ) and the segment start point (u) are searched simultaneously.

以下、本発明による式（７）の事後確率が、従来手法による式（２）の事後確率と同様に、ｙ_ｔ:ｔ+τとＭ_ｕ:ｕ+τが比較的よく一致している場合、τが長ければ長いほど高い事後確率を与えるという特徴を持っていることを証明する。このため，ｙ_ｔ:ｔ+τをｙ_ｔ:ｔ+νとｙ_{ｔ＋ν＋１:ｔ+τ}に分割して前者をＭで後者をｇで評価する場合（式（４））と、ｙ_ｔ:ｔ+τをｙ_ｔ:ｔ+ν-1とｙ_{ｔ＋ν:ｔ+τ}に分割して前者をＭで後者をｇで評価する場合とで、事後確率の大小を比較する。 Hereinafter, in the case where the posterior probability of Equation (7) according to the present invention is relatively good in agreement with _yt _{: t + τ} and _{Mu: u + τ} , similarly to the posterior probability of Equation (2) by the conventional method It proves that the longer τ is, the higher posterior probability is given. Therefore, when y _{t: t + τ} is divided into y _{t: t + ν} and y _{t + ν + 1: t + τ} and the former is evaluated by M and the latter is evaluated by g (equation (4)), y _{t: t +} a _{_{τ y t: t + ν-}} 1 and y _{t + [nu:} the former is divided into _{t + tau} in the case of evaluating the latter in g in M, compares the magnitude of the posterior probability.

式（７）から明らかなように、両場合において分母は等しくなるので、両場合の比は、式（４）から、以下の尤度比に等しくなる。 As is clear from equation (7), the denominator is equal in both cases, so the ratio in both cases is equal to the following likelihood ratio from equation (4).

ここで、ｙ_ｔ＋νがｍ_ｕ＋νによく一致していると仮定する。この場合、式（８）の分母は、ｗ（ｍ_ｕ＋ν）ｇ（ｙ_ｔ＋ν｜ｍ_ｕ＋ν）と近似できる。よって、式（８）は、１／ｗ（ｍ_ｕ＋ν）に等しい。ｗ（ｍ_ｕ＋ν）は１以下であるので、式（８）は１以上になる。これにより、ｙ_{ｔ：ｔ＋τ}とＭ_{ｕ：ｕ＋τ}が比較的よく一致している場合、τが長ければ長いほど式（７）で計算される事後確率が高くなるという特徴を持っていることが分かる。 Here, it is assumed that _{y t + [nu} is good agreement in _{m u + ν.} In this case, the denominator of Equation (8) can be approximated as w (m _{u + ν} ) g (y _{t + ν} | _{mu + ν} ). Thus, equation (8) is equal to 1 / w (m _{u + ν} ). Since w (m _{u + ν} ) is 1 or less, Expression (8) becomes 1 or more. Thus, it can be seen that when yt _{: t + τ} and _{Mu: u + τ} match relatively well, the longer τ is, the higher the posterior probability calculated by equation (7) is. .

［変形例等］
なお、この発明は、複数の雑音／残響環境の事例モデルを考慮する場合、及び、マッチング時に時間伸縮を考える場合についても、非特許文献１に記載されているように、拡張可能である。 [Modifications, etc.]
Note that the present invention can be extended as described in Non-Patent Document 1 when considering a plurality of case models of noise / reverberation environments and considering time expansion and contraction at the time of matching.

上記信号処理装置及び方法において説明した処理は、記載の順にしたがって時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The processes described in the above signal processing apparatus and method are not only executed in chronological order according to the order of description, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the process. .

また、信号処理装置における各部をコンピュータによって実現する場合、信号処理装置の各部が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、その各部がコンピュータ上で実現される。 Further, when each unit in the signal processing device is realized by a computer, the processing contents of the functions that each unit of the signal processing device should have are described by a program. And each part is implement | achieved on a computer by running this program with a computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、各処理手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each processing means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Needless to say, other modifications are possible without departing from the spirit of the present invention.

１０１フーリエ変換部
１０２特徴量生成部
１０３マッチング部
１０４音声強調フィルタリング部
１０５事例モデル記憶部 101 Fourier Transform Unit 102 Feature Quantity Generation Unit 103 Matching Unit 104 Speech Enhancement Filtering Unit 105 Case Model Storage Unit

Claims

A case model storage unit storing a segment that is a series of indices of a Gaussian distribution in a Gaussian mixture distribution that gives a maximum likelihood to a feature amount of each frame of a predetermined signal;
A matching unit that searches the segment stored in the case model storage unit as a candidate and searches for a segment that gives the maximum posterior probability to the feature amount series of the input signal,
When the input signal is divided into two, the first half is the first half signal and the second half is the second half signal.
The posterior probabilities in the matching unit are the likelihood evaluated based on the segment of the length corresponding to the first half signal for the first half signal and the likelihood evaluated based on the model based on the Gaussian mixture distribution for the second half signal. Expressed in degrees,
Signal processing device.

The signal processing apparatus according to claim 1,
t, τ, u, ν, u ′, ν ′ are integers, the feature quantity of the input signal corresponding to frame t + τ from frame t is yt _{: t + τ,} and the frame u stored in the case model storage unit , The segment corresponding to frame u + ν is M _{u: u + ν} , the Gaussian mixture distribution corresponding to frame u + ν + 1 to frame u + τ is φ _{u + ν + 1: u + τ,} and y _t when M _{u: u + ν} and φ _{u + ν + 1: u + τ} are given. _: The posterior probability of _{t + τ} is p (y _{t: t + τ} | M _{u: u + ν} , φ _{u + ν + 1: u + τ} ),
The posterior probability is p ( _{Mu: u + ν} , φu _{+ ν + 1: u + τ} | y _{t: t + τ} ) defined below.

Signal processing device.

  In the case model storage unit, a segment that is a series of Gaussian distribution indexes in the Gaussian mixture distribution that gives the maximum likelihood to the feature amount of each frame of a predetermined signal is stored.
  A matching step for searching for a segment that gives the maximum posterior probability for the feature amount series of the input signal, with the segment stored in the case model storage unit as a candidate,
  When the input signal is divided into two, the first half is the first half signal and the second half is the second half signal.
  The posterior probability in the matching step is the likelihood evaluated based on the segment of the length corresponding to the first half signal for the first half signal and the likelihood evaluated based on the model based on the Gaussian mixture distribution for the second half signal. Expressed in degrees,
  Signal processing method.

The program for functioning a computer as each part of the signal processing apparatus of Claim 1 or 2.