JP6448567B2

JP6448567B2 - Acoustic signal analyzing apparatus, acoustic signal analyzing method, and program

Info

Publication number: JP6448567B2
Application number: JP2016031801A
Authority: JP
Inventors: 弘和亀岡; 直毅村田
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC; NTT Inc USA
Current assignee: University of Tokyo NUC; NTT Inc; NTT Inc USA
Priority date: 2016-02-23
Filing date: 2016-02-23
Publication date: 2019-01-09
Anticipated expiration: 2036-02-23
Also published as: JP2017152825A

Description

本発明は、音響信号解析装置、音響信号解析方法、及びプログラムに係り、特に、複数のマイクロホンで取得した音響信号を用いて、残響除去と音源分離を行うための音響信号解析装置、方法、及びプログラムに関する。 The present invention relates to an acoustic signal analysis device, an acoustic signal analysis method, and a program, and in particular, an acoustic signal analysis device, method, and method for performing dereverberation and sound source separation using acoustic signals acquired by a plurality of microphones. Regarding the program.

複数のマイクロホンで取得した多チャネル信号を処理し、音源の空間情報を手がかりにして音源分離などを行う枠組をマイクロホンアレー信号処理という。 A framework that processes multi-channel signals acquired by a plurality of microphones and performs sound source separation using the spatial information of the sound source as a clue is called microphone array signal processing.

近年、マイクロホンアレー信号処理に関して、ボイスレコーダ、ノートパソコン、スマートフォン、及びビデオカメラ等の身の回りにある様々な録音機器による多チャンネル録音を用いた、アドホックマイクロホンアレーの研究が盛んに行われている。アドホックマイクロホンアレーを用いた場合、特殊な装置や配線を要する従来のマイクロホンアレーを用いた場合に比べて手軽かつ安価にマイクロホンアレーシステムを構築できるため、注目を集めている。 In recent years, with regard to microphone array signal processing, research on ad hoc microphone arrays using multi-channel recording by various recording devices such as voice recorders, notebook computers, smartphones, and video cameras has been actively conducted. In the case of using an ad hoc microphone array, a microphone array system can be easily and inexpensively constructed as compared with the case of using a conventional microphone array that requires special devices and wiring.

従来のマイクロホンアレーで商用化されているものの多くは、各マイクロホンが小規模に集中配置されているため、録音チャンネル間の音声信号の時間差が音源分離のための手がかりとなるのに対し、アドホックマイクロホンアレーを用いた場合、従来のマイクロホンアレーに比べてマイクロホンを広範囲に分散して配置することが容易となるため、録音チャンネル間の時間差に加えて、更に音声信号の強度比も音源分離のための手がかりとなる。 Many of the conventional microphone arrays that have been commercialized have their microphones concentrated in a small scale, so the time difference of the audio signal between the recording channels is a clue for sound source separation, while the ad hoc microphone When an array is used, it becomes easier to place microphones in a wider range compared to conventional microphone arrays. Therefore, in addition to the time difference between recording channels, the intensity ratio of the audio signal is also used for sound source separation. It becomes a clue.

一般に、音声信号に残響及び雑音が重畳され、観測信号が得られるプロセスを順問題と捉えると、アドホックマイクロホンアレーで集音した観測信号から目的音声のみを分離抽出する問題は逆問題とみなすことができる。雑音、或いは室内伝達系の情報が未知の場合で、且つ、マイクロホン数より音源数が多いという劣決定条件の場合、この逆問題には解が無数に存在しうるため、解を絞り込むための何らかの仮定が必要となる。 In general, when reverberation and noise are superimposed on a speech signal and the process of obtaining the observation signal is regarded as a forward problem, the problem of separating and extracting only the target speech from the observation signal collected by the ad hoc microphone array can be regarded as an inverse problem. it can. If the information of the noise or the indoor transmission system is unknown, and there is an underdetermined condition that the number of sound sources is larger than the number of microphones, there are an infinite number of solutions to this inverse problem. Assumptions are needed.

一方、近年、劣決定条件における音源分離手法の一例として、非負値行列因子分解(Non-negative Matrix Factorization: NMF) の多チャンネル拡張を用いたアプローチが注目されている（非特許文献１、２）。 On the other hand, in recent years, an approach using multi-channel extension of non-negative matrix factorization (NMF) has attracted attention as an example of a sound source separation method in an underdetermined condition (Non-Patent Documents 1 and 2). .

NMF とは、非負値行列を二つの非負値行列（基底行列と係数行列）の積に分解することをいい、スペクトログラムを非負値行列と見なしてNMF を適用することはスペクトログラムを低ランクの非負値行列で近似することに相当し、各時刻のスペクトルを基底行列の列数分のスペクトルテンプレートの非負結合で説明しようとしていることを意味する。そして、NMF の多チャンネル拡張は、各音源のパワースペクトログラムにこの構造を仮定した多チャンネル音源分離手法の一例である。なお、音源数よりマイクロホン数が多いという過剰決定条件でのNMF の多チャンネル拡張の例も提案されている（非特許文献３）。 NMF is the decomposition of a non-negative matrix into the product of two non-negative matrices (base matrix and coefficient matrix), and applying NMF with the spectrogram regarded as a non-negative matrix means that the spectrogram is a low-rank non-negative value. This corresponds to approximating with a matrix, and means that the spectrum at each time is to be explained by a non-negative combination of spectrum templates corresponding to the number of columns of the base matrix. NMF multi-channel extension is an example of a multi-channel sound source separation method that assumes this structure in the power spectrogram of each sound source. In addition, an example of NMF multi-channel expansion under an overdetermined condition that the number of microphones is larger than the number of sound sources has been proposed (Non-Patent Document 3).

A. Ozerov and C. F_evotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol.18, no. 3, pp. 550-563, 2010.A. Ozerov and C. F_evotte, “Multichannel nonnegative matrix factorization in convolutive combination for audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol.18, no. 3, pp. 550-563, 2010. A. Ozerov, C. F_evotte, R. Blouet and J.L. Durrieu, “Multichannel nonnegative tensor factorization with structured constraints for user-guided audio source separation,” in Proc. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2011), pp. 257-260, 2011.A. Ozerov, C. F_evotte, R. Blouet and JL Durrieu, “Multichannel nonnegative tensor factorization with structured constraints for user-guided audio source separation,” in Proc. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2011), pp. 257-260, 2011. Hirokazu Kameoka, Takuya Yoshioka, Mariko Hamamura, Jonathan Le Roux, Kunio Kashino,“Statistical model of speech signals based on composite autoregressive system with application to blind source separation,” in Proc. 9th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA 2010), LNCS 6365, pp. 245-253, Sep. 2010.Hirokazu Kameoka, Takuya Yoshioka, Mariko Hamamura, Jonathan Le Roux, Kunio Kashino, “Statistical model of speech signals based on composite autoregressive system with application to blind source separation,” in Proc. 9th International Conference on Latent Variable Analysis and Signal Separation (LVA / ICA 2010), LNCS 6365, pp. 245-253, Sep. 2010. T. Yoshioka, Tomohiro Nakatani, Masato Miyoshi, and Hiroshi G. Okuno, “Blind separation and dereverberation of speech mixtures by joint optimization,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 1, pp. 69-84, 2011.T. Yoshioka, Tomohiro Nakatani, Masato Miyoshi, and Hiroshi G. Okuno, “Blind separation and dereverberation of speech mixture by joint optimization,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 1, pp. 69 -84, 2011.

上述した従来のアプローチでは、室内伝達系に時不変性などの制約が置かれ、その条件の下で逆問題が定式化されるが、アドホックマイクロホンアレーの枠組では手軽にアレーシステムが構築できる利点がある一方で、各マイクロホンの位置は固定されないため音源及びマイクロホンの相対位置関係が録音中に変化しやすいという脆弱性を有している。このように録音中に音源及びマイクロホンの相対位置関係が変化する場合、室内伝達系に対する上述の仮定が成立しなくなり、当該仮定の下で設計されたアルゴリズムは、音源分離に関して高い性能を発揮できなくなるという問題があった。 In the conventional approach described above, constraints such as time invariance are placed on the indoor transmission system, and the inverse problem is formulated under those conditions. However, the ad hoc microphone array framework has the advantage that an array system can be constructed easily. On the other hand, since the position of each microphone is not fixed, there is a vulnerability that the relative positional relationship between the sound source and the microphone easily changes during recording. Thus, when the relative positional relationship between the sound source and the microphone changes during recording, the above assumption for the indoor transmission system is not satisfied, and the algorithm designed under the assumption cannot exhibit high performance with respect to sound source separation. There was a problem.

本発明は、上記の事情を鑑みてなされたものであり、音源及びマイクロホンの相対位置関係が変化する時変残響環境下であっても、複数の音源の音声が重畳した観測信号から、各音源信号を精度よく分離することができる音響信号解析装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and even in a time-varying reverberation environment in which the relative positional relationship between the sound source and the microphone changes, each sound source is obtained from observation signals in which the sound of a plurality of sound sources is superimposed. An object of the present invention is to provide an acoustic signal analyzing apparatus, method, and program capable of accurately separating signals.

上記の目的を達成するために本発明に係る音響信号解析装置は、Ｊ本の各マイクロホンｊで集音した音響信号の時系列データを入力として、各時刻ｌにおける各周波数ｋの観測信号時間周波数成分ｙ_j,k,lを出力する時間周波数解析部と、音源ｉから前記マイクロホンｊまでの時刻ｎだけ遅延して集音される伝達特性を表す時変ステアリングベクトルの振幅成分Ａ_j,i,k,n、複数フレームのスペクトルを連結したスペクトル素片に対応する各時刻τにおける、各基底ｍ及び各周波数ｋの非負値の基底スペクトルＷ_i,k,m,τ、前記スペクトル素片に対応する各時刻τにおける、各基底ｍ及び各周波数ｋの非負値の基底オンセットＨ_i,m,τの各々に初期値を設定する初期値設定部と、（j,k,l）の全ての組み合わせにおける、前記観測信号時間周波数成分ｙ_j,k,lと、前記スペクトル素片に対応する各時刻τにおける、各基底ｍ及び各周波数ｋの非負値の基底スペクトルＷ_i,k,m,τ、前記スペクトル素片に対応する各時刻τにおける、各基底ｍ及び各周波数ｋの非負値の基底オンセットＨ_i,m,τ、各音源ｉ及び各時刻ｎの前記振幅成分Ａ_j,i,k,nに基づいて算出される前記マイクロホンｊのパワースペクトログラムモデルＸ_j,k,lとの距離が小さくなるように、前記基底スペクトルＷ_i,k,m,τと、前記基底オンセットＨ_i,m,τと、前記振幅成分Ａ_j,i,k,nとを更新するパラメータ更新部と、予め定められた終了条件を満たすまで、前記パラメータ更新部による更新を繰り返し行う終了判定部と、を含んで構成されている。 In order to achieve the above object, an acoustic signal analyzing apparatus according to the present invention receives time series data of acoustic signals collected by J microphones j as input, and the observed signal time frequency of each frequency k at each time l. A time-frequency analysis unit that outputs a component y _{j, k, l,} and an amplitude component A _{j, i,} of a time-varying steering vector that represents a transmission characteristic that is collected with a delay of time n from the sound source i to the microphone j _{. k, n} , at each time τ corresponding to a spectrum segment obtained by concatenating spectra of a plurality of frames, corresponding to each spectrum m and non-negative basis spectrum W _{i, k, m, τ at} each frequency k, corresponding to the spectrum segment An initial value setting unit for setting an initial value for each non-negative base onset H _{i, m, τ} of each base m and each frequency k at each time τ, and all of (j, k, l) The observed signal time frequency component y _j in the combination _{, k, l} and non-negative base spectra W _{i, k, m, τ of} each basis m and each frequency k at each time τ corresponding to the spectrum segment, and each time τ corresponding to the spectrum segment. The microphone is calculated based on the non-negative base onset H _{i, m, τ of} each base m and each frequency k, and each amplitude source A _{j, i, k, n} at each time n. j so that the distance from the power spectrogram model X _{j, k, l} is reduced, the base spectrum W _{i, k, m, τ} , the base onset H _{i, m, τ,} and the amplitude component A _{j , i, k, n,} and an end determination unit that repeatedly updates the parameter update unit until a predetermined end condition is satisfied.

本発明に係る音響信号解析方法は、時間周波数解析部と、初期値設定部と、パラメータ更新部と、終了判定部とを含む音響信号解析装置における音響信号解析方法であって、前記時間周波数解析部が、Ｊ本の各マイクロホンｊで集音した音響信号の時系列データを入力として、各時刻ｌにおける各周波数ｋの観測信号時間周波数成分ｙ_j,k,lを出力し、前記初期値設定部が、音源ｉから前記マイクロホンｊまでの時刻ｎだけ遅延して集音される伝達特性を表す時変ステアリングベクトルの振幅成分Ａ_j,i,k,n、複数フレームのスペクトルを連結したスペクトル素片に対応する各時刻τにおける、各基底ｍ及び各周波数ｋの非負値の基底スペクトルＷ_i,k,m,τ、前記スペクトル素片に対応する各時刻τにおける、各基底ｍ及び各周波数ｋの非負値の基底オンセットＨ_i,m,τの各々に初期値を設定し、前記パラメータ更新部が、（j,k,l）の全ての組み合わせにおける、前記観測信号時間周波数成分ｙ_j,k,lと、前記スペクトル素片に対応する各時刻τにおける、各基底ｍ及び各周波数ｋの非負値の基底スペクトルＷ_i,k,m,τ、前記スペクトル素片に対応する各時刻τにおける、各基底ｍ及び各周波数ｋの非負値の基底オンセットＨ_i,m,τ、各音源ｉ及び各時刻ｎの前記振幅成分Ａ_j,i,k,nに基づいて算出される前記マイクロホンｊのパワースペクトログラムモデルＸ_j,k,lとの距離が小さくなるように、前記基底スペクトルＷ_i,k,m,τと、前記基底オンセットＨ_i,m,τと、前記振幅成分Ａ_j,i,k,nとを更新し、前記終了判定部が、予め定められた終了条件を満たすまで、前記パラメータ更新部による更新を繰り返し行う。 An acoustic signal analysis method according to the present invention is an acoustic signal analysis method in an acoustic signal analysis apparatus including a time-frequency analysis unit, an initial value setting unit, a parameter update unit, and an end determination unit, wherein the time-frequency analysis Input the time-series data of the acoustic signals collected by each of the J microphones j, and output the observed signal time frequency components y _{j, k, l} at each frequency k at each time l to set the initial value Is a spectral element obtained by concatenating the amplitude components A _{j, i, k, n of} the time-varying steering vector representing the transmission characteristics collected from the sound source i to the microphone j with a delay of time _n , and the spectra of a plurality of frames. The non-negative base spectrum W _{i, k, m,} τ of each base m and each frequency k at each time τ corresponding to the piece, and each base m and each frequency k at each time τ corresponding to the spectrum fragment Nonnegative base of Onset H _{i, m,} set the initial value to each of _tau, the parameter updating unit, and the (j, k, l) for all the combinations of the observation signal time-frequency component y _{j, k, l,} Non-negative base spectra W _{i, k, m, τ of} each basis m and each frequency k at each time τ corresponding to the spectrum segment, each basis m and each base m at each time τ corresponding to the spectrum segment The power spectrogram model X of the microphone j calculated based on the non-negative base onset H _{i, m, τ of} each frequency k, each sound source i, and the amplitude components A _{j, i, k, n} at each time n. _The base spectrum W _{i, k, m, τ} , the base onset H _{i, m, τ,} and the amplitude component A _{j, i, k, n} so that the distance from _{j, k, l} becomes small. And the update by the parameter update unit is repeated until the end determination unit satisfies a predetermined end condition. Do.

本発明に係るプログラムは、上記の音響信号解析装置の各部としてコンピュータを機能させるためのプログラムである。 The program according to the present invention is a program for causing a computer to function as each part of the acoustic signal analysis apparatus.

以上説明したように、本発明の音響信号解析装置、方法、及びプログラムによれば、観測信号時間周波数成分ｙ_j,k,lと、スペクトル素片に対応する各時刻τにおける、各基底ｍ及び各周波数ｋの非負値の基底スペクトルＷ_i,k,m,τ、スペクトル素片に対応する各時刻τにおける、各基底ｍ及び各周波数ｋの非負値の基底オンセットＨ_i,m,τ、各音源ｉ及び各時刻ｎにおける時変ステアリングベクトルの振幅成分Ａ_j,i,k,nに基づいて算出されるマイクロホンｊのパワースペクトログラムモデルＸ_j,k,lとの距離が小さくなるように、基底スペクトルＷ_i,k,m,τと、基底オンセットＨ_i,m,τと、振幅成分Ａ_j,i,k,nとを更新することを繰り返すことにより、複数の音源の音声が重畳した観測信号から、各音源信号を精度よく分離することができる、という効果が得られる。 As described above, according to the acoustic signal analyzing apparatus, method, and program of the present invention, each base m and each of the observation signal time frequency components y _{j, k, l} and each time τ corresponding to the spectrum segment are Non-negative base spectrum W _{i, k, m, τ for} each frequency k, non-negative base onset H _{i, m, τ} , for each base m and each frequency k at each time τ corresponding to the spectral segment, The distance from the power spectrogram model X _{j, k, l} of the microphone j calculated based on the amplitude components A _{j, i, k,} n of the time-varying steering vector at each sound source i and each time n is reduced. By repeatedly updating the base spectrum W _{i, k, m, τ} , the base onset H _{i, m, τ} and the amplitude component A _{j, i, k, n} , the sound of multiple sound sources is superimposed. That each sound source signal can be accurately separated from the observed signal Obtained.

本発明の実施の形態に係る音響信号解析装置の構成を示す概略図である。It is the schematic which shows the structure of the acoustic signal analyzer which concerns on embodiment of this invention. 本発明の実施の形態に係る音響信号解析装置における音響信号解析処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the acoustic signal analysis process routine in the acoustic signal analyzer which concerns on embodiment of this invention. 音響信号解析装置の評価実験を実施する部屋の環境を示す図である。It is a figure which shows the environment of the room which implements the evaluation experiment of an acoustic signal analyzer. 音響信号解析装置の評価実験における、壁の反射係数に対するSource to Distortion Ratio(SDR)の変化の一例を示すグラフである。It is a graph which shows an example of the change of Source to Distortion Ratio (SDR) with respect to the reflection coefficient of a wall in the evaluation experiment of an acoustic signal analyzer. 音響信号解析装置の評価実験における、伝達系への外乱に対するSDRの変化の一例を示すグラフである。It is a graph which shows an example of the change of SDR with respect to the disturbance to a transmission system in the evaluation experiment of an acoustic signal analyzer.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

（発明の原理） (Principle of the invention)

まず、本発明の提案モデルについて説明する。 First, the proposed model of the present invention will be described.

（時間周波数領域における畳み込み混合モデル） (Convolution mixing model in the time-frequency domain)

音源からマイクロホンアレーへの伝達系が線形時不変であり、また、残響成分が時間周波数解析の窓長内に収まっていると仮定できる場合、マイクロホンアレーで得られる信号は音源の瞬時混合で記述することができる。 If the transmission system from the sound source to the microphone array is linear time-invariant and the reverberation component can be assumed to be within the window length of the time-frequency analysis, the signal obtained by the microphone array is described by instantaneous mixing of the sound sources. be able to.

一方、窓長を超える残響成分が無視できない状況下では、マイクロホンで観測される観測信号は、例えば、非特許文献４に示されているように、時間周波数領域の畳み込み混合モデルを用いて（１）式のように表される。 On the other hand, under the situation where reverberation components exceeding the window length cannot be ignored, the observation signal observed by the microphone is, for example, as shown in Non-Patent Document 4, using a convolutional mixture model in the time-frequency domain (1 )

ここで、変数ｉ、ｋは、それぞれ音源と周波数のインデックスを表し、ｌ、ｎは時間フレームのインデックスを表す。ｙ^_k,l∈Ｃ^Ｊは、マイクロホンアレーで観測される観測信号ベクトルであり、ｊはマイクロホンのインデックスを表し、Ｊはマイクロホン数を表す。ａ^_i,k,nは、各音源からマイクロホンへのステアリングベクトルであり、時間がｎフレーム遅れて到来する反響成分に対応するものとする。また、ｓ_i,k,lは、各音源の時間周波数領域における複素スペクトログラムである。以降、ベクトル、行列又は確率変数を表す変数には“^”を付すものとする。 Here, the variables i and k represent the sound source and frequency indexes, respectively, and l and n represent the time frame indexes. y ^ _{k, l} ∈ C ^J is an observed signal vector observed by the microphone array, j represents the index of the microphone, and J represents the number of microphones. a ^ _{i, k, n} is a steering vector from each sound source to the microphone, and corresponds to an echo component that arrives with a delay of n frames. Moreover, s _{i, k, l} is a complex spectrogram in the time-frequency domain of each sound source. Hereinafter, a variable representing a vector, a matrix, or a random variable is appended with “^”.

今、音源或いはマイクロホンの位置が時間変化すると仮定した場合、ステアリングベクトルａ^_i,k,nは時刻ｌに依存し、（１）式の混合過程は、（２）式のような時変ステアリングベクトルａ^_i,k,n,lを用いて表される。 Assuming that the position of the sound source or microphone changes with time, the steering vector a ^ _{i, k, n} depends on the time l, and the mixing process of equation (1) is time-varying steering like equation (2). It is expressed using the vector a ^ _{i, k, n, l} .

ここで、各音源ｉの複素スペクトログラムが複素ガウス分布に従う、すなわち、ｓ_i,k,l〜Ｎ_Ｃ(０,Ｐ_i,k,l)と仮定すれば、マイクロホンでの観測信号ベクトルｙ^_k,lは、（３）式に示す分布に従う。なお、Ｐ_i,k,lは、音源ｉの時刻ｌにおけるパワースペクトログラムを表し、ａ^^H _i,k,n,lは、時変ステアリングベクトルａ^_i,k,n,lのエルミート転置を表す。 Here, assuming that the complex spectrogram of each sound source i follows a complex Gaussian distribution, that is, s _{i, k, l to} N _C (0, P _{i, k, l} ), the observed signal vector y ^ _k at the microphone. _{, l} follow the distribution shown in equation (3). P _{i, k, l} represents the power spectrogram of the sound source i at time l, and a ^ ^H _{i, k, n, l} represents the Hermitian transpose of the time-varying steering vector a ^ _{i, k, n, l.} Represent.

ここで、時変ステアリングベクトルａ^_i,k,n,lを絶対値と偏角の要素に分解して表した式を（４）式に示す。 Here, equation (4) shows an expression in which the time-varying steering vector a _{i, k, n, l} is decomposed into absolute value and declination elements.

今、音源或いはマイクロホンの微小移動等、音響信号解析環境の軽微な変化については、時変ステアリングベクトルａ^_i,k,n,lの振幅成分を時不変、位相成分を時変と仮定する特殊な時変系を設定し、このような混合過程を「半時変形」と呼ぶことにする。すなわち、|ａ^_j,i,k,n,l|は時刻ｌに依存しないことになる。したがって、|ａ^_j,i,k,n,l|＝Ａ_j,i,k,nとすれば、（４）式は（５）式のように表すことができる。 Now, for minor changes in the acoustic signal analysis environment, such as the slight movement of a sound source or microphone, a special assumption is made that the time-varying steering vector a ^ _{i, k, n, l} is time-invariant and the phase component is time-varying. Such a mixing process is called "half-time deformation". That is, | a ^ _{j, i, k, n, l} | does not depend on time l. Therefore, if | a ^ _{j, i, k, n, l} | = A _{j, i, k, n} , Equation (4) can be expressed as Equation (5).

（５）式を（３）式に代入すれば、（６）式が得られる。 Substituting equation (5) into equation (3) yields equation (6).

（非負値テンソル二重畳み込みモデル） (Non-negative tensor double superposition model)

アドホックマイクロホンアレーでは、アレー素子が非同期であることによる音声信号のサンプリング周波数の僅かなずれ、並びに音源或いはマイクロホンの僅かな位置の変化が、通常のマイクロホンアレーに比べて起こりやすいため、半時変系の混合過程として取り扱う必要がある。 In an ad hoc microphone array, a slight shift in the sampling frequency of the audio signal due to the asynchronous array elements and a slight change in the position of the sound source or microphone are likely to occur compared to a normal microphone array. It is necessary to handle this as a mixing process.

半時変系の混合過程を取り扱う際の一つの解決策は、時変ステアリングベクトルａ^_{ｉ,ｋ,ｎ,ｌ}の時間変化量をオンラインで推定して補償した後に、公知のアレー信号処理を適用することである。 One solution when dealing with the mixing process of a half-time varying system is to estimate the time variation of the time-varying steering vector a ^ _{i, k, n, l} online and compensate for it, and then perform a known array signal processing. Is to apply.

一方、時変ステアリングベクトルａ^_i,k,n,lの位相成分を確率的に変動する確率変数として扱う方法も考えられるが、ここでは、後者の考え方にしたがって、時変ステアリングベクトルａ^_i,k,n,lの位相成分に、次の２つの条件を設定する。 On the other hand, a method of treating the phase component of the time-varying steering vector a ^ _{i, k, n, l} as a random variable that varies stochastically can be considered, but here, the time-varying steering vector a ^ _{i in} accordance with the latter concept. _{, k, n, l} are set to the following two conditions.

（条件１）φ_j,i,k,n,lとφ_j',i,k,n,l'(j≠ｊ^'またはｌ≠ｌ^')は互いに独立である (Condition 1) φ _{j, i, k, n, l} and φ _{j ′, i, k, n, l ′} (j ≠ j ^′ or l ≠ l ^′ ) are independent of each other.

（条件２）φ_j,i,k,n,lは区間[０,２π)で一様分布に従う。 (Condition 2) φ _{j, i, k, n, l} follows a uniform distribution in the interval [0, 2π).

（条件１）及び（条件２）の下で位相成分φ_j,i,k,n,lに関して周辺化を行うと、Ｅ[ψ^_i,k,n,l ψ^^H _i,k,n,l]は単位行列となるため、（６）式は（７）式のように表される。 When marginalization is performed on the phase components φ _{j, i, k, n, l} under (Condition 1) and (Condition 2), E [ψ ^ _{i, k, n, l} ψ ^ ^H _{i, k, n , l} ] is a unit matrix, so equation (6) is expressed as equation (7).

次に、音源ｉのパワースペクトログラムＰ_i,k,lについて検討する。 Next, the power spectrogram P _{i, k, l} of the sound source i will be examined.

周波数のインデックスｋの最大値をＫ、及び時間フレームのインデックスｌの最大値をＬとすれば、NMFの多チャンネル拡張では、Ｋ行Ｌ列で表されるパワースペクトログラムベクトルＰ^_i＝(Ｐ_i,k,l)_K×Lを２つの非負値行列の積で表される。これは、音源ｉの時刻ｌにおけるパワースペクトログラムＰ_i,k,lが、限られた数のスペクトルテンプレートの非負結合で表されるという仮定に基づくものである。 Assuming that the maximum value of the frequency index k is K and the maximum value of the time frame index l is L, in the multi-channel extension of NMF, the power spectrogram vector P ^ _i = (P _i) represented by K rows and L columns. _{, k, l} ) _{K × L} is represented by the product of two non-negative matrices. This is based on the assumption that the power spectrogram P _{i, k, l} of the sound source i at time l is represented by a non-negative combination of a limited number of spectral templates.

しかし、音声には単一の時間フレームｌにおけるスペクトルのみならず、そのダイナミクス、すなわち、局所的な時間変化パターンに大きな特徴があるものと考えられる。したがって、各時間フレームｌにおけるスペクトルを音声を構成する要素と考えるよりも、複数の時間フレームｌに亘るスペクトルを連結したものを、音声を構成する要素単位とみなした方が、音声をより特徴付けた好ましい表現と考えることができる。 However, it is considered that speech has a large feature not only in the spectrum in a single time frame l but also in its dynamics, that is, a local time change pattern. Therefore, rather than considering the spectrum in each time frame l as an element constituting speech, it is better to characterize the speech by considering the concatenation of the spectra over a plurality of time frames l as the element unit constituting the speech. It can be considered a favorable expression.

そこで、本実施の形態では、複数の時間フレームｌに亘るスペクトルを連結したスペクトログラム素片のテンプレートとアクティベーション系列とを畳み込む混合モデルによって、パワースペクトログラムＰ_i,k,lのモデル化を行う。具体的には、公知の非負値行列因子逆畳み込み(Nonnegative Matrix Factor Deconvolution:NMFD)の考え方を、パワースペクトログラムＰ_i,k,lのモデル化に適用する。この場合、音源ｉの時刻ｌにおけるパワースペクトログラムＰ_i,k,lは（８）式で表される。 Therefore, in the present embodiment, the power spectrogram P _{i, k, l} is modeled by a mixed model that convolves a spectrogram segment template in which spectra over a plurality of time frames l are connected and an activation sequence. Specifically, the concept of known nonnegative matrix factor deconvolution (NMFD) is applied to modeling of the power spectrogram P _{i, k, l} . In this case, the power spectrogram P _{i, k, l} of the sound source i at time l is expressed by equation (8).

ここで、Ｗ_i,k,m,τは音源ｉのパワースペクトログラムＰ_i,k,lの基底スペクトルを表し、Ｈ_i,m,lは基底オンセットを表す。また、変数ｍは基底のインデックスを表し、変数τは時間フレームのインデックスを表す。なお、τ＝｛０｝（｛ｘ｝は、“ｘ”が集合の要素に含まれていることを示す）の場合は、NMFを用いた混合モデルに一致する。 Here, W _{i, k, m, τ} represents the base spectrum of the power spectrogram P _{i, k, l} of the sound source i, and H _{i, m, l} represents the base onset. A variable m represents a base index, and a variable τ represents a time frame index. In the case of τ = {0} ({x} indicates that “x” is included in the elements of the set), it matches the mixed model using NMF.

（パラメータ推定） (Parameter estimation)

次に、最尤推定を用いたパラメータ推定について説明する。まず、（７）式の対数尤度関数の負値をＣ_ＭＬとして（９）式のように定義する。なお、Ｙ^は、観測信号ｙ_j,k,lに対応する確率変数、Ｐ^は、パワースペクトログラムＰ_i,k,lに対応する確率変数、及びＡ^は、時変ステアリングベクトルａ^_i,k,n,lの絶対値に対応する確率変数である。また、観測信号ｙ_j,k,lは、観測信号時間周波数成分の一例である。 Next, parameter estimation using maximum likelihood estimation will be described. First, the negative value of the log-likelihood function of equation (7) is defined as _CML as equation (9). Y ^ is a random variable corresponding to the observed signal y _{j, k, l} , P ^ is a random variable corresponding to the power spectrogram P _{i, k, l} , and A ^ is a time-varying steering vector a ^ _{i. , k, n, l} are random variables corresponding to absolute values. The observation signal y _{j, k, l} is an example of the observation signal time frequency component.

（９）式を用いてＣ_MLを具体的に計算すれば（１０）式が得られる。 If _CML is specifically calculated using equation (9), equation (10) is obtained.

ここで、記号＝^cは、定数項以外の他の項は等しいことを表す。 Here, the symbol = ^c indicates that terms other than the constant term are equal.

すなわち、最尤推定を用いたパラメータ推定は、マイクロホンｊの観測パワースペクトログラムＹ_j,k,lと、音源ｉのパワースペクトログラムＰ_i,k,lとして（８）式を適用した場合における、マイクロホンｊのパワースペクトログラムモデルＸ_j,k,lとの板倉斎藤距離の最小化問題に帰着し、目的関数である（１１）式をＡ_i,j,k,n≧０、Ｗ_i,k,m,τ≧０、及びＨ_i,m,τ≧０の制約の下で最適化する問題になる。 That is, the parameter estimation using the maximum likelihood estimation is performed by using the microphone j when the expression (8) is applied as the observed power spectrogram Y _{j, k, l} of the microphone j and the power spectrogram P _{i, k, l} of the sound source i. Reduced to the Itakura Saito distance minimization problem with the power spectrogram model X _{j, k, l,} and the objective function (11) is expressed as A _{i, j, k, n} ≧ 0, W _{i, k, m, tau} ≧ 0, and H _{i, m,} a problem of optimization under the constraint of _tau ≧ 0.

そこで、（１１）式を公知の補助関数法の原理に基づいて最適化する。なお、ここでは、板倉斎藤距離を内包した、より一般的な乖離度規準であるβダイバージェンスを規準として最適化アルゴリズムを導出する。 Therefore, the expression (11) is optimized based on the principle of a known auxiliary function method. Here, an optimization algorithm is derived using β divergence, which is a more general divergence criterion, including the Itakura Saito distance.

最適化アルゴリズムの詳細な導出過程の説明は省略するが、（１１）式で表される目的関数に対して、Jensenの不等式と接線不等式とを用いて補助関数を設計することで、（１３）〜（１５）式に示す乗法更新式を得ることができる。 Although a detailed description of the optimization algorithm derivation process is omitted, the auxiliary function is designed by using Jensen's inequality and tangent inequality for the objective function expressed by equation (11). A multiplicative update formula shown in formula (15) can be obtained.

ただし、ρ(β)は、βの値に応じて設定される値であり、（１６）式で定義される。 However, ρ (β) is a value set according to the value of β, and is defined by equation (16).

（システム構成） (System configuration)

次に、アドホックマイクロホンアレーで取得した、伝達系への外乱及び残響成分が重畳された複数の音源の観測信号を解析して、複数の音源の各々の音源信号に分離する音響信号解析装置に本発明を適用した場合を例にして、本発明の実施の形態を説明する。 Next, the acoustic signal analysis apparatus that analyzes the observation signals of a plurality of sound sources on which disturbance and reverberation components on the transmission system obtained by an ad hoc microphone array are superimposed and separated into each of the sound source signals of the plurality of sound sources will be described. The embodiment of the present invention will be described by taking the case of applying the invention as an example.

図１に示すように、本発明の実施の形態に係る音響信号解析装置は、ＣＰＵと、ＲＡＭと、後述する音響信号解析処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。 As shown in FIG. 1, the acoustic signal analysis apparatus according to the embodiment of the present invention is a computer that includes a CPU, a RAM, and a ROM that stores a program for executing an acoustic signal analysis processing routine described later. It is configured and functionally configured as follows.

音響信号解析装置１００は、入力部１０と、演算部２０と、記憶部３０と、出力部４０とを備えている。 The acoustic signal analysis device 100 includes an input unit 10, a calculation unit 20, a storage unit 30, and an output unit 40.

入力部１０により、伝達系への外乱及び残響成分が重畳された複数の音源ｉを含む観測信号ｙ_j[ｌ]の時系列データが入力される。記憶部３０は、入力部１０により入力された観測信号ｙ_j[ｌ]の時系列データを記憶する。また、記憶部３０は、後述する各処理での結果を記憶すると共に、本処理ルーチンで用いる各パラメータの初期値を記憶している。 The input unit 10 inputs time series data of an observation signal y _j [l] including a plurality of sound sources i on which disturbance and reverberation components are superimposed on the transmission system. The storage unit 30 stores time series data of the observation signal y _j [l] input by the input unit 10. In addition, the storage unit 30 stores the result of each process to be described later, and stores the initial value of each parameter used in this process routine.

演算部２０は、時間周波数解析部２１と、初期設定部２２と、パラメータ更新部２３と、終了判定部２４と、信号変換部２５とを備えている。 The calculation unit 20 includes a time frequency analysis unit 21, an initial setting unit 22, a parameter update unit 23, an end determination unit 24, and a signal conversion unit 25.

時間周波数解析部２１は、例えばマイクロホンｊの時系列信号としての観測された観測信号ｙ_j[ｌ]を入力として、マイクロホンｊの観測パワースペクトログラムＹ_j,k,lを計算する。また、計算したマイクロホンｊの観測パワースペクトログラムＹ_j,k,lを、記憶部３０に記憶しておく。より詳細には、時間周波数解析部２１は、例えばマイクロホンｊで観測された観測信号の時系列データを入力として、短時間フーリエ変換（Short-Time Fourier Transform:STFT）を用いて時間周波数解析を行うことにより、マイクロホンｊの観測パワースペクトログラムＹ_j,k,lを計算する。 The time frequency analysis unit 21 calculates the observed power spectrogram Y _{j, k, l} of the microphone j, for example, using the observed observation signal y _j [l] as the time series signal of the microphone j as an input. The calculated observation power spectrogram Y _{j, k, l} of the microphone j is stored in the storage unit 30. More specifically, the time-frequency analysis unit 21 performs time-frequency analysis using, for example, short-time Fourier transform (STFT) using time-series data of an observation signal observed by the microphone j as an input. Thus, the observed power spectrogram Y _{j, k, l} of the microphone j is calculated.

初期設定部２２は、後述する処理で用いる各パラメータＡ_j,i,k,n、Ｗ_i,k,m,τ、及びＨ_i,m,τの各初期値を設定する。なお、各パラメータの初期値は、例えば乱数を用いて適当な値に設定すればよい。この場合、Ａ_j,i,k,n、Ｗ_i,k,m,τ、及びＨ_i,m,τの各パラメータの初期値は非負値となるように設定する。 The initial setting unit 22 sets initial values of parameters A _{j, i, k, n} , W _{i, k, m, τ} and H _{i, m, τ} used in the processing described later. In addition, what is necessary is just to set the initial value of each parameter to an appropriate value, for example using a random number. In this case, the initial values of the parameters A _{j, i, k, n} , W _{i, k, m, τ} and H _{i, m, τ} are set to be non-negative values.

パラメータ更新部２３は、（ｉ、ｋ、ｌ）の全ての組み合わせの各々について、記憶部３０に記憶されているＷ_i,k,m,τ及びＨ_i,m,τに基づいて、上記（８）式に従って、音源ｉのパワースペクトログラムＰ_i,k,lを計算し、記憶部３０に格納する。 The parameter update unit 23 performs the above-described ((i, k, l)) based on W _{i, k, m, τ} and H _{i, m, τ} stored in the storage unit 30 for each combination of (i, k, l). The power spectrogram P _{i, k, l} of the sound source i is calculated according to the equation (8) and stored in the storage unit 30.

また、パラメータ更新部２３は、（ｊ、ｋ、ｌ）の全ての組み合わせの各々について、記憶部３０に記憶されているＰ_i,k,l及びＡ_j,i,k,nに基づいて、上記（１２）式に従って、マイクロホンｊのパワースペクトログラムモデルＸ_j,k,lを計算し、記憶部３０に格納する。 In addition, the parameter update unit 23 calculates, based on P _{i, k, l} and A _{j, i, k, n} stored in the storage unit 30 for each combination of (j, k, l). According to the above equation (12), the power spectrogram model X _{j, k, l} of the microphone j is calculated and stored in the storage unit 30.

また、パラメータ更新部２３は、（ｊ,i,ｋ,ｎ）の全ての組み合わせの各々について、記憶部３０に記憶されているＹ_j,k,l、Ｘ_j,k,l、Ａ_j,i,k,n、及びＰ_i,k,lに基づいて、上記（１１）式の目的関数を小さくするように、（１３）式に従って、時変ステアリングベクトルの振幅成分Ａ_j,i,k,nを更新し、記憶部３０に格納する。この際、パラメータ更新部２３は、更新したＡ_j,i,k,nを用いて、上記（１２）式に従って、マイクロホンｊのパワースペクトログラムモデルＸ_j,k,lを更新し、記憶部３０に格納する。 Further, the parameter updating unit 23 stores Y _{j, k, l} , X _{j, k, l} , A _j, A stored in the storage unit 30 for each of all combinations of (j, i, k, n) _{. Based on i, k, n} and P _{i, k, l} , the amplitude component A _{j, i, k of the} time-varying steering vector is determined according to equation (13) so as to reduce the objective function of equation (11). _{, n} are updated and stored in the storage unit 30. At this time, the parameter updating unit 23 uses the updated A _{j, i, k, n} to update the power spectrogram model X _{j, k, l} of the microphone j according to the above equation (12), and stores it in the storage unit 30. Store.

また、パラメータ更新部２３は、（ｉ,ｋ,ｍ,τ）の全ての組み合わせの各々について、記憶部３０に記憶されているＹ_j,k,l、Ｘ_j,k,l、Ａ_j,i,k,n、Ｗ_i,k,m,τ、及びＨ_i,m,τに基づいて、上記（１１）の目的関数を小さくするように、（１４）式に従って、音源ｉのパワースペクトログラムＰ_i,k,lの基底スペクトルＷ_i,k,m,τを更新し、記憶部３０に格納する。この際、パラメータ更新部２３は、更新したＷ_i,k,m,τを用いて、上記（８）式及び（１２）式に従って、音源ｉのパワースペクトログラムＰ_i,k,l及びマイクロホンｊのパワースペクトログラムモデルＸ_j,k,lを更新し、記憶部３０に格納する。 In addition, the parameter update unit 23 stores Y _{j, k, l} , X _{j, k, l} , A _j, A stored in the storage unit 30 for each of all combinations of (i, k, m, τ) _{. Based on i, k, n} , W _{i, k, m, τ} and H _{i, m, τ} , the power spectrogram of the sound source i according to the equation (14) so as to reduce the objective function of (11). The base spectrum W _{i, k, m, τ} of P _{i, k, l} is updated and stored in the storage unit 30. At this time, the parameter updating unit 23 uses the updated W _{i, k, m, τ} and the power spectrogram P _{i, k, l of the} sound source i and the microphone j according to the above equations (8) and (12). The power spectrogram model X _{j, k, l} is updated and stored in the storage unit 30.

更に、パラメータ更新部２３は、（ｉ,ｍ,τ）の全ての組み合わせの各々について、記憶部３０に記憶されているＹ_j,k,l、Ｘ_j,k,l、Ａ_j,i,k,n、Ｈ_i,m,τ、及びＷ_i,k,m,τに基づいて、上記（１１）の目的関数を小さくするように、（１５）式に従って、音源ｉのパワースペクトログラムＰ_i,k,lの基底オンセットＨ_i,m,τを更新し、記憶部３０に格納する。この際、パラメータ更新部２３は、更新したＨ_i,m,τを用いて、上記（８）式及び（１２）式に従って、音源ｉのパワースペクトログラムＰ_i,k,l及びマイクロホンｊのパワースペクトログラムモデルＸ_j,k,lを更新し、記憶部３０に格納する。 Further, the parameter updating unit 23 performs the processing for each combination of (i, m, τ), Y _{j, k, l} , X _{j, k, l} , A _{j, i,} stored in the storage unit 30 _{. Based on k, n} , H _{i, m, τ} and W _{i, k, m, τ} , the power spectrogram P _{i of the} sound source i according to the equation (15) so as to reduce the objective function of (11). _{, k, l} base onsets H _{i, m, τ} are updated and stored in the storage unit 30. At this time, the parameter updating unit 23 uses the updated H _{i, m, τ} and the power spectrogram P _{i, k, l of the} sound source i and the power spectrogram of the microphone j according to the above equations (8) and (12). The model X _{j, k, l} is updated and stored in the storage unit 30.

終了判定部２４は、予め定められた終了条件を満足するか否かを判定し、終了条件を満足していない場合には、パラメータ更新部２３の各処理を繰り返す。終了判定部２４は、終了条件を満足したと判定した場合には、信号変換部２５による処理に移行する。 The end determination unit 24 determines whether or not a predetermined end condition is satisfied. If the end condition is not satisfied, each process of the parameter update unit 23 is repeated. When it is determined that the end condition is satisfied, the end determination unit 24 proceeds to processing by the signal conversion unit 25.

信号変換部２５は、記憶部３０に記憶されている音源ｉのパワースペクトログラムＰ_i,k,lの基底スペクトルＷ_i,k,m,τ、及び音源ｉのパワースペクトログラムＰ_i,k,lの基底オンセットＨ_i,m,τに基づいて、複数の音源ｉの各々について、当該音源ｉの音源信号を生成して出力部４０に出力する。出力部４０は、複数の音源ｉの各々の音源信号を出力する。 Signal converting unit 25, the power spectrogram P _i of the sound source i in the storage unit 30 is _{stored, k,} basal spectrum W _i of _{_l, k, m, tau,} and the power spectrogram P _i of the sound source _{i, k, l} of Based on the base onset H _{i, m, τ} , a sound source signal of the sound source i is generated and output to the output unit 40 for each of the plurality of sound sources i. The output unit 40 outputs each sound source signal of the plurality of sound sources i.

なお、終了条件としては、繰り返し回数がＬ-1回目の目的関数（１１）式の値と、繰り返し回数がＬ回目の目的関数（１１）式の値との差が、予め定めた閾値よりも小さくなったことを用いればよい。あるいは、終了条件として、繰り返し回数が、予め定められた上限回数に到達したことを用いてもよい。 As an end condition, the difference between the value of the objective function (11) with the number of repetitions L-1 and the value of the objective function (11) with the number of repetitions L is less than a predetermined threshold. What has become smaller can be used. Alternatively, the termination condition may be that the number of repetitions has reached a predetermined upper limit number.

（音響信号解析装置の作用） (Operation of acoustic signal analyzer)

次に、本実施の形態に係る音響信号解析装置１００の作用について説明する。Ｊ本のマイクロホンｊからなるアドホックマイクロホンアレーで取得された、伝達系への外乱及び残響成分が重畳された複数の音源ｉを含む観測信号ｙ_j[ｌ]の時系列データが、音響信号解析装置１００に入力され、記憶部３０に格納される。そして、音響信号解析装置１００において、図２に示す音響信号解析処理ルーチンが実行される。 Next, the operation of the acoustic signal analysis apparatus 100 according to the present embodiment will be described. The time-series data of the observation signal y _j [l] obtained by an ad hoc microphone array composed of J microphones j and including a plurality of sound sources i on which disturbances and reverberation components to the transmission system are superimposed is an acoustic signal analyzer. 100 and stored in the storage unit 30. Then, in the acoustic signal analysis apparatus 100, an acoustic signal analysis processing routine shown in FIG.

まず、ステップＳ１００において、記憶部３０から、観測信号ｙ_j[ｌ]を読み込み、当該観測信号ｙ_j[ｌ]に対して、短時間フーリエ変換を用いた時間周波数分析を行い、マイクロホンｊの観測パワースペクトログラムＹ_j,k,lを算出すると共に、得られたマイクロホンｊの観測パワースペクトログラムＹ_j,k,lを記憶部３０に記憶する。 First, in step S100, from the storage unit 30, reads the observed signal y _j [l], with respect to the observed signal y _j [l], performs time-frequency analysis using short-time Fourier transform, the observation of the microphone j The power spectrogram Y _{j, k, l} is calculated and the obtained observation power spectrogram Y _{j, k, l} of the microphone j is stored in the storage unit 30.

そして、ステップＳ１０２において、乱数を用いて、Ａ_j,i,k,n、Ｗ_i,k,m,τ、及びＨ_i,m,τの各初期値を設定して、記憶部３０に記憶する。 In step S102, initial values of A _{j, i, k, n} , W _{i, k, m, τ} and H _{i, m, τ} are set using random numbers and stored in the storage unit 30. To do.

次に、ステップＳ１０４では、ステップＳ１０２で設定されたＷ_i,k,m,τ及びＨ_i,m,τに基づいて、上記（８）式に従って、音源ｉのパワースペクトログラムＰ_i,k,lを各（ｉ、ｋ、ｌ）の組み合わせについて算出して、記憶部３０に格納する。更に、ステップＳ１０４では、ステップＳ１０２で設定されたＡ_j,i,k,nと、本ステップで算出されたＰ_i,k,lに基づいて、上記（１２）式に従って、マイクロホンｊのパワースペクトログラムモデルＸ_j,k,lを各（ｊ、ｋ、ｌ）の組み合わせについて算出して、記憶部３０に格納する。 Next, in step S104, the power spectrogram P _{i, k, l of the} sound source i according to the above equation (8) based on W _{i, k, m, τ} and H _{i, m, τ} set in step S102. Is calculated for each (i, k, l) combination and stored in the storage unit 30. Further, in step S104, the power spectrogram of the microphone j according to the above equation (12) based on A _{j, i, k, n} set in step S102 and P _{i, k, l} calculated in this step. The model X _{j, k, l} is calculated for each (j, k, l) combination and stored in the storage unit 30.

ステップＳ１０６では、ステップＳ１００で算出されたＹ_j,k,lと、ステップＳ１０２で設定されたＡ_j,i,k,nと、ステップＳ１０４で算出されたＰ_i,k,l及びＸ_j,k,lに基づいて、上記（１２）式に従って、時変ステアリングベクトルの振幅成分Ａ_j,i,k,nを各（ｊ、ｉ、ｋ、ｎ）の組み合わせについて更新して、記憶部３０に格納する。 In step S106, Y _j calculated in step _{S100, k, l} and, set A _j in step _{S102, i, k, n} and, P _i calculated in step _{S104, k, l} and X _j, Based on _{k, l} , the amplitude component A _{j, i, k, n} of the time-varying steering vector is updated for each (j, i, k, n) combination according to the above equation (12), and the storage unit 30 To store.

ステップＳ１０８では、ステップＳ１０６で更新されたＡ_j,i,k,nと、ステップＳ１０４で算出したＰ_i,k,lに基づいて、（１２）式に従って、マイクロホンｊのパワースペクトログラムモデルＸ_j,k,lを各（ｊ、ｋ、ｌ）の組み合わせについて更新して、記憶部３０に格納する。 In step S108, based on A _{j, i, k, n} updated in step S106 and P _{i, k, l} calculated in step S104, the power spectrogram model X _{j, k, l} are updated for each (j, k, l) combination and stored in the storage unit 30.

ステップＳ１１０では、ステップＳ１００で算出されたＹ_j,k,lと、ステップＳ１０８で更新されたＸ_j,k,lと、ステップＳ１０６で更新されたＡ_j,i,k,nと、ステップＳ１０２で設定されたＨ_i,m,τ及びＷ_i,k,m,τに基づいて、（１４）式に従って、音源ｉのパワースペクトログラムＰ_i,k,lの基底スペクトルＷ_i,k,m,τを各（ｉ、ｋ、ｍ、τ）の組み合わせについて更新して、記憶部３０に格納する。 In step S110, Y _{j, k, l} calculated in step S100, X _{j, k, l} updated in step S108, A _{j, i, k, n} updated in step S106, and step S102 _Is based on H _{i, m, τ} and W _{i, k, m, τ set} in step (14) _, the base spectrum W _{i, k, m,} of the power spectrogram P _{i, k, l} of the sound source i according to the equation (14) _{. τ} is updated for each (i, k, m, τ) combination and stored in the storage unit 30.

ステップＳ１１２では、ステップＳ１１０で更新されたＷ_i,k,m,τと、ステップＳ１０２で設定されたＨ_i,m,τに基づいて、（８）式に従って、音源ｉのパワースペクトログラムＰ_i,k,lを各（ｉ、ｋ、ｌ）の組み合わせについて更新して、記憶部３０に格納する。また、本ステップで更新されたＰ_i,k,lと、ステップＳ１０６で更新されたＡ_j,i,k,nに基づいて、（１２）式に従って、マイクロホンｊのパワースペクトログラムモデルＸ_j,k,lを各（ｊ、ｋ、ｌ）の組み合わせについて更新して、記憶部３０に格納する。 In step S112, based on W _{i, k, m, τ} updated in step S110 and H _{i, m, τ} set in step S102, the power spectrogram P _{i, k, l} is updated for each (i, k, l) combination and stored in the storage unit 30. Further, based on P _{i, k, l} updated in this step and A _{j, i, k, n} updated in step S106, the power spectrogram model X _{j, k of the} microphone j according to the equation (12). _{, l} are updated for each (j, k, l) combination and stored in the storage unit 30.

ステップＳ１１４では、ステップＳ１００で算出されたＹ_j,k,lと、ステップＳ１０２で設定されたＨ_i,m,τと、ステップＳ１１２で更新されたＸ_j,k,lと、ステップＳ１０６で更新されたＡ_j,i,k,nと、ステップＳ１１０で更新されたＷ_i,k,m,τに基づいて、（１５）式に従って、音源ｉのパワースペクトログラムＰ_i,k,lの基底オンセットＨ_i,m,τを各（ｉ、ｍ、τ）の組み合わせについて更新して、記憶部３０に格納する。 In step S114, Y _{j, k, l} calculated in step S100, H _{i, m, τ} set in step S102, X _{j, k, l} updated in step S112, and update in step S106 On the basis of the power spectrogram P _{i, k, l} of the sound source i according to the equation (15), based on _{the generated} A _{j, i, k, n} and W _{i, k, m, τ} updated in step S110 The set H _{i, m, τ} is updated for each (i, m, τ) combination and stored in the storage unit 30.

ステップＳ１１６では、ステップＳ１１０で更新されたＷ_i,k,m,τと、ステップＳ１１４で更新されたＨ_i,m,τに基づいて、（８）式に従って、音源ｉのパワースペクトログラムＰ_i,k,lを各（ｉ、ｋ、ｌ）の組み合わせについて更新して、記憶部３０に格納する。また、本ステップで更新されたＰ_i,k,lと、ステップＳ１０６で更新されたＡ_j,i,k,nに基づいて、（１２）式に従って、マイクロホンｊのパワースペクトログラムモデルＸ_j,k,lを各（ｊ、ｋ、ｌ）の組み合わせについて更新して、記憶部３０に格納する。 In step S116, based on W _{i, k, m, τ} updated in step S110 and H _{i, m, τ} updated in step S114, the power spectrogram P _{i, k, l} is updated for each (i, k, l) combination and stored in the storage unit 30. Further, based on P _{i, k, l} updated in this step and A _{j, i, k, n} updated in step S106, the power spectrogram model X _{j, k of the} microphone j according to the equation (12). _{, l} are updated for each (j, k, l) combination and stored in the storage unit 30.

次のステップＳ１１８では、ステップＳ１００で算出したＹ_j,k,lと、ステップＳ１１６で更新されたＸ_j,k,lに基づいて、（１１）式に従って、目的関数の値を算出して、記憶部３０に記憶する。そして、前回のステップＳ１１８で算出した目的関数の値を記憶部３０から読み込み、今回のステップＳ１１８で算出した目的関数の値と、前回のステップＳ１０８で算出した目的関数の値との差分が、予め記憶部３０に記憶されている予め定められた閾値よりも小さいか否かを判定し、差分が予め定められた閾値以上の場合には、終了条件を満足していないと判断して、上記ステップＳ１０６へ戻り、上記ステップＳ１０６〜ステップＳ１１８の処理を繰り返す。 In the next step S118, the value of the objective function is calculated according to the equation (11) based on Y _{j, k, l} calculated in step S100 and X _{j, k, l} updated in step S116. Store in the storage unit 30. Then, the value of the objective function calculated in the previous step S118 is read from the storage unit 30, and the difference between the value of the objective function calculated in the current step S118 and the value of the objective function calculated in the previous step S108 is calculated in advance. It is determined whether or not the threshold value is smaller than a predetermined threshold value stored in the storage unit 30. If the difference is equal to or larger than the predetermined threshold value, it is determined that the end condition is not satisfied, and the above step is performed. It returns to S106 and repeats the process of said step S106-step S118.

一方、差分が予め定められた閾値未満の場合には、終了条件を満足したと判断して、ステップＳ１２０で、記憶部３０に記憶されている音源ｉのパワースペクトログラムＰ_i,k,lの基底スペクトルＷ_i,k,m,τ、及び音源ｉのパワースペクトログラムＰ_i,k,lの基底オンセットＨ_i,m,τに基づいて、複数の音源ｉの各々について、当該音源ｉの音源信号を生成して出力部４０から出力して、音響信号解析処理ルーチンを終了する。 On the other hand, if the difference is less than the predetermined threshold value, it is determined that the end condition is satisfied, and the basis of the power spectrogram P _{i, k, l} of the sound source i stored in the storage unit 30 is determined in step S120. Based on the spectrum W _{i, k, m, τ} and the base spectrogram H _{i, m, τ} of the power spectrogram P _{i, k, l} of the sound source i, for each of the plurality of sound sources i, the sound source signal of the sound source i Is output from the output unit 40, and the acoustic signal analysis processing routine is terminated.

（実施結果） (Implementation results)

次に、本実施の形態に係る手法の有効性を示す目的で、２つの残響環境下での劣決定条件における音源分離実験を行った。 Next, for the purpose of showing the effectiveness of the method according to the present embodiment, a sound source separation experiment was performed under indeterminate conditions under two reverberant environments.

まず、残響環境下での頑健さの度合いを確認するため、異なる残響の強さを持つ環境での音源分離実験を行った。次に、伝達系への外乱に対する頑健さの度合いを確認するため、マイクロホンの位置が観測信号取得中に変化した場合における音源分離実験を行った。 First, in order to confirm the degree of robustness in a reverberant environment, a sound source separation experiment was performed in an environment with different reverberant strengths. Next, in order to confirm the degree of robustness against disturbance to the transmission system, a sound source separation experiment was performed when the position of the microphone changed during acquisition of the observation signal.

なお、劣決定条件として、部屋の中に音源数を３つ、マイクロホン数を２つ設置し、部屋の形状と、音源の位置と、マイクロホンの位置から鏡像法を用いてインパルス応答を生成した。 As an inferior decision condition, three sound sources and two microphones were installed in the room, and an impulse response was generated using a mirror image method from the shape of the room, the position of the sound source, and the position of the microphone.

図３に、音源分離実験を行う部屋（以降、単に「部屋」と称す）の形状と、各音源の位置及び各マイクロホンの位置を示す。なお、図３に示すように、部屋は、５ｍ×１０ｍの長方形状の大きさを有し、Ｓ１〜Ｓ３で示される“×”は音源の位置を、Ｍ１及びＭ２で示される“●”はマイクロホンの位置をそれぞれ示している。 FIG. 3 shows the shape of a room (hereinafter simply referred to as “room”) in which the sound source separation experiment is performed, the position of each sound source, and the position of each microphone. As shown in FIG. 3, the room has a rectangular size of 5 m × 10 m, “X” indicated by S1 to S3 indicates the position of the sound source, and “●” indicated by M1 and M2 Each microphone position is shown.

音源分離実験で使用する音源としては、ＡＴＲ音声対話データベースの３話者１５発話を用いた。３話者の音源１〜３のうち、音源１、２の話者は女性であり、音源３の話者は男性である。 As the sound source used in the sound source separation experiment, 15 utterances of 3 speakers in the ATR speech dialogue database were used. Of the sound sources 1 to 3 of the three speakers, the speakers of the sound sources 1 and 2 are women, and the speaker of the sound source 3 is a man.

また、残響の強さは、部屋の壁の反射係数を調整することで変化させた。具体的には、部屋の壁の反射係数を0.5にした場合、残響時間は60msとなり、部屋の壁の反射係数を0.8にした場合、残響時間は210msとなる。 The intensity of reverberation was changed by adjusting the reflection coefficient of the room wall. Specifically, when the reflection coefficient of the room wall is 0.5, the reverberation time is 60 ms, and when the reflection coefficient of the room wall is 0.8, the reverberation time is 210 ms.

音源分離実験では、公知の多チャンネルNMF（Multichannel extensions of Non-negative Matrix Factorization:MNMF）を用いた手法と、本実施の形態に係る提案手法を用いた場合とについて比較した。MNMFの場合、瞬時混合モデルが仮定されているため、STFTのフレーム外に残響成分が存在する場合は、性能が低下することが予想される。 In the sound source separation experiment, a comparison was made between a method using a known multichannel extensions of non-negative matrix factorization (MNMF) and a case using the proposed method according to the present embodiment. In the case of MNMF, since an instantaneous mixing model is assumed, it is expected that the performance will be degraded if reverberation components exist outside the STFT frame.

また、用意した各話者の１５発話のうち、１つの発話を分離用の信号とし、残りの１４個の発話を教師データとして事前の学習に用いた。なお、本実施の形態に係る提案手法の事前学習にはNMFD、MNMFにはNMFを用いて、基底スペクトルの学習を行い、各音源１〜３に対して、それぞれ４０個と２０個の基底を学習した。また、距離尺度としては、一般化KLダイバージェンスを用い、評価指標としては、Source-to-distortion ratio(SDR)を用いた。この際、STFTのフレーム長を32msとし、シフト長は16msとした。 Further, out of 15 prepared utterances of each speaker, one utterance was used as a separation signal, and the remaining 14 utterances were used as teacher data for prior learning. Note that NMFD and MNMF use NMF for the prior learning of the proposed method according to the present embodiment to learn the base spectrum, and each of the sound sources 1 to 3 has 40 and 20 bases, respectively. Learned. Further, the generalized KL divergence was used as the distance measure, and the source-to-distortion ratio (SDR) was used as the evaluation index. At this time, the STFT frame length was set to 32 ms, and the shift length was set to 16 ms.

図４に、部屋の壁の反射係数の変化に対する各比較手法の性能の変化の一例を示すグラフを示す。図４に示すグラフの横軸は部屋の壁の反射係数を表すと共に、縦軸はSDRを表し、グラフ５０がMNMFを用いた場合、グラフ５１が本実施の形態に係る提案手法を用いた場合のグラフを示す。なお、各マイクロホンにおける各音源のSDRを平均した値を、対応する部屋の壁の反射係数に対するSDRとして表している。 FIG. 4 is a graph showing an example of a change in the performance of each comparison method with respect to a change in the reflection coefficient of the room wall. The horizontal axis of the graph shown in FIG. 4 represents the reflection coefficient of the wall of the room, the vertical axis represents SDR, the graph 50 uses MNMF, and the graph 51 uses the proposed method according to the present embodiment. The graph of is shown. A value obtained by averaging the SDRs of the sound sources in the microphones is represented as SDR with respect to the reflection coefficient of the corresponding room wall.

図４に示すように、反射係数の低い、瞬時混合モデルが成立しているとみなせる範囲においては、MNMFは本実施の形態に係る提案手法に比べて良い性能を示している。しかし、反射係数が高くなるにしたがって、本実施の形態に係る提案手法では、スペクトログラム素片のテンプレートとアクティベーション系列とを畳み込む混合モデルを採用していることから、残響成分を精度よく推定し、MNMFより良い性能を示す傾向が見られる。 As shown in FIG. 4, the MNMF exhibits better performance than the proposed method according to the present embodiment in a range where the instantaneous mixing model with a low reflection coefficient can be considered to be established. However, as the reflection coefficient increases, the proposed method according to the present embodiment employs a mixed model that convolves the spectrogram segment template and the activation sequence, so that the reverberation component is accurately estimated, There is a tendency to show better performance than MNMF.

次に、伝達系への外乱に対する頑健さの度合いを確認するため、マイクロホンの位置が観測信号取得中に変化した場合における音源分離実験について説明する。 Next, in order to confirm the degree of robustness against disturbance to the transmission system, a sound source separation experiment in the case where the position of the microphone changes during acquisition of the observation signal will be described.

本実験では、図３に示すマイクロホンＭ１、Ｍ２で取得した観測信号と、マイクロホンＭ１、Ｍ２の位置からそれぞれΔｘｍずらした位置で取得した観測信号とを繋げることにより、本音源分離実験で用いる観測信号を生成した。この場合、伝達系の振幅成分に比べて、位相成分には大きな外乱が生じることになる。なお、本音源分離実験における部屋の壁の反射係数は0.8に設定した。 In this experiment, the observation signals used in the sound source separation experiment are connected by connecting the observation signals acquired by the microphones M1 and M2 shown in FIG. Was generated. In this case, a larger disturbance is generated in the phase component than in the amplitude component of the transmission system. The reflection coefficient of the room wall in this sound source separation experiment was set to 0.8.

図５に、観測信号取得位置のずれ量Δｘに対する各比較手法の性能の変化の一例を示すグラフを示す。図５に示すグラフの横軸は観測信号取得位置のずれ量Δｘを表すと共に、縦軸はSDRを表し、グラフ５２がMNMFを用いた場合、グラフ５３が本実施の形態に係る提案手法を用いた場合のグラフを示す。 FIG. 5 shows a graph illustrating an example of a change in performance of each comparison method with respect to the deviation amount Δx of the observation signal acquisition position. The horizontal axis of the graph shown in FIG. 5 represents the deviation Δx of the observation signal acquisition position, the vertical axis represents SDR, and when the graph 52 uses MNMF, the graph 53 uses the proposed method according to the present embodiment. The graph when there is.

図５に示すように、MNMFにおける性能を示すグラフ５２は、Δｘが大きくなるにしたがって、すなわち、外乱が大きくなるにしたがって、SDRの値が大きく落ち込んでいることがわかる。具体的には、Δｘ＝０の場合のSDRに対して、Δｘ＝０．２の場合のSDRは約８６％の値まで減少している。 As shown in FIG. 5, the graph 52 showing the performance in the MNMF shows that the value of SDR drops significantly as Δx increases, that is, as the disturbance increases. Specifically, the SDR when Δx = 0.2 is reduced to a value of about 86% with respect to the SDR when Δx = 0.

一方、提案手法の場合、Δｘ＝０の場合のSDRに対して、Δｘ＝０．２の場合のSDRは約９３％の値までしか減少していない。したがって、本実施の形態に係る提案手法は、MNMFを用いる場合に比べて、伝達系への外乱に対しての音源分離性能の低下が少なく、伝達系への外乱に対する頑健さの度合いが高いということができる。 On the other hand, in the case of the proposed method, the SDR when Δx = 0.2 is reduced only to a value of about 93% with respect to the SDR when Δx = 0. Therefore, the proposed method according to the present embodiment has less deterioration in sound source separation performance against disturbance to the transmission system and higher degree of robustness against disturbance to the transmission system than when using MNMF. be able to.

このように、本発明に係る提案手法では、音源或いはマイクロホンの微小移動等、音響信号解析環境の軽微な変化に対して半時変形モデルを設定し、複数の時間フレームに亘るスペクトルを連結したスペクトログラム素片のテンプレートとアクティベーション系列とを畳み込む混合モデルを適用することで、音源及びマイクロホンの相対位置関係が変化する時変残響環境下であっても、複数の音源の音声が重畳した観測信号から、各音源の音源信号を精度よく分離することができる。 As described above, in the proposed method according to the present invention, a spectrogram in which a half-time deformation model is set for a slight change in an acoustic signal analysis environment such as a minute movement of a sound source or a microphone, and spectra over a plurality of time frames are connected. By applying a mixed model that convolves the template of the fragment and the activation sequence, even in a time-varying reverberation environment where the relative positional relationship between the sound source and the microphone changes, it can The sound source signal of each sound source can be accurately separated.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、上述の音響信号解析装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 For example, the above-described acoustic signal analysis apparatus has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. .

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１０入力部
２０演算部
２１時間周波数解析部
２２初期設定部
２３パラメータ更新部
２４終了判定部
２５信号変換部
３０記憶部
４０出力部
１００音響信号解析装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Calculation part 21 Time frequency analysis part 22 Initial setting part 23 Parameter update part 24 End determination part 25 Signal conversion part 30 Storage part 40 Output part 100 Acoustic signal analysis apparatus

Claims

A time-frequency analysis unit that outputs time-series data of acoustic signals collected by J microphones j and outputs observation signal time-frequency components y _{j, k, l} of each frequency k at each time l;
Corresponds to time-varying steering vector amplitude components A _{j, i, k, n} representing the transmission characteristics collected from the sound source i to the microphone j with a delay of time _n , and a spectrum segment connecting multiple frames of spectrum. Non-negative base spectra W _{i, k, m,} τ of each base m and each frequency k at each time τ, and non-negative values of each base m and each frequency k at each time τ corresponding to the spectrum fragment An initial value setting unit for setting an initial value for each of the base onsets H _{i, m, τ} of
The non-negative values of each base m and each frequency k at each time τ corresponding to the observed signal time frequency component y _{j, k, l} and the spectrum segment in all combinations of (j, k, l). Base spectrum W _{i, k, m, τ} , non-negative base onset H _{i, m, τ} , each sound source i, and each time at each base m and each frequency k at each time τ corresponding to the spectrum fragment The base spectrum W _{i, k, m} so that the distance from the power spectrogram model X _{j, k, l} of the microphone j calculated based on the amplitude component A _{j, i, k, n} of _n becomes small. _{, τ} , the base onset H _{i, m, τ,} and the amplitude component A _{j, i, k, n} are updated,
An end determination unit that repeatedly performs update by the parameter update unit until a predetermined end condition is satisfied;
An acoustic signal analyzing apparatus including:

The acoustic signal analyzer according to claim 1, wherein the Itakura Saito distance is used as a measure of the distance between the observed signal time frequency component y _{j, k, l} and the power spectrogram model X _{j, k, l} of the microphone j.

An acoustic signal analysis method in an acoustic signal analysis device including a time-frequency analysis unit, an initial value setting unit, a parameter update unit, and an end determination unit,
The time frequency analysis unit inputs time series data of acoustic signals collected by each of the J microphones j, and outputs observed signal time frequency components y _{j, k, l} of each frequency k at each time l,
The initial value setting unit calculates the amplitude components A _{j, i, k, n of} the time-varying steering vector representing the transmission characteristics collected from the sound source i to the microphone j with a delay of time _n , and the spectrum of a plurality of frames. Non-negative base spectra W _{i, k, m, τ of} each base m and each frequency k at each time τ corresponding to the connected spectral segment, and each base m at each time τ corresponding to the spectral segment And an initial value for each of the non-negative base onsets H _{i, m, τ} of each frequency k,
The parameter updating unit includes each base m and each base at each time τ corresponding to the observed signal time frequency component y _{j, k, l} and the spectrum segment in all combinations of (j, k, l). A non-negative base spectrum W _{i, k, m, τ of} frequency k, each base m and non-negative base onset H _{i, m, τ} , at each time τ corresponding to the spectrum segment, The base spectrum so that the distance from the power spectrogram model X _{j, k, l} of the microphone j calculated based on the amplitude components A _{j, i, k, n} at each sound source i and each time n becomes small. Update W _{i, k, m, τ} , the basis onset H _{i, m, τ} and the amplitude component A _{j, i, k, n} ,
The acoustic signal analysis method in which the update by the parameter update unit is repeatedly performed until the end determination unit satisfies a predetermined end condition.

The acoustic signal analysis method according to claim 3, wherein the Itakura Saito distance is used as a measure of the distance between the observed signal time frequency component y _{j, k, l} and the power spectrogram model X _{j, k, l} of the microphone j.

The program for functioning a computer as each part of the acoustic signal analyzer of Claim 1 or Claim 2.