JP7444243B2

JP7444243B2 - Signal processing device, signal processing method, and program

Info

Publication number: JP7444243B2
Application number: JP2022513704A
Authority: JP
Inventors: 智広中谷; 慶介木下; 林太郎池下; マークデルクロア
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2020-04-06
Filing date: 2020-04-06
Publication date: 2024-03-06
Anticipated expiration: 2040-04-06
Also published as: JPWO2021205494A1; WO2021205494A1

Description

特許法第３０条第２項適用２０１９年１０月３０日、ウェブサイト（ｈｔｔｐｓ：／／ａｒｘｉｖ．ｏｒｇ／ａｂｓ／１９１０．１３７０７，ｈｔｔｐｓ：／／ａｒｘｉｖ．ｏｒｇ／ｐｄｆ／１９１０．１３７０７．ｐｄｆ）上に公開。Application of Article 30, Paragraph 2 of the Patent Act October 30, 2019, on the website (https://arxiv.org/abs/1910.13707, https://arxiv.org/pdf/1910.13707.pdf) Published on.

特許法第３０条第２項適用２０２０年３月２日、一般社団法人日本音響学会発行の「日本音響学会２０２０年春季研究発表会講演論文集ＣＤ－ＲＯＭ」に、講演番号３－１－１０（講演要旨は第１１２頁、講演論文は第２８９～２９２頁）として公開。Application of Article 30, Paragraph 2 of the Patent Act On March 2, 2020, the lecture number 3-1-10 was included in the “Acoustical Society of Japan 2020 Spring Research Presentation Collection CD-ROM” published by the Acoustical Society of Japan. (The abstract of the lecture is on page 112, and the lecture papers are on pages 289-292).

本発明は、音響信号から目的音以外の音やその他の雑音、残響を抑圧し、目的音を抽出する技術に関する。 The present invention relates to a technique for extracting a target sound from an acoustic signal by suppressing sounds other than the target sound, other noise, and reverberation.

非特許文献１には、周波数領域の音響信号から雑音や残響を抑圧する方法が開示されている。この方法では、まず音響信号を受け取るとともに、予測誤差の重み付きパワー最小化基準に基づき、目的音の残響を抑圧する残響抑圧フィルタを推定し、その残響抑圧フィルタを音響信号に適用して残響除去を行う。その後、目的音の方向を表すステアリングベクトル（またはその推定値）を受け取り、目的音の音源位置からマイクロホンに到来する音を歪ませないという拘束条件のもと、音響信号のパワーを最小化する最小パワー無歪応答（Minimum-Power Distortionless Response, MPDR）ビームフォーマを推定し、それを残響除去後の音響信号に適用することで、さらに雑音を抑圧する。 Non-Patent Document 1 discloses a method for suppressing noise and reverberation from an acoustic signal in the frequency domain. This method first receives an acoustic signal, estimates a dereverberation filter that suppresses the reverberation of the target sound based on a prediction error weighted power minimization criterion, and then applies the dereverberation filter to the acoustic signal to remove the reverberation. I do. It then receives a steering vector representing the direction of the target sound (or its estimated value), and uses a steering vector that minimizes the power of the acoustic signal under the constraint that it does not distort the sound arriving at the microphone from the source position of the target sound. Noise is further suppressed by estimating a Minimum-Power Distortionless Response (MPDR) beamformer and applying it to the dereverberated acoustic signal.

非特許文献２には、音響信号には拡散性雑音は含まれておらず、複数の点音源のみが含まれるとの仮定の下、与えられた目的関数を最小化するように、残響抑圧ブロックと音源分離ブロックの係数を交互に更新することで、最適な残響抑圧と音源分離を同時に実現する方法が開示されている。 Non-Patent Document 2 describes a dereverberation block that minimizes a given objective function under the assumption that the acoustic signal does not contain diffuse noise and only contains a plurality of point sound sources. A method is disclosed that simultaneously realizes optimal dereverberation and sound source separation by alternately updating the coefficients of the sound source separation block and the coefficients of the sound source separation block.

Takuya Yoshioka, Nobutaka Ito, Marc Delcroix, Atsunori Ogawa, Keisuke Kinoshita, Masakiyo Fujimoto, Chengzhu Yu, Wojciech J Fabian, Miquel Espi, Takuya Higuchi, Shoko Araki, Tomohiro Nakatani," The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices," Proc. IEEE ASRU 2015, 436-443, 2015.Takuya Yoshioka, Nobutaka Ito, Marc Delcroix, Atsunori Ogawa, Keisuke Kinoshita, Masakiyo Fujimoto, Chengzhu Yu, Wojciech J Fabian, Miquel Espi, Takuya Higuchi, Shoko Araki, Tomohiro Nakatani," The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices," Proc. IEEE ASRU 2015, 436-443, 2015. Takuya Yoshioka, Tomohiro Nakatani, Masato Miyoshi, and Hiroshi G. Okuno, "Blind Separation and Dereverberation of Speech Mixtures by Joint Optimization," IEEE Trans. ASLP, 19 (1), 69-84, 2011.Takuya Yoshioka, Tomohiro Nakatani, Masato Miyoshi, and Hiroshi G. Okuno, "Blind Separation and Dereverberation of Speech Mixtures by Joint Optimization," IEEE Trans. ASLP, 19 (1), 69-84, 2011.

しかし、非特許文献１の方法では、残響抑圧フィルタと最小パワー無歪応答ビームフォーマとを独立に最適化するため、全体として最適な処理を行うことができなかった。また、非特許文献２の方法では、音響信号から拡散性雑音を抑圧することができなかった。 However, in the method of Non-Patent Document 1, since the dereverberation filter and the minimum power distortionless response beamformer are independently optimized, it is not possible to perform optimal processing as a whole. Furthermore, the method disclosed in Non-Patent Document 2 was unable to suppress diffuse noise from the acoustic signal.

本発明はこのような点に鑑みてなされたものであり、全体として最適化された残響抑圧と拡散性雑音抑圧と目的音源分離とを行うことを目的とする。 The present invention has been made in view of these points, and an object of the present invention is to perform overall optimized dereverberation, diffuse noise suppression, and target sound source separation.

周波数分割された時系列の音響信号と、目的音の情報を表す補助情報とを受け取り、残響抑圧と拡散性雑音抑圧と目的音源分離とを行う畳み込みビームフォーマを音響信号に適用して得られる信号が確率モデルに従って定まるという最適化基準に基づき、畳み込みビームフォーマを推定し、推定された畳み込みビームフォーマを音響信号に適用して処理信号を得て出力する。 A signal obtained by receiving a frequency-divided time-series acoustic signal and auxiliary information representing information on the target sound, and applying a convolution beamformer to the acoustic signal to perform dereverberation, diffuse noise suppression, and target sound source separation. A convolutional beamformer is estimated based on an optimization criterion that is determined according to a probabilistic model, and the estimated convolutional beamformer is applied to the acoustic signal to obtain and output a processed signal.

本発明では、目的音の情報を表す補助情報を用いることで、残響抑圧と拡散性雑音抑圧と目的音源分離とを行う畳み込みビームフォーマを全体として最適化できる。そのため、全体として最適化された残響抑圧と拡散性雑音抑圧と目的音源分離とを行うことができる。 In the present invention, by using auxiliary information representing information on the target sound, it is possible to optimize the convolutional beamformer that performs dereverberation, diffuse noise suppression, and target sound source separation as a whole. Therefore, it is possible to perform dereverberation, diffuse noise suppression, and target sound source separation that are optimized as a whole.

図１は実施形態の信号処理装置の機能構成を例示したブロック図である。FIG. 1 is a block diagram illustrating the functional configuration of a signal processing device according to an embodiment. 図２は第２実施形態およびその変形例の信号処理装置の機能構成を例示したブロック図である。FIG. 2 is a block diagram illustrating the functional configuration of a signal processing device according to the second embodiment and a modification thereof. 図３は第２実施形態およびその変形例の信号処理方法を説明するためのフロー図である。FIG. 3 is a flow diagram for explaining the signal processing method of the second embodiment and its modification. 図４は第２実施形態の変形例の信号処理方法を説明するためのフロー図である。FIG. 4 is a flow diagram for explaining a signal processing method according to a modification of the second embodiment. 図５は第２実施形態およびその変形例の信号処理方法を説明するためのフロー図である。FIG. 5 is a flow diagram for explaining the signal processing method of the second embodiment and its modification. 図６はＲＴＦ（Relative Transfer Function）の推定処理を例示するためのブロック図である。FIG. 6 is a block diagram illustrating an example of RTF (Relative Transfer Function) estimation processing. 図７は第２実施形態の変形例の信号処理方法を説明するためのフロー図である。FIG. 7 is a flow diagram for explaining a signal processing method according to a modification of the second embodiment. 図８は第２実施形態の変形例の信号処理方法を説明するためのフロー図である。FIG. 8 is a flow diagram for explaining a signal processing method according to a modification of the second embodiment. 図９は第３実施形態およびその変形例の信号処理装置の機能構成を例示したブロック図である。FIG. 9 is a block diagram illustrating the functional configuration of a signal processing device according to the third embodiment and its modification. 図１０は第３実施形態およびその変形例の信号処理方法を説明するためのフロー図である。FIG. 10 is a flow diagram for explaining the signal processing method of the third embodiment and its modification. 図１１は第３実施形態およびその変形例の信号処理方法を説明するためのフロー図である。FIG. 11 is a flow diagram for explaining the signal processing method of the third embodiment and its modification. 図１２は第３実施形態の変形例の信号処理方法を説明するためのフロー図である。FIG. 12 is a flow diagram for explaining a signal processing method according to a modification of the third embodiment. 図１３は第３実施形態の変形例の信号処理方法を説明するためのフロー図である。FIG. 13 is a flow diagram for explaining a signal processing method according to a modification of the third embodiment. 図１４は実施形態の信号処理装置のハードウェア構成を例示したブロック図である。FIG. 14 is a block diagram illustrating the hardware configuration of the signal processing device according to the embodiment. 図１５は第４実施形態と第２実施形態の変形例１，２で得られた処理信号を音声認識した際の単語誤り率を例示したグラフである。FIG. 15 is a graph illustrating the word error rate when speech recognition is performed on the processed signals obtained in the fourth embodiment and the first and second modifications of the second embodiment.

以下、図面を参照して本発明の実施形態を説明する。
［第１実施形態］
まず、本発明の第１実施形態を説明する。第１実施形態では、信号処理装置が、周波数分割された時系列の音響信号と、目的音の情報を表す補助情報とを受け取り、残響抑圧と拡散性雑音抑圧と目的音源分離とを行う畳み込みビームフォーマを音響信号に適用して得られる信号が確率モデルに従って定まるという最適化基準に基づき、畳み込みビームフォーマを推定し、推定された畳み込みビームフォーマを音響信号に適用して処理信号を得て出力する。 Embodiments of the present invention will be described below with reference to the drawings.
[First embodiment]
First, a first embodiment of the present invention will be described. In the first embodiment, a signal processing device receives a frequency-divided time-series acoustic signal and auxiliary information representing target sound information, and performs dereverberation, diffuse noise suppression, and target sound source separation using a convolution beam. A convolutional beamformer is estimated based on an optimization criterion in which the signal obtained by applying a former to an acoustic signal is determined according to a probabilistic model, and the estimated convolutional beamformer is applied to an acoustic signal to obtain and output a processed signal. .

＜機能構成＞
図１に例示するように、本実施形態の信号処理装置１は、畳み込みビームフォーマ推定部１１、畳み込みビームフォーマ適用部１２、および制御部１３を有し、制御部１３の制御の下で各処理を実行する。 <Functional configuration>
As illustrated in FIG. 1, the signal processing device 1 of the present embodiment includes a convolutional beamformer estimation section 11, a convolutional beamformer application section 12, and a control section 13, and each processing is performed under the control of the control section 13. Execute.

＜処理＞
Ｉ個の音源から発せられた源信号が残響および拡散性雑音が存在する環境下でＭ個のマイクロホンで観測される状況を想定する。ただし、ＩおよびＭは１以上の整数であり、Ｍ≧Ｉの関係を満たす。Ｉ個の音源から発せられた源信号に基づく信号（直接音、初期反射音、後部残響）と、拡散性雑音（加法的拡散性雑音）との混合信号をマイクロホンで観測して得られた信号は、短時間フーリエ変換（short-time Fourier transform）などの周知の方法によって周波数分割され、各時間周波数点での音響信号ｘ_ｔ，ｆ（時間周波数領域での音響信号ｘ_ｔ，ｆ）が得られる。このような音響信号ｘ_ｔ，ｆは以下のようにモデル化される。

ここで、ｔ∈｛１，…，Ｎ｝は時間区間（フレーム）に対応する時間インデックスであり、ｆ∈｛１,…，Ｆ｝は周波数帯域（周波数ビン）に対応する周波数インデックスである。ＮおよびＦは正整数である。例えば、Ｎは２以上の整数である。以降、時間インデックスｔに対応する時間区間を「時間区間ｔ」と呼び、周波数インデックスｆに対応する周波数帯域を「周波数帯域ｆ」と呼ぶことにする。ｉは各目的音の音源に対応するインデックスであり、ｉ∈｛１，…，Ｉ｝である。音源ｉから発せられた源信号を「源信号ｉ」と呼ぶことにする。音響信号ｘ_ｔ，ｆ＝［ｘ_{１，ｔ，ｆ}，…，ｘ_{Ｍ，ｔ，ｆ}］^Ｔ∈Ｃ^Ｍ×１は、Ｍ個のマイクロホンで観測されたすべての信号を各時間区間ｔで周波数分割して得られた各周波数帯域ｆでの信号ｘ_{１，ｔ，ｆ}，…，ｘ_{Ｍ，ｔ，ｆ}を要素とするＭ次元列ベクトルである。Ｃは複素数全体の集合を表す。（・）^Ｔは（・）の非共役転置を表す。マイクイメージ信号ｘ_ｔ，ｆ ^（ｉ）＝［ｘ_{１，ｔ，ｆ} ^（ｉ），…，ｘ_{Ｍ，ｔ，ｆ} ^（ｉ）］^Ｔ∈Ｃ^Ｍ×１は、信号ｘ_{１，ｔ，ｆ}，…，ｘ_{Ｍ，ｔ，ｆ}のうち源信号ｉに対応する直接音と初期反射音と後部残響からなる成分ｘ_{１，ｔ，ｆ} ^（ｉ），…，ｘ_{Ｍ，ｔ，ｆ} ^（ｉ）を要素とするＭ次元列ベクトルである。ｘ_{１，ｔ，ｆ} ^（ｉ），…，ｘ_{Ｍ，ｔ，ｆ} ^（ｉ）は拡散性雑音に対応する成分を含まない。拡散性雑音ｎ_ｔ，ｆ＝［ｎ_{１，ｔ，ｆ}，…，ｎ_{Ｍ，ｔ，ｆ}］^Ｔ∈Ｃ^Ｍ×１は、信号ｘ_{１，ｔ，ｆ}，…，ｘ_{Ｍ，ｔ，ｆ}のうち拡散性雑音に対応する成分を要素とするＭ次元列ベクトルである。式(1)のマイクイメージ信号ｘ_ｔ，ｆ ^（ｉ）はさらに式(2)のように２つの要素に分割される。目的音ｄ_ｔ，ｆ ^（ｉ）はマイクイメージ信号ｘ_ｔ，ｆ ^（ｉ）のうち直接音および初期反射音に対応する成分を表し、後部残響音ｒ_ｔ，ｆ ^（ｉ）はマイクイメージ信号ｘ_ｔ，ｆ ^（ｉ）のうち後部残響に対応する成分を表す。なお、ｘ_ｔ，ｆ ^（ｉ），ｄ_ｔ，ｆ ^（ｉ），ｒ_ｔ，ｆ ^（ｉ）などの「χ_α ^β」の形式で表記される記号の上付き添え字βは本来αの真上に記載すべきであるが（式(2)参照）、記載表記の制約上、αの右上に記載する場合がある。本実施形態の残響抑圧と拡散性雑音抑圧と目的音源分離では、式(1)の音響信号ｘ_ｔ，ｆから拡散性雑音ｎ_ｔ，ｆと各音源ｉに対応する後部残響音ｒ_ｔ，ｆ ^（ｉ）とが抑圧され、各音源ｉに対応する目的音ｄ_ｔ，ｆ ^（ｉ）が分離抽出される。 <Processing>
Assume a situation where source signals emitted from I sound sources are observed by M microphones in an environment where reverberation and diffuse noise exist. However, I and M are integers of 1 or more, and satisfy the relationship M≧I. A signal obtained by observing with a microphone a mixed signal of a signal based on the source signal emitted from I sound sources (direct sound, early reflection sound, rear reverberation) and diffuse noise (additive diffuse noise) is frequency-divided by a well-known method such as short-time Fourier transform, and an acoustic signal x _t,f (acoustic signal x _t,f in the time-frequency domain) at each time-frequency point is obtained. It will be done. Such an acoustic signal x _t,f is modeled as follows.

Here, tε{1,...,N} is a time index corresponding to a time interval (frame), and fε{1,..., F} is a frequency index corresponding to a frequency band (frequency bin). N and F are positive integers. For example, N is an integer greater than or equal to 2. Hereinafter, the time interval corresponding to time index t will be referred to as "time interval t", and the frequency band corresponding to frequency index f will be referred to as "frequency band f". i is an index corresponding to the sound source of each target sound, i∈{1,...,I}. The source signal emitted from sound source i will be referred to as "source signal i." Acoustic signal x _t,f = [x _1,t,f ,...,x _M,t,f ] ^T ∈C ^M×1 is the frequency of all the signals observed by M microphones in each time interval t. It is an M-dimensional column vector whose elements are signals x _{1, t, f} , ..., x _{M, t, f} in each frequency band f obtained by division. C represents the set of all complex numbers. (・) ^T represents the nonconjugate transposition of (・). Microphone image signal x _t,f ⁽ⁱ⁾ = [x _1,t,f ⁽ⁱ⁾ ,...,x _M,t,f ⁽ⁱ⁾ ] ^T ∈C ^M×1 is the signal x _1,t,f , ..., x _{M, t, f (i),} the component consisting of the direct sound, early reflection sound, and rear reverberation corresponding to the source signal i x _{1, t, f} ⁽ⁱ⁾ , ..., x _{M, t, f} ⁽ⁱ⁾ This is an M-dimensional column vector as an element. x _1,t,f ⁽ⁱ⁾ , ..., x _M,t,f ⁽ⁱ⁾ do not include components corresponding to diffuse noise. Diffuse noise n _t,f = [n _1,t,f ,...,n _M,t,f ] ^T ∈C ^M×1 is the signal x _1,t,f ,...,x _M,t,f This is an M-dimensional column vector whose elements are components corresponding to diffuse noise. The microphone image signal x _t,f ⁽ⁱ⁾ in equation (1) is further divided into two elements as shown in equation (2). The target sound d _t,f ⁽ⁱ⁾ represents the component corresponding to the direct sound and early reflected sound in the microphone image signal x _t,f ⁽ⁱ⁾ , and the rear reverberant sound r t,f ⁽ⁱ⁾ represents the component corresponding to the microphone image signal x _t,f (i). _{t, f} ⁽ⁱ⁾ represents the component corresponding to the rear reverberation. Note that the superscript β of symbols expressed in the form “χ _α ^β ” such as x _t,f ⁽ⁱ⁾ , d _t,f ⁽ⁱ⁾ , r _t,f ⁽ⁱ⁾ is originally the truth of α. Although it should be written above (see formula (2)), it may be written in the upper right corner of α due to constraints on notation. In the dereverberation, diffuse noise suppression, and target sound source separation of this embodiment, the acoustic signal x _t,f in equation (1) is converted to the diffuse noise n _t,f and the rear reverberation sound r _t,f corresponding to each sound source i. ⁽ⁱ⁾ is suppressed, and the target sound d _t,f ⁽ⁱ⁾ corresponding to each sound source i is separated and extracted.

図１を用いて本実施形態の処理を説明する。
信号処理装置１には、周波数分割された時系列の音響信号ｘ_ｔ，ｆがすべてのｔ∈｛１，…，Ｎ｝およびｆ∈｛１,…，Ｆ｝について入力される。前述のように、本実施形態で例示する音響信号ｘ_ｔ，ｆは、拡散性雑音と音源から発せられた源信号に基づくマイクイメージ信号との混合信号を観測して得られた信号を周波数分割して得られるものである。さらに信号処理装置１には、目的音の情報を表す補助情報ｓが入力される。
補助情報ｓの例は、ＲＴＦ（Relative Transfer Function）ｖ~_ｆ ^（ｉ）（例えば、参考文献１等参照）を特定または推定するための情報である。
参考文献１：I. Cohen, “Relative transfer function identification using speech signals,” IEEE Trans. on Speech, and Audio Processing, vol. 12, no. 5, pp. 451-459, 2004.
ＲＴＦｖ~_ｆ ^（ｉ）は、目的音の音源ｉからＭ個のマイクロホンまでの空間に対応するＭ次元のステアリングベクトルｖ_ｆ ^（ｉ）＝［ｖ_１，ｆ ^（ｉ），…，ｖ_Ｍ，ｆ ^（ｉ）］^Ｔの各要素を何れかの要素を基準に正規化して得られるものである。式(3)にＲＴＦｖ~_ｆ ^（ｉ）の一例を示す。式(3)のｖ~_ｆ ^（ｉ）は、ｖ_ｆ ^（ｉ）の各要素を要素ｖ_１，ｆ ^（ｉ）を基準に正規化して得られるものである。ただし、これは本発明を限定するものではない。

なお、ｖ~など「χ^α」の形式で表記される記号の上付き添え字αは本来χの真上に記載すべきであるが（式(3)参照）、記載表記の制約上、χの右上に記載する場合がある。ＲＴＦｖ~_ｆ ^（ｉ）を特定または推定するための情報の例は、目的音の参照音、目的音の音源ｉの時間周波数マスクγ_ｔ，ｆ ^（ｉ）、ステアリングベクトルｖ_ｆ ^（ｉ）、ＲＴＦｖ~_ｆ ^（ｉ）などである。
各時間周波数マスクγ_ｔ，ｆ ^（ｉ）は、時間区間ｔおよび周波数帯域ｆでの源信号ｉの存在確率または存在の有無に対応する値を表す。例えば、時間区間ｔおよび周波数帯域ｆでの源信号ｉの存在確率またはその関数値を時間周波数マスクγ_ｔ，ｆ ^（ｉ）としてもよいし、時間区間ｔおよび周波数帯域ｆで源信号ｉが存在する場合にγ_ｔ，ｆ ^（ｉ）＝１とし、存在しない場合にγ_ｔ，ｆ ^（ｉ）＝０としてもよい。時間周波数マスクγ_ｔ，ｆ ^（ｉ）の推定方法は、例えば、参考文献２に記載されている。
参考文献２：F. Bahmaninezhad, J. Wu, R. Gu, S.-X. Zhang, Y. Xu, M. Yu, and D. Yu, “A comprehensive study of speech separation: spectrogram vs waveform separation,” in Interspeech, 2019.
一方、時間周波数マスクからRTFを推定する方法は、例えば、非特許文献１に記載されている。
目的音の参照音から時間周波数マスクを推定する方法は、例えば、参考文献３に記載されている。
参考文献３：K. Zmolikova, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Cernocky, “SpeakerBeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800-814, 2019.
補助情報ｓがさらに目的音のパワーを特定する情報を含んでいてもよい。目的音のパワーを推定する方法としては、例えば、参考文献３Bに記載されている。
参考文献３B：Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “A Regression Approach to Speech Enhancement Based on Deep Neural Networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7-19, 2015.
音響信号ｘ_ｔ，ｆは畳み込みビームフォーマ推定部１１および畳み込みビームフォーマ適用部１２に入力され、補助情報ｓは畳み込みビームフォーマ推定部１１に入力される。 The processing of this embodiment will be explained using FIG.
A frequency-divided time-series acoustic signal x _t,f is input to the signal processing device 1 for all tε{1,...,N} and fε{1,...,F}. As mentioned above, the acoustic signal x _{t, f} exemplified in this embodiment is a signal obtained by observing a mixed signal of diffuse noise and a microphone image signal based on a source signal emitted from a sound source, and is obtained by frequency-dividing the signal. It is obtained by doing. Furthermore, auxiliary information s representing information on the target sound is input to the signal processing device 1 .
An example of the auxiliary information s is information for specifying or estimating RTF (Relative Transfer Function) v ~ _f ⁽ⁱ⁾ (for example, see Reference 1, etc.).
Reference 1: I. Cohen, “Relative transfer function identification using speech signals,” IEEE Trans. on Speech, and Audio Processing, vol. 12, no. 5, pp. 451-459, 2004.
RTFv~ _f ⁽ⁱ⁾ is an M-dimensional steering vector v _f ⁽ⁱ⁾ = [v _1,f ⁽ⁱ⁾ ,..., v _{M, f} corresponding to the space from the target sound source i to M microphones. ⁽ⁱ⁾ ] It is obtained by normalizing each element of ^T using any element as a reference. Equation (3) shows an example of RTFv~ _f ⁽ⁱ⁾ . v ~ _f ⁽ⁱ⁾ in Equation (3) is obtained by normalizing each element of v _f ⁽ⁱ⁾ using the element v _1,f ⁽ⁱ⁾ as a reference. However, this does not limit the invention.

Note that the superscript α of symbols expressed in the form “χ ^α ”, such as v~, should originally be written directly above χ (see formula (3)), but due to constraints on the notation, χ It may be written in the upper right corner of Examples of information for specifying or estimating RTFv~ _f ⁽ⁱ⁾ include the reference sound of the target sound, the time-frequency mask γ _t,f ⁽ⁱ⁾ of the sound source i of the target sound, the steering vector v _f ⁽ⁱ⁾ , and the RTFv ~ _f ⁽ⁱ⁾ etc.
Each time-frequency mask γ _t,f ⁽ⁱ⁾ represents a value corresponding to the existence probability or presence or absence of the source signal i in the time interval t and frequency band f. For example, the existence probability of source signal i in time interval t and frequency band f or its function value may be used as a time-frequency mask γ _t,f ⁽ⁱ⁾ , or the existence probability of source signal i in time interval t and frequency band f may be used as a time-frequency mask γ t,f (i). It is also possible to set γ _t,f ⁽ⁱ⁾ = 1 if it exists, and to set γ _t,f ⁽ⁱ⁾ = 0 if it does not exist. A method for estimating the time-frequency mask γ _t,f ⁽ⁱ⁾ is described in Reference 2, for example.
Reference 2: F. Bahmaninezhad, J. Wu, R. Gu, S.-X. Zhang, Y. Xu, M. Yu, and D. Yu, “A comprehensive study of speech separation: spectrogram vs waveform separation,” in Interspeech, 2019.
On the other hand, a method for estimating RTF from a time-frequency mask is described in Non-Patent Document 1, for example.
A method for estimating a time-frequency mask from a reference sound of a target sound is described in Reference 3, for example.
Reference 3: K. Zmolikova, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Cernocky, “SpeakerBeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800-814, 2019.
The auxiliary information s may further include information specifying the power of the target sound. A method for estimating the power of a target sound is described, for example, in Reference 3B.
Reference 3B: Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “A Regression Approach to Speech Enhancement Based on Deep Neural Networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7-19, 2015.
The acoustic signals x _{t, f} are input to the convolutional beamformer estimation section 11 and the convolutional beamformer application section 12 , and the auxiliary information s is inputted to the convolutional beamformer estimation section 11 .

≪畳み込みビームフォーマ推定部１１の処理（ステップＳ１１）≫
畳み込みビームフォーマ推定部１１は、音響信号ｘ_ｔ，ｆと補助情報ｓとを受け取り、残響抑圧と拡散性雑音抑圧と目的音源分離とを行う畳み込みビームフォーマを音響信号ｘ_ｔ，ｆに適用して得られる信号ｙ_ｔ，ｆが確率モデルに従って定まるという最適化基準に基づき、当該畳み込みビームフォーマを推定する。ただし、ｙ_ｔ，ｆ＝［ｙ_ｔ，ｆ ^（１），…，ｙ_ｔ，ｆ ^（Ｉ）］^Ｔ∈Ｃ^Ｉ×１はＩ次元列ベクトルであり、ｉ∈｛１，…，Ｉ｝についてのｙ_ｔ，ｆ ^（ｉ）は目的音ｄ_ｔ，ｆ ^（ｉ）の推定信号である。畳み込みビームフォーマは、例えば、以下のように表現される。

ただし、τ∈｛０，Δ，Δ＋１，…，Ｌ－１｝についてのＷ_τ∈Ｃ^Ｍ×Ｉはビームフォーマ係数を要素とするＭ×Ｉ行列であり、（・）^Ｈは（・）の共役転置を表す。Δは初期反射音の長さに対応する時間区間数（フレーム数）を表す正整数である。少なくともΔ≧１であり、Δの一例は３０～５０ｍｓに対応する時間区間を表す正整数である。Δによって初期反射音よりも後部残響音を抑圧する畳み込みビームフォーマが実現される。 <<Processing of convolutional beam former estimation unit 11 (step S11)>>
The convolutional beamformer estimation unit 11 receives the acoustic signal xt _,f and the auxiliary information s, and applies a convolutional beamformer that performs dereverberation, diffuse noise suppression, and target sound source separation to the acoustic signal _xt,f. The convolutional beamformer is estimated based on an optimization criterion in which the obtained signals y _t,f are determined according to a probabilistic model. However, y _t,f = [y _t,f ⁽¹⁾ ,...,y _t,f ^(I) ] ^T ∈C ^I×1 is an I-dimensional column vector, and for i∈{1,...,I} y _t,f ⁽ⁱ⁾ is the estimated signal of the target sound d _t,f ⁽ⁱ⁾ . For example, the convolutional beamformer is expressed as follows.

However, W _τ ∈C ^M×I for τ∈{0, Δ, Δ+1, ..., L−1} is an M×I matrix whose elements are beamformer coefficients, and (・) ^H is the (・) Represents conjugate transposition. Δ is a positive integer representing the number of time intervals (number of frames) corresponding to the length of the early reflected sound. At least Δ≧1, and one example of Δ is a positive integer representing a time interval corresponding to 30 to 50 ms. Δ realizes a convolutional beamformer that suppresses the rear reverberation sound more than the early reflection sound.

ここで以下を満たすとする。

ただし、

は目的音ｄ_ｔ，ｆ ^（ｉ）の推定信号ｙ_ｔ，ｆ ^（ｉ）を抽出するために使用される。式(5)(6)を用いると式(4)は以下のようにも変形できる。

Assume that the following is satisfied.

however,

is used to extract the estimated signal y t _,f ⁽ⁱ⁾ of the target sound d t _,f ⁽ⁱ⁾ . Using equations (5) and (6), equation (4) can also be transformed as follows.

また、Ｑ_f∈Ｃ^Ｍ×Ｉ，Ｇ^－ _ｆ∈Ｃ^{Ｍ（Ｌ－Δ）×Ｍ}，

について、Ｗ_０，ｆ＝Ｑ_ｆおよび

を満たすとする。なお、Ｍ≧Ｉおよびｒａｎｋ｛Ｑ｝＝Ｉであるとき、式(8)を満たすＧ^－ _ｆの存在は保証される。これらを用いると、式(4)は以下の式(9)(10)のようにも変形できる。

ただし、以下を満たす。

Also, Q _f ∈C ^M×I , G ⁻ _f ∈C ^M(L−Δ)×M ,

For W _0,f =Q _f and

Suppose that the following is satisfied. Note that when M≧I and rank {Q}=I, the existence of G ⁻ _f that satisfies equation (8) is guaranteed. Using these, equation (4) can also be transformed as shown in equations (9) and (10) below.

However, the following must be met.

また、ｑ_ｆ ^（ｉ）∈Ｃ^Ｍ×１，

について、ｉ∈｛１，…，Ｉ｝においてｗ_０ ^（ｉ）＝ｑ_ｆ ^（ｉ）および

を満たすとする。すると、式(7)は以下の式(12)(13)のようにも変形できる。

Also, q _f ⁽ⁱ⁾ ∈C ^M×1 ,

For i∈{1,...,I}, w ₀ ⁽ⁱ⁾ = q _f ⁽ⁱ⁾ and

Suppose that the following is satisfied. Then, equation (7) can also be transformed as shown in equations (12) and (13) below.

また、式(9)は以下の式(9')のように変形できる。

ただし、

であり、Ｉ_Ｍ∈Ｒ^Ｍ×ＭはＭ×Ｍの単位行列を表し、Ｒは実数全体の集合を表す。

はクロネッカー積を表す。またｍ∈｛１，…，Ｍ｝について

はＧ^－ _ｆのｍ番目の列ベクトルである。また式(10)は以下の式(10')のように変形できる。

Further, equation (9) can be transformed as shown in equation (9') below.

however,

, I _M ∈R ^M×M represents an M×M identity matrix, and R represents a set of all real numbers.

represents the Kronecker product. Also, for m∈{1,...,M}

is the mth column vector of G ⁻ _f . Further, equation (10) can be transformed as shown in equation (10') below.

畳み込みビームフォーマ推定部１１は、ｙ_ｔ，ｆが確率モデルに従って定まるという最適化基準に基づき、畳み込みビームフォーマを推定する。この確率モデルとしては以下の(a)および(b)を満たすモデルを例示できる。
(a)各ｉ∈｛１，…，Ｉ｝のｙ_ｔ，ｆ ^（ｉ）が平均０の時変分散（time-varying variance）

の複素ガウシアン分布に従う。Ｅ｛・｝は、期待値関数を表す。以下、λ_ｔ，ｆ ^（ｉ）をｙ_ｔ，ｆ ^（ｉ）のパワーと呼ぶ。
(b)畳み込みビームフォーマは、各ｉ∈｛１，…，Ｉ｝について音源ｉからマイクロホンに到来する音を歪ませない。この拘束条件は例えば以下の式(15)または(16)のように記載できる。

The convolutional beamformer estimation unit 11 estimates the convolutional beamformer based on an optimization criterion that y _t,f is determined according to a probability model. An example of this probability model is a model that satisfies (a) and (b) below.
(a) y _t,f ^{(i) for each i∈{1,...,I}} is a time-varying variance with a mean of 0

follows a complex Gaussian distribution. E{·} represents an expectation function. Hereinafter, λ _t,f ⁽ⁱ⁾ will be referred to as the power of y _t,f ⁽ⁱ⁾ .
(b) The convolutional beamformer does not distort the sound arriving at the microphone from source i for each iε{1,...,I}. This constraint condition can be written, for example, as in the following equation (15) or (16).

この確率モデルに従って定まる音源ｉに対応する式(7)に示す畳み込みビームフォーマの係数

の最適化基準は、式(15)または(16)の拘束条件下で、以下の式(18)のコスト関数Ｌ_ｉ，ｆ（θ_ｆ ^（ｉ））を最小化することである。

すべての音源ｉ＝１，…，Ｉについての畳み込みビームフォーマの係数

の最適化基準は、すべての音源ｉ＝１，…，Ｉについて式(15)または(16)を満たすという拘束条件下で、以下の式(20)のコスト関数Ｌ_ｆ（Θ_ｆ）を最小化することである。

The coefficients of the convolutional beamformer shown in equation (7) corresponding to the sound source i determined according to this probability model

The optimization criterion is to minimize the cost function L _i,f (θ _f ⁽ⁱ⁾ ) of the following equation (18) under the constraint of equation (15) or (16).

Coefficients of the convolutional beamformer for all sources i=1,...,I

The optimization criterion is to minimize the cost function L _f (Θ _f ) of the following equation (20) under the constraint that equation (15) or (16) is satisfied for all sound sources i=1,...,I. It is to become

すなわち、本実施形態の畳み込みビームフォーマ推定部１１は、式(20)のコスト関数Ｌ_ｆ（Θ_ｆ）を最小化する畳み込みビームフォーマを推定し、推定した畳み込みビームフォーマ（式(4),(7),(9)(10),(9')(10'),(12)(13)）を特定する情報を出力する。式(14)(18)から分かるように、この推定にはｉ∈｛１，…，Ｉ｝についてｙ_ｔ，ｆ ^（ｉ）のパワーλ_ｔ，ｆ ^（ｉ）＝Ｅ｛｜ｙ_ｔ，ｆ ^（ｉ）｜^２｝を特定する情報が必要である。補助情報ｓが目的音のパワーを特定する情報を含んでいる場合には、補助情報ｓによって特定される目的音のパワーをλ_ｔ，ｆ ^（ｉ）として用いればよい。補助情報ｓが目的音のパワーを特定する情報を含んでいない場合には、以下に示すように畳み込みビームフォーマ適用部１２で得られるｙ_ｔ，ｆ ^（ｉ）からλ_ｔ，ｆ ^（ｉ）＝｜ｙ_ｔ，ｆ ^（ｉ）｜^２を得る。ただし、畳み込みビームフォーマ適用部１２で得られるｙ_ｔ，ｆ ^（ｉ）は畳み込みビームフォーマ推定部１１で推定される畳み込みビームフォーマに依存するため、畳み込みビームフォーマ推定部１１と畳み込みビームフォーマ適用部１２との処理を所定の収束条件を満たすまで交互に繰り返す必要がある。 That is, the convolutional beamformer estimating unit 11 of this embodiment estimates the convolutional beamformer that minimizes the cost function L _f (Θ _f ) of equation (20), and calculates the estimated convolutional beamformer (equation (4), ( 7),(9)(10),(9')(10'),(12)(13)). As can be seen from equations (14) and (18), this estimation requires the power λ _t _,f ^{(i) of y t,f (i)} for i∈{1,..., ^I} =E{|y _t,f ⁽ⁱ⁾ Information identifying | ² } is required. When the auxiliary information s includes information specifying the power of the target sound, the power of the target sound specified by the auxiliary information s may be used as λ _t,f ⁽ⁱ⁾ . If the auxiliary information s does not include information specifying the power of the target sound, λ _t,f ⁽ⁱ⁾ = y t, _f ⁽ⁱ⁾ obtained by the convolutional beamformer application unit 12 as shown below. |y _t,f ⁽ⁱ⁾ | ² is obtained. However, since y _t,f ⁽ⁱ⁾ obtained by the convolutional beamformer application unit 12 depends on the convolutional beamformer estimated by the convolutional beamformer estimation unit 11, the convolutional beamformer estimation unit 11 and the convolutional beamformer application unit 12 It is necessary to alternately repeat these processes until a predetermined convergence condition is satisfied.

≪畳み込みビームフォーマ適用部１２の処理（ステップＳ１２）≫
音響信号ｘ_ｔ，ｆおよび畳み込みビームフォーマ推定部１１から出力された畳み込みビームフォーマを特定する情報は、畳み込みビームフォーマ適用部１２に入力される。畳み込みビームフォーマ適用部１２は、当該情報から特定される畳み込みビームフォーマ（式(4),(7),(9)(10),(9')(10'),(12)(13)）を音響信号ｘ_ｔ，ｆ＝［ｘ_{１，ｔ，ｆ}，…，ｘ_{Ｍ，ｔ，ｆ}］^Ｔに適用して処理信号ｙ_ｔ，ｆ＝［ｙ_ｔ，ｆ ^（１），…，ｙ_ｔ，ｆ ^（Ｉ）］^Ｔを得て出力する。 <<Processing of convolution beam former application unit 12 (step S12)>>
The acoustic signal x _t,f and the information specifying the convolutional beamformer outputted from the convolutional beamformer estimation section 11 are input to the convolutional beamformer application section 12 . The convolution beamformer application unit 12 applies the convolution beamformer (formulas (4), (7), (9) (10), (9') (10'), (12) (13)) specified from the information. is applied to the acoustic signal x _t,f = [x _1,t,f ,..., x _M,t,f ] ^T to obtain the processed signal y _t,f = [y _t,f ⁽¹⁾ ,..., y _{t , f} ^(I) ] ^T is obtained and output.

前述のように、補助情報ｓが目的音のパワーを特定する情報を含んでいる場合、信号処理装置１はステップＳ１２で得られた処理信号ｙ_ｔ，ｆを出力する。この場合にはステップＳ１１，Ｓ１２の繰り返し処理は不要である。一方、補助情報ｓが目的音のパワーを特定する情報を含んでいない場合、所定の収束条件を満たすまでステップＳ１１の処理とステップＳ１２の処理とが交互に繰り返される。収束条件の例は、繰り返し回数が所定回数に達したという条件、繰り返しの前後で畳み込みビームフォーマの係数の変化量の所定量以下であるという条件などである。信号処理装置１は当該収束条件を満たしたときにステップＳ１２で得られた処理信号ｙ_ｔ，ｆを出力する。何れの場合も、処理信号ｙ_ｔ，ｆは、音響信号ｘ_ｔ，ｆに残響抑圧と拡散性雑音抑圧と目的音源分離とを施した結果となる。出力された処理信号ｙ_ｔ，ｆは他の演算処理の入力とされてもよいし、逆フーリエ変換（Inverse Fourier transform）などの周知の方法によって時間領域の音響信号に変換されてもよい。 As described above, when the auxiliary information s includes information specifying the power of the target sound, the signal processing device 1 outputs the processed signal y _t,f obtained in step S12. In this case, it is not necessary to repeat steps S11 and S12. On the other hand, if the auxiliary information s does not include information specifying the power of the target sound, the process of step S11 and the process of step S12 are alternately repeated until a predetermined convergence condition is satisfied. Examples of convergence conditions include a condition that the number of repetitions has reached a predetermined number, a condition that the amount of change in the coefficients of the convolution beamformer before and after the repetition is less than or equal to a predetermined amount. The signal processing device 1 outputs the processed signal yt _,f obtained in step S12 when the convergence condition is satisfied. In either case, the processed signal y _t,f is the result of performing dereverberation, diffuse noise suppression, and target sound source separation on the acoustic signal x _t,f . The output processed signal _yt,f may be used as an input for other arithmetic processing, or may be transformed into a time domain acoustic signal by a known method such as an inverse Fourier transform.

＜本実施形態の特徴＞
本実施形態では、目的音の情報を表す補助情報ｓを用い、残響抑圧と拡散性雑音抑圧と目的音源分離とを行う畳み込みビームフォーマを音響信号に適用して得られる信号が確率モデルに従って定まるという最適化基準に基づき、畳み込みビームフォーマを推定する。これにより、畳み込みビームフォーマを全体として最適化でき、より効果的な音声強調を実現できる。 <Features of this embodiment>
In this embodiment, a signal obtained by applying a convolution beamformer that performs dereverberation, diffuse noise suppression, and target sound source separation to an acoustic signal is determined according to a probabilistic model using auxiliary information s representing information on the target sound. Estimate a convolutional beamformer based on optimization criteria. As a result, the convolutional beamformer can be optimized as a whole, and more effective speech enhancement can be achieved.

［第２実施形態］
次に本発明の第２実施形態を説明する。本実施形態では、畳み込みビームフォーマを、残響抑圧を行う残響抑圧フィルタと、拡散性雑音抑圧と目的音源分離とを行うビームフォーマとに分割して取り扱う。言い換えると、本実施形態の畳み込みビームフォーマは、残響抑圧を行う残響抑圧フィルタと、拡散性雑音抑圧と目的音源分離とを行うビームフォーマとを含む。ただし、残響抑圧フィルタとビームフォーマとの最適化処理は互いに独立しておらず、畳み込みビームフォーマ全体として最適化される。残響抑圧フィルタの例は式(9),式(9')または式(12)などであり、ビームフォーマの例は式(10),式(10')または式(13)などである。本実施形態では一例として残響抑圧フィルタとして式(9')を用い、ビームフォーマとして式(10')を用いる。また、残響抑圧フィルタの最適化には各目的音のパワー重み付き時空間共分散行列が用いられる。目的音のパワー重み付き時空間共分散行列はサイズが小さいため、小さい演算量で残響抑圧フィルタの最適化を行うことができる。以下では、これまで説明した事項との相違点を中心に説明し、既に説明した事項については同じ参照番号を引用して説明を簡略化する。 [Second embodiment]
Next, a second embodiment of the present invention will be described. In this embodiment, the convolutional beamformer is divided into a dereverberation filter that performs dereverberation and a beamformer that performs diffuse noise suppression and target sound source separation. In other words, the convolutional beamformer of this embodiment includes a dereverberation filter that performs dereverberation, and a beamformer that performs diffuse noise suppression and target sound source separation. However, the optimization processing of the dereverberation filter and the beamformer is not independent of each other, and the convolutional beamformer as a whole is optimized. Examples of a dereverberation filter include equation (9), equation (9'), or equation (12), and examples of a beamformer include equation (10), equation (10'), or equation (13). In this embodiment, as an example, equation (9') is used as the dereverberation filter, and equation (10') is used as the beamformer. Furthermore, a power-weighted spatio-temporal covariance matrix of each target sound is used to optimize the dereverberation filter. Since the power-weighted spatiotemporal covariance matrix of the target sound is small in size, the dereverberation filter can be optimized with a small amount of calculation. In the following, the explanation will focus on the differences from the matters explained so far, and the same reference numbers will be cited for the matters already explained to simplify the explanation.

＜機能構成＞
図２に例示するように、本実施形態の信号処理装置２は、時空間共分散推定部２１１、残響抑圧フィルタ推定部２１２、ビームフォーマ推定部２１３、残響抑圧フィルタ適用部２２１、ビームフォーマ適用部２２２、および制御部１３を有し、制御部１３の制御の下で各処理を実行する。ここで、時空間共分散推定部２１１、残響抑圧フィルタ推定部２１２、およびビームフォーマ推定部２１３は、畳み込みビームフォーマ推定部を構成する。残響抑圧フィルタ適用部２２１およびビームフォーマ適用部２２２は、畳み込みビームフォーマ適用部を構成する。 <Functional configuration>
As illustrated in FIG. 2, the signal processing device 2 of the present embodiment includes a spatio-temporal covariance estimation section 211, a dereverberation filter estimation section 212, a beamformer estimation section 213, a dereverberation filter application section 221, and a beamformer application section. 222 and a control unit 13, and executes each process under the control of the control unit 13. Here, the spatiotemporal covariance estimation section 211, the dereverberation filter estimation section 212, and the beamformer estimation section 213 constitute a convolutional beamformer estimation section. The dereverberation filter application section 221 and the beamformer application section 222 constitute a convolutional beamformer application section.

＜処理＞
図２を用いて本実施形態の処理を説明する。
補助情報ｓがビームフォーマ推定部２１３に入力され、音響信号ｘ_ｔ，ｆが時空間共分散行列推定部２１１、および残響抑圧フィルタ適用部２２１に入力される。本実施形態の補助情報ｓは、ＲＴＦｖ~_ｆ ^（ｉ）を特定または推定するための情報であり、目的音のパワーを特定する情報を含まない。時空間共分散行列推定部２１１は、目的音のパワー重み付き時空間共分散行列

を得て出力する。残響抑圧フィルタ推定部２１２は、目的音のパワー重み付き時空間共分散行列Ｒ^－ _ｘ，ｆ ^（ｉ）およびＰ_ｘ，ｆ ^（ｉ）と、ビームフォーマを表す情報ｑ_ｆ ^（ｉ）とを受け取り、前述の最適化基準に基づいて残響抑圧フィルタを推定する。残響抑圧フィルタ適用部２２１は、残響抑圧フィルタ推定部２１２で推定された残響抑圧フィルタを音響信号ｘ_ｔ，ｆに適用して残響抑圧信号ｚ_ｔ，ｆを得て出力する（式(9')）。ビームフォーマ推定部２１３は、残響抑圧信号ｚ_ｔ，ｆと補助情報ｓとを受け取り、前述した最適化基準に基づいてビームフォーマを推定する。ビームフォーマ適用部２２２は、ビームフォーマ推定部２１３で推定されたビームフォーマを残響抑圧信号ｚ_ｔ，ｆに適用して処理信号ｙ_ｔ，ｆ ^（ｉ）を得て出力する。本実施形態では、所定の収束条件を満たすまで、畳み込みビームフォーマ推定部に含まれる時空間共分散推定部２１１、残響抑圧フィルタ推定部２１２、およびビームフォーマ推定部２１３の処理と、畳み込みビームフォーマ適用部に含まれる残響抑圧フィルタ適用部２２１およびビームフォーマ適用部２２２の処理と、を交互に繰り返す。信号処理装置２は当該収束条件を満たしたときにビームフォーマ適用部２２２で得られた処理信号ｙ_ｔ，ｆを出力する。 <Processing>
The processing of this embodiment will be explained using FIG. 2.
The auxiliary information s is input to the beamformer estimating section 213, and the acoustic signal xt _,f is input to the spatio-temporal covariance matrix estimating section 211 and the dereverberation filter applying section 221. The auxiliary information s of this embodiment is information for specifying or estimating RTFv~ _f ⁽ⁱ⁾ , and does not include information specifying the power of the target sound. The spatio-temporal covariance matrix estimation unit 211 includes a power-weighted spatio-temporal covariance matrix of the target sound.

Obtain and output. The dereverberation filter estimation unit 212 receives the power-weighted spatiotemporal covariance matrices R ⁻ _x,f ⁽ⁱ⁾ and P _x,f ⁽ⁱ⁾ of the target sound, and information q _f ⁽ⁱ⁾ representing the beamformer. , estimate the dereverberation filter based on the aforementioned optimization criteria. The dereverberation filter application unit 221 applies the dereverberation filter estimated by the dereverberation filter estimation unit 212 to the acoustic signal x _t,f to obtain and output the dereverberation signal z _t,f (Equation (9') ). The beamformer estimator 213 receives the dereverberation signal z _t,f and the auxiliary information s, and estimates the beamformer based on the optimization criteria described above. The beamformer application unit 222 applies the beamformer estimated by the beamformer estimation unit 213 to the dereverberation signal z _t,f to obtain and output the processed signal y _t,f ⁽ⁱ⁾ . In this embodiment, the processes of the spatio-temporal covariance estimator 211, the dereverberation filter estimator 212, and the beamformer estimator 213 included in the convolutional beamformer estimator and the application of the convolutional beamformer are performed until a predetermined convergence condition is satisfied. The processes of the dereverberation filter application section 221 and the beamformer application section 222 included in the section are alternately repeated. The signal processing device 2 outputs the processed signal yt _,f obtained by the beam former application unit 222 when the convergence condition is satisfied.

以下、図３～図６を用いて本実施形態の処理を詳細に説明する。
補助情報ｓがビームフォーマ推定部２１３に入力される（ステップＳ２１３ａ）。本実施形態では、補助情報ｓとして時間周波数マスクγ_ｔ，ｆ ^（ｉ）が入力される。しかし、これは本発明を限定するものではない。また、音響信号ｘ_ｔ，ｆが、時空間共分散推定部２１１、および残響抑圧フィルタ適用部２２１に入力される（ステップＳ２２１ａ）。 The processing of this embodiment will be described in detail below using FIGS. 3 to 6.
The auxiliary information s is input to the beam former estimation unit 213 (step S213a). In this embodiment, a time-frequency mask γ _t,f ⁽ⁱ⁾ is input as the auxiliary information s. However, this does not limit the invention. Furthermore, the acoustic signal x _t,f is input to the spatio-temporal covariance estimation unit 211 and the dereverberation filter application unit 221 (step S221a).

時空間共分散推定部２１１は、すべてのｉ∈｛１，…，Ｉ｝，ｔ∈｛１，…，Ｎ｝，ｆ∈｛１，…，Ｆ｝についてλ_ｔ，ｆ ^（ｉ）を初期化する。例えば、時空間共分散推定部２１１は以下のように目的音のパワーλ_ｔ，ｆ ^（ｉ）を初期化する。

ここで

はα^Ｈβαを表す。また、α←βはβをαに代入することを表す。言い換えると、α←βはαをβにすることを表す（ステップＳ２１１ａ）。 The spatio-temporal covariance estimation unit 211 initializes λ _t,f ⁽ⁱ⁾ for all i∈{1,...,I}, t∈{1,...,N}, f∈{1,...,F}. become For example, the spatio-temporal covariance estimation unit 211 initializes the power λ _t,f ⁽ⁱ⁾ of the target sound as follows.

here

represents α ^H βα. Further, α←β represents substituting β into α. In other words, α←β represents changing α to β (step S211a).

ビームフォーマ推定部２１３はすべてのｉ∈｛１，…，Ｉ｝，ｆ∈｛１，…，Ｆ｝についてｑ_ｆ ^（ｉ）を初期化する。例えば、ビームフォーマ推定部２１３はＩ_Ｍのｉ番目の列をｑ_ｆ ^（ｉ）とする（ステップＳ２１３ｂ）。 The beamformer estimation unit 213 initializes q _f ⁽ⁱ⁾ for all iε{1,...,I}, fε{1,...,F}. For example, the beamformer estimation unit 213 sets the i-th column of I _M to q _f ⁽ⁱ⁾ (step S213b).

残響抑圧フィルタ適用部２２１はすべてのｔ∈｛１，…，Ｎ｝，ｆ∈｛１，…，Ｆ｝についてｚ_ｔ，ｆを初期化する。例えば、残響抑圧フィルタ適用部２２１はｚ_ｔ，ｆ←ｘ_ｔ，ｆとする（ステップＳ２２１ｂ）。 The dereverberation filter application unit 221 initializes z _t,f for all tε{1,...,N}, fε{1,...,F}. For example, the dereverberation filter application unit 221 sets z _t,f ← x _{t, f} (step S221b).

まだ処理信号ｙ_ｔ，ｆが一度も得られていないのであれば、時空間共分散推定部２１１には処理信号ｙ_ｔ，ｆは入力されない。一方、処理信号ｙ_ｔ，ｆが得られているのであれば、時空間共分散推定部２１１には更に処理信号ｙ_ｔ，ｆが入力される。時空間共分散推定部２１１は、すべてのｉ∈｛１，…，Ｉ｝，ｆ∈｛１，…，Ｆ｝について目的音のパワー重み付き時空間共分散行列

を計算して出力する。処理信号ｙ_ｔ，ｆが一度も得られていないのであれば、この計算にはステップＳ２１１ａで得られたλ_ｔ，ｆ ^（ｉ）が用いられる。一方、既に処理信号ｙ_ｔ，ｆが得られているのであれば、この計算にはステップＳ２１１ｄで得られたλ_ｔ，ｆ ^（ｉ）が用いられる（ステップＳ２１１ｂ）。さらに時空間共分散推定部２１１は、すべてのｉ∈｛１，…，Ｉ｝，ｆ∈｛１，…，Ｆ｝について目的音のパワー重み付き時空間共分散行列

を計算して出力する。ここでも処理信号ｙ_ｔ，ｆが一度も得られていないのであれば、この計算にはステップＳ２１１ａで得られたλ_ｔ，ｆ ^（ｉ）が用いられる。一方、既に処理信号ｙ_ｔ，ｆが得られているのであれば、この計算にはステップＳ２１１ｄで得られたλ_ｔ，ｆ ^（ｉ）が用いられる（ステップＳ２１１ｃ）。 If the processed signal y _t,f has never been obtained yet, the processed signal y _t,f is not input to the spatio-temporal covariance estimation unit 211 . On the other hand, if the processed signal y _t,f has been obtained, the processed signal y _t,f is further input to the spatio-temporal covariance estimation unit 211 . The spatio-temporal covariance estimation unit 211 calculates a power-weighted spatio-temporal covariance matrix of the target sound for all i∈{1,...,I}, f∈{1,...,F}.

Calculate and output. If the processed signal y _t,f has never been obtained, λ _t,f ⁽ⁱ⁾ obtained in step S211a is used for this calculation. On the other hand, if the processed signal y _t,f has already been obtained, the λ _t,f ⁽ⁱ⁾ obtained in step S211d is used for this calculation (step S211b). Furthermore, the spatio-temporal covariance estimation unit 211 calculates the power-weighted spatio-temporal covariance matrix of the target sound for all i∈{1,...,I}, f∈{1,...,F}.

Calculate and output. Again, if the processed signal y _t,f has never been obtained, the λ _t,f ⁽ⁱ⁾ obtained in step S211a is used for this calculation. On the other hand, if the processed signal y _t,f has already been obtained, λ _t,f ⁽ⁱ⁾ obtained in step S211d is used for this calculation (step S211c).

音響信号ｘ_ｔ，ｆと、時空間共分散推定部２１１で得られた目的音のパワー重み付き時空間共分散行列Ｒ^－ _ｘ，ｆ ^（ｉ）およびＰ_ｘ，ｆ ^（ｉ）と、ビームフォーマ推定部２１３で得られたビームフォーマを表す情報ｑ_ｆ ^（ｉ）は残響抑圧フィルタ推定部２１２に入力される。残響抑圧フィルタ推定部２１２は、これらを受け取り、前述の最適化基準に基づいて残響抑圧フィルタ（式(9')）を推定する。まず残響抑圧フィルタ推定部２１２は、

を計算する（ステップＳ２１２ａ）。次に残響抑圧フィルタ推定部２１２は、

を計算する。ただし（・）^*は（・）の複素共役を表す（ステップＳ２１２ｂ）。さらに残響抑圧フィルタ推定部２１２は、残響抑圧フィルタを特定する情報

を計算して出力する。ただし、（・）^＋は（・）のムーア-ペンローズ擬似逆行列（Moore-Penrose pseudo-inverse matrix）である（ステップＳ２１２ｃ）。 The acoustic signal x _t,f , the power-weighted spatio-temporal covariance matrices R ⁻ _x,f ⁽ⁱ⁾ and P _x,f ⁽ⁱ⁾ of the target sound obtained by the spatio-temporal covariance estimation unit 211, and the beamformer Information q _f ⁽ⁱ⁾ representing the beamformer obtained by the estimation section 213 is input to the dereverberation filter estimation section 212 . The dereverberation filter estimation unit 212 receives these and estimates a dereverberation filter (Equation (9')) based on the above-mentioned optimization criteria. First, the dereverberation filter estimation unit 212

is calculated (step S212a). Next, the dereverberation filter estimation unit 212

Calculate. However, (·) ^* represents the complex conjugate of (·) (step S212b). Further, the dereverberation filter estimating unit 212 generates information specifying the dereverberation filter.

Calculate and output. However, (.) ⁺ is a Moore-Penrose pseudo-inverse matrix of (.) (step S212c).

残響抑圧フィルタ推定部２１２で得られたｇ^－ _ｆは残響抑圧フィルタ適用部２２１に入力される。残響抑圧フィルタ適用部２２１は、残響抑圧フィルタ推定部２１２で推定された残響抑圧フィルタを以下のように音響信号ｘ_ｔ，ｆに適用して残響抑圧信号z_t，ｆを得て出力する。

残響抑圧信号z_t，ｆは、ビームフォーマ推定部２１３およびビームフォーマ適用部２２２に送られる（ステップＳ２２１ｃ）。 g ⁻ _f obtained by the dereverberation filter estimation section 212 is input to the dereverberation filter application section 221. The dereverberation filter application unit 221 applies the dereverberation filter estimated by the dereverberation filter estimation unit 212 to the acoustic signal x _t,f as follows, to obtain and output the dereverberation signal z _t,f .

The dereverberation signal z _t,f is sent to the beamformer estimation section 213 and the beamformer application section 222 (step S221c).

処理信号ｙ_ｔ，ｆが一度も得られていないのであれば、ビームフォーマ推定部２１３には、残響抑圧信号z_t，ｆ、補助情報ｓ＝γ_ｔ，ｆ ^（ｉ）、およびステップＳ２１１ａで得られたλ_ｔ，ｆ ^（ｉ）が入力される。一方、既に処理信号ｙ_ｔ，ｆが得られているのであれば、ビームフォーマ推定部２１３には、残響抑圧信号z_t，ｆ、補助情報ｓ＝γ_ｔ，ｆ ^（ｉ）、および処理信号ｙ_ｔ，ｆが入力される。ビームフォーマ推定部２１３は、これらを受け取り、前述の最適化基準に基づいてビームフォーマを推定する。
ビームフォーマ推定部２１３は、z_t，ｆおよびγ_ｔ，ｆ ^（ｉ）に基づいてＲＴＦｖ~_ｆ ^（ｉ）を得る。図６に例示するように、まずビームフォーマ推定部２１３のステアリングベクトル推定部２１３１がz_t，ｆおよびγ_ｔ，ｆ ^（ｉ）に基づいてステアリングベクトルｖ_ｆ ^（ｉ）を推定して出力する。例えば、ステアリングベクトルｖ_ｆ ^（ｉ）は、以下のように推定される。

さらにビームフォーマ推定部２１３のＲＴＦ推定部２１３２がｖ_ｆ ^（ｉ）からｖ~_ｆ ^（ｉ）を得る。例えば、ＲＴＦ推定部２１３２は式(3)に従ってｖ~_ｆ ^（ｉ）を得る（ステップＳ２１３ｃ）。
またビームフォーマ推定部２１３は、すべてのｉ∈｛１，…，Ｉ｝，ｆ∈｛１，…，Ｆ｝について

を計算する。なお、処理信号ｙ_ｔ，ｆが一度も得られていないのであれば、この計算にはステップＳ２１１ａで得られたλ_ｔ，ｆ ^（ｉ）が用いられる。一方、既に処理信号ｙ_ｔ，ｆが得られているのであれば、この計算にはステップＳ２１１ｄで得られたλ_ｔ，ｆ ^（ｉ）が用いられる（ステップＳ２１３ｄ）。
さらにビームフォーマ推定部２１３は、すべてのｉ∈｛１，…，Ｉ｝，ｆ∈｛１，…，Ｆ｝についてビームフォーマを特定する情報

を計算して出力する（ステップＳ２１３ｅ）。 If the processed signal y _t,f has never been obtained, the beamformer estimation unit 213 includes the dereverberation signal z _t,f , the auxiliary information s=γ _t,f ⁽ⁱ⁾ , and the information obtained in step S211a. The calculated λ _t,f ⁽ⁱ⁾ is input. On the other hand, if the processed signal y _t,f has already been obtained, the beamformer estimation unit 213 includes the dereverberation signal z _t,f , the auxiliary information s=γ _t,f ⁽ⁱ⁾ , and the processed signal y _{t and f} are input. The beamformer estimation unit 213 receives these and estimates the beamformer based on the above-mentioned optimization criteria.
The beamformer estimation unit 213 obtains RTFv~ _f ⁽ⁱ ) based on z _t,f and γ _t,f ⁽ⁱ ). As illustrated in FIG. 6, first, the steering vector estimation unit 2131 of the beamformer estimation unit 213 estimates and outputs the steering vector v _f (i) based on z _t,f and γ _t,f ⁽ⁱ⁾ ^. For example, the steering vector v _f ⁽ⁱ⁾ is estimated as follows.

Further, the RTF estimator 2132 of the beamformer estimator 213 obtains v~ _f ⁽ⁱ ) from v _f ⁽ⁱ⁾ . For example, the RTF estimation unit 2132 obtains v ~ _f ⁽ⁱ⁾ according to equation (3) (step S213c).
In addition, the beamformer estimator 213 calculates for all i∈{1,...,I}, f∈{1,...,F}

Calculate. Note that if the processed signal y _t,f has never been obtained, λ _t,f ⁽ⁱ⁾ obtained in step S211a is used for this calculation. On the other hand, if the processed signal y _t,f has already been obtained, λ _t,f ⁽ⁱ⁾ obtained in step S211d is used for this calculation (step S213d).
Furthermore, the beamformer estimation unit 213 obtains information specifying the beamformer for all i∈{1,...,I}, f∈{1,...,F}.

is calculated and output (step S213e).

ビームフォーマ適用部２２２には、残響抑圧信号z_t，ｆ、およびビームフォーマを特定する情報ｑ_ｆ ^（ｉ）が入力される。ビームフォーマ適用部２２２は、以下のようにビームフォーマを残響抑圧信号z_t，ｆに適用して処理信号ｙ_ｔ，ｆ ^（ｉ）を得て出力する。

この処理はすべてのｉ∈｛１，…，Ｉ｝およびｆ∈｛１，…，Ｆ｝について行われ、ビームフォーマ適用部２２２はｙ_ｔ，ｆ＝［ｙ_ｔ，ｆ ^（１），…，ｙ_ｔ，ｆ ^（Ｉ）］^Ｔを得る（ステップＳ２２２ａ）。 The beamformer application unit 222 receives the dereverberation signal z _t,f and information q _f ⁽ⁱ⁾ specifying the beamformer. The beamformer application unit 222 applies a beamformer to the dereverberation signal z _t,f as follows to obtain and output the processed signal y _t,f ⁽ⁱ⁾ .

This process is performed for all i∈{1,...,I} and f∈{1,...,F}, and the beamformer application unit 222 calculates y _t,f = [y _t,f ⁽¹⁾ ,..., y _t,f ^(I) ] ^T is obtained (step S222a).

制御部１３は前述の収束条件を充足したか否かを判定する（ステップＳ１３）。ここで収束条件を充足していない場合、時空間共分散推定部２１１およびビームフォーマ推定部２１３は、入力されたｙ_ｔ，ｆ ^（ｉ）を用いて

の計算を行って（ステップＳ２１１ｄ）、処理がステップＳ２１１ｂに戻る。これにより、時空間共分散行列推定部２１１の処理と、残響抑圧フィルタ推定部２１２の処理と、残響抑圧フィルタ適用部２２１の処理と、ビームフォーマ推定部２１３の処理と、ビームフォーマ適用部２２２の処理とが繰り返される。この繰り返しによって各値が更新される。一方、収束条件を充足している場合には、ビームフォーマ適用部２２２はｙ_ｔ，ｆ＝［ｙ_ｔ，ｆ ^（１），…，ｙ_ｔ，ｆ ^（Ｉ）］^Ｔを出力する（ステップＳ２２２ｂ）。 The control unit 13 determines whether the above-mentioned convergence conditions are satisfied (step S13). If the convergence condition is not satisfied here, the spatiotemporal covariance estimation unit 211 and the beamformer estimation unit 213 use the input y _t,f ⁽ⁱ⁾ to

is calculated (step S211d), and the process returns to step S211b. As a result, the processing of the spatiotemporal covariance matrix estimation section 211, the processing of the dereverberation filter estimation section 212, the processing of the dereverberation filter application section 221, the processing of the beamformer estimation section 213, and the processing of the beamformer application section 222 are performed. The process is repeated. Each value is updated by this repetition. On the other hand, if the convergence condition is satisfied, the beamformer application unit 222 outputs y _t,f = [y _t,f ⁽¹⁾ ,..., y _t,f ^(I) ] ^T (step S222b ).

＜本実施形態の特徴＞
本実施形態では、目的音の情報を表す補助情報ｓを用い、残響抑圧と拡散性雑音抑圧と目的音源分離とを行う畳み込みビームフォーマを音響信号に適用して得られる信号が確率モデルに従って定まるという最適化基準に基づき、畳み込みビームフォーマを推定する。これにより、畳み込みビームフォーマを全体として最適化でき、より効果的な音声強調を実現できる。また、本実施形態では畳み込みビームフォーマを残響抑圧フィルタとビームフォーマとに分割し、推定の途中段階で得られる残響抑圧信号を用いてビームフォーマを推定することで、より効果的な音声強調を実現できる。さらに、残響抑圧フィルタの推定に必要な演算の大部分が目的音のパワー重み付き時空間共分散行列Ｒ^－ _ｘ，ｆ ^（ｉ）およびＰ_ｘ，ｆ ^（ｉ）の演算である。ステップＳ２１１ｂ，Ｓ２１１ｃで得られる目的音のパワー重み付き時空間共分散行列Ｒ^－ _ｘ，ｆ ^（ｉ）およびＰ_ｘ，ｆ ^（ｉ）のサイズはステップＳ２１２ａ，Ｓ２１２ｂで得られる行列Ψ_ｆ，φ_ｆのサイズよりも小さい。そのため、本実施形態では、残響抑圧フィルタの推定に必要な演算量を大幅に削減でき、少ない計算コストで音声強調を実現できる。 <Features of this embodiment>
In this embodiment, a signal obtained by applying a convolution beamformer that performs dereverberation, diffuse noise suppression, and target sound source separation to an acoustic signal is determined according to a probabilistic model using auxiliary information s representing information on the target sound. Estimate a convolutional beamformer based on optimization criteria. As a result, the convolutional beamformer can be optimized as a whole, and more effective speech enhancement can be achieved. In addition, in this embodiment, more effective speech enhancement is achieved by dividing the convolutional beamformer into a dereverberation filter and a beamformer, and estimating the beamformer using the dereverberation signal obtained in the middle of estimation. can. Furthermore, most of the calculations required to estimate the dereverberation filter are calculations of power-weighted spatiotemporal covariance matrices R ⁻ _x,f ⁽ⁱ⁾ and P _x,f ⁽ⁱ⁾ of the target sound. The sizes of the power-weighted spatiotemporal covariance matrices R ^- _x,f ⁽ⁱ⁾ and P _x,f ⁽ⁱ⁾ of the target sound obtained in steps S211b and S211c are the matrices Ψ _f and φ _f obtained in steps S212a and S212b. smaller than the size of Therefore, in this embodiment, the amount of calculation required for estimating the dereverberation filter can be significantly reduced, and speech enhancement can be realized with low calculation cost.

［第２実施形態の変形例１］
第２実施形態では、残響抑圧フィルタ推定部２１２がビームフォーマを固定して残響抑圧フィルタ（式(9')）を推定し、ビームフォーマ推定部２１３が残響抑圧フィルタを固定してビームフォーマ（式(10')）を推定する処理を繰り返す。この処理では、残響抑圧フィルタ推定部２１２がビームフォーマを残響抑圧信号に適応してＩ次元の処理信号ｙ_ｔ，ｆ＝［ｙ_ｔ，ｆ ^（１），…，ｙ_ｔ，ｆ ^（Ｉ）］^Ｔを得、Ｉ次元の処理信号ｙ_ｔ，ｆが次の残響抑圧フィルタの推定に用いられる。しかし、Ｉ≦Ｍであるため、Ｉ次元の処理信号ｙ_ｔ，ｆはＭ次元の音響信号ｘ_ｔ，ｆよりも圧縮され、情報が失われている。この情報の損失に起因して、残響抑圧フィルタやビームフォーマが最適解ではなく、準最適解となってしまう場合がある。この問題を解決するため、本実施形態ではｉ＝１，…，Ｉのｙ_ｔ，ｆ ^（１），…，ｙ_ｔ，ｆ ^（Ｉ）に対応する目的音のパワー重み付き時空間共分散行列Ｒ^－ _ｘ，ｆ ^（ｉ）およびＰ_ｘ，ｆ ^（ｉ）に加え、ｉ＝Ｉ＋１，…，Ｍに対応する非目的音のパワー重み付き時空間共分散行列も計算して残響抑圧フィルタの推定に用いる。 [Modification 1 of the second embodiment]
In the second embodiment, the dereverberation filter estimator 212 fixes the beamformer and estimates the dereverberation filter (Equation (9')), and the beamformer estimator 213 fixes the dereverberation filter and estimates the beamformer (Equation (9')). (10')) is repeated. In this process, the dereverberation filter estimation unit 212 applies a beamformer to the dereverberation signal to obtain an I-dimensional processed signal y _t,f = [y _t,f ⁽¹⁾ ,..., y _t,f ^(I) ] ^T is obtained, and the I-dimensional processed signal y _t,f is used to estimate the next dereverberation filter. However, since I≦M, the I-dimensional processed signal y _t,f is more compressed than the M-dimensional acoustic signal x _t,f , and information is lost. Due to this loss of information, the dereverberation filter and beamformer may become suboptimal solutions instead of optimal solutions. In order to solve this problem, in this embodiment, the power-weighted spatio-temporal covariance matrix of the target sound corresponding to y _t,f ⁽¹⁾ ,..., y _t,f ^(I) for i=1,...,I In addition to R ⁻ _x,f ⁽ⁱ⁾ and P _x,f ⁽ⁱ⁾ , the power-weighted spatio-temporal covariance matrix of non-target sounds corresponding to i=I+1,...,M is also calculated to estimate the dereverberation filter. used for

＜機能構成＞
図２に例示するように、本変形例の信号処理装置２’は、時空間共分散推定部２１１’、残響抑圧フィルタ推定部２１２’、ビームフォーマ推定部２１３、残響抑圧フィルタ適用部２２１、ビームフォーマ適用部２２２、および制御部１３を有し、制御部１３の制御の下で各処理を実行する。ここで、時空間共分散推定部２１１’、残響抑圧フィルタ推定部２１２’、およびビームフォーマ推定部２１３は、畳み込みビームフォーマ推定部を構成する。残響抑圧フィルタ適用部２２１およびビームフォーマ適用部２２２は、畳み込みビームフォーマ適用部を構成する。 <Functional configuration>
As illustrated in FIG. 2, the signal processing device 2' of this modification includes a spatiotemporal covariance estimation section 211', a dereverberation filter estimation section 212', a beamformer estimation section 213, a dereverberation filter application section 221, a beam It has a former application section 222 and a control section 13, and executes each process under the control of the control section 13. Here, the spatiotemporal covariance estimation section 211', the dereverberation filter estimation section 212', and the beamformer estimation section 213 constitute a convolutional beamformer estimation section. The dereverberation filter application section 221 and the beamformer application section 222 constitute a convolutional beamformer application section.

＜処理＞
図２を用いて本変形例の処理を説明する。
本変形例では、時空間共分散行列推定部２１１に代えて時空間共分散行列推定部２１１’が、目的音のパワー重み付き時空間共分散行列Ｒ^－ _ｘ，ｆ ^（ｉ）およびＰ_ｘ，ｆ ^（ｉ）に加え、さらに非目的音のパワー重み付き時空間共分散行列も生成する。さらに、残響抑圧フィルタ推定部２１２に代えて残響抑圧フィルタ推定部２１２’が、目的音のパワー重み付き時空間共分散行列Ｒ^－ _ｘ，ｆ ^（ｉ）およびＰ_ｘ，ｆ ^（ｉ）と目的音を推定するための１≦ｉ≦Ｉに対応するビームフォーマを表す情報ｑ_ｆ ^（ｉ）とに加え、さらに非目的音のパワー重み付き時空間共分散行列と非目的音を推定するためのＩ＜ｉ≦Ｍに対応するビームフォーマを表す情報ｑ_ｆ ^（ｉ）を受け取り、前述の最適化基準に基づいて残響抑圧フィルタを推定する。ビームフォーマ推定部２１３に変えてビームフォーマ推定部２１３’が、目的音を推定するための１≦ｉ≦Ｉに対応するビームフォーマを表す情報ｑ_ｆ ^（ｉ）に加え、さらに非目的音を推定するためのＩ＜ｉ≦Ｍに対応するビームフォーマを表す情報ｑ_ｆ ^（ｉ）をも生成する。ビームフォーマ適用部２２２に代えてビームフォーマ適用部２２２’が、目的音の推定値ｙ_ｔ，ｆ ^（１），…，ｙ_ｔ，ｆ ^（Ｉ）に加えて、非目的音の推定値ｙ_ｔ，ｆ ^⊥を生成する。その他は第２実施形態と同じである。 <Processing>
The processing of this modification will be explained using FIG. 2.
In this modification, instead of the spatio-temporal covariance matrix estimator 211, the spatio-temporal covariance matrix estimator 211' calculates the power-weighted spatio-temporal covariance matrices R ^- _x,f ⁽ⁱ⁾ and P _x, In addition to _f ⁽ⁱ⁾ , a power-weighted spatio-temporal covariance matrix of non-target sounds is also generated. Further, instead of the dereverberation filter estimator 212, the dereverberation filter estimator 212' calculates power-weighted spatio-temporal covariance matrices R ^- _x,f ⁽ⁱ⁾ and P _x,f ⁽ⁱ⁾ of the target sound and the target sound. In addition to the information representing the beamformer corresponding to 1≦i≦I for estimating q _f ⁽ⁱ⁾ , the power-weighted spatiotemporal covariance matrix of the non-target sound and I Receive information q _f ⁽ⁱ⁾ representing a beamformer corresponding to <i≦M, and estimate a dereverberation filter based on the aforementioned optimization criteria. In place of the beamformer estimation section 213, a beamformer estimation section 213' estimates a non-target sound in addition to information q _f ⁽ⁱ⁾ representing a beamformer corresponding to 1≦i≦I for estimating a target sound. It also generates information q _f ⁽ⁱ⁾ representing the beamformer corresponding to I<i≦M. In place of the beamformer application unit 222, the beamformer application unit 222' calculates the estimated value _{yt of the non-target sound in addition to the estimated value yt,f} ⁽¹⁾ , ..., yt _,f ^(I) of the target _{sound. , f} ^⊥ . The rest is the same as the second embodiment.

以下、図３～図５を用いて本変形例の処理を詳細に説明する。
まず、信号処理装置２に代えて信号処理装置２’が、図３に示すステップＳ２１３ａ，Ｓ２２１ａ，Ｓ２１１ａ，Ｓ２１３ｂ，Ｓ２２１ｂ，Ｓ２１１ｂ，Ｓ２１１ｃ，Ｓ２１２ａ，Ｓ２１２ｂの処理を実行する。ただし、時空間共分散推定部２１１の処理は、時空間共分散推定部２１１に代えて時空間共分散推定部２１１’が実行する。ステップＳ２１１ａにおいて、時空間共分散推定部２１１’は、目的音のパワーλ_ｔ，ｆ ^（１），…，λ_ｔ，ｆ ^（Ｉ）に加えて、非目的音のパワーλ_ｔ，ｆ ^⊥も目的音のパワーと同様の方法で初期化する。ビームフォーマ推定部２１３の処理は、ビームフォーマ推定部２１３に代えてビームフォーマ推定部２１３’が実行する。ステップＳ２１３ｂにおいて、ビームフォーマ推定部２１３’は、すべてのｉ∈｛１，…，Ｍ｝，ｆ∈｛１，…，Ｆ｝についてｑ_ｆ ^（ｉ）を初期化する。例えば、ビームフォーマ推定部２１３はＩ_Ｍのｉ番目の列をｑ_ｆ ^（ｉ）とする。 The processing of this modification will be described in detail below using FIGS. 3 to 5.
First, the signal processing device 2' instead of the signal processing device 2 executes steps S213a, S221a, S211a, S213b, S221b, S211b, S211c, S212a, and S212b shown in FIG. However, the processing of the spatio-temporal covariance estimating section 211 is executed by the spatio-temporal covariance estimating section 211' instead of the spatio-temporal covariance estimating section 211. In step S211a, the spatio-temporal covariance estimation unit 211' also calculates the power λ _t, _f ^⊥ of the non-target sound in addition to the power λ t,f ⁽¹⁾ ,..., λ _t,f ^(I) of the target sound. Initialize it in the same way as the power of the target sound. The processing of the beamformer estimation section 213 is executed by the beamformer estimation section 213' instead of the beamformer estimation section 213. In step S213b, the beamformer estimation unit 213' initializes q _f ⁽ⁱ⁾ for all iε{1,...,M}, fε{1,...,F}. For example, the beamformer estimation unit 213 sets the i-th column of I _M to q _f ⁽ⁱ⁾ .

時空間共分散推定部２１１’は、まだ、ｙ_ｔ，ｆ ^（i）が一度も得られていないのであれば、Ｓ２１１ａで得られたλ_ｔ，ｆ ^（１），…，λ_ｔ，ｆ ^（Ｉ）とλ_ｔ，ｆ ^⊥を用いる。一方、ｙ_ｔ，ｆ ^（i）が得られているのであれば、時空間共分散推定部２１１’に、ｙ_ｔ，ｆ ^（１），…，ｙ_ｔ，ｆ ^（Ｉ）とｙ_ｔ，ｆ ^⊥が入力されるので、λ_ｔ，ｆ ^（１），…，λ_ｔ，ｆ ^（Ｉ）、および、λ_ｔ，ｆ ^⊥をステップＳ２１１ｄにより得ることができる。
次に時空間共分散推定部２１１’は、ｘ_ｔ，ｆおよびλ_ｔ，ｆ ^⊥を用い、非目的音のパワー重み付き時空間共分散行列

を計算して出力する（ステップＳ２１１ｂ’）。
さらに時空間共分散推定部２１１’は、ｘ_ｔ，ｆおよびλ_ｔ，ｆ ^⊥を用い、非目的音のパワー重み付き時空間共分散行列

を計算して出力する（ステップＳ２１１ｃ’）。 If y _t,f ⁽ⁱ⁾ has not yet been obtained, the spatio-temporal covariance estimation unit 211' calculates the λ _t,f ⁽¹⁾ ,..., λ _t,f ⁽ obtained in S211a). ^I) and λ _{t, f} ^⊥ . On the other hand, if y _t,f ⁽ⁱ⁾ has been obtained, the spatio-temporal covariance estimation unit 211' has y _t,f ⁽¹⁾ ,..., y _t,f ^(I) and y _t,f Since ^⊥ is input, λ _t,f ⁽¹⁾ , . . . , λ _t,f ^(I) and λ _t,f ^⊥ can be obtained in step S211d.
Next, the spatio-temporal covariance estimation unit 211' uses x _t,f and λ _t,f ^⊥ to calculate the power-weighted spatio-temporal covariance matrix of the non-target sound.

is calculated and output (step S211b').
Further, the spatio-temporal covariance estimation unit 211' uses x _t,f and λ _t,f ^⊥ to calculate the power-weighted spatio-temporal covariance matrix of the non-target sound.

is calculated and output (step S211c').

残響抑圧フィルタ推定部２１２’はＲ^－ _ｘ，ｆ ^⊥およびｑ_ｆ ^（ｉ）を受け取り、

を計算する（ステップＳ２１２ａ’）。さらに残響抑圧フィルタ推定部２１２’はＰ^⊥ _ｘ，ｆおよびｑ_ｆ ^（ｉ）を受け取り、

を計算する（ステップＳ２１２ｂ’）。 The dereverberation filter estimator 212′ receives R ⁻ _{x, f} ^⊥ and q _f ⁽ⁱ⁾ ,

is calculated (step S212a'). Furthermore, the dereverberation filter estimation unit 212' receives P ^⊥ _{x, f} and q _f ⁽ⁱ⁾ ,

is calculated (step S212b').

その後、信号処理装置２に代えて信号処理装置２’が、図５に示すステップＳ２１２ｃ，Ｓ２２１ｃ，Ｓ２１３ｃ，Ｓ２１３ｄ，Ｓ２１３ｅ，Ｓ２２２ａ，Ｓ１３，Ｓ２１１ｄ，Ｓ２２２ｂの処理を実行する。ただし、第２実施形態で説明した残響抑圧フィルタ推定部２１２の処理は、残響抑圧フィルタ推定部２１２に代えて残響抑圧フィルタ推定部２１２’が実行する。ビームフォーマ推定部２１３の処理は、ビームフォーマ推定部２１３に代えてビームフォーマ推定部２１３’が実行する。ビームフォーマ適用部２２２の処理は、ビームフォーマ適用部２２２に代えてビームフォーマ適用部２２２’が実行する。
ビームフォーマ推定部２１３’は、ステップＳ２１３において、ｉ∈｛１，…，Ｉ｝，ｆ∈｛１，…，Ｆ｝についてｑ_ｆ ^（ｉ）を推定するのに加えて、ｉ∈｛Ｉ＋１，…，Ｍ｝，ｆ∈｛１，…，Ｆ｝に関するｑ_ｆ ^（ｉ）をも生成する。例えば、各ｆにおいて、ｉ∈｛１，…，Ｉ｝に対するｑ_ｆ ^（ｉ）が張る線形空間に対し、その補空間を張るベクトルとして、ｉ∈｛Ｉ＋１，…，Ｍ｝に関するｑ_ｆ ^（ｉ）を生成する。補空間を張るベクトルとしては、例えば、その補空間の正規直交規定を採用すればよいし、それ以外でもよい。 Thereafter, the signal processing device 2' instead of the signal processing device 2 executes steps S212c, S221c, S213c, S213d, S213e, S222a, S13, S211d, and S222b shown in FIG. However, the processing of the dereverberation filter estimation section 212 described in the second embodiment is executed by the dereverberation filter estimation section 212' instead of the dereverberation filter estimation section 212. The processing of the beamformer estimation section 213 is executed by the beamformer estimation section 213' instead of the beamformer estimation section 213. The processing of the beamformer application section 222 is executed by the beamformer application section 222' instead of the beamformer application section 222.
In step S213, the beamformer estimation unit 213' estimates q _f ⁽ⁱ⁾ for i∈{1,...,I}, f∈{1,...,F}, and also estimates i∈{I+1, ..., M}, f∈{1, ..., _F ^} is also generated. For example, for each _f , q _f ⁽ ⁱ ⁾ is generated. As the vector spanning the complementary space, for example, an orthonormal definition of the complementary space may be adopted, or other vectors may be used.

＜本変形例の特徴＞
本変形例では、目的音のパワー重み付き時空間共分散行列だけではなく、非目的音のパワー重み付き時空間共分散行列も計算して残響抑圧フィルタの推定に用いるため、残響抑圧フィルタの推定精度が向上する。 <Characteristics of this modification>
In this modification, not only the power-weighted spatio-temporal covariance matrix of the target sound but also the power-weighted spatio-temporal covariance matrix of the non-target sound is calculated and used for estimating the dereverberation filter. Improves accuracy.

［第２実施形態の変形例２］
第２実施形態の変形例２では、いずれか一つの源信号ｉに対応する処理信号ｙ_ｔ，ｆ ^（ｉ）のみを得て出力する。本変形例の残響抑圧フィルタは式(12)のものであり、ビームフォーマは式(10')のものである。 [Modification 2 of the second embodiment]
In the second modification of the second embodiment, only the processed signal y _t,f ⁽ⁱ⁾ corresponding to any one source signal i is obtained and output. The dereverberation filter of this modification is based on equation (12), and the beam former is based on equation (10').

＜機能構成＞
図２に例示するように、本変形例の信号処理装置２”は、時空間共分散推定部２１１、残響抑圧フィルタ推定部２１２”、ビームフォーマ推定部２１３、残響抑圧フィルタ適用部２２１、ビームフォーマ適用部２２２、および制御部１３を有し、制御部１３の制御の下で各処理を実行する。ここで、時空間共分散推定部２１１、残響抑圧フィルタ推定部２１２”、およびビームフォーマ推定部２１３は、畳み込みビームフォーマ推定部を構成する。残響抑圧フィルタ適用部２２１およびビームフォーマ適用部２２２は、畳み込みビームフォーマ適用部を構成する。 <Functional configuration>
As illustrated in FIG. 2, the signal processing device 2'' of this modification includes a spatio-temporal covariance estimation section 211, a dereverberation filter estimation section 212'', a beamformer estimation section 213, a dereverberation filter application section 221, a beamformer It has an application section 222 and a control section 13, and executes each process under the control of the control section 13. Here, the spatiotemporal covariance estimation section 211, the dereverberation filter estimation section 212'', and the beamformer estimation section 213 constitute a convolutional beamformer estimation section.The dereverberation filter application section 221 and the beamformer application section 222, Configures a convolutional beamformer application section.

＜処理＞
図７および図８を用いて本変形例の処理を詳細に説明する。
まず、信号処理装置２に代えて信号処理装置２”が、図７に示すステップＳ２１３ａ，Ｓ２２１ａ，Ｓ２１１ａ，Ｓ２１１ｂ，Ｓ２１１ｃの処理を実行する。 <Processing>
The processing of this modified example will be explained in detail using FIGS. 7 and 8.
First, instead of the signal processing device 2, the signal processing device 2'' executes the processes of steps S213a, S221a, S211a, S211b, and S211c shown in FIG.

次に、残響抑圧フィルタ推定部２１２に代えて残響抑圧フィルタ推定部２１２”がＲ^－ _ｘ，ｆ ^（ｉ）およびＰ_ｘ，ｆ ^（ｉ）を受け取り、残響除去フィルタに対応する情報

を計算して出力する（ステップＳ２１２ｃ”）。 Next, the dereverberation filter estimator 212'' instead of the dereverberation filter estimator 212 receives R ⁻ _x,f ⁽ⁱ⁾ and P _x,f ⁽ⁱ⁾ , and calculates information corresponding to the dereverberation filter.

is calculated and output (step S212c'').

残響抑圧フィルタ適用部２１２”は、残響除去フィルタに対応する情報Ｇ^－ _ｆ ^（ｉ）と音響信号ｘ_ｔ，ｆとを受け取り、以下のように残響抑圧フィルタを音響信号ｘ_ｔ，ｆに適用して残響抑圧信号ｚ_ｔ，ｆを得て出力する（ステップＳ２２１ｃ”）。

The dereverberation filter application unit 212'' receives the information G ⁻ _f ⁽ⁱ⁾ corresponding to the dereverberation filter and the acoustic signals x _{t, f} , and applies the dereverberation filter to the acoustic signals x _{t, f} as follows. Then, the dereverberation signals zt _,f are obtained and output (step S221c'').

その後、信号処理装置２に代えて信号処理装置２”が、図８に示すステップＳ２１３ｃ，Ｓ２１３ｄ，Ｓ２１３ｅ，Ｓ２２２ａ，Ｓ１３，Ｓ２１１ｄ，Ｓ２２２ｂの処理を実行する。 Thereafter, the signal processing device 2'' instead of the signal processing device 2 executes the processes of steps S213c, S213d, S213e, S222a, S13, S211d, and S222b shown in FIG.

［第３実施形態］
次に本発明の第３実施形態を説明する。本実施形態では、補助情報が目的音のパワーを特定する情報を含む。これにより、繰り返し処理を省略できる。 [Third embodiment]
Next, a third embodiment of the present invention will be described. In this embodiment, the auxiliary information includes information specifying the power of the target sound. This makes it possible to omit repeated processing.

＜機能構成＞
図９に例示するように、本実施形態の信号処理装置３は、時空間共分散推定部３１１、残響抑圧フィルタ推定部２１２、ビームフォーマ推定部３１３、残響抑圧フィルタ適用部２２１、ビームフォーマ適用部２２２、および制御部１３を有し、制御部１３の制御の下で各処理を実行する。ここで、時空間共分散推定部３１１、残響抑圧フィルタ推定部２１２、およびビームフォーマ推定部３１３は、畳み込みビームフォーマ推定部を構成する。残響抑圧フィルタ適用部２２１およびビームフォーマ適用部２２２は、畳み込みビームフォーマ適用部を構成する。 <Functional configuration>
As illustrated in FIG. 9, the signal processing device 3 of this embodiment includes a spatio-temporal covariance estimation section 311, a dereverberation filter estimation section 212, a beamformer estimation section 313, a dereverberation filter application section 221, and a beamformer application section. 222 and a control unit 13, and executes each process under the control of the control unit 13. Here, the spatiotemporal covariance estimation section 311, the dereverberation filter estimation section 212, and the beamformer estimation section 313 constitute a convolutional beamformer estimation section. The dereverberation filter application section 221 and the beamformer application section 222 constitute a convolutional beamformer application section.

＜処理＞
以下、図１０，図１１および図４を用いて本変形例の処理を詳細に説明する。
まず、補助情報ｓ＝｛ｓ_１，ｓ_２｝が信号処理装置３に入力される。本実施形態の補助情報ｓは、時間周波数マスクｓ_１＝γ_ｔ，ｆ ^（ｉ）と目的音のパワーを特定する情報ｓ_２＝λ_ｔ，ｆ ^（ｉ）とを含む。時間周波数マスクｓ_１＝γ_ｔ，ｆ ^（ｉ）はビームフォーマ推定部３１３に入力され、目的音のパワーを特定する情報ｓ_２＝λ_ｔ，ｆ ^（ｉ）は時空間共分散推定部３１１およびビームフォーマ推定部３１３に入力される（ステップＳ３１３ａ）。 <Processing>
The processing of this modification will be described in detail below using FIGS. 10, 11, and 4.
First, auxiliary information s={s ₁ , s ₂ } is input to the signal processing device 3. The auxiliary information s of this embodiment includes a time-frequency mask s ₁ =γ _t,f ⁽ⁱ⁾ and information specifying the power of the target sound s ₂ =λ _t,f ⁽ⁱ⁾ . The time-frequency mask s ₁ =γ _t,f ⁽ⁱ⁾ is input to the beamformer estimation unit 313, and the information specifying the power of the target sound s ₂ =λ _t,f ⁽ⁱ⁾ is input to the spatio-temporal covariance estimation unit 311 and It is input to the beamformer estimator 313 (step S313a).

図１０，図１１および図４に例示するように、信号処理装置２に代えて信号処理装置３が、Ｓ２２１ａ，Ｓ２１１ａ，Ｓ２１３ｂ，Ｓ２２１ｂ，Ｓ２１１ｂ，Ｓ２１１ｃ，Ｓ２１２ａ，Ｓ２１２ｂ，Ｓ２１２ｃ，Ｓ２２１ｃ，Ｓ２１３ｃ，Ｓ２１３ｄ，Ｓ２１３ｅ，Ｓ２２２ａ，Ｓ２２２ｂの処理を実行する。ただし、第２実施形態で説明した時空間共分散推定部２１１の処理およびビームフォーマ推定部２１３の処理は、それぞれ、時空間共分散推定部２１１およびビームフォーマ推定部２１３に代えて時空間共分散推定部３１１およびビームフォーマ推定部３１３が実行する。また、ステップＳ２１１ｂ，Ｓ２１１ｃ，Ｓ２１３ｄの計算には補助情報ｓ_２＝λ_ｔ，ｆ ^（ｉ）が用いられる。これにより、繰り返し処理を行うことなく、音響信号ｘ_ｔ，ｆに対して残響抑圧と拡散性雑音抑圧と目的音源分離とを行った処理信号ｙ_ｔ，ｆ ^（ｉ）が得られる。 As illustrated in FIG. 10, FIG. 11, and FIG. , S213e, S222a, and S222b are executed. However, the processing of the spatiotemporal covariance estimation unit 211 and the processing of the beamformer estimation unit 213 described in the second embodiment are performed in place of the spatiotemporal covariance estimation unit 211 and the beamformer estimation unit 213, respectively. The estimation unit 311 and the beamformer estimation unit 313 execute this. Further, the auxiliary information s ₂ =λ _t,f ⁽ⁱ⁾ is used in the calculations in steps S211b, S211c, and S213d. As a result, a processed signal y _t,f ⁽ⁱ⁾ obtained by performing dereverberation, diffuse noise suppression, and target sound source separation on the acoustic signal x _t,f is obtained without performing repeated processing.

＜本実施形態の特徴＞
本実施形態では、目的音のパワーを補助情報として信号処理装置３に与えることで、繰り返し処理を行うことなく、高精度な音声強調を行うことができる。 <Features of this embodiment>
In this embodiment, by providing the power of the target sound to the signal processing device 3 as auxiliary information, highly accurate speech enhancement can be performed without performing repeated processing.

［第３実施形態の変形例１］
第２実施形態の変形例１と同様、第３実施形態において、目的音のパワー重み付き時空間共分散行列Ｒ^－ _ｘ，ｆ ^（ｉ）およびＰ_ｘ，ｆ ^（ｉ）に加え、非目的音のパワー重み付き時空間共分散行列も計算して残響抑圧フィルタの推定に用いてもよい。 [Modification 1 of the third embodiment]
Similar to Modification 1 of the second embodiment, in the third embodiment, in addition to the power-weighted spatio-temporal covariance matrices R ⁻ _x,f ⁽ⁱ⁾ and P _x,f ⁽ⁱ⁾ of the target sound, the non-target sound A power-weighted spatiotemporal covariance matrix of may also be calculated and used for estimating the dereverberation filter.

＜機能構成＞
図９に例示するように、本変形例の信号処理装置３’は、時空間共分散推定部３１１’、残響抑圧フィルタ推定部２１２’、ビームフォーマ推定部３１３、残響抑圧フィルタ適用部２２１、ビームフォーマ適用部２２２、および制御部１３を有し、制御部１３の制御の下で各処理を実行する。ここで、時空間共分散推定部３１１’、残響抑圧フィルタ推定部２１２’、およびビームフォーマ推定部３１３は、畳み込みビームフォーマ推定部を構成する。残響抑圧フィルタ適用部２２１およびビームフォーマ適用部２２２は、畳み込みビームフォーマ適用部を構成する。 <Functional configuration>
As illustrated in FIG. 9, the signal processing device 3' of this modification includes a spatio-temporal covariance estimation section 311', a dereverberation filter estimation section 212', a beamformer estimation section 313, a dereverberation filter application section 221, a beam It has a former application section 222 and a control section 13, and executes each process under the control of the control section 13. Here, the spatio-temporal covariance estimator 311', the dereverberation filter estimator 212', and the beamformer estimator 313 constitute a convolutional beamformer estimator. The dereverberation filter application section 221 and the beamformer application section 222 constitute a convolutional beamformer application section.

＜処理＞
以下、図１０，図１１および図４を用いて本変形例の処理を詳細に説明する。
信号処理装置３に代えて信号処理装置３’が、図１０に示すステップＳ３１３ａ，Ｓ２２１ａ，Ｓ２１１ａ，Ｓ２１３ｂ，Ｓ２２１ｂ，Ｓ２１１ｂ，Ｓ２１１ｃ，Ｓ２１２ａ，Ｓ２１２ｂの処理を第３実施形態で説明したように実行する。 <Processing>
The processing of this modification will be described in detail below using FIGS. 10, 11, and 4.
Instead of the signal processing device 3, the signal processing device 3' executes the processing of steps S313a, S221a, S211a, S213b, S221b, S211b, S211c, S212a, and S212b shown in FIG. 10 as described in the third embodiment. .

次に信号処理装置２’に代えて信号処理装置３’が、図４に示すステップＳ２１１ｂ’，Ｓ２１１ｃ’，Ｓ２１２ａ’，Ｓ２１２ｂ’の処理を実行する。ただし、第２実施形態の変形例１で説明した時空間共分散推定部２１１’の処理は時空間共分散推定部３１１’が実行する。また、ステップＳ２１１ｂ’，Ｓ２１１ｃ’の計算には補助情報ｓ_２＝λ_ｔ，ｆ ^（ｉ）が用いられる。 Next, the signal processing device 3' instead of the signal processing device 2' executes the processes of steps S211b', S211c', S212a', and S212b' shown in FIG. However, the processing of the spatio-temporal covariance estimating section 211' described in the first modification of the second embodiment is executed by the spatio-temporal covariance estimating section 311'. Further, the auxiliary information s ₂ =λ _t,f ⁽ⁱ⁾ is used in the calculations in steps S211b' and S211c'.

信号処理装置３に代えて信号処理装置３’が、図１１に示すＳ２１２ｃ，Ｓ２２１ｃ，Ｓ２１３ｃ，Ｓ２１３ｄ，Ｓ２１３ｅ，Ｓ２２２ａ，Ｓ２２２ｂの処理を第３実施形態で説明したように実行する。 A signal processing device 3' instead of the signal processing device 3 executes the processes of S212c, S221c, S213c, S213d, S213e, S222a, and S222b shown in FIG. 11 as described in the third embodiment.

［第３実施形態の変形例２］
第２実施形態の変形例２と同様、第３実施形態においても、いずれか一つの源信号ｉに対応する処理信号ｙ_ｔ，ｆ ^（ｉ）のみが得られてもよい。 [Modification 2 of the third embodiment]
Similar to the second modification of the second embodiment, in the third embodiment as well, only the processed signal y _t,f ⁽ⁱ⁾ corresponding to any one source signal i may be obtained.

＜機能構成＞
図９に例示するように、本変形例の信号処理装置３”は、時空間共分散推定部３１１、残響抑圧フィルタ推定部２１２”、ビームフォーマ推定部３１３、残響抑圧フィルタ適用部２２１”、ビームフォーマ適用部２２２、および制御部１３を有し、制御部１３の制御の下で各処理を実行する。ここで、時空間共分散推定部３１１、残響抑圧フィルタ推定部２１２”、およびビームフォーマ推定部３１３は、畳み込みビームフォーマ推定部を構成する。残響抑圧フィルタ適用部２２１”およびビームフォーマ適用部２２２は、畳み込みビームフォーマ適用部を構成する。 <Functional configuration>
As illustrated in FIG. 9, the signal processing device 3'' of this modification includes a spatio-temporal covariance estimation section 311, a dereverberation filter estimation section 212'', a beamformer estimation section 313, a dereverberation filter application section 221'', a beam It includes a former application unit 222 and a control unit 13, and executes each process under the control of the control unit 13.Here, a spatiotemporal covariance estimation unit 311, a dereverberation filter estimation unit 212'', and a beamformer estimation The unit 313 constitutes a convolutional beamformer estimation unit. The dereverberation filter application section 221'' and the beamformer application section 222 constitute a convolutional beamformer application section.

＜処理＞
以下、図１２および図１３を用いて本変形例の処理を詳細に説明する。
信号処理装置３に代えて信号処理装置３”が、第３実施形態で説明したように、図１２および図１３に示すステップＳ３１３ａ，Ｓ２２１ａ，Ｓ２１１ｂ，Ｓ２１１ｃを実行する。さらに、信号処理装置２”の残響抑圧フィルタ推定部２１２”に代えて信号処理装置３”の残響抑圧フィルタ推定部２１２”が第２実施形態の変形例２で説明したステップＳ２１２ｃ”およびＳ２２１ｃ”を実行する。その後、信号処理装置２に代えて信号処理装置３”が、第２実施形態で説明したステップＳ２１３ｃ，Ｓ２１３ｄ，Ｓ２１３ｅ，Ｓ２２２ａ，Ｓ２２２ｂを実行する。ただし、ステップＳ２１１ｂ，Ｓ２１１ｃ，Ｓ２１３ｄの計算には補助情報ｓ_２＝λ_ｔ，ｆ ^（ｉ）が用いられる。 <Processing>
The processing of this modification will be described in detail below with reference to FIGS. 12 and 13.
Instead of the signal processing device 3, the signal processing device 3'' executes steps S313a, S221a, S211b, and S211c shown in FIGS. 12 and 13, as described in the third embodiment. Furthermore, the signal processing device 2'' Instead of the dereverberation filter estimator 212'' of the signal processing device 3'', the dereverberation filter estimator 212'' of the signal processing device 3'' executes steps S212c'' and S221c'' described in the second modification of the second embodiment. After that, the signal processing Instead of the device 2, the signal processing device 3'' executes steps S213c, S213d, S213e, S222a, and S222b described in the second embodiment. However, the auxiliary information s ₂ =λ _t,f ⁽ⁱ⁾ is used in the calculations in steps S211b, S211c, and S213d.

［第４実施形態］
第２実施形態の変形例１で説明したように、ステップＳ２１１ｂ，Ｓ２１１ｃで得られる目的音のパワー重み付き時空間共分散行列Ｒ^－ _ｘ，ｆ ^（ｉ）およびＰ_ｘ，ｆ ^（ｉ）のサイズはステップＳ２１２ａ，Ｓ２１２ｂで得られる行列Ψ_ｆ，φ_ｆのサイズよりも小さいため、上記の各実施形態および変形例では少ない計算コストで音声強調を実現できる。この効果は得られないがステップＳ２１２ａ，Ｓ２１２ｂにおいて、

に代えて、

が実行されてもよい。ただし、以下を満たす。

[Fourth embodiment]
As explained in Modification 1 of the second embodiment, the size of the power-weighted spatiotemporal covariance matrix R ⁻ _x,f ⁽ⁱ⁾ and P _x,f ⁽ⁱ⁾ of the target sound obtained in steps S211b and S211c is smaller than the size of the matrices Ψ _f and φ _f obtained in steps S212a and S212b, so in each of the embodiments and modified examples described above, speech enhancement can be achieved with low calculation cost. Although this effect cannot be obtained, in steps S212a and S212b,

Instead of

may be executed. However, the following must be met.

＜比較実験＞
以下に第４実施形態と第２実施形態の変形例１，２との比較結果を例示する。以下の２つの構成Config-1，2で実験を行った。周波数分割には短時間フーリエ変換を用いた。窓関数にはフォンハン窓（Hann window）を用い、フレーム長およびシフト幅をそれぞれ32msおよび8msに設定した。またΔ＝４とした。

<Comparative experiment>
Comparison results between the fourth embodiment and

Modifications

1 and 2 of the second embodiment will be illustrated below. We conducted experiments with the following two configurations, Config-1 and Config-2. Short-time Fourier transform was used for frequency division. A Hann window was used as the window function, and the frame length and shift width were set to 32 ms and 8 ms, respectively. Also, Δ=4.

図１５に、２つの構成Config-1，2について第４実施形態と第２実施形態の変形例１，２で得られた処理信号を音声認識した際の単語誤り率を示す。図１５の横軸は繰り返し回数（#iterations）を表し、縦軸は単語誤り率（WER(%)）を表す。図１５に示すように、第２実施形態の変形例１，２方法では、雑音残響複数音源環境下で収録した音声信号の音声認識性能を第４実施形態の方法よりも改善できることが分かる。 FIG. 15 shows word error rates when speech recognition is performed on the processed signals obtained in the fourth embodiment and modifications 1 and 2 of the second embodiment for the two configurations Config-1 and Config-2. The horizontal axis in FIG. 15 represents the number of repetitions (#iterations), and the vertical axis represents the word error rate (WER(%)). As shown in FIG. 15, it can be seen that the methods of Modifications 1 and 2 of the second embodiment can improve the speech recognition performance of speech signals recorded in a noise-reverberating multiple-sound-source environment compared to the method of the fourth embodiment.

以下に9.44sの長さの混合信号を対象とし、２つの構成Config-1，2について第４実施形態と第２実施形態の変形例１，２の方法で処理するために必要な計算時間を例示する。

第２実施形態の変形例１，２の方法は、第４実施形態の方法よりも少ない計算量で、残響抑圧と拡散性雑音抑圧と目的音源分離とを行うことができることが分かる。 The calculation time required to process a mixed signal with a length of 9.44 seconds using the method of the fourth embodiment and the modified examples 1 and 2 of the second embodiment for two configurations Config-1 and Config-2 is shown below. Illustrate.

It can be seen that the methods of

Modifications

1 and 2 of the second embodiment can perform dereverberation, diffuse noise suppression, and target sound source separation with a smaller amount of calculation than the method of the fourth embodiment.

［ハードウェア構成］
各実施形態における信号処理装置１，２，２’，２”，３，３’，３”は、例えば、ＣＰＵ（central processing unit）等のプロセッサ（ハードウェア・プロセッサ）やＲＡＭ（random-access memory）・ＲＯＭ（read-only memory）等のメモリ等を備える汎用または専用のコンピュータが所定のプログラムを実行することで構成される装置である。このコンピュータは１個のプロセッサやメモリを備えていてもよいし、複数個のプロセッサやメモリを備えていてもよい。このプログラムはコンピュータにインストールされてもよいし、予めＲＯＭ等に記録されていてもよい。また、ＣＰＵのようにプログラムが読み込まれることで機能構成を実現する電子回路（circuitry）ではなく、単独で処理機能を実現する電子回路を用いて一部またはすべての処理部が構成されてもよい。また、１個の装置を構成する電子回路が複数のＣＰＵを含んでいてもよい。 [Hardware configuration]
The signal processing devices 1, 2, 2', 2'', 3, 3', 3'' in each embodiment include, for example, a processor (hardware processor) such as a CPU (central processing unit), a RAM (random-access memory )・It is a device configured by a general-purpose or dedicated computer equipped with a memory such as ROM (read-only memory), etc., executing a predetermined program. This computer may include one processor and memory, or may include multiple processors and memories. This program may be installed on the computer or may be pre-recorded in a ROM or the like. In addition, some or all of the processing units may be configured using an electronic circuit that independently realizes a processing function, rather than an electronic circuit that realizes a functional configuration by reading a program like a CPU. . Further, an electronic circuit constituting one device may include a plurality of CPUs.

図６は、各実施形態における信号処理装置１，２，２’，２”，３，３’，３”のハードウェア構成を例示したブロック図である。図６に例示するように、この例の信号処理装置１，２，２’，２”，３，３’，３”は、ＣＰＵ（Central Processing Unit）１０ａ、入力部１０ｂ、出力部１０ｃ、ＲＡＭ（Random Access Memory）１０ｄ、ＲＯＭ（Read Only Memory）１０ｅ、補助記憶装置１０ｆ及びバス１０ｇを有している。この例のＣＰＵ１０ａは、制御部１０ａａ、演算部１０ａｂ及びレジスタ１０ａｃを有し、レジスタ１０ａｃに読み込まれた各種プログラムに従って様々な演算処理を実行する。また、入力部１０ｂは、データが入力される入力端子、キーボード、マウス、タッチパネル等である。また、出力部１０ｃは、データが出力される出力端子、ディスプレイ、所定のプログラムを読み込んだＣＰＵ１０ａによって制御されるＬＡＮカード等である。また、ＲＡＭ１０ｄは、ＳＲＡＭ (Static Random Access Memory)、ＤＲＡＭ (Dynamic Random Access Memory)等であり、所定のプログラムが格納されるプログラム領域１０ｄａ及び各種データが格納されるデータ領域１０ｄｂを有している。また、補助記憶装置１０ｆは、例えば、ハードディスク、ＭＯ（Magneto-Optical disc）、半導体メモリ等であり、所定のプログラムが格納されるプログラム領域１０ｆａ及び各種データが格納されるデータ領域１０ｆｂを有している。また、バス１０ｇは、ＣＰＵ１０ａ、入力部１０ｂ、出力部１０ｃ、ＲＡＭ１０ｄ、ＲＯＭ１０ｅ及び補助記憶装置１０ｆを、情報のやり取りが可能なように接続する。ＣＰＵ１０ａは、読み込まれたＯＳ（Operating System）プログラムに従い、補助記憶装置１０ｆのプログラム領域１０ｆａに格納されているプログラムをＲＡＭ１０ｄのプログラム領域１０ｄａに書き込む。同様にＣＰＵ１０ａは、補助記憶装置１０ｆのデータ領域１０ｆｂに格納されている各種データを、ＲＡＭ１０ｄのデータ領域１０ｄｂに書き込む。そして、このプログラムやデータが書き込まれたＲＡＭ１０ｄ上のアドレスがＣＰＵ１０ａのレジスタ１０ａｃに格納される。ＣＰＵ１０ａの制御部１０ａａは、レジスタ１０ａｃに格納されたこれらのアドレスを順次読み出し、読み出したアドレスが示すＲＡＭ１０ｄ上の領域からプログラムやデータを読み出し、そのプログラムが示す演算を演算部１０ａｂに順次実行させ、その演算結果をレジスタ１０ａｃに格納していく。このような構成により、信号処理装置１，２，２’，２”，３，３’，３”の機能構成が実現される。 FIG. 6 is a block diagram illustrating the hardware configuration of the signal processing devices 1, 2, 2', 2'', 3, 3', 3'' in each embodiment. As illustrated in FIG. 6, the signal processing device 1, 2, 2', 2'', 3, 3', 3'' of this example includes a CPU (Central Processing Unit) 10a, an input section 10b, an output section 10c, a RAM (Random Access Memory) 10d, ROM (Read Only Memory) 10e, auxiliary storage device 10f, and bus 10g. The CPU 10a in this example has a control section 10aa, a calculation section 10ab, and a register 10ac, and executes various calculation processes according to various programs read into the register 10ac. The input unit 10b is an input terminal into which data is input, a keyboard, a mouse, a touch panel, etc. Further, the output unit 10c is an output terminal for outputting data, a display, a LAN card controlled by the CPU 10a loaded with a predetermined program, and the like. Further, the RAM 10d is an SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), etc., and has a program area 10da in which a predetermined program is stored and a data area 10db in which various data are stored. The auxiliary storage device 10f is, for example, a hard disk, an MO (Magneto-Optical disc), a semiconductor memory, etc., and has a program area 10fa in which a predetermined program is stored and a data area 10fb in which various data are stored. There is. Further, the bus 10g connects the CPU 10a, the input section 10b, the output section 10c, the RAM 10d, the ROM 10e, and the auxiliary storage device 10f so that information can be exchanged. The CPU 10a writes the program stored in the program area 10fa of the auxiliary storage device 10f to the program area 10da of the RAM 10d according to the read OS (Operating System) program. Similarly, the CPU 10a writes various data stored in the data area 10fb of the auxiliary storage device 10f to the data area 10db of the RAM 10d. Then, the address on the RAM 10d where this program and data are written is stored in the register 10ac of the CPU 10a. The control unit 10aa of the CPU 10a sequentially reads these addresses stored in the register 10ac, reads programs and data from the area on the RAM 10d indicated by the read addresses, and causes the calculation unit 10ab to sequentially execute the operations indicated by the programs. The calculation results are stored in the register 10ac. With such a configuration, the functional configuration of the signal processing devices 1, 2, 2', 2'', 3, 3', 3'' is realized.

上述のプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は非一時的な（non-transitory）記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。 The above program can be recorded on a computer readable recording medium. An example of a computer readable storage medium is a non-transitory storage medium. Examples of such recording media are magnetic recording devices, optical disks, magneto-optical recording media, semiconductor memories, and the like.

このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ－ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。上述のように、このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記憶装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 This program is distributed, for example, by selling, transferring, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded. Furthermore, this program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to another computer via a network. As described above, a computer that executes such a program, for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing a process, this computer reads a program stored in its own storage device and executes a process according to the read program. In addition, as another form of execution of this program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and furthermore, the program may be transferred to this computer from the server computer. The process may be executed in accordance with the received program each time. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer programs from the server computer to this computer, but only realizes processing functions by issuing execution instructions and obtaining results. You can also use it as Note that the program in this embodiment includes information that is used for processing by an electronic computer and that is similar to a program (data that is not a direct command to the computer but has a property that defines the processing of the computer, etc.).

各実施形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In each embodiment, the present apparatus is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized by hardware.

なお、本発明は上述の実施形態に限定されるものではない。例えば、上述の各実施形態またはその変形例では、拡散性雑音と音源から発せられた源信号との混合信号を観測して得られた信号を周波数分割して得られる音響信号を号ｘ_ｔ，ｆとした。しかしながら、これは本発明を限定するものではない。例えば、混合信号を観測して得られた信号を周波数分割して得られた信号に何らかの信号処理（フィルタリング処理など）を施して得られた音響信号をｘ_ｔ，ｆとしてもよい。あるいは、混合信号を観測して得られた信号に何らかの信号処理を施して得られた信号を周波数分割して得られた音響信号をｘ_ｔ，ｆとしてもよい。あるいは、混合信号を観測して得られた信号に何らかの信号処理を施して得られた信号を周波数分割して得られた信号に、さらに何らかの信号処理を施して得られた音響信号をｘ_ｔ，ｆとしてもよい。また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 Note that the present invention is not limited to the above-described embodiments. For example, in each of the above-described embodiments or variations thereof, the acoustic signal obtained by frequency-dividing the signal obtained by observing a mixed signal of diffuse noise and a source signal emitted from a sound source is expressed as x _t, It was set as _f . However, this does not limit the invention. For example, x _t,f may be an acoustic signal obtained by subjecting a signal obtained by frequency-dividing a signal obtained by observing a mixed signal to some kind of signal processing (filtering processing, etc.). Alternatively, x _t,f may be an acoustic signal obtained by frequency-dividing the signal obtained by performing some signal processing on the signal obtained by observing the mixed signal. Alternatively, x _t, It may also be _f . Further, the various processes described above are not only executed in chronological order according to the description, but also may be executed in parallel or individually depending on the processing capacity of the device that executes the processes or as necessary.

また、第２実施形態では、残響抑圧信号z_t，ｆおよび補助情報ｓ＝γ_ｔ，ｆ ^（ｉ）（時間周波数マスク）がビームフォーマ推定部２１３に入力され、ビームフォーマ推定部２１３のステアリングベクトル推定部２１３１がz_t，ｆおよびγ_ｔ，ｆ ^（ｉ）に基づいてステアリングベクトルｖ_ｆ ^（ｉ）を推定した。しかし、補助情報ｓがステアリングベクトルｖ_ｆ ^（ｉ）そのものを含んでもよい。この場合にはステアリングベクトル推定部２１３１は省略可能であり、ビームフォーマ推定部２１３のＲＴＦ推定部２１３２が補助情報ｓに含まれたｖ_ｆ ^（ｉ）からｖ~_ｆ ^（ｉ）を得てもよい。また、補助情報ｓが時間周波数マスクγ_ｔ，ｆやステアリングベクトルｖ_ｆ ^（ｉ）を含まなくても、補助情報ｓが目的音の参照音を含んでいれば、残響抑圧信号z_t，ｆおよび補助情報ｓからステアリングベクトルｖ_ｆ ^（ｉ）を推定可能である。すなわち、図６に例示するように、まずビームフォーマ推定部２１３の時間周波数マスク推定部２１３０が残響抑圧信号z_t，ｆおよび補助情報ｓ（目的音の参照音）を受け取り、参考文献３に記載された方法によって時間周波数マスクγ_ｔ，ｆ ^（ｉ）を推定し、ステアリングベクトル推定部２１３１に入力してもよい。また、ＲＴＦｖ~_ｆ ^（ｉ）そのものが補助情報ｓ＝ｖ~_ｆ ^（ｉ）としてビームフォーマ推定部２１３に入力されてもよい。結局、補助情報ｓとしてＲＴＦｖ~_ｆ ^（ｉ）を算出するための情報がビームフォーマ推定部２１３に入力されれば、ビームフォーマ推定部２１３はビームフォーマを推定できる。 Further, in the second embodiment, the dereverberation signal z _t,f and the auxiliary information s=γ _t,f ⁽ⁱ⁾ (time-frequency mask) are input to the beamformer estimation unit 213, and the steering vector of the beamformer estimation unit 213 is input to the beamformer estimation unit 213. The estimation unit 2131 estimated the steering vector v _f ⁽ⁱ⁾ based on z _t,f and γ _t,f ( ⁱ ). However, the auxiliary information s may include the steering vector v _f ⁽ⁱ⁾ itself. In this case, the steering vector estimation unit 2131 can be omitted, and the RTF estimation unit 2132 of the beamformer estimation unit 213 may obtain v~ _f ⁽ⁱ⁾ from v _f ⁽ⁱ⁾ included in the auxiliary information s. . Furthermore, even if the auxiliary information s does not include the time-frequency mask γ _t,f or the steering vector v _f ⁽ⁱ⁾ , if the auxiliary information s includes the reference sound of the target sound, the dereverberation signal z _t,f and The steering vector v _f ⁽ⁱ⁾ can be estimated from the auxiliary information s. That is, as illustrated in FIG. 6, first, the time-frequency mask estimation section 2130 of the beamformer estimation section 213 receives the dereverberation signal zt _,f and the auxiliary information s (reference sound of the target sound), The time-frequency mask γ _t,f ⁽ⁱ⁾ may be estimated using the method described above and input to the steering vector estimation unit 2131. Further, RTFv~ _f ⁽ⁱ⁾ itself may be input to the beamformer estimator 213 as auxiliary information s=v~ _f ⁽ⁱ⁾ . After all, if the information for calculating RTFv~ _f ⁽ⁱ⁾ is input to the beamformer estimation section 213 as the auxiliary information s, the beamformer estimation section 213 can estimate the beamformer.

また、Ｓ２１２ａ’，Ｓ２１２ｂ’において、非目的音のパワー重み付き時空間共分散行列Ｒ^－ _ｘ，ｆ ^⊥，Ｐ^⊥ _ｘ，ｆに代えて目的音のパワー重み付き時空間共分散行列Ｒ^－ _ｘ，ｆ ^（ｉ），Ｐ_ｘ，ｆ ^（ｉ）が用いられてもよい。この場合には、ステップＳ２１１ｂ’，Ｓ２１１ｃ’は省略可能である。 In addition, in S212a' and S212b', the power-weighted spatio-temporal covariance matrix R ^- _{x, f} ^⊥ , P ⊥ x, f of the target sound is replaced with the power-weighted spatio-temporal covariance matrix R ^- x, f ⊥ , P ^⊥ _x _{, f} of the non-target sound. _{, f} ⁽ⁱ⁾ , P _{x, f} ⁽ⁱ⁾ may be used. In this case, steps S211b' and S211c' can be omitted.

その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 It goes without saying that other changes can be made as appropriate without departing from the spirit of the present invention.

１，２，２’，２”，３，３’，３” 信号処理装置
１１畳み込みビームフォーマ推定部
１２畳み込みビームフォーマ適用部
２１１，２１１’，３１１，３１１’ 時空間共分散推定部
２１２，２１２’，２１２” 残響抑圧フィルタ推定部
２１３，２１３’，３１３ビームフォーマ推定部
２２１，２２１” 残響抑圧フィルタ適用部
２２２，２２２’ ビームフォーマ適用部 1, 2, 2', 2", 3, 3', 3" Signal processing device 11 Convolutional beamformer estimation section 12 Convolutional beamformer application section 211, 211', 311, 311' Spatiotemporal covariance estimation section 212, 212 ', 212'' Dereverberation filter estimation section 213, 213', 313 Beamformer estimation section 221, 221'' Dereverberation filter application section 222, 222' Beamformer application section

Claims

It is obtained by receiving a frequency-divided time-series acoustic signal and auxiliary information representing information on the target sound, and applying a convolution beamformer to the acoustic signal to perform dereverberation, diffuse noise suppression, and target sound source separation. a convolutional beamformer estimation unit that estimates the convolutional beamformer based on an optimization criterion that the signal is determined according to a probability model that satisfies a constraint;
a convolutional beamformer application unit that applies the convolutional beamformer estimated by the convolutional beamformer estimation unit to the acoustic signal to obtain and output a processed signal;
has
The acoustic signal is based on the sound arriving at the observation position from the sound source,
The constraint condition is that the sound arriving at the observation position from the sound source is not distorted,
A signal processing device in which some variables of the constraint condition are determined based on the auxiliary information .

The signal processing device according to claim 1,
The convolutional beamformer includes a dereverberation filter that performs the dereverberation, and a beamformer that performs the diffuse noise suppression and the target sound source separation,
The convolutional beamformer estimator includes:
a spatio-temporal covariance matrix estimation unit that obtains a power-weighted spatio-temporal covariance matrix of the target sound;
a dereverberation filter estimation unit that receives a power-weighted spatio-temporal covariance matrix of the acoustic signal and the target sound and information representing the beamformer, and estimates the dereverberation filter based on the optimization criterion;
The convolution beamformer application unit includes a dereverberation filter application unit that applies the dereverberation filter estimated by the dereverberation filter estimation unit to the acoustic signal to obtain a dereverberation signal,
The convolutional beamformer estimator further includes a beamformer estimator that receives the dereverberation signal and the auxiliary information and estimates the beamformer based on the optimization criterion,
The convolutional beamformer application section further includes a beamformer application section that applies the beamformer estimated by the beamformer estimation section to the dereverberation signal to obtain and output the processed signal.

The signal processing device according to claim 2,
The spatio-temporal covariance matrix estimation unit further obtains a power-weighted spatio-temporal covariance matrix of non-target sounds,
The dereverberation filter estimation unit receives a power-weighted spatio-temporal covariance matrix of the acoustic signal and the target sound, a power-weighted spatio-temporal covariance matrix of the non-target sound, and information specifying the beamformer, A signal processing device that estimates the dereverberation filter based on the optimization criterion.

The signal processing device according to claim 2 or 3,
The processing of the spatio-temporal covariance matrix estimator, the process of the dereverberation filter estimation part, the process of the dereverberation filter application part, the process of the beamformer estimation part, and the process of the beamformer application part are repeated. , signal processing device.

The signal processing device according to any one of claims 1 to 3,
The signal processing device, wherein the auxiliary information includes information specifying power of the target sound.

It is obtained by receiving a frequency-divided time-series acoustic signal and auxiliary information representing information on the target sound, and applying a convolution beamformer to the acoustic signal to perform dereverberation, diffuse noise suppression, and target sound source separation. a convolutional beamformer estimating step of estimating the convolutional beamformer based on an optimization criterion that the signal is determined according to a probability model that satisfies a constraint;
a convolutional beamformer application step of applying the convolutional beamformer estimated in the convolutional beamformer estimation step to the acoustic signal to obtain and output a processed signal;
has
The acoustic signal is based on the sound arriving at the observation position from the sound source,
The constraint condition is that the sound arriving at the observation position from the sound source is not distorted,
A signal processing method , wherein some variables of the constraint condition are determined based on the auxiliary information .

A program for causing a computer to function as the signal processing device according to any one of claims 1 to 5.