JP7776016B2

JP7776016B2 - Signal processing device, signal processing method, and program

Info

Publication number: JP7776016B2
Application number: JP2024541328A
Authority: JP
Inventors: 林太郎池下; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2025-11-26
Anticipated expiration: 2042-08-17
Also published as: JPWO2024038522A1; WO2024038522A1

Description

本発明は、マイクロホンを用いて収録した信号に含まれる音声信号を高品質に推定する技術に関する。 The present invention relates to a technology for high-quality estimation of audio signals contained in signals recorded using a microphone.

雑音残響環境においてマイクロホンを用いて音声信号を収録する際、マイクロホンには収録したい音声成分に加えて、雑音、残響、妨害音といった不要な成分が混入するため、収録信号に含まれる音声信号の品質は低い。そこで、収録信号に含まれる音声信号を高品質に推定するために、信号源抽出技術が盛んに研究されてきた。複数のセンサを用いて信号源抽出を行う手法として、畳み込みビームフォーマ（Convolutional Beamformer: CBF, 非特許文献１参照）を用いた手法が知られている。CBFを最適化する基準としては、これまで無歪最小分散(Minimum-Variance Distortionless Response: MVDR)という基準が用いられてきた（非特許文献１参照）。When recording a speech signal using a microphone in a noisy, reverberant environment, the quality of the speech signal contained in the recorded signal is low because the microphone contains not only the desired speech components but also unwanted components such as noise, reverberation, and interfering sounds. Therefore, signal source extraction techniques have been actively researched to estimate the speech signals contained in the recorded signal with high quality. A known method for signal source extraction using multiple sensors is the convolutional beamformer (CBF; see Non-Patent Document 1). The Minimum-Variance Distortionless Response (MVDR) criterion has been used to optimize the CBF (see Non-Patent Document 1).

T. Nakatani, C. Boeddeker, K. Kinoshita, R. Ikeshita, M. Delcroix and R. Haeb-Umbach, "Jointly Optimal Denoising, Dereverberation, and Source Separation", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2267-2282, 2020.T. Nakatani, C. Boeddeker, K. Kinoshita, R. Ikeshita, M. Delcroix and R. Haeb-Umbach, "Jointly Optimal Denoising, Dereverberation, and Source Separation", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2267-2282, 2020.

しかしながら、CBFをMVDR基準で設計する場合、抽出対象である目的音源の空間情報（空間共分散行列）をステアリングベクトルに圧縮してCBFを設計するため、目的音源が有する空間情報をすべて用いることができないという問題がある。However, when designing CBF based on the MVDR standard, the spatial information (spatial covariance matrix) of the target sound source to be extracted is compressed into a steering vector to design the CBF, which poses the problem that it is not possible to use all of the spatial information possessed by the target sound source.

本発明は、MVDR基準に代えて、MaxSNR基準を導入することで、目的音源の空間情報をすべて用いることができる信号処理装置、信号処理方法、プログラムを提供することを目的とする。 The present invention aims to provide a signal processing device, signal processing method, and program that can use all of the spatial information of the target sound source by introducing the MaxSNR criterion instead of the MVDR criterion.

上記の課題を解決するために、本発明の一態様によれば、信号処理装置は、非目的音源の空間時間共分散行列の推定値を用いて、非目的音源の空間共分散行列を推定する第二空間共分散行列推定部と、非目的音源の空間時間共分散行列の推定値を用いて、残響除去フィルタを推定する残響除去フィルタ推定部と、観測信号または目的音源の空間共分散行列の推定値と、非目的音源の空間共分散行列の推定値と、推定した残響除去フィルタとを用いて、畳み込みビームフォーマを推定するビームフォーマ推定部と、観測信号と推定した畳み込みビームフォーマとを用いて、ビームフォーミング処理を行い、音源信号を推定する音源抽出部とを含む。 In order to solve the above problem, according to one aspect of the present invention, a signal processing device includes a second spatial covariance matrix estimation unit that estimates the spatial covariance matrix of a non-target sound source using an estimate of the spatiotemporal covariance matrix of the non-target sound source; a dereverberation filter estimation unit that estimates a dereverberation filter using the estimate of the spatiotemporal covariance matrix of the non-target sound source; a beamformer estimation unit that estimates a convolution beamformer using the estimate of the spatial covariance matrix of the observed signal or the target sound source, the estimate of the spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter; and a sound source extraction unit that performs beamforming processing using the observed signal and the estimated convolution beamformer to estimate a sound source signal.

本発明によれば、MaxSNR基準を導入することで、目的音源の空間情報をすべて用いることができるという効果を奏する。 According to the present invention, by introducing the MaxSNR criterion, it is possible to use all of the spatial information of the target sound source.

第一実施形態に係る信号処理装置の機能ブロック図。FIG. 1 is a functional block diagram of a signal processing device according to a first embodiment. 第一実施形態に係る信号処理装置の処理フローの例を示す図。FIG. 2 is a diagram showing an example of a processing flow of the signal processing device according to the first embodiment. 第二実施形態に係る信号処理装置の機能ブロック図。FIG. 10 is a functional block diagram of a signal processing device according to a second embodiment. 第二実施形態に係る信号処理装置の処理フローの例を示す図。FIG. 10 is a diagram showing an example of a processing flow of a signal processing device according to a second embodiment. 第三実施形態に係る信号処理装置の機能ブロック図。FIG. 10 is a functional block diagram of a signal processing device according to a third embodiment. 第三実施形態に係る信号処理装置の処理フローの例を示す図。FIG. 11 is a diagram showing an example of a processing flow of a signal processing device according to a third embodiment. 本手法を適用するコンピュータの構成例を示す図。FIG. 1 is a diagram showing an example of the configuration of a computer to which the present technique is applied.

以下、本発明の実施形態について、説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、テキスト中で使用する記号「^」「^-」等は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。式中においてはこれらの記号は本来の位置に記述している。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 An embodiment of the present invention will be described below. In the drawings used in the following description, components having the same function and steps performing the same processing are denoted by the same reference numerals, and duplicate explanations will be omitted. In the following description, symbols such as "^" and " ^- " used in the text should normally be written directly above the character immediately following them, but due to limitations in text notation, they are written immediately before the character in question. In formulas, these symbols are written in their original positions. Furthermore, unless otherwise specified, processing performed on each element of a vector or matrix is assumed to apply to all elements of that vector or matrix.

＜音源抽出問題＞
本実施形態で対象とする問題は、音源抽出問題であり、マイクロホンで観測した信号x_f,tから、音源信号s_f,tあるいは、音源信号s_f,tの残響が取り除かれた空間イメージs_f,t ^image=a_fs_f,tを推定する問題である。ただし、a_fは音源の音響伝達関数を表す。なお、音源信号とはマイクロホンの収録対象である音源(目的音源)が発した音に基づく信号であり、本実施形態では、目的音源を話者(以下「目的話者」ともいう)とし、目的音を目的話者が発話した音声（以下「目的音声」ともいう）とし、目的信号を目的音声に対応する信号とする。ただし、これらに限定されるものではなく、目的音源は話者に限らず楽器などの音源や再生装置等の何らかの音源であってもよく、目的音は音声に限らず音声以外の音であってもよい。目的音源以外の音源を非目的音源ともいう。 <Sound source extraction problem>
The problem addressed in this embodiment is a sound source extraction problem, which involves estimating a sound source signal s _f,t or a spatial image s _f,t ^image =a _f s f,t obtained by removing reverberation from the sound source signal s _f,t from a signal x f _,t _observed by a microphone. Here, a _f represents the acoustic transfer function of the sound source. The sound source signal is a signal based on a sound emitted by a sound source (target sound source) to be recorded by the microphone. In this embodiment, the target sound source is a speaker (hereinafter also referred to as the "target speaker"), the target sound is the speech uttered by the target speaker (hereinafter also referred to as the "target speech"), and the target signal is a signal corresponding to the target speech. However, the target sound source is not limited to these, and may be any sound source such as a sound source such as a musical instrument or a playback device, and the target sound is not limited to speech but may also be a sound other than speech. Sound sources other than the target sound source are also referred to as non-target sound sources.

＜第一実施形態のポイント＞
MVDR CBFが用いるステアリングベクトルは空間共分散行列V_Sの主成分に対応し、MVDR CBFは空間共分散行列V_Sがもつ空間情報をすべて用いることはできない。本実施形態では、CBFを設計する新たな基準として、MaxSNR基準を導入する。MaxSNR基準を用いてCBFを設計する際、目的音源の空間情報（空間共分散行列V_S）をフルに活用できるという利点がある。 <Key Points of the First Embodiment>
The steering vectors used by the MVDR CBF correspond to the principal components of the spatial covariance matrix V _S , and the MVDR CBF cannot use all of the spatial information contained in the spatial covariance matrix V _S. In this embodiment, the MaxSNR criterion is introduced as a new criterion for designing the CBF. When designing the CBF using the MaxSNR criterion, there is an advantage in that the spatial information (spatial covariance matrix V _S ) of the target sound source can be fully utilized.

まず、MaxSNR基準のCBFについて説明する。Mをマイクロホンの数を表す2以上の整数の何れかとし、L+1をCBFのタップ数とし、S_＋を非負定値行列の全体からなる集合とし、行列A^BをB行B列の正方行列とし、行列A^B×CをB行C列の行列とし、^R_N∈S^M+ML _＋を非目的音源の空間時間共分散行列とし、V_S∈S^M _＋を目的音源の空間共分散行列とし、O_A×BをA行B列の零行列とし、

とすると、MaxSNR CBF ^wは、以下のように定義される。

なお、L=0のときMaxSNR CBFはMaxSNR beamformerになる。 First, the MaxSNR-based CBF will be explained. Let M be any integer equal to or greater than 2 representing the number of microphones, L+1 be the number of taps of the CBF, S ₊ be a set consisting of all non-negative definite matrices, let matrix A ^B be a square matrix with B rows and B columns, let matrix A ^B×C be a matrix with B rows and C columns, let ^R _N ∈ ^{S M+ML} ₊ be the space-time covariance matrix of the non-target sound source, let V _S ∈ ^{S M} ₊ be the spatial covariance matrix of the target sound source, and let O _A×B be a zero matrix with A rows and B columns,

Then, MaxSNR CBF ^w is defined as follows:

Note that when L=0, MaxSNR CBF becomes MaxSNR beamformer.

また、本実施形態のMaxSNR CBF ^wは、次式のように、残響除去フィルタ^Gと、瞬時混合に対するMaxSNRビームフォーマwの積に分解できるという特徴がある。

ただし、下付き添え字optは最適解を意味し、Cは複素数(Complex numbers)の全体の集合である。言い換えると、MaxSNR CBFは、残響除去フィルタ^GとMaxSNRビームフォーマwを統合的に最適化できるという特徴がある。 Furthermore, the MaxSNR CBF ^w of this embodiment has the feature that it can be decomposed into the product of the dereverberation filter ^G and the MaxSNR beamformer w for the instantaneous mixture, as shown in the following equation.

where the subscript opt denotes the optimal solution, and C is the set of all complex numbers. In other words, the MaxSNR CBF has the advantage of being able to jointly optimize the dereverberation filter ^G and the MaxSNR beamformer w.

式(2)が式(3)のように分解できることについて説明するために、^w、^R_Nを以下のように記載する。

ただし、S₊₊は正定値行列の全体からなる集合である。 To explain that equation (2) can be decomposed as equation (3), ^w and ^R _N are written as follows:

where S ₊₊ is the set of all positive definite matrices.

ここで、MaxSNR CBF ^wの最適解^w_optを

として得ることができる。ただし、

である。ただし、I_MはM行M列の単位行列であり、A^HはAのエルミート転置を示す。 Here, the optimal solution of MaxSNR CBF ^w _is

can be obtained as follows, where

where I _M is an M-row, M-column identity matrix, and A ^H denotes the Hermitian transpose of A.

なお、式(7)は、一般化された固有値分解の最適の固有ベクトルとして解くことができる。 Note that equation (7) can be solved as the optimal eigenvector of the generalized eigenvalue decomposition.

V_Sw_opt = λ_maxV_Nw_opt
ただし、λ_maxは最大固有値である。 V _S w _opt = λ _max V _N w _opt
where λ _max is the maximum eigenvalue.

式(8)の^Gは、残響除去で用いられる多チャネル線形予測(multi-channel linear prediction: MCLP)ベースの残響除去フィルタである。また、式(9)のV_Nは^R_Nのシューア補行列であり、残響が取り除かれた、非目的音源の空間共分散行列とみなすことができる。 In equation (8), ̂G is a multi-channel linear prediction (MCLP)-based dereverberation filter used in dereverberation. Also, V _N in equation (9) is the Schur complement of ̂R _N , which can be regarded as the spatial covariance matrix of the non-target sound source after reverberation.

＜第一実施形態＞
図１は第一実施形態に係る信号処理装置の機能ブロック図を、図２はその処理フローを示す。 First Embodiment
FIG. 1 is a functional block diagram of a signal processing device according to a first embodiment, and FIG. 2 shows the processing flow thereof.

信号処理装置１００は、第一空間共分散行列推定部１１０と、空間時間共分散行列推定部１２０と、第二空間共分散行列推定部１４０と、残響除去フィルタ推定部１３０と、ビームフォーマ推定部１５０と、音源抽出部１６０と、空間イメージ推定部１７０とを含む。 The signal processing device 100 includes a first spatial covariance matrix estimation unit 110, a space-time covariance matrix estimation unit 120, a second spatial covariance matrix estimation unit 140, a dereverberation filter estimation unit 130, a beamformer estimation unit 150, a sound source extraction unit 160, and a spatial image estimation unit 170.

信号処理装置１００は、マイクロホンで観測した観測信号x_f,tを入力とし、音源信号s_f,tあるいは、音源信号s_f,tの（残響が取り除かれた）空間イメージs_f,t ^image=a_fs_f,tを推定して、出力する。なお、観測信号は、例えば複数のマイクロホンからなるマイクロホンアレーで観測した音響信号である。マイクロホンの出力信号をそのまま入力としてもよいし、何らかの記憶装置に記憶された出力信号を読み出して入力としてもよいし、マイクロホンの出力信号に対して何らかの処理を行ったものを入力としてもよい。なお、f(f=1,…,F)は周波数を示し、t(t=1,…,T)はフレーム番号を示し、観測信号x_f,t、音源信号s_f,tは周波数領域の信号である。ただし、時間領域の観測信号を入力とし図示しない周波数領域変換部において周波数領域の観測信号x_f,tに変換し、音源信号s_f,tの推定値を図示しない時間領域変換部において時間領域の音源信号に変換し出力してもよい。周波数領域変換、時間領域変換はどのような方法によって行ってもよく、例えば、フーリエ変換、逆フーリエ変換等を用いることができる。 The signal processing device 100 receives an observation signal x _f,t observed by a microphone as input, estimates a sound source signal s _f,t or a spatial image s _f _, t ^image =a _f s _{f,t (with reverberation removed) of the sound source signal s f,t} , and outputs the signal. The observation signal may be, for example, an acoustic signal observed by a microphone array consisting of multiple microphones. The microphone output signal may be input as is, or an output signal stored in some kind of storage device may be read and used as the input, or the microphone output signal may be input after undergoing some processing. Here, f (f = 1, ..., F) indicates frequency, and t (t = 1, ..., T) indicates a frame number. The observation signal x _f,t and the sound source signal s _f,t are frequency-domain signals. Alternatively, a time-domain observation signal may be input, converted into a frequency-domain observation signal x _f,t in a frequency-domain transformer (not shown), and the estimated value of the sound source signal s _f,t may be converted into a time-domain sound source signal in a time-domain transformer (not shown) and output. The frequency domain transformation and the time domain transformation may be performed by any method, for example, a Fourier transform, an inverse Fourier transform, or the like.

信号処理装置１００は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。信号処理装置１００は、例えば、中央演算処理装置の制御のもとで各処理を実行する。信号処理装置１００に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。信号処理装置１００の各処理部は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。信号処理装置１００が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。ただし、各記憶部は、必ずしも信号処理装置１００がその内部に備える必要はなく、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置により構成し、信号処理装置１００の外部に備える構成としてもよい。The signal processing device 100 is a special device configured by loading a special program into a publicly available or dedicated computer having, for example, a central processing unit (CPU) and a main memory (RAM). The signal processing device 100 executes various processes under the control of the central processing unit. Data input to the signal processing device 100 and data obtained from each process are stored in the main memory, for example. The data stored in the main memory is read by the central processing unit as needed and used for other processes. At least a portion of each processing unit in the signal processing device 100 may be configured using hardware such as an integrated circuit. Each memory unit in the signal processing device 100 may be configured using, for example, a main memory such as RAM (Random Access Memory) or middleware such as a relational database or key-value store. However, each memory unit does not necessarily have to be internal to the signal processing device 100; it may be configured using an auxiliary memory device consisting of a hard disk, optical disk, or semiconductor memory element such as flash memory, and be configured external to the signal processing device 100.

以下、各部について説明する。 Each part is explained below.

＜第一空間共分散行列推定部１１０＞
第一空間共分散行列推定部１１０は、目的音源の空間共分散行列を推定し（Ｓ１１０）、推定値V_S∈S^M _＋を出力する。目的音源の空間共分散行列の推定方法として、様々な方法を用いることができる。例えば、第一空間共分散行列推定部１１０は、観測信号x_f,tを入力とし、観測信号x_f,tから目的音源が発した音を含む区間(以下、目的信号ともいう)を推定し、推定した目的信号を用いて、目的音源の空間共分散行列を推定する。また、目的音源の方向が既知の場合には、予め実験やシミュレーションで目的音源の空間共分散行列を近似して近似値を推定値V_S∈S^M _＋として用いてもよい。
＜空間時間共分散行列推定部１２０＞
空間時間共分散行列推定部１２０は、非目的音源の空間時間共分散行列を推定し（Ｓ１２０）、推定値^R_N∈S^M+ML _＋を出力する。非目的音源の空間時間共分散行列の推定方法として、様々な方法を用いることができる。例えば、空間時間共分散行列推定部１２０は、観測信号x_f,tを入力とし、観測信号x_f,tから目的音源が発した音を含まない区間(以下、非目的信号ともいう)を推定し、推定した非目的信号を用いて、非目的音源の空間時間共分散行列を推定する。
＜残響除去フィルタ推定部１３０＞
残響除去フィルタ推定部１３０は、空間時間共分散行列の推定値^R_Nを入力とし、推定値^R_Nに含まれるブロック行列^-P_N,^-R_Nから残響除去フィルタを推定し（Ｓ１３０）、推定した残響除去フィルタ^Gを出力する。例えば、残響除去フィルタは式(8)により推定される。

なお、

である。つまり、R_Nは推定値^R_Nの1行1列～M行M列の要素からなるブロック行列であり、^-P_Nは推定値^R_Nの(M+1)行1列～(M+ML)行M列の要素からなるブロック行列であり、(^-P_N)^Hは推定値^R_Nの1行(M+1)列～M行(M+ML)列の要素からなるブロック行列であり、^-R_Nは推定値^R_Nの(M+1)行(M+1)列～(M+ML)行(M+ML)列の要素からなるブロック行列である。 <First spatial covariance matrix estimation unit 110>
The first spatial covariance matrix estimation unit 110 estimates the spatial covariance matrix of the target sound source (S110) and outputs an estimated value V _S ∈ ^{S M} ₊ . Various methods can be used as a method for estimating the spatial covariance matrix of the target sound source. For example, the first spatial covariance matrix estimation unit 110 receives an observed signal x _f,t as input, estimates a section including a sound emitted by the target sound source from the observed signal x _f,t (hereinafter also referred to as a target signal), and estimates the spatial covariance matrix of the target sound source using the estimated target signal. Furthermore, if the direction of the target sound source is known, the spatial covariance matrix of the target sound source may be approximated in advance by experiment or simulation, and the approximate value may be used as the estimated value V _S ∈ ^{S M} ₊ .
<Spatio-temporal covariance matrix estimation unit 120>
The space-time covariance matrix estimation unit 120 estimates the space-time covariance matrix of the non-target sound source (S120) and outputs the estimated value ^R _N ∈S ^M+ML ₊ . Various methods can be used to estimate the space-time covariance matrix of the non-target sound source. For example, the space-time covariance matrix estimation unit 120 receives the observed signal x _f,t as input, estimates a section from the observed signal x _f,t that does not include sound emitted by the target sound source (hereinafter also referred to as the non-target signal), and estimates the space-time covariance matrix of the non-target sound source using the estimated non-target signal.
<Dereverberation Filter Estimation Unit 130>
The dereverberation filter estimation unit 130 receives the estimated value ^R _N of the spatiotemporal covariance matrix, estimates a dereverberation filter from the block matrices ^−P _N , ^−R _N included in the estimated value ^R _N (S130), and outputs the estimated dereverberation filter ^G. For example, the dereverberation filter is estimated by equation (8).

In addition,

That is, R _N is a block matrix consisting of elements from row 1, _column 1 to M, column M of the estimated value ^R _N , ^- P _N is a block matrix consisting of elements from row (M+1) to column (M+ML) of the estimated value ^R _N , ( ^- P _N ) ^H is a block matrix consisting of elements from row 1, column (M+1) to column M of the estimated value ^R _N , and ^- R _N is a block matrix consisting of elements from row (M+1) to column (M+ML) of the estimated value ^R N.

＜第二空間共分散行列推定部１４０＞
第二空間共分散行列推定部１４０は、空間時間共分散行列の推定値^R_Nを入力とし、推定値^R_Nに含まれるブロック行列R_N,^-P_N,^-R_Nから非目的音源の空間共分散行列を推定し（Ｓ１４０）、推定値V_N∈S^M+ML _＋を出力する。例えば、非目的音源の空間共分散行列は式(9)により推定される。

である。 <Second spatial covariance matrix estimation unit 140>
The second spatial covariance matrix estimation unit 140 receives the estimated value ^R _N of the space-time covariance matrix as input, estimates the spatial covariance matrix of the non-target sound source from the block matrices R _N , ^−P _N , ^−R _N included in the estimated value ^R _N (S140), and outputs an estimated value V _N ∈S ^M+ML ₊ . For example, the spatial covariance matrix of the non-target sound source is estimated by equation (9).

is.

なお、第二空間共分散行列推定部１４０は、空間時間共分散行列の推定値^R_Nと残響除去フィルタ推定部１３０で推定した残響除去フィルタ^Gとを入力とし、推定値^R_Nと残響除去フィルタ^Gとから式(9)により非目的音源の空間共分散行列を推定してもよい。 The second spatial covariance matrix estimator 140 may receive the estimated value ^R _N of the spatiotemporal covariance matrix and the dereverberation filter ^G estimated by the dereverberation filter estimator 130 as input, and estimate the spatial covariance matrix of the non-target sound source from the estimated value ^R _N and the dereverberation filter ^G using equation (9).

＜ビームフォーマ推定部１５０＞
ビームフォーマ推定部１５０は、目的音源の空間共分散行列の推定値V_Sと、非目的音源の空間共分散行列の推定値V_Nと、推定した残響除去フィルタ^Gとを入力とする。ビームフォーマ推定部１５０は、目的音源の空間共分散行列の推定値V_Sと、非目的音源の空間共分散行列の推定値V_Nとから、式(7)により、瞬時混合に対するMaxSNRビームフォーマw_optを求める。

なお、式(7)は、一般化された固有値分解の最適の固有ベクトルとして解くことができる。 <Beamformer Estimation Unit 150>
The beamformer estimation unit 150 receives as input the estimated value V _S of the spatial covariance matrix of the target sound source, the estimated value V _N of the spatial covariance matrix of the non-target sound sources, and the estimated dereverberation filter ^G. The beamformer estimation unit 150 obtains a MaxSNR beamformer w opt for the instantaneous mixture using equation (7) from the estimated value V _S of the spatial covariance matrix of the target sound source and the estimated value V _N of the spatial covariance matrix of the non- _target sound sources.

Note that equation (7) can be solved as the optimal eigenvector of the generalized eigenvalue decomposition.

ビームフォーマ推定部１５０は、瞬時混合に対するMaxSNRビームフォーマw_optと推定した残響除去フィルタ^Gとから、式(3)により、畳み込みビームフォーマを推定し（Ｓ１５０）、推定した畳み込みビームフォーマ^wを出力する。

＜音源抽出部１６０＞
音源抽出部１６０は、観測信号x_f,tと推定した畳み込みビームフォーマ^wとを入力とし、次式により、ビームフォーミング処理を行い、音源信号を推定し（Ｓ１６０）、推定値y_f,tを出力する。 The beamformer estimation unit 150 estimates a convolution beamformer from the MaxSNR beamformer w _opt for the instantaneous mixture and the estimated dereverberation filter ^G using equation (3) (S150), and outputs the estimated convolution beamformer ^w.

<Sound source extraction unit 160>
The sound source extraction unit 160 receives the observed signal x _f,t and the estimated convolution beamformer ^w as input, performs beamforming processing using the following equation, estimates the sound source signal (S160), and outputs the estimated value y _f,t .

y_f,t=^w_f ^H^x_f,t∈ C
^w ∈ C^M+ML
^w=[^w₁ | … | ^w_F]
^x_f,t=[x_f,t ^T|x_f,t-D-1 ^T|…|x_f,t-D-L ^T]^T ∈ C^M+ML
A^HはAのエルミート転置を示し、A^TはAの転置を示し、Y=(y_t)_t=1 ^Tは音源信号Sの推定値であり、Dは予測遅延である。 y _f,t =^w _f ^H ^x _f,t ∈ C
^w ∈ C ^M+ML
^w=[^w ₁ | … | ^w _F ]
^x _f,t =[x _f,t ^T |x _f,tD-1 ^T |…|x _f,tDL ^T ] ^T ∈ C ^M+ML
A ^H denotes the Hermitian transpose of A, A ^T denotes the transpose of A, Y=(y _t ) _t=1 ^T is an estimate of the source signal S, and D is the prediction delay.

＜空間イメージ推定部１７０＞
各周波数ビンｆの畳み込みビームフォーマ^w_fのスケールは不定であるが、次式により、空間イメージs_f,t ^imageを近似するベクトルu_fを推定することで、復元することができる。 <Spatial image estimation unit 170>
Although the scale of the convolution beamformer ^w _f for each frequency bin f is indefinite, it can be restored by estimating a vector u _f that approximates the spatial image s _f,t ^image using the following equation.

s_f,t ^image=a_fs_f,t≒u_fy_f,t=(u_fw_f ^H)(^G_f ^H^x_f,t)∈C^M
^G=[^G₁ | … | ^G_F]、ただし、ベクトルu_fは以下の条件を満たすことを要求される。 s _f,t ^image =a _f s _f,t ≒u _f y _f,t =(u _f w _f ^H )(^G _f ^H ^x _f,t )∈C ^M
^G=[^G ₁ | … | ^G _F ], where the vectors u and _f are required to satisfy the following conditions:

(i) w_f ^Hu_f=1(歪無し制約条件)
(ii) u_f∝V_N,fw_f(理想的にはa_f∝V_N,fw_fが成立するため)
V_N=[V_N,1 | … | V_N,F]、二つの制約により、ベクトルu_fは次式の通り一意に決定される。

空間イメージ推定部１７０は、非目的音源の空間共分散行列の推定値V_Nと、推定値y_f,tと、瞬時混合に対するMaxSNRビームフォーマw_optとを入力とし、推定値V_Nと瞬時混合に対するMaxSNRビームフォーマw_optから式(11)により、ベクトルu_fを求め、推定値y_f,tとベクトルu_fから次式により、空間イメージs_f,t ^imageを近似し、近似値u_fy_f,tを出力する。 (i) w _f ^H u _f =1 (no distortion constraint)
(ii) u _f ∝V _N,f w _f (ideally, a _f ∝V _N,f w _f holds)
V _N =[V _N,1 | ... | V _N,F ]. Due to the two constraints, the vector u _f is uniquely determined as follows:

The spatial image estimation unit 170 receives as input the estimated value V _N of the spatial covariance matrix of the non-target sound source, the estimated value y _f,t , and the MaxSNR beamformer w _opt for the instantaneous mixture, calculates a vector u _f from the estimated value V _N and the MaxSNR beamformer w _opt for the instantaneous mixture according to equation (11), approximates a spatial image s _f,t ^image from the estimated value y _f,t and the vector u _f according to the following equation, and outputs an approximate value u _f y _f,t .

s_f,t ^image≒u_fy_f,t
＜効果＞
以上の構成により、MaxSNR基準を導入することで、目的音源の空間情報をすべて用いることができる。 s _f,t ^image ≒u _f y _f,t
<Effects>
With the above configuration, by introducing the MaxSNR criterion, it is possible to use all the spatial information of the target sound source.

＜第二実施形態のポイント＞
MVDR CBF（MVDR基準でCBFを推定する手法）は、目的音源のステアリングベクトルを別途事前に推定する必要があり、ステアリングベクトルの推定性能に、MVDR CBFの音源抽出性能が強く依存するという問題や、使い勝手が悪いという問題がある。本実施形態では、この問題を解消する。 <Key Points of the Second Embodiment>
The MVDR CBF (a method for estimating CBF based on the MVDR standard) requires a steering vector of a target sound source to be estimated separately in advance, which causes problems such as the sound source extraction performance of the MVDR CBF being highly dependent on the steering vector estimation performance and poor usability. This embodiment solves these problems.

MaxSNR CBFを推定するには、式(1),(2)の通り、目的音源の空間共分散行列の推定値V_Sと、非目的音源の空間時間共分散行列の推定値^R_Nを事前に求めておく必要がある。

本実施形態では、これら２つの推定値を事前に求めておくことを不要にした、Blind MaxSNR CBFについて説明する。なお、ここで、「Blind」は、事前知識が不要という意味であることを意味する。 To estimate the MaxSNR CBF, it is necessary to calculate in advance the estimated value V _S of the spatial covariance matrix of the target sound source and the estimated value ^R _N of the spatiotemporal covariance matrix of the non-target sound source, as shown in equations (1) and (2).

In this embodiment, a Blind MaxSNR CBF will be described, which eliminates the need to calculate these two estimated values in advance. Note that "Blind" here means that no prior knowledge is required.

本実施形態のBlind MaxSNR CBFは、式(2)あるいは式(7)で与えられるMaxSNR CBFと類似の計算を繰り返し行うことで、MaxSNR CBFを推定する方法である。 The Blind MaxSNR CBF of this embodiment is a method of estimating MaxSNR CBF by repeatedly performing calculations similar to the MaxSNR CBF given by equation (2) or equation (7).

本実施形態のBlind MaxSNR CBFは、任意のスーパーガウス関数φ:R_≧0→Rと以下の行列^R_Xのシューア補行列V_Xを用いて、ブラインドMaxSNR CBFを以下の局所最適解として定義する(式(20a),(20b))。

θ=(^w_f)_f=1 ^Fは変数であり、y_f,t=(^w_f)^H^x_f,tは音源信号の推定値であり、y_t=[y_1,t| … |y_F,t]^T∈C^F、ベクトルAに対して||A||₂=√(A^HA)はユークリッドノルムであり、式(20b)の右辺のCは関数を最大化または最小化するアルゴリズムの反復毎に適応的かつ発見的に決定する定数である。 The Blind MaxSNR CBF of this embodiment is defined as the following local optimum solution (Equations (20a) and (20b)) using an arbitrary super-Gaussian function φ:R _≧0 →R and the Schur complement matrix _VX of the following matrix ^R _X.

θ=(^w _f ) _f=1 ^F is a variable, y _f,t =(^w _f ) ^H ^x _f,t is an estimate of the source signal, y _t =[y _1,t | … |y _F,t ] ^T ∈C ^F , for vector A, ||A|| ₂ =√(A ^H A) is the Euclidean norm, and C on the right-hand side of equation (20b) is a constant that is adaptively and heuristically determined for each iteration of the algorithm to maximize or minimize the function.

より具体的には、非目的音源の空間時間共分散行列の推定値^R_N,fとして解釈される空間時間共分散行列^R_Z,fを以下の式(21)、(22)に基づき求める処理と、以下の式(23)-(26)に基づくMaxSNR CBF ^wの推定処理とを交互に繰り返す反復最適化によって、事前知識なしでMaxSNR CBFを最適化していく。

y_t ^k=[…|y_f,t ^k|…]^T, y_f,t ^k=(^w_f ^k)^H^x_f,t (22)

ただし、kは繰り返し回数を示すインデックスである。 More specifically, the MaxSNR CBF is optimized without prior knowledge by iterative optimization that alternately repeats a process of obtaining the spatiotemporal covariance matrix ^R _Z _,f, which is interpreted as an estimate of the spatiotemporal covariance matrix ^R N,f of the non-target sound source based on the following equations (21) and (22), and a process of estimating MaxSNR CBF ^w based on the following equations (23) to (26).

y _t ^k =[…|y _f,t ^k |…] ^T , y _f,t ^k =(^w _f ^k ) ^H ^x _f,t (22)

Here, k is an index indicating the number of repetitions.

また、上記の反復最適化の各反復において、以下の式(27)に基づいて、周波数f=1,…,FごとにMaxSNR CBF ^w_fのスケールを揃えることを特徴とする。 In addition, in each iteration of the iterative optimization, the scale of MaxSNR CBF ^w _f is adjusted for each frequency f=1, . . . , F based on the following equation (27).

w_f←(u_f,m)^*w_f=(e_m ^Tu_f)^*w_f (27)
ただし、m(1≦m≦M)は参照マイクロホンのインデックスであり、*は複素共役を示し、u_fは式(11)で表され(ただし、V_N,fに代えてV_Z,fを用いる)、u_f,m=e_m ^Tu_f∈Cはu_fのm番目の要素である。 w _f ←(u _f,m ) ^* w _f =(e _m ^T u _f ) ^* w _f (27)
where m (1≦m≦M) is the index of the reference microphone, * indicates the complex conjugate, u _f is expressed by equation (11) (where V _Z,f is used instead of V _N,f ), and u _f,m =e _m ^T u _f ∈C is the m-th element of u _f .

＜第二実施形態＞
第一実施形態と異なる部分を中心に説明する。 Second Embodiment
The following description will focus on the differences from the first embodiment.

図３は第一実施形態に係る信号処理装置の機能ブロック図を、図４はその処理フローを示す。 Figure 3 shows a functional block diagram of the signal processing device related to the first embodiment, and Figure 4 shows the processing flow.

信号処理装置２００は、初期化部２０１と、第一空間共分散行列推定部２１０と、空間時間共分散行列推定部２２０と、第二空間共分散行列推定部２４０と、残響除去フィルタ推定部２３０と、ビームフォーマ推定部２５０と、音源抽出部１６０と、判定部２８０とを含む。 The signal processing device 200 includes an initialization unit 201, a first spatial covariance matrix estimation unit 210, a space-time covariance matrix estimation unit 220, a second spatial covariance matrix estimation unit 240, a dereverberation filter estimation unit 230, a beamformer estimation unit 250, a sound source extraction unit 160, and a determination unit 280.

信号処理装置２００は、マイクロホンで観測した観測信号x_f,tと参照マイクロホンのイデックスmを入力とし、音源信号s_f,tを推定して、出力する。なお、fは周波数を示し、tはフレーム番号を示し、観測信号x_f,t、音源信号s_f,tは周波数領域の信号である。ただし、時間領域の観測信号を入力とし図示しない周波数領域変換部において周波数領域の観測信号x_f,tに変換し、音源信号s_f,tを図示しない時間領域変換部において時間領域の音源信号に変換し出力してもよい。周波数領域変換、時間領域変換はどのような方法によって行ってもよく、例えば、フーリエ変換、逆フーリエ変換等を用いることができる。 The signal processing device 200 receives an observation signal x _f,t observed by a microphone and an index m of a reference microphone as input, estimates a sound source signal s _f,t , and outputs it. Note that f indicates frequency, t indicates a frame number, and the observation signal x _f,t and the sound source signal s _f,t are frequency domain signals. However, a time domain observation signal may be input, converted into a frequency domain observation signal x _f,t in a frequency domain conversion unit (not shown), and the sound source signal s _f,t may be converted into a time domain sound source signal in a time domain conversion unit (not shown) and output. The frequency domain conversion and the time domain conversion may be performed by any method, and for example, a Fourier transform, an inverse Fourier transform, etc. may be used.

＜初期化部２０１＞
初期化部２０１は、参照マイクロホンのイデックスmを入力とし、推定対象の畳み込みビームフォーマ^wの初期値^w⁰=[^w₁ ⁰,…,^w_F ⁰]を次式により設定し（Ｓ２０１）、出力する。

ただし、e_mは、参照マイクロホンに対応する単位ベクトルである。
＜第一空間共分散行列推定部２１０＞
第一空間共分散行列推定部２１０は、観測信号x_f,tを入力とし、式(28)～(30)を用いて観測信号x_f,tの空間共分散行列を推定し（Ｓ２１０）、推定値V_Xを出力する。

^x_f,t=[x_f,t ^T|x_f,t-D-1 ^T|…|x_f,t-D-L ^T]^T ∈ C^M+ML

R_X,fは推定値^R_X,fの1行1列～M行M列の要素からなるブロック行列であり、^-P_X,fは推定値^R_X,fの(M+1)行1列～(M+ML)行M列の要素からなるブロック行列であり、(^-P_X,f)^Hは推定値^R_X,fの1行(M+1)列～M行(M+ML)列の要素からなるブロック行列であり、^-R_X,fは推定値^R_X,fの(M+1)行(M+1)列～(M+ML)行(M+ML)列の要素からなるブロック行列である。

V_X=[V_X,1,...,V_X,f,...,V_X,F]
＜空間時間共分散行列推定部２２０＞
空間時間共分散行列推定部２２０は、1回前の繰り返し処理で推定した畳み込みビームフォーマ^w^kまたはその初期値^w⁰と観測信号x_f,tとを入力とし、非目的音源の空間時間共分散行列の推定値^R_N,fとして解釈される空間時間共分散行列^R_Z=[^R_Z,1,...,^R_Z,f,...,^R_Z,F]を以下の式(21)、(22)に基づき求め（Ｓ２２０）、出力する。

y_t ^k=[…|y_f,t ^k|…]^T, y_f,t ^k=(^w_f ^k)^H^x_f,t (22)
なお、初めて空間時間共分散行列^R_Z,fを求める際、言い換えると、後述するビームフォーマ推定部２５０で畳み込みビームフォーマ^wを推定する前には、初期化部２０１の出力値を畳み込みビームフォーマの初期値^w⁰=[^w₁ ⁰,…,^w_F ⁰]として用いる。 <Initialization unit 201>
The initialization unit 201 receives the index m of the reference microphone as input, sets the initial value ^w ⁰ =[^w ₁ ⁰ , . . . , ^w _F ⁰ ] of the convolution beamformer ^w to be estimated using the following equation (S201), and outputs it.

where e _m is the unit vector corresponding to the reference microphone.
<First spatial covariance matrix estimation unit 210>
The first spatial covariance matrix estimator 210 receives the observed signal x _f,t as input, estimates the spatial covariance matrix of the observed signal x _f,t using equations (28) to (30) (S210), and outputs the estimated value V _X.

^x _f,t =[x _f,t ^T |x _f,tD-1 ^T |…|x _f,tDL ^T ] ^T ∈ C ^M+ML

R _X,f is a block matrix consisting of elements from row 1 _, column 1 to M rows and column M of the estimate ^R _X,f , ^- P _X,f is a block matrix consisting of elements from row (M+1) to column (M+ML) of the estimate ^R _X,f , ( ^- P _X,f ) ^H is a block matrix consisting of elements from row 1, column (M+1) to column M of the estimate ^R _X _,f , ^- R X,f is a block matrix consisting of elements from row (M+1) to column (M+ML) of the estimate ^R X,f.

V _X =[V _X,1 ,...,V _X,f ,...,V _X,F ]
<Spatio-temporal covariance matrix estimation unit 220>
The space-time covariance matrix estimation unit 220 receives as input the convolution beamformer ^w ^k or its initial value ^w ⁰ estimated in the previous iteration and the observed signal x _f,t , and calculates and outputs the space-time covariance matrix ^R _Z =[^R _Z,1 ,...,^R _Z,f ,...,^R _Z,F ] interpreted as the estimated value ^R _N,f of the space-time covariance matrix of the non-target sound source based on the following equations (21) and (22) (S220).

y _t ^k =[…|y _f,t ^k |…] ^T , y _f,t ^k =(^w _f ^k ) ^H ^x _f,t (22)
When calculating the spatial-temporal covariance matrix ^R _Z,f for the first time, in other words, before estimating the convolution beamformer ^w by the beamformer estimation unit 250 described later, the output value of the initialization unit 201 is used as the initial value of the convolution beamformer ^w ⁰ = [^w ₁ ⁰ , ..., ^w _F ⁰ ].

＜残響除去フィルタ推定部２３０＞
残響除去フィルタ推定部２３０は、空間時間共分散行列^R_Z,fを入力とし、推定値^R_Z,fに含まれる^-P_Z,f,^-R_Z,fから残響除去フィルタを推定し（Ｓ２３０）、推定した残響除去フィルタ^Gを出力する。例えば、残響除去フィルタは式(25)により推定される。

なお、

である。つまり、R_Z,fは空間時間共分散行列^R_Z,fの1行1列～M行M列の要素からなるブロック行列であり、^-P_Z,fは空間時間共分散行列^R_Z,fの(M+1)行1列～(M+ML)行M列の要素からなるブロック行列であり、(^-P_Z,f)^Hは空間時間共分散行列^R_Z,fの1行(M+1)列～M行(M+ML)列の要素からなるブロック行列であり、^-R_Z,fは空間時間共分散行列^R_Z,fの(M+1)行(M+1)列～(M+ML)行(M+ML)列の要素からなるブロック行列である。 <Dereverberation Filter Estimation Unit 230>
The dereverberation filter estimation unit 230 receives the spatiotemporal covariance matrix ^R _Z,f as input, estimates a dereverberation filter from ^−P _Z,f , ^−R _Z, _{f included in the estimated value ^R Z,} f (S230), and outputs the estimated dereverberation filter ^G. For example, the dereverberation filter is estimated using equation (25).

In addition,

That is, R _Z,f is a block matrix consisting of elements from row 1, column 1 to column M of rows and M of the space-time covariance matrix ^R _Z,f , ^- P _Z,f is a block matrix consisting of elements from row (M+1) to column (M+ML) of rows and M of the space-time covariance matrix ^R Z _, _f , ( ^- P _Z,f ) ^H is a block matrix consisting of elements from row 1, column (M+1) to column M of rows and (M+ML) of the space-time covariance matrix ^R Z _, f, ^- R Z _,f is a block matrix consisting of elements from row (M+1) to column (M+ML) of rows and (M+ML) of the space-time covariance matrix ^R Z,f.

＜第二空間共分散行列推定部２４０＞
第二空間共分散行列推定部２４０は、空間時間共分散行列^R_Z,fを入力とし、空間時間共分散行列^R_Z,fに含まれるR_Z,f,^-P_Z,f,^-R_Z,fから非目的音源の空間共分散行列を推定し（Ｓ２４０）、推定値V_Z,f∈S^M+ML _＋を出力する。例えば、非目的音源の空間共分散行列は式(31)により推定される。

である。 <Second spatial covariance matrix estimation unit 240>
The second spatial covariance matrix estimation unit 240 receives the spatiotemporal covariance matrix ^R _Z,f as input, estimates the spatial covariance matrix of the non-target sound source from R _Z _,f , ^- P _Z,f , ^- R _{Z,f included in the spatiotemporal covariance matrix ^R Z,f} (S240), and outputs an estimated value V _Z,f ∈ ^{S M+ML} ₊ . For example, the spatial covariance matrix of the non-target sound source is estimated by equation (31).

is.

なお、第二空間共分散行列推定部２４０は、空間時間共分散行列^R_Z,fと残響除去フィルタ推定部２３０で推定した残響除去フィルタ^Gとを入力とし、空間時間共分散行列^R_Z,fと残響除去フィルタ^Gとから式(31)により非目的音源の空間共分散行列を推定してもよい。 The second spatial covariance matrix estimator 240 may receive the spatiotemporal covariance matrix ^R _Z,f and the dereverberation filter ^G estimated by the dereverberation filter estimator 230 as input, and estimate the spatial covariance matrix of the non-target sound source from the spatiotemporal covariance matrix ^R _Z,f and the dereverberation filter ^G using equation (31).

＜ビームフォーマ推定部２５０＞
ビームフォーマ推定部２５０は、観測信号x_f,tの空間共分散行列の推定値V_X=[V_X,1,…,V_X,F]と、非目的音源の空間共分散行列の推定値V_Z=[V_Z,1,…,V_Z,F]と、推定した残響除去フィルタ^G=[^G_,1,…,^G_F]とを入力とする。ビームフォーマ推定部２５０は、観測信号x_f,tの空間共分散行列の推定値V_Xと、非目的音源の空間共分散行列の推定値V_Zとから、式(24)により、w_f ^k+1を求める。

ビームフォーマ推定部２５０は、瞬時混合に対するMaxSNRビームフォーマw_f ^k+1と推定した残響除去フィルタ^Gとから、式(23)により、畳み込みビームフォーマを推定する（Ｓ２５０）。

ビームフォーマ推定部２５０は、非目的音源の空間共分散行列の推定値V_Z=[V_Z,1,…,V_Z,F]と、畳み込みビームフォーマ^w^k+1から次式によりベクトルu_fを求める。

さらに、ビームフォーマ推定部２５０は、以下の式(29)に基づいて、ベクトルu_fのm番目の要素u_f,mを用いて、周波数f=1,…,FごとにMaxSNR CBF ^w_f ^k+1のスケールを揃え、スケールを揃えた畳み込みビームフォーマ^w^k+1を出力する。 <Beamformer Estimation Unit 250>
The beamformer estimation unit 250 receives as input an estimate V _X =[V _X,1 , ...,V _X,F ] of the spatial covariance matrix of the observed signal x _f,t , an estimate V _Z =[V _Z,1 , ...,V _Z,F ] of the spatial covariance matrix of the non-target sound source, and an estimated dereverberation filter ^G=[^G _,1 , ...,^G _F ]. The beamformer estimation unit 250 obtains w _f ^k+1 from the estimate V _X of the spatial covariance matrix of the observed signal x _f,t and the estimate V _Z of the spatial covariance matrix of the non-target sound source using equation (24).

The beamformer estimation unit 250 estimates a convolution beamformer from the MaxSNR beamformer w _f ^k+1 for the instantaneous mixture and the estimated dereverberation filter ^G according to equation (23) (S250).

The beamformer estimation unit 250 obtains a vector u f from the estimated value V _Z =[V _Z,1 , . . . , V _Z,F ] of the spatial covariance matrix of the non-target sound source and the convolution beamformer ^w ^k+1 _by the following equation.

Furthermore, the beamformer estimation unit 250 uses the m-th element u _f,m of the vector u _f to adjust the scale of MaxSNR CBF ^w _f ^k+1 for each frequency f = 1, ..., F based on the following equation (29), and outputs the scaled convolution beamformer ^w ^k+1 .

^w^k+1←(u_f,m)^*^w^k+1=(e_m ^Tu_f)^*^w^k+1 (29)
＜音源抽出部１６０＞
音源抽出部１６０は、観測信号x_f,tと推定した畳み込みビームフォーマ^w^k+1とを入力とし、次式により、ビームフォーミング処理を行い、音源信号を推定し（Ｓ１６０）、推定値y_f,tを出力する。 ^w ^k+1 ←(u _f,m ) ^* ^w ^k+1 =(e _m ^T u _f ) ^* ^w ^k+1 (29)
<Sound source extraction unit 160>
The sound source extraction unit 160 receives the observed signal x _f,t and the estimated convolution beamformer ^w ^k+1 as input, performs beamforming processing using the following equation, estimates the sound source signal (S160), and outputs the estimated value y _f,t .

y_f,t=(^w_f ^k+1)^H^x_f,t∈ C
^w^k+1 ∈C^M+ML
^w^k+1=[^w₁ ^k+1| … | ^w_F ^k+1]
＜判定部２８０＞
判定部２８０は、収束条件を満たすか否かを判定し（Ｓ２８０）、収束条件を満たす場合（Ｓ２８０のYESの場合）には、その時点の推定値y_f,tを信号処理装置の出力として出力し処理を終了する。収束条件を満たさない場合（Ｓ２８０のNOの場合）には、判定部２８０は、Ｓ２２０～Ｓ１６０を繰り返すように各部に制御信号を送って、各部の処理を制御する。なお、音源抽出部１６０の出力する推定値y_f,tを空間時間共分散行列推定部２２０で用い、式(22)の計算を省略することができる。なお、収束条件には、学習を一定回数（例えば数回）繰り返したか？推定前後の畳み込みビームフォーマ^w^k+1の差分が所定の閾値以下か?などの条件を利用できる。 y _f,t =(^w _f ^k+1 ) ^H ^x _f,t ∈ C
^w ^k+1 ∈C ^M+ML
^w ^k+1 =[^w ₁ ^k+1 | … | ^w _F ^k+1 ]
<Determination unit 280>
The determination unit 280 determines whether the convergence condition is satisfied (S280). If the convergence condition is satisfied (YES in S280), the determination unit 280 outputs the estimated value y _f,t at that time as the output of the signal processing device and ends the processing. If the convergence condition is not satisfied (NO in S280), the determination unit 280 sends a control signal to each unit to repeat S220 to S160, thereby controlling the processing of each unit. Note that the estimated value y _f,t output by the sound source extraction unit 160 can be used in the space-time covariance matrix estimation unit 220, and the calculation of equation (22) can be omitted. Note that the convergence condition can be, for example, whether learning has been repeated a certain number of times (for example, several times), whether the difference between the convolution beamformer ^w ^k+1 before and after estimation is equal to or less than a predetermined threshold, etc.

＜効果＞
このような構成とすることで、第一実施形態と同様の効果を得ることができる。さらに、本実施形態のBlind MaxSNR CBFは、高々数回の反復でMaxSNR CBFを高精度に推定できる超高速な手法である。 <Effects>
With this configuration, it is possible to obtain the same effects as in the first embodiment. Furthermore, the Blind MaxSNR CBF of this embodiment is an ultra-fast method that can estimate the MaxSNR CBF with high accuracy in just a few iterations.

なお、本実施形態では、音源信号s_f,tの推定値y_f,tを出力しているが、空間イメージ推定部１７０を設け、収束条件を満たした時点の推定値y_f,tを用いて、空間イメージs_f,t ^imageの近似値u_fy_f,tを求め、出力する構成としてもよい。 In this embodiment, an estimated value y _f, _{t of the sound source signal s f} ,t is output. However, a spatial image estimation unit 170 may be provided, and an approximate value u f _{y f,t} of the spatial image s _f,t ^image may be calculated and output using the estimated value _y _f,t at the point when the convergence condition is satisfied.

＜第三実施形態のポイント＞
本実施形態では、第二実施形態のBlind MaxSNR CBFの副産物として、目的音源の空間共分散行列V_Sは既知（= 事前に推定する）で、一方で、不要音の空間時間共分散行列^R_Nは未知（＝事前に推定しない）、という状況下で MaxSNR CBFを高精度に推定する手法である「Iteratively Reweighted MaxSNR CBF (IR-MaxSNR CBF)」を実現する。 <Key Points of the Third Embodiment>
In this embodiment, as a by-product of the Blind MaxSNR CBF of the second embodiment, "Iteratively Reweighted MaxSNR CBF (IR-MaxSNR CBF)" is realized, which is a method for estimating MaxSNR CBF with high accuracy under the circumstances where the spatial covariance matrix V _S of the target sound source is known (= estimated in advance) and the spatiotemporal covariance matrix ^R _N of the unwanted sound is unknown (= not estimated in advance).

目的音源の空間共分散行列V_Sが高精度に推定できる場合に、その情報を用いることで、第二実施形態のBlind MaxSNR CBFと比べて精度良くMaxSNR CBFを推定できる。 When the spatial covariance matrix V _S of the target sound source can be estimated with high accuracy, the MaxSNR CBF can be estimated with higher accuracy than the Blind MaxSNR CBF of the second embodiment by using that information.

＜第三実施形態＞
第二実施形態と異なる部分を中心に説明する。 Third Embodiment
The following description will focus on the differences from the second embodiment.

図５は第三実施形態に係る信号処理装置の機能ブロック図を、図６はその処理フローを示す。 Figure 5 shows a functional block diagram of a signal processing device related to the third embodiment, and Figure 6 shows the processing flow.

信号処理装置３００は、初期化部２０１と、第一空間共分散行列推定部１１０と、空間時間共分散行列推定部２２０と、第二空間共分散行列推定部２４０と、残響除去フィルタ推定部２３０と、ビームフォーマ推定部２５０と、音源抽出部１６０と、判定部２８０とを含む。 The signal processing device 300 includes an initialization unit 201, a first spatial covariance matrix estimation unit 110, a space-time covariance matrix estimation unit 220, a second spatial covariance matrix estimation unit 240, a dereverberation filter estimation unit 230, a beamformer estimation unit 250, a sound source extraction unit 160, and a determination unit 280.

本実施形態では、第一空間共分散行列推定部２１０に代えて第一空間共分散行列推定部１１０を含む点が第二実施形態とは異なる。なお、第一空間共分散行列推定部１１０は第一実施形態で説明した通りである。また、ビームフォーマ推定部２５０は、観測信号x_f,tの空間共分散行列の推定値V_Xに代えて、目的音源の空間共分散行列の推定値V_Sを用いる点が第二実施形態と異なる。他の処理は第二実施形態と同様である。 This embodiment differs from the second embodiment in that it includes a first spatial covariance matrix estimator 110 instead of the first spatial covariance matrix estimator 210. Note that the first spatial covariance matrix estimator 110 is as described in the first embodiment. Also, the beamformer estimator 250 differs from the second embodiment in that it uses an estimated value V _S of the spatial covariance matrix of the target sound source instead of an estimated value V _X of the spatial covariance matrix of the observed signal x _f,t . Other processing is the same as in the second embodiment.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other Modifications>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above may not only be executed in chronological order as described, but may also be executed in parallel or individually depending on the processing capabilities of the devices that execute the processes or as needed. Other modifications are possible within the scope of the present invention.

＜プログラム及び記録媒体＞
上述の各種の処理は、図７に示すコンピュータ２０００の記録部２０２０に、上記方法の各ステップを実行させるプログラムを読み込ませ、制御部２０１０、入力部２０３０、出力部２０４０、表示部２０５０などに動作させることで実施できる。 <Program and recording medium>
The various processes described above can be implemented by loading a program that executes each step of the above method into the recording unit 2020 of the computer 2000 shown in Figure 7, and operating the control unit 2010, input unit 2030, output unit 2040, display unit 2050, etc.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing this processing content can be recorded on a computer-readable recording medium. Examples of computer-readable recording media include magnetic recording devices, optical disks, magneto-optical recording media, and semiconductor memory.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ－ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program may be distributed, for example, by selling, transferring, or lending portable recording media such as DVDs or CD-ROMs on which the program is recorded. Furthermore, this program may be distributed by storing it in a storage device of a server computer and transferring it from the server computer to other computers via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。A computer that executes such a program, for example, first stores the program recorded on a portable recording medium or transferred from a server computer in its own storage device. Then, when executing a process, the computer reads the program stored on its own recording medium and executes the process in accordance with the read program. Alternatively, the computer may read the program directly from a portable recording medium and execute the process in accordance with the program. Furthermore, each time a program is transferred from a server computer to the computer, the computer may execute the process in accordance with the received program. Alternatively, the server computer may not transfer the program to the computer, but may instead execute the process through a so-called ASP (Application Service Provider) service, which realizes the processing function simply by issuing execution instructions and obtaining the results. In this embodiment, the program includes information used for processing by a computer that is equivalent to a program (such as data that does not directly instruct the computer but has properties that define computer processing).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, in this form, the device is configured by executing a specified program on a computer, but at least some of the processing content may also be realized in hardware.

Claims

a second spatial covariance matrix estimation unit that estimates a spatial covariance matrix of the non-target sound source using an estimated value of the space-time covariance matrix of the non-target sound source;
a dereverberation filter estimator that estimates a dereverberation filter using the estimated value of the spatiotemporal covariance matrix of the non-target sound source;
a beamformer estimation unit that estimates a convolution beamformer using an estimated value of a spatial covariance matrix of an observed signal or a target sound source, an estimated value of a spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter;
a sound source extraction unit that performs beamforming processing using the observed signal and the estimated convolution beamformer to estimate a sound source signal;
Signal processing device.

2. The signal processing device of claim 1,
a first spatial covariance matrix estimation unit that estimates a section including a sound emitted from a target sound source (hereinafter also referred to as a target signal) from the observed signal and estimates a spatial covariance matrix of the target sound source using the estimated target signal;
a space-time covariance matrix estimation unit that estimates a section that does not include a sound emitted by a target sound source (hereinafter also referred to as a non-target signal) from the observed signal, and estimates a space-time covariance matrix of the non-target sound source using the estimated non-target signal,
the beamformer estimation unit estimates a convolution beamformer using an estimate of a spatial covariance matrix of the target sound source, an estimate of a spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter.
Signal processing device.

2. The signal processing device of claim 1,
a first spatial covariance matrix estimator that estimates a spatial covariance matrix of the observed signal using the observed signal;
a space-time covariance matrix estimator that estimates a space-time covariance matrix of the non-target sound source by using the observed signal and the estimated convolution beamformer,
the beamformer estimation unit estimates a convolution beamformer using an estimate of a spatial covariance matrix of the observed signals, an estimate of a spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter;
repeating the processes in the spatiotemporal covariance matrix estimator, the second spatial covariance matrix estimator, the dereverberation filter estimator, the beamformer estimator, and the sound source extractor until a convergence condition is satisfied;
Signal processing device.

2. The signal processing device of claim 1,
a first spatial covariance matrix estimation unit that estimates a section including a sound emitted from a target sound source (hereinafter also referred to as a target signal) from the observed signal and estimates a spatial covariance matrix of the target sound source using the estimated target signal;
a space-time covariance matrix estimator that estimates a space-time covariance matrix of the non-target sound source by using the observed signal and the estimated convolution beamformer,
the beamformer estimation unit estimates a convolution beamformer using an estimate of a spatial covariance matrix of the target sound source, an estimate of a spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter;
repeating the processes in the spatiotemporal covariance matrix estimator, the second spatial covariance matrix estimator, the dereverberation filter estimator, the beamformer estimator, and the sound source extractor until a convergence condition is satisfied;
Signal processing device.

a second spatial covariance matrix estimation step in which the computer estimates a spatial covariance matrix of the non-target sound source using an estimate of the space-time covariance matrix of the non-target sound source;
a dereverberation filter estimation step in which a computer estimates a dereverberation filter using the estimated spatiotemporal covariance matrix of the non-target sound source;
a beamformer estimation step in which a computer estimates a convolution beamformer using an estimated value of a spatial covariance matrix of an observed signal or a target sound source, an estimated value of a spatial covariance matrix of the non-target sound source, and the estimated dereverberation filter;
a sound source extraction step of performing beamforming processing by a computer using the observed signal and the estimated convolution beamformer to estimate a sound source signal,
Signal processing methods.

A program for causing a computer to function as a signal processing device according to any one of claims 1 to 4.