JP7619564B2

JP7619564B2 - Sound collection device, sound collection program, and sound collection method

Info

Publication number: JP7619564B2
Application number: JP2021025965A
Authority: JP
Inventors: 大藤枝; 一浩片桐; 耕平西城; 哲司小川
Original assignee: Waseda University; Oki Electric Industry Co Ltd
Current assignee: Waseda University; Oki Electric Industry Co Ltd
Priority date: 2021-02-22
Filing date: 2021-02-22
Publication date: 2025-01-22
Anticipated expiration: 2041-02-22
Also published as: JP2022127777A

Description

本発明は、収音装置、収音プログラム、及び収音方法に関し、例えば、目的エリアに存在する音源から到来する音（以下、「目的エリア音」と呼ぶ）を収音するエリア収音処理に関する。 The present invention relates to a sound collection device, a sound collection program, and a sound collection method, and, for example, to an area sound collection process for collecting sound coming from a sound source present in a target area (hereinafter referred to as "target area sound").

従来、多チャンネルマイクロホンを用いたエリア収音技術として、非特許文献１に記載されたＭＵＢＡＳＥ（ＭｕｌｔｉｐｌｅＢｅａｍ－ｆｏｒｍｉｎｇＡｒｅａＳｏｕｎｄＥｎｈａｎｃｅｍｅｎｔ）が挙げられる。ＭＵＢＡＳＥは、２チャンネルのマイクロホンの観測信号の差分により周囲からの妨害音を抽出できることを利用し、正面方向のエリアを強調する手法である。 Conventionally, an area sound collection technology using a multi-channel microphone is MUBASE (Multiple Beam-forming Area Sound Enhancement) described in Non-Patent Document 1. MUBASE is a method that emphasizes the area in the front direction by utilizing the ability to extract disturbing sounds from the surroundings by using the difference in the observation signals of two-channel microphones.

図７は、２つのマイクロホンＭｌ、Ｍｒを備えるマイクロホンアレイＭＡの観測信号を用いて、ＭＵＢＡＳＥにより正面方向（目的エリア音が存在する方向）の音の成分を強調して取得する例について示した図である。 Figure 7 shows an example of how MUBASE is used to emphasize and acquire sound components from the front direction (the direction in which the target area sound is located) using observation signals from a microphone array MA equipped with two microphones Ml and Mr.

ここで、マイクロホンＭｌ、Ｍｒの観測信号をそれぞれ以下、（１）式、（２）式としたとき（ｆは周波数ビンのインデックス）、観測信号の差分は以下の（３）式のように示すことができる。そして、その観測信号の差分は、正面方向にｎｕｌｌを向けるフィルタ（以下、「差分フィルタ」と呼ぶ）となり、エリア外からの妨害音が抽出される。ただし、差分フィルタにより得られる推定妨害音は実際の妨害音に比べ低周波ほどパワーが弱くなることが知られている。差分フィルタにより得られる推定妨害音を利用し、正面の目的エリア内の音源ｙ_ｆからの音は、以下の式（４）で表されるサブトラクションを用いて抽出できる。

Here, when the observation signals of the microphones Ml and Mr are respectively expressed as the following formulas (1) and (2) (f is a frequency bin index), the difference of the observation signals can be expressed as the following formula (3). The difference of the observation signals becomes a filter (hereinafter referred to as a "difference filter") that faces null in the front direction, and interference sounds from outside the area are extracted. However, it is known that the estimated interference sounds obtained by the difference filter have weaker power at lower frequencies than the actual interference sounds. Using the estimated interference sounds obtained by the difference filter, the sound from the sound source _yf in the target area in front can be extracted using subtraction expressed by the following formula (4).

ただし、（４）式において、サブトラクション係数α_ｆはハイパーパラメータである。（４）式において、係数α_ｆの値により強調するエリアの幅が変化し、α_ｆの値が大きいほど狭いビームとなる。 In equation (4), the subtraction coefficient _αf is a hyperparameter. In equation (4), the width of the emphasized area changes depending on the value of coefficient _αf . The larger the value of _αf , the narrower the beam becomes.

ＫａｚｕｈｉｒｏＫａｔａｇｉｒｉ，ＴｏｋｕｏＹａｍａｇｕｃｈｉ，ＴａｋａｓｈｉＹａｚｕ，ａｎｄＹｏｏｎｇＫｅｏｋＬｅｅ， “Ｍｕｌｔｉｐｌｅｂｅａｍ－ｆｏｒｍｉｎｇａｒｅａｓｏｕｎｄｅｎｈａｎｃｅｍｅｎｔ（ＭＵＢＡＳＥ）ａｎｄｓｔｅｒｅｏｐｈｏｎｉｃａｒｅａｓｏｕｎｄｒｅｐｒｏｄｕｃｔｉｏｎ（ＳＡＳＲ）ｓｙｓｔｅｍ”，ＳＩＧＧＲＡＰＨＡｓｉａ２０１５ＥｍｅｒｇｉｎｇＴｅｃｈｎｏｌｏｇｉｅｓ，２０１５．Kazuhiro Katagiri, Tokuo Yamaguchi, Takashi Yazu, and Yoong Keok Lee, “Multiple beam-forming area sound enhancement (MUBASE) and stereophonic area sound reproduction (SASR) system”, SIGGRAPH Asia 2015 Emerging Technologies, 2015.

従来のＭＵＢＡＳＥを用いたエリア収音処理では、目的エリア内の音源の位置等により最適な係数α_ｆの値は異なる。 In the conventional area sound collection processing using MUBASE, the optimum value of the coefficient _αf varies depending on the position of the sound source within the target area.

例えば、従来のＭＵＢＡＳＥを用いたエリア収音処理において手動で係数α_ｆを調整しようとした場合を想定すると、係数α_ｆが大きすぎればオーバーサブトラクションになり、収音処理により得られる信号（目的エリア音を強調した信号）が歪んでしまい、係数α_ｆが小さすぎれば妨害音（非目的エリア音）の抑圧が不十分となってしまうため、係数α_ｆの最適な調整は困難である。 For example, assuming that an attempt is made to manually adjust the coefficient _αf in area sound collection processing using the conventional MUBASE, if the coefficient _αf is too large, over-subtraction occurs, and the signal obtained by the sound collection processing (a signal in which the target area sound is emphasized) is distorted, whereas if the coefficient _αf is too small, suppression of interference sounds (non-target area sounds) becomes insufficient, making it difficult to optimally adjust the coefficient _αf .

以上のような問題を鑑みて、目的エリア内の音源に関する環境変化（例えば、音源の移動）に対してより頑健な収音装置、収音プログラム、及び収音方法が望まれている。 In view of the above problems, there is a need for a sound collection device, a sound collection program, and a sound collection method that are more robust against environmental changes (e.g., movement of the sound source) related to the sound source within the target area.

第１の本発明の収音装置は、学習モデルを用いて、マイクロホンアレイを構成する第１のマイクロホンからの第１の入力信号と、前記第１の入力信号と前記マイクロホンアレイを構成する第２のマイクロホンからの第２の入力信号との差分となる差分信号から、前記第１の入力信号に含まれる目的音の成分を強調した目的音強調信号を取得する目的音抽出処理手段と、前記第１の入力信号と前記差分信号と、前記目的音の信号を含むデータを教師データとして学習処理することにより前記学習モデルを得る学習手段とを有することを特徴とする。 The first sound collection device of the present invention is characterized in having a target sound extraction processing means for using a learning model to obtain a target sound enhancement signal that emphasizes the target sound component contained in a first input signal from a first microphone that constitutes a microphone array and a differential signal that is the difference between the first input signal and a second input signal from a second microphone that constitutes the microphone array, and a learning means for obtaining the learning model by learning and processing the first input signal, the differential signal, and data including the target sound signal as teacher data .

第２の本発明の収音プログラムは、学習モデルを用いて、マイクロホンアレイを構成する第１のマイクロホンからの第１の入力信号と、前記第１の入力信号と前記マイクロホンアレイを構成する第２のマイクロホンからの第２の入力信号との差分となる差分信号から、前記第１の入力信号に含まれる目的音の成分を強調した目的音強調信号を取得する目的音抽出処理手段と、前記第１の入力信号と前記差分信号と、前記目的音の信号を含むデータを教師データとして学習処理することにより前記学習モデルを得る学習手段として機能させることを特徴とする。 The second sound collection program of the present invention is characterized in that it functions as a target sound extraction processing means for using a learning model to obtain a target sound emphasis signal that emphasizes the target sound component contained in a first input signal from a first microphone that constitutes a microphone array and a differential signal that is the difference between the first input signal and a second input signal from a second microphone that constitutes the microphone array, and as a learning means for obtaining the learning model by learning and processing data including the first input signal, the differential signal, and the target sound signal as teacher data .

第３の本発明は、収音装置が行う収音方法において、前記収音装置は目的音抽出処理手段と学習手段とを備え、前記目的音抽出処理手段は、学習モデルを用いて、マイクロホンアレイを構成する第１のマイクロホンからの第１の入力信号と、前記第１の入力信号と前記マイクロホンアレイを構成する第２のマイクロホンからの第２の入力信号との差分となる差分信号から、前記第１の入力信号に含まれる目的音の成分を強調した目的音強調信号を取得し、前記第１の入力信号と前記差分信号と、前記目的音の信号を含むデータを教師データとして学習処理することにより前記学習モデルを得ることを特徴とする。
The third invention is a sound collection method performed by a sound collection device, the sound collection device being equipped with a target sound extraction processing means and a learning means , and the target sound extraction processing means uses a learning model to obtain a target sound enhancement signal that emphasizes the target sound component contained in a first input signal from a first microphone that constitutes a microphone array and a differential signal that is the difference between the first input signal and a second input signal from a second microphone that constitutes the microphone array, and obtains the learning model by learning and processing data including the first input signal, the differential signal, and the target sound signal as teacher data .

本発明によれば、目的エリア内の音源に関する環境変化に対してより頑健な収音処理を提供することができる。 The present invention provides a sound collection process that is more robust against environmental changes related to sound sources within a target area.

実施形態に係る第１の目的エリア音抽出部の機能的構成について示したブロック図である。FIG. 2 is a block diagram showing a functional configuration of a first target area sound extraction unit according to the embodiment. 実施形態に係る第２の目的エリア音抽出部の機能的構成について示したブロック図である。FIG. 11 is a block diagram showing a functional configuration of a second target area sound extraction unit according to the embodiment. 実施形態に係る収音装置の機能的構成について示したブロック図である。1 is a block diagram showing a functional configuration of a sound collection device according to an embodiment. 実施形態に係る収音装置のハードウェア構成の例について示したブロック図である。1 is a block diagram showing an example of a hardware configuration of a sound collection device according to an embodiment. 実施形態に係る収音装置の実験環境について示した図である。FIG. 1 is a diagram showing an experimental environment for a sound collection device according to an embodiment. 実施形態に係る収音装置の実験結果について示した図である。11A to 11C are diagrams showing experimental results of the sound collection device according to the embodiment. 従来の２チャンネルマイクロホンアレイを用いた収音処理について示した図である。FIG. 1 is a diagram showing a conventional sound collection process using a two-channel microphone array.

（Ａ）主たる実施形態
以下、本発明による収音装置、プログラム及び方法の一実施形態を、図面を参照しながら詳述する。 (A) Main embodiment Hereinafter, an embodiment of a sound collection device, a program, and a method according to the present invention will be described in detail with reference to the drawings.

（Ａ－１）実施形態の構成
図３は、この実施形態の収音装置１００の機能的構成について示したブロック図である。 (A-1) Configuration of the Embodiment FIG. 3 is a block diagram showing the functional configuration of the sound collection device 100 of this embodiment.

収音装置１００は、２つのマイクロホンＭｒ、Ｍｌを備えるマイクロホンアレイＭＡを用いて、目的エリアの音源からの目的エリア音を収音する目的エリア音収音処理を行う。 The sound collection device 100 performs target area sound collection processing to collect target area sound from a sound source in the target area using a microphone array MA having two microphones Mr and Ml.

マイクロホンアレイＭＡは、目的エリアが存在する空間の任意の場所に配置される。なお、この実施形態では、説明を簡易とするため、マイクロホンアレイＭＡで収音の対象となる目的エリア（目的エリアに配置された目的音源）は１つだけであるものとする。 The microphone array MA is placed anywhere in the space in which the target area exists. Note that in this embodiment, for ease of explanation, it is assumed that there is only one target area (target sound source placed in the target area) that is the target of sound pickup by the microphone array MA.

次に、収音装置１００の内部構成について説明する。 Next, the internal configuration of the sound collection device 100 will be described.

収音装置１００は、信号入力部１０１、目的エリア音抽出部１０２、及び信号出力部１０３を備える。なお、収音装置１００を構成する各機能ブロックの詳細処理については後述する。 The sound collection device 100 includes a signal input unit 101, a target area sound extraction unit 102, and a signal output unit 103. The detailed processing of each functional block constituting the sound collection device 100 will be described later.

信号入力部１０１は、各マイクロホンで観測された音響信号（アナログ信号）を、ディジタル信号に変換して、目的エリア音抽出部１０２で処理可能な形式の信号（この実施形態では、周波数領域の信号）に変換する機能を担っている。信号入力部１０１は、各マイクロホンで観測された音響信号（アナログ信号）を、アナログ信号からディジタル信号に変換し、さらに時間領域から周波数領域に変換（例えば、高速フーリエ変換等により変換）して、目的エリア音抽出部１０２に供給する。 The signal input unit 101 has the function of converting the acoustic signals (analog signals) observed by each microphone into digital signals, and then converting them into a signal format that can be processed by the target area sound extraction unit 102 (in this embodiment, a frequency domain signal). The signal input unit 101 converts the acoustic signals (analog signals) observed by each microphone from analog signals to digital signals, and further converts them from the time domain to the frequency domain (for example, by using a fast Fourier transform, etc.), and supplies them to the target area sound extraction unit 102.

なお、ここでは、信号入力部１０１から目的エリア音抽出部１０２に供給されるマイクロホンＭｌ、Ｍｒの観測信号（周波数領域に変換された音響信号）を、それぞれＸ_ｒ、Ｘ_ｌと表すものとする。 It should be noted that, here, the observation signals (acoustic signals converted into the frequency domain) of the microphones Ml and Mr supplied from the signal input unit 101 to the target area sound extraction unit 102 are represented as _Xr and _Xl , respectively.

目的エリア音抽出部１０２は、信号入力部１０１から供給された信号について、目的エリア音の成分を推定して抽出する機能を担っている。 The target area sound extraction unit 102 has the function of estimating and extracting the target area sound components from the signal supplied from the signal input unit 101.

信号出力部１０３は、目的エリア音抽出部１０２から出力された信号を、周波数領域から時間領域へ変換して、所定の形式で出力する。なお、信号出力部１０３による信号出力の形式や方式については限定されないものである。 The signal output unit 103 converts the signal output from the target area sound extraction unit 102 from the frequency domain to the time domain and outputs the signal in a predetermined format. Note that there are no limitations on the format or method of signal output by the signal output unit 103.

次に、収音装置１００のハードウェア構成の例について説明する。 Next, an example of the hardware configuration of the sound collection device 100 will be described.

収音装置１００は、全てハードウェア（例えば、専用チップ等）により構成するようにしてもよいし一部又は全部についてソフトウェア（プログラム）として構成するようにしてもよい。収音装置１００は、例えば、プロセッサ及びメモリを有するコンピュータにプログラム（実施形態の収音プログラムを含む）をインストールすることにより構成するようにしてもよい。 The sound collection device 100 may be configured entirely from hardware (e.g., a dedicated chip, etc.), or may be configured partially or entirely as software (program). The sound collection device 100 may be configured, for example, by installing a program (including the sound collection program of the embodiment) on a computer having a processor and memory.

図４は、収音装置１００のハードウェア構成の例について示したブロック図である。 Figure 4 is a block diagram showing an example of the hardware configuration of the sound collection device 100.

図４では、収音装置１００を、ソフトウェア（コンピュータ）を用いて構成する際のハードウェア構成の例について示している。 Figure 4 shows an example of the hardware configuration when the sound collection device 100 is configured using software (computer).

図４に示す収音装置１００は、ハードウェア的な構成要素として、プログラム（実施形態の収音プログラムを含む）がインストールされたコンピュータ４００を有している。また、コンピュータ４００は、収音プログラム専用のコンピュータとしてもよいし、他の機能のプログラムと共用される構成としてもよい。 The sound collection device 100 shown in FIG. 4 has, as a hardware component, a computer 400 on which a program (including the sound collection program of the embodiment) is installed. The computer 400 may be a computer dedicated to the sound collection program, or may be configured to be shared with programs of other functions.

図４に示すコンピュータ４００は、プロセッサ４０１、一次記憶部４０２、及び二次記憶部４０３を有している。一次記憶部４０２は、プロセッサ４０１の作業用メモリ（ワークメモリ）として機能する記憶手段であり、例えば、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等の高速動作するメモリを適用することができる。二次記憶部４０３は、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）やプログラムデータ（実施形態に係る収音プログラムのデータを含む）等の種々のデータを記録する記憶手段であり、例えば、ＦＬＡＳＨ（登録商標）メモリやＨＤＤやＳＳＤ等の不揮発性メモリを適用することができる。この実施形態のコンピュータ４００では、プロセッサ４０１が起動する際、二次記憶部４０３に記録されたＯＳやプログラム（実施形態に係る収音プログラムを含む）を読み込み、一次記憶部４０２上に展開して実行する。 The computer 400 shown in FIG. 4 has a processor 401, a primary storage unit 402, and a secondary storage unit 403. The primary storage unit 402 is a storage unit that functions as a working memory (work memory) for the processor 401, and can be, for example, a high-speed memory such as a dynamic random access memory (DRAM). The secondary storage unit 403 is a storage unit that records various data such as an operating system (OS) and program data (including data of the sound collection program according to the embodiment), and can be, for example, a non-volatile memory such as a FLASH (registered trademark) memory, HDD, or SSD. In the computer 400 of this embodiment, when the processor 401 starts up, the OS and programs (including the sound collection program according to the embodiment) recorded in the secondary storage unit 403 are read, deployed on the primary storage unit 402, and executed.

なお、コンピュータ４００の具体的な構成は図４の構成に限定されないものであり、種々の構成を適用することができる。例えば、一次記憶部４０２が不揮発メモリ（例えば、ＦＬＡＳＨ（登録商標）メモリ等）であれば、二次記憶部４０３については除外した構成としてもよい。 The specific configuration of the computer 400 is not limited to that shown in FIG. 4, and various configurations can be applied. For example, if the primary storage unit 402 is a non-volatile memory (e.g., FLASH (registered trademark) memory, etc.), the secondary storage unit 403 may be excluded.

次に、目的エリア音抽出部１０２による目的エリア音抽出処理の概要について説明する。 Next, we will provide an overview of the target area sound extraction process performed by the target area sound extraction unit 102.

ここで述べる目的エリア音抽出処理は、従来のＭＵＢＡＳＥと同様、２つのマイクロホンの観測信号から目的エリア音を抽出する処理として設計される。従来のＭＵＢＡＳＥの処理では、上記の（４）式が適用されるが、目的音源や妨害音（非目的エリア音）の到来角によって最適な係数α_ｆは異なり、手動で設定することは困難となる場合があった。また、従来のＭＵＢＡＳＥの処理において、α_ｆの値が大きすぎると、オーバーサブトラクションとなり、目的エリア内の音声が歪んでしまう場合があった。反対に、従来のＭＵＢＡＳＥの処理において、α_ｆの値が小さければ、エリア外の妨害音をあまり抑圧できない場合があった。 The target area sound extraction process described here is designed as a process for extracting target area sound from observation signals of two microphones, similar to the conventional MUBASE. In the conventional MUBASE process, the above formula (4) is applied, but the optimal coefficient _αf varies depending on the arrival angle of the target sound source and the interference sound (non-target area sound), and it may be difficult to set it manually. In addition, in the conventional MUBASE process, if the value of _αf is too large, oversubtraction may occur, and the sound within the target area may be distorted. On the other hand, in the conventional MUBASE process, if the value of _αf is small, the interference sound outside the area may not be suppressed very much.

この実施形態の目的エリア音抽出部１０２では、上記の（４）式で表されるサブトラクションにあたる計算を、深層ニューラルネットワーク（ＤＮＮ（ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ）を用いて学習することにより目的エリア音を収音する構成（以下、「深層エリア収音」又は「ＤＭＵＢＡＳＥ」と呼ぶ）を適用するものとして説明する。この実施形態の目的エリア音抽出部１０では、深層エリア収音（ＤＭＵＢＡＳＥ）により、目的音や妨害音の到来角に依らず、高精度なエリア収音を実現することができる。 In this embodiment, the target area sound extraction unit 102 will be described as applying a configuration (hereinafter referred to as "deep area sound collection" or "DMUBASE") that collects target area sound by learning using a deep neural network (DNN) to perform the subtraction calculation represented by the above formula (4). In this embodiment, the target area sound extraction unit 10 can achieve highly accurate area sound collection by using deep area sound collection (DMUBASE), regardless of the arrival angle of the target sound or interfering sound.

ところで、エリア収音処理では、目的エリア内の音源の動きに対して頑健であることが望ましいが、深層エリア収音（ＤＭＵＢＡＳＥ）では、データドリブンにフィルタを学習することになるため、頑健性を保証するような制約が必要となる。 Incidentally, in area sound collection processing, it is desirable to be robust against the movement of sound sources within the target area, but in deep area sound collection (DMUBASE), filters are learned in a data-driven manner, so constraints are needed to ensure robustness.

そのため、深層エリア収音（ＤＭＵＢＡＳＥ）では、「エリア外の妨害音を抑圧できること」と、「エリア内における目的音源の動きに対し頑健であること」という要件を満たしていることが望ましい。 Therefore, it is desirable for deep area sound collection (DMUBASE) to meet the requirements of "being able to suppress interfering sounds outside the area" and "being robust against the movement of the target sound source within the area."

以上を踏まえて、この実施形態では、目的エリア音抽出部１０２のモデルアーキテクチャとして、例えば、図１に示す第１の目的エリア音抽出部１０２Ａ又は、図２に示す第２の目的エリア音抽出部１０２Ｂのいずれかを適用するものとする。 In light of the above, in this embodiment, for example, either the first target area sound extraction unit 102A shown in FIG. 1 or the second target area sound extraction unit 102B shown in FIG. 2 is applied as the model architecture of the target area sound extraction unit 102.

まず、図１に示す第１の目的エリア音抽出部１０２Ａについて説明する。 First, we will explain the first target area sound extraction unit 102A shown in Figure 1.

第１の目的エリア音抽出部１０２Ａは、推定処理部２００、マスク処理部２１０、位相処理部２２０、及び差分抽出器２３０を有している。 The first target area sound extraction unit 102A has an estimation processing unit 200, a mask processing unit 210, a phase processing unit 220, and a difference extractor 230.

推定処理部２００は、各マイクロホンアレイの観測信号Ｘ_ｌ、Ｘ_ｒに基づいて、Ｘ_ｒに含まれる妨害音（非目的エリア音）の成分を推定し、Ｘ_ｒに含まれる非目的エリア音の成分を抑圧するための係数（フィルタ係数）を設定した信号（以下、「マスク信号」と呼ぶ）を出力する。マスク信号には、周波数ごとにＸ_ｒに含まれる妨害音（非目的エリア音）の成分を抑圧するためのフィルタ係数（０～１の間のいずれかの値）が設定されている。 The estimation processing unit 200 estimates the components of the interference sound (non-target area sound) contained in _Xr based on the observation signals _Xl and _Xr of each microphone array, and outputs a signal (hereinafter referred to as a "mask signal") in which a coefficient (filter coefficient) for suppressing the components of the non-target area sound contained in _Xr is set. In the mask signal, a filter coefficient (any value between 0 and 1) for suppressing the components of the interference sound (non-target area sound) contained in _Xr is set for each frequency.

具体的には、推定処理部２００は、ＤＮＮを用いて、観測信号｜Ｘ_ｒ｜と、Ｘ_ｌ、Ｘ_ｒの差分（差分フィルタの出力）となる｜ｄ｜＝｜Ｘ_ｒ－Ｘ_ｌ｜に基づいてマスク信号を推定する処理を行う。なお、ここでは、マイクロホンＭｒの観測信号Ｘｒから目的エリア音の成分を抽出する処理を行うため、観測信号｜Ｘ_ｒ｜と差分フィルタの出力｜Ｘ_ｒ－Ｘ_ｌ｜に基づいてマスク信号を推定する処理を行うものとして説明するが、観測信号｜Ｘ_ｌ｜を基準とし、フィルタの出力｜Ｘ_ｌ－Ｘ_ｒ｜に基づいてマスク信号を推定する処理を行うようにしてもよい。 Specifically, the estimation processing unit 200 uses a DNN to perform processing to estimate a mask signal based on the observed signal | _Xr | and |d| = | _Xr - _Xl |, which is the difference between _Xl and _Xr (output of a differential filter). Note that, in this embodiment, in order to perform processing to extract components of the target area sound from the observed signal Xr of the microphone Mr, processing to estimate a mask signal is performed based on the observed signal | _Xr | and the output of a differential filter | _Xr - _Xl |, but processing to estimate a mask signal may be performed based on the observed signal | _Xl | as a reference and the output of the filter | _Xl - _Xr |.

マスク処理部２１０は、推定処理部２００から供給されるマスク信号（フィルタ係数）に基づいて、｜Ｘ_ｒ｜に含まれる妨害音（非目的エリア音）の成分をマスク（減衰；抑圧；フィルタ処理）させて、目的エリア音を強調した信号を出力する。 The mask processing unit 210 masks (attenuates; suppresses; filters) the components of the interference sound (non-target area sound) contained in |X _r | based on the mask signal (filter coefficient) supplied from the estimation processing unit 200, and outputs a signal in which the target area sound is emphasized.

位相処理部２２０は、マスク処理部２１０から供給される信号にＸ_ｒの位相（位相情報）を付与（同期）させる処理を行って出力する。ここでは、位相処理部２２０から出力される信号を「ｙ＾」と表す。ここでは、ｙ＾を、第１の目的エリア音抽出部１０２Ａの出力信号としている。すなわち、ここでは、ｙ＾が、目的エリア音を抽出（強調；推定）した結果と言える。 The phase processing unit 220 performs processing to impart (synchronize) a phase of _Xr (phase information) to the signal supplied from the mask processing unit 210, and outputs the result. Here, the signal output from the phase processing unit 220 is represented as "y^". Here, y^ is the output signal of the first target area sound extraction unit 102A. In other words, here, y^ can be said to be the result of extracting (emphasizing; estimating) the target area sound.

差分抽出器２３０は、位相処理部２２０から出力されるｙ＾（目的エリア音を推定した結果）と、機械学習における教師ラベル（正解ラベル）となるクリーンな目的エリア音（以下、「ｙ」と表す）との差分を取得し、この差分を推定処理部２００にｌｏｓｓ（機械学習におけるｌｏｓｓ）としてフィードバックする。すなわち、差分抽出器２３０は、推定処理部２００に学習処理させる際にのみ機能する要素である。したがって、すでに推定処理部２００で新たな学習処理が行われない場合には、第１の目的エリア音抽出部１０２Ａから差分抽出器２３０を除外するようにしてもよい。 The difference extractor 230 obtains the difference between y^ (the result of estimating the destination area sound) output from the phase processing unit 220 and the clean destination area sound (hereinafter referred to as "y") which serves as the teacher label (correct label) in machine learning, and feeds this difference back to the estimation processing unit 200 as a loss (loss in machine learning). In other words, the difference extractor 230 is an element which functions only when the estimation processing unit 200 is made to perform learning processing. Therefore, if new learning processing has already not been performed by the estimation processing unit 200, the difference extractor 230 may be excluded from the first destination area sound extraction unit 102A.

以上のように、第１の目的エリア音抽出部１０２Ａでは、推定処理部２００に観測信号｜Ｘ_ｒ｜と差分フィルタの出力として得られる非目的エリア音（妨害音）が支配的な｜ｄ｜を入力としたニューラルネットワーク（推定処理部２００）により、マイクロホンアレイＭＡ正面の扇形領域（図７参照）に相当する目的エリア内の音源（目的エリア音）を抽出するためのマスク信号を推定する。 As described above, in the first target area sound extraction unit 102A, a mask signal for extracting a sound source (target area sound) within a target area corresponding to a sector-shaped area in front of the microphone array MA (see Figure 7) is estimated by a neural network (estimation processing unit 200) that receives as input the observed signal | _Xr | and |d| in which non-target area sound (interference sound) is dominant, obtained as the output of a differential filter.

この実施形態の第１の目的エリア音抽出部１０２Ａでは、推定処理部２００に対して学習処理を実行させる動作モード（以下、「学習処理モード」と呼ぶ）と、供給された観測信号Ｘｌ，Ｘｒに基づいて、目的エリア音抽出処理（マスク信号及びｙ＾の抽出）を行う動作モード（以下、「信号処理モード」と呼ぶ）の両方に対応しているものとする。なお、第１の目的エリア音抽出部１０２Ａにおいて、学習処理モードに対応しない構成（例えば、既に学習モデルを取得しているか外部から学習モデルを取得する構成等）としてもよい。 The first target area sound extraction unit 102A in this embodiment is compatible with both an operation mode in which the estimation processing unit 200 executes a learning process (hereinafter referred to as the "learning processing mode"), and an operation mode in which the target area sound extraction process (extraction of the mask signal and y^) is performed based on the supplied observation signals Xl and Xr (hereinafter referred to as the "signal processing mode"). Note that the first target area sound extraction unit 102A may be configured not to support the learning processing mode (for example, a configuration in which a learning model has already been acquired or a learning model is acquired from outside, etc.).

第１の目的エリア音抽出部１０２Ａは、学習処理モードで動作する場合、教師データとしての観測信号（Ｘ_ｌ、Ｘ_ｒ）のサンプルと、教師ラベルとしてのクリーンな目的エリア音ｙを含むデータセット（以下、「教師データセット」と呼ぶ）が供給されると、教師データセットの観測信号（Ｘ_ｌ、Ｘ_ｒ）から、｜Ｘ_ｒ｜と｜ｄ｜を取得して推定処理部２００に供給するとともに、差分抽出器２３０が抽出したｌｏｓｓ（差分）を推定処理部２００にフィードバックさせる。これにより、推定処理部２００では、教師データセットに基づいて学習（ディープラーニング）した学習モデルを取得することができる。 When the first target area sound extraction unit 102A operates in the learning processing mode, upon being supplied with a data set (hereinafter referred to as the "teacher data set") including samples of observed signals ( _Xl , _Xr ) as teacher data and clean target area sound y as a teacher label, the first target area sound extraction unit 102A acquires | _Xr | and |d| from the observed signals ( _Xl , _Xr ) of the teacher data set and supplies them to the estimation processing unit 200, and also feeds back the loss (difference) extracted by the difference extractor 230 to the estimation processing unit 200. This allows the estimation processing unit 200 to acquire a learning model that has been learned (deep learning) based on the teacher data set.

次に、推定処理部２００の内部構成の例について図１を用いて説明する。 Next, an example of the internal configuration of the estimation processing unit 200 will be described with reference to FIG. 1.

ここでは、推定処理部２００の内部構成として、図１の例を説明するが、推定処理部２００としては、上記の教師データセットに基づく学習処理と信号処理が可能であれば、種々の機械学習（ディープラーニング）のフレームワークを適用することができる。 Here, the example of Figure 1 will be described as the internal configuration of the estimation processing unit 200, but as the estimation processing unit 200, various machine learning (deep learning) frameworks can be applied as long as learning processing and signal processing based on the above teacher dataset are possible.

そして、この実施形態の例では、推定処理部２００のニューラルネットワークは、図１に示す５層の構成となっているものとして説明するが、上記の教師データセットに基づく学習処理と信号処理が可能であれば、種々の構成を適用することができる。 In this embodiment, the neural network of the estimation processing unit 200 is described as having the five-layer configuration shown in FIG. 1, but various configurations can be applied as long as learning processing and signal processing based on the above teacher data set are possible.

図１に示す推定処理部２００では、入力層から順に、「ＦＣ層２１１、２１２」、「ＦＣ層２２１、２２２」、「ＦＣ層２３１」、「ＦＣ層２４１」、「ＦＣ層２５１」が配置されている。図１に示す推定処理部２００では入力のＦＣ層２１１、２１２にそれぞれ｜ｘ_ｒ｜、｜ｄ｜が入力されている。また、図１に示す推定処理部２００のニューラルネットワークでは、ＦＣ層２５１のみ活性化関数がシグモイド（Ｓｉｇｍｏｉｄ）であり、それ以外のＦＣ層の活性化関数がＲｅＬＵ（ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ）となっている。 In the estimation processing unit 200 shown in Fig. 1, "FC layers 211, 212", "FC layers 221, 222", "FC layer 231", "FC layer 241", and "FC layer 251" are arranged in this order from the input layer. In the estimation processing unit 200 shown in Fig. 1, | _xr | and |d| are input to the input FC layers 211 and 212, respectively. In the neural network of the estimation processing unit 200 shown in Fig. 1, only the FC layer 251 has an activation function of sigmoid, and the activation functions of the other FC layers are ReLU (Rectified Linear Unit).

図１に示す推定処理部２００のニューラルネットワークでは、｜ｘ_ｒ｜と｜ｄ｜の各々に対し、ＦＣ層２１１、２１２、２２１、２２２により非線形変換を施した後に、２入力を結合して３層目のＦＣ層２３１に入力している。さらに、図１に示す推定処理部２００のニューラルネットワークでは、その後の２層のＦＣ層２４１、２５１により変換(逆変換）を行いマスク信号（時間周波数マスク）を出力する構成となっている。上記の通りＦＣ層２５１の活性化関数はシグモイドになっているため、推定処理部２００では、周波数ごとに０～１の間の数値で表される係数（フィルタ係数）を出力することができる。 In the neural network of the estimation processing unit 200 shown in Fig. 1, after a nonlinear transformation is performed on each of | _xr | and |d| by the FC layers 211, 212, 221, and 222, the two inputs are combined and input to the third layer, the FC layer 231. Furthermore, in the neural network of the estimation processing unit 200 shown in Fig. 1, the two subsequent FC layers 241 and 251 perform a transformation (inverse transformation) and output a mask signal (time-frequency mask). As described above, the activation function of the FC layer 251 is sigmoid, so that the estimation processing unit 200 can output a coefficient (filter coefficient) represented by a numerical value between 0 and 1 for each frequency.

推定処理部２００のでは、図１に示すようなニューラルネットワークを構成することにより、妨害音が優勢の｜ｄ｜と観測信号（観測音）である｜Ｘ_ｒ｜からマイクロホンアレイＭＡの正面方向を音源とする目的エリア音を強調するマスク信号（フィルタ係数）を抽出する過程をデータから学習することで、（４）式に相当するサブトラクション処理をＤＮＮにより学習することができる。言い換えると、推定処理部２００のでは、図１に示すようなニューラルネットワークを構成することにより、マイクロホンアレイＭＡの正面方向にある目的エリア内における音源の動きに対して頑健なエリア収音処理を行うことができる。特に、推定処理部２００では、２入力が結合する中間層となるＦＣ層２３１が上記のサブトラクション処理を担う構成となる。 In the estimation processing unit 200, by configuring a neural network as shown in Fig. 1, a process of extracting a mask signal (filter coefficient) that emphasizes a target area sound whose sound source is in the front direction of the microphone array MA from |d| where the interfering sound is dominant and | _Xr | which is an observation signal (observation sound) can be learned from data, and a subtraction process corresponding to the formula (4) can be learned by the DNN. In other words, in the estimation processing unit 200, by configuring a neural network as shown in Fig. 1, it is possible to perform area sound collection processing that is robust against the movement of a sound source in the target area in the front direction of the microphone array MA. In particular, in the estimation processing unit 200, the FC layer 231, which is an intermediate layer to which two inputs are connected, is configured to be responsible for the above-mentioned subtraction process.

次に、図２に示す第２の目的エリア音抽出部１０２Ｂについて説明する。図２では、上述の図１と同一部分または対応部分には同一符号又は対応符号を付している。 Next, the second target area sound extraction unit 102B shown in FIG. 2 will be described. In FIG. 2, the same or corresponding parts as those in FIG. 1 above are denoted by the same or corresponding reference numerals.

以下では、第２の目的エリア音抽出部１０２Ｂについて、第１の目的エリア音抽出部１０２Ａとの差異を説明する。 The following describes the differences between the second target area sound extraction unit 102B and the first target area sound extraction unit 102A.

図２に示すように、第２の目的エリア音抽出部１０２Ｂは、推定処理部３００、位相処理部２２０、及び差分抽出器２３０を有している。 As shown in FIG. 2, the second target area sound extraction unit 102B has an estimation processing unit 300, a phase processing unit 220, and a difference extractor 230.

推定処理部２００は、マイクロホンアレイの観測信号｜Ｘ_ｒ｜と｜ｄ｜に基づいてマスク信号を推定する処理を行っていたが、推定処理部３００は、目的エリア音ｙを推定した結果得られるパワースペクトラム（目的エリア音の成分を強調した信号のスペクトラム；周波数領域の信号）を出力する点で、推定処理部２００と異なっている。 The estimation processing unit 200 performs processing to estimate a mask signal based on the observed signals | _Xr | and |d| of the microphone array, but the estimation processing unit 300 differs from the estimation processing unit 200 in that it outputs a power spectrum obtained as a result of estimating the target area sound y (the spectrum of a signal in which the components of the target area sound are emphasized; a signal in the frequency domain).

位相処理部２２０は、推定処理部３００から供給されるパワースペクトラムにＸ_ｒの位相（位相情報）を付与（同期）させる処理を行って、信号ｙ＾として出力する。 The phase processing section 220 performs processing to add (synchronize) the phase of _Xr (phase information) to the power spectrum supplied from the estimation processing section 300, and outputs the result as a signal y^.

以上のように、第２の目的エリア音抽出部１０２Ｂでは、推定処理部３００に観測信号｜Ｘ_ｒ｜と差分フィルタの出力として得られる非目的エリア音（妨害音）が支配的な｜ｄ｜を入力としたニューラルネットワーク（推定処理部３００）により、マイクロホンアレイＭＡ正面の扇形領域（図７参照）に相当する目的エリア内の音源（目的エリア音）のパワースペクトラムを推定する。 As described above, in the second target area sound extraction unit 102B, the power spectrum of the sound source (target area sound) in the target area corresponding to the sector-shaped area in front of the microphone array MA (see Figure 7) is estimated by a neural network (estimation processing unit 300) that receives as input the observed signal | _Xr | and |d| in which non-target area sound (interference sound) is dominant, obtained as the output of a differential filter.

そして、第２の目的エリア音抽出部１０２Ｂでは、第１の目的エリア音抽出部１０２Ａと同様に、学習処理モードと信号処理モードに対応するようにしてもよい。 The second target area sound extraction unit 102B may be configured to support a learning processing mode and a signal processing mode, similar to the first target area sound extraction unit 102A.

次に、推定処理部３００の内部構成の例について図２を用いて説明する。 Next, an example of the internal configuration of the estimation processing unit 300 will be described with reference to FIG. 2.

ここでは、推定処理部３００の内部構成として、図２の例を説明するが、推定処理部３００としては、上記の教師データセットに基づく学習処理と信号処理が可能であれば、種々の機械学習（ディープラーニング）の構成を適用することができる。 Here, the example of Figure 2 will be described as the internal configuration of the estimation processing unit 300, but as the estimation processing unit 300, various machine learning (deep learning) configurations can be applied as long as learning processing and signal processing based on the above teacher data set are possible.

ここでは、推定処理部３００のニューラルネットワークは、図２に示す通り、推定処理部３００のニューラルネットワークの最後段（出力層）のＦＣ層２５１がＦＣ層３５１に置き換わっている点で推定処理部２００と異なっている。推定処理部３００のＦＣ層３５１では、活性化関数がｓｉｇｍｏｉｄではなくＲｅＬｕとなっている点で推定処理部２００と異なっている。これにより、推定処理部３００のＦＣ層３５１では、パワースペクトラムを出力することができる。 Here, the neural network of the estimation processing unit 300 differs from the estimation processing unit 200 in that the FC layer 251 at the last stage (output layer) of the neural network of the estimation processing unit 300 is replaced with an FC layer 351, as shown in FIG. 2. The FC layer 351 of the estimation processing unit 300 differs from the estimation processing unit 200 in that the activation function is ReLu instead of sigmoid. This allows the FC layer 351 of the estimation processing unit 300 to output a power spectrum.

推定処理部３００では、図２に示すようなニューラルネットワークにより、妨害音が優勢の｜ｄ｜と観測音である｜Ｘ_ｒ｜からマイクロホンアレイＭＡの正面方向を音源とする目的エリア音を出力する機構を構成することで、（４）式に相当するサブトラクション処理をデータから学習することができる。 In the estimation processing unit 300, a mechanism is configured using a neural network as shown in FIG. 2 to output a target area sound whose sound source is in front of the microphone array MA from |d|, where the interfering sound is dominant, and | _Xr |, which is the observed sound, so that the subtraction processing equivalent to equation (4) can be learned from data.

（Ａ－２）実施形態の動作
次に、以上のような構成を有するこの実施形態における収音装置１００の動作（実施形態に係る収音方法）を説明する。 (A-2) Operation of the Embodiment Next, the operation of the sound collection device 100 in this embodiment having the above-mentioned configuration (sound collection method according to the embodiment) will be described.

まず、収音装置１００の目的エリア音抽出部１０２が学習処理モードで動作する場合の処理について説明する。 First, we will explain the processing performed when the target area sound extraction unit 102 of the sound collection device 100 operates in the learning processing mode.

学習処理モードで動作する目的エリア音抽出部１０２に教師データセットが供給されると、目的エリア音抽出部１０２は、教師データセットの観測信号（Ｘ_ｌ、Ｘ_ｒ）から、｜Ｘ_ｒ｜と｜ｄ｜を取得してニューラルネットワークに入力して、深層エリア収音の学習処理（ニューラルネットワークにより目的エリア音を抽出する処理の学習）を行う。 When a teacher data set is supplied to the target area sound extraction unit 102 operating in a learning processing mode, the target area sound extraction unit 102 acquires | _Xr | and |d| from the observed signals ( _Xl , _Xr ) of the teacher data set and inputs them to the neural network to perform learning processing for deep area sound collection (learning the process of extracting target area sound using a neural network).

収音装置１００に、第１の目的エリア音抽出部１０２Ａが適用される場合、第１の目的エリア音抽出部１０２Ａでは、｜Ｘ_ｒ｜と｜ｄ｜が推定処理部２００に入力される。また、このとき、第１の目的エリア音抽出部１０２Ａでは、差分抽出器２３０により位相処理部２２０から出力される信号ｙ＾と教師ラベルｙとのｌｏｓｓが抽出されて推定処理部２００にフィードバックされる。第１の目的エリア音抽出部１０２Ａでは、上記のようなフィードバックにより、深層エリア収音の学習処理（ニューラルネットワークにより目的エリア音を抽出する処理の学習）が行われる。 When the first target area sound extraction unit 102A is applied to the sound collection device 100, the first target area sound extraction unit 102A inputs | _Xr | and |d| to the estimation processing unit 200. At this time, the first target area sound extraction unit 102A extracts the loss between the signal y^ output from the phase processing unit 220 and the teacher label y by the difference extractor 230 and feeds it back to the estimation processing unit 200. The first target area sound extraction unit 102A performs a learning process for deep area sound collection (learning the process of extracting the target area sound by a neural network) by the above-mentioned feedback.

一方、収音装置１００に、第２の目的エリア音抽出部１０２Ｂが適用される場合、第２の目的エリア音抽出部１０２Ｂでは、｜Ｘ_ｒ｜と｜ｄ｜が推定処理部３００に入力される。また、このとき、第２の目的エリア音抽出部１０２Ｂでは、差分抽出器２３０により推定処理部３００から出力されるパワースペクトラムのｌｏｓｓが抽出されて推定処理部３００にフィードバックされる。第２の目的エリア音抽出部１０２Ｂでは、上記のようなフィードバックにより、深層エリア収音の学習処理（ニューラルネットワークにより目的エリア音を抽出する処理の学習）が行われる。 On the other hand, when the second target area sound extraction unit 102B is applied to the sound collection device 100, the second target area sound extraction unit 102B inputs | _Xr | and |d| to the estimation processing unit 300. At this time, the second target area sound extraction unit 102B extracts the loss of the power spectrum output from the estimation processing unit 300 by the difference extractor 230 and feeds it back to the estimation processing unit 300. The second target area sound extraction unit 102B performs a learning process for deep area sound collection (learning the process of extracting the target area sound by a neural network) by the above-mentioned feedback.

次に、収音装置１００の目的エリア音抽出部１０２が信号処理モードで動作する場合の動作について説明する。 Next, we will explain the operation of the target area sound extraction unit 102 of the sound collection device 100 when it operates in signal processing mode.

ここで、マイクロホンアレイＭＡ（マイクロホンＭｒ、Ｍｌ）から信号入力部１０１を介して、信号処理モードで動作する目的エリア音抽出部１０２に観測信号（Ｘ_ｌ、Ｘ_ｒ）が供給されたものとする。そうすると、目的エリア音抽出部１０２は、ニューラルネットワーク（推定処理部２００又は推定処理部３００）に｜Ｘ_ｒ｜と｜ｄ｜を供給し、結果としてｙ＾を取得して信号出力部１０３に供給することになる。信号出力部１０３は、ｙ＾を周波数領域から時間領域に変換して出力する。 Assume here that observed signals ( _Xl , _Xr ) are supplied from the microphone array MA (microphones Mr, Ml) to the target area sound extraction unit 102 operating in signal processing mode via the signal input unit 101. The target area sound extraction unit 102 then supplies | _Xr | and |d| to the neural network (the estimation processing unit 200 or the estimation processing unit 300), and as a result, obtains y^ and supplies it to the signal output unit 103. The signal output unit 103 converts y^ from the frequency domain to the time domain and outputs it.

次に、発明者が、実際に収音装置１００を構築して、目的エリア音を収音する処理を行い、その品質を評価するための実験（以下、「本実験」と呼ぶ）を行った際の実験結果及びその評価結果について説明する。 Next, the inventors will explain the experimental results and evaluation results of an experiment (hereinafter referred to as "this experiment") in which they actually constructed a sound collection device 100, performed a process to collect target area sound, and evaluated its quality.

図５は、本実験の環境について示した図である。 Figure 5 shows the environment of this experiment.

図５では、マイクロホンＭｒ、Ｍｌ、目的音源、妨害音源が全て同じ平面上に存在する場合の例について示している。また、図５では、マイクロホンＭｒ、Ｍｌの位置（中心位置）を結んだ線Ｌの中点の位置（マイクロホンアレイＭＡの中心点）をＰ１と図示している。さらに、図５では、Ｐ１からみてマイクロホンＭｒの方向を０°、Ｐ１からみてマイクロホンＭｌの方向を１８０°として、目的音源及び妨害音源はＰ１からみて０°～１８０°のいずれかの角度から到来するものとする。以下では、Ｐ１から見た目的音源及び妨害音源の存在する方向を「到来角」又は「到来方向」とも呼ぶものとする。また、図５に示すように、目的音源及び妨害音源（非目的エリアの音源）の位置はＰ１から１ｍの距離の半円の線上であるものとする。 Figure 5 shows an example in which microphones Mr, Ml, the target sound source, and the interfering sound source are all on the same plane. Also, in Figure 5, the position of the midpoint of line L connecting the positions (center positions) of microphones Mr, Ml (center point of microphone array MA) is shown as P1. Furthermore, in Figure 5, the direction of microphone Mr as viewed from P1 is 0°, and the direction of microphone Ml as viewed from P1 is 180°, and the target sound source and the interfering sound source arrive from any angle between 0° and 180° as viewed from P1. Hereinafter, the direction in which the apparent sound source and the interfering sound source exist from P1 will also be referred to as the "arrival angle" or "arrival direction". Also, as shown in Figure 5, the target sound source and the interfering sound source (sound source in a non-target area) are located on a semicircular line at a distance of 1 m from P1.

本実験では、学習処理モード（訓練時）、信号処理モード（信号処理時）のいずれの動作モードにおいても、目的音源のドライソース（信号）としてＴＩＭＩTコーパス（以下の参考文献１参照）を用い、妨害音のドライソース（信号）として、ＴＭＩＴコーパス又はＤＥＭＡＮＤ（ＤｉｖｅｒｓｅＥｎｖｉｒｏｎｍｅｎｔｓＭｕｌｔｉ－ｃｈａｎｎｅｌＡｃｏｕｓｔｉｃＮｏｉｓｅＤａｔａｂａｓｅ）コーパス（以下の参考文献２参照）を用いた。 In this experiment, in both the learning processing mode (during training) and the signal processing mode (during signal processing), the TIMIT corpus (see Reference 1 below) was used as the dry source (signal) of the target sound source, and the TIMIT corpus or the DEMAND (Diverse Environments Multi-channel Acoustic Noise Database) corpus (see Reference 2 below) was used as the dry source (signal) of the interference sound.

参考文献１：J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G.Fiscus, D. S. Pallett, N. L. Dahlgren, V. Zue, “TIMIT acoustic phonetic continuous speech corpus,”Linguistic Data Consotrium, 1992.
参考文献２：J. Thiemann, N. Ito, and E. Vincent, “The diverseenvironments multi-channel acoustic noise database(DEMAND): A database of multichannel environmental noise recordings”, The Journal of the Acoustical Society of America,vol. 133, p. 3591,05, 2013. Reference 1: JS Garofolo, LF Lamel, WM Fisher, JGFiscus, DS Pallett, NL Dahlgren, V. Zue, “TIMIT acoustic phonetic continuous speech corpus,” Linguistic Data Consotrium, 1992.
Reference 2: J. Thiemann, N. Ito, and E. Vincent, “The diverseenvironments multi-channel acoustic noise database(DEMAND): A database of multichannel environmental noise recordings”, The Journal of the Acoustical Society of America, vol. 133, p. 3591,05, 2013.

本実験では、図５のような音場（モデル環境）においてマイクロホンＭｌ、Ｍｒで捕捉される観測信号（音響信号）をコンピュータ上のシミュレーションにより取得し、さらに取得した観測信号を収音装置１００に入力した結果を評価した。 In this experiment, the observed signals (acoustic signals) captured by microphones Ml and Mr in a sound field (model environment) like that shown in Figure 5 were obtained by computer simulation, and the obtained observed signals were then input to the sound collection device 100 to evaluate the results.

具体的には、本実験では、ＰｙＲｏｏｍＡｃｏｕｓｔｉｃｓ（以下の参考文献３参照）を用いて、図５のような音場（モデル環境）を設定したシミュレーションを行ってインパルス応答を取得し、取得したインパルス応答を上記のドライソース（目的音源及び妨害音源のドライソース）に畳み込むことで、マイクロホンＭｌ、Ｍｒの観測信号Ｘ_ｌ、Ｘ_ｒを得た。 Specifically, in this experiment, a simulation was performed using PyRoomAcoustics (see Reference 3 below) with a sound field (model environment) as shown in FIG. 5 to obtain an impulse response, and the obtained impulse response was convoluted with the above-mentioned dry sources (the dry sources of the target sound source and the interfering sound source) to obtain the observation signals _Xl and _Xr of the microphones Ml and Mr.

参考文献３：Scheibler, E. Bezzam, I. Dokmani´c, “Pyroomacoustics: A Python package for audio room simulations and array processing algorithms”, Proc. IEEE ICASSP, 2018 Reference 3: Scheibler, E. Bezzam, I. Dokmani´c, “Pyroomacoustics: A Python package for audio room simulations and array processing algorithms”, Proc. IEEE ICASSP, 2018

また、本実験のシミュレーションでは、観測信号Ｘ_ｌ、Ｘ_ｒにおけるＳＮＲがおよそ０．０［ｄＢ］となるよう調整している。なお、以下では、本実験用の音場の３Ｄ空間を（ｘ，ｙ，ｚ）の三次元の座標系で表すものとする。 In addition, in the simulation of this experiment, the SNR of the observed signals _Xl and _Xr is adjusted to be approximately 0.0 [dB]. In the following, the 3D space of the sound field for this experiment is represented by a three-dimensional coordinate system of (x, y, z).

そして、本実験のシミュレーションでは、モデル環境の音場を構成する部屋の大きさは（ｘ，ｙ，ｚ）［ｍ］＝（５，３，３）とし、２ｃｈのマイクロホンＭｌ，Ｍｒの座標を、それぞれ（ｘ，ｙ，ｚ）［ｍ］＝（２．４９，１．５，１）、（ｘ，ｙ，ｚ）［ｍ］＝（２．５１，１．５，１）とした。これにより、マイクロホンＭｌ，Ｍｒの間の間隔は２［ｃｍ］となる。また、本実験のシミュレーションでは、部屋の吸音率を０．２、部屋の反射回数を３と設定した。 In the simulation of this experiment, the size of the room that constitutes the sound field of the model environment was set to (x, y, z) [m] = (5, 3, 3), and the coordinates of the 2ch microphones Ml and Mr were set to (x, y, z) [m] = (2.49, 1.5, 1) and (x, y, z) [m] = (2.51, 1.5, 1), respectively. This results in a distance of 2 [cm] between microphones Ml and Mr. In addition, in the simulation of this experiment, the sound absorption coefficient of the room was set to 0.2, and the number of reflections in the room was set to 3.

本実験では、収音装置１００の目的エリア音抽出部１０２に、マスク推定により目的エリア音を推定する第１の目的エリア音抽出部１０２Ａを適用した場合（以下、「第１の本発明の実験モデル」とよぶ）、パワースペクトラム推定により目的エリア音を推定する第２の目的エリア音抽出部１０２Ｂを適用した場合（以下、「第２の本発明の実験モデル」と呼ぶ）、及び従来のＭＵＢＡＳＥによる目的エリア音推定を適用した場合（以下、「従来構成の実験モデル」と呼ぶ）を適用した場合のそれぞれについてシミュレーションを行った。 In this experiment, simulations were performed for the following cases: a first target area sound extraction unit 102A that estimates the target area sound by mask estimation is applied to the target area sound extraction unit 102 of the sound collection device 100 (hereinafter referred to as the "first experimental model of the present invention"), a second target area sound extraction unit 102B that estimates the target area sound by power spectrum estimation is applied (hereinafter referred to as the "second experimental model of the present invention"), and target area sound estimation using conventional MUBASE is applied (hereinafter referred to as the "experimental model of conventional configuration").

次に、本実験のシミュレーションにおける各音源の位置について説明する。 Next, we will explain the position of each sound source in the simulation of this experiment.

本実験では、学習時は目的音源の位置を９０°に固定し、テスト時には目的音源をエリア内（Ｐ１から距離１ｍで８０°～９０°の範囲内）で動かすことで、収音装置１００が上記の２つの要件を満たしているかを検証した。また、妨害音源については、学習時・テスト時共に、０°、１５°、３０°、４５°、１３５°、１５０°、１６５°、１８０°の計８か所のうちランダムに１～３か所に設置した。本実験では、このような目的音源及び妨害音原の位置変更を、コーパス上データ処理単位（例えば、単語単位）で行った。 In this experiment, the position of the target sound source was fixed at 90° during learning, and during testing, the target sound source was moved within the area (within a range of 80° to 90° at a distance of 1 m from P1) to verify whether the sound collection device 100 satisfied the above two requirements. In addition, the interfering sound source was randomly placed in 1 to 3 of a total of 8 positions, namely 0°, 15°, 30°, 45°, 135°, 150°, 165°, and 180°, during both learning and testing. In this experiment, such changes in the positions of the target sound source and the interfering sound source were performed in data processing units (e.g., word units) on the corpus.

次に、本実験のシミュレーションにおける詳細なパラメータ設定について説明する。 Next, we will explain the detailed parameter settings for the simulation of this experiment.

「従来のＭＵＢＡＳＥのモデル」を適用したシミュレーションでは、目的エリア音抽出部１０２において、差分フィルタにより非目的エリア音（妨害音）を推定する際に、低周波ほどパワーが弱いという傾向に基づき、αの値を２００／（ｆ＋０．０１）と設定した。 In a simulation using the "conventional MUBASE model," the value of α was set to 200/(f+0.01) in the target area sound extraction unit 102 when estimating non-target area sound (interference sound) using a differential filter, based on the tendency for the power to be weaker at lower frequencies.

また、「第１の本発明の実験モデル」及び「第２の本発明の実験モデル」の学習では、バッチサイズを３２、エポック数を２００と設定し、損失関数として平均二乗誤差を用いた。また、「第１の本発明の実験モデル」及び「第２の本発明の実験モデル」の学習では、最適化アルゴリズムにＡｄａｍ（以下の参考文献４を参照）を用い、学習率は０．００１とした。 In addition, in training the "first experimental model of the present invention" and the "second experimental model of the present invention", the batch size was set to 32, the number of epochs was set to 200, and the mean squared error was used as the loss function. In training the "first experimental model of the present invention" and the "second experimental model of the present invention", Adam (see Reference 4 below) was used as the optimization algorithm, and the learning rate was set to 0.001.

参考文献４：D. Kingma, and J. Ba, “Adam: A method for stochastic optimization”, International Conference on Learning Representations (ICLR), 2015. Reference 4: D. Kingma, and J. Ba, “Adam: A method for stochastic optimization”, International Conference on Learning Representations (ICLR), 2015.

本実験では、第１の本発明の実験モデル（マスク推定）、第２の本発明の実験モデル（パワースペクトラム推定）、及び従来構成の実験モデル（ＭＵＢＡＳＥ）の環境を構築し、それぞれについて、上記の学習処理及び信号処理（テスト処理）を行った。本実験の信号処理（テスト処理）では、３つの実験モデルのそれぞれについてＳＮＲ（Ｓｉｇｎａｌ－ｔｏ－ＮｏｉｓｅＲａｔｉｏ）とＳＴＯＩ（Ｓｈｏｒｔ－ＴｉｍｅＯｂｊｅｃｔｉｖｅＩｎｔｅｌｌｉｇｉｂｉｌｉｔｙ）の２つの指標を測定した。また、本実験のテスト処理では、それぞれの実験モデルについて、目的音源の位置を９０°で固定したパターン（以下、「目的音源固定パターン」と呼ぶ）と、目的音源を８０°～９０°の間でランダムに移動させたパターン（以下、「目的音源移動パターン」と呼ぶ）でのテスト処理を行った。図６は、本実験の結果について示した図である。 In this experiment, an environment was constructed for the first experimental model of the present invention (mask estimation), the second experimental model of the present invention (power spectrum estimation), and the conventional experimental model (MUBASE), and the above-mentioned learning process and signal processing (test process) were performed for each of them. In the signal processing (test process) of this experiment, two indices, SNR (Signal-to-Noise Ratio) and STOI (Short-Time Objective Intelligence), were measured for each of the three experimental models. In addition, in the test process of this experiment, test processes were performed for each experimental model in a pattern in which the position of the target sound source was fixed at 90° (hereinafter referred to as the "target sound source fixed pattern") and a pattern in which the target sound source was moved randomly between 80° and 90° (hereinafter referred to as the "target sound source moving pattern"). Figure 6 shows the results of this experiment.

（Ａ－３）実施形態の効果
この実施形態によれば、以下のような効果を奏することができる。 (A-3) Advantages of the Embodiment According to this embodiment, the following advantages can be obtained.

従来のＭＵＢＡＳＥを用いた構成では、所定の係数を伴うスペクトル減算によってエリア収音処理を行っていたが、この実施形態の収音装置１００では、教師データにより学習したニューラルネットワークを用いた深層エリア収音（ＤＭＵＢＡＳＥ）を行っている。特に、この実施形態の収音装置１００では、２チャンネルのマイクロホンアレイＭＡにおいて、差分フィルタの出力ｄをとることで正面方向以外から到来する妨害音（非目的エリア音）を得られることを利用し、ニューラルネットワークに差分フィルタの出力ｄ（妨害音が優勢となる情報）を観測信号と共にニューラルネットワークに入力することで、正面方向の目的エリア音が強調された出力を得ることができる。 In a conventional configuration using MUBASE, area sound collection processing was performed by spectral subtraction involving a predetermined coefficient, but the sound collection device 100 of this embodiment performs deep area sound collection (DMUBASE) using a neural network trained with teacher data. In particular, the sound collection device 100 of this embodiment utilizes the fact that in the two-channel microphone array MA, the output d of the differential filter can be taken to obtain interference sounds (non-target area sounds) arriving from directions other than the front, and by inputting the output d of the differential filter (information in which interference sounds are dominant) into the neural network together with the observation signal, an output can be obtained in which the target area sounds in the front direction are emphasized.

上記の通り、エリア収音処理では、目的エリア内の音源の動きに対して頑健であることが望ましいが、深層エリア収音（ＤＭＵＢＡＳＥ）では、データドリブンにフィルタを学習することになるため、頑健性を保証するような制約が必要となる。そして、この実施形態の収音装置１００では、単純なデータドリブン（例えば、観測信号のみ）でなく、差分フィルタの出力ｄ等の物理的な情報を補助情報に用いることで、環境変化への頑強性を向上させている。そして、図６に示すように、この実施形態の構成を再現した実験モデル（第１及び第２の本発明の実験モデル）はいずれも、目的音源固定パターン及び目的音源移動パターンの両方で、従来構成の実験モデル（ＭＵＢＡＳＥ）の精度を上回った。つまり、本発明の実験モデルは、従来よりも目的エリア音の音源の移動に対しても頑健であることが確認できた。 As described above, in the area sound collection process, it is desirable to be robust against the movement of the sound source within the target area, but in the deep area sound collection (DMUBASE), the filter is learned in a data-driven manner, so constraints are required to ensure robustness. In addition, in the sound collection device 100 of this embodiment, robustness against environmental changes is improved by using physical information such as the output d of the differential filter as auxiliary information, rather than a simple data-driven (e.g., only the observed signal). As shown in FIG. 6, both of the experimental models that reproduce the configuration of this embodiment (the first and second experimental models of the present invention) exceeded the accuracy of the experimental model (MUBASE) of the conventional configuration in both the target sound source fixed pattern and the target sound source moving pattern. In other words, it was confirmed that the experimental model of the present invention is more robust against the movement of the sound source of the target area sound than the conventional one.

（Ｂ）他の実施形態
本発明は、上記の各実施形態に限定されるものではなく、以下に例示するような変形実施形態も挙げることができる。 (B) Other Embodiments The present invention is not limited to the above-described embodiments, and modified embodiments such as those exemplified below can also be mentioned.

（Ｂ－１）上記の実施形態において、収音装置１００は、学習処理モードと信号処理モード（テストモード）の両方に対応するものとして説明したが、予め学習モデルが保持されていれば信号処理モードだけに対応し、学習処理モードに必要な手段（学習手段）については除外した構成としてもよい。 (B-1) In the above embodiment, the sound collection device 100 has been described as being compatible with both the learning processing mode and the signal processing mode (test mode), but if a learning model is stored in advance, it may be configured to be compatible with only the signal processing mode and to exclude the means (learning means) required for the learning processing mode.

１００…収音装置、１０１…信号入力部、１０２…目的エリア音抽出部、１０３…信号出力部、１０２Ａ…第１の目的エリア音抽出部、２００…推定処理部、２１２、２２１、２２２、２３１、２４１、２５１、２１１、…ＦＣ層、２１０…マスク処理部、２２０…位相処理部、２３０…差分抽出器、１０２Ｂ…第２の目的エリア音抽出部、…推定処理部３００、２１２、２２１、２２２、２３１、２４１、３５１、２１１…ＦＣ層、２３０…差分抽出器。 100...sound collection device, 101...signal input section, 102...target area sound extraction section, 103...signal output section, 102A...first target area sound extraction section, 200...estimation processing section, 212, 221, 222, 231, 241, 251, 211,...FC layer, 210...mask processing section, 220...phase processing section, 230...difference extractor, 102B...second target area sound extraction section,...estimation processing section 300, 212, 221, 222, 231, 241, 351, 211...FC layer, 230...difference extractor.

Claims

a target sound extraction processing means for acquiring a target sound emphasis signal in which a component of a target sound included in a first input signal is emphasized from a first input signal from a first microphone constituting a microphone array and a difference signal which is a difference between the first input signal and a second input signal from a second microphone constituting the microphone array, using a learning model;
a learning means for acquiring the learning model by performing learning processing on data including the first input signal, the difference signal, and the target sound signal as teacher data;
A sound collecting device comprising:

the learning model outputs a mask coefficient that suppresses components of non-target sounds other than the target sound included in the first input signal from the first input signal and the differential signal;
The sound collection device according to claim 1 , wherein the target sound extraction processing means acquires the target sound emphasis signal by suppressing the non-target sound components from the first input signal using the mask coefficient.

The sound collection device according to claim 1, characterized in that the learning model outputs a target sound emphasis signal in which non-target sound components contained in the first input signal are suppressed and the target sound components are emphasized from the first input signal and the difference signal.

Computer,
a target sound extraction processing means for acquiring a target sound emphasis signal in which a component of a target sound included in a first input signal is emphasized from a first input signal from a first microphone constituting a microphone array and a difference signal which is a difference between the first input signal and a second input signal from a second microphone constituting the microphone array, using a learning model ;
a learning means for acquiring the learning model by performing learning processing on data including the first input signal, the difference signal, and the target sound signal as teacher data;
A sound recording program characterized by functioning as follows.

In the sound collection method performed by the sound collection device,
The sound collection device includes a target sound extraction processing means and a learning means ,
the target sound extraction processing means acquires, using a learning model, a target sound enhancement signal in which a component of the target sound included in a first input signal is enhanced from a first input signal from a first microphone constituting a microphone array and a differential signal which is a difference between the first input signal and a second input signal from a second microphone constituting the microphone array ;
The learning means obtains the learning model by performing a learning process using data including the first input signal, the difference signal, and the target sound signal as teacher data.
A sound collection method comprising: