JP7721089B2

JP7721089B2 - Sound processing device, sound processing method and program

Info

Publication number: JP7721089B2
Application number: JP2022197160A
Authority: JP
Inventors: 克寿糸山; 一博中臺; 侑樹藤田
Original assignee: Honda Motor Co Ltd; Tokyo Institute of Technology NUC; Institute of Science Tokyo
Current assignee: Honda Motor Co Ltd; Tokyo Institute of Technology NUC; Institute of Science Tokyo
Priority date: 2022-12-09
Filing date: 2022-12-09
Publication date: 2025-08-12
Anticipated expiration: 2042-12-09
Also published as: JP2024082932A

Description

本発明は、音響処理装置、音響処理方法およびプログラムに関する。 The present invention relates to an audio processing device, an audio processing method, and a program.

マイクロホンアレイ処理は、音響信号処理の要素技術である。マイクロホンアレイ処理は、マイクロホンアレイを用いて収音された複数チャネルの音響信号を用いた音響処理であり、例えば、音源定位（sound source localization）、音源分離（sound source separation）などが掲げられる。音源定位は、複数チャネルの音響信号から音源方向を推定する手法である。音源分離は、複数チャネルの音響信号から個々の音源から到来する成分を抽出する手法である。音源定位ならびに音源分離は、騒音下で発話がなされる場合、複数音源が存在する場合など、個々の音を識別する際に有用である。音源定位や音源分離は、ロボット聴覚（robot audition）をはじめ、スマートスピーカ、通信会議システム、議事録作成など、種々の用途に応用されている。 Microphone array processing is a fundamental technology in acoustic signal processing. Microphone array processing uses multi-channel acoustic signals collected using a microphone array, and examples include sound source localization and sound source separation. Sound source localization is a method for estimating the direction of a sound source from multi-channel acoustic signals. Sound source separation is a method for extracting components arriving from individual sound sources from multi-channel acoustic signals. Sound source localization and sound source separation are useful for identifying individual sounds, such as when speech is spoken in a noisy environment or when multiple sound sources are present. Sound source localization and sound source separation are applied to a variety of applications, including robot audition, smart speakers, teleconferencing systems, and meeting minutes creation.

マイクロホンアレイ処理では、音源から受音点への音の伝達特性を示す音響伝達関数が用いられる。音源伝達関数は、例えば、自由音場を仮定して数理モデルを用いて幾何的に計算されることや、予め実験室において多方向に設置された音源を用いて測定されることがある。しかしながら、かかる音響伝達関数は、マイクロホンアレイ処理が現実に使用される音響環境（本願では、「使用環境」と呼ぶことがある）で測定されるものとは異なる。そのため、音源定位または音源分離の性能が低下する原因となりうる。音源定位または音源分離の性能を確保するため使用環境において予め音響伝達関数を測定しておくことも考えられる。音響環境の変化に伴い音響伝達関数も測定時から変化するが、音響伝達関数の再測定には多くの時間と労力を要する。そのため、使用環境下での再測定は現実的ではない。 Microphone array processing uses an acoustic transfer function that indicates the transfer characteristics of sound from a sound source to a sound receiving point. The sound source transfer function may be calculated geometrically using a mathematical model assuming a free sound field, or may be measured in advance in a laboratory using sound sources installed in multiple directions. However, such an acoustic transfer function differs from that measured in the acoustic environment in which microphone array processing is actually used (sometimes referred to as the "usage environment" in this application). This can result in a decrease in the performance of sound source localization or sound source separation. It may be possible to measure the acoustic transfer function in advance in the usage environment to ensure the performance of sound source localization or sound source separation. The acoustic transfer function changes from the time of measurement as the acoustic environment changes, but re-measuring the acoustic transfer function requires a lot of time and effort. Therefore, re-measuring the acoustic transfer function in the usage environment is not practical.

中臺一博、瀧ケ平雅行、河合熊輔、中島弘史「伝達関数の常時オンライン適応による音源定位・分離の向上」人工知能学会第二種研究会資料ＡＩチャレンジ研究会 SIG-Challenge-058-07、＜https://doi.org/10.11517/jsaisigtwo.2021.Challenge-058_07＞、2021年12月16日公開Kazuhiro Nakadai, Masayuki Takigahira, Kumasuke Kawai, and Hiroshi Nakajima, "Improvement of Sound Source Localization and Separation by Continuous Online Adaptation of Transfer Functions," Japanese Society for Artificial Intelligence Second Type Research Meeting Materials, AI Challenge Research Group SIG-Challenge-058-07, <https://doi.org/10.11517/jsaisigtwo.2021.Challenge-058_07>, published December 16, 2021.

非特許文献１では、マイクロホンアレイで取得された音響信号を用いて音響伝達関数を推定し、逐次に更新する手法について記載されている。この手法によれば、再計測に係る機器の設営を伴わず、任意の音源を用いてオンラインで音響伝達関数を取得することができる。しかしながら、使用環境で取得される音響信号は、音響伝達関数を取得するうえで必ずしも好適とは限らない。例えば、音響信号には、顕著なノイズが一時的に混入されることがある。そのため、音響伝達関数が安定的に推定できないことがあった。 Non-Patent Document 1 describes a method for estimating and sequentially updating an acoustic transfer function using acoustic signals acquired by a microphone array. This method makes it possible to acquire an acoustic transfer function online using any sound source, without the need to set up equipment for remeasurement. However, acoustic signals acquired in the usage environment are not necessarily suitable for acquiring an acoustic transfer function. For example, significant noise may be temporarily mixed into the acoustic signal. As a result, it may not be possible to stably estimate the acoustic transfer function.

本実施形態は上記の点に鑑みてなされたものであり、使用環境において安定的に音響伝達関数を推定することができる音響処理装置、音響処理方法およびプログラムを提供することを課題とする。 This embodiment has been made in consideration of the above points, and aims to provide an acoustic processing device, an acoustic processing method, and a program that can stably estimate an acoustic transfer function in the usage environment.

（１）本願は上記の課題を解決するためになされたものであり、本実施形態の一態様は、第１音響伝達関数を音源方向ごとに記憶する記憶部と、各フレームについて、チャネルごとの音響信号の周波数領域における変換係数と前記第１音響伝達関数に基づいて音源方向ごとに空間スペクトルを算出し、前記空間スペクトルが最大となる音源方向を推定音源方向として推定する音源方向推定部と、複数フレームからなる観測期間における前記推定音源方向の頻度分布に基づいて前記推定音源方向の代表値である代表推定音源方向を定める代表推定音源方向決定部と、前記推定音源方向が前記代表推定音源方向から予め定めた許容範囲の範囲外となるフレームの変換係数を除去する外れ値除去部と、残されたフレームの前記音響信号の変換係数に基づいて、前記観測期間における音源から前記音響信号の収音部までの音響伝達関数の代表値を第２音響伝達関数として推定する音響伝達関数推定部と、前記代表推定音源方向に対する第１音響伝達関数を、前記第２音響伝達関数を用いて更新する音響伝達関数更新部と、を備える音響処理装置である。 (1) The present application has been made to solve the above-mentioned problems, and one aspect of this embodiment is an audio processing device comprising: a memory unit that stores a first acoustic transfer function for each sound source direction; a sound source direction estimation unit that calculates a spatial spectrum for each sound source direction for each frame based on the transform coefficients in the frequency domain of the acoustic signal for each channel and the first acoustic transfer function, and estimates the sound source direction where the spatial spectrum is maximum as the estimated sound source direction; a representative estimated sound source direction determination unit that determines a representative estimated sound source direction that is a representative value of the estimated sound source directions based on the frequency distribution of the estimated sound source directions in an observation period consisting of a plurality of frames; an outlier removal unit that removes transform coefficients of frames where the estimated sound source direction falls outside a predetermined tolerance range from the representative estimated sound source direction; an acoustic transfer function estimation unit that estimates a representative value of the acoustic transfer function from the sound source to the sound collection unit for the acoustic signal during the observation period as a second acoustic transfer function based on the transform coefficients of the acoustic signal in the remaining frames; and an acoustic transfer function update unit that updates the first acoustic transfer function for the representative estimated sound source direction using the second acoustic transfer function.

（２）本実施形態の他の態様は、（１）の音響処理装置であって、前記代表推定音源方向決定部は、頻度が極大となる前記推定音源方向を前記代表推定音源方向として定めてもよい。 (2) Another aspect of the present embodiment is the sound processing device of (1), wherein the representative estimated sound source direction determiner may determine the estimated sound source direction with a maximum frequency as the representative estimated sound source direction.

（３）本実施形態の他の態様は、（１）の音響処理装置であって、前記音響伝達関数更新部は、前記第１音響伝達関数と前記第２音響伝達関数の加重平均値を新たな第１音響伝達関数に更新してもよい。 (3) Another aspect of this embodiment is the sound processing device of (1), wherein the acoustic transfer function update unit may update the weighted average of the first acoustic transfer function and the second acoustic transfer function to a new first acoustic transfer function.

（４）本実施形態の他の態様は、（１）の音響処理装置であって、前記音響伝達関数更新部は、前記観測期間における前記推定音源方向が前記許容範囲の範囲内となる頻度に基づいて前記代表推定音源方向の信頼度を定め、前記信頼度が高いほど前記第１音響伝達関数に対する前記第２音響伝達関数の比率を高くしてもよい。 (4) Another aspect of this embodiment is the sound processing device of (1), wherein the acoustic transfer function update unit determines the reliability of the representative estimated sound source direction based on the frequency with which the estimated sound source direction during the observation period falls within the tolerance range, and may increase the ratio of the second acoustic transfer function to the first acoustic transfer function as the reliability increases.

（５）本実施形態の他の態様は、（１）の音響処理装置であって、前記許容範囲が前記代表推定音源方向と等しく、前記代表推定音源方向とは異なる方向を含まなくてもよい。 (5) Another aspect of this embodiment is the sound processing device of (1), in which the allowable range is equal to the representative estimated sound source direction and does not need to include directions different from the representative estimated sound source direction.

（６）本実施形態の他の態様は、（１）の音響処理装置であって、前記音源方向推定部は、チャネルごとの前記変換係数を含む入力ベクトルに、チャネルごとの前記第１音響伝達関数を含む音響伝達関数ベクトルの疑似逆行列を乗算して前記空間スペクトルを算出してもよい。 (6) Another aspect of this embodiment is the sound processing device of (1), wherein the sound source direction estimation unit may calculate the spatial spectrum by multiplying an input vector including the transformation coefficient for each channel by a pseudo-inverse matrix of an acoustic transfer function vector including the first acoustic transfer function for each channel.

（７）本実施形態の他の態様は、コンピュータに（１）の音響処理装置として機能させるためのプログラムであってもよい。 (7) Another aspect of this embodiment may be a program for causing a computer to function as the sound processing device of (1).

（８）本実施形態の他の態様は、第１音響伝達関数を音源方向ごとに記憶する記憶部を備える音響処理装置における音響処理方法であって、前記音響処理装置が、各フレームについて、チャネルごとの音響信号の周波数領域における変換係数と前記第１音響伝達関数に基づいて音源方向ごとに空間スペクトルを算出し、前記空間スペクトルが最大となる音源方向を推定音源方向として推定する音源方向推定ステップと、複数フレームからなる観測期間における前記推定音源方向の頻度分布に基づいて前記推定音源方向の代表値である代表推定音源方向を定める代表音源方向決定ステップと、前記推定音源方向が前記代表推定音源方向から所定の許容範囲の範囲外となるフレームの変換係数を除去する外れ値除去ステップと、残されたフレームの前記音響信号の変換係数に基づいて、前記観測期間における音源から前記音響信号の収音部までの音響伝達関数の代表値を第２音響伝達関数として推定する音響伝達関数推定ステップと、前記代表推定音源方向に対する第１音響伝達関数を、前記第２音響伝達関数を用いて更新する音響伝達関数更新ステップと、を実行する音響処理方法である。 (8) Another aspect of this embodiment is an acoustic processing method in an acoustic processing device including a storage unit that stores a first acoustic transfer function for each sound source direction, wherein the acoustic processing device executes the following acoustic processing steps: a sound source direction estimation step for calculating a spatial spectrum for each sound source direction for each frame based on the transform coefficients in the frequency domain of the acoustic signal for each channel and the first acoustic transfer function, and estimating the sound source direction for which the spatial spectrum is maximum as the estimated sound source direction; a representative sound source direction determination step for determining a representative estimated sound source direction that is a representative value of the estimated sound source directions based on the frequency distribution of the estimated sound source directions in an observation period consisting of a plurality of frames; an outlier removal step for removing transform coefficients of frames in which the estimated sound source direction falls outside a predetermined tolerance range from the representative estimated sound source direction; an acoustic transfer function estimation step for estimating a representative value of the acoustic transfer function from the sound source to the sound collection unit for the acoustic signal in the observation period as a second acoustic transfer function based on the transform coefficients of the acoustic signal in the remaining frames; and an acoustic transfer function updating step for updating the first acoustic transfer function for the representative estimated sound source direction using the second acoustic transfer function.

本実施形態によれば、現実の音響環境において安定的に音響伝達関数を推定することができる。
上述した（１）、（７）、（８）の構成によれば、代表推定音源方向から所定の範囲内の推定音源方向を与える音響信号の変換係数に基づいて第２音響伝達関数が算出され、算出された第２音響伝達関数を用いて代表推定音源方向と対応付けて第１音響伝達関数を更新することができる。統計的に代表推定音源方向、または、これに近似する推定音源方向を与える音響信号に基づいて得られる音響伝達関数の代表値が第２音響伝達関数として代表推定音源方向と対応付けて更新されるので、音源方向との対応関係が安定した第１音響伝達関数が得られる。 According to this embodiment, it is possible to stably estimate an acoustic transfer function in a real acoustic environment.
According to the above-described configurations (1), (7), and (8), the second acoustic transfer function is calculated based on the conversion coefficient of the acoustic signal that gives an estimated sound source direction within a predetermined range from the representative estimated sound source direction, and the calculated second acoustic transfer function can be used to update the first acoustic transfer function in association with the representative estimated sound source direction. A representative value of the acoustic transfer function obtained based on the acoustic signal that statistically gives the representative estimated sound source direction or an estimated sound source direction approximate thereto is updated as the second acoustic transfer function in association with the representative estimated sound source direction, so that a first acoustic transfer function with a stable correspondence relationship with the sound source direction can be obtained.

上述した（２）の構成によれば、観測期間内での頻度が極大となる推定音源方向が代表推定音源方向として定まるため、可能性が最も高い推定音源方向が代表推定音源方向として簡素に定まる。 According to the configuration (2) described above, the estimated sound source direction with the highest frequency within the observation period is determined as the representative estimated sound source direction, so the most likely estimated sound source direction is simply determined as the representative estimated sound source direction.

上述した（３）の構成によれば、観測期間の変更に伴い、第１音響伝達関数は更新により第２音響伝達関数に完全に置き換わらず、その一部の成分が残される。第１音響伝達関数の急激な変動が回避されるため、システムの安定性が図られる。 According to the configuration (3) described above, when the observation period is changed, the first acoustic transfer function is not completely replaced by the second acoustic transfer function through updating, and some of its components remain. This avoids sudden fluctuations in the first acoustic transfer function, thereby ensuring system stability.

上述した（４）の構成によれば、信頼度が高い推定音源方向を与える音響信号ほど重視して第２音響伝達関数を用いて第１音響伝達関数を更新することができる。そのため、更新される第１音響伝達関数の信頼性を向上させることができる。 The configuration (4) described above allows the first acoustic transfer function to be updated using the second acoustic transfer function, with emphasis being placed on acoustic signals that provide more reliable estimated sound source directions. This improves the reliability of the updated first acoustic transfer function.

上述した（５）の構成によれば、推定音源方向が代表推定音源方向と等しいか否かにより、推定音源方向を与える音響信号の変換係数を簡素に排除するか否かを定めることができる。 The above-described configuration (5) makes it possible to simply determine whether or not to exclude the conversion coefficient of the acoustic signal that gives the estimated sound source direction, depending on whether the estimated sound source direction is equal to the representative estimated sound source direction.

上述した（６）の構成によれば、簡素な行列演算により算出される空間スペクトルに基づいて音源方向を推定することができる。多くの演算資源を要しないため、経済的な実現を図ることができる。 The configuration (6) described above makes it possible to estimate the sound source direction based on the spatial spectrum calculated by a simple matrix operation. Since it does not require many computational resources, it can be implemented economically.

本実施形態に係る音響処理システムの構成例を示す概略ブロック図である。1 is a schematic block diagram illustrating an example of the configuration of a sound processing system according to an embodiment of the present invention. 本実施形態に係る音響伝達関数適応処理の一例を示すデータフローチャートである。10 is a data flowchart showing an example of acoustic transfer function adaptation processing according to the present embodiment. 収音部の構成例を示す図である。FIG. 2 is a diagram illustrating an example of the configuration of a sound collection unit. 音響伝達関数を例示する説明図である。FIG. 2 is an explanatory diagram illustrating an acoustic transfer function. 実験室を例示する図である。FIG. 1 illustrates a laboratory. 観測期間ごとの成功率を例示する図である。FIG. 10 is a diagram illustrating an example of a success rate for each observation period. 第１音響伝達関数の種類ごとの成功率を例示する図である。FIG. 10 is a diagram illustrating an example of success rates for each type of first acoustic transfer function. 本実施形態の第１変形例に係る音響処理システムの構成例を示す概略ブロック図である。FIG. 10 is a schematic block diagram showing an example of the configuration of a sound processing system according to a first modified example of the present embodiment. 本実施形態の第２変形例に係る音響処理システムの構成例を示す概略ブロック図である。FIG. 10 is a schematic block diagram showing an example of the configuration of a sound processing system according to a second modified example of the present embodiment.

（第１の実施形態）
図面を参照しながら本発明の第１の実施形態について説明する。
図１は、本実施形態に係る音響処理システムＳ１の構成例を示す概略ブロック図である。
音響処理システムＳ１は、音響処理装置１０と、収音部２０と、を備える。 (First embodiment)
A first embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a schematic block diagram showing an example of the configuration of a sound processing system S1 according to this embodiment.
The sound processing system S1 includes a sound processing device 10 and a sound collection unit 20.

音響処理装置１０には、音源からの音の伝達特性を示す第１音響伝達関数を音源方向ごとに予め記憶させておく。本願では音響処理装置１０に事前に、または一時的に記憶された音響伝達関数を「第１音響伝達関数」と呼び、収音部２０から取得された音響信号に基づいて推定した音響伝達関数を「第２音響伝達関数」とは呼ぶことで、両者を区別する。 The sound processing device 10 stores a first acoustic transfer function, which indicates the transfer characteristics of sound from a sound source, for each sound source direction in advance. In this application, the acoustic transfer function stored in advance or temporarily in the sound processing device 10 is referred to as the "first acoustic transfer function," and the acoustic transfer function estimated based on the acoustic signal acquired from the sound collection unit 20 is referred to as the "second acoustic transfer function," thereby distinguishing between the two.

音響処理装置１０は、収音部２０から複数チャネルの音響信号を取得する。音響処理装置１０は、各フレームについてチャネルごとの音響信号の周波数領域における変換係数を算出し、算出した変換係数と第１音響伝達関数に基づいて音源方向ごとに空間スペクトルを算出する。音響処理装置１０は、空間スペクトルが極大（local maximum）となる音源方向を推定音源方向として推定する（音源定位）。音響処理装置１０は、フレームごとの推定音源方向と推定音源方向の推定に用いられた音響信号を集積し、複数フレームからなる観測期間における推定音源方向の頻度分布（ヒストグラム）を生成する。音響処理装置１０は、推定音源方向の頻度分布に基づいて推定音源方向の代表値を代表推定音源方向として定める。 The sound processing device 10 acquires multi-channel acoustic signals from the sound collection unit 20. The sound processing device 10 calculates a conversion coefficient in the frequency domain of the acoustic signal for each channel for each frame, and calculates a spatial spectrum for each sound source direction based on the calculated conversion coefficient and the first acoustic transfer function. The sound processing device 10 estimates the sound source direction where the spatial spectrum has a local maximum as the estimated sound source direction (sound source localization). The sound processing device 10 accumulates the estimated sound source direction for each frame and the acoustic signals used to estimate the estimated sound source direction, and generates a frequency distribution (histogram) of the estimated sound source direction over an observation period consisting of multiple frames. The sound processing device 10 determines a representative value of the estimated sound source direction as the representative estimated sound source direction based on the frequency distribution of the estimated sound source direction.

音響処理装置１０は、代表推定音源方向から予め定めた許容範囲の範囲外となる推定音源方向を与える音響信号のフレームを特定し、そのフレームの音響信号の変換係数を外れ値（outlier）として除去する。音響処理装置１０は、除去されずに残された変換係数に基づいて音源から収音部２０までの音響伝達関数を算出する。音響処理装置１０は、観測期間における音響伝達関数の代表値を第２音響伝達関数として推定する。音響処理装置１０は、定めた代表推定音源方向に対する第１音響伝達関数を推定した第２音響伝達関数を用いて更新する。 The sound processing device 10 identifies frames of the acoustic signal that provide an estimated sound source direction that is outside a predetermined tolerance range from the representative estimated sound source direction, and removes the transformation coefficients of the acoustic signal of those frames as outliers. The sound processing device 10 calculates an acoustic transfer function from the sound source to the sound collection unit 20 based on the transformation coefficients that remain. The sound processing device 10 estimates a representative value of the acoustic transfer function during the observation period as a second acoustic transfer function. The sound processing device 10 updates the first acoustic transfer function for the determined representative estimated sound source direction using the estimated second acoustic transfer function.

音響処理装置１０は、更新した第１音響伝達関数を用いてマイクロホンアレイ処理を行ってもよい。マイクロホンアレイ処理には、音源定位の他、音源分離などの処理も含まれうる。音源分離は、収音部２０から取得される複数チャネルの音響信号から、推定した音源方向に基づいて個々の音源からの音の成分を音源成分として抽出する処理を含む。 The sound processing device 10 may perform microphone array processing using the updated first acoustic transfer function. Microphone array processing may include processing such as sound source separation in addition to sound source localization. Sound source separation includes processing of extracting sound components from individual sound sources as sound source components from the multi-channel acoustic signals acquired from the sound collection unit 20 based on the estimated sound source direction.

音響処理装置１０は、音源定位を行って推定した音源方向と、音源分離を行って抽出した音源成分を示す音源信号の一方または両方を、自装置において他の処理に用いてもよいし、出力先となる他の装置（図示せず、本願では「出力先機器」と呼ぶことがある）に出力してもよい。音響処理装置１０は、他の処理として、例えば、推定音源方向における物体の存在を推定してもよい。音響処理装置１０は、特定の音源方向（話者）からの音源成分もしくは音源信号に対して音声認識処理を行い、発話内容を示す発話テキストを取得してもよいし、話者を推定してもよい。 The sound processing device 10 may use one or both of the sound source direction estimated by performing sound source localization and the sound source signal indicating the sound source components extracted by performing sound source separation for other processing within the device itself, or may output them to another device (not shown, sometimes referred to as an "output destination device" in this application). As another process, the sound processing device 10 may, for example, estimate the presence of an object in the estimated sound source direction. The sound processing device 10 may perform speech recognition processing on the sound source components or sound source signal from a specific sound source direction (speaker), obtain spoken text indicating the content of the utterance, or estimate the speaker.

収音部２０は、複数のマイクロホン２０－１～２０－Ｍを有し、マイクロホンアレイとして機能する。マイクロホン数Ｍは、２以上の整数である。マイクロホン数Ｍは、チャネル数に相当する。個々のマイクロホンは、それぞれ異なる位置に配置され、それぞれ自部に到来する音波を収音するアクチュエータを備える。アクチュエータは、到来した音波を音響信号に変換する。変換された音響信号は、音響処理装置１０に無線または有線で出力される。個々のマイクロホンは、音響信号のチャネルに対応する。 The sound collection unit 20 has multiple microphones 20-1 to 20-M and functions as a microphone array. The number of microphones, M, is an integer greater than or equal to two. The number of microphones, M, corresponds to the number of channels. Each microphone is placed at a different position and has an actuator that collects sound waves arriving at that microphone. The actuator converts the arriving sound waves into an acoustic signal. The converted acoustic signal is output to the sound processing device 10 wirelessly or via a wired connection. Each microphone corresponds to a channel of the acoustic signal.

複数のマイクロホンの配置は、固定されてもよいし、可変であってもよい。複数のマイクロホンの位置は、互いに異なっていればよい。図３に例示される収音部２０は、８チャネルの円形マイクロホンアレイとして構成されている。図３において、個々のマイクロホンは黒丸で示される。８個のマイクロホンは、円周上に等間隔となるように配置されている。８個のマイクロホンは、それぞれ垂直方向に平行な回転軸に対して回転対称性を有する回転体をなす筐体の側面に配置され、それぞれの位置関係が固定される。収音部２０は、出力インタフェース（図示せず）を備える。出力インタフェースは、個々のマイクロホンが収録した８チャネルの音響信号を集約し、有線で並列に音響処理装置１０に出力する。なお、マイクロホンの個数、配置は、これには限られない。マイクロホンの個数Ｍは、２個以上７個以下、または、９個以上であってもよい。個々のマイクロホンの位置は、図４に例示されるように直線上に配置されてもよい。 The arrangement of the multiple microphones may be fixed or variable. The positions of the multiple microphones need only be different from one another. The sound collection unit 20 illustrated in FIG. 3 is configured as an eight-channel circular microphone array. In FIG. 3, individual microphones are indicated by black circles. The eight microphones are arranged at equal intervals around the circumference. The eight microphones are arranged on the side of a housing that forms a body of revolution that is rotationally symmetrical about a rotation axis parallel to the vertical direction, and their relative positions are fixed. The sound collection unit 20 includes an output interface (not shown). The output interface aggregates the eight-channel acoustic signals recorded by the individual microphones and outputs them in parallel via a wired connection to the sound processing device 10. However, the number and arrangement of microphones are not limited to this. The number M of microphones may be two to seven or less, or nine or more. The positions of the individual microphones may be arranged in a straight line, as illustrated in FIG. 4.

音響処理装置１０は、ＰＣ（Personal Computer）、多機能携帯電話機、などの汎用の情報通信機器として構成されてもよいし、計測器、監視装置、など、専用の機器として構成されてもよい。
以下の説明では、収音部２０は、音響処理装置１０と別個に構成される場合を例にするが、音響処理装置１０とは別体でもよい。 The sound processing device 10 may be configured as a general-purpose information and communication device such as a PC (Personal Computer) or a multi-function mobile phone, or may be configured as a dedicated device such as a measuring instrument or a monitoring device.
In the following description, the sound collection unit 20 is configured separately from the sound processing device 10 as an example, but may be separate from the sound processing device 10 .

次に、本実施形態に係る音響処理装置１０の機能構成例について説明する。
音響処理装置１０は、入出力部１１０と、制御部１２０と、記憶部１５０と、を含んで構成される。
入出力部１１０は、他の機器と各種のデータを入力および出力可能に無線または有線で接続する。入出力部１１０は、入力データとして、収音部２０からＭチャネルの音響信号を制御部１２０に出力する。入出力部１１０は、制御部１２０から出力データが入力される場合には、入力される出力データを出力先機器（図示せず）に出力することができる。出力データには、マイクロホンアレイ処理により得られる情報が含まれうる。かかる情報は、例えば、推定音源方向を示す推定音源方向情報、音源から到来する音源成分を示す音源信号などが該当する。入出力部１１０は、例えば、入出力インタフェース、通信インタフェースなどのいずれか、または、それらの組み合わせであってもよい。 Next, an example of the functional configuration of the sound processing device 10 according to this embodiment will be described.
The sound processing device 10 includes an input/output unit 110 , a control unit 120 , and a storage unit 150 .
The input/output unit 110 is connected wirelessly or wired to other devices to input and output various types of data. The input/output unit 110 outputs M-channel acoustic signals from the sound collection unit 20 to the control unit 120 as input data. When output data is input from the control unit 120, the input/output unit 110 can output the input output data to an output destination device (not shown). The output data may include information obtained by microphone array processing. Such information includes, for example, estimated sound source direction information indicating an estimated sound source direction, a sound source signal indicating sound source components arriving from a sound source, and the like. The input/output unit 110 may be, for example, an input/output interface, a communication interface, or a combination thereof.

制御部１２０は、音響処理装置１０の機能を実現するための処理、その機能を制御するための処理、などを実行する。制御部１２０は、全体として、もしくは、個々の機能に対して、専用の部材を用いて構成されてもよいが、ＣＰＵ（Central Processing Unit）などのプロセッサと各種の記憶媒体を含んでコンピュータシステムとして構成されてもよい。プロセッサは、予め記憶媒体に記憶された所定のプログラムを読み出し、読み出したプログラムに記述された各種の命令で指示される処理を実行して制御部１２０の機能を実現する。制御部１２０の機能構成については、後述する。 The control unit 120 executes processes to realize the functions of the sound processing device 10, processes to control those functions, and the like. The control unit 120 may be configured using dedicated components as a whole or for each individual function, or may be configured as a computer system including a processor such as a CPU (Central Processing Unit) and various storage media. The processor reads a predetermined program stored in advance in the storage medium and executes the processes instructed by the various commands written in the read program to realize the functions of the control unit 120. The functional configuration of the control unit 120 will be described later.

記憶部１５０は、各種のデータを一時的または恒常的に記憶する記憶媒体を含んで構成される。記憶部１５０は、制御部１２０により用いられる各種のデータ（パラメータ等を含む）、制御部１２０またはその他の機能部により取得された各種のデータ（外部から入力された入力データ、処理中の中間データ、処理結果として生成された生成データを含む）を記憶する。記憶部１５０には、音響伝達関数セットが記憶される。音響伝達関数セットは、音源方向ごとに、各周波数について個々のマイクロホン（チャネル）について第１音響伝達関数を含んで構成される。本願では、音響伝達関数セットにおいて、第１音響伝達関数と関連付けられる音源方向を「目標方向（target direction）」と呼ぶことがある。 The memory unit 150 includes a storage medium that temporarily or permanently stores various types of data. The memory unit 150 stores various types of data (including parameters, etc.) used by the control unit 120, and various types of data acquired by the control unit 120 or other functional units (including input data input from the outside, intermediate data during processing, and generated data generated as a result of processing). The memory unit 150 stores an acoustic transfer function set. The acoustic transfer function set includes a first acoustic transfer function for each microphone (channel) for each frequency for each sound source direction. In this application, the sound source direction associated with the first acoustic transfer function in the acoustic transfer function set may be referred to as the "target direction."

第１音響伝達関数の初期値Ｈ_Ｔとして、予め測定された音響伝達関数が用いられてもよいし、所定の幾何モデルを用いて予め計算された音響伝達関数が用いられてもよい。幾何モデルとして、自由音場における平面波の伝搬を仮定した平面波モデル、収音部２０から所定の距離に存在する音源からの球面波の伝搬を仮定した球面波モデル、などが用いられてもよい。初期の音響伝達関数セットは、各チャネルおよび周波数について、音源方向ごとの第１音響伝達関数の初期値Ｈ_Ｔ（θ_１）～Ｈ_Ｔ（θ_Ｎ）を要素として含む。初期値Ｈ_Ｔ（θ_１）等は、音源方向θ_１等からの音波の到来を仮定して幾何モデルを用いて算出することができる。Ｎは、予め定めた音源方向の個数を示す。互いに隣接する音源方向の間隔は、音源定位により推定される音源方向の精度に直接的に影響する。音源方向の個数が多いほど音源方向の精度の向上が期待されるが、音源定位における空間スペクトルの算出に係る演算量が増大する。 As the initial value H _T of the first acoustic transfer function, a pre-measured acoustic transfer function may be used, or an acoustic transfer function pre-calculated using a predetermined geometric model may be used. Examples of the geometric model that may be used include a plane wave model that assumes the propagation of a plane wave in a free sound field, and a spherical wave model that assumes the propagation of a spherical wave from a sound source located a predetermined distance from the sound pickup unit 20. The initial acoustic transfer function set includes, as elements, initial values H _T (θ ₁ ) to H _T (θ _N ) of the first acoustic transfer function for each sound source direction for each channel and frequency. The initial value H _T (θ ₁ ) etc. can be calculated using a geometric model assuming the arrival of a sound wave from a sound source direction θ ₁ etc. N indicates the number of predetermined sound source directions. The distance between adjacent sound source directions directly affects the accuracy of the sound source direction estimated by sound source localization. The greater the number of sound source directions, the more accurate the sound source direction is expected to be, but the amount of calculation required to calculate the spatial spectrum in sound source localization increases.

次に、本実施形態に係る制御部１２０の機能構成例について説明する。制御部１２０は、周波数分析部１２２、音源方向推定部１２４、代表推定音源方向決定部１２６、外れ値除去部１２８、代表音響信号決定部１３０、音響伝達関数推定部１３２および音響伝達関数更新部１３４を含んで構成される。特に断らない限り、音源方向推定部１２４、代表推定音源方向決定部１２６、外れ値除去部１２８、代表音響信号決定部１３０、音響伝達関数推定部１３２および音響伝達関数更新部１３４の処理は、それぞれ周波数ごとに独立に実行されてもよい。 Next, an example of the functional configuration of the control unit 120 according to this embodiment will be described. The control unit 120 is configured to include a frequency analysis unit 122, a sound source direction estimation unit 124, a representative estimated sound source direction determination unit 126, an outlier removal unit 128, a representative acoustic signal determination unit 130, an acoustic transfer function estimation unit 132, and an acoustic transfer function update unit 134. Unless otherwise specified, the processes of the sound source direction estimation unit 124, the representative estimated sound source direction determination unit 126, the outlier removal unit 128, the representative acoustic signal determination unit 130, the acoustic transfer function estimation unit 132, and the acoustic transfer function update unit 134 may be executed independently for each frequency.

周波数分析部１２２には、収音部２０から入出力部１１０を経由してＭチャネルの音響信号が入力される。取得されるＭチャネルの音響信号は、それぞれ時間領域におけるサンプル時刻ごとの振幅の時系列（波形）を表す。周波数分析部１２２は、各チャネルについて時間領域の音響信号に対して、所定の期間（例えば、２０ｍｓ－１００ｍｓ）のフレームごとに周波数分析を行い、周波数領域における周波数ごとの変換係数に変換する。個々のチャネルにおいて、変換係数の周波数間のセットは周波数スペクトルを示す。周波数分析部１２２は、周波数分析において、例えば、短時間フーリエ変換（ＳＴＦＴ：Short Time Fourier Transform）、離散フーリエ変換（ＤＦＴ：Discrete Fourier Transform）などの手法が利用可能である。周波数分析部１２２は、変換により得られた変換係数を示す入力信号情報を音源方向推定部１２４および音響伝達関数推定部１３２に出力する。 The frequency analysis unit 122 receives M-channel acoustic signals from the sound collection unit 20 via the input/output unit 110. The acquired M-channel acoustic signals each represent a time series (waveform) of amplitude at each sample time in the time domain. The frequency analysis unit 122 performs frequency analysis on the time-domain acoustic signal for each channel, frame by frame over a predetermined period (e.g., 20 ms-100 ms), and converts it into transform coefficients for each frequency in the frequency domain. For each channel, the set of transform coefficients between frequencies represents a frequency spectrum. For frequency analysis, the frequency analysis unit 122 can use techniques such as the short-time Fourier transform (STFT) and the discrete Fourier transform (DFT). The frequency analysis unit 122 outputs input signal information indicating the transform coefficients obtained by the transform to the sound source direction estimation unit 124 and the acoustic transfer function estimation unit 132.

音源方向推定部１２４は、記憶部１５０に記憶された音響伝達関数セットを参照し、周波数分析部１２２から入力される入力信号情報に示される各チャネルの変換係数を用いて、周波数ごとに空間スペクトルＳ_ｓｐ（θ）を算出する。空間スペクトルＳ_ｓｐ（θ）は、収音部２０の位置を基準とする目標方向θごとに音源が存在する可能性の程度を示す指標とみることができる。音源方向推定部１２４は、第１音響伝達関数Ｈ_Ｅからなる音響伝達関数セット、目標方向θ、および、入力ベクトルＸを用いて算出することができる。音源方向推定部１２４は、式（１）に例示されるように、空間スペクトルが最大となる方向を推定音源方向φとして推定することができる。空間スペクトルＳ_ｓｐ（θ）を算出する手法の具体例については、後述する。音源方向推定部１２４は、推定した推定音源方向を示す推定音源方向情報を代表推定音源方向決定部１２６および外れ値除去部１２８に出力する。また、音源方向推定部１２４は、推定音源方向情報と対応付けて入力信号情報を外れ値除去部１２８に出力する。 The sound source direction estimation unit 124 refers to the acoustic transfer function set stored in the storage unit 150 and calculates a spatial spectrum S _sp (θ) for each frequency using the conversion coefficients of each channel indicated in the input signal information input from the frequency analysis unit 122. The spatial spectrum S _sp (θ) can be considered as an index indicating the degree of possibility that a sound source exists for each target direction θ relative to the position of the sound collection unit 20. The sound source direction estimation unit 124 can perform calculations using an acoustic transfer function set consisting of the first acoustic transfer function H _E , the target direction θ, and the input vector X. As exemplified in Equation (1), the sound source direction estimation unit 124 can estimate the direction in which the spatial spectrum is maximum as the estimated sound source direction φ. A specific example of a method for calculating the spatial spectrum S _sp (θ) will be described later. The sound source direction estimation unit 124 outputs estimated sound source direction information indicating the estimated sound source direction to the representative estimated sound source direction determination unit 126 and the outlier removal unit 128. Furthermore, the sound source direction estimation unit 124 outputs the input signal information to the outlier removal unit 128 in association with the estimated sound source direction information.

代表推定音源方向決定部１２６には、音源方向推定部１２４から推定音源方向情報が入力される。代表推定音源方向決定部１２６は、予め設定された観測期間ごとに、推定音源方向情報に示される推定音源方向φの頻度分布を生成する。観測期間は、一度に代表推定音源方向を定める複数のフレームを含む期間である。より具体的には、代表推定音源方向決定部１２６は、観測期間に含まれるフレームごとに推定音源方向情報に示される推定音源方向を特定し、特定した推定音源方向に対するフレーム数に１ずつ加算することで（インクリメント）、推定音源方向ごとのフレーム数を計数する。代表推定音源方向決定部１２６は、当該観測期間において計数された推定音源方向ごとのフレーム数（頻度）を推定音源方向の頻度分布（ヒストグラム）として取得することができる。 The representative estimated sound source direction determiner 126 receives estimated sound source direction information from the sound source direction estimator 124. The representative estimated sound source direction determiner 126 generates a frequency distribution of the estimated sound source direction φ indicated in the estimated sound source direction information for each preset observation period. The observation period is a period including multiple frames for which a representative estimated sound source direction is determined at one time. More specifically, the representative estimated sound source direction determiner 126 identifies the estimated sound source direction indicated in the estimated sound source direction information for each frame included in the observation period, and counts the number of frames for each estimated sound source direction by adding one to the number of frames for the identified estimated sound source direction (increment). The representative estimated sound source direction determiner 126 can obtain the number of frames (frequency) for each estimated sound source direction counted during the observation period as a frequency distribution (histogram) of the estimated sound source direction.

代表推定音源方向決定部１２６は、特定した区間のうちフレーム数が最多となる推定音源方向（最頻値、mode）を代表推定音源方向θ’として定める。代表推定音源方向決定部１２６は、定めた代表推定音源方向θ’を示す代表推定音源方向情報を外れ値除去部１２８に出力する。 The representative estimated sound source direction determination unit 126 determines the estimated sound source direction (mode) with the largest number of frames in the identified section as the representative estimated sound source direction θ'. The representative estimated sound source direction determination unit 126 outputs representative estimated sound source direction information indicating the determined representative estimated sound source direction θ' to the outlier removal unit 128.

外れ値除去部１２８には、音源方向推定部１２４から推定音源方向情報と入力信号情報が入力され、代表推定音源方向決定部１２６から代表推定音源方向情報から入力される。外れ値除去部１２８は、例えば、推定音源方向が代表推定音源方向と等しいか否かを判定する。外れ値除去部１２８は、推定音源方向が代表推定音源方向から所定の範囲内であるとき、その推定音源方向を示す推定音源方向情報に対応する入力信号情報を採用する。外れ値除去部１２８は、推定音源方向が代表推定音源方向とは異なるとき、その推定音源方向を示す推定音源方向情報に対応する入力信号情報を外れ値として除去し、棄却する。これにより、外れ値除去部１２８は、最頻値フィルタ（mode filter）として機能する。外れ値除去部１２８は、採用した入力信号情報を代表音響信号決定部１３０に出力する。 The outlier removal unit 128 receives input of estimated sound source direction information and input signal information from the sound source direction estimation unit 124, and inputs representative estimated sound source direction information from the representative estimated sound source direction determination unit 126. The outlier removal unit 128 determines, for example, whether the estimated sound source direction is equal to the representative estimated sound source direction. When the estimated sound source direction is within a predetermined range from the representative estimated sound source direction, the outlier removal unit 128 adopts input signal information corresponding to the estimated sound source direction information indicating that estimated sound source direction. When the estimated sound source direction differs from the representative estimated sound source direction, the outlier removal unit 128 removes the input signal information corresponding to the estimated sound source direction information indicating that estimated sound source direction as an outlier and discards it. In this way, the outlier removal unit 128 functions as a mode filter. The outlier removal unit 128 outputs the adopted input signal information to the representative sound signal determination unit 130.

代表音響信号決定部１３０は、観測期間ごとに、外れ値除去部１２８から入力される入力信号情報に示されるチャネルごとの変換係数のフレーム間の代表値（例えば、平均値）を各周波数について代表変換係数として定める。代表音響信号決定部１３０は、定めた代表変換係数を示す代表入力信号情報を音響伝達関数推定部１３２に出力する。 For each observation period, the representative acoustic signal determination unit 130 determines the representative value (e.g., average value) between frames of the transformation coefficients for each channel indicated in the input signal information input from the outlier removal unit 128 as the representative transformation coefficient for each frequency. The representative acoustic signal determination unit 130 outputs representative input signal information indicating the determined representative transformation coefficient to the acoustic transfer function estimation unit 132.

音響伝達関数推定部１３２には、代表音響信号決定部１３０から代表入力信号情報が入力される。音響伝達関数推定部１３２は、各周波数について、代表入力信号情報に示されるチャネルごとの代表変換係数に基づいて、音源からそのチャネルに対応するマイクロホンまでの音響伝達関数を第２音響伝達関数Ｈ’として推定する。第２音響伝達関数H’は、観測期間において推定される音響伝達関数の代表値に相当する。音響伝達関数推定部１３２は、第２音響伝達関数Ｈ’を推定する際、例えば、チャネルごとの代表変換係数の振幅と位相のそれぞれをチャネル間で正規化する。 The acoustic transfer function estimation unit 132 receives representative input signal information from the representative acoustic signal determination unit 130. For each frequency, the acoustic transfer function estimation unit 132 estimates the acoustic transfer function from the sound source to the microphone corresponding to that channel as a second acoustic transfer function H' based on the representative conversion coefficient for each channel indicated in the representative input signal information. The second acoustic transfer function H' corresponds to the representative value of the acoustic transfer function estimated during the observation period. When estimating the second acoustic transfer function H', the acoustic transfer function estimation unit 132, for example, normalizes the amplitude and phase of the representative conversion coefficient for each channel between channels.

音響伝達関数推定部１３２は、例えば、式（２）に従って、第２音響伝達関数Ｈ’を算出することができる。式（２）の例では、代表入力ベクトルＸは、そのノルム｜Ｘ｜で除算して、代表変換係数の振幅が正規化される。ノルムとして、例えば、二乗和の平方根が適用可能である。代表入力ベクトルＸは、ある周波数における各チャネルｍに対する代表変換係数Ｘ_ｍを要素として有するベクトルである。正規化された振幅は、０以上１以下の実数値となる。代表変換係数Ｘ_ｍのチャネル間の総和Σ_ｍＸ_ｍをその絶対値｜Σ_ｍＸ_ｍ｜で除算して得られる商の複素共役を乗算することで、代表変換係数の位相が正規化される。位相の正規化により、各チャンネルの代表変換係数の振幅で重みを付けたチャネル間の位相の平均値が０となる。本実施形態では、音響伝達関数はチャネル間で振幅ならびに位相が相対化された値であってもよく、必ずしも絶対値でなくてもよい。音響伝達関数推定部１３２は、推定した第２音響伝達関数Ｈ’を示す第２音響伝達関数情報を音響伝達関数更新部１３４に出力する。 The acoustic transfer function estimator 132 can calculate the second acoustic transfer function H′ according to, for example, Equation (2). In the example of Equation (2), the representative input vector X is divided by its norm |X| to normalize the amplitude of the representative transform coefficient. For example, the square root of the sum of squares can be used as the norm. The representative input vector X is a vector having, as elements, representative transform coefficients X _m for each channel _m at a certain frequency. The normalized amplitude is a real value between 0 and 1. The phase of the representative transform coefficient is normalized by multiplying it by the complex conjugate of the quotient obtained by dividing the sum Σ _m X _m of the representative transform coefficients X m across channels by its absolute value |Σ _m X _m |. Phase normalization causes the average value of the phase across channels, weighted by the amplitude of the representative transform coefficient of each channel, to be 0. In this embodiment, the acoustic transfer function may be a value in which the amplitude and phase are relativized across channels, and need not necessarily be an absolute value. The acoustic transfer function estimating unit 132 outputs second acoustic transfer function information indicating the estimated second acoustic transfer function H′ to the acoustic transfer function updating unit 134 .

音響伝達関数更新部１３４には、音響伝達関数推定部１３２から第２音響伝達関数情報が入力され、代表推定音源方向決定部１２６から代表推定音源方向情報が入力される。音響伝達関数更新部１３４は、各周波数について、入力される第２音響伝達関数情報が示すチャネルごとの第２音響伝達関数Ｈ’を、代表推定音源方向情報に示される代表推定音源方向に対応する第２音響伝達関数として特定する。音響伝達関数更新部１３４、特定した第２音響伝達関数Ｈ’を用いて、記憶部１５０に記憶された音響伝達関数セットのうち代表推定音源方向に対応する第１音響伝達関数Ｈ_Ｅを更新する。 The acoustic transfer function updating unit 134 receives second acoustic transfer function information from the acoustic transfer function estimating unit 132 and receives representative estimated sound source direction information from the representative estimated sound source direction determining unit 126. For each frequency, the acoustic transfer function updating unit 134 identifies the second acoustic transfer function H' for each channel indicated by the input second acoustic transfer function information as the second acoustic transfer function corresponding to the representative estimated sound source direction indicated in the representative estimated sound source direction information. The acoustic transfer function updating unit 134 uses the identified second acoustic transfer function H' to update the first acoustic transfer function H _E corresponding to the representative estimated sound source direction from the acoustic transfer function set stored in the storage unit 150.

音響伝達関数更新部１３４は、例えば、指数平滑法を用いて、その時点における第２音響伝達関数Ｈ’と更新対象とする代表推定音源方向θ’に係る第１音響伝達関数Ｈ_Ｅ（θ’）を加重平均して、新たに更新される第１音響伝達関数Ｈ_Ｅ（θ’）を算出する。式（３）の例では、第２音響伝達関数Ｈ’に乗算される重み係数βは、最大値が１となる所定の正の実数値である。更新前の第１音響伝達関数Ｈ_Ｅ（θ’）には重み係数（１－β）が乗じられる。重み係数β、（１－β）は、それぞれ第２音響伝達関数Ｈ’、第１音響伝達関数Ｈ_Ｅ（θ’）に対する比率に相当する。よって、重み係数βが大きいほど、第１音響伝達関数Ｈ_Ｅ（θ’）として第２音響伝達関数Ｈ’ほど重視されるように音響伝達関数の時間平均値が得られる。重み係数βが１である場合には、第２音響伝達関数Ｈ’はフレームごとに第１音響伝達関数Ｈ_Ｅに置き換わる。即ち、重み係数βが大きいほど、第２音響伝達関数Ｈ’に含まれる音源からの音の提示の有無、音響環境の一時的な変化、音源方向の誤推定などによる影響が第１音響伝達関数Ｈ_Ｅ（θ’）に含まれる。重み係数βが小さいほど、一時的な第２音響伝達関数Ｈ’の変動が平滑化される。 The acoustic transfer function update unit 134 calculates a new updated first acoustic transfer function H _E (θ') by using, for example, exponential smoothing, a weighted average of the second acoustic transfer function H' at that time and the first acoustic transfer function H _E (θ') related to the representative estimated sound source direction θ' to be updated. In the example of equation (3), the weighting coefficient β by which the second acoustic transfer function H' is multiplied is a predetermined positive real value having a maximum value of 1. The first acoustic transfer function H _E (θ') before updating is multiplied by the weighting coefficient (1-β). The weighting coefficients β and (1-β) correspond to the ratios to the second acoustic transfer function H' and the first acoustic transfer function _H _E (θ'), respectively. Therefore, the larger the weighting coefficient β, the more importance is placed on the second acoustic transfer function H' as the first acoustic transfer function H E (θ'). When the weighting coefficient β is 1, the second acoustic transfer function H' is replaced by the first acoustic transfer function H _E for each frame. That is, the larger the weighting coefficient β, the more the first acoustic transfer function H E (θ') includes the influence of the presence or absence of sound from the sound source included in the second acoustic transfer function H _' , temporary changes in the acoustic environment, erroneous estimation of the sound source direction, etc. The smaller the weighting coefficient β, the more the temporary fluctuations in the second acoustic transfer function H' are smoothed.

音響伝達関数更新部１３４は、もとの更新前の第１音響伝達関数Ｈ_Ｅ（θ’）に代え、新たな第１音響伝達関数Ｈ_Ｅ（θ’）を推定音源方向θ’に対応付けて記憶部１５０に記憶する。 The acoustic transfer function update unit 134 stores the new first acoustic transfer function H _E (θ') in the storage unit 150 in place of the original, unupdated first acoustic transfer function H _E (θ') in association with the estimated sound source direction θ'.

音響伝達関数更新部１３４は、代表推定音源方向θ’の信頼度が高いほど、大きくなるように第２音響伝達関数Ｈ’に対する重み係数βを定めてもよい。音響伝達関数更新部１３４は、推定音源方向φが代表推定音源方向θ’から所定範囲内となるフレーム数の全フレーム数に対する比率を代表推定音源方向θ’の信頼度として用いることができる。より具体的には、音響伝達関数更新部１３４は、重み係数βをＬ／２Ｋと定めることができる。Ｌは、観測期間における推定音源方向φが代表推定音源方向θ’となるフレーム数、Ｋは観測期間における全フレーム数を示す。Ｌ＝Ｋとなるとき、重み係数βは最大値０．５となる。 The acoustic transfer function update unit 134 may set the weighting coefficient β for the second acoustic transfer function H' so that the higher the reliability of the representative estimated sound source direction θ', the larger the weighting coefficient β. The acoustic transfer function update unit 134 may use the ratio of the number of frames in which the estimated sound source direction φ falls within a predetermined range from the representative estimated sound source direction θ' to the total number of frames as the reliability of the representative estimated sound source direction θ'. More specifically, the acoustic transfer function update unit 134 may set the weighting coefficient β to L/2K, where L is the number of frames in the observation period in which the estimated sound source direction φ becomes the representative estimated sound source direction θ', and K is the total number of frames in the observation period. When L = K, the weighting coefficient β has a maximum value of 0.5.

なお、代表推定音源方向決定部１２６は、計数されたフレーム数が予め設定されたフレーム数の下限よりも多い推定音源方向を有意な推定音源方向として特定し、互いに隣接する複数の有意な音源方向からなる区間を定めてもよい。代表推定音源方向決定部１２６は、特定した区間のうち、フレーム数が極大となる推定音源方向の代表値を代表推定音源方向として定めてもよい。これにより、特異的に孤立した推定音源方向が排除され、代表推定音源方向として選ばれなくなる。 The representative estimated sound source direction determination unit 126 may identify an estimated sound source direction in which the number of counted frames is greater than a preset lower limit of the number of frames as a significant estimated sound source direction, and may determine a section consisting of multiple significant adjacent sound source directions. The representative estimated sound source direction determination unit 126 may also determine, as the representative estimated sound source direction, a representative value of the estimated sound source direction in which the number of frames is maximum among the identified sections. This eliminates particularly isolated estimated sound source directions and prevents them from being selected as the representative estimated sound source direction.

外れ値除去部１２８は、推定音源方向が代表推定音源方向から所定の許容範囲（例えば、±３～５°）の範囲内であるとき、その推定音源方向を示す推定音源方向情報に対応する入力信号情報を採用し、推定音源方向が代表推定音源方向から所定の範囲内を超えるとき、その推定音源方向を示す推定音源方向情報に対応する入力信号情報を外れ値として除去し、棄却してもよい。これにより、代表推定音源方向に近似した推定音源方向を与える音響信号に対する変換係数も第２音響伝達関数の推定に用いられる。許容範囲が０°である場合が、上記の最頻値フィルタに相当する。 When the estimated sound source direction is within a predetermined tolerance range (e.g., ±3 to 5°) from the representative estimated sound source direction, the outlier removal unit 128 adopts input signal information corresponding to the estimated sound source direction information indicating that estimated sound source direction. When the estimated sound source direction is outside the predetermined range from the representative estimated sound source direction, the outlier removal unit 128 may remove and discard the input signal information corresponding to the estimated sound source direction information indicating that estimated sound source direction as an outlier. As a result, the conversion coefficient for the acoustic signal that provides an estimated sound source direction that is close to the representative estimated sound source direction is also used to estimate the second acoustic transfer function. A tolerance range of 0° corresponds to the mode filter described above.

なお、一度に音が放射される音源の個数は、必ずしも１個に限られず、２個以上となることや、または、音源から一時的もしくは継続的に音が放射されないことがある。そこで、音源方向推定部１２４は、空間スペクトルＳ_ｓｐ（θ）が極大となり、かつ、所定の空間スペクトルの閾値よりも大きくなる方向が１個検出される場合に限り、検出された１個の方向を推定音源方向θ’として示す推定音源方向情報を代表推定音源方向決定部１２６と外れ値除去部１２８に出力してもよい。音響伝達関数更新部１３４は、上述のように、音源方向推定部１２４から推定音源方向情報で通知される１個の推定音源方向θ’に係る第１音響伝達関数Ｈ_Ｅ（θ’）を、第２音響伝達関数Ｈ’を用いて更新することができる。その場合、音響伝達関数更新部１３４は、重み係数βを定める際、観測期間における推定音源方向の個数が１個となるフレーム数をＬとし、その１個の推定音源方向φが代表推定音源方向θ’（または許容範囲内となる）フレーム数をＫとして、重み係数βを定めてもよい。 Note that the number of sound sources from which sound is radiated at one time is not necessarily limited to one, and may be two or more, or sound may not be radiated from a sound source temporarily or continuously. Therefore, only when one direction in which the spatial spectrum S _sp (θ) is maximized and is greater than a predetermined spatial spectrum threshold is detected, the sound source direction estimating unit 124 may output estimated sound source direction information indicating the detected direction as an estimated sound source direction θ' to the representative estimated sound source direction determining unit 126 and the outlier removing unit 128. As described above, the acoustic transfer function updating unit 134 can update the first acoustic transfer function H _E (θ') associated with one estimated sound source direction θ' notified by the sound source direction estimating unit 124 in the estimated sound source direction information, by using the second acoustic transfer function H'. In this case, when determining the weighting coefficient β, the acoustic transfer function updating unit 134 may determine the weighting coefficient β by setting the number of frames in which the number of estimated sound source directions in the observation period is one as L and the number of frames in which the one estimated sound source direction φ is within the representative estimated sound source direction θ′ (or within an acceptable range) as K.

言い換えれば、音源方向推定部１２４は、空間スペクトルＳ_ｓｐ（θ）が極大となり、かつ、所定の空間スペクトルの閾値よりも大きくなる方向が２個以上検出される場合と、空間スペクトルＳ_ｓｐ（θ）が極大となり、かつ、所定の空間スペクトルの閾値よりも大きくなる方向が検出されない場合には、推定音源方向情報を代表推定音源方向決定部１２６と外れ値除去部１２８に出力しない。そのため、第２音響伝達関数を定める際、推定音源方向の個数が２個以上となる場合、および、推定音源方向が検出されない場合に取得される入力信号情報は用いられない。他方、推定音源方向が２個以上検出される場合には、マイクロホンに複数の音源から到来した音が重畳されるため、チャネル間の変換係数の比が特定の１個の音源に係る音源方向に対する音響伝達関数の比を与えることにならない。音源方向が検出されない場合には、そもそも有意な音が音源からマイクロホンに到来しない。従って、検出される音源が１個の場合に音源定位、音響伝達関数の推定、更新を制限することで音響伝達関数の推定精度の劣化を抑えられる。 In other words, the sound source direction estimating unit 124 does not output estimated sound source direction information to the representative estimated sound source direction determining unit 126 and the outlier removing unit 128 when two or more directions in which the spatial spectrum _S _sp (θ) is maximized and greater than a predetermined spatial spectrum threshold are detected, or when no direction in which the spatial spectrum S sp (θ) is maximized and greater than a predetermined spatial spectrum threshold is detected. Therefore, when determining the second acoustic transfer function, input signal information acquired when the number of estimated sound source directions is two or more or when no estimated sound source direction is detected is not used. On the other hand, when two or more estimated sound source directions are detected, sounds arriving from multiple sound sources are superimposed at the microphone, so the ratio of the conversion coefficients between the channels does not provide the ratio of the acoustic transfer functions for the sound source direction associated with a specific sound source. When a sound source direction is not detected, no significant sound arrives at the microphone from the sound source. Therefore, when one sound source is detected, sound source localization, estimation of acoustic transfer functions, and updating are limited, thereby preventing deterioration in the accuracy of estimating acoustic transfer functions.

二次元空間では、音源方向は収音部２０の代表点（例えば、重心）からの方位角で定義されうる。その際には、音響伝達関数セットをなす個々の第１音響伝達関数に対応付けられる音源方向の配置は、例えば、収音部２０の位置を中心とする水平面に平行な円周上に分布する一次元配列となりうる。三次元空間では、音源方向は方位角と仰角の組で定義されうる。音源方向の配置は、収音部２０の位置を中心とする球面上に分布する二次元配列となりうる。音響伝達関数セットは、音源位置ごとに第１音響伝達関数を含んで構成されてもよい。その場合には、音源位置の配置は、三次元空間における三次元分布となる。音源位置は、収音部２０の位置を基準とする三次元座標で表され、音源方向と基準位置からの距離との組み合わせに相当する。但し、本実施形態では主に音源位置の分布が一次元配列である場合を例にして説明するが、二次元配列または三次元配列である場合にも適用可能である。 In two-dimensional space, the sound source direction may be defined by an azimuth angle from a representative point (e.g., the center of gravity) of the sound collection unit 20. In this case, the arrangement of sound source directions associated with each first acoustic transfer function constituting the acoustic transfer function set may be, for example, a one-dimensional array distributed on a circumference parallel to a horizontal plane centered on the position of the sound collection unit 20. In three-dimensional space, the sound source direction may be defined by a pair of an azimuth angle and an elevation angle. The arrangement of sound source directions may be a two-dimensional array distributed on a spherical surface centered on the position of the sound collection unit 20. The acoustic transfer function set may be configured to include a first acoustic transfer function for each sound source position. In this case, the arrangement of sound source positions is a three-dimensional distribution in three-dimensional space. The sound source position is expressed in three-dimensional coordinates based on the position of the sound collection unit 20 and corresponds to a combination of the sound source direction and the distance from the reference position. However, although this embodiment mainly describes a case where the distribution of sound source positions is a one-dimensional array, it is also applicable to two-dimensional or three-dimensional arrays.

音響伝達関数セットが、音源位置ごとの第１音響伝達関数を含んで構成される場合には、音源方向推定部１２４は、推定対象とする情報として音源位置を推定することができる。音源方向推定部１２４は、音源方向に代え、音源位置ごとに空間スペクトルを算出し、空間スペクトルが極大（または最大）となる音源位置を特定すればよい。音響伝達関数更新部１３４は、特定された音源位置を推定音源位置とし、上記の手法を用いて音源方向推定部１２４が推定した第２音響伝達関数を用いて、推定音源位置に係る第１音響伝達関数を更新すればよい。 When the acoustic transfer function set includes a first acoustic transfer function for each sound source position, the sound source direction estimation unit 124 can estimate the sound source position as the information to be estimated. The sound source direction estimation unit 124 calculates a spatial spectrum for each sound source position instead of the sound source direction, and identifies the sound source position where the spatial spectrum is maximized (or at its maximum). The acoustic transfer function update unit 134 sets the identified sound source position as the estimated sound source position, and updates the first acoustic transfer function associated with the estimated sound source position using the second acoustic transfer function estimated by the sound source direction estimation unit 124 using the above-described method.

（音源定位の例）
次に、音源定位の例について説明する。音源方向推定部１２４は、音源定位において、例えば、ビームフォーミング法（beam forming）を用いることができる。音源方向推定部１２４は、式（４）に例示される空間スペクトルＳ_ｓｐ（θ）の極大値を与える音源方向θを推定音源方向として算出することができる。空間スペクトルＳ_ｓｐ（θ）は、入力ベクトルＸに音響伝達関数ベクトルＨ（θ）の疑似逆行列Ｈ（θ）^＋を乗算して得られる。音響伝達関数ベクトルＨ（θ）は、チャネルごとに音源からマイクロホンまでの伝達関数を各列の要素として有するベクトル［Ｈ_１（θ），Ｈ_２（θ），…，Ｈ_Ｍ（θ）］^Ｔである。 (Example of sound source localization)
Next, an example of sound source localization will be described. The sound source direction estimator 124 can use, for example, beamforming in sound source localization. The sound source direction estimator 124 can calculate, as an estimated sound source direction, the sound source direction θ that gives the maximum value of the spatial spectrum S _sp (θ) exemplified in equation (4). The spatial spectrum S _sp (θ) is obtained by multiplying the input vector X by the pseudo-inverse matrix H(θ) ⁺ of the acoustic transfer function vector H(θ). The acoustic transfer function vector H(θ) is a vector [H ₁ (θ), H ₂ (θ), ..., H _M (θ)] ^T having, as elements of each column, a transfer function from the sound source to the microphone for each channel.

音源方向推定部１２４は、音源定位において、ビームフォーミング法以外の手法を用いてもよい。音源方向推定部１２４は、例えば、ＭＵＳＩＣ（Multiple Signal Classification,多重信号分類）法、遅延和法、などの手法を用いてもよい。 The sound source direction estimation unit 124 may use a method other than beamforming for sound source localization. For example, the sound source direction estimation unit 124 may use a method such as the MUSIC (Multiple Signal Classification) method or a delay and sum method.

（音響伝達関数適応処理）
次に、本実施形態に係る音響伝達関数適応処理について説明する。図２は、本実施形態に係る音響伝達関数適応処理の一例を示すデータフローチャートである。次の説明では、外れ値除去部が最頻値フィルタとして機能する場合を例にする。
（ステップＳ１０２）周波数分析部１１２には、収音部２０からＭチャネルの音響信号ｘが入力される。
（ステップＳ１０４）周波数分析部１１２は、各チャネルについてフレームごとに音響信号に対して周波数分析を行い周波数領域の変換係数を示す入力ベクトルＸ（入力信号情報）に変換する。観測期間におけるフレームごとの入力ベクトルの集合［Ｘ_１，Ｘ_２，…，Ｘ_Ｋ］が入力信号群Ｚを形成する。 (Acoustic Transfer Function Adaptation Processing)
Next, the acoustic transfer function adaptation process according to this embodiment will be described. Fig. 2 is a data flowchart showing an example of the acoustic transfer function adaptation process according to this embodiment. In the following description, an example will be taken in which the outlier remover functions as a mode filter.
(Step S<b>102 ) The frequency analysis unit 112 receives the M-channel acoustic signal x from the sound collection unit 20 .
(Step S104) The frequency analysis unit 112 performs frequency analysis on the acoustic signal for each channel for each frame and converts it into an input vector X (input signal information) indicating a transformation coefficient in the frequency domain. A set of input vectors [X ₁ , X ₂ , ..., X _K ] for each frame in the observation period forms an input signal group Z.

（ステップＳ１０６）音源方向推定部１２４は、音響伝達関数セットを参照し、フレームごとに入力ベクトルＸに示される変換係数を用いて、周波数ごとに空間スペクトルＳ_ｓｐ（θ）が最大となる音源方向を推定音源方向φとして算出する。観測期間におけるフレームごとの推定音源方向の集合［φ_１，φ_２，…，φ_Ｋ］が定位方向群Φを形成する。
（ステップＳ１０８）代表推定音源方向決定部１２６は、観測期間ごとに、推定音源方向ごとの頻度（フレーム数）を示す頻度分布を計数し、頻度が最も高い推定音源方向を代表推定音源方向θ’として定める。 (Step S106) The sound source direction estimator 124 refers to the acoustic transfer function set and calculates, for each frame, the sound source direction at which the spatial spectrum S _sp (θ) is maximized for each frequency, as the estimated sound source direction φ, using the conversion coefficients indicated in the input vector X. A set of estimated sound source directions [φ ₁ , φ ₂ , ..., φ _K ] for each frame in the observation period forms a localization direction group Φ.
(Step S108) The representative estimated sound source direction determiner 126 counts the frequency distribution indicating the frequency (number of frames) of each estimated sound source direction for each observation period, and determines the most frequently occurring estimated sound source direction as the representative estimated sound source direction θ′.

（ステップＳ１１０）外れ値除去部１２８は、フレームごとの入力ベクトルＸのうち、代表推定音源方向θ’と異なる推定音源方向φを与えるフレームの入力ベクトルＸ”を外れ値として除去する。除去されずに残された入力ベクトルの集合［Ｘ_１’，Ｘ_２’，…，Ｘ_Ｌ’］が外れ値除去入力信号群Ｚ’を形成する。 (Step S110) The outlier removal unit 128 removes, as an outlier, an input vector X" of a frame that gives an estimated sound source direction φ different from the representative estimated sound source direction θ', from among the input vectors X for each frame. A set of input vectors [X ₁ ', X ₂ ', ..., X _L '] that remain without being removed forms an outlier-removed input signal group Z'.

（ステップＳ１１２）代表音響信号決定部１３０は、観測区間において外れ値が除去されずに残されたフレームの入力ベクトルＸ’に示されるチャネルごとの変換係数の代表値を代表変換係数として示す代表入力ベクトル＜Ｘ＞を生成する。
（ステップＳ１１４）音響伝達関数推定部１３２は、各周波数について、代表入力信号情報に示されるチャネルごとの代表変換係数をチャネル間で正規化し、音源からそのチャネルに対応するマイクロホンまでの音響伝達関数を第２音響伝達関数Ｈ’として推定する。
（ステップＳ１１６）音響伝達関数更新部１３４は、周波数ごとに、代表推定音源方向θ’に係る第１音響伝達関数Ｈ_Ｅと第２音響伝達関数Ｈ’の加重平均値（１－β）Ｈ_Ｅ＋βＨ’を新たな第１音響伝達関数Ｈ_Ｅとして更新する。 (Step S112) The representative acoustic signal determination unit 130 generates a representative input vector <X> that indicates, as a representative transformation coefficient, the representative value of the transformation coefficients for each channel indicated in the input vector X' of the frame in which the outliers were not removed and which remains in the observation section.
(Step S114) The acoustic transfer function estimation unit 132 normalizes the representative conversion coefficient for each channel indicated in the representative input signal information for each frequency between channels, and estimates the acoustic transfer function from the sound source to the microphone corresponding to that channel as the second acoustic transfer function H'.
(Step S116) The acoustic transfer function update unit 134 updates the weighted average value (1-β)H _E +βH' of the first acoustic transfer function H _E and the second acoustic transfer function H' associated with the representative estimated sound source direction θ' for each frequency as a new first acoustic transfer function H _E.

なお、図２の処理は、観測期間ごとに繰り返されてもよいし、観測期間よりも短い周期、例えば、１フレームごとに繰り返されてもよい。 Note that the process in Figure 2 may be repeated for each observation period, or may be repeated at a cycle shorter than the observation period, for example, every frame.

（評価実験）
次に、上記の実施形態の有効性を評価するために実行した評価実験について説明する。評価実験は、一般的なオフィス環境と同様の音響環境を有する実験室内で行った。実験室内の形状は、ほぼ直方体である。実験室の大きさは、横（ｘ方向）、縦（ｙ方向）、高さ（ｚ方向）が、それぞれ７．０、４．０、３．０［ｍ］である（図５参照）。実験室は、実験室の中央部、周縁部には、それぞれテーブルが設置され、テーブルの周囲には複数の椅子が設置された。中央部のテーブルには、収音部２０として円形マイクロホンアレイ（図３参照）を設置し、テーブルの床面からの高さを０．９［ｍ］とした。周縁部のテーブルには、ノート型パーソナルコンピュータとその他の物品を配置した。 (Evaluation experiment)
Next, an evaluation experiment conducted to evaluate the effectiveness of the above-described embodiment will be described. The evaluation experiment was conducted in a laboratory with an acoustic environment similar to that of a typical office environment. The laboratory was approximately rectangular in shape. The laboratory dimensions were 7.0 m in width (x direction), 4.0 m in length (y direction), and 3.0 m in height (z direction) (see FIG. 5 ). Tables were installed in the center and on the periphery of the laboratory, with multiple chairs placed around the tables. A circular microphone array (see FIG. 3 ) was installed as the sound collection unit 20 on the central table, and the height of the table from the floor was 0.9 m. A notebook personal computer and other items were placed on the peripheral table.

評価実験に先立ち、音源信号を取得した。音源として日本語話し言葉コーパス（ＣＳＪ：Corpus of Spontaneous Japanese）から選択された男声を用いた。マイクロホンアレイの中央部からの距離が０．７８ｍとなる円周上に沿ってスピーカを時計回りにゆっくり移動させながら、音源信号に従って放音させた。スピーカ中央部の床面からの高さを１．０ｍとした。その状況下で、音源から到来しマイクロホンアレイで収音される音を示す８チャネルの音響信号を２０分間取得した。サンプリング周波数を１６ｋＨｚとした。その間におけるスピーカは円周上を３周した。
また、第１音響伝達関数の初期値Ｈ_Ｔとして、自由音場モデルを仮定して、目標方向に設置されたスピーカとマイクロホンアレイとの位置関係に基づいて算出した音響伝達関数を予め設定した。 Prior to the evaluation experiment, a sound source signal was acquired. A male voice selected from the Corpus of Spontaneous Japanese (CSJ) was used as the sound source. The loudspeaker was moved slowly clockwise along a circle at a distance of 0.78 m from the center of the microphone array, and sound was emitted according to the sound source signal. The height of the center of the loudspeaker from the floor was set to 1.0 m. Under these conditions, eight-channel acoustic signals representing the sound coming from the sound source and picked up by the microphone array were acquired for 20 minutes. The sampling frequency was set to 16 kHz. During this time, the loudspeaker made three revolutions around the circle.
As the initial value H _T of the first acoustic transfer function, an acoustic transfer function calculated based on the positional relationship between the speaker and the microphone array installed in the target direction was set in advance, assuming a free sound field model.

周波数分析部１２２は、周波数分析においてＳＴＦＴを実行した。ＳＴＦＴにおいて、フレーム長、シフト幅をそれぞれ、５１２点、２５６点とした。窓関数として、ハン窓（ハニング窓）を用いた。平均音圧が－２４ｄＢ以上となるフレームを有効フレームとし。有効フレームにおける音響信号を採用し、それ以外のフレームを無音区間として採用せずに、棄却した。観測期間に相当する時間が経過する都度、新たな観測期間を設定した。即ち、観測期間のシフト幅をその観測期間と同等の期間とした。 The frequency analysis unit 122 performed STFT for frequency analysis. In STFT, the frame length and shift width were set to 512 points and 256 points, respectively. The Hann window was used as the window function. Frames with an average sound pressure of -24 dB or higher were considered valid frames. The acoustic signals in valid frames were adopted, and other frames were discarded as silent intervals. Each time the time equivalent to the observation period elapsed, a new observation period was set. In other words, the shift width of the observation period was set to a period equivalent to that observation period.

評価実験は、２項目の検証から構成される。第１の検証では、最頻値フィルタの長さ（観測期間に相当）と音源定位性能との関係を調べた。第１の検証では、予め複数通りの観測期間のそれぞれに対して取得した音響信号を用いて図２に例示される処理を実行して第１音響伝達関数を更新した。観測期間を、６０フレーム（０．９６秒）から６００フレーム（９．６秒）までの６０フレーム間隔の１１通りとした。 The evaluation experiment consisted of two verification items. In the first verification, the relationship between the length of the mode filter (equivalent to the observation period) and sound source localization performance was investigated. In the first verification, the first acoustic transfer function was updated by performing the process illustrated in Figure 2 using acoustic signals acquired in advance for each of several observation periods. The observation period was set to 11 different intervals of 60 frames, ranging from 60 frames (0.96 seconds) to 600 frames (9.6 seconds).

第１音響伝達関数に対する音源方向の方向分解能を５°とした。方向分解能は、個々の第１音響伝達関数に関連付けられた音源方向の間隔に相当する。処理対象とする周波数帯域の最大周波数、最小周波数を、それぞれ３００Ｈｚ、６０００Ｈｚとした。そして、更新により得られた第１音響伝達関数を含む音響伝達関数セットを用いて音源定位を実行し、推定音源方向と目標音源方向を比較した。目標音源方向は、正解となる既知の音源方向に相当する。音源定位の手法として遅延和法を用いた。 The directional resolution of the sound source direction for the first acoustic transfer function was set to 5°. The directional resolution corresponds to the spacing between the sound source directions associated with each first acoustic transfer function. The maximum and minimum frequencies of the frequency band to be processed were set to 300 Hz and 6000 Hz, respectively. Sound source localization was then performed using an acoustic transfer function set including the first acoustic transfer function obtained by updating, and the estimated sound source direction was compared with the target sound source direction. The target sound source direction corresponds to the known, correct sound source direction. The delay-and-sum method was used as the sound source localization method.

音源定位性能の評価指標として、成功率を算出した。成功率は、有効フレーム数に対する成功フレーム数の比率に相当する。成功フレームは、音源定位に成功したフレームに相当する。本検証では、推定音源方向が目標音源方向から所定範囲（例えば、５°）以内であるフレームを成功フレームとして計数した。従って、成功率が高いほど音源定位性能が良好であることを意味する。本検証では、有効フレーム数は４３２２フレームとなった。 The success rate was calculated as an evaluation index for sound source localization performance. The success rate corresponds to the ratio of the number of successful frames to the number of valid frames. A successful frame corresponds to a frame in which sound source localization was successful. In this test, frames in which the estimated sound source direction was within a specified range (e.g., 5°) from the target sound source direction were counted as successful frames. Therefore, a higher success rate indicates better sound source localization performance. In this test, the number of valid frames was 4,322.

図６は、観測期間ごとの成功率を例示する図である。図６によれば、観測期間が１２０フレーム（１．９２秒に相当）となるとき成功率が９０．４２％と最高となった。本検証においてスピーカを移動していたことを鑑みても、約１．９２秒間はスピーカが静止していると仮定しても音源定位の性能を低下させずに済むことを示す。また、全体として観測期間が短いほど成功率が高い傾向がある。観測期間が長いほど移動に伴う目標音源方向の変化により外れ値の発生頻度が高くなることと、第１音響伝達関数の更新頻度が低下することが原因として考えられる。他方、観測期間が短いと代表推定音源方向を定める際に用いられる推定音源方向の分布に対する統計的信頼性が低くなる。このことは、むしろ成功率が低下する要因となる。観測期間が１２０フレームとなるとき成功率が最高となる現象は、観測期間による増加と減少との表れとみることができる。 Figure 6 illustrates the success rate for each observation period. Figure 6 shows that the success rate reached its highest at 90.42% when the observation period was 120 frames (equivalent to 1.92 seconds). Considering that the speakers were moved in this test, this indicates that the performance of sound source localization does not degrade even if the speakers are assumed to be stationary for approximately 1.92 seconds. Overall, the shorter the observation period, the higher the success rate. This is thought to be due to the fact that the longer the observation period, the higher the frequency of outliers due to changes in the target sound source direction associated with movement, and the lower the update frequency of the first acoustic transfer function. On the other hand, a short observation period reduces the statistical reliability of the distribution of estimated sound source directions used to determine the representative estimated sound source direction. This actually contributes to a lower success rate. The fact that the success rate reaches its highest when the observation period is 120 frames can be seen as an indication of the increase and decrease that occurs depending on the observation period.

第１の検証では、信頼度と第１音響伝達関数の種類との関係を調べた。第１音響伝達関数の種類として、次の３種類のそれぞれに対して成功率を定めた。（１）本実施形態：第１音響伝達関数と第２音響伝達関数の加重和を新たな第１音響伝達関数に更新、（２）既存手法（非特許文献１に記載の手法）：フレームごとに第１音響伝達関数を更新、（３）更新なし：第１音響伝達関数を更新しない。但し、観測期間を１２０フレームとした。 In the first verification, the relationship between reliability and the type of first acoustic transfer function was investigated. A success rate was determined for each of the following three types of first acoustic transfer function: (1) This embodiment: The weighted sum of the first acoustic transfer function and the second acoustic transfer function is updated to a new first acoustic transfer function; (2) Existing method (method described in Non-Patent Document 1): The first acoustic transfer function is updated for each frame; (3) No update: The first acoustic transfer function is not updated. However, the observation period was set to 120 frames.

図７は、第１音響伝達関数の種類ごとの成功率を例示する図である。図７は、更新無し、既存手法、本実施形態の順に成功率を示す。成功率は、既存手法、更新無し、本実施形態の順に高くなる。成功率は、更新無しでは、８５．５９％、既存手法では、８０．６３％、本実施形態では、９０．４２％となった。この結果は、本実施形態の有効性を裏付ける。既存手法での成功率が、更新無しでの成功率よりも低くなる現象は、フレームごとに一律に第１音響伝達関数を更新するため、外れ値が生じ信頼度が低い推定音源方向が得られるケースを棄却しないことが一因と推察される。 Figure 7 is a diagram illustrating the success rate for each type of first acoustic transfer function. Figure 7 shows the success rate in the order of no update, existing method, and this embodiment. The success rate increases in the order of existing method, no update, and this embodiment. The success rate was 85.59% without updating, 80.63% with the existing method, and 90.42% with this embodiment. This result supports the effectiveness of this embodiment. One reason why the success rate with the existing method is lower than the success rate without updating is thought to be that, because the first acoustic transfer function is updated uniformly for each frame, outliers occur and cases in which an estimated sound source direction with low reliability is not rejected.

（変形例）
次に、本実施形態の変形例について説明する。以下の説明では、上述の実施形態との差異を主とし、特に断らない限り、上述の実施形態と同一の符号を付してその説明を援用する。図８は、本実施形態の第１変形例に係る音響処理システムＳ１の構成例を示す概略ブロック図である。本変形例に係る音響処理システムＳ１は、音源分離に応用される。音響処理システムＳ１において、音響処理装置１０の制御部１２０は、周波数分析部１２２、音源方向推定部１２４、代表推定音源方向決定部１２６、外れ値除去部１２８、代表音響信号決定部１３０、音響伝達関数推定部１３２および音響伝達関数更新部１３４の他、さらに音源分離部１３６と音源信号生成部１３８を備える。 (Modification)
Next, a modification of this embodiment will be described. The following description will mainly focus on the differences from the above-described embodiment, and unless otherwise specified, the same reference numerals as in the above-described embodiment will be used to refer to the description therein. FIG. 8 is a schematic block diagram showing an example configuration of a sound processing system S1 according to a first modification of this embodiment. The sound processing system S1 according to this modification is applied to sound source separation. In the sound processing system S1, the control unit 120 of the sound processing device 10 includes a frequency analysis unit 122, a sound source direction estimation unit 124, a representative estimated sound source direction determination unit 126, an outlier removal unit 128, a representative sound signal determination unit 130, an acoustic transfer function estimation unit 132, and an acoustic transfer function update unit 134, as well as a sound source separation unit 136 and a sound source signal generation unit 138.

周波数分析部１２２は、入力信号情報を音源方向推定部１２４、音響伝達関数推定部１３２の他、音源分離部１３６にも出力する。
音源方向推定部１２４は、推定音源方向情報を代表推定音源方向決定部１２６および外れ値除去部１２８の他、音源分離部１３６にも出力する。
音響伝達関数推定部１３２は、第２音響伝達関数情報を音響伝達関数更新部１３４の他、音源分離部１３６にも出力する。 The frequency analysis unit 122 outputs the input signal information to the sound source direction estimation unit 124 , the acoustic transfer function estimation unit 132 , and also to the sound source separation unit 136 .
The sound source direction estimating unit 124 outputs the estimated sound source direction information to the representative estimated sound source direction determining unit 126 and the outlier removing unit 128 as well as to the sound source separating unit 136 .
The acoustic transfer function estimating unit 132 outputs the second acoustic transfer function information to the acoustic transfer function updating unit 134 as well as to the sound source separating unit 136 .

音源分離部１３６には、周波数分析部１２２から入力信号情報が入力され、音源方向推定部１２４から推定音源方向情報が入力される。音源分離部１３６は、入力信号情報に示されるチャネルごとの変換係数から推定音源方向から到来する音源成分を抽出する。音源分離部１３６は、例えば、記憶部１５０に記憶された音響伝達関数セットを参照し、推定音源方向θ’に係る第１音響伝達関数Ｈ_Ｅから分離行列Ｗ（Ｈ_Ｅ，θ’）を算出する。音源分離部１３６は、式（５）に例示されるように、入力ベクトルＸに分離行列Ｗ（Ｈ_Ｅ，θ’）を乗じて、その推定音源方向θ’に存在する音源から到来する音源成分として推定される出力値を各周波数について示す出力ベクトルＹ（分離音源）を算出することができる。入力ベクトルＸは、入力信号情報に示されるチャネルごとの変換係数を要素として含むベクトルである。推定音源方向が複数個検出される場合には、音源分離部１３６は、音源（推定音源方向）ごとに出力値を定めることができる。音源分離部１３６は、各音源について周波数ごとに定めた出力値を示す出力信号情報を音源信号生成部１３８に出力する。 The sound source separation unit 136 receives input signal information from the frequency analysis unit 122 and estimated sound source direction information from the sound source direction estimation unit 124. The sound source separation unit 136 extracts sound source components arriving from the estimated sound source direction from the conversion coefficients for each channel indicated in the input signal information. The sound source separation unit 136, for example, refers to an acoustic transfer function set stored in the storage unit 150 and calculates a separation matrix W(H _E , θ') from a first acoustic transfer function H _E related to the estimated sound source direction θ'. As exemplified in Equation (5), the sound source separation unit 136 multiplies the input vector X by the separation matrix W(H _E , θ') to calculate an output vector Y (separated sound source) indicating, for each frequency, an output value estimated as a sound source component arriving from a sound source present in the estimated sound source direction θ'. The input vector X is a vector including, as elements, conversion coefficients for each channel indicated in the input signal information. When multiple estimated sound source directions are detected, the sound source separation unit 136 can determine an output value for each sound source (estimated sound source direction). The sound source separation unit 136 outputs output signal information indicating the output value determined for each frequency for each sound source to the sound source signal generation unit 138.

音源信号生成部１３８は、各音源について音源分離部１３６から入力される出力信号情報に示される周波数ごとの出力値を時間領域におけるサンプル時刻ごとの振幅の時系列に変換する。音源信号生成部１３８は、周波数領域における周波数ごとの出力値を振幅の時系列に変換する際、周波数分析との逆処理、例えば、逆離散フーリエ変換を用いることができる。音源信号生成部１３８は、各音源についてフレームごとに得られた振幅の時系列をフレーム間で連結して音源信号を生成することができる。音源信号生成部１３８は、生成した音源信号を出力先機器に入出力部１１０を経由して出力してもよいし、記憶部１５０に記憶してもよい。 The sound source signal generation unit 138 converts the output values for each frequency indicated in the output signal information input from the sound source separation unit 136 for each sound source into a time series of amplitude for each sample time in the time domain. When converting the output values for each frequency in the frequency domain into a time series of amplitude, the sound source signal generation unit 138 can use the inverse process of frequency analysis, for example, an inverse discrete Fourier transform. The sound source signal generation unit 138 can generate a sound source signal by linking the time series of amplitude obtained for each frame for each sound source between frames. The sound source signal generation unit 138 may output the generated sound source signal to an output destination device via the input/output unit 110, or may store it in the storage unit 150.

なお、音源方向推定部１２４は、空間スペクトルＳ_ｓｐ（θ）が極大となり、所定の空間スペクトルの閾値よりも大きくなる方向を複数個検出することがある。その場合には、音源方向推定部１２４は、複数個の音源方向をそれぞれ推定音源方向として示す推定音源方向情報を音源分離部１３６に出力してもよい。このような場合には、有意な音源が複数個存在すると推定されるためである。音源方向推定部１２４における音源定位と音源分離部１３６における音源分離は、音響伝達関数更新部１３４における第１音響電圧関数の更新と同時に実行されてもよいが、必ずしも同期しなくてもよい。即ち、検出される音源が２個以上となる場合に、推定音源方向情報を代表推定音源方向決定部１２６と外れ値除去部１２８に出力せず、代表推定音源方向決定部１２６から代表推定音源方向情報が音響伝達関数更新部１３４に入力されない場合でも音源分離部１３６における音源分離の実行は許容される。 The sound source direction estimation unit 124 may detect a plurality of directions in which the spatial spectrum S _sp (θ) is maximized and exceeds a predetermined spatial spectrum threshold. In such a case, the sound source direction estimation unit 124 may output estimated sound source direction information indicating each of the plurality of sound source directions as an estimated sound source direction to the sound source separation unit 136. This is because in such a case, it is estimated that a plurality of significant sound sources exist. The sound source localization in the sound source direction estimation unit 124 and the sound source separation in the sound source separation unit 136 may be performed simultaneously with the update of the first acoustic voltage function in the acoustic transfer function update unit 134, but they do not necessarily have to be synchronized. In other words, when two or more sound sources are detected, the estimated sound source direction information is not output to the representative estimated sound source direction determiner 126 and the outlier remover 128, and even if the representative estimated sound source direction information is not input from the representative estimated sound source direction determiner 126 to the acoustic transfer function update unit 134, the sound source separation in the sound source separation unit 136 is permitted.

なお、音源分離部１３６は、音源分離の手法として、例えば、上記のビームフォーミング法を応用することができる。その場合、音源分離部１３６は、ビームフォーミング法を用いて推定された推定音源方向θ’に係る音響伝達関数ベクトルＨ（θ’）の疑似逆行列Ｈ^＋（θ’）を分離行列として採用すればよい。その他の音源分離の手法として、例えば、ＧＨＤＳＳ（Geometric-contrained High-order Decorrelation-based Source Separation, 幾何制約高次相関除去音源分離）法を用いることができる。ＧＨＤＳＳ法は、コスト関数Ｊ（Ｗ）が最小化するように分離行列Ｗを適応的に算出する過程を含む。コスト関数Ｊ（Ｗ）は、分離尖鋭度（Separation Sharpness）Ｊ_ＳＳ（Ｗ）と幾何制約度（Geometric Constrain）Ｊ_ＧＣ（Ｗ）との重み付き和となる。分離尖鋭度Ｊ_ＳＳ（Ｗ）は、ある音源の音源成分Ｙに他の音源の成分が混入する度合いを示す指標値である。幾何制約度Ｊ_ＧＣ（Ｗ）は、出力となる音源信号と音源から発されたもとの音源信号との誤差の度合いを表す指標値である。 The sound source separation unit 136 can apply, for example, the beamforming method described above as a sound source separation technique. In this case, the sound source separation unit 136 may employ, as a separation matrix, a pseudo-inverse matrix H ⁺ (θ′) of an acoustic transfer function vector H(θ′) associated with an estimated sound source direction θ′ estimated using the beamforming method. Other sound source separation techniques that can be used include, for example, the GHDSS (Geometric-contrained High-order Decorrelation-based Source Separation) method. The GHDSS method includes a process of adaptively calculating a separation matrix W so as to minimize a cost function J(W). The cost function J(W) is a weighted sum of separation sharpness J _SS (W) and geometric constraint J _GC (W). The separation sharpness J _SS (W) is an index value indicating the degree to which a component of a certain sound source is mixed into the sound source component Y of another sound source. The geometric constraint degree J _GC (W) is an index value indicating the degree of error between the output sound source signal and the original sound source signal emitted from the sound source.

次に、本実施形態の第２変形例について説明する。以下の説明では、上述の実施形態ならびに変形例との差異を主とし、特に断らない限り、上述の実施形態と同一の符号を付してその説明を援用する。本変形例に係る音響処理システムＳ１は、ロボットシステム（図示せず）の一部をなす。図９は、本変形例に係る音響処理システムＳ１の構成例を示す概略ブロック図である。音響処理システムＳ１をなす音響処理装置１０と収音部２０の一方または両方は、ロボットの筐体に内蔵されてもよい。 Next, a second modified example of this embodiment will be described. The following description will mainly focus on the differences from the above-described embodiment and modified example, and unless otherwise specified, the same reference numerals as in the above-described embodiment will be used and the description thereof will be incorporated. The sound processing system S1 according to this modified example forms part of a robot system (not shown). Figure 9 is a schematic block diagram showing an example configuration of the sound processing system S1 according to this modified example. One or both of the sound processing device 10 and the sound collection unit 20 that form the sound processing system S1 may be built into the robot's housing.

音響処理装置１０において、制御部１２０は、周波数分析部１２２、音源方向推定部１２４、代表推定音源方向決定部１２６、外れ値除去部１２８、代表音響信号決定部１３０、音響伝達関数推定部１３２、音響伝達関数更新部１３４、音源分離部１３６および音源信号生成部１３８の他、動作制御部１４０を備える。即ち、音響処理システムＳ１において、音源方向推定部１２４、音源分離部１３６および音源信号生成部１３８は、ロボット聴覚（robot audition）を実現するロボット聴覚機能ブロックとして機能してもよい。 In the sound processing device 10, the control unit 120 includes a frequency analysis unit 122, a sound source direction estimation unit 124, a representative estimated sound source direction determination unit 126, an outlier removal unit 128, a representative sound signal determination unit 130, an acoustic transfer function estimation unit 132, an acoustic transfer function update unit 134, a sound source separation unit 136, and a sound source signal generation unit 138, as well as an operation control unit 140. That is, in the sound processing system S1, the sound source direction estimation unit 124, the sound source separation unit 136, and the sound source signal generation unit 138 may function as a robot audition functional block that realizes robot audition.

制御部１２０は、さらに音声認識処理部（図示せず）を備えてもよい。音声認識処理部は、個々の音源に係る音源成分に対して、公知の音声認識処理を実行して音源の種類を特定してもよい（音源同定）。音声認識処理部は、音源の種類として、人物である発話者が特定されてもよい。音源方向推定部１２４は、特定した種類の音源について、推定音源方向を示す推定音源方向情報を他の装置に通知してもよいし、特定した種類の音源について出力信号情報から変換された音源信号を他の装置に出力してもよい。
音源方向推定部１２４は、上記のように音源位置を推定可能とし、推定音源位置を示す推定音源方向情報を代表推定音源方向決定部１２６、外れ値除去部１２８および音源分離部１３６の他、動作制御部１４０に出力する。 The control unit 120 may further include a speech recognition processing unit (not shown). The speech recognition processing unit may identify the type of sound source by performing known speech recognition processing on sound source components related to each sound source (sound source identification). The speech recognition processing unit may identify a human speaker as the type of sound source. The sound source direction estimation unit 124 may notify another device of estimated sound source direction information indicating the estimated sound source direction for the identified type of sound source, or may output a sound source signal converted from output signal information for the identified type of sound source to another device.
The sound source direction estimation unit 124 is capable of estimating the sound source position as described above, and outputs estimated sound source direction information indicating the estimated sound source position to the representative estimated sound source direction determination unit 126, the outlier removal unit 128, the sound source separation unit 136, and also to the operation control unit 140.

動作制御部１４０には、音源方向推定部１２４から推定音源方向情報が入力され、音源分離部１３６から音源成分を示す出力信号情報が入力される。動作制御部１４０は、推定音源位置と音源成分の一方または両方を用いて動作機構４０の動作を制御する。動作制御部１４０は、例えば、推定音源位置と音源成分に基づいて、自己位置推定と環境地図作成を実行してもよい（ＳＬＡＭ：Simultaneous Localization and Mapping、同時定位地図作成）。動作制御部１４０は、音源同定を実行することで推定音源位置における音源となる物体（人物を含む）の存在を推定することができる。動作制御部１４０は、推定音源位置に近いほど高くなるように所定の密度関数モデルを用いて音源となる物体の存在確率を定めてもよい。動作制御部１４０は、例えば、物体ごとに存在する存在確率の空間分布を物体間で重畳して環境地図を作成することができる。動作制御部１４０は、経路計画において、物体の存在確率が所定の存在確率よりも高い領域を通過しないように進行経路を定めてもよい。進行経路は、時刻ごとの目標位置により表される。動作制御部１４０は、所定の種類の音源の推定方向をロボットの正面に相対する目標方向と定めてもよい。動作制御部１４０は、その時点における目標位置と目標方向の一方または両方を示す制御信号を動作機構４０に出力する。 The operation control unit 140 receives estimated sound source direction information from the sound source direction estimation unit 124 and output signal information indicating sound source components from the sound source separation unit 136. The operation control unit 140 controls the operation of the operating mechanism 40 using one or both of the estimated sound source position and sound source components. The operation control unit 140 may, for example, perform self-location estimation and environmental map creation based on the estimated sound source position and sound source components (SLAM: Simultaneous Localization and Mapping). The operation control unit 140 can estimate the presence of an object (including a person) that is the sound source at the estimated sound source position by performing sound source identification. The operation control unit 140 may determine the existence probability of an object that is the sound source using a predetermined density function model so that the probability increases the closer to the estimated sound source position. The operation control unit 140 can, for example, create an environmental map by superimposing the spatial distribution of the existence probability of each object between objects. In the path planning, the movement control unit 140 may determine a travel path that does not pass through areas where the probability of an object's existence is higher than a predetermined probability. The travel path is represented by a target position for each time. The movement control unit 140 may determine the estimated direction of a predetermined type of sound source as the target direction relative to the front of the robot. The movement control unit 140 outputs a control signal to the movement mechanism 40 that indicates either or both of the target position and target direction at that time.

動作機構４０は、ロボットの筐体に内蔵され、動作制御部１４０から入力される制御信号に基づいてロボットの動作を制御する。動作機構４０は、動力源となるモータ（図示せず）と自部の位置と方向を検出するエンコーダ（図示せず）を備える。モータは、制御信号で指示される目標位置または目標方向に近づくようにロボットを移動させる。エンコーダは、その時点において検出した位置と方向を動作状態として示す動作情報を逐次に動作制御部１４０に出力する。 The movement mechanism 40 is built into the robot's housing and controls the robot's movement based on control signals input from the movement control unit 140. The movement mechanism 40 is equipped with a motor (not shown) that serves as a power source and an encoder (not shown) that detects the position and direction of the movement mechanism 40. The motor moves the robot so that it approaches the target position or direction specified by the control signal. The encoder sequentially outputs movement information to the movement control unit 140, indicating the position and direction detected at that time as the movement state.

以上に説明したように、本実施形態に係る音響処理装置１０は、第１音響伝達関数を音源方向ごとに記憶する記憶部１５０と、各フレームについて、チャネルごとの音響信号の周波数領域における変換係数と第１音響伝達関数に基づいて音源方向ごとに空間スペクトルを算出し、空間スペクトルが最大となる音源方向を推定音源方向として推定する音源方向推定部１２４と、複数フレームからなる観測期間における推定音源方向の頻度分布に基づいて推定音源方向の代表値である代表推定音源方向を定める代表推定音源方向決定部１２６と、推定音源方向が代表推定音源方向から予め定めた許容範囲の範囲外となるフレームの変換係数を除去する外れ値除去部１２８と、残されたフレームの音響信号の変換係数に基づいて、観測期間における音源から音響信号の収音部までの音響伝達関数の代表値を第２音響伝達関数として推定する音響伝達関数推定部１３２と、代表推定音源方向に対する第１音響伝達関数を、第２音響伝達関数を用いて更新する音響伝達関数更新部１３４と、を備える。
この構成によれば、代表推定音源方向から所定の範囲内の推定音源方向を与える音響信号の変換係数に基づいて第２音響伝達関数が算出され、算出された第２音響伝達関数を用いて代表推定音源方向と対応付けて第１音響伝達関数を更新することができる。統計的に代表推定音源方向、または、これに近似する推定音源方向を与える音響信号に基づいて得られる音響伝達関数の代表値が第２音響伝達関数として代表推定音源方向と対応付けて更新されるので、音源方向との対応関係が安定した第１音響伝達関数が得られる。かかる第１音響伝達関数を用いることで、オンラインで任意の音響信号を用いて、音源定位、音源分離、その他のマイクロホンアレイ処理の信頼性を向上することができる。 As described above, the sound processing device 10 according to this embodiment includes: a storage unit 150 that stores a first acoustic transfer function for each sound source direction; a sound source direction estimation unit 124 that calculates a spatial spectrum for each sound source direction for each frame based on the transform coefficients in the frequency domain of the acoustic signal for each channel and the first acoustic transfer function, and estimates the sound source direction for which the spatial spectrum is maximum as the estimated sound source direction; a representative estimated sound source direction determination unit 126 that determines a representative estimated sound source direction that is a representative value of the estimated sound source directions based on the frequency distribution of the estimated sound source directions in an observation period consisting of a plurality of frames; an outlier removal unit 128 that removes transformation coefficients of frames whose estimated sound source directions fall outside a predetermined tolerance range from the representative estimated sound source direction; an acoustic transfer function estimation unit 132 that estimates, as the second acoustic transfer function, a representative value of the acoustic transfer function from the sound source to the sound collection unit for the acoustic signal in the observation period based on the transformation coefficients of the acoustic signal in the remaining frames; and an acoustic transfer function update unit 134 that updates the first acoustic transfer function for the representative estimated sound source direction using the second acoustic transfer function.
With this configuration, the second acoustic transfer function is calculated based on the conversion coefficient of the acoustic signal that gives an estimated sound source direction within a predetermined range from the representative estimated sound source direction, and the first acoustic transfer function can be updated using the calculated second acoustic transfer function in association with the representative estimated sound source direction. A representative value of the acoustic transfer function obtained based on the acoustic signal that statistically gives the representative estimated sound source direction or an estimated sound source direction that is approximate thereto is updated as the second acoustic transfer function in association with the representative estimated sound source direction, thereby obtaining a first acoustic transfer function with a stable correspondence with the sound source direction. By using such a first acoustic transfer function, it is possible to improve the reliability of sound source localization, sound source separation, and other microphone array processing using any acoustic signal online.

また、音源方向推定部１２４は、頻度が極大（例えば、最大）となる推定音源方向φを代表推定音源方向θ’として定めてもよい。
この構成によれば、観測期間内での頻度が極大となる推定音源方向が代表推定音源方向として定まるため、可能性が最も高い推定音源方向が代表推定音源方向として簡素に定まる。 Furthermore, the sound source direction estimating unit 124 may determine the estimated sound source direction φ having a local maximum (for example, the largest) frequency as the representative estimated sound source direction θ′.
According to this configuration, the estimated sound source direction with the maximum frequency within the observation period is determined as the representative estimated sound source direction, so that the most likely estimated sound source direction is simply determined as the representative estimated sound source direction.

また、音響伝達関数更新部１３４は、第１音響伝達関数と第２音響伝達関数の加重平均値（例えば、βＨ’＋（１－β）Ｈ_Ｅ）を新たな第１音響伝達関数Ｈ_Ｅに更新してもよい。
この構成によれば、観測期間の変更に伴い、第１音響伝達関数は更新により第２音響伝達関数に完全に置き換わらず、その一部の成分が残される。第１音響伝達関数の急激な変動が回避されるため、システムの安定性が図られる。 Furthermore, the acoustic transfer function update unit 134 may update the first acoustic transfer function H _E to a new weighted average value of the first acoustic transfer function and the second acoustic transfer function (for example, βH′+(1−β)H _E ).
With this configuration, when the observation period is changed, the first acoustic transfer function is not completely replaced by the second acoustic transfer function through updating, but some of its components remain. This prevents abrupt fluctuations in the first acoustic transfer function, thereby improving system stability.

また、音響伝達関数更新部１３４は、観測期間における推定音源方向が許容範囲の範囲内となる頻度（例えば、フレーム数Ｌ）に基づいて代表推定音源方向の信頼度（例えば、Ｌ／Ｋ、Ｋは観測期間内のフレーム数）を定め、信頼度が高いほど第１音響伝達関数に対する第２音響伝達関数の比率βを高くしてもよい（例えば、Ｌ／２Ｋ）。
この構成によれば、信頼度が高い推定音源方向を与える音響信号ほど重視して第２音響伝達関数を用いて第１音響伝達関数を更新することができる。そのため、更新される第１音響伝達関数の信頼性を向上させることができる。 In addition, the acoustic transfer function update unit 134 may determine the reliability (e.g., L/K, where K is the number of frames in the observation period) of the representative estimated sound source direction based on the frequency (e.g., the number of frames L) at which the estimated sound source direction falls within an acceptable range during the observation period, and may increase the ratio β of the second acoustic transfer function to the first acoustic transfer function as the reliability increases (e.g., L/2K).
According to this configuration, the first acoustic transfer function can be updated using the second acoustic transfer function, with emphasis being placed on acoustic signals that provide more reliable estimated sound source directions, thereby improving the reliability of the updated first acoustic transfer function.

また、推定音源方向の許容範囲は代表推定音源方向と等しく、前記代表推定音源方向とは異なる方向を含まなくてもよい。
この構成によれば、推定音源方向が代表推定音源方向と等しいか否かにより、推定音源方向を与える音響信号の変換係数を簡素に排除するか否かを定めることができる。 Furthermore, the allowable range of the estimated sound source direction is equal to the representative estimated sound source direction and does not need to include directions different from the representative estimated sound source direction.
According to this configuration, it is possible to simply determine whether or not to exclude the transform coefficient of the acoustic signal that gives the estimated sound source direction, depending on whether or not the estimated sound source direction is equal to the representative estimated sound source direction.

また、音源方向推定部１２４は、チャネルごとの変換係数を含む入力ベクトルＸに、チャネルごとの第１音響伝達関数を含む音響伝達関数ベクトルの疑似逆行列Ｈ^＋を乗算して空間スペクトルを算出してもよい。
この構成によれば、簡素な行列演算により算出される空間スペクトルに基づいて音源方向を推定することができる。そのため、多くの演算資源を要しないため、経済的な実現を図ることができる。 Alternatively, the sound source direction estimator 124 may calculate the spatial spectrum by multiplying an input vector X including a conversion coefficient for each channel by a pseudo-inverse matrix H ⁺ of an acoustic transfer function vector including a first acoustic transfer function for each channel.
This configuration allows the sound source direction to be estimated based on the spatial spectrum calculated by a simple matrix operation, which does not require many computational resources and can be implemented economically.

以上、図面を参照してこの発明の一実施形態について詳しく説明してきたが、具体的な構成は上述のものに限られることはなく、この発明の要旨を逸脱しない範囲内において様々な設計変更等をすることが可能である。 One embodiment of the present invention has been described in detail above with reference to the drawings, but the specific configuration is not limited to that described above, and various design modifications can be made without departing from the spirit of the present invention.

Ｓ１…音響処理システム、１０…音響処理装置、２０…収音部、４０…動作機構、１１０…入出力部、１２０…制御部、１２２…周波数分析部、１２４…音源方向推定部、１２６…代表推定音源方向決定部、１２８…外れ値除去部、１３０…代表音響信号決定部、１３２…音響伝達関数推定部、１３４…音響伝達関数更新部、１３６…音源分離部、１３８…音源信号生成部、１４０…動作制御部、１５０…記憶部 S1...sound processing system, 10...sound processing device, 20...sound collection unit, 40...operation mechanism, 110...input/output unit, 120...control unit, 122...frequency analysis unit, 124...sound source direction estimation unit, 126...representative estimated sound source direction determination unit, 128...outlier removal unit, 130...representative sound signal determination unit, 132...acoustic transfer function estimation unit, 134...acoustic transfer function update unit, 136...sound source separation unit, 138...sound source signal generation unit, 140...operation control unit, 150...storage unit

Claims

a storage unit that stores the first acoustic transfer function for each sound source direction;
For each frame, a spatial spectrum is calculated for each sound source direction based on a transform coefficient in the frequency domain of the acoustic signal for each channel and the first acoustic transfer function;
a sound source direction estimation unit that estimates the sound source direction in which the spatial spectrum is maximized as an estimated sound source direction;
a representative estimated sound source direction determining unit that determines a representative estimated sound source direction that is a representative value of the estimated sound source directions based on a frequency distribution of the estimated sound source directions in an observation period consisting of a plurality of frames;
an outlier removal unit that removes transform coefficients of frames whose estimated sound source directions are outside a predetermined tolerance range from the representative estimated sound source direction;
an acoustic transfer function estimation unit that estimates, as a second acoustic transfer function, a representative value of an acoustic transfer function from a sound source to a sound collection unit of the acoustic signal during the observation period based on a transformation coefficient of the acoustic signal of the remaining frames;
an acoustic transfer function update unit that updates a first acoustic transfer function for the representative estimated sound source direction using the second acoustic transfer function;
An acoustic processing device comprising:

The sound processing device according to claim 1 , wherein the representative estimated sound source direction determiner determines the estimated sound source direction with a maximum frequency as the representative estimated sound source direction.

The sound processing device according to claim 1 , wherein the acoustic transfer function update unit updates the first acoustic transfer function to a new first acoustic transfer function by a weighted average value of the first acoustic transfer function and the second acoustic transfer function.

the acoustic transfer function update unit determines a reliability of the representative estimated sound source direction based on a frequency with which the estimated sound source direction falls within the tolerance range during the observation period;
The sound processing device according to claim 3 , wherein the ratio of the second acoustic transfer function to the first acoustic transfer function is increased as the reliability increases.

The sound processing device according to claim 1 , wherein the tolerance range is equal to the representative estimated sound source direction and does not include directions different from the representative estimated sound source direction.

The sound source direction estimation unit
The sound processing device according to claim 1 , wherein the spatial spectrum is calculated by multiplying an input vector including the conversion coefficients for each channel by a pseudo-inverse matrix of an acoustic transfer function vector including the first acoustic transfer function for each channel.

A program for causing a computer to function as the sound processing device described in claim 1.

A sound processing method in a sound processing device including a storage unit that stores a first acoustic transfer function for each sound source direction,
The sound processing device comprises:
For each frame, a spatial spectrum is calculated for each sound source direction based on a transform coefficient in the frequency domain of the acoustic signal for each channel and the first acoustic transfer function;
a sound source direction estimating step of estimating the sound source direction in which the spatial spectrum is maximized as an estimated sound source direction;
a representative sound source direction determining step of determining a representative estimated sound source direction that is a representative value of the estimated sound source directions based on a frequency distribution of the estimated sound source directions in an observation period consisting of a plurality of frames;
an outlier removal step of removing transform coefficients of frames in which the estimated sound source direction is outside a predetermined tolerance range from the representative estimated sound source direction;
an acoustic transfer function estimating step of estimating, as a second acoustic transfer function, a representative value of an acoustic transfer function from a sound source to a sound collection unit of the acoustic signal during the observation period based on a transformation coefficient of the acoustic signal of the remaining frames;
an acoustic transfer function updating step of updating a first acoustic transfer function for the representative estimated sound source direction using the second acoustic transfer function;
An acoustic processing method that performs the above.