JP5034607B2

JP5034607B2 - Acoustic echo canceller system

Info

Publication number: JP5034607B2
Application number: JP2007090206A
Authority: JP
Inventors: 真人戸上; 伸治坂野; 俊幸松田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2006-11-02
Filing date: 2007-03-30
Publication date: 2012-09-26
Anticipated expiration: 2027-03-30
Also published as: JP2008141718A

Description

スピーカとマイクロホンを有する電話会議システムもしくはテレビ会議システム向けの音響エコーキャンセル技術に属する。 The present invention belongs to an acoustic echo cancellation technology for a telephone conference system or a video conference system having a speaker and a microphone.

スピーカとマイクロホンを双方に有し、ネットワークで接続して離れたところの人と音声で会話することができる電話会議システムもしくはテレビ会議システムがある。このシステムでは、スピーカから出力する音声がマイクロホンに混入するという問題があった。そこで、従来より音響エコーキャンセラ技術を使って、マイクロホンに混入したスピーカ出力音声（音響エコー）を除去することが行われてきた。会議室の音響環境が不変であれば、最初に、１度だけ空間の音の伝わり方（インパルス応答）を学習し、そのインパルス応答を使って音響エコーを完全に除去することが可能である。しかし、会議参加者が席を移動したりすると、音響エコーの音響経路が変動するため、学習したインパルス応答と実際のインパルス応答がアンマッチとなり、音響エコーを完全に除去することができなくなる。最悪の場合、残留エコーが回りまわって、音量が次第に大きくなり、ハウリングという現象を起こし、会話することが全く困難な状態となる。 There is a telephone conference system or a video conference system that has a speaker and a microphone on both sides and can talk by voice with a person connected by a network. This system has a problem that sound output from the speaker is mixed into the microphone. Therefore, conventionally, the acoustic output from the speaker (acoustic echo) mixed in the microphone has been removed using the acoustic echo canceller technique. If the acoustic environment of the conference room is unchanged, it is possible to first learn how to transmit the sound in the space (impulse response) only once and use the impulse response to completely remove the acoustic echo. However, when the conference participant moves from the seat, the acoustic path of the acoustic echo changes, so the learned impulse response and the actual impulse response become unmatched, and the acoustic echo cannot be completely removed. In the worst case, the residual echo turns around and the volume gradually increases, causing a phenomenon called howling and making it difficult to talk.

そこで、インパルス応答を逐次学習し、音響経路の変動に追随することで、常に音響エコーを完全に消すことを目的とした方法が提案されている（例えば、非特許文献１）。 In view of this, a method has been proposed in which an acoustic response is sequentially learned and the acoustic echo is always completely canceled by following the change in the acoustic path (for example, Non-Patent Document 1).

また、マイクロホンアレイを使って、音響エコーを消去する方法が提案されている（例えば、特許文献１）。従来技術では、エコーキャンセラの性能が十分でないため、同時に近端話者と遠端話者が話す場合には、音量の小さい話者の声を完全にシャットアウトし、片側通話の状態にして、ハウリングを防ぐことが行われている。しかし、片側通話では、会話がしにくいという問題がある。 In addition, a method of canceling acoustic echoes using a microphone array has been proposed (for example, Patent Document 1). In the prior art, the performance of the echo canceller is not sufficient, so when the near-end speaker and far-end speaker speak at the same time, the voice of the speaker with low volume is completely shut out, and the one-side call state is set. Preventing howling is done. However, there is a problem that it is difficult to talk in a one-side call.

特開２００５−１３６７０１号JP-A-2005-136701 Peter Heitkamper, “An Adaptation Control for Acoustic Echo Ｃａｎｃｅｌｌｅｒｓ，” ＩＥＥＥＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ Letters, Vol.4, No.6, 1997/6.Peter Heitkamper, “An Adaptation Control for Acoustic Echo Cancers,” IEEE Signal Processing Letters, Vol. 4, No. 6, 1997/6. R. O. Schmidt, “Multiple Emitter Location and Signal Parameter Estimation, ” IEEE Trans. Antennas and Propagation, vol.34, no.3, pp.276-280, 1986.R. O. Schmidt, “Multiple Emitter Location and Signal Parameter Estimation,” IEEE Trans. Antennas and Propagation, vol.34, no.3, pp.276-280, 1986. 戸上真人, 天野明雄, 新庄広, 鴨志田亮太, 玉本淳一, 柄川索, “人間共生ロボット“EMIEW”の聴覚機能, ” 第22回AIチャレンジ研究会, pp. 59-64, 2005Masato Togami, Akio Amano, Hiroshi Shinjo, Ryota Kamoshida, Junichi Tamamoto, Kou Egawa, “Hearing Function of Human Symbiotic Robot“ EMIEW ”,” 22nd AI Challenge Study Group, pp. 59-64, 2005

インパルス応答を逐次学習し、音響経路の変動に追随する従来手法では、スピーカのみから音が出ている場合は、逐次学習可能であるが、スピーカから音が出るとともに、会議室内の話者が発話した場合、学習不可能となり、最悪の場合、インパルス応答の学習に失敗し、音響エコーを全く除去できなくなる。そこで、スピーカのみから音が出ているか会議室内の話者が発話しているかを見極める（ダブルトーク検出）ことが必要となる。 In the conventional method of learning impulse responses sequentially and following changes in the acoustic path, it is possible to learn sequentially when the sound is coming from only the speaker, but the speaker speaks and the speaker in the conference room speaks. In such a case, learning becomes impossible. In the worst case, learning of the impulse response fails, and acoustic echo cannot be removed at all. Therefore, it is necessary to determine whether sound is output from only the speaker or a speaker in the conference room (double talk detection).

本発明では、スピーカからの音が支配的な状況を検出して、エコーキャンセラの適応制御を行う。そのための構成例としては、複数のマイクロホン素子を有するマイクロホンアレイを用いることで、音源の到来方向を推定可能とする。より好ましい態様では、複数のマイクロホン素子に入力される音声の位相差を検出して、スピーカからの音が支配的な状況を判定することができる。判定は、あらかじめ格納された閾値との比較で行うことができる。されに好ましい態様では、音源の到来方向がスピーカー方向の帯域分割信号のみを抽出し、その帯域分割信号で音響エコーキャンセラの適応を行う音響エコーキャンセラ適応部を有する。 In the present invention, the situation where the sound from the speaker is dominant is detected, and adaptive control of the echo canceller is performed. As a configuration example for this purpose, the arrival direction of a sound source can be estimated by using a microphone array having a plurality of microphone elements. In a more preferred aspect, it is possible to determine a situation in which sound from a speaker is dominant by detecting a phase difference between sounds input to a plurality of microphone elements. The determination can be made by comparison with a threshold value stored in advance. In addition, in a preferable aspect, there is provided an acoustic echo canceller adaptation unit that extracts only a band-divided signal whose sound source arrival direction is the speaker direction and adapts the acoustic echo canceller with the band-divided signal.

音響エコーキャンセラは、スピーカからの音を擬似的に作成し、入力音声から差し引くことで、エコーをキャンセルすることができる。 The acoustic echo canceller can cancel the echo by creating a pseudo sound from the speaker and subtracting it from the input voice.

本発明の構成の典型的な例としては、音声を入力するマイクロホンと、マイクロホンからの信号をデジタル変換するＡＤ変換器と、ＡＤ変換器からのデジタル信号を処理し音響エコー成分を抑圧する情報処理装置と、情報処理装置からの信号をネットワークに送出する出力インターフェースと、ネットワークからの信号を受信する入力インターフェースと、入力インターフェースからの信号をアナログ変換するＤＡ変換器と、ＤＡ変換器からの信号を音声として出力するスピーカとを有する会議システムであって、情報処理装置は、マイクロホンに入力される音声の状態に基づいて、情報処理装置の最適化タイミングを制御する。ＡＤ変換器とＤＡ変換器は一体であっても良い。 As a typical example of the configuration of the present invention, a microphone for inputting sound, an AD converter for digitally converting a signal from the microphone, and an information processing for processing a digital signal from the AD converter and suppressing an acoustic echo component A device, an output interface for sending a signal from the information processing device to the network, an input interface for receiving the signal from the network, a DA converter for analog conversion of the signal from the input interface, and a signal from the DA converter The conference system includes a speaker that outputs sound, and the information processing device controls optimization timing of the information processing device based on a state of sound input to the microphone. The AD converter and the DA converter may be integrated.

好ましくは、マイクロホンに入力される音声が、主にスピーカ方向からとなるタイミングで情報処理装置の最適化を行う。判定はたとえば適当な閾値を設定することで可能となる。 Preferably, the information processing apparatus is optimized at the timing when the sound input to the microphone is mainly from the direction of the speaker. The determination can be made by setting an appropriate threshold value, for example.

さらに好ましくは、情報処理装置は、適応フィルタと、適応フィルタの最適化を行う音響エコーキャンセラ適応部と、適応フィルタを用いて、デジタル信号からスピーカの音声の混入分である音響エコー成分を抑圧する音響エコーキャンセル部とを備える。 More preferably, the information processing apparatus uses an adaptive filter, an acoustic echo canceller adaptation unit that optimizes the adaptive filter, and the adaptive filter to suppress an acoustic echo component that is a mixture of speaker sounds from a digital signal. And an acoustic echo canceling unit.

さらに好ましくは、マイクロホンは複数のマイクロホン素子を有するマイクロホンアレイであり、ＡＤ変換器はマイクロホン素子ごとに信号をデジタル変換する複数のＡＤ変換器であり、情報処理装置は、複数のＡＤ変換機からの信号に基づいて、複数のマイクロホン素子に入力される音声間の位相差を計算する位相差計算部を有し、位相差計算部が出力する位相差から、マイクロホンアレイに入力された音声がスピーカからの音声であるかどうかを判定する周波数振り分け部を有する。 More preferably, the microphone is a microphone array having a plurality of microphone elements, the AD converter is a plurality of AD converters for digitally converting a signal for each microphone element, and the information processing device is supplied from a plurality of AD converters. A phase difference calculation unit that calculates a phase difference between sounds input to a plurality of microphone elements based on the signal, and the sound input to the microphone array is output from the speaker from the phase difference output by the phase difference calculation unit. It has a frequency distribution part which determines whether it is a voice of.

さらに好ましくは、情報処理装置は、デジタル信号を帯域分割する帯域分割部を有し、帯域分割部で、マイクロホン素子ごとにデジタル変換されたデジタル信号をそれぞれ帯域分割し、位相差計算部は分割した各帯域毎複数のマイクロホン素子に入力される音声間の位相差を計算し、周波数振り分け部は、位相差計算部が出力する各帯域毎の位相差から、帯域分割信号が、スピーカ出力信号であるか話者信号であるかを判定し、音響エコーキャンセラ適応部は、周波数振り分け部でスピーカ出力信号であると判定された帯域についてのみ、マイクロホン素子の信号からスピーカの音声の混入分を抑圧するために用いる適応フィルタの適応化をおこない、音響エコーキャンセラ部は、適応フィルタを用いて、各マイクロホン素子の信号から音響エコー成分を除去する。帯域分割部では、たとえば、２０Ｈｚから１６ｋＨｚまでの周波数を、２０Ｈｚごとに分割する。このように、周波数帯域ごとに制御を行うことで、高精度のエコーキャンセルが可能となる。 More preferably, the information processing apparatus has a band dividing unit that divides a digital signal into bands, and the band dividing unit divides each digital signal digitally converted for each microphone element, and the phase difference calculation unit divides the digital signal. The phase difference between sounds input to a plurality of microphone elements for each band is calculated, and the frequency distribution unit is a speaker output signal from the phase difference for each band output by the phase difference calculation unit. The acoustic echo canceller adapting unit suppresses the mixing of the speaker sound from the microphone element signal only in the band determined by the frequency distribution unit as the speaker output signal. The acoustic echo canceller is adapted from the signal of each microphone element using the adaptive filter. To remove the code component. In the band dividing unit, for example, a frequency from 20 Hz to 16 kHz is divided every 20 Hz. Thus, by performing control for each frequency band, highly accurate echo cancellation can be performed.

さらに好ましくは、周波数振り分け部で、スピーカ出力信号であるかどうかを判定するために、予めスピーカからマイクロホンアレイまで音が伝わる音の伝達関数を測定し、測定した伝達関数からスピーカから音が出力される際の、マイクロホンアレイの各帯域毎の位相差を算出し、外部記憶装置に帯域毎の位相差を記憶し、記憶済みの位相差と帯域分割信号の各帯域毎のマイクロホン素子間の位相差が予め定める閾値以下の場合、帯域分割信号をスピーカー出力信号であると判定する。 More preferably, in order to determine whether or not the output signal is a speaker output signal, the frequency distribution unit measures in advance a transfer function of sound transmitted from the speaker to the microphone array, and the sound is output from the speaker from the measured transfer function. When calculating the phase difference for each band of the microphone array, store the phase difference for each band in the external storage device, and store the phase difference between the microphone elements for each band of the band-divided signal. Is equal to or less than a predetermined threshold value, the band division signal is determined to be a speaker output signal.

さらに好ましくは、予めスピーカーの個数とマイクロホンアレイに対する相対的な物理位置をユーザーが指定することを特徴とするユーザーインターフェースを有し、ユーザーインターフェースで指定されたスピーカーの個数と物理位置から、スピーカから音が出力される際の、マイクロホンアレイの各帯域毎の位相差を計算するエコー位相差計算処理部を有し、外部記憶装置に該帯域毎の位相差を記憶し、記憶済みの位相差と帯域分割信号の各帯域毎のマイクロホン素子間の位相差が予め定める閾値以下の場合、帯域分割信号をスピーカー出力信号であると判定する。 More preferably, it has a user interface characterized in that the user designates the number of speakers and the physical position relative to the microphone array in advance, and the sound from the speakers is determined from the number and physical positions of the speakers designated by the user interface. Has an echo phase difference calculation processing unit that calculates a phase difference for each band of the microphone array, and stores the phase difference for each band in an external storage device, and stores the stored phase difference and band When the phase difference between the microphone elements for each band of the divided signal is equal to or less than a predetermined threshold value, the band divided signal is determined to be a speaker output signal.

さらに好ましくは、各マイクロホンアレイの位相差を用いて、帯域に渡る音源方向のヒストグラムを算出し、そのヒストグラムから音源方向を推定する音源定位部を有し、音源定位部が算出する音源方向から到来したと推定される信号の大きさを算出し、算出した信号の大きさがスピーカ出力信号であると判定された帯域分割信号の大きさもしくは、音響エコーキャンセラ後の帯域分割信号でのスピーカ出力信号であると判定された帯域分割信号の大きさ、と比較して、予め定める大きさ以下である場合、帯域分割信号の大きさを小さくするか、全て0とする。 More preferably, using a phase difference of each microphone array, a sound source direction histogram is calculated across the band, and a sound source localization unit that estimates the sound source direction from the histogram is provided. The magnitude of the signal estimated to have been calculated is calculated, and the magnitude of the calculated signal is determined to be the loudspeaker output signal or the loudspeaker output signal after the acoustic echo canceller. In comparison with the size of the band division signal determined to be equal to or smaller than the predetermined size, the size of the band division signal is reduced or all are set to zero.

会議室の状況に応じて、動的にエコーキャンセラーの制御を行うことができる。 The echo canceller can be controlled dynamically according to the situation of the conference room.

以下、本発明の具体的な形態を図を用いて説明する。本発明は例えば、ＩＰネットワーク回線を用いた、電話会議システムであり、ネットワークで接続された２つ（以上）のサイトの双方が、マイクロホンアレイと、スピーカなどからなる電話会議設備を用いて交信し、両方のサイトにいる話者間の会話を実現する。以下このサイトの双方を近端、遠端と称する。 Hereinafter, specific embodiments of the present invention will be described with reference to the drawings. The present invention is, for example, a teleconference system using an IP network line, and two (or more) sites connected by a network communicate with each other using a teleconference facility including a microphone array and a speaker. , Enable conversations between speakers at both sites. Hereinafter, both of these sites are referred to as a near end and a far end.

図１は、サイトの近端、遠端にそれぞれ配置される本発明のハードウェアの構成を示した図である。少なくとも２つ以上のマイクロホン素子からなるマイクロホンアレイ１、マイクロホンアレイから入力されるアナログの音圧値をデジタルデータに変換するA/D変換装置及びデジタルデータをアナログデータに変換するD/A変換装置２、変換装置２の出力を処理する中央演算装置３、例えば揮発性のメモリ４、ネットワークに接続し、遠端との間でデータを送受信するためのハブ５、D/A変換されたアナログデータをスピーカーを通し音圧に変換するスピーカ６、例えば不揮発性の外部記憶媒体７からなっている。 FIG. 1 is a diagram showing a configuration of hardware according to the present invention arranged at the near end and the far end of a site. A microphone array 1 composed of at least two or more microphone elements, an A / D converter for converting analog sound pressure values inputted from the microphone array into digital data, and a D / A converter 2 for converting digital data into analog data A central processing unit 3 for processing the output of the conversion device 2, for example, a volatile memory 4, a hub 5 for connecting to a network and transmitting / receiving data to / from the far end, and D / A converted analog data The speaker 6 that converts the sound pressure through the speaker, for example, a non-volatile external storage medium 7.

マイクロホンアレイ１で収録した多チャンネルの音圧値は、AD/DA変換装置２に送られて、多チャンネルのデジタルデータに変換される。変換されたデジタルデータは、CPU３を経由し、メモリ４に蓄積される。 The multi-channel sound pressure values recorded by the microphone array 1 are sent to the AD / DA converter 2 and converted into multi-channel digital data. The converted digital data is stored in the memory 4 via the CPU 3.

ハブ５を介して送られてきた遠端側の音声は、CPU３を経由し、AD/DA変換装置２に送られて、スピーカー６から出力される。スピーカー６から出力される遠端側の音声は、近端側の話者音声とともに、マイクロホンアレイ１で収録する音声に混入する。したがって、メモリ４に蓄積されたデジタル音圧データにも、スピーカーから出力される遠端側の音声が混入することになる。CPU3では、メモリ４に蓄積されたデジタル音圧データから、混入した遠端側の音声を抑圧するエコーキャンセル処理を行い、近端側の話者音声のみ、ハブ５を介して遠端側に送信する。エコーキャンセル処理は、外部記憶媒体７に予め記憶しているスピーカーからマイクロホンアレイまでの音の伝わり方に関するデータや、スピーカーの個数、スピーカーの位置などの情報を利用し、遠端側の音声を抑圧する。 The far-end audio sent via the hub 5 is sent to the AD / DA converter 2 via the CPU 3 and output from the speaker 6. The far-end voice output from the speaker 6 is mixed with the voice recorded by the microphone array 1 together with the near-end talker voice. Accordingly, the far-end sound output from the speaker is also mixed in the digital sound pressure data stored in the memory 4. The CPU 3 performs echo cancellation processing to suppress the mixed far-end voice from the digital sound pressure data stored in the memory 4 and transmits only the near-end speaker voice to the far-end side via the hub 5. To do. Echo cancellation processing suppresses the far-end sound by using data such as the number of speakers and the position of the speakers, as well as data related to the sound transmission from the speakers to the microphone array stored in advance in the external storage medium 7 To do.

図２は、本発明のソフトウェア構成を示した図であり、アナログデータをデジタルデータに変換するA/D変換部８以外は、ＣＰＵ３でデジタル的に実行される。ＣＰＵの処理能力が十分であれば、最も便利な方式である。あるいは、これと等価なハードウエア構成とし、デジタルまたはアナログ処理としてもよい。 FIG. 2 is a diagram showing a software configuration of the present invention, and is executed digitally by the CPU 3 except for the A / D converter 8 that converts analog data into digital data. This is the most convenient method if the processing capacity of the CPU is sufficient. Alternatively, a hardware configuration equivalent to this may be used, and digital or analog processing may be used.

主な機能ブロックは、デジタルデータを帯域分割されたデータに変換する帯域分割部９、帯域分割信号の各マイクチャンネル間の位相差を計算する位相差計算部１０、位相差計算部が計算する位相差から、帯域分割信号の各帯域毎に、音響エコーが支配的か話者音声が支配的かを判定する周波数振り分け部１１、音響エコーキャンセル用の適応フィルタを適応する音響エコーキャンセラ適応部１２、スピーカ参照信号から、マイクロホンアレイに伝わる音響エコーを擬似生成する擬似エコー生成部１３、適応フィルタを使い、入力信号から音響エコーを抑圧する音響エコーキャンセル部１４などからなる。マイクロホンアレイ１で収録した多チャンネルのアナログ音圧データは、A/D変換部８で、多チャンネルのデジタルデータx(t)に変換される。変換された多チャンネルのデジタルデータは、帯域分割部９に送られ、多チャンネルの帯域分割データx(f:τ)に変換される。ここで、マイクロホンアレイは複数のマイクロホン素子を備えるので、A/D変換部８、帯域分割部９はマイクロホン素子の数だけ並列に配置すればよい。 The main functional blocks are a band division unit 9 that converts digital data into band-divided data, a phase difference calculation unit 10 that calculates a phase difference between each microphone channel of the band division signal, and a phase difference calculation unit that calculates From the phase difference, for each band of the band-divided signal, a frequency distribution unit 11 that determines whether acoustic echo is dominant or speaker voice is dominant, an acoustic echo canceller adaptation unit 12 that adapts an adaptive filter for acoustic echo cancellation, It consists of a pseudo echo generation unit 13 that pseudo-generates an acoustic echo transmitted to the microphone array from the speaker reference signal, an acoustic echo cancellation unit 14 that suppresses the acoustic echo from the input signal using an adaptive filter, and the like. Multi-channel analog sound pressure data recorded by the microphone array 1 is converted into multi-channel digital data x (t) by the A / D converter 8. The converted multi-channel digital data is sent to the band dividing unit 9 and converted into multi-channel band divided data x (f: τ). Here, since the microphone array includes a plurality of microphone elements, the A / D conversion unit 8 and the band dividing unit 9 may be arranged in parallel by the number of microphone elements.

帯域分割には、短時間フーリエ変換やウェーブレット変換、バンドバスフィルタなどを用いる。帯域分割部では、たとえば、２０Ｈｚから２０ｋＨｚまでの周波数を、２０Ｈｚごとに分割する。τは短時間周波数解析を行う際のフレームインデックスである。帯域分割されたデータは、位相差計算部１０に送られる。位相差計算部１０では、各マイクチャンネル毎の位相差を〔数１〕で算出する。 For band division, short-time Fourier transform, wavelet transform, band-pass filter, or the like is used. In the band dividing unit, for example, a frequency from 20 Hz to 20 kHz is divided every 20 Hz. τ is a frame index when performing short-time frequency analysis. The band-divided data is sent to the phase difference calculation unit 10. The phase difference calculator 10 calculates the phase difference for each microphone channel by [Equation 1].

xi(f,τ)はiチャンネルのf番目の帯域分割データである。同様にxj(f,τ)はjチャンネルのf番目の帯域分割データである。δi,j(f,τ)はiチャンネルとjチャンネルのf番目の帯域に関する位相差である。算出した各マイクチャンネル毎の位相差は周波数振り分け部１１に送られる。周波数振り分け部１１では、予め設定するスピーカーからマイクロホンアレイまでのエコー成分の位相差Spi,j(f)と、各マイクチャンネル毎の位相差から〔数２〕で定義されるei,j(f,τ)を算出し、インデックスi,jに関するei,j(f,τ)の総和が予め設定する閾値以下であれば、f番目の帯域はエコーが支配的な帯域であると判定し、インデックスi,jに関するei,j(f,τ)の総和が予め設定する閾値以上であれば、近端側の音声であると判定する。 xi (f, τ) is the f-th band division data of i channel. Similarly, xj (f, τ) is the f-th band division data of j channel. Δi, j (f, τ) is a phase difference regarding the f-th band of the i channel and the j channel. The calculated phase difference for each microphone channel is sent to the frequency distribution unit 11. In the frequency distribution unit 11, the phase difference Spi, j (f) of the echo component from the speaker to the microphone array set in advance and the phase difference for each microphone channel are defined as ei, j (f, τ) is calculated, and if the sum of ei, j (f, τ) with respect to index i, j is equal to or less than a preset threshold, it is determined that the f-th band is a band in which echo is dominant, and index i , j is determined to be near-end speech if the sum of ei, j (f, τ) is equal to or greater than a preset threshold value.

エコーであると判定された周波数成分は、音響エコーキャンセラ適応部１２に送られる。音響エコーキャンセラ適応部１２は、分割された周波数帯ごとに、適応フィルタの設定条件を格納している。音響エコーキャンセラ適応部１２では、周波数振り分け部１１がエコーであると判定した周波数帯について、擬似エコー生成部１３が出力する擬似エコー成分Echo i(f,τ)を用いて、〔数３〕で適応フィルタhi,τ(f,T)の適応を行う。 The frequency component determined to be an echo is sent to the acoustic echo canceller adaptation unit 12. The acoustic echo canceller adaptation unit 12 stores adaptive filter setting conditions for each of the divided frequency bands. The acoustic echo canceller adaptation unit 12 uses the pseudo echo component Echo i (f, τ) output by the pseudo echo generation unit 13 for the frequency band determined by the frequency distribution unit 11 as an echo, using [Equation 3]. Adapt the adaptive filter hi, τ (f, T).

Echo i (f,τ)はiチャンネル目のマイクの擬似エコー成分である。hi,τ(f,T)はiチャンネル目のマイクの適応フィルタのｆ番目の帯域のTタップ目のフィルタであり、τ−１フレームまでの信号で適応済みのフィルタである。Lは適応フィルタのタップ長である。適応は、このように周波数毎に行ってもよいし、時間τにおいて、スピーカー方向であると判定された帯域数が予め定める閾値以上である場合、その時間τの周波数成分全てにおいて〔数３〕で適応を行っても良い。また周波数毎の音源定位は、MUSIC法や（非特許文献２参照）修正遅延和アレイ法（非特許文献３参照）に基づいて行っても良い。擬似エコー生成部１３では、〔数４〕で定義される擬似エコー成分e^(f,τ)を生成する。 Echo i (f, τ) is a pseudo echo component of the i-th channel microphone. hi, τ (f, T) is a filter of the T-tap in the f-th band of the adaptive filter of the i-th channel microphone, and is a filter that has been adapted with signals up to τ−1 frames. L is the tap length of the adaptive filter. Adaptation may be performed for each frequency in this way, and when the number of bands determined to be in the speaker direction at a time τ is equal to or greater than a predetermined threshold, all the frequency components at the time τ [Equation 3] You may adapt with The sound source localization for each frequency may be performed based on the MUSIC method or the modified delay sum array method (see Non-Patent Document 3). The pseudo echo generation unit 13 generates a pseudo echo component e ^ (f, τ) defined by [Equation 4].

d(f,τ)はスピーカーに出力する原信号の帯域分割信号である。さらに、擬似エコー生成部１３では、擬似エコーから、エコー位相差DBを〔数５〕で更新する。 d (f, τ) is a band division signal of the original signal output to the speaker. Further, the pseudo echo generator 13 updates the echo phase difference DB with [Equation 5] from the pseudo echo.

適応フィルタのマイク間位相差をエコー位相差DBに記憶する。音響エコーキャンセル部１４では、音響エコーキャンセラ適応部１２で適応した適応フィルタを使い、音響エコー抑圧後の音声デジタルデータx^i(f,τ)を〔数６〕で生成し、出力する。 The phase difference between microphones of the adaptive filter is stored in the echo phase difference DB. The acoustic echo cancellation unit 14 uses the adaptive filter adapted by the acoustic echo canceller adaptation unit 12 to generate and output the audio digital data x ^ i (f, τ) after acoustic echo suppression by [Equation 6].

以上のように、本実施例では、複数のマイクロホン素子を有するマイクロホンアレイを用いて、マイク間位相差から、スピーカ音が支配的な帯域を判定し、その帯域についてのみ適応制御することで、音響エコーの抑圧性能の高いフィルタを生成することができる。また、ダブルトーク検出が可能となるため、スピーカのみから音が出ている時に音響エコーキャンセラの適応を行うことが可能となる。したがって、音響経路の変動に常に追随できるとともに、スピーカから音が出ている時には会議室内の話者が発話した場合にインパルス応答の学習を一時停止するため、インパルス応答の学習に失敗することが少なくなる。 As described above, in this embodiment, by using a microphone array having a plurality of microphone elements, a band in which the speaker sound is dominant is determined from the phase difference between microphones, and adaptive control is performed only for that band. A filter with high echo suppression performance can be generated. In addition, since the double talk can be detected, it is possible to adapt the acoustic echo canceller when sound is emitted only from the speaker. Therefore, it is possible to always follow fluctuations in the acoustic path, and learning of the impulse response is paused when the speaker in the conference room speaks when sound is emitted from the speaker. Become.

図３は、スピーカーの個数や、スピーカー位置を指定するGUIの情報を用いて、エコー位相差DBの初期値を設定するシステムのブロック図である。テレビ会議に用いるスピーカ個数と物理位置を指定するスピーカ個数・位置設定GUI１５、設定されたスピーカ個数・位置から、音響エコーの位相差を計算するエコー位相差計算処理部１６の各機能ブロックおよびデータベースからなり、ＣＰＵおよび記憶手段で実現される。 FIG. 3 is a block diagram of a system that sets the initial value of the echo phase difference DB using GUI information that specifies the number of speakers and speaker positions. From the function block and database of the echo phase difference calculation processing unit 16 that calculates the phase difference of the acoustic echo from the set number and position of the speakers, the speaker number / position setting GUI 15 for designating the number of speakers and the physical position used for the video conference. Thus, it is realized by a CPU and storage means.

スピーカー個数・位置設定GUI１５では、スピーカーの個数とマイクロホンアレイ１に対するスピーカーの位置を設定する。スピーカー個数・位置設定GUI１５では、マイクロホンアレイ１に対するスピーカー方向を設定することが必須となるが、スピーカー個数・位置設定GUIで設定されたスピーカーの個数と位置情報は、エコー位相差計算処理部１６に送られる。エコー位相差計算処理部１６では、スピーカー個数及ぶ位置から、FarFieldの仮定に基づき、音響エコーのiチャンネルとjチャンネル目のマイク間位相差Spi,j(f)を推定する。推定したエコー位相差は、エコー位相差DBに保存される。 In the speaker number / position setting GUI 15, the number of speakers and the position of the speaker with respect to the microphone array 1 are set. In the speaker number / position setting GUI 15, it is essential to set the speaker direction with respect to the microphone array 1, but the number and position information of the speakers set in the speaker number / position setting GUI are sent to the echo phase difference calculation processing unit 16. Sent. The echo phase difference calculation processing unit 16 estimates the phase difference Spi, j (f) between the microphones of the i-channel and the j-th channel of the acoustic echo based on the assumption of FarField from the positions that span the number of speakers. The estimated echo phase difference is stored in the echo phase difference DB.

図４は、本発明の適応アルゴリズム及び音源方向を用いて、エコーキャンセラを使うかボイススイッチを使うかを切り替えることで、エコーキャンセラの性能の限界を超えている大規模会議室であっても、ハウリングを起こすことが無いテレビ会議システムを実現することが可能となる。音声エコーキャンセル部１４に加え、話者音のパワーを推定する音源定位部１７、音響エコーの大きさと話者音のパワーの大きさから、VoiceSwitchを使用するかどうかを判定するVoiceSwitch判定部１８、音響エコー抑圧後の信号を出力する出力信号生成部１９を備える。 FIG. 4 shows that the adaptive algorithm and sound source direction of the present invention are used to switch between using an echo canceller or a voice switch, so that even in a large-scale conference room exceeding the limit of the echo canceller performance, It is possible to realize a video conference system that does not cause howling. In addition to the voice echo canceling unit 14, a sound source localization unit 17 that estimates the power of the speaker sound, a VoiceSwitch determination unit 18 that determines whether to use the VoiceSwitch from the size of the acoustic echo and the power of the speaker sound, An output signal generation unit 19 that outputs a signal after acoustic echo suppression is provided.

A/D変換部, 帯域分割部位相差計算部, 周波数振り分け部, 音響エコーキャンセラ適応部, 擬似エコー生成部, 音響エコーキャンセル部については、図２と同様の処理である。音源定位部１７では、周波数振り分け部１１でエコー成分とみなされなかった周波数成分の位相差のヒストグラムを計算する。計算した位相差ヒストグラムのピークから、音源方向を同定する。同定する音源方向数は予め定めるか、または、ヒストグラムのピークであって、頻度がある閾値以上の場合、音源方向とみなす、というようにする。同定した音源方向について、全てのパワーを加算したものを近端話者パワーと定義する。音源定位部１７は、近端話者パワーを出力する。VoiceSwitch判定部１８では、周波数振り分け部で音響エコーが支配的であると判定された周波数について、音響エコーキャンセル後のパワーの総和を音響エコーパワーとして算出する。算出した音響エコーパワーと近端話者パワーとの比が、予め定める閾値以上であれば、そのフレームは、音響エコーが主であり、発話者はいないと判断し、VoiceSwitchを使用すると判定する。また、予め定める閾値以下であれば、そのフレームに、発話者がいると判断し、VoiceSwitchを使用しないと判定する。出力信号生成部１９では、VoiceSwitchを使用すると判定した場合、全ての値を0に設定した信号を生成し、出力する。VoiceSwitchを使用しないと判定した場合、音響エコーキャンセラ部が出力する音響エコーキャンセル後の信号を出力する。音響エコーキャンセル後の信号に残留エコーが多い場合、VoiceSwitch判定部１８は、VoiceSwitchを使用すると判断し、残留エコーを含む信号を送信しないことになる。残留エコーを含む信号を送信すると、システムが閉ループとなり、残留エコーによりハウリングを起こす可能性がある。そこで、残留エコーを防ぐためにVoiceSwitchを使い、エコーがループしないようにすることが望まれるが、常にVoiceSwitchを使うと、近端話者と遠端話者が同時に会話できなくなってしまう。そこで、本発明のVoiceSwitch判定部１８では、残留エコーが存在するフレームのみ、VoiceSwitchを使うため、残留エコーが生じない時は、近端話者と遠端話者が同時に会話可能である。また残留エコーが生じる時は、VoiceSwitchに切り替えるため、ハウリングを起こす可能性を劇的に減らすことが可能となる。本実施例は、VoiceSwitch判定部１８の判定に用いる音響エコーパワーを、音響エコーキャンセル後の信号から求めたが、音響エコーキャンセル前の信号のパワーから音響エコーパワーを計算しても良い。 The A / D conversion unit, the band division unit, the phase difference calculation unit, the frequency distribution unit, the acoustic echo canceller adaptation unit, the pseudo echo generation unit, and the acoustic echo cancellation unit are the same processing as in FIG. The sound source localization unit 17 calculates a histogram of phase differences of frequency components that are not regarded as echo components by the frequency distribution unit 11. The sound source direction is identified from the peak of the calculated phase difference histogram. The number of sound source directions to be identified is determined in advance, or when it is a peak of a histogram and the frequency is equal to or higher than a certain threshold, it is regarded as the sound source direction. For the identified sound source direction, the sum of all powers is defined as near-end speaker power. The sound source localization unit 17 outputs near-end speaker power. The VoiceSwitch determination unit 18 calculates, as the acoustic echo power, the sum of the power after acoustic echo cancellation for the frequency for which the acoustic distribution is determined to be dominant by the frequency distribution unit. If the ratio between the calculated acoustic echo power and near-end speaker power is equal to or greater than a predetermined threshold value, it is determined that the frame is mainly an acoustic echo and there is no speaker, and it is determined that VoiceSwitch is used. Also, if it is below a predetermined threshold, it is determined that there is a speaker in the frame, and it is determined that VoiceSwitch is not used. When it is determined that the VoiceSwitch is used, the output signal generation unit 19 generates and outputs a signal in which all values are set to 0. When it is determined that VoiceSwitch is not used, a signal after acoustic echo cancellation output by the acoustic echo canceller is output. If there are many residual echoes in the signal after acoustic echo cancellation, the VoiceSwitch determination unit 18 determines that the VoiceSwitch is used and does not transmit a signal including the residual echo. If a signal including residual echo is transmitted, the system becomes a closed loop, and howling may occur due to the residual echo. In order to prevent residual echo, it is desirable to use VoiceSwitch to prevent the echo from looping. However, if VoiceSwitch is always used, the near-end speaker and the far-end speaker cannot talk at the same time. Therefore, since the VoiceSwitch determination unit 18 of the present invention uses VoiceSwitch only for frames in which residual echo exists, when the residual echo does not occur, the near-end speaker and the far-end speaker can talk simultaneously. When residual echo occurs, switching to VoiceSwitch will dramatically reduce the possibility of howling. In this embodiment, the acoustic echo power used for determination by the VoiceSwitch determination unit 18 is obtained from the signal after acoustic echo cancellation. However, the acoustic echo power may be calculated from the power of the signal before acoustic echo cancellation.

また全周波数の残留エコーと、全周波数の近端話者パワーとを比較し、VoiceSwitchの使用判定をするが、いくつかの周波数ビンを含むサブバンドごとに、残留エコーと近端話者パワーとを比較し、各サブバンドごとに、VoiceSwitchを使うかどうか切り替えても良い。この場合、各サブバンドごとに、VoiceSwitchを使うと判定されたサブバンドは、出力信号生成部１９で、値０で置換されたものを出力する。VoiceSwitchを使わないと判定されたサブバンドは、出力信号生成部１９で、音響エコーキャンセル後の信号を出力する。 Also, the residual echo of all frequencies and the near-end speaker power of all frequencies are compared to determine the use of VoiceSwitch. For each subband including several frequency bins, the residual echo and the near-end speaker power You can switch whether to use VoiceSwitch for each subband. In this case, for each subband, the subband determined to use the VoiceSwitch is output by the output signal generator 19 with the value 0 replaced. For the subbands determined not to use VoiceSwitch, the output signal generation unit 19 outputs a signal after acoustic echo cancellation.

図５にテレビ会議システムに本発明を適用した際の、システム全体図を示す。このシステムでは、計算器101上で、位相差計算部10で計算した位相差や音源方向の情報を用いて、音響エコーキャンセラの適応を制御することを特徴とするテレビ会議システムである。 FIG. 5 shows an overall view of the system when the present invention is applied to a video conference system. This system is a video conference system characterized in that the adaptation of the acoustic echo canceller is controlled on the calculator 101 by using the phase difference calculated by the phase difference calculator 10 and the information on the sound source direction.

図５は、1拠点のシステム構成を示している。テレビ会議システム100は、音響信号処理または画像処理、通信処理を計算機101で行う。計算機101にはA/DD/A装置102がつながっており、マイクロホンアレイ105で収録した音声信号は、A/DD/A装置102によってデジタルの音声信号に変換され計算機101に送られる。マイクロホンアレイ105は、複数のマイク素子を有する。 FIG. 5 shows the system configuration of one site. In the video conference system 100, the computer 101 performs acoustic signal processing, image processing, and communication processing. An A / DD / A device 102 is connected to the computer 101, and an audio signal recorded by the microphone array 105 is converted into a digital audio signal by the A / DD / A device 102 and sent to the computer 101. The microphone array 105 has a plurality of microphone elements.

計算機101では、デジタルの音声信号に対して音響信号処理を施し、処理後の音声信号はハブ103を介して、ネットワーク上に送られる。ここで、計算機101は、図1に示されるCPU3、メモリ4及び
外部記憶媒体７を備えている。 The computer 101 performs acoustic signal processing on the digital audio signal, and the processed audio signal is sent to the network via the hub 103. Here, the computer 101 includes the CPU 3, the memory 4, and the external storage medium 7 shown in FIG.

外部記憶媒体７は、計算機101の内部にあっても良いし、計算機101の外部にあっても良い。そして、計算機101内のCPU3では、図２に示されるような、帯域分割部9、位相差計算部10、周波数振り分け部11、音響エコーキャンセラ適応部12、擬似エコー生成部13、音響エコーキャンセラ部14や、または後述する図9に示されるように、音声送信部201、音響エコーキャンセラ適応部204、音響エコーキャンセラ部205、音声収録部203、音声受信部207、音声再生部208を有し、これらにより音響エコーキャンセラが実現されている。 The external storage medium 7 may be inside the computer 101 or outside the computer 101. Then, in the CPU 3 in the computer 101, as shown in FIG. 2, a band dividing unit 9, a phase difference calculating unit 10, a frequency distributing unit 11, an acoustic echo canceller adapting unit 12, a pseudo echo generating unit 13, an acoustic echo canceller unit 14, or as shown in FIG. 9 to be described later, has an audio transmission unit 201, an acoustic echo canceller adaptation unit 204, an acoustic echo canceller unit 205, an audio recording unit 203, an audio reception unit 207, an audio reproduction unit 208, As a result, an acoustic echo canceller is realized.

ハブ103を介して、テレビ会議システム100に送られてきた他拠点の画像信号は画像表示装置104に送られ、画面に表示される。ハブ103を介して送られてきた他拠点の音声信号は、スピーカ106より出力される。 The image signal of the other base sent to the video conference system 100 via the hub 103 is sent to the image display device 104 and displayed on the screen. The audio signal of the other base sent through the hub 103 is output from the speaker 106.

マイクロホンアレイ105で受音される音声内には、スピーカ106からマイクロホンアレイ105まで伝わる音響エコーが含まれており、除去することが必要となる。デジタルケーブル110及びデジタルケーブル113は、USBケーブルなどを用いる。 The sound received by the microphone array 105 includes an acoustic echo transmitted from the speaker 106 to the microphone array 105 and needs to be removed. As the digital cable 110 and the digital cable 113, a USB cable or the like is used.

図6に、スピーカからマイク素子まで音が伝わる際の音響エコーモデル及び適応フィルタを用いた音響エコーキャンセラによる、従来技術の音響エコー抑圧処理の構成を示す。 FIG. 6 shows a configuration of a conventional acoustic echo suppression process using an acoustic echo model and an acoustic echo canceller using an adaptive filter when sound is transmitted from a speaker to a microphone element.

信号は全てz変換を施して表記する。受話信号d(z)は、スピーカより放出されマイク素子に部屋のインパルス応答H(z)が畳み込まれた形で到来する。インパルス応答H(z)は、スピーカからマイクまでの直接音及び壁や床、天井などからの反射（音響エコー）を含んでいる。 All signals are expressed with z-transform. The received signal d (z) is emitted from the speaker and arrives in a form in which the impulse response H (z) of the room is convoluted with the microphone element. The impulse response H (z) includes a direct sound from the speaker to the microphone and a reflection (acoustic echo) from the wall, floor, ceiling, or the like.

マイク素子には、音響エコーの他、話者音声N(z)が混合する。マイク素子信号X(z)をそのまま送話すると、送話音声に音響エコーが含まれることになり、信号がループし、最悪の場合ハウリングを起こし、通話不能となる。そのため、送話音声から音響エコーのみを抑圧する必要がある。 In addition to the acoustic echo, speaker sound N (z) is mixed in the microphone element. If the microphone element signal X (z) is transmitted as it is, an acoustic echo is included in the transmitted voice, and the signal loops. In the worst case, howling occurs and the call becomes impossible. Therefore, it is necessary to suppress only the acoustic echo from the transmitted voice.

適応フィルタW(z)は適応的に部屋のインパルス応答H(z)を学習したフィルタであり、受話信号にW(z)をかけることで、擬似的な音響エコーを作ることができる。適応フィルタＷ（ｚ）の適応は、例えばＮＬＭＳ法などを用いて行う。ＮＬＭＳ法では、Ｗ（ｚ）＝Ｗ（ｚ）＋２μX(z)N’(z)*/|X(z)|^2というように、適応フィルタを更新する。 The adaptive filter W (z) is a filter that has learned the impulse response H (z) of the room adaptively. By applying W (z) to the received signal, a pseudo acoustic echo can be created. The adaptive filter W (z) is adapted using, for example, the NLMS method. In the NLMS method, the adaptive filter is updated such that W (z) = W (z) +2 μX (z) N ′ (z) * / | X (z) | ^ 2.

W(z)=H(z)となるケースでは、マイク信号から擬似音響エコーを差し引くことで、話者音声N(z)のみを抽出することができる。 In the case of W (z) = H (z), only the speaker voice N (z) can be extracted by subtracting the pseudo acoustic echo from the microphone signal.

N(z)=0の場合、W(z)=H(z)のケースでは、送話音声は0となる。つまり、前述のＮＬＭＳ法の更新式では、W(z)は送話音声が0となるように適応的に変更される。 When N (z) = 0, the transmitted voice is 0 in the case of W (z) = H (z). That is, in the above update formula of the NLMS method, W (z) is adaptively changed so that the transmitted voice becomes zero.

しかし、N(z)が0ではない場合、W(z)は送話音声が0となるように適応的に変更されることで、逆にW(z)は適応に失敗する。そのため、N(z)が0ではない場合は、適応しないように制御することが必要となる。 However, when N (z) is not 0, W (z) is adaptively changed so that the transmitted voice becomes 0, and W (z) fails to adapt. Therefore, when N (z) is not 0, it is necessary to perform control so as not to adapt.

図7に、N(z)が0でない場合には適応しないように制御する機能を備えたダブルトーク検出器を有する、本発明を適用したシステムを示す。このダブルトーク検出器は、N(z)が0であるかどうか判定し、N(z)が0に近い時のみ、適応フィルタを適応する。 FIG. 7 shows a system to which the present invention is applied, which has a double-talk detector having a function of controlling not to adapt when N (z) is not 0. This double talk detector determines whether N (z) is 0, and applies the adaptive filter only when N (z) is close to 0.

本発明を適用したこのシステムでは、マイクロホンアレイが受音したダブルトーク検出部702が位相差計算部701で得た音源の到来方向に関する情報を利用して判定を行う点が、特徴的である。 This system to which the present invention is applied is characterized in that the double talk detection unit 702 received by the microphone array makes a determination using information on the arrival direction of the sound source obtained by the phase difference calculation unit 701.

Ｎ（ｚ）が０でない場合に、音響エコーキャンセラを更新すると、適応に失敗し、フィルタが発散する危険性があることから、その危険性を回避するために、ダブルトーク検出器は必須となる。 If the acoustic echo canceller is updated when N (z) is not 0, there is a risk that adaptation will fail and the filter will diverge, so a double-talk detector is essential to avoid that risk. .

図8は、2拠点のテレビ会議システムにおける音声ストリームの流れ及び３拠点以上の会議システムにおける音声ストリームの流れを示している。ここで、位相差計算部は、サーバーにあっても良いし、各拠点ごとのＣＰＵに存在しても良い。 FIG. 8 shows the flow of an audio stream in a video conference system at two sites and the flow of an audio stream in a conference system at three or more sites. Here, the phase difference calculation unit may be in a server or may exist in a CPU at each site.

２拠点の場合は、拠点Aのテレビ会議システムから音響エコーキャンセル後の送話信号が拠点Bのテレビ会議システムにネットワークを介して送られ、拠点Bで再生される。拠点Bの音声は拠点Aに送られて、再生される。 In the case of two sites, the transmission signal after acoustic echo cancellation is transmitted from the video conference system at site A to the video conference system at site B via the network and reproduced at site B. The audio from site B is sent to site A for playback.

また３拠点以上の場合は、サーバやＣＰＵに一旦データが集められ、それぞれの拠点に再分配されて、再生される。 In the case of three or more locations, data is once collected in a server or CPU, redistributed to each location, and reproduced.

図9に、本発明を適用された際のテレビ会議システムのブロック構成を示す。ネットワークを介して伝わる受話音声は音声受信部207受信される。受信された受話音声は音声再生部208に送られる。音声再生部208ではスピーカで受話音声を再生する。 FIG. 9 shows a block configuration of a video conference system when the present invention is applied. The received voice transmitted via the network is received by the voice receiving unit 207. The received received voice is sent to the voice playback unit 208. The voice playback unit 208 plays back the received voice using a speaker.

受話音声は音響エコーキャンセラ部205に送られる。音声収録部203では、マイクロホンアレイの音声信号を収録する。収録した音声信号は音響エコーキャンセラ部205に送られる。 The received voice is sent to the acoustic echo canceller unit 205. The audio recording unit 203 records the audio signal of the microphone array. The recorded audio signal is sent to the acoustic echo canceller unit 205.

音響エコーキャンセラ部205は、音響エコーキャンセルフィルタDB211に蓄えてある音響エコーキャンセルフィルタと受話音声から擬似エコーを生成して、マイクロホンアレイの音声信号から擬似エコーを差し引く。差し引いた結果、残った誤差信号は音響エコーキャンセラ適応部204に送られる。 The acoustic echo canceller unit 205 generates a pseudo echo from the acoustic echo cancellation filter stored in the acoustic echo cancellation filter DB211 and the received voice, and subtracts the pseudo echo from the audio signal of the microphone array. As a result of the subtraction, the remaining error signal is sent to the acoustic echo canceller adaptation unit 204.

音響エコーキャンセラ適応部204では、誤差信号を0にするように音響エコーキャンセラを適応する。適応した結果を音響エコーキャンセルフィルタDB211に保存する。音響エコーキャンセル部205で出力する誤差信号は音声送信部201に送られる。 The acoustic echo canceller adaptation unit 204 adapts the acoustic echo canceller so that the error signal becomes zero. The adapted result is stored in the acoustic echo cancellation filter DB211. The error signal output from the acoustic echo cancellation unit 205 is sent to the audio transmission unit 201.

音声送信部201では誤差信号を他拠点に送信する。画像撮影部210ではカメラで画像を撮影する。撮影画像を画像送信部202に送り、他拠点に送信する。 The voice transmission unit 201 transmits an error signal to another site. The image capturing unit 210 captures an image with a camera. The captured image is sent to the image transmission unit 202 and transmitted to another site.

画像受信部209は他拠点から送られてきた画像を受信する。受信した画像は画像表示部206に送る。画像表示部206では送られてきた画像を画面に表示する。 The image receiving unit 209 receives an image sent from another base. The received image is sent to the image display unit 206. The image display unit 206 displays the sent image on the screen.

図10に、テレビ会議システムの処理フローを示す。音響エコーキャンセラ適応処理S1では、スピーカより学習信号を鳴らし、音響エコーキャンセラの適応を行う。学習信号は白色雑音が望ましい。学習信号長は、数秒〜数十秒以上であることが望まれる。学習長が短い場合、音響エコーキャンセラは十分に部屋のインパルス応答を学習できない虞がある。このように、学習信号長を数秒〜数十秒以上とすることで、インパルス応答を十分に学習することが可能となる。 FIG. 10 shows a processing flow of the video conference system. In the acoustic echo canceller adaptation processing S1, a learning signal is emitted from the speaker, and the acoustic echo canceller is adapted. The learning signal is preferably white noise. The learning signal length is desired to be several seconds to several tens of seconds or more. If the learning length is short, the acoustic echo canceller may not be able to learn the room impulse response sufficiently. As described above, the impulse response can be sufficiently learned by setting the learning signal length to several seconds to several tens of seconds or more.

学習終了後、他拠点から接続要求が出ているかどうかの判定S2を行う。他拠点から接続要求が出ている場合、他拠点と接続S4を行う。 After learning is completed, a determination S2 is made as to whether or not a connection request has been issued from another site. When a connection request is issued from another site, connection S4 is performed with the other site.

他拠点より接続要求が出ていなければ、自拠点から接続要求が出たかS3判定する。自拠点からの接続要求は、ユーザーがＧＵＩを通じて出す。 If no connection request is issued from another site, it is determined in S3 whether a connection request is issued from the local site. The user issues a connection request from his / her base through the GUI.

自拠点から接続要求が出ていれば、他拠点と接続S4を行う。自拠点から接続要求が出ていなければ、他拠点と接続をせず、他拠点から接続要求が出たかS2の判定に戻る。 If a connection request is issued from the local site, connection S4 is performed with another site. If the connection request is not issued from the own site, the connection to the other site is not made, and the process returns to the determination of S2 whether the connection request is issued from the other site.

つまり、テレビ会議システムは、自拠点か他拠点かいずれかから接続要求が出るまで、待機することになる。 In other words, the video conference system stands by until a connection request is issued from either its own base or another base.

他拠点と接続S4後、スピーカから再生S6、画像表示S7、音声収録S8、エコーキャンセルS9、音声送信S10を接続が切れるまで繰り返し行う。 After connection S4 with another site, reproduction S6, image display S7, audio recording S8, echo cancellation S9, and audio transmission S10 are repeatedly performed from the speaker until the connection is disconnected.

スピーカから再生S6では他拠点から送られてくる受話音声を再生する。 Reproduce from speaker In S6, the received voice sent from another site is reproduced.

画像表示S7では、他拠点から送られてくる画像をモニタ上に表示する。 In the image display S7, an image sent from another base is displayed on the monitor.

音声収録S8では、自拠点のマイクロホンアレイの音声を収録する。 Audio recording S8 records audio from the microphone array at your site.

エコーキャンセルS9では、収録したマイクロホンアレイの音声から音響エコー成分を抑圧する。 In echo cancellation S9, the acoustic echo component is suppressed from the voice of the recorded microphone array.

音声送信S10では、音響エコー成分抑圧後の音声信号を送信する。接続が切れたかS11の判定で、もし切れたと判定された場合は、他拠点から接続S13を行い、テレビ会議システムを終了する。 In audio transmission S10, an audio signal after acoustic echo component suppression is transmitted. If it is determined in step S11 that the connection has been lost or not, connection S13 is performed from another site, and the video conference system is terminated.

接続が切れていないと判定された場合は、自拠点のユーザーからＧＵＩを通して切断要求があるかS12判定し、切断要求があれば、他拠点から切断S13し、テレビ会議システムを終了する。 If it is determined that the connection has not been disconnected, it is determined whether there is a disconnection request from the user at the local site through the GUI, and if there is a disconnection request, disconnection from other sites is performed S13, and the video conference system is terminated.

図11に、本発明の主たる要素であるダブルトーク処理の基盤概念であるスパース性を示している。 FIG. 11 shows sparsity, which is the basic concept of double talk processing, which is the main element of the present invention.

本発明では、マイクロホンアレイからの音声信号及び擬似エコー生成に用いる受話信号は、全て短時間フーリエ変換やウェーブレット変換もしくはサブバンド処理を施されて、周波数領域信号に変換される。短時間フーリエ変換時のフレームサイズは、約50msに相当するポイント数であることが望まれる。 In the present invention, the voice signal from the microphone array and the reception signal used for pseudo echo generation are all subjected to short-time Fourier transform, wavelet transform, or subband processing to be converted into a frequency domain signal. It is desirable that the frame size at the time of the short-time Fourier transform is the number of points corresponding to about 50 ms.

例えば、サンプリングレート32kHzではフレームサイズは2048ポイントが望ましいことになる。音声は数十ｍｓ程度は定常であり、かつこのようなフレームサイズに設定することで、周波数領域で、最もスパース性が成立していると仮定でき、音響エコーキャンセラの適応処理を高精度に動作させることが可能となる。 For example, when the sampling rate is 32 kHz, the frame size is preferably 2048 points. The voice is steady for about several tens of ms, and by setting such a frame size, it can be assumed that the sparseness is most established in the frequency domain, and the adaptive processing of the acoustic echo canceller operates with high accuracy. It becomes possible to make it.

また短時間フーリエ変換は、ハミング窓やハニング窓、またはブラックマン窓などをかけた後、行うことが望ましい。短時間フーリエ変換は、信号が分析長周期で繰り返させると仮定する。窓関数を掛けない場合、両端の値に差が生じ、短時間フーリエ変換後に、存在しない周波数が観測されてしまう。このように窓関数をかけることで、存在しない周波数成分が観測されにくくなり、周波数解析精度を向上させることが可能となる。 The short-time Fourier transform is preferably performed after applying a Hamming window, Hanning window, Blackman window, or the like. The short-time Fourier transform assumes that the signal repeats with an analysis long period. When the window function is not multiplied, a difference occurs between the values at both ends, and a nonexistent frequency is observed after a short-time Fourier transform. By applying the window function in this way, it becomes difficult to observe a non-existent frequency component, and the frequency analysis accuracy can be improved.

フレームシフトはフレームサイズの1/4もしくは1/8程度が望ましく、フレームシフトを細かくするほど、出力音声の音質が向上する。しかしフレームシフトを細かくするほど、処理量が大きくなるため、搭載する計算機の処理速度でリアルタイム処理可能な範囲でフレームシフトを細かくする必要がある。 The frame shift is preferably about 1/4 or 1/8 of the frame size, and the finer the frame shift, the better the sound quality of the output sound. However, as the frame shift is made finer, the processing amount becomes larger. Therefore, it is necessary to make the frame shift fine within a range where real-time processing can be performed at the processing speed of the installed computer.

図11では横軸を時間(フレーム番号）、縦軸を周波数（周波数ビン番号）に設定した格子を示している。 FIG. 11 shows a lattice in which the horizontal axis is set to time (frame number) and the vertical axis is set to frequency (frequency bin number).

本発明のダブルトーク処理では各時間-周波数成分毎に、その成分が音響エコー成分か非音響エコー成分かを判定し、音響エコー成分であると判定された時間-周波数についてのみ音響エコーキャンセラの適応処理を行う。 In the double talk processing of the present invention, for each time-frequency component, it is determined whether the component is an acoustic echo component or a non-acoustic echo component, and the acoustic echo canceller is applied only for the time-frequency determined to be an acoustic echo component. Process.

音声は時間-周波数領域で見ると、スパースな信号であり、複数の音声が同じ時間-周波数上で混合することは稀であることが知られている。 It is known that speech is a sparse signal when viewed in the time-frequency domain, and it is rare for multiple speeches to mix on the same time-frequency.

テレビ会議システムのように受話信号、送話信号の双方が音声信号である場合、スパース性から時間-周波数毎に音響エコー成分か非音響エコー成分かに振り分けることで、音響エコー成分だけを高精度に抽出することが可能となる。 When both the received signal and the transmitted signal are audio signals, as in a video conference system, only the acoustic echo component is highly accurate by allocating the acoustic echo component or non-acoustic echo component for each time-frequency due to sparsity. Can be extracted.

図12では、本発明を適用した音響エコーキャンセラの基本構成を示している。 FIG. 12 shows a basic configuration of an acoustic echo canceller to which the present invention is applied.

まず、複数のマイクロホン素子を有するマイクロホンアレイから計算機101に入力された音声信号に対して、帯域分割部が周波数分解S101を行い、収録音声を周波数領域の信号に変換する。 First, the band dividing unit performs frequency decomposition S101 on the audio signal input to the computer 101 from the microphone array having a plurality of microphone elements, and converts the recorded audio into a frequency domain signal.

次に、位相差計算部が、収録音声の素子間位相差を計算する。 Next, the phase difference calculation unit calculates the inter-element phase difference of the recorded voice.

次に、周波数振り分け部が、位相差計算部が出力する各帯域の位相差から、帯域分割信号がどの音声であるかを判定する。すなわち、スピーカ出力信号であるか話者信号であるかを判定する。 Next, the frequency distribution unit determines which voice the band division signal is based on the phase difference of each band output from the phase difference calculation unit. That is, it is determined whether the output signal is a speaker output signal or a speaker signal.

そして、音響エコーキャンセラ部が、帯域分割信号に含まれる音声を除去するS102。 Then, the acoustic echo canceller removes the voice included in the band division signal S102.

S102では、Ｗ（ｚ）を参照信号ｄ（ｚ）に掛け合わせて、擬似エコーＷ（ｚ）ｄ（ｚ）を生成する。マイク入力信号ｘ（ｘ）からＷ（ｚ）ｄ（ｚ）を差し引くことで、マイク入力信号から音響エコーを消去することができる。 In S102, the pseudo echo W (z) d (z) is generated by multiplying W (z) by the reference signal d (z). By subtracting W (z) d (z) from the microphone input signal x (x), the acoustic echo can be eliminated from the microphone input signal.

音響エコーキャンセラ適応部は、帯域分割信号のうち、スピーカ出力信号である成分が多い場合、ダブルトーク状態であると判定S103し、音響エコーキャンセラの適応処理S104を行なう。 The acoustic echo canceller adaptation unit determines that it is in the double talk state when there are many components that are speaker output signals in the band division signal, and performs the acoustic echo canceller adaptation processing S104.

音響エコーキャンセラの適応処理は、ＮＬＭＳ法などで音響エコーキャンセラのフィルタＷ（ｚ）を更新する。ＮＬＭＳ法では、Ｗ（ｚ）＝Ｗ（ｚ）＋２μX(z)N’(z)*/|X(z)|^2というように更新する。 In the adaptive processing of the acoustic echo canceller, the acoustic echo canceller filter W (z) is updated by the NLMS method or the like. In the NLMS method, W (z) = W (z) +2 μX (z) N ′ (z) * / | X (z) | ^ 2 is updated.

ダブルトーク状態であれば、エコーキャンセラ適応S103を行わず、音響エコーキャンセラ処理を終了する。 If in the double talk state, the echo canceller adaptation S103 is not performed, and the acoustic echo canceller process is terminated.

図13では、音源方向情報を用いて、スピーカ方向から到来する時間-周波数成分だけを音響エコーとみなし、適応する処理の処理フローを示している。 FIG. 13 shows a processing flow of an adaptation process that uses sound source direction information and regards only time-frequency components coming from the speaker direction as acoustic echoes.

周波数分解S201では、収録音声を時間-周波数領域信号に変換する。 In frequency resolution S201, the recorded voice is converted into a time-frequency domain signal.

音源定位Ｓ２０２では、修正遅延和アレイ法に基づいて、音源方向推定を行う。本発明で音源方向推定に用いる修正遅延和アレイ法は、時間-周波数毎にその成分がどの方向から到来しているかを判定する。 In sound source localization S202, sound source direction estimation is performed based on the modified delay sum array method. The modified delay sum array method used for sound source direction estimation in the present invention determines from which direction the component comes from every time-frequency.

図１１に示したとおり、音声は、時間-周波数毎に見るとスパースな信号であるため、時間-周波数ごとに音響エコーが主な成分と、非音響エコーが主な成分とに分かれると仮定できる。したがって、時間-周波数毎に推定した音源定位結果より、スピーカ方向から到来している時間-周波数成分を選択すれば、その成分は音響エコーが主な成分であると考えることができる。 As shown in FIG. 11, since the sound is a sparse signal when viewed for each time-frequency, it can be assumed that the acoustic echo is divided into main components and the non-acoustic echo is divided into main components for each time-frequency. . Therefore, if a time-frequency component coming from the speaker direction is selected from the sound source localization result estimated for each time-frequency, it can be considered that the component is an acoustic echo as the main component.

修正遅延和アレイ法では、音源方向θのステアリングベクトルAθ(f）を用いて、音源定位を行う。マイク数をMとした時、Aθ(f）はM次元の複素ベクトルとなる。 In the modified delay sum array method, sound source localization is performed using the steering vector Aθ (f) in the sound source direction θ. When the number of microphones is M, Aθ (f) is an M-dimensional complex vector.

ここでｆは周波数ビン番号をあらわす。入力信号の時間-周波数表現をX(f,τ)と記載する。τは短時間フーリエ変換のフレーム番号である。X(f,τ)はM次元の複素ベクトルであり、各マイク素子のフレームτ、周波数ｆの成分を要素に持つベクトルである。 Here, f represents a frequency bin number. The time-frequency representation of the input signal is described as X (f, τ). τ is the frame number of the short-time Fourier transform. X (f, τ) is an M-dimensional complex vector, which is a vector having elements of the frame τ and the frequency f of each microphone element.

修正遅延和アレイ法では、|Aθ(f）*X(f,τ)|が最大となる仮想音源方向θをフレームτ、周波数fの音源方向と推定する。 In the modified delay sum array method, the virtual sound source direction θ that maximizes | Aθ (f) * X (f, τ) | is estimated as the sound source direction of frame τ and frequency f.

スピーカ方向特定S203では、仮想音源方向θ毎に音源方向がθであると推定された周波数fの数もしくは、log|Aθ(f）*X(f,τ)|を積み上げて、ヒストグラムを作る。そして予め定めるスピーカー方向の範囲（例えば-30°〜30°までといったように定める）内でのヒストグラムのピークを算出し、その方向をスピーカー方向θspとする。 In the speaker direction specification S203, a histogram is created by accumulating the number of frequencies f estimated that the sound source direction is θ or log | Aθ (f) * X (f, τ) | for each virtual sound source direction θ. Then, the peak of the histogram is calculated within a predetermined speaker direction range (for example, determined from -30 ° to 30 °), and the direction is set as the speaker direction θsp.

音源方向がスピーカ方向かS204の判定では周波数fの推定音源方向θ'が|θ'-θsp|<βを満たす場合、その周波数fをスピーカ方向から到来した周波数成分とみなす。そして、音響エコー成分であるとみなし、エコーキャンセラ適応S205を行う。 In the determination of S204 whether the sound source direction is the speaker direction, if the estimated sound source direction θ ′ of the frequency f satisfies | θ′−θsp | <β, the frequency f is regarded as a frequency component arriving from the speaker direction. Then, it is regarded as an acoustic echo component, and echo canceller adaptation S205 is performed.

全周波数にて、S204及びS205を行った後、音響エコーキャンセラの適応処理を終了する。 After performing S204 and S205 at all frequencies, the adaptive processing of the acoustic echo canceller is terminated.

さて、次に、図14では、擬似エコーから算出できる音源方向に類似した情報を用いた音響エコーキャンセラ適応処理を示す。 Next, FIG. 14 shows acoustic echo canceller adaptation processing using information similar to the sound source direction that can be calculated from the pseudo echo.

周波数分解S301を行い、収録音声を周波数領域の信号に変換する。 Perform frequency decomposition S301 to convert the recorded audio into a frequency domain signal.

変換した周波数領域の信号に音響エコーフィルタを掛けて、擬似エコーを算出S302する。 An acoustic echo filter is applied to the converted frequency domain signal to calculate a pseudo echo S302.

各周波数f毎に算出した擬似エコーと入力信号との類似度を算出S303する。類似度算出処理では、擬似エコーE(f,τ)を用いる。E(f,τ)はM次元の複素ベクトル、各マイク素子のフレームτ,周波数ｆの擬似エコー成分を要素に持つ。E(f,τ)の0番目の要素をE0(f,τ)と記載する。E'(f,τ)=E(f,τ)/E0(f,τ)と定義する。さらに。E''(f,τ)=E'(f,τ)/|E'(f,τ)|とする。類似度を|E''(f,τ)*X(f,τ)|/|X(f,τ)|とする。 The similarity between the pseudo echo calculated for each frequency f and the input signal is calculated S303. In the similarity calculation process, pseudo echo E (f, τ) is used. E (f, τ) has an M-dimensional complex vector, a frame τ of each microphone element, and a pseudo echo component of frequency f as elements. The 0th element of E (f, τ) is described as E0 (f, τ). It is defined as E ′ (f, τ) = E (f, τ) / E0 (f, τ). further. Let E ″ (f, τ) = E ′ (f, τ) / | E ′ (f, τ) |. The similarity is assumed to be | E ″ (f, τ) * X (f, τ) | / | X (f, τ) |.

この類似度は音響エコー成分と入力信号の音源方向の類似度を見ており、入力信号中に音響エコー成分しか含まれていない場合は、類似度が1となる。類似度に周波数毎に異なる閾値α(f)を掛け合わせた値を最終的な類似度とする。α(f)=1/Σ|E''(f,τ)*Aθ(f)|とする。 This similarity measures the similarity between the acoustic echo component and the sound source direction of the input signal. When the input signal contains only the acoustic echo component, the similarity is 1. A value obtained by multiplying the similarity by a different threshold value α (f) for each frequency is defined as the final similarity. α (f) = 1 / Σ | E ″ (f, τ) * Aθ (f) |

類似度が予め定める閾値thを上回った場合(S304)エコーキャンセラの適応S305を行い、下回った場合は、エコーキャンセラの適応S305を行わない。 When the similarity exceeds a predetermined threshold th (S304), the echo canceller adaptation S305 is performed. When the similarity is lower than the predetermined threshold th, the echo canceller adaptation S305 is not performed.

S303〜S305を全周波数毎に行った後、音響エコーキャンセラの適応処理を終了する。 After performing S303 to S305 for every frequency, the adaptive processing of the acoustic echo canceller is terminated.

図15では、擬似エコーから算出できる音源方向に類似した情報を用いた音響エコーキャンセラ適応処理を示す。本処理では、音響エコーキャンセラを適応するかどうかは、フレーム内の全周波数成分について同一である。 FIG. 15 shows acoustic echo canceller adaptation processing using information similar to the sound source direction that can be calculated from the pseudo echo. In this processing, whether to apply the acoustic echo canceller is the same for all frequency components in the frame.

周波数分解S401を行い、収録音声を周波数領域の信号に変換する。変換した周波数領域の信号に音響エコーフィルタを掛けて、擬似エコーを算出S402する。 Perform frequency resolution S401 to convert the recorded audio into a frequency domain signal. An acoustic echo filter is applied to the converted frequency domain signal to calculate a pseudo echo S402.

各周波数f毎に算出した擬似エコーと入力信号との類似度を算出S403する。 The similarity between the pseudo echo calculated for each frequency f and the input signal is calculated S403.

類似度算出処理では、擬似エコーE(f,τ)を用いる。E(f,τ)はM次元の複素ベクトル、各マイク素子のフレームτ,周波数ｆの擬似エコー成分を要素に持つ。E(f,τ)の0番目の要素をE0(f,τ)と記載する。E'(f,τ)=E(f,τ)/E0(f,τ)と定義する。さらに。E''(f,τ)=E'(f,τ)/|E'(f,τ)|とする。類似度を|E''(f,τ)*X(f,τ)|/|X(f,τ)|とする。 In the similarity calculation process, pseudo echo E (f, τ) is used. E (f, τ) has an M-dimensional complex vector, a frame τ of each microphone element, and a pseudo echo component of frequency f as elements. The 0th element of E (f, τ) is described as E0 (f, τ). It is defined as E ′ (f, τ) = E (f, τ) / E0 (f, τ). further. Let E ″ (f, τ) = E ′ (f, τ) / | E ′ (f, τ) |. The similarity is assumed to be | E ″ (f, τ) * X (f, τ) | / | X (f, τ) |.

この類似度は音響エコー成分と入力信号の音源方向の類似度を見ており、入力信号中に音響エコー成分しか含まれていない場合は、類似度が1となる。類似度に周波数毎に異なる閾値α(f)を掛け合わせた値を最終的な類似度とする。α(f)=1/Σ|E''(f,τ)*Aθ(f)|とする。類似度の計算を全周波数毎に行う。そして全周波数の類似度を加算S404する。 This similarity measures the similarity between the acoustic echo component and the sound source direction of the input signal. When the input signal contains only the acoustic echo component, the similarity is 1. A value obtained by multiplying the similarity by a different threshold value α (f) for each frequency is defined as the final similarity. α (f) = 1 / Σ | E ″ (f, τ) * Aθ (f) | The similarity is calculated for every frequency. Then, the similarity of all frequencies is added S404.

加算した類似度が予め定める閾値thを上回った場合(S405)全周波数成分に対して、エコーキャンセラの適応S406を行い、下回った場合は、エコーキャンセラの適応S406を行わずに、音響エコーキャンセラ適応処理を終了する。また適応を行う場合は、エコーキャンセラ適応S406後、音響エコーキャンセラ適応処理を終了する。 When the added similarity exceeds a predetermined threshold th (S405), the echo canceller adaptation S406 is performed for all frequency components, and when it falls below, the echo canceller adaptation S406 is not performed and the acoustic echo canceller adaptation is performed. The process ends. When adaptation is performed, the acoustic echo canceller adaptation processing is terminated after the echo canceller adaptation S406.

図16では、音源方向情報を用いて、スピーカ方向から到来する時間-周波数成分だけを音響エコーとみなし、適応する処理の処理フローを示している。 FIG. 16 shows a processing flow of processing that uses sound source direction information and regards only time-frequency components coming from the speaker direction as acoustic echoes and adapts them.

本処理では音響エコーキャンセラを適応するかどうかは、フレーム内の全周波数成分について同一である。 In this process, whether or not the acoustic echo canceller is applied is the same for all frequency components in the frame.

周波数分解S501では、収録音声を時間-周波数領域信号に変換する。音源定位Ｓ502では、修正遅延和アレイ法に基づいて、音源方向推定を行う。 In frequency resolution S501, the recorded voice is converted into a time-frequency domain signal. In sound source localization S502, sound source direction estimation is performed based on the modified delay sum array method.

スピーカ方向特定S503では、仮想音源方向θ毎に音源方向がθであると推定された周波数fの数もしくは、log|Aθ(f）*X(f,τ)|を積み上げて、ヒストグラムを作る。そして予め定めるスピーカー方向の範囲（例えば-30°〜30°までといったように定める）内でのヒストグラムのピークを算出し、その方向をスピーカー方向θspとする。 In speaker direction specification S503, a histogram is created by accumulating the number of frequencies f estimated that the sound source direction is θ or log | Aθ (f) * X (f, τ) | for each virtual sound source direction θ. Then, the peak of the histogram is calculated within a predetermined speaker direction range (for example, determined from -30 ° to 30 °), and the direction is set as the speaker direction θsp.

音源方向がスピーカ方向かS504の判定では周波数fの推定音源方向θ'が|θ'-θsp|<βを満たす場合、その周波数fをスピーカ方向から到来した周波数成分とみなす。そして、スピーカ方向とみなされた周波数fの数または、log|Aθ(f）*X(f,τ)|を積み上げて、周波数方向に加算したパワースペクトルを得る。全周波数で加算したパワースペクトルが予め定める以上であるかどうかS506の判定を行い、閾値以上であれば、全周波数成分についてエコーキャンセラ適応S507を行い、処理を終了する。 In the determination in S504 whether the sound source direction is the speaker direction, if the estimated sound source direction θ ′ of the frequency f satisfies | θ′−θsp | <β, the frequency f is regarded as a frequency component arriving from the speaker direction. Then, the number of frequencies f regarded as the speaker direction or log | Aθ (f) * X (f, τ) | is accumulated to obtain a power spectrum added in the frequency direction. Whether or not the power spectrum added at all frequencies is equal to or greater than a predetermined value is determined in S506. If the power spectrum is equal to or greater than the threshold value, echo canceller adaptation S507 is performed on all frequency components, and the process is terminated.

図17は本音響エコーキャンセラの適応処理の効果を示した図である。自拠点の話者が発話中の音声（ダブルトーク音声）で適応したエコーキャンセルフィルタを用いて、音響エコーのみが存在する入力信号内の音響エコーを抑圧した結果である。 FIG. 17 is a diagram showing the effect of the adaptive processing of the present acoustic echo canceller. This is a result of suppressing the acoustic echo in the input signal in which only the acoustic echo exists by using an echo cancellation filter adapted for the voice (double talk voice) being spoken by the speaker at the local site.

この場合、全ての入力信号が抑圧されて、無音になることが望まれる。本発明による適応制御を行った場合の結果を上段に示す。 In this case, it is desirable that all input signals be suppressed and silenced. The result of the adaptive control according to the present invention is shown in the upper part.

適応制御を行わない場合の結果を下段に示す。結果は時間-周波数毎のパワーが大きい程、明るく小さいほど暗くなるような図で示している。横軸が時間で、縦軸が周波数である。適応制御を行う場合のほうが、特に高い周波数で音響エコー抑圧後の信号パワーが小さく、音響エコー抑圧性能が高いことが分かる。 The results when adaptive control is not performed are shown in the lower part. The results are shown in such a figure that the darker the light, the smaller the power for each time-frequency. The horizontal axis is time, and the vertical axis is frequency. It can be seen that when adaptive control is performed, the signal power after acoustic echo suppression is particularly small at a high frequency and the acoustic echo suppression performance is high.

本発明を用いたテレビ会議システムでは、他拠点と接続する前に自拠点のスピーカより白色信号などを鳴らして音響エコーキャンセラを適応化しておくといった構成をとってもよい。 The video conference system using the present invention may be configured such that the acoustic echo canceller is adapted by sounding a white signal or the like from the speaker at the local site before connecting to another site.

図18に、予め音響エコーキャンセラを適応化する場合の処理フローを示す。 FIG. 18 shows a processing flow when the acoustic echo canceller is adapted in advance.

他拠点接続前に自拠点のスピーカがなっている間の全フレームのデータを用いた音響エコーキャンセラの適応処理S601を行う。 The acoustic echo canceller adaptive processing S601 is performed using the data of all frames while the speaker at the local site is on before connecting to another site.

これは本発明のダブルトーク検出を行わずに無条件に音響エコーキャンセラの適応処理を行うことに相当する。 This corresponds to performing the acoustic echo canceller adaptive processing unconditionally without performing the double talk detection of the present invention.

そして他拠点接続待機S602を行い、他拠点から接続要求がくるか自拠点のユーザーが他拠点への接続要求を出すまで待機する。 Then, another site connection standby S602 is performed, and the system waits until a connection request is received from another site or a user at the local site issues a connection request to another site.

他拠点と接続後は、本発明のダブルトーク検出処理による適応制御付き音響エコーキャンセラ適応処理S603を繰り返し行う。そして、テレビ会議システムの切断後、終了する。 After the connection with the other site, the acoustic echo canceller adaptive processing with adaptive control S603 by the double talk detection processing of the present invention is repeatedly performed. Then, after the video conference system is disconnected, the process ends.

図19では、VoiceSwitchによる非線形処理の制御を入力信号と擬似エコーの類似度を用いて行う処理のフローを示す。 FIG. 19 shows a flow of processing for performing nonlinear processing control by VoiceSwitch using the similarity between the input signal and the pseudo echo.

周波数分解S701を行い、収録音声を周波数領域の信号に変換する。 Perform frequency resolution S701 to convert the recorded audio into a frequency domain signal.

変換した周波数領域の信号に音響エコーフィルタを掛けて、擬似エコーを算出S702する。 An acoustic echo filter is applied to the converted frequency domain signal to calculate a pseudo echo S702.

各周波数f毎に算出した擬似エコーと入力信号との類似度を算出S703する。 The degree of similarity between the pseudo echo calculated for each frequency f and the input signal is calculated S703.

類似度算出処理では、擬似エコーE(f,τ)を用いる。E(f,τ)はM次元の複素ベクトル、各マイク素子のフレームτ,周波数ｆの擬似エコー成分を要素に持つ。E(f,τ)の0番目の要素をE0(f,τ)と記載する。E'(f,τ)=E(f,τ)/E0(f,τ)と定義する。さらに。E''(f,τ)=E'(f,τ)/|E'(f,τ)|とする。類似度を|E''(f,τ)*X(f,τ)|/|X(f,τ)|とする。この類似度は音響エコー成分と入力信号の音源方向の類似度を見ており、入力信号中に音響エコー成分しか含まれていない場合は、類似度が1となる。類似度に周波数毎に異なる閾値α(f)を掛け合わせた値を最終的な類似度とする。α(f)=1/Σ|E''(f,τ)*Aθ(f)|とする。 In the similarity calculation process, pseudo echo E (f, τ) is used. E (f, τ) has an M-dimensional complex vector, a frame τ of each microphone element, and a pseudo echo component of frequency f as elements. The 0th element of E (f, τ) is described as E0 (f, τ). It is defined as E ′ (f, τ) = E (f, τ) / E0 (f, τ). further. Let E ″ (f, τ) = E ′ (f, τ) / | E ′ (f, τ) |. The similarity is assumed to be | E ″ (f, τ) * X (f, τ) | / | X (f, τ) |. This similarity measures the similarity between the acoustic echo component and the sound source direction of the input signal. When the input signal contains only the acoustic echo component, the similarity is 1. A value obtained by multiplying the similarity by a different threshold value α (f) for each frequency is defined as the final similarity. α (f) = 1 / Σ | E ″ (f, τ) * Aθ (f) |

類似度が予め定める閾値thを上回った場合(S704)かつ入力信号のパワーが閾値以上の場合、送話音声内の該当周波数成分を0とする。それ以外の場合は、エコーキャンセル後の信号を送話音声内の該当周波数成分として、処理を終了する。 When the similarity exceeds a predetermined threshold th (S704) and the power of the input signal is equal to or higher than the threshold, the corresponding frequency component in the transmitted voice is set to zero. In other cases, the processing is terminated using the signal after echo cancellation as the corresponding frequency component in the transmitted voice.

図20では、VoiceSwitchによる非線形処理の制御を入力信号と擬似エコーの類似度を用いて行う処理のフローを示す。VoiceSwitchを用いるかどうかの判定は全周波数で共通である。 FIG. 20 shows a flow of processing for performing nonlinear processing control by VoiceSwitch using the similarity between the input signal and the pseudo echo. Whether to use VoiceSwitch is common to all frequencies.

周波数分解S801を行い、収録音声を周波数領域の信号に変換する。変換した周波数領域の信号に音響エコーフィルタを掛けて、擬似エコーを算出S802する。各周波数f毎に算出した擬似エコーと入力信号との類似度を算出S803する。 Frequency resolution S801 is performed to convert the recorded voice into a frequency domain signal. An acoustic echo filter is applied to the converted frequency domain signal to calculate a pseudo echo S802. The degree of similarity between the pseudo echo calculated for each frequency f and the input signal is calculated S803.

類似度算出処理では、擬似エコーE(f,τ)を用いる。E(f,τ)はM次元の複素ベクトル、各マイク素子のフレームτ,周波数ｆの擬似エコー成分を要素に持つ。E(f,τ)の0番目の要素をE0(f,τ)と記載する。E'(f,τ)=E(f,τ)/E0(f,τ)と定義する。さらに。E''(f,τ)=E'(f,τ)/|E'(f,τ)|とする。類似度を|E''(f,τ)*X(f,τ)|/|X(f,τ)|とする。この類似度は音響エコー成分と入力信号の音源方向の類似度を見ており、入力信号中に音響エコー成分しか含まれていない場合は、類似度が1となる。類似度に周波数毎に異なる閾値α(f)を掛け合わせた値を最終的な類似度とする。α(f)=1/Σ|E''(f,τ)*Aθ(f)|とする。類似度を全周波数で加算する。 In the similarity calculation process, pseudo echo E (f, τ) is used. E (f, τ) has an M-dimensional complex vector, a frame τ of each microphone element, and a pseudo echo component of frequency f as elements. The 0th element of E (f, τ) is described as E0 (f, τ). It is defined as E ′ (f, τ) = E (f, τ) / E0 (f, τ). further. Let E ″ (f, τ) = E ′ (f, τ) / | E ′ (f, τ) |. The similarity is assumed to be | E ″ (f, τ) * X (f, τ) | / | X (f, τ) |. This similarity measures the similarity between the acoustic echo component and the sound source direction of the input signal. When the input signal contains only the acoustic echo component, the similarity is 1. A value obtained by multiplying the similarity by a different threshold value α (f) for each frequency is defined as the final similarity. α (f) = 1 / Σ | E ″ (f, τ) * Aθ (f) | Add similarity at all frequencies.

そして、加算後の類似度が予め定める閾値thを上回った場合(S805)かつ入力信号のパワーが閾値以上の場合、送話音声信号を0とする。それ以外の場合は、エコーキャンセル後の信号を送話音声として、処理を終了する。 When the similarity after the addition exceeds a predetermined threshold th (S805) and the power of the input signal is equal to or higher than the threshold, the transmitted voice signal is set to 0. In other cases, the process is terminated using the signal after echo cancellation as the transmitted voice.

図21では、音響エコーキャンセル後の残留エコーの非線形抑圧係数の制御を入力信号と擬似エコーの類似度を用いて行う処理のフローを示す。 FIG. 21 shows a flow of processing in which the nonlinear suppression coefficient of the residual echo after acoustic echo cancellation is controlled using the similarity between the input signal and the pseudo echo.

周波数分解S901を行い、収録音声を周波数領域の信号に変換する。 Perform frequency resolution S901 to convert the recorded audio into a frequency domain signal.

変換した周波数領域の信号に音響エコーフィルタを掛けて、擬似エコーを算出S902する。 An acoustic echo filter is applied to the converted frequency domain signal to calculate a pseudo echo S902.

各周波数f毎に算出した擬似エコーと入力信号との類似度を算出S903する。 The degree of similarity between the pseudo echo calculated for each frequency f and the input signal is calculated S903.

類似度が予め定める閾値thを上回った場合(S904)かつ入力信号のパワーが閾値以上の場合、非線形抑圧係数αを予め定める値α0とする。それ以外の場合は、非線形抑圧係数αをα1とする。α0>α1として、予め定めておく。 When the degree of similarity exceeds a predetermined threshold th (S904) and the power of the input signal is equal to or higher than the threshold, the nonlinear suppression coefficient α is set to a predetermined value α0. In other cases, the nonlinear suppression coefficient α is α1. α0> α1 is set in advance.

音響エコーキャンセルS907後の信号をn'(f,τ)とし、擬似エコー成分をe(f,τ)とする。
非線形抑圧処理S908では、n''(f,τ)=Floor(|n'(f,τ)|-|e(f,τ)|)arg(n'(f,τ))として、n''(f,τ)を出力する。ここで、Floor(x)はxが0以上の場合、xを返し、xが0以下の場合、0を返す関数である。arg(x)はxの位相成分を返す関数である。全周波数で非線形抑圧処理を行った後、処理を終了する。 The signal after acoustic echo cancellation S907 is n ′ (f, τ), and the pseudo echo component is e (f, τ).
In the nonlinear suppression processing S908, n ′ (f, τ) = Floor (| n ′ (f, τ) |-| e (f, τ) |) arg (n ′ (f, τ)) '(f, τ) is output. Here, Floor (x) is a function that returns x when x is 0 or more, and returns 0 when x is 0 or less. arg (x) is a function that returns the phase component of x. After performing nonlinear suppression processing at all frequencies, the processing ends.

以上のように、本発明によれば、音響エコーの制御を電話会議システムもしくはテレビ会議システム等で実現することが可能となり、ダブルトーク状態での音響エコーキャンセル技術に適用することができる。 As described above, according to the present invention, acoustic echo control can be realized by a telephone conference system or a video conference system, and can be applied to an acoustic echo cancellation technique in a double talk state.

本発明のハードウェア構成を示した図。The figure which showed the hardware constitutions of this invention. 本発明のソフトウェアのブロック図。The software block diagram of this invention. 本発明のGUIによるスピーカー個数・位置指定のブロック図。The block diagram of speaker number and position specification by GUI of this invention. 本発明の音源定位によるVoiceSwitch切り替え処理のブロック図。The block diagram of VoiceSwitch switching processing by sound source localization of the present invention. 本発明をテレビ会議システムに適用した装置全体図。The whole apparatus which applied this invention to the video conference system. 従来技術の音響エコーキャンセラを説明した図。The figure explaining the acoustic echo canceller of a prior art. ダブルトーク検出処理を用いた音響エコーキャンセラを説明した図。The figure explaining the acoustic echo canceller using a double talk detection process. ２拠点テレビ会議の構成を示した図。The figure which showed the structure of the two-site video conference. 本発明を適用したテレビ会議システムのブロック図。1 is a block diagram of a video conference system to which the present invention is applied. 本発明を適用したテレビ会議システムの処理フロー図。The processing flow figure of the video conference system to which this invention is applied. 本発明における音響エコー成分特定法を説明した図。The figure explaining the acoustic echo component identification method in this invention. 本発明のエコーキャンセラとダブルトーク判定及びエコーキャンセラ適応処理の処理フロー図。The processing flow figure of the echo canceller of this invention, a double talk determination, and an echo canceller adaptation process. 音源方向を利用した周波数毎のエコーキャンセラ適応制御フロー図。The echo canceller adaptive control flowchart for every frequency using a sound source direction. 音源方向に類似した情報を利用した周波数毎のエコーキャンセラ適応制御フロー図。The echo canceller adaptive control flowchart for every frequency using the information similar to a sound source direction. 音源方向に類似した情報を利用した全周波数成分のエコーキャンセラ適応制御フロー図。The echo canceller adaptive control flow figure of all the frequency components using the information similar to a sound source direction. 音源方向を利用した周波数毎のエコーキャンセラ適応制御フロー図。The echo canceller adaptive control flowchart for every frequency using a sound source direction. 本発明の効果を示した図。The figure which showed the effect of this invention. テレビ会議開始前に適応処理を行う場合の処理フロー図。The processing flow figure in the case of performing an adaptive process before a video conference start. 音源方向に類似した情報を利用した周波数毎のVoiceSwitch制御フロー図。The VoiceSwitch control flowchart for each frequency using information similar to the sound source direction. 音源方向に類似した情報を利用した全周波数のVoiceSwitch制御フロー図。The VoiceSwitch control flow diagram of all frequencies using information similar to the sound source direction. 音源方向に類似した情報を利用した周波数毎の非線形音響エコー抑圧処理の抑圧係数制御フロー図。The suppression coefficient control flowchart of the nonlinear acoustic echo suppression process for every frequency using the information similar to a sound source direction.

Explanation of symbols

1…マイクロホンアレイ、２…A/D変換装置及びD/A変換装置、３…中央演算装置、４…メモリ、５…ハブ、６…スピーカ、７…外部記憶媒体、８…A/D変換部、９…帯域分割部、１０…位相差計算部、１１…周波数振り分け部、１２…音響エコーキャンセラ適応部、１３…擬似エコー生成部、１４…音響エコーキャンセル部、１５…スピーカ個数・位置設定GUI、１６…エコー位相差計算処理部、１７…音源定位部、１８…VoiceSwitch判定部、１９…出力信号生成部、
１００・・・テレビ会議システム、１０１・・・計算機、１０２・・・A/DD/A装置、１０３・・・ハブ、１０４・・・画像表示装置、１０５・・・マイクロホンアレイ、１０６・・・スピーカ、１０７・・・カメラ、１０８・・・オーディオケーブル、１０９・・・オーディオケーブル、１１０・・・デジタルケーブル、１１１・・・モニタケーブル、１１２・・・ＬＡＮケーブル、１１３・・・デジタルケーブル、２０１音声送信部、２０２・・・画像送信部、２０３・・・音声収録部、２０４・・・音響エコーキャンセラ適応部、２０５・・・音響エコーキャンセラ部、２０６・・・画像表示部、２０７・・・音声受信部、２０８・・・音声再生部、２０９・・・画像受信部、２１０・・・画像撮影部、２１１・・・音響エコーキャンセラフィルタＤＢ、７０１・・・位相差計算部、７０２・・・ダブルトーク検出部、Ｓ１・・・音響エコーキャンセラ適応処理、Ｓ２・・・他拠点から接続要求が出たかの判定、Ｓ３・・・自拠点から接続要求が出たかの判定、Ｓ４・・・他拠点と接続、Ｓ６・・・スピーカから再生処理、Ｓ７・・・画像表示、Ｓ８・・・音声収録、Ｓ９・・・エコーキャンセル、Ｓ１０・・・音声送信、Ｓ１１・・・接続が切れたかの判定、Ｓ１２・・・自拠点からの切断要求があるかの判定、Ｓ１３・・・他拠点から切断、Ｓ１０１・・・周波数分解、Ｓ１０２・・・エコーキャンセラ実行、Ｓ１０３・・・ダブルトーク状態であるかの判定、Ｓ１０４・・・エコーキャンセラ適応、Ｓ２０１・・・周波数分解、Ｓ２０２・・・音源定位、Ｓ２０３・・・スピーカー方向特定、Ｓ２０４・・・音源方向がスピーカー方向かどうかの判定、Ｓ２０５・・・エコーキャンセラ適応、Ｓ３０１・・・周波数分解、Ｓ３０２・・・擬似エコー算出、Ｓ３０３・・・入力信号との類似度算出、Ｓ３０４・・・類似度が閾値以上かの判定、Ｓ３０５・・・エコーキャンセラ適応、Ｓ４０１・・・周波数分解、Ｓ４０２・・・擬似エコー算出、Ｓ４０３・・・入力信号との類似度算出、Ｓ４０４・・・類似度加算処理、Ｓ４０５・・・加算後の類似度が閾値以上かの判定、Ｓ４０６・・・全周波数エコーキャンセラ適応、Ｓ５０１・・・周波数分解、Ｓ５０２・・・音源定位、Ｓ５０３・・・スピーカー方向特定、Ｓ５０４・・・音源方向がスピーカー方向かどうかの判定、Ｓ５０５・・・パワースペクトル加算処理、Ｓ５０６・・・スピーカ方向スペクトルが閾値以上かの判定、Ｓ５０７・・・エコーキャンセラ適応、Ｓ６０１・・・全フレーム適応処理、Ｓ６０２・・・他拠点接続待機、Ｓ６０３・・・適応制御付き音響エコーキャンセラ適応処理、Ｓ７０１・・・周波数分解、Ｓ７０２・・・擬似エコー算出、Ｓ７０３・・・入力信号との類似度算出、Ｓ７０４・・・類似度が閾値以上かの判定、Ｓ７０５・・・パワーが閾値以上かの判定、Ｓ７０６・・・ＶｏｉｃｅＳｗｉｔｃｈ判定、Ｓ８０１・・・周波数分解、Ｓ８０２・・・擬似エコー算出、Ｓ８０３・・・入力信号との類似度算出、Ｓ８０４・・・類似度加算処理、Ｓ８０５・・・加算後類似度が閾値以上かの判定、Ｓ８０６・・・入力パワーが閾値以上かの判定、Ｓ８０７・・・全帯域ＶｏｉｃｅＳｗｉｔｃｈ判定、Ｓ９０１・・・周波数分解、Ｓ９０２・・・擬似エコー算出、Ｓ９０３・・・入力信号との類似度算出、Ｓ９０４・・・類似度が閾値以上かの判定、Ｓ９０５・・・パワーが閾値以上かの判定、Ｓ９０６・・・非線形抑圧係数調整、Ｓ９０７・・・音響エコーキャンセラ、Ｓ９０８・・・非線形抑圧。 DESCRIPTION OF SYMBOLS 1 ... Microphone array, 2 ... A / D converter and D / A converter, 3 ... Central processing unit, 4 ... Memory, 5 ... Hub, 6 ... Speaker, 7 ... External storage medium, 8 ... A / D converter , 9 ... Band division unit, 10 ... Phase difference calculation unit, 11 ... Frequency distribution unit, 12 ... Acoustic echo canceller adaptation unit, 13 ... Pseudo echo generation unit, 14 ... Acoustic echo cancellation unit, 15 ... Speaker number / position setting GUI , 16 ... Echo phase difference calculation processing unit, 17 ... Sound source localization unit, 18 ... VoiceSwitch determination unit, 19 ... Output signal generation unit,
DESCRIPTION OF SYMBOLS 100 ... Video conference system, 101 ... Computer, 102 ... A / DD / A apparatus, 103 ... Hub, 104 ... Image display apparatus, 105 ... Microphone array, 106 ... Speaker, 107 ... Camera, 108 ... Audio cable, 109 ... Audio cable, 110 ... Digital cable, 111 ... Monitor cable, 112 ... LAN cable, 113 ... Digital cable, 201 audio transmission unit 202 202 image transmission unit 203 audio recording unit 204 acoustic echo canceller adaptation unit 205 acoustic echo canceller unit 206 image display unit 207 ..Audio receiving unit, 208... Audio reproducing unit, 209... Image receiving unit, 210. Filter DB, 701 ... Phase difference calculation unit, 702 ... Double talk detection unit, S1 ... Acoustic echo canceller adaptation processing, S2 ... Judgment of connection request from other bases, S3 ... Auto Determining whether a connection request has been issued from the site, S4: Connection to another site, S6: Reproduction processing from speaker, S7: Image display, S8: Audio recording, S9: Echo cancellation, S10 ..Voice transmission, S11... Judgment of disconnection, S12... Judgment of disconnection request from own site, S13... Disconnection from other sites, S101. Echo canceller execution, S103: Judgment whether double talk is in effect, S104: Echo canceller adaptation, S201: Frequency decomposition, S202: Sound source localization, S203: S -Car direction identification, S204 ... Determination of whether the sound source direction is the speaker direction, S205 ... Echo canceller adaptation, S301 ... Frequency resolution, S302 ... Pseudo echo calculation, S303 ... Similarity with input signal Degree calculation, S304... Judgment whether similarity is greater than threshold, S305 ... Echo canceller adaptation, S401 ... Frequency decomposition, S402 ... Pseudo echo calculation, S403 ... Similarity calculation with input signal , S404 ... Similarity addition processing, S405 ... Determination of whether the added similarity is equal to or greater than a threshold, S406 ... All frequency echo canceller adaptation, S501 ... Frequency decomposition, S502 ... Sound source localization, S503: Speaker direction identification, S504: Determination of whether the sound source direction is the speaker direction, S505: Power spectrum addition processing S506: Determination of whether the speaker direction spectrum is equal to or greater than the threshold value, S507: Echo canceller adaptation, S601: All frame adaptation processing, S602: Other site connection standby, S603: Acoustic with adaptive control Echo canceller adaptive processing, S701 ... frequency decomposition, S702 ... pseudo echo calculation, S703 ... similarity calculation with input signal, S704 ... determination of whether similarity is greater than or equal to threshold, S705 ... power S706 ... VoiceSwitch determination, S801 ... frequency decomposition, S802 ... pseudo echo calculation, S803 ... similarity calculation with input signal, S804 ... similarity addition processing, S805: Determination of whether the similarity after addition is equal to or greater than a threshold value, S806: Determination of whether input power is equal to or greater than the threshold value, S807: All bands oiceSwitch determination, S901: frequency decomposition, S902: pseudo echo calculation, S903: similarity calculation with input signal, S904: determination whether similarity is equal to or greater than threshold, S905: power is threshold Determination of whether or not, S906 ... nonlinear suppression coefficient adjustment, S907 ... acoustic echo canceller, S908 ... nonlinear suppression.

Claims

A microphone for voice input,
An AD converter for digitally converting the signal from the microphone;
An information processing apparatus that processes a digital signal from the AD converter and suppresses an acoustic echo component;
An output interface for sending a signal from the information processing apparatus to a network;
An input interface for receiving signals from the network;
A DA converter for analog conversion of a signal from the input interface;
A speaker for outputting a signal from the DA converter as sound;
The microphone is a microphone array having a plurality of microphone elements,
The AD converter is a plurality of AD converters for digitally converting a signal for each microphone element,
The information processing apparatus includes: a band division unit that divides the digital signal into a band; and a phase difference calculation that calculates a phase difference between sounds input to the plurality of microphone elements based on signals from the plurality of AD converters. And a frequency distribution unit that determines whether the sound input to the microphone array from the phase difference output from the phase difference calculation unit is sound from a speaker,
The band dividing unit divides each digital signal digitally converted for each microphone element,
The phase difference calculation unit calculates a phase difference between sounds input to the plurality of microphone elements for each of the divided bands ,
The frequency distribution unit determines whether the band division signal is a speaker output signal or a speaker signal from the phase difference for each band output by the phase difference calculation unit,
It said acoustic echo canceller adaptation unit, for band speaker output signal is determined to be dominant only have an adaptation of the adaptive filter used to suppress contamination content of the voice of the speaker from the signal of the microphone element ,
The acoustic echo canceler unit removes an acoustic echo component from a signal of each microphone element using the adaptive filter.

In order to determine whether or not the output signal is a speaker output signal, the frequency distribution unit measures in advance a transfer function of sound transmitted from the speaker to the microphone array, and the sound is output from the speaker from the measured transfer function. And calculating the phase difference for each band of the microphone array, storing the phase difference for each band in the external storage device, and storing the phase difference between the microphone elements for each band of the band difference signal and the band division signal. The acoustic echo canceller system according to claim 1, further comprising: a frequency distribution unit that determines that the band division signal is a speaker output signal when the phase difference is equal to or less than a predetermined threshold value.

The user interface is characterized in that the user specifies the number of speakers and the physical position relative to the microphone array in advance. Sound is output from the speakers from the number and physical positions of the speakers specified by the user interface. An echo phase difference calculation processing unit that calculates a phase difference for each band of the microphone array, and stores the phase difference for each band in an external storage device, and stores the stored phase difference and the band division signal The acoustic echo canceller system according to claim 1, further comprising: a frequency distribution unit that determines that the band division signal is a speaker output signal when a phase difference between microphone elements for each band is equal to or less than a predetermined threshold value. .

Using the phase difference of each microphone array, calculate a histogram of the sound source direction over the band, and have a sound source localization unit that estimates the sound source direction from the histogram, and estimates that the sound source localization unit has arrived from the sound source direction calculated And the magnitude of the calculated signal is the magnitude of the band division signal determined to be the speaker output signal or the speaker output signal in the band division signal after the acoustic echo canceller. 2. The acoustic echo canceller system according to claim 1, wherein when the magnitude of the band division signal is equal to or smaller than a predetermined magnitude compared to the determined magnitude of the band division signal, all the magnitudes of the band division signal are set to zero. .

Furthermore,
Means for estimating the sound source direction;
Means for identifying the speaker direction;
Means for comparing the sound source direction and the speaker direction;
2. The acoustic echo canceller system according to claim 1, wherein when the sound source direction is considered to correspond to the speaker direction, the sound source is regarded as an acoustic echo component and the echo canceller is applied.

Furthermore,
Means for applying an acoustic echo filter to the converted frequency domain signal to calculate a pseudo echo;
Means for calculating the similarity between the pseudo echo calculated for each frequency and the input signal,
The acoustic echo canceller system according to claim 1, wherein an echo canceller is applied when the similarity is larger than a predetermined threshold.

Furthermore,
Means for applying an acoustic echo filter to the converted frequency domain signal to calculate a pseudo echo;
Means for calculating the similarity between the pseudo echo calculated for each frequency and the input signal;
Means for adding the similarity of all frequencies,
2. The acoustic echo canceller system according to claim 1, wherein the echo canceller is adapted when the added similarity is larger than a predetermined threshold value.

Furthermore,
Means for estimating the sound source direction;
A means of identifying the speaker direction;
Means for comparing the sound source direction and the speaker direction;
If the sound source direction is considered to correspond to the speaker direction, means for determining that the sound source is an acoustic echo component;
Means for adding a sound source regarded as a speaker direction in the frequency direction to obtain a power spectrum;
The acoustic echo canceller system according to claim 1, wherein an echo canceller is applied when a power spectrum added at all frequencies is larger than a predetermined threshold.