JP4448424B2

JP4448424B2 - Voice switch method, apparatus for implementing the method, program, and recording medium therefor

Info

Publication number: JP4448424B2
Application number: JP2004309640A
Authority: JP
Inventors: 暁江村; 末廣島内
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2004-10-25
Filing date: 2004-10-25
Publication date: 2010-04-07
Anticipated expiration: 2024-10-25
Also published as: JP2006121590A

Description

この発明は、例えば多チャネル音響再生系を有する通信会議システムに適用され、ハウリングの原因及び聴覚上の障害となる音響エコーを抑圧する多チャネル反響抑圧方法、その装置、そのプログラム及びその記録媒体に関するものである。 BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multi-channel echo suppression method, apparatus, program, and recording medium that are applied to, for example, a communication conference system having a multi-channel sound reproduction system and suppress acoustic echoes that cause acoustic feedback and cause hearing problems. Is.

近年のディジタルネットワークの大容量化により、複数の人が容易に参加でき、より自然な通話環境を提供できる多チャネル拡声型の通信会議システムが検討されている。このシステムでは、受話音声がスピーカから再生されマイクロホンに収音されて音響エコーが生じ、そのまま送信されると通話の障害や不快感などの問題が生じる。さらに対地の拡声通信系を含めて形成される閉ループのループゲインが１より大きい場合には、ハウリングが生じて通話が不可能になる。このような問題を解決するために多チャネル拡声通話に対応した音声スイッチが特開２００４−１４７０６９号公報（特許文献１）に提案されている。 With the recent increase in capacity of digital networks, a multi-channel loudspeaker type teleconferencing system that allows a plurality of people to easily participate and provide a more natural calling environment has been studied. In this system, the received voice is reproduced from the speaker and picked up by the microphone to generate an acoustic echo. If the received voice is transmitted as it is, problems such as a call failure and discomfort arise. Furthermore, when the loop gain of a closed loop formed including a grounded voice communication system is larger than 1, howling occurs and a call cannot be made. In order to solve such a problem, Japanese Patent Application Laid-Open No. 2004-147069 (Patent Document 1) proposes a voice switch that supports multi-channel loudspeaking calls.

Ｍ（≧２）チャネルの再生系と２チャネルの収音系とで構成される通信会議システムは、図１に示すような構成により音響エコーの抑圧を行う。すなわち各受話端子１_ｍ（ｍ＝１，…，Ｍ）からの受話信号は再生信号として各スピーカ２_ｍ（ｍ＝１，…，Ｍ）に送られ、音響信号として再生され、各Ｍ個の音響エコー経路を経て各マイクロホン３_ｎ（ｎ＝１，…，Ｎ）に回り込む。音声スイッチは、送話判定部５、送話音声パワー推定部６_１，６_２、受話信号を減衰させるための可変損失部７、マイクロホンからの収音信号を減衰させるための可変損失部８からなる。送話音声パワー推定部６_１，６_２では、Ｍチャネルスピーカ再生信号と１チャネルの収音信号から、収音信号に含まれる送話音声の信号パワーを推定する。送話判定部５では、２チャネル分の送話音声の信号パワーから送話の有無を検出し、送話音声があると判断されたときは受話側の可変損失部７により受話信号のみを減衰させてスピーカへの再生信号とする。送話音声がないと判断されたときは、送話端子４から送信される送話信号のみを可変損失部８により減衰させる。これによりエコーを小さくし、対地を含めて形成される閉ループのループゲインを低減させることで、ハウリングを防止する。なお、収音系がＮチャネルの場合には、送話音声パワー推定部６をＮ個並列に並べることになる。 A communication conference system including an M (≧ 2) channel reproduction system and a two-channel sound collection system performs acoustic echo suppression with the configuration shown in FIG. That is, the reception signal from each reception terminal 1 _m (m = 1,..., M) is sent as a reproduction signal to each speaker 2 _m (m = 1,..., M) and reproduced as an acoustic signal. It goes around each microphone 3 _n (n = 1,..., N) through an acoustic echo path. The voice switch includes a transmission determination unit 5, transmission voice power estimation units 6 ₁ and 6 ₂ , a variable loss unit 7 for attenuating the received signal, and a variable loss unit 8 for attenuating the collected sound signal from the microphone. Become. The transmitted voice power estimation units 6 ₁ and 6 ₂ estimate the signal power of the transmitted voice included in the collected sound signal from the M channel speaker reproduction signal and the collected sound signal of one channel. The transmission determination unit 5 detects the presence / absence of transmission from the signal power of the transmission audio for two channels. When it is determined that there is transmission audio, only the reception signal is attenuated by the variable loss unit 7 on the reception side. Let it be a reproduction signal to the speaker. When it is determined that there is no transmission voice, only the transmission signal transmitted from the transmission terminal 4 is attenuated by the variable loss unit 8. This reduces echo and reduces the loop gain of the closed loop formed including the ground, thereby preventing howling. If the sound collection system is N channels, N transmission voice power estimation units 6 are arranged in parallel.

送話音声パワー推定部６では、ＴＦ変換部６１_ｍ（ｍ＝１，…，Ｍ）にて時間領域の再生信号ｘ_１（ｋ），…，ｘ_Ｍ（ｋ）（ただし、ｋは時間を示す変数。）を、フレーム長２Ｌサンプルで、Ｌサンプルごとにフレーム化し、周波数領域に変換してスペクトルＸ_１（ｊ，ｆ），…，Ｘ_Ｍ（ｊ，ｆ）（ただし、ｊはフレームの時刻を示す変数）を求める。ＴＦ変換部６２では、時間領域の収音信号ｙ（ｋ）を周波数領域に変換してスペクトルＹ（ｊ，ｆ）を求める。Ｌサンプルごとの信号のサンプル時刻ｋとフレーム時刻ｊの関係を図９に示す。エコー成分比率推定部６３では、周波数成分ごとに収音信号に占めるエコー成分の比率を求め、信号パワー算出部で収音信号に含まれる非エコー成分の信号パワーを求める。 In the transmission voice power estimating unit 6, TF conversion unit _{61 m (m = 1, ...} , M) the reproduced signal _x 1 in the time domain by _{(k), ..., x M} (k) ( however, k is the time a variable.) indicated in the frame length 2L samples were framed every L samples, the spectrum _X 1 is converted into the frequency domain _{(j, f), ...,} X M (j, f) ( although, j is the frame Find the variable that indicates the time. In the TF conversion unit 62, the spectrum Y (j, f) is obtained by converting the collected sound signal y (k) in the time domain into the frequency domain. FIG. 9 shows the relationship between the sample time k of the signal for each L sample and the frame time j. The echo component ratio estimation unit 63 obtains the ratio of the echo component in the collected sound signal for each frequency component, and the signal power calculation unit obtains the signal power of the non-echo component included in the collected sound signal.

エコー成分比率推定部６３の構成を図２に示す。相関除去部６３１では、多チャネル再生信号のスペクトルＸ_１（ｊ，ｆ），…，Ｘ_Ｍ（ｊ，ｆ）から互いに相関のない多チャネルのスペクトルＸ_１（ｊ，ｆ），Ｘ_２（１）（ｊ，ｆ），…，Ｘ_{Ｍ（Ｍ−１）}（ｊ，ｆ）を求める。相関除去部６３２では、収音信号のスペクトルＹ（ｊ，ｆ）から第１〜第ｍ−１チャネル再生信号の相関成分を除去したスペクトルＹ_{（ｍ―１）}（ｊ，ｆ）（ｍ＝２，…，Ｍ）を求める。コヒーレンス算出部６３３では、コヒーレンス算出部６３３_１で第１チャネルの再生信号Ｘ_１（ｊ，ｆ）と収音信号Ｙ（ｊ，ｆ）のコヒーレンスγ_１ｙ ^２（ｊ，ｆ）を、コヒーレンス算出部６３３_ｍで第ｍチャネルの再生信号Ｘ_{ｍ（ｍ−１）}（ｊ，ｆ）とＹ_{（ｍ−１）}（ｊ，ｆ）（ｍ＝２，…，Ｍ）のコヒーレンスγ_{ｍｙ（ｍ−１）} ^２（ｊ，ｆ）を求める。エコー成分比率算出部６３４では、次式によりエコー成分比率γ^２（ｊ，ｆ）を求める。

エコー成分比率推定のフローを図３に示す。
特開２００４−１４７０６９号公報 The configuration of the echo component ratio estimation unit 63 is shown in FIG. The decorrelation unit 631, the multi-channel reproduction signal spectrum _{X 1 (j, f),} ..., X M (j, f) the spectrum of the multi-channel having no correlation with each other from _{_{X 1 (j, f),}} X 2 (1 ₎ (J, f),..., X _{M (M-1)} (j, f) is obtained. In the correlation removing unit 632, a spectrum Y _(m−1) (j, f) (m = 2) obtained by removing the correlation component of the _{first to (m−1)} -th channel reproduction signals from the spectrum Y (j, f) of the collected sound signal. , ..., M). In the coherence calculation unit 633, the coherence calculation unit 633 ₁ converts the reproduction signal X ₁ (j, f) of the first channel and the coherence γ _1y ² (j, f) of the collected sound signal Y (j, f) into the coherence calculation unit. The coherence γ _{my (m−1} ) of the reproduction signal X _{m (m−1)} (j, f) and Y _(m−1) (j, f) (m = 2,..., M) of the mth channel at 633 _m ₎ ² (j, f) is obtained. The echo component ratio calculation unit 634 obtains an echo component ratio γ ² (j, f) by the following equation.

The flow of echo component ratio estimation is shown in FIG.
JP 2004-147069 A

上記従来法では、収音信号は一定のフレーム長でフレーム化され、ＦＦＴにより周波数領域に変換され、送話検出処理、可変損失処理を経て送信される。この方法では、送話音声信号はフレーム長分バッファリングされ、処理されてから送信されるので、ハードウェアの処理能力には関係なく、フレーム長によって決まるアルゴリズム上の遅延（処理遅延）が存在する。この遅延が大きい場合には通話系として非常に離しづらくなってしまうため、フレーム長を短くして処理遅延を抑える必要がある。
しかし、スピーカから再生されてマイクロホンに収音されるまでにフレーム長以上遅延するエコー成分は、非エコー成分として扱われることが問題となる。したがって、フレーム長を残響時間（通常の部屋で３００ｍｓ程度）よりも大幅に短く設定した場合、エコー成分比率が小さめに設定されたり、エコー成分の推定値が揺らいだりするために、エコー成分比率の推定性能が劣化し、送話検出性能が劣化してしまう。 In the above-described conventional method, the collected sound signal is framed with a fixed frame length, converted into the frequency domain by FFT, and transmitted through transmission detection processing and variable loss processing. In this method, the transmission voice signal is buffered for the frame length, processed, and then transmitted. Therefore, there is an algorithmic delay (processing delay) determined by the frame length regardless of the processing capability of the hardware. . When this delay is large, it is very difficult to separate as a call system, so it is necessary to reduce the processing delay by shortening the frame length.
However, there is a problem that an echo component that is delayed from the frame length by the time it is reproduced from the speaker and collected by the microphone is treated as a non-echo component. Therefore, when the frame length is set to be significantly shorter than the reverberation time (about 300 ms in a normal room), the echo component ratio is set smaller, or the estimated value of the echo component fluctuates. The estimation performance deteriorates, and the transmission detection performance deteriorates.

この発明では、収音信号の短時間スペクトルＹ（ｊ，ｆ）に含まれるエコー成分の比率を、現時点の多チャネル再生信号フレームから求めた短時間スペクトルＸ_１（ｊ，ｆ），…，Ｘ_Ｍ（ｊ，ｆ）だけでなく、過去の再生信号フレームから求めた短時間スペクトルも一緒に使用して推定する方法を提案する。
この発明では更に、多チャネル再生信号の現時点のフレームと過去のフレームとを、現時点のフレームの第１チャネル再生信号からなる主成分および主成分との相関が除去されたその他のフレームからなる副成分に分け、主成分のエコーが収音信号に占める割合を求め、副成分のエコーが主成分との相関を除去した収音信号に占める割合を求め、これら２つの割合から収音信号に占める多チャネル再生信号のエコー成分比率を推定する方法を提案する。 In the present invention, the ratio of echo components included in the short-time spectrum Y (j, f) of the collected sound signal is determined from the short-time spectrum X ₁ (j, f),. A method is proposed for estimation using not only _M (j, f) but also a short-time spectrum obtained from a past reproduction signal frame.
In the present invention, the current frame and the past frame of the multi-channel reproduction signal are further divided into the main component consisting of the first channel reproduction signal of the current frame and the sub-component consisting of other frames from which the correlation with the main component has been removed. The ratio of the main component echo to the collected sound signal is obtained, and the ratio of the sub component echo to the collected sound signal from which the correlation with the main component is removed is obtained. A method for estimating the echo component ratio of the channel reproduction signal is proposed.

この方法により、過去の信号フレームをエコー成分比率の推定に取り込むことができ、フレーム長が残響時間よりも大幅に短く設定された場合でもエコー成分比率の推定性能劣化を回避し、エコー抑圧性能の劣化を防ぐことができる。 With this method, past signal frames can be taken into the estimation of the echo component ratio, and even when the frame length is set to be much shorter than the reverberation time, the estimation performance of the echo component ratio is avoided and the echo suppression performance is improved. Deterioration can be prevented.

以下にこの発明の実施形態を図面を参照しながら説明するが、各図中の対応する部分は同一参照番号を付けて重複説明を省略する。
［第１実施形態］
この発明をＭ（≧２）チャネル再生系とＮ（≧１）チャネル収音系からなる場合について説明する。収音系のＮチャネルに対しては、Ｍ入力１出力の送信音声パワー推定部をＮ個並列に並べることで、Ｎチャネルの収音系に対応する。この発明では、図２に内部構成を示している図１の送信音声パワー推定部６中のエコー成分比率推定部６３を、図４に内部構造が示されているエコー成分比率推定部６６に置き換える。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present invention will be described below with reference to the drawings. Corresponding portions in the respective drawings are given the same reference numerals, and redundant description is omitted.
[First Embodiment]
The present invention will be described in the case of an M (≧ 2) channel reproduction system and an N (≧ 1) channel sound collection system. For the N channel of the sound collection system, N transmission audio power estimators with M inputs and one output are arranged in parallel to correspond to the N channel sound collection system. In the present invention, the echo component ratio estimation unit 63 in the transmission voice power estimation unit 6 of FIG. 1 whose internal configuration is shown in FIG. 2 is replaced with an echo component ratio estimation unit 66 whose internal structure is shown in FIG. .

以下では、フレーム長を２Ｌサンプル、シフト長をＬサンプル、フレーム時刻をｊとする。フレーム時刻ｊの信号フレームは、サンプル時刻ｋ＝ｊＬ−２Ｌ＋１〜ｊＬの信号サンプルからなる。このときの信号のサンプル時刻ｋとフレーム時刻ｊの関係は図９のようになる。また、過去の再生信号フレームから求めたスペクトルとして、１フレーム前の短時間スペクトルＸ_１（ｊ−１，ｆ），…，Ｘ_Ｍ（ｊ−１，ｆ）を使用する例を説明する。
図１のＴＦ変換部６１_ｍ（ｍ＝１，…，Ｍ）において、各チャネルの時間領域の再生信号ｘ_ｍ（ｋ）をＬサンプル毎に長さ２Ｌの信号ベクトルにフレーム化し、ＦＦＴを使って短時間スペクトルに変換する。

ただし、ｍ＝１，…，Ｍ
この処理では、各信号をハニング窓等でウインドウ処理してから周波数変換してもよい。 In the following, it is assumed that the frame length is 2L samples, the shift length is L samples, and the frame time is j. The signal frame at frame time j is composed of signal samples at sample times k = jL−2L + 1 to jL. The relationship between the signal sampling time k and the frame time j at this time is as shown in FIG. An example in which the short-time spectrum X ₁ (j−1, f),..., X _M (j−1, f) one frame before is used as the spectrum obtained from the past reproduction signal frame will be described.
In the TF converter 61 _m (m = 1,..., M) in FIG. 1, the time domain reproduction signal x _m (k) of each channel is framed into a signal vector having a length of 2 L for each L sample, and FFT is used. For a short time.

However, m = 1, ..., M
In this processing, each signal may be subjected to frequency conversion after being windowed by a Hanning window or the like.

また、ＴＦ変換部６２において、収音信号ｙ（ｋ）を周波数領域に変換し、短時間スペクトルをえる。

この処理でも、各信号をウインドウ処理してから周波数変換してもよい。
図４に内部構造が示されているエコー成分比率推定部６６において以下のステップＦ１〜７により、周波数領域の多チャネル再生信号Ｘ_ｍ（ｊ，ｆ）と周波数領域の収音信号Ｙ（ｊ，ｆ）から、周波数成分ごとに収音信号に含まれるエコー成分の比率を求める。図５にエコー成分比率を推定するためのフローを示す。 Further, the TF converter 62 converts the collected sound signal y (k) into the frequency domain and obtains a short-time spectrum.

In this processing, each signal may be subjected to window processing and then frequency conversion.
In the echo component ratio estimation unit 66 whose internal structure is shown in FIG. 4, the frequency domain multi-channel reproduction signal X _m (j, f) and the frequency domain sound collection signal Y (j, f, From f), the ratio of echo components included in the collected sound signal is obtained for each frequency component. FIG. 5 shows a flow for estimating the echo component ratio.

ステップＦ１
現時点のフレームから求めた多チャネル再生信号の短時間スペクトルＸ_１（ｊ，ｆ），…，Ｘ_Ｍ（ｊ，ｆ）を図４の相関除去部６６１内の蓄積部６６１ａ１に保存する。
ステップＦ２
相関除去部６６１ｂ１では、例えば次式の方法で多チャネル再生信号の短時間スペクトルＸ_２（ｊ，ｆ），…，Ｘ_Ｍ（ｊ，ｆ）からＸ_１（ｊ，ｆ）との相関成分を除去して、スペクトルＸ_２（１）（ｊ，ｆ），…，Ｘ_Ｍ（１）（ｊ，ｆ）を得、多チャネル再生信号スペクトルの副成分の一部とする。

ただし、ｍ＝２，…，Ｍ
ここで、ε［］は、平均をとることを意味し、平均処理の一例としては、

のように、１フレーム前の処理結果と０〜１の値をとる平滑化定数βを用いる方法がある。 Step F1
The short-time spectrum X ₁ (j, f),..., X _M (j, f) of the multi-channel reproduction signal obtained from the current frame is stored in the accumulation unit 661a1 in the correlation removal unit 661 in FIG.
Step F2
The decorrelation unit 661B1, for example, short-time spectrum _X 2 multichannel reproduction signal by the following equation method _{(j, f), ...,} X M (j, f) from _X 1 (j, f) the correlation components of the The spectrum X _{2 (1)} (j, f),..., X _{M (1)} (j, f) is obtained as a part of the sub-component of the multi-channel reproduction signal spectrum.

However, m = 2, ..., M
Here, ε [] means taking an average, and as an example of the averaging process,

As described above, there is a method using a processing result of one frame before and a smoothing constant β that takes a value of 0 to 1.

ステップＦ３
相関除去部６６１ｂ２において、蓄積部６６１ａ２に蓄積された１フレーム前の多チャネル再生信号のスペクトルＸ_１（ｊ−１，ｆ），…，Ｘ_Ｍ（ｊ−１，ｆ）から、Ｘ_１（ｊ，ｆ）との相関を次のように除去したスペクトルＸ_１（１）（ｊ−１，ｆ），…，Ｘ_Ｍ（１）（ｊ−１，ｆ）を求め、多チャネル再生信号スペクトルの副成分の一部とする。

ただし、ｍ＝１，…，Ｍ
なお、ｎフレーム前の短時間スペクトルＸ_１（ｊ−ｎ，ｆ），…，Ｘ_Ｍ（ｊ−ｎ，ｆ）をエコー成分比率推定に使用する場合にも、同様の計算により得られた結果を多チャネル再生信号スペクトルの副成分の一部とすればよい。 Step F3
In decorrelation unit 661B2, the spectrum _X 1 of the multi-channel playback signal before one frame stored in the storage unit 661a2 (j-1, f) , ..., from X M (j-1, f ), X 1 (j , F) to obtain a spectrum X _{1 (1)} (j−1, f),..., X _{M (1)} (j−1, f) from which the correlation with the multi-channel reproduction signal spectrum is obtained as follows. It is a part of subcomponent.

However, m = 1, ..., M
Incidentally, n frames before short-time spectrum _{X 1 (j-n, f} ), ..., X M (j-n, f) in the case of using the echo component ratio estimation results obtained by the similar calculation May be a part of the sub-component of the multi-channel reproduction signal spectrum.

ステップＦ４
相関除去部６６２では、現時点のフレームの収音信号の短時間スペクトルＹ（ｊ，ｆ）からＸ_１（ｊ，ｆ）との相関成分を除去したスペクトルＹ_（１）（ｊ，ｆ）を求める。

Step F4
The correlation removal unit 662 obtains a spectrum Y ₍₁₎ (j, f) obtained by removing the correlation component with X ₁ (j, f) from the short-time spectrum Y (j, f) of the sound collection signal of the current frame. .

ステップＦ５
コヒーレンス算出部６６３_１では、多チャネル再生信号スペクトルの主成分である現時点のフレームの第１チャネル再生信号の短時間スペクトルＸ_１（ｊ，ｆ）と現時点の収音信号のスペクトルＹ（ｊ，ｆ）から、次のコヒーレンスを求める。

Step F5
In the coherence calculation unit 663 ₁ , the short-time spectrum X ₁ (j, f) of the first channel reproduction signal of the current frame, which is the main component of the multi-channel reproduction signal spectrum, and the spectrum Y (j, f) of the current collected sound signal ) To find the next coherence.

ステップＦ６
副成分エコー比率算出部６６３_２では、まず相関除去された収音信号スペクトルＹ_（１）（ｊ，ｆ）に含まれるエコー成分Ｙ＾_（１）（ｊ，ｆ）を求める。エコー成分Ｙ＾_（１）（ｊ，ｆ）は、多チャネル再生信号短時間スペクトルの副成分Ｘ_２（１）（ｊ，ｆ），…，Ｘ_Ｍ（１）（ｊ，ｆ），Ｘ_１（１）（ｊ−１，ｆ），…，Ｘ_Ｍ（１）（ｊ−１，ｆ）の線形和

のうちで、相関除去された収音信号スペクトルとの誤差
｜Ｙ_（１）（ｊ，ｆ）−Ｙ＾_（１）（ｊ，ｆ）｜^２
が最小となるスペクトルである。この誤差を最小にするスペクトルは、

とし、

により求められる。さらに、次式により副成分のエコー比率を周波数成分ごとに求める。

Step F6
The sub-component echo ratio calculation unit 663 ₂ first obtains an echo component Y ^ ₍₁₎ (j, f) included in the collected sound signal spectrum Y ₍₁₎ (j, f) from which correlation has been removed. The echo component Y ^ ₍₁₎ (j, f) is a sub-component X _{2 (1)} (j, f),..., X _{M (1)} (j, f), X ₁ of the multi-channel reproduction signal short-time spectrum. ₍₁₎ (j−1, f),..., X _{M (1)} (j−1, f) linear sum

Among them, the error from the collected sound signal spectrum from which the correlation has been removed | Y ₍₁₎ (j, f) −Y ^ ₍₁₎ (j, f) | ²
Is the spectrum with the minimum. The spectrum that minimizes this error is

age,

It is calculated by. Further, the echo ratio of subcomponents is obtained for each frequency component by the following equation.

ステップＦ７
エコー成分比率算出部６６４において、ステップＦ５、Ｆ６で求めた各比率から、収音信号スペクトルＹ（ｊ，ｆ）に占めるエコー成分の比率を求める。

Step F7
In the echo component ratio calculation unit 664, the ratio of echo components in the collected sound signal spectrum Y (j, f) is obtained from the ratios obtained in steps F5 and F6.

図１のパワー算出部６４では、まず周波数成分ごとにエコー成分比率から非エコー信号パワー

を算出し、周波数ｆについての総和をとって非エコー信号パワー

を求める。
送話判定部５では、上記非エコー信号パワーＰ_ＹＩを閾値Ｐ_ｔｈと比較して送話の有無を判定する。送話ありと判定したとき、受話側可変損失部７により受話信号を減衰させる。また、送話なしと判定したとき、送話側可変損失部８により収音信号を減衰させる。閾値Ｐ_ｔｈの設定法としては、例えばマイクロホンの入力定格レベルの−１５ｄＢに設定する等が考えられる。 In the power calculation unit 64 of FIG. 1, first, the non-echo signal power is calculated from the echo component ratio for each frequency component.

And calculate the sum of the frequency f and the non-echo signal power

Ask for.
The transmission determination unit 5 determines the presence / absence of transmission by comparing the non-echo signal power P _YI with a threshold value P _th . When it is determined that there is a transmission, the reception side variable loss unit 7 attenuates the reception signal. When it is determined that there is no transmission, the transmission side variable loss unit 8 attenuates the collected sound signal. As a method for setting the threshold value P _th , for example, it is conceivable to set the input rated level of the microphone to −15 dB.

さらに、Ｍスピーカ・Ｎマイクロホン構成を取る場合には、Ｎ個の送話信号パワー推定部６によりマイクロホンごとに収音信号に含まれる送話信号のパワーを推定し、Ｎチャネル分の情報から送話の有無を判定する。その判定法として以下の例がある。
（１）各チャネルについて送話信号パワーを閾値Ｐ_ｔｈとを比較する。
（２）閾値Ｐ_ｔｈより信号パワーの大きいチャネルの数が、あらかじめ設定した閾値Ｎ_ｔｈを越えるときに送話ありと判定する。
（３）それ以外のときには送話なしと判定する。 Further, when the M speaker / N microphone configuration is adopted, the power of the transmission signal included in the collected sound signal is estimated for each microphone by the N transmission signal power estimation units 6, and is transmitted from the information for N channels. Determine if there is a talk. Examples of the determination method include the following.
(1) The transmission signal power is compared with the threshold value P _th for each channel.
(2) The number of large channel signal power than the threshold P _th is located transmission when exceeding the threshold value N _th set in advance and judges.
(3) Otherwise, it is determined that there is no transmission.

なお、送話判定に使用した信号フレームに減衰処理を適用するために、送信側可変損失部８の前段に送話判定での遅延に相当する遅延器を挿入する構成も考えられる。
現時点のフレームの処理が終了すると、最後に現時点の蓄積部６６１ａ１に蓄積された再生信号情報は過去の蓄積部６６１ａ２に転送され、蓄積される。
なお、蓄積部６６１ａ内で現時点の蓄積部６６１ａ１と過去の蓄積部６６１ａ２とを特に区別し、上記のように一連の処理の最後に現時点の蓄積部６６１ａ１に蓄積された再生信号情報を過去の蓄積部６６１ａ２に転送するのではなく、１つの蓄積部６６１ａに蓄積された情報の中で最新情報を現時点の情報として処理する方法もある。また、図６に示すように処理に利用する現時点の再生信号のスペクトルを、蓄積部から取り出すのではなく、入力された再生信号のスペクトルを直接利用する方法もある。 In addition, in order to apply attenuation processing to the signal frame used for transmission determination, a configuration in which a delay device corresponding to the delay in transmission determination is inserted before the transmission-side variable loss unit 8 is also conceivable.
When the processing of the current frame is completed, the reproduction signal information stored last in the current storage unit 661a1 is transferred to the past storage unit 661a2 and stored.
Note that the current storage unit 661a1 and the past storage unit 661a2 are particularly distinguished in the storage unit 661a, and the reproduction signal information stored in the current storage unit 661a1 at the end of the series of processes as described above is stored in the past. There is also a method of processing the latest information as the current information in the information stored in one storage unit 661a, instead of transferring it to the unit 661a2. In addition, as shown in FIG. 6, there is a method of directly using the spectrum of the input reproduction signal instead of taking out the spectrum of the current reproduction signal used for processing from the storage unit.

［第２実施形態］
この発明は、音声スイッチ方法と適応フィルタによる音響エコー消去方法とを組み合わせたものであり、その構成例を図７に示す。
受話側の可変損失部７_ｍ（ｍ＝１，…，Ｍ）を経たＭチャネル受話信号は、スピーカ２_ｍで音響信号として再生され、音響エコー経路を経てマイクロホン３に回り込む。同時に音響エコー消去部９の予測エコー生成部９１に入力される。減算器９２によってマイクロホン３からの収音信号ｙ（ｋ）から予測エコー信号が差し引かれ、その残差信号がエコー経路推定部９３にフィードバックされると同時に送話側可変損失部８を経て対地へ送信される。送話判定部５と送話音声パワー推定部６では、第１実施形態と同様に送話の有無を判定し、受話側および送話側の可変損失部を制御する。 [Second Embodiment]
The present invention is a combination of a voice switch method and an acoustic echo cancellation method using an adaptive filter, and a configuration example thereof is shown in FIG.
The M channel received signal that has passed through the variable loss section 7 _m (m = 1,..., M) on the receiving side is reproduced as an acoustic signal by the speaker 2 _m and goes around the microphone 3 through the acoustic echo path. At the same time, it is input to the predicted echo generator 91 of the acoustic echo canceler 9. The subtractor 92 subtracts the predicted echo signal from the collected sound signal y (k) from the microphone 3, and the residual signal is fed back to the echo path estimator 93 and at the same time through the transmission side variable loss unit 8 to the ground. Sent. The transmission determination unit 5 and the transmission voice power estimation unit 6 determine the presence / absence of transmission as in the first embodiment, and control the variable loss unit on the reception side and transmission side.

この構成では、エコー消去処理を経た信号が送信される。そのため受話音声と送話音声が重なるダブルトーク状況においても、収音信号に含まれる受話エコー成分を大幅に低減でき、拡声通話の品質を向上できる。
なお、図７はＭ（≧２）チャネル再生系と２チャネル収音系からなる場合を説明したが、収音系がＮ（≧３）チャネルの場合にも同様の構成と処理により音声スイッチと適応フィルタによるエコー消去とを組み合わせることが可能である。 In this configuration, a signal that has undergone echo cancellation processing is transmitted. Therefore, even in a double talk situation in which the received voice and the transmitted voice overlap, the received echo component included in the collected sound signal can be greatly reduced, and the quality of the expanded call can be improved.
Note that FIG. 7 illustrates the case of the M (≧ 2) channel reproduction system and the 2-channel sound collection system. However, when the sound collection system is the N (≧ 3) channel, the voice switch and It is possible to combine echo cancellation with an adaptive filter.

［変形例］
この発明は、第２実施形態の変形例であり、図８にその構成を示す。上記第２実施形態では送話音声パワー推定部のＴＦ変換部６２への入力は収音信号であるが、この発明では、ＴＦ変換部６２への入力として適応フィルタによるエコー消去後の信号（残差信号）を用いている。
この構成でも、送信信号はエコー消去後に可変損失部を経た信号となるため、拡声通話品質の向上が期待できる。 [Modification]
The present invention is a modification of the second embodiment, and its configuration is shown in FIG. In the second embodiment, the input to the TF conversion unit 62 of the transmission voice power estimation unit is a collected sound signal. However, in the present invention, the signal after the echo cancellation by the adaptive filter (residual) is input to the TF conversion unit 62. Difference signal).
Even in this configuration, since the transmission signal is a signal that has passed through the variable loss portion after echo cancellation, an improvement in the quality of the voice call can be expected.

Ｍチャネル音声スイッチの構成を示す図。The figure which shows the structure of an M channel audio | voice switch. 従来のエコー成分比率推定部の構成を示す図。The figure which shows the structure of the conventional echo component ratio estimation part. 従来のエコー成分比率推定のフローを示す図。The figure which shows the flow of the conventional echo component ratio estimation. 第１実施形態のエコー成分比率推定部の構成を示す図。The figure which shows the structure of the echo component ratio estimation part of 1st Embodiment. 第１実施形態のエコー成分比率推定のフローを示す図。The figure which shows the flow of echo component ratio estimation of 1st Embodiment. 第１実施形態のエコー成分比率推定部の変形例の構成を示す図。The figure which shows the structure of the modification of the echo component ratio estimation part of 1st Embodiment. 第２実施形態のＭ入力２出力チャネルの音声スイッチを示す図。The figure which shows the audio | voice switch of the M input 2 output channel of 2nd Embodiment. 変形例のＭ入力２出力チャネルの音声スイッチを示す図。The figure which shows the audio | voice switch of the M input 2 output channel of a modification. 信号のサンプル時刻ｋとフレーム時刻ｊの関係を示す図。The figure which shows the relationship between the sample time k of a signal, and the frame time j.

Claims

In a method of determining the presence / absence of transmission from a reproduction signal of a plurality of channels (M channel) and a sound collection signal of at least one channel and attenuating the reception signal or transmission signal,
The main component is the short-time spectrum of the first channel playback signal of the current frame,
With respect to the reproduction signal from the second to the M-th channel of the current frame and the reproduction signal from the first to the M-th channel at least one frame past, from each of the short-time spectra, the short-term spectrum as the main component Remove correlations to obtain multiple short-time spectra that make up subcomponents,
Find the ratio of the main component echo to the short-time spectrum of the collected signal,
Obtain the ratio of the echoes of subcomponents in the short-time spectrum of the collected sound signal from which the correlation with the main component has been removed,
Estimate the ratio of echo components in the short-time spectrum of the collected sound signal for each frequency from the above two ratios,
The non-echo signal power corresponding to the sum of the frequencies of the product of the value obtained by subtracting the echo component ratio estimated for each frequency from 1 and the collected sound signal power for each frequency is obtained. Judging the presence or absence of transmission by comparing with a preset threshold,
Voice switch method characterized by the above.

The method of claim 1 , wherein
The echo component ratio gamma ² occupying the short-time spectrum of the collected signal ^(f), the echo of the main component is short Percentage spectrum γ _{1 2} ^(f) and subcomponent of collected signal echo is mainly From the ratio γ ₂ ² (f) of the short-time spectrum of the collected sound signal from which the correlation is removed,

Asking for,
Voice switch method characterized by the above.

The method of claim 2 , wherein
The echo component Y ^ ₍₁₎ (f) included in the short-time spectrum Y ₍₁₎ (f) of the collected sound signal from which the correlation with the principal component is removed is represented by | Y ₍₁₎ (f) -Y ^ _{( 1)} (f) | Obtain as a linear sum that minimizes ² ;
The proportion γ ₂ ² (f) of the short-time spectrum of the collected sound signal from which the echo of the subcomponent is removed from the correlation with the main component,

Asking for,
Voice switch method characterized by the above.

In the method in any one of Claims 1-3 ,
Comparing the non-echo signal power obtained from a plurality of collected signals with a threshold for each channel, and determining that there is a transmission signal when the number of channels larger than the threshold exceeds a certain number;
Voice switch method characterized by the above.

The method of claim 4 , wherein
The non-echo signal power P _YI is calculated from the short-time spectrum Y (f) of the collected sound signal, the echo component ratio γ ² (f) occupying the short-time spectrum of the collected sound signal, and the frame length L.

Asking for,
Voice switch method characterized by the above.

In the method in any one of Claims 1-5 ,
A signal obtained by subtracting the predicted value of the echo predicted from the reproduction signal from the collected sound signal as a transmission signal;
Voice switch method characterized by the above.

In the method in any one of Claims 1-6 ,
A residual signal between a predicted value of an echo predicted from a reproduction signal and a signal obtained from a sound collection unit is used as a sound collection signal and a transmission signal;
Voice switch method characterized by the above.

Means for receiving a reproduction signal of a plurality of channels (M channel) and a sound pickup signal of at least one channel;
Means for obtaining as a main component a short-time spectrum of the first channel reproduction signal of the current frame;
With respect to the reproduction signal from the second to the M-th channel of the current frame and the reproduction signal from the first to the M-th channel at least one frame past, from each of the short-time spectra, the short-term spectrum as the main component Means for removing a correlation and obtaining a plurality of short-time spectra constituting subcomponents;
Means for determining the proportion of the principal component echo in the short-time spectrum of the collected signal;
Means for determining the ratio of the echoes of the subcomponents in the short-time spectrum of the collected sound signal from which the correlation with the main component is removed;
Means for estimating, for each frequency, an echo component ratio in the short-time spectrum of the collected sound signal from the two ratios;
The non-echo signal power corresponding to the sum of the frequencies of the product of the value obtained by subtracting the echo component ratio estimated for each frequency from 1 and the collected sound signal power for each frequency is obtained. Means for determining the presence or absence of transmission by comparing with a preset threshold;
Loss means for attenuating the reception signal or transmission signal depending on the presence or absence of transmission,
Voice switch with

The apparatus of claim 8 .
As means for estimating the echo component ratio γ ² (f) in the short-time spectrum of the collected sound signal, the ratio γ ₁ ² (f) in which the main component echo occupies the short-time spectrum of the collected signal and the sub-component echo are From the ratio γ ₂ ² (f) in the short-time spectrum of the collected sound signal from which the correlation with the main component is removed,

Means to obtain by
Voice switch with

The apparatus of claim 9, wherein,
The short-time spectrum of the collected sound signal from which the correlation with the main component has been removed as means for obtaining the ratio γ ₂ ² (f) of the short-term spectrum of the collected sound signal from which the correlation with the main component has been removed by the echo of the subcomponent _{Y: (1)} echo component contained in the (f) _{Y ^ (1)} (f),
| Y ₍₁₎ (f) -Y ^ ₍₁₎ (f) | ²
And the ratio γ ₂ ² (f) in the short-time spectrum of the collected sound signal in which the echo of the subcomponent is removed from the correlation with the main component,

Means to obtain by
Voice switch with

In the apparatus in any one of Claims 8-10 ,
Means for setting a threshold;
Means for determining the presence or absence of transmission by comparing the threshold value and the non-echo signal power;
Voice switch with

The apparatus of claim 11 .
Means for comparing the non-echo signal power obtained from a plurality of collected signals with the threshold for each channel;
Means for determining that there is a transmission signal when the number of channels larger than the threshold exceeds a certain number;
Voice switch with

The apparatus of claim 11 .
The non-echo signal power P _YI is calculated from the short-time spectrum Y (f) of the collected sound signal, the echo component ratio γ ² (f) occupying the short-time spectrum of the collected sound signal, and the frame length L.

Means to obtain by
Voice switch with

The apparatus according to any one of claims 8 to 13 ,
Means for making a transmission signal a signal obtained by subtracting a predicted value of an echo predicted from a reproduction signal from a collected sound signal;
Voice switch with

The apparatus according to any one of claims 8 to 13 ,
Means for making a residual signal between a predicted value of an echo predicted from a reproduction signal and a signal obtained from a sound collection unit a sound collection signal and a transmission signal;
Voice switch with

Voice switch program for executing by a computer speech switching method according to claims 1-7.

A computer-readable recording medium on which the voice switch program according to claim 16 is recorded.