JP4194749B2

JP4194749B2 - Channel gain correction system and noise reduction method in voice communication

Info

Publication number: JP4194749B2
Application number: JP2000509079A
Authority: JP
Inventors: マウロ、アンソニー・ピー
Original assignee: Qualcomm Inc
Current assignee: Qualcomm Inc
Priority date: 1997-09-02
Filing date: 1997-09-30
Publication date: 2008-12-10
Anticipated expiration: 2017-09-30
Also published as: EP1010169B1; DE69736198T2; EP1010169A1; DE69736198D1; JP2003526109A

Description

【０００１】
（技術分野）
本発明は音声処理に関する。より特定的には本発明は、音声処理に用いられる雑音抑制システムとその方法に関する。
【０００２】
（背景技術）
デジタル技法による音声の送信は、特にセルラー電話や個人通信システム（ＰＣＳ）などの応用分野で広く用いられるようになっている。これがまた、音声処理技法の改良に対する興味を生じた。改良がなされている１つの領域は雑音抑制技法の開発である。
【０００３】
音声通信システムにおける雑音抑制は一般的に、環境的背景雑音を所望の音声信号からフィルタリングすることによって所望のオーディオ信号の全体的な品質を改良する目的に適うものである。この音声向上プロセスは、飛行機や、移動中の車両や、やかましい工場などの異常に高レベルの周辺背景雑音を有する環境においては特に必要である。
【０００４】
１つの雑音抑制技法はスペクトル減算、すなわちスペクトルの利得を修正する技法である。この方式を用いると、入力オーディオ信号は複数の周波数チャネルに分割され、これによって、特定の周波数チャネルがその雑音エネルギー含有量に従って減衰される。各周波数チャネルに対する背景雑音の推定値を利用して、そのチャネルでの音声の信号対雑音比（ＳＮＲ）を発生し、このＳＮＲ比を用いて各チャネルの利得係数を計算する。次に、この利得係数によって特定のチャネルの減衰量を決定する。減衰したチャネルは再合成されて、雑音を抑制した出力信号を生成する。
【０００５】
比較的高い背景雑音環境を伴う特殊な応用分野では、大抵の雑音抑制技法がかなりの性能限界を示す。このような応用分野の１例として、セルラーモバイル通信システムに対する車両スピーカフォンというオプションがある。このスピーカフォンオプションは、自動車のドライバにハンドフリーの動作を可能とするものである。ハンドフリーマイクロフォンは一般的には、面頬(visor)の上方に取り付けられたりして、使用者からかなり隔たったところに置かれる。この遠隔のマイクロフォンでは、道路と風などの雑音条件のため、ランドエンド(land-end)側に対して悪いＳＮＲが提供される。ランドエンドで受信された音声は通常は理解可能であるが、このような背景雑音レベルに対して連続して曝されると、聴取者の疲労を増すことがしばしばある。
【０００６】
雑音抑制システムが適切に機能するためには、音声のＳＮＲを正確に決定することが重要である。しかしながら、現在入手可能な雑音検出器の限界のために、音声信号のＳＮＲを正確に決定するのは困難である。スペクトル減算技法は、音声が不在の期間中に背景雑音推定値を更新するものである。音声が不在のときに、測定されたスペクトルエネルギーは雑音によるものであり、このため、測定されたスペクトルエネルギーに基づいて雑音推定値が更新される。したがって、音声の存在期間と不在期間を区別して、ＳＮＲを計算するための正確な雑音エネルギーを得ることが重要である。
【０００７】
音声検出のある例示技法では、音声計量(metric)計算機を用いて雑音更新値判断を実行している。音声計量とは、チャネルエネルギーの全体的な音声状特徴の測定値である。最初に、生(raw)のＳＮＲ推定値を用いて音声計量表を割り出し、これによって、各チャネルに対する音声計量値を得る。個々のチャネル音声計量値は合計されてエネルギーパラメータとなり、これを背景雑音更新しきい値と比較する。この音声計量合計値がこのしきい値以上であれば、その信号は音声を包含していると言われる。音声計量合計値がしきい値未満であれば、入力フレームは雑音と見なされて、背景雑音更新が実行される。しかしながら、高背景雑音や突然背景雑音や増加雑音発生源の場合、ＳＮＲ測定値は大きな値となり、この結果、音声計量値が高くなり、このため雑音推定更新値が無効となる。
【０００８】
音声計量計算機技法を洗練させた技法では、チャネルエネルギーの偏差が測定される。この方法では、雑音はある時間にわたって一定のスペクトルエネルギーを示し、一方、音声はある時間にわたって可変のスペクトルエネルギーを示すものと仮定される。したがって、チャネルエネルギーは時間に対して積分され、これによって、チャネルエネルギーの偏差がかなり大きくなると音声が検出され、一方、チャネルエネルギーの偏差がほとんどなければ雑音が検出される。チャネルエネルギー偏差を測定する音声検出器は、雑音レベルの突然の増加を検出する。しかしながら、チャネルエネルギー偏差方法は、入力音声信号が一定のエネルギーの信号である場合は不正確な結果をもたらす。さらに、増加雑音発生源の場合、入力エネルギーが変化すると、エネルギー偏差が大きくなり、このため、雑音推定更新値がたとえ必要な場合でも無効となってしまう。
【０００９】
正確な音声検出器に加えて、雑音抑制システムは適切にチャネル利得を調整しなければならない。チャネル利得は、音声品質を犠牲にすることなく雑音が抑制されるように調整すべきである。チャネル利得を調整する１つの方法では、全体雑音推定値と音声信号のＳＮＲの関数として利得を計算する。一般に、全体雑音推定値が増すと、所与のＳＮＲに対する利得係数が減少する。利得係数が低いということは、減衰係数が高いことを示す。この技法は、全体雑音推定値が非常に高い場合に、最小の利得値を課して、チャネル利得の過剰減衰を防止するものである。強度にクランプした(clamped)最小利得値を用いることによって、雑音抑制と音声品質との兼ね合いが導き出される。クランプが比較的低い場合、雑音抑制は向上するが、音声品質は劣化する。クランプが比較的高ければ、雑音抑制は劣化するが音声品質は改善する。
【００１０】
改良型の雑音抑制システムを提供するために、音声検出とチャネル利得計算のための現在の技法の限界を指摘する必要がある。これらの問題と欠陥は以下に示すように本発明によって解決される。
【００１１】
（発明の開示）
本発明は、音声処理システムで用いられる雑音抑制のためのシステムと方法である。本発明の目的は、入力信号中に音声が存在することを決定する音声検出器を提供することである。音声の信号対雑音比（ＳＮＲ）を正確に決定するには信頼性の高い音声検出器が必要である。音声が不在であると判断されると、入力信号はその全体が雑音信号であると仮定されて、雑音エネルギーが測定される。次に、雑音エネルギーを用いてＳＮＲを決定する。本発明の別の目的は、雑音を抑制するための改良型の利得測定エレメントを提供することである。
【００１２】
本発明によれば、雑音抑制システムは、音声が入力信号のフレーム中に存在するか否か判断する音声検出器を備えている。音声の存否の判断は、入力信号中の音声のＳＮＲ尺度に基づいて行われる。ＳＮＲ推定器は、エネルギー推定器が発生した信号エネルギー推定値と雑音エネルギー推定器が発生した雑音エネルギー推定値とに基づいてＳＮＲを推定する。音声の存否判断はまた、入力信号の符号化レートに基づいている。可変速通信システムにおいては、各入力フレームは、入力フレームの内容に基づいて、所定のレート集合から選択された符号化レート(encording rate)を割り当てられる。一般に、このレートは音声のアクティビティ(activity)のレベルによって異なるため、音声を包含しているフレームには高レートが割り当てられ、一方、音声を包含していないフレームには低レートが割り当てられる。さらに、音声存否判断は、入力信号の特徴を記述している１つ以上のモード尺度に基づくこともある。音声が入力フレーム中に存在しないと判断された場合、雑音エネルギー推定器は雑音エネルギー推定値を更新する。
【００１３】
チャネル利得推定器は、入力信号のフレームに対する利得を決定する。音声がフレーム中に存在しない場合、利得は所定の最小値に設定される。存在する場合は、利得はフレームの周波数の内容に基づいて決定される。ある好ましい実施形態では、利得係数は事前定義された集合を成す周波数チャネルの各々に対して決定される。各チャネルに対して、利得はそのチャネル上の音声のＳＮＲに従って決定される。チャネル毎に、利得はそのチャネルが存在する周波数バンドの特徴に適した関数を用いて定義される。一般的には、事前定義された周波数バンドに対して、利得はＳＮＲが増すと共に自身も線形に増加するように設定される。加えて、各周波数バンドに対する最小利得は、環境的特徴に基づいて調整することも可能であり得る。例えば、ユーザー選択可能な最小利得が実現され得る。チャネルＳＮＲは、エネルギー推定器が発生したチャネルエネルギー推定値と雑音エネルギー推定器が発生したチャネルエネルギー推定値とに基づいている。利得係数を用いて、様々なチャネル上の信号の利得を調整し、利得調整されたチャネルは合成されて、雑音抑制された出力信号を生成する。
【００１４】
（発明を実施するための最良の形態）
本発明の特徴、目的及び利点は、全体にわたって同様の参照符号が同様のエレメントを示す図面を参照して以下に記述する詳細な説明から明らかであろう。
【００１５】
音声通信システムにおいては、通常は雑音抑制器を用いて、好ましくない環境的背景雑音を抑制する。大抵の雑音抑制器は、１つ以上の周波数バンド中の入力データ信号の背景雑音特徴を推定し、この推定値の平均値をこの入力信号から減算するように動作する。平均の背景雑音の推定値は音声が不在の期間中に更新される。雑音抑制器は、正しく動作するには、背景雑音レベルを正確に決定する必要がある。加えて、雑音の抑制レベルを入力信号の音声と雑音との特徴に基づいて正しく調整しなければならない。これらの要件は本発明の雑音抑制システムによって処理される。
【００１６】
本発明が実現されている例示の音声処理システム１００を図１に示す。システム１００はマイクロフォン１０２と、Ａ／Ｄコンバータ１０４と、音声プロセッサ１０６と、送信機１１０と、アンテナ１０２と、を備えている。マイクロフォン１０２は、図１に示す他のエレメントと共にセルラー電話中に配置してもよい。代替例としては、マイクロフォン１０２は、セルラー通信システムの車両スピーカフォンオプションであるハンドフリーマイクロフォンであってもよい。車両スピーカフォンのアセンブリは時としてカーキットと呼ばれる。マイクロフォン１０２がカーキットの１部である場合、雑音抑制機能は特に重要である。ハンドフリーマイクロフォンは一般的に使用者からある程度の距離のところに位置するので、受信された音響信号は、道路と風という条件のため悪いＳＮＲを持つ傾向がある。
【００１７】
図１を引き続き参照すると、音声及び／又は背景雑音を含む入力オーディオ信号がマイクロフォン１０２によって受信される。入力オーディオ信号はマイクロフォン１０２によって、項目ｓ(t)で表される電気音響信号に変換される。この電気音響信号は、Ａ／Ｄコンバータ１０４によってアナログ信号からパルス符号変調（ＰＣＭ）サンプルに変換してもよい。ある例示実施形態では、ＰＣＭサンプルはＡ／Ｄコンバータ１０４から６４kbpsのレートで出力され、図１に示すように信号s(n)として表される。デジタル信号s(n)は、雑音抑制器１０８を他のエレメントと共に備えている音声プロセッサ１０６に受信される。雑音抑制器１０８は本発明に従って信号s(n)中の雑音を抑制する。カーキット(carkit)応用品の中では、雑音抑制器１０８は背景環境雑音のレベルを測定して、信号の利得を調整して、このような環境雑音の影響を軽減する。雑音抑制器１０８に加えて、音声プロセッサ１０６は一般的にはボイスコーダ、すなわちボコーダ（図示せず）を備えているが、このボコーダは、人間の音声の発生のモデルに関連するパラメータを抽出することによって音声を圧縮する。音声プロセッサ１０６はまた、エコーキャンセラ（図示せず）を備えているが、これは、スピーカ（図示せず）とマイクロフォン１０２間のフィードバックに起因する音響エコーを解消するものである。
【００１８】
音声プロセッサ１０６による処理に続いて、信号は送信機１１０に出力されるが、送信機１１０は、符号分割多重アクセス方式（ＣＤＭＡ）や、時分割多重アクセス方式（ＴＤＭＡ）や、周波数分割多重アクセス方式（ＦＤＭＡ）などの所定の方式に従って変調を実行する。本例示の実施形態では、送信機１１０は、本発明の譲受人に譲受され、参考としてここに組み込まれる「衛星又は地上中継器を用いた拡散スペクトル多重アクセス通信システム」（ＳＰＲＥＡＤＳＰＥＣＴＲＵＭＭＵＴＩＰＬＥＡＣＣＥＳＳＣＯＭＭＵＮＩＣＡＴＩＯＮＳＹＳＴＥＭＵＳＩＮＧＳＡＴＥＬＬＩＴＥＯＲＴＥＲＲＥＳＴＲＩＡＬＲＥＰＥＡＴＥＲＳ）という題名の米国特許第４，９０１，３０７号に述べるようなＣＤＭＡ形式に従って信号を変調する。すると、送信機１１０は変調された信号を上方変換して増幅し、変調された信号はアンテナ１１２から送信される。
【００１９】
雑音抑制器１０８は、図１のシステム１００とは異なった音声処理システムとして実現してもよいことを認識すべきである。例えば、雑音抑制器１０８を、音声メールオプションを有する電子メール応用例で利用してもよい。このような応用例中では、図１の送信機１１０とアンテナ１１２とは必要ではない。その代わりに、雑音抑制された信号が音声プロセッサ１０６によってフォーマッティングされて、電子メールネットワーク上で送信される。
【００２０】
雑音抑制器１０８のある例示実施形態を図２に示す。入力オーディオ信号は図２に示すように事前プロセッサ２０２によって受信される。事前プロセッサ(preprocessor)２０２は、事前エンファシス(preemphasis)とフレーム発生を実行することによって雑音抑制するように入力信号を作成する。事前エンファシスは、信号の高周波数音声成分を強調することによって音声信号の出力スペクトル密度を再分布させる。実質的には高域パスフィルタリング(a high pass filtering)機能を実行することによって、事前エンファシス処理は重要な音声成分を強調して、周波数ドメイン(domain)中にあるこれらの成分のＳＮＲを向上させる。事前プロセッサ２０２はまた、入力信号のサンプルからフレームを発生する。ある好ましい実施形態では、８０サンプル／フレームの１０ｍｓフレームを発生する。これらのフレームはサンプルをオーバーラップさせて処理精度を高めることがある。これらのフレームは、入力信号のサンプルをウインドウ処理(windowing)し、ゼロパッディングする(zeropadding)ことによって発生させてもよい。プリプロセスされた(preprocessed)信号は変換エレメント２０４に出力される。ある好ましい実施形態では、変換エレメント２０４は、入力信号の各フレームに対して１２８ポイントの高速フーリエ変換（ＦＦＴ）を発生する。しかしながら、代替スキームを用いて入力信号の周波数成分を分析してもよいことを理解すべきである。
【００２１】
変換されたこれらの成分はチャネルエネルギー推定器２０６ａに供給され、ここで変換済み信号のＮチャネルの各々に対するエネルギー推定値を発生する。各チャネルに対して、チャネルエネルギーの更新をするある１つの技法は、前のフレームのチャネルエネルギーに対して平滑化された現行のチャネルエネルギーとなる更新値を次のように推定する：
Ｅ_u(t)=αＥ_ch＋（１―α）Ｅ_u(t-1) （１）
ここで、更新された推定値Ｅ_u(t)は現行チャネルエネルギーＥ_chと前の推定チャネル雑音エネルギーＥ_u(t-1)との関数であると定義される。
1). 例示的な実施形態は、α＝０．５５をセットする。
【００２２】
ある好ましい実施形態では、低周波数チャネルのエネルギー推定値と高周波数チャネルのエネルギー推定値とを、Ｎ＝２となるように決定する。低周波数チャネルは２５０〜２２５０Ｈｚの周波数範囲に対応し、一方、高周波数チャネルは２２５０〜３５００Ｈｚの周波数範囲に対応している。低周波数チャネルの現行チャネルエネルギーは、２５０〜２２５０Ｈｚに対応するＦＦＴポイントのエネルギーとを合計することによって決定し、高周波数チャネルの現行チャネルエネルギーは、２２５０〜３５００Ｈｚに対応するＦＦＴポイントのエネルギーを合計することによって決定してもよい。
【００２３】
これらのエネルギー推定値は音声検出器２０８に供給され、ここで、受信されたオーディオ信号中に音声が存在するか否か判断する。音声検出器２０８のＳＮＲ推定器２１０ａは、エネルギー推定値を受信する。ＳＮＲス推定器２１０ａはチャネルエネルギー推定値とチャネル雑音エネルギー推定値の双方に基づいて、Ｎ個のチャネルの各チャネル上にある音声の信号対雑音比（ＳＮＲ）を決定する。チャネル雑音エネルギー推定値は雑音エネルギー推定器２１４ａによって供給されるが、一般的に、音声を包含していない前のフレーム上で平滑化された推定雑音エネルギーに対応している。
【００２４】
音声検出器２０８はまた、レート決定エレメント２１２を備えるが、これは、所定の集合を成すデータレートから入力信号のデータレートを選択する。ある種の通信システムでは、データは、データレートが1 つのフレームから他のフレームに変化するように符号化される。これは可変レート通信システムとして知られている。可変レートスキームに基づいてデータを符号化するボイスコーダは一般的に可変レートボコーダと呼ばれる。可変レートボコーダのある例実施形態を、本発明の譲受人に譲受され、参考としてここに組み込まれる、「可変レートボコーダ」（ＶＡＲＩＡＢＬＥＲＡＴＥＶＯＣＯＤＥＲ）という題名の米国特許第５，４１４，７９６号に述べられている。可変レート通信チャネルを用いると、送信されるべき有益な音声がない場合に不必要な送信を除去することができる。音声アクティビティの変化に従って各フレーム中の変化する数の情報ビットを形成するために、アルゴリズムがボコーダ内で利用される。例えば、４つのレートから成る集合を持つボコーダは、スピーカのアクティビティによって１６、４０、８０又は１７１の情報ビットを包含する２０ミリ秒のデータフレームを発生する。通信の送信レートを変化させることによって、固定された時間内で各データフレームを送信するのが好ましい。
【００２５】
フレームのレートは、時間フレーム期間中の音声アクティビティによって異なるので、レートを決定することは、音声が存在するか否かに関する情報を提供することになる。可変レートを利用しているシステムでは、フレームを最高レートで符号化すべきであるとする決定は一般に音声の存在を示し、一方、フレームを最低レートで符号化すべきであるとする決定は一般に音声の不在を示す。中間レートは一般的には、音声の存在と不在の間での遷移(transitions)を示す。
【００２６】
レート決定エレメント２１２は、複数個あるレート決定アルゴリズムの内のどれでも実装し得る。このようなレート決定アルゴリズムがかって、本発明の譲受人に対して譲受され、参照してここに組み込まれる、「低減レート可変レートボコーディングのための方法と装置」（ＭＥＴＨＯＤＡＮＤＡＰＰＡＲＡＴＧＵＳＦＯＲＰＥＲＦＯＲＭＩＮＧＲＥＤＵＣＥＤＲＡＴＥＶＡＲＩＡＢＬＥＲＡＴＥＶＯＣＯＤＩＮＧ）という題名の同時係属米国特許出願第０８／２８６，８４２号に開示されている。この技法はモード尺度(mode measures)と呼ばれる１集合のレート決定基準を提供する。第１のモード尺度は前の符号化フレームに基づいた目標整合信号対雑音比（ＴＭＳＮＲ）であり、これは、合成された音声信号を入力音声信号と比較することによっていかに良好に符号化モデルが実行されているかに関する情報となるものである。第２のモード尺度は正規化自動相関関数(normalized autocorrelation function)（ＮＡＣＦ）であり、これは音声フレームの周期性を測定するものである。第３のモード尺度はゼロ交差(zero crossings)（ＺＣ）パラメータであり、これは入力音声フレーム中の高周波数成分を測定するものである。第４の尺度である予測利得微分(prediction gain differential)（ＰＧＤ）はエンコーダがその予測効率を維持しているか否かを判断するものである。第５の尺度はエネルギー微分(energy differential)（ＥＤ）であり、これは現行フレーム中のエネルギーを平均フレームエネルギーと比較するものである。これらのモード尺度を用いて、レート決定ロジックは入力のフレームの符号化レートを選択する。
【００２７】
レート決定エレメント２１２は、図２では雑音抑制器１０８に含まれるエレメントとして示されているが、その代わりにレート情報を音声プロセッサ１０６の別の構成部品によって雑音プロセッサ１０８に提供するようにしてもよいことを理解すべきである（図１）。例えば、音声プロセッサ１０６は、入力信号の各フレームに対する符号化レートを決定する可変レートボコーダ（図示せず）を備えることがある。雑音抑制器１０８に単独でレート決定させる代わりに、レート情報を可変レートボコーダによって雑音抑制器１０８に提供するようにしてもよい。
【００２８】
また、レートを判断して音声の存在を決定する代わりに、音声検出器２０８が、レートの判断に寄与するモード尺度から成るサブ集合を用いてもよいことを理解すべきである。例えば、レート決定エレメント２１２の代わりにＮＡＣＦエレメント（図示せず）を用いてもよいが、これは、すでに述べたように、音声フレームの周期性を測定するものである。ＮＡＣＦは以下の関係式に従って評価される：
【数１】

【００２９】
ここで、Ｎは音声フレームのサンプルの数であり、ｔ₁とｔ₂は、ＮＡＣＦを評価するＴ個のサンプル内の境界のことである。ＮＡＣＦはホルマント(formant)の残留信号ｅ（ｎ）に基づいて評価される。ホルマント周波数は音声の共鳴周波数である。短期フィルタを用いて音声信号をフィルタリングして、フォルマント周波数を得る。この短期(short term)フィルタによるフィルタリング後に得られる残留信号がフォルマント残留信号であり、ピッチ(pitch)など、信号の長期音声情報を包含している。
【００３０】
ＮＡＣＦモード尺度は、発声された音声を包含している信号の周期性が発声された音声を包含していない信号とは異なるので、音声の存在を決定するのに適している。発声された音声は周期的な成分によって特徴付けられる傾向がある。発声された音声が存在しない場合、信号は一般に周期的な成分を有しない。したがって、ＮＡＣＦ尺度は音声検出器２０８が用いる良好なインジケータであり得る。
【００３１】
音声検出器２０８は、レート決定結果を発生するのが実用的ではない状況においてレート決定結果の代わりにＮＡＣＦなどの尺度を用いることがある。例えば、レート決定結果が可変レートボコーダから入手可能でなく、雑音プロセッサ１０８が自分自身のレート決定結果を発生する処理パワーを持たない場合、ＮＡＣＦなどのモード尺度が所望の代替物を提供する。これは、処理パワーが概して制限されているカーキット応用例などに当てはまる。
【００３２】
加えて、音声検出器２０８は、レート決定結果やモード尺度やＳＮＲ推定値だけに基づいて音声の存在に関する決定をすることを理解すべきである。さらなる尺度によって決定の精度を向上させるべきであるとはいえ、これらの尺度のどの１つだけでも適切な結果をもたらし得る。
【００３３】
レート決定結果（又はモード尺度）とＳＮＲ推定器２１０ａによって発生されたＳＮＲ推定値とは、音声判断エレメント２１６に提供される。音声判断エレメント２１６は、入力信号中に音声が存在するか否かをその入力に基づく判断を発生する。音声の存在に関する判断によって、雑音エネルギーの推定値を更新するか否かが決定される。雑音エネルギー推定値はＳＮＲ推定器２１０ａによって用いられて、入力信号中の音声のＳＮＲを決定する。このＳＮＲは次に、雑音抑制のための入力信号の減衰のレベルを計算するために用いられる。音声が存在すると判断された場合、音声判断エレメント２１６はスイッチ２１８ａを開いて、雑音推定器２１４ａが雑音エネルギー推定値を更新しないようにする。音声が存在しないと判断された場合、入力信号は雑音であると推測され、音声判断エレメント２１６はスイッチ２１８ａを閉じて、雑音エネルギー推定器２１８ａに雑音推定値を更新させる。図２ではスイッチ２１８ａと示されているが、音声判断エレメント２１６から雑音エネルギー推定器２１４ａに供給されたイネーブル信号も同じ機能を実行することを理解すべきである。
【００３４】
２つのチャネルＳＮＲが評価されるある好ましい実施形態では、音声判断エレメント２１６は以下の手順に基づいて雑音更新判断(the noise update decision)を発生する：

ＳＮＲ推定器２１０ａによって供給されたチャネルＳＮＲ推定値はchsnr1とchsnr2とによって表される。レート決定エレメント２１２によって供給された入力信号のレートはレート(rate)で表される。カウンタ、レートカウントは、以下に述べるある種の条件に基づいてフレームの数を追跡する。
【００３５】
音声判断エレメント２１６は、レートが可変レートの内の最小値レートであり、chsnr1がしきい値Ｔ1より大きいか又はchsnr2がしきい値Ｔ2より大きくて、レートカウントがしきい値Ｔ3より大きい場合は、音声が存在せず、及び、雑音推定値を更新すべきであると判断する。レートが最小値であり、chsnr1がＴ１より大きいか又はchsnr2がＴ２より大きいがレートカウントがＴ3より小さい場合、レートカウントは１つだけ増加されるが、雑音推定値は更新されない。カウンタ、レートカウントは、最小レートを有するフレームの数をカウントするが同時に複数のチャネルの内の少なくとも１つのチャネルに高エネルギーを有することによって、雑音レベルが突然増加する場合又は雑音発生源が増加する場合を検出する。高ＳＮＲ信号が音声を包含していないことを示すインジケータとなるカウンタは、信号中に音声が検出されるまではカウントするように設定される。ある好ましい実施形態は、１０ｍｓフレームが評価されるところのＴ1＝Ｔ2＝５ｄＢ、Ｔ2＝１００フレームを設定する。
【００３６】
レートが最小値であり、chsnr1がＴ1未満であり、chsnr2がＴ2未満である場合、音声判断エレメント２１６は、音声が存在せず、したがって、雑音推定値を更新すべきであると判断する。加えて、レートカウントがゼロにリセットされる。
【００３７】
レートが最小値でなければ、音声判断エレメント２１６は、フレームが音声を包含しており、したがって、雑音推定値を更新すべきではないと判断し、レートカウント(ratecount)はゼロにリセットされる。
【００３８】
レート尺度(rate measure)を用いて音声の存在を判断する代わりに、ＮＡＣＦ尺度などのモード尺度(mode measures)を利用し得ることを思い出すべきである。音声判断エレメント２１６はＮＡＣＦ尺度を利用して音声の存在を判断することがあり、したがって、雑音更新決定は以下の手順に従って実行される：
もしも( ( pitch Present＝＝偽り(FALSE)であれば
もしも( (chsnr1＞ＴＨ1）又は(chsnr2＞ＴＨ2）であれば
もしも(pitchCount＞ＴＨ3）であれば
雑音推定値を更新する
そうでなければ
pitchCount＋＋
そうでなければ
雑音推定値を更新する
pitchCount＝０
そうでなければ
pitchCount＝０
ここで、pitchPresentは次のように定義される：
もしも(NACF＞ＴＴ1）であれば
pitchPresent＝真実(TRUE)
NACFヌカウント＝０
そうでなくて（ＴＴ２≦ＮＡＣＦ≦ＴＴ１）であれば
もしも（NACFCOUNT＞ＴＴ３）であれば
pitchPresent＝真実
そうでなければ
pitchPresent＝偽り
NACFCOUNT＋＋
そうでなければ
pitchPresent＝偽れ
NACFCOUNT＝０
再び、ＳＮＲ推定器２１０ａが供給したチャネルＳＮＲ推定値はchsnr1とchsnr2で表される。ＮＡＣＦエレメント（図示せず）は、上記で明らかにしたように、該のピッチの存在を示す尺度であるpitchPresentを発生する。カウンタであるpitchCountは以下に述べるある種の条件に基づいてフレームの数を追跡する。
【００３９】
尺度pitchPresentは、ＮＡＣＦがしきい値ＴＴ１より大きいとピッチが存在すると判断する。ＮＡＣＦがしきい値ＴＴ３より大きい複数のフレームに対して中間範囲（ＴＴ２≦ＮＣＦ≦ＴＴ１）にある場合も、ピッチが存在すると判断される。カウンタ、ＮＡＣＦｃｏｕｎｔは、
【数２】

【００４０】
が成立するフレームの数を追跡する。ある好ましい実施形態では、１０ｍｓフレームが評価されるＴＴ１＝０．６、ＴＴ２＝０．４、ＴＴ３＝８フレームとなっている。
【００４１】
音声判断エレメント２１６は、pitchPresent尺度がピッチが存在しないことを示しており（pitchPtrsent＝偽り）、chsnr1がしきい値ＴＨ１より大きいか又はchsnr2がしきい値ＴＴ２より大きく、また、pitchCountがしきい値ＴＨ３より大きい場合、音声が存在せず、したがって、雑音推定値を更新すべきであると判断する。pitchPresent＝偽りであり、chsnr1がＴＨ１より大きいか又はchnsr2がＴＨ２より大きいが、pitchCountがＴＨ３未満である場合、pitchCountは１つ増加されるが雑音推定値は更新されない。カウンタ、pitchCountを用いて、雑音のレベルの突然の増加や雑音発生源の増加を検出する。ある好ましい実施形態では、１０ｍｓフレームが評価されるＴ１＝Ｔ２＝５ｄＢ、Ｔ２＝１００フレームという条件が設定される。
【００４２】
ピッチが存在しないことをpitchPresentが示し、chsnr1がＴＨ１未満であるか又はchsnr2がＴＨ２未満である場合、音声判断エレメント２１６は、音声が存在せず、したがって、雑音推定値を更新すべきであると判断する。加えて、pitchCopuntがゼロにリセットされる。
【００４３】
ピッチが存在することをpitchPresentが示す（pitchPresent＝真実）場合、音声判断エレメント２１６は、フレームが音声を包含しており、したがって、雑音推定値を更新すべきではないと判断する。しかしながら、pitchCountはゼロにリセットされる。
【００４４】
音声が存在しないと判断されると、スイッチ２１８ａは閉じられて、雑音エネルギー推定器２１４ａが雑音推定値を更新する。雑音エネルギー推定器２１４ａは一般に、Ｎチャネル分の入力信号の各々に対する雑音エネルギー推定値を発生する。音声は存在しないので、エネルギーは全部雑音によるものであると推測される。各チャネルに対して、雑音エネルギー更新値は、音声を包含しない前のフレームのチャネルエネルギーに対して平滑化された現行のチャネルエネルギーであると推定される。例えば、更新された推定値は以下の関係式に基づいて得られる：
Ｅ_n(t) ＝βＥ_ch + (1-β)Ｅ_n(t-1), （３）
ここで、更新された推定値Ｅ_n(t)は、現行のチャネルエネルギーＥ_chと前の推定チャネル雑音エネルギーＥ_n(t-1)の関数として定義される。ある例示実施形態ではβ＝０．１と設定される。更新されたチャネル雑音エネルギー推定値はＳＮＲ推定器２１０ａに提供される。これらのチャネル雑音エネルギー推定値を用いて、入力信号の次のフレームのチャネルＳＮＲ推定更新値を得る。
【００４５】
音声の存在に関する決定はチャネル利得推定器２２０にも提供される。チャネル利得推定器２２０は利得を決定し、こうして入力信号のフレームに対する雑音抑制レベルを決定する。音声決定成分２１６が音声の不存在を決定した場合、フレームに対する利得が所定の最低利得レベルに設定される。そうでなければ、利得は周波数の関数として決定される。好ましい実施形態では、利得は図３に示すグラフに基づいて計算される。図３においてグラフで示しているが、図３に示した関数はチャネル利得推定器２２０においてルップアップ表として実装してもよいことを理解すべきである。
【００４６】
図３において、本発明の好ましい実施形態が各々のＬ周波数バンド(band)のために別々の利得曲線を限定することが解る。図３において３つのバンド（Ｌ＝３）が表示されているが、Ｌは１以上のどのような数であってもよい。このように、低バンドのチャネル用の利得係数を低バンド曲線を使用して決定し、中間バンドのチャネル用の利得係数を中間バンド曲線を使用して決定し、高バンドのチャネル用の利得係数を高バンド曲線を使用して決定してもよい。
【００４７】
入力信号用の１つだけの利得曲線（Ｌ＝１）を利用して雑音抑制を実施してもよいが、多数のバンドを使用した場合の方が音声の品質低下が少ないことが見い出されている。道路や風による雑音等の環境的な雑音の場合、雑音信号のエネルギーは低い方の周波数において大きくなり、一般にこのエネルギーは周波数が増大するにつれて減少する。
【００４８】
図３において、固定された勾配(slope)とｙ-インターセプトを備えた直線式を使用して、各々のバンド用の利得係数を決定する。利得係数の決定は以下の関係によって説明することができる：
利得［低バンド］(dB)＝勾配１＊ＳＮＲ＋低バンドｙ-インターセプト；（４）
利得［中間バンド］(dB)＝勾配２＊ＳＮＲ＋中間バンドｙ-インターセプ
ト；（５）
利得［高バンド］(dB)＝勾配３＊ＳＮＲ＋高バンドｙ-インターセプト；（６）
好ましい実施形態は低バンドを１２５〜３７５Ｈｚと指定し、中間バンドを３７５〜２６２５Ｈｚと指定し、高バンドを２６２５〜４０００Ｈｚと指定する。勾配とｙ-インターセプトは実験的に決定される。好ましい実施形態は３つのバンドの各々について同じ勾配０．３９を使用するが、各々の周波数バンドに対して異なる勾配を使用してもよい。更に、低バンドｙ-インターセプトは−１７ｄＢに設定され、中間バンドｙ-インターセプトは−１３ｄＢに設定され、高バンドｙ-インターセプトは−１３ｄＢに設定される。
【００４９】
所望のｙ-インターセプトを選択するために、任意の特徴が雑音抑制器を備える装置のユーザを提供するであろう。このように、音声劣化を犠牲にして、より多くの雑音抑制（低い方のｙ-インターセプト）を選んでもよい。あるいは、ｙ-インターセプトは雑音抑制器１０８によって決定されるある尺度の関数として可変であってもよい。例えば、所定の期間に過度の雑音エネルギーが検出された場合、より多くの雑音抑制（低い方のｙ-インターセプト）が望ましいかもしれない。あるいは、バブル(babble)等の状態が検出された場合は、少ない雑音抑制（高い方のｙ-インターセプト）が望ましいかもしれない。バブル状態の間に、背景スピーカが存在し、メインスピーカのカットアウトを防止するために少ない雑音抑制が正当化されるかもしれない。別の任意の特徴が利得曲線の選択可能な勾配を準備するであろう。更に、特定の状況下で利得係数を決定するために式（４）〜（６）によって説明される直線以外の曲線の方が適していることが見い出されるかもしれない。
【００５０】
音声を含む各々のフレームに対して、入力信号のＭ個の周波数チャネルの各々に対して利得係数が決定され、Ｍは評価すべき所定数のチャネルである。好ましい実施形態では１６のチャネル（Ｍ＝１６）を評価する。再び図３において、低バンドの範囲内の周波数成分を有するチャネルに対する利得係数は低バンド曲線を使用して決定される。中間バンドの範囲内の周波数成分を有するチャネルに対する利得係数は中間バンド曲線を使用して決定される。高バンドの範囲内の周波数成分を有するチャネルに対する利得係数は高バンド曲線を使用して決定される。
【００５１】
評価される各々のチャネルに対して、チャネルＳＮＲを使用して適切な曲線に基づく利得係数を引き出す。図２において、チャネルＳＮＲはチャネルエネルギー推定器２０６ｂと、雑音エネルギー推定器２１４ｂとＳＮＲ推定器２１０ｂによって評価されることが示されている。入力信号の各々のフレームに対して、チャネルエネルギー推定器２０６ｂは変換された入力信号のＭ個のチャネルの各々に対してエネルギー見積りを発生させ、エネルギー見積りをＳＮＲ推定器２１０ｂに提供する。チャネルエネルギー見積りは上記の式（１）の関係を使用して更新することができる。音声決定成分２１６によって入力信号内に如何なる音声も存在しないと決定された場合、スイッチ２１８ｂが閉じられ、雑音エネルギー推定器２１４ｂがチャネル雑音エネルギーの見積りを更新する。Ｍ個のチャネルの各々に対して、更新された雑音エネルギー見積りはチャネルエネルギー推定器２０６ｂによって決定されるチャネルエネルギー見積りに基づいている。更新された見積りは上記に式（３）の関係を使用して評価することができる。チャネル雑音見積りはＳＮＲ推定器２１０ｂに提供される。こうして、ＳＮＲ推定器２１０ｂは特定の音声フレームに対するチャネル利得見積りに基づいて各々の音声フレームのためのチャネルＳＮＲ見積りを決定し、チャネル雑音エネルギー見積りが雑音エネルギー推定器２１４ｂによって提供される。
【００５２】
当業者であれば、チャネルエネルギー推定器２０６ａと、雑音エネルギー推定器２１４ａと、スイッチ２１８ａと、ＳＮＲ推定器２１０ａとが、チャネルエネルギー推定器２０６ｂと、雑音エネルギー推定器２１４ｂと、スイッチ２１８ｂと、ＳＮＲ推定器２１０ｂと同様の機能を各々果たすことを認識するであろう。このように、図２において別々の処理成分として示されているが、チャネルエネルギー推定器２０６ａと２０６ｂが１つの処理成分として組み合わされてもよく、雑音エネルギー推定器２１４ａと２１４ｂが１つの処理成分として組み合わされてもよく、スイッチ２１８ａと２１８ｂが１つの処理成分として組み合わされてもよく、またＳＮＲ推定器２１０ａと２１０ｂが１つの処理成分として組み合わされてもよい。組み合わされた成分として、チャネルエネルギー推定器は音声検出のために使用されるＮ個のチャネルと、チャネル利得係数を決定するために使用されるＭ個のチャネルの両方のためにチャネルエネルギー見積りを決定するであろう。Ｎ＝Ｍが可能であることに注意。同様に、雑音エネルギー推定器とＳＮＲ推定器はＮ個のチャネルとＭ個のチャネルの両方に対して作用するであろう。そしてＳＮＲ推定器は音声決定成分２１６にＮ個のＳＮＲ見積りを提供し、チャネル利得推定器２２０にＭ個のＳＮＲ見積りを提供する。
【００５３】
チャネル利得係数はチャネル利得推定器２２０によって利得調整器２２４に提供される。利得調整器２２４は変換成分２０４からＦＦＴ変換された入力信号を受信する。変換信号の利得はチャネル利得係数に従って適宜調整される。例えば、Ｍ＝１６である上述の実施形態では、１６個のチャネルのうち特定のチャネルに属する変換された（ＦＦＴ）ポイントが適切なチャネル利得係数に基づいて調整される。
【００５４】
利得調整器２２４によって発生される利得調整された信号は次に変換成分２２６を逆転させるために提供され、好ましい実施形態では変換成分２２６は信号の逆高速フーリエ変換（ＩＦＦＴ）を発生させる。入力のフレームが重ねられたサンプルで形成されている場合、後工程成分(post processing element)２２８はオーバーラップのために出力信号を調整する。また後工程成分２２８は、信号がプレエンファシスを経験した場合、デエンファシス(deemphasis)を実施する。デエンファシスは事前エンファシスの間に強調された周波数成分を減衰させる。事前エンファシス／デエンファシスプロセスは、処理済み周波数成分の範囲外にある雑音成分を減少させることによって、雑音抑制に効果的に貢献する。
【００５５】
図２に示した雑音抑制器の様々な処理ブロックをデジタル信号プロセッサ（ＤＳＰ）またはアプリケーション特有の集積回路（ＡＳＩＣ）内に構成してもよい。本発明の機能性の説明は、当業者は過度の実験を行うことなくＤＳＰまたはＡＳＩＣに本発明を実装することができるであろう。
【００５６】
次に図４において、図２と３に関連して説明した処理に含まれるステップの一部を図示するフローチャートが示されている。連続的なステップとして示されているが、当業者であればステップの一部の順序を交換できることを認識するであろう。
【００５７】
プロセスはステップ４０２で始まる。ステップ４０４において、変換成分２０４は入力されたオーディオ信号を変換された信号、慨してＦＦＴ信号に変換する。ステップ４０６において、ＳＮＲ推定器２１０ｂはチャネルエネルギー推定器２０６ｂによって提供されるチャネルエネルギー見積りと、雑音エネルギー推定器２１４ｂによって提供されるチャネル雑音エネルギー見積りに基づいて、入力信号のＭ個のチャネルに対する音声ＳＮＲを決定する。ステップ４０８において、チャネル利得推定器２２０がチャネルの周波数に基づいて、入力信号のＭ個のチャネルに対する利得係数を決定する。チャネル利得推定器２２０は入力信号のフレームに音声がないことが見い出された場合、利得を最低レベルに設定する。そうでなければ、所定の関数に基づいてＭ個のチャネルの各々に対する利得係数が決定される。例えば、図３において、固定された勾配とy- インターセプトを備えた直線式によって定義される関数を使用してもよく、その場合各々の直線式が所定の周波数バンドに対する利得を定義する。ステップ４１０において、利得調整器２２４がＭ個の利得係数を使用して、変換された信号のＭ個のチャネルの利得を調整する。ステップ４１２において、逆変換成分２２６が利得調整された変換信号を変換し、雑音抑制されたオーディオ信号を作り出す。
【００５８】
ステップ４１４において、ＳＮＲ推定器２１０ａがチャネルエネルギー推定器２０６ａによって提供されるチャネルエネルギー見積りと、雑音エネルギー推定器２１４ａによって提供されるチャネル雑音エネルギー見積りに基づいて、入力信号のＮ個のチャネルに対する音声ＳＮＲを決定する。ステップ４１６において、レート決定エレメント２１２が入力信号の分析を通して入力信号に対する符号化レートを決定する。あるいは、ＮＡＣＦ等の１つ以上のモード尺度を決定してもよい。ステップ４１８において、音声決定エレメント２１６はＳＮＲ推定器２１０ａによって提供されたＳＮＲと、レート決定要素によって提供されたレート及び／またはモード尺度に基づいて、入力信号に音声が存在するかどうかを決定する。決定ブロック４２０において、音声が存在しないと決定された場合、入力信号は完全に雑音であると仮定され、ステップ４２２において雑音エネルギー推定器２１４ａによって雑音見積りが更新される。雑音エネルギー推定器２１４ａはチャネルエネルギー推定器２０６ａによって決定されるチャネルエネルギーに基づいて、雑音見積りを更新する。音声が検出されてもされなくても、手順は入力信号の次のフレームの処理を続ける。
【００５９】
好ましい実施形態の前述の説明は、当業者が本発明を利用または使用できるようにするために提供されたものである。これらの実施形態に対する様々な変更は当業者にとっては容易に自明となるであろうし、ここで定義された一般的な原則を発明的な才能を使用しないでも他の実施形態に適用することができる。このように、本発明をここで示した実施形態に制限することは意図しておらず、ここで開示された原則及び新規の特徴と矛盾しない最も幅広い範囲と一致すべきものである。
【図面の簡単な説明】
【図１】雑音抑制器を利用した通信システムのブロック図である。
【図２】本発明による雑音抑制器を示すブロック図である。
【図３】本発明による雑音抑制を実現するための周波数に基づいた利得係数のグラフである。
【図４】図２の処理用エレメントによって実現されるような雑音抑制に含まれる処理ステップの例示実施形態を示すフローチャートである。[0001]
(Technical field)
The present invention relates to audio processing. More particularly, the present invention relates to a noise suppression system and method used for speech processing.
[0002]
(Background technology)
The transmission of voice by digital techniques has become widely used particularly in application fields such as cellular telephones and personal communication systems (PCS). This has also generated interest in improving speech processing techniques. One area that has been improved is the development of noise suppression techniques.
[0003]
Noise suppression in a voice communication system generally serves the purpose of improving the overall quality of a desired audio signal by filtering environmental background noise from the desired voice signal. This voice enhancement process is particularly necessary in environments with unusually high levels of ambient background noise, such as airplanes, moving vehicles, and noisy factories.
[0004]
One noise suppression technique is spectral subtraction, ie, a technique that modifies the spectral gain. With this scheme, the input audio signal is divided into a plurality of frequency channels, whereby a particular frequency channel is attenuated according to its noise energy content. The background noise estimate for each frequency channel is used to generate a speech signal-to-noise ratio (SNR) on that channel and the SNR ratio is used to calculate the gain factor for each channel. Next, the attenuation amount of a specific channel is determined by this gain coefficient. The attenuated channel is recombined to produce an output signal with reduced noise.
[0005]
  In special applications with relatively high background noise environments, most noise suppression techniques have significant performance limitations. One example of such an application field is the option of a vehicle speakerphone for cellular mobile communication systems. This speakerphone option enables hands-free operation for the driver of the automobile. In general, hands-free microphonesFace cheekIt is placed above the (visor) or placed far away from the user. This remote microphone provides poor SNR for the land-end side due to noise conditions such as road and wind. Although speech received at the land end is usually understandable, continuous exposure to such background noise levels often increases listener fatigue.
[0006]
In order for the noise suppression system to function properly, it is important to accurately determine the SNR of the speech. However, due to the limitations of currently available noise detectors, it is difficult to accurately determine the SNR of a speech signal. Spectral subtraction techniques update the background noise estimate during periods when speech is absent. In the absence of speech, the measured spectral energy is due to noise, so the noise estimate is updated based on the measured spectral energy. Therefore, it is important to obtain accurate noise energy for calculating the SNR by distinguishing between the presence period and the absence period of speech.
[0007]
  One exemplary technique for speech detection is speechMeasurement(metric) Noise update value judgment is performed using a computer. voiceMeasurementIs a measure of the overall voice-like feature of the channel energy. First, using the raw SNR estimate, speechMeasurementDetermine the table, and thereby the audio for each channelMeasurementGet the value. Individual channel audioMeasurementThe values are summed into an energy parameter that is compared to the background noise update threshold. This voiceMeasurementIf the sum is above this threshold, the signal is said to contain speech. voiceMeasurementIf the sum is less than the threshold, the input frame is considered noise and a background noise update is performed. However, high background noise, sudden background noise,increaseIn the case of a noise source, the SNR measurement value is large, resulting in a voiceMeasurementThe value becomes high, and therefore the noise estimation update value becomes invalid.
[0008]
  voiceWeighing calculatorIn a refined technique, the deviation in channel energy is measured. In this method, it is assumed that noise exhibits a constant spectral energy over time, while speech exhibits a variable spectral energy over time. Thus, the channel energy is integrated over time, so that speech is detected when the channel energy deviation is quite large, while noise is detected when there is little channel energy deviation. Voice detectors that measure channel energy deviations can cause sudden noise levelsincreaseIs detected. However, the channel energy deviation method gives inaccurate results when the input speech signal is a constant energy signal. further,increaseIn the case of a noise source, when the input energy changes, the energy deviation increases, so that even if a noise estimation update value is required, it becomes invalid.
[0009]
In addition to an accurate speech detector, the noise suppression system must adjust the channel gain appropriately. The channel gain should be adjusted so that noise is suppressed without sacrificing voice quality. One way to adjust the channel gain is to calculate the gain as a function of the overall noise estimate and the SNR of the speech signal. In general, as the overall noise estimate increases, the gain factor for a given SNR decreases. A low gain coefficient indicates a high attenuation coefficient. This technique imposes a minimum gain value to prevent excessive attenuation of the channel gain when the overall noise estimate is very high. By using a minimum gain value clamped to intensity, a tradeoff between noise suppression and speech quality is derived. When the clamp is relatively low, noise suppression is improved, but speech quality is degraded. If the clamp is relatively high, noise suppression is degraded but speech quality is improved.
[0010]
In order to provide an improved noise suppression system, it is necessary to point out the limitations of current techniques for speech detection and channel gain calculation. These problems and defects are solved by the present invention as described below.
[0011]
    (Disclosure of the Invention)
  The present invention is a system and method for noise suppression used in speech processing systems. An object of the present invention is to provide a speech detector that determines the presence of speech in an input signal. The signal-to-noise ratio (SNR) of speechaccuratelyA reliable audio detector is required to make the decision. If it is determined that there is no speech, the input signal is entirely a noise signal.AssumptionThe noise energy is measured. Next, the SNR is determined using the noise energy. Another object of the present invention is to provide an improved gain measurement element for suppressing noise.
[0012]
  In accordance with the present invention, the noise suppression system includes a speech detector that determines whether speech is present in the frame of the input signal. The presence / absence of voice is determined based on the SNR measure of voice in the input signal. The SNR estimator estimates the SNR based on the signal energy estimate generated by the energy estimator and the noise energy estimate generated by the noise energy estimator. The presence / absence of audio is also determined by the input signal.Encoding rateBased on. In the variable speed communication system, each input frame is determined based on the content of the input frame.rateEncoding selected from setrate(encording rate) can be assigned. In general, thisrateDepends on the level of voice activity, so frames that contain voicerateFor frames that do not contain audiorateIs assigned. Further, the presence / absence determination of speech may be based on one or more mode measures describing characteristics of the input signal. If it is determined that no speech is present in the input frame, the noise energy estimator updates the noise energy estimate.
[0013]
  The channel gain estimator determines the gain for the frame of the input signal. If no speech is present in the frame, the gain is set to a predetermined minimum value. If present, the gain is determined based on the frequency content of the frame. In a preferred embodiment, the gain factor is determined for each of the frequency channels that form a predefined set. For each channelGain isDetermined according to the SNR of the voice on that channel. For each channel,Gain isIt is defined using a function suitable for the characteristics of the frequency band in which the channel exists. In general, for a predefined frequency band, the gain is set so that it increases linearly with increasing SNR. In addition, the minimum gain for each frequency band may be adjustable based on environmental characteristics. For example, a user selectable minimum gain can be realized. The channel SNR is based on the channel energy estimate generated by the energy estimator and the channel energy estimate generated by the noise energy estimator. The gain factor is used to adjust the gain of the signals on the various channels and the gain adjusted channels are combined to produce a noise-suppressed output signal.
[0014]
(Best Mode for Carrying Out the Invention)
The features, objects and advantages of the present invention will become apparent from the detailed description set forth below when taken in conjunction with the drawings in which like reference characters identify like elements throughout.
[0015]
In a voice communication system, a noise suppressor is usually used to suppress undesirable environmental background noise. Most noise suppressors operate to estimate the background noise characteristics of the input data signal in one or more frequency bands and subtract the average of this estimate from this input signal. The average background noise estimate is updated during the absence of speech. A noise suppressor must accurately determine the background noise level in order to operate correctly. In addition, the noise suppression level must be adjusted correctly based on the characteristics of the voice and noise of the input signal. These requirements are handled by the noise suppression system of the present invention.
[0016]
An exemplary speech processing system 100 in which the present invention is implemented is shown in FIG. The system 100 includes a microphone 102, an A / D converter 104, an audio processor 106, a transmitter 110, and an antenna 102. Microphone 102 may be placed in a cellular telephone along with the other elements shown in FIG. As an alternative, the microphone 102 may be a hands-free microphone that is a vehicle speakerphone option for cellular communication systems. Vehicle speakerphone assemblies are sometimes referred to as car kits. The noise suppression function is particularly important when the microphone 102 is part of a car kit. Since hands-free microphones are generally located at some distance from the user, the received acoustic signals tend to have a bad SNR due to road and wind conditions.
[0017]
  With continued reference to FIG. 1, an input audio signal containing voice and / or background noise is received by the microphone 102. The input audio signal is converted by the microphone 102 into an electroacoustic signal represented by the item s (t). This electroacoustic signal may be converted from an analog signal to a pulse code modulation (PCM) sample by an A / D converter 104. In an exemplary embodiment, the PCM samples are 64 kbps from A / D converter 104rateAnd is represented as a signal s (n) as shown in FIG. The digital signal s (n) is received by a speech processor 106 that includes a noise suppressor 108 along with other elements. Noise suppressor 108 suppresses noise in signal s (n) in accordance with the present invention. In carkit applications, the noise suppressor 108 measures the background environmental noise level and adjusts the signal gain to mitigate the effects of such environmental noise. In addition to the noise suppressor 108, the speech processor 106 typically comprises a voice coder, or vocoder (not shown), which extracts parameters associated with a model of human speech generation. To compress the audio. The audio processor 106 also includes an echo canceller (not shown), which eliminates acoustic echo due to feedback between the speaker (not shown) and the microphone 102.
[0018]
Subsequent to the processing by the voice processor 106, the signal is output to the transmitter 110. The transmitter 110 performs code division multiple access (CDMA), time division multiple access (TDMA), and frequency division multiple access. Modulation is performed according to a predetermined method such as (FDMA). In the present exemplary embodiment, the transmitter 110 is assigned to the assignee of the present invention and is incorporated herein by reference. The signal is modulated according to a CDMA format as described in US Pat. No. 4,901,307 entitled USING SATELLITE OR TERRESTRIAL REPEATERS). Then, the transmitter 110 up-converts and amplifies the modulated signal, and the modulated signal is transmitted from the antenna 112.
[0019]
  It should be appreciated that the noise suppressor 108 may be implemented as a speech processing system that is different from the system 100 of FIG. For example, the noise suppressor 108 isEmailIt may be used in e-mail applications with options. In such applications, the transmitter 110 and antenna 112 of FIG. 1 are not necessary. Instead, the noise-suppressed signal is formatted by the voice processor 106 and transmitted over the email network.
[0020]
  One exemplary embodiment of the noise suppressor 108 is shown in FIG. The input audio signal is received by the preprocessor 202 as shown in FIG. Advance processor(preprocessor)202 is advanceEmphasisCreate an input signal to suppress noise by performing (preemphasis) and frame generation. In advanceEmphasisRedistributes the output spectral density of the speech signal by enhancing the high frequency speech component of the signal. SubstantiallyHigh passfiltering(a high pass filtering)Advance by performing the functionEmphasisProcessingEmphasize important audio components,Improve the SNR of these components in the frequency domain. The preprocessor 202 also generates frames from samples of the input signal. In a preferred embodiment, a 10 ms frame of 80 samples / frame is generated. These frames may overlap samples to increase processing accuracy. These frames may be generated by windowing the input signal samples and zero padding. The preprocessed signal is output to the transform element 204. In one preferred embodiment, the transform element 204 generates a 128 point Fast Fourier Transform (FFT) for each frame of the input signal. However, it should be understood that the frequency components of the input signal may be analyzed using alternative schemes.
[0021]
  These transformed components are fed to channel energy estimator 206a, whereFor each of the N channels of the converted signalGenerate energy estimates. For each channel, one technique for updating the channel energy estimates an updated value that is the current channel energy smoothed with respect to the channel energy of the previous frame as follows:
        E_u(t) = αE_ch+ (1-α) E_u(t-1) (1)
  Here, the updated estimated value E_u(t) is the current channel energy E_chAnd the previous estimated channel noise energy E_udefined as a function with (t-1).
1). The exemplary embodiment sets α = 0.55.
[0022]
In a preferred embodiment, the energy estimate for the low frequency channel and the energy estimate for the high frequency channel are determined such that N = 2. The low frequency channel corresponds to a frequency range of 250-2250 Hz, while the high frequency channel corresponds to a frequency range of 2250-3500 Hz. The current channel energy of the low frequency channel is determined by summing the energy of the FFT points corresponding to 250-2250 Hz, and the current channel energy of the high frequency channel sums the energy of the FFT points corresponding to 2250-3500 Hz. May be determined by
[0023]
  These energy estimates are supplied to the speech detector 208 where it is determined whether speech is present in the received audio signal. The SNR estimator 210a of the speech detector 208 isReceive energy estimates. The SNR estimator 210aBased on both the channel energy estimate and the channel noise energy estimate, the signal-to-noise ratio (SNR) of speech on each of the N channels is determined. The channel noise energy estimate is supplied by the noise energy estimator 214a, but is generally smoothed over the previous frame that does not contain speechEstimatedIt corresponds to noise energy.
[0024]
  The voice detector 208 is alsorateA decision element 212 is provided, which selects the data rate of the input signal from a predetermined set of data rates. In some types of communication systems, data has a data rate.1 HornflameTo other framesEncoded to change. This is known as a variable rate communication system. A voice coder that encodes data based on a variable rate scheme is commonly referred to as a variable rate vocoder. An example embodiment of a variable rate vocoder is described in US Pat. No. 5,414,796, entitled “VARIABLE RATE VOCODER”, assigned to the assignee of the present invention and incorporated herein by reference. It has been. With a variable rate communication channel,Unnecessary transmissions can be eliminated when there is no useful voice to be transmitted.In each frame as the voice activity changesA variable number ofInformation bitsFormTo do this, an algorithm is used in the vocoder. For example, a vocoder with a set of four rates will generate a 20 millisecond data frame containing 16, 40, 80 or 171 information bits depending on speaker activity. Fixed by changing the transmission rate of communicationWasPreferably, each data frame is transmitted in time.
[0025]
  Frame rate is time frameperiodDetermining the rate will provide information about whether or not there is voice, as it depends on the voice activity in it. In systems utilizing variable rates, a decision that a frame should be encoded at the highest rate generally indicates the presence of speech, while a decision that a frame should be encoded at the lowest rate is generally Indicates absence. Intermediate rates generally indicate transitions between the presence and absence of speech.
[0026]
  The rate determination element 212 may be any of a plurality of rate determination algorithms.ImplementationCan do. Such a rate determination algorithm is thus assigned to the assignee of the present invention and is incorporated herein by reference and is referred to as “METHOD AND APPARATUS FOR PERFORMING REDUCED RATE”. Copending US patent application Ser. No. 08 / 286,842 entitled VARIABLE RATE VOCODING). This technique provides a set of rate determination criteria called mode measures. The first mode measure is the target matched signal-to-noise ratio (TMSNR) based on the previous encoded frame, which shows how well the encoded model is compared by comparing the synthesized speech signal with the input speech signal. It is information about whether it is being executed. The second mode measure is the normalized autocorrelation function (NACF), which measures the periodicity of speech frames. The third mode measure is the zero crossings (ZC) parameter, which measures high frequency components in the input speech frame. A fourth measure, prediction gain differential (PGD), determines whether the encoder maintains its prediction efficiency. The fifth measure is the energy differential (ED), which compares the energy in the current frame with the average frame energy. Using these mode measures, the rate determination logic selects the encoding rate of the input frame.
[0027]
Although the rate determination element 212 is shown in FIG. 2 as being included in the noise suppressor 108, the rate information may instead be provided to the noise processor 108 by another component of the speech processor 106. It should be understood (FIG. 1). For example, the audio processor 106 may comprise a variable rate vocoder (not shown) that determines the coding rate for each frame of the input signal. Instead of having the noise suppressor 108 determine the rate independently, rate information may be provided to the noise suppressor 108 by a variable rate vocoder.
[0028]
It should also be understood that instead of determining the rate to determine the presence of speech, the speech detector 208 may use a subset of mode measures that contribute to rate determination. For example, a NACF element (not shown) may be used in place of the rate determination element 212, which measures the periodicity of a speech frame, as already described. NACF is evaluated according to the following relation:
[Expression 1]

[0029]
Where N is the number of audio frame samples and t₁And t₂Evaluates NACFT samplesIt is the inner boundary. NACF is evaluated based on the formant residual signal e (n). The formant frequency is the resonance frequency of speech. The speech signal is filtered using a short-term filter to obtain the formant frequency. A residual signal obtained after filtering by the short term filter is a formant residual signal, and includes long-term speech information of the signal such as a pitch.
[0030]
The NACF mode measure is suitable for determining the presence of speech because the periodicity of the signal that contains the spoken speech is different from the signal that does not contain the spoken speech. Spoken speech tends to be characterized by periodic components. In the absence of spoken speech, the signal generally has no periodic component. Thus, the NACF measure can be a good indicator used by the voice detector 208.
[0031]
Voice detector 208 may use a measure such as NACF instead of rate determination results in situations where it is not practical to generate rate determination results. For example, if a rate determination result is not available from a variable rate vocoder and the noise processor 108 does not have the processing power to generate its own rate determination result, a mode measure such as NACF provides the desired alternative. This is the case, for example, in car kit applications where processing power is generally limited.
[0032]
In addition, it should be understood that the speech detector 208 makes decisions regarding the presence of speech based solely on rate determination results, mode measures, and SNR estimates. Although additional measures should improve the accuracy of the decision, any one of these measures can give adequate results.
[0033]
The rate determination result (or mode measure) and the SNR estimate generated by the SNR estimator 210a are provided to the speech decision element 216. The voice decision element 216 generates a decision based on the input whether or not voice is present in the input signal. Whether or not to update the estimated noise energy is determined based on the determination of the presence of speech. The noise energy estimate is used by the SNR estimator 210a to determine the SNR of the speech in the input signal. This SNR is then used to calculate the level of attenuation of the input signal for noise suppression. If it is determined that speech is present, speech determination element 216 opens switch 218a to prevent noise estimator 214a from updating the noise energy estimate. If it is determined that there is no speech, the input signal is presumed to be noise, and speech determination element 216 closes switch 218a and causes noise energy estimator 218a to update the noise estimate. Although shown as switch 218a in FIG. 2, it should be understood that the enable signal provided from speech decision element 216 to noise energy estimator 214a performs the same function.
[0034]
In a preferred embodiment where two channel SNRs are evaluated, the speech decision element 216 generates the noise update decision based on the following procedure:

The channel SNR estimate provided by SNR estimator 210a is represented by chsnr1 and chsnr2. The rate of the input signal supplied by the rate determining element 212 is expressed as a rate. The counter, rate count, tracks the number of frames based on certain conditions described below.
[0035]
  The voice judgment element 216 has a minimum value among the variable rates.rateIf chsnr1 is greater than threshold value T1 or chsnr2 is greater than threshold value T2 and the rate count is greater than threshold value T3, there is no voice,as well as,Noise estimationvalueIs determined to be updated. If the rate is minimum and chsnr1 is greater than T1 or chsnr2 is greater than T2 but the rate count is less than T3, the rate count is increased by one but the noise estimate is not updated. The counter, rate count, counts the number of frames with the minimum rate, but at the same time having a high energy in at least one of the channels, the noise level suddenly increases or the noise source isincreaseDetect when to do. The counter, which is an indicator that the high SNR signal does not contain speech, counts until speech is detected in the signalTo doSet to One preferred embodiment sets T1 = T2 = 5 dB, T2 = 100 frames where 10 ms frames are evaluated.
[0036]
If the rate is minimum, chsnr1 is less than T1, and chsnr2 is less than T2, speech decision element 216 determines that there is no speech and therefore the noise estimate should be updated. In addition, the rate count is reset to zero.
[0037]
If the rate is not the minimum value, the speech decision element 216 determines that the frame contains speech and therefore the noise estimate should not be updated, and the rate count is reset to zero.
[0038]
  It should be recalled that instead of using rate measures to determine the presence of speech, mode measures such as NACF measures can be used. The speech decision element 216 may utilize the NACF measure to determine the presence of speech, and therefore the noise update decision is performed according to the following procedure:
  If ((if pitch Present == FALSE)
    If ((chsnr1> TH1) or (chsnr2> TH2)
        If (pitchCount> TH3)
              Update noise estimates
        Otherwise
              pitchCount ++
    Otherwise
        Update noise estimates
        pitchCount = 0
Otherwise
    pitchCount = 0
Where pitchPresent is defined as:
If (NACF> TT1)
    pitchPresent = Truth (TRUE)
    NACF Nucount = 0
Otherwise (TT2 ≦ NACF ≦ TT1)
    If (NACFCOUNT> TT3)
          pitchPresent = truth
    Otherwise
          pitchPresent ＝ Fake
          NACFCOUNT ++
Otherwise
    pitchPresent = False
    NACFCOUNT = 0
  againThe channel SNR estimation values supplied by the SNR estimator 210a are represented by chsnr1 and chsnr2. NACF element (not shown)As revealed inGenerate pitchPresent, which is a measure of the presence of a pitch. A counter, pitchCount, tracks the number of frames based on certain conditions described below.
[0039]
The scale pitchPresent determines that there is a pitch if the NACF is greater than the threshold value TT1. Even when the NACF is in the intermediate range (TT2 ≦ NCF ≦ TT1) for a plurality of frames larger than the threshold value TT3, it is determined that a pitch exists. Counter, NACFcount is
[Expression 2]

[0040]
Keep track of the number of frames that hold. In a preferred embodiment, 10 ms frames are evaluated, TT1 = 0.6, TT2 = 0.4, TT3 = 8 frames.
[0041]
  The speech decision element 216 indicates that the pitchPresent measure has no pitch (pitchPtrsent = false), chsnr1 is greater than threshold TH1, or chsnr2 is greater than threshold TT2, and pitchCount is a threshold. If greater than TH3, it is determined that there is no speech and therefore the noise estimate should be updated. If pitchPresent = false and chsnr1 is greater than TH1 or chnsr2 is greater than TH2, but pitchCount is less than TH3, pitchCount is incremented by one but the noise estimate is not updated. Using the counter, pitchCount, sudden increases in noise level and noise sourcesincreaseIs detected. In a preferred embodiment, the conditions are set: T1 = T2 = 5 dB, T2 = 100 frames in which 10 ms frames are evaluated.
[0042]
If pitchPresent indicates that no pitch is present and chsnr1 is less than TH1 or chsnr2 is less than TH2, speech decision element 216 indicates that speech is not present and therefore the noise estimate should be updated. to decide. In addition, pitchCopunt is reset to zero.
[0043]
If pitchPresent indicates that a pitch is present (pitchPresent = true), speech decision element 216 determines that the frame contains speech and therefore the noise estimate should not be updated. However, pitchCount is reset to zero.
[0044]
If it is determined that no speech is present, the switch 218a is closed and the noise energy estimator 214a updates the noise estimate. The noise energy estimator 214a generally generates a noise energy estimate for each of the N channel input signals. Since there is no voice, it is assumed that all energy is due to noise. For each channel, the noise energy update value is estimated to be the current channel energy smoothed with respect to the channel energy of the previous frame that does not contain speech. For example, an updated estimate is obtained based on the following relation:
E_n(t) = βE_ch + (1-β) E_n(t-1), (3)
Here, the updated estimated value E_n(t) is the current channel energy E_chAnd the previous estimated channel noise energy E_nIt is defined as a function of (t-1). In an exemplary embodiment, β = 0.1 is set. The updated channel noise energy estimate is provided to SNR estimator 210a. These channel noise energy estimates are used to obtain the channel SNR estimate update for the next frame of the input signal.
[0045]
  A decision regarding the presence of speech is also provided to the channel gain estimator 220. Channel gain estimator 220 determines the gain, and thus the noise suppression level for the frame of the input signal. The audio decision component 216 is the audioBadIf the presence is determined, the gain for the frame is set to a predetermined minimum gain level. Otherwise, the gain is determined as a function of frequency. In the preferred embodiment, the gain is calculated based on the graph shown in FIG. Although illustrated graphically in FIG. 3, it should be understood that the function shown in FIG. 3 may be implemented as a loop-up table in the channel gain estimator 220.
[0046]
In FIG. 3, it can be seen that the preferred embodiment of the present invention limits a separate gain curve for each L frequency band. Although three bands (L = 3) are displayed in FIG. 3, L may be any number of 1 or more. Thus, the gain factor for the low band channel is determined using the low band curve, the gain factor for the intermediate band channel is determined using the intermediate band curve, and the gain factor for the high band channel. May be determined using a high band curve.
[0047]
Noise suppression may be performed using only one gain curve (L = 1) for the input signal, but it has been found that there are fewer voice quality degradations when multiple bands are used. Yes. In the case of environmental noise such as road and wind noise, the energy of the noise signal increases at lower frequencies, and generally this energy decreases as the frequency increases.
[0048]
  In FIG. 3, fixedSlopeA linear equation with (slope) and y-intercept is used to determine the gain factor for each band. The determination of the gain factor can be explained by the following relationship:
  Gain [low band] (dB) =Slope1 * SNR + low band y-intercept; (4)
  Gain [middle band] (dB) =Slope2 * SNR + intermediate band y-intercept
                         (5)
  Gain [high band] (dB) =Slope3 * SNR + high band y-intercept; (6)
  The preferred embodiment designates the low band as 125-375 Hz, the intermediate band as 375-2625 Hz, and the high band as 2625-4000 Hz.SlopeAnd y-intercept are determined experimentally. The preferred embodiment is the same for each of the three bandsSlopeUse 0.39, but different for each frequency bandSlopeMay be used. Further, the low band y-intercept is set to -17 dB, the middle band y-intercept is set to -13 dB, and the high band y-intercept is set to -13 dB.
[0049]
  In order to select the desired y-intercept, any feature will provide the user of the device with a noise suppressor. Thus, more noise suppression (lower y-intercept) may be selected at the expense of speech degradation. Alternatively, the y-intercept is determined by the noise suppressor 108ScaleThe function may be variable. For example, if excessive noise energy is detected in a given period, more noise suppression (lower y-intercept) may be desirable. Or bubble(babble)If such a condition is detected, less noise suppression (higher y-intercept) may be desirable. During the bubble state,backgroundThere may be speakers, and less noise suppression may be justified to prevent cut-out of the main speaker. Another optional feature is selectable gain curveSlopeWould prepare. Furthermore, it may be found that curves other than the straight lines described by equations (4)-(6) are more suitable for determining the gain factor under certain circumstances.
[0050]
For each frame containing speech, a gain factor is determined for each of the M frequency channels of the input signal, where M is a predetermined number of channels to be evaluated. In the preferred embodiment, 16 channels (M = 16) are evaluated. Referring again to FIG. 3, the gain factor for channels having frequency components within the low band range is determined using a low band curve. The gain factor for channels having frequency components within the midband is determined using the midband curve. The gain factor for channels having frequency components in the high band range is determined using a high band curve.
[0051]
For each channel evaluated, the channel SNR is used to derive a gain factor based on the appropriate curve. In FIG. 2, it is shown that the channel SNR is evaluated by the channel energy estimator 206b, the noise energy estimator 214b and the SNR estimator 210b. For each frame of the input signal, channel energy estimator 206b generates an energy estimate for each of the M channels of the transformed input signal and provides the energy estimate to SNR estimator 210b. The channel energy estimate can be updated using the relationship of equation (1) above. If the speech decision component 216 determines that no speech is present in the input signal, the switch 218b is closed and the noise energy estimator 214b updates the channel noise energy estimate. For each of the M channels, the updated noise energy estimate is based on the channel energy estimate determined by the channel energy estimator 206b. The updated estimate can be evaluated using the relationship of equation (3) above. The channel noise estimate is provided to SNR estimator 210b. Thus, SNR estimator 210b determines a channel SNR estimate for each speech frame based on the channel gain estimate for the particular speech frame, and a channel noise energy estimate is provided by noise energy estimator 214b.
[0052]
A person skilled in the art would have a channel energy estimator 206a, a noise energy estimator 214a, a switch 218a, an SNR estimator 210a, a channel energy estimator 206b, a noise energy estimator 214b, a switch 218b, and an SNR. It will be appreciated that each performs a similar function as estimator 210b. Thus, although shown as separate processing components in FIG. 2, channel energy estimators 206a and 206b may be combined as one processing component and noise energy estimators 214a and 214b as one processing component. The switches 218a and 218b may be combined as one processing component, and the

SNR estimators

210a and 210b may be combined as one processing component. As a combined component, the channel energy estimator determines channel energy estimates for both the N channels used for speech detection and the M channels used to determine the channel gain factor. Will do. Note that N = M is possible. Similarly, the noise energy estimator and SNR estimator will work for both N and M channels. The SNR estimator then provides N SNR estimates for the speech decision component 216 and M SNR estimates for the channel gain estimator 220.
[0053]
The channel gain factor is provided to gain adjuster 224 by channel gain estimator 220. The gain adjuster 224 receives the input signal that has been subjected to the FFT conversion from the conversion component 204. The gain of the converted signal is appropriately adjusted according to the channel gain coefficient. For example, in the above embodiment where M = 16, the transformed (FFT) points belonging to a particular channel among the 16 channels are adjusted based on the appropriate channel gain factor.
[0054]
  The gain adjusted signal generated by gain adjuster 224 is then provided to reverse transform component 226, which in the preferred embodiment generates an inverse fast Fourier transform (IFFT) of the signal. If the input frame is made up of overlaid samples, a post processing element 228 adjusts the output signal for overlap. The post-process component 228 also performs deemphasis when the signal experiences pre-emphasis. De-emphasis in advanceEmphasisThe frequency component emphasized during the period is attenuated. In advanceEmphasisThe de-emphasis process effectively contributes to noise suppression by reducing noise components that are outside the range of processed frequency components.
[0055]
  The various processing blocks of the noise suppressor shown in FIG. 2 may be configured in a digital signal processor (DSP) or application specific integrated circuit (ASIC). Functionality of the present inventionofExplanationIsThose skilled in the art will be able to implement the present invention in a DSP or ASIC without undue experimentation..
[0056]
Referring now to FIG. 4, a flowchart illustrating some of the steps included in the processing described in connection with FIGS. 2 and 3 is shown. Although shown as sequential steps, one skilled in the art will recognize that the order of some of the steps can be interchanged.
[0057]
  The process begins at step 402. In step 404, the conversion component 204 converts the input audio signal into a converted signal, ie, an FFT signal. In step 406, the SNR estimator 210b determines the speech SNR for the M channels of the input signal based on the channel energy estimate provided by the channel energy estimator 206b and the channel noise energy estimate provided by the noise energy estimator 214b. To decide. In step 408, channel gain estimator 220 determines gain factors for the M channels of the input signal based on the frequency of the channels. The channel gain estimator 220 sets the gain to the lowest level when it is found that there is no speech in the frame of the input signal. Otherwise, a gain factor for each of the M channels is determined based on a predetermined function. For example, in FIG.SlopeWheny- InterceptBy linear formula withDefinitionMay be used, in which case each linear equation will increase the gain for a given frequency band.DefinitionTo do. In step 410, gain adjuster 224 uses the M gain factors to adjust the gain of the M channels of the converted signal. In step 412, the inverse transform component 226 transforms the gain-adjusted transformed signal to produce a noise-suppressed audio signal.
[0058]
  In step 414, the speech SNR for the N channels of the input signal based on the channel energy estimate provided by the channel energy estimator 206a by the SNR estimator 210a and the channel noise energy estimate provided by the noise energy estimator 214a. To decide. In step 416, rate determinationelement212 determines the encoding rate for the input signal through analysis of the input signal. Or one or more modes such as NACFScaleMay be determined. In step 418, voice decisionelement216 is the SNR provided by the SNR estimator 210a and the rate and / or mode provided by the rate determining element.ScaleTo determine whether speech is present in the input signal. If it is determined at decision block 420 that no speech is present, the input signal is assumed to be completely noisy and the noise estimate is updated by the noise energy estimator 214a at step 422. The noise energy estimator 214a updates the noise estimate based on the channel energy determined by the channel energy estimator 206a. The procedure continues with the next frame of the input signal whether or not speech is detected.
[0059]
The previous description of the preferred embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments without using inventive talents. . Thus, it is not intended that the invention be limited to the embodiments shown herein, but should be accorded the widest scope consistent with the principles and novel features disclosed herein.
[Brief description of the drawings]
FIG. 1 is a communication system using a noise suppressor.ofIt is a block diagram.
FIG. 2 according to the inventionNoise suppressorFIG.
FIG. 3 is a graph of a gain coefficient based on a frequency for realizing noise suppression according to the present invention.
4 for noise suppression as realized by the processing element of FIG.included6 is a flowchart illustrating an exemplary embodiment of processing steps.

Claims

  A noise suppressor (108) for suppressing background noise of an audio signal,
  A signal to noise ratio (SNR) estimator (210b) that forms a channel signal to noise ratio (SNR) estimate for a predetermined first frequency channel of the audio signal;
  A gain estimator (220) that forms a gain factor for each frequency channel based on a corresponding one of the plurality of channel SNR estimates, wherein the gain factor uses a gain function that defines the gain factor as an increasing function of SNR Derived from;
  A gain adjuster (224) for adjusting a gain level for each frequency channel based on the corresponding gain factor; and
  A speech detector (208) for determining the presence of speech in the audio signal;
      Here, the sound detector (208) has a plurality of frequencies of the audio signal.
    According to the SNR estimate for a predetermined second set of channels,
    as well as
a) The coding rate of a set of coding rates for the audio signal,
b) At least characterizing the audio signal 1 One mode measure,
    A voice decision element that determines the presence of voice according to
    (216)
The noise suppressor comprising:

The voice detector (208) further includes a first frequency channel of the audio signal. 2 The noise suppressor of claim 1 comprising a signal to noise ratio (SNR) estimator (21a) that forms the SNR estimate for a predetermined set of.

The noise suppressor of claim 2, wherein the speech detector (208) further comprises a rate determination element (212) that determines the encoding rate of a set of encoding rates for an audio signal.

The voice detector (208) further comprises the at least 1 The noise suppressor (108) of claim 2, comprising a mode measurement element that determines one mode measure.

The noise suppressor (108) of any of claims 1-4, wherein the mode measure comprises a normalized autocorrelation function (NACF) measure.

The noise suppressor (108) of any of the preceding claims, wherein the gain function is frequency dependent.

The noise suppressor (108) of any preceding claim, wherein the gain function is implemented as a look-up table.

The gain function is the slope and y - A noise suppressor (108) according to any of claims 1 to 7, which is a linear function having an intercept.

Y - The noise suppressor (108) of claim 8, wherein the intercept is user selectable.

Based on the measured noise characteristics in the audio signal, the y - The noise suppressor (108) of claim 8, wherein the intercept can be adjusted.

The noise suppressor (108) of claim 8, wherein the gradient is user selectable.

The noise suppressor (108) of claim 8, wherein the slope can be adjusted based on measured noise characteristics in the audio signal.

A noise energy estimator (214b) that forms an updated channel noise energy estimate for each of the frequency channels when the speech detector determines that no speech is present in the audio signal; A noise energy estimate is provided to the SNR estimator forming the channel SNR estimate.
The noise suppressor according to claim 1, further comprising:

A noise suppressor (108) according to any of claims 1 to 13, wherein if the speech detector determines that speech is not present, it determines a minimum gain factor for each of the frequency channels.

The noise suppressor (108) of any of claims 1-14, wherein the gain estimator (220) comprises:
If the means for determining the presence of speech determines that speech is present, means for determining a gain factor for each of the frequency channels, wherein the gain factor is in each of a set of multiple frequency bands And, for each of the frequency channels, the channel gain factor, as defined for each frequency band, is determined based on a gain function for the frequency band having a range in which the frequency channel is included. , The gain factor is defined to increase with increasing SNR.

Means for converting the audio signal into a frequency representation of the audio signal;
Means for inversely transforming the gain-adjusted frequency display to form an audio signal with reduced noise;
The noise suppressor (108) of any of the preceding claims, further comprising:

A method for suppressing background noise of an audio signal, comprising the following steps:
    The number of multiple frequency channels of the audio signal 1 Form channel SNR estimates for a given set of
    Corresponding to the plurality of channel SNR estimates 1 Forming a gain factor for each of the frequency channels based on one, wherein the gain factor is derived using a gain function that defines the gain factor as an increasing function of SNR,
    Adjusting the gain level of each of the frequency channels based on the corresponding gain factor; and
    Determining the presence of speech in the audio signal, wherein:
The presence of speech is according to the SNR for a second predetermined set of frequency channels of the audio signal, and
        a) Coding a set of a plurality of coding rates for the audio signal
            rate,
        b) at least characterizing the audio signal 1 One mode measure,
It is determined according to either

The frequency channel of the audio signal 2 The method of claim 17, further comprising forming the SNR estimate for a predetermined set of.

The method of claim 18, further comprising determining the encoding rate of a set of encoding rates for an audio signal.

Said at least 1 The method of claim 18, further comprising determining one mode measure.

21. A method according to any of claims 17 to 20, wherein the gain function is frequency dependent.

The method of any of claims 17 to 21, wherein the gain function is implemented as a lookup table.

Each of the gain functions is a slope and y- 23. A method according to any of claims 17-22, which is a linear function having an intercept.

Said y- 24. The method of claim 23, wherein the intercept is user selectable.

Said y- 24. The method of claim 23, wherein intercept can be adjusted based on measured noise characteristics in the audio signal.

24. The method of claim 23, wherein the gradient is user selectable.

24. The method of claim 23, wherein the slope can be adjusted based on measured noise characteristics in the audio signal.

Determining the presence of speech forms an updated channel noise energy estimate for each of the frequency channels if the speech is determined not to be present in the audio signal, wherein the updated channel noise; Energy estimates are before Used to form a channel SNR estimate,
The method according to any one of claims 17 to 27, further comprising:

Converting the audio signal into a frequency representation of the audio signal; and
Inverse transforming the gain adjusted frequency representation to form a noise-suppressed audio signal;
The method according to any one of claims 17 to 28, further comprising:

During the step of forming the gain factor,
If it is determined that speech is present in the audio signal, a gain factor for each of the frequency channels is determined, wherein the gain factor is for each of a set of frequency bands and the frequency band. Where, for each of the plurality of frequency channels, the gain is such that a channel gain factor is determined based on a gain function for a frequency band having a range that includes the frequency channel. Defined to increase with increasing SNR,
The method in any one of Claims 17-29 which comprises the process of these.

31. The method of any of claims 17-30, wherein the mode measure comprises a normalized autocorrelation function (NACF) measure.

32. The method of any of claims 17-31, further comprising determining a minimum gain factor for each of the frequency channels if it is determined that no speech is present in the audio signal.