JP7724678B2

JP7724678B2 - Howling prevention circuit, microphone device and electronic device

Info

Publication number: JP7724678B2
Application number: JP2021171752A
Authority: JP
Inventors: 二郎國分
Original assignee: Alinco Inc
Current assignee: Alinco Inc
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2025-08-18
Anticipated expiration: 2041-10-20
Also published as: JP2023061676A

Description

本発明は、ハウリングを防止するためのハウリング防止回路と、前記ハウリング防止回路を備えるマイクロホン装置と、前記ハウリング防止回路を備える電子機器とに関する。 The present invention relates to an anti-feedback circuit for preventing feedback, a microphone device equipped with the anti-feedback circuit, and an electronic device equipped with the anti-feedback circuit.

図８は従来例の拡声装置１１０Ａにおける構成例及び問題点を示すブロック図である。 Figure 8 is a block diagram showing an example configuration and problems with a conventional public address system 110A.

図８に示すように、拡声装置（もしくは、会議装置、又は通信装置など）１１０Ａのマイクロホンとスピーカとの組み合わせで、ユーザがマイクロホンに向かって話した音声の音声信号を増幅してスピーカから出力した場合（会議装置又は通信装置のときは、送受信分離用ハイブリッド回路（二線四線変換器）での一部漏洩によるハウリングの発生、もしくは、別の通信装置の受信機のスピーカから出力した場合のハウリングの発生）、マイクロホン、音声増幅部、及びスピーカの間でループ回路が形成され、回り込み音の音声信号の増幅がループ回路内で繰り返されてハウリングが発生する。 As shown in Figure 8, when a microphone and speaker are combined in a public address system (or conference device, communication device, etc.) 110A, and the audio signal of the user speaking into the microphone is amplified and output from the speaker (in the case of a conference device or communication device, howling occurs due to partial leakage in the transmission/reception separation hybrid circuit (two-wire/four-wire converter), or howling occurs when output from the speaker of the receiver of another communication device), a loop circuit is formed between the microphone, audio amplifier, and speaker, and the amplification of the audio signal of the return sound is repeated within the loop circuit, causing howling.

例えば、特許文献１では、ハンドセット（送受話器）を用いずにスピーカとマイクロホンにて通話ができる拡声電話機が開示されている。この従来例に係る拡声電話機は、特に、スピーカから出た音が室内の壁などで反射してマイクロホンに入ることにより発生するハウリングを防止するために、反響消去回路を備えたことを特徴とする。 For example, Patent Document 1 discloses a loudspeaker telephone that allows communication using a speaker and microphone without using a handset (transmitter/receiver). This conventional loudspeaker telephone is characterized by the inclusion of an echo cancellation circuit to prevent feedback, which occurs when sound emitted from the speaker reflects off the walls of a room and enters the microphone.

特開平１－１９８１５５号公報Japanese Patent Application Publication No. 198155

上述のハウリングを防止するためには、回り込む音声信号の利得を一定レベルに抑えるために、入力される音声信号を一定レベル以下に制限するリミッタに通過させることで解決されるが、当該リミッタでの音質変化又は音量変化が発生するため、ハウリングではない本来の音声の音質も変化するという問題点があった。 To prevent the above-mentioned feedback, the input audio signal is passed through a limiter that restricts the gain of the incoming audio signal to a certain level or below. However, this limiter causes changes in sound quality or volume, which can cause changes in the sound quality of the original audio that is not feedback.

また、マイクロホンから入力される第１の音声信号とスピーカに出力される第２の音声信号を使う、ハウリングの除去方法として以下の方法がある。
（１）例えば、マイクロホンからの第１の音声信号の反転信号を、第２の音声信号に加算することで打ち消す。もしくは、
（２）第１の音声信号をデジタルデータに変換して、マイクロホンからの第１の音声信号のみを第２の音声信号から除去する。 Furthermore, there is the following method for eliminating feedback, which uses a first audio signal input from a microphone and a second audio signal output to a speaker.
(1) For example, canceling the first audio signal from a microphone by adding an inverted signal to the second audio signal.
(2) Converting the first audio signal into digital data and removing only the first audio signal from the microphone from the second audio signal.

これらの方法では音声入力部と音声出力部の両方でこれらのハウリング除去の制御が必要となり、システム構成が複雑になり製品の小型化が困難になる。 These methods require feedback cancellation control at both the audio input and audio output sections, which complicates the system configuration and makes it difficult to miniaturize the product.

本発明の目的は以上の問題点を解決し、従来例に比較して比較的簡単な構成で、高い精度でハウリングを防止することができるハウリング防止回路と、前記ハウリング防止回路を備えたマイクロホン装置と、前記ハウリング防止回路を備えた電子機器とを提供することにある。 The object of the present invention is to solve the above problems and provide a feedback prevention circuit that can prevent feedback with high accuracy using a relatively simple configuration compared to conventional examples, a microphone device equipped with said feedback prevention circuit, and electronic equipment equipped with said feedback prevention circuit.

本発明の一態様に係るハウリング防止回路は、
入力される音声を音声信号に変換するマイクロホンからの前記音声信号の音声の少なくとも一部が前記マイクロホンに入力されるときに発生するハウリングを防止するハウリング防止回路であって、
前記マイクロホンからの音声信号からノイズを除去して音声信号のみを出力するノイズキャンセル部を備え、
前記ノイズキャンセル部は、人間の音声の特徴パラメータを用いて学習され、入力される音声信号からノイズを含む非音声期間であるか否かを判定する深層学習モデル部を用いて、ノイズキャンセル処理を行い、
前記ノイズキャンセル部は、前記深層学習モデル部の前記判定に基づいて、入力される音声信号からノイズを含む非音声期間を通過させないようにノイズキャンセル処理を行って、前記ノイズキャンセル処理後の音声信号を出力する音声信号処理部を備える。 A howling prevention circuit according to one aspect of the present invention comprises:
1. A howling prevention circuit for preventing howling that occurs when at least a part of a sound of an audio signal from a microphone that converts input audio into an audio signal is input to the microphone,
a noise cancellation unit that removes noise from the audio signal from the microphone and outputs only the audio signal;
the noise cancellation unit performs noise cancellation processing using a deep learning model unit that is trained using feature parameters of human voice and determines whether an input voice signal contains noise or not;
The noise cancellation unit includes an audio signal processing unit that performs noise cancellation processing based on the judgment of the deep learning model unit to prevent non-audio periods containing noise from passing through the input audio signal, and outputs the audio signal after the noise cancellation processing.

従って、本発明に係るハウリング防止回路等によれば、従来例に比較して比較的簡単な構成で、高い精度でハウリングを防止することができる。 Therefore, the anti-feedback circuit according to the present invention can prevent feedback with a relatively simple configuration compared to conventional examples, and with high accuracy.

実施形態１に係る拡声装置１１０の構成例を示すブロック図である。1 is a block diagram showing an example of the configuration of a loudspeaker 110 according to a first embodiment. 図１のノイズキャンセル部１０２の詳細構成例を示すブロック図である。FIG. 2 is a block diagram showing a detailed configuration example of a noise cancellation unit 102 in FIG. 1 . 図２の深層学習モデル部３５の詳細構成例を示すブロック図である。FIG. 3 is a block diagram showing a detailed configuration example of the deep learning model unit 35 in FIG. 2 . 実施形態２に係る拡声システム１１３の構成例を示すブロック図である。FIG. 10 is a block diagram showing an example of the configuration of a loudspeaker system 113 according to a second embodiment. 実施形態３に係る会議装置１２０の構成例を示すブロック図である。FIG. 10 is a block diagram illustrating an example of the configuration of a conference device 120 according to a third embodiment. 実施形態４に係る無線通信装置１３０の構成例を示すブロック図である。FIG. 10 is a block diagram showing an example of the configuration of a wireless communication device 130 according to a fourth embodiment. 図１の拡声装置１１０の構成例及び動作例を示すブロック図である。2 is a block diagram showing an example of the configuration and operation of the loudspeaker 110 of FIG. 1. 従来例の拡声装置１１０Ａにおける構成例及び問題点を示すブロック図である。FIG. 1 is a block diagram showing an example of the configuration and problems of a conventional loudspeaker 110A.

以下、本発明に係る実施形態及び変形例について図面を参照して説明する。なお、同一又は同様の構成要素については同一の符号を付している。 Embodiments and variations of the present invention will be described below with reference to the drawings. Note that the same or similar components are designated by the same reference numerals.

（実施形態１）
図１は、実施形態１に係る拡声装置１１０の構成例を示すブロック図である。 (Embodiment 1)
FIG. 1 is a block diagram showing an example of the configuration of a loudspeaker 110 according to the first embodiment.

図１において、拡声装置１１０は、マイクロホン１０１と、ノイズキャンセル部１０２と、音声信号増幅部１０３と、スピーカ１０４とを備えて構成される。 In FIG. 1, the public address system 110 is composed of a microphone 101, a noise cancellation unit 102, an audio signal amplification unit 103, and a speaker 104.

拡声装置１１０において、マイクロホン１０１に入力された音声は電気信号に変換された後、ノイズキャンセル部１０２に入力される。ノイズキャンセル部１０２は、後述する深層学習モデル部３５（図２及び図３）を用いて音声期間と、ノイズを含む非音声期間とを区別して、非音声期間を通過させないようにノイズキャンセル処理を行って、音声以外のノイズを除去する処理を行った後、処理後の音声信号を音声信号増幅部１０３に出力する。音声信号増幅部１０３は入力される音声信号を増幅してスピーカ１０４に出力し、スピーカ１０４は入力される音声信号を音声に変換して出力する。 In the public address system 110, audio input to the microphone 101 is converted into an electrical signal and then input to the noise cancellation unit 102. The noise cancellation unit 102 uses the deep learning model unit 35 (Figures 2 and 3), described below, to distinguish between audio periods and silent periods containing noise, and performs noise cancellation processing to prevent the silent periods from passing through, removing noise other than audio, and then outputs the processed audio signal to the audio signal amplification unit 103. The audio signal amplification unit 103 amplifies the input audio signal and outputs it to the speaker 104, which then converts the input audio signal into audio and outputs it.

図２は図１のノイズキャンセル部１０２の詳細構成例を示すブロック図である。 Figure 2 is a block diagram showing an example of the detailed configuration of the noise cancellation unit 102 in Figure 1.

図２を参照して、深層学習モデル部３５を用いた図１のノイズキャンセル部１０２の構成及び動作について以下に説明する。 With reference to Figure 2, the configuration and operation of the noise cancellation unit 102 in Figure 1 using the deep learning model unit 35 will be described below.

ここで、「音素」という用語は、特定の言語において１つの単語を他の単語から区別する音の単位を意味し、「振動レート」という用語は、各秒におけるデジタル化された振動データの０と１の間の移動の数を意味し、「振動計数値（ＶＣ）」という用語は、各フレーム内のデジタル化された振動データの値の合計を意味する。また、「振動パターン」とは、時間軸に沿った所定のフレーム数ごとに算出された振動数の総和のデータ分布を意味する。深層学習モデル部３５では、異なる振動パターン、すなわち異なる振動計数値の総和（ＶＳ値）のデータ分布の違いを考慮して、ノイズキャンセル処理を行っており、振動レートは振動計数値に類似しているが、振動レートが大きいほど、振動計数値も大きくなる。 Here, the term "phoneme" refers to a unit of sound that distinguishes one word from another in a particular language, the term "vibration rate" refers to the number of shifts between 0 and 1 in digitized vibration data per second, and the term "vibration count (VC)" refers to the sum of the values of digitized vibration data in each frame. Furthermore, the term "vibration pattern" refers to the data distribution of the sum of vibration frequencies calculated for each predetermined number of frames along the time axis. The deep learning model unit 35 performs noise cancellation processing by taking into account the differences in the data distribution of different vibration patterns, i.e., the sum of different vibration counts (VS values). The vibration rate is similar to the vibration count, but the higher the vibration rate, the larger the vibration count.

音声信号の振幅と振動レートは共に観測可能である。ノイズキャンセル部１０２の特徴は、音声信号の振幅と振動率に応じて音声イベントを検出することである。また、別の特徴は、デジタル化された振動データの振動計数値の総和を、あらかじめ定義されたフレーム数分だけ計測することで、音声と、非音声／無音を区別することである。もう一つの特徴は、入力される音声信号データのストリームをその振動パターンによって異なる音素に分類することである。別の特徴は、下流の処理部をトリガするように、入力される音声信号データストリームから最初の起動音素を正しく区別することであり、それによって、処理部を含む計算システムの電力消費等の計算コストを節約することである。 Both the amplitude and vibration rate of the audio signal can be observed. A feature of the noise cancellation unit 102 is to detect audio events according to the amplitude and vibration rate of the audio signal. Another feature is to distinguish between voice and non-voice/silence by measuring the sum of vibration count values of digitized vibration data for a predefined number of frames. Another feature is to classify the input audio signal data stream into different phonemes based on their vibration patterns. Another feature is to correctly identify the first activation phoneme from the input audio signal data stream so as to trigger downstream processing units, thereby saving computational costs such as power consumption of the computing system including the processing units.

図２において、ノイズキャンセル部１０２は音声イベント検出を用いてノイズキャンセル処理を行うものであって、音声前置処理部３８と、ＡＤ変換器３９と、音声信号処理部３０とを備えて構成される。ここで、音声前置処理部３８は、アナログ音声信号に対して、ハイパスフィルタリング、ローパスフィルタリング、増幅又はそれらの組み合わせ等を含む、音声信号前置処理を行って、処理後のアナログ音声信号をＡＤ変換器３９に出力する。すなわち、音声前置処理部３８は、マイクロホン１０１からの音声信号に対して、人間の音声信号の所定のレベル範囲であって、所定の帯域幅のみを通過させる。次いで、ＡＤ変換器３９は、所定の基準電圧Ｖｒｅｆ及び許容電圧Ｖａｄｍ（＜Ｖｒｅｆ）に従って、アナログ音声信号をデジタル音声信号にＡＤ変換して音声信号処理部３０の入力インターフェース３６に出力する。 In FIG. 2, the noise cancellation unit 102 performs noise cancellation processing using audio event detection and is composed of an audio pre-processing unit 38, an AD converter 39, and an audio signal processing unit 30. Here, the audio pre-processing unit 38 performs audio signal pre-processing on the analog audio signal, including high-pass filtering, low-pass filtering, amplification, or a combination thereof, and outputs the processed analog audio signal to the AD converter 39. In other words, the audio pre-processing unit 38 passes only the audio signal from the microphone 101 within a predetermined level range of human audio signals and a predetermined bandwidth. Next, the AD converter 39 AD-converts the analog audio signal to a digital audio signal in accordance with a predetermined reference voltage Vref and an allowable voltage Vadm (<Vref), and outputs the digital audio signal to the input interface 36 of the audio signal processing unit 30.

本実施形態において、ＡＤ変換器３９において、基準電圧Ｖｒｅｆよりも小さい許容電圧Ｖａｄｍは、基準電圧Ｖｒｅｆと組み合わせて、第１のしきい値電圧Ｖｔｈ１（＝Ｖｒｅｆ＋Ｖａｄｍ））及び第２のしきい値電圧Ｖｔｈ２（＝Ｖｒｅｆ－Ｖａｄｍ）を形成するために使用され、ＡＤ変換器３９は、第１のしきい値電圧Ｖｔｈ１及び第２のしきい値電圧Ｖｔｈ２に基づいて、第１のしきい値電圧Ｖｔｈ１以上又は第２のしきい値電圧Ｖｔｈ２以下のノイズに対してＡＤ変換を実行せず、その間の音声信号に対してＡＤ変換を実行することで、入力されるアナログ音声信号のノイズ及び干渉を除去することができる。ここで、例えばＶｒｅｆ＝１．０Ｖ，Ｖａｄｍ＝０．０１Ｖとすると、静かな環境では振動データの振動数が少なく，音声環境では振動データの振動数が多いことが理解できる。なお、本実施形態において、「フレームサイズ」とは、各フレーム内のデジタル化された振動データに対応するサンプリングポイントの数を意味し、「音素ウィンドウＴｗ」とは、各音素の音声特徴量を収集するための時間を意味する。好ましい実施形態では、各フレームの継続時間Ｔｆは例えば０．１～１ミリ秒（ｍｓ）であり、音素ウィンドウＴｗは例えば約０．３秒である。さらに好ましい実施形態では、各フレーム内のデジタル化された振動データに対応するサンプリングポイントの数は例えば１～１６の範囲である。 In this embodiment, the AD converter 39 uses an allowable voltage Vadm, which is smaller than the reference voltage Vref, in combination with the reference voltage Vref to form a first threshold voltage Vth1 (= Vref + Vadm) and a second threshold voltage Vth2 (= Vref - Vadm). Based on the first and second threshold voltages Vth1 and Vth2, the AD converter 39 does not perform AD conversion on noise above the first threshold voltage Vth1 or below the second threshold voltage Vth2, but instead performs AD conversion on the audio signal in between, thereby eliminating noise and interference from the input analog audio signal. Here, for example, if Vref = 1.0 V and Vadm = 0.01 V, it can be understood that the frequency of vibration data is low in a quiet environment and high in a voice environment. In this embodiment, "frame size" refers to the number of sampling points corresponding to the digitized vibration data within each frame, and "phoneme window Tw" refers to the time required to collect audio features for each phoneme. In a preferred embodiment, the duration Tf of each frame is, for example, 0.1 to 1 millisecond (ms), and the phoneme window Tw is, for example, approximately 0.3 seconds. In a further preferred embodiment, the number of sampling points corresponding to the digitized vibration data within each frame is, for example, in the range of 1 to 16.

音声信号を分析する場合、ほとんどの音声信号は短期間で安定しているので、通常、短期分析の方法が採用される。例えば、ＡＤ変換器３９で使用されるサンプリング周波数ｆｓが１６０００であり、各フレームの時間継続期間Ｔｆが１ｍｓであると仮定すると、フレームサイズはｆｓ×１／１０００＝１６サンプルポイントとなる。 When analyzing audio signals, short-term analysis methods are usually employed, since most audio signals are stable over a short period of time. For example, assuming the sampling frequency fs used by the AD converter 39 is 16000 and the time duration Tf of each frame is 1 ms, the frame size is fs x 1/1000 = 16 sample points.

図２において、音声信号処理部３０は例えばコンピュータデバイスで構成され、
（１）ノイズキャンセルなどの所定の音声信号処理を実行するＣＰＵ（Central Processing Unit）３１と、
（２）ＣＰＵ３１の基本処理を実行するオペレーティングシステム及び前記音声信号処理のプログラム、並びに当該プログラムを実行するために必要なデータ等を格納するＲＯＭ（Read Only Memory）３２と、
（３）ＣＰＵ３１の基本処理を実行するオペレーティングシステム及び前記音声信号処理のプログラムの実行時に、処理中のデータ等を格納するＲＡＭ（Read Access Memory）３３と、
（４）前記音声信号処理を実行するために必要な後述する設定データ等を格納する不揮発性のＥＥＰＲＯＭ（Electrically Erasable Programmable Memory）３４と、
（５）例えばニューラルネットワークなどで構成され、人間の音声信号データに基づいて深層学習されて入力される音声信号データに対して、ノイズを除去して実質的に音声信号のみを抽出して出力する深層学習モデル部３５と、
（６）ＡＤ変換器３９から入力される音声信号データを、後段の信号仕様値に変換するための所定の信号変換処理を行ってＣＰＵ３１に出力する入力インターフェース３６と、
（７）深層学習モデル部３５によりノイズが除去された音声信号データを、後段の信号仕様値に変換するための所定の信号変換処理を行って音声信号増幅部１０３に出力する出力インターフェース３７と、
を備えて構成される。 In FIG. 2, the audio signal processing unit 30 is configured, for example, by a computer device.
(1) a CPU (Central Processing Unit) 31 that executes predetermined audio signal processing such as noise cancellation;
(2) a ROM (Read Only Memory) 32 that stores an operating system that executes the basic processing of the CPU 31, a program for the audio signal processing, and data necessary for executing the program;
(3) RAM (Read Access Memory) 33 for storing data being processed when the CPU 31 executes an operating system that executes basic processing and the audio signal processing program;
(4) a non-volatile EEPROM (Electrically Erasable Programmable Memory) 34 for storing setting data, etc., which will be described later, necessary for executing the audio signal processing;
(5) A deep learning model unit 35, which is configured, for example, by a neural network, removes noise from input voice signal data that has been deep learned based on human voice signal data, and extracts and outputs substantially only the voice signal;
(6) an input interface 36 that performs a predetermined signal conversion process on the audio signal data input from the AD converter 39 to convert it into a signal specification value for the subsequent stage and outputs it to the CPU 31;
(7) An output interface 37 that performs a predetermined signal conversion process on the audio signal data from which noise has been removed by the deep learning model unit 35 to convert it into a signal specification value at a subsequent stage and outputs it to the audio signal amplifier unit 103;
The device is configured to include:

ここで、ＥＥＰＲＯＭ３４は例えば、一連の振動計数値ＶＣ、振動計数値の総和ＶＳ、振動計数値の総和ＶＳｆ、振動計数値の総和ＶＳｐ（後述する）、及びすべての特徴ベクトルの音声特徴値を記憶する。なお、ＥＥＰＲＯＭ３４は外部メモリなどの記憶装置であってもよい。ここで、ｘ個のフレームの振動計数値ＶＣを加算して、時点Ｔｊにおける現在のフレームの振動計数値の総和ＶＳを得る。ｘ個のフレームには現在のフレームが含まれる。一実施形態では、ＣＰＵ３１は、時点Ｔｊにおける現在のフレームの振動計数値ＶＣと、その直前（ｘ－１）個のフレームの振動計数値の総和ＶＳｐとを加算して、時点Ｔｊにおけるｘ個のフレームの振動計数値の総和ＶＳ（＝ＶＣ＋ＶＳｐ）を得る。 Here, EEPROM 34 stores, for example, a series of vibration count values VC, a vibration count sum VS, a vibration count sum VSf, a vibration count sum VSp (described later), and audio feature values of all feature vectors. EEPROM 34 may also be a storage device such as an external memory. Here, the vibration count values VC of x frames are added together to obtain the vibration count sum VS of the current frame at time Tj. The x frames include the current frame. In one embodiment, CPU 31 adds the vibration count value VC of the current frame at time Tj to the vibration count sum VSp of the immediately preceding (x-1) frames to obtain the vibration count sum VS (= VC + VSp) of the x frames at time Tj.

なお、変形例では、ＣＰＵ３１は、時点Ｔｊにおける現在のフレームの振動計数値ＶＣ、その直後のｙ個のフレームの振動計数値の総和ＶＳｆ、及びその直前の（ｘ－ｙ－１）個のフレームの振動計数値の総和ＶＳｐを加算して、時点Ｔｊにおけるｘ個のフレームの振動計数値の総和ＶＳ（＝ＶＣ＋ＶＳｆ＋ＶＳｐ）を得るが、ｙはゼロ以上である。ＣＰＵ３１は、ＶＳ、ＶＳｆ及びＶＳｐの値をＥＥＰＲＯＭ３４に格納する。好ましい実施形態では、ｘ個のフレーム（音素ウィンドウＴｗ）の継続時間（ｘ×Ｔｆ）は、約０．３秒である。さらに好ましい実施形態では、ｘ個のフレームのデジタル化された振動データに対応するサンプリングポイントの数は、ｘ～１６ｘの範囲にある。 In a modified example, the CPU 31 adds the vibration count value VC of the current frame at time Tj, the sum of the vibration count values VSf of the immediately succeeding y frames, and the sum of the vibration count values VSp of the immediately preceding (x-y-1) frames to obtain the sum of the vibration count values VS (=VC+VSf+VSp) of the x frames at time Tj, where y is greater than or equal to zero. The CPU 31 stores the values of VS, VSf, and VSp in the EEPROM 34. In a preferred embodiment, the duration (x×Tf) of x frames (phoneme window Tw) is approximately 0.3 seconds. In a more preferred embodiment, the number of sampling points corresponding to the digitized vibration data of x frames is in the range of x to 16x.

一般的に、音声信号データについては、同じ音素では振動計数値ＶＣの振動パターンが類似しているが、異なる音素ではＶＳ値の振動パターンが全く異なる。従って、振動計数値ＶＣの振動パターンを利用して、音素を区別することができる。特に、例えば鶏又は猫の鳴き声と、人間の音声とは、振動計数値ＶＣの周波数分布に関して全く異なり、人間の音声の振動計数値ＶＣのほとんどは４０以下に分布していることが既知である。 Generally, in voice signal data, the vibration patterns of the vibration count value VC are similar for the same phoneme, but the vibration patterns of the VS value are completely different for different phonemes. Therefore, the vibration patterns of the vibration count value VC can be used to distinguish between phonemes. In particular, it is known that the frequency distribution of the vibration count value VC is completely different between, for example, the cry of a chicken or a cat and human speech, and that most of the vibration count value VC of human speech is distributed below 40.

学習フェーズにおいて、音声信号処理部３０のＣＰＵ３１は、まず、所定の音声信号データ収集方法を複数回実行して、複数の音素に対する複数の特徴ベクトルを収集し、複数の特徴ベクトルに対応するラベルを付加して、複数のラベル付き学習例を形成する。その後、起動音素を含む異なる音素に対する複数のラベル付き学習例を、深層学習モデル部３５の学習に適用する。最後に、学習された深層学習モデル部３５（音声信号データの予測モデルを構成する）を作成して、入力される音声信号データのストリームが起動音素を含むかどうかを分類する。音声信号処理部３０の起動音素として、所定の音素が指定されている場合、深層学習モデル部３５は、少なくとも当該指定された音素を含む異なる音素についての複数のラベル付き学習例で学習される。 In the training phase, the CPU 31 of the audio signal processing unit 30 first executes a predetermined audio signal data collection method multiple times to collect multiple feature vectors for multiple phonemes, and then attaches labels corresponding to the multiple feature vectors to form multiple labeled training examples. The multiple labeled training examples for different phonemes, including the activation phoneme, are then applied to training the deep learning model unit 35. Finally, a trained deep learning model unit 35 (which constitutes a predictive model of the audio signal data) is created to classify the input audio signal data stream as to whether it contains the activation phoneme. When a predetermined phoneme is designated as the activation phoneme of the audio signal processing unit 30, the deep learning model unit 35 is trained with multiple labeled training examples for different phonemes, including at least the designated phoneme.

すなわち、学習段階では、ラベル付けされた学習例のセットを使用して深層学習モデル部３５を学習し、それによって深層学習モデル部３５が、ラベル付けされた学習例の各フレームの３つの音声特徴量（例えば、（ＶＳｊ，ＴＤｊ，ＴＧｊ））に基づいて、ｊ＝０～２９９の間で、所定の起動音素を認識するようにする。学習段階の終わりに、学習された深層学習モデル部３５は、当該起動音素に対応する学習されたスコアを提供し、学習されたスコアは、次に、入力される音声信号データのストリームをランタイムで分類するための基準として使用される。なお、ＶＳｊ，ＴＤｊ，ＴＧｊは以下のように定義される。
（１）ＶＳｊ：フレームｊの振動計数値の総和（ＶＳ値）；
（２）ＴＤｊ：フレームｊにおいて、ゼロではない振動計数値の総和（ＶＳ値）の時間期間；及び
（３）ＴＧｊ；フレームｊにおける、ゼロではない振動計数値の総和（ＶＳ値）間の時間ギャップ（時間隙間）。 That is, in the training phase, the set of labeled training examples is used to train the deep learning model unit 35 so that the deep learning model unit 35 recognizes a given activated phoneme, where j=0 to 299, based on three speech features (e.g., (VSj, TDj, TGj)) of each frame of the labeled training examples. At the end of the training phase, the trained deep learning model unit 35 provides a trained score corresponding to the activated phoneme, which is then used as a criterion for classifying an input stream of speech signal data at runtime, where VSj, TDj, and TGj are defined as follows:
(1) VSj: the sum of vibration count values of frame j (VS value);
(2) TDj: the time period of non-zero sum of vibration counts (VS values) in frame j; and (3) TGj: the time gap (time gap) between non-zero sum of vibration counts (VS values) in frame j.

深層学習モデル部３５を学習するために、教師付き学習に関連する様々な機械学習技術を使用することができ、例えば、サポートベクターマシン（ＳＶＭ）法、ランダムフォレスト法、畳み込みニューラルネットワーク法などを利用できる。教師付き学習では、複数のラベル付けされた学習例を使用して関数計算部（すなわち、深層学習モデル部３５）が作成され、その各例は、入力特徴ベクトルとラベル付けされた出力からなる。学習されたとき、深層学習モデル部３５は、対応するスコア又は予測値を生成するために、新しいラベルのない例に適用することができる。 Various machine learning techniques related to supervised learning can be used to train the deep learning model unit 35, such as support vector machine (SVM) methods, random forest methods, and convolutional neural network methods. In supervised learning, a function calculation unit (i.e., the deep learning model unit 35) is created using multiple labeled training examples, each of which consists of an input feature vector and a labeled output. Once trained, the deep learning model unit 35 can be applied to new unlabeled examples to generate a corresponding score or prediction.

図３は図２の深層学習モデル部３５の詳細構成例を示すブロック図である。 Figure 3 is a block diagram showing an example of the detailed configuration of the deep learning model unit 35 in Figure 2.

深層学習モデル部３５は、例えば、図３に示すように、ニューラルネットワークを用いて実装される。ここで、ニューラルネットワークは、１つの入力層４１と、少なくとも１つであり好ましくは複数の中間層４２と、１つの出力層４３を含む。入力層４１には３つの入力ニューロン５１，５２，５３があり、各入力ニューロン５１，５２，５３は、特徴ベクトルの各フレームの３つのオーディオ特徴値（すなわち、ＶＳｊ，ＴＤｊ，ＴＧｊ）に対応する。また、中間層４２は、各入力ニューロン５１，５２，５３に関連する重み係数と各ニューロンのバイアス係数を有するニューロン６１～７４で構成される。学習フェーズのサイクルを通じて中間層４２の各ニューロン６１～７４の重み係数とバイアス係数を変更することにより，ニューラルネットワークを学習して，所定の種類の入力に対する予測値を報告するようにすることができる。さらに、出力層４３は、音素に対応する１つの予測値（具体的には、音声期間であるか、ノイズを含む非音声期間であるかを示す）を提供する１つの出力ニューロン８１を含む。 The deep learning model unit 35 is implemented using a neural network, as shown in FIG. 3, for example. Here, the neural network includes an input layer 41, at least one, but preferably multiple, hidden layers 42, and an output layer 43. The input layer 41 includes three input neurons 51, 52, and 53, each corresponding to one of three audio feature values (i.e., VSj, TDj, and TGj) for each frame of the feature vector. The hidden layer 42 includes neurons 61-74, each with a weight coefficient associated with each input neuron 51, 52, and 53, and a bias coefficient for each neuron. By changing the weight coefficients and bias coefficients of the neurons 61-74 in the hidden layer 42 through a training phase cycle, the neural network can be trained to report a prediction for a given type of input. The output layer 43 includes an output neuron 81 that provides a prediction corresponding to a phoneme (specifically, indicating whether the input is a speech period or a noisy non-speech period).

以上説明したように、前記ノイズキャンセル部において、深層学習モデル部３５は、人間の音声の特徴パラメータを用いて学習され、入力される音声信号からノイズを含む非音声期間であるか否かを判定する。そして、音声信号処理部３０のＣＰＵ３１は、深層学習モデル部３５の前記判定に基づいて、入力される音声信号からノイズを含む非音声期間を通過させないようにノイズキャンセル処理を行って、前記ノイズキャンセル処理後の音声信号を出力する。ここで、深層学習モデル部３５は、人間の音声の特徴パラメータを入力とし、入力される音声信号からノイズを含む非音声期間であるか否かを判定する判定結果を出力とする、図３のニューラルネットワークにより構成される。 As explained above, in the noise cancellation unit, the deep learning model unit 35 learns using human voice characteristic parameters and determines whether the input audio signal contains noise or not. Then, based on the determination by the deep learning model unit 35, the CPU 31 of the audio signal processing unit 30 performs noise cancellation processing to prevent noise or not from passing through the input audio signal, and outputs the audio signal after the noise cancellation processing. Here, the deep learning model unit 35 is configured by the neural network of Figure 3, which receives human voice characteristic parameters as input and outputs the determination result of whether the input audio signal contains noise or not.

以上のように構成された拡声装置１１０の動作例について、図１及び図７を参照して以下に説明する。 An example of the operation of the loudspeaker 110 configured as described above will be described below with reference to Figures 1 and 7.

図７は図１の拡声装置１１０の構成例及び動作例を示すブロック図である。 Figure 7 is a block diagram showing an example configuration and operation of the loudspeaker 110 in Figure 1.

図７において、マイクロホン１０１から入力される音声信号を、深層学習モデル部３５（図３）を用いたノイズキャンセル部１０２を通過させることで、本来はマイクロホン１０１に入力される周囲ノイズ音の非音声信号を低減して目的の音声の音声信号を抽出する目的であるが、ハウリングで発生する回り込み音の非音声信号も同様に低減することが可能となり、回り込み音の音声信号の増幅の繰り返しが回避される。これにより、マイクロホン１０１からの目的の音声の音声信号のみを抽出することができ、マイクロホン１０１に入力される周囲ノイズの低減を含め、ハウリング時においてもスピーカ１０４から出力される音声信号は音質変化が無く、かつ小規模でのシステム構成により小型製品でのハウリングを除去することが可能となる。 In Figure 7, the audio signal input from microphone 101 is passed through noise cancellation unit 102, which uses deep learning model unit 35 (Figure 3). The original purpose is to reduce the non-audio signals of ambient noise input to microphone 101 and extract the audio signal of the desired audio. However, it is also possible to similarly reduce the non-audio signals of feedback noise, thereby avoiding repeated amplification of the feedback audio signal. This makes it possible to extract only the audio signal of the desired audio from microphone 101. Even during feedback, there is no change in the sound quality of the audio signal output from speaker 104, including the reduction of ambient noise input to microphone 101. Furthermore, a small-scale system configuration makes it possible to eliminate feedback in small products.

なお、本発明者らは、図１の拡声装置１１０を試作してハウリングを発生して実験を行った。実験の結果、本実施形態に係る拡声装置１１０のノイズキャンセル部１０２により高精度で有効的にハウリングの発生を防止できることを確認した。 The inventors conducted an experiment in which they prototyped the public address system 110 shown in Figure 1 and generated feedback. As a result of the experiment, they confirmed that the noise cancellation unit 102 of the public address system 110 according to this embodiment can effectively prevent feedback with high accuracy.

以上説明したように、前記深層学習モデル部３５を用いたノイズキャンセル部１０２により、従来例に比較して比較的簡単な構成で、高い精度でハウリングを防止することができるハウリング防止回路を提供できる。また、ノイズキャンセル部１０２を拡声装置１１０に備えることで、ハウリングを高精度で有効的に防止することができる拡声装置を実現できる。 As described above, the noise cancellation unit 102 using the deep learning model unit 35 can provide a howling prevention circuit that can prevent howling with high accuracy, with a relatively simple configuration compared to conventional examples. Furthermore, by providing the noise cancellation unit 102 in the public address system 110, it is possible to realize a public address system that can effectively prevent howling with high accuracy.

（実施形態２）
図４は、実施形態２に係る拡声システム１１３の構成例を示すブロック図である。 (Embodiment 2)
FIG. 4 is a block diagram showing an example of the configuration of a loudspeaker system 113 according to the second embodiment.

図４において、拡声システム１１３は、マイクロホン装置１１１と拡声装置１１２とを、音声信号ケーブル１０５を用いて接続されて構成される。 In Figure 4, the public address system 113 is composed of a microphone device 111 and a public address device 112 connected using an audio signal cable 105.

マイクロホン装置１１１は、マイクロホン１０１と、例えばリチウム電池等の二次電池である直流電源１０２Ｂにより電源供給されるノイズキャンセル部１０２とを備えて構成される。ノイズキャンセル部１０２の構成及び動作は、実施形態１に係るノイズキャンセル部１０２と同様である。ノイズキャンセル部１０２への電源供給は、直流電源１０２Ｂに限らず、交流電圧を整流平滑するいわゆるＡＣアダプタにより、もしくは、拡声装置１１２本体からの直流電圧の電源供給であってもよい。 The microphone device 111 is configured to include a microphone 101 and a noise cancellation unit 102 that is powered by a DC power supply 102B, which is a secondary battery such as a lithium battery. The configuration and operation of the noise cancellation unit 102 are similar to the noise cancellation unit 102 according to the first embodiment. Power supply to the noise cancellation unit 102 is not limited to the DC power supply 102B, but may also be provided by a so-called AC adapter that rectifies and smooths AC voltage, or by DC voltage power supply from the loudspeaker 112 itself.

また、拡声装置１１２は、音声信号増幅部１０３と、スピーカ１０４とを備えて構成され、これらの動作は図１の実施形態１と同様である。 The loudspeaker 112 is also configured with an audio signal amplifier 103 and a speaker 104, and their operation is the same as in embodiment 1 in Figure 1.

以上説明したように、前記深層学習モデル部３５を用いたノイズキャンセル部１０２により、従来例に比較して比較的簡単な構成で、高い精度でハウリングを防止することができる。また、ノイズキャンセル部１０２をマイクロホン装置１１１に備えることで、ハウリングを高精度で有効的に防止することができる拡声システム１１３を実現できる。 As described above, the noise cancellation unit 102 using the deep learning model unit 35 can prevent feedback with a relatively simple configuration compared to conventional examples, and with high accuracy. Furthermore, by providing the noise cancellation unit 102 in the microphone device 111, a public address system 113 can be realized that can effectively prevent feedback with high accuracy.

図４において、マイクロホン装置１１１は、例えば「オプションマイクロホン」もしくは「外部マイクロホン」と呼ばれることがある。また、拡声装置１１２は無線通信装置又は有線通信装置であってもよい。 In FIG. 4, the microphone device 111 may be referred to as, for example, an "optional microphone" or an "external microphone." Furthermore, the loudspeaker device 112 may be a wireless communication device or a wired communication device.

（実施形態３）
図５は、実施形態３に係る会議装置１２０の構成例を示すブロック図である。 (Embodiment 3)
FIG. 5 is a block diagram illustrating an example of the configuration of the conferencing device 120 according to the third embodiment.

図５において、会議装置１２０において、マイクロホン１０１－１に入力された音声は電気信号に変換された後、ノイズキャンセル部１０２－１に入力される。また、マイクロホン１０１－２に入力された音声は電気信号に変換された後、ノイズキャンセル部１０２－２に入力される。各ノイズキャンセル部１０２－１，１０２－２は、前記深層学習モデル部３５（図２及び図３）を用いて音声期間と、ノイズを含む非音声期間とを区別して、非音声期間を通過させないようにノイズキャンセル処理を行って、音声以外のノイズを除去する処理を行った後、処理後の音声信号を加算器１２１に出力する。加算器１２１は入力される２個の音声信号を加算した後、加算後の合成音声信号を、送受信分離用ハイブリッド回路（二線四線変換器）１２２を介して通信インターフェース１２３に出力する。 In FIG. 5, in the conferencing device 120, the voice input to microphone 101-1 is converted into an electrical signal and then input to noise cancellation unit 102-1. Similarly, the voice input to microphone 101-2 is converted into an electrical signal and then input to noise cancellation unit 102-2. Each noise cancellation unit 102-1, 102-2 uses the deep learning model unit 35 (FIGS. 2 and 3) to distinguish between voice periods and silent periods containing noise, performs noise cancellation processing to prevent the silent periods from passing through, and performs processing to remove noise other than voice. The processed voice signal is then output to adder 121. Adder 121 adds the two input voice signals and then outputs the resulting synthesized voice signal to communication interface 123 via transmission/reception separation hybrid circuit (two-wire/four-wire converter) 122.

通信インターフェース１２３は例えばＵＳＢ（Universal Serial Bus）インターフェースであって、通信ケーブル１２４を介して、例えばパーソナルコンピュータ（ＰＣ）１２５に接続されて、ＵＳＢインターフェース信号を送受信する。本実施形態では、通信インターフェース１２３は、会議装置１２０で取得した合成音声信号を、パーソナルコンピュータ１２５に例えばインターネットなどの所定のネットワークを介して接続された相手方のパーソナルコンピュータ(図示せず）に送信するとともに、相手方の音声信号を受信する。受信された相手方の音声信号はハイブリッド回路１２２及び音声信号増幅部１０３を介してスピーカ１０４から当該音声信号の音声が出力される。 The communication interface 123 is, for example, a USB (Universal Serial Bus) interface, and is connected to, for example, a personal computer (PC) 125 via a communication cable 124 to send and receive USB interface signals. In this embodiment, the communication interface 123 transmits a synthesized voice signal acquired by the conferencing device 120 to a remote personal computer (not shown) connected to the personal computer 125 via a predetermined network such as the Internet, and also receives the remote voice signal. The received remote voice signal is passed through the hybrid circuit 122 and the voice signal amplifier 103, and the voice of the voice signal is output from the speaker 104.

以上のように構成された会議装置１２０を用いた会議システムでは、例えば以下のハウリング経路が考えられる。
（１）ハイブリッド回路１２２における一部漏洩により、マイクロホン１０１－１，１０１－２に入力された会議装置１２０のユーザの音声信号が加算器１２１からハイブリッド回路１２２及び音声信号増幅部１０３を介してスピーカ１０４から出力される音声が、マイクロホン１０１－１，１０１－２に回り込む。
（２）マイクロホン１０１－１，１０１－２に入力された会議装置１２０のユーザの音声信号が通信インターフェース１２３及びパーソナルコンピュータ１２５、及び相手方のパーソナルコンピュータを介して相手方のスピーカから音声信号の音声が出力される。この音声が、相手方のマイクロホンに拾われて、逆方向でパーソナルコンピュータ１２５、通信インターフェース１２３、ハイブリッド回路１２２及び音声信号増幅部１０３を介してスピーカ１０４から出力されて、マイクロホン１０１－１，１０１－２に回り込む。もしくは、相手方のハイブリッド回路での一部漏洩により、ユーザの音声信号が戻ってくる場合もある。 In a conference system using the conference device 120 configured as above, for example, the following howling paths are conceivable.
(1) Due to partial leakage in hybrid circuit 122, the voice signal of the user of conference device 120 input to microphones 101-1 and 101-2 is passed from adder 121 through hybrid circuit 122 and voice signal amplifier 103, and the voice output from speaker 104 is passed through to microphones 101-1 and 101-2.
(2) The voice signal of the user of the conference device 120 input to the microphones 101-1 and 101-2 is output from the speaker of the other party via the communication interface 123, the personal computer 125, and the other party's personal computer. This voice is picked up by the microphone of the other party and output from the speaker 104 in the reverse direction via the personal computer 125, the communication interface 123, the hybrid circuit 122, and the voice signal amplifier 103, and then returns to the microphones 101-1 and 101-2. Alternatively, the user's voice signal may return due to partial leakage in the other party's hybrid circuit.

しかしながら、本実施形態では、前記深層学習モデル部３５を用いたノイズキャンセル部１０２－１，１０２－２により、従来例に比較して比較的簡単な構成で、高い精度でハウリングを防止することができる。また、ノイズキャンセル部１０２－１，１０２－２を会議装置１２０に備えることで、ハウリングを高精度で有効的に防止することができる会議システムを実現できる。 However, in this embodiment, the noise cancellation units 102-1 and 102-2 using the deep learning model unit 35 can prevent feedback with a relatively simple configuration compared to conventional examples, and with high accuracy. Furthermore, by providing the noise cancellation units 102-1 and 102-2 in the conference device 120, a conference system can be realized that can effectively prevent feedback with high accuracy.

以上の実施形態では、２個のマイクロホン１０１－１，１０１－２及び２個のノイズキャンセル部１０２－１，１０２－２を備えているが、本開示はこれに限らず、複数個のマイクロホン１０１及び複数個のノイズキャンセル部１０２を備えてもよい。 In the above embodiment, two microphones 101-1 and 101-2 and two noise cancellation units 102-1 and 102-2 are provided, but the present disclosure is not limited to this and multiple microphones 101 and multiple noise cancellation units 102 may be provided.

以上の実施形態では、１個のマイクロホン１０１－１，１０１－２に対して各１個のノイズキャンセル部１０２－１，１０２－２を備えているが、本発明はこれに限らず、２個のマイクロホン１０１－１，１０１－２からの２個の音声信号を加算した後、１個のノイズキャンセル部１０２により、前記深層学習モデル部３５を用いたノイズキャンセル処理を行ってもよい。 In the above embodiment, one noise cancellation unit 102-1, 102-2 is provided for each microphone 101-1, 101-2, but the present invention is not limited to this. After adding two audio signals from the two microphones 101-1, 101-2, noise cancellation processing using the deep learning model unit 35 may be performed by a single noise cancellation unit 102.

（実施形態４）
図６は、実施形態４に係る無線通信装置１３０の構成例を示すブロック図である。 (Embodiment 4)
FIG. 6 is a block diagram showing an example of the configuration of a wireless communication device 130 according to the fourth embodiment.

図６において、無線通信装置１３０は、マイクロホン１０１と、ノイズキャンセル部１０２と、音声信号増幅部１０３Ａと、変調送信部１３１と、送信アンテナ１３２と、受信アンテナ１３３と、受信復調部１３４と、音声信号増幅部１０３と、スピーカ１０４とを備えて構成される。 In FIG. 6, the wireless communication device 130 is configured to include a microphone 101, a noise cancellation unit 102, an audio signal amplification unit 103A, a modulation transmission unit 131, a transmission antenna 132, a reception antenna 133, a reception demodulation unit 134, an audio signal amplification unit 103, and a speaker 104.

図６の無線通信装置１３０において、マイクロホン１０１に入力された音声は電気信号に変換された後、ノイズキャンセル部１０２に入力される。ノイズキャンセル部１０２は、前記深層学習モデル部３５（図２及び図３）を用いて音声期間と、ノイズを含む非音声期間とを区別して、非音声期間を通過させないようにノイズキャンセル処理を行って、音声以外のノイズを除去する処理を行った後、音声信号増幅部１０３Ａを介して変調送信部１３１に出力する。変調送信部１３１は入力される音声信号に従って、所定の変調方式で搬送波を変調することで変調無線信号を発生して送信アンテナ１３２を介して送信する。一方、受信復調部１３４は、相手方の無線通信装置からの変調無線信号を受信アンテナ１３３により受信し、当該受信した変調無線信号を低雑音増幅、周波数変換、中間周波増幅などを行った後、所定の復調方式で音声信号に復調して音声信号増幅部１０３を介してスピーカ１０４に出力する。 In the wireless communication device 130 of FIG. 6, voice input to the microphone 101 is converted into an electrical signal and then input to the noise cancellation unit 102. The noise cancellation unit 102 uses the deep learning model unit 35 (FIGS. 2 and 3) to distinguish between voice periods and silent periods containing noise, performs noise cancellation processing to prevent the silent periods from passing through, and performs processing to remove noise other than voice. The noise cancellation unit 102 then outputs the signal to the modulation/transmission unit 131 via the audio signal amplification unit 103A. The modulation/transmission unit 131 generates a modulated radio signal by modulating a carrier wave using a predetermined modulation method in accordance with the input voice signal, and transmits the modulated radio signal via the transmission antenna 132. Meanwhile, the reception/demodulation unit 134 receives the modulated radio signal from the other wireless communication device via the reception antenna 133, performs low-noise amplification, frequency conversion, intermediate frequency amplification, etc. on the received modulated radio signal, and then demodulates it into an audio signal using a predetermined demodulation method and outputs the audio signal to the speaker 104 via the audio signal amplification unit 103.

以上のように構成された無線通信装置１３０を用いた無線通信システムでは、例えば以下のハウリング経路が考えられる。
（１）マイクロホン１０１に入力された無線通信装置１３０のユーザの音声信号が変調送信部１３１により変調しかつ無線送信されて、相手方の無線通信装置のスピーカから音声信号の音声が出力される。この音声が、相手方のマイクロホンに拾われて、逆方向で無線通信装置１３０の受信復調部１３４及び音声信号増幅部１０３を介してスピーカ１０４から出力されて、マイクロホン１０１に回り込む場合が考えられる。 In a wireless communication system using the wireless communication device 130 configured as above, for example, the following howling paths are conceivable.
(1) A voice signal of the user of wireless communication device 130 input to microphone 101 is modulated and wirelessly transmitted by modulation/transmission unit 131, and the sound of the voice signal is output from the speaker of the other wireless communication device. This voice may be picked up by the other microphone, and output from speaker 104 in the opposite direction via reception/demodulation unit 134 and audio signal amplifier 103 of wireless communication device 130, and may then be sent back to microphone 101.

しかしながら、本実施形態では、前記深層学習モデル部３５を用いたノイズキャンセル部１０２により、従来例に比較して比較的簡単な構成で、高い精度でハウリングを防止することができる。また、ノイズキャンセル部１０２を無線通信装置１３０に備えることで、ハウリングを高精度で有効的に防止することができる無線通信システムを実現できる。 However, in this embodiment, the noise cancellation unit 102 using the deep learning model unit 35 can prevent feedback with a relatively simple configuration compared to conventional examples, and with high accuracy. Furthermore, by providing the noise cancellation unit 102 in the wireless communication device 130, a wireless communication system can be realized that can effectively prevent feedback with high accuracy.

以上の実施形態においては、変調送信部１３１と、受信復調部１３４とを備えているが、本発明はこれに限らず、受信復調部１３４は別体の装置とし、少なくとも変調送信部１３１を備えてもよい。 In the above embodiment, the system is equipped with a modulation/transmission unit 131 and a reception/demodulation unit 134, but the present invention is not limited to this. The reception/demodulation unit 134 may be a separate device, and at least the modulation/transmission unit 131 may be included.

以上の実施形態では、無線通信装置１３０について説明しているが、本発明はこれに限らず、無線通信装置１３０に代えて、有線通信装置、電話機、スマートホンなどの通信装置にも適用することができる。 In the above embodiment, a wireless communication device 130 has been described, but the present invention is not limited to this and can also be applied to communication devices such as wired communication devices, telephones, and smartphones instead of wireless communication devices 130.

以上詳述したように、本発明に係るハウリング防止回路によれば、深層学習モデル部を用いたノイズキャンセル部により、従来例に比較して比較的簡単な構成で、高い精度でハウリングを防止することができる。また、ノイズキャンセル部を、拡声装置、通信装置、会議装置、電話機、スマートホン、又はコンピュータに備えることで、ハウリングを高精度で有効的に防止することができる音声処理システムを実現できる。 As described above in detail, the anti-feedback circuit of the present invention uses a noise cancellation unit that uses a deep learning model, making it possible to prevent feedback with a relatively simple configuration compared to conventional examples and with high accuracy. Furthermore, by incorporating a noise cancellation unit into a public address system, communication device, conference device, telephone, smartphone, or computer, it is possible to realize an audio processing system that can effectively prevent feedback with high accuracy.

３０音声信号処理部
３１ＣＰＵ
３２ＲＯＭ
３３ＲＡＭ
３４ＥＥＰＲＯＭ
３５深層学習モデル部
３６入力インターフェース
３７出力インターフェース
３８音声信号前置処理部
３９ＡＤ変換器
４１入力層
４２中間層
４３出力層
５１～８１ニューロン
１０１，１０１－１，１０１－２マイクロホン
１０２，１０２－１，１０２－２ノイズキャンセル部
１０２Ｂ直流電源
１０３，１０３Ａ音声信号増幅部
１０４スピーカ
１０５音声信号ケーブル
１１０，１１０Ａ拡声装置
１１１マイクロホン装置
１１２拡声装置
１１３拡声システム
１２０会議装置
１２１加算器
１２２ハイブリッド回路
１２３通信インターフェース
１２４通信ケーブル
１２５パーソナルコンピュータ
１３０無線通信装置
１３１変調送信部
１３２送信アンテナ
１３３受信アンテナ
１３４受信復調部 30 Audio signal processing unit 31 CPU
32 ROM
33 RAM
34 EEPROM
35 Deep learning model unit 36 Input interface 37 Output interface 38 Audio signal pre-processing unit 39 AD converter 41 Input layer 42 Intermediate layer 43 Output layer 51 to 81 Neurons 101, 101-1, 101-2 Microphones 102, 102-1, 102-2 Noise cancellation unit 102B DC power supply 103, 103A Audio signal amplifier unit 104 Speaker 105 Audio signal cable 110, 110A Public address device 111 Microphone device 112 Public address device 113 Public address system 120 Conference device 121 Adder 122 Hybrid circuit 123 Communication interface 124 Communication cable 125 Personal computer 130 Wireless communication device 131 Modulation transmission unit 132 Transmission antenna 133 Reception antenna 134 Reception demodulation unit

Claims

1. A howling prevention circuit for preventing howling that occurs when at least a part of a sound of an audio signal from a microphone that converts input audio into an audio signal is input to the microphone,
a noise cancellation unit that removes noise from the audio signal from the microphone and outputs only the audio signal;
the noise cancellation unit uses a deep learning model unit that determines whether an input audio signal is a non-audio period including noise, and performs noise cancellation processing based on the result of the determination ;
During learning, the deep learning model unit uses, as a learning input, a feature parameter of human voice, which is a data distribution of a sum of vibration frequencies calculated for each predetermined number of frames along a time axis in the voice signal, and learns a result of a determination as to whether or not the voice signal is the non-voice period containing noise, based on a vibration pattern representing the data distribution of the sum of vibration frequencies;
the deep learning model unit, during operation after learning, outputs the result of the determination when a feature parameter of a human voice related to a voice signal converted from an input voice is input;
The noise cancellation unit includes an audio signal processing unit that, when the result of the determination by the deep learning model unit is the silent period, performs noise cancellation processing on an audio signal converted from input audio so as not to pass the silent period containing noise, and outputs the audio signal after the noise cancellation processing.
Anti-feedback circuit.

The deep learning model unit is configured by a predetermined neural network.
2. The howling prevention circuit according to claim 1.

The noise cancellation unit
an audio signal pre-processing unit provided in a stage preceding the audio signal processing unit, for passing only audio signals from the microphone that are within a predetermined level range of human audio signals and have a predetermined bandwidth;
3. The howling prevention circuit according to claim 1 or 2.

The howling prevention circuit according to any one of claims 1 to 3 is provided.
Microphone device.

The howling prevention circuit according to any one of claims 1 to 3 is provided.
electronic equipment.

The electronic device of claim 5, wherein the electronic device is a public address system, a communication device, a conference device, a telephone, a smartphone, or a computer.