JP7653311B2

JP7653311B2 - Wireless communication device and wireless communication system

Info

Publication number: JP7653311B2
Application number: JP2021102671A
Authority: JP
Inventors: 二郎國分; 敬浩山之口; 裕希高橋
Original assignee: Alinco Inc
Current assignee: Alinco Inc
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2025-03-28
Anticipated expiration: 2041-06-21
Also published as: JP2023001754A

Description

本発明は、例えば特定小電力無線通信無線局のための無線通信装置及び、複数の無線通信装置を含む無線通信システムに関する。 The present invention relates to a wireless communication device for, for example, a specific low-power wireless communication station, and a wireless communication system including multiple wireless communication devices.

例えば特許文献１の図１及び図４において、無線通信装置のマイクロホンからの音声信号に対してノイズキャンセル処理を行うノイズキャンセル回路を用いることが開示されている。 For example, Patent Document 1 discloses in Figures 1 and 4 that a noise cancellation circuit is used to perform noise cancellation processing on an audio signal from a microphone of a wireless communication device.

特開２００４－１６５９６２号公報（図１、図４）JP 2004-165962 A (FIGS. 1 and 4)

特許文献１において開示されたノイズキャンセル回路は、「マイクロホンから入力された音声信号から雑音を取り除く」という開示しかないが、従来技術では、マイクロホンの外部の騒音信号を反転してマイクロホンに入力される音声信号に加算する、いわゆる位相反転型のノイズキャンセル回路が広く使用されている。当該位相反転型のノイズキャンセル回路においては、人間が発話している場合にもかかわらず、騒音が大きいときは、音声信号のレベルが抑圧されて、人間の発話音声を有効的に出力するようにノイズキャンセルを行うことができないという問題点があった。 The noise cancellation circuit disclosed in Patent Document 1 only discloses that it "removes noise from the audio signal input from the microphone," but in the prior art, so-called phase-inversion type noise cancellation circuits are widely used, which invert the noise signal outside the microphone and add it to the audio signal input to the microphone. In this phase-inversion type noise cancellation circuit, there is a problem in that when the noise is loud, even when a person is speaking, the level of the audio signal is suppressed, making it impossible to perform noise cancellation so as to effectively output the human speech.

本発明の目的は以上の問題点を解決し、無線通話を行う無線通信装置において、従来技術に比較して、人間の発話音声を有効的に出力するようにノイズキャンセルを行うことができる無線通信装置及び当該複数の無線通信装置を含む無線通信システムを提供することにある。 The object of the present invention is to solve the above problems and provide a wireless communication device for wireless communication that can perform noise cancellation to effectively output human speech compared to conventional techniques, and a wireless communication system including multiple such wireless communication devices.

本発明の一態様に係る無線通信装置は、
入力される音声信号に従って無線搬送波を変調して無線信号を送信する変調送信部を備える無線通信装置において、
前記入力される音声信号からノイズをキャンセルするように音声信号処理を行って前記変調送信部に出力する第１のノイズキャンセル部を備え、
前記第１のノイズキャンセル部は、人間の音声の特徴パラメータを用いて学習され、前記復調された音声信号からノイズを含む非音声期間であるか否かを判定する深層学習モデル部を用いて、ノイズキャンセル処理を行う。 A wireless communication device according to one aspect of the present invention comprises:
A wireless communication device including a modulation transmission unit that modulates a wireless carrier wave in accordance with an input audio signal and transmits a wireless signal,
a first noise canceling unit that performs audio signal processing so as to cancel noise from the input audio signal and outputs the processed audio signal to the modulation transmitting unit;
The first noise cancellation unit performs noise cancellation processing using a deep learning model unit that is trained using characteristic parameters of human voice and determines whether or not the demodulated voice signal contains noise in a non-voice period.

従って、本発明に係る無線通信装置によれば、位相反転型のノイズキャンセル回路に代えて、深層学習モデル部を用いたノイズキャンセル回路を備えることで、無線通話を行う無線通信装置において、従来技術に比較して、人間の発話音声を有効的に出力するようにノイズキャンセルを行うことができる。 Therefore, according to the wireless communication device of the present invention, by providing a noise cancellation circuit using a deep learning model unit instead of a phase inversion type noise cancellation circuit, it is possible to perform noise cancellation so as to output human speech more effectively than with conventional technology in a wireless communication device that performs wireless calls.

実施形態に係る無線機１の構成例を示すブロック図である。1 is a block diagram showing an example of the configuration of a wireless device 1 according to an embodiment; 図１のノイズキャンセル部１３の構成例を示すブロック図である。2 is a block diagram showing an example of the configuration of a noise canceling unit 13 in FIG. 1 . 図２の深層学習モデル部３５の構成例を示すブロック図である。A block diagram showing an example configuration of the deep learning model unit 35 of Figure 2.

以下、本発明に係る実施形態及び変形例について図面を参照して説明する。なお、同一又は同様の構成要素については同一の符号を付している。 Embodiments and modifications of the present invention will be described below with reference to the drawings. Note that identical or similar components are given the same reference numerals.

図１は実施形態に係る無線機１の構成例を示すブロック図である。図１において、無線機１は無線通信装置の一例であって、受信アンテナ１１と、受信復調部１２と、ノイズキャンセル部１３と、音声信号増幅器１４と、スピーカ１５と、制御部２０と、ＰＴＴ（Push To Talk）キー２１Ａを含む操作部２１と、マイクロホン２２と、音声信号増幅器２３と、ノイズキャンセル部２４と、変調送信部２５と、送信アンテナ２６とを備えて構成される。 Fig. 1 is a block diagram showing an example of the configuration of a radio 1 according to an embodiment. In Fig. 1, the radio 1 is an example of a wireless communication device, and is configured with a receiving antenna 11, a receiving demodulation unit 12, a noise cancellation unit 13, an audio signal amplifier 14, a speaker 15, a control unit 20, an operation unit 21 including a PTT (Push To Talk) key 21A, a microphone 22, an audio signal amplifier 23, a noise cancellation unit 24, a modulation transmission unit 25, and a transmission antenna 26.

ここで、実施形態に係る無線機１は例えば特定小電力無線通信システムのための特定小電力無線局の無線通信装置の一例である。本実施形態では、無線機１はその送信部において、位相反転型のノイズキャンセル回路に代えて、例えばＦＭ（周波数変調）復調又はＰＭ（位相変調）復調においてノイズ軽減で特に有効である、ノイズキャンセル部２４を備えたことを特徴としている。また、本実施形態では、無線機１はその受信部において、ノイズ軽減で特に有効であるノイズキャンセル部１３をさらに備える。また、複数の無線機１により無線通信システムを構成する。 Here, the radio 1 according to the embodiment is an example of a radio communication device, for example, a specific low-power radio station for a specific low-power radio communication system. In this embodiment, the radio 1 is characterized in that, in its transmitting section, instead of a phase-inversion type noise cancellation circuit, it is provided with a noise cancellation section 24 that is particularly effective in reducing noise, for example, in FM (frequency modulation) demodulation or PM (phase modulation) demodulation. Also, in this embodiment, the radio 1 further includes, in its receiving section, a noise cancellation section 13 that is particularly effective in reducing noise. Also, a wireless communication system is configured with a plurality of radios 1.

図１において、受信アンテナ１１により受信された無線信号は受信復調部１２に入力される。受信復調部１２は、受信された無線信号を低雑音増幅、低域周波数変換、中間周波増幅等を行った後、例えばＦＭ（周波数変調）復調又はＰＭ（位相変調）復調などの所定の復調方式で音声信号に復調してノイズキャンセル部１３に出力する。ノイズキャンセル部１３は人間の音声により深層学習された深層学習モデル部３５(図３）を用いて、復調された音声信号から音声信号期間のみ当該音声信号を通過させることで、ノイズをキャンセルするように音声信号処理を行った後、処理後の音声信号を音声信号増幅器１４を介してスピーカ１５に出力する。 In FIG. 1, a radio signal received by a receiving antenna 11 is input to a receiving demodulation unit 12. The receiving demodulation unit 12 performs low-noise amplification, low-frequency conversion, intermediate frequency amplification, etc. on the received radio signal, and then demodulates it into an audio signal using a predetermined demodulation method, such as FM (frequency modulation) demodulation or PM (phase modulation) demodulation, and outputs the signal to a noise cancellation unit 13. The noise cancellation unit 13 uses a deep learning model unit 35 (FIG. 3) that has been deep-learned using human voice to process the demodulated audio signal so as to cancel noise by passing the audio signal only during the audio signal period, and then outputs the processed audio signal to a speaker 15 via an audio signal amplifier 14.

マイクロホン２２は入力される音声を音声信号に変換して音声信号増幅器２３及び、ノイズキャンセル部１３と同様の構成を有するノイズキャンセル部２４を介して変調送信部２５に出力する。制御部２０は、ＰＴＴキー２１Ａがオンされたときに、変調送信部２５を動作させ、変調送信部２５は入力される音声信号に従って無線搬送波を所定の変調方式で変調した後、変調された無線搬送波である無線信号を、高域周波数変換しかつ電力増幅した後、送信アンテナ２６から送信する。 The microphone 22 converts the input voice into a voice signal and outputs it to the modulation transmission unit 25 via the voice signal amplifier 23 and the noise cancellation unit 24, which has a configuration similar to that of the noise cancellation unit 13. When the PTT key 21A is turned on, the control unit 20 operates the modulation transmission unit 25, which modulates the radio carrier wave according to the input voice signal using a predetermined modulation method, and then transmits the radio signal, which is the modulated radio carrier wave, from the transmission antenna 26 after high-frequency conversion and power amplification.

なお、本実施形態では、無線機１は送信周波数と受信周波数とが異なる同時通話方式での動作について説明したが、本発明はこれに限られず、無線機１は送信周波数と受信周波数とを同一の周波数を使用する場合は、制御部２０は、ＰＴＴキー２１Ａがオンされたときに、受信復調部１２の動作を停止させる。すなわち、送信動作と受信動作を同時に行わない無線機１において、受信動作時には、ノイズキャンセル部１３の代わりに、ノイズキャンセル部２４を動作させて受信時のノイズキャンセル処理を行ってもよい。 In this embodiment, the radio 1 has been described as operating in a simultaneous call system in which the transmission frequency and the reception frequency are different, but the present invention is not limited to this. If the radio 1 uses the same transmission frequency and reception frequency, the control unit 20 stops the operation of the reception demodulation unit 12 when the PTT key 21A is turned on. In other words, in a radio 1 that does not simultaneously transmit and receive, during reception, the noise cancellation unit 24 may be operated instead of the noise cancellation unit 13 to perform noise cancellation processing during reception.

次いで、図２を参照して、深層学習モデル部３５を用いた図１のノイズキャンセル部１３，２４の構成及び動作について以下に説明する。 Next, referring to FIG. 2, the configuration and operation of the noise cancellation units 13 and 24 in FIG. 1 using the deep learning model unit 35 will be described below.

図２は図１のノイズキャンセル部１３，２４の構成例を示すブロック図である。 Figure 2 is a block diagram showing an example configuration of the noise cancellation units 13 and 24 in Figure 1.

ここで、「音素」という用語は、特定の言語において１つの単語を他の単語から区別する音の単位を意味し、「振動レート」という用語は、各秒におけるデジタル化された振動データの０と１の間の移動の数を意味し、「振動計数値（ＶＣ）」という用語は、各フレーム内のデジタル化された振動データの値の合計を意味する。また、「振動パターン」とは、時間軸に沿った所定のフレーム数ごとに算出された振動数の総和のデータ分布を意味する。深層学習モデル部３５では、異なる振動パターン、すなわち異なる振動計数値の総和（ＶＳ値）のデータ分布の違いを考慮して、ノイズキャンセル処理を行っており、振動レートは振動計数値に類似しているが、振動レートが大きいほど、振動計数値も大きくなる。 Here, the term "phoneme" refers to a unit of sound that distinguishes one word from another in a particular language, the term "vibration rate" refers to the number of shifts between 0 and 1 in the digitized vibration data in each second, and the term "vibration count (VC)" refers to the sum of the values of the digitized vibration data in each frame. Also, the term "vibration pattern" refers to the data distribution of the sum of the vibration counts calculated for each predetermined number of frames along the time axis. The deep learning model unit 35 performs noise cancellation processing by taking into account the difference in the data distribution of different vibration patterns, i.e., the sum of different vibration counts (VS values), and the vibration rate is similar to the vibration count, but the higher the vibration rate, the higher the vibration count.

音声信号の振幅と振動レートは共に観測可能である。ノイズキャンセル部１３，２４の特徴は、音声信号の振幅と振動率に応じて音声イベントを検出することである。また、別の特徴は、デジタル化された振動データの振動計数値の総和を、あらかじめ定義されたフレーム数分だけ計測することで、音声と、非音声／無音を区別することである。もう一つの特徴は、入力される音声信号データのストリームをその振動パターンによって異なる音素に分類することである。別の特徴は、下流の処理部をトリガするように、入力される音声信号データストリームから最初の起動音素を正しく区別することであり、それによって、処理部を含む計算システムの電力消費等の計算コストを節約することである。 Both the amplitude and vibration rate of the audio signal are observable. A feature of the noise cancellation units 13 and 24 is to detect audio events according to the amplitude and vibration rate of the audio signal. Another feature is to distinguish between voice and non-voice/silence by measuring the sum of vibration count values of the digitized vibration data for a predefined number of frames. Another feature is to classify the input audio signal data stream into different phonemes according to their vibration patterns. Another feature is to correctly distinguish the first activation phoneme from the input audio signal data stream so as to trigger a downstream processing unit, thereby saving computational costs such as power consumption of the computing system including the processing unit.

図２において、ノイズキャンセル部１３，２４は音声イベント検出を用いてノイズキャンセル処理を行うものであって、音声信号前置処理部３８と、ＡＤ変換器３９と、音声信号処理部３０とを備えて構成される。ここで、音声信号前置処理部３８は、アナログ音声信号に対して、ハイパスフィルタリング、ローパスフィルタリング、増幅又はそれらの組み合わせ等を含む、音声信号前置処理を行って、処理後のアナログ音声信号をＡＤ変換器３９に出力する。すなわち、音声信号前置処理部３８は、マイクロホン２２からの音声信号に対して、人間の音声信号の所定のレベル範囲であって、所定の帯域幅のみを通過させる。次いで、ＡＤ変換器３９は、所定の基準電圧Ｖｒｅｆ及び許容電圧Ｖａｄｍ（＜Ｖｒｅｆ）に従って、アナログ音声信号をデジタル音声信号にＡＤ変換して音声信号処理部３０の入力インターフェース３６に出力する。 In FIG. 2, the noise cancellation units 13 and 24 perform noise cancellation processing using audio event detection, and are configured with an audio signal pre-processing unit 38, an AD converter 39, and an audio signal processing unit 30. Here, the audio signal pre-processing unit 38 performs audio signal pre-processing including high-pass filtering, low-pass filtering, amplification, or a combination thereof, on the analog audio signal, and outputs the processed analog audio signal to the AD converter 39. That is, the audio signal pre-processing unit 38 passes only a predetermined level range of human audio signals and a predetermined bandwidth for the audio signal from the microphone 22. Next, the AD converter 39 AD converts the analog audio signal into a digital audio signal according to a predetermined reference voltage Vref and an allowable voltage Vadm (<Vref), and outputs the signal to the input interface 36 of the audio signal processing unit 30.

本実施形態において、ＡＤ変換器３９において、基準電圧Ｖｒｅｆよりも小さい許容電圧Ｖａｄｍは、基準電圧Ｖｒｅｆと組み合わせて、第１のしきい値電圧Ｖｔｈ１（＝Ｖｒｅｆ＋Ｖａｄｍ））及び第２のしきい値電圧Ｖｔｈ２（＝Ｖｒｅｆ－Ｖａｄｍ）を形成するために使用され、ＡＤ変換器３９は、第１のしきい値電圧Ｖｔｈ１及び第２のしきい値電圧Ｖｔｈ２に基づいて、第１のしきい値電圧Ｖｔｈ１以上又は第２のしきい値電圧Ｖｔｈ２以下のノイズに対してＡＤ変換を実行せず、その間の音声信号に対してＡＤ変換を実行することで、入力されるアナログ音声信号のノイズ及び干渉を除去することができる。ここで、例えばＶｒｅｆ＝１．０Ｖ，Ｖａｄｍ＝０．０１Ｖとすると、静かな環境では振動データの振動数が少なく，音声環境では振動データの振動数が多いことが理解できる。なお、本実施形態において、「フレームサイズ」とは、各フレーム内のデジタル化された振動データに対応するサンプリングポイントの数を意味し、「音素ウィンドウＴｗ」とは、各音素の音声特徴量を収集するための時間を意味する。好ましい実施形態では、各フレームの継続時間Ｔｆは例えば０．１～１ミリ秒（ｍｓ）であり、音素ウィンドウＴｗは例えば約０．３秒である。さらに好ましい実施形態では、各フレーム内のデジタル化された振動データに対応するサンプリングポイントの数は例えば１～１６の範囲である。 In this embodiment, in the AD converter 39, the allowable voltage Vadm, which is smaller than the reference voltage Vref, is used in combination with the reference voltage Vref to form a first threshold voltage Vth1 (= Vref + Vadm) and a second threshold voltage Vth2 (= Vref - Vadm). The AD converter 39 does not perform AD conversion on noise equal to or greater than the first threshold voltage Vth1 or equal to or less than the second threshold voltage Vth2 based on the first threshold voltage Vth1 and the second threshold voltage Vth2, but performs AD conversion on the audio signal in between, thereby removing noise and interference from the input analog audio signal. Here, for example, if Vref = 1.0 V and Vadm = 0.01 V, it can be understood that the number of vibrations in the vibration data is small in a quiet environment and the number of vibrations in the vibration data is large in a voice environment. In this embodiment, the "frame size" means the number of sampling points corresponding to the digitized vibration data in each frame, and the "phoneme window Tw" means the time for collecting the audio features of each phoneme. In a preferred embodiment, the duration Tf of each frame is, for example, 0.1 to 1 millisecond (ms), and the phoneme window Tw is, for example, about 0.3 seconds. In a further preferred embodiment, the number of sampling points corresponding to the digitized vibration data in each frame ranges, for example, from 1 to 16.

音声信号を分析する場合、ほとんどの音声信号は短期間で安定しているので、通常、短期分析の方法が採用される。例えば、ＡＤ変換器３９で使用されるサンプリング周波数ｆｓが１６０００であり、各フレームの継続時間Ｔｆが１ｍｓであると仮定すると、フレームサイズはｆｓ×１／１０００＝１６サンプルポイントとなる。 When analyzing audio signals, short-term analysis methods are usually adopted since most audio signals are stable over a short period of time. For example, assuming that the sampling frequency fs used by the AD converter 39 is 16000 and the duration Tf of each frame is 1 ms, the frame size is fs x 1/1000 = 16 sample points.

図２において、音声信号処理部３０は例えばコンピュータデバイスで構成され、
（１）ノイズキャンセルなどの所定の音声信号処理を実行するＣＰＵ（Central Processing Unit）３１と、
（２）ＣＰＵ３１の基本処理を実行するオペレーティングシステム及び前記音声信号処理のプログラム、並びに当該プログラムを実行するために必要なデータ等を格納するＲＯＭ（Read Only Memory）３２と、
（３）ＣＰＵ３１の基本処理を実行するオペレーティングシステム及び前記音声信号処理のプログラムの実行時に、処理中のデータ等を格納するＲＡＭ（Read Access Memory）３３と、
（４）前記音声信号処理を実行するために必要な後述する設定データ等を格納する不揮発性のＥＥＰＲＯＭ（Electrically Erasable Programmable Memory）３４と、
（５）例えばニューラルネットワークなどで構成され、人間の音声信号データに基づいて深層学習されて入力される音声信号データに対して、ノイズを除去して実質的に音声信号のみを抽出して出力する深層学習モデル部３５と、
（６）ＡＤ変換器３９から入力される音声信号データを、後段の信号仕様値に変換するための所定の信号変換処理を行ってＣＰＵ３１に出力する入力インターフェース３６と、
（７）深層学習モデル部３５によりノイズが除去された音声信号データを、後段の信号仕様値に変換するための所定の信号変換処理を行って端子Ｔ１２、音声ラインＬ２等を介して無線機１に出力する出力インターフェース３７と、
を備えて構成される。 In FIG. 2, the audio signal processing unit 30 is, for example, a computer device.
(1) a CPU (Central Processing Unit) 31 that executes predetermined audio signal processing such as noise cancellation;
(2) a ROM (Read Only Memory) 32 for storing an operating system for executing the basic processing of the CPU 31, a program for the audio signal processing, and data necessary for executing the program;
(3) a RAM (Read Access Memory) 33 for storing data being processed when the CPU 31 executes an operating system for executing basic processing and the audio signal processing program;
(4) a non-volatile EEPROM (Electrically Erasable Programmable Memory) 34 for storing setting data, etc., which are necessary for executing the audio signal processing, as described later;
(5) A deep learning model unit 35 that is composed of, for example, a neural network, removes noise from voice signal data that is input through deep learning based on human voice signal data, and extracts and outputs substantially only the voice signal;
(6) an input interface 36 that performs a predetermined signal conversion process for converting the audio signal data input from the AD converter 39 into a signal specification value for a subsequent stage and outputs the result to the CPU 31;
(7) an output interface 37 that performs a predetermined signal conversion process for converting the voice signal data from which noise has been removed by the deep learning model unit 35 into a signal specification value in a subsequent stage, and outputs the data to the wireless device 1 via the terminal T12, the audio line L2, etc.;
The present invention is configured to include the following components:

ここで、ＥＥＰＲＯＭ３４は例えば、一連の振動計数値ＶＣ、振動計数値の総和ＶＳ、振動計数値の総和ＶＳｆ、振動計数値の総和ＶＳｐ（後述する）、及びすべての特徴ベクトルの音声特徴値を記憶する。なお、ＥＥＰＲＯＭ３４は外部メモリなどの記憶装置であってもよい。音声信号処理部３０に適用される音声イベント検出方法は、音声イベントを捕捉するために、ＣＰＵ３１によってランタイム中に実行される。ｆｓ＝１６０００、Ｔｆ＝１ｍｓ、Ｔｗ＝０．３ｓと仮定して、音声イベント検出を実行する。 Here, the EEPROM 34 stores, for example, a series of vibration count values VC, a sum of vibration count values VS, a sum of vibration count values VSf, a sum of vibration count values VSp (described later), and audio feature values of all feature vectors. Note that the EEPROM 34 may be a storage device such as an external memory. The audio event detection method applied to the audio signal processing unit 30 is executed by the CPU 31 during runtime to capture audio events. Audio event detection is performed assuming fs=16000, Tf=1 ms, and Tw=0.3 s.

ＣＰＵ３１は、具体的には、処理対象である現在のフレーム（すなわち、１ｍｓ以内）の振動データ値の総和を計算して、振動計数値ＶＣを取得し、その後、時点Ｔｊにおける現在のフレームのＶＣ値をＥＥＰＲＯＭ３４に格納する。ここで、ｘ個のフレームの振動計数値ＶＣを加算して、時点Ｔｊにおける現在のフレームの振動計数値の総和ＶＳを得る。ｘ個のフレームには現在のフレームが含まれる。一実施形態では、ＣＰＵ３１は、時点Ｔｊにおける現在のフレームの振動計数値ＶＣと、その直前（ｘ－１）個のフレームの振動計数値の総和ＶＳｐとを加算して、時点Ｔｊにおけるｘ個のフレームの振動計数値の総和ＶＳ（＝ＶＣ＋ＶＳｐ）を得る。 Specifically, the CPU 31 calculates the sum of the vibration data values of the current frame to be processed (i.e., within 1 ms) to obtain the vibration count value VC, and then stores the VC value of the current frame at time Tj in the EEPROM 34. Here, the vibration count values VC of x frames are added together to obtain the sum VS of the vibration count values of the current frame at time Tj. The x frames include the current frame. In one embodiment, the CPU 31 adds the vibration count value VC of the current frame at time Tj to the sum VSp of the vibration count values of the (x-1) frames immediately preceding it to obtain the sum VS (=VC+VSp) of the vibration count values of the x frames at time Tj.

なお、変形例では、ＣＰＵ３１は、時点Ｔｊにおける現在のフレームの振動計数値ＶＣ、その直後のｙ個のフレームの振動計数値の総和ＶＳｆ、及びその直前の（ｘ－ｙ－１）個のフレームの振動計数値の総和ＶＳｐを加算して、時点Ｔｊにおけるｘ個のフレームの振動計数値の総和ＶＳ（＝ＶＣ＋ＶＳｆ＋ＶＳｐ）を得るが、ｙはゼロ以上である。ＣＰＵ３１は、ＶＳ、ＶＳｆ及びＶＳｐの値をＥＥＰＲＯＭ３４に格納する。好ましい実施形態では、ｘ個のフレーム（音素ウィンドウＴｗ）の継続時間（ｘ×Ｔｆ）は、約０．３秒である。さらに好ましい実施形態では、ｘ個のフレームのデジタル化された振動データに対応するサンプリングポイントの数は、ｘ～１６ｘの範囲にある。 In a modified example, the CPU 31 adds the vibration count value VC of the current frame at time Tj, the sum of the vibration count values of the immediately succeeding y frames VSf, and the sum of the vibration count values of the immediately preceding (x-y-1) frames VSp to obtain the sum of the vibration count values of the x frames at time Tj (=VC+VSf+VSp), where y is equal to or greater than zero. The CPU 31 stores the values of VS, VSf, and VSp in the EEPROM 34. In a preferred embodiment, the duration (x×Tf) of x frames (phoneme window Tw) is about 0.3 seconds. In a further preferred embodiment, the number of sampling points corresponding to the digitized vibration data of x frames is in the range of x to 16x.

一般的に、音声信号データについては、同じ音素では振動計数値ＶＣの振動パターンが類似しているが、異なる音素ではＶＳ値の振動パターンが全く異なる。従って、振動計数値ＶＣの振動パターンを利用して、音素を区別することができる。特に、例えば鶏又は猫の鳴き声と、人間の音声とは、振動計数値ＶＣの周波数分布に関して全く異なり、人間の音声の振動計数値ＶＣのほとんどは４０以下に分布していることが既知である。 In general, for voice signal data, the vibration patterns of the vibration count value VC are similar for the same phoneme, but the vibration patterns of the VS values for different phonemes are completely different. Therefore, the vibration patterns of the vibration count value VC can be used to distinguish between phonemes. In particular, it is known that the frequency distribution of the vibration count value VC is completely different between, for example, the cry of a chicken or a cat and human voice, and most of the vibration count values VC of human voice are distributed below 40.

学習フェーズにおいて、音声信号処理部３０のＣＰＵ３１は、まず、所定の音声信号データ収集方法を複数回実行して、複数の音素に対する複数の特徴ベクトルを収集し、複数の特徴ベクトルに対応するラベルを付加して、複数のラベル付き学習例を形成する。その後、起動音素を含む異なる音素に対する複数のラベル付き学習例を、深層学習モデル部３５の学習に適用する。最後に、学習された深層学習モデル部３５（音声信号データの予測モデルを構成する）を作成して、入力される音声信号データのストリームが起動音素を含むかどうかを分類する。音声信号処理部３０の起動音素として、所定の音素が指定されている場合、深層学習モデル部３５は、少なくとも当該指定された音素を含む異なる音素についての複数のラベル付き学習例で学習される。 In the training phase, the CPU 31 of the audio signal processing unit 30 first executes a predetermined audio signal data collection method multiple times to collect multiple feature vectors for multiple phonemes, and then adds labels corresponding to the multiple feature vectors to form multiple labeled training examples. The multiple labeled training examples for different phonemes including the activation phoneme are then applied to training of the deep learning model unit 35. Finally, a trained deep learning model unit 35 (which constitutes a predictive model of the audio signal data) is created to classify whether the input audio signal data stream includes the activation phoneme. When a predetermined phoneme is specified as the activation phoneme of the audio signal processing unit 30, the deep learning model unit 35 is trained with multiple labeled training examples for different phonemes including at least the specified phoneme.

すなわち、学習段階では、ラベル付けされた学習例のセットを使用して深層学習モデル部３５を学習し、それによって深層学習モデル部３５が、ラベル付けされた学習例の各フレームの３つの音声特徴量（例えば、（ＶＳｊ，ＴＤｊ，ＴＧｊ））に基づいて、ｊ＝０～２９９の間で、所定の起動音素を認識するようにする。学習段階の終わりに、学習された深層学習モデル部３５は、当該起動音素に対応する学習されたスコアを提供し、学習されたスコアは、次に、入力される音声信号データのストリームをランタイムで分類するための基準として使用される。なお、ＶＳｊ，ＴＤｊ，ＴＧｊは以下のように定義される。
（１）ＶＳｊ：フレームｊの振動計数値の総和（ＶＳ値）；
（２）ＴＤｊ：フレームｊにおいて、ゼロではない振動計数値の総和（ＶＳ値）の時間期間；及び
（３）ＴＧｊ；フレームｊにおける、ゼロではない振動計数値の総和（ＶＳ値）間の時間ギャップ（時間隙間）。 That is, in the training phase, the set of labeled training examples is used to train the deep learning model unit 35 so that the deep learning model unit 35 recognizes a given activation phoneme, for j=0-299, based on three speech features (e.g., (VSj, TDj, TGj)) of each frame of the labeled training examples. At the end of the training phase, the trained deep learning model unit 35 provides a trained score corresponding to the activation phoneme, which is then used as a criterion for classifying an input stream of speech signal data at run-time. Note that VSj, TDj, and TGj are defined as follows:
(1) VSj: the sum of vibration counts in frame j (VS value);
(2) TDj: the time period between non-zero sums of vibration counts (VS values) in frame j; and (3) TGj: the time gap between non-zero sums of vibration counts (VS values) in frame j.

深層学習モデル部３５を学習するために、教師付き学習に関連する様々な機械学習技術を使用することができ、例えば、サポートベクターマシン（ＳＶＭ）法、ランダムフォレスト法、畳み込みニューラルネットワーク法などを利用できる。教師付き学習では、複数のラベル付けされた学習例を使用して関数計算部（すなわち、深層学習モデル部３５）が作成され、その各例は、入力特徴ベクトルとラベル付けされた出力からなる。学習されたとき、深層学習モデル部３５は、対応するスコア又は予測値を生成するために、新しいラベルのない例に適用することができる。 Various machine learning techniques related to supervised learning can be used to train the deep learning model unit 35, such as support vector machine (SVM) methods, random forest methods, convolutional neural network methods, etc. In supervised learning, a function calculation unit (i.e., the deep learning model unit 35) is created using a number of labeled training examples, each of which consists of an input feature vector and a labeled output. When trained, the deep learning model unit 35 can be applied to new unlabeled examples to generate a corresponding score or prediction value.

図３は図２の深層学習モデル部３５の詳細構成例を示すブロック図である。 Figure 3 is a block diagram showing an example of a detailed configuration of the deep learning model unit 35 in Figure 2.

深層学習モデル部３５は、例えば、図３に示すように、ニューラルネットワークを用いて実装される。ここで、ニューラルネットワークは、１つの入力層４１と、少なくとも１つであり好ましくは複数の中間層４２と、１つの出力層４３を含む。入力層４１には３つの入力ニューロン５１，５２，５３があり、各入力ニューロン５１，５２，５３は、特徴ベクトルの各フレームの３つのオーディオ特徴値（すなわち、ＶＳｊ，ＴＤｊ，ＴＧｊ）に対応する。また、中間層４２は、各入力ニューロン５１，５２，５３に関連する重み係数と各ニューロンのバイアス係数を有するニューロン６１～７４で構成される。学習フェーズのサイクルを通じて中間層４２の各ニューロン６１～７４の重み係数とバイアス係数を変更することにより，ニューラルネットワークを学習して，所定の種類の入力に対する予測値を報告するようにすることができる。さらに、出力層４３は、音素に対応する１つの予測値（具体的には、音声期間であるか、ノイズを含む非音声期間であるかを示す）を提供する１つの出力ニューロン８１を含む。 The deep learning model unit 35 is implemented using a neural network, for example, as shown in FIG. 3. Here, the neural network includes an input layer 41, at least one and preferably multiple intermediate layers 42, and an output layer 43. The input layer 41 includes three input neurons 51, 52, 53, each of which corresponds to three audio feature values (i.e., VSj, TDj, TGj) of each frame of the feature vector. The intermediate layer 42 includes neurons 61-74, each of which has a weighting factor associated with each input neuron 51, 52, 53 and a biasing factor for each neuron. By changing the weighting factor and biasing factor for each neuron 61-74 of the intermediate layer 42 through cycles of a learning phase, the neural network can be trained to report a prediction value for a given type of input. The output layer 43 includes an output neuron 81 that provides a prediction value corresponding to a phoneme (specifically, indicating whether it is a speech period or a noisy non-speech period).

以上説明したように、ノイズキャンセル部１３，２４において、深層学習モデル部３５は、人間の音声の特徴パラメータを用いて学習され、入力される音声信号からノイズを含む非音声期間であるか否かを判定する。そして、音声信号処理部３０のＣＰＵ３１は、深層学習モデル部３５の前記判定に基づいて、入力される音声信号からノイズを含む非音声期間を通過させないようにノイズキャンセル処理を行って、前記ノイズキャンセル処理後の音声信号を出力する。ここで、深層学習モデル部３５は、人間の音声の特徴パラメータを入力とし、入力される音声信号からノイズを含む非音声期間であるか否かを判定する判定結果を出力とする、図３のニューラルネットワークにより構成される。 As described above, in the noise cancellation units 13 and 24, the deep learning model unit 35 learns using the characteristic parameters of human voice and determines whether or not the input voice signal is a non-voice period containing noise. Then, based on the determination by the deep learning model unit 35, the CPU 31 of the voice signal processing unit 30 performs noise cancellation processing so as not to pass non-voice periods containing noise from the input voice signal, and outputs the voice signal after the noise cancellation processing. Here, the deep learning model unit 35 is configured by the neural network of FIG. 3, which receives the characteristic parameters of human voice as input and outputs the determination result of whether or not the input voice signal is a non-voice period containing noise.

以上説明したように、本実施形態では、無線機１はその送信部において、位相反転型のノイズキャンセル回路に代えて、例えばＦＭ（周波数変調）復調又はＰＭ（位相変調）復調においてノイズ軽減で特に有効である、ノイズキャンセル部２４を備える。これにより、無線通話を行う無線通信装置において、従来技術に比較して、人間の発話音声を有効的に出力するようにノイズキャンセルを行うことができる。送信側でノイズキャンセル部２４を備えることで、送信側以降の回路及び装置（例えば、無線中継装置など）における音声信号において有効的にノイズを除去できる。 As described above, in this embodiment, the radio 1 has a noise cancellation unit 24 in its transmission section, which is particularly effective in reducing noise, for example, in FM (frequency modulation) demodulation or PM (phase modulation) demodulation, instead of a phase inversion type noise cancellation circuit. This allows a wireless communication device that performs wireless calls to perform noise cancellation so as to more effectively output human speech than conventional technology. By providing the noise cancellation unit 24 on the transmission side, noise can be effectively removed from audio signals in circuits and devices (e.g., wireless relay devices, etc.) subsequent to the transmission side.

また、本実施形態では、無線機１はその受信部において、ノイズ軽減で特に有効であるノイズキャンセル部１３をさらに備える。これにより、無線通話を行う無線通信装置において、従来技術に比較して、人間の発話音声を有効的に出力するようにノイズキャンセルを行うことができる。なお、変調方式としてＦＭ又はＰＭを用いる場合は、復調した音声信号をスピーカ１５に出力する前段でノイズキャンセル部１３に通すことによって、無信号に近くなってきた際のホワイトノイズが無くなり、通話限界距離に近づいていっても受信音声はクリアな音声を保ち続ける。また、従来のノイズスケルチ回路の代わりに、深層学習モデル部３５を利用するノイズキャンセル部１３を用いてノイズキャンセルすることで、無信号状態のノイズの出力を停止させる（又は軽減させる）ことにより、本来受信できる音声信号も出力停止させることが無い。従って、限界通話距離ギリギリまで受信された音声信号を出力でき、ノイズスケルチ回路に比べて通話距離を延ばすことが可能となる。 In addition, in this embodiment, the radio 1 further includes a noise cancellation unit 13 in the receiving section, which is particularly effective in reducing noise. As a result, in a wireless communication device that performs wireless communication, noise cancellation can be performed to effectively output human speech compared to conventional technology. When FM or PM is used as the modulation method, the demodulated audio signal is passed through the noise cancellation unit 13 before being output to the speaker 15, eliminating white noise when approaching no signal, and the received audio continues to maintain a clear voice even when approaching the communication limit distance. In addition, by using the noise cancellation unit 13 that uses the deep learning model unit 35 to cancel noise instead of the conventional noise squelch circuit, the output of noise in a no-signal state is stopped (or reduced), and the output of the audio signal that can be received is not stopped. Therefore, the audio signal received up to the limit of the communication distance can be output, making it possible to extend the communication distance compared to the noise squelch circuit.

（変形例）
以上の実施形態において、無線機１はノイズキャンセル部２４を備えているが、本発明はこれに限らず、ノイズキャンセル部２４を備えなくてもよい。 (Modification)
In the above embodiment, the wireless device 1 includes the noise canceling unit 24. However, the present invention is not limited to this, and the wireless device 1 does not necessarily need to include the noise canceling unit 24.

以上詳述したように、本発明に係る無線通信装置によれば、位相反転型のノイズキャンセル回路に代えて、無線機１の送信側に、もしくはさらに受信側に、深層学習モデル部３５を利用するノイズキャンセル部１３を用いてノイズキャンセルすることで、無線通話を行う無線通信装置において、従来技術に比較して、人間の発話音声を有効的に出力するようにノイズキャンセルを行うことができる。 As described above in detail, according to the wireless communication device of the present invention, instead of a phase inversion type noise cancellation circuit, noise cancellation is performed using a noise cancellation unit 13 that utilizes a deep learning model unit 35 on the transmitting side of the radio 1, or even on the receiving side, thereby making it possible to perform noise cancellation so as to effectively output human speech sounds in a wireless communication device that performs wireless calls, compared to conventional technology.

１無線機
１１受信アンテナ
１２受信復調部
１３ノイズキャンセル部
１４音声信号増幅器
１５スピーカ
２０制御部
２１操作部
２１ＡＰＴＴキー
２２マイクロホン
２３音声信号増幅器
２４ノイズキャンセル部
２５変調送信部
２６送信アンテナ
３０音声信号処理部
３１ＣＰＵ
３２ＲＯＭ
３３ＲＡＭ
３４ＥＥＰＲＯＭ
３５深層学習モデル部
３６入力インターフェース
３７出力インターフェース
３８音声信号前置処理部
３９ＡＤ変換器
４１入力層
４２中間層
４３出力層
５１～８１ニューロン REFERENCE SIGNS LIST 1 Radio 11 Receiving antenna 12 Receiving and demodulating section 13 Noise canceling section 14 Audio signal amplifier 15 Speaker 20 Control section 21 Operation section 21A PTT key 22 Microphone 23 Audio signal amplifier 24 Noise canceling section 25 Modulation transmitting section 26 Transmitting antenna 30 Audio signal processing section 31 CPU
32 ROM
33 RAM
34 EEPROM
35 Deep learning model unit 36 Input interface 37 Output interface 38 Audio signal pre-processing unit 39 AD converter 41 Input layer 42 Intermediate layer 43 Output layer 51 to 81 Neurons

Claims

A wireless communication device including a modulation transmission unit that modulates a wireless carrier wave in accordance with an input audio signal and transmits a wireless signal,
a first noise canceling unit that performs audio signal processing so as to cancel noise from the input audio signal and outputs the processed audio signal to the modulation transmitting unit;
The first noise cancellation unit is a wireless communication device that performs noise cancellation processing using a deep learning model unit that is trained using characteristic parameters of human voice, which are vibration frequencies calculated for each predetermined number of frames along the time axis in the input audio signal, and determines whether the input audio signal is a non-voice period containing noise based on a vibration pattern that represents a data distribution of the sum of the vibration frequencies.

The wireless communication device according to claim 1, wherein the modulation transmission unit modulates a wireless carrier wave according to an input audio signal using a frequency modulation method or a phase modulation method.

The first noise cancellation unit includes an audio signal processing unit that performs a noise cancellation process based on the determination of the deep learning model unit so as not to pass a non-voice period including noise from the input audio signal, and outputs the audio signal after the noise cancellation process.
3. A wireless communication device according to claim 1 or 2.

The first noise cancellation unit includes:
4. The wireless communication device according to claim 3, further comprising an audio signal pre-processing unit provided in front of the audio signal processing unit, which passes only the input audio signal within a predetermined level range of human audio signals and a predetermined bandwidth.

The wireless communication device according to any one of claims 1 to 4, wherein the deep learning model unit is configured with a predetermined neural network that receives human voice characteristic parameters as input and outputs a determination result of whether the input voice signal is a non-voice period that includes noise.

The wireless communication device includes:
a receiving and demodulating unit that demodulates a received wireless signal into an audio signal ;
a second noise canceling unit that performs audio signal processing so as to cancel noise from the demodulated audio signal and outputs the audio signal after the audio signal processing ,
The wireless communication device according to claim 1 or 2, wherein the second noise cancellation unit performs noise cancellation processing using a deep learning model unit that is trained using characteristic parameters of human voice and determines whether the demodulated voice signal is a non-voice period containing noise.

The wireless communication device according to claim 6, wherein the receiving and demodulating unit demodulates the received wireless signal using a frequency modulation method or a phase modulation method.

The second noise cancellation unit includes an audio signal processing unit that performs a noise cancellation process based on the determination of the deep learning model unit so as to pass the audio signal only during an audio signal period from the demodulated audio signal, and outputs the audio signal after the noise cancellation process.
8. A wireless communication device according to claim 6 or 7.

The second noise cancellation unit includes:
The wireless communication device according to claim 8, further comprising an audio signal pre-processing unit provided in front of the audio signal processing unit, which passes only the input audio signal within a predetermined level range of human audio signals and a predetermined bandwidth.

The wireless communication device according to any one of claims 6 to 9, wherein the deep learning model unit is configured with a predetermined neural network that receives human voice characteristic parameters as input and outputs a determination result of whether the input voice signal is a non-voice period that includes noise.

A wireless communication device that does not simultaneously perform transmission and reception operations, and that operates the first noise cancellation unit instead of the second noise cancellation unit during reception, according to any one of claims 6 to 10.

The wireless communication device according to any one of claims 1 to 11, wherein the wireless communication device is a specific low-power radio station that is a wireless communication device for a specific low-power wireless communication system.

A wireless communication system including a plurality of wireless communication devices according to any one of claims 1 to 12.