JP7772107B2

JP7772107B2 - Signal processing device, signal processing method, and signal processing program

Info

Publication number: JP7772107B2
Application number: JP2023579992A
Authority: JP
Inventors: 直之加茂; 林太郎池下; 慶介木下; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2025-11-18
Anticipated expiration: 2042-02-10
Also published as: JPWO2023152915A1; WO2023152915A1

Description

本発明は、信号処理装置、信号処理方法、および、信号処理プログラムに関する。 The present invention relates to a signal processing device, a signal processing method, and a signal processing program.

従来、遠隔マイクで録音された音声や音楽の信号から、残響成分を除去する技術がある。残響とは、例えば、元の信号が壁・床・天井等に反射することで、元の信号から遅れてマイクに到達する信号成分のことである。 There is currently technology to remove reverberation components from voice or music signals recorded with a remote microphone. Reverberation is a signal component that arrives at a microphone later than the original signal, for example, due to the original signal being reflected off the walls, floor, ceiling, etc.

残響を含む信号は、音声認識、信号処理による雑音除去、音源分離等において性能劣化につながるため、事前に信号から残響を除去することで、性能劣化を回避できる。なお、上記の残響成分の除去は、例えば、補聴器の性能向上や自動譜面作成等にも適用することができる。 Signals containing reverberation can lead to performance degradation in speech recognition, noise reduction using signal processing, sound source separation, etc., so performance degradation can be avoided by removing reverberation from the signal in advance. The removal of reverberation components as described above can also be applied to, for example, improving the performance of hearing aids and automatic music notation creation.

残響成分を除去する技術として、WPE（Weighted Prediction Error）がある。WPEは、残響の自己回帰モデルを仮定し、過去の観測信号から現在の残響成分を予測する。そして、WPEは、予測した残響成分を打ち消す逆フィルタを推定し、その推定した逆フィルタにより残響除去を行う。WPEによれば、残響を除去することができるが、音源数≧マイク数の場合、MINT定理により、因果的な逆フィルタ（過去の信号だけを使う逆フィルタ）が存在できないことが分かっている。 One technique for removing reverberation components is WPE (Weighted Prediction Error). WPE assumes an autoregressive model of reverberation and predicts current reverberation components from past observed signals. WPE then estimates an inverse filter that cancels out the predicted reverberation components, and performs dereverberation using this estimated inverse filter. WPE can remove reverberation, but when the number of sound sources is greater than or equal to the number of microphones, the MINT theorem shows that a causal inverse filter (an inverse filter that uses only past signals) cannot exist.

上記の問題を部分的に解決する技術として、SwitchingWPE（非特許文献１参照）がある。SwitchingWPEは、WPEを改良した技術で、信号の時間周波数ビンごとに複数のWPEフィルタを切り替えることで、残響除去を実現する。ここで、WPEフィルタを適用する時間周波数ビンを選択するためのパラメータをSwitchと呼ぶ。 SwitchingWPE (see Non-Patent Document 1) is a technology that partially solves the above problem. SwitchingWPE is an improved version of WPE, and achieves dereverberation by switching between multiple WPE filters for each time-frequency bin of the signal. Here, the parameter used to select the time-frequency bin to which the WPE filter is applied is called Switch.

Rintaro Ikeshita,et al., "Blind Signal Dereverberation Based on Mixture of Weighted Prediction Error Models", IEEE SIGNAL PROCESSING LETTERS, VOL. 28, 2021, 399.Rintaro Ikeshita,et al., "Blind Signal Dereverberation Based on Mixture of Weighted Prediction Error Models", IEEE SIGNAL PROCESSING LETTERS, VOL. 28, 2021, 399.

SwitchingWPEでは、Switchを重み付きパワー最小化基準（最尤基準）で最適化するため、最適化されたSwitchは、必ずしも他の評価基準（例えば、音声認識率、信号歪み尺度等）で最適なSwitchとは限らない。例えば、SwitchingWPEで最適化されたSwitchは、音声認識に対し最適なSwitchとは限らない。そのため、SwitchingWPEによる残響除去後の信号に対する音声認識率が高くならない可能性がある。 In SwitchingWPE, the switch is optimized using the weighted power minimization criterion (maximum likelihood criterion), so the optimized switch is not necessarily the optimal switch in terms of other evaluation criteria (e.g., speech recognition rate, signal distortion measure, etc.). For example, a switch optimized by SwitchingWPE is not necessarily the optimal switch for speech recognition. As a result, the speech recognition rate for signals after dereverberation using SwitchingWPE may not be high.

そこで、本発明は、前記した問題を解決し、SwitchingWPEにおいて目的に応じた残響成分の除去の性能向上を実現することを課題とする。 Therefore, the objective of this invention is to solve the above-mentioned problems and achieve improved performance in removing reverberation components in SwitchingWPE according to the purpose.

前記した課題を解決するため、本発明は、観測された信号の残響成分を除去する複数のWPEフィルタ、および、観測された信号の時間周波数ごとに前記複数のWPEフィルタの切り替えを行うためのSwitchを有するSwitchingWPEと、前記SwitchingWPEによる残響成分の除去後の信号の評価基準の入力を受け付ける受付部と、信号の残響成分の除去の学習用データセットを用い、前記SwitchingWPEにより残響成分が除去された信号が、前記評価基準で最適化されるような前記Switchの推定結果を出力するモデルの学習を行う学習部と、観測された信号に対し、学習後の前記モデルにより推定されたSwitchを前記SwitchingWPEに設定するSwitch設定部と、設定された前記Switchに対し最適なWPEフィルタを計算し、前記SwitchingWPEに設定するフィルタ設定部とを備え、前記SwitchingWPEは、設定された前記Switchおよび設定された前記WPEフィルタを用いて、入力された信号の残響成分を除去することを特徴とする。 To solve the above-mentioned problems, the present invention provides a SwitchingWPE having multiple WPE filters that remove reverberation components from an observed signal and a Switch for switching between the multiple WPE filters for each time frequency of the observed signal; a reception unit that receives input of an evaluation criterion for the signal after reverberation components have been removed by the SwitchingWPE; a learning unit that uses a training dataset for removing reverberation components of a signal and trains a model that outputs an estimation result for the Switch such that the signal after reverberation components have been removed by the SwitchingWPE is optimized using the evaluation criterion; a Switch setting unit that sets the Switch estimated by the trained model for the observed signal in the SwitchingWPE; and a filter setting unit that calculates an optimal WPE filter for the set Switch and sets it in the SwitchingWPE, wherein the SwitchingWPE removes the reverberation components of the input signal using the set Switch and the set WPE filter.

本発明によれば、SwitchingWPEにおいて目的に応じた残響成分除去の性能向上を行うことができる。 According to the present invention, it is possible to improve the performance of reverberation component removal in SwitchingWPE according to the purpose.

図１は、SwitchingWPEの概要を説明する図である。FIG. 1 is a diagram for explaining an overview of SwitchingWPE. 図２は、信号処理装置の概要を説明する図である。FIG. 2 is a diagram illustrating an outline of the signal processing device. 図３は、評価基準がSDRである場合における、信号処理装置の概要を説明する図である。FIG. 3 is a diagram illustrating an outline of a signal processing device when the evaluation standard is SDR. 図４は、信号処理装置の構成例を示す図である。FIG. 4 is a diagram illustrating an example of the configuration of a signal processing device. 図５は、信号処理装置の処理手順の例を示す図である。FIG. 5 is a diagram illustrating an example of a processing procedure of the signal processing device. 図６は、信号処理装置による残響成分の除去性能の評価結果を示す図である。FIG. 6 is a diagram showing the evaluation results of the reverberation component removal performance of the signal processing device. 図７は、信号処理プログラムを実行するコンピュータの構成例を示す図である。FIG. 7 is a diagram showing an example of the configuration of a computer that executes a signal processing program.

以下、図面を参照しながら、本発明を実施するための形態（実施形態）について説明する。本発明は、本実施形態に限定されない。 Hereinafter, a form (embodiment) for implementing the present invention will be described with reference to the drawings. The present invention is not limited to this embodiment.

［SwitchingWPE］
まず、図１を用いて、本実施形態の信号処理装置が用いる基本技術である、SwitchingWPEの概要を説明する。SwitchingWPEは、観測信号（例えば、音声信号）の時間周波数をクラスタリングし、Switchにより時間周波数ごとに複数のWPEフィルタを切り替えることで、観測信号の残響除去を実現する。WPEフィルタは、Switchごとに計算される。このSwitchingWPEのSwitchとWPEフィルタは、重み付きパワー最小化基準（最尤基準）で交互に最適化される。 [SwitchingWPE]
First, an overview of SwitchingWPE, which is a basic technology used by the signal processing device of this embodiment, will be described using Fig. 1. SwitchingWPE clusters the time frequencies of an observed signal (e.g., a speech signal) and uses Switch to switch between multiple WPE filters for each time frequency, thereby achieving dereverberation of the observed signal. A WPE filter is calculated for each Switch. The Switch and WPE filters of SwitchingWPE are alternately optimized using a weighted power minimization criterion (maximum likelihood criterion).

［概要］
次に、図２を用いて、本実施形態の信号処理装置の概要を説明する。信号処理装置は、DNN（Deep Neural Network）等のモデルにより、観測信号から、SwitchingWPEで残響除去を行う際に最適なSwitchを推定する。そして、信号処理装置は、推定したSwitchを用いたSwitchingWPEにより、観測信号の残響除去を行う。 [overview]
Next, an overview of the signal processing device of this embodiment will be described with reference to Fig. 2. The signal processing device estimates an optimal Switch for dereverberation using SwitchingWPE from an observed signal using a model such as a DNN (Deep Neural Network). Then, the signal processing device dereverberates the observed signal using SwitchingWPE using the estimated Switch.

例えば、信号処理装置は、残響除去の学習用データセットを用い、観測信号の入力を受け付けると、その観測信号に対し最適なSwitchの推定結果を出力するSwitch推定モデルの学習を行う。なお、この学習用データセットは、入力信号とその入力信号から残響成分を除去した信号（残響除去の正解信号）とを示したデータセットである。For example, a signal processing device uses a dereverberation training dataset to train a Switch estimation model that receives an input of an observed signal and outputs an optimal Switch estimation result for that observed signal. Note that this training dataset is a dataset that shows the input signal and a signal obtained by removing the reverberation components from that input signal (the dereverberation correct signal).

ここで、信号処理装置は、上記のSwitch推定モデルの学習前に、SwitchingWPEによる残響除去後の信号の評価基準（例えば、SDR（信号対歪み比）、Scale invariant SDR（スケール不変の信号対歪み比）、STOI(Short-Time Objective Intelligibility measure)等の明瞭度、Cepstral distance（ケプストラル距離）、ASR（自動音声認識）におけるWER（単語誤り率）等）の入力を受け付けておく。 Here, before training the above-mentioned Switch estimation model, the signal processing device accepts input of evaluation criteria for the signal after dereverberation by SwitchingWPE (e.g., SDR (Signal-to-Distortion Ratio), Scale invariant SDR, intelligibility such as STOI (Short-Time Objective Intelligibility measure), Cepstral distance, WER (Word Error Rate) in ASR (Automatic Speech Recognition), etc.).

そして、信号処理装置は、残響除去の学習用データセットを用い、SwitchingWPEによる残響成分の除去後の信号（残響除去信号）が上記の評価基準で最適化されるようなSwitchを推定するSwitch推定モデルの学習を行う。 Then, the signal processing device uses the dereverberation training dataset to train a Switch estimation model that estimates a Switch such that the signal (dereverberated signal) after reverberation components have been removed by SwitchingWPE is optimized using the above evaluation criteria.

例えば、上記の残響除去信号の評価基準がSDRであり、Switch推定モデルがDNNにより実現される場合を考える。この場合、信号処理装置は、学習データセットに含まれる入力信号に対しSwitchingWPEが出力する残響除去信号と、学習データセットに含まれる当該入力信号の残響除去の正解信号との間のSDRを最大化するように、DNNの最適化を行う（図３参照）。For example, consider a case where the evaluation criterion for the dereverberated signal is SDR and the Switch estimation model is implemented using a DNN. In this case, the signal processing device optimizes the DNN to maximize the SDR between the dereverberated signal output by SwitchingWPE for an input signal included in the training dataset and the dereverberated target signal for that input signal included in the training dataset (see Figure 3).

その後、信号処理装置は、最適化されたDNNに観測信号を入力し、観測信号に対し最適化されたSwitchの推定結果を得る。そして、信号処理装置は、推定されたSwitchを用いて、SwitchingWPEにより観測信号の残響除去を行う。このようにすることで、信号処理装置は、SwitchingWPEにおいて目的（評価基準）に応じた残響成分の除去の性能向上を実現することができる。 Then, the signal processing device inputs the observed signal into the optimized DNN and obtains an estimated Switch optimized for the observed signal. The signal processing device then uses the estimated Switch to dereverberate the observed signal using SwitchingWPE. In this way, the signal processing device can achieve improved performance in removing reverberation components in SwitchingWPE according to the purpose (evaluation criteria).

［構成例］
次に、図４を用いて、信号処理装置１０の構成例を説明する。信号処理装置１０は、入出力部１１、記憶部１２、制御部１３を備える。 [Configuration example]
Next, an example of the configuration of the signal processing device 10 will be described with reference to Fig. 4. The signal processing device 10 includes an input/output unit 11, a storage unit 12, and a control unit 13.

入出力部１１は、各種情報の入出力を司るインタフェースである。例えば、入出力部１１は、残響除去の対象とする観測信号や、残響除去信号の評価基準等の入力を受け付ける。また、例えば、入出力部１１は、残響除去信号を出力する。 The input/output unit 11 is an interface that handles the input and output of various information. For example, the input/output unit 11 accepts inputs such as the observed signal to be dereverberated and evaluation criteria for the dereverberated signal. Furthermore, for example, the input/output unit 11 outputs the dereverberated signal.

記憶部１２は、制御部１３が各種処理を実行する際に参照するデータを記憶する。例えば、記憶部１２は、残響除去信号の評価基準や、残響除去の学習用データセット、最適なSwitchの推定を行うためのSwitch推定モデル（Switch推定モデルのパラメータ）等を記憶する。 The memory unit 12 stores data that the control unit 13 references when performing various processes. For example, the memory unit 12 stores evaluation criteria for dereverberated signals, a training dataset for dereverberation, a switch estimation model (parameters of the switch estimation model) for estimating the optimal switch, etc.

上記のSwitch推定モデルは、SwitchingWPE１３１への観測信号を入力とし、SwitchingWPE１３１における最適なSwitchの推定結果を出力するモデルである。このSwitch推定モデルは、例えば、DNNにより実現される。Switch推定モデルは、学習部１３３により学習される。 The above Switch estimation model is a model that takes an observed signal as input to SwitchingWPE131 and outputs an estimation result of the optimal Switch in SwitchingWPE131. This Switch estimation model is realized, for example, by a DNN. The Switch estimation model is trained by the learning unit 133.

制御部１３は、信号処理装置１０全体の制御を司る。制御部１３は、SwitchingWPE１３１と、受付部１３２と、学習部１３３と、Switch設定部１３４と、フィルタ設定部１３５とを備える。 The control unit 13 is responsible for controlling the entire signal processing device 10. The control unit 13 includes a SwitchingWPE 131, a reception unit 132, a learning unit 133, a Switch setting unit 134, and a filter setting unit 135.

SwitchingWPE１３１は、Switchおよび複数のWPEフィルタを用いて、観測された信号の残響除去を行う。Switchは、観測信号の時間周波数ごとに複数のWPEフィルタの切り替えを行うためのパラメータである。WPEフィルタは、観測信号の残響成分を除去する。 SwitchingWPE131 derives reverberation from the observed signal using Switch and multiple WPE filters. Switch is a parameter for switching between multiple WPE filters for each time frequency of the observed signal. The WPE filters remove the reverberation components of the observed signal.

受付部１３２は、残響除去信号の評価基準の入力を受け付ける。評価基準は、例えば、SDR、Scale invariant SDR、STOI等の明瞭度、Cepstram distance、ASRにおけるWER等である。なお、受付部１３２が受け付ける残響除去信号の評価基準は、上記のいずれかの評価基準でもよいし、複数の評価基準の組み合わせであってもよい。 The reception unit 132 receives input of evaluation criteria for the dereverberated signal. The evaluation criteria include, for example, SDR, Scale invariant SDR, clarity such as STOI, Cepstram distance, and WER in ASR. The evaluation criteria for the dereverberated signal received by the reception unit 132 may be any of the above evaluation criteria, or a combination of multiple evaluation criteria.

学習部１３３は、残響除去の学習用データセットを用い、観測信号を入力とし、SwitchingWPE１３１により残響成分が除去された信号が、受付部１３２で入力された評価基準で最適化されるようなSwitchの推定結果を出力するSwitch推定モデルの学習を行う。 The learning unit 133 uses a dereverberation learning dataset, takes the observed signal as input, and learns a Switch estimation model that outputs a Switch estimation result such that the signal from which the reverberation components have been removed by SwitchingWPE 131 is optimized using the evaluation criteria input by the receiving unit 132.

例えば、学習部１３３は、評価基準が、SDR、Scale invariant SDR、STOI、Cepstral distanceである場合、学習用データセットの入力信号に対する残響除去信号（正解信号）を正解データとする。そして、学習部１３３は、上記の正解データを用いて、学習用データセットの入力信号に対し、SwitchingWPE１３１が出力する残響信号に対し上記の評価基準で評価した結果が最適化されるような、SwitchingWPE１３１のSwitchを推定するSwitch推定モデルの学習を行う。For example, when the evaluation criteria are SDR, Scale invariant SDR, STOI, or Cepstral distance, the learning unit 133 sets the dereverberated signal (correct signal) for the input signal of the training dataset as correct data. Then, using the correct data, the learning unit 133 trains a Switch estimation model that estimates the Switch of SwitchingWPE131 so that the result of evaluating the reverberation signal output by SwitchingWPE131 for the input signal of the training dataset using the above evaluation criteria is optimized.

なお、評価基準が、ASRである場合、学習部１３３は、上記の正解信号の代わりに正解テキスト（入力音声に対する書き起こしの文章）を正解データとして用いる。この場合、学習部１３３は、SwitchingWPE１３１が出力する残響除去信号を、ASRに入力し、ASRによる認識結果が、正解テキストになるべく一致するように（ASR正解率が改善するように）、Switch推定モデルの学習を行う。例えば、学習部１３３は、ASRによる認識結果のWERができるだけ小さくなるよう、Switch推定モデルの学習を行う。 When the evaluation criterion is ASR, the learning unit 133 uses the correct text (a transcribed sentence of the input speech) as the correct data instead of the above-mentioned correct signal. In this case, the learning unit 133 inputs the dereverberated signal output by SwitchingWPE 131 to the ASR, and trains the Switch estimation model so that the recognition result by ASR matches the correct text as closely as possible (so that the ASR accuracy rate improves). For example, the learning unit 133 trains the Switch estimation model so that the WER of the recognition result by ASR is as small as possible.

Switch設定部１３４は、観測信号に対し、学習後のSwitch推定モデルにより出力されたSwitchをSwitchingWPE１３１に設定する。また、フィルタ設定部１３５は、Switch設定部１３４により設定されたSwitchに対し最適なWPEフィルタを計算し、SwitchingWPE１３１に設定する。ここでの最適なWPEフィルタの計算方法は、例えば、従来のSwitchingWPEにおけるWPEフィルタの計算方法と同様の方法でよい。 The Switch setting unit 134 sets the Switch output by the learned Switch estimation model for the observed signal to SwitchingWPE 131. In addition, the filter setting unit 135 calculates the optimal WPE filter for the Switch set by the Switch setting unit 134 and sets it to SwitchingWPE 131. The method for calculating the optimal WPE filter here may be, for example, the same method as the method for calculating the WPE filter in conventional SwitchingWPE.

その後、SwitchingWPE１３１は、Switch設定部１３４により設定されたSwitchおよびフィルタ設定部１３５により設定されたWPEフィルタを用いて、入力された観測信号の残響成分を除去する。 Then, SwitchingWPE131 removes the reverberation components of the input observation signal using the Switch set by the Switch setting unit 134 and the WPE filter set by the filter setting unit 135.

このようにすることで、信号処理装置１０は、SwitchingWPE１３１に対し、目的に応じた残響成分の除去の性能向上を実現することができる。 By doing this, the signal processing device 10 can achieve improved performance in removing reverberation components according to the purpose for the SwitchingWPE 131.

［処理手順の例］
次に、図５を用いて信号処理装置１０の処理手順の例を説明する。まず、信号処理装置１０の受付部１３２は、SwitchingWPE１３１による残響成分の除去後の信号の評価基準の入力を受け付ける（Ｓ１）。 [Example of processing procedure]
Next, an example of the processing procedure of the signal processing device 10 will be described with reference to Fig. 5. First, the receiving unit 132 of the signal processing device 10 receives an input of an evaluation criterion for a signal after reverberation components have been removed by the SwitchingWPE 131 (S1).

次に、学習部１３３は、残響除去の学習用データセットを用い、信号を入力とし、SwitchingWPE１３１により残響成分が除去された信号が、Ｓ１で受け付けた評価基準で最適化されるようなSwitchの推定結果を出力するSwitch推定モデルの学習を行う（Ｓ２：モデルの学習）。 Next, the learning unit 133 uses a dereverberation learning dataset to learn a Switch estimation model that takes a signal as input and outputs a Switch estimation result such that the signal from which the reverberation components have been removed by SwitchingWPE131 is optimized using the evaluation criteria accepted in S1 (S2: Model learning).

Ｓ２の後、信号処理装置１０は、観測信号の入力を受け付ける（Ｓ３）。そして、Switch設定部１３４は、Ｓ３で入力された観測信号に対し、Ｓ２で学習されたSwitch推定モデルにより推定されたSwitchをSwitchingWPE１３１に設定する（Ｓ４：Switchの設定）。そして、フィルタ設定部１３５は、設定されたSwitchに対し最適なWPEフィルタを計算し、SwitchingWPE１３１に設定する（Ｓ５：WPEフィルタの設定）。 After S2, the signal processing device 10 accepts input of an observed signal (S3). Then, the Switch setting unit 134 sets the Switch estimated by the Switch estimation model learned in S2 for the observed signal input in S3 to SwitchingWPE131 (S4: Setting the Switch). Then, the filter setting unit 135 calculates the optimal WPE filter for the set Switch and sets it to SwitchingWPE131 (S5: Setting the WPE filter).

その後、SwitchingWPE１３１は、Switch設定部１３４により設定されたSwitchおよびフィルタ設定部１３５により設定されたWPEフィルタを用いて、入力された観測信号の残響成分を除去する（Ｓ６）。 Then, SwitchingWPE131 removes the reverberation components of the input observation signal using the Switch set by the Switch setting unit 134 and the WPE filter set by the filter setting unit 135 (S6).

［評価結果］
次に、図６を用いて、信号処理装置１０による残響成分の除去性能の評価結果を説明する。ここでは、信号処理装置１０が、シミュレーションにより作成された残響成分を含む音声データに対する残響成分の除去の評価を行った。 [Evaluation results]
Next, the evaluation results of the reverberation component removal performance of the signal processing device 10 will be described with reference to Fig. 6. Here, the signal processing device 10 was evaluated for its reverberation component removal performance on speech data containing reverberation components that was created by simulation.

なお、信号処理装置１０は、残響除去の学習用データセットを用い、SwitchingWPE１３１により残響成分が除去された信号が、SDRを評価基準とし最適化されるSwitchを推定するDNNの学習を行った。評価対象の音声データは、マイク数＝１で収録された音声データである。また、SwitchingWPE１３１が用いるSwitchの数＝３とした。比較対象は、観測信号（処理なし）、WPE、SwitchingWPEである。 The signal processing device 10 used a dereverberation training dataset to train a DNN that estimates the optimized Switch for a signal from which reverberation components have been removed by SwitchingWPE131, using SDR as the evaluation criterion. The audio data to be evaluated was audio data recorded with one microphone. The number of Switches used by SwitchingWPE131 was set to three. The comparison targets were the observed signal (unprocessed), WPE, and SwitchingWPE.

図６に示すように、信号処理装置１０が、上記のDNNにより推定されたSwitchを用いたSwitchingWPEにより残響除去を行った音声データは、WPE、SwitchingWPEに比べてSDRが高いことが確認できた。また、信号処理装置１０により残響除去を行った音声データは、WPE、SwitchingWPEに比べて単語認識誤り率が低いことも確認できた。 As shown in Figure 6, it was confirmed that the speech data dereverberated by the signal processing device 10 using SwitchingWPE, which uses the Switch estimated by the above-mentioned DNN, had a higher SDR than WPE and SwitchingWPE. It was also confirmed that the speech data dereverberated by the signal processing device 10 had a lower word recognition error rate than WPE and SwitchingWPE.

このことから信号処理装置１０は、SwitchingWPE１３１に対し、目的に応じた残響成分の除去の性能向上を実現できることが確認できた。 This confirms that the signal processing device 10 can achieve improved performance in removing reverberation components according to the purpose compared to SwitchingWPE131.

［システム構成等］
また、図示した各部の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、ＣＰＵ及び当該ＣＰＵにて実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Furthermore, the components of each unit shown in the figure are conceptual functional units and do not necessarily have to be physically configured as shown. In other words, the specific form of distribution and integration of each device is not limited to that shown in the figure, and all or part of them can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc. Furthermore, all or any part of the processing functions performed by each device can be realized by a CPU and a program executed by the CPU, or can be realized as hardware using wired logic.

また、前記した実施形態において説明した処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Furthermore, among the processes described in the above embodiments, all or part of the processes described as being performed automatically can be performed manually, or all or part of the processes described as being performed manually can be performed automatically using known methods. In addition, the information, including the processing procedures, control procedures, specific names, various data, and parameters shown in the above documents and drawings, can be changed as desired unless otherwise specified.

［プログラム］
前記した信号処理装置１０は、パッケージソフトウェアやオンラインソフトウェアとしてプログラム（信号処理プログラム）を所望のコンピュータにインストールさせることによって実装できる。例えば、上記のプログラムを情報処理装置に実行させることにより、情報処理装置を信号処理装置１０として機能させることができる。ここで言う情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）等の移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistant）等の端末等がその範疇に含まれる。 [program]
The signal processing device 10 can be implemented by installing a program (signal processing program) as package software or online software on a desired computer. For example, by executing the program on an information processing device, the information processing device can function as the signal processing device 10. The information processing device referred to here includes mobile communication terminals such as smartphones, mobile phones, and PHS (Personal Handyphone Systems), as well as terminals such as PDAs (Personal Digital Assistants).

図７は、信号処理プログラムを実行するコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 Figure 7 shows an example of a computer that executes a signal processing program. The computer 1000 has, for example, memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these components is connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１及びＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to a hard disk drive 1090. The disk drive interface 1040 is connected to a disk drive 1100. A removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example. The video adapter 1060 is connected to a display 1130, for example.

ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、上記の信号処理装置１０が実行する各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、信号処理装置１０における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、ＳＳＤ（Solid State Drive）により代替されてもよい。 The hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the programs that define the processes executed by the signal processing device 10 described above are implemented as program modules 1093 in which computer-executable code is written. The program modules 1093 are stored, for example, on the hard disk drive 1090. For example, a program module 1093 for executing processes similar to the functional configuration of the signal processing device 10 is stored on the hard disk drive 1090. Note that the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

また、上述した実施形態の処理で用いられるデータは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 In addition, data used in the processing of the above-described embodiment is stored as program data 1094, for example, in memory 1010 or hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 or program data 1094 stored in memory 1010 or hard disk drive 1090 into RAM 1012 as needed and executes it.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３及びプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続される他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３及びプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and program data 1094 may not necessarily be stored on the hard disk drive 1090, but may also be stored on a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100, etc. Alternatively, the program module 1093 and program data 1094 may be stored on another computer connected via a network (such as a LAN (Local Area Network), WAN (Wide Area Network)). The program module 1093 and program data 1094 may then be read by the CPU 1020 from the other computer via the network interface 1070.

１０信号処理装置
１１入出力部
１２記憶部
１３制御部
１３１ SwitchingWPE
１３２受付部
１３３学習部
１３４ Switch設定部
１３５フィルタ設定部 10 Signal processing device 11 Input/output unit 12 Storage unit 13 Control unit 131 SwitchingWPE
132 Reception unit 133 Learning unit 134 Switch setting unit 135 Filter setting unit

Claims

a SwitchingWPE having a plurality of WPE filters that remove reverberation components of the observed signal and a Switch for switching the plurality of WPE filters for each time frequency of the observed signal;
a receiving unit that receives an input of an evaluation criterion for the signal after the reverberation component has been removed by the SwitchingWPE;
a learning unit that uses a dereverberation learning dataset to learn a model that estimates the Switch such that a signal from which reverberation components have been removed by the SwitchingWPE is optimized using the evaluation criterion; and
A Switch setting unit that sets a Switch estimated by the model after learning to the SwitchingWPE for an observed signal;
A filter setting unit calculates an optimal WPE filter for the set Switch and sets it to the SwitchingWPE,
The SwitchingWPE is
a signal processing device that removes a reverberation component of the observed signal using the set Switch and the set WPE filter.

The evaluation criteria are:
2. The signal processing device according to claim 1, wherein the signal processing parameter is at least one of a signal-to-distortion ratio, a scale-invariant signal-to-distortion ratio, intelligibility, a cepstral distance, and a word error rate in automatic speech recognition.

the dereverberation training data set includes an input signal and a signal obtained by removing reverberation components from the input signal, which is correct data of the input signal;
The learning unit
2. The signal processing device according to claim 1, wherein the model is trained so that an evaluation result based on the evaluation criterion for the signal after the reverberation components have been removed, using the signal after the reverberation components have been removed output by the SwitchingWPE and the ground truth data, for the input signal, is optimized.

The model is
The signal processing device according to claim 1 , wherein the model estimates the switch using a deep neural network (DNN).

A signal processing method performed by a signal processing device, comprising:
receiving an input of an evaluation criterion for the signal after the reverberation components have been removed by a SwitchingWPE having a plurality of WPE filters that remove the reverberation components of the observed signal and a Switch for switching the plurality of WPE filters for each time frequency of the observed signal;
a step of training a model that outputs Switch estimation results such that a signal from which reverberation components have been removed by the SwitchingWPE is optimized using the evaluation criterion, using a training dataset for dereverberation;
A step of setting the Switch estimated by the model after learning to the SwitchingWPE for an observed signal;
A process of calculating an optimal WPE filter for the set Switch and setting it as the SwitchingWPE;
and removing the reverberation component of the observation signal using the SwitchingWPE to which the Switch and the WPE filter are set.

receiving an input of an evaluation criterion for the signal after the reverberation components have been removed by a SwitchingWPE having a plurality of WPE filters that remove the reverberation components of the observed signal and a Switch for switching the plurality of WPE filters for each time frequency of the observed signal;
a step of training a model that outputs Switch estimation results such that a signal from which reverberation components have been removed by the SwitchingWPE is optimized using the evaluation criterion, using a training dataset for dereverberation;
A step of setting the Switch estimated by the model after learning to the SwitchingWPE for an observed signal;
A process of calculating an optimal WPE filter for the set Switch and setting it as the SwitchingWPE;
and removing the reverberation component of the observed signal using the Switch and the SwitchingWPE in which the WPE filter is set.