JP7775899B2

JP7775899B2 - Time window generation device, method and program

Info

Publication number: JP7775899B2
Application number: JP2023578328A
Authority: JP
Inventors: 伸村田; 洋平脇阪; 記良鎌土; 翔一郎齊藤
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2022-02-07
Filing date: 2022-02-07
Publication date: 2025-11-26
Anticipated expiration: 2042-02-07
Also published as: WO2023148955A1; JPWO2023148955A1

Description

本発明は、音声等の音信号を処理する技術に関する。 The present invention relates to technology for processing audio signals such as voice.

リアルタイムに音響信号を時間周波数で分析するために、短時間フーリエ変換を用いた手法が用いられる。その手法の中では、一定の長さの信号を切り出し、周期的信号として扱うために時間窓が用いられる。時間窓は、その形状に応じて周波数分解能とダイナミックレンジが定まる。ここで、周波数分解能及びダイナミックレンジは、トレードオフの関係にある。したがって、隣接する周波数成分での分離性能を向上させると、微小成分を見落とすことが発生する。 A method using short-time Fourier transform is used to analyze acoustic signals in real time in terms of time frequency. In this method, a time window is used to extract a signal of a certain length and treat it as a periodic signal. The frequency resolution and dynamic range of the time window are determined according to its shape. Here, there is a trade-off between frequency resolution and dynamic range. Therefore, improving the separation performance of adjacent frequency components can result in the overlooking of small components.

全ての信号に対して有効な時間窓は存在しないため、状況に合わせた時間窓を適宜用いる必要がある。 Since there is no time window that is effective for all signals, it is necessary to use a time window that suits the situation appropriately.

従来の音響信号処理システムにおいては、特定の窓関数が固定して使用されていたり、事前に用意された複数個の窓関数が切り替えて使用されていたりしていた（例えば、非特許文献１参照。）。 In conventional acoustic signal processing systems, a specific window function was used in a fixed manner, or multiple pre-prepared window functions were used by switching between them (see, for example, non-patent document 1).

小泉悠馬、「深層学習に基づく音源強調と位相制御」、音響学会誌75巻3号、pp. 156-163、2019Yuma Koizumi, "Sound source enhancement and phase control based on deep learning," Journal of the Acoustical Society of Japan, Vol. 75, No. 3, pp. 156-163, 2019

しかし、従来は、適切な時間窓を生成する技術はなかった。 However, until now, there has been no technology to generate appropriate time windows.

本発明は、適切な時間窓を生成する時間窓生成装置、方法及びプログラムを提供することを目的とする。 The present invention aims to provide a time window generation device, method, and program for generating an appropriate time window.

この発明の一態様による時間窓生成装置は、音信号を所定の長さに切り出すことで、切り出し信号を生成する信号切り出し部と、切り出し信号と分析窓モデルパラメータにより定まる分析窓モデルとを用いて分析窓を生成する分析窓生成部と、切り出し信号を少なくとも用いて、又は、分析窓を用いて、合成窓を生成する合成窓生成部と、切り出し信号を分析窓を用いて周波数領域に変換することで周波数領域信号を生成する周波数領域変換部と、周波数領域信号に対して所定の処理を行い、処理後周波数領域信号を生成する信号処理部と、処理後周波数領域信号を合成窓を用いて時間領域に変換することで時間領域信号を生成する時間領域変換部と、時間領域信号と時間領域信号に対応する正解データとを用いて、分析窓モデルパラメータを少なくとも学習する学習部と、を備えている。所定の処理は、音声信号強調処理である。正解データは、雑音のない原音信号である。 A time window generation device according to one aspect of the present invention includes a signal clipping unit that generates a clipped signal by clipping a sound signal to a predetermined length, an analysis window generation unit that generates an analysis window using the clipped signal and an analysis window model parameter, a synthesis window generation unit that generates a synthesis window using at least the clipped signal or the analysis window, a frequency domain conversion unit that converts the clipped signal into a frequency domain using the analysis window to generate a frequency domain signal, a signal processing unit that performs predetermined processing on the frequency domain signal to generate a processed frequency domain signal, a time domain conversion unit that converts the processed frequency domain signal into a time domain using the synthesis window to generate a time domain signal, and a learning unit that at least learns the analysis window model parameter using the time domain signal and ground truth data corresponding to the time domain signal. The predetermined processing is a speech signal enhancement process. The ground truth data is a noise-free original sound signal.

適切な時間窓を生成することができる。 Appropriate time windows can be generated.

図１は、時間窓生成装置の機能構成の例を示す図である。FIG. 1 is a diagram illustrating an example of the functional configuration of a time window generating device. 図２は、時間窓生成方法の処理手続きの例を示す図である。FIG. 2 is a diagram showing an example of a processing procedure of the time window generation method. 図３は、コンピュータの機能構成例を示す図である。FIG. 3 is a diagram illustrating an example of the functional configuration of a computer.

以下、本発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 The following describes in detail an embodiment of the present invention. Note that components with the same functions in the drawings are given the same numbers, and duplicate explanations will be omitted.

[時間窓生成装置及び方法]
時間窓生成装置は、図１に示すように、信号切り出し部１、分析窓生成部２、合成窓生成部３、周波数領域変換部４、信号処理部５、時間領域変換部６及び学習部７を例えば備えている。 [Time window generation device and method]
As shown in FIG. 1, the time window generating device includes, for example, a signal extracting unit 1, an analysis window generating unit 2, a synthesis window generating unit 3, a frequency domain transforming unit 4, a signal processing unit 5, a time domain transforming unit 6, and a learning unit 7.

時間窓生成方法は、時間窓生成装置の各構成部が、以下に説明し及び図２に示すステップＳ１からステップＳ７の処理を行うことにより例えば実現される。 The time window generation method is realized, for example, by each component of the time window generation device performing the processing of steps S1 to S7 described below and shown in Figure 2.

ここで、時間窓とは、分析窓、合成窓の少なくとも一方のことを意味する。 Here, a time window refers to at least one of an analysis window and a synthesis window.

以下、時間窓生成装置の各構成部について説明する。 The following describes each component of the time window generation device.

<信号切り出し部１>
信号切り出し部１には、音信号が入力される。 <Signal extraction section 1>
A sound signal is input to the signal extractor 1 .

信号切り出し部１は、音信号を所定の長さに切り出すことで、切り出し信号を生成する（ステップＳ１）。 The signal extraction unit 1 generates an extracted signal by extracting the sound signal to a predetermined length (step S1).

生成された切り出し信号は、分析窓生成部２及び合成窓生成部３に出力される。 The generated cut-out signal is output to the analysis window generation unit 2 and the synthesis window generation unit 3.

<分析窓生成部２>
分析窓生成部２には、切り出し信号が入力される。また、分析窓生成部２には、後述する学習部７で生成された分析窓モデルパラメータが入力される。 <Analysis window generation unit 2>
The extracted signal is input to the analysis window generator 2. The analysis window generator 2 also receives analysis window model parameters generated by the learning unit 7, which will be described later.

分析窓生成部２は、切り出し信号と分析窓モデルパラメータにより定まる分析窓モデルとを用いて分析窓を生成する（ステップＳ２）。生成された分析窓は、周波数領域変換部４に出力される。The analysis window generation unit 2 generates an analysis window using the extracted signal and an analysis window model determined by the analysis window model parameters (step S2). The generated analysis window is output to the frequency domain transformation unit 4.

分析窓モデルパラメータは、例えば後述する学習部７で生成された分析窓モデルパラメータである。分析窓生成部２の第１回目の処理であり、学習部７で生成された分析窓モデルパラメータが存在しない場合には、分析窓生成部２は、予め定められた分析窓モデルパラメータを用いる。この場合、分析窓生成部２は、予め定められた所定の分析窓を生成して出力してもよい。 The analysis window model parameters are, for example, analysis window model parameters generated by the learning unit 7, which will be described later. If this is the first processing by the analysis window generation unit 2 and there are no analysis window model parameters generated by the learning unit 7, the analysis window generation unit 2 uses predetermined analysis window model parameters. In this case, the analysis window generation unit 2 may generate and output a predetermined analysis window.

<合成窓生成部３>
合成窓生成部３には、切り出し信号が入力される。また、合成窓生成部３には、後述する学習部７で生成された合成窓モデルパラメータが入力される。 <Synthesis window generation unit 3>
The extracted signal is input to the synthesis window generation unit 3. The synthesis window generation unit 3 also receives synthesis window model parameters generated by the learning unit 7, which will be described later.

合成窓生成部３は、切り出し信号と合成窓モデルパラメータにより定まる合成窓モデルとを用いて合成窓を生成する（ステップＳ３）。生成された合成窓は、時間領域変換部６に出力される。The synthesis window generation unit 3 generates a synthesis window using the extracted signal and a synthesis window model determined by the synthesis window model parameters (step S3). The generated synthesis window is output to the time domain transformation unit 6.

合成窓モデルパラメータは、例えば後述する学習部７で生成された合成窓モデルパラメータである。合成窓生成部３の第１回目の処理であり、学習部７で生成された合成窓モデルパラメータが存在しない場合には、合成窓生成部３は、予め定められた合成窓モデルパラメータを用いる。この場合、合成窓生成部３は、予め定められた所定の合成窓を生成して出力してもよい。 The synthesis window model parameters are, for example, synthesis window model parameters generated by the learning unit 7 described below. If this is the first processing by the synthesis window generation unit 3 and there are no synthesis window model parameters generated by the learning unit 7, the synthesis window generation unit 3 uses predetermined synthesis window model parameters. In this case, the synthesis window generation unit 3 may generate and output a predetermined synthesis window.

<周波数領域変換部４>
周波数領域変換部４には、切り出し信号及び分析窓が入力される。 <Frequency domain transform unit 4>
The frequency domain transform unit 4 receives the extracted signal and the analysis window.

周波数領域変換部４は、切り出し信号を分析窓を用いて周波数領域に変換することで周波数領域信号を生成する（ステップＳ４）。生成された周波数領域信号は、信号処理部５に出力される。The frequency domain transform unit 4 generates a frequency domain signal by transforming the extracted signal into the frequency domain using an analysis window (step S4). The generated frequency domain signal is output to the signal processing unit 5.

周波数領域変換部４は、短時間フーリエ変換等の手法により周波数領域への変換を行う。 The frequency domain transformation unit 4 performs transformation to the frequency domain using techniques such as short-time Fourier transform.

<信号処理部５>
信号処理部５には、周波数領域信号が入力される。 <Signal processing unit 5>
The signal processing unit 5 receives a frequency domain signal.

信号処理部５は、周波数領域信号に対して所定の処理を行い、処理後周波数領域信号を生成する（ステップＳ５）。生成された処理後周波数領域信号は、時間領域変換部６に出力される。The signal processing unit 5 performs predetermined processing on the frequency domain signal to generate a processed frequency domain signal (step S5). The generated processed frequency domain signal is output to the time domain transform unit 6.

所定の処理の例は、声等の所定の信号を強調する処理（言い換えれば、音声信号強調処理、雑音を抑制する処理）、雑音等を分類する分類処理の少なくとも１つである。 Examples of specified processing are at least one of processing to emphasize specified signals such as voice (in other words, processing to emphasize voice signals, processing to suppress noise), and classification processing to classify noise, etc.

所定の処理が音声信号強調処理である場合には、信号処理部５は、例えば周波数領域信号を用いて、音声強調フィルタを推定する。そして、信号処理部５は、推定された音声強調フィルタと周波数領域信号とを乗じることで、音声が強調された周波数領域信号を生成する。この周波数領域信号が、処理後周波数領域信号の一例である。 When the specified processing is speech signal enhancement processing, the signal processing unit 5 estimates a speech enhancement filter, for example, using the frequency domain signal. Then, the signal processing unit 5 multiplies the estimated speech enhancement filter by the frequency domain signal to generate a frequency domain signal in which the speech is enhanced. This frequency domain signal is an example of a processed frequency domain signal.

なお、所定の処理に、雑音の分類処理が含まれていてもよい。この場合、信号処理部５は、周波数領域信号を用いて、周波数領域信号に含まれる雑音を推定する。この推定結果である雑音ラベルは、学習部７に出力される。 The predetermined processing may include noise classification processing. In this case, the signal processing unit 5 uses the frequency domain signal to estimate the noise contained in the frequency domain signal. The noise label resulting from this estimation is output to the learning unit 7.

<時間領域変換部６>
時間領域変換部６には、処理後周波数領域信号及び合成窓が入力される。 <Time domain transform unit 6>
The time domain transform unit 6 receives the processed frequency domain signal and the synthesis window.

時間領域変換部６は、処理後周波数領域信号を合成窓を用いて時間領域に変換することで時間領域信号を生成する（ステップＳ６）。生成された時間領域信号は、学習部７に出力される。生成された時間領域信号は、信号処理部５による所定の処理の結果として、時間窓生成装置から出力されてもよい。 The time domain transform unit 6 generates a time domain signal by transforming the processed frequency domain signal into the time domain using a synthesis window (step S6). The generated time domain signal is output to the learning unit 7. The generated time domain signal may be output from the time window generation device as a result of predetermined processing by the signal processing unit 5.

周波数領域変換部４は、逆短時間フーリエ変換等の手法により時間領域への変換を行う。 The frequency domain transformation unit 4 performs transformation to the time domain using techniques such as inverse short-time Fourier transform.

<学習部７>
学習部７には、時間領域信号が入力される。また、学習部７には、時間領域信号に対応する正解データが入力される。 <Study Section 7>
The time domain signal is input to the learning unit 7. The learning unit 7 also receives correct data corresponding to the time domain signal.

学習部７は、時間領域信号と時間領域信号に対応する正解データとを用いて、分析窓パラメータ及び合成窓パラメータを学習する（ステップＳ７）。 The learning unit 7 learns the analysis window parameters and synthesis window parameters using the time domain signal and the correct data corresponding to the time domain signal (step S7).

分析窓パラメータ及び合成窓パラメータの学習は、例えば勾配降下法等により行われる。 The analysis window parameters and synthesis window parameters are learned using, for example, gradient descent.

信号処理部５における所定の処理が音声信号強調処理である場合には、雑音のない原音信号が正解データとなる。この場合、分析窓パラメータ及び合成窓パラメータの学習には、時間領域信号と原音信号との差を二乗して平均した平均二乗和誤差が小さくなるように、勾配法などが用いられる。 When the specified processing in the signal processing unit 5 is speech signal enhancement processing, the noise-free original sound signal becomes the correct data. In this case, a gradient method or the like is used to learn the analysis window parameters and synthesis window parameters so that the mean squared error, calculated by squaring the difference between the time-domain signal and the original sound signal and averaging it, becomes small.

信号処理部５における所定の処理が雑音の分類処理を含む場合には、真の雑音ラベルが正解データとして学習部７に入力される。また、信号処理部５における所定の処理が雑音の分類処理を含む場合には、信号処理部５における所定の処理で推定された雑音ラベルが更に学習部７に入力される。この場合、学習部７は、時間領域信号と時間領域信号に対応する正解データとに加えて、真の雑音ラベル及び推定された雑音ラベルを更に用いて、分析窓パラメータ及び合成窓パラメータを学習してもよい。 If the predetermined processing in the signal processing unit 5 includes noise classification processing, the true noise label is input to the learning unit 7 as correct data. Also, if the predetermined processing in the signal processing unit 5 includes noise classification processing, the noise label estimated by the predetermined processing in the signal processing unit 5 is further input to the learning unit 7. In this case, the learning unit 7 may learn the analysis window parameters and synthesis window parameters using the true noise label and the estimated noise label in addition to the time-domain signal and the correct data corresponding to the time-domain signal.

これまで説明したステップＳ１からステップＳ７の処理は、適宜繰り返し行われてもよい。 The processing from step S1 to step S7 described above may be repeated as appropriate.

学習部７は、信号処理部５で所定の処理が行われ、時間領域に変換された信号である時間領域信号に基づいて、分析窓パラメータ及び合成窓パラメータを学習している。このため、分析窓パラメータ及び合成窓パラメータは、分析窓生成部２及び合成窓生成部３の後段の処理である信号処理部５の所定の処理を考慮して学習されていると言える。このように、後段の処理を考慮して分析窓パラメータ及び合成窓パラメータを学習することで、従来よりも適切に時間窓を生成することができる。 The learning unit 7 learns the analysis window parameters and synthesis window parameters based on the time domain signal, which is a signal that has undergone predetermined processing in the signal processing unit 5 and been converted to the time domain. Therefore, it can be said that the analysis window parameters and synthesis window parameters are learned taking into account the predetermined processing in the signal processing unit 5, which is the processing subsequent to the analysis window generation unit 2 and the synthesis window generation unit 3. In this way, by learning the analysis window parameters and synthesis window parameters taking into account the processing subsequent to the analysis window parameters, it is possible to generate time windows more appropriately than before.

[変形例]
以上、本発明の実施の形態について説明したが、具体的な構成は、これらの実施の形態に限られるものではなく、本発明の趣旨を逸脱しない範囲で適宜設計の変更等があっても、本発明に含まれることはいうまでもない。 [Variations]
The above describes the embodiments of the present invention, but the specific configuration is not limited to these embodiments, and it goes without saying that even if design changes are made as appropriate within the scope of the present invention, they are still included in the present invention.

実施の形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The various processes described in the embodiments may not only be performed chronologically in the order described, but may also be performed in parallel or individually depending on the processing capacity of the device performing the processes or as needed.

例えば、時間窓生成装置の構成部間のデータのやり取りは直接行われてもよいし、図示していない記憶部を介して行われてもよい。 For example, data exchange between components of the time window generation device may be performed directly or via a memory unit not shown.

なお、合成窓は、分析窓から生成することができる。このため、合成窓生成部３は、分析窓生成部２が生成した分析窓を用いて、合成窓を生成してもよい。すなわち、合成窓生成部３は、切り出し信号を少なくとも用いて、又は、分析窓を用いて、合成窓を生成すればよい。 The synthesis window can be generated from the analysis window. Therefore, the synthesis window generation unit 3 may generate the synthesis window using the analysis window generated by the analysis window generation unit 2. In other words, the synthesis window generation unit 3 may generate the synthesis window using at least the extracted signal or the analysis window.

この場合、学習部７は、合成窓モデルパラメータを学習しなくてもよい。すなわち、学習部７は、時間領域信号と時間領域信号に対応する正解データとを用いて、分析窓モデルパラメータを少なくとも学習すればよい。In this case, the learning unit 7 does not need to learn the synthesis window model parameters. In other words, the learning unit 7 only needs to learn the analysis window model parameters using the time-domain signal and the ground truth data corresponding to the time-domain signal.

分析窓生成部２及び合成窓生成部３における処理だけではなく、信号処理部５における所定の処理が深層学習で実装されていてもよい。 Not only the processing in the analysis window generation unit 2 and the synthesis window generation unit 3, but also certain processing in the signal processing unit 5 may be implemented using deep learning.

すなわち、信号処理部５における所定の処理は、周波数領域信号とモデルパラメータにより定まるモデルとを用いて処理後周波数領域信号を生成する処理であってもよい。この場合、学習部７は、時間領域信号と時間領域信号に対応する正解データとを少なくとも用いて、モデルパラメータを更に学習してもよい。 That is, the predetermined processing in the signal processing unit 5 may be processing that generates a processed frequency domain signal using a model determined by the frequency domain signal and the model parameters. In this case, the learning unit 7 may further learn the model parameters using at least the time domain signal and the correct answer data corresponding to the time domain signal.

また、信号処理部５における所定の処理は、周波数領域信号と雑音推定モデルパラメータにより定まるモデルとを用いて雑音ラベルを推定する処理を含んでいてもよい。この場合、学習部７は、時間領域信号と時間領域信号に対応する正解データとに加えて、学習部７に入力された真の雑音ラベル及び信号処理部５で推定された雑音ラベルを更に用いて、雑音推定モデルパラメータを更に学習してもよい。 The predetermined processing in the signal processing unit 5 may also include processing for estimating noise labels using a model determined by the frequency domain signal and the noise estimation model parameters. In this case, the learning unit 7 may further learn the noise estimation model parameters using the true noise labels input to the learning unit 7 and the noise labels estimated by the signal processing unit 5, in addition to the time domain signal and the correct answer data corresponding to the time domain signal.

[プログラム、記録媒体]
上述した各装置の各部の処理をコンピュータにより実現してもよく、この場合は各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムを図３に示すコンピュータ１０００の記憶部１０２０に読み込ませ、演算処理部１０１０、入力部１０３０、出力部１０４０、表示部１０６０などに動作させることにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Programs, recording media]
The processing of each unit of each of the above-mentioned devices may be realized by a computer, in which case the processing content of the functions that each device should have is described by a program. Then, by loading this program into storage unit 1020 of computer 1000 shown in Figure 3 and running it on arithmetic processing unit 1010, input unit 1030, output unit 1040, display unit 1060, etc., various processing functions of each of the above-mentioned devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体は、例えば、非一時的な記録媒体であり、具体的には、磁気記録装置、光ディスク、等である。 The program describing this processing content can be recorded on a computer-readable recording medium. A computer-readable recording medium is, for example, a non-transitory recording medium, specifically a magnetic recording device, optical disk, etc.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program may be distributed, for example, by selling, transferring, or lending portable recording media such as DVDs or CD-ROMs on which the program is recorded. Furthermore, this program may be distributed by storing it in a storage device of a server computer and transferring it from the server computer to other computers via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の非一時的な記憶装置である補助記録部１０５０に格納する。そして、処理の実行時、このコンピュータは、自己の非一時的な記憶装置である補助記録部１０５０に格納されたプログラムを記憶部１０２０に読み込み、読み込んだプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを記憶部１０２０に読み込み、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。A computer executing such a program, for example, first stores the program recorded on a portable recording medium or transferred from a server computer in its own non-transitory storage device, auxiliary storage unit 1050. Then, when executing a process, the computer loads the program stored in its own non-transitory storage device, auxiliary storage unit 1050, into storage unit 1020 and executes processing in accordance with the loaded program. Alternatively, as an alternative execution form of this program, the computer may load the program directly from a portable recording medium into storage unit 1020 and execute processing in accordance with the program. Furthermore, each time a program is transferred from a server computer to this computer, the computer may execute processing in accordance with the received program. Alternatively, the above-described processing may be executed using a so-called ASP (Application Service Provider) type service, which realizes processing functions simply by issuing execution instructions and obtaining results, without transferring the program from the server computer to this computer. In this embodiment, the program includes information used for processing by an electronic computer that is equivalent to a program (such as data that is not a direct instruction to a computer but has properties that dictate computer processing).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。例えば、信号切り出し部１、分析窓生成部２、合成窓生成部３、周波数領域変換部４、信号処理部５、時間領域変換部６及び学習部７は、処理回路により構成されてもよい。 In this embodiment, the device is configured by executing a predetermined program on a computer, but at least some of the processing may be implemented in hardware. For example, the signal extraction unit 1, analysis window generation unit 2, synthesis window generation unit 3, frequency domain transformation unit 4, signal processing unit 5, time domain transformation unit 6, and learning unit 7 may be configured as processing circuits.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 It goes without saying that other modifications are possible as long as they do not deviate from the spirit of this invention.

Claims

a signal cutout unit that cuts out a sound signal to a predetermined length to generate a cutout signal;
an analysis window generation unit that generates an analysis window using the extracted signal and an analysis window model determined by analysis window model parameters;
a synthesis window generator that generates a synthesis window using at least the extracted signal or the analysis window;
a frequency domain transform unit that generates a frequency domain signal by transforming the extracted signal into a frequency domain using the analysis window;
a signal processing unit that performs predetermined processing on the frequency domain signal to generate a processed frequency domain signal;
a time domain transform unit that generates a time domain signal by transforming the processed frequency domain signal into a time domain signal using the synthesis window;
a learning unit that learns at least the analysis window model parameters using the time domain signal and ground truth data corresponding to the time domain signal;
Including,
the predetermined processing is a speech signal enhancement processing,
The correct answer data is a noise-free original sound signal,
the synthesis window generation unit generates a synthesis window using the extracted signal and a synthesis window model determined by a synthesis window model parameter;
the learning unit further learns the synthesis window model parameters using the time-domain signal and ground truth data corresponding to the time-domain signal.
Time window generator.

2. The time window generating device of claim 1 ,
the predetermined processing is processing for generating a processed frequency domain signal using the frequency domain signal and a model determined by model parameters;
the learning unit further learns the model parameters by using at least the time-domain signal and ground truth data corresponding to the time-domain signal.
Time window generator.

a signal cutting step in which a signal cutting unit cuts out a sound signal to a predetermined length to generate a cut-out signal;
an analysis window generation step in which an analysis window generation unit generates an analysis window using the extracted signal and an analysis window model determined by analysis window model parameters;
a synthesis window generation step in which a synthesis window generation unit generates a synthesis window using at least the extracted signal or the analysis window;
a frequency domain transform step in which a frequency domain transform unit transforms the extracted signal into a frequency domain using the analysis window to generate a frequency domain signal;
a signal processing step in which a signal processing unit performs predetermined processing on the frequency domain signal to generate a processed frequency domain signal;
a time domain transform step in which a time domain transform unit transforms the processed frequency domain signal into a time domain signal using the synthesis window;
a learning step in which a learning unit learns at least the analysis window model parameters using the time domain signal and ground truth data corresponding to the time domain signal;
Including,
the predetermined processing is a speech signal enhancement processing,
The correct answer data is a noise-free original sound signal,
In the synthesis window generating step, a synthesis window is generated using the extracted signal and a synthesis window model determined by a synthesis window model parameter;
In the learning step, the synthesis window model parameters are further learned using the time-domain signal and ground truth data corresponding to the time-domain signal.
Time window generation method.

A program for causing a computer to function as each unit of the time window generating device according to claim 1 or 2 .