JP7658212B2

JP7658212B2 - Signal analysis device, signal analysis method, and signal analysis program

Info

Publication number: JP7658212B2
Application number: JP2021130718A
Authority: JP
Inventors: 弘和亀岡; 莉李
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2025-04-08
Anticipated expiration: 2041-08-10
Also published as: JP2023025457A

Description

開示の技術は、信号解析装置、信号解析方法、及び信号解析プログラムに関する。 The disclosed technology relates to a signal analysis device, a signal analysis method, and a signal analysis program.

ブラインド音源分離（ＢｌｉｎｄＳｏｕｒｃｅＳｅｐａｒａｔｉｏｎ：ＢＳＳ）は、音源に関する情報や音源とマイクとの間の伝達関数等の事前情報を用いずに観測された混合信号のみから個々の音源信号を分離抽出する技術である。マイクロホンの数が音源数以上の優決定条件下においては、音源信号間の独立性を最大化するように分離フィルタを推定することを目的とする独立成分分析（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ：ＩＣＡ）が有効であることが知られており、その原理を拡張した手法が数多く提案されている。中でも時間周波数領域で定式化される手法は、音源に関する時間周波数領域で成り立つ様々な仮定やマイクロホンアレーの周波数応答に関する仮定を有効に活用できるという利点がある。例えば、非特許文献１に記載の独立低ランク行列分析（ＩｎｄｅｐｅｎｄｅｎｔＬｏｗ－ＲａｎｋＭａｔｒｉｘＡｎａｌｙｓｉｓ：ＩＬＲＭＡ）は、各音源信号のパワースペクトログラムを二つの非負値行列の積（低ランク非負値行列）でモデル化できるという仮定を基礎としている。しかし、この仮定に従わない音源に対しては本手法の分離性能は必然的に限定的となる。 Blind Source Separation (BSS) is a technology that separates and extracts individual source signals from only the observed mixed signals without using prior information such as information about the sound sources or transfer functions between the sound sources and microphones. Under overdetermined conditions where the number of microphones is greater than or equal to the number of sound sources, Independent Component Analysis (ICA), which aims to estimate a separation filter to maximize the independence between source signals, is known to be effective, and many methods that extend the principles of ICA have been proposed. Among these, methods formulated in the time-frequency domain have the advantage of being able to effectively utilize various assumptions that hold in the time-frequency domain regarding sound sources and assumptions regarding the frequency response of microphone arrays. For example, the Independent Low-Rank Matrix Analysis (ILRMA) described in Non-Patent Document 1 is based on the assumption that the power spectrogram of each sound source signal can be modeled as the product of two non-negative matrices (low-rank non-negative matrix). However, the separation performance of this method is inevitably limited for sound sources that do not follow this assumption.

近年、ＩＣＡをはじめとした信号処理に基づく手法に深層学習（ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ：ＤＮＮ）を導入することで、分離精度を改善する試みがなされている。非特許文献２に記載の多チャンネル変分自己符号化器法（ＭｕｌｔｉｃｈａｎｎｅｌＶａｒｉａｔｉｏｎａｌＡｕｔｏｅｎｃｏｄｅｒ：ＭＶＡＥ）法は、条件付きＶＡＥ（ＣｏｎｄｉｔｉｏｎａｌＶＡＥ：ＣＶＡＥ）により表現される音源スペクトログラムの生成モデルを事前学習し、分離時においてＣＶＡＥのデコーダ入力を分離行列と共に推定する手法で、ＤＮＮを用いた手法の中でも特に高い分離精度を達成している。この手法では、各反復計算で尤度関数が上昇するようにパラメータが更新されるため、尤度関数の停留点への収束が保証される一方で、デコーダ入力値の更新に誤差逆伝播法（Ｂａｃｋｐｒｏｐａｇａｔｉｏｎ）が用いられるため、高い計算コストを要する点に課題があった。 In recent years, attempts have been made to improve separation accuracy by introducing deep neural network (DNN) into signal processing-based methods such as ICA. The multichannel variational autoencoder (MVAE) method described in Non-Patent Document 2 is a method in which a generation model of a sound source spectrogram represented by a conditional VAE (CVAE) is pre-trained, and the decoder input of the CVAE is estimated together with the separation matrix during separation. This method has achieved particularly high separation accuracy among methods using DNN. In this method, parameters are updated so that the likelihood function increases in each iteration, so that the convergence of the likelihood function to a stationary point is guaranteed, but there is a problem in that the error backpropagation method is used to update the decoder input value, which requires high calculation costs.

非特許文献３に記載のＦａｓｔＭＶＡＥ法は前記ＭＶＡＥ法の計算コストの削減を目的として提案された手法で、クラス識別器つきＶＡＥ（ＡｕｘｉｌｉａｒｙＣｌａｓｓｉｆｉｅｒＶＡＥ：ＡＣＶＡＥ）を用いて音源スペクトログラムの生成モデルであるデコーダと共に、音源クラスの分布と潜在変数の事後分布を近似する識別器分布とエンコーダ分布を学習することで、学習で得られた識別器とエンコーダを用いて事後分布が最大となるようなデコーダ入力値を予測する手法である。この手法では、ＭＶＡＥ法に比べて音源分離アルゴリズムを高速化できる一方で、未知話者や長い残響の場合など、テスト時において学習時と条件が一致しない場合に分離性能が低下する傾向があった。 The FastMVAE method described in Non-Patent Document 3 is a method proposed for the purpose of reducing the calculation cost of the MVAE method. It uses a VAE with a classifier (Auxiliary Classifier VAE: ACVAE) to learn a classifier distribution and an encoder distribution that approximate the distribution of source classes and the posterior distribution of latent variables together with a decoder, which is a generation model of the sound source spectrogram, and predicts the decoder input value that maximizes the posterior distribution using the classifier and encoder obtained by learning. While this method can speed up the sound source separation algorithm compared to the MVAE method, separation performance tends to decrease when the conditions during testing do not match those during learning, such as in the case of unknown speakers or long reverberation.

Daichi Kitamura, Nobutaka Ono, Hiroshi Sawada, Hirokazu Kameoka, and Hiroshi Saruwatari, “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no.9, pp. 1626-1641, Sep. 2016.Daichi Kitamura, Nobutaka Ono, Hiroshi Sawada, Hirokazu Kameoka, and Hiroshi Saruwatari, “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no.9, pp. 1626-1641, Sep. 2016. Hirokazu Kameoka, Li Li, Shota Inoue, and Shoji Makino, “Supervised Determined Source Separation with Multichannel Variational Autoencoder,” Neural Computation, vol. 31, no. 9, pp.1891-1914, Sep. 2019.Hirokazu Kameoka, Li Li, Shota Inoue, and Shoji Makino, “Supervised Determined Source Separation with Multichannel Variational Autoencoder,” Neural Computation, vol. 31, no. 9, pp.1891-1914, Sep. 2019. Li Li, Hirokazu Kameoka, Shota Inoue, and Shoji Makino, “FastMVAE: A Fast Optimization Algorithm for the Multichannel Variational Autoencoder Method,” IEEE Access, vol. 8, pp. 228740-228753, Dec. 2020.Li Li, Hirokazu Kameoka, Shota Inoue, and Shoji Makino, “FastMVAE: A Fast Optimization Algorithm for the Multichannel Variational Autoencoder Method,” IEEE Access, vol. 8, pp. 228740-228753, Dec. 2020.

ＭＶＡＥ法では、各反復計算で対数尤度が上昇するようにパラメータの更新が行われるため、対数尤度の停留点への収束が保証される利点がある一方で、誤差逆伝播法による音声生成モデルのパラメータ更新に多大な計算コストを要する点に課題があった。 In the MVAE method, parameters are updated so that the log-likelihood increases with each iteration, which has the advantage of guaranteeing that the log-likelihood converges to a stationary point. However, there is an issue in that updating the parameters of the speech generation model using the backpropagation method requires a huge computational cost.

これに対し、非特許文献３のＦａｓｔＭＶＡＥ法では、デコーダと共に事前に学習しておいたエンコーダと識別器を用いて当該パラメータの更新値を予測する方法により、音源分離アルゴリズムの大幅な高速化を実現した。しかし、ＦａｓｔＭＶＡＥ法におけるエンコーダと識別器の出力値は当該パラメータに関する対数尤度の最急上昇方向への更新値を近似したものでしかないため、音源分離精度に関してはＦａｓｔＭＶＡＥ法はＭＶＡＥ法に及ばないことが実験的に確認されている。 In contrast, the FastMVAE method in Non-Patent Document 3 achieves a significant increase in the speed of the sound source separation algorithm by predicting the update value of the parameter using an encoder and a classifier that have been trained in advance together with the decoder. However, since the output values of the encoder and classifier in the FastMVAE method are merely approximations of the update value in the direction of the steepest increase in the log-likelihood for the parameter, it has been experimentally confirmed that the FastMVAE method is inferior to the MVAE method in terms of sound source separation accuracy.

開示の技術は、上記の点に鑑みてなされたものであり、計算コストを抑えて、各構成音が混合した混合信号から、各構成音を精度よく分離することができる信号解析装置、方法、及びプログラムを提供することを目的とする。 The disclosed technology has been made in consideration of the above points, and aims to provide a signal analysis device, method, and program that can accurately separate each component sound from a mixed signal in which the component sounds are mixed, while reducing calculation costs.

本開示の第１態様は、信号解析装置であって、各構成音についてのスペクトログラム及び前記構成音の属性を示す属性クラスに基づいて、音のスペクトログラムを入力として潜在ベクトル系列を推定するエンコーダと、前記音のスペクトログラムを入力として前記音の属性を示す属性クラスを識別する識別器と、前記潜在ベクトル系列及び前記属性クラスを入力として前記音のスペクトログラムの分散を生成するデコーダと、を学習する学習部と、各構成音が混合された観測信号を入力として、前記学習されたエンコーダによって前記分離行列により分離された各構成音について推定される前記潜在ベクトル系列、前記学習された識別器によって前記分離行列により分離された各構成音について識別される前記属性クラス、各構成音についての、前記学習されたデコーダによって生成される、前記構成音のスペクトログラムの分散と、スケールパラメータとから算出される、前記構成音のスペクトログラム、各構成音のスペクトログラムのスケールパラメータ、時間周波数領域で各構成音が混合された混合音を各構成音に分離するための分離行列、及び前記観測信号を各構成音に分離した信号を用いて表される目的関数を最適化するように、前記分離行列と、前記スケールパラメータとを推定するパラメータ推定部と、を含む。 A first aspect of the present disclosure is a signal analysis device, comprising: an encoder that estimates a latent vector sequence by taking a spectrogram of a sound as input based on a spectrogram for each constituent sound and an attribute class indicating an attribute of the constituent sound; a classifier that identifies an attribute class indicating an attribute of the sound by taking the spectrogram of the sound as input; and a decoder that generates a variance of the spectrogram of the sound by taking the latent vector sequence and the attribute class as input; and a learning unit that learns the following: an encoder that estimates a latent vector sequence for each constituent sound separated by the separation matrix by the trained encoder by taking an observation signal in which the constituent sounds are mixed as input; The device includes: the attribute class identified by the trained classifier for each of the constituent sounds separated by the separation matrix; a spectrogram of the constituent sounds calculated from the variance of the spectrogram of the constituent sounds and a scale parameter generated by the trained decoder for each of the constituent sounds; a scale parameter of the spectrogram of each of the constituent sounds; a separation matrix for separating a mixed sound in which the constituent sounds are mixed in the time-frequency domain into each of the constituent sounds; and a parameter estimation unit that estimates the separation matrix and the scale parameter so as to optimize an objective function represented by using a signal obtained by separating the observed signal into each of the constituent sounds.

本開示の第２態様は、信号解析方法であって、学習部が、各構成音についてのスペクトログラム及び前記構成音の属性を示す属性クラスに基づいて、音のスペクトログラムを入力として潜在ベクトル系列を推定するエンコーダと、前記音のスペクトログラムを入力として前記音の属性を示す属性クラスを識別する識別器と、前記潜在ベクトル系列及び前記属性クラスを入力として前記音のスペクトログラムの分散を生成するデコーダと、を学習し、パラメータ推定部が、各構成音が混合された観測信号を入力として、前記学習されたエンコーダによって前記分離行列により分離された各構成音について推定される前記潜在ベクトル系列、前記学習された識別器によって前記分離行列により分離された各構成音について識別される前記属性クラス、各構成音についての、前記学習されたデコーダによって生成される、前記構成音のスペクトログラムの分散と、スケールパラメータとから算出される、前記構成音のスペクトログラム、各構成音のスペクトログラムのスケールパラメータ、時間周波数領域で各構成音が混合された混合音を各構成音に分離するための分離行列、及び前記観測信号を各構成音に分離した信号を用いて表される目的関数を最適化するように、前記分離行列と、前記スケールパラメータとを推定する。 A second aspect of the present disclosure is a signal analysis method, in which a learning unit learns an encoder that estimates a latent vector sequence by taking a spectrogram of a sound as an input based on a spectrogram for each constituent sound and an attribute class indicating an attribute of the constituent sound, a classifier that receives the spectrogram of the sound as an input and identifies an attribute class indicating an attribute of the sound, and a decoder that receives the latent vector sequence and the attribute class as input and generates a variance of the spectrogram of the sound, and a parameter estimation unit that receives an observation signal in which the constituent sounds are mixed and estimates a variance for each constituent sound separated by the separation matrix by the trained encoder. The separation matrix and the scale parameter are estimated so as to optimize an objective function represented by the latent vector sequence, the attribute class identified by the trained classifier for each of the constituent sounds separated by the separation matrix, the spectrogram of the constituent sounds calculated from the variance of the spectrogram of the constituent sounds and the scale parameter generated by the trained decoder for each of the constituent sounds, the scale parameter of the spectrogram of each of the constituent sounds, the separation matrix for separating a mixed sound in which the constituent sounds are mixed in the time-frequency domain into each of the constituent sounds, and the signal obtained by separating the observed signal into each of the constituent sounds.

本開示の第３態様は、プログラムであって、コンピュータを、上記第１態様の信号解析装置として機能させるためのプログラムである。 A third aspect of the present disclosure is a program for causing a computer to function as the signal analysis device of the first aspect.

開示の技術によれば、計算コストを抑えて、各構成音が混合した混合信号から、各構成音を精度よく分離することができる、という効果が得られる。 The disclosed technology has the effect of reducing calculation costs and accurately separating each component sound from a mixed signal in which each component sound is mixed.

本実施形態に係る教師用エンコーダ及び教師用デコーダの構成を説明するための概念図である。FIG. 2 is a conceptual diagram for explaining the configurations of a teacher encoder and a teacher decoder according to the present embodiment. 本実施形態に係るエンコーダ、識別器、及びデコーダの構成を説明するための概念図である。FIG. 2 is a conceptual diagram for explaining the configurations of an encoder, a discriminator, and a decoder according to the present embodiment. 本実施形態に係るエンコーダ及び識別器の構成例、並びにデコーダの構成例を示す図である。3A to 3C are diagrams illustrating examples of the configuration of an encoder and a discriminator, and an example of the configuration of a decoder according to the present embodiment. 本実施形態の信号解析装置として機能するコンピュータの一例の概略ブロック図である。FIG. 2 is a schematic block diagram of an example of a computer that functions as a signal analysis device according to the present embodiment. 本実施形態の信号解析装置の構成を示すブロック図である。1 is a block diagram showing a configuration of a signal analysis device according to an embodiment of the present invention; 本実施形態の信号解析装置における学習処理ルーチンを示すフローチャートである。4 is a flowchart showing a learning processing routine in the signal analyzing device of the present embodiment. 本実施形態の信号解析装置における信号解析処理ルーチンを示すフローチャートである。4 is a flowchart showing a signal analysis processing routine in the signal analyzing device of the present embodiment. 実験例におけるマイクと音源の配置を示す図である。FIG. 1 is a diagram showing the arrangement of microphones and sound sources in an experimental example. 本実施形態の手法と従来手法による、各反復における計算時間を示す図である。FIG. 11 is a diagram showing the calculation time in each iteration according to the method of this embodiment and the conventional method.

以下、開示の技術の実施形態の一例を、図面を参照しつつ説明する。なお、各図面において同一又は等価な構成要素及び部分には同一の参照符号を付与している。また、図面の寸法比率は、説明の都合上誇張されており、実際の比率とは異なる場合がある。 Below, an example of an embodiment of the disclosed technology will be described with reference to the drawings. Note that the same reference symbols are used for identical or equivalent components and parts in each drawing. Also, the dimensional ratios in the drawings have been exaggerated for the convenience of explanation and may differ from the actual ratios.

＜本実施形態の概要＞
まず、本実施形態における概要を説明する。 <Outline of this embodiment>
First, an overview of this embodiment will be described.

本実施形態では、ＦａｓｔＭＶＡＥ法におけるエンコーダと識別器を単一のマルチタスクＮＮ（ニューラルネットワーク）として統合することでさらなる高速化を実現する。また、当該マルチタスクＮＮとデコーダを、それぞれの出力分布が、ＭＶＡＥ法における事前学習で獲得したエンコーダとデコーダのそれぞれの出力分布とできるだけ近くなるように知識蒸留（ＫｎｏｗｌｅｄｇｅＤｉｓｔｉｌｌａｔｉｏｎ：ＫＤ）を行うことで、各ＮＮに、ＭＶＡＥ法における音声生成モデルのパラメータ更新に似た振る舞いをさせ、ＭＶＡＥ法の高い分離精度に近づける。これらのアイディアにより、従来技術のＦａｓｔＭＶＡＥ法に比べて未知話者に対しても高速かつ高精度な音源分離を実現する。 In this embodiment, the encoder and classifier in the FastMVAE method are integrated into a single multitask neural network (NN), thereby achieving even greater speed. In addition, knowledge distillation (KD) is performed on the multitask NN and decoder so that their respective output distributions are as close as possible to the respective output distributions of the encoder and decoder acquired by pre-learning in the MVAE method, thereby making each NN behave in a manner similar to the parameter update of the speech generation model in the MVAE method, thereby approaching the high separation accuracy of the MVAE method. These ideas enable faster and more accurate sound source separation to be achieved even for unknown speakers compared to the conventional FastMVAE method.

＜本実施形態の原理＞
＜優決定条件下の多チャンネル音源分離問題の定式化＞
Ｉ個のマイクロホンでＪ個の音源から到来する信号を観測する場合を考える。マイクｉの観測信号、音源ｊの信号の複素スペクトログラムをそれぞれｘ_ｉ（ｆ，ｎ）、ｓ_ｊ（ｆ，ｎ）とする。また、これらを要素としたベクトルを

（１）

（２）
とする。ただし、ここではＩ＝Ｊの優決定条件を考える。ここで（）^Ｔは転置を表し、ｆとｎはそれぞれ周波数と時間のインデックスである。 <Principle of this embodiment>
<Formulation of multi-channel sound source separation problem under overdetermined conditions>
Consider the case where signals coming from J sound sources are observed by I microphones. The complex spectrograms of the observed signal of microphone i and the signal of sound source j are x _i (f, n) and s _j (f, n), respectively. Also, let us define a vector with these as elements as

(1)

(2)
However, here we consider the overdetermination condition of I=J, where ( ) ^T represents transposition, and f and n are frequency and time indices, respectively.

Ｉ＝Ｊの条件においては音源信号の複素スペクトログラムのベクトルｓ（ｆ，ｎ）と観測信号のベクトルｘ（ｆ，ｎ）の間の関係式として瞬時分離系

（３）

（４）
を仮定することができる。ここで、Ｗ^Ｈ（ｆ）は分離行列を表し、（）^Ｈはエルミート転置である。 Under the condition of I = J, the instantaneous separation system is expressed as the relation between the vector s(f,n) of the complex spectrogram of the sound source signal and the vector x(f,n) of the observed signal.

(3)

(4)
It can be assumed that W ^H (f) is the separation matrix and ( ) ^H is the Hermitian transpose.

以上の瞬時混合系の仮定の下で、更に音源ｊの複素スペクトログラムｓ_ｊ（ｆ，ｎ）が平均０、分散

の複素正規分布

（５）
に従う確率変数とすると、各音源信号ｓ_ｊ（ｆ，ｎ）とｓ_ｊ’（ｆ，ｎ）、ｊ≠ｊ’が統計的に独立のときには、音源信号ｓ（ｆ，ｎ）は

（６）
に従う。 Under the above assumption of an instantaneous mixture system, the complex spectrogram s _j (f,n) of the sound source j has a mean of 0 and a variance of

Complex normal distribution of

(5)
If each source signal s _j (f,n) and s _{j '} (f,n), j ≠ j ' are statistically independent, then the source signal s(f,n) is

(6)
Follow.

ここで、Ｖ（ｆ，ｎ）はｖ_１（ｆ，ｎ），・・・，ｖ_Ｉ（ｆ，ｎ）を要素に持つ対角行列である。式（３）、（６）より、観測信号ｘは

（７）
に従う。 Here, V(f,n) is a diagonal matrix having elements v ₁ (f,n), ..., v _I (f,n). From equations (3) and (6), the observed signal x is

(7)
Follow.

従って、観測信号Ｘ＝｛ｘ（ｆ，ｎ）｝_ｆ，ｎが与えられた下での分離行列Ｗ＝｛Ｗ（ｆ）｝_ｆと各音源のパワースペクトログラムＶ＝｛ｖ_ｊ（ｆ，ｎ）｝_{ｊ，ｆ，ｎ}の対数尤度関数は

（８）
となる。ここで、＝^ｃはパラメータに依存する項のみに関する等号を表す。音源パワースペクトログラムｖ_ｊ（ｆ，ｎ）に制約がない場合、式（８）は周波数ｆごとの項に分解されるため、式（８）に基づいて求めるＷで得られた分離信号のインデックスにはパーミュテーションの任意性が生じる。ｖ_ｊ（ｆ，ｎ）が周波数方向に構造的制約を持つ場合、その制約を活かすことでパーミュテーション整合と音源分離を同時解決するアプローチを導くことができる。ＩＬＲＭＡやＭＶＡＥ法がその例である。 Therefore, the log-likelihood function of the separation matrix W = {W(f)} _f and the power spectrogram V = {v j (f,n)} j, f, _n for the observed signal X = { _x (f,n)} _{f, n} is

(8)
Here, = ^c represents an equality sign related only to the terms that depend on the parameters. When there is no constraint on the source power spectrogram v _j (f, n), Equation (8) is decomposed into terms for each frequency f, so that the index of the separated signal obtained by W calculated based on Equation (8) has arbitrariness of permutation. When v _j (f, n) has a structural constraint in the frequency direction, it is possible to derive an approach that simultaneously solves permutation matching and source separation by utilizing the constraint. Examples include the ILRMA and MVAE methods.

＜従来技術１：ＭＶＡＥ法＞
ＭＶＡＥ法では、音源クラスラベルを補助入力としたＣＶＡＥのデコーダ分布を各音源の複素スペクトログラムの生成モデルとして用いる。ある音源信号の複素スペクトログラムをＳ＝｛ｓ（ｆ，ｎ）｝_ｆ，ｎとし、対応する音源クラスラベルをｏｎｅ－ｈｏｔベクトルｃとする。図１にＣＶＡＥの概念図を示す。ＣＶＡＥはエンコーダ分布ｑ^＊ _φ（ｚ｜Ｓ，ｃ）とデコーダ分布ｐ^＊ _θ（Ｓ｜ｚ，ｃ）が無矛盾になるように、すなわち、ｑ^＊ _φ（ｚ｜Ｓ，ｃ）とｐ^＊ _θ（Ｓ｜ｚ，ｃ）から導かれる事後分布ｐ^＊ _θ（ｚ｜Ｓ，ｃ）∝ｐ^＊ _θ（Ｓ｜ｚ，ｃ）ｐ（ｚ）ができるだけ一致するようにエンコーダとデコーダのＮＮパラメータφ、θを学習する。ここで、ＣＶＡＥのデコーダ分布を式（５）の局所ガウス音源モデルと同形の確率モデル

（９）

（１０）
と置く。ただし、分散σ^＊ _θ ^２（ｆ，ｎ；ｚ，ｃ）はデコーダネットワークの出力であり、ｇはパワースペクトログラムのスケールを表す変数である。 <Prior art 1: MVAE method>
In the MVAE method, the decoder distribution of the CVAE with the source class label as an auxiliary input is used as a generation model of the complex spectrogram of each source. The complex spectrogram of a certain source signal is S = {s(f,n)} _f,n , and the corresponding source class label is a one-hot vector c. Figure 1 shows a conceptual diagram of the CVAE. ^The CVAE learns the NN parameters φ and _θ of the encoder and decoder so that the encoder distribution q ^* _φ (z | S, c) and the decoder distribution p ^* _θ (S | z, c) are consistent, that is, so that the posterior distribution p ^* _θ (z | S, c) ∝ p ^* _θ (S | z, c) p (z) derived from q ^* φ (z | S, c) and p * _θ (S | z, c) matches as much as possible. Here, the decoder distribution of the CVAE is a probability model with the same shape as the local Gaussian source model of Equation (5).

(9)

(10)
where the variance σ ^* _θ ² (f,n;z,c) is the output of the decoder network, and g is a variable representing the scale of the power spectrogram.

一方、エンコーダ分布ｑ^＊ _φ（ｚ｜Ｓ，ｃ）は通常のＣＶＡＥと同様に、標準正規分布

（１１）
と仮定する。 On the other hand, the encoder distribution q ^* _φ (z|S, c) follows the standard normal distribution, as in the case of ordinary CVAE.

(11)
Assume that.

ここで、μ^＊ _φ（Ｓ，ｃ）、σ^＊ _φ ^２（Ｓ，ｃ）はエンコーダの出力である。ＣＶＡＥのパラメータθ、φは、各種クラスの音源信号の複素スペクトログラムの学習サンプル｛Ｓ_ｍ、ｃ_ｍ｝^Ｍ _ｍ＝１を用いて

（１２）
が最大となるように学習される。 Here, μ ^* _φ (S,c) and σ ^* _φ2 (S,c) are the encoder outputs. The CVAE parameters θ and φ are calculated using the training samples {S _m ^, c _m } ^M _m=1 of the complex spectrograms of the sound source signals of various classes.

(12)
is trained to maximize

は学習サンプルによる標本平均を表し、ＫＬ［・｜｜・］はＫｕｌｌｂａｃｋ－Ｌｅｉｂｌｅｒ（ＫＬ）ダイバージェンスである。以上により学習したデコーダ分布ｐ^＊ _θ（Ｓ｜ｚ，ｃ，ｇ）をＣＶＡＥ音源モデルと呼ぶ。ＣＶＡＥ音源モデルは、学習サンプルに含まれる様々なクラスの音源の複素スペクトログラムを表現可能な生成モデルとなっており、ｃを音源クラスのカテゴリカルな特徴を調整する役割と見なすことができ、ｚを、クラス内の変動を調整する役割を担った変数と見なすことができる。
represents the sample average of the learning sample, and KL[·||·] is the Kullback-Leibler (KL) divergence. The decoder distribution p ^* _θ (S|z,c,g) learned as described above is called the CVAE sound source model. The CVAE sound source model is a generative model capable of expressing complex spectrograms of various classes of sound sources contained in the learning sample, and c can be regarded as a variable that adjusts the categorical features of the sound source class, and z can be regarded as a variable that adjusts the variation within the class.

音源ｊの複素スペクトログラムＳ_ｊ＝｛ｓ_ｊ（ｆ，ｎ）｝_ｆ，ｎの生成モデルを、ｚ_ｊ、ｃ_ｊ、ｇ_ｊを入力としたデコーダ分布により表現することで、音源モデルのパラメータの尤度関数は式（８）と同形の尤度関数に帰着させることができる。従って、式（８）の尤度関数が大きくなるように分離行列Ｗ、ＣＶＡＥ音源モデルパラメータΨ＝｛ｚ_ｊ，ｃ_ｊ｝_ｊ、スケールパラメータＧ＝｛ｇ_ｊ｝_ｊを反復更新することで、式（８）の停留点を探索することができる。式（８）を上昇させるＷの更新にはＩＬＲＭＡと同様に反復射影法（ＩｔｅｒａｔｉｖｅＰｒｏｊｅｃｔｉｏｎ：ＩＰ）

（１３）

（１４）
を用いることができる。 By expressing the generation model of the complex spectrogram _Sj ={ _sj (f,n)} _f,n of the sound source j by a decoder distribution with _zj , _cj , and _gj as inputs, the likelihood function of the parameters of the sound source model can be reduced to a likelihood function of the same form as that of formula (8). Therefore, by iteratively updating the separation matrix W, the CVAE sound source model parameters Ψ={zj, _cj } _j , and the scale parameter G={ _gj _}j _so that the likelihood function of formula (8) becomes large, it is possible to search for a stationary point of formula (8). To update W to raise formula (8), iterative projection (IP) is used as in ILRMA.

(13)

(14)
can be used.

ただし、

であり、ｅ_ｊはＩ×Ｉの単位行列の第ｊ列ベクトルである。また式（８）を上昇させるΨの更新は誤差逆伝播法、Ｇの更新は

（１５）
により行うことができる。ただし、式（１５）はＷとΨが固定された下で式（８）を最大にする更新式である。以上よりＭＶＡＥの推論プロセスは以下のようにまとめられる。 however,

where e _j is the j-th column vector of the I × I unit matrix. The update of Ψ, which raises equation (8), is performed using the backpropagation method, and the update of G is performed using

(15)
Here, equation (15) is an update equation that maximizes equation (8) when W and Ψ are fixed. From the above, the inference process of MVAE can be summarized as follows.

１．式（１２）を規準としてθ、φを学習する。
２．Ｗを単位行列に初期化し、Ψを初期化する。
３．各ｊについて下記ステップａ～ステップｃを繰り返す。
（ステップａ）式（１３）、（１４）により｛ｗ_ｊ（ｆ）｝_ｊ，ｆを更新する。
（ステップｂ）誤差逆伝播法によりΨ_ｊ＝｛ｚ_ｊ，ｃ_ｊ｝を更新する。
（ステップｃ）式（１５）によりｇ_ｊを更新する。 1. Learn θ and φ using equation (12) as a criterion.
2. Initialize W to be the identity matrix and initialize Ψ.
3. Repeat steps a through c below for each j.
(Step a) Update {w _j (f)} _j,f according to equations (13) and (14).
(Step b) Update Ψ _j ={z _j , c _j } by the backpropagation method.
(Step c) Update _gj using equation (15).

＜従来技術２：ＦａｓｔＭＶＡＥ法＞
ＭＶＡＥ法では、各反復計算で対数尤度が上昇するようにパラメータの更新が行われるため、対数尤度の停留点への収束が保証される利点がある一方で、ｐ_θ（ｚ_ｊ，ｃ_ｊ｜Ｓ_ｊ）を最大にするパラメータｚ_ｊ、ｃ_ｊを誤差逆伝播法により更新するのに多大な計算コストを要する点に課題があった。非特許文献３のＦａｓｔＭＶＡＥ法では、事後分布ｐ_θ（ｚ，ｃ｜Ｓ）をｐ_θ（ｚ｜Ｓ，ｃ）ｐ_θ（ｃ｜Ｓ）のように二つの条件付き分布の積に分解し、各分布を近似するよう分布ｑ^＊ _φ（ｚ｜Ｓ，ｃ）、ｒ^＊ _ψ（ｃ｜Ｓ）をＮＮにより表現し、事前学習する。これにより、ＭＶＡＥ法における誤差逆伝播法によるパラメータ探索をそれぞれのＮＮのフォワード計算で代替でき、高速な推論が可能になる。しかし、ＦａｓｔＭＶＡＥ法におけるエンコーダｑ^＊ _φ（ｚ｜Ｓ，ｃ）と識別器ｒ^＊ _ψ（ｃ｜Ｓ）の出力値は当該パラメータに関する対数尤度の最急上昇方向への更新値を近似したものでしかないため、音源分離精度に関しては、ＦａｓｔＭＶＡＥ法はＭＶＡＥ法に及ばないことが実験的に確認されている。 <Conventional technique 2: Fast MVAE method>
In the MVAE method, since the parameters are updated so that the log likelihood increases in each iteration, the convergence of the log likelihood to a stationary point is guaranteed. However, there is a problem in that a large calculation cost is required to update the parameters z _j and c _j that maximize p _θ (z _j , c _j |S _j ) by the backpropagation method. In the FastMVAE method of Non-Patent Document 3, the posterior distribution p _θ (z, c |S) is decomposed into a product of two conditional distributions such as p _θ (z |S, c) p _θ (c |S), and the distributions q ^* _φ (z |S, c) and r ^* _ψ (c |S) are expressed by NNs to approximate each distribution, and pre-trained. As a result, the parameter search by the backpropagation method in the MVAE method can be replaced by the forward calculation of each NN, enabling high-speed inference. However, since the output values of the encoder q ^* _φ (z|S, c) and the classifier r ^* _ψ (c|S) in the FastMVAE method are merely approximations of the updated value in the direction of the steepest increase of the log-likelihood for the parameters, it has been experimentally confirmed that the FastMVAE method is inferior to the MVAE method in terms of sound source separation accuracy.

＜本実施形態の方法＞
本実施形態で用いるＦａｓｔＭＶＡＥ２法では、まず潜在変数ｚと音源の属性クラスｃが条件付き独立であることを仮定する。これは、所与のスペクトログラムＳが与えられた下で、話者情報ｃと発話内容に関する情報ｚが独立であると仮定することに相当する。つまり、事後確率ｐ_θ（ｚ，ｃ｜Ｓ）をｐ_θ（ｚ｜Ｓ）ｐ_θ（ｃ｜Ｓ）と表せると仮定する点が従来と異なる。この二つの条件付き分布の近似分布が得られれば、ＦａｓｔＭＶＡＥ法と同様、ＮＮのフォワード計算でパラメータ探索を高速に行うことができる。 <Method of the Present Embodiment>
In the FastMVAE2 method used in this embodiment, it is first assumed that the latent variable z and the attribute class c of the sound source are conditionally independent. This is equivalent to assuming that the speaker information c and the information z on the speech content are independent under a given spectrogram S. In other words, it differs from the conventional method in that it is assumed that the posterior probability p _θ (z, c|S) can be expressed as p _θ (z|S) p _θ (c|S). If an approximation distribution of these two conditional distributions is obtained, it is possible to perform parameter search at high speed by forward calculation of NN, as in the FastMVAE method.

＜ＣｈｉｍｅｒａＡＣＶＡＥ音源モデル＞
ＡＣＶＡＥは、元々音声変換に応用する目的で提案されたＣＶＡＥの拡張版で、入力されるクラスラベルｃのデコーダ出力への影響力を強調するためにデコーダ出力とクラスラベルｃとの相互情報量Ｉ（ｃ，Ｓ｜ｚ）を正則化項としてエンコーダとデコーダを学習する方式である。Ｉ（ｃ，Ｓ｜ｚ）を含めた規準を直接最適化することは容易ではないが、ＣＶＡＥの学習と同様に変分下界を導入し、その変分下界とＪ（φ，θ）を合わせた規準を上昇させることで、元となる規準を間接的に大きくすることができる。Ｉ（ｃ，Ｓ｜ｚ）はｌｏｇｐ（ｃ｜Ｓ）の期待値と定数の和で与えられるが、ｐ（ｃ｜Ｓ）を適当な補助分布ｒ（ｃ｜Ｓ）に置き換えたものがＩ（ｃ，Ｓ｜ｚ）の下界となる。この補助分布ｒ（ｃ｜Ｓ）をパラメータ_ψのＮＮでモデル化することで、上記下界を規準としてψをφやθとともに学習することができる。パラメータψのＮＮで表される補助分布をｒ_ψ（ｃ｜Ｓ）と表し、識別器と呼ぶ。 <Chimera ACVAE sound source model>
ACVAE is an extension of CVAE, which was originally proposed for the purpose of applying it to speech conversion. It is a method of training the encoder and decoder using the mutual information I(c, S|z) between the decoder output and the class label c as a regularization term in order to emphasize the influence of the input class label c on the decoder output. It is not easy to directly optimize a criterion including I(c, S|z), but the original criterion can be indirectly increased by introducing a variational lower bound as in the case of CVAE training and raising the criterion that combines the variational lower bound and J(φ, θ). I(c, S|z) is given as the sum of the expected value of log p(c|S) and a constant, but replacing p(c|S) with an appropriate auxiliary distribution r(c|S) becomes the lower bound of I(c, S|z). By modeling this auxiliary distribution r(c|S) with a NN with parameter _ψ , ψ can be trained together with φ and θ using the above lower bound as a criterion. The auxiliary distribution represented by a NN of parameters ψ is represented as r _ψ (c|S) and is called a classifier.

これに対し、本実施形態の「ＣｈｉｍｅｒａＡＣＶＡＥ」はＡＣＶＡＥのエンコーダと識別器を一体のマルチタスクＮＮとして表したモデルである。つまり、ｚとｃの分布ｑ^＋ _φ（ｚ｜Ｓ）、ｒ^＋ _ψ（ｃ｜Ｓ）をスペクトログラムＳから同時推論するモデルとなる。図２にＣｈｉｍｅｒａＡＣＶＡＥの概念図を示す。 In contrast, the "ChimeraACVAE" of this embodiment is a model in which the ACVAE encoder and discriminator are expressed as an integrated multitask NN. In other words, it is a model in which the distributions q ⁺ _φ (z|S) and r ⁺ _ψ (c|S) of z and c are simultaneously inferred from the spectrogram S. A conceptual diagram of ChimeraACVAE is shown in FIG. 2.

ＣｈｉｍｅｒａＡＣＶＡＥは潜在変数ｚを入力スペクトログラムのみから抽出する構造になっているため、クラスラベルｃの推定誤差に起因するｚの推論誤差を回避することができる。また、従来のＡＣＶＡＥモデルに比べてコンパクトなネットワーク構造で記述できるため、より高速な推論が可能となることが期待される。 ChimeraACVAE is structured to extract the latent variable z only from the input spectrogram, making it possible to avoid inference errors in z caused by estimation errors in the class label c. In addition, since it can be described with a more compact network structure than conventional ACVAE models, it is expected to enable faster inference.

ＣｈｉｍｅｒａＡＣＶＡＥを学習するための規準、すなわちＮＮパラメータθ、φ、ψに関して最大化すべき目的関数は、ＣＶＡＥの学習規準

（１６）
および、相互情報量

（１７）
の和を含む。また、ラベル付き学習サンプル｛Ｓ_ｍ，ｃ_ｍ｝^Ｍ _ｍも学習に用いることができるため、学習データＳ_ｍと対応するクラスラベルｃ_ｍの負の交差エントロピー

（１８）
も、学習するための規準に含めることができる。ここまではモデル構造を除けば従来のＡＣＶＡＥと同様である。 The criterion for training the ChimeraACVAE, i.e., the objective function to be maximized with respect to the NN parameters θ, φ, and ψ, is the CVAE training criterion:

(16)
and mutual information

(17)
In addition, since the labeled training sample {S _m , c _m } ^M _m can also be used for training, _the negative _cross entropy

(18)
can also be included in the criteria for learning. Up to this point, the method is the same as the conventional ACVAE except for the model structure.

しかし、以上の規準により学習されたＡＣＶＡＥは、テスト条件と学習条件が一致する場合高精度な推論が可能となるが、一致しない場合に推定される潜在変数が仮定した分布から逸脱する傾向があり、モデルの汎化能力は十分ではなかった。 However, while the ACVAE trained according to the above criteria was capable of highly accurate inference when the test conditions and training conditions matched, when they did not match, the estimated latent variables tended to deviate from the assumed distribution, and the model's generalization ability was insufficient.

そこでモデルの汎化能力を向上させるため、ＣｈｉｍｅｒａＡＣＶＡＥの学習においては上記の規準に加え更に以下の規準と知識蒸留を用いる。ＣｈｉｍｅｒａＡＣＶＡＥでは、推定されたクラス情報を利用して、スペクトログラムＳを再構築することができる。このプロセスは推論時にも用いられるため、同じプロセスで再構築したスペクトログラムＳの精度を評価する規準を利用してモデルを学習させることは推論時の精度向上に繋がると考えられる。そこで、最大化すべき式（１９）の再構築規準と式（２０）のクラス識別規準も、学習するための規準に含める。

（１９）

（２０） Therefore, in order to improve the generalization ability of the model, the following criteria and knowledge distillation are used in addition to the above criteria in the learning of ChimeraACVAE. In ChimeraACVAE, the spectrogram S can be reconstructed using the estimated class information. Since this process is also used during inference, it is believed that training a model using a criterion for evaluating the accuracy of the spectrogram S reconstructed by the same process will lead to improved accuracy during inference. Therefore, the reconstruction criterion of equation (19) to be maximized and the class identification criterion of equation (20) are also included in the criteria for learning.

(19)

(20)

あるいはこれらの規準の代わりに、実装の簡単化のため、その近似値

（２１）

（２２）
を用いても良い。ただし

である。 Alternatively, instead of these criteria, approximations may be used to simplify implementation.

(21)

(22)
However,

It is.

知識蒸留（ＫｎｏｗｌｅｄｇｅＤｉｓｔｉｌｌａｔｉｏｎ：ＫＤ）は事前に大量のデータで学習した大きなＮＮを教師用モデルとし、その知識を軽量または別のＮＮ構造を持つ生徒モデルに継承させるための方法論であり、汎化能力の高い生徒モデルが得られることが知られている。ここで、未知話者に対しても高い分離精度を実現できるＣＶＡＥモデルを教師用モデルとし、ＣＶＡＥで学習した潜在変数の分布ｑ^＊ _φ（ｚ｜Ｓ，ｃ）とスペクトログラムの生成モデルｐ^＊ _θ（Ｓ｜ｚ，ｃ）の知識を生徒モデルであるＣｈｉｍｅｒａＡＣＶＡＥに継承させることを考える。具体的には、ＣＶＡＥで推論した潜在変数の分布ｑ^＊ _φ（ｚ｜Ｓ，ｃ）と、デコーダで出力した分散σ^＊ _φ ^２を用いた正規分布Ｎ（０、ｄｉａｇ（σ^＊ _θ ^２（ｚ，ｃ）））をそれぞれ生徒モデルの出力分布ｑ^＋ _φ（ｚ｜Ｓ）と、デコーダ出力σ^＋ _φ ^２を用いた正規分布の事前分布とし、生徒モデルの出力が事前分布に近づくよう学習させる。ただし、教師用モデルと生徒モデルの分布の乖離度を、ＫＬダイバージェンスを用いて測り、式（２３）～（２５）に示すように、知識蒸留規準とする。 Knowledge Distillation (KD) is a methodology for using a large NN trained with a large amount of data in advance as a teacher model and inheriting the knowledge to a student model having a lightweight or different NN structure, and it is known that a student model with high generalization ability can be obtained. Here, we consider using a CVAE model that can achieve high separation accuracy even for unknown speakers as a teacher model, and inheriting the knowledge of the distribution of latent variables q ^* _φ (z|S,c) and the spectrogram generation model p ^* _θ (S|z,c) learned by the CVAE to the student model, ChimeraACVAE. Specifically, the distribution of latent variables inferred by CVAE q ^* _φ (z|S,c) and the normal distribution N(0, diag(σ ^* ^θ2 (z,c))) using the variance σ ^* _φ2 output by the decoder are _set as the output distribution q ⁺ _φ (z|S) of the ^student model and the prior distribution of the normal distribution using the decoder output σ ⁺ _φ2 ^, respectively, and the student model is trained so that its output approaches the prior distribution. However, the degree of divergence between the distributions of the teacher model and the student model is measured using the KL divergence and used as the knowledge distillation criterion as shown in equations (23) to (25).

（２３）

（２４）

（２５）
(23)

(24)

(25)

以上よりＣｈｉｍｅｒａＡＣＶＡＥを学習する際に最大化すべき規準は

（２６）
となる。ここで、λは非負値であり、各規準の重み係数である。図２に知識蒸留を用いたＣｈｉｍｅｒａＡＣＶＡＥの学習の概念図を示す。 From the above, the criterion to be maximized when learning ChimeraACVAE is

(26)
Here, λ is a non-negative value and is a weighting coefficient for each criterion. Figure 2 shows a conceptual diagram of learning of ChimeraACVAE using knowledge distillation.

図３にＣｈｉｍｅｒａＡＣＶＡＥのネットワーク構造例を示す。エンコーダと識別器の各層は畳み込み層、ＬａｙｅｒＮｏｒｍａｌｉｚａｔｉｏｎ（ＬＮ）とＳｉｇｍｏｉｄＬｉｎｅａｒＵｎｉｔ（ＳｉＬＵ）により構成され、デコーダの各層は逆畳み込み層、ＬＮとＳｉＬＵにより構成される。ここで、ＬＮを用いることによって、学習と推論時における正規化の計算方法の不整合を回避できる。ＳｉＬＵはＣＶＡ音源モデルに用いられたＧａｔｅｄＬｉｎｅａｒＵｎｉｔ（ＧＬＵ）と同様に階層間に受け渡す情報をゲートにより制御するデータ駆動の活性化関数であり、ＧＬＵのパラメータ数を半減することができる。 Figure 3 shows an example of the network structure of ChimeraACVAE. Each layer of the encoder and classifier is composed of a convolutional layer, Layer Normalization (LN), and Sigmoid Linear Unit (SiLU), while each layer of the decoder is composed of a deconvolutional layer, LN, and SiLU. By using LN, it is possible to avoid inconsistencies in the normalization calculation method during learning and inference. SiLU is a data-driven activation function that uses gates to control the information passed between layers, similar to the Gated Linear Unit (GLU) used in the CVA sound source model, and it is possible to halve the number of parameters of GLU.

＜ＦａｓｔＭＶＡＥ２法：高速な推論アルゴリズム＞
ＣｈｉｍｅｒａＡＣＶＡＥで学習したエンコーダと識別器を用いることで、従来のＭＶＡＥ法におけるｐ_θ（ｚ_ｊ，ｃ_ｊ｜Ｓ_ｊ）の最大化ステップをｑ^＋ _φ（ｚ_ｊ｜Ｓ_ｊ）とｒ^＋ _ψ（ｃ_ｊ｜Ｓ_ｊ）のフォワード計算に置き換えることができる。よって、以下のアルゴリズムが得られる。これをＦａｓｔＭＶＡＥ２法と呼ぶ。 <FastMVAE2 method: high-speed inference algorithm>
By using the encoder and classifier trained by ChimeraACVAE, the maximization step of _pθ ( _zj , _cj | _Sj ) in the conventional MVAE method can be replaced with the forward calculation of q ⁺ _φ ( _zj | _Sj ) and r ⁺ _ψ ( _cj | _Sj ). Therefore, the following algorithm is obtained. This is called the FastMVAE2 method.

１．式（２６）を学習のための規準としてθ、φ、ψを学習する。
２．Ｗを単位行列に初期化する。
３．各ｊについて下記ステップａ～ｃを繰り返す。
（ステップａ）式（１３）、（１４）により｛ｗ_ｊ（ｆ）｝_ｊ，ｆを更新する。
（ステップｂ）Ｗを用いて分離したスペクトログラムを入力とし、エンコーダから出力されるガウス分布の平均と識別器の出力値（連続値ベクトル）にｚ_ｊとｃ_ｊをそれぞれ更新する。
（ステップｃ）式（１５）によりｇ_ｊを更新する。 1. Learn θ, φ, and ψ using equation (26) as the learning criterion.
2. Initialize W to be the identity matrix.
3. Repeat steps a through c below for each j.
(Step a) Update {w _j (f)} _j,f according to equations (13) and (14).
(Step b) The spectrogram separated using W is input, and z _j and c _j are updated to the mean of the Gaussian distribution output from the encoder and the output value (continuous value vector) of the discriminator, respectively.
(Step c) Update _gj using equation (15).

＜本実施形態に係る信号解析装置の構成＞
図４は、本実施形態の信号解析装置１００のハードウェア構成を示すブロック図である。 <Configuration of the signal analysis device according to this embodiment>
FIG. 4 is a block diagram showing a hardware configuration of the signal analyzing device 100 of the present embodiment.

図４に示すように、信号解析装置１００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１２、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１３、ストレージ１４、入力部１５、表示部１６及び通信インタフェース（Ｉ／Ｆ）１７を有する。各構成は、バス１９を介して相互に通信可能に接続されている。 As shown in FIG. 4, the signal analysis device 100 has a CPU (Central Processing Unit) 11, a ROM (Read Only Memory) 12, a RAM (Random Access Memory) 13, a storage 14, an input unit 15, a display unit 16, and a communication interface (I/F) 17. Each component is connected to each other via a bus 19 so as to be able to communicate with each other.

ＣＰＵ１１は、中央演算処理ユニットであり、各種プログラムを実行したり、各部を制御したりする。すなわち、ＣＰＵ１１は、ＲＯＭ１２又はストレージ１４からプログラムを読み出し、ＲＡＭ１３を作業領域としてプログラムを実行する。ＣＰＵ１１は、ＲＯＭ１２又はストレージ１４に記憶されているプログラムに従って、上記各構成の制御及び各種の演算処理を行う。本実施形態では、ＲＯＭ１２又はストレージ１４には、学習処理を実行するための学習プログラム、及び信号解析処理を実行するための信号解析プログラムが格納されている。学習プログラム及び信号解析プログラムは、１つのプログラムであっても良いし、複数のプログラム又はモジュールで構成されるプログラム群であっても良い。 The CPU 11 is a central processing unit that executes various programs and controls each part. That is, the CPU 11 reads a program from the ROM 12 or storage 14, and executes the program using the RAM 13 as a working area. The CPU 11 controls each of the above components and performs various calculation processes according to the program stored in the ROM 12 or storage 14. In this embodiment, the ROM 12 or storage 14 stores a learning program for executing the learning process, and a signal analysis program for executing the signal analysis process. The learning program and the signal analysis program may be a single program, or may be a group of programs consisting of multiple programs or modules.

ＲＯＭ１２は、各種プログラム及び各種データを格納する。ＲＡＭ１３は、作業領域として一時的にプログラム又はデータを記憶する。ストレージ１４は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）又はＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）により構成され、オペレーティングシステムを含む各種プログラム、及び各種データを格納する。 The ROM 12 stores various programs and various data. The RAM 13 temporarily stores programs or data as a working area. The storage 14 is composed of a HDD (Hard Disk Drive) or SSD (Solid State Drive) and stores various programs including the operating system and various data.

入力部１５は、マウス等のポインティングデバイス、及びキーボードを含み、各種の入力を行うために使用される。 The input unit 15 includes a pointing device such as a mouse and a keyboard, and is used to perform various input operations.

入力部１５は、学習データとして、複数の構成音の各々について、当該構成音の信号の時系列データ及び当該構成音の信号の属性を示す属性クラスを受け付ける。また、入力部１５は、解析対象データとして、複数の構成音が混じっている混合信号（以後、観測信号）の時系列データを受け付ける。なお、構成音の信号の属性を示す属性クラスは、人手で与えておけばよい。また、構成音の信号の属性とは、例えば、性別、大人／子供、話者ＩＤなどである。 The input unit 15 receives, as learning data, time series data of the signal of each of the multiple constituent sounds and an attribute class indicating the attribute of the signal of the multiple constituent sounds. The input unit 15 also receives, as data to be analyzed, time series data of a mixed signal (hereinafter, observed signal) in which multiple constituent sounds are mixed. Note that the attribute class indicating the attribute of the signal of the constituent sounds may be manually provided. Furthermore, examples of the attributes of the signal of the constituent sounds include gender, adult/child, speaker ID, etc.

表示部１６は、例えば、液晶ディスプレイであり、各種の情報を表示する。表示部１６は、タッチパネル方式を採用して、入力部１５として機能しても良い。 The display unit 16 is, for example, a liquid crystal display, and displays various information. The display unit 16 may also function as the input unit 15 by adopting a touch panel system.

通信インタフェース１７は、他の機器と通信するためのインタフェースであり、例えば、イーサネット（登録商標）、ＦＤＤＩ、Ｗｉ－Ｆｉ（登録商標）等の規格が用いられる。 The communication interface 17 is an interface for communicating with other devices, and uses standards such as Ethernet (registered trademark), FDDI, and Wi-Fi (registered trademark).

次に、信号解析装置１００の機能構成について説明する。図５は、信号解析装置１００の機能構成の例を示すブロック図である。 Next, the functional configuration of the signal analysis device 100 will be described. FIG. 5 is a block diagram showing an example of the functional configuration of the signal analysis device 100.

信号解析装置１００は、機能的には、図５に示すように、時間周波数展開部２４と、教師学習部３０と、学習部３２と、音源信号モデル記憶部３４と、パラメータ推定部３６と、出力部３８と、を含んで構成されている。 As shown in FIG. 5, the signal analysis device 100 is functionally configured to include a time-frequency expansion unit 24, a teacher learning unit 30, a learning unit 32, a sound source signal model storage unit 34, a parameter estimation unit 36, and an output unit 38.

時間周波数展開部２４は、構成音毎に、当該構成音の信号の時系列データに基づいて、各時刻のスペクトルを表すパワースペクトログラムを計算する。また、時間周波数展開部２４は、観測信号の時系列データに基づいて、各時刻のスペクトルを表すパワースペクトログラムを計算する。なお、本実施の形態においては、短時間フーリエ変換やウェーブレット変換などの時間周波数展開を行う。 For each constituent sound, the time-frequency expansion unit 24 calculates a power spectrogram representing the spectrum at each time based on the time series data of the signal of the constituent sound. The time-frequency expansion unit 24 also calculates a power spectrogram representing the spectrum at each time based on the time series data of the observed signal. In this embodiment, time-frequency expansion such as short-time Fourier transform and wavelet transform is performed.

教師学習部３０は、学習データとして入力された各構成音についてのスペクトログラム及び属性クラスに基づいて、音のスペクトログラム及び属性クラスを入力として潜在ベクトル系列を推定する教師用エンコーダ、並びに潜在ベクトル系列及び属性クラスを入力として音のスペクトログラムの分散を生成する教師用デコーダを学習し、音源信号モデル記憶部３４に格納する。 The teacher learning unit 30 learns a teacher encoder that estimates a latent vector sequence using the spectrogram and attribute class of the sound as input, based on the spectrogram and attribute class of each constituent sound input as learning data, and a teacher decoder that generates the variance of the sound spectrogram using the latent vector sequence and attribute class as input, and stores them in the sound source signal model storage unit 34.

具体的には、教師学習部３０は、構成音毎に、教師用デコーダによって生成されたパワースペクトログラムと、元の構成音の信号におけるパワースペクトログラムとの誤差、並びに、教師用エンコーダによって推定された潜在ベクトル系列と、元の構成音の信号における潜在ベクトル系列との距離を用いて表される、上記式（１２）の目的関数の値を最大化するように、教師用エンコーダ及び教師用デコーダを学習し、音源信号モデル記憶部３４に格納する。ここで、教師用エンコーダ及び教師用デコーダの各々は、畳み込みネットワーク又は再帰型ネットワークを用いて構成される。 Specifically, the teacher learning unit 30 learns the teacher encoder and teacher decoder so as to maximize the value of the objective function of the above formula (12), which is expressed for each constituent sound using the error between the power spectrogram generated by the teacher decoder and the power spectrogram in the signal of the original constituent sound, and the distance between the latent vector sequence estimated by the teacher encoder and the latent vector sequence in the signal of the original constituent sound, and stores the trained results in the sound source signal model storage unit 34. Here, each of the teacher encoder and teacher decoder is configured using a convolutional network or a recurrent network.

学習部３２は、学習データとして入力された各構成音についてのスペクトログラム及び属性クラスに基づいて、音のスペクトログラムを入力として潜在ベクトル系列を推定するエンコーダと、音のスペクトログラムを入力として属性クラスを識別する識別器と、潜在ベクトル系列及び属性クラスを入力として音のスペクトログラムの分散を生成するデコーダと、を学習する。 The learning unit 32 learns an encoder that estimates a latent vector sequence using the spectrogram of the sound as input based on the spectrogram and attribute class for each constituent sound input as learning data, a classifier that identifies the attribute class using the spectrogram of the sound as input, and a decoder that generates the variance of the sound spectrogram using the latent vector sequence and attribute class as input.

具体的には、学習部３２は、エンコーダの出力、及びデコーダの出力を評価するための学習規準と、デコーダの出力及び属性クラスの相互情報量と、エンコーダの出力及び識別器の出力を入力としたデコーダの出力を用いて生成したスペクトログラムを評価するための再構築規準と、エンコーダの出力及び識別器の出力を入力としたデコーダの出力を用いて生成したスペクトログラムを入力とした前記識別器の出力を評価するためのクラス識別規準と、エンコーダの出力及び学習された教師用エンコーダの出力を対応させ、かつ、デコーダの出力及び学習された教師用デコーダの出力を対応させるための知識蒸留規準とを含む上記式（２６）の規準を最大化するように、エンコーダ、識別器、及びデコーダを学習し、音源信号モデル記憶部３４に格納する。ここで、エンコーダ及び識別器は、一体のニューラルネットワークであって、エンコーダ及び識別器で、一部の層を共有する。また、エンコーダ、識別器、及びデコーダの各々は、畳み込みネットワーク又は再帰型ネットワークを用いて構成される。 Specifically, the learning unit 32 learns the encoder, the discriminator, and the decoder so as to maximize the criteria of the above formula (26), which includes a learning criterion for evaluating the output of the encoder and the output of the decoder, the mutual information of the decoder output and the attribute class, a reconstruction criterion for evaluating a spectrogram generated using the output of the decoder to which the output of the encoder and the output of the discriminator are input, a class discrimination criterion for evaluating the output of the discriminator to which the spectrogram generated using the output of the decoder to which the output of the encoder and the output of the discriminator are input, and a knowledge distillation criterion for matching the output of the encoder and the output of the trained teacher encoder, and for matching the output of the decoder and the output of the trained teacher decoder, and stores the encoder, the discriminator, and the decoder in the sound source signal model storage unit 34. Here, the encoder and the discriminator are an integrated neural network, and the encoder and the discriminator share some layers. In addition, each of the encoder, the discriminator, and the decoder is configured using a convolutional network or a recurrent network.

パラメータ推定部３６は、観測信号のパワースペクトログラムに基づいて、各構成音が混合された観測信号を入力として、学習されたエンコーダによって分離行列により分離された各構成音について推定される潜在ベクトル系列、学習された識別器によって分離行列により分離された各構成音について識別される属性クラス、各構成音についての、前記学習されたデコーダによって生成される、構成音のスペクトログラムの分散と、スケールパラメータとから算出される、構成音のスペクトログラム、各構成音のスペクトログラムのスケールパラメータ、時間周波数領域で各構成音が混合された混合音を各構成音に分離するための分離行列、及び前記観測信号を各構成音に分離した信号を用いて表される上記式（８）式の目的関数を最適化するように、分離行列と、スケールパラメータとを推定する。 The parameter estimation unit 36 estimates the separation matrix and the scale parameter based on the power spectrogram of the observed signal, using the observed signal in which each component sound is mixed as input, so as to optimize the objective function of the above formula (8) expressed using the latent vector sequence estimated for each component sound separated by the separation matrix by the trained encoder, the attribute class identified for each component sound separated by the separation matrix by the trained classifier, the spectrogram of each component sound calculated from the variance of the spectrogram of the component sound and the scale parameter generated by the trained decoder for each component sound, the scale parameter of the spectrogram of each component sound, the separation matrix for separating the mixed sound in which each component sound is mixed in the time-frequency domain into each component sound, and the signal in which the observed signal is separated into each component sound.

具体的には、パラメータ推定部３６は、初期値設定部４０、分離行列更新部４２、潜在変数クラス更新部４４、スケールパラメータ更新部４６、及び収束判定部４８を備えている。 Specifically, the parameter estimation unit 36 includes an initial value setting unit 40, a separation matrix update unit 42, a latent variable class update unit 44, a scale parameter update unit 46, and a convergence determination unit 48.

初期値設定部４０は、分離行列と、各構成音の潜在ベクトル系列と、各構成音の属性クラスと、各構成音のスケールパラメータとに初期値を設定する。 The initial value setting unit 40 sets initial values for the separation matrix, the latent vector sequence of each constituent tone, the attribute class of each constituent tone, and the scale parameter of each constituent tone.

分離行列更新部４２は、観測信号のパワースペクトログラムと、前回更新された、又は初期値が設定された、各構成音の潜在ベクトル系列、各構成音の属性クラス、各構成音のスケールパラメータ、及び分離行列とに基づいて、上記式（８）に示す目的関数を大きくするように、上記式（１３）、（１４）に従って、分離行列を更新する。 The separation matrix update unit 42 updates the separation matrix according to the above formulas (13) and (14) based on the power spectrogram of the observed signal, and the previously updated or initialized latent vector series of each constituent sound, the attribute class of each constituent sound, the scale parameter of each constituent sound, and the separation matrix, so as to increase the objective function shown in the above formula (8).

潜在変数クラス更新部４４は、観測信号及び分離行列を用いて得られる各構成音のパワースペクトログラムを入力としたエンコーダの出力のガウス分布の平均を用いて得られる、各構成音の潜在ベクトル系列に更新すると共に、観測信号及び分離行列を用いて得られる各構成音のパワースペクトログラムを入力とした識別器の出力を用いて、各構成音の属性クラスを更新する。 The latent variable class update unit 44 updates the latent vector series of each constituent sound obtained by using the average of the Gaussian distribution of the output of the encoder to which the power spectrogram of each constituent sound obtained by using the observation signal and the separation matrix is input, and updates the attribute class of each constituent sound by using the output of the classifier to which the power spectrogram of each constituent sound obtained by using the observation signal and the separation matrix is input.

スケールパラメータ更新部４６は、観測信号のパワースペクトログラムと、更新された、各構成音の潜在ベクトル系列、各構成音の属性クラス、各構成音のスケールパラメータ、及び分離行列とに基づいて、上記式（８）に示す目的関数を大きくするように、上記式（１５）に従って、スケールパラメータを更新する。 The scale parameter update unit 46 updates the scale parameters according to the above formula (15) based on the power spectrogram of the observed signal, the updated latent vector series of each constituent sound, the attribute class of each constituent sound, the scale parameters of each constituent sound, and the separation matrix, so as to increase the objective function shown in the above formula (8).

収束判定部４８は、収束条件を満たすか否かを判定し、収束条件を満たすまで、分離行列更新部４２における更新処理と、潜在変数クラス更新部４４における更新処理と、スケールパラメータ更新部４６における更新処理とを繰り返させる。 The convergence determination unit 48 determines whether the convergence condition is satisfied, and repeats the update process in the separation matrix update unit 42, the update process in the latent variable class update unit 44, and the update process in the scale parameter update unit 46 until the convergence condition is satisfied.

収束条件としては、例えば、繰り返し回数が、上限回数に到達したことを用いることができる。あるいは、収束条件として、上記式（８）の目的関数の値と前回の目的関数の値との差分が、予め定められた閾値以下であることを用いることができる。 The convergence condition can be, for example, that the number of iterations has reached an upper limit. Alternatively, the convergence condition can be that the difference between the value of the objective function in formula (8) above and the value of the previous objective function is equal to or less than a predetermined threshold.

出力部３８は、パラメータ推定部３６において取得した、各構成音の潜在ベクトル系列、各構成音の属性クラス、及び各構成音のスケールパラメータに基づいて、デコーダを用いて生成される各構成音のパワースペクトログラムを求め、各構成音のパワースペクトログラムから、各構成音の信号を生成して出力する。 The output unit 38 obtains a power spectrogram of each constituent sound generated using a decoder based on the latent vector series of each constituent sound, the attribute class of each constituent sound, and the scale parameter of each constituent sound acquired by the parameter estimation unit 36, and generates and outputs a signal of each constituent sound from the power spectrogram of each constituent sound.

＜本実施形態に係る信号解析装置の作用＞
次に、本実施形態に係る信号解析装置１００の作用について説明する。 <Function of the signal analysis device according to the present embodiment>
Next, the operation of the signal analyzing device 100 according to this embodiment will be described.

図６は、信号解析装置１００による学習処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から学習プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、学習処理が行なわれる。また、信号解析装置１００に、学習データとして、複数の構成音の各々について、当該構成音の信号の時系列データ及び当該構成音の信号の属性を示す属性クラスが入力される。 Figure 6 is a flowchart showing the flow of the learning process by the signal analysis device 100. The learning process is performed by the CPU 11 reading out a learning program from the ROM 12 or storage 14, expanding it into the RAM 13, and executing it. In addition, for each of a plurality of constituent tones, time series data of the signal of that constituent tone and an attribute class indicating the attribute of the signal of that constituent tone are input to the signal analysis device 100 as learning data.

まず、ステップＳ１００において、ＣＰＵ１１が、時間周波数展開部２４として、構成音毎に、当該構成音の信号の時系列データに基づいて、各時刻のスペクトルを表すパワースペクトログラムを計算する。 First, in step S100, the CPU 11, functioning as the time-frequency expansion unit 24, calculates a power spectrogram representing the spectrum at each time for each constituent sound based on the time series data of the signal of that constituent sound.

次のステップＳ１０２では、ＣＰＵ１１が、教師学習部３０として、学習データとして入力された各構成音についてのスペクトログラム及び属性クラスに基づいて、音のスペクトログラム及び属性クラスを入力として潜在ベクトル系列を推定する教師用エンコーダ、及び潜在ベクトル系列及び属性クラスを入力として音のスペクトログラムの分散を生成する教師用デコーダを学習する。 In the next step S102, the CPU 11, as the teacher learning unit 30, learns a teacher encoder that estimates a latent vector sequence using the spectrogram and attribute class of the sound as input, based on the spectrogram and attribute class for each constituent sound input as learning data, and a teacher decoder that generates the variance of the sound spectrogram using the latent vector sequence and attribute class as input.

ステップＳ１０４では、ＣＰＵ１１が、学習部３２として、学習データとして入力された各構成音についてのスペクトログラム及び属性クラスに基づいて、音のスペクトログラムを入力として潜在ベクトル系列を推定するエンコーダと、音のスペクトログラムを入力として属性クラスを識別する識別器と、潜在ベクトル系列及び属性クラスを入力として音のスペクトログラムの分散を生成するデコーダと、を学習し、学習したエンコーダ、識別器、及びデコーダのパラメータを、音源信号モデル記憶部３４に格納する。 In step S104, the CPU 11, as the learning unit 32, learns an encoder that estimates a latent vector sequence using the spectrogram of the sound as input based on the spectrogram and attribute class for each constituent sound input as learning data, a discriminator that identifies the attribute class using the spectrogram of the sound as input, and a decoder that generates the variance of the sound spectrogram using the latent vector sequence and attribute class as input, and stores the trained parameters of the encoder, discriminator, and decoder in the sound source signal model storage unit 34.

図７は、信号解析装置１００による信号解析処理の流れを示すフローチャートである。ＣＰＵ１１がＲＯＭ１２又はストレージ１４から信号解析プログラムを読み出して、ＲＡＭ１３に展開して実行することにより、信号解析処理が行なわれる。また、信号解析装置１００に、各構成音が混在した観測信号の時系列データが入力される。 Figure 7 is a flowchart showing the flow of signal analysis processing by the signal analysis device 100. The CPU 11 reads out a signal analysis program from the ROM 12 or storage 14, deploys it in the RAM 13, and executes it to perform signal analysis processing. In addition, time-series data of an observed signal in which each component sound is mixed is input to the signal analysis device 100.

まず、ステップＳ１２０において、ＣＰＵ１１が、時間周波数展開部２４として、観測信号の時系列データに基づいて、各時刻のスペクトルを表すパワースペクトログラムを計算する。 First, in step S120, the CPU 11, functioning as the time-frequency expansion unit 24, calculates a power spectrogram representing the spectrum at each time based on the time series data of the observed signal.

ステップＳ１２２では、ＣＰＵ１１が、初期値設定部４０として、分離行列と、各構成音の潜在ベクトル系列と、各構成音の属性クラスと、各構成音のスケールパラメータとに初期値を設定する。 In step S122, the CPU 11, as the initial value setting unit 40, sets initial values for the separation matrix, the latent vector series of each constituent tone, the attribute class of each constituent tone, and the scale parameter of each constituent tone.

ステップＳ１２４では、ＣＰＵ１１が、分離行列更新部４２として、上記ステップＳ１２０で計算された観測信号のパワースペクトログラムと、前回更新された、又は初期値が設定された、各構成音の潜在ベクトル系列、各構成音の属性クラス、各構成音のスケールパラメータ、及び分離行列とに基づいて、上記式（８）に示す目的関数を大きくするように、上記式（１３）、（１４）に従って、分離行列を更新する。 In step S124, the CPU 11, as the separation matrix update unit 42, updates the separation matrix in accordance with the above formulas (13) and (14) based on the power spectrogram of the observed signal calculated in the above step S120, and the previously updated or initialized latent vector series of each constituent sound, the attribute class of each constituent sound, the scale parameter of each constituent sound, and the separation matrix, so as to increase the objective function shown in the above formula (8).

ステップＳ１２６では、ＣＰＵ１１が、潜在変数クラス更新部４４として、各構成音の潜在ベクトル系列を、観測信号及び分離行列を用いて得られる各構成音のパワースペクトログラムを入力としたエンコーダの出力のガウス分布の平均を用いて得られる、各構成音の潜在ベクトル系列に更新すると共に、観測信号及び分離行列を用いて得られる各構成音のパワースペクトログラムを入力とした識別器の出力を用いて、各構成音の属性クラスを更新する。 In step S126, the CPU 11, as the latent variable class update unit 44, updates the latent vector series of each constituent sound to a latent vector series of each constituent sound obtained by using the average of the Gaussian distribution of the output of the encoder to which the power spectrogram of each constituent sound obtained by using the observation signal and the separation matrix is input, and updates the attribute class of each constituent sound by using the output of the classifier to which the power spectrogram of each constituent sound obtained by using the observation signal and the separation matrix is input.

ステップＳ１２８では、ＣＰＵ１１が、スケールパラメータ更新部４６として、上記ステップＳ１２０で計算された観測信号のパワースペクトログラムと、更新された、各構成音の潜在ベクトル系列、各構成音の属性クラス、各構成音のスケールパラメータ、及び分離行列とに基づいて、上記式（８）に示す目的関数を大きくするように、上記式（１５）に従って、スケールパラメータを更新する。 In step S128, the CPU 11, as the scale parameter update unit 46, updates the scale parameters according to the above formula (15) based on the power spectrogram of the observed signal calculated in the above step S120, the updated latent vector series of each constituent sound, the attribute class of each constituent sound, the scale parameters of each constituent sound, and the separation matrix, so as to increase the objective function shown in the above formula (8).

次に、ステップＳ１３０では、収束条件を満たすか否かを判定する。収束条件を満たした場合には、ステップＳ１３２へ移行し、収束条件を満たしていない場合には、ステップＳ１２４へ移行し、ステップＳ１２４～ステップＳ１２８の処理を繰り返す。 Next, in step S130, it is determined whether the convergence condition is satisfied. If the convergence condition is satisfied, the process proceeds to step S132. If the convergence condition is not satisfied, the process proceeds to step S124, and the processes of steps S124 to S128 are repeated.

ステップＳ１３２では、上記ステップＳ１２４～Ｓ１２８で最終的に更新された、各構成音の潜在ベクトル系列、各構成音の属性クラス、及び各構成音のスケールパラメータに基づいて、デコーダを用いて各構成音のパワースペクトログラムを生成し、各構成音のパワースペクトログラムから、各構成音の信号を生成して、出力部３８から出力し、信号解析処理を終了する。 In step S132, a power spectrogram of each constituent tone is generated using a decoder based on the latent vector series of each constituent tone, the attribute class of each constituent tone, and the scale parameter of each constituent tone, which were finally updated in steps S124 to S128 above. A signal of each constituent tone is generated from the power spectrogram of each constituent tone and output from the output unit 38, and the signal analysis process is terminated.

＜実験結果＞
本実施形態の手法による音源分離性能を検証するため、ＶｏｉｃｅＣｏｎｖｅｒｓｉｏｎＣｈａｌｌｅｎｇｅ（ＶＣＣ）２０１８音声データベースを用いた話者依存の分離実験とＷＳＪ０音声データベースを用いた任意話者の分離実験を行った。比較対象は、非特許文献１に記載のＩＬＲＭＡ、非特許文献２に記載のＭＶＡＥ法、非特許文献３に記載のＦａｓｔＭＶＡＥ法とし、評価規準としてｓｏｕｒｃｅ－ｔｏｄｉｓｔｏｒｔｉｏｎｓｒａｔｉｏ（ＳＤＲ）、ｓｏｕｒｃｅ－ｔｏ－ｉｎｔｅｒｆｅｒｅｎｃｅｓｒａｔｉｏ（ＳＩＲ）とｓｏｕｒｃｅｓ－ｔｏ－ａｒｔｉｆａｃｔｓｒａｔｉｏ（ＳＡＲ）を用いた。すべての手法においては分離行列Ｗ（ｆ）を単位行列に初期化し、６０回更新を行った。 <Experimental Results>
In order to verify the sound source separation performance by the method of this embodiment, a speaker-dependent separation experiment using the Voice Conversion Challenge (VCC) 2018 speech database and an arbitrary speaker separation experiment using the WSJ0 speech database were performed. The comparison targets were the ILRMA described in Non-Patent Document 1, the MVAE method described in Non-Patent Document 2, and the FastMVAE method described in Non-Patent Document 3, and the source-to-distortions ratio (SDR), source-to-interferences ratio (SIR), and sources-to-artifacts ratio (SAR) were used as evaluation criteria. In all methods, the separation matrix W(f) was initialized to a unit matrix and updated 60 times.

ＩＬＲＭＡの基底数を２とした。表１に各モデルのパラメータ数を示す。 The basis number for ILRMA was set to 2. Table 1 shows the number of parameters for each model.

ＣｈｉｍｅｒａＡＣＶＡＥでは、ＡＣＶＡＥよりパラメータ数を４０％まで削減することができた。表２に実験結果を示す。 ChimeraACVAE was able to reduce the number of parameters by up to 40% compared to ACVAE. The experimental results are shown in Table 2.

いずれの条件においても、本実施形態の手法（ＦａｓｔＭＶＡＥ２法）がＩＬＲＭＡとＦａｓｔＭＶＡＥ法より高い分離性能を示し、ＭＶＡＥ法との差を大幅に縮めた。 Under all conditions, the method of this embodiment (FastMVAE2 method) showed higher separation performance than the ILRMA and FastMVAE methods, significantly narrowing the gap with the MVAE method.

２音源より多い音源数における各手法の分離性能および計算時間を評価するため、ＷＳＪ０音声データベースから、１８話者の発話を利用して音源数が｛２，３，６，９｝の混合信号を作成した。インパルス応答は鏡像法により作成し、壁の反射係数を０．２とした。図８にマイクと音源の配置を示す。各条件について混合信号を１０文作成した。すべての処理はＩｎｔｅｌ（Ｒ）Ｘｅｏｎ（Ｒ）Ｇｏｌｄ６１３０ＣＰＵ＠２．１０ＧＨｚとＴｅｓｌａＶ１００ＧＰＵを用いて計算した。表３に各条件におけるＳＤＲの平均値を示す。 To evaluate the separation performance and calculation time of each method when the number of sound sources is more than two, a mixed signal with the number of sound sources {2, 3, 6, 9} was created using the speech of 18 speakers from the WSJ0 speech database. The impulse response was created using the mirror method, and the wall reflection coefficient was set to 0.2. Figure 8 shows the arrangement of the microphones and sound sources. Ten mixed signals were created for each condition. All processing was performed using an Intel(R) Xeon(R) Gold 6130 CPU @ 2.10 GHz and a Tesla V100 GPU. Table 3 shows the average SDR for each condition.

また、図９に各手法の反復ごとの計算時間を示す。本実施形態の手法（ＦａｓｔＭＶＡＥ２、ＦａｓｔＭＶＡＥ２＿ＣＰＵ、ＦａｓｔＭＶＡＥ２＿ＧＰＵ）において性能改善が確認できた。また、本実施形態の手法は３音源以下の場合にＩＬＲＭＡと同等の計算時間で分離を実現でき、３音源以上の場合にＩＬＲＭＡより短い計算時間で分離を実現できることを確認した。 Figure 9 also shows the calculation time for each iteration of each method. Performance improvements were confirmed in the methods of this embodiment (FastMVAE2, FastMVAE2_CPU, FastMVAE2_GPU). It was also confirmed that the method of this embodiment can achieve separation in the same calculation time as ILRMA when there are three or fewer sound sources, and can achieve separation in a shorter calculation time than ILRMA when there are three or more sound sources.

以上説明したように、本実施形態に係る信号解析装置は、各構成音についてのスペクトログラム及び属性クラスに基づいて、音のスペクトログラムを入力として潜在ベクトル系列を推定するエンコーダと、音のスペクトログラムを入力として前記音の属性を示す属性クラスを識別する識別器と、潜在ベクトル系列及び属性クラスを入力として前記音のスペクトログラムの分散を生成するデコーダと、を学習する。そして、信号解析装置は、各構成音が混合された観測信号を入力として、学習されたエンコーダによって分離行列により分離された各構成音について推定される潜在ベクトル系列、学習された識別器によって分離行列により分離された各構成音について識別される属性クラス、各構成音についての、学習されたデコーダによって生成される、構成音のスペクトログラムの分散と、スケールパラメータとから算出される、構成音のスペクトログラム、各構成音のスペクトログラムのスケールパラメータ、時間周波数領域で各構成音が混合された混合音を各構成音に分離するための分離行列、及び観測信号を各構成音に分離した信号を用いて表される目的関数を最適化するように、分離行列と、スケールパラメータとを推定する。これにより、計算コストを抑えて、各構成音が混合した混合信号から、各構成音を精度よく分離することができる。 As described above, the signal analysis device according to the present embodiment learns an encoder that estimates a latent vector sequence using a spectrogram of a sound as an input based on the spectrogram and attribute class for each constituent sound, a classifier that identifies an attribute class indicating the attribute of the sound using the spectrogram of the sound as an input, and a decoder that generates the variance of the spectrogram of the sound using the latent vector sequence and the attribute class as an input. Then, the signal analysis device estimates a separation matrix and a scale parameter so as to optimize an objective function expressed using an observation signal in which each constituent sound is mixed, the latent vector sequence estimated for each constituent sound separated by a separation matrix by the learned encoder, the attribute class identified for each constituent sound separated by a separation matrix by the learned classifier, the variance of the spectrogram of the constituent sound generated by the learned decoder for each constituent sound, the scale parameter of the spectrogram of each constituent sound, the separation matrix for separating the mixed sound in which each constituent sound is mixed into each constituent sound in the time-frequency domain, and the signal obtained by separating the observation signal into each constituent sound. This reduces computational costs and enables accurate separation of each component sound from a mixed signal in which the components are mixed.

＜変形例＞
なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 <Modification>
The present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the spirit and scope of the present invention.

例えば、観測信号のパワースペクトログラムや構成音のパワースペクトログラムを計算する場合を例に説明したが、これに限定されるものではなく、観測信号の振幅スペクトログラムや構成音の振幅スペクトログラムを計算するようにしてもよい。この場合には、学習部３２は、各構成音についての振幅スペクトログラム及び属性クラスに基づいて、音の振幅スペクトログラムを入力として潜在ベクトル系列を推定するエンコーダと、音の振幅スペクトログラムを入力として属性クラスを識別する識別器と、潜在ベクトル系列及び属性クラスを入力として音の振幅スペクトログラムの分散を生成するデコーダと、を学習する。また、パラメータ推定部３６は、観測信号を入力として、学習されたエンコーダによって推定される潜在ベクトル系列、学習された識別器によって識別される属性クラス、各構成音についての、学習されたデコーダによって生成される、構成音の振幅スペクトログラムの分散と、スケールパラメータとから算出される、構成音の振幅スペクトログラム、各構成音の振幅スペクトログラムのスケールパラメータ、時間周波数領域で各構成音が混合された混合音を各構成音に分離するための分離行列、及び観測信号を各構成音に分離した信号を用いて表される目的関数を最適化するように、分離行列と、スケールパラメータとを推定する。 For example, the case where the power spectrogram of the observed signal and the power spectrogram of the constituent sounds are calculated has been described as an example, but the present invention is not limited to this, and the amplitude spectrogram of the observed signal and the amplitude spectrogram of the constituent sounds may be calculated. In this case, the learning unit 32 learns an encoder that estimates a latent vector sequence using the amplitude spectrogram of the sound as input based on the amplitude spectrogram and attribute class of each constituent sound, a discriminator that identifies the attribute class using the amplitude spectrogram of the sound as input, and a decoder that generates the variance of the amplitude spectrogram of the sound using the latent vector sequence and the attribute class as input. In addition, the parameter estimation unit 36 estimates the separation matrix and scale parameters so as to optimize an objective function represented by an input of the observation signal, the latent vector sequence estimated by the trained encoder, the attribute class identified by the trained classifier, the amplitude spectrograms of the constituent sounds calculated from the variances of the amplitude spectrograms of the constituent sounds and the scale parameters generated by the trained decoder for each constituent sound, the scale parameters of the amplitude spectrograms of each constituent sound, a separation matrix for separating a mixed sound in which the constituent sounds are mixed in the time-frequency domain into each constituent sound, and a signal obtained by separating the observation signal into each constituent sound.

また、更新するパラメータの順番には任意性があるため、上記の実施の形態の順番に限定されない。 The order in which the parameters are updated can be determined arbitrarily and is not limited to the order shown in the above embodiment.

また、上記各実施形態でＣＰＵがソフトウェア（プログラム）を読み込んで実行した各種処理を、ＣＰＵ以外の各種のプロセッサが実行してもよい。この場合のプロセッサとしては、ＦＰＧＡ（Ｆｉｅｌｄ－ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）等の製造後に回路構成を変更可能なＰＬＤ（ＰｒｏｇｒａｍｍａｂｌｅＬｏｇｉｃＤｅｖｉｃｅ）、及びＡＳＩＣ（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）等の特定の処理を実行させるために専用に設計された回路構成を有するプロセッサである専用電気回路等が例示される。また、学習処理及び信号解析処理を、これらの各種のプロセッサのうちの１つで実行してもよいし、同種又は異種の２つ以上のプロセッサの組み合わせ（例えば、複数のＦＰＧＡ、及びＣＰＵとＦＰＧＡとの組み合わせ等）で実行してもよい。また、これらの各種のプロセッサのハードウェア的な構造は、より具体的には、半導体素子等の回路素子を組み合わせた電気回路である。 In addition, various processes that the CPU reads and executes software (programs) in each of the above embodiments may be executed by various processors other than the CPU. Examples of processors in this case include PLDs (Programmable Logic Devices) such as FPGAs (Field-Programmable Gate Arrays) whose circuit configuration can be changed after manufacture, and dedicated electrical circuits such as ASICs (Application Specific Integrated Circuits), which are processors with circuit configurations designed specifically to execute specific processes. In addition, the learning process and the signal analysis process may be executed by one of these various processors, or may be executed by a combination of two or more processors of the same or different types (for example, multiple FPGAs, and a combination of a CPU and an FPGA). In addition, the hardware structure of these various processors is, more specifically, an electrical circuit that combines circuit elements such as semiconductor elements.

また、上記各実施形態では、学習プログラム及び信号解析プログラムがストレージ１４に予め記憶（インストール）されている態様を説明したが、これに限定されない。プログラムは、ＣＤ－ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＤＶＤ－ＲＯＭ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、及びＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリ等の非一時的（ｎｏｎ－ｔｒａｎｓｉｔｏｒｙ）記憶媒体に記憶された形態で提供されてもよい。また、プログラムは、ネットワークを介して外部装置からダウンロードされる形態としてもよい。 In addition, in each of the above embodiments, the learning program and the signal analysis program are described as being pre-stored (installed) in the storage 14, but this is not limiting. The programs may be provided in a form stored in a non-transitory storage medium such as a CD-ROM (Compact Disk Read Only Memory), a DVD-ROM (Digital Versatile Disk Read Only Memory), or a USB (Universal Serial Bus) memory. The programs may also be downloaded from an external device via a network.

以上の実施形態に関し、更に以下の付記を開示する。 The following notes are further provided with respect to the above embodiment.

（付記項１）
メモリと、
前記メモリに接続された少なくとも１つのプロセッサと、
を含み、
前記プロセッサは、
各構成音についてのスペクトログラム及び前記構成音の属性を示す属性クラスに基づいて、音のスペクトログラムを入力として潜在ベクトル系列を推定するエンコーダと、前記音のスペクトログラムを入力として前記音の属性を示す属性クラスを識別する識別器と、前記潜在ベクトル系列及び前記属性クラスを入力として前記音のスペクトログラムの分散を生成するデコーダと、を学習し、
各構成音が混合された観測信号を入力として、前記学習されたエンコーダによって前記分離行列により分離された各構成音について推定される前記潜在ベクトル系列、前記学習された識別器によって前記分離行列により分離された各構成音について識別される前記属性クラス、各構成音についての、前記学習されたデコーダによって生成される、前記構成音のスペクトログラムの分散と、スケールパラメータとから算出される、前記構成音のスペクトログラム、各構成音のスペクトログラムのスケールパラメータ、時間周波数領域で各構成音が混合された混合音を各構成音に分離するための分離行列、及び前記観測信号を各構成音に分離した信号を用いて表される目的関数を最適化するように、前記分離行列と、前記スケールパラメータとを推定する
ように構成される信号解析装置。 (Additional Note 1)
Memory,
at least one processor coupled to the memory;
Including,
The processor,
An encoder that estimates a latent vector sequence by taking a spectrogram of a sound as an input based on a spectrogram for each constituent sound and an attribute class indicating an attribute of the constituent sound, a classifier that identifies an attribute class indicating an attribute of the sound by taking the spectrogram of the sound as an input, and a decoder that generates a variance of the spectrogram of the sound by taking the latent vector sequence and the attribute class as input,
A signal analysis device configured to estimate the separation matrix and the scale parameter so as to optimize an objective function represented by an input of an observation signal in which each constituent sound is mixed, the latent vector sequence estimated for each constituent sound separated by the separation matrix by the trained encoder, the attribute class identified for each constituent sound separated by the separation matrix by the trained classifier, a spectrogram of the constituent sound calculated from a variance of the spectrogram of the constituent sound and a scale parameter generated by the trained decoder for each constituent sound, a scale parameter of the spectrogram of each constituent sound, a separation matrix for separating a mixed sound in which each constituent sound is mixed into each constituent sound in a time-frequency domain, and a signal obtained by separating the observation signal into each constituent sound.

（付記項２）
信号解析処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
前記信号解析処理は、
各構成音についてのスペクトログラム及び前記構成音の属性を示す属性クラスに基づいて、音のスペクトログラムを入力として潜在ベクトル系列を推定するエンコーダと、前記音のスペクトログラムを入力として前記音の属性を示す属性クラスを識別する識別器と、前記潜在ベクトル系列及び前記属性クラスを入力として前記音のスペクトログラムの分散を生成するデコーダと、を学習し、
各構成音が混合された観測信号を入力として、前記学習されたエンコーダによって前記分離行列により分離された各構成音について推定される前記潜在ベクトル系列、前記学習された識別器によって前記分離行列により分離された各構成音について識別される前記属性クラス、各構成音についての、前記学習されたデコーダによって生成される、前記構成音のスペクトログラムの分散と、スケールパラメータとから算出される、前記構成音のスペクトログラム、各構成音のスペクトログラムのスケールパラメータ、時間周波数領域で各構成音が混合された混合音を各構成音に分離するための分離行列、及び前記観測信号を各構成音に分離した信号を用いて表される目的関数を最適化するように、前記分離行列と、前記スケールパラメータとを推定する
非一時的記憶媒体。 (Additional Note 2)
A non-transitory storage medium storing a program executable by a computer to perform a signal analysis process,
The signal analysis process includes:
An encoder that estimates a latent vector sequence by taking a spectrogram of a sound as an input based on a spectrogram for each constituent sound and an attribute class indicating an attribute of the constituent sound, a classifier that identifies an attribute class indicating an attribute of the sound by taking the spectrogram of the sound as an input, and a decoder that generates a variance of the spectrogram of the sound by taking the latent vector sequence and the attribute class as input,
A non-transitory storage medium that estimates the separation matrix and the scale parameter so as to optimize an objective function represented by an observation signal in which each constituent sound is mixed, the latent vector sequence estimated for each constituent sound separated by the separation matrix by the trained encoder, the attribute class identified for each constituent sound separated by the separation matrix by the trained classifier, a spectrogram of the constituent sound calculated from the variance of the spectrogram of the constituent sound and a scale parameter generated by the trained decoder for each constituent sound, the scale parameter of the spectrogram of each constituent sound, a separation matrix for separating a mixed sound in which each constituent sound is mixed into each constituent sound in the time-frequency domain, and a signal obtained by separating the observation signal into each constituent sound.

１１ＣＰＵ
１４ストレージ
１５入力部
１６表示部
２４時間周波数展開部
３０教師学習部
３２学習部
３４音源信号モデル記憶部
３６パラメータ推定部
３８出力部
４０初期値設定部
４２分離行列更新部
４４潜在変数クラス更新部
４６スケールパラメータ更新部
４８収束判定部
１００信号解析装置 11 CPU
14 Storage 15 Input unit 16 Display unit 24 Time-frequency expansion unit 30 Supervised learning unit 32 Learning unit 34 Sound source signal model storage unit 36 Parameter estimation unit 38 Output unit 40 Initial value setting unit 42 Separation matrix update unit 44 Latent variable class update unit 46 Scale parameter update unit 48 Convergence determination unit 100 Signal analysis device

Claims

a learning unit that learns an encoder that estimates a latent vector sequence by using only a spectrogram of a sound as an input, based on a spectrogram for each constituent sound and an attribute class that indicates an attribute of the constituent sound; a classifier that uses the spectrogram of the sound as an input and identifies an attribute class that indicates an attribute of the sound; and a decoder that uses the latent vector sequence and the attribute class as input and generates a variance of the spectrogram of the sound;
a parameter estimation unit that estimates the separation matrix and the scale parameter so as to optimize an objective function represented by a spectrogram of the constituent sounds, the separation matrix, and a signal obtained by separating the observation signal into each constituent sound, the objective function being calculated from the variance of the spectrogram of the constituent sounds and the scale parameter of the spectrogram of each constituent sound, which are generated by the learned decoder for each constituent sound, and a signal obtained by separating the observation signal into each constituent sound; and
A signal analysis device comprising:
The system further includes a teacher learning unit that learns a teacher encoder that estimates a latent vector sequence using the spectrogram and the attribute class of a sound as input, based on the spectrogram and the attribute class of each component sound, and a teacher decoder that generates a variance of the spectrogram of the sound using the latent vector sequence and the attribute class as input,
The learning unit is a signal analysis device that learns the encoder and the decoder so that the output of the encoder corresponds to the output of the learned teacher encoder, and so that the output of the decoder corresponds to the output of the learned teacher decoder.

a learning unit that learns an encoder that estimates a latent vector sequence by using only a spectrogram of a sound as an input, based on a spectrogram for each constituent sound and an attribute class that indicates an attribute of the constituent sound; a classifier that uses the spectrogram of the sound as an input and identifies an attribute class that indicates an attribute of the sound; and a decoder that uses the latent vector sequence and the attribute class as input and generates a variance of the spectrogram of the sound;
a parameter estimation unit that estimates the separation matrix and the scale parameter so as to optimize an objective function represented by a spectrogram of the constituent sounds, the separation matrix, and a signal obtained by separating the observation signal into each constituent sound, the objective function being calculated from the variance of the spectrogram of the constituent sounds and the scale parameter of the spectrogram of each constituent sound, which are generated by the learned decoder for each constituent sound, and a signal obtained by separating the observation signal into each constituent sound; and
A signal analysis device comprising:
The signal analysis device, wherein the encoder and the classifier are an integrated neural network, and the encoder and the classifier share some layers.

a learning unit that learns an encoder that estimates a latent vector sequence by using only a spectrogram of a sound as an input, based on a spectrogram for each constituent sound and an attribute class that indicates an attribute of the constituent sound; a classifier that uses the spectrogram of the sound as an input and identifies an attribute class that indicates an attribute of the sound; and a decoder that uses the latent vector sequence and the attribute class as input and generates a variance of the spectrogram of the sound;
a parameter estimation unit that estimates the separation matrix and the scale parameter so as to optimize an objective function represented by a spectrogram of the constituent sounds, the separation matrix, and a signal obtained by separating the observation signal into each constituent sound, the objective function being calculated from the variance of the spectrogram of the constituent sounds and the scale parameter of the spectrogram of each constituent sound, which are generated by the learned decoder for each constituent sound, and a signal obtained by separating the observation signal into each constituent sound; and
A signal analysis device comprising:
The learning unit is
a learning criterion for evaluating the output of the encoder and the output of the decoder;
Mutual information of the decoder output and the attribute classes;
a reconstruction criterion for evaluating the spectrogram generated using the output of the decoder to which the output of the encoder and the output of the classifier are input;
a class identification criterion for evaluating an output of the classifier using as input the spectrogram generated using the output of the decoder using as input the output of the encoder and the output of the classifier;
The signal analysis apparatus trains the encoder, the classifier, and the decoder so as to optimize a criterion including:

The parameter estimation unit
an initial value setting unit that sets initial values to the separating matrix, the latent vector sequence, the attribute class, and the scale parameter;
a separation matrix update unit that updates the separation matrix so as to increase the objective function based on a spectrogram of the observed signal, the latent vector sequence, the attribute class, the scale parameter, and the separation matrix that have been previously updated or have initial values set;
a latent variable class update unit that updates the latent vector sequence to the latent vector sequence obtained by using an output of the encoder to which a spectrogram of each component sound obtained by using the observation signal and the separation matrix is input, and updates the attribute class by using an output of a classifier to which a spectrogram of each component sound obtained by using the observation signal and the separation matrix is input;
a scale parameter update unit that updates the scale parameter so as to increase the objective function based on a spectrogram of the observed signal and the updated latent vector sequence, the attribute class, the scale parameter, and the separating matrix; a convergence determination unit that determines whether a predetermined convergence condition is satisfied, and repeats the update by the separating matrix update unit, the update by the latent variable class update unit, and the update by the scale parameter update unit until the convergence condition is satisfied;
The signal analysis device according to any one of claims 1 to 3, comprising:

a learning unit learns an encoder that estimates a latent vector sequence by taking only the spectrogram of a sound as an input, based on a spectrogram for each constituent sound and an attribute class indicating an attribute of the constituent sound; a classifier that uses the spectrogram of the sound as an input to identify an attribute class indicating an attribute of the sound; and a decoder that uses the latent vector sequence and the attribute class as input to generate a variance of the spectrogram of the sound;
a parameter estimation unit receives as input an observation signal in which constituent sounds are mixed, the latent vector sequence estimated for each constituent sound separated by the trained encoder using a separation matrix for separating a mixed sound in which constituent sounds are mixed in a time-frequency domain into each constituent sound, and the attribute class identified by the trained classifier for each constituent sound separated by the separation matrix, and estimates the separation matrix and the scale parameter so as to optimize an objective function expressed using a spectrogram of the constituent sounds, the separation matrix, and a signal obtained by separating the observation signal into each constituent sound, the objective function being calculated for each constituent sound from a variance of the spectrogram of the constituent sound and a scale parameter of the spectrogram of each constituent sound, which are generated by the trained decoder,
The teacher learning unit further includes learning a teacher encoder that estimates a latent vector sequence using the spectrogram and the attribute class of a sound as input based on the spectrogram and the attribute class for each component sound, and a teacher decoder that generates a variance of the spectrogram of the sound using the latent vector sequence and the attribute class as input,
A signal analysis method in which the learning unit learns the encoder and the decoder so that the output of the encoder corresponds to the output of the trained teacher encoder, and so that the output of the decoder corresponds to the output of the trained teacher decoder.

a learning unit learns an encoder that estimates a latent vector sequence by taking only the spectrogram of a sound as an input, based on a spectrogram for each constituent sound and an attribute class indicating an attribute of the constituent sound; a classifier that uses the spectrogram of the sound as an input to identify an attribute class indicating an attribute of the sound; and a decoder that uses the latent vector sequence and the attribute class as input to generate a variance of the spectrogram of the sound;
a parameter estimation unit receives as input an observation signal in which constituent sounds are mixed, the latent vector sequence estimated for each constituent sound separated by the trained encoder using a separation matrix for separating a mixed sound in which constituent sounds are mixed in a time-frequency domain into each constituent sound, and the attribute class identified by the trained classifier for each constituent sound separated by the separation matrix, and estimates the separation matrix and the scale parameter so as to optimize an objective function expressed using a spectrogram of the constituent sounds, the separation matrix, and a signal obtained by separating the observation signal into each constituent sound, the objective function being calculated for each constituent sound from a variance of the spectrogram of the constituent sound and a scale parameter of the spectrogram of each constituent sound, which are generated by the trained decoder,
A signal analysis method in which the encoder and the classifier are an integrated neural network, and the encoder and the classifier share some layers.

a learning unit learns an encoder that estimates a latent vector sequence by taking only the spectrogram of a sound as an input, based on a spectrogram for each constituent sound and an attribute class indicating an attribute of the constituent sound; a classifier that uses the spectrogram of the sound as an input to identify an attribute class indicating an attribute of the sound; and a decoder that uses the latent vector sequence and the attribute class as input to generate a variance of the spectrogram of the sound;
a parameter estimation unit receives as input an observation signal in which constituent sounds are mixed, the latent vector sequence estimated for each constituent sound separated by the trained encoder using a separation matrix for separating a mixed sound in which constituent sounds are mixed in a time-frequency domain into each constituent sound, and the attribute class identified by the trained classifier for each constituent sound separated by the separation matrix, and estimates the separation matrix and the scale parameter so as to optimize an objective function expressed using a spectrogram of the constituent sounds, the separation matrix, and a signal obtained by separating the observation signal into each constituent sound, the objective function being calculated for each constituent sound from a variance of the spectrogram of the constituent sound and a scale parameter of the spectrogram of each constituent sound, which are generated by the trained decoder,
The learning unit is
a learning criterion for evaluating the output of the encoder and the output of the decoder;
Mutual information of the decoder output and the attribute classes;
a reconstruction criterion for evaluating the spectrogram generated using the output of the decoder to which the output of the encoder and the output of the classifier are input;
a class identification criterion for evaluating an output of the classifier using as input the spectrogram generated using the output of the decoder using as input the output of the encoder and the output of the classifier;
The signal analysis method further comprises training the encoder, the classifier, and the decoder to optimize a criterion including:

A signal analysis program for causing a computer to function as each unit of the signal analysis device according to any one of claims 1 to 4.