JP7243760B2

JP7243760B2 - Audio feature compensator, method and program

Info

Publication number: JP7243760B2
Application number: JP2021096366A
Authority: JP
Inventors: チョンチョンワン; 浩司岡部; 孝文越仲
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2018-03-05
Filing date: 2021-06-09
Publication date: 2023-03-22
Anticipated expiration: 2038-03-05
Also published as: JP6897879B2; JP2021140188A; WO2019171415A1; JP2021510846A

Description

本発明は、発話および音声の特徴ベクトルをロバストなものに補償するための特徴補償装置、特徴補償方法およびプログラムに関する。 The present invention relates to a feature compensation device, a feature compensation method, and a program for robustly compensating feature vectors of utterances and speech.

話者認識は、声から人を認識することである。声道の形状、喉頭のサイズ、および、音声生成器官の他の部分が異なるため、２人の声が同じように聞こえることはない。人間の声の独自性を考慮すると、話者認識は、テレフォンバンキングなどの不正アクセスの証拠が発見されるべき電話ベースのサービスにますます適用される。 Speaker recognition is the recognition of a person from their voice. No two voices sound alike because of differences in the shape of the vocal tract, the size of the larynx, and other parts of the sound-producing organs. Given the uniqueness of the human voice, speaker recognition is increasingly applied to telephone-based services, such as telephone banking, where evidence of unauthorized access is to be discovered.

話者認識システムは、テキスト依存のシステムとテキスト非依存のシステムに分けることができる。テキスト依存システムでは、認識句は固定されているか、事前に認識されている。テキスト非依存システムでは、話者が使用できる語に制約はない。テキスト非依存認識は、応用範囲が広く、２つのタスクに対してはるかにチャレンジングであり、過去数十年で一貫して改善されている。 Speaker recognition systems can be divided into text-dependent and text-independent systems. In text-dependent systems, recognition phrases are either fixed or pre-recognized. In a text-independent system, there are no restrictions on the words a speaker can use. Text-independent recognition is broadly applicable, far more challenging for the two tasks, and has been consistently improved over the past decades.

テキスト非依存話者認識アプリケーションでの参照（reference:トレーニングで話されるもの）とテスト（test：実際の使用で発話されるもの）の発話は全く異なる内容になる可能性があるため、認識システムはこの音声の不一致を考慮する必要がある。パフォーマンスは音声の長さに大きく依存する。ユーザが、長い期間、通例１分以上、発話する場合、ほとんどの音素がカバーされていると考えられる。その結果、音声内容が異なっていても認識精度は高くなる。しかし、短時間音声の場合、統計的手法で抽出された発話の話者特徴ベクトルは正確な認識を行うには信頼性が低いので、短時間音声では話者認識性能が低下する。 Since reference (what is spoken in training) and test (test (what is said in actual use)) utterances in text-independent speaker recognition applications can be quite different, recognition systems should take into account this voice discrepancy. Performance is highly dependent on audio length. If the user speaks for a long period of time, typically a minute or more, it is assumed that most phonemes have been covered. As a result, the recognition accuracy is high even if the speech contents are different. However, in the case of short-time speech, the speaker feature vector of the utterance extracted by the statistical method is unreliable for accurate recognition.

実際の話者検証アプリケーションでは、テスト中に短い音声区間のみがしばしば観察される。一般に、１０秒未満の短い音声区間がよく生ずる。よって、話者特徴ベクトルを復元して、短時間発話によるテキスト非依存話者認識を改善することが重要である。 In real speaker verification applications, only short speech intervals are often observed during testing. In general, short speech intervals of less than 10 seconds are common. Therefore, it is important to recover speaker feature vectors to improve text-independent speaker recognition with short utterances.

特許文献１には、Denoising Autoencoder（ＤＡＥ）を使用して、限られた発音情報を含む短時間音声の話者特徴ベクトルを復元する技術が開示されている。 Patent Literature 1 discloses a technique of using a denoising autoencoder (DAE) to restore a speaker feature vector of short speech containing limited pronunciation information.

図２３に示すように、特許文献１に記載されたＤＡＥに基づく特徴補正装置では、まず、音声モデルに基づく事後確率として、入力発話の音響的多様性の程度を推定する。次に、音響的多様性の程度と認識特徴ベクトルとの両方が入力層４０１に提供される。本明細書において、「特徴ベクトル」は、対象を表す数値（特定のデータ）のセットを意味する。入力層４０１、１つまたは複数の隠れ層４０２、および出力層４０３を含むＤＡＥベースの変換は、長い音声区間と短い音声区間とのペアを使用した教師ありトレーニングの助けを借りて、出力層において復元された認識特徴ベクトルを生成できる。 As shown in FIG. 23, the DAE-based feature correction device described in Patent Document 1 first estimates the degree of acoustic diversity of an input utterance as a posteriori probability based on a speech model. Both the degree of acoustic diversity and the recognition feature vector are then provided to the input layer 401 . As used herein, "feature vector" means a set of numerical values (specific data) representing an object. A DAE-based transform comprising an input layer 401, one or more hidden layers 402, and an output layer 403, with the help of supervised training using pairs of long and short speech intervals, in the output layer A reconstructed recognition feature vector can be generated.

非特許文献１には、音響特徴としてＭＦＣＣ（Mel-Frequency Cepstrum Coefficients:メル周波数ケプストラム係数）が開示されている。 Non-Patent Document 1 discloses MFCCs (Mel-Frequency Cepstrum Coefficients) as acoustic features.

米国特許出願公開第２０１６／００９８９９３号明細書U.S. Patent Application Publication No. 2016/0098993

Najim Dehak, Patrick J. Kenny, Reda Dehak, Pierre Dumouchel, and Pierre Ouellet, "Front-End Factor Analysis for Speaker Verification", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 4, MAY 2011Najim Dehak, Patrick J. Kenny, Reda Dehak, Pierre Dumouchel, and Pierre Ouellet, "Front-End Factor Analysis for Speaker Verification", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 4, MAY 2011

しかし、特許文献１では、ＤＡＥ最適化で平均二乗誤差最小化のみが使用されている。このような目的関数は、正確な実行のためには単純すぎる。また、単純な目的関数を使用する場合には、短時間音声を長時間音声の一部に制限しないとよりよい結果が得られない。現実には、そのようなネットワークをトレーニングするために、長時間音声しか使用できない（短時間音声はそこから切り取られる。）。存在する話者の短時間音声の情報は無駄になる。このシステムは、トレーニングのために、複数の長時間音声を有する十分な数の話者を必要とする。そのことは、すべてのアプリケーションにとって現実的ではない可能性がある。 However, in US Pat. No. 5,700,000, only mean squared error minimization is used in DAE optimization. Such an objective function is too simple for accurate execution. Also, when using a simple objective function, better results cannot be obtained unless the short speech is restricted to a portion of the long speech. In reality, only long-duration speech can be used to train such networks (short-duration speech is clipped from it). The information of the brief speech of the speaker present is wasted. The system requires a sufficient number of speakers with multiple long speeches for training. That may not be practical for all applications.

本発明の目的は、上述した状況を考慮して、短時間音声に対する頑健（ロバスト）な特徴補償を提供することである。 SUMMARY OF THE INVENTION It is an object of the present invention to provide robust feature compensation for short-duration speech in view of the above situation.

音声特徴補償装置の例示的な態様は、所定の話者による短い音声区間から抽出された第１の特徴量と、当該所定の話者による短い音声区間よりも長い音声区間から抽出された第２の特徴量とを使用してＧＡＮの生成器と識別器とをトレーニングし、トレーニングされたパラメータを出力するトレーニング手段と、トレーニングされたパラメータを使用して、入力された短時間音声から抽出された特徴量に基づいて、復元された特徴量を生成する生成手段とを含む。 An exemplary aspect of the speech feature compensator is a first feature extracted from a short speech segment by a predetermined speaker and a second feature extracted from a longer speech segment than the short speech segment by the predetermined speaker. extracted from the input short - term speech using and generating means for generating a reconstructed feature amount based on the feature amount.

音声処理方法の例示的な態様は、所定の話者による短い音声区間から抽出された第１の特徴量と、当該所定の話者による短い音声区間よりも長い音声区間から抽出された第２の特徴量とを使用してＧＡＮの生成器と識別器とをトレーニングし、トレーニングされたパラメータを出力し、トレーニングされたパラメータを使用して、入力された短時間音声から抽出された特徴量に基づいて、復元された特徴量を生成する。 An exemplary aspect of the speech processing method is a first feature extracted from a short speech segment by a predetermined speaker and a second feature extracted from a longer speech segment than the short speech segment by the predetermined speaker. train a GAN generator and classifier using features, output trained parameters, and use the trained parameters to extract features from the input short-term speech to generate restored features.

音声処理プログラムの例示的な態様は、コンピュータに、所定の話者による短い音声区間から抽出された第１の特徴量と、当該所定の話者による短い音声区間よりも話者からの長い音声区間から抽出された第２の特徴量とを使用してＧＡＮの生成器と識別器とをトレーニングし、トレーニングされたパラメータを出力する処理と、トレーニングされたパラメータを使用して、入力された短時間音声から抽出された特徴量に基づいて、復元された特徴量を生成する処理とを実行させる。 An exemplary aspect of the speech processing program is to provide a computer with a first feature extracted from a short speech interval by a predetermined speaker and a speech interval longer than the short speech interval by the predetermined speaker. A process of training a GAN generator and a discriminator using the second feature extracted from and outputting the trained parameters, and using the trained parameters, the input short time and a process of generating a restored feature amount based on the feature amount extracted from the voice.

本発明によれば、音声補償装置、音声特徴補償方法、およびプログラムは、短時間音声に対してロバストな特徴補償を提供することができる。 According to the present invention, a speech compensator, speech feature compensation method, and program can provide robust feature compensation for short-time speech.

本発明の第１の実施形態のロバストな特徴補償装置のブロック図である。1 is a block diagram of a robust feature compensator of a first embodiment of the present invention; FIG. 短時間音声データ記憶部の内容の一例を示す図である。It is a figure which shows an example of the content of a short time audio|voice data storage part. 長時間音声データ記憶部の内容の一例を示す図である。FIG. 4 is a diagram showing an example of the contents of a long-time voice data storage unit; 生成器パラメータ記憶部の内容の一例を示す図である。FIG. 4 is a diagram showing an example of the contents of a generator parameter storage unit; 第１の実施形態におけるＮＮアーキテクチャの概念を示す図である。1 is a diagram showing the concept of NN architecture in the first embodiment; FIG. 第１の実施形態のロバストな特徴補償装置の動作を示すフローチャートである。4 is a flow chart showing the operation of the robust feature compensator of the first embodiment; 第１の実施形態のロバストな特徴補償装置のトレーニングフェーズの動作を示すフローチャートである。4 is a flow chart showing the operation of the training phase of the robust feature compensator of the first embodiment; 第１の実施形態のロバストな特徴補償装置のロバストな特徴補償フェーズの動作を示すフローチャートである。4 is a flow chart showing the operation of the robust feature compensation phase of the robust feature compensation device of the first embodiment; 本発明の第２の実施形態のロバストな特徴補償装置のブロック図である。Fig. 2 is a block diagram of a robust feature compensator of the second embodiment of the present invention; 第２の実施形態におけるＮＮアーキテクチャの概念を示す図である。FIG. 4 is a diagram showing the concept of NN architecture in the second embodiment; 第２の実施形態のロバストな特徴補償装置の動作を示すフローチャートである。9 is a flow chart showing the operation of the robust feature compensator of the second embodiment; 第２の実施形態のロバストな特徴補償装置のトレーニングフェーズの動作を示すフローチャートである。FIG. 10 is a flow chart showing the operation of the training phase of the robust feature compensator of the second embodiment; FIG. 第２の実施形態のロバストな特徴補償装置のロバストな特徴補償フェーズの動作を示すフローチャートである。9 is a flow chart showing the operation of the robust feature compensation phase of the robust feature compensator of the second embodiment; 本発明の第３の実施形態のロバストな特徴補償装置のブロック図である。Fig. 3 is a block diagram of a robust feature compensator of a third embodiment of the present invention; 第３の実施形態におけるＮＮアーキテクチャの概念を示す図である。FIG. 10 is a diagram showing the concept of NN architecture in the third embodiment; 第３の実施形態のロバストな特徴補償装置の動作を示すフローチャートである。10 is a flow chart showing the operation of the robust feature compensator of the third embodiment; 第３の実施形態のロバストな特徴補償装置のトレーニングフェーズの動作を示すフローチャートである。FIG. 10 is a flow chart showing the operation of the training phase of the robust feature compensator of the third embodiment; FIG. 第３の実施形態のロバストな特徴補償装置のロバストな特徴補償フェーズの動作を示すフローチャートである。FIG. 12 is a flow chart showing the operation of the robust feature compensation phase of the robust feature compensator of the third embodiment; FIG. 本発明による実施形態で使用可能なコンピュータ構成を示す図である。FIG. 2 illustrates a computer configuration that can be used with embodiments in accordance with the present invention; 本発明による実施形態で使用可能なコンピュータ構成を示す図である。FIG. 2 illustrates a computer configuration that can be used with embodiments in accordance with the present invention; 音声特徴補償装置の主要部を示すブロック図である。1 is a block diagram showing main parts of an audio feature compensation device; FIG. 音声特徴補償装置の他の態様を示すブロック図である。FIG. 11 is a block diagram showing another aspect of the audio feature compensation device; 特許文献１に示された特徴補償装置を示すブロック図である。1 is a block diagram showing a feature compensator disclosed in Patent Document 1; FIG.

以下、本発明の各実施形態について、図面を参照して説明する。以下の詳細な説明は単なる例示であり、本発明または本発明の用途および使用を限定することを意図していない。さらに、上記の発明の背景または以下の詳細な説明に示されている考え方に拘束される意図はない。 Hereinafter, each embodiment of the present invention will be described with reference to the drawings. The following detailed description is merely exemplary and is not intended to limit the invention or the application and uses of the invention. Furthermore, there is no intention to be bound by any concept presented in the preceding background of the invention or the following detailed description.

図中の要素が単純化および明確化のために示され、必ずしも一定の縮尺で描かれていないことは、当業者に理解されるであろう。たとえば、集積回路アーキテクチャを示す図中のいくつかの要素の大きさは、当該実施形態および他の実施形態の理解を容易にするのに役立つように、他の要素に対して誇張されうる。 Skilled artisans appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in figures illustrating integrated circuit architecture may be exaggerated relative to other elements to help facilitate understanding of this and other embodiments.

実際の話者認識アプリケーションでは、多くの場合、テキスト非依存話者認識が使用され、短い音声区間（１０秒未満）が観察される。このような場合、音声の不整合を考慮に入れる必要がある。不均衡な音声分布は、短時間音声から抽出される話者特徴ベクトルの信頼性低下という結果をもたらすからである。区間の長さが短くなると、性能が低下する。したがって、話者特徴復元方法（speaker feature restoration method）によって、短時間発話によるテキスト非依存話者認識を改善する必要がある。 In practical speaker recognition applications, text-independent speaker recognition is often used and short speech intervals (less than 10 seconds) are observed. In such cases, voice inconsistencies need to be taken into account. This is because an unbalanced speech distribution results in less reliable speaker feature vectors extracted from short-term speech. Shorter interval lengths result in lower performance. Therefore, there is a need to improve text-independent speaker recognition with short utterances by speaker feature restoration methods.

上記の観点から、以下の実施形態では、反復トレーニングプロセス中に互いを改善する生成器および識別器を含む敵対的生成ネットワーク（Generative Adversarial Network:ＧＡＮ）が利用される。生成器は、補償によって短時間音声のためにロバストな特徴ベクトルを生成する。 In view of the above, the following embodiments utilize a Generative Adversarial Network (GAN) that includes generators and classifiers that improve each other during the iterative training process. The generator produces robust feature vectors for short duration speech by compensation.

第１の実施形態
第１の実施形態のロバストな特徴補償装置は、生成器を使用して、短時間音声の未加工の特徴ベクトルから、短い音声区間に対するロバストな特徴ベクトルを提供することができる。すなわち、この実施形態では、短時間音声と長時間音声とでトレーニングされたＧＡＮの生成器は、短時間音声からでもロバストな特徴ベクトルを生成することができる。長期間音声の期間は、短時間音声の期間よりも長い。 First Embodiment The robust feature compensator of the first embodiment can use a generator to provide robust feature vectors for short speech intervals from raw feature vectors of short speech. . That is, in this embodiment, a GAN generator trained on short and long speech can generate robust feature vectors even from short speech. The period of long-term speech is longer than the period of short-term speech.

＜ロバストな特徴補償装置の構成＞
本発明の第１の実施形態では、ＧＡＮの生成器を使用する特徴復元のためのロバストな特徴補償装置が説明される。 <Configuration of Robust Feature Compensator>
In a first embodiment of the present invention, a robust feature compensator for feature recovery using GAN generators is described.

図１は、第１の実施形態のロバストな特徴補償装置１００を示すブロック図である。ロバストな特徴補償装置１００は、トレーニング部１００Ａと特徴復元部１００Ｂとを備える。 FIG. 1 is a block diagram showing a robust feature compensator 100 of the first embodiment. A robust feature compensation device 100 comprises a training section 100A and a feature restoration section 100B.

トレーニング部１００Ａは、短時間音声データ記憶部１０１、長時間音声データ記憶部１０２、特徴抽出部１０３ａ，１０３ｂ、ノイズ記憶部１０４、生成器・識別器トレーニング部１０５、および生成器パラメータ記憶部１０６を含む。特徴復元部１００Ｂは、特徴抽出部１０３ｃ、生成器１０７、および生成特徴記憶部１０８を備える。特徴抽出部１０３ａ，１０３ｂ，１０３ｃは、同じ機能を有する。 The training unit 100A includes a short-time speech data storage unit 101, a long-time speech data storage unit 102, feature extraction units 103a and 103b, a noise storage unit 104, a generator/discriminator training unit 105, and a generator parameter storage unit 106. include. The feature restoration unit 100B includes a feature extraction unit 103c, a generator 107, and a generated feature storage unit . Feature extractors 103a, 103b, and 103c have the same function.

短時間音声データ記憶部１０１は、図２に示されるような話者ラベルを有する短時間音声記録を記憶する。 The short speech data storage unit 101 stores short speech recordings with speaker labels as shown in FIG.

長時間音声データ記憶部１０２は、図３に示すような話者ラベルを有する長時間音声記録を記憶する。長時間音声データ記憶部１０２は、短時間音声データ記憶部１０１に短時間音声記録が含まれる各話者について少なくとも１つの長時間音声記録を含む。 The long-term voice data storage 102 stores long-term voice recordings with speaker labels as shown in FIG. Long-term speech data storage 102 contains at least one long-term speech recording for each speaker whose short-term speech recording is included in short-term speech data storage 101 .

ノイズ記憶部１０４は、ノイズを表すランダムなベクトルを記憶する。 The noise storage unit 104 stores random vectors representing noise.

生成器パラメータ記憶部１０６は、図４に示すように生成器パラメータを格納する。生成器は、図４からわかるように、エンコーダおよびデコーダを含む。エンコーダおよびデコーダの両方のパラメータは、生成器パラメータ記憶部１０６に格納される。 The generator parameter storage unit 106 stores generator parameters as shown in FIG. The generator includes an encoder and a decoder, as can be seen from FIG. Both encoder and decoder parameters are stored in generator parameter storage 106 .

特徴抽出部１０３ａは、短時間音声データ記憶部１０１における短時間音声データから特徴ベクトルを抽出する。特徴抽出部１０３ｂは、長時間音声データ記憶部１０２における長時間音声から特徴ベクトルを抽出する。特徴ベクトルは、個別に測定可能な観測値の特性である。特徴ベクトルは、たとえば、ｉ－ｖｅｃｔｏｒすなわち非特許文献１に記載されているＭＦＣＣなどの音響特徴から抽出された固定次元の特徴ベクトルである。 The feature extraction unit 103 a extracts feature vectors from the short-time speech data in the short-time speech data storage unit 101 . The feature extraction unit 103b extracts a feature vector from the long-time speech stored in the long-time speech data storage unit 102. FIG. A feature vector is an individually measurable property of an observation. The feature vector is, for example, an i-vector, that is, a fixed-dimensional feature vector extracted from acoustic features such as MFCC described in Non-Patent Document 1.

生成器・識別器トレーニング部１０５は、特徴抽出部１０３ａから短い音声区間の特徴ベクトルを受け取り、特徴抽出部１０３ｂから長い音声区間の特徴ベクトルを受け取り、ノイズ記憶部１０４からノイズを受け取る。生成器・識別器トレーニング部１０５は、生成器と識別器（図１において図示せず）を繰り返しトレーニングして、「真」（特徴ベクトルは長時間音声から抽出される。）または「偽」（特徴ベクトルは、短時間音声からの特徴ベクトルを基に生成される。）、および特徴ベクトルが属する話者ラベルを判定する。生成器と識別器のそれぞれは、入力層、１つまたは複数の隠れ層、および出力層を含む。 The generator/discriminator training unit 105 receives feature vectors of short speech segments from the feature extraction unit 103a, receives feature vectors of long speech segments from the feature extraction unit 103b, and receives noise from the noise storage unit 104. FIG. The generator/discriminator training unit 105 iteratively trains the generator and the classifier (not shown in FIG. 1) to obtain “true” (feature vectors are extracted from long speech) or “false” ( A feature vector is generated based on the feature vector from the short speech), and the speaker label to which the feature vector belongs. Each generator and classifier includes an input layer, one or more hidden layers, and an output layer.

トレーニングにおいて、「真」の場合には、受信された長時間音声の特徴ベクトルが、識別器の入力層に与えられる。「偽」の場合には、受信された短時間音声の特徴ベクトルが、生成器の入力層に与えられる。生成器の出力層は、識別器の入力層である。さらに、「真／偽」と話者のラベルとが、識別器の出力層に与えられる。それらの層の詳細は後述される。トレーニングの後、生成器・識別器トレーニング部１０５は、生成器パラメータを生成器パラメータ記憶部１０６に格納する。 In training, if true, the received long speech feature vector is fed to the input layer of the classifier. If "false", the received short speech feature vector is fed to the input layer of the generator. The output layer of the generator is the input layer of the discriminator. Additionally, the "true/false" and speaker labels are provided to the output layer of the classifier. Details of these layers are described later. After training, the generator/discriminator training unit 105 stores the generator parameters in the generator parameter storage unit 106 .

特徴復元部１００Ｂでは、特徴抽出部１０３ｃが、短時間音声から特徴ベクトルを抽出する。生成器１０７は、特徴ベクトルとともに、ノイズ記憶部１０４に記憶されたノイズと、生成器パラメータ記憶部１０６に記憶された生成器パラメータとを受け取る。生成器１０７は、ロバストな復元された特徴を生成する。 In the feature restoration section 100B, the feature extraction section 103c extracts a feature vector from the short-time speech. The generator 107 receives the noise stored in the noise storage unit 104 and the generator parameters stored in the generator parameter storage unit 106 along with the feature vector. Generator 107 produces robust reconstructed features.

図５には、生成器と識別器のアーキテクチャの概念が示されている。生成器は、２つのニューラルネットワーク（ＮＮ）すなわちエンコーダＮＮとデコーダＮＮとを有する。識別器、は１つのＮＮを有する。各ＮＮは、入力層、隠れ層、出力層の３種類のレイヤを含む。隠れ層は、複数層を含んでもよい。少なくとも入力層と隠れ層の間、および隠れ層と出力層の間には、線形変換および／または活性化関数（伝達関数）がある。エンコーダＮＮの入力層は、短時間音声記録の特徴ベクトルである。エンコーダＮＮの出力層は、話者係数（特徴ベクトル）である。デコーダの入力層は、ノイズとエンコーダＮＮの出力層の話者係数との加算または連結である。デコーダの出力層は、復元された特徴ベクトルである。識別器の場合、入力層は、長時間音声の特徴ベクトルまたはデコーダＮＮの出力である復元された特徴ベクトルである。識別器の出力は、「真／偽」および話者ラベルである。 FIG. 5 shows the concept of the architecture of the generator and discriminator. The generator has two neural networks (NN), an encoder NN and a decoder NN. The discriminator has one NN. Each NN includes three types of layers: input layer, hidden layer, and output layer. The hidden layer may include multiple layers. At least between the input and hidden layers, and between the hidden and output layers, there are linear transformations and/or activation functions (transfer functions). The input layer of the encoder NN is the feature vector of short-term speech recordings. The output layer of the encoder NN is speaker coefficients (feature vectors). The input layer of the decoder is the addition or concatenation of noise and speaker coefficients in the output layer of the encoder NN. The output layer of the decoder is the reconstructed feature vector. For the discriminator, the input layer is the feature vector of long-term speech or the recovered feature vector that is the output of the decoder NN. The outputs of the classifier are "true/false" and speaker labels.

トレーニング部１００Ａにおいて、エンコーダＮＮの入力層（短時間音声記録の特徴ベクトル）、デコーダＮＮの入力層の一部（ノイズ）、識別器用の２つのタイプのうちの１つの入力層（長時間音声記録の特徴ベクトル）、識別器（「真／偽」と話者ラベルを出力）の出力層が与えられ、その結果、３つのＮＮ（エンコーダ、デコーダ、識別器）パラメータの隠れ層、エンコーダの出力層ＮＮ（話者係数）、デコーダの出力層ＮＮ（復元された特徴ベクトル）が決定される。たとえば、エンコーダ、デコーダ、および識別器における層数は、１５、１５、１６である。 In the training unit 100A, the input layer of the encoder NN (feature vector of the short-term speech recording), part of the input layer of the decoder NN (noise), one of the two types of input layers for the discriminator (long-term speech recording ), an output layer of classifiers (outputting “true/false” and speaker labels), resulting in a hidden layer of three NN (encoder, decoder, classifier) parameters, an output layer of the encoder NN (speaker coefficients), the decoder's output layer NN (reconstructed feature vector) are determined. For example, the number of layers in the encoder, decoder and discriminator are 15,15,16.

トレーニング部１００Ａの評価部では、エンコーダパラメータ、デコーダパラメータ、エンコーダＮＮ（短時間音声の特徴ベクトル）の入力層、デコーダＮＮの入力層の一部（ノイズ）が設けられ、その結果、デコーダのＮＮ（復元された特徴ベクトル）の出力層が決定される。 The evaluation unit of the training unit 100A is provided with encoder parameters, decoder parameters, an input layer of the encoder NN (feature vector of short-time speech), and a part (noise) of the input layer of the decoder NN. The output layer of the reconstructed feature vector) is determined.

識別器において、出力層は（２＋ｎ）ニューロンで構成される。ｎはトレーニングデータにおける話者の数であり、２は「真／偽」である。トレーニング部１００Ａにおいて、ニューロンは、「真／偽」および「真の話者ラベル／偽の話者ラベル」に対応する値「１」または「０」を取ることができる。 In the discriminator, the output layer consists of (2+n) neurons. n is the number of speakers in the training data and 2 is "true/false". In the training section 100A, the neuron can take the value "1" or "0" corresponding to "true/false" and "true speaker label/false speaker label".

トレーニング部１００Ａにおいて、生成器（エンコーダおよびデコーダ）および識別器は、互いに繰り返しトレーニングする。各反復で、識別器パラメータが固定されている間に生成器パラメータが１回更新され、次に、生成器パラメータが固定されている間に識別器パラメータが１回更新される。この目的のために、交差エントロピーとしての、事前定義されたコスト関数を最小にするバックプロパゲーションとして知られる最急降下法、平均二乗誤差など、様々な最適化手法を適用できる。 In the training section 100A, the generator (encoder and decoder) and discriminator are repeatedly trained with each other. At each iteration, the generator parameters are updated once while the discriminator parameters are fixed, and then the discriminator parameters are updated once while the generator parameters are fixed. For this purpose, various optimization techniques can be applied, such as steepest descent, known as backpropagation, mean squared error, which minimizes a pre-defined cost function as cross-entropy.

たとえば、目的関数は次のように表すことができる。
生成器のため： For example, the objective function can be expressed as
For generator:

識別器のため： For discriminator:

値（ａ）は生成器のための目的変数であり、値（ｂ）は識別器のための目的変数である。Ａは、与えられた短時間音声の特徴ベクトルである。Ｂは、与えられた長時間音声の特徴ベクトルである。要素（ｃ）は、話者以外のバリエーションをモデル化したノイズである。Ｇ（Ａ、z）は、生成器から生成された特徴ベクトルである。要素（ｄ）は、話者分類の結果すなわち話者の事後確率のための要素である。Ｎ^ｄはトレーニングセットの話者の総数である。要素（ｅ）は、「真／偽」の特徴ベクトル分類のための要素である。要素（ｆ）は、Ｄ^ｄのｉ番目の要素である。演算子（ｇ）と（ｈ）とは、それぞれ、期待値と平均二乗誤差演算子である。定数（ｉ）は事前に定義された定数である。ｙ^ｄは、真の話者IＤ（正解）である。 Value (a) is the objective variable for the generator and value (b) is the objective variable for the discriminator. A is the feature vector of the given short speech. B is the feature vector of the given long speech. Element (c) is noise that models non-speaker variations. G(A, z) is the feature vector generated from the generator. Element (d) is the result of speaker classification, ie, the element for the posterior probability of the speaker. N ^d is the total number of speakers in the training set. Element (e) is for “true/false” feature vector classification. Element (f) is the i-th element of ^Dd . The operators (g) and (h) are the expectation and mean squared error operators, respectively. Constant (i) is a predefined constant. ^yd is the true speaker ID (correct answer).

以下のように表現することもできる。
生成器のため： It can also be expressed as follows.
For generator:

識別器のため for discriminator

＜ロバストな特徴補償装置の動作＞
次に、ロバストな特徴補償装置１００の動作を、図面を参照して説明する。 <Operation of Robust Feature Compensator>
Next, the operation of robust feature compensator 100 will be described with reference to the drawings.

ロバストな特徴補償装置１００の全体の動作を図６を参照して説明する。図６は、トレーニング部１００Ａおよび特徴復元部１００Ｂの動作を含む。ただし、これは例であり、トレーニングと特徴復元の操作を連続して実行したり、時間間隔を挿入したりすることができる。 The overall operation of robust feature compensator 100 will now be described with reference to FIG. FIG. 6 includes operations of the training section 100A and the feature restoration section 100B. However, this is an example, and the training and feature recovery operations can be performed sequentially, or time intervals can be inserted.

ステップＡ０１（トレーニング部）において、生成器・識別器トレーニング部１０５は、短時間音声データ記憶部１０１と長時間音声データ記憶部１０２とのそれぞれに記憶された同じ話者からの短時間音声および長時間音声に基づいて、生成器および識別器をともに繰り返しトレーニングする。詳しくは、各反復で、最初に識別器パラメータが固定され、目的関数を使用して生成器パラメータが更新される。次に、生成器パラメータが固定され、識別器パラメータが目的関数を使用して更新される。反復において、生成器パラメータと識別器パラメータとを更新する順序は変更可能である。トレーニングのために、交差エントロピーとしての、事前定義されたコスト関数を最小にするバックプロパゲーションとして知られる最急降下法、平均二乗誤差など、様々な最適化手法を適用できる。生成器の更新に使用される目的関数は、識別器が識別できない復元された特徴ベクトルを生成できるように生成器を更新できる。一方、識別器の更新における目的関数は、生成された特徴ベクトルを識別できるように識別器を更新できる。 In step A01 (training unit), the generator/discriminator training unit 105 generates the short-time speech and the long-term speech from the same speaker stored in the short-time speech data storage unit 101 and the long-time speech data storage unit 102, respectively. Both the generator and the discriminator are iteratively trained based on temporal speech. Specifically, at each iteration, the discriminator parameters are initially fixed and the generator parameters are updated using the objective function. Then the generator parameters are fixed and the classifier parameters are updated using the objective function. In an iteration, the order of updating the generator and discriminator parameters can be changed. For training, various optimization techniques can be applied, such as steepest descent, known as backpropagation, mean squared error, which minimizes a pre-defined cost function as cross-entropy. The objective function used to update the generator can update the generator so that it can generate recovered feature vectors that the discriminator cannot discriminate. On the other hand, the objective function in updating the discriminator can update the discriminator so that the generated feature vector can be discriminated.

ステップＡ０２（特徴復元部）では、生成器１０７は、生成器パラメータ記憶部１０６に記憶されている生成器パラメータを用いて、出力層において、与えられた短時間音声発話から復元特徴ベクトルを生成する。 In step A02 (feature restoration unit), the generator 107 uses the generator parameters stored in the generator parameter storage unit 106 to generate a restoration feature vector from the given short speech utterance in the output layer. .

図７は、生成器および識別器が、ノイズとともに短時間音声の特徴ベクトルおよび長時間音声の特徴ベクトルを使用してともにトレーニングされることを示すフローチャートである。図７は、図６におけるトレーニング部を示す。 FIG. 7 is a flow chart showing that the generator and discriminator are trained together using short and long speech feature vectors along with noise. FIG. 7 shows the training part in FIG.

まず、ステップＢ０１において、特徴抽出部１０３ａは、トレーニング部の始めとして、話者ラベル付きの短時間音声データを短時間音声データ記憶部１０１から読み出す。 First, in step B01, the feature extraction unit 103a reads short-time speech data with a speaker label from the short-time speech data storage unit 101 as the beginning of the training section.

ステップＢ０２では、特徴抽出部１０３ａは、さらに、短時間音声から特徴ベクトルを抽出する。 At step B02, the feature extraction unit 103a further extracts a feature vector from the short-time speech.

ステップＢ０３では、特徴抽出部１０３ｂは、話者ラベル付き長時間音声データを長時間音声データ記憶部１０２から読み出す。 In step B03, the feature extraction unit 103b reads the long-term speech data with the speaker label from the long-term speech data storage unit 102. FIG.

ステップＢ０４では、特徴抽出部１０３ｂは、さらに、長時間音声から特徴ベクトルを抽出する。 At step B04, the feature extraction unit 103b further extracts feature vectors from the long-time speech.

ステップＢ０５では、生成器・識別器トレーニング部１０５は、ノイズ記憶部１０４に記憶されているノイズデータを読み出す。 In step B<b>05 , the generator/discriminator training unit 105 reads noise data stored in the noise storage unit 104 .

ステップＢ０６では、生成器・識別器トレーニング部１０５は、特徴抽出部１０３ａから送信された話者ラベル付きの短時間音声の特徴ベクトルおよび特徴抽出部１０３ｂから送信された話者ラベル付きの長時間音声の特徴ベクトル、ならびにノイズを使用して、生成器と識別器とをともにトレーニングする。 In step B06, the generator/discriminator training unit 105 extracts the feature vector of the short speech with the speaker label transmitted from the feature extraction unit 103a and the long speech with the speaker label transmitted from the feature extraction unit 103b. We train the generator and classifier together using the feature vectors of , as well as the noise.

ステップＢ０７では、トレーニングの結果として、生成器・識別器トレーニング部１０５は、生成器パラメータおよび識別器パラメータを生成し、生成器パラメータを生成器パラメータ記憶部１０６に格納する。 In step B07, the generator/discriminator training unit 105 generates a generator parameter and a discriminator parameter as a result of training, and stores the generator parameter in the generator parameter storage unit 106. FIG.

Ｂ０１～Ｂ０２とＢ０３～Ｂ０４の順序は、図７に示した形式に限らず、入れ替えることができる。 The order of B01-B02 and B03-B04 is not limited to the format shown in FIG. 7 and can be changed.

図８は、特徴復元部１００Ｂを示すフローチャートである。 FIG. 8 is a flow chart showing the feature restoration unit 100B.

まず、ステップＣ０１において、特徴抽出部１０３ｃは、外部装置（図１において図示せず）を介して提供される短時間音声データを読み取る。
ステップＣ０２では、特徴抽出部１０３ｃは、与えられた短時間音声データから特徴ベクトルを抽出する。 First, in step C01, the feature extraction unit 103c reads short-time speech data provided via an external device (not shown in FIG. 1).
At step C02, the feature extraction unit 103c extracts feature vectors from the supplied short-time speech data.

ステップＣ０３では、生成器１０７は、ノイズ記憶部１０４に記憶されているノイズデータを読み出す。 In step C03, the generator 107 reads noise data stored in the noise storage unit 104. FIG.

ステップＣ０４では、生成器１０７は、生成器パラメータ記憶部１０６から生成器パラメータを読み出す。 In step C04, the generator 107 reads generator parameters from the generator parameter storage unit 106. FIG.

ステップＣ０５では、生成器１０７は、短時間音声の特徴ベクトルを復元し、ロバストな特徴ベクトルを再生する。 At step C05, the generator 107 restores the feature vector of the short-time speech to reproduce a robust feature vector.

なお、Ｃ０３とＣ０４の順序を入れ替えることができる。 Note that the order of C03 and C04 can be switched.

第１の実施形態の効果
以上に説明したように、第１の実施形態では、短時間音声の特徴ベクトルのロバスト性を向上させることができる。その理由は、生成器と識別器の共同トレーニングがお互いの性能を向上させ、トレーニングにおける長時間音声の特徴ベクトルと短時間音声の特徴ベクトルとの関係が学習されるためである。その結果、そのようなＮＮは、短時間音声の特徴ベクトルを、長時間音声の特徴と同じくらいロバストに生成できる。 Effects of First Embodiment As described above, in the first embodiment, it is possible to improve the robustness of feature vectors of short-time speech. The reason is that the joint training of the generator and the classifier improves each other's performance, and the relationship between the feature vectors of long and short speech in training is learned. As a result, such a NN can generate feature vectors for short-duration speech as robustly as for long-duration speech.

第２の実施形態
第２の実施形態のロバストな特徴補償装置は、エンコーダを使用して、短時間音声の未加工の特徴から短い音声区間に対してロバストな特徴を提供することができる。すなわち、この実施形態では、エンコーダ（短時間音声と長時間音声でトレーニングされたＧＡＮの生成器の一部）は、短時間音声に対してロバストな話者特徴ベクトルを生成することができる。 Second Embodiment The robust feature compensator of the second embodiment can use an encoder to provide robust features for short speech segments from raw features of short speech. That is, in this embodiment, the encoder (part of the generator of the GAN trained on short and long speech) is able to generate robust speaker feature vectors for short speech.

＜ロバストな特徴補償装置の構成＞
本発明の第２の実施形態では、ＧＡＮの生成器および識別器を使用する話者特徴抽出のためのロバストな特徴補償装置が説明される。 <Configuration of Robust Feature Compensator>
In a second embodiment of the present invention, a robust feature compensation apparatus for speaker feature extraction using GAN generators and classifiers is described.

図９は、第２の実施形態のロバストな特徴補償装置２００のブロック図を示す。ロバストな特徴補償装置２００は、トレーニング部２００Ａと話者特徴抽出部２００Ｂとを含む。 FIG. 9 shows a block diagram of the robust feature compensator 200 of the second embodiment. The robust feature compensator 200 includes a training section 200A and a speaker feature extraction section 200B.

トレーニング部２００Ａは、短時間音声データ記憶部２０１、長時間音声データ記憶部２０２、特徴抽出部２０３ａ，２０３ｂ、ノイズ記憶部２０４、生成器・識別器トレーニング部２０５、およびエンコーダパラメータ記憶部２０６を含む。話者特徴抽出部２００Ｂは、特徴抽出部２０３ｃ、生成手段としてのエンコード部２０７、および生成特徴記憶部２０８を備える。特徴抽出部２０３ａ，２０３ｂ，２０３ｃは、同様の機能を有する。 The training unit 200A includes a short-time speech data storage unit 201, a long-time speech data storage unit 202, feature extraction units 203a and 203b, a noise storage unit 204, a generator/discriminator training unit 205, and an encoder parameter storage unit 206. . The speaker feature extraction unit 200B includes a feature extraction unit 203c, an encoding unit 207 as generation means, and a generated feature storage unit 208. FIG. Feature extractors 203a, 203b, and 203c have similar functions.

短時間音声データ記憶部２０１は、図２に示すように、話者ラベルを有する短時間音声記録を格納する。 The short speech data storage unit 201 stores short speech recordings with speaker labels, as shown in FIG.

長時間音声データ記憶部２０２は、図３に示すように、話者ラベルを有する長時間音声記録を記憶する。長時間音声データ記憶部２０２は、短時間音声データ記憶部２０１に短時間音声記録が含まれる各話者についての少なくとも１つの長時間音声記録を含む。 The long-term voice data storage 202 stores long-term voice recordings with speaker labels, as shown in FIG. Long-term voice data store 202 contains at least one long-term voice recording for each speaker whose short-term voice recording is included in short-term voice data store 201 .

ノイズ記憶部２０４は、ノイズを表すランダムなベクトルを記憶する。 The noise storage unit 204 stores random vectors representing noise.

エンコーダパラメータ記憶部２０６は、エンコーダパラメータを格納する。各エンコーダパラメータは、生成器・識別器トレーニング部２０５の結果の一部である。生成器（図９において図示せず）は、図４から理解されうる第１の実施形態と同様に、エンコーダとデコーダとで構成されている。 The encoder parameter storage unit 206 stores encoder parameters. Each encoder parameter is part of the result of the generator/discriminator training unit 205 . The generator (not shown in FIG. 9) consists of an encoder and a decoder, similar to the first embodiment that can be understood from FIG.

特徴抽出部２０３ａは、短時間音声データ記憶部２０１の短時間音声から特徴を抽出する。特徴抽出部２０３ｂは、長時間音声データ記憶部２０２の長時間音声から特徴を抽出する。特徴は、個別に測定可能な観測値の特性である。特徴は、たとえば、ｉ－ｖｅｃｔｏｒすなわちＭＦＣＣなどの音響特徴から抽出された固定次元の特徴ベクトルである。 The feature extraction unit 203 a extracts features from the short-time voice in the short-time voice data storage unit 201 . The feature extraction unit 203 b extracts features from the long-time voice in the long-time voice data storage unit 202 . A feature is an independently measurable property of an observation. Features are, for example, fixed-dimensional feature vectors extracted from acoustic features such as i-vectors or MFCCs.

生成器・識別器トレーニング部２０５は、特徴抽出部２０３ａから短時間音声の特徴ベクトルを受け取り、特徴抽出部２０３ｂから長時間音声の特徴ベクトルを受け取り、ノイズ記憶部２０４からノイズを受け取る。生成器・識別器トレーニング部２０５は、真（特徴ベクトルは長時間音声から抽出される。）または偽（特徴ベクトルは短時間音声からの特徴ベクトルを基に生成される。）、および特徴ベクトルが属している話者ラベルを決定するために、生成器と識別器（図９において図示せず）とを繰り返しトレーニングする。トレーニングの詳細は、第１の実施形態において示されている。トレーニングの後、生成器・識別器トレーニング部２０５は、生成器パラメータおよび識別器パラメータを出力し、それらをエンコーダパラメータ記憶部２０６に格納する。 The generator/discriminator training unit 205 receives the feature vector of short-time speech from the feature extraction unit 203a, the feature vector of long-time speech from the feature extraction unit 203b, and noise from the noise storage unit 204. FIG. The generator/discriminator training unit 205 selects true (the feature vector is extracted from the long speech) or false (the feature vector is generated based on the feature vector from the short speech), and the feature vector is Iteratively train the generator and classifier (not shown in FIG. 9) to determine the belonging speaker labels. Training details are given in the first embodiment. After training, the generator/discriminator training unit 205 outputs the generator parameters and the discriminator parameters and stores them in the encoder parameter storage unit 206 .

話者特徴抽出部２００Ｂにおいて、特徴抽出部２０３ｃは、短時間音声から特徴ベクトルを抽出する。エンコード部２０７は、特徴ベクトルとともに、ノイズ記憶部２０４に記憶されているノイズおよびエンコーダパラメータ記憶部２０６に記憶されているエンコーダパラメータを受け取る。エンコード部２０７は、ロバストな話者特徴をコード化（encode）する。 In the speaker feature extraction section 200B, the feature extraction section 203c extracts a feature vector from the short-time speech. The encoding unit 207 receives the noise stored in the noise storage unit 204 and the encoder parameters stored in the encoder parameter storage unit 206 together with the feature vector. Encoder 207 encodes the robust speaker features.

図１０には、第２の実施形態の生成器および識別器のアーキテクチャの概念が示されている。生成器は、２つのＮＮ（エンコーダＮＮとデコーダＮＮ）を有し、識別器は、１つのＮＮを有する。各ＮＮは、入力層、隠れ層、出力層の３種類のレイヤを含む。隠れ層は、複数層を含んでもよい。少なくとも入力層と隠れ層の間、および隠れ層と出力層の間には、線形変換および／または活性化関数（伝達関数）がある。エンコーダＮＮの入力層は、短時間音声の特徴ベクトルである。エンコーダＮＮの出力層は話者係数である。デコーダの入力層は、ノイズとエンコーダＮＮの出力層の話者係数との加算または連結である。デコーダの出力層は、復元された特徴ベクトルである。識別器の場合、入力層は、長時間音声の特徴ベクトルまたはデコーダＮＮの出力である復元された特徴ベクトルである。識別器の出力は、「真／偽」および話者ラベルである。 FIG. 10 shows the architectural concept of the generator and discriminator of the second embodiment. The generator has two NNs (encoder NN and decoder NN) and the discriminator has one NN. Each NN includes three types of layers: input layer, hidden layer, and output layer. The hidden layer may include multiple layers. At least between the input and hidden layers, and between the hidden and output layers, there are linear transformations and/or activation functions (transfer functions). The input layer of the encoder NN is the feature vector of short-time speech. The output layer of the encoder NN is the speaker coefficients. The input layer of the decoder is the addition or concatenation of noise and speaker coefficients in the output layer of the encoder NN. The output layer of the decoder is the reconstructed feature vector. For the discriminator, the input layer is the feature vector of long-term speech or the recovered feature vector that is the output of the decoder NN. The outputs of the classifier are "true/false" and speaker labels.

第２実施形態のトレーニング部２００Ａは、上述した第１の実施形態のトレーニング部と同様である。 200 A of training parts of 2nd Embodiment are the same as that of the training part of 1st Embodiment mentioned above.

評価部では、エンコーダパラメータとエンコーダＮＮの入力層（短時間音声の特徴ベクトル）が提供され、その結果、エンコーダＮＮの出力層（話者係数）が得られる。 In the evaluation unit, the encoder parameters and the input layer of the encoder NN (short-term speech feature vectors) are provided, resulting in the output layer of the encoder NN (speaker coefficients).

＜ロバストな特徴補償装置の動作＞
次に、ロバストな特徴補償装置２００の動作を、図面を参照して説明する。 <Operation of Robust Feature Compensator>
Next, the operation of robust feature compensator 200 will be described with reference to the drawings.

ロバストな特徴補償装置２００の全体の動作を、図１１を参照して説明する。図１１は、トレーニング部２００Ａおよび話者特徴抽出部２００Ｂの動作を含む。ただし、これは例であって、トレーニングと話者特徴抽出の操作を連続して実行したり、時間間隔を挿入したりすることができる。 The overall operation of robust feature compensator 200 will now be described with reference to FIG. FIG. 11 includes operations of the training section 200A and the speaker feature extraction section 200B. However, this is an example, and the operations of training and speaker feature extraction can be performed sequentially, or time intervals can be inserted.

ステップＤ０１（トレーニング部）において、生成器・識別器トレーニング部２０５は、それぞれ、短時間音声データ記憶部２０１と長時間音声データ記憶部２０２とののそれぞれに記憶された同じ話者からの短時間音声および長時間音声に基づいて、生成器および識別器をともに繰り返しトレーニングする。詳しくは、各反復で、最初に識別器パラメータが固定され、目的関数を使用して生成器パラメータが更新される。次に、生成器パラメータが固定され、識別器パラメータが目的関数を使用して更新される。反復において、生成器パラメータと識別器パラメータとを更新する順序は変更可能である。トレーニングのために、交差エントロピーとしての、事前定義されたコスト関数を最小にするバックプロパゲーションとして知られる最急降下法、平均二乗誤差など、様々な最適化手法を適用できる。生成器の更新に使用される目的関数は、識別器が識別できない復元された特徴ベクトルを生成できるように生成器を更新できる。一方、識別器の更新における目的関数は、生成された特徴ベクトルを識別できるように識別器を更新できる。 In step D01 (training section), the generator/discriminator training section 205 generates short-term speech data from the same speaker stored in the short-time speech data storage section 201 and the long-term speech data storage section 202, respectively. Iteratively train both the generator and the discriminator based on the speech and the long speech. Specifically, at each iteration, the discriminator parameters are initially fixed and the generator parameters are updated using the objective function. Then the generator parameters are fixed and the classifier parameters are updated using the objective function. In an iteration, the order of updating the generator and discriminator parameters can be changed. For training, various optimization techniques can be applied, such as steepest descent, known as backpropagation, mean squared error, which minimizes a pre-defined cost function as cross-entropy. The objective function used to update the generator can update the generator so that it can generate recovered feature vectors that the discriminator cannot discriminate. On the other hand, the objective function in updating the discriminator can update the discriminator so that the generated feature vector can be discriminated.

ステップＤ０２（話者特徴抽出部）では、エンコード部２０７は、エンコーダパラメータ記憶部２０６に記憶されているエンコーダパラメータを用いて、エンコーダの出力層において、与えられた短時間発話からロバストな話者特徴ベクトルをコード化する。 In step D02 (speaker feature extraction unit), the encoding unit 207 uses the encoder parameters stored in the encoder parameter storage unit 206 to extract robust speaker features from a given short-time utterance in the output layer of the encoder. Encode the vector.

図１２は、生成器および識別器が、ノイズとともに短時間音声の特徴ベクトルおよび長時間音声の特徴ベクトルを使用してともにトレーニングされることを示すフローチャートである。図１２は、図１１のトレーニング部を示す。 FIG. 12 is a flow chart showing that the generator and discriminator are trained together using short and long speech feature vectors along with noise. FIG. 12 shows the training portion of FIG.

まず、ステップＥ０１において、特徴抽出部２０３ａは、トレーニング部の始めとして、話者ラベル付きの短時間音声データを短時間音声データ記憶部２０１から読み出す。 First, in step E01, the feature extraction unit 203a reads short-time speech data with a speaker label from the short-time speech data storage unit 201 as the beginning of the training section.

ステップＥ０２では、特徴抽出部２０３ａは、さらに、短時間音声から特徴ベクトルを抽出する。 At step E02, the feature extraction unit 203a further extracts a feature vector from the short-time speech.

ステップＥ０３では、特徴抽出部２０３ｂは、話者ラベル付き長時間音声データを長時間音声データ記憶部２０２から読み出す。 At step E03, the feature extraction unit 203b reads the long-term voice data with the speaker label from the long-term voice data storage unit 202. FIG.

ステップＥ０４では、特徴抽出部２０３ｂは、さらに、長時間音声から特徴ベクトルを抽出する。 At step E04, the feature extraction unit 203b further extracts feature vectors from the long-time speech.

ステップＥ０５では、生成器・識別器トレーニング部２０５は、ノイズ記憶部２０４に記憶されているノイズデータを読み出す。 At step E05, the generator/discriminator training unit 205 reads the noise data stored in the noise storage unit 204. FIG.

ステップＥ０６において、生成器・識別器トレーニング部２０５は、特徴抽出部２０３ａから送信された話者ラベル付きの短時間音声の特徴ベクトルおよび特徴抽出部２０３ｂから送信された話者ラベル付きの長時間音声の特徴ベクトル、ならびにノイズを使用して、生成器および識別器をともにトレーニングする。 In step E06, the generator/discriminator training unit 205 extracts the feature vector of the short-term speech with the speaker label transmitted from the feature extraction unit 203a and the long-term speech with the speaker label transmitted from the feature extraction unit 203b. We train the generator and classifier together using the feature vectors of , and the noise.

ステップＥ０７では、トレーニングの結果として、生成器・識別器トレーニング部２０５は、生成器および識別器をトレーニングし、エンコーダ（生成器の一部）のパラメータをエンコーダパラメータ記憶部２０６に格納する。 In step E<b>07 , the generator/discriminator training unit 205 trains the generator and the discriminator as a result of the training, and stores the parameters of the encoder (a part of the generator) in the encoder parameter storage unit 206 .

Ｅ０１～Ｅ０２とＥ０３～Ｅ０４の順序は、図１２に示した形式に限らず、入れ替えることができる。 The order of E01 to E02 and E03 to E04 is not limited to the format shown in FIG. 12, and can be changed.

図１３は、話者特徴抽出部２００Ｂを示すフローチャートである。 FIG. 13 is a flow chart showing the speaker feature extraction unit 200B.

まず、特徴抽出部２０３ｃは、ステップＦ０１において、外部装置（図９において図示せず）を介して提供される短時間音声データを読み取る。 First, in step F01, the feature extraction unit 203c reads short-time speech data provided via an external device (not shown in FIG. 9).

ステップＦ０２では、特徴抽出部２０３ｃは、与えられた短時間音声データから特徴ベクトルを抽出する。 At step F02, the feature extraction unit 203c extracts a feature vector from the given short-time speech data.

ステップＦ０３では、エンコード部２０７は、ノイズ記憶部２０４に記憶されているノイズデータを読み出す。 At step F<b>03 , the encoding unit 207 reads noise data stored in the noise storage unit 204 .

ステップＦ０４では、エンコード部２０７は、エンコーダパラメータ記憶部２０６からエンコーダパラメータを読み出す。 At step F<b>04 , the encoding unit 207 reads encoder parameters from the encoder parameter storage unit 206 .

ステップＦ０５では、エンコード部２０７は、短時間音声の特徴ベクトルをコード化し、ロバストな話者特徴ベクトルを抽出する。 At step F05, the encoding unit 207 encodes the feature vector of short-time speech and extracts a robust speaker feature vector.

なお、Ｆ０３とＦ０４の順序を入れ替えることができる。 Note that the order of F03 and F04 can be switched.

第２の実施形態の効果 Effects of the second embodiment

上述したように、第２の実施形態は、短時間音声の特徴ベクトルのロバスト性を改善することができる。第１の実施形態では、ロバストな特徴の復元が行われる。同じトレーニング構造で、エンコーダの出力層でロバストな話者特徴ベクトルを同時に生成できる。話者特徴ベクトルの使用は、話者検証アプリケーションにとってより直接的である。 As described above, the second embodiment can improve the robustness of feature vectors for short speech. In a first embodiment, robust feature recovery is performed. The same training structure can simultaneously generate robust speaker feature vectors at the output layer of the encoder. Using speaker feature vectors is more straightforward for speaker verification applications.

第３の実施形態
第３の実施形態のロバストな特徴補償装置は、生成器および識別器を使用して、短時間音声の未加工の特徴から、識別器の最後の層で生成されるボトルネック特徴ベクトルを使用して、短い音声区間にロバストな特徴を提供できる。すなわち、この実施形態では、短時間音声および長時間音声でトレーニングされたＧＡＮの生成器および識別器は、短時間音声に対してロバストなボトルネック特徴を生成することができる。 Third Embodiment The robust feature compensator of the third embodiment uses a generator and a discriminator to extract from the raw features of short-duration speech the bottleneck generated in the last layer of the discriminator. Feature vectors can be used to provide robust features for short speech intervals. That is, in this embodiment, the generator and discriminator of the GAN trained on short and long speech can produce robust bottleneck features for short speech.

＜ロバストな特徴補償装置の構成＞
本発明の第３の実施形態では、ＧＡＮの生成器のエンコーダを使用するボトルネック特徴抽出のためのロバストな特徴補償装置が説明される。 <Configuration of Robust Feature Compensator>
In a third embodiment of the present invention, a robust feature compensator for bottleneck feature extraction using GAN generator encoders is described.

図１４は、第３の実施形態のロバストな特徴補償装置３００のブロック図を示す。ロバストな特徴補償装置３００は、トレーニング部３００Ａと、ボトルネック特徴抽出部３００Ｂとを含む。 FIG. 14 shows a block diagram of the robust feature compensator 300 of the third embodiment. The robust feature compensator 300 includes a training section 300A and a bottleneck feature extraction section 300B.

トレーニング部３００Ａは、短時間音声データ記憶部３０１、長時間音声データ記憶部３０２、特徴抽出部３０３ａ，３０３ｂ，３０３ｃ、ノイズ記憶部３０４、生成器・識別器トレーニング部３０５、生成器パラメータ記憶部３０６、および識別器を含む。ボトルネック特徴抽出部３００Ｂは、特徴抽出部３０３ｃ、生成器３０８、およびボトルネック特徴記憶部３０９を含む。特徴抽出部３０３ａ，３０３ｂ，３０３ｃは、同様の機能を有する。 The training unit 300A includes a short-time speech data storage unit 301, a long-time speech data storage unit 302, feature extraction units 303a, 303b, and 303c, a noise storage unit 304, a generator/discriminator training unit 305, and a generator parameter storage unit 306. , and a discriminator. The bottleneck feature extraction unit 300B includes a feature extraction unit 303c, a generator 308, and a bottleneck feature storage unit 309. FIG. Feature extractors 303a, 303b, and 303c have similar functions.

短時間音声データ記憶部３０１は、図２に示すように、話者ラベルを有する短時間音声記録を格納する。 The short speech data storage 301 stores short speech recordings with speaker labels, as shown in FIG.

長時間音声データ記憶部３０２は、図３に示すように、話者ラベルを有する長時間音声記録を記憶する。長時間音声データ記憶部３０２は、短時間音声データ記憶部３０１に短時間音声記録を有する各話者の少なくとも１つの長時間音声記録を含む。 Long-term audio data storage 302 stores long-term audio recordings with speaker labels, as shown in FIG. Long-term voice data store 302 contains at least one long-term voice record for each speaker that has a short-term voice record in short-term voice data store 301 .

ノイズ記憶部３０４は、ノイズを表すランダムなベクトルを記憶する。 The noise storage unit 304 stores random vectors representing noise.

生成器パラメータ記憶部３０６は、生成器パラメータを記憶する。生成器（図１４において図示せず）は、図４から理解されうる第１の実施形態と同様のエンコーダおよびデコーダからなる。したがって、エンコーダおよびデコーダの両方のパラメータは、生成器パラメータ記憶部３０６に記憶される。 The generator parameter storage unit 306 stores generator parameters. The generator (not shown in FIG. 14) consists of encoders and decoders similar to the first embodiment that can be understood from FIG. Therefore, both encoder and decoder parameters are stored in generator parameter storage 306 .

識別器パラメータ記憶部３０７は、識別器（図１４において図示せず）のパラメータを記憶する。 The discriminator parameter storage unit 307 stores parameters of the discriminator (not shown in FIG. 14).

特徴抽出部３０３ａは、短時間音声データ記憶部３０１における短時間音声から特徴を抽出する。特徴抽出部３０３ｂは、長時間音声データ記憶部３０２における長時間音声から特徴を抽出する。特徴は、たとえば、ｉ－ｖｅｃｔｏｒすなわちＭＦＣＣなどの音響特徴から抽出された固定次元の特徴ベクトルである。 The feature extraction unit 303a extracts features from the short-time voice stored in the short-time voice data storage unit 301. FIG. The feature extraction unit 303 b extracts features from the long-time voice stored in the long-time voice data storage unit 302 . Features are, for example, fixed-dimensional feature vectors extracted from acoustic features such as i-vectors or MFCCs.

生成器・識別器トレーニング部３０５は、特徴抽出部３０３ａから短時間音声の特徴ベクトルを受け取り、特徴抽出部３０３ｂから長時間音声の特徴ベクトルを受け取り、ノイズ記憶部３０４からのノイズを受け取る。生成器・識別器トレーニング部３０５は、真（特徴ベクトルは長時間音声から抽出される。）または偽（特徴ベクトルは短時間音声からの特徴ベクトルを基に生成される。）、および特徴ベクトルが属している話者ラベルを決定するために、生成器と識別器とを繰り返しトレーニングする。トレーニングの詳細は、第１の実施形態において示されている。トレーニングの後、生成器・識別器トレーニング部３０５は、生成器パラメータおよび識別器パラメータを出力し、それらを生成器パラメータ記憶部３０６および識別器パラメータ記憶部３０７に格納する。 The generator/discriminator training unit 305 receives the feature vector of short-time speech from the feature extraction unit 303a, the feature vector of long-time speech from the feature extraction unit 303b, and the noise from the noise storage unit 304. FIG. The generator/discriminator training unit 305 selects true (the feature vector is extracted from the long speech) or false (the feature vector is generated based on the feature vector from the short speech), and the feature vector is We iteratively train the generator and classifier to determine which speaker labels belong. Training details are given in the first embodiment. After training, the generator/discriminator training unit 305 outputs the generator parameters and the discriminator parameters, and stores them in the generator parameter storage unit 306 and the discriminator parameter storage unit 307 .

ボトルネック特徴抽出部３００Ｂにおいて、特徴抽出部３０３ｃは、短時間音声から特徴ベクトルを抽出する。生成器３０８は、特徴ベクトルとともに、ノイズ記憶部３０４に記憶されているノイズおよび生成器パラメータ記憶部３０６に記憶されている生成器パラメータを受け取る。生成器３０８は、話者係数を表す１つ以上のロバストなボトルネック特徴を生成する。 In the bottleneck feature extraction section 300B, the feature extraction section 303c extracts a feature vector from the short-time speech. Generator 308 receives the noise stored in noise store 304 and the generator parameters stored in generator parameter store 306 along with the feature vector. Generator 308 generates one or more robust bottleneck features representing speaker coefficients.

図１５には、第２の実施形態の生成器および識別器のアーキテクチャの概念が示されている。生成器は、２つのＮＮ（エンコーダＮＮとデコーダＮＮ）を有し、識別器は、１つのＮＮを有する。各ＮＮは、入力層、隠れ層、出力層の３種類のレイヤを含む。隠れ層は、複数層を含んでもよい。少なくとも入力層と隠れ層の間、および隠れ層と出力層の間には、線形変換および／または活性化関数（伝達関数）がある。エンコーダＮＮの入力層は、短時間音声の特徴ベクトルである。エンコーダＮＮの出力層は話者係数である。デコーダの入力層は、ノイズとエンコーダＮＮの出力層の話者係数との加算または連結である。デコーダの出力層は、復元された特徴ベクトルである。識別器の場合、入力層は、長時間音声の特徴ベクトルまたはデコーダＮＮの出力である復元された特徴ベクトルである。識別器の出力は、トレーニングにおける「真／偽」および話者ラベルであり、評価部では、元の出力層が破棄され、その前の最後の層が出力層として使用される。 FIG. 15 shows the architectural concept of the generator and discriminator of the second embodiment. The generator has two NNs (encoder NN and decoder NN) and the discriminator has one NN. Each NN includes three types of layers: input layer, hidden layer, and output layer. The hidden layer may include multiple layers. At least between the input and hidden layers, and between the hidden and output layers, there are linear transformations and/or activation functions (transfer functions). The input layer of the encoder NN is the feature vector of short-time speech. The output layer of the encoder NN is the speaker coefficients. The input layer of the decoder is the addition or concatenation of noise and speaker coefficients in the output layer of the encoder NN. The output layer of the decoder is the reconstructed feature vector. For the discriminator, the input layer is the feature vector of long-term speech or the recovered feature vector that is the output of the decoder NN. The outputs of the classifier are the "true/false" and speaker labels in training, and in the evaluator the original output layer is discarded and the previous last layer is used as the output layer.

第３の実施形態のトレーニング部は、第１の実施形態のトレーニング部と同様である。 The training section of the third embodiment is similar to the training section of the first embodiment.

評価部では、エンコーダパラメータ、デコーダパラメータ、識別器パラメータ、エンコーダＮＮの入力層（短時間音声の特徴ベクトル）、デコーダＮＮ（ノイズ）の入力層の一部が設けられ、その結果、識別器ＮＮ（ボトルネック特徴ベクトル）の出力層が得られる。 The evaluation unit includes encoder parameters, decoder parameters, discriminator parameters, an input layer of the encoder NN (feature vector of short-time speech), and a part of the input layer of the decoder NN (noise). An output layer of bottleneck feature vectors) is obtained.

＜ロバストな特徴補償装置の動作＞
次に、ロバストな特徴補償装置３００の動作を、図面を参照して説明する。 <Operation of Robust Feature Compensator>
Next, the operation of robust feature compensator 300 will be described with reference to the drawings.

図１６を参照して、ロバストな特徴補償装置３００の全体の動作を説明する。図１６は、トレーニング部３００Ａおよびボトルネック特徴抽出部３００Ｂの動作を含む。ただし、これは例であり、トレーニングと特徴復元の操作を連続して実行したり、時間間隔を挿入したりできる。 The overall operation of robust feature compensator 300 will now be described with reference to FIG. FIG. 16 includes operations of the training section 300A and the bottleneck feature extraction section 300B. However, this is an example, and the training and feature recovery operations can be performed sequentially or interspersed with time intervals.

ステップＧ０１（トレーニング部）において、生成器・識別器トレーニング部３０５は、短時間音声データ記憶部３０１と長時間音声データ記憶部３０２とのそれぞれに記憶された同じ話者からの短時間音声および長時間音声に基づいて、生成器および識別器をともに繰り返しトレーニングする。詳しくは、各反復で、最初に識別器のパラメータが固定され、目的関数を使用して生成器パラメータが更新される。次に、生成器パラメータが固定され、識別器パラメータが目的関数を使用して更新される。反復において、生成器パラメータと識別器パラメータとを更新する順序は変更可能である。トレーニングのために、交差エントロピーとしての、事前定義されたコスト関数を最小にするバックプロパゲーションとして知られる最急降下法、平均二乗誤差など、様々な最適化手法を適用できる。生成器の更新に使用される目的関数は、識別器が識別できない復元された特徴ベクトルを生成できるように生成器を更新できる。一方、識別器の更新における目的関数は、生成された特徴ベクトルを識別できるように識別器を更新できる。 In step G01 (training section), generator/discriminator training section 305 generates short-term speech and long-term speech from the same speaker stored in short-time speech data storage section 301 and long-time speech data storage section 302, respectively. Both the generator and the discriminator are iteratively trained based on temporal speech. Specifically, at each iteration, the classifier parameters are initially fixed and the generator parameters are updated using the objective function. Then the generator parameters are fixed and the classifier parameters are updated using the objective function. In an iteration, the order of updating the generator and discriminator parameters can be changed. For training, various optimization techniques can be applied, such as steepest descent, known as backpropagation, mean squared error, which minimizes a pre-defined cost function as cross-entropy. The objective function used to update the generator can update the generator so that it can generate recovered feature vectors that the discriminator cannot discriminate. On the other hand, the objective function in updating the discriminator can update the discriminator so that the generated feature vector can be discriminated.

ステップＧ０２（ボトルネック特徴抽出部）では、生成器３０８は、生成器パラメータ記憶部３０６に記憶されている生成器パラメータを用いて、出力層において、与えられた短時間音声発話から復元特徴ベクトルを生成し、識別器に入力する。生成器３０８は、最終の隠れ層をロバストなボトルネック特徴として抽出する。 In step G02 (bottleneck feature extraction unit), the generator 308 uses the generator parameters stored in the generator parameter storage unit 306 to restore feature vectors from the given short speech utterance in the output layer. Generate and input to the discriminator. Generator 308 extracts the final hidden layers as robust bottleneck features.

図１７は、生成器および識別器が、ノイズとともに短時間音声の特徴ベクトルおよび長時間音声の特徴ベクトルを使用してともにトレーニングされることを示すフローチャートである。図１７は、図１６のトレーニング部を示す。 FIG. 17 is a flow chart showing that the generator and discriminator are trained together using short and long speech feature vectors along with noise. FIG. 17 shows the training portion of FIG.

まず、ステップＨ０１において、特徴抽出部３０３ａは、トレーニング部の始めとして、話者ラベル付きの短時間音声データを短時間音声データ記憶部３０１から読み出す。 First, in step H01, the feature extraction unit 303a reads short-time speech data with a speaker label from the short-time speech data storage unit 301 as the beginning of the training section.

ステップＨ０２では、特徴抽出部３０３ａは、さらに、短時間音声データから特徴ベクトルを抽出する。 In step H02, the feature extraction unit 303a further extracts feature vectors from the short-time speech data.

ステップＨ０３では、特徴抽出部３０３ｂは、話者ラベル付き長時間音声データを長時間音声データ記憶部３０２から読み出す。 In step H03, the feature extraction unit 303b reads the long-term voice data with the speaker label from the long-term voice data storage unit 302. FIG.

ステップＨ０４では、特徴抽出部３０３ｂは、さらに、長時間音声データから特徴ベクトルを抽出する。 At step H04, the feature extraction unit 303b further extracts feature vectors from the long-time speech data.

ステップＨ０５では、生成器・識別器トレーニング部３０５は、ノイズ記憶部３０４に記憶されているノイズデータを読み取る。 In step H 05 , the generator/discriminator training unit 305 reads noise data stored in the noise storage unit 304 .

ステップＨ０６では、生成器・識別器トレーニング部３０５は、特徴抽出部３０３ａから送信された話者ラベル付きの短時間音声の特徴ベクトルおよび特徴抽出部３０３ｂから送信された話者ラベル付きの長時間音声の特徴ベクトル、ならびにノイズを使用して、生成器および識別器をともにトレーニングする。 In step H06, the generator/discriminator training unit 305 extracts the feature vector of the short speech with the speaker label transmitted from the feature extraction unit 303a and the long speech with the speaker label transmitted from the feature extraction unit 303b. We train the generator and classifier together using the feature vectors of , and the noise.

ステップＨ０７では、トレーニングの結果として、生成器・識別器トレーニング部３０５は、生成器パラメータおよび識別器パラメータを生成し、それらを、生成器パラメータ記憶部３０６および識別器パラメータ記憶部３０７に格納する。 In step H07, generator/discriminator training section 305 generates a generator parameter and a discriminator parameter as a result of training, and stores them in generator parameter storage section 306 and discriminator parameter storage section 307, respectively.

Ｈ０１～Ｈ０２とＨ０３～Ｈ０４の順序は、図１７に示した形式に限らず、入れ替えることができる。 The order of H01-H02 and H03-H04 is not limited to the format shown in FIG. 17, and can be changed.

図１８は、ボトルネック特徴抽出部３００Ｂを示すフローチャートである。 FIG. 18 is a flow chart showing the bottleneck feature extraction unit 300B.

まず、ステップＩ０１において、特徴抽出部３０３ｃは、外部装置（図１４において図示せず）から提供される短時間音声データを読み取る。 First, in step I01, the feature extraction unit 303c reads short-time speech data provided from an external device (not shown in FIG. 14).

ステップＩ０２では、特徴抽出部３０３ｃは、与えられた短時間音声データから特徴ベクトルを抽出する。 At step I02, the feature extractor 303c extracts a feature vector from the given short-time speech data.

ステップＩ０３では、生成器３０８は、ノイズ記憶部３０４に記憶されているノイズデータを読み取る。 At step I03, the generator 308 reads noise data stored in the noise storage unit 304. FIG.

ステップＩ０４では、生成器３０８は、生成器パラメータ記憶部３０６から生成器パラメータを読み取る。 At step I04, the generator 308 reads generator parameters from the generator parameter storage unit 306. FIG.

ステップＩ０５では、生成器３０８は、識別器パラメータ記憶部３０７から識別器パラメータを読み取る。 At step I05 , the generator 308 reads discriminator parameters from the discriminator parameter storage unit 307 .

なお、I０３～I０５の順序を入れ替えることができる。 Note that the order of I03 to I05 can be changed.

ステップＩ０６で、生成器３０８は、識別器ＮＮの最終層で生成されたボトルネック特徴を抽出する。 At step I06, the generator 308 extracts bottleneck features generated in the final layer of the discriminator NN.

第３の実施形態の効果 Effects of the third embodiment

以上に説明したように、第３の実施形態は、短時間音声の特徴ベクトルのロバスト性を向上させることができる。その結果、そのようなＮＮは、短時間音声の特徴ベクトルを、長時間音声の特徴と同程度にロバストに生成できる。第１の実施形態では、ロバストな特徴の復元が行われる。同じトレーニング構造を使用すると、識別器の出力層にロバストなボトルネック特徴を同時に生成できる（元の出力層「真／偽」と話者ラベルは、トレーニング部の後で破棄される）。 As described above, the third embodiment can improve the robustness of feature vectors of short speech. As a result, such a NN can generate feature vectors for short-term speech as robustly as features for long-term speech. In a first embodiment, robust feature recovery is performed. Using the same training structure, we can simultaneously generate robust bottleneck features in the output layer of the classifier (the original output layer "true/false" and speaker labels are discarded after the training part).

なお、すべての実施形態において、訓練での識別器の出力層における話者ラベルは、感情認識、言語認識などのための特徴補償の使用のために、感情ラベル、言語ラベルなどに置き換えることができる。同様に、エンコーダの出力層は、感情特徴ベクトルまたは言語特徴ベクトルを表すこために変更可能である。 Note that in all embodiments, the speaker labels in the output layer of the classifier in training can be replaced with emotion labels, language labels, etc. for use in feature compensation for emotion recognition, language recognition, etc. . Similarly, the output layer of the encoder can be modified to represent emotional feature vectors or verbal feature vectors.

第４の実施形態
第４の実施形態のロバストな特徴補償装置を図１９に示す。ＧＡＮに基づく音声特徴補償装置５００は、同じ話者からの少なくとも１つの短時間音声の特徴ベクトルと少なくとも１つの長時間音声の特徴ベクトルとに基づいて、生成器および識別器パラメータを生成するようにＧＡＮモデルをトレーニングする生成器・識別器トレーニング部５０１と、短時間音声ベクトルと生成器パラメータと識別器パラメータとに基づいて、短時間音声の特徴ベクトルを補償するロバストな特徴補償部５０２とを含む。 Fourth Embodiment A robust feature compensator of the fourth embodiment is shown in FIG. The GAN-based speech feature compensator 500 generates generator and discriminator parameters based on at least one short speech feature vector and at least one long speech feature vector from the same speaker. It includes a generator-discriminator training unit 501 for training a GAN model, and a robust feature compensation unit 502 for compensating short-time speech feature vectors based on short-time speech vectors, generator parameters, and classifier parameters. .

音声特徴補償装置５００は、短時間音声に対してロバストな特徴補償を提供することができる。その理由は、短時間音声の特徴ベクトルと長時間音声の特徴ベクトルと間の関係を学習するために、短時間音声の特徴ベクトルと長時間音声の特徴ベクトルを使用して、生成器と識別器とが共同でトレーニングされ、お互いのパフォーマンスを反復的に改善するためである。 The speech feature compensation device 500 can provide robust feature compensation for short duration speech. The reason is that in order to learn the relationship between the short speech feature vector and the long speech feature vector, the feature vector of the short speech and the feature vector of the long speech are used in the generator and the discriminator. are trained jointly to iteratively improve each other's performance.

＜情報処理装置の構成＞
図２０は、本発明の実施形態のロバストな特徴補償装置を実現可能な情報処理装置９００（コンピュータ）の構成を例示する。すなわち、図２０は、上記の実施形態における各機能を実現可能なハードウェア環境を表す図１、図９、図１４、図１９に示された装置を実現可能なコンピュータ（情報処理装置）の構成を示す。 <Configuration of information processing device>
FIG. 20 illustrates the configuration of an information processing device 900 (computer) capable of implementing the robust feature compensation device of the embodiment of the present invention. That is, FIG. 20 shows the configuration of a computer (information processing device) that can implement the devices shown in FIGS. indicates

図２０に示す情報処理装置９００は、以下の要素を含む。
－ＣＰＵ（中央処理装置）９０１;
－ＲＯＭ（Read Only Memory）９０２;
－ＲＡＭ（Random Access Memory）９０３;
－ハードディスク９０４（記憶装置）;
－外部デバイスとの通信インタフェース９０５;
－ＣＤ－ＲＯＭ（Compact Disc Read Only Memory）などの記憶媒体９０７に格納されたデータを読み書きできるリーダ／ライタ９０８;
－入出力インタフェース９０９ The information processing apparatus 900 shown in FIG. 20 includes the following elements.
- CPU (Central Processing Unit) 901;
- ROM (Read Only Memory) 902;
- RAM (Random Access Memory) 903;
- hard disk 904 (storage device);
- a communication interface 905 with external devices;
- a reader/writer 908 capable of reading and writing data stored in a storage medium 907 such as a CD-ROM (Compact Disc Read Only Memory);
- input/output interface 909;

情報処理装置９００は、バス９０６（通信線）を介してこれらが接続された一般的なコンピュータである。 The information processing device 900 is a general computer connected via a bus 906 (communication line).

上記の一例としての実施形態で説明された本発明は、図２０に示す情報処理装置９００に、各実施形態の説明で参照されたブロック図（図１、図９）またはフローチャート（図６～８、図１１～１３および図１６～１８）に記載された機能を実現可能なコンピュータプログラムが供給され、そのようなハードウェア内のＣＰＵ９０１にコンピュータプログラムを読み取らせ、解釈して実行することによって実現される。装置に提供されるコンピュータプログラムは、揮発性の読み書き可能な記憶メモリ（ＲＡＭ９０３）またはハードディスク９０４などの不揮発性記憶装置に記憶されうる。 The present invention described in the exemplary embodiments above is incorporated into the information processing apparatus 900 shown in FIG. , FIGS. 11 to 13 and FIGS. 16 to 18) are supplied, and the CPU 901 in such hardware reads, interprets, and executes the computer program. be. Computer programs provided to the device may be stored in volatile read/write storage memory (RAM 903 ) or non-volatile storage such as hard disk 904 .

上記の場合、一般的な手順を使用して、そのようなハードウェアにコンピュータプログラムを提供できる。これらの手順には、たとえば、ＣＤ－ＲＯＭなどの様々な記憶媒体９０７のいずれかを介してコンピュータプログラムを装置にインストールすることや、インターネットなどの通信回線を介して外部ソースからプログラムをダウンロードすることが含まれる。それらの場合において、本発明は、そのようなコンピュータプログラムを形成するコードから構成されるか、またはコードを記憶する記憶媒体から構成されると見なすことができる。 In the above cases, standard procedures can be used to provide computer programs to such hardware. These procedures include, for example, installing a computer program on the device via any of a variety of storage media 907 such as a CD-ROM, or downloading the program from an external source via a communication line such as the Internet. is included. In those cases, the invention can be considered to consist of code forming such a computer program or of a storage medium storing code.

なお、ここで説明および図示されているプロセス、技術、および方法論は、特定の装置に限定または関連していないことは明らかである。コンポーネントの組み合わせを使用して実装で可能である。また、ここでの教示に従って、様々なタイプの汎用装置を使用することができる。本発明は、特定のいくつかの例を使用して説明された。しかし、それらは単なる例であり、限定的なものではない。たとえば、説明されたソフトウェアは、Ｃ／Ｃ＋＋、Ｊａｖａ（登録商標）、ＭＡＴＬＡＢ（登録商標）およびＰｙｔｈｏｎなどの種々な言語で実装可能である。さらに、本発明の技術の他の実装は、当業者にとって明らかである。 It should be understood that the processes, techniques, and methodologies described and illustrated herein are not limited to or related to any particular apparatus. Implementation is possible using a combination of components. Also, various types of general purpose devices may be used in accordance with the teachings herein. The invention has been described using a few specific examples. However, they are merely examples and are not limiting. For example, the described software can be implemented in various languages such as C/C++, Java, MATLAB and Python. Moreover, other implementations of the inventive technique will be apparent to those skilled in the art.

図２１は、本発明に係る音声特徴補償装置の要部を示すブロック図である。図２１に示すように、音声特徴補償装置１０は、短い音声区間から抽出された第１の特徴ベクトルと、短い音声区間よりも長く短い音声の話者と同一の話者からの長い音声区間から抽出された第２の特徴ベクトルとを使用してＧＡＮ（Generative Adversarial Network）の生成器２１と識別器２２とをトレーニングし、ＧＡＮのトレーニングされたパラメータを出力するトレーニング手段１１（実施形態では生成器・識別器トレーニング部１０５，２０５，３０５で実現される。）と、入力された短時間音声から特徴ベクトルを抽出する特徴抽出手段１２（実施形態では、特徴抽出部１０３ｃ，２０３ｃ，３０３ｃで実現される。）と、トレーニングされたパラメータを使用して、抽出された特徴ベクトルに基づいてロバストな特徴ベクトルを生成する生成手段１３（実施形態では、生成器１０７，３０８またはエンコード部２０７で実現される。）とを備える。 FIG. 21 is a block diagram showing the essential parts of the audio feature compensator according to the present invention. As shown in FIG. 21, the speech feature compensator 10 extracts a first feature vector extracted from a short speech segment and a long speech segment from the same speaker whose speech is longer and shorter than the short speech segment. Training means 11 (in the embodiment, the generator and a feature extracting means 12 for extracting feature vectors from input short-time speech (implemented by feature extractors 103c, 203c, and 303c in the embodiment). ), and generating means 13 (in embodiments implemented in generators 107, 308 or encoding unit 207) that uses the trained parameters to generate robust feature vectors based on the extracted feature vectors. ) and

図２２に示すように、生成器２１は、第１の特徴ベクトルを入力して特徴ベクトルを出力するエンコーダ２１１と、復元された特徴ベクトルを出力するデコーダ２１２とを含み、少なくともエンコーダに関してトレーニングされたパラメータを出力し、生成手段１３は、トレーニングされたパラメータを使用して、入力された短時間音声の特徴ベクトルをコード化することによってロバストな特徴ベクトルを生成するエンコード部を含んでいてもよい。 As shown in FIG. 22, the generator 21 includes an encoder 211 that inputs a first feature vector and outputs a feature vector, and a decoder 212 that outputs a reconstructed feature vector, and is trained on at least the encoder. Outputting the parameters, the generating means 13 may include an encoding unit that generates a robust feature vector by encoding the feature vector of the input short-duration speech using the trained parameters.

１００，２００，３００ロバストな特徴補償装置
１０１，２０１，３０１短時間音声データ記憶部
１０２，２０２，３０２長時間音声データ記憶部
１０３ａ，２０３ａ，３０３ａ特徴抽出部
１０３ｂ，２０３ｂ，３０３ｂ特徴抽出部
１０３ｃ，２０３ｃ，３０３ｃ特徴抽出部
１０４，２０４，３０４ノイズ記憶部
１０５，２０５，３０５生成器・識別器トレーニング部
１０６生成器パラメータ記憶部
２０６エンコーダパラメータ記憶部
３０６生成器パラメータ記憶部
１０７生成器
２０７エンコード部
３０７識別器パラメータ記憶部
１０８，２０８生成特徴記憶部
３０８生成器
３０９ボトルネック特徴記憶部 100, 200, 300 robust feature compensator 101, 201, 301 short-time speech data storage unit 102, 202, 302 long-time speech data storage unit 103a, 203a, 303a feature extraction unit 103b, 203b, 303b feature extraction unit 103c, 203c, 303c feature extraction unit 104, 204, 304 noise storage unit 105, 205, 305 generator/discriminator training unit 106 generator parameter storage unit 206 encoder parameter storage unit 306 generator parameter storage unit 107 generator 207 encoding unit 307 discriminator parameter storage unit 108, 208 generated feature storage unit 308 generator 309 bottleneck feature storage unit

Claims

GAN ( training means for training a generator and classifier of a Generative Adversarial Network and outputting trained parameters;
A speech feature compensator, comprising: generating means for generating a restored feature amount based on the feature amount extracted from the input short-time speech using the trained parameters.

2. The speech feature compensator according to claim 1 , wherein said generating means generates a restored feature amount corresponding to said feature amount extracted from an input short-time speech.

GAN ( train the generator and discriminator of a Generative Adversarial Network, output the trained parameters,
A speech feature compensation method for generating restored features based on features extracted from input short-term speech using the trained parameters.

4. The speech feature compensation method according to claim 3 , further comprising generating a restored feature quantity corresponding to the feature quantity extracted from the input short-time speech.

to the computer,
GAN ( A process of training a generator and a discriminator of a Generative Adversarial Network) and outputting the trained parameters;
A speech feature compensation program for executing a process of generating a restored feature amount based on a feature amount extracted from an input short-time speech using the trained parameters.