JP7680893B2

JP7680893B2 - Speech processing training program, speech processing training device, speech processing training method, speech processing program, speech processing device, and speech processing method

Info

Publication number: JP7680893B2
Application number: JP2021106955A
Authority: JP
Inventors: 圭阿久澤; 弘太郎大西; 啓介滝口; 浩輝豆谷; 紘一郎森
Original assignee: DeNA Co Ltd
Current assignee: DeNA Co Ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2025-05-21
Anticipated expiration: 2041-06-28
Also published as: JP2023005191A

Description

本発明は、音声処理学習プログラム、音声処理学習装置、音声処理学習方法、音声処理プログラム、音声処理装置及び音声処理方法に関する。 The present invention relates to a speech processing learning program, a speech processing learning device, a speech processing learning method, a speech processing program, a speech processing device, and a speech processing method.

任意の話者が発声した音声を別の話者の声質を有する音声に変換する音声処理装置が開発されている。例えば、画像変換の技術であるＣｙｃｌｅＧＡＮを音声変換に応用した技術が開示されている（非特許文献１）。 A voice processing device has been developed that converts the voice of any speaker into a voice with the voice quality of another speaker. For example, a technology has been disclosed that applies CycleGAN, an image conversion technology, to voice conversion (Non-Patent Document 1).

Takuhiro Kaneko and Hirokazu Kameoka, Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks arXiv:1711.11293,Nov. 2017 (EUSIPCO 2018) http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc/Takuhiro Kaneko and Hirokazu Kameoka, Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks arXiv:1711.11293,Nov. 2017 (EUSIPCO 2018) http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc/

元の話者から別の話者の音声を合成して出力する音声処理装置では、合成された音声の声質や言い回しをできるだけ自然なものにすることが要求されている。しかしながら、従来の音声処理装置の学習方法では、合成された音声を十分に自然なものとすることができない場合があった。 In speech processing devices that synthesize and output the speech of another speaker from an original speaker, there is a demand for the voice quality and phrasing of the synthesized speech to be as natural as possible. However, with conventional training methods for speech processing devices, there were cases where the synthesized speech could not be made sufficiently natural.

本発明の１つの態様は、コンピュータを、音声を入力音響特徴量に変換する音響特徴量抽出器と、音声の話者ラベルを話者特徴量に変換する話者エンコーダと、入力音響特徴量と話者特徴量とを潜在表現に変換する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声エンコーダと、潜在表現と話者特徴量を少なくとも用いて音響特徴量を生成する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声デコーダと、を備えた音声処理学習装置として機能させ、前記音声エンコーダ、前記音声デコーダ及び前記話者エンコーダは、前記音声エンコーダに入力される入力音響特徴量と前記音声デコーダにおいて生成される出力音響特徴量との距離を小さくするように学習させることを特徴とする音声処理学習プログラムである。 One aspect of the present invention is a speech processing training program that causes a computer to function as a speech processing training device that includes an acoustic feature extractor that converts speech into input acoustic features, a speaker encoder that converts a speaker label of speech into speaker features, a speech encoder that includes a variational autoencoder having two or more sampling hierarchies that converts the input acoustic features and the speaker features into latent representations, and a speech decoder that includes a variational autoencoder having two or more sampling hierarchies that generates acoustic features using at least the latent representations and the speaker features, and that trains the speech encoder, the speech decoder, and the speaker encoder to reduce the distance between the input acoustic features input to the speech encoder and the output acoustic features generated in the speech decoder.

ここで、前記音声デコーダは、前記２以上のサンプリング階層において話者特徴量を入力する階層が限定されていることが好適である。 Here, it is preferable that the speech decoder limits the hierarchical levels at which speaker features are input in the two or more sampling hierarchical levels.

また、前記音声デコーダは、前記２以上のサンプリング階層において所定の階層より前段の階層には話者特徴量を入力せず、前記所定の階層より後段の階層には話者特徴量を入力することが好適である。 It is also preferable that the speech decoder does not input speaker features to layers prior to a predetermined layer in the two or more sampling layers, and inputs speaker features to layers subsequent to the predetermined layer.

また、前記音声デコーダは、前記２以上のサンプリング階層において前記所定の階層より前段の階層では事後分布からサンプリングを行い、前記所定の階層より後段の階層では事前分布からサンプリングを行うことが好適である。 It is also preferable that the audio decoder samples from a posterior distribution in layers prior to the predetermined layer in the two or more sampling layers, and samples from a prior distribution in layers subsequent to the predetermined layer.

また、前記音声デコーダは、話者特徴量を条件付きインスタンス正規化層に入力することが好適である。 Furthermore, it is preferable that the speech decoder inputs the speaker features into a conditional instance normalization layer.

本発明の別の態様は、コンピュータを、音声を音響特徴量に変換する音響特徴量抽出器と、音声の話者ラベルを話者特徴量に変換する話者エンコーダと、前記音響特徴量抽出器においてソース話者の音声を変換して得られたソース音響特徴量と、前記話者エンコーダにおいて前記ソース話者の話者ラベルを変換して得られたソース話者特徴量とを潜在表現に変換する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声エンコーダと、潜在表現と、前記話者エンコーダにおいてターゲット話者の話者ラベルを変換して得られたターゲット話者特徴量を少なくとも用いてターゲット音響特徴量を生成する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声デコーダと、前記音声デコーダで生成された前記ターゲット音響特徴量を音声に変換するボコーダと、を備える音声処理装置として機能させることを特徴とする音声処理プログラムである。 Another aspect of the present invention is a speech processing program that causes a computer to function as a speech processing device that includes an acoustic feature extractor that converts speech into acoustic features, a speaker encoder that converts a speaker label of speech into speaker features, a speech encoder that includes a variational autoencoder having two or more sampling hierarchies that converts source acoustic features obtained by converting the speech of a source speaker in the acoustic feature extractor and source speaker features obtained by converting the speaker label of the source speaker in the speaker encoder into latent representations, a speech decoder that includes a variational autoencoder having two or more sampling hierarchies that generates target acoustic features using at least the latent representations and target speaker features obtained by converting the speaker label of a target speaker in the speaker encoder, and a vocoder that converts the target acoustic features generated by the speech decoder into speech.

ここで、前記音声エンコーダ、前記音声デコーダ及び前記話者エンコーダは、前記音声エンコーダに入力される音響特徴量と前記音声デコーダにおいて生成される音響特徴量との距離を小さくするように学習させたものであることを特徴とする音声処理プログラム。 The speech processing program is characterized in that the speech encoder, speech decoder, and speaker encoder are trained to reduce the distance between the acoustic features input to the speech encoder and the acoustic features generated by the speech decoder.

また、前記音声デコーダは、前記２以上のサンプリング階層において前記ターゲット話者特徴量を入力する階層が限定されていることが好適である。 It is also preferable that the speech decoder limits the hierarchical level to which the target speaker features are input in the two or more sampling hierarchical levels.

また、前記音声デコーダは、前記２以上のサンプリング階層において所定の階層より前段の階層には前記ターゲット話者特徴量を入力せず、前記所定の階層より後段の階層には前記ターゲット話者特徴量を入力することが好適である。 It is also preferable that the speech decoder does not input the target speaker features to a layer prior to a predetermined layer in the two or more sampling layers, and inputs the target speaker features to a layer subsequent to the predetermined layer.

また、前記音声デコーダは、前記ターゲット話者特徴量を条件付きインスタンス正規化層に入力することが好適である。 Furthermore, it is preferable that the speech decoder inputs the target speaker features into a conditional instance normalization layer.

本発明の別の態様は、音声を入力音響特徴量に変換する音響特徴量抽出器と、音声の話者ラベルを話者特徴量に変換する話者エンコーダと、音響特徴量と話者特徴量とを潜在表現に変換する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声エンコーダと、潜在表現と話者特徴量を少なくとも用いて音響特徴量を生成する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声デコーダと、を備え、前記音声エンコーダ、前記音声デコーダ及び前記話者エンコーダは、前記音声エンコーダに入力される入力音響特徴量と前記音声デコーダにおいて生成される出力音響特徴量との距離を小さくするように学習させることを特徴とする音声処理学習装置である。 Another aspect of the present invention is a speech processing training device comprising: an acoustic feature extractor that converts speech into input acoustic features; a speaker encoder that converts a speaker label of speech into speaker features; a speech encoder including a variational autoencoder having two or more sampling hierarchies that converts the acoustic features and the speaker features into latent representations; and a speech decoder including a variational autoencoder having two or more sampling hierarchies that generates acoustic features using at least the latent representations and the speaker features, wherein the speech encoder, the speech decoder, and the speaker encoder are trained to reduce the distance between the input acoustic features input to the speech encoder and the output acoustic features generated in the speech decoder.

本発明の別の態様は、音声を音響特徴量に変換する音響特徴量抽出器と、音声の話者ラベルを話者特徴量に変換する話者エンコーダと、前記音響特徴量抽出器においてソース話者の音声を変換して得られたソース音響特徴量と、前記話者エンコーダにおいて前記ソース話者の話者ラベルを変換して得られたソース話者特徴量とを潜在表現に変換する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声エンコーダと、潜在表現と、前記話者エンコーダにおいてターゲット話者の話者ラベルを変換して得られたターゲット話者特徴量を少なくとも用いてターゲット音響特徴量を生成する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声デコーダと、前記音声デコーダで生成された前記ターゲット音響特徴量を音声に変換するボコーダと、を備えることを特徴とする音声処理装置である。 Another aspect of the present invention is a speech processing device comprising: an acoustic feature extractor that converts speech into acoustic features; a speaker encoder that converts a speaker label of speech into speaker features; a speech encoder including a variational autoencoder having two or more sampling hierarchies that converts source acoustic features obtained by converting the speech of a source speaker in the acoustic feature extractor and source speaker features obtained by converting the speaker label of the source speaker in the speaker encoder into latent representations; a speech decoder including a variational autoencoder having two or more sampling hierarchies that generates target acoustic features using at least the latent representations and target speaker features obtained by converting the speaker label of a target speaker in the speaker encoder; and a vocoder that converts the target acoustic features generated by the speech decoder into speech.

本発明の別の態様は、音声を入力音響特徴量に変換する音響特徴量抽出器と、音声の話者ラベルを話者特徴量に変換する話者エンコーダと、音響特徴量と話者特徴量とを潜在表現に変換する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声エンコーダと、潜在表現と話者特徴量を少なくとも用いて音響特徴量を生成する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声デコーダと、を備える音声処理学習装置において、前記音声エンコーダ、前記音声デコーダ及び前記話者エンコーダは、前記音声エンコーダに入力される入力音響特徴量と前記音声デコーダにおいて生成される出力音響特徴量との距離を小さくするように学習させることを特徴とする音声処理学習方法である。 Another aspect of the present invention is a speech processing training method comprising: a speech processing training device including an acoustic feature extractor that converts speech into input acoustic features; a speaker encoder that converts a speaker label of speech into speaker features; a speech encoder including a variational autoencoder having two or more sampling hierarchies that converts the acoustic features and the speaker features into latent representations; and a speech decoder including a variational autoencoder having two or more sampling hierarchies that generates acoustic features using at least the latent representations and the speaker features, the speech encoder, the speech decoder, and the speaker encoder being trained to reduce the distance between the input acoustic features input to the speech encoder and the output acoustic features generated in the speech decoder.

本発明の別の態様は、音声を音響特徴量に変換する音響特徴量抽出器と、音声の話者ラベルを話者特徴量に変換する話者エンコーダと、前記音響特徴量抽出器においてソース話者の音声を変換して得られたソース音響特徴量と、前記話者エンコーダにおいて前記ソース話者の話者ラベルを変換して得られたソース話者特徴量とを潜在表現に変換する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声エンコーダと、潜在表現と、前記話者エンコーダにおいてターゲット話者の話者ラベルを変換して得られたターゲット話者特徴量を少なくとも用いてターゲット音響特徴量を生成する２以上のサンプリング階層を有する変分自己符号化器を含んで構成される音声デコーダと、前記音声デコーダで生成された前記ターゲット音響特徴量を音声に変換するボコーダと、を備える音声処理装置を用いて、前記ソース話者の音声を前記ターゲット話者の音声に変換することを特徴とする音声処理方法である。 Another aspect of the present invention is a speech processing method for converting the speech of the source speaker to the speech of the target speaker using a speech processing device including: an acoustic feature extractor that converts speech into acoustic features; a speaker encoder that converts the speaker label of the speech into speaker features; a speech encoder including a variational autoencoder having two or more sampling hierarchies that converts source acoustic features obtained by converting the speech of the source speaker in the acoustic feature extractor and source speaker features obtained by converting the speaker label of the source speaker in the speaker encoder into latent representations; a speech decoder including a variational autoencoder having two or more sampling hierarchies that generates target acoustic features using at least the latent representations and target speaker features obtained by converting the speaker label of the target speaker in the speaker encoder; and a vocoder that converts the target acoustic features generated by the speech decoder into speech.

本発明によれば、任意の話者が発した音声を目標とする話者が発した音声に適切に変換する音声処理学習プログラム、音声処理学習装置、音声処理学習方法、音声処理学習プログラム、音声処理学習装置及び音声処理学習方法を提供することができる。本発明の実施の形態の他の目的は、本明細書全体を参照することにより明らかになる。 According to the present invention, it is possible to provide a speech processing training program, a speech processing training device, a speech processing training method, a speech processing training program, a speech processing training device, and a speech processing training method that appropriately convert speech uttered by any speaker into speech uttered by a target speaker. Other objects of the embodiments of the present invention will become apparent by referring to this specification as a whole.

本発明の実施の形態における音声処理装置の構成を示す図である。1 is a diagram illustrating a configuration of a voice processing device according to an embodiment of the present invention. 本発明の実施の形態における音声処理学習装置の構成を示す機能ブロック図である。1 is a functional block diagram showing a configuration of a speech processing training device according to an embodiment of the present invention. バリエーショナル・オート－エンコーダの構成を示す図である。FIG. 1 is a diagram showing the configuration of a variational auto-encoder. ヌーヴォー・バリエーショナル・オート－エンコーダの構成を示す図である。FIG. 1 shows the configuration of a nouveau variational auto-encoder. 本発明の実施の形態におけるバリエーショナル・オート－エンコーダの各層のニューラルネットワークの構成を示す図である。FIG. 2 is a diagram showing the configuration of a neural network in each layer of a variational auto-encoder in an embodiment of the present invention. 本発明の実施の形態における音声学習処理を説明するための図である。FIG. 4 is a diagram for explaining a voice learning process according to the embodiment of the present invention. 本発明の実施の形態における音声学習装置の構成を示す機能ブロック図である。1 is a functional block diagram showing a configuration of a pronunciation learning device according to an embodiment of the present invention; 本発明の実施の形態における音声処理を説明するための図である。FIG. 2 is a diagram for explaining audio processing according to the embodiment of the present invention.

本発明の実施の形態における音声処理装置１００は、図１に示すように、処理部１０、記憶部１２、入力部１４、出力部１６及び通信部１８を含んで構成される。処理部１０は、ＣＰＵ等の演算処理を行う手段を含む。処理部１０は、記憶部１２に記憶されている音声処理学習プログラムを実行することによって、本実施の形態における音声処理の学習を行う。また、処理部１０は、記憶部１２に記憶されている音声処理プログラムを実行することによって、本実施の形態における音声処理に関する機能を実現する。記憶部１２は、半導体メモリやメモリカード等の記憶手段を含む。記憶部１２は、処理部１０とアクセス可能に接続され、音声処理学習プログラム、音声処理プログラム、その処理に必要な情報を記憶する。入力部１４は、情報を入力する手段を含む。入力部１４は、例えば、使用者からの情報の入力を受けるキーボード、タッチパネル、ボタン等を備える。また、入力部１４は、任意の話者及び目標となる所定の話者の音声の入力を受ける音声入力手段を備える。音声入力手段は、例えば、マイク、増幅回路等を含む構成とすればよい。出力部１６は、管理者から入力情報を受け付けるためのユーザインターフェース画面（ＵＩ）や処理結果を出力する手段を含む。出力部１６は、例えば、画像を呈示するディスプレイを備える。また、出力部１６は、音声処理装置１００によって生成された合成音声を出力する音声出力手段を備える。音声出力手段は、例えば、スピーカ、増幅器等を含む構成とすればよい。通信部１８は、ネットワーク１０２を介して、外部端末（図示しない）との情報の通信を行うインターフェースを含んで構成される。通信部１８による通信は有線及び無線を問わない。なお、音声処理に供される音声情報は通信部１８を介して外部端末から取得してもよい。 As shown in FIG. 1, the voice processing device 100 according to the embodiment of the present invention includes a processing unit 10, a storage unit 12, an input unit 14, an output unit 16, and a communication unit 18. The processing unit 10 includes a means for performing arithmetic processing such as a CPU. The processing unit 10 executes a voice processing learning program stored in the storage unit 12 to learn voice processing in the present embodiment. The processing unit 10 also executes a voice processing program stored in the storage unit 12 to realize functions related to voice processing in the present embodiment. The storage unit 12 includes a storage means such as a semiconductor memory or a memory card. The storage unit 12 is connected to the processing unit 10 in an accessible manner, and stores the voice processing learning program, the voice processing program, and information required for the processing. The input unit 14 includes a means for inputting information. The input unit 14 includes, for example, a keyboard, a touch panel, a button, etc. that receives information input from a user. The input unit 14 also includes a voice input means for receiving input of the voice of an arbitrary speaker and a specific target speaker. The voice input means may include, for example, a microphone, an amplifier circuit, etc. The output unit 16 includes a user interface screen (UI) for receiving input information from an administrator and a means for outputting processing results. The output unit 16 includes, for example, a display for presenting an image. The output unit 16 also includes an audio output means for outputting a synthetic voice generated by the voice processing device 100. The audio output means may include, for example, a speaker, an amplifier, and the like. The communication unit 18 includes an interface for communicating information with an external terminal (not shown) via the network 102. Communication by the communication unit 18 may be wired or wireless. Note that the audio information provided for audio processing may be acquired from an external terminal via the communication unit 18.

音声処理装置１００は、任意の話者が発した音声を所定の話者（目標話者）の音声の音質に変換する音声処理を行う。また、音声処理装置１００は、当該音声処理のための学習を行う音声処理学習装置としても機能する。 The voice processing device 100 performs voice processing to convert a voice uttered by an arbitrary speaker into the quality of the voice of a specified speaker (target speaker). The voice processing device 100 also functions as a voice processing learning device that learns for the voice processing.

［音声学習処理］
図２は、音声処理学習時における音声処理装置１００の構成を示す機能ブロック図である。音声処理装置１００は、音声分析部２０、話者エンコーダ２２、音声エンコーダ２４、音声デコーダ２６及び学習器２８として機能する。具体的には、音声処理装置１００は、音声処理学習プログラムを実行することによって以下の音声学習方法を実現する音声処理学習装置として機能する。 [Voice learning process]
2 is a functional block diagram showing the configuration of the speech processing device 100 during speech processing training. The speech processing device 100 functions as a speech analyzer 20, a speaker encoder 22, a speech encoder 24, a speech decoder 26, and a learner 28. Specifically, the speech processing device 100 functions as a speech processing training device that realizes the following speech training method by executing a speech processing training program.

音声分析部２０は、音声データを取得し、音声データから音響特徴量を抽出する音響特徴量抽出器として機能する。すなわち、音声処理装置１００の処理部１０は、音声分析部２０として機能する。音声データは、入力部１４を構成するマイクを用いて話者の音声を音声データに変換して取得すればよい。また、通信部１８を介して、外部のコンピュータ等に予め記録されている音声データを受信するようにしてもよい。取得された音声データは、記憶部１２に記憶される。 The voice analysis unit 20 functions as an acoustic feature extractor that acquires voice data and extracts acoustic features from the voice data. That is, the processing unit 10 of the voice processing device 100 functions as the voice analysis unit 20. Voice data may be acquired by converting the speaker's voice into voice data using a microphone that constitutes the input unit 14. Voice data that has been pre-recorded in an external computer or the like may also be received via the communication unit 18. The acquired voice data is stored in the memory unit 12.

音声データの取得処理は、任意の話者の発する音声について行われる。音声学習処理では、多数の話者からの音声を用いて音声エンコーダ２４及び音声デコーダ２６の学習処理が行われる。各話者から得る音声は、同一の内容である必要はない。 The voice data acquisition process is performed on voices uttered by any speaker. In the voice training process, the voice encoder 24 and the voice decoder 26 are trained using voices from multiple speakers. The voices obtained from each speaker do not need to be identical in content.

また、音声分析部２０は、さらに音声処理に必要な音声分析を行う。例えば、音声分析部２０は、入力された音声の周波数特性に基づいて音声のケプストラム解析を行い、スペクトルの包絡線（声の太さ等を示す情報）及び微細構造の情報を含むメル周波数ケプストラム係数（ＭＦＣＣ）、音声の基本周波数や共鳴周波数（声の高さ、声のかすれ等を示す情報）等の音響特徴量を求める。音響特徴量は、例えば、音声セグメントの長さＴに対して（８０×Ｔ）次元のユークリッド空間とすることができる。具体的には、音声分析部２０は、話者ＩＤ（話者ラベル）がｉの話者が発した音声から音響特徴量ｘ_ｉを生成して出力する。音声分析部２０で抽出された音響特徴量は音声エンコーダ２４及び学習器２８へ入力される。 Further, the voice analysis unit 20 performs voice analysis necessary for voice processing. For example, the voice analysis unit 20 performs cepstrum analysis of the voice based on the frequency characteristics of the input voice, and obtains acoustic features such as Mel Frequency Cepstrum Coefficients (MFCC) including spectrum envelope (information indicating the thickness of the voice, etc.) and fine structure information, and fundamental frequency and resonance frequency of the voice (information indicating the pitch of the voice, hoarseness, etc.). The acoustic features can be, for example, an (80×T)-dimensional Euclidean space with respect to the length T of the voice segment. Specifically, the voice analysis unit 20 generates and outputs an acoustic feature x _i from the voice uttered by a speaker with a speaker ID (speaker label) of i. The acoustic features extracted by the voice analysis unit 20 are input to the voice encoder 24 and the learning device 28.

話者エンコーダ２２は、音声分析部２０に入力された音声の発話者のＩＤを音声処理に利用できる話者特徴量に変換して出力する。話者エンコーダ２２は、発話者のＩＤを話者特徴量に変換して出力する埋込モジュールを含んで構成することができる。例えば、話者エンコーダ２２は、話者ＩＤがｉの話者である場合、話者特徴量ｙ_ｉを生成して出力する。話者エンコーダ２２で生成された話者特徴量は音声エンコーダ２４及び音声デコーダ２６へ入力される。 The speaker encoder 22 converts the ID of the speaker of the voice input to the voice analysis unit 20 into a speaker feature that can be used for voice processing and outputs it. The speaker encoder 22 can be configured to include an embedded module that converts the speaker ID into a speaker feature and outputs it. For example, when the speaker ID is i, the speaker encoder 22 generates and outputs a speaker feature y _i . The speaker feature generated by the speaker encoder 22 is input to the voice encoder 24 and the voice decoder 26.

音声処理装置１００の学習では、複数の話者が発した音声から得られた音響特徴量ｘ_ｉと話者特徴量ｙ_ｉの組み合わせ（ｘ_ｉ，ｙ_ｉ）のセットが用いられる。 In training the speech processing apparatus 100, a set of combinations (x _i , y _i ) of acoustic features x _i and speaker features y _i obtained from speech uttered by a plurality of speakers is used.

音声エンコーダ２４は、音響特徴量及び話者特徴量の入力を受けて、音響特徴量及び話者特徴量を潜在表現に変換する処理を行う。音声デコーダ２６は、音声エンコーダ２４によって得られた潜在表現及び話者特徴量の入力を受けて、潜在表現及び話者特徴量を音響特徴量に変換する処理を行う。潜在表現は、入力された音声データの言語的な特徴を表す。 The speech encoder 24 receives the acoustic features and speaker features as input, and performs a process of converting the acoustic features and speaker features into latent representations. The speech decoder 26 receives the latent representations and speaker features obtained by the speech encoder 24 as input, and performs a process of converting the latent representations and speaker features into acoustic features. The latent representations represent the linguistic features of the input speech data.

音声エンコーダ２４及び音声デコーダ２６は、図２に示すように、音声分析部２０から音響特徴量ｘ_ｉの入力を受けて、音声エンコーダ２４において潜在表現ｚに変換し、さらに音声デコーダ２６において潜在表現ｚから音響特徴量ｘ_ｉ＾に再構成し、出力の音響特徴量ｘ_ｉ＾が入力の音響特徴量ｘ_ｉを復元するように学習される。 As shown in FIG. 2, the speech encoder 24 and the speech decoder 26 receive an input of an acoustic feature x _i from the speech analysis unit 20, convert it to a latent representation z in the speech encoder 24, and further reconstruct it from the latent representation z into an acoustic feature x _i ^ in the speech decoder 26, and are trained so that the output acoustic feature x _i ^ restores the input acoustic feature x _i .

本実施の形態では、音声エンコーダ２４及び音声デコーダ２６は、バリエーショナル・オート－エンコーダ（ＶＡＥ：ＶａｒｉａｔｉｏｎａｌＡｕｔｏ－Ｅｎｃｏｄｅｒ）によって構成される。バリエーショナル・オート－エンコーダは、変分自己符号化器の一種であり、図３に示すように、潜在表現を確率分布に基づいたサンプリングによって生成する。確率分布は、平均μと分散σで規定される正規分布と仮定する。バリエーショナル・オート－エンコーダは、入力Ｘに対して平均μと分散σに基づいたサンプリングによって潜在表現ｚを生成するエンコーダと、潜在表現ｚから出力Ｘ＾を生成するデコーダと、の組み合わせからなる。バリエーショナル・オート－エンコーダでは、入力Ｘと出力Ｘ＾との復元誤差（復元距離）Ｅが小さくなるように話者エンコーダ２２、音声エンコーダ２４及び音声デコーダ２６の学習が行われる。 In this embodiment, the speech encoder 24 and speech decoder 26 are configured by a variational auto-encoder (VAE). The variational auto-encoder is a type of variational auto-encoder, and generates a latent representation by sampling based on a probability distribution, as shown in FIG. 3. The probability distribution is assumed to be a normal distribution defined by the mean μ and variance σ. The variational auto-encoder is a combination of an encoder that generates a latent representation z by sampling based on the mean μ and variance σ for the input X, and a decoder that generates an output X^ from the latent representation z. In the variational auto-encoder, the speaker encoder 22, speech encoder 24, and speech decoder 26 are trained so that the restoration error (restoration distance) E between the input X and the output X^ is small.

図４に示すように、一般的なバリエーショナル・オート－エンコーダは一階層のニューラルネットワークで構成されるが、本実施の形態では２階層以上の複数階層のニューラルネットワークで構成されたヌーヴォー・バリエーショナル・オート－エンコーダ（ＮＶＡＥ：ＮｏｕｖｅａｕＶａｒｉａｔｉｏｎａｌＡｕｔｏ－Ｅｎｃｏｄｅｒ）とすることが好適である。すなわち、ヌーヴォー・バリエーショナル・オート－エンコーダは、２以上のサンプリング階層を有する変分自己符号化器を含んで構成される。例えば、音声処理装置１００では、音声エンコーダ２４及び音声デコーダ２６をｎ＝３５階層のニューラルネットワークでそれぞれ構成することが好適である。 As shown in FIG. 4, a typical variational auto-encoder is composed of a single-layer neural network, but in this embodiment, it is preferable to use a nouveau variational auto-encoder (NVAE) composed of a neural network with two or more layers. In other words, the nouveau variational auto-encoder is composed of a variational auto-encoder having two or more sampling layers. For example, in the audio processing device 100, it is preferable to configure the audio encoder 24 and the audio decoder 26 with n=35 layers of neural networks, respectively.

音声エンコーダ２４及び音声デコーダ２６のヌーヴォー・バリエーショナル・オート－エンコーダの各層は、図５に示すように、Ｃｏｎｄｉｔｉｏｎａｌ－Ｉｎｓｔａｎｃｅ－Ｎｏｒｍａｌｉｚａｔｉｏｎ層（ＣＩＮ層）、Ｃｏｎｖｏｌｕｔｉｏｎ層（ＣＯＮＶ層）、Ｓｑｕｅｅｚｅ－ａｎｄ－Ｅｘｃｉｔａｔｉｏｎ層（ＳＥ層）を組み合わせて構成される。ＣＩＮ層は、一般的なヌーヴォー・バリエーショナル・オート－エンコーダにおけるバッチ正規仮想（ＢＮ層）の代わりに設けられる層である。ＣＩＮ層は、正規化層の１つであり、スタイル毎に異なるパラメータを設定して正規化を行う条件付きインスタンス正規化層である。本実施の形態では、ＣＩＮ層は、話者特徴量を入力の１つとして、入力された話者特徴量によって条件付けられた正規化を行う。また、Ｓｗｉｓｈ活性化関数はｆ（ｘ）＝ｘ／（１＋ｅ^－βｘ）と表される活性化関数である。Ｃｏｎｖｏｌｕｔｉｏｎ層は、入力に対して畳み込み演算を適用して次の層に演算結果を出力する層である。ＳＥ層は、入力に対してチャンネル間の関係に基づいて適応的にａｔｔｅｎｔｉｏｎをかけて重み付きの特徴を出力する層である。 As shown in FIG. 5, each layer of the nouveau variational auto-encoder of the speech encoder 24 and the speech decoder 26 is configured by combining a conditional-instance-normalization layer (CIN layer), a convolution layer (CONV layer), and a squeeze-and-excitation layer (SE layer). The CIN layer is a layer provided in place of a batch normalization hypothesis (BN layer) in a general nouveau variational auto-encoder. The CIN layer is one of the normalization layers, and is a conditional instance normalization layer that performs normalization by setting different parameters for each style. In this embodiment, the CIN layer uses a speaker feature as one of the inputs and performs normalization conditioned by the input speaker feature. The Swish activation function is an activation function expressed as f(x)=x/(1+e ^−βx ). The convolution layer is a layer that applies a convolution operation to an input and outputs the operation result to the next layer. The SE layer is a layer that adaptively applies an attention to the input based on the relationship between channels and outputs a weighted feature.

図６を参照して、音声処理装置１００における音声学習処理について説明する。音声エンコーダ２４及び音声デコーダ２６は、それぞれ階層数ｎのニューラルネットワークで構成された例を示している。階層数ｎは、例えば、３５階層とすることができる。各階層は、それぞれ図５に示したＣｏｎｄｉｔｉｏｎａｌ－Ｉｎｓｔａｎｃｅ－Ｎｏｒｍａｌｉｚａｔｉｏｎ層（ＣＩＮ層）、Ｃｏｎｖｏｌｕｔｉｏｎ層（ＣＯＮＶ層）、Ｓｑｕｅｅｚｅ－ａｎｄ－Ｅｘｃｉｔａｔｉｏｎ層（ＳＥ層）を組み合わせて構成される。なお、音声エンコーダ２４の階層ｋ（ただし、ｋは１～ｎの階層数を示す）から出力される潜在表現をｈ_ｋで示し、音声デコーダ２６の階層数ｋで表される階層から出力される潜在表現をｚ_ｋで示している。 The voice learning process in the voice processing device 100 will be described with reference to FIG. 6. The voice encoder 24 and the voice decoder 26 are each configured with a neural network having a layer number n. The layer number n can be, for example, 35 layers. Each layer is configured by combining the Conditional-Instance-Normalization layer (CIN layer), the Convolution layer (CONV layer), and the Squeeze-and-Excitation layer (SE layer) shown in FIG. 5. Note that the latent expression output from layer k (where k indicates the layer number from 1 to n) of the voice encoder 24 is represented by h _k , and the latent expression output from the layer represented by the layer number k of the voice decoder 26 is represented by z _k .

音声エンコーダ２４では、階層ｎに対して音響特徴量ｘ_ｉ及び話者特徴量ｙ_ｉが入力され、潜在表現ｈ_ｎが出力される。次の階層ｎ－１では、前段である階層ｎから出力された潜在表現ｈ_ｎ及び話者特徴量ｙ_ｉが入力され、潜在表現ｈ_ｎ－１が出力される。以下、同様に、階層ｋでは、前段である階層ｋ＋１から出力された潜在表現ｈ_ｋ＋１及び話者特徴量ｙ_ｉが入力され、潜在表現ｈ_ｋが出力される。最終段である階層１では、前段である階層２から出力された潜在表現ｈ_２及び話者特徴量ｙ_ｉが入力され、潜在表現ｈ_１が出力される。当該潜在表現ｈ_１から音声デコーダ２６の初段である階層１の潜在表現ｚ_１がサンプリングされる。このように、音声エンコーダ２４においては、すべての階層１～ｎにおいて話者特徴量ｙ_ｉを入力に含めることが好適である。 In the speech encoder 24, the acoustic feature x _i and the speaker feature y _i are input to the layer n, and a latent expression h _n is output. In the next layer n-1, the latent expression h _n and the speaker feature y _i output from the previous layer n are input, and a latent expression h _n-1 is output. Similarly, in the layer k, the latent expression h _k+1 and the speaker feature y _i output from the previous layer k+1 are input, and a latent expression h _k is output. In the layer 1, which is the final layer, the latent expression h ₂ and the speaker feature y _i output from the previous layer 2 are input, and a latent expression h ₁ is output. From the latent expression h ₁ , the latent expression z ₁ of the layer 1, which is the first layer of the speech decoder 26, is sampled. In this way, in the speech encoder 24, it is preferable to include the speaker feature y _i in the input in all layers 1 to n.

音声デコーダ２６では、初段である階層１に対して潜在表現ｚ_１が入力され、潜在表現ｚ_２が出力される。また、音声デコーダ２６の階層ｋにおける潜在表現ｚ_ｋは、音声デコーダ２６において前段の階層ｋ－１の潜在表現ｚ_ｋ－１、音声エンコーダ２４のｋ階層目の潜在表現ｈ_ｋ及び話者特徴量ｙ_ｉに基づく事前分布ｐ（ｚ_ｋ｜ｚ_ｋ－１，ｈ_ｋ，ｙ_ｉ）からサンプリングして得ることが可能である。また、潜在表現ｚ_ｋは、音声デコーダ２６のより前段の階層ｋ－１、階層ｋ－２・・・階層１の潜在表現ｚ_ｋ－１、潜在表現ｚ_ｋ－２・・・潜在表現ｚ_１及び音声エンコーダ２４のｋ階層目の潜在表現ｈ_ｋに基づく事後分布ｐ（ｚ_ｋ｜ｚ_ｋ－１，ｚ_ｋ－２・・・ｚ_１，ｈ_ｋ）からサンプリングして得ることも可能である。なお、分布ｐ（ａ｜ｂ）は、ｂを前提条件としてａが出力とされる尤もらしさを示す尤度関数である。 In the speech decoder 26, latent expression _z1 is input to layer 1, which is the first stage, and latent expression _z2 is output. Furthermore, latent expression _zk in layer k of the speech decoder 26 can be obtained by sampling from a prior distribution p( _zk |zk _-1 , _hk , yi ₎ based on latent expression zk _-1 _in the previous layer k-1 of the speech decoder 26, latent expression hk in the kth layer of the speech encoder 24, and speaker feature _yi . Furthermore, latent expression _zk can be obtained by sampling from a posterior distribution p(zk|zk-1, zk-2... _z1 , hk) based on latent expression zk- ₁ in the previous layer k _-1 , layer _k _-2 ...layer 1 of the speech decoder 26, latent expression zk _-2 _... latest expression _z1 , and latent expression hk in the _kth layer of the speech encoder 24. The distribution p(a|b) is a likelihood function indicating the likelihood that a will be the output with b as a prerequisite.

音声学習処理では、音声デコーダ２６の出力に近い階層から遠い階層に亘って音声エンコーダ２４からサンプリングを行う。すなわち、図６に示すように、すべての階層１～階層ｎにおいて音声エンコーダ２４のｋ階層目の潜在表現ｈ_ｋからサンプリングを行うことが好適である。また、事後分布からのサンプリングには話者特徴量ｙ_ｉを入力に含めないことが好適である。 In the speech learning process, sampling is performed from the speech encoder 24 across layers from close to the output of the speech decoder 26 to layers far from it. That is, as shown in Fig. 6, it is preferable to perform sampling from the latent representation _hk of the kth layer of the speech encoder 24 in all layers 1 to n. It is also preferable not to include the speaker feature _yi in the input when sampling from the posterior distribution.

すなわち、音声デコーダ２６では出力に近い階層のみに話者特徴量ｙ_ｉを入力に含め、出力から遠い階層には話者特徴量ｙ_ｉを入力に含めないことが好適である。このとき、音声エンコーダ２４からサンプリングを行わず、事前分布からサンプリングを行う階層では話者特徴量ｙ_ｉを入力に含め、音声エンコーダ２４からサンプリングを行い、事後分布からサンプリングを行う階層では話者特徴量ｙ_ｉを入力に含めないようにすることが好適である。 That is, it is preferable that the speech decoder 26 includes the speaker feature _yi in the input only in layers close to the output, and does not include the speaker feature _yi in the input in layers far from the output. In this case, it is preferable that the speaker feature _yi is included in the input in layers where sampling is not performed from the speech encoder 24 and sampling is performed from a prior distribution, and the speaker feature _yi is not included in the input in layers where sampling is performed from the speech encoder 24 and sampling is performed from a posterior distribution.

なお、サンプリングには話者特徴量ｙ_ｉを含めない階層では、Ｃｏｎｄｉｔｉｏｎａｌ－Ｉｎｓｔａｎｃｅ－Ｎｏｒｍａｌｉｚａｔｉｏｎ層（ＣＩＮ層）に話者特徴量ｙ_ｉを入力しない。 In addition, in a layer in which the speaker feature y _i is not included in the sampling, the speaker feature y _i is not input to the Conditional-Instance-Normalization layer (CIN layer).

このような構成において、学習器２８では、音声デコーダ２６に入力される音響特徴量ｘ_ｉと音声デコーダ２６から出力される再構築された音響特徴量ｘ_ｉ＾との誤差（距離）が小さくなるように話者エンコーダ２２、音声エンコーダ２４及び音声デコーダ２６に含まれる各階層のニューラルネットワークの各種パラメータ（各ニューロンの重み係数又はバイアス等）を調整する。 In such a configuration, the learning device 28 adjusts various parameters (such as the weighting coefficient or bias of each neuron) of the neural networks of each layer included in the speaker encoder ₂₂ , the speech encoder 24, and the speech decoder 26 so as to reduce the error (distance) between the acoustic feature x _i input to the speech decoder 26 and the reconstructed acoustic feature x i ^ output from the speech decoder 26.

ここで、音声デコーダ２６に入力される音響特徴量ｘ_ｉと音声デコーダ２６から出力される再構築された音響特徴量ｘ_ｉ＾との誤差（距離）が小さくなるように、音声デコーダ２６において話者特徴量ｙ_ｉを考慮した事前分布からサンプリングを行う階層と、話者特徴量ｙ_ｉを考慮しない事後分布からサンプリングを行う階層との境界となる階層を適宜設定すればよい。 Here, in order to reduce the error (distance) between the acoustic feature x _i input to the speech decoder 26 and the reconstructed acoustic feature x _i ^ output from the speech decoder 26, a hierarchical layer that serves as the boundary between a hierarchical layer in which sampling is performed from a prior distribution that takes into account the speaker feature y _i in the speech decoder 26 and a hierarchical layer in which sampling is performed from a posterior distribution that does not take into account the speaker feature y _i can be appropriately set.

以上のように、音声エンコーダ２４に入力される音響特徴量ｘ_ｉによって表現される音声と、音声デコーダ２６において再構築される音響特徴量ｘ_ｉ＾によって表現される音声とが近づくように音声エンコーダ２４及び音声デコーダ２６が学習される。 As described above, the speech encoder 24 and the speech decoder 26 are trained so that the speech represented by the acoustic feature x _i input to the speech encoder 24 approaches the speech represented by the acoustic feature x _i ^ reconstructed in the speech decoder 26.

［音声処理］
図７は、ソース話者が発した音声をターゲット話者が発した音声のように変換する音声処理時における音声処理装置１００の構成を示す機能ブロック図である。音声処理装置１００は、音声分析部２０、話者エンコーダ２２、音声エンコーダ２４、音声デコーダ２６及びボコーダ３０として機能する。具体的には、音声処理装置１００は、音声処理プログラムを実行することによって以下の音声処理を実現する音声処理装置として機能する。 [Audio processing]
7 is a functional block diagram showing the configuration of the voice processing device 100 during voice processing for converting the voice uttered by the source speaker into a voice uttered by the target speaker. The voice processing device 100 functions as a voice analysis unit 20, a speaker encoder 22, a voice encoder 24, a voice decoder 26, and a vocoder 30. Specifically, the voice processing device 100 functions as a voice processing device that realizes the following voice processing by executing a voice processing program.

音声分析部２０は、ソース話者が発した音声の音声データを取得し、音声処理に必要な音声分析を行う。音声分析部２０で抽出された音響特徴量は音声エンコーダ２４へ入力される。 The speech analysis unit 20 acquires speech data of the speech uttered by the source speaker and performs the speech analysis required for speech processing. The acoustic features extracted by the speech analysis unit 20 are input to the speech encoder 24.

話者エンコーダ２２は、ソース話者及びターゲット話者のＩＤを音声処理に利用できる話者特徴量に変換して出力する。話者エンコーダ２２は、ソース話者ＩＤがｓの話者である場合、ソース話者特徴量ｙ_ｓを生成して音声エンコーダ２４へ出力する。また、話者エンコーダ２２は、ターゲット話者ＩＤがｔの話者である場合、ターゲット話者特徴量ｙ_ｔを生成して音声デコーダ２６へ出力する。 The speaker encoder 22 converts the IDs of the source speaker and the target speaker into speaker features that can be used for speech processing and outputs them. When the source speaker ID is a speaker of s, the speaker encoder 22 generates a source speaker feature y _s and outputs it to the speech encoder 24. When the target speaker ID is a speaker of t, the speaker encoder 22 generates a target speaker feature y _t and outputs it to the speech decoder 26.

音声エンコーダ２４は、ソース話者の音声から得られた音響特徴量及びソース話者特徴量の入力を受けて、当該音響特徴量及び当該ソース話者特徴量を潜在表現に変換する処理を行う。音声デコーダ２６は、音声エンコーダ２４によって得られた潜在表現及びターゲット話者特徴量の入力を受けて、当該潜在表現及び当該ターゲット話者特徴量から音響特徴量を再構築する処理を行う。 The speech encoder 24 receives the acoustic features and source speaker features obtained from the speech of the source speaker, and performs a process of converting the acoustic features and the source speaker features into latent representations. The speech decoder 26 receives the latent representation and target speaker features obtained by the speech encoder 24, and performs a process of reconstructing the acoustic features from the latent representation and the target speaker features.

図８を参照して、音声処理装置１００における音声処理について説明する。音声処理では、上記の音声学習処理において学習された音声エンコーダ２４及び音声デコーダ２６を用いて行われる。 The voice processing in the voice processing device 100 will be described with reference to FIG. 8. The voice processing is performed using the voice encoder 24 and the voice decoder 26 that have been trained in the above-mentioned voice training process.

音声エンコーダ２４では、階層ｎに対してソース話者の音声から得られた音響特徴量ｘ_ｓ及びソース話者特徴量ｙ_ｓが入力され、潜在表現ｈ_ｎが出力される。以下、学習時と同様に、階層ｋでは、前段である階層ｋ＋１から出力された潜在表現ｈ_ｋ＋１及びソース話者特徴量ｙ_ｓが入力され、潜在表現ｈ_ｋが出力される。最終段である階層１では、前段である階層２から出力された潜在表現ｈ_２及びソース話者特徴量ｙ_ｓが入力され、潜在表現ｈ_１が出力される。当該潜在表現ｈ_１から音声デコーダ２６の初段である階層１の潜在表現ｚ_１がサンプリングされる。 In the speech encoder 24, acoustic features _xs and source speaker features _ys obtained from the speech of the source speaker are input to layer n, and a latent representation _hn is output. Thereafter, as in the learning process, latent representation hk ₊₁ and source speaker features _ys output from the previous layer k+1 are input to layer k, and latent representation _hk is output. In layer 1, which is the final layer, latent representation _h2 and source speaker features _ys output from the previous layer 2 are input, and latent representation _h1 is output. From this latent representation _h1 , latent representation _z1 of layer 1, which is the first layer of the speech decoder 26, is sampled.

音声デコーダ２６では、初段である階層１に対して潜在表現ｚ_１が入力され、潜在表現ｚ_２が出力される。音声デコーダ２６の出力から遠い階層では、ターゲット話者特徴量ｙ_ｔを入力に含めず、音声デコーダ２６においてより前段の階層ｋ－１、階層ｋ－２・・・階層１の潜在表現ｚ_ｋ－１、潜在表現ｚ_ｋ－２・・・潜在表現ｚ_１及び音声エンコーダ２４のｋ階層目の潜在表現ｈ_ｋに基づく事後分布ｐ（ｚ_ｋ｜ｚ_ｋ－１，ｚ_ｋ－２・・・ｚ_１，ｈ_ｋ）からサンプリングを行う。音声デコーダ２６の出力に近い階層では音声エンコーダ２４からサンプリングを行わず、直前の階層ｋ－１の潜在表現ｚ_ｋ－１及びターゲット話者特徴量ｙ_ｔに基づく事前分布ｐ（ｚ_ｋ｜ｚ_ｋ－１，ｙ_ｔ）からサンプリングを行う。図８では、音声デコーダ２６の階層ｎ－１及び階層ｎにおいて事前分布からサンプリングを行う例を示している。このとき、事前分布からのサンプリングにはソース話者特徴量ｙ_ｓではなく、ターゲット話者特徴量ｙ_ｔを入力に含めることが好適である。 In the speech decoder 26, latent expression _z1 is input to the first layer, layer 1, and latent expression _z2 is output. In layers far from the output of the speech decoder 26, the target speaker feature _yt is not included in the input, and sampling is performed from the posterior distribution p(zk|zk-1, zk-2 _... z1, hk) based on the latent expression zk _-1 _, latent expression zk _-2 ...latent expression z1 of the previous layers _k _-1 , _k-2 ...layer 1 in the speech decoder 26 and the latent expression _hk of the _kth layer of the speech encoder 24. In layers close to the output of the speech decoder 26, sampling is not performed from the speech encoder 24, but is performed from the prior distribution p( _zk |zk _-1 , _yt ) based on the latent expression zk _- 1 of the immediately preceding layer k-1 and the target speaker feature _yt . FIG. 8 shows an example of sampling from the prior distribution in layers n-1 and n of the speech decoder 26. In this case, it is preferable to include target speaker features _yt as input for sampling from the prior distribution, rather than source speaker features _ys .

音声エンコーダ２４及び音声デコーダ２６における音声処理によって、音声デコーダ２６の最終段である階層ｎからソース話者の音声から得られた音響特徴量ｘ_ｓをターゲット話者の音声に合わせた音響特徴量ｘ_ｔが構築されて出力される。 Through speech processing in the speech encoder 24 and speech decoder 26, acoustic features xt are constructed and output from the final stage, hierarchical layer _n , of the speech decoder 26, by matching the acoustic features _xs obtained from the source speaker's speech to the target speaker's speech.

ボコーダ３０は、音声デコーダ２６から出力された音響特徴量ｘ_ｔを音声データに変換して出力する。ボコーダ３０は、音声分析部２０における音声データから音響特徴量を抽出する処理の逆の処理を行うことによって音響特徴量ｘ_ｔを音声データに変換することができる。 The vocoder 30 converts the acoustic feature _xt output from the voice decoder 26 into voice data and outputs it. The vocoder 30 can convert the acoustic feature _xt into voice data by performing a process reverse to the process performed by the voice analysis unit 20 to extract the acoustic feature from the voice data.

以上のように、本実施の形態の音声処理装置１００によれば、任意の話者が発した音声を目標とする話者が発した音声の音質に適切に変換する音声処理装置、音声処理プログラム及び音声処理方法並びに音声学習処理装置、音声学習処理プログラム及び音声学習処理方法を提供することができる。すなわち、学習された音声エンコーダ２４及び音声デコーダ２６を含む音声処理装置１００によって、ソース話者が発した音声をターゲット話者が発したような音声に変換する音声処理を実現することができる。 As described above, the voice processing device 100 of this embodiment can provide a voice processing device, a voice processing program, and a voice processing method, as well as a voice learning processing device, a voice learning processing program, and a voice learning processing method, that appropriately convert the voice uttered by an arbitrary speaker into the sound quality of the voice uttered by a target speaker. In other words, the voice processing device 100 including the trained voice encoder 24 and voice decoder 26 can realize voice processing that converts the voice uttered by a source speaker into a voice that sounds like the voice uttered by a target speaker.

特に、音声エンコーダ２４及び音声デコーダ２６に対してヌーヴォー・バリエーショナル・オート－エンコーダ（ＮＶＡＥ：ＮｏｕｖｅａｕＶａｒｉａｔｉｏｎａｌＡｕｔｏ－Ｅｎｃｏｄｅｒ）を適用することによって、従来よりもソース話者の音声をターゲット話者が発した自然な感じの音声に変換することができる。 In particular, by applying the Nouveau Variational Auto-Encoder (NVAE) to the speech encoder 24 and speech decoder 26, the speech of the source speaker can be converted into a more natural-sounding speech of the target speaker than in the past.

１０処理部、１２記憶部、１４入力部、１６出力部、１８通信部、２０音声分析部、２２話者エンコーダ、２４音声エンコーダ、２６音声デコーダ、２８学習器、３０ボコーダ、１００音声処理装置、１０２ネットワーク。 10 Processing unit, 12 Memory unit, 14 Input unit, 16 Output unit, 18 Communication unit, 20 Voice analysis unit, 22 Speaker encoder, 24 Voice encoder, 26 Voice decoder, 28 Learning device, 30 Vocoder, 100 Voice processing device, 102 Network.

Claims

Computer,
an acoustic feature extractor that converts speech into input acoustic features;
A speaker encoder that converts the speaker labels of the speech into speaker features;
A speech encoder including a variational autoencoder having a sampling hierarchy of hierarchy n (where n is an integer equal to or greater than 2) that converts input acoustic features and speaker features into latent representations;
A speech decoder including a variational autoencoder having a sampling hierarchy of n layers that generates acoustic features using at least a latent representation and speaker features;
and functioning as a speech processing learning device having the
In a sampling hierarchical level n of the first stage in the speech encoder, the input acoustic feature and the speaker feature are input, and a latent representation h is output;
In each sampling hierarchical layer (m-1) (where m is an integer from n to 2) from the second stage onwards in the speech encoder, a latent representation h(m-1) is output in response to an input of the latent representation h(m-1) output from the sampling hierarchical layer (m) in the previous stage in the speech encoder and the speaker feature;
In the sampling layer of the first layer 1 of the speech decoder, a latent representation h1 output from the final layer 1 of the speech encoder is received as an input and a latent representation Z1 is output;
In each sampling hierarchical layer k (where k is an integer from 2 to n) from the second stage onwards in the speech decoder, a latent representation Z(k-1) output from a sampling hierarchical layer (k-1) in the preceding stage in the speech decoder and a latent representation hk output from hierarchical layer k in the speech encoder are input, and a latent representation Zk is output;
The speech encoder, the speech decoder, and the speaker encoder are trained to reduce the distance between an input acoustic feature input to the speech encoder and an output acoustic feature generated from a latent representation Zn output from the speech decoder.

2. The speech processing training program according to claim 1,
2. A speech processing training program, comprising: a speech decoder, the speech decoder being configured such that a hierarchy for inputting speaker features is limited in the sampling hierarchy.

3. The speech processing training program according to claim 2,
the speech decoder does not input speaker features to layers preceding a predetermined layer in the sampling layers, and inputs speaker features to layers following the predetermined layer.

4. The speech processing training program according to claim 3,
The speech decoder is characterized in that it samples from a posterior distribution in layers prior to the specified layer in the sampling layer, and samples from a prior distribution in layers subsequent to the specified layer.

The speech processing training program according to any one of claims 1 to 4,
The speech processing training program is characterized in that the speech decoder inputs speaker features to a conditional instance normalization layer.

Computer,
an acoustic feature extractor that converts speech into acoustic features;
The speaker encoder, the speech encoder, and the speech decoder trained by the speech processing training program according to claim 1;
A vocoder for converting acoustic features into speech;
and make it work.
inputting source acoustic features obtained by converting the speech of the source speaker in the acoustic feature extractor and source speaker features obtained by converting the speaker label of the source speaker in the speaker encoder into the speech encoder and converting them into latent representations;
The latent representation and a target speaker feature obtained by converting the speaker label of the target speaker in the speaker encoder are input to the speech decoder to generate a target acoustic feature;
A speech processing program, comprising : inputting the target acoustic feature generated by the speech decoder into the vocoder to convert the target acoustic feature into speech.

an acoustic feature extractor that converts speech into input acoustic features;
A speaker encoder that converts the speaker labels of the speech into speaker features;
a speech encoder including a variational autoencoder having a sampling hierarchy of n (where n is an integer equal to or greater than 2) that converts acoustic features and speaker features into a latent representation;
A speech decoder including a variational autoencoder having a sampling hierarchy of n layers that generates acoustic features using at least a latent representation and speaker features;
Equipped with
In a sampling hierarchical level n of the first stage in the speech encoder, the input acoustic feature and the speaker feature are input, and a latent representation h is output;
In each sampling hierarchical layer (m-1) (where m is an integer from n to 2) from the second stage onwards in the speech encoder, a latent representation h(m-1) is output in response to an input of the latent representation h(m-1) output from the sampling hierarchical layer (m) in the previous stage in the speech encoder and the speaker feature;
In the sampling layer of the first layer 1 of the speech decoder, a latent representation h1 output from the final layer 1 of the speech encoder is received as an input and a latent representation Z1 is output;
In each sampling hierarchical layer k (where k is an integer from 2 to n) from the second stage onwards in the speech decoder, a latent representation Z(k-1) output from a sampling hierarchical layer (k-1) in the preceding stage in the speech decoder and a latent representation hk output from hierarchical layer k in the speech encoder are input, and a latent representation Zk is output;
The speech processing training device is characterized in that the speech encoder, the speech decoder, and the speaker encoder are trained to reduce the distance between an input acoustic feature input to the speech encoder and an output acoustic feature generated from a latent representation Zn output from the speech decoder.

an acoustic feature extractor that converts speech into acoustic features;
The speaker encoder, the speech encoder, and the speech decoder trained by the speech processing training device according to claim 7;
A vocoder for converting acoustic features into speech;
Equipped with
inputting source acoustic features obtained by converting the speech of the source speaker in the acoustic feature extractor and source speaker features obtained by converting the speaker label of the source speaker in the speaker encoder into the speech encoder and converting them into latent representations;
The latent representation and a target speaker feature obtained by converting the speaker label of the target speaker in the speaker encoder are input to the speech decoder to generate a target acoustic feature ;
The speech processing device according to claim 1, wherein the target acoustic feature generated by the speech decoder is input to the vocoder to convert the target acoustic feature into speech.

an acoustic feature extractor that converts speech into input acoustic features;
A speaker encoder that converts the speaker labels of the speech into speaker features;
a speech encoder including a variational autoencoder having a sampling hierarchy of n (where n is an integer equal to or greater than 2) that converts acoustic features and speaker features into a latent representation;
A speech decoder including a variational autoencoder having a sampling hierarchy of n layers that generates acoustic features using at least a latent representation and speaker features;
A speech processing training device comprising:
In a sampling hierarchical level n of the first stage in the speech encoder, the input acoustic feature and the speaker feature are input, and a latent representation h is output;
In each sampling hierarchical layer (m-1) (where m is an integer from n to 2) from the second stage onwards in the speech encoder, a latent representation h(m-1) is output in response to an input of the latent representation h(m-1) output from the sampling hierarchical layer (m) in the previous stage in the speech encoder and the speaker feature;
In the sampling layer of the first layer 1 of the speech decoder, a latent representation h1 output from the final layer 1 of the speech encoder is received as an input and a latent representation Z1 is output;
In each sampling hierarchical layer k (where k is an integer from 2 to n) from the second stage onwards in the speech decoder, a latent representation Z(k-1) output from a sampling hierarchical layer (k-1) in the preceding stage in the speech decoder and a latent representation hk output from hierarchical layer k in the speech encoder are input, and a latent representation Zk is output;
The speech processing training method is characterized in that the speech encoder, the speech decoder, and the speaker encoder are trained to reduce the distance between an input acoustic feature input to the speech encoder and an output acoustic feature generated from a latent representation Zn output from the speech decoder.

an acoustic feature extractor that converts speech into acoustic features;
The speaker encoder, the speech encoder and the speech decoder trained by the speech processing training method of claim 9;
A vocoder for converting acoustic features into speech;
In a voice processing device comprising:
inputting source acoustic features obtained by converting the speech of the source speaker in the acoustic feature extractor and source speaker features obtained by converting the speaker label of the source speaker in the speaker encoder into the speech encoder and converting them into latent representations;
The latent representation and a target speaker feature obtained by converting the speaker label of the target speaker in the speaker encoder are input to the speech decoder to generate a target acoustic feature;
a voice processing method comprising: inputting the target acoustic feature generated by the voice decoder to the vocoder to convert the target acoustic feature into voice;