JP7709646B2

JP7709646B2 - Speech synthesis device, speech synthesis method, and program

Info

Publication number: JP7709646B2
Application number: JP2023567286A
Authority: JP
Inventors: 裕紀金川
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2025-07-17
Anticipated expiration: 2041-12-13
Also published as: WO2023112095A1; JPWO2023112095A1

Description

本開示内容は、音声合成装置、音声合成方法、及びプログラムに関する。 The present disclosure relates to a voice synthesis device, a voice synthesis method, and a program.

音声合成において、スペクトルや声の高さを表すピッチ等の音響特徴量から音声波形に変換するモジュールはボコーダーと呼ばれる。ボコーダーの実装方法は大きく二種類がある。一つは信号処理による方法であり、STRAIGHT（非特許文献１）やWORLD（非特許文献２）といった手法が有名である。これらの方法は数理モデルにより音響特徴量から音声波形への変換を表現するため、学習が不要かつ処理速度が高速であるが、分析再合成された音声を自然音声と比較すると品質が劣る。二つ目はニューラルネットワークによる方法（ニューラルボコーダー）であり、WaveNetがその代表的な手法である（特許文献１）。こちらは自然音声と比較しても遜色ない品質の音声を合成可能な一方で巨大な畳み込みニューラルネットワーク(CNN : Convolutional Neural Network)に基づくため計算量が多く、信号処理のボコーダーよりも動作が低速で、リアルタイム動作が困難である。In speech synthesis, a module that converts acoustic features such as spectrum and pitch into a speech waveform is called a vocoder. There are two main ways to implement a vocoder. The first is a signal processing method, and well-known methods include STRAIGHT (Non-Patent Document 1) and WORLD (Non-Patent Document 2). These methods use a mathematical model to represent the conversion from acoustic features to a speech waveform, so they do not require learning and are fast in processing speed, but the quality of the analyzed and resynthesized speech is inferior to natural speech. The second is a neural network method (neural vocoder), and WaveNet is a representative method (Patent Document 1). This method can synthesize speech of a quality comparable to natural speech, but it is based on a huge convolutional neural network (CNN: Convolutional Neural Network), so it requires a lot of calculations, operates slower than a signal processing vocoder, and is difficult to operate in real time.

したがって、CPUにおいてリアルタイム動作させるためには計算量の削減が必要である。その主なアプローチとして、WaveNetで用いられる巨大なCNNを小規模な再帰型ニューラルネットワーク（RNN : Recurrent Neural Network）で置き換えたWaveRNNがある（特許文献２）。また、LPCNet（非特許文献３）では、音声波形の生成過程に信号処理の知見である線形予測分析（LPC）を導入し、WaveRNNよりも更に小規模なディープニューラルネットワーク(DNN : Deep Neural Network)での音声合成を可能としている。このように、WaveRNNやLPCNetでは、小規模な音声合成DNNの実現のため、RNNを用いている。Therefore, in order to operate in real time on a CPU, it is necessary to reduce the amount of calculations. One of the main approaches is WaveRNN, which replaces the huge CNN used in WaveNet with a small recurrent neural network (RNN) (Patent Document 2). Also, LPCNet (Non-Patent Document 3) introduces linear predictive analysis (LPC), a knowledge of signal processing, into the process of generating speech waveforms, making it possible to synthesize speech using a deep neural network (DNN) that is even smaller than WaveRNN. In this way, WaveRNN and LPCNet use RNNs to realize small-scale speech synthesis DNNs.

Hideki Kawahara, Ikuyo Masuda-Katsuse and Alain de Cheveigne, "Restructuring speech representations using a pitch-adaptive time frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds," Speech Communication, vol. 27, no. 3-4, pp. 187-207, 1999.Hideki Kawahara, Ikuyo Masuda-Katsuuse and Alain de Cheveigne, "Restructuring speech representations using a pitch-adaptive time frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds," Speech Communication, vol. 27, no. 3-4, pp. 187-207, 1999. Masanori Morise, Fumiya Yokomori, Kenji Ozawa, "WORLD: a vocoder-based high-quality speech synthesis system for real-time applications," IEICE transactions on information and systems, vol. E99-D, no. 7, pp. 1877-1884, 2016.Masanori Morise, Fumiya Yokomori, Kenji Ozawa, "WORLD: a vocoder-based high-quality speech synthesis system for real-time applications," IEICE transactions on information and systems, vol. E99-D, no. 7, pp. 1877-1884, 2016. Jean-Marc Valin and Jan Skoglund, "LPCNET: Improving Neural Speech Synthesis through Linear Prediction," Proc. ICASSP, 2019, pp. 5891-5895Jean-Marc Valin and Jan Skoglund, "LPCNET: Improving Neural Speech Synthesis through Linear Prediction," Proc. ICASSP, 2019, pp. 5891-5895

ＷＯ－Ａ－２０１８／０４８９３４WO-A-2018/048934 ＷＯ－Ａ－２０１９／１５５０５４WO-A-2019/155054

しかし、RNNをはじめとする自己回帰型のモデルでは、予測した音声波形値を次の時刻の音声波形値に使用するため、予測すべき系列が長くなるにつれて、学習フェーズとの誤差が増大していく。ひいては波形生成が破綻し、発話が不明瞭になるだけでなく、最悪の場合、無音になることがある。また、RNNの状態変数を一定のタイミングで初期化することで破綻を免れることは可能であるが、当該時刻までの時系列情報が初期化されてしまうため不連続となり、特に有音区間での初期化では音声の自然性の低下につながる。 However, in autoregressive models such as RNNs, the predicted speech waveform value is used as the speech waveform value for the next time, so the error with the learning phase increases as the sequence to be predicted becomes longer. This can eventually cause waveform generation to fail, leading to unclear speech and, in the worst case, silence. In addition, while it is possible to avoid failure by initializing the state variables of the RNN at a certain timing, the time series information up to that time is initialized, resulting in discontinuity, and especially initialization during a speech section can lead to a decrease in the naturalness of the speech.

また、波形生成の破綻検出のため、品質は低いものの破綻はしない信号処理による波形生成もニューラルボコーダーとともに実行し、その結果を比較することも考えられるが、２通りの手法でボコーダーを実行しなくてはならないため、波形生成の動作速度を著しく損なう。 In addition, in order to detect any breakdowns in the waveform generation, it is possible to run waveform generation using signal processing, which is lower quality but does not cause breakdowns, together with the neural vocoder and compare the results; however, since the vocoder must be run using two different methods, this would significantly slow down the operating speed of the waveform generation.

本発明は、上記の点に鑑みてなされたものであり、波形生成の動作速度を著しく損なうことを防止しながら、音声の自然性の低下を防止することを目的とする。 The present invention has been made in consideration of the above points, and aims to prevent a decrease in the naturalness of voice while preventing a significant loss in the operating speed of waveform generation.

上記課題を解決するため、請求項１に係る発明は、学習フェーズにおいて音声波形を生成する音声合成装置であって、結合された前記音声波形及び音響特徴量、並びに、再帰型ニューラルネットワークの状態に基づいて、次の時刻の音声波形の予測値を得る波形生成部と、前記音声波形、前記音声波形の予測値、及び各時刻における音声波形が破綻しているかを示す破綻フラグの閾値に基づいて、前記破綻フラグを得る破綻フラグへの変換部と、前記再帰型ニューラルネットワークの状態系列及び破綻検出モデルに基づいて、前記破綻フラグの予測値を得る破綻検出部と、前記破綻フラグ及び前記破綻フラグの予測値の誤差を算出する破綻フラグの誤差算出部と、前記誤差及び前記破綻検出モデルに基づいて、学習済み破綻検出モデルを得る破綻検出モデル学習部と、を有する音声合成装置である。In order to solve the above problem, the invention of claim 1 is a speech synthesis device that generates a speech waveform in a learning phase, and includes: a waveform generation unit that obtains a predicted value of the speech waveform at the next time based on the combined speech waveform and acoustic features, and the state of a recurrent neural network; a failure flag conversion unit that obtains the failure flag based on the speech waveform, the predicted value of the speech waveform, and a failure flag threshold indicating whether the speech waveform at each time is corrupted; a failure detection unit that obtains a predicted value of the failure flag based on the state sequence of the recurrent neural network and a failure detection model; a failure flag error calculation unit that calculates an error between the failure flag and the predicted value of the failure flag; and a failure detection model learning unit that obtains a learned failure detection model based on the error and the failure detection model.

以上説明したように本発明によれば、波形生成の動作速度を著しく損なうことを防止しながら、音声の自然性の低下を防止することができるという効果を奏する。As described above, the present invention has the effect of preventing a decrease in the naturalness of the voice while preventing a significant loss in the operating speed of waveform generation.

本実施形態に係る通信システムの概略図である。1 is a schematic diagram of a communication system according to an embodiment of the present invention. 本実施形態に係る音声合成装置及び通信端末のハードウェア構成図である。1 is a diagram illustrating a hardware configuration of a voice synthesizer and a communication terminal according to an embodiment of the present invention. 第１の実施形態に係る音声合成装置の学習フェーズにおける機能構成図である。FIG. 2 is a functional configuration diagram of the speech synthesis device according to the first embodiment in a learning phase. 第１の実施形態に係る音声合成装置の推論フェーズにおける機能構成図である。FIG. 2 is a functional configuration diagram of the speech synthesis device according to the first embodiment in an inference phase. 第１の実施形態に係る音声合成装置の学習フェーズにおける処理又は動作を示すフローチャートである。4 is a flowchart showing processing or operations in a learning phase of the voice synthesis device according to the first embodiment. 第１の実施形態に係る音声合成装置の推論フェーズにおける処理又は動作を示すフローチャートである。4 is a flowchart showing a process or operation in an inference phase of the speech synthesis device according to the first embodiment. 第２の実施形態に係る音声合成装置の推論フェーズにおける機能構成図である。FIG. 11 is a functional configuration diagram of a speech synthesis device according to a second embodiment in an inference phase. 第２の実施形態に係る音声合成装置の推論フェーズにおける処理又は動作を示すフローチャートである。13 is a flowchart showing a process or operation in an inference phase of the speech synthesis device according to the second embodiment. 第３の実施形態に係る音声合成装置の学習フェーズにおける機能構成図である。FIG. 13 is a functional configuration diagram of a speech synthesis device according to a third embodiment in a learning phase. 第３の実施形態に係る音声合成装置の推論フェーズにおける機能構成図である。FIG. 13 is a functional configuration diagram of a speech synthesis device according to a third embodiment in an inference phase. 第３の実施形態に係る音声合成装置の学習フェーズにおける処理又は動作を示すフローチャートである。13 is a flowchart showing a process or operation in a learning phase of the voice synthesis device according to the third embodiment. 第３の実施形態に係る音声合成装置の推論フェーズにおける処理又は動作を示すフローチャートである。13 is a flowchart showing a process or operation in an inference phase of the speech synthesis device according to the third embodiment. 第４の実施形態に係る音声合成装置の学習フェーズにおける機能構成図である。FIG. 13 is a functional configuration diagram of a speech synthesis device according to a fourth embodiment in a learning phase. 第４の実施形態に係る音声合成装置の推論フェーズにおける機能構成図である。FIG. 13 is a functional configuration diagram of a speech synthesis device according to a fourth embodiment in an inference phase. 第４の実施形態に係る音声合成装置の学習フェーズにおける処理又は動作を示すフローチャートである。13 is a flowchart showing a process or operation in a learning phase of the voice synthesis device according to the fourth embodiment. 第４の実施形態に係る音声合成装置の推論フェーズにおける処理又は動作を示すフローチャートである。13 is a flowchart showing a process or operation in an inference phase of a voice synthesis device according to a fourth embodiment.

以下、図面に基づいて本発明の実施形態を説明する。 Below, an embodiment of the present invention is described based on the drawings.

〔実施形態のシステム構成〕
まず、図１を用いて、本実施形態の通信システム１の構成の概略について説明する。図１は、本実施形態に係る通信システムの概略図である。 [System configuration of the embodiment]
First, an outline of the configuration of a communication system 1 according to the present embodiment will be described with reference to Fig. 1. Fig. 1 is a schematic diagram of the communication system according to the present embodiment.

図１に示されているように、本実施形態の通信システム１は、音声合成装置３、及び通信端末５によって構築されている。通信端末５は、ユーザＹによって管理及び使用される。As shown in FIG. 1, the communication system 1 of this embodiment is constructed by a voice synthesis device 3 and a communication terminal 5. The communication terminal 5 is managed and used by user Y.

また、音声合成装置３と通信端末５は、インターネット等の通信ネットワーク１００を介して通信することができる。通信ネットワーク１００の接続形態は、無線又は有線のいずれでも良い。In addition, the voice synthesis device 3 and the communication terminal 5 can communicate via a communication network 100 such as the Internet. The connection form of the communication network 100 may be either wireless or wired.

音声合成装置３は、単数又は複数のコンピュータによって構成されている。音声合成装置３が複数のコンピュータによって構成されている場合には、「音声合成装置」と示しても良いし、「音声合成システム」と示しても良い。The voice synthesis device 3 is composed of one or more computers. When the voice synthesis device 3 is composed of multiple computers, it may be referred to as a "voice synthesis device" or a "voice synthesis system."

音声合成装置３は、コンピュータであり、破綻検出技術を用いて、音声合成のための音声波形生成を行う装置である。 The voice synthesis device 3 is a computer that uses speech impairment detection technology to generate voice waveforms for voice synthesis.

通信端末５は、コンピュータであり、図１では、一例としてノート型パソコンが示されているが、ノード型に限るものではなく、デスクトップパソコンであってもよい。また、通信端末は、スマートフォン、又はタブレット型端末であってもよい。図１では、ユーザＹが、通信端末５を操作している。The communication terminal 5 is a computer, and in FIG. 1, a notebook computer is shown as an example, but it is not limited to a notebook computer and may be a desktop computer. The communication terminal may also be a smartphone or a tablet terminal. In FIG. 1, user Y is operating the communication terminal 5.

〔音声合成装置及び通信端末のハードウェア構成〕
次に、図２を用いて、音声合成装置３及び通信端末５のハードウェア構成を説明する。図２は、本実施形態に係る音声合成装置及び通信端末のハードウェア構成図である。なお、音声合成装置及び通信端末のハードウェア構成は、後述の第１乃至第４の実施形態において共通である。 [Hardware configuration of voice synthesizer and communication terminal]
Next, the hardware configuration of the voice synthesizer 3 and the communication terminal 5 will be described with reference to Fig. 2. Fig. 2 is a hardware configuration diagram of the voice synthesizer and the communication terminal according to this embodiment. Note that the hardware configuration of the voice synthesizer and the communication terminal is common to the first to fourth embodiments described later.

図２に示されているように、音声合成装置３は、プロセッサ３０１、メモリ３０２、補助記憶装置３０３、接続装置３０４、通信装置３０５、ドライブ装置３０６を有する。なお、音声合成装置３を構成する各ハードウェアは、バス３０７を介して相互に接続される。2, the speech synthesis device 3 has a processor 301, a memory 302, an auxiliary storage device 303, a connection device 304, a communication device 305, and a drive device 306. The hardware components constituting the speech synthesis device 3 are connected to each other via a bus 307.

プロセッサ３０１は、音声合成装置３全体の制御を行う制御部の役割を果たし、ＣＰＵ（Central Processing Unit）等の各種演算デバイスを有する。プロセッサ３０１は、各種プログラムをメモリ３０２上に読み出して実行する。なお、プロセッサ３０１には、ＧＰＧＰＵ(General-purpose computing on graphics processing units)が含まれていてもよい。The processor 301 serves as a control unit that controls the entire speech synthesis device 3, and has various computing devices such as a CPU (Central Processing Unit). The processor 301 reads various programs onto the memory 302 and executes them. The processor 301 may also include a GPGPU (General-purpose computing on graphics processing units).

メモリ３０２は、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）等の主記憶デバイスを有する。プロセッサ３０１は、メモリ３０２上に読み出した各種プログラムを実行することで、後述の各種機能部を実現する。The memory 302 has a primary storage device such as a read only memory (ROM) or a random access memory (RAM). The processor 301 executes various programs read onto the memory 302 to realize various functional units described below.

補助記憶装置３０３は、各種プログラムや、各種プログラムがプロセッサ３０１によって実行される際に用いられる各種情報（後述の破綻検出モデル３０ａ、学習済み破綻検出モデル３０ｂ等）を格納する。The auxiliary memory device 303 stores various programs and various information (such as the failure detection model 30a and the learned failure detection model 30b described below) used when the various programs are executed by the processor 301.

接続装置３０４は、外部装置（例えば、表示装置３１０、操作装置３１１）と音声合成装置３とを接続する接続デバイスである。 The connection device 304 is a connection device that connects an external device (e.g., a display device 310, an operating device 311) to the speech synthesis device 3.

通信装置３０５は、他の装置との間で各種情報を送受信するための通信デバイスである。 The communication device 305 is a communication device for sending and receiving various information between other devices.

ドライブ装置３０６は記録媒体３３０をセットするためのデバイスである。ここでいう記録媒体３３０には、ＣＤ－ＲＯＭ(Compact Disc Read-Only Memory)、フレキシブルディスク、光磁気ディスク等のように情報を光学的、電気的あるいは磁気的に記録する媒体が含まれる。また、記録媒体３３０には、ＲＯＭ(Read Only Memory)、フラッシュメモリ等のように情報を電気的に記録する半導体メモリ等が含まれていてもよい。 The drive unit 306 is a device for setting the recording medium 330. The recording medium 330 here includes media that record information optically, electrically, or magnetically, such as a CD-ROM (Compact Disc Read-Only Memory), a flexible disk, or a magneto-optical disk. The recording medium 330 may also include semiconductor memory that records information electrically, such as a ROM (Read Only Memory) or a flash memory.

なお、補助記憶装置３０３にインストールされる各種プログラムは、例えば、配布された記録媒体３３０がドライブ装置３０６にセットされ、該記録媒体３３０に記録された各種プログラムがドライブ装置３０６により読み出されることでインストールされる。あるいは、補助記憶装置３０３にインストールされる各種プログラムは、通信装置３０５を介してネットワークからダウンロードされることで、インストールされてもよい。The various programs to be installed in the auxiliary storage device 303 are installed, for example, by setting the distributed recording medium 330 in the drive device 306 and reading the various programs recorded on the recording medium 330 by the drive device 306. Alternatively, the various programs to be installed in the auxiliary storage device 303 may be installed by downloading them from a network via the communication device 305.

また、図２には、通信端末５のハードウェア構成が示されているが、符号が３００番台から５００番台に変わっただけで、各構成は同様であるため、これらの説明を省略する。 Figure 2 also shows the hardware configuration of communication terminal 5, but since the configuration is the same except that the reference numbers have changed from the 300s to the 500s, the explanation of these will be omitted.

●第１の実施形態
図３乃至図６を用いて、第１の実施形態について説明する。 First Embodiment A first embodiment will be described with reference to FIGS.

〔音声合成装置の機能構成〕
図３及び図４を用いて、第１の実施形態に係る音声合成装置の機能構成について説明する。 [Functional configuration of the voice synthesis device]
The functional configuration of the speech synthesis device according to the first embodiment will be described with reference to FIG. 3 and FIG.

＜音声合成装置の学習フェーズにおける機能構成＞
図３は、第１の実施形態に係る音声合成装置の学習フェーズにおける機能構成図である。図３に示されているように、音声合成装置３は、入力部３１、波形生成部３２、破綻フラグへの変換部３３、破綻検出部３４、破綻フラグの誤差算出部３５、及び破綻検出モデル学習部３６を有する。 <Functional configuration of the speech synthesis device in the learning phase>
Fig. 3 is a functional configuration diagram of the speech synthesizer in the learning phase according to the first embodiment. As shown in Fig. 3, the speech synthesizer 3 has an input unit 31, a waveform generation unit 32, a conversion to a failure flag unit 33, a failure detection unit 34, a failure flag error calculation unit 35, and a failure detection model learning unit 36.

これらのうち、入力部３１は、入力した音声波形及び音響特徴量を結合する。Of these, the input unit 31 combines the input speech waveform and acoustic features.

波形生成部３２は、結合された前記音声波形及び音響特徴量、並びに、再帰型ニューラルネットワークの状態に基づいて、次の時刻の音声波形の予測値を得る。The waveform generation unit 32 obtains a predicted value of the audio waveform at the next time based on the combined audio waveform and acoustic features, as well as the state of the recurrent neural network.

破綻フラグへの変換部３３は、音声波形、音声波形の予測値、及び破綻フラグの閾値に基づいて、各時刻における音声波形が破綻しているかを示す破綻フラグを得る。The failure flag conversion unit 33 obtains a failure flag indicating whether the audio waveform at each time is failure based on the audio waveform, the predicted value of the audio waveform, and the failure flag threshold value.

破綻検出部３４は、再帰型ニューラルネットワーク（RNN : Recurrent Neural Network）の状態系列及び破綻検出モデル３０ａに基づいて、破綻フラグの予測値を得る。The failure detection unit 34 obtains a predicted value of the failure flag based on the state sequence of the recurrent neural network (RNN) and the failure detection model 30a.

破綻フラグの誤差算出部３５は、破綻フラグ及び破綻フラグの予測値の誤差を算出する。The bankruptcy flag error calculation unit 35 calculates the error of the bankruptcy flag and the predicted value of the bankruptcy flag.

破綻検出モデル学習部３６は、誤差及び破綻検出モデルに基づいて、学習済み破綻検出モデル３０ｂを得る。The failure detection model learning unit 36 obtains a learned failure detection model 30b based on the error and the failure detection model.

なお、上記各機能構成については、以降で詳細に説明する。Each of the above functional configurations will be explained in detail below.

＜音声合成装置の推論フェーズにおける機能構成＞
図４は、第１の実施形態に係る音声合成装置の推論フェーズにおける機能構成図である。図４に示されているように、音声合成装置３は、入力部３１、波形生成部３２、破綻フラグへの変換部３３、破綻検出部３４、及び状態初期化部３７を有する。なお、学習フェーズにおける機能構成と同様の機能構成については、同一の符号を付して説明を省略する。 <Functional configuration of the speech synthesis device in the inference phase>
Fig. 4 is a functional configuration diagram of the speech synthesizer in the inference phase according to the first embodiment. As shown in Fig. 4, the speech synthesizer 3 has an input unit 31, a waveform generation unit 32, a conversion unit to a failure flag 33, a failure detection unit 34, and a state initialization unit 37. Note that the same reference numerals are used for the functional configurations similar to those in the learning phase, and the description thereof will be omitted.

状態初期化部３７は、破綻フラグ（予測値)が「破綻している」旨を示すことで破綻していると予測した場合、RNNの状態の初期値に基づいて、RNNの状態を初期化する。なお、この機能構成については、以降で詳細に説明する。When the state initialization unit 37 predicts a failure by indicating that the failure flag (predicted value) indicates "failure," it initializes the state of the RNN based on the initial value of the state of the RNN. This functional configuration will be described in detail later.

〔音声合成装置の処理又は動作〕
続いて、図５及び図６を用いて、第１の実施形態に係る音声合成装置の処理又は動作について説明する。 [Processing or operation of the voice synthesizer]
Next, the process or operation of the voice synthesis device according to the first embodiment will be described with reference to FIG. 5 and FIG.

＜音声合成装置の学習フェーズにおける処理又は動作＞
図５は、第１の実施形態に係る音声合成装置の学習フェーズにおける処理又は動作を示すフローチャートである。 <Processing or operation in the learning phase of the speech synthesizer>
FIG. 5 is a flowchart showing the processing or operation in the learning phase of the speech synthesis device according to the first embodiment.

まず、図５に示されているように、入力部３１は、時刻tにおける学習データの音声波形First, as shown in FIG. 5, the input unit 31 receives the speech waveform of the learning data at time t.

と、その波形に対応する音響特徴量を結合して、波形生成部３２に入力する（Ｓ１１）。 The acoustic feature quantity corresponding to the waveform is combined with the waveform and input to the waveform generating unit 32 (S11).

次に、波形生成部３２は、結合された音声波形及び音響特徴量、並びに、RNNの状態Next, the waveform generation unit 32 generates the combined speech waveform and acoustic features, as well as the state of the RNN.

に基づいて、次の時刻の音声波形（予測値） Based on this, the next time's audio waveform (predicted value)

を得る（Ｓ１２）。ここで、音響特徴量として、スペクトログラムやメルケプストラムなどのスペクトル情報、基本周波数又はピッチ幅などの韻律情報が使われる。 Here, as the acoustic feature, spectral information such as a spectrogram or mel-cepstrum, and prosodic information such as a fundamental frequency or a pitch width are used.

そして、破綻フラグへの変更部３３は、上記処理（Ｓ１２）が時刻t=1,…,Tに関して実行されることで、音声波形（予測値） Then, the failure flag change unit 33 executes the above process (S12) for times t = 1, ..., T to obtain the voice waveform (predicted value)

を取得すると共に、音声波形 and audio waveform

、及び破綻フラグの閾値ｆを取得し、これらに基づいて、破綻フラグ , and the bankruptcy flag threshold value f are obtained, and the bankruptcy flag is calculated based on these.

を得る（Ｓ１３）。 is obtained (S13).

ここで、破綻フラグとは、各時刻tにおける音声波形が破綻しているかを示す２値のフラグである。処理（Ｓ１３）において、破綻フラグへの変換部３３は、xとHere, the failure flag is a binary flag indicating whether the audio waveform at each time t is broken. In the process (S13), the failure flag conversion unit 33 converts x and

を比較し、その差分が破綻フラグの閾値fを超えた場合に破綻、そうでない場合に破綻していないとしてフラグを付与する。なお、xと If the difference exceeds the threshold value f of the failure flag, a flag is added as a failure. If not, a flag is added as a non-failure.

を振幅値で直接比較するのでは、波形生成部フラグの付与が難しい場合が多いため、一度パワーやスペクトルに変換してから比較してもよい。なぜなら、 Since it is often difficult to assign a waveform generator flag when comparing directly the amplitude values, it is acceptable to convert the values to power or spectrum before comparing them. This is because

が破綻している場合、パワーやスペクトルがxのそれと著しく異なり、差分が明確に出るためである。差分を計算する方法として、平均二乗誤差や平均絶対誤差を用いることができる
また、破綻フラグへの変換部３３が、音声波形（予測値） If x fails, the power or spectrum will be significantly different from that of x, and the difference will be clearly visible. The mean square error or mean absolute error can be used to calculate the difference.

を取得する際に、破綻検出部３４が、同時に波形生成部３２からRNNの状態系列 When acquiring the state sequence of the RNN from the waveform generating unit 32, the failure detecting unit 34 simultaneously acquires the state sequence of the RNN from the waveform generating unit 32.

を取得し、更に破綻検出モデル３０ａを取得して、これらに基づいて破綻フラグを予測することで、破綻フラグ（予測値） , and further obtains the bankruptcy detection model 30a, and predicts the bankruptcy flag based on these, thereby obtaining the bankruptcy flag (predicted value)

を得る（Ｓ１４）。破綻検出モデルとして統計モデルを使う場合、破綻しているか否かの確率値が得られる。破綻検出モデル３０ａにはDNNだけでなく、ロジスティック回帰やサポートベクターマシンなどの別の分類モデルを用いてもよい。 (S14). When a statistical model is used as the failure detection model, a probability value of whether or not there is a failure is obtained. As the failure detection model 30a, not only DNN but also other classification models such as logistic regression and support vector machines may be used.

次に、破綻フラグの誤差算出部３５は、破綻フラグ及び破綻フラグ（予測値）を取得し、破綻フラグ及び破綻フラグ（予測値）の誤差を算出する（Ｓ１５）。破綻検出モデル３０ａのタスクは、破綻しているか否かの分類問題であるため、DNNを統計モデルとして採用する場合、誤差関数としてクロスエントロピーなどが利用可能である。Next, the failure flag error calculation unit 35 acquires the failure flag and the failure flag (predicted value), and calculates the error of the failure flag and the failure flag (predicted value) (S15). Since the task of the failure detection model 30a is a classification problem of whether or not there is a failure, when a DNN is adopted as a statistical model, cross-entropy or the like can be used as an error function.

次に、破綻検出モデル学習部３６は、破綻フラグの誤差算出部３５によって算出された誤差、及び破綻検出モデル３０ａに基づいて、学習済み破綻検出モデル３０ｂを得る（Ｓ１６）。この処理（Ｓ１６）は、誤差を最小化するよう、破綻検出モデル３０ａのパラメータを更新することで達成され、DNNでは一般的に誤差逆伝搬が用いられる。ここまでの手順を学習データの全てに対して繰り返し実行することで、破綻検出モデル３０ａの予測精度を向上させる。以上のようにして、推論フェーズが終了する。Next, the failure detection model learning unit 36 obtains a learned failure detection model 30b based on the error calculated by the failure flag error calculation unit 35 and the failure detection model 30a (S16). This process (S16) is achieved by updating the parameters of the failure detection model 30a to minimize the error, and in DNNs, error backpropagation is generally used. By repeatedly executing the above procedure for all of the training data, the prediction accuracy of the failure detection model 30a is improved. In this way, the inference phase ends.

＜音声合成装置の推論フェーズにおける処理又は動作＞
図６は、第１の実施形態に係る音声合成装置の推論フェーズにおける処理又は動作を示すフローチャートである。 <Processing or operation in the inference phase of the speech synthesizer>
FIG. 6 is a flowchart showing the processing or operation in the inference phase of the speech synthesis device according to the first embodiment.

まず、図６に示されているように、入力部３１は、上述の処理（Ｓ１１）と同様に、時刻tにおける学習データの音声波形First, as shown in FIG. 6, the input unit 31 inputs the speech waveform of the learning data at time t in the same manner as in the above-mentioned process (S11).

と、その波形に対応する音響特徴量を結合して、波形生成部３２に入力する（Ｓ１１１）。 The acoustic feature quantity corresponding to the waveform is combined with the waveform and input to the waveform generating unit 32 (S111).

次に、波形生成部３２は、上述の処理（Ｓ１２）と同様に、結合された音声波形及び音響特徴量、並びに、RNNの状態Next, the waveform generation unit 32 generates the combined speech waveform and acoustic features, as well as the state of the RNN, in the same manner as in the above-mentioned process (S12).

を得る（Ｓ１１２）。 is obtained (S112).

次に、破綻検出部３４が、波形生成部３２からRNNの状態Next, the failure detection unit 34 receives the state of the RNN from the waveform generation unit 32.

を得て、更に学習済み破綻検出モデル３０ｂを得て、これらに基づいて破綻フラグを予測することで、破綻フラグ（予測値） Then, the learned failure detection model 30b is obtained, and the failure flag is predicted based on these, so that the failure flag (predicted value)

を得る（Ｓ１１３）。 is obtained (S113).

次に、状態初期化部３７は、破綻フラグ（予測値)が「破綻している」旨を示すことで破綻していると予測した場合、RNNの状態の初期値に基づいて、RNNの状態Next, when the state initialization unit 37 predicts that the RNN has failed by indicating that the failure flag (predicted value) indicates that the RNN has failed, the state initialization unit 37 initializes the RNN state based on the initial value of the RNN state.

を初期化する。 Initialize.

以上のようにして、推論フェーズが終了する。 This completes the inference phase.

＜第１の実施形態の主な効果＞
以上説明したように、本実施形態によれば、波形生成の動作速度を著しく損なうことを防止しながら、音声の自然性の低下を防止することができるという効果を奏する。具体的には、ニューラルボコーダーの動作で得た情報のみから波形生成の破綻を検出できる。ひいては破綻を検出したタイミングでRNNの状態変数を初期化することで、状態変数発話が不明瞭になる問題や無音になる問題を回避することができる。 <Main Effects of the First Embodiment>
As described above, according to this embodiment, it is possible to prevent the deterioration of the naturalness of the voice while preventing a significant loss in the operating speed of the waveform generation. Specifically, a waveform generation failure can be detected only from information obtained by the operation of the neural vocoder. Furthermore, by initializing the state variables of the RNN at the timing when a failure is detected, it is possible to avoid problems such as unclear speech or silence.

●第２の実施形態
続いて、図７及び図８を用いて、第２の実施形態について説明する。 Second Embodiment Next, a second embodiment will be described with reference to FIGS.

〔音声合成装置の機能構成〕
本実施形態に係る音声合成装置３の学習フェーズにおける機能構成は、第１の実施形態に係る音声合成装置３の学習フェーズにおける機能構成と同様であるため、説明を省略する。 [Functional configuration of the voice synthesis device]
The functional configuration of the speech synthesizer 3 according to this embodiment in the learning phase is similar to the functional configuration of the speech synthesizer 3 according to the first embodiment in the learning phase, and therefore a description thereof will be omitted.

＜音声合成装置の推論フェーズにおける機能構成＞
第２の実施形態に係る音声合成装置３は、第１の実施形態に係る音声合成装置３に対して、更に、音声波形のバッファリング部４１、音声波形の平均部４２、及び音声波形選択部４８を有する。 <Functional configuration of the speech synthesis device in the inference phase>
The speech synthesizer 3 according to the second embodiment further comprises a speech waveform buffering unit 41, a speech waveform averaging unit 42, and a speech waveform selection unit 48 in addition to the components of the speech synthesizer 3 according to the first embodiment.

これらのうち、音声波形のバッファリング部４１は、メモリ３０２に構築されており、音声波形（予測値）を蓄積する。Of these, the audio waveform buffering unit 41 is constructed in memory 302 and stores audio waveforms (predicted values).

音声波形の平均部４２は、バッファリング部４１に蓄積された直前の音声波形（予測値）に基づいて、平均された音声波形を得る。 The audio waveform averaging unit 42 obtains an averaged audio waveform based on the previous audio waveform (predicted value) stored in the buffering unit 41.

音声波形選択部４８は、破綻フラグの予測値、音声波形（予測値）、及び平均された音声波形に基づき、次時刻の音声波形予測のための音声波形（予測値）を出力する。 The audio waveform selection unit 48 outputs an audio waveform (predicted value) for predicting the audio waveform at the next time based on the predicted value of the bankruptcy flag, the audio waveform (predicted value), and the averaged audio waveform.

〔音声合成装置の処理又は動作〕
続いて、図８を用いて、第２の実施形態に係る音声合成装置の処理又は動作について説明する。本実施形態に係る音声合成装置３の処理又は動作は、第１の実施形態に係る音声合成装置３の処理又は動作に対して、学習フェーズは同様の処理であり、推論フェーズの一部が異なるのみであるため、学習フェーズの処理の説明は省略する。 [Processing or operation of the voice synthesizer]
Next, the processing or operation of the speech synthesizer according to the second embodiment will be described with reference to Fig. 8. The processing or operation of the speech synthesizer 3 according to the present embodiment is similar to the processing or operation of the speech synthesizer 3 according to the first embodiment in the learning phase, and only a part of the inference phase is different, so the description of the processing in the learning phase will be omitted.

＜音声合成装置の推論フェーズにおける処理又は動作＞
図８は、第２の実施形態に係る音声合成装置の推論フェーズにおける処理又は動作を示すフローチャートである。なお、処理（Ｓ１２１～Ｓ１２４）は、第１の実施形態の処理（Ｓ１１１～Ｓ１１４）と同様であるため、説明を省略する。 <Processing or operation in the inference phase of the speech synthesizer>
8 is a flowchart showing the process or operation in the inference phase of the speech synthesis device according to the second embodiment. Note that the process (S121 to S124) is similar to the process (S111 to S114) of the first embodiment, and therefore the description will be omitted.

なお、処理（Ｓ１２２）において、波形生成部３２は、次の時刻の音声波形（予測値）In addition, in the process (S122), the waveform generation unit 32 generates the voice waveform (predicted value) for the next time.

を得る度に、この音声波形（予測値）を音声波形のバッファリング部４１に入力することで、音声波形のバッファリング部４１は、音声波形（予測値）を蓄積する（Ｓ１２５）。 Each time a speech waveform (predicted value) is obtained, the speech waveform buffering unit 41 inputs the speech waveform (predicted value) and the speech waveform buffering unit 41 stores the speech waveform (predicted value) (S125).

次に、音声波形の平均部４２は、バッファリング部４１に蓄積された直前の音声波形（予測値）を得て、この音声波形（予測値）に基づいて、平均された音声波形Next, the audio waveform averaging unit 42 obtains the immediately preceding audio waveform (predicted value) stored in the buffering unit 41, and calculates an averaged audio waveform based on this audio waveform (predicted value).

を得る（Ｓ１２６）。ここで、音声波形の平均部４２は、直近のNサンプルの平均をとる単純移動平均、又は、直近の時刻の音声波形を重視するような加重移動平均若しくは指数移動平均等を用いることができる。 Here, the voice waveform averaging unit 42 can use a simple moving average that takes the average of the most recent N samples, or a weighted moving average or an exponential moving average that places importance on the voice waveform at the most recent time.

次に、音声波形選択部４８は、破綻フラグ（予測値）Next, the audio waveform selection unit 48 selects the failure flag (predicted value)

平均された音声波形 Averaged audio waveform

及び音声波形（予測値） and speech waveform (predicted value)

に基づき、次の時刻の音声波形予測のための音声波形（予測値）として出力する（Ｓ１２７）。この場合、時刻t+1において、破綻検出部３４によって破綻フラグ Based on this, the speech waveform (prediction value) for predicting the speech waveform at the next time is output (S127). In this case, at time t+1, the failure detection unit 34 detects the failure flag

が破綻と判定されたときには、音声波形選択部４８は、波形生成部３２により得た音声波形（予測値）を選択せずに、平均された音声波形を選択して出力する。以上のようにして、推論フェーズが終了する。 is determined to be broken, the voice waveform selection unit 48 selects and outputs the averaged voice waveform, without selecting the voice waveform (predicted value) obtained by the waveform generation unit 32. In this manner, the inference phase ends.

＜第２の実施形態の主な効果＞
以上説明したように、本実施形態によれば、第１の実施形態の効果に加え、RNNの状態hを初期化した際の副作用である、音声の不連続性を解消することが可能である。また、RNNの状態hには、その時刻までに生成した音声波形の情報が蓄積されており、状態hの初期化によりそれまでの情報が失われる。すなわち、状態hの初期化前後の音声の連続性も失われる。本実施形態では、この不連続性を解消するため、初期化する直前の数サンプルの音声波形を事前にバッファリングしておく。初期化時点での音声波形として、波形生成部３２により予測したものの代わりに、バッファリングした音声の平均値を用いる。これにより、状態hを初期化しつつ、直前の音声との連続性も担保することができ、著しい品質の劣化を起こさずに破綻を防げる。 <Main Effects of the Second Embodiment>
As described above, according to this embodiment, in addition to the effects of the first embodiment, it is possible to eliminate the discontinuity of the voice, which is a side effect when the state h of the RNN is initialized. Furthermore, in the state h of the RNN, information on the voice waveform generated up to that time is accumulated, and the information up to that time is lost by the initialization of the state h. In other words, the continuity of the voice before and after the initialization of the state h is also lost. In this embodiment, in order to eliminate this discontinuity, several samples of the voice waveform immediately before the initialization are buffered in advance. As the voice waveform at the time of initialization, the average value of the buffered voice is used instead of the one predicted by the waveform generation unit 32. As a result, while initializing the state h, the continuity with the immediately previous voice can be secured, and failure can be prevented without causing significant deterioration in quality.

●第３の実施形態
続いて、図９乃至図１２を用いて、第３の実施形態について説明する。 Third Embodiment Next, a third embodiment will be described with reference to FIGS.

〔音声合成装置の機能構成〕
本実施形態に係る音声合成装置３は、第１の実施形態に係る音声合成装置３の学習フェーズ及び推論フェーズにおける機能構成に対して、更に、統計量算出部４９を有する。 [Functional configuration of the voice synthesis device]
The voice synthesizer 3 according to this embodiment further includes a statistics calculation unit 49 in addition to the functional configuration in the learning phase and the inference phase of the voice synthesizer 3 according to the first embodiment.

統計量算出部４９は、RNNの状態の統計量を得る。なお、この機能構成については、以降で詳細に説明する。The statistics calculation unit 49 obtains statistics of the state of the RNN. This functional configuration will be described in detail later.

〔音声合成装置の処理又は動作〕
続いて、図１１を用いて、第３の実施形態に係る音声合成装置の処理又は動作について説明する。 [Processing or operation of the voice synthesizer]
Next, the process or operation of the voice synthesis device according to the third embodiment will be described with reference to FIG.

＜音声合成装置の学習フェーズにおける処理又は動作＞
図１１は、第３の実施形態に係る音声合成装置の学習フェーズにおける処理又は動作を示すフローチャートである。なお、処理（Ｓ３１，Ｓ３２，Ｓ３３，Ｓ３４，Ｓ３５，Ｓ３６）は、それぞれ第１の実施形態における処理（Ｓ１１，Ｓ１２，Ｓ１３，Ｓ１４，Ｓ１５，Ｓ１６）に対応し、第１の実施形態と大部分は同じであるため、差分のみ説明する。 <Processing or operation in the learning phase of the speech synthesizer>
11 is a flowchart showing the processing or operation in the learning phase of the speech synthesis device according to the third embodiment. Note that the processing (S31, S32, S33, S34, S35, S36) corresponds to the processing (S11, S12, S13, S14, S15, S16) in the first embodiment, respectively, and is mostly the same as the first embodiment, so only the differences will be described.

本実施形態では、第１の実施形態と同様に、処理（Ｓ３１）後、波形生成部３２が、次の時刻の音声波形（予測値）In this embodiment, as in the first embodiment, after processing (S31), the waveform generation unit 32 generates the speech waveform (predicted value) for the next time.

を得る（Ｓ３２）。また、この予測の度に、波形生成部３２は、RNNの状態 (S32). In addition, each time a prediction is made, the waveform generating unit 32 obtains the state of the RNN.

を得る。 get.

本実施形態では、統計量算出部４９は、波形生成部３２からRNNの状態を得て、RNNの状態の統計量In this embodiment, the statistics calculation unit 49 obtains the state of the RNN from the waveform generation unit 32 and calculates the statistics of the state of the RNN.

を得る。そして、統計量算出部４９は、この処理を時刻t=1,…,Tに関して実行し、RNNの状態の統計量 The statistics calculation unit 49 then executes this process for times t=1, ..., T to obtain the statistics of the RNN state.

を取得する（Ｓ３２－１）。通常、RNNの状態hは、音声波形の連続性を担保するため、少なくとも次元数が１００を超えるベクトルで構成される。そのまま破綻検出部３４の入力に用いると、次元数に比例して計算量が大きいため、統計量算出部４９は、状態hから低次元の特徴量に変換する。具体的には、統計量算出部４９は、 (S32-1). Usually, the state h of the RNN is composed of a vector with a number of dimensions exceeding at least 100 in order to ensure the continuity of the speech waveform. If this is used as the input to the breakdown detection unit 34 as is, the amount of calculation increases in proportion to the number of dimensions, so the statistics calculation unit 49 converts the state h into a low-dimensional feature. Specifically, the statistics calculation unit 49

の平均値、標準偏差、最大値、及び最小値などを結合したベクトルを用いる。もしくは、統計量算出部４９は、主成分分析や線形判別分析により低次元のベクトルに次元圧縮してもよい。 The statistical calculation unit 49 may use a vector obtained by combining the average value, standard deviation, maximum value, minimum value, etc. of the above. Alternatively, the statistical calculation unit 49 may perform dimensional compression into a low-dimensional vector by principal component analysis or linear discriminant analysis.

その後、統計量算出部４９は、破綻検出部３４にRNNの状態の統計量Then, the statistics calculation unit 49 outputs the statistics of the RNN state to the failure detection unit 34.

を入力し、破綻フラグを予測する以降の流れは第１の実施形態と同様であるため、説明を省略する。以上のようにして、学習フェーズが終了する。 The flow from inputting the above to predicting the bankruptcy flag is the same as that in the first embodiment, and therefore the description thereof will be omitted. In this manner, the learning phase is completed.

＜音声合成装置の推論フェーズにおける処理又は動作＞
図１２は、第３の実施形態に係る音声合成装置の推論フェーズにおける処理又は動作を示すフローチャートである。なお、処理（Ｓ１３１，Ｓ１３２，Ｓ１３３，Ｓ１３４）は、それぞれ第１の実施形態における処理（Ｓ１１１，Ｓ１１２，Ｓ１１３，Ｓ１１４）に対応し、第１の実施形態と大部分は同じであるため、差分のみ説明する。 <Processing or operation in the inference phase of the speech synthesizer>
12 is a flowchart showing the processing or operation in the inference phase of the speech synthesis device according to the third embodiment. Note that the processing (S131, S132, S133, S134) corresponds to the processing (S111, S112, S113, S114) in the first embodiment, respectively, and is mostly the same as the first embodiment, so only the differences will be described.

本実施形態では、第１の実施形態と同様に、処理（Ｓ１３１）後、波形生成部３２が、次の時刻の音声波形（予測値）In this embodiment, as in the first embodiment, after the process (S131), the waveform generation unit 32 generates the speech waveform (predicted value) for the next time.

を得る（Ｓ１３２）。また、この予測の度に、波形生成部３２は、RNNの状態 (S132). In addition, each time a prediction is made, the waveform generating unit 32 obtains the state of the RNN.

を得る。 get.

を入力し、その後の処理の流れは第１の実施形態と同様であるため、説明を省略する。以上のようにして、推論フェーズが終了する。 The flow of the process thereafter is the same as in the first embodiment, and therefore will not be described. In this manner, the inference phase is completed.

＜第３の実施形態の主な効果＞
以上説明したように、本実施形態によれば、第１の実施形態の効果に加え、以下のような効果を奏する。すなわち、第２の実施形態にて述べた破綻検出処理において、RNNの状態hをそのまま用いると、状態hの次元数が大きいため計算量が大きく波形生成動作速度を損なう。これに対して、本実施形態により、破綻検出に要する計算量を削減することができ、破綻検出を含めた波形生成の動作速度を向上できる。また、第２の実施形態と組み合わせることもでき、破綻検出に要する計算量を削減しながら音声の連続性を担保した波形生成が可能である。 <Main Effects of the Third Embodiment>
As described above, this embodiment provides the following effects in addition to the effects of the first embodiment. That is, if the state h of the RNN is used as is in the breakdown detection process described in the second embodiment, the number of dimensions of the state h is large, so the amount of calculation is large and the operation speed of the waveform generation is impaired. In contrast, this embodiment can reduce the amount of calculation required for breakdown detection, and can improve the operation speed of waveform generation including breakdown detection. In addition, it can be combined with the second embodiment, and it is possible to generate a waveform that ensures the continuity of the voice while reducing the amount of calculation required for breakdown detection.

●第４の実施形態
次に、図１３乃至図１６を用いて、第４の実施形態について説明する。 Fourth Embodiment Next, a fourth embodiment will be described with reference to FIGS.

〔音声合成装置の機能構成〕
図１３及び図１４を用いて、第４の実施形態に係る音声合成装置の機能構成について説明する。 [Functional configuration of the voice synthesis device]
The functional configuration of a speech synthesis device according to the fourth embodiment will be described with reference to FIG. 13 and FIG.

＜音声合成装置の学習フェーズにおける機能構成＞
本実施形態に係る音声合成装置３は、第１の実施形態の音声合成装置３における、破綻フラグへの変換部３３、破綻検出部３４、破綻フラグの誤差算出部３５、及び破綻検出モデル学習部３６が、それぞれ、破綻検出の指標への変換部４３、破綻検出の指標の予測部４４、破綻検出の指標の誤差算出部４５、及び破綻検出の指標予測モデル学習部４６に代わっている。 <Functional configuration of the speech synthesis device in the learning phase>
In the speech synthesis device 3 of this embodiment, the failure flag conversion unit 33, failure detection unit 34, failure flag error calculation unit 35, and failure detection model learning unit 36 in the speech synthesis device 3 of the first embodiment are replaced with a failure detection index conversion unit 43, a failure detection index prediction unit 44, a failure detection index error calculation unit 45, and a failure detection index prediction model learning unit 46, respectively.

これらのうち、破綻検出の指標への変換部４３は、音声波形及び音声波形(予測値)に基づいて、破綻検出の指標を得る。Of these, the failure detection index conversion unit 43 obtains a failure detection index based on the audio waveform and the audio waveform (predicted value).

破綻検出の指標の予測部４４は、RNNの状態系列及び破綻検出の指標予測モデル４０ａに基づいて、破綻検出の指標（予測値)を得る。The failure detection index prediction unit 44 obtains a failure detection index (predicted value) based on the RNN state sequence and the failure detection index prediction model 40a.

破綻検出の指標の誤差算出部４５は、破綻検出の指標及び破綻検出の指標（予測値）の誤差を算出する。The failure detection index error calculation unit 45 calculates the failure detection index and the error of the failure detection index (predicted value).

破綻検出の指標予測モデル学習部４６は、誤差及び破綻検出の指標予測モデル４０ａに基づいて、学習済み破綻検出の指標予測モデル４０ｂを得る。The failure detection indicator prediction model learning unit 46 obtains a learned failure detection indicator prediction model 40b based on the error and the failure detection indicator prediction model 40a.

＜音声合成装置の推論フェーズにおける機能構成＞
本実施形態に係る音声合成装置３は、第１の実施形態の音声合成装置３における破綻検出部３４が、破綻検出の指標の予測部４４に代わっている。また、状態初期化部３７に対して、破綻フラグ（予測値）が入力されるのではなく破綻検出の指標（予測値）が入力され、更に破綻フラグの閾値fが入力されている。 <Functional configuration of the speech synthesis device in the inference phase>
In the speech synthesizer 3 according to this embodiment, the failure detection unit 34 in the speech synthesizer 3 of the first embodiment is replaced with a failure detection index prediction unit 44. Also, instead of a failure flag (predicted value) being input to a state initialization unit 37, a failure detection index (predicted value) is input, and further a failure flag threshold value f is input.

〔音声合成装置の処理又は動作〕
続いて、図１５及び図１６を用いて、第４の実施形態に係る音声合成装置の処理又は動作について説明する。 [Processing or operation of the voice synthesizer]
Next, the process or operation of the voice synthesis device according to the fourth embodiment will be described with reference to FIG. 15 and FIG.

＜音声合成装置の学習フェーズにおける処理又は動作＞
図１５は、第４の実施形態に係る音声合成装置の学習フェーズにおける処理又は動作を示すフローチャートである。なお、処理（Ｓ４１，Ｓ４２）は、それぞれ第１の実施形態における処理（Ｓ１１，Ｓ１２）に対応するため、差分のみ説明する。 <Processing or operation in the learning phase of the speech synthesizer>
15 is a flowchart showing the processing or operation in the learning phase of the speech synthesis device according to the fourth embodiment. Note that since the processing (S41, S42) corresponds to the processing (S11, S12) in the first embodiment, respectively, only the differences will be described.

本実施形態では、破綻検出の指標への変換部４３は、上記処理（Ｓ４２）が時刻t=1,…,Tに関して実行されることで、音声波形（予測値）In this embodiment, the conversion unit 43 converts the speech waveform (predicted value) into a failure detection index by executing the above process (S42) for times t = 1, ..., T.

を取得すると共に、音声波形 and audio waveform

を取得し、これらに基づいて、破綻検出の指標 Based on these, the indicators for detecting bankruptcy are obtained.

を得る（Ｓ４３）。 is obtained (S43).

ここで、破綻検出の指標とは、第１の実施形態の破綻フラグを生成するために使う音声波形xとその予測値Here, the indicators for detecting a failure are the audio waveform x and its predicted value used to generate the failure flag in the first embodiment.

の差分であり、音声波形の振幅値又はこの振動値から計算されるパワーや、スペクトルから計算される誤差である。 and is the power calculated from the amplitude value of the audio waveform or the vibration value, or the error calculated from the spectrum.

また、破綻検出の指標への変換部４３が、音声波形（予測値） In addition, the conversion unit 43 converts the audio waveform (predicted value)

を取得する際に、破綻検出の指標の予測部４４が、同時に波形生成部３２からRNNの状態系列 When acquiring the state sequence of the RNN from the waveform generating unit 32, the failure detection index predicting unit 44 simultaneously acquires the state sequence of the RNN from the waveform generating unit 32.

を取得し、更に破綻検出の指標予測モデル４０ａを取得して、これらに基づいて破綻検出の指標（予測値） , and further obtains a failure detection index prediction model 40a, and based on these, a failure detection index (prediction value)

を得る（Ｓ４４）。 is obtained (S44).

なお、破綻検出の指標の予測部４４は、第３の実施形態と同様に、統計量算出部４９を介して、RNNの状態の統計量In addition, the failure detection index prediction unit 44 calculates the statistics of the RNN state via the statistics calculation unit 49 in the same manner as in the third embodiment.

を取得するようにしてもよい。 may be acquired.

次に、破綻検出の指標の誤差算出部４５は、破綻検出の指標及び破綻検出の指標（予測値）を取得し、破綻検出の指標及び破綻検出の指標（予測値）の誤差を算出する（Ｓ４５）。破綻検出の指標及び破綻検出の指標（予測値）は、それぞれ連続値であるため、誤差を計算する方法として、第１の実施形態の処理（Ｓ１４）と同様に平均二乗誤差や平均絶対誤差を用いることができる。Next, the failure detection index error calculation unit 45 acquires the failure detection index and the failure detection index (predicted value), and calculates the error of the failure detection index and the failure detection index (predicted value) (S45). Since the failure detection index and the failure detection index (predicted value) are continuous values, the mean square error or the mean absolute error can be used as a method of calculating the error, as in the process of the first embodiment (S14).

次に、破綻検出の指標予測モデル学習部４６は、破綻検出の指標の誤差算出部４５によって算出された誤差、及び破綻検出の指標予測モデル４０ａに基づいて、学習済み破綻検出の指標予測モデル４０ｂを得る（Ｓ４６）。この処理（Ｓ４６）は、誤差を最小化するよう、破綻検出の指標予測モデル４０ａのパラメータを更新することで達成され、DNNでは一般的に誤差逆伝搬が用いられる。ここまでの手順を学習データの全てに対して繰り返し実行することで、破綻検出の指標予測モデル４０ａの予測精度を向上させる。以上のようにして、推論フェーズが終了する。Next, the failure detection index prediction model learning unit 46 obtains a learned failure detection index prediction model 40b based on the error calculated by the failure detection index error calculation unit 45 and the failure detection index prediction model 40a (S46). This process (S46) is achieved by updating the parameters of the failure detection index prediction model 40a to minimize the error, and in DNNs, backpropagation is generally used. By repeatedly executing the above procedure for all of the training data, the prediction accuracy of the failure detection index prediction model 40a is improved. In this way, the inference phase is completed.

＜音声合成装置の推論フェーズにおける処理又は動作＞
第４の実施形態に係る音声合成装置の推論フェーズにおける処理又は動作を示すフローチャートである。なお、処理（Ｓ１４１，Ｓ１４２）は、それぞれ第１の実施形態における処理（Ｓ１１１，Ｓ１１２）に対応するため、差分のみ説明する。 <Processing or operation in the inference phase of the speech synthesizer>
10 is a flowchart showing the processing or operation in the inference phase of the speech synthesis device according to the fourth embodiment. Note that since the processing (S141, S142) corresponds to the processing (S111, S112) in the first embodiment, respectively, only the differences will be described.

本実施形態では、破綻検出の指標への変換部４３は、波形生成部３２からRNNの状態In this embodiment, the conversion unit 43 converts the state of the RNN from the waveform generation unit 32 to an index for detecting a failure.

を得て、更に、学習済み破綻検出の指標予測モデル４０ｂを得て、これらに基づいて破綻検出の指標を予測することで、破綻検出の指標（予測値） and further obtain a learned failure detection index prediction model 40b. By predicting the failure detection index based on these, the failure detection index (predicted value)

を得る（Ｓ１４３）。 is obtained (S143).

次に、状態初期化部３７は、破綻検出の指標（予測値）が閾値fより大きい場合、波形生成が「破綻しているとみなし」、RNNの状態の初期値に基づいて、RNNの状態Next, if the failure detection index (prediction value) is greater than the threshold value f, the state initialization unit 37 determines that the waveform generation has "failed" and resets the state of the RNN based on the initial value of the RNN state.

を初期化する。 Initialize.

＜第４の実施形態の主な効果＞
以上説明したように、第１乃至第３の実施形態の破綻検出処理において、識別モデルとして学習した破綻検出モデルの精度のチューニングをする場合、学習フェーズの破綻フラグの閾値fを変えたり、破綻検出モデルのハイパーパラメータをはじめとする学習条件を変えたりなど、再学習が必須である。このため、最適と思われるモデルを得るためのチューニングに要する手間が大きい。また、破綻検出モデル３０ａは、波形生成部３２に特化して学習しているため、波形生成部３２で用いるモデルが変わるとそれに紐づき、破綻検出モデル３０ａのチューニングもやり直さなくてはならない。 <Main Effects of the Fourth Embodiment>
As described above, in the failure detection processing of the first to third embodiments, when tuning the accuracy of the failure detection model trained as a discrimination model, re-learning is required, for example, by changing the threshold f of the failure flag in the learning phase or by changing learning conditions such as hyperparameters of the failure detection model. Therefore, a lot of effort is required for tuning to obtain a model that is considered optimal. In addition, since the failure detection model 30a is trained specifically for the waveform generation unit 32, when the model used in the waveform generation unit 32 is changed, the failure detection model 30a must be linked to the new model and re-tuned.

これに対して、本実施形態では、離散値の破綻フラグを予測する識別モデルを学習するのではなく、連続量の破綻検出の指標を予測する生成モデルとして学習する。具体的には、第４の実施形態の音声合成装置３は、破綻フラグを統計モデルから直接予測するのではなく、その指標となる値を予測し、それが閾値fを超えているかで破綻を間接的に検出する。これにより、破綻検出に使うモデルの再学習なしに、閾値fのチューニングをするだけで良く、上記の第１乃至第３の実施形態の課題を低減することができる。In contrast, in this embodiment, a discriminative model that predicts a discrete value failure flag is not trained, but rather a generative model that predicts an indicator for continuous failure detection is trained. Specifically, the speech synthesis device 3 of the fourth embodiment does not directly predict the failure flag from a statistical model, but predicts the value that will be the indicator, and indirectly detects failure based on whether it exceeds the threshold f. This makes it possible to reduce the issues of the first to third embodiments by simply tuning the threshold f without re-learning the model used for failure detection.

また、本実施形態は、第２又は第３の実施形態と組み合わせることもでき、破綻検出に要する計算量を削減しながら、音声の連続性を担保した波形生成が可能である。 This embodiment can also be combined with the second or third embodiment, making it possible to generate a waveform that ensures audio continuity while reducing the amount of calculations required for breakdown detection.

〔補足〕
本発明は上述の実施形態に限定されるものではなく、以下に示すような構成又は処理（動作）であってもよい。〔supplement〕
The present invention is not limited to the above-described embodiment, and may have the following configurations or processes (operations).

音声合成装置３はコンピュータとプログラムによって実現できるが、このプログラムを（非一時的な）記録媒体に記録することも、通信ネットワーク１００を介して提供することも可能である。The voice synthesis device 3 can be realized by a computer and a program, but this program can also be recorded on a (non-temporary) recording medium or provided via the communication network 100.

１通信システム
３音声合成装置
５通信端末
３０ａ破綻検出モデル
３０ｂ学習済み破綻検出モデル
３１入力部
３２波形生成部
３３破綻フラグへの変換部
３４破綻検出部
３５破綻フラグの誤差算出部
３６破綻検出モデル学習部
３７状態初期化部
４０ａ破綻検出の指標予測モデル
４０ｂ学習済み破綻検出の指標予測モデル
４１音声波形のバッファリング部
４２音声波形の平均部
４３破綻検出の指標への変換部
４４破綻検出の指標の予測部
４５破綻検出の指標の誤差算出部
４６破綻検出の指標予測モデル学習部 REFERENCE SIGNS LIST 1 Communication system 3 Voice synthesis device 5 Communication terminal 30a Impairment detection model 30b Learned impairment detection model 31 Input unit 32 Waveform generation unit 33 Conversion to impair flag unit 34 Impairment detection unit 35 Impairment flag error calculation unit 36 Impairment detection model learning unit 37 State initialization unit 40a Impairment detection index prediction model 40b Learned impairment detection index prediction model 41 Voice waveform buffering unit 42 Voice waveform averaging unit 43 Conversion to impair detection index unit 44 Impairment detection index prediction unit 45 Impairment detection index error calculation unit 46 Impairment detection index prediction model learning unit

Claims

A speech synthesis device for generating speech waveforms in a training phase, comprising:
a waveform generation unit that obtains a predicted value of a speech waveform at a next time based on the combined speech waveform and acoustic feature amount and a state of a recurrent neural network;
a conversion unit for converting the speech waveform into a failure flag, the conversion unit obtaining the failure flag based on the speech waveform, the predicted value of the speech waveform, and a threshold value of a failure flag indicating whether the speech waveform at each time point is failed;
a failure detection unit that obtains a predicted value of the failure flag based on a state sequence of the recurrent neural network and a failure detection model;
a bankruptcy flag error calculation unit that calculates an error between the bankruptcy flag and a predicted value of the bankruptcy flag;
a failure detection model learning unit that obtains a learned failure detection model based on the error and the failure detection model;
A speech synthesis device having the above configuration.

2. The speech synthesis device according to claim 1,
a total metric calculation unit for obtaining statistics of a state of the recurrent neural network based on the state of the recurrent neural network;
The failure detection unit obtains a predicted value of the failure flag based on statistics of the state of the recurrent neural network replaced from the state series of the recurrent neural network and the failure detection model.

A speech synthesis device for generating speech waveforms in a training phase, comprising:
a waveform generation unit that obtains a predicted value of a speech waveform at a next time based on the combined speech waveform and acoustic feature amount and a state of a recurrent neural network;
a conversion unit for converting the speech waveform into a speech imperfection detection index, the conversion unit obtaining a speech imperfection detection index based on a difference between the speech waveform and a predicted value of the speech waveform;
a failure detection index prediction unit that obtains a predicted value of a failure detection index based on a state sequence of the recurrent neural network and a failure detection index prediction model;
a failure detection index error calculation unit that calculates an error between the failure detection index and a predicted value of the failure detection index;
a failure detection index prediction model learning unit for obtaining a learned failure detection model based on the error and the failure detection index prediction model;
A speech synthesis device having the above configuration.

A speech synthesis device for generating a speech waveform in an inference phase, comprising:
a waveform generation unit that obtains a predicted value of a speech waveform at a next time based on the combined speech waveform and acoustic feature amount and a state of a recurrent neural network;
a failure detection unit that obtains a predicted value of a failure flag indicating whether the speech waveform at each time point is failed, based on a state of the recurrent neural network and a trained failure detection model;
a state initialization unit that initializes a state of the recurrent neural network based on an initial value of the state of the recurrent neural network when the predicted value of the failure flag indicates a failure;
A speech synthesis device having the above configuration.

5. A speech synthesis device according to claim 4,
a speech waveform buffering unit that stores the speech waveform prediction value;
a speech waveform averaging unit for obtaining an averaged speech waveform based on a predicted value of the immediately preceding speech waveform stored in the buffering unit;
a speech waveform selection unit that outputs a speech waveform prediction value for predicting a speech waveform at a next time point based on the failure flag prediction value, the speech waveform prediction value, and the averaged speech waveform;
A speech synthesis device having the above configuration.

5. A speech synthesis device according to claim 4,
a total metric calculation unit for obtaining statistics of a state of the recurrent neural network based on the state of the recurrent neural network;
The speech synthesis device, wherein the failure detection unit obtains a predicted value of the failure flag based on statistics of the state of the recurrent neural network replaced with a state series of the recurrent neural network and the learned failure detection model.

A speech synthesis device for generating a speech waveform in an inference phase, comprising:
a waveform generation unit that obtains a predicted value of a speech waveform at a next time based on the combined speech waveform and acoustic feature amount and a state of a recurrent neural network;
a failure detection index prediction unit that obtains a predicted value of a failure detection index based on a state of the recurrent neural network and a learned failure detection index prediction model;
a state initialization unit that initializes a state of the recurrent neural network based on an initial value of the state of the recurrent neural network when the predicted value of the indicator of failure detection is greater than a threshold value;
A speech synthesis device having the above configuration.

A speech synthesis method executed by a speech synthesis device that generates a speech waveform in a learning phase, comprising:
The speech synthesizer comprises:
A waveform generation process for obtaining a predicted value of a speech waveform at a next time based on the combined speech waveform and acoustic feature quantity and a state of a recurrent neural network;
A conversion process to obtain the failure flag based on the speech waveform, the predicted value of the speech waveform, and a failure flag threshold value indicating whether the speech waveform at each time is failure;
A failure detection process for obtaining a predicted value of the failure flag based on a state sequence of the recurrent neural network and a failure detection model;
A bankruptcy flag error calculation process for calculating an error between the bankruptcy flag and a predicted value of the bankruptcy flag;
a failure detection model learning process for obtaining a learned failure detection model based on the error and the failure detection model;
A speech synthesis method that performs

A speech synthesis method executed by a speech synthesizer that generates a speech waveform in an inference phase, comprising:
The speech synthesizer comprises:
A waveform generation process for obtaining a predicted value of a speech waveform at a next time based on the combined speech waveform and acoustic feature quantity and a state of a recurrent neural network;
a failure detection process for obtaining a predicted value of a failure flag indicating whether the speech waveform at each time point is failed, based on a state of the recurrent neural network and a trained failure detection model;
a state initialization process for initializing a state of the recurrent neural network based on an initial value of the state of the recurrent neural network when the predicted value of the failure flag indicates a failure;
A speech synthesis method that performs

A program for causing a computer to execute the method according to claim 8 or 9.