JP4816144B2

JP4816144B2 - Speech synthesis apparatus, speech synthesis method, and program

Info

Publication number: JP4816144B2
Application number: JP2006056732A
Authority: JP
Inventors: 勝彦佐藤
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2006-03-02
Filing date: 2006-03-02
Publication date: 2011-11-16
Anticipated expiration: 2026-03-02
Also published as: JP2007233181A

Description

本発明は、与えられたテキスト文字列から音声を合成する音声合成装置、音声合成方法、及び、プログラムに関する。 The present invention relates to a speech synthesizer, a speech synthesis method, and a program for synthesizing speech from a given text string.

テキスト文字列から音声を合成する技術において、ＨＭＭ（隠れマルコフモデル）がさまざまな形で利用されている。 HMM (Hidden Markov Model) is used in various forms in technology for synthesizing speech from text strings.

例えば、特許文献１の技術は、音声データからＬＳＰ（Line Spectrum Pair）係数を抽出し、音素ごとにＨＭＭでモデル化する。そして、与えられた文字列に対応するＨＭＭを選択する。そのＨＭＭを駆動させＬＳＰ係数を出力して、出力されたＬＳＰ係数を用いて音声を合成している。 For example, in the technique of Patent Document 1, LSP (Line Spectrum Pair) coefficients are extracted from speech data, and each phoneme is modeled by an HMM. Then, the HMM corresponding to the given character string is selected. The HMM is driven to output LSP coefficients, and a voice is synthesized using the output LSP coefficients.

特許文献１の音声合成装置は、ＨＭＭを一定の出力フレーム周期で駆動させて、ＬＳＰ係数を出力している。このような場合に、滑らかな音声を合成出力しようとすると、ＨＭＭからＬＳＰ係数を出力する出力フレーム周期を短くしなければならない。そうすると、音声合成装置の処理負担が大きくなり、処理速度が低下してしまう。 The speech synthesizer of Patent Document 1 drives an HMM at a constant output frame period and outputs LSP coefficients. In such a case, if an attempt is made to synthesize and output smooth speech, the output frame period for outputting the LSP coefficient from the HMM must be shortened. If it does so, the processing burden of a speech synthesizer will become large and processing speed will fall.

特開２００２−６２８９０号公報JP 2002-62890 A

本発明は、上記問題点に鑑みてなされたものであり、高音質の音声を合成する音声合成装置、音声合成方法、及び、プログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and an object thereof is to provide a speech synthesizer, a speech synthesis method, and a program that synthesize high-quality speech.

上記目的を達成するため、本発明の第１の観点に係る音声合成装置は、
音声を合成するためのパラメータであるＬＳＰ（Line Spectrum Pair）係数を生成するための音素ＨＭＭ（隠れマルコフモデル）データと、音素ラベルと、を対応させて記憶する記憶手段と、
与えられたテキストデータから音素ラベルを生成し、前記記憶手段の記憶情報を参照して、該生成した音素ラベルを対応する音素ＨＭＭデータに変換する音素ＨＭＭデータ変換手段と、
前記音素ＨＭＭデータ変換手段が変換した音素ＨＭＭデータからＬＳＰ係数を出力する周期を、状態位置ごとに設定する周期設定手段と、
前記音素ＨＭＭデータ変換手段が変換した音素ＨＭＭデータから、前記周期設定手段が設定した周期でＬＳＰ係数を出力するＬＳＰ係数出力手段と、
を備え、
前記周期設定手段は、
音素ＨＭＭデータの状態位置ごとのＬＳＰ係数の分散値の大きさを判別し、
前記分散値が第１の閾値以下であり第１の閾値より小さい第２の閾値以上である場合はＬＳＰ係数を出力する周期を第１の周期に設定し、
前記分散値が前記第１の閾値より大きい場合はＬＳＰ係数を出力する周期を第１の周期より小さい第２の周期に設定し、
前記分散値が前記第２の閾値より小さい場合はＬＳＰ係数を出力する周期を第１の周期より大きい第３の周期に設定することを特徴とする。 In order to achieve the above object, a speech synthesizer according to the first aspect of the present invention provides:
Storage means for storing phoneme HMM (Hidden Markov Model) data for generating LSP (Line Spectrum Pair) coefficients, which are parameters for synthesizing speech, and phoneme labels in association with each other;
Phoneme HMM data conversion means for generating a phoneme label from given text data, referring to the storage information of the storage means, and converting the generated phoneme label into corresponding phoneme HMM data;
A period setting means for setting, for each state position, a period for outputting an LSP coefficient from the phoneme HMM data converted by the phoneme HMM data conversion means;
LSP coefficient output means for outputting LSP coefficients at a period set by the period setting means from the phoneme HMM data converted by the phoneme HMM data conversion means;
With
The period setting means includes
Determine the size of the dispersed values of the LSP coefficients for each state position of the phoneme HMM data,
If the variance value is less than or equal to the first threshold and greater than or equal to the second threshold smaller than the first threshold, the cycle for outputting the LSP coefficient is set to the first cycle,
If the variance value is greater than the first threshold, set the period for outputting the LSP coefficient to a second period smaller than the first period,
When the variance value is smaller than the second threshold value, the period for outputting the LSP coefficient is set to a third period that is larger than the first period .

本発明の第２の観点に係る音声合成方法は、
音声を合成するためのパラメータであるＬＳＰ（Line Spectrum Pair）係数を生成するための音素ＨＭＭ（隠れマルコフモデル）データと、音素ラベルと、を対応させて記憶する記憶ステップと、
与えられたテキストデータから音素ラベルを生成し、前記記憶ステップで記憶した情報を参照して、該生成した音素ラベルを対応する音素ＨＭＭデータに変換する音素ＨＭＭデータ変換ステップと、
前記音素ＨＭＭデータ変換ステップで変換した音素ＨＭＭデータからＬＳＰ係数を出力する周期を、状態位置ごとに設定する周期設定ステップと、
前記音素ＨＭＭデータ変換ステップで変換した音素ＨＭＭデータから、前記周期設定ステップで設定した周期でＬＳＰ係数を出力するＬＳＰ係数出力ステップと、
を備え、
前記周期設定ステップでは、
音素ＨＭＭデータの状態位置ごとのＬＳＰ係数の分散値の大きさを判別し、
前記分散値が第１の閾値以下であり第１の閾値より小さい第２の閾値以上である場合はＬＳＰ係数を出力する周期を第１の周期に設定し、
前記分散値が前記第１の閾値より大きい場合はＬＳＰ係数を出力する周期を第１の周期より小さい第２の周期に設定し、
前記分散値が前記第２の閾値より小さい場合はＬＳＰ係数を出力する周期を第１の周期より大きい第３の周期に設定することを特徴とする。 The speech synthesis method according to the second aspect of the present invention provides:
A storage step of storing phoneme HMM (Hidden Markov Model) data for generating LSP (Line Spectrum Pair) coefficients, which are parameters for synthesizing speech, and phoneme labels in association with each other;
A phoneme HMM data conversion step of generating a phoneme label from given text data, referring to the information stored in the storage step, and converting the generated phoneme label into corresponding phoneme HMM data;
A cycle setting step for setting, for each state position, a cycle for outputting LSP coefficients from the phoneme HMM data converted in the phoneme HMM data conversion step;
An LSP coefficient output step of outputting an LSP coefficient at a cycle set in the cycle setting step from the phoneme HMM data converted in the phoneme HMM data conversion step;
With
In the period setting step,
Determine the size of the dispersed values of the LSP coefficients for each state position of the phoneme HMM data,
If the variance value is less than or equal to the first threshold and greater than or equal to the second threshold smaller than the first threshold, the cycle for outputting the LSP coefficient is set to the first cycle,
If the variance value is greater than the first threshold, set the period for outputting the LSP coefficient to a second period smaller than the first period,
When the variance value is smaller than the second threshold value, the period for outputting the LSP coefficient is set to a third period that is larger than the first period .

本発明の第３の観点に係るコンピュータプログラムは、
コンピュータを、
音声を合成するためのパラメータであるＬＳＰ（Line Spectrum Pair）係数を生成するための音素ＨＭＭ（隠れマルコフモデル）データと、音素ラベルと、を対応させて記憶する記憶手段、
与えられたテキストデータから音素ラベルを生成し、前記記憶手段の記憶情報を参照して、該生成した音素ラベルを対応する音素ＨＭＭデータに変換する音素ＨＭＭデータ変換手段、
前記音素ＨＭＭデータ変換手段が変換した音素ＨＭＭデータからＬＳＰ係数を出力する周期を、状態位置ごとに設定する周期設定手段、
前記音素ＨＭＭデータ変換手段が変換した音素ＨＭＭデータから、前記周期設定手段が設定した周期でＬＳＰ係数を出力するＬＳＰ係数出力手段、
として機能させ、
前記周期設定手段は、
音素ＨＭＭデータの状態位置ごとのＬＳＰ係数の分散値の大きさを判別し、
前記分散値が第１の閾値以下であり第１の閾値より小さい第２の閾値以上である場合はＬＳＰ係数を出力する周期を第１の周期に設定し、
前記分散値が前記第１の閾値より大きい場合はＬＳＰ係数を出力する周期を第１の周期より小さい第２の周期に設定し、
前記分散値が前記第２の閾値より小さい場合はＬＳＰ係数を出力する周期を第１の周期より大きい第３の周期に設定することを特徴とする。 A computer program according to the third aspect of the present invention provides:
Computer
Storage means for storing phoneme HMM (Hidden Markov Model) data for generating LSP (Line Spectrum Pair) coefficients, which are parameters for synthesizing speech, and phoneme labels in association with each other;
Phoneme HMM data conversion means for generating a phoneme label from given text data and converting the generated phoneme label into corresponding phoneme HMM data by referring to the storage information of the storage means;
A period setting means for setting a period for outputting an LSP coefficient from the phoneme HMM data converted by the phoneme HMM data conversion means for each state position;
LSP coefficient output means for outputting an LSP coefficient at a period set by the period setting means from the phoneme HMM data converted by the phoneme HMM data conversion means;
Function as
The period setting means includes
Determine the size of the dispersed values of the LSP coefficients for each state position of the phoneme HMM data,
If the variance value is less than or equal to the first threshold and greater than or equal to the second threshold smaller than the first threshold, the cycle for outputting the LSP coefficient is set to the first cycle,
If the variance value is greater than the first threshold, set the period for outputting the LSP coefficient to a second period smaller than the first period,
When the variance value is smaller than the second threshold value, the period for outputting the LSP coefficient is set to a third period that is larger than the first period .

本発明によれば、音素ＨＭＭデータの状態位置ごとにＬＳＰ係数を出力する周期を適切に設定することで、処理速度を維持しつつ高音質の音声を合成することができる。 According to the present invention, it is possible to synthesize high-quality sound while maintaining the processing speed by appropriately setting the cycle of outputting the LSP coefficient for each state position of the phoneme HMM data.

本発明の実施形態に係る音声合成装置１００について、図面を参照しながら説明する。
音声合成装置１００は、任意のテキスト文字列を与えられると、該テキスト文字列の音声を合成して出力する装置である。 A speech synthesizer 100 according to an embodiment of the present invention will be described with reference to the drawings.
The speech synthesizer 100 is a device that, when given an arbitrary text string, synthesizes and outputs the speech of the text string.

音声合成装置１００は、図１に示すように、入力変換部１０と、音声合成辞書２０と、音素ＨＭＭ列変換部３０と、パラメータ生成部４０と、励起音源生成部５０と、ＬＳＰ係数補間部６０と、ＬＳＰ合成フィルタ７０と、から構成される。 As shown in FIG. 1, the speech synthesizer 100 includes an input conversion unit 10, a speech synthesis dictionary 20, a phoneme HMM sequence conversion unit 30, a parameter generation unit 40, an excitation sound source generation unit 50, and an LSP coefficient interpolation unit. 60 and an LSP synthesis filter 70.

入力変換部１０は、ユーザからテキスト文字列データの入力を受ける。そして、入力変換部１０は、入力されたテキスト文字列データを、音素単位のラベルの並びである音素ラベル列データに変換する。 The input conversion unit 10 receives input of text character string data from the user. Then, the input conversion unit 10 converts the input text character string data into phoneme label string data that is an array of labels in phoneme units.

音声合成辞書２０は、音素ラベル列データを音素ＨＭＭ列データに変換する際に用いられる。音声合成辞書２０は、ＬＳＰ係数に関する音素ＨＭＭデータとピッチに関する音素ＨＭＭデータとを記憶する。各音素ＨＭＭデータは、多数の音声データから抽出したＬＳＰ係数と該音声データに対応する音素ラベル列データとから、学習によって作成される。 The speech synthesis dictionary 20 is used when converting phoneme label string data into phoneme HMM string data. The speech synthesis dictionary 20 stores phoneme HMM data related to LSP coefficients and phoneme HMM data related to pitch. Each phoneme HMM data is created by learning from LSP coefficients extracted from a large number of speech data and phoneme label string data corresponding to the speech data.

ＬＳＰ係数に関する音素ＨＭＭデータ及びピッチに関する音素ＨＭＭデータは、それぞれ図５（ａ）に示すように、状態数を５つもち、Ｓ１〜Ｓ３の状態位置それぞれで（Ｓ０は初期状態、Ｓ４は終了状態。Ｓ０とＳ４ではＬＳＰ係数及びピッチデータを出力しない）、ＬＳＰ係数及びピッチデータを出力する。なお各状態毎に平均値、分散値をパラメータとして保持している。 As shown in FIG. 5A, the phoneme HMM data related to the LSP coefficient and the phoneme HMM data related to the pitch each have five states, and each of the state positions S1 to S3 (S0 is an initial state, S4 is an end state) In S0 and S4, the LSP coefficient and pitch data are not output), and the LSP coefficient and pitch data are output. Note that an average value and a variance value are held as parameters for each state.

ＬＳＰ係数は、音声の特徴を表す特徴ベクトルであり、音声を合成するためのパラメータとして用いられる。 The LSP coefficient is a feature vector that represents a feature of speech and is used as a parameter for synthesizing speech.

音素ＨＭＭ列変換部３０は、入力変換部１０から音素ラベル列データを受け取る。そして、音声合成辞書２０を参照し、受け取った音素ラベル列データをＬＳＰ係数に関する音素ＨＭＭ列データとピッチに関する音素ＨＭＭ列データとに変換する。音素ＨＭＭ列データとは、図５（ａ）に示す音素ＨＭＭデータをつなぎ合わせた、図５（ｂ）のような列データにしたものである。 The phoneme HMM string converter 30 receives phoneme label string data from the input converter 10. Then, referring to the speech synthesis dictionary 20, the received phoneme label string data is converted into phoneme HMM string data related to LSP coefficients and phoneme HMM string data related to pitch. The phoneme HMM sequence data is obtained by connecting the phoneme HMM data shown in FIG. 5A into the sequence data as shown in FIG. 5B.

パラメータ生成部４０は、音素ＨＭＭ変換部３０からＬＳＰ係数に関する音素ＨＭＭ列データを受け取り、音声を合成するパラメータとして、ＬＳＰ係数系列データを生成する。ＬＳＰ係数系列データとは、図６の下段に折れ線グラフで示すように、白丸で示す時系列で変化するＬＳＰ係数を、所定の周期で並べて、つなぎ合わせたものである。ここでは、図を簡略化するため、ＬＳＰ係数はそれぞれ５次元の特徴ベクトルで構成されているものとしている。 The parameter generation unit 40 receives phoneme HMM sequence data related to LSP coefficients from the phoneme HMM conversion unit 30, and generates LSP coefficient series data as a parameter for synthesizing speech. The LSP coefficient series data is obtained by arranging and connecting LSP coefficients that change in a time series indicated by white circles in a predetermined cycle, as shown by a line graph in the lower part of FIG. Here, in order to simplify the drawing, it is assumed that each LSP coefficient is composed of a five-dimensional feature vector.

また、パラメータ生成部４０は、ピッチに関する音素ＨＭＭ列データを受け取り、音声を合成するパラメータとして、図７に示すようなピッチ列データを生成する。 Further, the parameter generation unit 40 receives phoneme HMM sequence data related to pitch, and generates pitch sequence data as shown in FIG. 7 as a parameter for synthesizing speech.

パラメータ生成部４０は、音素ＨＭＭ列データの各音素ＨＭＭデータに対する尤度が最大となるようにパラメータを生成する。
各音素ＨＭＭデータに対する尤度を最大にするパラメータは、以下の式を解くことで求められる。 The parameter generation unit 40 generates parameters so that the likelihood of the phoneme HMM sequence data with respect to each phoneme HMM data is maximized.
The parameter that maximizes the likelihood for each phoneme HMM data can be obtained by solving the following equation.

ただし、Ｐは状態位置Qで作られるパラメータOが音素ＨＭＭデータλから観測される確率（Oに関するQでのλの尤度）、ＣはＰを最大にするパラメータである。 However, P is the probability that the parameter O created at the state position Q is observed from the phoneme HMM data λ (the likelihood of λ at Q with respect to O), and C is the parameter that maximizes P.

尤度を最大にするパラメータを生成することで、ＬＳＰ係数系列データ及びピッチ列データのばらつきが小さくなり、不連続な変化を抑制し、より現実の発話に近い音声を合成することができる。 By generating a parameter that maximizes the likelihood, variations in LSP coefficient sequence data and pitch sequence data are reduced, discontinuous changes can be suppressed, and speech closer to an actual utterance can be synthesized.

また、パラメータ生成部４０がＬＳＰ係数を出力する周期（フレーム周期）は、音素ＨＭＭデータの状態位置毎に設定される。
後述するＬＳＰ合成フィルタ７０で、音声を合成する際のフレーム周期をＦＰＲＤとすると、パラメータ生成部４０は、通常、ＦＰＲＤより長い所定のフレーム周期ＰＲＤでＬＳＰ係数を出力する。 The period (frame period) at which the parameter generation unit 40 outputs the LSP coefficient is set for each state position of the phoneme HMM data.
Assuming that a frame period for synthesizing speech with an LSP synthesis filter 70, which will be described later, is FPRD, the parameter generator 40 normally outputs LSP coefficients with a predetermined frame period PRD longer than FPRD.

パラメータ生成部４０は、各状態位置の分散値の大きさによってフレーム周期を切り替える。即ち、ある状態位置の分散値が第１の所定の値より小さいとき、フレーム周期をＰＲＤの２倍の長さに切り替える。また、ある状態位置の分散値が第２の所定の値より大きいとき、フレーム周期を設定できる中で最短の周期であるＦＰＲＤに切り替える。 The parameter generation unit 40 switches the frame period according to the size of the variance value at each state position. That is, when the variance value at a certain state position is smaller than the first predetermined value, the frame period is switched to twice the length of the PRD. Further, when the variance value at a certain state position is larger than the second predetermined value, the state is switched to FPRD which is the shortest cycle among which the frame cycle can be set.

分散値の小さい状態位置では、通常より長い周期でパラメータを出力してもばらつきが少ないため、不連続なデータになりにくい。そこで、フレーム周期を通常の２倍に設定し、処理速度を向上することができる。
また、分散値が大きい状態位置では、短い周期でパラメータを出力しなければ不連続なデータになる。そこで、フレーム周期をＦＰＲＤに設定し、不連続なデータになることを抑制する。 In the state position where the variance value is small, even if the parameter is output with a longer cycle than usual, there is little variation, and thus it is difficult to produce discontinuous data. Therefore, it is possible to improve the processing speed by setting the frame period to twice the normal period.
In a state position where the variance value is large, discontinuous data is obtained unless parameters are output in a short cycle. Therefore, the frame period is set to FPRD to suppress discontinuous data.

励起音源生成部５０は、パラメータ生成部４０から、図７のような時系列のピッチ列データを受け取り、該ピッチ列データから励起音源データを生成する。 The excitation sound source generator 50 receives time series pitch string data as shown in FIG. 7 from the parameter generator 40, and generates excitation sound source data from the pitch string data.

ＬＳＰ係数補間部６０は、パラメータ生成部４０からＬＳＰ係数系列データを受け取る。受け取ったＬＳＰ係数系列データの係数間を、隣り合うＬＳＰ係数を用いて図６に示す黒丸のように補間して、フレーム周期ＦＰＲＤのＬＳＰ係数系列データを生成する。該補間は、ＬＳＰ係数系列データの隣り合うＬＳＰ係数を用いた線形補間により行う。 The LSP coefficient interpolation unit 60 receives LSP coefficient series data from the parameter generation unit 40. The coefficients of the received LSP coefficient series data are interpolated as shown by the black circles shown in FIG. 6 using adjacent LSP coefficients to generate LSP coefficient series data of the frame period FPRD. The interpolation is performed by linear interpolation using adjacent LSP coefficients of the LSP coefficient series data.

ＬＳＰ合成フィルタ７０は、励起音源生成部５０から励起音源データを受け取る。また、ＬＳＰ係数補間部６０からフレーム周期ＦＰＲＤのＬＳＰ係数系列データを受け取る。そして、それらを合成し、合成音声を生成する。そして、生成した合成音声を出力する。 The LSP synthesis filter 70 receives excitation sound source data from the excitation sound source generation unit 50. In addition, the LSP coefficient series data of the frame period FPRD is received from the LSP coefficient interpolation unit 60. Then, they are synthesized to generate synthesized speech. Then, the generated synthesized speech is output.

次に、上記構成の音声合成装置１００の音声を合成する音声合成処理の動作について図２を参照しながら説明する。 Next, an operation of speech synthesis processing for synthesizing speech of the speech synthesizer 100 having the above configuration will be described with reference to FIG.

まず、音声合成装置１００の入力変換部１０が、ユーザからテキスト文字列データの入力を受け付ける（ステップＳ１１）。 First, the input conversion unit 10 of the speech synthesizer 100 receives an input of text character string data from the user (step S11).

入力変換部１０は、テキスト文字列データの入力を受け付けると、テキスト文字列データを音素ラベル列データに変換する（ステップＳ１２）。そして、変換した音素ラベル列データを音素ＨＭＭ列変換部３０に引き渡す。 When receiving input of text character string data, the input conversion unit 10 converts the text character string data into phoneme label string data (step S12). Then, the converted phoneme label string data is delivered to the phoneme HMM string converter 30.

次に、音素ＨＭＭ列変換部３０が、ステップＳ１２で変換された音素ラベル列データを受け取り、音声合成辞書２０を参照し、ＬＳＰ係数に関する音素ＨＭＭ列データとピッチ列に関する音素ＨＭＭ列データとに変換する（ステップＳ１３）。そして、変換したＬＳＰ係数に関する音素ＨＭＭ列データとピッチ列に関する音素ＨＭＭ列データとをパラメータ生成部４０に引き渡す。 Next, the phoneme HMM sequence converter 30 receives the phoneme label sequence data converted in step S12, refers to the speech synthesis dictionary 20, and converts it into phoneme HMM sequence data related to LSP coefficients and phoneme HMM sequence data related to pitch sequences. (Step S13). Then, the phoneme HMM string data related to the converted LSP coefficient and the phoneme HMM string data related to the pitch string are delivered to the parameter generation unit 40.

パラメータ生成部４０は、ＬＳＰ係数に関する音素ＨＭＭ列データとピッチ列に関する音素ＨＭＭ列データとを受け取ると、図３に示すパラメータ生成処理を実行する（ステップＳ１４）。 When the parameter generation unit 40 receives the phoneme HMM string data related to the LSP coefficient and the phoneme HMM string data related to the pitch string, the parameter generation unit 40 executes the parameter generation process shown in FIG. 3 (step S14).

パラメータ生成処理（ステップＳ１４）で、パラメータ生成部４０は、受け取ったピッチ列に関する音素ＨＭＭ列データから図７に示すようなピッチ列データを生成する（ステップＳ２１）。 In the parameter generation process (step S14), the parameter generation unit 40 generates pitch sequence data as shown in FIG. 7 from the phoneme HMM sequence data related to the received pitch sequence (step S21).

それと共に、受け取ったＬＳＰ係数に関する音素ＨＭＭ列データからＬＳＰ係数系列データを生成するためにＬＳＰ係数系列データ生成処理（ステップＳ２２）を実行する。 At the same time, an LSP coefficient sequence data generation process (step S22) is executed to generate LSP coefficient sequence data from the phoneme HMM sequence data related to the received LSP coefficient.

ＬＳＰ係数系列データ生成処理（ステップＳ２２）の動作を図４に示す。 The operation of the LSP coefficient series data generation process (step S22) is shown in FIG.

ＬＳＰ係数系列データ生成処理（ステップＳ２２）では、パラメータ生成部４０は、図６の上段に示すような、ＬＳＰ係数に関する音素ＨＭＭ列データ｛λｉ｜１≦ｉ≦Ｎ｝の各音素ＨＭＭデータλｉの全ての状態位置Ｓ１〜Ｓ３について、ＬＳＰ係数を出力する。そして、図６下段に示すような、ＬＳＰ係数系列データ（白丸）を生成する。 In the LSP coefficient series data generation process (step S22), the parameter generation unit 40 sets the phoneme HMM data λi of the phoneme HMM string data {λi | 1 ≦ i ≦ N} related to the LSP coefficient as shown in the upper part of FIG. LSP coefficients are output for all state positions S1 to S3. Then, LSP coefficient series data (white circles) as shown in the lower part of FIG. 6 is generated.

先ず、ＬＳＰ係数を出力するフレーム周期を決定する。そのために、パラメータ生成部４０は、ある状態位置の分散値が所定の閾値Ｖ１より小さいか否かを判別する（ステップＳ３１）。
閾値Ｖ１には、分散値がこれより小さい値であれば安定したパラメータが出力される値が設定される。 First, the frame period for outputting the LSP coefficient is determined. Therefore, the parameter generation unit 40 determines whether or not the variance value at a certain state position is smaller than a predetermined threshold value V1 (step S31).
The threshold V1 is set to a value at which a stable parameter is output if the variance value is smaller than this.

分散値が閾値Ｖ１より小さいと判別されると（ステップＳ３１；Ｙｅｓ）、該状態位置でのフレーム周期を通常の出力周期ＰＲＤの２倍に設定する（ステップＳ３２）。 If it is determined that the variance value is smaller than the threshold value V1 (step S31; Yes), the frame period at the state position is set to twice the normal output period PRD (step S32).

また、分散値が閾値Ｖ１より小さくないと判別されると（ステップＳ３１；Ｎｏ）、分散値が所定の閾値Ｖ２より大きいか否かを判別する（ステップＳ３３）。
閾値Ｖ２には、分散値がこれ以上であるとばらつきのあるパラメータが出力される値が設定される。 If it is determined that the variance value is not smaller than the threshold value V1 (step S31; No), it is determined whether the variance value is greater than a predetermined threshold value V2 (step S33).
The threshold value V2 is set to a value that outputs a parameter with a variation when the variance value is greater than this value.

分散値が閾値Ｖ２より大きいと判別されると（ステップＳ３３；Ｙｅｓ）、該状態位置でのフレーム周期を設定できる最小の周期であるＦＰＲＤに設定する（ステップＳ３４）。 If it is determined that the variance value is larger than the threshold value V2 (step S33; Yes), the frame period at the state position is set to FPRD, which is the minimum period that can be set (step S34).

一方、分散値が閾値Ｖ２より大きくないと判別されると（ステップＳ３３；Ｎｏ）、フレーム周期を再設定せず、フレーム周期は通常の出力周期ＰＲＤである。 On the other hand, if it is determined that the variance value is not greater than the threshold value V2 (step S33; No), the frame period is not reset and the frame period is the normal output period PRD.

ステップＳ３１〜Ｓ３４で、フレーム周期が決定すると、該フレーム周期で、音素ＨＭＭデータに対する尤度が最大となるＬＳＰ係数を出力する（ステップＳ３５）。 When the frame period is determined in steps S31 to S34, the LSP coefficient having the maximum likelihood for the phoneme HMM data is output in the frame period (step S35).

以上のステップＳ３１〜Ｓ３５の処理を、音素ＨＭＭ列データ｛λｉ｜１≦ｉ≦Ｎ｝の各音素ＨＭＭデータλｉの全ての状態位置Ｓ１〜Ｓ３について繰り返し実行することで、状態位置毎に適切なフレーム周期でＬＳＰ係数を出力したＬＳＰ係数系列データを生成することができる。 The processes in steps S31 to S35 described above are repeated for all the state positions S1 to S3 of each phoneme HMM data λi of the phoneme HMM sequence data {λi | 1 ≦ i ≦ N}. It is possible to generate LSP coefficient series data in which LSP coefficients are output at a frame period.

なお、図６の例では、音素ＨＭＭデータλｉでの状態位置Ｓ１、Ｓ３でのフレーム周期ＰＲＤ_{λｉ, Ｓ１}及びＰＲＤ_{λｉ, Ｓ３}は、通常のフレーム周期ＰＲＤで出力されている。また、状態位置Ｓ２では、分散値が十分小さいため、状態位置Ｓ２でのフレーム周期ＰＲＤ_{λｉ, Ｓ２}は、ＰＲＤの２倍のフレーム周期で出力されている。このとき、各状態位置間のフレーム周期はＦＰＲＤに設定されている。 In the example of FIG. 6, the frame periods PRD _{λi, S1} and PRD _{λi, S3} at the state positions S1, S3 in the phoneme HMM data λi are output at the normal frame period PRD. Further, since the variance value is sufficiently small at the state position S2, the frame period PRD _{λi, S2} at the state position _S2 is output at a frame period twice that of the PRD. At this time, the frame period between each state position is set to FPRD.

ＬＳＰ係数系列データ生成処理（ステップＳ２２）が終了し、パラメータ生成処理（ステップＳ１４）が終了すると、図２示す音声合成処理に戻り、パラメータ生成部４０は、生成したピッチ列データを励起音源生成部５０に引き渡す。また、生成したＬＳＰ係数系列データをＬＳＰ係数補間部６０に引き渡す。 When the LSP coefficient series data generation process (step S22) ends and the parameter generation process (step S14) ends, the process returns to the speech synthesis process shown in FIG. 2, and the parameter generation unit 40 uses the generated pitch string data as the excitation sound source generation unit. Hand over to 50. The generated LSP coefficient series data is delivered to the LSP coefficient interpolation unit 60.

励起音源生成部５０は、ピッチ列データを受け取ると、該ピッチ列データから励起音源データを生成する（ステップＳ１５）。そして、生成した励起音源データをＬＳＰ合成フィルタ７０に引き渡す。 Upon receiving the pitch string data, the excitation sound source generator 50 generates excitation sound source data from the pitch string data (step S15). Then, the generated excitation sound source data is delivered to the LSP synthesis filter 70.

また、ＬＳＰ係数補間部６０は、ＬＳＰ係数系列データを受け取ると、該ＬＳＰ係数系列データの隣り合うＬＳＰ係数を用いて線形補間して、フレーム周期ＦＰＲＤのＬＳＰ係数系列データを生成する（ステップＳ１６）（図６の黒丸）。そして、生成したフレーム周期ＦＰＲＤのＬＳＰ係数系列データをＬＳＰ合成フィルタ７０に引き渡す。 Further, upon receiving the LSP coefficient series data, the LSP coefficient interpolation unit 60 performs linear interpolation using adjacent LSP coefficients of the LSP coefficient series data to generate LSP coefficient series data of the frame period FPRD (step S16). (Black circle in FIG. 6). The generated LSP coefficient series data of the frame period FPRD is delivered to the LSP synthesis filter 70.

ＬＳＰ合成フィルタ７０は、励起音源データとフレーム周期ＦＰＲＤのＬＳＰ係数系列データとを受け取ると、両者から音声を合成する（ステップＳ１７）。 When the LSP synthesis filter 70 receives the excitation sound source data and the LSP coefficient series data having the frame period FPRD, the LSP synthesis filter 70 synthesizes speech from both (step S17).

続いて、ＬＳＰ合成フィルタ７０は、合成した音声を出力する（ステップＳ１８）。 Subsequently, the LSP synthesis filter 70 outputs the synthesized voice (step S18).

以上のように、音声合成装置１００は、与えられたテキスト文字列データから処理速度の低下を防ぎながら、高音質の音声を合成して出力することができる。 As described above, the speech synthesizer 100 can synthesize and output high-quality speech from given text character string data while preventing a decrease in processing speed.

なお、この発明は上記実施形態に限定されず、種々の変形及び応用が可能である。 In addition, this invention is not limited to the said embodiment, A various deformation | transformation and application are possible.

上記実施形態では、線形補間によりＬＳＰ係数系列データを補間していたが、これに限られるものではない。例えば、より高音質の音声を合成するため、状態位置毎に補間方法を切り替えるようにしてもよい。 In the above embodiment, the LSP coefficient series data is interpolated by linear interpolation, but the present invention is not limited to this. For example, in order to synthesize higher-quality sound, the interpolation method may be switched for each state position.

例えば、ＬＳＰ係数系列データの安定性を判別して、安定性を欠くデータとなったときに該データを正常なデータに補正するＬＳＰ係数補正部８０をさらに備えて、合成音声の音質をさらに向上するようにしてもよい。ＬＳＰ係数補正部８０による補正は、図８に示すように、ＬＳＰ係数補間部６０で補間後のＬＳＰ係数系列データに対してしてもよいし、図９に示すように、補間前のＬＳＰ係数系列データに対してしてもよい。 For example, the sound quality of the synthesized speech is further improved by further including an LSP coefficient correction unit 80 that determines the stability of LSP coefficient series data and corrects the data to normal data when the data becomes unstable. You may make it do. The correction by the LSP coefficient correction unit 80 may be performed on the LSP coefficient series data after the interpolation by the LSP coefficient interpolation unit 60 as shown in FIG. 8, or the LSP coefficient before the interpolation as shown in FIG. It may be for series data.

また、音声合成装置１００は、上記実施形態のように、専用の機器である場合に限られない。
例えば、コンピュータをプログラムにより音声合成装置１００として機能させてもよいし、ＤＳＰ（Digital Signal Processor）等にプログラムを読み込ませて音声合成装置１００の動作をさせてもよい。 The speech synthesizer 100 is not limited to a dedicated device as in the above embodiment.
For example, the computer may function as the speech synthesizer 100 by a program, or the speech synthesizer 100 may be operated by reading a program into a DSP (Digital Signal Processor) or the like.

本発明の実施形態に係る音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer which concerns on embodiment of this invention. 音声合成処理の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of a speech synthesis process. 図２のパラメータ生成処理の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the parameter production | generation process of FIG. 図３のＬＳＰ係数系列データ生成処理の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the LSP coefficient series data generation process of FIG. （ａ）音素ＨＭＭデータの例を示す図である。（ｂ）音素ＨＭＭ列データの例を示す図である。(A) It is a figure which shows the example of phoneme HMM data. (B) It is a figure which shows the example of phoneme HMM row | line | column data. ＬＳＰ係数系列データの例を示す図である。It is a figure which shows the example of LSP coefficient series data. ピッチ列データの例を示す図である。It is a figure which shows the example of pitch row data. 音声合成装置の第１の変形例の構成を示すブロック図である。It is a block diagram which shows the structure of the 1st modification of a speech synthesizer. 音声合成装置の第２の変形例の構成を示すブロック図である。It is a block diagram which shows the structure of the 2nd modification of a speech synthesizer.

Explanation of symbols

１０…入力変換部、２０…音声合成辞書、３０…音素ＨＭＭ列変換部、４０…パラメータ生成部、５０…励起音源生成部、６０…ＬＳＰ係数補間部、７０…ＬＳＰ合成フィルタ、８０…ＬＳＰ係数補正部、１００…音声合成装置 DESCRIPTION OF SYMBOLS 10 ... Input conversion part, 20 ... Speech synthesis dictionary, 30 ... Phoneme HMM sequence conversion part, 40 ... Parameter generation part, 50 ... Excitation sound source generation part, 60 ... LSP coefficient interpolation part, 70 ... LSP synthesis filter, 80 ... LSP coefficient Correction unit, 100 ... speech synthesizer

Claims

Storage means for storing phoneme HMM (Hidden Markov Model) data for generating LSP (Line Spectrum Pair) coefficients, which are parameters for synthesizing speech, and phoneme labels in association with each other;
Phoneme HMM data conversion means for generating a phoneme label from given text data, referring to the storage information of the storage means, and converting the generated phoneme label into corresponding phoneme HMM data;
A period setting means for setting, for each state position, a period for outputting an LSP coefficient from the phoneme HMM data converted by the phoneme HMM data conversion means;
LSP coefficient output means for outputting LSP coefficients at a period set by the period setting means from the phoneme HMM data converted by the phoneme HMM data conversion means;
With
The period setting means includes
Determine the size of the dispersed values of the LSP coefficients for each state position of the phoneme HMM data,
If the variance value is less than or equal to the first threshold and greater than or equal to the second threshold smaller than the first threshold, the cycle for outputting the LSP coefficient is set to the first cycle,
If the variance value is greater than the first threshold, set the period for outputting the LSP coefficient to a second period smaller than the first period,
When the variance value is smaller than the second threshold value , the speech synthesizer is characterized in that the period for outputting the LSP coefficient is set to a third period that is larger than the first period .

2. The voice according to claim 1 , further comprising means for interpolating between LSP coefficients output by the LSP coefficient output means at a period set by the period setting means using LSP coefficients adjacent in time series. Synthesizer.

If it is determined that the low stability of the LSP coefficients LSP coefficient output unit outputs, according to claim 1, characterized by further comprising means for correcting the LSP coefficients as stability of the LSP coefficients is increased or The speech synthesizer according to 2 .

A storage step of storing phoneme HMM (Hidden Markov Model) data for generating LSP (Line Spectrum Pair) coefficients, which are parameters for synthesizing speech, and phoneme labels in association with each other;
A phoneme HMM data conversion step of generating a phoneme label from given text data, referring to the information stored in the storage step, and converting the generated phoneme label into corresponding phoneme HMM data;
A cycle setting step for setting, for each state position, a cycle for outputting LSP coefficients from the phoneme HMM data converted in the phoneme HMM data conversion step;
An LSP coefficient output step of outputting an LSP coefficient at a cycle set in the cycle setting step from the phoneme HMM data converted in the phoneme HMM data conversion step;
With
In the period setting step,
Determine the size of the dispersed values of the LSP coefficients for each state position of the phoneme HMM data,
If the variance value is less than or equal to the first threshold and greater than or equal to the second threshold smaller than the first threshold, the cycle for outputting the LSP coefficient is set to the first cycle,
If the variance value is greater than the first threshold, set the period for outputting the LSP coefficient to a second period smaller than the first period,
When the variance value is smaller than the second threshold value, the period for outputting the LSP coefficient is set to a third period larger than the first period .

Computer
Storage means for storing phoneme HMM (Hidden Markov Model) data for generating LSP (Line Spectrum Pair) coefficients, which are parameters for synthesizing speech, and phoneme labels in association with each other;
Phoneme HMM data conversion means for generating a phoneme label from given text data and converting the generated phoneme label into corresponding phoneme HMM data by referring to the storage information of the storage means;
A period setting means for setting a period for outputting an LSP coefficient from the phoneme HMM data converted by the phoneme HMM data conversion means for each state position;
LSP coefficient output means for outputting an LSP coefficient at a period set by the period setting means from the phoneme HMM data converted by the phoneme HMM data conversion means;
Function as
The period setting means includes
Determine the size of the dispersed values of the LSP coefficients for each state position of the phoneme HMM data,
If the variance value is less than or equal to the first threshold and greater than or equal to the second threshold smaller than the first threshold, the cycle for outputting the LSP coefficient is set to the first cycle,
If the variance value is greater than the first threshold, set the period for outputting the LSP coefficient to a second period smaller than the first period,
A computer program for setting a period for outputting an LSP coefficient to a third period larger than the first period when the variance value is smaller than the second threshold value .