JP4775236B2

JP4775236B2 - Speech synthesizer

Info

Publication number: JP4775236B2
Application number: JP2006315275A
Authority: JP
Inventors: 勉兼安
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2006-11-22
Filing date: 2006-11-22
Publication date: 2011-09-21
Anticipated expiration: 2026-11-22
Also published as: JP2008129382A

Description

本発明は、合成した音声とともに、その合成音声の品質を表す情報を出力する音声合成装置に関するものである。 The present invention relates to a speech synthesizer that outputs information representing the quality of synthesized speech together with synthesized speech.

従来、『この発明は、音素単位だけでなく音素の中心を境界とするダイフォン単位も用いた新たな日本語テキスト合成方法を提供することを目的とする。また、この発明は、音素単位のみを用いる従来法と比較して、より自然性の高い音声を合成することができ、かつコーパスをより有効に使用することができるようになる日本語テキスト合成方法を提供すること』を目的とした技術として、『日本語テキスト音声合成方法において、母音と母音との連鎖における波形接続において、それらの境界での接続と、それらの母音中心での接続との両方を考慮して音声合成単位の選択を行うようにした。』というものがある（特許文献１）。
上記技術において、音素単位選択を行う際に、知覚的特徴に一致した尺度（コスト）を用いる。このように、合成音の品質評価を所定のコスト関数などで行う技術が、一般に公開されている。
特開２００３−２０８１８８号公報（要約、図２） Conventionally, an object of the present invention is to provide a new Japanese text synthesizing method using not only phoneme units but also diphone units having the center of the phoneme as a boundary. Further, the present invention provides a Japanese text synthesis method that can synthesize speech with higher naturalness and use a corpus more effectively than a conventional method using only phoneme units. As a technology for the purpose of `` Providing '', in the Japanese text-to-speech synthesis method, in the waveform connection in the chain of vowels and vowels, both the connection at the boundary and the connection at the center of those vowels The speech synthesis unit is selected in consideration of the above. (Patent Document 1).
In the above technique, a scale (cost) that matches the perceptual feature is used when performing phoneme unit selection. As described above, a technique for performing quality evaluation of a synthesized sound by a predetermined cost function or the like is publicly disclosed.
JP2003-208188 (summary, FIG. 2)

一般に、ユーザはより高品質な合成音声を望む。しかし、合成音声の品質を測る尺度は様々なものがあり、一般的なユーザには判断しにくい。上記従来技術のようなコスト関数により合成音声の品質を求めてユーザに提示することもできるが、その算出値がどの程度の品質を意味するのかをユーザが検討しなければならず、やはり同様に品質の良し悪しが判断しにくい。
そのため、合成音声の品質の良し悪しを、一般ユーザでもより直感的に判断することのできる音声合成装置が望まれていた。 In general, the user desires higher quality synthesized speech. However, there are various measures for measuring the quality of synthesized speech, and it is difficult for a general user to judge. Although the quality of synthesized speech can be obtained and presented to the user by the cost function as in the above prior art, the user must consider what quality the calculated value means, and similarly It is difficult to judge whether the quality is good or bad.
Therefore, there has been a demand for a speech synthesizer that allows general users to more intuitively determine the quality of synthesized speech.

本発明に係る音声合成装置は、
音声に変換するための入力テキストを入力する入力部と、
前記入力テキストの内容に応じて音声合成を行うとともに、該入力テキスト全体について合成された音声の品質を所定の演算式で算出する音声合成部と、
合成音声の品質レベル毎にあらかじめ対応付けられた応答メッセージを格納した応答データベースと、
前記応答データベースから応答メッセージを読み出して音声出力する音声応答部と、
を備え、
前記音声応答部は、
前記音声合成部が音声合成を完了すると、その音声の品質に対応した応答メッセージを前記応答データベースから読み出して音声出力する
ことを特徴とするものである。 A speech synthesizer according to the present invention includes:
An input unit for inputting input text to be converted into speech;
A speech synthesizer that performs speech synthesis according to the content of the input text, and calculates the quality of the synthesized speech for the entire input text using a predetermined arithmetic expression;
A response database storing response messages associated in advance for each quality level of synthesized speech;
A voice response unit that reads out a response message from the response database and outputs the voice;
With
The voice response unit
When the speech synthesizer completes speech synthesis, a response message corresponding to the quality of the speech is read from the response database and output as speech.

本発明に係る音声合成装置によれば、合成音声の品質を応答メッセージで音声出力するため、ユーザは合成音声の品質を音声で知ることができ、より直感的に品質の良し悪しを判断することができる。 According to the speech synthesizer according to the present invention, since the quality of the synthesized speech is output as a response message, the user can know the quality of the synthesized speech by speech and more intuitively determine whether the quality is good or bad. Can do.

実施の形態１．
図１は、本発明の実施の形態１に係る音声合成装置１００の機能ブロック図である。
音声合成装置１００は、音声合成部１１０、推奨度選択部１２０、音声応答部１３０、音声ＤＢ１４０、応答ＤＢ１５０を備える。
音声合成部１１０は、合成音声で読み上げるための入力テキストを受け取り、音声ＤＢ１４０が格納しているデータを用いて合成した音声を出力する。音声合成に際しては、コーパスベース方式を用いるものとする。また、合成した音声の品質を後述の方法で算出し、推奨度選択部１２０に出力する。
推奨度選択部１２０は、音声合成部１１０より合成音声の品質を表す情報を受け取り、応答ＤＢ１５０にアクセスして、対応する応答メッセージを読み出す。
音声応答部１３０は、推奨度選択部１２０が読み出した応答メッセージを音声出力することにより、合成音声の品質をユーザに音声で通知する。
音声ＤＢ１４０は、音声合成部１１０が音声合成を行う際に必要なデータを格納している。
応答ＤＢ１５０については、後述の図２で説明する。 Embodiment 1 FIG.
FIG. 1 is a functional block diagram of speech synthesis apparatus 100 according to Embodiment 1 of the present invention.
The speech synthesizer 100 includes a speech synthesizer 110, a recommendation level selector 120, a speech response unit 130, a speech DB 140, and a response DB 150.
The speech synthesizer 110 receives input text to be read out with synthesized speech, and outputs synthesized speech using data stored in the speech DB 140. In speech synthesis, a corpus-based method is used. Also, the quality of the synthesized speech is calculated by the method described later and output to the recommendation level selection unit 120.
The recommendation level selection unit 120 receives information indicating the quality of the synthesized speech from the speech synthesis unit 110, accesses the response DB 150, and reads a corresponding response message.
The voice response unit 130 outputs the response message read by the recommendation level selection unit 120 as a voice to notify the user of the quality of the synthesized voice by voice.
The voice DB 140 stores data necessary when the voice synthesizer 110 performs voice synthesis.
The response DB 150 will be described later with reference to FIG.

また、音声合成装置１００は、入力テキストを受け取るため、必要に応じてネットワークインターフェースなどの入力部を備える。
音声合成部１１０が出力する合成音声は、波形信号やそのサンプリングデータなどの形式で出力されるものとする。 In addition, the speech synthesizer 100 includes an input unit such as a network interface as necessary in order to receive the input text.
The synthesized speech output from the speech synthesizer 110 is assumed to be output in the form of a waveform signal, its sampling data, or the like.

なお、本実施の形態１における「音声合成部」は、音声合成部１１０と推奨度選択部１２０により構成されるものとする。 Note that the “speech synthesizer” in the first embodiment is configured by the speech synthesizer 110 and the recommendation degree selector 120.

音声合成部１１０が算出する合成音声の品質とは、コーパスベース音声合成方式で合成音声を生成する段階で生じる、音声の物理量と知覚とを対応付けたコスト関数により音質を評価した値のことである。コスト関数は、あらかじめ定められているものとする。
あるいは、以下のような基準で合成音声の品質を算出することもできる。
（１）合成音声を生成するために要した処理時間の多寡。
（２）入力テキストをモーラ単位に分割して分析し、音質を劣化させるようなモーラが含まれている場合は、音質が低いものと評価する。
（３）合成音声の生成中、もしくは生成後の合成音声の、話速に関する特徴量。例えばフォルマントの遷移速度。
（４）生成後の合成音声のメルケプストラムと、韻律推定されたメルケプストラムとの差分値。メルケプストラム以外に、音素の継続時間長、ピッチ、ＬＰＣ係数などを用いてもよい。 The quality of the synthesized speech calculated by the speech synthesizer 110 is a value obtained by evaluating the sound quality using a cost function that associates the physical quantity of speech with perception, which is generated at the stage of generating synthesized speech by the corpus-based speech synthesis method. is there. The cost function is assumed to be predetermined.
Alternatively, the quality of the synthesized speech can be calculated based on the following criteria.
(1) The amount of processing time required to generate synthesized speech.
(2) The input text is divided into mora units and analyzed, and if a mora that deteriorates the sound quality is included, it is evaluated that the sound quality is low.
(3) A feature amount relating to speech speed of the synthesized speech during or after the generation of the synthesized speech. For example, formant transition speed.
(4) A difference value between the mel cepstrum of the synthesized speech after generation and the mel cepstrum estimated for prosody. In addition to the mel cepstrum, the phoneme duration, pitch, LPC coefficient, etc. may be used.

図２は、応答ＤＢ１５０が格納している応答メッセージテーブル１５１（図示せず）の構成とデータ例を示すものである。
応答メッセージテーブル１５１は、「合成音声の品質」列、「応答メッセージ」列を有する。
「合成音声の品質」列は、音声合成部１１０が算出する合成音声の品質を表す値の閾値が格納されている。図２のデータ例では、音声合成部１１０が上述のコスト関数を用いて算出した「コスト値」に対応する値が格納されており、値が小さいほうがより品質の良い合成音声であるものと評価する。
「応答メッセージ」列は、音声合成部１１０が生成した合成音声の品質（＝コスト値）に対応する応答メッセージが格納されている。図２のデータ例では、例えば「コスト値＝０．１０」であれば、応答メッセージは「お薦めの音声だね。」となる。 FIG. 2 shows a configuration and data example of a response message table 151 (not shown) stored in the response DB 150.
The response message table 151 includes a “synthesized speech quality” column and a “response message” column.
The “synthesized speech quality” column stores a threshold value representing a quality of synthesized speech calculated by the speech synthesizer 110. In the data example of FIG. 2, a value corresponding to the “cost value” calculated by the speech synthesizer 110 using the above-described cost function is stored, and it is evaluated that a smaller value is a synthesized speech with better quality. To do.
The “response message” column stores response messages corresponding to the quality (= cost value) of synthesized speech generated by the speech synthesis unit 110. In the data example of FIG. 2, for example, if “cost value = 0.10”, the response message is “recommended voice”.

次に、音声合成装置１００の詳細な動作についてステップを追って説明する。
（１）入力テキストの入力
音声合成部１１０は、読み上げ対象の入力テキストを受け付ける。なお、入力のための必要に応じて、音声合成装置１００に入力インターフェースを設ける。具体的には、例えばＬＡＮインターフェースなどのネットワークインターフェースや、音声合成装置１００の外面に設けられた操作パネルによる直接入力などが考えられる。 Next, detailed operations of the speech synthesizer 100 will be described step by step.
(1) Input text input The speech synthesizer 110 receives input text to be read out. Note that an input interface is provided in the speech synthesizer 100 as necessary for input. Specifically, for example, direct input by a network interface such as a LAN interface or an operation panel provided on the outer surface of the speech synthesizer 100 can be considered.

（２）音声合成の実行・出力
音声合成部１１０は、音声ＤＢ１４０に格納されている、韻律モデルデータベース、音響モデルデータベース、音声ファイルなどの、コーパスベース音声合成に必要なデータを用いて、入力テキストを読み上げる合成音声を生成する。
合成音声の出力形式は、音声波形をサンプリングしたデータ形式でもよいし、スピーカー等の音声出力装置を介して直接音声出力してもよい。あるいは、音声波形に相当する電気的信号そのものを出力してもよい。 (2) Execution / Output of Speech Synthesis The speech synthesizer 110 uses the data necessary for corpus-based speech synthesis, such as prosodic model database, acoustic model database, speech file, etc., stored in the speech DB 140 to input text Generate synthesized speech.
The output format of the synthesized speech may be a data format obtained by sampling a speech waveform, or may be directly output via a speech output device such as a speaker. Alternatively, an electrical signal itself corresponding to the voice waveform may be output.

（３）コスト値の算出
音声合成部１１０は、合成音声を生成するに際し、その合成音声の品質を上述のコスト関数により算出して推奨度選択部１２０に出力する。出力のタイミングは、合成音声の生成が完全に終了してからでもよいし、合成の最中に逐次的に出力してもよい。後者の場合は推奨度選択部１２０がコスト値の合計を算出するなどすればよい。 (3) Calculation of Cost Value When generating the synthesized speech, the speech synthesizer 110 calculates the quality of the synthesized speech using the above cost function and outputs it to the recommendation level selecting unit 120. The output timing may be after the generation of the synthesized speech is completely completed, or may be output sequentially during the synthesis. In the latter case, the recommendation level selection unit 120 may calculate the total cost value.

（４）応答メッセージの選択
推奨度選択部１２０は、音声合成部１１０より受け取ったコスト値をキーにして応答メッセージテーブル１５１を検索する。次に、該当するデータの「応答メッセージ」列を読み取り、音声応答部１３０に出力する。
「応答メッセージ」列に格納しているデータは、メッセージのテキストのみとしてもよいし、メッセージを読み上げる音声ファイルそのものを格納していてもよい。 (4) Selection of Response Message The recommendation level selection unit 120 searches the response message table 151 using the cost value received from the speech synthesis unit 110 as a key. Next, the “response message” column of the corresponding data is read and output to the voice response unit 130.
The data stored in the “response message” column may be only the text of the message, or may store the voice file itself that reads the message.

（５）応答メッセージの出力
音声応答部１３０は、推奨度選択部１２０より受け取った応答メッセージの内容を、スピーカー等により音声出力する。
「応答メッセージ」列に格納しているデータがメッセージのテキストのみである場合は、そのテキストを読み上げる合成音声を生成して出力する。また、メッセージを読み上げる音声ファイルそのものである場合は、その音声ファイルを再生して音声出力する。
なお、応答メッセージの音声出力のタイミングは、音声合成部１１０が合成音声を出力する前でもよいし、出力が完全に終了した後でもよい。音声合成部１１０の出力形式が合成音声の波形サンプリングデータである場合には、そのデータの出力とともに応答メッセージを音声出力してもよい。この場合は合成音声と音声応答が重複して音声出力されることはないからである。 (5) Output of response message The voice response unit 130 outputs the content of the response message received from the recommendation level selection unit 120 through a speaker or the like.
If the data stored in the “response message” column is only the text of the message, a synthesized speech that reads the text is generated and output. If the message is a voice file itself that reads out the message, the voice file is reproduced and output as a voice.
Note that the voice output timing of the response message may be before the voice synthesizer 110 outputs the synthesized voice or after the output is completely completed. When the output format of the speech synthesizer 110 is waveform sampling data of synthesized speech, the response message may be output as speech along with the output of the data. In this case, the synthesized voice and the voice response are not duplicated and outputted.

なお、本実施の形態１では音声合成部１１０はコーパスベース方式により音声合成を行うものとして説明したが、これに限られるものではなく、規則合成方式や録音編集方式により音声合成を行うものでもよい。 In the first embodiment, the speech synthesizer 110 has been described as performing speech synthesis using a corpus-based method. However, the present invention is not limited to this, and speech synthesis may be performed using a rule synthesis method or a recording / editing method. .

以上のように、本実施の形態１によれば、合成音声の品質を応答メッセージで音声出力するため、ユーザは合成音声の品質を音声で知ることができ、より直感的に品質の良し悪しを判断することができる。
また、聴覚で応答メッセージを得ることは、単なる数値の提示よりもユーザの感性に直接的に訴えかけるので、よりインタラクティブ性が高まるという効果もある。 As described above, according to the first embodiment, since the quality of the synthesized speech is output as a response message, the user can know the quality of the synthesized speech by speech, and the quality can be determined more intuitively. Judgment can be made.
Also, obtaining a response message by hearing directly appeals to the user's sensibility rather than simply presenting numerical values, and thus has the effect of increasing the interactivity.

実施の形態２．
実施の形態１では、音声合成を実行する際に得られる種々のパラメータを基に、所定のコスト関数でコスト値を算出する構成を説明した。
本発明の実施の形態２では、特定のキーワードが入力テキストに含まれている場合に、コスト値を補正した上で応答メッセージを選択する音声合成装置の構成を説明する。 Embodiment 2. FIG.
In the first embodiment, the configuration in which the cost value is calculated using a predetermined cost function based on various parameters obtained when executing speech synthesis has been described.
In the second embodiment of the present invention, a configuration of a speech synthesizer that selects a response message after correcting a cost value when a specific keyword is included in an input text will be described.

図３は、本実施の形態２に係る音声合成装置１００の機能ブロック図である。
本実施の形態２に係る音声合成装置１００は、キーワードＤＢ１６０を備える。その他の構成は実施の形態１で説明した図１と同様であるため、同様の符号を付して説明を省略する。 FIG. 3 is a functional block diagram of the speech synthesizer 100 according to the second embodiment.
The speech synthesizer 100 according to the second embodiment includes a keyword DB 160. Since other configurations are the same as those in FIG. 1 described in the first embodiment, the same reference numerals are given and description thereof is omitted.

キーワードＤＢ１６０は、例えばテーブル形式などで格納された任意のキーワードのリストを格納している。このキーワードのリストは、音声合成装置１００の製造者が製造の際に、あるいは管理者が設定により、キーワードＤＢ１６０内に格納するものである。 The keyword DB 160 stores a list of arbitrary keywords stored in a table format, for example. This keyword list is stored in the keyword DB 160 at the time of manufacture by the manufacturer of the speech synthesizer 100 or by setting by the administrator.

次に、本実施の形態２に係る音声合成装置１００の動作について説明する。
（１）入力テキストの入力〜（２）音声合成の実行・出力
これらのステップの動作は実施の形態１と同様であるため、説明を省略する。 Next, the operation of the speech synthesizer 100 according to the second embodiment will be described.
(1) Input text input to (2) Execution / output of speech synthesis Since the operations of these steps are the same as those in the first embodiment, the description thereof is omitted.

（３）コスト値の算出
音声合成部１１０は、合成音声を生成するに際し、その合成音声の品質を上述のコスト関数により算出し、次にキーワードＤＢ１６０を参照して算出結果を補正する。
補正方法は、入力テキストの中にキーワードＤＢ１６０が保持するキーワードが何個出現するか、などを基準として、より出現頻度が高い場合にコスト値を低く補正する、といった方法が考えられる。その他、キーワード毎に重みを付けて、特定のキーワードはコスト値低減効果を高くする、などとすることもできる。
算出・補正したコスト値は、推奨度選択部１２０に出力される。 (3) Calculation of cost value When the speech synthesizer 110 generates synthesized speech, the speech synthesizer 110 calculates the quality of the synthesized speech using the above-mentioned cost function, and then corrects the calculation result with reference to the keyword DB 160.
As a correction method, a method of correcting the cost value to be lower when the appearance frequency is higher is considered based on how many keywords held in the keyword DB 160 appear in the input text. In addition, a weight can be given to each keyword, and a specific keyword can increase the cost value reduction effect.
The calculated / corrected cost value is output to the recommendation level selection unit 120.

（４）応答メッセージの選択〜（５）応答メッセージの出力
これらのステップの動作は実施の形態１と同様であるため、説明を省略する。 (4) Selection of response message to (5) Output of response message The operations of these steps are the same as those in the first embodiment, and thus description thereof is omitted.

以上のように、本実施の形態２によれば、算出したコスト値をキーワードリストの内容により補正することができるので、キーワードリストに保持する内容如何によっては、コスト値の算出にバイアスをかけることができる。
即ち、通常であれば音声合成部１１０に入力されるテキストは全くのアドホックであるが、ある特定のキーワード群を入力した場合に限り、合成音声の品質が良くなったかのような外観を作出することができるので、音声合成部１１０に入力されるテキストに、キーワードリストに基づく方向性を与えることができる。
もちろん、虚偽のコスト値を算出することは誠実の観点から好ましくないので、入力テキストがキーワードリストに合致した場合には、合成音声の実際の品質もそれに応じて調整して生成することが必要であろう。
この機能は、合成音声として出力するには好ましくない語句が入力テキストとして入力された場合に、その合成音声の品質を極端に劣悪にするなどして、そのような語句の入力を事実上抑制することなどに応用できる。 As described above, according to the second embodiment, the calculated cost value can be corrected based on the content of the keyword list. Therefore, depending on the content held in the keyword list, the calculation of the cost value is biased. Can do.
That is, normally, the text input to the speech synthesizer 110 is completely ad hoc, but only when a specific keyword group is input, an appearance as if the quality of the synthesized speech is improved is created. Therefore, the directionality based on the keyword list can be given to the text input to the speech synthesizer 110.
Of course, since it is not preferable to calculate a false cost value from the point of view of integrity, if the input text matches the keyword list, the actual quality of the synthesized speech must be adjusted accordingly. I will.
This function effectively suppresses the input of such words, for example, when the words that are not desirable for output as synthesized speech are input as input text, such as extremely degrading the quality of the synthesized speech. It can be applied to things.

実施の形態３．
実施の形態１〜２では、音声応答部１３０は、応答ＤＢ１５０が格納している応答メッセージの内容を用いて音声応答を出力するものとしたが、この音声応答は、合成音声とは特段の関係がなく生成される、無機質な機械的音声を想定したものである。
本発明の実施の形態３では、合成音声と音声応答に関連性を持たせる音声合成装置の構成を説明する。 Embodiment 3 FIG.
In the first and second embodiments, the voice response unit 130 outputs a voice response using the content of the response message stored in the response DB 150. This voice response has a special relationship with the synthesized voice. It is assumed to be an inorganic mechanical sound that is generated without any noise.
In the third embodiment of the present invention, the configuration of a speech synthesizer that provides a relationship between synthesized speech and speech response will be described.

図４は、本実施の形態３に係る音声合成装置１００の機能ブロック図である。
図４の音声合成装置１００は、構成要素は実施の形態１で説明した図１と同様であるが、各部の入出力関係が図１とは異なる。次の動作説明で、詳細を説明する。 FIG. 4 is a functional block diagram of speech synthesis apparatus 100 according to the third embodiment.
The speech synthesizer 100 in FIG. 4 has the same components as in FIG. 1 described in the first embodiment, but the input / output relationship of each part is different from that in FIG. Details will be described in the following operation description.

次に、本実施の形態３に係る音声合成装置１００の詳細な動作についてステップを追って説明する。
（１）入力テキストの入力〜（３）コスト値の算出
これらのステップの動作は実施の形態１と同様であるため、説明を省略する。 Next, the detailed operation of the speech synthesizer 100 according to the third embodiment will be described step by step.
(1) Input text input to (3) Cost value calculation Since the operations of these steps are the same as those in the first embodiment, the description thereof is omitted.

（４）応答メッセージの選択
推奨度選択部１２０は、音声合成部１１０より受け取ったコスト値をキーにして応答メッセージテーブル１５１を検索する。次に、該当するデータの「応答メッセージ」列を読み取り、音声合成部１１０に出力する。
なお、応答メッセージテーブル１５１の「応答メッセージ」列には、応答メッセージのテキストのみが格納されているものとする。 (4) Selection of Response Message The recommendation level selection unit 120 searches the response message table 151 using the cost value received from the speech synthesis unit 110 as a key. Next, the “response message” column of the corresponding data is read and output to the speech synthesizer 110.
Note that only the text of the response message is stored in the “response message” column of the response message table 151.

（５）応答メッセージの音声合成
音声合成部１１０は、推奨度選択部１２０より受け取った応答メッセージの内容を読み上げる合成音声を生成する。生成に際しては、ステップ（２）と同様の処理を行う。
生成した合成音声は、音声応答部１３０に出力される。 (5) Speech synthesis of response message The speech synthesizer 110 generates synthesized speech that reads the content of the response message received from the recommendation level selector 120. At the time of generation, the same processing as step (2) is performed.
The generated synthesized voice is output to the voice response unit 130.

（６）応答メッセージの出力
音声応答部１３０は、音声合成部１１０より受け取った合成音声を、スピーカー等により音声出力する。 (6) Output of response message The voice response unit 130 outputs the synthesized voice received from the voice synthesis unit 110 through a speaker or the like.

以上のように、本実施の形態３によれば、合成音声の品質を表す応答メッセージは、合成音声と同じ話者ないし口調で音声出力されるため、ユーザにとって合成音声の品質がより直感的に理解しやすく、インタラクティブ性のある音声合成装置を提供することができる。 As described above, according to the third embodiment, since the response message indicating the quality of the synthesized speech is output in the same speaker or tone as the synthesized speech, the quality of the synthesized speech is more intuitive for the user. An easily understood and interactive speech synthesizer can be provided.

以上の実施の形態１〜３において、合成音声の品質に応じて、応答メッセージの音声品質にも差を設けてもよい。例えば実施の形態３において、コスト値が低い高品質の合成音声を出力する場合には、応答メッセージの品質も高くする、もしくは感情表現を込めた応答メッセージを出力する、などとすれば、ユーザに与えるインタラクティブ感もその分増すので、ユーザと合成音声との一体感が高まる。 In the first to third embodiments, a difference may be provided in the voice quality of the response message according to the quality of the synthesized voice. For example, in the third embodiment, when outputting a high-quality synthesized speech with a low cost value, if the quality of the response message is increased or a response message including emotional expressions is output, The interactive feeling given increases accordingly, and the sense of unity between the user and the synthesized speech is enhanced.

また、以上の実施の形態１〜３において、応答メッセージの話者は１話者に限るものではなく、コスト値の閾値毎に異なる話者の応答メッセージを音声出力するようにしてもよい。 Further, in the first to third embodiments, the number of speakers of the response message is not limited to one speaker, and different speaker response messages may be output for each cost value threshold.

実施の形態１に係る音声合成装置１００の機能ブロック図である。1 is a functional block diagram of a speech synthesizer 100 according to Embodiment 1. FIG. 応答ＤＢ１５０が格納している応答メッセージテーブル１５１の構成とデータ例を示すものである。The structure and data example of the response message table 151 stored in the response DB 150 are shown. 実施の形態２に係る音声合成装置１００の機能ブロック図である。6 is a functional block diagram of a speech synthesizer 100 according to Embodiment 2. FIG. 実施の形態３に係る音声合成装置１００の機能ブロック図である。6 is a functional block diagram of a speech synthesizer 100 according to Embodiment 3. FIG.

Explanation of symbols

１００音声合成装置、１１０音声合成部、１２０推奨度選択部、１３０音声応答部、１４０音声ＤＢ、１５０応答ＤＢ、１５１応答メッセージテーブル、１６０キーワードＤＢ。 DESCRIPTION OF SYMBOLS 100 Voice synthesizer, 110 Voice synthesizer, 120 Recommendation selection part, 130 Voice response part, 140 Voice DB, 150 Response DB, 151 Response message table, 160 Keyword DB.

Claims

An input unit for inputting input text to be converted into speech;
A speech synthesizer that performs speech synthesis according to the content of the input text, and calculates the quality of the synthesized speech for the entire input text using a predetermined arithmetic expression;
A response database storing response messages associated in advance for each quality level of synthesized speech;
A voice response unit that reads out a response message from the response database and outputs the voice;
With
The voice response unit
When the speech synthesizer completes speech synthesis, a response message corresponding to the quality of the speech is read from the response database and output as speech.

The response database stores the response messages associated with three or more evaluation ranges for the voice quality level.
The speech synthesizer according to claim 1.

A keyword table storing one or more predetermined keywords;
The speech synthesizer
When calculating the quality of synthesized speech, refer to the keyword table,
The speech synthesizer according to claim 1 or 2 , wherein when the content of the input text is included in the keyword table, the quality of the synthesized speech is corrected according to a predetermined rule.

The response database stores only the text of the response message;
The speech synthesizer
When the voice response unit voice-outputs the response message, the text of the response message is read and voice synthesis is performed,
The voice response unit
Speech synthesis apparatus according to any one of claims 1 to 3, characterized in that the audio output response message using the voice.