JP3485586B2

JP3485586B2 - Voice synthesis method

Info

Publication number: JP3485586B2
Application number: JP25838792A
Authority: JP
Inventors: 誠橋本; 徹北村
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1991-09-30
Filing date: 1992-09-28
Publication date: 2004-01-13
Anticipated expiration: 2019-01-13
Also published as: JPH05224690A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、規則による音声合成方
法に関し、特に、音声の音韻やアクセントの自然性に大
きく影響する合成音声のピッチパターン生成方法に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a rule-based speech synthesis method, and more particularly, to a method for generating a pitch pattern of synthesized speech which greatly affects the naturalness of the phoneme and accent of the speech.

【０００２】[0002]

【従来の技術】近年、規則による音声合成の研究が、ヒ
ューマンインタフェースの重要な技術として盛んに行わ
れている。規則による音声合成は、文字列等から形態素
解析によって品詞情報を決定し、単語辞書との照合によ
り単語の読みを決定した後に、この読みに応じた単語の
アクセント型、アクセント結合、フレーズを求め、これ
らの情報からピッチパターンの決定を行うと共に、単語
の読みに応じた音声素片（例えばパーコール係数やＬＳ
Ｐ係数）を接続することにより音声データを生成する。
即ち、音声データとは、パーコール係数列とこれに応じ
たピッチパターンとアンプ情報である。この中でもピッ
チパターンは、合成音声の自然性の善し悪しに大きな影
響を及ぼすものとして位置付けられている。2. Description of the Related Art In recent years, research on speech synthesis by rules has been actively carried out as an important technique for human interface. Speech synthesis by rules determines part-of-speech information by morphological analysis from character strings and the like, and after determining the reading of a word by matching with a word dictionary, obtains the accent type, accent combination, and phrase of the word according to this reading, A pitch pattern is determined from these pieces of information, and at the same time, a speech unit (for example, a Percoll coefficient or LS) corresponding to the reading of a word is determined.
The audio data is generated by connecting (P coefficient).
That is, the voice data is a Percoll coefficient sequence, a pitch pattern and amplifier information corresponding thereto. Among them, the pitch pattern is positioned as having a great influence on whether the naturalness of the synthesized voice is good or bad.

【０００３】従来のピッチパターン生成には、文章の構
造からフレーズ成分とアクセント成分を決定した後に、
各モーラの重心点のピッチを推定し、直線補間すること
によりピッチパターンを生成する点ピッチモデルを用い
る方法が知られている。（電子通信学会論文誌Ｖｏｌ.
Ｊ63-ＤＮｏ.9 ｐｐ.715-722, 1980.9）また、ニューラルネットにより各フレーズに対する先頭
モーラ、ピッチ周波数がピークをとるモーラ、末尾モー
ラの各ピッチ周波数の値を推定する方法なども知られて
いる。（音声研究会資料ＳＰ89-111, 1990.1）これら、従来のピッチパターンの生成方法は、いずれも
各モーラに対するピッチ周波数の絶対値を一義的に推定
する方法であり、前後のモーラとのつながり（ピッチ周
波数の変化量）については考慮されていない。従って、
これらの方法では、モーラ間でのピッチパターンの変化
量が安定しないといった問題がある。In the conventional pitch pattern generation, after determining a phrase component and an accent component from the structure of a sentence,
A method using a point pitch model that generates a pitch pattern by estimating the pitch of the center of gravity of each mora and performing linear interpolation is known. (IEICE Transactions Vol.
J63-D No.9 pp.715-722, 1980.9) Also known is a method of estimating the value of each pitch frequency of the leading mora, the mora having a peak pitch frequency, and the ending mora for each phrase by a neural network. There is. (Voice study group material SP89-111, 1990.1) All of these conventional pitch pattern generation methods are methods for uniquely estimating the absolute value of the pitch frequency for each mora, and the connection with the mora before and after (pitch The amount of frequency change) is not taken into consideration. Therefore,
These methods have a problem that the amount of change in pitch pattern between mora is not stable.

【０００４】[0004]

【発明が解決しようとする課題】上述のように、合成音
声の自然性を左右する要因としては、ピッチ周波数の絶
対値よりも変化量の方がより重要であるにもかかわら
ず、従来の方法は、いずれも各モーラに対するピッチ周
波数の変動を推定するものではなかった。As described above, although the variation amount is more important than the absolute value of the pitch frequency as a factor that influences the naturalness of the synthesized speech, the conventional method is used. Neither estimate the variation in pitch frequency for each mora.

【０００５】本発明は、このような問題を解決するため
になされたものであり、規則による音声合成において、
合成音声の自然性を向上させるために、合成音声の自然
性に大きな影響を及ぼすピッチ周波数の変化量を考慮し
てピッチパターンを生成するものである。The present invention has been made to solve such a problem, and in speech synthesis by rules,
In order to improve the naturalness of synthetic speech, a pitch pattern is generated in consideration of the amount of change in pitch frequency that greatly affects the naturalness of synthetic speech.

【０００６】[0006]

【課題を解決するための手段】本発明のピッチパターン
生成方法は、任意の文字列のモーラ毎に、当該モーラの
重心点のピッチ周波数と前記当該モーラに対する先行モ
ーラの重心点のピッチ周波数との差を推定するピッチ差
推定処理と、上記ピッチ差推定処理により求められたピ
ッチ差からピッチパターンを形成するピッチパターン生
成処理とを備えたものであって、上記ピッチパターン生
成処理が、少なくとも、当該モーラを含むアクセント句
における当該モーラのモーラ位置、分割されたフレーズ
における当該モーラを含むアクセント句位置、当該モー
ラを含むアクセント句のモーラ数、当該モーラを含むア
クセント句のアクセント型、当該モーラを含むアクセン
ト句の先行アクセント句のアクセント型についての情報
と、連接するモーラ重心点間のピッチ周波数の差との対
応関係から、各モーラの重心点のピッチ周波数と先行す
るモーラの重心点のピッチ周波数とのピッチ差を求め、
求められたピッチ差に基づいてピッチパターンを生成す
ることを特徴とする。According to the pitch pattern generation method of the present invention, for each mora of a character string, the pitch frequency of the center of gravity of the mora and the pitch frequency of the center of gravity of the preceding mora with respect to the mora are calculated. A pitch difference estimation process for estimating a difference, and a pitch pattern generation process for forming a pitch pattern from the pitch difference obtained by the pitch difference estimation process, wherein the pitch pattern generation process is at least the Accent phrase with mora
Mora position of the mora in question , divided phrases
Accent phrase position including the mora in, the motor
The number of mora in the accent phrase containing the
Accent type of quint phrase, Accen containing the mora
From the correspondence between the information about the accent type of the preceding phrase and the difference in pitch frequency between the concatenated moras centroids, the pitch frequency of the centroid of each mora and the pitch frequency of the centroid of the preceding mora Find the pitch difference between
A feature is that a pitch pattern is generated based on the obtained pitch difference.

【０００７】さらに、本発明の音声合成方法は、入力
された任意の文字列の各モーラに対するピッチを生成す
るピッチパターン生成処理と、音声合成に必要な音声パ
ラメータからなる音声素片を蓄える音声素片メモリと、
上記文字列に必要な上記音声素片を接続して音声データ
を生成する音声素片接続処理を備えたものであって、上
記ピッチパターン生成処理が、少なくとも、当該モーラ
を含むアクセント句における当該モーラのモーラ位置、
分割されたフレーズにおける当該モーラを含むアクセン
ト句位置、当該モーラを含むアクセント句のモーラ数、
当該モーラを含むアクセント句のアクセント型、当該モ
ーラを含むアクセント句の先行アクセント句のアクセン
ト型についての情報と、連接するモーラ重心点間のピッ
チ周波数の差との対応関係に基づいて、該当モーラの重
心点のピッチ周波数と先行するモーラの重心点のピッチ
周波数とのピッチ差を求め、求められたピッチ差に基づ
いてピッチパターンを生成することを特徴とする。Further, the voice synthesis method of the present invention is a pitch pattern generation process for generating a pitch for each mora of an input arbitrary character string, and a voice unit for storing voice units including voice parameters necessary for voice synthesis. One memory,
A voice unit connection process for connecting the voice unit required to the character string to generate voice data, wherein the pitch pattern generation process is at least the mora.
The mora position of the mora in the accent phrase containing
Accent phrase position including the mora in the divided phrases, mora number of accent phrases including the mora,
Accent type of the accent phrase including the mora, the motor
Accenture of preceding accent phrase including accent
The pitch difference between the center frequency of the corresponding mora and the center frequency of the preceding mora is calculated based on the correspondence between the pitch type information and the pitch frequency difference between the adjacent mora center points. A pitch pattern is generated based on the obtained pitch difference.

【０００８】又、本発明の音声合成方法は、入力され
た任意の文字列の各モーラに対するピッチを生成するピ
ッチパターン生成処理と、音声合成に必要な音声パラメ
ータからなる音声素片を蓄える音声素片メモリと、上記
文字列に必要な上記音声素片を接続して音声データを生
成する音声素片接続処理を備えたものであって、上記ピ
ッチパターン生成処理が、少なくとも、当該モーラを含
むアクセント句における当該モーラのモーラ位置、分割
されたフレーズにおける当該モーラを含むアクセント句
位置、当該モーラを含むアクセント句のモーラ数、当該
モーラを含むアクセント句のアクセント型、当該モーラ
を含むアクセント句の先行アクセント句のアクセント型
についての情報と、連接するモーラ重心点間のピッチ周
波数の差との対応関係に基づいて、該当モーラの重心点
のピッチ周波数と先行するモーラの重心点のピッチ周波
数とのピッチ差を求め、求められたピッチ差に基づいて
ピッチパターンを生成することを特徴とする。Further, the voice synthesis method of the present invention is a pitch pattern generation process for generating a pitch for each mora of an input arbitrary character string, and a voice unit for storing a voice unit including voice parameters necessary for voice synthesis. A speech unit connection process for connecting one-sided memory and the voice unit necessary for the character string to generate voice data, wherein the pitch pattern generation process includes at least an accent including the mora. mora position of the mora definitive in clause division
Been accent phrase position including the mora in phrases, number of moras accent phrase including the mora, the accent type of the accent phrase including the mora, and information about the accent type of the preceding accent phrase accent phrase including the mora, The pitch difference between the pitch frequency of the center point of gravity of the relevant mora and the pitch frequency of the center point of gravity of the preceding mora is calculated based on the correspondence relationship with the difference in pitch frequency between the concentric points of the moras. It is characterized in that a pitch pattern is generated based on this.

【０００９】[0009]

【作用】本発明の音声合成方法では、まず、任意の文字
列の各モーラ対して、当該モーラの重心点のピッチ周波
数と当該モーラに対する先行モーラの重心点のピッチ周
波数との差を推定する。このような差の推定は、少なく
とも、当該モーラを含むアクセント句における当該モー
ラのモーラ位置、分割されたフレーズにおける当該モー
ラを含むアクセント句位置、当該モーラを含むアクセン
ト句のモーラ数、当該モーラを含むアクセント句のアク
セント型、当該モーラを含むアクセント句の先行アクセ
ント句のアクセント型についての情報と、連接するモー
ラ重心点間のピッチ周波数の差との対応関係から求めら
れものであって、たとえば、上記の各情報と上記のピッ
チ周波数との差との対応を記録した対応表、又は、入力
層へ上記各情報を入力すると、出力層がモーラ間のピッ
チ差を出力するように学習されているニューラルネット
ワーク等を用いて行われる。In the speech synthesis method of the present invention, first, for each mora of an arbitrary character string, the difference between the pitch frequency of the center of gravity of the mora and the pitch frequency of the center of gravity of the preceding mora with respect to the mora is estimated. Such a difference estimation is at least based on the mood in the accent phrase containing the mora.
La Mora position, the mode in divided phrases
Accent phrase position including La, Accen including the mora
Number of mora of the phrase, accent type of accent phrase containing the mora, preceding access of accent phrase containing the mora
It is obtained from the correspondence between the information about the accent type of the phrase and the difference in the pitch frequency between the concatenated mora centroids. For example, the correspondence between the above information and the difference between the pitch frequencies. Is recorded, or when each of the above information is input to the input layer, the output layer is performed using a neural network or the like that has been learned to output the pitch difference between moras.

【００１０】次に、上述の推定によって、連接するモー
ラ毎のピッチ差を用いたピッチパターンの生成を生成す
ることができる。Next, by the above estimation, it is possible to generate the generation of the pitch pattern using the pitch difference for each concatenated mora.

【００１１】[0011]

【実施例】最初に、本発明の音声合成方法を用いた規則
合成装置について説明する。［実施例１］図１は、本発明の音声合成方法を用いた規
則合成装置の実施例を示すブロック図である。図１にお
いて、１は規則音声合成させる文字列の入力を行う文字
コード記号列入力部、２はその文字列を単語単位に分割
し、品詞情報を決定する形態素解析部、３はその単語の
読みを決定する読み決定部、４は単語の読みを記憶して
いる単語辞書、５は単語の読みに基づくアクセントを決
定するアクセント決定部、６は単語毎のアクセントを記
憶しているアクセント辞書、７は上記文字列のフレーズ
を決定するフレーズ決定部である。尚、フレーズとは文
頭乃至読点、読点乃至読点、読点乃至句点、息継ぎ乃至
息継ぎ、又はポーズ乃至ポーズ等の呼気段落をいう。First, a rule synthesizing apparatus using the speech synthesizing method of the present invention will be described. [Embodiment 1] FIG. 1 is a block diagram showing an embodiment of a rule synthesizing apparatus using the speech synthesizing method of the present invention. In FIG. 1, 1 is a character code symbol string input unit for inputting a character string for regular speech synthesis, 2 is a morphological analysis unit that divides the character string into word units and determines part-of-speech information, and 3 is a reading of the word. A phonetic deciding unit for deciding a word reading, a word dictionary storing a word reading, a reference numeral 5 for an accent deciding unit for deciding an accent based on the word reading, a accent dictionary 6 for storing an accent for each word, Is a phrase determination unit that determines the phrase of the character string. The phrase refers to an exhalation paragraph such as a sentence head or a reading point, a reading point or a reading point, a reading point or a phrase, a breath or breath, or a pause or a pose.

【００１２】８は上記文字列のピッチパターンを生成す
るピッチパターン生成部、９は当該モーラの重心点のピ
ッチ周波数と当該モーラに対する先行モーラの重心点の
ピッチ周波数との差を推定するピッチ差推定部、１０は
音声の素片を接続する素片接続部、１１は音声素片を格
納した音声素片テーブル、１２はＤＡ変換部、１３はス
ピーカである。Reference numeral 8 is a pitch pattern generator for generating the pitch pattern of the character string, and reference numeral 9 is a pitch difference estimation for estimating a difference between the pitch frequency of the center of gravity of the mora and the pitch frequency of the center of gravity of the preceding mora with respect to the mora. The unit, 10 is a unit connection unit for connecting voice units, 11 is a voice unit table storing voice units, 12 is a DA conversion unit, and 13 is a speaker.

【００１３】図２は、入力文字列を形態素解析した結果
である。FIG. 2 shows the result of morphological analysis of the input character string.

【００１４】図３は、入力文字列の形態素解析結果に対
して読み決定を行った結果である。FIG. 3 shows the result of reading decision made on the morphological analysis result of the input character string.

【００１５】図４は、入力文字列をアクセント句単位で
表したものである。図４において、４１は入力文字列の
第５モーラ、４２は入力文字列の第１アクセント句、４
３は入力文字列の第２アクセント句である。FIG. 4 shows an input character string in units of accent phrases. In FIG. 4, 41 is the fifth mora of the input character string, 42 is the first accent phrase of the input character string, 4
3 is the second accent phrase of the input character string.

【００１６】図５は、入力文字列のピッチパターンであ
る。FIG. 5 shows the pitch pattern of the input character string.

【００１７】図６は、対応表を用いたピッチ差推定部
（９）である。図６において、６０は当該モーラを含む
アクセント句に対するモーラ位置の例、６１は当該モー
ラを含むアクセント句位置の例、６２は当該モーラを含
むアクセント句のモーラ数の例、６３は当該モーラを含
むアクセント句のアクセント型の例、６４は当該モーラ
を含むアクセント句の先行アクセント句のアクセント型
の例、６５は当該モーラと先行モーラのピッチ周波数の
差の例である。FIG. 6 shows a pitch difference estimating unit (9) using a correspondence table. In FIG. 6, 60 is an example of the mora position for the accent phrase including the mora, 61 is an example of the position of the accent phrase including the mora, 62 is an example of the number of mora of the accent phrase including the mora, and 63 is the mora. An example of the accent type of the accent phrase, 64 is an example of the accent type of the preceding accent phrase of the accent phrase including the mora, and 65 is an example of the pitch frequency difference between the mora and the preceding mora.

【００１８】これより、本実施例の処理動作を、２アク
セント句からなる１フレーズの文字列「道を尋ねる」を
用いて説明する。The processing operation of this embodiment will be described below with reference to a character string "ask the way" of one phrase consisting of two accent phrases.

【００１９】文字コード記号列入力部（１）から入力さ
れた文字列は、形態素解析部（２）によって単語単位に
分割され、各単語の品詞が決定される。図２は、本実施
例の文字列「道を尋ねる」を単語単位に分割し、各単語
に対して品詞を付与した結果である。The character string input from the character code symbol string input unit (1) is divided into word units by the morpheme analysis unit (2), and the part of speech of each word is determined. FIG. 2 is a result of dividing the character string “Ask the road” of this embodiment into word units and assigning a part of speech to each word.

【００２０】品詞が決定されると、読み決定部（３）に
送られ、単語辞書（４）との照合により各単語の読みが
決定される。図３に、本実施例の文字列「道を尋ねる」
に対して読み決定を行った結果を示す。When the part-of-speech is determined, it is sent to the reading determination unit (3) and the reading of each word is determined by collating with the word dictionary (4). In FIG. 3, the character string “Ask the road” in this embodiment is used.
The result of reading decision is shown for.

【００２１】単語の読みが決定されると、アクセント決
定部（５）に送られ、アクセント辞書（６）との照合に
より単語のアクセントが決定され、規則によりアクセン
ト結合が行われてアクセント句が形成されるとともに、
アクセント句に対するアクセントが決定される。これに
より、本実施例の文字列「みちをたずねる」は、図４に
示されるように、第一アクセント句「みちを」（４２）
と第２アクセント句「たずねる」（４３）の２つのアク
セント句に分けられる。When the reading of the word is determined, it is sent to the accent determining unit (5), the accent of the word is determined by collation with the accent dictionary (6), and the accent combination is performed by the rule to form the accent phrase. As well as
The accent for the accent phrase is determined. As a result, the character string "Ask for the road" in this embodiment is, as shown in FIG. 4, the first accent phrase "Michio" (42).
And the second accent phrase “Tsuneru” (43).

【００２２】アクセントが決定されたあとは、フレーズ
決定部（７）でフレーズの決定が行われる。本実施例の
文字列では、文字列全体で１つのフレーズを形成してい
るが、例えば「こうばんまでいって、みちをたずねた」
といった文字列であれば、「こうばんまでいって」と
「みちをたずねた」の２つのフレーズに分割される。After the accent is determined, the phrase is determined by the phrase determining section (7). In the character string of the present embodiment, one phrase is formed by the entire character string, but for example, "I went to the koban and asked the road".
If it is a character string such as, it is divided into two phrases, "I'm going to ask" and "I asked Michi."

【００２３】次に、ピッチ差の推定が行われるが、本実
施例では、文字列「みちをたずねる」の第４モ−ラ
「た」と第５モ−ラ「ず」の重心点ピッチ周波数の差を
図６の対応表を用いて推定する場合について説明する。Next, the pitch difference is estimated. In the present embodiment, the pitch frequency of the center of gravity of the fourth moor "ta" and the fifth moor "zu" of the character string "Ask for Michi". A case will be described in which the difference is estimated using the correspondence table of FIG.

【００２４】ここで、対応表は、１）当該モーラを含むアクセント句における当該モーラ
のモーラ位置２）分割されたフレーズにおける当該モーラを含むアク
セント句位置３）当該モーラを含むアクセント句のモーラ数４）当該モーラを含むアクセント句のアクセント型５）当該モーラを含むアクセント句の先行アクセント句
のアクセント型の５つのパラメータに対応する形で当該モーラと先行モ
ーラとのピッチ差を自然対数で記述したものである。こ
のような対応表は、例えば、大量の文を用いて、アクセ
ント型やモーラ数が多種の値を取る条件下での文のモー
ラ間のピッチ差を記録することにより作成される。Here, the correspondence table is: 1) the mora in the accent phrase including the mora
Mora position 2) accent phrase position 3 containing the mora in divided phrases) accent type 5 of the mora number 4 accent phrase including Mora) accent phrase containing the mora) prior accent accent phrase including the mora The pitch difference between the mora and the preceding mora is described in natural logarithm in a form corresponding to the five accent type parameters of the phrase. Such a correspondence table is created, for example, by using a large number of sentences and recording the pitch difference between the mora of the sentence under the condition that the accent type and the number of mora have various values.

【００２５】本実施例の文字列の場合、当該モーラであ
る第５モ−ラ「ず」は、モーラ数が４でアクセント型が
３型である第２アクセント句（４３）の第２モーラ（４
１）である。これは、対応表においては、当該モーラを
含むアクセント句に対するモーラ位置「２」（６０）、
当該モーラを含むアクセント句位置「２」（６１）、当
該モーラを含むアクセント句のモーラ数「４」（６
２）、当該モーラを含むアクセント句のアクセント型
「３」（６３）、当該モーラを含むアクセント句の先行
アクセント句（４２）のアクセント型「０」（６４）、
の５個のパラメータ「２，２，４，３，０」で表され
る。したがって、対応表より、当該モ−ラ「ず」と先行
モーラ「た」との重心点ピッチ周波数の差は自然対数
で、「＋０．１４７」（６５）と推定される。In the case of the character string of this embodiment, the fifth mora "zu", which is the mora, is the second mora (43) of the second accent phrase (43) in which the number of mora is 4 and the accent type is 3. Four
1). In the correspondence table, this is the mora position “2” (60) for the accent phrase containing the mora,
The accent phrase position “2” (61) including the mora, and the mora number “4” (6) of the accent phrase including the mora.
2), the accent type “3” (63) of the accent phrase including the mora, the accent type “0” (64) of the preceding accent phrase (42) of the accent phrase including the mora,
It is represented by five parameters “2, 2, 4, 3, 0”. Therefore, from the correspondence table, it is estimated that the difference between the center frequency pitch frequencies of the mora "zu" and the preceding mora "ta" is "+0.147" (65) in natural logarithm.

【００２６】ところで、本方法では、注目モーラとこれ
に先行する先行モーラとの母音重心点のピッチ周波数の
差を推定しているので、第１アクセント句の第１モーラ
（文頭の第１モーラ）とこれに先行する先行モーラとの
推定をどのように取り扱うかという問題が生じる。By the way, in this method, since the difference between the pitch frequencies of the vowel centroids of the target mora and the preceding mora preceding it is estimated, the first mora of the first accent phrase (the first mora at the beginning of the sentence). And how to deal with the estimation of the preceding mora that precedes this.

【００２７】そこで、第１アクセント句の第１モーラ
と、このモーラに先行する先行モーラとの母音重心点の
ピッチ周波数の差を推定する場合は、本実施例では、第
１アクセント句の第１モーラに対する先行アクセント句
のアクセント型を１型として、この値と第１アクセント
句の第１モーラとの母音重心点のピッチ周波数の差を求
めることとしている。Therefore, in the case of estimating the difference in pitch frequency of the vowel center of gravity points between the first mora of the first accent phrase and the preceding mora preceding this mora, the first accent phrase of the first accent phrase is used in this embodiment. The accent type of the preceding accent phrase with respect to the mora is set as type 1, and the difference between this value and the pitch frequency of the vowel center point of gravity with respect to the first mora of the first accent phrase is determined.

【００２８】これは、第１アクセント句の第１モーラに
対する先行アクセント句のアクセント型として１型を採
用したのは、その１型のアクセント句の後方部分はピッ
チ周波数の値が下降しているからであり、この結果、第
１アクセント句の第１モーラに対する先行モーラから、
第１アクセント句の第１モーラへの繋がりは違和感がな
く、自然な音声発声と看做せることとなる。This is because the type 1 is adopted as the accent type of the preceding accent phrase with respect to the first mora of the first accent phrase because the value of the pitch frequency falls in the rear part of the accent phrase of the type 1 type. And as a result, from the preceding mora for the first mora of the first accent phrase,
The connection of the first accent phrase to the first mora has no discomfort and can be regarded as a natural voice utterance.

【００２９】このようにして、「みちをたずねる」の各
モーラに対して、当該モーラの重心点ピッチ周波数と先
行モーラの重心点ピッチ周波数との差が、自然対数で、
第１モ−ラから順次、「−0.061, 0.396, −0.224, −
0.300, 0.147, −0.142, −0.320」と推定される。In this way, for each mora of "Ask for the road", the difference between the center-of-mass pitch frequency of the mora and the center-of-mass pitch frequency of the preceding mora is the natural logarithm,
From the first mora, "-0.061, 0.396, -0.224,-"
0.300, 0.147, −0.142, −0.320 ”.

【００３０】ピッチパターン生成部（８）では、あらか
じめ設定された音声区間の始端、および、終端のピッチ
周波数と、ピッチ差推定部（９）で推定された各値に基
づいて各モーラの重心点におけるピッチ周波数を推定
し、図５に示されるような点ピッチパターンが生成され
る。In the pitch pattern generator (8), the center of gravity of each mora is set based on the preset pitch frequencies of the start and end of the voice section and the respective values estimated by the pitch difference estimator (9). Estimating the pitch frequency at, the point pitch pattern as shown in FIG. 5 is generated.

【００３１】ピッチパターンが生成されると、素片接続
部（１０）において、ＣＶＣ（子音＋母音＋子音）など
の音声素片（例えば、パーコール係数、あるいはＬＳＰ
係数）をあらかじめ格納している素片テーブル（１１）
から当該文に必要な音声素片が選ばれて各素片が接続さ
れ、デジタル信号である音声データが作成される。音声
データはＤＡ変換部（１２）によってアナログ信号に変
換され、スピーカ（１３）から合成音声として出力され
る。When the pitch pattern is generated, the speech unit (for example, Percoll coefficient or LSP) such as CVC (consonant + vowel + consonant) is generated in the segment connecting portion (10).
Element table (11) in which coefficients are stored in advance
Then, the voice unit necessary for the sentence is selected, each unit is connected, and voice data which is a digital signal is created. The voice data is converted into an analog signal by the DA converter (12) and output as a synthesized voice from the speaker (13).

【００３２】尚、上述の実施例では、図６に示す５つの
パラメータから構成された対応表に基づいて、１フレー
ズからなる文字列のピッチパターンの生成を行ったが、
この５つのパラメータの一部に代えて、又はこの５つの
パラメータに加えて言語情報に関するパラメータ、例え
ば注目モーラが無声音であるか否か、注目モーラが無声
子音を伴うか否か、注目モーラが撥音であるか否か、注
目モーラが拗音であるか否か、注目モーラが有声子音を
伴うか否か、注目モーラの子音が摩擦音であるか否か、
注目モーラの子音が半母音であるか否か、注目モーラの
子音が鼻音であるか否か、注目モーラの子音が破擦音で
あるか否か、注目モーラの子音が破裂音であるか否か、
注目モーラを含む単語の品詞が何であるか、又は注目モ
ーラを含むアクセント句が強調されるか否か、等を採用
して対応表を作成してもよい。In the above-described embodiment, the pitch pattern of the character string consisting of one phrase is generated based on the correspondence table composed of the five parameters shown in FIG.
In place of or in addition to some of these five parameters, parameters relating to language information, for example, whether or not the attention mora is unvoiced, whether or not the attention mora is accompanied by unvoiced consonants, , Whether or not the focused mora is a jumble, whether or not the focused mora is accompanied by voiced consonants, and whether or not the consonant of the focused mora is a fricative,
Whether the consonant of the target mora is a half vowel, whether the consonant of the target mora is a nasal, whether the consonant of the target mora is an affricate, whether the consonant of the target mora is a plosive ,
The correspondence table may be created by adopting what the part of speech of the word including the attention mora is, whether the accent phrase including the attention mora is emphasized, or the like.

【００３３】また、上述の実施例では、１フレーズから
なる文字列のピッチパターンの生成を行ったが、図６の
対応表の５つのパラメータ、上述の言語情報に関するパ
ラメータ、フレーズ位置、又はフレーズ数等を用いた対
応表によって、複数フレーズの文字列のピッチパターン
の生成を行うことも可能である。［実施例２］次に、実施例１のピッチ差推定部（９）に
ニューラルネットワークを用いた実施例について説明す
る。In the above embodiment, the pitch pattern of the character string consisting of one phrase is generated. However, the five parameters of the correspondence table of FIG. 6, the parameters relating to the language information, the phrase position, or the number of phrases are used. It is also possible to generate a pitch pattern of character strings of a plurality of phrases by a correspondence table using, for example. [Embodiment 2] Next, an embodiment in which a neural network is used in the pitch difference estimation unit (9) of Embodiment 1 will be described.

【００３４】図７は、ニューラルネットを用いたピッチ
差推定部（９）である。図７において、７１は入力層、
７２は中間層、７３は出力層である。FIG. 7 shows a pitch difference estimating section (9) using a neural network. In FIG. 7, 71 is an input layer,
72 is an intermediate layer and 73 is an output layer.

【００３５】ピッチ差推定部（９）部分以外の処理につ
いては処理動作１と同じであるため、以下ではピッチ差
推定部における処理についてのみ説明する。Since the processes other than the pitch difference estimating unit (9) are the same as those in the processing operation 1, only the process in the pitch difference estimating unit will be described below.

【００３６】本実施例では、読みで表された本実施例の
文字列「みちをたずねる」の第４モ−ラ「た」と第５モ
−ラ「ず」の重心点ピッチ周波数の差をニューラルネッ
トにより推定する場合について説明する。In this embodiment, the difference between the pitch frequency of the center of gravity of the fourth moor "ta" and the fifth moor "zu" of the character string "ask for the road" of this embodiment expressed by reading is calculated. The case of estimation by a neural network will be described.

【００３７】ピッチ差推定部（９）で用いるニューラル
ネットワークでは、入力層（７１）に、１）当該モーラを含むアクセント句における当該モーラ
のモーラ位置２）分割されたフレーズにおける当該モーラを含むアク
セント句位置３）当該モーラを含むアクセント句のモーラ数４）当該モーラを含むアクセント句のアクセント型５）当該モーラを含むアクセント句の先行アクセント句
のアクセント型の５つのパラメータを入力する。In the neural network used in the pitch difference estimation unit (9), the input layer (71) has: 1) the mora in the accent phrase including the mora.
Mora position 2) accent phrase position 3 containing the mora in divided phrases) accent type 5 of the mora number 4 accent phrase including Mora) accent phrase containing the mora) prior accent accent phrase including the mora Enter the five parameters of the phrase accent type.

【００３８】また、このニューラルネットワークは、出
力層（７３）が、当該モーラの重心点のピッチ周波数と
先行するモーラの重心点のピッチ周波数との差を出力す
るように学習されているものとする。In this neural network, it is assumed that the output layer (73) is learned so as to output the difference between the pitch frequency of the center of gravity of the mora and the pitch frequency of the center of gravity of the preceding mora. .

【００３９】ここで、当該モーラである第５モ−ラ
「ず」は、モーラ数が４でアクセント型が３型である第
２アクセント句（４３）の第２モーラ（４１）であるの
で、ニューラルネットの入力層への入力パラメータは、
当該モーラを含むアクセント句に対するモーラ位置
「２」、当該モーラを含むアクセント句位置「２」、当
該モーラを含むアクセント句のモーラ数「４」、当該モ
ーラを含むアクセント句のアクセント型「３」、当該モ
ーラを含むアクセント句の先行アクセント句（４２）の
アクセント型「０」、の５つ「２，２，４，３，０」と
なる。Here, since the fifth mora "zu" which is the mora is the second mora (41) of the second accent phrase (43) having the number of mora of 4 and the accent type of type 3, Input parameters to the input layer of the neural network are
The mora position “2” for the accent phrase including the mora, the accent phrase position “2” including the mora, the mora number “4” for the accent phrase including the mora, the accent type “3” for the accent phrase including the mora, The accent type “0” of the preceding accent phrase (42) of the accent phrase including the mora is “2, 2, 4, 3, 0”.

【００４０】ニューラルネットワークは、入力層（７
１）に「２，２，４，３，０」の５個の情報が入力され
た場合に、ある値を出力層（７３）に出力するように学
習されており、この学習によって決定されている係数に
従い、ニューラルネットワークの各ユニットに対する重
み付けがなされる。これによって、出力層（７３）から
は、例えば、自然対数で、当該モ−ラ「ず」と先行モー
ラ「た」との重心点ピッチ周波数の差「＋０．１４７」
が推定されて出力される。The neural network consists of the input layer (7
When 5 pieces of information "2, 2, 4, 3, 0" are input to 1), it is learned to output a certain value to the output layer (73), and is determined by this learning. Each unit of the neural network is weighted according to the existing coefficient. Thereby, from the output layer (73), for example, in natural logarithm, the difference "+0.147" between the center frequency pitch frequencies of the moor "zu" and the preceding mora "ta".
Is estimated and output.

【００４１】本実施例においても、第１の実施例と同様
に、第１アクセント句の第１モーラと先行モーラとの重
心点ピッチ周波数の差をニューラルネットを用いて推定
させる場合は、例えば、当該モーラを含むアクセント句
の先行アクセント句のアクセント型を「１」（０型以外
だとみなせる情報）として学習させておき、入力層への
入力パラメータの内、当該モーラを含むアクセント句の
先行アクセント句のアクセント型を「１」（学習時に用
いた入力情報）として推定させている。Also in this embodiment, when the difference in the barycentric point pitch frequency between the first mora of the first accent phrase and the preceding mora is estimated by using a neural network, as in the first embodiment, for example, The accent type of the preceding accent phrase including the mora is learned as "1" (information that can be regarded as other than 0 type), and the preceding accent of the accent phrase including the mora among the input parameters to the input layer is learned. The phrase accent type is estimated as "1" (input information used during learning).

【００４２】このようにして、「みちをたずねる」の各
モーラに対して、当該モーラの重心点ピッチ周波数と先
行モ−ラの重心点ピッチ周波数との差が、自然対数で、
第１モーラから順次、「−0.061, 0.396, −0.224, −
0.300, 0.147, −0.142, −0.320」と推定される。In this way, for each mora of "Ask for the road", the difference between the center frequency of the mora and the center frequency of the preceding mora is the natural logarithm,
From the first mora, “-0.061, 0.396, −0.224, −
0.300, 0.147, −0.142, −0.320 ”.

【００４３】尚、上述の第２の実施例では、ニューラル
ネットワークの入力層を５ユニット、中間層を１層とし
ているが、各ユニット数、層数はこの限りではない。In the above second embodiment, the input layer of the neural network is 5 units and the intermediate layer is 1 layer, but the number of each unit and the number of layers are not limited to this.

【００４４】また、上述の第２の実施例では、５つのパ
ラメータを入力することによって学習したニューラルネ
ットを用いて、１フレーズからなる文字列のピッチパタ
ーンの生成を行ったが、この５つのパラメータの一部に
代えて、又はこの５つのパラメータに加えて言語情報に
関するパラメータ、例えば注目モーラが無声音であるか
否か、注目モーラが無声子音を伴うか否か、注目モーラ
が撥音であるか否か、注目モーラが拗音であるか否か、
注目モーラが有声子音を伴うか否か、注目モーラの子音
が摩擦音であるか否か、注目モーラの子音が半母音であ
るか否か、注目モーラの子音が鼻音であるか否か、注目
モーラの子音が破擦音であるか否か、注目モーラの子音
が破裂音であるか否か、注目モーラを含む単語の品詞が
何であるか、又は注目モーラを含むアクセント句が強調
されるか否か、等を採用して学習を行わせたニューラル
ネットを用いてもよい。Further, in the above-mentioned second embodiment, the pitch pattern of the character string consisting of one phrase is generated by using the neural net learned by inputting the five parameters. Or a parameter related to linguistic information in addition to these five parameters, for example, whether the attention mora is unvoiced, whether the attention mora is accompanied by unvoiced consonants, and whether the attention mora is sound-repellent. Or whether or not the attention mora is a roar,
Whether the attention mora is accompanied by voiced consonants, whether the attention mora's consonants are fricatives, whether the attention mora's consonants are half vowels, whether the attention mora's consonants are nasal sounds, Whether the consonant is an affricate, whether the consonant of the attention mora is a plosive, what the part of speech of the word containing the attention mora is, or whether the accent phrase containing the attention mora is emphasized It is also possible to use a neural network that has been trained by adopting ,, or the like.

【００４５】更に、上述の第２の実施例では、１フレー
ズからなる文字列のピッチパターンの生成を行ったが、
上述の言語情報に関するパラメータ、フレーズ位置、又
はフレーズ数等を用いて学習させたニューラルネットに
よって、複数フレーズの文字列のピッチパターンの生成
を行うことも可能である。Further, in the above-mentioned second embodiment, the pitch pattern of the character string consisting of one phrase is generated.
It is also possible to generate a pitch pattern of a character string of a plurality of phrases by a neural net learned using the above-mentioned parameters relating to language information, phrase positions, or the number of phrases.

【００４６】[0046]

【発明の効果】以上のように、本発明によれば、各モー
ラに対して、当該モーラの重心点のピッチ周波数と当該
モーラの先行モーラの重心点のピッチ周波数との差を推
定することにより、ピッチ周波数の変動からピッチパタ
ーンを生成し、合成音声の自然性を向上させることがで
きる。As described above, according to the present invention, for each mora, by estimating the difference between the pitch frequency of the center of gravity of the mora and the pitch frequency of the center of gravity of the preceding mora of the mora. , It is possible to improve the naturalness of synthesized speech by generating a pitch pattern from fluctuations in pitch frequency.

[Brief description of drawings]

【図１】本発明を用いた規則合成の一実施例を示すブロ
ック図FIG. 1 is a block diagram showing an embodiment of rule synthesis using the present invention.

【図２】入力文字列を形態素解析した結果を示す図FIG. 2 is a diagram showing a result of morphological analysis of an input character string.

【図３】入力文字列の形態素解析結果の読みを決定した
結果を示す図FIG. 3 is a diagram showing a result of determining reading of a morphological analysis result of an input character string.

【図４】入力文字列をアクセント句単位で表した図FIG. 4 is a diagram showing an input character string in units of accent phrases.

【図５】入力文字列のピッチパターンを表す図FIG. 5 is a diagram showing a pitch pattern of an input character string.

【図６】ピッチ差推定部（９）に用いる対応表を表す図FIG. 6 is a diagram showing a correspondence table used in a pitch difference estimation unit (9).

【図７】ピッチ差推定部（９）に用いるニューラルネッ
トワークの構成図FIG. 7 is a block diagram of a neural network used in a pitch difference estimation unit (9).

[Explanation of symbols]

１文字コード記号列入力部２形態素解析部３読み決定部４単語辞書５アクセント決定部６アクセント辞書７フレーズ決定部８ピッチパターン生成部９ピッチ差推定部１０素片接続部１１素片テーブル１２ＤＡ変換部１３スピーカ４１入力文字列の第５モーラ「ず」４２入力文字列の第１アクセント句４３入力文字列の第２アクセント句７１ニューラルネットワークの入力層７２ニューラルネットワークの中間層７３ニューラルネットワークの出力層 1 character code symbol string input section 2 Morphological analyzer 3 reading decision section 4 word dictionary 5 Accent determination part 6 accent dictionary 7 Phrase decision section 8 Pitch pattern generator 9 Pitch difference estimation unit 10 Element connection part 11 piece table 12 DA converter 13 speakers 41 5th mora of input character string "zu" 42 First accent phrase of input string 43 Second accent phrase of input string 71 Neural network input layer 72 Middle layer of neural network 73 Output layer of neural network

フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 13/06 - 13/08 ＪＩＣＳＴファイル（ＪＯＩＳ)Front page continuation (58) Fields surveyed (Int.Cl. ⁷ , DB name) G10L 13/06-13/08 JISST file (JOIS)

Claims

(57) [Claims]

1. A pitch difference estimation process for estimating a difference between a pitch frequency of a center of gravity of the mora and a pitch frequency of a center of gravity of a preceding mora with respect to the mora, and the pitch difference estimation. in the pitch pattern generation method consisting of the pitch pattern generation process for forming a pitch pattern from the pitch difference determined by the processing, the pitch pattern generation process, at least, the motor
The mora position of the mora in the accent phrase including the la, the accent phrase position including the mora in the divided phrase, the number of mora of the accent phrase including the mora, the accent type of the accent phrase including the mora , This
The action of the preceding accent phrase of the accent phrase containing the mora
The pitch difference between the pitch frequency of the center of gravity of each mora and the pitch frequency of the center of gravity of the preceding mora is found and obtained from the correspondence relationship between the information about the cent type and the difference in the pitch frequency between the connecting points of the mora. A pitch pattern generation method characterized by generating a pitch pattern based on the obtained pitch difference.

2. A pitch pattern generation process for generating a pitch for each mora of an input arbitrary character string, a voice unit memory for storing a voice unit consisting of voice parameters necessary for voice synthesis, and the character string. A speech synthesis method comprising a speech segment connection process for connecting the required speech segments to generate speech data, wherein the pitch pattern generation process is at least a mora position of the mora in an accent phrase including the mora. Information about the position of the accent phrase including the mora in the divided phrase, the number of mora of the accent phrase including the mora, the accent type of the accent phrase including the mora, and the accent type of the preceding accent phrase of the accent phrase including the mora. Based on the correspondence between the pitch frequency difference between the conjoined mora centroids, Speech synthesis method characterized by seeking the pitch difference between the pitch frequency of the center of gravity of the mora and the preceding pitch frequency of the center of gravity of mora, to generate a pitch pattern based on the determined pitch difference.