JP3286353B2

JP3286353B2 - Voice synthesis method

Info

Publication number: JP3286353B2
Application number: JP25838892A
Authority: JP
Inventors: 誠橋本; 正典宮武
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1992-05-20
Filing date: 1992-09-28
Publication date: 2002-05-27
Anticipated expiration: 2017-05-27
Also published as: JPH0635492A

Abstract

PURPOSE:To improve the naturalness of a synthesized speech by generating a pitch pattern in consideration of the quantity of variation in the pitch frequency of the center of gravity between two adjacent morae. CONSTITUTION:This method consists of a pitch difference estimating process 9 for estimating the difference in pitch frequency of the center of gravity between two adjacent morae corresponding to parameter information on an accent type and parameter information on the number of morae, a pitch pattern generating process 8 for generating the pitch pattern of an optional character string according to the difference in pitch frequency, and a phoneme piece connecting process 10 for generating speech data by reading phoneme pieces required for speech vocalization out of a phoneme piece memory 11 stored with phoneme pieces consisting of speech parameters required for speech synthesis. In the pitch estimating process 9, the difference between a certain pitch frequency and the pitch frequency of the center of gravity of the mora if the mora is at the head of a sectioned range.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、音声の音韻やアクセン
トの自然性に大きく影響する合成音声のピッチパターン
の生成方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for generating a pitch pattern of a synthesized speech which greatly affects the phonology of speech and the naturalness of accents.

【０００２】[0002]

【従来の技術】近年、規則音声合成の研究が、ヒューマ
ンインタフェースの重要な技術として盛んに行われてい
る。規則音声合成は、文字列等から形態素解析によって
品詞情報を決定し、単語辞書との照合により単語の読み
を決定した後に、この読みに応じた単語のアクセント
型、アクセント結合、フレーズ等を求め、これらの情報
からピッチパターンの決定を行うと共に、単語の読みに
応じた音声素片（例えばパーコール係数やＬＳＰ係数）
を接続することにより音声データを生成するものであ
る。2. Description of the Related Art In recent years, research on rule-based speech synthesis has been actively conducted as an important technology of human interface. Rule speech synthesis determines part-of-speech information by morphological analysis from character strings and the like, determines word reading by collation with a word dictionary, and then determines the accent type, accent combination, phrase, etc. of the word according to this reading, A pitch pattern is determined from these pieces of information, and a speech unit (for example, a Percoll coefficient or an LSP coefficient) corresponding to the reading of a word
Are connected to generate audio data.

【０００３】即ち、音声データとは、パーコール係数
列、これに応じたピッチパターン及びアンプ情報であ
り、この中でもピッチパターンは、合成音声の自然性の
善し悪しに大きな影響を及ぼすものとして位置付けられ
ている。[0003] That is, speech data is a Percoll coefficient sequence, a pitch pattern and amplifier information corresponding thereto, and among them, the pitch pattern is positioned as having a great effect on the naturalness of synthesized speech. .

【０００４】従来のピッチパターン生成には、文章の構
造からフレーズ成分とアクセント成分を決定した後に、
各モーラの重心点のピッチを推定し、直線補間すること
によりピッチパターンを生成する点ピッチモデルを用い
る方法が知られている（電子通信学会論文誌 Vol.J63-D
No.9 pp.715-722, 1980.9）。In the conventional pitch pattern generation, after determining a phrase component and an accent component from a sentence structure,
It is known to use a point pitch model that generates a pitch pattern by estimating the pitch of the center of gravity of each mora and performing linear interpolation (see IEICE Transactions Vol.J63-D).
No. 9 pp. 715-722, 1980.9).

【０００５】また、ニューラルネットにより各フレーズ
に対する先頭モーラ、ピッチ周波数がピークをとるモー
ラ、末尾モーラの各ピッチ周波数の値を推定する方法な
ども知られている（音声研究会資料 SP89-111, 1990.
1）。There is also known a method of estimating a value of each pitch frequency of a leading mora, a mora at which a pitch frequency takes a peak, and a trailing mora for each phrase by using a neural network (Speech Technical Committee Material SP89-111, 1990). .
1).

【０００６】これら、従来のピッチパターンの生成方法
は、いずれも各モーラに対するピッチ周波数の絶対値を
一義的に推定する方法であり、隣接する２モーラ間の繋
がり（ピッチ周波数の変化量）については考慮されてい
ない。[0006] Each of these conventional methods for generating a pitch pattern is a method for unambiguously estimating the absolute value of the pitch frequency for each mora, and the connection between two adjacent mora (the amount of change in the pitch frequency) is not described. Not considered.

【０００７】従って、これらの方法では、各モーラ間で
のピッチパターンの変化量が安定せず、人間がしゃべっ
た場合と比べて違和感を招いてしまう問題があった。Therefore, in these methods, there is a problem that the amount of change in the pitch pattern between the mora is not stable, and a sense of discomfort is caused as compared with a case where a human talks.

【０００８】[0008]

【発明が解決しようとする課題】このように、規則合成
音声の自然性を左右する要因としては、夫々のモーラの
ピッチ周波数の絶対値よりも、隣接する２モーラ間の重
心点のピッチ周波数の変化量の方がより重要であるにも
かかわらず、従来の方法は各モーラ間に対するピッチ周
波数の変化量を推定するものではなかった。As described above, the factor that influences the naturalness of the rule-synthesized speech is that the pitch frequency of the center of gravity between two adjacent mora is not the absolute value of the pitch frequency of each mora. Although the amount of change is more important, the conventional method does not estimate the amount of change in pitch frequency between each mora.

【０００９】本発明は、このような問題を解決するため
になされたものであり、規則音声合成において、合成音
声の自然性を向上させるために、合成音声の自然性に大
きな影響を及ぼすピッチ周波数の変化量を考慮してピッ
チパターンを生成しようとするものである。SUMMARY OF THE INVENTION The present invention has been made to solve such a problem. In order to improve the naturalness of a synthesized speech in a ruled speech synthesis, a pitch frequency which greatly affects the naturalness of a synthesized speech is provided. The pitch pattern is to be generated in consideration of the amount of change in the pitch pattern.

【００１０】[0010]

【課題を解決するための手段】本発明は、少なくとも、
文中でのアクセント句の位置、アクセント型のパラメー
タ情報、又は及びモーラ数のパラメータ情報に対応し
て、ｍ個のアクセント句からなる複数の文の各モーラ間
のピッチ周波数の差を記憶した対応表、あるいは、入力
層へ、アクセント型のパラメータ情報、又はモーラ数の
パラメータ情報を入力すると、出力層から隣接する２モ
ーラ間の重心点のピッチ周波数の差を出力するように学
習されているニューラルネットワーク、を用いて隣接す
る２モーラ間のピッチ周波数の差を推定するピッチ差推
定処理部と、上記ピッチ差推定処理部にて推定されたピ
ッチ周波数の差に基づいてピッチパターンを形成するピ
ッチパターン生成処理部と、から構成される。Means for Solving the Problems The present invention provides at least:
Correspondence table storing the difference in pitch frequency between each mora of a plurality of sentences composed of m accent phrases corresponding to the position of the accent phrase in the sentence, the parameter information of the accent type, and the parameter information of the number of mora. Alternatively, a neural network that has been learned to output accent type parameter information or mora number parameter information to the input layer and to output the difference in pitch frequency of the center of gravity between two adjacent mora from the output layer. , A pitch difference estimation processing unit for estimating a pitch frequency difference between two adjacent moras, and a pitch pattern generation unit for forming a pitch pattern based on the pitch frequency difference estimated by the pitch difference estimation processing unit And a processing unit.

【００１１】斯るピッチ差推定処理部は、注目モーラが
区切られた範囲の先頭に位置する場合、その注目モーラ
の先行モーラとして一定値のピッチ周波数を仮想し、こ
のモーラとの重心点のピッチ周波数の差を推定すると共
に、このとき注目モーラを含むアクセント句が、区切ら
れた範囲の先頭に位置する場合には、そのアクセント句
の先行アクセント句のアクセント型を起伏式、又は平板
式として上記ピッチ周波数の差を推定する処理部から構
成される。When the target mora is located at the beginning of the divided range, the pitch difference estimation processing section assumes a constant pitch frequency as a preceding mora of the target mora, and sets a pitch of the center of gravity with the mora. In addition to estimating the frequency difference, if the accent phrase including the target mora is located at the beginning of the delimited range, the accent type of the preceding accent phrase of the accent phrase is undulated or flat. It is composed of a processing unit for estimating a difference in pitch frequency.

【００１２】また、上記ピッチ差推定処理部は、ｍ個の
アクセント句からなる文の各モーラ間のピッチ周波数の
差に基づいて、隣接する２モーラ間の重心点のピッチ周
波数の差を推定する際、ｎ（＞ｍ）個のアクセント句か
らなる文の上記ピッチ周波数の差を推定するには、（ｍ
＋１）番目乃至ｎ番目のアクセント句の文中での位置を
ｍとして、（ｍ＋１）番目乃至ｎ番目のアクセント句の
各モーラ間の重心点のピッチ周波数の差を推定する処理
部から構成される。Further, the pitch difference estimation processing section estimates a pitch frequency difference between centroid points of two adjacent moras based on a pitch frequency difference between mora of a sentence composed of m accent phrases. In order to estimate the difference between the pitch frequencies of a sentence composed of n (> m) accent phrases, (m
Assuming that the position of the (+1) th to n-th accent phrases in the sentence is m, the processing unit estimates the difference in pitch frequency of the center of gravity between the mora of the (m + 1) -th to n-th accent phrases.

【００１３】[0013]

【作用】本発明の音声合成方法では、任意の文字列の各
モーラ間の重心点のピッチ周波数の差を、少なくともア
クセント型のパラメータ情報、又はモーラ数のパラメー
タ情報に対応して推定する際、その文字列の先頭モーラ
に関して、一定値のピッチ周波数と上記先頭モーラの重
心点のピッチ周波数との差に基づいてピッチ差推定処理
を行なう。According to the speech synthesis method of the present invention, when estimating the difference of the pitch frequency of the center of gravity between each mora of an arbitrary character string at least according to the parameter information of the accent type or the parameter information of the number of mora, With respect to the first mora of the character string, pitch difference estimation processing is performed based on the difference between the pitch frequency of a fixed value and the pitch frequency of the center of gravity of the first mora.

【００１４】また、本発明の音声合成方法では、任意の
文字列の各モーラ間の重心点のピッチ周波数の差を、少
なくともアクセント型のパラメータ情報、又はモーラ数
のパラメータ情報に対応して推定する際、その文字列の
先頭モーラを含むアクセント句が区切られた範囲の先頭
に位置する場合には、そのアクセント句の先行アクセン
ト句のアクセント型を起伏式、又は平板式として、ピッ
チ差推定処理を行なう。Further, in the speech synthesis method of the present invention, a difference in pitch frequency of the center of gravity between each mora of an arbitrary character string is estimated corresponding to at least accent type parameter information or mora number parameter information. When the accent phrase including the leading mora of the character string is located at the beginning of the delimited range, the accent type of the preceding accent phrase of the accent phrase is set to the relief type or the flat type, and pitch difference estimation processing is performed. Do.

【００１５】ここで、明解日本語アクセント辞典第２版
秋永一枝編三省堂によれば、１つのアクセント句の中
に高く発音するモーラから低く発音するモーラへの部分
があるものを起伏式といい、また１つのアクセント句の
中に高く発音するモーラから低く発音するモーラへの部
分がないものを平板式という。According to Sanseido, the second edition of the Meiji Japanese Accent Dictionary, Kazue Akinaga, a part of a single accent phrase that has a portion from a mora pronounced high to a mora pronounced low is called an undulating expression. Also, an accent phrase in which there is no portion from a mora that pronounces high to a mora that pronounces low is called a flat type.

【００１６】更に、文中でのアクセント句の位置も上記
パラメータ情報として、ｍ個のアクセント句からなる文
の各モーラ間のピッチ周波数の差に基づいて、隣接する
２モーラ間の重心点のピッチ周波数の差を推定する際、
ｎ（＞ｍ）個のアクセント句からなる文の上記ピッチ周
波数の差を推定するには、（ｍ＋１）番目乃至ｎ番目の
アクセント句の文中での位置をｍとして、（ｍ＋１）番
目乃至ｎ番目のアクセント句の各モーラ間の重心点のピ
ッチ周波数の差を推定する。Further, the position of the accent phrase in the sentence is also used as the parameter information, based on the pitch frequency difference between each mora of the sentence composed of m accent phrases, based on the pitch frequency of the centroid point between two adjacent mora. When estimating the difference between
In order to estimate the pitch frequency difference of a sentence composed of n (> m) accent phrases, the positions of the (m + 1) th to nth accent phrases in the sentence are set to m, and the (m + 1) th to nth accent phrases are set. Of the pitch frequency of the center of gravity point between each mora of the accent phrase of.

【００１７】[0017]

【実施例】本発明の実施例を図１乃至図９に基づいて説
明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described with reference to FIGS.

【００１８】図１は、本発明の音声合成方法の一実施例
を示すブロック図であり、１は規則音声合成させる文字
列の入力を行なう文字コード記号列入力部、２はその文
字列を単語単位に分割し、品詞情報を決定する形態素解
析部、３はその単語の読みを決定する読み決定部、４は
単語の読みを記憶している単語辞書、５は単語の読みに
基づくアクセントを決定するアクセント決定部、６は単
語毎のアクセントを記憶しているアクセント辞書、７は
上記文字列のフレーズを決定するフレーズ決定部であ
る。尚、フレーズとは文頭乃至読点、読点乃至読点、又
は読点乃至句点、息継ぎ乃至息継ぎ、又はポーズ乃至ポ
ーズ等の呼気段落をいう。FIG. 1 is a block diagram showing an embodiment of a speech synthesizing method according to the present invention. 1 is a character code symbol string input section for inputting a character string to be subjected to ruled speech synthesis, and 2 is a character string which is converted into a word. A morphological analysis unit that divides into units and determines part of speech information, 3 is a reading determination unit that determines the reading of the word, 4 is a word dictionary that stores the reading of the word, and 5 is an accent that is determined based on the reading of the word An accent deciding unit 6 stores an accent dictionary storing accents for each word, and a phrase deciding unit 7 determines a phrase of the character string. The phrase refers to an exhalation paragraph such as a sentence or a reading point, a reading point or a reading point, a reading point or a punctuation mark, a breath or a breath, or a pause or a pause.

【００１９】８は上記文字列のピッチパターンを生成す
るピッチパターン生成部、９は上記文字列のモーラの母
音重心点のピッチ周波数とそのモーラに対する先行モー
ラの母音重心点のピッチ周波数との差を、ニューラルネ
ットワーク又は対応表に基づいて推定するピッチ差推定
部、１０は音声の素片を接続する素片接続部、１１は音
声素片を格納した素片テーブル、１２はＤＡ変換部、１
３はスピーカである。Reference numeral 8 denotes a pitch pattern generation unit for generating the pitch pattern of the character string. Reference numeral 9 denotes a difference between the pitch frequency of the vowel centroid of the mora of the character string and the pitch frequency of the vowel centroid of the preceding mora with respect to the mora. , A pitch difference estimating unit for estimating based on a neural network or a correspondence table, 10 is a unit connecting unit for connecting speech units, 11 is a unit table storing speech units, 12 is a DA converter, 1
3 is a speaker.

【００２０】図２は、入力文字列を形態素解析した結果
である。FIG. 2 shows the result of morphological analysis of the input character string.

【００２１】図３は、入力文字列の形態素解析結果に対
して読み決定を行った結果である。FIG. 3 shows the result of a reading decision made on the morphological analysis result of the input character string.

【００２２】図４は、入力文字列をアクセント句単位で
表したものであり、４１は入力文字列の第２アクセント
句の第２モーラ、４２は入力文字列の第１アクセント
句、４３は入力文字列の第２アクセント句である。FIG. 4 shows an input character string in units of accent phrases. 41 is a second mora of a second accent phrase of the input character string, 42 is a first accent phrase of the input character string, and 43 is an input character string. This is the second accent phrase of the character string.

【００２３】図５は、入力文字列のピッチパターンであ
る。ここで、あるアクセント句において、第Ｎモーラ目
にアクセントが存在するアクセント型を「Ｎ型」と呼ぶ
ことにする。従って、文字列「みちをたずねる」にお
いて、第１アクセント句「みちを」はアクセントが存在
しないので、アクセント型は０型となり、また第２アク
セント句「たずねる」は第３モーラ「ね」にアクセント
が存在するので、アクセント型は３型となる。FIG. 5 shows a pitch pattern of an input character string. Here, in an accent phrase, an accent type in which an accent exists in the Nth mora is referred to as “N type”. Therefore, in the character string "Michi-no-kaze", the first accent phrase "michi-o" has no accent, so the accent type is 0, and the second accent phrase "michi-no" is accented on the third mora "ne". Exists, the accent type becomes type 3.

【００２４】図６は、２つのアクセント句からなる１フ
レーズの短文を複数用いて、隣接する２モーラ間の母音
重心点のピッチ周波数の差を、５つのパラメータ情報毎
に対応付けた対応表である。FIG. 6 is a correspondence table in which a plurality of short sentences of one phrase composed of two accent phrases are used, and a difference in pitch frequency of a vowel barycenter between two adjacent moras is associated with each of five parameter information. is there.

【００２５】図７は、ニューラルネットを用いたピッチ
差推定部９を模式的に示したものであり、７１は５ユニ
ット１層からなる入力層、７２は１０ユニット３層から
なる中間層、７３は１ユニット１層からなる出力層であ
り、シグモイド関数による非線形処理は中間層７２に用
いている。FIG. 7 schematically shows the pitch difference estimating unit 9 using a neural network, wherein 71 is an input layer consisting of 5 units and 1 layer, 72 is an intermediate layer consisting of 10 units and 3 layers, and 73 Is an output layer consisting of one unit and one layer. Non-linear processing using a sigmoid function is used for the intermediate layer 72.

【００２６】図８は、ピッチ差推定部９のニューラルネ
ットワークに学習させる学習データのピッチパターンの
一例である。FIG. 8 shows an example of a pitch pattern of learning data to be learned by the neural network of the pitch difference estimating section 9.

【００２７】図９は、２アクセント句からなる１フレー
ズの文字列の各モーラ間の母音重心点のピッチ周波数の
差をニューラルネットワークに学習させた後、３アクセ
ント句からなる１フレーズの文字列を文字コード記号列
入力部１に入力して、規則音声合成させたときのピッチ
パターンを示したものである。FIG. 9 shows that the neural network learns the pitch frequency difference between the vowel centroids between each mora of the character string of one phrase composed of two accent phrases, and then converts the character string of one phrase composed of three accent phrases. FIG. 3 shows a pitch pattern when input to the character code symbol string input unit 1 to synthesize a rule speech.

【００２８】まず、図７のピッチ差推定部９にて、２ア
クセント句からなる１フレーズの短文を複数用いて、各
種の文字列からなる学習データを入力層７１に入力し、
学習を行わせる。First, learning data composed of various character strings is input to the input layer 71 by using a plurality of short sentences of one phrase composed of two accent phrases in the pitch difference estimating unit 9 of FIG.
Let them learn.

【００２９】具体的に述べると、図８に示す２アクセン
ト句からなる１フレーズの文字列「たいふうがくる」
の「る」に注目し、この第７モーラ「る」とこの直前の
第６モ−ラ「く」との母音重心点のピッチ周波数の差を
ニューラルネットワークに学習させる場合について説明
する。More specifically, a character string of one phrase consisting of two accent phrases shown in FIG.
The case where the neural network learns the difference between the pitch frequencies of the vowel centroids of the seventh mora "ru" and the sixth mora "ku" immediately before will be described.

【００３０】本実施例のピッチ差推定部９で用いるニュ
ーラルネットワークでは、入力層７１に、１）注目モーラを含むアクセント句の文中での位置２）注目モーラを含むアクセント句のモーラ数３）注目モーラのアクセント句中での位置４）注目モーラを含むアクセント句のアクセント型５）注目モーラを含むアクセント句の先行アクセント句
のアクセント型の５個のパラメータ情報を入力する。In the neural network used in the pitch difference estimating unit 9 of this embodiment, the input layer 71 has the following information: 1) the position in the sentence of the accent phrase including the target mora 2) the number of mora of the accent phrase including the target mora 3) the target Position of the mora in the accent phrase 4) Accent type of the accent phrase including the focused mora 5) Accent type of the preceding accent phrase of the accent phrase including the focused mora 5 pieces of parameter information are input.

【００３１】また、このニューラルネットワークは、入
力層７１に上記５個のパラメータ情報から構成された学
習データを入力すると、出力層７３から注目モーラの母
音重心点のピッチ周波数とこれに先行するモーラの母音
重心点のピッチ周波数との差を出力するように学習され
ている。When the learning data composed of the above-mentioned five pieces of parameter information is input to the input layer 71, the neural network receives from the output layer 73 the pitch frequency of the center of gravity of the vowel of the target mora and the mora of the preceding mora. It has been learned to output the difference from the pitch frequency of the vowel centroid.

【００３２】ここで、２アクセント句からなる１フレー
ズの文字列「たいふうがくる」について考えてみる
と、注目モ−ラである第７モ−ラ「る」は、モ−ラ数が
２でアクセント型が１型である第２アクセント句の第２
モ−ラであるので、ニューラルネットの入力層７１への
パラメータ情報は、注目モーラを含むアクセント句の文
中での位置「２」、注目モーラを含むアクセント句のモ
ーラ数「２」、注目モーラのアクセント句中での位置
「２」、注目モーラを含むアクセント句のアクセント型
「１」、注目モーラを含むアクセント句の先行アクセン
ト句のアクセント型「３」となり、そのパラメータ情報
は「２，２，２，１，３」となることがわかる。Considering the character string of one phrase "Taifu ga kuru" consisting of two accent phrases, the seventh mora "Ru", which is the attention mora, has a mora number of 2 The second of the second accent phrase in which the accent type is 1
Therefore, the parameter information to the input layer 71 of the neural network includes the position “2” in the sentence of the accent phrase including the target mora, the number of mora of the accent phrase including the target mora “2”, and the The position of the accent phrase in the accent phrase is “2”, the accent type of the accent phrase including the target mora is “1”, and the accent type of the preceding accent phrase of the accent phrase including the target mora is “3”, and the parameter information is “2, 2, 2, 1, 3 ".

【００３３】ところで、本方法では、注目モーラとこれ
に先行する先行モーラとの母音重心点のピッチ周波数の
差を推定しているので、第１アクセント句の第１モーラ
（文頭の第１モーラ）「た」に対する先行モーラとのピ
ッチ周波数の差の推定、並びに注目モーラのアクセント
句が第１アクセント句であるときの先行アクセント句の
アクセント型をどのように取り扱うかという問題が生じ
る。By the way, in the present method, since the pitch frequency difference of the vowel centroid point between the target mora and the preceding mora is estimated, the first mora of the first accent phrase (the first mora at the beginning of the sentence). A problem arises in estimating the difference between the pitch frequency of "ta" and the preceding mora, and how to handle the accent type of the preceding accent phrase when the accent phrase of the focused mora is the first accent phrase.

【００３４】そこで、第１アクセント句の第１モーラ
「た」と、このモーラに先行する先行モーラとの母音重
心点のピッチ周波数の差を推定する場合は、本実施例で
は、学習データ中の１型以外の第１アクセント句の第１
モーラの平均ピッチ周波数、並びに第１アクセント句の
第１モーラに対する先行アクセント句のアクセント型を
起伏式に属する１型として、この値と第１アクセント句
の第１モーラとの母音重心点のピッチ周波数の差を求め
ることとしている。To estimate the difference between the pitch frequency of the vowel center of gravity of the first mora "ta" of the first accent phrase and the preceding mora preceding this mora, the present embodiment uses The first of the first accent phrase other than type 1
Assuming that the average pitch frequency of the mora and the accent type of the preceding accent phrase with respect to the first mora of the first accent phrase belong to the undulation formula, the pitch frequency of the vowel centroid with this value and the first mora of the first accent phrase To determine the difference.

【００３５】これは、第１アクセント句の第１モーラの
先行モーラとして、学習データ中の１型以外の第１アク
セント句の第１モーラの平均ピッチ周波数を採用したの
は、アクセント型が１型以外の第１アクセント句の第１
モーラの平均ピッチ周波数の値は、経験的に低い値とな
るからであり、また、第１アクセント句の第１モーラに
対する先行アクセント句のアクセント型として１型を採
用したのは、その１型のアクセント句の後方部分はピッ
チ周波数の値が下降しているからである。This is because the average pitch frequency of the first mora of the first accent phrase other than the first mora in the learning data is used as the preceding mora of the first mora of the first accent phrase because the accent type is the first mora. The first of the first accent phrase other than
This is because the value of the average pitch frequency of the mora is empirically low, and the type 1 is adopted as the accent type of the preceding accent phrase for the first mora of the first accent phrase. This is because the value of the pitch frequency decreases in the rear part of the accent phrase.

【００３６】この結果、第１アクセント句の第１モーラ
に対する先行モーラから、第１アクセント句の第１モー
ラへの繋がりは違和感がなく、自然な音声発声と看做せ
ることとなる。As a result, the connection from the preceding mora of the first accent phrase to the first mora to the first mora of the first accent phrase has no uncomfortable feeling and can be regarded as a natural voice utterance.

【００３７】斯くして、注目モーラが第１アクセント句
の第１モーラに当るときには、ピッチ差推定部９のニュ
ーラルネットワークの入力層７１に入力する「注目モー
ラを含むアクセント句の先行アクセント句のアクセント
型」を、１型にすることとしている。Thus, when the target mora corresponds to the first mora of the first accent phrase, the “accent of the preceding accent phrase of the accent phrase including the target mora” input to the input layer 71 of the neural network of the pitch difference estimator 9 is input. "Type" is changed to type 1.

【００３８】ところで、第１アクセント句の第１モーラ
（文頭の第１モーラ）に対する先行モーラとして、例え
ば学習データ中の１型アクセントのみの第１アクセント
句の第１モーラの平均ピッチ周波数を採用することも考
えられる。この場合、その平均ピッチ周波数の値は経験
的に高くなる傾向にあるため、第１アクセント句の第１
モーラに対する先行アクセント句のアクセント型を、例
えば後方部分のピッチの下降が少ない平板式とすること
によって、第１アクセント句の第１モーラに対する先行
モーラから、第１アクセント句の第１モーラへの繋がり
は違和感がなくなる。By the way, as the preceding mora to the first mora of the first accent phrase (the first mora at the beginning of the sentence), for example, the average pitch frequency of the first mora of the first accent phrase of only the type 1 accent in the learning data is adopted. It is also possible. In this case, since the value of the average pitch frequency tends to be empirically high, the value of the first accent phrase
By setting the accent type of the preceding accent phrase to the mora to be, for example, a flat plate type in which the pitch of the rear portion is small, the connection from the preceding mora to the first mora of the first accent phrase to the first mora of the first accent phrase is made. Will feel uncomfortable.

【００３９】このように、文字列「たいふうがくる」
の第１モーラ「た」に注目した場合、入力層７１に入力
する５個のパラメータ情報は、注目モーラを含むアクセ
ント句の文中での位置「１」、注目モーラを含むアクセ
ント句のモーラ数「５」、注目モーラのアクセント句中
での位置「１」、注目モーラを含むアクセント句のアク
セント型「３」、注目モーラを含むアクセント句の先行
アクセント句のアクセント型「１」、の５個のパラメー
タ情報「１，５，１，３，１」で表される。この５個の
パラメータ「１，５，１，３，１」をニューラルネット
ワークの入力層７１に入力することによって、注目モ−
ラ「た」と先行モーラとの夫々の母音重心点ピッチ周波
数の差が推定される。As described above, the character string "Taifuga comes"
When attention is paid to the first mora “ta”, the five pieces of parameter information input to the input layer 71 include the position “1” in the sentence of the accent phrase including the attention mora, and the number of mora of the accent phrase including the attention mora “ 5 ”, the position“ 1 ”in the accent phrase of the target mora, the accent type“ 3 ”of the accent phrase including the target mora, and the accent type“ 1 ”of the preceding accent phrase of the accent phrase including the target mora. It is represented by parameter information “1, 5, 1, 3, 1”. By inputting these five parameters “1, 5, 1, 3, 1” to the input layer 71 of the neural network,
The difference between the vowel center-of-gravity point pitch frequency of La "ta" and the preceding mora is estimated.

【００４０】そこで、上述のニューラルネットワークに
おいて、アクセント型やモーラ数が異なる複数の文字列
の学習データを上述の５個のパラメータで表したそれら
の情報を入力層７１に入力しながら、誤差逆伝搬法に基
づいて、ニューラルネットワークの出力値が所望のピッ
チ周波数の差、即ち自然音声のピッチパターンより求め
られたピッチ差の値の近傍で収束するまでニューラルネ
ットワークの各ユニット間の結合の強さ（重み付け）を
逐次変更し、学習を繰り返す。Therefore, in the above-described neural network, while the learning data of a plurality of character strings having different accent types and mora numbers represented by the above five parameters is input to the input layer 71, the error back propagation is performed. The strength of the connection between the units of the neural network until the output value of the neural network converges in the vicinity of the desired pitch frequency difference, that is, the value of the pitch difference obtained from the pitch pattern of natural speech ( Weighting) is sequentially changed, and learning is repeated.

【００４１】これにより、ニューラルネットワークの各
ユニット間の結合の強さが決まる。Thus, the strength of the connection between the units of the neural network is determined.

【００４２】以下には、各ユニット間の最終的な結合の
強さが決まったニューラルネットワークから構成された
ピッチ差推定部９を用いて、未学習の文字列「道を尋ね
る」を規則音声合成させる場合の処理動作を述べる。In the following, an unlearned character string "Ask for a road" is converted into a rule speech by using a pitch difference estimator 9 composed of a neural network in which the final connection strength between the units is determined. The following describes the processing operation in the case of causing the processing.

【００４３】文字コード記号列入力部（１）から入力さ
れた上記文字列は、形態素解析部２によって、まず単語
単位に分割された後、各単語の品詞が図２に示すように
決定される。The character string input from the character code symbol string input unit (1) is first divided by the morphological analysis unit 2 into word units, and the part of speech of each word is determined as shown in FIG. .

【００４４】形態素解析部２にて品詞が決定されると、
その品詞データは読み決定部３に送られ、単語辞書４と
の照合により図３に示すように、各単語の読みが決定さ
れる。When the part of speech is determined by the morphological analysis unit 2,
The part-of-speech data is sent to the reading determination unit 3, and the reading of each word is determined by collation with the word dictionary 4 as shown in FIG. 3.

【００４５】読み決定部３にて単語の読みが決定される
と、その単語データはアクセント決定部５に送られ、ア
クセント辞書６との照合により単語のアクセントが決定
され、規則によりアクセント結合が行われてアクセント
句が形成されるとともに、アクセント句に対するアクセ
ントが決定される。これにより、文字列「みちをたずね
る」は、図４に示されるように、第１アクセント句「み
ちを」４２と第２アクセント句「たずねる」４３の２つ
のアクセント句に分けられる。When the reading of the word is determined by the reading determining unit 3, the word data is sent to the accent determining unit 5, the accent of the word is determined by collation with the accent dictionary 6, and the accent connection is performed according to the rules. Then, an accent phrase is formed, and the accent for the accent phrase is determined. As a result, the character string “Ask Michio” is divided into two accent phrases, a first accent phrase “Michio” 42 and a second accent phrase “Teach”, as shown in FIG.

【００４６】アクセント句並びにアクセントが決定され
た後、フレーズ決定部７でフレーズの決定が行われる。
本実施例の文字列では、文字列全体で１つのフレーズを
形成しており、例えば「こうばんまでいってみちをた
ずねる」といった文字列であれば、「こうばんまでいっ
て」と「みちをたずねる」との２つのフレーズに分割さ
れる。After the accent phrase and the accent are determined, the phrase determination unit 7 determines the phrase.
In the character string according to the present embodiment, one phrase is formed by the entire character string. For example, if the character string is “Go to Koban and ask for the way”, “Go to Koban” and “Michito” Ask ".

【００４７】次に、ピッチ差推定部９にて、ピッチ差の
推定が行われる。即ち、本実施例の文字列の場合、注目
モ−ラである第５モ−ラ「ず」は、モ−ラ数が４でアク
セント型が３型である第２アクセント句４３の第２モ−
ラ４１である。従って、入力層７１に入力する５個のパ
ラメータ情報は、注目モーラを含むアクセント句の文中
での位置「２」、注目モーラを含むアクセント句のモー
ラ数「４」、注目モーラのアクセント句中での位置
「２」、注目モーラを含むアクセント句のアクセント型
「３」、注目モーラを含むアクセント句の先行アクセン
ト句のアクセント型「０」、の５個のパラメータ情報
「２，４，２，３，０」で表される。この５個のパラメ
ータ「２，４，２，３，０」をニューラルネットワーク
の入力層７１に入力することによって、注目モ−ラ
「ず」と先行モーラ「た」との夫々の母音重心点のピッ
チ周波数の差は自然対数で、「＋０．１４７」と推定さ
れる。Next, the pitch difference estimation unit 9 estimates the pitch difference. In other words, in the case of the character string of the present embodiment, the fifth mora “Z”, which is the target mora, is the second mora of the second accent phrase 43 having the number of mora of 4 and the accent type of 3. −
La 41. Therefore, the five pieces of parameter information input to the input layer 71 include the position “2” in the sentence of the accent phrase including the target mora, the number of moras of the accent phrase including the target mora “4”, and the accent phrase of the target mora in the sentence. 5 parameter information "2,4,2,3" of the position "2", the accent type "3" of the accent phrase including the attention mora, and the accent type "0" of the preceding accent phrase of the accent phrase including the attention mora , 0 ". By inputting these five parameters “2,4,2,3,0” to the input layer 71 of the neural network, the center of gravity of each vowel center of the target mora “zu” and the preceding mora “ta” is obtained. The difference between the pitch frequencies is a natural logarithm and is estimated to be “+0.147”.

【００４８】このようにして、「みちをたずねる」の各
モ−ラに対して、注目モ−ラの母音重心点ピッチ周波数
と先行モ−ラの母音重心点のピッチ周波数との差が、自
然対数で、第１モ−ラから順次、「−0.061, 0.396, −
0.224, −0.300, 0.147，−0.142，−0.320」と推定さ
れる。In this manner, for each of the “Miss the Way” models, the difference between the pitch frequency of the vowel barycenter of the target model and the pitch frequency of the vowel barycenter of the preceding model is the natural value. Logarithmically, "-0.061, 0.396,-
0.224, −0.300, 0.147, −0.142, −0.320 ”.

【００４９】この後、ピッチパターン生成部８では、あ
らかじめ設定された音声区間の始端、および、終端のピ
ッチ周波数と、ピッチ差推定部９で推定された各値に基
づいて各モーラの母音重心点のピッチ周波数の差を推定
し、図５に示される点ピッチパターンが生成される。Thereafter, the pitch pattern generator 8 calculates the vowel center point of each mora based on the preset pitch frequencies at the start and end of the voice section and the values estimated by the pitch difference estimator 9. Is estimated, and a point pitch pattern shown in FIG. 5 is generated.

【００５０】点ピッチパターンが生成されると、素片接
続部１０において、ＣＶＣ（子音＋母音＋子音）などの
音声素片（例えば、パーコール係数、あるいはＬＳＰ係
数）を予め格納している素片テーブル１１から、入力さ
れた文字列の音声発声に必要な音声素片が選ばれて各素
片が接続され、デジタル信号である音声データが作成さ
れる。音声データはＤＡ変換部１２によってアナログ信
号に変換され、スピーカ１３から規則合成音声として出
力される。When the point pitch pattern is generated, the segment connection unit 10 pre-stores speech segments such as CVC (consonant + vowel + consonant) (for example, Percall coefficients or LSP coefficients). From the table 11, speech segments necessary for speech utterance of the input character string are selected, the respective segments are connected, and speech data as a digital signal is created. The audio data is converted into an analog signal by the DA converter 12 and is output from the speaker 13 as a regular synthesized voice.

【００５１】以上は、ピッチ差推定部９を構成するニュ
ーラルネットワークの学習を２アクセント句からなる文
字列を用いて行ない、実際に規則音声合成させる文字列
も２アクセント句からなるものとして規則音声合成した
が、以下にはニューラルネットワークの学習を上述と同
じ２アクセント句とし、３アクセント句からなる文字列
を規則音声合成させる場合について説明する。In the above, learning of the neural network constituting the pitch difference estimating unit 9 is performed using a character string composed of two accent phrases, and a character string to be actually subjected to rule speech synthesis is also assumed to be composed of two accent phrases. However, in the following, a case will be described where the learning of the neural network is the same as described above with two accent phrases and a character string composed of three accent phrases is synthesized in a regular speech.

【００５２】文字コード記号列入力部１に、３アクセン
ト句からなる文字列、例えば「ゆかいななかまがあつ
まる（愉快な仲間が集まる）」を入力した場合の推定並
びに自然音声のピッチパターンを図９に示す。FIG. 9 shows an estimation and a pitch pattern of a natural voice when a character string composed of three accent phrases, for example, “a good friend gathers” is input to the character code symbol string input unit 1. Shown in

【００５３】第３アクセント句「あつまる」に関し、ピ
ッチ差推定部９のニューラルネットワークへ入力する５
個のパラメータ情報のうち「注目モーラを含むアクセン
ト句の文中での位置」としては、実際「３」であるが、
本実施例では「２」とする。The third accent phrase “Atsumaru” is input to the neural network of the pitch difference estimating unit 9 (5).
Of the parameter information, the “position in the sentence of the accent phrase including the attention mora” is actually “3”,
In this embodiment, it is “2”.

【００５４】これは、人間の音声発声としてピッチ周波
数を観察すると、そのピッチ周波数の形状は、上に凸の
山型をしており、「注目モーラを含むアクセント句の文
中での位置」を「２」とした方が、自然の音声発声に近
いことが本発明者の実験によって確かめられており、こ
のことから「注目モーラを含むアクセント句の文中での
位置」を「２」とした。This is because when observing the pitch frequency as a human voice utterance, the shape of the pitch frequency has an upwardly convex mountain shape, and the “position in the sentence of the accent phrase including the focused mora” is changed to “ It has been confirmed by the inventor's experiment that the setting of "2" is closer to natural voice utterance. Therefore, "the position of the accent phrase including the focused mora in the sentence" is set to "2".

【００５５】このように、未学習のアクセント句数から
構成された文の各モーラ間のピッチ周波数の差を推定す
るには、１つのフレーズ内であれば、アクセント句数が
ｍからなる文の各モーラ間の母音重心点のピッチ周波数
の差を学習させたニューラルネットワークを用いて、ｎ
（＞ｍ）のアクセント句数からなる文の各モーラ間の母
音重心点のピッチ周波数の差を推定する場合、（ｍ＋
１）番目乃至ｎ番目のアクセント句の各モーラに対して
は、「注目モーラを含むアクセント句の文中での位置」
をｍとして推定させると、自然に近い規則合成音声の実
現が可能となる。As described above, in order to estimate the difference in pitch frequency between each mora of a sentence composed of the number of unlearned accent phrases, within one phrase, a sentence having m accent phrases is m. Using a neural network that has learned the difference between the pitch frequencies of the vowel centroids between each mora, n
When estimating the pitch frequency difference of the vowel barycenter between each mora of a sentence composed of (> m) accent phrases, (m +
1) For each mora of the nth to nth accent phrases, “position in sentence of accent phrase including attention mora”
Is estimated as m, it is possible to realize a rule-synthesized speech that is almost natural.

【００５６】更に、同図のように、第３アクセント句の
ピッチ周波数の推定パターンは自然音声のピッチ周波数
と異なっているが、（ｍ＋１）番目乃至ｎ番目のアクセ
ント句の各モーラにおいては、「注目モーラを含むアク
セント句の文中での位置」をｍとして推定させた後、そ
の推定値に係数を乗じて補正することによって、更に自
然音声に近くなる。Further, as shown in the figure, although the pitch frequency estimation pattern of the third accent phrase is different from the pitch frequency of natural speech, in each mora of the (m + 1) th to nth accent phrases, " By estimating m as the position of the accent phrase including the focused mora in the sentence as m, and then multiplying the estimated value by a coefficient to make correction, the sound becomes closer to natural speech.

【００５７】以上の実施例では、ニューラルネットワー
クから構成されたピッチ差推定部９において、複数の短
文を用いて学習させ乍ら、各ユニット間の結合の強さを
変更していき、最終的に決定された各ユニット間の結合
の強さを記憶したニューラルネットワークを用いて、各
モーラ間の母音重心点のピッチ周波数の差の推定を行っ
たが、これには限られず、ニューラルネットワークに入
力した５個のパラメータ情報及び各モーラ間の母音重心
点のピッチ周波数の差を図６のように対応表で記憶させ
ておいても良いことはいうまでもない。In the above embodiment, the pitch difference estimating unit 9 composed of a neural network changes the strength of the connection between the units while learning using a plurality of short sentences. Using a neural network that stores the determined strength of connection between each unit, the pitch frequency difference between vowel centroids between each mora was estimated. It goes without saying that the five parameter information and the difference between the pitch frequencies of the vowel centroids among the mora may be stored in a correspondence table as shown in FIG.

【００５８】この場合、上述と同様に文字列「みちを
たずねる」の第５モーラ「ず」に注目すると、５個のパ
ラメータ情報は「２，４，２，３，０」６０となり、こ
れより注目モーラ「ず」と先行モーラ「た」との母音重
心点のピッチ周波数の差は自然対数で、ニューラルネッ
トワークの場合と同様に「＋０．１４７」６１となる。In this case, the character string "Michio
Paying attention to the fifth mora “zu” of “question”, the five pieces of parameter information become “2, 4, 2, 3, 0” 60, and the vowel centroid of the attention mora “zu” and the preceding mora “ta” The difference between the pitch frequencies of the points is a natural logarithm, and is “+0.147” 61 as in the case of the neural network.

【００５９】尚、本実施例では、ニューラルネットワー
クの入力層７１を５ユニット１層、中間層７２を１０ユ
ニット３層としているが、各ユニット数、層数はこの限
りではない。In the present embodiment, the input layer 71 of the neural network has one unit of 5 units, and the intermediate layer 72 has three layers of 10 units. However, the number of units and layers are not limited to these.

【００６０】また、上述の実施例では、隣接する２モー
ラ間の母音重心点に注目して、ピッチ周波数の差を推定
したが、これには限られず、各モーラの重心点のピッチ
周波数の差を推定しても良いことはいうまでもない。In the above embodiment, the pitch frequency difference is estimated by focusing on the vowel center of gravity between two adjacent moras. However, the present invention is not limited to this, and the pitch frequency difference between the center of gravity of each mora is estimated. It is needless to say that may be estimated.

【００６１】更に、本実施例では、学習データとして２
アクセント句からなる短文を用いてニューラルネットワ
ークの学習、並びに対応表の作成を行ったが、これには
限られず、３アクセント句以上の文を用いて、ニューラ
ルネットワークの学習、並びに対応表の作成を行っても
良いことはいうまでもない。Further, in the present embodiment, as learning data, 2
Learning neural networks and creating correspondence tables using short sentences composed of accent phrases, but not limited to this, learning neural networks and creating correspondence tables using sentences with three or more accent phrases. Needless to say, it can be done.

【００６２】上述の実施例では、５つのパラメータを入
力することによって学習したニューラルネット、又は５
つのパラメータから構成された対応表を用いて、１フレ
ーズからなる文字列のピッチパターンの生成を行った
が、この５つのパラメータの一部に代えて、又はこの５
つのパラメータに加えて言語情報に関するパラメータ、
例えば注目モーラが無声音であるか否か、注目モーラが
撥音であるか否か、注目モーラが拗音であるか否か、注
目モーラが有声子音を伴うか否か、注目モーラの子音が
摩擦音であるか否か、注目モーラの子音が半母音である
か否か、注目モーラの子音が鼻音であるか否か、注目モ
ーラの子音が破擦音であるか否か、注目モーラの子音が
破裂音であるか否か、注目モーラを含む単語の品詞が何
であるか否か、又は注目モーラを含むアクセント句が強
調されるか否か、等を採用して学習を行わせたニューラ
ルネット、又は対応表を用いてもよい。In the above embodiment, the neural network learned by inputting five parameters, or 5
A pitch pattern of a character string composed of one phrase was generated using a correspondence table composed of three parameters.
Language information parameters in addition to the two parameters,
For example, whether or not the focused mora is unvoiced, whether or not the focused mora is a repellent sound, whether or not the focused mora is a relentless sound, whether or not the focused mora is accompanied by a voiced consonant, or a consonant of the focused mora is a fricative sound Whether or not the consonant of the target mora is a semi-vowel, whether or not the consonant of the target mora is a nasal sound, whether or not the consonant of the target mora is an affricate, the consonant of the target mora is a plosive Neural network trained by adopting whether or not there is, whether or not the part of speech of the word containing the noticed mora, or whether the accent phrase containing the noticed mora is emphasized, or a correspondence table May be used.

【００６３】また、５つのパラメータ、上述の言語情報
に関するパラメータ、フレーズ位置、又はフレーズ数等
のパラメータを用いて学習させたニューラルネット、又
は対応表によって複数フレーズの文字列のピッチパター
ンの生成を行うことも可能である。A pitch pattern of a character string of a plurality of phrases is generated by a neural network or a correspondence table learned using five parameters, parameters relating to the above-mentioned linguistic information, parameters such as a phrase position or the number of phrases. It is also possible.

【００６４】[0064]

【発明の効果】以上のように、本発明によれば、文字列
の各モーラ間の重心点のピッチ周波数の差を推定する
際、注目モーラが区切られた範囲の先頭に位置する場合
には、その注目モーラの先行モーラとして一定値のピッ
チ周波数を仮想し、このモーラとの重心点のピッチ周波
数の差を推定すると共に、このとき注目モーラを含むア
クセント句が、区切られた範囲の先頭に位置する場合に
は、そのアクセント句の先行アクセント句のアクセント
型を起伏式、又は平板式として上記ピッチ周波数の差を
推定することによって、規則合成音声を自然の音声に近
づけることができる。As described above, according to the present invention, when estimating the difference of the pitch frequency of the center of gravity between the mora of the character string, when the mora of interest is located at the head of the divided range, Assuming a constant pitch frequency as the leading mora of the target mora, estimating the difference between the pitch frequency of the center of gravity and the mora, and at this time, the accent phrase including the target mora is placed at the beginning of the delimited range. If it is located, the rule-type synthesized speech can be made closer to a natural speech by estimating the difference in pitch frequency by making the accent type of the preceding accent phrase of the accent phrase an undulating or flat type.

[Brief description of the drawings]

【図１】本発明を用いた規則音声合成の一実施例を示す
ブロック図FIG. 1 is a block diagram showing one embodiment of rule speech synthesis using the present invention.

【図２】入力文字列を形態素解析した結果を示す図FIG. 2 is a diagram showing a result of morphological analysis of an input character string;

【図３】入力文字列の形態素解析結果の読みを決定した
結果を示す図FIG. 3 is a diagram showing a result of determining reading of a morphological analysis result of an input character string;

【図４】入力文字列をアクセント句単位で表した図FIG. 4 is a diagram showing an input character string in units of accent phrases.

【図５】入力文字列のピッチパターンを表す図FIG. 5 is a diagram showing a pitch pattern of an input character string.

【図６】ピッチ差推定部９に用いる対応図表FIG. 6 is a corresponding chart used in the pitch difference estimating unit 9;

【図７】ピッチ差推定部９に用いるニューラルネットワ
ークの構成図FIG. 7 is a configuration diagram of a neural network used in a pitch difference estimating unit 9;

【図８】ピッチ差推定部９のニューラルネットワークに
学習させる学習データのピッチパターンの一例FIG. 8 shows an example of a pitch pattern of learning data to be learned by a neural network of the pitch difference estimating unit 9;

【図９】２アクセント句の学習データを学習させたニュ
ーラルネットワークを用いて、３アクセント句の文を規
則音声合成させたときのピッチパターンFIG. 9 shows a pitch pattern when a sentence of three accent phrases is synthesized into a regular speech using a neural network trained on learning data of two accent phrases.

[Explanation of symbols]

１文字コード記号列入力部２形態素解析部３読み決定部５アクセント決定部６アクセント辞書７フレーズ決定部８ピッチパターン生成部９ピッチ差推定部１０素片接続部１１素片テーブル７１ニューラルネットワークの入力層７２ニューラルネットワークの中間層７３ニューラルネットワークの出力層 DESCRIPTION OF SYMBOLS 1 Character code symbol string input part 2 Morphological analysis part 3 Reading determination part 5 Accent determination part 6 Accent dictionary 7 Phrase determination part 8 Pitch pattern generation part 9 Pitch difference estimation part 10 Element connection part 11 Element table 71 Input of neural network Layer 72 Intermediate layer of neural network 73 Output layer of neural network

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭59−81697（ＪＰ，Ａ) 特開昭59−192293（ＪＰ，Ａ) 特開昭61−70597（ＪＰ，Ａ) 特開昭62−262100（ＪＰ，Ａ) 特開昭63−85797（ＪＰ，Ａ) 特開平１−238697（ＪＰ，Ａ) 特開平４−46396（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 13/06 - 13/08 G10L 21/04 G06F 15/18,17/27 ──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-59-81697 (JP, A) JP-A-59-192293 (JP, A) JP-A-61-70597 (JP, A) JP-A-62 262100 (JP, A) JP-A-63-85797 (JP, A) JP-A-1-238697 (JP, A) JP-A-4-46396 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 13/06-13/08 G10L 21/04 G06F 15 / 18,17 / 27

Claims

(57) [Claims]

1. A pitch difference estimating process for estimating a pitch frequency difference of a center of gravity between two adjacent moras corresponding to at least accent type parameter information or mora number parameter information; On the basis of the,
A pitch pattern generation process for generating a pitch pattern of an arbitrary character string; and a speech unit necessary for speech utterance of the character string from a speech unit memory in which a speech unit including speech parameters necessary for speech synthesis is stored. Read out and connect,
A voice unit connection process for generating voice data, wherein when a certain mora is located at the beginning of the divided range, the pitch difference estimating process uses a constant pitch frequency. And estimating a difference between the pitch frequency of the center of gravity of the mora and the pitch frequency of the mora.

2. The pitch difference estimating process includes: when at least accent type parameter information or mora number parameter information is input to the input layer, the pitch difference estimating process is performed on the adjacent layer from the output layer.
Based on a neural network that has been trained to output the difference between the pitch frequencies of the centroid points between the mora,
This neural network is characterized in that when a certain mora is located at the beginning of a divided range, it has been learned to output a difference between a pitch frequency of a fixed value and a pitch frequency of the center of gravity of the mora. The speech synthesis method according to claim 1.

3. The pitch difference estimating process uses a correspondence table that stores a difference in pitch frequency of a center of gravity between two adjacent moras corresponding to at least accent type parameter information or mora number parameter information. Then, the difference between the pitch frequencies is estimated, and the correspondence table stores the difference between the pitch frequency of a fixed value and the pitch frequency of the center of gravity of the mora when a certain mora is located at the beginning of the divided range. 2. The speech synthesis method according to claim 1, wherein:

4. At least a difference in pitch frequency of the center of gravity between two adjacent moras corresponding to the parameter information of the accent type of the preceding accent phrase of the accent phrase containing a certain mora or the parameter information of the number of mora. Based on the pitch difference estimation process to be estimated and the difference between the pitch frequencies,
A pitch pattern generation process for generating a pitch pattern of an arbitrary character string; and a speech unit necessary for speech utterance of the character string from a speech unit memory in which a speech unit including speech parameters necessary for speech synthesis is stored. Read out and connect,
In a speech synthesis method comprising speech unit connection processing for generating speech data, when an accent phrase including a certain mora is located at the beginning of a delimited range, the accent of the preceding accent phrase of the accent phrase is A voice synthesizing method characterized in that the mold is of an undulating or flat type.

5. The pitch difference estimating process is characterized in that, when at least accent type parameter information of a leading accent phrase of an accent phrase including a certain mora or parameter information of the number of mora is input to the input layer, the pitch difference estimation processing is performed from the output layer. Do 2
Based on a neural network that has been trained to output the difference between the pitch frequencies of the centroid points between the mora,
When the accent phrase including a certain mora is located at the beginning of the divided range, the accent type of the preceding accent phrase of the accent phrase is set to the undulating type or the flat type, and the difference of the pitch frequency is added to the neural network. 5. A learning is performed so as to output.
Described speech synthesis method.

6. The pitch difference estimating process includes:
Using a correspondence table that stores the difference between the pitch frequencies of the centroid points of two adjacent mora corresponding to the parameter information of the accent type of the preceding accent phrase of the accent phrase including a certain mora or the parameter information of the number of mora. , The pitch frequency difference of the center of gravity between two adjacent moras is estimated, and if the accent phrase including a certain mora is located at the beginning of the delimited range, the leading accent of the accent phrase is 5. The speech synthesis method according to claim 4, wherein the accent type of the phrase is set as a relief type or a flat type, and the difference between the pitch frequencies is stored in the correspondence table.

7. The speech synthesis method according to claim 1, wherein the divided range is a sentence.

8. The speech synthesis method according to claim 1, wherein the divided range is a phrase.

9. The speech synthesis method according to claim 1, wherein the divided range is an accent phrase.

10. The speech synthesis method according to claim 1, wherein the divided range is a word.

11. According to at least a position of an accent phrase in a sentence, parameter information of an accent type, or parameter information of a number of mora, a pitch frequency difference between mora of a sentence composed of m accent phrases is determined. A pitch difference estimating process for estimating a pitch frequency difference between centroid points between two adjacent moras based on the pitch frequency generating process for generating a pitch pattern of an arbitrary character string based on the pitch frequency difference; Speech unit connection processing for reading and connecting the speech units required for speech utterance of the character string from a speech unit memory in which speech units composed of speech parameters necessary for speech synthesis are stored, and generating speech data When estimating the pitch frequency difference of a sentence composed of n (> m) accent phrases, the (m + 1) th to the (m + 1) th The position in the context of th accent phrase as m, (m + 1)
A speech synthesis method characterized by estimating a difference between pitch frequencies of centroid points between mora of the nth to nth accent phrases.

12. The pitch difference estimating process includes, for each mora of a sentence composed of m accent phrases, a parameter of at least a position of an accent phrase in the sentence, an accent type, or a number of mora to an input layer of a neural network. 12. The speech synthesis method according to claim 11, wherein when information is input, the speech synthesis method is performed based on a neural network learned to output a pitch frequency difference between centroid points between two adjacent moras from an output layer.

13. The pitch difference estimating process includes, for each mora of a sentence composed of m accent phrases, at least:
12. The speech synthesis method according to claim 11, wherein the speech synthesis method is performed based on a correspondence table storing parameter information of a position of an accent phrase in a sentence, an accent type, or a number of mora.