JP3421964B2

JP3421964B2 - Articulatory parameter control speech synthesis method, its apparatus and program recording medium

Info

Publication number: JP3421964B2
Application number: JP00414298A
Authority: JP
Inventors: 剛岡留; 時彦鏑木; 雅彰誉田
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 1998-01-12
Filing date: 1998-01-12
Publication date: 2003-06-30
Anticipated expiration: 2018-01-12
Also published as: JPH11202897A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、与えられたテキ
ストから、そのシンボル列の発声に関与する唇・舌・顎
などの調音運動器官の位置（調音パラメータ）の時系
列、つまり運動軌道を高い精度で生成し、その軌道情報
をもとに高品質な音声合成を行なう方法、その装置とプ
ログラム記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention shows a time series of positions (articulatory parameters) of articulatory motor organs such as lips, tongue, and jaw that are involved in the utterance of a symbol string from a given text. The present invention relates to a method for generating a voice with high accuracy and performing high-quality speech synthesis based on the orbit information, its apparatus, and a program recording medium.

【０００２】[0002]

【従来の技術】音源のモデルと声道の特性を組み合わせ
た音声合成方法における、声道特性として、（１）声道
断面積関数を用いるものと、（２）調音器官の構造と動
きを直接的に利用するものがある（例えば、文献古井
「音響・音声工学」近代科学社、１９９２）。2. Description of the Related Art In a speech synthesis method in which a model of a sound source and characteristics of a vocal tract are combined, (1) a vocal tract cross-sectional area function is used as a vocal tract characteristic, and (2) a structure and a motion of an articulatory organ are directly measured. There are some that are used for example (for example, literature Furui "Acoustic and audio engineering", Modern Science Company, 1992).

【０００３】声道特性として、声道断面積関数を用いる
方法では、声道断面積関数そのものを同定することが困
難であるという欠点がある。調音器官の構造と動きを直
接的に利用する方法では、与えられた音素列に対して、
あらかじめ同定された各音素の特徴を用い、さらに、あ
る特定の運動規範のもとで全運動軌道を生成する。その
場合、各音素の特徴が、測定された調音運動データから
抽出・同定され、各音素の声道の形状の特徴が用いられ
る（Saltzman，Ｅ．Ｌ．and Munhall ，Ｋ．Ｇ．“A dy
namical approach to gestural patterning in speech
production，”Ecological Psychology 1 ，333-382 ，
1989）。すなわち、例えば、舌と唇の形状とが制約とし
て与えられる。KaburagiとHonda （Kaburagi，Ｔ．and
Honda ，Ｍ．“A model of articulator trajectory fo
rmation based on the motor tasks of vocal-tractsha
pes ，”Journal of theAcoustical Society of Americ
a，99，3154-3170 ，1996）も各音素の声道の形状の特
徴を用いている。The method using the vocal tract cross-sectional area function as a vocal tract characteristic has a drawback in that it is difficult to identify the vocal tract cross-sectional area function itself. In the method that directly uses the structure and movement of the articulatory organ, for a given phoneme sequence,
Using the features of each phoneme identified in advance, all motion trajectories are generated under a specific motion norm. In that case, the features of each phoneme are extracted and identified from the measured articulatory movement data, and the features of the vocal tract shape of each phoneme are used (Saltzman, EL and Munhall, KG "A dy.
namical approach to gestural patterning in speech
production, "Ecological Psychology 1, 333-382,
1989). That is, for example, the tongue and lip shapes are given as constraints. Kaburagi and Honda (Kaburagi, T. and
Honda, M. “A model of articulator trajectory fo
rmation based on the motor tasks of vocal-tractsha
pes, “Journal of the Acoustical Society of Americ
a, 99, 3154-3170, 1996) also uses the features of the vocal tract shape of each phoneme.

【０００４】さらに、KaburagiとHonda （上記文献）
は、力の変化と運動エネルギーの重みづけ関数をコスト
関数とし、これを最小にする運動軌道を生成した。しか
しながら、上記従来技術は、各音素の特徴を声道の形で
与え、それらを滑らかに繋ぐので、本来急激に変化する
軌道に対しては生成した軌道と観測される軌道との誤差
がとりわけ大きくなった。Furthermore, Kaburagi and Honda (supra).
Generated a kinematic trajectory that minimizes the weighting function of force change and kinetic energy as a cost function. However, in the above-described conventional technique, since the features of each phoneme are given in the form of vocal tracts and they are connected smoothly, the error between the generated orbit and the observed orbit is particularly large for the orbit that changes abruptly. became.

【０００５】[0005]

【発明が解決しようとする課題】人間の音声は、調音器
官の軌道によってその音韻性が定まる。この発明が、解
決しようとする課題は、音素シンボル列が与えられた場
合、そのシンボル列の発声に関与する調音運動器官の正
確な運動軌道を生成することである。The phonological property of human voice is determined by the trajectory of the articulatory organ. The problem to be solved by the present invention is to generate an accurate motion trajectory of an articulatory motor organ involved in the utterance of a phoneme symbol string when the phoneme symbol string is given.

【０００６】[0006]

【課題を解決するための手段】この発明は、与えられた
テキストに対して、調音パラメータと音源生成装置とを
用いて音声を合成する方法において、特に正確な調音パ
ラメータ、すなわち運動軌道を生成する方法に特徴があ
り、その運動軌道の生成方法は、音素シンボル列中の各
音素の発声時点、つまり隣接音素間の時間間隔を生成す
る第１の過程と、音素シンボル列中の各音素の発声時の
調音器官の各点の位置・速度・加速度を生成する第２の
過程と、これら生成された、音素シンボル列中の各音素
の発声時点・調音器官の各点の位置・速度・加速度を拘
束条件としてなめらかな軌道を生成する第３の過程とを
有する。According to the present invention, in a method of synthesizing a voice for a given text using an articulatory parameter and a sound source generator, a particularly accurate articulatory parameter, that is, a motion trajectory is generated. The method is characterized in that the method of generating the motion trajectory is as follows: the first step of generating the time point of each phoneme in the phoneme symbol sequence, that is, the time interval between adjacent phonemes, and the utterance of each phoneme in the phoneme symbol sequence. The second process of generating the position / velocity / acceleration of each point of the articulatory organ, and the generated time points of each phoneme in the phoneme symbol sequence / the position / velocity / acceleration of each point of the articulatory organ. And a third step of generating a smooth trajectory as a constraint condition.

【０００７】前記第３の過程は音素シンボル列中の各音
素の発声時点・調音器官の各点の位置・速度・加速度を
拘束条件として調音器官の各点のジャーク（加速度の微
分）の時間積分（文献Flash and Hogan ，“The coordi
nation of arm movements: an experimentally confirm
ed mathematical model.”J.Neurosci. Vol.５，1688-1
703 ，1985）が最小となるように調音運動器官の運動の
軌道を生成する。The third step is the time integration of the jerk (differentiation of acceleration) of each point of the articulatory organ with the vocalization point of each phoneme in the phoneme symbol sequence, the position, velocity, and acceleration of each point of the articulatory organ as constraints. (Literature Flash and Hogan, “The coordi
nation of arm movements: an experimentally confirm
ed mathematical model. ”J. Neurosci. Vol. 5, 1688-1
703, 1985) to generate the trajectory of the articulatory movement organ so that it becomes the minimum.

【０００８】第１の過程における、各音素の発声時点
と、第２の過程における、各音素の発声時の調音器官の
各点の位置・速度・加速度とは、磁気センサシステムに
より実測した調音器官の１１点（上唇・下唇・舌上の舌
先部から舌背部に至る等間隔な４点など）の調音軌道デ
ータをもとに求め、データベース化してあるものを用い
ている。このデータベースは、１．まったく文脈を考慮しない単音素ごとに、調音器官
の各点の位置・速度・加速度の平均値（あるいは中央
値）を求め、また、単音素ごとの前後の発声時点間の平
均値（あるいは中央値）を求めたもの（各音素データベ
ースと記す）。The point of time when each phoneme is uttered in the first process and the position, velocity, and acceleration of each point of the articulatory organ when each phoneme is uttered in the second process are the articulatory organ actually measured by the magnetic sensor system. 11 points (4 points at equal intervals from the tip of the tongue to the back of the tongue on the upper lip, lower lip, and tongue) were obtained based on the articulatory trajectory data, and a database is used. This database is: The average value (or median value) of the position, velocity, and acceleration of each point in the articulatory organ is calculated for each phoneme that does not consider the context at all, and the average value (or median value) between the vocalization points before and after each phoneme. ) Was obtained (written as each phoneme database).

【０００９】２．各２音素組ごとに、その２音素を構成
する二つの音素における調音器官の各点の位置・速度・
加速度の平均値（あるいは中央値）を求め、また、２音
素間の発声間隔の平均値（あるいは中央値）を求めたも
の（２音素組データベースと記す）。３．各３音素組ご
とに、その３音素を構成する三つの音素における調音器
官の各点の位置・速度・加速度の平均値（あるいは中央
値）を求め、また、３音素組中の二つの２音素間の発声
間隔の平均値（あるいは中央値）を求めたもの（３音素
組データベースと記す）。をそれぞれプールした３種類
又は単音素データベースと２音素組データベースを用意
してある。位置と時間間隔は平均値、速度と加速度は中
央値をそれぞれ求めたものをデータベース化しておいて
もよい。2. For each two-phoneme set, the position / velocity of each point of the articulatory organ in the two phonemes that make up the two-phoneme
An average value (or median value) of accelerations is obtained, and an average value (or median value) of vocalization intervals between two phonemes is obtained (referred to as a two-phoneme set database). 3. For each 3-phoneme set, the average value (or median value) of the position, velocity, and acceleration of each point of the articulatory organ in the three phonemes that make up the 3-phoneme is calculated, and two 2-phonemes in the 3-phoneme set are obtained. An average value (or median value) of the vocalization intervals between the two is obtained (referred to as a 3-phoneme set database). There are prepared three types or a single phoneme database and a two-phoneme set database, each of which is pooled. A database may be prepared by obtaining the average value of the position and the time interval and the median value of the velocity and the acceleration.

【００１０】[0010]

【発明の実施の形態】図１にこの発明の装置の実施例の
機能的構成を示し、図２にこの発明の方法の実施例の処
理手順を示す、入力端子１１から入力されたテキストは
音素変換部１２で、音素変換テーブル１３を参照して音
素シンボル列に変換される。この音素シンボル列は音素
間隔決定部（タイミング決定部）１４と運動状態決定部
１５へ供給される。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows a functional configuration of an embodiment of an apparatus of the present invention, and FIG. 2 shows a processing procedure of an embodiment of the method of the present invention. Text input from an input terminal 11 is a phoneme. The conversion unit 12 refers to the phoneme conversion table 13 to convert the phoneme symbol string. This phoneme symbol sequence is supplied to the phoneme interval determination unit (timing determination unit) 14 and the motion state determination unit 15.

【００１１】音素間隔決定部１４では調音器官状態デー
タベース１６を参照して入力された音素シンボル列中の
各音素の発声時点を生成し、隣接音素間の時間間隔を決
定する。運動状態決定部１５は入力された音素シンボル
列中の各音素の発声時の調音器官の各点、この例では１
１点の位置・速度・加速度を調音器官状態データベース
１６を参照して生成する。調音器官データベース１６は
先に述べた単音素データベース、２音素組データベー
ス、３音素組データベースにより構成されている。The phoneme interval determination unit 14 refers to the articulatory organ state database 16 to generate the utterance time point of each phoneme in the input phoneme symbol sequence, and determines the time interval between adjacent phonemes. The motion state determination unit 15 determines each point of the articulatory organ at the time of vocalization of each phoneme in the input phoneme symbol sequence, which is 1 in this example.
The position / velocity / acceleration of one point is generated by referring to the articulatory organ state database 16. The articulatory organ database 16 is composed of the above-mentioned single phoneme database, two-phoneme set database, and three-phoneme set database.

【００１２】音素間隔決定部１４、運動状態決定部１５
よりの、入力音素シンボル列中の各音素の発声時点・調
音器官の各点の位置・速度・加速度は運動軌道決定部１
７に入力され、これらを拘束条件として調音器官の各点
のジャーク（加速度の微分値）の時間積分が最小となる
ように調音運動器官の運動の軌道を生成する。このよう
にして得られた調音器官の１１点の各運動軌道における
各単位時間の間ごとに、例えば各調音器官の１１点の各
種軌道の単位時間での組合せと対応した音声波形を記録
した、音声波形データベース１８を音声合成部１９で参
照して音声波形列、つまり合成音声を出力端子２１に出
力する。Phoneme interval determination unit 14 and motion state determination unit 15
The motion trajectory determining unit 1 determines the time of utterance of each phoneme in the input phoneme symbol sequence, the position, velocity, and acceleration of each point of the articulatory organ.
7 is input, and with these as constraints, the trajectory of the motion of the articulatory organ is generated so that the time integration of jerk (differential value of acceleration) at each point of the articulatory organ is minimized. For each unit time in each motion trajectory of 11 points of the articulatory organ thus obtained, for example, a voice waveform corresponding to a combination of 11 trajectories of 11 points of each articulatory organ in a unit time was recorded, The voice waveform database 18 is referred to by the voice synthesizer 19, and a voice waveform string, that is, a synthesized voice is output to the output terminal 21.

【００１３】次に、この装置を具体的な文を発声する場
合に適用した例について示す。以下の例では、各３音素
組ごとに、その３音素を構成するそれぞれの三つの音素
における調音器官の各点の位置の平均値、速度・加速度
の中央値、３音素組中の二つの２音素間の発声間隔の平
均値、を保持している３音素組データベースを用いる。
文「彼女は手のこんだ御馳走を作りました」を発声する
場合に適用した際の計算計処理過程およびデータの流れ
を説明する。入力テキストは音素変換部１２で図３Ａに
示す音素記号列に変換される。Next, an example in which this device is applied to utter a specific sentence will be described. In the following example, for each three-phoneme set, the average value of the position of each point of the articulatory organ in each of the three phonemes making up the three-phoneme, the median value of the velocity / acceleration, and the two 2's in the three-phoneme set. A three-phoneme set database holding the average value of vocalization intervals between phonemes is used.
Explain the calculation process and data flow when applying the sentence "She made an elaborate treat". The input text is converted into a phoneme symbol string shown in FIG. 3A by the phoneme converter 12.

【００１４】ただし、＜は発声開始を表わし、＞は発声
終了を表わす。音素間隔決定部１４では与えられた音素
記号列の各音素一つひとつについて、発声開始記号＜か
ら走査をはじめ、各３音素組ごとにデータベース１６を
参照して音素列中の各音素間の時間間隔を決定してい
く。例えば、ｋａｎｏｊｏのａｎの時間間隔は、デ
ータベース１６中の３音素組kaｎとａｎｏの各ａ
ｎの時間間隔情報の平均値として７２msec. と決定さ
れる。このようにして図３Ａの音素記号列に対し、図３
Ｂに示す各音素間の時間間隔が決定される。単位はmsec
である。However, <represents utterance start and> represents utterance end. The phoneme interval determination unit 14 scans each phoneme of the given phoneme symbol string from the vocalization start symbol <, and refers to the database 16 for each three phoneme groups to determine the time interval between phonemes in the phoneme string. To decide. For example, the time interval of an of kanojo is a for each of the three phoneme sets kan and an in the database 16.
72 msec. is determined as the average value of n time interval information. In this way, the phoneme symbol string of FIG.
The time interval between the phonemes shown in B is determined. Unit is msec
Is.

【００１５】なお、３音素組データベース中にない３音
素組が音素列に出現した場合には、その３音素組を２つ
の２音素組に分け、２音素組データベース中の２音素組
情報から音素間の時間間隔を求める。さらに、２音素組
データベース中にも該当２音素組がなければ、１音素デ
ータベース中の前後の発声時点間の該当するものを参照
する。When a 3-phoneme set that does not exist in the 3-phoneme set database appears in the phoneme sequence, the 3-phoneme set is divided into two 2-phoneme sets, and the phoneme is selected from the 2-phoneme set information in the 2-phoneme set database. Find the time interval between. Furthermore, if there is no corresponding two-phoneme set in the two-phoneme set database, the corresponding one between the utterance points before and after in the one-phoneme database is referred to.

【００１６】運動状態決定部１５では調音器官の１１点
についての各音素ごとの位置・速度・加速度を前後の文
脈を考慮しつつ連続する３音素組ごとに定める。つまり
入力音素記号列ｐ₁ｐ₂…ｐ_n中の各ｐ_iについて、ｐ
_i-2ｐ_i-1ｐ_iの組のｐ_iの位置・速度・加速度の組
（ｐ，ｖ，ａ）と、ｐ_i-1ｐ_iｐ_i+1の組のｐ_iの位置
・速度・加速度の組（ｐ′，ｖ′，ａ′）と、ｐ_iｐ
_i+1ｐ_i+2の組のｐ_iの位置・速度・加速度（ｐ″，
ｖ″，ａ″）とをデータベース１６から抽出し、これら
重み付け和、例えば（（ｐ，ｖ，ａ）＋４・（ｐ′，
ｖ′，ａ′）＋（ｐ″，ｖ″，ａ″））／６により音素
ｐ_iの運動状態とする。このようにして決定された各音
素の発声時の舌先の鉛直方向の位置・速度・加速度の例
の一部を図３Ｃに示す。例えば、ｎ：１８２．２，１
５．０，８００．０は、音素ｎの発声時の舌先の鉛直方
向の位置（ｙ）が１８２．２ｍｍで、速度が１５．０ｍ
ｍ／ｍｓｅｃ．で加速度が８００．０ｍｍ／ｍｓｅｃ．
であることを表わす。これは、３音素組データベース中
にある三つの３音素組“ｋａｎ”，“ａｎ”，
“ｎｏｊ”のそれぞれ“ｎ”の発声時の鉛直方向の
位置・速度・加速度の重み付け平均（重みは例えば１：
４：１）である。The motion state determination unit 15 determines the position, velocity, and acceleration of each phoneme for 11 points of the articulatory organ for each continuous three-phoneme group while considering the context before and after. That is, for each p _i in the input phoneme symbol string p ₁ p ₂ ... P _n , p
_i-2 p _i-1 p _i of the set of position, velocity and acceleration of the set of p _i (p, v, a) and the position and velocity of the p _i-1 p _i p i _{+ 1} of the set of p _i A set of accelerations (p ', v', a ') and p _i p
Positions / velocities / accelerations (p ″, p) of _{i + 1} p _{i + 2} pairs of p _i
v ″, a ″) is extracted from the database 16 and these weighted sums, for example, ((p, v, a) + 4 · (p ′,
The motion state of the phoneme p _i is defined by v ′, a ′) + (p ″, v ″, a ″)) / 6. The vertical position of the tongue tip at the time of utterance of each phoneme thus determined. 3C shows a part of an example of velocity / acceleration, for example, n: 182.2,1.
In 5.0 and 800.0, the vertical position (y) of the tongue tip when the phoneme n is uttered is 182.2 mm, and the speed is 15.0 m.
m / msec. Acceleration is 800.0 mm / msec.
It means that. This consists of three 3-phoneme sets "kan", "an", in the 3-phoneme database.
A weighted average of vertical position, velocity, and acceleration when each "n" of "noj" is uttered (weight is, for example, 1:
4: 1).

【００１７】運動状態決定部１５は舌先の鉛直方向のみ
ならず、測定された調音器官の１１点すべてについて、
水平（ｘ）方向と鉛直（ｙ）方向の各音素の発声時の位
置・速度・加速度を上述のようにして算出する。この運
動状態決定においても３音素組データベース中にない３
音素組が音素列に出現した場合には、その３音素組を２
つの２音素組に分け、２音素組データベース中の２音素
組情報から各音素の発声時の位置・速度・加速度を算出
する。さらに、２音素組データベース中にも２音素組が
なければ、１音素データベース中の情報を用いる。前記
１：４：１重み付け和をｐ_i-2ｐ_i-1ｐ_i（１），ｐ
_i-1ｐ_iｐ_i+1（４），ｐ_iｐ_i+1ｐ_i+2（１）と表現
する時、３音素組データベース中に２つの組しかない時
はｐ_i-1ｐ _iｐ_i+1（２），ｐ_iｐ_i+1ｐ_i+2（１），
又はｐ_i-2ｐ_i-1ｐ_i（１），ｐ_i- ₁ｐ_iｐ
_i+1（２）、１組しかない時はｐ_i-1ｐ_iｐ_i+1（１）
とし、またｐ_i- ₁ｐ_i-2ｐ_i（１），ｐ_iｐ_i+1ｐ_i+2
（１）とし、３音素組データベース中に対称の３音素組
がない場合は、２音素組データベースも加え、ｐ_i-1ｐ
_i（１），ｐ_iｐ_i+1ｐ_i+2（２）、またｐ_i-1ｐ
_i（１），ｐ_iｐ_i+1ｐ_i+2（２）、あるいはｐ_i-2ｐ
_i-1ｐ_i（２），ｐ_iｐ_i+1（１）とし、ｐ_i-1ｐ
_i（１），ｐ_i，ｐ_i+1（１）とし、ｐ_i-2ｐ_i-1ｐ_i
（１）とし、ｐ_iｐ_i+1ｐ_i+2（１）とし、ｐ_i-1ｐ_i
（１）とし、ｐ_iｐ_i+1（１）とする。The motion state determining unit 15 is provided only in the vertical direction of the tongue tip.
Of course, for all 11 measured articulatory organs,
Position of each phoneme in the horizontal (x) direction and the vertical (y) direction at the time of utterance
Position, velocity and acceleration are calculated as described above. This luck
Even when the dynamic state is determined, it is not 3 in the phoneme set database 3
When a phoneme set appears in a phoneme string, the 3 phoneme set is set to 2
2 phonemes in the 2-phoneme database
Calculates the position / velocity / acceleration when each phoneme is uttered from the group information
To do. In addition, the 2-phoneme set is also included in the 2-phoneme set database.
If not, the information in the 1-phoneme database is used. The above
1: 4: 1 weighted sum p_i-2p_i-1p_i(1), p
_i-1p_ip_{i + 1}(4), p_ip_{i + 1}p_{i + 2}Expressed as (1)
When there is only two pairs in the 3-phoneme database
Is p_i-1p _ip_{i + 1}(2), p_ip_{i + 1}p_{i + 2}(1),
Or p_i-2p_i-1p_i(1), p_i- ₁p_ip
_{i + 1}(2) If there is only one set, p_i-1p_ip_{i + 1}(1)
And p_i- ₁p_i-2p_i(1), p_ip_{i + 1}p_{i + 2}
(1) Symmetrical 3-phoneme set in 3-phoneme database
If there is not, add a phoneme set database, p_i-1p
_i(1), p_ip_{i + 1}p_{i + 2}(2), p_i-1p
_i(1), p_ip_{i + 1}p_{i + 2}(2) or p_i-2p
_i-1p_i(2), p_ip_{i + 1}(1) and p_i-1p
_i(1), p_i, P_{i + 1}(1) and p_i-2p_i-1p_i
(1) and p_ip_{i + 1}p_{i + 2}(1) and p_i-1p_i
(1) and p_ip_{i + 1}(1)

【００１８】運動軌道決定部１７で先に求めた音素時間
間隔と、調音器官の１１点の各水平方向と鉛直方向の各
音素の発声時の位置・速度・加速度とを拘束条件とし
て、調音器官上の１１点の各点のジャークの時間積分
（次式）が最小となる軌道を求める。（１／２）∫₀ ^tf（（ｄ³ｘ／ｄｔ³）²＋（ｄ³ｙ／
ｄｔ³）²）ｄｔ(1) ここで、（ｘ，ｙ）は調音器官の各点の座標であり、時
間〔０，ｔ_f〕はｔ＝ｔ ₀，ｔ₁，ｔ₂，…，ｔ_n＝ｔ
_fに分割されており、ｔ_i（ｉ＝０，−１，…，ｎ）で
その点ｐの位置・速度・加速度ｘ_i，ｘ′_i，ｘ″_i，
ｙ_i，ｙ′_i，ｙ″_iが与えられている。一般にコスト
関数Ｌ〔ｔ，ｘ′，…，ｄⁿｘ／ｄｔⁿ〕に対しこのＴ
₁かＴ₂までの時間積分を極小にするｘ（ｔ）はEnlen
−Peisson方程式を満足する。このことからこの微分方
程式を解いて、ｘ（ｔ）＝ａ₀＋ａ₁ｔ＋ａ₂ｔ²＋ａ₃ｔ³＋ａ₄ｔ
⁴＋ａ₅ｔ⁵ ｙ（ｔ）＝ｂ₀＋ｂ₁ｔ＋ｂ₂ｔ²＋ｂ₃ｔ³＋ｂ₄ｔ
⁴＋ｂ₅ｔ⁵ を得る。従って、制約条件としてｘ（Ｔ₁），ｘ′（Ｔ
₁），ｘ″（Ｔ₁），ｘ（Ｔ₂），ｘ′（Ｔ₂），ｘ″
（Ｔ₂），ｙ（Ｔ₁），ｙ′（Ｔ₁），ｙ″（Ｔ ₁），
ｙ（Ｔ₂），ｙ′（Ｔ₂），ｙ″（Ｔ₂）値を与えるこ
とにより、係数ａ ₀，…，ａ₅，ｂ₀，…，ｂ₅を一意
に求めることができる。このようにして時間〔０，
ｔ_f〕における各時点ｔ_i，ｉ＝０，…，ｎで与えられ
るｘ，ｘ′，ｘ″，ｙ，ｙ′，ｙ″を満足し、各区間
〔ｔ_i，ｔ_i+1〕ｉ＝０，…，ｎ−１でＬを最小にする
軌道は各区間〔ｔ_i，ｔ_i+1〕で一意に定まる。これら
の軌道を全区間〔０，ｔ_f〕でつなぎ合せたものを求め
ることができる。Phoneme time previously obtained by the motion trajectory determination unit 17.
Intervals, 11 points of articulatory organs, horizontal and vertical
The constraint conditions are the position, velocity, and acceleration when the phoneme is uttered.
, The time integration of the jerk of each of 11 points on the articulatory organ
Find the orbit that minimizes (Equation). (1/2) ∫₀ ^tf((D³x / dt³)²+ (D³y /
dt³)²) Dt (1) Where (x, y) are the coordinates of each point in the articulatory organ, and
Interval [0, t_f] Is t = t ₀, T₁, T₂, ..., t_n= T
_fIs divided into_i(I = 0, -1, ..., n)
Position / velocity / acceleration x at that point p_i, X ′_i, X ″_i，
y_i, Y ′_i, Y ″_iIs given. Generally costs
Function L [t, x ', ..., dⁿx / dtⁿ] To this T
₁Or T₂X (t) that minimizes the time integration up to Enlen
− Satisfies the Peisson equation. From this, this differential method
Solve the equation, x (t) = a₀+ A₁t + a₂t²+ A₃t³+ A_Fourt
^Four+ A_Fivet^Five y (t) = b₀+ B₁t + b₂t²+ B₃t³+ B_Fourt
^Four+ B_Fivet^Five To get Therefore, x (T₁), X '(T
₁), X ″ (T₁), X (T₂), X '(T₂), X ″
(T₂), Y (T₁), Y ′ (T₁), Y ″ (T ₁),
y (T₂), Y ′ (T₂), Y ″ (T₂) Give a value
And the coefficient a ₀, ..., a_Five, B₀, ..., b_FiveUnique
You can ask. In this way the time [0,
t_f] At each time point t_i, I = 0, ..., n
X, x ', x ", y, y', y"
[T_i, T_{i + 1}] L is minimized at i = 0, ..., N-1
Trajectory is for each section [t_i, T_{i + 1}] Uniquely determines. these
The orbit of the whole section [0, t_f] To find the one
You can

【００１９】前記、各点のジャークの時間積分が最小と
なる軌跡を用いるのは、Flash ＆Hogan （１９８５）に
より、手先をある点から他の点にもって行くというタス
クにおいて、手先のジャークの時間積分を極小にする軌
道が、観察される軌道と合致することが見出されている
ことにもとづく。この場合は２自由度のリンク系の運動
であって、ジャーク最小モデルは、リンク運動を一意に
定める。３以上の自由度を有する系に対しては不確定要
素が残る。Using the locus with which the time integration of the jerk at each point is minimized is performed by Flash & Hogan (1985) in the task of moving the hand from one point to another point. It is based on the fact that the orbit that minimizes is found to match the observed orbit. In this case, the link system motion has two degrees of freedom, and the jerk minimum model uniquely defines the link motion. Uncertainties remain for systems with three or more degrees of freedom.

【００２０】そこで前述したように、調音器官の各点の
ジャークの時間積分がおのおの独立に極小となる軌道を
とるという各点ジャークモデルを用い、連続する領域は
常に連続であり、かつ系のうち剛体性の過程が成立する
部位は常にその仮定が保持されるという制約を置く。つ
まり、各音素ごとの位置・速度・加速度を定め、更に各
音素間の時間間隔を設定し、これらを制約として用いる
ことに前述したように各調音器官の運動軌道は時間に関
する５次の多項式で表現される。Therefore, as described above, using the point jerk model in which the time integrals of the jerks at the points of the articulatory organ each take a minimum independently, the continuous region is always continuous, and The assumption is always held for the part where the rigid body process is established. That is, the position / velocity / acceleration of each phoneme is determined, the time interval between each phoneme is set, and these are used as constraints. As described above, the motion trajectory of each articulator is a fifth-order polynomial with respect to time. Expressed.

【００２１】[0021]

【発明の効果】同一の人間が、同一の文を何回か読み上
げる時の調音器官の軌道の平均誤差は１．５０ｍｍから
２．００ｍｍであることが実験により確かめられてい
る。この発明は、３音素組のデータベースを用いた場
合、合成音声の調音器官の軌道（予測軌道）と観測軌道
との平均誤差は１．５０ｍｍから大きいものでも１．９
９ｍｍ以内となる。これは、極めて高い推定精度といえ
よう。また、２音素組のデータベースを用いた場合に
は、平均誤差は１．６０ｍｍかを大きいものでも２．２
０ｍｍ以内となり、１音素ごとのデータベースを用いた
場合には、平均誤差は１．９０ｍｍから大きいものでも
２．６０ｍｍ以内となる。これらもそれほど悪い推定精
度ではない。It has been confirmed by experiments that the average error of the trajectory of the articulatory organ when the same person reads the same sentence several times is 1.50 mm to 2.00 mm. In the present invention, when a database of three phonemes is used, even if the average error between the trajectory of the articulatory organ of synthetic speech (predicted trajectory) and the observed trajectory is from 1.50 mm to a large value, it is 1.9.
Within 9 mm. This is a very high estimation accuracy. Further, when a database of two phonemes is used, the average error is 1.60 mm or 2.2 even if it is large.
When the database for each phoneme is used, the average error is 1.90 mm to 2.60 mm even if it is large. These are also not so bad estimation accuracy.

【００２２】図３Ｄに、文「この本は、ただいま品切れ
です」に対する調音器官上の各観測点の垂直方向の各点
ジャーク最小モデルの予測軌道（実線）と、観測軌道
（破線）を示す。これより各点ジャーク最小モデルの予
測軌道は観測軌道と定性的によく一致することがわか
る。発明において、与えられた音素列に対して、各音素
の発声時点をハッシング技術（多数のものから選び出す
アルゴリズム）を用いて定数のオーダーで算出し、各音
素の発声時の調音器官の各点の位置・速度・加速度をや
はり、ハッシング技術を用いて定数のオーダーで算出
し、この発声時点と各音素の発声時の位置・速度・加速
度とを拘束条件とし、調音器官上の各点のジャーク最小
軌道とは、時間に関する区分５次多項式として、その係
数を線形計算で求めることにより、与えられた音素列に
対して、実時間で調音器官上の各点の軌道が生成でき
る。FIG. 3D shows the predicted trajectory (solid line) and the observed trajectory (broken line) of the jerk minimum model for each point in the vertical direction of each observation point on the articulatory organ for the sentence "This book is out of stock". From this, it can be seen that the predicted orbit of the minimum jerk model at each point qualitatively agrees well with the observed orbit. In the invention, for a given phoneme sequence, the vocalization time point of each phoneme is calculated in a constant order using hashing technology (algorithm that selects from a large number), and each point of the articulatory organ at the time of vocalization of each phoneme is calculated. The position / velocity / acceleration is calculated using hashing technology in the order of constants, and the jerk minimum of each point on the articulatory organ is set as a constraint condition with this vocalization point and the position / velocity / acceleration at the time of vocalization of each phoneme. The trajectory is a piecewise fifth-order polynomial with respect to time, and its coefficient is obtained by linear calculation, whereby the trajectory of each point on the articulatory organ can be generated in real time for a given phoneme sequence.

【００２３】この発明を用いることによって、例えば、
調音器官のすべての点の軌道の振幅を一定の比で小さく
し、つまり算出された各点の軌道の位置座標中のｘをそ
のままとし、ｙ軸方向に時間的に徐々に一定の比率で縮
めてゆき、制御前の各点から他の器官に徐々に移らせ、
あるいは各音素間の時間間隔を伸縮し、例えば、徐々に
短かくして遅口から速口に移るようにすることができ、
音声のある種のモルフィングを簡単に行うことができ
る。By using this invention, for example,
The amplitude of the orbits of all points of the articulatory organ is reduced at a constant ratio, that is, x in the calculated position coordinates of the orbits of each point is left unchanged, and gradually reduced in the y-axis direction at a constant ratio. And gradually move from each point before control to other organs,
Alternatively, the time interval between each phoneme can be expanded or contracted, for example, gradually shortened to shift from late to fast.
Some sort of morphing audio can be done easily.

[Brief description of drawings]

【図１】この発明による音声合成装置の実施例の機能構
成を示すブロック図。FIG. 1 is a block diagram showing a functional configuration of an embodiment of a speech synthesizer according to the present invention.

【図２】この発明による方法の要部を示す流れ図。FIG. 2 is a flowchart showing an essential part of the method according to the present invention.

【図３】Ａは音素間隔決定部の出力の例を示す図、Ｂは
運動状態決定部１５の出力の例を示す図、Ｄは運動軌道
決定部１７で決定した予測軌道（実線）と観測軌道（破
線）の例を示す図である。3A is a diagram showing an example of an output of a phoneme interval determination unit, B is a diagram showing an example of an output of a motion state determination unit 15, and D is a predicted trajectory (solid line) determined by a motion trajectory determination unit 17 and observation. It is a figure which shows the example of a trajectory (broken line).

フロントページの続き (56)参考文献特開平７−5897（ＪＰ，Ａ) 特開平２−234285（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 19/00 Continuation of the front page (56) Reference JP-A-7-5897 (JP, A) JP-A-2-234285 (JP, A) (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 19 / 00

Claims

(57) [Claims]

The method according to claim 1] given text into a phoneme symbol string, and generates a time series (motion trajectory) of the position of each articulatory organs (articulatory parameters) involved from the phoneme symbol string to the utterance of the text In the method of synthesizing a voice by controlling the spectrum characteristic of the voice using the articulation parameter, the first step of generating from the above-mentioned phoneme symbol sequence the utterance time point of each phoneme in the phoneme symbol sequence of the symbol sequence, , A second process of generating the position, velocity, and acceleration of each point of the articulatory organ at the time of utterance of each phoneme in the above phoneme symbol sequence, and the utterance time point and articulatory organ of each phoneme in the generated phoneme symbol sequence The articulatory parameter controlled speech synthesis method comprising: a third step of generating a trajectory with the position / velocity / acceleration of a point as a constraint.

2. The third step is the position obtained in the second step with respect to the phoneme symbol p _i at the utterance time t _i (i = 0, 1, ..., N-1) obtained in the first step. (1/2) ∫ _t1 ^t2 {(d ³ x / d with velocity and acceleration as constraint conditions
^{^{^{t 3) 2 + (d 3}}} y / dt 3) 2} dt, (x, y the claim 1, wherein the position coordinates of each point articulator) is a process of performing smallest calculation Articulatory parameter control voice synthesis method.

Position x _i y _i according to claim 3 wherein the smallest operation p _i obtained in the second step, the speed v _xi, v _yi, acceleration a _xi a
the positions x _{i + 1} y _{i + 1 of} _yi and p _{i + 1} , the velocities v _{xi + 1} , v _{yi + 1} ,
Using the accelerations a _{xi + 1} and a _{yi + 1} , the coefficients of the piecewise fifth-order polynomials x (t) and y (t) with respect to time are obtained by linear calculation, and the x (t) and y (t) are defined in the interval [ 3. The articulatory parameter controlled speech synthesis method according to claim _{2, wherein} the trajectory is t ₁ , t ₂ ].

4. The position / velocity / acceleration of three phonemes forming each phoneme and the utterance time interval between the two phonemes are stored for each three phoneme set at each position of each of the articulatory motor organs measured. A database is prepared, and in the first step, the 3-phoneme set p _i-1 p _i is selected from the database.
p _{i + 1} and p _i p _{i + 1} p _{i + 2} are selected, and their two phonemes p _i p
and the average value of each phoneme time interval _{i + 1,} and the spacing of phoneme p _i, each utterance time point p i _{+ 1} of the phoneme symbols claims, characterized in that obtained from the utterance time this The articulatory parameter controlled voice synthesis method described in 2 or 3.

5. The second step is for p _i-2 p _i-1 p _i , p for the phoneme symbols p _i in the phoneme symbol sequence.
Each three-phoneme set of _i-1 p _i p _{i + 1} and p _i p _{i + 1} p _{i + 2} is selected from the above database, and corresponding data of the three selected three phoneme sets are weighted and averaged to each position. 5. The articulatory parameter controlled voice synthesis method according to claim 4, wherein speed and acceleration are obtained.

6. A database storing the positions and velocities of two phonemes forming two phonemes and the utterance time intervals between the two phonemes for each two phoneme set at each measured position of each articulatory movement organ is prepared. Incidentally, in the first step, a two-phoneme set of phonemes p _i and p _{i + 1} is selected from the database, and the elementary time intervals of the two phoneme sets are set to the phonemes p _i and p _{i + 1 of the} phoneme symbol. 4. The articulatory parameter controlled speech synthesis method according to claim 2, wherein the utterance time point is determined from the intervals of the utterance time points.

7. The second step is for p _i−1 p _i and p _i p _{i + 1} for the phoneme symbols p _i in the phoneme symbol sequence.
7. The articulatory parameter controlled voice synthesizing method according to claim 6, wherein each of the two phoneme groups is selected from the database, and the selected corresponding data is averaged to obtain each position / velocity / acceleration.

8. Convert the given text to phoneme symbol string, and generates a time series (motion trajectory) of the position of each articulatory organs (articulatory parameters) involved from the phoneme symbol string vocalization of the text, In a device that synthesizes a voice by controlling the spectral characteristics of the voice using the articulatory parameters, stores the measured value data of the position, velocity, acceleration, and utterance time interval of each point of the above articulatory organ at the time of utterance for each phoneme. And a phoneme symbol sequence that inputs the phoneme symbol sequence and determines the adjacent phoneme time intervals of each phoneme symbol by referring to the database, and the phoneme symbol sequence, and the position of each phoneme symbol.・ Motion state determining means for determining the velocity / acceleration by referring to the database, and the position / velocity / acceleration for each point of the articulatory organ. As a constraint condition, the above-mentioned determined time interval between phonemes (0 to
From (t _f ), (1/2) ∫ ₀ ^tf {(d ³ x / dt ³ ) ² + (d ³ y / dt ³ ) ² } dt (x, y) finds the trajectory that minimizes the position coordinates. An articulatory parameter controlled voice synthesizer comprising a motion trajectory determining means.

9. The input text is converted into a phoneme symbol string, and a time series (motion trajectory) of each position (articulatory parameter) of an articulatory organ involved in utterance of the text is generated from the phonetic symbol string. , In a device that synthesizes voice by controlling the spectral characteristics of the voice using the articulatory parameters, the position / velocity of each point of the articulatory organ at the time of vocalization for each phoneme
A recording medium that stores a database that stores measured value data of acceleration / speech time intervals and that records a program that generates the above articulation parameters, wherein the program is the adjacent phoneme time of each phoneme symbol of the above phoneme symbol string. An inter-phoneme determination process for determining the intervals with reference to the database, a motion state determination process for determining the position / velocity / acceleration of each phoneme symbol of the phoneme symbol sequence with reference to the database, and the position, velocity and acceleration determined in the motion state determination process and constraints for each point, determined above between phonemes making process the phonemes spacing _{(0~t f) (1/2) ∫} 0 tf ^{^{{(d 3 x / dt 3}} ) 2 + (d 3 y / dt 3) 2} dt (x, y) performs a movement orbit determination process of obtaining a trajectory that minimizes the position of the coordinates, the Computer readable recording medium characterized and.

10. The process of determining the motion trajectory is performed by each phoneme symbol p _i (i = _i ) in the phoneme symbol sequence p ₁ , p ₂ , ...
1, 2, ...) And p _{i + 1} , the positions x _i y _i , x _{i + 1} y _{i + 1} velocities v _xi , v _yi , v obtained in the above motion state determination process.
_{xi + 1} , v _{yi + 1} accelerations a _xi , a _yi , a _{xi + 1} , a _{yi + 1} are used to obtain the coefficients of the piecewise _quintic polynomial x (t), y (t) with respect to time by linear calculation, 10. The recording medium according to claim 9, which is a process in which the x (t) and y (t) are used as the minimum orbit of the section [t _i , t _{i + 1} ].

11. The database stores the position / velocity / acceleration of each phoneme of the three phonemes and the vocalization interval within the two phonemes for each three phoneme set, and the motion state determination process is performed by the phoneme determination process. For the symbol p _i , p _i-2 p _i-1 p _i , p _i-1 p _i p _{i + 1} , p _i p _{i + 1}
The process of selecting each three-phoneme set of p _{i + 2} from the database, and weighting and averaging corresponding data of the three selected three-phoneme sets to obtain each position / velocity / acceleration. The recording medium according to 9 or 10.

12. The inter-phoneme determination process is performed by selecting three phoneme sets p _i-1 p _i p _{i + 1} and p _i p _{i + 1} p _{i + 2} from the database.
Is selected, and each time interval between these phonemes p _i p _{i + 1} is averaged to obtain a phoneme time interval for the phoneme p _i p _{i + 1.} Recording medium.

13. The database contains various 3-phoneme sets .
In addition , the position / velocity / acceleration of each phoneme of the two phonemes and the inter-phoneme time interval are stored, and when there is no corresponding three-phoneme set in the inter-phoneme determination process and the motion state determination process, The recording medium according to claim 11 or 12, wherein a corresponding two-phoneme set is selected from the database and used.