JP4536464B2

JP4536464B2 - Speech synthesis apparatus and method

Info

Publication number: JP4536464B2
Application number: JP2004260782A
Authority: JP
Inventors: 定男廣谷; 岳美持田
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2004-09-08
Filing date: 2004-09-08
Publication date: 2010-09-01
Anticipated expiration: 2024-09-08
Also published as: JP2006078641A

Description

本発明は、刺激音声を、複数の調音器官の少なくとも一部の位置における調音運動軌跡の確率的な動的モデルに基づき生成し、調音器官の位置および当該位置の変化とフォルマント周波数の変化量との関係を前記刺激音声として出力する、音声合成装置およびその方法に関する。 The present invention generates stimulation speech based on a probabilistic dynamic model of articulatory motion trajectories at at least some positions of a plurality of articulators, and includes the position of articulators, the change in the positions, and the amount of change in formant frequency. The present invention relates to a speech synthesizer and a method for outputting the relationship as the stimulus speech.

母音あるいは子音を含む音節のフォルマント周波数の弁別閾値を測定する聴覚心理物理実験に用いる刺激音声の作成方法としては以下のようなものがある。
まず、人間が単母音発声した音声信号を録音し、低次の３つのフォルマント周波数とそのバンド幅および基本周波数を音声信号から計測する。次に、その値をＫｌａｔｔのフォルマント合成音声のパラメータとして設定した基準となる刺激音声と、閾値測定の対象なるフォルマント周波数の値から上下いずれかの方向に変化させた刺激音声を作成する。そして、恒常法あるいは変形上下法などの心理物理測定手法を用いてフォルマント周波数の弁別閾値を測定するものである（例えば、非特許文献１参照）。 There are the following methods for creating stimulus speech used in an auditory psychophysical experiment for measuring a discrimination threshold of formant frequencies of syllables including vowels or consonants.
First, a voice signal uttered by a single vowel is recorded, and three low-order formant frequencies, their bandwidth and fundamental frequency are measured from the voice signal. Next, a stimulus sound as a reference in which the value is set as a parameter of the Klatt formant synthesized sound and a stimulus sound in which the value of the formant frequency to be measured for the threshold is changed in either the upper or lower direction are created. Then, the discrimination threshold of the formant frequency is measured using a psychophysical measurement method such as a constant method or a modified up-and-down method (for example, see Non-Patent Document 1).

しかしながら、上記した非特許文献１に開示された技術によれば、計測されたフォルマント周波数とそのバンド幅、および基本周波数は、人間の発声器官や発声動作の拘束を陽に考慮していない。
このため、閾値測定の対象となるフォルマント周波数の値を一定幅で上下に変化させて作成した刺激音声が、必ずしも人間が実際に発声可能な範囲に存在するとは限らない。したがって、音声に聞こえない刺激を実験に用いている可能性がある。また、単母音あるいは子音を含む単音節発声の音声信号の計測では、調音結合等、前後の音素環境の影響を考慮した刺激音声を作成することが困難である。 However, according to the technique disclosed in Non-Patent Document 1 described above, the measured formant frequency, its bandwidth, and fundamental frequency do not explicitly take into account the constraints of human vocal organs and vocalization actions.
For this reason, the stimulus sound created by changing the formant frequency value subject to threshold measurement up and down with a certain width does not necessarily exist in a range where humans can actually speak. Therefore, there is a possibility that a stimulus that cannot be heard is used in the experiment. Moreover, in the measurement of a single syllable utterance voice signal including a single vowel or consonant, it is difficult to create a stimulus voice that takes into account the effects of the front and back phonemic environment such as articulation coupling.

一方、発声動作から音声信号を生成する方法に、調音・音響対コードブック検索による方法がある（例えば、特許文献１、非特許文献２参照）。また、発声動作の動的モデルの作成およびその音声信号を生成する方法に、ＨＭＭ（Hidden Markov Model：隠れマルコフモデル）音声生成モデルと調音・音響マッピングに基づく方法がある（例えば、特許文献２、特許文献３、非特許文献３参照）。
特許第３４１２７９８号公報特開２００３−２４１７７６号公報特開２００３−２７１１８６号公報 D. Kewley-Port and C.S Watson.“For mant-frequencydiscrimination for isolated English vowels," J. Acoust. Soc. Am, vol.95, no.1,1994)。鏑木・誉田・津村,“音素ラベル付き調音・音響対コードブックの検索に基づく調音運動からの音声合成法の検討,”vol.54,no.3, 日本音響学会誌, 1998) Hiroya, S. Honda. M.“Estimation of articulatorymovements from speech acoustics using an HMM-based speech production model," IEEE Trans.Speech and Audio Processing, vol.12, no.2, 2004) On the other hand, there is a method based on articulation / acoustic versus codebook search as a method of generating an audio signal from an utterance operation (see, for example, Patent Document 1 and Non-Patent Document 2). In addition, as a method for generating a dynamic model of speech operation and generating a speech signal thereof, there is a method based on an HMM (Hidden Markov Model) speech generation model and articulation / acoustic mapping (for example, Patent Document 2, (See Patent Document 3 and Non-Patent Document 3).
Japanese Patent No. 3421798 JP 2003-241776 A JP 2003-271186 A D. Kewley-Port and CS Watson. “For mant-frequency discrimination for isolated English vowels,” J. Acoust. Soc. Am, vol. 95, no. 1, 1994). Kashiwagi, Honda, Tsumura, “Speech synthesis from articulatory motion based on search of articulatory / acoustic pair codebook with phoneme label,” vol.54, no.3, Journal of the Acoustical Society of Japan, 1998) Hiroya, S. Honda. M. “Estimation of articulatorymovements from speech acoustics using an HMM-based speech production model,” IEEE Trans.Speech and Audio Processing, vol.12, no.2, 2004)

ところで、ある調音位置が与えられた際に、その調音位置の変化に対するフォルマント周波数の変化量の計算は、発話動作の制約を考慮せずに、各調音器官の位置をすべての可能な方向に対して変更して行われていた。従って、そのフォルマント周波数の変化の影響を調べる場合、定性的な結果が得られるため、調音位置の変化とフォルマント周波数の変化量の関係を刺激音声として用いることが困難である。
また、実際の各調音器官の間には相関があるため、従来研究で行われた「調音器官の位置をすべての可能な方向に動かす」ことは、調音運動の動的振る舞いを十分に考慮していないことを意味する。 By the way, when a certain articulation position is given, the calculation of the amount of change in formant frequency with respect to the change in the articulation position does not take into account the restriction of the speech operation, and the position of each articulator organ in all possible directions. It was done by changing. Therefore, when investigating the influence of the change in formant frequency, a qualitative result can be obtained. Therefore, it is difficult to use the relationship between the change in articulation position and the amount of change in formant frequency as the stimulus sound.
In addition, since there is a correlation between the actual articulators, “moving the position of the articulator in all possible directions” performed in the previous study fully considers the dynamic behavior of the articulatory motion. Means not.

本発明は上記事情に鑑みてなされたものであり、人間の発声は調音器官の静的かつ動的な制約の基に行なわれていることから、調音位置の変化とフォルマント周波数の変化量の関係を調音運動の拘束に基づき作成される刺激音声として用いることを可能にする、音声合成装置およびその方法を提供することを目的とする。 The present invention has been made in view of the above circumstances, and since human speech is performed based on static and dynamic restrictions of the articulatory organ, the relationship between the change in articulation position and the amount of change in formant frequency. It is an object of the present invention to provide a speech synthesizer and a method thereof that make it possible to use as a stimulating speech created based on the restriction of articulatory motion.

上記した課題を解決するために本発明は、刺激音声を生成して出力する音声合成装置であって、調音運動軌跡の確率的な動的モデルを用いて、調音パラメータベクトルの状態系列を学習する手段と、前記調音運動軌跡の確率的な動的モデルにおける各状態の平均調音位置の間を補間演算する手段と、前記補間演算によって求められた各調音位置に対し、音声パラメータベクトルと前記調音パラメータベクトルとの対が格納された調音音響対コードブックを参照し、フォルマント周波数とバンド幅と基本周波数とを算出する手段と、前記補間演算によって求められた各調音位置と前記算出されたフォルマント周波数に対して、前記調音位置の変化に対するフォルマント周波数の変化量を、前記補間演算によって求められた各調音位置が運動方向に１０点変化したことによる前記調音パラメータベクトルの平均二乗距離で、当該１０点の調音位置に基づいて算出されたフォルマント周波数の値に基づいて算出される線形回帰係数を割ることにより、算出する手段と、前記各調音位置に対する前記フォルマント周波数、前記バンド幅、前記基本周波数、前記フォルマント周波数の変化量を対とする対応表を生成して記憶する手段と、前記対応表を参照し、入力されたフォルマント周波数、バンド幅、基本周波数に該当する前記対とされている前記フォルマント周波数、前記バンド幅、前記基本周波数、および、前記フォルマント周波数の変化量であるデータ組を全検索により選択し、当該選択したデータ組に基づいて、前記刺激音声を生成する手段と、を具備することを特徴とする。 The present invention to solve the above-mentioned problems, there is provided a speech synthesizer to force out by generating a stimulus sound, using a stochastic dynamic model of articulatory trajectory learning state series articulatory parameter vector It means for, means for interpolation calculation between the average articulation position of each state in the stochastic dynamic model of the articulatory movement trajectories, for each articulation position obtained by the interpolation operation, the speech parameter vector articulatory A means for calculating a formant frequency, a bandwidth, and a fundamental frequency with reference to an articulatory acoustic pair code book in which a pair with a parameter vector is stored , each articulation position obtained by the interpolation operation, and the calculated formant frequency against, the amount of change of the formant frequency for a change in the articulation position, the respective articulation positions obtained by the interpolation operation direction of movement In the mean square distance of the articulatory parameter vector due to the change in 10 points, by dividing the linear regression coefficient is calculated based on the values of the formant frequencies calculated on the basis of the articulation position of the 10-point, means for calculating , Means for generating and storing a correspondence table in which the formant frequency, the bandwidth, the fundamental frequency, and the amount of change of the formant frequency for each of the articulation positions are paired, and the formant inputted with reference to the correspondence table Select the data set that is the amount of change of the formant frequency, the bandwidth, the fundamental frequency, and the formant frequency corresponding to the frequency, bandwidth, and fundamental frequency by performing a full search, and select the selected Means for generating the stimulation sound based on a data set .

また、本発明において、前記補間演算する手段は、前記調音運動軌跡の確率的な動的モデルを、隠れマルコフモデルを用いて生成することを特徴とする。 Further, in the present invention, the means for performing the interpolation calculation generates a probabilistic dynamic model of the articulatory movement locus using a hidden Markov model.

また、本発明において、前記補間演算する手段は、前記各調音位置における運動方向について、前記隠れマルコフモデルの状態遷移に基づき選択することを特徴とする。 Further, in the present invention, the means for performing the interpolation calculation selects the motion direction at each articulation position based on the state transition of the hidden Markov model.

また、本発明は、刺激音声を演算装置によって生成して出力する音声合成方法であって、前記演算装置は、調音運動軌跡の確率的な動的モデルを用いて、調音パラメータベクトルの状態系列を学習するステップと、前記調音運動軌跡の確率的な動的モデルにおける各状態の平均調音位置の間を補間演算するステップと、前記補間演算された各調音位置に対し、音声パラメータベクトルと前記調音パラメータベクトルとの対が格納された調音音響対コードブックを参照し、フォルマント周波数とバンド幅および基本周波数を算出するステップと、前記補間演算によって求められた各調音位置と前記算出されたフォルマント周波数に対して、前記調音位置の変化に対するフォルマント周波数の変化量を、前記補間演算によって求められた各調音位置が運動方向に１０点変化したことによる前記調音パラメータベクトルの平均二乗距離で、当該１０点の調音位置に基づいて算出されたフォルマント周波数の値に基づいて算出される線形回帰係数を割ることにより、算出するステップと、前記各調音位置に対するフォルマント周波数、バンド幅、基本周波数、フォルマント周波数の変化量を対とする対応表を作成して記憶するステップと、前記対応表を参照し、入力されたフォルマント周波数、バンド幅、基本周波数に該当する前記対とされている前記フォルマント周波数、前記バンド幅、前記基本周波数、および、前記フォルマント周波数の変化量であるデータ組を全検索により選択し、当該選択したデータ組に基づいて、前記刺激音声を生成するステップと、を実行することを特徴とする。 Further, the present invention provides a speech synthesis method for force out by generating by a stimulus sound arithmetic unit, said arithmetic unit, using a probabilistic dynamic model of the articulatory movement trajectory, articulatory parameter vector Learning step sequence, interpolating between average articulation positions of each state in the probabilistic dynamic model of the articulatory movement locus, and speech parameter vector for each interpolated articulation position Referring to the said articulatory parameter vector and pairs stored articulatory sound pair codebook, calculating a formant frequency and bandwidth and the fundamental frequency, which is the calculated respective articulation position obtained by the interpolation calculation relative formant frequencies, the amount of change of the formant frequency for the change in place of articulation, the articulation position obtained by the interpolation calculation There the average square distance of the articulatory parameter vector due to the change 10 in the direction of movement, by dividing the linear regression coefficient is calculated based on the value of the calculated formant frequencies on the basis of the articulation position of the 10 points, A step of calculating, a step of creating and storing a correspondence table in which formant frequency, bandwidth, fundamental frequency, and amount of change of formant frequency with respect to each articulation position are paired; and the formant input with reference to the correspondence table Select the data set that is the amount of change of the formant frequency, the bandwidth, the fundamental frequency, and the formant frequency corresponding to the frequency, bandwidth, and fundamental frequency by performing a full search, and select the selected based on the data set, it is characterized by executing the steps of: generating the stimulus voice .

本発明によれば、刺激音声を、連続音声発声時における、顎、舌、唇、軟口蓋、喉頭等、調音器官の調音運動軌跡の確率的な動的モデルを考慮して作成し、また、各運動の平均調音位置の間を補間し、当該補間された各調音位置からフォルマント周波数とそのバンド幅、基本周波数を決定して各調音位置に対して当該調音位置に対するフォルマント周波数の変化量を計算することで、調音器官の位置の変化とフォルマント周波数の変化量との関係を刺激音声として用いることが可能となる。このことにより、人間の調音器官や運動の動的な振る舞いを考慮した精緻な音声合成を実現することができる。
更に、調音運動軌跡の確率的な動的モデルをＨＭＭ（隠れマルコフモデル）を用いて生成し、また、各調音位置における運動方向について隠れマルコフモデルの状態遷移に基づき選択することで語間の接続関係を間接的に表現でき、一層精緻な音声合成が可能となる。 According to the present invention, stimulating speech is created in consideration of a probabilistic dynamic model of articulatory movement trajectories of articulators such as jaw, tongue, lips, soft palate, and larynx during continuous speech utterance, Interpolate between the average articulation positions of motion, determine the formant frequency, its bandwidth, and fundamental frequency from each interpolated articulation position, and calculate the amount of change in formant frequency for that articulation position for each articulation position Thus, the relationship between the change in the position of the articulator and the amount of change in the formant frequency can be used as the stimulus sound. As a result, it is possible to realize precise speech synthesis that takes into account the dynamic behavior of human articulators and movements.
Furthermore, a probabilistic dynamic model of articulatory motion trajectory is generated using HMM (Hidden Markov Model), and the connection between words is selected by selecting the motion direction at each articulation position based on the state transition of the hidden Markov model. Relationships can be expressed indirectly, enabling more sophisticated speech synthesis.

図１は、本発明における音声合成装置の内部構成を機能展開して示したブロック図である。
図１に示されるように、本発明の音声合成装置は、音声パラメータ記憶部１と、調音パラメータ記憶部２と、音素系列記憶部３と、調音・音響コードブック生成部４と、ＨＭＭモデル作成部５と、補間調音位置生成部６と、フォルマント周波数生成部７と、バンド幅生成部８と、基本周波数生成部９と、ＡＦＳ（Articulatory Formant Sensitivity）生成部１０と、対応表生成部１１と、パラメータ選択部１２と、音声生成部１３で構成される。 FIG. 1 is a block diagram showing an expanded function of the internal configuration of the speech synthesizer according to the present invention.
As shown in FIG. 1, the speech synthesizer of the present invention includes a speech parameter storage unit 1, an articulation parameter storage unit 2, a phoneme sequence storage unit 3, an articulation / acoustic codebook generation unit 4, and an HMM model creation. Unit 5, interpolated articulation position generation unit 6, formant frequency generation unit 7, bandwidth generation unit 8, fundamental frequency generation unit 9, AFS (Articulatory Formant Sensitivity) generation unit 10, and correspondence table generation unit 11 The parameter selection unit 12 and the voice generation unit 13 are configured.

調音・音響コードブック生成部４は、音声パラメータ記憶部１に格納された音声パラメータベクトルと、調音パラメータ記憶部２に格納された調音パラメータベクトルとの対を生成し、これらをフォルマント周波数生成部７、バンド幅生成部８、基本周波数生成部９へ供給する。
ＨＭＭモデル作成部５は、刺激音声を、顎、舌、唇、軟口蓋、喉頭等、人間の調音器官における調音運動軌跡の確率的な動的モデルを用いて生成する機能を持ち、ここでは、調音運動軌跡の確率的な動的モデルとしてＨＭＭ（隠れマルコフモデル）を利用することとする。ＨＭＭモデル作成部５には、調音パラメータ記憶部２に格納される調音パラメータベクトルの他に、音素系列記憶部３に格納される発声された音素系列も供給されている。詳細は後述する。 The articulation / acoustic codebook generation unit 4 generates a pair of an audio parameter vector stored in the audio parameter storage unit 1 and an articulation parameter vector stored in the articulation parameter storage unit 2, and forms them into a formant frequency generation unit 7. , To the bandwidth generator 8 and the fundamental frequency generator 9.
The HMM model creation unit 5 has a function of generating stimulation speech using a stochastic dynamic model of articulatory movement trajectory in a human articulator such as jaw, tongue, lips, soft palate, and larynx. An HMM (Hidden Markov Model) is used as a probabilistic dynamic model of the motion trajectory. In addition to the articulation parameter vector stored in the articulation parameter storage unit 2, the HMM model creation unit 5 is also supplied with the uttered phoneme sequence stored in the phoneme sequence storage unit 3. Details will be described later.

また、補間調音位置生成部６は、上記したＨＭＭにおける各状態の平均調音位置の間を補間演算する機能を持ち、各調音位置における運動方向について、ＨＭＭの状態遷移に基づき選択してそれぞれについて補間演算を行なう。この補間演算によって求められる各調音位置データは、フォルマント周波数生成部７、バンド幅生成部８、基本周波数生成部９のそれぞれへ供給される。
フォルマント周波数生成部７、バンド幅生成部８、基本周波数生成部９のそれぞれは、補間調音位置生成部６の補間演算によって求められた各調音位置に対し、音声パラメータベクトルと調音パラメータベクトルとの対が格納された調音・音響コードブック生成部４を参照して、フォルマント周波数、バンド幅、基本周波数を算出する。この算出のためのロジックは、上記した非特許文献２（音素ラベル付き調音・音響対コードブックの検索に基づく調音運動からの音声合成法の検討）に詳細に開示されている。 The interpolated articulation position generation unit 6 has a function of performing interpolation calculation between the average articulation positions of the respective states in the HMM described above, and selects the motion direction at each articulation position based on the state transition of the HMM and interpolates each of them. Perform the operation. Each articulation position data obtained by this interpolation calculation is supplied to each of the formant frequency generation unit 7, the bandwidth generation unit 8, and the fundamental frequency generation unit 9.
Each of the formant frequency generation unit 7, the bandwidth generation unit 8, and the fundamental frequency generation unit 9 sets a pair of a speech parameter vector and an articulation parameter vector for each articulation position obtained by the interpolation operation of the interpolation articulation position generation unit 6. The formant frequency, the bandwidth, and the fundamental frequency are calculated with reference to the articulation / acoustic codebook generation unit 4 in which is stored. The logic for this calculation is disclosed in detail in Non-Patent Document 2 described above (examination of speech synthesis method from articulation motion based on search of phoneme-labeled articulation / acoustic pair code book).

一方、ＡＦＳ生成部１０は、各調音位置とフォルマント周波数生成部７出力であるフォルマント周波数に対して、前記調音位置の変化に対するフォルマント周波数の変化量（ＡＦＳ）を算出する機能を持ち、ここで算出されたＡＦＳは対応表生成部１１へ供給される。対応表生成部１１には、他に、フォルマント周波数生成部７、バンド幅生成部８、基本周波数生成部９から、フォルマント周波数、バンド幅、基本周波数のそれぞれが供給されており、ここでこれらの対を生成して図示せぬ記憶装置へ格納する。
また、パラメータ選択部１２は、生成され記憶された対応表に対し、例えばユーザにより選択されたフォルマント周波数、バンド幅、基本周波数のそれぞれのパラメータを入力することによって全検索を行い、該当するデータの組を見つけ、それを刺激音声としてフォルマント合成器で構成される音声生成部１３へ供給する機能を持つ。音声生成部１３は、対応表から得られるデータの組を入力として合成音声を出力する。 On the other hand, the AFS generator 10 has a function of calculating the amount of change (AFS) of the formant frequency with respect to the change of the articulation position for each articulation position and the formant frequency that is the output of the formant frequency generator 7. The processed AFS is supplied to the correspondence table generation unit 11. Besides, the correspondence table generation unit 11 is supplied with formant frequency, bandwidth, and fundamental frequency from the formant frequency generation unit 7, the bandwidth generation unit 8, and the fundamental frequency generation unit 9, respectively. A pair is generated and stored in a storage device (not shown).
Further, the parameter selection unit 12 performs a full search on the generated and stored correspondence table by inputting, for example, parameters of the formant frequency, the bandwidth, and the fundamental frequency selected by the user. It has a function of finding a set and supplying it to a sound generation unit 13 composed of a formant synthesizer as stimulus sound. The voice generation unit 13 outputs a synthesized voice with the data set obtained from the correspondence table as an input.

なお、図１における、音声パラメータ記憶部１、調音パラメータ記憶部２、音素系列記憶部３、そして、調音・音響コードブック生成部４によって生成されるコードブックと対応表生成部１１によって生成される対応表は、図示せぬ記憶装置の所定の領域へ割付けられて格納されるものとし、また、ＨＭＭモデル作成部５、補間調音位置生成部６、フォルマント周波数生成部７、バンド幅生成部８、基本周波数生成部９、ＡＦＳ（Articulatory Formant Sensitivity）生成部１０、対応表生成部１１、パラメータ選択部１２、音声生成部１３のそれぞれは、コンピュータを構成する演算装置とその周辺ＬＳＩがプログラムを逐次読み出して実行することによってそれぞれが持つ機能が実現されるものとする。 1, the speech parameter storage unit 1, the articulation parameter storage unit 2, the phoneme sequence storage unit 3, and the codebook generated by the articulation / acoustic codebook generation unit 4 and the correspondence table generation unit 11. The correspondence table is assumed to be allocated and stored in a predetermined area of a storage device (not shown), and also includes an HMM model creation unit 5, an interpolation articulation position generation unit 6, a formant frequency generation unit 7, a bandwidth generation unit 8, Each of the fundamental frequency generation unit 9, the AFS (Articulatory Formant Sensitivity) generation unit 10, the correspondence table generation unit 11, the parameter selection unit 12, and the voice generation unit 13 sequentially reads out the program by the arithmetic device and its peripheral LSI constituting the computer. It is assumed that the functions possessed by each are realized.

図２は、本発明における音声合成装置の動作を説明するために引用したフローチャートである。
以下、図２に示すフローチャートを参照しながら図１に示す本発明実施形態の動作について詳細に説明する。 FIG. 2 is a flowchart cited for explaining the operation of the speech synthesizer according to the present invention.
The operation of the embodiment of the present invention shown in FIG. 1 will be described in detail below with reference to the flowchart shown in FIG.

ここでは、フォルマント周波数とそのバンド幅および基本周波数をパラメータとし、音声生成部１３（フォルマント合成器）により作成される定常母音を刺激音声とする。これらのパラメータは、人間の発声動作の実観測データに基づき生成される調音・音響コードブックから求める。 Here, the formant frequency, its bandwidth, and the fundamental frequency are used as parameters, and the stationary vowel created by the sound generation unit 13 (formant synthesizer) is used as the stimulation sound. These parameters are obtained from an articulatory / acoustic codebook generated based on actual observation data of human vocalization.

具体的に、調音・音響対コードブックは、音声信号と磁気センサシステムを用いて同時観測した調音運動に基づき、調音・音響コードブック生成部４により作成される。
こでは、話者は１名で、／ａｕｉ／などの３母音連鎖を５４０回発声したものを用いた。そしてこの音声信号を毎秒２５０回のレートで、窓長２５ｍｓで切り出し、低次の４つのフォルマント周波数とバンド幅および基本周波数を求めたものを音声パラメータｙ（音声パラメータ記憶部１）とする。また、調音パラメータとして、信号を毎秒２５０回のレートで測定し、各位置として下顎、上・下唇、舌上の４点、軟口蓋および喉頭の計９点の水平および垂直信号を用いたベクトルを調音パラメータベクトルｘ（調音パラメータ記憶部２）とする（Ｓ２１、Ｓ２２）。なお、コードブックには、発声された音素系列（音素系列記憶部３）も一緒に与えられている。 Specifically, the articulation / acoustic pair codebook is created by the articulation / acoustic codebook generation unit 4 based on the articulation motion observed simultaneously using the audio signal and the magnetic sensor system.
Here, one speaker was used, and a vowel utterance of three vowel chains such as / aui / was used. The voice signal is cut out at a rate of 250 times per second at a window length of 25 ms, and the four low-order formant frequencies, the bandwidth, and the fundamental frequency are obtained as a voice parameter y (voice parameter storage unit 1). As articulation parameters, signals are measured at a rate of 250 times per second, and vectors using horizontal and vertical signals of 9 points in total, including the lower jaw, upper and lower lips, 4 points on the tongue, soft palate and larynx as positions. It is assumed that the articulation parameter vector x (articulation parameter storage unit 2) (S21, S22). The code book is also given a phoneme sequence (phoneme sequence storage unit 3) uttered.

ＨＭＭモデル作成部５は、上記により得られる調音パラメータベクトルｘを用い、統計的に調音パラメータの動的振る舞いの学習を行う。ここでは、統計手法としてＨＭＭを用いている（Ｓ２３：調音運動に基づくＨＭＭの作成）。ＨＭＭの構造は、後続音素を考慮した２音素組の３状態単混合ガウス分布で、スキップなしのleft-to-rightモデルとする。
ＨＭＭのモデルλは、調音パラメータベクトルの出力確率Ｐ（ｘ｜λ）＝ΣｑＰ（ｘ｜ｑ，λ）Ｐ（ｑ｜λ）が最大となるように作成する。ここで、ｑはＨＭＭの状態系列で、与えられた状態系列に対する調音パラメータベクトルの出力確率Ｐ（ｘ｜ｑ，λ）はガウス分布を仮定する。 The HMM model creation unit 5 statistically learns the dynamic behavior of the articulation parameters using the articulation parameter vector x obtained as described above. Here, HMM is used as a statistical method (S23: Creation of HMM based on articulatory motion). The structure of the HMM is a left-to-right model without skipping, which is a two-phoneme set three-state single-mixed Gaussian distribution considering the subsequent phonemes.
The HMM model λ is created so that the output probability P (x | λ) = Σq P (x | q, λ) P (q | λ) of the articulation parameter vector is maximized. Here, q is an HMM state sequence, and the output probability P (x | q, λ) of the articulation parameter vector for a given state sequence assumes a Gaussian distribution.

次に、補間調音位置生成部６は、上記により求めたＨＭＭのモデルλの各２音素組のモデルにおける各状態ｉ，ｊの平均調音位置ｘｉ，０とｘｊ，０の間を１００点で線形補間した調音位置系列を作成する（Ｓ２４）。ここではそれをｘｉ，ｊ，ｎ (ｉ，ｊ：状態、ｎ＝１〜１００）とする。
これは、ある調音位置における運動方向を、ＨＭＭの状態遷移に基づき定義したことを意味する。図４に、ＨＭＭにより決定された正中断面上での調音位置の運動方向の一例が示されている。 Next, the interpolated articulation position generation unit 6 is linear at 100 points between the average articulation positions xi, 0 and xj, 0 of each state i, j in the model of each two-phoneme set of the HMM model λ obtained as described above. An interpolated articulation position series is created (S24). Here, it is assumed that xi, j, n (i, j: state, n = 1 to 100).
This means that the movement direction at a certain articulation position is defined based on the state transition of the HMM. FIG. 4 shows an example of the movement direction of the articulation position on the median cross section determined by the HMM.

そして、フォルマント周波数生成部７、バンド幅生成部８、基本周波数生成部９のそれぞれは、補間調音位置生成部６により出力される、線形補間済みの調音位置ｘｉ，ｊ，ｎに対し、調音・音響コードブック生成部４により生成される調音音響対コードブックに基づき、フォルマント周波数、バンド幅、基本周波数ｙｉ，ｊ，ｎのそれぞれを算出する（Ｓ２５）。 Then, each of the formant frequency generation unit 7, the bandwidth generation unit 8, and the fundamental frequency generation unit 9 outputs the articulation / synchronization position xi, j, n output by the interpolation articulation position generation unit 6 to the articulation / Based on the articulatory acoustic pair codebook generated by the acoustic codebook generation unit 4, each of the formant frequency, the bandwidth, and the fundamental frequency yi, j, n is calculated (S25).

図３に、上記した調音運動の動的モデルにおける各状態の平均調音位置の間を補間する手順と、補間された調音位置からフォルマント周波数とそのバンド幅および基本周波数を決定する手順が概念的に示されている。 FIG. 3 conceptually shows a procedure for interpolating between the average articulation positions in each state in the dynamic model of articulatory movement described above, and a procedure for determining a formant frequency, its bandwidth and fundamental frequency from the interpolated articulation positions. It is shown.

続いて、ＡＦＳ生成部１０は、補間された調音位置およびフォルマント周波数を用いて、上述した調音器官の位置である調音ジェスチャの変化に対するフォルマント周波数の変化量の算出を行う（Ｓ２６）。計算法は、すべての調音位置の第１あるいは第２フォルマント周波数の値に対し、その前後１０点のフォルマント周波数の値を用いて決定された線形回帰係数を、調音位置が１０点変化したことによる顎や舌などの各パラメータに対する平均二乗距離、すなわち調音ジェスチャの変化量で割る。これを調音ジェスチャの変化に対するフォルマント周波数の変化量(ＡＦＳ）と定義している。調音ジェスチャの変化に対するフォルマント周波数の変化量の一例が図５に示されている。
次に、対応表生成部１１は、すべての調音位置に対し計算し、各調音位置に対するフォルマント周波数、バンド幅、基本周波数、ＡＦＳを対とする対応表を作成し記憶する（Ｓ２７）。 Subsequently, the AFS generation unit 10 calculates a change amount of the formant frequency with respect to the change of the articulation gesture which is the position of the articulatory organ, using the interpolated articulation position and formant frequency (S26). The calculation method is based on the linear regression coefficient determined by using the formant frequency values at the 10 points before and after the first or second formant frequency value at all the articulation positions, and by changing the articulation position by 10 points. Divide by the mean square distance for each parameter such as jaw and tongue, that is, the amount of change in articulation gesture. This is defined as the amount of change in formant frequency (AFS) with respect to the change in articulation gesture. An example of the amount of change in formant frequency with respect to a change in articulation gesture is shown in FIG.
Next, the correspondence table generating unit 11 performs calculation for all the articulation positions, and creates and stores a correspondence table in which formant frequency, bandwidth, fundamental frequency, and AFS for each articulation position are paired (S27).

ここでパラメータ選択部１２は、対応表生成部１１により作成された対応表を参照することにより、発話動作の制約を考慮した刺激音声を提示する。
例えば、フォルマント周波数とそのバンド幅が同じで、基本周波数の値が異なる刺激音声を提示したい場合、対応表の中から全探索により該当するデータ組を見つけ、それを刺激音声として用いればよい。さらに、フォルマント周波数を変化させた場合の刺激音声が、人間の実際に発声可能かどうかを調音位置の制約に基づき調べることも可能である。
図６に、フォルマント周波数とそのバンド幅が同じで、基本周波数が高い（１２７Ｈｚ）、低い（１１４Ｈｚ）を生成する正中断面上での調音位置のデータ組の一例が示されている。 Here, the parameter selection unit 12 refers to the correspondence table created by the correspondence table generation unit 11, and presents the stimulation sound in consideration of the restriction of the speech operation.
For example, when it is desired to present stimulus sounds having the same formant frequency and the same bandwidth but different fundamental frequency values, a corresponding data set may be found by a full search from the correspondence table and used as the stimulus sounds. Furthermore, it is possible to check whether or not the stimulating sound when the formant frequency is changed can be actually uttered by a person based on the restriction of the articulation position.
FIG. 6 shows an example of a data set of articulation positions on the mid-section that generates the same formant frequency and the same bandwidth, and generates a high fundamental frequency (127 Hz) and a low fundamental frequency (114 Hz).

最後に、音声生成部１３は、パラメータ選択部１２により選択されたフォルマント周波数とバンド幅および基本周波数をパラメータとして、フォルマント合成器から刺激音声を作成する（Ｓ２８、Ｓ２９）。 Finally, the voice generation unit 13 creates a stimulus voice from the formant synthesizer using the formant frequency, the bandwidth, and the fundamental frequency selected by the parameter selection unit 12 as parameters (S28, S29).

なお、ここでは、日本人男性１名によって発声された／ａｉｕ／などの３母音連鎖５４０文章を用いて調音運動の動的モデルの作成と、調音変化に対するフォルマント感度の学習を行った。また、フォルマント合成器にはＰＲＡＡＴを使用した。更に、刺激音声の時間長は２００ｍｓで、音声のパワーには自然な時間変化をつけた。
本発明による発話器官の動的振る舞いを考慮した刺激音声を用いることにより、フォルマント周波数弁別閾値と調音ジェスチャに対するフォルマント周波数の変化量（ＡＦＳ）との間に相関関係が存在することが分かった。また、人間が実際に発声可能な範囲に存在しない刺激音声を用いた場合、上記の相関が見られないことから、従来の発話の制約を考慮しない刺激音声は、聴覚心理物理実験のための刺激として不十分であることを示している。本発明による、調音ジェスチャの変化に対するフォルマント周波数の変化量（ＡＦＳ）とフォルマント周波数の弁別閾値の関係を図７に示す。 Here, a dynamic model of articulatory movement was created using three-vowel chain 540 sentences such as / aiu uttered by one Japanese male, and formant sensitivity to articulation changes was learned. A PRAAT was used for the formant synthesizer. Furthermore, the time length of the stimulation sound was 200 ms, and the sound power was naturally changed with time.
It has been found that there is a correlation between the formant frequency discrimination threshold and the amount of change (AFS) of the formant frequency with respect to the articulation gesture by using the stimulating speech that takes into account the dynamic behavior of the speech organ according to the present invention. In addition, when stimulating speech that does not exist within the range that humans can actually utter is used, the above correlation is not observed, so stimulating speech that does not take into account the limitations of conventional speech is used for stimulating psychoacoustic experiments. As insufficient. FIG. 7 shows the relationship between the amount of change in formant frequency (AFS) with respect to the change in articulation gesture and the discrimination threshold of formant frequency according to the present invention.

以上説明のように本発明によれば、刺激音声を、連続音声発声時における、顎、舌、唇、軟口蓋、喉頭等、調音器官の調音運動軌跡の確率的な動的モデル（ここでは、ＨＭＭを使用）を考慮して作成し、また、各運動の平均調音位置の間を補間し、当該補間された各調音位置からフォルマント周波数とそのバンド幅、基本周波数を決定して各調音位置に対して当該調音位置に対するフォルマント周波数の変化量を計算することで、調音ジェスチャの変化とフォルマント周波数の変化量との関係を刺激音声として用いることを可能とする。このことにより、人間の調音器官や運動の動的な振る舞いを考慮した精緻な音声合成を実現することができる。 As described above, according to the present invention, the stimulating sound is a probabilistic dynamic model of the articulatory movement trajectory of the articulator such as the jaw, tongue, lips, soft palate, and larynx during continuous voice utterance (here, the HMM). And the average articulation position of each motion is interpolated, and the formant frequency, its bandwidth, and the fundamental frequency are determined from each interpolated articulation position, and for each articulation position Thus, by calculating the change amount of the formant frequency with respect to the articulation position, the relationship between the change of the articulation gesture and the change amount of the formant frequency can be used as the stimulation sound. As a result, it is possible to realize precise speech synthesis that takes into account the dynamic behavior of human articulators and movements.

本発明における音声合成装置の内部構成を機能展開して示したブロック図である。It is the block diagram which expanded the function and showed the internal structure of the speech synthesizer in this invention. 本発明における音声合成装置の動作を説明するために引用したフローチャートである。It is the flowchart quoted in order to demonstrate operation | movement of the speech synthesizer in this invention. 本発明実施形態の動作を説明するために引用した動作概念図である。It is the operation | movement conceptual diagram quoted in order to demonstrate operation | movement of this invention embodiment. 本発明実施形態による実験結果を示すグラフである。It is a graph which shows the experimental result by this invention embodiment. 本発明実施形態による実験結果を示すグラフである。It is a graph which shows the experimental result by this invention embodiment. 本発明実施形態による実験結果を示すグラフである。It is a graph which shows the experimental result by this invention embodiment. 本発明実施形態による実験結果を示すグラフである。It is a graph which shows the experimental result by this invention embodiment.

Explanation of symbols

１…音声パラメータ記憶部、２…調音パラメータ記憶部、３…音素系列記憶部、４…調音・音響コードブック生成部、５…ＨＭＭモデル作成部、６…補間調音位置生成部、７…フォルマント周波数生成部、８…バンド幅生成部、９…基本周波数生成部、１０…ＡＦＳ生成部、１１…対応表生成部、１２…パラメータ選択部、１３…音声生成部

DESCRIPTION OF SYMBOLS 1 ... Speech parameter memory | storage part, 2 ... Articulation parameter memory | storage part, 3 ... Phoneme series memory | storage part, 4 ... Articulation / acoustic codebook production | generation part, 5 ... HMM model creation part, 6 ... Interpolation articulation position production | generation part, 7 ... Formant frequency Generation unit, 8 ... Bandwidth generation unit, 9 ... Fundamental frequency generation unit, 10 ... AFS generation unit, 11 ... Correspondence table generation unit, 12 ... Parameter selection unit, 13 ... Audio generation unit

Claims

A speech synthesis apparatus which forces out to generate a stimulation speech,
Means for learning a state sequence of an articulatory parameter vector using a probabilistic dynamic model of an articulatory motion locus;
Means for interpolating between the average articulation positions of each state in the probabilistic dynamic model of the articulatory movement locus;
For each articulation position obtained by the interpolation calculation, with reference to the speech parameter vector and the articulatory parameter vector and articulatory sound pair codebook pairs are stored in, means for calculating the formant frequency and bandwidth and the fundamental frequency When,
For each articulation position obtained by the interpolation calculation and the calculated formant frequency, the amount of change of the formant frequency with respect to the change of the articulation position is 10 points in the movement direction for each articulation position obtained by the interpolation calculation. Means for calculating by dividing the linear regression coefficient calculated based on the value of the formant frequency calculated based on the 10 articulation positions by the mean square distance of the articulation parameter vector due to the change ;
Means for generating and storing a correspondence table that pairs the formant frequency, the bandwidth, the fundamental frequency, and the amount of change of the formant frequency with respect to each articulation position;
Referring to the correspondence table, the formant frequency, the bandwidth, the fundamental frequency and the formant frequency corresponding to the input formant frequency, the bandwidth, the bandwidth, the fundamental frequency, and the amount of change of the formant frequency Means for selecting a set by full search and generating the stimulation sound based on the selected data set ;
A speech synthesizer characterized by comprising:

The means for performing the interpolation calculation includes:
The speech synthesizer according to claim 1, wherein the probabilistic dynamic model of the articulatory motion locus is generated using a hidden Markov model.

The means for performing the interpolation calculation includes:
The speech synthesizer according to claim 1 or 2, wherein the movement direction at each articulation position is selected based on the state transition of the hidden Markov model.

A speech synthesis method for force out by generating by a stimulus sound arithmetic device,
The arithmetic unit is:
Learning a state sequence of an articulatory parameter vector using a probabilistic dynamic model of an articulatory motion trajectory;
Interpolating between the average articulation positions of each state in the probabilistic dynamic model of the articulatory motion trajectory;
For each articulation position as the interpolation calculation, calculating a voice parameter vector and the articulatory parameter pairs with the vector refers to the stored articulatory sound pair codebook, formant frequency and bandwidth and the fundamental frequency,
For each articulation position obtained by the interpolation calculation and the calculated formant frequency, the amount of change of the formant frequency with respect to the change of the articulation position is 10 points in the movement direction for each articulation position obtained by the interpolation calculation. Dividing the linear regression coefficient calculated based on the value of the formant frequency calculated based on the 10 articulation positions by the mean square distance of the articulation parameter vector due to the change ; and
Creating and storing a correspondence table paired with the amount of change in formant frequency, bandwidth, fundamental frequency, and formant frequency for each articulation position;
Referring to the correspondence table, the formant frequency, the bandwidth, the fundamental frequency and the formant frequency corresponding to the input formant frequency, the bandwidth, the bandwidth, the fundamental frequency, and the amount of change of the formant frequency Selecting a set by full search and generating the stimulus sound based on the selected data set ;
A speech synthesis method characterized by executing