JPS5855519B2

JPS5855519B2 - speech synthesizer

Info

Publication number: JPS5855519B2
Application number: JP54132181A
Authority: JP
Inventors: 洋一東倉; 芳典匂坂
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1979-10-13
Filing date: 1979-10-13
Publication date: 1983-12-09
Also published as: JPS5655999A

Description

【発明の詳細な説明】この発明は音声信号の特徴であるスペクトル包絡特性と
音声基本周波数との少くとも二つを使って音声信号を合
成する音声合成装置に関し、特に人工的に音声基本周波
数を与え、かつ品質の高い合成音が得られるようにしよ
うとするものである。[Detailed Description of the Invention] The present invention relates to a speech synthesis device that synthesizes a speech signal using at least two characteristics of the speech signal, namely, the spectral envelope characteristic and the speech fundamental frequency, and particularly relates to a speech synthesis device that synthesizes a speech signal using at least two characteristics of the speech signal, namely, the spectral envelope characteristic and the speech fundamental frequency. The aim is to provide high-quality synthesized speech.

第１図を用いて従来のこの種の音声合成装置を説明する
。A conventional speech synthesis device of this type will be explained using FIG.

音声の振幅し、基本周期、いわゆるピッチＴ及び有声音
と無声音とを区別する有声／無声信号などの音源情報が
記憶装置１１に貯えられ、音源信号制御装置１２により
合成しようとする音声信号と対応する一組の音源信号が
記憶装置１１から一定時間ごとに読み出される。Sound source information such as the amplitude of the sound, the fundamental period, so-called pitch T, and the voiced/unvoiced signal that distinguishes between voiced and unvoiced sounds is stored in the storage device 11 and corresponds to the sound signal to be synthesized by the sound source signal control device 12. A set of sound source signals are read out from the storage device 11 at regular intervals.

読み出された基本周期Ｔはパルス発生器１３に、有声／
無声信号は切換装置１４に、音声振幅りは可変利得増幅
器１５へそれぞれ供給される。The read fundamental period T is sent to the pulse generator 13 as voiced/
The unvoiced signal is supplied to a switching device 14, and the voice amplitude is supplied to a variable gain amplifier 15.

パルス発生器１３は入力された基本周期Ｔ毎のパルスを
発生し切換装置１５は入力された有声／無声信号が有声
音を示す時はパルス発生器１３の出力を選択し、無声音
を示す時は白線音発生器１６の出力を選択して増幅器１
５へ供給する。The pulse generator 13 generates a pulse for each input basic period T, and the switching device 15 selects the output of the pulse generator 13 when the input voiced/unvoiced signal indicates a voiced sound, and selects the output of the pulse generator 13 when the input voiced/unvoiced signal indicates a voiced sound. Select the output of the white line sound generator 16 and apply it to the amplifier 1.
Supply to 5.

増幅器１５は切換装置１４の出力を、振幅りに従って増
幅する。Amplifier 15 amplifies the output of switching device 14 according to its amplitude.

その出力はディジタルフィルタの音声合成フィルタ１７
に入力される。The output is the digital filter's speech synthesis filter 17.
is input.

一方音声信号のスペクトル包絡を表わすパラメータが記
憶装置１８に記憶されている。On the other hand, parameters representing the spectral envelope of the audio signal are stored in the storage device 18.

このパラメータとしては、例えばＰＡＲＣＯＲ係数（ｋ
ｉ）”ｊ＝１（参考文献、日本音響学全編、音響工学講
座７「音声」中田和男著ｐｐ、９３〜９７）を用いる
ことができる。As this parameter, for example, PARCOR coefficient (k
i)"j=1 (References, Japanese Acoustics Complete Edition, Acoustic Engineering Course 7 "Speech" by Kazuo Nakata, pp. 93-97) can be used.

このパラメータは一定時間ごとにパラメータ制御装置１
９で読み出してフィルタ１７に送出され、このパラメー
タに応じて音声合成フィルタ１７の特性が制御される。This parameter is sent to the parameter control device 1 at regular intervals.
9 and sent to the filter 17, and the characteristics of the speech synthesis filter 17 are controlled according to this parameter.

フィルタ１７の出力はディジタルアナログ変換装置２１
によりアナログ信号に変換され、出力端子２２より合成
音声信号として送出される。The output of the filter 17 is a digital to analog converter 21.
The signal is converted into an analog signal and sent out from the output terminal 22 as a synthesized audio signal.

このような従来の音声合成装置では音源信号制御装置１
２とパラメータ制御装置１９とが独立に動作するもので
あった。In such a conventional speech synthesis device, the sound source signal control device 1
2 and the parameter control device 19 operated independently.

このため音源信号とスペクトル包絡との不整合が生じ、
意図したスペクトル包絡を実現できない場合があった。This causes a mismatch between the sound source signal and the spectral envelope,
In some cases, the intended spectrum envelope could not be achieved.

即ち、例えば１０個のＰＡＲＣＯＲ係数（ｋｌ）１４＝１によって記述される音声のスペ
クトル包絡と、これに対応するフーリエのスペクトルを
第２図Ａに示す。That is, FIG. 2A shows the spectral envelope of speech described by, for example, 10 PARCOR coefficients (kl) 14 =1 and the corresponding Fourier spectrum.

このスペクトル包絡は音声の周波構造で表されるフーリ
エスペクトルを最適近似するように決定されている（参
考文献、電子通信学会論文誌Ｖｏｌ、５３−Ａ、
Ａ；、１．１９７０、板金、斉藤「統計的手法による音
声スペクトル密度とホルマント周波数の推定」）。This spectral envelope is determined to optimally approximate the Fourier spectrum represented by the frequency structure of audio (References, Journal of the Institute of Electronics and Communication Engineers Vol. 53-A,
A;, 1.1970, Saito, Saito "Estimation of speech spectral density and formant frequency using statistical methods").

従って、この周波構造と一致した基本周波数を持つ音源
信号をスペクトル包絡特性再生用ディジタルフィルタ１
７に付加すれば、音源信号とスペクトル包絡とは整合し
、その結果得られた合成音声のスペクトル包絡も原音声
のスペクトル包絡に近いものとなる。Therefore, the digital filter 1 for reproducing the spectral envelope characteristic is used to reproduce the sound source signal having a fundamental frequency that matches this frequency structure.
7, the sound source signal and the spectral envelope match, and the spectral envelope of the resulting synthesized speech also becomes close to the spectral envelope of the original speech.

これらの問題点は音声の分析により、その音源情報とス
ペクトル包絡情報とを抽出し、これらを再び組み合せて
原音声を再生する場合には問題とならない。These problems do not arise when the sound source information and spectral envelope information are extracted by analyzing the sound, and these are recombined to reproduce the original sound.

しかし音源情報のみを独立に変更する場合、例えば第１
図に示したように記憶装置１１の音源情報が記憶装置１
８のスペクトル包絡情報と独立に生成されるような場合
には、大きな問題点となる。However, when changing only the sound source information independently, for example, the first
As shown in the figure, the sound source information of the storage device 11 is stored in the storage device 1.
If the spectral envelope information is generated independently from the spectral envelope information of No. 8, it will be a big problem.

第２図Ａに示したスペクトル包絡に対応したスペクトル
包絡パラメータをディジタルフィルタ１７に与えるが、
このフィルタ１７に第２図Ａに示した調波構造、即ち基
本周波数Ｆ。The spectral envelope parameters corresponding to the spectral envelope shown in FIG. 2A are given to the digital filter 17,
This filter 17 has a harmonic structure shown in FIG. 2A, that is, a fundamental frequency F.

＝３５０Ｈｚとは異なった調波構造、例えば基本周波数
Ｆ。= 350 Hz and a different harmonic structure, e.g. fundamental frequency F.

＝２３０Ｈｚの音源信号を付加した場合、その合成音声
のスペクトル包絡及びフーリエスペクトルは第２図Ｂに
示すようになる。When a sound source signal of =230 Hz is added, the spectral envelope and Fourier spectrum of the synthesized speech are as shown in FIG. 2B.

これら第２図Ａ及びＢを比較して明らかなように、第２
図Ｂでは第２図Ａにおいて存在した低周波数領域の明瞭
なピーク２３が失なわれ、意図したスペクトル包絡が得
られていない。As is clear from comparing these Figures 2 A and B,
In FIG. 2B, the clear peak 23 in the low frequency region that existed in FIG. 2A is lost, and the intended spectral envelope is not obtained.

従って第２図Ｂに示したようなスペクトル包絡を持つ合
成音声はスペクトルの大きなひずみを持ち、品質の劣化
したものになる。Therefore, synthesized speech having a spectral envelope as shown in FIG. 2B has a large spectral distortion and is of degraded quality.

従って実際の音声信号より得られたスペクトル包絡特性
を示すパラメータを用い、その音声信号の基本周期（ピ
ッチ）と異なる、例えば高い又は低い音声信号を合成し
ようとすると、その合成音声の品質は可成り劣化したも
のになった。Therefore, if you try to synthesize a voice signal that is different from the fundamental period (pitch) of the voice signal, for example, higher or lower, using parameters that indicate the spectral envelope characteristics obtained from an actual voice signal, the quality of the synthesized voice will be quite low. It has become degraded.

この発明は音源信号とスペクトル包絡との不整合が生む
合成音声の品質劣化を除去し、意図したスペクトル包絡
と聴覚的に近いスペクトル包絡を実現することを可能と
する音声合成装置を提供するものである。The present invention provides a speech synthesis device that eliminates the quality deterioration of synthesized speech caused by mismatch between the sound source signal and the spectral envelope, and makes it possible to realize a spectral envelope that is audibly close to the intended spectral envelope. be.

この発明によれば与えられた音声基本周波数と得ようと
する音声信号のスペクトル包絡のピーク周波数、例えば
第１７オルマント周波数とが相関を持つように対応する
スペクトル包絡特性を表わすパラメータを変換する。According to the present invention, the parameters representing the corresponding spectral envelope characteristics are converted so that the given audio fundamental frequency and the peak frequency of the spectral envelope of the audio signal to be obtained, for example, the 17th ormant frequency, have a correlation.

第３図はこの発明による音声合成装置の一例を示し、第
１図と対応する部分には同一符号を付けである。FIG. 3 shows an example of a speech synthesis device according to the present invention, and parts corresponding to those in FIG. 1 are given the same reference numerals.

この発明においては整合装置２４が設けられ、この整合
装置２４に制御装置１２より基本周波Ｔが、制御装置１
９からパラメータがそれぞれ供給される。In this invention, a matching device 24 is provided, and the fundamental frequency T is transmitted to the matching device 24 from the control device 12.
Parameters are supplied from 9 respectively.

この整合装置２４はパラメータにより記述されるスペク
トル包絡を基本周波数Ｆ。This matching device 24 adjusts the spectral envelope described by the parameters to the fundamental frequency F.

（＝１／Ｔ）の周波構造と整合するようにパラメータを
変換する。The parameters are converted to match the frequency structure of (=1/T).

即ち得ようとする音声信号のスペクトル包絡のピーク周
波数と、制御装置１２からの基本周波数Ｆ。That is, the peak frequency of the spectral envelope of the audio signal to be obtained and the fundamental frequency F from the control device 12.

−１／Ｔとが相関をもつように制御装置１９からのパラ
メータが変換される。The parameters from the control device 19 are converted so that they have a correlation with -1/T.

例えば記憶装置１８に配置されているパラメータがＰＡ
ＲＣＯＲ係数（ｋｉ）ｉ＝、で記述される場合、このＰ
ＡＲＣＯＲ係数から、原音声のスペクトル包絡の低周波
、一般に５００Ｈｚ以下のピーク周波数、通常第１ホル
マントの周波数Ｆ１及びその帯域幅Ｂ１を抽出する。For example, the parameters stored in the storage device 18 are PA
When described by the RCOR coefficient (ki)i=, this P
From the ARCOR coefficients, a low frequency of the spectral envelope of the original speech, generally a peak frequency of 500 Hz or less, usually the frequency F1 of the first formant, and its bandwidth B1 are extracted.

この抽出は公知の方法によることができる。This extraction can be performed by a known method.

Ｆｌ−ｎＦ’ｏ（ｎは正整数）を演算し、その差出
力がＢ１／２以下又はそれと等しい場合は記憶装置１８
から読み出されたパラメータをそのまま利用する。Fl−nF'o (n is a positive integer) is calculated, and if the difference output is less than or equal to B1/2, the storage device 18
Use the parameters read from .

しかしＦｌｎＦｏがＢ１／２以上の場合は前記第１
ホルマント周波数Ｆに僅かの周波数ΔＦを加えた周波
数Ｆ１′を考え、このｌＦｌ’ ｎＦｏ１がＢ１
／２より小又は等しくなるＦ１′で△Ｆ１が最小のもの
を設定する。However, if FlnFo is B1/2 or more, the first
Considering the frequency F1' which is the formant frequency F plus a slight frequency ΔF, this l Fl' nFo 1 is B1
Set the one in which ΔF1 is the smallest among F1′ that is smaller than or equal to /2.

このＦ□′を第１ホルマント周波数とし、その他のホル
マント周波数はそのまＳとして、これ等よりＰＡＲＣＯ
Ｒ係数（ｋｉ’）、＝１を演算する。Let this F□′ be the first formant frequency, and let the other formant frequencies be S as they are, and from these, PARCO
Calculate R coefficient (ki'), =1.

この演算は公知の方法によることができる。This calculation can be performed using a known method.

この演算されたＰＡＲＣＯＲ係数を音声合成フィルタ１
７の特性を制御するパラメータとする。This calculated PARCOR coefficient is applied to the speech synthesis filter 1.
7 as a parameter to control the characteristics.

このようなパラメータ変換はＰＡＲＣＯＲ係数のみに限
らず音声の線形予測分析で導出される線形予測係数、対
数断面積比、更に線スペクトルパラメタータに関しても
同様に行なうことができる。Such parameter conversion can be performed not only on PARCOR coefficients but also on linear prediction coefficients derived by linear prediction analysis of speech, logarithmic cross-sectional area ratios, and even line spectrum parameters.

た文し線スペクトルパラメータはそれ自体が擬似ホルマ
ントであるから、その第１次のパラメータと与えられた
基本周波数Ｆ。Since the tabular line spectrum parameter is itself a pseudo-formant, its first-order parameter and the given fundamental frequency F.

とから前記Ｆ１′と対応するものを求め、これを線スペ
クトルパラメータの１次のパラメータとすればよい。What is necessary is to find a value corresponding to the above-mentioned F1' from , and use this as the first-order parameter of the line spectrum parameter.

例えば原音声信号のスペクトル包絡とフーリエスペクト
ルが第２図Ａに示した場合であり、その第１ホルマント
周波数がＦ１＝３５０Ｈｚその帯域幅Ｂ１が２０Ｈｚで
ある場合に、合成音声の基本周波数Ｆ。For example, if the spectral envelope and Fourier spectrum of the original speech signal are shown in FIG. 2A, and its first formant frequency is F1 = 350 Hz and its bandwidth B1 is 20 Hz, then the fundamental frequency F of the synthesized speech.

とじて２３０Ｈｚが与えられた時、第１図に示した従
来装置で得られる合成音声のスペクトル包絡は第２図Ｂ
に示したように大きなスペクトルひずみを含んだものと
なる。When 230 Hz is given, the spectral envelope of the synthesized speech obtained with the conventional device shown in Fig. 1 is shown in Fig. 2B.
As shown in the figure, it contains large spectral distortion.

しかし、この発明では整合装置２４により、Ｆ１’＝
２Ｆ。However, in this invention, the matching device 24 allows F1'=
2 F.

−（Ｂｌ／２）＝４６０Ｈ２が求められ、とのＦ１／を
第１ホルマント周波数とするようにパラメータが変換さ
れ、このパラメータによりフィルタ１７が制御され、そ
の時得られる合成音声のスペクトル包絡は第４図に示す
ようになる。-(Bl/2)=460H2 is obtained, and the parameters are converted so that F1/ of is the first formant frequency. The filter 17 is controlled by this parameter, and the spectral envelope of the synthesized speech obtained at that time is the fourth formant frequency. The result will be as shown in the figure.

この第４図には低周波にスペクトルのピーク２５、即ち
第１ホルマントが生じ、第２図Ａに示した包絡に近くな
るが、第１ホルマントの周波数が異なるものとなる。In FIG. 4, a spectral peak 25, ie, the first formant, occurs at a low frequency, and the envelope is close to that shown in FIG. 2A, but the frequency of the first formant is different.

よって音声品質がよく、しかも目的とするピッチ（高さ
）の音声が得られる。Therefore, it is possible to obtain audio of good quality and at the desired pitch (height).

以上説明したように、この発明による音声合成装置を使
用して音声の合成を行った場合、与えられた基本周波数
Ｆ。As explained above, when the speech synthesis device according to the present invention is used to synthesize speech, the given fundamental frequency F.

に依存するようにスペクトル包絡を変形するため、第１
ホルマントＦ工に代表されるようなスペクトルのピーク
を失うことなく、明確なスペクトルピークを持つ品質が
高い合成音声を得ることができる。In order to transform the spectral envelope so that it depends on the first
It is possible to obtain high quality synthesized speech having clear spectral peaks without losing the spectral peaks typified by formant F.

このスペクトル包絡の変形により合成音声の第１ホルマ
ント周波数Ｆ１′は、与えられたパラメータの原音声の
第１ホルマント周波数Ｆ１とは異なった周波数になる
が、ホルマント周波数が基本周波数に依存して変化する
ことは第５図及び第６図に示された自然音声の観察結果
からも明らかであり、この点、この発明の音声合成装置
で得られる合成音声は自然音声に近いホルマント周波数
と基本周波数の関係を保つことができる。Due to this modification of the spectral envelope, the first formant frequency F1' of the synthesized speech becomes a frequency different from the first formant frequency F1 of the original speech of the given parameters, but the formant frequency changes depending on the fundamental frequency. This is clear from the observation results of natural speech shown in FIGS. 5 and 6, and in this respect, the synthesized speech obtained by the speech synthesizer of the present invention has a relationship between formant frequency and fundamental frequency that is close to that of natural speech. can be kept.

第５図は日本語母音のホルマント分布を示し、また第６
図は英語の母音のホルマント分布を示し、何れも基本周
波数が変化するとホルマン４周波数も変化したものとな
っている。Figure 5 shows the formant distribution of Japanese vowels, and also shows the formant distribution of Japanese vowels.
The figure shows the formant distribution of English vowels, and in both cases, when the fundamental frequency changes, the four-formant frequency also changes.

以上述べたようにこの発明の装置によって得られた合成
音声はその自然性と明瞭性の双方において従来の合成音
声より優れており、合成音声の品質向上効果も犬である
。As described above, the synthesized speech obtained by the apparatus of the present invention is superior to conventional synthesized speech in both its naturalness and clarity, and the quality improvement effect of the synthesized speech is also excellent.

[Brief explanation of the drawing]

第１図は従来の音声合成装置を示すブロック図、第２図
Ａは自然音声を分析して得られたスペクトル包絡及びフ
ーリエスペクトルを示す曲線図、第２図Ｂは自然音声の
ピッチを変化させたのち第２図Ａと同様のスペクトル包
絡を使って合成した合成音声のスペクトル包絡とフーリ
エスペクトルを示す曲線図、第３図はこの発明による音
声合成装置の実施例を示すブロック図、第４図はこの発
明の装置により得られた合成音声のスペクトル包絡及び
フーリエスペクトルを示す曲線図、第５図及び第６図は
それぞれ自然音声における基本周波数とホルマントとの
関係を示す図である。１１：音源情報を記憶した記憶装置、１２：音源情報制
御装置、１３：パルス発生器、１４：切換装置、１５：
可変利得増幅器、１６：白線音発生器、１７：音声合成
フィルタ、１８ニスベクトル包絡を表わすパラメータを
記憶した記憶装置、１９：パラメータ制御装置、２１：
ＤＡ変換器、２２：出力端子、２４：整合装置。Figure 1 is a block diagram showing a conventional speech synthesis device, Figure 2A is a curve diagram showing the spectral envelope and Fourier spectrum obtained by analyzing natural speech, and Figure 2B is a curve diagram showing the spectral envelope and Fourier spectrum obtained by analyzing natural speech. Fig. 2 is a curve diagram showing the spectral envelope and Fourier spectrum of synthesized speech synthesized using the same spectral envelope as in Fig. 2A; Fig. 3 is a block diagram showing an embodiment of the speech synthesis device according to the present invention; Fig. 4; is a curve diagram showing the spectral envelope and Fourier spectrum of synthesized speech obtained by the apparatus of the present invention, and FIGS. 5 and 6 are diagrams showing the relationship between fundamental frequency and formant in natural speech, respectively. 11: Storage device storing sound source information, 12: Sound source information control device, 13: Pulse generator, 14: Switching device, 15:
variable gain amplifier, 16: white line sound generator, 17: speech synthesis filter, 18 storage device storing parameters representing varnish vector envelope, 19: parameter control device, 21:
DA converter, 22: output terminal, 24: matching device.

Claims

[Claims]

1. Setting the filter characteristics of the voice synthesis filter using a parameter representing the spectral envelope characteristic of the voice signal, driving the voice synthesis filter with a pulse signal of a frequency equal to the given voice fundamental frequency, and obtaining the voice signal from the filter output. In the speech synthesis device, means is provided for converting a parameter representing the corresponding spectral envelope characteristic so that the given speech fundamental frequency and the peak frequency of the spectral envelope of the speech signal to be obtained have a correlation. Characteristic speech synthesizer.