JPS5948396B2

JPS5948396B2 - Fragment editing type speech synthesis device

Info

Publication number: JPS5948396B2
Application number: JP52008529A
Authority: JP
Inventors: 勝信伏木田; 和雄落合
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1977-01-27
Filing date: 1977-01-27
Publication date: 1984-11-26
Also published as: JPS5393704A

Abstract

PURPOSE:To avoid a big discontinuity occurring to the frequency spectrum shape of the compounded voice and thus to improve the sound quality, by using the interpolation element piece waveform and the element piece waveform extracted from the natural voice waveform.

Description

【発明の詳細な説明】本発明は素片編集型音声合成装置に関する。[Detailed description of the invention] The present invention relates to a segment editing type speech synthesis device.

予め自然音声波形よりピッチ区間程度の時間長を持つ音
声波形を種々抽出しておき、合成データとして予め用意
される素片番号およびピッチ等の制御データに従つて素
片波形を編集合成する型の素片編集型音声合成方式が知
られている。また、前記方式において、単音節等の限ら
れた自然音声波形より抽出された素片波形を用いる場合
には、合成の際に用いられる素片波形系列の周波数スペ
クトルの形状が十分なめらかに変化し得るに必要十分な
素片波形を含まない場合が生ずる。このため、合成音声
の周波数スペクトル形状に大きな不連続を生じ音質を劣
下させる要因となる。また、前記周波数スペクトルの不
連続を埋め合わせるために必要な周波数スペクトル形状
を有する素片波形を自然音声波形から抽出することは、
必要とする周波数スペクトル形状を有する自然音声の選
択が極めて困難であるという欠点がある。本発明の目的
は単音節や単語等の自然音声波形から抽出された素片波
形を用いて合成される音声のフオルマント周波数等の周
波数スペクトル形状に大きな不連続が生ずることを防ぎ
音質を向上させた素片編集型音声合成装置を提供するこ
とにある。A type of speech waveform that extracts in advance various speech waveforms with a time length comparable to a pitch interval from a natural speech waveform, and edits and synthesizes the segment waveforms according to control data such as segment number and pitch prepared in advance as synthesis data. A segment editing type speech synthesis method is known. In addition, in the above method, when using segment waveforms extracted from limited natural speech waveforms such as monosyllables, the shape of the frequency spectrum of the segment waveform series used during synthesis changes sufficiently smoothly. There may be cases where the waveform does not contain enough elemental waveforms to obtain the desired result. This causes a large discontinuity in the frequency spectrum shape of the synthesized speech, which causes deterioration in sound quality. Furthermore, extracting an elemental waveform having a frequency spectrum shape necessary to compensate for the discontinuity of the frequency spectrum from a natural speech waveform is as follows:
The disadvantage is that it is extremely difficult to select natural speech having the required frequency spectral shape. The purpose of the present invention is to improve sound quality by preventing large discontinuities from occurring in the frequency spectrum shape such as the formant frequency of speech synthesized using segmental waveforms extracted from natural speech waveforms such as monosyllables and words. An object of the present invention is to provide a segment editing type speech synthesis device.

本発明の音声合成装置は、自然音声から抽出された音声
素片波形の周波数スペクトル形状を表わすパラメータ値
を算出する手段と、二つの異なる音声素片波形より算出
された前記周波数スペクトル形状を表わすパラメータ値
を補間する手段と、前記補間された周波数スペクトル形
状を表わすパラメータ値より音声素片波形を合成する手
段とを含む音声素片補間装置により生成された音声素片
波形と前記自然音声より抽出された音声素片波形とを併
せて用いて音声波形を編集合成する手段とから構成され
ている。The speech synthesis device of the present invention includes means for calculating a parameter value representing the frequency spectrum shape of a speech segment waveform extracted from natural speech, and a parameter representing the frequency spectrum shape calculated from two different speech segment waveforms. A speech segment waveform extracted from the natural speech and a speech segment waveform generated by a speech segment interpolation device including means for interpolating values and a means for synthesizing a speech segment waveform from parameter values representing the interpolated frequency spectrum shape. and a means for editing and synthesizing the speech waveform using the speech segment waveforms obtained in conjunction with the speech segment waveforms.

本発明の特徴は、単音節等の自然音声波形より抽出され
た素片波形の他に前記自然音声波形より抽出された素片
波形のフオルマント等の周波数スペクトル形状を表わす
バラメータ値を補間して得られる値により生成した素片
波形（ここでは補間素片と呼ぶ）を併せて用いることに
ある。A feature of the present invention is that in addition to the segment waveform extracted from a natural speech waveform such as a monosyllable, parameter values representing the frequency spectrum shape such as the formant of the segment waveform extracted from the natural speech waveform are interpolated. The purpose is to also use a segment waveform (herein referred to as an interpolated segment) generated from the values.

このため、本発明を用いると、合成音声のフオルマント
等のスペクトルに大きな不連続が生ぜず、比較的音質の
良い合成音が得られるという効果がある。音声波形の周
波数スペクトル特性番こは、フオルマントと呼ばれる周
辺の周波数成分に比較して大きなエネルギーを有する周
波数成分があることが知られている。また、前記フオル
マント周波数が母音定常部等において急激に大きな変化
をすると、音質が劣下することが知られている。従つて
、前記補間素片波形としては第１図に示すように補間す
べき自然音声（図では単音を示す）より抽出された素片
波形１０５およびＩＯＴのフオルマント周波数をなめら
かにつなぐ補間素片１０６が適している。本発明におい
ては、補間すべき両端の二っの素片波形よりそれぞれフ
オルマント周波数およびバンド巾を抽出し、前記抽出さ
れたフオルマント周波数およびバンド巾を（線形）補間
した値より補間素片波形を生成する。また、音声素片波
形の自己相関々数値を係数として持つ連立一次方程式を
解くことにより得られる線形予測係数を用いてフオルマ
ント周波数およびバンド巾を抽出する方法も知られてい
る。逆に、フオルマント周波数およびバンド巾から線形
予測係数を算出する方法も知られており、その算出方法
は下記文献（雑誌）に詳しく記載されているので、ここ
では説明を省略する。文献″ＴｈｅＪＯｕｒｎａｌＯｆ
ｔｈｅＡｃＯｕｓｔｉｃａｌＳＯｃｉｅｔｙＯｆＡｍｅ
ｒｉｃａ″，ＶＯｌ．５Ｏ，慮２（Ｐａｒｔ２）の第６
３７頁一第６５５頁（表題″ＳｐｅｅｃｈＡｎａｌｙｓ
ｉｓａｎｄＳｙｎｔｈｅｓｉｓｂｙＬｉｎｅａｒＰｒｅ
ｄｉｃｈｔｉＯｎＯｆｔｈｅＳｐＥｅｃｈＷａ−Ｖｅ″
，１９７１年ＴｈｅＪＯｕｒｎａｌＯｆｔｈｅＡｃ一０
ｕｓｔｉｃａ１Ｓ０ｃｉｅｔｙ０ｆＡｍｅｒｉｃａ発行
）両端（Ａ，Ｂ）における音声素片波形より抽出された
フオルマント周波数（ｆ）およびバンド巾（ｂ）の対を
それぞれ（ＦＡｎ，ｂＡｎ），ｎ＝１，２，・・
，Ｎおよび（ＦＢｎ，ｂＢｎ），ｎ＝１，２，
・・，Ｎとする。Therefore, when the present invention is used, large discontinuities do not occur in the spectrum of formants, etc. of synthesized speech, and synthesized speech with relatively good sound quality can be obtained. It is known that the frequency spectrum characteristic of a voice waveform includes a frequency component called a formant, which has greater energy than surrounding frequency components. Furthermore, it is known that when the formant frequency suddenly changes greatly in a vowel stationary region, the sound quality deteriorates. Therefore, as shown in FIG. 1, the interpolated segment waveform is an interpolated segment waveform 105 extracted from the natural speech to be interpolated (a single tone is shown in the figure) and an interpolated segment 106 that smoothly connects the IOT formant frequency. is suitable. In the present invention, the formant frequency and band width are extracted from the two end segment waveforms to be interpolated, and the interpolated segment waveform is generated from the value obtained by (linearly) interpolating the extracted formant frequency and band width. do. Also known is a method of extracting formant frequencies and bandwidths using linear prediction coefficients obtained by solving simultaneous linear equations having autocorrelation values of speech unit waveforms as coefficients. Conversely, a method of calculating a linear prediction coefficient from formant frequency and band width is also known, and the calculation method is described in detail in the following literature (magazine), so the explanation will be omitted here. Literature ``TheJOurnalOf
theAcAusticalSOcietyOfAme
rica'', VOl.5O, Part 2, Part 6
Pages 37-655 (titled “SpeechAnalys”)
isandSynthesisbyLinearPre
dichtiOnOftheSpEechWa-Ve''
, 1971TheJOwnalOftheAc10
ustica1S0city0fAmerica) The pair of formant frequency (f) and band width (b) extracted from the speech unit waveform at both ends (A, B) are (FAn, bAn), n = 1, 2, ・
, N and ( FBn, bBn) , n = 1, 2,
・・ ,N.

Ｎは４〜６程度をとれば充分であることが知られている
。前記両端におけるフオルマント周波数およびバンド巾
を（線形）補間して得られるフオルマント周波数とバン
ド巾の対を（ＦＯｎ，ｂｃｎ）ｎ＝１，２，・・
・，Ｎとすると、例えば、（ＦＯｎ，ｆｃｎ）は次式
の如く与えることができる。（１）式において１，ｍを
適当に与えてＦｃｎ，ｂＯｎを算出することにより補間
素片のフオルマント周波数およびバンド巾が得られる。It is known that it is sufficient if N is about 4 to 6. The pair of formant frequency and band width obtained by (linear) interpolation of the formant frequency and band width at both ends is (FOn, bcn)n = 1, 2,...
. , N, for example, (FOn, fcn) can be given as shown in the following equation. By appropriately giving 1 and m in equation (1) and calculating Fcn and bOn, the formant frequency and band width of the interpolation element can be obtained.

補間素片波形はＦｃｎ，ｂｃｎから算出された線形予測
係数により制御される巡回型フイルタのインパルス応答
波形として与えられる。なお、周波数スペクトル形状を
表わすパラメータとしては、前述のフオルマント周波数
やバンド巾のほかに自然音声波形よりフオルマント周波
数（あるいは極周波数）およびバンド巾を抽出する際に
算出される自己相関係数あるいは線形予測係数を用いる
こともできることは明らかである。The interpolated segment waveform is given as an impulse response waveform of a recursive filter controlled by linear prediction coefficients calculated from Fcn and bcn. In addition to the above-mentioned formant frequency and band width, parameters representing the frequency spectrum shape include the autocorrelation coefficient or linear prediction calculated when extracting the formant frequency (or polar frequency) and band width from the natural speech waveform. It is clear that coefficients can also be used.

しかしながら、自己相関係数あるいは線形予測係数を用
いる方法は、補間素片のフオルマント周波数が両端の自
然音声より抽出された素片のフオルマント周波数をなめ
らかに補間したものには必ずしもならない。なお、第１
図において、参照数字１０１は時間軸、参照数字１０２
は周波数軸、参照数字１０３は第１フオルマントの変化
曲線および参照数字１０４は第２フオルマントの変化曲
線をそれぞれ示す。However, in the method of using an autocorrelation coefficient or a linear prediction coefficient, the formant frequency of the interpolated segment does not necessarily result in a smooth interpolation of the formant frequencies of the segments extracted from the natural speech at both ends. In addition, the first
In the figure, reference numeral 101 is the time axis, reference numeral 102
is a frequency axis, reference numeral 103 indicates a change curve of the first formant, and reference numeral 104 indicates a change curve of the second formant, respectively.

次に図面を参照して本発明を詳細に説明する。Next, the present invention will be explained in detail with reference to the drawings.

第２図は本発明の一実施例を示すプロツク図である。ま
ず、単音節等の音声波形が音声波形入力端子２０１から
音声素片抽出部２０６に入力され、制脚回路２０２から
音声素片抽出部制御データ伝送路２０３を介して与えら
れる制御データに従つて音声素片波形が生成される。FIG. 2 is a block diagram showing one embodiment of the present invention. First, a speech waveform such as a monosyllable is inputted from the speech waveform input terminal 201 to the speech segment extraction section 206, and is processed according to control data given from the leg control circuit 202 via the speech segment extraction section control data transmission line 203. A speech segment waveform is generated.

前記音声素片抽出部２０６において生成された音声素片
波形は補間素片生成部２０１に出力されると同時に、合
成部２１０内の素片波形記憶回路２１４にスイツチ２０
８を介して記憶される。補間素片生成部２０Ｔでは制御
回路２０２から補間素片生成部制御データ伝送路２０４
を介して与えられる補間素片生成データに従つて補間素
片波形を算出し、合成部２１０内の素片波形記憶回路２
１４にスイツチ２０９を介して記憶させる。以上の操作
が終了した後、制御回路２０２から出力指令データ伝送
路２０５を介して合成部２１０に与えられる出力指令デ
ータに従つて音声波形が合成される。合成部２１０内の
合成部制御回路２１１は前記出力指令データ伝送路２１
２を介して与えられる出力指令データに従つて合成デー
タ記憶回路制御データおよび編集合成回路制御データを
生成し、それぞれ合成データ記憶回路制御データ伝送路
２１２および編集合成回路制御データ伝送路２１３を介
して合成データ記憶回路２１５および編集合成回路２１
６を制御する。合成データ記憶回路２１５は、前記合成
データ記憶回路制御データに従い予め記憶されている素
片番号データを素片波形記憶回路２１４に出力すると同
時に予め記憶されているピツチおよび振巾データを編集
合成回路２１６に出力する。素片波形記憶回路２１４は
前記素片番号データに従つて自然音声波形からそのまま
抽出されスイツチ２０８を介して与えられる音声素片波
形とスイツチ２０９を介して与えられる補間素片波形と
からなる素片波形を編集合成回路２１６に出力する。編
集合成回路２１６は前記編集合成回路制御データ伝送路
２１３を介して与えられる編集合成回路制御データと合
成データ記憶回路２１５より出されるピツチおよび振巾
データとに従い、素片波形記憶回路２１４より出力され
る素片波形を編集合成して合成波形を生成し、合成波形
出力端子２１７より合成波形を出力する。次に補間素片
波形の生成方法を第３図を用いて説明する。The speech segment waveform generated in the speech segment extraction section 206 is output to the interpolation segment generation section 201, and at the same time, the speech segment waveform is outputted to the segment waveform storage circuit 214 in the synthesis section 210 by the switch 20.
8. In the interpolation segment generation section 20T, a control data transmission line 204 is transmitted from the control circuit 202 to the interpolation segment generation section control data transmission line 204.
The interpolated segment waveform is calculated according to the interpolated segment generation data given via the segment waveform storage circuit 2 in the synthesis unit 210.
14 via the switch 209. After the above operations are completed, the audio waveform is synthesized in accordance with the output command data provided from the control circuit 202 to the synthesis unit 210 via the output command data transmission line 205. A synthesis unit control circuit 211 in the synthesis unit 210 connects the output command data transmission path 21
The synthesized data storage circuit control data and the editing synthesis circuit control data are generated in accordance with the output command data given via the synthesis data storage circuit control data transmission line 212 and the editing synthesis circuit control data transmission line 213, respectively. Synthesis data storage circuit 215 and editing and synthesis circuit 21
Control 6. The composite data storage circuit 215 outputs the segment number data stored in advance to the segment waveform storage circuit 214 according to the composite data storage circuit control data, and at the same time edits and synthesizes the pitch and amplitude data stored in advance. Output to. The segment waveform storage circuit 214 stores a segment consisting of a speech segment waveform extracted directly from the natural speech waveform according to the segment number data and provided via a switch 208 and an interpolated segment waveform provided via a switch 209. The waveform is output to the editing/synthesizing circuit 216. The editing/synthesizing circuit 216 outputs data from the segment waveform storage circuit 214 in accordance with the editing/synthesizing circuit control data given via the editing/synthesizing circuit control data transmission line 213 and the pitch and amplitude data output from the synthetic data storage circuit 215. A synthesized waveform is generated by editing and synthesizing the segment waveforms, and the synthesized waveform is output from the synthesized waveform output terminal 217. Next, a method for generating an interpolated segment waveform will be explained with reference to FIG.

第３図は第２図の補間素片波形の生成部２０７を詳しく
示すプロツク図である。まず、補間すべき二つの自然音
声より抽出された素片波形が素片波形入力端子３０６か
らフオルマント抽出回路３０７に入力される。FIG. 3 is a block diagram showing in detail the interpolated segment waveform generating section 207 of FIG. First, a segment waveform extracted from two natural voices to be interpolated is inputted from the segment waveform input terminal 306 to the formant extraction circuit 307 .

フオルマント抽出回路３０７は制御回路３０１からフオ
ルマント抽出回路制御データ伝送路３０２を介して与え
られる制御データに従つて前記自然音声より抽出された
素片波形よりフオルマント周波数およびバンド巾を抽出
し、補間回路３０８に出力する。補間回路３０８は制御
回路３０１から補間回路制御データ伝送路３０３を介し
て与えられる制御データに従つて前記フオルマント抽出
回路３０７より出力されるフオルマント周波数およびバ
ンド巾を補間し、フオルマント周波数の補間値（（１）
式におけるＦｃｎ）およびバンド巾の補間値（（１）
式におけるＢｃｎ）を算出し、予側係数算出回路３０９
に出力する。予測係数算出回路３０９は制御回路３０１
から予測係数算出回路制御データ伝送路３０４を介して
与えられる制御データに従つて前記補間回路３０８から
出力されるフオルマント周波数の補間値およびバンド巾
の補間値より線形予測係数を算出し、線形フイルタ３１
０に出力する。一方、音源波形生成回路３１１は制御回
路３０１から音源波形生成回路制御データ伝送路３０５
を介して与えられる制御データにより音源波形としてイ
ンパルス波形を生成し、線形フイルタ３１０に出力する
。線形フイルタ３１０は前記予測係数算出回路３０９か
ら出力される線形予測係数により制御され、前記音源波
形生成回路３１１から出力されるインパルス波形を音源
として補間素片波形を算出し、補間素片波形出力端子３
１２から補間素片波形を出力する。The formant extraction circuit 307 extracts the formant frequency and bandwidth from the elemental waveform extracted from the natural speech according to control data given from the control circuit 301 via the formant extraction circuit control data transmission line 302, and extracts the formant frequency and band width from the segment waveform extracted from the natural speech. Output to. The interpolation circuit 308 interpolates the formant frequency and band width output from the formant extraction circuit 307 according to control data given from the control circuit 301 via the interpolation circuit control data transmission path 303, and calculates the interpolated value of the formant frequency (( 1)
Fcn) in the formula and the interpolated value of the band width ((1)
Bcn) in the formula is calculated, and the predictive side coefficient calculation circuit 309
Output to. The prediction coefficient calculation circuit 309 is the control circuit 301
A prediction coefficient calculation circuit calculates a linear prediction coefficient from the formant frequency interpolation value and band width interpolation value output from the interpolation circuit 308 according to control data given via the control data transmission line 304, and
Output to 0. On the other hand, the sound source waveform generation circuit 311 is connected from the control circuit 301 to the sound source waveform generation circuit control data transmission line 305.
An impulse waveform is generated as a sound source waveform based on control data given via the control data, and is output to the linear filter 310. The linear filter 310 is controlled by the linear prediction coefficient outputted from the prediction coefficient calculation circuit 309, calculates an interpolated segment waveform using the impulse waveform outputted from the sound source waveform generation circuit 311 as a sound source, and outputs an interpolated segment waveform to an interpolated segment waveform output terminal. 3
12 outputs an interpolated segment waveform.

[Brief explanation of the drawing]

第１図は本発明の原理を説明するための図、第２図は本
発明の一実施例を示すプロツク図および第３図は第２図
の補間素片波形生成部２０７を詳しく示すプロツク図で
ある〇第２図において、２０１は音声波形入力端子、２
０２は制御回路、２０３は音声素片抽出部制御データ伝
送路、２０４は補間素片生成部制御データ伝送路、２０
５は出力指令データ伝送路、２０６は音声素片抽出部、
２０７は補間素片生成部、２０８はスイツチ、２０９は
スイツチ、２１０は合成部、２１１は合成部制御回路、
２１２は合成データ記憶回路制御データ伝送路、２１３
は編集合成回路制御データ伝送路、２１４は素片波形記
憶回路、２１５は合成データ記憶回路、２１６は編集合
成回路、２１７は合成波形出力端子。FIG. 1 is a diagram for explaining the principle of the present invention, FIG. 2 is a block diagram showing an embodiment of the present invention, and FIG. 3 is a block diagram showing in detail the interpolated segment waveform generating section 207 in FIG. In Fig. 2, 201 is the audio waveform input terminal;
02 is a control circuit, 203 is a speech segment extraction unit control data transmission line, 204 is an interpolation segment generation unit control data transmission line, 20
5 is an output command data transmission path, 206 is a speech segment extraction unit,
207 is an interpolation segment generation unit, 208 is a switch, 209 is a switch, 210 is a synthesis unit, 211 is a synthesis unit control circuit,
212 is a synthetic data storage circuit control data transmission line; 213
214 is a segmental waveform storage circuit, 215 is a composite data storage circuit, 216 is an editing and composite circuit, and 217 is a composite waveform output terminal.

Claims

[Claims]

1. In a segment editing type speech synthesis device that edits and synthesizes phoneme forms of a plurality of pitch intervals extracted from a natural speech waveform in advance, a parameter representing the frequency spectrum shape of a speech segment waveform extracted from the natural speech. means for calculating a value, means for interpolating a parameter value representing the frequency spectrum shape calculated from two different speech segment waveforms, and synthesizing a speech segment waveform from the parameter values representing the interpolated frequency spectrum shape. 1. A speech segment editing type speech synthesis device characterized in that a speech segment waveform generated by a speech segment interpolation device including means for performing the above-mentioned natural speech is used in conjunction with a speech segment waveform extracted from the natural speech.