JP3532059B2

JP3532059B2 - Speech synthesis method and speech synthesis device

Info

Publication number: JP3532059B2
Application number: JP05752197A
Authority: JP
Inventors: 幸雄田部井
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1997-03-12
Filing date: 1997-03-12
Publication date: 2004-05-31
Anticipated expiration: 2017-03-12
Also published as: JPH10254495A

Abstract

PROBLEM TO BE SOLVED: To make it possible to set a pitch marl with little pitch fluctuation by a relatively simple processing and to achieve a high quality speech synthesis method by preparing speech waveform segments beforehand and winder/overlap- adding them with one pitch period shifted centering a minimum point in the speech waveform segments. SOLUTION: When a sentence containing kanji and hiragana letters is inputted to a text analysis part 101, the part executes morphemic analysis referring to a word dictionary 102, reading the sentence containing kanji and hiragana, determining accents and intonations, and outputting phonetic signs (intermediate language) with metrical sings. A parameter generation part 103 sets a pitch frequency pattern, a duration of phoneme, etc. A speech synthesis part 104 executes speech synthesis processing. Namely, it selects segments in a segment dictionary 105, windowing to hang the time-window of time-window length centering the pitch mark by a windowing part 106, and synthesizing speeches by PSOLA method (a synthesis method of superimposing speeches with pitch mark positions shifted in accordance with synthesis pitch period).

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、規則によって任意
の音声を合成する音声合成方法及び音声合成装置に関
し、特に、音声波形を接続して合成音声を得る音声合成
方法および音声合成装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice synthesizing method and a voice synthesizing apparatus for synthesizing an arbitrary voice according to a rule, and more particularly to a voice synthesizing method and a voice synthesizing apparatus for connecting voice waveforms to obtain a synthetic voice.

【０００２】[0002]

【従来の技術】従来のテキスト音声変換装置、即ちテキ
スト文章を音声に変換して出力するテキスト音声変換装
置としては、テキスト解析部とパラメータ生成部と音声
合成部とから構成された装置が一般に知られている。テ
キスト解析部では、漢字かな混じり文が入力され、単語
辞書を参照して形態素解析がなされて、読み、アクセン
ト、イントネーションが決定され、韻律記号付き発音記
号（中間言語）が出力される。パラメータ生成部では、
ピッチ周波数パターンや音韻継続時間等の設定が行われ
る。音声合成部では、音声の合成処理が行われる。この
音声合成部での音声合成処理としては、以前は線形予測
法などが用いられていたが、これらの方法では情報が劣
化してしまう。即ち、本来相互関係がある声道情報と音
源情報を分離して扱っていたため、また、音声生成過程
のモデル化による制約のため、音質の劣化は避けられな
かった。このため、近年、声道情報と音源情報とを分離
せず、さらに原音声波形をそのまま利用して人工的なモ
デル化なしで、品質劣化の少ない高品質の合成音を得る
手法が用いられるようになってきた。2. Description of the Related Art As a conventional text-to-speech conversion apparatus, that is, a text-speech conversion apparatus for converting a text sentence into speech and outputting the speech, an apparatus composed of a text analysis section, a parameter generation section and a speech synthesis section is generally known. Has been. In the text analysis unit, a kanji / kana mixed sentence is input, morphological analysis is performed with reference to a word dictionary, pronunciation, accent, and intonation are determined, and a phonetic symbol with a prosody symbol (intermediate language) is output. In the parameter generator,
The pitch frequency pattern, phoneme duration, etc. are set. The voice synthesis unit performs a voice synthesis process. As a speech synthesis process in this speech synthesis unit, a linear prediction method or the like has been used before, but these methods result in deterioration of information. That is, since the vocal tract information and the sound source information, which are originally related to each other, are treated separately, and due to the restriction due to the modeling of the voice generation process, the deterioration of the sound quality cannot be avoided. For this reason, in recent years, a method of obtaining high-quality synthesized speech with little quality deterioration without separating vocal tract information and sound source information and using the original speech waveform as it is without artificial modeling has been used. Has become.

【０００３】音声波形をそのまま利用する方法として
は、従来、文献：「“F.J. CHARPENTIER，M.G. STELL
A，DIPHONE SYNTHESIS USING AN OVERLAP-ADD TECHNIQU
E FOR SPEECH WAVEFORMS CONCATENATION”，Proc.Int.C
onf.ASSP,TOKYO,1986 PP2015-2018」に示されるものが
知られている。この方法は、予め音声波形にピッチマー
ク（基準点）を付けておき、そのピッチマークの位置を
中心に音声波形を切り出し、合成時に合成ピッチ周期に
合わせてピッチマーク位置をその周期ずつずらしながら
重ね合わせる合成方法で、PSOLA（Pitch-Synchronous O
verlap Add method）として知られている。As a method of directly using a voice waveform, there is a conventional method: "FJ CHARPENTIER, MG STELL.
A, DIPHONE SYNTHESIS USING AN OVERLAP-ADD TECHNIQU
E FOR SPEECH WAVEFORMS CONCATENATION ”, Proc.Int.C
onf.ASSP, TOKYO, 1986 PP 2015-2018 ”is known. In this method, a pitch mark (reference point) is attached to the voice waveform in advance, the voice waveform is cut out around the position of the pitch mark, and the pitch mark position is overlapped by shifting the pitch mark position in accordance with the synthetic pitch period during synthesis. PSOLA (Pitch-Synchronous O
Verlap Add method).

【０００４】図２は前記文献から引用したもので、ピッ
チを変更しながら音声波形を重畳するPSOLA法を示す模
式図である。この模式図では、分析時（素片作成時）に
比べて、合成時のピッチ周期を大きくした（音程を低く
した）場合の例を示す。FIG. 2 is cited from the above document and is a schematic diagram showing a PSOLA method for superimposing a voice waveform while changing a pitch. This schematic diagram shows an example in which the pitch period during synthesis is increased (pitch is lowered) as compared with the time of analysis (during segment production).

【０００５】このPSOLA法では、必要に応じてピッチを
変更できるため、テキスト音声変換における音声合成部
として広く用いられてきている。この場合、ピッチマー
クを音声波形の１ピッチ毎の特定位置に付けておく必要
があるが、このピッチマークの位置として下記のものが
提案されている。In the PSOLA method, since the pitch can be changed as needed, it has been widely used as a voice synthesizing unit in text-to-speech conversion. In this case, the pitch mark needs to be attached to a specific position for each pitch of the voice waveform, and the following positions have been proposed as the position of the pitch mark.

【０００６】（１）音声波形のピークをピッチマークの
設定位置とするものとして、例えば特開平４−３７２９
９９号公報に記載の「音声ピッチ変換方法」がある。(1) For setting the peak of the voice waveform as the pitch mark setting position, for example, Japanese Patent Laid-Open No. 4-3729.
There is a "voice pitch conversion method" described in Japanese Patent Publication No. 99.

【０００７】この場合、音声波形のローカルピーク位置
はエネルギーが集中しているため、切り出し波形のスペ
クトルを保存するのに適していると考えられる。In this case, since the energy is concentrated at the local peak position of the voice waveform, it is considered to be suitable for storing the spectrum of the cut-out waveform.

【０００８】（２）短時間パワーのピークをピッチマー
クの設定位置とするものとして、例えば「“波形素片接
続型音声合成システムの検討” 河井恒、樋口宜
男、清水徹、山本誠一信学技報SP93-9(1993-05)
社団法人電子情報通信学会」がある。(2) For setting the peak of the short time power as the setting position of the pitch mark, for example, "" Examination of speech synthesis system with waveform segment connection "Tsune Kawai, Yoshio Higuchi, Tohru Shimizu, Seiichi Yamamoto Technical report SP93-9 (1993-05)
The Institute of Electronics, Information and Communication Engineers.

【０００９】この場合も、前記（１）の場合と同様に、
音声波形の短時間パワーのローカルピーク位置はエネル
ギーが集中しているため、切り出し波形のスペクトルを
保存するのに適していると考えられる。Also in this case, as in the case of (1) above,
Since the energy is concentrated at the local peak position of the short-time power of the voice waveform, it is considered to be suitable for storing the spectrum of the cut-out waveform.

【００１０】（３）ピッチフィルタ後のピークをピッチ
マークの設定位置とするものとして、例えば特開平７−
７２８９７号公報に記載の「音声合成方法および装置」
がある。(3) A method in which the peak after the pitch filter is set as the pitch mark setting position is disclosed in, for example, Japanese Patent Application Laid-Open No. 7-
"Speech synthesis method and device" described in Japanese Patent No. 72897.
There is.

【００１１】ピッチフィルタ後のピークは１ピッチの声
帯の駆動波形のピークであり、前記文献によれば、ピッ
チ間隔を良好に代表するものであると報告されている。The peak after the pitch filter is the peak of the drive waveform of a 1-pitch vocal cord, and according to the above document, it is reported that it is a good representative of the pitch interval.

【００１２】（４）インパルス駆動点の１５％遅延点を
ピッチマークの設定位置とするものとして、例えば
「“ピッチ波形抽出位置の検討” 新居康彦、西村
洋文、吉田博子、蓑輪利光信学技報SP95-8(1995-0
5) 社団法人電子情報通信学会」がある。(4) As an example in which the 15% delay point of the impulse driving point is set as the pitch mark setting position, for example, "A study of pitch waveform extraction position" Yasuhiko Arai, Nishimura
Hiroshi Yoshida, Toshimitsu Minowa IEICE Technical Report SP95-8 (1995-0
5) The Institute of Electronics, Information and Communication Engineers.

【００１３】この文献によると、スペクトル歪みが最小
になると報告されている。According to this document, it is reported that the spectral distortion is minimized.

【００１４】（５）声門閉鎖点をピッチマークの設定位
置とするものとして、例えば「“波形重畳法を用いた日
本語テキスト音声合成システムについて” 阪本正
治、斉藤隆、鈴木和洋、橋本泰秀、小林メイ信
学技報SP95-6(1995-05) 社団法人電子情報通信学
会」がある。(5) As an example of setting the glottal closing point as the pitch mark setting position, for example, "About Japanese text-to-speech synthesis system using waveform superposition method" S. Sakamoto, T. Saito, K. Suzuki, Y. Hidehashi, Kobayashi There is Meishin Giho SP95-6 (1995-05) The Institute of Electronics, Information and Communication Engineers.

【００１５】この文献の声門閉鎖点とは、インパルス駆
動点（１ピッチ波形の励振点）と同様のものであると考
えられる。この声門閉鎖点を安定的に抽出するために、
Dynamic Wavelet変換が用いられている。The glottal closing point in this document is considered to be the same as the impulse driving point (excitation point of one pitch waveform). In order to stably extract this glottal closing point,
Dynamic Wavelet transform is used.

【００１６】[0016]

【発明が解決しようとする課題】しかしながら、前述の
ような従来のピッチマーク位置では次のような問題点が
あった。However, the conventional pitch mark position as described above has the following problems.

【００１７】前記（３）のピッチフィルタ後のピークを
ピッチマークの設定位置とするものでは、本出願人の実
験によれば、波形のピーク位置との間にズレがあり、こ
のズレによるピッチの揺れが大きく、ゴロゴロした音声
になってしまう。（１）の音声波形のピークをピッチマ
ークの設定位置とする方が比較的良好な結果となった。
（２）の短時間のパワーのピークをピッチマークの設定
位置とするものでは、極大値と極小値が対等に評価され
るため、発声者によってはピッチの揺れを生じることが
ある。（４）のインパルス駆動点の１５％遅延点をピッ
チマークの設定位置とするものでは、設定位置の特定等
のための処理量が多くなり、処理に遅延を生じ、また個
人や音韻の種類によっては、１５％の遅延点が最良とは
限らない。（５）の声門閉鎖点をピッチマークの設定位
置とするものでは、この声門閉鎖点の抽出のために行う
Dynamic Wavelet変換は処理量が多く、前記（４）と同
様に、処理に遅延を生じる。According to the experiment of the applicant of the present invention, there is a gap between the peak position of the waveform and the peak position after the pitch filter of the above (3), and according to the experiment by the applicant, the pitch caused by the gap is changed. The shaking is great and the sound becomes muffled. Relatively good results were obtained when the peak of the voice waveform of (1) was set as the pitch mark setting position.
In the case of (2) in which the power peak for a short time is set as the set position of the pitch mark, the maximum value and the minimum value are evaluated equally, so that pitching may occur depending on the speaker. In the case where the pitch mark setting position is the 15% delay point of the impulse driving point in (4), the processing amount for specifying the setting position is large, resulting in a delay in processing, and depending on the individual and the type of phoneme. , The delay point of 15% is not always the best. In the case where the glottal closure point in (5) is set as the pitch mark setting position, this is performed to extract this glottal closure point.
The dynamic wavelet conversion requires a large amount of processing and causes a delay in processing as in (4) above.

【００１８】本発明は、前記問題点に鑑みてなされたも
ので、比較的簡単な処理でピッチの揺れが少ないピッチ
マークの設定を可能にして、高品質の音声合成方法及び
音声合成装置を実現することを目的とする。The present invention has been made in view of the above problems, and realizes a high-quality voice synthesizing method and voice synthesizing apparatus by enabling pitch mark setting with less pitch fluctuation with relatively simple processing. The purpose is to do.

【００１９】[0019]

【課題を解決するための手段】前記課題を解決するため
に、第１の発明に係る音声合成方法は、音声信号のピー
ク直前の極小点を検出する工程と、検出された極小点を
中心にセンタリングして前記音声信号を切り出す工程と
により音声合成素片を予め作成しておき、前記音声合成
素片中の極小点を重畳の中心として、ピッチ周期分ずら
しながら窓掛け重畳することを特徴とする。In order to solve the above-mentioned problems, a speech synthesis method according to the first invention is directed to a step of detecting a local minimum point immediately before a peak of an audio signal, and a detected local minimum point. A voice synthesis unit is created in advance by the step of centering and cutting out the voice signal, and window overlapping is performed while shifting a pitch period with a minimum point in the voice synthesis unit as a center of superimposition. To do.

【００２０】以上のように、音声信号のピーク直前の極
小点を、切り出す音声信号の中心点にしているので、重
畳する際の中心点を簡易な処理によって容易に設定する
ことができ、スペクトル歪みも小さくすることができ
る。この結果、聴感上ゴロゴロした音が減少した。As described above, since the minimum point immediately before the peak of the audio signal is set as the center point of the audio signal to be cut out, the center point for superimposing can be easily set by a simple process, and the spectral distortion Can also be smaller. As a result, the rumbling sound was reduced.

【００２１】また、一定長さを単位として音声素片を扱
い、フレーム処理を行うことで、音声合成時において、
音声波形データを制御しやすくなる。[0021] Further, by treating a speech unit in units of a fixed length and performing frame processing, during speech synthesis,
It becomes easier to control the voice waveform data.

【００２２】第２の発明に係る音声合成方法は、音声信
号のピッチ周期を検出する工程と、音声信号のピーク直
前の極小点を検出する工程と、検出された極小点を中心
にセンタリングして前記音声信号を切り出す工程と、切
り出された音声信号に前記ピッチ周期の定数倍の窓を掛
ける工程とにより音声合成素片を予め作成しておき、前
記音声合成素片中の極小点を重畳の中心として、ピッチ
周期分ずらしながら重畳することを特徴とする。In the speech synthesis method according to the second aspect of the invention, the step of detecting the pitch period of the speech signal, the step of detecting the minimum point immediately before the peak of the speech signal, and the centering of the detected minimum point are performed. A voice synthesis unit is created in advance by a step of cutting out the voice signal and a step of multiplying the cut out voice signal by a window that is a constant multiple of the pitch period, and a minimum point in the voice synthesis unit is superimposed. It is characterized in that they are overlapped while being shifted by a pitch period as a center.

【００２３】以上のように、求めたピッチ周期に基づい
て予め素片に窓掛けしておくので、音声合成時に窓掛け
処理をする必要がなくなる。この結果、音声合成処理時
の処理量を大幅に減少させることができ、処理装置の簡
素化、又は処理の高速化を図ることができる。As described above, the elemental pieces are windowed in advance based on the obtained pitch period, so that it is not necessary to perform windowing processing during voice synthesis. As a result, the processing amount at the time of speech synthesis processing can be significantly reduced, and the processing device can be simplified or the processing speed can be increased.

【００２４】第３の発明に係る音声合成方法は、音声信
号の正負を適宜反転させて音声信号全体の正負を整合さ
せる工程と、音声信号のピーク直前の極小点を検出する
工程と、検出された極小点を中心にセンタリングして前
記音声信号を切り出す工程とにより音声合成素片を予め
作成しておき、前記音声合成素片中の極小点を重畳の中
心として、ピッチ周期分ずらしながら窓掛け重畳するこ
とを特徴とする。In the voice synthesizing method according to the third aspect of the invention, the steps of appropriately inverting the positive and negative signs of the voice signal to match the positive and negative signs of the whole voice signal, and the step of detecting the minimum point immediately before the peak of the voice signal are detected. A voice synthesis unit is created in advance by the step of centering the minimum point and cutting out the voice signal, and windowing is performed while shifting the minimum point in the voice synthesis unit as the center of superimposition for a pitch period. It is characterized by overlapping.

【００２５】以上の構成により、アナログ系の構成の変
化等による位相の変化をディジタル的に補正することが
できる。With the above configuration, it is possible to digitally correct a change in phase due to a change in the configuration of the analog system.

【００２６】第４の発明に係る音声合成方法は、音声信
号のピッチ周期を検出する工程と、音声信号の正負を適
宜反転させて音声信号全体の正負を整合させる工程と、
音声信号のピーク直前の極小点を検出する工程と、検出
された極小点を中心にセンタリングして前記音声信号を
切り出す工程と、切り出された音声信号に前記ピッチ周
期の定数倍の窓を掛ける工程とにより音声合成素片を予
め作成しておき、前記音声合成素片中の極小点を重畳の
中心として、ピッチ周期分ずらしながら重畳することを
特徴とする。A voice synthesizing method according to a fourth aspect of the present invention includes a step of detecting a pitch period of a voice signal and a step of appropriately inverting the positive and negative of the voice signal to match the positive and negative of the entire voice signal.
A step of detecting a local minimum point just before the peak of the audio signal; a step of centering the detected local minimum point to cut out the audio signal; and a step of multiplying the cut out audio signal by a window that is a constant multiple of the pitch period. And a voice synthesis unit is created in advance, and the voice synthesis unit is superimposed while shifting the minimum point in the voice synthesis unit by the pitch period.

【００２７】以上の構成により、予め素片に窓掛けして
おくので、音声合成処理時の処理量を大幅に減少させる
ことができる。With the above configuration, since the element is windowed in advance, the processing amount at the time of speech synthesis processing can be greatly reduced.

【００２８】また、音声波形の正負を反転させる機能を
持たせたので、アナログ系の構成の変化等による位相の
変化をディジタル的に補正することができる。Further, since the function of inverting the positive / negative of the voice waveform is provided, it is possible to digitally correct the phase change due to the change of the configuration of the analog system.

【００２９】第５の発明に係る音声合成装置は、音声信
号のピーク直前の極小点を検出する極小点検出手段と、
当該極小点検出手段で検出された極小点を中心にセンタ
リングして前記音声信号を切り出す音声信号切り出し手
段と、当該音声信号切り出し手段により切り出された音
声合成素片を記憶しておく音声合成素片記憶手段と、当
該音声合成素片記憶手段に記憶された音声合成素片をそ
の極小点を重畳の中心として、ピッチ周期分ずらしなが
ら窓掛け重畳する音声合成部とを備えたことを特徴とす
る。A voice synthesizing apparatus according to a fifth aspect of the present invention comprises a minimum point detecting means for detecting a minimum point immediately before a peak of a voice signal,
A voice signal cutout unit for centering the minimum point detected by the minimum point detection unit to cut out the voice signal, and a voice synthesis unit for storing the voice synthesis unit cut out by the voice signal cutout unit. The present invention is characterized by comprising a storage means and a voice synthesis section for window-applying the voice synthesis element stored in the voice synthesis element storage means while shifting the minimum point of the voice synthesis element by a pitch period and superimposing it. .

【００３０】以上のように、極小点検出手段で検出した
音声信号のピーク直前の極小点を、音声信号切り出し手
段で切り出す音声信号の中心点にしているので、音声合
成部で重畳する際の中心点を簡易な処理によって容易に
設定することができ、スペクトル歪みも小さくすること
ができる。この結果、聴感上ゴロゴロした音が減少し
た。また、一定長さを単位として音声素片を扱い、フレ
ーム処理を行うことで、音声合成時において、音声波形
データを制御しやすくなる。As described above, since the local minimum point immediately before the peak of the audio signal detected by the local minimum point detecting means is set as the center point of the audio signal cut out by the audio signal cutting out means, the center when superimposing in the audio synthesizing section. Points can be easily set by simple processing, and spectral distortion can be reduced. As a result, the rumbling sound was reduced. In addition, by treating the speech units in units of a fixed length and performing frame processing, it becomes easier to control the speech waveform data during speech synthesis.

【００３１】第６の発明に係る音声合成装置は、音声信
号のピッチ周期を検出するピッチ周期検出手段と、音声
信号のピーク直前の極小点を検出する極小点検出手段
と、当該極小点検出手段で検出された極小点を中心にセ
ンタリングして前記音声信号を切り出す音声信号切り出
し手段と、当該音声信号切り出し手段で切り出された音
声信号に前記ピッチ周期の定数倍の窓を掛ける窓掛け手
段と、当該窓掛け手段により窓掛けされた音声合成素片
を記憶しておく音声合成素片記憶手段と、当該音声合成
素片記憶手段に記憶された音声合成素片をその極小点を
重畳の中心として、ピッチ周期分ずらしながら重畳する
音声合成部とを備えたことを特徴とする。A voice synthesizer according to a sixth aspect of the invention is a pitch period detecting means for detecting a pitch period of a voice signal, a minimum point detecting means for detecting a minimum point immediately before a peak of a voice signal, and the minimum point detecting means. An audio signal cutting-out means for cutting out the audio signal by centering on the minimum point detected by, and a windowing means for multiplying the audio signal cut out by the audio signal cutting-out means with a window which is a constant multiple of the pitch period. A voice synthesis unit storage unit that stores the voice synthesis unit that has been windowed by the window unit, and the voice synthesis unit stored in the voice synthesis unit storage unit with the minimum point being the center of superimposition. , And a voice synthesizing unit for superimposing while shifting by a pitch period.

【００３２】以上のように、ピッチ周期検出手段で求め
たピッチ周期に基づいて、窓掛け手段で予め素片に窓掛
けしておくので、音声合成時に窓掛け処理をする必要が
なくなる。この結果、音声合成処理時の処理量を大幅に
減少させることができ、処理装置の簡素化、又は処理の
高速化を図ることができる。As described above, the windows are preliminarily windowed by the windowing means on the basis of the pitch period obtained by the pitch period detecting means, so that it is not necessary to perform the windowing process at the time of speech synthesis. As a result, the processing amount at the time of speech synthesis processing can be significantly reduced, and the processing device can be simplified or the processing speed can be increased.

【００３３】第７の発明に係る音声合成装置は、音声信
号の正負を適宜反転させて音声信号全体の正負を整合さ
せる音声信号反転手段と、音声信号のピーク直前の極小
点を検出する極小点検出手段と、当該極小点検出手段で
検出された極小点を中心にセンタリングして前記音声信
号を切り出す音声信号切り出し手段と、当該音声信号切
り出し手段により切り出された音声合成素片を記憶して
おく音声合成素片記憶手段と、当該音声合成素片記憶手
段に記憶された音声合成素片をその極小点を重畳の中心
として、ピッチ周期分ずらしながら窓掛け重畳する音声
合成部とを備えたことを特徴とする。The voice synthesizer according to the seventh aspect of the invention comprises a voice signal inverting means for appropriately inverting the positive and negative of the voice signal to match the positive and negative of the whole voice signal, and a minimum inspection for detecting a minimum point immediately before the peak of the voice signal. The output means, the voice signal cutout means for centering the minimum point detected by the minimum point detection means to cut out the voice signal, and the voice synthesis unit cut out by the voice signal cutout means are stored. A voice synthesis unit storage means and a voice synthesis unit for windowing and superimposing the voice synthesis unit stored in the voice synthesis unit storage unit while shifting by a pitch period with the minimum point as a center of superposition. Is characterized by.

【００３４】以上の構成により、音声信号反転手段で音
声信号の正負を適宜反転させて音声信号全体の正負を整
合させることで、アナログ系の構成の変化等による位相
の変化をディジタル的に補正することができる。With the above-described structure, the positive and negative signs of the audio signal are appropriately inverted by the audio signal inverting means to match the positive and negative signs of the entire audio signal, thereby digitally correcting a phase change due to a change in the structure of the analog system. be able to.

【００３５】第８の発明に係る音声合成装置は、音声信
号のピッチ周期を検出するピッチ周期検出手段と、音声
信号の正負を適宜反転させて音声信号全体の正負を整合
させる音声信号反転手段と、音声信号のピーク直前の極
小点を検出する極小点検出手段と、当該極小点検出手段
で検出された極小点を中心にセンタリングして前記音声
信号を切り出す音声信号切り出し手段と、当該音声信号
切り出し手段で切り出された音声信号に前記ピッチ周期
の定数倍の窓を掛ける窓掛け手段と、当該窓掛け手段に
より窓掛けされた音声合成素片を記憶しておく音声合成
素片記憶手段と、当該音声合成素片記憶手段に記憶され
た音声合成素片をその極小点を重畳の中心として、ピッ
チ周期分ずらしながら重畳する音声合成部とを備えたこ
とを特徴とする。A voice synthesizer according to an eighth aspect of the present invention comprises a pitch period detecting means for detecting a pitch period of a voice signal, and a voice signal inverting means for appropriately inverting the positive / negative of the voice signal to match the positive / negative of the entire voice signal. A local minimum point detecting means for detecting a local minimum point immediately before the peak of the audio signal; an audio signal clipping means for centering the local minimum point detected by the local minimum point detecting means for clipping the audio signal; Windowing means for applying a window having a constant multiple of the pitch period to the voice signal cut out by the means, voice synthesis element storage means for storing the voice synthesis element windowed by the windowing means, And a voice synthesis unit for superimposing the voice synthesis unit stored in the voice synthesis unit storage unit while shifting the minimum point of the voice synthesis unit by a pitch period.

【００３６】以上の構成により、窓掛け手段で予め素片
に窓掛けしておくので、音声合成処理時の処理量を大幅
に減少させることができる。With the above construction, the element is pre-windowed by the windowing means, so that the processing amount at the time of speech synthesis processing can be greatly reduced.

【００３７】また、音声波形の正負を適宜反転させて音
声信号全体の正負を整合させる音声信号反転手段を設け
たので、アナログ系の構成の変化等による位相の変化を
ディジタル的に補正することができる。Further, since the audio signal inverting means for appropriately inverting the positive and negative of the audio waveform to match the positive and negative of the entire audio signal is provided, it is possible to digitally correct the phase change due to the change of the configuration of the analog system. it can.

【００３８】第９の発明に係る音声合成装置は、音声信
号を反転増幅させる反転増幅器と、音声信号を非反転増
幅させる非反転増幅器と、前記反転増幅器からの音声信
号と前記非反転増幅器からの音声信号とを選択するセレ
クタと、当該セレクタで選択された音声信号をディジタ
ル値に変換するＡＤ変換器と、当該ＡＤ変換器でＡＤ変
換されたデータを格納する記憶手段と、当該記憶手段に
記憶された音声信号を順次読み出す音声信号読み出し手
段と、当該音声信号読み出し手段で読み出した音声信号
のピーク直前の極小点を検出する極小点検出手段と、当
該極小点検出手段で検出された極小点を中心にセンタリ
ングして前記音声信号を切り出す音声信号切り出し手段
と、当該音声信号切り出し手段により切り出した音声合
成素片を記憶しておく音声合成素片記憶手段と、当該音
声合成素片記憶手段中から選択した音声合成素片の極小
点を重畳の中心として、ピッチ周期分ずらしながら窓掛
け重畳する素片接続合成部とを備えたことを特徴とす
る。A speech synthesizer according to a ninth aspect of the invention comprises an inverting amplifier for inverting and amplifying a voice signal, a non-inverting amplifier for non-inverting amplification of the voice signal, a voice signal from the inverting amplifier and a non-inverting amplifier. A selector that selects an audio signal, an AD converter that converts the audio signal selected by the selector into a digital value, a storage unit that stores the data AD-converted by the AD converter, and a storage unit that stores the data. The audio signal reading means for sequentially reading the audio signals, the minimum point detecting means for detecting the minimum point immediately before the peak of the audio signal read by the audio signal reading means, and the minimum point detected by the minimum point detecting means. A voice signal cutout unit for centering the voice signal by centering and a voice synthesis unit cut out by the voice signal cutout unit are stored. And a voice synthesis unit storage unit, and a voice unit connection synthesis unit for windowing and superimposing while shifting a pitch period with a minimum point of the voice synthesis unit selected from the voice synthesis unit storage unit as a center of superposition. It is characterized by that.

【００３９】以上の構成により、反転増幅器又は非反転
増幅器とセレクタとで、反転増幅させた音声信号と非反
転増幅させた音声信号とを適宜選択して、音声波形の正
負を適宜反転させて音声信号全体の正負を整合させる。
これにより、アナログ系の構成の変化等による位相の変
化をディジタル的に補正することができる。With the above configuration, the inverting amplifier or the non-inverting amplifier and the selector appropriately select the inverting-amplified audio signal and the non-inverting-amplified audio signal, and appropriately invert the positive and negative of the audio waveform. Match the positive and negative of the entire signal.
As a result, it is possible to digitally correct a change in the phase due to a change in the configuration of the analog system.

【００４０】また、極小点検出手段で検出した音声信号
のピーク直前の極小点を、音声信号切り出し手段で切り
出す音声信号の中心点にしているので、重畳する際の中
心点を簡易な処理によって容易に設定することができ、
スペクトル歪みも小さくすることができる。この結果、
聴感上ゴロゴロした音が減少した。また、一定長さを単
位として音声素片を扱い、フレーム処理を行うことで、
音声合成時において、音声波形データを制御しやすくな
る。Further, since the local minimum point immediately before the peak of the audio signal detected by the local minimum point detecting means is used as the center point of the audio signal cut out by the audio signal cutting means, the center point at the time of superimposing can be easily performed by a simple process. Can be set to
Spectral distortion can also be reduced. As a result,
The rumbling sound decreased. Also, by handling speech units with a fixed length as a unit and performing frame processing,
It becomes easy to control the voice waveform data during voice synthesis.

【００４１】第１０の発明に係る音声合成装置は、音声
信号を反転増幅させる反転増幅器と、音声信号を非反転
増幅させる非反転増幅器と、前記反転増幅器からの音声
信号と前記非反転増幅器からの音声信号とを選択するセ
レクタと、当該セレクタで選択された音声信号をディジ
タル値に変換するＡＤ変換器と、当該ＡＤ変換器でＡＤ
変換されたデータを格納する記憶手段と、当該記憶手段
に記憶された音声信号を順次読み出す音声信号読み出し
手段と、当該音声信号読み出し手段で読み出した音声信
号のピッチ周期を検出するピッチ周期検出手段と、当該
ピッチ周期検出手段で検出したピッチ周期を定数倍する
窓長算出手段と、前記音声信号読み出し手段で読み出し
た音声信号のピーク直前の極小点を検出する極小点検出
手段と、当該極小点検出手段で検出された極小点を中心
にセンタリングして前記音声信号を切り出す音声信号切
り出し手段と、当該音声信号切り出し手段で切り出した
音声信号に前記窓長算出部で算出した窓長の窓掛けをす
る窓掛け手段と、当該窓掛け部により窓掛けがされた音
声合成素片を記憶しておく音声合成素片記憶手段と、当
該音声合成素片記憶手段中から選択した音声合成素片の
極小点を重畳の中心として、ピッチ周期分ずらしながら
重畳する素片接続合成部とを備えたことを特徴とする。The speech synthesizer according to the tenth aspect of the invention comprises an inverting amplifier for inverting and amplifying a voice signal, a non-inverting amplifier for non-inverting amplification of the voice signal, a voice signal from the inverting amplifier and a non-inverting amplifier. A selector for selecting an audio signal, an AD converter for converting the audio signal selected by the selector into a digital value, and an AD for the AD converter.
Storage means for storing the converted data, audio signal reading means for sequentially reading the audio signals stored in the storage means, and pitch cycle detecting means for detecting the pitch cycle of the audio signal read by the audio signal reading means. Window length calculation means for multiplying the pitch cycle detected by the pitch cycle detection means by a constant, minimum point detection means for detecting a minimum point immediately before the peak of the audio signal read by the audio signal reading means, and the minimum point detection. An audio signal cutout means for centering the minimum point detected by the means to cut out the audio signal, and a windowing of the window length calculated by the window length calculation part on the audio signal cut out by the audio signal cutout means. Windowing means, voice synthesis element storage means for storing the voice synthesis element that has been windowed by the windowing section, and the voice synthesis element description. As the center of superimposing the minimum point of the speech synthesis fragments selected from among means, characterized in that a segment connection combining unit that superimposes shifting pitch cycle.

【００４２】以上の構成により、アナログ系の構成の変
化等による位相の変化をディジタル的に補正することが
できると共に、ピッチ周期検出手段で求めたピッチ周期
に基づいて、窓掛け手段で予め素片に窓掛けしておくの
で、音声合成時に窓掛け処理をする必要がなくなる。こ
の結果、音声合成処理時の処理量を大幅に減少させるこ
とができ、処理装置の簡素化、又は処理の高速化を図る
ことができる。With the above configuration, a phase change due to a change in the analog system configuration or the like can be digitally corrected, and the windowing means can preliminarily use the element pieces based on the pitch period obtained by the pitch period detecting means. Since it is windowed, it is not necessary to perform windowing processing during voice synthesis. As a result, the processing amount at the time of speech synthesis processing can be significantly reduced, and the processing device can be simplified or the processing speed can be increased.

【００４３】[0043]

【発明の実施の形態】以下、本発明の実施形態を添付図
面に基づいて説明する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the accompanying drawings.

【００４４】［第１の実施形態］以下、第１の実施形態
に係る音声合成方法及び音声合成装置について説明す
る。図１は第１の実施形態に係る音声合成装置の構成を
示すブロック図である。[First Embodiment] A voice synthesizing method and a voice synthesizing apparatus according to the first embodiment will be described below. FIG. 1 is a block diagram showing the configuration of the speech synthesizer according to the first embodiment.

【００４５】テキスト解析部１０１では、漢字かな混じ
り文が入力されると、単語辞書１０２を参照して形態素
解析を行い、漢字かな混じり文の読み、アクセント及び
イントネーションを決定し、韻律記号付き発音記号（中
間言語）を出力する。パラメータ生成部１０３では、ピ
ッチ周波数パターンや音韻継続時間等の設定を行う。こ
れらテキスト解析部１０１、単語辞書１０２及びパラメ
ータ生成部１０３は、従来のものとかわるところはな
い。In the text analysis unit 101, when a kanji / kana mixed sentence is input, morphological analysis is performed with reference to the word dictionary 102 to determine readings, accents and intonations of the kanji / kana mixed sentence, and phonetic symbols with prosodic symbols. Output (intermediate language). The parameter generator 103 sets a pitch frequency pattern, phoneme duration, and the like. The text analysis unit 101, the word dictionary 102, and the parameter generation unit 103 are the same as the conventional ones.

【００４６】音声合成部１０４では音声合成処理を行
う。即ち、素片辞書１０５内の素片を選択し、窓掛け部
１０６にて、ピッチマークが中心となるように後述の時
間窓長Tp1の時間窓を前記素片に掛ける窓掛けを行い、P
SOLA法にて音声合成する。The voice synthesis unit 104 performs a voice synthesis process. That is, a segment in the segment dictionary 105 is selected, and a windowing unit 106 performs windowing for multiplying the segment by a time window having a time window length Tp1 described later so that the pitch mark is at the center, and P
Speech synthesis is performed by the SOLA method.

【００４７】ここで、時間窓長Tp1は、分析時のピッチ
周期をTpa、合成時のピッチ周期をTpsとした場合、 Tp1=C0×min(Tpa,Tps) のように設定する。なお、C0は2.0程度の値である。Here, the time window length Tp1 is set as Tp1 = C0 × min (Tpa, Tps), where Tpa is the pitch period during analysis and Tps is the pitch period during synthesis. Note that C0 has a value of about 2.0.

【００４８】素片辞書１０５は素片を書き込んだ辞書で
ある。素片は素片作成部１０７にて作成される。素片作
成部１０７は、本発明の主要部分であり、図３のフロー
チャートに示す処理機能を有する。The segment dictionary 105 is a dictionary in which segments are written. The segment is created by the segment creating unit 107. The segment creating unit 107 is a main part of the present invention and has a processing function shown in the flowchart of FIG.

【００４９】この素片作成部１０７での処理を図３に従
って説明する。データディスクなどを備えた音声信号入
力部１０８によって、音声信号が素片作成部１０７に入
力されると、まず、ステップＳ２０１で、入力された音
声信号データを分析フレームに分割する。この分析フレ
ームは一定長さの区間に区切られた音声信号データのこ
とで、本実施例では、１フレーム長が３２ｍ秒で、８ｍ
秒ずらして次のフレームに移るように区切られている。
ここでは、総フレーム数をＮとする。The processing in the segment creating unit 107 will be described with reference to FIG. When an audio signal is input to the segment creating unit 107 by the audio signal input unit 108 including a data disc, first, in step S201, the input audio signal data is divided into analysis frames. This analysis frame is audio signal data divided into sections of a fixed length. In this embodiment, one frame length is 32 msec and 8 m
It is delimited so that it shifts to the next frame and moves to the next frame.
Here, the total number of frames is N.

【００５０】ステップＳ２０２では、処理を行う分析フ
レームのフレーム番号ｉを初期化する。ステップＳ２０
３では、第ｉフレームにおける、ピーク直前の極小値を
与える時間軸の座標xdを検出する。この座標xdの検出例
を図１１に示す。なお、図１１に示す音声波形は、ア
（／ａ／）と発声したときの音声波形で、マーク「＊」
の位置が本実施形態に係るピッチマーク位置（ピーク直
前の極小値）である。このピッチマーク位置の検出は容
易に行うことができる。即ち、各分析フレーム中のピー
ク点は容易に特定でき、その直前の極小点も容易に特定
できる。この極小点がピッチマーク位置であるため、ピ
ッチマーク位置を容易に検出することができる。In step S202, the frame number i of the analysis frame to be processed is initialized. Step S20
In 3, the coordinate xd on the time axis that gives the minimum value immediately before the peak in the i-th frame is detected. An example of detecting this coordinate xd is shown in FIG. The voice waveform shown in FIG. 11 is the voice waveform when uttered as a (/ a /), and the mark "*"
Is the pitch mark position (the minimum value immediately before the peak) according to the present embodiment. This pitch mark position can be easily detected. That is, the peak point in each analysis frame can be easily specified, and the local minimum point immediately before that can also be easily specified. Since this minimum point is the pitch mark position, the pitch mark position can be easily detected.

【００５１】次いで、図３中のステップＳ２０４で、座
標xdの前後にそれぞれＬ分の音声データを切り出し、座
標xdが中央に位置するようにセンタリングする。なお、
ここではＬ分を１２ｍ秒に設定した。これは、本発明者
の予備実験により、男性で最長のピッチ周期に余裕を持
たせた値である。Then, in step S204 in FIG. 3, L audio data is cut out before and after the coordinate xd and centered so that the coordinate xd is located at the center. In addition,
Here, L minute was set to 12 msec. This is a value in which a margin is given to the longest pitch period in males by a preliminary experiment by the present inventor.

【００５２】ステップＳ２０５では、第ｉフレームにお
ける素片として、ステップＳ２０４で切り出した音声デ
ータをデータディスク等の記憶媒体に、素片辞書１０５
として順次書き込みを行う。ステップＳ２０６では、全
分析フレームについて素片の書き込みが終了したか否か
の判定を行う。この書き込みが終了していなければ、ス
テップＳ２０７でフレーム番号を更新してステップＳ２
０３に戻り、ステップＳ２０３からステップＳ２０５ま
での処理を継続する。ステップＳ２０６で全分析フレー
ムの処理が終了したと判定した場合は、素片辞書１０５
のデータディスクのクローズ処理等（図示せず）を行っ
て素片作成部１０７の動作を終了する。In step S205, the speech data extracted in step S204 is stored in a storage medium such as a data disk as a segment in the i-th frame and stored in the segment dictionary 105.
Are sequentially written. In step S206, it is determined whether writing of the segment has been completed for all the analysis frames. If this writing has not been completed, the frame number is updated in step S207 and then step S2 is performed.
Returning to step 03, the processing from step S203 to step S205 is continued. If it is determined in step S206 that all analysis frames have been processed, the segment dictionary 105
The data disc closing process (not shown) and the like are performed, and the operation of the segment creating unit 107 ends.

【００５３】以上の処理によって作成された素片が書き
込まれた素片辞書１０５内から、対象となる素片が適宜
選択され、窓掛け部１０６にて窓掛けが行われて、音声
合成部１０４で音声合成処理が行われる。The target segment is appropriately selected from the segment dictionary 105 in which the segment created by the above processing is written, and windowing is performed by the windowing unit 106, and the voice synthesis unit 104 is selected. The voice synthesis process is performed.

【００５４】なお、ピーク直前の極小値検出は、音声信
号の有声部分に対してのみ行われるものとする。無声音
部分は、音声データをそのまま使用する。以下に述べる
他の実施形態においても同様である。It is assumed that the local minimum value detection just before the peak is performed only on the voiced part of the voice signal. The voice data is used as it is for the unvoiced part. The same applies to other embodiments described below.

【００５５】［効果］各分析フレーム中のピーク直前の
極小値をピッチマークとしているので、簡易な処理によ
ってピッチマークを設定することができ、スペクトル歪
みも小さくすることができる。この結果、本発明者の実
験によれば、聴感上ゴロゴロした音が減少した。[Effect] Since the minimum value immediately before the peak in each analysis frame is used as the pitch mark, the pitch mark can be set by a simple process and the spectral distortion can be reduced. As a result, according to the experiments by the present inventor, the sound that was rumbling in hearing was reduced.

【００５６】また、１２ｍ秒の固定長を単位として音声
素片を扱い、フレーム処理を行っているので、音声合成
時において、音声波形データを制御しやすいという効果
もある。Further, since the voice segment is handled in units of a fixed length of 12 msec and the frame processing is performed, there is an effect that voice waveform data can be easily controlled during voice synthesis.

【００５７】［第２の実施形態］次に、本発明の第２の
実施形態について説明する。[Second Embodiment] Next, a second embodiment of the present invention will be described.

【００５８】図４は第２の実施形態に係る音声合成装置
の構成を示すブロック図である。FIG. 4 is a block diagram showing the configuration of the speech synthesizer according to the second embodiment.

【００５９】本実施形態に係る音声合成装置において、
テキスト解析部１０１、単語辞書１０２、パラメータ生
成部１０３及び音声信号入力部１０８は、前記第１の実
施形態に係る音声合成装置と同様である。In the speech synthesizer according to this embodiment,
The text analysis unit 101, the word dictionary 102, the parameter generation unit 103, and the voice signal input unit 108 are the same as those of the voice synthesis device according to the first embodiment.

【００６０】本実施形態の音声合成方法は、ピッチマー
クを各分析フレーム中のピーク直前の極小点に設定する
点で前記第１の実施形態に係る音声合成方法と同様であ
る。そして、本実施形態の音声合成方法の特徴は、素片
作成部３０１において素片にあらかじめ窓掛けを行う点
にある。The voice synthesizing method of this embodiment is similar to the voice synthesizing method of the first embodiment in that the pitch mark is set to the minimum point immediately before the peak in each analysis frame. The feature of the speech synthesis method of the present embodiment is that the segment creation unit 301 performs windowing on the segment in advance.

【００６１】素片作成部３０１は、ピッチマーク算出部
３０２と窓掛け部３０３により構成されている。この素
片作成部３０１は、図５に示す処理機能を備えている。
この素片作成部３０１での処理を以下に説明する。The segment creating section 301 is composed of a pitch mark calculating section 302 and a windowing section 303. The segment creating unit 301 has the processing function shown in FIG.
The processing in the segment creating unit 301 will be described below.

【００６２】音声信号入力部１０８から音声信号データ
が入力されると、まずステップＳ４０１で音声信号デー
タが分析フレームに分割される。この分析フレームは、
前記第１の実施形態と同様に、１フレーム長が３２ｍ秒
で、８ｍ秒ずらして次のフレームに移るように設定され
ている。総フレーム数はＮである。When the audio signal data is input from the audio signal input unit 108, the audio signal data is first divided into analysis frames in step S401. This analysis frame
Similar to the first embodiment, one frame length is 32 msec, and it is set to shift to the next frame by shifting 8 msec. The total number of frames is N.

【００６３】ステップＳ４０２では、処理を行うフレー
ム番号ｉを初期化する。ステップＳ４０３では、第ｉフ
レームにおける音声のピッチ周期Tpを検出する。このピ
ッチ周期Tpを検出する方法には、簡易な手法として波形
のピーク間隔を検出する方法等が考えられるが、本実施
形態ではケプストラム法を用いている。これは、より精
密にピッチ周期を算出するためである。このケプストラ
ム法では、図６に示す処理工程でピッチ周期Tpを検出す
る。まず、ステップＳ５０１で時間波形を入力し、ステ
ップＳ５０２で窓掛けを行う。次いで、窓掛けを行った
時間波形に対してステップＳ５０３で離散フーリエ変換
（ＤＦＴ）を施し、ステップＳ５０４でその実部と虚部
の二乗和の平方根を対数変換する。その後、ステップＳ
５０５で逆フーリエ変換（ＩＤＦＴ）を施し、ステップ
Ｓ５０６でケプストラム成分を得て出力する。このよう
に、ケプストラム法は、畳み込み演算を加法的な演算に
変換するものである。音声の有声音信号は音源成分を声
道情報で畳み込んだものであるため、ケプストラム法は
両者の分離に適している。入力信号が音声の有声音信号
の場合、ピッチ周期をT0とすれば、音源成分は高ケフレ
ンシイ（長時間領域）のT0の近傍として現れ、声道成分
は低ケフレンシイ（短時間領域）の成分として現れる。
ケプストラムからピッチ周期を求めるには、高ケフレン
シイ部のピークを求めて、時間原点からこのピークまで
の時間を測定すればよい。In step S402, the frame number i for processing is initialized. In step S403, the pitch period Tp of the voice in the i-th frame is detected. As a method of detecting the pitch period Tp, a method of detecting the peak interval of the waveform can be considered as a simple method, but in the present embodiment, the cepstrum method is used. This is for more accurately calculating the pitch period. In this cepstrum method, the pitch cycle Tp is detected in the processing steps shown in FIG. First, a time waveform is input in step S501, and windowing is performed in step S502. Next, the windowed time waveform is subjected to discrete Fourier transform (DFT) in step S503, and the square root of the sum of squares of its real part and imaginary part is logarithmically converted in step S504. After that, step S
Inverse Fourier transform (IDFT) is performed in 505, and the cepstrum component is obtained and output in step S506. As described above, the cepstrum method converts a convolution operation into an additive operation. Since the voiced sound signal of speech is a convolution of the sound source component with vocal tract information, the cepstrum method is suitable for separating the two. When the input signal is a voiced voice signal, if the pitch period is T0, the sound source component appears near T0 of high Keflenshiy (long-term region), and the vocal tract component is a low Keflenshiy (short-term region) component. appear.
In order to obtain the pitch period from the cepstrum, the peak of the high kefrenshi part can be obtained and the time from the time origin to this peak can be measured.

【００６４】次に、図５中のステップＳ４０４で、第ｉ
フレームにおける、ピーク直前の極小値を与える時間軸
の座標xdを検出する。この座標xdの検出に関しては前記
第１の実施形態と同様である（図１１参照）。ステップ
Ｓ４０５では、座標xdの前後それぞれＬ分の音声データ
を切り出し、座標xdが中央に位置するようにセンタリン
グする。Ｌ分の長さは、前記第１実施形態と同様であ
る。Next, in step S404 in FIG. 5, the i-th
The coordinate xd on the time axis that gives the minimum value immediately before the peak in the frame is detected. The detection of the coordinate xd is the same as that in the first embodiment (see FIG. 11). In step S405, audio data for L before and after the coordinate xd is cut out and centered so that the coordinate xd is located at the center. The length of L is the same as that in the first embodiment.

【００６５】次いで、ステップＳ４０６において、前記
ステップＳ４０３で求めたピッチ周期Tpを定数C1倍し、
ステップＳ４０７で極小点xdを中心に前後それぞれC1×
Tpの長さの時間窓を掛ける。この定数C1として本実施例
にいては、1.0程度の値を用いる。なお、定数C1として
は、本発明者の実験によれば、1.0より小さい値が望ま
しい。これは、定数C1が1.0より小さいことで、隣接す
るピッチの影響を抑制して、雑音を減少することができ
るためである。Next, in step S406, the pitch period Tp obtained in step S403 is multiplied by a constant C1,
In step S407, the front and rear are centered around the minimum point xd, and C1 ×
Multiply a time window of length Tp. In this embodiment, a value of about 1.0 is used as the constant C1. According to the experiments of the present inventor, a value smaller than 1.0 is desirable as the constant C1. This is because when the constant C1 is smaller than 1.0, it is possible to suppress the influence of adjacent pitches and reduce noise.

【００６６】次いで、ステップＳ４０８で、第ｉフレー
ムにおける素片として、窓掛けした音声データをデータ
ディスク等の記憶媒体に、素片辞書３０５として順次書
き込みを行う。次いでステップＳ４０９で、全フレーム
の処理を終了したか否かの判定を行い、終了していなけ
れば、ステップＳ４１０でフレーム番号を更新してステ
ップＳ４０３に戻り、前記ステップＳ４０３からステッ
プＳ４０８までの処理を継続する。ステップＳ４０９で
全フレームの処理が終了したと判定したときは、前記デ
ータディスク等の記憶媒体のクローズ処理等（図示せ
ず）を行って素片作成部３０１の動作を終了する。Then, in step S408, the windowed audio data is sequentially written as a segment in the i-th frame to a storage medium such as a data disk as a segment dictionary 305. Next, in step S409, it is determined whether or not the processing for all the frames has been completed. If not completed, the frame number is updated in step S410, the process returns to step S403, and the processing from step S403 to step S408 is performed. continue. When it is determined in step S409 that the processing of all the frames has been completed, the processing of closing the storage medium such as the data disk (not shown) or the like is performed and the operation of the segment creating unit 301 is completed.

【００６７】以上の処理によって作成された素片が書き
込まれた素片辞書３０５内から、対象となる素片が適宜
選択され、音声合成部３０４で音声合成処理が行われ
る。The target segment is appropriately selected from the segment dictionary 305 in which the segment created by the above process is written, and the voice synthesis unit 304 performs the voice synthesis process.

【００６８】［効果］本実施形態では、素片作成部３０
１において精度良くピッチ周期Tpを求め、それに基づい
て予め素片辞書３０５に書き込む素片に窓掛けしておく
ので、音声合成時に窓掛け処理をする必要がなくなる。
即ち、第１の実施形態において音声合成処理時に必要で
あった１ピッチ毎の窓掛け処理（乗算）が不要となり、
ただ重ね合わせを実行するだけで済むので、音声合成処
理時の処理量を大幅に減少させることができる。[Effect] In the present embodiment, the segment creating unit 30.
In 1, the pitch period Tp is accurately obtained, and the element to be written in the element dictionary 305 is windowed in advance based on the obtained value. Therefore, it is not necessary to perform the windowing process at the time of speech synthesis.
That is, the windowing process (multiplication) for each pitch, which is required in the voice synthesis process in the first embodiment, is unnecessary,
Since it is only necessary to execute the superposition, it is possible to significantly reduce the processing amount at the time of speech synthesis processing.

【００６９】この結果、本実施形態の音声合成方法を用
いた音声合成装置においては、ＤＳＰ等の高度な演算プ
ロセッサを使用することなく、通常のＣＰＵで実現する
ことが可能になる。また、同一の演算プロセッサを使用
する場合には、音声合成処理の大幅な高速化を図ること
ができる。As a result, the voice synthesizing apparatus using the voice synthesizing method of the present embodiment can be realized by a normal CPU without using an advanced arithmetic processor such as DSP. Further, when the same arithmetic processor is used, it is possible to significantly speed up the voice synthesis process.

【００７０】また、定数C1を1.0より小さい値にするこ
とで、隣接するピッチの影響を抑制することができ、雑
音を減少させることができる。By setting the constant C1 to a value smaller than 1.0, it is possible to suppress the influence of adjacent pitches and reduce noise.

【００７１】［第３の実施形態］次に、本発明の第３の
実施形態について説明する。[Third Embodiment] Next, a third embodiment of the present invention will be described.

【００７２】本実施形態の音声合成方法に用いる音声合
成装置の全体構成は、前記第１の実施形態の音声合成装
置とほぼ同様である。そして、本実施形態の特徴は、素
片作成部（１０７）での処理において音声波形の位相反
転を制御する機能を持たせた点と、反転制御部を設け
て、前記素片作成部で位相反転処理をさせるか否かを制
御できるようにした点になる。The overall structure of the speech synthesizing apparatus used in the speech synthesizing method of this embodiment is almost the same as that of the speech synthesizing apparatus of the first embodiment. The feature of the present embodiment is that the function of controlling the phase inversion of the voice waveform is provided in the processing in the segment creating unit (107), and the inversion control unit is provided, and the phase is created in the segment creating unit. The point is that it is possible to control whether or not to perform the inversion process.

【００７３】まず、素片作成部の動作を図７（Ａ）のフ
ローチャートに基づいて説明する。First, the operation of the segment creating section will be described with reference to the flowchart of FIG.

【００７４】音声信号データが入力されると、ステップ
Ｓ６０１で音声信号データが分析フレームに分割され、
ステップＳ６０２で処理を行うフレーム番号ｉが初期化
される。これらの処理は前記第１の実施形態と同様であ
る。When the voice signal data is input, the voice signal data is divided into analysis frames in step S601.
In step S602, the frame number i to be processed is initialized. These processes are the same as those in the first embodiment.

【００７５】ステップＳ６０３では、共有メモリに格納
されている反転フラグを調べ、反転フラグが１であれ
ば、ステップＳ６０４で音声波形の正負を反転する。反
転フラグが０であれば、音声波形の反転は行わず、ステ
ップＳ６０５に飛ぶ。In step S603, the inversion flag stored in the shared memory is checked. If the inversion flag is 1, the positive / negative of the voice waveform is inverted in step S604. If the inversion flag is 0, the voice waveform is not inverted, and the process jumps to step S605.

【００７６】次に、ステップＳ６０５で、第ｉフレーム
における、ピーク直前の極小値を与える時間軸の座標xd
を検出する。この座標xdの検出に関しては前記第１の実
施形態と同様である（図１１参照）。ステップＳ６０６
では、座標xdの前後それぞれＬ分の音声データを切り出
し、座標xdが中央に位置するようにセンタリングする。
Ｌ分の長さは、前記第１実施形態と同様である。次い
で、ステップＳ６０７で、第ｉフレームにおける素片と
して、ステップＳ６０６で切り出した音声データを素片
辞書１０５に順次書き込む。Next, in step S605, the coordinate xd on the time axis that gives the minimum value immediately before the peak in the i-th frame.
To detect. The detection of the coordinate xd is the same as that in the first embodiment (see FIG. 11). Step S606
Then, L audio data before and after the coordinate xd is cut out and centered so that the coordinate xd is located at the center.
The length of L is the same as that in the first embodiment. Next, in step S607, the voice data cut out in step S606 is sequentially written in the segment dictionary 105 as a segment in the i-th frame.

【００７７】次いでステップＳ６０８で、全フレームの
処理を終了したか否かの判定を行い、終了していなけれ
ば、ステップＳ６０９でフレーム番号を更新してステッ
プＳ６０３に戻り、このステップＳ６０３からステップ
Ｓ６０７までの処理を継続する。ステップＳ６０８で全
フレームの処理が終了したと判定したときは、前記デー
タディスク等の記憶媒体のクローズ処理等（図示せず）
を行って素片作成部の動作を終了する。Then, in step S608, it is determined whether or not the processing of all the frames has been completed. If not completed, the frame number is updated in step S609 and the process returns to step S603. From step S603 to step S607 Continue processing. When it is determined in step S608 that the processing of all the frames has been completed, the processing of closing the storage medium such as the data disk (not shown)
Then, the operation of the segment creating unit is finished.

【００７８】以上の処理によって作成された素片が書き
込まれた素片辞書１０５内から、対象となる素片が適宜
選択され、音声合成部１０４で音声合成処理が行われ
る。The target speech segment is appropriately selected from the speech segment dictionary 105 in which the speech segment created by the above processing is written, and the speech synthesis unit 104 performs speech synthesis processing.

【００７９】次に、反転制御部の動作を説明する。この
反転制御部は、キーボード等からの作業者による指示に
基づいて、前記素片作成部での音声波形の反転処理を制
御するもので、図７（Ｂ）のフローチャートに示す処理
機能を有している。この反転制御部での動作を以下に説
明する。Next, the operation of the inversion controller will be described. The reversing control unit controls the reversing process of the voice waveform in the segment creating unit based on an instruction from the operator from a keyboard or the like, and has a processing function shown in the flowchart of FIG. 7B. ing. The operation of this inversion control unit will be described below.

【００８０】まず、ステップＳ６１０で、キーボード等
から入力された作業者の意思を確認する。即ち、作業者
が音声信号の位相の反転を指示しているか否かを判定す
る。反転指示の場合には、ステップＳ６１１により、前
記共有メモリ（前記素片作成部のステップＳ６０３で調
べる共有メモリ）上の反転フラグを１に設定する。非反
転指示の場合には、ステップＳ６１２により、共有メモ
リ上の反転フラグを０に設定する。First, in step S610, the operator's intention input from a keyboard or the like is confirmed. That is, it is determined whether or not the operator has instructed to invert the phase of the audio signal. In the case of the inversion instruction, the inversion flag in the shared memory (the shared memory checked in step S603 of the segment creating unit) in the shared memory is set to 1 in step S611. In the case of the non-inversion instruction, the inversion flag on the shared memory is set to 0 in step S612.

【００８１】この共有メモリ上の反転フラグの設定に基
づいて、前記素片作成部のステップＳ６０３での判断が
なされる。Based on the setting of the inversion flag on the shared memory, the determination in step S603 of the segment creating unit is made.

【００８２】なお、反転制御部の実行は、アナログ系が
一定なら、最初に行っておくのが望ましい。音声信号を
収録した環境が、他と一部分相違するような場合には、
図１の音声信号入力部１０８と前記共有メモリ上に設定
する反転フラグとを対応させて表を作成し、これに基づ
いて反転フラグを共有メモリに記憶するように構成して
もよい。It should be noted that it is desirable to execute the inversion control section first if the analog system is constant. If the environment in which the audio signal is recorded is partially different from the others,
The audio signal input unit 108 of FIG. 1 and the inversion flag set on the shared memory may be associated with each other to create a table, and the inversion flag may be stored in the shared memory based on the table.

【００８３】［効果］マイクや、マイクで拾った音声信
号を増幅するアンプ等のアナログ系を変えた場合など、
アナログ系がもとの構成と違った場合には、位相が反転
してしまうことがある。この場合は、音声データの正負
が逆転してしまうので、極小値を検出したつもりが極大
値を検出してしまうことがある。[Effect] When an analog system such as a microphone or an amplifier for amplifying a voice signal picked up by the microphone is changed,
If the analog system differs from the original configuration, the phase may be inverted. In this case, the positive and negative signs of the audio data are reversed, so the intention is to detect the minimum value, but the maximum value may be detected.

【００８４】本実施形態によれば、このアナログ系の構
成の変化等による位相の変化をディジタル的に補正する
ことができるようになる。この結果、単一の音声合成装
置で、アナログ系の違いに対応することができるように
なる。According to this embodiment, it becomes possible to digitally correct the phase change due to the change in the configuration of the analog system. As a result, a single voice synthesizer can handle the difference in analog system.

【００８５】［第４の実施形態］次に、本発明の第４の
実施形態について説明する。[Fourth Embodiment] Next, a fourth embodiment of the present invention will be described.

【００８６】本実施形態の音声合成装置の全体構成は、
前記第２及び第３の実施形態に係る音声合成装置とほぼ
同様である。第２の実施形態との比較における本実施形
態の特徴は、素片作成部の処理において音声波形の位相
反転を制御する機能を持たせた点と、反転制御部を設け
て前記素片作成部で位相反転処理をさせるか否かを制御
できるようにした点になる。第３の実施形態との比較に
おける本実施形態の特徴は、素片作成部において素片に
あらかじめ窓掛けを行う点にある。The overall configuration of the speech synthesizer of this embodiment is as follows.
It is almost the same as the speech synthesizer according to the second and third embodiments. The features of the present embodiment in comparison with the second embodiment are that the function of controlling the phase inversion of the voice waveform is provided in the processing of the segment creating unit, and that the segment creating unit is provided with an inversion control unit. The point is that it is possible to control whether or not the phase inversion process is performed. The feature of the present embodiment in comparison with the third embodiment is that the elemental piece creating unit performs windowing on the elemental piece in advance.

【００８７】まず、素片作成部の動作を図８（Ａ）のフ
ローチャートに基づいて説明する。First, the operation of the segment creating section will be described with reference to the flowchart of FIG.

【００８８】音声信号データが入力されると、ステップ
Ｓ７０１で音声信号データが分析フレームに分割され、
ステップＳ７０２で処理を行うフレーム番号ｉが初期化
される。これらの処理は前記第１の実施形態と同様であ
る。ステップＳ７０３では、第ｉフレームにおける音声
のピッチ周期Tpを検出する。このピッチ周期Tpを検出す
る方法としては、前記第２の実施形態と同様にケプスト
ラム法を用いる。When the voice signal data is input, the voice signal data is divided into analysis frames in step S701,
In step S702, the frame number i to be processed is initialized. These processes are the same as those in the first embodiment. In step S703, the pitch period Tp of the voice in the i-th frame is detected. As a method for detecting the pitch period Tp, the cepstrum method is used as in the second embodiment.

【００８９】ステップＳ７０４では、共有メモリに格納
されている反転フラグを調べ、反転フラグが１であれ
ば、ステップＳ７０５で音声波形の正負を反転する。反
転フラグが０であれば、音声波形の反転は行わず、ステ
ップＳ７０６に飛ぶ。In step S704, the inversion flag stored in the shared memory is checked. If the inversion flag is 1, the positive / negative of the voice waveform is inverted in step S705. If the inversion flag is 0, the voice waveform is not inverted, and the process jumps to step S706.

【００９０】次に、ステップＳ７０６で、第ｉフレーム
における、ピーク直前の極小値を与える時間軸の座標xd
を検出する。この座標xdの検出に関しては前記第１の実
施形態と同様である（図１１参照）。ステップＳ７０７
では、座標xdの前後それぞれＬ分の音声データを切り出
し、座標xdが中央に位置するようにセンタリングする。
Ｌ分の長さは、前記第１実施形態と同様である。次い
で、ステップＳ７０８において、前記ステップＳ７０３
で求めたピッチ周期Tpを定数C1倍し、ステップＳ７０９
で極小点xdを中心に前後それぞれC1×Tpの長さの時間窓
を掛ける。Next, in step S706, the coordinate xd on the time axis which gives the minimum value immediately before the peak in the i-th frame.
To detect. The detection of the coordinate xd is the same as that in the first embodiment (see FIG. 11). Step S707
Then, L audio data before and after the coordinate xd is cut out and centered so that the coordinate xd is located at the center.
The length of L is the same as that in the first embodiment. Then, in step S708, the step S703 is performed.
The pitch period Tp obtained in step S1 is multiplied by a constant C1, and step S709
Then, a time window with a length of C1 × Tp is applied to the front and back around the minimum point xd.

【００９１】次いで、ステップＳ７１０で、第ｉフレー
ムにおける素片として、窓掛けした音声データをデータ
ディスク等の記憶媒体に、素片辞書（３０５）として順
次書き込みを行う。次いでステップＳ７１１で、全フレ
ームの処理を終了したか否かの判定を行い、終了してい
なければ、ステップＳ７１２でフレーム番号を更新して
ステップＳ７０３に戻り、前記ステップＳ７０３からス
テップＳ７１０までの処理を継続する。ステップＳ７１
１で全フレームの処理が終了したと判定したときは、前
記データディスク等の記憶媒体のクローズ処理等（図示
せず）を行って素片作成部の動作を終了する。Then, in step S710, the windowed audio data is sequentially written as a segment in the i-th frame to a storage medium such as a data disk as a segment dictionary (305). Next, in step S711, it is determined whether or not the processing for all the frames has been completed. If not completed, the frame number is updated in step S712, the process returns to step S703, and the processes from step S703 to step S710 are performed. continue. Step S71
When it is determined in 1 that the processing of all the frames is completed, the processing of closing the storage medium such as the data disk (not shown) is performed and the operation of the segment creating unit is completed.

【００９２】以上の処理によって作成された素片が書き
込まれた素片辞書内から、対象となる素片が適宜選択さ
れ、音声合成部で音声合成処理が行われる。The target segment is appropriately selected from the segment dictionary in which the segment created by the above process is written, and the speech synthesis unit performs the speech synthesis process.

【００９３】次に、反転制御部の動作を説明する。この
反転制御部は、前記第３の実施形態における反転制御部
と同様であり、図８（Ｂ）のフローチャートに示す処理
機能を有している。この反転制御部での動作を以下に説
明する。Next, the operation of the inversion controller will be described. This inversion control unit is similar to the inversion control unit in the third embodiment and has the processing function shown in the flowchart of FIG. 8 (B). The operation of this inversion control unit will be described below.

【００９４】まず、ステップＳ７２０で、キーボード等
から入力された作業者の意思を確認する。即ち、作業者
が音声信号の位相の反転を指示しているか否かを判定す
る。反転指示の場合には、ステップＳ７２１により、前
記共有メモリ（前記素片作成部のステップＳ７０４で調
べる共有メモリ）上の反転フラグを１に設定する。非反
転指示の場合には、ステップＳ７２２により、共有メモ
リ上の反転フラグを０に設定する。この共有メモリ上の
反転フラグの設定に基づいて、前記素片作成部のステッ
プＳ７０４での判断がなされる。First, in step S720, the operator's intention input from the keyboard or the like is confirmed. That is, it is determined whether or not the operator has instructed to invert the phase of the audio signal. In the case of the inversion instruction, the inversion flag on the shared memory (the shared memory checked in step S704 of the segment creating unit) in the shared memory is set to 1 in step S721. In the case of the non-inversion instruction, the inversion flag on the shared memory is set to 0 in step S722. Based on the setting of the inversion flag on the shared memory, the determination in step S704 of the segment creating unit is made.

【００９５】なお、反転制御部の実行は、第３の実施形
態における反転制御部の場合と同様に、アナログ系が一
定なら、最初に行っておくのが望ましい。音声信号を収
録した環境が、他と一部分相違するような場合には、音
声信号入力部と前記共有メモリ上に設定する反転フラグ
とを対応させて表を作成し、これに基づいて反転フラグ
を共有メモリに記憶するように構成してもよい。The inversion control unit is preferably executed first if the analog system is constant, as in the case of the inversion control unit in the third embodiment. If the environment in which the audio signal is recorded is partially different from the others, a table is created by associating the audio signal input section with the inversion flag set on the shared memory, and the inversion flag is set based on this table. It may be configured to be stored in the shared memory.

【００９６】［効果］第４の実施形態によれば、素片作
成部において、ピッチ周期検出部を設けて、予め素片辞
書に書き込む素片に窓掛けしておくので、音声合成時に
窓掛け処理をする必要がなくなる。即ち、１ピッチ毎の
窓掛け処理（乗算）が不要となり、音声合成処理時の処
理量を大幅に減少させることができる。[Effect] According to the fourth embodiment, since the pitch cycle detecting unit is provided in the segment creating unit and the segment to be written in the segment dictionary is windowed in advance, it is windowed at the time of speech synthesis. Eliminates the need for processing. That is, the windowing process (multiplication) for each pitch becomes unnecessary, and the processing amount at the time of voice synthesis processing can be significantly reduced.

【００９７】また、音声波形の正負を反転させる機能を
持たせたので、アナログ系の構成の変化等による位相の
変化をディジタル的に補正することができるようにな
る。この結果、単一の音声合成装置で、アナログ系の違
いに対応することができるようになる。Further, since the function of inverting the positive / negative of the voice waveform is provided, it becomes possible to digitally correct the phase change due to the change of the configuration of the analog system. As a result, a single voice synthesizer can handle the difference in analog system.

【００９８】［第５の実施形態］次に、本発明の第５の
実施形態について説明する。図９に第５の実施形態に係
る音声合成装置の構成を示す。[Fifth Embodiment] Next, a fifth embodiment of the present invention will be described. FIG. 9 shows the configuration of a speech synthesizer according to the fifth embodiment.

【００９９】図中の音声入力端子８００に入力された音
声信号は、１の経路として、反転増幅器８０２に入力さ
れて位相が反転され、セレクタ８０３に入力される。他
の経路は、非反転増幅器８０１を介して（音声信号の位
相を反転せずに）、セレクタ８０３に入力される。セレ
クタ８０３では、反転増幅器８０２を通した音声信号と
非反転増幅器８０１を通した音声信号のうち、一方が選
択されてＡＤ変換器８０４に入力される。入力された音
声信号は、このＡＤ変換器８０４でディジタル信号に変
換され、記憶媒体８０５に記憶される。The audio signal input to the audio input terminal 800 in the figure is input to the inverting amplifier 802 as the path 1 and the phase thereof is inverted, and then input to the selector 803. The other path is input to the selector 803 via the non-inverting amplifier 801 (without inverting the phase of the audio signal). In the selector 803, one of the audio signal passed through the inverting amplifier 802 and the audio signal passed through the non-inverting amplifier 801 is selected and input to the AD converter 804. The input audio signal is converted into a digital signal by the AD converter 804 and stored in the storage medium 805.

【０１００】音声信号読み出し回路８０６では、記憶媒
体８０５中に記憶された音声データを読み出し、極小値
検出回路８０７で、ピーク直前の極小値を検出する。こ
の極小値検出回路８０７での極小値検出処理は、前記第
１の実施形態における素片作成部１０７のステップＳ２
０３（図３参照）での極小値検出処理と同様である（図
１１参照）。The audio signal reading circuit 806 reads the audio data stored in the storage medium 805, and the local minimum value detecting circuit 807 detects the local minimum value immediately before the peak. The minimum value detection processing by the minimum value detection circuit 807 is performed in step S2 of the segment creating unit 107 in the first embodiment.
This is the same as the minimum value detection processing in 03 (see FIG. 3) (see FIG. 11).

【０１０１】音声切り出し回路８０８では、ピーク直前
の極小値の前後それぞれＬ分の音声データを切り出し、
この極小値が中央に位置するようセンタリングする。Ｌ
分は前記第１の実施形態と同様の１２ｍ秒とした。この
音声切り出し回路８０８での音声切り出し処理は、前記
第１の実施形態における素片作成部１０７のステップＳ
２０４（図３参照）での音声切り出し処理と同様であ
る。この音声切り出し回路８０８で切り出したデータを
素片ファイル８０９として、ディスク装置などに記憶す
る。The voice cut-out circuit 808 cuts out L amount of voice data before and after the local minimum value immediately before the peak.
Center so that this minimum value is located in the center. L
The minute was set to 12 msec as in the first embodiment. The voice cut-out processing in the voice cut-out circuit 808 is performed in step S of the segment creating unit 107 in the first embodiment.
This is the same as the audio cutout process in 204 (see FIG. 3). The data cut out by the audio cutout circuit 808 is stored in a disk device or the like as a segment file 809.

【０１０２】以上の処理動作は、入力された全ての音声
データについて行われる。The above processing operation is performed for all the input voice data.

【０１０３】次に、音声合成処理時の動作について説明
する。Next, the operation at the time of speech synthesis processing will be described.

【０１０４】文字列人力端子８１０を介して文字音素記
号変換回路８１１に文字列が入力されると、この文字音
素記号変換回路８１１では、入力された文字列に対し
て、対応するアクセント記号付きの音素記号を出力す
る。韻律情報設定回路８１２では、文字音素記号変換回
路８１１からの音素記号に、イントネーションの強さ、
音韻の継続時間などの韻律情報を設定する。When a character string is input to the character phoneme symbol conversion circuit 811 via the character string input terminal 810, the character phoneme symbol conversion circuit 811 adds a corresponding accent mark to the input character string. Output phoneme symbols. In the prosody information setting circuit 812, the intonation strength of the phoneme symbol from the character phoneme symbol conversion circuit 811,
Prosody information such as phoneme duration is set.

【０１０５】素片選択回路８１３では、前記音素記号列
から音声に変換するのに必要な素片を、素片ファイル８
０９中から選択して読み出し、窓掛け回路８１４に出力
する。この窓掛け回路８１４では、素片選択回路８１３
で読み出された素片のフレーム毎に、窓掛けを行い、素
片接続合成回路８１５に出力する。素片接続合成回路８
１５では、前記窓掛け回路８１４で窓掛けしたフレーム
毎の素片を、合成ピッチ周期分だけずらして重ね合わせ
る。以上の音声合成時の動作により、音声の時間波形が
得られ、合成音声出力端子８１６より出力される。In the segment selection circuit 813, the segment file 8 is used to extract the segment necessary for converting the phoneme symbol string into speech.
It is selected from 09, read out, and output to the windowing circuit 814. In this windowing circuit 814, the segment selection circuit 813
Windowing is performed for each frame of the segment read out in (4) and output to the segment connection synthesis circuit 815. Element connection synthesis circuit 8
At 15, the pieces of each frame that have been windowed by the windowing circuit 814 are shifted by the combined pitch period and overlapped. By the above operation during voice synthesis, a time waveform of voice is obtained and output from the synthesized voice output terminal 816.

【０１０６】［効果］マイクやアンプ等のアナログ系を
変えた場合など、アナログ系がもとの構成と違って音声
信号の位相が反転した場合でも、反転増幅器８０２、非
反転増幅器８０１及びセレクタ８０３によって、その位
相の変化をアナログ的に、かつ容易に補正することがで
きるようになる。この結果、単一の音声合成装置で、ア
ナログ系の位相の違いに対応することができるようにな
る。[Effect] The inverting amplifier 802, the non-inverting amplifier 801 and the selector 803 even when the analog system such as a microphone or an amplifier is changed and the phase of the audio signal is inverted unlike the original configuration. Thus, the change in the phase can be easily corrected in an analog manner. As a result, a single voice synthesizer can cope with the phase difference of the analog system.

【０１０７】［第６の実施形態］次に、本発明の第６の
実施形態について説明する。図１０に第６の実施形態に
係る音声合成装置の構成を示す。なお、本実施形態に係
る音声合成装置の全体構成は前記第５の実施形態に係る
音声合成装置とほぼ同様であるため、同一の部分には同
一の符号を付して説明する。[Sixth Embodiment] Next, a sixth embodiment of the present invention will be described. FIG. 10 shows the configuration of a speech synthesizer according to the sixth embodiment. Since the overall configuration of the speech synthesizer according to the present embodiment is almost the same as that of the speech synthesizer according to the fifth embodiment, the same parts will be denoted by the same reference numerals.

【０１０８】図中の音声入力端子９００に入力された音
声信号は、前記第５の実施形態と同様に、反転増幅器９
０２を介した経路と非反転増幅器９０１を介した経路と
によってセレクタ９０３に入力される。セレクタ９０３
では、反転増幅器９０２を通した音声信号と非反転増幅
器９０１を通した音声信号のうち、一方が選択されてＡ
Ｄ変換器９０４に入力される。入力された音声信号は、
このＡＤ変換器９０４でディジタル信号に変換され、記
憶媒体９０５に記憶される。The voice signal input to the voice input terminal 900 in the figure is the same as in the fifth embodiment, except that the inverting amplifier 9
The signal is input to the selector 903 through the path through 02 and the path through the non-inverting amplifier 901. Selector 903
Then, one of the audio signal passing through the inverting amplifier 902 and the audio signal passing through the non-inverting amplifier 901 is selected and A
It is input to the D converter 904. The input audio signal is
This AD converter 904 converts it into a digital signal and stores it in a storage medium 905.

【０１０９】音声信号読み出し回路９０６では、記憶媒
体９０５中に記憶された音声データを読み出し、ピッチ
周期検出回路９２１に出力する。ピッチ周期検出回路９
２１では音声データのピッチ周期を検出する。このピッ
チ周期検出処理は、前記第２の実施形態の素片作成部３
０１のピッチ周期検出処理（ステップＳ４０３）と同様
である。ピッチ周期検出法としてはケプストラム法等を
用いる。ピッチ周期検出回路９２１での検出処理の後、
極小値検出回路９０７で、ピーク直前の極小値を検出す
る。The audio signal reading circuit 906 reads the audio data stored in the storage medium 905 and outputs it to the pitch cycle detecting circuit 921. Pitch cycle detection circuit 9
At 21, the pitch period of the voice data is detected. This pitch cycle detection processing is performed by the segment creating unit 3 of the second embodiment.
This is the same as the 01 pitch period detection processing (step S403). A cepstrum method or the like is used as the pitch period detection method. After the detection processing in the pitch period detection circuit 921,
The minimum value detection circuit 907 detects the minimum value immediately before the peak.

【０１１０】音声切り出し回路９０８では、ピーク直前
の極小値の前後それぞれＬ分（前記第１の実施形態と同
様の１２ｍ秒）の音声データを切り出し、この極小値が
中央に位置するようにセンタリングする。この音声切り
出し回路９０８で切り出されたデータは窓掛け回路９２
３に出力される。この窓掛け回路９０３では、ピッチ周
期検出回路９２１からのピッチ周期に基づいて窓長算出
回路９２２で算出された時間窓長の時間窓を掛ける。窓
掛けが施された音声データは、素片ファイル９０９とし
て、ディスク装置などに記憶する。The voice cut-out circuit 908 cuts out voice data of L minutes (12 msec as in the first embodiment) before and after the local minimum value immediately before the peak, and centers the local minimum value. . The data cut out by the audio cutout circuit 908 is the windowing circuit 92.
3 is output. The windowing circuit 903 multiplies the time window having the time window length calculated by the window length calculation circuit 922 based on the pitch cycle from the pitch cycle detection circuit 921. The windowed audio data is stored in a disk device or the like as a fragment file 909.

【０１１１】以上の処理動作は、入力された全ての音声
データについて行われる。The above processing operation is performed for all the input audio data.

【０１１２】次に、音声合成時の動作について説明す
る。Next, the operation at the time of voice synthesis will be described.

【０１１３】文字列入力端子９１０に入力された文字列
に対して、文字音素記号変換回路９１１は、対応するア
クセント記号付きの音素記号を出力する。韻律情報設定
回路９１２は、この音素記号に、イントネーションの強
さ、音韻の継続時間などの韻律情報を設定する。With respect to the character string input to the character string input terminal 910, the character phoneme symbol conversion circuit 911 outputs a corresponding phoneme symbol with an accent symbol. The prosody information setting circuit 912 sets prosody information such as the intensity of intonation and the duration of the phoneme in this phoneme symbol.

【０１１４】素片選択回路９１３では、前記音素記号列
から音声に変換するのに必要な素片を、素片ファイル９
０９から選択して読み出し、素片接続合成回路９１５に
出力する。In the segment selection circuit 913, the segment necessary for converting the phoneme symbol string into speech is stored in the segment file 9
It is selected from 09 and read out, and it outputs to the element connection synthesis circuit 915.

【０１１５】素片接続合成回路９１５では、素片をフレ
ーム毎に、合成ピッチ周期分だけずらして重ね合わせ
る。以上の音声合成時の動作により、音声の時間波形が
得られ、合成音声出力端子９１６より出力される。In the unit piece connection synthesizing circuit 915, the unit pieces are superimposed for each frame while being shifted by the synthetic pitch period. By the above-described operation during voice synthesis, a time waveform of voice is obtained and output from the synthesized voice output terminal 916.

【０１１６】［効果］本実施形態によれば、音声素片フ
ァイル作成時に窓掛けをしておくため、音声合成時に
は、窓掛けが不要となる。このため、音声合成部分の回
路構成が、乗算器を含まない簡易なものになる。[Effect] According to the present embodiment, since windowing is performed when the voice unit file is created, windowing is not required during voice synthesis. Therefore, the circuit configuration of the voice synthesizing portion becomes simple without including a multiplier.

【０１１７】かつ、反転増幅器とセレクタを設けること
によって、アナログ的に、かつ容易に、音声データの位
相の反転を補正可能としたため、同一な音声合成装置を
適用することが可能になる効果がある。Further, by providing the inverting amplifier and the selector, the inversion of the phase of the voice data can be corrected in an analog manner and easily, so that the same voice synthesizer can be applied. .

【０１１８】［変形例］なお、前記第２、４、６の実施
形態では、ピッチ周期検出法としてケプストラム法を用
いたが、他の方法、例えば自己相関法や、線形予測残差
の自己相関である変形自己相関法などの他の方法を用い
るてもよい。[Modification] Although the cepstrum method is used as the pitch period detection method in the second, fourth, and sixth embodiments, other methods such as the autocorrelation method and the autocorrelation of linear prediction residuals are used. Other methods such as the modified autocorrelation method that is

【０１１９】また、前記各実施形態の音声合成方法およ
び音声合成装置における素片作成部は、原音声のピッチ
を変化させ、声の高さを変更する、いわゆる音声ピッチ
変換装置でのピッチマーク設定等の、種々の音声出力装
置における処理に適応することが可能である。Further, the voice synthesis method and voice synthesis apparatus according to each of the above-mentioned embodiments changes the pitch of the original voice to change the pitch of the original voice, that is, a pitch mark setting in a so-called voice pitch conversion apparatus. It is possible to adapt to processing in various audio output devices such as.

【０１２０】[0120]

【発明の効果】以上、詳細に説明したように、本発明に
よれば、次のような効果を奏することができる。As described in detail above, according to the present invention, the following effects can be obtained.

【０１２１】（１）音声信号のピーク直前の極小点
を、切り出す音声信号の中心点にしているので、重畳す
る際の中心点を簡易な処理によって容易に設定すること
ができ、スペクトル歪みも小さくすることができる。こ
の結果、聴感上ゴロゴロした音が減少した。(1) Since the minimum point immediately before the peak of the audio signal is set as the center point of the audio signal to be cut out, the center point for superimposing can be easily set by a simple process and the spectral distortion is small. can do. As a result, the rumbling sound was reduced.

【０１２２】（２）一定長さを単位として音声素片を
扱い、フレーム処理を行うことで、音声合成時におい
て、音声波形データを制御しやすくなる。(2) By treating a voice segment with a fixed length as a unit and performing frame processing, it becomes easy to control voice waveform data at the time of voice synthesis.

【０１２３】（３）求めたピッチ周期に基づいて予め
素片に窓掛けしておくことで、音声合成時に窓掛け処理
をする必要がなくなり、音声合成処理時の処理量を大幅
に減少させることができる。この結果、処理装置の簡素
化、又は処理の高速化を図ることができる。(3) By preliminarily windowing the segments based on the obtained pitch period, there is no need to perform windowing processing during voice synthesis, and the amount of processing during voice synthesis processing can be greatly reduced. You can As a result, the processing device can be simplified or the processing speed can be increased.

【０１２４】（３）音声波形の正負を反転させる機能
を持たせたので、アナログ系の構成の変化等による位相
の変化をディジタル的に補正することができる。(3) Since the function of inverting the positive / negative of the voice waveform is provided, it is possible to digitally correct the change in phase due to the change in the configuration of the analog system.

【０１２５】（４）予め素片に窓掛けしておくので、
音声合成処理時の処理量を大幅に減少させることができ
る。(4) Since the element is windowed in advance,
The processing amount at the time of speech synthesis processing can be greatly reduced.

[Brief description of drawings]

【図１】本発明の第１の実施形態に係る音声合成装置の
構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a speech synthesizer according to a first embodiment of the present invention.

【図２】ピッチを変更しながら音声波形を重畳するPSOL
A法を示す模式図である。[Figure 2] PSOL that superimposes the voice waveform while changing the pitch
It is a schematic diagram which shows A method.

【図３】本発明の第１の実施形態に係る音声合成装置の
素片作成部での処理機能を示すフローチャートである。FIG. 3 is a flowchart showing a processing function in a segment creating unit of the speech synthesizer according to the first embodiment of the present invention.

【図４】本発明の第２の実施形態に係る音声合成装置の
構成を示すブロック図である。FIG. 4 is a block diagram showing a configuration of a speech synthesizer according to a second embodiment of the present invention.

【図５】本発明の第２の実施形態に係る音声合成装置の
素片作成部での処理機能を示すフローチャートである。FIG. 5 is a flowchart showing a processing function in a segment creating unit of the speech synthesis device according to the second embodiment of the present invention.

【図６】ケプストラム法を説明するフローチャートであ
る。FIG. 6 is a flowchart illustrating a cepstrum method.

【図７】本発明の第３の実施形態に係る音声合成装置の
素片作成部及び反転制御部での処理機能を示すフローチ
ャートである。FIG. 7 is a flowchart showing processing functions in a segment creating unit and an inversion control unit of a voice synthesizing device according to a third embodiment of the present invention.

【図８】本発明の第４の実施形態に係る音声合成装置の
素片作成部及び反転制御部での処理機能を示すフローチ
ャートである。FIG. 8 is a flowchart showing processing functions in a segment creating unit and an inversion control unit of a speech synthesis device according to a fourth embodiment of the present invention.

【図９】本発明の第５の実施形態に係る音声合成装置の
構成を示すブロック図である。FIG. 9 is a block diagram showing a configuration of a voice synthesis device according to a fifth embodiment of the present invention.

【図１０】本発明の第６の実施形態に係る音声合成装置
の構成を示すブロック図である。FIG. 10 is a block diagram showing a configuration of a voice synthesis device according to a sixth embodiment of the present invention.

【図１１】本発明の各実施形態における、有声音に対す
るピーク直前の極小値検出例を示す説明図である。FIG. 11 is an explanatory diagram showing an example of detecting a minimum value immediately before a peak for voiced sound in each embodiment of the present invention.

[Explanation of symbols]

１０１：テキスト解析部、１０２：単語辞書、１０３：
パラメータ生成部、１４：音声合成部、１０５：素片辞
書、１０６：窓掛け部、１０７：素片作成部、１０８：
音声信号入力部。101: text analysis unit, 102: word dictionary, 103:
Parameter generation unit, 14: voice synthesis unit, 105: segment dictionary, 106: windowing unit, 107: segment creation unit, 108:
Audio signal input section.

Claims

(57) [Claims]

1. A voice synthesis segment is created in advance by a step of detecting a local minimum point immediately before a peak of a voice signal, and a step of centering the detected local minimum point and cutting out the voice signal. A method for synthesizing speech, which comprises performing windowing superimposing while shifting a minimum point in a speech synthesizing unit as a center of superimposition and shifting by a pitch period.

2. A step of detecting a pitch period of an audio signal, a step of detecting a minimum point immediately before a peak of the audio signal, a step of centering the detected minimum point to cut out the audio signal, and a step of cutting out. A voice synthesis unit is created in advance by a step of multiplying the generated voice signal by a window that is a constant multiple of the pitch period, and the minimum point in the voice synthesis unit is used as the center of the superimposition to superimpose while shifting the pitch period. A voice synthesis method characterized by:

3. A step of appropriately inverting the positive and negative signs of a voice signal to match the positive and negative signs of the whole voice signal, a step of detecting a minimum point immediately before a peak of the voice signal, and centering around the detected minimum point. A voice synthesis characterized in that a voice synthesis unit is created in advance by the step of cutting out the voice signal, and the minimum point in the voice synthesis unit is used as a center of superposition to perform windowing superposition while shifting the pitch period. Method.

4. A step of detecting a pitch period of an audio signal, a step of appropriately inverting the positive and negative of the audio signal to match the positive and negative of the entire audio signal, and a step of detecting a local minimum point immediately before a peak of the audio signal. A voice synthesis unit is created in advance by a step of centering the detected minimum point and cutting out the voice signal, and a step of multiplying the cut out voice signal by a window that is a constant multiple of the pitch period. A method for synthesizing speech, which is characterized in that superimposing is performed while shifting a minimum point in a speech synthesizing unit as a center of superimposition and shifting by a pitch period.

5. A local minimum point detecting means for detecting a local minimum point immediately before a peak of an audio signal, and an audio signal clipping means for centering the local minimum point detected by the local minimum point detecting means to cut out the audio signal. A voice synthesis unit storage means for storing the voice synthesis unit cut out by the voice signal cutout unit, and a voice synthesis unit stored in the voice synthesis unit storage unit with the minimum point as the center of superimposition. A voice synthesizing apparatus comprising: a voice synthesizing unit that overlaps with a window while shifting the pitch period.

6. A pitch period detecting means for detecting a pitch period of an audio signal, a minimum point detecting means for detecting a minimum point immediately before a peak of the audio signal, and a minimum point detected by the minimum point detecting means. An audio signal cutting means for centering and cutting out the audio signal, a windowing means for applying a window having a constant multiple of the pitch period to the audio signal cut out by the audio signal cutting means, and a windowing means for applying the window A voice synthesis unit storing means for storing the voice synthesis unit, and a voice synthesis unit storing the voice synthesis unit stored in the voice synthesis unit storage unit while shifting the local minimum point as the center of superimposition while shifting the pitch period. A voice synthesizing device comprising: a synthesizing unit.

7. A voice signal inverting means for appropriately inverting the positive and negative signs of a voice signal to match the positive and negative signs of the whole voice signal, a minimum point detecting means for detecting a minimum point immediately before a peak of the voice signal, and the minimum point detecting means. A voice signal cutting-out means for cutting out the voice signal by centering on the minimum point detected by the above; and a voice synthesis unit storing means for storing the voice synthesis unit cut out by the voice signal cutting means, A voice synthesizing device comprising: a voice synthesizing unit for windowing and superimposing a voice synthesizing unit stored in a voice synthesizing unit storage means with a minimum point thereof as a center of superimposition and shifting by a pitch period.

8. A pitch period detecting means for detecting a pitch period of an audio signal, an audio signal inverting means for appropriately inverting the positive and negative of the audio signal to match the positive and negative of the entire audio signal, and a local minimum point immediately before a peak of the audio signal. Detecting means, a voice signal cutting-out means for centering the minimum point detected by the minimum-point detecting means to cut out the voice signal, and a voice signal cut-out by the voice signal cutting-out means Windowing means for applying a window that is a constant multiple of the pitch period, voice synthesis element storage means for storing the voice synthesis element windowed by the windowing means, and voice synthesis element storage means A voice synthesizing device, comprising: a voice synthesizing unit that superimposes a voice synthesizing element while shifting a minimum point of the voice synthesizing unit by a pitch period.

9. An inverting amplifier for inverting and amplifying a voice signal, a non-inverting amplifier for non-inverting amplification of a voice signal, a selector for selecting a voice signal from the inverting amplifier and a voice signal from the non-inverting amplifier, An AD converter that converts the audio signal selected by the selector into a digital value, a storage unit that stores the data AD-converted by the AD converter, and an audio signal that sequentially reads the audio signals stored in the storage unit. The reading means, the minimum point detecting means for detecting the minimum point immediately before the peak of the audio signal read by the audio signal reading means, and the minimum point detected by the minimum point detecting means are centered to center the audio signal. Audio signal cutout means for cutting out, and voice synthesis element storage means for storing the voice synthesis element cut out by the voice signal cutout means. A voice synthesizing apparatus comprising: a voice synthesis unit synthesizing unit for windowing and superimposing while shifting a pitch cycle about a minimum point of a voice synthesis unit selected from the voice synthesis unit storage means. .

10. An inverting amplifier for inverting and amplifying a voice signal, a non-inverting amplifier for non-inverting and amplifying a voice signal, a selector for selecting the voice signal from the inverting amplifier and the voice signal from the non-inverting amplifier, An AD converter that converts the audio signal selected by the selector into a digital value, a storage unit that stores the data AD-converted by the AD converter, and an audio signal that sequentially reads the audio signals stored in the storage unit. Reading means, pitch cycle detecting means for detecting the pitch cycle of the audio signal read by the audio signal reading means, window length calculating means for multiplying the pitch cycle detected by the pitch cycle detecting means by a constant, and reading the audio signal Minimum point detecting means for detecting the minimum point just before the peak of the audio signal read by the means, and the minimum point detected by the minimum point detecting means. An audio signal cutting-out means for centering the audio signal by centering on a point; windowing means for windowing the audio signal cut out by the audio signal cutting-out means with the window length calculated by the window length calculating section; A voice synthesis unit storage means for storing the voice synthesis unit windowed by the windowing unit, and a minimum point of the voice synthesis unit selected from the voice synthesis unit storage unit as a center of superposition, A voice synthesizing device, comprising: a segment connection synthesizing unit that overlaps while shifting by a pitch period.