JPH0766272B2

JPH0766272B2 - Audio segmentation device

Info

Publication number: JPH0766272B2
Application number: JP62210691A
Authority: JP
Inventors: 和彦岩田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1987-08-24
Filing date: 1987-08-24
Publication date: 1995-07-19
Anticipated expiration: 2010-07-19
Also published as: JPS6454494A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、波形編集型音声合成に使用する音声データ収
集等において用いる音声のセグメンテーション装置に関
する。Description: TECHNICAL FIELD The present invention relates to a speech segmentation apparatus used in speech data collection and the like used for waveform editing type speech synthesis.

（従来の技術）従来の音声セグメンテーション技術は、音声認識のため
のものであり、日本音響学会昭和60年度秋期研究発表会
講演論文集１−４−24「音声の音素片表現のためのセグ
メンテーションと自動ラベリング」等に詳述されている
ような方法が知られている。これらの従来方法は、入力
音声を10ないし20msの、ある一定の短い区間（フレー
ム）に分割し、各々の区間の音素が何であるかを推定す
るものである。したがって、従来方法によるセグメンテ
ーションの時間解像度はフレーム長となっていた。(Conventional Technology) The conventional speech segmentation technology is for speech recognition, and it has been proposed that the Acoustical Society of Japan 1994 Autumn Research Presentation Presentation Proceedings 1-4-24 "Segmentation for Phoneme Fragment Representation of Speech. Methods such as those described in "Automatic labeling" are known. These conventional methods divide an input speech into certain short sections (frames) of 10 to 20 ms and estimate what the phonemes in each section are. Therefore, the time resolution of the segmentation by the conventional method is the frame length.

（発明が解決しようとする問題点）しかしながら、波形編集型音声合成に使用する音声デー
タ収集においては、時間軸での高い解像度が必要であ
り、入力音声を短区間に分割して分析を行う従来技術で
は対応することができなかった。(Problems to be Solved by the Invention) However, in voice data collection used for waveform-editing voice synthesis, high resolution on the time axis is required, and the input voice is divided into short intervals and analyzed. Technology could not cope.

第３図は、従来技術の問題点等を説明するための図であ
る。図において縦軸はある音響パラメータ量を、横軸は
時間を表している。この分析は、10ないし20msの時間長
を持つフレーム毎の音響パラメータを用いて行われるた
め、時間方向の解像度はこのフレーム長に依存して決ま
る。またこの場合、抽出された音素境界候補31は、実際
の音素境界32とずれており、このようなずれは波形編集
型音声合成のための音声データを作成する際には問題と
なってくる。FIG. 3 is a diagram for explaining problems and the like of the conventional technique. In the figure, the vertical axis represents a certain acoustic parameter amount, and the horizontal axis represents time. Since this analysis is performed using acoustic parameters for each frame having a time length of 10 to 20 ms, the resolution in the time direction depends on this frame length. Further, in this case, the extracted phoneme boundary candidate 31 is deviated from the actual phoneme boundary 32, and such a deviation becomes a problem when creating speech data for waveform editing type speech synthesis.

そこで、本発明は、時間解像度の高い音声セグメンテー
ション装置を提供することを目的としている。Therefore, an object of the present invention is to provide a speech segmentation device with high temporal resolution.

（問題点を解決するための手段）本発明は、入力音声から抽出された音響パラメータから
第一の音素境界候補を抽出する手段と、前記第一の音素
境界候補付近において前記入力音声波形のピークを検出
する手段と、前記ピークの中から声帯の周期的振動に対
応するピークを音素境界副候補として選択する手段と、
前記第一の音素境界候補及び前記音素境界副候補の各々
の付近の前記入力音声から音響パラメータを抽出し、各
音響パラメータ値を比較することによって、前記音素境
界候補の１つを音素境界として選択する手段とを有する
ことを特徴とする。(Means for Solving Problems) The present invention relates to means for extracting a first phoneme boundary candidate from acoustic parameters extracted from input speech, and a peak of the input speech waveform in the vicinity of the first phoneme boundary candidate. Means for detecting, and a means for selecting a peak corresponding to the periodic vibration of the vocal cords as a phoneme boundary sub-candidate from the peaks,
One of the phoneme boundary candidates is selected as a phoneme boundary by extracting an acoustic parameter from the input speech in the vicinity of each of the first phoneme boundary candidate and the phoneme boundary sub-candidate and comparing the respective acoustic parameter values. And means for doing so.

（作用）第２図は、音素境界候補点の抽出方法を説明するための
図である。第一の音素境界候補21は、ある一定時間のフ
レーム長毎に抽出された音響パラメータ値に基づいて抽
出される。次に、前記第一の音素境界候補21付近におい
て、入力音声波形のピーク検出を行い、この中から声帯
の周期的振動に対応するピークのみを抽出し、これらを
音素境界副候補22とする。このとき、実際の音素境界23
の存在する可能性のある範囲は、第２図からもわかるよ
うに、前記第一の音素境界候補の前後のフレーム内であ
ると考えられる。したがって、音素境界副候補を選択す
る範囲は、第一の音素境界候補の前後のフレーム内に限
定することができる。(Operation) FIG. 2 is a diagram for explaining a method of extracting a phoneme boundary candidate point. The first phoneme boundary candidate 21 is extracted based on the acoustic parameter value extracted for each frame length of a certain fixed time. Next, peaks of the input speech waveform are detected in the vicinity of the first phoneme boundary candidate 21, and only peaks corresponding to the periodic vibration of the vocal cords are extracted from these peaks, and these are designated as phoneme boundary sub-candidates 22. At this time, the actual phoneme boundary 23
As shown in FIG. 2, the range in which is likely to exist is considered to be within the frame before and after the first phoneme boundary candidate. Therefore, the range in which the phoneme boundary sub-candidates are selected can be limited to the frames before and after the first phoneme boundary candidate.

以上のような手法を採用することによって、ピーク同期
的に境界決定ができるため、時間解像度の高い正確なセ
グメンテーションが可能となる。By adopting the method as described above, the boundary can be determined in a peak-synchronous manner, so that accurate segmentation with high time resolution becomes possible.

（実施例）第１図は本発明の一実施例を示すブロック図である。音
響パラメータ抽出部12は、音声入力端子11から入力され
た音声波形から、音響パラメータ値をある一定のフレー
ム長を持った短区間毎に抽出する。第一境界候補抽出部
13は、前記音響パラメータ値から第一の音素境界候補を
抽出する。前記音響パラメータから音素境界候補を決定
する方法としては、音響パラメータ値の変化量が最大と
なる点を音素境界候補とする方法等を用いることができ
る。(Embodiment) FIG. 1 is a block diagram showing an embodiment of the present invention. The acoustic parameter extraction unit 12 extracts an acoustic parameter value for each short section having a certain frame length from the speech waveform input from the speech input terminal 11. First boundary candidate extraction unit
13 extracts a first phoneme boundary candidate from the acoustic parameter value. As a method of determining a phoneme boundary candidate from the acoustic parameter, a method of using a point having a maximum change amount of the acoustic parameter value as a phoneme boundary candidate or the like can be used.

ピーク抽出部14は、前記第一の音素境界候補の前後のフ
レーム内の音声波形のピーク点を抽出する。ピーク点の
抽出は、前記入力音声波形の一次差分信号を計算し、そ
の一次差分の信号の符号が反転する点を探すことによっ
て行うことができる。音素境界副候補抽出部15は、前記
ピーク点の中から声帯の周期的振動に対応するピークを
検出し、それらを音素境界副候補として選ぶ。前記ピー
ク点の中から声帯の周期的振動に対応するピーク検出す
る方法としては、次のような方法が考えられる。すなわ
ち、ある短い区間内で最大の振幅を持つピークを検出す
る。この短い区間の時間長としては、例えば、その音声
波形の中で最高と思われるピッチ周波数の逆数である周
期等を用いることが考えられる。また、波形編集型音声
合成等で用いる音素境界は、零クロス点であることが必
要なので、選択されたピーク位置から時間軸をさかのぼ
って最初に現れる零クロス点を音素境界副候補として抽
出する。The peak extraction unit 14 extracts peak points of the voice waveform in the frames before and after the first phoneme boundary candidate. The peak point can be extracted by calculating the first-order difference signal of the input speech waveform and searching for a point at which the sign of the signal of the first-order difference is inverted. The phoneme boundary sub-candidate extraction unit 15 detects peaks corresponding to the periodic vibrations of the vocal cords from the peak points and selects them as phoneme boundary sub-candidates. The following method can be considered as a method of detecting a peak corresponding to the periodic vibration of the vocal cord from the peak points. That is, the peak having the maximum amplitude is detected within a certain short section. As the time length of this short section, for example, it is conceivable to use the period which is the reciprocal of the pitch frequency which is considered to be the highest in the voice waveform. Further, since the phoneme boundary used in the waveform editing type speech synthesis or the like needs to be the zero cross point, the zero cross point that appears first after going back the time axis from the selected peak position is extracted as a phoneme boundary sub-candidate.

音素境界決定部17は、前記第一の音素境界候補及び前記
音素境界副候補付近から音響パラメータ抽出部16におい
て抽出した音響パラメータ値に基づいて音素境界として
最適な点を選択し、この点を音素境界として音素境界出
力端子18に出力する。第一の音素境界候補と音素境界副
候補の中から、音素境界を決定する方法としては、例え
ば、各音素境界候補を中心として時間軸の左右それぞれ
の区間において音響パラメータ値を算出し、左右の区間
の音響パラメータ値の比が最大となる点を音素境界とす
る方法等を用いることができる。The phoneme boundary determination unit 17 selects an optimum point as a phoneme boundary based on the acoustic parameter value extracted by the acoustic parameter extraction unit 16 from the vicinity of the first phoneme boundary candidate and the phoneme boundary sub-candidate, and this point is a phoneme. It is output to the phoneme boundary output terminal 18 as a boundary. As a method of determining a phoneme boundary from the first phoneme boundary candidate and the phoneme boundary sub-candidate, for example, acoustic parameter values are calculated in the left and right sections of the time axis centered on each phoneme boundary candidate, and It is possible to use a method in which the point at which the ratio of the acoustic parameter values in the section is the maximum is the phoneme boundary.

（発明の効果）以上に述べてきたように、本発明によれば、自動的にか
つ時間軸上での解像度の高いセグメンテーションを行う
ことができるので、波形編集型音声合成に使用する音声
データ収集等において用いる音声のセグメンテーション
を精度良く行う装置を実現できる。(Effects of the Invention) As described above, according to the present invention, segmentation with high resolution on the time axis can be performed automatically. Therefore, speech data collection used for waveform-editing speech synthesis can be performed. It is possible to realize a device that accurately performs segmentation of voice used in, for example.

[Brief description of drawings]

第１図は本発明による音声セグメンテーション装置の一
実施例を示すブロック図、第２図は音素境界候補の決定
方法を説明するための図、第３図は従来技術の問題点等
を説明するための図である。図において、11は音声入力端子、12は音響パラメータ抽
出部、13は第一境界候補抽出部、14はピーク抽出部、15
は音素境界副候補抽出部、16は音響パラメータ抽出部、
17は音素境界決定部、18は音素境界出力端子である。FIG. 1 is a block diagram showing an embodiment of a speech segmentation apparatus according to the present invention, FIG. 2 is a diagram for explaining a method for determining a phoneme boundary candidate, and FIG. 3 is a diagram for explaining problems of the conventional technique. FIG. In the figure, 11 is a voice input terminal, 12 is an acoustic parameter extraction unit, 13 is a first boundary candidate extraction unit, 14 is a peak extraction unit, and 15 is a peak extraction unit.
Is a phoneme boundary sub-candidate extraction unit, 16 is an acoustic parameter extraction unit,
Reference numeral 17 is a phoneme boundary determination unit, and 18 is a phoneme boundary output terminal.

Claims

[Claims]

1. A means for extracting a first phoneme boundary candidate from acoustic parameters extracted from an input speech, a means for detecting a peak of a waveform of the input speech in the vicinity of the first phoneme boundary candidate, and the peak. Means for selecting a peak corresponding to the periodic vibration of the vocal cord as a phoneme boundary sub-candidate from among, and extracting acoustic parameters from the input speech near each of the first phoneme boundary candidate and the phoneme boundary sub-candidate. , And a means for selecting one of the phoneme boundary candidates as a phoneme boundary by comparing the respective acoustic parameter values.