JPS6239754B2

JPS6239754B2 -

Info

Publication number: JPS6239754B2
Application number: JP56115767A
Authority: JP
Inventors: Kazuo Kitagawa; Eiji Kira
Original assignee: Tokyo Shibaura Electric Co Ltd
Current assignee: Toshiba Corp
Priority date: 1981-07-23
Filing date: 1981-07-23
Publication date: 1987-08-25
Also published as: JPS5817497A

Description

【発明の詳細な説明】この発明は、音声信号波形のピツチを検出する
装置に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to an apparatus for detecting the pitch of an audio signal waveform.

音声信号波形を低い情報密度で記録、再生した
り、伝送したりする際に必要な処理の一つとし
て、有声音でのピツチ検出がある。このピツチ検
出の方法として、例えばLPC（線型予測）方式の
音声分析合成システムでは、一般に10段程度の白
色化フイルタでスペクトル包絡等の他の必要な音
声パラメータを抽出した後の残差信号をローパス
フイルタを通した後自己相関をとり、また移動平
均処理等をしてピツチ情報を得る方法がとられて
いるが、その計算量は膨大であり、ハードウエア
も大規模なものとなつている。またLPC方式の音
声分析合成システムでは、分析時のピツチ検出エ
ラーが再生音声の品質を損なう要因の一つとなつ
ているが、そのエラー訂正が手作業で行なわれて
いるため、分析コストが非常に高くなつている。 Pitch detection in voiced sounds is one of the necessary processes when recording, reproducing, or transmitting audio signal waveforms at low information density. As a method for pitch detection, for example, in a speech analysis and synthesis system using the LPC (linear prediction) method, the residual signal is generally low-passed after extracting other necessary speech parameters such as the spectral envelope using a whitening filter of about 10 stages. The method used is to take autocorrelation after passing through a filter and perform moving average processing to obtain pitch information, but the amount of calculation required is enormous and the hardware is also large-scale. In addition, in LPC-based speech analysis and synthesis systems, pitch detection errors during analysis are one of the factors that impair the quality of reproduced speech, but since error correction is done manually, analysis costs are extremely high. It's getting expensive.

これに対し、デルタ変調（適応型デルタ変調も
含む、以下同じ）方式により音声信号波形を記
録・再生したり伝送するシステムは、情報密度は
16Kb〜32Kb／sesとさほど低くないが、そのハ
ードウエアの簡単さと実時間性の点でLPC方式よ
り有利である。発明者らはこのようなデルタ変調
方式の記録再生あるいは伝送システムにおいて、
音声信号波形のうち有声音部のくり返し波形を実
時間で例えば４波毎に１波で代表して記録あるい
は伝送することによつて情報密度の低減を図るこ
とを検討している。その場合、有声音部のくり返
しの有無を検出する操作がピツチ検出であるが、
このピツチ検出にLPC方式におけると同様な方法
を用いたのでは構成上非常に無駄があるので、簡
単な構成によりピツチ検出を行なうことが要求さ
れてくる。 On the other hand, systems that record, playback, and transmit audio signal waveforms using delta modulation (including adaptive delta modulation, the same applies hereinafter) have a low information density.
Although it is not very low at 16Kb to 32Kb/ses, it is advantageous over the LPC method in terms of hardware simplicity and real-time performance. In such a delta modulation recording/reproducing or transmission system, the inventors
We are considering reducing the information density by recording or transmitting the repeated waveform of the voiced part of the audio signal waveform in real time, for example, with one wave representing every four waves. In that case, the operation to detect whether voiced parts are repeated is pitch detection.
If a method similar to that used in the LPC method is used for pitch detection, it would be very wasteful in terms of structure, so it is required to perform pitch detection with a simple structure.

この発明は上記の点に鑑みてなされたもので、
その目的は簡単な構成によつて音声信号波形のピ
ツチを正確に検出することができる音声ピツチ検
出装置を提供することにある。 This invention was made in view of the above points,
The object is to provide an audio pitch detection device that can accurately detect the pitch of an audio signal waveform with a simple configuration.

この発明では、音声信号波形の所定時間当りの
平均傾斜を逐次検出し、その傾斜が一定値を越え
た期間に対応するパルスを得る。そして、このパ
ルスの前縁を検出して比較的細いパルスを得、さ
らにこのパルスのうち前回パルスより一定時間以
上経過しした後のパルスのみを抽出して、音声信
号波形のピツチに相当する間隔のパルス列を得
る。 In this invention, the average slope of the audio signal waveform per predetermined time period is sequentially detected, and a pulse corresponding to a period in which the slope exceeds a certain value is obtained. Then, the leading edge of this pulse is detected to obtain a relatively thin pulse, and only those pulses after a certain period of time have elapsed since the previous pulse are extracted, and intervals corresponding to the pitch of the audio signal waveform are extracted. obtain a pulse train of

この発明によれば、LPC方式におけるピツチ検
出に比べ著しく簡単な構成によつて音声のピツチ
を検出することができるばかりでなく、検出誤り
を非常に少なくできる。 According to the present invention, it is possible not only to detect pitches in speech with a significantly simpler configuration than pitch detection in the LPC method, but also to greatly reduce detection errors.

以下、この発明の実施例を図面を参照して説明
する。 Embodiments of the present invention will be described below with reference to the drawings.

第１図はこの発明の一実施例を示したものであ
る。図において、入力端子１に入力される音声信
号は入力フイルタ２を介してデルタ変調方式のエ
ンコーダ３に加えられ、クロツク発振器４からの
クロツク信号に同期した符号化データとなる。そ
して、この符号化データが音声ピツチ検出部５に
入力される。この音声ピツチ検出部５は次のよう
に構成されている。 FIG. 1 shows an embodiment of the present invention. In the figure, an audio signal input to an input terminal 1 is applied to an encoder 3 using a delta modulation system via an input filter 2, and becomes encoded data synchronized with a clock signal from a clock oscillator 4. This encoded data is then input to the audio pitch detection section 5. This audio pitch detection section 5 is constructed as follows.

エンコーダ３の出力データは、ピツチ検出部５
内のシフトレジスタ１１に、クロツク発振器４か
らのクロツク信号に従いシリアルに逐次入力され
る。シフトレジスタ１１の各段の内容は並列に出
力され、各々ラダー抵抗１２を介して加算器１３
で加算される。 The output data of the encoder 3 is sent to the pitch detection section 5.
A clock signal from the clock oscillator 4 is serially input to the shift register 11 in the clock oscillator 4. The contents of each stage of the shift register 11 are output in parallel and sent to an adder 13 via a ladder resistor 12.
is added.

第２図は音声信号の音声音部の典型的な波形例
を示したもので、T₀はピツチ周期、T₁は第１フ
オルマントの周期を表わす。ここでシフトレジス
タ１１の段数をＮ、クロツク信号の周期をτとす
ると、τ・Ｎ＝Ｔの時間は第２図のように第１の
フオルマントの周期T₁の半分以下の適当な値に
設定される。この時間Ｔの最適値は発声者の性別
等によつて異なるが、0.5〜２ｍsec程度の範囲内
である。シフトレジスタ１１の各段の入力をラダ
ー抵抗１２を介して加算器１３で加算すること
は、ラダー抵抗１２の値がほぼ等しい、つまりシ
フトレジスタ１１の各段の出力がほぼ同一ウエイ
トで加算されるとすれば、入力音声信号波形のＴ
なる単位時間当りの平均傾斜を検出することに相
当する。 FIG. 2 shows a typical waveform example of the audio sound portion of the audio signal, where T ₀ represents the pitch period and T ₁ represents the period of the first formant. Here, if the number of stages of the shift register 11 is N and the period of the clock signal is τ, then the time τ・N=T is set to an appropriate value less than half of the period _T1 of the first formant as shown in Figure 2. be done. The optimum value of this time T varies depending on the gender of the speaker, etc., but is within the range of about 0.5 to 2 msec. Adding the inputs of each stage of the shift register 11 by the adder 13 via the ladder resistor 12 means that the values of the ladder resistor 12 are approximately equal, that is, the outputs of each stage of the shift register 11 are added with approximately the same weight. Then, T of the input audio signal waveform
This corresponds to detecting the average slope per unit time.

例えば、入力端子１への音声信号波形を第３図
ａとすると、エンコーダ３においてｂのクロツク
信号で符号化したデータはｃとなり、そのときの
加算器１３の出力波形はｄのように示される。す
なわち、シフトレジスタ１１は１つのデータが入
るとＮビツト前のデータを１つ出し、加算器１３
はシフトレジスタ１１内のＮビツトのデータの
“１”と“０”との差に応じた出力を出すので、
加算器１３の出力は、シフトレジスタ１１が
“１”が入つたとき“１”を出せば不変で“０”
を出せば１ステツプ上昇し、逆にシフトレジスタ
１１が“０”が入つたときは不変または１ステツ
プ減少することになる。 For example, if the audio signal waveform to the input terminal 1 is shown as a in FIG. . That is, when the shift register 11 receives one piece of data, it outputs one piece of data N bits before, and the adder 13
outputs an output according to the difference between "1" and "0" of the N-bit data in the shift register 11, so
The output of the adder 13 remains unchanged as "0" if the shift register 11 outputs "1" when "1" is input.
If "0" is entered in the shift register 11, the shift register 11 will remain unchanged or decrease by one step.

加算器１３の出力はコンパレータ１４に入力さ
れ、第３図ｄ中にVthで示す一定レベルと比較さ
れる。この結果、コンパレータ１４の出力には、
第３図ｅに示すように加算器１３の出力がVthを
越えた期間、換言すれば入力音声信号波形の単位
時間Ｔ当りの平均傾斜が一定値を越えた期間に対
応するパルスが得られる。 The output of the adder 13 is input to a comparator 14 and compared with a constant level indicated by Vth in FIG. 3d. As a result, the output of the comparator 14 is
As shown in FIG. 3e, a pulse is obtained that corresponds to a period in which the output of the adder 13 exceeds Vth, in other words, a period in which the average slope per unit time T of the input audio signal waveform exceeds a certain value.

コンパレータ１４の出力は、第１の単安定マル
チバイブレータ１５に加えられる。このマルチバ
イブレータ１５は第３図ｅに示すコンパレータ１
４の出力パルスの立上り、つまり前縁でトリガさ
れ、同図ｆに示すような細いパルスを発生する。
この第１の単安定マルチバイブレータ１５の出力
は、ANDゲート１７の一方の入力端に直接加え
られるとともに、第２の単安定マルチバイブレー
タ１６に加えられる。第２の単安定マルチバイブ
レータ１６は第１の単安定マルチバイブレータ１
５より十分長い時定数（例えば2.5〜10ｍsec）を
持ち、第３図ｆに示す第１の単安定マルチバイブ
レータ１５の出力パルスの立下り、つまり後縁で
トリガされて同図ｇに示すようなｆのパルスより
十分広いパルス幅のパルスを反転出力として発生
する。そして、第２の単安定マルチバイブレータ
１６の出力がANDゲート１７の他方の入力端に
加えられる。従つて、ADNゲート１７の出力に
は第３図ｇに示すように、ｆのパルスP₁，P₂のう
ちP₁に相当するパルスのみが出力される。 The output of comparator 14 is applied to a first monostable multivibrator 15. This multivibrator 15 is connected to the comparator 1 shown in FIG. 3e.
It is triggered at the rising edge of the output pulse No. 4, that is, the leading edge, and generates a thin pulse as shown in the figure f.
The output of the first monostable multivibrator 15 is applied directly to one input terminal of the AND gate 17 and also to the second monostable multivibrator 16 . The second monostable multivibrator 16 is the first monostable multivibrator 1
It has a time constant (for example, 2.5 to 10 msec) sufficiently longer than 5, and is triggered by the falling edge, that is, the trailing edge, of the output pulse of the first monostable multivibrator 15 shown in FIG. A pulse having a width sufficiently wider than that of the pulse f is generated as an inverted output. The output of the second monostable multivibrator 16 is then applied to the other input terminal of the AND gate 17. Therefore, as shown in FIG. 3g, the ADN gate 17 outputs only the pulse corresponding to P ₁ of the pulses P ₁ and P ₂ of f.

すなわち、有声音は駆動源がインパルス的であ
り、このインパルスを喉や口蓋、口唇で形成され
る音響管を通して放射するメカニズムなので、そ
の波形は第２図に示したように励振点から減衰す
る形になるが、第１の単安定マルチバイブレータ
１４の出力でみると第３図に示すように入力音声
信号波形の有声音部の第２の山でもピツチが誤つ
て検出されてしまうことがある。これにaddition
errorという。そこで、この発明では第１の単安
定マルチバイブレータ１５の出力パルスのうち、
前回のパルスより第２の単安定マルチバイブレー
タ１６の出力パルスの幅に相当する一定時間以内
にあるパルスはANDゲート１７で禁止して、こ
の一定時間以上経過した後のパルスのみを出力す
る。この結果、ANDゲート１７の出力には入力
音声信号波形のピツチ周期に相当する間隔のパル
ス列が得られることになる。 In other words, the driving source of voiced sounds is an impulse, and the mechanism is to radiate this impulse through the acoustic tube formed by the throat, palate, and lips, so the waveform is attenuated from the excitation point as shown in Figure 2. However, when looking at the output of the first monostable multivibrator 14, as shown in FIG. 3, a pitch may be erroneously detected even at the second peak of the voiced part of the input audio signal waveform. addition to this
It's called error. Therefore, in the present invention, among the output pulses of the first monostable multivibrator 15,
The AND gate 17 prohibits pulses within a certain time period corresponding to the width of the output pulse of the second monostable multivibrator 16 from the previous pulse, and outputs only pulses after this certain period of time has elapsed. As a result, a pulse train having an interval corresponding to the pitch period of the input audio signal waveform is obtained at the output of the AND gate 17.

以上のような構成によれば、音声信号波形のピ
ツチを良好に検出することができる。一例を挙げ
ると、クロツク信号周波数１／τを16kHz、シフ
トレジスタ１１の段数Ｎを１２、つまりＴ＝0.75
ｍsecとし、コンパレータ１４のスレツシヨルド
レベルＶ_thを、シフトレジスタ１１内の１２ビツ
トのデータのうち８ビツト以上が“１”のときコ
ンパレータ１４の出力が反転するように設定し
て、入力端子１に女性が「OYASUMI NO
JIKAN」と発声したときの約１秒間の音声信号
を入力して実際にピツチを検出したところ、語尾
の弱音部を除いた実際の波数244ピツチのうち、
最小３ｍsec、最大4.2ｍsec、平均3.3ｍsecで３
つのオミツトエラーがあつただけで、実に98.75
％という高い正答率が得られた。しかも、このオ
ミツトエラーの生じた位置も低レベルの音韻と音
韻との間の無音とみなしても音質に差のない部分
であつた。 According to the above configuration, the pitch of the audio signal waveform can be detected satisfactorily. For example, the clock signal frequency 1/τ is 16 kHz, and the number of stages N of the shift register 11 is 12, that is, T = 0.75.
msec, and the threshold level _Vth of the comparator 14 is set so that when 8 or more bits of the 12-bit data in the shift register 11 are "1", the output of the comparator 14 is inverted, and the output of the comparator 14 is inverted. ``OYASUMI NO''
When we input the audio signal of about 1 second when uttering "JIKAN" and actually detected the pitches, we found that out of the actual pitch of 244 waves excluding the weak part at the end of the word,
3 at minimum 3msec, maximum 4.2msec, average 3.3msec
98.75 with just one omit error.
A high correct answer rate of % was obtained. Moreover, the position where this error occurred was also a part where there was no difference in sound quality even if it was regarded as silence between low-level phonemes.

以上のように、この発明によれば音声のピツチ
を少ない誤り率で正確に検出できる。また、多段
の白色フイルタで他の音声パラメータを抽出した
後の残差信号からピツチ検出を行なう従来のLPC
方式におけるピツチ検出手段と比べて、ハードウ
エアが著しく簡単で済み、この点でデルタ変調方
式の音声記録再生システム等に極めて適してい
る。 As described above, according to the present invention, pitches in speech can be detected accurately with a low error rate. In addition, conventional LPC performs pitch detection from the residual signal after other audio parameters are extracted using a multi-stage white filter.
Compared to the pitch detection means in the system, the hardware is significantly simpler, and in this respect it is extremely suitable for audio recording and reproducing systems using the delta modulation system.

また、第１図の実施例によれば音声信号波形の
ピツチ検出をエンコーダ３の出力データから行な
つているので、回路を全てデイジタル回路で構成
でき、IC化に適している。なお、ピツチ検出部
での処理は、マイクロコンピユータで実行するこ
とも可能である。 Furthermore, according to the embodiment shown in FIG. 1, since the pitch of the audio signal waveform is detected from the output data of the encoder 3, the circuit can be constructed entirely of digital circuits, making it suitable for IC implementation. Note that the processing in the pitch detection section can also be executed by a microcomputer.

次に、この発明の応用例を第４図、第５図を参
照して説明する。第４図は音声記録装置における
自動波形圧縮装置にこの発明を適用した例であ
り、第１図と共通の部分には同一符号を付してあ
る。また、第５図は第４図の各部の波形図であ
る。ピツチ検出部５からは第５図ａに示すような
音声信号波形のピツチ周期に相当した間隔のパル
ス列が得られるが、このパルス列はＮ進カウンタ
４１とインターバルカウンタ４２に入力される。
Ｎ進カウンタ４１は例えばＮ＝４とすれば２段の
バイナリカウンタで構成され、ピツチ検出部５か
ら４個パルス入力が与えられる毎に、次の１個の
パルス入力が与えられるまでの１ピツチ期間だけ
ANDゲート４３より第５図ｂのゲート信号を発
生せしめる。ANDゲート４４はANDゲート４３
よりのゲート信号で指定された期間に対応するエ
ンコーダ３からの１ピツチ分のデータを第５図ｃ
の如く取出して、メモリシステム４５へ供給す
る、こうして有声音部の波形については、連続す
るＮピツチのうちの１ピツチの波形だけがメモリ
システム４５に蓄積されることになる。すなわ
ち、有声音部の波形はＩ／Ｎに圧縮されて蓄積さ
れる。 Next, an application example of the present invention will be explained with reference to FIGS. 4 and 5. FIG. 4 shows an example in which the present invention is applied to an automatic waveform compression device for an audio recording device, and parts common to those in FIG. 1 are given the same reference numerals. Moreover, FIG. 5 is a waveform diagram of each part of FIG. 4. A pulse train having an interval corresponding to the pitch period of the audio signal waveform as shown in FIG.
For example, if N = 4, the N-ary counter 41 is composed of a two-stage binary counter, and every time four pulses are inputted from the pitch detection section 5, one pitch is inputted until the next one pulse input is applied. only for a period
The AND gate 43 generates the gate signal shown in FIG. 5b. AND gate 44 is AND gate 43
The data for one pitch from the encoder 3 corresponding to the period specified by the gate signal is shown in Fig. 5c.
With respect to the waveform of the voiced part, which is extracted as shown in FIG. That is, the waveform of the voiced sound part is compressed into I/N and stored.

一方、無声音部や語尾等の、ピツチ検出部５で
ピツチを検出できない部分の波形は、エンコーダ
３の出力データがANDゲート４９を通してその
ままメモリシステム４５に蓄積される。この操作
は、レベル検出部４６で音声信号のレベルを検出
して第５図ｄに示す有声―無声判別信号を得ると
ともに、インターバルカウンタ４２でピツチ検出
部５の出力パルスの間隔を計測しそれが所定の範
囲内に入つているかどうかをウインドウコンパレ
ータ４７でチエツクして第５図ｅに示すオミツト
信号を得、ANDゲート４８で第５図ｆに示すｄ
とｅとの一致信号を検出し、この信号ｆでAND
ゲート４９をコントロールすることによつて実現
できる。 On the other hand, for waveforms of portions where the pitch cannot be detected by the pitch detection section 5, such as unvoiced parts and word endings, the output data of the encoder 3 is stored in the memory system 45 as is through the AND gate 49. In this operation, the level detector 46 detects the level of the audio signal to obtain the voiced/unvoiced discrimination signal shown in FIG. The window comparator 47 checks whether it is within a predetermined range to obtain the limit signal shown in FIG.
Detects a matching signal between and e, and performs AND with this signal f.
This can be achieved by controlling the gate 49.

このような構成で、前述と同様の女性の
「OYASUMI NO JIKAN」とう音声信号波形をメ
モリシステム４５に蓄積したところ、エンコーダ
３の出力データを圧縮せずそのまま蓄積した部分
は全体の約３％で、平均的な情報密度は約4Kb／
secにも低減した。また、この場合オミツトされ
る部分は有声音部のレベルが大きくて一定してい
るところを除く語頭または音韻間に限られる傾向
があるので、このオミツトエラーの情報は音韻の
セグメンテーシヨンにも有効と見られる。 With this configuration, when the same voice signal waveform of a woman saying "OYASUMI NO JIKAN" as described above was stored in the memory system 45, the portion of the output data of the encoder 3 that was stored as is without being compressed was approximately 3% of the total. , the average information density is about 4Kb/
It was also reduced to sec. In addition, in this case, the omitted parts tend to be limited to the beginnings of words or between phonemes, except for voiced parts where the level is high and constant, so this omit error information is also effective for phoneme segmentation. Can be seen.

[Brief explanation of the drawing]

第１図はこの発明の一実施例に係る音声ピツチ
検出装置の構成を示す図、第２図は音声信号の有
声音部の典型的な波形例を示す図、第３図は第１
図の動作を説明するための各部の波形図、第４図
はこの発明の応用例を示す図、第５図は第４図の
動作を説明するための各部の波形図である。１……音声信号入力端子、２……入力フイル
タ、３……エンコーダ、４……クロツク発振器、
５……ピツチ検出部、１１……シフトレジスタ、
１２……ラダー抵抗、１３……加算器、１４……
コンパレータ、１５，１６……単安定マルチバイ
ブレータ、１７……ANDゲート、４１……Ｎ進
カンタ、４２……インターバルカウンタ、４５…
…メモリシステム、４６……レベル検出部、４７
……ウインドウコンパレータ。 FIG. 1 is a diagram showing the configuration of an audio pitch detection device according to an embodiment of the present invention, FIG. 2 is a diagram showing a typical waveform example of a voiced part of an audio signal, and FIG.
FIG. 4 is a diagram showing an application example of the present invention, and FIG. 5 is a waveform diagram of each part to explain the operation of FIG. 4. 1... Audio signal input terminal, 2... Input filter, 3... Encoder, 4... Clock oscillator,
5... Pitch detection unit, 11... Shift register,
12...Ladder resistor, 13...Adder, 14...
Comparator, 15, 16... Monostable multivibrator, 17... AND gate, 41... N-ary counter, 42... Interval counter, 45...
...Memory system, 46...Level detection section, 47
...window comparator.

Claims

[Claims]

1 Consists of a shift register that stores a data string obtained by encoding an audio signal waveform by delta modulation for a certain length of time, and an adder that adds the outputs of each stage of this shift register, and calculates the average of the audio signal waveform per predetermined time. a first means for sequentially detecting the slope; and a second means comprising a comparator for detecting a period in which the average slope detected by the first means exceeds a certain value and generating a pulse corresponding to that period. , a third means for generating a pulse by detecting the leading edge of the output pulse of the second means;
It is characterized by comprising a fourth means for extracting only the pulses after a certain period of time has elapsed since the previous pulse from among the output pulses of the means to obtain a pulse train having an interval corresponding to the pitch period of the audio signal waveform. Audio pitch detection device.