JPH02717B2

JPH02717B2 -

Info

Publication number: JPH02717B2
Application number: JP13964283A
Authority: JP
Inventors: Hiroyoshi Yuasa
Original assignee: Matsushita Electric Works Ltd
Current assignee: Panasonic Electric Works Co Ltd
Priority date: 1983-07-30
Filing date: 1983-07-30
Publication date: 1990-01-09
Also published as: JPS6031195A

Description

[Detailed description of the invention]

〔技術分野〕本発明は電子機器を音声メツセージによつて操
作するための音声メツセージ識別方式に関するも
のである。〔背景技術〕従来、帯域フイルタによつて抽出した音声の周
波数分析によるスペクトルパターンを変換行列等
によつて変換し、音声の特徴パターンを記号ない
しはパラメータによつて抽出し、このパターンで
音声を識別する方式において、簡単化、ローコス
ト化のために、音声の有声音、無声音の差異、お
よび、ホルマント軌跡に対応するマイクロ的な特
徴パターンを形成し、音声メツセージを識別する
方式を検討した。第１図は本発明に関連する先願（特願昭58−
67261号）の構成を示すブロツク図である。同図
の音声メツセージ識別装置は、音響分析部Ａ、パ
ターン変換部Ｂ、単語識別部Ｃに大別できる。音響分析部Ａは、マイク１にて検出した音声の
増幅、周波数成分の高域強調（等価回路）、直流
分除去を行なう前置調整アンプ２の出力を周波数
分析部３で、周波数スペクトル成分に展開し、対
数変換部４で、周波数スペクトルのパターンを対
数尺度に変換する。第２図に従来例の音響分析部
Ａの実施例を示した。第２図では、音声の周波数
分析を６個のフイルタバンク（Ｖ、UV、VL、
VH、VB、VF）で行なつた。ここで、各フイル
タバンクの周波数帯域について説明すると、Ｖは
有声音Ｖのパワーの集中する帯域（およそ０〜
1KHz）、UVは無音声UVのパワーの集中する帯
域（およそ５〜10KHz）、VLは狭顎音VLの第一
ホルマントの集中する帯域（およそ０〜0.4K
Hz）、VHは広顎音VHの第一ホルマントの集中す
る帯域（およそ0.4〜0.8KHz）、VBは後舌音VBの
第二ホルマントの集中する帯域（およそ0.8〜
1.8KHz）、VFは前舌音VFの第二ホルマントの集
中する帯域（およそ1.8〜3.2KHz）をそれぞれ抽
出するものである。第３図ａ，ｂは、各フイルタ
バンクVL、VH、VB、VFの周波数特性を示し
ており、同図ａは横軸の周波数を均等目盛として
描いてあり、同図ｂは横軸の周波数を対数目盛と
して描いてある。なお第３図において、APは後
述する調整アンプ２ｂの特性を示している。前置調整アンプ２は、第２図のブロツク図に示
すように、マイク１にて検出した音声信号の増幅
を行なうプリアンプ２ａと、このプリアンプ２ａ
の出力に接続され、ゲインおよびオフセツト値の
調整を行なう調整アンプ２ｂと、レベル調整器２
ｃとを有している。レベル調整器２ｃでは、フイ
ルタバンクＶ、UVに供給する信号のパワーと他
のフイルタバンクに供給する信号のパワーとのバ
ランスをとつている。次に、Ｖ／UVバランス調
整器３ａではフイルタバンクＶの入力とフイルタ
バンクＵの入力とのバランスをとる。他のフイル
タバンクについては、VB／VLバランス調整器
３ｂを中点に調整し、VH／VLバランス調整器
３ｃで、フイルタバンクVHとフイルタバンク
VLの入力バランスをとり、VF／VBバランス調
整器３ｄでフイルタバンクVFとフイルタバンク
VBのバランスをとる。さらにVB／VLバランス
調整器３ｂで、フイルタバンクVBとフイルタバ
ンクVHまたはVLのバランスをとる。各フイルタバンクの出力はマルチプレクサ３ｅ
にて時分割的に順次切り換えられて、対数変換部
４に入力される。対数変換部４では、入力パワー
を対数スケールに変換する。対数変換器４の出力
はＡ／Ｄコンバータ４ａに入力されて、８ビツト
の２進数にデジタル化される。なお各フイルタを
デイジタルフイルタで構成する場合には、Ａ／Ｄ
コンバータ４ａは、調整アンプ２ｂの次段に来る
ものである。Ａ／Ｄコンバータ４ａの出力は、パターン変換
部Ｂの差信号ベクトル変換部５に入力されて、各
フイルタバンク出力の差信号よりなる差信号ベク
トルが抽出される。第４図は差信号ベクトル変換
部５の構成例を示している。同図ａに示すよう
に、差信号ベクトルは、２つのフイルタ出力の差
をとつて、平均化（積分）したものである。すな
わち10msecのサンプリング周期で、フイルタバ
ンク出力をＡ／Ｄコンバータ４ａによつてデイジ
タル化した場合に、１サンプル前の差信号に係数
器５ａにて係数αを掛けたものと、現サンプルの
差信号との和レジスタ５ｂに記憶したものが、現
サンプルの差信号ベクトル成分となる。係数α
は、およそ0.6〜0.8である。差信号ベクトル変換
部５を第４図ｂに示したように簡略化して図示
し、前記先願におけるパターン変換部Ｂの実施例
を示したものが第５図である。第５図に示すように、差信号ベクトル変換部５
ではUV／Ｖ差信号、V_eap／V_iu差信号、V_a／V_ep
差信号、V_e／V_p差信号、およびV_i／V_u差信号の
合計５成分の差信号ベクトルが抽出される。ま
ず、UV／Ｖ差信号は、有音声と無音声の識別に
用いられるものであり、フイルタバンクUVとフ
イルタバンクＶとの出力の差信号である。次に、
V_eap／V_iu差信号は母音のｅ、ａ、ｏとｉ、ｕと
を識別するために用いられるものであり、フイル
タバンクVHとフイルタバンクVLとの出力の差
信号である。またV_a／V_ep差信号は、母音のａと
ｅ、ｏとを識別するために用いられるものであ
り、フイルタバンクVBとフイルタバンクVLと
の出力の差信号である。V_e／V_p差信号は、母音
のｅとｏとを識別するために用いられるものであ
り、フイルタバンクVFとフイルタバンクVBと
の出力の差信号である。さらに、V_i／V_u差信号
は、母音のｉとｕとを識別するために用いられる
ものであり、フイルタバンクVFとフイルタバン
クVHとの出力の差信号である。各差信号のう
ち、V_eap／V_iu差信号とV_a／V_ep差信号は母音の第
１ホルマントの特徴を抽出しているものであり、
また、V_e／V_p差信号とV_i／V_u差信号は母音の第
２ホルマントの特徴を抽出しているものである。上述のようにして作成されたUV／Ｖ差信号
は、Ｖ、UV、Ｓ判定部１８に入力されて、有声
音Ｖの区間と、無声音UVの区間、および無音Ｓ
の区間の判定に用いられる。Ｖ、UV、Ｓ判定部
１８は、UV／Ｖ差信号を所定の基準値Rv、Ru
（Rv＜Ｏ＜Ru）と比較して、UV／Ｖ差信号が基
準値Rvよりも小さい場合には有声音Ｖと判定し、
基準値Ruよりも大きい場合には無声音Ｕと判定
し、基準値RuとRvとの間であれば無音Ｓと判定
する。始端終端検知部６は、音声入力が有声音Ｖ
または無声音Ｕと判断された場合には、音声区間
であると判断し、音声入力が無音Ｓと判断された
場合には、無音区間であると判断される。次にフイルタバンクVF、VB、VH、VLの出
力の差信号ベクトルは、記号ベクトル変換部７に
入力される。記号ベクトル変換器は、各差信号
V_eap／V_iu、V_a／Veo、V_e／V_p、V_i／V_uを成分と
する４次元の差信号ベクトルに、変換行列
〔Tm〕を乗算して、音声入力中に含まれる各母
音ｉ、ｅ、ａ、ｏ、ｕの短時間平均パワーV_i、
V_e、V_a、V_p、V_u、並びに広顎有声音、狭顎有声
音、前舌有声音、後舌有声音、母音ａとｏの中間
的な有声音の各短時間平均パワーV_h、V_l、Vf、
V_b、V_wを算出するものである。記号ベクトル変
換部７における変換行列〔Tm〕の成分の一例を
示せば、次式のようになる。しかして記号ベクトル変換部７の出力は最大値
判定部８ａに入力されて、各成分V_i、V_e、V_a、
V_p、V_u、V_h、V_l、V_f、V_b、V_wのうち最大の成
分がどれであるかを判定される。記号出力部８ｂ
は、Ｖ、UV、Ｓ判定部１８において、音声入力
が無音または無声音と判定された場合には、それ
ぞれＳ、UVの記号を出力する。また、Ｖ、UV、
Ｓ判定部１８において、音声入力が有声音と判定
された場合には、記号出力部８ｂは最大値判定部
８ａにおいて最大の成分と判定された有声音の記
号を出力する。ただし、各有声音V_i、V_e、V_a、
V_p、V_u、V_f、V_b、V_h、V_l、V_wのうち、最大の
成分が所定の基準値に達しないような場合には、
上記各有声音のいずれにも該当しない有声音V_n
の記号を出力する。したがつて、記号出力部８ｂ
からは、Ｓ、UV、V_i、V_e、V_a、V_p、V_u、V_h、
V_l、V_f、V_b、V_w、V_nの13種類の記号のうちいず
れか１つが出力されることになる。記号出力部８ｂから出力される各記号の時系列
は、整形処理部９ａに入力されて、整形処理され
る。すなわち、整形処理部９ａでは、同じ記号の
繰返しを一つの記号とその継続時間とのリストに
直し、さらに継続時間が、ある設定値より少ない
ものは、前後の記号が同じ場合には、これらを一
つのリストにし、前後の記号が異なる場合には、
前の記号に含めるようにして、継続時間の短いも
のは省略する。整形処理部９ａの出力は、時間軸線型正規化処
理部９ｂに入力される。時間軸線型正規化処理部
９ｂは、各リストの継続時間の合計が200（あるい
は1000）といつた一定値になるように、継続時間
を正規化する。これは、全サンプル値200（あるい
は1000）と継続時間との比率をそれぞれの継続時
間に掛け合わせると良い。以上のプロセスで、入力された音声メツセージ
に対する音声パターンが、作成できる。この音声パターンは、登録モードでは、標準パ
ターン記憶部１０に登録される。認識モードで
は、相関計算部１２で、標準パターンと照合する
が、まず予備選択部１１で一次識別して、照合対
象を限定しておく。予備選択部１１は、無声音
UVの個数、有声音Ｖの個数、やや長い無音の区
間（例えば破音の前など）を示すポーズＰの個数
や、記号の数、継続時間の合計等で照合対象を限
定する。相関計算部１２は、第１表に示したよう
な相関テーブル１３で、入力パターンと標準パタ
ーンとの対応を、相関が最も大きくなるように
（距離が最も近くなるように）DPマツチング法に
よつて動的に照合する。 [Technical Field] The present invention relates to a voice message identification method for operating electronic equipment using voice messages. [Background technology] Conventionally, a spectral pattern obtained by frequency analysis of speech extracted using a bandpass filter is converted using a conversion matrix, etc., a characteristic pattern of the speech is extracted using symbols or parameters, and speech is identified using this pattern. In order to simplify and reduce costs, we investigated a method for identifying voice messages by forming micro-feature patterns corresponding to the differences between voiced and unvoiced sounds and formant trajectories. Figure 1 shows the earlier application related to the present invention (Japanese Patent Application No. 1983-
67261) is a block diagram showing the configuration of the device. The voice message identification device shown in the figure can be roughly divided into an acoustic analysis section A, a pattern conversion section B, and a word identification section C. The acoustic analysis section A converts the output of the preconditioning amplifier 2, which amplifies the sound detected by the microphone 1, emphasizes the high frequency component (equivalent circuit), and removes the DC component, into frequency spectrum components using the frequency analysis section 3. The frequency spectrum pattern is expanded and converted into a logarithmic scale by the logarithmic conversion unit 4. FIG. 2 shows an embodiment of a conventional acoustic analysis section A. In Figure 2, audio frequency analysis is performed using six filter banks (V, UV, VL,
VH, VB, VF). Here, to explain the frequency band of each filter bank, V is the band where the power of voiced sound V is concentrated (approximately 0 to
1KHz), UV is the band where the power of silent UV is concentrated (approximately 5 to 10KHz), and VL is the band where the first formant of constrictor sound VL is concentrated (approximately 0 to 0.4K
Hz), VH is the band where the first formant of the broad jaw sound VH is concentrated (approximately 0.4 to 0.8 KHz), and VB is the band where the second formant of the back tongue sound VB is concentrated (approximately 0.8 to 0.8 KHz).
1.8KHz), and VF extracts the band (approximately 1.8 to 3.2KHz) where the second formant of the frontal sound VF is concentrated. Figure 3 a and b show the frequency characteristics of each filter bank VL, VH, VB, and VF. Figure a shows the frequency on the horizontal axis as a uniform scale, and Figure b shows the frequency on the horizontal axis. is plotted on a logarithmic scale. Note that in FIG. 3, AP indicates the characteristics of the adjustment amplifier 2b, which will be described later. As shown in the block diagram of FIG. 2, the preconditioning amplifier 2 includes a preamplifier 2a that amplifies the audio signal detected by the microphone 1, and a preamplifier 2a that amplifies the audio signal detected by the microphone 1.
an adjustment amplifier 2b that is connected to the output of the output and adjusts the gain and offset value;
It has c. The level adjuster 2c balances the power of the signals supplied to the filter banks V and UV with the power of the signals supplied to the other filter banks. Next, the V/UV balance adjuster 3a balances the input of the filter bank V and the input of the filter bank U. For other filter banks, adjust the VB/VL balance adjuster 3b to the midpoint, and then adjust the filter bank VH and filter bank using the VH/VL balance adjuster 3c.
Balance the VL input, and use the VF/VB balance adjuster 3d to balance the filter bank VF and filter bank.
Balance VB. Further, the VB/VL balance adjuster 3b balances the filter bank VB and the filter bank VH or VL. The output of each filter bank is multiplexer 3e
The signals are sequentially switched in a time-division manner and input to the logarithmic conversion section 4. The logarithmic conversion unit 4 converts the input power into a logarithmic scale. The output of the logarithmic converter 4 is input to an A/D converter 4a and digitized into an 8-bit binary number. Note that when each filter is configured with a digital filter, the A/D
Converter 4a is the next stage after adjustment amplifier 2b. The output of the A/D converter 4a is input to the difference signal vector converter 5 of the pattern converter B, and a difference signal vector consisting of the difference signals of the outputs of each filter bank is extracted. FIG. 4 shows an example of the configuration of the difference signal vector converter 5. In FIG. As shown in figure a, the difference signal vector is obtained by averaging (integrating) the difference between the two filter outputs. In other words, when the filter bank output is digitized by the A/D converter 4a at a sampling period of 10 msec, the difference signal of the previous sample multiplied by the coefficient α by the coefficient unit 5a and the difference signal of the current sample. The value stored in the sum register 5b becomes the difference signal vector component of the current sample. Coefficient α
is approximately 0.6 to 0.8. The difference signal vector converter 5 is simplified as shown in FIG. 4b, and FIG. 5 shows an embodiment of the pattern converter B in the prior application. As shown in FIG.
Then, UV/V difference signal, V _eap /V _iu difference signal, V _a /V _ep
A total of five component difference signal vectors are extracted: the difference signal, the V _e /V _p difference signal, and the V _i /V _u difference signal. First, the UV/V difference signal is used to distinguish between voice presence and no voice, and is a difference signal between the outputs of filter bank UV and filter bank V. next,
The V _eap /V _iu difference signal is used to distinguish between vowels e, a, o and i, u, and is a difference signal between the outputs of filter bank VH and filter bank VL. Further, the V _a /V _ep difference signal is used to identify vowels a, e, and o, and is a difference signal between the outputs of filter bank VB and filter bank VL. The V _e /V _p difference signal is used to distinguish between vowels e and o, and is a difference signal between the outputs of filter bank VF and filter bank VB. Furthermore, the V _i /V _u difference signal is used to identify the vowels i and u, and is a difference signal between the outputs of filter bank VF and filter bank VH. Among the difference signals, the V _eap /V _iu difference signal and the V _a /V _ep difference signal extract the characteristics of the first formant of the vowel.
Further, the V _e /V _p difference signal and the V _i /V _u difference signal extract the characteristics of the second formant of the vowel. The UV/V difference signal created as described above is input to the V, UV, S determining section 18, and is divided into a section of voiced sound V, a section of unvoiced sound UV, and silent S.
It is used to determine the interval. The V, UV, S determination unit 18 converts the UV/V difference signal into predetermined reference values Rv, Ru.
(Rv<O<Ru), and if the UV/V difference signal is smaller than the reference value Rv, it is determined to be a voiced sound V,
If it is larger than the reference value Ru, it is determined to be an unvoiced sound U, and if it is between the reference values Ru and Rv, it is determined to be a silent sound S. The start end end detection unit 6 detects that the audio input is a voiced sound V.
Alternatively, if it is determined that the voice input is an unvoiced sound U, it is determined that it is a voice section, and if the voice input is determined to be a silence S, it is determined that it is a silent section. Next, the difference signal vectors of the outputs of the filter banks VF, VB, VH, and VL are input to the symbol vector converter 7. The symbol vector converter converts each difference signal
A four-dimensional difference signal vector whose components are V _eap /V _iu , V _a /Veo, V _e /V _p and V _i /V _u is multiplied by the transformation matrix [Tm] to be included in the audio input. Short-term average power V _i of each vowel i, e, a, o, u,
V _e , V _a , V _p , V _u , and each short-term average power V of wide-mouthed sounds, narrow-mouthed sounds, front-mouthed sounds, back-mouthed sounds, and voiced sounds intermediate between vowels a and o. _h , _Vl , Vf,
This is to calculate V _b and V _w . An example of the components of the transformation matrix [Tm] in the symbol vector transformation unit 7 is as shown in the following equation. The output of the symbol vector converter 7 is input to the maximum value determination unit 8a, and each component V _i , V _e , V _a ,
It is determined which of V _p , V _u , V _h , V _l , V _f , V _b , and V _w has the largest component. Symbol output section 8b
When the V, UV, and S determining section 18 determines that the audio input is silent or unvoiced, it outputs the symbols S and UV, respectively. Also, V, UV,
When the S determining section 18 determines that the audio input is a voiced sound, the symbol output section 8b outputs the symbol of the voiced sound determined to be the largest component in the maximum value determining section 8a. However, each voiced sound V _i , V _e , V _a ,
If the maximum component among V _p , V _u , V _f , V _b , V _h , V _l , and V _w does not reach the predetermined reference value,
Voiced sound V _n that does not correspond to any of the above voiced sounds
Outputs the symbol. Therefore, the symbol output section 8b
From, S, UV, V _i , V _e , V _a , V _p , V _u , V _h ,
Any one of the 13 types of symbols V _l , V _f , V _b , V _w , and V _n will be output. The time series of each symbol output from the symbol output section 8b is input to the formatting section 9a, where it is formatted. That is, the formatting processing unit 9a converts the repetition of the same symbol into a list of one symbol and its duration, and furthermore, if the duration is less than a certain set value, if the symbols before and after are the same, then these are If the symbols before and after are different in one list,
Include them in the previous symbol, and omit short duration ones. The output of the shaping processing section 9a is input to the time axis linear normalization processing section 9b. The time axis linear normalization processing unit 9b normalizes the duration so that the total duration of each list becomes a constant value such as 200 (or 1000). This can be done by multiplying each duration by the ratio of the total sample value 200 (or 1000) to the duration. Through the above process, a voice pattern for the input voice message can be created. This voice pattern is registered in the standard pattern storage section 10 in the registration mode. In the recognition mode, the correlation calculation unit 12 matches the standard pattern, but first the preliminary selection unit 11 performs primary identification to limit the matching target. The preliminary selection section 11 selects unvoiced sounds.
The objects to be matched are limited by the number of UVs, the number of voiced sounds V, the number of pauses P indicating a rather long silent section (for example, before a break), the number of symbols, the total duration, etc. The correlation calculation unit 12 uses the correlation table 13 shown in Table 1 to determine the correspondence between the input pattern and the standard pattern using the DP matching method so that the correlation is maximized (the distance is the shortest). dynamically collate the data.

[Purpose of the invention]

本発明は上述のような点に鑑みて為されたもの
であり、同一の単語に対する音声パターンの声の
出し方による差を小さくすると共に、音声のパワ
ーが小さい場合でも記号化をより確実に行なえる
ようにし、また子音や無声音の後に続く母音の記
号化を確実に行ない得るようにした音声メツセー
ジ識別方式を提供することを目的とするものであ
る。〔発明の開示〕第８図は本発明の一実施例のブロツク図であ
る。同図において、音響分析部Ａと単語識別部Ｂ
とについては、第１図に示す従来例と同一の構成
であり、パターン変換部Ｃの構成が異なつてい
る。まず第８図において、１６は微分ベクトル変
換部であり、差信号ベクトル変換部５から出力さ
れる差信号成分の微分信号を算出するものであ
る。１７は差信号強調部であり、微分ベクトル変
換部５によつて算出された微分信号を元の差信号
に加算して差信号の立上り、立下りのエツジを強
調するものである。１８はオフセツト計算部であ
り、微分ベクトル変換部１６から出力される微分
信号の正負の符号とその絶対値とに応じたオフセ
ツト値を算出するものである。さらに１９はオフ
セツト補償部であり、オフセツト計算部１８によ
つて計算されたオフセツト値を元の差信号に加算
するものである。このオフセツト補償部１９とオ
フセツト計算部１８とによりオフセツト処理部Ｄ
を構成している。第９図の回路図は、差信号ベク
トル変換部５と、微分ベクトル変換部１６と、差
信号強調部１７との機能を実現する具体回路の構
成を示している。同図の回路の動作を説明する
と、まず現時点の差信号がレジスタt_oに記憶さ
れ、一つ前の差信号がレジスタt_o-1に記憶され、
このレジスタt_oとレジスタt_o-1の差動平均（差の
変動の短時間平均パワー）を微分ベクトル成分と
して形成し、この微分ベクトル成分とレジスタ
t_o-1あるいは、レジスタt_oの内容との和の短時間
平均パワーを強調差信号ベクトル成分とする方式
である。微分ベクトル変換部１６の出力は、オフセツト
計算部１８に入力されて、オフセツト値を計算さ
れる。このオフセツト値は上述のようにオフセツ
ト補償部１９にて元の差信号に加算されるもので
あるが、等価的には記号ベクトル変換部７におけ
る行列計算式を次式のように変更することによつ
てオフセツト補償を行なうことができる。ただ
し、次式において、Of₁，Of₂，Of₃，Of₄はオフ
セツト値を示している。第１０図および第１１図は、それぞれ第６図お
よび第７図の差信号波形の微分波形を示してお
り、この図では、微分波形そのものの短時間平均
パワーを積分計算する前の形を示している。第１
２図および第１３図は、差信号強調部１７の出力
である強調差信号ベクトルの波形と、記号化パタ
ーンとを示したもので、第８図のオフセツト補償
部１９が無い場合の記号化パターンを示してい
る。第６図および第７図の波形図と、第１２図およ
び第１３図の波形図とを比較すると、第１２図お
よび第１３図の強調差信号ベクトルの波形の立上
り、立下りのエツジが強調され、オーバーシユー
トしており、さらに波形の凹凸が明確になつてい
ることがわかる。ところで日本語の音節は、子音（Consonant）
と母音（Vowel）との結合よりなるCV音節（子
音＋母音）が多く、口を閉じた状態から開く状態
へ向かうときに、音節として認識されるので、特
に、図示の差信号の立上がり時にエツジを強調し
て、重みをつけることが有効と思われ、微分ベク
トル成分がこの重みづけになつている。第１４図および第１５図と、第１６図および第
１７図とは、相異なる男性の被験者２名について
音声メツセージ「動作開始」を発声させた場合の
強調差信号の波形と、その記号化パターンとを示
している。この測定例においては、強調差信号ベ
クトルの短時間平均化の時定数を第１３図の場合
よりも長く設定してある。第１５図および第１７
図におけるV1、V2、V3、V4などは有声音の区
間を示しており、例えばV1の区間では、V_l、
V_n、V_p、V_a、V_p、V_nの各有音声の記号化パタ
ーンが得られたことを示している。また記号Ｆは
無声音UVのうち摩擦音（Friction Sound）が得
られたことを示している。この測定例において
は、「dousa」の「ｓ」と、「Kaisi」の「ｓ」と
がそれぞれ無声摩擦音Ｆとして検出されている。
さらに記号Ｐは、無声破裂音「Ｋ」の前の休止期
間（ポーズ）を意味している。この第１５図およ
び第１７図の記号化パターンを見れば、“dousa
Kaisi”の母音系列ｏ、（ｕ）、ａ、ａ、ｉ、ｉが
話者によらず検出されていることがわかる。第１８図乃至第２１図は相異なる女性２名と相
異なる男性２名の被験者について、、日本語の５
母音「イエアオウ」を発声させた場合の強調差信
号の波形と、その記号化パターンとを示してい
る。上記各図を見れば、５母音「イエアオウ」の
記号化が話者によらず一様に行なわれていること
が把握できる。以上のように微分波形成分による差信号強調を
施すことによつて、差信号の立上り、立下りのエ
ツジを強調することができ、子音の後に続く母音
の記号化を話者によらず良好に行ない得るもので
ある。ところで、第８図のオフセツト処理部Ｄ
は、ダイナミツク（動的）なオフセツト補償を行
なうもで、オフセツト計算部１８で、微分ベクト
ル成分が正の場合には、正のオフセツトをオフセ
ツト補償部１９で強調差信号ベクトル（または、
差信号ベクトル）に加えることによつてオフセツ
ト補償された差信号ベクトルを形成し、云わば差
信号の零点が負側にオフセツト分移動したように
動作させ、微分ベクトル成分が負の場合には、負
のオフセツトを強調差信号ベクトルに加えること
によつて元のオフセツト無しの状態にもどすこと
になる。しかし上述のCV音節のように、子音か
ら母音への変化、すなわち顎の開きの狭い状態か
ら広い状態への変化を検出する場合には、母音の
第１ホルマントに対応する差信号ベクトル
（V_eap／V_iu、V_a／V_ep）の立上りを強調すること
が有効であると云える。一方、母音の第２ホルマ
ントに対応する差信号ベクトル（V_e／V_p、V_i／
V_u）の場合、上述のCV音節では前舌の状態から
後舌の状態への変化を強調したいので、差信号ベ
クトルの微分ベクトル成分が負の場合には負のオ
フセツトを、オフセツト補償部１９で強調差信号
ベクトル（又は、差信号ベクトル）に加えて、微
分ベクトル成分が正の場合には正のオフセツトを
強調差信号ベクトル（又は、差信号ベクトル）に
加えてオフセツトを元にもどすことによつて、前
舌、後舌に対応した音声の記号化が確実になると
云える。すなわちオフセツト補償部１９は、式
のオフセツト値O₁〜Of₄を、強調差信号ベクトル
（又は、差信号ベクトル）に与えることになる。
そしてこの場合には、オフセツト補償の差信号ベ
クトルが、音声記号に対して、正しく正負、ある
いはH_igh／Lowに分かれることによつて記号化処
理を、第２２図のように逐次判別（分岐限定）し
てゆくことにより、式のような線型変換演算を
行なうよりも簡単かつ高速に記号化を行なうこと
が可能になり、より簡易な音声入力装置に応用で
きるものである。第２２図のような逐次判別処理
は、上述の記号ベクトル変換部７と最大値判定部
８ａとの機能を簡単に実現するものであり、マイ
クロコンピユータの逐次判別処理プログラムによ
つて実現することができる。同図のフローチヤー
トにあつては、まず第１段階としてV_eap／V_iu差
信号が高レベルＨであるか、中レベルＭである
か、低レベルＬであるかによつて、３グループに
分けている。そして、第２段階では、まず第１段
階がＨのときは、V_a／Veo差信号がＨならば、
記号ａを出力し、Ｍならば記号ｈを出力し、Ｌな
らば第３段階に移り、V_e／V_p差信号を調べて、
Ｈならばｅを出力し、Ｍならばｗを出力し、Ｌな
らばｏを出力する。一方、第１段階がＭの場合、
第２段階では、V_e／V_p差信号がＨならばｆを出
力し、Ｍならばｍを出力し、Ｌならばｂを出力す
る。さらに第１段階がＬの場合、第２段階では
V_i／V_u差信号がＨならばｉを出力し、Ｍならば
ｌを出力し、Ｌならばｕを出力する。〔発明の効果〕本発明は叙上のように構成されており、音声入
力の高周波成分および低周波成分の短時間平均パ
ワーをそれぞれ取り出す一対のフイルタの差信号
出力を入力とし、有声音と無声音と無音とを判別
する比較手段と、音声入力から相異なる周波数領
域の短時間平均パワーを取り出す複数組のフイル
タ対の各差信号出力の大小関係に応じて日本語の
５母音と、その他の有声音との各符号のうちいず
れか１つの符号を割り当てる有声音分析手段を設
けて、比較手段の出力のうち、有声音の符号を上
記有声音分析手段から出力される符号に置換し
て、無音、無声音、および５母音とその他の有声
音の符号の時系列からなる入力パターンを形成
し、複数種の音声メツセージを標準的に発声した
ときに形成される各入力パターンを標準パターン
として予め登録し、パターンに最も近似する標準
パターンを入力メツセージとして識別する音声メ
ツセージ識別方式において、各フイルタ対の差信
号出力の微分信号を算出する微分ベクトル変換部
と、前記微分信号を元の差信号に加算して差信号
の立上り、立下りのエツジを強調する差信号強調
部とを設けたものであるから、差信号の立上りと
立下りを強調することができ、記号化が確実にな
るという利点があり、特に子音と母音との結合よ
りなる音節の場合には、実施例の説明において述
べたように第１ホルマントに対応する差信号に
は、狭顎状態から広顎状態に変化する際の立上り
時に正の重みを加えることができ、また第２ホル
マントに対応する差信号には、前舌状態から後舌
状態に変化する際の立ち下り時に負の重みを加え
ることができ、話者の相違や話し方の相違があつ
ても記号化を良好に行ない得るという利点があ
る。さらに、併合発明にあつては、各フイルタ対
の差信号出力の微分信号を算出する微分ベクトル
変換部と、前記微分信号の正負の符号とその絶対
値とに応じたオフセツト値を算出するオフセツト
計算部と、前記オフセツト値を元の差信号に加算
するオフセツト補償部とを設けたものであるか
ら、立ち上がつた差信号には正のオフセツト値を
加え、立ち下がつた差信号には負のオフセツト値
を加えることによつて、オフセツト補償後の差信
号ベクトルが、音声信号に対して正しく正負に分
かれることにより、これによつて音声の記号化を
話者の相違や話し方の相違によらずに良好に行な
うことができ、特に子音と母音との結合よりなる
音節の場合には、実施例の説明において述べたよ
うに、第１ホルマントに対応する差信号には、狭
顎状態から広顎状態に移行する際の立上り時に正
のオフセツトを加えて立下がり時には元にもどし
第２ホルマントに対応する差信号には、前舌状態
から後舌状態に移行する際の立下り時に負のオフ
セツトを与えて立上り時には元にもどすことがで
き、したがつて子音の後に続く母音の特徴抽出を
良好に行なうことができるという利点がある。 The present invention has been made in view of the above-mentioned points, and it is possible to reduce the difference in speech patterns for the same word depending on how it is uttered, and to more reliably perform symbolization even when the power of the speech is low. It is an object of the present invention to provide a voice message identification method which can reliably encode vowels following consonants and unvoiced sounds. [Disclosure of the Invention] FIG. 8 is a block diagram of an embodiment of the present invention. In the figure, an acoustic analysis section A and a word identification section B
The configuration is the same as that of the conventional example shown in FIG. 1, except that the configuration of the pattern conversion section C is different. First, in FIG. 8, reference numeral 16 denotes a differential vector converter, which calculates a differential signal of the difference signal component output from the difference signal vector converter 5. A difference signal emphasizing section 17 adds the differential signal calculated by the differential vector converting section 5 to the original difference signal to emphasize the rising and falling edges of the difference signal. Reference numeral 18 denotes an offset calculation section, which calculates an offset value according to the positive/negative sign of the differential signal outputted from the differential vector conversion section 16 and its absolute value. Furthermore, 19 is an offset compensator which adds the offset value calculated by the offset calculator 18 to the original difference signal. The offset compensating section 19 and the offset calculating section 18 create an offset processing section D.
It consists of The circuit diagram in FIG. 9 shows the configuration of a specific circuit that realizes the functions of the difference signal vector converter 5, the differential vector converter 16, and the difference signal enhancer 17. To explain the operation of the circuit in the same figure, first, the current difference signal is stored in register t _o , the previous difference signal is stored in register t _o-1 ,
The differential average of this register t _o and register t _o-1 (the short-term average power of the variation of the difference) is formed as a differential vector component, and this differential vector component and the register
This is a method in which the short-term average power of the sum with t _o-1 or the contents of register t _o is used as the emphasized difference signal vector component. The output of the differential vector converter 16 is input to an offset calculator 18, where an offset value is calculated. This offset value is added to the original difference signal in the offset compensator 19 as described above, but equivalently, the matrix calculation formula in the symbol vector converter 7 is changed as shown below. Therefore, offset compensation can be performed. However, in the following equation, Of ₁ , Of ₂ , Of ₃ , and Of ₄ indicate offset values. Figures 10 and 11 show the differential waveforms of the difference signal waveforms in Figures 6 and 7, respectively, and these figures show the differential waveform itself before the short-term average power is integrally calculated. ing. 1st
2 and 13 show the waveform of the emphasized difference signal vector which is the output of the difference signal emphasizing section 17, and the encoding pattern. It shows. Comparing the waveform diagrams in Figures 6 and 7 with the waveform diagrams in Figures 12 and 13, the rising and falling edges of the waveforms of the emphasized difference signal vectors in Figures 12 and 13 are emphasized. It can be seen that the waveform is overshot, and that the unevenness of the waveform is becoming clearer. By the way, Japanese syllables are consonants.
There are many CV syllables (consonants + vowels) consisting of the combination of a word and a vowel, and they are recognized as syllables when the mouth goes from a closed state to an open state. It seems effective to emphasize and weight the differential vector component. Figures 14 and 15 and Figures 16 and 17 show the waveforms of the emphasized difference signals and their symbolization patterns when two different male subjects utter the voice message "Start motion". It shows. In this measurement example, the time constant for short-time averaging of the emphasized difference signal vector is set longer than in the case of FIG. 13. Figures 15 and 17
In the figure, V1, V2, V3, V4, etc. indicate voiced sound sections. For example, in the V1 section, V _l ,
This shows that the symbolization patterns of each of the voiced voices V _n , V _p , V _a , V _p , and V _n have been obtained. Further, the symbol F indicates that a fricative sound was obtained from the unvoiced sound UV. In this measurement example, the "s" in "dousa" and the "s" in "Kaisi" are each detected as voiceless fricatives F.
Furthermore, the symbol P means a pause before the voiceless plosive "K". If we look at the symbolization patterns in Figures 15 and 17, we can see that “dousa
It can be seen that the vowel sequence o, (u), a, a, i, i of "Kaisi" is detected regardless of the speaker. Figures 18 to 21 show two different women and two different men. Regarding the subject of the name, Japanese 5
The waveform of the emphasis difference signal and its symbolization pattern are shown when the vowel "ye-a-oh" is uttered. Looking at each of the above figures, it can be seen that the symbolization of the five vowels ``ye-a-ou'' is done uniformly regardless of the speaker. By emphasizing the difference signal using the differential waveform component as described above, it is possible to emphasize the rising and falling edges of the difference signal, and the vowel that follows the consonant can be well symbolized regardless of the speaker. It is something that can be done. By the way, the offset processing section D in FIG.
The offset compensator 18 performs dynamic offset compensation, and when the differential vector component is positive, the offset compensator 19 converts the positive offset into an emphasized difference signal vector (or
(difference signal vector) to form an offset-compensated difference signal vector, so to speak, so that the zero point of the difference signal moves to the negative side by the offset amount, and when the differential vector component is negative, By adding a negative offset to the emphasized difference signal vector, the original state without offset is restored. However, when detecting a change from a consonant to a vowel, that is, from a narrow jaw opening to a wide jaw opening, as with the CV syllable mentioned above, the difference signal vector (V _eap It can be said that it is effective to emphasize the rise of V _iu , V _a /V _ep ). On the other hand, the difference signal vectors (V _e /V _p , V _i /
In the case of V _u ), in the above-mentioned CV syllable, we want to emphasize the change from the front tongue state to the back tongue state, so if the differential vector component of the difference signal vector is negative, a negative offset is applied to the offset compensator 19. In addition to the emphasized difference signal vector (or difference signal vector), if the differential vector component is positive, a positive offset is added to the emphasized difference signal vector (or difference signal vector) to restore the offset to the original value. Therefore, it can be said that the encoding of sounds corresponding to the front tongue and the rear tongue becomes reliable. That is, the offset compensator 19 gives the offset values O ₁ to Of ₄ in the equation to the emphasized difference signal vector (or difference signal vector).
In this case, the difference signal vector for offset compensation is correctly divided into positive and negative, or high _and low, with respect to the phonetic symbol, so that the encoding process is sequentially determined (branch and limit) as shown in Figure 22. ), it becomes possible to perform symbolization more easily and quickly than by performing a linear conversion operation such as an equation, and it can be applied to a simpler voice input device. The sequential discrimination processing as shown in FIG. 22 easily realizes the functions of the symbol vector conversion section 7 and the maximum value determination section 8a described above, and can be realized by a sequential discrimination processing program of a microcomputer. can. In the flowchart of the same figure, in the first step, the signal is divided into three groups depending on whether the V _eap /V _iu difference signal is at a high level H, a medium level M, or a low level L. It's divided. Then, in the second stage, when the first stage is H, if the V _a /Veo difference signal is H, then
Output the symbol a, if M, output the symbol h, if L, move to the third step, check the V _e /V _p difference signal,
If H, outputs e, if M, outputs w, and if L, outputs o. On the other hand, if the first stage is M,
In the second stage, if the V _e /V _p difference signal is H, it outputs f, if it is M, it outputs m, and if it is L, it outputs b. Furthermore, if the first stage is L, then in the second stage
If the V _i /V _u difference signal is H, it outputs i, if it is M, it outputs l, and if it is L, it outputs u. [Effects of the Invention] The present invention is configured as described above, and receives as input the difference signal output of a pair of filters that extracts the short-term average power of high-frequency components and low-frequency components of audio input. 5 Japanese vowels and other vowels according to the magnitude relationship of the difference signal outputs of multiple pairs of filters that extract the short-term average power of different frequency regions from the voice input. Voiced sound analysis means is provided which assigns any one of the codes to voiced sounds, and among the outputs of the comparison means, the code of the voiced sound is replaced with the code output from the voiced sound analysis means, , unvoiced sounds, five vowels, and other voiced sounds, and each input pattern formed when multiple types of voice messages are uttered in a standard manner is registered in advance as a standard pattern. , a voice message identification method that identifies a standard pattern that is most similar to a pattern as an input message, includes a differential vector converter that calculates a differential signal of the difference signal output of each filter pair, and a differential vector converter that calculates a differential signal of the difference signal output of each filter pair, and adds the differential signal to the original difference signal. Since the present invention is equipped with a difference signal emphasizing section that emphasizes the rising and falling edges of the difference signal, it has the advantage of being able to emphasize the rising and falling edges of the difference signal and ensuring reliable symbolization. In particular, in the case of a syllable consisting of a combination of a consonant and a vowel, as described in the explanation of the embodiment, the difference signal corresponding to the first formant includes a difference signal at the rise when changing from a narrow jaw state to a wide jaw state. A positive weight can be added to the difference signal corresponding to the second formant, and a negative weight can be added to the difference signal corresponding to the second formant at the trailing edge when changing from the front tongue state to the back tongue state, so that differences between speakers and It has the advantage that even if there are differences in speaking style, it can be encoded well. Furthermore, in the case of the combined invention, a differential vector conversion unit that calculates a differential signal of the difference signal output of each filter pair, and an offset calculation unit that calculates an offset value according to the positive/negative sign of the differential signal and its absolute value. A positive offset value is added to a rising difference signal, and a negative offset value is added to a falling difference signal. By adding the offset value of In particular, in the case of syllables consisting of consonant and vowel combinations, the difference signal corresponding to the first formant has a range from narrow to wide. A positive offset is added at the rising edge when transitioning to the chin state and returned to the original value at the falling edge.A negative offset is added to the difference signal corresponding to the second formant at the falling edge when transitioning from the anterior tongue condition to the posterior tongue condition. It has the advantage that it can be restored to its original state at the time of rise, and that the characteristics of the vowel following the consonant can be extracted well.

[Brief explanation of drawings]

第１図は従来例のブロツク図、第２図は同上に
用いる音響分析部の構成を示すブロツク図、第３
図ａ，ｂは同上に用いるフイルタバンクの特性を
示す図、第４図ａ，ｂは同上に用いる差信号ベク
トル変換部の構成を示す概略回路図、第５図は同
上に用いるパターン変換部、第６図および第７図
は同上の動作説明図、第８図は本発明の一実施例
のブロツク図、第９図は同上に用いる差信号強調
処理部の概略回路図、第１０図乃至第２１図は同
上の動作説明図、第２２図は同上に用いる逐次判
別処理を示すフローチヤートである。３は周波数分析部、５は差信号ベクトル変換
部、７は記号ベクトル変換部、１６は微分ベクト
ル変換部、１７は差信号強調部、１８はオフセツ
ト計算部、１９はオフセツト補償部である。 Figure 1 is a block diagram of the conventional example, Figure 2 is a block diagram showing the configuration of the acoustic analysis section used in the same example, and Figure 3 is a block diagram showing the configuration of the acoustic analysis section used in the same.
Figures a and b are diagrams showing the characteristics of the filter bank used in the above, Figures 4 a and b are schematic circuit diagrams showing the configuration of the difference signal vector converter used in the same, and Figure 5 is a pattern converter used in the same. 6 and 7 are explanatory diagrams of the same operation as above, FIG. 8 is a block diagram of an embodiment of the present invention, FIG. 9 is a schematic circuit diagram of a difference signal enhancement processing section used in the same, and FIGS. FIG. 21 is an explanatory diagram of the same operation as above, and FIG. 22 is a flowchart showing the sequential discrimination process used in the same. 3 is a frequency analysis section, 5 is a difference signal vector conversion section, 7 is a symbol vector conversion section, 16 is a differential vector conversion section, 17 is a difference signal emphasis section, 18 is an offset calculation section, and 19 is an offset compensation section.

Claims

[Claims] 1. The difference signal output of a pair of filters that extracts the short-term average power of the high-frequency component and low-frequency component of the audio input, respectively, is input, and when the high-frequency component is stronger, the sign of the unvoiced sound is taken as the signal of the low-frequency component. A comparison means is provided that outputs the sign of voiced sound when the signal is stronger, and outputs the sign of silence when the high-frequency component and the low-frequency component are approximately the same, and multiple sets of short-term average powers in different frequency regions are extracted from the voice input. 5 in Japanese depending on the magnitude relationship of each difference signal output of the filter pair.
A voiced sound analysis means is provided which assigns one of the codes of vowels and other voiced sounds, and among the outputs of the comparison means, the code of the voiced sound is assigned to the code output from the voiced sound analysis means. Replace it with
Forms an input pattern consisting of a time series of symbols for silence, unvoiced sounds, five vowels, and other voiced sounds, and pre-registers each input pattern formed when multiple types of voice messages are uttered in a standard manner as a standard pattern. In a voice message identification method that identifies a standard pattern that is most similar to an input pattern as an input message, a differential vector conversion unit that calculates a differential signal of the difference signal output of each filter pair, and a differential vector converter that calculates a differential signal of the difference signal output of each filter pair, and 1. A voice message identification system comprising: a difference signal emphasizing section for adding up and emphasizing rising and falling edges of a difference signal. 2 Inputs the difference signal output of a pair of filters that extracts the short-term average power of the high-frequency component and low-frequency component of the audio input, and when the high-frequency component is stronger, the sign of unvoiced sound is input, and when the low-frequency component is stronger, the signal output is the signal of the unvoiced sound. Comparing means for outputting a silent code when high-frequency components and low-frequency components are substantially the same is provided, and each difference signal of a plurality of filter pairs extracts short-term average power in different frequency regions from the voice input. Japanese 5 depending on the size of the output
A voiced sound analysis means is provided which assigns one of the codes of vowels and other voiced sounds, and among the outputs of the comparison means, the code of the voiced sound is assigned to the code output from the voiced sound analysis means. Replace it with
Forms an input pattern consisting of a time series of symbols for silence, unvoiced sounds, five vowels, and other voiced sounds, and pre-registers each input pattern formed when multiple types of voice messages are uttered in a standard manner as a standard pattern. In a voice message identification method that identifies a standard pattern that is most similar to an input pattern as an input message, a differential vector conversion unit that calculates a differential signal of the difference signal output of each filter pair, and a differential vector converter that calculates a differential signal of the difference signal output of each filter pair, and 1. A voice message identification system comprising: an offset calculation unit that calculates an offset value according to an absolute value; and an offset compensation unit that adds the offset value to an original difference signal.