JPH0462596B2

JPH0462596B2 -

Info

Publication number: JPH0462596B2
Application number: JP10693684A
Authority: JP
Inventors: Hiroyoshi Yuasa; Koichi Oomura
Original assignee: Matsushita Electric Works Ltd
Current assignee: Panasonic Electric Works Co Ltd
Priority date: 1984-05-25
Filing date: 1984-05-25
Publication date: 1992-10-06
Also published as: JPS60250398A

Description

[Detailed description of the invention]

［技術分野］本発明は電子機器を音声メツセージによつて操
作するための音声メツセージ識別方式に関するも
のである。［背景技術］音韻情報は調音方法によるので、主に断面積の
変化に対応した周波数スペクトルの包絡線に含ま
れており、特に共振周波数（第１、第２、第３ホ
ルトマントＦ１，Ｆ２，Ｆ３）と、その帯域幅
（50〜110Hz）によつて第徴付けられる。音声の周
波数スペクトルは上記声道の伝達特性と音源波形
の形状で、ほぼ決まるが声道の伝達特性は声道断
面積による共振点と声道長によつて決まる共振点
が含まれており、調音即ち音韻はほぼ声道断面積
で決まり、声道長は男性、女性、子供や個人差に
よる。さらに音源波形（特に有声音の声帯振動に
よるもの）は声の高さや、強さによると考えられ
る。又スペクトル包絡やホルトマントを求める方法
では短区間分析方式が知られている。これは有声
音の周期よりやや長いめかあるいは3msec程度の
短い区間（特に声門の閉鎖区間）内で、線形予測
分析をするもので、声帯振動数の影響を受けず、
ホルトマントが求められると言われているが、線
形予測のため相関関数を求めたり、短区間の決定
の為に共分散行列を計算する等乗算回数が多くな
るという問題が有つた。［発明の目的］本発明は上述の問題点に鑑みて為されたもの
で、その目的とするところは計算量が少なく、話
者の個人差や声帯振動の影響が少ない音声メツセ
ージ識別方式を提供するにある。［発明の開示］（実施例）第１図は本発明の処理フローに基づいた回路構
成図であり、図中１は高域強調部で、この高域強
調部１は入力音声の高域を強調させるためのもの
である。２は高域強調された入力音声をＡ／Ｄ変
換するＡ／Ｄ変換部で、このＡ／Ｄ変換部２から
の出力は区間補償部３と、ピツチ検出部４とに入
力する。ピツチ検出部４はＡ／Ｄ変換された１フ
レーム内の音声の時間波形の振幅の絶対値IPOW
を求めて所定サンプル数の平均値がフレーム内の
前半で最低となる下向き大振幅点IPN１を検出す
るとともにこれに続く第２の下向き大振幅点IPN
３を求めてこれら下向き大振幅点IPN１とIPN３
とからピツチIPitを求めるためのものである。区
間補償部３は時間波形の振幅が小さい場合を考慮
して後述の短区間の直流分を除去するためのもの
である。５は分析区間決定部で、この分析区間決
定部５は上記下向き大振幅点IPN１より前の上向
き大振幅点IPN０を求め、両大振幅点IPN１，
IPN０の中間の振幅０の点を大振幅点IMiDとし、
この大振幅点IMiDを中心に上向き大振幅点IPN
０の半サイクルと下向き大振幅点IPN１の半サイ
クルとの１サイクルを含む短区間を決定するもの
である。ここで高速フーリエ変換のサンプル数と
しては64，128，256といつた２のべきになるのが
一般的であるが、ウインドウの計算の都合で、短
区間の長さとして32点と64点とを採用した。７は
高速フーリエ変換部で、この高速フーリエ変換部
７は高速フーリエ変換によつて周波数スペクトル
の包絡を求めるためのものであり、演算時には分
析窓計算部６によつて定められたスペクトルウイ
ンドウが掛けられる。分析窓計算部６は高速フー
リエ変換部７でのスペクトル包絡線抽出をより正
確にかつより少ない計算量（乗算回数）で行なえ
るように高速フーリエ変換にかけるスペクトルウ
インドウの長さ及び位置を最適化するためのもの
である。８は周波数帯域分割部で、この周波数帯
域分割部８は高速フーリエ変換部７で抽出され周
波数スペクトルを対数パワースペクトルにした後
の出力から各周波数成分の短時間平均パワー、例
えばUV，Ｖ，VH，VL，VF，VBの６成分を得
るためのもので有る。ここで、Ｖは音声入力中、
０〜1KHzの周波数帯域の短時間平均パワーをし
めしており、有声音のエネルギに対応している。
UVは音声入力中、５〜12KHzの周波数帯域の短
時間平均パワーを示しており、無声音のエネルギ
に対応している。また有声音のVL，VH，VB，
VFは夫々音声入力中、０〜0.5KHz、0.5〜1.0K
Hz、1.0〜2.0KHz、及び2.0〜4.0KHzの周波数帯域
の短時間平均パワーを示しており、夫々狭顎音、
広顎音、後舌音、及び前舌音のエネルギに対応し
ている。９は差信号ベクトル変換部で、この差信
号ベクトル変換部９は上記各短時間平均パワーよ
り、５音韻（ｉ、ｅ、ａ、ｏ、ｕ）が夫々eao／
iu、ａ／eo、ｅ／ｏ、ｉ／ｕにほぼ分けられるよ
うにUV／Ｖ，VH／VL，VF／VB，VB／VL，
VF／VHの差信号ベクトルを求めるものである。
１８は上記周波数帯域分割部８と、差信号ベクト
ル変換部９とが周波数帯域分割による差信号ベク
トルを求めるためのものであるに対して、ホルマ
ントベクトルを求めるためのホルマント軌跡変換
部であつて、スペクトル包絡のピーク周波数（ホ
ルマント周波数）を求めてホルマントベクトルと
しており、ホルマントベクトルの成分は各ホルマ
ントごとの平均値に対する差を成分とし、周波数
軸を対数あるいは線形スケールで表したものであ
る。尚ピツチ検出により各ホルマントごとの基準
周波数となる上記の平均値を男性、女性、子供と
いつたクラスに切り換えることによつて、認識率
の向上が図れる。第９図ａ，ｂは５母音のホルマ
ント分布と、ピークの位置を示す。１０は記号ベ
クトル変換部で、この記号ベクトル変換部１０は
上記差信号ベクトル又はホルマントベクトルと変
換行列とで記号ベクトル｛ｉ、ｅ、ａ、ｏ、ｕ、
ｈ、ｌ、ｆ、ｂ、ｗ｝に変換するもので、変換行
列の値は記号に対応する差信号ベクトルあるいは
ホルマントベクトルの各成分の大きさに相当する
行成分を持つておればよい。１１は始端・終端検
知部で、この始端・終端検知部１１はUV／Ｖ差
信号が、ある設定値Ruより正のときUVと判定
し、ある設定値Rvより負のときＶと判定し、そ
の中間をＳと判定する有声、無声判定機能を備
え、UV、Ｖの判定により音声の始端を検知し、
無音がある設定値以上のサンプル数の間、継続す
ると終端と検知するものである。１２は記号変換
処理部で、この記号変換処理部１２はＶの区間で
記号ベクトルの最大成分がある設定値以上の場合
にはその記号を出力し、設定値以下の場合にはｍ
を出力する。またUVとＳとの区間では、夫々
UV、Ｓを出力する。１３は整形処理部で、この
整形処理部１３は同じ記号の繰り返しを一つの記
号とその継続時間とのリストに直し、さらに継続
時間が短いものは省略する。１４は単語標準パタ
ーン記憶部で、この単語標準パターン記憶部１４
は音声パターンを登録モードで登録して認識照合
時の標準パターンとするためのものである。予備
選択部１５は認識モードにおいて、照合するまえ
にUVの数などで一次識別して照合対象を限定す
るための予備選択を行うためのものである。１６
は時間軸正規化・照合部で、この時間軸正規化・
照合部１６は上記リストの継続時間の合計が例え
ば200（あるいは1000）といつた一定値になるよう
に継続時間を正規化するための時間軸正規化機能
と、時間軸上で対応する対応する記号間の距離
（相関値）を求めて、これを、全サンプルについ
て合計したものをパターン間の距離とした第１表
に示す距離テーブルを用いて標準パターンと照合
する距離計算機能とからなる。 [Technical Field] The present invention relates to a voice message identification method for operating electronic equipment using voice messages. [Background Art] Since phonological information depends on the articulation method, it is mainly included in the envelope of the frequency spectrum that corresponds to changes in cross-sectional area, and is especially contained in the envelope of the frequency spectrum corresponding to changes in cross-sectional area. ) and its bandwidth (50-110Hz). The frequency spectrum of speech is almost determined by the vocal tract transfer characteristics and the shape of the sound source waveform, but the vocal tract transfer characteristics include resonance points determined by the vocal tract cross-sectional area and resonance points determined by the vocal tract length. Articulation, or phonology, is determined mostly by the cross-sectional area of the vocal tract, and the length of the vocal tract varies between men, women, children, and individuals. Furthermore, the sound source waveform (particularly due to vocal fold vibration of voiced sounds) is thought to depend on the pitch and strength of the voice. A short interval analysis method is also known as a method for determining the spectral envelope and Holt mant. This is a linear predictive analysis performed within an interval that is slightly longer than the period of a voiced sound or as short as 3 msec (especially the glottal closure interval), and is not affected by the vocal fold frequency.
It is said that Holtmant can be calculated, but there is a problem in that it requires a large number of equal multiplications to calculate a correlation function for linear prediction and to calculate a covariance matrix for short interval determination. [Object of the Invention] The present invention has been made in view of the above-mentioned problems, and its purpose is to provide a voice message identification method that requires less calculation and is less affected by individual differences between speakers and vocal cord vibration. There is something to do. [Disclosure of the Invention] (Example) Fig. 1 is a circuit configuration diagram based on the processing flow of the present invention. In the figure, 1 is a high frequency emphasis section, and this high frequency emphasis section 1 enhances the high frequency range of input audio. This is for emphasis. Reference numeral 2 denotes an A/D converter that performs A/D conversion of the high-frequency emphasized input audio, and the output from this A/D converter 2 is input to a section compensator 3 and a pitch detector 4. The pitch detection unit 4 detects the absolute value IPOW of the amplitude of the time waveform of the A/D converted audio within one frame.
, and detects a large downward amplitude point IPN1 where the average value of a predetermined number of samples is the lowest in the first half of the frame, and also detects a second large downward amplitude point IPN that follows this point IPN1.
3, these downward large amplitude points IPN1 and IPN3
This is to find the pitch IPit from . The interval compensator 3 is for removing a DC component in a short interval, which will be described later, in consideration of the case where the amplitude of the time waveform is small. Reference numeral 5 denotes an analysis interval determination unit, which determines the large upward amplitude point IPN0 before the large downward amplitude point IPN1, and determines the large amplitude point IPN0 of both large amplitude points IPN1, IPN1,
The point of amplitude 0 in the middle of IPN0 is set as the large amplitude point IMiD,
The large amplitude point IPN points upward with this large amplitude point IMiD as the center.
A short section including one cycle of a half cycle of 0 and a half cycle of the large downward amplitude point IPN1 is determined. Here, the number of samples for fast Fourier transform is generally a power of 2 such as 64, 128, and 256, but due to window calculations, the length of the short interval is 32 points and 64 points. It was adopted. Reference numeral 7 denotes a fast Fourier transform section. This fast Fourier transform section 7 is used to obtain the envelope of the frequency spectrum by fast Fourier transform, and during calculation, the spectrum window determined by the analysis window calculation section 6 is multiplied. It will be done. The analysis window calculation unit 6 optimizes the length and position of the spectral window to be applied to the fast Fourier transform so that the fast Fourier transform unit 7 can extract the spectral envelope more accurately and with fewer calculations (number of multiplications). It is for the purpose of Reference numeral 8 denotes a frequency band division section, and this frequency band division section 8 extracts the short-term average power of each frequency component, such as UV, V, VH, from the output after converting the frequency spectrum extracted by the fast Fourier transform section 7 into a logarithmic power spectrum. , VL, VF, and VB. Here, V is inputting voice,
It shows the short-term average power in the frequency band of 0 to 1 KHz, and corresponds to the energy of voiced sounds.
UV indicates the short-term average power in the frequency band of 5 to 12 KHz during audio input, and corresponds to the energy of unvoiced sound. Also, voiced sounds VL, VH, VB,
VF is during audio input, 0~0.5KHz, 0.5~1.0K
Hz, 1.0~2.0KHz, and 2.0~4.0KHz frequency bands show the short-term average power, respectively.
It corresponds to the energy of wide jaw sounds, back tongue sounds, and front tongue sounds. Reference numeral 9 denotes a difference signal vector converter, and this difference signal vector converter 9 converts the five phonemes (i, e, a, o, u) into eao/
UV/V, VH/VL, VF/VB, VB/VL,
This is to find the difference signal vector of VF/VH.
Reference numeral 18 is a formant locus converter for obtaining a formant vector, whereas the frequency band dividing section 8 and the difference signal vector converting section 9 are for obtaining a difference signal vector by frequency band division. The peak frequency (formant frequency) of the spectrum envelope is determined and used as a formant vector, and the components of the formant vector are the differences with respect to the average value for each formant, and the frequency axis is expressed on a logarithmic or linear scale. The recognition rate can be improved by switching the above-mentioned average value, which serves as the reference frequency for each formant, to classes such as male, female, and child through pitch detection. Figures 9a and 9b show the formant distribution of five vowels and the positions of the peaks. Reference numeral 10 denotes a symbol vector converter, which converts the symbol vector {i, e, a, o, u,
h, l, f, b, w}, and the value of the transformation matrix only needs to have row elements corresponding to the magnitude of each component of the difference signal vector or formant vector corresponding to the symbol. Reference numeral 11 denotes a start/end detection section, and this start/end detection section 11 determines UV when the UV/V difference signal is more positive than a certain set value Ru, and determines V when it is more negative than a certain set value Rv. Equipped with a voiced/unvoiced determination function that determines the middle as S, and detects the beginning of the voice by determining UV and V.
If silence continues for a number of samples equal to or greater than a certain set value, it is detected as the end. Reference numeral 12 denotes a symbol conversion processing unit, which outputs the symbol if the maximum component of the symbol vector in the interval V is greater than or equal to a certain set value, and if it is less than or equal to the set value, it outputs the symbol.
Output. In addition, in the UV and S sections, respectively
Outputs UV and S. Reference numeral 13 denotes a formatting processing section, which converts repetitions of the same symbol into a list of one symbol and its duration, and omits symbols with short durations. 14 is a word standard pattern storage unit, and this word standard pattern storage unit 14
is for registering a voice pattern in registration mode and using it as a standard pattern for recognition and verification. The preliminary selection unit 15 is used in the recognition mode to perform preliminary selection to limit the objects of comparison by performing primary identification based on the number of UVs or the like before comparison. 16
is the time axis normalization/verification section, and this time axis normalization/
The matching unit 16 has a time axis normalization function for normalizing the duration so that the total duration time in the above list becomes a constant value such as 200 (or 1000), and a corresponding one on the time axis. It consists of a distance calculation function that calculates the distance (correlation value) between symbols and matches it with a standard pattern using the distance table shown in Table 1, in which the sum of all samples is the distance between patterns.

【表】第１表において、横の欄及び縦の欄は夫々標準
パターンの記号及び入力パターンの記号に対応し
ており、例えば標準パターンの記号がａであつ
て、しかも入力パターンの記号もａであるときに
は、距離テーブルの出力は−２となり、近似度が
低いことをしめすものである。従つて距離計算機
能においては距離テーブルからの出力を順次加算
するだけでの演算操作により、入力パターンと標
準パターンとのパターン全体としての近似度を容
易に計算できるわけである。１７は有意差検定部
で、この有意差検定部１７は距離の最も近いパタ
ーンがある設定値より近く、さらに２番目に近い
ものより、ある設定値以上離れている場合に、こ
の最も近いパターンと入力パターンが同じとみな
し、他の場合には認識不良としてリジエクトする
有意差検定機能と、該認識結果を出力する結果出
力機能とを備えたものである。１９は最適化フイ
ードバツク部で、この最適化フイードバツク部１
９は周波数帯域の分割の最適化と、差信号ベクト
ルのオフセツトの最適化をフイードバツク的に行
うために、学習モードにおいて話者の／ｉ、ｅ、
ａ、ｏ、ｕ／の発生の時系列を記憶して、予め標
準的に設定した分割周波数の近傍で分割周波数を
変動させて、記号ベクトルの感度特性に応じて変
動方向と量を、記号成分が最大となるように最適
化するものであり、この場合スペクトルの勾配
を、差信号ベクトルのオフセツトで補償し、特に
入力音声がイ音のときｉ成分が突出し、ア音のと
きａ音が突出するようにし、また／ｅ／、／ｕ／
の識別がより確実となるように差信号入力のゲイ
ンバランスを調整する。この場合まずVH／VL
の最適調整、次にVF／VBの最適調整、更につ
ぎにVB／VLの最適調整を行うのである。而して実施例ではサンプリング周期80μsec（サ
ンプリング周波数12.5KHz）で、フレーム長を
512サンプルとした。基本周波数の周期が最低で
90Hzとすると、139サンプルになり、256点の周波
数スペクトルを計算するためには通常の高速フー
リエ変換では512点の計算になり、乗算回数が2⁹
×（2⁴＋2⁵）＝512×（16＋32）＝24576回になるが、
基本周期より短い区間の64サンプルを、512サン
プルのフレームより抜き出して分析すると、128
点の高速フーリエ変換でよいので2⁷×（2³＋2⁴）＝
128×（８＋16）＝3072回の乗算で良い。また高速
フーリエ変換の前処理の分析窓の乗算は周波数ス
ペクトルのサンプル数と同じになるので、短区間
分析が簡易な方法として効果のあることがわか
る。第２図は第１図実施例のピツチ検出部４と分
析区間決定部５からなる特徴部分のフローチヤー
トを示し、１フレーム内の振幅の絶対値の平均値
IPOWを(1)で求め、(2)で30サンプルづつの平均値
がフレームの前半で最低となる下向き大振幅点
IPN１を検出し、更に(3)で次の下向き大振幅点
IPN３を検出し、そして(4)でこれらの下向き大振
幅点IPN１，IPN３からピツチIPit＝IPN3−
IPN1を求める。ピツチ検出後(5)で前の上向き大
振幅点IPN０を下向き大振幅点IPN１より検出
し、両大振幅点IPN０，IPN１の中間の振幅０の
点より(6)で大振幅点IMiDとし、この大振幅点
IMiDを中心として上向き大振幅の半サイクルと
下向き大振幅の半サイクルからなる１サイクルを
含む短区間を決定する。次いで(7)で直流分補償を
行い、(8)で分析窓掛けを行い(9)で高速フーリエ変
換を行い、(10)で差信号ベクトルか、ホルマントベ
クトルかのモード選択を行い、(11)で周波数帯域分
割を、(12)でホルマント軌跡を求める。第３図は本発明の具体的な回路図を示し、音声
はマイク１８より入力され、プリアンプ１９で増
幅されて、調整アンプ２０でゲインとオフセツト
を調整される。次にＡ／Ｄ変換回路２１で音声入
力をデイジタル変換を行い、デイジタル変換され
た音声フレームは音声フレームメモリ２３に記憶
される。２４はFFTプロセツサで、このFFTプ
ロセツサ２４はコントロール部２４ａと、演算レ
ジスタ２４ｂと、内蔵RAM２４ｃと、係数を記
憶してある係数ROM２４ｄとを備えた一般の
FETチツプからなり、音声フレームメモリ２３
から読出した音声フレームを取り込み、高速フー
リエ変換をにウインドウをかけて行う。２５はス
ペクトルフレームメモリで、FETプロセツサ２
４で演算されたスペクトルフレームを記憶するた
めのものである。２２は音声フレームメモリ２
３、FFTプロセツサ２４、スペクトルフレーム
メモリ２５の動作タイミングを与えるタイミング
回路である。２６はプログラムROM２７に予め
書き込んである動作プログラムに基づいて制御演
算を行うCPUであり、照合モード時には照合演
算回路３０を動作させて、スペクトルフレームメ
モリ２５に格納してあるデータを記号化して予め
登録モード時に標準パターンRAM３１に格納し
てある標準パターンとの照合演算を行つたり、あ
るいは登録モード時に入力音声のパターンを標準
パターンとして標準パターンRAM３１に格納さ
せたり、更には学習モード時に上述の最適化フイ
ードバツクを行つたりする。図中３２はターミナ
ル部、３３はマイコンバス、２８はワーキング
RAM、２９は制御入出力部である。第４図ａはフレーム長を128点にして第４図ｂ
に示す時間波形の大振幅位置に合わせた基本周期
分の分析例を示し、第５図は本方式による場合の
分析例を示し、同図ａは同図ｂに示す／Ｉ／音の
時間波形にフレーム長64点（高速フーリエ変換は
128点）のウインドウを掛けて高速フーリエ変換
を行つたシユミレーシヨン結果である。又第６図
は本方式による場合の分析例を示し、同図ａは同
図ｂに示す／Ｉ／音の時間波形にフレーム長32点
（高速フーリエ変換は64点）のウインドウを掛け
て高速フーリエ変換を行つたシミユレーシヨン結
果である。尚第７図は記号化のプロセスを示す。同図にお
いて、Ｖは音声入力中、０〜1KHzの周波数帯域
の短時間平均パワーを示しており、有声音のエネ
ルギに対応している。また、UVは音声入力中、
５〜12KHzの周波数帯域の短時間平均パワーを示
しており、無声音のエネルギに対応している。さ
らに、VL，VH，VB，VFは夫々音声入力中、
０〜0.4KHz、0.4〜0.8KHz、及び1.8〜3.2KHzの周
波数帯域の短時間平均パワーを示しており、夫々
狭顎音、広顎音、後舌音、及び前舌音のエネルギ
に対応している。S₀〜S₄は差動増幅手段であり、
夫々差信号Ｖ／UV，Veao／Viu，Va／Veo，
Ve／Vo，Vi／Vuを算出するものである。C₀は
比較手段であり、上記差動増幅手段S₀から出力さ
れる差信号成分が、基準値Rvよりも小さいとき
には有声音Ｖの符号を割り当て、基準値Ruより
も大きい時には無声音しＶの符号をの符号を割り
当て、それ以外の場合には無音Ｓと判定する。た
だし、Ru＞Ｏ＞Rvである。MY₀は記号化処理部
で、この記号化処理部MY₀は無音、有声音及び
無声音の各場合についてSVUVの各符号の内い
ずれか１つの符号を入力する。MC₀は各差信号
出力Vea／Viu，Va／Vea，Ve／Vo，Vi／Vu
を成分とする４次元ベクトルに所定の行列Tmを
乗算した、音声入力中に含まれる各母音ｉ、ｅ、
ａ、ｏ、ｕ、とその他の有声音ｈ、ｉ、ｆ、ｂ、
ｗの短時間平均パワーを算出するものであり、行
列計算部MC₀の出力は最大値判定部MX₀に入力
されて各成分ｉ、ｅ、ａ、ｏ、ｕ、ｈ、ｌ、ｆ、
ｂ、ｗの内最大の成分がどれであるかを判定さ
れ、その最大の成分の符号が記号化処理部MY₀
に入力される。但し最大の成分と２番目に大きい
成分との差が小さいときには符号ｍが出力され
る。記号化処理部MY₀は比較手段C₀から出力さ
れる符号がＶであるときには、最大値判定部
MX₀から出力されるｉ、ｅ、ａ、ｏ、ｕ、ｈ、
ｌ、ｆ、ｂ、ｗ及びｍの内のいずれか１つの符号
を出力し、又比較手段C₀から出力される符号が
Ｕ又はＳであるときには、その符号をそのまま出
力するものである。尚行列計算部MC₀の変換行
列Tmとしては(1)〜(3)式のようなものが使用可能
である。〔Tm〕＝−17， 17， 17， 17， −17， 18， −18，０，０， 13，０，０， 17，０，０，０，０，０，０，０，０， 17，０， −17，０，０，０， 18， −18，０，17 ００００００００ −13 …(1) 〔Tm〕＝−16， 16， 16， 16， −16， 18， −18，０，０， 13，−８， −８， 16， −８， −８，０，０，０，０，０，０， 16，０， −16，０，０，０， 18， −18，０，16 ０ −８０ −16 ００００ −13 …(2) 〔Tm〕＝−14， 14， 14， 14， −14， 18， −18，０，０， 13−14， −14， 14， −14， −14，０，０，０，０，０，０， 14，０， −14，０，０，０， 18， −18，０，14 ０ −14 ０ −14 ００００ −13 ……(3) まず(1)式の変換行列Tmは、識別に最低限必要
な要素以外は０にして、計算を速くできるように
したもので、(2)式は、要素の絶対値が８の部分に
冗長度を持たせ、差信号の検出が弱い場合には幅
広く５母音の記号化が可能になるようにしたもの
で、(3)式は第１ホルマントF₁に関する差信号に
対する５母音の要素を総て同じ大きさの重み（絶
対値14）にするとともに、第２ホルマントF₂に
関する２つの差信号に関しては、５母音に対し
て、どちらかに１個づつ識別に必要な重みをつけ
たもので、第１ホルマントF₁を第２ホルマント
F₂より重要視したものといえる。この変換行列
Tmは、識別対象の言葉等によつて任意に設定で
きるものである。この第３図のAPは上述した調
整アンプ２０の特性を示している。又上述の照合方法以外に、差信号から２値化信
号を作つてこの組み合わせで記号化し、逐次照合
することも可能である。この方法としては次のよ
うなものがある。つまり短時間平均パワーのベク
トルより求めたUV／Ｖ差信号、Veao／Viu差信
号、Va／Veo差信号、Ve／Vo差信号、Vi／Vu
差信号を抽出してVeao／Viu差信号が正の一定
値以上あれば記号Veaoを割り当て、負の一定値
以下であるときには記号Viuを割り当て、その他
の場合には記号Ｓを割り当て、Va／Veo差信号
が正の一定値以上であるときには記号Vaを割り
当て、負の一定値以下であるときには記号Veoを
割り当て、その他の場合には記号Ｓを割り当て、
Ve／Vo差信号が正の一定値以上であるときには
記号Veを割り当て、負の一定値以下であるとき
には記号Voを割り当て、その他の場合には記号
Ｓを割り当て、更にVi／Vu差信号が正の一定値
以上であるときには記号Viを割り当て、負の一
定値以下であるときには記号Vuを割り当て、そ
の他の場合には記号Ｓを割り当てる。そしてこれ
らの記号を一時記憶手段に記憶して第２表に示す
記号化テーブルを参照しながら記号ａ、ｅ、ｏ、
ｉ、ｕ、ｈ、ｌ、ｆ、ｂ、ｗ、ｍのうちいずれか
１つの記号に変換する。[Table] In Table 1, the horizontal and vertical columns correspond to the standard pattern symbol and input pattern symbol, respectively.For example, if the standard pattern symbol is a, and the input pattern symbol is also a When , the output of the distance table is -2, indicating that the degree of approximation is low. Therefore, in the distance calculation function, the degree of approximation of the input pattern and the standard pattern as a whole can be easily calculated by simply adding the outputs from the distance table in sequence. Reference numeral 17 denotes a significant difference testing unit, which detects the closest pattern when it is closer than a certain setting value and is further away from the second closest pattern by more than a certain setting value. It is equipped with a significant difference test function that considers the input patterns to be the same and rejects them as recognition failure in other cases, and a result output function that outputs the recognition result. 19 is an optimization feedback section, and this optimization feedback section 1
9 is the speaker's /i, e,
The time series of the occurrences of a, o, and u/ is stored, and the dividing frequency is varied in the vicinity of a preset standard dividing frequency, and the direction and amount of variation are determined according to the sensitivity characteristics of the symbol vector. In this case, the slope of the spectrum is compensated by the offset of the difference signal vector, and in particular, when the input voice is an A note, the i component is prominent, and when the input voice is an A note, the a note is prominent. /e/, /u/
The gain balance of the difference signal input is adjusted so that identification of the difference signal becomes more reliable. In this case, first VH/VL
Then, the optimum adjustment of VF/VB is performed, and then the optimum adjustment of VB/VL is performed. In the example, the sampling period is 80 μsec (sampling frequency 12.5 KHz), and the frame length is
There were 512 samples. The period of the fundamental frequency is the lowest
If it is 90Hz, there will be 139 samples, and in order to calculate the frequency spectrum of 256 points, normal fast Fourier transform will require calculation of 512 points, and the number of multiplications will be ^29.
× (2 ⁴ + 2 ⁵ ) = 512 × (16 + 32) = 24576 times,
When 64 samples in an interval shorter than the fundamental period are extracted from a frame of 512 samples and analyzed, 128
Fast Fourier transform of a point is sufficient, so 2 ⁷ × (2 ³ + 2 ⁴ ) =
128×(8+16)=3072 multiplications are enough. Furthermore, since the multiplication of the analysis window in the preprocessing of the fast Fourier transform is the same as the number of samples of the frequency spectrum, it can be seen that the short interval analysis is effective as a simple method. FIG. 2 shows a flowchart of the characteristic part consisting of the pitch detection section 4 and the analysis interval determination section 5 of the embodiment of FIG.
Calculate IPOW in (1), and in (2) find the downward large amplitude point where the average value of each 30 samples is the lowest in the first half of the frame.
Detect IPN1, and then proceed to the next downward large amplitude point in (3)
Detect IPN3, and in (4), from these downward large amplitude points IPN1 and IPN3, pitch IPit=IPN3−
Find IPN1. After detecting the pitch (5), the previous upward large amplitude point IPN0 is detected from the downward large amplitude point IPN1, and from the point with amplitude 0 between the two large amplitude points IPN0 and IPN1, the large amplitude point IMiD is determined in (6). Large amplitude point
A short section including one cycle consisting of a half cycle of large upward amplitude and a half cycle of large downward amplitude is determined with IMiD as the center. Next, perform DC component compensation in (7), perform analytical windowing in (8), perform fast Fourier transform in (9), select mode between difference signal vector or formant vector in (10), and (11) ) to find the frequency band division, and (12) to find the formant locus. FIG. 3 shows a specific circuit diagram of the present invention, in which audio is input from a microphone 18, amplified by a preamplifier 19, and adjusted for gain and offset by an adjustment amplifier 20. Next, the A/D conversion circuit 21 digitally converts the audio input, and the digitally converted audio frame is stored in the audio frame memory 23. Reference numeral 24 denotes an FFT processor, and this FFT processor 24 is a general type having a control section 24a, an arithmetic register 24b, a built-in RAM 24c, and a coefficient ROM 24d in which coefficients are stored.
Consists of FET chip, audio frame memory 23
The audio frame read from is taken in and fast Fourier transform is applied to the window. 25 is a spectrum frame memory, and FET processor 2
This is for storing the spectrum frame calculated in step 4. 22 is audio frame memory 2
3. A timing circuit that provides operating timing for the FFT processor 24 and spectrum frame memory 25. 26 is a CPU that performs control calculations based on the operation program written in advance in the program ROM 27, and in the verification mode operates the verification calculation circuit 30 to encode data stored in the spectral frame memory 25 and register it in advance. In the mode, it is possible to perform a comparison calculation with the standard pattern stored in the standard pattern RAM 31, or in the registration mode, the input voice pattern can be stored as a standard pattern in the standard pattern RAM 31, and furthermore, in the learning mode, the above-mentioned optimization can be performed. We send feedback. In the figure, 32 is the terminal section, 33 is the microcomputer bus, and 28 is the working section.
RAM 29 is a control input/output section. Figure 4a shows a frame length of 128 points, and Figure 4b shows a frame length of 128 points.
Fig. 5 shows an analysis example of the fundamental period corresponding to the large amplitude position of the time waveform shown in Fig. 5, and Fig. 5 shows an analysis example using this method, and Fig. 5 a shows the time waveform of the /I/ sound shown in Fig. 5 b. The frame length is 64 points (fast Fourier transform is
This is the simulation result of fast Fourier transform multiplied by a window of 128 points). Fig. 6 shows an example of analysis using this method, and Fig. 6 a shows the time waveform of the /I/ sound shown in Fig. This is a simulation result using Fourier transform. FIG. 7 shows the symbolization process. In the figure, V indicates the short-time average power in the frequency band of 0 to 1 KHz during voice input, and corresponds to the energy of voiced sound. Also, UV is inputting audio,
It shows the short-term average power in the frequency band of 5 to 12 KHz, which corresponds to the energy of unvoiced sounds. Furthermore, VL, VH, VB, and VF are respectively input during voice input.
It shows the short-term average power in the frequency bands of 0~0.4KHz, 0.4~0.8KHz, and 1.8~3.2KHz, which corresponds to the energy of narrow jaw sounds, wide jaw sounds, back tongue sounds, and front tongue sounds, respectively. ing. S ₀ to _{S 4} are differential amplification means,
Difference signals V/UV, Veao/Viu, Va/Veo, respectively
This is to calculate Ve/Vo and Vi/Vu. _C0 is a comparison means, and when the difference signal component output from the differential amplification means _S0 is smaller than the reference value Rv, it is assigned the sign of a voiced sound V, and when it is larger than the reference value Ru, it is assigned the sign of an unvoiced sound V. In other cases, it is determined to be silent S. However, Ru>O>Rv. MY ₀ is a symbolization processing unit, and this symbolization processing unit MY ₀ inputs one of the SVUV codes for each case of silence, voiced sound, and unvoiced sound. MC ₀ is each difference signal output Vea/Viu, Va/Vea, Ve/Vo, Vi/Vu
Each vowel i, e,
a, o, u, and other voiced sounds h, i, f, b,
The short-term average power of w is calculated, and the output of the matrix calculation unit MC ₀ is input to the maximum value determination unit MX ₀ to calculate each component i, e, a, o, u, h, l, f,
It is determined which component is the largest among b and w, and the sign of the largest component is sent to the encoding processing unit MY ₀
is input. However, when the difference between the largest component and the second largest component is small, the code m is output. When the code output from the comparison means _C0 is V, the symbolization processing unit _MY0 operates as a maximum value determination unit.
i, e, a, o, u, h, output from MX ₀
It outputs any one of the codes l, f, b, w, and m, and when the code output from the comparing means _C0 is U or S, it outputs that code as is. It should be noted that as the transformation matrix Tm of the matrix calculation unit _MC0 , equations (1) to (3) can be used. [Tm] = -17, 17, 17, 17, -17, 18, -18, 0, 0, 13, 0, 0, 17, 0, 0, 0, 0, 0, 0, 0, 0, 17 , 0, -17, 0, 0, 0, 18, -18, 0,17 0 0 0 0 0 0 0 0 -13 ...(1) [Tm] = -16, 16, 16, 16, -16, 18, -18, 0, 0, 13, -8, -8, 16, -8, -8, 0, 0, 0, 0, 0, 0, 16, 0, -16, 0, 0, 0, 18, -18, 0,16 0 -8 0 -16 0 0 0 0 -13 ...(2) [Tm] = -14, 14, 14, 14, -14, 18, -18, 0, 0, 13 −14, −14, 14, −14, −14, 0, 0, 0, 0, 0,0, 14, 0, −14, 0, 0, 0, 18, −18, 0,14 0 −14 0 −14 0 0 0 0 −13 ...(3) First, the transformation matrix Tm in equation (1) is set to 0 except for the minimum necessary elements for identification to speed up the calculation. ) formula has redundancy in the part where the absolute value of the element is 8, so that if the detection of the difference signal is weak, it is possible to symbolize a wide range of five vowels. The elements of the 5 vowels for the difference signal for the 1st formant F ₁ are all given the same weight (absolute value 14), and for the 2 difference signals for the 2nd formant F ₂ , for the 5 vowels, either The first formant F 1 is given the necessary weight for identification one by one, and the first formant F ₁ is changed to the second formant F1.
It can be said that it was given more importance than _F2 . This transformation matrix
Tm can be arbitrarily set depending on the word to be identified. AP in FIG. 3 shows the characteristics of the adjustment amplifier 20 described above. In addition to the above-mentioned verification method, it is also possible to create a binary signal from the difference signal, encode it in combination, and successively perform verification. Examples of this method include: In other words, UV/V difference signal, Veao/Viu difference signal, Va/Veo difference signal, Ve/Vo difference signal, Vi/Vu obtained from the short-time average power vector.
Extract the difference signal and assign the symbol Veao if the Veao/Viu difference signal is above a certain positive value, assign the symbol Viu if it is below a certain negative value, and assign the symbol S in other cases. When the difference signal is above a certain positive value, the symbol Va is assigned, when it is below a certain negative value, the symbol Veo is assigned, and in other cases, the symbol S is assigned,
When the Ve/Vo difference signal is above a certain positive value, the symbol Ve is assigned, when it is below a certain negative value, the symbol Vo is assigned, and in other cases, the symbol S is assigned, and the Vi/Vu difference signal is positive. When it is above a certain value, the symbol Vi is assigned, when it is below a certain negative value, the symbol Vu is assigned, and in other cases, the symbol S is assigned. Then, while storing these symbols in the temporary storage means and referring to the symbolization table shown in Table 2, the symbols a, e, o,
Convert to one of the symbols i, u, h, l, f, b, w, and m.

【表】【table】

【表】ただし、第２表において＊は０、１のいずれで
もよいことを示しており、０／１は０の場合と１
の場合を示している。かかる記号化テーブルは例
えばROMなどを用いて構成されており、一時記
憶した内容をアドレス入力としてROMをアクセ
スすることにより、ａ、ｅ、ｏ……等の各記号の
コードがデータ出力として得られるようにする
か、あるいは一時記憶した内容と記号化テーブル
の内容とを排他的論理和で比較し、一致したとき
の記号を出力するとよい。第９図は照合部をマイ
クロコンピユータの逐次判別処理プログラムによ
つて実現する方法を示すフローチヤートであり、
まず第１段階としてVeo／Viu差信号が高レベル
Ｈであるか、中レベルＭであるか、低レベルＬで
あるかによつて、３グループに分けている。そし
て第２段階では、まず第１段階がＨのときは、
Va／Veo差信号がＨならば、記号／ａ／を出力
し、Ｍならば記号／Vo／を出力し、Ｌならば第
３段階に移り、Ve／Vo差信号を調べて、Ｈなら
ば記号／ｅ／を出力し、Ｍならば／ｈ／を出力
し、Ｌならば記号／ｏ／を出力する。一方、第１
段階がＭの場合、第２段階では、Ve／Vo差信号
がＨならば記号／ｆ／を出力し、Ｍならば記号／
ｍ／を出力し、Ｌならば記号／ｂ／を出力する。
更に第１段階がＬの場合、第２段階ではVi／Vu
差信号がAHならば記号／ｉ／を出力し、Ｍなら
ば記号／ｌ／を出力し、Ｌならば記号／ｕ／を出
力するのである。上述に実施例は直交変換として高速フーリエ変
換を用いてあるが、ウオルシユ変換を用いてもよ
い。［発明の効果］本発明は上述のように構成し音声入力の時間波
形の正負が声帯振動の向きに対して一定であるよ
うに正負の位相を保ち、声帯振動の微分波形音声
の時間波形を上向き大振幅点とし、その上向き大
振幅点の次ぎの下向きの大振幅点を声帯振動の立
ち下がりによる声帯振動の微分波形に対応させ有
声音について下向き大振幅点を音声サンプルの短
時間平均値の下向きの最大値の点でフレーム平均
値よりも大きな大きさの点として検出する手段
と、この下向き大振幅点の前の上向き大振幅点と
の中間を大振幅点として検出し、この大振幅点を
中心に短区間分析区間を決定する手段とを備え、
この決定された短区間分析区間の直流分を除去し
て分析窓をかけ短区間分析によつて音声入力の周
波数スペクトルの包絡線を抽出するので、入力音
声の情報量の大きな大振幅区間のみを短区間分析
するのでマクロな特徴が簡単に検出することがで
き、とくに音源波形の影響があるとしても、基本
周期の影響が無いため、スペクトル包絡が簡単に
求められ、話者の個人差や、声帯振動の影響が少
なくかつ計算量が少なくて高い認識が行えるとい
う効果がある。[Table] However, in Table 2, * indicates that it can be either 0 or 1, and 0/1 indicates 0 or 1.
The case is shown below. Such a symbolization table is constructed using, for example, a ROM, and by accessing the ROM by using the temporarily stored contents as an address input, codes for each symbol such as a, e, o, etc. can be obtained as data output. Alternatively, it is preferable to compare the temporarily stored contents and the contents of the symbolization table using exclusive OR, and output a symbol when they match. FIG. 9 is a flowchart showing a method for realizing the matching section by a sequential discrimination processing program of a microcomputer,
First, as a first step, the Veo/Viu difference signal is divided into three groups depending on whether it is a high level H, a medium level M, or a low level L. And in the second stage, when the first stage is H,
If the Va/Veo difference signal is H, it outputs the symbol /a/, if it is M, it outputs the symbol /Vo/, and if it is L, it moves to the third step, checks the Ve/Vo difference signal, and if it is H, it outputs the symbol /Vo/. It outputs the symbol /e/, if it is M it outputs /h/, and if it is L it outputs the symbol /o/. On the other hand, the first
If the stage is M, in the second stage, if the Ve/Vo difference signal is H, the symbol /f/ is output, and if it is M, the symbol /f/ is output.
It outputs m/, and if it is L, it outputs the symbol /b/.
Furthermore, if the first stage is L, the second stage is Vi/Vu
If the difference signal is AH, the symbol /i/ is output, if it is M, the symbol /l/ is output, and if it is L, the symbol /u/ is output. Although the above-described embodiment uses fast Fourier transform as the orthogonal transform, Walsh transform may also be used. [Effects of the Invention] The present invention is configured as described above, and maintains the positive and negative phases of the time waveform of the voice input so that the positive and negative phases are constant with respect to the direction of the vocal cord vibration, and the time waveform of the voice that is the differential waveform of the vocal cord vibration. The large downward amplitude point is defined as the upward large amplitude point, and the downward large amplitude point next to the upward large amplitude point corresponds to the differential waveform of vocal fold vibration due to the fall of vocal fold vibration. Means for detecting a point with a maximum downward value as a point with a size larger than the frame average value, and detecting the intermediate point between the large downward amplitude point and the large upward amplitude point before this large amplitude downward point as a large amplitude point, and detecting the large amplitude point as a large amplitude point. means for determining a short interval analysis interval centered on
The DC component of this determined short-term analysis section is removed, an analysis window is applied, and the envelope of the frequency spectrum of the audio input is extracted by short-term analysis. Because short-term analysis is performed, macroscopic features can be easily detected.In particular, even if there is an influence of the sound source waveform, there is no influence of the fundamental period, so the spectral envelope can be easily determined, and individual differences between speakers can be detected. This method has the advantage of being less affected by vocal fold vibration and requiring less calculations, allowing for high-quality recognition.

[Brief explanation of the drawing]

第１図は本発明の実施例の概略回路構成図、第
２図は同上の要部の動作説明用のフローチヤー
ト、第３図は同上の具体回路図、第４図乃至第６
図は同上の動作説明図、第７図は本発明の実施例
の記号化プロセスを説明する回路ブロツク図、第
８図は本発明の別の照合例をの動作説明用のフロ
ーチヤート、第９図はホルトマン軌跡についての
説明用波形図であり、３は区間補償部、４はピツ
チ検出部、５は分析区間決定部、６は分析窓計算
部、７は高速フーリエ変換部、８は周波数帯域分
割部、９は差信号ベクトル変換部、１０は記号ベ
クトル変換部、１２は記号化処理部、１６は時間
軸正規化・照合部、１８はホルトマル軌跡変換部
である。 FIG. 1 is a schematic circuit configuration diagram of an embodiment of the present invention, FIG. 2 is a flowchart for explaining the operation of the main parts of the same, FIG. 3 is a specific circuit diagram of the same, and FIGS.
7 is a circuit block diagram for explaining the symbolization process of the embodiment of the present invention; FIG. 8 is a flowchart for explaining the operation of another collation example of the present invention; FIG. The figure is a waveform diagram for explaining the Holtmann locus, where 3 is an interval compensation section, 4 is a pitch detection section, 5 is an analysis section determination section, 6 is an analysis window calculation section, 7 is a fast Fourier transform section, and 8 is a frequency band. 10 is a symbol vector conversion section; 12 is a symbolization processing section; 16 is a time axis normalization/verification section; and 18 is a Holtmal locus conversion section.

Claims

[Claims] 1. A means for extracting the frequency spectrum of a voice input, and dividing the logarithmic power spectrum into frequencies to extract short-term average powers for each frequency band and extracting five vowels i, e, a from these short-term average powers. ,o,u
Extract the difference signal vector so that it is divided into the ratio of e, a, o and i, u, the ratio of a and e, o, the ratio of e and o, and the ratio of i and u, or extract the formant vector from the formant locus. matrix calculation means for calculating symbol vectors of five vowels and other voiced sounds by multiplying the difference signal vector or formant vector by a transformation matrix; A means for outputting the largest component as an onomatopoeic symbol of an analysis frame and a standard pattern in which input patterns consisting of symbols and durations are stored in advance based on the onomatopoeic symbol are compared on a time axis or by symbol. In a voice message identification method that identifies the standard pattern closest to the input pattern as the input message, the positive and negative phases of the time waveform of the voice input are maintained so that they are constant with respect to the direction of vocal cord vibration. , the temporal waveform of the voice corresponding to the differential waveform due to the rise of vocal fold vibration is set as the upward large amplitude point, and the downward water amplitude point next to the upward large amplitude point is set as the falling point of vocal fold vibration. means for detecting a downward large amplitude point for a voiced sound as a point with a maximum downward value of a short-time average value of a voice sample and a point having a magnitude larger than a frame average value in correspondence with a differential waveform of vocal fold vibration; means for detecting the middle point between the large amplitude point and the previous upward large amplitude point as the large amplitude point, and determining a short period analysis section centered on this large amplitude point; A voice message identification method characterized in that the envelope of the frequency spectrum of a voice input is extracted by short-term analysis by applying an analysis window and removing the frequency spectrum of voice input. 2. The voice message identification system according to claim 1, wherein the width of the short section includes one cycle at the time of large amplitude of the voiced sound.