JPS6131479B2

JPS6131479B2 -

Info

Publication number: JPS6131479B2
Application number: JP54158448A
Authority: JP
Inventors: Sadahiro Furui
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1979-12-06
Filing date: 1979-12-06
Publication date: 1986-07-21
Also published as: JPS5680100A

Description

【発明の詳細な説明】本発明は、話者照合方法に関し、特に登録され
た話者のうちの特定の本人であるか否かを高精度
かつ高能率に自動判定する方式に関するものであ
る。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speaker verification method, and more particularly to a method for automatically determining with high precision and efficiency whether or not a particular speaker among registered speakers is the same person.

電話サービス方式の１つとして、登録された発
声者のうちの特定の人と自称する音声が、確かに
その発者のものか否かを判断してその結果を通知
するものがある。これによつて、電話による買
物、預金の出入れ、脅迫電話犯人の割出し、ある
いは電話による取引、会議等に利用することがで
きる。 As one of the telephone service systems, there is a system that determines whether or not the voice of a particular person among the registered speakers is actually that of that person, and notifies the user of the result. This allows it to be used for shopping over the phone, depositing and withdrawing money, identifying the culprit who made the threatening call, or conducting transactions, meetings, etc. over the phone.

音声波は、声帯振動波（有声音源）および声道
の狭まりで発生する乱流による雑音波（無声音
源）を声道に与えたときに唇または鼻から放出さ
れる音波である。 Speech waves are sound waves emitted from the lips or nose when vocal cord vibration waves (voiced sound source) and noise waves due to turbulence generated by the narrowing of the vocal tract (unvoiced sound source) are applied to the vocal tract.

第１図は、音声波を電気的にモデル化したブロ
ツク図である。 FIG. 1 is a block diagram electrically modeling a sound wave.

声帯であるインパルス発生器２１からの有声音
と乱流である白雑音発生器２２からの無声音が、
スイツチ２３により切換えられ、声道パイプを示
す電気回路２４に結合されて、スピーカ２５を鳴
らして音声波となる。なお、第１図において、鎖
線の左を音源、右を声道特性として区別する。 The voiced sound from the impulse generator 21, which is the vocal cords, and the unvoiced sound from the white noise generator 22, which is the turbulent flow, are
It is switched by a switch 23 and connected to an electric circuit 24 representing a vocal tract pipe, causing a speaker 25 to produce a sound wave. In FIG. 1, the left side of the chain line is a sound source, and the right side is a vocal tract characteristic.

音声を分析する場合、ある有限（10〜30ｍＳ）
の瞬間ごとの音声波の持つ周波数スペクトル（声
道の伝達関数のもつスペクトル）と、声道を駆動
した音源の性質を数量的に明らかにすることであ
つて、前者がスペクトル分析、後者が音源分析と
いうことになる。 When analyzing audio, there is a certain finite amount (10 to 30 mS)
The objective is to quantitatively clarify the frequency spectrum of the speech wave (the spectrum of the transfer function of the vocal tract) at each moment of the sound wave and the properties of the sound source that drives the vocal tract. It's called analysis.

音源は、インパルス列駆動か、雑音駆動かの区
分信号（有／無区分）Ｖと、インパルス列であれ
ばその周期（ピツチ）Ｐと、インパルス列または
雑音の振幅Ａとの３つの要素で表わされる。これ
らの３つの要素を数量的に抽出することが音源分
析であるが、これらの要素はかなり高速度で変化
するため、正しい分析はかなり困雑である。 A sound source is expressed by three elements: a classification signal (presence/absence) V indicating whether it is impulse train driven or noise driven, the period (pitch) P if it is an impulse train, and the amplitude A of the impulse train or noise. It can be done. Sound source analysis involves quantitatively extracting these three elements, but since these elements change at a fairly high speed, accurate analysis is quite difficult.

第２図は、音声分析モデルを示すブロツク図で
ある。 FIG. 2 is a block diagram showing a speech analysis model.

音声の音源分析を行うためには、音声をパーコ
ール分析器２６に入力し、あらかじめスペクトル
分析を行い（K₁，K₂……Ｋ_p）、音声波をそのス
ペクトルの逆特性を有する逆スペクトル回路２８
に通し、スペクトルの山谷を無くする。このと
き、得られるスペクトルは、ほぼ平坦となり、音
源のインパルス列または雑音が残差波Ｒとして現
われ、その残差波Ｒには、音波信号が含まれてい
る。 In order to analyze the sound source of a voice, the voice is input to the Percoll analyzer 26, a spectrum analysis is performed in advance (K ₁ , K ₂ . . . K _p ), and the voice wave is passed through an inverse spectrum circuit having inverse characteristics of the spectrum. 28
to eliminate peaks and valleys in the spectrum. At this time, the obtained spectrum becomes substantially flat, and the impulse train or noise of the sound source appears as a residual wave R, and the residual wave R contains a sound wave signal.

そこで、音源分析回路２９で残差波Ｒを分析し
て、音源信号Ｖ，Ｐ，Ａを抽出する。 Therefore, the sound source analysis circuit 29 analyzes the residual wave R and extracts the sound source signals V, P, and A.

ところで、話者照合方式に使用する装置として
は、判定に用いる入力音声の発声内容つまり言葉
をあらかじめ決めておいて、常に同一の言葉を発
声させ、登録されている各話者の音声波と比較し
てその話者による音声とみなすことができるか否
かを判定するものと、判定に用いる入力音声の発
声内容をあらかじめ決めておくことなく、任意に
発声した言葉を用いてその特徴部分のみを登録さ
れている各話者の音声波の特徴部分と比較して判
定するものとに分けることができる。 By the way, the device used in the speaker verification method determines in advance the utterance content of the input voice used for determination, that is, the words, and then always utters the same words and compares them with the registered voice waves of each speaker. One method uses arbitrarily uttered words to determine whether or not the input speech can be considered to be the voice of the speaker, and the other method uses arbitrarily uttered words to identify only the characteristic parts of the input speech without predetermining the utterance content of the input speech used for determination. It can be divided into two types: one that is determined by comparing it with the characteristic parts of the voice waves of each registered speaker.

後者の方法の方が一般に適用範囲が広いが、高
い精度で判定を行うのが難しい。一方前者の方法
は、やや用途が限られるが、実用的には極めて広
い応用範囲が期待され、後者の方法よりも高い精
度を得ることが可能である。 The latter method generally has a wider range of applicability, but it is difficult to make a highly accurate determination. On the other hand, although the former method has somewhat limited uses, it is expected to have an extremely wide range of practical applications, and it is possible to obtain higher accuracy than the latter method.

一般に、入力音声波と登録されている音声波を
直接比較するのは能率的でないので、周波数スペ
クトル、線形予測係数等のいわゆる特徴パラメー
タに変換してから比較を用うのが望ましい。従来
のこの種の装置の構成では、上記の他に、基本周
波数、音声エネルギー、ホルマント周波数、ケプ
ストラム係数、パーコール係数、対数断面積比、
零交差数等が用いられているが、安定に精度よく
パラメータを抽出するのが困難であつたり、パラ
メータの抽出に複雑な計算を要したり、発声者の
声の特徴を表現するパラメータとして不十分であ
つたり、電話系のような伝送路を通つたときに変
動して、判定の精度が大きく低下する等の欠点が
あつた。 Generally, it is not efficient to directly compare input speech waves and registered speech waves, so it is desirable to convert them into so-called feature parameters such as frequency spectra and linear prediction coefficients before comparing them. In addition to the above, the conventional configuration of this type of device includes fundamental frequency, voice energy, formant frequency, cepstral coefficient, Percoll coefficient, logarithmic cross-sectional area ratio,
The number of zero crossings, etc. is used, but it is difficult to extract parameters stably and accurately, complex calculations are required to extract the parameters, and parameters that do not express the characteristics of the speaker's voice are inadequate. However, there are drawbacks such as the fact that the accuracy of the determination is greatly reduced due to fluctuations when passing through a transmission line such as a telephone system.

本発明の目的は、このような欠点を除去するた
め、電話系等を通つた音声から伝送歪等の影響を
受けにくい音声の特徴を簡単に抽出し、高精度で
本人か否かの判定を行うことができる話者照合方
法を提供することにある。 The purpose of the present invention, in order to eliminate such drawbacks, is to easily extract voice characteristics that are not easily affected by transmission distortion etc. from voice transmitted through a telephone system, etc., and to determine with high accuracy whether or not the person is the real person. The object of the present invention is to provide a speaker verification method that can perform the following steps.

本発明の話者照合方法は、発声者の声の特徴を
表現するパラメータとして、きわめて有用で、か
つ比較的簡易な方法により抽出できる線形予測ケ
プストラム係数を用い、ケプストラム係数の時間
波形から伝送路の変動等の影響を受けにくい特徴
パラメータである多項式展開係数を抽出し、あら
かじめ登録されている各話者の特徴パラメータと
の非線形時間正規化マツチング（対応付け）によ
り本人による音声か否かを判定することを特徴と
する。 The speaker verification method of the present invention uses linear predictive cepstral coefficients, which are extremely useful and can be extracted by a relatively simple method, as parameters expressing the characteristics of the speaker's voice. Extracts polynomial expansion coefficients, which are characteristic parameters that are not easily affected by fluctuations, and determines whether or not the voice is by the person himself/herself through nonlinear time normalized matching (correlation) with the characteristic parameters of each speaker registered in advance. It is characterized by

以下、本発明の実施例を、第３図により説明す
る。 Hereinafter, an embodiment of the present invention will be described with reference to FIG.

第３図は、本発明の話者照合方法のブロツク構
成図である。 FIG. 3 is a block diagram of the speaker verification method of the present invention.

本発明の方法は、第３図に示すように、音声入
力端子１から照合すべき音声を入力して、音声区
間検出回路３、線形予測分析回路４、ケプストラ
ム変換回路５、ケプストラム蓄積部６、ケプスト
ラム平均化回路７、減算回路８、特徴パラメータ
蓄積部９、ケプストラム・レジスタ１０、および
多項式展開回路１１を経由し、線形予測ケプスト
ラム係数の時間波形から多項式展開係数を抽出す
る。 As shown in FIG. 3, the method of the present invention involves inputting the speech to be verified from the speech input terminal 1, and then inputting the speech to be verified using the speech section detection circuit 3, linear prediction analysis circuit 4, cepstrum conversion circuit 5, cepstrum storage section 6, Polynomial expansion coefficients are extracted from the time waveform of the linearly predicted cepstrum coefficients via a cepstrum averaging circuit 7, a subtraction circuit 8, a feature parameter storage section 9, a cepstrum register 10, and a polynomial expansion circuit 11.

一方、識別番号入力端子２から照合すべき話者
の番号を入力して、標準パターン蓄積部１３の中
から対応するパターンを取出し、学習モードと照
合モードにスイツチ１２で切換えて、照合すべき
音声の多項式展開係数を出力し、かつ重みレジス
タ１６の内容を入力して、非線形時間正規化回路
１５で類似性の度合いの計算を行い、しきい値と
比較回路１７で比較し、その結果話者のものであ
るか否かの出力を出力端子１８に与えるととも
に、特徴パラメータを標準パターン平均化回路１
４に入力し、また話者の判定に用いられたしきい
値をしきい値演算論理回路１９に入力して更新す
る。 On the other hand, the number of the speaker to be verified is inputted from the identification number input terminal 2, the corresponding pattern is retrieved from the standard pattern storage section 13, the learning mode and the matching mode are switched by the switch 12, and the voice to be verified is inputted from the identification number input terminal 2. The nonlinear time normalization circuit 15 calculates the degree of similarity by outputting the polynomial expansion coefficient of The standard pattern averaging circuit 1 outputs an output indicating whether the
4, and the threshold value used to determine the speaker is input to the threshold calculation logic circuit 19 and updated.

さらに詳しく動作を説明する。先ず音声入力端
子１から話者の判定に用いる音声波を入力すると
ともに、識別番号入力端子２から照合を行うべき
話者の番号を入力する。この番号の入力には、例
えば、押ボタン・ダイヤル電話機のダイヤル等を
用いることができる。入力された音声波には、通
常実際の音声の区間と無音（雑音）の区間とが含
まれているので、入力された音声波を音声区間検
出回路３に入力して音声区間の検出を行う。 The operation will be explained in more detail. First, an audio wave used for speaker determination is inputted from the audio input terminal 1, and at the same time, the number of the speaker to be verified is inputted from the identification number input terminal 2. To input this number, for example, a dial on a push-button/dial telephone can be used. Since the input speech wave usually includes an actual speech section and a silent (noise) section, the input speech wave is input to the speech section detection circuit 3 to detect the speech section. .

この検出には、すでによく知られているいくつ
かの方法、例えば入力信号波の短区間エネルギ
ー、ある一定値以上のエネルギーが継続する時
間、波形の周期性の有無、等を用いることができ
る。検出された音声区間の信号波は線形予測分析
回路４に送られ、線形予測係数の時間波形に変換
される。この技術は、すでに公知であるので（例
えば、文献、板倉、斎藤：統計的手法による音声
スペクトル密度とホルマント周波数の推定、電子
通信学会論文誌、53−Ａ、１、35、1970参照）、
詳細は省略するが、基本的にはまず低域通過フイ
ルタに通したのち標本化および電子化を行い、一
定時間ごとに短区間の波形を切り出してハミング
窓等を乗じ、積和の演算によつて相関係数を計算
する。ハミング窓の長さとしては、例えば30ｍ
Ｓ、これを更新する周期としては、たとえば10ｍ
Ｓのような値が用いられる。相関係数から、繰返
し演算処理によつて代数方程式を解くことによ
り、容易に線形予測係数が抽出される。相関係数
および線形予測係数は、たとえば第０次から第10
次までの値を計算する。抽出された線形予測係数
の時間波形は、ケプストラム変換回路５により、
いわゆる線形予測ケプストラム係数に変換され
る。線形予測ケプストラム係数は、対数パワー・
スペクトルのフーリエ変換によつて得られる従来
のケプストラム係数とはやや異なるが、それによ
つて表現されるスペクトル包絡は類似している。
線形予測ケプストラム係数は、発声者の声の特徴
を表現するパラメータとして優れた性質を有して
いることが知られている（文献、B.S.Atal：
Effectiveness of Linear Prediction
Characteristics of the Speech Wave for
Automatic Speaker Identification and
Verification、J.Acoust.Soc.Amer.、55、６、
ｐ、1304、1974）。線形予測係数から線形予測ケ
プストラムへの変換は、次の演算により行うこと
ができる。 For this detection, several well-known methods can be used, such as the short-term energy of the input signal wave, the time that energy above a certain value continues, the presence or absence of periodicity of the waveform, etc. The signal wave of the detected voice section is sent to the linear prediction analysis circuit 4 and converted into a time waveform of a linear prediction coefficient. This technique is already known (for example, see the literature, Itakura, Saito: Estimation of speech spectral density and formant frequency by statistical methods, Journal of the Institute of Electronics and Communication Engineers, 53-A, 1, 35, 1970).
Although the details are omitted, basically the waveform is first passed through a low-pass filter, then sampled and digitized, a short section of the waveform is cut out at regular intervals, multiplied by a Hamming window, etc., and then calculated using a product-sum calculation. and calculate the correlation coefficient. For example, the length of the humming window is 30 m.
S, the period for updating this is, for example, 10m.
A value such as S is used. Linear prediction coefficients are easily extracted from the correlation coefficients by solving algebraic equations through iterative calculation processing. The correlation coefficient and linear prediction coefficient are, for example, from the 0th order to the 10th order.
Calculate values up to. The time waveform of the extracted linear prediction coefficient is converted by the cepstrum conversion circuit 5 to
It is converted into so-called linear predictive cepstral coefficients. The linear predicted cepstral coefficients are the logarithmic power
Although it is slightly different from the conventional cepstral coefficients obtained by Fourier transform of the spectrum, the spectral envelope expressed thereby is similar.
It is known that the linear predictive cepstral coefficient has excellent properties as a parameter expressing the voice characteristics of the speaker (Reference, BSAtal:
Effectiveness of Linear Prediction
Characteristics of the Speech Wave for
Automatic Speaker Identification and
Verification, J.Acoust.Soc.Amer., 55, 6,
p. 1304, 1974). Conversion from linear prediction coefficients to linear prediction cepstrum can be performed by the following calculation.

C₁＝a₁ ……(1) ここで、ａ_oはｎ次の線形予測係数、ｃ_oはｎ次
の線形予測ケプストラム、ｐは線形予測モデルの
次元数である。ｐとしては、前述のように10程度
の値が用いられる。 _C1 = _a1 ...(1) Here, a _o is an n-th linear prediction coefficient, c _o is an n-th linear prediction cepstrum, and p is the number of dimensions of the linear prediction model. As described above, a value of about 10 is used as p.

抽出された全音声区間の線形予測ケプストラム
係数（以下簡単のために単にケプストラム係数と
称する）の時間波形は、ケプストラム蓄積部６に
蓄えられる。同時に、そのうちの、後に話者の判
定に用いる特徴パラメータとしてあらかじめ定め
られている係数の波形は、ケプストラム平均化回
路７に入力される。ここで、各次数のケプストラ
ム係数ごとに全音声区間の平均値が計算される。
全ケプストラム係数のうち、どの係数を特徴パラ
メータとして用いるかは、予備実験や分散分析等
の統計的分析によつてあらかじめ定めておく。 The time waveform of the linearly predicted cepstrum coefficients (hereinafter simply referred to as cepstrum coefficients for simplicity) of the extracted entire speech section is stored in the cepstrum storage unit 6. At the same time, the waveform of the coefficient, which is predetermined as a characteristic parameter to be used later for speaker determination, is input to the cepstrum averaging circuit 7. Here, the average value of the entire speech interval is calculated for each cepstral coefficient of each degree.
Which coefficients among all cepstral coefficients are used as feature parameters is determined in advance through preliminary experiments and statistical analysis such as variance analysis.

次に、この平均値と、ケプストラム蓄積部６に
蓄えられているケプストラム係数の時間波形のう
ち特徴パラメータとして用いることが定められて
いる係数の波形を減算回路８に入力し、各ケプス
トラム係数の値から対応する次数の平均値を減ず
る。この出力は、特徴パラメータ蓄積部９に一旦
蓄えられる。 Next, this average value and the waveform of the coefficient determined to be used as a characteristic parameter among the time waveforms of the cepstrum coefficients stored in the cepstrum storage unit 6 are input to the subtraction circuit 8, and the value of each cepstrum coefficient is Subtract the average value of the corresponding order from . This output is temporarily stored in the feature parameter storage section 9.

一方、ケプストラム蓄積部６に蓄えられている
ケプストラム係数のうち、あらかじめ定められて
いる複数の係数の時間波形はそれぞれ、一定間隔
ごとに一定の時間長の区間がケプストラム・レジ
スタ１０に一旦蓄えられ、このレジスタ１０の内
容は多項式展開回路１１に送られて多項式展開係
数が演算される。このケプストラム・レジスタ１
０および多項式展開回路１１に入力されるケプス
トラム係数の時間波形の長さとしては、たとえば
90ｍＳ、これを更新する周期としては、たとえば
10ｍＳのような値を用いる。時間波形を多項式に
展開する方法としては、種々の方法を用いること
ができるが、ここでは、たとえば時間波形を次の
ような３種類の関数の線形結合で表現する方法を
用いる。 On the other hand, among the cepstrum coefficients stored in the cepstrum storage unit 6, the time waveforms of a plurality of predetermined coefficients are temporarily stored in the cepstrum register 10 in sections of a certain time length at certain intervals, respectively. The contents of this register 10 are sent to a polynomial expansion circuit 11 to calculate polynomial expansion coefficients. This cepstrum register 1
0 and the length of the time waveform of the cepstrum coefficients input to the polynomial expansion circuit 11, for example,
90mS, and the update period is, for example,
Use a value like 10mS. Various methods can be used to expand the time waveform into a polynomial. Here, for example, a method is used in which the time waveform is expressed as a linear combination of the following three types of functions.

Ｐ_0j＝１ ……(3) Ｐ_1j＝ｊ−５ ……(4) Ｐ_2j＝j²−10j＋５５／３ ……(5) このとき、ケプストラム係数の時間波形をｘ_j
（ｊ＝１、２……、９）であらわすとすると、上
記の３種類の関数に対応する展開係数は、の演算で求めることができる。ａ、ｂ、ｃの係数
のうち、各次数のケプストラム係数に応じてのち
に特徴パラメータとして用いることがあらかじめ
定められている係数は10ｍＳごとに更新される多
項式展開回路１１の入力にじて計算され、特徴パ
ラメータ蓄積部９に送られて蓄えられる。このう
ち、ａの係数すなわち０次の多項式展開係数は、
時間波形の短時間ごとの平均値に相当し、伝送路
等の変動の影響を受け易いので、特徴パラメータ
蓄積部９には蓄えず、以後特徴パラメータとして
は用いない。ｂとｃの多項式展開係数は、それぞ
れ時間波形の傾斜と曲率を表現するものであり、
時間的にゆつくりした伝送路等の変動の影響はす
でに０次の展開係数として取り除かれているの
で、伝送路等の影響を受け難い特徴がある。特徴
パラメータ蓄積部９には、合計18ないし20個程度
の、あらかじめ定められている次数のケプストラ
ム係数および多項式展開係数の全音声区間におけ
る時間波形が蓄えられる。この18〜20個程度の時
間波形のうち、ケプストラム係数の時間波形から
は全音声区間の平均値がすでに減じられており、
多項式展開係数からは０次の係数が除去されてい
るので、ともに伝送路等の影響を受け難い特徴を
有している。一定間隔（上述のように例えば10ｍ
Ｓ）ごとの該18〜20個の係数をまとめて、特徴パ
ラメータと呼ぶ。 P _0j = 1 ... (3) P _1j = j-5 ... (4) P _2j = j ² -10j + 55/3 ... (5) At this time, the time waveform of the cepstrum coefficient is x _j
If expressed as (j = 1, 2..., 9), the expansion coefficients corresponding to the above three types of functions are: It can be found by the calculation. Among the coefficients a, b, and c, the coefficients that are predetermined to be used later as feature parameters according to the cepstral coefficients of each order are calculated based on the input of the polynomial expansion circuit 11, which is updated every 10 mS. , and are sent to the feature parameter storage section 9 and stored therein. Among these, the coefficient of a, that is, the 0th-order polynomial expansion coefficient is
Since it corresponds to the short-time average value of the time waveform and is easily affected by fluctuations in the transmission path, etc., it is not stored in the characteristic parameter storage section 9 and will not be used as a characteristic parameter thereafter. The polynomial expansion coefficients b and c express the slope and curvature of the time waveform, respectively.
Since the influence of temporally slow fluctuations in the transmission path, etc. has already been removed as a zero-order expansion coefficient, it has the characteristic that it is not easily affected by the transmission path, etc. The feature parameter storage unit 9 stores time waveforms of a total of about 18 to 20 cepstral coefficients and polynomial expansion coefficients of a predetermined order in the entire speech interval. Among these 18 to 20 time waveforms, the average value of the entire speech interval has already been subtracted from the time waveform of the cepstral coefficient.
Since zero-order coefficients are removed from the polynomial expansion coefficients, both have the characteristic of being less susceptible to the influence of transmission paths and the like. At regular intervals (e.g. 10m as mentioned above)
The 18 to 20 coefficients for each S) are collectively called feature parameters.

スイツチ１２は、学習モードと照合モードを選
択するスイツチであつて、いずれの話者に関して
も、最初の１回の発声に対してはスイツチ１２を
端子１２ａに接続しておいて、特徴パラメータ蓄
積部９の内容を標準パターン蓄積部１３に入力
し、その話者の標準パターンとして蓄える。その
後の話者の異同を判定すべき音声に対しては、ま
ずスイツチ１２を端子１２ｃに接続しておいて、
特徴パラメータ蓄積部９の内容を非線形時間正規
化回路１５に入力する。同時に、識別番号入力端
子２から入力された話者の番号に対応した標準パ
ターンを標準パターン蓄積部１３から読出し、非
線形時間正規化回路１５に入力する。非線形時間
正規化回路１５では、標準パターンと入力音声の
特徴パラメータの類似性の度合いを計算する。音
声の発声速度は、同じ話者が同じ言葉を繰返し発
声してもその度ごとに部分的および全体的に変化
するので、両者を比較するには、共通の音（音
韻）が対応するように、一方の時間軸を適当に非
線形に伸縮して他方の時間軸にあわせ、対応する
時点の特徴パラメータどうしを比較する必要があ
る。 The switch 12 is a switch for selecting a learning mode and a matching mode, and for any speaker, for the first utterance, the switch 12 is connected to the terminal 12a, and the feature parameter storage section 9 is input to the standard pattern storage section 13 and stored as the standard pattern for that speaker. For subsequent voices to be determined as to whether the speakers are different or different, first connect the switch 12 to the terminal 12c.
The contents of the feature parameter storage section 9 are input to the nonlinear time normalization circuit 15. At the same time, the standard pattern corresponding to the speaker number input from the identification number input terminal 2 is read from the standard pattern storage section 13 and input to the nonlinear time normalization circuit 15. The nonlinear time normalization circuit 15 calculates the degree of similarity between the standard pattern and the feature parameters of the input voice. The rate of speech production changes both partially and completely each time the same word is uttered repeatedly by the same speaker, so in order to compare the two, it is necessary to , it is necessary to appropriately expand or contract one time axis non-linearly to match the other time axis, and compare the feature parameters at corresponding points in time.

一方基準にして、両者が最もよく合うように
（両者の類似度が最も大きくなるように）他方の
時間軸を非線形に伸縮する技術としては、動的計
画法による最適化の手法を使用できることが知ら
れている（文献：迫江、千葉：動的計画法を利用
した音声の時間正規化に基づく連続単語認識、日
本音響学会誌、27、９、P.438、1971）。 Optimization techniques using dynamic programming can be used as a technique for nonlinearly expanding or contracting the time axis of the other to best fit the two (so that the degree of similarity between the two is greatest) based on one standard. (Reference: Sakoe, Chiba: Continuous word recognition based on temporal normalization of speech using dynamic programming, Journal of the Acoustical Society of Japan, 27, 9, P. 438, 1971).

本発明の方式においても、非線形時間正規化回
路１５では、動的計画法の演算を行う。標準パタ
ーンのある時点ｋにおける特徴パラメータをｒ_ki
（１ｉＮ）、入力音声のある時点ｌにおける特
徴パラメータをｚ_li（１ｉＮ）であらわす
と、ここでは両者の距離（小さくなるほど類似度
が大きいことを示す数値）として、次のような値
を用いる。 Also in the method of the present invention, the nonlinear time normalization circuit 15 performs dynamic programming calculations. The feature parameters at a certain point k of the standard pattern are r _ki
(1iN), and the feature parameter at a certain point l of the input audio is expressed by z _li (1iN). Here, the following values are used as the distance between the two (a numerical value indicating that the smaller the degree of similarity is).

あるいは、ここで、Ｎはケプストラム係数と多項式展開係
数をあわせた特徴パラメータの次元数で、前述の
ように18ないし20程度の値を用いる。すなわち、
zliとrkiはいずれも、ケプストラム係数と多項式
展開係数を要素として持つている。ｗ_iは各特徴
パラメータに対してあらかじめ定められている重
みを示す数値で、この値は多数話者がそれぞれの
複数回発声した音声を用いて、そのパラメータの
変動性の度合いを調べた結果にもとづいて定め、
重みレジスタ１６に蓄えておく。動的計画法の演
算によつて標準パターンと入力音声の一致度が最
もよくなるように時間軸を対応づけたときの、対
応する時点どうしの標準パターンと入力音声の特
徴パラメータの距離を全音声区間について平均し
た値を計算する。この値を、入力音声と標準パタ
ーンとの総合的距離と呼ぶことにする。 or, Here, N is the number of dimensions of the feature parameter, which is a combination of cepstrum coefficients and polynomial expansion coefficients, and uses a value of about 18 to 20 as described above. That is,
Both zli and rki have cepstral coefficients and polynomial expansion coefficients as elements. w _i is a numerical value indicating a predetermined weight for each feature parameter, and this value is based on the result of examining the degree of variability of that parameter using voices uttered multiple times by multiple speakers. Established on the basis of
It is stored in the weight register 16. The distance between the standard pattern and the characteristic parameters of the input voice at corresponding points in time is determined by calculating the distance between the characteristic parameters of the standard pattern and the input voice over the entire voice interval when the time axes are matched so that the degree of matching between the standard pattern and the input voice is the best through dynamic programming calculations. Calculate the average value for. This value will be referred to as the overall distance between the input voice and the standard pattern.

次に、この総合的距離と、標準パターン蓄積部
１３にあらかじめ蓄えられている一定のしきい値
を比較回路１７に入力し、論理回路により両者の
大小関係を判定する。標準パターン蓄積部１３に
は、各登録話者毎に、それまでの標準パターンと
その話者の入力話者との距離の履歴や、その話者
の標準パターンと他の話者の入力音声との距離の
分布等にもとづいてあらかじめ決めておいたしき
い値を蓄えておき、識別番号入力端子２から入力
された識別番号を用いて、該当する話者のしきい
値を読み出し、比較回路１７に入力する入力され
た音声と標準パターンとの総合的距離がしきい値
より大きい場合は、その入力音声はその話者のも
のではないと判定する信号を出力端子１８から出
力し、総合的距離がしきい値よりも小さい場合
は、その入力音声はその話者のものであると判定
する信号を出力端子１８から出力する。 Next, this comprehensive distance and a certain threshold value stored in advance in the standard pattern storage section 13 are inputted to the comparison circuit 17, and the magnitude relationship between the two is determined by a logic circuit. The standard pattern storage unit 13 stores, for each registered speaker, the history of the distance between the standard pattern and the input speaker of that speaker, and the relationship between the standard pattern of that speaker and the input voice of other speakers. A predetermined threshold value is stored based on the distribution of the distance between If the total distance between the input voice and the standard pattern is greater than the threshold, a signal is output from the output terminal 18 to determine that the input voice does not belong to the speaker, and the total distance is determined to be greater than the threshold. If it is smaller than the threshold, a signal is output from the output terminal 18 to determine that the input voice is that of the speaker.

入力音声が、その話者のものであると判定され
た場合は、スイツチ１２を端子１２ｂに接続し、
特徴パラメータ蓄積部９に蓄えられている特徴パ
ラメータを標準パターン平均化回路１４に入力す
る。同時に、その話者の標準パターンの特徴パラ
メータを標準パターン蓄積部１３から読み出し、
非線形時間正規化回路１５で演算された標準パタ
ーンと入力音声の時間軸の対応関係、すなわち一
方の時間軸の各時点が他方の時間軸のどの時点に
対応するかを示す数値列とともに標準パターン平
均化回路１４に入力する。標準パターン平均化回
路１４では、これらの入力に応じて、各特徴パラ
メータごとに、標準パターンと入力音声の重みつ
き平均値を、標準パターンの各時点について計算
する。この重みは、各話者の標準パターンを作成
するために、これまでに用いられたその話者の入
力音声の数に応じて決める。こうして計算された
特徴パラメータの重みつき平均値を、新しい標準
パターンとして標準パターン蓄積部１３に転送
し、蓄える。さらに、非線形時間正規化回路１５
で計算された総合的距離と、話者の判定に用いら
れたしきい値演算論理回路１９に入力し、しきい
値を更新する。しきい値の初期値としては、経験
的に決められた値を標準パターン蓄積部１３に蓄
えておいて用い、その後は、しきい値演算論理回
路１９に各話者の過去２回程度の総合的距離を蓄
えておいて、新しく計算された総合的距離を含め
た最大値を選択し、この値に一定値を加えた値と
現在のしきい値の平均値を計算する。この値を、
標準パターン蓄積部１３に転送して、新しいしき
い値として蓄える。 If the input voice is determined to be that of the speaker, connect the switch 12 to the terminal 12b,
The feature parameters stored in the feature parameter storage section 9 are input to the standard pattern averaging circuit 14. At the same time, the characteristic parameters of the standard pattern of the speaker are read from the standard pattern storage section 13,
The correspondence between the standard pattern calculated by the nonlinear time normalization circuit 15 and the time axis of the input audio, that is, the standard pattern average along with a numerical string indicating which time point on one time axis corresponds to which time point on the other time axis. input into the conversion circuit 14. In response to these inputs, the standard pattern averaging circuit 14 calculates a weighted average value of the standard pattern and the input voice for each characteristic parameter at each point in time of the standard pattern. This weight is determined according to the number of input voices of each speaker that have been used so far to create a standard pattern for that speaker. The weighted average value of the feature parameters thus calculated is transferred to the standard pattern storage section 13 and stored as a new standard pattern. Furthermore, the nonlinear time normalization circuit 15
The total distance calculated in step 1 is inputted to the threshold arithmetic logic circuit 19 used for determining the speaker, and the threshold value is updated. As the initial value of the threshold value, a value determined empirically is stored in the standard pattern storage section 13 and used. After that, the threshold value calculation logic circuit 19 receives the sum total of the past two times for each speaker. The target distances are stored, the maximum value including the newly calculated total distance is selected, and the average value of this value plus a constant value and the current threshold value is calculated. This value is
It is transferred to the standard pattern storage section 13 and stored as a new threshold value.

このような構造になつているからその結果とし
て、高品質のマイクロホンだけでなく、電話系を
通つた音声、雑音や伝送歪の影響を受けた音声等
に対しても高い精度を示す話者照合システムを実
現することができる。これまでの実験によれば、
実際の炭素送話器を含む電話機と交換器を通つた
音声に対して、本発明による方式を適用すること
により、99％以上の精度で話者照合の判定を行う
ことができることが示されている。 This structure results in speaker verification that shows high accuracy not only for high-quality microphones, but also for voices transmitted through telephone systems, voices affected by noise and transmission distortion, etc. system can be realized. According to previous experiments,
It has been shown that by applying the method according to the present invention to voice transmitted through a telephone set and exchanger including an actual carbon transmitter, speaker verification can be determined with an accuracy of over 99%. There is.

以上説明したように、本発明によれば、電話系
等を通つた音声から伝送歪等の影響を受けにくい
声の特徴をとり出して用いることにより、高い精
度で本人か否かの判定が行えるため、電話の声等
を本人か否かの鍵として用いるバンキング・サー
ビス等の種々のサービスに広く応用することがで
きる。 As explained above, according to the present invention, by extracting and using voice characteristics that are not easily affected by transmission distortion etc. from voice transmitted through a telephone system, etc., it is possible to determine whether or not the person is the real person with high accuracy. Therefore, it can be widely applied to various services such as banking services that use telephone voice etc. as a key to determine whether the person is the person who is the user.

[Brief explanation of the drawing]

第１図は音声波を電気的にモデル化したブロツ
ク図、第２図は音声分析モデルを示すブロツク
図、第３図は本発明の実施例を示す話者照合方式
のブロツク図である。１：音声入力端子、２：識別番号入力端子、
３：音声区間検出回路、４：線形予測分析回路、
５：ケプストラム変換回路、６：ケプストラム蓄
積部、７：ケプストラム平均化回路、８：減算回
路、９：特徴パラメータ蓄積部、１０：ケプスト
ラム・レジスタ、１１：多項式展開回路、１２：
スイツチ、１３：標準パターン蓄積部、１４：標
準パターン平均化回路、１５：非線形時間正規化
回路、１６：重みレジスタ、１７：比較回路、１
８：出力端子、１９：しきい値演算論理回路。 FIG. 1 is a block diagram showing an electrical model of a speech wave, FIG. 2 is a block diagram showing a speech analysis model, and FIG. 3 is a block diagram showing a speaker verification method according to an embodiment of the present invention. 1: Audio input terminal, 2: Identification number input terminal,
3: Voice section detection circuit, 4: Linear prediction analysis circuit,
5: Cepstrum conversion circuit, 6: Cepstrum accumulation section, 7: Cepstrum averaging circuit, 8: Subtraction circuit, 9: Feature parameter accumulation section, 10: Cepstrum register, 11: Polynomial expansion circuit, 12:
switch, 13: standard pattern storage section, 14: standard pattern averaging circuit, 15: nonlinear time normalization circuit, 16: weight register, 17: comparison circuit, 1
8: Output terminal, 19: Threshold calculation logic circuit.

Claims

[Claims]

1. Means for calculating the time waveform of linear prediction coefficients of the speech wave input to be matched, converting the linear prediction coefficients into cepstral coefficients and storing them, and calculating the average value of the cepstral coefficients in all speech intervals, and calculating the average value and the means for normalizing the cepstrum coefficients by subtracting the time waveform of the cepstrum coefficients, means for calculating polynomial expansion coefficients from the time waveform of the cepstrum coefficients, means for accumulating standard patterns for each registered speaker, and nonlinear time normalization. and a comparison means, the time waveform of the normalized cepstral coefficient;
The polynomial expansion coefficient and the standard pattern extracted from the number of the speaker to be matched are input to the nonlinear time normalization means to calculate the degree of similarity between the two, and the degree of similarity between the two is calculated based on the calculated value and the number of the speaker. A speaker verification method characterized by inputting the extracted value into the comparison means and comparing the magnitude thereof to determine whether the input speech wave is from a speaker corresponding to the number.