JPH0314360B2

JPH0314360B2 -

Info

Publication number: JPH0314360B2
Application number: JP58155878A
Authority: JP
Inventors: Yasushi Ishikawa
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1983-08-26
Filing date: 1983-08-26
Publication date: 1991-02-26
Also published as: JPS6048098A

Description

【発明の詳細な説明】この発明は、入力された音声と記憶されている
音声の類以度を求め判定することで入力した話者
が登録した話者本人か否かを判定する話者照合装
置に関するものである。[Detailed Description of the Invention] This invention provides speaker verification that determines whether or not the input speaker is the registered speaker by determining the degree of similarity between the input voice and the stored voice. It is related to the device.

第１図は従来の話者照合装置の一構成例であ
る。以下図を用いて動作を説明する。参照パター
ン記憶回路１には予め登録話者によつて発声され
た単語音声が後述する特徴パラメータ抽出回路と
同様の方式で分析され特徴パラメータの形で記憶
されているものとする。話者は例えば磁気カード
読み取り回路、キーボード入力回路等の登録番号
入力回路２に登録番号を入力し、登録してある自
分の参照パターン３を呼び出す。次にマイクロフ
オン４に向かい決められた単語を発声する。音声
信号５は特徴パラメータ抽出回路６において、例
えばスペクトラム分析を受けケプストラム等の特
徴パラメータに一定時間間隔で変換され入力パタ
ーン７が抽出される。入力パターン７と参照パタ
ーン３とは継続時間が異なるためDPマツチング
回路８において非線形時間伸縮によるマツチング
を行い２つのパターン間の距離９を求める。距離
９は判定回路１０で閾値と比較され、閾値以下な
ら本人として受理し、また閾値以上であれば詐称
者として拒否するという認識結果１１を出力す
る。ところがDPマツチングは２つのパターン間
の非線形伸縮による距離の最小化を行うため、話
者による相違を求めることを目的とした話者照合
装置のマツチング法としては適していない。すな
わち話者に特有の単語内における時間構造を必要
以上の時間伸縮によつて距離に話者による違いを
十分に反映できないことが多く、またこれも避け
るため時間伸縮に制限を設ける場合、制限が厳し
い場合同じ話者の発声のマツチングで異音韻間の
対応づけが生じる危険性がある。またDPマツチ
ングにより得られる単語全区間の平均距離を用い
て判定をすることは、単語内の特定の音韻の相違
を他の音韻が類似していることで見落す可能性が
ある。従来の話者照合装置ではこれらの欠点があ
り、認識率低下の原因となつていた。 FIG. 1 shows an example of the configuration of a conventional speaker verification device. The operation will be explained below using figures. It is assumed that the reference pattern storage circuit 1 stores word sounds uttered by registered speakers in advance in the form of feature parameters after being analyzed in the same manner as the feature parameter extraction circuit described later. The speaker inputs a registration number into a registration number input circuit 2, such as a magnetic card reading circuit or a keyboard input circuit, and calls up his or her registered reference pattern 3. Next, face the microphone 4 and say the determined word. The audio signal 5 is subjected to, for example, spectrum analysis in a feature parameter extraction circuit 6 and converted into feature parameters such as cepstrum at regular time intervals, and an input pattern 7 is extracted. Since the input pattern 7 and the reference pattern 3 have different durations, a DP matching circuit 8 performs matching using nonlinear time expansion and contraction to find the distance 9 between the two patterns. The distance 9 is compared with a threshold value in a determination circuit 10, and a recognition result 11 is output in which if it is less than the threshold value, it is accepted as the real person, and if it is more than the threshold value, it is rejected as an impostor. However, since DP matching minimizes the distance between two patterns by nonlinear expansion and contraction, it is not suitable as a matching method for a speaker matching device that aims to find differences between speakers. In other words, if the time structure within a word that is unique to a speaker is expanded or contracted more than necessary, differences between speakers often cannot be reflected sufficiently in the distance, and if a limit is set on time expansion or contraction to avoid this, it is necessary to In severe cases, there is a risk that matching between utterances of the same speaker may result in correspondence between allophones. Furthermore, when making a determination using the average distance of all word sections obtained by DP matching, there is a possibility that differences in a specific phoneme within a word may be overlooked because other phonemes are similar. Conventional speaker verification devices have these drawbacks, which are the cause of a decline in recognition rate.

この発明は、これらの欠点を改善するためにな
されたもので、単語内の話者に特有な時間構造の
違いを十分に距離に反映でき、単語内の特定の音
韻の相違も見落すことなく判定が可能で、計算量
も少い装置を提供するものである。 This invention was made to improve these shortcomings, and it is possible to sufficiently reflect differences in temporal structure within a word that are unique to speakers, without overlooking specific phonological differences within a word. The purpose is to provide a device that is capable of making judgments and requires a small amount of calculation.

第２図はこの発明の一構成例である。図におい
て１２は入力パターンから所定の時間区間を切り
出す区分化回路、１３は区間入力パターン、１４
は区間参照パターン、１５は区間参照パターン記
憶回路、１６は線形マツチング回路、１７は区間
距離、１８は区間判定結果、１９は論理判定回路
である。 FIG. 2 shows an example of the structure of this invention. In the figure, 12 is a segmentation circuit that cuts out a predetermined time interval from an input pattern, 13 is an interval input pattern, and 14
15 is a section reference pattern, 15 is a section reference pattern storage circuit, 16 is a linear matching circuit, 17 is a section distance, 18 is a section judgment result, and 19 is a logic judgment circuit.

次に動作を説明する。話者が自分の登録番号を
入力し、単語を発声し、特徴パラメータに変換さ
れるまでは従来の装置と同様である。入力音声の
特徴パラメータは区分化回路１１において音韻の
変化点等の検出を受け、所定の区間入力パターン
１２に区分化される。入力される単語は決められ
ているため、区分化は特徴パラメータの時間変化
量等を用いて比較的容易にかつ高い精度で行うこ
とが可能である。 Next, the operation will be explained. It is similar to conventional devices until the speaker inputs his or her registration number, utters the words, and is converted into feature parameters. The characteristic parameters of the input speech are subjected to detection of phoneme change points and the like in a segmentation circuit 11, and are segmented into predetermined interval input patterns 12. Since the words to be input are predetermined, segmentation can be performed relatively easily and with high precision using the amount of change in feature parameters over time.

第３図は区分化回路１２における処理の１手順
例を示すものである。最初にアにおいて入力パタ
ーン７の隣接フレーム間距離Ｄ（ｍ）（ｍはフレー
ム番号）、すなわちパラメータの時間変化量を求
め、次にイにおいてＤ（ｍ）のピークを検出する。 FIG. 3 shows an example of a procedure of processing in the partitioning circuit 12. First, in A, the distance D(m) between adjacent frames of the input pattern 7 (m is the frame number), that is, the amount of change over time of the parameter, is determined, and then in B, the peak of D(m) is detected.

ピーク値Dpはエにおいて閾値Dtと比較され、
閾値以上の場合はオにおいてそのフレーム番号を
音韻変化点として記憶する。閾値以下の場合はイ
へ戻りウにおいて単語の終点が検出されるまでく
り返し音韻変化点を検出する。以上求められた音
韻変化点を始点、終点としてカにおいて入力パタ
ーン７から区間入力パターン１３を切り出し、出
力する。一方、区間参照パターン記憶回路１５に
は予め区分化された区間参照パターン１４が記憶
されており、登録番号入力回路２からの信号によ
り線形マツチング回路１６に送られている。線形
マツチング回路１６では区間入力パターン１３と
区間参照パターン１４を線形伸縮による継続時間
正規化により区間距離１７を求める。 The peak value Dp is compared with the threshold value Dt at
If it is above the threshold, the frame number is stored as a phoneme change point in E. If it is less than the threshold, the process returns to (a) and detects the phoneme change point repeatedly until the end point of the word is detected in (c). The section input pattern 13 is cut out from the input pattern 7 using the phoneme change points determined above as the starting point and the ending point, and outputs it. On the other hand, the section reference pattern storage circuit 15 stores section reference patterns 14 that have been segmented in advance, and is sent to the linear matching circuit 16 by a signal from the registration number input circuit 2. A linear matching circuit 16 calculates a section distance 17 by normalizing the duration of the section input pattern 13 and the section reference pattern 14 by linear expansion and contraction.

すなわち、区間入力パターンを｛t_o｝（ｎ＝０，
１，……，Ｎ−１）、区間参照パターンを｛r_n｝
（ｍ＝０，１，……，Ｎ−１）、ただしｎ，ｍはフ
レーム番号とする線形マツチングによる距離Ｄ
は、以下のように求められる。｛ttn｝から時間正
規化したパターン｛t_n｝（ｍ＝０，１，……Ｍ−
１）を求める。 In other words, the interval input pattern is {t _o } (n=0,
1,...,N-1), the interval reference pattern is {r _n }
(m = 0, 1, ..., N-1), where n and m are frame numbers. Distance D by linear matching
is calculated as follows. Pattern {t _n } (m=0, 1, ...M-
Find 1).

t_n＝ａ×tt_k＋（１−ａ）×tt_k+1 ｋ＝〔Ｎ−１／Ｍ−１×ｍ〕ａ＝Ｎ−１／Ｍ−１×ｍ−ｋここで〔ｉ〕はガウス記号で、ｉの整数部分を
示す。 t _n = a×tt _k + (1-a)×tt _k+1 k=[N-1/M-1×m] a=N-1/M-1×m-k Here, [i] is A Gaussian symbol indicates the integer part of i.

距離ＤはＤ＝１／Ｍ_M-1 〓ⁱ⁼⁰ ｛dr_i，t_i｝ここでｄ（r_i，t_i）は｛1r_n｝のｉフレーム目と
｛tt′_n｝のｉフレーム目との距離を示す。これら
の区間距離１７に対して判定回路１０において閾
値判定を行う。さらにこれら区間判定結果１８を
論理判定回路１９で論理判定を行い最終の認識結
果１１を得る。たとえば区間判定結果１８をA₁，
A₂……A_l……A_L（Ｌ：区間の数）とし、区間距離
が閾値以下のときA_l＝１、閾値より大きいときA_l
＝０とし、論理積によつて認識結果Ｓを得るなら
ば、Ｓ＝A₁・A₂・……・A_l・……A_L となり、すなわち区間距離が全区間で閾値以内の
場合にのみ本人として受理し、一区間でも閾値を
越える場合は詐称者として拒否する。 The distance D is D=1/M _M-1 〓 ⁱ⁼⁰ {dr _i , t _i } Here, d(r _i , t _i ) is the i-th frame of {1r _n } and the i-th frame of {tt′ _n } Indicates the distance from the eye. The determination circuit 10 performs threshold value determination for these section distances 17. Furthermore, these section determination results 18 are logically determined by a logic determination circuit 19 to obtain a final recognition result 11. For example, section judgment result 18 is A ₁ ,
A ₂ ...A _l ...A _L (L: number of sections), A _l = 1 when the section distance is less than the threshold, A _l when it is greater than the threshold
= 0, and if the recognition result S is obtained by logical product, then S = A ₁・A ₂・……・A _l・……A _L , that is, only when the interval distance is within the threshold in all intervals It will be accepted as the real person, and if it exceeds the threshold even in one section, it will be rejected as an impostor.

以上説明したように、この発明によれば、線形
マツチングによつて距離に時間構造を十分に反映
することができ、区間を３音素長程度以内にすれ
ば、同一話者においては異音韻間の対応づけもほ
とんど生じない。また区間ごとの閾値定結果の論
理判定により認識結果を得るため単語内の特定の
音韻の違いを見落す可能性も小さく、従来に比べ
高い認識率を得ることができる。さらに計算量は
DPマツチングを行わないためかなり小さくくな
り、従来に比べ加わる区分化回路も単語が決つて
いるため比較的的小さな計算量ですむ。 As explained above, according to the present invention, it is possible to sufficiently reflect the temporal structure in the distance by linear matching, and if the interval is within about three phonemes, it is possible to Almost no correspondence occurs. Furthermore, since the recognition results are obtained by logically determining the threshold value results for each section, there is less possibility of overlooking specific phoneme differences within a word, and a higher recognition rate than before can be obtained. Furthermore, the amount of calculation is
Since DP matching is not performed, it is considerably smaller, and compared to the conventional method, the added segmentation circuit also requires a relatively small amount of calculation because the words are determined.

なお上記実施例は、線形マツチング回路１６、
判定回路１７を各区間に対して並列に構成した
が、区分化回路１２および区間参照パターン記憶
回路１５から区間パターンを順次出力し、１つの
線形マツチング回路、判定回路で処理する構成も
可能であることはいうまでもない。 Note that in the above embodiment, the linear matching circuit 16,
Although the determination circuit 17 is configured in parallel for each section, it is also possible to sequentially output the section patterns from the sectioning circuit 12 and the section reference pattern storage circuit 15 and process them with one linear matching circuit and one judgment circuit. Needless to say.

[Brief explanation of the drawing]

第１図は従来の話者照合装置を示す図、第２図
はこの発明による話者照合装置の一実施例を示す
構成図、第３図は第２図における区分化回路にお
ける区分化の一手順例を示す図であり、１は参照
パターン記憶回路、２は登録番号入力回路、３は
参照パターン、４はマイクロフオン、５は入力音
声、６は特徴パラメータ抽出回路、７は入力パタ
ーン、８はDPマツチング回路、９は距離、１０
は判定回路、１１は認識結果、１２は区分化回
路、１３は区間入力パターン、１４は区間参照パ
タン、１５は区間参照パタン記憶装置、１６は線
形マツチング回路、１７は区間距離、１８は区間
判定結果、１８は論理判定回路である。なお、図
中同一あるいは相当部分には同一符号を付して示
してある。 FIG. 1 is a diagram showing a conventional speaker verification device, FIG. 2 is a block diagram showing an embodiment of the speaker verification device according to the present invention, and FIG. 3 is a diagram showing an example of segmentation in the segmentation circuit in FIG. 1 is a diagram showing a procedure example, 1 is a reference pattern storage circuit, 2 is a registration number input circuit, 3 is a reference pattern, 4 is a microphone, 5 is an input voice, 6 is a feature parameter extraction circuit, 7 is an input pattern, and 8 is a diagram showing an example of a procedure. is the DP matching circuit, 9 is the distance, 10
is a judgment circuit, 11 is a recognition result, 12 is a segmentation circuit, 13 is a section input pattern, 14 is a section reference pattern, 15 is a section reference pattern storage device, 16 is a linear matching circuit, 17 is a section distance, and 18 is a section judgment As a result, 18 is a logic judgment circuit. It should be noted that the same or corresponding parts in the figures are indicated by the same reference numerals.

Claims

[Claims]

1. A feature parameter extraction circuit that spectrally analyzes the input word speech of the target speaker, converts it into feature parameters at regular time intervals, and extracts the input pattern, and the feature parameters extracted by this feature parameter extraction circuit. A segmentation circuit detects phonological change points within a word duration from an input pattern, divides it into multiple time segments, and extracts the characteristic parameters of each segment, and segments the word sounds uttered by registered speakers in advance. an interval reference pattern storage circuit which stores the interval reference pattern in a state that can be called up by specifying the registration number from the registration number input circuit; a linear matching circuit for correlating the pattern with each of the plurality of section reference patterns specified by the registration number input circuit and called from the section reference pattern storage circuit by normalizing the duration by linear expansion and contraction, and calculating the distance of each of the plurality of sections; A judgment circuit that judges pass/fail for each plurality of sections with respect to a predetermined threshold value for each of the plurality of section distances obtained by the linear matching circuit, and a logical judgment is made by integrating the plurality of section threshold judgment results of this judgment circuit. and a logical determination circuit for determining whether or not the speaker of the input word voice is the person registered in advance.