JPH0430598B2

JPH0430598B2 -

Info

Publication number: JPH0430598B2
Application number: JP58110683A
Authority: JP
Priority date: 1983-06-20
Filing date: 1983-06-20
Publication date: 1992-05-22
Also published as: JPS602998A

Description

【発明の詳細な説明】〔発明の技術分野〕本発明は、音声認識システムにおける話者適応
のために、入力話者ごとに最適の音声辞書を作成
する音声認識装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Technical Field of the Invention] The present invention relates to a speech recognition device that creates an optimal speech dictionary for each input speaker for speaker adaptation in a speech recognition system.

[Technology background]

一般に、音声認識システムでは、たとえば数百
語以上の多数の単語の認識を、何らの事前学習も
なしで行なうことは、現状ではなお認識率の点で
問題がある。他方、多数の単語を全て事前に発声
して登録する方式は、認識精度の点ですぐれてい
るが、話者の負担が大きくなりすぎ、実用的では
ない。したがつて、事前発声データはなるべく少
量にして話者に適した辞書を作成する手法が必要
である。 In general, in speech recognition systems, recognition of a large number of words, for example, several hundred words or more, without any prior learning is still problematic in terms of recognition rate. On the other hand, a method in which a large number of words are all uttered in advance and registered is excellent in recognition accuracy, but it imposes too much burden on the speaker and is not practical. Therefore, a method is needed to create a dictionary suitable for the speaker by using as little pre-utterance data as possible.

その１つの手法として、予め用意した複数の辞
書、たとえば他の複数の話者が発声した音声にも
とづいてそれぞれ作成した辞書から話者に最適な
辞書を１つだけ選択する方式が考えられるが、話
者に適した辞書の存在を保証するためには、かな
り多数の辞書を用意しなければならないという問
題がある。 One possible method is to select only one dictionary that is most suitable for the speaker from among multiple dictionaries prepared in advance, for example, dictionaries each created based on the sounds uttered by multiple other speakers. There is a problem in that a large number of dictionaries must be prepared in order to ensure that there are dictionaries suitable for each speaker.

〔発明の目的および構成〕本発明の目的は、音声認識システムを話者ごと
に最適化して認識精度を高める話者適応方式にお
いて、事前学習に対する話者の負担を少なくして
かつ最適の音声辞書を容易に構成することができ
る手段を提供することにある。[Objects and Structure of the Invention] An object of the present invention is to reduce the burden on speakers in advance learning and to develop an optimal speech dictionary in a speaker adaptive method that optimizes a speech recognition system for each speaker and increases recognition accuracy. The purpose of this invention is to provide a means by which the system can be easily configured.

一般に、与えられた複数の音声辞書のうち、入
力話者に最適な辞書を唯１個選択して用いたとし
ても、もとの音声辞書の数が少なければ、入力話
者によつてはあまり適した辞書が存在せず、した
がつて高認識率が得られない場合がある。このよ
うな場合、複数の辞書の情報を用いることによ
り、そのいずれの辞書を単独で使用した場合より
も認識率を高くすることが可能である。 In general, even if you select and use only one dictionary that is optimal for an input speaker from among a plurality of given speech dictionaries, if the number of original speech dictionaries is small, it will not work very well depending on the input speaker. There are cases where a suitable dictionary does not exist and therefore a high recognition rate cannot be obtained. In such a case, by using information from multiple dictionaries, it is possible to make the recognition rate higher than when any one of the dictionaries is used alone.

本発明は、この点に着目してなされたものであ
り、少ない数のサンプル音声データを辞書として
用いた認識結果により主の音声辞書の選択を行な
い、比較的類似度の高い音声辞書を複数個に絞つ
て使用し、あるいはさらにそれから新たな辞書を
作成することにより上記目的を達成している。 The present invention has been made by focusing on this point, and selects the main speech dictionary based on the recognition results using a small number of sample speech data as dictionaries, and selects a plurality of speech dictionaries with relatively high similarity. The above purpose is achieved by narrowing down to the dictionary, or by creating a new dictionary from it.

本発明の構成は、それにより (1) 異なる話者により作成された複数の音声辞書
と、該複数の音声辞書のそれぞれごとに作成さ
れた同じカテゴリ群からなるサンプル音声デー
タ群と、該サンプル音声データ群のそれぞれを
辞書として認識対象の入力話者の音声データを
認識し認識率を計算する手段と、高い認識率を
示した上位複数のサンプル音声データ群に対応
する複数の音声辞書のみを選択する手段とをそ
なえ、該選択された複数の音声辞書を上記入力
話者に対する音声辞書として使用することを特
徴とする。 The present invention has the following features: (1) a plurality of speech dictionaries created by different speakers; a sample speech data group consisting of the same category group created for each of the plurality of speech dictionaries; A means of calculating the recognition rate by recognizing the voice data of the input speaker to be recognized using each data group as a dictionary, and selecting only the multiple voice dictionaries corresponding to the top multiple sample voice data groups that showed high recognition rates. The plurality of selected speech dictionaries are used as speech dictionaries for the input speaker.

(2) 異なる話者により作成された複数の音声辞書
と、該複数の音声辞書のそれぞれごとに作成さ
れた同じカテゴリ群からなるサンプル音声デー
タ群と、該サンプル音声データ群のそれぞれを
辞書として認識対象の入力話者の音声データを
認識し認識率を計算する手段と、高い認識率を
示した上位複数のサンプル音声データ群に対応
する複数の音声辞書のみを選択する手段と、該
選択された複数の音声辞書を平均化して新しい
音声辞書を作成する手段とをそなえ、該複数の
選択された音声辞書を平均化して作成された音
声辞書を、上記入力話者に対する音声辞書とし
て使用することを特徴とするものである。(2) Recognizing a plurality of speech dictionaries created by different speakers, a sample speech data group consisting of the same category group created for each of the plurality of speech dictionaries, and each of the sample speech data groups as a dictionary. means for recognizing voice data of a target input speaker and calculating a recognition rate; means for selecting only a plurality of voice dictionaries corresponding to a plurality of top sample voice data groups showing a high recognition rate; means for creating a new voice dictionary by averaging a plurality of voice dictionaries, and using the voice dictionary created by averaging the plurality of selected voice dictionaries as a voice dictionary for the input speaker. This is a characteristic feature.

[Embodiments of the invention]

以下に、本発明の詳細を実施例にしたがつて説
明する。 The details of the present invention will be explained below with reference to Examples.

一般には、他人の辞書を用いた場合は自身の辞
書を用いた場合に比べて認識率がかなり下がる
が、学習により改善を図ることができる。 In general, when using someone else's dictionary, the recognition rate is much lower than when using your own dictionary, but it can be improved through learning.

他人の辞書の一部を自身の辞書と置き換えて新
しい辞書とした場合、使用している特徴量に個人
差を表わす情報が多く含まれていれば、認識率
は、その置換量に応じて、例えば第１図の実線の
グラフ１のような変化を示す。 If you replace part of someone else's dictionary with your own dictionary to create a new dictionary, if the features used include a lot of information that represents individual differences, the recognition rate will increase depending on the amount of replacement. For example, a change like the solid line graph 1 in FIG. 1 is shown.

すなわち、他人の辞書のうち少量を自身の辞書
と置き換えた場合には、認識率がかえつて低下す
る傾向を示す。この現象は、特徴量に個人情報が
多く含まれている場合には、他人の発声した同一
の単語より、自身の発声した別の単語の方が類似
性が高くなることにより起こるものである。この
場合、辞書の中で自身のデータと他人のデータとは
あらかじめ区別できるので、自身のデータに対
して非類似性に関する閾値を設定し、認識時点
でこの閾値を超える自身の辞書データを採用し
ないことにより、第１図の破線のグラフ２のよ
うに、大体置換量に比例した認識率の増加をみ
ることができる。 That is, when a small amount of someone else's dictionary is replaced with one's own dictionary, the recognition rate tends to decrease. This phenomenon occurs because when the feature amount includes a lot of personal information, another word uttered by the user has higher similarity than the same word uttered by another person. In this case, since it is possible to distinguish between one's own data and other people's data in the dictionary, a threshold for dissimilarity is set for one's own data, and one's own dictionary data exceeding this threshold is not adopted at the time of recognition. As a result, as shown in broken line graph 2 in FIG. 1, it can be seen that the recognition rate increases roughly in proportion to the amount of replacement.

また、予め多数の話者の辞書（多数語彙と少
数語彙すなわちサンプル語彙の２組：少数語彙
は多数語彙の一部としてもよい）を用意してお
き、利用者は、上記サンプル語彙を発声して、
その語彙の範囲内で認識を行なう。その結果、
認識率が最も高かつたサンプル語彙の話者によ
る多数語彙の辞書を使用することにすれば、単
に１人の辞書を用意して全ての利用者がその辞
書を使用する場合にくらべて、平均認識率を高
くすることができる。 In addition, dictionaries for many speakers are prepared in advance (two sets of majority vocabulary and minority vocabulary, i.e., sample vocabulary; the minority vocabulary may be part of the majority vocabulary), and the user can utter the sample vocabulary. hand,
Recognition is performed within the range of that vocabulary. the result,
If we decide to use a dictionary with a large number of vocabulary words written by the speakers of the sample vocabulary with the highest recognition rate, the average The recognition rate can be increased.

で認識率の最も高い辞書を１つ使用するか
わりに、認識率の高い辞書を複数個使用する方
法をとる。 Instead of using one dictionary with the highest recognition rate, a method is used in which multiple dictionaries with high recognition rates are used.

複数辞書の使用法としては、従来、マルチテ
ンプレート方式、平均パタン方式がよく用いら
れている。 Conventionally, the multi-template method and the average pattern method are often used as methods for using multiple dictionaries.

マルチテンプレート方式は、複数の辞書を単
に平面的に配列し、ひとまとめにして１つの辞
書とするものである。１つのカテゴリに複数
（話者の人数）のデータが存在することになり、
認識時点では、それら全てのデータの中から最
もよく似たデータを探す処理が行なわれる。 The multi-template method simply arranges a plurality of dictionaries in a two-dimensional manner and combines them into one dictionary. There will be multiple data (number of speakers) in one category,
At the time of recognition, processing is performed to find the most similar data among all of the data.

次に平均パタン方式は、同一のカテゴリ内の
複数のデータにおいて、対応する特徴ごとに特
徴値を平均し、新しい１つのデータとするもの
である。音声の場合には、時間長の変動がある
為、一般には時間方向での対応付けを行なつた
後、平均する。 Next, in the average pattern method, feature values are averaged for each corresponding feature in a plurality of pieces of data in the same category, and a single piece of new data is created. In the case of audio, since there is a variation in time length, it is generally averaged after being correlated in the time direction.

本実施例では、時間方向は単語長を16等分す
るという形で時間長の正規化を行なつているの
で、平均操作は簡単に行なうことができる。 In this embodiment, since the time length is normalized in the time direction by dividing the word length into 16 equal parts, the averaging operation can be easily performed.

第２図は、マルチテンプレート方式と平均パ
タン方式の効果を比較したものである。 FIG. 2 compares the effects of the multi-template method and the average pattern method.

第２図は、40人のテスト対象話者のそれぞれ
について40個の辞書（語数200語）から類似度
の高い上位１，３，５，10個の辞書を選択した
場合を横軸にとり、縦軸には話者40人の平均認
識率を示したものである。グラフ３がマルチテ
ンプレートの場合、グラフ４が平均パタン辞書
の場合を示す。図から明らかなように、平均パ
タン方式がマルチテンプレート方式よりも優れ
ていることがわかる。これは、平均パタン方式
の場合、個々の辞書に含まれる個人情報部分
が、平均化により希釈され、その反対に有効な
特徴情報部分は強調されることによるものであ
る。他方、マルチテンプレート方式の場合に
は、このような効果を生じさせることができな
い。 Figure 2 shows the cases where the top 1, 3, 5, and 10 dictionaries with high similarity are selected from 40 dictionaries (200 words) for each of the 40 test speakers, and the vertical axis shows the top 1, 3, 5, and 10 dictionaries with high similarity. The axis shows the average recognition rate for 40 speakers. Graph 3 shows a case of a multi-template, and graph 4 shows a case of an average pattern dictionary. As is clear from the figure, the average pattern method is superior to the multi-template method. This is because, in the case of the average pattern method, personal information included in each dictionary is diluted by averaging, while effective feature information is emphasized. On the other hand, in the case of the multi-template method, such an effect cannot be produced.

第３図は、平均パタン方式の効果をさらに明
確にするための典型例のデータを示す。図は、
100個の辞書（200語）から類似度の上位20個の
辞書を選択したものを類似度順に横軸に配列
し、これに対して５人の入力話者Ａ，Ｂ，Ｃ，
Ｄ，Ｅの認識率を縦軸にとつたものである。各
入力話者について、下方向に伸びる実線グラフ
が、上位20個の辞書のそれぞれを単一辞書とし
て扱つたときの認識率を表わし、また上方向に
伸びる点線グラフが上位３個、５個、10個の辞
書を平均したときの認識率を表わす。 FIG. 3 shows typical example data to further clarify the effect of the average pattern method. The diagram is
The top 20 dictionaries with the highest similarity are selected from the 100 dictionaries (200 words) and arranged on the horizontal axis in order of similarity, and the five input speakers A, B, C,
The vertical axis is the recognition rate of D and E. For each input speaker, the solid line graph extending downward represents the recognition rate when each of the top 20 dictionaries is treated as a single dictionary, and the dotted line graph extending upward represents the recognition rate for the top 3, 5, Represents the recognition rate averaged over 10 dictionaries.

事前発声なしで認識を行なう為には、不特定
話者用辞書を用意する必要がある。特定の１人
の辞書を不特定話者用辞書として用いることは
前述したように高い認識率を得ることができな
い。 In order to perform recognition without prior utterance, it is necessary to prepare a speaker-independent dictionary. As described above, it is not possible to obtain a high recognition rate if a dictionary of one specific person is used as a dictionary for unspecified speakers.

ここでは与えられた複数の辞書を平均して不
特定話者用とする場合を考える。 Here, we will consider the case where a plurality of given dictionaries are averaged for use by unspecified speakers.

あらかじめ20人分の辞書（1000語彙）が登録
されているものとする。 It is assumed that dictionaries for 20 people (1000 vocabulary words) are registered in advance.

また、上記とは別の入力話者20人について、
1000語彙を対象に次の５種類の辞書で認識を行
なつた場合の認識率データを第４図に示す。 Also, regarding 20 input speakers different from the above,
Figure 4 shows the recognition rate data when recognition was performed using the following five types of dictionaries for 1000 vocabulary words.

用意されている20個の辞書を平均した不特定
話者用辞書 50語のサンプル辞書で認識を行ない、認識率
の高い10個のサンプル辞書に対応する主の辞書
（1000語彙）を平均した平均辞書 100語のサンプル辞書で認識を行ない、認識
率の高い10個のサンプル辞書に対応する主の辞
書（1000語彙）を平均した平均辞書 200語のサンプル辞書で認識を行ない、認識
率の高い10個のサンプル辞書に対応する主の辞
書（1000語彙）を平均した平均辞書入力話者自身の発声で登録された1000語彙の
個人辞書なお第４図のグラフは、入力話者20人分の平均
認識率を表わしている。 A dictionary for non-specific speakers that is the average of 20 prepared dictionaries.Average of the main dictionary (1000 vocabulary) corresponding to the 10 sample dictionaries with high recognition rates after recognition was performed using a sample dictionary of 50 words. Dictionary Recognition was performed using a sample dictionary of 100 words, and the average of the main dictionaries (1000 vocabulary) corresponding to the 10 sample dictionaries with the highest recognition rate.The 10 highest recognition rates were recognized using a sample dictionary of 200 words. The average dictionary (1000 vocabulary) corresponding to the sample dictionaries The personal dictionary of 1000 vocabulary registered by input speakers' own utterances The graph in Figure 4 is the average of 20 input speakers. It represents the recognition rate.

，，のように少数語彙のサンプル辞書で
学習し、複数の主辞書の選択を行なつてそれから
平均辞書を作成する方法により、およびを結
ぶ破線５が示す事前発声語数に比例して向上する
個人辞書の認識率をさらに上回るところの、特性
６で示す効果を上げることができる。 By learning with sample dictionaries with a small number of vocabulary such as , , selecting multiple main dictionaries, and then creating an average dictionary, an individual improves in proportion to the number of pre-uttered words indicated by the dashed line 5 connecting and. The effect shown in characteristic 6, which is even higher than the recognition rate of the dictionary, can be achieved.

第５図は上述した関係を総括したグラフであ
る。 FIG. 5 is a graph summarizing the above-mentioned relationships.

次に、本発明による音声認識システムの１実施
例の構成を、上述したにもとづく音声辞書構成
方式の場合を例にして説明する。 Next, the configuration of one embodiment of the speech recognition system according to the present invention will be described using as an example the case of the speech dictionary configuration method based on the above-mentioned method.

第６図はその構成図であり、７は入力部、８は
認識部、９はサンプル辞書群フアイル、１０は認
識結果保持部、１１は選択部、１２は主辞書群フ
アイル、１３は平均辞書作成部、１４は平均辞書
格納部、１５はモード切替スイツチ、１６は出力
部を表わす。 FIG. 6 is a configuration diagram of the same, where 7 is an input section, 8 is a recognition section, 9 is a sample dictionary group file, 10 is a recognition result holding section, 11 is a selection section, 12 is a main dictionary group file, and 13 is an average dictionary. 14 is an average dictionary storage section, 15 is a mode changeover switch, and 16 is an output section.

本実施例システムは、辞書構成モードと、認識
処理モードとの２つのモードで動作する。 The system of this embodiment operates in two modes: a dictionary configuration mode and a recognition processing mode.

まず、モード切替スイツチ１５をサンプル辞書
群フアイル９側に設定し、辞書構成モードにす
る。ここで利用者は、学習用の少数の単語（100
語）を発声する。この学習用発声にもとづく音声
データは、入力部７から入力され、認識部８で認
識される。このとき使用される辞書は、サンプル
辞書群フアイル９中のものである。 First, the mode changeover switch 15 is set to the sample dictionary group file 9 side to set the dictionary configuration mode. Here users can select a small number of words (100
utter words). Audio data based on this learning utterance is input from the input section 7 and recognized by the recognition section 8. The dictionary used at this time is one in the sample dictionary group file 9.

サンプル辞書群フアイル９には、複数（20人）
の話者によつて発声されたサンプル辞書（20個）
があり、かつ全てのサンプル辞書は同一のカテゴ
リ群（100語彙）からなり、このカテゴリ群には
上記学習用の少数単語が全て含まれている。この
カテゴリ群は、後述する主辞書群フアイル１２と
同一あるいはその一部であつてもよいし、無関係
であつてもよい。 Sample dictionary group file 9 contains multiple (20 people)
Sample dictionaries uttered by speakers of (20)
, and all sample dictionaries consist of the same category group (100 vocabulary), and this category group includes all of the above-mentioned minority words for learning. This category group may be the same as or a part of the main dictionary group file 12, which will be described later, or may be unrelated.

認識結果保持部１０は、上記サンプル辞書群フ
アイル９中の各サンプル辞書ごとに認識結果を保
持する。 The recognition result holding unit 10 holds recognition results for each sample dictionary in the sample dictionary group file 9.

選択部１１は、サンプル辞書群フアイル９中の
サンプル辞書のうち、認識率が高かつたサンプル
辞書を選択する。 The selection unit 11 selects a sample dictionary with a high recognition rate from among the sample dictionaries in the sample dictionary group file 9.

主辞書群フアイル１２には、サンプル辞書群フ
アイル９のサンプル辞書データを発声した複数の
話者による認識対象単語群（1000語彙）を発声登
録した辞書が格納されている。 The main dictionary group file 12 stores a dictionary in which recognition target word groups (1000 vocabulary) are registered as utterances by a plurality of speakers who have uttered the sample dictionary data of the sample dictionary group file 9.

平均辞書作成部１３は、選択部１１で選択され
た複数のサンプル辞書と同一の発声者による主辞
書を主辞書群フアイル１２からとり出し、それら
の辞書を平均した１つの辞書を作成する。作成さ
れた平均辞書は、平均辞書格納部１４に格納され
る。 The average dictionary creation unit 13 extracts main dictionaries written by the same speaker as the plurality of sample dictionaries selected by the selection unit 11 from the main dictionary group file 12, and creates one dictionary by averaging these dictionaries. The created average dictionary is stored in the average dictionary storage section 14.

ここでモード切替スイツチ１５を、平均辞書格
納部１４側に設定変更し、認識処理モードにす
る。利用時点では、入力部７から入力された音声
が、認識部８において、平均辞書格納部１４に格
納されている辞書を用いて認識され、その結果が
出力部１６から出力される。 Here, the setting of the mode changeover switch 15 is changed to the average dictionary storage section 14 side, and the recognition processing mode is set. At the time of use, the speech input from the input section 7 is recognized by the recognition section 8 using the dictionary stored in the average dictionary storage section 14, and the result is output from the output section 16.

なお、平均パタン方式の代りにマルチテンプレ
ート方式をとる場合には、主辞書群から選択した
複数の辞書をそのまま認識処理用辞書とすればよ
い。 Note that when a multi-template method is used instead of the average pattern method, a plurality of dictionaries selected from the main dictionary group may be directly used as dictionaries for recognition processing.

〔Effect of the invention〕

以上述べたように、本発明によればあらかじめ
用意された複数の辞書から、話者に応じてそれら
のいづれよりも認識率の高い辞書を作成すること
ができるので、対象語彙を全て発声することな
く、迅速かつ容易に高精度の辞書を作成すること
ができる。 As described above, according to the present invention, from a plurality of dictionaries prepared in advance, it is possible to create a dictionary with a higher recognition rate than any of them depending on the speaker. You can quickly and easily create high-precision dictionaries without having to

[Brief explanation of drawings]

第１図は他人の辞書に対する学習効果の説明
図、第２図はマルチテンプレート辞書と平均パタ
ン辞書についての選択辞書数の効果の説明図、第
３図は選択対象辞書の類似度順位の効果の説明
図、第４図は不特定話者用辞書についての学習効
果の説明図、第５図は第１図から第４図までを総
括した説明図、第６図は本発明の１実施例システ
ムの構成図である。図中、７は入力部、８は認識部、９はサンプル
辞書群フアイル、１０は認識結果保持部、１１は
選択部、１２は主辞書群フアイル、１３は平均辞
書作成部、１４は平均辞書格納部、１５はモード
切替スイツチ、１６は出力部を表わす。 Figure 1 is an illustration of the learning effect for other people's dictionaries, Figure 2 is an illustration of the effect of the number of selected dictionaries for multi-template dictionaries and average pattern dictionaries, and Figure 3 is an illustration of the effect of the similarity ranking of selected dictionaries. An explanatory diagram, FIG. 4 is an explanatory diagram of the learning effect for a dictionary for unspecified speakers, FIG. 5 is an explanatory diagram summarizing FIGS. 1 to 4, and FIG. 6 is an example system of the present invention. FIG. In the figure, 7 is an input section, 8 is a recognition section, 9 is a sample dictionary group file, 10 is a recognition result holding section, 11 is a selection section, 12 is a main dictionary group file, 13 is an average dictionary creation section, and 14 is an average dictionary A storage section, 15 a mode changeover switch, and 16 an output section.

Claims

[Claims] 1. A plurality of speech dictionaries created by different speakers, a sample speech data group consisting of the same category group created for each of the plurality of speech dictionaries, and each of the sample speech data groups. means for recognizing speech data of an input speaker to be recognized using a dictionary and calculating a recognition rate; and means for selecting only a plurality of speech dictionaries corresponding to a plurality of top sample speech data groups showing a high recognition rate. A speech recognition device characterized in that the plurality of selected speech dictionaries are used as speech dictionaries for the input speaker. 2 A plurality of speech dictionaries created by different speakers, a sample speech data group consisting of the same category group created for each of the plurality of speech dictionaries, and a recognition target using each of the sample speech data groups as a dictionary. means for recognizing voice data of an input speaker and calculating a recognition rate; means for selecting only a plurality of voice dictionaries corresponding to a plurality of top sample voice data groups showing a high recognition rate; and means for creating a new speech dictionary by averaging the speech dictionaries, and the speech dictionary created by averaging the plurality of selected speech dictionaries is used as the speech dictionary for the input speaker. voice recognition device.