JP3166708B2

JP3166708B2 - Speech recognition device and method

Info

Publication number: JP3166708B2
Application number: JP15702398A
Authority: JP
Inventors: 健一磯
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1998-06-05
Filing date: 1998-06-05
Publication date: 2001-05-14
Anticipated expiration: 2018-06-05
Also published as: JPH11352983A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識装置に関
し、特にその話者適応化機能を改良した音声認識装置及
び方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus, and more particularly, to a speech recognition apparatus and method having improved speaker adaptation functions.

【０００２】[0002]

【従来の技術】音声認識装置の従来の話者適応化方式と
して、例えば電子情報通信学会誌ＤＩＩ（Ｊ７８−Ｄ−
ＩＩ巻、１号、第１〜９ページ、１９９５年１月）の
「木構造話者クラスタリングを用いた話者適応」と題す
る論文に記載された方式が知られている。2. Description of the Related Art As a conventional speaker adaptation system of a speech recognition apparatus, for example, the Institute of Electronics, Information and Communication Engineers magazine DII (J78-D-
A method described in a paper entitled "Speaker Adaptation Using Tree-structured Speaker Clustering" in Vol. II, No. 1, pp. 1-9, January 1995) is known.

【０００３】この従来の方式では、あらかじめ多数の話
者を木構造にクラスタリングする（これを「話者ツリ
ー」という）。話者ツリーの根（ルート）ノードは、全
ての話者を集めた集合（この話者の集合を「話者クラス
タ」という）に対応し、末端（リーフ）ノードは個別の
話者に対応する。ルートとリーフ間の途中のノードは、
音響的に類似した話者の集合（話者クラスタ）に対応す
る。In this conventional method, a large number of speakers are clustered in a tree structure in advance (this is called a "speaker tree"). The root (root) node of the speaker tree corresponds to a set of all speakers (this set of speakers is referred to as a “speaker cluster”), and the terminal (leaf) nodes correspond to individual speakers. . The node on the way between the root and the leaf is
It corresponds to a set of acoustically similar speakers (speaker cluster).

【０００４】各ノードには、対応した話者集合の音声デ
ータを用いて作成（学習）された標準パターン（隠れマ
ルコフモデル、Hidden Markov Model；「ＨＭＭ」と
略記される）が関連付けられている。この標準パターン
とは、認識単位として、「あ」、「い」、…のような音
節を用いた場合、全音節のＨＭＭ全体の集合である。[0004] Each node is associated with a standard pattern (Hidden Markov Model; abbreviated as "HMM") created (learned) using speech data of a corresponding set of speakers. The standard pattern is a set of all HMMs of all syllables when syllables such as “A”, “I”,.

【０００５】新話者の少数の発声を用いて、話者ツリー
中のノードを選択し、そのノードに結びついている標準
パターンを用いて、以後の音声認識を行う。[0005] A node in the speaker tree is selected using a small number of utterances of a new speaker, and subsequent speech recognition is performed using a standard pattern connected to the node.

【０００６】[0006]

【発明が解決しようとする課題】しかしながら、上述し
た従来の話者適応化方式においては、次のような問題点
を有している。However, the above-mentioned conventional speaker adaptation method has the following problems.

【０００７】すなわち、話者ツリーの各ノードは、対応
する話者クラスタに対する、全ての認識単位のＨＭＭ全
体に対応している。このため、例えば新しい話者の音声
が、一部分（例えば「あ」の音）は、ある話者クラスタ
に類似しているが、別の部分（例えば「く」の音）は別
の話者クラスタに類似しているような場合を扱うことが
できない。That is, each node of the speaker tree corresponds to the entire HMM of all recognition units for the corresponding speaker cluster. Thus, for example, the voice of a new speaker may have a portion (eg, the sound of “a”) similar to one speaker cluster, but another portion (eg, the sound of “ku”) may have another speaker cluster. It cannot handle cases that are similar to

【０００８】上記従来の方式では、代わりに、新しい話
者の少量の音声全体が平均的に最も類似している話者ク
ラスタ（ノード）を選んで、そのクラスタの標準パター
ン（全認識単位のＨＭＭ）を、全ての認識単位に対し
て、使用しなければならない。In the above conventional method, instead, a speaker cluster (node) in which a small amount of speech of a new speaker is the most similar on average is selected, and a standard pattern of the cluster (HMM of all recognition units) is selected. ) Must be used for all recognition units.

【０００９】したがって本発明は、上記問題点に鑑みて
なされたものであって、その目的は、新しい話者への話
者適応化として、認識単位毎、あるいは、さらに細かく
ＨＭＭの状態毎に、最適な話者クラスタと、そのクラス
タを表すＨＭＭ、あるいは確率分布を、選択可能とした
音声認識装置を提供することにある。[0009] Accordingly, the present invention has been made in view of the above problems, and has as its object the purpose of speaker adaptation to a new speaker, for each recognition unit, or more finely, for each HMM state. It is an object of the present invention to provide a speech recognition device that enables selection of an optimum speaker cluster and an HMM or a probability distribution representing the cluster.

【００１０】[0010]

【課題を解決するための手段】前記目的を達成する本発
明の音声認識装置は、隠れマルコフモデルに基づく音声
認識装置において、話者クラスタに対する確率分布をノ
ードとする話者ツリーを複数記憶する話者ツリー記憶部
と、新話者の発話を用いて前記話者ツリー中の最適なノ
ードを選択する既知ノード選択手段と、前記選択された
ノードを親ノードとする部分ツリーの末端ノードにおけ
る頻度情報を計数する話者頻度算出手段と、前記末端ノ
ードの頻度情報を用いて前記話者ツリー中の最適なノー
ドを選択する未知ノード選択手段と、前記既知ノード選
択手段と前記未知ノード選択手段で選択されたノードの
確率分布を用いて標準パターンを適応化する標準パター
ン更新部と、を含む。According to the present invention, there is provided a speech recognition apparatus based on a Hidden Markov Model, wherein a plurality of speaker trees each having a probability distribution for a speaker cluster as a node are stored. Node tree storage unit, known node selecting means for selecting an optimal node in the speaker tree using a new speaker's utterance, and frequency information at a terminal node of a partial tree having the selected node as a parent node Speaker frequency calculating means for counting the number of unknown nodes, an unknown node selecting means for selecting an optimal node in the speaker tree using the frequency information of the terminal node, and a selection by the known node selecting means and the unknown node selecting means. A standard pattern updating unit that adapts the standard pattern using the probability distribution of the selected node.

【００１１】[0011]

【発明の実施の形態】本発明の実施の形態について以下
に説明する。本発明の音声認識装置は、その好ましい実
施の形態において、図１を参照すると、話者適応化機能
を提供する部分に、話者ツリー記憶部（７）と、既知ノ
ード選択部（４）と、話者頻度算出部（５）と、未知ノ
ード選択部（６）と、標準パターン更新部（８）と、を
備えたものである。Embodiments of the present invention will be described below. In a preferred embodiment of the speech recognition apparatus of the present invention, referring to FIG. 1, a speaker tree storage unit (7), a known node selection unit (4), , A speaker frequency calculating unit (5), an unknown node selecting unit (6), and a standard pattern updating unit (8).

【００１２】この話者ツリー記憶部（７）は、認識単位
毎、あるいはそのＨＭＭの状態毎に、構成された話者ツ
リーを記憶する。The speaker tree storage unit (7) stores the configured speaker tree for each recognition unit or for each state of the HMM.

【００１３】既知ノード選択部（４）は、新話者の発話
を用いて、その発話に出現した認識単位（あるいはその
ＨＭＭ状態）について、話者ツリー記憶部（７）の対応
する話者ツリーから最適なノードを選択する。The known node selecting section (4) uses the utterance of the new speaker and, for the recognition unit (or its HMM state) appearing in the utterance, stores the corresponding speaker tree in the speaker tree storage section (7). Select the optimal node from.

【００１４】話者頻度算出部（５）は、話者ツリー記憶
部（７）に記憶された話者ツリーにおいて、選択された
最適ノードを親ノードとする部分ツリーの末端ノードが
対応する話者（クラスタ）の累積使用頻度を計数する。The speaker frequency calculating section (5) is a speaker tree stored in the speaker tree storage section (7). The terminal node of the partial tree having the selected optimum node as a parent node corresponds to the speaker node corresponding to the selected optimum node. The cumulative usage frequency of (cluster) is counted.

【００１５】未知ノード選択部（６）は、話者頻度算出
部（５）で計数された話者（クラスタ）毎の使用頻度情
報を利用して、新話者の発話に出現しなかった認識単位
（あるいはそのＨＭＭ状態）に対する話者ツリーから、
最適なノード（話者クラスタ）を選択する。The unknown node selection unit (6) uses the usage frequency information for each speaker (cluster) counted by the speaker frequency calculation unit (5) to recognize that it did not appear in the utterance of the new speaker. From the speaker tree for a unit (or its HMM state):
Select the optimal node (speaker cluster).

【００１６】標準パターン更新部（８）は、既知ノード
選択部（４）と、未知ノード選択部（６）で選択された
ノードに対応するＨＭＭ（あるいはその状態）を用い
て、新話者に対応した標準パターンを標準パターン記憶
部（３）に作成する。The standard pattern updating section (8) uses the known node selecting section (4) and the HMM (or its state) corresponding to the node selected by the unknown node selecting section (6) to notify the new speaker. A corresponding standard pattern is created in the standard pattern storage unit (3).

【００１７】このように、本発明の実施の形態によれ
ば、話者適応化として、認識単位毎、あるいは、そのＨ
ＭＭの状態毎に、最適な話者クラスタを選択して、新話
者向けの、より精度の高い標準パターンを用意すること
ができる。As described above, according to the embodiment of the present invention, as speaker adaptation, each recognition unit or its H
An optimum speaker cluster can be selected for each MM state, and a more accurate standard pattern for a new speaker can be prepared.

【００１８】[0018]

【実施例】上記した本発明の実施の形態についてさらに
詳細に説明すべく、本発明の実施例について図面を参照
して詳細に説明する。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing an embodiment of the present invention;

【００１９】図１は、本発明の音声認識装置の一実施例
の構成を示す図である。図１を参照すると、分析部１
と、認識照合部２と、標準パターン記憶部３と、既知ノ
ード選択部４と、話者頻度算出部５と、未知ノード選択
部６と、話者ツリー記憶部７と、標準パターン更新部８
と、を備えて構成されている。FIG. 1 is a diagram showing the configuration of an embodiment of the speech recognition apparatus of the present invention. Referring to FIG.
, A recognition / matching unit 2, a standard pattern storage unit 3, a known node selection unit 4, a speaker frequency calculation unit 5, an unknown node selection unit 6, a speaker tree storage unit 7, and a standard pattern update unit 8.
And is provided.

【００２０】分析部１は、入力された音声波形から特徴
抽出（メルケプストラム分析など）を行い、特徴ベクト
ルの時系列を出力する。The analysis unit 1 performs feature extraction (mel cepstrum analysis, etc.) from the input speech waveform, and outputs a time series of feature vectors.

【００２１】標準パターン記憶部３は、全ての認識単位
の隠れマルコフモデル（ＨＭＭ）を記憶している。The standard pattern storage unit 3 stores hidden Markov models (HMMs) of all recognition units.

【００２２】認識照合部２は、標準パターン記憶部３に
記憶されているＨＭＭを読み出して、分析部１から送ら
れてきた特徴ベクトルの時系列データと照合して、認識
結果を出力する。なお、本発明の一実施例において、こ
れらの分析部１、認識照合部２、標準パターン記憶部３
については、いずれも、当業者によく知られた公知技術
をそのまま適用することができ、また、その詳細は、本
発明の主題とは、直接関係しないので、構成の詳細の説
明は省略する。The recognition / collation unit 2 reads out the HMM stored in the standard pattern storage unit 3, compares the HMM with the time-series data of the feature vector sent from the analysis unit 1, and outputs a recognition result. In one embodiment of the present invention, the analysis unit 1, the recognition / collation unit 2, and the standard pattern storage unit 3
For any of the above, well-known techniques well known to those skilled in the art can be applied as they are, and the details thereof are not directly related to the subject of the present invention, and therefore, detailed description of the configuration is omitted.

【００２３】話者ツリー記憶部７は、各認識単位毎（あ
るいは、後述するように、そのＨＭＭの状態毎）に用意
された話者ツリーを記憶する。The speaker tree storage unit 7 stores a speaker tree prepared for each recognition unit (or for each HMM state, as described later).

【００２４】以下では、各認識単位毎に、話者ツリーを
用意する場合について説明する。また認識単位として
は、音節（「あ」、「い」、…、「か」、…）を用いる
場合を考える。Hereinafter, a case where a speaker tree is prepared for each recognition unit will be described. Assume that syllables (“a”, “i”,..., “Ka”,...) Are used as recognition units.

【００２５】音節の総種類数をＮ個とする。各音節毎に
話者ツリーを用意すると、合計Ｎ個の話者ツリーが、話
者ツリー記憶部７に記憶される。It is assumed that the total number of syllable types is N. When a speaker tree is prepared for each syllable, a total of N speaker trees are stored in the speaker tree storage unit 7.

【００２６】図２に示した木構造が、音節「あ」の話者
ツリーである場合について説明する。この場合は、最上
位の根（ルート）ノードには、全ての話者（男女混合）
の「あ」の音声データを集めて、作成もしくは学習され
た認識単位「あ」のＨＭＭが対応している。The case where the tree structure shown in FIG. 2 is a speaker tree of the syllable "A" will be described. In this case, the top root node is all speakers (mixed gender)
The HMM of the recognition unit "a" which is created or learned by collecting the voice data of "a" corresponds to the HMM.

【００２７】次位のノード（ルートの一つ下位のノー
ド）は、全ての男性話者の「あ」の音声データから作成
したＨＭＭが対応するノードと、全ての女性話者の
「あ」の音声データから作成したＨＭＭが対応するノー
ドとに分岐している。The next node (one node below the root) is a node corresponding to the HMM created from the voice data of “a” of all male speakers, and a node of “a” of all female speakers. The HMM created from the voice data branches to the corresponding node.

【００２８】話者ツリーの最下位の末端（リーフ）ノー
ドは、個別の話者（話者Ａ、話者Ｂ、Ｃ、…）のそれぞ
れの「あ」の音声データから作成された、各話者の
「あ」のＨＭＭが対応する。The lowest-level (leaf) node of the speaker tree is composed of the individual speakers (speaker A, speaker B, C,...), Each of which is created from the voice data of "a". The HMM of the person "a" corresponds.

【００２９】再び図１を参照すると、既知ノード選択部
４は、新話者の話者適応化用発話を用いて、その発話に
出現した認識単位、あるいは、そのＨＭＭ状態につい
て、話者ツリー記憶部７の対応する話者ツリーから、最
適なノードを選択する。Referring to FIG. 1 again, the known node selecting section 4 uses the speaker adaptation utterance of the new speaker to store the recognition unit appearing in the utterance or the HMM state thereof in the speaker tree storage. An optimum node is selected from the corresponding speaker tree of the unit 7.

【００３０】より具体的に説明するために、新話者の音
節「あ」の音声データを用いて、音節「あ」の話者ツリ
ーから最適なノードを選択する手順について図４の流れ
図も参照して以下に説明する。図４は、本発明の一実施
例の話者頻度算出部の処理の要部を示す流れ図である。For a more specific description, also refer to the flowchart of FIG. 4 for a procedure for selecting an optimum node from the speaker tree of the syllable "A" using the voice data of the syllable "A" of the new speaker. This will be described below. FIG. 4 is a flowchart showing a main part of the processing of the speaker frequency calculation unit according to one embodiment of the present invention.

【００３１】ステップＡ１：変数「現在のノード」を、
音節「あ」の話者ツリーの最上位の根（ルート）ノード
とする。Step A1: The variable “current node” is
Let it be the highest root node of the speaker tree of syllable "a".

【００３２】ステップＡ２：「現在のノード」が話者ツ
リーの末端ノード（リーフ）である場合は、ステップＡ
７へ移る。Step A2: If the "current node" is the terminal node (leaf) of the speaker tree, step A
Move to 7.

【００３３】ステップＡ３：「現在のノード」のＨＭＭ
を用いて、新話者の音節「あ」の音声データの尤度Ｐ０
を算出する。Step A3: HMM of "current node"
, The likelihood P0 of the voice data of the syllable “a” of the new speaker
Is calculated.

【００３４】ステップＡ４：「現在のノード」の直下の
全てのノードのＨＭＭを用いて、新話者の音節「あ」の
音声データの尤度Ｑ1〜ＱMを算出する。ただし、Mは直
下のノードの数である。Step A4: Using the HMMs of all nodes immediately below the "current node", the likelihoods Q1 to QM of the voice data of the syllable "a" of the new speaker are calculated. Here, M is the number of nodes immediately below.

【００３５】ステップＡ５：尤度Ｑ1〜ＱMの最大値をＱ
0とする。また最大値を与えたノードを変数「次のノー
ド」として記憶する。Step A5: The maximum value of the likelihoods Q1 to QM is set to Q
Set to 0. Also, the node giving the maximum value is stored as a variable “next node”.

【００３６】ステップＡ６：尤度の最大値Ｑ0が、Ｐ0＋
θ（θはあらかじめ定めておいた閾値）より大きい場合
は、「次のノード」を「現在のノード」に代入して、ス
テップＡ２へ戻る。Step A6: The maximum likelihood Q0 is P0 +
If it is larger than θ (θ is a predetermined threshold), “next node” is substituted for “current node”, and the process returns to step A2.

【００３７】ステップＡ７：「現在のノード」を最適な
ノードとして選択する。Step A7: Select the "current node" as the optimal node.

【００３８】話者頻度算出部５は、選択された最適ノー
ドを親ノードとする部分ツリーの末端ノードが対応する
話者（クラスタ）の累積使用頻度を計数する。The speaker frequency calculation unit 5 counts the cumulative use frequency of the speaker (cluster) corresponding to the terminal node of the partial tree having the selected optimum node as a parent node.

【００３９】具体例として、新話者が話者適応化用に
「あお」と発声した場合について説明する。As a specific example, a case where a new speaker utters “Ao” for speaker adaptation will be described.

【００４０】既知ノード選択部４は、その発声を用い
て、音節「あ」の話者ツリーと音節「お」の話者ツリー
中の、それぞれ最適なノードを選択する。The known node selection unit 4 uses the utterances to select the optimum nodes in the speaker tree of the syllable “A” and the speaker tree of the syllable “O”.

【００４１】図３において、記号二重丸「◎」で示すノ
ードが選択されたノードである。In FIG. 3, a node indicated by a double circle “◎” is a selected node.

【００４２】話者頻度算出部５では、最適ノードを親ノ
ードにする部分ツリーの末端ノードに対応する話者の累
積使用頻度を「１」だけ増加させる。The speaker frequency calculating unit 5 increases the cumulative use frequency of the speaker corresponding to the terminal node of the partial tree having the optimum node as the parent node by "1".

【００４３】図３の場合は、音節「あ」の話者ツリーか
らは、話者Ｃと話者Ｄの累積使用頻度がそれぞれ「１」
だけ増加させられる。In the case of FIG. 3, from the speaker tree of the syllable "a", the cumulative use frequencies of the speakers C and D are each "1".
Only increased.

【００４４】音節「お」の話者ツリーからは、話者Ｃと
話者Ｇの累積使用頻度がそれぞれ「１」だけ増加させら
れる。From the speaker tree of the syllable "O", the cumulative use frequencies of the speakers C and G are each increased by "1".

【００４５】結局、これらの総計として、話者Ｃの累積
使用頻度が「２」、話者Ｄと話者Ｇが累積使用頻度が
「１」、その他の話者の累積使用頻度は「０」となる。As a result, the cumulative use frequency of the speaker C is “2”, the cumulative use frequency of the speakers D and G is “1”, and the cumulative use frequency of the other speakers is “0”. Becomes

【００４６】未知ノード選択部６は、話者頻度算出部５
で計数された話者毎の累積使用頻度を利用して、新話者
の発話に出現しなかった認識単位あるいはそのＨＭＭ状
態に対する話者ツリーから、最適なノード（話者クラス
タ）を選択する。The unknown node selecting section 6 includes a speaker frequency calculating section 5
Using the cumulative usage frequency of each speaker counted in the above, an optimum node (speaker cluster) is selected from a recognition unit that did not appear in the utterance of the new speaker or the speaker tree for the HMM state.

【００４７】図３に示した例について、具体的に説明す
る。前述したとおり、新話者は話者適応化用に「あお」
と発声しており、音節「い」の発声データはない。この
ために、既知ノード選択部４では、音節「い」の話者ツ
リー中の最適ノードは定めることができない。The example shown in FIG. 3 will be specifically described. As mentioned above, new speakers use "Ao" for speaker adaptation.
And there is no utterance data for the syllable "i". For this reason, the known node selection unit 4 cannot determine the optimal node in the speaker tree of the syllable “i”.

【００４８】そこで、本発明の一実施例では、以下の手
順によって、音節「い」の話者ツリーの最適ノードを選
択する。以下、本発明の一実施例の未知ノード・算出部
の処理の要部を流す図５も参照して説明する。Therefore, in one embodiment of the present invention, the optimum node of the speaker tree of the syllable "i" is selected by the following procedure. Hereinafter, a description will be given with reference to FIG. 5 which shows a main part of the processing of the unknown node / calculation unit according to one embodiment of the present invention.

【００４９】ステップＢ１：話者頻度算出部５で算出さ
れた各話者の累積使用頻度を、音節「い」の話者ツリー
の末端ノード（各話者に対応）に与える。Step B1: The cumulative usage frequency of each speaker calculated by the speaker frequency calculation unit 5 is given to the terminal node (corresponding to each speaker) of the speaker tree of the syllable “I”.

【００５０】ステップＢ２：各ノードの累積使用頻度
は、そのノードに直接つながっている直下（下位）のノ
ードの累積使用頻度の和で与える。これにより、末端ノ
ードから順番に上位のノードの累積使用頻度を算出す
る。Step B2: The cumulative use frequency of each node is given by the sum of the cumulative use frequencies of the immediately lower (lower) nodes directly connected to the node. Thereby, the cumulative use frequency of the upper node is calculated in order from the terminal node.

【００５１】ステップＢ３：変数「現在のノード」を、
話者ツリーの最上位の根（ルート）ノードとする。Step B3: The variable “current node” is
Let it be the highest root node of the speaker tree.

【００５２】ステップＢ４：「現在のノード」の直下の
全てのノードの累積使用頻度の最大値をＲ回とする。ま
た最大値を与えたノードを変数「次のノード」として記
憶する。ただし、最大値を与えるノードが複数存在する
場合は、ステップＢ７へ移る。Step B4: The maximum value of the cumulative use frequency of all the nodes immediately below the "current node" is set to R times. Also, the node giving the maximum value is stored as a variable “next node”. However, if there are a plurality of nodes giving the maximum value, the process proceeds to step B7.

【００５３】ステップＢ５：「次のノード」の累積使用
頻度が、あらかじめ定めておいた最低使用頻度（たとえ
ば２回とする）以下の場合は、ステップＢ７へ移る。Step B5: If the cumulative use frequency of the "next node" is equal to or less than the predetermined minimum use frequency (for example, twice), the process proceeds to step B7.

【００５４】ステップＢ６：「次のノード」を「現在の
ノード」に代入して、ステップＢ４へ戻る。Step B6: Substitute "next node" for "current node" and return to step B4.

【００５５】ステップＢ７：「現在のノード」を最適な
ノードとして選択する。Step B7: Select the "current node" as the optimal node.

【００５６】このようにして、新話者の適応化用発声に
含まれない認識単位の話者ツリーについても、最適なノ
ードを選択することができる。In this way, the optimum node can be selected for the speaker tree of the recognition unit that is not included in the adaptation utterance of the new speaker.

【００５７】標準パターン更新部８は、既知ノード選択
部４と、未知ノード選択部６で選択されたノードに対応
するＨＭＭ（あるいはその状態）を集めて、新話者に対
応した標準パターンとして、標準パターン記憶部８に格
納する。The standard pattern updating unit 8 collects the HMMs (or their states) corresponding to the nodes selected by the known node selecting unit 4 and the unknown node selecting unit 6, and generates a standard pattern corresponding to a new speaker. It is stored in the standard pattern storage unit 8.

【００５８】以上のようにして、各認識単位毎に最適な
ノード（話者クラスタ）と、そのノード（クラスタ）を
表すＨＭＭを選択することができる。特に、新話者の適
応化用発声に含まれている認識単位だけでなく、含まれ
ていない認識単位についても、最適なノードを選択する
ことができる。As described above, an optimum node (speaker cluster) and an HMM representing the node (cluster) can be selected for each recognition unit. In particular, it is possible to select an optimum node not only for the recognition unit included in the adaptation utterance of the new speaker, but also for the recognition unit not included.

【００５９】なお、各認識単位のＨＭＭの各状態毎に、
別々の独立した話者ツリーを用意する場合には、認識単
位毎に、話者ツリーを用意する代わりに、認識単位のＨ
ＭＭの状態毎に、話者ツリーを用意すれば、上記実施例
で説明した手順と全く同様の手順で、最適なノードを選
択することができる。Incidentally, for each state of the HMM of each recognition unit,
When preparing separate and independent speaker trees, instead of preparing a speaker tree for each recognition unit, H
If a speaker tree is prepared for each MM state, an optimum node can be selected in exactly the same procedure as that described in the above embodiment.

【００６０】[0060]

【発明の効果】以上説明したように、本発明によれば、
各認識単位毎、あるいはさらに細かくＨＭＭの状態毎
に、最適な話者クラスタと、そのクラスタを表すＨＭ
Ｍ、あるいは確率分布を選択することができる、という
効果を奏する。As described above, according to the present invention,
For each recognition unit or more finely for each state of the HMM, the optimal speaker cluster and the HM representing the cluster
There is an effect that M or a probability distribution can be selected.

【００６１】その理由は、本発明においては、各認識単
位毎、あるいはさらに細かく、ＨＭＭの状態毎に別々の
独立した話者ツリーを用意し、新話者の少量の話者適応
化用発声を用いて、全ての話者ツリーについて、最適ノ
ードを選択し、それらに対応するＨＭＭを新話者用に適
応化した標準パターンとすることができるためである。The reason is that, in the present invention, a separate independent speaker tree is prepared for each recognition unit or, more specifically, for each state of the HMM, and a small amount of speaker adaptation utterances of a new speaker are prepared. This is because the optimum nodes can be selected for all speaker trees, and the corresponding HMMs can be used as standard patterns adapted for new speakers.

[Brief description of the drawings]

【図１】本発明の一実施例の構成を示すブロック図であ
る。FIG. 1 is a block diagram showing the configuration of an embodiment of the present invention.

【図２】本発明の一実施例を説明するための図であり、
話者ツリーを説明するための図である。FIG. 2 is a diagram for explaining one embodiment of the present invention;
It is a figure for explaining a speaker tree.

【図３】本発明の一実施例における話者頻度算出部と未
知ノード算出部の動作を説明するための図である。FIG. 3 is a diagram for explaining operations of a speaker frequency calculation unit and an unknown node calculation unit according to one embodiment of the present invention.

【図４】本発明の一実施例の話者頻度算出部の処理フロ
ーを示す流れ図である。FIG. 4 is a flowchart showing a processing flow of a speaker frequency calculation unit according to one embodiment of the present invention.

【図５】本発明の一実施例の未知ノード算出部の処理フ
ローを示す流れ図である。FIG. 5 is a flowchart showing a processing flow of an unknown node calculating unit according to one embodiment of the present invention.

[Explanation of symbols]

１分析部２認識照合部３標準パターン記憶部４既知ノード選択部５話者頻度算出部６未知ノード選択部７話者ツリー記憶部８標準パターン更新部 DESCRIPTION OF SYMBOLS 1 Analysis part 2 Recognition collation part 3 Standard pattern storage part 4 Known node selection part 5 Speaker frequency calculation part 6 Unknown node selection part 7 Speaker tree storage part 8 Standard pattern update part

フロントページの続き (56)参考文献日本音響学会平成９年度秋季研究発表会講演論文集▲Ｉ▼，１−１−14，阿部俊朗外「音素毎の話者クラスに基づく話者適応」，ｐ．27−28（平成９年９月17 日発行) 電子情報通信学会論文誌，Ｖｏｌ．Ｊ 82−Ｄ−▲ＩＩ▼ Ｎｏ．６，Ｊｕｎｅ 1999，鈴木基之外「音素ごとの木構造話者クラスタリングによる話者適応」, ｐ．981−989，（平成11年６月25日発行) 電子情報通信学会論文誌，Ｖｏｌ．Ｊ 78−Ｄ−▲ＩＩ▼ Ｎｏ．１，Ｊａｎｕａｒｙ 1995，小坂哲夫外「木構造話者クラスタリングを用いた話者適応」, ｐ．１−９，（平成７年１月25日発行) ＰｒｏｃｅｅｄｉｎｇｓｏｆＦｏｕｒｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＳｐｏｋｅｎＬａｎｇｕａｇｅＰｒｏｃｅｓｓｉｎｇ，ＩＣＳＬＰ 96，Ｖｏｌ. ２，Ｊ．Ｉｓｈｉｉｅｔａｌ，”ＳｐｅａｋｅｒＡｄａｐａｔａｔｉｏｎＵｓｉｎｇＴｒｅｅＳｔｒｕｃｔｕｒｅｄＳｈａｒｅｄ−ＳｔａｔｅＨＭＭｓ”，ｐ．1149−1152，Ｏｃｔｏｂｅｒ３−６，1996，Ｐｈｉｌａｄｅｌｐｈｉａ，Ｕ．Ｓ．Ａ. Ｐｒｏｃｅｅｄｉｎｇｓｏｆ 1995 ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．１，Ｔ．Ｗａｔａｎａｂｅｅｔａｌ，”ＨｉｇｈＳｐｅｅｄＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎＵｓｉｎｇＴｒｅｅ−ＳｔｒｕｃｔｕｒｅｄＰｒｏｂａｂｉｌｉｔｙＤｅｎｓｉｔｙＦｕｎｃｔｉｏｎ" ｐ．556−559 Ｐｒｏｃｅｅｄｉｎｇｓｏｆ 1994 ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＡｃｏｕｓｔｉｃｓ，ＳｐｅｅｃｈａｎｄＳｉｇｎａｌＰｒｏｃｅｓｓｉｎｇ，Ｖｏｌ．１，Ｔ．Ｋｏｓａｋａｅｔａｌ，”Ｔｒｅｅ−ＳｔｒｕｃｔｕｒｅｄＳｐｅａｋｅｒＣｌｕｓｔｅｒｉｎｇｆｏｒＦａｓｔＳｐｅａｋｅｒＡｄａｐｔａｔｉｏｎ”，ｐ. Ｉ−245〜Ｉ−248 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/14 ＩＮＳＰＥＣ（ＤＩＡＬＯＧ) ＪＩＣＳＴファイル（ＪＯＩＳ) ＷＰＩ（ＤＩＡＬＯＧ) ＩＥＥＥ／ＩＥＥＥｌｅｃｔｒｏｎｉｃＬｉｂｒａｒｙＯｎｌｉｎｅContinuation of the front page (56) References The Acoustical Society of Japan Fall Meeting, 1997, I-14, 1-1-14, Toshio Abe, "Speaker Adaptation Based on Speaker Class for Each Phoneme," p. . 27-28 (published September 17, 1997) IEICE Transactions, Vol. J 82-D- ▲ II No. 6, June 1999, Motoyuki Suzuki, "Tree Structure for Each Phoneme, Speaker Adaptation by Speaker Clustering," p. 981-989, (issued June 25, 1999) Transactions of the Institute of Electronics, Information and Communication Engineers, Vol. J 78-D- ▲ II ▼ No. 1, January 1995, Tetsuo Kosaka et al. "Speaker adaptation using tree-structured speaker clustering", p. 1-9, (issued January 25, 1995) Proceedings of Fourth International Conference on Spoken Language Processing, ICSLP 96, Vol. Ishii et al, "Speaker Adaptation Using Tree Structured Shared-State HMMs", p. 1149-1152, Octo bar 3-6, 1996, Philadelphia, U.S.A. S. A. Proceedings of 1995 IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 1, T. Watanabe et al, "High Speed Speech Recognition Using Tree-Structured Probability Density Function" p. 556-559 Proceedings of 1994 IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 1, T. Kosaka e t al, "Tree- Structu red Speaker Cluste ring for Fast Spea ker Adaptation", p. I-245~I-248 (58) investigated the field (Int.Cl. ^7, DB name) G10L 15/14 INSPEC (DIALOG) JICST file (JOIS) WPI (DIALOG) IEEE / IEEE Electronic Library Online

Claims

(57) [Claims]

1. A speech recognition apparatus based on a hidden Markov model, a speaker tree storage unit for storing a plurality of speaker trees each having a probability distribution for a speaker cluster as a node, for each hidden Markov model of a recognition unit , Node selecting means for selecting an optimum node in the speaker tree using the utterance of a speaker, and speaker frequency calculating means for counting frequency information at a terminal node of a partial tree having the selected node as a parent node An unknown node selecting unit that selects an optimum node in the speaker tree using the frequency information of the terminal node; and a probability distribution of the nodes selected by the known node selecting unit and the unknown node selecting unit. And a standard pattern updating means for adapting the standard pattern.

2. The speech recognition apparatus according to claim 1, wherein a speaker tree is stored in the speaker tree storage unit for each state of a hidden Markov model of a recognition unit.

3. An apparatus for performing speech recognition based on a Hidden Markov Model, comprising: analysis means for extracting a feature from an input speech waveform and outputting a time series of feature vectors; and a Hidden Markov Model (HMM) of a recognition unit. Or a standard pattern storage unit that stores the state of the HMM, and reads out the state of the HMM or the HMM stored in the standard pattern storage unit and recognizes it by comparing it with the time-series data of the feature vector sent from the analysis unit. A recognition / matching unit that outputs a result; a speaker tree storage unit that stores a plurality of speaker trees for each of the HMMs or HMM states of the recognition unit; A known node selecting means for selecting an optimal node; and counting frequency information at a terminal node of a partial tree having the selected node as a parent node Speaker frequency calculating means, and calculating the frequency information of the upper node using the frequency information of the terminal node of the speaker tree, whereby the speech of the recognition unit not included in the adaptation utterance of the new speaker is obtained. An unknown node selecting means for selecting an optimum node in the speaker tree, an HMM corresponding to the node selected by the known node selecting means and the node selected by the unknown node selecting means, and a state of the HMM corresponding to the selected node. A standard pattern updating unit that stores the standard pattern corresponding to the speaker in the standard pattern storage unit.

4. A speaker adaptation method for an apparatus for performing speech recognition based on a Hidden Markov Model, comprising a speaker tree storage unit for storing a plurality of speaker trees each having a probability distribution for a speaker cluster as a node for each recognition unit. (A) known node selection processing for selecting a known node in the speaker tree using the utterance of a new speaker, and (b) end of a partial tree having the selected node as a parent node (C) using the frequency information of the terminal node, the speaker tree of the recognition unit not included in the adaptation utterance of the new speaker; An unknown node selection process for selecting an optimal node among the nodes, (d) corresponding to the new speaker using the known node selection process and the probability distribution of the nodes selected in the unknown node selection process. A standard pattern updating process for adapting to a standard pattern, the speaker adapting method for a speech recognition device.

5. The method according to claim 4, wherein a speaker tree is stored in the speaker tree storage unit for each hidden Markov model of a recognition unit.

6. A speaker tree is stored in the speaker tree storage unit for each state of a hidden Markov model of a recognition unit.
The speaker adaptation method for a speech recognition device according to claim 4, characterized in that:

7. The known node selection processing includes the steps of: (a) setting a root node of a syllable speaker tree as a recognition unit of the utterance of the new speaker as a current node; Determining whether the speaker tree is a leaf, and if it is a leaf, moving to step (f); (c) using the HMM of the current node to determine the likelihood P0 of the speech data of the syllable of the new speaker (D) calculating the likelihood of the speech data of the syllable of the new speaker using the HMM of each node immediately below the current node, and determining the node giving the maximum value Q0 of the likelihood as follows: (E) when the maximum value Q0 is larger than the P0 by a predetermined value, setting the next node as a current node and moving to the step (b); (F) the present 5. The method according to claim 4, further comprising the step of: selecting the node as an optimal node.

8. The unknown node selection processing includes the steps of: (a) using the cumulative use frequency of each speaker calculated by the speaker frequency calculation means as a recognition unit which did not appear in the utterance of the new speaker; , Given to the terminal node of the syllable speaker tree,
Calculating the cumulative use frequency of the upper nodes in order from the terminal node by giving the cumulative use frequency of each node of the speaker tree as the sum of the cumulative use frequencies of the lower nodes directly connected to the node; (B) making the current node the topmost root node of the speaker tree; and (c) finding the maximum value of the cumulative use frequency of all the nodes immediately below the current node, and assigning the node giving the maximum value. Storing as the next node, and when there are a plurality of nodes giving the maximum value, moving to step (f); (d) when the cumulative use frequency of the next node is equal to or less than a predetermined value Goes to step (f), (e) takes the next node as the current node, returns to step (c), and (f) selects the current node as the optimal node That step, the speech recognition apparatus according to claim 4, wherein the containing.

9. A speech recognition process based on a hidden Markov model.
An apparatus for executing includes a speaker tree storage unit for storing a plurality of speakers tree to node a probability distribution for the speaker cluster by the recognition unit, the speaker tree in using speech (a) new speaker (B) a speaker frequency calculation process for counting frequency information at a terminal node of a partial tree having the selected node as a parent node; (c) the terminal node By calculating the frequency information of the upper node using the frequency information of the node, the optimum node in the speaker tree is also selected for the speaker tree of the recognition unit not included in the utterance for adaptation of the new speaker. Unknown node selection processing, and (d) a standard pattern for adapting a standard pattern using the known node selection processing and the probability distribution of the nodes selected in the unknown node selection processing. Updating process, the (a) ~ a recording medium each processing recording a program for executing on a computer that constitutes the device (d).