JP6944329B2

JP6944329B2 - Speech recognizer and speech recognition program

Info

Publication number: JP6944329B2
Application number: JP2017192767A
Authority: JP
Inventors: 嶐一岡
Original assignee: University of Aizu
Current assignee: University of Aizu
Priority date: 2017-10-02
Filing date: 2017-10-02
Publication date: 2021-10-06
Anticipated expiration: 2037-10-02
Also published as: JP2019066690A

Description

本発明は、音声認識を行うための音声認識装置および音声認識プログラムに関する。 The present invention relates to a voice recognition device and a voice recognition program for performing voice recognition.

従来より、複数話者によって複数の会話が行われる環境で、自分の興味のある話題を聴き取る効果として、「カクテルパーティ効果」が知られている。例えば、たくさんの人々が別々の話題で話しをしている場合でも、カクテルパーティ効果によって、興味のある人の会話や、興味のある内容の会話などを、自然と聴き取ることができる。 Conventionally, the "cocktail party effect" has been known as an effect of listening to a topic of interest in an environment in which a plurality of conversations are held by a plurality of speakers. For example, even when many people are talking on different topics, the cocktail party effect allows you to naturally hear conversations of people who are interested or conversations of content that you are interested in.

「カクテルパーティ効果」に関する人間の能力を、工学的に実現したいという要望が高まっている。しかしながら、「カクテルパーティ効果」を工学的に実現することは、容易ではない。「カクテルパーティ効果」を工学的に実現するための研究として、混合音声の分離に関する研究が知られている（例えば、特許文献１、非特許文献１〜非特許文献６参照）。混合音声とは、上述のように複数話者によって行われる複数の会話からなる音声などが該当する。 There is a growing demand for engineering realization of human abilities related to the "cocktail party effect". However, it is not easy to realize the "cocktail party effect" engineeringly. As a study for engineeringly realizing the "cocktail party effect", a study on separation of mixed voices is known (see, for example, Patent Document 1, Non-Patent Document 1 to Non-Patent Document 6). The mixed voice corresponds to a voice composed of a plurality of conversations performed by a plurality of speakers as described above.

混合音声の分離に関する一般的な研究では、話者の数以上のマイクを設置し、各マイクにより収音される複数音声の波形に基づいて、各話者の音声の分離を行う。このような混合音声の分離に関する技術分野は「blind source separation」と呼ばれる。混合音声の分離に関する主な手法として、「独立成分分析（Independent component analysis）」(以下、ＩＣＡと称する)と呼ばれる技術が用いられる。 In a general study on the separation of mixed voices, more than the number of microphones of the speakers are installed, and the voices of each speaker are separated based on the waveforms of the multiple voices picked up by each microphone. The technical field for such mixed audio separation is called "blind source separation". As a main method for separating mixed speech, a technique called "Independent component analysis" (hereinafter referred to as "ICA") is used.

特開２０００−１２５３９７号明細書JP-A-2000-125397

Edited by Pierre Comon, Christian Jutten, Handbook of Blind Source Separation: Independent Component Analysis and Applications, Academic Press, Chapter 8,2010, pp.281-324Edited by Pierre Comon, Christian Jutten, Handbook of Blind Source Separation: Independent Component Analysis and Applications, Academic Press, Chapter 8,2010, pp.281-324 Emmanuel Vincent, Nancy Bertin, Remi Gribonval, Frederic Bimot, From Blind to Guided Audio Source Separation: How models and side information can improve the separation of sound, IEEE Signal Processing Magazine, Volume: 31, Issue 3, May 2014, pp.107-115Emmanuel Vincent, Nancy Bertin, Remi Gribonval, Frederic Bimot, From Blind to Guided Audio Source Separation: How models and side information can improve the separation of sound, IEEE Signal Processing Magazine, Volume: 31, Issue 3, May 2014, pp.107 -115 Sebastian Ewert, Bryan Pardo, Meinard Mueller, Mark D. Plumbley, Score-Informed Source Separation for Musical Audio Recordings: An overview, IEEE Signal Processing Magazine, Volume: 31, Issue 3, May 2014, pp.116-124Sebastian Ewert, Bryan Pardo, Meinard Mueller, Mark D. Plumbley, Score-Informed Source Separation for Musical Audio Recordings: An overview, IEEE Signal Processing Magazine, Volume: 31, Issue 3, May 2014, pp.116-124 Simon Arberet, Pierre Vandergheynst, R. Carrillo, J. Thiran, Y. Wiaux,"Sparse reverberant audio source separation via reweighted analysis", IEEE Transactions on Speech, Audio and Language Processing, Vol. 21, No. 7, Jul. 2013, pp. 1391-1402Simon Arberet, Pierre Vandergheynst, R. Carrillo, J. Thiran, Y. Wiaux, "Sparse reverberant audio source separation via reweighted analysis", IEEE Transactions on Speech, Audio and Language Processing, Vol. 21, No. 7, Jul. 2013 , pp. 1391-1402 Simon Arberet, Pierre Vandergheynst,"Reverberant Audio Source Separation via Sparse and Low-Rank Modeling", IEEE Signal Processing Letters, Vol. 21, Issue 4, January 28, 2014, pp.404-408Simon Arberet, Pierre Vandergheynst, "Reverberant Audio Source Separation via Sparse and Low-Rank Modeling", IEEE Signal Processing Letters, Vol. 21, Issue 4, January 28, 2014, pp.404-408 Antoine Liutkus, Derry Fitzgerald, Zafar Rafii, Bryan Pardo, Laurent Daudet,"Kernel Additive Models for Source Separation", IEEE Transactions on Signal Processing, Vol. 62, Issue 16, pp.4298-4310, August 15, 2014Antoine Liutkus, Derry Fitzgerald, Zafar Rafii, Bryan Pardo, Laurent Daudet, "Kernel Additive Models for Source Separation", IEEE Transactions on Signal Processing, Vol. 62, Issue 16, pp.4298-4310, August 15, 2014

ＩＣＡが実際に用いられる状況は、話者の人数が２〜３名程度の場合が多い。５名以上の話者がいる状況では、ＩＣＡによって、混合音声の分離を実現することが難しい。また、話者が途中で増加したり減少したりする場合にも、良好に混合音声を分離することが難しくなる。さらに、ＩＣＡでは、話者以上の数のマイクを設置する必要がある。このため、ＩＣＡでは、収音器材を多数設置するための機器準備負担や設置作業負担等が生じる。 In many cases, the number of speakers is about 2 to 3 when ICA is actually used. In a situation where there are five or more speakers, it is difficult to achieve mixed voice separation by ICA. Also, when the number of speakers increases or decreases in the middle, it becomes difficult to separate mixed voices well. Furthermore, in ICA, it is necessary to install more microphones than speakers. For this reason, in the ICA, the burden of preparing equipment and the burden of installation work for installing a large number of sound collecting devices are incurred.

また、「カクテルパーティ効果」を工学的に実現するためには、ＩＣＡにより混合音声の波形を分離した後に、音声認識を別途行う必要が生じる。従って、「カクテルパーティ効果」を工学的に実現するためには、複数のマイクが必要とされることや、混合音声が、ＩＣＡで分離可能な音声であることや、音声分離後に音声の認識処理を行う必要があることなど、複数の課題をそれぞれ解決する必要があった。 Further, in order to technically realize the "cocktail party effect", it is necessary to separately perform voice recognition after separating the waveform of the mixed voice by ICA. Therefore, in order to engineeringly realize the "cocktail party effect", multiple microphones are required, the mixed voice is a voice that can be separated by ICA, and the voice recognition process is performed after the voice separation. It was necessary to solve each of multiple issues, such as the need to do.

本発明は、上記課題に鑑みてなされたものであり、参照音声が入力音声に含まれているか否かを判断することが可能な音声認識装置および音声認識プログラムを提供することを課題とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a voice recognition device and a voice recognition program capable of determining whether or not a reference voice is included in an input voice.

上記課題を解決するために、本発明に係る音声認識装置は、第１所定時間毎に入力音声を周波数領域のデータに変換させることにより、周波数ｘ（１≦ｘ≦Ｘ）と時間ｔとの２次元データによって構成される入力パターンと、第２所定時間毎に参照音声を周波数領域のデータに変換させることにより、周波数ｙ（１≦ｙ≦Ｘ）と時間τ（１≦τ≦Ｔ）との２次元データによって構成される参照パターンとを用いて、前記入力パターンの座標ｆ（ｔ，ｘ）の値から、ｘ＝１からｘ＝Ｘまでの前記座標ｆ（ｔ，ｘ）の値の和を除算することにより、正規化された前記座標ｆ（ｔ，ｘ）の値を算出し、前記参照パターンの座標ｚ（τ，ｙ）の値から、ｙ＝１からｙ＝Ｘまでの前記座標ｚ（τ，ｙ）の値の和を除算することにより正規化された前記座標ｚ（τ，ｙ）の値を算出し、２次元連続動的計画法によって、前記入力パターンの前記座標ｆ（ｔ，ｘ）と前記参照パターンの前記座標ｚ（τ，ｙ）との非線形のマッチング処理を行うことにより、正規化された前記座標ｆ（ｔ，ｘ）の値から正規化された前記座標ｚ（τ，ｙ）の値を減算した値の絶対値を示す局所距離ｄ（ｔ，ｘ，τ，ｙ）を算出し、算出された当該局所距離ｄ（ｔ，ｘ，τ，ｙ）を累積して累積局所距離Ｄ（ｔ，Ｘ，Ｔ，Ｘ）を算出する累積局所距離算出手段と、該累積局所距離算出手段により算出された前記累積局所距離Ｄ（ｔ，Ｘ，Ｔ，Ｘ）の値をＴとＸとの積で除算したＤ（ｔ，Ｘ，Ｔ，Ｘ）／（Ｔ・Ｘ）を算出し、前記Ｄ（ｔ，Ｘ，Ｔ，Ｘ）／（Ｔ・Ｘ）の値の時間ｔにおける変化状態より谷底部分を示すｄｉｐを検出するｄｉｐ検出手段と、該ｄｉｐ検出手段によって検出された前記ｄｉｐの発生時間をｔ^＊とし、時間ｔ^＊−２Ｔから時間ｔ^＊までの時間区間における前記Ｄ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値の最大値から、時間ｔ^＊におけるＤ（ｔ^＊,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値を差し引いた値を、差値ｈとして算出する差値算出手段と、該差値算出手段により算出された差値ｈが、予め設定された第１閾値よりも大きい場合に、累積局所距離Ｄ（ｔ^＊，Ｘ，Ｔ，Ｘ）の（ｔ^＊，Ｘ，Ｔ，Ｘ）からバックトレース処理を行うことにより、前記座標ｆ（ｔ，ｘ）の正規化された値と、当該座標ｆ（ｔ，ｘ）に最適にマッチングする前記座標ｚ（τ，ｙ）の正規化された値とによって局所距離ｄ（ｔ，ｘ，τ，ｙ）の値を算出し、算出された当該局所距離ｄ（ｔ，ｘ，τ，ｙ）の値が予め設定された第２閾値以下となる前記参照パターンの前記座標ｚ（τ，ｙ）の集合を、対象領域として検出する参照パターン領域検出手段と、該参照パターン領域検出手段により前記対象領域が検出された場合に、前記入力音声に前記参照音声が含まれると判断する音声判断手段とを有することを特徴とする。 In order to solve the above problems, the voice recognition device according to the present invention converts the input voice into data in the frequency region at first predetermined time intervals, so that the frequency x (1 ≦ x ≦ X) and the time t can be set. By converting the input pattern composed of the two-dimensional data and the reference voice into the data in the frequency region at the second predetermined time, the frequency y (1 ≦ y ≦ X) and the time τ (1 ≦ τ ≦ T) can be obtained. From the value of the coordinate f (t, x) of the input pattern to the value of the coordinate f (t, x) from x = 1 to x = X, using the reference pattern composed of the two-dimensional data of By dividing the sum, the normalized value of the coordinate f (t, x) is calculated, and from the value of the coordinate z (τ, y) of the reference pattern, the above from y = 1 to y = X. The value of the coordinate z (τ, y) normalized by dividing the sum of the values of the coordinate z (τ, y) is calculated, and the coordinate f of the input pattern is calculated by a two-dimensional continuous dynamic programming method. The coordinates normalized from the values of the coordinates f (t, x) normalized by performing a non-linear matching process between (t, x) and the coordinates z (τ, y) of the reference pattern. The local distance d (t, x, τ, y) indicating the absolute value of the value obtained by subtracting the value of z (τ, y) is calculated, and the calculated local distance d (t, x, τ, y) is calculated. A cumulative local distance calculating means for cumulatively calculating the cumulative local distance D (t, X, T, X) and the cumulative local distance D (t, X, T, X) calculated by the cumulative local distance calculating means. The value of D (t, X, T, X) / (TX) divided by the product of T and X is calculated, and the value of D (t, X, T, X) / (TX) is calculated. a dip detecting means for detecting a dip indicating the trough bottom from changing state at time t value, the occurrence time of the dip detected by the dip detecting means and ^{t *,} from the time ^t * -2T to time ^{t *} From the maximum value of the D (t, X, T, X) / (XT) value in the time interval, the value ^{of D (t *} ^{, X, T, X) / (XT) in the time t *} When the difference value calculation means for calculating the difference value h by subtracting the value and the difference value h calculated by the difference value calculation means are larger than the preset first threshold value, the cumulative local distance D ( ^{t *, X, T, X} ) of the ^(t *, X, T, by performing the back-tracing from X), a normalized value of the coordinate f (t, x), the coordinates f (t , X) Optimal matching to the normalized value of the coordinates z (τ, y) and the value of the local distance d (t, x, τ, y) Is calculated, and the set of the coordinates z (τ, y) of the reference pattern in which the calculated value of the local distance d (t, x, τ, y) is equal to or less than the preset second threshold value is targeted. It is characterized by having a reference pattern area detecting means for detecting as a region and a voice determining means for determining that the input voice includes the reference voice when the target area is detected by the reference pattern area detecting means. And.

また、本発明に係る音声認識プログラムは、参照音声が入力音声に含まれているか否かを判断するための音声認識装置の音声認識プログラムであって、前記音声認識装置の制御手段に、第１所定時間毎に前記入力音声を周波数領域のデータに変換させることにより、周波数ｘ（１≦ｘ≦Ｘ）と時間ｔとの２次元データによって構成される入力パターンと、第２所定時間毎に前記参照音声を周波数領域のデータに変換させることにより、周波数ｙ（１≦ｙ≦Ｘ）と時間τ（１≦τ≦Ｔ）との２次元データによって構成される参照パターンとを用いて、前記入力パターンの座標ｆ（ｔ，ｘ）の値から、ｘ＝１からｘ＝Ｘまでの前記座標ｆ（ｔ，ｘ）の値の和を除算することにより、正規化された前記座標ｆ（ｔ，ｘ）の値を算出させ、前記参照パターンの座標ｚ（τ，ｙ）の値から、ｙ＝１からｙ＝Ｘまでの前記座標ｚ（τ，ｙ）の値の和を除算することにより、正規化された前記座標ｚ（τ，ｙ）の値を算出させ、２次元連続動的計画法によって、前記入力パターンの前記座標ｆ（ｔ，ｘ）と前記参照パターンの前記座標ｚ（τ，ｙ）との非線形のマッチング処理を行わせることにより、正規化された前記座標ｆ（ｔ，ｘ）の値から正規化された前記座標ｚ（τ，ｙ）の値を減算した値の絶対値を示す局所距離ｄ（ｔ，ｘ，τ，ｙ）を算出させ、算出された当該局所距離ｄ（ｔ，ｘ，τ，ｙ）を累積して累積局所距離Ｄ（ｔ，Ｘ，Ｔ，Ｘ）を算出させる累積局所距離算出機能と、該累積局所距離算出機能により算出された前記累積局所距離Ｄ（ｔ，Ｘ，Ｔ，Ｘ）の値をＴとＸとの積で除算したＤ（ｔ，Ｘ，Ｔ，Ｘ）／（Ｔ・Ｘ）を算出させ、前記Ｄ（ｔ，Ｘ，Ｔ，Ｘ）／（Ｔ・Ｘ）の値の時間ｔにおける変化状態より谷底部分を示すｄｉｐを検出させるｄｉｐ検出機能と、該ｄｉｐ検出機能により検出された前記ｄｉｐの発生時間をｔ^＊とし、時間ｔ^＊−２Ｔから時間ｔ^＊までの時間区間における前記Ｄ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値の最大値から、時間ｔ^＊におけるＤ（ｔ^＊,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値を差し引いた値を、差値ｈとして算出させる差値算出機能と、該差値算出機能により算出された差値ｈが、予め設定された第１閾値よりも大きい場合に、累積局所距離Ｄ（ｔ^＊，Ｘ，Ｔ，Ｘ）の（ｔ^＊，Ｘ，Ｔ，Ｘ）からバックトレース処理を行わせることにより、前記座標ｆ（ｔ，ｘ）の正規化された値と、当該座標ｆ（ｔ，ｘ）に最適にマッチングする前記座標ｚ（τ，ｙ）の正規化された値とによって局所距離ｄ（ｔ，ｘ，τ，ｙ）の値を算出させ、算出された当該局所距離ｄ（ｔ，ｘ，τ，ｙ）の値が予め設定された第２閾値以下となる前記参照パターンの前記座標ｚ（τ，ｙ）の集合を、対象領域として検出させる参照パターン領域検出機能と、該参照パターン領域検出機能により前記対象領域が検出された場合に、前記入力音声に前記参照音声が含まれると判断させる音声判断機能とを実現させることを特徴とする。 Further, the voice recognition program according to the present invention is a voice recognition program of a voice recognition device for determining whether or not the reference voice is included in the input voice, and the control means of the voice recognition device is the first. By converting the input voice into data in the frequency domain at predetermined time intervals, an input pattern composed of two-dimensional data of frequency x (1 ≤ x ≤ X) and time t, and the second predetermined time interval are described. By converting the reference voice into data in the frequency domain, the input is performed using a reference pattern composed of two-dimensional data of frequency y (1 ≦ y ≦ X) and time τ (1 ≦ τ ≦ T). The normalized coordinates f (t, x) are normalized by dividing the sum of the values of the coordinates f (t, x) from x = 1 to x = X from the values of the pattern coordinates f (t, x). The value of x) is calculated, and the sum of the values of the coordinates z (τ, y) from y = 1 to y = X is divided from the value of the coordinates z (τ, y) of the reference pattern. The normalized value of the coordinate z (τ, y) is calculated, and the coordinate f (t, x) of the input pattern and the coordinate z (τ, y) of the reference pattern are calculated by a two-dimensional continuous dynamic programming method. The absolute value of the value obtained by subtracting the normalized value of the coordinate z (τ, y) from the normalized value of the coordinate f (t, x) by performing a non-linear matching process with y). The local distance d (t, x, τ, y) indicating is calculated, and the calculated local distance d (t, x, τ, y) is accumulated to accumulate the local distance D (t, X, T, X). ), And D (t) obtained by dividing the value of the cumulative local distance D (t, X, T, X) calculated by the cumulative local distance calculation function by the product of T and X. , X, T, X) / (TX) is calculated, and the dip indicating the valley bottom is detected from the change state of the value of D (t, X, T, X) / (TX) at time t. a dip detection function causes the in the time interval of the time of occurrence of the dip detected by the dip detection and ^{t *,} from the time ^t * -2T to time ^{t * D (t, X,} T, X) / Difference value calculation in which the value obtained by subtracting the value of D (t ^* ^{, X, T, X) / (XT) at time t *} from the maximum value of the value of (XT) is calculated as the difference value h. When the function and the difference value h calculated by the difference value calculation function are larger than the preset first threshold value, the cumulative local distance D (t ^* , X, T, X) (t ^* , X) , T, X) to perform backtrace processing, so that the coordinates f (t, x) ) And the normalized value of the coordinate z (τ, y) that optimally matches the coordinate f (t, x), so that the local distance d (t, x, τ, y) The set of the coordinates z (τ, y) of the reference pattern in which the calculated value of the local distance d (t, x, τ, y) is equal to or less than the preset second threshold value is calculated. , A reference pattern area detection function for detecting as a target area and a voice determination function for determining that the input sound includes the reference sound when the target area is detected by the reference pattern area detection function are realized. It is characterized by that.

さらに、上述した音声認識装置および音声認識プログラムは、前記対象領域が検出された場合に、当該対象領域を構成する前記座標ｚ（τ，ｙ）の最小の時間τｓと最大の時間τｅとを検出し、前記参照音声のうち前記時間τｓから前記時間τｅまでの時間範囲に含まれる音声が、前記入力音声に含まれると判断するものであってもよい。 Further, the voice recognition device and the voice recognition program described above detect the minimum time τs and the maximum time τe of the coordinates z (τ, y) constituting the target area when the target area is detected. Then, among the reference voices, the voice included in the time range from the time τs to the time τe may be determined to be included in the input voice.

また、上述した音声認識装置は、前記入力音声を前記第１所定時間毎に周波数領域のデータに変換する周波数変換手段と、該周波数変換手段により変換された前記周波数領域のデータが正規化されて、変換された時間順に記録される入力パターン記録手段とを有し、前記入力パターン記録手段は、前記周波数変換手段によって新しいデータが生成される毎に、当該入力パターン記録手段に記録されている最も古いデータを消去し、前記新しいデータを、当該新しいデータの直前に生成されたデータの次に記録させ、累積局所距離算出手段は、前記入力パターン記録手段に記録された全てのデータを記録された順番を維持したまま読み出して前記入力パターンとして用いることにより、前記累積局所距離Ｄ（ｔ，Ｘ，Ｔ，Ｘ）の算出を行うものであってもよい。 Further, in the above-mentioned voice recognition device, the frequency conversion means for converting the input voice into the data in the frequency region at the first predetermined time and the data in the frequency region converted by the frequency conversion means are normalized. , The input pattern recording means is recorded in the converted time order, and the input pattern recording means is the most recorded in the input pattern recording means every time new data is generated by the frequency conversion means. The old data was erased, the new data was recorded next to the data generated immediately before the new data, and the cumulative local distance calculation means recorded all the data recorded in the input pattern recording means. The cumulative local distance D (t, X, T, X) may be calculated by reading the data while maintaining the order and using it as the input pattern.

本発明に係る音声認識装置および音声認識プログラムでは、入力音声を２次元の入力パターンとし、参照音声を２次元の参照パターンとして、２次元連続動的計画法によって、非線形のマッチング処理を行うことにより、参照音声が入力音声に含まれるかを判断する。このため、入力音声として複数話者が同時に発声した音であっても、波形分離を行うことなく、１つのまとまった音（単一音声）として音声認識を行うことができる。従って、音声分離処理や音声分離後に個別に音声認識処理を行う必要がなく、処理の迅速化と処理負担の軽減を図ることが可能になる。 In the voice recognition device and the voice recognition program according to the present invention, the input voice is used as a two-dimensional input pattern, the reference voice is used as a two-dimensional reference pattern, and a non-linear matching process is performed by a two-dimensional continuous dynamic programming method. , Determine if the reference voice is included in the input voice. Therefore, even if the input voice is a sound uttered by a plurality of speakers at the same time, the voice can be recognized as one united sound (single voice) without performing waveform separation. Therefore, it is not necessary to perform the voice separation process or the voice recognition process individually after the voice separation, and it is possible to speed up the process and reduce the processing load.

また、複数話者の発声を１つのまとまった音（単一音声）として音声認識を行うので、話者が途中で増加したり減少したりした場合であっても、話者の増減を考慮することなく音声認識を行うことができる。 In addition, since voice recognition is performed by uttering multiple speakers as one united sound (single voice), even if the number of speakers increases or decreases in the middle, the increase or decrease of speakers is taken into consideration. Voice recognition can be performed without any need.

さらに、音声認識処理において、入力音声を２次元の入力パターンとし、参照音声を２次元の参照パターンとして、２次元データと２次元データとによるパターンマッチングにより音声認識を行うため、時間の異なるベクトルの要素をも考慮した音声認識処理を行うことができる。このため、従来の音声認識処理のように、入力音声あるいは参照音声をベクトルデータと捉えて音声認識を行う場合に比べて、より高い精度で音声認識を行うことができる。 Further, in the voice recognition process, the input voice is used as a two-dimensional input pattern, the reference voice is used as a two-dimensional reference pattern, and voice recognition is performed by pattern matching between the two-dimensional data and the two-dimensional data. Speech recognition processing can be performed in consideration of the elements. Therefore, it is possible to perform voice recognition with higher accuracy than in the case where the input voice or the reference voice is regarded as vector data and the voice recognition is performed as in the conventional voice recognition process.

また、本発明に係る音声認識装置および音声認識プログラムでは、Ｄ（ｔ，Ｘ，Ｔ，Ｘ）／（Ｔ・Ｘ）の値の時間ｔにおける変化状態に基づいて差値ｈを算出し、差値ｈが予め設定された第１閾値よりも大きい場合に、局所距離ｄ（ｔ，ｘ，τ，ｙ）の値が予め設定された第２閾値以下となる参照パターンの対象領域を検出できるかを調べることによって、入力音声に参照音声が含まれるか否かの判断を行うことができる。 Further, in the voice recognition device and the voice recognition program according to the present invention, the difference value h is calculated based on the change state of the value of D (t, X, T, X) / (TX) at time t, and the difference value h is calculated. Is it possible to detect the target area of the reference pattern in which the value of the local distance d (t, x, τ, y) is equal to or less than the preset second threshold value when the value h is larger than the preset first threshold value? By examining, it is possible to determine whether or not the input voice includes the reference voice.

特に、本発明に係る音声認識装置および音声認識プログラムでは、対象領域が検出された場合に、当該対象領域を構成する座標ｚ（τ，ｙ）の最小の時間τｓと最大の時間τｅとを検出することによって、参照音声のうち時間τｓから時間τｅまでの時間範囲に含まれる音声が、入力音声に含まれると判断すること可能になる。 In particular, in the voice recognition device and the voice recognition program according to the present invention, when the target area is detected, the minimum time τs and the maximum time τe of the coordinates z (τ, y) constituting the target area are detected. By doing so, it becomes possible to determine that the voice included in the time range from the time τs to the time τe among the reference voices is included in the input voice.

さらに、入力パターン記録手段において、周波数変換手段によって新しいデータが生成される毎に、変換された時間順に記録されたデータのうち最も古いデータを消去し、新しいデータを、当該新しいデータの直前に生成されたデータの次に記録させ、累積局所距離算出手段が、入力パターン記録手段に記録された全てのデータを記録された順番を維持したまま読み出して入力パターンとして用いて、累積局所距離Ｄ（ｔ，Ｘ，Ｔ，Ｘ）の算出を行う。このため、継続的に入力される入力音声を対象として参照音声の音声認識処理を行うことが可能になる。また、音声認識処理の処理対象となる入力音声は、音声の始端および終端によって制限されることはない。従って、本発明に係る音声認識装置および音声認識プログラムでは、エンドレスに続く入力音声を音声認識処理の対象として用いることが可能になる。 Further, in the input pattern recording means, each time new data is generated by the frequency conversion means, the oldest data among the data recorded in the converted time order is deleted, and new data is generated immediately before the new data. The data is recorded next to the data, and the cumulative local distance calculation means reads out all the data recorded in the input pattern recording means while maintaining the recorded order and uses it as an input pattern, and the cumulative local distance D (t). , X, T, X). Therefore, it is possible to perform voice recognition processing of the reference voice for the input voice that is continuously input. Further, the input voice to be processed by the voice recognition process is not limited by the start and end of the voice. Therefore, in the voice recognition device and the voice recognition program according to the present invention, it is possible to use the input voice following the endless as the target of the voice recognition process.

実施の形態に係る音声認識装置の概略構成を示したブロック図である。It is a block diagram which showed the schematic structure of the voice recognition apparatus which concerns on embodiment. 実施の形態に係るＣＰＵが、プログラムに基づいて実行する機能を示したブロック図である。It is a block diagram which showed the function which the CPU which concerns on embodiment execute based on a program. マッチング処理により求められたＤ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値の時間変化を示した図であって、（ａ）は、ｄｉｐにおけるＤ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値がゼロになる場合を示し、（ｂ）は、ｄｉｐにおけるＤ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値がゼロでない場合を示した図である。It is a figure which showed the time change of the value of D (t, X, T, X) / (XT) obtained by a matching process, and (a) is a figure which showed D (t, X, T, in dip The case where the value of X) / (XT) becomes zero is shown, and the case where the value of D (t, X, T, X) / (XT) in the dip is not zero is shown. It is a figure. 参照データが「明日は晴れかな、雨かな」という音声データの場合に、入力データの中に「晴れかな」に該当する音声が含まれている場合を説明するための図である。It is a figure for demonstrating the case where the reference data is the voice data of "it is sunny tomorrow, is it rain", and the input data contains the voice corresponding to "sunny". 実施の形態に係るマッチング処理部の処理内容を示したフローチャートである。It is a flowchart which showed the processing content of the matching processing part which concerns on embodiment. ベクトルベースのマッチング処理を基本とする従来の音声認識処理手法を説明するための図である。It is a figure for demonstrating the conventional speech recognition processing method based on the vector-based matching processing. 参照データや入力データを、領域ベースのデータとして捉えてマッチング処理を行うことを説明するための図である。It is a figure for demonstrating that the reference data and input data are regarded as area-based data, and matching processing is performed.

以下、本発明に係る音声認識装置の一例を、図面を用いて詳細に説明する。図１は、音声認識装置の概略構成を示したブロック図である。音声認識装置１００は、マイク１と、ＣＰＵ(Central Processing Unit：制御手段、累積局所距離算出手段、ｄｉｐ検出手段、差値算出手段、参照パターン領域検出手段)２と、ＲＯＭ（Read Only Memory）３と、ＲＡＭ（Random Access Memory：入力パターン記録手段）４と、記録部（入力パターン記録手段）５と、表示部６とを有している。音声認識装置１００は、例えば、複数の話者が存在する空間であって、複数の話者により異なる内容の会話が平行して進行される空間に設置される。 Hereinafter, an example of the voice recognition device according to the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram showing a schematic configuration of a voice recognition device. The voice recognition device 100 includes a microphone 1, a CPU (Central Processing Unit: control means, cumulative local distance calculation means, dip detection means, difference value calculation means, reference pattern area detection means) 2, and a ROM (Read Only Memory) 3. It also has a RAM (Random Access Memory: input pattern recording means) 4, a recording unit (input pattern recording means) 5, and a display unit 6. The voice recognition device 100 is installed, for example, in a space in which a plurality of speakers exist, in which conversations having different contents are carried out in parallel by the plurality of speakers.

マイク１は、上述したように、異なる内容の会話が並行して進行される空間の音声を収音する。本実施の形態に係る音声認識装置１００では、マイク１が１台だけ設置される場合を一例として示す。しかしながら、マイク１の設置台数は、必ずしも１台に限定されない。複数の話者により異なる内容の会話が行われる空間の音を、単一音声として収音することができれば、マイク１の設置台数は制限を受けない。 As described above, the microphone 1 picks up the sound of the space in which conversations having different contents proceed in parallel. In the voice recognition device 100 according to the present embodiment, a case where only one microphone 1 is installed is shown as an example. However, the number of microphones 1 installed is not necessarily limited to one. The number of microphones 1 installed is not limited as long as the sound in the space where different contents of conversation are performed by a plurality of speakers can be picked up as a single voice.

記録部５は、一般的なハードディスク等によって構成されている。なお、記録部５の構成は、ハードディスクだけに限定されるものではなく、フラッシュメモリ、ＳＳＤ（Solid State Drive / Solid State Disk）などであってもよい。記録部５には、音声認識を行うための音声データが、参照データとして記録される。参照データは、文章や単語等の文言（以下、これらを「文記号」と称する）を内容とする音声（参照音声）に基づいて生成される。 The recording unit 5 is composed of a general hard disk or the like. The configuration of the recording unit 5 is not limited to the hard disk, and may be a flash memory, an SSD (Solid State Drive / Solid State Disk), or the like. The recording unit 5 records voice data for performing voice recognition as reference data. The reference data is generated based on a voice (reference voice) containing words such as sentences and words (hereinafter, these are referred to as "sentence symbols").

ＣＰＵ２は、マイク１により収音された入力音声（入力音、入力データ）に、参照音声（参照データ）が含まれているか否かの音声認識処理を行う。具体的に、ＣＰＵ２は、次述するＲＯＭ３に記録されるプログラム（後述する図５のフローチャートに基づくプログラム等）を読み出して実行することによって、音声認識処理を行う。 The CPU 2 performs voice recognition processing for whether or not the input sound (input sound, input data) picked up by the microphone 1 includes a reference sound (reference data). Specifically, the CPU 2 performs voice recognition processing by reading and executing a program (a program based on the flowchart of FIG. 5 described later) recorded in the ROM 3 described below.

ＲＯＭ３には、ＣＰＵ２における処理内容を示したブログラム等が記録される。ＲＡＭ４は、ＣＰＵ２の処理に利用されるワークエリアとして用いられる。ＲＡＭ４は、次述するバッファーメモリ部１２やワーキングメモリ１４ａとして用いられる。例えば、ＲＡＭ４がバッファーメモリ部１２として用いられる場合、ＲＡＭ４には、マイク１で収音された音声データが入力データとして記録される。また、ＲＡＭ４が、ワーキングメモリ１４ａとして用いられる場合、ＲＡＭ４は、２次元連続動的計画法を用いたマッチング処理およびバックトレース処理用のワーキングエリアとして利用される。 A program or the like showing the processing contents in the CPU 2 is recorded in the ROM 3. The RAM 4 is used as a work area used for processing by the CPU 2. The RAM 4 is used as the buffer memory unit 12 and the working memory 14a described below. For example, when the RAM 4 is used as the buffer memory unit 12, the voice data picked up by the microphone 1 is recorded in the RAM 4 as input data. When the RAM 4 is used as the working memory 14a, the RAM 4 is used as a working area for matching processing and backtrace processing using the two-dimensional continuous dynamic programming method.

表示部６は、入力音声（入力データ）に参照音声（参照データ）が含まれるか否かの判断結果等を表示するために用いられる。表示部６には、液晶ディスプレイや、ＣＲＴディスプレイなどの一般的な表示装置が用いられる。 The display unit 6 is used to display a determination result or the like of whether or not the input voice (input data) includes the reference voice (reference data). A general display device such as a liquid crystal display or a CRT display is used for the display unit 6.

次に、入力データと参照データとに基づいて、ＣＰＵ２が、音声認識処理を行う方法について説明する。図２は、ＣＰＵ２が、ＲＯＭ３に記録されたプログラムに基づいて実行する機能を示したブロック図である。図２に示すように、音声認識装置１００は、Ａ／Ｄ変換部１０と、ＦＦＴ部（周波数変換手段、累積局所距離算出手段）１１と、バッファーメモリ部（入力パターン記録手段）１２と、参照データメモリ部１３と、マッチング処理部（制御手段、累積局所距離算出手段、ｄｉｐ検出手段、差値算出手段、参照パターン領域検出手段）１４と、検出判定部（制御手段、音声判断手段）１５とを有している。 Next, a method in which the CPU 2 performs voice recognition processing based on the input data and the reference data will be described. FIG. 2 is a block diagram showing a function that the CPU 2 executes based on the program recorded in the ROM 3. As shown in FIG. 2, the voice recognition device 100 refers to an A / D conversion unit 10, an FFT unit (frequency conversion means, cumulative local distance calculation means) 11, and a buffer memory unit (input pattern recording means) 12. A data memory unit 13, a matching processing unit (control means, cumulative local distance calculation means, dip detection means, difference value calculation means, reference pattern area detection means) 14, and a detection determination unit (control means, voice judgment means) 15. have.

Ａ／Ｄ変換部１０には、マイク１が接続されている。Ａ／Ｄ変換部１０は、マイク１によって収音された入力音声を、アナログデータからデジタルデータへと変換する処理を行う。マイク１によって収音される入力音声は、エンドレスな音声波形である。例えば、複数話者が同時に発話した音声が重なって単一音声となった音声波形が該当する。Ａ／Ｄ変換部１０では、マイク１で収音されたアナログデータである入力音声を、例えば、１２ｋＨｚや１６ｋＨｚのサンプリングレートでデジタルデータへ変換する。Ａ／Ｄ変換部１０で変換されたデジタルデータは、ＦＦＴ部１１へ出力される。 A microphone 1 is connected to the A / D conversion unit 10. The A / D conversion unit 10 performs a process of converting the input voice picked up by the microphone 1 from analog data to digital data. The input voice picked up by the microphone 1 is an endless voice waveform. For example, a voice waveform in which voices uttered by a plurality of speakers at the same time are overlapped to form a single voice is applicable. The A / D conversion unit 10 converts the input sound, which is analog data picked up by the microphone 1, into digital data at a sampling rate of, for example, 12 kHz or 16 kHz. The digital data converted by the A / D conversion unit 10 is output to the FFT unit 11.

ＦＦＴ部１１は、Ａ／Ｄ変換部１０によってデジタルデータに変換された入力音声を、時間領域のデータから周波数領域のデータへと変換する処理を行う。詳細には、ＦＦＴ部１１は、デジタルデータに変換された入力音声に対して窓関数を用いてフィルタ処理した後に、高速フーリエ変換処理を行う。ＦＦＴ部１１により高速フーリエ変換処理されたデータは、例えば、２０のバンド数を備えるスペクトルベクトルとして求められる。本実施の形態では、このスペクトルベクトルをフレームベクトルと称する。このフレームベクトルは、一般的な音声分析において用いられるデータである。なお、フレームベクトル（スペクトルベクトル）のバンド数は、必ずしも２０には限定されず、１２や１６等のバンド数であってもよい。 The FFT unit 11 performs a process of converting the input sound converted into digital data by the A / D conversion unit 10 from the data in the time domain to the data in the frequency domain. Specifically, the FFT unit 11 performs a fast Fourier transform process after filtering the input voice converted into digital data by using a window function. The data subjected to the fast Fourier transform by the FFT unit 11 is obtained, for example, as a spectrum vector having 20 bands. In this embodiment, this spectrum vector is referred to as a frame vector. This frame vector is data used in general speech analysis. The number of bands of the frame vector (spectrum vector) is not necessarily limited to 20, and may be the number of bands such as 12 and 16.

フレームベクトルは、ＦＦＴ部１１によって逐次作成される。また、上述したように、入力音声は、マイク１によりエンドレスに収音される。従って、ＦＦＴ部１１において逐次、高速フーリエ変換処理を行うことによって、エンドレスに収音される入力音声に対応したフレームベクトルを作成することができる。本実施の形態では、一例として、５ｍｓｅｃ毎（第１所定時間毎）にフレームベクトルが作成されるものとする。なお、フレームベクトルの作成間隔は、５ｍｓｅｃには限定されず、８ｍｓｅｃであっても、あるいは、１２ｍｓｅｃであっても、その他の作成間隔であってもよい。 The frame vector is sequentially created by the FFT unit 11. Further, as described above, the input voice is endlessly picked up by the microphone 1. Therefore, by sequentially performing the fast Fourier transform process in the FFT unit 11, it is possible to create a frame vector corresponding to the input voice that is endlessly picked up. In the present embodiment, as an example, it is assumed that a frame vector is created every 5 msec (every first predetermined time). The frame vector creation interval is not limited to 5 msec, and may be 8 msec, 12 msec, or any other creation interval.

さらに、ＦＦＴ部１１は、高速フーリエ変換処理によって作成されたフレームベクトルに対して、音声の波形の大きさを時間毎に正規化する処理を行う。入力音声に基づいてフレームベクトルが作成された時間をｔとする。この時間ｔは、入力音声がマイク１を介してＦＦＴ部１１に入力された時間にも該当する。時間ｔに作成されたフレームベクトル（スペクトルベクトル）をｆ（ｔ，ｘ）で示す。但し、１≦ｘ≦Ｘ，ｔ＝１，２，・・・とする。ｘは、フレームベクトルのバンド数を示し、本実施の形態では、上述したようにＸ＝２０である。ＦＦＴ部１１は、ｘ＝１からｘ＝Ｘまでのｆ（ｔ，ｘ）の値の和を求めて、求められた和の値でｆ（ｔ，ｘ）の値を除算することにより、正規化されたフレームベクトルの値を算出する。 Further, the FFT unit 11 performs a process of normalizing the magnitude of the voice waveform for each time with respect to the frame vector created by the fast Fourier transform process. Let t be the time when the frame vector is created based on the input voice. This time t also corresponds to the time when the input voice is input to the FFT unit 11 via the microphone 1. The frame vector (spectral vector) created at time t is indicated by f (t, x). However, 1 ≦ x ≦ X, t = 1, 2, ... x indicates the number of bands of the frame vector, and in the present embodiment, X = 20 as described above. The FFT unit 11 obtains the sum of the values of f (t, x) from x = 1 to x = X, and divides the value of f (t, x) by the value of the obtained sum to be normal. Calculate the value of the converted frame vector.

正規化後のフレームベクトルをｆ'（ｔ，ｘ）とすると、ｆ'（ｔ，ｘ）の値は、

・・・式１
により求められる。そして、ＦＦＴ部１１は、正規化されたフレームベクトルｆ'（ｔ，ｘ）の値を、改めてｆ（ｔ，ｘ）に設定する。 Assuming that the normalized frame vector is f'(t, x), the value of f'(t, x) is

・・・ Equation 1
Demanded by. Then, the FFT unit 11 sets the value of the normalized frame vector f'(t, x) to f (t, x) again.

つまり、高速フーリエ変換処理により求められた入力音声のフレームベクトルに対して正規化処理が行われたフレームベクトルの値が、ｆ（ｔ，ｘ）の値となる。 That is, the value of the frame vector in which the normalization processing is performed on the frame vector of the input voice obtained by the fast Fourier transform processing becomes the value of f (t, x).

この正規化処理によって、ｆ（ｔ，ｘ）の値は、０≦ｆ（ｔ，ｘ）≦１となる。なお、本実施の形態に係る入力音声では、実質的にエンドレスな時間区間として時間ｔを判断するが、その詳細については後述する。このようにして作成されたフレームベクトルは、バッファーメモリ部１２へ出力される。 By this normalization processing, the value of f (t, x) becomes 0 ≦ f (t, x) ≦ 1. In the input voice according to the present embodiment, the time t is determined as a substantially endless time interval, and the details thereof will be described later. The frame vector created in this way is output to the buffer memory unit 12.

バッファーメモリ部１２には、ＦＦＴ部１１によって正規化されたフレームベクトル（以下、正規化されたフレームベクトルを、単にフレームベクトルと称する）が時系列的に記録される。つまり、フレームベクトルは、ＦＦＴ部１１によって作成（変換）された時間順にバッファーメモリ部１２に記録される。 The frame vector normalized by the FFT unit 11 (hereinafter, the normalized frame vector is simply referred to as a frame vector) is recorded in the buffer memory unit 12 in time series. That is, the frame vector is recorded in the buffer memory unit 12 in the order of time created (converted) by the FFT unit 11.

バッファーメモリ部１２には、Ｔ_Ｍ個（Ｔ_Ｍフレーム）分のフレームベクトルを記録することが可能なメモリ容量が確保されている。例えば、５ｍｓｅｃ毎に作成されるフレームベクトルのＴ_Ｍ個分のデータ、つまり、５ｍｓｅｃ×Ｔ_Ｍの時間長の入力音声のデータを記録することが可能である。そして、ＦＦＴ部１１によりＴ_Ｍ＋１個目の新しいフレームベクトルが作成された場合、バッファーメモリ部１２は、先入先出処理（First-In First-Out処理）によって、記録されている１番古いフレームベクトルを消去して、記録されている他のフレームベクトルをＦＦＴ部１１によって作成された順番に並べて記録させ、Ｔ_Ｍ個目に作成されたフレームベクトル（一番新しいフレームベクトルの直前に作成（変換）されたフレームベクトル）の次に、一番新しいフレームベクトル（Ｔ_Ｍ＋１個目のフレームベクトル）を記録させる。 The buffer memory unit 12 secures a memory capacity capable of recording frame vectors for _TM pieces ( _{TM frames).} For example, _{T M} pieces of data of the frame vector is created for each 5 msec, i.e., it is possible to record the data of the time length of the input speech 5 msec × _{T M.} When the new frame vector of T _M +1 th is created by FFT unit 11, a buffer memory 12, the FIFO process (First-In First-Out process), the oldest frame is recorded Clear the vector, create other frame vector recorded by the recording side by side in the order in which they are created by the FFT unit 11, immediately before the T _M-th frame vectors created (newest frame vector (conversion ), Then the newest frame vector ( _TM + 1st frame vector) is recorded.

バッファーメモリ部１２に記録されるＴ_Ｍ個分のフレームベクトルのデータ群を、入力データと称する。上述したように、フレームベクトルは、入力音声を周波数領域に変換したデータであるため、周波数毎の信号レベルが記録されたベクトルデータとなる。さらに、バッファーメモリ部１２には、５ｍｓｅｃ毎に逐次作成されるフレームベクトルがＴ_Ｍ個（Ｔ_Ｍフレーム）記録されているため、入力データは、縦軸を周波数とし、横軸を時間とする２次元データであって、各座標に信号レベルが記録されたデータに該当する。 The data group T _M pieces worth of frame vectors recorded in the buffer memory unit 12, referred to as input data. As described above, since the frame vector is data obtained by converting the input voice into the frequency domain, it is vector data in which the signal level for each frequency is recorded. Further, the buffer memory unit 12, since the frame vector is sequentially generated for each 5msec is T _M pieces (T _M frame) recording the input data, the vertical axis represents the frequency, and the horizontal axis represents time 2 It is dimensional data and corresponds to the data in which the signal level is recorded at each coordinate.

参照データメモリ部１３には、所定の音声（所定の文記号の音声）を記録した参照データが記録される。参照データは、参照音声をデジタルデータ化したものである。参照データを作成する場合には、まず、単語や文などの記号表現（文記号）を作る。その後、これらの記号表現（文記号）を合成音に変換して波形を作成することにより、参照音声が作成される。本実施の形態では、参照音声を有限の音声とする。つまり、参照音声の波形は、有限の時間区間により構成されるものとする。 Reference data in which a predetermined voice (voice of a predetermined sentence symbol) is recorded is recorded in the reference data memory unit 13. The reference data is a digital data of the reference audio. When creating reference data, first create symbolic expressions (sentence symbols) such as words and sentences. After that, the reference voice is created by converting these symbolic expressions (sentence symbols) into synthetic sounds and creating a waveform. In the present embodiment, the reference voice is a finite voice. That is, the waveform of the reference voice is composed of a finite time interval.

作成された参照音声を、例えば、５ｍｓｅｃ毎（第２所定時間毎）に高速フーリエ変換することにより、参照データ用のフレームベクトルが生成される。参照データ用のフレームベクトルに対しても、音声の波形の大きさを時間毎に正規化する処理が行われる。参照データの音声波形の時間をτで示す。時間τの範囲は１≦τ≦Ｔとする。時間Ｔに関しては後述する。時間τにおいて高速フーリエ変換処理により生成された参照データ用のフレームベクトル（スペクトルベクトル）をｚ（τ，ｙ）で示す。但し、１≦ｙ≦Ｘとする。ｘは、参照データ用のフレームベクトルにおけるバンド数を示し、本実施の形態では、上述したようにＸ＝２０とする。参照データ用のフレームベクトルｚ（τ，ｙ）に関しても、ｙ＝１からｙ＝Ｘまでのｚ（τ，ｙ）の値の和を求めて、求められた和の値によりｚ（τ，ｙ）の値を除算することにより、正規化されたフレームベクトル（スペクトルベクトル）の値を算出する。 A frame vector for reference data is generated by performing a fast Fourier transform on the created reference voice, for example, every 5 msec (every second predetermined time). The frame vector for the reference data is also processed to normalize the magnitude of the voice waveform for each time. The time of the audio waveform of the reference data is indicated by τ. The range of time τ is 1 ≦ τ ≦ T. The time T will be described later. The frame vector (spectral vector) for the reference data generated by the fast Fourier transform process at time τ is indicated by z (τ, y). However, 1 ≦ y ≦ X. x indicates the number of bands in the frame vector for reference data, and in this embodiment, X = 20 as described above. For the frame vector z (τ, y) for reference data, the sum of the values of z (τ, y) from y = 1 to y = X is calculated, and z (τ, y) is calculated from the calculated sum value. ) Is divided to calculate the value of the normalized frame vector (spectral vector).

正規化後のフレームベクトルをｚ'（τ，ｙ）とすると、ｚ'（τ，ｙ）の値は、

・・・式２
により求められる。そして、正規化されたフレームベクトルｚ'（τ，ｙ）の値を、改めてｚ（τ，ｙ）に設定する。 Assuming that the normalized frame vector is z'(τ, y), the value of z'(τ, y) is

・・・ Equation 2
Demanded by. Then, the value of the normalized frame vector z'(τ, y) is set to z (τ, y) again.

つまり、ｚ（τ，ｙ）は、高速フーリエ変換処理により求められた参照データ用のフレームベクトルに対して、正規化処理が行われたものになる。この正規化処理によって、ｚ（τ，ｙ）の値は、０≦ｆ（τ，ｙ）≦１となる。 That is, z (τ, y) is the one obtained by normalizing the frame vector for the reference data obtained by the fast Fourier transform process. By this normalization processing, the value of z (τ, y) becomes 0 ≦ f (τ, y) ≦ 1.

参照データは、上述したように、５ｍｓｅｃ毎（第２所定時間毎）のフレームベクトルがＴ個（Ｔフレーム）（但し、Ｔ_Ｍ＞Ｔとする）集まることにより構成される。また、参照データは、入力データと同様のデータ形式であり、２０のバンド数からなるスペクトルベクトルによって構成される。このため、本実施の形態における参照データは、一例として、２０バンド（２０個）の要素を備えたフレームベクトルが、Ｔ個（Ｔフレーム）存在するデータと判断することができる。つまり、縦２０、横Ｔの長さのマトリックスによって、１つの参照データを表現することができる。 As described above, the reference data is configured by collecting _{T frame vectors (T frames) (where TM} > T) every 5 msec (every second predetermined time). The reference data has the same data format as the input data, and is composed of a spectrum vector having 20 bands. Therefore, as an example, the reference data in the present embodiment can be determined to be data in which T (T frames) of frame vectors having 20 bands (20 elements) exist. That is, one reference data can be represented by a matrix having a length of 20 in the vertical direction and a length of T in the horizontal direction.

参照データに用いられる有限の音声データ（参照音声）として、本実施の形態では、上述したように、合成音声を利用する。例えば、「明日は晴れかな、雨かな」という文記号に基づいて、音声合成ソフト等を用いた人工的な音声データ（合成音による波形データ）を生成する。参照データメモリ部１３に記録される参照データは、上述した入力データと同様のデータ形式、つまり、縦軸を周波数とし、横軸を時間とする２次元データであって、各座標に信号レベルが記録されたデータとなる。 As the finite voice data (reference voice) used for the reference data, in the present embodiment, the synthetic voice is used as described above. For example, artificial voice data (waveform data by synthetic sound) using voice synthesis software or the like is generated based on the sentence symbol "Tomorrow is sunny or rainy". The reference data recorded in the reference data memory unit 13 is in the same data format as the input data described above, that is, two-dimensional data in which the vertical axis is frequency and the horizontal axis is time, and the signal level is set at each coordinate. It will be the recorded data.

なお、参照データにおいてフレームベクトルが作成される間隔（第２所定時間毎）は、５ｍｓｅｃに限定されない。フレームベクトルの作成間隔は、８ｍｓｅｃであっても、あるいは、１２ｍｓｅｃであっても、その他の作成間隔であってもよい。 The interval at which the frame vector is created in the reference data (every second predetermined time) is not limited to 5 msec. The frame vector creation interval may be 8 msec, 12 msec, or any other creation interval.

マッチング処理部１４は、バッファーメモリ部１２に記録される入力データ（入力音声のフレームベクトルが正規化されＴ_Ｍ個集まったデータ群）と、参照データメモリ部１３に記録される参照データ（参照データ用のフレームベクトルが正規化されＴ個集まったデータ群）とを読み出して、２次元連続動的計画法を用いたデータのマッチング処理およびバックトレース処理を行う。マッチング処理部１４には、マッチング処理およびバックトレース処理に用いるワーキングメモリ１４ａが設けられている。ワーキングメモリ１４ａは、マッチング処理およびバックトレース処理の計算に必要なメモリ量が確保されている。マッチング処理に必要なメモリ量については、後述する。 The matching processing unit 14 includes input data recorded in the buffer memory unit 12 ( _{a data group in which the frame vector of the input voice is normalized and TM} pieces are collected) and reference data recorded in the reference data memory unit 13 (reference data). The frame vector for the data is normalized and T data groups) are read out, and data matching processing and back tracing processing are performed using the two-dimensional continuous dynamic planning method. The matching processing unit 14 is provided with a working memory 14a used for the matching processing and the backtrace processing. In the working memory 14a, the amount of memory required for the calculation of the matching process and the backtrace process is secured. The amount of memory required for the matching process will be described later.

マッチング処理部１４では、２次元連続動的計画法を用いてマッチング処理およびバックトレース処理を行うが、既に説明したように、マッチング処理およびバックトレース処理の対象となる入力データは、エンドレスな入力音声に基づく複数のフレームベクトルによって構成され、単位時間毎に入力データの内容が更新されるという特徴がある。このため、マッチング処理部１４では、エンドレスに更新される入力データを用いて２次元連続動的計画法を行う。本実施の形態では、エンドレスに更新される入力データを用いて行われる２次元連続動的計画法を、「インクリメンタル２次元連続動的計画法」と称する。 The matching processing unit 14 performs matching processing and backtrace processing using a two-dimensional continuous dynamic programming method. As described above, the input data to be the target of the matching processing and backtrace processing is an endless input voice. It is composed of a plurality of frame vectors based on the above, and has a feature that the content of the input data is updated every unit time. Therefore, the matching processing unit 14 performs a two-dimensional continuous dynamic programming method using input data that is updated endlessly. In the present embodiment, a two-dimensional continuous dynamic programming method performed using input data that is updated endlessly is referred to as an "incremental two-dimensional continuous dynamic programming method".

また、縦軸を周波数とし、横軸を時間とする入力データの２次元のデータパターンを、入力パターンと称する。入力パターンでは、既に説明したように、時間を示す横軸をｔ，周波数を示す縦軸をｘで示し、１≦ｘ≦Ｘとする。また、入力データの時間ｔは、バッファーメモリ部１２において、一番新しいフレームベクトルの作成（変換）時間を示している。バッファーメモリ部１２に記録される入力データは、１番新しいフレームベクトルが記録される際に、１番古いフレームベクトルが消去されるため、実質的にエンドレスな時間区間と判断できる。 Further, a two-dimensional data pattern of input data having a vertical axis as a frequency and a horizontal axis as a time is referred to as an input pattern. In the input pattern, as described above, the horizontal axis indicating time is indicated by t, the vertical axis indicating frequency is indicated by x, and 1 ≦ x ≦ X. Further, the time t of the input data indicates the creation (conversion) time of the newest frame vector in the buffer memory unit 12. The input data recorded in the buffer memory unit 12 can be determined to be a substantially endless time interval because the oldest frame vector is erased when the newest frame vector is recorded.

さらに、縦軸を周波数とし、横軸を時間とする参照データの２次元のデータパターンを、参照パターンと称する。参照パターンでは、既に説明したように、時間を示す横軸をτ，周波数を示す縦軸をｙで示し、１≦τ≦Ｔ，１≦ｙ≦Ｘとする。 Further, a two-dimensional data pattern of reference data having a vertical axis as frequency and a horizontal axis as time is referred to as a reference pattern. In the reference pattern, as described above, the horizontal axis indicating time is indicated by τ and the vertical axis indicating frequency is indicated by y, and 1 ≦ τ ≦ T and 1 ≦ y ≦ X.

マッチング処理部１４は、参照パターン（τ，ｙ）と入力パターン（ｔ，ｘ）とのマッチング処理およびバックトレース処理を、２次元連続動的計画法を用いて行う。２次元連続動的計画法の処理内容は公知の技術内容である。例えば、特開２０１０−１６５１０４号明細書において、参照画像と入力画像（対象画像）との２つの画像のマッチング処理に関する説明が記載されている。入力画像を画像ｆ（ｉ，ｊ）とし、参照画像を画像ｇ（ｍ，ｎ）とすると、２次元連続動的計画法を用いることにより、画像ｆ（ｉ，ｊ）と画像ｇ（ｍ，ｎ）との間の最適な画素の対応を求めることができる。 The matching processing unit 14 performs matching processing and backtrace processing of the reference pattern (τ, y) and the input pattern (t, x) by using a two-dimensional continuous dynamic programming method. The processing content of the two-dimensional continuous dynamic programming method is a known technical content. For example, Japanese Patent Application Laid-Open No. 2010-165104 describes a description of matching processing between a reference image and an input image (target image). Assuming that the input image is the image f (i, j) and the reference image is the image g (m, n), the image f (i, j) and the image g (m, m, by using the two-dimensional continuous dynamic programming method are used. It is possible to find the optimum pixel correspondence with n).

２次元連続動的計画法では、各変数ｉ，ｊ，ｍ，ｎからなる４次元空間を構築し、４次元における局所距離ｄ（ｉ，ｊ，ｍ，ｎ）＝｜ｆ（ｉ，ｊ）−ｇ（ｍ，ｎ）｜を算出する。さらに、局所距離ｄ（ｉ，ｊ，ｍ，ｎ）を累積した変数を累積局所距離Ｄ（ｉ，ｊ，ｍ，ｎ）とする。ｉを固定し、固定したｉについて、１≦ｊ≦Ｊ，１≦ｍ≦Ｍ，１≦ｎ≦Ｎとして、ｊ，ｍ，ｎの値をそれぞれ前述の範囲で動かす（変更する）ことによって、累積局所距離Ｄ（ｉ，ｊ，ｍ，ｎ）の計算を行う。この計算において、ｉ，ｊ，ｍ，ｎを、上述したｔ，ｘ，τ，ｙに対応させることによって、本実施の形態における累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）を求めることができる。なお、ｊ，ｍ，ｎに対応するｘ，τ，ｙの範囲は、既に説明したように、１≦ｘ≦Ｘ，１≦τ≦Ｔ，１≦ｙ≦Ｘである。 In the two-dimensional continuous dynamic programming method, a four-dimensional space consisting of each variable i, j, m, n is constructed, and the local distance d (i, j, m, n) in the four dimensions = | f (i, j). -G (m, n) | is calculated. Further, the variable obtained by accumulating the local distance d (i, j, m, n) is defined as the cumulative local distance D (i, j, m, n). i is fixed, and for the fixed i, 1 ≦ j ≦ J, 1 ≦ m ≦ M, 1 ≦ n ≦ N, and by moving (changing) the values of j, m, and n within the above-mentioned ranges, respectively. The cumulative local distance D (i, j, m, n) is calculated. In this calculation, the cumulative local distance D (t, X, T, X) in the present embodiment can be obtained by associating i, j, m, n with the above-mentioned t, x, τ, y. can. The ranges of x, τ, and y corresponding to j, m, and n are 1 ≦ x ≦ X, 1 ≦ τ ≦ T, and 1 ≦ y ≦ X, as described above.

累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）は、時間ｔにおける局所距離ｄを最適累積した値である。このため、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）の値が小さいほど、最適な座標の対応関係が求められたと判断される。つまり、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）の値が小さいほど、参照パターンと入力パターンとがより合致するものと考えることができる。 The cumulative local distance D (t, X, T, X) is a value obtained by optimally accumulating the local distance d at time t. Therefore, it is judged that the smaller the value of the cumulative local distance D (t, X, T, X) is, the more the optimum coordinate correspondence is obtained. That is, it can be considered that the smaller the value of the cumulative local distance D (t, X, T, X) is, the more the reference pattern and the input pattern match.

累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）を求める場合には、入力パターンの時間を示すｔの値を、１つずつ増加させる。このように、従来から知られている２次元連続動的計画法を用いて、時間ｔを１つずつ増加させ（時間ｔを進行させ）ながら、時間ｔの増加（進行）に応じて、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）等を計算する（演算を進行する）手法が、上述した「インクリメンタル２次元連続動的計画法」の特徴である。マッチング処理部１４では、インクリメンタル２次元連続動的計画法に基づいて、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）を算出する。 When the cumulative local distance D (t, X, T, X) is obtained, the value of t indicating the time of the input pattern is incremented by one. In this way, using the conventionally known two-dimensional continuous dynamic programming method, the time t is increased one by one (the time t is advanced), and the time t is accumulated according to the increase (progress). The method of calculating the local distance D (t, X, T, X) and the like (advancing the calculation) is a feature of the above-mentioned "incremental two-dimensional continuous dynamic programming method". The matching processing unit 14 calculates the cumulative local distance D (t, X, T, X) based on the incremental two-dimensional continuous dynamic programming method.

さらに、マッチング処理部１４では、算出された累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）に基づいて、２次元連続動的計画法におけるバックトレース処理を行う。バックトレース処理を行うことにより、入力パターンの座標（ｔ，ｘ）と参照パターンの座標（τ，ｙ）との間の最適な座標の対応を求めることができる。なお、バックトレース処理は、動的計画法において一般に用いられる公知の処理であって、上述した特開２０１０−１６５１０４号明細書などにおいて、詳細に説明されている。 Further, the matching processing unit 14 performs backtrace processing in the two-dimensional continuous dynamic programming method based on the calculated cumulative local distance D (t, X, T, X). By performing the backtrace processing, it is possible to obtain the optimum coordinate correspondence between the coordinates (t, x) of the input pattern and the coordinates (τ, y) of the reference pattern. The backtrace process is a known process generally used in dynamic programming, and is described in detail in the above-mentioned JP-A-2010-165104 and the like.

累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）は、時間−周波数からなる２次元の参照パターンと、入力時間（時刻）が時間ｔまでの時間−周波数からなる２次元の入力パターンとの、２次元的なマッチング結果を示している。マッチング結果となる累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）は、時間のスカラー値に該当するため、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）の算出結果を時間方向に並べて出力することによって、時間毎の累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）の値の変化状態を、１つの波を打つような系列によって示すことができる。 The cumulative local distance D (t, X, T, X) is a two-dimensional reference pattern consisting of time-frequency and a two-dimensional input pattern in which the input time (time) is time-frequency up to time t. The two-dimensional matching result is shown. Since the cumulative local distance D (t, X, T, X) that is the matching result corresponds to the scalar value of time, the calculation result of the cumulative local distance D (t, X, T, X) is output side by side in the time direction. By doing so, the changing state of the value of the cumulative local distance D (t, X, T, X) for each time can be indicated by a series that strikes one wave.

マッチング処理部１４は、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）を算出した後に、値Ｘと値Ｔとを掛け合わせた値（Ｘ・Ｔ）で、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）の値を除算することにより、Ｄ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）を求める。累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）は、ＴとＸとに依存する値であるため、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）の値を（Ｘ・Ｔ）で割ることによって、Ｄ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値が０と１との間に常に収まる値となる。Ｄ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）は、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）を正規化したものに該当する。 The matching processing unit 14 calculates the cumulative local distance D (t, X, T, X) and then multiplies the value X and the value T (XT) to obtain the cumulative local distance D (t, X). , T, X) is divided to obtain D (t, X, T, X) / (XT). Since the cumulative local distance D (t, X, T, X) is a value that depends on T and X, the value of the cumulative local distance D (t, X, T, X) is divided by (XT). As a result, the value of D (t, X, T, X) / (XT) always falls between 0 and 1. D (t, X, T, X) / (XT) corresponds to the normalized cumulative local distance D (t, X, T, X).

図３（ａ）（ｂ）は、マッチング処理により求められたＤ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値の時間変化（変化状態）を示した図である。図３（ａ）（ｂ）に示すように、Ｄ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の出力を示す波の軌跡には、谷をつくるところや、山をなすところが生じる。図３（ａ）（ｂ）に示すような局所的な谷の部分（谷底部分）を、ｄｉｐ（局所くぼみ）と称する。 3A and 3B are diagrams showing the time change (change state) of the value of D (t, X, T, X) / (XT) obtained by the matching process. As shown in FIGS. 3 (a) and 3 (b), the trajectory of the wave showing the output of D (t, X, T, X) / (XT) has valleys and peaks. .. A local valley portion (valley bottom portion) as shown in FIGS. 3A and 3B is referred to as a dip (local depression).

図３（ａ）は、ｄｉｐにおけるＤ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値がゼロになる場合を示しており、図３（ｂ）は、ｄｉｐにおけるＤ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値がゼロでない場合を示している。図３（ａ）（ｂ）に示す波の軌跡において、有効なｄｉｐが存在する場合には、ｄｉｐが発生する時間ｔ^＊の参照パターンと、時間ｔ^＊以前の入力パターンとが、部分的あるいは全部で一致する。入力パターンは、時間ｔ^＊以前において切れ目なく連続的に入力されている。このため、どの時間（時刻）を始め（時間的始点）として、参照パターンと入力パターンとが部分的あるいは全部で一致するかは、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）の局所値を与える時間ｔ^＊（有効なｄｉｐが発生する時間）において、累積局所距離Ｄ（ｔ^＊,Ｘ,Ｔ,Ｘ）を示す座標（ｔ^＊,Ｘ,Ｔ,Ｘ）から、２次元連続動的計画法によるバックトレース処理を行って、（ｔ^＊２,１,１,１）となる時間ｔ^＊２を、事後的に求めることによって判断することができる。 FIG. 3 (a) shows the case where the value of D (t, X, T, X) / (XT) in the dip becomes zero, and FIG. 3 (b) shows the case where the value of D (t, t, in the dip) becomes zero. The case where the value of X, T, X) / (XT) is not zero is shown. In the wave trajectory of shown in FIG. 3 (a) (b), if a valid dip is present, and time t ^* of the reference patterns dip occurs, and the time t ^* previous input patterns, partially or All match. The input pattern is continuously input without a break before ^{the time t *.} Therefore, it is a local value of the cumulative local distance D (t, X, T, X) which time (time) starts (temporal start point) and the reference pattern and the input pattern partially or wholly match. Two-dimensional continuous dynamic from the ^{coordinates (t *} ^{, X, T, X) indicating the cumulative local distance D (t *} , X, T, X) at the time t ^* (time when a valid dip occurs). It can be determined by performing back tracing processing by the planning method and obtaining the time t ^{* 2} ^{to be (t * 2,1,1,1) after the fact.}

図３（ａ）（ｂ）に示すようにｄｉｐが発生する時間（時刻）を、時間ｔ^＊とし、［ｔ^＊−２Ｔ,ｔ^＊］（ｔ^＊−２Ｔからｔ^＊まで）の区間におけるＤ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値の最大値を求める。その最大値からＤ（ｔ^＊,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値を差し引いた値を、差値ｈとする。Ｄ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値は、０から1までの間に含まれる値であるため、差値ｈも０≦ｈ≦1となる。本実施の形態に係る音声認識装置１００では、差値ｈが予め設定される第１閾値を超えた値である場合に、音声認識に有効なｄｉｐが存在すると判断される。 As shown in FIGS. 3A and 3B, the time (time) at which the dip occurs is defined as the time t ^*, and D in the section [t ^* -2T, t ^* ] (t ^* -2T to t ^*). Find the maximum value of (t, X, T, X) / (XT). ^{The value obtained by subtracting the value of D (t *} , X, T, X) / (XT) from the maximum value is defined as the difference value h. Since the value of D (t, X, T, X) / (XT) is a value included between 0 and 1, the difference value h is also 0 ≦ h ≦ 1. In the voice recognition device 100 according to the present embodiment, when the difference value h exceeds a preset first threshold value, it is determined that a dip effective for voice recognition exists.

ここで、第１閾値は、実験等によって予め設定される値である。例えば、参照音声が含まれると判断し得るぎりぎりの入力音声と参照音声とを実験的に用いて、差値ｈを求めることによって、入力音声に参照音声が含まれるか否かの境界となる閾値（第１閾値）を、予め設定することができる。 Here, the first threshold value is a value preset by an experiment or the like. For example, by experimentally using the input voice and the reference voice that can be determined to include the reference voice and obtaining the difference value h, a threshold value that serves as a boundary as to whether or not the input voice includes the reference voice is used. (First threshold value) can be set in advance.

［ｔ^＊−２Ｔ,ｔ^＊］の区間は、時間ｔ^＊を終端とするときに、始端となり得る限界、つまりマッチング可能な範囲の限界を考慮して決定されたものである。［ｔ^＊−２Ｔ,ｔ^＊］の区間におけるＤ（ｔ^＊,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値の最大値は、時間ｔ^＊を終端とする入力パターンの要素の中で、参照パターンと最もマッチングされていない要素が存在することを意味する。また、上述の差値ｈは、最もマッチングされていない段階から、部分的にマッチングを行ったことによる累積値の減少を示す値である。この理由は、既に説明したように、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）が局所距離ｄの累積値であるためである。 The interval of [t ^* -2T, t ^* ] is determined in consideration of the limit that can be the start end, that is, the limit of the matchable range when the ^{time t * is the end point.} ^{The maximum value of D (t *} , X, T, X) / (XT) in the interval of [t ^* -2T, t ^* ] is among the elements of the input pattern ending at ^{time t *.} , Means that there is an element that is least matched to the reference pattern. Further, the above-mentioned difference value h is a value indicating a decrease in the cumulative value due to partial matching from the stage of the least matching. The reason for this is that, as described above, the cumulative local distance D (t, X, T, X) is the cumulative value of the local distance d.

Ｄ（ｔ^＊,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）がゼロの場合（つまり、累積局所距離Ｄ（ｔ^＊,Ｘ,Ｔ,Ｘ）がゼロの場合）（図３（ａ）の場合）には、２次元のデータパターンで構成される参照パターンの要素の値が、同じく２次元のデータパターンで構成される入力パターンの中の最適に対応する全ての要素の値と、完全に合致する区間が存在する。このように、対応する全ての要素の値が完全に合致することを完全マッチングと称する。 When D (t ^* , X, T, X) / (XT) is zero (that is, when the cumulative local distance D (t ^* , X, T, X) is zero) (FIG. 3 (a)) In the case), the values of the elements of the reference pattern composed of the two-dimensional data pattern are completely the values of all the elements corresponding to the optimum in the input pattern also composed of the two-dimensional data pattern. There is a matching section. Such perfect matching of the values of all the corresponding elements is called perfect matching.

完全マッチングの場合には、参照パターンの要素の値が、入力パターンの中の最適に対応する全ての要素の値と完全に合致することになるが、カクテルパーティ効果が有効と判断され得る状況は、必ずしも参照パターンの要素の値が、対応する全ての入力パターンの要素の値と完全に合致する場合だけではない。 In the case of perfect matching, the values of the elements of the reference pattern will exactly match the values of all the elements that correspond optimally in the input pattern, but there are situations where the cocktail party effect can be judged to be effective. , Not necessarily when the value of the element of the reference pattern exactly matches the value of the element of all the corresponding input patterns.

例えば、参照パターンに該当する参照音声が所定の文記号が発せられた女性の音声であり、入力パターンに該当する入力音声が同じ文記号が発せられた男性の音声である場合であっても、それぞれの音声の文記号が同じであれば、参照音声が入力音声に含まれると判断され、カクテルパーティ効果が有効と判断され得る。この場合には、参照パターンの要素の値が、入力パターンの中の最適に対応する全ての要素の値と完全に合致（完全マッチング）したものではなく、部分的に合致（部分マッチング）したものになる。完全に合致すること（完全マッチング）は例外的であり、通常は部分的に合致すること（部分マッチング）が多い。 For example, even if the reference voice corresponding to the reference pattern is the voice of a woman with a predetermined sentence symbol and the input voice corresponding to the input pattern is the voice of a man with the same sentence symbol. If the sentence symbol of each voice is the same, it is determined that the reference voice is included in the input voice, and the cocktail party effect can be judged to be effective. In this case, the values of the elements of the reference pattern do not completely match (perfect match) with the values of all the elements that correspond optimally in the input pattern, but partially match (partially match). become. Exact matching (perfect matching) is an exception, and usually partial matching (partial matching) is common.

部分マッチングに該当する場合であっても、参照音声に含まれる所定の文記号が、入力音声に含まれていると認識できれば、カクテルパーティ効果に該当する音声認識を行うことができる。つまり、部分マッチングになる場合であっても、参照パターンの文記号に該当する要素が、入力パターンの中の最適に対応する要素に含まれると判断される場合がある。このため、本実施の形態に係る音声認識装置１００では、参照パターンの要素の値が、入力パターンの中の最適に対応する要素の値と部分的に合致する場合に該当する、部分マッチングに基づいて音声認識を行う。 Even in the case of partial matching, if it can be recognized that the predetermined sentence symbol included in the reference voice is included in the input voice, the voice recognition corresponding to the cocktail party effect can be performed. That is, even in the case of partial matching, it may be determined that the element corresponding to the sentence symbol of the reference pattern is included in the element corresponding to the optimum in the input pattern. Therefore, in the voice recognition device 100 according to the present embodiment, it is based on partial matching, which corresponds to the case where the value of the element of the reference pattern partially matches the value of the element corresponding to the optimum in the input pattern. To perform voice recognition.

音声認識を行う場合には、累積局所距離Ｄ（ｔ^＊,Ｘ,Ｔ,Ｘ）の（ｔ^＊,Ｘ,Ｔ,Ｘ）からバックトレース処理を行う。累積局所距離Ｄ（ｔ^＊,Ｘ,Ｔ,Ｘ）のバックトレース処理は、４次元空間の中にある格子点(ｔ，ｘ，τ，ｙ)を探索する処理に該当する。この４次元空間のバックトレースによる探索点(ｔ，ｘ，τ，ｙ)を、(ｔ，ｘ)と(τ，ｙ)との対として考える。バックトレース処理により求められる探索点(ｔ，ｘ，τ，ｙ)は、参照データの２次元のデータパターン（参照パターン）である時間−周波数パターンの座標（τ，ｙ）が、入力データの２次元のデータパターン（入力パターン）である時間−周波数の座標（ｔ，ｘ）に、マッチングしている部分の座標を示すことになる。マッチングしている座標の局所距離は、ｄ(ｔ，ｘ，τ，ｙ)である。局所距離ｄ(ｔ，ｘ，τ，ｙ)の値を（τ，ｙ）点の上に投影して（配置して）、明暗パターンを作成する。 ^{When performing voice recognition, backtrace processing is performed from (t *} , X, T, X) of the cumulative local distance D (t ^* , X, T, X). The backtrace processing of the cumulative local distance D (t ^* , X, T, X) corresponds to the processing of searching for the grid points (t, x, τ, y) in the four-dimensional space. The search point (t, x, τ, y) by the back trace of this four-dimensional space is considered as a pair of (t, x) and (τ, y). The search points (t, x, τ, y) obtained by the backtrace processing are the two-dimensional data patterns (reference patterns) of the reference data, and the coordinates (τ, y) of the time-frequency pattern are 2 of the input data. The coordinates (t, x) of the time-frequency, which is a two-dimensional data pattern (input pattern), indicate the coordinates of the matching portion. The local distance of the matching coordinates is d (t, x, τ, y). A light-dark pattern is created by projecting (arranging) the value of the local distance d (t, x, τ, y) onto the (τ, y) point.

そして、格子点（τ，ｙ）点の値が、予め設定された第２閾値（この閾値は０と１との間の値となる）以下となる格子点の領域を求める。この第２閾値は、予め実験等によって設定される値である。格子点の値が第２閾値以下となる格子点の座標やその領域で、参照パターン（τ，ｙ）と入力パターン(ｔ，ｘ)とが、最適にマッチングしていると判断することができる。すなわち、局所距離ｄ(ｔ，ｘ，τ，ｙ)の値が、予め設定される第２閾値以下となる場合に、参照パターン（τ，ｙ）と入力パターン(ｔ，ｘ)とが、最適にマッチングしていると判断することができる。 Then, the region of the grid point where the value of the grid point (τ, y) is equal to or less than the preset second threshold value (this threshold value is a value between 0 and 1) is obtained. This second threshold value is a value set in advance by an experiment or the like. It can be determined that the reference pattern (τ, y) and the input pattern (t, x) are optimally matched in the coordinates of the grid points where the value of the grid points is equal to or less than the second threshold value or in the region thereof. .. That is, when the value of the local distance d (t, x, τ, y) is equal to or less than the preset second threshold value, the reference pattern (τ, y) and the input pattern (t, x) are optimal. It can be judged that it matches with.

具体的には、参照パターン（参照データ）の座標ｚ（τ，ｙ）と、入力パターン（入力データ）の座標ｆ（ｔ，ｘ）とを用いて、局所距離ｄ(ｔ，ｘ，τ，ｙ)を、
局所距離ｄ(ｔ，ｘ，τ，ｙ)＝｜ｆ（ｔ，ｘ）−ｚ（τ，ｙ）｜
により求めて、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）を算出する。さらに、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）からバックトレース処理を行うことによって、参照パターンの座標ｚ（τ，ｙ）と入力パターンの座標ｆ(ｔ，ｘ)とが最適にマッチングする座標の局所距離ｄ(ｔ，ｘ，τ，ｙ)を求める。 Specifically, using the coordinates z (τ, y) of the reference pattern (reference data) and the coordinates f (t, x) of the input pattern (input data), the local distance d (t, x, τ, y),
Local distance d (t, x, τ, y) = | f (t, x) -z (τ, y) |
To calculate the cumulative local distance D (t, X, T, X). Furthermore, by performing backtrace processing from the cumulative local distance D (t, X, T, X), the coordinates z (τ, y) of the reference pattern and the coordinates f (t, x) of the input pattern are optimally matched. Find the local distance d (t, x, τ, y) of the coordinates to be used.

このようにして求められた局所距離ｄ(ｔ，ｘ，τ，ｙ)が第２閾値以下となる、格子点（τ，ｙ）の領域を求めることにより、入力パターンの座標ｆ（ｔ，ｘ）と参照パターンの座標ｚ（τ，ｙ）との対応座標（時間−周波数）を求めることができる。 By finding the region of the grid points (τ, y) where the local distance d (t, x, τ, y) thus obtained is equal to or less than the second threshold value, the coordinates f (t, x) of the input pattern are obtained. ) And the coordinates z (τ, y) of the reference pattern, and the corresponding coordinates (time-frequency) can be obtained.

次に、第２閾値以下となる座標ｚ（τ，ｙ）点によって構成される領域の中から、最大となる領域を見つけて、その領域のもつ最小の時間と最大の時間とを求めることによって、領域の始端と終端とを求める。 Next, by finding the maximum region from the region composed of the coordinate z (τ, y) points that are equal to or less than the second threshold value, and finding the minimum time and the maximum time of that region. , Find the start and end of the area.

具体的には、局所距離ｄ(ｔ，ｘ，τ，ｙ)の値が第２閾値以下となる参照パターン（参照データ）の領域の中から最大の領域を求める。次に、求められた参照パターン（参照データ）の最大となる領域の範囲で最小の時間と最大の時間とを求める。つまり、局所距離ｄ(ｔ，ｘ，τ，ｙ)の値が、第２閾値以下となる参照パターンの対応する座標（τ，ｙ）のうち、時間τが最小となる時間と最大となる時間とを求める。 Specifically, the maximum region is obtained from the region of the reference pattern (reference data) in which the value of the local distance d (t, x, τ, y) is equal to or less than the second threshold value. Next, the minimum time and the maximum time are obtained in the range of the maximum area of the obtained reference pattern (reference data). That is, among the corresponding coordinates (τ, y) of the reference pattern in which the value of the local distance d (t, x, τ, y) is equal to or less than the second threshold value, the time at which the time τ becomes the minimum and the time at which the time τ becomes the maximum And ask.

図４は、局所距離ｄ(ｔ，ｘ，τ，ｙ)の値が第２閾値以下となる最大の領域を示している。この領域に基づいて、最小の時間τｓと最大の時間τｅとが求められる。図４には、参照パターン（参照データ）が「明日は晴れかな、雨かな」という文記号の音声データであって、入力パターン（入力データ）の中に、「晴れかな」に該当する文記号の音声データが含まれている場合、つまり、部分マッチングが成立する場合が一例として示されている。参照パターン（参照データ）の全体の音声（文記号の合成音声）は、「明日は晴れかな、雨かな」であるが、図４における参照パターン（参照データ）の領域の時間τｓから時間τｅまでに対応する音声（合成音声）の文記号は、「晴れかな」である。この参照データの時間τｓから時間τｅまでの部分は、エンドレスに入力される入力データ（例えば、複数話者による同時発話の音声データ）の中から認識された部分に該当することになる。 FIG. 4 shows the maximum region where the value of the local distance d (t, x, τ, y) is equal to or less than the second threshold value. Based on this region, the minimum time τs and the maximum time τe are obtained. In FIG. 4, the reference pattern (reference data) is the voice data of the sentence symbol "tomorrow is sunny, rainy", and the sentence symbol corresponding to "sunny" is included in the input pattern (input data). The case where the voice data of is included, that is, the case where partial matching is established is shown as an example. The entire voice of the reference pattern (reference data) (synthetic voice of sentence symbols) is "Tomorrow is sunny or rainy", but from time τs to time τe in the region of the reference pattern (reference data) in FIG. The sentence symbol of the voice (synthetic voice) corresponding to is "sunny". The portion of the reference data from the time τs to the time τe corresponds to the portion recognized from the input data (for example, the voice data of simultaneous utterances by a plurality of speakers) input endlessly.

図５は、上述したマッチング処理部１４の処理内容を示したフローチャートである。上述したように、マッチング処理部１４は、時間−周波数の２次元のデータパターンからなる参照パターン（τ，ｙ）と、同じく時間−周波数の２次元のデータパターンからなる入力パターン（ｔ,ｘ）とに対して、インクリメンタル２次元連続動的計画法を適用する。そして、マッチング処理部１４は、局所距離ｄ（ｔ，ｘ，τ，ｙ）を累積した、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）を求める（Ｓ．１、累積局所距離算出機能）。 FIG. 5 is a flowchart showing the processing contents of the matching processing unit 14 described above. As described above, the matching processing unit 14 has a reference pattern (τ, y) composed of a time-frequency two-dimensional data pattern and an input pattern (t, x) composed of a time-frequency two-dimensional data pattern. Incremental two-dimensional continuous dynamic programming is applied to and. Then, the matching processing unit 14 obtains the cumulative local distance D (t, X, T, X) obtained by accumulating the local distances d (t, x, τ, y) (S.1, cumulative local distance calculation function). ..

マッチング処理部１４は、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）を（Ｔ・Ｘ）で除算したＤ（ｔ,Ｘ,Ｔ,Ｘ）/（Ｔ・Ｘ）の値の時間変化に基づいて、波の軌跡のうちｄｉｐ（谷底部分）を検出する（Ｓ．２、ｄｉｐ検出機能）。次に、ｄｉｐに該当する時間を時間ｔ^＊として、［ｔ^＊−２Ｔ,ｔ^＊］の区間における差値ｈを求める（Ｓ．３、差値算出機能）。その後、マッチング処理部１４は、差値ｈが、予め設定される第１閾値以上であるか否かの判断を行う（Ｓ．４、参照パターン領域検出機能）。 The matching processing unit 14 divides the cumulative local distance D (t, X, T, X) by (TX) to change the value of D (t, X, T, X) / (TX) over time. Based on this, the dip (valley bottom portion) of the wave trajectory is detected (S.2, dip detection function). Next, the difference value h in the section of [t ^* -2T, t ^* ] is obtained with the time corresponding to the dip as the time t ^* (S.3, difference value calculation function). After that, the matching processing unit 14 determines whether or not the difference value h is equal to or greater than a preset first threshold value (S.4, reference pattern area detection function).

差値ｈが第１閾値よりも小さい場合（Ｓ．４においてＮｏの場合）、マッチング処理部１４は、参照パターンの全体あるいは一部の音声データが、入力パターンの中に含まれていないと判断する。そして、マッチング処理部１４は、参照パターンが入力パターンに含まれていない旨の情報を検出判定部１５へ出力する（Ｓ.５）。 When the difference value h is smaller than the first threshold value (No in S.4), the matching processing unit 14 determines that all or part of the audio data of the reference pattern is not included in the input pattern. do. Then, the matching processing unit 14 outputs information to the effect that the reference pattern is not included in the input pattern to the detection determination unit 15 (S.5).

一方で、差値ｈが第１閾値以上の場合（Ｓ．４においてＹｅｓの場合）、マッチング処理部１４は、累積局所距離Ｄ（ｔ^＊,Ｘ,Ｔ,Ｘ）の（ｔ^＊,Ｘ,Ｔ,Ｘ）からバックトレース処理を行うことによって、参照パターンの座標ｚ（τ，ｙ）と入力パターンの座標ｆ（ｔ，ｘ）との最適なマッチングが成立する座標の局所距離ｄ（ｔ，ｘ，τ，ｙ）を算出する（Ｓ．６、参照パターン領域検出機能）。 On the other hand, when the difference value h is equal to or greater than the first threshold value (Yes in S.4), the matching processing unit 14 has a cumulative local distance D (t ^* , X, T, X) (t ^* , X, X,). By performing the backtrace processing from T, X), the local distance d (t, t,) of the coordinates at which the optimum matching between the coordinates z (τ, y) of the reference pattern and the coordinates f (t, x) of the input pattern is established x, τ, y) is calculated (S.6, reference pattern area detection function).

そして、求められた局所距離ｄ（ｔ，ｘ，τ，ｙ）が第２閾値以下となる参照パターンの領域が存在するか否かを判断する（Ｓ．７、参照パターン領域検出機能）。つまり、局所距離ｄ（ｔ，ｘ，τ，ｙ）が第２閾値以下となる参照パターンの座標（τ，ｙ）が存在するか否かの判断を行う。局所距離ｄ（ｔ，ｘ，τ，ｙ）が第２閾値以下となる参照パターンの領域が存在しない場合（Ｓ．７においてＮｏの場合）、マッチング処理部１４は、参照パターンが入力パターンに含まれていない旨の情報を検出判定部１５へ出力する（Ｓ.５）。 Then, it is determined whether or not there is a reference pattern region in which the obtained local distance d (t, x, τ, y) is equal to or less than the second threshold value (S.7, reference pattern region detection function). That is, it is determined whether or not there are coordinates (τ, y) of the reference pattern in which the local distance d (t, x, τ, y) is equal to or less than the second threshold value. When there is no reference pattern region in which the local distance d (t, x, τ, y) is equal to or less than the second threshold value (No in S.7), the matching processing unit 14 includes the reference pattern in the input pattern. The information indicating that the information has not been obtained is output to the detection / determination unit 15 (S.5).

一方で、局所距離ｄ（ｔ，ｘ，τ，ｙ）が第２閾値以下となる参照パターンの領域（座標）が存在する場合（Ｓ．７においてＹｅｓの場合）、マッチング処理部１４は、求められた領域の範囲における最小の時間τｓと最大の時間τｅとを求める（Ｓ．８、参照パターン領域検出機能）。そして、最小の時間から最大の時間までの時間範囲を示す対応時間範囲の情報を、検出判定部１５へ出力する（Ｓ．９）。 On the other hand, when there is a reference pattern region (coordinates) in which the local distance d (t, x, τ, y) is equal to or less than the second threshold value (yes in S.7), the matching processing unit 14 obtains it. The minimum time τs and the maximum time τe in the range of the specified region are obtained (S.8, reference pattern region detection function). Then, the information of the corresponding time range indicating the time range from the minimum time to the maximum time is output to the detection determination unit 15 (S.9).

参照パターンが入力パターンに含まれていない旨の情報を出力（Ｓ．５）した後、あるいは、対応時間範囲の情報を出力（Ｓ．９）した後、マッチング処理部１４は、ＦＦＴ部１１によって入力パターン（入力データ）用の新しいフレームベクトルが生成され、バッファーメモリ部１２に記録された一番古いフレームベクトルが消去され、新しいフレームベクトルが記録されたか否かの判断を行う。つまり、マッチング処理部１４は、バッファーメモリ部１２に記録される入力データ（入力パターン）が更新された否かを判断する（Ｓ．１０）。 After outputting the information that the reference pattern is not included in the input pattern (S.5) or outputting the information of the corresponding time range (S.9), the matching processing unit 14 is transferred by the FFT unit 11. A new frame vector for the input pattern (input data) is generated, the oldest frame vector recorded in the buffer memory unit 12 is erased, and it is determined whether or not a new frame vector has been recorded. That is, the matching processing unit 14 determines whether or not the input data (input pattern) recorded in the buffer memory unit 12 has been updated (S.10).

バッファーメモリ部１２が更新されていない場合（Ｓ．１０においてＮｏの場合）、マッチング処理部１４は、バッファーメモリ部１２が更新されるまでＳ．１０の処理を繰り返す。バッファーメモリ部１２が更新された場合（Ｓ．１０においてＹｅｓの場合）、マッチング処理部１４は、処理をＳ．１に移行する。つまり、マッチング処理部１４は、参照パターン（τ，ｙ）と、新たな入力パターン（ｔ,ｘ）とに対して、２次元連続動的計画法を適用することにより、局所距離ｄ（ｔ，ｘ，τ，ｙ）を累積した、累積局所距離Ｄ（ｔ,Ｘ,Ｔ,Ｘ）を求める処理（Ｓ．１）を実行し、その後、上述した処理（Ｓ．２〜Ｓ．１０）を実行する。 If the buffer memory unit 12 has not been updated (No in S.10), the matching processing unit 14 will perform S.M. The process of 10 is repeated. When the buffer memory unit 12 is updated (yes in S.10), the matching processing unit 14 performs processing in S.M. Move to 1. That is, the matching processing unit 14 applies the two-dimensional continuous dynamic programming method to the reference pattern (τ, y) and the new input pattern (t, x), so that the local distance d (t, t, The process (S.1) for obtaining the cumulative local distance D (t, X, T, X) by accumulating x, τ, y) is executed, and then the above-mentioned process (S.2 to S.10) is performed. Run.

検出判定部１５は、マッチング処理部１４から、参照パターンが入力パターンに含まれていない旨の情報や、対応時間範囲の情報を取得する。そして、検出判定部１５は、取得した情報に基づいて、参照パターンの音声（文記号の音声）が、入力パターンに含まれるか否かの判断を行う（音声判断機能）。 The detection determination unit 15 acquires information indicating that the reference pattern is not included in the input pattern and information on the corresponding time range from the matching processing unit 14. Then, the detection determination unit 15 determines whether or not the voice of the reference pattern (speech of the sentence symbol) is included in the input pattern based on the acquired information (voice judgment function).

参照パターンが入力パターンに含まれていない旨の情報を、マッチング処理部１４から取得した場合（Ｓ．５の場合）、検出判定部１５は、参照パターンの音声（文記号の音声）が、入力パターンに全く含まれていなかった（存在しなかった）と判断する（音声判断機能）。そして、検出判定部１５は、参照音声が入力音声に全く含まれていなかった旨の情報を表示部６に出力する。表示部６は、検出判定部１５より受信した情報を表示し、ユーザにマッチングが成立しなかったことを報知する。 When information indicating that the reference pattern is not included in the input pattern is acquired from the matching processing unit 14 (in the case of S.5), the detection determination unit 15 inputs the voice of the reference pattern (voice of the sentence symbol). Judge that it was not included in the pattern (it did not exist) (voice judgment function). Then, the detection determination unit 15 outputs information to the effect that the reference voice is not included in the input voice at all to the display unit 6. The display unit 6 displays the information received from the detection determination unit 15 and notifies the user that the matching has not been established.

また、最小の時間τｓから最大の時間τｅまでの時間範囲を示す対応時間範囲の情報を、マッチング処理部１４から取得した場合（Ｓ．９の場合）、検出判定部１５は、参照パターンの全部または一部の音声（文記号の音声）が、入力パターンに含まれていた（存在した）と判断する（音声判断機能）。そして、検出判定部１５は、参照音声が、入力音声に含まれている旨の情報を表示部６に出力する。さらに、検出判定部１５は、参照音声における対応時間範囲（時間τｓから時間τｅまで）の音声（文記号の音声）が、入力音声に含まれていた旨の情報を表示部６に出力する。表示部６は、検出判定部１５より受信した情報を表示し、ユーザに部分マッチングが成立したことを報知する（音声判断機能）。 Further, when the information of the corresponding time range indicating the time range from the minimum time τs to the maximum time τe is acquired from the matching processing unit 14 (in the case of S.9), the detection determination unit 15 uses the entire reference pattern. Alternatively, it is determined that some voices (voices of sentence symbols) are included (existed) in the input pattern (speech judgment function). Then, the detection determination unit 15 outputs information to the effect that the reference voice is included in the input voice to the display unit 6. Further, the detection determination unit 15 outputs to the display unit 6 information that the voice (voice of the sentence symbol) in the corresponding time range (time τs to time τe) in the reference voice is included in the input voice. The display unit 6 displays the information received from the detection determination unit 15 and notifies the user that the partial matching has been established (voice determination function).

このように、マッチング処理部１４が、図５に示した処理を実行し、検出判定部１５が、マッチング処理部１４の処理結果に基づいてマッチングの有無を表示部６に表示させることによって、ユーザに対して、入力パターンに参照パターンの全体あるいは一部が含まれていること（Ｓ．９の場合）、あるいは、入力パターンに参照パターンが全く含まれていないこと（Ｓ．５の場合）を知らせることができる。ユーザは、表示部６に表示される結果によって、複数話者による混合音声（入力音声）に、所定の音声（参照音声、文記号の音声）が含まれるか否かの判断を行うことが可能となる。 In this way, the matching processing unit 14 executes the processing shown in FIG. 5, and the detection determination unit 15 causes the display unit 6 to display the presence or absence of matching based on the processing result of the matching processing unit 14. On the other hand, the input pattern contains all or part of the reference pattern (in the case of S.9), or the input pattern does not contain the reference pattern at all (in the case of S.5). I can inform you. Based on the result displayed on the display unit 6, the user can determine whether or not the mixed voice (input voice) by a plurality of speakers includes a predetermined voice (reference voice, voice of a sentence symbol). It becomes.

なお、上述したように、実施の形態に係る音声認識装置１００では、参照音声の時間τｓから最大の時間τｅまでの時間範囲の文記号が入力音声に含まれると判断される。特に、Ｄ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値がゼロ（０）に近い場合は，始端となる時間τｓが１となり、終端となる時間τｅがＴになる場合がある。この場合は参照音声の文記号の全てが入力音声に含まれると判断されることになる。通常の部分マッチングでは、始端となる時間τｓが１より大きな値となり、終端となる時間τｅがＴより小さな値となる。その結果、参照音声の文記号の一部が認識対象になる。また、ｄｉｐが深くなるほど、入力音声に含まれる参照音声の部分（その部分に対応する文記号）の長さが長くなる。 As described above, in the voice recognition device 100 according to the embodiment, it is determined that the input voice includes a sentence symbol in a time range from the time τs of the reference voice to the maximum time τe. In particular, when the value of D (t, X, T, X) / (XT) is close to zero (0), the start time τs may be 1 and the end time τe may be T. be. In this case, it is determined that all the sentence symbols of the reference voice are included in the input voice. In normal partial matching, the start time τs is a value larger than 1, and the end time τe is a value smaller than T. As a result, a part of the sentence symbol of the reference voice becomes a recognition target. Further, the deeper the dip, the longer the length of the reference voice portion (the sentence symbol corresponding to that portion) included in the input voice.

次に、本実施の形態に示した音声認識方法と、従来より知られている音声認識方法との違いについて説明する。本実施の形態に示した参照パターンのマトリックス（２次元の表現である時間−周波数のパターン）により定まる領域は、２次元平面の部分領域である。このため、参照パターンのマトリックスにおいて、時間軸と周波数軸とを考えることができ、時間軸において一部、周波数軸でも一部となる領域が存在することになる。また、その領域の形状は様々である。従って、本実施の形態における参照データは、時間と周波数とからなる２次元的な点の領域的集まりとして捉えることができ、時間毎に作られる音声分析用のベクトルという単位を扱っているのではない。 Next, the difference between the voice recognition method shown in the present embodiment and the conventionally known voice recognition method will be described. The region determined by the reference pattern matrix (time-frequency pattern, which is a two-dimensional representation) shown in the present embodiment is a partial region of the two-dimensional plane. Therefore, in the matrix of the reference pattern, the time axis and the frequency axis can be considered, and there is a region that is a part on the time axis and a part on the frequency axis. Moreover, the shape of the region is various. Therefore, the reference data in the present embodiment can be regarded as a regional collection of two-dimensional points consisting of time and frequency, and it may be handled as a unit called a vector for voice analysis created for each time. No.

一方で、従来の音声認識研究においては、ほとんどの場合、参照データがベクトル単位で判断されている。例えば、従来の音声認識研究として、時間的遷移によるベクトル単位のマッチング処理、音声認識の典型手法の１つである隠れマルコフモデル（Hidden Markov Model）で用いられるコードのベクトル単位の量子化処理、あるいはベクトルの連続分布作成等の研究が行われてきた。このように、参照データの捉え方、考え方の点においても、本実施の形態に係る音声認識装置１００の処理方法は、従来の音声認識の処理方法と異なっている。 On the other hand, in the conventional speech recognition research, the reference data is judged in vector units in most cases. For example, as conventional speech recognition research, vector unit matching processing by temporal transition, vector unit quantization processing of the code used in the Hidden Markov Model, which is one of the typical methods of speech recognition, or Studies such as creating a continuous distribution of vectors have been conducted. As described above, the processing method of the voice recognition device 100 according to the present embodiment is also different from the conventional voice recognition processing method in terms of how to capture the reference data and how to think about it.

なお、例外として、時間−周波数のパターンにおけるフォルマントのトラッキング手法が知られている。このトラッキング手法は、時間−周波数のパターンにおいて、値の大きい点（フォルマントと称する）を追跡し、追跡結果に基づいて、主に母音の認識を行うことを特徴とする。しかしながら、この追跡結果は、あくまで点列の軌跡の結果であり、「領域を示す２次元のパターン同士を、２次元連続動的計画法を用いて最適マッチングすることによって、部分領域の検出を行う」という、本実施の形態に係る音声認識方法を行うものではない。 As an exception, a formant tracking method for a time-frequency pattern is known. This tracking method is characterized in that a point having a large value (referred to as a formant) is tracked in a time-frequency pattern, and vowels are mainly recognized based on the tracking result. However, this tracking result is only the result of the locus of the point sequence, and "the partial region is detected by optimally matching the two-dimensional patterns indicating the regions with each other using the two-dimensional continuous dynamic programming method. The voice recognition method according to the present embodiment is not performed.

従来の手法では、上述したように、参照データがベクトル単位で判断されているため、動的計画法を用いて、参照パターンの時間変化によるマッチングを行う方法が一般的に行われている。ベクトル単位で時間変化の類似性を求める場合、話者の話すスピードの相異により、参照パターンの時間変化と入力パターンの時間変化とが異なるときには、参照パターンと入力パターンとでパターンの変化状態が一致しない。このため、動的計画法を用いることによって、時間変化による非線形の整合を行う。ここで、従来の方法で整合を求める場合、データがベクトル単位で判断されているため、整合する量が、以下のベクトル間の距離ｄ（ｔ，τ）によって求められる。

・・・式３
ここで、ｆ_０（ｔ，ｘ）は、入力パターンにおけるベクトルのｘ要素を示し、ｚ_０（τ，ｘ）は参照パターンにおけるベクトルのｘ要素を示す。式３に示すように、それぞれの要素の差分（ｆ_０（ｔ，ｘ）−ｚ_０（τ，ｘ））の絶対値の和を求めることによって、ベクトルを基準とする入力パターンのｆ_０（ｔ，ｘ）と、ベクトルを基準とする参照パターンのｚ_０（τ，ｘ）以外の要素ｚ_０（τ，ｙ）（ｙ≠ｘ）との関係が、検出対象から外される。そして、話者の話すスピードの違いにより、ベクトルを基準とする参照データと、ベクトルを基準とする入力データとの対応点が異なるという問題を、動的計画法におけるマッチング（整合）を行い、時間軸tと時間軸τとの対応を取ることにより解決して、音声認識処理を行う。このように、ベクトルを基準として音声認識処理を行う方法は、ベクトルベースのマッチング処理と呼ばれており、従来の音声認識処理において、広く用いられる手法である。 In the conventional method, as described above, since the reference data is determined in vector units, a method of matching the reference pattern by time change is generally performed by using a dynamic programming method. When finding the similarity of time change in vector units, when the time change of the reference pattern and the time change of the input pattern are different due to the difference in the speaking speed of the speaker, the change state of the pattern between the reference pattern and the input pattern is changed. It does not match. Therefore, by using the dynamic programming method, non-linear matching due to time change is performed. Here, when matching is obtained by the conventional method, since the data is determined in vector units, the matching amount is obtained by the distance d (t, τ) between the following vectors.

... Equation 3
Here, f ₀ (t, x) indicates the x element of the vector in the input pattern, and z ₀ (τ, x) indicates the x element of the vector in the reference pattern. As shown in Equation 3, the difference of each element _{(f 0 (t, x)} -z 0 (τ, x)) by summing the absolute value of the input pattern relative to the vector _f 0 ( _{The relationship between t, x) and the element z 0} (τ, y) (y ≠ x) _{other than z 0} (τ, x) of the reference pattern based on the vector is excluded from the detection target. Then, matching (matching) in the dynamic programming method is performed to solve the problem that the correspondence point between the reference data based on the vector and the input data based on the vector is different due to the difference in the speaking speed of the speaker. The problem is solved by taking the correspondence between the axis t and the time axis τ, and the voice recognition process is performed. As described above, the method of performing the speech recognition process based on the vector is called the vector-based matching process, and is a method widely used in the conventional speech recognition process.

図６は、ベクトルベースのマッチング処理を基本とする、従来の音声認識処理手法を説明するための図である。図６では、ベクトルを単位としているので、時間の異なる複数のベクトルの集合（ベクトル群）の中の要素の類似性は、音声認識の対象にならない。あくまで、ベクトル同士の類似性しか認識の対象として扱われないという特徴がある。 FIG. 6 is a diagram for explaining a conventional speech recognition processing method based on vector-based matching processing. In FIG. 6, since the vector is used as a unit, the similarity of the elements in the set (vector group) of a plurality of vectors having different times is not the target of voice recognition. To the last, there is a feature that only the similarity between vectors is treated as the object of recognition.

しかしながら、本実施の形態に係る音声認識装置１００の音声認識方法では、参照データや入力データを、ベクトルベースのデータとして捉えるのではなく、領域ベースのデータとして捉えて、マッチング処理を行うことを特徴としている。 However, the speech recognition method of the speech recognition device 100 according to the present embodiment is characterized in that the reference data and the input data are not captured as vector-based data, but are captured as area-based data and the matching process is performed. It is supposed to be.

図７は、参照データや入力データを、領域ベースのデータとして捉えてマッチング処理を行うことを説明するための図である。本実施の形態で用いるインクリメンタル２次元連続動的計画法では、時間−周波数からなる２次元の参照パターンと、入力波形の分析結果を示すパターンであって、時間−周波数のパターンからなる２次元の入力パターンとの間で、領域における非線形の対応が求められる。 FIG. 7 is a diagram for explaining that the reference data and the input data are regarded as area-based data and the matching process is performed. In the incremental two-dimensional continuous dynamic planning method used in the present embodiment, the two-dimensional reference pattern consisting of time-frequency and the pattern showing the analysis result of the input waveform are two-dimensional consisting of the time-frequency pattern. A non-linear correspondence in the region is required with the input pattern.

そして、音声認識装置１００のマッチング処理部１４では、マッチングする領域の面積に比例する値として、上述したＤ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値を求める。Ｄ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値が減少して、局所的な最小値を示す箇所にｄｉｐ（局所くぼみ）が発生する。ｄｉｐ（局所くぼみ）が発生する時間をｔ^＊とし、既に説明したように、［ｔ^＊−２Ｔ,ｔ^＊］区間におけるＤ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の最大値を示す時間から、ｄｉｐが発生する時間ｔ^＊までＤ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値が減少する。このＤ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値の減少は、マッチング（整合）する領域によって生じる。このような領域におけるマッチングは、図６を用いて説明したような、ベクトルベース（フレーム単位）のマッチング処理では発生しない。参照パターン（時間−周波数）における全ての（時間−周波数）座標と、入力パターン（時間−周波数）における全ての（時間−周波数）座標とで、非線形の対応が取れていなければ、領域におけるマッチングは発生しない。また、その出力となるＤ（ｔ,Ｘ,Ｔ,Ｘ）／（Ｘ・Ｔ）の値の軌跡によってｄｉｐ（局所くぼみ）を求める場合には、インクリメンタル２次元連続動的計画法を用いることが最適である。 Then, the matching processing unit 14 of the voice recognition device 100 obtains the above-mentioned value of D (t, X, T, X) / (XT) as a value proportional to the area of the matching region. The value of D (t, X, T, X) / (XT) decreases, and a dip (local depression) occurs at a position showing a local minimum value. ^{Let t *} be the time when dip (local depression) occurs, and as already explained, the maximum value of D (t, X, T, X) / ( ^{XT) in the [t *} -2T, t ^{*] interval.} from the time that indicates the time that dip occurs ^{t *} to D (t, X, T, X) the value of / (X · T) is reduced. This decrease in the value of D (t, X, T, X) / (XT) is caused by the matching region. Matching in such a region does not occur in the vector-based (frame unit) matching process as described with reference to FIG. If all (time-frequency) coordinates in the reference pattern (time-frequency) and all (time-frequency) coordinates in the input pattern (time-frequency) do not have a non-linear correspondence, matching in the region will be performed. Does not occur. In addition, when finding a dip (local depression) from the locus of the D (t, X, T, X) / (XT) value that is the output, it is possible to use incremental two-dimensional continuous dynamic programming. Optimal.

マッチング処理部１４に設けられるワーキングメモリ１４ａは、上述したインクリメンタル２次元連続動的計画法におけるマッチング処理等に用いられる。インクリメンタル２次元連続動的計画法では、入力音声がエンドレスに入力されるが、ワーキングメモリ１４ａのメモリ量は、エンドレスな容量とすることができない。このため、実際に計算に用いる範囲に、メモリ量を限定して、フレーム毎に、すなわち、時間ｔ毎に、ワーキングメモリ１４ａの内容を更新する。このデータ更新によって、入力音声がエンドレスに入力される場合であっても、一定のメモリ量によって、累積局所距離Ｄ（ｔ，Ｘ，Ｔ，Ｘ）の値の計算が可能となる。 The working memory 14a provided in the matching processing unit 14 is used for matching processing and the like in the above-mentioned incremental two-dimensional continuous dynamic programming method. In the incremental two-dimensional continuous dynamic programming method, the input voice is input endlessly, but the memory amount of the working memory 14a cannot be an endless capacity. Therefore, the amount of memory is limited to the range actually used for calculation, and the contents of the working memory 14a are updated every frame, that is, every time t. By this data update, even when the input voice is input endlessly, the value of the cumulative local distance D (t, X, T, X) can be calculated with a certain amount of memory.

時間ｔにおいて、入力パターンの周波数ｘを、１≦ｘ≦Ｘ、参照パターンの周波数を、１≦ｙ≦Ｘ、時間τを、１≦τ≦Ｔとして、累積局所距離Ｄ（ｔ，Ｘ，Ｔ，Ｘ）の値を、１つの入力スペクトルについて計算する。累積局所距離Ｄ（ｔ，Ｘ，Ｔ，Ｘ）の値の計算に影響する時間ｔの範囲は、［ｔ−２Ｔ，ｔ］の範囲であり、一定である。従って、必要なメモリ量は、時間ｔの範囲の最大値２Ｔと、時間τの最大値Ｔと、参照パターンおよび入力パターンの周波数の最大値Ｘとを用いて、２Ｔ×Ｘ×Ｘ×Ｔとなる。このメモリ量（２Ｔ×Ｘ×Ｘ×Ｔ）は、一定である。時間ｔがｔ＋１となった場合には、累積局所距離Ｄ（ｔ，Ｘ，Ｔ，Ｘ）の全体を時間＋１だけずらして、つまり、時間ｔに該当するデータを時間ｔ−１にずらして、新たに入力された入力パターンを用いて計算を行う。 At time t, the frequency x of the input pattern is 1 ≦ x ≦ X, the frequency of the reference pattern is 1 ≦ y ≦ X, and the time τ is 1 ≦ τ ≦ T. , X) values are calculated for one input spectrum. The range of time t that affects the calculation of the value of the cumulative local distance D (t, X, T, X) is the range of [t-2T, t] and is constant. Therefore, the required amount of memory is 2T × X × X × T by using the maximum value 2T in the range of time t, the maximum value T of time τ, and the maximum value X of the frequency of the reference pattern and the input pattern. Become. This amount of memory (2T × X × X × T) is constant. When the time t becomes t + 1, the entire cumulative local distance D (t, X, T, X) is shifted by the time +1, that is, the data corresponding to the time t is shifted to the time t-1. The calculation is performed using the newly input input pattern.

以上、本発明に係る音声認識装置および音声認識プログラムについて、音声認識装置１００を一例として示し、詳細に説明を行った。しかしながら、本発明に係る音声認識装置および音声認識プログラムは、実施の形態における音声認識装置１００の構成には限定されない。 As described above, the voice recognition device and the voice recognition program according to the present invention have been described in detail by showing the voice recognition device 100 as an example. However, the voice recognition device and the voice recognition program according to the present invention are not limited to the configuration of the voice recognition device 100 in the embodiment.

例えば、実施の形態に係る音声認識装置１００では、マイク１が１台しか設けられていない場合を示して説明した。しかしながら、マイクの設置台数は１つに限定されるものではない。本発明に係る音声認識装置および音声認識プログラムでは、複数話者が同時に発声した音を１つの音（波形、単一音声）として捉えて、１つの音（単一音声）から、参照データに該当する音の認識を行うことを特徴とする。このため、従来より知られている技術であるＩＣＡのように、混合音声の波形を分離する必要がない。従って、複数話者の発した音を１つの音（波形）として収音できれば十分であり、１つの音（波形）（単一音声）として収音可能であれば、マイクは１台であっても、複数台であっても特に限定されるものではない。 For example, in the voice recognition device 100 according to the embodiment, a case where only one microphone 1 is provided has been described. However, the number of microphones installed is not limited to one. In the voice recognition device and the voice recognition program according to the present invention, sounds uttered by a plurality of speakers at the same time are regarded as one sound (waveform, single voice), and one sound (single voice) corresponds to reference data. It is characterized by recognizing the sound to be played. Therefore, unlike the ICA, which is a conventionally known technique, it is not necessary to separate the waveform of the mixed voice. Therefore, it is sufficient if the sounds emitted by a plurality of speakers can be picked up as one sound (waveform), and if the sound can be picked up as one sound (waveform) (single voice), one microphone is required. However, even if there are a plurality of units, the number is not particularly limited.

また、実施の形態に係る音声認識装置１００では、入力音声がマイク１により収音され、ＦＦＴ部１１によって継続的に周波数領域のデータに変換されて、入力音声がエンドレスに入力される場合について説明した。しかしながら、必ずしも、入力音声がエンドレスに入力される構成には限定されない。本発明に係る音声認識装置および音声認識プログラムでは、参照データを、時間−周波数からなる２次元の参照パターンとし、入力データを時間−周波数からなる２次元の入力パターンとして、２次元連続動的計画法を用いて、入力パターンと参照パターンとの非線形のマッチング処理を行うことによって、音声認識処理を行うことを特徴とする。 Further, in the voice recognition device 100 according to the embodiment, a case where the input voice is picked up by the microphone 1 and continuously converted into the data in the frequency domain by the FFT unit 11 and the input voice is input endlessly will be described. bottom. However, it is not always limited to the configuration in which the input voice is input endlessly. In the speech recognition device and the speech recognition program according to the present invention, the reference data is used as a two-dimensional reference pattern consisting of time-frequency, and the input data is used as a two-dimensional input pattern consisting of time-frequency for two-dimensional continuous dynamic planning. It is characterized in that speech recognition processing is performed by performing non-linear matching processing between an input pattern and a reference pattern using a method.

このため、入力データを時間−周波数からなる２次元の入力パターンとして用いることが可能であれば、入力音声は必ずしもエンドレスな音声には限定されず、有限な入力音声に基づくものであっても、本発明に係る音声認識装置等に用いることが可能である。従って、入力される複数話者の音声に基づいて音声認識処理を行う場合、入力される音声は、始端と終端を想定していないエンドレスの音（波形）を対象とすることができるのはもちろんであるが、始端または終端の一方あるは双方を想定した音（波形）を対象として、音声認識処理を行うことも可能である。 Therefore, if the input data can be used as a two-dimensional input pattern consisting of time and frequency, the input voice is not necessarily limited to endless voice, and even if it is based on a finite input voice, It can be used in a voice recognition device or the like according to the present invention. Therefore, when performing voice recognition processing based on the input voices of a plurality of speakers, it is of course possible for the input voices to be endless sounds (waveforms) that do not assume the start and end. However, it is also possible to perform voice recognition processing for sounds (waveforms) that assume either the start end or the end end or both.

さらに、実施の形態に係る音声認識装置１００の参照データメモリ部１３に記録される参照データとして、音声合成ソフト等を用いて生成された人工的な音声データ（合成音による波形データ）を用いる場合について説明を行った。しかしながら、本発明に係る音声認識装置および音声認識プログラムに用いられる参照データは、必ずしも人工的な音声データに限定されない。本発明に係る音声認識装置および音声認識プログラムでは、入力パターンと参照パターンとの非線形のマッチング処理により、音声認識処理を行う。このため、参照データが文記号を含む音声データであれば、参照データのデータ形式（種類）は、特に限定されるものではない。 Further, when artificial voice data (waveform data by synthetic sound) generated by using voice synthesis software or the like is used as the reference data recorded in the reference data memory unit 13 of the voice recognition device 100 according to the embodiment. Was explained. However, the reference data used in the voice recognition device and the voice recognition program according to the present invention is not necessarily limited to artificial voice data. In the voice recognition device and the voice recognition program according to the present invention, the voice recognition process is performed by the non-linear matching process of the input pattern and the reference pattern. Therefore, the data format (type) of the reference data is not particularly limited as long as the reference data is voice data including sentence symbols.

本発明に係る音声認識装置および音声認識プログラムでは、時間−周波数からなる２次元のデータパターンの非線形のマッチングにより、音声認識を行うため、参照データは、音の基本となる文記号を波形にしたものであればよい。このため、参照データは、合成音声の種類（男性音や女性音）に左右されず、あるいは、人間の音声を直接録音して生成したものであってもよい。このように、参照データと入力データとの非線形なマッチング処理により音声認識処理を行うことができるため、様々な音声データを参照データとして用いることができる。また、参照データと入力データとの非線形なマッチング処理により音声認識処理を行うことができるため、不特定話者が同時に複数発話する状況において、途中で話者の増加や減少が生じて、入力データにおける話者数等が変化する場合であっても、音声認識を行うことが可能になる。 In the speech recognition device and the speech recognition program according to the present invention, speech recognition is performed by non-linear matching of a two-dimensional data pattern consisting of time and frequency. Therefore, the reference data is a waveform of a sentence symbol that is the basis of sound. Anything is fine. Therefore, the reference data may be generated by directly recording human voice, regardless of the type of synthetic voice (male sound or female sound). As described above, since the voice recognition process can be performed by the non-linear matching process of the reference data and the input data, various voice data can be used as the reference data. In addition, since voice recognition processing can be performed by non-linear matching processing between reference data and input data, in a situation where multiple unspecified speakers speak at the same time, the number of speakers increases or decreases in the middle, and the input data Even when the number of speakers in the above changes, it becomes possible to perform voice recognition.

また、本発明に係る音声認識装置および音声認識プログラムでは、入力データにおける波形分離などを行うことなく、入力音声を単一音声として捉えて、音声認識処理を行う。このため、音声分離等の複数の処理を必要とせず、例えば、図５に示したような１つの処理方式によって、音声認識処理を行うことができる。このため、処理の簡略化を図ることができ、処理負担の増大や処理遅延等の発生を抑制することが可能になる。 Further, in the voice recognition device and the voice recognition program according to the present invention, the input voice is regarded as a single voice and the voice recognition process is performed without performing waveform separation in the input data. Therefore, the voice recognition process can be performed by one processing method as shown in FIG. 5, for example, without requiring a plurality of processes such as voice separation. Therefore, the processing can be simplified, and it is possible to suppress an increase in processing load and occurrence of processing delay.

１ …マイク
２ …ＣＰＵ（制御手段、累積局所距離算出手段、ｄｉｐ検出手段、差値算出手段、参照パターン領域検出手段）
３ …ＲＯＭ
４ …ＲＡＭ（入力パターン記録手段）
５ …記録部（入力パターン記録手段）
６ …表示部
１０ …Ａ／Ｄ変換部
１１ …ＦＦＴ部（周波数変換手段、累積局所距離算出手段）
１２ …バッファーメモリ部（入力パターン記録手段）
１３ …参照データメモリ部
１４ …マッチング処理部（制御手段、累積局所距離算出手段、ｄｉｐ検出手段、差値算出手段、参照パターン領域検出手段）
１４ａ …ワーキングメモリ
１５ …検出判定部（制御手段、音声判断手段）
１００ …音声認識装置 1 ... Microphone 2 ... CPU (control means, cumulative local distance calculation means, dip detection means, difference value calculation means, reference pattern area detection means)
3 ... ROM
4 ... RAM (input pattern recording means)
5 ... Recording unit (input pattern recording means)
6 ... Display unit 10 ... A / D conversion unit 11 ... FFT unit (frequency conversion means, cumulative local distance calculation means)
12 ... Buffer memory unit (input pattern recording means)
13 ... Reference data memory unit 14 ... Matching processing unit (control means, cumulative local distance calculation means, dip detection means, difference value calculation means, reference pattern area detection means)
14a ... Working memory 15 ... Detection and determination unit (control means, voice judgment means)
100 ... Speech recognition device

Claims

By converting the input voice into data in the frequency region every first predetermined time, an input pattern composed of two-dimensional data of frequency x (1 ≦ x ≦ X) and time t, and every second predetermined time By converting the reference voice into data in the frequency region, the input is performed using a reference pattern composed of two-dimensional data of frequency y (1 ≦ y ≦ X) and time τ (1 ≦ τ ≦ T). The coordinate f (t, t, normalized by dividing the sum of the values of the coordinate f (t, x) from x = 1 to x = X from the value of the coordinate f (t, x) of the pattern. By calculating the value of x) and dividing the value of the coordinate z (τ, y) of the reference pattern by the sum of the values of the coordinate z (τ, y) from y = 1 to y = X, The normalized value of the coordinate z (τ, y) is calculated, and the normalized coordinate f (t, x) of the input pattern and the reference pattern are normalized by a two-dimensional continuous dynamic programming method. by performing been the coordinate z (τ, y) a non-linear matching process with the value of normalized the coordinate f (t, x) the normalized from the value of the coordinate z (τ, y) The local distance d (t, x, τ, y) indicating the absolute value of the value obtained by subtracting is calculated, and the calculated local distance d (t, x, τ, y) is accumulated to accumulate the cumulative local distance D ( Cumulative local distance calculation means for calculating t, X, T, X) and
The value of the cumulative local distance D (t, X, T, X) calculated by the cumulative local distance calculating means is divided by the product of T and X, and D (t, X, T, X) / (T. A dip detecting means for calculating X) and detecting a dip indicating a valley bottom portion from a change state of the value of D (t, X, T, X) / (TX) at time t.
^{Let t * be} the generation time of the dip detected by the dip detecting means, and the D (t, X, T, X) / (XT) in the time interval from the time t ^* -2T to the time t ^*. A difference value calculation means for calculating the difference value h by subtracting the value of D (t ^* ^{, X, T, X) / (XT) at the time t *} from the maximum value of the value.
When the difference value h calculated by the difference value calculation means is larger than the preset first threshold value, the cumulative local distance D (t ^* , X, T, X) (t ^* , X, T, By performing backtrace processing from X), the normalized value of the coordinate f (t, x) and the normality of the coordinate z (τ, y) that optimally matches the coordinate f (t, x). The value of the local distance d (t, x, τ, y) is calculated from the converted value, and the calculated value of the local distance d (t, x, τ, y) is set in advance as a second threshold value. A reference pattern area detecting means for detecting the set of the coordinates z (τ, y) of the reference pattern below as a target area, and
A voice recognition device comprising: a voice recognition means for determining that the input voice includes the reference voice when the target area is detected by the reference pattern area detecting means.

When the target area is detected, the reference pattern area detecting means detects the minimum time τs and the maximum time τe of the coordinates z (τ, y) constituting the target area.
The voice recognition according to claim 1, wherein the voice recognition means determines that the voice included in the time range from the time τs to the time τe among the reference voices is included in the input voice. Device.

A frequency conversion means for converting the input voice into data in the frequency domain at the first predetermined time, and
It has an input pattern recording means in which the data in the frequency domain converted by the frequency conversion means is normalized and recorded in the converted time order.
The input pattern recording means erases the oldest data recorded in the input pattern recording means each time new data is generated by the frequency conversion means, and generates the new data immediately before the new data. Record next to the data
The cumulative local distance calculating means reads out all the data recorded in the input pattern recording means while maintaining the recording order and uses it as the input pattern, whereby the cumulative local distance D (t, X, T, The voice recognition device according to claim 1 or 2, wherein X) is calculated.

It is a voice recognition program of a voice recognition device for determining whether or not the reference voice is included in the input voice.
As a control means of the voice recognition device,
By converting the input voice into data in the frequency region every first predetermined time, an input pattern composed of two-dimensional data of frequency x (1 ≦ x ≦ X) and time t, and every second predetermined time. By converting the reference voice into data in the frequency region, a reference pattern composed of two-dimensional data of frequency y (1 ≦ y ≦ X) and time τ (1 ≦ τ ≦ T) is used. The coordinate f (normalized) by dividing the sum of the values of the coordinates f (t, x) from x = 1 to x = X from the value of the coordinates f (t, x) of the input pattern. To calculate the value of t, x) and divide the sum of the values of the coordinates z (τ, y) from y = 1 to y = X from the value of the coordinates z (τ, y) of the reference pattern. To calculate the value of the coordinate z (τ, y) normalized by the two-dimensional continuous dynamic programming method, the coordinate f (t, x) normalized to the input pattern and the reference pattern. The coordinate z (τ, y) normalized from the value of the coordinate f (t, x) normalized by performing a non-linear matching process with the coordinate z (τ, y) normalized. The local distance d (t, x, τ, y) indicating the absolute value of the value obtained by subtracting the value of) is calculated, and the calculated local distance d (t, x, τ, y) is accumulated and accumulated locally. Cumulative local distance calculation function that calculates the distance D (t, X, T, X),
The value of the cumulative local distance D (t, X, T, X) calculated by the cumulative local distance calculation function is divided by the product of T and X, and D (t, X, T, X) / (T. A dip detection function that calculates X) and detects the dip indicating the valley bottom from the change state of the value of D (t, X, T, X) / (TX) at time t.
^{Let t * be} the generation time of the dip detected by the dip detection function, and the D (t, X, T, X) / (XT) in the time interval from the time t ^* -2T to the time t ^*. A difference value calculation function that calculates the difference value h by subtracting the value of D (t ^* ^{, X, T, X) / (XT) at time t *} from the maximum value of the value.
When the difference value h calculated by the difference value calculation function is larger than the preset first threshold value, the cumulative local distance D (t ^* , X, T, X) (t ^* , X, T, By performing backtrace processing from X), the normalized value of the coordinate f (t, x) and the coordinate z (τ, y) that optimally matches the coordinate f (t, x). The value of the local distance d (t, x, τ, y) is calculated by the normalized value, and the calculated value of the local distance d (t, x, τ, y) is set in advance. A reference pattern area detection function that detects a set of coordinates z (τ, y) of the reference pattern that is equal to or less than a threshold value as a target area.
A voice recognition program characterized by realizing a voice determination function of determining that the input voice includes the reference voice when the target area is detected by the reference pattern area detection function.

To the control means
When the target area is detected in the reference pattern area detection function, the minimum time τs and the maximum time τe of the coordinates z (τ, y) constituting the target area are detected.
The voice recognition according to claim 4, wherein in the voice determination function, it is determined that the voice included in the time range from the time τs to the time τe among the reference voices is included in the input voice. program.