JP6728083B2

JP6728083B2 - Intermediate feature amount calculation device, acoustic model learning device, speech recognition device, intermediate feature amount calculation method, acoustic model learning method, speech recognition method, program

Info

Publication number: JP6728083B2
Application number: JP2017021565A
Authority: JP
Inventors: 崇史森谷; 太一浅見
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2017-02-08
Filing date: 2017-02-08
Publication date: 2020-07-22
Anticipated expiration: 2037-02-08
Also published as: JP2018128574A

Description

本発明は、音声認識技術に関し、特にニューラルネットワークを用いて学習した音響モデルにより音声認識を行う技術に関する。 The present invention relates to a voice recognition technique, and particularly to a technique for performing voice recognition using an acoustic model learned by using a neural network.

音声認識、画像認識など様々な分野においてニューラルネットワークを用いたモデル学習が行われている。ここでは、まずニューラルネットワークの代表的なモデルであるディープニューラルネットワーク(DNN: Deep Neural Networks)についてその概略を説明する（図１参照）。 Model learning using neural networks is performed in various fields such as voice recognition and image recognition. First, an outline of a deep neural network (DNN: Deep Neural Networks) that is a typical model of a neural network will be described (see FIG. 1).

DNNは、入力層（第0層）とK(Kは1以上の整数)層の隠れ層（第1層〜第K層）と出力層（第K+1層）から構成される。DNNの第k層(0≦k≦K+1)に入力される特徴量x^kをn_k次元ベクトルとする。また、第K+1層（つまり、出力層）から出力される特徴量x^K+2をn_K+2次元ベクトルとする。このとき、第k層にはn_k次元ベクトルである特徴量x^kが入力され、n_k+1次元ベクトルである特徴量x^k+1が出力される。ここで、第k層の出力特徴量x^k+1は第k+1層の入力特徴量となる。第K層のことを最終隠れ層という。 The DNN is composed of an input layer (0th layer), K (K is an integer of 1 or more) hidden layers (1st to Kth layers), and an output layer (K+1th layer). The feature quantity x ^k input to the k-th layer (0≦k≦K+1) of the DNN is an n _k -dimensional vector. Also, the feature quantity x ^K+2 output from the K+1th layer (that is, the output layer) is an n _K+2 dimensional vector. In this case, the k-th layer characteristic amount x ^k is n _{k-dimensional} vector is inputted, the feature x ^{k + 1} is n _{k + 1-dimensional} vector is outputted. Here, the output feature quantity x ^k+1 of the k-th layer becomes the input feature quantity of the k+1-th layer. The Kth layer is called the final hidden layer.

一般に、DNNでは各層において線形変換と非線形変換が実行される。第k層(0≦k≦K+1)における線形変換を特徴付ける重み行列をW^k、バイアスベクトルをb^k、非線形変換をf（fのことを活性化関数ともいう）とすると、出力特徴量x^k+1は、入力特徴量x^kを用いて次式で表現される。 Generally, in DNN, linear transformation and nonlinear transformation are performed in each layer. If the weight matrix characterizing the linear transformation in the k-th layer (0≦k≦K+1) is W ^k , the bias vector is b ^k , and the nonlinear transformation is f (f is also called an activation function), the output feature amount x ^k+1 is expressed by the following equation using the input feature amount x ^k .

ここで、W^kはn_k+1×n_k行列、b^kはn_k+1次元ベクトルとなる。 Here, W ^k is an n _k+1 ×n _k matrix, and b ^k is an n _k+1 dimensional vector.

また、fはn_k+1次元ベクトルからn_k+1次元ベクトルへの非線形変換となる。隠れ層では、シグモイド関数が用いられる。この場合、線形変換後の特徴量x^k+1/2の第i成分をx_i ^k+1/2、出力特徴量x^k+1の第i成分をx_i ^k+1とすると、次式が成り立つ(1≦i≦n_k+1)。 Further, f is a non-linear transformation from n _{k + 1-dimensional} vector into n _{k + 1-dimensional} vector. In the hidden layer, a sigmoid function is used. In this case, if the i-th component of the linearly transformed feature quantity x ^k+1/2 is x _i ^k+1/2 and the i-th component of the output feature quantity x ^k+1 is x _i ^k+1 , Holds (1≦i≦n _k+1 ).

一方、出力層では、ソフトマックス関数が用いられる。この場合、線形変換後の特徴量x^(K+1)+1/2の第i成分をx_i ^(K+1)+1/2、出力特徴量x^K+2の第i成分をx_i ^K+2とすると、次式が成り立つ(1≦i≦n_K+2)。 On the other hand, in the output layer, the softmax function is used. In this case, the i-th component of the linearly transformed feature x ^(K+1)+1/2 is x _i ^(K+1)+1/2 , and the i-th component of the output feature x ^K+2 is x _i ^{If K+2} , the following equation holds (1≦i≦n _K+2 ).

出力特徴量x^K+2の第i成分x_i ^K+2は出力層を構成する第iユニットからの出力である。各iについて0≦x_i ^K+2≦1であり、Σx_i ^K+2=1が成り立つ。そこで、出力層からの出力特徴量x^K+2を出力確率分布ということにする。また、出力特徴量x^K+2が確率分布であることから、出力特徴量x^K+2を出力確率分布pと表すこともある。つまり、p=(p₁,…,p_{n_K+2})=(x₁ ^K+2,…,x_{n_K+2} ^K+2)となる（なお、_（アンダースコア）は下付き添字を表し、例えば、x^y_zはy_zがxに対する上付き添字であり、x_{y_z}はy_zがxに対する下付き添字であることを表す）。 The i-th component x _i ^K+2 of the output feature value x ^K+2 is the output from the i-th unit forming the output layer. 0≦x _i ^K+2 ≦1 for each i, and Σx _i ^K+2 =1 holds. Therefore, the output feature amount x ^K+2 from the output layer will be referred to as the output probability distribution. Further, since the output characteristic amounts x ^{K + 2} is the probability distribution, may represent an output characteristic amounts x ^{K + 2} and the output probability distributions p. That is, p=(p ₁ ,...,p _{n_K+2} )=(x ₁ ^K+2 ,...,x _{n_K+2} ^K+2 ) (where _ (underscore) represents a subscript, For example, x ^y_z means that y _z is a subscript to x and x _{y_z} means that y _z is a subscript to x).

DNNがモデルを学習することは、重み行列とバイアスベクトルを学習することである。つまり、DNNが学習するモデルは、DNNを特徴付けるパラメータとなる。 Learning a model by the DNN is learning a weight matrix and a bias vector. That is, the model learned by the DNN becomes a parameter that characterizes the DNN.

音声認識に用いる音響モデルを学習する場合、DNNへの入力特徴量（つまり、入力層への入力となる特徴量）x⁰は音声データの音声特徴量、DNNからの出力特徴量（つまり、出力層からの出力となる特徴量）x^K+2は音声の出力シンボルである音素（音素状態）の確率分布pとなる。このとき、出力層を構成するユニットの数は音素の数に等しく、第iユニットから出力される確率は入力された音声特徴量が第i音素である確率を表すことになる。 When learning the acoustic model used for speech recognition, the input feature amount to the DNN (that is, the feature amount that is the input to the input layer) x ⁰ is the voice feature amount of the voice data, and the output feature amount from the DNN (that is, the output amount). The feature quantity output from the layer) x ^K+2 is the probability distribution p of the phoneme (phoneme state) that is the output symbol of the voice. At this time, the number of units forming the output layer is equal to the number of phonemes, and the probability output from the i-th unit represents the probability that the input speech feature amount is the i-th phoneme.

確率が最大となる出力層のユニット番号に対応する音素を入力された音声特徴量と対応する音素とすれば、DNNは音声特徴量を入力として音素を出力するモデルを音響モデルとして学習することになる。ここで、学習される音響モデルは各層の重み行列とバイアスベクトルである。 If the phoneme corresponding to the unit number of the output layer with the maximum probability is the phoneme corresponding to the input speech feature amount, DNN will learn the model that outputs the phoneme with the speech feature amount as an acoustic model. Become. Here, the learned acoustic model is the weight matrix and bias vector of each layer.

したがって、当該音響モデルを用いて構成した音声認識装置は、認識対象となる音声データ（認識用音声データ）の音声特徴量から音素の系列を生成し、音声認識結果を生成するものとなる。 Therefore, the voice recognition device configured by using the acoustic model generates a phoneme sequence from the voice feature amount of the voice data (recognition voice data) to be recognized, and generates the voice recognition result.

以上、DNNの概略について、音声認識に適用した場合も含め説明した。 The outline of DNN has been described above, including the case of applying it to speech recognition.

なお、一般に、ニューラルネットワークはベクトルを入力とし、ベクトルを出力する関数とみなすこともできる。 In general, a neural network can be regarded as a function that inputs a vector and outputs the vector.

以下では、非特許文献１〜３に記載のニューラルネットワークが音響モデルを学習する（つまり、ニューラルネットワークを特徴付けるパラメータを学習する）方法について説明する。まず、図１２〜図１３を参照して非特許文献１の音響モデル学習装置７００を説明する。図１２は、音響モデル学習装置７００の構成を示すブロック図である。図１３は、音響モデル学習装置７００の動作を示すフローチャートである。図１２に示すように音響モデル学習装置７００は、音声中間特徴量計算部７１０、音素確率分布計算部７２０、パラメータ最適化部７３０、記録部７９０を含む。 Hereinafter, a method in which the neural network described in Non-Patent Documents 1 to 3 learns an acoustic model (that is, learns parameters that characterize the neural network) will be described. First, the acoustic model learning device 700 of Non-Patent Document 1 will be described with reference to FIGS. 12 to 13. FIG. 12 is a block diagram showing the configuration of the acoustic model learning device 700. FIG. 13 is a flowchart showing the operation of the acoustic model learning device 700. As shown in FIG. 12, the acoustic model learning device 700 includes a speech intermediate feature amount calculation unit 710, a phoneme probability distribution calculation unit 720, a parameter optimization unit 730, and a recording unit 790.

記録部７９０は、音響モデル学習装置７００の処理に必要な情報を適宜記録する構成部である。例えば、音響モデルを構成するパラメータ（音響モデルパラメータ）の初期値を事前に記録しておく。また、学習過程で生成される音響モデルパラメータを適宜に記録する。 The recording unit 790 is a component that appropriately records information necessary for the processing of the acoustic model learning device 700. For example, the initial values of the parameters (acoustic model parameters) forming the acoustic model are recorded in advance. Also, the acoustic model parameters generated in the learning process are recorded appropriately.

音響モデルパラメータの初期値は、乱数を用いて生成してもよいし、今回の学習に用いる音声データとは異なる別の音声データを学習データとして生成したパラメータを利用してもよい。 The initial value of the acoustic model parameter may be generated by using a random number, or a parameter generated by using, as learning data, different voice data different from the voice data used for this learning may be used.

音素の数をMとし、各音素には番号（以下、音素番号という）が1〜Mまで振られており、音素番号m(1≦m≦M)を用いて各音素を識別することにする。音素番号mの音素のことを第m音素という。 The number of phonemes is M, each phoneme is numbered from 1 to M (hereinafter referred to as phoneme number), and each phoneme is identified using the phoneme number m (1 ≤ m ≤ M). .. The phoneme with phoneme number m is called the mth phoneme.

学習開始前に、学習データとなる音声データ（学習用音声データ）から音声特徴量を抽出しておく。音声特徴量はベクトルとして表される。また、当該音声特徴量に対応する音素（正解音素）を識別するための音素番号である正解音素番号も併せて用意しておく。つまり、音声特徴量と正解音素番号の組が音響モデル学習装置７００の入力となる。音声特徴量と正解音素番号の組のことを訓練データという。 Before the learning is started, the voice feature amount is extracted from the voice data (learning voice data) to be the learning data. The voice feature amount is represented as a vector. In addition, a correct phoneme number, which is a phoneme number for identifying a phoneme (correct phoneme) corresponding to the voice feature amount, is also prepared. That is, the set of the voice feature amount and the correct phoneme number is input to the acoustic model learning device 700. The set of speech feature quantities and correct phoneme numbers is called training data.

音響モデル学習装置７００は、訓練データである音声特徴量と正解音素番号の組から、音響モデルを学習する。 The acoustic model learning device 700 learns an acoustic model from a set of a voice feature amount and a correct phoneme number, which is training data.

音声中間特徴量計算部７１０は、DNNの入力層から最終隠れ層までの計算を実行する構成部である。また、音素確率分布計算部７２０は、DNNの出力層での計算を実行する構成部である。したがって、音響モデル学習装置７００が学習する音響モデルは、音声中間特徴量計算部７１０と音素確率分布計算部７２０を特徴付けるDNNのパラメータとなる。 The voice intermediate feature amount calculation unit 710 is a configuration unit that executes calculation from the input layer of the DNN to the final hidden layer. The phoneme probability distribution calculation unit 720 is a component that executes calculation in the output layer of the DNN. Therefore, the acoustic model learned by the acoustic model learning device 700 becomes a DNN parameter that characterizes the speech intermediate feature amount calculation unit 710 and the phoneme probability distribution calculation unit 720.

音響モデル学習装置７００は、学習開始までに、記録部７９０に記録した音響モデルパラメータの初期値を音声中間特徴量計算部７１０、音素確率分布計算部７２０に設定する。また、音響モデル学習装置７００は、学習中、パラメータ最適化部７３０が音響モデルパラメータを最適化計算する都度、計算した音響モデルパラメータを音声中間特徴量計算部７１０、音素確率分布計算部７２０に設定する。これにより、新たに計算された音響モデルパラメータで特徴付けられるDNN（音声中間特徴量計算部７１０と音素確率分布計算部７２０）を用いて、次の訓練データを処理することになる。 The acoustic model learning device 700 sets the initial values of the acoustic model parameters recorded in the recording unit 790 in the speech intermediate feature amount calculation unit 710 and the phoneme probability distribution calculation unit 720 by the start of learning. In addition, the acoustic model learning device 700 sets the calculated acoustic model parameters to the speech intermediate feature amount calculation unit 710 and the phoneme probability distribution calculation unit 720 every time the parameter optimization unit 730 optimizes the acoustic model parameters during learning. To do. As a result, the following training data will be processed using the DNN (speech intermediate feature amount calculation unit 710 and phoneme probability distribution calculation unit 720) characterized by the newly calculated acoustic model parameter.

図１３に従い音響モデル学習装置７００の動作について説明する。音声中間特徴量計算部７１０は、入力された音声特徴量から、音素識別用の中間特徴量である音声中間特徴量を計算する（Ｓ７１０）。音声中間特徴量は、入力された音声特徴量が対応する音素が音素番号m（1≦m≦M）の音素である確率p_mの分布である音素確率分布p=(p₁,…,p_M)を計算するために用いる特徴量である。また、上述の通り、音声中間特徴量計算部７１０はDNNの入力層から最終隠れ層までの計算を実行する構成部であるので、音声中間特徴量は学習中のDNNの最終隠れ層の出力特徴量である。 The operation of the acoustic model learning device 700 will be described with reference to FIG. The voice intermediate feature amount calculation unit 710 calculates a voice intermediate feature amount, which is an intermediate feature amount for phoneme identification, from the input voice feature amount (S710). The phoneme probability distribution p=(p ₁ ,...,p) is a distribution of the probability p _m that the phoneme corresponding to the input voice feature is the phoneme of the phoneme number m (1≦m≦M). It is the feature quantity used to calculate _M ). Further, as described above, since the voice intermediate feature amount calculation unit 710 is a component unit that executes the calculation from the input layer of the DNN to the final hidden layer, the voice intermediate feature amount is the output feature of the final hidden layer of the DNN being learned. Is the amount.

音素確率分布計算部７２０は、Ｓ７１０で計算した音声中間特徴量から、音素確率分布を計算する（Ｓ７２０）。上述の通り、音素確率分布計算部７２０はDNNの出力層での計算を実行する構成部であるので、音素確率分布は学習中のDNNの出力層の出力特徴量である。ここで、音素番号mの音素（第m音素）は、出力層を構成する第mユニットに対応するものとすると、音素確率分布は出力層の第mユニットからの出力値である確率p_m（式(3)で計算される値）を並べた分布p=(p₁,…,p_M)になる。 The phoneme probability distribution calculation unit 720 calculates a phoneme probability distribution from the speech intermediate feature amount calculated in S710 (S720). As described above, the phoneme probability distribution calculation unit 720 is a component that executes the calculation in the output layer of the DNN, and thus the phoneme probability distribution is the output feature amount of the output layer of the DNN being learned. Here, assuming that the phoneme with the phoneme number m (the m-th phoneme) corresponds to the m-th unit forming the output layer, the phoneme probability distribution is the probability p _m (the output value from the m-th unit in the output layer). The distribution p = (p ₁ , ..., p _M ) in which the values calculated by equation (3) are arranged.

パラメータ最適化部７３０は、Ｓ７２０で計算した音素確率分布と入力された正解音素番号を用いて、音響モデルパラメータを最適化する（Ｓ７３０）。例えば、次式で定義される損失関数Cの値を減少させるように、音響モデルパラメータを最適化計算していく。 The parameter optimizing unit 730 optimizes the acoustic model parameters using the phoneme probability distribution calculated in S720 and the input correct phoneme number (S730). For example, the acoustic model parameters are optimized and calculated so as to reduce the value of the loss function C defined by the following equation.

ただし、p=(p₁,…,p_M)は音素確率分布、d=(d₁,…,d_M)は次式で定義される正解確率分布である。 However, p=(p ₁ ,..., P _M ) is a phoneme probability distribution, and d=(d ₁ ,..., D _M ) is a correct probability distribution defined by the following equation.

なお、損失関数Cはクロスエントロピーと呼ばれるものであり、2つの確率分布の間で定義される、分布間のずれを測る尺度である。 Note that the loss function C is called cross entropy, and is a scale defined between two probability distributions and is a measure for measuring the deviation between the distributions.

一般に、音声特徴量と正解音素番号の組である訓練データの数は数千万〜数億回程度と非常に大きいものとなる。膨大な訓練データから効率的に音響モデルパラメータを最適化するためには、例えば、非特許文献１の式(4)を用いるとよい。 Generally, the number of training data, which is a set of a voice feature amount and a correct phoneme number, is very large, about tens of millions to hundreds of millions of times. In order to efficiently optimize the acoustic model parameters from the enormous amount of training data, for example, Equation (4) in Non-Patent Document 1 may be used.

音響モデル学習装置７００は、Ｓ７１０〜Ｓ７３０の処理を訓練データの数だけ繰り返し、最終的に計算された音響モデルパラメータを学習結果として出力する。 The acoustic model learning device 700 repeats the processing of S710 to S730 for the number of training data, and outputs the finally calculated acoustic model parameter as a learning result.

次に、図１４〜図１５を参照して非特許文献２の音響モデル学習装置８００を説明する。音響モデル学習装置８００の学習では、音響モデル学習装置７００の学習で用いたDNNに加えて、画像認識でよく用いられるニューラルネットワークである畳み込みニューラルネットワーク(CNN: Convolutional Neural Networks)も用いる。CNNは、入力層、畳み込み層、プーリング層から構成される。 Next, the acoustic model learning device 800 of Non-Patent Document 2 will be described with reference to FIGS. 14 to 15. In the learning of the acoustic model learning device 800, in addition to the DNN used in the learning of the acoustic model learning device 700, a convolutional neural network (CNN) which is a neural network often used in image recognition is also used. CNN consists of input layer, convolutional layer and pooling layer.

図１４は、音響モデル学習装置８００の構成を示すブロック図である。図１５は、音響モデル学習装置８００の動作を示すフローチャートである。図１４に示すように音響モデル学習装置８００は、耐雑音中間特徴量計算部８１０、音声中間特徴量計算部７１０、音素確率分布計算部７２０、パラメータ最適化部７３０、記録部７９０を含む。 FIG. 14 is a block diagram showing the configuration of the acoustic model learning device 800. FIG. 15 is a flowchart showing the operation of the acoustic model learning device 800. As shown in FIG. 14, the acoustic model learning device 800 includes a noise resistant intermediate feature amount calculation unit 810, a speech intermediate feature amount calculation unit 710, a phoneme probability distribution calculation unit 720, a parameter optimization unit 730, and a recording unit 790.

音響モデル学習装置８００は、訓練データである音声特徴量と正解音素番号の組から、音響モデルを学習する。 The acoustic model learning device 800 learns an acoustic model from a set of a voice feature amount and a correct phoneme number, which is training data.

音声中間特徴量計算部７１０は、DNNの入力層から最終隠れ層までの計算を実行する構成部である。音素確率分布計算部７２０は、DNNの出力層での計算を実行する構成部である。耐雑音中間特徴量計算部８１０は、CNNの計算を実行する構成部である。したがって、音響モデル学習装置８００が学習する音響モデルは、音声中間特徴量計算部７１０と音素確率分布計算部７２０を特徴付けるDNNのパラメータと目的音特徴量計算部８１０を特徴付けるCNNのパラメータを含む。 The voice intermediate feature amount calculation unit 710 is a configuration unit that executes calculation from the input layer of the DNN to the final hidden layer. The phoneme probability distribution calculator 720 is a component that executes calculation in the output layer of the DNN. The noise-proof intermediate feature amount calculation unit 810 is a component unit that executes CNN calculation. Therefore, the acoustic model learned by the acoustic model learning device 800 includes the DNN parameters that characterize the speech intermediate feature amount calculation unit 710 and the phoneme probability distribution calculation unit 720 and the CNN parameters that characterize the target sound feature amount calculation unit 810.

音響モデル学習装置８００は、学習開始までに、記録部７９０に記録した音響モデルパラメータの初期値を耐雑音中間特徴量計算部８１０、音声中間特徴量計算部７１０、音素確率分布計算部７２０に設定する。また、音響モデル学習装置８００は、学習中、パラメータ最適化部７３０が音響モデルパラメータを最適化計算する都度、計算した音響モデルパラメータを耐雑音中間特徴量計算部８１０、音声中間特徴量計算部７１０、音素確率分布計算部７２０に設定する。 The acoustic model learning device 800 sets the initial values of the acoustic model parameters recorded in the recording unit 790 in the noise-resistant intermediate feature amount calculation unit 810, the speech intermediate feature amount calculation unit 710, and the phoneme probability distribution calculation unit 720 by the start of learning. To do. In the acoustic model learning apparatus 800, each time the parameter optimizing unit 730 optimizes the acoustic model parameters during learning, the acoustic model parameters are calculated by using the noise-resistant intermediate feature amount calculating unit 810 and the speech intermediate feature amount calculating unit 710. , And phoneme probability distribution calculation unit 720.

図１５に従い音響モデル学習装置８００の動作について説明する。耐雑音中間特徴量計算部８１０は、入力された音声特徴量から、当該音声特徴量を抽出した音声データに含まれる目的音の特徴量である耐雑音中間特徴量を計算する（Ｓ８１０）。耐雑音中間特徴量とは、雑音が重畳した音声データの目的音に対応する特徴量を音声特徴量から計算したものであり、具体的には、次のように計算する。まず、音声特徴量から時間と対数パワースペクトルの2次元画像を生成する。次に、CNNを用いて当該2次元画像から耐雑音中間特徴量を計算する。 The operation of the acoustic model learning device 800 will be described with reference to FIG. The noise resistance intermediate feature amount calculation unit 810 calculates the noise resistance intermediate feature amount, which is the feature amount of the target sound included in the voice data from which the voice feature amount is extracted, from the input voice feature amount (S810). The noise resistant intermediate feature amount is a feature amount calculated from the voice feature amount corresponding to the target sound of the voice data on which noise is superimposed, and is specifically calculated as follows. First, a two-dimensional image of time and logarithmic power spectrum is generated from the voice feature amount. Next, the noise-resistant intermediate feature amount is calculated from the two-dimensional image using CNN.

音声中間特徴量計算部７１０は、Ｓ８１０で計算した耐雑音中間特徴量から、音素識別用の中間特徴量である音声中間特徴量を計算する（Ｓ７１０）。音声中間特徴量計算部７１０は、雑音の影響が残ったまま抽出した音声特徴量の代わりに、雑音の影響を除去した耐雑音中間特徴量を入力とする点において音響モデル学習装置７００のそれと異なるが、その動作は同様である。つまり、音声中間特徴量計算部７１０は、DNNの入力層から最終隠れ層までで構成されるニューラルネットワークを用いて、耐雑音中間特徴量から最終隠れ層の出力特徴量である音声中間特徴量を計算する。 The speech intermediate feature amount calculation unit 710 calculates a speech intermediate feature amount, which is an intermediate feature amount for phoneme identification, from the noise resistant intermediate feature amount calculated in S810 (S710). The speech intermediate feature amount calculation unit 710 differs from that of the acoustic model learning device 700 in that the noise-resistant intermediate feature amount in which the influence of noise is removed is input instead of the voice feature amount extracted while the influence of noise remains. However, the operation is similar. That is, the speech intermediate feature amount calculation unit 710 uses the neural network configured from the input layer of the DNN to the final hidden layer to extract the speech intermediate feature amount, which is the output feature amount of the final hidden layer, from the noise resistant intermediate feature amount. calculate.

音素確率分布計算部７２０は、Ｓ７１０で計算した音声中間特徴量から、音素確率分布を計算する（Ｓ７２０）。 The phoneme probability distribution calculation unit 720 calculates a phoneme probability distribution from the speech intermediate feature amount calculated in S710 (S720).

パラメータ最適化部７３０は、Ｓ７２０で計算した音素確率分布と入力された正解音素番号を用いて、音響モデルパラメータを最適化する（Ｓ７３０）。 The parameter optimizing unit 730 optimizes the acoustic model parameters using the phoneme probability distribution calculated in S720 and the input correct phoneme number (S730).

音響モデル学習装置８００で学習した音響モデルを用いた音声認識は、雑音を含む音声データに対する音声認識の精度が高いことが確認されている。つまり、音響モデル学習装置８００で学習する音響モデルは、耐雑音性のある音響モデルとなる。 It has been confirmed that the voice recognition using the acoustic model learned by the acoustic model learning device 800 has high accuracy of the voice recognition for the voice data including noise. That is, the acoustic model learned by the acoustic model learning device 800 is a noise resistant acoustic model.

次に、図１６〜図１７を参照して非特許文献３の音響モデル学習装置９００を説明する。音響モデル学習装置９００の学習でも、音響モデル学習装置８００の学習と同様、DNNとCNNを用いる。音響モデル学習装置８００と異なるのは、これらのニューラルネットワークの結合の仕方である。音響モデル学習装置８００では、CNN、DNNの順に直列に結合させたニューラルネットワークを用いたが、音響モデル学習装置９００では、CNNとDNNを並列に結合させたニューラルネットワークを用いる。 Next, the acoustic model learning device 900 of Non-Patent Document 3 will be described with reference to FIGS. Also in the learning of the acoustic model learning device 900, DNN and CNN are used as in the learning of the acoustic model learning device 800. The difference from the acoustic model learning device 800 is the way of connecting these neural networks. The acoustic model learning device 800 uses a neural network in which CNN and DNN are connected in series in this order, while the acoustic model learning device 900 uses a neural network in which CNN and DNN are connected in parallel.

図１６は、音響モデル学習装置９００の構成を示すブロック図である。図１７は、音響モデル学習装置９００の動作を示すフローチャートである。図１６に示すように音響モデル学習装置９００は、音声中間特徴量計算部７１０、耐雑音中間特徴量計算部８１０、中間特徴量結合部９１０、音素確率分布計算部７２０、パラメータ最適化部７３０、記録部７９０を含む。 FIG. 16 is a block diagram showing the configuration of the acoustic model learning device 900. FIG. 17 is a flowchart showing the operation of the acoustic model learning device 900. As shown in FIG. 16, the acoustic model learning device 900 includes a speech intermediate feature amount calculation unit 710, a noise resistant intermediate feature amount calculation unit 810, an intermediate feature amount combination unit 910, a phoneme probability distribution calculation unit 720, a parameter optimization unit 730, The recording unit 790 is included.

音響モデル学習装置９００は、訓練データである音声特徴量と正解音素番号の組から、音響モデルを学習する。 The acoustic model learning device 900 learns an acoustic model from a set of a speech feature amount and a correct phoneme number, which is training data.

音響モデル学習装置９００が学習する音響モデルは、音響モデル学習装置８００が学習する音響モデルパラメータと同様、音声中間特徴量計算部７１０と音素確率分布計算部７２０を特徴付けるDNNのパラメータと目的音特徴量計算部８１０を特徴付けるCNNのパラメータを含む。 The acoustic model learned by the acoustic model learning device 900 is the same as the acoustic model parameters learned by the acoustic model learning device 800, and the parameters of the DNN and the target sound feature amount that characterize the speech intermediate feature amount calculation unit 710 and the phoneme probability distribution calculation unit 720. It includes the CNN parameters that characterize the calculator 810.

音響モデル学習装置９００は、学習開始までに、記録部７９０に記録した音響モデルパラメータの初期値を耐雑音中間特徴量計算部８１０、音声中間特徴量計算部７１０、音素確率分布計算部７２０に設定する。また、音響モデル学習装置９００は、学習中、パラメータ最適化部７３０が音響モデルパラメータを最適化計算する都度、計算した音響モデルパラメータを耐雑音中間特徴量計算部８１０、音声中間特徴量計算部７１０、音素確率分布計算部７２０に設定する。 The acoustic model learning device 900 sets the initial values of the acoustic model parameters recorded in the recording unit 790 in the noise resistant intermediate feature amount calculation unit 810, the speech intermediate feature amount calculation unit 710, and the phoneme probability distribution calculation unit 720 by the start of learning. To do. In the acoustic model learning device 900, each time the parameter optimizing unit 730 optimizes the acoustic model parameters during learning, the acoustic model parameters are calculated by using the calculated acoustic model parameters as a noise resistant intermediate feature amount calculating unit 810 and a voice intermediate feature amount calculating unit 710. , And phoneme probability distribution calculation unit 720.

図１７に従い音響モデル学習装置９００の動作について説明する。音声中間特徴量計算部７１０は、入力された音声特徴量から、音素識別用の中間特徴量の一部となる音声中間特徴量を計算する（Ｓ７１０）。 The operation of the acoustic model learning device 900 will be described with reference to FIG. The voice intermediate feature amount calculation unit 710 calculates a voice intermediate feature amount that is a part of the phoneme identification intermediate feature amount from the input voice feature amount (S710).

耐雑音中間特徴量計算部８１０は、入力された音声特徴量から、音素識別用の中間特徴量の一部となる耐雑音中間特徴量を計算する（Ｓ８１０）。 The noise resistant intermediate feature amount calculation unit 810 calculates a noise resistant intermediate feature amount that is a part of the phoneme identification intermediate feature amount from the input speech feature amount (S810).

中間特徴量結合部９１０は、Ｓ７１０で計算した音声中間特徴量とＳ８１０で計算した耐雑音中間特徴量から、結合中間特徴量を生成する（Ｓ９１０）。結合中間特徴量は、ベクトルである音声中間特徴量と耐雑音中間特徴量をベクトルとして結合したベクトルである。 The intermediate feature amount combining unit 910 generates a combined intermediate feature amount from the voice intermediate feature amount calculated in S710 and the noise resistant intermediate feature amount calculated in S810 (S910). The combined intermediate feature amount is a vector obtained by combining the voice intermediate feature amount and the noise resistant intermediate feature amount, which are vectors, as a vector.

音素確率分布計算部７２０は、Ｓ９１０で生成した結合中間特徴量から、音素確率分布を計算する（Ｓ７２０）。 The phoneme probability distribution calculation unit 720 calculates a phoneme probability distribution from the combined intermediate feature amount generated in S910 (S720).

音響モデル学習装置９００で学習した音響モデルを用いた音声認識は、学習データ数が同じ場合、音響モデル学習装置７００や音響モデル学習装置８００で学習した音響モデルを用いた音声認識と比較して、精度が高いことが確認されている。 In the voice recognition using the acoustic model learned by the acoustic model learning device 900, when the number of learning data is the same, as compared with the voice recognition using the acoustic model learned by the acoustic model learning device 700 or the acoustic model learning device 800, It has been confirmed that the accuracy is high.

Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patric Nguyen, Tara Sainath, Brian Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition”, IEEE Signal Processing Magazine, Vol.29, No.6, pp.82-97, 2012.Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patric Nguyen, Tara Sainath, Brian Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition”, IEEE Signal Processing Magazine, Vol.29, No.6, pp.82-97, 2012. Ossama Abdel-Hamid, Adbel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, Dong Yu, “Convolutional Neural Networks for Speech Recognition”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.22, No.10, pp.1533-1545, 2014.Ossama Abdel-Hamid, Adbel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn, Dong Yu, “Convolutional Neural Networks for Speech Recognition”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol.22, No. 10, pp.1533-1545, 2014. Hagen Soltau, George Saon, Tara N. Sainath, “Joint Training of Convolutional and Non-Convolutional Neural Networks”, IEEE ICASSP, 2014.Hagen Soltau, George Saon, Tara N. Sainath, “Joint Training of Convolutional and Non-Convolutional Neural Networks”, IEEE ICASSP, 2014.

非特許文献１の方法で学習した音響モデルを用いた音声認識では、雑音が重畳した音声の認識精度が雑音のない音声の認識精度に比べて著しく低下するという問題がある。 In the voice recognition using the acoustic model learned by the method of Non-Patent Document 1, there is a problem that the recognition accuracy of a voice on which noise is superimposed is significantly lower than the recognition accuracy of a voice without noise.

非特許文献２の方法で学習した音響モデルを用いた音声認識は、非特許文献１のそれに比べて耐雑音性の点において優れている。しかし、非特許文献１の方法に比べて音響モデルの学習に必要な計算時間が大きいという問題がある。また、雑音が少ない音声の音声認識では、非特許文献１の方法で学習した音響モデルを用いた音声認識と精度があまり変わらないにもかかわらず、認識に必要な計算時間が大きいという問題もある。 Speech recognition using an acoustic model learned by the method of Non-Patent Document 2 is superior to that of Non-Patent Document 1 in terms of noise resistance. However, there is a problem that the calculation time required for learning the acoustic model is longer than that of the method of Non-Patent Document 1. In addition, speech recognition of speech with less noise has a problem that the calculation time required for the recognition is long, although the accuracy is not so different from that of the speech recognition using the acoustic model learned by the method of Non-Patent Document 1. ..

非特許文献３の方法では、学習対象となる音声に雑音が含まれていてもいなくてもその音声特徴量がDNNの学習に相当する音声中間特徴量計算部７１０とCNNの学習に相当する耐雑音中間特徴量計算部８１０の両方に入力され、学習に用いられる。このため、学習結果として得られる音響モデルを用いた音声認識の認識精度は高くなるが、その学習に要する計算時間は大きくなってしまう。また、認識に要する計算時間も大きくなってしまう。 According to the method of Non-Patent Document 3, whether or not noise is included in the speech to be learned, the speech feature amount corresponds to the learning of the DNN, and the speech intermediate feature amount calculation unit 710 corresponding to the learning of the CNN corresponds to the tolerance. It is input to both of the noise intermediate feature amount calculation units 810 and used for learning. Therefore, although the recognition accuracy of the voice recognition using the acoustic model obtained as the learning result is high, the calculation time required for the learning is long. In addition, the calculation time required for recognition becomes long.

また、雑音が含まれない音声を用いて音響モデルを学習したが、音声認識の段階では認識対象となる音声に雑音が含まれているというようなケースもありうる。このようなケースのように学習に用いた音声とは別の種類の音声を認識しようとすると、認識精度が低くなってしまうという問題が生じることがある。そこで、この問題を解決するため、複数の種類の音声を学習データとして用意して非特許文献１や非特許文献２の方法を用いて学習した音響モデルを用いて音声認識をする方法が考えられるが、音声の種類ごとに学習した音響モデルを用いて音声認識をする場合に比べて、認識精度が低くなるという問題がある。 In addition, although the acoustic model is learned by using a voice that does not include noise, there may be a case where the voice to be recognized includes noise at the voice recognition stage. If an attempt is made to recognize a voice of a different type from the voice used for learning as in this case, the recognition accuracy may decrease. Therefore, in order to solve this problem, a method in which a plurality of types of voices are prepared as learning data and voice recognition is performed using an acoustic model learned using the methods of Non-Patent Document 1 and Non-Patent Document 2 can be considered. However, there is a problem that the recognition accuracy is lower than that in the case of performing voice recognition using an acoustic model learned for each type of voice.

以上述べたように、非特許文献１〜３の方法では、認識処理に必要な計算時間を抑制しつつ雑音の有無に関わらず高精度な音声認識を実現する音響モデルを学習することは難しい。 As described above, with the methods of Non-Patent Documents 1 to 3, it is difficult to learn an acoustic model that realizes highly accurate speech recognition regardless of the presence or absence of noise while suppressing the calculation time required for recognition processing.

そこで本発明は、認識処理に必要な計算時間を抑制しつつ学習用音声データの種類にかかわらず高精度な音声認識を実現する音響モデルを学習するために用いる中間特徴量を計算する技術を提供することを目的とする。 Therefore, the present invention provides a technique for calculating an intermediate feature amount used for learning an acoustic model that realizes highly accurate voice recognition regardless of the type of learning voice data while suppressing the calculation time required for recognition processing. The purpose is to do.

本発明の一態様は、音声データを分類する種類の数をJ、前記種類を識別するための番号を種類番号、音素の数をM、前記音素を識別するための番号を音素番号とし、音声特徴量から、当該音声特徴量を抽出した音声データの種類に対応する種類番号j’（ただし、j’は1≦j’≦Jを満たす整数）を決定する種類識別部と、前記音声特徴量と前記種類番号j’から、当該音声特徴量が対応する音素が音素番号m（1≦m≦M）の音素である確率p_mの分布である音素確率分布p=(p₁,…,p_M)を計算するために用いる特徴量である第j’種中間特徴量を音素中間特徴量として計算する音素中間特徴量計算部とを含む中間特徴量計算装置であって、前記音素中間特徴量計算部は、1≦j≦Jを満たす各整数jについて、ニューラルネットワークを用いて、種類番号jの音声データから抽出された音声特徴量から、第j種中間特徴量を計算する第j種中間特徴量計算部とを含む。 One aspect of the present invention, the number of types to classify the voice data is J, the number for identifying the type is a type number, the number of phonemes is M, and the number for identifying the phoneme is a phoneme number, A type identification unit that determines a type number j′ (where j′ is an integer that satisfies 1≦j′≦J) corresponding to the type of voice data from which the voice feature amount is extracted, and the voice feature amount And the type number j′, the phoneme probability distribution p=(p ₁ ,...,p) which is a distribution of the probability p _m that the phoneme corresponding to the voice feature quantity is the phoneme of the phoneme number m (1≦m≦M) _M ) is an intermediate feature amount calculation device including a phoneme intermediate feature amount calculation unit that calculates a j'th type intermediate feature amount that is a feature amount used as a phoneme intermediate feature amount, wherein the phoneme intermediate feature amount is For each integer j satisfying 1≦j≦J, the calculating unit calculates the j-th type intermediate feature amount from the voice feature amount extracted from the voice data of the type number j by using the neural network. And a feature quantity calculation unit.

本発明によれば、認識処理に必要な計算時間を抑制しつつ学習用音声データの種類にかかわらず高精度な音声認識を実現する音響モデルを学習するために用いる中間特徴量を計算することができる。 According to the present invention, it is possible to calculate an intermediate feature amount used for learning an acoustic model that realizes highly accurate voice recognition regardless of the type of learning voice data while suppressing the calculation time required for recognition processing. it can.

DNNの一例を示す図。The figure which shows an example of DNN. 音響モデル学習装置１００の構成の一例を示す図。The figure which shows an example of a structure of the acoustic model learning apparatus 100. 音響モデル学習装置１００の動作の一例を示す図。The figure which shows an example of operation|movement of the acoustic model learning apparatus 100. 種類識別部１１０の構成の一例を示す図。The figure which shows an example of a structure of the type identification part 110. 種類識別部１１０の動作の一例を示す図。The figure which shows an example of operation|movement of the type identification part 110. 音素中間特徴量計算部１２０の構成の一例を示す図。The figure which shows an example of a structure of the phoneme intermediate feature-value calculation part 120. 音素中間特徴量計算部１２０の動作の一例を示す図。The figure which shows an example of operation|movement of the phoneme intermediate feature-value calculation part 120. 音声認識装置２００の構成の一例を示す図。The figure which shows an example of a structure of the speech recognition apparatus 200. 音声認識装置２００の動作の一例を示す図。The figure which shows an example of operation|movement of the speech recognition apparatus 200. 音声認識部２２０の構成の一例を示す図。The figure which shows an example of a structure of the voice recognition part 220. 音声認識部２２０の動作の一例を示す図。The figure which shows an example of operation|movement of the voice recognition part 220. 音響モデル学習装置７００の構成の一例を示す図。The figure which shows an example of a structure of the acoustic model learning apparatus 700. 音響モデル学習装置７００の動作の一例を示す図。The figure which shows an example of operation|movement of the acoustic model learning apparatus 700. 音響モデル学習装置８００の構成の一例を示す図。The figure which shows an example of a structure of the acoustic model learning apparatus 800. 音響モデル学習装置８００の動作の一例を示す図。The figure which shows an example of operation|movement of the acoustic model learning apparatus 800. 音響モデル学習装置９００の構成の一例を示す図。The figure which shows an example of a structure of the acoustic model learning apparatus 900. 音響モデル学習装置９００の動作の一例を示す図。The figure which shows an example of operation|movement of the acoustic model learning apparatus 900.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. It should be noted that components having the same function are denoted by the same reference numeral, and redundant description will be omitted.

以下、各実施形態で用いる用語について簡単に説明する。 Hereinafter, terms used in each embodiment will be briefly described.

音声データとは、音響モデルの学習や音声認識に用いるため、あらかじめ収録しておく音声データのことである。音声データは、例えば話者が発話した文章の音声である。また、音声データは、例えばサンプリング周波数１６ｋＨｚで離散値化されたデジタルデータである。 The voice data is voice data that is recorded in advance because it is used for learning an acoustic model and voice recognition. The voice data is, for example, the voice of a sentence spoken by the speaker. Further, the audio data is, for example, digital data digitized at a sampling frequency of 16 kHz.

音声特徴量とは、音声データから抽出した特徴量であり、例えば、音声データを分割したフレーム（通常20ms〜40ms程度）ごとに抽出されるFBANK（フィルタバンク対数パワー）などがある。なお、音声特徴量は一般にベクトルとして表現される。 The voice feature amount is a feature amount extracted from voice data, and includes, for example, FBANK (filter bank logarithmic power) extracted for each frame (normally about 20 ms to 40 ms) into which voice data is divided. The voice feature amount is generally expressed as a vector.

音声の種類とは、音響モデルを学習する際に用いる学習用音声データを分類するカテゴリのことである。例えば、雑音の有無を基準に分類すると、雑音を含む音声と雑音を含まない音声の２つの種類に学習用音声データが分類される。また、話し言葉の音声、読み上げ音声、会議音声の３つの種類に学習用音声データを分類することもできる。男性の音声、女性の音声の２つの種類に学習用音声データを分類することもできる。さらに、雑音に関して、有無の２つで分けるのでなく、例えば高、中、低と雑音のレベルで分ける場合には３つの種類に学習用音声データを分類することもできる。 The type of voice is a category for classifying learning voice data used when learning an acoustic model. For example, when classification is performed based on the presence or absence of noise, the learning voice data is classified into two types, that is, a voice including noise and a voice not including noise. Further, the learning voice data can be classified into three types, that is, spoken language voice, read-aloud voice, and conference voice. It is also possible to classify the learning voice data into two types, a male voice and a female voice. Furthermore, regarding the noise, it is also possible to classify the learning voice data into three types when the noise is not divided into two types, ie, high level, medium level, low level and noise level.

＜第一実施形態＞
［音響モデル学習装置１００］
以下、図２〜図７を参照して音響モデル学習装置１００について説明する。図２に示すように音響モデル学習装置１００は、種類識別部１１０、音素中間特徴量計算部１２０、音素確率分布計算部１３０、パラメータ最適化部１４０、記録部７９０を含む。記録部７９０は、音響モデル学習装置１００の処理に必要な情報を適宜記録する構成部である。 <First embodiment>
[Acoustic model learning device 100]
Hereinafter, the acoustic model learning device 100 will be described with reference to FIGS. 2 to 7. As shown in FIG. 2, the acoustic model learning device 100 includes a type identification unit 110, a phoneme intermediate feature amount calculation unit 120, a phoneme probability distribution calculation unit 130, a parameter optimization unit 140, and a recording unit 790. The recording unit 790 is a component that appropriately records information necessary for the processing of the acoustic model learning device 100.

学習用音声データは、J種類の音声に分類されるものとする。また、各分類には番号（以下、種類番号という）が1〜Jまで振られており、種類番号j(1≦j≦J)を用いて各種類を識別することにする。種類番号jの音声のことを第j種の音声という。例えば、雑音の有無と性別を基準に分類する場合、雑音がない男性の音声を種類番号1の音声、雑音がない女性の音声を種類番号2の音声、雑音を含む男性の音声を種類番号3の音声、雑音を含む女性の音声を種類番号4の音声とし、４つの種類に分類することができる。 The voice data for learning shall be classified into J types of voice. Further, numbers (hereinafter referred to as type numbers) are assigned to the respective categories from 1 to J, and each type is identified using the type number j (1≦j≦J). The voice of type number j is called the j-th voice. For example, when classifying on the basis of presence or absence of noise and gender, a male voice without noise is type 1 voice, a female voice without noise is type 2 voice, and a male voice containing noise is type 3 The voices of female and voices including noise can be classified into four types with the type 4 voice.

また、音素の数をMとし、各音素には番号（以下、音素番号という）が1〜Mまで振られており、音素番号m(1≦m≦M)を用いて各音素を識別することにする。音素番号mの音素のことを第m音素という。 In addition, the number of phonemes is M, each phoneme is numbered from 1 to M (hereinafter referred to as phoneme number), and each phoneme is identified using the phoneme number m (1 ≤ m ≤ M). To The phoneme with phoneme number m is called the mth phoneme.

学習開始前に、学習用音声データから訓練データを用意しておくのは、音響モデル学習装置７００と同じである。 It is the same as the acoustic model learning device 700 that training data is prepared from the learning voice data before the learning is started.

音響モデル学習装置１００は、訓練データである音声特徴量と正解音素番号の組から、音響モデルを学習する。正解音素番号とは、音声特徴量に対応する音素（正解音素）を識別するための音素番号のことである。 The acoustic model learning device 100 learns an acoustic model from a set of a voice feature amount and a correct phoneme number, which is training data. The correct phoneme number is a phoneme number for identifying a phoneme (correct phoneme) corresponding to a voice feature amount.

種類識別部１１０、音素中間特徴量計算部１２０、音素確率分布計算部１３０は、ニューラルネットワークによる計算を実行する構成を含む。したがって、音響モデル学習装置１００は、学習開始までに、記録部７９０に記録した音響モデルパラメータの初期値を種類識別部１１０、音素中間特徴量計算部１２０、音素確率分布計算部１３０に設定する。また、音響モデル学習装置１００は、学習中、パラメータ最適化部１４０が音響モデルパラメータを最適化計算する都度、計算した音響モデルパラメータを種類識別部１１０、音素中間特徴量計算部１２０、音素確率分布計算部１３０に設定する。 The type identification unit 110, the phoneme intermediate feature amount calculation unit 120, and the phoneme probability distribution calculation unit 130 include a configuration for executing calculation by a neural network. Therefore, the acoustic model learning device 100 sets the initial values of the acoustic model parameters recorded in the recording unit 790 in the type identification unit 110, the phoneme intermediate feature amount calculation unit 120, and the phoneme probability distribution calculation unit 130 before the learning starts. In the acoustic model learning apparatus 100, each time the parameter optimizing unit 140 optimizes the acoustic model parameters during learning, the acoustic model parameters are calculated by the type identifying unit 110, the phoneme intermediate feature amount calculating unit 120, and the phoneme probability distribution. The calculation unit 130 is set.

図３に従い音響モデル学習装置１００の動作について説明する。種類識別部１１０は、入力された音声特徴量から、当該音声特徴量を抽出した音声データの種類に対応する種類番号j’（ただし、j’は1≦j’≦Jを満たす整数）を決定する（Ｓ１１０）。以下、図４〜図５を参照して種類識別部１１０について説明する。 The operation of the acoustic model learning device 100 will be described with reference to FIG. The type identifying unit 110 determines a type number j′ (where j′ is an integer satisfying 1≦j′≦J) corresponding to the type of audio data from which the audio feature amount is extracted, from the input audio feature amount. Yes (S110). The type identifying unit 110 will be described below with reference to FIGS. 4 to 5.

図４に示すように種類識別部１１０は、種類中間特徴量計算部１１１、種類確率分布計算部１１２、種類番号決定部１１３を含む。種類特徴量計算部１１１は、DNNの入力層から最終隠れ層までのニューラルネットワークに対応する構成部である。種類確率分布計算部１１２は、DNNの出力層のニューラルネットワークに対応する構成部である。種類確率分布計算部１１２の出力層に含まれるユニットの数は、種類の数Jに等しい。第jユニット（1≦j≦J）は、音声特徴量を抽出した音声の種類番号がjである（第j種である）確率を出力するユニットになる。 As shown in FIG. 4, the type identification unit 110 includes a type intermediate feature amount calculation unit 111, a type probability distribution calculation unit 112, and a type number determination unit 113. The type feature amount calculation unit 111 is a configuration unit corresponding to the neural network from the input layer of the DNN to the final hidden layer. The type probability distribution calculator 112 is a component corresponding to the neural network in the output layer of the DNN. The number of units included in the output layer of the type probability distribution calculation unit 112 is equal to the number J of types. The j-th unit (1.ltoreq.j.ltoreq.J) is a unit that outputs the probability that the type number of the speech from which the speech feature amount is extracted is j (is the j-th type).

なお、種類中間特徴量計算部１１１と種類確率分布計算部１１２をDNN以外のニューラルネットワークを用いて構成してもよい。ただし、種類確率分布計算部１１２の出力はJ次元ベクトルとなるように構成する。 The type intermediate feature amount calculation unit 111 and the type probability distribution calculation unit 112 may be configured by using a neural network other than DNN. However, the output of the type probability distribution calculation unit 112 is configured to be a J-dimensional vector.

図５に従い種類識別部１１０の動作について説明する。種類中間特徴量計算部１１１は、入力された音声特徴量から、種類識別用の中間特徴量である種類特徴量を計算する（Ｓ１１１）。種類中間特徴量は、入力された音声特徴量を抽出した音声が対応する種類が第j種（1≦j≦J）である確率q_jの分布である種類確率分布q=(q₁,…,q_J)を計算するために用いる特徴量である。また、上述の通り、種類中間特徴量計算部１１１がDNNの入力層から最終隠れ層までのニューラルネットワークに対応する構成部である場合、種類中間特徴量は学習中のDNNの最終隠れ層の出力特徴量となる。 The operation of the type identifying unit 110 will be described with reference to FIG. The type intermediate feature amount calculation unit 111 calculates a type feature amount, which is an intermediate feature amount for type identification, from the input voice feature amount (S111). The type intermediate feature amount is a type probability distribution q=(q ₁ ,..., Which is a distribution of the probability q _j that the type corresponding to the speech from which the input speech feature amount is extracted is the j-th type (1≦j≦J) , q _J ) is a feature quantity used to calculate. Further, as described above, when the type intermediate feature amount calculation unit 111 is a configuration unit corresponding to the neural network from the input layer of the DNN to the final hidden layer, the type intermediate feature amount is the output of the final hidden layer of the DNN being learned. It becomes a feature quantity.

種類確率分布計算部１１２は、Ｓ１１１で計算した種類中間特徴量から、種類確率分布を計算する（Ｓ１１２）。上述の通り、種類確率分布計算部１１２がDNNの出力層のニューラルネットワークに対応する構成部である場合、種類確率分布は学習中のDNNの出力層の出力特徴量となる。ここで、種類番号jの種類（第j種）は出力層を構成する第jユニットに対応するので、種類確率分布は出力層の第jユニットからの出力値である確率q_jを並べた分布q=(q₁,…,q_J)になる。 The type probability distribution calculation unit 112 calculates a type probability distribution from the type intermediate feature amount calculated in S111 (S112). As described above, when the type probability distribution calculation unit 112 is a configuration unit corresponding to the neural network of the output layer of the DNN, the type probability distribution becomes the output feature amount of the output layer of the DNN under learning. Here, the type of the type number j (a j species) because corresponds to the j units constituting the output layer, the type probability distribution arranged the probability q _j is the output value from the j-th unit of the output layer distribution q=(q ₁ ,...,q _J ).

種類決定部１１３は、Ｓ１１２で計算した種類確率分布から、確率が最大となる種類番号j’を決定する（Ｓ１１３）。 The type determining unit 113 determines the type number j′ having the maximum probability from the type probability distribution calculated in S112 (S113).

音素中間特徴量計算部１２０は、入力された音声特徴量とＳ１１０で決定した種類番号j’から、音素識別用の中間特徴量である音素中間特徴量を計算する（Ｓ１２０）。音素中間特徴量は、入力された音声特徴量が対応する音素が音素番号m（1≦m≦M）の音素である確率p_mの分布である音素確率分布p=(p₁,…,p_M)を計算するために用いる特徴量である。以下、図６〜図７を参照して音素中間特徴量計算部１２０について説明する。 The phoneme intermediate feature amount calculation unit 120 calculates a phoneme intermediate feature amount, which is an intermediate feature amount for phoneme identification, from the input voice feature amount and the type number j′ determined in S110 (S120). The phoneme intermediate feature amount is a phoneme probability distribution p=(p ₁ ,...,p) that is a distribution of the probability p _m that the phoneme corresponding to the input voice feature amount is the phoneme of the phoneme number m (1≦m≦M). It is the feature quantity used to calculate _M ). The phoneme intermediate feature amount calculation unit 120 will be described below with reference to FIGS. 6 to 7.

図６に示すように音素中間特徴量計算部１２０は、音声特徴量入力部１２１、第1種中間特徴量計算部１２２_１、…、第J種中間特徴量計算部１２２_Ｊ、音素中間特徴量出力部１２３を含む。第1種中間特徴量計算部１２２_１、…、第J種中間特徴量計算部１２２_Ｊは、それぞれDNNの入力層から最終隠れ層までのニューラルネットワークに対応する構成部、CNNに対応する構成部のいずれかである。種類番号jの音声（第j種の音声）が雑音を含む音声である場合、第j種中間特徴量計算部１２２_jは、CNNに対応する構成部とする方が好ましい。第j種中間特徴量は、入力された音声特徴量が種類番号j（1≦j≦J）の音声のものであるとして、当該音声特徴量が対応する音素が音素番号m（1≦m≦M）の音素である確率p_mの分布である音素確率分布p=(p₁,…,p_M)を計算するための特徴量である。また、第1種中間特徴量、…、第J種中間特徴量のベクトルとしての次元は一致する。 As shown in FIG. 6, the phoneme intermediate feature amount calculation unit 120 includes a voice feature amount input unit 121, a first type intermediate feature amount calculation unit 122 ₁ ,..., A J type intermediate feature amount calculation unit 122 _J , and a phoneme intermediate feature amount. The output unit 123 is included. The first-type intermediate feature amount calculation unit 122 ₁ ,..., And the J-type intermediate feature amount calculation unit 122 _J respectively correspond to the neural network from the input layer to the final hidden layer of the DNN, and the configuration unit corresponding to the CNN. Is either. When the voice of the type number j (the voice of the j-th type) is a voice including noise, it is preferable that the j-th type intermediate feature amount calculation unit 122 _j is a component corresponding to the CNN. In the j-th type intermediate feature amount, assuming that the input voice feature amount is that of the type number j (1≦j≦J), the phoneme corresponding to the voice feature amount is the phoneme number m (1≦m≦ It is a feature quantity for calculating the phoneme probability distribution p=(p ₁ ,..., P _M ) which is the distribution of the probability p _m that is the phoneme of _M ). Further, the dimensions of the first-type intermediate feature amount,..., And the J-type intermediate feature amount as vectors are the same.

なお、第1種中間特徴量計算部１２２_１、…、第J種中間特徴量計算部１２２_ＪをDNNやCNN以外のニューラルネットワークを用いて構成してもよい。ただし、この場合も雑音に強い音響モデルを生成するニューラルネットワークとそうでないニューラルネットワークなど音声の種類に応じたニューラルネットワークを準備するのが好ましい。 The first-type intermediate feature amount calculation unit 122 ₁ ,..., And the J-type intermediate feature amount calculation unit 122 _J may be configured by using a neural network other than DNN or CNN. However, in this case as well, it is preferable to prepare a neural network that generates a noise-resistant acoustic model and a neural network that does not, such as a neural network depending on the type of voice.

例えば、雑音の有無と性別を基準に分類する場合、J=4であり、第1種中間特徴量計算部１２２_１は雑音がない男性の音声（種類番号1の音声）の特徴量を第1種中間特徴量、第2種中間特徴量計算部１２２₂は雑音がない女性の音声（種類番号2の音声）の特徴量を第2種中間特徴量、第3種中間特徴量計算部１２２₃は雑音を含む男性の音声（種類番号3の音声）の特徴量を第3種中間特徴量、第4種中間特徴量計算部１２２₄は雑音を含む女性の音声（種類番号4の音声）の特徴量を第4種中間特徴量としてそれぞれ計算する。この場合、第1種中間特徴量計算部１２２₁と第2種中間特徴量計算部１２２₂はDNNの入力層から最終隠れ層までのニューラルネットワークに対応する構成部、第3種中間特徴量計算部１２２₃と第4種中間特徴量計算部１２２₄はCNNに対応する構成部として構成する。 For example, when classifying on the basis of the presence or absence of noise and gender, J=4, and the first type intermediate feature amount calculation unit 1221 sets the _first feature amount of the noise-free male voice (voice of type number 1) as the first feature amount. The type 2 intermediate feature amount/type 2 intermediate feature amount calculation unit 122 ₂ calculates the feature amount of a female voice (voice of type number 2) without noise as the type 2 intermediate feature amount and type 3 intermediate feature amount calculation unit 122 ₃ type 3 intermediate feature quantity the feature quantity of the speech (voice type number 3) of the men including noise, the four intermediate feature quantity calculator 122 ₄ female noisy speech (voice type No. 4) The feature amount is calculated as the fourth type intermediate feature amount. In this case, the first type intermediate feature amount calculation unit 122 ₁ and the second type intermediate feature amount calculation unit 122 ₂ are components corresponding to the neural network from the input layer of the DNN to the final hidden layer, and the third type intermediate feature amount calculation unit. The unit 122 ₃ and the fourth type intermediate feature amount calculation unit 122 ₄ are configured as a configuration unit corresponding to CNN.

図７に従い音素中間特徴量計算部１２０の動作について説明する。音声特徴量入力部１２１は、Ｓ１１０で決定した種類番号j’を用いて、入力された音声特徴量を第j’種中間特徴量計算部１２２_j’に出力する（Ｓ１２１）。 The operation of the phoneme intermediate feature amount calculation unit 120 will be described with reference to FIG. The voice feature amount input unit 121 outputs the input voice feature amount to the j'th type intermediate feature amount calculation unit 122 _j'using the type number j'determined in S110 (S121).

第j’種中間特徴量計算部１２２_j’は、音声特徴量入力部１２１から入力された音声特徴量から、第j’種中間特徴量を計算する（Ｓ１２２）。 The j'th type intermediate feature amount calculation unit 122 _j'calculates the j'th type intermediate feature amount from the voice feature amount input from the voice feature amount input unit 121 (S122).

音素中間特徴量出力部１２３は、Ｓ１２２で計算した第j’種中間特徴量を音素中間特徴量として出力する（Ｓ１２３）。 The phoneme intermediate feature amount output unit 123 outputs the j'th type intermediate feature amount calculated in S122 as a phoneme intermediate feature amount (S123).

音素確率分布計算部１３０は、Ｓ１２０で計算した音素中間特徴量から、音素確率分布を計算する（Ｓ１３０）。音素確率分布計算部１３０は、DNNの出力層のニューラルネットワークに対応する構成部である。音素確率分布計算部１３０の出力層に含まれるユニットの数は、音素の数に等しい。また、第mユニットは、音声特徴量に対応する音素の音素番号がmである（音声特徴量に対応する音素が第m音素である）確率を出力するユニットになる。 The phoneme probability distribution calculation unit 130 calculates a phoneme probability distribution from the phoneme intermediate feature amount calculated in S120 (S130). The phoneme probability distribution calculator 130 is a component corresponding to the neural network in the output layer of the DNN. The number of units included in the output layer of the phoneme probability distribution calculation unit 130 is equal to the number of phonemes. The m-th unit is a unit that outputs the probability that the phoneme number of the phoneme corresponding to the voice feature amount is m (the phoneme corresponding to the voice feature amount is the m-th phoneme).

なお、音素確率分布計算部１３０をDNN以外のニューラルネットワークを用いて構成してもよい。ただし、音素確率分布計算部１３０の出力はM次元ベクトルとなるように構成する。 The phoneme probability distribution calculation unit 130 may be configured using a neural network other than DNN. However, the output of the phoneme probability distribution calculation unit 130 is configured to be an M-dimensional vector.

パラメータ最適化部１４０は、Ｓ１３０で計算した音素確率分布と入力された正解音素番号を用いて、音響モデルパラメータを最適化する（Ｓ１４０）。具体的な最適化計算方法は、音響モデル学習装置７００のパラメータ最適化部７３０と同様でよい。 The parameter optimizing unit 140 optimizes the acoustic model parameters using the phoneme probability distribution calculated in S130 and the input correct phoneme number (S140). A specific optimization calculation method may be the same as that of the parameter optimization unit 730 of the acoustic model learning device 700.

計算した音響モデルパラメータは、種類中間特徴量計算部１１１、種類確率分布計算部１１２、第j’種中間特徴量計算部１２２_j’（ただし、j’はＳ１１０で決定した種類番号）、音素確率分布計算部１３０にフィードバックされ、次の訓練データを用いた学習に利用される。 The calculated acoustic model parameters are the type intermediate feature amount calculation unit 111, the type probability distribution calculation unit 112, the j'th type intermediate feature amount calculation unit 122 _j' (where j'is the type number determined in S110), the phoneme probability. It is fed back to the distribution calculation unit 130 and used for learning using the next training data.

音響モデル学習装置１００は、Ｓ１１０〜Ｓ１４０の処理を訓練データの数だけ繰り返し、最終的に計算された音響モデルパラメータを学習結果として出力する。 The acoustic model learning device 100 repeats the processing of S110 to S140 for the number of training data, and outputs the finally calculated acoustic model parameter as a learning result.

なお、種類識別部１１０と音素中間特徴量計算部１２０をまとめて中間特徴量計算部１０５という（図２参照）。また、中間特徴量計算部を音響モデル学習装置１００の一部としてではなく、独立した装置として扱う場合、中間特徴量計算装置という。中間特徴量計算装置は、音声特徴量を入力として、当該音声特徴量を抽出した音声データの種類を識別したうえで、中間特徴量を計算、出力するものとなる。 The type identification unit 110 and the phoneme intermediate feature amount calculation unit 120 are collectively referred to as an intermediate feature amount calculation unit 105 (see FIG. 2). In addition, when the intermediate feature amount calculation unit is treated not as a part of the acoustic model learning device 100 but as an independent device, it is referred to as an intermediate feature amount calculation device. The intermediate feature amount calculation device receives the voice feature amount as an input, identifies the type of voice data from which the voice feature amount is extracted, and then calculates and outputs the intermediate feature amount.

［音声認識装置２００］
以下、図８〜図１１を参照して音声認識装置２００について説明する。図８に示すように音声認識装置２００は、音声特徴量抽出部２１０、音声認識部２２０を含む。 [Voice recognition device 200]
Hereinafter, the voice recognition device 200 will be described with reference to FIGS. As shown in FIG. 8, the voice recognition device 200 includes a voice feature amount extraction unit 210 and a voice recognition unit 220.

また、音声認識装置２００は、学習結果記録部２９０と接続している。学習結果記録部２９０は、音響モデル学習装置１００が学習した音響モデルを記録している。なお、学習結果記録部２９０は、音声認識装置２００に含まれる構成部としてもよい。 The voice recognition device 200 is also connected to the learning result recording unit 290. The learning result recording unit 290 records the acoustic model learned by the acoustic model learning device 100. The learning result recording unit 290 may be a component included in the voice recognition device 200.

図１０に示すように音声認識部２２０は、種類識別部１１０、音素中間特徴量計算部１２０、音素確率分布計算部１３０、音声認識結果生成部２２１を含む。種類識別部１１０、音素中間特徴量計算部１２０、音素確率分布計算部１３０は、音響モデル学習装置１００のそれと同様の構成部である。 As shown in FIG. 10, the voice recognition unit 220 includes a type identification unit 110, a phoneme intermediate feature amount calculation unit 120, a phoneme probability distribution calculation unit 130, and a voice recognition result generation unit 221. The type identification unit 110, the phoneme intermediate feature amount calculation unit 120, and the phoneme probability distribution calculation unit 130 are the same configuration units as those of the acoustic model learning device 100.

音声認識装置２００は、認識用音声データから、認識用音声データの認識結果である音声認識結果を生成する。 The voice recognition device 200 generates a voice recognition result, which is a recognition result of the recognition voice data, from the recognition voice data.

音声認識装置２００は、音声認識開始までに、学習結果記録部２９０に記録した音響モデルパラメータを音声認識部２２０（具体的には、種類識別部１１０、音素中間特徴量計算部１２０、音素確率分布計算部１３０）に設定する。 The speech recognition apparatus 200 recognizes the acoustic model parameters recorded in the learning result recording unit 290 by the speech recognition unit 220 (specifically, the type identification unit 110, the phoneme intermediate feature amount calculation unit 120, and the phoneme probability distribution) before starting the speech recognition. The calculation unit 130) is set.

図９に従い音声認識装置２００の動作について説明する。音声特徴量抽出部２１０は、認識用音声データから、認識用音声データの音声特徴量を抽出する（Ｓ２１０）。音声特徴量抽出部２１０は、音響モデル学習装置１００の入力である音声特徴量の生成と同一条件にて音声特徴量を抽出する。 The operation of the voice recognition device 200 will be described with reference to FIG. The voice feature amount extraction unit 210 extracts the voice feature amount of the recognition voice data from the recognition voice data (S210). The voice feature amount extraction unit 210 extracts the voice feature amount under the same condition as the generation of the voice feature amount which is the input of the acoustic model learning device 100.

音声認識部２２０は、Ｓ２１０で抽出した音声特徴量から、認識用音声データを認識した結果である音声認識結果を生成する（Ｓ２２０）。図１１に従い、具体的処理について説明する。種類識別部１１０は、Ｓ２１０で抽出した音声特徴量から、当該音声特徴量を抽出した音声データの種類に対応する種類番号j’を決定する（Ｓ１１０）。音素中間特徴量計算部１２０は、Ｓ２１０で抽出した音声特徴量とＳ１１０で決定した種類番号j’から、音素識別用の中間特徴量である音素中間特徴量を計算する（Ｓ１２０）。音素確率分布計算部１３０は、Ｓ１２０で計算した音素中間特徴量から、音素確率分布を計算する（Ｓ１３０）。音声認識結果生成部２２１は、Ｓ１３０で計算した音素確率分布から確率が最大となる音素を決定し、決定した音素の系列から音声認識結果を生成する（Ｓ２２１）。なお、音素の系列の長さは、認識用音声データから抽出された音声特徴量の数と等しくなる。 The voice recognition unit 220 generates a voice recognition result which is a result of recognizing the recognition voice data from the voice feature amount extracted in S210 (S220). Specific processing will be described with reference to FIG. The type identifying unit 110 determines a type number j′ corresponding to the type of voice data from which the voice feature amount is extracted, from the voice feature amount extracted in S210 (S110). The phoneme intermediate feature amount calculation unit 120 calculates a phoneme intermediate feature amount, which is an intermediate feature amount for phoneme identification, from the voice feature amount extracted in S210 and the type number j′ determined in S110 (S120). The phoneme probability distribution calculation unit 130 calculates a phoneme probability distribution from the phoneme intermediate feature amount calculated in S120 (S130). The speech recognition result generation unit 221 determines a phoneme having the maximum probability from the phoneme probability distribution calculated in S130, and generates a speech recognition result from the determined phoneme sequence (S221). The length of the phoneme sequence is equal to the number of voice feature amounts extracted from the recognition voice data.

本実施形態の発明によれば、音声の種類を識別したうえで音素中間特徴量を計算する。また、音声の種類を反映して計算した音素中間特徴量を用いて、各種類の音声を認識するための音響モデルを結合したものに相当する１つの音響モデルを学習する。これにより、認識処理に必要な計算時間を抑制しつつ学習用音声データの種類にかかわらず高精度な音声認識を実現する音響モデルを学習することができる。 According to the invention of this embodiment, the phoneme intermediate feature amount is calculated after identifying the type of voice. Further, one acoustic model corresponding to a combination of acoustic models for recognizing each type of speech is learned using the phoneme intermediate feature amount calculated by reflecting the type of speech. As a result, it is possible to learn an acoustic model that realizes highly accurate voice recognition regardless of the type of learning voice data while suppressing the calculation time required for recognition processing.

本実施形態の発明による音響モデルを用いて音声認識をすることにより、音響モデル学習装置７００や音響モデル学習装置８００による音響モデルを用いた音声認識と比較して、雑音の有無に影響を受けない高精度な音声認識が可能となる。また、音響モデル学習装置９００による音響モデルを用いた音声認識と比較して、認識処理に必要な計算時間を抑制することも可能となる。 By performing voice recognition using the acoustic model according to the present invention, the presence or absence of noise is not affected as compared with the voice recognition using the acoustic model by the acoustic model learning device 700 or the acoustic model learning device 800. Highly accurate voice recognition is possible. Further, it is possible to suppress the calculation time required for the recognition processing, as compared with the voice recognition using the acoustic model by the acoustic model learning device 900.

＜変形例＞
この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 <Modification>
It is needless to say that the present invention is not limited to the above-described embodiments and can be appropriately modified without departing from the spirit of the present invention. The various kinds of processing described in the above embodiments may be executed not only in time series according to the order described, but also in parallel or individually according to the processing capability of the device that executes the processing or the need.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Additional notes>
The device of the present invention is, for example, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating with the outside of the hardware entity. Connectable communication unit, CPU (Central Processing Unit, cache memory and registers may be provided), RAM or ROM that is memory, external storage device that is a hard disk, and their input unit, output unit, and communication unit , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged among external storage devices. If necessary, the hardware entity may be provided with a device (drive) capable of reading and writing a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary to realize the above-described functions and data necessary for the processing of this program (not limited to the external storage device, for example, the program is read). It may be stored in a ROM that is a dedicated storage device). In addition, data and the like obtained by the processing of these programs are appropriately stored in the RAM, the external storage device, or the like.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM, etc.) and data necessary for the processing of each program are read into the memory as necessary, and interpreted and executed/processed by the CPU as appropriate. .. As a result, the CPU realizes a predetermined function (each constituent element represented by the above,... Unit,... Means, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiments, and can be modified as appropriate without departing from the spirit of the present invention. Further, the processes described in the above-described embodiments are not only executed in time series in the order described, but may be executed in parallel or individually according to the processing capability of the device that executes the processes or as necessary. ..

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions of the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on the computer, the processing functions of the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded in a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, or the like. Specifically, for example, a hard disk device, a flexible disk, a magnetic tape or the like is used as a magnetic recording device, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), or a CD-ROM (Compact Disc Read Only) is used as an optical disc. Memory), CD-R (Recordable)/RW (ReWritable), etc. as a magneto-optical recording medium, MO (Magneto-Optical disc) etc., and semiconductor memory EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The distribution of this program is performed by selling, transferring, or lending a portable recording medium such as a DVD or a CD-ROM in which the program is recorded. Further, the program may be stored in a storage device of a server computer and transferred from the server computer to another computer via a network to distribute the program.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program temporarily stores, for example, the program recorded on a portable recording medium or the program transferred from the server computer in its own storage device. Then, when executing the processing, this computer reads the program stored in its own recording medium and executes the processing according to the read program. As another execution form of this program, a computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to this computer. Each time, the processing according to the received program may be sequentially executed. In addition, a configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer May be It should be noted that the program in this embodiment includes information that is used for processing by an electronic computer and that is equivalent to the program (data that is not a direct command to a computer, but has the property of defining computer processing).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, the hardware entity is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be implemented by hardware.

Claims

The number of types for classifying the voice data is J, the number for identifying the type is a type number, the number of phonemes is M, and the number for identifying the phoneme is a phoneme number,
A type identification unit that determines a type number j′ (where j′ is an integer satisfying 1≦j′≦J) corresponding to the type of audio data from which the audio feature amount is extracted, from the audio feature amount;
A phoneme probability distribution p=(p ₁ that is a distribution of the probability p _m that the phoneme corresponding to the phonetic feature amount is a phoneme with a phoneme number m (1≦m≦M) from the phonetic feature amount and the type number j′. ,...,P _M ), which is a feature amount used for calculating the j′-type intermediate feature amount as a phoneme intermediate feature amount.
The phoneme intermediate feature amount calculation unit, for each integer j that satisfies 1≦j≦J,
A j-type intermediate feature amount calculation unit that calculates a j-type intermediate feature amount from the voice feature amount extracted from the voice data of type number j using a neural network;
Including
The type identification unit,
A type intermediate feature amount calculation unit that calculates a type intermediate feature amount, which is an intermediate feature amount for type identification, from the voice feature amount;
From the type intermediate feature amount, a type probability distribution calculation unit that calculates a type probability distribution,
An intermediate feature quantity calculation device, comprising : a type determination unit that determines the type number having the maximum probability as the type number j′ from the type probability distribution .

An acoustic model learning device for learning an acoustic model used for speech recognition from a correct phoneme number that is a phoneme number for identifying a phoneme corresponding to a voice feature and a phoneme corresponding to the voice feature,
Using the intermediate feature amount calculation device according to claim 1, a distribution of probability p _m from the voice feature amount that a phoneme to which the voice feature amount corresponds is a phoneme with a phoneme number m (1≦m≦M). A j'th type intermediate feature amount (where j'is the type of voice data from which the voice feature amount is extracted) that is a feature amount used to calculate a certain phoneme probability distribution p=(p ₁ , ..., p _M ). An intermediate feature amount calculation unit that calculates a phoneme intermediate feature amount that is a corresponding type number and that satisfies 1≦j′≦J),
From the phoneme intermediate feature amount, a phoneme probability distribution calculation unit for calculating the phoneme probability distribution,
A parameter optimization unit that optimizes an acoustic model parameter that is a parameter of the acoustic model by using the phoneme probability distribution and the correct phoneme number;
Among the acoustic model parameters, the parameter optimization unit includes parameters that are included in the intermediate feature amount calculation unit and characterize the neural network used in the j'th type intermediate feature amount calculation unit corresponding to the type number j'. An acoustic model learning device characterized by optimizing.

A voice feature amount extraction unit that extracts a voice feature amount of the recognition voice data from the recognition voice data;
A voice recognition unit that generates a voice recognition result, which is a recognition result of the recognition voice data, from the voice feature amount using the acoustic model learned by the acoustic model learning device according to claim 2.
Speech recognition device including.

The number of types for classifying the voice data is J, the number for identifying the type is a type number, the number of phonemes is M, and the number for identifying the phoneme is a phoneme number,
A type in which the intermediate feature amount calculation device determines a type number j′ (where j′ is an integer that satisfies 1≦j′≦J) corresponding to the type of voice data from which the voice feature amount is extracted, from the voice feature amount An identification step,
The intermediate feature amount calculation device is a distribution of the probability p _m from the voice feature amount and the type number j′ that the phoneme to which the voice feature amount corresponds is the phoneme with the phoneme number m (1≦m≦M). Intermediate feature including a phoneme intermediate feature calculation step of calculating a j'th type intermediate feature that is a feature used to calculate a phoneme probability distribution p=(p ₁ ,..., P _M ) as a phoneme intermediate feature How to calculate quantity,
The phoneme intermediate feature amount calculation step, for each integer j that satisfies 1≦j≦J,
A j-type intermediate feature amount calculation step of calculating a j-type intermediate feature amount from the voice feature amount extracted from the voice data of the type number j using a neural network;
Including
The type identification step,
A type intermediate feature amount calculating step of calculating a type intermediate feature amount, which is an intermediate feature amount for type identification, from the voice feature amount;
From the type intermediate feature amount, a type probability distribution calculating step of calculating a type probability distribution,
And a type determining step of determining the type number having the maximum probability as the type number j'from the type probability distribution .

An acoustic model learning device is an acoustic model learning method for learning an acoustic model used for speech recognition, from a correct phoneme number that is a phoneme number for identifying a phoneme corresponding to a voice feature and a phoneme corresponding to the voice feature.
The acoustic model learning apparatus uses the intermediate feature amount calculation method according to claim 4, wherein a phoneme corresponding to the voice feature amount is a phoneme with a phoneme number m (1≦m≦M) from the voice feature amount. A j'th type intermediate feature amount (where j'is the above-mentioned speech feature amount) which is a feature amount used for calculating a phoneme probability distribution p=(p ₁ ,..., P _M ) which is a distribution of a certain probability p _m An intermediate feature amount calculation step of calculating a phoneme intermediate feature amount, which is a type number corresponding to the type of the extracted speech data and satisfies 1≦j′≦J),
The acoustic model learning device, from the phoneme intermediate feature amount, a phoneme probability distribution calculation step of calculating the phoneme probability distribution,
The acoustic model learning device, using the phoneme probability distribution and the correct answer phoneme number, a parameter optimization step for optimizing an acoustic model parameter that is a parameter of the acoustic model,
The parameter optimizing step includes, among the acoustic model parameters, parameters that characterize the neural network used in the j'th type intermediate feature amount calculating step corresponding to the type number j'included in the intermediate feature amount calculating step. An acoustic model learning method characterized by optimizing.

A voice recognition device, a voice feature amount extraction step of extracting a voice feature amount of the recognition voice data from the recognition voice data,
A voice recognition step in which the voice recognition device generates a voice recognition result, which is a recognition result of the recognition voice data, from the voice feature amount, using the acoustic model learned by the acoustic model learning method according to claim 5. ,
Speech recognition method including.

Program for causing a computer to function as an intermediate feature quantity calculation equipment according to claim 1.