JP7754428B2

JP7754428B2 - Speech recognition device, speech recognition method, and program

Info

Publication number: JP7754428B2
Application number: JP2022134800A
Authority: JP
Inventors: 唯周藤; 一博中臺; 龍武田
Original assignee: Honda Motor Co Ltd; Osaka University NUC
Current assignee: Honda Motor Co Ltd; University of Osaka NUC
Priority date: 2022-08-26
Filing date: 2022-08-26
Publication date: 2025-10-15
Anticipated expiration: 2042-08-26
Also published as: JP2024031314A

Description

本発明は、音声認識装置、音声認識方法、および、プログラムに関する。 The present invention relates to a voice recognition device, a voice recognition method, and a program.

音声認識は、多様な用途を有し、さまざまな環境で用いられる。雑音が混入した音声を音声認識に用いると、雑音が混入されていないクリーン音声よりも認識率が低下することが知られている。雑音下で認識率を向上させるため、音声認識システムに対して音声強調が適用されることがある。音声強調によれば、収録された入力音声成分の音声成分が強調され、相対的に雑音成分が低減する。雑音抑圧は、音声強調の一形態として捉えることができる。 Speech recognition has a wide range of applications and is used in a variety of environments. It is known that when speech containing noise is used for speech recognition, the recognition rate is lower than when clean speech without noise is used. To improve the recognition rate in noisy environments, speech enhancement is sometimes applied to speech recognition systems. With speech enhancement, the speech components of the recorded input speech components are emphasized, thereby relatively reducing the noise components. Noise suppression can be considered a form of speech enhancement.

音声強調を音声認識に適用した手法として、ミッシングデータ音声認識処理が提案されていた。例えば、非特許文献１、２に記載の手法では、エビデンスモデル（evidence model）が適用される。エビデンスモデルは、音声強調から音声認識に統計的情報を与えるデコード処理のモデルである。エビデンスモデルは、認識結果を与える分類スコアの期待値を評価するための数理モデルとみなすことができ、誤分類を低減させるように学習された確率密度関数を用いて表わされる。 Missing data speech recognition processing has been proposed as a method of applying speech enhancement to speech recognition. For example, the methods described in Non-Patent Documents 1 and 2 apply an evidence model. The evidence model is a decoding process model that provides statistical information from speech enhancement to speech recognition. The evidence model can be considered a mathematical model for evaluating the expected value of the classification score that gives the recognition result, and is expressed using a probability density function trained to reduce misclassification.

A. C. Morris, J. Baker, and H. Bourlard, “FROM MISSING DATA TO MAYBE USEFUL DATA: SOFT DATA MODELLING FOR NOISE ROBUST ASR”, Proceedings of Workshop Innovation Speech Process, 2001A. C. Morris, J. Baker, and H. Bourlard, “FROM MISSING DATA TO MAYBE USEFUL DATA: SOFT DATA MODELLING FOR NOISE ROBUST ASR”, Proceedings of Workshop Innovation Speech Process, 2001 M. Kuhne, R. Togneri, and S. Nordholm, “Recognition with Applications in Reverberant Multi-Source Environments”, IEEE Transactions on Audio, Speech and Language Processing, vol. 19, No. 2, pp. 372-384, 2011M. Kuhne, R. Togneri, and S. Nordholm, “Recognition with Applications in Reverberant Multi-Source Environments”, IEEE Transactions on Audio, Speech and Language Processing, vol. 19, No. 2, pp. 372-384, 2011

提案されたエビデンスモデルは、フレームごとに適用され、フレーム間で独立な処理をもたらす。そのため、処理後の音響特性を表す音響特徴量がフレーム間で不連続となり非線形歪を生じることがある。予期しない非線形歪は、認識率を低下させる原因となりうる。他方、音声認識では、一回の発話から複数の文字列からなる認識結果をもたらすエンド・ツー・エンド（Ｅ２Ｅ：End-to-End）モデルを用いることが提案されている。Ｅ２Ｅモデルに対応できるように、エビデンスモデルを複数フレームに拡張することも考えられる。エビデンスモデルは、高次元の統計モデルであり、単純にモデルを拡張するだけでは演算量が非常に多くなる。エビデンスモデルに対して音響特徴量のサンプリングを行って演算量を低減することも考えられる。認識率の低下を抑えるためには、モデルの規模に相応した十分なサンプリング数を要することが想定される。 The proposed evidence model is applied to each frame, resulting in independent processing between frames. As a result, acoustic features representing the acoustic characteristics after processing may become discontinuous between frames, resulting in nonlinear distortion. Unexpected nonlinear distortion can cause a decrease in recognition rate. On the other hand, for speech recognition, the use of end-to-end (E2E) models has been proposed, which produce recognition results consisting of multiple character strings from a single utterance. It is possible to extend the evidence model to multiple frames to accommodate E2E models. Evidence models are high-dimensional statistical models, and simply extending the model would result in an extremely large amount of computation. It is also possible to reduce the amount of computation by sampling acoustic features from the evidence model. To prevent a decrease in recognition rate, it is expected that a sufficient number of samples commensurate with the size of the model will be required.

（１）本実施形態の一態様は、入力音声信号の音響特性に基づいて発話区間を定める発話区間処理部と、第１モデルを用いて前記入力音声信号の音響特徴量について音声成分が強調された強調特徴量をフレームごとに定める音声強調部と、第２モデルを用いて目標特徴量の系列である目標特徴量系列に基づいて隠れ状態特徴量の系列である隠れ状態特徴量系列を定める隠れ状態処理部と、発話区間内の前記強調特徴量の系列である強調特徴量系列と前記音響特徴量の系列である音響特徴量系列に対応する目標特徴量系列の確率分布を示す第３モデルを用いて当該目標特徴量系列のサンプル値を複数回サンプリングし、前記隠れ状態処理部に対し、前記目標特徴量系列のサンプル値に対する隠れ状態特徴量系列のサンプル値を算出させ、前記隠れ状態特徴量系列のサンプル値から前記隠れ状態特徴量系列の期待値を定めるサンプリング処理部と、第４モデルを用いて前記隠れ状態特徴量系列の期待値に基づいて前記発話区間の発話内容を定める発話処理部と、を備える音声認識装置である。 (1) One aspect of the present embodiment is a speech recognition device including: a speech section processing unit that determines a speech section based on acoustic characteristics of an input speech signal; a speech enhancement unit that determines, for each frame, emphasis features in which speech components are emphasized among acoustic features of the input speech signal using a first model; a hidden state processing unit that determines a hidden state feature series that is a series of hidden state features based on a target feature series that is a series of target features using a second model; a sampling processing unit that samples sample values of the target feature series multiple times using a third model that indicates a probability distribution of the emphasis feature series that is the series of emphasis features within the speech section and a target feature series corresponding to the acoustic feature series that is a series of the acoustic features, causes the hidden state processing unit to calculate sample values of the hidden state feature series for the sample values of the target feature series, and determines an expected value of the hidden state feature series from the sample values of the hidden state feature series; and a speech processing unit that determines a speech content of the speech section based on the expected value of the hidden state feature series using a fourth model.

（２）本実施形態の一態様は、（１）の音声認識装置であって、前記目標特徴量系列は、前記強調特徴量系列と前記音響特徴量系列との加重和であり、前記確率分布は、前記強調特徴量系列と前記音響特徴量系列との比率の確率分布であり、前記サンプリング処理部は、前記第３モデルを用いて前記比率のサンプル値をサンプリングし、前記比率のサンプル値に基づいて前記強調特徴量系列と前記音響特徴量系列を合成して前記目標特徴量のサンプル値を算出してもよい。 (2) One aspect of this embodiment is the speech recognition device of (1), wherein the target feature sequence is a weighted sum of the emphasis feature sequence and the acoustic feature sequence, the probability distribution is a probability distribution of a ratio between the emphasis feature sequence and the acoustic feature sequence, and the sampling processing unit may sample a sample value of the ratio using the third model, and synthesize the emphasis feature sequence and the acoustic feature sequence based on the sample value of the ratio to calculate a sample value of the target feature.

（３）本実施形態の一態様は、（１）の音声認識装置であって、前記確率分布は、前記目標特徴量系列が前記強調特徴量系列と等しくなる可能性を示す第１確率分布と、前記目標特徴量系列が前記強調特徴量系列から分散する確率分布である第２確率分布とを有してもよい。 (3) One aspect of this embodiment is the speech recognition device of (1), wherein the probability distribution may include a first probability distribution indicating the likelihood that the target feature sequence will be equal to the emphasis feature sequence, and a second probability distribution indicating the likelihood that the target feature sequence will diverge from the emphasis feature sequence.

（４）本実施形態の一態様は、（３）の音声識装置であって、前記サンプリング処理部は、前記発話区間におけるフレームごとの前記強調特徴量の事後確率分布に基づいて前記第１確率分布を定めてもよい。 (4) One aspect of this embodiment is the speech recognition device of (3), wherein the sampling processing unit may determine the first probability distribution based on a posterior probability distribution of the emphasis feature for each frame in the speech section.

（５）本実施形態の一態様は、（３）または（４）の音声認識装置であって、前記サンプリング処理部は、前記第１確率分布を用いて一部の前記目標特徴量系列のサンプル値を第１種目標特徴量系列のサンプル値としてサンプリングし、前記第２確率分布を用いて他の前記目標特徴量系列のサンプル値を第２種目標特徴量系列のサンプル値としてサンプリングし、前記第１種目標特徴量系列のサンプル値に対する隠れ状態特徴量系列のサンプル値と前記第２種目標特徴量系列のサンプル値に対する隠れ状態特徴量系列のサンプル値との平均値を前記隠れ状態特徴量系列の期待値として定めてもよい。 (5) One aspect of this embodiment is the speech recognition device of (3) or (4), wherein the sampling processing unit may sample some sample values of the target feature series as sample values of a first type target feature series using the first probability distribution, sample other sample values of the target feature series as sample values of a second type target feature series using the second probability distribution, and determine, as an expected value of the hidden state feature series, an average value of the sample values of the hidden state feature series for the sample values of the first type target feature series and the sample values of the hidden state feature series for the sample values of the second type target feature series.

（６）本実施形態の一態様は、（１）の音声認識装置であって、前記第４モデルは、アテンションデコーダとコネクショニスト時系列分類（ＣＴＣ）デコーダを備え、前記アテンションデコーダは、前記隠れ状態特徴量系列の期待値に対する発話内容の候補ごとに第１事後確率を算出し、前記ＣＴＣデコーダは、前記隠れ状態特徴量系列のサンプル値に対する発話内容の候補ごとに第２事後確率のサンプル値を算出し、発話内容の候補ごとに前記第２事後確率のサンプル値の期待値を前記第２事後確率として算出し、前記第１事後確率と前記第２事後確率を合成したスコアに基づいて前記発話内容を定めてもよい。 (6) One aspect of this embodiment is the speech recognition device of (1), wherein the fourth model includes an attention decoder and a connectionist time series classification (CTC) decoder, the attention decoder calculates a first posterior probability for each candidate utterance content for an expected value of the hidden state feature sequence, the CTC decoder calculates a sample value of a second posterior probability for each candidate utterance content for a sample value of the hidden state feature sequence, calculates an expected value of the sample value of the second posterior probability for each candidate utterance content as the second posterior probability, and determines the utterance content based on a score obtained by combining the first posterior probability and the second posterior probability.

（７）本実施形態の一態様は、コンピュータに入力音声信号の音響特性に基づいて発話区間を定める発話区間処理ステップと、第１モデルを用いて前記入力音声信号の音響特徴量について音声成分が強調された強調特徴量をフレームごとに定める音声強調ステップと、第２モデルを用いて目標特徴量の系列である目標特徴量系列に基づいて隠れ状態特徴量の系列である隠れ状態特徴量系列を定める隠れ状態処理ステップと、発話区間内の前記強調特徴量の系列である強調特徴量系列と前記音響特徴量の系列である音響特徴量系列に対応する目標特徴量系列の確率分布を示す第３モデルを用いて第３モデルを用いて当該目標特徴量系列のサンプル値を複数回サンプリングし、前記隠れ状態処理ステップに対し、前記目標特徴量系列のサンプル値に対する隠れ状態特徴量系列のサンプル値を算出させ、前記隠れ状態特徴量系列のサンプル値から前記隠れ状態特徴量系列の期待値を定めるサンプリング処理ステップと、第４モデルを用いて前記隠れ状態特徴量系列の期待値に基づいて前記発話区間の発話内容を定める発話処理ステップと、を実行させるためのプログラムであってもよい。 (7) One aspect of the present embodiment may be a program for causing a computer to execute: a speech section processing step of determining a speech section based on acoustic characteristics of an input speech signal; a speech enhancement step of determining, for each frame, an emphasis feature in which speech components of acoustic features of the input speech signal are emphasized using a first model; a hidden state processing step of determining, based on a target feature series that is a series of target features using a second model; a sampling processing step of sampling sample values of the target feature series multiple times using a third model that indicates a probability distribution of an emphasis feature series that is a series of the emphasis features within the speech section and a target feature series corresponding to the acoustic feature series that is a series of the acoustic features, instructing the hidden state processing step to calculate sample values of the hidden state feature series for sample values of the target feature series and determine expected values of the hidden state feature series from the sample values of the hidden state feature series; and a speech processing step of determining speech content of the speech section based on the expected values of the hidden state feature series using a fourth model .

（８）本実施形態の一態様は、音声認識装置における音声認識方法であって、前記音声認識装置が、入力音声信号の音響特性に基づいて発話区間を定める発話区間処理ステップと、第１モデルを用いて前記入力音声信号の音響特徴量について音声成分が強調された強調特徴量をフレームごとに定める音声強調ステップと、第２モデルを用いて目標特徴量の系列である目標特徴量系列に基づいて隠れ状態特徴量の系列である隠れ状態特徴量系列を定める隠れ状態処理ステップと、発話区間内の前記強調特徴量の系列である強調特徴量系列と前記音響特徴量の系列である音響特徴量系列に対応する目標特徴量系列の確率分布を示す第３モデルを用いて第３モデルを用いて当該目標特徴量系列のサンプル値を複数回サンプリングし、前記隠れ状態処理ステップに対し、前記目標特徴量系列のサンプル値に対する隠れ状態特徴量系列のサンプル値を算出させ、前記隠れ状態特徴量系列のサンプル値から前記隠れ状態特徴量系列の期待値を定めるサンプリング処理ステップと、第４モデルを用いて前記隠れ状態特徴量系列の期待値に基づいて前記発話区間の発話内容を定める発話処理ステップと、を実行する音声認識方法。 (8) One aspect of the present embodiment is a speech recognition method for a speech recognition device , in which the speech recognition device executes: a speech section processing step of determining a speech section based on acoustic characteristics of an input speech signal; a speech enhancement step of determining, for each frame, emphasis features in which speech components of acoustic features of the input speech signal are emphasized using a first model; a hidden state processing step of determining a hidden state feature series that is a series of hidden state features based on a target feature series that is a series of target features using a second model; a sampling processing step of sampling sample values of the target feature series multiple times using a third model that indicates a probability distribution of an emphasis feature series that is a series of the emphasis features within the speech section and a target feature series corresponding to the acoustic feature series that is a series of the acoustic features, instructing the hidden state processing step to calculate sample values of the hidden state feature series for sample values of the target feature series and determine expected values of the hidden state feature series from the sample values of the hidden state feature series; and a speech processing step of determining a speech content of the speech section based on the expected value of the hidden state feature series using a fourth model.

本実施形態の一態様によれば、音声強調処理による音声認識率の低下を抑制することができる。
例えば、（１）、（７）または（８）によれば、発話区間内の強調特徴量系列と音響特徴量系列に対応する複数の目標特徴量系列のサンプル値が得られ、複数の目標特徴量系列のサンプル値から目標特徴量系列の期待値が得られる。発話内容は、目標特徴量系列の期待値から得られる隠れ状態特徴量系列の期待値に基づいて定まる。目標特徴量系列により発話区間内の変化傾向として音響特性の連続性を表現できるため、ランダムなサンプリングによるフレーム間の音響特性の不連続性を回避できる。そのため、音響特性の不連続性による音声認識率の低下を回避することができる。また、発話区間内の目標特徴量系列のサンプリングにより、高次元化による処理量の増加を抑制することができる。
（２）によれば、目標特徴量系列の確率分布が強調特徴量系列と音響特徴量系列との比率で表現できる。そのため、音声認識精度を維持しながらサンプリングに係る処理量を低減することができる。
（３）によれば、第１確率分布により音声成分の強調による強調特徴量系列を目標特徴量系列として採用する度合いと、第２確率分布により強調特徴量系列が目標特徴量系列から逸脱する度合いを定量化できる。使用環境による強調特徴量系列の信頼性の差異を考慮したサンプリングにより、音声認識精度を維持することができる。
（４）によれば、発話区間における目標特徴量の連続性と併せて、フレームごとの強調特徴量の誤差を考慮した目標特徴量系列のサンプリングにより、音声認識精度を維持することができる。
（５）によれば、目標特徴量系列のサンプル値のサンプリングにおいてサンプルごとに第１確率分布と第２確率分布が使い分けられる。サンプルごとの第１確率分布と第２確率分布との加算を回避することで処理量を低減できる。また、サンプル間で処理を並行することで演算資源を有効に活用することができる。
（６）によれば、ＣＴＣデコーダには隠れ状態特徴系列のサンプルが入力され、第２事後確率のサンプル値が出力される。サンプルごとの処理にアテンションデコーダとは独立になされるＣＴＣデコーダの処理を含めることで、演算資源の活用をさらに図ることができる。 According to one aspect of the present embodiment, it is possible to suppress a decrease in speech recognition rate due to speech enhancement processing.
For example, according to (1), (7), or (8), sample values of multiple target feature sequences corresponding to an emphasis feature sequence and an acoustic feature sequence within an utterance section are obtained, and an expected value of the target feature sequence is obtained from the sample values of the multiple target feature sequences. The speech content is determined based on the expected value of the hidden state feature sequence obtained from the expected value of the target feature sequence. Since the target feature sequence can express the continuity of acoustic features as a change trend within an utterance section, discontinuity in acoustic features between frames due to random sampling can be avoided. Therefore, a decrease in speech recognition rate due to discontinuity in acoustic features can be avoided. Furthermore, sampling the target feature sequence within the utterance section can suppress an increase in processing volume due to high dimensionality.
According to (2), the probability distribution of the target feature sequence can be expressed as the ratio of the emphasis feature sequence to the acoustic feature sequence, thereby reducing the amount of processing required for sampling while maintaining speech recognition accuracy.
According to (3), the first probability distribution can be used to quantify the degree to which an emphasized feature sequence resulting from emphasis of speech components is adopted as a target feature sequence, and the second probability distribution can be used to quantify the degree to which the emphasized feature sequence deviates from the target feature sequence. Speech recognition accuracy can be maintained by sampling that takes into account differences in the reliability of the emphasized feature sequence depending on the usage environment.
According to (4), the accuracy of speech recognition can be maintained by sampling the target feature sequence taking into consideration the error of the emphasized feature for each frame as well as the continuity of the target feature in the speech section.
According to (5), the first probability distribution and the second probability distribution are used for each sample when sampling sample values of a target feature sequence. Addition of the first probability distribution and the second probability distribution for each sample can be avoided, thereby reducing the amount of processing. Furthermore, parallel processing between samples allows for effective use of computational resources.
According to (6), the CTC decoder receives samples of the hidden state feature sequence and outputs sample values of the second posterior probability. By including the CTC decoder processing, which is performed independently of the attention decoder, in the processing for each sample, further utilization of computational resources can be achieved.

本実施形態に係る音声認識装置の機能構成例を示す概略ブロック図である。1 is a schematic block diagram illustrating an example of the functional configuration of a voice recognition device according to an embodiment of the present invention. 本実施形態に係る音声認識装置のハードウェア構成例を示す概略ブロック図である。1 is a schematic block diagram illustrating an example of the hardware configuration of a voice recognition device according to an embodiment of the present invention. 本実施形態に係るサンプリング処理部、隠れ状態処理部および発話処理部の機能構成の第１例を示す概略ブロック図である。FIG. 2 is a schematic block diagram showing a first example of the functional configuration of a sampling processing unit, a hidden state processing unit, and a speech processing unit according to the present embodiment. 本実施形態に係る第３モデルの第１例を示すグラフである。10 is a graph showing a first example of a third model according to the present embodiment. 本実施形態に係るサンプリング処理部、隠れ状態処理部および発話処理部の機能構成の第２例を示す概略ブロック図である。FIG. 10 is a schematic block diagram showing a second example of the functional configuration of the sampling processing unit, the hidden state processing unit, and the speech processing unit according to the present embodiment. 本実施形態に係る音声認識処理の第１例を示すフローチャートである。4 is a flowchart showing a first example of a speech recognition process according to the present embodiment. 本実施形態に係る音声認識処理の第２例を示すフローチャートである。10 is a flowchart illustrating a second example of the speech recognition process according to the present embodiment. ＣＥＲ（Character Error Rate）を例示する一覧表である。1 is a table illustrating an example of CER (Character Error Rate); 比較例の機能構成を示す概略ブロック図である。FIG. 10 is a schematic block diagram showing a functional configuration of a comparative example.

以下、図面を参照しながら本開示の実施形態について説明する。まず、本実施形態に係る音声認識装置１０の機能構成例について説明する。図１は、本実施形態に係る音声認識装置１０の機能構成例を示す概略ブロック図である。
音声認識装置１０には、マイクロホン２０から音声信号が入力音声信号として入力される。マイクロホン２０は、自部に到来する音を収音し、音圧の振幅を電圧に変換する電気音響変換器を備える。マイクロホン２０は、変換した電圧を示す信号値として有する電気信号を音声認識装置１０に出力する。 Hereinafter, an embodiment of the present disclosure will be described with reference to the drawings. First, an example of the functional configuration of a speech recognition device 10 according to this embodiment will be described. Fig. 1 is a schematic block diagram showing an example of the functional configuration of the speech recognition device 10 according to this embodiment.
A voice signal is input as an input voice signal to the voice recognition device 10 from the microphone 20. The microphone 20 has an electro-acoustic transducer that picks up sound arriving at the microphone 20 and converts the amplitude of the sound pressure into a voltage. The microphone 20 outputs an electrical signal having a signal value indicating the converted voltage to the voice recognition device 10.

音声認識装置１０は、入力音声信号に対して音声認識処理を実行し、発話内容を推定する。音声認識装置１０は、入力音声信号の音響特性に基づいて発話区間を推定する。
音声認識装置１０は、第１モデルを用い、入力音声信号の音響特徴量について音声成分を強調した強調特徴量をフレームごとに定める。音声認識装置１０は、第２モデルを用いて目標特徴量の系列（本願では、「目標特徴量系列」と呼ぶことがある）に基づいて隠れ状態特徴量の系列（本願では、「隠れ状態特徴量系列」と呼ぶことがある）を定める。音声認識装置１０は、強調特徴量の系列（本願では、「強調特徴量系列」と呼ぶことがある）と音響特徴量の系列（本願では、「音響特徴量系列」と呼ぶことがある）に対応する目標特徴量系列の確率分布を示す第３モデルを用いて目標特徴量系列のサンプル値を複数回サンプリングし、隠れ状態特徴量系列のサンプル値から隠れ状態特徴量系列の期待値を定める。音声認識装置１０は、第４モデルを用いて隠れ状態特徴量系列の期待値に基づいて発話区間ごとに発話内容を定める。 The speech recognition device 10 performs speech recognition processing on an input speech signal to estimate the speech content. The speech recognition device 10 estimates a speech section based on the acoustic characteristics of the input speech signal.
The speech recognition device 10 uses a first model to determine, for each frame, an emphasis feature that emphasizes speech components of acoustic features of an input speech signal. The speech recognition device 10 uses a second model to determine a sequence of hidden state features (sometimes referred to herein as a “hidden state feature sequence”) based on a sequence of target features (sometimes referred to herein as a “target feature sequence”). The speech recognition device 10 samples sample values of the target feature sequence multiple times using a third model that indicates a probability distribution of a target feature sequence corresponding to the sequence of emphasis features (sometimes referred to herein as an “emphasis feature sequence”) and the sequence of acoustic features (sometimes referred to herein as an “acoustic feature sequence”), and determines an expected value of the hidden state feature sequence from the sample values of the hidden state feature sequence. The speech recognition device 10 uses a fourth model to determine speech content for each utterance section based on the expected value of the hidden state feature sequence.

音声認識装置１０は、定めた発話内容を各種の処理に応用してもよいし、他の機能を主機能として有する電子機器の一部となしてもよい。各種の処理には、例えば、音声コマンドの同定と同定された音声コマンドで指示される処理の実行、文書作成、編集、などのいずれであってもよい。本願では、「モデル」とは、主に数理モデルを指す。発話区間は、１回の発話に係る音声が含まれる区間、つまり、中断せずに連続して音声が発された区間である。発話音声は、人の声を意味し、必ずしも、その時点において発話されたものでなくてもよい。入力音声信号は、予め録音により得られたものでも、合成されたものでもよい。また、入力音声信号は、マイクロホン２０以外の他の機器から有線または無線で入力されてもよい。 The speech recognition device 10 may apply the determined utterance content to various processes, or may be part of an electronic device that has another primary function. The various processes may include, for example, identifying a voice command and executing a process instructed by the identified voice command, document creation, editing, etc. In this application, the term "model" primarily refers to a mathematical model. A speech section is a section that includes the sound of a single utterance, that is, a section in which sound is uttered continuously without interruption. Speech sound refers to a human voice and does not necessarily have to be the sound being spoken at that time. The input voice signal may be recorded in advance or synthesized. The input voice signal may also be input via wired or wireless connection from a device other than the microphone 20.

音声認識装置１０は、制御部１１０を備える。制御部１１０は、特徴分析部１１２、音声強調部１１４、発話区間処理部１１６、サンプリング処理部１１８、隠れ状態処理部１２０、発話処理部１２２、および、モデル学習部１３０を含んで構成される。図１に例示される音声認識装置１０は、ミッシングデータ自動音声認識処理（ＭＤ－ＡＳＲ：Missing Data－Automatic Speech Recognition）を実行する。ＭＤ－ＡＳＲにおいて、音声強調（ＳＥ：Speech Enhancement）の不確実性（uncertainty）が利用される。音声強調の不確実性が確率的エビデンスモデルにより表される。 The speech recognition device 10 includes a control unit 110. The control unit 110 includes a feature analysis unit 112, a speech enhancement unit 114, a speech segment processing unit 116, a sampling processing unit 118, a hidden state processing unit 120, a speech processing unit 122, and a model training unit 130. The speech recognition device 10 illustrated in FIG. 1 performs Missing Data-Automatic Speech Recognition (MD-ASR). In MD-ASR, uncertainty in speech enhancement (SE) is utilized. The uncertainty in speech enhancement is represented by a probabilistic evidence model.

特徴分析部１１２は、マイクロホン２０から入力される入力音声信号を取得する。入力音声信号は、所定のサンプリング周波数でサンプリングされた信号値の時系列を示すディジタル信号である。サンプリング周波数は、例えば、１６ｋＨｚである。特徴分析部１１２は、予め定めた窓長（window length）を有する分析窓（analysis window）ごとに音響特徴量（acoustic feature）を算出する。分析窓は、音声信号の音響的特性を一度に分析対象とする区間である。分析窓として、例えば、ハニング窓（Hanning window）を用いることができる。窓長は、分析対象とする期間、即ち、フレームに相当する。窓長は、例えば、５１２サンプルである。特徴分析部１１２は、一定の時間間隔ごとに分析窓の区間を所定のホップ長ごとに移動させる。ホップ長は、一度に分析窓を移動させる期間に相当する。ホップ長は、窓長以下となる正の実数であればよい。ホップ長は、例えば、１２８サンプルである。特徴分析部１１２は、音響特徴量として、例えば、短時間フーリエ変換係数（ＳＴＦＴ：Short Time Fourier Transform parameters）、メルフィルタバンク（Mel filter bank）など周波数特性を表す特徴量が適用可能である。
特徴分析部１１２は、フレームごとに算出した音響特徴量を音声強調部１１４、発話区間処理部１１６およびサンプリング処理部１１８に出力する。 The feature analysis unit 112 acquires an input audio signal input from the microphone 20. The input audio signal is a digital signal representing a time series of signal values sampled at a predetermined sampling frequency. The sampling frequency is, for example, 16 kHz. The feature analysis unit 112 calculates an acoustic feature for each analysis window having a predetermined window length. The analysis window is a section in which the acoustic characteristics of the audio signal are analyzed at one time. For example, a Hanning window can be used as the analysis window. The window length corresponds to the period to be analyzed, i.e., a frame. The window length is, for example, 512 samples. The feature analysis unit 112 moves the analysis window section by a predetermined hop length at regular time intervals. The hop length corresponds to the period by which the analysis window is moved at one time. The hop length may be any positive real number equal to or less than the window length. For example, the hop length is 128 samples. The feature analysis unit 112 can use, as the acoustic feature, features that represent frequency characteristics, such as Short Time Fourier Transform (STFT) parameters and Mel filter bank.
The feature analysis unit 112 outputs the acoustic feature amount calculated for each frame to the speech enhancement unit 114 , the speech segment processing unit 116 and the sampling processing unit 118 .

音声強調部１１４には、特徴分析部１１２から入力される音響特徴量に対して第１モデルを用いて音声強調処理（ＳＥ：Speech Enhancement）を行い、音声成分が強調された音響特徴量を強調特徴量として算出する。音声成分の強調とは、音声成分をその他の成分よりも相対的に強調することを指す。音声強調処理として、雑音抑圧処理が適用されてもよい。第１モデルとして、例えば、深層ニューラルネットワーク（ＤＮＮ：Deep Neural Network）が適用可能である。音声強調部１１４は、算出した強調特徴量をサンプリング処理部１１８に出力する。音声強調部１１４の具体例については、後述する。 The speech enhancement unit 114 performs speech enhancement (SE) on the acoustic features input from the feature analysis unit 112 using a first model, and calculates acoustic features in which speech components are emphasized as enhancement features. Emphasizing speech components means relatively emphasizing speech components more than other components. Noise suppression processing may be applied as speech enhancement processing. For example, a deep neural network (DNN) can be applied as the first model. The speech enhancement unit 114 outputs the calculated enhancement features to the sampling processing unit 118. Specific examples of the speech enhancement unit 114 will be described later.

発話区間処理部１１６は、特徴分析部１１２から取得されるフレームごとの音響特徴量に基づいて発話区間を定める。発話区間処理部１１６は、発話区間の検出において公知の音声検出法（ＶＡＤ：Voice Activity Detection）を用いることができる。発話区間処理部１１６は、フレームごとに当該フレームが音声区間であるか否かを判定する。発話区間処理部１１６は、例えば、予め設定した音声区間判定モデルを用い、音響特徴量に基づいて算出される音声区間確率が所定の音声区間確率の閾値以上となるフレームを音声区間と判定し、その閾値未満となるフレームを非音声区間と判定する。 The speech section processing unit 116 determines speech sections based on the acoustic features for each frame acquired from the feature analysis unit 112. The speech section processing unit 116 can use a known voice detection method (VAD: Voice Activity Detection) to detect speech sections. The speech section processing unit 116 determines for each frame whether or not the frame is a speech section. For example, the speech section processing unit 116 uses a preset speech section determination model to determine as a speech section a frame where the speech section probability calculated based on the acoustic features is equal to or greater than a predetermined speech section probability threshold, and determines as a non-speech section a frame where the speech section probability is less than the threshold.

発話区間処理部１１６は、予め設定された連続フレーム数の下限以上連続する一連の非音声区間のフレームを非発話区間と判定することができる。発話区間処理部１１６は、その前後に非発話区間で挟まれる１以上のフレームからなる区間を発話区間として判定することができる。これにより、一時的な無音区間を含む発話区間に対しても無音区間により分断されずに一連の発話区間として検出される。
発話区間処理部１１６は、判定した発話区間をサンプリング処理部１１８に出力する。なお、音声区間判定モデルは、第１モデルまたは第３モデルの一部として実装されてもよい。 The speech section processing unit 116 can determine a series of non-speech sections that continues for a predetermined number of consecutive frames or more as a non-speech section. The speech section processing unit 116 can also determine a section consisting of one or more frames sandwiched between non-speech sections before and after the series of non-speech sections as a speech section. As a result, even a speech section that includes a temporary silent section can be detected as a series of speech sections without being divided by silent sections.
The speech segment processing unit 116 outputs the determined speech segment to the sampling processing unit 118. Note that the speech segment determination model may be implemented as a part of the first model or the third model.

発話区間処理部１１６は、特徴分析部１１２から音響特徴量を取得せずに、入力音声信号から独自にフレームごとに音響特性を分析してもよい。発話区間処理部１１６は、音響特性として、例えば、パワーと零交差数（number of zero-crossing）を分析してもよい。零交差数とは、時間領域におけるフレーム内の信号値が正値から負値に、または、負値から正値に変化する回数である。発話区間処理部１１６は、分析したパワーが所定のパワーの閾値よりも大きく、かつ、零交差数が所定の範囲内（例えば、１秒当たり３００～１０００回）であるフレームを音声区間と判定し、それ以外のフレームを非音声区間と判定してもよい。 The speech section processing unit 116 may analyze the acoustic characteristics of the input speech signal independently for each frame, without acquiring acoustic features from the feature analysis unit 112. The speech section processing unit 116 may analyze, for example, power and the number of zero-crossings as acoustic characteristics. The number of zero-crossings is the number of times the signal value in a frame in the time domain changes from a positive value to a negative value, or from a negative value to a positive value. The speech section processing unit 116 may determine as a speech section a frame in which the analyzed power is greater than a predetermined power threshold and the number of zero-crossings is within a predetermined range (e.g., 300 to 1000 times per second), and may determine other frames as non-speech sections.

サンプリング処理部１１８には、特徴分析部１１２からフレームごとの音声特徴量が入力され、音声強調部１１４からフレームごとの強調特徴量が入力される。
サンプリング処理部１１８には、発話区間処理部１１６から発話区間が入力される。サンプリング処理部１１８は、発話区間におけるフレームごとの強調特徴量をその順序で配列し強調特徴量系列を構成する。サンプリング処理部１１８は、発話区間におけるフレームごとの音声特徴量をその順序で配列し音声特徴量系列を構成する。 The sampling processing unit 118 receives the speech feature for each frame from the feature analysis unit 112 and receives the emphasized feature for each frame from the speech enhancement unit 114 .
The sampling processing unit 118 receives an utterance section from the speech section processing unit 116. The sampling processing unit 118 arranges the emphasis features for each frame in the utterance section in that order to form an emphasis feature sequence. The sampling processing unit 118 arranges the speech features for each frame in the utterance section in that order to form a speech feature sequence.

サンプリング処理部１１８は、第３モデルに従って、発話区間内の強調特徴量系列と音響特徴量系列に対する目標特徴量系列の確率分布を定め、その確率分布に従い疑似乱数を用いて目標特徴量系列のサンプル値をＮ回サンプリングする。目標特徴量は、音響特徴量に対して音声強調処理を行って得られる強調特徴量の現実の値（realization）として推定される値である。音声強調処理では、非音声成分の抑圧の過不足により、音声成分だけが強調されるとは限らないためである。 The sampling processing unit 118 determines a probability distribution of a target feature sequence for the emphasis feature sequence and acoustic feature sequence within the speech section according to the third model, and samples the sample values of the target feature sequence N times using pseudo-random numbers according to the probability distribution. The target feature is a value estimated as the realization of the emphasis feature obtained by performing speech emphasis processing on the acoustic feature. This is because speech emphasis processing does not necessarily emphasize only the speech component, depending on whether non-speech components are suppressed excessively or insufficiently.

そこで、本実施形態に係るモデル学習部１３０は、第３モデルに係る確率分布として、既知の強調特徴量系列と音響特徴量系列のセットに対する目標特徴量系列との関係を示す確率分布を用い、Ｎ個の目標特徴量系列のサンプル値を得る。このサンプル値に基づく期待値は、目標特徴量に係る期待値として統計的に現実の値となる可能性が高くなる。Ｎは、２以上の予め定めた整数である。Ｎは、例えば、１０～２００である。第３モデルは、後述の発話区間別エビデンスモデルに相当し、確率的エビデンスモデルの一種とみなすことができる。
なお、強調特徴量、音響特徴量が、それぞれ短時間フーリエ変換係数である場合、目標特徴量系列をなす各フレームの目標特徴量は、サンプリング処理部１１８または隠れ状態処理部１２０においてメルフィルタバンクに変換されてもよい。 Therefore, the model learning unit 130 according to this embodiment obtains sample values of N target feature sequences using, as the probability distribution related to the third model, a probability distribution indicating the relationship between a target feature sequence and a set of known emphasis feature sequences and acoustic feature sequences. Expected values based on these sample values are statistically more likely to be actual values related to the target features. N is a predetermined integer equal to or greater than 2. N is, for example, 10 to 200. The third model corresponds to an utterance segment-specific evidence model, which will be described later, and can be regarded as a type of probabilistic evidence model.
In addition, when the emphasis features and the acoustic features are each short-time Fourier transform coefficients, the target features of each frame forming the target feature sequence may be converted into a Mel filter bank in the sampling processing unit 118 or the hidden state processing unit 120.

サンプリング処理部１１８は、目標特徴量系列のサンプル値を隠れ状態処理部１２０に出力する。
サンプリング処理部１１８には、目標特徴量系列のサンプル値に対する応答として隠れ状態特徴量系列のサンプル値が隠れ状態処理部１２０から入力される。サンプリング処理部１１８は、隠れ状態特徴量系列のサンプル値をサンプル間で平均して得られる期待値をその発話区間における隠れ状態特徴量系列として定める。サンプリング処理部１１８は、定めた隠れ状態特徴量系列を発話処理部１２２に出力する。サンプリング処理部１１８の具体例については、後述する。 The sampling processing unit 118 outputs sample values of the target feature sequence to the hidden state processing unit 120 .
The sampling processing unit 118 receives sample values of the hidden state feature sequence as a response to the sample values of the target feature sequence from the hidden state processing unit 120. The sampling processing unit 118 determines an expected value obtained by averaging the sample values of the hidden state feature sequence between samples as the hidden state feature sequence for that utterance section. The sampling processing unit 118 outputs the determined hidden state feature sequence to the utterance processing unit 122. A specific example of the sampling processing unit 118 will be described later.

隠れ状態処理部１２０には、各サンプルについてサンプリング処理部１１８から目標特徴量系列のサンプル値が入力される。隠れ状態処理部１２０は、第２モデルを用い、目標特徴量系列のサンプル値に対する隠れ状態特徴量系列のサンプル値を算出する。第２モデルは、例えば、公知のＣＴＣ（connectionist temporal classification）／アテンションアーキテクチャ（attention architecture）の一部をなす共有エンコーダ（shared encoder）ネットワークに相当するモデルであってもよい。第２モデルには、既知のクリーン音声の音声特徴量系列に対して隠れ状態特徴量系列を与えるように学習されたパラメータセットが適用されてもよい。隠れ状態処理部１２０の具体例については、後述する。 The hidden state processing unit 120 receives sample values of the target feature sequence from the sampling processing unit 118 for each sample. The hidden state processing unit 120 uses a second model to calculate sample values of the hidden state feature sequence for the sample values of the target feature sequence. The second model may be, for example, a model equivalent to a shared encoder network that forms part of a known CTC (connectionist temporal classification)/attention architecture. A parameter set trained to provide a hidden state feature sequence for a speech feature sequence of known clean speech may be applied to the second model. A specific example of the hidden state processing unit 120 will be described later.

発話処理部１２２には、サンプリング処理部１１８から隠れ状態特徴量系列が入力される。発話処理部１２２は、第４モデルを用い、隠れ状態特徴量系列に対し、その発話区間における発話内容の候補（仮説）ごとに、その候補が発話された可能性を示す事後確率（posterior probability）を算出する。発話処理部１２２は、算出した事後確率が最大となる発話内容の候補を、その発話区間における認識結果として探索する。 The speech processing unit 122 receives the hidden state feature sequence from the sampling processing unit 118. Using the fourth model, the speech processing unit 122 calculates, for the hidden state feature sequence, a posterior probability indicating the likelihood that each candidate (hypothesis) of speech content in that speech section has been uttered. The speech processing unit 122 searches for the candidate of speech content with the highest calculated posterior probability as the recognition result for that speech section.

第４モデルは、例えば、ＣＴＣ／アテンションアーキテクチャを有してもよい。ＣＴＣ／アーキテクチャは、ＣＴＣデコーダネットワークとアテンションデコーダネットワークを含む。発話内容、または、その候補は、１以上のラベルを含むラベル列を用いて構成される。ラベルは、文字、音節、単語、その他、発話内容の表記に係る任意の単位となりうる。ラベル列は、テキストを用いて表現されることがある。発話情報の候補の集合から認識結果を探索する際、例えば、公知のビームサーチ法（beam search technique）を用いることができる。発話処理部１２２の具体例については、後述する。
なお、制御部１１０は、発話処理部１２２が定めた認識結果である発話情報を保存してもよいし、他の処理に用いてもよいし、他の機器に出力してもよい。 The fourth model may have, for example, a CTC/attention architecture. The CTC/attention architecture includes a CTC decoder network and an attention decoder network. The speech content or its candidates is configured using a label string including one or more labels. A label can be a character, a syllable, a word, or any other unit related to the representation of the speech content. The label string may be expressed using text. When searching for a recognition result from a set of speech information candidates, for example, a well-known beam search technique can be used. A specific example of the speech processing unit 122 will be described later.
The control unit 110 may store the speech information, which is the recognition result determined by the speech processing unit 122, or may use it for other processing, or may output it to other devices.

モデル学習部１３０は、予め構成された訓練データを用いて第２モデルおよび第４モデルを学習する。本開示では、「モデル学習」または「モデルを学習する」とは、モデルに基づく演算において用いられるパラメータセットを定めることを意味する。訓練データは、複数の異なるデータセットを含み、個々のデータセットは、既知の入力データと出力データを含み、それらを対応付けて構成される。モデル学習部１３０は、あるモデルの学習において、入力データをなす入力値に対する演算により得られる演算値が、その入力データに対応する出力データをなす出力値との差が訓練データ全体として減少（最小化）するようにパラメータセットを再帰的（recurrently）に更新する。差が所定の判定閾値以下になったとき、または、更新回数が所定の回数に達したとき、モデル学習部１３０は、その時点でモデル学習を停止し、得られたパラメータセットを、それぞれのモデルに係る機能部に設定する。 The model learning unit 130 learns the second and fourth models using pre-configured training data. In this disclosure, "model learning" or "learning a model" means determining parameter sets to be used in model-based calculations. The training data includes multiple different data sets, each of which includes known input data and output data and is configured by associating these data sets. In learning a certain model, the model learning unit 130 recurrently updates the parameter sets so that the difference between the calculated value obtained by calculating the input value that constitutes the input data and the output value that constitutes the output data corresponding to that input data is reduced (minimized) for the entire training data. When the difference falls below a predetermined threshold, or when the number of updates reaches a predetermined number, the model learning unit 130 stops model learning at that point and sets the obtained parameter sets in the functional units associated with each model.

モデル学習部１３０は、第２モデルと第４モデルを同時に学習する。学習に用いられる訓練データをなす個々のデータセットは、入力データとして、ある発話区間における音声信号から導出される音声特徴量を含み、出力データとして、その音声区間における既知の発話内容を示す発話情報を含む。第２モデルと第４モデルの学習では、クリーン音声を示す音声信号から導出される入力データが用いられてもよい。この出力データは、正解を与える発話情報の候補に対する事後確率を１、その他の発話情報に対する事後確率を０とするベクトル値で表されうる。 The model learning unit 130 learns the second and fourth models simultaneously. Each data set constituting the training data used for learning includes, as input data, speech features derived from a speech signal in a certain speech section, and, as output data, speech information indicating known speech content in that speech section. When learning the second and fourth models, input data derived from a speech signal indicating clean speech may be used. This output data may be represented as a vector value in which the posterior probability for the candidate speech information that provides the correct answer is 1, and the posterior probability for other speech information is 0.

演算値と出力値との差の大きさを示す損失関数（loss function）として、例えば、二元交差エントロピー（binary cross entropy）を用いることができる。パラメータセットの更新において、例えば、再急勾配法（steepest gradient）もしくは確率的勾配降下法（stochastic gradient descent）に基づく誤差逆伝搬法（backpropagation）、または、その変形（例えば、アダム最適化（Adam Optimizer））を用いることができる。 For example, binary cross entropy can be used as a loss function that indicates the magnitude of the difference between the calculated value and the output value. When updating the parameter set, backpropagation based on steepest gradient or stochastic gradient descent, or a variant thereof (e.g., Adam Optimizer), can be used.

なお、音声強調処理に用いられる第１モデルは、音声認識処理に用いられる第２モデルと第４モデルとは独立に学習することができる。第１モデルの学習については、音声強調部１１４との具体例とともに後述する。後述の第３モデルの構成によっては、必ずしも学習を要しない。第３モデルの一部または全部のパラメータは、第１モデルの学習過程で得られる演算値から導出されてもよい。 The first model used in the voice enhancement process can be trained independently of the second and fourth models used in the voice recognition process. The training of the first model will be described later along with a specific example of the voice enhancement unit 114. Depending on the configuration of the third model described below, training may not be necessary. Some or all of the parameters of the third model may be derived from calculated values obtained during the training process of the first model.

（ハードウェア構成例）
次に、本実施形態に係る音声認識装置１０のハードウェア構成例について説明する。図２は、本実施形態に係る音声認識装置１０のハードウェア構成例を示す概略ブロック図である。音声認識装置１０は、図１に例示される各１個または複数個の機能部の組をなす専用の部材（例えば、集積回路）を含む音声認識システムとして構成されてもよい。音声認識装置１０は、音声認識システムとして汎用のコンピュータシステムの一部または全部として構成されてもよい。 (Example of hardware configuration)
Next, an example of the hardware configuration of the speech recognition device 10 according to this embodiment will be described. Fig. 2 is a schematic block diagram showing an example of the hardware configuration of the speech recognition device 10 according to this embodiment. The speech recognition device 10 may be configured as a speech recognition system including dedicated components (e.g., integrated circuits) forming a set of one or more functional units illustrated in Fig. 1. The speech recognition device 10 may also be configured as a part or the entirety of a general-purpose computer system as a speech recognition system.

音声認識装置１０は、例えば、プロセッサ１５２、ドライブ部１５６、入力部１５８、出力部１６０、ＲＯＭ（Read Only Memory）１６２、ＲＡＭ（Random Access Memory）１６４、補助記憶部１６６、および、インタフェース部１６８を含んで構成される。プロセッサ１５２、ドライブ部１５６、入力部１５８、出力部１６０、ＲＯＭ１６２、ＲＡＭ１６４、補助記憶部１６６、および、インタフェース部１６８は、バスＢＳ（基線）を用いて相互に接続される。 The speech recognition device 10 includes, for example, a processor 152, a drive unit 156, an input unit 158, an output unit 160, a ROM (Read Only Memory) 162, a RAM (Random Access Memory) 164, an auxiliary storage unit 166, and an interface unit 168. The processor 152, drive unit 156, input unit 158, output unit 160, ROM 162, RAM 164, auxiliary storage unit 166, and interface unit 168 are interconnected using a bus BS (base line).

プロセッサ１５２は、例えば、ＲＯＭ１６２に記憶されたプログラムや各種のデータを読み出し、当該プログラムを実行して、音声認識装置１０の動作を制御する。音声認識装置１０におけるプロセッサ１５２の個数は１個に限らず、複数となってもよい。プロセッサ１５２は、例えば、ＣＰＵ（Central Processing Unit）である。プロセッサ１５２の個数が複数となる場合、本実施形態に係る処理が複数のプロセッサ１５２間で分担されてもよい。また、複数のプロセッサ１５２の種類は、必ずしも全て同一でなくてもよく、一部が異なっていてもよい。複数のプロセッサ１５２には、ＣＰＵの他、少なくとも１個のＧＰＵ（Graphic Processing Unit）が含まれてもよい。なお、本実施形態では「プログラムを実行する」とは、プログラムに記述された各種の指令（コマンド）で指示された処理を実行するとの意味を含む。 The processor 152, for example, reads programs and various data stored in the ROM 162, executes the programs, and controls the operation of the speech recognition device 10. The number of processors 152 in the speech recognition device 10 is not limited to one, but may be multiple. The processor 152 is, for example, a CPU (Central Processing Unit). When there are multiple processors 152, the processing according to this embodiment may be shared among the multiple processors 152. Furthermore, the multiple processors 152 do not necessarily all need to be the same type; some may be different. The multiple processors 152 may include at least one GPU (Graphics Processing Unit) in addition to a CPU. Note that in this embodiment, "executing a program" includes the meaning of executing processing instructed by various instructions (commands) written in the program.

プロセッサ１５２は、所定のプログラムを実行して、上記の制御部１１０の全部または一部の機能部、例えば、特徴分析部１１２、音声強調部１１４、発話区間処理部１１６、サンプリング処理部１１８、隠れ状態処理部１２０、発話処理部１２２、および、モデル学習部１３０の一部または全部の機能を実現する。 The processor 152 executes a predetermined program to realize some or all of the functions of the above-mentioned control unit 110, such as the feature analysis unit 112, speech enhancement unit 114, speech section processing unit 116, sampling processing unit 118, hidden state processing unit 120, speech processing unit 122, and some or all of the functions of the model learning unit 130.

記憶媒体１５４は、各種のデータを記憶する。記憶媒体１５４は、例えば、光磁気ディスク、フレキシブルディスク、フラッシュメモリなどの可搬記憶媒体である。
ドライブ部１５６は、例えば、記憶媒体１５４からの各種データの読み出しと、記憶媒体１５４への各種データの書き込みの一方または両方を行う機器である。 The storage medium 154 stores various types of data and is, for example, a portable storage medium such as a magneto-optical disk, a flexible disk, or a flash memory.
The drive unit 156 is, for example, a device that reads various data from the storage medium 154 and/or writes various data to the storage medium 154 .

入力部１５８は、入力元となる各種の機器から入力データが入力され、入力データをプロセッサ１５２に出力する。
出力部１６０は、プロセッサ１５２から入力される出力データを、出力先となる各種の機器に出力する。 The input unit 158 receives input data from various input source devices and outputs the input data to the processor 152 .
The output unit 160 outputs the output data input from the processor 152 to various devices as output destinations.

ＲＯＭ１６２は、例えば、プロセッサ１５２が実行するためのプログラムを記憶する。
ＲＡＭ１６４は、例えば、プロセッサ１５２で用いられる各種データ、プログラムを一時的に保存する作業領域として機能する主記憶媒体として用いられる。
補助記憶部１６６は、ＨＤＤ（Hard Disk Drive）、フラッシュメモリなどの記憶媒体である。 The ROM 162 stores, for example, a program to be executed by the processor 152 .
The RAM 164 is used as a main storage medium that functions as a work area for temporarily storing various data and programs used by the processor 152, for example.
The auxiliary storage unit 166 is a storage medium such as a hard disk drive (HDD) or a flash memory.

インタフェース部１６８は、他の機器と接続し各種のデータを入力および出力可能とする。インタフェース部１６８は、例えば、有線または無線でネットワークに接続する通信モジュールを備える。 The interface unit 168 connects to other devices and allows various types of data to be input and output. The interface unit 168 includes, for example, a communication module that connects to a network via wired or wireless means.

（音声強調部の具体例）
次に、本実施形態に係る音声認識装置１０の音声強調部１１４の具体例について説明する。
音声強調部１１４は、予め設定された第１モデルを用い、フレームｔごとに音響特徴量（noisy observed spectrum）ｘ_ｔに対し、強調特徴量（feature vector）ｓ_ｔを算出する。ここで、音声強調部１１４は、音響特徴量ｘ_ｔとそのソフトマスクｍ（ｘ^～ _ｔ）を要素ごとに乗じて得られる乗算値に対する対数値を雑音除去音響特徴量（denoised spectrum）ｙ^～ _ｔとして算出する。式（１）をはじめとする数式を構成する文字の上部に付された～、＾などの記号は、本文中では、ｘ^～ _ｔ、ｙ^～ _ｔなどと文字に隣接して表記する。 (Specific example of a voice enhancement unit)
Next, a specific example of the voice enhancement unit 114 of the voice recognition device 10 according to this embodiment will be described.
The speech enhancement unit 114 uses a preset first model to calculate an enhancement feature (feature vector) s _t for an acoustic feature (noisy observed spectrum) x _t for each frame t. Here, the speech enhancement unit 114 calculates, as a denoised acoustic feature (denoised spectrum) y ^∼ _t , the logarithm of the multiplied value obtained by multiplying the acoustic feature x _t by its soft mask m(x ^∼ _t ) element by element. Symbols such as ~ and ^ added above letters constituting formulas such as formula (1) will be written adjacent to the letters in this document, such as x ^∼ _t and y ^∼ _t .

式（１）に例示されるように、雑音除去音響特徴量ｙ^～ _ｔには、その対数値に対して、さらに予測誤差（prediction error）ｎ_ｔ ^ｙが加算されてもよい。式（１）において、右辺第２項の小さい○は、要素ごとの乗算を示す。ｌｏｇ（…）は、…の対数値を示す。｜…｜は、…の絶対値を示す。ｘ^～ _ｔは、フレームｔ－ｋの音響特徴量ｘ_ｔ－ｋからフレームｔ＋ｋの音響特徴量ｘ_ｔ－ｋまでの２ｋ＋１フレームにわたり結合されてなる結合ベクトル（concatenated vector）を示す。ソフトマスクｍ（ｘ^～ _ｔ）は、ｘ^～ _ｔを入力値として各要素が０から１の間の自然数を出力値として与える関数によりモデル化される。予測誤差ｎ_ｔ ^ｙは、確率密度として多次元ガウス関数Ｎ（０，Λ_ｙｔ ^－１）に従い、疑似乱数を用いてサンプリングして得られる。Λ_ｙｔ ^－１は、精度行列（precision matrix）Λ_ｙｔの逆行列を示す。精度行列Λ_ｙｔは、Ｄ個の分散λ_{ｙ，ｔ，１} ^２，…，λ_{ｙ，ｔ，Ｄ} ^２を対角成分として有するＤ行Ｄ列の対角行列である。Ｄは、強調特徴量ｓ_ｔの次元数を示す整数値である。音声強調部１１４は、分散λ_{ｙ，ｔ，１} ^２，…，λ_{ｙ，ｔ，Ｄ} ^２もＤＮＮの他の一部の出力値として算出してもよい。なお、音響特徴量ｘ_ｔ、雑音除去音響特徴量ｙ^～ _ｔの次元数Ｆは、それぞれフレーム長に相当する。 As exemplified in Equation (1), a prediction error ^{n ty} _may be added to the logarithmic value of the noise-removed acoustic feature y ^∼ _t . In Equation (1), the small circle in the second term on the right side indicates element-by-element multiplication. log(...) indicates the logarithmic value of .... |...| indicates the absolute value of .... ^{x ∼} _t indicates a concatenated vector formed by concatenating the acoustic feature x _t-k of frame t-k to the acoustic feature x _t-k of frame t+k across 2k+1 frames. The soft mask m(x ^∼ _t ) is modeled by a function that takes x ^∼ _t as an input value and gives a natural number between 0 and 1 as an output value for each element. The prediction error _n ^ty is obtained by sampling using pseudorandom numbers according to a multidimensional Gaussian function N(0,Λ _yt ⁻¹ ) as the probability density. Λ _yt ⁻¹ denotes the inverse matrix of the precision matrix Λ _yt . The precision matrix Λ _yt is a diagonal matrix with D rows and D columns having D variances λ _y,t,1 ² , ..., λ _y,t,D ² as diagonal components. D is an integer value indicating the number of dimensions of the enhancement feature s _t . The speech enhancement unit 114 may also calculate the variances λ _y,t,1 ² , ..., λ _y,t,D ² as other part of the output values of the DNN. Note that the number of dimensions F of the acoustic feature x _t and the noise-reduced acoustic features y ^∼ _t each corresponds to the frame length.

次に、音声強調部１１４は、雑音除去音響特徴量ｙ^～ _ｔに対する特徴抽出関数（feature extraction function）ｆ（ｙ^～ _ｔ）の関数値を強調特徴量ｓ_ｔとして算出する。式（２）に例示されるように、強調特徴量ｓ_ｔは、その関数値に対して、さらに予測誤差ｎ_ｔ ^ｓが加算されてもよい。予測誤差ｎ_ｔ ^ｓは、確率密度として多次元ガウス関数Ｎ（０，Λ_ｓ，ｔ ^－１）に従い、疑似乱数を用いてサンプリングして得られる。Λ_ｓｔ ^－１は、精度行列Λ_ｓ，ｔの逆行列を示す。精度行列Λ_ｓ，ｔは、Ｄ個の分散λ_{ｓ，ｔ，１} ^２，…，λ_{ｓ，ｔ，Ｄ} ^２を対角成分として有するＤ行Ｄ列の対角行列である。本実施形態では、ソフトマスクｍ（ｘ^～ _ｔ）、特徴抽出関数ｆ（ｙ^～ _ｔ）が、それぞれ第１モデルをなすＤＮＮの一部（サブセット）として実現される。また、分散λ_{ｙ，ｔ，１} ^２，…，λ_{ｙ，ｔ，Ｄ} ^２、λ_{ｓ，ｔ，１} ^２，…，λ_{ｓ，ｔ，Ｄ} ^２もＤＮＮからの出力値として算出されてもよい。 Next, the speech enhancement unit 114 calculates the function value of a feature extraction function f(y ^∼ _t ) for the noise-removed acoustic feature y ^∼ _t as an enhancement feature s _t . As exemplified in Equation (2), a prediction error n _t ^s may be added to the function value of the enhancement feature s _t . The prediction error n _t ^s is obtained by sampling using pseudo-random numbers according to a multidimensional Gaussian function N(0,Λ _s,t ⁻¹ ) as a probability density. Λ _st ⁻¹ indicates the inverse matrix of the precision matrix Λ _s,t . The precision matrix Λ _s,t is a diagonal matrix with D rows and D columns having D variances λ _s,t,1 ² , ..., λ _s,t,D ² as diagonal elements. In this embodiment, the soft mask m(x ^∼ _t ) and the feature extraction function f(y ^∼ _t ) are each realized as a part (subset) of a DNN constituting a first model. In addition, the variances λ _y,t,1 ² , ..., λ _y,t,D ² , λ _s,t,1 ² , ..., λ _s,t,D ² may also be calculated as output values from the DNN.

次に、第１モデルの学習について説明する。モデル学習部１３０は、音響特徴量ｘ_ｔの結合ベクトルｘ^～ _ｔに対する強調特徴量ｓ_ｔの確率密度関数ｐ（ｓ_ｔ｜ｘ^～ _ｔ）の対数尤度（log-likelihood）ｌｏｇｐ（ｓ_ｔ｜ｘ^～ _ｔ）が増加（最大化）するように第１モデルを学習する。確率密度ｐ（ｓ_ｔ｜ｘ^～ _ｔ）は、式（３）に示すように結合ベクトルｘ^～ _ｔを条件とする雑音除去音響特徴量ｙ^～ _ｔの条件付き確率ｐ（ｙ^～ _ｔ｜ｘ^～ _ｔ）と、結合ベクトルｘ^～ _ｔと雑音除去音響特徴量ｙ^～ _ｔのセットに対する強調特徴量ｓ_ｔの条件付き確率（ｓ_ｔ｜ｙ^～ _ｔ，ｘ^～ _ｔ）との畳み込み積分値となる。そのため、対数尤度ｌｏｇｐ（ｓ_ｔ｜ｘ^～ _ｔ）を解析的に導出することは一般的に困難である。 Next, the training of the first model will be described. The model training unit 130 trains the first _model so as to increase (maximize) the log-likelihood log p(s _t |x ^{∼ t) of the probability density function p(s t |x ∼} _t ₎ ^of _the enhancement feature s _t with respect to the combined vector x ^∼ _t of the acoustic feature x t. The probability density p(s _t |x ^∼ _t ) is the convolution integral value of the conditional probability p(y ^∼ _t |x ^∼ _t ) of the noise-reduced acoustic feature y ^∼ _t , given the combined vector x ^∼ _t as a condition, and the conditional probability (s _t |y ^∼ _t , x ^∼ _t ) of the enhancement feature s _t with respect to the set of ^the combined vector x ^∼ _t and the noise-reduced acoustic feature y ∼ _t , as shown in Equation (3). Therefore, it is generally difficult to analytically derive the log-likelihood log p(s _t |x ^∼ _t ).

本実施形態では、式（４）に示すように、対数尤度ｌｏｇｐ（ｓ_ｔ｜ｘ^～ _ｔ）の下限が、雑音除去音響特徴量ｙ＾_ｔが結合ベクトルｘ^～ _ｔを条件とする雑音除去音響特徴量ｙ＾_ｔの条件付き変分事後確率ｑ（ｙ＾_ｔ｜ｘ^～ _ｔ）に近似され、条件付き対数尤度条件付き確率ｐ（ｙ＾_ｔ｜ｘ^～ _ｔ）と条件付き確率（ｓ_ｔ｜ｙ＾_ｔ，ｘ^～ _ｔ）との積の対数値の期待値となることを利用する。式（４）において、Ｅ（…）は、…の期待値を示す。但し、式（４）では、簡単のため式（３）の雑音除去音響特徴量ｙ^～ _ｔに代えて、予測誤差ｎ_ｔ ^ｙが加算されていない雑音除去音響特徴量ｙ＾_ｔが適用されている。学習段階では、音声成分以外の雑音成分も既知であるためである。 In this embodiment, as shown in Equation (4), the lower limit of the logarithmic likelihood log p(s _t |x ^∼ _t ) is approximated to the conditional variational posterior probability q(y^ _t |x ^∼ _t ) of the noise-reduced acoustic feature y^ _t , where the noise-reduced acoustic feature y^ _t is conditioned on the combined vector x ^∼ _t , and is utilized as the expected value of the logarithmic value of the product of the conditional logarithmic likelihood conditional probability p(y^ _t |x ^∼ _t ) and the conditional probability (s _t |y^ _t , x ^∼ _t ). In Equation (4), E(...) indicates the expected value of.... However, for simplicity, in Equation (4), the noise-reduced acoustic feature y ^{^} _t to which the prediction error n _t ^y is not added is applied instead of the noise-reduced acoustic feature y ∼ _t in Equation (3). This is because noise components other than speech components are also known in the learning stage.

従って、モデル学習部１３０は、式（４）の右辺で定義される数値が最大化させるように第１モデルを学習できればよい。より具体的には、モデル学習部１３０は、式（５）に例示されるコスト関数（cost function）Ｊ_ｓｅを最小化するように第１モデルを学習することができる。コスト関数Ｊ_ｓｅは、条件付き確率ｐ（ｓ_ｔ｜ｙ＾_ｔ，ｘ^～ _ｔ）の対数値ｌｏｇｐ（ｓ_ｔ｜ｙ＾_ｔ，ｘ^～ _ｔ）と結合ベクトルｘ^～ _ｔを条件とする対数特徴量ｙ^－ _ｔの条件付き確率対数値ｌｏｇｐ（ｓ_ｔ｜ｙ^－ _ｔ｜ｘ^～ _ｔ）との和に対し正負の符号を反転させた値に相当する。式（５）は、式（４）の右辺において、条件付き変分事後確率ｑ（ｙ＾_ｔ｜ｘ^～ _ｔ）がデルタ関数であることを仮定して導出される。対数特徴量ｙ^－ _ｔは、雑音除去音響特徴量ｙ＾_ｔの絶対値に対する対数値に相当する。条件付き確率対数値ｌｏｇｐ（ｓ_ｔ｜ｙ^－ _ｔ｜ｘ^～ _ｔ）は、ソフトマスクの学習を促進する。ここで、条件付き確率ｐ（ｓ_ｔ｜ｙ＾_ｔ，ｘ^～ _ｔ）、ｐ（ｓ_ｔ｜ｙ^－ _ｔ｜ｘ^～ _ｔ）は多次元ガウス関数と仮定されてもよい。その場合、モデル学習部１３０は、学習済みの条件付き確率ｐ（ｓ_ｔ｜ｙ＾_ｔ，ｘ^～ _ｔ）の平均値を関数値ｆ（ｙ＾_ｔ）と定め、その分散を定めることで、多次元ガウス関数Ｎ（ｆ（ｙ＾_ｔ），Λ_ｓ，ｔ ^－１（ｙ＾_ｔ，ｚ^～ _ｔ））を定義することができる。Λ_ｓ，ｔ ^－１（ｙ＾_ｔ，ｚ^～ _ｔ）は、精度行列の逆行列Λ_ｓ，ｔ ^－１（ｙ＾_ｔ，ｚ^～ _ｔ）を示す。この関数値ｆ（ｙ＾_ｔ）は、第１モデルに基づく演算過程において得られる。モデル学習部１３０は、第１モデルからの出力である強調特徴量の相関行列に対してＬＵ分解を行って精度行列Λ_ｓ，ｔを算出することができる。 Therefore, the model training unit 130 only needs to be able to train the first model so as to maximize the value defined on the right side of equation (4). More specifically, the model training unit 130 can train the first model so as to minimize the cost function _Jse exemplified in equation (5). The cost function _Jse corresponds to a value obtained by inverting the sign of the sum of the logarithm of the conditional probability p(s _t |y^ _t , x ^∼ _t ), log p _{(s t |y − t} _| _x ^∼ _t ⁾ _, and the logarithm of the conditional probability of the logarithmic feature ^y ⁻ _t , conditioned on the combined vector x ^∼ _t . Equation (5) is derived assuming that the conditional variational _posterior probability q(y^ _t |x ^∼ _t ) on the right side of equation (4) is a delta function. The logarithmic feature y ^- _t corresponds to the logarithmic value of the absolute value of the noise-removed acoustic feature y^ _t . The conditional probability logarithm log p(s _t |y ⁻ _t |x ^∼ _t ) facilitates soft mask training. Here, the conditional probabilities p(s _t |y^ _t , x ^∼ _t ) and p(s _t |y ⁻ _t |x ^∼ _t ) may be assumed to be multidimensional Gaussian functions. In this case, the model training unit 130 can define the multidimensional Gaussian function N(f(y^ _t ), _Λ s,t −1 (y^ _t , z ^∼ _t )) by defining the mean value of the trained conditional probability p(s t |y^ _t , x ^∼ _t ) as the function value f(y^ _t ) and determining its variance. Λ _s,t ₋₁ ⁽ ^y ^ _t , z ^∼ _t ) denotes the inverse matrix Λ _s,t ⁻¹ (y^ _t , z ^∼ _t ) of the precision matrix. This function value f(y^ _t ) is obtained in a calculation process based on the first model. The model learning unit 130 can calculate the precision matrix Λs _,t by performing LU decomposition on the correlation matrix of the emphasis feature, which is the output from the first model.

（サンプリング処理部の具体例）
次に、本実施形態に係るサンプリング処理部１１８の具体例について、第２モデルと第４モデルとの関係を含めて説明する。
サンプリング処理部１１８は、第３モデルとして、発話別エビデンスモデル（utterance-wise evidence model）に基づいて隠れ状態特徴量系列ｈ_１：Ｔ’を定め、発話処理部１２２に対し隠れ状態特徴量系列ｈ_１：Ｔ’に基づいて発話内容の候補の事後確率を算出させることができる。隠れ状態特徴量系列ｈ_１：Ｔ’は、式（６）に示すように、エンコーダネットワークをなす第２モデルを用いることで、発話区間における強調特徴量系列ｚ_１：Ｔから推定することができる。Ｔ’は、発話区間内における隠れ状態特徴量のフレーム数を示す。フレーム数Ｔ’は、第３モデルをなすエンコーダネットワークの構成に依存し、その発話区間における隠れ状態特徴量系列ｚ_１：Ｔのフレーム数Ｔと等しくなることも、より少なくなることもある。 (Specific example of sampling processing unit)
Next, a specific example of the sampling processing unit 118 according to this embodiment will be described, including the relationship between the second model and the fourth model.
The sampling processing unit 118 determines the hidden state feature sequence h1 _:T' based on an utterance-wise evidence model as the third model, and can cause the utterance processing unit 122 to calculate the posterior probability of utterance content candidates based on the hidden state feature sequence h1 _:T' . The hidden state feature sequence h1 _:T' can be estimated from the emphasis feature sequence z1 _:T in the utterance section by using the second model forming an encoder network, as shown in equation (6). T' indicates the number of frames of the hidden state feature in the utterance section. The number of frames T' depends on the configuration of the encoder network forming the third model, and may be equal to or less than the number of frames T of the hidden state feature sequence z1 _:T in the utterance section.

発話別エビデンスモデルによる隠れ状態特徴量系列ｈ_１：Ｔ’の推定は、式（７）のように定式化される。式（７）によれば、発話区間における強調特徴量系列ｓ_１：Ｔに対する隠れ状態特徴量の系列ｈ_１：Ｔ’（ｓ_１：Ｔ）の期待値が、目標特徴量系列ｚ_１：Ｔに対する隠れ状態特徴量の系列ｈ_１：Ｔ’（ｚ_１：Ｔ）と目標特徴量系列ｚ_１：Ｔに対する発話別エビデンスモデルによる確率分布（以下、「モデル確率分布」と呼ぶことがある）ε（ｚ_１：Ｔ；Θ）との畳み込み積分値で表される。Θは、モデル確率分布のパラメータセットを示す。 Estimation of the hidden state feature sequence h1 _:T' using the utterance-specific evidence model is formulated as shown in equation (7). According to equation (7), the expected value of the hidden state feature sequence _h1:T' (s1 _:T ) for the emphasis feature sequence s1: _T in an utterance section is expressed as the convolution integral value of the hidden state feature sequence _h1 _:T' (z1 _:T ) for the target feature sequence z1 _:T and the probability distribution ε(z1 _:T ;Θ) by the utterance-specific evidence model for the target feature sequence z1:T (hereinafter sometimes referred to as the "model probability distribution"), where Θ denotes a parameter set of the model probability distribution.

発話区間における目標特徴量系列にわたる積分演算は極めて演算量が多くなるため一般に困難である。モデル確率分布は、目標特徴量系列を表す高次元空間（Ｄ×Ｔ次元）で定義されるためである。本実施形態に係るサンプリング処理部１１８は、準モンテカルロ近似（Monte-Carlo-like approximation）を適用し、モデル確率分布ε（ｚ_１：Ｔ；Θ）に従って目標特徴量系列ｚ_１：Ｔのサンプル値｛ｚ_１：Ｔ ^（ｎ）｝_ｎを、疑似乱数を用い複数回ランダムに抽出する。 Integrating the target feature sequence in an utterance section is generally difficult because it requires an extremely large amount of calculation. This is because the model probability distribution is defined in a high-dimensional space (D×T dimensions) representing the target feature sequence. The sampling processing unit 118 according to this embodiment applies a Monte-Carlo-like approximation and randomly extracts sample values {z 1 _: T ⁽ⁿ⁾ } _n of the target feature sequence z _1:T multiple times using pseudo-random numbers in accordance with the model probability distribution ε(z _1:T ; Θ).

図３の例では、サンプリング処理部１１８は、抽出した強調特徴量系列ｚ_１：Ｔのサンプル値に対する隠れ状態特徴量系列のサンプル値｛ｈ_１：Ｔ’ ^（ｎ）｝_ｎを隠れ状態処理部１２０に推定させる。サンプリング処理部１１８は、隠れ状態特徴量系列のサンプル値｛ｈ_１：Ｔ’ ^（ｎ）｝_ｎの平均値を隠れ状態特徴量系列の期待値Ｅ〔ｈ_１：Ｔ‘〕として算出する。サンプリング処理部１１８は、算出した隠れ状態特徴量系列の期待値Ｅ〔ｈ_１：Ｔ‘〕を発話処理部１２２に出力し、第４モデルを用いて発話内容を推定させる。 3, the sampling processing unit 118 causes the hidden state processing unit 120 to estimate sample values {h _1:T' ⁽ⁿ⁾ } _n of the hidden state feature sequence corresponding to sample values of the extracted emphasis feature sequence z _1:T . The sampling processing unit 118 calculates the average value of the sample values {h _1:T' ⁽ⁿ⁾ } _n of the hidden state feature sequence as the expected value E[h _1:T' ] of the hidden state feature sequence. The sampling processing unit 118 outputs the calculated expected value E[h _1:T' ] of the hidden state feature sequence to the utterance processing unit 122, which then estimates the utterance content using the fourth model.

図３に例示されるように、第４モデルがアテンションデコーダネットワーク１２２ａとＣＴＣデコーダネットワーク１２２ｃを含んで構成される場合、両者において隠れ状態特徴量系列の期待値Ｅ〔ｈ_１：Ｔ‘〕が用いられる。アテンションデコーダネットワーク１２２ａとＣＴＣデコーダネットワーク１２２ｃは、それぞれ発話内容の候補ごとに事後確率ｐ_ａｔｔ（ｃ_１：Ｌ｜Ｅ〔ｈ_１：Ｔ‘〕）、ｐ_ｃｔｃ（ｃ_１：Ｌ｜Ｅ〔ｈ_１：Ｔ‘〕）を算出する。そして、発話処理部１２２の発話内容推定部１２２ｄは、式（８）に例示される事後確率ｐ_ａｔｔ（ｃ_１：Ｌ｜Ｅ〔ｈ_１：Ｔ‘〕）の対数値と事後確率ｐ_ｃｔｃ（ｃ_１：Ｌ｜Ｅ〔ｈ_１：Ｔ‘〕）の対数値の加重和を発話内容の候補ごとに音声認識スコアＪ_ａｓｒとして算出する。発話内容推定部１２２ｄは、音声認識スコアＪ_ａｓｒが最大となる発話内容の候補を発話内容として推定することができる。式（８）において、ｗ_１、ｗ_２は、それぞれ０以上１以下の予め定めた実数値であり、それらの和が１となるように正規化される。 3, when the fourth model includes the attention decoder network 122a and the CTC decoder network 122c, the expected value E[h1 _:T' ] of the hidden state feature sequence is used in both networks. The attention decoder network 122a and the CTC decoder network 122c each calculate the posterior probabilities _patt (c1 _:L |E[ _h1:T' ]) and _pctc ( _c1:L |E[h1 _:T' ]) for each utterance candidate. The utterance content estimation unit 122d of the utterance processing unit 122 then calculates the weighted sum of the logarithmic value of the posterior probability _patt (c1 _:L |E[h1 _:T' ]) and the logarithmic value of the posterior probability _pctc (c1 _:L |E[h1 _:T' ]) as shown in Equation (8) as the speech recognition score J _asr for each utterance candidate. The utterance content estimation unit 122d can estimate, as the utterance content, the candidate utterance content that maximizes the speech recognition score J _asr . In equation (8), w ₁ and w ₂ are each a predetermined real number between 0 and 1, and are normalized so that their sum becomes 1.

上記の例では、アテンションデコーダネットワーク１２２ａとＣＴＣデコーダネットワーク１２２ｃのいずれにも隠れ状態特徴量系列の期待値Ｅ〔ｈ_１：Ｔ‘〕が入力される場合を仮定したが、これには限られない。図５に例示されるように、アテンションデコーダネットワーク１２２ａに隠れ状態特徴量系列の期待値Ｅ〔ｈ_１：Ｔ‘〕が入力され、ＣＴＣデコーダネットワーク１２２ｃに隠れ状態特徴量系列のサンプル値｛ｈ_１：Ｔ’ ^（ｎ）｝_ｎが入力されてもよい。その場合、ＣＴＣデコーダネットワーク１２２ｃから事後確率のサンプル値ｐ_ｃｔｃ（ｃ_１：Ｌ｜｛ｈ_１：Ｔ’ ^（ｎ）｝_ｎ）がサンプルおよび発話内容の候補ごとに出力される。発話内容推定部１２２ｄは、事後確率のサンプル値ｐ_ｃｔｃ（ｃ_１：Ｌ｜｛ｈ_１：Ｔ’ ^（ｎ）｝_ｎ）のサンプル間の平均値を事後確率の期待値Ｅ〔ｐ_ｃｔｃ（ｃ_１：Ｌ｜ｈ_１：Ｔ’）〕として算出することができる。発話内容推定部１２２ｄは、事後確率の期待値Ｅ〔ｐ_ｃｔｃ（ｃ_１：Ｌ｜ｈ_１：Ｔ’）〕を式（８）の事後確率ｐ_ｃｔｃ（ｃ_１：Ｌ｜Ｅ〔ｈ_１：Ｔ‘〕）に代入し、音声認識スコアＪ_ａｓｒを算出することができる。よって、発話内容推定部１２２ｄは、音声認識スコアＪ_ａｓｒが最大となる発話内容の候補を発話内容として推定することができる。 In the above example, it is assumed that the expected value E[h1 _:T' ] of the hidden state feature sequence is input to both the attention decoder network 122a and the CTC decoder network 122c, but this is not limiting. As illustrated in Figure 5, the expected value E[h1 _:T' ] of the hidden state feature sequence may be input to the attention decoder network 122a, and the sample value {h1 _:T' ⁽ⁿ⁾ } _n of the hidden state feature sequence may be input to the CTC decoder network 122c. In this case, the CTC decoder network 122c outputs a sample value _pctc (c1 _:L |{ _h1:T' ⁽ⁿ⁾ } _n ) of the posterior probability for each sample and utterance candidate. The utterance content estimation unit 122d can calculate the average value among samples of the posterior probability sample value p _ctc (c _1:L | {h _1:T' ⁽ⁿ⁾ } _n ) as the expected value E[p _ctc (c _1:L | h _1:T' )]. The utterance content estimation unit 122d can calculate the speech recognition score J asr by substituting the expected value E[p _ctc (c _1:L | h _1:T' )] for the posterior probability p _ctc (c _1:L | E[h _1:T' ]) in equation (8). Therefore, the utterance content estimation unit 122d can estimate the utterance content candidate that maximizes the speech recognition score _{J asr} _as the utterance content.

図５の例は、発話別エビデンスモデルによるＣＴＣデコーダネットワークによる事後確率ｐ_ｃｔｃの推定は、式（９）に例示される関係に基づく。式（９）は、発話区間における隠れ状態特徴量系列ｈ_１：Ｔ’の発話内容の候補をなすラベル列ｃ_１：ＬのＣＴＣデコーダネットワークによる事後確率ｐ_ｃｔｃ（ｃ_１：Ｌ｜ｈ_１：Ｔ’）の期待値が、目標特徴量の系列ｚ_１：Ｔに対する隠れ状態特徴量系列ｈ_１：Ｔ’（ｚ_１：Ｔ）を条件とするラベル列ｃ_１：ＬのＣＴＣデコーダネットワークによる事後確率ｐ_ｃｔｃ（ｃ_１：Ｌ｜ｈ_１：Ｔ’（ｚ_１：Ｔ））とモデル確率密度分布ε（ｚ_１：Ｔ；Θ）との畳み込み積分値に相当することを示す。 5, the estimation of the posterior probability p _ctc by the CTC decoder network using the utterance-specific evidence model is based on the relationship illustrated in equation (9). Equation (9) indicates that the expected value of the posterior probability p ctc (c _1:L | h _{1:T' ) by the CTC decoder network of the label sequence c 1:L that is a candidate for the utterance content of the hidden state feature sequence h 1:T} _' _in _the utterance section corresponds to the convolution integral value of the posterior probability p _ctc (c _1:L | h 1: _T' (z _1:T )) by the CTC decoder network of the label sequence c _1:L, which is conditioned on the hidden state feature sequence h _1:T' (z _1:T ) for the target _feature sequence z 1:T, and the model probability density distribution ε(z _1:T ; Θ).

次に、本実施形態に係る第３モデルの例について説明する。
第３モデルの第１例では、第３モデルに係るモデル確率分布が、音声特徴量系列と強調特徴量系列のセットと目標特徴量系列との対応関係を示す潜在変数（latent variable）を用いて表される。この潜在変数を用いて、モデル確率分布がより低い次元で表現される
より具体的には、目標特徴量系列を強調特徴量系列と音響特徴量系列との加重和とする仮定のもとで、モデル確率分布が強調特徴量系列と音響特徴量系列との比率を潜在変数とする確率分布として表現される。その場合、サンプリング処理部１１８は、サンプリングにおいてモデル確率分布から比率のサンプル値を抽出する。サンプリング処理部１１８は、抽出した比率のサンプル値に従って強調特徴量系列と音響特徴量系列の加重和を目標特徴量系列のサンプル値として算出することができる。 Next, an example of a third model according to this embodiment will be described.
In a first example of the third model, the model probability distribution according to the third model is expressed using a latent variable indicating the correspondence between a set of speech feature sequences and emphasis feature sequences and a target feature sequence. Using this latent variable, the model probability distribution is expressed in a lower dimension. More specifically, under the assumption that the target feature sequence is a weighted sum of the emphasis feature sequence and the acoustic feature sequence, the model probability distribution is expressed as a probability distribution in which the ratio between the emphasis feature sequence and the acoustic feature sequence is a latent variable. In this case, the sampling unit 118 extracts a sample value of the ratio from the model probability distribution during sampling. The sampling unit 118 can calculate the weighted sum of the emphasis feature sequence and the acoustic feature sequence as a sample value of the target feature sequence according to the sample value of the extracted ratio.

図４は、本例における強調特徴量系列、音響特徴量系列、目標特徴量系列および比率αとの対応関係を表すグラフである。図４において、ｓ_ｔ，ｄ、ｕ_ｔ，ｄ、ｚ_ｔ，ｄは、それぞれフレームｔ、次元ｄに係る強調特徴量、音響特徴量および目標特徴量の要素を示す。Ｔ、Ｄは、それぞれ発話区間におけるフレーム数、個々の特徴量の次元数を示す。αは、比率を示す。個々の矢印は、その起点に示される情報と終点に示される情報とのその順序での関連性を示す。即ち、図４は、強調特徴量系列の全体、音響特徴量系列の全体および比率αから目標特徴量系列が与えられることを示す。
モデル確率分布ε（ｚ_１：Ｔ｜α）は、式（１０）に例示されるように、比率αの確率分布ｐ（α）と比率αを条件とする目標特徴量系列ｚ_１：Ｔの条件付き確率分布ε_Ｕ（ｚ_１：Ｔ｜α）として表される。 Fig. 4 is a graph showing the correspondence between the emphasis feature sequence, acoustic feature sequence, target feature sequence, and ratio α in this example. In Fig. 4, s _t,d , u _t,d , and z _t,d represent the elements of the emphasis feature, acoustic feature, and target feature related to frame t and dimension d, respectively. T and D represent the number of frames in the speech section and the number of dimensions of each feature, respectively. α represents the ratio. Each arrow indicates the relationship between the information indicated at its start point and the information indicated at its end point, in that order. That is, Fig. 4 shows that the target feature sequence is given from the entire emphasis feature sequence, the entire acoustic feature sequence, and ratio α.
The model probability distribution ε(z _1:T |α) is expressed as the probability distribution p(α) of the ratio α and the conditional probability distribution ε _U (z _1:T |α) of the target feature sequence z _1:T conditioned on the ratio α, as exemplified in equation (10).

強調特徴量系列と音響特徴量系列のそれぞれの比率は、非負の実数値であり、それぞれの和は、１に正規化されてもよい。その場合、上記の加重和は加重平均に相当し、潜在変数をなす比率は、１個の変数で表現することができる。１個の変数αが音響特徴量系列に対する比率を示す場合、強調特徴量系列に対する比率は、１－αと定まる。比率αの値域をなす最小値、最大値は、それぞれ０、１とし、確率分布ｐ（α）を、比率αに対する確率密度を１とし、それ以外の比率αに対する確率密度を０とする一様分布（uniform distribution）と仮定されてもよい。その場合、条件付き確率分布ε_Ｕ（ｚ_１：Ｔ｜α）は、目標特徴量系列から強調特徴量系列と音響特徴量系列との加重和の差分値に対するディラックのデルタ関数（Dirac’s Delta、本願では「デルタ関数」と呼ぶことがある）として表される。これらの仮定のもとでは、式（１０）は式（１１）のように変形することができる。式（１１）において、ε_Ｕは、値域［０，１］に対する一様分布を示す。ε_δは、デルタ関数を示し、フレームおよび次元ごとのスカラー値に対して定義されている。式（１１）に示すフレームｔおよび次元ｄを跨ぐ乗算は、発話区間内の目標特徴量全体に対するデルタ関数を与えるためになされる。 The respective ratios of the enhancement feature sequence and the acoustic feature sequence may be non-negative real values, and their sums may be normalized to 1. In this case, the weighted sum corresponds to a weighted average, and the ratio forming the latent variable can be expressed by a single variable. When one variable α indicates the ratio for the acoustic feature sequence, the ratio for the enhancement feature sequence is determined to be 1-α. The minimum and maximum values forming the range of the ratio α may be set to 0 and 1, respectively, and the probability distribution p(α) may be assumed to be a uniform distribution in which the probability density for the ratio α is 1 and the probability density for other ratios α is 0. In this case, the conditional probability distribution ε _U (z _1:T |α) is expressed as a Dirac's delta function (sometimes referred to as a "delta function" in this application) for the difference value between the weighted sum of the enhancement feature sequence and the acoustic feature sequence from the target feature sequence. Under these assumptions, Equation (10) can be transformed into Equation (11). In equation (11), ε _U denotes a uniform distribution over the range [0, 1]. ε _δ denotes a delta function, which is defined for a scalar value for each frame and dimension. The multiplication across frame t and dimension d shown in equation (11) is performed to obtain a delta function for the entire target feature within the speech section.

次に、本実施形態に係る第３モデルの第２例について説明する。第２例に係るモデル確率分布は、目標特徴量系列が強調特徴量系列と等しくなる可能性を示す第１確率分布と、目標特徴量系列が強調特徴量系列から分散する確率分布である第２確率分布とを有する。第１確率分布は、音声強調部１１４により得られた強調特徴量が音声認識処理にそのままされる度合いを示す。第２確率分布は、強調特徴量が音声強調部１１４により得られた強調特徴量から逸脱する度合いを示す。これにより、音声強調部１１４により得られた強調特徴量を真の強調特徴量として信頼できる度合いが考慮される。 Next, a second example of the third model according to this embodiment will be described. The model probability distribution according to the second example has a first probability distribution indicating the likelihood that the target feature sequence will be equal to the emphasis feature sequence, and a second probability distribution which is a probability distribution in which the target feature sequence will diverge from the emphasis feature sequence. The first probability distribution indicates the degree to which the emphasis features obtained by the speech emphasis unit 114 are used as is in the speech recognition process. The second probability distribution indicates the degree to which the emphasis features deviate from the emphasis features obtained by the speech emphasis unit 114. This allows for consideration of the degree to which the emphasis features obtained by the speech emphasis unit 114 can be trusted as true emphasis features.

モデル確率分布は、第１確率分布と第２確率分布との加重平均で表されてもよい。式（１２）に例示されるモデル確率分布ε_{ｐ－，Ｕ－}（ｚ_１：Ｔ;Θ）では、第１確率分布は、フレームｔごとの目標特徴量の事後分布ｐ_ｓｅ（ｚ_ｔ）のフレーム間の積となり、目標特徴量の事後確率ｐ_ｓｅ（ｚ_ｔ）が強調特徴量の事後確率ｐ_ｓｅ（ｓ_ｔ）に相当するとの仮定に基づく。事後確率ｐ_ｓｅ（ｓ_ｔ）として、第１モデルの学習により得られる条件付き確率ｐ（ｓ_ｔ｜ｙ＾_ｔ，ｘ^～ _ｔ）を利用することができる。 The model probability distribution may be expressed as a weighted average of the first probability distribution and the second probability distribution. In the model probability distribution ε _p−,U− (z _1:T ; Θ) exemplified in equation (12), the first probability distribution is the product of the posterior distributions p _se (z _t ) of the target feature for each frame t, and is based on the assumption that the posterior probability p _se (z _t ) of the target feature corresponds to the posterior probability p _se (s _t ) of the enhancement feature. As the posterior probability p _se (s _t ), the conditional probability p(s _t | y^ _t , x ^∼ _t ) obtained by learning the first model can be used.

式（１２）において、πは、第１確率分布に対する重み係数（mixture weight）である。１－πは、第２確率分布に対する重み係数である。この例では、第１確率分布と第２確率分布のそれぞれに対する重み係数の和が１となるように正規化されている。重み係数πは、０より大きく１より小さい実数値である。重み係数πは、予め定められていてもよい。第２確率分布ε_Ｕ（ｚ_１：Ｔ）として、一様分布が適用されてもよい。一様分布を与える個々の目標特徴量の値域は、その要素値ごとに予め定められてもよい。 In equation (12), π is a weighting coefficient (mixture weight) for the first probability distribution. 1-π is a weighting coefficient for the second probability distribution. In this example, normalization is performed so that the sum of the weighting coefficients for the first probability distribution and the second probability distribution is 1. The weighting coefficient π is a real value greater than 0 and less than 1. The weighting coefficient π may be determined in advance. A uniform distribution may be applied as the second probability distribution ε _U (z _1:T ). The value range of each target feature that provides a uniform distribution may be determined in advance for each element value.

なお、モデル学習部１３０は、第３モデルを構成する各種のパラメータの一部または全部を、モデル学習を行って定めてもよい。その際、音声認識装置１０への入力データとして、ある発話区間における混合信号を入力し、音声強調部１１４から得られる出力データとしてから導出される強調特徴量の確率分布を目標特徴量の確率分布として導出することができる。混合信号として、既知の音声信号と非音声信号とを混合して制作しておく。目標特徴量の確率分布は、第１モデルの学習の過程において取得することができる。混合前の音声信号に対する音響特徴量が、理想的な強調特徴量となり、音声強調により現実に得られる強調特徴量は誤差を伴うため、統計的な分布を有する。
そして、モデル学習部１３０は、第３モデルの各例に係る確率分布と導出した確率分布との差分の大きさが少なくなるように再帰的に各種のパラメータを定めることができる。差分の大きさを示す損失関数として、例えば、ワッサースタイン計量（Wasserstein metric）、を用いることができる。 The model training unit 130 may determine some or all of the various parameters constituting the third model by performing model training. In this case, a mixed signal for a certain speech section is input as input data to the speech recognition device 10, and the probability distribution of the emphasis features derived from the output data obtained from the speech enhancement unit 114 can be derived as the probability distribution of the target features. The mixed signal is created by mixing a known speech signal and a non-speech signal. The probability distribution of the target features can be obtained in the process of training the first model. The acoustic features for the speech signal before mixing are ideal emphasis features, and the emphasis features actually obtained by speech enhancement involve errors and have a statistical distribution.
The model learning unit 130 can then recursively determine various parameters so as to reduce the magnitude of the difference between the probability distribution for each example of the third model and the derived probability distribution. As a loss function that indicates the magnitude of the difference, for example, the Wasserstein metric can be used.

次に、本実施形態に係る第３モデルの第３例について説明する。式（１３）に例示されるモデル確率分布も、第１確率分布と第２確率分布との加重平均で表される。第１確率分布は、第１確率分布が目標特徴量系列から強調特徴量系列の差分に対するデルタ関数である。即ち、第１確率分布が、目標特徴量系列が強調特徴量系列となる度合いを示す。ε_δは、式（１１）の例と同様にフレームおよび次元ごとのスカラー値に対して定義されている。式（１３）に示すフレームｔおよび次元ｄを跨ぐ乗算は、デルタ関数が発話区間内の各フレームの目標特徴量をなす個々の要素に対して定義されていることによる。なお、式（１３）に示す第２確率分布は、式（１２）に示すものと同様である。 Next, a third example of the third model according to this embodiment will be described. The model probability distribution illustrated in Equation (13) is also expressed as a weighted average of the first probability distribution and the second probability distribution. The first probability distribution is a delta function for the difference between the target feature sequence and the emphasis feature sequence. In other words, the first probability distribution indicates the degree to which the target feature sequence becomes the emphasis feature sequence. ε _δ is defined for scalar values for each frame and dimension, as in the example of Equation (11). The multiplication across frame t and dimension d shown in Equation (13) is due to the fact that the delta function is defined for each element constituting the target feature of each frame in the speech section. Note that the second probability distribution shown in Equation (13) is the same as that shown in Equation (12).

次に、本実施形態に係る音声認識処理の第１例について説明する。図６は、本実施形態に係る音声認識処理の第１例を示すフローチャートである。図６の例では、発話処理部１２２が図３に例示される構成を有する場合を前提とする。
（ステップＳ１０２）特徴分析部１１２は、入力音声信号の音響特性を示す音響特徴量をフレームごとに分析する。
（ステップＳ１０４）音声強調部１１４は、第１モデルを用いてフレームごとに取得される音響特徴量に対して音声強調処理を行い、音声成分が強調された強調特徴量を定める。
（ステップＳ１０６）発話区間処理部１１６は、フレームごとの音響特性に基づいて公知の音声検出法を用い、複数のフレームからなる発話区間を判定する。 Next, a first example of the speech recognition process according to this embodiment will be described. Fig. 6 is a flowchart showing the first example of the speech recognition process according to this embodiment. The example of Fig. 6 is based on the premise that the speech processing unit 122 has the configuration illustrated in Fig. 3.
(Step S102) The feature analysis unit 112 analyzes the acoustic feature quantity indicating the acoustic characteristics of the input speech signal for each frame.
(Step S104) The speech emphasizing unit 114 performs speech emphasizing processing on the acoustic features acquired for each frame using the first model, and determines emphasized features in which the speech components are emphasized.
(Step S106) The speech section processing unit 116 uses a known voice detection method based on the acoustic characteristics of each frame to determine a speech section consisting of a plurality of frames.

サンプリング処理部１１８は、判定された発話区間内の強調特徴量からなる強調特徴量系列を構成し、音声特徴量からなる音声特徴量系列を構成する。サンプリング処理部１１８は、第３モデルに従って、構成した強調特徴量系列と音声特徴量系列に対応する目標特徴量系列の確率分布を定める。サンプリング処理部１１８は、ステップＳ１０８とステップＳ１１０の処理がＮ回繰り返し、Ｎ回の繰り返しが終了した後、ステップＳ１１２の処理に進む。Ｎは、２以上の予め定めた整数値である。本願では、Ｎを「サンプル数」と呼ぶことがある。 The sampling processing unit 118 constructs an emphasis feature sequence consisting of emphasis features within the determined speech section, and constructs a speech feature sequence consisting of speech features. The sampling processing unit 118 determines a probability distribution of a target feature sequence corresponding to the constructed emphasis feature sequence and speech feature sequence according to a third model. The sampling processing unit 118 repeats the processes of steps S108 and S110 N times, and after the N repetitions are completed, proceeds to the process of step S112. N is a predetermined integer value greater than or equal to 2. In this application, N may be referred to as the "number of samples."

（ステップＳ１０８）サンプリング処理部１１８は、定めた確率分布を用いて目標特徴量系列のサンプル値をサンプリングする。
（ステップＳ１１０）サンプリング処理部１１８は、隠れ状態処理部１２０に対し、第２モデルを用いてサンプリングされた目標特徴量系列のサンプル値に対する隠れ状態特徴量系列のサンプル値を算出させる。 (Step S108) The sampling processing unit 118 samples the sample values of the target feature sequence using the determined probability distribution.
(Step S110) The sampling processing unit 118 causes the hidden state processing unit 120 to calculate sample values of the hidden state feature sequence for sample values of the target feature sequence sampled using the second model.

（ステップＳ１１２）サンプリング処理部１１８は、Ｎ個の隠れ状態特徴量系列のサンプル値の平均値を隠れ状態特徴量系列の期待値として算出する。
（ステップＳ１１４）発話処理部１２２は、第４モデルを用いて、発話内容の候補ごとに、隠れ状態特徴量系列の期待値に対する事後確率を算出する。発話処理部１２２は、算出した事後確率が最大となる発話内容の候補を発話区間における発話内容として推定する。その後、図６の処理を終了する。 (Step S112) The sampling processing unit 118 calculates the average value of the sample values of the N hidden state feature sequence as the expected value of the hidden state feature sequence.
(Step S114) The utterance processing unit 122 uses the fourth model to calculate the posterior probability of each utterance candidate with respect to the expected value of the hidden state feature sequence. The utterance processing unit 122 estimates the utterance candidate with the highest calculated posterior probability as the utterance content in the utterance section. Then, the processing in FIG. 6 ends.

次に、本実施形態に係る音声認識処理の第２例について、第１例との差異点を主として説明する。第１例との共通点については、特に言及しない限り、その説明を援用する。図７は、本実施形態に係る音声認識処理の第２例を示すフローチャートである。図７の例では、発話処理部１２２が図５に例示される構成を有する場合を前提とする。
図７の処理は、ステップＳ１０２～Ｓ１１２の処理と、ステップＳ１２２～Ｓ１２８の処理を有する。 Next, a second example of the speech recognition processing according to this embodiment will be described, focusing on the differences from the first example. Regarding the commonalities with the first example, the explanation of the first example will be used unless otherwise specified. Fig. 7 is a flowchart showing the second example of the speech recognition processing according to this embodiment. The example of Fig. 7 is based on the premise that the speech processing unit 122 has the configuration illustrated in Fig. 5.
The process in FIG. 7 includes steps S102 to S112 and steps S122 to S128.

ステップＳ１０２～Ｓ１０６の処理が終了した後、サンプリング処理部１１８は、上記のように第３モデルに従って確率分布を定める。サンプリング処理部１１８は、ステップＳ１０８、Ｓ１１０およびＳ１２２の処理をＮ回繰り返し、Ｎ回の繰り返しが終了した後、ステップＳ１１２の処理に進む。
（ステップＳ１２２）サンプリング処理部１１８は、隠れ状態特徴系列のサンプル値を発話処理部１２２のＣＴＣデコーダネットワーク１２２ｃに出力し、発話内容の候補ごとに事後確率のサンプル値をＣＴＣ事後確率サンプル値として算出させる。 After steps S102 to S106 are completed, the sampling unit 118 determines the probability distribution according to the third model as described above. The sampling unit 118 repeats steps S108, S110, and S122 N times, and after the N repetitions are completed, the processing proceeds to step S112.
(Step S122) The sampling processing unit 118 outputs the sample values of the hidden state feature sequence to the CTC decoder network 122c of the utterance processing unit 122, and calculates the sample values of the posterior probability for each candidate utterance content as CTC posterior probability sample values.

ステップＳ１１２の処理が終了した後、ステップＳ１２４の処理に進む。
（ステップＳ１２４）発話処理部１２２のアテンションデコーダネットワーク１２２ａは、発話内容の候補ごとに隠れ状態特徴量系列の期待値に対する事後確率の期待値をアテンション事後確率期待値として算出する。
（ステップＳ１２６）発話処理部１２２の発話内容推定部１２２ｄは、発話内容の候補ごとにＣＴＣ事後確率サンプル値の平均値をＣＴＣ事後確率期待値として算出する。
（ステップＳ１２８）発話処理部１２２は、発話内容の候補ごとに、アテンション事後確率期待値とＣＴＣ事後確率期待値との加重平均値を音声認識スコアとして算出する。発話処理部１２２は、算出した音声認識スコアが最大となる発話内容の候補を発話区間における発話内容として推定する。その後、図７の処理を終了する。 After the process of step S112 is completed, the process proceeds to step S124.
(Step S124) The attention decoder network 122a of the speech processing unit 122 calculates the expected value of the posterior probability for the expected value of the hidden state feature sequence for each candidate utterance content as the attention posterior probability expected value.
(Step S126) The utterance content estimation unit 122d of the utterance processing unit 122 calculates the average value of the CTC posterior probability sample values for each candidate utterance content as the CTC posterior probability expected value.
(Step S128) The speech processing unit 122 calculates the weighted average of the expected attention posterior probability and the expected CTC posterior probability for each candidate utterance content as a speech recognition score. The speech processing unit 122 estimates the candidate utterance content with the highest calculated speech recognition score as the utterance content for the utterance section. Then, the processing in FIG. 7 ends.

上記の説明では、サンプリング処理部１１８が、モデル確率分布からＮ回のサンプリングにより得られるＮ個の目標特徴量系列のサンプル値を取得する場合を主とした。モデル確率分布が第１確率分布と第２確率分布との加重平均で表される場合、サンプリング処理部１１８は、モデル確率分布に代え第１確率分布からπＮ回目標特徴量系列のサンプル値を第１種目標特徴量系列としてサンプリングし、第２確率分布から（１－π）Ｎ回目標特徴量系列のサンプル値を第２種目標特徴量系列としてサンプリングしてもよい。 The above explanation mainly focuses on the case where the sampling processing unit 118 acquires sample values of N target feature sequences obtained by sampling N times from a model probability distribution. If the model probability distribution is expressed as a weighted average of a first probability distribution and a second probability distribution, the sampling processing unit 118 may sample sample values of the πN-th target feature sequence from the first probability distribution instead of the model probability distribution as a first-type target feature sequence, and sample values of the (1-π)N-th target feature sequence from the second probability distribution as a second-type target feature sequence.

サンプリング処理部１１８は、πＮ個の第１種目標特徴量系列のサンプル値と（１－π）Ｎ個の第２種目標特徴量系列のサンプル値のそれぞれに対する計Ｎ個の隠れ状態特徴量系列のサンプル値を隠れ状態処理部１２０に取得させる。サンプリング処理部１１８は、取得したＮ個の隠れ状態特徴量系列のサンプル値の平均値を隠れ状態特徴量系列の期待値として算出することができる。Ｎが十分に大きい場合には、得られる期待値は、モデル確率分布からサンプリングされた目標特徴量系列のサンプル値に基づく隠れ状態特徴量系列のサンプル値の平均値に近似する。算出した隠れ状態特徴量系列の期待値は、上記のように発話処理部１２２における発話内容の推定に用いられる。 The sampling processing unit 118 causes the hidden state processing unit 120 to acquire a total of N sample values of the hidden state feature sequence for each of the πN sample values of the first type target feature sequence and the (1-π)N sample values of the second type target feature sequence. The sampling processing unit 118 can calculate the average value of the acquired N sample values of the hidden state feature sequence as the expected value of the hidden state feature sequence. When N is sufficiently large, the obtained expected value approximates the average value of the sample values of the hidden state feature sequence based on the sample values of the target feature sequence sampled from the model probability distribution. The calculated expected value of the hidden state feature sequence is used to estimate the speech content in the speech processing unit 122, as described above.

また、発話処理部１２２が図５に例示される構成を有する場合、Ｎ個の隠れ状態特徴量系列のサンプル値は、個々の発話内容の候補について、ＣＴＣデコーダネットワーク１２２ｃによりＮ個の事後確率のサンプル値の算出に用いられてもよい。得られたＮ個の事後確率のサンプル値の平均値は、事後確率の期待値として用いられる。 Furthermore, when the speech processing unit 122 has the configuration illustrated in FIG. 5, the sample values of the N hidden state feature sequences may be used by the CTC decoder network 122c to calculate N posterior probability sample values for each candidate utterance content. The average value of the obtained N posterior probability sample values is used as the expected value of the posterior probability.

図６に例示されるステップＳ１０８、Ｓ１１０の処理、または、図７に例示されるステップＳ１０８、Ｓ１１０およびＳ１２２の処理は、上記のようにサンプルごとに繰り返されてもよいし、サンプル間で並列に実行されてもよい。並列の処理は、プロセッサ１５２により提供される複数の演算資源を用いて分担されてもよい。演算資源の単位は、ソフトウェア的に定義されたものでもよいし、ハードウェア的に定義されたものでもよい。例えば、繰り返し処理または並列処理は、ＧＰＵにより実行され、その他の処理はＣＰＵにより実行されてもよい。並列処理がＧＰＵにより実行される場合には、１個以上の予め定めた個数のサンプルの処理に係るデータが、ＧＰＵミニバッチを用いて区分されてもよい。これにより、コンピュータシステムにおける演算資源の能力が発揮され、処理を高速化することができる。 The processing of steps S108 and S110 illustrated in FIG. 6, or the processing of steps S108, S110, and S122 illustrated in FIG. 7, may be repeated for each sample as described above, or may be performed in parallel across samples. Parallel processing may be shared using multiple computational resources provided by processor 152. The computational resource units may be software-defined or hardware-defined. For example, the repeated processing or parallel processing may be performed by a GPU, and other processing may be performed by a CPU. When parallel processing is performed by a GPU, data related to the processing of one or more predetermined number of samples may be divided using GPU mini-batches. This utilizes the capabilities of the computational resources in the computer system and speeds up processing.

（実験例）
次に、本実施形態に係る音声認識装置１０に対して実施した実験例について説明する。本実施形態の有効性を評価するため、本実施形態による音声認識率と他の手法による音声認識率とを比較した。 (Experimental Example)
Next, an example of an experiment performed on the speech recognition device 10 according to this embodiment will be described. In order to evaluate the effectiveness of this embodiment, the speech recognition rate according to this embodiment was compared with the speech recognition rate according to other methods.

実験では、日本語話し言葉コーパス（ＣＳＪ：Corpus of Spontaneous Japanese）から抽出した音声データとＰＳＥ（ProSoundEffects）効果音コーパスから抽出した非音声データを用いた。訓練データとして、約２３０時間の学術講演発表の音声データを用いた。テストセットとして、ＣＳＪの３個の公式評価セットｅｖａｌ１、ｅｖａｌ２、ｅｖａｌ３を用いた。訓練データの全長は、５時間となる。個々の音声データを、発話ごとに区分した。テストセットは、一連の音声認識処理に用いた。音声認識処理は、音声強調あり、なしのいずれについても実行した。ＰＳＥ効果音コーパスには、環境音、動物の鳴き声、楽音、などの非音声データが含まれる。音声強調処理に係る第１モデルに対する訓練データ、第２モデルならびに第４モデルに対する訓練データ、および、テストセットには、いずれも雑音を適用した。その他、テストセットと訓練データとは別個の評価セットを準備した。なお、音声データのサンプリング周波数をいずれも１６ｋＨｚとした。 The experiments used speech data extracted from the Corpus of Spontaneous Japanese (CSJ) and non-speech data extracted from the ProSoundEffects (PSE) sound effects corpus. Approximately 230 hours of speech data from academic presentations was used as training data. Three official CSJ evaluation sets, eval1, eval2, and eval3, were used as test sets. The total training data was 5 hours. Each piece of speech data was divided into utterances. The test set was used for a series of speech recognition processes. The speech recognition processes were performed both with and without speech enhancement. The PSE sound effects corpus includes non-speech data such as environmental sounds, animal calls, and musical sounds. Noise was applied to the training data for the first, second, and fourth speech enhancement models, as well as the test set. Separate evaluation sets were also prepared for the test and training data. The sampling frequency of the speech data was 16 kHz.

音声認識処理に係る第２モデルおよび第４モデルに対する訓練データとして、クリーン音声と残響音声を用いた。残響音声は、実験室内で測定したインパルス応答を用いてクリーン音声に対して畳み込み演算を行って生成した。第２モデルおよび第４モデルは、残響音声を訓練データとして用いることで、残響および雑音に対して頑健に学習され、評価に用いた。 Clean and reverberant speech were used as training data for the second and fourth models for speech recognition processing. The reverberant speech was generated by performing a convolution operation on the clean speech using an impulse response measured in a laboratory. By using reverberant speech as training data, the second and fourth models were trained to be robust against reverberation and noise, and were used for evaluation.

音声強調に係る第１モデルに対する訓練データとして、クリーン音声と非音声の混合データを用いた。混合データを、ＣＳＪからの個々の音声データには、ＰＳＥ効果音コーパスからランダムに選択した非音声データを加算して生成した。音声データと非音声データとの混合に係る信号雑音比（ＳＮＲ：Signal-to-Noise）を、－５、０、５、１０、１５の５通りのいずれかとなるように個々の音声データごとにランダムに選択した。テストセットも、第１モデルに対する訓練データと同様にクリーン音声と非音声を混合して生成した。但し、第１モデルに対する訓練データの生成に用いられなかった非音声データを音声データと混合した。また、ＳＮＲを、－５、０、５、１０の４通りのいずれかとなるように個々の音声データごとにランダムに選択した。 A mixture of clean speech and non-speech data was used as training data for the first model for speech enhancement. The mixture data was generated by adding randomly selected non-speech data from the PSE sound effects corpus to each piece of speech data from CSJ. The signal-to-noise ratio (SNR) for the mixture of speech and non-speech data was randomly selected for each piece of speech data to take one of five values: -5, 0, 5, 10, or 15. The test set was also generated by mixing clean speech and non-speech data, similar to the training data for the first model. However, non-speech data that was not used to generate the training data for the first model was mixed with speech data. The SNR was also randomly selected for each piece of speech data to take one of four values: -5, 0, 5, or 10.

音声認識に係る第２モデルおよび第４モデルとして、ＥＳＰｎｅｔ（End-to-End Speech Processing Toolkit）を用いた。ＥＳＰｎｅｔは、オープンソースのトランスフォーマエンコーダ・デコーダ型の音声認識モデルの一例である。ＥＳＰｎｅｔでは、フレーム長は５１２点、シフト長は１２８である。第２モデルに加える音声特徴量として８０次元のメルフィルタバンクが適用される。 ESPnet (End-to-End Speech Processing Toolkit) was used as the second and fourth models for speech recognition. ESPnet is an example of an open-source transformer encoder-decoder type speech recognition model. In ESPnet, the frame length is 512 points and the shift length is 128. An 80-dimensional Mel filter bank is applied as the speech feature to be added to the second model.

音声強調に係る第１モデルを、ＰｙＴｏｒｃｈライブラリを用いて制作した。ＰｙＴｏｒｃｈは、オープンソースの機械学習ライブラリの一例である。ソフトマスクｍと精度行列Λ_ｙ，ｔ、Λ_ｓ，ｔの推論のために、同一の構成を有するニューラルネットワークを用いた。個々のニューラルネットワークは、８０次元のフィルタバンクネットワーク（filterbank）、検出力活性化関数（power activation function）、絶対活性化関数（absolute activation function）、中心フレームの前後３２フレームの結合（concatenation）、層別正規化（layer-wise normalization）、シグモイド関数を伴う三層全結合ネットワーク（three-layer fully-connected networks with sigmoid function）およびドロップアウト層（dropout layer）を備える。中間層の次元数を２０４８とした。マスク用のニューラルネットワークの最後の層にはシグモイド関数を適用した。但し、その他のニューラルネットワークの最後の層には、何も適用せずに直前の層からの出力値を出力させた。 The first model for speech enhancement was developed using the PyTorch library. PyTorch is an example of an open-source machine learning library. Neural networks with identical configurations were used to infer the soft mask m and the precision matrices Λ _y,t and Λ _s,t . Each neural network included an 80-dimensional filter bank network, a power activation function, an absolute activation function, concatenation of 32 frames before and after the central frame, layer-wise normalization, three-layer fully connected networks with a sigmoid function, and a dropout layer. The dimensionality of the hidden layer was set to 2048. The sigmoid function was applied to the last layer of the neural network for the mask. However, the last layer of the other neural networks did not apply any functions and simply output the output value from the previous layer.

第１モデルの学習では、確率的勾配降下法（stochastic gradient descent）の一種であるアダム最適化を用い、勾配クリッピング（gradient clipping）を適用した。第１モデルをなすパラメータセットは、発話ごとに更新した。学習率（learning rate）を１．０×１０^－４とした。パラメータセットの更新回数を５０エポックとした。この更新回数のもとで評価セットに対する性能が最良となった。 The first model was trained using Adam optimization, a type of stochastic gradient descent, with gradient clipping applied. The parameter set constituting the first model was updated for each utterance. The learning rate was set to 1.0 × 10 ⁻⁴ . The parameter set was updated 50 epochs. This number of updates yielded the best performance on the evaluation set.

なお、第４モデルに対するパラメータの設定値として、発話内容の推定に係るサンプリングパラメータを除き公知のＥＳＰｎｅｔＣＳＪレシピー（recipe）に記載のものを用いた。具体的には、ビームサイズを２０とし、ＣＴＣデコーダネットワーク、アテンションデコーダネットワーク、言語モデルそれぞれに対する重み係数を０．３、０．７、０．３とした。また、サンプル数Ｎを、１６、３２、６４、１２８の４通りとし、第１確率分布に対する重み係数πを０．２５とした。 The parameter settings for the fourth model were those described in the publicly known ESPnet CSJ recipe, except for the sampling parameters related to utterance content estimation. Specifically, the beam size was set to 20, and the weighting coefficients for the CTC decoder network, attention decoder network, and language model were set to 0.3, 0.7, and 0.3, respectively. The number of samples N was set to four: 16, 32, 64, and 128, and the weighting coefficient π for the first probability distribution was set to 0.25.

実験結果として、音声強調に関する手法およびテストセットとＳＮＲの組ごとに文字誤り率（ＣＥＲ：character error rate）を集計した。ＣＥＲは、正解語数に対する挿入語数と置換語数と削除語数の総和の比に相当する。ＣＥＲが小さいほど音声認識の性能が良好であることを示す。実験結果の集計において、発話ごとの認識結果となる文字列（character sequence）に対して文字列からなる既存のテキストと照合し、挿入語、置換語句、および、削除語の有無を検出した。テストセットｅｖａｌ１、ｅｖａｌ２、ｅｖａｌ３における全語数は１１５，７４５語である。 As experimental results, the character error rate (CER) was compiled for each combination of speech enhancement method, test set, and SNR. The CER is equivalent to the ratio of the total number of inserted words, replaced words, and deleted words to the number of correct words. A smaller CER indicates better speech recognition performance. In compiling the experimental results, the character sequence that was the recognition result for each utterance was compared with existing text consisting of character strings to detect the presence or absence of inserted words, replaced words, and deleted words. The total number of words in the test sets eval1, eval2, and eval3 was 115,745.

図８は、実験結果を例示する一覧表である。各行において、手法を示す。「処理なし」とは、音声強調処理を行わなかったことを示す。「クリーンモデル」とは、クリーン音声を用いた場合、「クリーンモデル」と表記されていない行は残響音声を用いた場合を示す。「音声強調のみ」とは、音声強調による強調特徴量をそのまま音声認識処理に用いた場合を示す。「フレーム別モデル」とは、図９に示す比較例のようにフレームごとに構成された確率的エビデンスモデルを用いて得られた目標特徴量を音声認識処理に用いた場合を示す。ここでは、式（１４）に示されるモデル確率分布を用いてフレームごとに目標特徴量をサンプリングした。式（１４）のモデル確率分布は、フレームごとの目標特徴量の事後確率ｐ_ｓｅ（ｚ_ｔ）の積で与えられる。このモデル確率分布は、式（１２）の右辺の第１項に示される第１確率分布に相当する。前述のように事後確率ｐ_ｓｅ（ｚ_ｔ）は多次元ガウス関数を用いて表される。 FIG. 8 is a table illustrating experimental results. Each row indicates a method. "No processing" indicates that no speech enhancement processing was performed. "Clean model" indicates the case where clean speech was used, and rows without "clean model" indicate the case where reverberant speech was used. "Speech enhancement only" indicates the case where enhanced features obtained by speech enhancement were used directly in speech recognition processing. "Frame-specific model" indicates the case where target features obtained using a probabilistic evidence model configured for each frame, as in the comparative example shown in FIG. 9, were used in speech recognition processing. Here, target features were sampled for each frame using the model probability distribution shown in Equation (14). The model probability distribution in Equation (14) is given by the product of the posterior probability p _se (z _t ) of the target features for each frame. This model probability distribution corresponds to the first probability distribution shown in the first term on the right-hand side of Equation (12). As mentioned above, the posterior probability p _se (z _t ) is expressed using a multidimensional Gaussian function.

なお、図９の比較例でも、エンコーダからの出力となる隠れ状態特徴量のサンプル値を平均化して期待値を算出してもよいし、デコーダからの出力となる事後確率もしくは音声認識スコアのサンプル値を平均化して期待値を算出してもよい。実験では、隠れ状態特徴量のサンプル値を平均化し、その期待値を算出した。算出した隠れ状態の特徴量がデコーダへの入力となる。 In the comparative example of Figure 9, the expected value may also be calculated by averaging sample values of hidden state features output from the encoder, or by averaging sample values of the posterior probability or speech recognition score output from the decoder. In the experiment, the sample values of hidden state features were averaged and their expected value was calculated. The calculated hidden state features are input to the decoder.

図８に戻り、本実施形態の「一様」とは、上記の第３モデルの第１例、つまり、式（１１）のモデル確率分布を用いた場合に相当する。「デルタ＋一様」とは、上記の第３モデルの第３例、つまり、式（１３）のモデル確率分布を用いた場合に相当する。「ガウス＋一様」とは、上記の第３モデルの第２例、つまり、式（１２）のモデル確率分布を用いた場合に相当する。 Returning to Figure 8, "uniform" in this embodiment corresponds to the first example of the third model above, i.e., when the model probability distribution of equation (11) is used. "Delta + uniform" corresponds to the third example of the third model above, i.e., when the model probability distribution of equation (13) is used. "Gaussian + uniform" corresponds to the second example of the third model above, i.e., when the model probability distribution of equation (12) is used.

フレーム別モデル、本実施形態に対しては、実験条件としてのサンプル数Ｎと期待値の種別が示されている。「ＣＴＣ」の列の「ｅｎｃ」とは、エンコーダ側で取得される期待値が用いられる場合、即ち、隠れ状態特徴量系列の期待値がＣＴＣデコーダネットワークに入力される場合を示す。「ｐｒｏｂ」とは、デコーダ側で取得される期待値が用いられる場合、即ち、ＣＴＣデコーダネットワークからの事後確率のサンプル値に基づいてＣＴＣ事後確率の期待値を用いて音声認識スコアが算出される場合を示す。「Ａｔｔｅｎ」とは、アテンションデコーダネットワークを示す。実験では、いずれの場合も「ｅｎｃ」、即ち、隠れ状態特徴量系列の期待値をアテンションデコーダネットワークに入力した。 For the frame-specific model and this embodiment, the experimental conditions include the number of samples N and the type of expected value. "enc" in the "CTC" column indicates that the expected value obtained on the encoder side is used, i.e., the expected value of the hidden state feature sequence is input to the CTC decoder network. "prob" indicates that the expected value obtained on the decoder side is used, i.e., the speech recognition score is calculated using the expected value of the CTC posterior probability based on the sample value of the posterior probability from the CTC decoder network. "Atten" indicates the attention decoder network. In both experiments, "enc," i.e., the expected value of the hidden state feature sequence, was input to the attention decoder network.

図８の各列のｅｖａｌ１、ｅｖａｌ２、ｅｖａｌ３は、テストセットを示す。各テストセットについて、４通りのＳＮＲに対するＣＥＲが示されている。平均の列は、テストセットおよびＳＮＲ間のＣＥＲの平均値を示す。 In Figure 8, eval1, eval2, and eval3 in each column indicate test sets. For each test set, the CER for four different SNRs is shown. The average column indicates the average CER across test sets and SNRs.

図８に例示される実験結果によれば、本実施形態によるＣＥＲが他の手法によるＣＥＲよりも有意に小さくなった。ＣＥＲは、本実施形態、音声強調のみ、フレーム別モデル、処理なしの順で増加する。本実施形態のうちガウス＋一様、サンプル数Ｎが１２８であり、ＣＴＣ、Ａｔｔｅｎをいずれもｅｎｃとした場合についてＣＥＲが全体として最も低くなった。但し、テストセットｅｖａｌ１であってＳＮＲが１０ｄＢの場合には、ガウス＋一様、サンプル数Ｎが１２８であり、ＣＴＣ、Ａｔｔｅｎをそれぞれｐｒｏｂ、ｅｎｃとした場合、ＣＥＲが最も低くなった。テストセットｅｖａｌ３であってＳＮＲが－５ｄＢの場合には、一様、サンプル数Ｎが１６であり、ＣＴＣ、Ａｔｔｅｎをいずれもｅｎｃとした場合、ＣＴＣ、Ａｔｔｅｎをいずれもｅｎｃとした場合と同率でＣＥＲが最も低くなった。テストセットｅｖａｌ３であってＳＮＲが１０ｄＢの場合には、一様、サンプル数Ｎが１２８であり、ＣＴＣ、Ａｔｔｅｎをそれぞれｐｒｏｂ、ｅｎｃとした場合、ＣＥＲが最も低くなった。 According to the experimental results illustrated in Figure 8, the CER obtained by this embodiment was significantly smaller than the CER obtained by other methods. The CER increased in the following order: this embodiment, speech enhancement only, frame-specific model, and no processing. Among these embodiments, the CER was lowest overall when Gaussian + uniform, the number of samples N was 128, and both CTC and Atten were set to enc. However, when using test set eval1 with an SNR of 10 dB, the CER was lowest when Gaussian + uniform, the number of samples N was 128, and CTC and Atten were set to prob and enc, respectively. When using test set eval3 with an SNR of -5 dB, the CER was lowest when uniform, the number of samples N was 16, and both CTC and Atten were set to enc, at the same rate as when both CTC and Atten were set to enc. For the test set eval3, when the SNR was 10 dB, the CER was lowest when uniform, the number of samples N was 128, and CTC and Atten were set to prob and enc, respectively.

本実施形態によるＣＥＲでは、モデル確率分布（第１例～第３例）、サンプル数Ｎおよび期待値のパターンの間で有意差は認められなかった。これらの間の差分は、音声強調のみに対して得られるＣＥＲとの差分よりも十分に小さい。サンプル数による有意差が認められないことは、サンプル数が比較的少ない場合でも十分な性能が得られること、第２モデルをより拡張しても極端な演算量の増加を招かないことを示す。モデル確率分布に一様関数が用いられる点で共通することを考慮すると、一様関数を用いることで音声特徴量により目標特徴量の分布を説明することができ、強調特徴量を効果的に補正できることを示す。 In the CER according to this embodiment, no significant differences were observed between the model probability distribution (Examples 1 to 3), the number of samples N, and the expected value patterns. The differences between these are significantly smaller than the differences with the CER obtained for speech enhancement alone. The lack of significant differences due to the number of samples indicates that sufficient performance can be obtained even with a relatively small number of samples, and that further expansion of the second model does not result in an extreme increase in the amount of computation. Considering that both models share the commonality of using a uniform function for the model probability distribution, this indicates that using a uniform function makes it possible to explain the distribution of target features using speech features, and to effectively correct enhancement features.

また、フレーム別モデルに係るＣＥＲが音声強調のみに係るＣＥＲよりも増加する事象は、フレーム別の処理が、むしろ、音声認識に悪影響を与えることを示す。このことは、フレーム間で独立に目標特徴量がサンプリングされるため、目標特徴量系列の連続性が維持されないことが一因となりうることが裏付けられる。 Furthermore, the fact that the CER associated with the frame-specific model is higher than the CER associated with speech enhancement alone indicates that frame-specific processing actually has a negative impact on speech recognition. This supports the idea that one of the reasons for this is that the continuity of the target feature sequence is not maintained because target features are sampled independently between frames.

なお、本実施形態に係る音声認識装置１０は、次のように変形して実現されてもよい。図１に例示される音声認識装置１０は、マイクロホン２０と別体であるが、マイクロホン２０を含んで構成されてもよい。
第１モデルは、上記の実験で用いられたニューラルネットワークに限られず、回帰型ニューラルネットワーク（ＲＮＮ：Recurrent Neural Network）など、他の形態の学習モデルを用いて構成されてもよい。
エンコーダとなる第２モデルおよびデコーダとなる第４モデルは、トランスフォーマに限られず、ＲＮＮ、コンフォーマなど、他の形態の学習モデルを用いて構成されてもよい。 The speech recognition device 10 according to this embodiment may be realized by modifying it as follows: The speech recognition device 10 illustrated in FIG. 1 is separate from the microphone 20, but may be configured to include the microphone 20.
The first model is not limited to the neural network used in the above experiment, and may be configured using other types of learning models, such as a recurrent neural network (RNN).
The second model serving as the encoder and the fourth model serving as the decoder are not limited to transformers, and may be configured using other types of learning models such as RNNs and conformers.

第１～第４モデルの学習は、音声認識装置１０とは別個の機器により実行され、学習により得られた各モデルのパラメータセットが音声認識装置１０に設定されてもよい。パラメータセットは、音声認識装置１０の機能を実現するためのプログラムと対応付けて提供されてもよい。また、音声認識装置１０において、モデル学習部１３０が省略されてもよい。 The learning of the first to fourth models may be performed by a device separate from the speech recognition device 10, and the parameter sets of each model obtained by learning may be set in the speech recognition device 10. The parameter sets may be provided in association with a program for realizing the functions of the speech recognition device 10. Furthermore, the model learning unit 130 may be omitted from the speech recognition device 10.

本実施形態では、音声強調部１１４による音声強調処理に代え、音源分離処理が適用されてもよい。音源分離処理は、入力音声信号から複数の音源からの音源別成分を抽出する処理である。音源分離処理による、ある話者の音声成分を音源別成分として抽出する処理は、他の成分に対して相対的にその音声成分を強調する音声強調処理の一態様としてみなすこともできる。発話区間処理部１１６は、音源別成分ごとに発話区間検出を行い、同時に発話区間が検出された音源数を話者数として計数することができる。計数された話者数が２以上となる場合でも、検出された発話区間ごとの音源別成分に対してサンプリング処理部１１８、隠れ状態処理部１２０、および、発話処理部１２２が機能すればよい。 In this embodiment, sound source separation processing may be applied instead of the speech enhancement processing by the speech enhancement unit 114. Sound source separation processing is processing for extracting source-specific components from multiple sound sources from an input speech signal. The processing for extracting a speaker's speech component as a source-specific component by sound source separation processing can also be considered as one form of speech enhancement processing for emphasizing that speech component relative to other components. The speech period processing unit 116 performs speech period detection for each source-specific component, and can simultaneously count the number of sound sources for which a speech period is detected as the number of speakers. Even if the number of counted speakers is two or more, the sampling processing unit 118, hidden state processing unit 120, and speech processing unit 122 only need to function for the source-specific components for each detected speech period.

以上に説明したように、本実施形態に係る音声認識装置１０は、入力音声信号の音響特性に基づいて発話区間を定める発話区間処理部１１６と、第１モデルを用いて入力音声信号の音響特徴量について音声成分が強調された強調特徴量をフレームごとに定める音声強調部１１４と、第２モデルを用いて目標特徴量の系列である目標特徴量系列に基づいて隠れ状態特徴量の系列である隠れ状態特徴量系列を定める隠れ状態処理部１２０を備える。音声認識装置１０は、発話区間内の強調特徴量の系列である強調特徴量系列と前記音響特徴量の系列である音響特徴量系列に対応する目標特徴量系列の確率分布を示す第３モデルを用いて当該目標特徴量系列のサンプル値を複数回（例えば、Ｎ回）サンプリングし、隠れ状態特徴量系列のサンプル値から前記隠れ状態特徴量系列の期待値を定めるサンプリング処理部１１８と、第４モデルを用いて隠れ状態特徴量系列の期待値に基づいて発話区間の発話内容を定める発話処理部１２２を備える。
この構成によれば、発話区間内の強調特徴量系列と音響特徴量系列に対応する複数の目標特徴量系列のサンプル値が得られ、複数の目標特徴量系列のサンプル値から目標特徴量系列の期待値が得られる。発話内容は、目標特徴量系列の期待値から得られる隠れ状態特徴量系列の期待値に基づいて定まる。目標特徴量系列により発話区間内の変化傾向として音響特性の連続性を表現できるため、ランダムなサンプリングによるフレーム間の音響特性の不連続性を回避できる。そのため、音響特性の不連続性による音声認識率の低下を回避することができる。また、発話区間内の目標特徴量系列のサンプリングにより、高次元化による処理量の増加を抑制することができる。 As described above, the speech recognition device 10 according to this embodiment includes a speech section processing unit 116 that determines a speech section based on acoustic characteristics of an input speech signal, a speech enhancement unit 114 that determines, for each frame, an emphasis feature in which speech components are emphasized among the acoustic features of the input speech signal using a first model, and a hidden state processing unit 120 that determines a hidden state feature sequence that is a sequence of hidden state features based on a target feature sequence that is a sequence of target features using a second model. The speech recognition device 10 also includes a sampling processing unit 118 that samples sample values of a target feature sequence multiple times (e.g., N times) using a third model that indicates a probability distribution of an emphasis feature sequence that is a sequence of emphasis features within an utterance section and a target feature sequence corresponding to the acoustic feature sequence that is a sequence of the acoustic features, and determines an expected value of the hidden state feature sequence from the sample values of the hidden state feature sequence, and a speech processing unit 122 that determines the content of an utterance in the speech section based on the expected value of the hidden state feature sequence using a fourth model.
With this configuration, sample values of multiple target feature sequences corresponding to the emphasis feature sequence and the acoustic feature sequence within an utterance section are obtained, and the expected value of the target feature sequence is obtained from the sample values of the multiple target feature sequences. The speech content is determined based on the expected value of the hidden state feature sequence obtained from the expected value of the target feature sequence. Since the target feature sequence can express the continuity of acoustic features as a change trend within an utterance section, discontinuity in acoustic features between frames due to random sampling can be avoided. Therefore, a decrease in speech recognition rate due to discontinuity in acoustic features can be avoided. Furthermore, sampling the target feature sequence within the utterance section can suppress an increase in processing volume due to high dimensionality.

また、目標特徴量系列を強調特徴量系列と音響特徴量系列との加重和とし、目標特徴量系列の確率分布を、強調特徴量系列と音響特徴量系列との比率の確率分布としてもよい。サンプリング処理部１１８は、第３モデルを用いて当該比率のサンプル値をサンプリングし、比率のサンプル値に基づいて強調特徴量系列と音響特徴量系列を合成して目標特徴量のサンプル値を算出してもよい。
この構成によれば、目標特徴量系列の確率分布が強調特徴量系列と音響特徴量系列との比率で表現できる。そのため、音声認識精度を維持しながらサンプリングに係る処理量を低減することができる。 Alternatively, the target feature sequence may be a weighted sum of the emphasis feature sequence and the acoustic feature sequence, and the probability distribution of the target feature sequence may be a probability distribution of the ratio between the emphasis feature sequence and the acoustic feature sequence. The sampling processing unit 118 may sample a sample value of the ratio using a third model, and calculate a sample value of the target feature by combining the emphasis feature sequence and the acoustic feature sequence based on the sample value of the ratio.
This configuration allows the probability distribution of the target feature sequence to be expressed as the ratio of the emphasis feature sequence to the acoustic feature sequence, thereby reducing the amount of processing required for sampling while maintaining speech recognition accuracy.

また、目標特徴量系列の確率分布は、目標特徴量系列が強調特徴量系列と等しくなる可能性を示す第１確率分布（例えば、デルタ関数）と、目標特徴量系列が強調特徴量系列から分散する確率分布である第２確率分布（例えば、一様関数）とを有してもよい。
この構成によれば、第１確率分布により音声成分の強調による強調特徴量系列を目標特徴量系列として採用する度合いと、第２確率分布により強調特徴量系列が目標特徴量系列から逸脱する度合いを定量化できる。使用環境による強調特徴量系列の信頼性の差異を考慮したサンプリングにより、音声認識精度を維持することができる。 Furthermore, the probability distribution of the target feature sequence may have a first probability distribution (e.g., a delta function) that indicates the likelihood that the target feature sequence will be equal to the emphasis feature sequence, and a second probability distribution (e.g., a uniform function) that is a probability distribution that the target feature sequence will diverge from the emphasis feature sequence.
With this configuration, the first probability distribution can be used to quantify the degree to which an emphasized feature sequence resulting from emphasis of speech components is adopted as a target feature sequence, and the second probability distribution can be used to quantify the degree to which the emphasized feature sequence deviates from the target feature sequence. Speech recognition accuracy can be maintained by sampling that takes into account differences in the reliability of the emphasized feature sequence depending on the usage environment.

また、サンプリング処理部１１８は、発話区間におけるフレームごとの強調特徴量の事後確率分布（例えば、多次元ガウス関数）に基づいて第１確率分布を定めてもよい。
この構成によれば、発話区間における目標特徴量の連続性と併せて、フレームごとの強調特徴量の誤差を考慮した目標特徴量系列のサンプリングにより、音声認識精度を維持することができる。 Furthermore, the sampling processing unit 118 may determine the first probability distribution based on a posterior probability distribution (for example, a multidimensional Gaussian function) of the emphasis feature for each frame in the speech section.
According to this configuration, the accuracy of speech recognition can be maintained by sampling the target feature sequence taking into consideration the error of the emphasized feature for each frame as well as the continuity of the target feature in the speech section.

また、サンプリング処理部１１８は、第１確率分布を用いて一部の（例えば、πＮ個）目標特徴量系列のサンプル値を第１種目標特徴量系列のサンプル値としてサンプリングし、第２確率分布を用いて他の（例えば、（１－π）Ｎ回）目標特徴量系列のサンプル値を第２種目標特徴量系列のサンプル値としてサンプリングし、第１種目標特徴量系列のサンプル値に対する隠れ状態特徴量系列のサンプル値と第２種目標特徴量系列のサンプル値に対する隠れ状態特徴量系列のサンプル値との平均値を前記隠れ状態特徴量系列の期待値として定めてもよい。
この構成によれば、目標特徴量系列のサンプル値のサンプリングにおいてサンプルごとに第１確率分布と第２確率分布が使い分けられる。サンプルごとの第１確率分布と第２確率分布との加算を回避することで処理量を低減できる。また、サンプル間で処理を並行することで演算資源を有効に活用することができる。 Furthermore, the sampling processing unit 118 may use the first probability distribution to sample some (e.g., πN) sample values of the target feature series as sample values of the first-type target feature series, and use the second probability distribution to sample other (e.g., (1-π)N) sample values of the target feature series as sample values of the second-type target feature series, and may determine, as the expected value of the hidden state feature series, the average value of the sample values of the hidden state feature series for the sample values of the first-type target feature series and the sample values of the hidden state feature series for the sample values of the second-type target feature series.
According to this configuration, the first probability distribution and the second probability distribution are used for each sample when sampling sample values of a target feature sequence. Addition of the first probability distribution and the second probability distribution for each sample is avoided, thereby reducing the amount of processing. Furthermore, parallel processing across samples allows for effective use of computational resources.

第４モデルは、アテンションデコーダ（例えば、アテンションデコーダネットワーク１２２ａ）とＣＴＣデコーダ（例えば、ＣＴＣデコーダネットワーク１２２ｃ）を備えてもよい。アテンションデコーダは、隠れ状態特徴量系列の期待値に対する発話内容の候補ごとに第１事後確率（例えば、アテンション事後確率期待値）を算出し、ＣＴＣデコーダは、隠れ状態特徴量系列のサンプル値に対する発話内容の候補ごとに第２事後確率のサンプル値（例えば、ＣＴＣ事後確率サンプル値）を算出し、発話内容の候補ごとに第２事後確率のサンプル値の期待値を第２事後確率（例えば、ＣＴＣ事後確率期待値）として算出し、第１事後確率と第２事後確率を合成したスコア（例えば、音声認識スコア）に基づいて発話内容を定めてもよい。
この構成によれば、ＣＴＣデコーダには隠れ状態特徴系列のサンプルが入力され、第２事後確率のサンプル値が出力される。サンプルごとの処理にアテンションデコーダとは独立になされるＣＴＣデコーダの処理を含めることで、演算資源の活用をさらに図ることができる。 The fourth model may include an attention decoder (e.g., attention decoder network 122a) and a CTC decoder (e.g., CTC decoder network 122c). The attention decoder may calculate a first posterior probability (e.g., attention posterior probability expected value) for each utterance content candidate for an expected value of the hidden state feature sequence, and the CTC decoder may calculate a second posterior probability sample value (e.g., CTC posterior probability sample value) for each utterance content candidate for a sample value of the hidden state feature sequence, and may calculate an expectation of the second posterior probability sample value for each utterance content candidate as the second posterior probability (e.g., CTC posterior probability expected value), and may determine the utterance content based on a score (e.g., a speech recognition score) obtained by combining the first posterior probability and the second posterior probability.
According to this configuration, the CTC decoder receives samples of the hidden state feature sequence and outputs sample values of the second posterior probability. By including the CTC decoder processing, which is performed independently of the attention decoder, in the processing for each sample, it is possible to further improve the utilization of computational resources.

以上、図面を参照してこの発明の一実施形態について詳しく説明してきたが、具体的な構成は上述のものに限られることはなく、この発明の要旨を逸脱しない範囲内において様々な設計変更等をすることが可能である。即ち、上記の開示は特定の例を含むものであり、本明細書、図面、請求の範囲から種々の変形が明らかであり、開示の範囲に限定されるべきではない。 One embodiment of the present invention has been described in detail above with reference to the drawings, but the specific configuration is not limited to the above, and various design modifications can be made without departing from the spirit of the present invention. In other words, the above disclosure includes specific examples, and various modifications are apparent from the specification, drawings, and claims, so the present invention should not be limited to the scope of the disclosure.

１０…音声認識装置、２０…マイクロホン、１１０…制御部、１１２…特徴分析部、１１４…音声強調部、１１６…発話区間処理部、１１８…サンプリング処理部、１２０…隠れ状態処理部、１２２…発話処理部、１３０…モデル学習部、１５２…プロセッサ、１５６…ドライブ部、１５８…入力部、１６０…出力部、１６２…ＲＯＭ、１６４…ＲＡＭ、１６６…補助記憶部、１６８…インタフェース部 10...Speech recognition device, 20...Microphone, 110...Control unit, 112...Feature analysis unit, 114...Speech enhancement unit, 116...Speech segment processing unit, 118...Sampling processing unit, 120...Hidden state processing unit, 122...Speech processing unit, 130...Model learning unit, 152...Processor, 156...Drive unit, 158...Input unit, 160...Output unit, 162...ROM, 164...RAM, 166...Auxiliary storage unit, 168...Interface unit

Claims

a speech segment processing unit that determines a speech segment based on acoustic characteristics of an input speech signal;
a speech enhancement unit that determines, for each frame, an enhancement feature in which a speech component is enhanced with respect to the acoustic feature of the input speech signal using a first model;
a hidden state processing unit that determines a hidden state feature sequence that is a sequence of hidden state feature values based on a target feature sequence that is a sequence of target feature values using a second model;
sampling sample values of the target feature sequence a plurality of times using a third model indicating a probability distribution of an emphasis feature sequence that is a sequence of the emphasis feature within an utterance section and a target feature sequence that corresponds to the acoustic feature sequence that is a sequence of the acoustic feature ;
causing the hidden state processing unit to calculate sample values of a hidden state feature sequence corresponding to sample values of the target feature sequence;
a sampling processing unit that determines an expected value of the hidden state feature sequence from sample values of the hidden state feature sequence;
an utterance processing unit that determines an utterance content of the utterance section based on an expected value of the hidden state feature sequence using a fourth model.

the target feature sequence is a weighted sum of the emphasis feature sequence and the acoustic feature sequence, and the probability distribution is a probability distribution of a ratio between the emphasis feature sequence and the acoustic feature sequence,
the sampling processing unit samples the sample value of the ratio using the third model;
The speech recognition device according to claim 1 , wherein the sample value of the target feature is calculated by combining the sequence of emphasis features and the sequence of acoustic features based on the sample value of the ratio.

2. The speech recognition device according to claim 1, wherein the probability distribution includes a first probability distribution indicating a probability that the target feature sequence is equal to the emphasis feature sequence, and a second probability distribution indicating a probability that the target feature sequence diverges from the emphasis feature sequence.

The sampling processing unit
The speech recognition device according to claim 3 , wherein the first probability distribution is determined based on a posterior probability distribution of the emphasis feature for each frame in the speech section.

The sampling processing unit
sampling a part of the sample values of the target feature sequence as sample values of a first type target feature sequence using the first probability distribution;
sampling sample values of the other target feature sequences as sample values of a second type target feature sequence using the second probability distribution;
5. The speech recognition device according to claim 3, wherein an average value of sample values of the hidden state feature sequence for sample values of the first type target feature sequence and sample values of the hidden state feature sequence for sample values of the second type target feature sequence is determined as an expected value of the hidden state feature sequence.

The fourth model is
It has an attention decoder and a connectionist temporal classification (CTC) decoder,
the attention decoder calculates a first posterior probability for each candidate utterance content with respect to an expected value of the hidden state feature sequence;
the CTC decoder calculates a sample value of a second posterior probability for each candidate utterance content for the sample value of the hidden state feature sequence;
calculating an expectation value of the sample value of the second posterior probability for each candidate utterance content as the second posterior probability;
The speech recognition device according to claim 1 , wherein the speech content is determined based on a score obtained by combining the first posterior probability and the second posterior probability.

To the computer
a speech segment processing step of determining a speech segment based on acoustic characteristics of the input speech signal;
a speech enhancement step of determining, for each frame, an enhancement feature in which a speech component is enhanced with respect to the acoustic feature of the input speech signal using a first model;
a hidden state processing step of determining a hidden state feature sequence that is a sequence of hidden state features based on a target feature sequence that is a sequence of target features using a second model;
using a third model indicating a probability distribution of an emphasis feature sequence, which is a sequence of the emphasis feature in the speech section, and a target feature sequence corresponding to the acoustic feature sequence, which is a sequence of the acoustic feature, sampling sample values of the target feature sequence a plurality of times using the third model;
causing the hidden state processing step to calculate sample values of a hidden state feature sequence for sample values of the target feature sequence;
a sampling processing step of determining an expected value of the hidden state feature sequence from sample values of the hidden state feature sequence;
and a speech processing step of determining the speech content of the speech section based on the expected value of the hidden state feature sequence using a fourth model .

A speech recognition method in a speech recognition device, comprising:
The speech recognition device
a speech segment processing step of determining a speech segment based on acoustic characteristics of the input speech signal;
a speech enhancement step of determining, for each frame, an enhancement feature in which a speech component is enhanced with respect to the acoustic feature of the input speech signal using a first model;
a hidden state processing step of determining a hidden state feature sequence that is a sequence of hidden state features based on a target feature sequence that is a sequence of target features using a second model;
using a third model indicating a probability distribution of an emphasis feature sequence, which is a sequence of the emphasis feature in the speech section, and a target feature sequence corresponding to the acoustic feature sequence, which is a sequence of the acoustic feature, sampling sample values of the target feature sequence a plurality of times using the third model;
causing the hidden state processing step to calculate sample values of a hidden state feature sequence for sample values of the target feature sequence;
a sampling processing step of determining an expected value of the hidden state feature sequence from sample values of the hidden state feature sequence;
an utterance processing step of determining the utterance content of the utterance section based on the expected value of the hidden state feature sequence using a fourth model.