JP5229738B2

JP5229738B2 - Speech recognition device and speech conversion device

Info

Publication number: JP5229738B2
Application number: JP2009068545A
Authority: JP
Inventors: 誠司中川; 仁史中山; 俊介石光
Original assignee: National Institute of Advanced Industrial Science and Technology AIST; Hiroshima City University
Current assignee: National Institute of Advanced Industrial Science and Technology AIST; Hiroshima City University
Priority date: 2009-03-19
Filing date: 2009-03-19
Publication date: 2013-07-03
Anticipated expiration: 2029-03-19
Also published as: JP2010224020A

Description

本発明は、音声認識装置及び音声変換装置に関する。 The present invention relates to a speech recognition device and a speech conversion device.

マイクロフォンに収音された音声信号に対し、音声認識を行う音声認識装置や、音声認識装置による認識結果に対応した音声合成信号をスピーカから放音する音声変換装置が知られている。 A voice recognition device that performs voice recognition on a voice signal collected by a microphone and a voice conversion device that emits a voice synthesis signal corresponding to a recognition result by the voice recognition device from a speaker are known.

音声認識において、認識結果候補を構成する音節、単語などの単位の認識カテゴリ毎に隠れマルコフモデル（Hidden Markov Model、以下ではＨＭＭと記す）などの確率モデルを用いてモデル化する手法は、認識性能が高く、現在の音声認識技術の主流となっている。従来のＨＭＭを用いた音声認識装置について図４を参照して簡単に説明する。入力端子１０１から入力された音声信号は、Ａ／Ｄ変換部１０２においてデジタル信号に変換される。そのデジタル信号から特徴ベクトル抽出部１０３において音声特徴ベクトルを抽出する。その後、予め、認識カテゴリごとに、音声単位について作成したＨＭＭを、モデルメモリ１０４から読み出し、尤度計算部１０５において、抽出された音声特徴ベクトルに対する各モデルの照合尤度を計算する。そして、最も大きな照合尤度を示すモデルが表現する音声単位（認識カテゴリ）を認識結果とし出力部１０６より出力する。 In speech recognition, a method of modeling using a probabilistic model such as a Hidden Markov Model (hereinafter referred to as HMM) for each recognition category of units such as syllables and words constituting recognition result candidates is a recognition performance. Is the mainstream of current speech recognition technology. A conventional speech recognition apparatus using an HMM will be briefly described with reference to FIG. The audio signal input from the input terminal 101 is converted into a digital signal by the A / D converter 102. The feature vector extraction unit 103 extracts a voice feature vector from the digital signal. Thereafter, for each recognition category, the HMM created for the speech unit is read from the model memory 104 in advance, and the likelihood calculation unit 105 calculates the matching likelihood of each model with respect to the extracted speech feature vector. Then, the speech unit (recognition category) expressed by the model having the largest matching likelihood is output from the output unit 106 as a recognition result.

しかしながら、従来の音声認識においては、発話終了後に音声認識を行うため、発話タイミングと音声認識タイミングとの間にタイムラグが生じるという問題があった。 However, in the conventional speech recognition, since speech recognition is performed after the utterance is finished, there is a problem that a time lag occurs between the speech timing and the speech recognition timing.

また、従来の音声認識手法により得た認識結果に対応した音声合成信号をスピーカから放音する場合、合成音声は、発話者が発話したタイミングよりもかなり遅いタイミングで放音されることになるため、発話者が、遅れて生成される合成音声の影響を受けて、うまく会話を進めることが困難になるという問題が生じる。 In addition, when a synthesized speech signal corresponding to the recognition result obtained by the conventional speech recognition method is emitted from the speaker, the synthesized speech is emitted at a timing considerably later than the timing when the speaker uttered. The problem arises that it becomes difficult for the speaker to proceed with the conversation well under the influence of the synthesized speech generated late.

本発明は、このような問題を解決するためになされたものであって、高速で音声認識を行うことができる音声認識装置及び音声変換装置の提供を目的とする。 The present invention has been made to solve such a problem, and an object thereof is to provide a speech recognition device and a speech conversion device that can perform speech recognition at high speed.

本発明の前記目的は、音声を入力する音声入力手段と、前記音声入力手段に入力された音声信号をデジタル音声波形信号に変換する変換手段と、前記変換手段で変換されたデジタル音声波形信号から、前記音声入力手段に入力された音声を解析する解析手段とを備える音声認識装置であって、前記解析手段は、前記デジタル音声波形信号をフレーム単位で解析して音声の特徴量を表す特徴ベクトルを抽出する特徴ベクトル抽出部と、前記フレーム単位で抽出した前記特徴ベクトルを時系列的に複数フレーム分記憶する特徴ベクトル記憶部と、音声認識候補となる複数の音声を記憶する認識候補音声記憶部と、前記特徴ベクトル記憶部に記憶された複数フレーム分における前記特徴ベクトルに基づいて、音声認識候補となる各音声の尤度を算出する第１解析部と、前記複数フレーム分における前記特徴ベクトルからフレーム単位あたりの平均特徴ベクトルを算出し、当該平均特徴ベクトルから、音声認識候補となる音声の尤度を算出する第２解析部と、前記第１解析部において算出した音声認識候補となる各音声の尤度、及び、前記第２解析部において算出した音声認識候補となる各音声の尤度に基づいて一つの音声を決定する音声決定部と、を備える音声認識装置により達成される。 The object of the present invention is from voice input means for inputting voice, conversion means for converting a voice signal input to the voice input means to a digital voice waveform signal, and a digital voice waveform signal converted by the conversion means. A speech recognition apparatus comprising: analysis means for analyzing speech input to the speech input means, wherein the analysis means analyzes the digital speech waveform signal in units of frames and represents a feature amount of speech A feature vector extracting unit for extracting a frame, a feature vector storage unit for storing the feature vectors extracted in units of frames in a time series for a plurality of frames, and a recognition candidate speech storage unit for storing a plurality of voices as speech recognition candidates And the likelihood of each speech that is a speech recognition candidate based on the feature vectors for a plurality of frames stored in the feature vector storage unit A first analysis unit that calculates an average feature vector per frame from the feature vectors for the plurality of frames, and calculates a likelihood of speech that is a speech recognition candidate from the average feature vector; The speech for determining one speech based on the likelihood of each speech to be a speech recognition candidate calculated in the first analysis unit and the likelihood of each speech to be a speech recognition candidate calculated in the second analysis unit And a determination unit.

この音声認識装置において、前記第１解析部及び前記第２解析部の少なくともいずれか一方は、ビタビアルゴリズム（Viterbi algorithm）、あるいは、ニューラルネットワークにより、音声認識候補となる音声の尤度を算出することが好ましい。 In this speech recognition apparatus, at least one of the first analysis unit and the second analysis unit calculates the likelihood of speech that is a speech recognition candidate using a Viterbi algorithm or a neural network. Is preferred.

また、前記音声決定部は、前記第１解析部において算出した音声認識候補となる各音声の尤度、及び、前記第２解析部において算出した音声認識候補となる各音声の尤度を音声毎に加算し、当該音声毎の加算値の最大値に対応する音声を認識結果とすることが好ましい。 In addition, the speech determination unit calculates the likelihood of each speech that is a speech recognition candidate calculated in the first analysis unit and the likelihood of each speech that is a speech recognition candidate calculated in the second analysis unit for each speech. It is preferable that the voice corresponding to the maximum value of the added value for each voice is used as the recognition result.

また、本発明の前記目的は、上記音声認識装置と、前記音声決定部が認識結果とした音声に対応する合成音声を生成する音声生成装置とを備える音声変換装置により達成される。 In addition, the object of the present invention is achieved by a speech conversion device that includes the speech recognition device and a speech generation device that generates a synthesized speech corresponding to the speech that is determined as a recognition result by the speech determination unit.

本発明によれば、高速で音声認識を行うことができる音声認識装置及び音声変換装置を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the speech recognition apparatus and speech conversion apparatus which can perform speech recognition at high speed can be provided.

本発明の一実施形態に係る音声変換装置を示すブロック図である。It is a block diagram which shows the audio | voice conversion apparatus which concerns on one Embodiment of this invention. 「あさひ」という音声を発した場合に取得されるデジタル音声波形信号の波形の一例を示すグラフである。It is a graph which shows an example of the waveform of the digital audio | voice waveform signal acquired when the sound of "Asahi" is uttered. １フレーム毎の認識結果を模式的に表した図である。It is the figure which represented typically the recognition result for every frame. 従来のＨＭＭを用いた音声認識装置を示すブロック図である。It is a block diagram which shows the speech recognition apparatus using the conventional HMM.

以下、本発明の実態形態について添付図面を参照して説明する。図１は、本発明の一実施形態に係る音声変換装置１を示すブロック図である。本実施形態に係る音声変換装置１は、図１に示すように、音声認識装置２および音声生成装置３を備えている。音声生成装置３は、音声認識装置２が認識とした音声に対応する合成音声を生成する装置である。 Hereinafter, actual forms of the present invention will be described with reference to the accompanying drawings. FIG. 1 is a block diagram showing an audio conversion device 1 according to an embodiment of the present invention. As shown in FIG. 1, the speech conversion device 1 according to the present embodiment includes a speech recognition device 2 and a speech generation device 3. The voice generation device 3 is a device that generates synthesized voice corresponding to the voice recognized by the voice recognition device 2.

音声認識装置２は、音声入力手段２１、変換手段２２及び解析手段２３を備えている。音声入力手段２１は、発話者が発声した音声を収音する入力装置であり、例えば、骨伝導音をはじめとする体内伝導音などの固体伝搬信号を抽出する加速度ピックアップやマイクロフォンなどを挙げることができる。 The voice recognition device 2 includes a voice input unit 21, a conversion unit 22, and an analysis unit 23. The voice input means 21 is an input device that picks up the voice uttered by the speaker, and examples thereof include an acceleration pickup and a microphone that extract a solid propagation signal such as a body conduction sound including a bone conduction sound. it can.

変換手段２２は、音声入力手段２１に入力された音声信号をデジタル音声波形信号に変換する装置である。具体的には、例えば、音声信号をＡＤ変換してＰＣＭ（パルス符号変調）形式の波形信号を得る装置である。 The conversion unit 22 is a device that converts the audio signal input to the audio input unit 21 into a digital audio waveform signal. Specifically, for example, it is a device that obtains a waveform signal in PCM (pulse code modulation) format by AD-converting an audio signal.

解析手段２３は、音声入力手段２１に入力された音声を解析するものであり、特徴ベクトル抽出部２３１と、特徴ベクトル記憶部２３２と、認識候補音声記憶部２３３と、第１解析部２３４と、第２解析部２３５と、音声決定部２３６とを備えている。 The analysis unit 23 analyzes the voice input to the voice input unit 21, and includes a feature vector extraction unit 231, a feature vector storage unit 232, a recognition candidate speech storage unit 233, a first analysis unit 234, A second analysis unit 235 and a voice determination unit 236 are provided.

特徴ベクトル抽出部２３１は、デジタル音声波形信号をフレーム単位で解析して音声の特徴量を表す特徴ベクトルを抽出するものである。音声認識で用いられる特徴ベクトルとして、ケプストラム領域の特徴量（MFCC：Mel Frequency Cepstrum Coefficient）およびパワーが挙げられる。MFCCとは、フレーム毎に音声データのFFT分析で得られるパワースペクトルに対してメルスケールのフィルタバンクを施し、周波数軸変換されたパワースペクトルに対して離散コサイン変換(DCT)を実行することにより抽出される、スペクトル包絡を表すパラメータであり、その詳細は「音声認識システム」（野清宏、伊藤克亘、河原達也、武田一哉、山本幹雄編著，オーム社出版局; ISBN4-274-13228-5）などで説明されている。 The feature vector extraction unit 231 extracts a feature vector representing a feature amount of speech by analyzing the digital speech waveform signal in units of frames. The feature vector used in speech recognition includes a feature amount (MFCC: Mel Frequency Cepstrum Coefficient) and power of a cepstrum region. MFCC is extracted by applying a Mel-scale filter bank to the power spectrum obtained by FFT analysis of audio data for each frame, and performing discrete cosine transform (DCT) on the frequency spectrum converted power spectrum This is a parameter that represents the spectral envelope, and the details are “speech recognition system” (edited by Kiyoshi Hiroshi, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, Mikio Yamamoto, Ohm Publishing Co., Ltd .; ISBN4-274-13228-5) Explained.

音声認識では、入力音声のスペクトル特徴量を離散コサイン変換し、ケプストラム領域において３つの処理（直流成分の除去，リフタリング処理およびケプストラム平均除去) を実行することで得られる１２次元のMFCC(MFCC1，MFCC2，…MFCC12)およびその１次時間微分(ΔMFCC1，ΔMFCC2，…ΔMFCC12)、ならびにパワーＰＯＷの１次時間微分(ΔＰＯＷ)およびその２次時間微分（ΔΔＰＯＷ）を併せた２６次元の特徴ベクトルが利用される。 In speech recognition, 12-dimensional MFCC (MFCC1, MFCC2) obtained by performing discrete cosine transform on spectral features of input speech and executing three processes in the cepstrum domain (DC component removal, liftering processing, and cepstrum average removal) ,... MFCC12) and its first-order time derivative (ΔMFCC1, ΔMFCC2,. The

特徴ベクトル記憶部２３２は、フレーム単位で抽出した特徴ベクトルを時系列的に複数フレーム分記憶するものである。本実施形態においては、３フレーム分の特徴ベクトルを時系列的に順次記憶できるように構成されている。 The feature vector storage unit 232 stores feature vectors extracted in units of frames for a plurality of frames in time series. In the present embodiment, it is configured so that feature vectors for three frames can be sequentially stored in time series.

認識候補音声記憶部２３３は、音声認識候補となる複数の音声を予め記憶する記憶部であり、各音声、例えば、「ａ」、「ｉ」、「ｕ」、「ｅ」、「ｏ」といった音声と、それらの各音声に対応する特徴ベクトルの情報が格納されている。 The recognition candidate voice storage unit 233 is a storage unit that stores a plurality of voices that are voice recognition candidates in advance. For example, “a”, “i”, “u”, “e”, “o”, and the like. Information on voices and feature vectors corresponding to the respective voices is stored.

第１解析部２３４は、特徴ベクトル記憶部２３２に記憶された複数フレーム分における特徴ベクトルに基づいて、音声認識候補となる各音声の尤度を算出する解析部である。具体的に説明すると、特徴ベクトル記憶部２３２に記憶された３フレーム分の特徴ベクトル（２６次元×３フレーム＝７８次元）と、認識候補音声記憶部２３３に記憶された各音声に対応する特徴ベクトル情報とに基づいて、入力された音声がどの音声に相当するのかを計算する尤度計算を行う。尤度計算手法としては、例えば、ビタビアルゴリズム（Viterbi algorithm）、あるいは、ニューラルネットワーク等を用いた最尤パス探索を例示することができる。 The first analysis unit 234 is an analysis unit that calculates the likelihood of each speech that is a speech recognition candidate based on the feature vectors for a plurality of frames stored in the feature vector storage unit 232. More specifically, feature vectors for three frames (26 dimensions × 3 frames = 78 dimensions) stored in the feature vector storage unit 232 and feature vectors corresponding to each speech stored in the recognition candidate speech storage unit 233 Based on the information, a likelihood calculation for calculating which voice the input voice corresponds to is performed. As the likelihood calculation method, for example, a maximum likelihood path search using a Viterbi algorithm or a neural network can be exemplified.

第２解析部２３５は、複数フレーム分における特徴ベクトルからフレーム単位あたりの平均特徴ベクトルを算出し、当該平均特徴ベクトルから、音声認識候補となる音声の尤度を算出する解析部である。平均特徴ベクトルは、以下に示す数式１〜数式４に基づいて算出される。なお、数式１は平均ケプストラムを、数式２は平均Δケプストラムを算出する式であり、数式３は平均ΔＰＯＷを、数式４は平均ΔΔＰＯＷを算出する式である。ここで、数式１〜４におけるＭは、フレーム数を表す。また、Ｗ_MFCC（ｊ）、ＷΔ_MFCC（ｊ）、ＷΔ_POW（ｊ）及びＷΔΔ_POW（ｊ）は、フレーム毎に付される重み係数であり、例えば、０．０〜１．０までの任意の定数を設定することができる。 The second analysis unit 235 is an analysis unit that calculates an average feature vector per frame unit from feature vectors for a plurality of frames, and calculates the likelihood of speech that is a speech recognition candidate from the average feature vector. The average feature vector is calculated based on Equations 1 to 4 shown below. Equation 1 is an equation for calculating an average cepstrum, Equation 2 is an equation for calculating an average Δ cepstrum, Equation 3 is an equation for calculating an average ΔPOW, and Equation 4 is an equation for calculating an average ΔΔPOW. Here, M in Equations 1 to 4 represents the number of frames. W _MFCC (j), WΔ _MFCC (j), WΔ _POW (j), and WΔΔ _POW (j) are weighting coefficients attached to each frame, for example, arbitrary values from 0.0 to 1.0 Constants can be set.

上記数式から算出された平均特徴ベクトルから、音声認識候補となる音声の尤度を算出するには、第１解析部２３４と同様に、平均特徴ベクトル（２６次元）と、認識候補音声記憶部２３３に記憶された各音声に対応する特徴ベクトル情報とに基づいて、入力された音声がどの音声に相当するのかを計算する尤度計算を行う。尤度計算手法としては、例えば、ビタビアルゴリズム（Viterbi algorithm）、あるいは、ニューラルネットワーク等を用いた最尤パス探索を例示することができる。この第２解析部２３５は、第１解析部２３４において認識した認識結果の上位Ｎ個に対して尤度計算を行うように構成することが好ましい。例えば、Ｎ個の数を「２」とし、第１解析部２３４が認識した音声の中で上位２個の音声に基づいて尤度計算を行う。 In order to calculate the likelihood of speech as a speech recognition candidate from the average feature vector calculated from the above mathematical formula, the average feature vector (26 dimensions) and the recognition candidate speech storage unit 233 are calculated in the same manner as the first analysis unit 234. Based on the feature vector information corresponding to each voice stored in, a likelihood calculation for calculating which voice the input voice corresponds to is performed. As the likelihood calculation method, for example, a maximum likelihood path search using a Viterbi algorithm or a neural network can be exemplified. The second analysis unit 235 is preferably configured to perform likelihood calculation on the top N recognition results recognized by the first analysis unit 234. For example, the number of N is “2”, and likelihood calculation is performed based on the top two voices among the voices recognized by the first analysis unit 234.

音声決定部２３６は、第１解析部２３４において算出した音声認識候補となる各音声の尤度、及び、第２解析部２３５において算出した音声認識候補となる各音声の尤度に基づいて一つの音声を決定する。具体的には、以下に示す数式５により音声を決定するように構成されており、第１解析部２３４において算出した音声認識候補となる各音声の尤度、及び、第２解析部２３５において算出した音声認識候補となる各音声の尤度を音声毎に加算し、当該音声毎の加算値の最大値に対応する音声を認識結果として決定する。 The voice determination unit 236 uses one likelihood based on the likelihood of each voice to be a voice recognition candidate calculated in the first analysis unit 234 and the likelihood of each voice to be a voice recognition candidate calculated in the second analysis unit 235. Determine the voice. Specifically, the speech is determined according to the following Formula 5, and the likelihood of each speech that is a speech recognition candidate calculated by the first analysis unit 234 and the second analysis unit 235 calculate the speech. The likelihood of each voice that is a voice recognition candidate is added for each voice, and a voice corresponding to the maximum value of the added value for each voice is determined as a recognition result.

この数式５は、｛Ｌ１_s・λ＋Ｌ２_s｝で算出される値が最も大きい音声を解として算出する数式であり、Ｌ１は、第１解析部２３４で認識した認識結果を表す音声の尤度であり、Ｌ２は、第１解析部２３４で認識した音声と同一の音声に対して第２解析部２３５で得られた尤度である。また、λは、重み係数である。 Formula 5 is a formula for calculating a speech having the largest value calculated by {L1 _s · λ + L2 _s } as a solution, and L1 is a speech likelihood representing a recognition result recognized by the first analysis unit 234. Yes, L2 is the likelihood obtained by the second analysis unit 235 for the same speech as the speech recognized by the first analysis unit 234. Λ is a weighting factor.

このように構成された音声変換装置１の作動について以下に説明する。まず、発話者が発生した音声を音声入力手段２１により収音した後、変換手段２２が、収音された音声信号をＡＤ変換してＰＣＭ（パルス符号変調）形式のデジタル音声波形信号に変換する。例えば、発話者が、「あさひ」という音声を発した場合に取得されるデジタル音声波形信号の波形を図２に示す。 The operation of the speech conversion apparatus 1 configured as described above will be described below. First, after the voice generated by the speaker is picked up by the voice input means 21, the conversion means 22 performs AD conversion on the picked-up voice signal and converts it into a PCM (pulse code modulation) format digital voice waveform signal. . For example, FIG. 2 shows a waveform of a digital speech waveform signal acquired when a speaker utters a voice “Asahi”.

次いで、解析手段２３の特徴ベクトル抽出部２３１が、デジタル音声波形信号をフレーム単位で解析して音声の特徴量を表す合計２６次元の特徴ベクトルを抽出する。抽出された特徴ベクトルは、フレーム毎に時系列的に複数分、例えば３フレーム分の特徴ベクトルが特徴ベクトル記憶部２３２に記憶される。 Next, the feature vector extraction unit 231 of the analysis unit 23 analyzes the digital speech waveform signal in units of frames and extracts a total 26-dimensional feature vector representing the feature amount of speech. The extracted feature vectors are stored in the feature vector storage unit 232 for a plurality of feature vectors, for example, three frames, for each frame.

そして、第１解析部２３４が、特徴ベクトル記憶部２３２に記憶された３フレーム分の特徴ベクトル（２６次元×３フレーム＝７８次元）と、認識候補音声記憶部２３３に記憶された各音声に対応する特徴ベクトル情報とに基づいて、入力された音声がどの音声に相当するのかを、例えばビタビアルゴリズムにより尤度計算する。尤度計算の結果、「ａ」という音声の尤度が０．８０、「ｉ」という音声の尤度が０．２０、「ｕ」という音声の尤度が０．３０、「ｅ」という音声の尤度が０．６５、「ｏ」という音声の尤度が０．４０、無音の尤度が０．１０であったとする。 Then, the first analysis unit 234 corresponds to the feature vectors for three frames (26 dimensions × 3 frames = 78 dimensions) stored in the feature vector storage unit 232 and each voice stored in the recognition candidate speech storage unit 233. Based on the feature vector information, the likelihood of which input speech corresponds is calculated by, for example, the Viterbi algorithm. As a result of the likelihood calculation, the likelihood of the voice “a” is 0.80, the likelihood of the voice “i” is 0.20, the likelihood of the voice “u” is 0.30, and the voice “e”. , The likelihood of the speech “o” is 0.40, and the likelihood of silence is 0.10.

次に、第２解析部２３５が、上記数式１〜数式４に基づいて３フレーム分における特徴ベクトルから平均特徴ベクトルを算出する。その後、平均特徴ベクトル（２６次元）と、認識候補音声記憶部２３３に記憶された各音声に対応する特徴ベクトル情報とに基づいて、入力された音声がどの音声に相当するのかを、例えばビタビアルゴリズムにより尤度計算する。尤度計算に際して、第２解析部２３５は、例えば、第１解析部２３４において認識した認識結果の上位２個に対して尤度計算を行う。第１解析部２３４において認識された結果の上位２個は、「ａ」及び「ｅ」になるので、これらについての尤度計算を行う。尤度計算の結果、「ａ」という音声の尤度が０．９０、「ｅ」という音声の尤度が０．６０であったとする。 Next, the second analysis unit 235 calculates an average feature vector from the feature vectors for three frames based on the above formulas 1 to 4. Then, based on the average feature vector (26 dimensions) and the feature vector information corresponding to each speech stored in the recognition candidate speech storage unit 233, which speech the input speech corresponds to, for example, a Viterbi algorithm The likelihood is calculated by In the likelihood calculation, for example, the second analysis unit 235 performs likelihood calculation for the top two recognition results recognized by the first analysis unit 234. Since the top two results recognized by the first analysis unit 234 are “a” and “e”, likelihood calculation is performed on these. As a result of the likelihood calculation, it is assumed that the likelihood of the speech “a” is 0.90 and the likelihood of the speech “e” is 0.60.

その後、音声決定部２３６が、第１解析部２３４において算出した音声認識候補となる各音声の尤度、及び、第２解析部２３５において算出した音声認識候補となる各音声の尤度に基づいて、上記数式５から一つの音声を決定する。数式５における重み係数λを「１」とした場合、数式５における｛Ｌ１_s・λ＋Ｌ２_s｝は、「ａ」が、｛０．８０×１＋０．９０｝＝１．７０となる。一方、「ｅ」は、｛０．６５×１＋０．６｝＝１．２５となる。数式５は、１．７０と１．２５の内、数値の大きい音声を算出するので、「ａ」を最終的な認識結果であるとして算出する。 After that, the speech determination unit 236 is based on the likelihood of each speech that is a speech recognition candidate calculated by the first analysis unit 234 and the likelihood of each speech that is a speech recognition candidate calculated by the second analysis unit 235. Then, one voice is determined from Equation 5 above. When the weighting factor λ in Equation 5 is “1”, {L1 _s · λ + L2 _s } in Equation 5 is such that “a” is {0.80 × 1 + 0.90} = 1.70. On the other hand, “e” is {0.65 × 1 + 0.6} = 1.25. Formula 5 calculates a voice having a large numerical value among 1.70 and 1.25, and therefore calculates “a” as a final recognition result.

このような音声認識を１フレームずつシフトしながら計算を繰り返し、デジタル音声波形信号の全てのフレームについての音声認識を行う。１フレーム毎の認識結果を模式的に表した図を図３に示す。なお、図３における□は、１つのフレームを表しており、「無」という表記は無音を示している。 The calculation is repeated while shifting such speech recognition frame by frame, and speech recognition is performed for all the frames of the digital speech waveform signal. FIG. 3 schematically shows the recognition result for each frame. In FIG. 3, □ represents one frame, and the notation “none” represents silence.

その後、音声生成装置３は、音声決定部２３６が認識結果として算出した「ａ」に相当する合成音声を生成し、スピーカから放音する。 Thereafter, the voice generation device 3 generates a synthesized voice corresponding to “a” calculated by the voice determination unit 236 as a recognition result, and emits the sound from the speaker.

本実施形態に係る音声認識装置２は、上述のようにフレーム単位で音声認識を行うことができるので、「あさひ」という言葉の「あ」という音が発せられている時間の初期の段階で、発話された音声が「あ」であると認識できるので、発話タイミングと音声認識タイミングとの間のタイムラグを短縮して高速で恩背の認識を行うことが可能になる。 Since the speech recognition apparatus 2 according to the present embodiment can perform speech recognition on a frame basis as described above, at the initial stage of the time when the sound “a” of the word “Asahi” is being emitted, Since the uttered voice can be recognized as “A”, the time lag between the utterance timing and the voice recognition timing can be shortened, and the grace can be recognized at high speed.

また、本実施形態に係る音声認識装置２により得た認識結果に対応した音声合成信号をスピーカから放音する場合、発話タイミングと音声認識タイミングとの間にタイムラグを短縮できる結果、従来のように、発話者が発話したタイミングよりもかなり遅いタイミングで放音されることがなくなるため、発話者が、遅れて生成される合成音声の影響を受けて、うまく会話を進めることが困難になることを効果的に抑制することができる。 Further, when a speech synthesis signal corresponding to the recognition result obtained by the speech recognition apparatus 2 according to the present embodiment is emitted from the speaker, the time lag can be shortened between the speech timing and the speech recognition timing. , It will not be emitted at a timing much later than the time when the speaker uttered, so that it will be difficult for the speaker to be able to advance the conversation well under the influence of synthesized speech generated late It can be effectively suppressed.

１音声変換装置
２音声認識装置
２１音声入力手段
２２変換手段
２３解析手段
２３１特徴ベクトル抽出部
２３２特徴ベクトル記憶部
２３３認識候補音声記憶部
２３４第１解析部
２３５第２解析部
２３６音声決定部
３音声生成装置 DESCRIPTION OF SYMBOLS 1 Voice conversion apparatus 2 Voice recognition apparatus 21 Voice input means 22 Conversion means 23 Analysis means 231 Feature vector extraction part 232 Feature vector storage part 233 Recognition candidate voice storage part 234 First analysis part 235 Second analysis part 236 Voice determination part 3 Voice Generator

Claims

Voice input means for inputting voice, conversion means for converting a voice signal input to the voice input means into a digital voice waveform signal, and input to the voice input means from the digital voice waveform signal converted by the conversion means A speech recognition device comprising an analysis means for analyzing the generated speech,
The analysis means includes
A feature vector extraction unit that analyzes the digital speech waveform signal in units of frames and extracts a feature vector representing a feature amount of speech;
A feature vector storage unit for storing the feature vectors extracted in units of frames for a plurality of frames in a time series;
A recognition candidate voice storage unit that stores a plurality of voices as voice recognition candidates;
A first analysis unit that calculates the likelihood of each speech that is a speech recognition candidate based on the feature vectors for a plurality of frames stored in the feature vector storage unit;
A second analysis unit that calculates an average feature vector per frame from the feature vectors for the plurality of frames, and calculates a likelihood of speech that is a speech recognition candidate from the average feature vector;
Speech determination for determining one speech based on the likelihood of each speech to be a speech recognition candidate calculated in the first analysis unit and the likelihood of each speech to be a speech recognition candidate calculated in the second analysis unit A voice recognition device.

2. The speech according to claim 1, wherein at least one of the first analysis unit and the second analysis unit calculates the likelihood of speech that is a speech recognition candidate using a Viterbi algorithm or a neural network. Recognition device.

The speech determination unit adds, for each speech, the likelihood of each speech that is a speech recognition candidate calculated by the first analysis unit and the likelihood of each speech that is a speech recognition candidate calculated by the second analysis unit. The speech recognition apparatus according to claim 1, wherein the speech corresponding to the maximum value of the addition value for each speech is a recognition result.

The speech recognition device according to any one of claims 1 to 3,
A voice conversion device comprising: a voice generation device that generates a synthesized voice corresponding to the voice determined by the voice determination unit as a recognition result.