JP3973492B2

JP3973492B2 - Speech synthesis method and apparatus thereof, program, and recording medium recording the program

Info

Publication number: JP3973492B2
Application number: JP2002162815A
Authority: JP
Inventors: 泰浩南; 智広中谷; 俊夫入野
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2002-06-04
Filing date: 2002-06-04
Publication date: 2007-09-12
Anticipated expiration: 2022-06-04
Also published as: JP2004012584A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声認識用情報作成方法、音響モデル作成方法、音声認識方法、音声合成用情報作成方法、音声合成方法、及びそれらの装置、並びにプログラム、そのプログラムを記録した記録媒体に関する。
【０００２】
【従来の技術】
従来のヒドンマルコフモデル（ＨＭＭ）を用いる音声合成手法として、（文献[1] K.Tokuda, T.Kobayashi and S.Imai, "Speech parameter generation from HMM using dynamic features" Proc. ICASSP, pp.660-663, 1995）,（文献[2] K.Tokuda, T.Masuko and T.Yamada, T.Kobayashi and S. Imai, "An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features" Proc. Eurospeech, pp.757-760, 1995）,（文献[3] T.Masuko, K.Tokuda, T.Kobayashi and S.Imai, "Speech synthesis from HMMs using dynamic features" Proc. ICASSP, pp.389-392, 1996)があげられる。
この手法は、音声認識におけるパラメータと音声合成におけるパラメータに同一のものを用いることで、音声認識で用いられる手法を音声合成に用いて音声合成手法を高機能化したり、音声合成で用いられる手法を音声認識に用いて認識手法を高精度化したりできるという利点を持っている。
【０００３】
図１、図２に従来のヒドンマルコフモデル（ＨＭＭ）による音響モデル学習・音声認識装置、及び音声合成装置の構成を示す。
まず、図１を用いて音声認識モデル学習法と音声認識手法について説明する。
メルケプストラム分析部101では入力された音声をメルケプストラムに変換する。ヒドンマルコフモデル（ＨＭＭ）の学習時には、このメルケプストラムを音響モデル学習部102に送る。メルケプストラムパラメータと入力音声の学習テキスト（例えば、日本語文、言語的情報の付与された音素系列、または単語系列）からヒドンマルコフモデル（例えば、音素、または単語ヒドンマルコフモデル）を学習する。（すなわち、最大確率（最大尤度）を与えるモデルを選ぶ。）次に学習したモデルを記憶部103に記憶する。認識時には、入力音声のメルケプストラムパラメータを音声認識部104に送り、音響モデル学習部で学習されたヒドンマルコフモデル（ＨＭＭ）をＨＭＭ記憶部103から読み出して比較し、尤度が最大のものを出力することで、テキストに変換する。
【０００４】
次に図２を用いて、ヒドンマルコフモデル（ＨＭＭ）を用いてテキストから音声を作り出す音声合成方式について説明する。
ＨＭＭ記憶部105に記憶しているヒドンマルコフモデル（ＨＭＭ）は予め大量のデータより、上述の学習手法を学習しているものとする。まず、構文解析部110では、入力されたテキストを、言語的情報の付与された音素（または単語）系列に変換する。この音素（または単語）情報により音素（または単語）ヒドンマルコフモデル（ＨＭＭ）が接続され、入力のテキストに対するヒドンマルコフモデル（ＨＭＭ）の系列を生成する。平滑化パラメータ生成部109で、ヒドンマルコフモデル（ＨＭＭ）状態系列から自然で滑らかなメルケプストラム系列を出力する。この滑らかなメルケプストラム系列を、音声合成部106では各時刻でＭＬＳＡフィルタ108に変換し、音源情報を元にパルス／ノイズ系列生成部107によって生成される信号をこのフィルタを通すことで音声を合成する。なお、音源情報は処理対象となるテキスト列から得られ、ピッチ情報、パワー情報などを含む。
【０００５】
【発明が解決しようとする課題】
従来の手法では、メルケプストラムが、音声の包絡スペクトルの微細な構造を表現できず、合成音声の品質が十分でないという問題点があった。また、メルケプストラムを合成する際に用いられる逆メル変換と指数変換を近似するＭＬＳＡ(Mel Log Spectral Approximation)フィルタは複雑で、例えばサンプリング周波数を変更するということに問題点があった。
音声合成で利用されるSTRAIGHTスペクトルを用いるSTRAIGHT合成系は、高品質に音声を合成できることが知られている。そこで、このパラメータをヒドンマルコフモデル（ＨＭＭ）を用いた音声合成手法に導入すれば音声合成の品質が向上することが期待できる。しかし、このパラメータは、音声認識には直接利用できないため、ヒドンマルコフモデル（ＨＭＭ）の学習ができない。たとえ、音声認識に利用できるメルケプストラムなどのパラメータに変換できて、ＨＭＭが学習できたとしても、合成音を作成する際に、ＨＭＭから生成されるメルケプストラムを使って音声を合成する手段がなかった。このため、ＨＭＭを用いた音声合成にはSTRAIGHT合成系を導入するということが実現できなかった。
【０００６】
本発明はかかる事情に鑑みて、ヒドンマルコフモデル（ＨＭＭ）を使った音声合成系に、STRAIGHT合成系を導入し、STRAIGHTメルケプストラムというパラメータに変換する手段と、合成時に作成されるSTRAIGHTメルケプストラムをSTRAIGHTスペクトルに変換する手段を用いることで、認識と合成のパラメータを同一にし、従来のヒドンマルコフモデル（ＨＭＭ）を用いた合成法が持っていた利点をそのまま残しながら、高品質な合成音声システムを簡単な構成で実現できる新たな技術の提供を目的とする。
【０００７】
また、本発明は、ヒドンマルコフモデル（ＨＭＭ）を用いた音声合成系のパラメータとして、短時間フーリエ変換したスペクトルを対数変換し、離散コサイン変換して求めることができるメルケプストラムを使い、音声認識におけるパラメータとテキストからの音声合成におけるパラメータとで同一のものを用いることで、従来のヒドンマルコフモデル（ＨＭＭ）を用いた合成法が持っていた利点をそのまま残しながら、高品質な合成音声システムを簡単な構成で実現できる新たな技術の提供を目的とする。
【０００８】
【課題を解決するための手段】
本発明では、ヒドンマルコフモデル（ＨＭＭ）を用いた音声合成に、基本周波数の影響を除去してスペクトラム形状を求める高品質なSTRAIGHT音声合成系を導入することで音声合成の品質向上を実現する。これを実現するために、まず、STRAIGHTスペクトルを音声認識で高性能を実現するSTRAIGHTメルケプストラムに変換する。このメルケプストラム変換は、STRAIGHTスペクトルの対数変換をとり、その結果に対して周波数伸縮離散コサイン変換を行う手段を用いて実行する。これらの手段によりSTRAIGHTスペクトルを音声認識で利用できる形に変形できる。このため、音声認識のパラメータと音声合成のパラメータの統一を実現でき、ＨＭＭによる音声合成が持つ利点を保つことができる。このパラメータを用いて音素（または単語）ヒドンマルコフモデル（ＨＭＭ）を学習する手段と、学習された音素（または単語）ヒドンマルコフモデル（ＨＭＭ）を用いて音声を認識する手段により、音響モデルの学習と音声認識を実現する。音声合成時には、学習されたヒドンマルコフモデル（ＨＭＭ）の音素（または単語）系列から自然で滑らかなSTRAIGHTメルケプストラム系列を出力する手段を用いる。さらに、このSTRAIGHTメルケプストラム系列を、逆周波数伸縮離散コサイン変換し、さらに指数変換する手段を用い、STRAIGHTスペクトル系列に変換する。これらの手段によりSTRAIGHTメルケプストラムを音声合成で利用できる形に変形することができる。このため、音声認識のパラメータと音声合成のパラメータの統一を実現でき、ＨＭＭによる音声合成が持つ利点を保つことができる。これらの手段によって生成されたSTRAIGHTスペクトル系列とパルス／ノイズ系列生成手段により合成されたパルス／ノイズ系列から、逆ＦＦＴ重ね合わせ合成する手段により音声を合成する。
【０００９】
上記のSTRAIGHTスペクトル生成手段の代わりに短時間フーリエ変換によるスペクトル生成手段を用いてもよい。この場合は、メルケプストラム系列は、短時間フーリエ変換から求められるスペクトラムの対数変換を、周波数伸縮離散コサイン変換したものとなる。このパラメータを使っても、上記音響モデル学習手段、音声認識手段により、ヒドンマルコフモデル（ＨＭＭ）の学習とヒドンマルコフモデル（ＨＭＭ）を用いる音声認識も実現できる。合成時は、滑らかなパラメータ生成もできる。この生成された滑らかなパラメータを逆周波数伸縮離散コサイン変換し、指数変換を実行する手段を用いて、スペクトルに変換する。その後、このスペクトル系列とパルス／ノイズ系列生成手段により合成されたパルス／ノイズ系列から、逆ＦＦＴ重ね合わせ合成する手段により音声を合成する。
以上のような手段により、STRAIGHTパラメータの持つ高品質な合成品質を、ヒドンマルコフモデル（ＨＭＭ）を用いる音声合成に導入でき、音声合成の品質向上が実現する。
【００１０】
【発明の実施の形態】
（音声認識用情報作成・音響モデル作成・音声認識）
図３、図４に音声認識用情報作成・音響モデル作成・音声認識装置の構成を示す。
まず、図３を用いて音声認識用情報作成方法、音響モデル作成方法、音声認識方法、及びそれらの装置について説明する。
音声認識用情報作成部10のSTRAIGHT分析部11において、短時間フーリエ変換部13は、入力された音声を短時間フーリエ変換する。それと同時に、基本周波数推定部12で入力音声の無声／有声の判定を行い、有声の場合は基本周波数を計算（推定）する。その情報を元に、平滑化スペクトラム分析部14は、短時間フーリエ変換された音声から基本周波数の影響を取り除き、平滑化されたスペクトラムに変換する。この変換されたスペクトルをSTRAIGHTスペクトルと呼ぶ。次に、対数変換部15は、STRAIGHTスペクトルの対数変換を行い、周波数伸縮離散コサイン変換部16により、STRAIGHTメルケプストラムに変換する。
【００１１】
音響モデルの学習時には、このSTRAIGHTメルケプストラムを音響モデル学習部20に送る。音響モデル学習部20では、STRAIGHTメルケプストラムとそれに対応するテキスト（学習テキスト）から音素（または単語）ヒドンマルコフモデル（ＨＭＭ）を学習する。次に学習した音素（または単語）ヒドンマルコフモデル（ＨＭＭ）を記憶部30に記憶する。
認識時には、STRAIGHTメルケプストラムパラメータを、音声認識部40に送り、音響モデル学習部で学習され、記憶部30に保持されたヒドンマルコフモデル（ＨＭＭ）と比較し、尤度の最も高い値を示すテキストを出力する。これにより音声認識を実現する。
上記の手法で、STRAIGHTスペクトル生成時に、基本周波数推定を行わなくてもよい。このときの音声認識モデル学習法と音声認識手法を図４を用いて説明する。この場合に生成されるメルケプストラム系列は、短時間フーリエ変換から求められるスペクトラムを直接対数変換して周波数伸縮離散コサイン変換したものである。このメルケプストラム系列を使って、上記のようにヒドンマルコフモデル（ＨＭＭ）の学習を実現することもできる。さらに、ヒドンマルコフモデル（ＨＭＭ）から滑らかなパラメータを生成することもできる。
【００１２】
（音声合成）
図５に音声合成装置の構成を示す。
図５を用いて、テキストから音声を作り出す音声合成方法、及び装置について説明する。
ＨＭＭ記憶部63に記憶しているヒドンマルコフモデル（ＨＭＭ）は予め大量のデータより、上記の学習手法により作成する。音声合成用情報生成部60では、まず、入力されたテキストを、構文解析部62により、言語的情報の付与された音素（または単語）系列に変換する。平滑化パラメータ生成部61では、この音素（または単語）情報により音素（または単語）ヒドンマルコフモデル（ＨＭＭ）が接続され、入力のテキストに対する音素（または単語）ヒドンマルコフモデル（ＨＭＭ）の系列が生成される。入力が音素系列である場合は、構文解析部では構文解析を行わず、その音素情報からヒドンマルコフモデル（ＨＭＭ）をつなぎ合わせて、入力に対する音素（または単語）ヒドンマルコフモデル（ＨＭＭ）を作成する。また、入力がヒドンマルコフモデル（ＨＭＭ）の状態系列である場合は、音素（または単語）ヒドンマルコフモデル（ＨＭＭ）系列の代わりに、ヒドンマルコフモデル（ＨＭＭ）の状態系列を作成する。平滑化パラメータ生成部では、さらに、音素（または単語）ヒドンマルコフモデル（ＨＭＭ）系列から自然で滑らかなSTRAIGHTメルケプストラム系列を出力する。このSTRAIGHTメルケプストラム系列が音声合成部50に入力される。音声合成部50の逆周波数伸縮離散コサイン変換部55では、このSTRAIGHTメルケプストラム系列を逆周波数伸縮離散コサイン変換し、さらに、指数変換部54において指数変換することで、STRAIGHTスペクトル系列に変換し、STRAIGHTスペクトル系列とパルス／ノイズ生成部52によって生成された信号から逆ＦＦＴ重ね合わせ合成部53により音声を合成する。
上記の手法で、音声認識用情報作成部で、基本周波数推定部を使わない場合には、このヒドンマルコフモデル（ＨＭＭ）から生成された滑らかなパラメータを逆周波数伸縮離散コサイン変換し、さらに指数変換することで、スペクトルに変換する。その後、このスペクトル系列とパルス／ノイズ生成部によって生成されたパルス／ノイズ列から逆ＦＦＴ重ね合わせ合成により音声を合成する。
各構成部について詳細に説明する。
【００１３】
（STRAIGHT分析部）
STRAIGHT分析部11では、短時間フーリエ変換部13において、入力された音声を短時間フーリエ変換する。それと同時に、基本周波数推定部12は、入力音声の無声／有声の判定を行い、有声の場合は、基本周波数を計算する。その情報を元に、平滑化スペクトラム分析部14において、短時間フーリエ変換された音声から基本周波数の影響を取り除き、平滑化したスペクトラムに変換する。この変換されたスペクトルをSTRAIGHTスペクトルと呼ぶ。このSTRAIGHTスペクトルは次に周波数伸縮離散コサイン変換部16へ送られる。
【００１４】
（周波数伸縮離散コサイン変換部）
周波数伸縮離散コサイン変換部１６では入力されたSTRAIGHTスペクトルの対数変換を行い、その結果に対して周波数伸縮離散コサイン変換を行う。この変換の核の関数は、
【数１】

と定義されるフィルタの周波数応答の実数部である。Ｒｅ［Ψ_ｍ（ω）］の実部は｛ω｜０≦ω≦π｝の時、正規化直交変換になる。αは、周波数伸縮の度合いを決定する係数である。その伸縮の度合いは、
【数２】

という式によって求めることができる。αが０のときには、Ｒｅ［Ψ_ｍ（ω）］＝ｃｏｓ（ｍω）となり、離散コサイン変換となる。αが０と１の間では、Ｒｅ［Ψ_ｍ（ω）］は直交性を保存する重みつき関数による
【数３】

以上の変換でSTRAIGHTスペクトルはSTRAIGHTメルケプストラムに変換される。
【００１５】
（音響モデル学習部）
音響モデル学習時には、このSTRAIGHTメルケプストラムを、音響モデル学習部20へ入力する。音響モデル学習部20では、入力されたSTRAIGHTメルケプストラムを使って、ＥＭ(expectation-maximization)アルゴリズムにより音素（または単語）ヒドンマルコフモデル（ＨＭＭ）の学習を行う。上記の手法では、STRAIGHTスペクトル生成時に基本周波数推定を行わなくてもよい。この場合は、メルスペクトラム系列は、短時間フーリエ変換から求められるスペクトラムの対数変換を、周波数伸縮離散コサイン変換したものである。このメルケプストラム系列を使っても、上記のようにヒドンマルコフモデル（ＨＭＭ）の学習ができる。さらに、ヒドンマルコフモデル（ＨＭＭ）による音声認識も実現できる。
【００１６】
（音声合成）
音響モデル学習部で作成されたヒドンマルコフモデル（ＨＭＭ）を用いた音声合成は以下のように実現する。まず、入力されたテキストは、構文解析部62により、言語的情報の付与された音素（または単語）系列に変換される。この音素（または単語）情報により音素（または単語）ヒドンマルコフモデル（ＨＭＭ）が接続され、入力のテキストに対する音素（または単語）ヒドンマルコフモデル（ＨＭＭ）の系列が生成される。入力が音素系列である場合は、構文解析部では構文解析を行わず、その音素情報からヒドンマルコフモデル（ＨＭＭ）をつなぎ合わせて、入力に対する音素（または単語）ヒドンマルコフモデル（ＨＭＭ）系列を作成する。また、入力がヒドンマルコフモデル（ＨＭＭ）の状態系列である場合には、音素（または単語）ヒドンマルコフモデル（ＨＭＭ）系列の代わりに、ヒドンマルコフモデル（ＨＭＭ）の状態系列を作成する。平滑化パラメータ生成部61は、音素（または単語）ヒドンマルコフモデル（ＨＭＭ）系列から自然で滑らかなメルケプストラムパラメータ系列を出力する。この滑らかにする手法について以下に述べる。
【００１７】
上で述べた学習によって、ヒドンマルコフモデル（ＨＭＭ）が学習されているとする。ここで、S＝[s₁,s₂,・・・,s_T]は、ヒドンマルコフモデル（ＨＭＭ）のガウス分布時系列、M＝[μ₁,μ₂,・・・,μ_T]、ΔM＝[Δμ₁,Δμ₂,・・・,Δμ_T]、Δ²M＝[Δ²μ₁,Δ²μ₂,・・・,Δ²μ_T]は、そのガウス分布時系列でのヒドンマルコフモデル（ＨＭＭ）のSTRAIGHTメルケプストラム、その微分係数であるΔSTRAIGHTメルケプストラム、２次微分係数Δ²STRAIGHTメルケプストラムの平均値のベクトル時系列である。また、Σ＝[Σ₁,Σ₂,・・・,Σ_T]、ΔΣ＝[ΔΣ₁,ΔΣ₂,・・・,ΔΣ_T]、Δ²Σ＝[Δ²Σ₁,Δ²Σ₂,・・・,Δ²Σ_T]は、ヒドンマルコフモデル（ＨＭＭ）のSTRAIGHTメルケプストラム、ΔSTRAIGHTメルケプストラム、Δ² STRAIGHTメルケプストラム共分散行列の時系列である（対角共分散行列を仮定している）。ところで、STRAIGHTメルケプストラムC＝[c₁,c₂,・・・,c_T]、ΔSTRAIGHTメルケプストラムΔC＝[Δc₁,Δc₂,・・・,Δc_T]、Δ²STRAIGHTメルケプストラムΔ²C＝[Δ²c₁,Δ²c₂,・・・,Δ²c_T]の間には（３）,（４）に示すような拘束条件がある（拘束条件にはこの他にも複数考えられるがどれを使っても同様なことが実現できる）。
【数４】

ここで、２L＋１はウィンドウサイズ、b₀,b₁とb₂はウィンドウサイズによって決まる固定値である、このヒドンマルコフモデル（ＨＭＭ）の平均値時系列から文献[１〜３]手法を使って、この平均値の時系列を変形して、滑らかなSTRAIGHTメルケプストラムを生成する。ここでその手法について説明する。いま、ガウス分布時系列が与えられていると仮定する。与えられたガウス分布時系列に対して（３）,（４）の条件の下で（５）を最大化するC,ΔC,Δ²Cを選ぶことによって、STRAIGHTメルケプストラムの時系列を生成する。これは、（５）に（３）,（４）を代入し、ΔC,Δ²Cを消去してCだけで表現し、これをCで偏微分した（６）を満たすCを求めることによって実現できる。これにより、滑らかなSTRAIGHTメルケプストラムの係数が得られる。
【数５】

上記手法では、STRAIGHTメルケプストラムの２次微分までしか用いていないが、３次、４次以降の項を導入することもできる。また、上記手法以外に、フィルタを用いて、ヒドンマルコフモデル（ＨＭＭ）のSTRAIGHTメルケプストラム系列の平均値系列を滑らかにする手法も利用できる。
【００１８】
このようにして作成された滑らかなSTRAIGHTメルケプストラム系列は、逆周波数伸縮離散コサイン変換部55へ入力される。逆周波数伸縮離散コサイン変換部では逆周波数伸縮離散コサインを行う。周波数伸縮離散コサイン変換は、直交変換なので、逆周波数伸縮離散コサイン変換もこの変換から容易に計算できる。この変換を行って、さらに指数変換を行うことによりSTRAIGHTメルケプストラム系列は、STRAIGHTメルケプストラム系列はSTRAIGHTスペクトルの系列に変換される。STRAIGHT音声合成部51では、このSTRAIGHTスペクトル系列とパルス／ノイズ生成部52によって生成されたパルス／ノイズ列から逆ＦＦＴと重ね合わせ合成により音声を合成する。
【００１９】
（パルス／ノイズ系列生成部）
次に、パルス／ノイズ系列生成部32について述べる。まず、基本的なパルス／ノイズ系列生成手法を示す。これはある話者が発声した音声から抽出した基本周波数をそのまま利用する方法である。この手法を図６に示す。この方法では入力の音声から直接、基本周波数を推定し、その基本周波数を合成に利用する。
図７に基本周波数モデル学習部の構成、図８にヒドンマルコフモデル（ＨＭＭ）を利用した場合のパルス／ノイズ系列生成部の構成を示す。
図７と図８を用いて、ヒドンマルコフモデル（ＨＭＭ）を利用した場合のパルス／ノイズ系列生成部32について述べる。
図７では、パルス／ノイズ生成の場合のヒドンマルコフモデル（ＨＭＭ）の学習方法について示す。入力された音声から基本周波数推定部52-3では、無声／有声を判断し、その結果を出力する。また、有声の場合は、その周波数の情報も出力する。この出力された情報は、基本周波数モデル学習部52-2に送られる。これらの基本周波数、その微分係数、２次微分係数および、無声／有声の情報と、それに対応する学習テキストを使って、基本周波数、その微分係数、２次微分係数の平均値と分散および無声／有声の情報を、音素（または単語）基本周波数ヒドンマルコフモデル（ＨＭＭ）の構造上に、ＥＭアルゴリズムを使って学習する。次に、学習した基本周波数ヒドンマルコフモデル（ＨＭＭ）を記憶部52-4に記憶する。
【００２０】
図８にヒドンマルコフモデル（ＨＭＭ）を利用した場合のパルス／ノイズ合成について示す。まず、最初に入力されたテキストは、構文解析部52-8により、言語的情報の付与された音素（または単語）系列に変換される。この音素（または単語）情報により音素（または単語）基本周波数ヒドンマルコフモデル（ＨＭＭ）が接続され、入力のテキストに対する音素（または単語）基本周波数ヒドンマルコフモデル（ＨＭＭ）の系列が生成される。入力が音素系列である場合は、構文解析部では構文解析を行わず、その音素情報から音素（または単語）基本周波数ヒドンマルコフモデル（ＨＭＭ）をつなぎ合わせて、入力に対する音素（または単語）基本周波数ヒドンマルコフモデル（ＨＭＭ）系列を作成する。また、入力がヒドンマルコフモデル（ＨＭＭ）の状態系列である場合は、音素（または単語）基本周波ヒドンマルコフモデル（ＨＭＭ）の代わりに、基本周波数ヒドンマルコフモデル（ＨＭＭ）の状態系列を作成する。平滑化パルス／ノイズ生成部52-7では、入力に対する音素（または単語）基本周波ヒドンマルコフモデル（ＨＭＭ）系列から、滑らかなSTRAIGHTメルケプストラム生成のときの手法と同じ平滑化手法により滑らかな基本周波数系列を出力し、パルス情報に変換する。ただし、ヒドンマルコフモデル（ＨＭＭ）の状態が無声音であれば、平滑化を行わず、ノイズを生成する。
【００２１】
上記の手法で、音声認識用情報作成部で、基本周波数推定部を使わない場合には、このヒドンマルコフモデル（ＨＭＭ）から生成された滑らかなパラメータを逆周波数伸縮離散コサイン変換し、さらに指数変換することで、スペクトルに変換する。その後、このスペクトル系列とパルス／ノイズ生成部によって生成されたパルス／ノイズ列から逆ＦＦＴ重ね合わせ合成により音声を合成する。このパルス／ノイズ列の生成には、上述したパルス／ノイズ列生成手法が利用できる。
【００２２】
なお、上記に記載の音声認識用情報作成・音響モデル作成・音声認識装置及び音声合成装置は、ＣＰＵやメモリ等を有するコンピュータと、アクセス主体となるユーザが利用する端末と、記録媒体から構成することができる。記録媒体は、ＣＤ−ＲＯＭ、磁気ディスク装置、半導体メモリ等機械読み取り可能な記録媒体であり、ここに記録された制御用プログラムは、コンピュータに読み取られ、コンピュータの動作を制御し、コンピュータ上に前述した各構成要素を実現することができる。
【００２３】
【発明の効果】
以上説明したように、本発明によれば、ヒドンマルコフモデル（ＨＭＭ）を用いた音声合成系に、STRAIGHT合成系を導入し、音声認識におけるパラメータとテキストからの音声合成におけるパラメータとで同一のものを用いることで、従来のヒドンマルコフモデル（ＨＭＭ）を用いた合成法が持っていた利点をそのまま残しながら、高品質でかつ簡単なシステム構成で実現できる。
また、本発明によれば、ヒドンマルコフモデル（ＨＭＭ）を用いた音声合成系のパラメータとして、短時間フーリエ変換したスペクトルを対数変換して離散コサイン変換して求めることができるメルケプストラムを使うことで、音声認識におけるパラメータとテキストからの音声合成におけるパラメータとで同一のものを用いることで、従来のヒドンマルコフモデル（ＨＭＭ）を用いた合成法が持っていた利点をそのまま残しながら、高品質でかつ簡単なシステム構成が実現できる。
【図面の簡単な説明】
【図１】従来のヒドンマルコフモデル（ＨＭＭ）による音響モデル学習および音声認識装置の構成図。
【図２】従来のヒドンマルコフモデル（ＨＭＭ）による音声合成装置の構成図。
【図３】本発明の実施例である音声認識用情報作成・音響モデル作成・音声認識装置の構成図。
【図４】本発明の他の実施例である音声認識用情報作成・音響モデル作成・音声認識装置の構成図。
【図５】本発明の実施例であるテキストから音声を作り出す音声合成装置の構成図。
【図６】パルス／ノイズ系列生成部の説明図。
【図７】基本周波数モデル学習部の構成図。
【図８】ヒドンマルコフモデル（ＨＭＭ）を利用した場合のパルス／ノイズ系列生成部の構成図。
【符号の説明】
10・・・音声認識用情報作成部
11・・・STRAIGHT分析部
12・・・基本周波数推定部、13・・・短時間フーリエ変換部、14・・・平滑化スペクトラム分析部
15・・・対数変換部
16・・・周波数伸縮離散コサイン変換部
20・・・音響モデル学習部
30・・・ＨＭＭ（音響モデル）記憶部
40・・・音声認識部
50・・・音声合成部
51・・・STRAIGHT音声合成部
52・・・パルス／ノイズ系列生成部、53・・・逆ＦＦＴ重ね合わせ合成部
54・・・指数変換部
55・・・逆周波数伸縮離散コサイン変換部
60・・・音声合成用情報生成部
61・・・平滑化パラメータ生成部、62・・・構文解析部、63・・・ＨＭＭ（音響モデル）記憶部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition information creation method, an acoustic model creation method, a speech recognition method, a speech synthesis information creation method, a speech synthesis method, a device thereof, a program, and a recording medium on which the program is recorded.
[0002]
[Prior art]
As a speech synthesis method using the conventional Hidden Markov Model (HMM) (Ref. [1] K. Tokuda, T. Kobayashi and S. Imai, "Speech parameter generation from HMM using dynamic features" Proc. ICASSP, pp.660- 663, 1995), (Ref. [2] K. Tokuda, T. Masuko and T. Yamada, T. Kobayashi and S. Imai, "An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features" Proc. Eurospeech, pp .757-760, 1995), (Ref. [3] T.Masuko, K.Tokuda, T.Kobayashi and S.Imai, "Speech synthesis from HMMs using dynamic features" Proc. ICASSP, pp.389-392, 1996) Can be given.
This method uses the same parameters for speech recognition and speech synthesis, so that the method used for speech recognition can be used for speech synthesis to enhance the functionality of speech synthesis, or the method used for speech synthesis. It has the advantage that the recognition method can be made highly accurate by using it for speech recognition.
[0003]
FIG. 1 and FIG. 2 show the configuration of a conventional acoustic model learning / speech recognition apparatus and speech synthesis apparatus based on the Hidden Markov Model (HMM).
First, a speech recognition model learning method and a speech recognition method will be described with reference to FIG.
The mel cepstrum analysis unit 101 converts the input voice into a mel cepstrum. When learning a hidden Markov model (HMM), this mel cepstrum is sent to the acoustic model learning unit 102. A hidden Markov model (for example, a phoneme or a word hidden Markov model) is learned from a mel cepstrum parameter and a learning text (for example, a Japanese sentence, a phoneme sequence to which linguistic information is added, or a word sequence). (That is, the model giving the maximum probability (maximum likelihood) is selected.) Next, the learned model is stored in the storage unit 103. At the time of recognition, the mel cepstrum parameter of the input speech is sent to the speech recognition unit 104, the Hidden Markov model (HMM) learned by the acoustic model learning unit is read from the HMM storage unit 103 and compared, and the one with the maximum likelihood is output. To convert it to text.
[0004]
Next, a speech synthesis method for generating speech from text using the Hidden Markov Model (HMM) will be described with reference to FIG.
It is assumed that the Hidden Markov Model (HMM) stored in the HMM storage unit 105 has learned the above-described learning method from a large amount of data in advance. First, the syntax analysis unit 110 converts the input text into a phoneme (or word) sequence to which linguistic information is added. A phoneme (or word) hidden Markov model (HMM) is connected by this phoneme (or word) information, and a series of hidden Markov models (HMM) for the input text is generated. The smoothing parameter generation unit 109 outputs a natural and smooth mel cepstrum sequence from the Hidden Markov model (HMM) state sequence. The smooth mel cepstrum sequence is converted into the MLSA filter 108 at each time by the speech synthesizer 106, and the signal generated by the pulse / noise sequence generator 107 based on the sound source information is passed through this filter to synthesize the speech. To do. The sound source information is obtained from a text string to be processed and includes pitch information, power information, and the like.
[0005]
[Problems to be solved by the invention]
The conventional method has a problem that the mel cepstrum cannot express the fine structure of the speech envelope spectrum and the quality of the synthesized speech is not sufficient. In addition, an MLSA (Mel Log Spectral Approximation) filter for approximating the inverse Mel transformation and exponential transformation used when synthesizing the mel cepstrum is complicated, and has a problem in changing the sampling frequency, for example.
It is known that the STRAIGHT synthesis system using the STRAIGHT spectrum used in speech synthesis can synthesize speech with high quality. Therefore, if this parameter is introduced into a speech synthesis method using the Hidden Markov Model (HMM), it can be expected that the quality of speech synthesis will be improved. However, since this parameter cannot be directly used for speech recognition, the Hidden Markov Model (HMM) cannot be learned. Even if it can be converted into parameters such as mel cepstrum that can be used for speech recognition and HMM can be learned, there is no means to synthesize speech using mel cepstrum generated from HMM when creating synthesized speech It was. For this reason, it has not been possible to implement the STRAIGHT synthesis system for speech synthesis using HMM.
[0006]
In view of such circumstances, the present invention introduces a STRAIGHT synthesis system into a speech synthesis system using a Hidden Markov model (HMM) and converts it into a parameter called STRAIGHT mel cepstrum, and a STRAIGHT mel cepstrum created at the time of synthesis. By using the means to convert to a STRAIGHT spectrum, the recognition and synthesis parameters are the same, and the high-quality synthesized speech system can be created while retaining the advantages of the synthesis method using the conventional Hidden Markov Model (HMM). The purpose is to provide a new technology that can be realized with a simple configuration.
[0007]
In addition, the present invention uses a mel cepstrum that can be obtained by logarithmically transforming a short-time Fourier transform spectrum and performing discrete cosine transform as a speech synthesis system parameter using the Hidden Markov Model (HMM). By using the same parameters and parameters for text-to-speech synthesis, it is easy to create a high-quality synthesized speech system while retaining the advantages of the conventional synthesis method using the Hidden Markov Model (HMM). The purpose is to provide a new technology that can be realized with a simple configuration.
[0008]
[Means for Solving the Problems]
In the present invention, the quality of speech synthesis is improved by introducing a high-quality STRAIGHT speech synthesis system that removes the influence of the fundamental frequency and obtains the spectrum shape into speech synthesis using the Hidden Markov Model (HMM). In order to realize this, first, the STRAIGHT spectrum is converted into a STRAIGHT mel cepstrum that realizes high performance by speech recognition. This mel cepstrum transformation is executed using means for performing logarithmic transformation of the STRAIGHT spectrum and performing frequency expansion / contraction discrete cosine transformation on the result. By these means, the STRAIGHT spectrum can be transformed into a form that can be used for speech recognition. For this reason, the unification of the speech recognition parameters and the speech synthesis parameters can be realized, and the advantages of speech synthesis by HMM can be maintained. Learning of an acoustic model by means of learning a phoneme (or word) hidden Markov model (HMM) using this parameter and means for recognizing speech using the learned phoneme (or word) hidden Markov model (HMM) Realize voice recognition. At the time of speech synthesis, means for outputting a natural and smooth STRAIGHT mel cepstrum sequence from the phoneme (or word) sequence of the learned Hidden Markov Model (HMM) is used. Further, this STRAIGHT mel cepstrum sequence is converted into a STRAIGHT spectrum sequence by means of inverse frequency expansion / contraction discrete cosine transform and exponential transformation. By these means, the STRAIGHT mel cepstrum can be transformed into a form that can be used in speech synthesis. For this reason, the unification of the speech recognition parameters and the speech synthesis parameters can be realized, and the advantages of speech synthesis by HMM can be maintained. From the STRAIGHT spectrum sequence generated by these means and the pulse / noise sequence synthesized by the pulse / noise sequence generating means, speech is synthesized by means of inverse FFT superposition synthesis.
[0009]
Instead of the above STRAIGHT spectrum generation means, spectrum generation means by short-time Fourier transform may be used. In this case, the mel cepstrum sequence is obtained by frequency-stretching discrete cosine transform of the logarithmic transform of the spectrum obtained from the short-time Fourier transform. Even with this parameter, learning of the Hidden Markov model (HMM) and speech recognition using the Hidden Markov model (HMM) can be realized by the acoustic model learning means and the voice recognition means. Smooth parameters can be generated during synthesis. The generated smooth parameter is subjected to inverse frequency expansion / contraction discrete cosine transform, and converted to a spectrum using means for performing exponential transform. Thereafter, speech is synthesized by means of inverse FFT superposition synthesis from the spectrum series and the pulse / noise series synthesized by the pulse / noise series generation means.
By the means described above, high-quality synthesis quality possessed by the STRAIGHT parameter can be introduced into speech synthesis using the Hidden Markov Model (HMM), and speech synthesis quality can be improved.
[0010]
DETAILED DESCRIPTION OF THE INVENTION
(Information creation for voice recognition, acoustic model creation, voice recognition)
FIG. 3 and FIG. 4 show the configuration of the speech recognition information creation / acoustic model creation / speech recognition device.
First, a speech recognition information creation method, an acoustic model creation method, a speech recognition method, and devices thereof will be described with reference to FIG.
In the STRAIGHT analysis unit 11 of the information generation unit 10 for speech recognition, the short-time Fourier transform unit 13 performs short-time Fourier transform on the input speech. At the same time, the fundamental frequency estimator 12 determines whether the input speech is unvoiced / voiced, and if it is voiced, calculates (estimates) the fundamental frequency. Based on the information, the smoothed spectrum analyzing unit 14 removes the influence of the fundamental frequency from the voice subjected to the short-time Fourier transform, and converts it into a smoothed spectrum. This converted spectrum is called a STRAIGHT spectrum. Next, the logarithmic conversion unit 15 performs logarithmic conversion of the STRAIGHT spectrum, and the frequency expansion / contraction discrete cosine conversion unit 16 converts the STRAIGHT spectrum into a STRAIGHT mel cepstrum.
[0011]
At the time of learning the acoustic model, this STRAIGHT mel cepstrum is sent to the acoustic model learning unit 20. The acoustic model learning unit 20 learns a phoneme (or word) hidden Markov model (HMM) from the STRAIGHT mel cepstrum and the corresponding text (learning text). Next, the learned phoneme (or word) hidden Markov model (HMM) is stored in the storage unit 30.
At the time of recognition, the STRAIGHT mel cepstrum parameter is sent to the speech recognition unit 40, learned by the acoustic model learning unit, and compared with the Hidden Markov Model (HMM) stored in the storage unit 30, the text showing the highest likelihood value Is output. This realizes speech recognition.
With the above method, it is not necessary to perform fundamental frequency estimation when generating the STRAIGHT spectrum. The speech recognition model learning method and speech recognition method at this time will be described with reference to FIG. The mel cepstrum sequence generated in this case is obtained by directly logarithmically transforming the spectrum obtained from the short-time Fourier transform and performing frequency expansion / contraction discrete cosine transform. Using this mel cepstrum sequence, learning of a Hidden Markov model (HMM) can also be realized as described above. Furthermore, smooth parameters can be generated from the Hidden Markov Model (HMM).
[0012]
(Speech synthesis)
FIG. 5 shows the configuration of the speech synthesizer.
A speech synthesis method and apparatus for creating speech from text will be described with reference to FIG.
The Hidden Markov Model (HMM) stored in the HMM storage unit 63 is created from a large amount of data in advance by the learning method described above. In the speech synthesis information generating unit 60, first, the input text is converted into a phoneme (or word) sequence to which linguistic information is added by the syntax analysis unit 62. The smoothing parameter generation unit 61 connects the phoneme (or word) hidden Markov model (HMM) based on this phoneme (or word) information, and generates a phoneme (or word) hidden Markov model (HMM) sequence for the input text. Is done. If the input is a phoneme sequence, the syntactic analysis unit does not perform the parsing, and connects the Hidden Markov model (HMM) from the phoneme information to create a phonemic (or word) Hidden Markov model (HMM) for the input. . When the input is a Hidden Markov Model (HMM) state sequence, a Hidden Markov Model (HMM) state sequence is created instead of the phoneme (or word) Hidden Markov Model (HMM) sequence. The smoothing parameter generation unit further outputs a natural and smooth STRAIGHT mel cepstrum sequence from the phoneme (or word) hidden Markov model (HMM) sequence. This STRAIGHT mel cepstrum sequence is input to the speech synthesizer 50. The inverse frequency expansion / contraction discrete cosine transform unit 55 of the speech synthesizer 50 performs the inverse frequency expansion / contraction discrete cosine transform of the STRAIGHT mel cepstrum sequence, and further converts the STRAIGHT spectrum sequence by exponential conversion in the exponent conversion unit 54. Speech is synthesized by the inverse FFT superposition synthesis unit 53 from the spectrum sequence and the signal generated by the pulse / noise generation unit 52.
In the above method, when the fundamental frequency estimation unit is not used in the speech recognition information creation unit, the smooth parameter generated from the Hidden Markov Model (HMM) is subjected to inverse frequency expansion / contraction discrete cosine transform, and then exponential transform To convert it to a spectrum. Thereafter, speech is synthesized from the spectrum series and the pulse / noise sequence generated by the pulse / noise generator by inverse FFT superposition synthesis.
Each component will be described in detail.
[0013]
(STRAIGHT Analysis Department)
In the STRAIGHT analysis unit 11, the short-time Fourier transform unit 13 performs short-time Fourier transform on the input speech. At the same time, the fundamental frequency estimation unit 12 determines whether the input speech is unvoiced / voiced, and calculates the fundamental frequency if it is voiced. Based on the information, the smoothed spectrum analyzing unit 14 removes the influence of the fundamental frequency from the voice subjected to the short-time Fourier transform, and converts it into a smoothed spectrum. This converted spectrum is called a STRAIGHT spectrum. This STRAIGHT spectrum is then sent to the frequency expansion / contraction discrete cosine transform unit 16.
[0014]
(Frequency expansion / contraction discrete cosine transform unit)
The frequency expansion / contraction discrete cosine transform unit 16 performs logarithmic conversion of the input STRAIGHT spectrum, and performs frequency expansion / contraction discrete cosine transform on the result. The core function of this transformation is
[Expression 1]

Is the real part of the frequency response of the filter defined as Re [Ψ _m The real part of (ω)] is normalized orthogonal transform when {ω | 0 ≦ ω ≦ π}. α is a coefficient that determines the degree of frequency expansion and contraction. The degree of expansion and contraction is
[Expression 2]

It can be calculated by the formula. When α is 0, Re [Ψ _m (Ω)] = cos (mω), which is a discrete cosine transform. When α is between 0 and 1, Re [Ψ _m (Ω)] depends on a weighted function that preserves orthogonality
[Equation 3]

With the above conversion, the STRAIGHT spectrum is converted to the STRAIGHT mel cepstrum.
[0015]
(Acoustic Model Learning Department)
At the time of acoustic model learning, this STRAIGHT mel cepstrum is input to the acoustic model learning unit 20. The acoustic model learning unit 20 learns a phoneme (or word) hidden Markov model (HMM) by an EM (expectation-maximization) algorithm using the input STRAIGHT mel cepstrum. In the above method, the fundamental frequency may not be estimated when the STRAIGHT spectrum is generated. In this case, the mel spectrum series is obtained by performing frequency expansion / contraction discrete cosine transformation on logarithmic transformation of a spectrum obtained from short-time Fourier transformation. Even using this mel cepstrum sequence, the Hidden Markov Model (HMM) can be learned as described above. Furthermore, speech recognition by the Hidden Markov Model (HMM) can also be realized.
[0016]
(Speech synthesis)
Speech synthesis using the Hidden Markov Model (HMM) created by the acoustic model learning unit is realized as follows. First, the input text is converted into a phoneme (or word) sequence to which linguistic information is added by the syntax analysis unit 62. A phoneme (or word) hidden Markov model (HMM) is connected by this phoneme (or word) information, and a phoneme (or word) hidden Markov model (HMM) sequence for the input text is generated. If the input is a phoneme sequence, the syntactic analysis unit does not perform the syntax analysis, and creates a phoneme (or word) Hidden Markov model (HMM) sequence for the input by connecting the Hidden Markov model (HMM) from the phoneme information. To do. When the input is a Hidden Markov Model (HMM) state sequence, a Hidden Markov Model (HMM) state sequence is created instead of the phoneme (or word) Hidden Markov Model (HMM) sequence. The smoothing parameter generation unit 61 outputs a natural and smooth mel cepstrum parameter sequence from a phoneme (or word) hidden Markov model (HMM) sequence. This smoothing method is described below.
[0017]
It is assumed that the Hidden Markov Model (HMM) is learned by the learning described above. Where S = [s ₁ , s ₂ , ..., s _T ] Is the Gaussian time series of Hidden Markov Model (HMM), M = [μ ₁ , μ ₂ , ..., μ _T ], ΔM = [Δμ ₁ , Δμ ₂ , ..., Δμ _T ], Δ ² M = [Δ ² μ ₁ , Δ ² μ ₂ , ..., Δ ² μ _T ] Is the STRAIGHT mel cepstrum of the Hidden Markov model (HMM) in the Gaussian time series, and the differential coefficient ΔSTRAIGHT mel cepstrum, the second derivative Δ ² This is a vector time series of mean values of STRAIGHT mel cepstrum. Also, Σ = [Σ ₁ , Σ ₂ , ..., Σ _T ], ΔΣ = [ΔΣ ₁ , ΔΣ ₂ , ..., ΔΣ _T ], Δ ² Σ = [Δ ² Σ ₁ , Δ ² Σ ₂ , ..., Δ ² Σ _T ] Is the Hidden Markov Model (HMM) STRAIGHT Mel Cepstrum, ΔSTRAIGHT Mel Cepstrum, Δ ² STRAIGHT A mel cepstrum covariance matrix time series (assuming a diagonal covariance matrix). By the way, STRAIGHT Mel Cepstrum C = [c ₁ , c ₂ , ..., c _T ], ΔSTRAIGHT Mel Cepstrum ΔC = [Δc ₁ , Δc ₂ , ..., Δc _T ], Δ ² STRAIGHT Mel Cepstrum Δ ² C = [Δ ² c ₁ , Δ ² c ₂ , ..., Δ ² c _T ], There are constraint conditions as shown in (3) and (4) (a plurality of other constraint conditions can be considered, but the same can be realized by using any of them).
[Expression 4]

Where 2L + 1 is the window size, b ₀ , b ₁ And b ₂ Is a fixed value determined by the window size, using the reference [1-3] method from the mean time series of this Hidden Markov Model (HMM), transforming the mean time series to make a smooth STRAIGHT mel cepstrum Is generated. Here, the method will be described. Assume that a Gaussian time series is given. C, ΔC, Δ which maximizes (5) under the conditions of (3) and (4) for a given Gaussian time series ² By selecting C, a time series of STRAIGHT mel cepstrum is generated. This is because (3) and (4) are substituted into (5), and ΔC, Δ ² This can be realized by eliminating C and expressing it only by C, and subtracting this from C to obtain C that satisfies (6). As a result, smooth STRAIGHT mel cepstrum coefficients are obtained.
[Equation 5]

In the above method, only the second derivative of the STRAIGHT mel cepstrum is used, but terms of the third and fourth order can be introduced. In addition to the above method, a method of smoothing the average value sequence of the STRAIGHT mel cepstrum sequence of the Hidden Markov model (HMM) using a filter can be used.
[0018]
The smooth STRAIGHT mel cepstrum sequence created in this way is input to the inverse frequency expansion / contraction discrete cosine transform unit 55. The inverse frequency expansion / contraction discrete cosine transform unit performs inverse frequency expansion / contraction discrete cosine. Since the frequency expansion / contraction discrete cosine transform is an orthogonal transformation, the inverse frequency expansion / contraction discrete cosine transform can be easily calculated from this conversion. By performing this conversion and further performing exponential conversion, the STRAIGHT mel cepstrum sequence is converted to the STRAIGHT spectrum sequence. The STRAIGHT speech synthesizer 51 synthesizes speech from the STRAIGHT spectrum sequence and the pulse / noise sequence generated by the pulse / noise generator 52 by inverse FFT and superposition synthesis.
[0019]
(Pulse / Noise sequence generator)
Next, the pulse / noise sequence generation unit 32 will be described. First, a basic pulse / noise sequence generation method is shown. This is a method in which a fundamental frequency extracted from a voice uttered by a certain speaker is used as it is. This technique is shown in FIG. In this method, a fundamental frequency is estimated directly from input speech, and the fundamental frequency is used for synthesis.
FIG. 7 shows the configuration of the fundamental frequency model learning unit, and FIG. 8 shows the configuration of the pulse / noise sequence generation unit when the Hidden Markov model (HMM) is used.
The pulse / noise sequence generation unit 32 when the Hidden Markov model (HMM) is used will be described with reference to FIGS.
FIG. 7 shows a learning method of the Hidden Markov Model (HMM) in the case of pulse / noise generation. The fundamental frequency estimation unit 52-3 determines unvoiced / voiced from the input voice and outputs the result. In the case of voiced information, the frequency information is also output. This output information is sent to the fundamental frequency model learning unit 52-2. Using these fundamental frequencies, their differential coefficients, secondary differential coefficients and unvoiced / voiced information and the corresponding learning text, the fundamental frequency, its differential coefficients, the average value and variance of the secondary differential coefficients, and unvoiced / Voiced information is learned using the EM algorithm on the structure of a phoneme (or word) fundamental frequency Hidden Markov model (HMM). Next, the learned fundamental frequency hidden Markov model (HMM) is stored in the storage unit 52-4.
[0020]
FIG. 8 shows pulse / noise synthesis when the Hidden Markov Model (HMM) is used. First, the first input text is converted into a phoneme (or word) sequence to which linguistic information is added by the syntax analysis unit 52-8. The phoneme (or word) fundamental frequency hidden Markov model (HMM) is connected by this phoneme (or word) information, and a sequence of phoneme (or word) fundamental frequency hidden Markov model (HMM) for the input text is generated. If the input is a phoneme sequence, the syntactic analysis unit does not perform the parsing, and the phoneme (or word) fundamental frequency Hidden Markov model (HMM) is connected from the phoneme information to obtain the phoneme (or word) fundamental frequency for the input. Create a Hidden Markov Model (HMM) series. When the input is a Hidden Markov model (HMM) state sequence, a state sequence of the fundamental frequency Hidden Markov model (HMM) is created instead of the phoneme (or word) fundamental frequency Hidden Markov model (HMM). The smoothing pulse / noise generation unit 52-7 uses the same smoothing method as that used to generate a smooth STRAIGHT mel cepstrum from the phoneme (or word) fundamental frequency Hidden Markov model (HMM) sequence for the input to smooth the fundamental frequency. Output sequence and convert to pulse information. However, if the Hidden Markov Model (HMM) is in an unvoiced sound, smoothing is not performed and noise is generated.
[0021]
In the above method, when the fundamental frequency estimation unit is not used in the speech recognition information creation unit, the smooth parameter generated from the Hidden Markov Model (HMM) is subjected to inverse frequency expansion / contraction discrete cosine transform, and then exponential transform To convert it to a spectrum. Thereafter, speech is synthesized from the spectrum series and the pulse / noise sequence generated by the pulse / noise generator by inverse FFT superposition synthesis. The pulse / noise train generation method described above can be used to generate this pulse / noise train.
[0022]
The speech recognition information creation / acoustic model creation / speech recognition device and speech synthesizer described above are composed of a computer having a CPU, a memory, etc., a terminal used by a user who is an access subject, and a recording medium. be able to. The recording medium is a machine-readable recording medium such as a CD-ROM, a magnetic disk device, and a semiconductor memory. The control program recorded on the recording medium is read by a computer, controls the operation of the computer, and is stored on the computer. Each component can be realized.
[0023]
【The invention's effect】
As described above, according to the present invention, the STRAIGHT synthesis system is introduced into the speech synthesis system using the Hidden Markov Model (HMM), and the parameters for speech recognition and the parameters for speech synthesis from text are the same. By using this, it is possible to realize a high-quality and simple system configuration while leaving the advantages of the synthesis method using the conventional Hidden Markov Model (HMM) as it is.
Further, according to the present invention, as a speech synthesis system parameter using a Hidden Markov Model (HMM), a mel cepstrum that can be obtained by logarithmically transforming a short-time Fourier transformed spectrum and performing discrete cosine transformation is used. By using the same parameters for speech recognition and text-to-speech synthesis, while maintaining the advantages of the synthesis method using the conventional Hidden Markov Model (HMM), the quality and A simple system configuration can be realized.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of an acoustic model learning and speech recognition apparatus using a conventional Hidden Markov model (HMM).
FIG. 2 is a configuration diagram of a speech synthesizer using a conventional Hidden Markov model (HMM).
FIG. 3 is a configuration diagram of a speech recognition information creation / acoustic model creation / speech recognition apparatus according to an embodiment of the present invention.
FIG. 4 is a configuration diagram of a speech recognition information creation / acoustic model creation / speech recognition apparatus according to another embodiment of the present invention.
FIG. 5 is a configuration diagram of a speech synthesizer that creates speech from text according to an embodiment of the present invention.
FIG. 6 is an explanatory diagram of a pulse / noise sequence generation unit.
FIG. 7 is a configuration diagram of a fundamental frequency model learning unit.
FIG. 8 is a configuration diagram of a pulse / noise sequence generator when a Hidden Markov Model (HMM) is used.
[Explanation of symbols]
10 ・・・ Information creation part for voice recognition
11 ・・・ STRAIGHT Analysis Department
12 ... Fundamental frequency estimation unit, 13 ... Short-time Fourier transform unit, 14 ... Smoothed spectrum analysis unit
15: Logarithmic converter
16 ・・・ Frequency expansion / contraction discrete cosine transform unit
20 ・・・ Acoustic Model Learning Department
30 ・・・ HMM (acoustic model) storage
40 Voice recognition unit
50: Speech synthesis unit
51 ・・・ STRAIGHT speech synthesis unit
52 ... Pulse / noise sequence generation unit, 53 ... Inverse FFT superposition synthesis unit
54 ... Exponential conversion part
55 ・・・ Inverse frequency expansion / contraction discrete cosine transform unit
60 ・・・ Information generator for speech synthesis
61 ... smoothing parameter generation unit, 62 ... syntax analysis unit, 63 ... HMM (acoustic model) storage unit

Claims

The method comprising the steps of: parsing the input text,
The STRAIGHT spectrum of training speech logarithmic transformation, the real part of the result for the frequency stretching discrete cosine transform _{Re [Ψ m (z)]} ( provided _{that, Re [Ψ m (z)} ] is [psi _m (z) Yes, Ψ _m (z) is

It is. ) To obtain the Hidden Markov Model (HMM) according to the result of the syntax analysis from the storage device storing the Hidden Markov Model (HMM) trained based on the STRAIGHT Mel Cepstrum and the corresponding learning text. Creating a Hidden Markov Model (HMM) sequence;
STRAIGHT mel-cepstrum sequence and its first derivative, ..., under the constraint that difference approximation holds between the N-order differential coefficient (N is an integer of 2 or more), the hidden among the STRAIGHT mel-cepstrum sequence Creating a STRAIGHT mel cepstrum sequence (smooth STRAIGHT mel cepstrum sequence) by selecting a STRAIGHT mel cepstrum sequence that maximizes the likelihood of a Markov model (HMM) sequence;
And creating the smooth STRAIGHT cepstrum series with respect to applying an inverse frequency stretching discrete cosine transform of the frequency stretching discrete cosine transform _{Re [Ψ m (z)]} , STRAIGHT spectrum sequence by converting further index,
Performing speech synthesis from the pulse / noise sequence generated by the fundamental frequency Hidden Markov model and the STRAIGHT spectrum sequence by inverse FFT and superposition synthesis;
A speech synthesis method characterized by comprising:

The method comprising the steps of: parsing the input text,
After logarithmic transformation of the short-time Fourier transform of the learning speech, the frequency expansion / contraction discrete cosine transformation Re [Ψ _m (z)] (where Re [Ψ _m (z)] is the actual value of Ψ _m (z) And Ψ _m (z) is

It is. ) From a storage device storing a hidden Markov model that is learned (HMM) based on the learning text and Mel cepstrum obtained by applying the corresponding a, obtains the hidden Markov model (HMM) according to the result parsed, Creating a Hidden Markov Model (HMM) sequence;
The above-mentioned Hidden Markov model of the mel cepstrum series under the constraint that a difference approximation is established between the mel cepstrum series and its first derivative coefficient, ..., Nth derivative coefficient (N is an integer of 2 or more). Creating a mel cepstrum sequence (smooth mel cepstrum sequence ) by selecting a mel cepstrum sequence that maximizes the likelihood of the (HMM) sequence;
A step of the smooth applies the inverse frequency stretching discrete cosine transform Mel said frequency stretching discrete cosine transform Re respect cepstrum sequence [Ψ _{m (z)],} to create a spectral sequence and converting further index,
Performing speech synthesis from the pulse / noise sequence generated by the fundamental frequency Hidden Markov model and the spectrum sequence by inverse FFT and superposition synthesis;
A speech synthesis method characterized by comprising:

And means for parsing the input text,
The STRAIGHT spectrum of training speech logarithmic transformation, the real part of the result for the frequency stretching discrete cosine transform _{Re [Ψ m (z)]} ( provided _{that, Re [Ψ m (z)} ] is [psi _m (z) Yes, Ψ _m (z) is

It is. ) To obtain the Hidden Markov Model (HMM) according to the result of the syntax analysis from the storage device storing the Hidden Markov Model (HMM) trained based on the STRAIGHT Mel Cepstrum and the corresponding learning text Means for creating a Hidden Markov Model (HMM) sequence;
STRAIGHT mel-cepstrum sequence and its first derivative, ..., under the constraint that difference approximation holds between the N-order differential coefficient (N is an integer of 2 or more), the hidden among the STRAIGHT mel-cepstrum sequence Means for creating a STRAIGHT mel cepstrum sequence (smooth STRAIGHT mel cepstrum sequence) by selecting a STRAIGHT mel cepstrum sequence that maximizes the likelihood of a Markov model (HMM) sequence;
Means for creating the smooth STRAIGHT cepstrum series with respect to applying an inverse frequency stretching discrete cosine transform of the frequency stretching discrete cosine transform _{Re [Ψ m (z)]} , STRAIGHT spectrum sequence by converting further index,
Means for performing speech synthesis by superposition synthesis with inverse FFT from the pulse / noise sequence generated by the fundamental frequency Hidden Markov model and the STRAIGHT spectrum sequence;
A speech synthesizer comprising:

And means for parsing the input text,
After logarithmic transformation of the short-time Fourier transform of the learning speech, the frequency expansion / contraction discrete cosine transformation Re [Ψ _m (z)] (where Re [Ψ _m (z)] is the actual value of Ψ _m (z) And Ψ _m (z) is

It is. ) From a storage device storing a hidden Markov model that is learned (HMM) based on the learning text and Mel cepstrum obtained by applying the corresponding a, obtains the hidden Markov model (HMM) according to the result parsed, Means for creating a Hidden Markov Model (HMM) sequence;
The above-mentioned Hidden Markov model of the mel cepstrum series under the constraint that a difference approximation is established between the mel cepstrum series and its first derivative coefficient, ..., Nth derivative coefficient (N is an integer of 2 or more). Means for creating a mel cepstrum sequence (smooth mel cepstrum sequence ) by selecting a mel cepstrum sequence that maximizes the likelihood of the (HMM) sequence;
Means for creating a spectral series the smooth application of the inverse frequency stretching discrete cosine transform mel cepstrum sequence the frequency stretching discrete cosine transform Re respect [Ψ _{m (z)],} and further converts index,
Means for performing speech synthesis by superposition synthesis with inverse FFT and pulse / noise sequence generated by the fundamental frequency Hidden Markov model and the above spectrum sequence;
A speech synthesizer comprising:

Processing to parse the input text,
The STRAIGHT spectrum of the training speech is logarithmically transformed, and the frequency-scaled discrete cosine transform Re [Ψ _m (z)] (where Re [Ψ _m (z)] is the real part of Ψ _m (z). Yes, Ψ _m (z) is

It is. ) From a storage device storing a hidden Markov model that is learned (HMM) based on the learning text applying the STRAIGHT mel cepstrum obtained corresponding to get a hidden Markov model (HMM) according to the result parsed And processing to create a Hidden Markov Model (HMM) sequence,
STRAIGHT mel-cepstrum sequence and its first derivative, ..., under the constraint that difference approximation holds between the N-order differential coefficient (N is an integer of 2 or more), the hidden among the STRAIGHT mel-cepstrum sequence Processing to create a STRAIGHT mel cepstrum sequence (smooth STRAIGHT mel cepstrum sequence) by selecting a STRAIGHT mel cepstrum sequence that maximizes the likelihood of a Markov model (HMM) sequence;
A process of creating a STRAIGHT spectrum sequence by applying the inverse frequency stretching discrete cosine transform of the frequency stretching discrete cosine transform _{Re [Ψ m (z)]} , and further converted exponent against the smooth STRAIGHT mel cepstrum sequence,
A program for causing a computer to execute a speech synthesis method comprising: a pulse / noise sequence generated by a fundamental frequency Hidden Markov model; and a process of performing speech synthesis by superposition synthesis using inverse FFT from the STRAIGHT spectrum sequence.

Processing to parse the input text,
After logarithmic transformation of the short-time Fourier transform of the speech for learning, the frequency expansion / contraction discrete cosine transformation Re [Ψ _m (z)] (where Re [Ψ _m (z)] is the actual value of Ψ _m (z) And Ψ _m (z) is

It is. ) From a storage device storing a hidden Markov model that is learned (HMM) based on the learning text and Mel cepstrum obtained by applying the corresponding a, obtains the hidden Markov model (HMM) according to the result parsed, Processing to create a Hidden Markov Model (HMM) sequence;
The above-mentioned Hidden Markov model of the mel cepstrum series under the constraint that a difference approximation is established between the mel cepstrum series and its first derivative coefficient, ..., Nth derivative coefficient (N is an integer of 2 or more). A process of creating a mel cepstrum sequence (smooth mel cepstrum sequence ) by selecting a mel cepstrum sequence that maximizes the likelihood of the (HMM) sequence;
A process of creating a spectral sequence by applying the inverse frequency stretching discrete cosine transform of the frequency stretching discrete cosine transform _{Re [Ψ m (z)]} , and further converted exponent against the smooth mel cepstrum sequence,
A program for causing a computer to execute a speech synthesis method comprising: a pulse / noise sequence generated by a fundamental frequency Hidden Markov model, and speech synthesis by inverse FFT and superposition synthesis from the spectrum sequence.

It is. ) From a storage device storing a hidden Markov model that is learned (HMM) based on the learning text applying the STRAIGHT mel cepstrum obtained corresponding to get a hidden Markov model (HMM) according to the result parsed And processing to create a Hidden Markov Model (HMM) sequence,
STRAIGHT mel-cepstrum sequence and its first derivative, ..., under the constraint that difference approximation holds between the N-order differential coefficient (N is an integer of 2 or more), the hidden among the STRAIGHT mel-cepstrum sequence Processing to create a STRAIGHT mel cepstrum sequence (smooth STRAIGHT mel cepstrum sequence) by selecting a STRAIGHT mel cepstrum sequence that maximizes the likelihood of a Markov model (HMM) sequence;
A process of creating a STRAIGHT spectrum sequence by applying the inverse frequency stretching discrete cosine transform of the frequency stretching discrete cosine transform _{Re [Ψ m (z)]} , and further converted exponent against the smooth STRAIGHT mel cepstrum sequence,
A recording medium recording a program for causing a computer to execute a speech synthesis method comprising: a pulse / noise sequence generated by a fundamental frequency Hidden Markov model and a process of performing speech synthesis by superposition synthesis from the above STRAIGHT spectrum sequence by inverse FFT.

It is. ) From a storage device storing a hidden Markov model that is learned (HMM) based on the learning text and Mel cepstrum obtained by applying the corresponding a, obtains the hidden Markov model (HMM) according to the result parsed, Processing to create a Hidden Markov Model (HMM) sequence;
The above-mentioned Hidden Markov model of the mel cepstrum series under the constraint that a difference approximation is established between the mel cepstrum series and its first derivative coefficient, ..., Nth derivative coefficient (N is an integer of 2 or more). A process of creating a mel cepstrum sequence (smooth mel cepstrum sequence ) by selecting a mel cepstrum sequence that maximizes the likelihood of the (HMM) sequence;
A process of creating a spectral sequence by applying the inverse frequency stretching discrete cosine transform of the frequency stretching discrete cosine transform _{Re [Ψ m (z)]} , and further converted exponent against the smooth mel cepstrum sequence,
A recording medium having recorded thereon a program for causing a computer to execute a speech synthesis method comprising: a pulse / noise sequence generated by a fundamental frequency Hidden Markov model and speech synthesis by inverse FFT and superposition synthesis from the above spectrum sequence.