JP3967571B2

JP3967571B2 - Sound source waveform generation device, speech synthesizer, sound source waveform generation method and program

Info

Publication number: JP3967571B2
Application number: JP2001278292A
Authority: JP
Inventors: ボナダジョルディ; 裕司久湊
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2001-09-13
Filing date: 2001-09-13
Publication date: 2007-08-29
Anticipated expiration: 2021-09-13
Also published as: JP2003084798A

Description

【０００１】
【発明の属する技術分野】
本発明は、文字データ等から音声波形を合成する音声合成装置、当該音声合成装置による音声合成に用いられる音源波形を生成する音源波形生成装置、音源波形生成方法および音源波形生成処理を行うためのプログラムに関する。
【０００２】
【従来の技術】
従来より、人工的に音声を作り出す音声合成技術が種々提案されている。これらの音声合成技術では、音声の生成過程を声帯音源の生成、声道による調音といった２つの過程の組み合わせと考え、これらを工学モデルとして近似することにより音声合成を実現している。すなわち、声帯音源波形を生成し、当該音源波形に声道による調音に対応するフィルタ処理等を施すことにより音声波形を生成しているのである。例えば、線形予測分析（ＬＰＣ：Linear Predictive Coding）を利用した音声合成装置は、有声音源および無声音源からなる音源波形発生装置と、声道フィルタとを備えており、音源波形発生装置の発したパルス波形に対し、声道フィルタが発音すべき内容に応じたフィルタリングを施すことにより発音すべき音声に対応する音声波形を合成している。
【０００３】
【発明が解決しようとする課題】
ところで、上記のような音声合成に用いられる音源波形発生装置は、発生すべき音声に応じたピッチ情報およびゲイン情報に忠実にしたがったパルス波形をフレーム単位で生成し、このフレーム単位で生成したパルス波形を音源波形として声道フィルタに出力している。しかしながら、ピッチ情報等に忠実にしたがって生成される音源波形は画一的であるため、この音源波形を用いて生成した音声には、実際の人が発する音声の自然さが失われたものとなる虞が高い。すなわち、実際の人が発する音声の声帯波形は、ピッチやゲイン等に不規則な微妙なゆらぎを含んでおり、このようなゆらぎ方やゆらぎの度合いによって「しわがれ声」や「だみ声」といったふうに称される特徴のある声が発せられることもある（「しわがれ声と称されない声でも、微妙なゆらぎはあり、人の発する声には程度の差はあるけれども、微妙なしわがれ度合いを含んでいるといえる）。これに対し、上記のようなピッチ情報等にしたがってフレーム単位で画一的な、つまりフレーム内においてピッチやゲインにゆらぎ等のない音源波形には、上記のような「自然さ」を印象付けるような要素（微妙なゆらぎ等）が含まれていない。このため、このような音源波形を基にして生成された音声は、「自然さ」のない音声になってしまう虞が高いのである。
【０００４】
上記のような「自然さ」が欠落してしまうといった問題を解消するために、音源波形生成装置の生成した画一的なパルス波形に乱数等を用いてピッチやゲインにゆらぎを付与するといった方法も考えられる。しかしながら、このように乱数等を用いてゆらぎを付与した場合にも、そのゆらぎは人工的に付与されたものであり、このゆらぎが付与された音源波形を基に生成した音声に「自然さ」を持たせることができるとは限らず、かえって不自然な音声となってしまうこともある。
【０００５】
本発明は、上記の事情を考慮してなされたものであり、より自然な印象を聴取者に与えることができる音声を合成する音声合成装置、音声合成装置に用いられる音源波形生成装置、音源波形生成方法、およびプログラムを提供することを目的とする。
【０００６】
【課題を解決するための手段】
上述した課題を解決するために、本発明に係る音源波形生成装置は、音声波形を合成する際に用いられる音源波形を生成する装置であって、予め設定されたピッチの範囲毎に、予め定義された音声のしわがれ度に対応した、波形のピッチ、ゲインもしくは両者を補正するための複数の補正情報を記憶する記憶手段と、生成する波形のピッチ情報、ゲイン情報およびしわがれ度情報が入力され、該入力されたピッチ情報に対応する範囲において、該入力されるしわがれ度情報に対応した補正情報を前記記憶手段から読み出し、読み出した該補正情報に基づいて、前記入力されるピッチ情報およびゲイン情報を補正することにより音源波形を生成する波形生成手段とを具備することを特徴としている。
また、本発明に係る音源波形生成装置は、音声波形を合成する際に用いられる音源波形を生成する装置であって、予め設定されたゲインの範囲毎に、予め定義された音声のしわがれ度に対応した、波形のピッチ、ゲインもしくは両者を補正するための複数の補正情報を記憶する記憶手段と、生成する波形のピッチ情報、ゲイン情報およびしわがれ度情報が入力され、該入力されたゲイン情報に対応する範囲において、該入力されるしわがれ度情報に対応した補正情報を前記記憶手段から読み出し、読み出した該補正情報に基づいて、前記入力されるピッチ情報およびゲイン情報を補正することにより音源波形を生成する波形生成手段とを具備することを特徴とする。
【０００７】
この構成では、音声合成に用いる音源波形を生成する際に、発音内容に応じて入力されるピッチ情報およびゲイン情報に忠実にしたがった波形ではなく、予め記憶手段に記憶されている補正情報に基づいて補正した音源波形を生成することができる。したがって、予め人の発した音声等の解析結果に応じて作成したピッチやゲインを補正するための情報を記憶手段に記憶させておくことができ、当該補正情報を利用することでより自然なピッチやゲインのゆらぎを含んだ音源波形を生成することができる。
【０００８】
また、本発明に係る音声合成装置は、予め設定されたピッチの範囲または予め設定されたレベルの範囲毎に、予め定義された音声のしわがれ度に対応した、波形のピッチ、ゲインもしくは両者を補正するための補正情報を複数記憶する記憶手段と、発音すべき音声内容に基づいてピッチ情報、ゲイン情報およびしわがれ度情報を取得する情報取得手段と、前記情報取得手段によって取得されたピッチ情報またはゲイン情報に対応する前記範囲において、前記しわがれ度情報に対応した補正情報を前記記憶手段から読み出し、読み出した補正情報に基づいて前記情報取得手段によって取得された前記ピッチ情報および前記ゲイン情報を補正することにより音源波形を生成する波形生成手段と、前記波形生成手段によって生成された音源波形に対し、前記発音すべき音声内容にしたがったフィルタリングを施すことにより音声波形を合成する合成手段とを具備することを特徴とする。
【０００９】
また、本発明に係る音源波形生成方法は、音声波形を合成する際に用いられる音源波形を生成する方法であって、予め設定されたピッチの範囲または予め設定されたレベルの範囲毎に、予め定義された音声のしわがれ度に対応した、波形のピッチ、ゲインもしくは両者を補正するための複数の補正情報を記憶手段に記憶させるとともに、生成する波形のピッチ情報、ゲイン情報およびしわがれ度情報が入力され、該入力されたピッチ情報またはゲイン情報に対応する前記範囲において、該入力されるしわがれ度情報に対応した補正情報を前記記憶手段から読み出し、読み出した補正情報に基づいて、前記入力されるピッチ情報およびゲイン情報を補正することにより音源波形を生成することを特徴とする。
【００１０】
また、本発明に係るプログラムは、コンピュータを、予め設定されたピッチの範囲または予め設定されたレベルの範囲毎に、予め定義された音声のしわがれ度に対応した、波形のピッチ、ゲインもしくは両者を補正するための複数の補正情報を記憶した記憶手段から前記補正情報を読み出す手段と、生成する波形のピッチ情報、ゲイン情報およびしわがれ度情報が入力され、該入力されたピッチ情報またはゲイン情報に対応する前記範囲において、該入力されるしわがれ度情報に対応した補正情報を前記読み出す手段によって読み出し、読み出した補正情報に基づいて、前記入力されるピッチ情報およびゲイン情報を補正することにより音源波形を生成する波形生成手段として機能させる。
【００１１】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態について説明する。
Ａ．音声合成装置
まず、図１は本発明の一実施形態に係る音源波形生成装置を備えた音声合成装置の構成を示すブロック図である。同図に示すように、本実施形態において説明する音声合成装置１００は、ＬＰＣ合成技術を利用した音声合成装置であり、音声素片データベース（ＤＢ）１０と、パラメータ決定部２０と、音源波形生成装置３０と、合成フィルタ４０とを備えている。なお、本実施形態に係る音源波形生成装置３０は、ＬＰＣ合成技術を利用した音声合成装置に限らず、音源波形生成装置を構成要素とする様々な音声合成装置に適用することができる。
【００１２】
この音声合成装置１００においては、パラメータ決定部２０には、発音対象となる言葉等を示した文書データが入力される。パラメータ決定部２０は、入力される文書データ等に対して文書解析を行い、単語の読みやアクセント等の解析結果を得る。パラメータ決定部２０は、音声素片データベース１０の記憶内容を参照し、上記文書解析結果から、音声波形の合成に用いる種々のパラメータをフレーム毎に決定する。１フレームは、例えば５msecや１０msecであり、当該フレーム間隔毎に上記文書解析結果に応じた特徴パラメータを取得し、取得した特徴パラメータを音源波形生成装置３０や合成フィルタ４０に出力する。
【００１３】
音声素片データベース１０には、予め人の発した音声波形を分析することにより得られた声帯音源波形、および声道による調音に関する種々の特徴パラメータが音素毎、あるいは音素および複数音素からなる音素連鎖毎に記憶されている。ここで、音声素片データベース１０に記憶される特徴パラメータとしては、声帯音源波形に関するパラメータとして、ピッチ情報およびゲイン情報があり、調音に関するパラメータとしては声道特性情報等がある。パラメータ決定部２０は、上記のような文書解析結果から、多数の音素等毎に記憶されている種々の特徴パラメータの中から特定の音素等に対応付けて記憶されている特徴パラメータを取得する。パラメータ決定部２０は、上記のように文書解析結果に応じた特徴パラメータを取得すると、取得したピッチ情報およびゲイン情報を音源波形生成装置３０に、声道特性情報を合成フィルタ４０に出力する。また、パラメータ決定部２０は、上記の文書解析結果に応じて有声音区間と無声音区間とを識別し、有声音区間であるか無声音区間であるかを示す有声・無声情報を音源波形生成装置３０に出力する。
【００１４】
また、本実施形態においては、図２に示すように、パラメータ決定部２０に供給される文書データ中にその文書データによって示される文書を構成する音素、単語、文節等の所定の単位（図示の例では、単語単位）毎に、発音すべき声のしわがれ度合いを示すしわがれ度合いデータが含まれており、パラメータ決定部２０は、このデータを参照してしわがれ度合いを示すしわがれ度情報Ｈ（補正度合い情報）を音源波形生成装置３０に出力する。すなわち、パラメータ決定部２０は、発音すべき音声に応じてフレーム間隔毎にピッチ情報Ｐ、ゲイン情報Ｇ、しわがれ度情報Ｈおよび声道特性情報を音源波形生成装置３０や合成フィルタ４０に出力するのである。
【００１５】
図１に戻り、音源波形生成装置３０は、パラメータ決定部２０から供給されるフレーム間隔毎に供給されるピッチ情報Ｐ、ゲイン情報Ｇ、しわがれ度情報Ｈおよび有声・無声情報に基づいて音源波形を生成する。なお、音源波形生成装置３０についての詳細は後述する。
【００１６】
合成フィルタ４０は、音源波形生成装置３０から出力される音源波形に対し、パラメータ決定部２０から供給される声道特性情報に応じたフィルタ処理を施し、処理後の波形を音声波形として出力する。このような音声波形がＤ／Ａ変換器、アンプを介してスピーカに供給されることにより、人工的な音声が発音されるようになっている。本実施形態における音声合成装置は、音源波形生成装置３０に特徴を有しており、合成フィルタ４０等は従来の一般的なＬＰＣ合成技術を用いた音声合成装置と同様であるため、その詳細な説明を省略する。
【００１７】
Ｂ．音源波形生成装置
以上が音声合成装置１００の全体構成であり、以下、音源波形生成装置３０について詳細に説明する。図１に示すように、音源波形生成装置３０は、無声音源波形生成装置３１と、有声音源波形生成装置３２とを有しており、上記パラメータ決定部２０からフレーム間隔毎に供給されるピッチ情報Ｐ、ゲイン情報Ｇ、しわがれ度情報Ｈおよび有声・無声情報に基づいて音源波形を生成する。
【００１８】
無声音源波形生成装置３１は、パラメータ決定部２０から供給される有声・無声情報が無声区間であることを示している場合に、音源波形生成装置３０から出力すべき音源波形を出力する。より具体的には、パラメータ決定部２０から供給されるゲイン情報Ｇに応じたゲインの白色雑音波形を出力する。
【００１９】
有声音源波形生成装置３２は、有声音区間において音源波形生成装置３０から出力される音源波形を生成する波形生成装置であり、パルス生成部３３と、補正情報データベース３４と、連続離散変換部３５とを有している。パルス生成部３３は、パラメータ決定部２０からフレーム間隔毎に供給されるピッチ情報Ｐおよびゲイン情報Ｇにしたがってパルス波形を生成する。本実施形態においてパルス生成部３３は、ピッチ情報Ｐおよびゲイン情報Ｇに忠実にしたがったパルス波形を生成するのではなく、補正情報データベース３４の記憶内容を参照し、パラメータ決定部２０から供給されるしわがれ度情報Ｈに基づいて、ピッチ情報Ｐおよびゲイン情報Ｇにしたがったパルス波形を補正するようになっている。
【００２０】
補正情報データベース３４には、上述したパルス生成部３３によるパルス波形の生成に用いられる情報（補正情報）が格納されている。図３に示すように、補正情報データベース３４には、複数のしわがれ度を表す値毎、図示の例では、「０．２」、「０．４」、「０．６」、「０．８」、「１．０」といったしわがれ度の値毎に用意された補正用テンプレートが格納されている。複数のしわがれ度を表す値毎に用意された補正用テンプレートには、Δゲイン情報、Δゲイン平均情報、Δピッチ情報、テンプレート時間情報とが含まれている。
【００２１】
Δゲイン情報は、上述したようにパラメータ決定部２０から供給されるピッチ情報Ｐおよびゲイン情報Ｇにしたがったパルス波形に含まれるパルスの各々のゲインを個別に補正するための情報であり、具体的には各パルスの補正ゲイン量（ｄＢ）を示す情報である。ここで、補正ゲイン量を示す情報は多数用意されており、その各々がフレーム開始時からの経過時間ｔと対応付けて記憶されている。したがって、あるフレームのパルス波形におけるフレーム開始時間から時間ｔ１経過後に生成されるパルスのゲインは、時間ｔ１に対応付けられた補正ゲイン量分だけ補正され、時間ｔ２経過後に生成されるパルスのゲインは、時間ｔ２に対応付けられた補正ゲイン量分だけ補正されるといった具合に補正されることになる。
Δゲイン平均情報は、上記各補正ゲイン量の平均値を示す情報である。
【００２２】
Δピッチ情報は、パラメータ決定部２０から供給されるピッチ情報Ｐにしたがったパルスの生成時間間隔を補正するための情報である。より具体的には、パラメータ決定部２０から供給されるピッチ情報Ｐに忠実にしたがった時間間隔でパルス波形を生成する場合においてピッチ情報Ｐが周波数Ｆで表される時には、各パルスはＦの逆数で表される時間間隔毎に生成されることになる。Δピッチ情報は、このようなピッチ情報Ｐに忠実にしたがった（１／Ｆ）時間間隔で生成される各パルスの生成タイミングを補正するための情報である。Δピッチ情報も、上記Δゲイン情報と同様、多数用意されており、その各々がフレーム開始時からの経過時間ｔと対応付けて、つまり数列として記憶されている。したがって、ピッチ情報Ｐに忠実にしたがった場合にあるフレームのパルス波形の時間ｔ１経過後に生成されるべきパルスの生成タイミングは、上記数列の時間ｔ１に対応付けられた補正量分だけ補正され、時間ｔ２経過後に生成されるべきパルスの生成タイミングは、時間ｔ２に対応付けられた補正量分だけ補正されるといった具合に補正されることになる。ここで、Δピッチ情報としては、生成タイミングをずらす時間情報を記憶するようにしてもよいし、該時間を逆数で表した周波数情報を記憶するようにしてもよいし、該周波数をセント値に変換した情報を記憶するようにしてもよい。なお、以下の説明においては、Δピッチ情報が周波数を表す情報であることとする。
【００２３】
テンプレート時間情報は、しわがれ度を示す値毎に用意されたテンプレートの時間を示す情報である。上述したようにΔゲイン情報に含まれる各ゲイン補正量や、Δピッチ情報に含まれる各ピッチの補正量は、各々時間ｔに対応付けられているが、この時間ｔが取り得る値は、当該テンプレート時間情報に示される値の範囲内となる。すなわち、図３に示す例において、しわがれ度が「０．２」のテンプレートの場合、０≦ｔ≦Ｔ１となる。なお、各しわがれ度毎に用意されるテンプレートの時間長、つまりテンプレート時間情報は異なっていてもよいし、全て同じであってもよい。
【００２４】
「しわがれ度」を表す値は、しわがれ度合いを示す値であり、その値が大きくなるほどしわがれ度合いが大きいことを示している。上記のようなΔゲイン情報、Δゲイン平均情報、Δピッチ情報およびテンプレート時間情報を含むテンプレートは、このようなしわがれ度合いを表す値毎に用意されているのである。
【００２５】
次に、上記のようなしわがれ度合い毎に用意されるテンプレートの作成方法について説明する。まず、発声者がある音素を一定の時間にわたって一定の音高および強さで発声し、該発声音を録音する。発声者はこのような発声を「しわがれ度合い」が各々異なる声（異なる発声者でもよい）で５回行い、各々の録音結果から上記のように異なるしわがれ度に対応したテンプレートを作成する。
【００２６】
ここで、当該テンプレートは音源波形生成装置３０によって発せられる音源波形、すなわち人が発する声帯波形（声道による調音前の波形）をモデル化するためのものであるため、各しわがれ度合いに対応した声の録音結果から声帯波形部分を抽出し、該抽出した声帯波形のピッチ、ゲイン等の微妙なゆらぎを解析する。ここで、録音した音声波形から声帯波形を抽出する方法としては、ＬＰＣモデルに基づくカルマンフィルタを使用するといった公知の方法を用いることができる。そして、抽出した声帯波形の解析結果に基づき、該結果に示されるピッチやゲインの微妙なゆらぎを再現できるようなΔピッチ情報、Δゲイン情報等を含むテンプレートを作成する。より具体的には、当該声帯波形の平均ピッチ、平均ゲインを求め、該平均ゲインとの差分を求めることで、Δピッチ情報やΔゲイン情報を得ることができる。このようなテンプレート作成を上記のようにしわがれ度合いの異なる５回の録音結果の各々について行うことにより、上記補正情報データベース３４に格納する５つのしわがれ度に応じたテンプレートを作成することができる。
【００２７】
なお、補正情報データベース３４には、「０．２」、「０．４」、「０．６」、「０．８」、「１．０」といったしわがれ度を表す数値に応じたテンプレートを格納している。ここで、しわがれ度が「０．２」に対応するテンプレートは、５回の発声のうち最もしわがれ度合いが少ないと思われる声の解析結果に基づいて作成され、「１．０」に対応するテンプレートは最もしわがれ度合いの大きいと思われる声の解析結果に基づいて作成されることになる。
【００２８】
また、上述したように発声した音声の録音結果に基づいてテンプレートを作成する方法以外にも次のような方法でテンプレートを作成することもできる。すなわち、上記方法においては、音声波形からカルマンフィルタ等を用いて声帯波形を抽出していたが、発声者の喉頭に電極を設け、当該発声者によるしわがれ度合いの異なる５回の発声の際に当該電極によって検出される振動を声帯波形として抽出するようにしてもよい。より具体的には、発声時に電極に微弱な電流を流し、声帯の開閉に伴って変化する抵抗値を検出することにより、声帯の振動を検出し、該抽出した声帯波形を解析してテンプレートを作成するようにしてもよい。
【００２９】
以上が補正情報データベース３４に格納される情報の作成方法であり、パルス生成部３３は、このような補正情報データベース３４に格納される情報に基づいて、パラメータ決定部２０から供給されるピッチ情報Ｐおよびゲイン情報Ｇに忠実にしたがったパルス波形を補正し、補正後の波形を連続離散変換部３５に出力するのである。
【００３０】
次に、図４を参照しながら、パルス生成部３３による補正情報データベース３４に格納される情報に基づく補正を含んだ音源波形の生成処理内容について説明する。パルス生成部３３は、フレーム間隔毎にパラメータ決定部２０から供給されるピッチ情報Ｐおよびゲイン情報Ｇに基づいて仮パルス列を生成する。例えば先頭フレームＡについてピッチ情報Ｐとしてｆ１、ゲイン情報Ｇとしてｇ１が供給され、次のフレームＢについてピッチ情報Ｐとしてｆ２、ゲイン情報Ｇとしてｇ２が供給された場合には、パルス生成部３３によって図４に示すような仮パルス列を生成する。
【００３１】
同図の上段に示すように、先頭のフレームＡについては、ピッチ情報Ｐ（＝ｆ１）およびゲイン情報Ｇ（＝ｇ１）に忠実に従い、等しい時間間隔ａ毎に、ゲインｇ１のパルスを生成する。ここで、ａ＝１／ｆ１である。
【００３２】
次に、上記先頭のフレームＡに続くフレームＢについても、上記フレームＡと同様、ピッチ情報Ｐおよびゲイン情報Ｇに忠実にしたがったパルス列を生成するようにしてもよいが、本実施形態では各フレーム間のピッチ変動を滑らかにし、より自然な音声波形を生成するために、先頭以外のフレームについては、前のフレームのピッチと現在のフレームのピッチとを直線補間したピッチを用いて仮パルス列を生成する。
【００３３】
より具体的には、前のフレーム（ここでは、フレームＡ）の最後に発生させたパルスの発生時刻をＬＴ、時刻をｔ、現在のフレームの終端の時刻とＬＴとの差をＴｆとすると、後続フレームＢのパルス生成時間間隔ｄＴは以下の式で求められる。
【数１】

【００３４】
フレームＢにおいてパルスを生成する毎に、上記式により前パルスとの時間間隔ｄＴを求め、前パルスの生成時刻からｄＴ経過後に次のパルスを生成し、この結果、図４の上段に示すパルス列が生成されるのである。図示の場合、フレームＢのピッチ情報Ｐに示されるｆ２が前フレームＡのピッチ情報Ｐに示されるｆ１より小さいので、図示のようにフレームＢにおけるパルス生成時間間隔が徐々に大きくなっていくことになる。すなわち、図４の下段に示すように、フレームＢの終端においてピッチがｆ２となるように、直線的にピッチが変動するようになっているのである。
【００３５】
パルス生成部３３は、以上のようにしてパラメータ決定部２０からフレーム間隔毎に供給されるピッチ情報Ｐおよびゲイン情報Ｇに基づいて仮パルス列を生成すると、当該仮パルス列（波形）を、パラメータ決定部２０から供給されるしわがれ度情報Ｈに基づいて補正することになる。以下、このようなしわがれ度情報Ｈに基づく仮パルス列に対する補正処理の内容について図５および図６を参照しながら説明する。なお、以下においては、先頭フレームＡについてしわがれ度情報ｈ１が供給され、次のフレームＢについてしわがれ度情報ｈ２（ｈ１と異なる）が供給された場合における補正処理の内容について説明することとする。
【００３６】
図５に示すように、フレームＡについては、しわがれ度情報ｈ１が供給されているので、パルス生成部３３は、補正情報データベース３４に格納されたｈ１に対応するテンプレートを参照し、フレームＡに含まれる各パルスのゲインを補正する。ここで、ｈ１＝０．４の場合には、補正情報データベース３４（図３参照）の「しわがれ度」＝「０．４」に対応付けられたテンプレートにおけるΔゲイン情報ｄＧ２（ｔ）がフレームＡにおけるパルスのゲイン補正に用いられる。したがって、フレームＡの開始時からｔ１経過後のパルスに生成されるゲインｇ１は、ｄＧ２（ｔ１）分だけ補正される（図中の黒点から×印に補正される。以下のパルスについても、補正前のゲインを黒点、補正後のゲインを×印で示す）。同様に、ｔ２経過後に生成されるパルスのゲインｇ１は、ｄＧ２（ｔ２）分だけ補正され、ｔ３経過後に生成されるパルスのゲインｇ１は、ｄＧ２（ｔ３）分だけ補正される。以降のタイミングで生成されるパルスについても、フレーム開始時からの経過時刻に応じたゲイン補正量分だけ補正される。
【００３７】
次に、フレームＢについては、しわがれ度ｈ２が供給されているので、パルス生成部３３は、補正情報データベース３４に格納されたｈ２に対応するテンプレートを参照し、フレームＢに含まれる各パルスのゲインを補正する。ここで、ｈ１＝０．６の場合には、補正情報データベース３４の「しわがれ度」＝「０．６」に対応付けられたテンプレートにおけるΔゲイン情報ｄＧ３（ｔ）がフレームＢにおけるパルスのゲイン補正に用いられる。したがって、フレームＢの開始時からｔ８経過後のパルスに生成されるゲインｇ２は、ｄＧ３（ｔ８）分だけ補正される。同様に、ｔ９経過後に生成されるパルスのゲインｇ２は、ｄＧ３（ｔ９）分だけ補正され、ｔ１０経過後に生成されるパルスのゲインｇ２は、ｄＧ３（ｔ１０）分だけ補正される。以降のタイミングで生成されるパルスについても、フレーム開始時からの経過時刻に応じたゲイン補正量分だけ補正される。
【００３８】
以上がパラメータ決定部２０から供給されるピッチ情報Ｐおよびゲイン情報Ｇに基づいて生成された仮パルス列に含まれる各パルスのゲインを個別に補正する処理であり、次に、図６を参照しながらピッチの補正、つまり各パルスの生成タイミングの補正について説明する。
【００３９】
同図に示すように、フレームＡについては、しわがれ度情報ｈ１（＝０．４）が供給されているので、パルス生成部３３は、補正情報データベース３４に格納されたしわがれ度「０．４」に対応するテンプレートを参照し、フレームＡに含まれる各パルスの生成タイミングを補正する。すなわち、Δピッチ情報ｄｆ２（ｔ）がフレームＡにおけるパルスの生成タイミングの補正に用いられる。したがって、仮パルス列においてフレームＡの開始時からｔ１経過後に生成されるパルスは、ｄｆ２（ｔ１）分だけその生成タイミングが補正される（図中の×印から四角印に補正される。本実施形態においては、ｄｆ２（ｔ）が周波数値として記憶されているので、当該周波数値の逆数である１／ｄｆ２（ｔ）時間だけ生成タイミングを補正することになる。以下のパルスについても、補正前の生成タイミングを×印、補正後のタイミングを四角印で示す）。同様に、仮パルス列においてｔ２経過後に生成されるパルスの生成タイミングは、１／ｄｆ２（ｔ２）分だけ補正され、ｔ３経過後に生成されるパルスの生成タイミングは、１／ｄｆ２（ｔ３）分だけ補正される。仮パルス列において以降のタイミングで生成されるパルスについても、フレーム開始時からの経過時刻に応じた補正量分だけ生成タイミングが補正される。
【００４０】
次に、フレームＢについては、しわがれ度ｈ２（＝０．６）が供給されているので、パルス生成部３３は、補正情報データベース３４に格納されたしわがれ度「０．６」に対応するテンプレートを参照し、フレームＢに含まれる各パルスの生成タイミングを補正する。すなわち、Δピッチ情報ｄｆ３（ｔ）がフレームＢにおけるパルスの生成タイミングの補正に用いられる。したがって、仮パルス列においてフレームＢの開始時からｔ８経過後に生成されるパルスは、ｄｆ３（ｔ８）分だけその生成タイミングが補正される。同様に、仮パルス列においてｔ９経過後に生成されるパルスの生成タイミングは、ｄｆ３（ｔ９）分だけ補正され、ｔ１０経過後に生成されるパルスの生成タイミングは、ｄｆ３（ｔ１０）分だけ補正される。仮パルス列において以降のタイミングで生成されるパルスについても、フレーム開始時からの経過時刻に応じた補正量分だけ生成タイミングが補正される。
【００４１】
以上のようにしてパルス生成部３３は、パラメータ決定部２０から供給されるしわがれ度情報Ｈに基づいて、パラメータ決定部２０から供給されるピッチ情報Ｐおよびゲイン情報Ｇに基づいて生成された仮パルス列に含まれる各パルスのゲインおよび生成タイミングを個別に補正し、図７の下段に示すようなパルス波形を生成しているのである。
【００４２】
なお、上記の補正処理の内容は、各フレームについて供給されるしわがれ度情報Ｈが補正情報データベース３４に格納されているしわがれ度と同一である場合について説明したが、例えばしわがれ度情報が「０．３」や「０．５」といったように補正情報データベース３４に格納されていない度合いを示すしわがれ度情報Ｈが供給されることもある。このように補正情報データベース３４に格納されていないしわがれ度情報Ｈが供給された場合の補正処理について説明する。
【００４３】
補正情報データベース３４に格納されている以外のしわがれ度を示すしわがれ度情報Ｈが供給された場合、パルス生成部３３は、以下のような手順でゲインの補正に用いる値を求める。
【００４４】
まず、パルス生成部３３は、供給されたしわがれ度情報Ｈに示されるしわがれ度の値ｈ未満のしわがれ度であって、最大のしわがれ度に対応付けられたテンプレートを選択する。例えば供給されるしわがれ度情報Ｈの値ｈが０．３の場合には、しわがれ度「０．２」に対応するテンプレートを選択する（ステップ１）。
【００４５】
次に、供給されたしわがれ度情報Ｈに示されるしわがれ度の値ｈに基づいて、以下の式（１）〜（５）のいずれかの式を選択し、選択した式により補間比率Ｒを求める（ステップ２）。
【数２】

【００４６】
しわがれ度情報Ｈに示される値ｈに基づいて選択した式によって補間比率Ｒを求めると、求めた補間比率Ｒを用いて上記ステップ１で選択したテンプレートのΔゲイン情報に当該Ｒを乗算したものを、ゲインの補正に用いる値とする。例えば、上記ステップ１において、しわがれ度「０．２」に対応するテンプレートが選択された場合には、補正に用いるΔゲイン情報はＲ×ｄＧ１（ｔ）である。
【００４７】
以上のように補正情報データベース３４に格納されている以外のしわがれ度を示すしわがれ度情報Ｈが供給された場合には、パルス生成部３３は補間したΔゲイン情報を用いて仮パルス列を補正しているのである。
【００４８】
また、図５および図６は、連続する２つのフレーム（フレームＡおよびフレームＢ）について供給されるしわがれ度情報Ｈの値が異なり、補正情報データベース３４に格納されたテンプレート時間情報がフレームよりも長い時間である場合の補正内容について説明するものであったが、連続するフレームについて同じしわがれ度情報Ｈが供給された場合、パルス生成部３３は次のようにして補正処理を行う。
【００４９】
図８に示すように、フレームＡについては、図５に示す場合と同様、フレームＡの開始時からの経過時間ｔ１，ｔ２〜ｔ７に応じたゲイン補正量を用いて各パルスのゲインを補正する。次に、フレームＢに含まれる各パルスについては、フレームＡと同じしわがれ度に対応したテンプレートのΔゲイン情報、ここではｄＧ２（ｔ）を用い、ｔについてはフレームＢの開始時ではなく、フレームＡの開始時からの経過時間ｔ’８、ｔ’９〜ｔ’１２を利用する。すなわち、フレームＢにおける各パルスのゲイン補正量は、先頭から順にｄＧ２（ｔ’８）、ｄＧ２（ｔ’９）、ｄＧ２（ｔ’１０）、ｄＧ２（ｔ’１１）、ｄＧ２（ｔ’１２）となるのである。このようなゲイン補正量を用い、フレームＢにおける各パルスのゲインを補正するのである。
【００５０】
また、後続のフレームＢに含まれる各パルスの生成タイミングについても、上記ゲインの補正と同様に、フレームＢの開始時ではなく、フレームＡの開始時からの経過時間ｔ’８、ｔ’９〜ｔ’１２を利用する。すなわち、フレームＢにおける各パルスの生成タイミングの補正量は、先頭から順にｄｆ２（ｔ’８）、ｄｆ２（ｔ’９）、ｄｆ２（ｔ’１０）、ｄｆ２（ｔ’１１）、ｄｆ２（ｔ’１２）となるのである。このような生成タイミングの補正量を用い、フレームＢにおける各パルスの生成タイミングを補正するのである。
【００５１】
また、図５および図６は、フレーム間隔よりも時間的に大きいテンプレート、つまりテンプレート時間情報に示される時間が１フレーム分の時間より大きい場合の補正内容について説明するものであったが、使用するテンプレートの時間がフレームの時間よりも小さい場合には、パルス生成部３３は次のようにして補正処理を行う。
【００５２】
図９に示すように、フレームＡについて使用されるテンプレートの時間長がフレームＡの時間長よりも小さい場合、パルス生成部３３は、フレームＡにおけるそのテンプレートの時間長を越えた部分については、そのテンプレートを繰り返し使用する。図示の例では、フレームＡにおける５番目以降のパルスが、テンプレート時間以降の時間に生成されるパルスであり、これらのパルスについては、テンプレート時間終了時点からの経過時間ｔ’５、ｔ’６、ｔ’７を用いることになる。すなわち、フレームＡにおける５番目以降の各パルスのゲイン補正量は、ｄＧ２（ｔ’５）、ｄＧ２（ｔ’６）、ｄＧ２（ｔ’７）となり、パルス生成部３３はこのゲイン補正量分だけ各パルスのゲインを補正する。
【００５３】
しわがれ度情報Ｈが補正情報データベース３４に格納されたしわがれ度と一致しない場合、連続するフレームについてのしわがれ度情報Ｈが同じ場合、もしくはフレームの時間よりもテンプレート時間が短い場合などには、上記のようにしてパルス生成部３３は、ゲインおよびピッチの補正量を求め、仮パルス列に含まれる各パルスのゲインおよび生成タイミングを個別に補正しているのである。
【００５４】
上記のようにパルス生成部３３によって生成されたパルス波形は、図１に示す連続離散変換部３５に供給される。連続離散変換部３５は、上述したパルス生成部３３によって生成されるパルス波形の各パルスの生成時刻は連続時間で表されるものであったが、後段の合成フィルタ４０によるディジタル信号処理においては、離散時間で表される波形に変換する必要がある。連続離散変換部３５は、上記のようにパルス生成部３３によって生成されたパルス波形を、離散時間で表される波形に変換する処理を行う。
【００５５】
ところで、上述したようにパルス生成部３３によって生成される連続時間で表されるパルス波形を離散時間で表される波形に変換する場合、連続時間を単に四捨五入等することにより離散時間に量子化するといった変換を行うと、パルスがサンプリング時にのみ発生することになる。したがって、サンプリング周波数を整数で割った値の周波数のみしか発生しなくなり、変換した波形においてピッチの誤差が生じてしまうことになる。さらに、図７に示すようなパルス波形の各パルスを周波数領域で見ると、全ての周波数で値を持つといった周波数特性を有することになるため、Ｄ／Ａ変換すると折り返しノイズが発生してしまうことになる。
【００５６】
以上のような問題点を考慮し、本実施形態における連続離散変換部３５では、連続時間で表される波形（図７参照）を離散時間で表される波形に変換する際に、連続離散変換部３５では、まずパルス生成部３３によって生成されたパルス波形の各パルスを以下に示すsinc関数で表される波形に置き換える。
【数３】

【００５７】
なお、上記式において、Ｇは置き換え対象となるパルスのゲインであり、ｔｐは置き換え対象となるパルスの発生時刻であり、Ｔｓはサンプリング周期である。
【００５８】
次に、sinc関数に置き換えた波形をサンプリング周期Ｔｓでサンプリングして離散時間で表される音源波形を得る。これにより、０〜ｆｓ／２（Hz）の周波数以外の帯域ではゲインが０となる波形が得られる。なお、ｆｓはサンプリング周波数である。
【００５９】
以上のような手順を踏むことにより連続離散変換部３５では、パルス生成部３３によって生成されたパルス波形を離散時間で表される波形に変換するが、図１０に示すようなパルス波形がパルス生成部３３から供給された場合にその変換の様子を図１１および図１２に示す。図１０に示すように、このパルス波形には、振幅の異なる２つのパルスが存在しており、これらの各パルスの位置が上記sinc関数の中心位置となるように各パルスをsinc関数により表される波形に置き換えることにより、図１１に示すような波形が得られる。このように各パルスをsinc関数に置き換えることによって得られる波形をサンプリング周波数ｆｓでサンプリングすることによって、図１２に示すような離散時間で表される波形を得ることができる。
【００６０】
有声音源波形生成装置３２は、有声音区間において以上のように連続離散変換部３５によって変換された離散時間で表される音源波形を出力するのである。
【００６１】
以上説明したように本実施形態に係る音源波形生成装置３０では、予めしわがれ声を人が発した際に得られる声帯波形を解析することにより作成したピッチやゲインの微妙なゆらぎを再現するためのテンプレートを用い、ピッチ情報Ｐやゲイン情報Ｇにしたがって生成される波形における各パルスのピッチやゲインを補正したパルス波形を生成することができる。すなわち、補正情報データベース３４に上記のような人の声帯波形の解析結果に基づいて作成したテンプレートを用いてパルス波形を補正することにより、人の発した自然な声（声帯波形）のピッチやゲインの微妙なゆらぎをより正確に再現した音源波形を生成することができる。これにより当該音源波形を用いて合成した音声波形によって発音される音は、より自然な印象を聴取者に与えることができるものとなる。
【００６２】
また、本実施形態では、人の発したしわがれ度を含む自然な音声波形（声帯波形）が忠実に再現されるように、音源波形を構成する各パルスの各々に対して個別に補正を行っている。したがって、上記のように人の発した声を基にして各パルス毎に最適な量の補正を行うことにより、人の発した自然な声（声帯波形）のピッチやゲインの不規則なゆらぎをより正確に再現した音源波形を生成することができ、しわがれ度合いの含まれた人間の声の自然さをより正確に再現することが可能な音声波形を合成することができる。
【００６３】
また、本実施形態では、補正情報データベース３４に複数のしわがれ度に応じた各パルスを補正するためのテンプレートが用意されているので、様々なしわがれ度合いを有する人の声を選択的に再現することができる。すなわち、人の声は、その発生の仕方や、発声者によって声帯波形のピッチやゲインの微妙なゆらぎが異なり、この結果、聴取者にとっては発生される音声のしわがれ度が異なって聞こえるようになる。本実施形態では、上記のように複数のしわがれ度合いの異なる、つまり微妙なピッチやゲインのゆらぎ方の異なる音声を基に作成したテンプレートを補正情報データベース３４に記憶しており、これを選択的に用いることにより、様々なしわがれ度合いの異なる人の音声をより自然に近い形で再現することができるのである。
【００６４】
Ｃ．変形例
なお、本発明は、上述した実施形態に限定されるものではなく、以下に例示するような種々の変形が可能である。
【００６５】
（変形例１）
上述した実施形態においては、ＬＰＣ合成技術を利用した音声合成装置に本発明を適用した場合について説明したが、本発明はこれに限らず、ピッチ情報Ｐおよびゲイン情報Ｇに基づいて音源波形を生成する音源波形生成装置を有する種々の音声合成装置に適用することが可能である。例えば、ＰＡＲＣＯＲ合成装置に適用することも可能であるし、上記のように生成した時間領域の音源波形を周波数領域の波形に変換し、当該周波数領域の波形に対して声道特性等を反映させる合成処理を行い、合成処理後の波形を再度時間領域の波形に変換して出力するといった音声合成装置に適用することも可能である。
【００６６】
（変形例２）
また、上述した実施形態においては、補正情報データベース３４には、５つのしわがれ度に応じたテンプレートが記憶されていたが、補正情報データベース３４に６種類以上のしわがれ度合いに応じたテンプレートを記憶しておくようにしてもよいし、４種類以下のしわがれ度合いに応じたテンプレートを記憶しておくようにしてもよい。
【００６７】
また、図１３に示すような構成の補正情報データベース３４’を用いるようにしてもよい。同図に示すように、この補正情報データベース３４’は、「０〜Ｘ（Hz）」、「Ｘ〜Ｙ（Hz）」（Ｘ＜Ｙ）といったピッチ範囲毎に、上記実施形態と同様の５種類のしわがれ度合いに応じたテンプレートが記憶されている。各ピッチ範囲に応じた５種類のテンプレートは、上記実施形態と同様、各々のピッチ範囲内のピッチを有する人の発生音の声帯波形を解析することにより得られたものである。このようなピッチ範囲毎に記憶されたテンプレートを記憶した補正情報データベース３４’を利用してパルス波形を補正する場合、パルス生成部３３は、パラメータ決定部２０から供給されるピッチ情報Ｐに示されるピッチがいずれのピッチ範囲の属するかを特定し、特定したピッチ範囲に対応付けられたテンプレートを用いパルス波形の補正を行うようにすればよい。例えば、図示のようなテンプレートが補正情報データベース３４’に記憶されている場合において、ピッチ情報Ｐに示されるピッチが「Ｘ〜Ｙ」範囲内の値であり、しわがれ度情報Ｈが「０．４」である場合には、Δゲイン情報ｄＧ１２（ｔ）、Δゲイン平均情報ＡＧ１２、Δピッチ情報ｄｆ１２（ｔ）、テンプレート時間情報Ｔ１２といったテンプレートがパルスの補正の際に用いられることになる。
【００６８】
また、図１４に示すような構成の補正情報データベース３４”を用いるようにしてもよい。同図に示すように、この補正情報データベース３４”は、「０〜α（dB）」、「α〜β（dB）」（α＜β）といったゲイン範囲毎に、上記実施形態と同様の５種類のしわがれ度合いに応じたテンプレートが記憶されている。各ゲイン範囲に応じた５種類のテンプレートは、上記実施形態と同様、各々のゲイン範囲内のゲインを有する人の発生音の声帯波形を解析することにより得られたものである。このようなゲイン範囲毎に記憶されたテンプレートを記憶した補正情報データベース３４”を利用してパルス波形を補正する場合、パルス生成部３３は、パラメータ決定部２０から供給されるゲイン情報Ｇに示されるゲインがいずれのゲイン範囲の属するかを特定し、特定したゲイン範囲に対応付けられたテンプレートを用いて各パルスのゲイン補正を行うようにすればよい。
【００６９】
また、男性の声および女性の声の各々の声帯波形について解析し、男性の声用および女性の声用のテンプレートを補正情報データベース３４に記憶させるようにし、指定された性別にしたがってパルス波形の補正に用いるテンプレートを選択するようにしてもよい。
【００７０】
（変形例３）
また、上述した実施形態においては、当該音声合成装置１００に供給される文書データに含まれるしわがれ度合いデータに基づいて、パラメータ決定部２０がしわがれ度情報Ｈを生成して音源波形生成装置３０に供給するようにしていたが、しわがれ度合いについては、ユーザが操作パネル等を利用して指定するようにしてもよい。
【００７１】
（変形例４）
また、上述した実施形態においては、文書データに基づいて音声を合成する音声合成装置１００に本発明を適用した場合について説明したが、歌詞情報およびメロディ情報を含んだデータ（例えばカラオケデータ等）に基づいて歌唱音声を合成する歌唱音合成装置に本発明を適用するようにしてもよい。
【００７２】
（変形例５）
上述した実施形態における音声合成装置１００は、専用のハードウェア回路で構成するようにしてもよいが、図１５に示すようなコンピュータシステムによるソフトウェアによって構成するようにしてもよい。同図に示すように、このコンピュータシステムは、装置全体を制御するＣＰＵ（Central Processing Unit）３２０、各種データ群やプログラム群を記憶するＲＯＭ（Read Only Memory）３２１、ワークエリアとして使用されるＲＡＭ（Random Access Memory）３２２、各種データプログラム群を記憶するハードディスクやＣＤ−ＲＯＭ（Compact Disc Read Only Memory）ドライブ等の外部記憶装置３２３、キーボードやマウス等の操作部３２４、各種情報をユーザに表示する表示部３２５、Ｄ／Ａ変換器３２６、アンプ３２７、スピーカ３２８を備えている。
【００７３】
ＣＰＵ３２０は、ＲＯＭ３２１もしくはハードディスク等の外部記憶装置３２３に記憶されている補正情報データベース３４を構成する各種データ群を用い、ＲＯＭ３２１もしくはハードディスク等の外部記憶装置３２３に記憶されたプログラムにしたがって上記実施形態と同様に各パルスに対してゲインやピッチを補正した音源波形を生成し、該音源波形に対して声道特性に応じた合成処理を行って音声波形信号を合成する。
【００７４】
そして、ＣＰＵ３２０は、合成した音声波形信号をＤ／Ａ変換器３２６に出力する。Ｄ／Ａ変換器３２６では音声波形信号がアナログ信号に変換され、アナログ信号アンプ３２７によって増幅された後、スピーカ３２８から放音される。
【００７５】
このように上記実施形態に係る音源波形生成装置を備えた音声合成装置は、コンピュータシステムによるソフトウェアによって構成することが可能であり、上記実施形態と同様の音声合成処理をコンピュータシステムに実行させるためのプログラムの形態でユーザに提供するようにしてもよい。このようなプログラムの提供方法としては、ＣＤ−ＲＯＭやフロッピーディスク等の各種記録媒体に記憶して提供する方法や、インターネット等の通信回線を介して提供する方法等がある。
【００７６】
【発明の効果】
以上説明したように、本発明によれば、より自然な印象を聴取者に与えることができる音声波形を合成することが可能となる。
【図面の簡単な説明】
【図１】本発明の一実施形態に係る音源波形生成装置を備えた音声合成装置の構成を示すブロック図である。
【図２】前記音声合成装置による音声合成に用いられる文書データを説明するための図である。
【図３】前記音源波形生成装置の構成要素である補正情報データベースに格納されるデータの内容を説明するための図である。
【図４】前記音源波形生成装置による音源波形生成処理の内容を説明するための図である。
【図５】前記音源波形生成装置による音源波形生成処理の内容を説明するための図である。
【図６】前記音源波形生成装置による音源波形生成処理の内容を説明するための図である。
【図７】前記音源波形生成装置による音源波形生成処理の内容を説明するための図である。
【図８】前記音源波形生成装置による他の音源波形生成処理の内容を説明するための図である。
【図９】前記音源波形生成装置によるその他の音源波形生成処理の内容を説明するための図である。
【図１０】前記音源波形生成装置によって生成された連続時間で表された音源波形を離散時間で表される波形に変換する様子を説明するための図である。
【図１１】前記音源波形生成装置によって生成された連続時間で表された音源波形を離散時間で表される波形に変換する様子を説明するための図である。
【図１２】前記音源波形生成装置によって生成された連続時間で表された音源波形を離散時間で表される波形に変換する様子を説明するための図である。
【図１３】前記音源波形生成装置の変形例における補正情報データベースに格納されるデータの内容を説明するための図である。
【図１４】前記音源波形生成装置の他の変形例における補正情報データベースに格納されるデータの内容を説明するための図である。
【図１５】前記音声合成装置と同様の処理をソフトウェアにより実現するためのコンピュータシステムのハードウェア構成を示すブロック図である。
【符号の説明】
１０……音声素片データベース、２０……パラメータ決定部、３０……音源波形生成装置、３１……無声音源波形生成装置、３２……有声音源波形生成装置、３３……パルス生成部、３４……補正情報データベース、３５……連続離散変換部、４０……合成フィルタ、１００……音声合成装置、３２０……ＣＰＵ、３２１……ＲＯＭ、３２２……ＲＡＭ、３２３……外部記憶装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech synthesizer that synthesizes a speech waveform from character data or the like, a sound source waveform generation device that generates a sound source waveform used for speech synthesis by the speech synthesizer, a sound source waveform generation method, and a sound source waveform generation process. Regarding the program.
[0002]
[Prior art]
Conventionally, various speech synthesis techniques for artificially generating speech have been proposed. In these speech synthesis technologies, the speech generation process is considered as a combination of two processes such as vocal cord sound source generation and articulation by the vocal tract, and speech synthesis is realized by approximating these as an engineering model. That is, a vocal cord sound source waveform is generated, and a voice waveform is generated by performing filtering processing or the like corresponding to articulation by the vocal tract on the sound source waveform. For example, a speech synthesizer using linear predictive analysis (LPC) includes a sound source waveform generation device including a voiced sound source and an unvoiced sound source, and a vocal tract filter, and a pulse generated by the sound source waveform generation device. A voice waveform corresponding to the voice to be pronounced is synthesized by filtering the waveform according to the content to be pronounced by the vocal tract filter.
[0003]
[Problems to be solved by the invention]
By the way, the sound source waveform generator used for speech synthesis as described above generates a pulse waveform in accordance with pitch information and gain information according to the speech to be generated in units of frames, and the pulses generated in units of frames. The waveform is output to the vocal tract filter as a sound source waveform. However, since the sound source waveform generated in accordance with the pitch information and the like is uniform, the sound generated by using this sound source waveform loses the naturalness of the sound generated by the actual person. There is a high risk. In other words, the vocal fold waveform of voices uttered by an actual person includes irregular fluctuations that are irregular in pitch, gain, etc., and depending on the way of fluctuation and the degree of fluctuation, such as “wrinkled voice” and “dull voice” The voice with the characteristic that is called may be uttered ("The voice that is not called the wrinkled voice also has a subtle fluctuation, and the voice that the person utters has a degree of difference, but it includes a subtle degree of awkwardness." On the other hand, a sound source waveform that is uniform in units of frames according to the pitch information as described above, that is, with no fluctuation in pitch or gain in the frame, has the “naturalness” as described above. There are no elements (such as subtle fluctuations) that make the impression. For this reason, the voice generated based on such a sound source waveform has a high possibility of becoming a voice without “naturalness”.
[0004]
In order to eliminate the above-mentioned problem of “naturalness” being lost, a method of adding fluctuations to pitch and gain using a random number or the like to the uniform pulse waveform generated by the sound source waveform generation device Is also possible. However, even when fluctuations are given using random numbers and the like, the fluctuations are artificially given, and “naturalness” is added to the sound generated based on the sound source waveform to which the fluctuations are given. May not be able to be given, but may result in unnatural sound.
[0005]
The present invention has been made in consideration of the above circumstances, and is a speech synthesizer that synthesizes speech that can give a listener a more natural impression, a sound source waveform generation device used in the speech synthesizer, and a sound source waveform It is an object to provide a generation method and a program.
[0006]
[Means for Solving the Problems]
In order to solve the above-described problem, a sound source waveform generation device according to the present invention is a device that generates a sound source waveform used when a speech waveform is synthesized, For each preset pitch range, For correcting the pitch, gain, or both of the waveform corresponding to the pre-defined degree of speech plural Storage means for storing correction information, and pitch information, gain information and wrinkle degree information of the waveform to be generated are input, In a range corresponding to the input pitch information, Waveform generation means for generating a sound source waveform by reading correction information corresponding to the input wrinkle degree information from the storage means and correcting the input pitch information and gain information based on the read correction information It is characterized by comprising.
A sound source waveform generation device according to the present invention is a device that generates a sound source waveform used when a speech waveform is synthesized, For each preset gain range, For correcting the pitch, gain, or both of the waveform corresponding to the pre-defined degree of speech plural Storage means for storing correction information, and pitch information, gain information and wrinkle degree information of the waveform to be generated are input, In a range corresponding to the input gain information, Waveform generation means for generating a sound source waveform by reading correction information corresponding to the input wrinkle degree information from the storage means and correcting the input pitch information and gain information based on the read correction information It is characterized by comprising.
[0007]
In this configuration, when generating a sound source waveform to be used for speech synthesis, the waveform is not based on the pitch information and gain information input according to the pronunciation content, but based on correction information stored in advance in the storage unit. The sound source waveform corrected in this way can be generated. Therefore, information for correcting the pitch and gain created in advance according to the analysis result of speech or the like generated by a person can be stored in the storage means, and a more natural pitch can be obtained by using the correction information. And a sound source waveform including fluctuations in gain can be generated.
[0008]
The speech synthesizer according to the present invention For each preset pitch range or preset level range, Correction information for correcting the pitch, gain, or both of the waveform corresponding to the pre-defined degree of voice wrinkle Multiple Storage means for storing, information acquisition means for acquiring pitch information, gain information and wrinkle degree information based on the audio content to be pronounced, and acquired by the information acquisition means In the range corresponding to pitch information or gain information, Waveform generation for generating a sound source waveform by reading correction information corresponding to the wrinkle degree information from the storage unit and correcting the pitch information and the gain information acquired by the information acquisition unit based on the read correction information And a synthesizing unit that synthesizes a speech waveform by filtering the sound source waveform generated by the waveform generating unit according to the speech content to be pronounced.
[0009]
A sound source waveform generation method according to the present invention is a method for generating a sound source waveform used when a speech waveform is synthesized, For each preset pitch range or preset level range, For correcting the pitch, gain, or both of the waveform corresponding to the pre-defined degree of speech plural The correction information is stored in the storage means, and the pitch information, gain information, and wrinkle degree information of the waveform to be generated are input, In the range corresponding to the input pitch information or gain information, Correction information corresponding to the input wrinkle degree information is read from the storage means, and a sound source waveform is generated by correcting the input pitch information and gain information based on the read correction information. To do.
[0010]
Further, a program according to the present invention provides a computer, For each preset pitch range or preset level range, For correcting the pitch, gain, or both of the waveform corresponding to the pre-defined degree of speech plural A means for reading out the correction information from the storage means storing the correction information, and pitch information, gain information and wrinkle degree information of the waveform to be generated are input, In the range corresponding to the input pitch information or gain information, Waveform generation means for generating a sound source waveform by reading correction information corresponding to the input wrinkle degree information by the reading means and correcting the input pitch information and gain information based on the read correction information. Make it work.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
A. Speech synthesizer
First, FIG. 1 is a block diagram showing a configuration of a speech synthesizer including a sound source waveform generation device according to an embodiment of the present invention. As shown in the figure, a speech synthesizer 100 described in the present embodiment is a speech synthesizer using LPC synthesis technology, and includes a speech segment database (DB) 10, a parameter determination unit 20, and a sound source waveform generation. A device 30 and a synthesis filter 40 are provided. The sound source waveform generation device 30 according to the present embodiment is not limited to a speech synthesizer using the LPC synthesis technology, and can be applied to various speech synthesizers having the sound source waveform generation device as a constituent element.
[0012]
In the speech synthesizer 100, document data indicating words or the like to be pronounced is input to the parameter determination unit 20. The parameter determination unit 20 performs document analysis on input document data and the like, and obtains analysis results such as word reading and accent. The parameter determination unit 20 refers to the stored content of the speech unit database 10 and determines various parameters used for speech waveform synthesis for each frame from the document analysis result. One frame is, for example, 5 msec or 10 msec. A feature parameter corresponding to the document analysis result is acquired at each frame interval, and the acquired feature parameter is output to the sound source waveform generation device 30 or the synthesis filter 40.
[0013]
In the speech element database 10, various feature parameters relating to vocal cord sound source waveforms obtained by analyzing a speech waveform generated by a person in advance and vocal tract articulation are phoneme chains composed of phonemes or a plurality of phonemes. It is memorized every time. Here, the feature parameters stored in the speech segment database 10 include pitch information and gain information as parameters related to the vocal cord sound source waveform, and include parameters of vocal tract characteristics as parameters related to articulation. The parameter determination unit 20 acquires a feature parameter stored in association with a specific phoneme from various feature parameters stored for each phoneme from the document analysis result as described above. When the parameter determination unit 20 acquires the characteristic parameter according to the document analysis result as described above, the parameter determination unit 20 outputs the acquired pitch information and gain information to the sound source waveform generation device 30 and the vocal tract characteristic information to the synthesis filter 40. Further, the parameter determination unit 20 identifies the voiced sound segment and the unvoiced sound segment according to the document analysis result, and the voice source / voice generation information indicating whether it is a voiced sound segment or an unvoiced sound segment. Output to.
[0014]
Further, in the present embodiment, as shown in FIG. 2, predetermined units such as phonemes, words, phrases, etc. constituting the document indicated by the document data in the document data supplied to the parameter determination unit 20 (shown in FIG. 2). In the example, for each word), there is included wrinkle degree data indicating the degree of wrinkle of the voice to be pronounced, and the parameter determination unit 20 refers to this data to determine the degree of wrinkle degree H (correction degree). Information) is output to the sound source waveform generator 30. That is, the parameter determination unit 20 outputs the pitch information P, the gain information G, the wrinkle degree information H, and the vocal tract characteristic information to the sound source waveform generation device 30 and the synthesis filter 40 for each frame interval according to the sound to be generated. is there.
[0015]
Returning to FIG. 1, the sound source waveform generation device 30 generates a sound source waveform based on pitch information P, gain information G, wrinkle degree information H, and voiced / unvoiced information supplied for each frame interval supplied from the parameter determination unit 20. Generate. Details of the sound source waveform generation device 30 will be described later.
[0016]
The synthesis filter 40 performs a filtering process on the sound source waveform output from the sound source waveform generation apparatus 30 according to the vocal tract characteristic information supplied from the parameter determination unit 20, and outputs the processed waveform as a speech waveform. Such a sound waveform is supplied to a speaker via a D / A converter and an amplifier, so that an artificial sound is generated. The speech synthesizer in the present embodiment is characterized by the sound source waveform generation device 30. The synthesis filter 40 and the like are the same as those in a conventional speech synthesizer using a general LPC synthesis technique. Description is omitted.
[0017]
B. Sound source waveform generator
The above is the overall configuration of the speech synthesizer 100, and the sound source waveform generator 30 will be described in detail below. As shown in FIG. 1, the sound source waveform generation device 30 includes an unvoiced sound source waveform generation device 31 and a voiced sound source waveform generation device 32, and pitch information supplied from the parameter determination unit 20 for each frame interval. A sound source waveform is generated based on P, gain information G, wrinkle degree information H, and voiced / unvoiced information.
[0018]
The unvoiced sound source waveform generation device 31 outputs a sound source waveform to be output from the sound source waveform generation device 30 when the voiced / unvoiced information supplied from the parameter determination unit 20 indicates an unvoiced section. More specifically, a white noise waveform having a gain corresponding to the gain information G supplied from the parameter determining unit 20 is output.
[0019]
The voiced sound source waveform generation device 32 is a waveform generation device that generates a sound source waveform output from the sound source waveform generation device 30 in a voiced sound section, and includes a pulse generation unit 33, a correction information database 34, a continuous discrete conversion unit 35, and the like. have. The pulse generation unit 33 generates a pulse waveform according to the pitch information P and gain information G supplied from the parameter determination unit 20 every frame interval. In the present embodiment, the pulse generation unit 33 does not generate a pulse waveform in accordance with the pitch information P and the gain information G, but refers to the stored contents of the correction information database 34 and is supplied from the parameter determination unit 20. Based on the wrinkle degree information H, the pulse waveform according to the pitch information P and the gain information G is corrected.
[0020]
The correction information database 34 stores information (correction information) used for generating the pulse waveform by the pulse generator 33 described above. As shown in FIG. 3, the correction information database 34 includes “0.2”, “0.4”, “0.6”, “0.8” for each value representing a plurality of wrinkling degrees, in the illustrated example. ”And“ 1.0 ”, correction templates prepared for each of the wrinkle degree values are stored. A correction template prepared for each value representing a plurality of wrinkling degrees includes Δ gain information, Δ gain average information, Δ pitch information, and template time information.
[0021]
The Δ gain information is information for individually correcting the gain of each pulse included in the pulse waveform according to the pitch information P and the gain information G supplied from the parameter determination unit 20 as described above. Is information indicating the correction gain amount (dB) of each pulse. Here, a large amount of information indicating the correction gain amount is prepared, and each of them is stored in association with the elapsed time t from the start of the frame. Therefore, the gain of the pulse generated after the lapse of time t1 from the frame start time in the pulse waveform of a certain frame is corrected by the correction gain amount associated with the time t1, and the gain of the pulse generated after the lapse of time t2 is The correction is made such that the correction is made by the correction gain amount associated with the time t2.
The Δ gain average information is information indicating the average value of each correction gain amount.
[0022]
The Δ pitch information is information for correcting the pulse generation time interval according to the pitch information P supplied from the parameter determination unit 20. More specifically, when the pulse waveform is generated at a time interval in accordance with the pitch information P supplied from the parameter determination unit 20 and the pitch information P is represented by the frequency F, each pulse is the reciprocal of F. It is generated every time interval represented by The Δ pitch information is information for correcting the generation timing of each pulse generated at (1 / F) time intervals according to such pitch information P. As with the Δ gain information, a large number of Δ pitch information is also prepared, each of which is associated with the elapsed time t from the start of the frame, that is, stored as a numerical sequence. Therefore, the generation timing of the pulse to be generated after the time t1 of the pulse waveform of a frame in the case of faithfully following the pitch information P is corrected by the correction amount associated with the time t1 in the above sequence, and the time The generation timing of the pulse to be generated after the elapse of t2 is corrected such that it is corrected by the correction amount associated with the time t2. Here, as the Δ pitch information, time information for shifting the generation timing may be stored, frequency information representing the time as a reciprocal number may be stored, or the frequency may be stored in a cent value. The converted information may be stored. In the following description, it is assumed that Δpitch information is information representing a frequency.
[0023]
The template time information is information indicating the time of the template prepared for each value indicating the wrinkle degree. As described above, each gain correction amount included in the Δ gain information and each pitch correction amount included in the Δ pitch information are associated with time t. It is within the range of values indicated in the template time information. That is, in the example shown in FIG. 3, 0 ≦ t ≦ T1 in the case of a template with a degree of crease of “0.2”. It should be noted that the time length of the template prepared for each wrinkle degree, that is, the template time information may be different or all may be the same.
[0024]
The value indicating the “wrinkle degree” is a value indicating the degree of wrinkle, and the greater the value, the greater the degree of wrinkle. The template including the Δ gain information, Δ gain average information, Δ pitch information, and template time information as described above is prepared for each value representing the degree of such wrinkling.
[0025]
Next, a method for creating a template prepared for each degree of wrinkling will be described. First, a speaker speaks a phoneme for a certain period of time with a certain pitch and intensity, and records the uttered sound. The speaker performs such utterance five times with voices having different “wrinkle levels” (may be different speakers), and creates a template corresponding to the different levels of squeeze as described above from each recording result.
[0026]
Here, since the template is for modeling a sound source waveform emitted by the sound source waveform generation device 30, that is, a vocal cord waveform (a waveform before articulation by the vocal tract) emitted by a person, a voice corresponding to each degree of cramping. Is extracted from the recorded result, and subtle fluctuations such as pitch and gain of the extracted vocal cord waveform are analyzed. Here, as a method for extracting the vocal cord waveform from the recorded speech waveform, a known method such as using a Kalman filter based on the LPC model can be used. Then, based on the analysis result of the extracted vocal cord waveform, a template including Δ pitch information, Δ gain information, and the like that can reproduce subtle fluctuations in pitch and gain indicated in the result is created. More specifically, Δ pitch information and Δ gain information can be obtained by obtaining an average pitch and an average gain of the vocal cord waveform and obtaining a difference from the average gain. By performing such template creation for each of the five recording results having different degrees of wrinkles as described above, templates corresponding to the five wrinkle degrees stored in the correction information database 34 can be created.
[0027]
The correction information database 34 stores templates corresponding to numerical values representing the degree of crease such as “0.2”, “0.4”, “0.6”, “0.8”, “1.0”. is doing. Here, the template corresponding to the degree of wrinkle “0.2” is created based on the analysis result of the voice that seems to have the least wrinkle degree among the five utterances, and the template corresponding to “1.0”. Is created based on the analysis result of the voice that seems to have the highest degree of wrinkle.
[0028]
In addition to the method of creating a template based on the recording result of the voice uttered as described above, the template can be created by the following method. That is, in the method described above, the vocal cord waveform is extracted from the speech waveform using a Kalman filter or the like, but an electrode is provided on the larynx of the speaker, and the electrode is used when the speaker speaks five times with different degrees of wrinkling. May be extracted as a vocal cord waveform. More specifically, a weak current is passed through the electrodes during vocalization, and a resistance value that changes with the opening and closing of the vocal cords is detected, thereby detecting vibration of the vocal cords and analyzing the extracted vocal cord waveform to obtain a template. You may make it create.
[0029]
The above is a method for creating information stored in the correction information database 34, and the pulse generation unit 33 uses the pitch information P supplied from the parameter determination unit 20 based on the information stored in the correction information database 34. In addition, the pulse waveform in accordance with the gain information G is corrected, and the corrected waveform is output to the continuous discrete conversion unit 35.
[0030]
Next, with reference to FIG. 4, the contents of a sound source waveform generation process including correction based on information stored in the correction information database 34 by the pulse generation unit 33 will be described. The pulse generation unit 33 generates a temporary pulse train based on the pitch information P and gain information G supplied from the parameter determination unit 20 for each frame interval. For example, if the first frame A is supplied with f1 as the pitch information P and g1 as the gain information G, and the next frame B is supplied with f2 as the pitch information P and g2 as the gain information G, the pulse generation unit 33 A temporary pulse train as shown in FIG.
[0031]
As shown in the upper part of the figure, for the first frame A, pulses of gain g1 are generated at equal time intervals a in accordance with the pitch information P (= f1) and gain information G (= g1). Here, a = 1 / f1.
[0032]
Next, for the frame B following the top frame A, as in the case of the frame A, a pulse train according to the pitch information P and the gain information G may be generated. In order to smooth the pitch fluctuations between them and generate a more natural speech waveform, for the frames other than the head, a temporary pulse train is generated using a pitch obtained by linearly interpolating the previous frame pitch and the current frame pitch. To do.
[0033]
More specifically, let LT be the generation time of the pulse generated at the end of the previous frame (here, frame A), t be the time, and Tf be the difference between the end time of the current frame and LT. The pulse generation time interval dT of the subsequent frame B is obtained by the following equation.
[Expression 1]

[0034]
Each time a pulse is generated in frame B, the time interval dT with the previous pulse is obtained by the above formula, and the next pulse is generated after dT has elapsed from the generation time of the previous pulse. It is generated. In the illustrated case, since f2 indicated by the pitch information P of the frame B is smaller than f1 indicated by the pitch information P of the previous frame A, the pulse generation time interval in the frame B gradually increases as shown in the figure. Become. That is, as shown in the lower part of FIG. 4, the pitch varies linearly so that the pitch is f2 at the end of the frame B.
[0035]
When the pulse generation unit 33 generates a temporary pulse train based on the pitch information P and gain information G supplied from the parameter determination unit 20 for each frame interval as described above, the pulse generation unit 33 converts the temporary pulse train (waveform) into the parameter determination unit. The wrinkle degree information H supplied from 20 is corrected. Hereinafter, the content of the correction process for the temporary pulse train based on the wrinkle degree information H will be described with reference to FIGS. 5 and 6. In the following, the content of the correction process when the wrinkle degree information h1 is supplied for the first frame A and the wrinkle degree information h2 (different from h1) is supplied for the next frame B will be described.
[0036]
As shown in FIG. 5, since the wrinkle degree information h <b> 1 is supplied for the frame A, the pulse generator 33 refers to the template corresponding to h <b> 1 stored in the correction information database 34 and is included in the frame A. Correct the gain of each pulse. Here, in the case of h1 = 0.4, the Δ gain information dG2 (t) in the template associated with “wrinkle degree” = “0.4” in the correction information database 34 (see FIG. 3) is the frame A. This is used to correct the gain of the pulse. Therefore, the gain g1 generated in the pulse after the elapse of t1 from the start of the frame A is corrected by dG2 (t1) (corrected from the black dot in the figure to the x mark. The following pulses are also corrected. (The previous gain is indicated by a black dot, and the corrected gain is indicated by a cross.) Similarly, the gain g1 of the pulse generated after elapse of t2 is corrected by dG2 (t2), and the gain g1 of the pulse generated after elapse of t3 is corrected by dG2 (t3). Pulses generated at subsequent timings are also corrected by a gain correction amount corresponding to the elapsed time from the start of the frame.
[0037]
Next, for the frame B, since the degree of creasing h2 is supplied, the pulse generator 33 refers to the template corresponding to h2 stored in the correction information database 34, and the gain of each pulse included in the frame B Correct. Here, when h1 = 0.6, the Δ gain information dG3 (t) in the template associated with the “wrinkle degree” = “0.6” in the correction information database 34 is the pulse gain correction in the frame B. Used for. Therefore, the gain g2 generated in the pulse after elapse of t8 from the start of the frame B is corrected by dG3 (t8). Similarly, the gain g2 of the pulse generated after elapse of t9 is corrected by dG3 (t9), and the gain g2 of the pulse generated after elapse of t10 is corrected by dG3 (t10). Pulses generated at subsequent timings are also corrected by a gain correction amount corresponding to the elapsed time from the start of the frame.
[0038]
The above is the process of individually correcting the gain of each pulse included in the temporary pulse train generated based on the pitch information P and gain information G supplied from the parameter determination unit 20, and next, referring to FIG. The correction of pitch, that is, correction of the generation timing of each pulse will be described.
[0039]
As shown in the figure, since the wrinkle degree information h1 (= 0.4) is supplied for the frame A, the pulse generator 33 has the wrinkle degree “0.4” stored in the correction information database 34. The generation timing of each pulse included in the frame A is corrected with reference to the template corresponding to. That is, Δ pitch information df2 (t) is used for correcting the pulse generation timing in frame A. Therefore, the generation timing of the pulse generated after the lapse of t1 from the start of the frame A in the temporary pulse train is corrected by df2 (t1) (from the x mark in the drawing to the square mark. This embodiment. In, since df2 (t) is stored as a frequency value, the generation timing is corrected for 1 / df2 (t) time which is the reciprocal of the frequency value. The generation timing is indicated by x and the corrected timing is indicated by a square mark). Similarly, the generation timing of the pulse generated after elapse of t2 in the temporary pulse train is corrected by 1 / df2 (t2), and the generation timing of the pulse generated after elapse of t3 is corrected by 1 / df2 (t3). Is done. For the pulses generated at the subsequent timing in the temporary pulse train, the generation timing is corrected by the correction amount corresponding to the elapsed time from the frame start time.
[0040]
Next, since the degree of wrinkle h2 (= 0.6) is supplied for the frame B, the pulse generation unit 33 selects a template corresponding to the degree of wrinkle “0.6” stored in the correction information database 34. The generation timing of each pulse included in frame B is corrected with reference. That is, Δ pitch information df3 (t) is used for correcting the pulse generation timing in frame B. Therefore, the generation timing of a pulse generated after the elapse of t8 from the start of frame B in the temporary pulse train is corrected by df3 (t8). Similarly, the generation timing of pulses generated after elapse of t9 in the temporary pulse train is corrected by df3 (t9), and the generation timing of pulses generated after elapse of t10 is corrected by df3 (t10). For the pulses generated at the subsequent timing in the temporary pulse train, the generation timing is corrected by the correction amount corresponding to the elapsed time from the frame start time.
[0041]
As described above, the pulse generator 33 generates the temporary pulse train generated based on the pitch information P and the gain information G supplied from the parameter determiner 20 based on the wrinkle degree information H supplied from the parameter determiner 20. 7 is individually corrected for the gain and generation timing of each pulse to generate a pulse waveform as shown in the lower part of FIG.
[0042]
The content of the correction process has been described for the case where the wrinkle degree information H supplied for each frame is the same as the wrinkle degree stored in the correction information database 34. For example, the wrinkle degree information is “0. The wrinkle degree information H indicating the degree not stored in the correction information database 34 such as “3” or “0.5” may be supplied. A correction process in the case where the missing information H that is not stored in the correction information database 34 is supplied will be described.
[0043]
When the wrinkle degree information H indicating the wrinkle degree other than that stored in the correction information database 34 is supplied, the pulse generator 33 obtains a value used for gain correction in the following procedure.
[0044]
First, the pulse generation unit 33 selects a template that has a degree of wrinkle less than the value h of the wrinkle degree indicated in the supplied wrinkle degree information H and that is associated with the maximum degree of wrinkle. For example, if the value h of the supplied wrinkle degree information H is 0.3, a template corresponding to the wrinkle degree “0.2” is selected (step 1).
[0045]
Next, based on the wrinkle degree value h indicated in the supplied wrinkle degree information H, one of the following formulas (1) to (5) is selected, and the interpolation ratio R is obtained by the selected formula. (Step 2).
[Expression 2]

[0046]
When the interpolation ratio R is obtained by an expression selected based on the value h indicated by the wrinkle degree information H, the value obtained by multiplying the Δ gain information of the template selected in the above step 1 by the R using the obtained interpolation ratio R is obtained. The value used for gain correction. For example, when a template corresponding to the degree of wrinkle “0.2” is selected in step 1, the Δ gain information used for correction is R × dG1 (t).
[0047]
When the wrinkle degree information H indicating the wrinkle degree other than that stored in the correction information database 34 is supplied as described above, the pulse generation unit 33 corrects the temporary pulse train using the interpolated Δ gain information. It is.
[0048]
5 and 6 differ in the value of the degree of crease information H supplied for two consecutive frames (frame A and frame B), and the template time information stored in the correction information database 34 is longer than the frame. Although the correction content in the case of time is described, when the same wrinkle degree information H is supplied for consecutive frames, the pulse generation unit 33 performs the correction process as follows.
[0049]
As shown in FIG. 8, for the frame A, as in the case shown in FIG. 5, the gain of each pulse is corrected using the gain correction amount corresponding to the elapsed times t1, t2 to t7 from the start of the frame A. . Next, for each pulse included in frame B, the template Δ gain information corresponding to the same degree of wrinkling as in frame A, here dG2 (t) is used, and t is not at the start of frame B, but in frame A Elapsed time t′8, t′9 to t′12 from the start time of is used. That is, the gain correction amount of each pulse in the frame B is dG2 (t′8), dG2 (t′9), dG2 (t′10), dG2 (t′11), dG2 (t′12) in order from the top. It becomes. By using such a gain correction amount, the gain of each pulse in the frame B is corrected.
[0050]
Also, with respect to the generation timing of each pulse included in the subsequent frame B, the elapsed times t′8, t′9˜ from the start of the frame A, not at the start of the frame B, as in the case of the gain correction. t'12 is used. That is, the correction amount of the generation timing of each pulse in the frame B is df2 (t′8), df2 (t′9), df2 (t′10), df2 (t′11), df2 (t ′) in order from the top. 12). The generation timing of each pulse in the frame B is corrected using such a correction amount of the generation timing.
[0051]
FIG. 5 and FIG. 6 explain the correction contents when the template is larger in time than the frame interval, that is, when the time indicated in the template time information is larger than the time of one frame. When the template time is smaller than the frame time, the pulse generator 33 performs the correction process as follows.
[0052]
As shown in FIG. 9, when the time length of the template used for the frame A is smaller than the time length of the frame A, the pulse generation unit 33 Repeat the template. In the illustrated example, the fifth and subsequent pulses in frame A are pulses generated at a time after the template time, and for these pulses, the elapsed times t′5, t′6, t'7 is used. That is, the gain correction amounts of the fifth and subsequent pulses in the frame A are dG2 (t′5), dG2 (t′6), and dG2 (t′7), and the pulse generation unit 33 corresponds to this gain correction amount. Correct the gain of each pulse.
[0053]
When the wrinkle degree information H does not match the wrinkle degree stored in the correction information database 34, when the wrinkle degree information H for consecutive frames is the same, or when the template time is shorter than the frame time, the above-mentioned In this way, the pulse generator 33 obtains gain and pitch correction amounts, and individually corrects the gain and generation timing of each pulse included in the temporary pulse train.
[0054]
The pulse waveform generated by the pulse generation unit 33 as described above is supplied to the continuous discrete conversion unit 35 shown in FIG. In the continuous discrete conversion unit 35, the generation time of each pulse of the pulse waveform generated by the pulse generation unit 33 described above is represented by a continuous time, but in the digital signal processing by the synthesis filter 40 in the subsequent stage, It is necessary to convert to a waveform expressed in discrete time. The continuous discrete conversion unit 35 performs processing for converting the pulse waveform generated by the pulse generation unit 33 as described above into a waveform expressed in discrete time.
[0055]
By the way, as described above, when the pulse waveform represented by the continuous time generated by the pulse generation unit 33 is converted into the waveform represented by the discrete time, the continuous time is quantized to the discrete time by simply rounding off. When such a conversion is performed, a pulse is generated only at the time of sampling. Therefore, only the frequency obtained by dividing the sampling frequency by an integer is generated, and a pitch error occurs in the converted waveform. Furthermore, when each pulse of the pulse waveform as shown in FIG. 7 is viewed in the frequency domain, it has frequency characteristics such as having values at all frequencies, so that aliasing noise is generated when D / A conversion is performed. become.
[0056]
In consideration of the above problems, the continuous discrete conversion unit 35 in the present embodiment performs continuous discrete conversion when converting a waveform expressed in continuous time (see FIG. 7) into a waveform expressed in discrete time. The unit 35 first replaces each pulse of the pulse waveform generated by the pulse generation unit 33 with a waveform represented by the following sinc function.
[Equation 3]

[0057]
In the above equation, G is the gain of the pulse to be replaced, tp is the generation time of the pulse to be replaced, and Ts is the sampling period.
[0058]
Next, the waveform replaced with the sinc function is sampled at the sampling period Ts to obtain a sound source waveform expressed in discrete time. As a result, a waveform having a gain of 0 is obtained in a band other than the frequency of 0 to fs / 2 (Hz). Note that fs is a sampling frequency.
[0059]
By following the above procedure, the continuous discrete conversion unit 35 converts the pulse waveform generated by the pulse generation unit 33 into a waveform represented in discrete time. The pulse waveform as shown in FIG. FIG. 11 and FIG. 12 show the state of conversion when supplied from the unit 33. As shown in FIG. 10, there are two pulses having different amplitudes in this pulse waveform, and each pulse is represented by a sinc function so that the position of each pulse becomes the center position of the sinc function. By substituting with the waveform shown in FIG. By sampling the waveform obtained by replacing each pulse with the sinc function in this manner at the sampling frequency fs, a waveform represented in discrete time as shown in FIG. 12 can be obtained.
[0060]
The voiced sound source waveform generation device 32 outputs the sound source waveform expressed in discrete time converted by the continuous discrete conversion unit 35 as described above in the voiced sound section.
[0061]
As described above, the sound source waveform generation device 30 according to the present embodiment reproduces subtle fluctuations in pitch and gain created by analyzing a vocal cord waveform obtained in advance when a person utters a hoarse voice. Using the template, it is possible to generate a pulse waveform in which the pitch and gain of each pulse in the waveform generated according to the pitch information P and the gain information G are corrected. That is, by correcting the pulse waveform using the template created based on the analysis result of the human vocal fold waveform as described above in the correction information database 34, the pitch and gain of the natural voice (voice fold waveform) uttered by the human. It is possible to generate a sound source waveform that more accurately reproduces subtle fluctuations. As a result, the sound produced by the speech waveform synthesized using the sound source waveform can give the listener a more natural impression.
[0062]
Further, in the present embodiment, correction is performed individually for each pulse constituting the sound source waveform so that a natural speech waveform (a vocal cord waveform) including a degree of wrinkle generated by a person is faithfully reproduced. Yes. Therefore, by correcting the optimum amount for each pulse based on the voice uttered by the person as described above, the fluctuation of the pitch and gain of the natural voice (voice waveform) uttered by the person is irregular. It is possible to generate a sound source waveform that is reproduced more accurately, and to synthesize a speech waveform that can more accurately reproduce the naturalness of a human voice including a degree of wrinkle.
[0063]
In the present embodiment, since a template for correcting each pulse corresponding to a plurality of wrinkle degrees is prepared in the correction information database 34, it is possible to selectively reproduce human voices having various degrees of wrinkles. Can do. In other words, human voices differ in how they are generated, and the subtle fluctuations in the pitch and gain of the vocal cord waveform vary depending on the speaker, and as a result, the listener can hear the generated voice with a different degree of wrinkle. . In the present embodiment, as described above, templates created based on a plurality of different degrees of wrinkling, that is, subtle differences in pitch and gain fluctuations are stored in the correction information database 34. By using it, it is possible to reproduce the voices of people with various degrees of mischief in a more natural form.
[0064]
C. Modified example
In addition, this invention is not limited to embodiment mentioned above, Various deformation | transformation which is illustrated below is possible.
[0065]
(Modification 1)
In the embodiment described above, the case where the present invention is applied to a speech synthesizer using LPC synthesis technology has been described. However, the present invention is not limited to this, and a sound source waveform is generated based on pitch information P and gain information G. The present invention can be applied to various speech synthesizers having a sound source waveform generating device. For example, the present invention can be applied to a PARCOR synthesizing apparatus, and the time domain sound source waveform generated as described above is converted into a frequency domain waveform, and vocal tract characteristics and the like are reflected on the frequency domain waveform. It is also possible to apply to a speech synthesizer that performs synthesis processing, converts the waveform after synthesis processing into a waveform in the time domain, and outputs the waveform again.
[0066]
(Modification 2)
Further, in the above-described embodiment, five templates corresponding to the degree of wrinkling are stored in the correction information database 34. However, six or more types of templates corresponding to the degree of wrinkling are stored in the correction information database 34. You may make it store, and you may make it memorize | store the template according to four or less types of wrinkles.
[0067]
Further, a correction information database 34 ′ having a configuration as shown in FIG. 13 may be used. As shown in the figure, this correction information database 34 ′ has the same 5 as in the above embodiment for each pitch range of “0 to X (Hz)” and “X to Y (Hz)” (X <Y). A template corresponding to the type of wrinkle is stored. The five types of templates corresponding to each pitch range are obtained by analyzing the vocal cord waveform of the sound generated by a person having a pitch within each pitch range, as in the above embodiment. When correcting the pulse waveform using the correction information database 34 ′ storing the template stored for each pitch range, the pulse generation unit 33 is indicated by the pitch information P supplied from the parameter determination unit 20. What pitch range the pitch belongs to is specified, and the pulse waveform may be corrected using a template associated with the specified pitch range. For example, when a template as illustrated is stored in the correction information database 34 ′, the pitch indicated by the pitch information P is a value within the range of “X to Y”, and the wrinkle degree information H is “0.4”. ”, Templates such as Δ gain information dG12 (t), Δ gain average information AG12, Δ pitch information df12 (t), and template time information T12 are used for pulse correction.
[0068]
Further, a correction information database 34 ″ having a configuration as shown in FIG. 14 may be used. As shown in FIG. 14, the correction information database 34 ″ includes “0 to α (dB)” and “α to”. For each gain range such as “β (dB)” (α <β), five types of templates corresponding to the degree of wrinkling are stored. The five types of templates corresponding to each gain range are obtained by analyzing the vocal cord waveform of the sound generated by a person having a gain within each gain range, as in the above embodiment. When correcting the pulse waveform using the correction information database 34 ″ storing the template stored for each gain range, the pulse generation unit 33 is indicated by the gain information G supplied from the parameter determination unit 20. What gain range the gain belongs to is specified, and the gain correction of each pulse may be performed using a template associated with the specified gain range.
[0069]
Further, the vocal cord waveform of each of the male voice and female voice is analyzed, and a template for male voice and female voice is stored in the correction information database 34, and correction of the pulse waveform is performed according to the specified gender. You may make it select the template used for.
[0070]
(Modification 3)
In the embodiment described above, the parameter determination unit 20 generates the wrinkle degree information H based on the wrinkle degree data included in the document data supplied to the speech synthesizer 100 and supplies the generated wrinkle degree information H to the sound source waveform generator 30. However, the wrinkle degree may be designated by the user using the operation panel or the like.
[0071]
(Modification 4)
In the above-described embodiment, the case where the present invention is applied to the speech synthesizer 100 that synthesizes speech based on document data has been described. However, data including lyrics information and melody information (for example, karaoke data) is described. You may make it apply this invention to the singing sound synthesizer which synthesize | combines singing voice based on.
[0072]
(Modification 5)
The speech synthesizer 100 in the above-described embodiment may be configured by a dedicated hardware circuit, but may be configured by software by a computer system as shown in FIG. As shown in the figure, this computer system includes a central processing unit (CPU) 320 for controlling the entire apparatus, a read only memory (ROM) 321 for storing various data groups and program groups, and a RAM (RAM) used as a work area. Random Access Memory (322), an external storage device 323 such as a hard disk or a CD-ROM (Compact Disc Read Only Memory) drive for storing various data program groups, an operation unit 324 such as a keyboard and a mouse, and a display for displaying various information to the user A unit 325, a D / A converter 326, an amplifier 327, and a speaker 328.
[0073]
The CPU 320 uses various data groups constituting the correction information database 34 stored in the ROM 321 or the external storage device 323 such as a hard disk and uses the above-described embodiment according to the program stored in the ROM 321 or the external storage device 323 such as the hard disk. Similarly, a sound source waveform in which the gain and pitch are corrected is generated for each pulse, and a synthesis process according to the vocal tract characteristics is performed on the sound source waveform to synthesize a speech waveform signal.
[0074]
CPU 320 then outputs the synthesized speech waveform signal to D / A converter 326. In the D / A converter 326, the voice waveform signal is converted into an analog signal, amplified by the analog signal amplifier 327, and then emitted from the speaker 328.
[0075]
As described above, the speech synthesizer including the sound source waveform generation device according to the above-described embodiment can be configured by software using a computer system, and causes the computer system to execute speech synthesis processing similar to that in the above-described embodiment. The program may be provided to the user. As a method of providing such a program, there are a method of providing it by storing it in various recording media such as a CD-ROM and a floppy disk, a method of providing it via a communication line such as the Internet, and the like.
[0076]
【The invention's effect】
As described above, according to the present invention, it is possible to synthesize a speech waveform that can give a listener a more natural impression.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a speech synthesizer including a sound source waveform generation device according to an embodiment of the present invention.
FIG. 2 is a diagram for explaining document data used for speech synthesis by the speech synthesizer.
FIG. 3 is a diagram for explaining the contents of data stored in a correction information database that is a component of the sound source waveform generation apparatus.
FIG. 4 is a diagram for explaining the contents of a sound source waveform generation process by the sound source waveform generation apparatus.
FIG. 5 is a diagram for explaining the contents of a sound source waveform generation process by the sound source waveform generation apparatus.
FIG. 6 is a diagram for explaining the contents of a sound source waveform generation process by the sound source waveform generation apparatus.
FIG. 7 is a diagram for explaining the contents of a sound source waveform generation process by the sound source waveform generation apparatus.
FIG. 8 is a diagram for explaining the contents of another sound source waveform generation process by the sound source waveform generation apparatus.
FIG. 9 is a diagram for explaining the contents of other sound source waveform generation processing by the sound source waveform generation apparatus.
FIG. 10 is a diagram for explaining a state in which a sound source waveform expressed by a continuous time generated by the sound source waveform generation device is converted into a waveform expressed by a discrete time.
FIG. 11 is a diagram for explaining a state in which a sound source waveform expressed by continuous time generated by the sound source waveform generating device is converted into a waveform expressed by discrete time.
FIG. 12 is a diagram for explaining a state in which a sound source waveform expressed by a continuous time generated by the sound source waveform generation device is converted into a waveform expressed by a discrete time.
FIG. 13 is a diagram for explaining the contents of data stored in a correction information database in a modification of the sound source waveform generation device.
FIG. 14 is a diagram for explaining the contents of data stored in a correction information database in another modification of the sound source waveform generation device.
FIG. 15 is a block diagram showing a hardware configuration of a computer system for realizing the same processing as that of the speech synthesizer by software.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Speech unit database, 20 ... Parameter determination part, 30 ... Sound source waveform generation apparatus, 31 ... Unvoiced sound source waveform generation apparatus, 32 ... Voiced sound source waveform generation apparatus, 33 ... Pulse generation part, 34 ... ... correction information database, 35 ... continuous discrete conversion unit, 40 ... synthesis filter, 100 ... speech synthesizer, 320 ... CPU, 321 ... ROM, 322 ... RAM, 323 ... external storage device

Claims

A device that generates a sound source waveform that is used when synthesizing a speech waveform, and corrects the pitch, gain, or both of the waveform corresponding to a predetermined degree of speech separation for each preset pitch range. Storage means for storing a plurality of correction information for
Pitch information, gain information, and wrinkle degree information of the waveform to be generated are input, and correction information corresponding to the input wrinkle degree information is read from the storage unit and read in a range corresponding to the input pitch information . A sound source waveform generation apparatus comprising: waveform generation means for generating a sound source waveform by correcting the input pitch information and gain information based on the correction information.

A device that generates a sound source waveform that is used when synthesizing speech waveforms, and corrects the pitch, gain, or both of the waveforms corresponding to the degree of pre-defined speech for each preset gain range. Storage means for storing a plurality of correction information for
Pitch information, gain information, and wrinkle degree information of the waveform to be generated are input, and correction information corresponding to the input wrinkle degree information is read from the storage unit and read in a range corresponding to the input gain information . A sound source waveform generation apparatus comprising: waveform generation means for generating a sound source waveform by correcting the input pitch information and gain information based on the correction information.

The sound source waveform generation apparatus according to claim 1, wherein the input wrinkle degree information is input in a predetermined unit.

2. The waveform generating unit individually corrects the generation timing of each pulse and the gain of each pulse in a waveform composed of a pulse train according to the pitch information and the gain information based on the correction information. The sound source waveform generation device according to any one of 4 to 4.

5. The sound source waveform generation device according to claim 1, wherein the correction information stored in the storage means is information created based on a result of analyzing a voice uttered by a person. .

Storage means for storing a plurality of correction information for correcting the pitch, gain, or both of the waveform corresponding to a predetermined degree of voice wrinkling for each preset pitch range or preset level range When,
Information acquisition means for acquiring pitch information, gain information and wrinkle degree information based on the audio content to be pronounced;
In the range corresponding to the pitch information or gain information acquired by the information acquisition means, correction information corresponding to the wrinkle degree information is read from the storage means, and acquired by the information acquisition means based on the read correction information. Waveform generating means for generating a sound source waveform by correcting the pitch information and the gain information, and filtering the sound source waveform generated by the waveform generating means according to the sound content to be pronounced. A speech synthesizer comprising: synthesis means for synthesizing a speech waveform.

A method of generating a sound source waveform used when synthesizing a speech waveform,
The storage means stores a plurality of correction information for correcting the pitch, gain, or both of the waveform corresponding to a predetermined degree of voice wrinkle for each preset pitch range or preset level range. Remember,
Pitch information, gain information, and wrinkle degree information of the waveform to be generated are input, and correction information corresponding to the input wrinkle degree information is input from the storage unit in the range corresponding to the input pitch information or gain information. A sound source waveform generation method comprising: generating a sound source waveform by correcting the input pitch information and gain information based on the read correction information.

Computer
A memory that stores a plurality of correction information for correcting the pitch, gain, or both of the waveform corresponding to a predetermined degree of voice wrinkle for each preset pitch range or preset level range Means for reading the correction information from the means;
The pitch information, gain information, and wrinkle degree information of the waveform to be generated are input, and the correction information corresponding to the input wrinkle degree information is read by the means in the range corresponding to the input pitch information or gain information . A program that functions as waveform generation means for generating a sound source waveform by correcting the input pitch information and gain information based on the read correction information.