JP3881971B2

JP3881971B2 - Voice quality difference evaluation table creation device, voice corpus voice quality difference evaluation table creation system, and speech synthesis system

Info

Publication number: JP3881971B2
Application number: JP2003297306A
Authority: JP
Inventors: 恒河井; 実津崎; 智基戸田
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2003-08-21
Filing date: 2003-08-21
Publication date: 2007-02-14
Anticipated expiration: 2023-08-21
Also published as: JP2005070214A

Description

この発明は、音声合成技術に関し、特に、音声コーパスから音声素片を選択し、接続することにより自然な発話に近い音声を合成する音声素片接続型音声合成技術に関する。 The present invention relates to a speech synthesis technology, and more particularly to a speech unit connection type speech synthesis technology that synthesizes speech that is close to a natural utterance by selecting and connecting speech units from a speech corpus.

コンピュータ技術及びデータコミュニケーション技術の発達に伴い、人間と機械との間のインターフェイスが重要となっている。人間にとっては、人と話をするのと同様に機械とのコミュニケーションを行なうことが望ましく、そのための技術開発が進められている。 With the development of computer technology and data communication technology, the interface between humans and machines has become important. For human beings, it is desirable to communicate with machines in the same way as talking to people, and technical development for that purpose is underway.

人間から機械への情報の伝達としては、音声認識、画像認識等の認知技術が主として用いられる。また機械から人間への情報の伝達方法は種々あるが、中でも音声合成技術が用いられる機会が増加しており、音声応答システム、音声翻訳システム等が代表的な応用例である。さらに、近年のロボット等の開発の進展に伴い、音声認識及び画像認識と音声合成とを組合せることで、人間とロボットとのコミュニケーションを人間同士のコミュニケーションと同様に実現することが期待される。 Cognitive techniques such as voice recognition and image recognition are mainly used for transmitting information from humans to machines. There are various methods for transmitting information from a machine to a human. Among them, there are increasing opportunities for using a speech synthesis technique, and a speech response system, a speech translation system, and the like are typical application examples. Furthermore, with the development of robots and the like in recent years, it is expected that communication between humans and robots can be realized in the same way as communication between humans by combining voice recognition and image recognition with voice synthesis.

図８に、音声コーパスを用いる音声素片接続型音声合成システムのブロック図を示す。図８を参照して、音声コーパスを用いる音声素片接続型音声合成システムでは、人間による自然な発話の音声を収録し、発話の音声素片を音声コーパス４０として予めコーパス化しておく。 FIG. 8 shows a block diagram of a speech unit connection type speech synthesis system using a speech corpus. Referring to FIG. 8, in a speech unit connection type speech synthesis system using a speech corpus, a speech of a natural utterance by a human is recorded, and the speech segment of the utterance is corpusified in advance as speech corpus 40.

このシステムに対して、入力テキスト４２が与えられると、音声素片選択部４４は音声コーパス４０の中から、入力テキストを構成する音声に対応する音声素片を、音声の合成に用いる音声素片の候補として選択する。音声素片選択部４４は、選択した音声素片を評価関数５０によって評価し、その結果に従って、音声の合成に用いるのに最適な音声素片を決定する。このようにして、入力テキスト４２を構成する音声にそれぞれ対応する音声素片を抽出する。音声素片接続部４６が、これら一連の音声素片を接続することにより、入力テキスト４２に対応する合成音声データ４８が生成される。 When the input text 42 is given to this system, the speech unit selection unit 44 uses the speech unit corresponding to the speech constituting the input text from the speech corpus 40 to use the speech unit for speech synthesis. Select as a candidate. The speech element selection unit 44 evaluates the selected speech element by the evaluation function 50, and determines an optimal speech element to be used for speech synthesis according to the result. In this way, speech segments corresponding to the speech constituting the input text 42 are extracted. The speech unit connection unit 46 connects these series of speech units, so that synthesized speech data 48 corresponding to the input text 42 is generated.

評価関数５０には、音声素片選択部４４より、これ以前に選択された音声素片及び候補となっている音声素片について観測可能な物理量が変数として与えられる。評価関数５０は、与えられた物理量に関する評価値を従属変数として出力する。音声素片選択部４４は、評価関数５０により出力された評価値に基づいて、選択した複数の音声素片から、直前の音声素片に接続するのに好適な音声素片を決定する。 The evaluation function 50 is given, as a variable, a physical quantity that can be observed from the speech unit selection unit 44 and the speech unit selected before and the candidate speech unit. The evaluation function 50 outputs an evaluation value related to the given physical quantity as a dependent variable. Based on the evaluation value output by the evaluation function 50, the speech unit selection unit 44 determines a speech unit suitable for connection to the immediately preceding speech unit from the selected plurality of speech units.

このようにして合成された音声は、人間が実際に発声した音声を用いて合成されたものである。そのため、いわゆる「機械音らしさ」を感じさせない比較的自然な音声を合成することができる。また、音声コーパスが大規模になると、音声素片選択部４４が音声素片の選択を行なう際の選択肢が増える。そのため、音声素片選択部４４は、それら多数の選択肢の中から、接続するのに適した音声素片を決定することが可能となり、合成音声の音質が向上する可能性が高くなる。そのため、現在の音声素片接続型音声合成システムでは、数十時間分の発話音声を収録した、大規模な音声コーパスを使用している。 The voice synthesized in this way is synthesized using voice actually uttered by a human. Therefore, it is possible to synthesize a relatively natural voice that does not feel the so-called “mechanical sound”. Further, when the speech corpus becomes large-scale, the choices when the speech unit selection unit 44 selects speech units increase. Therefore, the speech unit selection unit 44 can determine a speech unit suitable for connection from among these many options, and the possibility of improving the quality of the synthesized speech is increased. For this reason, the current speech unit connection type speech synthesis system uses a large-scale speech corpus that records speech for several tens of hours.

一方、音声素片接続型音声合成システムでは、それぞれ別個に収録された音声からそれぞれ抽出した音声素片を接続して、連続的な音声情報を合成する。そのため、接続される音声素片の声質が均質であることが求められる。さもなければ合成した音声に不連続感が生じ、合成された音声の音質は劣化する。よって多くの場合、単一の話者の音声からなる音声コーパスを使用する。 On the other hand, in a speech unit connection type speech synthesis system, continuous speech information is synthesized by connecting speech units extracted from speech recorded separately. Therefore, it is required that the voice quality of the connected speech elements is uniform. Otherwise, the synthesized speech has a discontinuity and the quality of the synthesized speech is degraded. Thus, in many cases, a speech corpus consisting of a single speaker's speech is used.

この場合、話者が一人であるため、大規模な音声コーパスを構築するには、長期間かけて音声を収録する必要がある。そのため、１人の話者の発声を、複数回の収録期間（以下、この収録期間を「セッション」と呼ぶ。）に分けて収録する。場合によってはその収録に数ヶ月から数年の期間を必要とする。 In this case, since there is only one speaker, it is necessary to record speech over a long period of time in order to construct a large-scale speech corpus. Therefore, the utterance of one speaker is recorded in a plurality of recording periods (hereinafter, this recording period is referred to as “session”). In some cases, the recording may take months to years.

このように、単一の話者による音声を収録した場合であっても、上記の通り、大規模な音声コーパスを作成するには、数ヶ月に及ぶ収録期間が必要となることがある。このように長い収録期間中、音声の収録条件を毎日一定に保つことは極めて困難である。とりわけ話者が体調を一定に保つことは極めて難しい。そのため、これら収録条件の変化に起因して、収録された音声データの声質が、セッション毎に異なるものとなることは、避けられない。よって合成された音声の声質にばらつきが生じ、不連続感が生じるという問題がある。従って、音声波形素片を接続する際の不自然さを解消する技術が望まれている。 As described above, even when a voice from a single speaker is recorded as described above, a recording period of several months may be required to create a large-scale voice corpus. During such a long recording period, it is extremely difficult to keep the sound recording conditions constant every day. In particular, it is extremely difficult for the speaker to keep his physical condition constant. For this reason, it is inevitable that the voice quality of the recorded audio data varies from session to session due to the change in the recording conditions. Therefore, there is a problem that the voice quality of the synthesized speech varies and a discontinuity is generated. Therefore, a technique for eliminating the unnaturalness when connecting speech waveform segments is desired.

音声素片を接続する際の不自然さを解消するために、合成に用いる音声波形素片をどのようにして評価し、選択するかが問題となる。通常、各音声波形素片に関連する何らかの音響特徴量を算出し、所定の条件に合致する音声波形素片が選択される。不自然さを小さくするためには、知覚特性にできるだけ一致した尺度を用いて素片選択を行なうことが重要である。 In order to eliminate unnaturalness when connecting speech segments, how to evaluate and select speech waveform segments used for synthesis becomes a problem. Usually, some acoustic feature quantity related to each speech waveform segment is calculated, and a speech waveform segment that matches a predetermined condition is selected. In order to reduce unnaturalness, it is important to perform segment selection using a scale that matches the perceptual characteristics as much as possible.

後掲の非特許文献１では、知覚特性を反映した「コスト関数」と呼ばれる評価関数を用いて、候補の音声素片についてコストを算出し、その算出されたコストが最小となる波形素片を選択している。このようなコスト関数を用いて波形素片を選択することで、より自然な音声を合成できると期待される。 In Non-Patent Document 1 described later, a cost is calculated for a candidate speech segment using an evaluation function called a “cost function” that reflects perceptual characteristics, and a waveform segment that minimizes the calculated cost is calculated. Selected. It is expected that more natural speech can be synthesized by selecting waveform segments using such a cost function.

しかし、コスト関数として、どのような物理尺度を用いれば、波形接続時に知覚される不自然さが解消されるかについては明らかではない。即ち、物理尺度と合成音声の自然さとの間の対応関係は明らかでない。そのため非特許文献１では、コスト関数を様々な要因に対応する複数のサブコスト関数に分けている。 However, it is not clear what physical scale is used as the cost function to eliminate the unnatural perception at the time of waveform connection. That is, the correspondence between the physical scale and the naturalness of the synthesized speech is not clear. Therefore, in Non-Patent Document 1, the cost function is divided into a plurality of sub cost functions corresponding to various factors.

サブコスト関数は、それぞれ対応の物理量（観測可能なもの）が与えられると、その関数としてサブコストを出力する。これらサブコストに重みを乗算し、加算することにより評価値となるコストが算出される。 When a corresponding physical quantity (observable) is given to each sub cost function, the sub cost is output as the function. By multiplying these sub-costs by weights and adding them, a cost as an evaluation value is calculated.

非特許文献１では、韻律に関するサブコスト関数、Ｆ０（基本周波数）の不連続に関するサブコスト関数、音素環境代替におけるサブコスト関数、スペクトルの不連続に関するサブコスト関数、音素の適合性に関するサブコスト関数を用いている。そして、これらサブコスト関数のうち、特に知覚評価との関係が比較的分かりやすい要因である音素環境代替に関しては、知覚評価と物理量との間のマッピングを行なっている。しかしその他の要因については知覚評価を用いていない。 In Non-Patent Document 1, a sub-cost function related to prosody, a sub-cost function related to discontinuity of F0 (fundamental frequency), a sub-cost function in phoneme environment substitution, a sub-cost function related to spectrum discontinuity, and a sub-cost function related to phoneme suitability are used. Of these sub-cost functions, the mapping between the perceptual evaluation and the physical quantity is performed especially regarding the phoneme environment substitution, which is a relatively easy to understand relationship with the perceptual evaluation. However, other factors do not use perceptual evaluation.

戸田智基、河井恒、津崎実、鹿野清宏、「素片接続型日本語テキスト音声合成における音素単位とダイフォン単位に基づく素片選択」、電子情報通信学会論文誌、Vol.J85-D-II.,No.12,pp.1760-1770,Dec.2002.Tomoki Toda, Tsune Kawai, Minoru Tsuzaki, Kiyohiro Shikano, “Fragment Selection Based on Phoneme Units and Diphone Units in Segment-Connected Japanese Text Speech Synthesis”, IEICE Transactions, Vol. J85-D-II. , No. 12, pp. 1760-1770, Dec. 2002.

非特許文献１に記載の技術では、音素環境代替による自然性劣化を知覚評価により評価し、その結果をサブコスト関数に反映している。しかし、セッション毎に異なる話者の声質による合成音声の自然性劣化については非特許文献１では考慮されていない。これは、セッションの違いに起因する声質の差異と、音声の物理的特徴との間の対応関係が不明か、それを特定するのが極めて困難であるためである。 In the technology described in Non-Patent Document 1, natural deterioration due to substitution of phonemic environment is evaluated by perceptual evaluation, and the result is reflected in the sub-cost function. However, non-patent document 1 does not consider the deterioration of the naturalness of the synthesized speech due to the voice quality of a speaker that is different for each session. This is because the correspondence between the voice quality difference due to the session difference and the physical characteristics of the voice is unknown or it is extremely difficult to identify it.

それゆえに、本発明の目的は、収録時期の異なる音声データより抽出された音声素片同士を接続して音声を合成する際に、話者の体調変化等に起因する声質の変化によって生じる、合成された音声の音質劣化を軽減するための声質差評価テーブル作成装置及び音声合成システムを提供することである。 Therefore, an object of the present invention is to synthesize speech produced by speech quality changes caused by changes in the physical condition of a speaker when speech components extracted from speech data with different recording times are connected to synthesize speech. It is to provide a voice quality difference evaluation table creating device and a voice synthesizing system for reducing the deterioration of the voice quality of the voice .

本発明の別の目的は、収録時期の異なる音声データの、話者の体調変化等に起因する声質の差異を良好な感度で評価するための声質差評価テーブル作成装置及び音声合成システムを提供することである。
Another object of the present invention is to provide a voice quality difference evaluation table creating apparatus and a voice synthesis system for evaluating a voice quality difference caused by a change in the physical condition of a speaker of voice data having different recording times with good sensitivity. That is.

本発明の第１の局面に係る声質差評価テーブル作成装置は、複数種類の発話音声データの間の声質の差異を表す声質差評価値テーブルを作成する装置である。この声質差評価テーブル作成装置は、複数種類の発話音声データから、第１及び第２の発話音声データの任意の組合せを抽出するための抽出手段と、抽出手段により抽出された任意の組合せの発話音声データの各々に基づいて、第１及び第２の音声刺激を生成するための刺激生成手段と、任意の組合せの発話音声データの各々に対して、第１及び第２の音声刺激に対する聴覚上の声質の差異に関する知覚試験を行ない、当該組合せの発話音声データの間の声質の差異に関する評価値を導出するための知覚試験手段と、知覚試験の結果に基づいて、任意の組合せの発話音声データの間の声質の差異を表わす評価値を、当該発話音声データの組合せと対応付けて格納することにより、声質差評価値テーブルを作成するためのテーブル作成手段とを含む。 The voice quality difference evaluation table creation device according to the first aspect of the present invention is a device that creates a voice quality difference evaluation value table representing voice quality differences among a plurality of types of utterance voice data. This voice quality difference evaluation table creation device includes an extraction means for extracting an arbitrary combination of first and second utterance voice data from a plurality of types of utterance voice data, and an utterance of an arbitrary combination extracted by the extraction means Stimulus generation means for generating first and second audio stimuli based on each of the audio data, and an auditory sense for the first and second audio stimuli for each of the utterance audio data in any combination Perceptual test means for conducting a perceptual test on the voice quality difference between the speech utterances and deriving an evaluation value on the voice quality difference between the speech data of the combination, and the speech voice data of any combination based on the result of the perceptual test A table creation means for creating a voice quality difference evaluation value table by storing an evaluation value representing a difference in voice quality between the voice data and a combination of the speech data, No.

異なる収録条件により収録された発話音声データの間の声質の差異を、知覚試験によって評価することができる。よって、物理的尺度によって評価することが困難な発話音声の声質の差異を、良好な感度で評価することが可能となる。 The difference voice between the speech data recorded by different recording conditions can be evaluated by sensory testing. Therefore, it is possible to evaluate the difference in voice quality of the uttered speech that is difficult to evaluate with a physical scale with good sensitivity.

好ましくは、知覚試験手段は、知覚試験の対象となる被験者に対して、第１及び第２の音声刺激を対比して呈示するための刺激呈示手段と、被験者による第１及び第２の音声刺激の間の聴感上の声質の差異の大きさに関する評定結果を、知覚試験の結果として取得するための取得手段とを含む。 Preferably, the perceptual test means includes a stimulus presenting means for comparing and presenting the first and second voice stimuli to the subject to be subjected to the perceptual test, and the first and second voice stimuli by the subject. Acquisition means for acquiring a result of evaluation relating to the magnitude of the difference in auditory voice quality between the two as a result of the perceptual test.

知覚試験を行なう際に、声質の差異を評価する対象となる発話音声の声質を、被験者に比較させることにより、比較対象となる両者の発話音声データ間の声質の差異を、被験者が明確に評定することができる。 When conducting a perceptual test, the subject clearly evaluates the difference in voice quality between the two voiced speech data to be compared by allowing the subject to compare the voice quality of the speech to be evaluated. can do.

より好ましくは、取得手段は、予め定める評定尺度を用いて被験者により評定された聴感上の差異の大きさを得るための手段と、被験者による評定により得られた評定尺度の値に基づいて評価値を導出するための評価値導出手段とを含む。 More preferably, the acquisition means has an evaluation value based on a means for obtaining a magnitude of the auditory difference rated by the subject using a predetermined rating scale and a value of the rating scale obtained by the rating by the subject. Evaluation value deriving means for deriving.

評定尺度を用いて、知覚試験を行なうことにより、より明確な知覚試験の結果を得ることができるようになる。 By performing a perceptual test using the rating scale, a clearer result of the perceptual test can be obtained.

評価値導出手段は、評定尺度による評定の結果を予め定められた換算基準により換算することにより評価値を導出するための手段を含んでもよい。 The evaluation value deriving unit may include a unit for deriving an evaluation value by converting a result of the rating on the rating scale according to a predetermined conversion standard.

さらに、換算基準を定め，これを用いて評価値を導出することにより、聴感上の声質の差異の大きさを数値化することができる。 Furthermore, by defining a conversion standard and deriving an evaluation value using this, it is possible to quantify the magnitude of the difference in voice quality on hearing.

本発明の第２の局面に係る声質差評価テーブル作成装置は、複数種類の発話音声データの間の声質の差異を表す声質差評価値テーブルを作成する装置である。この声質差評価テーブル作成装置は、複数種類の発話音声データに含まれる発話音声データの各々を所定の手順により定められる順番で抽出するための抽出手段と、抽出手段により抽出された発話音声データの各々に対して、聴覚上の声質に関する知覚試験を行ない、当該発話音声データの声質に関する評価値を導出するための知覚試験手段と、知覚試験により得られる、複数種類の発話音声データの声質に関する評価値に基づいて、発話音声データの任意の組合せの間の声質差に関する評価値を、当該発話音声データの組合せと対応付けて格納することにより、声質評価値テーブルを作成するためのテーブル作成手段とを含む。 The voice quality difference evaluation table creation device according to the second aspect of the present invention is a device that creates a voice quality difference evaluation value table representing voice quality differences among a plurality of types of speech data. This voice quality difference evaluation table creation device includes an extraction means for extracting each of speech voice data included in a plurality of types of speech voice data in an order determined by a predetermined procedure, and a speech quality data extracted by the extraction means. Perceptual test on auditory voice quality is performed for each, and perceptual test means for deriving evaluation value on voice quality of the speech data, and evaluation on voice quality of multiple types of speech data obtained by the perceptual test A table creating means for creating a voice quality evaluation value table by storing an evaluation value related to a voice quality difference between arbitrary combinations of speech voice data based on the value in association with the combination of the speech voice data; including.

被験者に、音声刺激に関する聴感上の印象をそれぞれ評定させるため、知覚試験によって、より多角的な評定を得ることができる。また、少ない試験回数で、声質差に関する評価値を導出することができる。 In order to allow the subject to evaluate the audible impressions related to the audio stimulus, a more diverse evaluation can be obtained by a perceptual test. In addition, an evaluation value related to a voice quality difference can be derived with a small number of tests.

好ましくは、知覚試験手段は、抽出手段により抽出された発話音声データの各々に基づいて音声刺激を生成するための刺激生成手段と、刺激生成手段により生成された音声刺激に対する、予め定められる複数個の評定尺度の各々に関する被験者の評価値を取得するための評価値取得手段とを含む。 Preferably, the perceptual test means includes a stimulus generation means for generating a voice stimulus based on each of the utterance voice data extracted by the extraction means, and a plurality of predetermined numbers for the voice stimulus generated by the stimulus generation means. Evaluation value acquisition means for acquiring the evaluation value of the subject regarding each of the rating scales.

複数個の評定尺度を用いて、被験者に評定を行なわせることにより、より多様な聴感上の印象をもとに、声質の差異に関する評価値を導出することが可能になる。 By allowing a subject to perform a rating using a plurality of rating scales, it is possible to derive an evaluation value related to a difference in voice quality based on a wider variety of audible impressions.

より好ましくは、評価値取得手段は、刺激生成手段により生成された音声刺激に対する、複数個の評定尺度の各々に関する被験者の評価値を、それぞれ予め定められた複数個の段階を表わす離散的な値として被験者から取得するための手段と、取得するための手段が取得した、複数個の評定尺度の各々に関する被験者の離散的な評価値を要素とするベクトルを生成するためのベクトル生成手段とを含む。 More preferably, the evaluation value acquisition means is a discrete value representing a plurality of predetermined stages for the evaluation values of the subject regarding each of the plurality of rating scales for the audio stimulus generated by the stimulus generation means. Means for acquiring from the subject as a method, and vector generating means for generating a vector obtained by the means for acquiring and having a discrete evaluation value of the subject for each of a plurality of rating scales as elements. .

被験者による評価をもとにベクトルを生成することにより、被験者による聴感上の印象に関する評価を、数値化することができる。 By generating a vector based on the evaluation by the subject, the evaluation regarding the impression on the auditory sense by the subject can be quantified.

さらに好ましくは、テーブル作成手段は、複数個の発話音声データの任意の組合せに対して、当該組合せに属する発話音声データに対して、ベクトル生成手段により得られるベクトルの間の距離を所定の算出方法に従って算出するための距離算出手段と、距離算出手段により算出された距離を、対応する発話音声データの組合せと対応付けて声質差評価テーブルに格納するための手段とを含む。 More preferably, the table creation means, for an arbitrary combination of a plurality of utterance voice data, calculates a distance between vectors obtained by the vector generation means for the utterance voice data belonging to the combination. And a means for storing the distance calculated by the distance calculation means in the voice quality difference evaluation table in association with the corresponding combination of speech voice data.

ベクトル生成手段より得られるベクトル間の距離を、声質の差異に関する評価値として用いることにより、被験者による多面的な評定をもとに、声質の差異を数値化することが可能になる。 By using the distance between vectors obtained from the vector generation means as an evaluation value related to the difference in voice quality, the difference in voice quality can be quantified based on multifaceted evaluation by the subject.

本発明の第３の局面に係る音声コーパスの声質差評価テーブル作成システムは、異なる複数の収録条件下で収録された複数種類の発話音声データからなる音声コーパスと、音声コーパスに含まれる複数種類の発話音声データを入力とする、本発明の第１の局面又は第２の局面に係る声質差評価テーブル作成装置とを含む。 A voice corpus voice quality difference evaluation table creation system according to a third aspect of the present invention includes a voice corpus including a plurality of types of utterance voice data recorded under a plurality of different recording conditions, and a plurality of types of voice corpus included in the voice corpus. A voice quality difference evaluation table creation device according to the first aspect or the second aspect of the present invention, which uses speech voice data as an input.

この声質差評価テーブル作成システムにより、音声コーパスを用いて行なう様々な音声処理技術において、声質の差異に関する評価に基づく処理を行なうことが可能となる。 With this voice quality difference evaluation table creation system, it is possible to perform processing based on evaluation related to voice quality differences in various voice processing techniques performed using a voice corpus.

本発明の第４の局面に係る音声合成システムは、複数種類の発話音声データからなり、それぞれ音声素片に分離可能な複数の発話音声データを含む音声コーパスと、予め定める入力情報を取得し、当該入力情報と所定の関係にある音声素片を、音声コーパスより選択し、抽出するための手段と、抽出するための手段が抽出した一連の音声素片を接続して、発話音声を合成するための手段と、音声コーパスに含まれる、これら複数種類の発話音声データを入力として、本発明の第１の局面から第３の局面のいずれかに係る声質差評価テーブル作成装置により作成された声質差評価テーブルとを含む音声合成システムである。この音声合成システムの、抽出するための手段は、声質差評価テーブルに格納された複数種類の発話音声データの間の声質差の評価値に基づいて、所定の関係にある音声素片を選択する。 A speech synthesis system according to a fourth aspect of the present invention comprises a speech corpus including a plurality of types of speech data, each including a plurality of speech data that can be separated into speech segments, and predetermined input information, A speech unit having a predetermined relationship with the input information is selected from a speech corpus, and a speech unit is synthesized by connecting a means for extracting and a series of speech units extracted by the means for extracting. And a voice quality created by the voice quality difference evaluation table creation device according to any one of the first to third aspects of the present invention using as input the plurality of types of speech data included in the speech corpus A speech synthesis system including a difference evaluation table. The means for extracting in this speech synthesis system selects speech segments having a predetermined relationship based on evaluation values of voice quality differences between a plurality of types of speech data stored in the voice quality difference evaluation table. .

この音声合成システムは、複数種類の発話音声データからなる音声コーパスをもとに音声を合成する際に、声質の差異が大きな音声素片を接続することを防止できる。よって、声質の差異が大きな音声素片を接続することにより生じる、合成音声の音質劣化を軽減することができる。 This speech synthesis system can prevent connection of speech segments having a large difference in voice quality when speech is synthesized based on a speech corpus including a plurality of types of speech speech data. Therefore, it is possible to reduce deterioration of the sound quality of the synthesized speech caused by connecting speech segments having a large difference in voice quality.

以下、図面を参照しつつ、本発明の実施の形態に係る音声素片接続型音声合成システムについて説明する。 Hereinafter, a speech unit connection type speech synthesis system according to an embodiment of the present invention will be described with reference to the drawings.

［第１の実施の形態］
図１に本発明の一実施の形態に係るシステムの機能的構成をブロック図形式で示す。図１を参照して、この音声素片選択システム７０は、図８に示す従来技術の音声合成システムに用いられるものと同様の、評価関数７２を用いた評価により音声素片を選択する音声素片選択部４４と、評価関数７２が参照する、セッション間の声質の差異を表わす情報を格納した声質差評価テーブル１０８を作成する声質差評価テーブル作成装置１００とを含む。 [First Embodiment]
FIG. 1 is a block diagram showing a functional configuration of a system according to an embodiment of the present invention. Referring to FIG. 1, this speech unit selection system 70 is a speech unit for selecting a speech unit by evaluation using an evaluation function 72 similar to that used in the conventional speech synthesis system shown in FIG. It includes a piece selection unit 44 and a voice quality difference evaluation table creation device 100 that creates a voice quality difference evaluation table 108 that stores information representing voice quality differences between sessions, which is referred to by the evaluation function 72.

声質差評価テーブル作成装置１００は、音声素片選択システム７０が音声合成に用いる一連の音声素片６０を選択する際の、選択候補となる音声のデータを格納する音声コーパス１０２と、音声コーパス１０２に接続され、音声コーパス１０２内の音声データを用いて、被験者１０６に対して知覚試験を行なうことにより、セッション間での声質の差異に関する評価値を導出し、導出した声質差に関する評価値をまとめて声質差評価テーブル１０８を作成する声質差評価装置１０４とを含む。声質差評価テーブル１０８は、評価関数７２による評価の際に参照される。 The voice quality difference evaluation table creation apparatus 100 includes a speech corpus 102 that stores speech data as selection candidates when the speech segment selection system 70 selects a series of speech segments 60 used for speech synthesis, and a speech corpus 102. Is used to derive evaluation values related to voice quality differences between sessions by performing a perceptual test on the subject 106 using voice data in the voice corpus 102, and to summarize the derived evaluation values related to voice quality differences. And a voice quality difference evaluation device 104 that creates a voice quality difference evaluation table 108. The voice quality difference evaluation table 108 is referred to at the time of evaluation by the evaluation function 72.

音声コーパス１０２には、単一の話者により発話され、収録された１セッション分の音声波形信号からなる音声データ１１０Ａ，１１０Ｂ，…，１１０Ｎが、複数セッション分記憶されている。これらの１セッション分の音声データには、それぞれセッションを識別するための識別番号１１２Ａ，１１２Ｂ，…，１１２Ｎが付与されている。 In the speech corpus 102, speech data 110A, 110B,..., 110N composed of speech waveform signals for one session recorded and recorded by a single speaker are stored for a plurality of sessions. Identification numbers 112A, 112B,..., 112N for identifying each session are assigned to the audio data for one session.

図２に、声質差評価装置１０４の構成をブロック図形式で示す。図２を参照して、声質差評価装置１０４は、音声コーパス１０２に接続され、音声コーパス１０２に記憶されている各セッションの音声データの識別番号をもとに、被験者１０６に対して呈示する刺激対の組合せを決定する処理を行なう試験処理部１２２と、音声コーパス１０２及び試験処理部１２２に接続され、試験処理部１２２からの命令に従い、被験者１０６に対して提示する刺激対を、音声コーパス１０２内の音声波形のデータより抽出する刺激抽出部１２０とを含む。 FIG. 2 shows the configuration of the voice quality difference evaluation apparatus 104 in the form of a block diagram. Referring to FIG. 2, voice quality difference evaluation apparatus 104 is connected to voice corpus 102 and is presented to subject 106 based on the identification number of the voice data of each session stored in voice corpus 102. The voice corpus 102 is connected to the test processing unit 122 that performs a process of determining a pair combination, the voice corpus 102, and the test processing unit 122, and presents a stimulus pair to be presented to the subject 106 according to a command from the test processing unit 122. And a stimulus extraction unit 120 that extracts the data from voice waveform data.

声質差評価装置１０４はさらに、刺激抽出部１２０が抽出した音声波形のデータを再生し、被験者１０６に対して音声刺激を呈示する刺激呈示部１２４と、被験者１０６に対して呈示した音声刺激についての被験者１０６の評定を取得する評定取得部１２６と、試験処理部１２２及び評定取得部１２６に接続され、試験処理部１２２が決定する刺激対の組合せと、評定取得部１２６が取得する被験者１０６の評定とをもとに、声質差評価テーブル１０８を作成するテーブル作成部１２８とを含む。 The voice quality difference evaluation apparatus 104 further reproduces the voice waveform data extracted by the stimulus extraction unit 120 and presents the voice stimulus to the subject 106, and the voice stimulus presented to the subject 106. A rating acquisition unit 126 that acquires a rating of the subject 106, a combination of stimulus pairs that is connected to the test processing unit 122 and the rating acquisition unit 126 and is determined by the test processing unit 122, and a rating of the subject 106 that the rating acquisition unit 126 acquires. And a table creation unit 128 that creates a voice quality difference evaluation table 108 based on the above.

なお、テーブル作成部１２８は、一つの刺激対に対する処理が完了すると、試験処理部１２２に対して完了信号を送る機能を有する。また、試験処理部１２２は、完了信号を受けると、次の刺激対の処理を開始する機能を有する。 The table creation unit 128 has a function of sending a completion signal to the test processing unit 122 when the processing for one stimulus pair is completed. Moreover, the test process part 122 has a function which starts the process of the next stimulus pair, if a completion signal is received.

図３に、声質差評価テーブル１０８の構成の一例を示す。図３を参照して、声質差評価テーブル１０８は、セッションごとのエントリ１３０Ａ，１３０Ｂ，…を含む。各エントリには、そのセッションの識別番号と、そのセッションで収録された音声と他の各セッションで収録された音声との間の声質の差異を示す評価値がそれぞれ格納される。なお、これらの評価値は、複数の被験者１０６から得られた評定の結果をセッションの組合せごとに統合したものである。 FIG. 3 shows an example of the configuration of the voice quality difference evaluation table 108. 3, voice quality difference evaluation table 108 includes entries 130A, 130B,... For each session. Each entry stores an identification number of the session and an evaluation value indicating a difference in voice quality between the voice recorded in the session and the voice recorded in each other session. These evaluation values are obtained by integrating evaluation results obtained from a plurality of subjects 106 for each combination of sessions.

本実施の形態に係るシステムは、以下のように動作する。なお、本実施の形態に係る声質差評価装置１０４が知覚試験を行なうにあたり、被験者１０６は予め適切な方法で選ばれているものとする。また被験者には、予め十分かつ適切な教示が与えられているものとする。 The system according to the present embodiment operates as follows. Note that when the voice quality difference evaluation apparatus 104 according to the present embodiment performs a perceptual test, the subject 106 is selected in advance by an appropriate method. In addition, it is assumed that the subject has been given sufficient and appropriate teaching in advance.

図１を参照して、話者が自然に発話した音声は、音声コーパス１０２に音声データ１１０Ａ、１１０Ｂ，…，１１０Ｎとしてセッション毎に格納されている。これらの音声データ１１０Ａ，１１０Ｂ，…，１１０Ｎにはそれぞれ識別番号が付与されている。また、できるだけ声質差の評価を正確にするため、各セッションの最初には同じ文を読むこととし、これを声質又は声質の差異の評価に用いるものとする。 Referring to FIG. 1, the voice naturally spoken by the speaker is stored in the voice corpus 102 as voice data 110A, 110B,. These audio data 110A, 110B,..., 110N are assigned identification numbers, respectively. In order to make the evaluation of the voice quality difference as accurate as possible, the same sentence is read at the beginning of each session, and this is used for the evaluation of the voice quality or the difference in voice quality.

図２を参照して、声質差評価装置１０４が起動すると、試験処理部１２２は、音声コーパス１０２に格納されている音声データの識別番号１１２Ａ、１１２Ｂ，…，１１２Ｎを、音声コーパス１０２より読出す。試験処理部１２２は、読出した識別番号をテーブル作成部１２８に与える。テーブル作成部１２８は、与えられた識別番号をもとに、声質差評価テーブル１０８の作成準備を行なう。即ち、図３に示す声質差評価テーブル１０８で、データの入っていないものを、図示しない記憶装置上に作成する。声質差評価テーブル１０８の作成準備が完了すると、テーブル作成部１２８は、試験処理部１２２に完了信号を与える。 Referring to FIG. 2, when voice quality difference evaluation device 104 is activated, test processing unit 122 reads voice data identification numbers 112A, 112B,..., 112N stored in voice corpus 102 from voice corpus 102. . The test processing unit 122 gives the read identification number to the table creation unit 128. The table creation unit 128 prepares to create the voice quality difference evaluation table 108 based on the given identification number. That is, the voice quality difference evaluation table 108 shown in FIG. 3 that does not contain data is created on a storage device (not shown). When preparation for creating the voice quality difference evaluation table 108 is completed, the table creation unit 128 gives a completion signal to the test processing unit 122.

試験処理部１２２は、テーブル作成部１２８より完了信号を受けたことに応答して、音声コーパス１０２より取得した識別番号１１２Ａ、１１２Ｂ，…，１１２Ｎの中から、声質の差異を比較させる対象となる２セッションの音声データの識別番号を選び、刺激抽出部１２０及びテーブル作成部１２８に与える。この際、どの２つのセッションを選ぶかには、様々な方法がある。比較対象となるべきセッションの対が全て抽出できるものであれば、どのような方法であってもよい。 In response to receiving the completion signal from the table creation unit 128, the test processing unit 122 is a target for comparing the voice quality differences among the identification numbers 112A, 112B,..., 112N acquired from the speech corpus 102. The identification number of the audio data of two sessions is selected and given to the stimulus extraction unit 120 and the table creation unit 128. At this time, there are various methods for selecting which two sessions. Any method may be used as long as all pairs of sessions to be compared can be extracted.

刺激抽出部１２０は、２セッションの音声データの識別番号が与えられると、音声コーパス１０２に格納されている音声データの中から、与えられた識別番号にそれぞれ対応する音声データをそれぞれ特定し、特定した音声データから、上記した声質評価用の１発声分の音声波形のデータをそれぞれ抽出する。刺激抽出部１２０は、抽出した音声波形のデータを対にして刺激呈示部１２４に与える。 When the stimulus extraction unit 120 is given the identification number of the voice data of the two sessions, the stimulus extraction unit 120 specifies each voice data corresponding to the given identification number from the voice data stored in the voice corpus 102. The voice waveform data for one utterance for voice quality evaluation is extracted from the voice data. The stimulus extraction unit 120 applies the extracted voice waveform data to the stimulus presentation unit 124 as a pair.

刺激呈示部１２４は、与えられた一対の音声波形のデータを対比する形で再生し、被験者１０６に呈示する。刺激の呈示が終了すると、刺激呈示部１２４は、呈示が終了したことを示す信号を評定取得部１２６に与える。 The stimulus presentation unit 124 reproduces the given pair of speech waveform data in a form of comparison, and presents it to the subject 106. When the presentation of the stimulus is finished, the stimulus presentation unit 124 gives a signal indicating that the presentation is finished to the rating acquisition unit 126.

被験者１０６には、刺激対を構成する２つの刺激間の聴感上の差異を評定するための評定尺度が予め示されている。図４に、評定尺度の一例を示す。図４を参照して、評定尺度１６０は、５段階のカテゴリによって構成された尺度である。これらのカテゴリには、それぞれカテゴリに対応する、数値で表わされている評定値が予め付与されている。なお、この評定尺度１６０は、印刷された評定用紙によって与えられてもよい。また、ディスプレイ装置によって被験者に対して表示してもよい。 The subject 106 is preliminarily shown with a rating scale for evaluating the auditory difference between the two stimuli constituting the stimulus pair. FIG. 4 shows an example of the rating scale. Referring to FIG. 4, the rating scale 160 is a scale constituted by five categories. A rating value represented by a numerical value corresponding to the category is assigned to each category in advance. The rating scale 160 may be given by a printed rating sheet. Moreover, you may display with respect to a test subject with a display apparatus.

刺激呈示部１２４より一対の刺激が呈示されると、被験者１０６は、それら刺激を比較する。被験者１０６はさらに、それらの間の差異を表わすのに、評定尺度１６０のカテゴリの内どれが最も適当であるかを判断し、そのカテゴリを選択する。本実施の形態では、評定取得部１２６は一般的な入力装置（例えばキーボード）を含んでおり、被験者１０６は、選択したカテゴリに対応するキーを押す。 When a pair of stimuli is presented from the stimulus presentation unit 124, the subject 106 compares the stimuli. Subject 106 further determines which category of rating scale 160 is most appropriate to represent the difference between them and selects that category. In the present embodiment, the rating acquisition unit 126 includes a general input device (for example, a keyboard), and the subject 106 presses a key corresponding to the selected category.

図２を参照して、評定取得部１２６は、被験者１０６がカテゴリを選択すると、そのカテゴリに付与されている評定値をテーブル作成部１２８に与える。被験者１０６が複数である場合、評定取得部１２６は、複数の被験者１０６による選択に対応する評定値をそれぞれ特定し、特定された評定値の平均値を、統合された評定値としてテーブル作成部１２８に与える。 With reference to FIG. 2, when the subject 106 selects a category, the rating acquisition unit 126 gives a rating value assigned to the category to the table creation unit 128. When there are a plurality of subjects 106, the rating acquisition unit 126 specifies the rating values corresponding to the selections made by the plurality of subjects 106, and the table creation unit 128 uses the average value of the specified rating values as an integrated rating value. To give.

テーブル作成部１２８は、テーブル作成部１２８が予め準備した声質差評価テーブル１０８の、試験処理部１２２より与えられた識別番号の組合せに該当する項目に、評定取得部１２６より与えられた評定値を格納する。テーブル作成部１２８は、声質評価テーブル１０８への評定値の格納が完了したことに応答して、試験処理部１２２に完了信号を与える。 The table creation unit 128 adds the rating value given by the rating acquisition unit 126 to an item corresponding to the combination of identification numbers given by the test processing unit 122 in the voice quality difference evaluation table 108 prepared in advance by the table creation unit 128. Store. The table creating unit 128 gives a completion signal to the test processing unit 122 in response to the completion of the storage of the rating value in the voice quality evaluation table 108.

試験処理部１２２は、この完了信号に応答して、音声コーパス１０２より取得した識別番号の中から、別の２セッションの音声データの識別番号を前述した選択方法によって選び、刺激抽出部１２０、及びテーブル作成部１２８に与える。 In response to the completion signal, the test processing unit 122 selects the identification number of the audio data of another two sessions from the identification numbers acquired from the audio corpus 102 by the selection method described above, and the stimulus extraction unit 120, and This is given to the table creation unit 128.

声質差評価装置１０４は、上記した動作を繰返し、全てのセッションの組合せについて、被験者１０６に対する知覚試験を行なう。テーブル作成部１２８は、評定取得部１２６より与えられる評定値を声質差評価テーブル１０８に格納する。全ての組合せについて評定値の格納が終了すると、声質差評価装置１０４は一連の動作を終了する。 The voice quality difference evaluation apparatus 104 repeats the above-described operation, and performs a perceptual test on the subject 106 for all combinations of sessions. The table creation unit 128 stores the rating value given from the rating acquisition unit 126 in the voice quality difference evaluation table 108. When the storage of the rating values for all the combinations is finished, the voice quality difference evaluation device 104 finishes a series of operations.

このようにして作成された声質差評価テーブル１０８は、図１に示す音声素片選択部４４による評価関数７２の値の算出時に、音声素片選択部４４により参照される。即ち、図１を参照して、音声素片選択部４４は、評価関数７２に対して、観測可能な物理量に加えて、候補となる音声素片が属するセッションと、その直前に音声合成に使用された音声素片が属するセッションとの２つのセッションの識別番号の組を変数として与える。評価関数７２は、与えられた物理量と、与えられた識別番号の組に対応する声質差評価テーブル１０８の項目に格納されている評定値とに基づく評価値を出力する。この評価値には、セッションごとの声質の差異に関する評定値が反映されることとなる。 The voice quality difference evaluation table 108 created in this way is referred to by the speech unit selection unit 44 when the value of the evaluation function 72 is calculated by the speech unit selection unit 44 shown in FIG. That is, referring to FIG. 1, the speech unit selection unit 44 uses the evaluation function 72 in addition to the observable physical quantity, the session to which the candidate speech unit belongs, and the speech function immediately before that. A set of identification numbers of two sessions with the session to which the speech unit belongs is given as a variable. The evaluation function 72 outputs an evaluation value based on the given physical quantity and the rating value stored in the item of the voice quality difference evaluation table 108 corresponding to the given set of identification numbers. The evaluation value reflects the evaluation value related to the difference in voice quality for each session.

音声素片選択部４４はこの評価値をもとに一連の音声素片を決定する。その結果、図８に示す音声素片接続部４６が一連の音声素片を接続することによって合成される音声には、接続される音声素片同士の聴感上の声質のばらつきが少なくなる。そのため、このようにして合成された音声は、不連続感が軽減し、自然に聞こえるものとなる。 The speech segment selection unit 44 determines a series of speech segments based on this evaluation value. As a result, the voice synthesized by connecting a series of speech units by the speech unit connection unit 46 shown in FIG. 8 has less audible variation in voice quality between the connected speech units. For this reason, the synthesized voice can be heard naturally with reduced discontinuity.

［第２の実施の形態］
第１の実施の形態に係るシステムでは、被験者１０６は、異なる２つのセッションで収録された音声波形のデータを比較し、それらの音声波形のデータにおける声質の差異を評定した。しかし、本発明は、このような実施の形態には限定されない。 [Second Embodiment]
In the system according to the first embodiment, the subject 106 compares the voice waveform data recorded in two different sessions, and evaluates the difference in voice quality in the voice waveform data. However, the present invention is not limited to such an embodiment.

第２の実施の形態に係る声質差評価装置は、２つのセッションの音声データを対比するのではなく、セッションごとの音声データをもとに、知覚検査により各セッションの声質を表わす特徴ベクトルを作成する。セッションのベクトル間の距離によりセッション間の声質差が表現される。本実施の形態に係る知覚試験では、声質を評定するための複数の評価語対からなる評価語セットを予め準備し、被験者に与える。被験者はこの評価語セットに基づいて、呈示された刺激の声質に関する聴感上の印象を評定する。 The voice quality difference evaluation apparatus according to the second embodiment creates a feature vector representing the voice quality of each session by perceptual inspection based on the voice data for each session, instead of comparing the voice data of two sessions. To do. The difference in voice quality between sessions is expressed by the distance between session vectors. In the perceptual test according to the present embodiment, an evaluation word set including a plurality of evaluation word pairs for evaluating voice quality is prepared in advance and given to a subject. Based on this evaluation word set, the test subject evaluates an audible impression regarding the voice quality of the presented stimulus.

図５に本実施の形態に係る声質差評価装置２０４の構成をブロック図形式で示す。図５を参照して、声質差評価装置２０４は、被験者１０６に呈示する音声刺激の抽出元となる音声データのセッションを１つずつ、所定の順序で決定する処理を行なう試験処理部２２２と、試験処理部２２２により決定されたセッションの音声データから、特定の声質評価用の音声データを抽出する刺激抽出部２２０と、刺激抽出部２２０が抽出した音声刺激を被験者１０６に対して呈示する刺激呈示部２２４と、刺激呈示部２２４により呈示された音声刺激に対して被験者１０６が、前述した評価語セットを用いて行なう聴感上の印象の評定を取得する評定取得部２２６とを含む。 FIG. 5 is a block diagram showing the configuration of the voice quality difference evaluation apparatus 204 according to this embodiment. Referring to FIG. 5, the voice quality difference evaluation apparatus 204 includes a test processing unit 222 that performs a process of determining, in a predetermined order, a session of audio data from which audio stimuli to be extracted to be presented to the subject 106 are extracted. A stimulus extraction unit 220 that extracts voice data for specific voice quality evaluation from the voice data of the session determined by the test processing unit 222, and a stimulus presentation that presents the subject 106 with the voice stimulus extracted by the stimulus extraction unit 220 And a rating acquisition unit 226 for the subject 106 to acquire an auditory impression rating using the evaluation word set described above for the voice stimulus presented by the stimulus presentation unit 224.

声質差評価装置２０４はさらに、試験処理部２２２及び評定取得部２２６に接続され、試験処理部２２２が決定するセッションの識別番号と、評定取得部２２６が取得する被験者の評定とをもとに、各セッションで収録された音声の声質を表わす声質ベクトルを作成する声質ベクトル作成部２２８と、声質ベクトルを格納する声質ベクトルテーブル２３０と、声質ベクトルテーブル２３０に格納された声質ベクトルに基づき、任意のセッション間の声質差に関する評価値を算出する声質差算出部２３２と、声質差算出部２３２が算出する評価値をもとに声質差評価テーブル２３６を作成するテーブル作成部２３４とを含む。声質差評価テーブル２３６は、図３に示す第１の実施の形態に係る声質差評価テーブル１０８と同様の構成である。 The voice quality difference evaluation device 204 is further connected to the test processing unit 222 and the rating acquisition unit 226, and based on the session identification number determined by the test processing unit 222 and the assessment of the subject acquired by the rating acquisition unit 226, Based on the voice quality vector creation unit 228 for creating a voice quality vector representing the voice quality of the voice recorded in each session, the voice quality vector table 230 for storing the voice quality vector, and the voice quality vector stored in the voice quality vector table 230, an arbitrary session A voice quality difference calculation unit 232 that calculates an evaluation value related to a voice quality difference between them, and a table creation unit 234 that creates a voice quality difference evaluation table 236 based on the evaluation value calculated by the voice quality difference calculation unit 232. The voice quality difference evaluation table 236 has the same configuration as the voice quality difference evaluation table 108 according to the first embodiment shown in FIG.

図６に、本実施の形態に係る知覚検査において、被験者に与えられる評価語セットの一例を示す。図６を参照して、評価語セット２４０は、対義語となる一対の形容詞からなる評価語対（「張りがある―張りがない」、「濁った―澄んだ」、「明るい―暗い」、など）を複数含む。これらの評価語対について左側の評価語から右側の評価語に向かって、７段階のカテゴリ（「非常に」、「かなり」、…、「かなり」、「非常に」）が与えられている。これらのカテゴリには、第１の実施の形態に係る評価尺度のカテゴリと同様に、それぞれ数値からなる評定値が評価語対ごとに付与されている。 FIG. 6 shows an example of the evaluation word set given to the subject in the perceptual test according to the present embodiment. Referring to FIG. 6, evaluation word set 240 includes evaluation word pairs composed of a pair of adjectives that are synonyms (“tensioned-no tension”, “cloudy-clear”, “bright-dark”, etc. ). For these evaluation word pairs, seven levels of categories (“very”, “pretty”,..., “Very”, “very”) are given from the evaluation word on the left side to the evaluation word on the right side. In these categories, as in the category of the evaluation scale according to the first embodiment, a rating value consisting of a numerical value is assigned to each evaluation word pair.

図７に、本実施の形態に係る声質ベクトルテーブルの構成の一例を示す。図７を参照して、声質ベクトルテーブル２３０は、各セッションに対応するエントリ２８０Ａ，２８０Ｂ，…によって構成されたテーブルである。声質ベクトルテーブル２３０の各エントリは、セッションの識別番号の項目２７２と、当該セッションの音声データの声質を示す声質ベクトル２７４とを含む。声質ベクトル２７４は、図６に示す評価語セットの各評価語対に関する被験者の評定をそれぞれ数値化したものを成分とするベクトルである。 FIG. 7 shows an example of the configuration of a voice quality vector table according to the present embodiment. Referring to FIG. 7, voice quality vector table 230 is a table constituted by entries 280A, 280B,... Corresponding to each session. Each entry of the voice quality vector table 230 includes a session identification number item 272 and a voice quality vector 274 indicating the voice quality of the voice data of the session. The voice quality vector 274 is a vector whose components are numerical values obtained by quantifying the subject's ratings for each evaluation word pair in the evaluation word set shown in FIG.

本実施の形態に係るシステムは以下のように動作する。 The system according to the present embodiment operates as follows.

図５を参照して、最初に試験処理部２２２は、音声コーパス１０２より、全てのセッションの識別番号を読出し、テーブル作成部２３４及び声質ベクトル作成部２２８に与える。声質ベクトル作成部２２８は与えられた識別番号を声質ベクトルテーブル２３０（図７参照）に格納し、声質ベクトルテーブル２３０の準備を行なう。声質ベクトルテーブル２３０の準備が完了すると、声質ベクトル作成部２２８は、試験処理部２２２に完了信号を与える。 Referring to FIG. 5, first, test processing unit 222 reads the identification numbers of all sessions from voice corpus 102 and provides them to table creation unit 234 and voice quality vector creation unit 228. Voice quality vector creating section 228 stores the given identification number in voice quality vector table 230 (see FIG. 7), and prepares voice quality vector table 230. When the preparation of the voice quality vector table 230 is completed, the voice quality vector creation unit 228 gives a completion signal to the test processing unit 222.

試験処理部２２２は、声質ベクトル作成部２２８より完了信号を受けたことに応答して、音声コーパス１０２より読出した識別番号の中から、１セッションの音声データの識別番号を選び、刺激抽出部２２０及び声質ベクトル作成部２２８に与える。 In response to receiving the completion signal from the voice quality vector creation unit 228, the test processing unit 222 selects the identification number of the speech data of one session from the identification numbers read from the speech corpus 102, and the stimulus extraction unit 220 And provided to the voice quality vector creation unit 228.

この識別番号が与えられると、刺激抽出部２２０は、音声コーパス１０２に格納されている、与えられた識別番号に対応するセッションの音声データから、前述した声質評価用の１発声分の音声波形データを抽出し、刺激呈示部２２４に与える。 When this identification number is given, the stimulus extraction unit 220 uses the voice waveform data for one utterance for voice quality evaluation described above from the voice data of the session corresponding to the given identification number stored in the voice corpus 102. Is extracted and given to the stimulus presenting unit 224.

刺激呈示部２２４は、音声波形データを再生し、再生が完了すると、刺激呈示部２２４は、評定取得部２２６に対し再生完了を示す信号を与える。 The stimulus presentation unit 224 reproduces the audio waveform data, and when the reproduction is completed, the stimulus presentation unit 224 gives a signal indicating completion of reproduction to the rating acquisition unit 226.

被験者１０６には、予め図６に示す評価語セットが与えられている。被験者１０６は、刺激呈示部２２４より音声刺激を受けると、与えられた評価語セット内のある各評価語対について、刺激によって受けた聴感上の印象が評定尺度のカテゴリの内のどのカテゴリに属するかを判断する。被験者１０６は、最も適当であると判断したカテゴリを当該評価語対に対応する評定尺度より選択する。被験者１０６はこの選択を全ての評価語対について行なう。 An evaluation word set shown in FIG. 6 is given to the subject 106 in advance. When the subject 106 receives a voice stimulus from the stimulus presentation unit 224, for each evaluation word pair in the given evaluation word set, the auditory impression received by the stimulus belongs to which category of the rating scale categories. Determine whether. The subject 106 selects the category determined to be the most appropriate from the rating scale corresponding to the evaluation word pair. The subject 106 performs this selection for all the evaluation word pairs.

図５を参照して、評定取得部２２６は、刺激呈示部２２４よりの再生完了を示す信号に応答して、被験者１０６による評定を取得する。評定取得部２２６は被験者１０６により選択されたカテゴリを評価語対ごとに特定し、対応する評定値をそれぞれ声質ベクトル作成部２２８に与える。被験者１０６が複数である場合、評定取得部２２６は、第１の実施の形態と同様に、複数の被験者１０６による選択からぞれぞれ特定し、特定された複数の評定値の平均値を、統合された評定値として声質ベクトル作成部２２８に与える。 Referring to FIG. 5, the rating acquisition unit 226 acquires a rating by the subject 106 in response to a signal indicating completion of reproduction from the stimulus presentation unit 224. The rating acquisition unit 226 specifies the category selected by the subject 106 for each evaluation word pair, and gives the corresponding rating value to the voice quality vector creation unit 228, respectively. If subjects 1 06 is plural, evaluation acquisition unit 226, like the first embodiment, to determine, respectively, respectively from the selection of a plurality of subjects 106, specified average value of a plurality of evaluation values Is provided to the voice quality vector creation unit 228 as an integrated rating value.

声質ベクトル作成部２２８は、評価語対に対応する評定値をもとに、これらの評定値を成分とする声質ベクトルを作成する。声質ベクトル作成部２２８は、声質ベクトルテーブル２３０のうち、試験処理部２２２から与えられた識別番号のエントリの声質ベクトルの項目に、作成した声質ベクトルを格納する。声質ベクトル作成部２２８は、声質ベクトルの格納が完了すると、試験処理部２２２に完了信号を与える。試験処理部２２２は、この完了信号に応答して、次のセッションを決定する。 The voice quality vector creating unit 228 creates a voice quality vector having these rating values as components based on the rating values corresponding to the evaluation word pairs. The voice quality vector creation unit 228 stores the created voice quality vector in the voice quality vector item of the entry of the identification number given from the test processing unit 222 in the voice quality vector table 230. When the storage of the voice quality vector is completed, the voice quality vector creation unit 228 gives a completion signal to the test processing unit 222. The test processing unit 222 determines the next session in response to the completion signal.

以上の動作を繰返すことにより、全てのセッションについての声質ベクトルが声質ベクトルテーブル２３０に格納される。 By repeating the above operation, voice quality vectors for all sessions are stored in the voice quality vector table 230.

声質ベクトルの作成及び格納が完了すると、声質差算出部２３２は、セッション間の声質の差異に関する評価値の算出を次のようにして行なう。即ち、声質差算出部２３２は、声質ベクトルテーブル２３０から、任意の２つのセッションの識別番号と声質ベクトルとを読出す。声質差算出部２３２は、読出した２組分の声質ベクトルをもとに、評価語セットの各評価語対における評定尺度をそれぞれ軸とする多次元空間における、読出した声質ベクトル間の距離を算出する。ここで算出するベクトル間の距離は、例えばベクトル間のユークリッド距離であってもよい。声質差算出部２３２は、このようにして算出したベクトル間の距離を読出した２セッション分の識別番号と共にテーブル作成部２３４に与える。 When the creation and storage of the voice quality vector is completed, the voice quality difference calculation unit 232 calculates the evaluation value regarding the voice quality difference between sessions as follows. That is, the voice quality difference calculation unit 232 reads the identification numbers and voice quality vectors of any two sessions from the voice quality vector table 230. The voice quality difference calculation unit 232 calculates the distance between the read voice quality vectors in a multidimensional space around the rating scales of each evaluation word pair in the evaluation word set based on the two sets of read voice quality vectors. To do. The distance between vectors calculated here may be, for example, the Euclidean distance between vectors. The voice quality difference calculation unit 232 gives the distance between the vectors calculated in this way to the table creation unit 234 together with the identification numbers for the two sessions read out.

テーブル作成部２３４は、声質差評価テーブル２３６の、声質差算出部２３２より与えられた識別番号に対応する項目に、声質差算出部２３２より与えられるベクトル間の距離を格納する。なお、声質差評価テーブル２３６は、図３に示す声質差評価テーブル１０８と同様の構成である。 The table creation unit 234 stores the distance between vectors given by the voice quality difference calculation unit 232 in the item corresponding to the identification number given by the voice quality difference calculation unit 232 in the voice quality difference evaluation table 236. The voice quality difference evaluation table 236 has the same configuration as the voice quality difference evaluation table 108 shown in FIG.

声質差算出部２３２とテーブル作成部２３４とは、全てのセッションの組合せについてこれら一連の動作を実行し、声質差評価テーブル２３６を作成する。 The voice quality difference calculation unit 232 and the table creation unit 234 execute a series of these operations for all session combinations, and create a voice quality difference evaluation table 236.

このようにして作成された声質差評価テーブル２３６は、第１の実施の形態と同様に、図１に示す評価関数７２の計算の中に、１つの評価尺度として組込まれる。 The voice quality difference evaluation table 236 created in this way is incorporated as one evaluation measure in the calculation of the evaluation function 72 shown in FIG. 1, as in the first embodiment.

このように第２の実施の形態に係る声質差評価装置を含むシステムでは、セッションごとの音声データの間の声質の差異の大きさを被験者に直接評価させるのではなく、声質に関する評価語セットによって表現される被験者の聴覚上の印象をもとに声質差評価テーブルを作成する。そのため、より多角的な評定結果を被験者より得ることが可能となる。 Thus, in the system including the voice quality difference evaluation apparatus according to the second embodiment, the subject does not directly evaluate the magnitude of the voice quality difference between the voice data for each session, but by the evaluation word set related to the voice quality. A voice quality difference evaluation table is created based on the expressed auditory impression of the subject. Therefore, it becomes possible to obtain more diverse evaluation results from the subject.

この結果、音声素片選択部４４がこの評価値をもとに選択した音声素片を接続することにより合成される音声では、接続される音声素片同士の声質が類似するものとなるので、不連続感が生じることが少なくなる。従って、このようにして合成された音声は、自然なものとなる。また、セッションごとに１回だけ評価すればよいので、２つのセッションの組合せごとに評価する第１の実施の形態と比較して、評価に要する時間を短縮することができる。 As a result, in the speech synthesized by connecting the speech units selected by the speech unit selection unit 44 based on this evaluation value, the voice qualities of the connected speech units are similar. Less discontinuity occurs. Therefore, the synthesized voice is natural. Moreover, since it is sufficient to evaluate only once for each session, the time required for evaluation can be shortened as compared with the first embodiment in which evaluation is performed for each combination of two sessions.

以上のブロック図形式で説明した各機能部は、いずれもコンピュータハードウェア及び当該コンピュータ上で実行されるプログラムにより実現することができる。このコンピュータとしては、音声を扱う設備を持ったものであれば、汎用のハードウェアを有するものを用いることができる。また、上で説明した装置の各機能ブロックは、この明細書の記載に基づき、当業者であればプログラムで実現することができる。そうしたプログラムもまた１つのデータであり、記憶媒体に記憶させて流通させることができる。 Each functional unit described in the above block diagram format can be realized by computer hardware and a program executed on the computer. As this computer, a computer having general-purpose hardware can be used as long as it has equipment for handling sound. Further, each functional block of the apparatus described above can be realized by a program by those skilled in the art based on the description in this specification. Such a program is also a piece of data and can be stored in a storage medium and distributed.

なお、上記した実施の形態における被験者の人数は問わない。十分な教示及び訓練を受けた被験者であれば、単一又は小人数の被験者であっても、十分な精度の試験結果を得ることができる。また、多人数の被験者に対して知覚試験を行なうことにより、合成された音声を聞く一般的なユーザの評定に近い評定値に基づく声質差評価テーブルを作成することができる。 Note that the number of subjects in the above-described embodiment does not matter. A test result with sufficient accuracy can be obtained by a subject who has been sufficiently taught and trained, even for a single subject or a small number of subjects. Further, by performing a perceptual test on a large number of subjects, it is possible to create a voice quality difference evaluation table based on a rating value close to that of a general user who hears synthesized speech.

また、上記した実施の形態では、複数の被験者を用いて知覚試験を行なう場合、各被験者から得られた評定値の平均値によって、評定値を統合した。しかし本発明は、このような実施の形態には限定されない。平均値以外にも中央値、最大値、及び最小値などによって評定値を統合してもよい。 In the above-described embodiment, when a perceptual test is performed using a plurality of subjects, the rating values are integrated by the average value of the rating values obtained from each subject. However, the present invention is not limited to such an embodiment. In addition to the average value, the rating values may be integrated by the median value, the maximum value, the minimum value, and the like.

なお、上記した実施の形態では、刺激抽出部１２０、２２０は、特定の１発声分の音声波形のデータを音声データよりそれぞれ抽出した。しかし、本発明はこのような実施の形態には限定されない。例えば複数発話分を用いてもよい。また不特定の発声を用いて評価することもできるが、その場合には、評価の精度が十分に保証される知覚試験の方法を用いることが望ましい。 In the above-described embodiment, the stimulus extraction units 120 and 220 extract the voice waveform data for one specific utterance from the voice data. However, the present invention is not limited to such an embodiment. For example, a plurality of utterances may be used. Although it is possible to evaluate using unspecified utterances, in that case, it is desirable to use a perceptual test method in which the accuracy of the evaluation is sufficiently guaranteed.

なお、上記した発明の実施の形態では、被験者に与える評定尺度の各カテゴリに、予め評定値が付与されていた。しかし、本発明は、このような実施の形態には限定されない。例えば、事前に別の知覚試験を行ない、各カテゴリに対応する評定値を決定してもよい。また、知覚試験によって得られる試験結果をもとに、各カテゴリの評定値を統計的に求めることも可能である。 In the above-described embodiment of the invention, a rating value is assigned in advance to each category of the rating scale given to the subject. However, the present invention is not limited to such an embodiment. For example, another perceptual test may be performed in advance to determine a rating value corresponding to each category. It is also possible to statistically determine the rating value of each category based on the test result obtained by the perceptual test.

また、上記した発明の実施の形態では、声質差の評価を行なうための知覚試験の方法及び評定値の決定方法として、被験者に音声刺激を呈示し、音声刺激に対する評定をカテゴリ尺度によって行なわせるものであった。しかし、本発明はこのような実施の形態には限定されない。抽出した音声波形のデータを呈示刺激とし、知覚試験によって聴感上の印象に基づく尺度を構成する方法であれば、どのような方法を用いてもよい。 In the embodiment of the invention described above, as a perceptual test method and a rating value determination method for evaluating a voice quality difference, the subject is presented with a voice stimulus, and the voice stimulus is evaluated by a category scale. Met. However, the present invention is not limited to such an embodiment. Any method may be used as long as the extracted speech waveform data is used as a presentation stimulus and a scale based on the impression of hearing is formed by a perceptual test.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の第１の実施の形態に係るシステムの構成を示すブロック図である。It is a block diagram which shows the structure of the system which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態に係る声質差評価装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality difference evaluation apparatus which concerns on the 1st Embodiment of this invention. 本発明の実施の形態に係る声質差評価テーブル１０８の一例を示す図である。It is a figure which shows an example of the voice quality difference evaluation table 108 which concerns on embodiment of this invention. 本発明の第１の実施の形態に係る評定尺度の一例を示す図である。It is a figure which shows an example of the rating scale which concerns on the 1st Embodiment of this invention. 本発明の第２の実施の形態に係る声質差評価装置の構成を示すブロック図である。It is a block diagram which shows the structure of the voice quality difference evaluation apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係る評定尺度の一例を示す図である。It is a figure which shows an example of the rating scale which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係る声質ベクトルテーブルの一例を示す図である。It is a figure which shows an example of the voice quality vector table which concerns on the 2nd Embodiment of this invention. 音声素片接続型音声合成システムの基本構成を示すブロック図である。It is a block diagram which shows the basic composition of a speech unit connection type | mold speech synthesis system.

Explanation of symbols

４０，１０２音声コーパス、４２入力テキスト、４４音声素片選択部、４６音声素片接続部、４８合成音声データ、５０，７２評価関数、７０音声素片選択システム、１００声質差評価テーブル作成装置、１０４，２０４声質差評価装置、１０６被験者、１０８，２３６声質差評価テーブル、１１０Ａ，１１０Ｂ，…，１１０Ｎ音声データ、１１２Ａ，１１２Ｂ，…，１１２Ｎ識別番号、１２０，２２０刺激抽出部、１２２，２２２試験処理部、１２４，２２４刺激呈示部、１２６，２２６評定取得部、１２８，２３４テーブル作成部、２２８声質ベクトル作成部、２３０声質ベクトルテーブル、２３２声質差算出部 40,102 speech corpus, 42 input text, 44 speech unit selection unit, 46 speech unit connection unit, 48 synthesized speech data, 50, 72 evaluation function, 70 speech unit selection system, 100 voice quality difference evaluation table creation device, 104,204 Voice quality difference evaluation device, 106 subjects, 108, 236 Voice quality difference evaluation table, 110A, 110B,..., 110N Voice data, 112A, 112B,. Processing unit, 124, 224 Stimulus presentation unit, 126, 226 Rating acquisition unit, 128, 234 Table creation unit, 228 Voice quality vector creation unit, 230 Voice quality vector table, 232 Voice quality difference calculation unit

Claims

A voice quality difference evaluation table creation device for creating a voice quality difference evaluation value table representing a voice quality difference between a plurality of types of utterance voice data, wherein each of the plurality of types of utterance voice data is used for evaluating a voice quality difference. Voice waveform data is included,
A combination determining means for determining an arbitrary combination of the first and second utterance voice data from the plurality of types of utterance voice data;
From each of said first and second speech data of the combination determined by the combination determining means, and the audio waveform data extracting means for extracting the speech waveform data for evaluation of differences in the voice,
The pre-Symbol Evaluation first and second audio stimulus consisting sound reproduced respectively audio waveform data for differences in voice quality of the first and second speech data, presented to the subject in comparison, the first and an input for assessing the magnitude of the difference in voice quality on the auditory sense of the second sound stimulus subjected to sensory test to receive from the subject, from the value of the rating entered, the voice quality during the speech data of the combination differences and a perception test means for deriving an evaluation value representing the evaluation of the size numeric,
The rating input indicates to which of a plurality of categories the magnitude of the voice quality difference represents the magnitude of the voice quality difference,
The evaluation value derived by the perceptual testing means, by storing in association with a combination of the speech data, and a table creation means for creating the voice quality difference evaluation value table creating voice difference evaluation table apparatus.

The sensory test means comprises:
Voice waveform data for evaluating the difference in voice quality between the first and second utterance voice data is reproduced, and the first and second voice stimuli composed of the reproduced voices are subjected to a subject to be subjected to a perceptual test. and the stimulus presentation means for presenting in contrast Te,
The voice quality difference evaluation table creation according to claim 1, further comprising: an input means for receiving an input of a rating related to the magnitude of the difference in auditory voice quality between the first and second voice stimuli by the subject. apparatus.

The input means includes
Means for causing the subject to input a rating indicating which of the plurality of categories the magnitude of the audible difference between the first and second audio stimuli belongs ;
Wherein Ri by the subject, the size of the audibility of difference between the first and second audio stimulus, response to the rating that indicates belongs to the plurality of stages of categories is input, the The voice quality difference evaluation table according to claim 2, further comprising means for outputting a numerical value assigned in advance to the category indicated by the input as an evaluation value of the difference between the first and second voice stimuli. Creation device.

A voice quality difference evaluation table creation device for creating a voice quality difference evaluation value table representing a voice quality difference between a plurality of types of utterance voice data, wherein each of the plurality of types of utterance voice data is used for evaluating a voice quality difference. Voice waveform data is included,
An extracting means for extracting each of the voice waveform data for evaluation of the difference in voice quality included in the plurality of types of utterance voice data in an order determined by a predetermined procedure;
For each of the voice waveform data extracted by the extraction means, the voice quality on the auditory sense at the time of reproduction, performs perceptual tests to assess the subject for each of a plurality of different rating scale regarding voice quality, by the subject By deriving a plurality of evaluation values corresponding to the plurality of rating scales regarding the voice quality of the utterance voice data from the rating, a voice quality vector having the plurality of evaluation values as elements is obtained for each of the plurality of utterance voice data. A perceptual test means for calculating
As a distance defined between the voice quality vectors calculated for each of the plurality of types of utterance voice data by the perceptual test , evaluation regarding a voice quality difference between any combination of the plurality of types of utterance voice data A voice quality difference evaluation table creation device including table creation means for creating a voice quality difference evaluation value table by calculating a value and storing it in association with the combination of the speech data.

The sensory test means comprises:
Reproduction means for presenting to the subject the voice obtained by reproducing each of the voice waveform data for evaluation of the difference in voice quality extracted by the extraction means;
Against the voice that has been presented to the subject by the reproducing means, viewed contains a commentary constant value acquisition means for acquiring a commentary constant value of the subject for each of the plurality of rating scale,
Each of the plurality of rating scales is represented by a plurality of categories,
The rating value of the test subject is, for each of the plurality of rating scales, which of the plurality of categories in which the voice quality of the voice presented to the subject by the playback means for the rating scale is the rating scale. Is a value indicating whether it belongs to
Each rating value is assigned a corresponding rating value,
The perceptual test means further includes a vector creating means for calculating the voice quality vector using an evaluation value assigned to a category corresponding to a rating value of a subject relating to each of the plurality of rating scales as an element. Item 5. The voice quality difference evaluation table creation device according to Item 4 .

The table creation means includes
For any combination of said plurality of speech data, a distance calculation means for exiting calculate the Euclidean distance between the vector obtained by the vector generating means with respect to speech data belonging to the combination,
The Euclidean distance calculated by said distance calculation means includes a combination of the corresponding speech data associating, and means for storing the magnitude Satoshi the voice difference evaluation table of differences voice between the speech data The voice quality difference evaluation table creation device according to claim 5 .

A voice corpus consisting of multiple types of speech data recorded under different recording conditions;
An input of said plurality of types of speech data included in the speech corpus, seen including a voice difference evaluation table generation apparatus according to any one of claims 1 to 6,
A speech corpus voice quality difference evaluation table creation system, wherein each of the plurality of types of utterance voice data includes voice waveform data for evaluation of a voice quality difference.

A speech corpus comprising a plurality of types of speech data, each including a plurality of speech data that can be separated into speech segments;
Obtains input information predetermined by an evaluation function for selecting a speech unit for synthesizing speech from the input information, and means for selecting from pre-Symbol speech corpus the sequence of speech units, and extracts ,
Connect a series of speech unit means for the extraction is extracted, and means for synthesizing speech,
A speech synthesis system including the voice quality difference evaluation table created by the voice quality difference assessment table creation device according to any one of claims 1 to 7 with the plurality of types of speech voice data included in the voice corpus as input. Because
Each of the plurality of types of utterance voice data includes voice waveform data for evaluation of voice quality difference,
The means for extracting synthesizes speech from the input information using an evaluation value of a voice quality difference between the plurality of types of utterance voice data stored in the voice quality difference evaluation table as an input of the evaluation function. A speech synthesis system that selects speech segments to perform.