JP5598516B2

JP5598516B2 - Voice synthesis system for karaoke and parameter extraction device

Info

Publication number: JP5598516B2
Application number: JP2012191440A
Authority: JP
Inventors: 晃弘上村; 久美幡田; 典昭阿瀬見; 琢磨久野
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2012-08-31
Filing date: 2012-08-31
Publication date: 2014-10-01
Anticipated expiration: 2032-08-31
Also published as: JP2014048472A

Description

本発明は、カラオケで音声合成を実行する音声合成システム、及び音声合成に必要な音声パラメータを音声から抽出するパラメータ抽出装置に関する。 The present invention relates to a speech synthesis system that performs speech synthesis in karaoke and a parameter extraction device that extracts speech parameters necessary for speech synthesis from speech.

従来、ユーザが歌唱することで入力された歌声音声の音高ピッチを補正して出力するカラオケ装置が知られている（特許文献１参照）。
この特許文献１に記載されたカラオケ装置では、ユーザが歌唱することで入力された歌声音声の音高ピッチを、楽曲中の楽音に用いられている音高の中で最も近い音高に合致させるように補正している。 2. Description of the Related Art Conventionally, there is known a karaoke apparatus that corrects and outputs a pitch pitch of a singing voice input by a user singing (see Patent Document 1).
In the karaoke apparatus described in Patent Document 1, the pitch pitch of the singing voice input by the user singing is matched with the closest pitch among the pitches used for the musical tone in the music. It is corrected as follows.

特開２００３−１６７５８７号公報JP 2003-167487 A

この特許文献１に記載されたカラオケ装置では、音高が補正された歌声音声は、ユーザが発声して当該カラオケ装置に音声を入力している期間中に出力されるものの、音声を入力していない期間には出力されない。 In the karaoke apparatus described in Patent Document 1, the singing voice with the corrected pitch is output during a period in which the user utters and inputs the voice to the karaoke apparatus, but the voice is input. It is not output during no period.

一般的に、ユーザ自身が歌唱している最中は、補正前の発声音もユーザ自身の耳にて聴取される。このため、演奏音に沿って２種類の歌唱音声を同時に聞き分けて歌唱練習することは困難であり、一般的な歌唱練習は、一旦歌唱した後で、自分の発声音声を該当曲の演奏音とともに聴いて行われる。 In general, while the user is singing, the uttered sound before correction is also heard by the user's own ear. For this reason, it is difficult to listen to two types of singing voices at the same time according to the performance sound and practice singing. It is done listening.

ところが、特許文献１に記載されたカラオケ装置では、ユーザ自身の歌声音声を当該楽曲の適切な音高で聴取するのは、ユーザ自身が楽曲を歌唱している最中であるため、該当曲の歌唱後や、ましてや、別の曲に対しては、再び歌唱しなければ歌唱練習することが困難であるという問題があった。 However, in the karaoke device described in Patent Document 1, the user's own singing voice is listened to at an appropriate pitch of the music because the user himself is singing the music. After singing, and even for other songs, there was a problem that it was difficult to practice singing without singing again.

そこで、本発明は、ユーザ自身が楽曲を歌唱している期間以外にも、ユーザ自身の歌声音声を当該楽曲の適切な音高で聴取可能とすることを目的とする。 Therefore, an object of the present invention is to make it possible to listen to the user's own singing voice at an appropriate pitch of the music other than the period during which the user sings the music.

上記目的を達成するためになされた本発明は、パラメータ抽出装置と、合成音出力装置とを備えた音声合成システムに関する。
本発明の音声合成システムを構成するパラメータ抽出装置は、発声情報取得手段と、波形取得手段と、パラメータ導出手段と、パラメータ登録手段と、模範記憶手段とを備えている。 The present invention made to achieve the above object relates to a speech synthesis system including a parameter extraction device and a synthesized sound output device.
The parameter extraction apparatus constituting the speech synthesis system of the present invention includes utterance information acquisition means, waveform acquisition means, parameter derivation means, parameter registration means, and model storage means.

このうち、発声情報取得手段は、楽曲を識別する楽曲ＩＤ，当該楽曲ＩＤによって識別される楽曲を構成する楽音を表す演奏情報，当該楽曲ＩＤによって識別される楽曲の歌詞を表す歌詞情報，当該歌詞情報によって表される歌詞の発声開始タイミングを示す発声タイミング情報を含む楽曲データから、楽曲ＩＤ，演奏情報，歌詞情報，発声タイミング情報を取得して楽曲データに基づく楽曲を再生する。 Among these, the utterance information acquisition means includes a song ID for identifying a song, performance information representing a musical sound constituting the song identified by the song ID, lyrics information representing the lyrics of the song identified by the song ID, and the lyrics A song ID, performance information, lyrics information, and utterance timing information are acquired from song data including utterance timing information indicating the utterance start timing of the lyrics represented by the information, and a song based on the song data is reproduced.

そして、波形取得手段が、楽曲データに基づく楽曲の再生中に、歌詞の発声開始タイミングで、入力された音声波形を取得し、パラメータ導出手段が、音声波形から、歌詞を形成する各音節に対する音声波形である音節波形を抽出すると共に、その抽出した各音節波形から、予め規定された少なくとも一つの特徴量である音声パラメータを導出する。 Then, the waveform acquisition means acquires the input voice waveform at the utterance start timing of the lyrics during the reproduction of the music based on the music data, and the parameter derivation means uses the voice for each syllable forming the lyrics from the voice waveform. A syllable waveform that is a waveform is extracted, and a speech parameter that is at least one feature amount defined in advance is derived from each extracted syllable waveform.

パラメータ登録手段は、パラメータ導出手段で導出された音節ごとの音声パラメータを、当該楽曲を歌唱したユーザを識別するユーザＩＤと対応付けて、第一記憶装置に記憶する。さらに、模範記憶手段が、各楽曲についての理想的な歌唱音声に基づく、少なくとも１つの音声パラメータを含む模範音声データを、前記楽曲ＩＤと対応付けて第二記憶装置に記憶する。 The parameter registering unit stores the voice parameter for each syllable derived by the parameter deriving unit in the first storage device in association with the user ID for identifying the user who sang the song. Further, the model storage means stores model voice data including at least one voice parameter based on an ideal singing voice for each song in the second storage device in association with the song ID.

一方、音声合成システムを構成する合成音出力装置は、識別情報取得手段と、パラメータ取得手段と、模範取得手段と、音声合成手段と、出力手段とを備えている。
この合成音出力装置では、識別情報取得手段が、指定された楽曲ＩＤ、及びユーザＩＤを取得する。その取得した楽曲ＩＤと対応付けられた模範歌声データを、模範取得手段が第二記憶装置から取得する。
さらに、パラメータ取得手段は、第一記憶装置に記憶されている音声パラメータの中から、識別情報取得手段で取得した前記ユーザＩＤと対応付けられた音声パラメータであって、模範取得手段で取得した模範音声データに含まれる楽曲ＩＤと対応付けられた音声パラメータに最も類似する音声パラメータを取得する。 On the other hand, the synthesized sound output device constituting the speech synthesis system includes identification information acquisition means, parameter acquisition means, model acquisition means, speech synthesis means, and output means.
In the synthesized speech output device, the identification information acquiring means, the specified music ID, and is acquired user ID. The model voice data associated with acquired music ID of that, the model obtaining unit obtains from the second storage device.
Further, the parameter acquisition unit is a voice parameter associated with the user ID acquired by the identification information acquisition unit from the voice parameters stored in the first storage device, and the model acquired by the model acquisition unit The sound parameter most similar to the sound parameter associated with the music ID included in the sound data is acquired.

音声合成手段は、パラメータ取得手段で取得した音声パラメータを、楽曲ＩＤと対応付けられた音声パラメータに一致するように調整し、その調整された音声パラメータに基づいて音声合成する。すると、その生成された合成音を、出力手段が出力する。
さらに、発声情報取得手段は、楽曲の一つである対象楽曲の楽譜を表し、音源モジュールから出力される個々の出力音について、少なくとも音高及び演奏開始タイミングが規定された楽譜データを前記演奏情報として取得する。楽譜データには、対象楽曲の曲中において転調していれば、時間軸に沿って前記対象楽曲が転調した時刻を表す転調フラグが含まれている。
そして、本発明のパラメータ登録手段は、区間特定手段と、主音特定手段と、音名頻度導出手段と、調推定手段とを備えている。
このうち、区間特定手段は、取得した楽譜データに基づいて、対象楽曲において同一の調が継続される各区間である調同一区間を特定する。主音特定手段は、区間特定手段にて特定した各調同一区間に含まれ、それぞれの調同一区間における時間軸に沿った最後の出力音を主音として特定する。そして、音名頻度導出手段は、区間特定手段にて特定した調同一区間に含まれる同一音名の出力音の頻度を表す登場音名頻度を、主音特定手段で特定した主音の音名を起点として前記調同一区間ごとに導出する。さらに、調推定手段は、各調にて利用可能な音名の分布を表すテンプレートとして調ごとに予め用意した調テンプレートに、音名頻度導出手段で導出した各登場音名頻度を照合した結果、最も相関が高い調それぞれを、メタデータとして推定する。
なお、本発明のパラメータ登録手段は、調推定手段にて推定したメタデータを、音声パラメータと対応付けて第一記憶装置に記憶する。 The voice synthesis unit adjusts the voice parameter acquired by the parameter acquisition unit to match the voice parameter associated with the music ID, and performs voice synthesis based on the adjusted voice parameter . Then, the output unit outputs the generated synthesized sound.
Furthermore, the utterance information acquisition means represents the score of the target song that is one of the songs, and for each output sound output from the sound module, at least the musical score data in which the pitch and the performance start timing are defined is the performance information. Get as. If the musical score data is transposed in the music of the target music, a modulation flag that indicates the time at which the target music is transposed along the time axis is included .
The parameter registration means of the present invention comprises section specifying means, main sound specifying means, pitch name frequency deriving means, and key estimating means.
Among these, the section specifying means specifies the same key section, which is each section in which the same key is continued in the target music, based on the acquired musical score data. The main sound specifying means is included in each key-same section specified by the section specifying means, and specifies the last output sound along the time axis in each key-same section as the main sound. Then, the pitch name frequency deriving means starts from the pitch name of the main tone specified by the main tone specifying means, with the appearance pitch name frequency representing the frequency of the output sound of the same pitch name included in the same key interval specified by the interval specifying means. Is derived for each of the same key intervals. Furthermore, the key estimation means is a result of matching each appearance name frequency derived by the pitch name frequency deriving means to a key template prepared in advance for each key as a template representing the distribution of pitch names that can be used in each key, Each key having the highest correlation is estimated as metadata.
The parameter registration means of the present invention stores the metadata estimated by the key estimation means in the first storage device in association with the speech parameter.

このような音声合成システムによれば、一つの楽曲をユーザが歌唱後に、模範歌唱状態のユーザ音声を聴くことができる。
つまり、ユーザの歌声から導出した音声パラメータを用いて、楽曲における発声開始タイミングかつ発声音高にて音声が出力されるように、ユーザ自身の声による音声合成を実施できる。 According to such a voice synthesis system, after the user sings one piece of music, the user voice in the model singing state can be heard.
That is, using the speech parameters derived from the user's singing voice, speech synthesis using the user's own voice can be performed so that the speech is output at the utterance start timing and the utterance pitch of the music.

しかも、本発明の音声合成システムにおいては、一つの楽曲についての音声パラメータ及び模範歌声データを生成した後は、他の楽曲について、再び音声パラメータ及び模範歌声データを生成する必要がない。 In addition, in the speech synthesis system of the present invention, after generating the speech parameter and model singing voice data for one piece of music, it is not necessary to generate the voice parameter and the model singing voice data for another piece of music again.

以上のことから、本発明の音声合成システムによれば、ユーザ自身が楽曲を歌唱している期間以外にも、ユーザ自身の歌声音声を当該楽曲も含め、様々な楽曲を適切な音高で聴取可能とすることができる。
本発明の音声合成システムでは、特定内容情報及び発声タイミング情報に基づいて、メタデータを自動的に推定できる。このため、本発明の音声合成システムによれば、発声内容情報によって表される文字列の内容を発声するときに、メタデータとしての当該音声の性質を、ユーザらに入力させる必要を無くすことができる。 From the above, according to the speech synthesis system of the present invention, in addition to the period in which the user sings the song, the user's own singing voice can be listened to various songs including appropriate songs at appropriate pitches. Can be possible.
In the speech synthesis system of the present invention, metadata can be automatically estimated based on specific content information and utterance timing information. For this reason, according to the speech synthesis system of the present invention, when the content of the character string represented by the utterance content information is uttered, it is not necessary for the users to input the property of the speech as metadata. it can.

なお、本発明における音声パラメータとしての特徴量は、フォルマント合成による音声合成を実行する際に必要となる特徴量である。この特徴量には、例えば、基本周波数や、メル周波数ケプストラム（ＭＦＣＣ）、パワー、及びそれらの各時間差分などを含む。 Note that the feature amount as a speech parameter in the present invention is a feature amount required when executing speech synthesis by formant synthesis. This feature amount includes, for example, a fundamental frequency, a mel frequency cepstrum (MFCC), power, and their respective time differences.

本発明の音声合成システムにおける発声情報取得手段は、楽曲の一つである対象楽曲の楽譜を表し、音源モジュールから出力される個々の出力音について、少なくとも音高及び演奏開始タイミングが規定された楽譜データを演奏情報として取得しても良い。また、発声情報取得手段は、対象楽曲の歌詞を構成する歌詞構成文字の文字列を、歌詞情報として取得しても良い。さらに、発声情報取得手段は、歌詞構成文字の少なくとも１つに対する出力タイミングが、当該歌詞構成文字に対応する出力音の演奏開始タイミングと対応付けられた歌詞出力タイミングを、発声タイミング情報として取得しても良い。 The utterance information acquisition means in the speech synthesis system of the present invention represents the score of the target song that is one of the songs, and for each output sound output from the sound module, at least the pitch and the performance start timing are defined. Data may be acquired as performance information. In addition, the utterance information acquisition unit may acquire a character string of the lyric constituent characters constituting the lyrics of the target music as lyric information. Further, the utterance information acquisition means acquires, as the utterance timing information, the lyric output timing in which the output timing for at least one of the lyrics constituent characters is associated with the performance start timing of the output sound corresponding to the lyrics constituent characters. Also good.

これらの場合、本発明の音声合成システムにおける波形取得手段が、楽譜データに基づく対象楽曲の演奏中に入力された音声が時間軸に沿って推移した波形を、音声波形として取得し、パラメータ導出手段が、音声波形において、個々の出力音に対応する区間での音声波形を、音節波形として抽出しても良い。 In these cases, the waveform acquisition means in the speech synthesis system of the present invention acquires, as a speech waveform, a waveform in which the voice input during the performance of the target music based on the score data has shifted along the time axis, and parameter derivation means However, in the speech waveform, the speech waveform in the section corresponding to each output sound may be extracted as a syllable waveform.

このような音声合成システムによれば、楽譜データに基づいて対象楽曲を演奏している期間に入力された音声波形を収集できる。 According to such a speech synthesis system, Ru can collect speech waveform input to the period in which playing object song based on musical score data.

また、本発明における音声の性質とは、当該音声が発声されたときの発声者の感情を少なくとも含むものであり、例えば、情緒や、雰囲気などを含む概念である。さらに、音声の性質には、感情を推定するために必要な情報を含んでも良い。 In addition, the nature of speech in the present invention includes at least the emotion of the speaker when the speech is uttered, and is a concept that includes, for example, emotion and atmosphere. Furthermore, information necessary for estimating emotions may be included in the nature of speech.

また、本発明におけるパラメータ登録手段では、単語分割手段が、発声情報取得手段で取得した歌詞情報によって表される文字列を、単語を構成する文字列である単語文字ごとに分割し、メタデータ抽出手段が、各単語の性質を表す性質情報を、当該単語の識別情報と対応付けた単語性質テーブルを予め用意し、単語分割手段で分割された各単語文字によって表される単語に対応する性質情報をメタデータとして、単語性質テーブルから抽出しても良い。 Further, in the parameter registration means in the present invention, the word dividing means divides the character string represented by the lyric information acquired by the utterance information acquiring means for each word character that is a character string constituting the word, and extracts metadata. The means prepares in advance a word property table in which the property information representing the property of each word is associated with the identification information of the word, and the property information corresponding to the word represented by each word character divided by the word dividing device May be extracted from the word property table as metadata.

これと共に、パラメータ登録手段は、メタデータ抽出手段にて抽出したメタデータを、音声パラメータと対応付けて第一記憶装置に記憶すれば良い。
このような音声合成システムによれば、各単語の性質をメタデータとすることができる。ここで言う単語の性質には、当該単語の意味や、当該単語によって表される感情を含む。 At the same time, the parameter registration unit may store the metadata extracted by the metadata extraction unit in the first storage device in association with the audio parameter.
According to such a speech synthesis system, the property of each word can be used as metadata. The nature of the word mentioned here includes the meaning of the word and the emotion represented by the word.

なお、本発明は、パラメータ抽出装置としてなされていても良い。この場合、本発明のパラメータ抽出装置は、上述した、発声情報取得手段と、波形取得手段と、パラメータ導出手段と、パラメータ登録手段と、模範記憶手段とを備えることが好ましい。
さらに、本発明がパラメータ抽出装置としてなされている場合には、パラメータ登録手段は、区間特定手段と、主音特定手段と、音名頻度導出手段と、調推定手段と、を備え、調推定手段にて推定したメタデータを、音声パラメータと対応付けて第一記憶装置に記憶しても良い。
また、本発明がパラメータ抽出装置としてなされている場合には、パラメータ登録手段は、単語分割手段と、メタデータ抽出手段とを備え、そのメタデータ抽出手段にて抽出したメタデータを、音声パラメータを対応付けて第一記憶装置に記憶しても良い。 In addition, this invention may be made | formed as a parameter extraction apparatus. In this case, it is preferable that the parameter extraction device of the present invention includes the utterance information acquisition unit, the waveform acquisition unit, the parameter derivation unit, the parameter registration unit, and the model storage unit.
Further, when the present invention is configured as a parameter extraction device, the parameter registration means includes a section specifying means, a main tone specifying means, a pitch name frequency deriving means, and a key estimating means, and the key estimating means The estimated metadata may be stored in the first storage device in association with the audio parameter.
In the case where the present invention is implemented as a parameter extracting device, the parameter registering unit includes a word dividing unit and a metadata extracting unit, and the metadata extracted by the metadata extracting unit is used as an audio parameter. You may match and memorize | store in a 1st memory | storage device.

このようなパラメータ抽出装置によれば、一つの楽曲をユーザが最初に歌唱した際に、音声パラメータを導出すると共に、模範歌声データを生成することができる。
そして、本発明のパラメータ抽出装置によれば、第一記憶装置に記憶された音声パラメータ、及び第二記憶装置に記憶された模範歌声データに基づいて、楽曲における発声開始タイミングかつ発声音高にて音声が出力されるように、音声合成装置に音声合成を実施させることができる。 According to such a parameter extraction device, when a user sings a piece of music for the first time, voice parameters can be derived and model singing voice data can be generated.
Then, according to the parameter extraction device of the present invention, based on the voice parameters stored in the first storage device and the model singing voice data stored in the second storage device, the voice start timing and the voice pitch in the music The speech synthesizer can perform speech synthesis so that the speech is output.

本発明が適用された音声合成システムの全体構成を示すブロック図である。1 is a block diagram showing an overall configuration of a speech synthesis system to which the present invention is applied. パラメータ登録処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a parameter registration process. パラメータデータ登録処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a parameter data registration process. 第一実施形態におけるメタデータ推定処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the metadata estimation process in 1st embodiment. メタデータ推定処理の処理内容を示す図である。It is a figure which shows the processing content of a metadata estimation process. メタデータ推定処理の処理内容を示す図である。It is a figure which shows the processing content of a metadata estimation process. メタデータ推定処理の処理内容を示す図である。It is a figure which shows the processing content of a metadata estimation process. 第一実施形態におけるパラメータデータ登録処理の概要を示す図である。It is a figure which shows the outline | summary of the parameter data registration process in 1st embodiment. パラメータデータ及び模範歌声データの概要を示す図である。It is a figure which shows the outline | summary of parameter data and model singing voice data. 模範歌声データ登録処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of an exemplary singing voice data registration process. 音声合成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a speech synthesis process. 第二実施形態におけるメタデータ推定処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the metadata estimation process in 2nd embodiment. 第二実施形態におけるパラメータデータ登録処理の概要を示す図である。It is a figure which shows the outline | summary of the parameter data registration process in 2nd embodiment.

以下に本発明の実施形態を図面と共に説明する。
［第一実施形態］
〈音声合成システム〉
図１に示すように、音声合成システム１は、ユーザ（歌唱者）が指定した楽曲（以下、指定楽曲と称す）における理想的な歌唱音声を、当該ユーザ自身の音声にて出力するように、音声合成を実施するシステムである。 Embodiments of the present invention will be described below with reference to the drawings.
[First embodiment]
<Speech synthesis system>
As shown in FIG. 1, the speech synthesis system 1 outputs an ideal singing voice in a music designated by a user (singer) (hereinafter referred to as a designated music) as the user's own voice. This is a system for performing speech synthesis.

これを実現するために、音声合成システム１は、音声入力装置１０と、情報格納サーバ２５と、情報処理装置３０と、データ格納サーバ５０と、音声出力端末６０とを備えている。 In order to realize this, the speech synthesis system 1 includes a speech input device 10, an information storage server 25, an information processing device 30, a data storage server 50, and a speech output terminal 60.

音声入力装置１０は、カラオケの用途に用いられる音楽データＭＤに基づいて楽曲を演奏すると共に、その楽曲の演奏中に音声の入力を受け付ける。情報格納サーバ２５は、楽曲ごとに用意された音楽データＭＤそれぞれを格納すると共に、音声入力装置１０を介して入力された音声それぞれのデータである音声波形データＳＶを格納する。 The voice input device 10 plays a song based on the music data MD used for karaoke, and accepts voice input during the performance of the song. The information storage server 25 stores each of the music data MD prepared for each music piece, and stores voice waveform data SV that is data of each voice input via the voice input device 10.

さらに、情報処理装置３０は、情報格納サーバ２５に格納されている音声波形データＳＶ及び音楽データＭＤに基づいて、パラメータデータＰＭ及び模範歌声データＥＤを生成する。 Furthermore, the information processing apparatus 30 generates parameter data PM and model singing voice data ED based on the voice waveform data SV and the music data MD stored in the information storage server 25.

なお、ここで言うパラメータデータＰＭとは、詳しくは後述するが、いわゆるフォルマント合成に用いる、少なくとも一つの音声パラメータを含むデータである。また、模範歌声データＥＤとは、各楽曲についての理想的な歌唱音声に基づく、少なくとも一つの音声パラメータを含むデータである。この少なくとも一つの音声パラメータには、例えば、発声音声における各音節での基本周波数、メル周波数ケプストラム（ＭＦＣＣ）、パワー、及びそれらの時間差分を含む。 The parameter data PM referred to here is data including at least one voice parameter used for so-called formant synthesis, which will be described later in detail. The model singing voice data ED is data including at least one voice parameter based on an ideal singing voice for each music piece. The at least one speech parameter includes, for example, a fundamental frequency at each syllable in the uttered speech, a mel frequency cepstrum (MFCC), power, and a time difference therebetween.

データ格納サーバ５０は、情報処理装置３０にて生成されたパラメータデータＰＭ及び模範歌声データＥＤを格納する。また、音声出力端末６０は、データ格納サーバ５０に格納されているパラメータデータＰＭ及び模範歌声データＥＤに基づいて、指定楽曲を歌唱した場合の理想的な歌唱音声となるように、ユーザ自身の音声により音声合成した合成音を出力する。なお、本実施形態における音声合成システム１は、音声出力端末６０を複数台備えていても良い。
〈情報格納サーバ〉
情報格納サーバ２５は、記憶内容を読み書き可能に構成された記憶装置を中心に構成された装置であり、通信網を介して、音声入力装置１０に接続されている。 The data storage server 50 stores parameter data PM and model singing voice data ED generated by the information processing apparatus 30. Also, the voice output terminal 60 is based on the parameter data PM and the model singing voice data ED stored in the data storage server 50, so that the voice of the user himself / herself becomes an ideal singing voice when singing the designated music piece. The synthesized sound synthesized by voice is output. Note that the speech synthesis system 1 in this embodiment may include a plurality of speech output terminals 60.
<Information storage server>
The information storage server 25 is a device that is mainly configured of a storage device configured to be able to read and write stored contents, and is connected to the voice input device 10 via a communication network.

この情報格納サーバ２５に格納される音楽データＭＤは、楽曲ＭＩＤＩデータＤＭと、歌詞データ群ＤＬと、ガイドボーカルデータＧＤとを有し、それぞれ対応する楽曲ごとに対応付けられている。さらに、楽曲がデュエット曲の場合、音楽データＭＤには、歌詞の男女パートを区別する情報が含まれていても良い。 The music data MD stored in the information storage server 25 includes music MIDI data DM, lyrics data group DL, and guide vocal data GD, and is associated with each corresponding music. Furthermore, when the music is a duet music, the music data MD may include information for distinguishing the male and female parts of the lyrics.

楽曲ＭＩＤＩデータＤＭは、周知のＭＩＤＩ（ＭｕｓｉｃａｌＩｎｓｔｒｕｍｅｎｔＤｉｇｉｔａｌＩｎｔｅｒｆａｃｅ）規格によって、一つの楽曲の楽譜を表すデータである。この楽曲ＭＩＤＩデータＤＭの各々は、楽曲を区別するデータである識別データと、当該楽曲にて用いられる楽器ごとの楽譜を表す楽譜トラックと、当該楽曲において調が変化する時刻を表す変調フラグとを少なくとも有している。 The music MIDI data DM is data representing the score of one music according to the well-known MIDI (Musical Instrument Digital Interface) standard. Each of the music MIDI data DM includes identification data that is data for discriminating music, a score track that represents a score for each musical instrument used in the music, and a modulation flag that represents the time at which the key changes in the music. Have at least.

そして、楽譜トラックには、ＭＩＤＩ音源から出力される個々の出力音について、少なくとも、音高（いわゆるノートナンバー）と、音源モジュールが出力音を出力する期間（以下、音符長）とが規定されている。ただし、楽譜トラックの音符長は、当該出力音の出力を開始するまでの当該楽曲の演奏開始からの時間を表す演奏開始タイミング（いわゆるノートオンタイミング）と、当該出力音の出力を終了するまでの当該楽曲の演奏開始からの時間を表す演奏終了タイミング（いわゆるノートオフタイミング）とによって規定されている。 The musical score track defines at least a pitch (so-called note number) and a period during which the sound module outputs the output sound (hereinafter, note length) for each output sound output from the MIDI sound source. Yes. However, the note length of the musical score track is the performance start timing (so-called note-on timing) indicating the time from the start of the performance of the music until the output of the output sound is started, and the output of the output sound is ended. It is defined by the performance end timing (so-called note-off timing) that represents the time from the start of performance of the music.

なお、楽譜トラックは、楽器種類ごとに用意されている。
一方、歌詞データ群ＤＬは、周知のカラオケ装置を構成する表示装置に表示される歌詞に関するデータであり、楽曲の歌詞を構成する文字（以下、歌詞構成文字とする）を表す歌詞テロップデータＤＴと、歌詞構成文字の出力タイミングである歌詞出力タイミングを、楽曲ＭＩＤＩデータＤＭの演奏と対応付けるタイミング対応関係が規定された歌詞出力データＤＯとを備えている。 Note that a score track is prepared for each instrument type.
On the other hand, the lyrics data group DL is data relating to lyrics displayed on a display device that constitutes a well-known karaoke device, and lyrics telop data DT representing characters constituting the lyrics of the music (hereinafter referred to as lyrics constituent characters); And lyrics output data DO in which a timing correspondence relationship for associating the lyrics output timing, which is the output timing of the lyrics constituent characters, with the performance of the music MIDI data DM is provided.

具体的に、本実施形態におけるタイミング対応関係は、楽曲ＭＩＤＩデータＤＭの演奏を開始するタイミングに、歌詞テロップデータＤＴの出力を開始するタイミングが対応付けられた上で、特定楽曲の時間軸に沿った各歌詞構成文字の歌詞出力タイミングが、楽曲ＭＩＤＩデータＤＭの演奏開始からの経過時間によって規定されている。なお、ここで言う経過時間とは、例えば、表示された歌詞構成文字の色替えを実行するタイミングを表す時間であり、色替えの速度によって規定されている。また、ここで言う歌詞構成文字は、歌詞を構成する文字の各々であっても良いし、その文字の各々を時間軸に沿った特定の規則に従って一群とした文節やフレーズであっても良い。 Specifically, the timing correspondence relationship in the present embodiment is based on the timing of starting the output of the lyrics telop data DT to the timing of starting the performance of the music MIDI data DM, and along the time axis of the specific music. In addition, the lyrics output timing of each lyrics constituent character is defined by the elapsed time from the start of the performance of the music MIDI data DM. Note that the elapsed time referred to here is, for example, a time indicating the timing for executing color change of displayed lyrics constituent characters, and is defined by the speed of color change. Further, the lyric constituent characters referred to here may be each of the characters constituting the lyric, or may be a phrase or a phrase grouped according to a specific rule along the time axis.

ガイドボーカルデータＧＤは、当該楽曲についての理想的な歌唱音声として予め用意された音声データである。ここで言う理想的な歌唱音声とは、当該楽曲を構成する楽音の通りに歌唱したと考えられる音声であり、カラオケ装置に周知の採点機能にて採点した場合に満点近くとなる音声である。 The guide vocal data GD is voice data prepared in advance as an ideal singing voice for the music. The ideal singing voice referred to here is a voice that is considered to have been sung according to the musical sound that constitutes the music, and is a voice that becomes close to a perfect score when the karaoke apparatus is scored by a well-known scoring function.

一般的なカラオケ装置においては、ガイドボーカルデータＧＤに基づく理想的な歌唱音声は、楽曲ＭＩＤＩデータＤＭに基づく楽器演奏とともに再生される。
〈音声入力装置〉
次に、音声入力装置１０について説明する。 In a general karaoke apparatus, an ideal singing voice based on the guide vocal data GD is reproduced together with a musical instrument performance based on the music MIDI data DM.
<Voice input device>
Next, the voice input device 10 will be described.

音声入力装置１０は、通信部１１と、入力受付部１２と、表示部１３と、音声入力部１４と、音声出力部１５と、音源モジュール１６と、記憶部１７と、制御部２０とを備えている。すなわち、本実施形態における音声入力装置１０は、いわゆる周知のカラオケ装置として構成されている。 The voice input device 10 includes a communication unit 11, an input receiving unit 12, a display unit 13, a voice input unit 14, a voice output unit 15, a sound source module 16, a storage unit 17, and a control unit 20. ing. That is, the voice input device 10 in the present embodiment is configured as a so-called well-known karaoke device.

このうち、通信部１１は、通信網を介して、音声入力装置１０が外部との間で通信を行う。本実施形態における通信網とは、例えば、公衆無線通信網やネットワーク回線である。 Among these, the communication part 11 communicates between the audio | voice input apparatuses 10 outside via a communication network. The communication network in this embodiment is, for example, a public wireless communication network or a network line.

入力受付部１２は、外部からの操作に従って情報や指令の入力を受け付ける入力機器である。本実施形態における入力機器とは、例えば、キーやスイッチ、リモコンの受付部などである。 The input receiving unit 12 is an input device that receives input of information and commands in accordance with external operations. The input device in the present embodiment is, for example, a key, a switch, a remote control receiving unit, or the like.

表示部１３は、少なくとも、文字コードで示される情報を含む画像を表示する表示装置である。本実施形態における表示装置とは、例えば、液晶ディスプレイやＣＲＴなどである。また、音声入力部１４は、音を電気信号に変換して制御部２０に入力する装置（いわゆるマイクロホン）である。音声出力部１５は、制御部２０からの電気信号を音に変換して出力する装置（いわゆるスピーカ）である。 The display unit 13 is a display device that displays at least an image including information indicated by a character code. The display device in the present embodiment is, for example, a liquid crystal display or a CRT. The voice input unit 14 is a device (so-called microphone) that converts sound into an electrical signal and inputs the signal to the control unit 20. The audio output unit 15 is a device (so-called speaker) that converts an electrical signal from the control unit 20 into sound and outputs the sound.

さらに、音源モジュール１６は、楽曲ＭＩＤＩデータＤＭに基づいて、音源からの音を模擬した音（即ち、出力音）を出力する装置、例えば、ＭＩＤＩ音源である。
記憶部１７は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。本実施形態における記憶装置とは、例えば、ハードディスク装置や、フラッシュメモリなどである。 Furthermore, the sound module 16 is a device that outputs a sound (that is, an output sound) that simulates a sound from the sound source based on the music MIDI data DM, for example, a MIDI sound source.
The storage unit 17 is a non-volatile storage device configured to be able to read and write stored contents. The storage device in the present embodiment is, for example, a hard disk device or a flash memory.

また、制御部２０は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを格納するＲＯＭ２１と、処理プログラムやデータを一時的に格納するＲＡＭ２２と、ＲＯＭ２１やＲＡＭ２２に記憶された処理プログラムに従って各処理（各種演算）を実行するＣＰＵ２３とを少なくとも有した周知のコンピュータを中心に構成されている。 The control unit 20 is stored in the ROM 21 that stores processing programs and data that need to retain stored contents even when the power is turned off, the RAM 22 that temporarily stores processing programs and data, and the ROM 21 and RAM 22. It is mainly configured by a known computer having at least a CPU 23 that executes each process (various operations) according to the processing program.

そして、ＲＯＭ２１には、ユーザによって指定された楽曲（以下、対象楽曲と称す）を演奏するカラオケ演奏処理を制御部２０が実行する処理プログラムが格納されている。本実施形態におけるカラオケ演奏処理では、対象楽曲が演奏されている期間中に、音声入力部１４を介して入力された音声を音声波形データＳＶとして、当該対象楽曲を識別する楽曲ＩＤ及び対象楽曲を歌唱したユーザ音声を識別するユーザＩＤと対応付けて、情報格納サーバ２５に格納する。 The ROM 21 stores a processing program in which the control unit 20 executes a karaoke performance process for playing a music specified by the user (hereinafter referred to as a target music). In the karaoke performance processing according to the present embodiment, the music ID and the target music for identifying the target music are obtained by using the voice input via the voice input unit 14 as the voice waveform data SV during the period in which the target music is being played. The information is stored in the information storage server 25 in association with the user ID for identifying the sung user voice.

つまり、音声入力装置１０では、カラオケ演奏処理に従って、対象楽曲に対応する楽曲ＭＩＤＩデータＤＭに基づいて、楽曲を演奏すると共に、対象楽曲に対応する歌詞データ群ＤＬに基づいて歌詞を表示部１３に表示する。そして、カラオケ演奏処理の実行中に、音声入力部１４を介して入力された音声を音声波形データＳＶとして、楽曲ＩＤ及びユーザＩＤと対応付けて、情報格納サーバ２５に格納する。 That is, in the voice input device 10, according to the karaoke performance process, the music is played based on the music MIDI data DM corresponding to the target music, and the lyrics are displayed on the display unit 13 based on the lyrics data group DL corresponding to the target music. indicate. During execution of the karaoke performance process, the voice input via the voice input unit 14 is stored in the information storage server 25 as voice waveform data SV in association with the music ID and user ID.

なお、情報格納サーバ２５に格納される音声波形データＳＶには、発声者の特徴を表す発声者特徴情報も対応付けられる。この発声者特徴情報には、例えば、少なくとも、ユーザの性別を含む。 The speech waveform data SV stored in the information storage server 25 is also associated with speaker feature information representing the features of the speaker. This speaker feature information includes, for example, at least the gender of the user.

データ格納サーバ５０は、記憶内容を読み書き可能に構成された記憶装置を中心に構成された装置であり、通信網を介して情報処理装置３０に接続されている。
〈情報処理装置〉
情報処理装置３０は、通信部３１と、入力受付部３２と、表示部３３と、記憶部３４と、制御部４０とを備えている。 The data storage server 50 is a device mainly configured of a storage device configured to be able to read and write stored contents, and is connected to the information processing device 30 via a communication network.
<Information processing device>
The information processing apparatus 30 includes a communication unit 31, an input reception unit 32, a display unit 33, a storage unit 34, and a control unit 40.

このうち、通信部３１は、通信網を介して外部との間で通信を行う。入力受付部３２は、外部からの操作に従って情報や指令の入力を受け付ける入力機器である。表示部３３は、画像を表示する表示装置である。 Among these, the communication part 31 communicates with the exterior via a communication network. The input receiving unit 32 is an input device that receives input of information and commands in accordance with external operations. The display unit 33 is a display device that displays an image.

記憶部３４は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。制御部４０は、ＲＯＭ４１、ＲＡＭ４２、ＣＰＵ４３を少なくとも有した周知のコンピュータを中心に構成されている。 The storage unit 34 is a non-volatile storage device configured to be able to read and write stored contents. The control unit 40 is configured around a known computer having at least a ROM 41, a RAM 42, and a CPU 43.

さらに、情報処理装置３０のＲＯＭ４１には、情報格納サーバ２５に格納された音声波形データＳＶの中で、指定された楽曲ＩＤ及びユーザＩＤが対応付けられた音声波形データＳＶを取得して、パラメータデータＰＭ、及び模範歌声データＥＤを生成してデータ格納サーバ５０に格納するパラメータ登録処理を制御部４０が実行する処理プログラムが格納されている。 Furthermore, in the ROM 41 of the information processing apparatus 30, the audio waveform data SV in which the designated music ID and user ID are associated among the audio waveform data SV stored in the information storage server 25 is acquired, and parameters are acquired. A processing program in which the control unit 40 executes parameter registration processing for generating the data PM and the model singing voice data ED and storing it in the data storage server 50 is stored.

このパラメータ登録処理にてデータ格納サーバ５０に格納されるパラメータデータＰＭ及び模範歌声データＥＤには、メタデータが含まれる。本実施形態におけるメタデータとは、当該音声が発声されたときの感情を含む音声の性質を表すものである。また、ここで言う感情には、例えば、情緒や、雰囲気などを含む。さらに、音声の性質には、感情を推定するために必要な情報を含んでも良い。
〈パラメータ登録処理〉
情報処理装置３０が実行するパラメータ登録処理は、起動されると、図２に示すように、まず、入力受付部３２を介して、ユーザＩＤを取得する（Ｓ１１０）。すなわち、Ｓ１１０では、ユーザが情報処理装置３０へのログインを行う。 The parameter data PM and the model singing voice data ED stored in the data storage server 50 by this parameter registration process include metadata. The metadata in the present embodiment represents the nature of sound including emotion when the sound is uttered. In addition, the emotion referred to here includes, for example, emotion and atmosphere. Furthermore, information necessary for estimating emotions may be included in the nature of speech.
<Parameter registration process>
When the parameter registration process executed by the information processing apparatus 30 is started, first, as shown in FIG. 2, a user ID is acquired via the input receiving unit 32 (S110). That is, in S110, the user logs in to the information processing apparatus 30.

続いて、入力受付部３２を介して楽曲ＩＤを取得する（Ｓ１２０）。すなわち、Ｓ１２０では、ユーザが楽曲を指定する。
続いて、Ｓ１１０にて取得したユーザＩＤ，及びＳ１２０にて取得した楽曲ＩＤに対応する音声波形データＳＶに基づいて、パラメータデータＰＭを生成して登録するパラメータデータ登録処理を実行する（Ｓ１３０）。さらに、Ｓ１２０にて取得した楽曲ＩＤに対応する楽曲の模範歌声データＥＤを生成して登録する模範歌声データ登録処理を実行する（Ｓ１４０）。なお、パラメータデータ登録処理及び模範歌声データ登録処理の詳細については、後述する。 Subsequently, the music ID is acquired through the input receiving unit 32 (S120). That is, in S120, the user designates music.
Subsequently, parameter data registration processing for generating and registering the parameter data PM is executed based on the user ID acquired in S110 and the voice waveform data SV corresponding to the music ID acquired in S120 (S130). Furthermore, the model singing voice data registration process which produces | generates and registers the model singing voice data ED of the music corresponding to music ID acquired in S120 is performed (S140). Details of the parameter data registration process and the model singing voice data registration process will be described later.

その後、本パラメータ登録処理を終了する。
〈パラメータデータ登録処理〉
パラメータデータ登録処理は、図３に示すように、起動されると、先のＳ１２０にて取得した楽曲ＩＤに対応する楽曲ＭＩＤＩデータＤＭを取得する（Ｓ２１０）。続いて、先のＳ１２０にて取得した楽曲ＩＤに対応する歌詞データ群ＤＬを取得し（Ｓ２２０）、当該楽曲ＩＤに対応し、かつ先のＳ１１０にて指定されたユーザＩＤに対応する一つの音声波形データＳＶを取得する（Ｓ２３０）。 Thereafter, the parameter registration process is terminated.
<Parameter data registration process>
As shown in FIG. 3, when the parameter data registration process is started, the music MIDI data DM corresponding to the music ID acquired in the previous S120 is acquired (S210). Subsequently, the lyrics data group DL corresponding to the song ID acquired in the previous S120 is acquired (S220), and one voice corresponding to the song ID and corresponding to the user ID specified in the previous S110 is obtained. Waveform data SV is acquired (S230).

さらに、Ｓ２３０で取得した音声波形データＳＶにおいて、当該音声波形データＳＶの発声内容に含まれる音節それぞれに対応する区間での音声波形（以下、音節波形と称す）を特定する（Ｓ２４０）。 Further, in the speech waveform data SV acquired in S230, a speech waveform (hereinafter referred to as a syllable waveform) in a section corresponding to each syllable included in the utterance content of the speech waveform data SV is specified (S240).

具体的に、本実施形態のＳ２４０では、Ｓ２１０で取得した楽曲ＭＩＤＩデータＤＭのうち、歌唱旋律を表す楽譜トラック（以下、メロディトラックと称す）に規定された各出力音の演奏開始タイミング及び演奏終了タイミングを抽出すると共に、各出力音に対応付けられた歌詞構成文字の音節を特定する。そして、音声波形データＳＶにおいて、各出力音の演奏開始タイミングから演奏終了タイミングまでの区間それぞれに対応する区間での音声波形を音節波形として特定する。なお、本実施形態のＳ２４０にて特定される音節波形それぞれは、当該音節波形にて発声した音節の内容と対応付けられたものである。 Specifically, in S240 of the present embodiment, the performance start timing and performance end of each output sound defined in a musical score track (hereinafter referred to as a melody track) representing the singing melody in the music MIDI data DM acquired in S210. The timing is extracted, and the syllables of the lyrics constituent characters associated with each output sound are specified. Then, in the speech waveform data SV, the speech waveform in the section corresponding to each section from the performance start timing to the performance end timing of each output sound is specified as a syllable waveform. Note that each syllable waveform specified in S240 of the present embodiment is associated with the content of the syllable uttered by the syllable waveform.

さらに、音節波形それぞれから音声パラメータを導出する（Ｓ２５０）。本実施形態のＳ２５０にて導出する音声パラメータには、少なくとも、基本周波数、メル周波数ケプストラム（ＭＦＣＣ）、パワー、それらの時間差分を含む。これらの基本周波数、ＭＦＣＣ、パワーの導出方法は、周知であるため、ここでの詳しい説明は省略するが、例えば、基本周波数であれば、音節波形の時間軸に沿った自己相関、音節波形の周波数スペクトルの自己相関、またはケプストラム法などの手法を用いて導出すれば良い。また、ＭＦＣＣであれば、音節波形に対して時間分析窓を適用して、時間分析窓ごとに周波数解析（例えば、ＦＦＴ）をした結果について、周波数ごとの大きさを対数化した結果を、さらに、周波数解析することで導出すれば良い。パワーについては、音節波形に対して時間分析窓を適用して振幅の二乗した結果を時間方向に積分することで導出すれば良い。 Further, a speech parameter is derived from each syllable waveform (S250). The audio parameters derived in S250 of this embodiment include at least the fundamental frequency, the mel frequency cepstrum (MFCC), the power, and their time differences. Since these fundamental frequency, MFCC, and power derivation methods are well known, detailed description thereof is omitted here. For example, if the fundamental frequency is used, autocorrelation along the time axis of the syllable waveform, syllable waveform What is necessary is just to derive | lead-out using methods, such as an autocorrelation of a frequency spectrum, or a cepstrum method. In addition, in the case of MFCC, a result obtained by applying a time analysis window to a syllable waveform and performing frequency analysis (for example, FFT) for each time analysis window is further obtained by logarithmizing the size for each frequency. It can be derived by frequency analysis. The power may be derived by applying a time analysis window to the syllable waveform and integrating the square of the amplitude in the time direction.

続いて、Ｓ２４０にて特定した各音節波形に対するメタデータを推定するメタデータ推定処理を実行する（Ｓ２６０）。
〈メタデータ推定処理〉
本実施形態におけるメタデータ推定処理は、図４に示すように、起動されると、まず、先のＳ２１０にて取得した楽曲ＭＩＤＩデータに基づいて、当該楽曲において同一の調が継続される各区間である調同一区間を特定する（Ｓ３１０）。具体的に、本実施形態のＳ３１０では、図５に示すように、楽曲ＭＩＤＩデータに含まれる転調フラグに基づき、時間軸に沿って互いに隣接する転調フラグの間の区間を、調同一区間として特定する。 Subsequently, a metadata estimation process for estimating metadata for each syllable waveform specified in S240 is executed (S260).
<Metadata estimation processing>
As shown in FIG. 4, when the metadata estimation process in the present embodiment is started, first, each section in which the same key is continued in the music based on the music MIDI data acquired in the previous S210. The same key interval is specified (S310). Specifically, in S310 of the present embodiment, as shown in FIG. 5, based on the modulation flag included in the music MIDI data, the section between the modulation flags adjacent to each other along the time axis is identified as the same key section. To do.

続いて、Ｓ３１０にて特定した調同一区間における主音を特定する（Ｓ３２０）。具体的に、本実施形態のＳ３２０では、図６に示すように、調同一区間において、時間軸に沿って最後の出力音を、当該調同一区間における主音として特定する。本実施形態では、Ｓ３１０にて特定した調同一区間のそれぞれについて、主音を特定する。 Subsequently, the main sound in the same key interval specified in S310 is specified (S320). Specifically, in S320 of the present embodiment, as shown in FIG. 6, in the same key interval, the last output sound along the time axis is specified as the main sound in the same key interval. In the present embodiment, the main sound is specified for each of the same key intervals specified in S310.

そして、Ｓ３２０にて特定した主音の音名を起点とし、当該主音が特定された調同一区間に含まれる出力音それぞれの音名を階級とし、各音名の登場回数を度数としたヒストグラム（以下、登場音名頻度と称す）を導出する（Ｓ３３０）。具体的に、本実施形態のＳ３３０にて導出する登場音名頻度は、図７（Ａ）に示すように、調同一区間に含まれる同一音名の出力音の登場回数（登場頻度）を集計したものである。そして、本実施形態においては、オクターブが異なる出力音であっても、音名が同一であれば、同一音名の出力音として集計する。なお、本実施形態では、各調同一区間について、登場音名頻度を導出する。 Then, a histogram (hereinafter referred to as a pitch) in which the pitch name of the main tone specified in S320 is a starting point, the pitch names of the output sounds included in the same key interval in which the main tone is specified is a rank, and the frequency of appearance of each pitch name is a frequency. , Referred to as frequency of appearance sound name) (S330). Specifically, as shown in FIG. 7A, the frequency of appearance sound names derived in S330 of the present embodiment is the total number of appearances (appearance frequencies) of output sounds of the same sound name included in the same key interval. It is a thing. And in this embodiment, even if it is an output sound from which an octave differs, if the pitch name is the same, it totals as an output sound of the same pitch name. In the present embodiment, the appearance name frequency is derived for the same section.

続いて、Ｓ３３０にて導出した登場音名頻度を、各調にて利用可能な音名の分布を表すテンプレートとして調ごとに予め用意した調テンプレートに照合した結果に基づいて、当該調同一区間における調を特定する（Ｓ３４０）。具体的に、本実施形態のＳ３４０では、長調の楽曲にて利用可能な音名の分布を表す長調テンプレート（図７（Ｂ）参照）と、短調の楽曲にて利用可能な音名の分布を表す短調テンプレート（図７（Ｃ）参照）とを予め用意し、それぞれの調テンプレートにＳ３３０にて導出した登場音名頻度を照合する。その結果、最も高い相関を示す調テンプレートに対応する調を、当該調同一区間における調として特定する。なお、本実施形態のＳ３４０では、調同一区間のそれぞれについての調を特定する。 Then, based on the result of collating the appearance note name frequency derived in S330 with a key template prepared in advance for each key as a template representing the distribution of pitch names that can be used in each key, A key is specified (S340). Specifically, in S340 of the present embodiment, a major template (see FIG. 7B) that represents the distribution of pitch names that can be used in major music, and the distribution of pitch names that can be used in minor music. A minor template to be expressed (see FIG. 7C) is prepared in advance, and the frequency of appearance note names derived in S330 is collated with each key template. As a result, the key corresponding to the key template showing the highest correlation is specified as the key in the key same section. In S340 of the present embodiment, the key for each key-same section is specified.

さらに、Ｓ３４０で特定した調同一区間における楽曲の調に対応する音声の性質を、メタデータとして特定する（Ｓ３５０）。具体的に、本実施形態のＳ３５０では、調同一区間における調が長調であれば、当該調同一区間での歌詞（即ち、発声内容）が「明るい」という感情を表す音声の性質をメタデータとして特定する。また、調同一区間における調が短調であれば、当該調同一区間での歌詞が「暗い」という感情を表す音声の性質をメタデータとして特定する。なお、本実施形態においては、調同一区間に含まれる全ての音節について、当該調同一区間に対応するメタデータを割り当てている。 Furthermore, the sound property corresponding to the key of the music in the same key section identified in S340 is specified as metadata (S350). Specifically, in S350 of the present embodiment, if the key in the same key section is a major key, the property of the voice representing the emotion that the lyrics (that is, the utterance content) in the same key section is “bright” is used as metadata. Identify. If the key in the same key section is a minor key, the property of the voice representing the feeling that the lyrics in the same key section are “dark” is specified as metadata. In the present embodiment, metadata corresponding to the same key interval is assigned to all syllables included in the same key interval.

その後、パラメータデータ登録処理のＳ２７０へと戻り、Ｓ２５０にて導出した音声パラメータと、Ｓ２６０にて推定したメタデータとを対応する音節ごとに対応付けることで、パラメータデータＰＭを生成しデータ格納サーバ５０に格納するパラメータデータ登録を実行する（Ｓ２７０）。なお、本実施形態のＳ２７０にてデータ格納サーバ５０に格納されるパラメータデータＰＭには、音声パラメータ及びメタデータに加えて、発声した音節の内容（種類）や、ユーザＩＤ、発声者特徴情報を含む。 Thereafter, the process returns to S270 of the parameter data registration process, and the parameter data PM is generated and associated with the data storage server 50 by associating the speech parameter derived in S250 with the metadata estimated in S260 for each corresponding syllable. The parameter data to be stored is registered (S270). Note that the parameter data PM stored in the data storage server 50 in S270 of the present embodiment includes the content (type) of the uttered syllable, the user ID, and speaker characteristic information in addition to the audio parameters and metadata. Including.

その後、本パラメータデータ登録処理を終了する。
以上説明したように、図８に示すように、本実施形態のパラメータデータ登録処理では、楽曲の演奏期間中に入力された音声波形を処理対象とする。そして、その音声波形に基づく音声波形データＳＶを、当該楽曲のメロディラインを構成する各出力音の演奏期間に対応する区間（即ち、発声内容に含まれる各音節）ごとに分割して音節波形を生成すると共に、各音節波形から音声パラメータを導出する。 Thereafter, the parameter data registration process is terminated.
As described above, as shown in FIG. 8, in the parameter data registration process of the present embodiment, the speech waveform input during the performance period of the music is processed. Then, the speech waveform data SV based on the speech waveform is divided into sections corresponding to the performance periods of the output sounds constituting the melody line of the music (that is, each syllable included in the utterance content), and the syllable waveform is divided. At the same time, a speech parameter is derived from each syllable waveform.

これと共に、パラメータデータ登録処理では、当該楽曲において同一の調が継続する期間（即ち、調同一区間）それぞれを特定し、各調同一区間における調（調性）を特定する。そして、その特定した調からイメージされる感情として予め規定された音声の性質をメタデータとして特定する。 At the same time, in the parameter data registration process, each period (that is, the same key section) in which the same key continues in the music is specified, and the key (tonality) in each key same section is specified. And the property of the voice previously defined as the emotion imaged from the specified key is specified as metadata.

その上で、パラメータデータ登録処理では、対応する音節ごとに、音声パラメータとメタデータとを対応付けてパラメータデータＰＭを生成し、データ格納サーバ５０に格納する。すなわち、データ格納サーバ５０に格納されるパラメータデータＰＭは、図９（Ａ）に示すように、当該楽曲における歌詞構成文字の時間軸に沿った音節の登場順に、メタデータと対応付けられた音声パラメータを含むものである。 In addition, in the parameter data registration process, for each corresponding syllable, the parameter data PM is generated by associating the voice parameter with the metadata and stored in the data storage server 50. That is, as shown in FIG. 9A, the parameter data PM stored in the data storage server 50 is the voice associated with the metadata in the order of appearance of syllables along the time axis of the lyrics constituent characters in the music. Includes parameters.

なお、本実施形態におけるパラメータデータ登録処理は、ユーザごと、かつ楽曲ごとに実施される。したがって、パラメータデータＰＭは、同一のユーザ，かつ同一の楽曲であっても、当該楽曲を歌唱するごとに、異なるパラメータデータＰＭとして、データ格納サーバに格納される。
〈模範歌声データ登録処理〉
模範歌声データ登録処理は、図１０に示すように、起動されると、先のＳ１２０にて取得された楽曲ＩＤに対応する楽曲ＭＩＤＩデータＤＭを取得する（Ｓ８１０）。続いて、先のＳ１２０にて取得した楽曲ＩＤに対応する歌詞データ群ＤＬを取得する（Ｓ８２０）。さらに、当該楽曲ＩＤに対応するガイドボーカルデータＧＤを取得する（Ｓ８３０）。 In addition, the parameter data registration process in this embodiment is implemented for every user and every music. Therefore, the parameter data PM is stored in the data storage server as different parameter data PM every time the song is sung even for the same user and the same song.
<Model singing voice data registration process>
As shown in FIG. 10, the model singing voice data registration process, when activated, acquires music MIDI data DM corresponding to the music ID acquired in the previous S120 (S810). Subsequently, the lyrics data group DL corresponding to the music ID acquired in the previous S120 is acquired (S820). Further, guide vocal data GD corresponding to the music ID is acquired (S830).

その取得したガイドボーカルデータＧＤにおいて、当該ガイドボーカルデータＧＤの発声内容に含まれる音節それぞれに対応する区間での音声波形を特定する（Ｓ８４０）。本実施形態におけるＳ８４０は、パラメータデータ登録処理におけるＳ２４０と同様に実施すれば良い。 In the acquired guide vocal data GD, the speech waveform in the section corresponding to each syllable included in the utterance content of the guide vocal data GD is specified (S840). S840 in the present embodiment may be performed in the same manner as S240 in the parameter data registration process.

さらに、Ｓ８４０にて特定した音声波形のそれぞれから音声パラメータを導出する（Ｓ８５０）。続いて、上述したメタデータ推定処理を実行する（Ｓ８６０）。
その後、Ｓ８５０にて導出した音声パラメータを、Ｓ８６０にて推定したメタデータと対応付けたデータを模範歌声データＥＤとして生成し、データ格納サーバ５０に格納する模範歌声データ登録を実行する（Ｓ８７０）。なお、本実施形態のＳ８７０にて生成される模範歌声データＥＤは、音声パラメータ及びメタデータに加えて、発声した音節の内容（種類）や、楽曲ＩＤを含む。 Further, speech parameters are derived from each of the speech waveforms identified in S840 (S850). Subsequently, the metadata estimation process described above is executed (S860).
Thereafter, the data associated with the audio parameter derived in S850 and the metadata estimated in S860 is generated as model singing voice data ED, and model singing voice data registration to be stored in the data storage server 50 is executed (S870). The model singing voice data ED generated in S870 of the present embodiment includes the content (type) of the uttered syllable and the song ID in addition to the voice parameters and the metadata.

その後、本模範歌声データ登録処理を終了する。
すなわち、データ格納サーバ５０に格納される模範歌声データＥＤは、図９（Ｂ）に示すように、当該楽曲における歌詞構成文字の時間軸に沿った音節の登場順に、メタデータと対応付けられた音声パラメータを含むものである。 Then, this model singing voice data registration process is complete | finished.
That is, the exemplary singing voice data ED stored in the data storage server 50 is associated with the metadata in the order of appearance of syllables along the time axis of the lyrics constituent characters in the music, as shown in FIG. 9B. Includes voice parameters.

通常、模範歌声データＥＤのメタデータ（図９（Ｂ））と、歌唱者のメタデータ（図９（Ａ））とは、同じ曲で、曲調が一致していれば、メタデータの種類（明るい、暗い）は一致していることが多い。 Normally, the metadata of the model singing voice data ED (FIG. 9 ( B )) and the singer's metadata (FIG. 9 ( A )) are the same song, and if the tune matches, the type of metadata ( (Bright and dark) often coincide.

歌唱者がメジャー調な曲を歌唱すれば、メタデータは「明るい」、マイナー調の曲を歌唱すれば、メタデータは「暗い」となる可能性が高い。
しかし、ユーザが歌唱した曲の曲調と、模範音声として聴きたい曲の曲調と、が異なる場合もある。 If a singer sings a major song, the metadata is likely to be “bright”, and if a singer sings a minor song, the metadata is likely to be “dark”.
However, the tone of the song sung by the user may be different from the tone of the song that the user wants to listen to as a model voice.

このため、更に、適切な音高で聴くためには、ガイドボーカルをメジャー調（明るい）、マイナー調（暗い）の２種類を用意して、２種類のメタデータと音声パラメータとを作成すればよい。 For this reason, in order to listen at an appropriate pitch, prepare two types of guide vocals, major (bright) and minor (dark), and create two types of metadata and audio parameters. Good.

ＣＰＵ２３は、楽曲ＩＤに対応付けられたガイドボーカル種類を特定し、パラメータ登録処理を、ガイドボーカルの種類毎に行うことで、ガイドボーカルの種類数に応じたパラメータデータ登録が行える。 The CPU 23 specifies the guide vocal type associated with the music ID, and performs parameter registration processing for each type of guide vocal, thereby registering parameter data according to the number of types of guide vocals.

本実施形態における模範歌声データ登録処理は、楽曲ごとに実施される。したがって、模範歌声データＥＤは、一つの楽曲に対して一つ生成され、データ格納サーバ５０に格納される。 The exemplary singing voice data registration process in the present embodiment is performed for each piece of music. Therefore, one model singing voice data ED is generated for one piece of music and stored in the data storage server 50.

本実施例において、ガイドボーカルデータＧＤのメタデータ、音声パラメータは、カラオケ装置側で音楽データＭＤを少なくとも１回（ガイドボーカルがメジャー調、マイナー調の２種類なら、２回）演奏することによって作成している。 In the present embodiment, the metadata and sound parameters of the guide vocal data GD are created by playing the music data MD at least once (two times if the guide vocal is of the major and minor types) on the karaoke device side. doing.

予め、音楽データＭＤに歌詞の発生音高（ガイドボーカル）とともに、ガイドボーカルの模範音声データをも、演奏の進行に同期した歌詞の出力タイミングに対応付けて組み込んでおくならば、ガイドボーカルの模範音声データ登録処理は、行わなくてもよい。
〈音声出力端末〉
この音声出力端末６０は、図１に示すように、情報受付部６１と、表示部６２と、音出力部６３と、通信部６４と、記憶部６５と、制御部６７とを備えている。本実施形態における音声出力端末６０として、例えば、携帯電話や携帯情報端末などの周知の携帯端末や、いわゆるパーソナルコンピュータといった周知の情報処理装置を想定しても良い。 If the voice data of the lyrics (guide vocal) and the guide voice data of the guide vocal are also incorporated in the music data MD in advance in association with the output timing of the lyrics synchronized with the progress of the performance, the guide vocal example The voice data registration process may not be performed.
<Audio output terminal>
As shown in FIG. 1, the audio output terminal 60 includes an information receiving unit 61, a display unit 62, a sound output unit 63, a communication unit 64, a storage unit 65, and a control unit 67. As the audio output terminal 60 in the present embodiment, for example, a known portable terminal such as a mobile phone or a portable information terminal, or a known information processing apparatus such as a so-called personal computer may be assumed.

このうち、情報受付部６１は、入力装置（図示せず）を介して入力された情報を受け付ける。表示部６２は、制御部６７からの指令に基づいて画像を表示する。音出力部６３は、音を出力する周知の装置であり、例えば、ＰＣＭ音源と、スピーカとを備えている。 Among these, the information reception part 61 receives the information input via the input device (not shown). The display unit 62 displays an image based on a command from the control unit 67. The sound output unit 63 is a known device that outputs sound, and includes, for example, a PCM sound source and a speaker.

通信部６４は、通信網を介して音声出力端末６０が外部との間で情報通信を行うものである。記憶部６５は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。記憶部６５には、各種処理プログラムや各種データが記憶される。 The communication unit 64 is for the voice output terminal 60 to perform information communication with the outside via a communication network. The storage unit 65 is a non-volatile storage device configured to be able to read and write stored contents. The storage unit 65 stores various processing programs and various data.

また、制御部６７は、ＲＯＭ、ＲＡＭ、ＣＰＵを少なくとも有した周知のコンピュータを中心に構成されている。
制御部６７のＲＯＭには、パラメータデータＰＭ及び模範歌声データＥＤに基づいて、楽曲についての理想的な歌唱音声をユーザ自身の音声により実現するように音声合成した合成音を出力する音声合成処理を、制御部６７が実行するための処理プログラムが格納されている。
〈音声合成処理について〉
音声合成処理は、音声出力端末６０の情報受付部６１を介して起動指令が入力されると起動される。 The control unit 67 is mainly configured by a known computer having at least a ROM, a RAM, and a CPU.
Based on the parameter data PM and the model singing voice data ED, the ROM of the control unit 67 has a voice synthesis process for outputting a synthesized voice that is voice-synthesized so as to realize an ideal singing voice for the music by the user's own voice. A processing program to be executed by the control unit 67 is stored.
<About voice synthesis processing>
The voice synthesis process is started when a start command is input via the information receiving unit 61 of the voice output terminal 60.

そして、音声合成処理では、図１１に示すように、情報受付部６１を介してユーザＩＤを取得する（Ｓ５１０）。続いて、情報受付部６１を介して楽曲ＩＤを取得する（Ｓ５２０）。 In the speech synthesis process, as shown in FIG. 11, the user ID is acquired via the information receiving unit 61 (S510). Subsequently, the music ID is acquired through the information receiving unit 61 (S520).

そして、Ｓ５２０にて取得した楽曲ＩＤに対応する模範歌声データＥＤを取得する（Ｓ５３０）。さらに、Ｓ５１０にて取得したユーザＩＤと対応付けられたパラメータデータＰＭの中で、Ｓ５３０にて取得した模範歌声データＥＤに含まれる音声パラメータに最も類似するパラメータデータＰＭを取得する（Ｓ５４０）。なお、Ｓ５４０において、「最も類似」しているものと判定する条件は、例えば、音声パラメータとしての特徴量ごとに導出した相関値の演算結果を、パラメータデータＰＭごとに導出し、その導出した演算結果の中で最も大きい演算結果に対応する音声パラメータを含むパラメータデータＰＭを「最も類似」とすれば良い。 And the model singing voice data ED corresponding to the music ID acquired in S520 is acquired (S530). Further, among the parameter data PM associated with the user ID acquired in S510, the parameter data PM most similar to the voice parameter included in the exemplary singing voice data ED acquired in S530 is acquired (S540). In S540, for example, the condition for determining “most similar” is that the calculation result of the correlation value derived for each feature amount as a speech parameter is derived for each parameter data PM, and the derived calculation is performed. The parameter data PM including the speech parameter corresponding to the largest calculation result among the results may be set as “most similar”.

また、メタデータが複数種類（明るい、暗い）存在する、つまり、楽曲ＩＤに対応する模範歌声データＥＤが複数あるならば、メタデータが一致している模範歌声データＥＤを選択すればよい。一致するメタデータ側の音声パラメータを採用することで、更に最適な音声合成が行われる。 The metadata plural kinds (bright, dark) exists, i.e., if the model voice data ED corresponding to the music ID there are a plurality may be selected model voice data ED metadata match. By adopting the matching audio parameter on the metadata side, further optimal audio synthesis is performed.

続いて、Ｓ５４０にて取得したパラメータデータＰＭに含まれる音声パラメータを、Ｓ５３０にて取得した模範歌声データＥＤに含まれる音声パラメータに一致するように調整する（Ｓ５５０）。その調整された音声パラメータに基づいて、音声合成する（Ｓ５６０）。このＳ５６０における音声合成は、フォルマント合成による周知の音声合成の手法を用いれば良い。なお、楽曲がデュエット曲である場合には、発声者特徴情報からユーザの性別を特定し、デュエット曲の男女パートに一致する歌詞部分を音声合成しても良い。 Subsequently, the voice parameter included in the parameter data PM acquired in S540 is adjusted to match the voice parameter included in the exemplary singing voice data ED acquired in S530 (S550). Based on the adjusted speech parameter, speech synthesis is performed (S560). For the speech synthesis in S560, a known speech synthesis technique using formant synthesis may be used. If the song is a duet song, the gender of the user may be specified from the speaker characteristic information, and the lyrics portion that matches the gender part of the duet song may be synthesized.

さらに、Ｓ５６０にて音声合成することによって生成された合成音を音出力部６３から出力する（Ｓ５７０）。
本実施形態においては、Ｓ５３０からＳ５７０の各ステップを、楽曲での時間軸に沿って順次実行する。具体的には、当該楽曲の楽曲ＭＩＤＩデータＤＭに基づいて、各楽音の出力タイミングにて、各音節に対する合成音が出力されるように音声合成を実行する。 Furthermore, the synthesized sound generated by the voice synthesis in S560 is output from the sound output unit 63 (S570).
In the present embodiment, the steps from S530 to S570 are sequentially executed along the time axis of the music. Specifically, based on the music MIDI data DM of the music, voice synthesis is executed so that a synthesized sound for each syllable is output at the output timing of each musical sound.

なお、Ｓ５２０にて取得した楽曲ＩＤに対応する楽曲の演奏が終了するまで（Ｓ５８０：ＮＯ）、Ｓ５３０からＳ５７０を繰り返す。そして、当該楽曲の演奏が終了すると（Ｓ５８０：ＹＥＳ）、本音声合成処理を終了する。
［第一実施形態の効果］
以上説明したように、音声合成システム１によれば、一つの楽曲をユーザが歌唱後に、模範歌唱状態のユーザ音声を聴くことができる。つまり、音声合成システム１によれば、ユーザの歌声から導出した音声パラメータを用いて、楽曲における発声開始タイミングかつ発声音高にて音声が出力されるように、ユーザ自身の声による音声合成を実施できる。 In addition, S530 to S570 are repeated until the performance of the music corresponding to the music ID acquired in S520 is completed (S580: NO). Then, when the performance of the music is finished (S580: YES), the speech synthesis process is finished.
[Effect of the first embodiment]
As described above, according to the speech synthesis system 1, after the user sings one piece of music, the user voice in the model singing state can be heard. That is, according to the speech synthesis system 1, speech synthesis using the user's own voice is performed using speech parameters derived from the user's singing voice so that speech is output at the utterance start timing and utterance pitch of the music. it can.

しかも、音声合成システム１においては、一つの楽曲について模範歌声データＥＤを生成した後は、他の楽曲については、該当する楽曲データＭＤを演奏するだけで、ユーザ側は毎回歌唱して模範歌声データＥＤを生成する必要がない。 Moreover, in the speech synthesis system 1, after generating the model singing voice data ED for one piece of music, the user side only sings the corresponding piece of music data MD, and the user sings the model singing voice data every time. There is no need to generate an ED.

以上のことから、音声合成システム１によれば、ユーザ自身が楽曲を歌唱している期間以外にも、ユーザ自身の歌声音声を当該楽曲も含め、様々な楽曲をユーザ自身の声で、かつ適切な音高で聴取可能とすることができる。 From the above, according to the speech synthesis system 1, in addition to the period in which the user himself sings the music, the user's own voice is also included in the user's own voice and various music including the music is appropriate. It is possible to listen at a high pitch.

なお、本実施形態のメタデータ推定処理では、対象楽曲における各調同一区間の調によって表される可能性が高い歌唱者の感情をメタデータとしている。すなわち、本実施形態のメタデータ推定処理によれば、各調同一区間に対応する歌詞を発声したときの発声者の感情をメタデータとすることができ、しかも、各調同一区間における調を確実に特定することができる。
［第二実施形態］
第二実施形態の音声合成システムは、第一実施形態の音声合成システム１とは、主として、メタデータ推定処理の処理内容が異なる。このため、本実施形態においては、第一実施形態と同様の構成及び処理には、同一の符号を付して説明を省略し、第一実施形態とは異なるメタデータ推定処理を中心に説明する。
〈メタデータ推定処理〉
本実施形態のメタデータ推定処理は、図１２に示すように、パラメータデータ登録処理のＳ２６０にて起動されると、先のＳ２１０にて取得した歌詞データ群ＤＬに含まれている歌詞テロップデータＤＴによって表される歌詞を形態素解析する（Ｓ７１０）。すなわち、本実施形態のＳ７１０では、形態素解析を実行することで、歌詞を構成する文字列を、当該歌詞中の単語を構成する文字列である単語文字ごとに分割する。なお、Ｓ７１０にて実行する形態素解析は、周知の処理であるため、ここでの詳しい説明は省略する。 In addition, in the metadata estimation process of this embodiment, the emotion of a singer who is highly likely to be represented by the key of each key in the target music is used as metadata. That is, according to the metadata estimation process of the present embodiment, it is possible to use the emotion of the speaker when the lyrics corresponding to the same key interval are uttered as metadata, and to ensure the key in the same key interval. Can be specified.
[Second Embodiment]
The speech synthesis system of the second embodiment mainly differs from the speech synthesis system 1 of the first embodiment in the processing content of the metadata estimation process. For this reason, in the present embodiment, the same configurations and processes as those in the first embodiment are denoted by the same reference numerals, description thereof will be omitted, and description will be made focusing on metadata estimation processing different from that in the first embodiment. .
<Metadata estimation processing>
As shown in FIG. 12, when the metadata estimation process of the present embodiment is started in S260 of the parameter data registration process, the lyrics telop data DT included in the lyrics data group DL acquired in the previous S210. The morphological analysis is performed on the lyrics represented by (S710). That is, in S710 of this embodiment, the character string which comprises a lyrics is divided | segmented for every word character which is the character string which comprises the word in the said lyrics by performing a morphological analysis. Note that the morphological analysis performed in S710 is a well-known process, and a detailed description thereof is omitted here.

続いて、予め用意された単語性質テーブルが格納された単語メタデータデータベース（図中ＤＢ）１００から、Ｓ７１０の形態素解析した結果である単語ごとに単語性質情報を取得する（Ｓ７２０）。ただし、ここで言う単語性質テーブルとは、各単語の性質を表す単語性質情報を当該単語の識別情報と対応付けたテーブルである。また、ここで言う単語の性質とは、当該単語の意味や、当該単語によって表される感情を含む。 Subsequently, word property information is acquired for each word that is the result of the morphological analysis of S710 from the word metadata database (DB in the figure) 100 in which a word property table prepared in advance is stored (S720). However, the word property table referred to here is a table in which word property information representing the property of each word is associated with identification information of the word. Further, the word property mentioned here includes the meaning of the word and the emotion represented by the word.

そして、Ｓ７２０にて取得した単語性質情報をメタデータとして、当該単語を発声した区間に割り当てる（Ｓ７３０）。
その後、本メタデータ推定処理を終了して、パラメータデータ登録処理へと戻る。 Then, the word property information acquired in S720 is assigned as metadata to the section where the word is uttered (S730).
Thereafter, the metadata estimation process is terminated, and the process returns to the parameter data registration process.

以上説明したように、本実施形態のメタデータ推定処理では、図１３に示すように、対象楽曲の歌詞に対して形態素解析を実行し、対象楽曲の歌詞を、単語を構成する文字列である単語文字ごとに分割する。その上で、予め用意された単語メタデータデータベース１００に格納されている単語性質テーブルに含まれる単語性質情報の中から、各単語に対応する単語性質情報を取得し、当該単語性質情報のそれぞれを、対応する音節の音声パラメータに対するメタデータとしている。
［第二実施形態の効果］
以上説明したように、本実施形態のメタデータ推定処理によれば、発声者が発声した単語の意味や、当該単語によって表される感情などを、メタデータとすることができる。
［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 As described above, in the metadata estimation process of the present embodiment, as shown in FIG. 13, morphological analysis is performed on the lyrics of the target music, and the lyrics of the target music are character strings constituting words. Divide by word character. In addition, word property information corresponding to each word is acquired from the word property information included in the word property table stored in the word metadata database 100 prepared in advance, and each of the word property information is obtained. , Metadata for the speech parameters of the corresponding syllable.
[Effects of Second Embodiment]
As described above, according to the metadata estimation process of the present embodiment, the meaning of the word uttered by the speaker and the emotion represented by the word can be used as metadata.
[Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

例えば、音声合成システム１においては、情報格納サーバ２５が設けられていたが、本発明の音声合成システムにおいては、情報格納サーバ２５は設けられていなくとも良い。この場合、音楽データＭＤや音声波形データＳＶは、音声入力装置１０の記憶部１７に格納されても良いし、データ格納サーバ５０に格納されても良いし、さらには、情報処理装置３０の記憶部３４に格納されても良い。 For example, in the speech synthesis system 1, the information storage server 25 is provided. However, in the speech synthesis system of the present invention, the information storage server 25 may not be provided. In this case, the music data MD and the voice waveform data SV may be stored in the storage unit 17 of the voice input device 10, may be stored in the data storage server 50, and further stored in the information processing device 30. It may be stored in the unit 34.

同様に、音声合成システム１においては、データ格納サーバ５０が設けられていたが、音声合成システム１においては、データ格納サーバ５０は設けられていなくとも良い。この場合、パラメータデータＰＭや模範歌声データＥＤは、情報処理装置３０の記憶部３４に格納されても良いし、音声入力装置１０の記憶部１７に格納されても良いし、さらには、情報格納サーバ２５に格納されても良い。 Similarly, although the data storage server 50 is provided in the speech synthesis system 1, the data storage server 50 may not be provided in the speech synthesis system 1. In this case, the parameter data PM and the model singing voice data ED may be stored in the storage unit 34 of the information processing device 30, may be stored in the storage unit 17 of the voice input device 10, and further store information. It may be stored in the server 25.

なお、上記実施形態における模範歌声データＥＤは、模範歌声データ登録処理を実行することで生成されていたが、模範歌声データＥＤは、予め音楽データＭＤに組み込まれていても良い。この場合、楽曲ＩＤに対応した理想的な音声波形データＳＶｒ，及びその音声波形データＳＶｒに基づく音声パラメータを含むデータが模範歌声データＥＤとして、組み込まれていても良い。音楽データＭＤに組み込む場合には、当然のことながら、模範歌声データＥＤを含む、１つの音楽データＭＤを記憶すればよい。 In addition, although the model singing voice data ED in the said embodiment was produced | generated by performing a model singing voice data registration process, the model singing voice data ED may be previously integrated in the music data MD. In this case, ideal voice waveform data SVr corresponding to the music ID and data including voice parameters based on the voice waveform data SVr may be incorporated as the model singing voice data ED. When incorporated into the music data MD, of course, it includes a model voice data E D, may be stored to one of the music data MD.

また、音声合成システム１は、音声入力装置１０と、情報処理装置３０と、音声出力端末６０とを別個に備えていたが、これらの音声入力装置１０と、情報処理装置３０と、音声出力端末６０とは共通した装置であっても良い。すなわち、音声入力装置１０（即ち、カラオケ装置）は、情報処理装置３０、及び音声出力端末６０の少なくとも一方を兼ねても良い。
［実施形態と特許請求の範囲との対応関係］
最後に、上記実施形態の記載と、特許請求の範囲の記載との関係を説明する。 The speech synthesis system 1 includes the speech input device 10, the information processing device 30, and the speech output terminal 60, but these speech input device 10, information processing device 30, and speech output terminal are provided separately. 60 may be a common device. That is, the voice input device 10 (that is, the karaoke device) may also serve as at least one of the information processing device 30 and the voice output terminal 60.
[Correspondence between Embodiment and Claims]
Finally, the relationship between the description of the above embodiment and the description of the scope of claims will be described.

上記実施形態におけるカラオケ演奏処理が、本発明における発声情報取得手段に相当し、パラメータデータ登録処理におけるＳ２３０が、本発明における波形取得手段に相当し、Ｓ２５０が、本発明におけるパラメータ導出手段に相当し、Ｓ２７０が、本発明におけるパラメータ登録手段に相当し、パラメータ登録処理におけるＳ１４０が、本発明における模範記憶手段に相当する。 The karaoke performance process in the above embodiment corresponds to the utterance information acquisition means in the present invention, S230 in the parameter data registration process corresponds to the waveform acquisition means in the present invention, and S250 corresponds to the parameter derivation means in the present invention. , S270 corresponds to the parameter registration means in the present invention, and S140 in the parameter registration processing corresponds to the model storage means in the present invention.

さらに、上記実施形態における音声合成処理におけるＳ５１０，Ｓ５２０が、本発明における識別情報取得手段に相当し、Ｓ５４０が、本発明におけるパラメータ取得手段に相当し、Ｓ５３０が、本発明における模範取得手段に相当する。そして、音声合成処理におけるＳ５５０，Ｓ５６０が、本発明における音声合成手段に相当し、Ｓ５７０が、本発明における出力手段に相当する。 Further, S510 and S520 in the speech synthesis process in the above embodiment correspond to the identification information acquisition means in the present invention, S540 corresponds to the parameter acquisition means in the present invention, and S530 corresponds to the model acquisition means in the present invention. To do. S550 and S560 in the speech synthesis process correspond to speech synthesis means in the present invention, and S570 corresponds to output means in the present invention.

また、上記第一実施形態のメタデータ推定処理におけるＳ３１０が、特許請求の範囲の記載における区間特定手段に相当し、Ｓ３２０が、特許請求の範囲の記載における主音特定手段に相当し、Ｓ３３０が、特許請求の範囲の記載における音名頻度導出手段に相当し、Ｓ３４０，Ｓ３５０が、調推定手段に相当する。さらに、上記第二実施形態のメタデータ推定処理におけるＳ７１０が、特許請求の範囲の記載における単語分割手段に相当し、Ｓ７２０，Ｓ７２０が、メタデータ抽出手段に相当する。 Further, S310 in the metadata estimation process of the first embodiment corresponds to the section specifying means in the description of the claims, S320 corresponds to the main sound specifying means in the description of the claims, and S330, It corresponds to the pitch name deriving means in the description of the claims, and S340 and S350 correspond to the key estimating means. Further, S710 in the metadata estimation process of the second embodiment corresponds to the word dividing means in the claims, and S720 and S720 correspond to the metadata extracting means.

１…音声合成システム１０…音声入力装置１１…通信部１２…入力受付部１３…表示部１４…音声入力部１５…音声出力部１６…音源モジュール１７…記憶部２０…制御部２１，４１…ＲＯＭ２２，４２…ＲＡＭ２３，４３…ＣＰＵ２５…ＭＩＤＩ格納サーバ３０…情報処理装置３１…通信部３２…入力受付部３３…表示部３４…記憶部４０…制御部５０…データ格納サーバ６０…音声出力端末６１…情報受付部６２…表示部６３…音出力部６４…通信部６５…記憶部６７…制御部１００…単語メタデータデータベース DESCRIPTION OF SYMBOLS 1 ... Voice synthesis system 10 ... Voice input device 11 ... Communication part 12 ... Input reception part 13 ... Display part 14 ... Voice input part 15 ... Voice output part 16 ... Sound source module 17 ... Memory | storage part 20 ... Control part 21,41 ... ROM 22, 42 ... RAM 23, 43 ... CPU 25 ... MIDI storage server 30 ... Information processing device 31 ... Communication unit 32 ... Input reception unit 33 ... Display unit 34 ... Storage unit 40 ... Control unit 50 ... Data storage server 60 ... Audio output Terminal 61 ... Information receiving unit 62 ... Display unit 63 ... Sound output unit 64 ... Communication unit 65 ... Storage unit 67 ... Control unit 100 ... Word metadata database

Claims

A song ID for identifying a song, performance information representing a musical sound constituting the song identified by the song ID, lyric information representing the lyrics of the song identified by the song ID, and utterance start of the lyrics represented by the lyrics information Utterance information acquisition means for acquiring the tune ID, the performance information, the lyric information, and the utterance timing information from tune data including utterance timing information indicating timing, and reproducing the tune based on the tune data;
Waveform acquisition means for acquiring an input speech waveform at the utterance start timing of the lyrics during playback of the music based on the music data;
A parameter derivation that extracts a syllable waveform that is a speech waveform for each syllable forming the lyrics from the speech waveform and derives a speech parameter that is at least one feature amount defined in advance from the extracted syllable waveform. Means,
Parameter registration means for storing the speech parameters for each syllable derived by the parameter derivation means in association with a user ID for identifying a user who sang the music in a first storage device;
Model storage means for storing model voice data including at least one voice parameter based on an ideal singing voice for each song in association with the song ID in the second storage device ;
A parameter extracting device having
Identification information acquisition means for acquiring the designated music ID and user ID ;
A model acquiring means for acquiring a model voice data associated with the music ID obtained in the previous SL identification information acquiring means from said second storage device,
Among the voice parameters stored in the first storage device, the voice parameters are associated with the user ID acquired by the identification information acquisition unit, and are included in the model voice data acquired by the model acquisition unit A parameter acquisition means for acquiring an audio parameter most similar to the audio parameter associated with the music ID to be recorded;
A speech synthesizer that adjusts the speech parameter acquired by the parameter acquisition unit to match the speech parameter associated with the music ID, and synthesizes speech based on the adjusted speech parameter ;
Output means for outputting synthesized sound generated by voice synthesis by the voice synthesis means ;
And a synthesized sound output device having
The utterance information acquisition means includes
Representing the score of the target music that is one of the music, for each output sound that is output from the sound source module, to obtain at least the musical score data that defines the pitch and performance start timing as the performance information,
The musical score data is
If it is in the song of the target song, including a modulation flag indicating the time at which the target song was modulated along the time axis,
The parameter registration means includes
Based on the acquired musical score data, section specifying means for specifying the same key section, which is each section in which the same key is continued in the target music,
A main sound specifying means for specifying the last output sound along the time axis in each key same section as the main sound included in each key same section specified by the section specifying means;
An appearance pitch name frequency representing a frequency of output sounds of the same pitch name included in the same pitch section specified by the section specifying means is set for each key same section starting from the pitch name of the main sound specified by the main sound specifying means. A pitch name deriving means for deriving;
As a result of comparing each appearance name name frequency derived by the above-mentioned sound name frequency deriving means against a key template prepared in advance for each key as a template representing the distribution of pitch names that can be used in each key, Key estimation means for estimating as metadata,
With
The speech synthesis system , wherein the metadata estimated by the key estimation means is stored in the first storage device in association with the speech parameter .

The utterance information acquisition means includes
A string of lyric constituent characters constituting the lyrics of the previous SL object music pieces, and obtained as the lyric information,
The output timing for at least one of the lyrics constituent characters is acquired as the utterance timing information, the lyrics output timing associated with the performance start timing of the output sound corresponding to the lyrics constituent characters,
The waveform acquisition means includes
A waveform in which the voice input during the performance of the target music based on the score data has shifted along the time axis is acquired as the voice waveform,
The parameter derivation means includes
The speech synthesis system according to claim 1, wherein in the speech waveform, a speech waveform in a section corresponding to each output sound is extracted as the syllable waveform.

  A song ID for identifying a song, performance information representing a musical sound constituting the song identified by the song ID, lyric information representing the lyrics of the song identified by the song ID, and utterance start of the lyrics represented by the lyrics information Utterance information acquisition means for acquiring the tune ID, the performance information, the lyric information, and the utterance timing information from tune data including utterance timing information indicating timing, and reproducing the tune based on the tune data;
  Waveform acquisition means for acquiring an input speech waveform at the utterance start timing of the lyrics during playback of the music based on the music data;
  A parameter derivation that extracts a syllable waveform that is a speech waveform for each syllable forming the lyrics from the speech waveform and derives a speech parameter that is at least one feature amount defined in advance from the extracted syllable waveform. Means,
  A parameter registration means for storing the speech parameter for each syllable derived by the parameter derivation means in association with a user ID for identifying a user who sang the song in a first storage device;
  Model storage means for storing model voice data including at least one voice parameter based on an ideal singing voice for each song in association with the song ID in the second storage device;
  A parameter extracting device having
  Identification information acquisition means for acquiring the designated music ID and user ID;
  Model acquisition means for acquiring model singing voice data associated with the music ID acquired by the identification information acquisition means from the second storage device;
  Among the voice parameters stored in the first storage device, the voice parameters are associated with the user ID acquired by the identification information acquisition unit, and are included in the model voice data acquired by the model acquisition unit A parameter acquisition means for acquiring an audio parameter most similar to the audio parameter associated with the music ID to be recorded;
  A speech synthesizer that adjusts the speech parameter acquired by the parameter acquisition unit to match the speech parameter associated with the music ID, and synthesizes speech based on the adjusted speech parameter;
  Output means for outputting synthesized sound generated by voice synthesis by the voice synthesis means;
  A synthesized sound output device having
  With
  The parameter registration means includes
  A word dividing unit that divides the character string represented by the lyrics information acquired by the utterance information acquiring unit for each word character that is a character string constituting the word;
  A word property table in which property information representing the property of each word is associated with identification information of the word is prepared in advance, and the property information corresponding to the word represented by each word character divided by the word dividing unit is Metadata extraction means for extracting from the word property table as metadata
  With
  A speech synthesis system, wherein the metadata extracted by the metadata extraction means is stored in the first storage device in association with the speech parameter.

  A song ID for identifying a song, performance information representing a musical sound constituting the song identified by the song ID, lyric information representing the lyrics of the song identified by the song ID, and utterance start of the lyrics represented by the lyrics information Utterance information acquisition means for acquiring the tune ID, the performance information, the lyric information, and the utterance timing information from tune data including utterance timing information indicating timing, and reproducing the tune based on the tune data;
  Waveform acquisition means for acquiring an input speech waveform at the utterance start timing of the lyrics during playback of the music based on the music data;
  A parameter derivation that extracts a syllable waveform that is a speech waveform for each syllable forming the lyrics from the speech waveform and derives a speech parameter that is at least one feature amount defined in advance from the extracted syllable waveform. Means,
  A parameter registration means for storing the speech parameter for each syllable derived by the parameter derivation means in association with a user ID for identifying a user who sang the song in a first storage device;
  Model storage means for storing model voice data including at least one voice parameter based on an ideal singing voice for each song in association with the song ID in the second storage device;
  With
  The utterance information acquisition means includes
  Representing the score of the target music that is one of the music, for each output sound that is output from the sound source module, to obtain at least the musical score data that defines the pitch and performance start timing as the performance information,
  The musical score data is
  If it is in the song of the target song, including a modulation flag indicating the time at which the target song was modulated along the time axis,
  The parameter registration means includes
  Based on the acquired musical score data, section specifying means for specifying the same key section, which is each section in which the same key is continued in the target music,
  A main sound specifying means for specifying the last output sound along the time axis in each key same section as the main sound included in each key same section specified by the section specifying means;
  An appearance pitch name frequency representing a frequency of output sounds of the same pitch name included in the same pitch section specified by the section specifying means is set for each key same section starting from the pitch name of the main sound specified by the main sound specifying means. A pitch name deriving means for deriving;
  As a result of comparing each appearance name name frequency derived by the above-mentioned sound name frequency deriving means against a key template prepared in advance for each key as a template representing the distribution of pitch names that can be used in each key, Key estimation means for estimating as metadata,
  With
  The parameter extraction device characterized in that the metadata estimated by the key estimation means is stored in the first storage device in association with the voice parameter.

A song ID for identifying a song, performance information representing a musical sound constituting the song identified by the song ID, lyric information representing the lyrics of the song identified by the song ID, and utterance start of the lyrics represented by the lyrics information Utterance information acquisition means for acquiring the tune ID, the performance information, the lyric information, and the utterance timing information from tune data including utterance timing information indicating timing, and reproducing the tune based on the tune data;
Waveform acquisition means for acquiring an input speech waveform at the utterance start timing of the lyrics during playback of the music based on the music data;
A parameter derivation that extracts a syllable waveform that is a speech waveform for each syllable forming the lyrics from the speech waveform and derives a speech parameter that is at least one feature amount defined in advance from the extracted syllable waveform. Means,
A parameter registration means for storing the speech parameter for each syllable derived by the parameter derivation means in association with a user ID for identifying a user who sang the song in a first storage device;
Model storage means for storing model voice data including at least one voice parameter based on an ideal singing voice for each song in association with the song ID in the second storage device;
With
The parameter registration means includes
A word dividing unit that divides the character string represented by the lyrics information acquired by the utterance information acquiring unit for each word character that is a character string constituting the word;
A word property table in which property information representing the property of each word is associated with identification information of the word is prepared in advance, and the property information corresponding to the word represented by each word character divided by the word dividing unit is Metadata extraction means for extracting from the word property table as metadata
With
The parameter extraction device characterized in that the metadata extracted by the metadata extraction means is stored in the first storage device in association with the voice parameter .