JP6372066B2

JP6372066B2 - Synthesis information management apparatus and speech synthesis apparatus

Info

Publication number: JP6372066B2
Application number: JP2013215029A
Authority: JP
Inventors: 入山　達也; 達也入山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2013-10-15
Filing date: 2013-10-15
Publication date: 2018-08-15
Anticipated expiration: 2033-10-15
Also published as: JP2015079065A

Description

本発明は、音声を合成する技術に関する。 The present invention relates to a technique for synthesizing speech.

複数の音素を連続して発音する場合、前後の音素の影響で各音素が変化する現象（以下「音声変化」という）が発生する。特定の音素の発音が省略される脱落（消失）や、各音素が前後の音素と類似する音素に変化する同化が音声変化の典型例である。例えば“out”（[aU][t]）と“door”([dh][O@])とが連続する“out door”を発音した場合、“out”の末尾の音素[t]が脱落して[aU][dh][O@]と発音される。なお、以上の例示の通り、本願明細書では、X-SAMPA（eXtrended - Speech Assessment Methods Phonetic Alphabet）に準拠した形式で各音素の音素記号を表記する。 When a plurality of phonemes are continuously generated, a phenomenon (hereinafter referred to as “speech change”) occurs in which each phoneme changes due to the influence of preceding and following phonemes. Dropping (disappearing) where pronunciation of a specific phoneme is omitted and assimilation in which each phoneme changes to a phoneme similar to the preceding and following phonemes are typical examples of speech changes. For example, if “out” ([aU] [t]) and “door” ([dh] [O @]) are pronounced continuously, the phoneme [t] at the end of “out” is dropped And pronounced [aU] [dh] [O @]. As described above, in this specification, the phoneme symbol of each phoneme is described in a format compliant with X-SAMPA (eXtrended-Speech Assessment Methods Phonetic Alphabet).

他方、任意の文字列を発音した音声を生成する音声合成技術が従来から提案されている。聴感的に自然な音声を合成するには、現実の発音時に発生する前述の音声変化を再現することが重要である。以上の事情を背景として、特許文献１には、複数の音素の時系列のうち音変化規則辞書に事前に規定された条件を充足する音素を省略または変更することで、脱落や同化等の音声変化が再現された音声を合成する技術が開示されている。 On the other hand, a speech synthesis technique for generating speech that pronounces an arbitrary character string has been proposed. In order to synthesize perceptually natural speech, it is important to reproduce the aforementioned speech changes that occur during actual pronunciation. Against the background described above, Patent Document 1 discloses that a phoneme that falls out or assimilated by omitting or changing a phoneme that satisfies a condition previously defined in the sound change rule dictionary among a plurality of phonemes in time series. A technique for synthesizing speech in which changes are reproduced is disclosed.

特開２０１１−１７５０７４号公報JP 2011-175074 A

しかし、特許文献１の技術のもとでは、音変化規則辞書に規定された条件を充足する音素については一律に音声変化が付与され、条件を充足しない音素については一律に音声変化が付与されない。したがって、実際の発音の傾向に合致しない不自然な音声変化が付与される可能性がある。例えば、複数の音素を短い間隔で素早く発音した場合には音声変化は発生しやすいが、複数の音素を遅く発音した場合や充分な間隔をあけて発音した場合には音声変化は発生しにくいという傾向がある。しかし、特許文献１の技術では、音変化規則辞書に合致する場合には各音素の発音の速度や間隔とは無関係に音声変化が付与される。 However, under the technique of Patent Document 1, a sound change is uniformly applied to phonemes that satisfy the conditions specified in the sound change rule dictionary, and a sound change is not uniformly applied to phonemes that do not satisfy the conditions. Therefore, an unnatural voice change that does not match the actual pronunciation tendency may be given. For example, voice changes are likely to occur when a plurality of phonemes are pronounced quickly at short intervals, but voice changes are unlikely to occur when a plurality of phonemes are pronounced late or at a sufficient interval. Tend. However, in the technique of Patent Document 1, when the sound change rule dictionary is matched, a sound change is given regardless of the speed and interval of pronunciation of each phoneme.

利用者が音素毎に音声変化の有無を指示する構成も想定され得るが、実際に音声変化を付与すべき音素を適切に決定するためには音声変化に関する専門的な知識が必要であり、また、音素毎に音声変化の有無を指示する作業の負荷が過大であるという問題もある。以上の事情を考慮して、本発明は、利用者の負荷を抑制しながら自然な音声変化を実現することを目的とする。 A configuration in which the user instructs whether or not there is a voice change for each phoneme can be assumed, but in order to appropriately determine the phoneme to which the voice change is actually to be applied, specialized knowledge about the voice change is necessary. There is also a problem that the work load for instructing whether or not there is a voice change for each phoneme is excessive. In view of the above circumstances, an object of the present invention is to realize a natural voice change while suppressing a user's load.

本発明の第１態様に係る合成情報管理装置は、合成対象音声の発音内容を指定する合成情報を管理する装置であって、音声変化の発生しやすさの指標である音声変化指標を合成対象音声の音素について算定する指標算定手段と、合成対象音声の音素について音声変化を発生させるか否かを当該音素の音声変化指標に応じて判定し、音声変化を発生させると判定した音素について音声変化の発生を合成情報に設定する一方、音声変化を発生させないと判定した音素については音声変化の発生を合成情報に設定しない情報管理手段とを具備する。以上の構成では、合成情報が指定する合成対象音声の音素について音声変化指標が算定され、当該音素の音素変化の有無が音声変化指標に応じて制御される。したがって、各音素が所定の条件に該当するか否かのみに応じて音声変化の有無が一律に決定される特許文献１の技術と比較すると、例えば音素毎に音声変化の有無を指示する作業等による利用者の負荷を抑制しながら、現実の発音の傾向に合致した自然な音声変化を実現することが可能である。 The synthesis information management device according to the first aspect of the present invention is a device that manages synthesis information that specifies the pronunciation content of a synthesis target speech, and is a synthesis target that is a speech change index that is an index of the likelihood of speech change. Index calculation means for calculating the phoneme of speech, whether to generate a speech change for the phoneme of the synthesis target speech, according to the speech change index of the phoneme, and for the phoneme determined to generate a speech change The information management means for not setting the occurrence of the voice change in the synthesis information for the phonemes determined not to generate the voice change. In the above configuration, the speech change index is calculated for the phoneme of the synthesis target speech specified by the synthesis information, and the presence or absence of the phoneme change of the phoneme is controlled according to the speech change index. Therefore, when compared with the technique of Patent Document 1 in which the presence or absence of a sound change is uniformly determined only depending on whether or not each phoneme meets a predetermined condition, for example, an operation for instructing the presence or absence of a sound change for each phoneme It is possible to realize a natural voice change that matches the actual pronunciation tendency, while suppressing the user's load due to.

本発明の第２態様に係る音声合成装置は、合成情報が発音内容を指定する合成対象音声の音声信号を生成する装置であって、音声変化の発生しやすさの指標である音声変化指標を合成対象音声の音素について算定する指標算定手段と、合成情報に応じた音声信号を生成する手段であって、合成対象音声の音素について音声変化を発生させるか否かを当該音素の音声変化指標に応じて判定し、音声変化を発生させると判定した音素について音声変化を発生させた音声信号を生成する一方、音声変化を発生させないと判定した音素については音声変化を発生させない音声信号を生成する音声合成手段とを具備する。以上の構成では、合成情報が指定する合成対象音声の音素について音声変化指標が算定され、当該音素の音素変化の有無が音声変化指標に応じて制御される。したがって、第１態様と同様に、例えば音素毎に音声変化の有無を指示する作業等による利用者の負荷を抑制しながら、現実の発音の傾向に合致した自然な音声変化を実現することが可能である。なお、第２態様の具体例は、例えば第３実施形態として後述される。 The speech synthesizer according to the second aspect of the present invention is a device that generates a speech signal of a synthesis target speech whose synthesis information specifies the pronunciation content, and a speech change index that is an index of the likelihood of speech change. Index calculation means for calculating the phoneme of the synthesis target speech and means for generating a speech signal corresponding to the synthesis information, and whether or not a speech change is generated for the phoneme of the synthesis target speech is used as the speech change index of the phoneme A sound signal that generates a sound signal for a phoneme that is determined to generate a sound change, and generates a sound signal that does not generate a sound change for a phoneme that is determined not to generate a sound change Synthesizing means. In the above configuration, the speech change index is calculated for the phoneme of the synthesis target speech specified by the synthesis information, and the presence or absence of the phoneme change of the phoneme is controlled according to the speech change index. Therefore, as in the first aspect, for example, it is possible to realize a natural voice change that matches the actual pronunciation tendency while suppressing the user's load due to the work of instructing the presence or absence of the voice change for each phoneme. It is. A specific example of the second mode will be described later as a third embodiment, for example.

本発明の第１態様および第２態様の好適例において、指標算定手段は、音声変化の発生しやすさの指標である基礎指標を２個の音素の組合せ毎に指定する参照情報から、合成対象音声の２個の音素の組合せに対応する基礎指標を特定し、当該基礎指標に応じて音声変化指標を算定する。以上の態様では、基礎指標を２個の音素の組合せ毎に指定する参照情報から、合成対象音声の２個の音素の組合せに対応する基礎指標が特定されて音声変化指標の算定に適用されるから、２個の音素の組合せに応じて音声変化の発生しやすさが相違する（音声変化が発生しやすい音素の組合せと音声変化が発生しにくい音素の組合せとが存在する）という傾向を反映した自然な音声変化を実現することが可能である。 In a preferred example of the first aspect and the second aspect of the present invention, the index calculating means is a synthesis target based on reference information that designates a basic index that is an index of the likelihood of occurrence of speech change for each combination of two phonemes. A basic index corresponding to a combination of two phonemes of speech is specified, and a speech change index is calculated according to the basic index. In the above aspect, the basic index corresponding to the combination of the two phonemes of the synthesis target speech is specified from the reference information designating the basic index for each combination of two phonemes, and applied to the calculation of the speech change index. From the above, it reflects the tendency that the probability of speech change differs depending on the combination of two phonemes (there is a combination of phonemes that are likely to cause speech changes and a combination of phonemes that are unlikely to cause speech changes). It is possible to realize natural voice changes.

本発明の第１態様および第２態様の好適例において、合成情報は、合成対象音声の音声単位毎に継続長と発音内容とを指定し、指標算定手段は、合成対象音声の音素の継続長に応じて音声変化指標を算定する。以上の態様では、音声単位の継続長に応じて音声変化指標が算定されるから、複数の音素が素早く発音された場合に音声変化が発生しやすいという現実の発音の傾向を再現することが可能である。 In the preferred embodiments of the first and second aspects of the present invention, the synthesis information specifies the duration and the pronunciation content for each speech unit of the synthesis target speech, and the index calculation means is the phoneme duration of the synthesis target speech. The voice change index is calculated according to In the above aspect, since the voice change index is calculated according to the duration of the voice unit, it is possible to reproduce the actual pronunciation tendency that a voice change is likely to occur when a plurality of phonemes are quickly pronounced. It is.

本発明の第１態様および第２態様の好適例において、合成情報は、合成対象音声の音声単位毎に音高と発音内容とを指定し、指標算定手段は、合成対象音声の音素の音高に応じて音声変化指標を算定する。以上の態様では、音声単位の音高に応じて音声変化指標が算定されるから、音声の音高が高いほど音声変化が発生しやすいという傾向を反映した自然な音声変化を実現することが可能である。 In a preferred example of the first aspect and the second aspect of the present invention, the synthesis information designates a pitch and a pronunciation content for each voice unit of the synthesis target voice, and the index calculation means includes the pitch of the phoneme of the synthesis target voice. The voice change index is calculated according to In the above aspect, since the voice change index is calculated according to the pitch of the voice unit, it is possible to realize a natural voice change reflecting the tendency that the voice change is more likely to occur as the pitch of the voice is higher. It is.

本発明の第１態様および第２態様の好適例において、合成情報は、合成対象音声の音楽的な表情を表す制御情報を含み、指標算定手段は、合成対象音声の音素の制御情報に応じて音声変化指標を算定する。以上の態様では、合成対象音声の音楽的な表情を表す制御情報に応じて音声変化指標が算定されるから、合成対象音声の音楽的な表情に適合した自然な音声変化を付与することが可能である。なお、以上の態様の具体例は、例えば第４実施形態として後述される。 In a preferred example of the first aspect and the second aspect of the present invention, the synthesis information includes control information representing a musical expression of the synthesis target speech, and the index calculation means is responsive to the phoneme control information of the synthesis target speech. Calculate voice change index. In the above aspect, since the voice change index is calculated according to the control information representing the musical expression of the synthesis target voice, it is possible to give a natural voice change suitable for the musical expression of the synthesis target voice. It is. In addition, the specific example of the above aspect is later mentioned as 4th Embodiment, for example.

本発明の第１態様および第２態様の好適例は、利用者からの指示を受付ける指示受付手段を具備し、情報管理手段は、合成対象音声の音素について指標算定手段が算定した音声変化指標と、指示受付部が利用者から受付けた指示に応じて設定された閾値との大小に応じて、当該音素の音声変化を発生させるか否かを判定する。以上の態様では、音声変化の有無の判定のために音声変化指標と比較される閾値が利用者からの指示に応じて設定されるから、利用者の意図や嗜好に適合した音声変化を実現することが可能である。なお、以上の態様の具体例は、例えば第２実施形態として後述される。 The preferred embodiments of the first aspect and the second aspect of the present invention include an instruction receiving unit that receives an instruction from a user, and the information management unit includes a voice change index calculated by the index calculating unit for the phoneme of the synthesis target voice, The instruction receiving unit determines whether or not to generate a sound change of the phoneme according to the magnitude of the threshold set according to the instruction received from the user. In the above aspect, since a threshold value to be compared with the voice change index for determining whether there is a voice change is set according to an instruction from the user, a voice change suitable for the user's intention and preference is realized. It is possible. In addition, the specific example of the above aspect is later mentioned, for example as 2nd Embodiment.

以上の各態様に係る合成情報管理装置および音声合成装置は、合成情報の編集や音声信号の生成に専用されるDSP（Digital Signal Processor）等のハードウェア（電子回路）によって実現されるほか、CPU（Central Processing Unit）等の汎用の演算処理装置とプログラムとの協働によっても実現される。本発明のプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、例えば、本発明のプログラムは、通信網を介した配信の形態で提供されてコンピュータにインストールされ得る。また、本発明は、以上に説明した各態様に係る合成情報管理装置の動作方法（合成情報管理方法）や音声合成装置の動作方法（音声合成方法）としても特定される。 The synthesis information management device and the speech synthesizer according to each aspect described above are realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to editing synthesis information and generating a speech signal, as well as a CPU. It is also realized by cooperation between a general-purpose arithmetic processing device such as (Central Processing Unit) and a program. The program of the present invention can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. For example, the program of the present invention can be provided in the form of distribution via a communication network and installed in a computer. The present invention is also specified as an operation method (synthesis information management method) of the synthesis information management apparatus and an operation method (voice synthesis method) of the speech synthesis apparatus according to each aspect described above.

本発明の第１実施形態に係る音声合成装置のブロック図である。1 is a block diagram of a speech synthesizer according to a first embodiment of the present invention. 合成情報の模式図である。It is a schematic diagram of synthetic information. 参照情報の模式図である。It is a schematic diagram of reference information. 編集画像の模式図である。It is a schematic diagram of an edit image. 音声合成装置の動作のフローチャートである。It is a flowchart of operation | movement of a speech synthesizer. 編集処理のフローチャートである。It is a flowchart of an edit process. 音声変化設定処理のフローチャートである。It is a flowchart of an audio | voice change setting process. 音声変化指標の算定の具体例である。It is a specific example of calculation of a voice change index. 音声変化指標に応じた楽譜画像の変化の説明図である。It is explanatory drawing of the change of the score image according to an audio | voice change parameter | index. 第２実施形態における編集画像の模式図である。It is a schematic diagram of the edited image in 2nd Embodiment. 第３実施形態における音声合成処理のフローチャートである。It is a flowchart of the speech synthesis process in 3rd Embodiment. 第４実施形態における合成情報の模式図である。It is a schematic diagram of the synthesis information in the fourth embodiment.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る音声合成装置１００の構成図である。第１実施形態の音声合成装置１００は、複数の音声素片を連結する素片接続型の音声合成で任意の楽曲（以下「合成楽曲」という）の歌唱音声の音声信号Ｖを生成する信号処理装置である。複数の音素が連続的に発音される場合に前後の音素の影響で各音素が変化する現象（音声変化）を再現した音声信号Ｖが生成される。第１実施形態では、同化や連結等を包含する種々の音声変化のうち、特定の音素の発音が省略される脱落を再現した音声信号Ｖの生成を例示する。 <First Embodiment>
FIG. 1 is a configuration diagram of a speech synthesizer 100 according to the first embodiment of the present invention. The speech synthesizer 100 according to the first embodiment generates a speech signal V of a singing voice of an arbitrary piece of music (hereinafter referred to as “synthetic music”) by a unit connection type speech synthesis that connects a plurality of speech units. Device. When a plurality of phonemes are continuously pronounced, a speech signal V is generated that reproduces a phenomenon (speech change) in which each phoneme changes due to the influence of the preceding and following phonemes. In the first embodiment, generation of a voice signal V that reproduces a dropout in which pronunciation of a specific phoneme is omitted among various voice changes including assimilation and connection is exemplified.

図１に例示される通り、音声合成装置１００は、演算処理装置１０と記憶装置１２と表示装置１４と入力装置１６と放音装置１８とを具備するコンピュータシステム（例えば携帯電話機やパーソナルコンピュータ等の情報処理装置）で実現される。表示装置１４（例えば液晶表示パネル）は、演算処理装置１０から指示された画像を表示する。入力装置１６は、音声合成装置１００に対する各種の指示のために利用者が操作する操作機器（例えばマウス等のポインティングデバイスやキーボード）であり、例えば利用者が操作する複数の操作子を含んで構成される。なお、表示装置１４と一体に構成されたタッチパネルを入力装置１６として採用することも可能である。放音装置１８（例えばスピーカやヘッドホン）は、音声信号Ｖに応じた音響を再生する。 As illustrated in FIG. 1, the speech synthesizer 100 includes a computer system (for example, a mobile phone or a personal computer) that includes an arithmetic processing device 10, a storage device 12, a display device 14, an input device 16, and a sound emitting device 18. Information processing device). The display device 14 (for example, a liquid crystal display panel) displays an image instructed from the arithmetic processing device 10. The input device 16 is an operation device (for example, a pointing device such as a mouse or a keyboard) operated by the user for various instructions to the speech synthesizer 100, and includes a plurality of operators operated by the user, for example. Is done. Note that a touch panel configured integrally with the display device 14 may be employed as the input device 16. The sound emitting device 18 (for example, a speaker or headphones) reproduces sound according to the audio signal V.

記憶装置１２は、演算処理装置１０が実行するプログラムや演算処理装置１０が使用する各種のデータを記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置１２として任意に採用される。第１実施形態の記憶装置１２は、以下に例示する通り、音声素片群Ｌと合成情報Ｓと参照情報Ｒとを記憶する。 The storage device 12 stores a program executed by the arithmetic processing device 10 and various data used by the arithmetic processing device 10. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily employed as the storage device 12. The storage device 12 of the first embodiment stores a speech unit group L, synthesis information S, and reference information R as illustrated below.

音声素片群Ｌは、特定の発声者の収録音声から事前に採取された複数の音声素片の集合（音声ライブラリ）である。各音声素片は、言語的な意味の区別の最小単位である音素単体（例えば母音や子音）、または、複数の音素を連結した音素連鎖である。ただし、第１実施形態では、相前後する２個の音素を連結した音素連鎖（ダイフォン）を音声素片として例示する。各音声素片は、時間領域での音声波形のサンプル系列や、音声波形のフレーム毎に算定された周波数領域のスペクトルの時系列として表現される。なお、以下の説明では、無音を便宜的に１個の音素と位置付けて記号“Sil”で表記する。 The speech segment group L is a set (speech library) of a plurality of speech segments collected in advance from recorded speech of a specific speaker. Each phoneme unit is a phoneme unit (for example, a vowel or a consonant) that is a minimum unit of distinction of linguistic meaning, or a phoneme chain in which a plurality of phonemes are connected. However, in the first embodiment, a phoneme chain (a diphone) in which two adjacent phonemes are connected is exemplified as a speech unit. Each speech segment is represented as a time series of a speech waveform sample sequence in the time domain or a frequency domain spectrum calculated for each frame of the speech waveform. In the following description, silence is positioned as one phoneme for the sake of convenience and is represented by the symbol “Sil”.

合成情報Ｓは、音声合成の対象となる音声（以下「合成対象音声」という）を指定する。第１実施形態の合成情報Ｓは、合成楽曲の歌唱音声を合成対象音声として指定する時系列データであり、図２に例示される通り、合成対象音声の音声単位毎（音符毎）に音高Ｘ1と発音期間Ｘ2と発音内容Ｘ3とを指定する。音声単位は、合成楽曲の１個の音符に割当てられる音声の単位（音節やモーラ等の分節単位）である。 The synthesis information S designates a voice that is a target of voice synthesis (hereinafter referred to as “synthesis target voice”). The synthesis information S of the first embodiment is time-series data that designates the singing voice of the synthesized music as the synthesis target voice, and as exemplified in FIG. 2, the pitch for each voice unit (every note) of the synthesis target voice. Designate X1, pronunciation period X2, and pronunciation content X3. The sound unit is a sound unit (segmental unit such as a syllable or a mora) assigned to one note of the synthesized music.

音高Ｘ1は、合成楽曲の各音符の音高であり、例えばMIDI（Musical Instrument Digital Interface）規格に準拠したノートナンバーで指定される。発音期間Ｘ2は、音符の時間長（音価）であり、例えば発音の開始時刻Ｔ1と継続長（時間長）Ｔ2とで規定される。第１実施形態の継続長Ｔ2は、４分音符を基準値（Ｔ2＝120）とした場合のテンポ（BPM：Beats Per Minute）の数値で表現される。例えば、８分音符に相当する継続長Ｔ2は240に設定され、２分音符に相当する継続長Ｔ2は60に設定される。すなわち、発音期間Ｘ2の時間長が長いほど継続長Ｔ2は小さい数値に設定される。なお、発音期間Ｘ2を発音の開始時刻Ｔ1と終了時刻とで規定する構成（両時刻間の時間長に応じて継続長Ｔ2が算定され得る構成）も好適である。発音内容Ｘ3は、合成対象音声による発音の内容であり、音声単位を表象する音声符号（発音文字）ＹAと、当該音声単位を構成する各音素の音素記号ＹBとを含んで構成される。音声符号ＹAは、合成楽曲の歌詞を構成する文字（書記素）に相当する。以上の説明から理解される通り、合成情報Ｓは、合成楽曲の楽譜を指定するデータとも換言され得る。 The pitch X1 is the pitch of each note of the synthesized music, and is specified by, for example, a note number conforming to the MIDI (Musical Instrument Digital Interface) standard. The sound generation period X2 is a time length (note value) of a note, and is defined by, for example, a sound generation start time T1 and a continuation length (time length) T2. The continuation length T2 of the first embodiment is expressed by a numerical value of a tempo (BPM: Beats Per Minute) when a quarter note is a reference value (T2 = 120). For example, the duration T2 corresponding to an eighth note is set to 240, and the duration T2 corresponding to a half note is set to 60. That is, the continuation length T2 is set to a smaller numerical value as the duration of the sound generation period X2 is longer. A configuration in which the sound generation period X2 is defined by the start time T1 and the end time of sound generation (a configuration in which the duration T2 can be calculated according to the time length between the two times) is also suitable. The pronunciation content X3 is the content of pronunciation by the synthesis target speech, and includes a speech code (sounding character) YA representing a speech unit and a phoneme symbol YB of each phoneme constituting the speech unit. The voice code YA corresponds to a character (grapheme) constituting the lyrics of the synthesized music. As can be understood from the above description, the synthesis information S can be rephrased as data specifying the score of the synthesized music.

参照情報Ｒは、合成対象音声における音声変化（音素の脱落）の有無の判断に利用されるデータであり、図３に例示される通り、相前後する２個の音素の配列（以下「音素対」という）毎に変換前パターンＱAと変換後パターンＱBと基礎指標Ｂとを指定する。すなわち、参照情報Ｒは、変換前パターンＱAと変換後パターンＱBと基礎指標Ｂとを音素対毎に相互に対応付けるデータテーブルとも換言され得る。第１実施形態の参照情報Ｒは、任意の２個の音素の組合せのうち合成対象音声の言語のもとで音声変化（脱落）が発生し得る音素対について変換前パターンＱAと変換後パターンＱBと基礎指標Ｂと指定する。 The reference information R is data used to determine whether or not there is a speech change (phoneme dropout) in the synthesis target speech. As illustrated in FIG. 3, two reference phoneme arrays (hereinafter referred to as “phoneme pairs”). )), A pre-conversion pattern QA, a post-conversion pattern QB, and a basic index B are designated. That is, the reference information R can also be expressed as a data table that associates the pre-conversion pattern QA, the post-conversion pattern QB, and the basic index B with each phoneme pair. The reference information R of the first embodiment is a pre-conversion pattern QA and a post-conversion pattern QB for a phoneme pair that can cause a speech change (dropping) under the language of the synthesis target speech among a combination of two arbitrary phonemes. And basic indicator B.

変換前パターンＱAは、音声変化の発生前の音素対を指定し、変換後パターンＱBは、音声変化の発生後（すなわち、変換前パターンＱAが指定する音素対に音声変化を発生させた場合）の音素対を指定する。例えば、“out”（[aU][t]）と“door”([dh][O@])とが連続する“out door”を発音した場合には、“out”の末尾の音素[t]に音声変化（脱落）が発生して[aU][dh][O@]と発音され得る。したがって、図３に例示される通り、“out”の末尾の音素[t]と“door”の先頭の音素[dh]とを配列した音素対[t-dh]が変換前パターンＱAとして指定され、音素対[t-dh]のうち音素[t]を無音の音素[Sil]に変換した音素対[Sil-dh]が変換後パターンＱBとして指定される。同様に、例えば“keep talk”（[kh][i:][p][th][o:][k]）に包含される音素対[p-th]の変換前パターンＱAには、音素[p]を無音の音素[Sil]に変換した音素対[Sil-th]の変換後パターンＱBが対応し、例えば“pick part”（[ph][I][k][ph][Q@][t]）に包含される音素対[k-ph]の変換前パターンＱAには、音素[k]を無音の音素[Sil]に変換した音素対[Sil-ph]の変換後パターンＱBが対応する。 The pre-conversion pattern QA specifies a phoneme pair before the occurrence of a speech change, and the post-conversion pattern QB is after the occurrence of a speech change (that is, when a speech change is generated in the phoneme pair specified by the pre-conversion pattern QA). Specifies the phoneme pair. For example, if “out door” with “out” ([aU] [t]) and “door” ([dh] [O @]) is pronounced, the phoneme at the end of “out” [t ] Can be pronounced as [aU] [dh] [O @] due to a voice change (dropping). Therefore, as illustrated in FIG. 3, the phoneme pair [t-dh] in which the phoneme [t] at the end of “out” and the phoneme [dh] at the beginning of “door” are arranged is designated as the pre-conversion pattern QA. The phoneme pair [Sil-dh] obtained by converting the phoneme [t] into the silent phoneme [Sil] of the phoneme pair [t-dh] is designated as the post-conversion pattern QB. Similarly, for example, a phoneme pair [p-th] pre-conversion pattern QA included in “keep talk” ([kh] [i:] [p] [th] [o:] [k]) includes a phoneme. The converted pattern QB of the phoneme pair [Sil-th] obtained by converting [p] into a silent phoneme [Sil] corresponds to, for example, “pick part” ([ph] [I] [k] [ph] [Q @ ] [t]), the pattern QA before conversion of the phoneme pair [k-ph] includes the post-conversion pattern QB of the phoneme pair [Sil-ph] obtained by converting the phoneme [k] to the silent phoneme [Sil]. Corresponds.

図３の基礎指標Ｂは、変換前パターンＱAから変換後パターンＱBへの音声変化の発生しやすさの指標である。第１実施形態では、音声変化が発生しやすい音素対（変換前パターンＱA）の基礎指標Ｂほど大きい数値に設定される。各基礎指標Ｂは、事前に収録された多数の収録音声（資料）から音声変化の傾向等を調査することで実験的または統計的に設定される。例えば、音素対[t-dh]の音素[t]は、音素対[k-ph]の音素[k]と比較して脱落が発生しやすく、音素対[k-ph]の音素[k]は、音素対[p-th]の音素[p]と比較して脱落が発生しやすいという傾向がある。以上の傾向を考慮して、図３の例示では、音素対[t-dh]の基礎指標Ｂは音素対[k-ph]の基礎指標Ｂを上回り、音素対[k-ph]の基礎指標Ｂは音素対[p-th]の基礎指標Ｂを上回る。 The basic index B in FIG. 3 is an index of the likelihood of a voice change from the pre-conversion pattern QA to the post-conversion pattern QB. In the first embodiment, the basic index B of the phoneme pair (pre-conversion pattern QA) that is likely to cause a voice change is set to a larger numerical value. Each basic index B is set experimentally or statistically by investigating the tendency of voice change from a large number of recorded voices (materials) recorded in advance. For example, the phoneme [t] of the phoneme pair [t-dh] is more likely to drop than the phoneme [k] of the phoneme pair [k-ph], and the phoneme [k] of the phoneme pair [k-ph] Tends to drop out compared to the phoneme [p] of the phoneme pair [p-th]. Considering the above trend, in the example of FIG. 3, the basic index B of the phoneme pair [t-dh] exceeds the basic index B of the phoneme pair [k-ph], and the basic index of the phoneme pair [k-ph]. B exceeds the basic index B of phoneme pair [p-th].

図１の演算処理装置１０（CPU）は、記憶装置１２に格納されたプログラムを実行することで、合成情報Ｓの編集や音声信号Ｖの生成のための複数の機能（指示受付部２２，表示制御部２４，指標算定部２６，情報管理部３２，音声合成部３４）を実現する。なお、演算処理装置１０の各機能を複数の装置に分散した構成や、専用の電子回路（例えばDSP）が演算処理装置１０の一部の機能を実現する構成も採用され得る。 The arithmetic processing unit 10 (CPU) in FIG. 1 executes a program stored in the storage unit 12 to execute a plurality of functions (instruction receiving unit 22, display) for editing the synthesis information S and generating the audio signal V. The control unit 24, the index calculation unit 26, the information management unit 32, and the speech synthesis unit 34) are realized. A configuration in which each function of the arithmetic processing device 10 is distributed to a plurality of devices, or a configuration in which a dedicated electronic circuit (for example, DSP) realizes a part of the functions of the arithmetic processing device 10 may be employed.

表示制御部２４は、各種の画像を表示装置１４に表示させる。具体的には、第１実施形態の表示制御部２４は、合成情報Ｓが指定する合成楽曲の内容を利用者が確認および編集するための図４の編集画像４０を表示装置１４に表示させる。図４に例示される通り、第１実施形態の編集画像４０は、合成楽曲の内容を表象する楽譜画像４２を包含する。 The display control unit 24 displays various images on the display device 14. Specifically, the display control unit 24 of the first embodiment causes the display device 14 to display the edited image 40 of FIG. 4 for the user to confirm and edit the contents of the synthesized music designated by the synthesis information S. As illustrated in FIG. 4, the edited image 40 of the first embodiment includes a score image 42 that represents the content of the synthesized music.

楽譜画像４２は、相互に交差する時間軸（横軸）および音高軸（縦軸）が設定された座標平面に、合成情報Ｓが指定する各音符を表象する音符図像４４を時系列に配置したピアノロール型の画像である。音高軸の方向における音符図像４４の位置は、合成情報Ｓが指定する音高Ｘ1に応じて設定され、時間軸の方向における音符図像４４の位置および表示長は、合成情報Ｓが指定する発音期間Ｘ2に応じて設定される。各音符図像４４には、合成情報Ｓが指定する発音内容Ｘ3の音声符号ＹAと音素記号ＹBとが付加される。なお、音声符号ＹAと音素記号ＹBとが表示される位置は図４の例示（音符図像４４の内側）に限定されない。 In the musical score image 42, a musical note image 44 representing each musical note designated by the synthesis information S is arranged in time series on a coordinate plane in which a time axis (horizontal axis) and a pitch axis (vertical axis) intersecting each other are set. This is a piano roll type image. The position of the note image 44 in the direction of the pitch axis is set according to the pitch X1 specified by the synthesis information S, and the position and display length of the note image 44 in the direction of the time axis are pronounced by the synthesis information S. It is set according to the period X2. Each musical note iconic image 44 is added with a voice code YA and a phoneme symbol YB of the pronunciation content X3 designated by the synthesis information S. Note that the positions where the voice code YA and the phoneme symbol YB are displayed are not limited to the example shown in FIG. 4 (inside the musical note image 44).

図１の指示受付部２２は、入力装置１６に対する操作に応じた利用者からの指示を受付ける。例えば利用者は、編集画像４０を確認しながら入力装置１６を適宜に操作することで合成情報Ｓの編集を音声合成装置１００に指示することが可能である。指示受付部２２は、合成情報Ｓの編集指示を利用者から受付ける。 The instruction receiving unit 22 in FIG. 1 receives an instruction from a user according to an operation on the input device 16. For example, the user can instruct the speech synthesizer 100 to edit the synthesis information S by appropriately operating the input device 16 while confirming the edited image 40. The instruction receiving unit 22 receives an instruction to edit the composite information S from the user.

指標算定部２６は、合成情報Ｓが指定する発音内容Ｘ3に対応する音素について音声変化指標Ａを算定する。音声変化指標Ａは、音声変化の発生しやすさの指標である。第１実施形態では、音声変化が発生しやすい音素対の音声変化指標Ａほど大きい数値に設定される。第１実施形態の指標算定部２６は、発音内容Ｘ3で指定される複数の音素の時系列のうち、参照情報Ｒに登録された何れかの変換前パターンＱAに合致する音素対毎に音声変化指標Ａを算定する。 The index calculation unit 26 calculates the voice change index A for the phoneme corresponding to the pronunciation content X3 specified by the synthesis information S. The voice change index A is an index of the likelihood of occurrence of a voice change. In the first embodiment, a larger numerical value is set for the sound change index A of a phoneme pair in which a sound change is likely to occur. The index calculation unit 26 of the first embodiment changes the sound for each phoneme pair that matches any of the pre-conversion patterns QA registered in the reference information R among the time series of a plurality of phonemes specified by the pronunciation content X3. Indicator A is calculated.

前述の通り、参照情報Ｒが指定する基礎指標Ｂは、変換前パターンＱAから変換後パターンＱBへの音声変化の発生しやすさの指標である。そこで、第１実施形態の指標算定部２６は、合成情報Ｓが指定する合成対象音声の複数の音素の時系列のうち変換前パターンＱAに合致する音素対の音声変化指標Ａを、参照情報Ｒのうち当該変換前パターンＱAに対応した基礎指標Ｂに応じて算定する。具体的には、基礎指標Ｂが大きい（音素対に音声変化が発生しやすい）ほど音声変化指標Ａが大きい数値となるように、指標算定部２６は、基礎指標Ｂに応じて音声変化指標Ａを算定する。 As described above, the basic index B specified by the reference information R is an index of the likelihood of a voice change from the pre-conversion pattern QA to the post-conversion pattern QB. Therefore, the index calculation unit 26 according to the first embodiment uses the speech change index A of the phoneme pair that matches the pre-conversion pattern QA among the time series of the plurality of phonemes of the synthesis target speech specified by the synthesis information S as the reference information R. Of these, calculation is performed according to the basic index B corresponding to the pre-conversion pattern QA. Specifically, the index calculation unit 26 sets the voice change index A according to the basic index B so that the voice change index A becomes a larger value as the basic index B is larger (the voice change is more likely to occur in the phoneme pair). Is calculated.

複数の音素が連続的に素早く発音された場合（各音素の継続長が短い場合）には当該音素について音声変化が発生しやすいという概略的な傾向がある。以上の傾向を考慮して、第１実施形態の指標算定部２６は、合成情報Ｓが指定する合成対象音声の各音素対の音声変化指標Ａを、当該音素対に対応する音符について合成情報Ｓが指定する発音期間Ｘ2の継続長Ｔ2に応じて算定する。具体的には、指標算定部２６は、変換前パターンＱAに合致する音素対の２個の音素のうち前方の音素（すなわち脱落対象となる音素）が割当てられた音符の継続長Ｔ2に応じて音声変化指標Ａを算定する。例えば、継続長Ｔ2が時間的に短い（継続長Ｔ2の数値が大きい）ほど音声変化指標Ａは大きい数値に設定される。 When a plurality of phonemes are pronounced quickly and continuously (when the duration of each phoneme is short), there is a general tendency that a voice change is likely to occur for the phoneme. In consideration of the above tendency, the index calculating unit 26 of the first embodiment uses the synthesis information S for the phoneme pairs corresponding to the phoneme pairs as the speech change index A of each phoneme pair of the synthesis target speech specified by the synthesis information S. Is calculated according to the duration T2 of the pronunciation period X2 specified by. Specifically, the index calculation unit 26 corresponds to the duration T2 of the note to which the front phoneme (that is, the phoneme to be dropped) is assigned among the two phonemes of the phoneme pair that matches the pre-conversion pattern QA. The voice change index A is calculated. For example, the voice change index A is set to a larger value as the duration T2 is shorter in time (the value of the duration T2 is larger).

また、音声の音高が高いほど音声変化が発生しやすいという概略的な傾向がある。以上の傾向を考慮して、第１実施形態の指標算定部２６は、合成情報Ｓが指定する合成対象音声の各音素対の音声変化指標Ａを、当該音素対に対応する音符について合成情報Ｓが指定する音高Ｘ1に応じて算定する。具体的には、指標算定部２６は、変換前パターンＱAに合致する音素対の２個の音素のうち前方の音素が割当てられた音符の音高Ｘ1に応じて音声変化指標Ａを算定する。例えば、音高Ｘ1が高いほど音声変化指標Ａは大きい数値に設定される。 Further, there is a general tendency that the higher the pitch of the voice, the more likely the voice change to occur. In consideration of the above tendency, the index calculating unit 26 of the first embodiment uses the synthesis information S for the phoneme pairs corresponding to the phoneme pairs as the speech change index A of each phoneme pair of the synthesis target speech specified by the synthesis information S. Is calculated according to the pitch X1 specified by. Specifically, the index calculation unit 26 calculates the speech change index A according to the pitch X1 of the note to which the front phoneme is assigned among the two phonemes of the phoneme pair that matches the pre-conversion pattern QA. For example, the voice change index A is set to a larger value as the pitch X1 is higher.

具体的には、第１実施形態の指標算定部２６は、基礎指標Ｂと継続長Ｔ2と音高Ｘ1とを適用した以下の数式(1)の演算で音声変化指標Ａを算定する。
Ａ＝Ｂ＋(Ｔ2×０.２)＋(Ｘ1×０.１) ……(1)
すなわち、基礎指標Ｂと継続長Ｔ2と音高Ｘ1との加重和が音声変化指標Ａとして算定される。基礎指標Ｂの加重値は、継続長Ｔ2および音高Ｘ1の加重値を上回る。すなわち、基礎指標Ｂは、継続長Ｔ2や音高Ｘ1と比較して優先的に音声変化指標Ａに反映される。 Specifically, the index calculation unit 26 of the first embodiment calculates the voice change index A by the calculation of the following formula (1) using the basic index B, the duration T2 and the pitch X1.
A = B + (T2 × 0.2) + (X1 × 0.1) (1)
That is, the weighted sum of the basic index B, the duration T2 and the pitch X1 is calculated as the voice change index A. The weight value of the basic index B exceeds the weight value of the duration T2 and the pitch X1. That is, the basic index B is preferentially reflected in the voice change index A as compared with the duration T2 and the pitch X1.

図１の情報管理部３２は、記憶装置１２に記憶された合成情報Ｓを管理する。例えば情報管理部３２は、指示受付部２２が利用者から受付けた編集指示に応じて合成情報Ｓを更新する。また、第１実施形態の情報管理部３２は、指標算定部２６が算定した音声変化指標Ａに応じて合成情報Ｓに音声変化の発生を設定する。具体的には、情報管理部３２は、合成対象音声の複数の音素の時系列のうち変換前パターンＱAに合致する音素対の音声変化指標Ａが所定の閾値ＴHを上回る場合に、当該音素対を、参照情報Ｒのうち当該変換前パターンＱAに対応する変換後パターンＱBの音素対に変更する。例えば図３の例示において、合成対象音声に包含される音素対[t-dh]の音声変化指標Ａが閾値ＴHを上回る場合、当該音素対[t-dh]は、変換後パターンＱBの音素対[Sil-dh]に変更される。すなわち、音素[t]が無音の音素[Sil]に変換されることで音素の脱落（音声変化）が再現される。 The information management unit 32 in FIG. 1 manages the composite information S stored in the storage device 12. For example, the information management unit 32 updates the composite information S in accordance with the editing instruction received from the user by the instruction receiving unit 22. In addition, the information management unit 32 of the first embodiment sets the occurrence of a voice change in the synthesized information S according to the voice change index A calculated by the index calculation unit 26. Specifically, the information management unit 32 determines that the phoneme pair when the speech change index A of the phoneme pair that matches the pre-conversion pattern QA among the time series of the plurality of phonemes of the synthesis target speech exceeds a predetermined threshold value TH. In the reference information R to the phoneme pair of the post-conversion pattern QB corresponding to the pre-conversion pattern QA. For example, in the illustration of FIG. 3, when the speech change index A of the phoneme pair [t-dh] included in the synthesis target speech exceeds the threshold value TH, the phoneme pair [t-dh] is the phoneme pair of the converted pattern QB. Changed to [Sil-dh]. In other words, the phoneme [t] is converted to the silent phoneme [Sil], thereby dropping the phoneme (speech change).

図１の音声合成部３４は、記憶装置１２に記憶された音声素片群Ｌと合成情報Ｓとを利用して音声信号Ｖを生成する。具体的には、音声合成部３４は、合成情報Ｓが指定する音符毎の発音内容Ｘ3（音声符号ＹAおよび音素記号ＹB）に応じた音声素片を音声素片群Ｌから順次に選択し、各音声素片を、合成情報Ｓが指定する音高Ｘ1および発音期間Ｘ2に調整したうえで相互に連結することで音声信号Ｖを生成する。音声合成部３４が生成した音声信号Ｖが放音装置１８に供給されることで合成楽曲の歌唱音声が再生される。 The speech synthesis unit 34 in FIG. 1 generates a speech signal V using the speech element group L and the synthesis information S stored in the storage device 12. Specifically, the speech synthesizer 34 sequentially selects a speech unit corresponding to the pronunciation content X3 (speech code YA and phoneme symbol YB) for each note specified by the synthesis information S from the speech unit group L, The speech signals V are generated by adjusting the speech segments to the pitch X1 and the sound generation period X2 specified by the synthesis information S and connecting them together. The voice signal V generated by the voice synthesizer 34 is supplied to the sound emitting device 18 so that the singing voice of the synthesized music is reproduced.

図５は、第１実施形態の音声合成装置１００の概略的な動作のフローチャートである。入力装置１６に対する利用者からの指示を契機として図５の処理が開始される。図５の処理を開始すると、表示制御部２４は、記憶装置１２に記憶された合成情報Ｓに応じた図４の編集画像４０を表示装置１４に表示させる（ＳA1）。そして、指示受付部２２は、利用者から編集指示を受付けたか否かを判定する（ＳA2）。指示受付部２２が編集指示を受付けた場合（ＳA2：YES）、編集指示の内容に応じた編集処理が実行される（ＳA3）。編集処理は、表示制御部２４による編集画像４０の更新と情報管理部３２による合成情報Ｓの更新とを包含する。他方、指示受付部２２が編集指示を受付けない場合（ＳA2：NO）には編集処理は実行されない。 FIG. 5 is a flowchart of a schematic operation of the speech synthesizer 100 according to the first embodiment. 5 is started in response to an instruction from the user to the input device 16. When the processing of FIG. 5 is started, the display control unit 24 causes the display device 14 to display the edited image 40 of FIG. 4 corresponding to the composite information S stored in the storage device 12 (SA1). Then, the instruction receiving unit 22 determines whether an editing instruction has been received from the user (SA2). When the instruction receiving unit 22 receives an editing instruction (SA2: YES), an editing process corresponding to the content of the editing instruction is executed (SA3). The editing process includes updating of the edited image 40 by the display control unit 24 and updating of the composite information S by the information management unit 32. On the other hand, when the instruction receiving unit 22 does not receive the editing instruction (SA2: NO), the editing process is not executed.

以上の処理が完了すると、指示受付部２２は、音声合成（音声信号Ｖの生成）の指示を利用者から受付けたか否かを判定する（ＳA4）。音声合成の指示を指示受付部２２が受付けた場合（ＳA4：YES）、音声合成部３４は、音声素片群Ｌと合成情報Ｓとを適用した音声合成処理で音声信号Ｖを生成する（ＳA5）。他方、音声合成が指示されない場合（ＳA4：NO）には音声合成処理は実行されない。また、指示受付部２２は、処理終了の指示を利用者から受付けたか否かを判定する（ＳA6）。処理終了が指示されない場合（ＳA6：NO）には、処理がステップＳA1に遷移して以降の処理が反復され、処理終了が指示された場合（ＳA6：YES）には図５の処理が終了する。 When the above processing is completed, the instruction receiving unit 22 determines whether or not an instruction for voice synthesis (generation of the voice signal V) is received from the user (SA4). When the instruction reception unit 22 receives a voice synthesis instruction (SA4: YES), the voice synthesis unit 34 generates a voice signal V by voice synthesis processing using the voice element group L and the synthesis information S (SA5). ). On the other hand, when voice synthesis is not instructed (SA4: NO), the voice synthesis process is not executed. Further, the instruction receiving unit 22 determines whether or not an instruction to end the process has been received from the user (SA6). If the end of the process is not instructed (SA6: NO), the process transitions to step SA1, and the subsequent processes are repeated. If the end of the process is instructed (SA6: YES), the process in FIG. 5 ends. .

図６は、編集処理（ＳA3）の具体的な内容のフローチャートである。編集処理を開始すると、指示受付部２２は、楽譜画像４２の音符図像４４に対する編集指示を利用者から受付けたか否かを判定する（ＳB1）。音符図像４４に対する編集指示を受付けた場合（ＳB1：YES）、表示制御部２４による編集画像４０の更新と情報管理部３２による合成情報Ｓの更新とが編集指示の内容に応じて実行される（ＳB2）。例えば、楽譜画像４２に対する音符図像４４の追加が指示された場合、表示制御部２４は、楽譜画像４２のうち利用者から指示された位置に音符図像４４を追加し、情報管理部３２は、利用者から指示された音符の情報（Ｘ1〜Ｘ3）を合成情報Ｓに追加する。また、既存の音符図像４４の移動が指示された場合、表示制御部２４は、音符図像４４の位置を利用者からの指示に応じて変更し、情報管理部３２は、編集対象の音符の音高Ｘ1や発音期間Ｘ2（開始時刻Ｔ1）を利用者からの指示に応じて変更する。音符図像４４の時間軸上の表示長の変更が指示された場合、表示制御部２４は、音符図像４４の表示長を利用者からの指示に応じて変更し、情報管理部３２は、編集対象の音符の発音期間Ｘ2（継続長Ｔ2）を利用者からの指示に応じて変更する。また、各音符の音声符号ＹAの変更が指示された場合、表示制御部２４は、合成情報Ｓのうち編集対象の音符の音声符号ＹAを利用者からの指示に応じて変更するとともに音素記号ＹBを変更後の音声符号ＹAに応じて更新し、表示制御部２４は、変更後の発音内容Ｘ3に対応する音声符号ＹAと音素記号ＹBとを音符図像４４の近傍に配置する。 FIG. 6 is a flowchart of specific contents of the editing process (SA3). When the editing process is started, the instruction receiving unit 22 determines whether an editing instruction for the musical note image 44 of the score image 42 has been received from the user (SB1). When the editing instruction for the musical note image 44 is received (SB1: YES), the update of the edited image 40 by the display control unit 24 and the update of the composite information S by the information management unit 32 are executed according to the content of the editing instruction ( SB2). For example, when the addition of the note image 44 to the score image 42 is instructed, the display control unit 24 adds the note image 44 to the position indicated by the user in the score image 42, and the information management unit 32 uses the note image 44. The note information (X1 to X3) instructed by the person is added to the synthesis information S. When the movement of the existing musical note image 44 is instructed, the display control unit 24 changes the position of the musical note image 44 in accordance with an instruction from the user, and the information management unit 32 performs the sound of the note to be edited. The high X1 and the sound generation period X2 (start time T1) are changed according to an instruction from the user. When a change in the display length of the musical note image 44 on the time axis is instructed, the display control unit 24 changes the display length of the musical note image 44 in accordance with an instruction from the user, and the information management unit 32 The note generation period X2 (continuation length T2) is changed in accordance with an instruction from the user. When the change of the voice code YA of each note is instructed, the display control unit 24 changes the voice code YA of the note to be edited in the synthesis information S in accordance with the instruction from the user and the phoneme symbol YB. Is updated in accordance with the changed voice code YA, and the display control unit 24 arranges the voice code YA and the phoneme symbol YB corresponding to the changed pronunciation content X3 in the vicinity of the musical note image 44.

また、利用者は、入力装置１６を適宜に操作することで、音声変化の付与を指示することが可能である。音声変化の付与の指示を指示受付部２２が利用者から受付けた場合（ＳB3：YES）、音声変化指標Ａに応じた音声変化を合成情報Ｓに設定する図７の音声変化設定処理が実行される（ＳB4）。 In addition, the user can instruct the application of a sound change by appropriately operating the input device 16. When the instruction receiving unit 22 receives an instruction to give a voice change from the user (SB3: YES), the voice change setting process of FIG. 7 is executed to set the voice change corresponding to the voice change index A in the synthesis information S. (SB4).

音声変化設定処理を開始すると、情報管理部３２は、合成情報Ｓが指定する合成対象音声の複数の音素（各発音内容Ｘ3の音素記号ＹBで指定される音素）の時系列から１個の音素対（以下「選択音素対」という）を選択する（ＳC1）。情報管理部３２は、選択音素対が、参照情報Ｒに登録された複数の変換前パターンＱAの何れかに合致するか否かを判定する（ＳC2）。すなわち、選択音素対が、音声変化を発生させ得る音素対に該当するか否かが判定される。 When the voice change setting process is started, the information management unit 32 selects one phoneme from a time series of a plurality of phonemes (phonemes specified by phoneme symbols YB of the pronunciation contents X3) of the synthesis target speech specified by the synthesis information S. A pair (hereinafter referred to as “selected phoneme pair”) is selected (SC1). The information management unit 32 determines whether or not the selected phoneme pair matches any of the plurality of pre-conversion patterns QA registered in the reference information R (SC2). That is, it is determined whether or not the selected phoneme pair corresponds to a phoneme pair that can cause a voice change.

選択音素対が何れかの変換前パターンＱAに合致する場合（ＳC2：YES）、指標算定部２６は、前掲の数式(1)の演算で選択音素対の音声変化指標Ａを算定する（ＳC3）。音声変化指標Ａの算定には、変換前パターンＱAに対応する基礎指標Ｂと、選択音素対の２個の音素のうち前方の音素に対応する音符の音高Ｘ1および継続長Ｔ2とが適用される。 When the selected phoneme pair matches any of the pre-conversion patterns QA (SC2: YES), the index calculation unit 26 calculates the speech change index A of the selected phoneme pair by the calculation of the above formula (1) (SC3). . For calculation of the voice change index A, the basic index B corresponding to the pre-conversion pattern QA and the pitch X1 and duration T2 of the note corresponding to the front phoneme among the two phonemes of the selected phoneme pair are applied. The

情報管理部３２は、指標算定部２６が算定した音声変化指標Ａが閾値ＴHを上回るか否かを判定する（ＳC4）。すなわち、選択音素対に音声変化を発生させるか否かが音声変化指標Ａに応じて判定される。音声変化指標Ａが閾値ＴHを上回る場合（ＳC4：YES）、情報管理部３２は、選択音素対について音声変化の発生を合成情報Ｓに設定する（ＳC5）。具体的には、合成情報Ｓのうち選択音素対の各音素の音素記号ＹBを、変換前パターンＱAに対応する変換後パターンＱBが指定する各音素の音素記号ＹBに変換する。 The information management unit 32 determines whether or not the voice change index A calculated by the index calculation unit 26 exceeds the threshold value TH (SC4). That is, it is determined according to the voice change index A whether or not a voice change is generated in the selected phoneme pair. When the voice change index A exceeds the threshold TH (SC4: YES), the information management unit 32 sets the occurrence of the voice change for the selected phoneme pair in the synthesis information S (SC5). Specifically, the phoneme symbol YB of each phoneme of the selected phoneme pair in the synthesis information S is converted into the phoneme symbol YB of each phoneme specified by the post-conversion pattern QB corresponding to the pre-conversion pattern QA.

他方、音声変化指標Ａが閾値ＴHを下回る場合（ＳC4：NO）、情報管理部３２は、音声変化の発生を設定するステップＳC5を実行しない。すなわち、選択音素対について音声変化の発生は合成情報Ｓに設定されない。また、選択音素対が何れの変換前パターンＱAにも合致しない場合（ＳC2：NO）、すなわち、選択音素対について音声変化が発生し得ない場合、音声変化指標Ａの算定（ＳC3）や選択音素対の変換（ＳC4，ＳC5）は実行されない。 On the other hand, when the voice change index A is lower than the threshold value TH (SC4: NO), the information management unit 32 does not execute Step SC5 for setting the occurrence of the voice change. That is, the occurrence of a voice change for the selected phoneme pair is not set in the synthesis information S. If the selected phoneme pair does not match any pre-conversion pattern QA (SC2: NO), that is, if no voice change can occur for the selected phoneme pair, the calculation of the voice change index A (SC3) or the selected phoneme No pair conversion (SC4, SC5) is performed.

例えば図８に例示される通り、“out door”および“keep talk”という歌詞に着目する。“out door”に包含される音素対[t-dh]の基礎指標Ｂは80であり、“keep talk”に包含される音素対[p-th]の基礎指標Ｂは20である。音素対[t-dh]内の音素[t]の音符の継続長Ｔ2が240（８分音符）であって音高Ｘ1が50であり、音素対[p-th]内の音素[p]の音符の継続長Ｔ2が60（２分音符）であって音高Ｘ1が50である場合を想定すると、“out door”の音素対[t-dh]の音声変化指標Ａは133（＝80＋(240×0.2)＋(50×0.1)）であり、“keep talk”の音素対[p-th]の音声変化指標Ａは37（＝20＋(60×0.2)＋(50×0.1)）である。したがって、例えば閾値ＴHを110と仮定すると、“out door”の音素対[t-dh]（ＳC4：YES）については、変換後パターンＱBが指定する音素対[Sil-dh]への音声変化が設定される一方、“keep talk”の音素対[p-th]（ＳC4：NO）には音声変化が設定されない。 For example, as illustrated in FIG. 8, attention is paid to the lyrics “out door” and “keep talk”. The basic index B of the phoneme pair [t-dh] included in “out door” is 80, and the basic index B of the phoneme pair [p-th] included in “keep talk” is 20. The phoneme [t] in the phoneme pair [t-dh] has a note duration T2 of 240 (eighth note) and a pitch X1 of 50, and the phoneme [p] in the phoneme pair [p-th]. Assuming that the note duration T2 is 60 (half note) and the pitch X1 is 50, the voice change index A of the phone pair [t-dh] of “out door” is 133 (= 80 + (240 × 0.2) + (50 × 0.1)), and “keep talk” phoneme pair [p-th] speech change index A is 37 (= 20 + (60 × 0.2) + (50 × 0.1)) is there. Therefore, for example, assuming that the threshold TH is 110, for the phoneme pair [t-dh] (SC4: YES) of “out door”, the voice change to the phoneme pair [Sil-dh] specified by the post-conversion pattern QB is On the other hand, no voice change is set for the phoneme pair [p-th] (SC4: NO) of “keep talk”.

以上の処理を実行すると、情報管理部３２は、合成対象音声の複数の音素の時系列に包含される全部の音素対についてステップＳC1からステップＳC5の処理を実行したか否かを判定する（ＳC6）。未処理の音素対が存在する場合（ＳC6：NO）、情報管理部３２は、処理をステップＳC1に移行し、現段階の選択音素対の直後の音素対を新規な選択音素対として選択したうえで以降の処理（ＳC2−ＳC6）を実行する。他方、全部の音素対について処理を実行した場合（ＳC6：YES）、表示制御部２４は、表示装置１４に表示された楽譜画像４２を、音声変化の設定後の内容に更新する（ＳC7）。 When the above processing is executed, the information management unit 32 determines whether or not the processing from step SC1 to step SC5 has been executed for all phoneme pairs included in the time series of the plurality of phonemes of the synthesis target speech (SC6). ). If there is an unprocessed phoneme pair (SC6: NO), the information management unit 32 moves the process to step SC1 and selects a phoneme pair immediately after the currently selected phoneme pair as a new selected phoneme pair. The subsequent processing (SC2-SC6) is executed. On the other hand, when the process has been executed for all phoneme pairs (SC6: YES), the display control unit 24 updates the score image 42 displayed on the display device 14 to the content after the sound change is set (SC7).

図９は、音声変化の設定に連動した楽譜画像４２の変化の説明図である。図９では、“out”と“door”とが２個の音符（Ｎ1，Ｎ2）に指定された場合が想定されている。図９の部分(A)は、音声変化の設定前の状態である。具体的には、音声符号“out”と音素記号[aU t]とが音符Ｎ1の音符図像４４とともに表示され、音声符号“door”と音素記号[dh O@]とが音符Ｎ2の音符図像４４とともに表示される。音声変化の付与が利用者から指示され（ＳB3：YES）、“out door”の音素対[t-dh]の音声変化指標Ａが前掲の例示のように閾値ＴHを上回ると（ＳC4：YES）、図９の部分(B)に例示される通り、音符Ｎ1の“out”の音素記号[aU t]が、音素[t]を無音の音素[Sil]に変換した音素記号[aU Sil]に変更される（ＳC7）。したがって、利用者は、音声変化を視覚的に把握することが可能である。以上に説明した音声変化設定処理の実行後の合成情報Ｓが音声合成部３４による音声合成処理に適用されることで、音声変化（脱落）を再現した音声信号Ｖが生成される。なお、音声変化により変更された音素記号ＹBを強調表示（例えば下線を付加）することも可能である。 FIG. 9 is an explanatory diagram of the change of the score image 42 in conjunction with the setting of the voice change. In FIG. 9, it is assumed that “out” and “door” are designated as two musical notes (N1, N2). The part (A) in FIG. 9 shows a state before setting the voice change. Specifically, the phonetic code “out” and the phoneme symbol [aU t] are displayed together with the note image 44 of the note N1, and the phonetic symbol “door” and the phoneme symbol [dh O @] are shown as the note image 44 of the note N2. It is displayed with. When the user gives an instruction to change the voice (SB3: YES), and the voice change index A of the phoneme pair [t-dh] of “out door” exceeds the threshold TH as shown in the above example (SC4: YES). As illustrated in part (B) of FIG. 9, the phoneme symbol [aU t] of the “out” of the note N1 is changed to the phoneme symbol [aU Sil] obtained by converting the phoneme [t] to the silent phoneme [Sil]. It is changed (SC7). Therefore, the user can visually grasp the voice change. The synthesis information S after the execution of the voice change setting process described above is applied to the voice synthesis process by the voice synthesizer 34, thereby generating a voice signal V that reproduces the voice change (dropping). It is also possible to highlight (for example, add an underline) the phoneme symbol YB that has been changed due to a change in speech.

図６の編集処理において、利用者は、入力装置１６を適宜に操作することで音声変化の取消を指示することが可能である。音声変化の取消の指示を指示受付部２２が利用者から受付けると（ＳB5：YES）、合成情報Ｓに設定された音声変化の取消が実行される（ＳB6）。具体的には、情報管理部３２は、合成情報Ｓの発音内容Ｘ3が指定する各音素の音素記号ＹBを、音声符号ＹAに対応する初期的な内容（すなわち音声変化が付与されていない音素記号ＹB）に変換し、表示制御部２４は、表示装置１４に表示された編集画像４０を、音声変化の取消後の合成情報Ｓに対応した内容（すなわち各音符の音素記号ＹBが初期化された内容）に更新する。音声変化の取消が指示されない場合（ＳB5：NO）に編集処理は終了する。 In the editing process of FIG. 6, the user can instruct to cancel the voice change by appropriately operating the input device 16. When the instruction receiving unit 22 receives an instruction to cancel the voice change from the user (SB5: YES), the voice change set in the synthesis information S is canceled (SB6). Specifically, the information management unit 32 converts the phoneme symbol YB of each phoneme designated by the pronunciation content X3 of the synthesis information S to the initial content corresponding to the speech code YA (that is, the phoneme symbol to which no speech change is given). YB), and the display control unit 24 initializes the edited image 40 displayed on the display device 14 to the content corresponding to the synthesis information S after canceling the voice change (that is, the phoneme symbol YB of each note is initialized). Content). If cancellation of voice change is not instructed (SB5: NO), the editing process ends.

以上に説明した通り、第１実施形態では、合成情報Ｓが指定する合成対象音声の音素対について、音声変化の発生しやすさを示す音声変化指標Ａが算定され、当該音素対の音声変化の有無が音声変化指標Ａに応じて制御される。したがって、各音素が所定の条件に該当するか否かのみに応じて音声変化の有無が一律に決定される特許文献１の技術と比較すると、実際の発音の傾向に合致した自然な音声変化を実現することが可能である。 As described above, in the first embodiment, for the phoneme pair of the synthesis target speech specified by the synthesis information S, the speech change index A indicating the likelihood of occurrence of a speech change is calculated, and the speech change of the phoneme pair is calculated. Presence / absence is controlled according to the voice change index A. Therefore, when compared with the technique of Patent Document 1 in which the presence or absence of a sound change is uniformly determined only depending on whether or not each phoneme meets a predetermined condition, a natural sound change that matches the actual pronunciation tendency is obtained. It is possible to realize.

例えば第１実施形態では、音素対毎に個別に設定された基礎指標Ｂに応じて音声変化指標Ａが算定されるから、音声変化の発生の可能性が音素対毎に相違する（音声変化が発生しやすい音素と発生しにくい音素とが存在する）という現実の発音の傾向に合致した自然な音声変化を実現することが可能である。また、音声単位（音符）の継続長Ｔ2に応じて音声変化指標Ａが算定されるから、複数の音素が連続的に素早く発音された場合に音声変化が発生しやすいという現実の発音の傾向を再現することが可能である。音声単位の音高Ｘ1に応じて音声変化指標Ａが算定されるから、音声の音高が高いほど音声変化が発生しやすいという傾向も再現される。 For example, in the first embodiment, since the voice change index A is calculated according to the basic index B set individually for each phoneme pair, the possibility of the voice change is different for each phoneme pair (the voice change is different). It is possible to realize a natural speech change that matches the actual pronunciation tendency of the phonemes that are likely to occur and those that are difficult to generate). Further, since the voice change index A is calculated according to the duration T2 of the voice unit (note), the actual pronunciation tendency that the voice change is likely to occur when a plurality of phonemes are pronounced quickly and continuously. It can be reproduced. Since the voice change index A is calculated according to the pitch X1 of the voice unit, the tendency that the voice change is more likely to occur as the voice pitch is higher is reproduced.

＜第２実施形態＞
本発明の第２実施形態を以下に説明する。第１実施形態では、音声変化指標Ａと比較される閾値ＴHを所定値に固定した。他方、第２実施形態では、閾値ＴHが可変に制御される。なお、以下に例示する各形態において作用や機能が第１実施形態と同様である要素については、第１実施形態の説明で参照した符号を流用して各々の詳細な説明を適宜に省略する。 Second Embodiment
A second embodiment of the present invention will be described below. In the first embodiment, the threshold value TH to be compared with the voice change index A is fixed to a predetermined value. On the other hand, in the second embodiment, the threshold value TH is variably controlled. In addition, about the element which an effect | action and function are the same as that of 1st Embodiment in each form illustrated below, the reference | standard referred by description of 1st Embodiment is diverted, and each detailed description is abbreviate | omitted suitably.

図１０は、第２実施形態における編集画像４０の模式図である。図１０に例示される通り、第２実施形態の編集画像４０は、第１実施形態と同様の楽譜画像４２と、利用者からの指示を受付けるための操作画像４６とを含んで構成される。図１０では、利用者が操作可能な操作子（スライダ）の画像が操作画像４６として例示されている。利用者は、入力装置１６を利用して操作画像４６を適宜に操作することで、任意の閾値ＴHを指示することが可能である。指示受付部２２は、利用者による閾値ＴHの指示を受付ける。すなわち、第２実施形態の閾値ＴHは、利用者からの指示に応じて可変に設定される。音声変化設定処理（ＳB4）のステップＳC4において、情報管理部３２は、指標算定部２６が算定した音声変化指標Ａと、指示受付部２２が利用者から受付けた指示に応じた閾値ＴHとの大小に応じて、選択音素対に音声変化を発生させるか否かを判定する。 FIG. 10 is a schematic diagram of the edited image 40 in the second embodiment. As illustrated in FIG. 10, the edited image 40 of the second embodiment includes a score image 42 similar to that of the first embodiment and an operation image 46 for receiving an instruction from the user. In FIG. 10, an image of an operator (slider) that can be operated by the user is illustrated as an operation image 46. The user can instruct an arbitrary threshold value TH by appropriately operating the operation image 46 using the input device 16. The instruction receiving unit 22 receives an instruction for the threshold TH from the user. That is, the threshold value TH in the second embodiment is variably set according to an instruction from the user. In step SC4 of the voice change setting process (SB4), the information management unit 32 determines whether the voice change index A calculated by the index calculation unit 26 and the threshold TH corresponding to the instruction received from the user by the instruction receiving unit 22 are large or small. In response to this, it is determined whether or not a voice change is generated in the selected phoneme pair.

以上の説明から理解される通り、利用者からの指示に応じた閾値ＴHが大きいほど合成対象音声に音声変化が発生しにくい状態となる。具体的には、前掲の図８の例示に着目すると、例えば閾値ＴHが150に設定された状態では、音素対[t-dh]および音素対[p-th]の何れについても音声変化は付与されないが、閾値ＴHが110に設定された状態では音素対[t-dh]に音素変化が付与され、閾値ＴHが30に設定された状態では音素対[t-dh]および音素対[p-th]の双方に音素変化が付与される。 As understood from the above description, the voice change is less likely to occur in the synthesis target voice as the threshold TH corresponding to the instruction from the user is larger. Specifically, focusing on the example of FIG. 8 described above, for example, in a state where the threshold value TH is set to 150, a voice change is given to both the phoneme pair [t-dh] and the phoneme pair [p-th]. However, when the threshold value TH is set to 110, a phoneme change is applied to the phoneme pair [t-dh], and when the threshold value TH is set to 30, the phoneme pair [t-dh] and the phoneme pair [p- A phoneme change is given to both of [th].

第２実施形態においても第１実施形態と同様の効果が実現される。また、第２実施形態では、音声変化の有無の判定に適用される閾値ＴHが可変に制御されるから、音声信号Ｖにおける音声変化の頻度を多様に制御することが可能である。第２実施形態では特に、利用者からの指示に応じて閾値ＴHが設定されるから、利用者の意図や嗜好に適合した音声変化を実現できるという利点がある。 In the second embodiment, the same effect as in the first embodiment is realized. In the second embodiment, since the threshold value TH applied to the determination of the presence / absence of a sound change is variably controlled, the frequency of the sound change in the sound signal V can be controlled in various ways. Particularly in the second embodiment, since the threshold value TH is set in accordance with an instruction from the user, there is an advantage that it is possible to realize a sound change adapted to the user's intention and preference.

＜第３実施形態＞
第１実施形態では、音声合成処理の実行前に合成情報Ｓに対して音声変化を設定した。他方、第３実施形態では、音声合成処理のなかで音声変化を音声信号Ｖに反映させる。すなわち、第３実施形態では合成情報Ｓに音声変化は反映されない。したがって、第３実施形態の編集処理（図６）では音声変化の設定や取消の処理（ＳB3−ＳB6）が省略される。 <Third Embodiment>
In the first embodiment, the voice change is set for the synthesis information S before the voice synthesis process is executed. On the other hand, in the third embodiment, the voice change is reflected in the voice signal V in the voice synthesis process. That is, in the third embodiment, the voice change is not reflected in the synthesis information S. Therefore, in the editing process (FIG. 6) of the third embodiment, the voice change setting and canceling processes (SB3-SB6) are omitted.

図１１は、第３実施形態の音声合成部３４が音声信号Ｖの生成のために実行する音声合成処理（ＳA5）のフローチャートである。音声合成処理を開始すると、音声合成部３４は、合成情報Ｓが指定する発音内容Ｘ3に対応する音素対を合成楽曲の先頭から順番に選択音素対として選択する（ＳD1）。そして、音声合成部３４は、選択音素対が、参照情報Ｒに登録された複数の変換前パターンＱAの何れかに合致するか否かを判定する（ＳD2）。 FIG. 11 is a flowchart of the speech synthesis process (SA5) executed by the speech synthesizer 34 of the third embodiment to generate the speech signal V. When the speech synthesis process is started, the speech synthesis unit 34 selects phoneme pairs corresponding to the pronunciation content X3 designated by the synthesis information S as the selected phoneme pairs in order from the beginning of the synthesized music (SD1). Then, the speech synthesizer 34 determines whether the selected phoneme pair matches any of the plurality of pre-conversion patterns QA registered in the reference information R (SD2).

選択音素対が何れかの変換前パターンＱAに合致する場合（ＳD2：YES）、指標算定部２６は、第１実施形態と同様に、前掲の数式(1)の演算で選択音素対の音声変化指標Ａを算定する（ＳD3）。音声合成部３４は、指標算定部２６が算定した音声変化指標Ａが閾値ＴHを上回るか否かを判定する（ＳD4）。すなわち、第１実施形態と同様に、選択音素対に音声変化を発生させるか否かが音声変化指標Ａに応じて判定される。 When the selected phoneme pair matches any of the pre-conversion patterns QA (SD2: YES), the index calculation unit 26 changes the voice of the selected phoneme pair by the calculation of the above formula (1), as in the first embodiment. Index A is calculated (SD3). The speech synthesizer 34 determines whether or not the speech change index A calculated by the index calculator 26 exceeds a threshold value TH (SD4). That is, as in the first embodiment, it is determined according to the voice change index A whether or not a voice change is generated in the selected phoneme pair.

音声変化指標Ａが閾値ＴHを下回る場合（ＳD4：NO）、すなわち、選択音素対について音声変化を発生させない場合、音声合成部３４は、合成対象音素から選択した選択音素対に対応する音声素片を音声素片群Ｌから選択する（ＳD5）。他方、音声変化指標Ａが閾値ＴHを上回る場合（ＳD4：YES）、すなわち、選択音素対について音声変化を発生させる場合、音声合成部３４は、選択音素対に代えて、変換後パターンＱBの音素対に対応する音声素片を音声素片群Ｌから選択する（ＳD6）。 When the speech change index A is lower than the threshold TH (SD4: NO), that is, when no speech change is generated for the selected phoneme pair, the speech synthesizer 34 selects the speech unit corresponding to the selected phoneme pair selected from the synthesis target phonemes. Is selected from the speech segment group L (SD5). On the other hand, when the speech change index A exceeds the threshold TH (SD4: YES), that is, when a speech change is generated for the selected phoneme pair, the speech synthesizer 34 replaces the selected phoneme pair with the phoneme of the converted pattern QB. A speech unit corresponding to the pair is selected from the speech unit group L (SD6).

例えば選択音素対が“out door”に包含される音素対[t-dh]である場合を想定する。選択音素対の音声変化指標Ａが閾値ＴHを下回る場合、音声合成部３４は、当該音素対[t-dh]に対応する音声素片を音声素片群Ｌから選択する（ＳD5）。他方、音声変化指標Ａが閾値ＴHを上回る場合、音声合成部３４は、当該音素対[t-dh]に対応する変換後パターンＱBの音素対[Sil-dh]（すなわち音素[t]を脱落させた音素対）に対応する音声素片を音声素片群Ｌから選択する（ＳD6）。 For example, it is assumed that the selected phoneme pair is a phoneme pair [t-dh] included in “out door”. When the speech change index A of the selected phoneme pair is lower than the threshold TH, the speech synthesizer 34 selects a speech unit corresponding to the phoneme pair [t-dh] from the speech unit group L (SD5). On the other hand, when the speech change index A exceeds the threshold value TH, the speech synthesizer 34 drops the phoneme pair [Sil-dh] (ie, phoneme [t]) of the converted pattern QB corresponding to the phoneme pair [t-dh]. The speech unit corresponding to the selected phoneme pair) is selected from the speech unit group L (SD6).

音声合成部３４は、ステップＳD5またはステップＳD6で選択した音声素片を利用して音声信号Ｖを生成する（ＳD7）。音声素片を利用した音声信号Ｖの生成には、第１実施形態と同様に公知の技術が任意に採用される。以上の説明から理解される通り、第３実施形態の音声合成部３４は、選択音素対に音声変化を発生させるか否かを音声変化指標Ａに応じて判定し、音声変化を発生させると判定した選択音素対については音声変化を発生させた音声信号Ｖを生成する一方（ＳD6，ＳD7）、音声変化を発生させないと判定した選択音素対については音声変化を発生させない音声信号Ｖを生成する（ＳD5，ＳD7）要素として機能する。 The speech synthesizer 34 generates a speech signal V using the speech segment selected in step SD5 or step SD6 (SD7). As in the first embodiment, a known technique is arbitrarily employed to generate the audio signal V using the speech element. As understood from the above description, the speech synthesizer 34 of the third embodiment determines whether or not to generate a speech change in the selected phoneme pair according to the speech change index A and determines to generate a speech change. For the selected phoneme pair, a voice signal V that generates a voice change is generated (SD6, SD7), while for a selected phoneme pair that is determined not to generate a voice change, a voice signal V that does not generate a voice change is generated ( SD5 and SD7) function as elements.

以上に説明した通り、第３実施形態では、合成情報Ｓが指定する音素対について音声変化指標Ａが算定され、音声変化の有無が音声変化指標Ａに応じて制御される。したがって、第１実施形態と同様に、実際の発音の傾向に合致した自然な音声変化を実現できるという利点がある。 As described above, in the third embodiment, the speech change index A is calculated for the phoneme pair specified by the synthesis information S, and the presence or absence of a speech change is controlled according to the speech change index A. Therefore, as in the first embodiment, there is an advantage that a natural voice change that matches the actual pronunciation tendency can be realized.

＜第４実施形態＞
図１２は、第４実施形態における合成情報Ｓの模式図である。図１２に例示される通り、第４実施形態の合成情報Ｓは、第１実施形態と同様の要素（音高Ｘ1，発音期間Ｘ2，発音内容Ｘ3）に加えて制御情報ＸCを包含する。制御情報ＸCは、合成対象音声の音楽的な表情（表現）を指定する変数である。例えばアクセントやスタッカートや音量が制御情報ＸCで指定される。第４実施形態の指標算定部２６は、基礎指標Ｂと継続長Ｔ2と音高Ｘ1とに加え、合成対象音声の各音素の制御情報ＸCが指定する変数を適用した演算で当該音素の音声変化指標Ａを算定する（ＳC3）。制御情報ＸCは、指示受付部２２が利用者から受付けた指示に応じて設定される。 <Fourth embodiment>
FIG. 12 is a schematic diagram of the synthesis information S in the fourth embodiment. As illustrated in FIG. 12, the synthesis information S of the fourth embodiment includes control information XC in addition to the same elements (pitch X1, sound generation period X2, sound generation content X3) as in the first embodiment. The control information XC is a variable that designates a musical expression (expression) of the synthesis target voice. For example, accent, staccato, and volume are specified by the control information XC. The index calculation unit 26 according to the fourth embodiment performs a sound change of the phoneme by calculation using a variable specified by the control information XC of each phoneme of the synthesis target speech in addition to the basic index B, the duration T2 and the pitch X1. Indicator A is calculated (SC3). The control information XC is set according to the instruction received from the user by the instruction receiving unit 22.

例えば明瞭な発音を意図した音符（すなわち強調したい音符）について利用者はアクセントやスタッカートを指示するという傾向がある。以上の傾向を考慮すると、制御情報ＸCでアクセントやスタッカートが指示された音素の音声変化指標Ａが減少する（すなわち音声変化が発生しにくくなる）ように指標算定部２６が音声変化指標Ａを算定する構成が好適である。また、例えば明瞭な発音を意図した音符について利用者は音量を大きい数値に設定するという傾向がある。以上の傾向を考慮すると、制御情報ＸCで指定される音量が大きいほど音声変化指標Ａが減少するように指標算定部２６が音声変化指標Ａを算定する構成が好適である。音声変化指標Ａを利用した音声変化の付与は第１実施形態と同様である。 For example, a user tends to indicate an accent or staccato for a note intended for clear pronunciation (ie, a note to be emphasized). Considering the above tendency, the index calculation unit 26 calculates the voice change index A so that the voice change index A of the phoneme for which accent or staccato is instructed by the control information XC decreases (that is, the voice change is less likely to occur). The structure which does is suitable. In addition, for example, for notes intended for clear pronunciation, the user tends to set the volume to a large value. In consideration of the above tendency, a configuration in which the index calculation unit 26 calculates the voice change index A so that the voice change index A decreases as the volume specified by the control information XC increases. The application of the voice change using the voice change index A is the same as in the first embodiment.

第４実施形態においても第１実施形態と同様の効果が実現される。また、第４実施形態においては、合成対象音声の音楽的な表情を表す制御情報ＸCに応じて音声変化指標Ａが算定されるから、合成対象音声の音楽的な表情に適合した自然な音声変化を付与することが可能である。なお、以上の例示では第１実施形態を基礎とした構成を例示したが、音声変化指標Ａに制御情報ＸCを反映させる第４実施形態の構成は、第２実施形態や第３実施形態にも同様に適用され得る。 In the fourth embodiment, the same effect as in the first embodiment is realized. In the fourth embodiment, since the voice change index A is calculated according to the control information XC representing the musical expression of the synthesis target voice, natural voice change suitable for the musical expression of the synthesis target voice is obtained. Can be given. In the above example, the configuration based on the first embodiment is illustrated. However, the configuration of the fourth embodiment in which the control information XC is reflected in the voice change index A is also applied to the second embodiment and the third embodiment. It can be applied as well.

＜変形例＞
前述の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様を適宜に併合することも可能である。 <Modification>
Each of the above-described embodiments can be variously modified. Specific modifications are exemplified below. Two or more modes arbitrarily selected from the following examples can be appropriately combined.

（１）前述の各形態では、基礎指標Ｂと継続長Ｔ2と音高Ｘ2とに応じて音声変化指標Ａを算定したが、音声変化指標Ａの算定に適用される変数は、以上の例示に限定されない。例えば、合成楽曲のテンポに応じて音声変化指標Ａを算定することも可能である。前述の通り、複数の音素が素早く発音された場合には音声変化が発生しやすいという傾向がある。したがって、合成楽曲のテンポが速いほど音声変化指標Ａが大きい数値（音声変化が発生しやすい数値）に設定する構成が好適である。合成楽曲のテンポは、例えば入力装置１６に対する操作に応じて利用者から可変に指示される。 (1) In each of the above-described embodiments, the voice change index A is calculated according to the basic index B, the duration T2, and the pitch X2, but the variables applied to the calculation of the voice change index A are as described above. It is not limited. For example, the sound change index A can be calculated according to the tempo of the synthesized music. As described above, when a plurality of phonemes are pronounced quickly, there is a tendency that a sound change is likely to occur. Therefore, a configuration in which the voice change index A is set to a larger value (a value at which a voice change is likely to occur) as the synthesized music tempo is faster is preferable. The tempo of the synthesized music is variably instructed by the user according to an operation on the input device 16, for example.

例えば、基礎指標Ｂと継続長Ｔ2と音高Ｘ2とから選択された１種類または２種類の変数を音声変化指標Ａに適用することも可能である。また、第４実施形態の制御情報ＸCのみを音声変化指標Ａの算定に適用する構成や、基礎指標Ｂと継続長Ｔ2と音高Ｘ2とから選択された１種類または２種類の変数と制御情報ＸCとの組合せを音声変化指標Ａの算定に適用する構成も採用され得る。また、各変数を適用した音声変化指標Ａの算定方法も以上の例示（数式(1)）には限定されない。 For example, one or two types of variables selected from the basic index B, the continuation length T2, and the pitch X2 can be applied to the voice change index A. In addition, a configuration in which only the control information XC of the fourth embodiment is applied to the calculation of the voice change index A, or one or two types of variables selected from the basic index B, the duration T2, and the pitch X2 and the control information. A configuration in which the combination with XC is applied to the calculation of the voice change index A can also be adopted. Further, the calculation method of the voice change index A to which each variable is applied is not limited to the above example (Formula (1)).

（２）前述の各形態では、音声変化が発生しやすいほど音声変化指標Ａが増加する場合（音声変化が発生しにくいほど音声変化指標Ａが減少する場合）を例示したが、音声変化の発生の可能性の高低と音声変化指標Ａの大小との関係は適宜に変更される。すなわち、音声変化が発生しやすいほど音声変化指標Ａが減少する構成（音声変化が発生しにくいほど音声変化指標Ａが増加する構成）も採用され得る。 (2) In each of the above-described embodiments, the case where the voice change index A increases as the voice change easily occurs (the case where the voice change index A decreases as the voice change hardly occurs) is exemplified. The relationship between the level of the possibility of and the magnitude of the voice change index A is appropriately changed. That is, a configuration in which the voice change index A decreases as the voice change easily occurs (a configuration in which the voice change index A increases as the voice change hardly occurs) may be employed.

（３）第１実施形態では、合成情報Ｓのうち音声変化（脱落）を発生させる音素の音素記号ＹBを音声変化後の音素記号ＹBに置換したが、音声変化の発生を合成情報Ｓに設定するための構成は任意である。例えば、音声変化を発生させる音素の音素記号ＹBに、音声変化（脱落）を意味する補助情報（例えば無声化を意味する記号“_0”）を付加する構成も採用される。また、音声変化のうち脱落のみに着目すれば、音声変化を発生させる音素の音素記号ＹBを削除することも可能である。 (3) In the first embodiment, the phoneme symbol YB of the phoneme that causes the voice change (dropping) in the synthesis information S is replaced with the phoneme symbol YB after the voice change, but the occurrence of the voice change is set in the synthesis information S The configuration for doing so is arbitrary. For example, a configuration in which auxiliary information (for example, symbol “_0” indicating devoicing) is added to the phoneme symbol YB of a phoneme that generates a change in speech. In addition, if attention is paid only to omission of voice changes, it is possible to delete the phoneme symbol YB of the phoneme causing the voice change.

（４）第３実施形態では、音声変化を発生させる音素対について変換後パターンＱBに対応する音声素片を選択したが、音声変化の有無に応じた音声信号Ｖを生成する方法は以上の例示に限定されない。例えば、音素対のうち音声変化（脱落）を発生させる音素の音量をゼロに設定することで音素の脱落を再現することも可能である。 (4) In the third embodiment, the speech unit corresponding to the post-conversion pattern QB is selected for the phoneme pair that causes the speech change. However, the method for generating the speech signal V according to the presence or absence of the speech change is exemplified above. It is not limited to. For example, it is possible to reproduce the loss of phonemes by setting the volume of the phoneme that causes a sound change (drop) in the phoneme pair to zero.

（５）前述の各形態では、音声変化を発生させる音素に対応する音符の継続長Ｔ2を音声変化指標Ａの算定に適用したが、発音期間Ｘ2（継続長Ｔ2）のうち当該音素が発音される継続長（時間長）を音声変化指標Ａの算定に適用することも可能である。 (5) In each of the above-described embodiments, the note duration T2 corresponding to the phoneme causing the speech change is applied to the calculation of the speech change index A. However, the phoneme is pronounced during the pronunciation period X2 (continuation length T2). It is also possible to apply the continuation length (time length) to the calculation of the voice change index A.

（６）第１実施形態では、合成情報Ｓの管理（表示制御部２４および情報管理部３２）と音声信号Ｖの生成（音声合成部３４）との双方を実行する音声合成装置１００を例示したが、合成情報Ｓを管理する合成情報管理装置としても本発明は特定され得る。合成情報管理装置では音声合成部３４の有無は不問である。また、携帯電話機等の端末装置と通信するサーバ装置で音声合成装置１００や合成情報管理装置を実現することも可能である。指示受付部２２は、利用者が端末装置に付与した指示を端末装置から通信網を介して受付け、表示制御部２４は、例えば編集画像４０の画像データを端末装置に送信することで編集画像４０を端末装置の表示装置に表示させる。また、音声合成部３４は、音声合成処理で生成した音声信号Ｖを端末装置に送信する。 (6) In the first embodiment, the speech synthesis apparatus 100 that executes both management of the synthesis information S (display control unit 24 and information management unit 32) and generation of the speech signal V (speech synthesis unit 34) is exemplified. However, the present invention can also be specified as a composite information management apparatus that manages the composite information S. In the synthesis information management apparatus, the presence or absence of the speech synthesizer 34 is not questioned. It is also possible to realize the speech synthesizer 100 and the synthesized information management device with a server device that communicates with a terminal device such as a mobile phone. The instruction receiving unit 22 receives an instruction given by the user to the terminal device from the terminal device via the communication network, and the display control unit 24 transmits, for example, image data of the edited image 40 to the terminal device, thereby editing the edited image 40. Is displayed on the display device of the terminal device. In addition, the voice synthesizer 34 transmits the voice signal V generated by the voice synthesis process to the terminal device.

（７）前述の各形態では、音声素片群Ｌと合成情報Ｓとを記憶する記憶装置１２を音声合成装置１００に搭載したが、音声合成装置１００とは独立した外部装置（例えばサーバ装置）が音声素片群Ｌや合成情報Ｓを記憶する構成も採用される。音声合成装置１００は、例えば通信網を介して音声素片群Ｌの音声素片や合成情報Ｓを取得して編集処理や音声合成処理を実行する。以上の説明から理解される通り、音声素片群Ｌや合成情報Ｓを記憶する要素は音声合成装置１００の必須の要素ではない。 (7) In each of the above-described embodiments, the storage device 12 that stores the speech element group L and the synthesis information S is mounted on the speech synthesizer 100. However, an external device (for example, a server device) that is independent of the speech synthesizer 100. Is also used to store the speech element group L and the synthesis information S. The speech synthesizer 100 acquires the speech unit of the speech unit group L and the synthesis information S through, for example, a communication network, and executes editing processing and speech synthesis processing. As understood from the above description, the elements that store the speech element group L and the synthesis information S are not essential elements of the speech synthesizer 100.

（８）前述の各形態では、合成楽曲の歌唱音声の音声信号Ｖの生成を例示したが、歌唱音声以外の音声（例えば会話音等）の音声信号Ｖの生成にも本発明を適用することが可能である。したがって、歌唱音声の合成に好適な音高Ｘ1は合成情報Ｓから省略され得る。以上の説明から理解される通り、以上の各態様に例示した合成情報Ｓは、合成対象となる音声の発音内容を指定する情報として包括的に表現される。また、音声合成処理の方式は任意である。具体的には、前述の各形態で例示した素片接続型の音声合成処理のほか、HMM（Hidden Markov Model）で推定された音高に対して発音内容Ｘ3（音素）に応じたフィルタ処理を実行する統計モデル型の音声合成処理で音声信号Ｖを生成することも可能である。 (8) In each of the above-described embodiments, the generation of the voice signal V of the singing voice of the synthesized music has been exemplified. However, the present invention is also applied to the generation of the voice signal V of the voice other than the singing voice (for example, conversation sound). Is possible. Therefore, the pitch X1 suitable for singing voice synthesis can be omitted from the synthesis information S. As understood from the above description, the synthesis information S exemplified in each of the above aspects is comprehensively expressed as information specifying the pronunciation content of the speech to be synthesized. Further, the method of speech synthesis processing is arbitrary. Specifically, in addition to the unit-connected speech synthesis processing exemplified in the above-described embodiments, the filter processing corresponding to the pronunciation content X3 (phoneme) is performed on the pitch estimated by the HMM (Hidden Markov Model). It is also possible to generate the voice signal V by a statistical model type voice synthesis process to be executed.

（９）前述の各形態では、英語の音声の合成を例示したが、合成対象となる音声の言語は任意である。例えば、日本語、スペイン語、中国語、韓国語等の任意の言語の音声を生成する場合にも本発明を適用することが可能である。 (9) In each of the above-described embodiments, the synthesis of English speech has been exemplified, but the speech language to be synthesized is arbitrary. For example, the present invention can be applied to the case of generating speech in an arbitrary language such as Japanese, Spanish, Chinese, or Korean.

１００……音声合成装置、１０……演算処理装置、１２……記憶装置、１４……表示装置、１６……入力装置、１８……放音装置、２２……指示受付部、２４……表示制御部、２６……指標算定部、３２……情報管理部、３４……音声合成部。
DESCRIPTION OF SYMBOLS 100 ... Voice synthesizer, 10 ... Arithmetic processing unit, 12 ... Memory | storage device, 14 ... Display device, 16 ... Input device, 18 ... Sound emission device, 22 ... Instruction reception part, 24 ... Display Control unit, 26... Index calculation unit, 32... Information management unit, 34.

Claims

A device that manages synthesis information that specifies a pitch, duration, and pronunciation content for each voice unit of a synthesis target voice,
A variable voice change indicator is indicative of the occurrence ease of voice change, for each of a plurality of phonemes of the synthesis target speech, and index calculation means for calculating in accordance with the duration of the phoneme,
An instruction receiving means for receiving an instruction from the user;
Depending on the size of the speech change index calculated by the index calculation means for the phoneme of the synthesis target speech and the threshold value variably controlled according to the instruction received from the user by the instruction receiving unit, the speech of the phoneme It is determined whether or not to generate a change, and for the phonemes determined to generate a speech change, the generation of the speech change is set in the synthesis information, while for the phonemes determined not to generate the speech change, the speech change is generated. A combined information management apparatus comprising: information management means not set in the combined information.

A device that manages synthesis information that specifies a pitch, duration, and pronunciation content for each voice unit of a synthesis target voice,
A variable voice change indicator is indicative of the occurrence ease of voice change, for each of a plurality of phonemes of the synthesis target speech, and index calculation means for calculating in accordance with the phoneme pitch,
An instruction receiving means for receiving an instruction from the user;
Depending on the magnitude of the voice change index calculated by the index calculation means for the phoneme of the synthesis target voice and the threshold set according to the instruction received from the user by the instruction receiving unit, the voice change of the phoneme is determined. It is determined whether or not to generate a voice change for the phoneme determined to generate a voice change in the synthesis information, while for the phoneme determined not to generate a voice change, the voice change occurrence is set to the synthesis A composite information management device comprising: information management means not set in the information.

The index calculation means is a basic index corresponding to a combination of two phonemes of the synthesis target speech based on reference information that designates a basic index that is an index of the likelihood of speech change for each combination of two phonemes. The synthetic information management device according to claim 1 or 2 , wherein the voice change index is calculated according to the basic index.

An apparatus for generating a voice signal of a synthesis target voice in which synthesis information specifies a pitch, duration, and pronunciation content for each voice unit ,
A variable voice change indicator is indicative of the occurrence ease of voice change, for each of a plurality of phonemes of the synthesis target speech and index calculation means for calculating in accordance with the duration of the phoneme,
An instruction receiving means for receiving an instruction from the user;
A means for generating a voice signal according to the synthesis information, the voice change index calculated by the index calculation means for the phoneme of the synthesis target voice, and variable according to an instruction received from a user by the instruction receiving unit Whether to generate a sound change of the phoneme according to the magnitude of the controlled threshold, and generating the sound signal in which the sound change is generated for the phoneme determined to generate the sound change. A speech synthesizer comprising: a speech synthesizer that generates the speech signal that does not cause a speech change for a phoneme determined not to cause a speech change.