JP7611744B2

JP7611744B2 - Signal processing device and program

Info

Publication number: JP7611744B2
Application number: JP2021048936A
Authority: JP
Inventors: 信正清山; 礼子齋藤; 正熊野; 篤今井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2025-01-10
Anticipated expiration: 2041-03-23
Also published as: JP2022147616A

Description

本発明は、信号処理装置およびプログラムに関する。 The present invention relates to a signal processing device and a program.

統計的な情報を用いて任意のテキストに対する音声を合成するテキスト音声合成を実現するには、事前準備として、音声信号データと、その発話内容のテキストや、読み方や、アクセントなどの情報を多量に用意しておくことが必要である。多量の音声信号データに対する読み方やアクセントの情報を正しく付与するには、例えばアナウンサーなどアクセントを正確に理解して聞き分けられる専門家による作業が必要となり、コストがかかる。そこで、音声信号データに対するアクセント型を判定する技術が求められ、研究されている。 To realize text-to-speech synthesis, which uses statistical information to synthesize speech for any text, it is necessary to prepare a large amount of speech signal data, the text of the speech, reading, accent, and other information as a preliminary step. Correctly assigning reading and accent information to a large amount of speech signal data requires the work of an expert who can accurately understand and distinguish accents, such as an announcer, which is costly. Therefore, there is a demand for technology that can determine the accent type for speech signal data, and research is being conducted.

非特許文献１に記載された技術は、与えられた音声信号データがどのアクセント型に該当するかを識別しようとするものである。具体的には、この技術は、事前に、アクセント句内において、モーラ（子音＋母音もしくは母音、促音、撥音）を単位として、各モーラを代表するピッチを求める。また、隣接するモーラ間における値の差分値により、（モーラ数－１）次元の特徴ベクトルを構成する。そして、同じアクセント型を持つアクセント句の特徴ベクトルを用いて、アクセント型ごとに（モーラ数－１）次元の特徴ベクトルの正規分布を求めておく。そして、これらのモデル化された正規分布を用いて、判定対象とするアクセント句において算出した（モーラ数－１）次元の特徴ベクトルがどのアクセント型に該当するかを識別するようにしている。 The technology described in Non-Patent Document 1 is intended to identify which accent type given speech signal data corresponds to. Specifically, this technology determines in advance the pitch that represents each mora (consonant + vowel or vowel, geminated consonant, glottal stop) within an accent phrase. It also constructs a (number of moras - 1)-dimensional feature vector from the difference in values between adjacent moras. It then uses the feature vectors of accent phrases with the same accent type to determine a normal distribution of (number of moras - 1)-dimensional feature vectors for each accent type. It then uses these modeled normal distributions to identify which accent type the (number of moras - 1)-dimensional feature vector calculated for the accent phrase to be determined corresponds to.

また、特許文献１にも、音声信号に対するアクセントの種類を判断するための技術が記載されている。特許文献１の技術は、テキスト音声合成を支援することを目的としている。この技術は、その目的のために、対象語句に対応する音声を複数のアクセント句に分割する。そして、対象とするアクセント句において、モーラ単位でピッチの時間変化の傾きと最終ピッチを求める。そして、この技術は、それらの情報に基づいて、モーラ単位で、ハイ（High）、ロー（Low）、アップ（Up）、ダウン（Down）の４種類の評価関数の値を算出する。そして、アクセント句を構成するモーラ全体の評価関数の値の加算値が最大となる組み合わせを求めることによって、アクセントの種類を判断するようにしている。 Patent Document 1 also describes a technology for determining the type of accent for a speech signal. The technology in Patent Document 1 aims to support text-to-speech synthesis. To that end, this technology divides the speech corresponding to a target phrase into multiple accent phrases. Then, for the target accent phrase, the slope of the time change in pitch and the final pitch are found on a mora-by-mora basis. Then, based on that information, this technology calculates the values of four types of evaluation functions, high, low, up, and down, on a mora-by-mora basis. The type of accent is determined by finding the combination that maximizes the sum of the evaluation function values for all the moras that make up the accent phrase.

特許第４１２９９８９号公報Patent No. 4129989

石井カルロス寿憲，峯松信明，広瀬啓吉，ピッチ知覚を考慮した日本語連続音声のアクセント型判定，電子情報通信学会技術研究報告，SP，音声， 101(270)，23-30，2001年08月23日，一般社団法人電子情報通信学会．Toshinori Ishii, Nobuaki Minematsu, Keiichi Hirose, Accent type determination of continuous Japanese speech considering pitch perception, Technical Report of the Institute of Electronics, Information and Communication Engineers, SP, Speech, 101(270), 23-30, 23 August 2001, The Institute of Electronics, Information and Communication Engineers.

しかしながら、非特許文献１に記載された技術では、モーラ間のピッチの差分値を用いるため、（モーラ数－１）個の差分値に基づくアクセント型を判定することしかできない。また、特許文献１に記載された技術では、評価関数の計算結果の組合せとして、アクセントの種類への分類を行うものである。
これらの従来技術のいずれの方法もピッチ抽出を前提としており、ピッチ抽出誤りの影響によって識別精度が低下する可能性があり得る。また、これらのいずれの技術も、アクセント句単位の分析に基づくものであり、発話全体あるいはフレーズ単位でのピッチの変動を考慮していない。したがって、精度よくアクセントを推定することができないという問題がある。このように、従来技術では、音声信号から精度よくアクセントを推定することができないという問題があった。 However, the technology described in Non-Patent Document 1 uses the difference in pitch between moras, and is therefore only able to determine the accent type based on the difference values of (the number of moras minus 1). Furthermore, the technology described in Patent Document 1 classifies accent types as a combination of the calculation results of the evaluation function.
All of these conventional methods are premised on pitch extraction, and there is a possibility that pitch extraction errors may affect the accuracy of recognition. Furthermore, all of these techniques are based on analysis of accent phrases, and do not take into account pitch fluctuations in the entire utterance or in phrases. Therefore, there is a problem in that the accent cannot be estimated with high accuracy. Thus, the conventional techniques have a problem in that the accent cannot be estimated with high accuracy from a speech signal.

本発明は、上記の課題認識に基づいて行なわれたものであり、精度よくアクセントのラベルを推定することのできる信号処理装置およびプログラムを提供しようとするものである。 The present invention was developed based on the above problem recognition, and aims to provide a signal processing device and a program that can estimate accent labels with high accuracy.

［１］上記の課題を解決するため、本発明の一態様による信号処理装置は、所定単位のテキストごとに、前記所定単位のテキストに対応する音声信号を基に、フレーム単位の音響特徴量を求める音響分析部と、前記所定単位のテキストを基に、当該テキストの前記フレーム単位の言語特徴量を求める言語解析部と、前記フレーム単位の前記音響特徴量および前記言語特徴量を予め学習済みの統計モデルに入力することによって、前記統計モデルからの出力として前記テキストに対応するアクセントのラベルを推定するラベル推定部と、を備えるものである。 [1] In order to solve the above problem, a signal processing device according to one aspect of the present invention includes an acoustic analysis unit that determines, for each predetermined unit of text, acoustic features on a frame-by-frame basis based on a speech signal corresponding to the predetermined unit of text, a language analysis unit that determines linguistic features of the text on a frame-by-frame basis based on the predetermined unit of text, and a label estimation unit that inputs the frame-by-frame acoustic features and the linguistic features into a pre-trained statistical model, and estimates an accent label corresponding to the text as an output from the statistical model.

［２］また、本発明の一態様は、上記の信号処理装置において、所定単位の学習用テキストごとに、前記所定単位の学習用テキストに対応する学習用音声信号を基に、フレーム単位の学習用音響特徴量を求める学習用音響分析部と、前記所定単位の学習用テキストを基に、当該学習用テキストの前記フレーム単位の学習用言語特徴量を求める学習用言語解析部と、前記学習用言語特徴量に基づいて前記テキストに対応するアクセントの正解ラベルを求める正解ラベル抽出部と、前記フレーム単位の前記学習用音響特徴量および前記学習用言語特徴量を統計モデルに入力することによって前記統計モデルからの出力として前記学習用テキストに対応するアクセントの学習用推定ラベルを得て、当該学習用推定ラベルと前記正解ラベルとの誤差に基づいて前記統計モデルの学習を行うモデル学習部と、をさらに備えるものである。 [2] In one aspect of the present invention, the signal processing device further includes a training acoustic analysis unit that determines, for each training text of a predetermined unit, training acoustic features on a frame-by-frame basis based on a training speech signal corresponding to the training text of the predetermined unit, a training language analysis unit that determines training language features on a frame-by-frame basis of the training text of the predetermined unit, a correct label extraction unit that determines a correct label of an accent corresponding to the text based on the training language features, and a model learning unit that inputs the frame-by-frame training acoustic features and the training language features to a statistical model to obtain a training estimated label of an accent corresponding to the training text as an output from the statistical model, and learns the statistical model based on an error between the training estimated label and the correct label.

［３］また、本発明の一態様による信号処理装置は、所定単位のテキストごとに、前記所定単位のテキストに対応する音声信号を基に、フレーム単位の音響特徴量を求める音響分析部と、前記所定単位のテキストを基に、当該テキストの前記フレーム単位の言語特徴量を求める言語解析部と、前記言語特徴量に基づいて前記テキストに対応するアクセントの正解ラベルを求める正解ラベル抽出部と、前記フレーム単位の前記音響特徴量および前記言語特徴量を統計モデルに入力することによって前記統計モデルからの出力として前記テキストに対応するアクセントの推定ラベルを得て、当該推定ラベルと前記正解ラベルとの誤差に基づいて前記統計モデルの学習を行うモデル学習部と、を備えるものである。 [3] A signal processing device according to an aspect of the present invention includes an acoustic analysis unit that determines, for each predetermined unit of text, acoustic features on a frame-by-frame basis based on a speech signal corresponding to the predetermined unit of text; a language analysis unit that determines linguistic features of the text on a frame-by-frame basis based on the predetermined unit of text; a correct label extraction unit that determines a correct label for an accent corresponding to the text based on the linguistic features; and a model learning unit that inputs the frame-by-frame acoustic features and the linguistic features into a statistical model to obtain an estimated label for an accent corresponding to the text as an output from the statistical model, and learns the statistical model based on an error between the estimated label and the correct label.

［４］また、本発明の一態様は、上記の信号処理装置において、前記所定単位は、アクセント句の単位である、というものである。 [4] In one aspect of the present invention, in the signal processing device described above, the predetermined unit is a unit of accent phrase.

［５］また、本発明の一態様は、上記の信号処理装置において、前記言語特徴量は、発話内の呼気段落位置と、呼気段落内のアクセント句位置と、アクセント句内のモーラ位置と、モーラ内のフレーム位置、である、というものである。 [5] In one aspect of the present invention, in the above-mentioned signal processing device, the linguistic features are a breath paragraph position in an utterance, an accent phrase position in the breath paragraph, a mora position in the accent phrase, and a frame position in the mora.

［６］また、本発明の一態様は、上記の信号処理装置において、前記統計モデルは、ディープニューラルネットワークを用いて実現されるものである。 [6] In one aspect of the present invention, in the signal processing device, the statistical model is realized using a deep neural network.

［７］また、本発明の一態様は、上記の信号処理装置において、前記音響特徴量は、メルスペクトログラムである、というものである。 [7] In another aspect of the present invention, in the signal processing device, the acoustic feature is a mel spectrogram.

［８］また、本発明の一態様は、コンピューターを、上記［１］から［７］までのいずれか一項に記載の信号処理装置、として機能させるためのプログラムである。 [8] Another aspect of the present invention is a program for causing a computer to function as a signal processing device according to any one of [1] to [7] above.

本発明によれば、フレーム単位の音響特徴量と言語特徴量で学習した統計モデルを用いることによって、言語特徴量にも基づく発話の影響を考慮して、精度よくアクセントを推定することができる。 According to the present invention, by using a statistical model trained on frame-by-frame acoustic features and linguistic features, it is possible to accurately estimate accent, taking into account the influence of speech based on linguistic features.

本発明の実施形態によるアクセント推定装置の概略機能構成を示すブロック図である。1 is a block diagram showing a schematic functional configuration of an accent estimation device according to an embodiment of the present invention. 同実施形態によるアクセント推定装置が持つ前処理部（３２）の詳細な機能構成を示すブロック図である。FIG. 2 is a block diagram showing a detailed functional configuration of a pre-processing unit (32) of the accent estimation device according to the embodiment. 同実施形態によるアクセント推定装置が使用するフルコンテキストラベルの形式を示す概略図（１／３）である。FIG. 13 is a schematic diagram (1/3) showing a format of a full context label used by the accent estimation device according to the embodiment. 同実施形態によるアクセント推定装置が使用するフルコンテキストラベルの形式を示す概略図（２／３）である。FIG. 2 is a schematic diagram (2/3) showing a format of a full context label used by the accent estimation device according to the embodiment. 同実施形態によるアクセント推定装置が使用するフルコンテキストラベルの形式を示す概略図（３／３）である。FIG. 3 is a schematic diagram (3/3) showing a format of a full context label used by the accent estimation device according to the embodiment. 同実施形態による前処理部内の言語解析部がテキストの言語解析処理を行ったことによって抽出するフルコンテキストラベルの例を示す概略図（１／２）である。13 is a schematic diagram (1/2) showing an example of a full context label extracted by a language analysis unit in a preprocessing unit according to the embodiment performing language analysis processing on a text. FIG. 同実施形態による前処理部内の言語解析部がテキストの言語解析処理を行ったことによって抽出するフルコンテキストラベルの例を示す概略図（２／２）である。13 is a schematic diagram (2/2) showing an example of a full context label extracted by a language analysis unit in a preprocessing unit according to the embodiment performing language analysis processing on a text. FIG. 同実施形態による第１正解ラベル抽出部および第２正解ラベル抽出部がアクセント句単位の正解ラベルを求めるための処理の手順を示すフローチャートである。13 is a flowchart showing a processing procedure for a first correct label extraction unit and a second correct label extraction unit according to the embodiment to obtain a correct label for each accent phrase. 同実施形態による第２正解ラベル抽出部がアクセント句単位のトーンを作成する処理の手順を示すフローチャートである。13 is a flowchart showing a procedure of a process in which a second correct label extraction unit according to the embodiment creates a tone for each accent phrase. 同実施形態のアクセント推定装置における、アクセント句のトーンから正解ラベルの数値への変換の規則を示す概略図である。13 is a schematic diagram showing rules for converting the tones of an accent phrase into numerical values of correct labels in the accent estimation device of the embodiment. FIG. 同実施形態のモデル学習部による統計モデルの学習時の処理の流れを示す概略図である。11 is a schematic diagram showing a process flow when a statistical model is learned by a model learning unit of the embodiment. FIG. 同実施形態によるラベル推定部がラベルを推定する処理の流れを示す概略図である。11 is a schematic diagram showing a process flow for estimating a label by a label estimation unit according to the embodiment. FIG. 同実施形態によるアクセント推定装置が持つ機能を実現するための内部構成の例を示すブロック図である。FIG. 2 is a block diagram showing an example of an internal configuration for realizing functions of the accent estimation device according to the embodiment.

次に、本発明の一実施形態について、図面を参照しながら説明する。まず、本実施形態が扱ういくつかの概念を説明する。発話（utterance）は、ひとまとまりの文に相当するスピーチである。呼気段落（breath group）は、発話中において休止（ポーズ，pause）によって区切られる一つの塊の単位である。書かれた文における読点で句切られる単位が、ほぼ呼気段落に対応する。アクセント句（accent phrase）は、通常においては、日本語の一単語、あるいは一単語プラス付属語という単位の塊である。モーラ（mora）は、１つの子音プラス１つの母音、または１つの母音もしくは撥音、促音のみから成る言語要素である。１つのモーラは、拗音もしくは主に外来語で使われる小さい母音を含まない場合には、１つの仮名（片仮名等）で表わされる。拗音もしくは主に外来語で使われる小さい母音を含む場合の１つのモーラは、１つの仮名プラス拗音を表す仮名で表わされる。例えば、「チ」や「オ」や「シャ」や「フィ」といったものがそれぞれモーラの単位である。音素（phoneme）は、モーラよりもさらに細かい、子音のみ、あるいは母音のみ、の単位である。 Next, an embodiment of the present invention will be described with reference to the drawings. First, some concepts that this embodiment deals with will be described. An utterance is speech that corresponds to a set of sentences. A breath group is a unit of speech that is separated by a pause during speech. A unit in a written sentence that is separated by a comma corresponds approximately to a breath group. An accent phrase is usually a unit of a Japanese word, or a word plus an attached word. A mora is a linguistic element consisting of one consonant and one vowel, or one vowel, mora, or mora. When a mora does not contain a yōon or a small vowel that is mainly used in foreign words, it is represented by one kana (such as katakana). When a mora contains a yōon or a small vowel that is mainly used in foreign words, it is represented by one kana plus a kana that represents the yōon. For example, "chi," "o," "sha," and "fi" are each units of a mora. A phoneme is an even smaller unit than a mora, consisting of only consonants or only vowels.

本実施形態では、アクセント推定装置は、事前学習においては、発話ごとのテキストをアクセント句単位に分割し、音声信号からアクセント句内でフレーム単位の音響特徴量を求める。また、アクセント推定装置は、テキストを基に、アクセント句内でフレーム単位の言語特徴量とアクセントの正解ラベルを求める。また、アクセント推定装置は、アクセント句内のフレーム単位の音響特徴量と言語特徴量を入力データとして、統計モデルの推定値と正解ラベルの誤差を最小とするように、統計モデルを学習する。
アクセント推定装置は、ラベル推定の処理においては、発話ごとのテキストをアクセント句単位に分割し、音声信号からアクセント句内でフレーム単位の音響特徴量を求める。また、アクセント推定装置は、テキストを基に、アクセント句内でフレーム単位の言語特徴量を求める。そして、アクセント推定装置は、アクセント句内のフレーム単位の音響特徴量と言語特徴量を、学習済みの統計モデルに入力することによって、アクセントのラベルを推定する。 In this embodiment, in pre-learning, the accent estimation device divides the text of each utterance into accent phrase units and obtains acoustic features for each frame within an accent phrase from the speech signal. The accent estimation device also obtains linguistic features and correct accent labels for each frame within an accent phrase based on the text. The accent estimation device also uses the acoustic features and linguistic features for each frame within an accent phrase as input data and learns a statistical model so as to minimize the error between the estimated value of the statistical model and the correct label.
In label estimation processing, the accent estimation device divides the text of each utterance into accent phrase units and obtains acoustic features for each frame within the accent phrase from the speech signal. The accent estimation device also obtains linguistic features for each frame within the accent phrase based on the text. The accent estimation device then estimates accent labels by inputting the acoustic features and linguistic features for each frame within the accent phrase into a trained statistical model.

図１は、本実施形態によるアクセント推定装置の概略機能構成を示すブロック図である。図示するように、アクセント推定装置１は、モデル記憶部１１と、推定用音声コーパス記憶部２１と、前処理部２２と、ラベル推定部２３と、学習用音声コーパス記憶部３１と、前処理部３２と、モデル学習部３３とを含んで構成される。アクセント推定装置１が持つこれらの機能は、例えば、コンピューターと、プログラムとで実現することが可能である。また、各機能は、必要に応じて、記憶手段を有する。記憶手段は、例えば、プログラム上の変数や、プログラムの実行によりアロケーションされるメモリーである。また、必要に応じて、磁気ハードディスク装置やソリッドステートドライブ（ＳＳＤ）といった不揮発性の記憶手段を用いるようにしてもよい。また、各機能部の少なくとも一部の機能を、プログラムではなく専用の電子回路として実現してもよい。 FIG. 1 is a block diagram showing a schematic functional configuration of an accent estimation device according to this embodiment. As shown in the figure, the accent estimation device 1 includes a model storage unit 11, an estimation speech corpus storage unit 21, a preprocessing unit 22, a label estimation unit 23, a training speech corpus storage unit 31, a preprocessing unit 32, and a model learning unit 33. These functions of the accent estimation device 1 can be realized, for example, by a computer and a program. Furthermore, each function has a storage means as necessary. The storage means is, for example, a variable in the program or a memory allocated by the execution of the program. Furthermore, non-volatile storage means such as a magnetic hard disk device or a solid state drive (SSD) may be used as necessary. Furthermore, at least a part of the functions of each functional unit may be realized as a dedicated electronic circuit rather than a program.

図示するアクセント推定装置１は、学習用音声コーパスに基づいて、機械学習により統計モデル（アクセント推定モデル）を構築する機能を含む。アクセント推定装置１は、構築された統計モデルをモデル記憶部１１に書き込む。また、アクセント推定装置は、上記の統計モデルを用いて、未知のテキストのアクセントを表すラベルを推定して出力する機能を含む。未知のテキストは、例えば推定用音声コーパスから得られるものである。 The accent estimation device 1 shown in the figure has a function of constructing a statistical model (accent estimation model) by machine learning based on a training speech corpus. The accent estimation device 1 writes the constructed statistical model to a model storage unit 11. The accent estimation device also has a function of estimating and outputting a label representing the accent of unknown text using the above statistical model. The unknown text is obtained, for example, from the estimation speech corpus.

アクセント推定装置１が持つ機能のうち、予め構築された統計モデルに基づいて未知のテキストのアクセントを表すラベルを推定する部分の機能のみを、アクセント推定装置２として実施してもよい。その場合、アクセント推定装置２は、上に列挙した機能部のうち、推定用音声コーパス記憶部２１と、前処理部２２と、ラベル推定部２３とを含んで構成される。アクセント推定装置２は、モデル記憶部１１が記憶する統計モデルを参照してラベル推定の処理を行うことができる。このときの統計モデルについては学習が完了している。アクセント推定装置１が持つ機能のうち、統計モデルを構築する部分の機能のみを、アクセント推定モデル学習装置３として実施してもよい。その場合、アクセント推定モデル学習装置３は、上に列挙した機能部のうち、学習用音声コーパス記憶部３１と、前処理部３２と、モデル学習部３３とを含んで構成される。アクセント推定モデル学習装置３は、学習して得られる統計モデルの情報を、モデル記憶部１１に書き込む。つまり、アクセント推定モデル学習装置３は統計モデルの学習を行うことができる。この統計モデルを用いることによって、入力データに対応するアクセントのラベルを推定することが可能となる。 Of the functions of the accent estimation device 1, only the function of estimating a label representing an accent of unknown text based on a pre-constructed statistical model may be implemented as the accent estimation device 2. In this case, the accent estimation device 2 is configured to include the estimation speech corpus storage unit 21, the pre-processing unit 22, and the label estimation unit 23 among the functional units listed above. The accent estimation device 2 can perform label estimation processing by referring to the statistical model stored in the model storage unit 11. Learning of the statistical model at this time has been completed. Of the functions of the accent estimation device 1, only the function of constructing a statistical model may be implemented as the accent estimation model learning device 3. In this case, the accent estimation model learning device 3 is configured to include the training speech corpus storage unit 31, the pre-processing unit 32, and the model learning unit 33 among the functional units listed above. The accent estimation model learning device 3 writes information on the statistical model obtained by learning to the model storage unit 11. In other words, the accent estimation model learning device 3 can learn the statistical model. By using this statistical model, it becomes possible to estimate the accent label corresponding to the input data.

アクセント推定装置１や、アクセント推定装置２や、アクセント推定モデル学習装置３のそれぞれを、「信号処理装置」と呼んでもよい。後述する音声信号や、テキストや、御経特徴量や、言語特徴量などのそれぞれは信号処理装置によって処理される「信号」である。 Each of the accent estimation device 1, the accent estimation device 2, and the accent estimation model learning device 3 may be called a "signal processing device." The speech signal, text, sutra features, and language features described below are each a "signal" that is processed by the signal processing device.

モデル記憶部１１は、統計モデルの情報を記憶する。統計モデルは、機械学習可能なモデルである。統計モデルは、例えば、ニューラルネットワークを用いて実現される。統計モデルは、特に、ディープニューラルネットワークを用いて実現するようにしてもよい。ただし、統計モデルを、ニューラルネットワーク以外の手法で実現してもよい。いずれの場合も、統計モデルは、入力されたデータ（入力データ）に統計的に対応するデータ（出力データ）を出力する。モデル記憶部１１は、具体的には、統計モデルが持つ内部パラメーターの値を記憶する。なお、統計モデルの機械学習の手法自体は、既存技術を用いて実施することが可能である。 The model storage unit 11 stores information about the statistical model. The statistical model is a model that can be machine-learned. The statistical model is realized, for example, by using a neural network. The statistical model may be realized, in particular, by using a deep neural network. However, the statistical model may be realized by a method other than a neural network. In either case, the statistical model outputs data (output data) that statistically corresponds to input data (input data). Specifically, the model storage unit 11 stores values of internal parameters of the statistical model. Note that the machine learning method of the statistical model itself can be implemented using existing technology.

学習用音声コーパス記憶部３１は、統計モデルを構築するために用いる学習用音声オーパスを記憶する。学習用音声コーパスは、音声信号とテキストの対を、大量に含むものである。音声信号は、例えば、標本化周波数１６０００ヘルツ（Ｈｚ）、変換ビット数１６ビット（ｂｉｔ）で標本化されている。ただし、標本化周波数や変換ビット数は、ここに例示した値に限る必要はない。また学習用音声コーパスが持つテキストは、音声信号に対応するものであり、人手での作成あるいは音声認識処理等を用いて適切に作成されたテキストである。 The training speech corpus storage unit 31 stores a training speech corpus used to construct a statistical model. The training speech corpus contains a large number of pairs of speech signals and text. The speech signals are sampled, for example, at a sampling frequency of 16,000 Hertz (Hz) and a conversion bit rate of 16 bits (bit). However, the sampling frequency and conversion bit rate do not need to be limited to the values exemplified here. The text contained in the training speech corpus corresponds to the speech signals and is text that has been created manually or appropriately using speech recognition processing, etc.

前処理部３２は、学習用音声コーパス記憶部３１から読み出される音声信号とテキストとの対についての処理を行い、学習処理のための入力データと正解ラベルとの対を出力する。なお、前処理部３２は、入力されるデータをアクセント句単位に分割し、アクセント句ごとの入力データと正解ラベルとを出力する。入力データは、アクセント句ごとの、フレーム単位の音響特徴量と言語特徴量とを含むものである。前処理部３２は、この入力データと正解ラベルとの対を、モデル学習部３３に渡す。なお、前処理部３２の処理のための詳細な機能構成については、後で図２を参照しながら説明する。 The preprocessing unit 32 processes pairs of speech signals and texts read from the training speech corpus storage unit 31, and outputs pairs of input data and correct labels for the training process. The preprocessing unit 32 divides the input data into accent phrase units, and outputs input data and correct labels for each accent phrase. The input data includes acoustic features and linguistic features on a frame-by-frame basis for each accent phrase. The preprocessing unit 32 passes the pairs of input data and correct labels to the model training unit 33. The detailed functional configuration for the processing of the preprocessing unit 32 will be described later with reference to FIG. 2.

モデル学習部３３は、前処理部３２から渡される入力データと正解ラベルとの対を用いて、機械学習処理を行う。モデル学習部３３は、具体的には、統計モデルを用いて、入力データを基にラベルを推定し、その推定ラベル（「学習用推定ラベル」とも呼ばれる）と正解ラベルとの差（ロス）を求め、そのロスを最小にするように統計モデルの内部パラメーターを調整する。モデル学習部３３は、学習用データとして入力データと正解ラベルの多数の対について、繰り返し、内部パラメーターを調整（更新）する処理を行う。モデル学習部３３が用いる機械学習の手法自体は、既存技術に属するものである。例えば、統計モデルがニューラルネットワークで実現される場合、モデル学習部３３は誤差逆伝播法を用いて統計モデルの学習を行う。 The model learning unit 33 performs machine learning processing using pairs of input data and correct labels passed from the preprocessing unit 32. Specifically, the model learning unit 33 uses a statistical model to estimate labels based on the input data, calculates the difference (loss) between the estimated labels (also called "estimated labels for learning") and the correct labels, and adjusts the internal parameters of the statistical model to minimize the loss. The model learning unit 33 repeatedly performs processing to adjust (update) the internal parameters for many pairs of input data and correct labels as learning data. The machine learning method itself used by the model learning unit 33 belongs to existing technology. For example, when the statistical model is realized by a neural network, the model learning unit 33 uses the backpropagation method to learn the statistical model.

言い換えれば、モデル学習部３３は、フレーム単位の音響特徴量（学習用音響特徴量）および言語特徴量（学習用言語特徴量）を統計モデルに入力することによって、統計モデルからの出力としてテキスト（学習用テキスト）に対応するアクセントの推定ラベル（学習用推定ラベル）を得て、当該推定ラベル（学習用推定ラベル）と前記正解ラベルとの誤差（ロス）に基づいて統計モデルの学習を行う。つまり、モデル学習部３３は、誤差が小さくなる方向に統計モデルの内部のパラメーターを調整（更新）する。 In other words, the model learning unit 33 inputs frame-by-frame acoustic features (training acoustic features) and language features (training language features) into a statistical model, obtains an estimated label (training estimated label) of an accent corresponding to text (training text) as output from the statistical model, and learns the statistical model based on the error (loss) between the estimated label (training estimated label) and the ground truth label. In other words, the model learning unit 33 adjusts (updates) the internal parameters of the statistical model in the direction that reduces the error.

推定用音声コーパス記憶部２１は、ラベル推定の対象となるテキストを含んだ推定用音声コーパスを記憶する。推定用音声コーパスは、テキストと音声信号との対を含むものである。 The estimation speech corpus storage unit 21 stores an estimation speech corpus that includes text that is the subject of label estimation. The estimation speech corpus includes pairs of text and speech signals.

前処理部２２は、推定用音声コーパス記憶部２１から読み出されるテキストの前処理を行い、その結果を入力データとしてラベル推定部２３に渡す。 The preprocessing unit 22 performs preprocessing on the text read from the estimation speech corpus storage unit 21 and passes the result to the label estimation unit 23 as input data.

ラベル推定部２３は、統計モデルを用いて、前処理部２２から渡される入力データに対応するアクセントのラベルを推定する。ラベル推定部２３は、モデル記憶部１１を参照することによってラベルを推定する。その時点で、モデル記憶部１１は、学習済みの統計モデルの情報を既に記憶している。ラベル推定部２３は、具体的には、フレーム単位の音響特徴量および言語特徴量を予め学習済みの統計モデルに入力することによって、統計モデルからの出力としてテキストに対応するアクセントのラベルを推定する。 The label estimation unit 23 uses a statistical model to estimate an accent label corresponding to the input data passed from the preprocessing unit 22. The label estimation unit 23 estimates the label by referring to the model storage unit 11. At that point, the model storage unit 11 has already stored information about the trained statistical model. Specifically, the label estimation unit 23 inputs frame-by-frame acoustic features and linguistic features into a pre-trained statistical model, thereby estimating an accent label corresponding to the text as an output from the statistical model.

図２は、アクセント推定装置１が持つ前処理部３２のさらに詳細な機能構成を示すブロック図である。図示するように、前処理部３２は、音響分析部３５１と、音素セグメンテーション部３５２と、中心音素抽出部３５３と、言語解析部３５４と、モーラ単位統合部３６１と、アクセント句単位統合部３６２と、アクセント句分割部３７１と、第１言語特徴量抽出部３８１と、第２言語特徴量抽出部３８２と、第１正解ラベル抽出部３８６と、第２正解ラベル抽出部３８７と、音響特徴量言語特徴量統合部３９１とを含んで構成される。 Fig. 2 is a block diagram showing a more detailed functional configuration of the preprocessing unit 32 of the accent estimation device 1. As shown in the figure, the preprocessing unit 32 includes an acoustic analysis unit 351, a phoneme segmentation unit 352, a central phoneme extraction unit 353, a language analysis unit 354, a mora unit integration unit 361, an accent phrase unit integration unit 362, an accent phrase division unit 371, a first language feature extraction unit 381, a second language feature extraction unit 382, a first correct label extraction unit 386, a second correct label extraction unit 387, and an acoustic feature language feature integration unit 391.

前処理部３２は、学習用の前処理を行うものである。前処理部３２を「学習用前処理部」と呼んでもよい。前処理部３２が持つ音響分析部３５１を、「学習用音響分析部」と呼んでもよい。音響分析部３５１が分析対象とする音声信号を、「学習用音声信号」と呼んでもよい。音響分析部３５１が求める音響特徴量を、「学習用音響特徴量」と呼んでもよい。前処理部３２が持つ言語解析部３５４を、「学習用言語解析部」と呼んでもよい。言語解析部３５４が解析対象とするテキストを、「学習用テキスト」と呼んでもよい。言語解析部３５４が求める言語特徴量を、「学習用言語特徴量」と呼んでもよい。 The preprocessing unit 32 performs preprocessing for learning. The preprocessing unit 32 may be called a "learning preprocessing unit." The acoustic analysis unit 351 included in the preprocessing unit 32 may be called a "learning acoustic analysis unit." The audio signal that the acoustic analysis unit 351 analyzes may be called a "learning audio signal." The acoustic features determined by the acoustic analysis unit 351 may be called "learning acoustic features." The language analysis unit 354 included in the preprocessing unit 32 may be called a "learning language analysis unit." The text that the language analysis unit 354 analyzes may be called "learning text." The language features determined by the language analysis unit 354 may be called "learning language features."

音響分析部３５１は、所定単位のテキストごとに、前記所定単位のテキストに対応する音声信号を基に、フレーム単位の音響特徴量を求める。言い換えれば、学習用音響分析部は、所定単位の学習用テキストごとに、所定単位の学習用テキストに対応する学習用音声信号を基に、フレーム単位の学習用音響特徴量を求める。ここで「所定単位のテキスト」とは、例えば、アクセント句の単位のテキストである。 The acoustic analysis unit 351 determines frame-by-frame acoustic features for each predetermined unit of text based on the speech signal corresponding to the predetermined unit of text. In other words, the training acoustic analysis unit determines frame-by-frame training acoustic features for each predetermined unit of training text based on the training speech signal corresponding to the predetermined unit of training text. Here, "predetermined unit of text" refers to, for example, text in units of accent phrases.

音素セグメンテーション部３５２は、音声信号に関して、音素のセグメンテーションを行う。つまり、音素セグメンテーション部３５２は、音声信号内の音素ごとの区間（開始位置および終了位置）を特定する。 The phoneme segmentation unit 352 performs phoneme segmentation on the speech signal. In other words, the phoneme segmentation unit 352 identifies intervals (start and end positions) for each phoneme in the speech signal.

中心音素抽出部３５３は、言語解析結果（フルコンテキストラベル）に基づいて、中心音素を抽出する。 The central phoneme extraction unit 353 extracts central phonemes based on the language analysis results (full context labels).

言語解析部３５４は、上記の所定単位のテキストを基に、当該テキストのフレーム単位の言語特徴量を求める。言い換えれば、前記所定単位の学習用テキストを基に、当該学習用テキストの前記フレーム単位の学習用言語特徴量を求める。言語解析部３５４が求める言語特徴量の例は、発話内の呼気段落位置と、呼気段落内のアクセント句位置と、アクセント句内のモーラ位置と、モーラ内のフレーム位置とである。 The language analysis unit 354 determines frame-by-frame language features of the text based on the above-mentioned predetermined unit of text. In other words, based on the training text of the predetermined unit, it determines frame-by-frame training language features of the training text. Examples of language features determined by the language analysis unit 354 are the breath paragraph position in the utterance, the accent phrase position within the breath paragraph, the mora position within the accent phrase, and the frame position within the mora.

モーラ単位統合部３６１は、音響分析部３５１が出力するフレーム単位の音響特徴量のデータを、モーラ単位の音響特徴量のデータに統合する。 The mora-based integration unit 361 integrates the frame-based acoustic feature data output by the acoustic analysis unit 351 into mora-based acoustic feature data.

アクセント句単位統合部３６２は、前記の音響特徴量のデータを、アクセント句単位の音響特徴量のデータに統合する。 The accent phrase unit integration unit 362 integrates the acoustic feature data into acoustic feature data for each accent phrase.

アクセント句分割部３７１は、前記の言語解析処理の結果であるフルコンテキストラベルを、アクセント句の単位に分割する。 The accent phrase division unit 371 divides the full context label, which is the result of the above-mentioned language analysis process, into accent phrase units.

第１言語特徴量抽出部３８１は、モーラ内のフレーム単位での言語特徴量を抽出する。 The first language feature extraction unit 381 extracts language features on a frame-by-frame basis within a mora.

第２言語特徴量抽出部３８２は、アクセント句内のフレーム単位での言語特徴量を抽出する。 The second language feature extraction unit 382 extracts language features on a frame-by-frame basis within an accent phrase.

第１正解ラベル抽出部３８６は、モーラ内のフレーム単位での正解ラベルを抽出する。 The first correct label extraction unit 386 extracts correct labels on a frame-by-frame basis within a mora.

第２正解ラベル抽出部３８７は、アクセント句内のフレーム単位での正解ラベルを抽出する。 The second correct label extraction unit 387 extracts correct labels on a frame-by-frame basis within an accent phrase.

なお、第１正解ラベル抽出部３８６と第２正解ラベル抽出部３８７とをまとめて単に「正解ラベル抽出部」と呼んでもよい。つまり、正解ラベル抽出部は、言語特徴量（学習用言語特徴量）に基づいてテキストに対応するアクセントの正解ラベルを求める。 The first correct label extraction unit 386 and the second correct label extraction unit 387 may be collectively referred to simply as the "correct label extraction unit." In other words, the correct label extraction unit finds a correct label for the accent corresponding to the text based on the language features (training language features).

音響特徴量言語特徴量統合部３９１は、アクセント句単位統合部３６２が出力する音響特徴量のデータと、第２言語特徴量抽出部３８２が出力する言語特徴量のデータとを、統合する。音響特徴量言語特徴量統合部３９１によって統合されたデータは、統計モデルへの入力データとなる。 The acoustic feature/language feature integration unit 391 integrates the acoustic feature data output by the accent phrase unit integration unit 362 and the language feature data output by the second language feature extraction unit 382. The data integrated by the acoustic feature/language feature integration unit 391 becomes input data to the statistical model.

図３、図４、図５は、アクセント推定装置１が使用するフルコンテキストラベル（文脈依存音素ラベル）の形式を示す概略図である。フルコンテキストラベルは、テキストを解析した結果として得られるものであり、テキストの文脈に依存する音素ラベルである。図３から図５までが、フルコンテキストラベルの１つのデータ構造体を示す。なお、アクセント推定装置１が処理時にフルコンテキストラベルを使用する方法については、後で説明する。図３、図４、図５に図示するように、フルコンテキストラベルは、下に列挙するデータを含んでいる。 Figures 3, 4 and 5 are schematic diagrams showing the format of full context labels (context-dependent phoneme labels) used by the accent estimation device 1. A full context label is obtained as a result of analyzing a text, and is a phoneme label that depends on the context of the text. Figures 3 to 5 show one data structure of a full context label. Note that the method in which the accent estimation device 1 uses the full context label during processing will be explained later. As shown in Figures 3, 4 and 5, a full context label includes the data listed below.

下記のｐ_１からｐ_５までは、音素情報である。
ｐ_１：前音素の前の音素の識別情報
ｐ_２：前音素の識別情報
ｐ_３：現音素（ｐ_１からｐ_５までの中心音素）の識別情報
ｐ_４：次音素の識別情報
ｐ_５：次音素の次の音素の識別情報 The following _p1 to _p5 are phoneme information.
_p1 : Identification information of the phoneme preceding the preceding phoneme _p2 : Identification information of the preceding phoneme _p3 : Identification information of the current phoneme (central phonemes from _p1 to _p5 ) _p4 : Identification information of the next phoneme _p5 : Identification information of the phoneme following the next phoneme

下記のａ_１からａ_３までは、アクセント情報である。
ａ_１：アクセント型（アクセント核のモーラ位置）と現モーラ識別情報の位置との差
ａ_２：現アクセント句における現モーラ識別情報の位置（前向き）
ａ_３：現アクセント句における現モーラ識別情報の位置（後ろ向き） The following _a1 to _a3 are accent information.
_a1 : Difference between accent type (mora position of accent nucleus) and position of current mora identification information _a2 : Position of current mora identification information in current accent phrase (forward)
a ₃ : Position of current mora identification information in the current accent phrase (backwards)

下記のｂ_１からｂ_３までと、ｃ_１からｃ_３までと、ｄ_１からｄ_３までとは、品詞情報である。
ｂ_１：前の語の品詞
ｂ_２：前の語の活用形
ｂ_３：前の語の活用型 The following _b1 to _b3 , _c1 to _c3 , and _d1 to _d3 are part-of-speech information.
_b1 : Part of speech of the previous word _b2 : Conjugated form of the previous word _b3 : Conjugated type of the previous word

ｃ_１：現在の語の品詞
ｃ_２：現在の語の活用形
ｃ_３：現在の語の活用型 _c1 : Part of speech of the current word _c2 : Conjugation form of the current word _c3 : Conjugation type of the current word

ｄ_１：次の語の品詞
ｄ_２：次の語の活用形
ｄ_３：次の語の活用型 _d1 : Part of speech of the next word _d2 : Conjugated form of the next word _d3 : Conjugated type of the next word

下記のｅ_１からｅ_５までと、ｆ_１からｆ_８までと、ｇ_１からｇ_５までとは、アクセント情報である。
ｅ_１：前のアクセント句におけるモーラ数
ｅ_２：前のアクセント句におけるアクセント型（アクセント核のモーラ位置）
ｅ_３：前のアクセント句が疑問詞であるか否か
ｅ_４：未定義コンテキスト
ｅ_５：前のアクセント句と現アクセント句との間に休止が挿入されるか否か The following _e1 to _e5 , _f1 to _f8 , and _g1 to _g5 are accent information.
_e1 : number of moras in the previous accent phrase _e2 : accent type in the previous accent phrase (mora position of accent nucleus)
_e3 : whether the previous accent phrase is an interrogative word or not; _e4 : undefined context; _e5 : whether a pause is inserted between the previous accent phrase and the current accent phrase or not;

ｆ_１：現アクセント句におけるモーラ数
ｆ_２：現アクセント句におけるアクセント型（アクセント核のモーラ位置）
ｆ_３：現アクセント句が疑問詞であるか否か
ｆ_４：未定義コンテキスト
ｆ_５：現アクセント句識別情報の、現呼気段落におけるアクセント句単位での位置（前向き）
ｆ_６：現アクセント句識別情報の、現呼気段落におけるアクセント句単位での位置（後ろ向き）
ｆ_７：現アクセント句識別情報の、現呼気段落におけるモーラ単位での位置（前向き）
ｆ_８：現アクセント句識別情報の、現呼気段落におけるモーラ単位での位置（後ろ向き） _f1 : number of moras in the current accent phrase _f2 : accent type in the current accent phrase (mora position of the accent nucleus)
_f3 : Whether the current accent phrase is an interrogative word or not _f4 : Undefined context _f5 : Position of the current accent phrase identification information in the current breath paragraph in terms of accent phrase units (forward)
_f6 : Position of the current accent phrase identification information in the current breath paragraph in terms of accent phrase units (backwards)
_f7 : Position of the current accent phrase identification information in the current breath paragraph in mora units (forward)
_f8 : Position of the current accent phrase identification information in the current breath paragraph in mora units (backwards)

ｇ_１：次のアクセント句におけるモーラ数
ｇ_２：次のアクセント句におけるアクセント型（アクセント核のモーラ位置）
ｇ_３：次のアクセント句が疑問詞であるか否か
ｇ_４：未定義コンテキスト
ｇ_５：次のアクセント句と現アクセント句との間に休止が挿入されるか否か _g1 : number of moras in the next accent phrase _g2 : accent type in the next accent phrase (mora position of accent nucleus)
_g3 : Whether the next accent phrase is an interrogative word _g4 : Undefined context _g5 : Whether a pause is inserted between the next accent phrase and the current accent phrase

下記のｈ_１からｈ_２までと、ｉ_１からｉ_８までと、ｊ_１からｊ_２までは、呼気段落情報である。
ｈ_１：前の呼気段落内におけるアクセント句の数
ｈ_２：前の呼気段落内におけるモーラ数 The following _h1 to _h2 , _i1 to _i8 , and _j1 to _j2 are breath phase information.
_h1 : number of accent phrases in the previous breath group _h2 : number of moras in the previous breath group

ｉ_１：現呼気段落内におけるアクセント句の数
ｉ_２：現呼気段落内におけるモーラ数
ｉ_３：現呼気段落識別情報の呼気段落単位の位置（前向き）
ｉ_４：現呼気段落識別情報の呼気段落単位の位置（後ろ向き）
ｉ_５：現呼気段落識別情報のアクセント句単位の位置（前向き）
ｉ_６：現呼気段落識別情報のアクセント句単位の位置（後ろ向き）
ｉ_７：現呼気段落識別情報のモーラ単位の位置（前向き）
ｉ_８：現呼気段落識別情報のモーラ単位の位置（後ろ向き） _i1 : number of accent phrases in the current breath group _i2 : number of moras in the current breath group _i3 : position of the breath group unit of the current breath group identification information (forward)
_i4 : Position of the current breath paragraph identification information in units of breath paragraphs (backward)
_i5 : Position of accent phrase unit of current breath paragraph identification information (forward)
_i6 : Position of accent phrase unit of current breath paragraph identification information (backward)
_i7 : Position of the current breath paragraph identification information in mora units (forward)
_i8 : Position of the current breath paragraph identification information in mora units (backwards)

ｊ_１：次の呼気段落内におけるアクセント句の数
ｊ_２：次の呼気段落内におけるモーラ数 _j1 : number of accent phrases in the next breath group _j2 : number of moras in the next breath group

下記のｋ_１からｋ_３は、総数情報である。
ｋ_１：本発話内における呼気段落の総数
ｋ_２：本発話内におけるアクセント句の総数
ｋ_３：本発話内におけるモーラの総数 The following _k1 to _k3 are total number information.
_k1 : Total number of breath paragraphs in this utterance _k2 : Total number of accent phrases in this utterance _k3 : Total number of moras in this utterance

図６、図７は、言語解析部３５４がテキストの言語解析処理を行ったことによって抽出するフルコンテキストラベルの例を示す概略図である。本例は、「晴れ、のち、曇り。」というテキストに基づくフルコンテキストラベルである。この図では、フルコンテキストラベルのデータの１行が、１音素に対応する。ただし、１行のデータを途中で折り返している。図６は、データの第１行から第８行までを示す。図７は、データの第９行から第１５行までを示す。各行の先頭部分が、前述の項目ｐ_１からｐ_５までに対応している。また、各行の「/A:」から始まる部分が、前述の項目ａ_１からａ_３までに対応している。また、各行の「/B:」から始まる部分が、前述の項目ｂ_１からｂ_３までに対応している。また、各行の「/C:」から始まる部分が、前述の項目ｃ_１からｃ_３までに対応している。また、各行の「/D:」から始まる部分が、前述の項目ｄ_１からｄ_３までに対応している。また、各行の「/E:」から始まる部分が、前述の項目ｅ_１からｅ_５までに対応している。また、各行の「/F:」から始まる部分が、前述の項目ｆ_１からｆ_８までに対応している。また、各行の「/G:」から始まる部分が、前述の項目ｇ_１からｇ_５までに対応している。また、各行の「/H:」から始まる部分が、前述の項目ｈ_１からｈ_２までに対応している。また、各行の「/I:」から始まる部分が、前述の項目ｉ_１からｉ_８までに対応している。また、各行の「/J:」から始まる部分が、前述の項目ｊ_１からｊ_２までに対応している。また、各行の「/K:」から始まる部分が、前述の項目ｋ_１からｋ_３までに対応している。 6 and 7 are schematic diagrams showing examples of full context labels extracted by the language analysis unit 354 performing language analysis processing on text. This example is a full context label based on the text "Sunny, then cloudy." In this figure, one line of full context label data corresponds to one phoneme. However, one line of data is folded back halfway. FIG. 6 shows the first to eighth lines of data. FIG. 7 shows the ninth to fifteenth lines of data. The beginning of each line corresponds to the above-mentioned items _p1 to _p5 . Moreover, the part starting with "/A:" of each line corresponds to the above-mentioned items _a1 to _a3 . Moreover, the part starting with "/B:" of each line corresponds to the above-mentioned items _b1 to _b3 . Moreover, the part starting with "/C:" of each line corresponds to the above-mentioned items _c1 to _c3 . Moreover, the part starting with "/D:" of each line corresponds to the above-mentioned items _d1 to _d3 . Moreover, the portion beginning with "/E:" in each line corresponds to the above-mentioned items _e1 to _e5 . Moreover, the portion beginning with "/F:" in each line corresponds to the above-mentioned items _f1 to _f8 . Moreover, the portion beginning with "/G:" in each line corresponds to the above-mentioned items _g1 to _g5 . Moreover, the portion beginning with "/H:" in each line corresponds to the above-mentioned items _h1 to _h2 . Moreover, the portion beginning with "/I:" in each line corresponds to the above-mentioned items _i1 to _i8 . Moreover, the portion beginning with "/J:" in each line corresponds to the above-mentioned items _j1 to _j2 . Moreover, the portion beginning with "/K:" in each line corresponds to the above-mentioned items _k1 to _k3 .

次に、事前学習の処理とラベル推定の処理のそれぞれについて、詳細に説明する。 Next, we will explain the pre-training process and the label estimation process in detail.

事前学習では、学習用のデータとして、発話ごとの音声信号と、その音声信号に対応するテキストとを利用する。事前学習の処理では、発話ごとのデータから、アクセント句内でフレーム単位のデータを作成し、アクセント句内のフレーム単位のデータを用いて、アクセントの正解ラベルを推定するための統計モデルの学習を行う。具体的には次の通りである。 In pre-learning, the speech signal for each utterance and the text corresponding to that speech signal are used as learning data. In the pre-learning process, frame-by-frame data within an accent phrase is created from the data for each utterance, and a statistical model for estimating the correct accent label is trained using the frame-by-frame data within the accent phrase. Specifically, this is done as follows.

前処理部３２は、学習用音声コーパス記憶部３１から読み出される音声信号とテキストとの対を処理対象とする。 The preprocessing unit 32 processes pairs of speech signals and text read from the training speech corpus storage unit 31.

音響分析部３５１は、音声信号の音響分析の処理を行う。つまり、音響分析部３５１は、所定の窓幅をシフトしながら、フレームごとの音響特徴の分析を行う。具体的には、音響分析部３５１は、例えば５ミリ秒（ｍｓ）単位のフレームごとに音響特徴量を抽出する。ただし、フレームの長さは５ミリ秒以外であってもよい。音響特徴量としては、メルスペクトログラム（mel spectrogram）を用いる。メルスペクトログラムは、周波数領域の音声信号の情報である。メルスペクトログラムは、周波数がメルスケールに変換されているスペクトログラムである。メルスペクトログラム自体は、既存の手法による音響特徴量である。一例として、下記ＵＲＬで提供される関数を用いてメルスペクトログラムを抽出することができる。
ＵＲＬ https://librosa.org/doc/latest/generated/librosa.feature.melspectrogram.html The acoustic analysis unit 351 performs acoustic analysis of the audio signal. That is, the acoustic analysis unit 351 performs analysis of the acoustic features for each frame while shifting a predetermined window width. Specifically, the acoustic analysis unit 351 extracts acoustic features for each frame, for example, in units of 5 milliseconds (ms). However, the length of the frame may be other than 5 milliseconds. As the acoustic features, a mel spectrogram is used. The mel spectrogram is information of an audio signal in the frequency domain. The mel spectrogram is a spectrogram in which the frequency is converted to the mel scale. The mel spectrogram itself is an acoustic feature by an existing method. As an example, the mel spectrogram can be extracted using a function provided at the following URL.
URL https://librosa.org/doc/latest/generated/librosa.feature.melspectrogram.html

音響分析部３５１は、本実施形態では、最低周波数８０Ｈｚ、最高周波数７６００Ｈｚの範囲でメルスペクトログラムを抽出する。なお、次数を８０とする。また、音響分析部３５１は、有音区間のメルスペクトログラムの次元ごとに算出した平均値と標準偏差によって、発話ごとのメルスペクトログラムを標準化しておく。音響分析部３５１は、この次数８０の数値を、フレーム単位の音響特徴量として出力する。上記のように発話ごとに標準化することにより、発話間の変動や話者間の変動を吸収できることが期待される。 In this embodiment, the acoustic analysis unit 351 extracts a mel spectrogram in the range of a minimum frequency of 80 Hz and a maximum frequency of 7600 Hz. The order is set to 80. The acoustic analysis unit 351 also standardizes the mel spectrogram for each utterance using the average value and standard deviation calculated for each dimension of the mel spectrogram in the sound section. The acoustic analysis unit 351 outputs this order 80 value as an acoustic feature for each frame. It is expected that standardization for each utterance as described above can absorb variations between utterances and between speakers.

言語解析部３５４は、音声信号に対応するテキストの言語解析処理を行い、文脈に依存したフルコンテキストラベルを求める。言語解析処理は、テキスト（文）の形態素解析および構文解析の処理を含む。フルコンテキストラベルの形式については、図３、図４、図５で説明した通りである。フルコンテキストラベルの抽出例については、図６、図７で説明した通りである。これらの図に示すフルコンテキストラベル自体は、既存の手法においても利用されるものである。言語解析部３５４は、例えば下記ＵＲＬで提供される音声合成処理ツールキットの解析技術を利用して、テキストから、フルコンテキストラベルを抽出する。
ＵＲＬ http://open-jtalk.sourceforge.net The language analysis unit 354 performs language analysis processing on the text corresponding to the speech signal to obtain a full context label that depends on the context. The language analysis processing includes morphological analysis and syntactic analysis processing of the text (sentence). The format of the full context label is as described in Figs. 3, 4, and 5. An example of an extracted full context label is as described in Figs. 6 and 7. The full context label itself shown in these figures is also used in existing methods. The language analysis unit 354 extracts the full context label from the text by using, for example, the analysis technology of a speech synthesis processing toolkit provided at the following URL.
URL http://open-jtalk.sourceforge.net

下記の通り、フルコンテキストラベルは、当該音素が含まれる文内の呼気段落位置や、呼気段落内のアクセント句位置や、アクセント句内のモーラ位置などの情報を含んでいる。 As shown below, the full context label includes information such as the breath paragraph position within the sentence in which the phoneme is contained, the accent phrase position within the breath paragraph, and the mora position within the accent phrase.

中心音素抽出部３５３は、言語解析部３５４が出力したフルコンテキストラベルから中心音素を抽出する。 The central phoneme extraction unit 353 extracts the central phoneme from the full context label output by the language analysis unit 354.

音素セグメンテーション部３５２は、音声信号と、中心音素抽出部３５３が抽出したフルコンテキストラベルの中心音素に基づいて、音素セグメンテーションを行う。音素セグメンテーション部３５２は、音素セグメンテーションの処理によって、音声信号内に含まれる各音素の音素区間情報を求める。音素区間情報は、音素の開始時間および終了時間の情報である。つまり、音素セグメンテーション部３５２は、音声信号内における音素の区間を特定する。音素セグメンテーション部３５２は、例えば下記ＵＲＬで提供される音声認識処理ツールキットの強制アラインメント（Forced Alignments）の技術を利用して、音素のセグメンテーションを行う。
ＵＲＬ http://htk.eng.cam.ac.uk The phoneme segmentation unit 352 performs phoneme segmentation based on the speech signal and the central phoneme of the full context label extracted by the central phoneme extraction unit 353. The phoneme segmentation unit 352 obtains phoneme period information of each phoneme included in the speech signal by the phoneme segmentation process. The phoneme period information is information on the start time and end time of a phoneme. In other words, the phoneme segmentation unit 352 identifies the period of the phoneme in the speech signal. The phoneme segmentation unit 352 performs phoneme segmentation using, for example, a forced alignment technique of a speech recognition processing toolkit provided at the following URL.
URL http://htk.eng.cam.ac.uk

なお、音素セグメンテーション部３５２による音素セグメンテーションの処理に加えて、人手によるセグメンテーションの確認および修正を行えるようにしてもよい。 In addition to the phoneme segmentation process performed by the phoneme segmentation unit 352, it may be possible to manually check and correct the segmentation.

一方、アクセント句分割部３７１は、言語解析部３５４が出力したフルコンテキストラベルを、アクセント句単位に分割する。アクセント句分割部３７１は、アクセント句に分割したフルコンテキストラベルの情報を、アクセント句単位統合部３６２に渡す。 On the other hand, the accent phrase division unit 371 divides the full context label output by the language analysis unit 354 into accent phrase units. The accent phrase division unit 371 passes information on the full context label divided into accent phrases to the accent phrase unit integration unit 362.

モーラ単位統合部３６１は、音響分析部３５１が出力する音響特徴量のデータを、モーラ単位に統合する。具体的には、モーラ単位統合部３６１は、中心音素抽出部３５３が抽出したフルコンテキストラベルの中心音素と、音素セグメンテーション部３５２の処理結果による音素区間情報を用いて、モーラ単位での音響特徴量を決定する。モーラ内の音響特徴量は、そのモーラ内のフレーム単位の音響特徴量の集合である。 The mora-unit integration unit 361 integrates the acoustic feature data output by the acoustic analysis unit 351 on a mora-by-mora basis. Specifically, the mora-unit integration unit 361 determines acoustic features on a mora-by-mora basis using the central phoneme of the full context label extracted by the central phoneme extraction unit 353 and the phoneme section information resulting from the processing of the phoneme segmentation unit 352. The acoustic features in a mora are a collection of acoustic features on a frame-by-frame basis within that mora.

アクセント句単位統合部３６２は、モーラ単位統合部３６１が出力したモーラ単位の音響特徴量のデータをアクセント句単位に統合する。具体的には、アクセント句単位統合部３６２は、フルコンテキストラベルが持つアクセント句情報を用いて、モーラ単位の音響特徴量を統合する。アクセント句内の音響特徴量は、アクセント句内に含まれる、モーラ単位の音響特徴量の集合である。 The accent phrase unit integration unit 362 integrates the mora-based acoustic feature data output by the mora-based integration unit 361 into accent phrase units. Specifically, the accent phrase unit integration unit 362 integrates the mora-based acoustic feature data using the accent phrase information contained in the full context label. The acoustic feature data within an accent phrase is a collection of mora-based acoustic feature data contained in the accent phrase.

また、その一方で、第１言語特徴量抽出部３８１および第２言語特徴量抽出部３８２は、言語解析部３５４が出力したフルコンテキストラベルから、言語特徴量を求める。第１言語特徴量抽出部３８１は、モーラ内のフレーム単位言語特徴量を出力する。第２言語特徴量抽出部３８２は、アクセント句内のフレーム単位言語特徴量を出力する。第１言語特徴量抽出部３８１は、対象とするアクセント句における各モーラ内における、各フレームに対して、次の４つの情報を割り当てる。
（１）文内の当該呼気段落位置を表す数値
（２）呼気段落内の当該アクセント句位置を表す数値
（３）アクセント句内の当該モーラ位置を表す数値
（４）モーラ内の当該フレーム位置を表す数値 Meanwhile, the first language feature extraction unit 381 and the second language feature extraction unit 382 obtain language features from the full context label output by the language analysis unit 354. The first language feature extraction unit 381 outputs frame-by-frame language features in a mora. The second language feature extraction unit 382 outputs frame-by-frame language features in an accent phrase. The first language feature extraction unit 381 assigns the following four pieces of information to each frame in each mora in the target accent phrase:
(1) A numerical value representing the position of the breath paragraph in the sentence; (2) A numerical value representing the position of the accent phrase in the breath paragraph; (3) A numerical value representing the position of the mora in the accent phrase; (4) A numerical value representing the position of the frame in the mora.

具体的に、上記４つの情報は、次の通りである。上記（１）の数値は、文内の当該呼気段落位置を文内の総呼気段落数で除算した値である。図３～図５のフルコンテキストラベルの形式の記号を用いた数式で表すと、呼気段落位置ＢＧ（breath group）は、ＢＧ＝ｉ_３／ｋ_１である。上記（２）の数値は、呼気段落内の当該アクセント句位置を呼気段落内の総アクセント句数で除算した値である。同じくフルコンテキストラベルの形式の記号を用いた数式で表すと、アクセント句位置ＡＰ（accent phrase）は、ＡＰ＝ｆ_５／ｉ_１である。上記（３）の数値は、アクセント句内の当該モーラ位置をアクセント句内の総モーラ数で除算した値である。同じくフルコンテキストラベルの形式の記号を用いた数式で表すと、モーラ位置ＭＲ（mora）は、ＭＲ＝ａ_２／ｆ_１である。上記（４）の数値は、モーラ内の当該フレーム位置をモーラ内の総フレーム数で除算した値である。即ち、フレーム位置ＦＲ（frame）は、数式で表わすと、ｆｒａｍｅ_ｉ／ｆｒａｍｅ＿ｍｏｒａである。なお、ここで、ｆｒａｍｅ_ｉはモーラ内におけるフレーム位置（整数値）であり、ｆｒａｍｅ＿ｍｏｒａは当該モーラ内の総フレーム数である。これらの次数4次の数値を、フレーム単位の言語特徴量とする。 Specifically, the four pieces of information are as follows: The numerical value of (1) above is a value obtained by dividing the breath group position in the sentence by the total number of breath groups in the sentence. When expressed as a formula using symbols in the full context label format of Figures 3 to 5, the breath group position BG (breath group) is BG = _i3 / _k1 . The numerical value of (2) above is a value obtained by dividing the accent phrase position in the breath group by the total number of accent phrases in the breath group. When expressed as a formula using symbols in the full context label format, the accent phrase position AP (accent phrase) is AP = _f5 / _i1 . The numerical value of (3) above is a value obtained by dividing the mora position in the accent phrase by the total number of moras in the accent phrase. When expressed as a formula using symbols in the full context label format, the mora position MR (mora) is MR = _a2 / _f1 . The numerical value in (4) above is the value obtained by dividing the frame position in the mora by the total number of frames in the mora. That is, the frame position FR(frame) is expressed as a formula: frame _i /frame_mora. Here, frame _i is the frame position (integer value) in the mora, and frame_mora is the total number of frames in the mora. These fourth-order numerical values are regarded as frame-by-frame language features.

第１言語特徴量抽出部３８１は、モーラ内のフレーム単位の言語特徴量の集合を抽出し、出力する。つまり、モーラ内の言語特徴量は、モーラ内のフレーム単位の言語特徴量をモーラ内の総フレーム数回繰り返したものである。 The first language feature extraction unit 381 extracts and outputs a set of frame-by-frame language features within a mora. In other words, the language features within a mora are the frame-by-frame language features within the mora repeated the number of times for the total number of frames within the mora.

第２言語特徴量抽出部３８２は、アクセント句内のフレーム単位の言語特徴量の集合を抽出し、出力する。つまり、アクセント句内の言語特徴量は、アクセント句内のモーラ内の言語特徴量をすべて結合したものである。 The second language feature extraction unit 382 extracts and outputs a set of language features on a frame-by-frame basis within an accent phrase. In other words, the language features within an accent phrase are the combination of all the language features within the moras in the accent phrase.

そして、音響特徴量言語特徴量統合部３９１は、アクセント句単位統合部３６２が出力する音響特徴量と、第２言語特徴量抽出部３８２が出力する言語特徴量とを統合して、モデルを学習するための入力データとする。 Then, the acoustic feature/language feature integration unit 391 integrates the acoustic features output by the accent phrase unit integration unit 362 and the language features output by the second language feature extraction unit 382 to obtain input data for training the model.

また、第１正解ラベル抽出部３８６および第２正解ラベル抽出部３８７は、フルコンテキストラベルに基づいて、次のような処理によって正解ラベルを抽出する。日本語のアクセントは模擬的に、モーラごとの声の高さが高い（Ｈ）か低い（Ｌ）かいずれかの、高低のトーンで表すことができる。１つのアクセント句内には、１箇所まで、声の高さが下がるモーラ位置が存在し得る。また、標準語（東京方言とも呼ばれる）では、アクセント句内の１モーラ目のトーンが高く、２モーラ目でトーンが下がる場合（頭高型）を除くと、１モーラ目のトーンが低く、２モーラ目でトーンが上がる特徴がある。 The first correct label extraction unit 386 and the second correct label extraction unit 387 extract correct labels based on the full context label by the following process. Japanese accents can be simulated by expressing high and low tones, with the pitch of each mora being either high (H) or low (L). Within one accent phrase, there can be up to one mora position where the pitch of the voice falls. Furthermore, standard Japanese (also known as the Tokyo dialect) is characterized by a low tone in the first mora and a high tone in the second mora, except for cases where the first mora in an accent phrase has a high tone and the second mora has a low tone (high-headed type).

図８は、前処理部３２内の第１正解ラベル抽出部３８６および第２正解ラベル抽出部３８７がアクセント句単位の正解ラベルを求めるための処理の手順を示すフローチャートである。その前提として、言語解析部３５４は、テキストを解析することによってフルコンテキストラベルを既に求めている。以下、このフローチャートに沿って正解ラベルを求める処理を説明する。 Figure 8 is a flowchart showing the procedure of the process performed by the first correct label extraction unit 386 and the second correct label extraction unit 387 in the preprocessing unit 32 to find correct labels for each accent phrase. The premise is that the language analysis unit 354 has already found full context labels by analyzing the text. Below, the process of finding correct labels will be explained according to this flowchart.

ステップＳ１１において、第１正解ラベル抽出部３８６は、言語解析部３５４が求めたフルコンテキストラベルに基づいて、テキストをモーラ単位の読み仮名に変換する。例えば、元のテキストが「あらゆる現実を、すべて、自分の方へ捻じ曲げたのだ。」である場合、本ステップの処理により、「ア／ラ／ユ／ル／ゲ／ン／ジ／ツ／オ／ス／ベ／テ／ジ／ブ／ン／ノ／ホ／オ／エ／ネ／ジ／マ／ゲ／タ／ノ／ダ」という読み仮名が得られる。この読み仮名において、スラッシュ（／）は、モーラの区切りを表す。 In step S11, the first correct label extraction unit 386 converts the text into pronunciations in mora units based on the full context label obtained by the language analysis unit 354. For example, if the original text is "He twisted all reality in his direction," the processing in this step will result in the pronunciations "A/RA/YU/RU/GEN/N/JI/TS/O/SU/BE/TE/JI/BU/N/NO/HO/O/E/NE/JI/MA/GE/TA/NO/DA." In this pronunciation, the slash (/) indicates the division of a mora.

次にステップＳ１２において、第２正解ラベル抽出部３８７は、フルコンテキストラベルを参照して、上記のモーラ単位の読み仮名を、アクセント句単位の読み仮名に統合する。本ステップの処理により、上記例の読み仮名は、「アラユル／ゲンジツオ／スベテ／ジブンノホオエ／ネジマゲタノダ」というアクセント句単位の読み仮名に変換される。ここでのスラッシュ（／）は、アクセント句の区切りを表す。 Next, in step S12, the second correct label extraction unit 387 refers to the full context label and integrates the above pronunciation of the mora unit into pronunciation of the accent phrase unit. Through the processing of this step, the pronunciation of the above example is converted into pronunciation of the accent phrase unit "arayuru/genjitsuo/subete/jibunnohooe/nejimagetanoda." Here, the slash (/) indicates the division of the accent phrase.

次にステップＳ１３において、第２正解ラベル抽出部３８７は、フルコンテキストラベルを参照して、上記のアクセント句単位の読み仮名に対応する、アクセント句単位のトーンを作成する。具体的には、第２正解ラベル抽出部３８７は、フルコンテキストラベルにおけるアクセント核のモーラ位置の情報からアクセント句単位のトーンを作成する。なお、アクセント核とは、ピッチが下がる直前のモーラである。具体的には、アクセント句内の当該モーラ位置（フルコンテキストラベルの形式の記号で表されるａ_２）と、アクセント句におけるモーラ数（フルコンテキストラベルの形式の記号で表されるｆ₁）、および、アクセント句におけるアクセントタイプ（フルコンテキストラベルの形式の記号で表されるｆ_２）の値を観測する。ｆ_２＝１の場合は頭高型を表し、当該モーラ位置ａ_２＝１のトーンはＨ、それ以外の位置のトーンはＬとする。ｆ_２＝ｆ_１の場合は平板型か尾高型を表し、ａ_２＝１のトーンはＬ、それ以外の位置のトーンはＨとする。前記した２つの場合以外は中高型を表し、ａ_２＝１およびａ_２＞ｆ_２のトーンはＬ、それ以外の位置のトーンはＨとする。本ステップの処理の詳細については、後で図９を参照しながら説明する。本ステップの処理により、上記例のアクセント句単位の読み仮名に対応して、「ＬＨＨＬ／ＬＨＨＨＨ／ＨＬＬ／ＬＨＨＨＨＬＬ／ＬＨＨＬＬＬＬ」というアクセント句単位のトーンが得られる。ここでのスラッシュ（／）も、アクセント句の区切りを表す。また，Ｌはlow（低）、Ｈはhigh（高）に対応する。 Next, in step S13, the second correct label extraction unit 387 refers to the full context label and creates tones for each accent phrase that correspond to the pronunciation of the accent phrase. Specifically, the second correct label extraction unit 387 creates tones for each accent phrase from the information on the mora position of the accent nucleus in the full context label. The accent nucleus is the mora immediately before the pitch drops. Specifically, the values of the mora position in the accent phrase (a ₂ represented by a symbol in the form of a full context label), the number of moras in the accent phrase (f ₁ represented by a symbol in the form of a full context label), and the accent type in the accent phrase (f ₂ represented by a symbol in the form of a full context label) are observed. When f ₂ = 1, it indicates a high-headed type, and the tone at the mora position a ₂ = 1 is H, and the tones at other positions are L. When f ₂ = f ₁ , it indicates a flat type or a high-tailed type, and the tone at a ₂ = 1 is L, and the tones at other positions are H. Cases other than the above two represent a middle-high type, with the tone for _a2 = 1 and _a2 > _f2 being L, and the tones at other positions being H. Details of the processing in this step will be explained later with reference to FIG. 9. By the processing in this step, tones for accent phrase units, "LHHL/LHHHH/HLL/LHHHHLL/LHHLLLL", corresponding to the pronunciation of the accent phrase units in the above example, are obtained. The slash (/) here also represents a division of accent phrases. Furthermore, L corresponds to low, and H corresponds to high.

次にステップＳ１４において、第２正解ラベル抽出部３８７は、アクセント句単位のトーンを、アクセントの正解ラベルに変換する。本実施形態において、アクセント正解ラベルは、上記のアクセント句単位のトーンに対応する、０、１、２という３種類の数の列である。これらの数値の意味は、次の通りである。即ち、数値２は、アクセント句内のモーラに対して、当該モーラの次のモーラで声の高さが下降する場合に対応する。つまり、隣接するモーラ間でトーンがＨからＬに変化するときに、当該Ｈに対応する部分が、２に対応する。数値１は、アクセント句内のモーラに対して、当該モーラの次のモーラで声の高さが上昇する場合に対応する。つまり、隣接するモーラ間でトーンがＬからＨに変化するときに、当該Ｌに対応する部分が、１に該当する。数値０は、アクセント句内のモーラに対して、当該モーラの次のモーラで声の高さが変化しない場合に対応する。つまり、隣接するモーラ間で、トーンがＬ－ＬあるいはＨ－Ｈと推移するときに、その前側のモーラに対応する部分が、０に該当する。また、アクセント句内の最後のモーラに対応する数値も０である。本ステップの処理により、上記例のアクセント句単位のトーンを基に、「１０２０／１００００／２００／１０００２００／１０２００００」というアクセント句単位の正解ラベルが得られる。ここでのスラッシュ（／）も、アクセント句の区切りを表す。ここでの各数値は、モーラに対応するものである。 Next, in step S14, the second correct label extraction unit 387 converts the tones of the accent phrase units into correct labels of accents. In this embodiment, the accent correct labels are a string of three numbers, 0, 1, and 2, corresponding to the tones of the accent phrase units described above. The meanings of these numbers are as follows. That is, the number 2 corresponds to the case where the pitch of the voice falls in the mora following the mora in the accent phrase. In other words, when the tone changes from H to L between adjacent moras, the part corresponding to the H corresponds to 2. The number 1 corresponds to the case where the pitch of the voice rises in the mora following the mora in the accent phrase. In other words, when the tone changes from L to H between adjacent moras, the part corresponding to the L corresponds to 1. The number 0 corresponds to the case where the pitch of the voice does not change in the mora following the mora in the accent phrase. In other words, when the tone changes from L-L or H-H between adjacent moras, the part corresponding to the mora before that corresponds to 0. The numerical value corresponding to the last mora in the accent phrase is also 0. Through the processing in this step, the correct label for the accent phrase unit, "1020/10000/200/1000200/1020000", is obtained based on the tones of the accent phrase unit in the above example. The slash (/) here also indicates the division of the accent phrase. Each numerical value here corresponds to a mora.

つまり、正解ラベルに含まれる数値「２」は、声の高さが下降する直前のモーラに対応するものである。また、正解ラベルに含まれる数値「１」は、声の高さが上昇する直前のモーラに対応するものである。また、正解ラベルに含まれる数値「０」は、声の高さが変わらない場合、あるいはアクセント句における最後のモーラである場合に対応する。 In other words, the number "2" contained in the correct label corresponds to the mora immediately before the pitch drops. The number "1" contained in the correct label corresponds to the mora immediately before the pitch rises. The number "0" contained in the correct label corresponds to the case where the pitch does not change, or the last mora in the accent phrase.

同ステップにおいて、第２正解ラベル抽出部３８７は、さらに、上記のモーラの数値（正解ラベル）を、当該モーラ内の各フレームに割り当てる。つまり、第２正解ラベル抽出部３８７は、あるモーラの正解ラベルに相当する数値（０、１、または２）を、当該モーラ内に含まれるフレーム数分くりかえすようにコピーする。これにより、第２正解ラベル抽出部３８７は、アクセント句内のフレーム単位の数値の列を生成する。即ち、この次数１次の数値が、アクセント句内のフレーム単位の正解ラベルである。 In this step, the second correct label extraction unit 387 further assigns the numerical value of the mora (correct label) to each frame in the mora. In other words, the second correct label extraction unit 387 copies the numerical value (0, 1, or 2) corresponding to the correct label of a certain mora so that it is repeated the number of times equal to the number of frames contained in the mora. In this way, the second correct label extraction unit 387 generates a string of numerical values on a frame-by-frame basis within the accent phrase. In other words, this numerical value of degree 1 is the correct label on a frame-by-frame basis within the accent phrase.

以上説明した処理により、前処理部３２の音響特徴量言語特徴量統合部３９１は、音響特徴量と言語特徴量とを統合したデータを出力する。また第２正解ラベル抽出部３８７は、アクセントに関する正解ラベルを出力する。この両データの対は、モデル学習部３３が統計モデルの学習をするために用いられる。 By the above-described processing, the acoustic feature/language feature integration unit 391 of the preprocessing unit 32 outputs data that integrates the acoustic feature and the language feature. In addition, the second correct answer label extraction unit 387 outputs a correct answer label related to accent. These two pairs of data are used by the model learning unit 33 to learn the statistical model.

図９は、図８に示したステップＳ１３の処理の詳細を示すフローチャートである。即ち、第２正解ラベル抽出部３８７がアクセント句単位のトーンを作成する処理の手順を示すフローチャートである。以下、このフローチャートに沿って説明する。 Figure 9 is a flowchart showing the details of the process of step S13 shown in Figure 8. That is, it is a flowchart showing the procedure of the process in which the second correct label extraction unit 387 creates tones for each accent phrase. The following explanation will be given along with this flowchart.

ステップＳ２１において、第２正解ラベル抽出部３８７は、ｆ_２＝１であるか否かを判定する。即ち、アクセント句におけるアクセントタイプの値が１であるか否かを判定する。ｆ_２＝１である場合（ステップＳ２１：ＹＥＳ）には、ステップＳ２２の判定に進む。その他の場合（ステップＳ２１：ＮＯ）には、ステップＳ２５の判定に進む。 In step S21, the second correct label extraction unit 387 judges whether or not _f2 = 1. That is, it judges whether or not the value of the accent type in the accent phrase is 1. If _f2 = 1 (step S21: YES), the process proceeds to the judgment in step S22. Otherwise (step S21: NO), the process proceeds to the judgment in step S25.

ステップＳ２２において、第２正解ラベル抽出部３８７は、ａ_２＝１であるか否かを判定する。即ち、アクセント句内の当該モーラ位置が１であるか否かを判定する。ａ_２＝１である場合（ステップＳ２２：ＹＥＳ）には、トーンを「Ｈ」とし（ステップＳ２４）て、本フローチャートの処理を終了する。その他の場合（ステップＳ２２：ＮＯ）には、トーンを「Ｌ」とし（ステップＳ２３）て、本フローチャートの処理を終了する。 In step S22, the second correct label extraction unit 387 judges whether _a2 = 1. That is, it judges whether the mora position in the accent phrase is 1. If _a2 = 1 (step S22: YES), the tone is set to "H" (step S24) and the process of this flowchart ends. Otherwise (step S22: NO), the tone is set to "L" (step S23) and the process of this flowchart ends.

ステップＳ２５に進んだ場合、同ステップにおいて、第２正解ラベル抽出部３８７は、ｆ_２＝ｆ_１であるか否かを判定する。即ち、アクセント句におけるアクセントタイプの値が、アクセント句におけるモーラ数と等しいか否か、を判定する。ｆ_２＝ｆ_１である場合（ステップＳ２５：ＹＥＳ）には、ステップＳ２６の判定に進む。その他の場合（ステップＳ２５：ＮＯ）には、ステップＳ２９の判定に進む。 If the process proceeds to step S25, the second correct label extraction unit 387 judges whether or not _f2 = _f1 . That is, it judges whether or not the value of the accent type in the accent phrase is equal to the number of moras in the accent phrase. If _f2 = _f1 (step S25: YES), the process proceeds to the judgment in step S26. Otherwise (step S25: NO), the process proceeds to the judgment in step S29.

ステップＳ２６において、第２正解ラベル抽出部３８７は、ａ_２＝１であるか否かを判定する。即ち、アクセント句内の当該モーラ位置が１であるか否かを判定する。ａ_２＝１である場合（ステップＳ２６：ＹＥＳ）には、トーンを「Ｌ」とし（ステップＳ２８）て、本フローチャートの処理を終了する。その他の場合（ステップＳ２６：ＮＯ）には、トーンを「Ｈ」とし（ステップＳ２７）て、本フローチャートの処理を終了する。 In step S26, the second correct label extraction unit 387 judges whether _a2 = 1. That is, it judges whether the mora position in the accent phrase is 1. If _a2 = 1 (step S26: YES), the tone is set to "L" (step S28) and the process of this flowchart ends. Otherwise (step S26: NO), the tone is set to "H" (step S27) and the process of this flowchart ends.

ステップＳ２９に進んだ場合、同ステップにおいて、第２正解ラベル抽出部３８７は、（ａ_２＝１またはａ_２＞ｆ_２）の条件を満たすか否かを判定する。ａ_２＝１は、アクセント句内の当該モーラ位置が１である場合に対応する。ａ_２＞ｆ_２は、アクセント句内の当該モーラ位置がアクセント句におけるアクセントタイプより大きい場合に対応する。（ａ_２＝１またはａ_２＞ｆ_２）である場合（ステップＳ２９：ＹＥＳ）には、トーンを「Ｌ」とし（ステップＳ３１）て、本フローチャートの処理を終了する。その他の場合（ステップＳ２９：ＮＯ）には、トーンを「Ｈ」とし（ステップＳ３０）て、本フローチャートの処理を終了する。 If the process proceeds to step S29, in this step the second correct label extraction unit 387 determines whether the condition ( _a2 = 1 or _a2 > _f2 ) is satisfied. _a2 = 1 corresponds to the case where the mora position in the accent phrase is 1. _a2 > _f2 corresponds to the case where the mora position in the accent phrase is larger than the accent type in the accent phrase. If ( _a2 = 1 or _a2 > _f2 ) is satisfied (step S29: YES), the tone is set to "L" (step S31) and the processing of this flowchart is terminated. Otherwise (step S29: NO), the tone is set to "H" (step S30) and the processing of this flowchart is terminated.

以上、説明したように、第２正解ラベル抽出部３８７は、本フローチャートで説明した処理によって、アクセント句単位のトーンを作成する。 As explained above, the second correct label extraction unit 387 creates tones for each accent phrase through the process described in this flowchart.

図１０は、図８に示したステップＳ１４の処理における、アクセント句のトーンから正解ラベルの数値への変換の規則を示す概略図である。図示するように、第２正解ラベル抽出部３８７は、アクセント句のトーンにおける連続する２つのモーラのトーンのパターンに応じて、当該２つのモーラのうちの前側のモーラに対応する正解ラベルの数値を決定する。ただし、第２正解ラベル抽出部３８７は、アクセント句が終結するモーラ（図１０のパターン（Ｅ）または（Ｆ））に関しては、当該モーラのトーンに依らずに、当該モーラに対応する正解ラベルの数値を決定する。 Figure 10 is a schematic diagram showing the rules for converting the tones of an accent phrase into the numerical value of a correct label in the processing of step S14 shown in Figure 8. As shown in the figure, the second correct label extraction unit 387 determines the numerical value of the correct label corresponding to the earlier of two consecutive moras in the tone of the accent phrase, depending on the tone pattern of the two moras. However, for a mora that ends an accent phrase (pattern (E) or (F) in Figure 10), the second correct label extraction unit 387 determines the numerical value of the correct label corresponding to that mora, regardless of the tone of that mora.

図１０のパターン（Ａ）は、連続する２つのモーラのトーンが「Ｈ」→「Ｌ」と変化する場合であり、この場合の前側のモーラに対応する正解ラベルの数値は「２」である。パターン（Ｂ）は、連続する２つのモーラのトーンが「Ｈ」→「Ｌ」と変化する場合であり、この場合の前側のモーラに対応する正解ラベルの数値は「１」である。パターン（Ｃ）は、連続する２つのモーラのトーンが「Ｈ」→「Ｈ」と続く場合であり、この場合の前側のモーラに対応する正解ラベルの数値は「０」である。パターン（Ｄ）は、連続する２つのモーラのトーンが「Ｌ」→「Ｌ」と続く場合であり、この場合の前側のモーラに対応する正解ラベルの数値は「０」である。つまり、パターン（Ｃ）および（Ｄ）は、連続する２つのモーラのトーンが変化せずに続く場合である。パターン（Ｅ）および（Ｆ）は、アクセント句内の最後のモーラに相当する場合であり、この場合、当該モーラのトーンが「Ｈ」であるか「Ｌ」であるかに依らず、当該モーラに対応する正解ラベルの数値は「０」である。 Pattern (A) in FIG. 10 is a case where the tone of two consecutive moras changes from "H" to "L", and in this case the correct label value corresponding to the preceding mora is "2". Pattern (B) is a case where the tone of two consecutive moras changes from "H" to "L", and in this case the correct label value corresponding to the preceding mora is "1". Pattern (C) is a case where the tone of two consecutive moras goes from "H" to "H", and in this case the correct label value corresponding to the preceding mora is "0". Pattern (D) is a case where the tone of two consecutive moras goes from "L" to "L", and in this case the correct label value corresponding to the preceding mora is "0". In other words, patterns (C) and (D) are cases where the tone of two consecutive moras continues without changing. Patterns (E) and (F) correspond to the last mora in an accent phrase, and in this case, regardless of whether the tone of the mora is "H" or "L", the numerical value of the correct label corresponding to the mora is "0".

モデル学習部３３は、前処理部３２から渡される、アクセント句を単位とする、入力データ（フレーム単位の音響特徴量と言語特徴量を結合したデータ。学習用データ。）と、正解ラベルとを用いて、統計モデルの学習を行う。統計モデルは、例えば、ＤＮＮ（ディープニューラルネットワーク）を用いて実現される。具体的には、モデル学習部は、上記の入力データに基づいてラベルの推定を行い、推定されたラベルと正解ラベルとの差を求め、その差に基づく誤差逆伝播法により、統計モデルの内部パラメーターを更新する。 The model learning unit 33 learns a statistical model using input data (learning data combining acoustic features and linguistic features on a frame-by-frame basis) in units of accent phrases passed from the preprocessing unit 32 and the correct label. The statistical model is realized, for example, using a DNN (deep neural network). Specifically, the model learning unit estimates a label based on the above input data, finds the difference between the estimated label and the correct label, and updates the internal parameters of the statistical model using the backpropagation method based on the difference.

統計モデルとしては、男性話者の音声用のモデルと女性話者の音声用のモデルとを区別して構築してよい。あるいは、統計モデルとして、男性話者あるいは女性話者を区別せず性別不特定のモデルを構築してもよい。男性話者用モデルの学習を行う場合には、モデル学習部３３は、男性話者の学習用データを使用する。女性話者用のモデルの学習を行う場合には、モデル学習部３３は、女性話者の学習用データを使用する。性別不特定のモデルの学習を行う場合には、モデル学習部３３は、男性話者および女性話者の両方の学習用データを混合して使用する。また、統計モデルとして特定の個人の話者用の統計モデルを構築しても良い。 The statistical models may be constructed by distinguishing between a model for the voices of male speakers and a model for the voices of female speakers. Alternatively, a gender-independent model may be constructed as the statistical model without distinguishing between male and female speakers. When training a model for male speakers, the model training unit 33 uses training data for male speakers. When training a model for female speakers, the model training unit 33 uses training data for female speakers. When training a gender-independent model, the model training unit 33 uses a mixture of training data for both male and female speakers. In addition, a statistical model for a specific individual speaker may be constructed as the statistical model.

学習用データのうちの入力データは、例えば、音響特徴量８０次と言語特徴量４次のデータを結合した８４次のデータである。入力データとしては、例えば、数十万件程度のアクセント句のデータを用いる。入力データは、８４次の各次元について平均値および標準偏差を求め、標準化しておくようにする。正解ラベルのデータについては、標準化せずに、そのまま扱ってよい。 The input data in the training data is, for example, 84th-order data combining 80th-order acoustic feature data and 4th-order linguistic feature data. As the input data, for example, data on hundreds of thousands of accent phrases is used. The input data is standardized by calculating the average and standard deviation for each of the 84 dimensions. The correct label data can be used as is without standardization.

上記のように例えば数十万件の学習用データを、ランダムな順番に並べ替え、例えば８：１：１の割合で、それぞれ、訓練データ、検証データ、評価データに分割して利用するようにする。モデル学習部３３による学習処理は、ミニバッチ単位で行う。一度に扱うミニバッチサイズを例えば３２個のアクセント句とする。そのミニバッチごとに、フレーム数が最長となるアクセント句（フレーム数：frame_max）に合わせて、より短いアクセント句の後部にはゼロ詰めをして学習に使用する。 As described above, for example, hundreds of thousands of pieces of learning data are rearranged in a random order and divided into training data, validation data, and evaluation data in a ratio of, for example, 8:1:1 for use. The learning process by the model learning unit 33 is performed in mini-batch units. The mini-batch size handled at one time is, for example, 32 accent phrases. For each mini-batch, the shorter accent phrases are padded with zeros to match the accent phrase with the longest number of frames (number of frames: frame_max) and used for learning.

モデル学習部３３が統計モデルを学習する処理自体は、既存技術を利用して行うことができる。一例として、オープンソースソフトウェアの機械学習ツールキットであるTensorFlowを使用することもできる。TensorFlowについては、下記ＵＲＬに記載されている。
ＵＲＬ https://tensorflow.org The process itself in which the model learning unit 33 learns the statistical model can be performed using existing technology. As an example, TensorFlow, an open source software machine learning toolkit, can be used. TensorFlow is described in the following URL:
URL https://tensorflow.org

図１１は、モデル学習部３３による統計モデルの学習時の処理の流れを示す概略図である。図示するように、統計モデルは、入力層４１１（Input Layer）と、マスキング層４１２（Masking）と、双方向ＬＳＴＭ４１３（Bidirectional LSTM）と、正規化層４１４（Layer Normalization）と、ドロップアウト層４１５（Dropout）と、全結合層４１６（Dense）と、ソフトマックス関数４１７（Softmax）との各層を含むように構成される。なお、ＬＳＴＭは、「Long short-term memory」の略である。双方向ＬＳＴＭは、「ＢＬＳＴＭ」と表記される場合がある。 Figure 11 is a schematic diagram showing the flow of processing when the model learning unit 33 learns a statistical model. As shown in the figure, the statistical model is configured to include the following layers: an input layer 411, a masking layer 412, a bidirectional LSTM 413, a normalization layer 414, a dropout layer 415, a fully connected layer 416, and a softmax function 417. Note that LSTM is an abbreviation for "Long short-term memory". Bidirectional LSTM is sometimes written as "BLSTM".

学習の際には、モデル学習部３３は、ミニバッチサイズ３２で、ミニバッチごとの最長フレーム数frame_maxに合わせた、８４次の入力データのテンソル[32,frame_max,84]を入力層に入力する。入力データは、入力層４１１およびマスキング層４１２によって処理される。マスキング層４１２は、ゼロ詰め部分を無視するマスキング情報を持つテンソル[32,frame_max,84]を出力する。マスキング情報は双方向ＬＳＴＭ４１３以後の層に伝播し、ゼロ詰め部分を無視した演算を可能にする。モデル学習部３３は、マスキング層４１２からの出力テンソル[32,frame_max,84]を、ユニット数６４の双方向ＬＳＴＭ層４１３（順方向と逆方向の出力を加算）×６層に通して、[32,frame_max,64]のテンソルを出力する。なお、双方向ＬＳＴＭ層４１３においては、過学習を防ぐため、０．００１でカーネルに１２正則化を適用する。そして、モデル学習部３３は、さらに汎化性能を上げるため、６層の双方向ＬＳＴＭ層４１３からの出力テンソル [32,frame_max,64]を、正規化層４１４とドロップ率０．５のドロップアウト層４１５を通して、[32,frame_max,64]のテンソルを出力する。そして、モデル学習部３３は、ドロップアウト層４１５からの出力テンソル[32,frame_max,64]を、ユニット数３の全結合層４１６に通して、[32,frame_max,3]のテンソルを出力する。全結合層４１６からの出力テンソル[32,frame_max,3]を基に、モデル学習部３３は、フレームごとにマスキングでマスクされた値を無視しながら、ソフトマックス関数４１７の演算を行う。モデル学習部３３は、当該モデルの出力である推定値のテンソル[32,frame_max,3]を出力する。 During learning, the model learning unit 33 inputs a tensor [32,frame_max,84] of 84th-order input data, which is adjusted to the longest number of frames per mini-batch, frame_max, with a mini-batch size of 32, to the input layer. The input data is processed by the input layer 411 and the masking layer 412. The masking layer 412 outputs a tensor [32,frame_max,84] with masking information that ignores the zero-padded portion. The masking information propagates to layers after the bidirectional LSTM 413, enabling calculations that ignore the zero-padded portion. The model learning unit 33 passes the output tensor [32,frame_max,84] from the masking layer 412 through a bidirectional LSTM layer 413 (adding the forward and backward outputs) with 64 units × 6 layers, and outputs a tensor [32,frame_max,64]. In the bidirectional LSTM layer 413, a 12-order regularization is applied to the kernel at 0.001 to prevent overlearning. Then, in order to further improve the generalization performance, the model learning unit 33 passes the output tensor [32,frame_max,64] from the six-layer bidirectional LSTM layer 413 through a normalization layer 414 and a dropout layer 415 with a drop rate of 0.5 to output the tensor [32,frame_max,64]. The model learning unit 33 passes the output tensor [32,frame_max,64] from the dropout layer 415 through a fully connected layer 416 with three units to output the tensor [32,frame_max,3]. Based on the output tensor [32,frame_max,3] from the fully connected layer 416, the model learning unit 33 performs the calculation of the softmax function 417 while ignoring values masked by masking for each frame. The model learning unit 33 outputs the tensor [32,frame_max,3] of the estimated value, which is the output of the model.

モデル学習部３３は、前処理部３２から出力された正解ラベル[32,frame_max,1]と、上記の推定値[32,frame_max,3]とを用いて損失関数（差，loss）を演算する。具体的には、モデル学習部３３は、正解ラベルのゼロ詰め部分を無視するためのマスク値[32,frame_max,1]を求め、正解ラベル[32,frame_max,1]と推定値[32,frame_max,3]で、多クラス分類でラベルを用いて損失を算出する。損失を演算するためには、モデル学習部３３は、SparseCategoricalCrossentropy()関数（疎なカテゴリカル交差エントロピー）の演算を行い、その演算結果[32,frame_max]を求める。そして、モデル学習部３３は、その次元をexpand_dims()関数で拡張し[32,frame_max,1]とする。モデル学習部３３は、その値と上で求めたマスク値[32,frame_max,1]との論理積（logical_and()関数）をとり、得られた値の平均（reduce_mean()関数）を算出し損失とする。なお、reduce_mean()関数は、テンソルの特定の次元の平均をとる関数である。 The model learning unit 33 calculates a loss function (difference, loss) using the correct label [32,frame_max,1] output from the preprocessing unit 32 and the above estimated value [32,frame_max,3]. Specifically, the model learning unit 33 obtains a mask value [32,frame_max,1] for ignoring the zero-padded portion of the correct label, and calculates the loss using the labels in multi-class classification with the correct label [32,frame_max,1] and the estimated value [32,frame_max,3]. To calculate the loss, the model learning unit 33 calculates the SparseCategoricalCrossentropy() function (sparse categorical cross entropy) and obtains the calculation result [32,frame_max]. Then, the model learning unit 33 expands the dimension with the expand_dims() function to [32,frame_max,1]. The model learning unit 33 takes the logical product (logical_and() function) of this value and the mask value [32,frame_max,1] obtained above, and calculates the average (reduce_mean() function) of the obtained value as the loss. Note that the reduce_mean() function is a function that takes the average of a specific dimension of a tensor.

モデル学習部３３は、訓練データのバッチごとにモデルの訓練損失を計算し、訓練損失に関するモデルの重みの勾配を取得し、オプティマイザーを使用して、勾配に基づいてモデルの重みを更新する。モデル学習部３３は、その際、オプティマイザーとして、learning_rate=0.00001、decay=1e-6、momentum=0.9のＳＧＤ（Stochastic Gradient Descent、確率的勾配降下法）を利用する。 The model learning unit 33 calculates the training loss of the model for each batch of training data, obtains the gradient of the model weights with respect to the training loss, and uses an optimizer to update the model weights based on the gradient. In this case, the model learning unit 33 uses SGD (Stochastic Gradient Descent) with learning_rate=0.00001, decay=1e-6, and momentum=0.9 as the optimizer.

モデル学習部３３は、上記の訓練損失が前エポック（epoch）までの最小訓練損失を下回ったら、保持するモデルを更新する。モデル学習部３３は、訓練データのすべてのミニバッチで計算を済ませる処理を１エポックとして、最大１０００エポックまで学習を行う。モデル学習部３３は、１エポックごとに検証データでモデルの検証損失と検証精度を求め、検証損失が連続して50エポック改善しない場合は、学習を早期終了（EarlyStop）する。モデル学習部３３は、このようにして検証損失が最小となった時に保持したモデルを採用する。 The model learning unit 33 updates the retained model when the above training loss falls below the minimum training loss up to the previous epoch. The model learning unit 33 performs learning for up to 1000 epochs, with one epoch being the process of completing calculations for all mini-batches of training data. The model learning unit 33 calculates the validation loss and validation accuracy of the model using validation data for each epoch, and if the validation loss does not improve for 50 consecutive epochs, it terminates learning early. The model learning unit 33 adopts the retained model when the validation loss is minimized in this way.

モデル学習部３３は、採用したモデルのパラメーター情報を、モデル記憶部１１に書き込む。 The model learning unit 33 writes the parameter information of the adopted model to the model storage unit 11.

次に、ラベル推定の処理の詳細について説明する。ラベル推定の処理の際には、推定の基となるデータとして、発話ごとの音声信号と、当該音声信号に対応する書き起こしテキストが存在する。本実施形態では、推定用音声コーパス記憶部２１が、上記の音声信号およびテキストの対を保持している。 Next, the label estimation process will be described in detail. When performing label estimation, the data on which the estimation is based includes a voice signal for each utterance and a transcribed text corresponding to the voice signal. In this embodiment, the estimation voice corpus storage unit 21 holds the above pairs of voice signals and text.

前処理部２２は、前述の前処理部３２とほぼ同様の構成を有している。しかし、前処理部２２は、正解ラベルを抽出するための構成を持つ必要がない。即ち、図２に示した前処理部３２の第１正解ラベル抽出部３８６および第２正解ラベル抽出部３８７の機能を、前処理部２２が持つ必要はない。前処理部２２は、発話ごとのテキストと音声信号とに基づいて、前処理部３２と同様に、アクセント句ごとに、フレーム単位の音響特徴量と言語特徴量からなる入力データを作成し、出力する。前処理部３２について図２を参照しながら説明済みであるため、ここでは、前処理部２２が音響特徴量と言語特徴量からなる入力データを作成するための詳細な手順についての説明を省略する。 The preprocessing unit 22 has a configuration similar to that of the preprocessing unit 32 described above. However, the preprocessing unit 22 does not need to have a configuration for extracting correct labels. In other words, the preprocessing unit 22 does not need to have the functions of the first correct label extraction unit 386 and the second correct label extraction unit 387 of the preprocessing unit 32 shown in FIG. 2. The preprocessing unit 22 creates and outputs input data consisting of acoustic features and language features on a frame-by-frame basis for each accent phrase based on the text and speech signal for each utterance, similar to the preprocessing unit 32. Since the preprocessing unit 32 has already been described with reference to FIG. 2, a detailed description of the procedure by which the preprocessing unit 22 creates input data consisting of acoustic features and language features will be omitted here.

ラベル推定部２３は、前処理部２２が出力した入力データを基に、アクセントについてのラベルを推定する。 The label estimation unit 23 estimates labels for accents based on the input data output by the preprocessing unit 22.

図１２は、ラベル推定部２３がラベルを推定する処理の流れを示す概略図である。図示するように、このときの統計モデルの構成は、図１１に示した構成（符号４１１から４１７まで）と同一である。ただし、推定時には、ドロップアウト層４１５は無視される。ラベル推定処理の時点では、統計モデルは既に学習済みである。即ち、統計モデルが持つパラメーター（重み）の値は、既に決定されている。ラベル推定部２３によるラベル推定処理の際には、正解ラベルに基づくパラメーターの調整を行わない。 Figure 12 is a schematic diagram showing the flow of processing in which the label estimation unit 23 estimates a label. As shown in the figure, the configuration of the statistical model at this time is the same as the configuration shown in Figure 11 (reference numerals 411 to 417). However, during estimation, the dropout layer 415 is ignored. At the time of label estimation processing, the statistical model has already been trained. In other words, the values of the parameters (weights) of the statistical model have already been determined. During label estimation processing by the label estimation unit 23, no adjustment of parameters based on the correct label is performed.

ラベル推定部２３が前処理部２２から受け取る入力データは、ミニバッチサイズが１、フレーム数がframe、次数が８４の入力データのテンソル[1,frame,84]である。ラベル推定部２３は、入力データを入力層４１１から入力する。入力層４１１からソフトマックス関数４１７までの各層における処理は、モデル学習のための処理として既に説明した通りである。ラベル推定部２３は、この学習済みのモデルにより、ミニバッチサイズが1、フレーム数がframe、次数が３の推定値[1,frame,3]を求める。ラベル推定部２３は、得られた推定値[1,frame,3]のフレームごとに、３次の数値のうち最大となるインデックス（０、１、２のいずれか）をフレームごとの推定ラベルの数値とする。１つのモーラには多数のフレームが含まれるため、ラベル推定部２３は、モーラ内のフレーム単位の推定ラベルの値（０、１、または２）の多数決をとることによって、モーラ単位の推定ラベルを決定する。 The input data that the label estimation unit 23 receives from the preprocessing unit 22 is a tensor [1, frame, 84] of input data with a mini-batch size of 1, the number of frames frame, and a degree of 84. The label estimation unit 23 inputs the input data from the input layer 411. The processing in each layer from the input layer 411 to the softmax function 417 has already been described as the processing for model learning. The label estimation unit 23 obtains an estimated value [1, frame, 3] with a mini-batch size of 1, the number of frames frame, and a degree of 3 using this trained model. For each frame of the obtained estimated value [1, frame, 3], the label estimation unit 23 sets the maximum index (0, 1, or 2) among the three-degree values as the value of the estimated label for each frame. Since one mora includes many frames, the label estimation unit 23 determines the estimated label for each mora by taking a majority vote of the estimated label values (0, 1, or 2) for each frame in the mora.

アクセント句内のモーラ単位の推定ラベルの値により、アクセントを特定することができる。合成音声による発話を生成する場合には、モーラ単位の推定ラベルの値により、音の高さを調整することが可能となる。モーラ単位の推定ラベルが「１」の場合には、当該モーラの次のモーラにおいて声の高さが上昇する。モーラ単位の推定ラベルが「２」の場合には、当該モーラの次のモーラにおいて声の高さが下降する。モーラ単位の推定ラベルが「０」の場合には、当該モーラの次のモーラで声の高さが変化する状況が生じない（アクセント句内に次のモーラが存在しない場合を含む）。 The accent can be identified by the value of the estimated label for each mora in the accent phrase. When generating speech using synthetic speech, the pitch can be adjusted by the value of the estimated label for each mora. If the estimated label for each mora is "1", the pitch of the voice will rise in the mora following that mora. If the estimated label for each mora is "2", the pitch of the voice will fall in the mora following that mora. If the estimated label for each mora is "0", there will be no situation in which the pitch changes in the mora following that mora (including cases where there is no next mora in the accent phrase).

図１３は、アクセント推定装置１が持つ機能を実現するための内部構成の例を示すブロック図である。アクセント推定装置１（あるいは少なくともその一部）は、コンピューターを用いて実現され得る。図示するように、そのコンピューターは、中央処理装置９０１と、ＲＡＭ９０２と、入出力ポート９０３と、入出力デバイス９０４や９０５等と、バス９０６と、を含んで構成される。コンピューター自体は、既存技術を用いて実現可能である。中央処理装置９０１は、ＲＡＭ９０２等から読み込んだプログラムに含まれる命令を実行する。中央処理装置９０１は、各命令にしたがって、ＲＡＭ９０２にデータを書き込んだり、ＲＡＭ９０２からデータを読み出したり、算術演算や論理演算を行ったりする。ＲＡＭ９０２は、データやプログラムを記憶する。ＲＡＭ９０２に含まれる各要素は、アドレスを持ち、アドレスを用いてアクセスされ得るものである。なお、ＲＡＭは、「ランダムアクセスメモリー」の略である。入出力ポート９０３は、中央処理装置９０１が外部の入出力デバイス等とデータのやり取りを行うためのポートである。入出力デバイス９０４や９０５は、入出力デバイスである。入出力デバイス９０４や９０５は、入出力ポート９０３を介して中央処理装置９０１との間でデータをやりとりする。バス９０６は、コンピューター内部で使用される共通の通信路である。例えば、中央処理装置９０１は、バス９０６を介してＲＡＭ９０２のデータを読んだり書いたりする。また、例えば、中央処理装置９０１は、バス９０６を介して入出力ポートにアクセスする。 FIG. 13 is a block diagram showing an example of an internal configuration for realizing the functions of the accent estimation device 1. The accent estimation device 1 (or at least a part of it) can be realized using a computer. As shown in the figure, the computer is configured to include a central processing unit 901, a RAM 902, an input/output port 903, input/output devices 904 and 905, etc., and a bus 906. The computer itself can be realized using existing technology. The central processing unit 901 executes instructions included in a program read from the RAM 902, etc. According to each instruction, the central processing unit 901 writes data to the RAM 902, reads data from the RAM 902, and performs arithmetic operations and logical operations. The RAM 902 stores data and programs. Each element included in the RAM 902 has an address and can be accessed using the address. Note that RAM is an abbreviation for "random access memory." The input/output port 903 is a port through which the central processing unit 901 exchanges data with external input/output devices, etc. Input/output devices 904 and 905 are input/output devices. Input/output devices 904 and 905 exchange data with central processing unit 901 via input/output port 903. Bus 906 is a common communication path used inside the computer. For example, central processing unit 901 reads and writes data in RAM 902 via bus 906. Also, for example, central processing unit 901 accesses the input/output port via bus 906.

なお、上述した実施形態におけるアクセント推定装置１の少なくとも一部の機能をコンピューターで実現することができる。その場合、この機能を実現するためのプログラムをコンピューター読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピューターシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピューターシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピューター読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、ＵＳＢメモリー等の可搬媒体、コンピューターシステムに内蔵されるハードディスク等の記憶装置のことをいう。つまり、「コンピューター読み取り可能な記録媒体」とは、非一過性の（non-transitory）コンピューター読み取り可能な記録媒体であってよい。さらに「コンピューター読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、一時的に、動的にプログラムを保持するもの、その場合のサーバーやクライアントとなるコンピューターシステム内部の揮発性メモリーのように、一定時間プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピューターシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 At least some of the functions of the accent estimation device 1 in the above-mentioned embodiment can be realized by a computer. In that case, a program for realizing the functions may be recorded in a computer-readable recording medium, and the program recorded in the recording medium may be read into a computer system and executed to realize the functions. Note that the term "computer system" here includes hardware such as an OS and peripheral devices. Furthermore, the term "computer-readable recording medium" refers to portable media such as flexible disks, optical magnetic disks, ROMs, CD-ROMs, DVD-ROMs, and USB memories, and storage devices such as hard disks built into computer systems. In other words, the term "computer-readable recording medium" may be a non-transitory computer-readable recording medium. Furthermore, the term "computer-readable recording medium" may include a medium that temporarily and dynamically holds a program, such as a communication line when a program is transmitted via a network such as the Internet or a communication line such as a telephone line, and a medium that holds a program for a certain period of time, such as a volatile memory inside a computer system that is a server or client in such a case. Furthermore, the above-mentioned program may be a program for realizing some of the above-mentioned functions, and may further be a program that can realize the above-mentioned functions in combination with a program already recorded in the computer system.

以上説明した通り、本実施形態では、アクセント句内のフレーム単位の音響特徴量と言語特徴量で学習した統計モデルを用いることによって、発話内の呼気段落位置、呼気段落内のアクセント句位置、アクセント句内のモーラ位置、モーラ内のフレーム位置といった発話の影響を考慮して、精度よくアクセントを推定することができる。また、本実施形態では、アクセント句内のフレーム単位の音響特徴量と言語特徴量、および正解ラベルに基づいて、統計モデルの学習処理を行うことができる。 As described above, in this embodiment, by using a statistical model trained on acoustic features and linguistic features on a frame-by-frame basis within an accent phrase, it is possible to accurately estimate an accent by taking into account the influence of an utterance, such as the breath paragraph position within the utterance, the accent phrase position within the breath paragraph, the mora position within the accent phrase, and the frame position within the mora. Furthermore, in this embodiment, the statistical model can be trained based on the acoustic features and linguistic features on a frame-by-frame basis within the accent phrase, and the correct answer label.

なお、次の変形例によってアクセント推定装置１、あるいは２や、アクセント推定モデル学習装置３を実施するようにしてもよい。上記実施形態では、アクセント句を単位として、テキストおよび音声信号に基づいて統計モデルの学習を行った。また、学習済みの統計モデルがあることを前提として、アクセント句を単位として、テキストおよび音声信号に基づいてアクセントのラベルを推定する処理を行った。変形例では、アクセント句の単位に限らず、その他の所定の単位のテキストおよび音声信号に基づいて、アクセントのラベルを推定したり、統計モデルの学習を行ったりしてもよい。また、変形例では、発話内の所定単位（アクセント句の単位に限らず、任意の単位）の位置の情報や、モーラの位置の情報や、フレームの位置の情報を、言語特徴量とする。 The accent estimation device 1 or 2 and the accent estimation model learning device 3 may be implemented by the following modified examples. In the above embodiment, a statistical model was learned based on text and speech signals with accent phrases as units. Also, assuming that a learned statistical model exists, a process was performed to estimate accent labels based on text and speech signals with accent phrases as units. In modified examples, accent labels may be estimated or statistical models may be learned based on text and speech signals of other predetermined units, not limited to accent phrase units. Also, in modified examples, information on the position of a predetermined unit (not limited to accent phrase units, any unit) in an utterance, information on the position of a mora, or information on the position of a frame is used as a linguistic feature.

また、上記実施形態では音響特徴量としてメルスペクトログラムを用いた。変形例として、メルスペクトログラム以外の音響特徴量を用いてアクセント推定装置１、あるいは２や、アクセント推定モデル学習装置３を実施するようにしてもよい。 In addition, in the above embodiment, a mel spectrogram is used as the acoustic feature. As a modification, the accent estimation device 1 or 2 and the accent estimation model learning device 3 may be implemented using an acoustic feature other than a mel spectrogram.

以上、この発明の実施形態および変形例について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The above describes in detail the embodiments and variations of the present invention with reference to the drawings, but the specific configuration is not limited to this embodiment and includes designs that do not deviate from the gist of the present invention.

［実施例の評価］
以下では、実施例の評価について説明する。実施例においては、アクセント推定モデル学習装置３が、女声用の不特定話者統計モデルと、男声用の不特定話者統計モデルの、それぞれの学習を行った。女声用の不特定話者統計モデルのための学習用データとしては、女性話者８名の３７，２２６個の発話から抽出したアクセント句２８９，１９７個を用いた。男声用の不特定話者統計モデルのための学習用データとしては、男性話者９名の３９，０６５個の発話から抽出したアクセント句１９６，９５６個を用いた。 [Evaluation of Examples]
Evaluation of the examples will be described below. In the examples, the accent estimation model training device 3 trained an unspecified speaker statistical model for female voices and an unspecified speaker statistical model for male voices. As training data for the unspecified speaker statistical model for female voices, 289,197 accent phrases extracted from 37,226 utterances by eight female speakers were used. As training data for the unspecified speaker statistical model for male voices, 196,956 accent phrases extracted from 39,065 utterances by nine male speakers were used.

また、学習済みの、女声用の不特定話者統計モデルおよび男声用の不特定話者統計モデルのそれぞれを用いて、ラベル推定の評価を行った。評価用のデータとしては、モデルの訓練用には使用していないデータを用いた。その結果、推定されたアクセントのラベルの精度は、次の通りであった。即ち、女声用の不特定話者統計モデルの精度は、９８．９８％であった。男声用の不特定話者統計モデルの精度は、９９．２８％であった。このように、本実施形態により、高い精度でアクセントのラベルを推定することができた。 Furthermore, label estimation was evaluated using the trained speaker-independent statistical model for female voices and speaker-independent statistical model for male voices. The evaluation data used was data that was not used for training the models. As a result, the accuracy of the estimated accent labels was as follows. That is, the accuracy of the speaker-independent statistical model for female voices was 98.98%. The accuracy of the speaker-independent statistical model for male voices was 99.28%. In this way, this embodiment was able to estimate accent labels with high accuracy.

本発明は、例えば、与えられたアクセント句に対して正確なアクセントのラベルを推定するために利用することができる。さらに本発明は、例えば、音声合成に利用することができる。但し、本発明の利用範囲はここに例示したものには限られない。 The present invention can be used, for example, to estimate an accurate accent label for a given accent phrase. Furthermore, the present invention can be used, for example, for speech synthesis. However, the scope of use of the present invention is not limited to the examples given here.

１，２アクセント推定装置（信号処理装置）
３アクセント推定モデル学習装置（信号処理装置）
１１モデル記憶部
２１推定用音声コーパス記憶部
２２前処理部
２３ラベル推定部
３１学習用音声コーパス記憶部
３２前処理部
３３モデル学習部
３５１音響分析部（学習用音響分析部）
３５２音素セグメンテーション部
３５３中心音素抽出部
３５４言語解析部（学習用言語解析部）
３６１モーラ単位統合部
３６２アクセント句単位統合部
３７１アクセント句分割部
３８１第１言語特徴量抽出部
３８２第２言語特徴量抽出部
３８６第１正解ラベル抽出部（正解ラベル抽出部）
３８７第２正解ラベル抽出部（正解ラベル抽出部）
３９１音響特徴量言語特徴量統合部
９０１中央処理装置
９０２ＲＡＭ
９０３入出力ポート
９０４，９０５入出力デバイス
９０６バス 1, 2 Accent estimation device (signal processing device)
3. Accent estimation model learning device (signal processing device)
11 Model storage unit 21 Estimation speech corpus storage unit 22 Preprocessing unit 23 Label estimation unit 31 Training speech corpus storage unit 32 Preprocessing unit 33 Model training unit 351 Acoustic analysis unit (training acoustic analysis unit)
352 Phoneme segmentation unit 353 Central phoneme extraction unit 354 Language analysis unit (Learning language analysis unit)
361: mora unit integration unit 362: accent phrase unit integration unit 371: accent phrase division unit 381: first language feature extraction unit 382: second language feature extraction unit 386: first correct label extraction unit (correct label extraction unit)
387 Second correct label extraction unit (correct label extraction unit)
391 Acoustic feature and language feature integration unit 901 Central processing unit 902 RAM
903 Input/Output Port 904, 905 Input/Output Device 906 Bus

Claims

an acoustic analysis unit that calculates, for each predetermined unit of text, an acoustic feature quantity for each frame based on a speech signal corresponding to the predetermined unit of text;
a language analysis unit that calculates language features of the text in units of frames based on the text in the predetermined units;
a label estimation unit that estimates an accent label corresponding to the text as an output from a pre-trained statistical model by inputting the frame-by-frame acoustic features and the linguistic features into the pre-trained statistical model;
A signal processing device comprising:

a training acoustic analysis unit that calculates training acoustic features in units of frames for each training text of a predetermined unit based on a training speech signal corresponding to the training text of the predetermined unit;
a training language analysis unit that calculates training language features for each frame of the training text based on the training text of the predetermined unit;
a correct label extraction unit for determining a correct label of an accent corresponding to the text based on the training language features;
a model training unit that inputs the training acoustic features and the training linguistic features for each frame into a statistical model to obtain training estimated labels of accents corresponding to the training text as an output from the statistical model, and trains the statistical model based on an error between the training estimated labels and the ground truth labels;
The signal processing device according to claim 1 , further comprising:

an acoustic analysis unit that calculates, for each predetermined unit of text, an acoustic feature quantity for each frame based on a speech signal corresponding to the predetermined unit of text;
a language analysis unit that calculates language features of the text in units of frames based on the text in the predetermined units;
a correct label extraction unit for determining a correct label of an accent corresponding to the text based on the linguistic feature;
a model learning unit that inputs the frame-by-frame acoustic features and the linguistic features into a statistical model to obtain an estimated label of an accent corresponding to the text as an output from the statistical model, and learns the statistical model based on an error between the estimated label and the ground truth label;
A signal processing device comprising:

The predetermined unit is an accent phrase unit.
A signal processing device according to any one of claims 1 to 3.

The linguistic feature amounts are a breath group position in an utterance, an accent phrase position in the breath group, a mora position in the accent phrase, and a frame position in the mora.
The signal processing device according to claim 4.

The statistical model is realized using a deep neural network.
A signal processing device according to any one of claims 1 to 5.

The acoustic feature is a mel spectrogram.
A signal processing device according to any one of claims 1 to 6.

Computer,
A signal processing device according to any one of claims 1 to 7.
A program to function as a