JP7601649B2

JP7601649B2 - Voice synthesis adjustment device

Info

Publication number: JP7601649B2
Application number: JP2021011012A
Authority: JP
Inventors: 歩坂口
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2024-12-17
Anticipated expiration: 2041-01-27
Also published as: JP2022114640A

Description

本開示は、音声合成装置による音声合成処理の調整（チューニング）を行う音声合成調整装置に関する。 This disclosure relates to a voice synthesis adjustment device that adjusts (tunes) the voice synthesis process performed by a voice synthesis device.

入力された文（入力文）を自然言語処理することにより、入力文の韻律情報を求め、得られた韻律情報に基づいて、入力文に対応した合成音を生成する音声合成装置が知られており、例えば、合成音を人間の発話に近づけるために、予め定められた韻律情報生成規則に沿って合成音を生成する技術が知られている。なお、ここでの「韻律情報」とは、人間の発話時の音声学的性質を示す情報を意味し、例えば、文節間のポーズ、単語のアクセントなどが挙げられる。従来の音声合成装置では、誰にとっても最適な音声の出力を目指すが、最適とされる音声は、音声が利用されるドメイン（本件では「利用シーンおよび用途により定められる分類」を意味する）によって変動する。そのため，上記の韻律情報生成規則を適宜、調整する（チューニングする）作業は必須となる。 There is known a speech synthesis device that performs natural language processing on an input sentence to obtain prosodic information of the input sentence and generates synthetic speech corresponding to the input sentence based on the obtained prosodic information. For example, a technology is known that generates synthetic speech according to predetermined prosodic information generation rules in order to make the synthetic speech closer to human speech. Note that "prosodic information" here means information that indicates the phonetic properties of human speech, such as pauses between phrases and accents of words. Conventional speech synthesis devices aim to output optimal speech for everyone, but the optimal speech varies depending on the domain in which the speech is used (which in this case means "classification determined by the usage scene and purpose"). Therefore, it is essential to adjust (tune) the prosodic information generation rules as appropriate.

特開平９－６２２８６号公報Japanese Patent Application Publication No. 9-62286

ところが、従来の音声合成装置では、上記のような調整作業に多くの時間を要するとの課題、および、専門的な知識を要するとの課題があった。 However, with conventional voice synthesis devices, the adjustment work described above had the drawback of taking a lot of time and requiring specialized knowledge.

また、特許文献１には、韻律情報のうちポーズを用途に応じて設定して音声合成を行う点が示唆されているものの、入力文を解析して得られた文構造に応じたポーズを用いるなど、より適切な韻律情報に基づく調整作業を目指すための詳細な点までは開示されておらず、改良の余地があった。 In addition, although Patent Document 1 suggests that speech synthesis is performed by setting pauses from the prosodic information according to the purpose, it does not disclose details for aiming for adjustment work based on more appropriate prosodic information, such as using pauses according to the sentence structure obtained by analyzing the input sentence, leaving room for improvement.

そこで、本開示は、専門的な知識を要さずとも、音声合成における調整作業に要する時間を短縮しつつ、より適切な韻律情報に基づいて音声合成の調整を行うことを目的とする。 The present disclosure therefore aims to adjust speech synthesis based on more appropriate prosodic information while shortening the time required for adjustment work in speech synthesis without requiring specialized knowledge.

本開示の一実施形態に係る音声合成調整装置は、音声の利用シーンおよび用途により定められるドメインに応じた音声を、音声合成装置により作成させ出力させる音声合成調整装置であって、外部からのドメイン情報に基づいて前記ドメインを設定するドメイン設定部と、外部からの入力データに対応した、韻律情報を含むテキストデータを取得するテキストデータ取得部と、前記テキストデータ取得部により取得された前記テキストデータから得られる韻律情報が、ドメインごとの韻律情報を予め記憶した韻律情報辞書に記憶されていない新規の韻律情報であるか否かを判定する判定部と、前記判定部により新規の韻律情報と判定された場合、前記テキストデータ取得部により取得された前記テキストデータを前記音声合成装置へ転送し、前記音声合成装置により出力される合成音声についての韻律情報の修正情報を取得する修正情報取得部と、前記修正情報取得部により取得された韻律情報の修正情報に基づいて、該当するドメインに関する前記新規の韻律情報を前記韻律情報辞書へ登録するとともに、前記音声合成装置により前記韻律情報の修正情報に基づき修正した音声を再作成させ出力させる修正制御部と、を備える。 A speech synthesis adjustment device according to an embodiment of the present disclosure is a speech synthesis adjustment device that creates and outputs a speech according to a domain determined by a usage scene and a purpose of the speech by a speech synthesis device, and includes: a domain setting unit that sets the domain based on domain information from an external source; a text data acquisition unit that acquires text data including prosodic information corresponding to input data from an external source; a determination unit that determines whether the prosodic information obtained from the text data acquired by the text data acquisition unit is new prosodic information that is not stored in a prosodic information dictionary that stores prosodic information for each domain in advance; a correction information acquisition unit that transfers the text data acquired by the text data acquisition unit to the speech synthesis device when the text data is determined to be new prosodic information and acquires correction information for the prosodic information for the synthetic speech output by the speech synthesis device; and a correction control unit that registers the new prosodic information for the corresponding domain in the prosodic information dictionary based on the correction information for the prosodic information acquired by the correction information acquisition unit, and causes the speech synthesis device to recreate and output a speech corrected based on the correction information for the prosodic information.

上記の音声合成調整装置では、ドメイン設定部が外部からのドメイン情報に基づいてドメインを設定するとともに、テキストデータ取得部が外部からの入力データに対応した、韻律情報を含むテキストデータを取得し、判定部が、取得されたテキストデータから得られる韻律情報が、ドメインごとの韻律情報を予め記憶した韻律情報辞書に記憶されていない新規の韻律情報であるか否かを判定する。ここで、テキストデータから得られる韻律情報が新規の韻律情報と判定された場合、修正情報取得部は、取得された上記テキストデータを音声合成装置へ転送し、音声合成装置により出力される合成音声についての韻律情報の修正情報を取得する。ここでの「韻律情報の修正情報」は、例えば、音声合成装置により合成音声がユーザ向けに出力された後、ユーザにより入力され、音声合成装置経由で音声合成調整装置へ転送されることで、修正情報取得部により取得される。そして、修正制御部は、取得された韻律情報の修正情報に基づいて、該当するドメインに関する上記新規の韻律情報を韻律情報辞書へ登録するとともに、音声合成装置によって、韻律情報の修正情報に基づき修正した音声を再作成させ出力させる。これにより、ユーザは、韻律情報の修正情報に基づき修正された音声を聴くことができる。上記のように、取得されたテキストデータから得られる韻律情報が韻律情報辞書に記憶されていない新規の韻律情報である場合に、取得された韻律情報の修正情報に基づいて、該当するドメインに関する上記新規の韻律情報を韻律情報辞書へ登録することで、次回以降、取得されたテキストデータから同じ韻律情報が得られたときに従来の処理を一部自動化・簡易化できる。これにより、専門的な知識を要さずとも、音声合成における調整作業に要する時間を短縮しつつ、より適切な韻律情報に基づいて音声合成の調整を行うことができる。 In the above voice synthesis adjustment device, the domain setting unit sets a domain based on domain information from outside, the text data acquisition unit acquires text data including prosodic information corresponding to the input data from outside, and the judgment unit judges whether the prosodic information acquired from the acquired text data is new prosodic information not stored in the prosodic information dictionary that stores prosodic information for each domain in advance. Here, if the prosodic information acquired from the text data is judged to be new prosodic information, the correction information acquisition unit transfers the acquired text data to the voice synthesis device and acquires correction information of the prosodic information for the synthetic voice output by the voice synthesis device. The "correction information of the prosodic information" here is, for example, input by the user after the synthetic voice is output to the user by the voice synthesis device, and transferred to the voice synthesis adjustment device via the voice synthesis device, and is acquired by the correction information acquisition unit. Then, the correction control unit registers the new prosodic information related to the corresponding domain in the prosodic information dictionary based on the acquired correction information of the prosodic information, and causes the voice synthesis device to recreate and output a voice corrected based on the correction information of the prosodic information. This allows the user to listen to speech that has been corrected based on the prosodic information correction information. As described above, when the prosodic information obtained from the acquired text data is new prosodic information not stored in the prosodic information dictionary, the new prosodic information for the corresponding domain is registered in the prosodic information dictionary based on the correction information for the acquired prosodic information, so that the conventional process can be partially automated and simplified the next time the same prosodic information is obtained from the acquired text data. This makes it possible to adjust the speech synthesis based on more appropriate prosodic information while shortening the time required for adjustment work in speech synthesis without requiring specialized knowledge.

本開示によれば、専門的な知識を要さずとも、音声合成における調整作業に要する時間を短縮しつつ、より適切な韻律情報に基づいて音声合成の調整を行うことができる。 According to the present disclosure, it is possible to adjust speech synthesis based on more appropriate prosodic information while reducing the time required for adjustment work in speech synthesis without requiring specialized knowledge.

第１実施形態における音声合成調整装置および関連装置の構成図である。1 is a configuration diagram of a voice synthesis adjustment device and related devices according to a first embodiment. さまざまなドメインの例を示す図である。FIG. 2 illustrates examples of different domains. （ａ）はドメインＡについての文構造ごとのポーズルールの例を示す図であり、（ｂ）はドメインＢについての文構造ごとのポーズルールの例を示す図である。FIG. 1A is a diagram showing an example of a pause rule for each sentence structure for domain A, and FIG. 1B is a diagram showing an example of a pause rule for each sentence structure for domain B. 第１実施形態における音声合成調整装置および関連装置により実行される処理を示すフロー図である。FIG. 2 is a flow diagram showing a process executed by the voice synthesis adjustment device and related devices in the first embodiment. 第２実施形態における音声合成調整装置および関連装置の構成図である。FIG. 13 is a configuration diagram of a voice synthesis adjustment device and related devices in a second embodiment. さまざまなドメインについての単語ごとのアクセント型の例を示す図である。FIG. 1 illustrates examples of per-word accent types for various domains. 第２実施形態における音声合成調整装置および関連装置により実行される処理を示すフロー図である。FIG. 11 is a flow diagram showing a process executed by a voice synthesis adjustment device and related devices in the second embodiment. 音声分析により基本周波数を抽出する処理を説明するための図である。FIG. 13 is a diagram for explaining a process of extracting a fundamental frequency by voice analysis. 音声合成調整装置のハードウェア構成例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a voice synthesis adjustment device.

以下、添付図面を参照して、本開示に係るさまざまな実施形態を順に説明する。なお、図面の説明においては同一要素には同一符号を付し、重複する説明を省略する。 Various embodiments of the present disclosure will be described below in order with reference to the accompanying drawings. Note that in the description of the drawings, the same elements are given the same reference numerals and duplicate descriptions will be omitted.

［第１実施形態］
以下、第１実施形態として、設定されたドメイン（音声の利用シーンおよび用途）に応じた、入力テキストの文構造のポーズルール（ポーズ挿入パターン）を新規登録又は選択し、当該ポーズルールに基づいて音声合成の調整を行う実施形態を説明する。 [First embodiment]
Below, as a first embodiment, an embodiment will be described in which a pause rule (pause insertion pattern) for the sentence structure of the input text according to a set domain (audio usage scene and purpose) is newly registered or selected, and speech synthesis is adjusted based on the pause rule.

図１は、第１実施形態における音声合成調整装置１０および関連装置の構成例を示しており、関連装置としては、音声合成装置２０と制御装置３０と他の関連装置４０が挙げられる。このうち制御装置３０は、図１に示す複数の装置から成るシステム全体を制御する装置であり、通常時はユーザからテキスト入力およびドメイン情報を、修正時には音声入力およびドメイン情報を、それぞれ受け取るとともに、ユーザへ合成音声を出力するユーザインタフェースの機能を有する。他の関連装置４０は、必要に応じて、ユーザからの入力に対し音声認識、自然言語処理などの処理を行うことで、韻律情報を含むテキストデータに変換する機能を有する。 Figure 1 shows an example of the configuration of a voice synthesis adjustment device 10 and related devices in the first embodiment, including a voice synthesis device 20, a control device 30, and other related devices 40. Of these, the control device 30 is a device that controls the entire system consisting of the multiple devices shown in Figure 1, and has a user interface function that normally receives text input and domain information from the user, and receives voice input and domain information during correction, and outputs synthetic voice to the user. The other related devices 40 have a function to convert user input into text data including prosodic information by performing processes such as voice recognition and natural language processing as necessary.

音声合成装置２０は、後述する音声合成調整装置１０による調整を受けながら、上記テキストデータに対し適切な韻律情報に基づく音声合成処理（例えば、第１実施形態では適切なポーズルールに応じたポーズが付された合成音声を作成する処理）を行う機能を有し、さまざまなドメインについての文構造に応じたポーズルールを記憶したポーズ辞書２１を内蔵している。 The speech synthesis device 20 has a function of performing speech synthesis processing based on appropriate prosodic information for the above text data (for example, in the first embodiment, processing for creating synthetic speech with pauses added according to appropriate pause rules) while being adjusted by the speech synthesis adjustment device 10 described below, and has a built-in pause dictionary 21 that stores pause rules according to sentence structures for various domains.

音声合成調整装置１０は、上記の音声合成装置２０による音声合成処理において、より適切な韻律情報（例えば、第１実施形態ではポーズルール）が用いられるように調整（チューニング）を行う機能を有する。 The voice synthesis adjustment device 10 has a function of adjusting (tuning) the voice synthesis process by the voice synthesis device 20 so that more appropriate prosodic information (e.g., pause rules in the first embodiment) is used.

音声合成調整装置１０は、上記の機能を実現するために、ドメイン設定部１１、テキストデータ取得部１２、文構造判定部１３、ポーズ修正情報取得部１４、ポーズ修正制御部１５、および、第１の作成制御部１６を含む。 To realize the above functions, the voice synthesis adjustment device 10 includes a domain setting unit 11, a text data acquisition unit 12, a sentence structure determination unit 13, a pause correction information acquisition unit 14, a pause correction control unit 15, and a first creation control unit 16.

このうち、ドメイン設定部１１は、外部からのドメイン情報に基づいてドメインを設定する機能部であり、テキストデータ取得部１２は、外部からの入力データに対応した、韻律情報を含むテキストデータを取得する機能部であり、文構造判定部１３は、テキストデータ取得部１２により取得されたテキストデータの文構造を解析し、解析により得られた文構造が、上記のポーズ辞書２１における新規の文構造であるか否かを判定する機能部である。文構造判定部１３によるテキストデータの文構造の解析は、例えば、テキストデータの文を文節ごとに分割し、いくつの文節に分割されるか、文中の切れ目（例えばテキスト表記した場合に読点「、」が入る位置）がどこに有るかなどの観点で分類することで実行される。例えば、文構造ごとのポーズルールを示した図３（ａ）における文構造id1の例文に示すように、文構造の解析によって「昨日の/夜は/かなり/冷え込みましたが、/ぐっすり/眠れましたか？」のように「/」で区切られる第１～第６文節に分割され、第４文節「冷え込みましたが、」の後に文中の切れ目が有ると解析される。また、文構造id3の例文に示すように、文構造の解析によって「昨日は/3000歩/歩いたので、/今日は/身体を/休めましょう！」のように「/」で区切られる第１～第６文節に分割され、第３文節「歩いたので、」の後に文中の切れ目が有ると解析される。これらの文構造id1とid3は、計６つの文節に分割される点で共通するものの、文中の切れ目の場所が異なるため、異なる文構造とされる。なお、上記の文構造の解析手法は、一例であり、上記と異なる解析手法を用いても構わない。 Of these, the domain setting unit 11 is a functional unit that sets a domain based on external domain information, the text data acquisition unit 12 is a functional unit that acquires text data including prosodic information corresponding to external input data, and the sentence structure determination unit 13 is a functional unit that analyzes the sentence structure of the text data acquired by the text data acquisition unit 12 and determines whether the sentence structure obtained by the analysis is a new sentence structure in the above-mentioned pause dictionary 21. The analysis of the sentence structure of the text data by the sentence structure determination unit 13 is performed, for example, by dividing the sentences of the text data into phrases and classifying them from the perspective of how many phrases they are divided into and where the breaks in the sentences are located (for example, the positions where commas "," would be placed if the sentence were written in text). For example, as shown in the example sentence of sentence structure id1 in Fig. 3(a) showing the pause rules for each sentence structure, the sentence structure is divided into the first to sixth phrases separated by "/" such as "It was quite cold last night, but did you sleep well?", and it is analyzed that there is a break in the sentence after the fourth phrase "It was cold, but...". Also, as shown in the example sentence of sentence structure id3, the sentence structure is divided into the first to sixth phrases separated by "/" such as "I walked 3000 steps yesterday, so I'll rest my body today!", and it is analyzed that there is a break in the sentence after the third phrase "Since I walked...". Although these sentence structures id1 and id3 have in common the point of being divided into a total of six phrases, they are considered to be different sentence structures because the locations of the breaks in the sentence are different. Note that the above sentence structure analysis method is just an example, and analysis methods other than the above may be used.

ポーズ修正情報取得部１４は、文構造判定部１３により新規の文構造と判定された場合に、解析後のテキストデータを音声合成装置２０へ転送し、その後、ユーザからの修正音声に対応した、合成音声に対するポーズ修正情報を含むテキストデータを音声合成装置２０経由で取得する機能部であり、ポーズ修正制御部１５は、ポーズ修正情報取得部１４により取得されたポーズ修正情報に基づいて、該当するドメインに関する該当の文構造のポーズルールをポーズ辞書２１へ登録するとともに、音声合成装置２０によって、ポーズ修正情報に基づきポーズを修正した音声を再作成させ出力させる機能部である。 The pause correction information acquisition unit 14 is a functional unit that transfers the analyzed text data to the speech synthesis device 20 when the sentence structure determination unit 13 determines that the sentence structure is new, and then acquires text data including pause correction information for the synthesized speech corresponding to the corrected speech from the user via the speech synthesis device 20. The pause correction control unit 15 is a functional unit that registers pause rules for the relevant sentence structure related to the relevant domain in the pause dictionary 21 based on the pause correction information acquired by the pause correction information acquisition unit 14, and causes the speech synthesis device 20 to recreate and output speech with pauses corrected based on the pause correction information.

第１の作成制御部１６は、文構造判定部１３により新規の文構造でないと判定された場合に、該当のドメインに関する該当の文構造のポーズルールおよびテキストデータを音声合成装置２０に転送し、音声合成装置２０によって、上記ポーズルールに応じたポーズが付された音声を作成させ出力させる機能部である。 The first creation control unit 16 is a functional unit that, when the sentence structure determination unit 13 determines that the sentence structure is not a new one, transfers the pause rules and text data of the corresponding sentence structure for the corresponding domain to the speech synthesis device 20, and causes the speech synthesis device 20 to create and output speech with pauses added according to the pause rules.

以上のような機能部を備えた音声合成調整装置１０、および関連装置により実行される一連の処理は、後述する。 The series of processes executed by the voice synthesis adjustment device 10 equipped with the above-mentioned functional units and related devices will be described later.

図２には、音声の利用シーンおよび用途の組合せにより定められるさまざまなドメインの例を示す。図２に示すように、利用シーンとしては、「介護」、「保育」、「ニュース読み上げ」などが例示され、同じ利用シーンでも複数の用途が挙げられ、利用シーンおよび用途の組合せそれぞれにつき、１つのドメインとされる。例えば、利用シーン「介護」については、用途として「健康的な70代向け」および「認知判断が遅い層向け」が例示され、利用シーン「介護」および用途「健康的な70代向け」について１つのドメイン（図３（ａ）におけるドメインＡ）とされ、また、利用シーン「介護」および用途「認知判断が遅い層向け」について１つのドメイン（図３（ｂ）におけるドメインＢ）とされる。 Figure 2 shows examples of various domains determined by combinations of audio usage scenarios and applications. As shown in Figure 2, examples of usage scenarios include "nursing care," "childcare," and "news reading," and the same usage scenario can have multiple applications, with each combination of usage scenario and application being treated as one domain. For example, for the usage scenario "nursing care," examples of applications include "healthy people in their 70s" and "people with slow cognitive judgment," and the usage scenario "nursing care" and application "healthy people in their 70s" are treated as one domain (domain A in Figure 3(a)), and the usage scenario "nursing care" and application "people with slow cognitive judgment" are treated as one domain (domain B in Figure 3(b)).

図３（ａ）、（ｂ）は、ポーズ辞書２１に記憶されたさまざまなポーズルールを例示する。図３（ａ）に示すように、ポーズ辞書２１には、利用シーン「介護」および用途「健康的な70代向け」のドメインＡについての文構造ごとのポーズルールが文構造idに対応付けられて記憶されており、同様に、図３（ｂ）に示すように、ポーズ辞書２１には、利用シーン「介護」および用途「認知判断が遅い層向け」のドメインＢについての文構造ごとのポーズルールが文構造idに対応付けられて記憶されている。同じ文構造において図３（ａ）、（ｂ）を比較すれば明らかなように、用途「認知判断が遅い層向け」と「健康的な70代向け」の違いを考慮して、図３（ｂ）のポーズルールは、図３（ａ）のポーズルールよりも、全般的に長いポーズが付されるルールとされている。例えば、文中の切れ目（例えばテキスト表記した場合に読点「、」が入る位置）、即ち、文構造id1の第４文節、文構造id2の第４文節、文構造id3の第３文節におけるポーズに着目すると、ドメインＡについてのポーズルールでは、上記箇所は全て「中ポーズ」とされているのに対し、ドメインＢについてのポーズルールでは、上記箇所は全て「大ポーズ」とされている。 3(a) and (b) show various examples of pause rules stored in the pause dictionary 21. As shown in FIG. 3(a), the pause dictionary 21 stores pause rules for each sentence structure for the domain A of the usage scene "care" and the purpose "healthy people in their 70s" in association with the sentence structure ID. Similarly, as shown in FIG. 3(b), the pause dictionary 21 stores pause rules for each sentence structure for the domain B of the usage scene "care" and the purpose "for people with slow cognitive judgment" in association with the sentence structure ID. As is clear from a comparison of FIG. 3(a) and (b) for the same sentence structure, the pause rule in FIG. 3(b) is a rule that generally provides longer pauses than the pause rule in FIG. 3(a), taking into account the difference between the purposes "for people with slow cognitive judgment" and "for healthy people in their 70s". For example, if we look at the breaks in a sentence (e.g., the positions where a comma would be placed if written in text), i.e., the fourth phrase of sentence structure id1, the fourth phrase of sentence structure id2, and the third phrase of sentence structure id3, the pause rules for domain A define all of these points as "medium pauses," whereas the pause rules for domain B define all of these points as "long pauses."

以下、図４のフロー図を用いて、音声合成調整装置１０および関連装置により実行される一連の処理を説明する。 Below, a series of processes executed by the voice synthesis adjustment device 10 and related devices will be explained using the flow diagram in Figure 4.

まず、音声合成調整装置１０は、ドメイン設定部１１により、ユーザから入力されたドメイン情報を制御装置３０経由で受け取り、当該ドメイン情報に示されたドメインを設定する（ステップＳ１）。具体的には、ドメイン設定部１１は、上記ドメイン情報に示されたドメインがポーズ辞書２１における既存のドメインであればドメイン選択を行い、上記ドメイン情報に示されたドメインがポーズ辞書２１に登録されていない新規のドメインであればドメイン登録を行う。 First, the voice synthesis adjustment device 10 receives domain information input by the user via the control device 30, and sets the domain indicated in the domain information by the domain setting unit 11 (step S1). Specifically, the domain setting unit 11 performs domain selection if the domain indicated in the domain information is an existing domain in the pause dictionary 21, and performs domain registration if the domain indicated in the domain information is a new domain not registered in the pause dictionary 21.

また、ユーザからの入力は、他の関連装置４０による音声認識、自然言語処理などの処理が施されて、韻律情報を含むテキストデータに変換され、変換後のテキストデータは、制御装置３０および音声合成装置２０経由で音声合成調整装置１０に出力される（ステップＳ２）。 In addition, the input from the user is subjected to processing such as speech recognition and natural language processing by other related devices 40 and converted into text data including prosodic information, and the converted text data is output to the speech synthesis adjustment device 10 via the control device 30 and the speech synthesis device 20 (step S2).

音声合成調整装置１０は、テキストデータ取得部１２により、ステップＳ２で出力されたテキストデータを取得し（ステップＳ３）、文構造判定部１３により、上記取得されたテキストデータの文構造を解析し（ステップＳ４）、解析により得られた文構造が、上記のポーズ辞書２１における新規の文構造であるか否かを判定する（ステップＳ５）。 The voice synthesis adjustment device 10 acquires the text data output in step S2 by the text data acquisition unit 12 (step S3), analyzes the sentence structure of the acquired text data by the sentence structure determination unit 13 (step S4), and determines whether the sentence structure obtained by the analysis is a new sentence structure in the pause dictionary 21 (step S5).

ここで、解析により得られた文構造が新規の文構造であると判定された場合（ステップＳ５でＹＥＳ）、音声合成調整装置１０は、ポーズ修正情報取得部１４により、解析後のテキストデータ（韻律情報を含む）をそのまま音声合成装置２０へ転送し、音声合成装置２０経由で届くポーズ修正情報を待つ（ステップＳ６）。その後、音声合成装置２０は、転送されたテキストデータに含まれた韻律情報（ここではポーズルール）に沿ってポーズが挿入された音声を作成し、作成された合成音声を制御装置３０経由でユーザへ出力する（ステップＳ７）。出力された合成音声を聴いたユーザにより、修正音声が制御装置３０に入力されると、他の関連装置４０による音声認識、自然言語処理などの処理が修正音声に施されて、修正された韻律情報（ここでは修正されたポーズルール）を含むテキストデータに変換され、変換後のテキストデータ（修正されたポーズルールを含む）が、制御装置３０および音声合成装置２０経由で音声合成調整装置１０に出力される（ステップＳ８）。 Here, if it is determined that the sentence structure obtained by the analysis is a new sentence structure (YES in step S5), the speech synthesis adjustment device 10 transfers the analyzed text data (including prosodic information) as is to the speech synthesis device 20 by the pause correction information acquisition unit 14, and waits for pause correction information to arrive via the speech synthesis device 20 (step S6). After that, the speech synthesis device 20 creates a voice in which pauses are inserted according to the prosodic information (here, the pause rule) included in the transferred text data, and outputs the created synthetic voice to the user via the control device 30 (step S7). When the user who listened to the output synthetic voice inputs the modified voice to the control device 30, the modified voice is subjected to processing such as speech recognition and natural language processing by other related devices 40, and is converted into text data including the modified prosodic information (here, the modified pause rule), and the converted text data (including the modified pause rule) is output to the speech synthesis adjustment device 10 via the control device 30 and the speech synthesis device 20 (step S8).

音声合成調整装置１０は、ポーズ修正情報取得部１４により、上記変換後のテキストデータ（修正されたポーズルールを含む）を取得し、ポーズ修正制御部１５により、上記変換後のテキストデータから得られる、該当するドメインに関する該当の文構造のポーズルール（修正されたポーズルール）をポーズ辞書２１へ登録するとともに、音声合成装置２０に対し、修正されたポーズルールに基づきポーズを修正し、修正後のポーズが付された合成音声を再作成して出力するように指示する（ステップＳ９）。ステップＳ９における指示を受けた音声合成装置２０は、修正されたポーズルールに基づきポーズを修正し、修正後のポーズが付された合成音声を再作成して、制御装置３０経由でユーザへ出力する（ステップＳ１０）。これにより、ユーザは、修正されたポーズルールに基づき修正された音声を聴くことができる。 The voice synthesis adjustment device 10 acquires the converted text data (including the corrected pause rule) by the pause correction information acquisition unit 14, and registers the pause rule (corrected pause rule) of the corresponding sentence structure related to the corresponding domain obtained from the converted text data in the pause dictionary 21 by the pause correction control unit 15, and instructs the voice synthesis device 20 to correct the pause based on the corrected pause rule, recreate the synthetic voice with the corrected pause added, and output it (step S9). The voice synthesis device 20 that has received the instruction in step S9 corrects the pause based on the corrected pause rule, recreates the synthetic voice with the corrected pause added, and outputs it to the user via the control device 30 (step S10). This allows the user to listen to the voice that has been corrected based on the corrected pause rule.

なお、音声合成装置２０は、ステップＳ７で合成音声を制御装置３０経由でユーザへ出力した後、予め定められた時間内に、変換後のテキストデータ（修正されたポーズルールを含む）を制御装置３０から受信しない場合は、ステップＳ７で出力した音声に対し、ポーズ修正が無いものと見做し、その旨を音声合成調整装置１０へ通知する（図４におけるＳ７からＳ９への破線部分）。この場合、音声合成調整装置１０は、ステップＳ９において、ポーズ修正制御部１５により、解析により得られた新規の文構造については、ステップＳ３で取得されたテキストデータに含まれたポーズルールを、該当するドメインに関する該当の文構造のポーズルールとしてポーズ辞書２１へ登録する。この場合は、ステップＳ１０の音声再作成・出力の処理は不要となる。 If the voice synthesis device 20 does not receive the converted text data (including the corrected pause rule) from the control device 30 within a predetermined time after outputting the synthetic voice to the user via the control device 30 in step S7, it considers that there is no pause correction for the voice output in step S7 and notifies the voice synthesis adjustment device 10 of this fact (the dashed line portion from S7 to S9 in FIG. 4). In this case, in step S9, the voice synthesis adjustment device 10 registers the pause rule included in the text data acquired in step S3 for the new sentence structure obtained by analysis by the pause correction control unit 15 in the pause dictionary 21 as the pause rule for the corresponding sentence structure for the corresponding domain. In this case, the voice recreation and output process in step S10 is not necessary.

一方、ステップＳ５で、解析により得られた文構造が新規の文構造でないと判定された場合（ステップＳ５でＮＯ）、音声合成調整装置１０は、第１の作成制御部１６により、該当のドメインに関する該当の文構造のポーズルールおよび解析後のテキストデータを音声合成装置２０に転送し、音声合成装置２０に対し、上記ポーズルールに沿ったポーズが付された合成音声を再作成して出力するように指示する（ステップＳ１１）。ステップＳ１１における指示を受けた音声合成装置２０は、上記ポーズルールに沿ったポーズが付された合成音声を再作成して、制御装置３０経由でユーザへ出力する（ステップＳ１２）。これにより、ユーザは、ポーズ辞書２１に登録されたポーズルールに基づく適切なポーズが付された音声を聴くことができる。 On the other hand, if it is determined in step S5 that the sentence structure obtained by analysis is not a new sentence structure (NO in step S5), the voice synthesis adjustment device 10 transfers the pause rules of the relevant sentence structure for the relevant domain and the analyzed text data to the voice synthesis device 20 by the first creation control unit 16, and instructs the voice synthesis device 20 to recreate and output synthetic voice with pauses added in accordance with the pause rules (step S11). The voice synthesis device 20 that has received the instruction in step S11 recreates synthetic voice with pauses added in accordance with the pause rules, and outputs it to the user via the control device 30 (step S12). This allows the user to listen to voice with appropriate pauses added based on the pause rules registered in the pause dictionary 21.

以上説明した第１実施形態によれば、上記のように、取得されたテキストデータから得られた文構造が新規である、即ち、設定されたドメインについての当該文構造のポーズ情報がポーズ辞書２１における新規の情報である場合に、修正音声から取得されたポーズ修正情報に基づき修正された、該当するドメインについてのポーズ情報を、ポーズ辞書２１へ登録することで、次回以降、取得されたテキストデータから同じ文構造の入力が得られたときに従来の処理を一部自動化・簡易化できる。これにより、専門的な知識を要さずとも、音声合成における調整作業に要する時間を短縮しつつ、より適切なポーズ情報に基づいて音声合成の調整を行うことができる。また、ユーザは、修正音声に基づきポーズが適切に修正された音声を聴くことができる。 According to the first embodiment described above, when the sentence structure obtained from the acquired text data is new, that is, when the pause information of the sentence structure for the set domain is new information in the pause dictionary 21, the pause information for the corresponding domain corrected based on the pause correction information obtained from the corrected voice is registered in the pause dictionary 21, so that the conventional processing can be partially automated and simplified the next time the same sentence structure is input from the acquired text data. This makes it possible to adjust the voice synthesis based on more appropriate pause information while shortening the time required for adjustment work in voice synthesis without requiring specialized knowledge. In addition, the user can listen to the voice in which the pauses have been appropriately corrected based on the corrected voice.

また、取得されたテキストデータから得られた文構造が新規でない場合は、第１の作成制御部１６によって、該当のドメインに関する該当の文構造のポーズルールおよびテキストデータを音声合成装置２０に転送し、音声合成装置２０によって、上記ポーズルールに応じたポーズが付された音声を作成させ出力させる。そのため、ユーザは、ポーズ辞書２１に登録されたポーズルールに基づく適切なポーズが付された音声を聴くことができる。 In addition, if the sentence structure obtained from the acquired text data is not new, the first creation control unit 16 transfers the pause rules and text data of the relevant sentence structure for the relevant domain to the speech synthesis device 20, and the speech synthesis device 20 creates and outputs speech with pauses added according to the pause rules. Therefore, the user can listen to speech with appropriate pauses added based on the pause rules registered in the pause dictionary 21.

［第２実施形態］
次に、第２実施形態として、設定されたドメイン（音声の利用シーンおよび用途）に応じた、入力テキスト中の単語のアクセントに関するルール（以下「アクセント型」という）を新規登録又は選択し、当該アクセント型に基づいて音声合成の調整を行う実施形態を説明する。 [Second embodiment]
Next, as a second embodiment, we will explain an embodiment in which rules regarding the accent of words in the input text (hereinafter referred to as "accent type") according to a set domain (speech usage scene and purpose) are newly registered or selected, and speech synthesis is adjusted based on the accent type.

図５は、第２実施形態における音声合成調整装置１０および関連装置の構成例を示しており、関連装置としては、音声合成装置２０と制御装置３０と他の関連装置４０が挙げられる。このうち制御装置３０および他の関連装置４０については、第１実施形態と同様の機能を有するため、重複した説明は省略する。 Figure 5 shows an example of the configuration of the voice synthesis adjustment device 10 and related devices in the second embodiment, including the voice synthesis device 20, the control device 30, and other related devices 40. Of these, the control device 30 and other related devices 40 have the same functions as in the first embodiment, so a duplicated description will be omitted.

音声合成装置２０は、後述する音声合成調整装置１０による調整を受けながら、音声テキストに対し適切な韻律情報に基づく音声合成処理（例えば、第２実施形態では適切なアクセント型に応じた単語アクセントが付された合成音声を作成する処理）を行う機能を有し、さまざまなドメインについての単語ごとのアクセント型を記憶したアクセント辞書２２を内蔵している。 The speech synthesis device 20 has a function of performing speech synthesis processing based on appropriate prosodic information for the speech text (for example, in the second embodiment, processing for creating synthetic speech with word accents according to appropriate accent types) while being adjusted by the speech synthesis adjustment device 10 described below, and has a built-in accent dictionary 22 that stores accent types for each word for various domains.

音声合成調整装置１０は、上記の音声合成装置２０による音声合成処理において、より適切な韻律情報（例えば、第２実施形態ではアクセント型）が用いられるように調整（チューニング）を行う機能を有する。 The voice synthesis adjustment device 10 has a function of adjusting (tuning) the voice synthesis process by the voice synthesis device 20 so that more appropriate prosodic information (e.g., accent type in the second embodiment) is used.

音声合成調整装置１０は、上記の機能を実現するために、ドメイン設定部１１、テキストデータ取得部１２、単語判定部１３Ｓ、アクセント修正情報取得部１４Ｓ、アクセント修正制御部１５Ｓ、および、第２の作成制御部１６Ｓを含む。このうち、ドメイン設定部１１およびテキストデータ取得部１２については、第１実施形態と同様の機能を有するため、重複した説明は省略する。 To realize the above functions, the voice synthesis adjustment device 10 includes a domain setting unit 11, a text data acquisition unit 12, a word determination unit 13S, an accent correction information acquisition unit 14S, an accent correction control unit 15S, and a second creation control unit 16S. Of these, the domain setting unit 11 and the text data acquisition unit 12 have the same functions as in the first embodiment, so a duplicated description will be omitted.

単語判定部１３Ｓは、テキストデータ取得部１２により取得されたテキストデータがアクセント辞書２２における新規の単語を含むか否かを判定する機能部である。 The word determination unit 13S is a functional unit that determines whether the text data acquired by the text data acquisition unit 12 includes a new word in the accent dictionary 22.

アクセント修正情報取得部１４Ｓは、単語判定部１３Ｓにより新規の単語を含むと判定された場合に、テキストデータ取得部１２により取得されたテキストデータをそのまま音声合成装置２０へ転送し、その後、ユーザからの修正音声（ここでは、アクセント型を修正すべき修正単語の音声を含んだ音声）に対応した、合成音声に対するアクセント修正情報を含むテキストデータを、音声合成装置２０経由で取得する機能部であり、アクセント修正制御部１５Ｓは、アクセント修正情報取得部１４Ｓにより取得されたアクセント修正情報に基づいて、該当するドメインに関する該当の単語のアクセント型をアクセント辞書２２へ登録するとともに、音声合成装置２０によって、アクセント修正情報に基づき単語のアクセントを修正した音声を再作成させ出力させる機能部である。 The accent correction information acquisition unit 14S is a functional unit that transfers the text data acquired by the text data acquisition unit 12 as is to the speech synthesis device 20 when the word determination unit 13S determines that the text data contains a new word, and then acquires text data including accent correction information for the synthesized speech corresponding to the corrected speech from the user (here, speech including the speech of the corrected word whose accent type is to be corrected) via the speech synthesis device 20. The accent correction control unit 15S is a functional unit that registers the accent type of the relevant word for the relevant domain in the accent dictionary 22 based on the accent correction information acquired by the accent correction information acquisition unit 14S, and causes the speech synthesis device 20 to recreate and output a speech in which the accent of the word has been corrected based on the accent correction information.

第２の作成制御部１６Ｓは、単語判定部１３Ｓにより新規の単語を含まないと判定された場合に、該当のドメインに関する該当の単語のアクセント型およびテキストデータを音声合成装置２０に転送し、音声合成装置２０によって、上記単語のアクセント型に応じたアクセントが付された音声を作成させ出力させる機能部である。 The second creation control unit 16S is a functional unit that, when the word determination unit 13S determines that the domain does not contain a new word, transfers the accent type and text data of the relevant word for the relevant domain to the voice synthesis device 20, and causes the voice synthesis device 20 to create and output a voice with an accent according to the accent type of the above word.

図６には、アクセント辞書２２に記憶された、第２実施形態におけるさまざまなドメインについての単語ごとのアクセント型が例示されている。なお、ドメインは、図２を用いて説明したように、「利用シーン」と「用途」の組合せに相当するが、図６では、説明を分かりやすくするために、ドメインを「標準語向け」、「ビジネスの場向け」、「年配の方向け」など簡易な表現で例示している。 Figure 6 illustrates examples of accent types for each word for various domains in the second embodiment, stored in the accent dictionary 22. As explained with reference to Figure 2, a domain corresponds to a combination of a "usage scene" and a "purpose," but in Figure 6, to make the explanation easier to understand, the domains are illustrated using simple expressions such as "for standard language," "for business situations," and "for the elderly."

図６から分かるように、例えば、単語「ほうれん草」では、ドメインA(標準語向け)については「れ」の位置がアクセント核（即ち、アクセントが置かれる位置）となるのに対し、ドメインB(関西弁向け)についてはアクセント核が存在しない。また、単語「まじ？」では、ドメインA(標準語向け)については「ま」の位置がアクセント核となるのに対し、ドメインB(関西弁向け)については「じ」の位置がアクセント核となる。このように同じ単語であっても、ドメインが異なれば、異なる位置がアクセント核となるような異なるアクセント型が設定される。 As can be seen from Figure 6, for example, in the word "horensa" (spinach), in domain A (standard Japanese), the position of the accent nucleus (i.e., the position where the accent is placed) is "re," whereas in domain B (Kansai dialect), there is no accent nucleus. Also, in the word "maji?" (real?), in domain A (standard Japanese), the accent nucleus is "ma," whereas in domain B (Kansai dialect), the accent nucleus is "ji." In this way, even for the same word, different accent types are set depending on the domain, with the accent nucleus being in a different position.

以下、図７のフロー図を用いて、音声合成調整装置１０および関連装置により実行される一連の処理を説明する。 Below, a series of processes executed by the voice synthesis adjustment device 10 and related devices will be explained using the flow diagram in Figure 7.

ステップＳ１～Ｓ３では、第１実施形態と同様に、ユーザからのドメイン情報に示されたドメインの設定、ユーザからの入力データのテキスト変換・出力、および、変換後の音声テキストデータの取得が行われる。 In steps S1 to S3, similar to the first embodiment, the domain indicated in the domain information from the user is set, the input data from the user is converted into text and output, and the converted voice text data is obtained.

その後、音声合成調整装置１０は、単語判定部１３Ｓにより、テキストデータ取得部１２により取得されたテキストデータがアクセント辞書２２における新規の単語を含むか否かを判定する（ステップＳ２１）。 Then, the voice synthesis adjustment device 10 determines, by the word determination unit 13S, whether or not the text data acquired by the text data acquisition unit 12 includes a new word in the accent dictionary 22 (step S21).

ここで、テキストデータがアクセント辞書２２における新規の単語を含むと判定された場合（ステップＳ２１でＹＥＳ）、音声合成調整装置１０は、アクセント修正情報取得部１４Ｓにより、テキストデータ（韻律情報を含む）をそのまま音声合成装置２０へ転送し、音声合成装置２０経由で届く修正音声（アクセント修正単語の音声を含んだ音声）を待つ（ステップＳ２２）。その後、音声合成装置２０は、転送されたテキストデータに含まれた韻律情報（ここではアクセント型）に沿って単語アクセントが付された音声を作成し、作成された合成音声を制御装置３０経由でユーザへ出力する（ステップＳ２３）。出力された合成音声を聴いたユーザにより、修正音声が制御装置３０に入力され、音声合成装置２０経由で音声合成調整装置１０に転送されると、音声合成調整装置１０は、アクセント修正情報取得部１４Ｓにより、修正音声から音声分析によって基本周波数（声の高さ）を抽出する（ステップＳ２４）。つまり、図８に示すように、修正音声における時間軸に沿った周波数の変化を検出し、基本周波数（声の高さ）がどのように変化したかを把握する。そして、基本周波数（声の高さ）が高い位置が単語のアクセント核として検出される。 Here, if it is determined that the text data contains a new word in the accent dictionary 22 (YES in step S21), the speech synthesis adjustment device 10 transfers the text data (including prosodic information) as is to the speech synthesis device 20 by the accent correction information acquisition unit 14S, and waits for the corrected speech (speech including the speech of the accent correction word) that arrives via the speech synthesis device 20 (step S22). After that, the speech synthesis device 20 creates a speech to which a word accent is added according to the prosodic information (here, the accent type) included in the transferred text data, and outputs the created synthetic speech to the user via the control device 30 (step S23). When the corrected speech is input to the control device 30 by the user who listened to the output synthetic speech, and is transferred to the speech synthesis adjustment device 10 via the speech synthesis device 20, the speech synthesis adjustment device 10 extracts the fundamental frequency (pitch) from the corrected speech by speech analysis using the accent correction information acquisition unit 14S (step S24). That is, as shown in Figure 8, the system detects frequency changes along the time axis in the corrected speech and determines how the fundamental frequency (pitch) has changed. The position where the fundamental frequency (pitch) is high is then detected as the accent nucleus of the word.

そして、音声合成調整装置１０は、アクセント修正制御部１５Ｓにより、上記ステップＳ２４の処理結果から、単語のアクセント核の位置に応じた、修正された単語アクセント型を推定し、推定結果を該当のドメインに関する該当の単語のアクセント型としてアクセント辞書２２へ登録するとともに、音声合成装置２０に対し、アクセント修正情報（ここでは、推定で得られた修正された単語アクセント型の情報）に基づき単語のアクセントを修正し、修正後のアクセントが付された合成音声を再作成して出力するように指示する（ステップＳ２５）。ステップＳ２５における指示を受けた音声合成装置２０は、アクセント修正情報に基づき単語のアクセントを修正し、修正後のアクセントが付された合成音声を再作成して、制御装置３０経由でユーザへ出力する（ステップＳ２６）。これにより、ユーザは、アクセント修正情報に基づき修正された音声を聴くことができる。 Then, the speech synthesis adjustment device 10 uses the accent correction control unit 15S to estimate a corrected word accent type according to the position of the accent nucleus of the word from the processing result of step S24, registers the estimated result in the accent dictionary 22 as the accent type of the relevant word for the relevant domain, and instructs the speech synthesis device 20 to correct the accent of the word based on the accent correction information (here, information on the corrected word accent type obtained by estimation) and recreate and output the synthetic voice with the corrected accent (step S25). The speech synthesis device 20 that has received the instruction in step S25 corrects the accent of the word based on the accent correction information, recreates the synthetic voice with the corrected accent, and outputs it to the user via the control device 30 (step S26). This allows the user to hear the speech corrected based on the accent correction information.

なお、音声合成調整装置１０は、ステップＳ２２でテキストデータを音声合成装置２０へ転送した後、予め定められた時間内に、ユーザからの修正音声を受信しない場合は、アクセントに関する修正が無いものと見做し、ステップＳ２４、Ｓ２５の実行を回避して処理を終了する。即ち、アクセントに関する修正が有った場合のみ、ステップＳ２５のアクセント辞書２２への登録が実行される。 If the voice synthesis adjustment device 10 does not receive corrected voice from the user within a predetermined time after transferring the text data to the voice synthesis device 20 in step S22, it assumes that there has been no correction to the accent, avoids execution of steps S24 and S25, and ends the process. In other words, registration in the accent dictionary 22 in step S25 is executed only if there has been a correction to the accent.

一方、ステップＳ２１で、テキストデータがアクセント辞書２２における新規の単語を含まないと判定された場合（ステップＳ２１でＮＯ）、音声合成調整装置１０は、第２の作成制御部１６Ｓにより、該当のドメインに関する該当の単語のアクセント型およびテキストデータを音声合成装置２０に転送し、音声合成装置２０に対し、上記アクセント型に沿ったアクセントが付された合成音声を再作成して出力するように指示する（ステップＳ２７）。ステップＳ２７における指示を受けた音声合成装置２０は、上記アクセント型に沿ったアクセントが付された合成音声を再作成して、制御装置３０経由でユーザへ出力する（ステップＳ２８）。これにより、ユーザは、アクセント辞書２２に登録されたアクセント型に基づく適切なアクセントが付された音声を聴くことができる。 On the other hand, if it is determined in step S21 that the text data does not include a new word in the accent dictionary 22 (NO in step S21), the voice synthesis adjustment device 10 transfers the accent type and text data of the relevant word for the relevant domain to the voice synthesis device 20 by the second creation control unit 16S, and instructs the voice synthesis device 20 to recreate and output synthetic voice with an accent according to the accent type (step S27). Having received the instruction in step S27, the voice synthesis device 20 recreates synthetic voice with an accent according to the accent type and outputs it to the user via the control device 30 (step S28). This allows the user to hear voice with an appropriate accent based on the accent type registered in the accent dictionary 22.

以上説明した第２実施形態によれば、上記のように、取得されたテキストデータから得られた単語が新規である、即ち、設定されたドメインについての当該単語のアクセント型がアクセント辞書２２における新規の情報である場合に、修正音声から取得された単語のアクセント修正情報に基づいて、該当するドメインについての単語のアクセント型をアクセント辞書２２へ登録することで、次回以降、取得されたテキストデータから同じ単語のアクセント型が得られたときに従来の処理を一部自動化・簡易化できる。これにより、専門的な知識を要さずとも、音声合成における調整作業に要する時間を短縮しつつ、より適切なアクセント修正情報に基づいて音声合成の調整を行うことができる。また、ユーザは、修正音声に基づき単語のアクセントが適切に修正された音声を聴くことができる。 According to the second embodiment described above, when a word obtained from the acquired text data is new, that is, when the accent type of the word for the set domain is new information in the accent dictionary 22, the accent type of the word for the corresponding domain is registered in the accent dictionary 22 based on the accent correction information of the word obtained from the corrected voice, so that the conventional process can be partially automated and simplified the next time the accent type of the same word is obtained from the acquired text data. This makes it possible to adjust the voice synthesis based on more appropriate accent correction information while shortening the time required for adjustment work in voice synthesis without requiring specialized knowledge. In addition, the user can listen to a voice in which the accent of the word has been appropriately corrected based on the corrected voice.

また、取得されたテキストデータから得られた単語が新規でない場合は、第２の作成制御部１６Ｓによって、該当のドメインに関する該当の単語のアクセント型およびテキストデータを音声合成装置２０に転送し、音声合成装置２０によって、上記アクセント型に応じたアクセントが付された音声を作成させ出力させる。そのため、ユーザは、アクセント辞書２２に登録された適切な単語のアクセント型に基づくアクセントが付された音声を聴くことができる。 In addition, if the word obtained from the acquired text data is not new, the second creation control unit 16S transfers the accent type and text data of the relevant word related to the relevant domain to the voice synthesis device 20, and the voice synthesis device 20 creates and outputs a voice with an accent according to the accent type. Therefore, the user can listen to a voice with an accent based on the accent type of the appropriate word registered in the accent dictionary 22.

なお、音声合成調整装置１０は、第１実施形態で説明したポーズの修正機能と、第２実施形態で説明したアクセントの修正機能の両方を兼ね備えてもよい。 The voice synthesis adjustment device 10 may have both the pause correction function described in the first embodiment and the accent correction function described in the second embodiment.

また、第１実施形態における図４の処理に関する変形例として、第１段階（学習段階）として、さまざまなドメインについてのさまざまな文構造のポーズルールをポーズ辞書へ登録する処理を多数回繰り返し行うことで、学習済みのポーズ辞書を作成し、その後、第２段階（ルール適用段階）として、学習済みのポーズ辞書に個別ケースを適用することで、個別ケースのポーズに関する音声合成調整を行う実施形態、即ち、時期的に大別される２つの段階で処理を行う実施形態を採用してもよい。 As a modified example of the process of FIG. 4 in the first embodiment, in the first stage (learning stage), a learned pause dictionary is created by repeatedly registering pause rules for various sentence structures for various domains in a pause dictionary, and then in the second stage (rule application stage), individual cases are applied to the learned pause dictionary to adjust the speech synthesis for the pauses of the individual cases, i.e., an embodiment in which the process is performed in two stages roughly divided in terms of time may be adopted.

同様に、第２実施形態における図７の処理に関する変形例として、第１段階（学習段階）として、さまざまなドメインについてのさまざまな単語のアクセント型をアクセント辞書へ登録する処理を多数回繰り返し行うことで、学習済みのアクセント辞書を作成し、その後、第２段階（ルール適用段階）として、学習済みのアクセント辞書に個別ケースを適用することで、個別ケースのアクセントに関する音声合成調整を行う実施形態、即ち、時期的に大別される２つの段階で処理を行う実施形態を採用してもよい。 Similarly, as a variation of the process of FIG. 7 in the second embodiment, in the first stage (learning stage), a learned accent dictionary is created by repeatedly registering accent types of various words for various domains in an accent dictionary many times, and then in the second stage (rule application stage), individual cases are applied to the learned accent dictionary to adjust the speech synthesis for the accent of each individual case, i.e., an embodiment in which the process is performed in two stages roughly divided in terms of time may be adopted.

［用語、変形態様などについて］
なお、上記の実施形態の説明に用いられたブロック図は、機能単位のブロックを示している。これらの機能ブロック（構成部）は、ハードウェアおよびソフトウェアの少なくとも一方の任意の組み合わせによって実現される。各機能ブロックの実現方法は特に限定されない。すなわち、各機能ブロックは、物理的または論理的に結合した１つの装置を用いて実現されてもよいし、物理的または論理的に分離した２つ以上の装置を直接的または間接的に（例えば、有線、無線などを用いて）接続し、これら複数の装置を用いて実現されてもよい。機能ブロックは、上記１つの装置または上記複数の装置にソフトウェアを組み合わせて実現されてもよい。 [Terminology, variations, etc.]
The block diagrams used in the description of the above embodiments show functional blocks. These functional blocks (components) are realized by any combination of at least one of hardware and software. The method of realizing each functional block is not particularly limited. That is, each functional block may be realized using one device that is physically or logically coupled, or may be realized using two or more devices that are physically or logically separated and directly or indirectly connected (for example, using wires, wirelessly, etc.). The functional block may be realized by combining the one device or the multiple devices with software.

機能には、判断、決定、判定、計算、算出、処理、導出、調査、探索、確認、受信、送信、出力、アクセス、解決、選択、選定、確立、比較、想定、期待、見做し、報知（broadcasting）、通知（notifying）、通信（communicating）、転送（forwarding）、構成（configuring）、再構成（reconfiguring）、割り当て（allocating、mapping）、および割り振り（assigning）などがあるが、これらの機能に限られない。例えば、送信を機能させる機能ブロック（構成部）は、送信部（transmitting unit）または送信機（transmitter）と呼称される。いずれも、上述したとおり、実現方法は特に限定されない。 Functions include, but are not limited to, judgement, determination, judgment, calculation, computation, processing, derivation, investigation, search, confirmation, reception, transmission, output, access, resolution, selection, selection, establishment, comparison, assumption, expectation, regard, broadcasting, notifying, communicating, forwarding, configuring, reconfiguring, allocating, mapping, and assignment. For example, a functional block (component) that performs the transmission function is called a transmitting unit or transmitter. As mentioned above, there are no particular limitations on the method of realization for either of these functions.

例えば、本開示の第１、第２実施形態における音声合成調整装置１０は、本開示の処理を行うコンピュータとして機能してもよい。図９に示されるように、上述の音声合成調整装置１０は、物理的には、プロセッサ１００１、メモリ１００２、ストレージ１００３、通信装置１００４、入力装置１００５、出力装置１００６、およびバス１００７などを含むコンピュータ装置として構成されてもよい。この点は、音声合成調整装置１０以外の装置についても同様であるが、ここでは、音声合成調整装置１０の構成例として説明する。 For example, the voice synthesis adjustment device 10 in the first and second embodiments of the present disclosure may function as a computer that performs the processing of the present disclosure. As shown in FIG. 9, the above-mentioned voice synthesis adjustment device 10 may be physically configured as a computer device including a processor 1001, a memory 1002, a storage 1003, a communication device 1004, an input device 1005, an output device 1006, and a bus 1007. This point is similar to devices other than the voice synthesis adjustment device 10, but here it will be described as an example configuration of the voice synthesis adjustment device 10.

なお、以下の説明では、「装置」という文言は、回路、デバイス、およびユニットなどに読み替えることができる。音声合成調整装置１０のハードウェア構成は、図に示された各装置を１つまたは複数含むように構成されてもよいし、一部の装置を含まずに構成されてもよい。 In the following description, the term "apparatus" may be interpreted as a circuit, device, unit, etc. The hardware configuration of the voice synthesis adjustment device 10 may be configured to include one or more of the devices shown in the figure, or may be configured to exclude some of the devices.

音声合成調整装置１０における各機能は、プロセッサ１００１およびメモリ１００２などのハードウェア上に所定のソフトウェア（プログラム）を読み込ませることによって、プロセッサ１００１が演算を行い、通信装置１００４による通信を制御したり、メモリ１００２およびストレージ１００３におけるデータの読み出しおよび書き込みの少なくとも一方を制御したりすることによって実現される。 The functions of the voice synthesis adjustment device 10 are realized by loading a specific software (program) onto hardware such as the processor 1001 and memory 1002, causing the processor 1001 to perform calculations, control communications via the communication device 1004, and control at least one of the reading and writing of data in the memory 1002 and storage 1003.

プロセッサ１００１は、例えば、オペレーティングシステムを動作させてコンピュータ全体を制御する。プロセッサ１００１は、周辺装置とのインターフェース、制御装置、演算装置、およびレジスタなどを含む中央処理装置（ＣＰＵ：Central Processing Unit）によって構成されてもよい。例えば、上述の音声合成調整装置１０の各機能は、プロセッサ１００１によって実現されてもよい。 The processor 1001, for example, operates an operating system to control the entire computer. The processor 1001 may be configured with a central processing unit (CPU) including an interface with peripheral devices, a control device, an arithmetic unit, and a register. For example, each function of the voice synthesis adjustment device 10 described above may be realized by the processor 1001.

プロセッサ１００１は、プログラム（プログラムコード）、ソフトウェアモジュール、およびデータなどを、ストレージ１００３および通信装置１００４の少なくとも一方からメモリ１００２に読み出し、これらに従って各種の処理を実行する。プログラムとしては、上述の実施形態において説明された動作の少なくとも一部をコンピュータに実行させるプログラムが用いられる。例えば、音声合成調整装置１０の各機能は、メモリ１００２に格納され、プロセッサ１００１において動作する制御プログラムによって実現されてもよい。上述の各種処理は、１つのプロセッサ１００１によって実行される旨を説明してきたが、２以上のプロセッサ１００１により同時または逐次に実行されてもよい。プロセッサ１００１は、１以上のチップによって実装されてもよい。なお、プログラムは、電気通信回線を介してネットワークから送信されてもよい。 The processor 1001 reads out programs (program codes), software modules, data, etc. from at least one of the storage 1003 and the communication device 1004 into the memory 1002, and executes various processes according to these. The programs used are those that cause a computer to execute at least some of the operations described in the above-mentioned embodiments. For example, each function of the voice synthesis adjustment device 10 may be realized by a control program stored in the memory 1002 and running on the processor 1001. Although the above-mentioned various processes have been described as being executed by one processor 1001, they may be executed simultaneously or sequentially by two or more processors 1001. The processor 1001 may be implemented by one or more chips. The programs may be transmitted from a network via a telecommunications line.

メモリ１００２は、コンピュータ読み取り可能な記録媒体であり、例えば、ＲＯＭ（Read Only Memory）、ＥＰＲＯＭ（Erasable Programmable ＲＯＭ）、ＥＥＰＲＯＭ（Electrically Erasable Programmable ＲＯＭ）、およびＲＡＭ（Random Access Memory）などの少なくとも１つによって構成されてもよい。メモリ１００２は、レジスタ、キャッシュ、またはメインメモリ（主記憶装置）などと呼ばれてもよい。メモリ１００２は、本開示の一実施形態に係る情報提供方法を実施するために実行可能なプログラム（プログラムコード）、ソフトウェアモジュールなどを保存することができる。 The memory 1002 is a computer-readable recording medium, and may be composed of at least one of, for example, a read only memory (ROM), an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), and a random access memory (RAM). The memory 1002 may also be called a register, a cache, or a main memory (primary storage device). The memory 1002 can store executable programs (program codes), software modules, and the like for implementing an information providing method according to one embodiment of the present disclosure.

ストレージ１００３は、コンピュータ読み取り可能な記録媒体であり、例えば、ＣＤ－ＲＯＭ（Compact Disc ＲＯＭ）などの光ディスク、ハードディスクドライブ、フレキシブルディスク、光磁気ディスク（例えば、コンパクトディスク、デジタル多用途ディスク、Ｂｌｕ－ｒａｙ（登録商標）ディスク）、スマートカード、フラッシュメモリ（例えば、カード、スティック、キードライブ）、フロッピー（登録商標）ディスク、および磁気ストリップなどの少なくとも１つによって構成されてもよい。ストレージ１００３は、補助記憶装置と呼ばれてもよい。上述の記憶媒体は、例えば、メモリ１００２およびストレージ１００３の少なくとも一方を含むデータベース、サーバ、その他の適切な媒体であってもよい。 Storage 1003 is a computer-readable recording medium, and may be, for example, at least one of an optical disk such as a CD-ROM (Compact Disc ROM), a hard disk drive, a flexible disk, a magneto-optical disk (e.g., a compact disk, a digital versatile disk, a Blu-ray (registered trademark) disk), a smart card, a flash memory (e.g., a card, a stick, a key drive), a floppy (registered trademark) disk, and a magnetic strip. Storage 1003 may also be referred to as an auxiliary storage device. The above-mentioned storage medium may be, for example, a database, a server, or other suitable medium including at least one of memory 1002 and storage 1003.

通信装置１００４は、有線ネットワークおよび無線ネットワークの少なくとも一方を介してコンピュータ間の通信を行うためのハードウェア（送受信デバイス）であり、例えばネットワークデバイス、ネットワークコントローラ、ネットワークカード、または通信モジュールなどともいう。通信装置１００４は、例えば周波数分割複信（ＦＤＤ：Frequency Division Duplex）および時分割複信（ＴＤＤ：Time Division Duplex）の少なくとも一方を実現するために、高周波スイッチ、デュプレクサ、フィルタ、周波数シンセサイザなどを含んで構成されてもよい。 The communication device 1004 is hardware (transmitting/receiving device) for communicating between computers via at least one of a wired network and a wireless network, and is also called, for example, a network device, a network controller, a network card, or a communication module. The communication device 1004 may be configured to include a high-frequency switch, a duplexer, a filter, a frequency synthesizer, etc., to realize, for example, at least one of Frequency Division Duplex (FDD) and Time Division Duplex (TDD).

入力装置１００５は、外部からの入力を受け付ける入力デバイス（例えば、キーボード、マウス、マイクロフォン、スイッチ、ボタン、センサなど）である。出力装置１００６は、外部への出力を実施する出力デバイス（例えば、ディスプレイ、スピーカー、ＬＥＤランプなど）である。なお、入力装置１００５および出力装置１００６は、一体となった構成（例えば、タッチパネル）であってもよい。 The input device 1005 is an input device (e.g., a keyboard, a mouse, a microphone, a switch, a button, a sensor, etc.) that accepts input from the outside. The output device 1006 is an output device (e.g., a display, a speaker, an LED lamp, etc.) that performs output to the outside. Note that the input device 1005 and the output device 1006 may be integrated into one device (e.g., a touch panel).

プロセッサ１００１およびメモリ１００２などの各装置は、情報を通信するためのバス１００７によって接続される。バス１００７は、単一のバスを用いて構成されてもよいし、装置間ごとに異なるバスを用いて構成されてもよい。 Each device, such as the processor 1001 and the memory 1002, is connected by a bus 1007 for communicating information. The bus 1007 may be configured using a single bus, or may be configured using different buses between each device.

音声合成調整装置１０は、マイクロプロセッサ、デジタル信号プロセッサ（ＤＳＰ：Digital Signal Processor）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＰＬＤ（Programmable Logic Device）、ＦＰＧＡ（Field Programmable Gate Array）などのハードウェアを含んで構成されてもよく、当該ハードウェアにより、各機能ブロックの一部または全てが実現されてもよい。例えば、プロセッサ１００１は、これらのハードウェアの少なくとも１つを用いて実装されてもよい。 The voice synthesis adjustment device 10 may be configured to include hardware such as a microprocessor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a programmable logic device (PLD), or a field programmable gate array (FPGA), and some or all of the functional blocks may be realized by the hardware. For example, the processor 1001 may be implemented using at least one of these pieces of hardware.

情報の通知は、本開示において説明された態様／実施形態に限られず、他の方法を用いて行われてもよい。 Notification of information is not limited to the aspects/embodiments described in this disclosure and may be performed using other methods.

本開示において説明された各態様／実施形態の処理手順、シーケンス、フローチャートなどは、矛盾の無い限り、順序を入れ替えてもよい。例えば、本開示において説明された方法については、例示的な順序を用いて様々なステップの要素を提示しており、提示された特定の順序に限定されない。 The processing steps, sequences, flow charts, etc. of each aspect/embodiment described in this disclosure may be reordered unless inconsistent. For example, the methods described in this disclosure present elements of various steps using an example order and are not limited to the particular order presented.

情報等は、上位レイヤから下位レイヤへ、または下位レイヤから上位レイヤへ出力され得る。情報等は、複数のネットワークノードを介して入出力されてもよい。 Information, etc. may be output from a higher layer to a lower layer, or from a lower layer to a higher layer. Information, etc. may be input/output via multiple network nodes.

入出力された情報等は特定の場所（例えば、メモリ）に保存されてもよいし、管理テーブルを用いて管理されてもよい。入出力される情報等は、上書き、更新、または追記され得る。出力された情報等は削除されてもよい。入力された情報等は他の装置へ送信されてもよい。 The input and output information may be stored in a specific location (e.g., memory) or may be managed using a management table. The input and output information may be overwritten, updated, or added to. The output information may be deleted. The input information may be transmitted to another device.

判定は、１ビットで表される値（０か１か）によって行われてもよいし、真偽値（Boolean：trueまたはfalse）によって行われてもよいし、数値の比較（例えば、所定の値との比較）によって行われてもよい。 The determination may be based on a value represented by one bit (0 or 1), a Boolean (true or false) value, or a numerical comparison (e.g., with a predetermined value).

本開示において説明された各態様／実施形態は単独で用いられてもよいし、組み合わせて用いられてもよいし、実行に伴って切り替えて用いられてもよい。所定の情報の通知（例えば、「Ｘであること」の通知）は、明示的な通知に限られず、暗黙的に（例えば、当該所定の情報の通知を行わないことによって）行われてもよい。 Each aspect/embodiment described in this disclosure may be used alone, in combination, or switched according to execution. Notification of specific information (e.g., notification that "X is the case") is not limited to explicit notification, but may be performed implicitly (e.g., by not notifying the specific information).

以上、本開示について詳細に説明したが、当業者にとっては、本開示が本開示中に説明された実施形態に限定されないということは明らかである。本開示は、請求の範囲の記載により定まる本開示の趣旨および範囲を逸脱することなく修正および変更態様として実施することができる。したがって、本開示の記載は、例示説明を目的とし、本開示に対して何ら制限的な意味を有しない。 Although the present disclosure has been described in detail above, it is clear to those skilled in the art that the present disclosure is not limited to the embodiments described herein. The present disclosure can be implemented in modified and altered forms without departing from the spirit and scope of the present disclosure as defined by the claims. Therefore, the description of the present disclosure is intended as an illustrative example and does not have any limiting meaning on the present disclosure.

ソフトウェアは、ソフトウェア、ファームウェア、ミドルウェア、マイクロコード、ハードウェア記述言語と呼ばれるか、他の名称で呼ばれるかを問わず、命令、命令セット、コード、コードセグメント、プログラムコード、プログラム、サブプログラム、ソフトウェアモジュール、アプリケーション、ソフトウェアアプリケーション、ソフトウェアパッケージ、ルーチン、サブルーチン、オブジェクト、実行可能ファイル、実行スレッド、手順、機能などを意味するよう広く解釈されるべきである。 Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executable files, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.

ソフトウェア、命令、および情報などは、伝送媒体を介して送受信されてもよい。例えば、ソフトウェアが、有線技術（同軸ケーブル、光ファイバケーブル、ツイストペア、デジタル加入者回線（ＤＳＬ：Digital Subscriber Line）など）および無線技術（赤外線、マイクロ波など）の少なくとも一方を使用してウェブサイト、サーバ、または他のリモートソースから送信される場合、これらの有線技術および無線技術の少なくとも一方は、伝送媒体の定義内に含まれる。 Software, instructions, information, and the like may be transmitted and received via a transmission medium. For example, if the software is transmitted from a website, server, or other remote source using wired technologies (such as coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL)), and/or wireless technologies (such as infrared, microwave), these wired and/or wireless technologies are included within the definition of a transmission medium.

本開示において説明された情報、および信号などは、様々な異なる技術のいずれかを使用して表されてもよい。例えば、上記の説明全体に渡って言及され得るデータ、命令、コマンド、情報、信号、ビット、シンボル、およびチップなどは、電圧、電流、電磁波、磁界若しくは磁性粒子、光場若しくは光子、またはこれらの任意の組み合わせによって表されてもよい。 The information, signals, and the like described in this disclosure may be represented using any of a variety of different technologies. For example, the data, instructions, commands, information, signals, bits, symbols, chips, and the like that may be referred to throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or magnetic particles, optical fields or photons, or any combination thereof.

なお、本開示において説明された用語および本開示の理解に必要な用語については、同一のまたは類似する意味を有する用語と置き換えられてもよい。 Note that terms explained in this disclosure and terms necessary for understanding this disclosure may be replaced with terms having the same or similar meanings.

本開示において使用される「システム」および「ネットワーク」という用語は、互換的に使用される。 As used in this disclosure, the terms "system" and "network" are used interchangeably.

本開示において説明された情報、およびパラメータなどは、絶対値を用いて表されてもよいし、所定の値からの相対値を用いて表されてもよいし、対応する別の情報を用いて表されてもよい。 The information, parameters, etc. described in this disclosure may be expressed using absolute values, may be expressed using relative values from a predetermined value, or may be expressed using other corresponding information.

上述したパラメータに使用される名称はいかなる点においても限定的な名称ではない。さらに、これらのパラメータを使用する数式等は、本開示で明示的に開示した数式等と異なる場合もある。 The names used for the parameters described above are not limiting in any way. Furthermore, the formulas etc. using these parameters may differ from the formulas etc. explicitly disclosed in this disclosure.

本開示で使用される「判断（determining）」、および「決定（determining）」という用語は、多種多様な動作を包含する場合がある。「判断」、「決定」は、例えば、判定（judging）、計算（calculating）、算出（computing）、処理（processing）、導出（deriving）、調査（investigating）、探索（looking up、search、inquiry）（例えば、テーブル、データベースまたは別のデータ構造での探索）、確認（ascertaining）した事を「判断」「決定」したとみなす事などを含み得る。「判断」、「決定」は、受信（receiving）（例えば、情報を受信すること）、送信（transmitting）（例えば、情報を送信すること）、入力（input）、出力（output）、アクセス（accessing）（例えば、メモリ中のデータにアクセスすること）した事を「判断」「決定」したとみなす事などを含み得る。「判断」、「決定」は、解決（resolving）、選択（selecting）、選定（choosing）、確立（establishing）、比較（comparing）などした事を「判断」「決定」したとみなす事を含み得る。つまり、「判断」「決定」は、何らかの動作を「判断」「決定」したとみなす事を含み得る。「判断（決定）」は、「想定する（assuming）」、「期待する（expecting）」、または「みなす（considering）」などで読み替えられてもよい。 The terms "determining" and "determining" as used in this disclosure may encompass a wide variety of actions. "Determining" may include, for example, judging, calculating, computing, processing, deriving, investigating, looking up, searching, inquiring (e.g., searching in a table, database, or other data structure), ascertaining, and the like. "Determining" may include receiving (e.g., receiving information), transmitting (e.g., sending information), input, output, accessing (e.g., accessing data in a memory), and the like. "Determining" may include resolving, selecting, choosing, establishing, comparing, and the like. In other words, "judgment" and "decision" can include regarding some action as having been "judged" or "decided." "Judgment" ("decision") can also be interpreted as "assuming," "expecting," or "considering," etc.

「接続された（connected）」、「結合された（coupled）」という用語、またはこれらのあらゆる変形は、２またはそれ以上の要素間の直接的または間接的なあらゆる接続または結合を意味し、互いに「接続」または「結合」された２つの要素間に１またはそれ以上の中間要素が存在することを含むことができる。要素間の結合または接続は、物理的に行われてもよく、論理的に行われてもよく、或いはこれらの組み合わせで実現されてもよい。例えば、「接続」は「アクセス」で読み替えられてもよい。本開示で使用される場合、２つの要素は、１またはそれ以上の電線、ケーブルおよびプリント電気接続の少なくとも一つを用いて、並びにいくつかの非限定的かつ非包括的な例として、無線周波数領域、マイクロ波領域および光（可視および不可視の両方）領域の波長を有する電磁エネルギーなどを用いて、互いに「接続」または「結合」されると考えることができる。 The terms "connected" and "coupled", or any variation thereof, refer to any direct or indirect connection or coupling between two or more elements, and may include the presence of one or more intermediate elements between two elements that are "connected" or "coupled" to each other. The coupling or connection between elements may be physical, logical, or a combination thereof. For example, "connected" may be read as "access". As used in this disclosure, two elements may be considered to be "connected" or "coupled" to each other using at least one of one or more wires, cables, and printed electrical connections, as well as electromagnetic energy having wavelengths in the radio frequency range, microwave range, and optical (both visible and invisible) range, as some non-limiting and non-exhaustive examples.

本開示において使用される「に基づいて」という記載は、別段に明記されていない限り、「のみに基づいて」を意味しない。言い換えれば、「に基づいて」という記載は、「のみに基づいて」と「に少なくとも基づいて」の両方を意味する。 As used in this disclosure, the phrase "based on" does not mean "based only on," unless expressly stated otherwise. In other words, the phrase "based on" means both "based only on" and "based at least on."

本開示において使用される「第１の」、および「第２の」などの呼称を使用した要素へのいかなる参照も、それらの要素の量または順序を全般的に限定しない。これらの呼称は、２つ以上の要素間を区別する便利な方法として本開示において使用され得る。したがって、第１および第２の要素への参照は、２つの要素のみが採用され得ること、および何らかの形で第１の要素が第２の要素に先行しなければならないことのいずれも意味しない。 Any reference to elements using designations such as "first" and "second" used in this disclosure does not generally limit the quantity or order of those elements. These designations may be used in this disclosure as a convenient way to distinguish between two or more elements. Thus, a reference to a first and a second element does not imply either that only two elements may be employed or that the first element must precede the second element in some way.

上記の各装置の構成における「部」は、「回路」、または「デバイス」等に置き換えられてもよい。 The "part" in the configuration of each of the above devices may be replaced with a "circuit" or a "device", etc.

本開示において、「含む（include）」、「含んでいる（including）」およびそれらの変形が使用されている場合、これらの用語は、用語「備える（comprising）」と同様に、包括的であることが意図される。さらに、本開示において使用されている用語「または（or）」は、排他的論理和ではないことが意図される。 When the terms "include," "including," and variations thereof are used in this disclosure, these terms are intended to be inclusive, similar to the term "comprising." Additionally, the term "or," as used in this disclosure, is not intended to be an exclusive or.

本開示において、例えば、英語での「a」,「an」および「the」のように、翻訳により冠詞が追加された場合、本開示は、これらの冠詞の後に続く名詞が複数形であることを含んでもよい。 In this disclosure, where articles have been added by translation, such as "a," "an," and "the" in English, this disclosure may include that the nouns following these articles are plural.

本開示において、「ＡとＢが異なる」という用語は、「ＡとＢが互いに異なる」ことを意味してもよい。なお、当該用語は、「ＡとＢがそれぞれＣと異なる」ことを意味してもよい。「離れる」、および「結合される」などの用語も、「異なる」と同様に解釈されてもよい。 In this disclosure, the term "A and B are different" may mean "A and B are different from each other." The term may also mean "A and B are each different from C." Terms such as "separate" and "combined" may also be interpreted in the same way as "different."

１０…音声合成調整装置、１１…ドメイン設定部、１２…テキストデータ取得部、１３…文構造判定部、１３Ｓ…単語判定部、１４…ポーズ修正情報取得部、１４Ｓ…アクセント修正情報取得部、１５…ポーズ修正制御部、１５Ｓ…アクセント修正制御部、１６…第１作成制御部、１６Ｓ…第２作成制御部、２０…音声合成装置、２１…ポーズ辞書、２２…アクセント辞書、３０…制御装置、４０…他の関連装置、１００１…プロセッサ、１００２…メモリ、１００３…ストレージ、１００４…通信装置、１００５…入力装置、１００６…出力装置、１００７…バス。 10... speech synthesis adjustment device, 11... domain setting unit, 12... text data acquisition unit, 13... sentence structure determination unit, 13S... word determination unit, 14... pause correction information acquisition unit, 14S... accent correction information acquisition unit, 15... pause correction control unit, 15S... accent correction control unit, 16... first creation control unit, 16S... second creation control unit, 20... speech synthesis device, 21... pause dictionary, 22... accent dictionary, 30... control device, 40... other related devices, 1001... processor, 1002... memory, 1003... storage, 1004... communication device, 1005... input device, 1006... output device, 1007... bus.

Claims

A voice synthesis adjustment device that creates and outputs a voice according to a domain determined by a usage scene and a purpose of the voice by a voice synthesis device,
a domain setting unit that sets the domain based on domain information from an external source;
a text data acquisition unit for acquiring text data including prosodic information corresponding to input data from an external source;
a determination unit that determines whether or not the prosodic information obtained from the text data acquired by the text data acquisition unit is new prosodic information that is not stored in a prosodic information dictionary that stores prosodic information for each domain in advance;
a correction information acquisition unit that transfers the text data acquired by the text data acquisition unit to the speech synthesizer when the prosodic information is determined to be new by the determination unit, and acquires correction information for the prosodic information for the synthetic speech output by the speech synthesizer;
a modification control unit that registers the new prosodic information for a corresponding domain in the prosodic information dictionary based on the modification information for the prosodic information acquired by the modification information acquisition unit, and causes the speech synthesis device to regenerate and output a speech modified based on the modification information for the prosodic information;
Equipped with
the determination unit includes a sentence structure determination unit that analyzes a sentence structure of the text data acquired by the text data acquisition unit, and determines whether or not the sentence structure obtained by the analysis is a new sentence structure in a pause dictionary that stores pause rules corresponding to the sentence structure in advance for each domain;
the correction information acquisition unit includes a pause correction information acquisition unit that, when the sentence structure determination unit determines that the sentence structure is a new sentence structure, transfers text data analyzed by the sentence structure determination unit to the speech synthesizer and acquires pause correction information for the synthetic speech output by the speech synthesizer;
the correction control unit includes a pause correction control unit that registers a pause rule of a corresponding sentence structure related to a corresponding domain in the pause dictionary based on the pause correction information acquired by the pause correction information acquisition unit, and causes the speech synthesis device to recreate and output a speech in which a pause is corrected based on the pause correction information;
Speech synthesis adjustment device.

a creation control unit that transfers the prosodic information related to the corresponding domain and the text data to the speech synthesizer when the determination unit determines that the prosodic information is not new, and causes the speech synthesizer to create and output a speech to which prosody corresponding to the prosodic information has been added;
The voice synthesis adjustment device of claim 1 further comprising:

a first creation control unit that, when it is determined by the sentence structure determination unit that the sentence structure is not a new sentence structure, transfers a pause rule of the sentence structure for the corresponding domain and the text data to the speech synthesizer, and causes the speech synthesizer to create and output a speech to which pauses are added in accordance with the pause rule;
The voice synthesis adjustment device of claim 1 further comprising:

the determination unit includes a word determination unit that determines whether the text data acquired by the text data acquisition unit includes a new word in an accent dictionary that stores in advance accent types of words for each of the domains;
the correction information acquisition unit includes an accent correction information acquisition unit that, when it is determined by the word determination unit that the text data includes a new word, transfers the text data acquired by the text data acquisition unit to the speech synthesis device and acquires accent correction information for the synthetic speech output by the speech synthesis device;
The correction control unit includes an accent correction control unit that registers the accent type of the corresponding word related to the corresponding domain in the accent dictionary based on the accent correction information acquired by the accent correction information acquisition unit, and causes the speech synthesis device to recreate and output a speech in which the accent of the word is corrected based on the accent correction information.
2. The voice synthesis adjustment device according to claim 1.

a second creation control unit that transfers the accent type of the word in question and the text data for the domain to the speech synthesizer when the word determination unit determines that the word does not contain a new word, and causes the speech synthesizer to create and output a voice with an accent according to the accent type of the word;
The voice synthesis adjustment device of claim 4 further comprising: