JP7614877B2

JP7614877B2 - Content production device and program

Info

Publication number: JP7614877B2
Application number: JP2021022463A
Authority: JP
Inventors: 正熊野; 篤今井
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2020-02-18
Filing date: 2021-02-16
Publication date: 2025-01-16
Anticipated expiration: 2041-02-16
Also published as: JP2021131537A

Description

本発明は、コンテンツ制作装置およびプログラムに関する。 The present invention relates to a content production device and a program.

合成音声をつなぎ合わせて用いることによって、人の発話を含まないコンテンツを制作する試みが為されている。 Attempts are being made to create content that does not contain human speech by using pieced together synthetic speech.

特許文献１には、音声合成装置の構成が記載されている。具体的には、音声合成装置は、「合成音間に挟む無音の長さを予め定められた方法で推定又は無音を生成し、当該合成音と合成音間に挾む無音の長さの情報を格納した音声ファイルを作成する複数の合成音作成手段３００」を含むことが記載されている。特許文献１に記載の音声合成装置は、上記の構成により、所定の時間長に収まる合成音声（放送番組等）を生成する。 Patent Document 1 describes the configuration of a voice synthesis device. Specifically, it describes that the voice synthesis device includes "multiple synthetic sound creation means 300 that estimate the length of silence between synthetic sounds or generate silence using a predetermined method, and create a voice file that stores information on the length of silence between the synthetic sounds." With the above configuration, the voice synthesis device described in Patent Document 1 generates synthetic sound (broadcast program, etc.) that fits within a predetermined time length.

また、特許文献１の段落００１８等には、番組を所望の長さに合わせるために、音声の話速変換を行うこととも記載されている。 Patent document 1 also describes, in paragraph 0018, that the speech speed of the audio is converted to fit the program to the desired length.

特開２０１６－００９０６１号公報JP 2016-009061 A

文面を固定的に作ってしまうと場合によっては話速を「大きく」変えなければならないため、品質を保つためには文面を加減しなければならない。 If the text is created in a fixed format, the speaking speed may have to be changed "significantly," so the text must be adjusted to maintain quality.

本発明は、上記の課題認識に基づいて行なわれたものであり、話速変換に依らずに、所定の長さを有する自然な、音声としての品質の高いコンテンツを制作することのできるコンテンツ制作装置およびプログラムを提供しようとするものである。 The present invention was developed based on the recognition of the above problems, and aims to provide a content production device and program that can produce content of a predetermined length that is natural and has high audio quality without relying on speech rate conversion.

［１］上記の課題を解決するため、本発明の一態様によるコンテンツ制作装置は、文章を生成するための文章テンプレートを複数のバリエーションについて持つコンテンツテンプレートを記憶するテンプレート記憶部と、データを取得するデータ取得部と、前記コンテンツテンプレートに含まれる前記複数のバリエーションのそれぞれについて前記文章テンプレートに前記データを適用することによって文章を生成し、生成された前記文章の合計時間長を決定するバリエーション生成部と、前記文章を連結する際のつなぎ目であるポーズの時間長に関する条件と前記文章の時間長および前記ポーズの時間長の総計に関する条件とに関する制約の下で、適宜定められた前記バリエーションごとの評価値の総計を評価関数として、バリエーションの組み合わせを探索する探索処理部と、前記制約を満たすバリエーションの組み合わせを、前記評価関数の値に基づいて選択する選択部と、前記ポーズの時間長に関する条件を満たし、且つ前記文章の時間長および前記ポーズの時間長の総計に関する条件を満たすように、前記ポーズの時間長を調整するポーズ調整部と、を備える。 [1] In order to solve the above problem, a content production device according to one aspect of the present invention includes a template storage unit that stores a content template having a plurality of variations of sentence templates for generating sentences, a data acquisition unit that acquires data, a variation generation unit that generates sentences by applying the data to the sentence templates for each of the plurality of variations included in the content template and determines the total length of the generated sentences, a search processing unit that searches for combinations of variations using an evaluation function that is an appropriately determined total of evaluation values for the variations under constraints regarding conditions regarding the length of pauses that are joints when connecting the sentences and conditions regarding the length of the sentences and the total length of the pauses, a selection unit that selects a combination of variations that satisfies the constraints based on the value of the evaluation function, and a pause adjustment unit that adjusts the length of the pauses so as to satisfy the conditions regarding the length of the pauses and the conditions regarding the length of the sentences and the total length of the pauses.

［２］また、本発明の一態様は、上記のコンテンツ制作装置において、前記コンテンツテンプレートは、トピックの列として構成されており、前記トピックは、相互に排他的に選択され得る複数の前記バリエーションを含むように構成されているものである。 [2] In one aspect of the present invention, in the content production device described above, the content template is configured as a sequence of topics, and the topics are configured to include a plurality of the variations that can be selected mutually exclusively.

［３］また、本発明の一態様は、上記のコンテンツ制作装置において、前記ポーズの時間長に関する条件は、前記バリエーションに含まれる文の区切りにおけるポーズである文間ポーズの時間長に関する条件と、前記トピックの区切りにおけるポーズであるトピック間ポーズに時間長に関する条件と、を含むものである。 [3] In one aspect of the present invention, in the content production device described above, the conditions regarding the duration of the pauses include a condition regarding the duration of an inter-sentence pause, which is a pause at a division of a sentence included in the variation, and a condition regarding the duration of an inter-topic pause, which is a pause at a division of a topic.

［４］また、本発明の一態様は、上記のコンテンツ制作装置において、前記ポーズ調整部は、前記文間ポーズの時間長がすべて同一になるように調整するとともに、前記トピック間ポーズの時間長がすべて同一になるように調整するものである。 [4] In one aspect of the present invention, in the content production device described above, the pause adjustment unit adjusts the duration of the inter-sentence pauses so that they are all the same, and adjusts the duration of the inter-topic pauses so that they are all the same.

［５］また、本発明の一態様は、上記のコンテンツ制作装置において、前記バリエーションごとの評価値は、前記コンテンツテンプレートに含まれる前記バリエーションの属性値として予め定められているものである。 [5] In one aspect of the present invention, in the content production device described above, the evaluation value for each variation is predefined as an attribute value of the variation included in the content template.

［６］また、本発明の一態様は、文章を生成するための文章テンプレートを複数のバリエーションについて持つコンテンツテンプレートを記憶するテンプレート記憶部と、データを取得するデータ取得部と、前記コンテンツテンプレートに含まれる前記複数のバリエーションのそれぞれについて前記文章テンプレートに前記データを適用することによって文章を生成し、生成された前記文章の合計時間長を決定するバリエーション生成部と、前記文章を連結する際のつなぎ目であるポーズの時間長に関する条件と前記文章の時間長および前記ポーズの時間長の総計に関する条件とに関する制約の下で、適宜定められた前記バリエーションごとの評価値の総計を評価関数として、バリエーションの組み合わせを探索する探索処理部と、前記制約を満たすバリエーションの組み合わせを、前記評価関数の値に基づいて選択する選択部と、前記ポーズの時間長に関する条件を満たし、且つ前記文章の時間長および前記ポーズの時間長の総計に関する条件を満たすように、前記ポーズの時間長を調整するポーズ調整部と、を備えるコンテンツ制作装置としてコンピューターを機能させるためのプログラムである。 [6] Also, one aspect of the present invention is a program for causing a computer to function as a content production device, the program comprising: a template storage unit that stores a content template having a plurality of variations of sentence templates for generating sentences; a data acquisition unit that acquires data; a variation generation unit that generates sentences by applying the data to the sentence templates for each of the plurality of variations included in the content template and determines the total length of the generated sentences; a search processing unit that searches for combinations of variations using an appropriately determined total of evaluation values for each of the variations as an evaluation function under constraints regarding a condition regarding the length of a pause that is a joint when connecting the sentences and a condition regarding the length of the sentences and the total length of the pauses; a selection unit that selects a combination of variations that satisfies the constraints based on the value of the evaluation function; and a pause adjustment unit that adjusts the length of the pause so as to satisfy the condition regarding the length of the pause and the condition regarding the length of the sentences and the total length of the pauses.

本発明によれば、音声の話速変換（音声の時間の伸び縮み）の技術に依らずに、所望の時間長を有するコンテンツを制作することが可能である。 According to the present invention, it is possible to create content of a desired length without relying on speech speed conversion technology (expansion and contraction of speech time).

本発明の一実施形態によるコンテンツ制作装置の概略機能構成を示すブロック図である。1 is a block diagram showing a schematic functional configuration of a content production device according to an embodiment of the present invention. 同実施形態で用いる天気予報データの構成の一例を示す概略図である。2 is a schematic diagram showing an example of the configuration of weather forecast data used in the embodiment. FIG. 同実施形態で用いる「天気予報（今日）」データの構成の一例を示す概略図である。10 is a schematic diagram showing an example of the configuration of "weather forecast (today)" data used in the embodiment. FIG. 同実施形態によるテンプレート記憶部が記憶するコンテンツテンプレートのデータの構成例を示す概略図である。4 is a schematic diagram showing an example of the configuration of data of a content template stored in a template storage unit according to the embodiment. FIG. 同実施形態によるトピックテンプレートのデータの構成例を示す概略図である。10 is a schematic diagram showing an example of the configuration of topic template data according to the embodiment. FIG. 同実施形態におけるトピックテンプレートに含まれる文章テンプレートと、その文章テンプレートを基に生成される生成文章との関係の例を示す概略図である。10 is a schematic diagram showing an example of the relationship between a sentence template included in a topic template in the embodiment and a generated sentence generated based on the sentence template. FIG. 同実施形態におけるトピックテンプレートに含まれる文章テンプレートと、その文章テンプレートを基に生成される生成文章との関係の別の例を示す概略図である。13 is a schematic diagram showing another example of the relationship between a sentence template included in a topic template and a generated sentence generated based on the sentence template in the embodiment. FIG. 同実施形態におけるコンテンツテンプレートに基づいてバリエーション生成部３０が生成したトピックおよびそのバリエーションの相互関係を示す概略図である。2 is a schematic diagram showing the interrelationships between topics and their variations generated by a variation generating unit 30 based on a content template in the embodiment. FIG. 同実施形態による探索処理部が探索処理を行う際の条件を示す概略図である。11 is a schematic diagram showing conditions under which a search processing unit according to the embodiment performs search processing. FIG. 同実施形態によるコンテンツ制作装置の全体的な処理手順を示すフローチャートである。13 is a flowchart showing an overall processing procedure of the content production device according to the embodiment. 同実施形態によるコンテンツ制作装置が表示するユーザーインターフェースの画面例を示す概略図である。11 is a schematic diagram showing an example of a user interface screen displayed by the content production device according to the embodiment. FIG.

次に、本発明の一実施形態について、図面を参照しながら説明する。本実施形態によるコンテンツ制作装置１は、取得したデータを用いて、コンテンツを自動的に生成する。コンテンツ制作装置１が生成するコンテンツは、例えば、放送信号に載せて広い範囲に伝送されたり、インターネット等を介して端末装置に向けて送信されたり、することが可能である。コンテンツ制作装置１は、コンテンツを制作するために、予め、コンテンツのテンプレートのデータを保持している。コンテンツ制作装置１は、上記のデータを、コンテンツのテンプレートに適用することによって、コンテンツを制作する。コンテンツは、例えば、音声によるコンテンツである。コンテンツ制作装置１は、音声合成の技術を用いることによって、音声によるコンテンツを自動的に生成する。また、コンテンツ制作装置１は、所望の時間長のコンテンツを自動生成するために、コンテンツの多数のバリエーション（候補）を探索する処理を行う。本実施形態では、例として、天気予報の番組（音声のみによる番組）を自動的に生成するコンテンツ制作装置１を説明する。 Next, an embodiment of the present invention will be described with reference to the drawings. The content production device 1 according to this embodiment automatically generates content using acquired data. The content generated by the content production device 1 can be transmitted over a wide area via a broadcast signal, or can be sent to a terminal device via the Internet or the like. The content production device 1 holds content template data in advance in order to create content. The content production device 1 creates content by applying the above data to the content template. The content is, for example, audio content. The content production device 1 automatically generates audio content by using a voice synthesis technique. The content production device 1 also performs a process of searching for many variations (candidates) of content in order to automatically generate content of a desired length. In this embodiment, a content production device 1 that automatically generates a weather forecast program (a program with only audio) will be described as an example.

図１は、本実施形態によるコンテンツ制作装置の概略機能構成を示すブロック図である。同図において、１は、コンテンツ制作装置である。コンテンツ制作装置１は、データ受信部１０と、テンプレート記憶部２０と、バリエーション生成部３０と、探索処理部４０と、選択部５０と、ポーズ調整部６０と、出力部７０とを含んで構成される。これらの各機能部は、例えば、コンピューターと、プログラムとで実現することが可能である。また、各機能部は、必要に応じて、記憶手段を有する。記憶手段は、例えば、プログラム上の変数や、プログラムの実行によりアロケーションされるメモリーである。また、必要に応じて、磁気ハードディスク装置やソリッドステートドライブ（ＳＳＤ）といった不揮発性の記憶手段を用いるようにしてもよい。また、各機能部の少なくとも一部の機能を、プログラムではなく専用の電子回路として実現してもよい。 FIG. 1 is a block diagram showing the schematic functional configuration of a content production device according to this embodiment. In the figure, 1 is a content production device. The content production device 1 is configured to include a data receiving unit 10, a template storage unit 20, a variation generating unit 30, a search processing unit 40, a selection unit 50, a pose adjustment unit 60, and an output unit 70. Each of these functional units can be realized, for example, by a computer and a program. Each functional unit also has a storage means as necessary. The storage means is, for example, a variable in the program or a memory allocated by the execution of the program. Also, non-volatile storage means such as a magnetic hard disk drive or a solid state drive (SSD) may be used as necessary. Also, at least a part of the functions of each functional unit may be realized as a dedicated electronic circuit rather than a program.

データ受信部１０は、外部の装置から、通信ネットワーク（例えば、インターネット等）を介して、天気予報データを受信する。外部の装置とは、例えば、気象予報機関が運営するデータサーバー装置である。データ受信部１０が受信する天気予報データは、天気予報に関する情報を含む。天気予報データの構成例については、別の図面を参照しながら後で説明する。データ受信部１０は、受信した天気予報データを、バリエーション生成部３０に渡す。なお、データ受信部１０を「データ取得部」と呼んでもよい。データ受信部１０は、通信によってデータを受信する代わりに、例えば記録媒体から読み出すなどといった通信に頼らない方法でデータを取得してもよい。 The data receiving unit 10 receives weather forecast data from an external device via a communication network (e.g., the Internet, etc.). The external device is, for example, a data server device operated by a weather forecasting agency. The weather forecast data received by the data receiving unit 10 includes information related to the weather forecast. An example of the configuration of the weather forecast data will be described later with reference to another drawing. The data receiving unit 10 passes the received weather forecast data to the variation generating unit 30. The data receiving unit 10 may be referred to as a "data acquiring unit." Instead of receiving data through communication, the data receiving unit 10 may acquire data by a method that does not rely on communication, such as reading it from a recording medium.

テンプレート記憶部２０は、コンテンツを制作するためのテンプレートのデータを記憶する。コンテンツを制作するためのテンプレートをコンテンツテンプレートと呼ぶ。コンテンツテンプレートは、１文もしくは複数の文からなる文章を生成するための文章テンプレートを、複数のバリエーションについて持つものである。その一形態として、コンテンツテンプレートは、トピックの列として構成してよい。トピックは、１つあるいは複数のバリエーションを含むように構成される。１つのトピック内において、それらのバリエーションは、相互に排他的に選択され得るものである。また、コンテンツテンプレート内の上記バリエーションの属性値として、バリエーションごとの評価値を、あらかじめ定める形で持つようにしてもよい。この評価値は、バリエーションの組み合わせを探索し選択する際の評価のための情報として用いられる。なお、テンプレート記憶部２０が記憶する上記の文章テンプレートは、自然言語の文章のテキスト（文字コードの系列）であってもよく、合成指示形式で表わされる文章であってもよく、あるいはそれら両方であってもよい。合成指示形式とは、音声合成器に対する合成指示のためのデータの形式であり、音の並びのデータおよび韻律指示のデータを含むものである。合成指示形式の具体的な形態は、音声合成器に依って異なっていてもよい。テンプレートの具体的な構成については、別の図を参照しながら後で説明する。 The template storage unit 20 stores data of templates for creating content. A template for creating content is called a content template. A content template has sentence templates for generating sentences consisting of one sentence or multiple sentences for multiple variations. As one form, the content template may be configured as a series of topics. A topic is configured to include one or multiple variations. In one topic, the variations can be selected mutually exclusive. In addition, an evaluation value for each variation may be provided in a predetermined form as an attribute value of the above-mentioned variation in the content template. This evaluation value is used as information for evaluation when searching for and selecting a combination of variations. The above-mentioned sentence templates stored in the template storage unit 20 may be text (a series of character codes) of a sentence in a natural language, sentences expressed in a synthesis instruction format, or both. The synthesis instruction format is a data format for synthesis instructions to a speech synthesizer, and includes data on the sequence of sounds and data on prosodic instructions. The specific form of the synthesis instruction format may differ depending on the speech synthesizer. The specific configuration of the template will be explained later with reference to another diagram.

バリエーション生成部３０は、テンプレートに含まれる複数のバリエーションのそれぞれについて、文章テンプレートに受信データを適用することによって文章を生成する。また、バリエーション生成部３０は、生成された文章の合計時間長を決定する。バリエーション生成部３０がこの合計時間長を決定するための方法の例は次の通りである。つまり、バリエーション生成部３０は、音声合成器の機能を用いて、生成された文章が持つ各文に対応する合成音声を生成する。また、バリエーション生成部３０は、この生成された各文の合成音声の時間長の総和である、バリエーションの合計時間長を決定する。例えば、バリエーション生成部３０が合成音声を生成することにより必然的にその合成音声の時間長は決定される。 The variation generation unit 30 generates sentences by applying the received data to the sentence template for each of the multiple variations included in the template. The variation generation unit 30 also determines the total duration of the generated sentences. An example of a method by which the variation generation unit 30 determines this total duration is as follows. That is, the variation generation unit 30 uses the function of a voice synthesizer to generate synthetic voice corresponding to each sentence in the generated sentence. The variation generation unit 30 also determines the total duration of the variations, which is the sum of the durations of the synthetic voices of each of the generated sentences. For example, the variation generation unit 30 generates synthetic voice, which inevitably determines the duration of the synthetic voice.

探索処理部４０は、合成音声を連結する際のつなぎ目であるポーズの時間長に関する条件と、合成音声の時間長およびポーズの時間長の総計に関する条件と、に関する制約の下で、探索処理を行う。探索処理部４０が探索処理を行う際の評価関数は、選択するバリエーションの評価値の総計としてよい。つまり、探索処理部４０は、評価関数の値が高くなるように、バリエーションの組み合わせを探索する。なお、ポーズの時間長に関する条件は、バリエーションに含まれる文の区切りにおけるポーズである文間ポーズの時間長に関する条件と、トピックの区切りにおけるポーズであるトピック間ポーズに時間長に関する条件とを含んでいてもよい。これらの条件の具体例については、図９を参照しながら説明する。また、探索処理部４０の処理の詳細についても、後述する。 The search processing unit 40 performs the search process under constraints regarding conditions related to the duration of pauses, which are the joints when connecting synthetic voices, and conditions related to the total duration of the synthetic voice and the total duration of the pauses. The evaluation function used by the search processing unit 40 when performing the search process may be the total of the evaluation values of the variations to be selected. In other words, the search processing unit 40 searches for a combination of variations that will increase the value of the evaluation function. Note that the conditions related to the duration of pauses may include conditions related to the duration of inter-sentence pauses, which are pauses at the division of sentences included in the variation, and conditions related to the duration of inter-topic pauses, which are pauses at the division of topics. Specific examples of these conditions will be described with reference to FIG. 9. Details of the processing by the search processing unit 40 will also be described later.

選択部５０は、時間に関する制約を満たすバリエーションの組み合わせを、評価関数の値に基づいて１つ選択する。例えば、選択部５０は、上記制約を満たすバリエーションの組み合わせの中から、評価関数値が最も良い組み合わせを選択する。あるいは、選択部５０は、上記の探索処理部４０が実行する探索アルゴリズムに依存して、評価関数の値が良いと判定される組み合わせ（必ずしも評価関数の値が最良でなくてもよい）を１つ選択する。 The selection unit 50 selects one combination of variations that satisfies the time constraints based on the value of the evaluation function. For example, the selection unit 50 selects the combination with the best evaluation function value from among the combinations of variations that satisfy the above constraints. Alternatively, the selection unit 50 selects one combination that is determined to have a good evaluation function value (which does not necessarily have to be the best evaluation function value) depending on the search algorithm executed by the above search processing unit 40.

ポーズ調整部６０は、ポーズ（pause）の時間長に関する条件を満たし、且つ合成音声の時間長およびポーズの時間長の総計（つまり、生成されるコンテンツのトータルな時間長）に関する条件を満たすように、ポーズの時間長を調整する。ポーズとは、音声のコンテンツにおける無音の区間である。本実施形態では、文の区切りの箇所に挿入される文間ポーズと、トピックの区切りの箇所に挿入されるトピック間ポーズとのそれぞれについて、ポーズ調整部６０が調整を行う。ポーズ調整部６０がポーズの長さを調整することにより、コンテンツ制作装置１が生成するコンテンツ全体の長さが与えられた所定の条件を満足するように、調整できる。 The pause adjustment unit 60 adjusts the duration of a pause so as to satisfy conditions related to the duration of the pause, and also to satisfy conditions related to the sum of the duration of the synthetic voice and the duration of the pause (i.e., the total duration of the content to be generated). A pause is a silent section in audio content. In this embodiment, the pause adjustment unit 60 adjusts both inter-sentence pauses inserted at sentence breaks and inter-topic pauses inserted at topic breaks. By the pause adjustment unit 60 adjusting the duration of the pause, the overall length of the content generated by the content production device 1 can be adjusted to satisfy given specified conditions.

ポーズ調整部６０は、その処理の一例として、文間ポーズの時間長がすべて同一になるように調整してもよい。また、ポーズ調整部６０は、トピック間ポーズの時間長がすべて同一になるように調整してもよい。また、逆に、文間ポーズの時間長は一定でなくてもよい。また、トピック間ポーズの時間長が一定でなくてもよい。 As an example of its processing, the pause adjustment unit 60 may adjust the duration of inter-sentence pauses so that they are all the same. The pause adjustment unit 60 may also adjust the duration of inter-topic pauses so that they are all the same. Conversely, the duration of inter-sentence pauses does not have to be constant. The duration of inter-topic pauses does not have to be constant.

出力部７０は、選択部５０によって選択されたバリエーションの組み合わせによるコンテンツ（複数のトピックが連結された１本の音声コンテンツ）を出力する。出力部７０が出力するコンテンツにおいて、ポーズの長さ（文間ポーズ、トピック間ポーズ）は既にポーズ調整部６０が調整した長さとなっている。ポーズの区間には、無音ないしはそれに類する音声が既に挿入されている。出力部７０が出力するコンテンツは、放送のための設備や、インターネット配信のための設備に渡される。 The output unit 70 outputs content (one piece of audio content in which multiple topics are linked) that is a combination of the variations selected by the selection unit 50. In the content output by the output unit 70, the length of the pauses (inter-sentence pauses, inter-topic pauses) has already been adjusted by the pause adjustment unit 60. Silence or a similar sound has already been inserted into the pause section. The content output by the output unit 70 is passed to equipment for broadcasting or equipment for internet distribution.

図２は、天気予報データの構成の一例を示す概略図である。前述の通り、データ受信部１０が、この天気予報データを受信する。図示するように、天気予報データ１００は、日付、都道府県、内容の３つのデータ項目を含むように構成されている。日付は、この天気予報データが対象とする日付を表す。図示する例では、日付は「２０２０年０１月２５日」である。都道府県は、この天気予報データが対象とする都道府県名（地域名）を表す。
図示する例では、都道府県は「神奈川県」である。内容は、天気予報データの実体を有する部分である。内容は、複数のデータの塊の列を持つ。図示する例では、内容は、「警報・注意報」データ１０１と、「天気予報（今日）」データ１０２と、「予想気温（今日）」データ１０３と、「降水確率（今日）」データ１０４とを含むデータの列を持つ。「降水確率（今日）」データ１０４の後に、さらに別のデータの塊が続いていてもよい。ここでは、「警報・注意報」データ１０１、「天気予報（今日）」データ１０２、「予想気温（今日）」データ１０３、「降水確率（今日）」データ１０４のそれぞれが、後述するトピックに対応するようにデータを構成している。これらのうちの、「天気予報（今日）」データ１０２のさらに具体的な構成例を、次に説明する。 2 is a schematic diagram showing an example of the configuration of weather forecast data. As described above, the data receiving unit 10 receives this weather forecast data. As shown in the figure, the weather forecast data 100 is configured to include three data items: date, prefecture, and content. The date represents the date covered by this weather forecast data. In the example shown in the figure, the date is "January 25, 2020." The prefecture represents the name of the prefecture (region) covered by this weather forecast data.
In the illustrated example, the prefecture is "Kanagawa Prefecture." The content is a portion having the substance of the weather forecast data. The content has a sequence of a plurality of data chunks. In the illustrated example, the content has a sequence of data including "alert/advisory" data 101, "weather forecast (today)" data 102, "expected temperature (today)" data 103, and "probability of precipitation (today)" data 104. The "probability of precipitation (today)" data 104 may be followed by another data chunk. Here, the data is configured so that each of the "alert/advisory" data 101, "weather forecast (today)" data 102, "expected temperature (today)" data 103, and "probability of precipitation (today)" data 104 corresponds to a topic to be described later. A more specific configuration example of the "weather forecast (today)" data 102 among these will be described next.

図３は、「天気予報（今日）」データ１０２の構成の一例を示す概略図である。図示するように、「天気予報（今日）」データ１０２は、表形式のデータとして表現されており、地域、風向、天気、時間推移、局所天気の各項目を持つ。地域は、都道府県内をさらに小さい単位に分けたときの地域名である。風向は、天気予報における風向の予報である。
天気は、晴れ、曇り、雨、雪等といった言葉で表される天候の区分である。時間推移は、天気の時間的推移を表す表現である。例えば、天気予報で多用される時間推移の表現は、「後曇り」、「時々雨」、「一時雪」などといった表現である。局所天気は、上記地域の中のさらに局所的な位置に依存して異なり得る天候の様相を表す表現である。図示する例では、地域「東部」に関しては、風向「西」、天気「晴れ」、時間推移「後曇り」であり、局所天気はない。また、地域「西部」に関しては、風向「南」、天気「晴れ」、時間推移「後曇り」、局所天気「所により雨」である。 3 is a schematic diagram showing an example of the configuration of the "weather forecast (today)" data 102. As shown in the figure, the "weather forecast (today)" data 102 is expressed as tabular data, and has the following items: region, wind direction, weather, time transition, and local weather. The region is the name of a region when a prefecture is divided into smaller units. The wind direction is the wind direction forecast.
Weather is a weather classification expressed by words such as sunny, cloudy, rain, snow, etc. Time transition is an expression that expresses the time transition of weather. For example, expressions of time transition often used in weather forecasts are "later cloudy", "occasional rain", "temporary snow", etc. Local weather is an expression that expresses the weather aspect that may differ depending on a more local location within the above-mentioned region. In the illustrated example, for the region "East", the wind direction is "West", the weather is "sunny", the time transition is "later cloudy", and there is no local weather. Also, for the region "West", the wind direction is "South", the weather is "sunny", the time transition is "later cloudy", and the local weather is "rain in places".

図３に示した例では、表形式のデータとして「天気予報（今日）」データ１０２を表現した。ただし、データの表現形式としては、いかなる形式を用いてもよい。図３に示したデータに関して、一例として、ＸＭＬ（拡張マークアップ言語、eXtensible Markup Language）を用いた表現形式を用いてもよい。また、その他の形式でデータを表現してもよい。 In the example shown in FIG. 3, the "Weather forecast (today)" data 102 is expressed as tabular data. However, any format may be used as the data expression format. As an example, the data shown in FIG. 3 may be expressed in an expression format using XML (extensible Markup Language). Data may also be expressed in other formats.

図３においては、「天気予報（今日）」データ１０２の構成例を示したが、「警報・注意報」データ１０１や、「予想気温（今日）」データ１０３や、「降水確率（今日）」データ１０４や、その他のデータについても、適宜、適切な形でデータを構成するようにする。このようにして、データ受信部１０は、コンテンツを生成するために必要な情報を外部から獲得する。 Figure 3 shows an example of the configuration of "Weather forecast (today)" data 102, but the "Warning/advisory" data 101, "Expected temperature (today)" data 103, "Chance of precipitation (today)" data 104, and other data can also be configured in an appropriate form as needed. In this way, the data receiving unit 10 acquires the information required to generate content from outside.

図４は、テンプレート記憶部２０が記憶するコンテンツテンプレートのデータの構成例を示す概略図である。図示するように、コンテンツテンプレートは、複数のトピックテンプレートを含むように構成される。また、これら複数のトピックテンプレートは、順序を有する。言い換えれば、１つのコンテンツテンプレートは、順序を有するトピックテンプレートの列として構成される。図示する例では、コンテンツテンプレートは、「あいさつ１」、「トピック警報・注意報」、「トピック今日の天気」、「トピック今日の予想気温」、「トピック降水確率」、「あいさつ２」といったトピックテンプレートを含む。トピックテンプレートのさらに詳細な構成については、次に説明する。 Figure 4 is a schematic diagram showing an example of the data configuration of a content template stored in the template storage unit 20. As shown in the figure, the content template is configured to include multiple topic templates. Furthermore, these multiple topic templates have an order. In other words, one content template is configured as a string of topic templates having an order. In the example shown in the figure, the content template includes topic templates such as "Greeting 1", "Topic: Warning/Caution", "Topic: Today's Weather", "Topic: Today's Forecasted Temperature", "Topic: Probability of Precipitation", and "Greeting 2". A more detailed configuration of the topic templates will be explained next.

図５は、１つのトピックテンプレートのデータの構成例を示す概略図である。図示するように、トピックテンプレートのデータは、必須フラグと、バリエーションのデータとを含むように構成される。必須フラグは、コンテンツ内において当該トピックが必須であるか否かを示す情報である。必須であるトピックは、コンテンツを生成する際に必ず含まれなければならない。必須ではないトピックは、例えばコンテンツ全体の時間長を調整する目的で、省略することが許される。必須フラグをｔｒｕｅ（真）とするかｆａｌｓｅ（偽）とするかを、コンテンツテンプレートの作成者が適宜決定してよい。バリエーションは、そのトピックに関してコンテンツ制作装置１が生成し得る文章の複数の態様を表すデータである。バリエーションのデータを、図示するように、例えば表形式のデータとして構成してもよい。図示する例では、この表は、１番から５番までの５種類のバリエーションの情報を保持する。バリエーションの数は、５に限らず、任意である。この表は、番号、文章テンプレート、評価値、生成文章、各文の合成音声、合計時間長（duration）の各項目を持つ。文章テンプレートは、バリエーション生成部３０が文章を生成する際の基となるテンプレートである。個々の文章テンプレートは、天気予報データを用いて置換することのできる部分（パラメーター）を持つ。評価値は、そのバリエーションを選択する場合の評価として用いられる数値である。例えば、評価値を、バリエーションごとに予め定めた固定値としてもよい。生成文章と合成音声と時間長の各項目は、テンプレートの段階では空欄であり、バリエーション生成部３０が実際の文章を生成した際に埋められる項目である。生成文章は、前記の文章テンプレートを基にバリエーション生成部３０が生成した文章を格納する欄である。合成音声は、前記の生成文章を基にバリエーション生成部３０が合成した各文の合成音声を格納する欄である。合成音声のデータは、例えば、音圧レベルの系列のデータ（あるいは、それを符号化したデータ）として格納される。合計時間長は、前記の各文の合成音声を再生した場合の時間の長さの総和であり、例えば秒の単位で表わされる。 5 is a schematic diagram showing an example of the configuration of data of one topic template. As shown in the figure, the data of the topic template is configured to include a required flag and variation data. The required flag is information indicating whether the topic is required in the content. A required topic must be included when generating the content. A non-required topic is allowed to be omitted, for example, for the purpose of adjusting the overall time length of the content. The creator of the content template may appropriately decide whether to set the required flag to true or false. A variation is data representing multiple forms of a sentence that the content production device 1 can generate for that topic. The variation data may be configured as data in a table format, for example, as shown in the figure. In the example shown in the figure, this table holds information on five types of variations, from number 1 to number 5. The number of variations is not limited to five and is arbitrary. This table has the following items: number, sentence template, evaluation value, generated sentence, synthetic voice of each sentence, and total duration. A sentence template is a template that serves as a basis for the variation generation unit 30 to generate sentences. Each sentence template has a portion (parameter) that can be replaced with weather forecast data. The evaluation value is a numerical value used as an evaluation when selecting the variation. For example, the evaluation value may be a fixed value that is predetermined for each variation. The generated sentence, synthetic voice, and duration items are blank at the template stage, and are filled in when the variation generation unit 30 generates an actual sentence. The generated sentence is a field that stores a sentence generated by the variation generation unit 30 based on the sentence template. The synthetic voice is a field that stores the synthetic voice of each sentence synthesized by the variation generation unit 30 based on the generated sentence. The synthetic voice data is stored, for example, as data of a series of sound pressure levels (or data obtained by encoding the data). The total duration is the sum of the lengths of time when the synthetic voice of each sentence is played back, and is expressed, for example, in seconds.

なお、コンテンツテンプレート内において、トピックが必須であるか否かや、あるいはトピックが省略されなければならないか否かが、他のトピックの採否に依存するように規定されてもよい。例えば、トピックＡがトピックＢより先行する場合、「トピックＡが存在する場合にはトピックＢは必須」と規定されてもよい。また例えば、トピックＡがトピックＢより先行する場合、「トピックＡが存在する場合にはトピックＢは採用不可」と規定されてもよい。つまり、複数のトピックがセットで採用されたり、複数のトピックが排他的に採用されたりするという関係を導入できるようにしてもよい。なお、コンテンツテンプレート内における複数のトピック間の依存の態様は、ここに例示したものには限られず、任意の関係を導入してよい。 Note that within a content template, whether a topic is required or must be omitted may be specified so as to depend on whether other topics are adopted. For example, if topic A precedes topic B, it may be specified that "topic B is required if topic A exists." Or, for example, if topic A precedes topic B, it may be specified that "topic B cannot be adopted if topic A exists." In other words, it may be possible to introduce a relationship in which multiple topics are adopted as a set or multiple topics are adopted exclusively. Note that the manner of dependency between multiple topics within a content template is not limited to the examples given here, and any relationship may be introduced.

以上、コンテンツテンプレートと、そのコンテンツテンプレートを構成するためのトピックテンプレートの、それぞれのデータについて説明した。つまり、音声コンテンツは、コンテンツテンプレートにしたがって、複数のトピックが決められた順番で並べられることによって生成されるものである。ただし、必須ではないトピックは省略されてもよい。
また、１つのトピックは、複数のバリエーションを持つことができる。トピックテンプレートは各バリエーションの文章テンプレートを持つため、バリエーション生成部３０は、トピックテンプレートを用いて、各バリエーションの文章を生成することができる。また、バリエーション生成部３０が生成文章に基づいて実際に音声合成することにより、その合成音声の合計時間長がわかる。また、トピックテンプレートにおいて、バリエーションごとの評価値が与えられている。つまり、バリエーション生成部３０がそれぞれのトピックの複数のバリエーションを生成した後、その時間長や評価値に基づいて、コンテンツ全体の最適な形態を探索することが可能となる。この探索処理については、後述する。 The above describes the data of the content template and the topic template for constructing the content template. In other words, audio content is generated by arranging multiple topics in a predetermined order according to the content template. However, non-essential topics may be omitted.
Furthermore, one topic can have multiple variations. Since the topic template has a sentence template for each variation, the variation generation unit 30 can generate sentences for each variation using the topic template. Furthermore, the variation generation unit 30 actually synthesizes voice based on the generated sentences, thereby determining the total duration of the synthesized voice. Furthermore, an evaluation value is given for each variation in the topic template. In other words, after the variation generation unit 30 generates multiple variations of each topic, it becomes possible to search for the optimal form for the entire content based on the duration and evaluation value. This search process will be described later.

バリエーションごとの評価値としては、予め固定値が与えられている例を上で説明した。ただし、バリエーションごとの評価値は、必ずしも予め固定されていなくてもよい。例えば、文章テンプレートを基に生成された生成文章の内容等に応じて、評価値を可変としてもよい。 An example was described above in which a fixed value was assigned in advance as the evaluation value for each variation. However, the evaluation value for each variation does not necessarily have to be fixed in advance. For example, the evaluation value may be variable depending on the content of the generated sentence generated based on the sentence template.

図６は、上で説明したトピックテンプレートに含まれる文章テンプレートと、その文章テンプレートを基に生成される生成文章との関係の例を示す概略図である。図示する例では、文章テンプレートは、「（都道府県）の今日の天気です。」という文の後に、「（地域）は、（風向）の風、（天気）（時間推移）（局所天気）でしょう。」という文を複数回繰り返し得る、ことを表すデータである。本例において、この繰り返しは、都道府県内の地域ごとの繰り返しである。文章テンプレートに含まれる、（都道府県）、（地域）、（風向）、（天気）、（時間推移）、（局所天気）のそれぞれは、いずれもパラメーターである。パラメーターは、生成文章に変換される際に、実データで置換され得るものである。パラメーターを置換する実データは、データ受信部１０が取得した天気予報データに含まれるものである。本例では、パラメーター（都道府県）は、実際の都道府県名である「神奈川県」（図２を参照）で置換される。これにより、「（都道府県）の今日の天気です。」というテンプレートに基づいて、バリエーション生成部３０は、「神奈川県の今日の天気です。」という文を生成する。また、パラメーターである（地域）、（風向）、（天気）、（時間推移）、（局所天気）のそれぞれは、図３に示した「天気予報（今日）」のデータ内の情報で置換される。例えば、神奈川県の東部に関しては、「（地域）は、（風向）の風、（天気）（時間推移）（局所天気）でしょう。」というテンプレートに基づいて、バリエーション生成部３０は、「東部は、西の風、晴れ後曇りでしょう。」という文を生成する。また、神奈川県の西部に関しては、「（地域）は、（風向）の風、（天気）（時間推移）（局所天気）でしょう。」というテンプレートに基づいて、バリエーション生成部３０は、「西部は、南の風、晴れ後曇り所により雨でしょう。」という文を生成する。 6 is a schematic diagram showing an example of the relationship between the sentence template included in the topic template described above and the generated sentence generated based on the sentence template. In the illustrated example, the sentence template is data representing that the sentence "This is the weather today in (prefecture)." may be followed by a sentence "In (region), the wind is (wind direction), (weather) (time transition) (local weather)." multiple times. In this example, this repetition is for each region within the prefecture. Each of the (prefecture), (region), (wind direction), (weather), (time transition), and (local weather) included in the sentence template is a parameter. The parameters can be replaced with actual data when converted into the generated sentence. The actual data that replaces the parameters is included in the weather forecast data acquired by the data receiving unit 10. In this example, the parameter (prefecture) is replaced with the actual prefecture name "Kanagawa Prefecture" (see FIG. 2). As a result, based on the template "This is the weather today in (prefecture).", the variation generating unit 30 generates the sentence "This is the weather today in Kanagawa Prefecture." In addition, the parameters (area), (wind direction), (weather), (time transition), and (local weather) are replaced with information in the "weather forecast (today)" data shown in FIG. 3. For example, for the eastern part of Kanagawa Prefecture, based on the template "In (area), wind will be (wind direction), (weather) (time transition) (local weather)," the variation generation unit 30 generates the sentence "In the eastern part, wind will be from the west, sunny, then cloudy." Also, for the western part of Kanagawa Prefecture, based on the template "In (area), wind will be (wind direction), (weather) (time transition) (local weather)," the variation generation unit 30 generates the sentence "In the western part, wind will be from the south, sunny, then cloudy, with rain in some places."

図７は、上で説明したトピックテンプレートに含まれる文章テンプレートと、その文章テンプレートを基に生成される生成文章との関係の別の例を示す概略図である。なお、図６と図７とは、同一のトピックである「今日の天気」に含まれる、互いに別のバリエーションである。図７の例では、文章テンプレートは、「（都道府県）の今日の天気は、県内全域、（天気）（時間推移）でしょう。」という表現を表すデータである。この図７の文章テンプレートは、図６に示した文章テンプレートの例と比較して、（風向）や（局地天気）の情報が省略されている。つまり、図６より情報が省略された伝え方になっている。また、この文章テンプレートは、（都道府県）内の全ての（地域）において、（天気）（時間推移）が同一であるときにしか文章を生成しない。それ以外の時には、このバリエーションは存在しないものとして取り扱う。この文章テンプレートに含まれる（都道府県）、（天気）、（時間推移）のそれぞれは、いずれも、パラメーターである。図６の場合と同様に、文章テンプレート内に存在するパラメーターは、実データ（天気予報データ）の情報を用いて置換される。バリエーション生成部３０は、図７の文章テンプレートを基に、実データを用いてパラメーターの置換を行うことにより、「神奈川県の今日の天気は、県内全域、晴れ後曇りでしょう。」という文を生成する。 Figure 7 is a schematic diagram showing another example of the relationship between the sentence template included in the topic template described above and the generated sentence generated based on the sentence template. Note that Figures 6 and 7 are different variations included in the same topic, "Today's Weather". In the example of Figure 7, the sentence template is data expressing the expression "Today's weather in (prefecture) will be (weather) (time transition) throughout the prefecture." This sentence template in Figure 7 omits information on (wind direction) and (local weather) compared to the example sentence template shown in Figure 6. In other words, the information is omitted from the way of conveying information in Figure 6. In addition, this sentence template generates a sentence only when (weather) (time transition) is the same in all (areas) within (prefecture). In other cases, this variation is treated as not existing. Each of (prefecture), (weather), and (time transition) included in this sentence template is a parameter. As in the case of Figure 6, the parameters present in the sentence template are replaced with information from actual data (weather forecast data). The variation generation unit 30 generates the sentence "Today's weather in Kanagawa Prefecture will be sunny throughout the prefecture, then cloudy" by replacing parameters using actual data based on the sentence template in FIG. 7.

以上説明したように、図６および図７でそれぞれ示した２種類の文章テンプレートは、同一のトピックのための、情報の詳細さが異なる２つのバリエーションに属するものである。バリエーション生成部３０は、同一の受信データ（ここでは、天気予報データ）に基づいて、１つのトピックに関して、複数の文章テンプレートを用いることにより、複数の生成文章をバリエーションとして生成する。また、バリエーション生成部３０は、これらの複数の生成文章のそれぞれに対応して、音声合成処理により、音声を生成する。音声の合計時間長は、バリエーションに応じて異なるものである。例として示した図６および図７では、図６の文章テンプレートに基づいて作られる生成文章のほうが、図７の文章テンプレートに基づいて作られる生成文章よりも長い。また、それぞれの生成文章に対応して生成される合成音声についても、図６の文章テンプレートに対応した合成音声の合計時間長のほうが、図７の文章テンプレートに対応した合成音声の合計時間長よりも長い。このように、バリエーション生成部３０は、異なる文章テンプレートを用いることにより、１つのトピックに対して、異なる合計時間長を有する合成音声を生成する。 As described above, the two types of sentence templates shown in FIG. 6 and FIG. 7 belong to two variations of the same topic with different information details. The variation generation unit 30 generates a plurality of generated sentences as variations for one topic based on the same received data (weather forecast data in this case) by using a plurality of sentence templates. The variation generation unit 30 also generates voices by voice synthesis processing corresponding to each of the plurality of generated sentences. The total duration of the voices differs depending on the variation. In the examples shown in FIG. 6 and FIG. 7, the generated sentences created based on the sentence template in FIG. 6 are longer than the generated sentences created based on the sentence template in FIG. 7. In addition, for the synthetic voices generated corresponding to the respective generated sentences, the total duration of the synthetic voices corresponding to the sentence template in FIG. 6 is longer than the total duration of the synthetic voices corresponding to the sentence template in FIG. 7. In this way, the variation generation unit 30 generates synthetic voices having different total durations for one topic by using different sentence templates.

図８は、コンテンツテンプレートに基づいてバリエーション生成部３０が生成したトピックおよびそのバリエーションの相互関係を示す概略図である。言い換えれば、図８は、トピックごとに複数のバリエーションを有する探索空間を示すものである。図示するように、探索空間は、シリアルなトピックの列を持っている。また、その列に含まれる各トピックは、並列する１個以上のバリエーションを持つことができる。図８に示す例において、「あいさつ１」は１個のバリエーションを持つ。また、「トピック１」は、並列する３個のバリエーションを持つ。また、「トピック２」は、並列する３個のバリエーションを持ち、そのうちの１つのバリエーションには「なし」と記されている。この「なし」は、当該バリエーションが、生成文および合成音声を持たないことを表している。そして、「あいさつ２」は１個のバリエーションを持つ。「なし」と記されたバリエーション以外の通常のバリエーションの各々は、生成文章と、各文の合成音声と、合計時間長と、評価値とを持つ。生成文章は、バリエーション生成部３０が文章テンプレートに基づいて生成した文章である。合成音声は、バリエーション生成部３０が生成文章の各文に基づいて合成した音声である。合計時間長は、バリエーション生成部３０によって生成された各文の合成音声の先頭から最後までの時間の長さの総和である。評価値は、そのバリエーションの評価値である。評価値は、そのバリエーションが選択されることの好ましさを表す数値である。評価値は、原則として、より詳細であり、結果としてより大きな合計時間長を持つようなバリエーションに対してより大きな値を与える。 Figure 8 is a schematic diagram showing the interrelationships between the topics and their variations generated by the variation generation unit 30 based on the content template. In other words, Figure 8 shows a search space having multiple variations for each topic. As shown in the figure, the search space has a series of serial topics. Furthermore, each topic included in the series can have one or more parallel variations. In the example shown in Figure 8, "Greeting 1" has one variation. Furthermore, "Topic 1" has three parallel variations. Furthermore, "Topic 2" has three parallel variations, one of which is marked "None". This "None" indicates that the variation does not have a generated sentence and a synthetic voice. Furthermore, "Greeting 2" has one variation. Each of the normal variations other than the variation marked "None" has a generated sentence, a synthetic voice for each sentence, a total time length, and an evaluation value. The generated sentence is a sentence generated by the variation generation unit 30 based on a sentence template. The synthetic voice is a voice synthesized by the variation generation unit 30 based on each sentence of the generated sentence. The total duration is the sum of the durations from the beginning to the end of the synthetic speech of each sentence generated by the variation generation unit 30. The evaluation value is the evaluation value of that variation. The evaluation value is a numerical value that indicates the desirability of that variation being selected. As a general rule, the evaluation value is given to a variation that is more detailed and therefore has a greater total duration.

バリエーション生成部３０がこの図８に示すような探索空間を構築する。言い換えれば、バリエーション生成部３０は、探索空間内の各バリエーションの生成文章を生成し、その生成文章に対応する合成音声を生成し、その合成音声の合計時間長を確定する。また、各バリエーションの評価値は、予め定められているか、あるいは生成文章や合成音声が生成された時点で確定される。つまり、バリエーション生成部３０がこの探索空間を構築した時点までには、各バリエーションの評価値は確定している。このように、図８に示す探索空間内の各バリエーションにおける、時間長および評価値の具体的な値が定まっている状態において、探索処理部４０は、評価値の総計を最大にするようなバリエーションの選択を行うことができる。具体的には、探索処理部４０は、各トピックから１個のバリエーションを選択することができる。そして、探索処理部４０は、所定の条件を満しながら、評価値の総計が最大になる解を求める。ここでの解とは、コンテンツ内の各トピックにおいて選択するバリエーションの組み合わせである。バリエーション生成部３０は、図８に示す探索空間を表現するデータをメモリー等の記憶媒体に書き込み、探索処理部４０が参照できるようにする。 The variation generation unit 30 constructs a search space as shown in FIG. 8. In other words, the variation generation unit 30 generates generated sentences for each variation in the search space, generates synthetic voice corresponding to the generated sentences, and determines the total time length of the synthetic voice. The evaluation value of each variation is either predetermined or determined at the time when the generated sentences or synthetic voice are generated. In other words, the evaluation value of each variation is determined by the time when the variation generation unit 30 constructs this search space. In this way, in a state in which the specific values of the time length and evaluation value for each variation in the search space shown in FIG. 8 are determined, the search processing unit 40 can select a variation that maximizes the total evaluation value. Specifically, the search processing unit 40 can select one variation from each topic. Then, the search processing unit 40 finds a solution that maximizes the total evaluation value while satisfying a predetermined condition. The solution here is a combination of variations selected for each topic in the content. The variation generation unit 30 writes data representing the search space shown in FIG. 8 into a storage medium such as a memory, so that the search processing unit 40 can refer to it.

図９は、探索処理部４０が探索処理を行う際の拘束条件を示す概略図である。図示するように、探索処理を行う際の条件は、トータルコンテンツ長と、文間ポーズ下限と、文間ポーズ上限と、トピック間ポーズ下限と、トピック間ポーズ上限とを含む。これらの条件は、例えば、あらかじめ設定され、探索処理部４０から参照可能なメモリー等に書き込まれている。トータルコンテンツ長は、コンテンツ制作装置１が生成するコンテンツ全体の時間長である。制作対象であるコンテンツ内のすべてのトピックの音声を連結したときの時間長は、トータルコンテンツ長を超えてはならない。文間ポーズ下限は、文間のポーズの時間長の下限である。文間ポーズ上限は、文間のポーズの時間長の上限である。トピック間ポーズ下限は、トピック間のポーズの下限である。トピック間ポーズ上限は、トピック間のポーズの上限である。これらの設定値の単位は、すべて秒である。 Figure 9 is a schematic diagram showing constraint conditions when the search processing unit 40 performs the search process. As shown in the figure, the conditions when performing the search process include the total content length, the inter-sentence pause lower limit, the inter-sentence pause upper limit, the inter-topic pause lower limit, and the inter-topic pause upper limit. These conditions are, for example, set in advance and written to a memory or the like that can be referenced by the search processing unit 40. The total content length is the total time length of the content generated by the content production device 1. The time length when the audio of all topics in the content to be produced is concatenated must not exceed the total content length. The inter-sentence pause lower limit is the lower limit of the time length of a pause between sentences. The inter-sentence pause upper limit is the upper limit of the time length of a pause between sentences. The inter-topic pause lower limit is the lower limit of a pause between topics. The inter-topic pause upper limit is the upper limit of a pause between topics. The unit of these setting values is all seconds.

文間とは、生成されたトピック内の各文のつなぎである。言い換えれば、文間とは、図６や図７に示した文章テンプレートや生成文章において「＜文区切り＞」が存在する場所である。トピック間とは、コンテンツテンプレートにおけるトピックとトピックとの間の場所である。図８において「＜トピック区切り＞」で示している箇所が、トピック間である。 Between sentences is the connection between each sentence in the generated topic. In other words, between sentences is the location where a "<sentence separator>" exists in the sentence template and generated sentence shown in Figures 6 and 7. Between topics is the location between topics in the content template. The location indicated by "<topic separator>" in Figure 8 is between topics.

図示する例では、トータルコンテンツ長は、２４０．０００［秒］である。また、文間ポーズ下限は、０．７００［秒］である。また、文間ポーズ上限は、１．５００［秒］である。また、トピック間ポーズ下限は、１．０００［秒］である。また、トピック間ポーズ上限は、３．０００［秒］である。 In the illustrated example, the total content length is 240.000 seconds. The lower limit of the inter-sentence pause is 0.700 seconds. The upper limit of the inter-sentence pause is 1.500 seconds. The lower limit of the inter-topic pause is 1.000 seconds. The upper limit of the inter-topic pause is 3.000 seconds.

探索処理部４０は、設定された条件を満たしながら、選択されたバリエーションの評価値の総和を最大化するバリエーションの組み合わせを探索する。文間ポーズは、その下限と上限との間に差があるように設定される。トピック間ポーズもまた、その下限と上限との間に差があるように設定される。したがって、探索処理部４０は、これらのポーズの時間長が後で調整可能であることを前提として、バリエーションを選択する。つまり、探索処理部４０は、選択されたバリエーションの合成音声をすべて連結し、且つ文間ポーズおよびトピック間ポーズを各々の下限値とした結果の長さが、上記コンテンツ長以下であり、且つ文間ポーズおよびトピック間ポーズを各々の上限値とした結果の長さが、上記コンテンツ長以上であるような組み合わせの中から、評価値の総和が最大であるバリエーションの組み合わせを探索する。文間ポーズやトピック間ポーズが調整可能であるということは、それらそれぞれのポーズの時間長が、設定された条件内で伸び縮み可能であるということである。 The search processing unit 40 searches for a combination of variations that maximizes the sum of the evaluation values of the selected variations while satisfying the set conditions. The inter-sentence pauses are set so that there is a difference between their lower and upper limits. The inter-topic pauses are also set so that there is a difference between their lower and upper limits. Therefore, the search processing unit 40 selects a variation on the premise that the duration of these pauses can be adjusted later. In other words, the search processing unit 40 concatenates all of the synthetic voices of the selected variations, and searches for a combination of variations that maximizes the sum of the evaluation values from among combinations in which the length of the result of setting the inter-sentence pauses and inter-topic pauses to their respective lower limits is equal to or less than the above content length, and the length of the result of setting the inter-sentence pauses and inter-topic pauses to their respective upper limits is equal to or more than the above content length. The fact that the inter-sentence pauses and inter-topic pauses are adjustable means that the duration of each of these pauses can be extended or shortened within the set conditions.

上記のコンテンツ長に関する制約を数式で表すと、次の通りである。即ち、コンテンツ長をa秒（固定長）として、コンテンツ内に含まれるトピック数をb、総文数をc、音声部分の時間長の総和をd秒、文間ポーズ下限をe秒、文間ポーズ上限をf秒、トピック間ポーズ下限をg秒、トピック間ポーズ上限をh秒とする。この場合の制約条件は、次の式（１）で表わされる。 The above constraints on content length can be expressed mathematically as follows. That is, the content length is a seconds (fixed length), the number of topics included in the content is b, the total number of sentences is c, the total duration of the audio portion is d seconds, the lower limit of inter-sentence pauses is e seconds, the upper limit of inter-sentence pauses is f seconds, the lower limit of inter-topic pauses is g seconds, and the upper limit of inter-topic pauses is h seconds. The constraints in this case are expressed by the following formula (1).

ｄ＋（ｃ－ｂ）×ｅ＋（ｂ－１）×ｇ≦ａ≦ｄ＋（ｃ－ｂ）×ｆ＋（ｂ－１）×ｈ
・・・（１） d+(c-b)×e+(b-1)×g≦a≦d+(c-b)×f+(b-1)×h
...(1)

この式の制約条件を満たすように探索を行えば、文間ポーズの長さおよびトピック間ポーズの長さを適切に調整することにより、トータルコンテンツ長をちょうどａ秒とすることができる。 If we perform a search to satisfy the constraints in this formula, we can make the total content length exactly a seconds by appropriately adjusting the length of the inter-sentence pauses and the inter-topic pauses.

なお、コンテンツ長を上記のように固定値とする代わりに、コンテンツ長の上限および下限を設けて、その制約条件下で探索処理を行うようにしてもよい。 Instead of setting the content length to a fixed value as described above, an upper and lower limit may be set for the content length, and the search process may be performed under those constraints.

なお、探索処理部４０は、オプショントピックに関しては、そのトピックに属するバリエーションを選択しないような解を求めてもよい。オプショントピックは、図５に示したトピックテンプレートにおいて、必須フラグが「ｆａｌｓｅ」に設定されているトピックである。なお、オプショントピックに関してバリエーションを選択しないことは、生成文や合成音声の存在しないバリエーション（例えば、図８において「（なし）」と表記しているバリエーション）を選択することと等価である。 For optional topics, the search processing unit 40 may find a solution that does not select a variation that belongs to that topic. Optional topics are topics for which the required flag is set to "false" in the topic template shown in FIG. 5. Not selecting a variation for an optional topic is equivalent to selecting a variation for which no generated sentence or synthetic voice exists (for example, a variation marked "(none)" in FIG. 8).

探索処理部４０は、どのような探索方法（アルゴリズム）を用いて探索処理を行ってもよい。探索処理部４０は、一例として、予め決められた長さの制約付きのＡ^＊探索（A* search algorithm）を使ってもよい。Ａ^＊探索自体は、既存の手法である。この場合、探索処理部４０は、コンテンツの先頭のトピックから始め、評価値の高いバリエーションを優先して深さ方向の探索を行う。探索処理部４０は、その探索処理中のあるトピックの箇所において、残っているトピックの期待評価値（各トピックについて最大評価値であるバリエーションを選んだ場合の評価値の総和）と、残っているトピックのトータルの最大時間長（各トピックについて最大合計時間長であるバリエーションを選び、かつ、各ポーズ長として上限値を選んだ場合の時間長の総計）および最小時間長（各トピックについて最小合計時間長であるバリエーションを選び、かつ、各ポーズ長として下限値を選んだ場合の時間長の総計）を保持する。そして、探索処理部４０は、そのトピックの箇所までの時間長の総和の最大値（各ポーズ長として上限値を選んだ場合の時間長の総計）および最小値（各ポーズ長として下限値を選んだ場合の時間長の総計）に、残っているトピックのトータルの最大時間長および最小時間長を各々加算して、生成するコンテンツのトータルの時間長の最大および最小を見積もる。そして、探索処理部４０は、そのトータルの時間長の最大および最小が所定範囲内に収まる仮説の中から、（そのトピックの箇所までの評価値の総和と、残っているトピックの期待評価値との和）が、最大である仮説について探索処理を伸ばしていく。 The search processing unit 40 may use any search method (algorithm) to perform the search process. As an example, the search processing unit 40 may use an A ^* search algorithm with a predetermined length restriction. The A ^* search itself is an existing method. In this case, the search processing unit 40 performs a depth-wise search starting from the first topic of the content and prioritizing variations with high evaluation values. At the location of a certain topic during the search process, the search processing unit 40 holds the expected evaluation value of the remaining topics (the sum of the evaluation values when the variation with the maximum evaluation value is selected for each topic), the total maximum time length of the remaining topics (the sum of the time lengths when the variation with the maximum total time length is selected for each topic and the upper limit value is selected for each pause length), and the minimum time length (the sum of the time lengths when the variation with the minimum total time length is selected for each topic and the lower limit value is selected for each pause length). The search processing unit 40 then adds the maximum and minimum total durations of the remaining topics to the maximum (total duration when the upper limit value is selected for each pause length) and minimum (total duration when the lower limit value is selected for each pause length) values of the sum of durations up to that topic location, respectively, to estimate the maximum and minimum total durations of the content to be generated.The search processing unit 40 then extends the search process for the hypothesis with the maximum (sum of the sum of the evaluation values up to that topic location and the expected evaluation value of the remaining topics) among the hypotheses whose maximum and minimum total durations fall within a predetermined range.

ただし、探索処理部４０による処理は、必ずしも上に例示したアルゴリズムに基づく必要はない。いずれの探索手法を用いる場合も、探索処理部４０は、コンテンツのトータルの時間長が制約を満たし得ないようなバリエーションの組み合わせを排除しながら、探索空間内で、評価値の総和が最大となる解を探索する。 However, the processing by the search processing unit 40 does not necessarily have to be based on the algorithm exemplified above. Regardless of which search method is used, the search processing unit 40 searches for a solution that maximizes the sum of the evaluation values within the search space while eliminating combinations of variations in which the total time length of the content does not satisfy the constraints.

なお、探索処理部４０が探索を行う際のコンテンツ全体の時間長Ｔは、次のように計算される。バリエーションの組み合わせが決まると、それらのバリエーションが持つ合成音声の長さと、それらのバリエーションを用いてコンテンツを生成する場合の文間ポーズの数およびトピック間ポーズの数が定まる。また、１個の文間ポーズの時間長と、１個のトピック間ポーズの時間長とは、図９に示した制約条件の中で可変である。なお、すべての文間ポーズの時間長が互いに等しくなるようにする。また、すべてのトピック間ポーズの時間長が互いに等しくなるようにする。このとき、次のＡ，Ｂ，Ｃの値（いずれも、時間の長さ）は、次の通りである。
Ａ：選択されたバリエーションが持つ合成音声の時間長の合計Ｂ：文間ポーズの時間長×文間ポーズの数（＝総文数－トピック数）Ｃ：トピック間ポーズの時間長×トピック間ポーズの数（＝トピック数－１）そして、コンテンツ全体の時間長Ｔは、Ｔ＝（Ａ＋Ｂ＋Ｃ）で表わされる。ただし、ＢおよびＣは、可変であり、各々最大値～最小値の間の値を取りうる。探索処理部４０は、このＴが前述の条件を満たす制約の中で、バリエーションの組み合わせを探索する。 The duration T of the entire content when the search processing unit 40 performs a search is calculated as follows. Once the combination of variations is determined, the length of the synthetic voice of those variations, and the number of inter-sentence pauses and inter-topic pauses when generating content using those variations are determined. The duration of one inter-sentence pause and the duration of one inter-topic pause are variable within the constraints shown in FIG. 9. The durations of all inter-sentence pauses are set to be equal to each other. The durations of all inter-topic pauses are also set to be equal to each other. At this time, the following values of A, B, and C (all durations) are as follows:
A: Total duration of synthetic speech of selected variations B: Duration of inter-sentence pauses × number of inter-sentence pauses (= total number of sentences - number of topics) C: Duration of inter-topic pauses × number of inter-topic pauses (= number of topics - 1) And the duration T of the entire content is expressed as T = (A + B + C), where B and C are variable and can each take a value between the maximum and minimum values. The search processing unit 40 searches for combinations of variations within the constraint that T satisfies the above-mentioned condition.

図１０は、コンテンツ制作装置１の全体的な処理手順を示すフローチャートである。以下、このフローチャートに沿って処理の手順を説明する。 Figure 10 is a flowchart showing the overall processing procedure of the content production device 1. The processing procedure will be explained below with reference to this flowchart.

まず、ステップＳ１において、データ受信部１０は、コンテンツ制作用のデータを受信する。コンテンツ制作用のデータは、例えば、前述の天気予報データ１００（図２を参照）である。データ受信部１０は、受信したデータを、バリエーション生成部３０に渡す。 First, in step S1, the data receiving unit 10 receives data for content creation. The data for content creation is, for example, the above-mentioned weather forecast data 100 (see FIG. 2). The data receiving unit 10 passes the received data to the variation generating unit 30.

次に、ステップＳ２において、バリエーション生成部３０は、テンプレート記憶部２０から、コンテンツのテンプレートのデータを読み出す。コンテンツのテンプレートのデータの例は、図４に示したとおりである。 Next, in step S2, the variation generation unit 30 reads content template data from the template storage unit 20. An example of content template data is as shown in FIG. 4.

次に、ステップＳ３において、バリエーション生成部３０は、ステップＳ１において受け取ったデータを、ステップＳ２で読み込んだテンプレートに適用して、コンテンツのバリエーションを生成する。ここで、バリエーション生成部３０は、使用する可能性のあるすべてのバリエーションを生成する。各々のバリエーションは、生成文章と、その各文の合成音声と、その音声の合計時間長の情報とを含むものである。また、各バリエーションには評価値が付与されている。 Next, in step S3, the variation generation unit 30 applies the data received in step S1 to the template loaded in step S2 to generate variations of the content. Here, the variation generation unit 30 generates all variations that may be used. Each variation includes information on the generated sentence, the synthesized voice of each sentence, and the total duration of the voice. In addition, an evaluation value is assigned to each variation.

次に、ステップＳ４において、探索処理部４０は、ステップＳ３で生成されたバリエーションの探索を行う。探索処理については既に述べたとおりであり、探索処理部４０は、コンテンツの長さ（時間長）に関する条件を満たすように、且つ評価値が高くなるように、バリエーションの組み合わせを探索する。探索処理部４０は、探索処理の結果を、選択部５０に渡す。探索処理の結果は、バリエーションの組み合わせの情報と、その組み合わせを選択する場合の評価値の情報とを含む。 Next, in step S4, the search processing unit 40 searches for the variations generated in step S3. The search processing has already been described, and the search processing unit 40 searches for a combination of variations that satisfies the conditions related to the length (time length) of the content and that results in a high evaluation value. The search processing unit 40 passes the results of the search processing to the selection unit 50. The results of the search processing include information on the combination of variations and information on the evaluation value when that combination is selected.

次に、ステップＳ５において、選択部５０は、バリエーションの組み合わせを選択する。具体的には、選択部５０は、評価値の総計が高くなるようにバリエーションの組み合わせを選択する。一例として、選択部５０は、コンテンツの長さの制約条件を満たす組み合わせの中で、評価値の総計が最も高くなるバリエーションの組み合わせを選択する。選択部５０は、選択したバリエーションの組み合わせに関する情報を、ポーズ調整部６０に渡す。 Next, in step S5, the selection unit 50 selects a combination of variations. Specifically, the selection unit 50 selects a combination of variations that will result in a high total evaluation value. As an example, the selection unit 50 selects a combination of variations that will result in the highest total evaluation value among combinations that satisfy the constraints on the length of the content. The selection unit 50 passes information about the selected combination of variations to the pose adjustment unit 60.

次に、ステップＳ６において、ポーズ調整部６０は、ポーズの長さを調整する。具体的には、ポーズ調整部６０は、文間ポーズ１個あたりの時間長と、トピック間ポーズ１個あたりの時間長とを調整する。ポーズ調整部６０は、制作するコンテンツ全体の長さＬに基づき、次の等式（２）を満たすＰ_ＳおよびＰ_Ｔを決定する。 Next, in step S6, the pause adjustment unit 60 adjusts the length of the pauses. Specifically, the pause adjustment unit 60 adjusts the time length of one inter-sentence pause and one inter-topic pause. The pause adjustment unit 60 determines P _S and P _T that satisfy the following equation (2) based on the total length L of the content to be produced.

Ｌ_Ｕ＋Ｎ_Ｓ・Ｐ_Ｓ＋Ｎ_Ｔ・Ｐ_Ｔ＝Ｌ・・・（２） L _U +N _S・P _S +N _T・P _T =L... (2)

ただし、式（２）において、Ｎ_ＳおよびＮ_Ｔは、それぞれ、選択されたバリエーションの組み合わせを採用する場合の、文間ポーズの数およびトピック間ポーズの数である。バリエーションの組み合わせが決まれば、Ｎ_ＳおよびＮ_Ｔそれぞれの値は決まる。Ｐ_ＳおよびＰ_Ｔは、それぞれ、文間ポーズ１個あたりの時間長およびトピック間ポーズ１個あたりの時間長である。Ｐ_ＳおよびＰ_Ｔは、ポーズ調整部６０が決定すべき値である。Ｌ_Ｕは、合成音声の時間長の総和である。なお、ポーズ調整部６０は、必要に応じて、Ｐ_Ｓの値とＰ_Ｔの値との間で、適切なバランスをとるようにしてもよい。 In formula (2), N _S and N _T are the number of inter-sentence pauses and the number of inter-topic pauses, respectively, when the selected combination of variations is adopted. Once the combination of variations is determined, the values of N _S and N _T are determined. P _S and P _T are the duration of one inter-sentence pause and one inter-topic pause, respectively. P _S and P _T are values to be determined by the pause adjustment unit 60. L _U is the total duration of the synthetic speech. Note that the pause adjustment unit 60 may strike an appropriate balance between the values of P _S and P _T as necessary.

Ｐ_ＳおよびＰ_Ｔの値が求まると、ポーズ調整部６０は、文間およびトピック間に、それぞれ、Ｐ_ＳおよびＰ_Ｔで定まる長さのポーズ（無音区間）を挿入しながら、生成された合成音声をすべて連結して、全体として１本の音声コンテンツを作成する。ポーズ調整部６０は、作成した音声コンテンツを出力部７０に渡す。 Once the values of _PS and _PT are determined, the pause adjustment unit 60 inserts pauses (silent intervals) of lengths determined by _PS and _PT between sentences and topics, respectively, while linking all the generated synthetic voices together to create one piece of voice content as a whole. The pause adjustment unit 60 passes the created voice content to the output unit 70.

次に、ステップＳ７において、出力部７０は、作成した音声コンテンツを外部に出力する。音声コンテンツは、例えば、放送（ラジオあるいはテレビ）やインターネット配信などの手段で、配信される。 Next, in step S7, the output unit 70 outputs the created audio content to the outside. The audio content is distributed, for example, by broadcasting (radio or television) or Internet distribution.

図１１は、コンテンツ制作装置１のユーザーインターフェースの画面例を示す概略図である。コンテンツ制作装置１が想定するユーザーは、コンテンツ制作者である。コンテンツ制作装置１は、例えばサーバー装置として機能して、クライアント装置であるユーザー端末（ＰＣ等）のディスプレイ装置に、この画面を表示する。図示する画面例は、コンテンツ制作装置１が制作した音声コンテンツ（気象情報の番組）の構成をユーザーに提示するためのものである。 Figure 11 is a schematic diagram showing an example of a screen of the user interface of the content production device 1. The user assumed by the content production device 1 is a content creator. The content production device 1 functions, for example, as a server device, and displays this screen on a display device of a user terminal (such as a PC) which is a client device. The example screen shown is for presenting to the user the composition of the audio content (weather information program) produced by the content production device 1.

この画面では、画面タイトルとして「コンテンツ制作」という文字列が表示される。また、コンテンツ名として、「２０２０／１／２５午前１０時気象情報」という文字列が表示される。このコンテンツ名は、適宜、設定されたものである。また、コンテンツ長が「２４０秒」であることが表示される。コンテンツ長は予め設定されているもの（図９を参照）である。また、放送日時として、「２０２０／１／２５１０：００：００」という文字列が表示される。放送日時は、予め設定されるものである。コンテンツ制作装置１が、放送日時の設定に基づいて、生成した音声コンテンツを適切なタイミングで自動的に外部に出力するようにしてもよい。また、送出状態として、「未了」という文字列が表示されている。この送出状態は、例えば、コンテンツ制作装置１内の管理部（不図示）が管理しているコンテンツごと状態の情報に基づいて表示されるようにしてよい。 On this screen, the character string "Content Production" is displayed as the screen title. In addition, the character string "2020/1/25 10:00 AM Weather Information" is displayed as the content name. This content name is set appropriately. In addition, it is displayed that the content length is "240 seconds". The content length is set in advance (see FIG. 9). In addition, the character string "2020/1/25 10:00:00" is displayed as the broadcast date and time. The broadcast date and time are set in advance. The content production device 1 may automatically output the generated audio content to the outside at an appropriate timing based on the broadcast date and time setting. In addition, the character string "Not completed" is displayed as the sending status. This sending status may be displayed based on information on the status of each content managed by a management unit (not shown) in the content production device 1, for example.

また、この画面の下側には、トピックごとの情報の列が表示される。画面は、適宜、上下方向にスクロール可能としてよい。トピックの情報としては、トピックの名称や、バリエーションの種類を表す名称や、生成文の文字列が表示される。また、画面が、試聴ボタンや編集ボタンを持つようにしてもよい。ユーザーが視聴ボタンを押す（クリック等）操作を行うと、ユーザーはそのトピックの合成音声を試聴することができる。ユーザーが編集ボタンを押す操作を行うと、編集画面を用いてユーザーが生成文を修正（編集）したり、修正後の文に基づいて音声を再合成したりできるようにしてよい。 In addition, a column of information for each topic is displayed at the bottom of this screen. The screen may be scrollable up and down as appropriate. Topic information displayed includes the name of the topic, a name indicating the type of variation, and the character string of the generated sentence. The screen may also have a listen button and an edit button. When the user presses (clicks, etc.) the listen button, the user can listen to the synthesized speech for that topic. When the user presses the edit button, the user may be able to use the edit screen to modify (edit) the generated sentence or resynthesize speech based on the modified sentence.

なお、ここに示したユーザーインターフェースは一例であり、コンテンツ制作装置１が、他の情報を画面に表示したり、ユーザーに他の操作を行わせたりできるようにしてもよい。 Note that the user interface shown here is just an example, and the content production device 1 may be configured to display other information on the screen or allow the user to perform other operations.

本実施形態によれば、コンテンツ制作装置１は、テンプレートに基づいて、複数のバリエーションの中から適切な文章を選択し、組み合わせて、コンテンツを自動的に制作することができる。また、本実施形態によれば、コンテンツ制作装置１は、コンテンツ全体の長さ（時間長）が所望の値となるようにコンテンツを自動的に制作することができる。また、本実施形態によれば、コンテンツ制作装置１は、話速変換の処理を行うことなく、自然な速度でのアナウンスによるコンテンツを自動的に制作することができる。また、本実施形態によれば、コンテンツ制作装置１は、時間（コンテンツの時間長）の制約の中で、評価値に基づいてバリエーションを選択するため、結果として評価値の高い（即ち、内容としてより好ましい）コンテンツを自動的に制作することができる。 According to this embodiment, the content production device 1 can automatically produce content by selecting and combining appropriate sentences from a plurality of variations based on a template. Also, according to this embodiment, the content production device 1 can automatically produce content such that the overall length (duration) of the content is a desired value. Also, according to this embodiment, the content production device 1 can automatically produce content in the form of announcements at a natural speed without performing speech speed conversion processing. Also, according to this embodiment, the content production device 1 selects variations based on evaluation values within the constraints of time (duration of the content), and as a result, can automatically produce content with a high evaluation value (i.e., more preferable content).

なお、上述した実施形態におけるコンテンツ制作装置の少なくとも一部の機能をコンピューターで実現することができる。その場合、この機能を実現するためのプログラムをコンピューター読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピューターシステムに読み込ませ、実行することによって実現しても良い。なお、ここでいう「コンピューターシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピューター読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、ＵＳＢメモリー等の可搬媒体、コンピューターシステムに内蔵されるハードディスク等の記憶装置のことをいう。
さらに「コンピューター読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、一時的に、動的にプログラムを保持するもの、その場合のサーバーやクライアントとなるコンピューターシステム内部の揮発性メモリーのように、一定時間プログラムを保持しているものも含んでも良い。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピューターシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 At least some of the functions of the content production device in the above-mentioned embodiment can be realized by a computer. In that case, a program for realizing the functions may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read into a computer system and executed to realize the functions. Note that the term "computer system" here includes hardware such as an OS and peripheral devices. Furthermore, the term "computer-readable recording medium" refers to portable media such as a flexible disk, a magneto-optical disk, a ROM, a CD-ROM, a DVD-ROM, a USB memory, and a storage device such as a hard disk built into a computer system.
Furthermore, the term "computer-readable recording medium" may include a medium that temporarily and dynamically stores a program, such as a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line, and a medium that stores a program for a certain period of time, such as a volatile memory inside a computer system that serves as a server or client in such a case. The above program may be one that realizes part of the above-mentioned functions, or may be one that can realize the above-mentioned functions in combination with a program already recorded in the computer system.

以上、実施形態を説明したが、本発明をさらに次のような変形例で実施してもよい。 Although the embodiment has been described above, the present invention may be further embodied in the following modified examples.

実施形態で説明したコンテンツの例では、トピックとして「あいさつ１」と「あいさつ２」とを含むようにした。これらのトピックに関しては、バリエーションがそれぞれ１種類ずつしか含まれていなかった。このように複数のバリエーションを持たないトピックに関しては、コンテンツ制作装置１による自動生成（時間長の調整のための解の探索）を行わないようにしてもよい。例えば、あいさつに相当する部分を、他の方法で制作するようにしてもよい。 In the example of content described in the embodiment, the topics include "Greeting 1" and "Greeting 2." Only one variation of each of these topics is included. For topics that do not have multiple variations like this, the content production device 1 may not automatically generate them (search for a solution for adjusting the time length). For example, the part that corresponds to the greeting may be produced using another method.

実施形態では、各文の時間長を特定するために全てのバリエーションの音声合成を事前に実施していたが、音声合成処理は一般に時間や計算資源を要する処理である。そこで、実際に音声合成処理を実施する前に、より簡易な処理によって各文の時間長を正確に求める処理のみをまず実施してもよい。実際に音声合成処理を行わずに文の時間長のみを正確に求める処理は、既存技術によって可能である（例えば、ＤＮＮ音声合成技術において一般的なduration modelを用いるなど）。つまり、探索に先立って行う音声合成処理の代わりに時間長のみを求めておき、そして探索処理を行い、探索処理の終了後に、解として使われることになった文についてのみ音声合成処理を実施することにしてもよい。
一般に、音声の時間長のみを求める処理のコストは、実際にその音声波形を生成する処理のコストよりも十分に小さい。 In the embodiment, voice synthesis of all variations is performed in advance to specify the duration of each sentence, but voice synthesis processing generally requires time and computational resources. Therefore, before actually performing voice synthesis processing, a process for accurately determining the duration of each sentence using a simpler process may be performed first. A process for accurately determining only the duration of a sentence without actually performing voice synthesis processing is possible using existing technology (for example, using a duration model that is common in DNN voice synthesis technology). In other words, instead of performing voice synthesis processing prior to search, only the duration may be obtained, and then a search process may be performed, and after the search process is completed, voice synthesis processing may be performed only for the sentence that was used as a solution.
Generally, the cost of processing to obtain only the duration of a voice is significantly smaller than the cost of processing to actually generate the voice waveform.

上記実施形態では、音声合成処理を含む構成で説明したが、コンテンツ制作装置１が音声合成処理を含まないようにしてもよい。探索処理終了後に、解として使われることになった文のみについて、発話文を出力するものでもよい。この出力された発話文を、別途音声合成処理するようにしてもかまわない。 In the above embodiment, a configuration including voice synthesis processing has been described, but the content production device 1 may not include voice synthesis processing. After the search process is completed, spoken sentences may be output only for sentences that are to be used as solutions. These output spoken sentences may be subjected to separate voice synthesis processing.

上記実施形態では、各文の時間長を正確に求める処理を行う例をあげて説明したが、各文の時間長が誤差を含むようにしてもよい。つまり、探索する段階において、各文の時間長がある程度の誤差が含むことを許容する。探索する段階においては、各文の時間長に所定程度の精度があれば、ポーズ調整部での調整や話速変換技術などを用いた尺調整でその誤差を吸収することが可能だからである。その場合、図９に示すトータルコンテンツ長が、他のパラメーターと同様に、下限や上限で表わされるパラメーターであってもかまわない。あるいは、トータルコンテンツ長が、許容される誤差範囲を持つものであってもかまわない。許容される誤差範囲は、秒数（例えば、±１０秒）で表わされてもよいし、トータルコンテンツ長に対する比率（例えば、±５％）で表わされてもよい。 In the above embodiment, an example of processing to accurately obtain the duration of each sentence has been described, but the duration of each sentence may include an error. In other words, in the search stage, the duration of each sentence is allowed to include a certain degree of error. This is because, in the search stage, if the duration of each sentence has a certain degree of accuracy, the error can be absorbed by length adjustment using pause adjustment units or speech rate conversion technology. In that case, the total content length shown in FIG. 9 may be a parameter expressed by a lower limit and an upper limit, like other parameters. Alternatively, the total content length may have an allowable error range. The allowable error range may be expressed in seconds (e.g., ±10 seconds) or as a ratio to the total content length (e.g., ±5%).

上記実施形態では、バリエーションの評価値として、予め固定値が与えられている例を説明したが、必ずしも予め固定された値でなくてもよい。バリエーションの評価値は、探索処理が行われる際に決まっていればよい。例えば、探索する際に、それまでに採用した文の中に同一の文や同意の文が含まれるか否かに応じて評価値を可変としてよい。そうすることにより、同じ表現や同じ情報が繰り返されるのを避けることができる。また、それまでに採用した文の時間長の合計値に応じて評価値を変えてもよい。そうすることにより、残った時間長によってコンテンツの内容を変えることもできる。つまり、同じ生成文であっても、コンテンツの中の出てくるタイミングや文の前後関係などに応じて、評価値を変えてもかまわない。 In the above embodiment, an example was described in which a fixed value was given as the evaluation value of the variation, but the value does not necessarily have to be fixed in advance. The evaluation value of the variation only needs to be determined when the search process is performed. For example, when searching, the evaluation value may be made variable depending on whether the sentences used up to that point include the same sentence or an equivalent sentence. This makes it possible to avoid repeating the same expression or the same information. The evaluation value may also be changed depending on the total value of the duration of the sentences used up to that point. This makes it possible to change the content depending on the remaining duration. In other words, even if the generated sentence is the same, the evaluation value may be changed depending on the timing at which it appears in the content or the context of the sentence.

上記実施形態においては、各文の正確な尺を予め確定させてから行う処理を説明したが、この長さは概算値であっても良い。その場合、本手法で制作されたコンテンツの実際の長さについて、各概算値と正確な値との差異に応じて、目標とする長さからずれが生じる場合がある。そのずれが微小なものであるならば、話速変換技術等を用いて正確な尺に調整することにしても良い。ある程度以下の範囲ならば話速変換を採用しても自然さが損なわれないことが知られている（A. Nakamura et al. “A New Approach to Compensate Degeneration of Speech Intelligibility for Elderly Listeners”, IEEE Transaction on Broadcasting, Vol.42, No3, 1996 など）。例えば、コンテンツとしての最終的な時間長が２４０秒である場合に、本発明で許容する誤差を１２秒として文を作成した上で、この誤差を話速変換等で聴感上不自然にならないように調節することが考えられる。ここで、許容誤差の範囲内でコンテンツを制作することが求められるが、その方法の一例として、当該装置の音声合成器の平均的な発話速度を予め調べてこれを基準の話速とし、その値を用いて文字数から発話時間の推定をすることが考えられる。一般に音声合成による発話速度は一定であるため、許容誤差の範囲での発話時間推定が期待できる。また、コンテンツの用途によって、少し発話速度を速くあるいは遅く発話をすることが求められる場合がある。この場合は、上記、音声合成器の平均的な発話速度を話速変換技術等によって所望の値に一律にシフトさせて、これを新たに基準の話速として、上述の方法と同様に、その基準の発話速度に応じた誤差の少ない発話時間の推定を行うことができる。基準の話速を管理することで、その話速に応じた情報量によるコンテンツが制作される。 In the above embodiment, the process is performed after the exact length of each sentence is determined in advance, but this length may be an approximate value. In that case, the actual length of the content created by this method may deviate from the target length depending on the difference between each approximate value and the exact value. If the deviation is small, it may be adjusted to the exact length using speech speed conversion technology or the like. It is known that naturalness is not lost even if speech speed conversion is adopted within a certain range (A. Nakamura et al. "A New Approach to Compensate Degeneration of Speech Intelligibility for Elderly Listeners", IEEE Transaction on Broadcasting, Vol.42, No3, 1996, etc.). For example, if the final length of the content is 240 seconds, it is possible to create sentences with an error allowable in the present invention of 12 seconds, and then adjust this error by speech speed conversion or the like so that it does not sound unnatural to the ear. Here, it is required to produce content within the range of allowable error. One example of a method for doing so is to check the average speech speed of the speech synthesizer of the device in advance and use this as a reference speech speed, and estimate the speech time from the number of characters using this value. Since the speech speed produced by speech synthesis is generally constant, it is expected that the speech time can be estimated within the range of allowable error. Depending on the use of the content, it may be required to speak a little faster or slower. In this case, the average speech speed of the speech synthesizer can be uniformly shifted to a desired value using speech speed conversion technology or the like, and this can be used as a new reference speech speed. As with the above method, it is possible to estimate the speech time with little error according to the reference speech speed. By managing the reference speech speed, content can be produced with an amount of information according to that speech speed.

実施形態では、天気予報（気象情報）の音声コンテンツを自動的に生成する処理について説明した。生成するコンテンツは、天気予報以外であってもよい。例えば、合成音声によってニュースを読み上げる形のニュース番組や、合成音声を用いて証券の銘柄ごとの価格の情報を流す証券市況番組や、演奏する曲目等を合成音声によって紹介する音楽番組や、その他のコンテンツを、コンテンツ制作装置１が制作するようにしてもよい。 In the embodiment, a process for automatically generating audio content for a weather forecast (weather information) has been described. The generated content may be something other than a weather forecast. For example, the content production device 1 may produce a news program in which the news is read out using a synthetic voice, a securities market program in which price information for each security issue is provided using a synthetic voice, a music program in which the songs to be played are introduced using a synthetic voice, or other content.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The above describes an embodiment of the present invention in detail with reference to the drawings, but the specific configuration is not limited to this embodiment and includes designs that do not deviate from the gist of the present invention.

本発明は、例えば、コンテンツの制作等に利用することができる。但し、本発明の利用範囲はここに例示したものには限られない。 The present invention can be used, for example, in content production. However, the scope of use of the present invention is not limited to the examples given here.

１コンテンツ制作装置
１０データ受信部（データ取得部）
２０テンプレート記憶部
３０バリエーション生成部
４０探索処理部
５０選択部
６０ポーズ調整部
７０出力部 1 Content production device 10 Data receiving unit (data acquisition unit)
20 Template storage unit 30 Variation generation unit 40 Search processing unit 50 Selection unit 60 Pose adjustment unit 70 Output unit

Claims

a template storage unit that stores a content template having a plurality of variations of a sentence template for generating a sentence;
A data acquisition unit for acquiring data;
a variation generation unit that generates a sentence by applying the data to the sentence template for each of the plurality of variations included in the content template, and determines a duration of the synthetic speech that is a sum of durations of the synthetic speech corresponding to each sentence of the generated sentence;
a search processing unit that searches for combinations of variations using an appropriately determined total sum of evaluation values for each of the variations as an evaluation function under constraints related to conditions related to the duration of pauses that serve as joints when connecting the sentences and conditions related to the total duration of the synthetic voice and the pauses;
a selection unit that selects a combination of variations that satisfies the constraints based on a value of the evaluation function;
a pause adjustment unit that adjusts the duration of the pause so as to satisfy a condition regarding the duration of the pause and a condition regarding a total duration of the synthetic voice and the duration of the pause;
A content production device comprising:

the content template is organized as a sequence of topics;
the topic is configured to include a plurality of the variations that may be selected mutually exclusively;
The content production device according to claim 1 .

The condition regarding the duration of the pause includes a condition regarding the duration of an inter-sentence pause, which is a pause at a division of a sentence included in the variation, and a condition regarding the duration of an inter-topic pause, which is a pause at a division of a topic.
The content production device according to claim 2 .

the pause adjustment unit adjusts the time lengths of the inter-sentence pauses so that they are all the same, and adjusts the time lengths of the inter-topic pauses so that they are all the same;
The content production device according to claim 3 .

the evaluation value for each variation is predetermined as an attribute value of the variation included in the content template;
The content production device according to any one of claims 1 to 4.

a template storage unit that stores a content template having a plurality of variations of a sentence template for generating a sentence;
A data acquisition unit for acquiring data;
a variation generation unit that generates a sentence by applying the data to the sentence template for each of the plurality of variations included in the content template, and determines a duration of the synthetic speech that is a sum of durations of the synthetic speech corresponding to each sentence of the generated sentence;
a search processing unit that searches for combinations of variations using a suitably determined total of evaluation values for each of the variations as an evaluation function under constraints related to conditions related to the duration of pauses that serve as joints when connecting the sentences and conditions related to the total duration of the synthetic voice and the pauses;
a selection unit that selects a combination of variations that satisfies the constraints based on a value of the evaluation function;
a pause adjustment unit that adjusts the duration of the pause so as to satisfy a condition regarding the duration of the pause and a condition regarding a total duration of the synthetic voice and the duration of the pause;
A program for causing a computer to function as a content production device comprising: