JP7643638B2

JP7643638B2 - Training data generation device, training data generation method, and program

Info

Publication number: JP7643638B2
Application number: JP2024504339A
Authority: JP
Inventors: いつみ斉藤; 京介西田; 仙吉田
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2025-03-11
Anticipated expiration: 2042-03-04
Also published as: JPWO2023166747A1; WO2023166747A1

Description

本発明は、動画から当該動画の要約テキストを生成するための要約モデルの学習に使用する学習データを生成する技術に関連するものである。 The present invention relates to a technology for generating training data used to train a summary model for generating summary text of a video from the video.

近年オンライン会議などが増加し、会議等のプレゼンテーションの動画がインターネット上に多数公開されている。 In recent years, online meetings have become more common, and many videos of presentations at these meetings have been made available on the Internet.

一般にプレゼンテーション動画は時間が長いため、その内容を把握するためには長時間動画を見なければならない。そのため、プレゼンテーション動画の内容を短時間で把握したいという要求がある。 Generally, presentation videos are long, so to understand the content one must watch the video for a long time. Therefore, there is a demand to understand the content of presentation videos in a short amount of time.

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and ComprehensionBART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

プレゼンテーション動画の内容を短時間で把握するために、ニューラルネットワークのモデル（要約モデルと呼ぶ）を用いて、プレゼンテーション動画の要約を表すテキスト（要約テキスト）を生成することが考えられる。In order to quickly grasp the contents of a presentation video, it is possible to use a neural network model (called a summary model) to generate text (summary text) that represents a summary of the presentation video.

しかし、プレゼンテーション動画においては、要約モデルを学習する際に使用する正解データ（学習データ）の量が少なく、収集した正解データのみでは十分な精度を持つ要約モデルを生成できなかった。この課題は、プレゼンテーション動画に限らずに、要約を生成する対象となる動画全般に対して生じ得る課題である。 However, in the case of presentation videos, the amount of correct answer data (training data) used to train the summary model was small, and it was not possible to generate a summary model with sufficient accuracy using only the correct answer data that was collected. This issue is not limited to presentation videos, but can occur with any video for which summaries are to be generated.

本発明は上記の点に鑑みてなされたものであり、動画から要約テキストを生成する要約モデルを学習するための学習データを生成することを可能とする技術を提供することを目的とする。The present invention has been made in consideration of the above points, and aims to provide a technology that makes it possible to generate training data for training a summary model that generates summary text from video.

開示の技術によれば、動画に対する要約テキストを生成する要約モデルの学習のための学習データセットを生成する学習データ生成装置であって、
前記動画における画像から抽出されたテキストである第１テキスト、前記動画における音声から抽出されたテキストである第２テキスト、及び、前記動画の正解の要約テキストを有する元の学習データセットから、少なくとも１つの更なる学習データセットを生成する学習データ生成部
を備える学習データ生成装置が提供される。 According to the disclosed technology, there is provided a training data generation device for generating a training data set for training a summary model that generates a summary text for a video, the device comprising:
The present invention provides a training data generation device including: a training data generation unit configured to generate at least one further training data set from an original training data set having a first text, the first text being text extracted from an image in the video, a second text being text extracted from an audio in the video, and a ground-truth summary text of the video.

開示の技術によれば、動画から要約テキストを生成する要約モデルを学習するための学習データを生成することを可能とする技術が提供される。 The disclosed technology provides technology that enables the generation of training data for training a summarization model that generates summary text from video.

プレゼンテーション動画から要約テキストを作成する基本的な処理の流れを示す図である。FIG. 1 is a diagram showing a basic process flow for creating a summary text from a presentation video. 要約生成装置１００の構成図である。FIG. 1 is a configuration diagram of a summary generation device 100. 要約生成装置１００の動作を説明するためのフローチャートである。4 is a flowchart for explaining the operation of the summary generating device 100. 要約モデル学習装置２００の構成図である。FIG. 2 is a configuration diagram of a summary model learning device 200. 要約モデル事前学習のための構成を示す図である。FIG. 1 illustrates a configuration for pre-training a summary model. 要約モデル学習装置２００の動作を説明するためのフローチャートである。4 is a flowchart for explaining the operation of the summary model learning device 200. 事前学習における、要約モデルへの入力、及び、要約モデルからの出力の例を示す図である。FIG. 13 is a diagram showing an example of input to a summary model and output from the summary model in pre-learning. 動画からの画像切り出し処理を説明するための図である。FIG. 11 is a diagram for explaining a process of extracting an image from a moving image. 画像からのテキスト抽出を説明するための図である。FIG. 1 is a diagram for explaining extraction of text from an image. 音声からのテキスト抽出を説明するための図である。FIG. 1 is a diagram for explaining extraction of text from speech. 学習における、要約モデルへの入力、及び、要約モデルからの出力の例を示す図である。FIG. 2 is a diagram showing an example of input to a summary model and output from the summary model in learning. データ拡張部４００の構成を示す図である。FIG. 4 is a diagram showing the configuration of a data extension unit 400. データ拡張部４００の動作を説明するためのフローチャートである。10 is a flowchart for explaining the operation of the data expansion unit 400. データ分割の例を示す図である。FIG. 13 is a diagram illustrating an example of data division. 分割された学習データセットを使用した学習を説明するための図である。FIG. 13 is a diagram for explaining learning using a divided learning data set. 装置のハードウェア構成例を示す図である。FIG. 2 illustrates an example of a hardware configuration of the apparatus. 論文データを事前に学習させた場合の効果を示す図である。FIG. 13 is a diagram showing the effect of learning paper data in advance. スライド概要を事前に学習させた場合の効果を示す図である。FIG. 13 is a diagram showing the effect of pre-learning slide summaries. 元の学習データセットとともに分割により得られた学習データセットを学習させた場合の効果を示す図である。FIG. 13 is a diagram showing the effect of training the original training data set together with a training data set obtained by division.

以下、図面を参照して本発明の実施の形態（本実施の形態）を説明する。以下で説明する実施の形態は一例に過ぎず、本発明が適用される実施の形態は、以下の実施の形態に限られるわけではない。Hereinafter, an embodiment of the present invention (the present embodiment) will be described with reference to the drawings. The embodiment described below is merely an example, and the embodiment to which the present invention is applicable is not limited to the following embodiment.

以下で説明する要約生成装置１００及び要約モデル学習装置２００はいずれも、論文から要約を生成するような従来技術に対して特定の改善を提供するものであり、動画から要約を生成する技術に係る技術分野の向上を示すものである。Both the summary generation device 100 and the summary model learning device 200 described below provide certain improvements over conventional techniques, such as generating summaries from papers, and represent an advancement in the technical field relating to techniques for generating summaries from videos.

以下で説明するデータ拡張部４００（学習データ生成装置４００）は、要約を人手で生成するような従来技術に対して特定の改善を提供するものであり、動画の要約テキストを生成するための要約モデルを学習する技術に係る技術分野の向上を示すものである。The data augmentation unit 400 (training data generation device 400) described below provides certain improvements over conventional techniques such as manually generating summaries and represents an advancement in the technical field relating to techniques for learning summary models for generating summary text for videos.

以下では、要約を生成する対象の動画として、プレゼンテーション動画を用いているが、これは例である。本発明に係る技術は、プレゼンテーション動画に限らない動画全般に適用することが可能である。In the following, a presentation video is used as the video for which a summary is to be generated, but this is just an example. The technology according to the present invention can be applied to any video, not just presentation videos.

（実施の形態の概要）
近年オンライン会議などが増加し、会議等のプレゼンテーションの動画が多数公開されている。一般にプレゼンテーション動画は時間が長いため、その内容を短時間で把握したいという要求がある。プレゼンテーション動画の内容を短時間で把握するために、プレゼンテーション動画の要約が生成できることが望ましい。 (Overview of the embodiment)
In recent years, online conferences have become more common, and many videos of presentations at conferences and other events have been made public. Generally, presentation videos are long, so there is a demand for the content to be understood in a short time. In order to understand the content of a presentation video in a short time, it is desirable to be able to generate a summary of the presentation video.

そこで、本実施の形態では、プレゼンテーション動画に対応する要約テキストを生成するための技術について説明する。 Therefore, in this embodiment, we describe a technology for generating summary text corresponding to a presentation video.

＜プレゼンテーション動画の例＞
一例として、「https://slideslive.com/38928967/predicting-depression-in-screening-interviews-from-latent-categorization-of-interview-prompts」（２０２２年２月２７日検索）、「https://videolectures.net/」（２０２２年２月２７日検索）等に開示されているように、一般的なプレゼンテーション動画は、発表内容を記載したスライドの画像と、発表者の画像と、発表者の音声からなる。なお、発表者の画像が表示されない場合も多い。 <Example of presentation video>
As an example, as disclosed in "https://slideslive.com/38928967/predicting-depression-in-screening-interviews-from-latent-categorization-of-interview-prompts" (searched February 27, 2022) and "https://videolectures.net/" (searched February 27, 2022), a typical presentation video consists of an image of a slide containing the content of the presentation, an image of the presenter, and the voice of the presenter. Note that in many cases, the image of the presenter is not displayed.

＜プレゼンテーション動画から要約テキストを作成する基本的な処理の流れ＞
プレゼンテーション動画から要約テキストを作成する基本的な処理の流れを、図１を参照して説明する。なお、以降の説明においては、記載の便宜上、プレゼンテーション動画を「動画」と呼び、要約テキストを「要約」と呼ぶ場合がある。 <Basic process flow for creating summary text from a presentation video>
The basic process flow for creating a summary text from a presentation video will be described with reference to Fig. 1. In the following description, for convenience, the presentation video may be referred to as a "video" and the summary text may be referred to as a "summary".

まず、要約作成の対象となる動画から、要約生成部１３０への入力データとなる、（Ａ）プレゼンテーションスライド、（Ｂ）動画から切り出した画像、及び（Ｃ）音声を用意する。First, from the video for which a summary is to be created, (A) presentation slides, (B) images extracted from the video, and (C) audio are prepared as input data for the summary generation unit 130.

なお、（Ａ）のプレゼンテーションスライドは、動画とは別のファイルであることを想定している。また、入力データとして、（Ａ）、（Ｂ）、（Ｃ）の３つのうちの少なくとも１つがあれば要約生成は可能であるが、より精度の良い要約を生成するために、（Ａ）、（Ｂ）、（Ｃ）の３つ、あるいは、（Ａ）と（Ｃ）の２つ、あるいは、（Ｂ）と（Ｃ）の２つがあることが望ましい。 It is assumed that the presentation slides in (A) are separate files from the video. Also, a summary can be generated if at least one of the three input data (A), (B), and (C) is available. However, to generate a more accurate summary, it is preferable to have three input data (A), (B), and (C), or two input data (A) and (C), or two input data (B) and (C).

次に、画像認識／音声認識によりテキストに変換した入力データを要約生成部１３０に入力し、要約生成部が要約テキストを出力する。要約生成部１３０は、後述する要約生成装置１００に含まれる機能部である。Next, the input data converted into text by image recognition/voice recognition is input to the summary generation unit 130, which outputs the summary text. The summary generation unit 130 is a functional unit included in the summary generation device 100 described below.

＜要約生成技術について＞
本実施の形態において要約生成部１３０がテキストから要約を生成するために、ニューラルネットワークのモデル（これを要約モデルと呼ぶ）を使用している。 About summary generation technology
In this embodiment, the summary generator 130 uses a neural network model (called a summary model) to generate a summary from text.

テキストを入力して要約テキストを出力するモデルであればどのような要約モデルを使用してもよいが、本実施の形態では、一例として、非特許文献１に開示されたＢＡＲＴに基づくモデルを使用している。Any summarization model can be used as long as it inputs text and outputs summary text, but in this embodiment, as an example, we use a model based on BART disclosed in non-patent document 1.

ＢＡＲＴは、エンコーダとデコーダからなるモデルである。学習済みのモデルを使用することで、エンコーダへテキストを入力すると、デコーダから要約テキストが出力される。 BART is a model consisting of an encoder and a decoder. By using a trained model, when text is input to the encoder, a summary text is output from the decoder.

＜課題について＞
従来から、テキストを入力して要約を出力する技術は存在したが、マルチモーダルの入力データから要約を出力する技術は見られない。すなわち、従来技術においては、プレゼンテーション動画等の、音声と画像（スライド画像等）を含む動画から適切に要約テキストを生成する技術は存在しなかった。 <About the assignment>
Although there have been conventional technologies that input text and output a summary, there has been no technology that outputs a summary from multimodal input data. In other words, there has been no conventional technology that appropriately generates a summary text from a video that includes audio and images (slide images, etc.), such as a presentation video.

上記の課題を、実施形態の観点からより具体的な課題に分けるとすると、下記のような課題１～３に分けることができる。If we were to divide the above issues into more specific issues from the perspective of the implementation form, they could be divided into issues 1 to 3 as follows.

課題１：動画に対する要約を生成するための要約モデルを学習する際に使用する、正解の要約テキストを含む学習データを作成する作成コストが高い。 Challenge 1: The cost of creating training data containing correct summary text to be used when training a summarization model to generate summaries for videos is high.

課題２：動画から音声及び画像を抽出して、これらを入力として要約テキストを出力する要約モデルを用いた要約生成技術は存在しない。 Challenge 2: There is no summary generation technology that uses a summary model that extracts audio and images from video and uses these as input to output summary text.

課題３：動画に対する要約を生成するための要約モデルを学習する際に使用する、正解の要約テキストを外部サーバ等から収集できたとしても、その量が少ないため、学習データの量が少なくなり、精度の良い要約モデルを生成できない。 Issue 3: Even if it were possible to collect correct summary text from an external server, etc. to use when training a summary model for generating summaries for videos, the amount of data would be small, resulting in a small amount of training data and making it impossible to generate an accurate summary model.

以下、プレゼンテーション動画から要約を生成する要約生成装置１００、及び、要約生成装置１００において使用される要約モデルを生成（学習）するための要約モデル学習装置２００のそれぞれについて、その構成と動作を説明する。以下で説明する技術により、上記の課題１～３が解決される。Below, we will explain the configuration and operation of a summary generation device 100 that generates a summary from a presentation video, and a summary model learning device 200 that generates (learns) a summary model used in the summary generation device 100. The technology described below solves problems 1 to 3 above.

（要約生成装置１００の構成と動作）
図２に、本実施の形態における要約生成装置１００の構成図を示す。図２に示すように、要約生成装置１００は、画像処理部１１０、音声処理部１２０、要約生成部１３０、要約モデルＤＢ（データベース）１４０を有する。要約モデルＤＢ１４０には、学習済みの要約モデルが格納されている。なお、本明細書におけるＤＢを記憶部あるいは格納部と呼んでもよい。 (Configuration and Operation of Summary Generation Device 100)
Fig. 2 shows a configuration diagram of a summary generation device 100 according to this embodiment. As shown in Fig. 2, the summary generation device 100 has an image processing unit 110, an audio processing unit 120, a summary generation unit 130, and a summary model DB (database) 140. A trained summary model is stored in the summary model DB 140. Note that the DB in this specification may also be called a memory unit or a storage unit.

図３のフローチャートを参照して、図２に示す要約生成装置１００の動作の流れを説明する。 The operation flow of the summary generation device 100 shown in Figure 2 will be explained with reference to the flowchart of Figure 3.

要約を作成する対象の動画から音声情報と画像情報を抽出しておき、Ｓ１０１において、画像情報を画像処理部１１０に入力し、音声情報を音声処理部１２０に入力する。なお、図２の例では、動画から音声情報と画像情報（特に画像情報）を抽出する機能部については、要約生成装置１００の外部にあることを想定するが、要約生成装置１００の内部に当該機能部を備えてもよい。Audio information and image information are extracted from the video for which a summary is to be created, and in S101, the image information is input to the image processing unit 110, and the audio information is input to the audio processing unit 120. Note that in the example of Figure 2, it is assumed that the functional unit for extracting audio information and image information (particularly image information) from the video is external to the summary generation device 100, but the functional unit may be provided inside the summary generation device 100.

Ｓ１０２において、画像処理部１１０が、画像認識技術を用いて、画像からテキストを抽出する。画像処理部１１０は、テキストに加えて、付随する補助情報（スライド中の文字の色など）を抽出してもよい。In S102, the image processing unit 110 uses image recognition technology to extract text from the image. In addition to the text, the image processing unit 110 may also extract associated auxiliary information (such as the color of the text in the slide).

Ｓ１０３において、音声処理部１２０は、音声認識技術を用いて、音声からテキストを抽出する。なお、Ｓ１０２とＳ１０３の処理の順番は逆であってもよいし、Ｓ１０２とＳ１０３を同時に実行してもよい。In S103, the voice processing unit 120 uses voice recognition technology to extract text from the voice. Note that the order of the processes in S102 and S103 may be reversed, or S102 and S103 may be executed simultaneously.

Ｓ１０２で抽出されたテキスト、及び、Ｓ１０２で抽出されたテキストは、要約生成部１３０に入力される。Ｓ１０４において、要約生成部１３０は、要約モデルＤＢ１４０から読み出した要約モデルを用いて、Ｓ１０２で抽出されたテキスト、及び、Ｓ１０３で抽出されたテキストから要約を生成する。要約モデルの学習のところでも説明するとおり、要約モデルへの入力として、テキストに加えて、文字の配置特徴量、画像特徴量、音声特徴量のうちのいずれか１つ、いずれか複数、又は全部を追加した情報を使用してもよい。なお、「要約モデル」の実態は、ニューラルネットワークを構成する関数及び重みパラメータ等からなるデータである。Ｓ１０４において、要約生成部１３０は、生成した要約を出力する。The text extracted in S102 and the text extracted in S103 are input to the summary generation unit 130. In S104, the summary generation unit 130 generates a summary from the text extracted in S102 and the text extracted in S103 using the summary model read from the summary model DB 140. As explained in the section on learning the summary model, in addition to the text, information to which one, more than one, or all of the following features of character arrangement, image, and audio features are added may be used as input to the summary model. The actual substance of the "summary model" is data consisting of functions and weight parameters that constitute a neural network. In S104, the summary generation unit 130 outputs the generated summary.

上記のように、動画から得られる音声情報と画像情報の双方を用いることで、高品質な要約を生成することができる。As described above, by using both audio and image information obtained from a video, a high-quality summary can be generated.

動画から音声情報と画像情報を抽出する機能部、画像処理部１１０、及び、音声処理部１２０における処理についてはそれぞれ、後述する要約モデル学習装置２００の学習データ入力部２２０、画像処理部２３０、及び、音声処理部２４０における処理と同じであるため、これらの詳細処理については、要約モデル学習装置２２０の説明のところで説明する。The processing in the functional units that extract audio and image information from video, the image processing unit 110, and the audio processing unit 120, is the same as the processing in the learning data input unit 220, the image processing unit 230, and the audio processing unit 240 of the summary model learning device 200 described below, so detailed processing of these will be explained in the explanation of the summary model learning device 220.

本実施の形態の要約生成装置１００により、前述した課題２が解決され、動画から音声及び画像を抽出して、これらを入力として要約テキストを出力する要約モデルを用いた要約生成技術を実現できる。なお、要約モデルについては、以下で説明する要約モデル学習装置２００により学習が行われる。The summary generation device 100 of this embodiment solves the above-mentioned problem 2, and realizes a summary generation technology using a summary model that extracts audio and images from video and outputs summary text using these as input. The summary model is trained by the summary model training device 200 described below.

（要約モデル学習装置の構成と動作）
図４に、本実施の形態における要約モデル学習装置２００の構成例を示す。図４に示すように、要約モデル学習装置２００は、データ取得部２１０、学習データ入力部２２０、画像処理部２３０、音声処理部２４０、要約モデル学習部２５０、データ拡張部４００、モデル設定部２７０、事前学習済みの要約モデルを格納する要約モデルＤＢ２８０、学習中の要約モデルを格納する要約モデルＤＢ２９０を有する。 (Configuration and operation of summary model learning device)
Fig. 4 shows an example of the configuration of a summary model learning device 200 in this embodiment. As shown in Fig. 4, the summary model learning device 200 has a data acquisition unit 210, a learning data input unit 220, an image processing unit 230, an audio processing unit 240, a summary model learning unit 250, a data extension unit 400, a model setting unit 270, a summary model DB 280 that stores previously trained summary models, and a summary model DB 290 that stores summary models currently being trained.

本実施の形態では、要約モデルの学習時において、プレゼンテーションと内容的には類似性が高いと考えられる論文の要約を事前に大量に学習した要約モデルを作成し、その要約モデルに対して少量のプレゼンテーションの要約データを用いてファインチューンする。これにより、プレゼンテーション動画に対する正解の要約データが少量でも高い精度を達成することを可能としている。In this embodiment, when training the summary model, a summary model is created that has been trained in advance on a large number of summaries of papers that are considered to have a high similarity in content to the presentation, and the summary model is fine-tuned using a small amount of presentation summary data. This makes it possible to achieve high accuracy even with a small amount of correct summary data for the presentation video.

なお、上記のように事前学習を行うことは、課題３の解決方法の１つである。事前学習を行わずに、後述するデータ拡張部４００により生成された更なる学習データを使用することでも、課題３を解決することができる。事前学習を行うことと、後述するデータ拡張部４００により生成された更なる学習データを使用することとを組み合わせてもよい。 Note that performing pre-learning as described above is one method of solving problem 3. Problem 3 can also be solved by using further learning data generated by the data extension unit 400 described later without performing pre-learning. Performing pre-learning and using further learning data generated by the data extension unit 400 described later may be combined.

図４に示す構成は、上記の事前学習を行う場合の構成を示しているが、事前学習を行わずにデータ拡張部４００で生成された学習データによる学習を行ってもよい。また、事前学習を行った要約モデルに対して、データ拡張部４００で生成された学習データによる学習を行ってもよい。 The configuration shown in FIG. 4 shows a configuration in the case where the above-mentioned pre-learning is performed, but learning may be performed using the learning data generated by the data extension unit 400 without performing pre-learning. In addition, learning may be performed using the learning data generated by the data extension unit 400 for a summary model that has been pre-learned.

事前学習のための構成を図５に示す。図５に示すように、事前学習のための構成として、要約モデル事前学習部３１０と、事前学習中の要約モデルを格納する要約モデルＤＢ３２０を有する。The configuration for pre-learning is shown in Figure 5. As shown in Figure 5, the configuration for pre-learning includes a summary model pre-learning unit 310 and a summary model DB 320 that stores summary models during pre-learning.

要約モデル事前学習部３１０と要約モデルＤＢ３２０とを有する要約モデル事前学習装置（要約モデル学習装置２００とは別の装置）を構成してもよいし、要約モデル事前学習部３１０と要約モデルＤＢ３２０が要約モデル学習装置２００内に含まれていてもよい。A summary model pre-training device (a device separate from the summary model training device 200) may be configured having the summary model pre-training unit 310 and the summary model DB 320, or the summary model pre-training unit 310 and the summary model DB 320 may be included within the summary model training device 200.

図６のフローチャートを参照して、要約モデル学習装置２００及び要約モデル事前学習部３１０の動作の流れを説明する。詳細処理については後述する。The flow of operations of the summary model learning device 200 and the summary model pre-learning unit 310 will be described with reference to the flowchart in Figure 6. Detailed processing will be described later.

Ｓ２０１、Ｓ２０２は、図５に示した事前学習のための構成における処理である。Ｓ２０１において、要約モデル事前学習部３１０に事前学習用データを入力する。事前学習用データは、例えば、プレゼンテーションに関連する論文のテキストと、その論文の要約（正解データ）である。 S201 and S202 are processes in the configuration for pre-learning shown in Figure 5. In S201, pre-learning data is input to the summary model pre-learning unit 310. The pre-learning data is, for example, the text of a paper related to the presentation and a summary of that paper (correct answer data).

Ｓ２０２において、要約モデル事前学習部３１０は、入力データを用いて、要約モデルを学習（事前学習）する。事前学習済みの要約モデルは、要約モデル学習装置２００における要約モデルＤＢ２８０に格納される。In S202, the summary model pre-training unit 310 uses the input data to train (pre-train) a summary model. The pre-trained summary model is stored in the summary model DB 280 in the summary model training device 200.

Ｓ２０３～Ｓ２０７は、図４に示した要約モデル学習装置２００における処理である。Ｓ２０３の入力処理において、データ取得部２１０にアクセス情報（例：論文及びプレゼンテーション動画が公開されているＵＲＬ）を入力する。データ取得部２１０は、アクセス情報を用いて、例えばネットワーク上のサーバから、学習データを取得して、学習データ入力部２２０に入力する。学習データは、例えば、論文に関するプレゼンテーション動画と、当該動画に対応する正解の要約テキストである。Ｓ２０３では更に、学習データ入力部２２０が、プレゼンテーション動画を画像情報と音声情報に分ける処理を行い、画像情報を画像処理部２３０に入力し、音声情報を音声処理部２４０に入力し、正解の要約を要約モデル学習部２５０に入力する。 S203 to S207 are processes in the summary model learning device 200 shown in Figure 4. In the input process of S203, access information (e.g., the URL where the paper and presentation video are published) is input to the data acquisition unit 210. The data acquisition unit 210 uses the access information to acquire learning data, for example from a server on the network, and inputs it to the learning data input unit 220. The learning data is, for example, a presentation video related to the paper and a correct summary text corresponding to the video. In S203, the learning data input unit 220 further performs a process of separating the presentation video into image information and audio information, inputs the image information to the image processing unit 230, inputs the audio information to the audio processing unit 240, and inputs the correct summary to the summary model learning unit 250.

なお、学習データ入力部２２０が画像処理部２３０へ入力する画像情報は、プレゼンテーション動画とは別ファイルになっているスライド画像等であってもよいし、プレゼンテーション動画から抽出したスライド画像等であってもよい。いずれの場合でも当該画像を「動画における画像」又は「動画に関する画像」と表現してもよい。いずれの場合でも、「動画における画像」又は「動画に関する画像」から画像認識処理により、テキストを抽出できる。 The image information input by the learning data input unit 220 to the image processing unit 230 may be a slide image or the like that is a separate file from the presentation video, or a slide image or the like extracted from the presentation video. In either case, the image may be expressed as an "image in the video" or an "image related to the video". In either case, text can be extracted from the "image in the video" or the "image related to the video" by image recognition processing.

なお、以降の説明では、画像処理部２３０へ入力する画像情報は、プレゼンテーション動画から抽出したスライド画像等であることを想定している。In the following explanation, it is assumed that the image information input to the image processing unit 230 is a slide image, etc. extracted from a presentation video.

Ｓ２０４において、画像処理部２３０が、画像認識技術を用いて、画像からテキストを抽出する。画像処理部２３０は、テキストに加えて、付随する補助情報（スライド中の文字の色など）、文字の配置特徴量、画像特徴量などを抽出してもよい。In S204, the image processing unit 230 uses image recognition technology to extract text from the image. In addition to the text, the image processing unit 230 may extract accompanying auxiliary information (such as the color of the text in the slide), character arrangement features, image features, and the like.

Ｓ２０５において、音声処理部１２０は、音声認識技術を用いて、音声からテキストを抽出する。音声処理部１２０は、テキストに加えて、音声特徴量などを抽出してもよい。なお、Ｓ２０４とＳ２０５の処理の順番は逆であってもよいし、Ｓ２０４とＳ２０５を同時に実行してもよい。In S205, the voice processing unit 120 uses voice recognition technology to extract text from the voice. The voice processing unit 120 may extract voice features in addition to the text. Note that the order of the processes of S204 and S205 may be reversed, or S204 and S205 may be executed simultaneously.

Ｓ２０４で抽出されたテキスト、及び、Ｓ２０５で抽出されたテキストは、要約モデル学習部２５０に入力される。また、正解の要約も要約モデル学習部２５０に入力される。The text extracted in S204 and the text extracted in S205 are input to the summary model learning unit 250. The correct summary is also input to the summary model learning unit 250.

ここで、モデル設定部２７０により、要約モデルＤＢ２８０から事前学習済みの要約モデルが読み出され、要約モデルＤＢ２９０に、当該事前学習済みの要約モデルが格納されている。この事前学習済みの要約モデルにおけるパラメータを初期値として、以下の学習（ファインチューニング）が行われる。Here, the model setting unit 270 reads out a pre-trained summary model from the summary model DB 280, and the pre-trained summary model is stored in the summary model DB 290. The following learning (fine tuning) is performed using the parameters in this pre-trained summary model as initial values.

Ｓ２０６において、要約モデル学習部２５０は、要約モデルＤＢ２９０から読み出した要約モデルを用いて、Ｓ２０４で抽出されたテキスト、及び、Ｓ２０５で抽出されたテキストから要約を生成するとともに、生成した要約と正解の要約との間の誤差が最小になるように要約モデルの学習（パラメータの更新）を行う。In S206, the summary model learning unit 250 uses the summary model read from the summary model DB 290 to generate a summary from the text extracted in S204 and the text extracted in S205, and learns the summary model (updates parameters) so as to minimize the error between the generated summary and the correct summary.

学習が終了すると、要約モデル学習部２５０は、学習済みの要約モデルを要約生成装置１００の要約モデルＤＢ１４０に格納する。 Once the learning is completed, the summary model learning unit 250 stores the learned summary model in the summary model DB 140 of the summary generation device 100.

なお、上記の例では、事前学習を行って、事前学習済みの学習モデルをファインチューンする例を示しているが、前述したように、事前学習は必須ではない。事前学習を実施せずに、図６のＳ２０３から処理を開始することとしてもよい。事前学習を実施しない場合における要約モデルのパラメータの初期値はランダムな値であってもよいし、ランダムな値以外の値であってもよい。 Note that in the above example, pre-learning is performed to fine-tune the pre-trained learning model, but as mentioned above, pre-learning is not required. Processing may be started from S203 in FIG. 6 without performing pre-learning. When pre-learning is not performed, the initial values of the parameters of the summary model may be random values or values other than random values.

以下では、Ｓ２０１～Ｓ２０７における各ステップの処理内容をより詳細に説明する。 Below, the processing contents of each step from S201 to S207 are explained in more detail.

（Ｓ２０１、Ｓ２０２：事前学習）
図５に示した要約モデル事前学習部３１０が実行する事前学習の詳細例を説明する。事前学習においては、要約の対象とするプレゼンテーション動画の分野に関連する分野のテキスト（関連分野テキストと呼ぶ）と、その正解の要約を用いて要約モデルの学習を行う。関連分野テキストは、例えば、論文テキスト（論文の本文のテキスト）、スライドのテキスト等である。 (S201, S202: Pre-learning)
A detailed example of pre-learning performed by the summary model pre-learning unit 310 shown in Fig. 5 will be described. In the pre-learning, a summary model is trained using text from a field related to the field of the presentation video to be summarized (called related field text) and its correct summary. The related field text is, for example, a paper text (text of the body of the paper), text of slides, etc.

関連分野テキストとして、論文テキストを使用する場合における、要約モデルへの入力、及び、要約モデルからの出力の例を図７に示す。前述したとおり、本実施の形態に係る要約モデルは、エンコーダとデコーダからなるモデルである。 Figure 7 shows an example of input to and output from the summary model when a research paper text is used as the related field text. As mentioned above, the summary model in this embodiment is a model consisting of an encoder and a decoder.

図７に示すとおり、エンコーダに論文の本文テキストが入力され、デコーダから要約テキストが出力される。出力される要約テキストと正解の要約テキストとの間の誤差が最小になるように要約モデルの学習がなされる。入力としてスライドテキストを使用する場合でも処理内容は論文テキストを用いる場合と同じである。As shown in Figure 7, the main text of a paper is input to the encoder, and summary text is output from the decoder. A summary model is trained to minimize the error between the output summary text and the correct summary text. When slide text is used as input, the processing is the same as when paper text is used.

なお、テキストのエンコーダへの入力の際には、テキストのトークン列がまずｄ次元の固定次元ベクトルに変換され、その後、エンコーダ‐デコーダを通して要約テキストに変換される。When text is input to the encoder, the token sequence of the text is first converted into a fixed dimension vector of d dimensions, and then converted into summary text through the encoder-decoder.

入力となる論文テキストの例を以下に示す。 An example of input paper text is shown below.

「We assume familiarity with basic notions of graph theory (see, for instance, 1]) and with elementary notions of polyhedral combinatorics (see, for instance, 6]).", "Our graphs will be undirected and simple (no loops and no multiple edges).", "As usual, K n denotes the complete graph with n vertices; K n;m denotes the complete bipartite graph with n + m vertices and n m edges.", "Let G be a graph; G is connected if for every pair of distinct vertices there exists a path in G joining them; G is twoconnected if for every vertex v of G, the graph G ?", "v is connected; G is planar if it can be embedded in the plane.", "A subgraph H of a G is spanning if the vertex sets of H and G are the same.", "Subdivision of an edge uv of G consists of removing edge uv, and adding a new vertex w and the two edges uw and vw; w is called subdivision vertex.", "If G and H are two graphs, we say that G contains a subdivision of H, if H arises by subdivision of the edges of some subgraph of G. As usual, (u) denotes the set of all edges that are incident in the vertex u.", "In automatic graph drawing the following problem arises: nd in a complete graph with weights on its edges a two-connected planar spanning subgraph with weight as Partially supported by DFG-Grant JU204/7-1 Forschungsschwerpunkt ＼" E ziente Algorithmen f ur diskrete Probleme und ihre Anw…」
上記入力に対する出力（あるいは正解データである要約テキスト）の例を以下に示す。 "We assume familiarity with basic notions of graph theory (see, for instance, 1]) and with elementary notions of polyhedral combinatorics (see, for instance, 6]).", "Our graphs will be undirected and simple (no loops and no multiple edges).", "As usual, K n denotes the complete graph with n vertices; K n;m denotes the complete bipartite graph with n + m vertices and nm edges.", "Let G be a graph; G is connected if for every pair of distinct vertices there exists a path in G joining them; G is twoconnected if for every vertex v of G, the graph G ?", "v is connected; G is planar if it can be embedded in the plane.", "A subgraph H of a G is spanning if the vertex sets of H and G are the same.", "Subdivision of an edge uv of G consists of edge uv, and adding a new vertex w and the two edges uw and vw; w is called subdivision vertex.", "In automatic graph drawing the following problem arises: nd in a complete graph with weights on its edges a two-connected planar spanning subgraph with weight as Partially supported by DFG-Grant JU204/7-1 Forschungsschwerpunkt \" E ziente Algorithmen f ur diskrete Probleme und ihre Anw…"
An example of the output (or summary text, which is the correct answer data) for the above input is shown below.

「The problem of finding a two-connected planar spanning subgraph of maximum weight in a complete edge-weighted graph is important in automatic graph drawing.", "We investigate the problem from a polyhedral point of view."」
プレゼンテーション動画のサイト等において、スライドのファイルを動画とは別ファイルとして取得できる場合がある。また、スライドのファイルには、スライドそのもののデータ（スライドテキスト）と、スライドの概要（要約テキスト）が含まれる場合も多い。このような場合、スライドテキストをエンコーダ‐デコーダの入力として、上記要約テキストを正解として使用することで要約モデルの事前学習を行うことができる。 "The problem of finding a two-connected planar spanning subgraph of maximum weight in a complete edge-weighted graph is important in automatic graph drawing.", "We investigate the problem from a polyhedral point of view."
On presentation video sites, slide files may be available as separate files from the video. Slide files often contain both the slide data (slide text) and a summary of the slide (summary text). In such cases, a summary model can be pre-trained by using the slide text as input to the encoder-decoder and the summary text as the correct answer.

入力となるスライドテキストの例を以下に示す。 An example of input slide text is shown below.

「[["ssn"], ["MASTERS", "IN", "AUTOMOTIVE"], ["ENGINEERING"], ["Karthiek", "Nagaraj"], ["PRESENTED", "AT", "IRIS", ",", "DEPARTMENT", "OF", "MECHANICAL", "ENGINEERING"], ["SSN"], ["WHY", "AUTOMOBILE", "ENGINEERING", "?"], ["Its", "scope", "is", "irrefutable", "and", "job", "prospects", "are", "very", "strong", "in", "any", "part", "of", "the", "world", ".", "Also", "the", "prospect", "of", "returning", "to", "India", "to", "work", "is", "bright", "as", "the", "indian", "automotive", "industry", "is", "making", "tremendous", "progress", "."], [">", "It", "is", "a", "stream", "which", "blends", "passion", "for", "vehicles", "and", "technical", "knowledge", ",", "thus", "making", "it", "all", "the", "more", "interesting", "."], ["It", "is", "an", "interdisciplinary", "field", "which", "encompasses", "mechanical", "engineering", ",", "electrical", "and", "electronics", "engineering", "and", "software", "engineering", ".", "This", "again", "adds", "to", "the", "interest", "factor", "."], ["A", "multitude", "of", "research", "options", "are", "on", "offer", ",", "especially", "in", "hybrid", "powertrains", "and", "fuel", "cells", "."], ["PRESENTED", "AT", "IRIS", ",", "DEPARTMENT", "OF", "MECHANICAL", "ENGINEERING"], ["2"], ["SSN"], ["KEY", "AREAS", "OF", "AUTOMOTIVE", "ENGINEERING"], ["Vehicle", "Propulsion", "~", "Internal", "combustion", "engines"], ["Powertrain", "dynamics", "and", "control"], ["Vehicle", "dynamics", "~", "Handling", "response"], ["~", "Advanced", "transmission"], ["systems"], ["~", "Hybrid", "propulsion", "systems"], ["~", "Terrain", "modelling"], ["~", "Fuel", "cells"], ["~", "Drivetrain", "control", "systems"], ["~", "NVH", "modelling"], ["Automotive", "body", "structures", "~", "Material", "selection"], ["Automotive", "safety", "~", "Active", "and", "passive", "safety"], ["systems"], ["~", "Crash", "worthiness"], ["~", "Human", "factor", "engineering"], ["and",」
上記入力に対する出力（あるいは正解データであるスライド概要）の例を以下に示す。 "[["ssn"], ["MASTERS", "IN", "AUTOMOTIVE"], ["ENGINEERING"], ["Karthiek", "Nagaraj"], ["PRESENTED", "AT", "IRIS", ",", "DEPARTMENT", "OF", "MECHANICAL", "ENGINEERING"], ["SSN"], ["WHY", "AUTOMOBILE", "ENGINEERING", "?"], ["Its", "scope", "is", "irrefutable", "and", "job", "prospects", "are", "very", "strong", "in", "any", "part", "of", "the", "world", ".", "Also", "the", "prospect", "of", "returning", "to", "India", "to", "work", "is", "bright", "as", "the", "indian", "automotive", "industry", "is", "making", "tremendous", "progress", "."], [">", "It", "is", "a", "stream", "which", "blends", "passion", "for", "vehicles", "and", "technical", "knowledge", ",", "thus", "making", "it", "all", "the", "more", "interesting", "."], ["It", "is", "an", "interdisciplinary", "field", "which", "encompasses", "mechanical", "engineering", ",", "electrical", "and", "electronics", "engineering", "and", "software", "engineering", ".", "This", "again", "adds", "to", "the", "interest", "factor", "."], ["A", "multitude", "of", "research", "options", "are", "on", "offer", ",", "especially", "in", "hybrid", "powertrains", "and", "fuel", "cells", "."], ["PRESENTED", "AT", "IRIS", ",", "DEPARTMENT", "OF", "MECHANICAL", "ENGINEERING"], ["2"], ["SSN"], ["KEY", "AREAS", "OF", "AUTOMOTIVE", "ENGINEERING"], ["Vehicle", "Propulsion", "~", "Internal", "combustion", "engines"], ["Powertrain", "dynamics", "and", "control"], ["Vehicle", "dynamics", "~", "Handling", "response"], ["~", "Advanced", "transmission"], ["systems"], ["~", "Hybrid", "propulsion", "systems"], ["~", "Terrain", "modelling"], ["~", "Fuel", "cells"], ["~", "Drivetrain", "control", "systems"], ["~", "NVH", "modelling"], ["Automotive", "body", "structures", "~", "Material", "selection"], ["Automotive", "safety", "~", "Active", "and", "passive", "safety"], ["systems"], ["~", "Crash", "worthiness"], ["~", "Human", "factor", "engineering"], ["and","
An example of the output (or a slide summary, which is the correct answer data) for the above input is shown below.

「A guide to Masters in Automotive Engineering at International Destinations」
（Ｓ２０３：要約モデル学習装置２００の入力処理）
次に、図４に示した要約モデル学習装置２００における、データ取得部２１０による処理、及び、学習データ入力部２２０による処理の詳細例を説明する。 "A guide to Masters in Automotive Engineering at International Destinations"
(S203: Input process of summary model learning device 200)
Next, a detailed example of the processing by the data acquisition unit 210 and the processing by the learning data input unit 220 in the summary model learning device 200 shown in FIG. 4 will be described.

データ取得部２１０は、例えばインターネット上にあるプレゼンテーション動画のサイトにアクセスし、そのサイトからプレゼンテーション動画と、動画に対応する正解の要約を取得する。このような動画と要約を取得できるサイトの例として例えば、「https://aclanthology.org/」（２０２２年２月２７日検索）がある。The data acquisition unit 210 accesses, for example, a presentation video site on the Internet and acquires the presentation video and a correct answer summary corresponding to the video from the site. An example of a site from which such videos and summaries can be acquired is "https://aclanthology.org/" (searched on February 27, 2022).

上記のように、ネットワーク上のサーバからプレゼンテーション動画とその要約を取得することで、人手で要約を作成することなく、学習データを作成することができ、前述した課題１が解決される。As described above, by obtaining the presentation video and its summary from a server on the network, learning data can be created without having to manually create a summary, thereby solving the aforementioned problem 1.

学習データ入力部２２０は、データ取得部２１０により取得したプレゼンテーション動画を画像情報と音声情報に分ける処理を行い、画像情報を画像処理部２３０に入力し、音声情報を音声処理部２４０に入力する。The learning data input unit 220 processes the presentation video acquired by the data acquisition unit 210 to separate it into image information and audio information, and inputs the image information to the image processing unit 230 and the audio information to the audio processing unit 240.

画像情報は特定の画像に限定されないが、ここでは、画像情報が、プレゼンテーション動画におけるスライド画像であることを想定している。 The image information is not limited to a specific image, but here it is assumed that the image information is a slide image in a presentation video.

図８を参照して、学習データ入力部２２０による、プレゼンテーション動画から画像を切り出す処理例を説明する。 Referring to Figure 8, an example of the process of extracting an image from a presentation video by the learning data input unit 220 is described.

Ｓ２０３（１－１）：
学習データ入力部２２０は、プレゼンテーション動画からｋ秒単位で画像を切り出す。ｋは、０より大きな実数であり、予め定めておく数である。図８の上段には、ｋ秒毎に切り出された６つの画像が示されている。 S203(1-1):
The learning data input unit 220 extracts images from the presentation video every k seconds, where k is a predetermined real number greater than 0. The upper part of Fig. 8 shows six images extracted every k seconds.

Ｓ２０３（１－２）：
学習データ入力部２２０は、Ｓ２０３（１－１）で切り出した画像を時刻ごとに順番に比較し，ｔ番目の画像とｔ－１番目の画像の類似度が閾値以上であればこれらの画像を同じ画像と判定する。なお、画像間の類似度の判定方法としてはどのような判定方法を使用してもよい。図８には、６つの画像における各２画像間の類似度の例が示されている。 S203(1-2):
The learning data input unit 220 compares the images extracted in S203(1-1) in order for each time, and if the similarity between the t-th image and the t-1-th image is equal to or greater than a threshold, it determines that these images are the same image. Note that any method may be used to determine the similarity between images. FIG. 8 shows an example of the similarity between each pair of six images.

Ｓ２０３（１－３）：
学習データ入力部２２０は、Ｓ２０３（１－１）とＳ２０３（１－２）を繰り返し、異なり画像集合を抽出する。図８には、閾値が２５である場合の異なり画像集合として、画像１、画像４、画像６が示されている。得られた画像集合は画像処理部２３０に入力される。 S203(1-3):
The learning data input unit 220 repeats S203(1-1) and S203(1-2) to extract a set of different images. In Fig. 8, image 1, image 4, and image 6 are shown as a set of different images when the threshold value is 25. The obtained set of images is input to the image processing unit 230.

（Ｓ２０４：画像処理）
次に、画像処理部２３０が実行する画像処理の詳細例を説明する。画像処理部２３０は、学習データ入力部２２０から入力された異なり画像集合に対してＯＣＲ（Optical Character Recognition）処理を実施し、図９に示すように、当該異なり画像集合における各画像から、テキスト、文字の色、文字の大きさ、文字の位置情報等を取得する。なお、取得する情報はテキストのみでもよい。 (S204: Image processing)
Next, a detailed example of image processing executed by the image processing unit 230 will be described. The image processing unit 230 performs OCR (Optical Character Recognition) processing on the set of different images input from the learning data input unit 220, and acquires text, character color, character size, character position information, and the like from each image in the set of different images, as shown in Fig. 9. Note that the acquired information may be only text.

（Ｓ２０５：音声処理）
次に、音声処理部２４０が実行する音声処理の詳細例を説明する。図１０に示すように、音声処理部２４０は、学習データ入力部２２０から入力された音声に対して音声認識処理を実施し、音声認識結果のテキストを取得する。 (S205: Audio processing)
Next, a detailed example of the voice processing executed by the voice processing unit 240 will be described. As shown in Fig. 10, the voice processing unit 240 performs a voice recognition process on the voice input from the learning data input unit 220, and obtains text of the voice recognition result.

（Ｓ２０６：学習処理）
続いて、要約モデル学習部２５０が実行する学習処理の詳細例を説明する。要約モデル学習部２５０は、画像処理部２３０により得られたテキストと、音声処理部２４０により得られたテキストとを結合し、結合されたテキストを要約モデルに入力する。要約モデル学習部２５０は、要約モデルから出力された要約テキストと、正解の要約テキストとの誤差が最小になるように要約モデルを学習する。要約モデルへの入力については、結合テキストに対して、画像処理部２３０により得られた、文字の配置特徴量、画像特徴量、文字の大きさや色情報等を追加した情報を使用してもよい。また、結合テキストに対して、音声処理部２４０により得られた音声特徴量を追加した情報を使用してもよい。 (S206: Learning process)
Next, a detailed example of the learning process executed by the summary model learning unit 250 will be described. The summary model learning unit 250 combines the text obtained by the image processing unit 230 and the text obtained by the voice processing unit 240, and inputs the combined text to the summary model. The summary model learning unit 250 learns the summary model so that the error between the summary text output from the summary model and the correct summary text is minimized. For input to the summary model, information obtained by adding character arrangement features, image features, character size, color information, etc. obtained by the image processing unit 230 to the combined text may be used. Information obtained by adding voice features obtained by the voice processing unit 240 to the combined text may also be used.

なお、上記の要約モデルの初期状態は、Ｓ２０２で事前学習した要約モデルである。ただし、前述したとおり、事前学習を行わないこととしてもよいので、上記の要約モデルの初期状態は、Ｓ２０２で事前学習した要約モデルでなくてもよい。事前学習を行わない場合には、後述するデータ拡張部４００により生成された更なる学習データを用いて学習を行うこととしてもよい。The initial state of the above summary model is the summary model pre-trained in S202. However, as described above, it is possible not to perform pre-training, so the initial state of the above summary model does not have to be the summary model pre-trained in S202. If pre-training is not performed, learning may be performed using further training data generated by the data expansion unit 400 described later.

要約モデルへの入力、及び、要約モデルからの出力の例を図１１に示す。前述したとおり、本実施の形態に係る要約モデルは、エンコーダとデコーダからなるモデルである。An example of input to and output from the summary model is shown in Figure 11. As mentioned above, the summary model in this embodiment is a model consisting of an encoder and a decoder.

図１１に示すとおり、エンコーダに、[SEP]により結合されたテキストと、文字の大きさ、及び色情報が入力され、デコーダから要約テキストが出力される。出力される要約テキストと正解の要約テキストとの間の誤差が最小になるように要約モデルの学習がなされる。As shown in Figure 11, the encoder receives the text combined by [SEP], character size, and color information, and the decoder outputs summary text. The summary model is trained to minimize the error between the output summary text and the correct summary text.

テキストのエンコーダへの入力の際には、テキストのトークン列がまずｄ次元の固定次元ベクトルに変換され、その後、エンコーダ‐デコーダを通して要約テキストに変換される。また、入力において、文字の大きさ、及び色情報はなくてもよい。When inputting text to the encoder, the text token sequence is first converted into a fixed dimension vector of d dimensions, and then converted into summary text through the encoder-decoder. Also, character size and color information may not be required in the input.

なお、音声処理部２４０により得られるテキストをＡＳＲ（Automatic Speech Recognition）テキストと呼び、画像処理部２４０により得られるテキストをＯＣＲテキストと呼んでもよい。The text obtained by the voice processing unit 240 may be called ASR (Automatic Speech Recognition) text, and the text obtained by the image processing unit 240 may be called OCR text.

ＡＳＲテキストの例を以下に示す。 An example of ASR text is shown below.

「So to put in context to put my presentation in the context, I will, I would like to begin with the word decision support or decision-making. And first ask the question who, or what is making decisions and obviously we get two branches here. One is that we have a human decision maker who makes a decision and all of us are decision makers and then we are also talking about the decision systems. So computers robots.」
ＯＣＲテキストの例を以下に示す。下記の例は、「http://videolectures.net/site/normal_dl/tag=1005123/icml2015_schmidt_time_framework_01.pdf」（２０２２年２月２６日検索）において開示されているスライド画像から得られたテキストの例である。 "So to put in context to put my presentation in the context, I will, I would like to begin with the word decision support or decision-making. And first ask the question who, or what is making decisions and obviously we get two branches here. One is that we have a human decision maker who makes a decision and all of us are decision makers and then we are also talking about the decision systems. So robots."
An example of OCR text is shown below: The example below is an example of text obtained from a slide image disclosed in "http://videolectures.net/site/normal_dl/tag=1005123/icml2015_schmidt_time_framework_01.pdf" (retrieved February 26, 2022).

「Structured sparsity sparsity is widely used in signal processing, machine learning, and statistics (compressive sensing, sparse linear regression, etc.) Examples of sparsity….」
ＡＳＲテキストとＯＣＲテキストを結合して要約モデルに入力した際に出力される要約テキスト（あるいはその正解）の例を以下に示す。 “Structured sparsity sparsity is widely used in signal processing, machine learning, and statistics (compressive sensing, sparse linear regression, etc.) Examples of sparsity….”
Below is an example of the summary text (or its correct answer) that is output when the ASR text and the OCR text are combined and input to the summary model.

「Decision Support is a discipline concerned with human decision making: it aims to provide methods and tools that support, rather than replace, people in making difficult decisions. One of the widely used decision-support approaches relies on decision models, which are developed in the decision process and used to evaluate and analyse decision alternatives. In this lecture, we shall present the method DEX (Decision EXpert), which was heavily influenced by ideas from Artificial Intelligence. DEX is a hierarchical, qualitative, rule-based, multi-criteria modelling method, suitable particularly for solving classification decision problems. DEX combines traditional approaches with those from expert systems and machine learning. DEX is supported by the software called DEXi and has been used in hundreds of real-world decision-making studies. The presentation will be illustrated by recent applications in the areas of electric energy production, food safety and health care.」
（データ拡張部４００の構成と動作）
以下では、課題３を解決する技術の１つである、追加の学習データセットを自動的に生成する技術について説明する。 “Decision Support is a discipline concerned with human decision making: it aims to provide methods and tools that support, rather than replace, people in making difficult decisions. One of the widely used decision-support approaches rely on decision models, which are developed in the decision process and used to evaluate and analyze decision alternatives. In this lecture, we shall present the method DEX (Decision EXpert), which was heavily influenced by ideas from Artificial Intelligence. DEX is a hierarchical, qualitative, rule-based, multi-criteria modeling method, particularly suitable for solving classification decision problems. DEX combines traditional approaches with those from expert systems and machine learning. DEX is supported by the software called DEXi and has been used in hundreds of real-world decision-making studies. The presentation will be illustrated by recent applications in the areas of electric energy production, food safety and health care.”
(Configuration and Operation of Data Expansion Unit 400)
The following describes one technique for solving problem 3, which is a technique for automatically generating an additional training dataset.

図４に示した要約モデル学習装置２００におけるデータ拡張部４００の構成を図１２に示す。図１２に示すように、データ拡張部４００は、学習データ生成部４１０、重要文抽出部４２０、タスク情報付与部４３０を有する。なお、データ拡張部４００は要約モデル学習装置２００内の機能部であってもよいし、要約モデル学習装置２００の外部にある別装置であってもよい。データ拡張部４００が要約モデル学習装置２００内にある場合の要約モデル学習装置２００を学習データ生成装置４００と呼んでもよい。データ拡張部４００が要約モデル学習装置２００の外部にある別装置である場合の当該別装置を学習データ生成装置４００と呼んでもよい。 Figure 12 shows the configuration of the data expansion unit 400 in the summary model learning device 200 shown in Figure 4. As shown in Figure 12, the data expansion unit 400 has a learning data generation unit 410, an important sentence extraction unit 420, and a task information assignment unit 430. Note that the data expansion unit 400 may be a functional unit within the summary model learning device 200, or may be a separate device external to the summary model learning device 200. When the data expansion unit 400 is within the summary model learning device 200, the summary model learning device 200 may be called the learning data generation device 400. When the data expansion unit 400 is a separate device external to the summary model learning device 200, the separate device may be called the learning data generation device 400.

図１３のフローチャートを参照して、図１２に示すデータ拡張部４００（学習データ生成装置４００）の動作の流れを説明する。Ｓ３０１において、音声処理により得られたＡＳＲテキスト、画像処理により得られたＯＣＲテキスト、及び、これらに対応する正解の要約テキストを学習データ生成部４１０に入力する。 The flow of operation of the data expansion unit 400 (learning data generation device 400) shown in Fig. 12 will be described with reference to the flowchart in Fig. 13. In S301, the ASR text obtained by speech processing, the OCR text obtained by image processing, and the corresponding correct summary text are input to the learning data generation unit 410.

Ｓ３０２において、データ分割部４１０は、入力されたデータに対して学習データ生成処理（データ分割処理と呼んでもよい）行う。Ｓ３０２においては、重要文抽出部４２０による重要文抽出処理も行われる。なお、重要文抽出部４２０が学習データ生成部４１０内に含まれていてもよい。In S302, the data division unit 410 performs a learning data generation process (which may also be called a data division process) on the input data. In S302, an important sentence extraction process is also performed by the important sentence extraction unit 420. Note that the important sentence extraction unit 420 may be included in the learning data generation unit 410.

タスク情報付与部４３０は、Ｓ３０３において、生成された学習データセットにタスク情報を付与し、Ｓ３０４において、タスク情報を付与した学習データセットを出力する。出力されたデータは要約モデル学習部２５０に入力され、要約モデルの学習に利用される。以下、上記の各ステップの処理をより詳細に説明する。In S303, the task information assignment unit 430 assigns task information to the generated learning data set, and in S304, outputs the learning data set to which the task information has been assigned. The output data is input to the summary model learning unit 250 and used to learn the summary model. The processing of each of the above steps is explained in more detail below.

（Ｓ３０１：入力、Ｓ３０２：データ分割）
学習データ生成部４１０へは、１つのプレゼンテーション動画に対して「ＯＣＲテキスト、ＡＳＲテキスト、正解の要約テキスト」を１セットとしてデータを入力する。学習を行うためのデータセットを学習データセットと呼ぶ。 (S301: Input, S302: Data division)
Data consisting of "OCR text, ASR text, and correct summary text" for one presentation video is input to the learning data generation unit 410. A data set for learning is called a learning data set.

学習データ生成部４１０は、上記の入力データに基づいて、図１４に示すように下記の５つの学習データセットを生成する。なお、（１）は、元の学習データセットである。各学習データセットは、タスクを表すので、学習データセットをタスクと呼んでもよい。なお、下記の５つは例であり、元の学習データセットに加えて、更なる学習データセットが少なくとも１つ生成されればよい。下記に加えて、（６）ＯＣＲテキスト、ＯＣＲ重要文、（７）ＡＳＲテキスト、ＡＳＲ重要文が生成されてもよい。Based on the above input data, the training data generation unit 410 generates the following five training data sets as shown in FIG. 14. Note that (1) is the original training data set. Since each training data set represents a task, the training data set may be called a task. Note that the following five are examples, and in addition to the original training data set, at least one further training data set may be generated. In addition to the following, (6) OCR text, OCR key sentences, and (7) ASR text, ASR key sentences may be generated.

（１）ＯＣＲテキスト、ＡＳＲテキスト、正解の要約テキスト
（２）ＯＣＲテキスト、正解の要約テキスト
（３）ＡＳＲテキスト、正解の要約テキスト
（４）ＯＣＲテキスト、ＡＳＲ重要文
（５）ＡＳＲテキスト、ＯＣＲ重要文
ＡＳＲ重要文、ＯＣＲ重要文はいずれも、疑似正解情報の例である。ＡＳＲ重要文とＯＣＲ重要文は、いずれも重要文抽出部４２０が作成する。これら重要文の作成方法の例を以下に説明する。 (1) OCR text, ASR text, correct summary text (2) OCR text, correct summary text (3) ASR text, correct summary text (4) OCR text, ASR key sentence (5) ASR text, OCR key sentence Both the ASR key sentence and the OCR key sentence are examples of pseudo-correct information. Both the ASR key sentence and the OCR key sentence are created by the key sentence extraction unit 420. An example of a method for creating these key sentences is described below.

ＡＳＲ重要文に関して、重要文抽出部４２０は、要約テキストとＡＳＲテキストとのマッチングをとることでＡＳＲ重要文を抽出する。例えば、重要文抽出部４２０は、ＡＳＲテキストのうち、要約テキストと類似性の高い部分をＡＳＲ重要文として抽出する。Regarding the ASR key sentences, the key sentence extraction unit 420 extracts the ASR key sentences by matching the summary text with the ASR text. For example, the key sentence extraction unit 420 extracts, from the ASR text, a portion that is highly similar to the summary text as the ASR key sentence.

ＯＣＲ重要文に関して、重要文抽出部４２０は、要約テキストとＯＣＲテキストとのマッチングをとることでＯＣＲ重要文を抽出する。例えば、重要文抽出部４２０は、ＯＣＲテキストのうち、要約テキストと類似性の高い部分をＯＣＲ重要文として抽出する。Regarding OCR important sentences, the important sentence extraction unit 420 extracts OCR important sentences by matching the summary text with the OCR text. For example, the important sentence extraction unit 420 extracts, from the OCR text, a portion that is highly similar to the summary text as an OCR important sentence.

ＡＳＲ／ＯＣＲ重要文の抽出のためのマッチングの取り方としては任意の手法を適用できるが、抽出要約のデータ作成で用いられる、例えばFine-tune BERT for Extractive Summarization（https://arxiv.org/pdf/1903.10318v2.pdf、２０２２年２月２７日検索）で記載されている方法を用いてもよい。Any method can be applied as a matching method for extracting ASR/OCR important sentences, but it is also possible to use a method used in creating data for extractive summarization, such as the method described in Fine-tune BERT for Extractive Summarization (https://arxiv.org/pdf/1903.10318v2.pdf, retrieved February 27, 2022).

（Ｓ３０３：タスク情報付与）
タスク情報付与部４３０は、学習データ生成部４１０により生成した各学習データセットに、タスクを識別するための識別情報（ラベルと呼んでもよい）を付与する。当該識別情報は特殊トークンである。上記（１）～（５）の例では、例えば、下記のように[task0]等の識別情報を付与する。 (S303: Task information assignment)
The task information assigning unit 430 assigns identification information (which may be called a label) for identifying a task to each learning data set generated by the learning data generating unit 410. The identification information is a special token. In the above examples (1) to (5), for example, identification information such as [task0] is assigned as follows.

（１）[task0] ＯＣＲテキスト、ＡＳＲテキスト、正解の要約テキスト
（２）[task1] ＯＣＲテキスト、正解の要約テキスト
（３）[task2] ＡＳＲテキスト、正解の要約テキスト
（４）[task3] ＯＣＲテキスト、ＡＳＲ重要文
（５）[task4] ＡＳＲテキスト、ＯＣＲ重要文
（Ｓ３０４：出力、（及び学習））
Ｓ３０３において識別情報の付された各タスク（各学習データセット）は、要約モデル学習部２５０へ出力される。 (1) [task0] OCR text, ASR text, correct summary text (2) [task1] OCR text, correct summary text (3) [task2] ASR text, correct summary text (4) [task3] OCR text, ASR key sentence (5) [task4] ASR text, OCR key sentence (S304: Output, (and learning))
Each task (each learning data set) to which identification information has been assigned in S303 is output to the summary model learning unit 250.

要約モデル学習部２５０は、識別情報の付されたそれぞれの学習データセットを用いて要約モデルの学習を行う。各学習データセットでの学習方法は、前述したＳ２０６での学習方法と同様である。ただし、ここでは、図１５に示すように、デコーダへの入力において、上記識別情報を付したテキストを用いる。図１５は、上記５つのタスクのうちの（２）のタスクでの学習例を示している。このような学習が、（１）～（５）のそれぞれに対して行われる。The summary model training unit 250 trains the summary model using each training data set with identification information. The training method for each training data set is the same as the training method in S206 described above. However, here, as shown in FIG. 15, text with the above identification information is used for input to the decoder. FIG. 15 shows an example of training for task (2) out of the five tasks above. Such training is performed for each of (1) to (5).

これにより、学習データ量を増大させることができ、精度の良い要約モデルを生成できる。 This allows us to increase the amount of training data and generate more accurate summary models.

（ハードウェア構成例）
要約生成装置１００、要約モデル学習装置２００、学習データ生成装置４００はいずれも、例えば、コンピュータにプログラムを実行させることにより実現できる。このコンピュータは、物理的なコンピュータであってもよいし、クラウド上の仮想マシンであってもよい。以下、要約生成装置１００、要約モデル学習装置２００、学習データ生成装置４００を総称して「装置」と呼ぶ。 (Hardware configuration example)
The summary generation device 100, the summary model learning device 200, and the training data generation device 400 can all be realized, for example, by causing a computer to execute a program. This computer may be a physical computer or a virtual machine on the cloud. Hereinafter, the summary generation device 100, the summary model learning device 200, and the training data generation device 400 will be collectively referred to as the "devices."

すなわち、当該装置は、コンピュータに内蔵されるＣＰＵやメモリ等のハードウェア資源を用いて、当該装置で実施される処理に対応するプログラムを実行することによって実現することが可能である。上記プログラムは、コンピュータが読み取り可能な記録媒体（可搬メモリ等）に記録して、保存したり、配布したりすることが可能である。また、上記プログラムをインターネットや電子メール等、ネットワークを通して提供することも可能である。That is, the device can be realized by using hardware resources such as a CPU and memory built into a computer to execute a program corresponding to the processing performed by the device. The program can be recorded on a computer-readable recording medium (such as a portable memory) and stored or distributed. The program can also be provided via a network such as the Internet or email.

図１６は、上記コンピュータのハードウェア構成例を示す図である。図１６のコンピュータは、それぞれバスＢＳで相互に接続されているドライブ装置１０００、補助記憶装置１００２、メモリ装置１００３、ＣＰＵ１００４、インタフェース装置１００５、表示装置１００６、入力装置１００７、出力装置１００８等を有する。 Figure 16 is a diagram showing an example of the hardware configuration of the computer. The computer in Figure 16 has a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, an output device 1008, etc., which are interconnected by a bus BS.

当該コンピュータでの処理を実現するプログラムは、例えば、ＣＤ－ＲＯＭ又はメモリカード等の記録媒体１００１によって提供される。プログラムを記憶した記録媒体１００１がドライブ装置１０００にセットされると、プログラムが記録媒体１００１からドライブ装置１０００を介して補助記憶装置１００２にインストールされる。但し、プログラムのインストールは必ずしも記録媒体１００１より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置１００２は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 The program that realizes the processing on the computer is provided by a recording medium 1001, such as a CD-ROM or a memory card. When the recording medium 1001 storing the program is set in the drive device 1000, the program is installed from the recording medium 1001 via the drive device 1000 into the auxiliary storage device 1002. However, the program does not necessarily have to be installed from the recording medium 1001, but may be downloaded from another computer via a network. The auxiliary storage device 1002 stores the installed program as well as necessary files, data, etc.

メモリ装置１００３は、プログラムの起動指示があった場合に、補助記憶装置１００２からプログラムを読み出して格納する。ＣＰＵ１００４は、メモリ装置１００３に格納されたプログラムに従って、ライトタッチ維持装置１００に係る機能を実現する。インタフェース装置１００５は、ネットワーク等に接続するためのインタフェースとして用いられる。表示装置１００６はプログラムによるＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）等を表示する。入力装置１００７はキーボード及びマウス、ボタン、又はタッチパネル等で構成され、様々な操作指示を入力させるために用いられる。出力装置１００８は演算結果を出力する。When an instruction to start a program is received, the memory device 1003 reads out and stores the program from the auxiliary storage device 1002. The CPU 1004 realizes functions related to the light touch maintenance device 100 in accordance with the program stored in the memory device 1003. The interface device 1005 is used as an interface for connecting to a network, etc. The display device 1006 displays a GUI (Graphical User Interface) or the like according to a program. The input device 1007 is composed of a keyboard and mouse, buttons, a touch panel, etc., and is used to input various operational instructions. The output device 1008 outputs the results of calculations.

（実施の形態の効果）
以上説明したとおり、本実施の形態に係る技術により、プレゼンテーション動画等の、音声と画像を含む動画から適切に要約テキストを生成することが可能となる。また、動画から要約テキストを生成する要約モデルを学習するための追加の学習データを自動的に生成することが可能となる。 (Effects of the embodiment)
As described above, the technology according to the present embodiment makes it possible to appropriately generate summary text from videos that include audio and images, such as presentation videos. It also makes it possible to automatically generate additional training data for training a summary model that generates summary text from videos.

特に本実施の形態では、事前学習又はデータ拡張（データ分割による追加学習データ生成）を行うことで、要約モデルの精度を向上させることができる。 In particular, in this embodiment, the accuracy of the summary model can be improved by performing pre-learning or data augmentation (generating additional training data by data division).

以下、事前学習を行った場合の実験結果に基づく効果、及び、データ分割を行った場合の実験結果に基づく効果を説明する。以下では、評価指標として、ROUGE-1, ROUGE-2, ROUGE-Lを使用しており、それぞれ、Ｒ１、Ｒ２、ＲＬと記載する。Below, we explain the effects based on the experimental results when pre-learning was performed, and when data division was performed. In the following, ROUGE-1, ROUGE-2, and ROUGE-L are used as evaluation indices, and are abbreviated as R1, R2, and RL, respectively.

図１７は、論文データを事前に学習させた場合の効果を示す図である。「ＡＳＲ＋ＯＣＲ」は、比較のための、論文データを事前に学習させない場合の評価結果を示す。「＋論文要約（３０万）」、「＋論文要約（５０万）」はそれぞれ、論文要約をそれぞれ３０万件、５０万件事前に学習させた場合の評価結果を示す。図１７に示すとおり、論文データを事前に学習させることにより、精度が向上していることがわかる。 Figure 17 shows the effect of pre-learning paper data. For comparison, "ASR+OCR" shows the evaluation results when no paper data is pre-learned. "+Paper Summaries (300,000)" and "+Paper Summaries (500,000)" show the evaluation results when 300,000 and 500,000 paper summaries were pre-learned, respectively. As shown in Figure 17, it can be seen that accuracy is improved by pre-learning paper data.

図１８は、スライド概要を事前に学習させた場合の効果を示す図である。「ＡＳＲ＋ＯＣＲ（４０９６）」は、比較のための、スライド概要を事前に学習させない場合の評価結果を示す。「＋slideshare」はスライド概要を事前に学習させた場合の評価結果を示す。図１８に示すとおり、スライド概要を事前に学習させることにより、精度が向上していることがわかる。 Figure 18 shows the effect of learning the slide summaries in advance. For comparison, "ASR+OCR(4096)" shows the evaluation results when the slide summaries are not learned in advance. "+slideshare" shows the evaluation results when the slide summaries are learned in advance. As shown in Figure 18, it can be seen that accuracy is improved by learning the slide summaries in advance.

図１９は、元の学習データセットとともに分割により得られた更なる学習データセットを学習させた場合の効果を示す図である。「ＡＳＲ＋ＯＣＲ（４０９６）」は、比較のための、元の学習データセットのみを学習させた場合の評価結果を示す。「ＡＳＲ＋ＯＣＲ（４０９６）＋ｅｘｔｅｎｄ」は、元の学習データセットとともに分割により得られた更なる学習データセットを学習させた場合の評価結果を示す。図１９に示すとおり、元の学習データセットとともに分割により得られた学習データセットを学習させることにより精度が向上していることがわかる。 Figure 19 shows the effect of training the original learning dataset together with a further learning dataset obtained by splitting. For comparison, "ASR+OCR(4096)" shows the evaluation results when only the original learning dataset was trained. "ASR+OCR(4096)+extend" shows the evaluation results when the original learning dataset was trained together with a further learning dataset obtained by splitting. As shown in Figure 19, it can be seen that accuracy is improved by training the original learning dataset together with the learning dataset obtained by splitting.

（付記）
以上の実施形態に関し、更に以下の付記項を開示する。
（付記項１）
動画に対する要約テキストを生成する要約モデルの学習のための学習データセットを生成する学習データ生成装置であって、
メモリと、
前記メモリに接続された少なくとも１つのプロセッサと、
を含み、
前記プロセッサは、
前記動画における画像から抽出されたテキストである第１テキスト、前記動画における音声から抽出されたテキストである第２テキスト、及び、前記動画の正解の要約テキストを有する元の学習データセットから、少なくとも１つの更なる学習データセットを生成する
学習データ生成装置。
（付記項２）
前記プロセッサは、前記更なる学習データセットとして、前記第１テキストを含み、前記第２テキストを含まない学習データセット、又は、前記第２テキストを含み、前記第１テキストを含まない学習データセットを生成する
付記項１に記載の学習データ生成装置。
（付記項３）
前記プロセッサは、前記更なる学習データセットとして、前記第１テキストと前記第２テキストのうちのいずれかのテキスト、及び、前記第１テキストと前記第２テキストのうちのいずれかのテキストと前記正解の要約テキストとのマッチングを行うことで得られた重要文を含む学習データセットを生成する
付記項１に記載の学習データ生成装置。
（付記項４）
前記プロセッサは、前記更なる学習データセットに対して、当該更なる学習データセットにより行われるタスクを識別するための識別情報を付与する
付記項１に記載の学習データ生成装置。
（付記項５）
動画に対する要約テキストを生成する要約モデルの学習のための学習データセットを生成する学習データ生成装置として使用されるコンピュータが実行する学習データ生成方法であって、
前記動画における画像から抽出されたテキストである第１テキスト、前記動画における音声から抽出されたテキストである第２テキスト、及び、前記動画の正解の要約テキストを有する元の学習データセットから、少なくとも１つの更なる学習データセットを生成する学習データ生成ステップ
を備える学習データ生成方法。
（付記項６）
動画に対する要約テキストを生成する要約モデルの学習のための学習データセットを生成する学習データ生成処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
前記学習データ生成処理は、
前記動画における画像から抽出されたテキストである第１テキスト、前記動画における音声から抽出されたテキストである第２テキスト、及び、前記動画の正解の要約テキストを有する元の学習データセットから、少なくとも１つの更なる学習データセットを生成する
非一時的記憶媒体。 (Additional Note)
Regarding the above embodiment, the following supplementary items are further disclosed.
(Additional Note 1)
A training data generation device for generating a training data set for training a summary model for generating a summary text for a video, comprising:
Memory,
at least one processor coupled to the memory;
Including,
The processor,
A training data generation device that generates at least one further training dataset from an original training dataset having a first text, the first text being text extracted from an image in the video, a second text, the second text being text extracted from an audio in the video, and a correct summary text of the video.
(Additional Note 2)
The processor generates, as the further training data set, a training data set that includes the first text and does not include the second text, or a training data set that includes the second text and does not include the first text.
(Additional Note 3)
The processor generates, as the further training data set, a training data set including either the first text or the second text, and important sentences obtained by matching either the first text or the second text with the correct summary text.
(Additional Note 4)
The training data generation device according to claim 1, wherein the processor assigns identification information to the further training data set for identifying a task performed by the further training data set.
(Additional Note 5)
1. A computer-implemented training data generation method for use as a training data generation device for generating a training data set for training a summary model for generating a summary text for a video, comprising:
a training data generation step of generating at least one further training data set from an original training data set having a first text, the first text being text extracted from an image in the video, a second text being text extracted from an audio in the video, and a correct summary text of the video.
(Additional Note 6)
A non-transitory storage medium storing a program executable by a computer to perform a training data generation process for generating a training data set for training a summary model that generates a summary text for a video, the non-transitory storage medium comprising:
The learning data generation process includes:
A non-transitory storage medium for generating at least one further training dataset from an original training dataset having a first text, the first text being text extracted from images in the video, a second text being text extracted from audio in the video, and a ground-truth summary text of the video.

以上、本実施の形態について説明したが、本発明はかかる特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the present embodiment has been described above, the present invention is not limited to such a specific embodiment, and various modifications and variations are possible within the scope of the gist of the present invention as described in the claims.

１００要約生成装置
１１０画像処理部
１２０音声処理部
１３０要約生成部
１４０要約モデルＤＢ
２００要約モデル学習装置
２１０データ取得部
２２０学習データ入力部
２３０画像処理部
２４０音声処理部
２５０要約モデル学習部
２７０モデル設定部
２８０要約モデルＤＢ
２９０要約モデルＤＢ
３１０要約モデル事前学習部
３２０要約モデルＤＢ
４００データ拡張部
４１０学習データ生成部
４２０重要文抽出部
４３０タスク情報付与部
１０００ドライブ装置
１００１記録媒体
１００２補助記憶装置
１００３メモリ装置
１００４ＣＰＵ
１００５インタフェース装置
１００６表示装置
１００７入力装置
１００８出力装置 100 Summary generation device 110 Image processing unit 120 Audio processing unit 130 Summary generation unit 140 Summary model DB
200 Summary model learning device 210 Data acquisition unit 220 Learning data input unit 230 Image processing unit 240 Audio processing unit 250 Summary model learning unit 270 Model setting unit 280 Summary model DB
290 Summary Model DB
310 Summary model pre-learning unit 320 Summary model DB
400 Data expansion unit 410 Learning data generation unit 420 Key sentence extraction unit 430 Task information assignment unit 1000 Drive device 1001 Recording medium 1002 Auxiliary storage device 1003 Memory device 1004 CPU
1005 Interface device 1006 Display device 1007 Input device 1008 Output device

Claims

A training data generation device for generating a training data set for training a summary model for generating a summary text for a video, comprising:
a training data generation unit configured to generate at least one further training data set from an original training data set having a first text, the first text being text extracted from an image in the video, a second text being text extracted from an audio in the video, and a correct summary text of the video.

2. The training data generation device according to claim 1, wherein the training data generation unit generates, as the further training data set, a training data set that includes the first text and does not include the second text, or a training data set that includes the second text and does not include the first text.

2. The training data generation device according to claim 1, wherein the training data generation unit generates, as the further training data set, a training data set including either the first text or the second text, and important sentences obtained by matching either the first text or the second text with the correct summary text.

The training data generation device according to claim 1 , further comprising: a task information assigning unit that assigns, to the further training data set, identification information for identifying a task performed by the further training data set.

1. A computer-implemented training data generation method for use as a training data generation device for generating a training data set for training a summary model for generating a summary text for a video, comprising:
a training data generation step of generating at least one further training data set from an original training data set having a first text, the first text being text extracted from an image in the video, a second text being text extracted from an audio in the video, and a correct summary text of the video.

A program that causes a computer to function as each component of a training data generation device described in any one of claims 1 to 4.