JP7619576B2

JP7619576B2 - Information processing device and information processing method

Info

Publication number: JP7619576B2
Application number: JP2023032994A
Authority: JP
Inventors: 寛貴宅島; 隆之堀; 裕真鈴木; 秀明岡本; 隼人田之上; 一也植木
Original assignee: SoftBank Corp; Meisei Gakuen
Current assignee: SoftBank Corp; Meisei Gakuen
Priority date: 2023-03-03
Filing date: 2023-03-03
Publication date: 2025-01-22
Anticipated expiration: 2043-03-03
Also published as: JP2024124970A

Description

本発明は、情報処理装置及び情報処理方法に関する。 The present invention relates to an information processing device and an information processing method.

従来、動画からキャプション（キャプション文ともいう。以下、「動画説明文」と記載する。）を生成する技術が知られている。例えば、画像に含まれる要素を単語として出力する多層式のニューラルネットワークに監視カメラで撮影された動画を入力し、動画の説明文を生成する技術が知られている。 Conventionally, there is known technology for generating captions (also called caption text, hereinafter referred to as "video description text") from videos. For example, a technology is known in which a video captured by a surveillance camera is input to a multi-layer neural network that outputs elements contained in an image as words, and a description of the video is generated.

特開２０１８－１０１３１７号公報JP 2018-101317 A

しかしながら、上記の従来技術では、画像に含まれる要素を単語として出力する多層式のニューラルネットワークに監視カメラで撮影された動画を入力し、動画の説明文を生成するにすぎないため、注目するフレーム画像に応じた多様な動画説明文を生成可能とすることができるとは限らない。 However, the above-mentioned conventional technology simply inputs video captured by a surveillance camera into a multi-layer neural network that outputs elements contained in the image as words, and generates a description of the video, so it is not necessarily possible to generate a variety of video descriptions that correspond to the frame images of interest.

本願は、注目するフレーム画像に応じた多様な動画説明文を生成可能とすることができる情報処理装置及び情報処理方法を提供することを目的とする。 The present application aims to provide an information processing device and information processing method that can generate a variety of video descriptions according to a frame image of interest.

本願に係る情報処理装置は、撮像画像と前記撮像画像の内容を説明する文章である画像説明文との組を含む画像文データセットに基づいて、学習用動画を生成する動画生成部と、前記学習用動画を構成する複数のフレーム画像それぞれの特徴量である学習用フレーム特徴量を抽出する抽出部と、前記学習用動画を構成する複数のフレーム画像それぞれに対応する重みである学習用重みを決定する決定部と、前記学習用フレーム特徴量と前記学習用重みとに基づいて、前記学習用動画の内容を説明する文章である学習用動画説明文であって、前記学習用重みによって重み付けされた前記学習用フレーム特徴量と対応する特徴を有する前記学習用動画説明文を生成するように学習された機械学習モデルである文章生成モデルを生成するモデル生成部と、を備える。 The information processing device according to the present application includes a video generation unit that generates a training video based on an image and text dataset including a pair of a captured image and an image description that is a text that describes the content of the captured image, an extraction unit that extracts training frame features that are features of each of a plurality of frame images that constitute the training video, a determination unit that determines training weights that are weights corresponding to each of a plurality of frame images that constitute the training video, and a model generation unit that generates a text generation model that is a machine learning model trained to generate training video descriptions that are text that describes the content of the training video, the training video descriptions having features that correspond to the training frame features weighted by the training weights, based on the training frame features and the training weights.

また、本願に係る情報処理装置は、撮像画像と前記撮像画像の内容を説明する文章である画像説明文との組を含む画像文データセットに基づいて生成された学習用動画を構成する複数のフレーム画像それぞれの特徴量である学習用フレーム特徴量と、前記学習用動画を構成する複数のフレーム画像それぞれに対応する重みである学習用重みとに基づいて、前記学習用動画の内容を説明する文章である学習用動画説明文であって、前記学習用重みによって重み付けされた前記学習用フレーム特徴量と対応する特徴を有する前記学習用動画説明文を生成するように学習された機械学習モデルである文章生成モデルを取得する取得部と、処理対象の動画である対象動画を構成する複数のフレーム画像それぞれの特徴量である対象フレーム特徴量を抽出する抽出部と、前記対象動画を構成する複数のフレーム画像のうち、利用者によって指定された指定フレーム画像に対応する重みを前記指定フレーム画像以外の他のフレーム画像に対応する重みよりも大きくするように前記複数のフレーム画像それぞれに対応する重みである対象重みを決定する決定部と、前記対象フレーム特徴量と前記対象重みとに基づいて、前記対象重みによって重み付けされた前記対象フレーム特徴量を前記文章生成モデルに入力して、前記対象動画の内容を説明する文章である対象動画説明文を生成する文章生成部と、を備える。 The information processing device according to the present application also includes an acquisition unit that acquires a sentence generation model that is a machine learning model trained to generate a learning video description that is a sentence that describes the content of a learning video, the learning video description having features corresponding to the learning frame features weighted by the learning weights, based on learning frame features that are features of each of a plurality of frame images that constitute a learning video generated based on an image and text dataset including a pair of a captured image and an image description that is a sentence that describes the content of the captured image, and learning weights that are weights corresponding to each of the plurality of frame images that constitute the learning video; an extraction unit that extracts target frame features that are features of each of a plurality of frame images that constitute a target video that is a video to be processed; a determination unit that determines target weights that are weights corresponding to each of the plurality of frame images among the plurality of frame images that constitute the target video, such that the weight corresponding to a designated frame image designated by a user is greater than the weights corresponding to other frame images other than the designated frame image; and a sentence generation unit that inputs the target frame features weighted by the target weights to the sentence generation model based on the target frame features and the target weights to generate a target video description that is a sentence that describes the content of the target video.

実施形態の一態様によれば、注目するフレーム画像に応じた多様な動画説明文を生成可能とすることができる。 According to one aspect of the embodiment, it is possible to generate a variety of video descriptions according to the frame image of interest.

図１は、実施形態に係る情報処理装置の構成例を示す図である。FIG. 1 is a diagram illustrating an example of the configuration of an information processing apparatus according to an embodiment. 図２は、実施形態に係る事前学習方法に関する情報処理の一例を示す図である。FIG. 2 is a diagram illustrating an example of information processing related to the pre-learning method according to the embodiment. 図３は、実施形態に係る第１の追加学習方法に関する情報処理の一例を示す図である。FIG. 3 is a diagram illustrating an example of information processing related to the first additional learning method according to the embodiment. 図４は、実施形態に係る重みによってフレーム特徴量を重み付けする方法について説明するための図である。FIG. 4 is a diagram for explaining a method of weighting a frame feature amount by a weight according to the embodiment. 図５は、実施形態に係る第２の追加学習方法に関する情報処理の一例を示す図である。FIG. 5 is a diagram illustrating an example of information processing related to the second additional learning method according to the embodiment. 図６は、実施形態に係る類似度を算出する方法について説明するための図である。FIG. 6 is a diagram for explaining a method for calculating a similarity according to the embodiment. 図７は、実施形態に係る推論方法に関する情報処理の一例を示す図である。FIG. 7 is a diagram illustrating an example of information processing related to the inference method according to the embodiment. 図８は、実施形態に係る文章生成モデルの一例である条件付き敵対的生成ネットワーク（ＣＧＡＮ）を示す図である。FIG. 8 is a diagram illustrating a conditional generative adversarial network (CGAN) that is an example of a sentence generation model according to the embodiment. 図９は、第１の変形例に係る文章生成モデルの一例である条件付き変分オートエンコーダ（ＣＶＡＥ）を示す図である。FIG. 9 is a diagram illustrating a conditional variational autoencoder (CVAE) which is an example of a sentence generation model according to the first modified example. 図１０は、第２の変形例に係る文章生成モデルの一例である条件付き拡散モデルを示す図である。FIG. 10 is a diagram showing a conditional diffusion model, which is an example of a sentence generation model according to the second modified example. 図１１は、情報処理装置の機能を実現するコンピュータの一例を示すハードウェア構成図である。FIG. 11 is a hardware configuration diagram illustrating an example of a computer that realizes the functions of the information processing device.

以下に、本願に係る情報処理装置及び情報処理方法を実施するための形態（以下、「実施形態」と呼ぶ）について図面を参照しつつ詳細に説明する。なお、この実施形態により本願に係る情報処理装置及び情報処理方法が限定されるものではない。また、以下の各実施形態において同一の部位には同一の符号を付し、重複する説明は省略される。 Below, a detailed description will be given of a form for implementing the information processing device and information processing method according to the present application (hereinafter, referred to as an "embodiment") with reference to the drawings. Note that the information processing device and information processing method according to the present application are not limited to this embodiment. Furthermore, the same components in each of the following embodiments are given the same reference numerals, and duplicated descriptions will be omitted.

（実施形態）
〔１．はじめに〕
従来、動画から動画の内容を説明する文章である動画説明文を生成する技術が知られている。例えば、動画から動画説明文を生成する機械学習モデルが知られている。ここで、動画から動画説明文を生成する機械学習モデルによって生成される動画説明文は、動画内のどの範囲に注目するかによってその内容が異なることが知られている。 (Embodiment)
1. Introduction
Conventionally, a technique for generating a video description, which is a text that explains the contents of a video, from a video is known. For example, a machine learning model that generates a video description from a video is known. Here, it is known that the content of the video description generated by the machine learning model that generates a video description from a video varies depending on which part of the video is focused on.

また、近年、生成モデルに関する技術が知られている。生成モデルとは、データ生成のプロセスをモデル化したものである。生成モデルは、学習用データを学習し、学習用データに似たデータを生成することができる機械学習モデルである。また、条件付き生成モデルに関する技術が知られている。条件付き生成モデルは、条件を変えることによってデータ生成のプロセスを変化させ、多様で高品質なデータを生成することができる機械学習モデルである。 In recent years, technology related to generative models has become known. A generative model is a model of the data generation process. A generative model is a machine learning model that can learn from training data and generate data similar to the training data. Technology related to conditional generative models has also become known. A conditional generative model is a machine learning model that can change the data generation process by changing the conditions, and generate diverse, high-quality data.

ここで、条件付き生成モデルにおける条件とは、条件付き生成モデルによって生成される生成対象データの特徴が満たすべき条件のことを指す。言い換えると、ここでの条件とは、条件付き生成モデルによって生成される生成対象データの種類や属性（例えば、生成対象データに現れる特徴の種類や属性など）に関する条件のことを指す。例えば、条件付き生成モデルによって生成される生成対象データが画像である場合、条件とは、画像に含まれる対象物の属性や種別を示す情報であってよい。具体的には、条件付き生成モデルに条件として入力されるベクトルである条件ベクトルを入力情報として条件付き生成モデルに入力することにより、条件ベクトルと対応する特徴を有するデータが生成可能となる。例えば、条件ベクトルは、条件を示す情報に対応するベクトルであってよい。 Here, the conditions in a conditional generative model refer to conditions that must be satisfied by the features of the target data generated by the conditional generative model. In other words, the conditions here refer to conditions related to the type and attributes of the target data generated by the conditional generative model (e.g., the type and attributes of features appearing in the target data). For example, if the target data generated by the conditional generative model is an image, the conditions may be information indicating the attributes and type of objects contained in the image. Specifically, by inputting a condition vector, which is a vector input as a condition to the conditional generative model, into the conditional generative model as input information, data having features corresponding to the condition vector can be generated. For example, the condition vector may be a vector corresponding to information indicating the condition.

例えば、条件付き生成モデルの一例として、Conditional GAN（CGAN）（参考文献；論文名“Conditional Generative Adversarial Nets“,＜インターネット＞https://arxiv.org/pdf/1411.1784.pdf（令和５年２月１６日検索））が知られている。CGANは、条件付き敵対的生成ネットワークとも呼ばれ、ノイズから特定のデータを生成するGAN（敵対的生成ネットワーク）に対して条件を与えられるように改良された機械学習モデルである。 For example, one example of a conditional generative model is the Conditional GAN (CGAN) (Reference: Paper title "Conditional Generative Adversarial Nets", <Internet> https://arxiv.org/pdf/1411.1784.pdf (Retrieved February 16, 2023)). CGAN is also called a conditional generative adversarial network, and is a machine learning model that has been improved to allow conditions to be given to GAN (generative adversarial network), which generates specific data from noise.

また、条件付き生成モデルの一例として、Conditional Variational Auto Encoder（CVAE）（参考文献；論文名“Semi-supervised Learning with Deep Generative Models “,＜インターネット＞https://proceedings.neurips.cc/paper/2014/file/d523773c6b194f37b938d340d5d02232-Paper.pdf（令和５年２月１６日検索））が知られている。CVAEは、条件付き変分オートエンコーダとも呼ばれ、潜在表現に従ってデータを生成するVAE（変分オートエンコーダ）に対して条件を与えられるように改良された機械学習モデルである。 Another example of a conditional generative model is the Conditional Variational Auto Encoder (CVAE) (reference: Paper title: "Semi-supervised Learning with Deep Generative Models", <Internet> https://proceedings.neurips.cc/paper/2014/file/d523773c6b194f37b938d340d5d02232-Paper.pdf (searched February 16, 2023)). CVAE is also called a conditional variational autoencoder, and is a machine learning model that has been improved to allow conditions to be given to a VAE (variational autoencoder), which generates data according to a latent representation.

また、条件付き生成モデルの一例として、Diffusion Model（参考文献；論文名“Denoising Diffusion Probabilistic Models “,＜インターネット＞https://arxiv.org/pdf/2006.11239.pdf（令和５年２月１６日検索））が知られている。Diffusion Modelは、ノイズから少しずつデータを復元する過程を学習する。Diffusion Modelは、一般的には拡散モデルと呼ばれているが、応用的な利用方法として、条件を与えたデータの生成が可能である。 The Diffusion Model (reference: Paper title: "Denoising Diffusion Probabilistic Models", <Internet> https://arxiv.org/pdf/2006.11239.pdf (searched February 16, 2023)) is known as an example of a conditional generative model. The Diffusion Model learns the process of gradually restoring data from noise. The Diffusion Model is generally called the diffusion model, but in practical applications it is possible to generate data with conditions given to it.

また、条件付き生成モデルのその他の例として、GLIDE（参考文献；論文名“GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models “,＜インターネット＞https://arxiv.org/pdf/2112.10741.pdf（令和５年２月１６日検索））、DALL-E 2 unCLIP（参考文献；論文名“Hierarchical Text-Conditional Image Generation with CLIP Latents “,＜インターネット＞https://arxiv.org/pdf/2204.06125.pdf（令和５年２月１６日検索））、Imagen（参考文献；論文名“Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding “,＜インターネット＞https://arxiv.org/pdf/2205.11487.pdf（令和５年２月１６日検索））、Parti（参考文献；論文名“Scaling Autoregressive Models for Content-Rich Text-to-Image Generation “,＜インターネット＞https://arxiv.org/pdf/2206.10789.pdf（令和５年２月１６日検索））が知られている。 Other examples of conditional generative models include GLIDE (reference: paper title “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, <Internet> https://arxiv.org/pdf/2112.10741.pdf (searched February 16, 2023)), DALL-E 2 unCLIP (reference: paper title “Hierarchical Text-Conditional Image Generation with CLIP Latents”, <Internet> https://arxiv.org/pdf/2204.06125.pdf (searched February 16, 2023)), Imagen (reference: paper title “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, <Internet> https://arxiv.org/pdf/2205.11487.pdf (searched February 16, 2023)), and Parti (reference: paper title “Scaling Autoregressive Models for "Content-Rich Text-to-Image Generation", <Internet> https://arxiv.org/pdf/2206.10789.pdf (Retrieved February 16, 2023) is known.

本実施形態に係る情報処理装置は、動画を構成する複数のフレーム画像それぞれに対応する重みと、動画を構成する複数のフレーム画像それぞれの特徴量であるフレーム特徴量とに基づいて、重み付けされたフレーム特徴量を条件として条件付き生成モデルに入力する。また、情報処理装置は、重み付けされたフレーム特徴量と対応する特徴を有する動画説明文を生成する。これにより、情報処理装置は、条件付き生成モデルに与える条件として、各フレーム画像に対応する重み付けされたフレーム特徴量を用いることにより、動画のどの部分（どのフレーム画像）を重視した動画説明文を生成するのかをコントロール可能とすることができる。また、情報処理装置は、注目するフレーム画像に応じた多様な動画説明文を生成可能とすることができる。また、情報処理装置は、条件付き生成モデルに与える条件として、各フレーム画像に対応する重み付けされたフレーム特徴量を用いることにより、動画の時系列情報を自然言語生成に反映することを可能とすることができる。 The information processing device according to this embodiment inputs weighted frame features as conditions to a conditional generative model based on weights corresponding to each of a plurality of frame images constituting a video and frame features that are features of each of a plurality of frame images constituting a video. The information processing device also generates a video description having features corresponding to the weighted frame features. As a result, the information processing device can control which part of the video (which frame image) to emphasize in generating a video description by using the weighted frame features corresponding to each frame image as a condition to be given to the conditional generative model. The information processing device can also generate a variety of video descriptions according to the frame image of interest. The information processing device can also use the weighted frame features corresponding to each frame image as a condition to be given to the conditional generative model, making it possible to reflect the time-series information of the video in natural language generation.

〔２．情報処理装置の構成〕
図１を用いて、実施形態に係る情報処理装置１００の構成例について説明する。図１は、実施形態に係る情報処理装置１００の構成例を示す図である。情報処理装置１００は、通信部１１０と、記憶部１２０と、制御部１３０とを有する。 2. Configuration of information processing device
An example of the configuration of an information processing device 100 according to an embodiment will be described with reference to Fig. 1. Fig. 1 is a diagram showing an example of the configuration of the information processing device 100 according to an embodiment. The information processing device 100 includes a communication unit 110, a storage unit 120, and a control unit 130.

（通信部１１０）
通信部１１０は、ＮＩＣ（Network Interface Card）やアンテナ等によって実現される。通信部１１０は、各種ネットワークと有線または無線で接続され、例えば、情報処理装置１００以外の他の情報処理装置との間で情報の送受信を行う。 (Communication unit 110)
The communication unit 110 is realized by a network interface card (NIC), an antenna, etc. The communication unit 110 is connected to various networks via wired or wireless communication, and transmits and receives information to and from other information processing devices other than the information processing device 100, for example.

（記憶部１２０）
記憶部１２０は、例えば、ＲＡＭ（Random Access Memory)、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。具体的には、記憶部１２０は、各種データを記憶する。例えば、記憶部１２０は、文章生成モデルの学習に用いられる学習用のデータを記憶する。また、記憶部１２０は、各種プログラムを記憶する。例えば、記憶部１２０は、モデル生成部１３４によって生成された文章生成モデルに関する情報を記憶する。 (Memory unit 120)
The storage unit 120 is realized by, for example, a semiconductor memory element such as a random access memory (RAM) or a flash memory, or a storage device such as a hard disk or an optical disk. Specifically, the storage unit 120 stores various data. For example, the storage unit 120 stores learning data used for learning the sentence generation model. The storage unit 120 also stores various programs. For example, the storage unit 120 stores information related to the sentence generation model generated by the model generation unit 134.

（制御部１３０）
制御部１３０は、コントローラ（controller）であり、例えば、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）等によって、情報処理装置１００内部の記憶装置に記憶されている各種プログラムがＲＡＭを作業領域として実行されることにより実現される。また、制御部１３０は、コントローラであり、例えば、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）等の集積回路により実現される。 (Control unit 130)
The control unit 130 is a controller, and is realized, for example, by a central processing unit (CPU) or a micro processing unit (MPU) executing various programs stored in a storage device inside the information processing device 100 using a RAM as a working area. The control unit 130 is also a controller, and is realized, for example, by an integrated circuit such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA).

制御部１３０は、動画生成部１３１と、抽出部１３２と、決定部１３３と、モデル生成部１３４と、取得部１３５と、文章生成部１３６を機能部として有し、以下に説明する情報処理の作用を実現または実行してよい。なお、制御部１３０の内部構成は、図１に示した構成に限られず、後述する情報処理を行う構成であれば他の構成であってもよい。また、各機能部は、制御部１３０の機能を示したものであり、必ずしも物理的に区別されるものでなくともよい。 The control unit 130 has a video generation unit 131, an extraction unit 132, a determination unit 133, a model generation unit 134, an acquisition unit 135, and a sentence generation unit 136 as functional units, and may realize or execute the information processing actions described below. Note that the internal configuration of the control unit 130 is not limited to the configuration shown in FIG. 1, and may be other configurations that perform the information processing described below. Also, each functional unit indicates a function of the control unit 130, and does not necessarily have to be physically distinct.

（動画生成部１３１）
動画生成部１３１は、撮像画像と撮像画像の内容を説明する文章である画像説明文との組を含む画像文データセットに基づいて、学習用動画を生成する。 (Video Generation Unit 131)
The video generator 131 generates a learning video based on an image and text data set that includes a pair of a captured image and an image caption that is a sentence that explains the content of the captured image.

（抽出部１３２）
抽出部１３２は、学習用動画を構成する複数のフレーム画像それぞれの特徴量である学習用フレーム特徴量を抽出する。また、抽出部１３２は、処理対象の動画である対象動画を構成する複数のフレーム画像それぞれの特徴量である対象フレーム特徴量を抽出する。 (Extraction Unit 132)
The extraction unit 132 extracts learning frame features, which are features of each of a plurality of frame images constituting the learning moving image. The extraction unit 132 also extracts target frame features, which are features of each of a plurality of frame images constituting the target moving image, which is a moving image to be processed.

（決定部１３３）
決定部１３３は、学習用動画を構成する複数のフレーム画像それぞれに対応する重みである学習用重みを決定する。また、決定部１３３は、対象動画を構成する複数のフレーム画像のうち、利用者によって指定された指定フレーム画像に対応する重みを指定フレーム画像以外の他のフレーム画像に対応する重みよりも大きくするように複数のフレーム画像それぞれに対応する重みである対象重みを決定する。 (Determination unit 133)
The determination unit 133 determines learning weights, which are weights corresponding to each of a plurality of frame images constituting the learning video. Also, the determination unit 133 determines target weights, which are weights corresponding to each of a plurality of frame images constituting the target video, such that the weight corresponding to a designated frame image designated by a user among the plurality of frame images constituting the target video is made larger than the weights corresponding to other frame images other than the designated frame image.

（モデル生成部１３４）
モデル生成部１３４は、学習用フレーム特徴量と学習用重みとに基づいて、学習用動画の内容を説明する文章である学習用動画説明文であって、学習用重みによって重み付けされた学習用フレーム特徴量と対応する特徴を有する学習用動画説明文を生成するように学習された機械学習モデルである文章生成モデルを生成する。ここで、文章生成モデルは、条件付き生成モデルであってよい。例えば、文章生成モデルは、条件付き敵対的生成ネットワーク（ＣＧＡＮ）、条件付き変分オートエンコーダ（ＣＶＡＥ）、または、条件付き拡散モデルであってよい。 (Model Generation Unit 134)
The model generation unit 134 generates a sentence generation model, which is a machine learning model trained to generate a training video description, which is a sentence that describes the content of the training video, based on the training frame features and the training weights, and has features corresponding to the training frame features weighted by the training weights. Here, the sentence generation model may be a conditional generative model. For example, the sentence generation model may be a conditional generative adversarial network (CGAN), a conditional variational autoencoder (CVAE), or a conditional diffusion model.

また、モデル生成部１３４は、文章生成モデルを事前学習させることで、事前学習済みの文章生成モデルを生成する。続いて、モデル生成部１３４は、事前学習済みの文章生成モデルを追加学習させることで、追加学習済みの文章生成モデルを生成する。モデル生成部１３４は、生成した追加学習済みの文章生成モデルに関する情報を記憶部１２０に格納する。なお、以下では、追加学習済みの文章生成モデルのことを単に「文章生成モデル」と記載する場合がある。 The model generation unit 134 also generates a pre-trained sentence generation model by pre-training the sentence generation model. Next, the model generation unit 134 generates an additionally trained sentence generation model by additionally training the pre-trained sentence generation model. The model generation unit 134 stores information about the generated additionally trained sentence generation model in the storage unit 120. Note that, hereinafter, the additionally trained sentence generation model may be simply referred to as a "sentence generation model".

（取得部１３５）
取得部１３５は、撮像画像と撮像画像の内容を説明する文章である画像説明文との組を含む画像文データセットに基づいて生成された学習用動画を構成する複数のフレーム画像それぞれの特徴量である学習用フレーム特徴量と、学習用動画を構成する複数のフレーム画像それぞれに対応する重みである学習用重みとに基づいて、学習用動画の内容を説明する文章である学習用動画説明文であって、学習用重みによって重み付けされた学習用フレーム特徴量と対応する特徴を有する学習用動画説明文を生成するように学習された機械学習モデルである文章生成モデルを取得する。具体的には、取得部１３５は、モデル生成部１３４によって生成された文章生成モデルを取得する。例えば、取得部１３５は、記憶部１２０から文章生成モデルに関する情報を取得する。 (Acquisition unit 135)
The acquisition unit 135 acquires a sentence generation model, which is a machine learning model trained to generate a training video description, which is a sentence that describes the content of a training video, having features corresponding to the training frame features weighted by the training weights, based on training frame features that are features of each of a plurality of frame images constituting a training video generated based on an image and sentence dataset including a pair of a captured image and an image description that is a sentence that describes the content of the captured image, and training weights that are weights corresponding to each of a plurality of frame images constituting the training video. Specifically, the acquisition unit 135 acquires the sentence generation model generated by the model generation unit 134. For example, the acquisition unit 135 acquires information on the sentence generation model from the storage unit 120.

（文章生成部１３６）
文章生成部１３６は、対象フレーム特徴量と対象重みとに基づいて、対象重みによって重み付けされた対象フレーム特徴量を文章生成モデルに入力して、対象動画の内容を説明する文章である対象動画説明文を生成する。 (Sentence generation unit 136)
The sentence generation unit 136 inputs the target frame feature weighted by the target weight to a sentence generation model based on the target frame feature and the target weight, and generates a target moving image description, which is a sentence that describes the content of the target moving image. Generate a statement.

〔３．事前学習方法〕
図２を用いて、実施形態に係る文章生成モデルの事前学習方法について説明する。図２は、実施形態に係る事前学習方法に関する情報処理の一例を示す図である。ここで、事前学習とは、後述する第１の追加学習（図３参照）または第２の追加学習（図５参照）によりモデルを本格的に学習させる前に、事前に準備として行われるモデルの学習のことを指す。 [3. Pre-learning methods]
A pre-learning method for a sentence generation model according to an embodiment will be described with reference to Fig. 2. Fig. 2 is a diagram showing an example of information processing related to the pre-learning method according to an embodiment. Here, pre-learning refers to learning of a model that is performed as a preparation before the model is fully trained by the first additional learning (see Fig. 3) or the second additional learning (see Fig. 5) described later.

図２に示すように、事前学習の段階では、（１）動画と動画説明文との組を含む動画文データセットに含まれる動画を構成する各フレーム画像から画像特徴量を抽出する。（２）動画を構成する各フレーム画像に対する重み付けは行わない。（３）各フレーム画像から抽出した画像特徴量を条件として、条件付き生成モデルである文章生成モデルに入力し、画像特徴量に対応する特徴を有する動画説明文を生成するように文章生成モデルを学習させる。 As shown in Figure 2, in the pre-learning stage, (1) image features are extracted from each frame image constituting a video included in a video sentence dataset that includes pairs of videos and video descriptions. (2) No weighting is applied to each frame image constituting a video. (3) The image features extracted from each frame image are input as conditions into a sentence generation model, which is a conditional generation model, and the sentence generation model is trained to generate video descriptions having features corresponding to the image features.

具体的には、抽出部１３２は、事前学習用の撮像動画（以下、「事前学習用動画＃１」と記載する場合がある）と事前学習用の撮像動画に対応する動画説明文（以下、「事前学習用動画説明文＃１」と記載する場合がある）との組を含む動画文データセット＃１を取得してよい。例えば、抽出部１３２は、通信部１１０を介して、外部の情報処理装置から動画文データセット＃１を取得してよい。 Specifically, the extraction unit 132 may acquire a video text dataset #1 including a pair of a captured video for pre-learning (hereinafter, may be referred to as "pre-learning video #1") and a video description corresponding to the captured video for pre-learning (hereinafter, may be referred to as "pre-learning video description #1"). For example, the extraction unit 132 may acquire the video text dataset #1 from an external information processing device via the communication unit 110.

続いて、抽出部１３２は、動画文データセット＃１に含まれる事前学習用動画＃１を構成する複数のフレーム画像それぞれから、事前学習用動画＃１を構成する複数のフレーム画像それぞれの画像特徴量を抽出してよい（ステップＳ１１）。例えば、画像特徴量は、多次元のベクトルであってよい。図２では、簡単のため、事前学習用動画＃１を構成するフレーム画像が３つである場合について説明するが、事前学習用動画＃１を構成するフレーム画像の数は４つ以上であってよい。図２では、事前学習用動画＃１の開始時刻に対応する１枚目のフレーム画像と、事前学習用動画＃１の開始時刻と終了時刻の間の時刻に対応する２枚目のフレーム画像と、事前学習用動画＃１の終了時刻に対応する３枚目のフレーム画像とが時系列順に並んでいる様子を示す。例えば、抽出部１３２は、１枚目のフレーム画像から特徴量ベクトルＶ１１を抽出する。また、抽出部１３２は、２枚目のフレーム画像から特徴量ベクトルＶ１２を抽出する。また、抽出部１３２は、３枚目のフレーム画像から特徴量ベクトルＶ１３を抽出する。続いて、抽出部１３２は、事前学習用動画＃１を構成する複数のフレーム画像それぞれの画像特徴量である事前学習用フレーム特徴量＃１として、特徴量ベクトルＶ１１～Ｖ１３の組のベクトル（Ｖ１１、Ｖ１２、Ｖ１３）を取得してよい。 Next, the extraction unit 132 may extract image features of each of the multiple frame images constituting the pre-learning video #1 from each of the multiple frame images constituting the pre-learning video #1 included in the video sentence dataset #1 (step S11). For example, the image features may be a multidimensional vector. In FIG. 2, for simplicity, a case in which there are three frame images constituting the pre-learning video #1 is described, but the number of frame images constituting the pre-learning video #1 may be four or more. In FIG. 2, a first frame image corresponding to the start time of the pre-learning video #1, a second frame image corresponding to a time between the start time and the end time of the pre-learning video #1, and a third frame image corresponding to the end time of the pre-learning video #1 are arranged in chronological order. For example, the extraction unit 132 extracts a feature vector V11 from the first frame image. In addition, the extraction unit 132 extracts a feature vector V12 from the second frame image. In addition, the extraction unit 132 extracts a feature vector V13 from the third frame image. Next, the extraction unit 132 may acquire a vector (V11, V12, V13) of the set of feature vectors V11 to V13 as pre-learning frame feature #1, which is the image feature of each of the multiple frame images that make up the pre-learning video #1.

例えば、抽出部１３２は、画像から画像特徴量を抽出することができる任意の公知技術を用いて、動画を構成する複数のフレーム画像それぞれから画像特徴量を抽出してよい。例えば、抽出部１３２は、画像エンコーダを備え、画像エンコーダを用いて画像特徴量を抽出してよい。例えば、抽出部１３２は、画像エンコーダとして、畳み込みニューラルネットワーク（ＣＮＮ：Convolutional Neural Network）を備えてよい。そして、抽出部１３２は、ＣＮＮを用いて、各フレーム画像から画像特徴量を抽出してよい。また、例えば、抽出部１３２は、画像エンコーダとして、物体認識用に開発されたＲｅｓＮｅｔ（Residual Network）（Kaiming He et al., 2015）、ＡｌｅｘＮｅｔ（Krizhevsky et al., 2012）、ＶＧＧＮｅｔ（Simonyan et al., 2014）、ＧｏｏｇＬｅＮｅｔ（Szegedy et al., 2014）、ＳＥＮｅｔ（Squeeze-and-Excitation Networks）（Jie Hu et al., 2018）、ＥｆｆｉｃｉｅｎｔＮｅｔ（Tan et al., 2019）、またはＺＦＮｅｔ（Matthew et al., 2013）を備えてよい。そして、抽出部１３２は、ＲｅｓＮｅｔ、ＡｌｅｘＮｅｔ、ＶＧＧＮｅｔ、ＧｏｏｇＬｅＮｅｔ、ＳＥＮｅｔ、ＥｆｆｉｃｉｅｎｔＮｅｔ、またはＺＦＮｅｔを用いて、各フレーム画像か画像特徴量を抽出してよい。また、例えば、抽出部１３２は、画像エンコーダとして、物体検出用に開発されたＦａｓｔｅｒＲ－ＣＮＮ（Shaoqing Ren et al., 2015）、ＹＯＬＯ（You Look Only Onse）（Joseph Redmon et al., 2015）、またはＳＳＤ（Single Shot MultiBox Detector）（Wei Liu., 2015）を備えてよい。そして、抽出部１３２は、ＦａｓｔｅｒＲ－ＣＮＮ、ＹＯＬＯ、またはＳＳＤを用いて、各フレーム画像から画像特徴量を抽出してよい。 For example, the extraction unit 132 may extract image features from each of the multiple frame images constituting the video, using any known technology capable of extracting image features from an image. For example, the extraction unit 132 may include an image encoder and extract image features using the image encoder. For example, the extraction unit 132 may include a convolutional neural network (CNN) as the image encoder. Then, the extraction unit 132 may extract image features from each frame image using the CNN. Furthermore, for example, the extraction unit 132 may be equipped with ResNet (Residual Network) (Kaiming He et al., 2015), AlexNet (Krizhevsky et al., 2012), VGGNet (Simonyan et al., 2014), GoogLeNet (Szegedy et al., 2014), SENet (Squeeze-and-Excitation Networks) (Jie Hu et al., 2018), EfficientNet (Tan et al., 2019), or ZFNet (Matthew et al., 2013) developed for object recognition as an image encoder. The extraction unit 132 may extract image features from each frame image using ResNet, AlexNet, VGGNet, GoogLeNet, SENet, EfficientNet, or ZFNet. For example, the extraction unit 132 may include Faster R-CNN (Shaoqing Ren et al., 2015), YOLO (You Look Only Onse) (Joseph Redmon et al., 2015), or SSD (Single Shot MultiBox Detector) (Wei Liu., 2015), which are developed for object detection, as an image encoder. The extraction unit 132 may extract image features from each frame image using Faster R-CNN, YOLO, or SSD.

このように、抽出部１３２は、撮像動画（図２の例では、事前学習用動画＃１）と撮像動画の内容を説明する文章である動画説明文（図２の例では、事前学習用動画説明文＃１）との組を含む動画文データセット＃１に含まれる撮像動画を構成する複数のフレーム画像それぞれの特徴量である事前学習用フレーム特徴量＃１を抽出する。 In this way, the extraction unit 132 extracts pre-learning frame features #1, which are features of each of the multiple frame images that make up a captured video included in video text dataset #1, which includes a pair of a captured video (in the example of Figure 2, pre-learning video #1) and a video description, which is a sentence that explains the content of the captured video (in the example of Figure 2, pre-learning video description #1).

続いて、モデル生成部１３４は、抽出部１３２によって抽出された事前学習用フレーム特徴量＃１を文章生成モデルＭ１に入力してよい（ステップＳ１２）。例えば、モデル生成部１３４は、事前学習用フレーム特徴量＃１に基づく条件ベクトル＃１を生成してよい。続いて、モデル生成部１３４は、生成した条件ベクトル＃１とノイズベクトル（乱数ベクトルともいう）を結合してよい。例えば、モデル生成部１３４は、線形変換処理を用いて、条件ベクトル＃１とノイズベクトルのサイズが同じになるように調整してよい。続いて、モデル生成部１３４は、条件ベクトル＃１の各要素をノイズベクトルの各要素に加算することにより、条件ベクトル＃１とノイズベクトルを結合してよい。あるいは、モデル生成部１３４は、条件ベクトル＃１の各要素をノイズベクトルの各要素に乗算することにより、条件ベクトル＃１とノイズベクトルを結合してよい。続いて、モデル生成部１３４は、結合された条件ベクトル＃１とノイズベクトルを入力情報として条件付き生成モデルである文章生成モデルＭ１に入力してよい。 Next, the model generation unit 134 may input the pre-learning frame feature #1 extracted by the extraction unit 132 to the sentence generation model M1 (step S12). For example, the model generation unit 134 may generate a condition vector #1 based on the pre-learning frame feature #1. Next, the model generation unit 134 may combine the generated condition vector #1 with a noise vector (also called a random vector). For example, the model generation unit 134 may use a linear conversion process to adjust the size of the condition vector #1 and the noise vector so that they are the same. Next, the model generation unit 134 may combine the condition vector #1 and the noise vector by adding each element of the condition vector #1 to each element of the noise vector. Alternatively, the model generation unit 134 may combine the condition vector #1 and the noise vector by multiplying each element of the condition vector #1 by each element of the noise vector. Next, the model generation unit 134 may input the combined condition vector #1 and the noise vector as input information to the sentence generation model M1, which is a conditional generation model.

続いて、モデル生成部１３４は、結合された条件ベクトル＃１とノイズベクトルの入力に応じて文章生成モデルＭ１が生成した動画説明文であって、文章生成モデルＭ１から出力情報として出力された動画説明文を取得してよい（ステップＳ１３）。モデル生成部１３４は、条件ベクトル＃１に基づいて、条件ベクトル＃１と対応する特徴を有する動画説明文を生成するように文章生成モデルＭ１を学習させてよい。例えば、モデル生成部１３４は、バックプロパゲーション（誤差逆伝播法）等を用いて、文章生成モデルＭ１から出力された動画説明文と、動画文データセット＃１に含まれる事前学習用動画説明文＃１との誤差が小さくなるように文章生成モデルＭ１を学習させてよい。このようにして、モデル生成部１３４は、事前学習用フレーム特徴量＃１に基づいて、事前学習用フレーム特徴量＃１と対応する特徴を有する動画説明文を生成するように文章生成モデルＭ１を学習させてよい。 Then, the model generation unit 134 may acquire the video description generated by the sentence generation model M1 in response to the input of the combined condition vector #1 and the noise vector, and output as output information from the sentence generation model M1 (step S13). The model generation unit 134 may train the sentence generation model M1 so as to generate a video description having a feature corresponding to the condition vector #1 based on the condition vector #1. For example, the model generation unit 134 may train the sentence generation model M1 using backpropagation (error backpropagation method) or the like so as to reduce the error between the video description output from the sentence generation model M1 and the pre-training video description #1 included in the video description dataset #1. In this way, the model generation unit 134 may train the sentence generation model M1 so as to generate a video description having a feature corresponding to the pre-training frame feature #1 based on the pre-training frame feature #1.

このように、モデル生成部１３４は、事前学習用フレーム特徴量＃１に基づいて、事前学習用フレーム特徴量＃１と対応する特徴を有する動画説明文（図２では、事前学習用動画説明文＃１）を生成するように事前に学習された機械学習モデルである事前学習済み文章生成モデルＭ１を生成する。 In this way, the model generation unit 134 generates a pre-trained sentence generation model M1, which is a machine learning model that has been pre-trained to generate a video description (pre-training video description #1 in Figure 2) having features corresponding to the pre-training frame feature #1, based on the pre-training frame feature #1.

〔４．第１の追加学習方法〕
図３を用いて、実施形態に係る第１の追加学習方法について説明する。図３は、実施形態に係る第１の追加学習方法に関する情報処理の一例を示す図である。第１の追加学習は、図２で説明した事前学習の後に行われる本格的なモデルの学習のことを指す。 [4. First additional learning method]
A first additional learning method according to the embodiment will be described with reference to Fig. 3. Fig. 3 is a diagram showing an example of information processing related to the first additional learning method according to the embodiment. The first additional learning refers to full-scale model learning performed after the pre-learning described in Fig. 2.

図３に示すように、第１の追加学習の段階では、（１）公知の動画生成モデルを用いて、画像（静止画像）と画像説明文との組を含む画像文データセットに含まれる画像から動画を生成する。以下では、動画を生成する元となった画像（画像文データセットに含まれる画像）のことを「オリジナルの画像」と記載する場合がある。生成された動画は、オリジナルの画像をフレームに含む。（２）生成された動画を構成する各フレーム画像のうち、オリジナルの画像に対応するフレーム画像を注目箇所として、動画を構成する各フレーム画像に対応する重みを決定する。また、生成された動画を構成する各フレーム画像から画像特徴量を抽出し、各フレーム画像から抽出された画像特徴量を各フレーム画像に対応する重みによって重み付けする。（３）重み付けされた画像特徴量を条件として、条件付き生成モデルである事前学習済み文章生成モデルＭ１に入力し、重み付けされた画像特徴量に対応する特徴を有する動画説明文を生成するように事前学習済み文章生成モデルＭ１を再学習させることにより、第１の追加学習済みの文章生成モデルＭ２を生成する。 As shown in FIG. 3, in the first additional learning stage, (1) a known video generation model is used to generate a video from images included in an image-sentence dataset including a pair of images (still images) and image captions. Hereinafter, the image (included in the image-sentence dataset) from which the video is generated may be referred to as the "original image". The generated video includes the original image as a frame. (2) Among the frame images constituting the generated video, the frame image corresponding to the original image is set as a focus point, and weights corresponding to each frame image constituting the video are determined. In addition, image features are extracted from each frame image constituting the generated video, and the image features extracted from each frame image are weighted by the weights corresponding to each frame image. (3) The weighted image features are input as conditions to a pre-trained sentence generation model M1, which is a conditional generation model, and the pre-trained sentence generation model M1 is re-trained to generate a video caption having features corresponding to the weighted image features, thereby generating a first additionally trained sentence generation model M2.

具体的には、動画生成部１３１は、撮像画像（以下、「画像＃２」と記載する場合がある）と撮像画像の内容を説明する文章である画像説明文（以下、「画像説明文＃２」と記載する場合がある）との組を含む画像文データセット＃２に基づいて、学習用動画＃２を生成してよい。例えば、動画生成部１３１は、通信部１１０を介して、外部の情報処理装置から画像文データセット＃２を取得してよい。続いて、動画生成部１３１は、画像から動画を生成する機械学習モデルである第１の動画生成モデルＭ２１を取得してよい。例えば、第１の動画生成モデルＭ２１は、画像から動画を生成する公知の機械学習モデルであってよい（参考文献；“Generating Videos with Scene Dynamics“, Carl Vondrick et al. ,2016 ,＜インターネット＞https://arxiv.org/pdf/1609.02612.pdf（令和５年２月１６日検索））。例えば、動画生成部１３１は、あらかじめ第１の動画生成モデルＭ２１に関する情報を格納している記憶部１２０から第１の動画生成モデルＭ２１を取得してよい。続いて、動画生成部１３１は、画像文データセット＃２に含まれる画像＃２を第１の動画生成モデルＭ２１に入力して、画像＃２から画像＃２をフレームに含む学習用動画＃２を生成してよい（ステップＳ２１）。 Specifically, the video generating unit 131 may generate the learning video #2 based on an image and sentence data set #2 including a pair of a captured image (hereinafter, sometimes referred to as "image #2") and an image description (hereinafter, sometimes referred to as "image description #2") that is a sentence that explains the contents of the captured image. For example, the video generating unit 131 may acquire the image and sentence data set #2 from an external information processing device via the communication unit 110. Next, the video generating unit 131 may acquire a first video generation model M21, which is a machine learning model that generates a video from an image. For example, the first video generation model M21 may be a known machine learning model that generates a video from an image (Reference: "Generating Videos with Scene Dynamics", Carl Vondrick et al., 2016, <Internet> https://arxiv.org/pdf/1609.02612.pdf (searched on February 16, 2023)). For example, the video generator 131 may acquire the first video generation model M21 from the storage unit 120, which stores information about the first video generation model M21 in advance. Next, the video generator 131 may input image #2 included in the image-sentence dataset #2 to the first video generation model M21, and generate a learning video #2 from image #2 that includes image #2 in a frame (step S21).

このように、動画生成部１３１は、画像から動画を生成する機械学習モデルである第１の動画生成モデルＭ２１を用いて、画像文データセット＃２に含まれる撮像画像（図３では、画像＃２）から、撮像画像をフレームに含む学習用動画＃２を生成する。以下では、学習用動画＃２を生成する元となった画像＃２のことを「オリジナルの画像＃２」と記載する場合がある。 In this way, the video generation unit 131 uses the first video generation model M21, which is a machine learning model that generates videos from images, to generate a learning video #2 that includes a captured image in a frame from a captured image (image #2 in FIG. 3) included in the image-sentence dataset #2. Hereinafter, image #2 that was the source from which learning video #2 was generated may be referred to as the "original image #2."

また、抽出部１３２は、動画生成部１３１によって生成された学習用動画＃２を構成する複数のフレーム画像それぞれから、学習用動画＃２を構成する複数のフレーム画像それぞれの画像特徴量を抽出してよい（ステップＳ２２）。なお、抽出部１３２が、各フレーム画像から画像特徴量を抽出する方法は、図２で説明した事前学習において各フレーム画像から画像特徴量を抽出する方法と同様であってよい。以下では、図２と重複する説明は省略する。図３では、簡単のため、学習用動画＃２を構成するフレーム画像が３つである場合について説明するが、学習用動画＃２を構成するフレーム画像の数は４つ以上であってよい。図３では、学習用動画＃２の開始時刻に対応する１枚目のフレーム画像と、学習用動画＃２の開始時刻と終了時刻の間の時刻に対応する２枚目のフレーム画像と、学習用動画＃２の終了時刻に対応する３枚目のフレーム画像とが時系列順に並んでいる様子を示す。例えば、抽出部１３２は、１枚目のフレーム画像から特徴量ベクトルＶ２１を抽出する。また、抽出部１３２は、２枚目のフレーム画像から特徴量ベクトルＶ２２を抽出する。また、抽出部１３２は、３枚目のフレーム画像から特徴量ベクトルＶ２３を抽出する。続いて、抽出部１３２は、学習用動画＃２を構成する複数のフレーム画像それぞれの画像特徴量である学習用フレーム特徴量＃２として、特徴量ベクトルＶ２１～Ｖ２３の組のベクトル（Ｖ２１、Ｖ２２、Ｖ２３）を取得してよい。 The extraction unit 132 may extract image features of each of the multiple frame images constituting the learning video #2 from each of the multiple frame images constituting the learning video #2 generated by the video generation unit 131 (step S22). The method by which the extraction unit 132 extracts image features from each frame image may be the same as the method of extracting image features from each frame image in the pre-learning described in FIG. 2. In the following, explanations that overlap with FIG. 2 will be omitted. In FIG. 3, for simplicity, a case will be described in which there are three frame images constituting the learning video #2, but the number of frame images constituting the learning video #2 may be four or more. In FIG. 3, a first frame image corresponding to the start time of the learning video #2, a second frame image corresponding to a time between the start time and the end time of the learning video #2, and a third frame image corresponding to the end time of the learning video #2 are arranged in chronological order. For example, the extraction unit 132 extracts a feature vector V21 from the first frame image. The extraction unit 132 also extracts a feature vector V22 from the second frame image. The extraction unit 132 also extracts a feature vector V23 from the third frame image. Next, the extraction unit 132 may acquire a vector (V21, V22, V23) of the set of feature vectors V21 to V23 as learning frame feature #2, which is the image feature of each of the multiple frame images that make up learning video #2.

このように、抽出部１３２は、学習用動画＃２を構成する複数のフレーム画像それぞれの特徴量である学習用フレーム特徴量＃２を抽出する。 In this way, the extraction unit 132 extracts learning frame features #2, which are features of each of the multiple frame images that make up learning video #2.

また、決定部１３３は、動画生成部１３１によって生成された学習用動画＃２を構成する複数のフレーム画像それぞれに対応する重みを決定してよい（ステップＳ２２）。なお、抽出部１３２が画像特徴量を抽出する処理と、決定部１３３が重みを決定する処理は、いずれの処理が先に行われてもよく、抽出部１３２および決定部１３３によってそれぞれ同時に行われてもよい。 The determination unit 133 may determine weights corresponding to each of the multiple frame images constituting the learning video #2 generated by the video generation unit 131 (step S22). Either the process of extracting image features by the extraction unit 132 or the process of determining weights by the determination unit 133 may be performed first, or may be performed simultaneously by the extraction unit 132 and the determination unit 133.

例えば、決定部１３３は、学習用動画＃２を構成する複数のフレーム画像のうち、オリジナルの画像＃２に対応するフレーム画像の重みをオリジナルの画像＃２に対応するフレーム画像以外の他のフレーム画像に対応する重みよりも大きくするように複数のフレーム画像それぞれに対応する重みを決定してよい。例えば、決定部１３３は、ガウス関数（正規分布ともいう）や円の一部のような凸状の関数であって、極大値の周囲が微分可能な関数の値に基づいて、複数のフレーム画像それぞれに対応する重みを決定してよい。なお、決定部１３３は、ガウス関数や円の一部に限らず、極大値の周囲が微分可能な関数であればどのような関数を用いて重みを決定してもよい。例えば、決定部１３３は、極大値の周囲が微分可能な関数の極大値に対応する値をオリジナルの画像＃２に対応するフレーム画像の重みとしてよい。また、決定部１３３は、極大値の周囲が微分可能な関数の極大値の周辺に対応する値をオリジナルの画像＃２に対応するフレーム画像以外の他のフレーム画像に対応する重みとしてよい。 For example, the determination unit 133 may determine weights corresponding to each of the multiple frame images constituting the learning video #2 such that the weight of the frame image corresponding to the original image #2 is greater than the weights corresponding to the other frame images other than the frame image corresponding to the original image #2. For example, the determination unit 133 may determine weights corresponding to each of the multiple frame images based on the value of a convex function such as a Gaussian function (also called a normal distribution) or a part of a circle, in which the periphery of the maximum value is differentiable. Note that the determination unit 133 may determine weights using any function, not limited to a Gaussian function or a part of a circle, as long as the periphery of the maximum value is differentiable. For example, the determination unit 133 may set a value corresponding to the maximum value of a function whose periphery of the maximum value is differentiable as the weight of the frame image corresponding to the original image #2. Furthermore, the determination unit 133 may set a value corresponding to the periphery of the maximum value of a function whose periphery of the maximum value is differentiable as the weight of the frame image other than the frame image corresponding to the original image #2.

図３では、決定部１３３は、横軸を動画の再生時刻、縦軸を重みとするガウス関数の値を用いて複数のフレーム画像それぞれに対応する重みを決定してよい。例えば、決定部１３３は、ガウス関数の平均値に対応する時刻をオリジナルの画像＃２に対応する２枚目のフレーム画像の再生時刻としてよい。また、決定部１３３は、ガウス関数の平均値に対応する時刻の値である「１．０」をオリジナルの画像＃２に対応する２枚目のフレーム画像の重み＃２２としてよい。また、決定部１３３は、ガウス関数の平均値よりも小さい値に対応する時刻を１枚目のフレーム画像の再生時刻としてよい。また、決定部１３３は、ガウス関数の平均値よりも小さい値に対応する時刻の値である「０．８」を１枚目のフレーム画像の重み＃２１としてよい。また、決定部１３３は、ガウス関数の平均値よりも大きい値に対応する時刻を３枚目のフレーム画像の再生時刻としてよい。また、決定部１３３は、ガウス関数の平均値よりも大きい値に対応する時刻の値である「０．８」を３枚目のフレーム画像の重み＃２３としてよい。例えば、決定部１３３は、学習用動画＃２を構成する複数のフレーム画像それぞれに対応する重みである学習用重み＃２として、１枚目のフレーム画像の重み＃２１～２枚目のフレーム画像の重み＃２３の組のベクトル（重み＃２１、重み＃２２、重み＃２３）＝（０．８、１．０、０．８）を取得してよい。 In FIG. 3, the determination unit 133 may determine weights corresponding to each of the multiple frame images using the value of a Gaussian function with the horizontal axis representing the playback time of the video and the vertical axis representing the weight. For example, the determination unit 133 may determine the time corresponding to the average value of the Gaussian function as the playback time of the second frame image corresponding to the original image #2. The determination unit 133 may also determine the value of the time corresponding to the average value of the Gaussian function, "1.0", as the weight #22 of the second frame image corresponding to the original image #2. The determination unit 133 may also determine the time corresponding to a value smaller than the average value of the Gaussian function as the playback time of the first frame image. The determination unit 133 may also determine the value of the time corresponding to a value smaller than the average value of the Gaussian function as the weight #21 of the first frame image. The determination unit 133 may also determine the time corresponding to a value larger than the average value of the Gaussian function as the playback time of the third frame image. Furthermore, the determination unit 133 may set "0.8", which is the value at the time corresponding to a value greater than the average value of the Gaussian function, as weight #23 for the third frame image. For example, the determination unit 133 may obtain a vector (weight #21, weight #22, weight #23) = (0.8, 1.0, 0.8) of the set of weight #21 for the first frame image to weight #23 for the second frame image as learning weight #2, which is the weight corresponding to each of the multiple frame images constituting learning video #2.

このように、決定部１３３は、学習用動画＃２を構成する複数のフレーム画像のうち、撮像画像（図３では、オリジナルの画像＃２）に対応する重み（図３では、２枚目のフレーム画像の重み＃２２である「１．０」）を撮像画像以外の他のフレーム画像に対応する重み（図３では、１枚目のフレーム画像の重み＃２１である「０．８」および３枚目のフレーム画像の重み＃２３である「０．８」）よりも大きくするように複数のフレーム画像それぞれに対応する学習用重み＃２（図３では、（重み＃２１、重み＃２２、重み＃２３）＝（０．８、１．０、０．８））を決定する。また、このように、決定部１３３は、学習用動画＃２を構成する複数のフレーム画像それぞれに対応する重みである学習用重み＃２を決定する。 In this way, the determination unit 133 determines the learning weights #2 (in FIG. 3, (weights #21, #22, #23) = (0.8, 1.0, 0.8)) corresponding to each of the multiple frame images constituting the learning video #2 so that the weight (in FIG. 3, weight #22 of the second frame image, "1.0") corresponding to the captured image (original image #2 in FIG. 3) is greater than the weights corresponding to the other frame images other than the captured image (in FIG. 3, weight #21 of the first frame image, "0.8" and weight #23 of the third frame image, "0.8"). Also, in this way, the determination unit 133 determines the learning weights #2 that correspond to each of the multiple frame images constituting the learning video #2.

また、モデル生成部１３４は、決定部１３３によって決定された学習用重み＃２によって、抽出部１３２によって抽出された学習用フレーム特徴量＃２を重み付けしてよい。モデル生成部１３４は、決定部１３３によって決定された学習用重み＃２によって重み付けされた学習用フレーム特徴量＃２である、重み付けされた学習用フレーム特徴量＃２´を生成してよい。図３では、モデル生成部１３４は、学習用動画＃２を構成する１枚目のフレーム画像に対応する重み＃２１である「０．８」を特徴量ベクトルＶ２１の各要素に乗じることにより、重み＃２１によって重み付けされた特徴量ベクトルＶ２１´を生成してよい。また、モデル生成部１３４は、学習用動画＃２を構成する２枚目のフレーム画像に対応する重み＃２２である「１．０」を特徴量ベクトルＶ２２の各要素に乗じることにより、重み＃２２によって重み付けされた特徴量ベクトルＶ２２´を生成してよい。また、モデル生成部１３４は、学習用動画＃２を構成する３枚目のフレーム画像に対応する重み＃２３である「０．８」を特徴量ベクトルＶ２３の各要素に乗じることにより、重み＃２３によって重み付けされた特徴量ベクトルＶ２３´を生成してよい。このようにして、モデル生成部１３４は、重み付けされた学習用フレーム特徴量＃２´を生成してよい。図３では、モデル生成部１３４は、重み付けされた学習用フレーム特徴量＃２´として、（重み＃２１、重み＃２２、重み＃２３）＊（Ｖ２１、Ｖ２２、Ｖ２３）＝（重み＃２１＊Ｖ２１、重み＃２２＊Ｖ２２、重み＃２３＊Ｖ２３）＝（Ｖ２１´、Ｖ２２´、Ｖ２３´）を生成してよい。 The model generation unit 134 may weight the learning frame feature #2 extracted by the extraction unit 132 by the learning weight #2 determined by the determination unit 133. The model generation unit 134 may generate a weighted learning frame feature #2', which is the learning frame feature #2 weighted by the learning weight #2 determined by the determination unit 133. In FIG. 3, the model generation unit 134 may generate a feature vector V21' weighted by the weight #21 by multiplying each element of the feature vector V21 by "0.8", which is the weight #21 corresponding to the first frame image constituting the learning video #2. The model generation unit 134 may generate a feature vector V22' weighted by the weight #22 by multiplying each element of the feature vector V22 by "1.0", which is the weight #22 corresponding to the second frame image constituting the learning video #2. Furthermore, the model generation unit 134 may generate a feature vector V23' weighted by weight #23 by multiplying each element of the feature vector V23 by "0.8", which is weight #23 corresponding to the third frame image constituting the learning video #2. In this manner, the model generation unit 134 may generate weighted learning frame feature #2'. In FIG. 3, the model generation unit 134 may generate (weight #21, weight #22, weight #23) * (V21, V22, V23) = (weight #21 * V21, weight #22 * V22, weight #23 * V23) = (V21', V22', V23') as the weighted learning frame feature #2'.

続いて、モデル生成部１３４は、重み付けされた学習用フレーム特徴量＃２´を事前学習済み文章生成モデルＭ１（以下、「文章生成モデルＭ１」と略記する場合がある）に入力してよい（ステップＳ２３）。例えば、モデル生成部１３４は、重み付けされた学習用フレーム特徴量＃２´に基づく条件ベクトル＃２を生成してよい。続いて、モデル生成部１３４は、生成した条件ベクトル＃２とノイズベクトルを結合してよい。なお、モデル生成部１３４が、条件ベクトル＃２とノイズベクトルを結合する方法は、図２で説明した事前学習において条件ベクトル＃１とノイズベクトルを結合する方法と同様であってよい。以下では、図２と重複する説明は省略する。続いて、モデル生成部１３４は、結合された条件ベクトル＃２とノイズベクトルを入力情報として条件付き生成モデルである文章生成モデルＭ１に入力してよい。 Then, the model generation unit 134 may input the weighted learning frame feature #2' to the pre-trained sentence generation model M1 (hereinafter, sometimes abbreviated as "sentence generation model M1") (step S23). For example, the model generation unit 134 may generate a condition vector #2 based on the weighted learning frame feature #2'. Then, the model generation unit 134 may combine the generated condition vector #2 with a noise vector. Note that the method by which the model generation unit 134 combines the condition vector #2 with the noise vector may be the same as the method by which the condition vector #1 with the noise vector is combined in the pre-training described in FIG. 2. In the following, the description that overlaps with FIG. 2 will be omitted. Then, the model generation unit 134 may input the combined condition vector #2 and noise vector as input information to the sentence generation model M1, which is a conditional generation model.

続いて、モデル生成部１３４は、結合された条件ベクトル＃２とノイズベクトルの入力に応じて文章生成モデルＭ１が生成した動画説明文であって、文章生成モデルＭ１から出力情報として出力された動画説明文（以下、「学習用動画説明文＃２」と記載する場合がある）を取得してよい（ステップＳ２４）。モデル生成部１３４は、条件ベクトル＃２に基づいて、条件ベクトル＃２と対応する特徴を有する動画説明文を生成するように文章生成モデルＭ１を再学習させてよい。例えば、モデル生成部１３４は、バックプロパゲーション（誤差逆伝播法）等を用いて、文章生成モデルＭ１から出力された学習用動画説明文＃２と、画像文データセット＃２に含まれる画像説明文＃２（オリジナルの画像＃２に対応する画像説明文）との誤差が小さくなるように文章生成モデルＭ１を再学習させてよい。このようにして、モデル生成部１３４は、重み付けされた学習用フレーム特徴量＃２´に基づいて、重み付けされた学習用フレーム特徴量＃２´と対応する特徴を有する動画説明文を生成するように文章生成モデルＭ１を再学習させてよい。 Next, the model generation unit 134 may acquire the video description generated by the sentence generation model M1 in response to the input of the combined condition vector #2 and the noise vector, and output as output information from the sentence generation model M1 (hereinafter, may be referred to as "learning video description #2") (step S24). The model generation unit 134 may retrain the sentence generation model M1 so as to generate a video description having a feature corresponding to the condition vector #2 based on the condition vector #2. For example, the model generation unit 134 may retrain the sentence generation model M1 using backpropagation (error backpropagation method) or the like so as to reduce the error between the learning video description #2 output from the sentence generation model M1 and the image description #2 (image description corresponding to the original image #2) included in the image sentence data set #2. In this way, the model generation unit 134 may retrain the sentence generation model M1 so as to generate a video description having a feature corresponding to the weighted learning frame feature #2' based on the weighted learning frame feature #2'.

このように、モデル生成部１３４は、学習用フレーム特徴量＃２と学習用重み＃２とに基づいて、学習用動画の内容を説明する文章である学習用動画説明文＃２であって、学習用重み＃２によって重み付けされた学習用フレーム特徴量＃２´と対応する特徴を有する学習用動画説明文＃２を生成するように事前学習済み文章生成モデルＭ１を再学習させることにより、文章生成モデルＭ２を生成する。 In this way, the model generation unit 134 generates a sentence generation model M2 by re-training the pre-trained sentence generation model M1 to generate training video description #2, which is a sentence that explains the content of the training video, based on training frame feature #2 and training weight #2, and has features corresponding to training frame feature #2' weighted by training weight #2.

図４は、実施形態に係る重みによってフレーム特徴量を重み付けする方法について説明するための図である。フレーム特徴量は、画像の各ピクセルに対応する値を持ってよい。図４に示す例では、簡単のため、画像の画素が３×３の行列で表される場合について説明する。このとき、フレーム特徴量は、３×３の行列で表されてよい。また、簡単のため、重みの値を「３」とする。このとき、モデル生成部１３４は、フレーム特徴量の各要素（３×３の行列の各要素）に重みの値である「３」を乗じることにより、重み付けされたフレーム特徴量を生成する。 FIG. 4 is a diagram for explaining a method of weighting frame features by weights according to an embodiment. The frame features may have values corresponding to each pixel of the image. In the example shown in FIG. 4, for simplicity, a case will be explained in which the pixels of the image are represented by a 3×3 matrix. In this case, the frame features may be represented by a 3×3 matrix. Also, for simplicity, the weight value is set to "3". In this case, the model generation unit 134 generates weighted frame features by multiplying each element of the frame features (each element of the 3×3 matrix) by the weight value "3".

〔５．第２の追加学習方法〕
図４を用いて、実施形態に係る第２の追加学習方法について説明する。図５は、実施形態に係る第２の追加学習方法に関する情報処理の一例を示す図である。第２の追加学習は、図２で説明した事前学習の後に行われる本格的なモデルの学習のことを指す。図５では、モデル生成部１３４は、第１の追加学習の代わりに、第２の追加学習により、事前学習済み文章生成モデルＭ１を再学習させる点が図３と異なる。 [5. Second additional learning method]
A second additional learning method according to the embodiment will be described with reference to Fig. 4. Fig. 5 is a diagram showing an example of information processing related to the second additional learning method according to the embodiment. The second additional learning refers to full-scale model learning performed after the pre-learning described in Fig. 2. Fig. 5 differs from Fig. 3 in that the model generation unit 134 re-learns the pre-trained sentence generation model M1 by the second additional learning instead of the first additional learning.

図５に示すように、第２の追加学習の段階では、（１）公知の動画生成モデルを用いて、画像（静止画像）と画像説明文との組を含む画像文データセットに含まれる画像説明文から動画を生成する。以下では、動画を生成する元となった画像説明文に対応する画像（画像文データセットに含まれる画像）のことを「オリジナルの画像」と記載する場合がある。（２）生成された動画を構成する各フレーム画像とオリジナルの画像との類似度を算出し、算出された類似度を、動画を構成する各フレーム画像に対応する重みとする。また、生成された動画を構成する各フレーム画像から画像特徴量を抽出し、各フレーム画像から抽出された画像特徴量を各フレーム画像に対応する重みによって重み付けする。（３）重み付けされた画像特徴量を条件として、条件付き生成モデルである事前学習済み文章生成モデルＭ１に入力し、重み付けされた画像特徴量に対応する特徴を有する動画説明文を生成するように事前学習済み文章生成モデルＭ１を再学習させることにより、第２の追加学習済みの文章生成モデルＭ３を生成する。 As shown in FIG. 5, in the second additional learning stage, (1) a known video generation model is used to generate a video from an image description included in an image-sentence dataset including a pair of an image (still image) and an image description. Hereinafter, the image (included in the image-sentence dataset) corresponding to the image description from which the video is generated may be referred to as the "original image". (2) The similarity between each frame image constituting the generated video and the original image is calculated, and the calculated similarity is set as the weight corresponding to each frame image constituting the video. In addition, image features are extracted from each frame image constituting the generated video, and the image features extracted from each frame image are weighted by the weight corresponding to each frame image. (3) The weighted image features are input as conditions to the pre-trained sentence generation model M1, which is a conditional generation model, and the pre-trained sentence generation model M1 is re-trained to generate a video description having features corresponding to the weighted image features, thereby generating a second additionally trained sentence generation model M3.

具体的には、動画生成部１３１は、撮像画像（以下、「画像＃３」と記載する場合がある）と撮像画像の内容を説明する文章である画像説明文（以下、「画像説明文＃３」と記載する場合がある）との組を含む画像文データセット＃３に基づいて、学習用動画＃３を生成してよい。例えば、動画生成部１３１は、通信部１１０を介して、外部の情報処理装置から画像文データセット＃３を取得してよい。続いて、動画生成部１３１は、文章から動画を生成する機械学習モデルである第２の動画生成モデルＭ３１を取得してよい。例えば、第２の動画生成モデルＭ３１は、文章から動画を生成する公知の機械学習モデルであってよい（参考文献；“ CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers “, Wenyi Hong et al,2022) ,＜インターネット＞https://arxiv.org/pdf/2205.15868.pdf（令和５年２月１６日検索））。例えば、動画生成部１３１は、あらかじめ第２の動画生成モデルＭ３１に関する情報を格納している記憶部１２０から第２の動画生成モデルＭ３１を取得してよい。続いて、動画生成部１３１は、画像文データセット＃３に含まれる画像説明文＃３を第２の動画生成モデルＭ３１に入力して、画像説明文＃３から学習用動画＃３を生成してよい（ステップＳ３１）。 Specifically, the video generation unit 131 may generate the learning video #3 based on an image and sentence data set #3 including a pair of a captured image (hereinafter, sometimes referred to as "image #3") and an image description (hereinafter, sometimes referred to as "image description #3") that is a sentence that explains the content of the captured image. For example, the video generation unit 131 may acquire the image and sentence data set #3 from an external information processing device via the communication unit 110. Next, the video generation unit 131 may acquire a second video generation model M31, which is a machine learning model that generates a video from a sentence. For example, the second video generation model M31 may be a known machine learning model that generates a video from a sentence (Reference: "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers", Wenyi Hong et al, 2022), <Internet> https://arxiv.org/pdf/2205.15868.pdf (searched on February 16, 2023)). For example, the video generation unit 131 may acquire the second video generation model M31 from the storage unit 120 that stores information about the second video generation model M31 in advance. Next, the video generation unit 131 may input the image description #3 included in the image text data set #3 to the second video generation model M31, and generate a learning video #3 from the image description #3 (step S31).

このように、動画生成部１３１は、文章から動画を生成する機械学習モデルである第２の動画生成モデルＭ３１を用いて、画像文データセット＃３に含まれる画像説明文＃３から学習用動画＃３を生成する。以下では、学習用動画＃３を生成する元となった画像説明文＃３と対応する画像＃３（画像文データセット＃３に含まれる画像＃３）のことを「オリジナルの画像＃３」と記載する場合がある。 In this way, the video generation unit 131 generates training video #3 from image description #3 included in image sentence dataset #3 using the second video generation model M31, which is a machine learning model that generates videos from text. In the following, the image description #3 that was the source for generating training video #3 and the corresponding image #3 (image #3 included in image sentence dataset #3) may be referred to as the "original image #3."

また、抽出部１３２は、動画生成部１３１によって生成された学習用動画＃３を構成する複数のフレーム画像それぞれから、学習用動画＃３を構成する複数のフレーム画像それぞれの画像特徴量を抽出してよい（ステップＳ３２）。なお、図３と同様に、抽出部１３２が、各フレーム画像から画像特徴量を抽出する方法は、図２で説明した事前学習において各フレーム画像から画像特徴量を抽出する方法と同様であってよい。以下では、図２と重複する説明は省略する。図５では、簡単のため、学習用動画＃３を構成するフレーム画像が３つである場合について説明するが、学習用動画＃３を構成するフレーム画像の数は４つ以上であってよい。図５では、学習用動画＃３の開始時刻に対応する１枚目のフレーム画像と、学習用動画＃３の開始時刻と終了時刻の間の時刻に対応する２枚目のフレーム画像と、学習用動画＃３の終了時刻に対応する３枚目のフレーム画像とが時系列順に並んでいる様子を示す。例えば、抽出部１３２は、１枚目のフレーム画像から特徴量ベクトルＶ３１を抽出する。また、抽出部１３２は、２枚目のフレーム画像から特徴量ベクトルＶ３２を抽出する。また、抽出部１３２は、３枚目のフレーム画像から特徴量ベクトルＶ３３を抽出する。続いて、抽出部１３２は、学習用動画＃３を構成する複数のフレーム画像それぞれの画像特徴量である学習用フレーム特徴量＃３として、特徴量ベクトルＶ３１～Ｖ３３の組のベクトル（Ｖ３１、Ｖ３２、Ｖ３３）を取得してよい。 The extraction unit 132 may extract image features of each of the multiple frame images constituting the learning video #3 from each of the multiple frame images constituting the learning video #3 generated by the video generation unit 131 (step S32). As in FIG. 3, the method in which the extraction unit 132 extracts image features from each frame image may be the same as the method in which image features are extracted from each frame image in the pre-learning described in FIG. 2. In the following, explanations that overlap with FIG. 2 will be omitted. In FIG. 5, for simplicity, a case in which there are three frame images constituting the learning video #3 will be described, but the number of frame images constituting the learning video #3 may be four or more. In FIG. 5, a first frame image corresponding to the start time of the learning video #3, a second frame image corresponding to a time between the start time and the end time of the learning video #3, and a third frame image corresponding to the end time of the learning video #3 are shown arranged in chronological order. For example, the extraction unit 132 extracts a feature vector V31 from the first frame image. The extraction unit 132 also extracts a feature vector V32 from the second frame image. The extraction unit 132 also extracts a feature vector V33 from the third frame image. Next, the extraction unit 132 may acquire a vector (V31, V32, V33) of the set of feature vectors V31 to V33 as learning frame feature #3, which is the image feature of each of the multiple frame images that make up learning video #3.

このように、抽出部１３２は、学習用動画＃３を構成する複数のフレーム画像それぞれの特徴量である学習用フレーム特徴量＃３を抽出する。 In this way, the extraction unit 132 extracts learning frame features #3, which are features of each of the multiple frame images that make up learning video #3.

また、決定部１３３は、動画生成部１３１によって生成された学習用動画＃３を構成する複数のフレーム画像それぞれに対応する重みを決定してよい（ステップＳ３２）。なお、図３と同様に、抽出部１３２が画像特徴量を抽出する処理と、決定部１３３が重みを決定する処理は、いずれの処理が先に行われてもよく、抽出部１３２および決定部１３３によってそれぞれ同時に行われてもよい。 The determination unit 133 may determine weights corresponding to each of the multiple frame images constituting the learning video #3 generated by the video generation unit 131 (step S32). As in FIG. 3, the process of extracting image features by the extraction unit 132 and the process of determining weights by the determination unit 133 may be performed first, or may be performed simultaneously by the extraction unit 132 and the determination unit 133.

例えば、決定部１３３は、学習用動画＃３を構成する複数のフレーム画像それぞれとオリジナルの画像＃３との類似度に基づいて、複数のフレーム画像それぞれに対応する重みを決定してよい。図５では、決定部１３３は、１枚目のフレーム画像とオリジナルの画像＃３との類似度＃３１を「０．１」と算出する。続いて、決定部１３３は、算出された類似度＃３１の値である「０．１」を１枚目のフレーム画像の重み＃３１としてよい。また、決定部１３３は、２枚目のフレーム画像とオリジナルの画像＃３との類似度＃３２を「０．７」と算出する。続いて、決定部１３３は、算出された類似度＃３２の値である「０．７」を２枚目のフレーム画像の重み＃３２としてよい。また、決定部１３３は、３枚目のフレーム画像とオリジナルの画像＃３との類似度＃３３を「０．２」と算出する。続いて、決定部１３３は、算出された類似度＃３３の値である「０．２」を３枚目のフレーム画像の重み＃３３としてよい。例えば、決定部１３３は、学習用動画＃３を構成する複数のフレーム画像それぞれに対応する重みである学習用重み＃３として、１枚目のフレーム画像の重み＃３１～２枚目のフレーム画像の重み＃３３の組のベクトル（重み＃３１、重み＃３２、重み＃３３）＝（類似度＃３１、類似度＃３２、類似度＃３３）＝（０．１、０．７、０．２）を取得してよい。 For example, the determination unit 133 may determine weights corresponding to each of the multiple frame images based on the similarity between each of the multiple frame images constituting the learning video #3 and the original image #3. In FIG. 5, the determination unit 133 calculates the similarity #31 between the first frame image and the original image #3 to be "0.1". Then, the determination unit 133 may set the calculated value of similarity #31, "0.1", as the weight #31 of the first frame image. The determination unit 133 also calculates the similarity #32 between the second frame image and the original image #3 to be "0.7". Then, the determination unit 133 may set the calculated value of similarity #32, "0.7", as the weight #32 of the second frame image. The determination unit 133 also calculates the similarity #33 between the third frame image and the original image #3 to be "0.2". Next, the determination unit 133 may set the calculated value of similarity #33, "0.2", as the weight #33 of the third frame image. For example, the determination unit 133 may obtain a vector (weight #31, weight #32, weight #33) = (similarity #31, similarity #32, similarity #33) = (0.1, 0.7, 0.2) of the set of weights #31 of the first frame image to #33 of the second frame image as learning weights #3, which are weights corresponding to each of the multiple frame images that make up the learning video #3.

このように、決定部１３３は、学習用動画＃３を構成する複数のフレーム画像（図５では、１枚目のフレーム画像～３枚目のフレーム画像）それぞれと撮像画像（図５では、オリジナルの画像＃３）との類似度（図５では、（類似度＃３１、類似度＃３２、類似度＃３３）＝（０．１、０．７、０．２））に関する情報を複数のフレーム画像それぞれに対応する学習用重み＃３（図５では、（重み＃３１、重み＃３２、重み＃３３）＝（０．１、０．７、０．２））とする。また、このように、決定部１３３は、学習用動画＃３を構成する複数のフレーム画像それぞれに対応する重みである学習用重み＃３を決定する。 In this way, the determination unit 133 sets information regarding the similarity (in FIG. 5, (similarity #31, similarity #32, similarity #33) = (0.1, 0.7, 0.2)) between each of the multiple frame images (in FIG. 5, the first to third frame images) that make up the learning video #3 and the captured image (in FIG. 5, original image #3) as learning weights #3 (in FIG. 5, (weights #31, weights #32, weights #33) = (0.1, 0.7, 0.2)) corresponding to each of the multiple frame images. In this way, the determination unit 133 determines learning weights #3 that are weights that correspond to each of the multiple frame images that make up the learning video #3.

また、モデル生成部１３４は、決定部１３３によって決定された学習用重み＃３によって、抽出部１３２によって抽出された学習用フレーム特徴量＃３を重み付けしてよい。モデル生成部１３４は、決定部１３３によって決定された学習用重み＃３によって重み付けされた学習用フレーム特徴量＃３である、重み付けされた学習用フレーム特徴量＃３´を生成してよい。図５では、モデル生成部１３４は、学習用動画＃３を構成する１枚目のフレーム画像に対応する重み＃３１である「０．１」を特徴量ベクトルＶ３１の各要素に乗じることにより、重み＃３１によって重み付けされた特徴量ベクトルＶ３１´を生成してよい。また、モデル生成部１３４は、学習用動画＃３を構成する２枚目のフレーム画像に対応する重み＃３２である「０．７」を特徴量ベクトルＶ３２の各要素に乗じることにより、重み＃３２によって重み付けされた特徴量ベクトルＶ３２´を生成してよい。また、モデル生成部１３４は、学習用動画＃３を構成する３枚目のフレーム画像に対応する重み＃３３である「０．２」を特徴量ベクトルＶ３３の各要素に乗じることにより、重み＃３３によって重み付けされた特徴量ベクトルＶ３３´を生成してよい。このようにして、モデル生成部１３４は、重み付けされた学習用フレーム特徴量＃３´を生成してよい。図５では、モデル生成部１３４は、重み付けされた学習用フレーム特徴量＃３´として、（重み＃３１、重み＃３２、重み＃３３）＊（Ｖ３１、Ｖ３２、Ｖ３３）＝（重み＃３１＊Ｖ３１、重み＃３２＊Ｖ３２、重み＃３３＊Ｖ３３）＝（Ｖ３１´、Ｖ３２´、Ｖ３３´）を生成してよい。 The model generation unit 134 may weight the learning frame feature #3 extracted by the extraction unit 132 by the learning weight #3 determined by the determination unit 133. The model generation unit 134 may generate a weighted learning frame feature #3', which is the learning frame feature #3 weighted by the learning weight #3 determined by the determination unit 133. In FIG. 5, the model generation unit 134 may generate a feature vector V31' weighted by the weight #31 by multiplying each element of the feature vector V31 by "0.1", which is the weight #31 corresponding to the first frame image constituting the learning video #3. The model generation unit 134 may generate a feature vector V32' weighted by the weight #32 by multiplying each element of the feature vector V32 by "0.7", which is the weight #32 corresponding to the second frame image constituting the learning video #3. Furthermore, the model generation unit 134 may generate a feature vector V33' weighted by weight #33 by multiplying each element of the feature vector V33 by "0.2", which is weight #33 corresponding to the third frame image constituting the learning video #3. In this manner, the model generation unit 134 may generate weighted learning frame feature #3'. In FIG. 5, the model generation unit 134 may generate (weight #31, weight #32, weight #33) * (V31, V32, V33) = (weight #31 * V31, weight #32 * V32, weight #33 * V33) = (V31', V32', V33') as the weighted learning frame feature #3'.

続いて、モデル生成部１３４は、重み付けされた学習用フレーム特徴量＃３´を事前学習済み文章生成モデルＭ１（以下、「文章生成モデルＭ１」と略記する場合がある）に入力してよい（ステップＳ３３）。例えば、モデル生成部１３４は、重み付けされた学習用フレーム特徴量＃３´に基づく条件ベクトル＃３を生成してよい。続いて、モデル生成部１３４は、生成した条件ベクトル＃３とノイズベクトルを結合してよい。なお、モデル生成部１３４が、条件ベクトル＃３とノイズベクトルを結合する方法は、図２で説明した事前学習において条件ベクトル＃１とノイズベクトルを結合する方法と同様であってよい。以下では、図２と重複する説明は省略する。続いて、モデル生成部１３４は、結合された条件ベクトル＃３とノイズベクトルを入力情報として条件付き生成モデルである文章生成モデルＭ１に入力してよい。 Then, the model generation unit 134 may input the weighted learning frame feature #3' to the pre-trained sentence generation model M1 (hereinafter, sometimes abbreviated as "sentence generation model M1") (step S33). For example, the model generation unit 134 may generate a condition vector #3 based on the weighted learning frame feature #3'. Then, the model generation unit 134 may combine the generated condition vector #3 with a noise vector. Note that the method by which the model generation unit 134 combines the condition vector #3 with the noise vector may be the same as the method by which the condition vector #1 with the noise vector is combined in the pre-training described in FIG. 2. In the following, the description that overlaps with FIG. 2 will be omitted. Then, the model generation unit 134 may input the combined condition vector #3 and noise vector as input information to the sentence generation model M1, which is a conditional generation model.

続いて、モデル生成部１３４は、結合された条件ベクトル＃３とノイズベクトルの入力に応じて文章生成モデルＭ１が生成した動画説明文であって、文章生成モデルＭ１から出力情報として出力された動画説明文（以下、「学習用動画説明文＃３」と記載する場合がある）を取得してよい（ステップＳ３４）。モデル生成部１３４は、条件ベクトル＃３に基づいて、条件ベクトル＃３と対応する特徴を有する動画説明文を生成するように文章生成モデルＭ１を再学習させてよい。例えば、モデル生成部１３４は、バックプロパゲーション（誤差逆伝播法）等を用いて、文章生成モデルＭ１から出力された学習用動画説明文＃３と、画像文データセット＃３に含まれる画像説明文＃３（オリジナルの画像＃３に対応する画像説明文）との誤差が小さくなるように文章生成モデルＭ１を再学習させてよい。このようにして、モデル生成部１３４は、重み付けされた学習用フレーム特徴量＃３´に基づいて、重み付けされた学習用フレーム特徴量＃３´と対応する特徴を有する動画説明文を生成するように文章生成モデルＭ１を再学習させてよい。 Then, the model generation unit 134 may acquire the video description generated by the sentence generation model M1 in response to the input of the combined condition vector #3 and the noise vector, and output from the sentence generation model M1 as output information (hereinafter, may be referred to as "learning video description #3") (step S34). The model generation unit 134 may retrain the sentence generation model M1 so as to generate a video description having a feature corresponding to the condition vector #3 based on the condition vector #3. For example, the model generation unit 134 may retrain the sentence generation model M1 using backpropagation (error backpropagation method) or the like so as to reduce the error between the learning video description #3 output from the sentence generation model M1 and the image description #3 (image description corresponding to the original image #3) included in the image sentence data set #3. In this way, the model generation unit 134 may retrain the sentence generation model M1 so as to generate a video description having a feature corresponding to the weighted learning frame feature #3' based on the weighted learning frame feature #3'.

このように、モデル生成部１３４は、学習用フレーム特徴量＃３と学習用重み＃３とに基づいて、学習用動画＃３の内容を説明する文章である学習用動画説明文＃３であって、学習用重み＃３によって重み付けされた学習用フレーム特徴量＃３´と対応する特徴を有する学習用動画説明文＃３を生成するように事前学習済み文章生成モデルＭ１を再学習させることにより、文章生成モデルＭ３を生成する。 In this way, the model generation unit 134 generates a sentence generation model M3 by re-training the pre-trained sentence generation model M1 to generate training video description #3, which is a sentence that explains the content of training video #3, based on training frame feature #3 and training weight #3, and has features corresponding to training frame feature #3' weighted by training weight #3.

図６は、実施形態に係る類似度を算出する方法について説明するための図である。フレーム画像およびオリジナルの画像＃３は、画像の各ピクセルに対応する画素値を持っている。図６に示す例では、簡単のため、画像の画素が３×３の行列で表される場合について説明する。図６の左側は、学習用動画＃３を構成する複数のフレーム画像のうちの一のフレーム画像を示す。図６の右側は、オリジナルの画像＃３を示す。このとき、決定部１３３は、一のフレーム画像とオリジナルの画像＃３の類似度として、一のフレーム画像とオリジナルの画像＃３とのコサイン類似度を算出してよい。例えば、決定部１３３は、下記に示す数式（１）に従って、一のフレーム画像とオリジナルの画像＃３とのコサイン類似度を算出してよい。 FIG. 6 is a diagram for explaining a method of calculating similarity according to an embodiment. The frame image and original image #3 have pixel values corresponding to each pixel of the image. In the example shown in FIG. 6, for simplicity, a case where the pixels of the image are expressed as a 3×3 matrix will be explained. The left side of FIG. 6 shows one of the multiple frame images constituting the learning video #3. The right side of FIG. 6 shows the original image #3. At this time, the determination unit 133 may calculate the cosine similarity between the one frame image and the original image #3 as the similarity between the one frame image and the original image #3. For example, the determination unit 133 may calculate the cosine similarity between the one frame image and the original image #3 according to the following formula (1).

例えば、決定部１３３は、上記の数式（１）に従って、一のフレーム画像とオリジナルの画像＃３とのコサイン類似度を「｛(１．２＊０．２)＋(２．４＊７．２)＋(（－２．３）＊０．９)＋(０．８＊（－２．４）)＋(（－１．３）＊（－３．９）)＋(（－１．２）＊（－３．６）)＋(２．０＊６．０)＋(（－３．２）＊９．６)＋(０．３＊１．９)｝／｛１．２^２＋２．４^２＋(（－２．３）)^２＋０．８^２＋(（－１．３）)^２＋(（－１．２）)^２＋２．０^２＋(（－３．２）)^２＋０．３^２｝^１／２｛０．２^２＋７．２^２＋０．９^２＋(（－２．４）)^２＋(（－３．９）)^２＋(（－３．６）)^２＋６．０^２＋９．６^２＋１．９^２｝^１／２＝０．０５」と算出してよい。 For example, the determination unit 133 determines the cosine similarity between one frame image and the original image #3 in accordance with the above formula (1) as follows: {(1.2*0.2)+(2.4*7.2)+((-2.3)*0.9)+(0.8*(-2.4))+((-1.3)*(-3.9))+((-1.2)*(-3.6))+(2.0*6.0)+((-3.2)*9.6)+(0.3*1.9)}/{1.2 ² +2.4 ² +((-2.3)) ² +0.8 ² +((-1.3)) ² +((-1.2)) ² +2.0 ² +((-3.2)) ² +0.3 ² } ^1/2 {0.2 ² +7.2 ² + 0.9 ² + ((-2.4)) ² + ((-3.9)) ² + ((-3.6)) ² + 6.0 ^{2 +} 9.6 ² + 1.9 ² } ^1/2 = 0.05.

〔６．推論方法〕
図７を用いて、実施形態に係る推論方法について説明する。図７は、実施形態に係る推論方法に関する情報処理の一例を示す図である。推論の段階では、図３で説明した第１の追加学習済みの文章生成モデルＭ２、または、図５で説明した第２の追加学習済みの文章生成モデルＭ３を用いて、処理対象の動画である対象動画の内容を説明する文章である対象動画説明文を生成する。図７では、文章生成部１３６が、第１の追加学習済みの文章生成モデルＭ２（以下、「文章生成モデルＭ２」と略記する場合がある）を用いて対象動画説明文を生成する場合について説明する。なお、文章生成部１３６は、文章生成モデルＭ２の代わりに、第２の追加学習済みの文章生成モデルＭ３（以下、「文章生成モデルＭ３」と略記する場合がある）を用いて対象動画説明文を生成してもよい。 6. Inference Method
The inference method according to the embodiment will be described with reference to FIG. 7. FIG. 7 is a diagram showing an example of information processing related to the inference method according to the embodiment. In the inference stage, the first additionally trained sentence generation model M2 described in FIG. 3 or the second additionally trained sentence generation model M3 described in FIG. 5 is used to generate a target video description, which is a sentence that describes the contents of a target video that is a video to be processed. FIG. 7 describes a case where the sentence generation unit 136 generates a target video description using the first additionally trained sentence generation model M2 (hereinafter, sometimes abbreviated as "sentence generation model M2"). Note that the sentence generation unit 136 may generate a target video description using the second additionally trained sentence generation model M3 (hereinafter, sometimes abbreviated as "sentence generation model M3") instead of the sentence generation model M2.

図７に示すように、推論の段階では、（１）対象動画を構成する各フレーム画像から画像特徴量を抽出する。（２）利用者から注目するフレーム画像（以下、「指定フレーム画像」と記載する場合がある）の指定を受け付け、指定フレーム画像に対応する重みが最大となるように複数のフレーム画像それぞれに対応する重みを決定する。（３）各フレーム画像から抽出された画像特徴量を各フレーム画像に対応する重みによって重み付けする。重み付けされた画像特徴量を条件として、条件付き生成モデルである文章生成モデルＭ２に入力する。（４）文章生成モデルＭ２によって対象動画説明文を生成する。 As shown in FIG. 7, in the inference stage, (1) image features are extracted from each frame image constituting the target video. (2) A frame image of interest (hereinafter sometimes referred to as a "designated frame image") is specified by the user, and weights corresponding to each of the multiple frame images are determined so that the weight corresponding to the designated frame image is maximized. (3) The image features extracted from each frame image are weighted by the weight corresponding to each frame image. The weighted image features are input as conditions to sentence generation model M2, which is a conditional generation model. (4) A description of the target video is generated by sentence generation model M2.

具体的には、抽出部１３２は、処理対象の動画である対象動画＃４を取得してよい。例えば、抽出部１３２は、通信部１１０を介して、利用者によって使用される情報処理装置から対象動画＃４を取得してよい。続いて、抽出部１３２は、対象動画＃４を構成する複数のフレーム画像それぞれから、対象動画＃４を構成する複数のフレーム画像それぞれの画像特徴量を抽出してよい（ステップＳ４１）。なお、図３および図５と同様に、抽出部１３２が、各フレーム画像から画像特徴量を抽出する方法は、図２で説明した事前学習において各フレーム画像から画像特徴量を抽出する方法と同様であってよい。以下では、図２と重複する説明は省略する。図７では、簡単のため、対象動画＃４を構成するフレーム画像が３つである場合について説明するが、対象動画＃４を構成するフレーム画像の数は４つ以上であってよい。図７では、対象動画＃４の開始時刻に対応する１枚目のフレーム画像と、対象動画＃４の開始時刻と終了時刻の間の時刻に対応する２枚目のフレーム画像と、対象動画＃４の終了時刻に対応する３枚目のフレーム画像とが時系列順に並んでいる様子を示す。例えば、抽出部１３２は、１枚目のフレーム画像から特徴量ベクトルＶ４１を抽出する。また、抽出部１３２は、２枚目のフレーム画像から特徴量ベクトルＶ４２を抽出する。また、抽出部１３２は、３枚目のフレーム画像から特徴量ベクトルＶ４３を抽出する。続いて、抽出部１３２は、対象動画＃４を構成する複数のフレーム画像それぞれの画像特徴量である対象フレーム特徴量＃４として、特徴量ベクトルＶ４１～Ｖ４３の組のベクトル（Ｖ４１、Ｖ４２、Ｖ４３）を取得してよい。 Specifically, the extraction unit 132 may acquire the target video #4, which is a video to be processed. For example, the extraction unit 132 may acquire the target video #4 from the information processing device used by the user via the communication unit 110. Next, the extraction unit 132 may extract image features of each of the multiple frame images constituting the target video #4 from each of the multiple frame images constituting the target video #4 (step S41). As in FIG. 3 and FIG. 5, the method in which the extraction unit 132 extracts image features from each frame image may be the same as the method in which image features are extracted from each frame image in the pre-learning described in FIG. 2. In the following, explanations that overlap with FIG. 2 will be omitted. In FIG. 7, for simplicity, a case in which the frame images constituting the target video #4 are three will be described, but the number of frame images constituting the target video #4 may be four or more. FIG. 7 shows a first frame image corresponding to the start time of the target video #4, a second frame image corresponding to a time between the start time and end time of the target video #4, and a third frame image corresponding to the end time of the target video #4, all arranged in chronological order. For example, the extraction unit 132 extracts a feature vector V41 from the first frame image. The extraction unit 132 also extracts a feature vector V42 from the second frame image. The extraction unit 132 also extracts a feature vector V43 from the third frame image. Next, the extraction unit 132 may acquire a vector (V41, V42, V43) of the set of feature vectors V41 to V43 as target frame feature #4, which is the image feature of each of the multiple frame images that make up the target video #4.

このように、抽出部１３２は、処理対象の動画である対象動画＃４を構成する複数のフレーム画像それぞれの特徴量である対象フレーム特徴量＃４を抽出する。 In this way, the extraction unit 132 extracts target frame features #4, which are features of each of the multiple frame images that make up target video #4, which is the video to be processed.

また、決定部１３３は、対象動画＃４を構成する複数のフレーム画像の中から利用者によって指定されたフレーム画像（以下、「指定フレーム画像＃４」と記載する場合がある）および対象動画＃４を取得してよい。例えば、決定部１３３は、通信部１１０を介して、利用者によって使用される情報処理装置から指定フレーム画像＃４および対象動画＃４を取得してよい。続いて、決定部１３３は、対象動画＃４を構成する複数のフレーム画像それぞれに対応する重みを決定してよい（ステップＳ４１）。なお、図３および図５と同様に、抽出部１３２が画像特徴量を抽出する処理と、決定部１３３が重みを決定する処理は、いずれの処理が先に行われてもよく、抽出部１３２および決定部１３３によってそれぞれ同時に行われてもよい。 The determination unit 133 may also acquire a frame image designated by the user from among the multiple frame images constituting the target video #4 (hereinafter, may be referred to as "designated frame image #4") and the target video #4. For example, the determination unit 133 may acquire the designated frame image #4 and the target video #4 from an information processing device used by the user via the communication unit 110. Next, the determination unit 133 may determine weights corresponding to each of the multiple frame images constituting the target video #4 (step S41). Note that, as in FIGS. 3 and 5, the process of extracting image features by the extraction unit 132 and the process of determining weights by the determination unit 133 may be performed first, or may be performed simultaneously by the extraction unit 132 and the determination unit 133.

例えば、決定部１３３は、対象動画＃４を構成する複数のフレーム画像のうち、指定フレーム画像＃４に対応するフレーム画像の重みを指定フレーム画像＃４に対応するフレーム画像以外の他のフレーム画像に対応する重みよりも大きくするように複数のフレーム画像それぞれに対応する重みを決定してよい。例えば、決定部１３３は、ガウス関数や円の一部のような凸状の関数であって、極大値の周囲が微分可能な関数の値に基づいて、複数のフレーム画像それぞれに対応する重みを決定してよい。図７では、決定部１３３は、図３と同様に、横軸を動画の再生時刻、縦軸を重みとするガウス関数の値を用いて複数のフレーム画像それぞれに対応する重みを決定してよい。例えば、決定部１３３は、ガウス関数の平均値に対応する時刻を指定フレーム画像＃４に対応する２枚目のフレーム画像の再生時刻としてよい。また、決定部１３３は、ガウス関数の平均値に対応する時刻の値である「１．０」を指定フレーム画像＃４に対応する２枚目のフレーム画像の重み＃４２としてよい。また、決定部１３３は、ガウス関数の平均値よりも小さい値に対応する時刻を１枚目のフレーム画像の再生時刻としてよい。また、決定部１３３は、ガウス関数の平均値よりも小さい値に対応する時刻の値である「０．８」を１枚目のフレーム画像の重み＃４１としてよい。また、決定部１３３は、ガウス関数の平均値よりも大きい値に対応する時刻を３枚目のフレーム画像の再生時刻としてよい。また、決定部１３３は、ガウス関数の平均値よりも大きい値に対応する時刻の値である「０．８」を３枚目のフレーム画像の重み＃４３としてよい。例えば、決定部１３３は、対象動画＃４を構成する複数のフレーム画像それぞれに対応する重みである対象重み＃４として、１枚目のフレーム画像の重み＃４１～２枚目のフレーム画像の重み＃４３の組のベクトル（重み＃４１、重み＃４２、重み＃４３）＝（０．８、１．０、０．８）を取得してよい。 For example, the determination unit 133 may determine weights corresponding to each of the multiple frame images such that the weight of the frame image corresponding to the designated frame image #4 among the multiple frame images constituting the target video #4 is greater than the weights corresponding to the other frame images other than the frame image corresponding to the designated frame image #4. For example, the determination unit 133 may determine weights corresponding to each of the multiple frame images based on the value of a convex function such as a Gaussian function or a part of a circle, the periphery of which is differentiable around the maximum value. In FIG. 7, the determination unit 133 may determine weights corresponding to each of the multiple frame images using the value of a Gaussian function with the horizontal axis representing the playback time of the video and the vertical axis representing the weight, as in FIG. 3. For example, the determination unit 133 may determine the time corresponding to the average value of the Gaussian function as the playback time of the second frame image corresponding to the designated frame image #4. In addition, the determination unit 133 may determine the weight #42 of the second frame image corresponding to the designated frame image #4 to be "1.0", which is the value of the time corresponding to the average value of the Gaussian function. The determination unit 133 may also determine the time corresponding to a value smaller than the average value of the Gaussian function as the playback time of the first frame image. The determination unit 133 may also determine "0.8", which is the value of the time corresponding to a value smaller than the average value of the Gaussian function, as the weight #41 of the first frame image. The determination unit 133 may also determine the time corresponding to a value larger than the average value of the Gaussian function, as the playback time of the third frame image. The determination unit 133 may also determine "0.8", which is the value of the time corresponding to a value larger than the average value of the Gaussian function, as the weight #43 of the third frame image. For example, the determination unit 133 may obtain a vector (weight #41, weight #42, weight #43) = (0.8, 1.0, 0.8) of the set of weights #41 of the first frame image to #43 of the second frame image as target weight #4, which is the weight corresponding to each of the multiple frame images constituting the target video #4.

このように、決定部１３３は、対象動画＃４を構成する複数のフレーム画像のうち、利用者によって指定された指定フレーム画像＃４に対応する重み（図７では、２枚目のフレーム画像の重み＃４２である「１．０」）を指定フレーム画像以外の他のフレーム画像に対応する重み（図７では、１枚目のフレーム画像の重み＃４１である「０．８」および３枚目のフレーム画像の重み＃４３である「０．８」）よりも大きくするように複数のフレーム画像それぞれに対応する重みである対象重み＃４を決定する。また、このように、決定部１３３は、対象動画＃４を構成する複数のフレーム画像それぞれに対応する重みである対象重み＃４を決定する。 In this way, the determination unit 133 determines target weights #4, which are weights corresponding to each of the multiple frame images that make up the target video #4, such that the weight corresponding to the designated frame image #4 designated by the user (in FIG. 7, weight #42 of the second frame image, "1.0") is greater than the weights corresponding to the other frame images other than the designated frame image (in FIG. 7, weight #41 of the first frame image, "0.8", and weight #43 of the third frame image, "0.8"). Also, in this way, the determination unit 133 determines target weights #4, which are weights corresponding to each of the multiple frame images that make up the target video #4.

また、文章生成部１３６は、決定部１３３によって決定された対象重み＃４によって、抽出部１３２によって抽出された対象フレーム特徴量＃４を重み付けしてよい。文章生成部１３６は、決定部１３３によって決定された対象重み＃４によって重み付けされた対象フレーム特徴量＃４である、重み付けされた対象フレーム特徴量＃４´を生成してよい。図７では、文章生成部１３６は、対象動画＃４を構成する１枚目のフレーム画像に対応する重み＃４１である「０．８」を特徴量ベクトルＶ４１の各要素に乗じることにより、重み＃４１によって重み付けされた特徴量ベクトルＶ４１´を生成してよい。また、文章生成部１３６は、対象動画＃４を構成する２枚目のフレーム画像に対応する重み＃４２である「１．０」を特徴量ベクトルＶ４２の各要素に乗じることにより、重み＃４２によって重み付けされた特徴量ベクトルＶ４２´を生成してよい。また、文章生成部１３６は、対象動画＃４を構成する３枚目のフレーム画像に対応する重み＃４３である「０．８」を特徴量ベクトルＶ４３の各要素に乗じることにより、重み＃４３によって重み付けされた特徴量ベクトルＶ４３´を生成してよい。このようにして、文章生成部１３６は、重み付けされた対象フレーム特徴量＃４´を生成してよい。図７では、文章生成部１３６は、重み付けされた対象フレーム特徴量＃４´として、（重み＃４１、重み＃４２、重み＃４３）＊（Ｖ４１、Ｖ４２、Ｖ４３）＝（重み＃４１＊Ｖ４１、重み＃４２＊Ｖ４２、重み＃４３＊Ｖ４３）＝（Ｖ４１´、Ｖ４２´、Ｖ４３´）を生成してよい。 The sentence generation unit 136 may weight the target frame feature #4 extracted by the extraction unit 132 by the target weight #4 determined by the determination unit 133. The sentence generation unit 136 may generate a weighted target frame feature #4', which is the target frame feature #4 weighted by the target weight #4 determined by the determination unit 133. In FIG. 7, the sentence generation unit 136 may generate a feature vector V41' weighted by the weight #41 by multiplying each element of the feature vector V41 by "0.8", which is the weight #41 corresponding to the first frame image constituting the target video #4. The sentence generation unit 136 may generate a feature vector V42' weighted by the weight #42 by multiplying each element of the feature vector V42 by "1.0", which is the weight #42 corresponding to the second frame image constituting the target video #4. Furthermore, the sentence generation unit 136 may generate a feature vector V43' weighted by weight #43 by multiplying each element of the feature vector V43 by "0.8", which is weight #43 corresponding to the third frame image constituting the target video #4. In this manner, the sentence generation unit 136 may generate weighted target frame feature #4'. In FIG. 7, the sentence generation unit 136 may generate (weight #41, weight #42, weight #43) * (V41, V42, V43) = (weight #41 * V41, weight #42 * V42, weight #43 * V43) = (V41', V42', V43') as the weighted target frame feature #4'.

また、取得部１３５は、文章生成モデルＭ２を取得してよい。例えば、取得部１３５は、文章生成モデルＭ２に関する情報を格納している記憶部１２０から文章生成モデルＭ２を取得してよい。 The acquisition unit 135 may also acquire the sentence generation model M2. For example, the acquisition unit 135 may acquire the sentence generation model M2 from the storage unit 120, which stores information about the sentence generation model M2.

このように、取得部１３５は、撮像画像（図７では、図３で説明した画像＃２）と撮像画像の内容を説明する文章である画像説明文（図７では、図３で説明した画像説明文＃２）との組を含む画像文データセット＃２に基づいて生成された学習用動画＃２を構成する複数のフレーム画像それぞれの特徴量である学習用フレーム特徴量＃２と、学習用動画＃２を構成する複数のフレーム画像それぞれに対応する重みである学習用重み＃２とに基づいて、学習用動画＃２の内容を説明する文章である学習用動画説明文＃２であって、学習用重み＃２によって重み付けされた学習用フレーム特徴量＃２と対応する特徴を有する学習用動画説明文＃２を生成するように学習された機械学習モデルである文章生成モデルＭ２を取得する。 In this way, the acquisition unit 135 acquires a sentence generation model M2, which is a machine learning model trained to generate training video description #2, which is a sentence explaining the content of the training video #2, having features corresponding to the training frame feature #2 weighted by the training weight #2, based on the training frame feature #2, which is a feature of each of the multiple frame images constituting the training video #2 generated based on the image and sentence dataset #2 including a pair of a captured image (in FIG. 7, image #2 described in FIG. 3) and an image description (in FIG. 7, image description #2 described in FIG. 3), which is a sentence explaining the content of the captured image, and based on the training weight #2, which is a weight corresponding to each of the multiple frame images constituting the training video #2.

また、文章生成部１３６は、取得部１３５によって取得された文章生成モデルＭ２に重み付けされた対象フレーム特徴量＃４´を入力してよい（ステップＳ４２）。例えば、文章生成部１３６は、重み付けされた対象フレーム特徴量＃４´に基づく条件ベクトル＃４を生成してよい。続いて、文章生成部１３６は、生成した条件ベクトル＃４とノイズベクトルを結合してよい。なお、文章生成部１３６が、条件ベクトル＃４とノイズベクトルを結合する方法は、図２で説明した事前学習において条件ベクトル＃１とノイズベクトルを結合する方法と同様であってよい。以下では、図２と重複する説明は省略する。続いて、文章生成部１３６は、結合された条件ベクトル＃４とノイズベクトルを入力情報として条件付き生成モデルである文章生成モデルＭ２に入力してよい。 The sentence generation unit 136 may input the weighted target frame feature #4' to the sentence generation model M2 acquired by the acquisition unit 135 (step S42). For example, the sentence generation unit 136 may generate a condition vector #4 based on the weighted target frame feature #4'. Next, the sentence generation unit 136 may combine the generated condition vector #4 with a noise vector. Note that the method by which the sentence generation unit 136 combines the condition vector #4 with the noise vector may be the same as the method by which the condition vector #1 with the noise vector is combined in the pre-learning described in FIG. 2. In the following, descriptions that overlap with FIG. 2 will be omitted. Next, the sentence generation unit 136 may input the combined condition vector #4 and noise vector as input information to the sentence generation model M2, which is a conditional generation model.

続いて、文章生成部１３６は、結合された条件ベクトル＃４とノイズベクトルの入力に応じて文章生成モデルＭ２が生成した動画説明文であって、文章生成モデルＭ２から出力情報として出力された動画説明文（以下、「対象動画説明文＃４」と記載する場合がある）を取得してよい（ステップＳ４３）。 Next, the sentence generation unit 136 may acquire the video description generated by the sentence generation model M2 in response to the input of the combined condition vector #4 and the noise vector, and output as output information from the sentence generation model M2 (hereinafter, this may be referred to as "target video description #4") (step S43).

このように、文章生成部１３６は、対象フレーム特徴量＃４と対象重み＃４とに基づいて、対象重みによって重み付けされた対象フレーム特徴量＃４´を文章生成モデルＭ２に入力して、対象動画＃４の内容を説明する文章である対象動画説明文＃４を生成する。 In this way, based on the target frame feature #4 and the target weight #4, the sentence generation unit 136 inputs the target frame feature #4' weighted by the target weight to the sentence generation model M2 to generate the target video description #4, which is a sentence that explains the content of the target video #4.

〔７．文章生成モデルの例〕
図８は、実施形態に係る文章生成モデルの一例である条件付き敵対的生成ネットワーク（ＣＧＡＮ）を示す図である。図８に示すように、文章生成モデルは、生成器ネットワークＧ１および識別器ネットワークＤ１を含む条件付き敵対的生成ネットワークであってよい。図８では、図３で説明した第１の追加学習または図５で説明した第２の追加学習について説明する。 [7. Example of a sentence generation model]
Fig. 8 is a diagram showing a conditional generative adversarial network (CGAN) as an example of a sentence generation model according to an embodiment. As shown in Fig. 8, the sentence generation model may be a conditional generative adversarial network including a generator network G1 and a classifier network D1. Fig. 8 describes the first additional learning described in Fig. 3 or the second additional learning described in Fig. 5.

図８に示す生成器ネットワークＧ１は、時系列データであるテキストの生成に向いている機械学習モデルであってよい。例えば、生成器ネットワークＧ１は、再帰型ニューラルネットワーク（ＲＮＮ：Recurrent Neural Network）、ＧＲＵ（Gated Recurrent Unit）、ＬＳＴＭ（Long Short Term Memory）、Ｔｒａｎｓｆｏｒｍｅｒ（Ashish Vaswani et al., 2017）、ＴｒａｎｓｆｏｒｍｅｒをベースとしたＢＥＲＴ（Bidirectional Encoder Representations from Transformers）、ＧＰＴ－３（Generative Pre-Training3）またはＴ５（Text-to-Text Transfer Transformer）等であってよい。 The generator network G1 shown in FIG. 8 may be a machine learning model suitable for generating text, which is time-series data. For example, the generator network G1 may be a recurrent neural network (RNN), a gated recurrent unit (GRU), a long short-term memory (LSTM), a transformer (Ashish Vaswani et al., 2017), a bidirectional encoder representations from transformers (BERT) based on a transformer, a generative pre-training3 (GPT-3), or a text-to-text transfer transformer (T5).

まず、モデル生成部１３４は、図２で説明した事前学習により、生成器ネットワークＧ１および識別器ネットワークＤ１を含む条件付き敵対的生成ネットワークを学習させてよい。例えば、モデル生成部１３４は、事前学習用フレーム特徴量＃５０に基づく条件ベクトルである事前学習用敵対的条件ベクトルＶ５０を生成してよい。続いて、モデル生成部１３４は、事前学習用敵対的条件ベクトルＶ５０およびノイズベクトルを入力情報として生成器ネットワークＧ１に入力した場合に、事前学習用敵対的条件ベクトルＶ５０と対応する特徴を有する動画説明文を生成するように生成器ネットワークＧ１を学習させてよい。例えば、モデル生成部１３４は、バックプロパゲーション（誤差逆伝播法）等を用いて、生成器ネットワークＧ１から出力された動画説明文と、動画文データセット＃５０に含まれる事前学習用動画説明文＃５０との誤差が小さくなるように生成器ネットワークＧ１を学習させてよい。このように、モデル生成部１３４は、事前学習用フレーム特徴量＃５０に基づいて、事前学習用フレーム特徴量＃５０と対応する特徴を有する動画説明文を生成するように事前に学習された事前学習済み生成器ネットワークＧ１を生成してよい。 First, the model generation unit 134 may train a conditional generative adversarial network including the generator network G1 and the discriminator network D1 by the pre-learning described in FIG. 2. For example, the model generation unit 134 may generate a pre-learning adversarial condition vector V50, which is a condition vector based on the pre-learning frame feature #50. Next, the model generation unit 134 may train the generator network G1 to generate a video description having features corresponding to the pre-learning adversarial condition vector V50 when the pre-learning adversarial condition vector V50 and a noise vector are input to the generator network G1 as input information. For example, the model generation unit 134 may train the generator network G1 using backpropagation (error backpropagation method) or the like so that the error between the video description output from the generator network G1 and the pre-learning video description #50 included in the video description dataset #50 is reduced. In this way, the model generation unit 134 may generate a pre-trained generator network G1 that has been trained in advance to generate video description text having features that correspond to the pre-training frame feature #50, based on the pre-training frame feature #50.

また、モデル生成部１３４は、事前学習用敵対的条件ベクトル＃および事前学習用動画説明文＃５０を入力情報として識別器ネットワークＤ１に入力した場合に、事前学習用動画説明文＃５０が、真の動画説明文であって、かつ、事前学習用敵対的条件ベクトルＶ５０と対応する動画説明文であることを示す情報（例えば、数字の「１」など）を出力情報として出力するように識別器ネットワークＤ１を学習させてよい。なお、モデル生成部１３４は、例えば、線形変換処理を用いて、事前学習用敵対的条件ベクトルＶ５０と事前学習用動画説明文＃５０のサイズが同じになるように調整してよい。続いて、モデル生成部１３４は、事前学習用敵対的条件ベクトルＶ５０と事前学習用動画説明文＃５０を結合し、結合された事前学習用敵対的条件ベクトルＶ５０と事前学習用動画説明文＃５０を入力情報として識別器ネットワークＤ１に入力してよい。また、モデル生成部１３４は、事前学習用敵対的条件ベクトルＶ５０および生成器ネットワークＧ１が生成した偽の動画説明文を入力情報として識別器ネットワークＤ１に入力した場合に、生成器ネットワークＧ１が生成した偽の動画説明文が、真の動画説明文であって、かつ、事前学習用敵対的条件ベクトルＶ５０と対応する動画説明文であることを示す情報以外の情報（例えば、数字の「０」など）を出力情報として出力するように識別器ネットワークＤ１を学習させてよい。このように、モデル生成部１３４は、事前学習用フレーム特徴量＃５０に基づいて、事前に学習された事前学習済み識別器ネットワークＤ１を生成してよい。 In addition, the model generation unit 134 may train the classifier network D1 so that when the pre-learning hostile condition vector # and the pre-learning video description #50 are input as input information to the classifier network D1, the classifier network D1 outputs information (e.g., the number "1") indicating that the pre-learning video description #50 is a true video description and is a video description corresponding to the pre-learning hostile condition vector V50 as output information. In addition, the model generation unit 134 may adjust the size of the pre-learning hostile condition vector V50 and the pre-learning video description #50 so that they are the same, for example, using a linear transformation process. Next, the model generation unit 134 may combine the pre-learning hostile condition vector V50 and the pre-learning video description #50, and input the combined pre-learning hostile condition vector V50 and the pre-learning video description #50 as input information to the classifier network D1. Furthermore, the model generation unit 134 may train the classifier network D1 so that when the pre-training adversarial condition vector V50 and the fake video description generated by the generator network G1 are input as input information to the classifier network D1, the classifier network D1 outputs information (e.g., the number "0") other than information indicating that the fake video description generated by the generator network G1 is a true video description and corresponds to the pre-training adversarial condition vector V50 as output information. In this way, the model generation unit 134 may generate a pre-trained classifier network D1 that has been trained in advance based on the pre-training frame feature #50.

また、モデル生成部１３４は、図３で説明した第１の追加学習または図５で説明した第２の追加学習により、事前学習済み生成器ネットワークＧ１（以下、「生成器ネットワークＧ１」と略記する場合がある）および事前学習済み識別器ネットワークＤ１（以下、「識別器ネットワークＤ１」と略記する場合がある）を再学習させてよい。 The model generation unit 134 may also retrain the pre-trained generator network G1 (hereinafter sometimes abbreviated as "generator network G1") and the pre-trained discriminator network D1 (hereinafter sometimes abbreviated as "discriminator network D1") by the first additional learning described in FIG. 3 or the second additional learning described in FIG. 5.

図８では、モデル生成部１３４は、学習用重みによって重み付けされた学習用フレーム特徴量に基づく条件ベクトルである第１の学習用敵対的条件ベクトルＶ５１（以下、「第１の敵対的条件ベクトルＶ５１」と略記する場合がある）を生成してよい。続いて、モデル生成部１３４は、第１の敵対的条件ベクトルＶ５１およびノイズベクトルＮ１を入力情報として生成器ネットワークＧ１に入力した場合に、第１の敵対的条件ベクトルＶ５１と対応する特徴を有する動画説明文（図８では、学習用動画説明文＃５１）を生成するよう生成器ネットワークＧ１を再学習させてよい。例えば、モデル生成部１３４は、バックプロパゲーション（誤差逆伝播法）等を用いて、生成器ネットワークＧ１から出力された学習用動画説明文＃５１と、画像文データセット＃５１に含まれる画像説明文＃５１（オリジナルの画像＃５１に対応する画像説明文）との誤差が小さくなるように生成器ネットワークＧ１を再学習させてよい。このようにして、モデル生成部１３４は、重み付けされた学習用フレーム特徴量に基づいて、重み付けされた学習用フレーム特徴量と対応する特徴を有する動画説明文を生成するように生成器ネットワークＧ１を再学習させてよい。このようにして、モデル生成部１３４は、第１の追加学習済みまたは第２の追加学習済みの生成器ネットワークＧ１を生成してよい。 In FIG. 8, the model generation unit 134 may generate a first training adversarial condition vector V51 (hereinafter, may be abbreviated as "first adversarial condition vector V51"), which is a condition vector based on the training frame feature weighted by the training weight. Next, the model generation unit 134 may retrain the generator network G1 so as to generate a video description (training video description #51 in FIG. 8) having features corresponding to the first adversarial condition vector V51 when the first adversarial condition vector V51 and the noise vector N1 are input to the generator network G1 as input information. For example, the model generation unit 134 may retrain the generator network G1 using backpropagation (error backpropagation method) or the like so as to reduce the error between the training video description #51 output from the generator network G1 and the image description #51 (image description corresponding to the original image #51) included in the image description dataset #51. In this manner, the model generation unit 134 may retrain the generator network G1 so as to generate video description sentences having features corresponding to the weighted training frame features based on the weighted training frame features. In this manner, the model generation unit 134 may generate a first additionally trained or second additionally trained generator network G1.

また、モデル生成部１３４は、第１の敵対的条件ベクトルＶ５１および生成器ネットワークＧ１が生成した偽の動画説明文である学習用動画説明文＃５１を入力情報として識別器ネットワークＤ１に入力した場合に、学習用動画説明文＃５１が、真の動画説明文であって、かつ、第１の敵対的条件ベクトルＶ５１と対応する動画説明文であることを示す情報以外の情報（例えば、数字の「０」など）を出力情報として出力するように識別器ネットワークＤ１を再学習させてよい。また、モデル生成部１３４は、事前学習用動画＃５２と事前学習用動画説明文＃５２との組を含む動画文データセット＃５２に含まれる事前学習用動画＃５２を構成する複数のフレーム画像それぞれの特徴量である事前学習用フレーム特徴量＃５２に基づく条件ベクトルである第２の学習用敵対的条件ベクトルＶ５２（以下、「第２の敵対的条件ベクトルＶ５２」と略記する場合がある）を生成してよい。続いて、モデル生成部１３４は、第２の敵対的条件ベクトルＶ５２および事前学習用動画説明文＃５２を入力情報として識別器ネットワークＤ１に入力した場合に、事前学習用動画説明文＃５２が、真の動画説明文であって、かつ、第２の敵対的条件ベクトルＶ５２と対応する動画説明文であることを示す情報（例えば、数字の「１」など）を出力情報として出力するように識別器ネットワークＤ１を再学習させてよい。このようにして、モデル生成部１３４は、第１の追加学習済みまたは第２の追加学習済みの識別器ネットワークＤ１を生成してよい。 In addition, when the first hostile condition vector V51 and the learning video description #51, which is a false video description generated by the generator network G1, are input as input information to the classifier network D1, the model generation unit 134 may retrain the classifier network D1 so that the learning video description #51 is a true video description and is a video description corresponding to the first hostile condition vector V51, and other information (e.g., the number "0") other than the information indicating that the learning video description #51 is a true video description and is a video description corresponding to the first hostile condition vector V51 is output as output information. In addition, the model generation unit 134 may generate a second learning hostile condition vector V52 (hereinafter, sometimes abbreviated as "second hostile condition vector V52") that is a condition vector based on the pre-learning frame feature #52, which is a feature of each of the multiple frame images constituting the pre-learning video #52 included in the video text data set #52 including the pair of the pre-learning video #52 and the pre-learning video description #52. Next, the model generation unit 134 may retrain the classifier network D1 so that, when the second adversarial condition vector V52 and the pre-learning video description #52 are input as input information to the classifier network D1, the classifier network D1 outputs information indicating that the pre-learning video description #52 is a true video description and is a video description corresponding to the second adversarial condition vector V52 (e.g., the number "1") as output information. In this manner, the model generation unit 134 may generate the first additionally trained or second additionally trained classifier network D1.

また、文章生成部１３６は、対象重みによって重み付けされた対象フレーム特徴量に基づく条件ベクトルである敵対的条件ベクトルＶ５３を生成してよい。続いて、モデル生成部１３４は、敵対的条件ベクトルＶ５３およびノイズベクトルＮ２を入力情報として生成器ネットワークＧ１に入力して、対象動画の内容を説明する文章である対象動画説明文を生成してよい。このようにして、文章生成部１３６は、対象フレーム特徴量と対象重みとに基づいて、対象重みによって重み付けされた対象フレーム特徴量を、第１の追加学習済みまたは第２の追加学習済みの生成器ネットワークＧ１に入力して、対象動画の内容を説明する文章である対象動画説明文を生成してよい。 The sentence generation unit 136 may also generate an adversarial condition vector V53, which is a condition vector based on the target frame features weighted by the target weight. Next, the model generation unit 134 may input the adversarial condition vector V53 and the noise vector N2 as input information to the generator network G1 to generate a target video description, which is a sentence that describes the contents of the target video. In this way, the sentence generation unit 136 may input the target frame features weighted by the target weight to the first additionally trained or second additionally trained generator network G1 based on the target frame features and the target weight to generate a target video description, which is a sentence that describes the contents of the target video.

〔８．変形例〕
上述した実施形態に係る処理は、上記実施形態以外にも種々の異なる形態にて実施されてよい。。 8. Modifications
The process according to the embodiment described above may be implemented in various different forms other than the above embodiment.

〔８－１．第１の変形例〕
上述した実施形態では、文章生成モデルが、生成器ネットワークおよび識別器ネットワークを含む条件付き敵対的生成ネットワークである場合について説明した。第１の変形例では、文章生成モデルが、エンコーダおよびデコーダを含む条件付き変分オートエンコーダである場合について説明する。 8-1. First Modified Example
In the above-described embodiment, the sentence generation model is a conditional generative adversarial network including a generator network and a classifier network. In a first modified example, the sentence generation model is a conditional variational autoencoder including an encoder and a decoder.

図９は、第１の変形例に係る文章生成モデルの一例である条件付き変分オートエンコーダ（ＣＶＡＥ）を示す図である。図９に示すように、文章生成モデルは、エンコーダＥＮ１およびデコーダＤＥ１を含む条件付き変分オートエンコーダであってよい。図９では、図２で説明した事前学習について説明する。 Figure 9 is a diagram showing a conditional variational autoencoder (CVAE), which is an example of a sentence generation model related to the first modified example. As shown in Figure 9, the sentence generation model may be a conditional variational autoencoder including an encoder EN1 and a decoder DE1. Figure 9 explains the pre-learning described in Figure 2.

まず、モデル生成部１３４は、図２で説明した事前学習により、エンコーダＥＮ１およびデコーダＤＥ１を含む条件付き条件付き変分オートエンコーダを学習させてよい。例えば、モデル生成部１３４は、事前学習用フレーム特徴量＃６０に基づく条件ベクトルである事前学習用変分条件ベクトルＶ６０（以下、「変分条件ベクトルＶ６０」と略記する場合がある）を生成してよい。続いて、モデル生成部１３４は、変分条件ベクトルＶ６０および事前学習用動画説明文＃６０を入力情報としてエンコーダＥＮ１に入力した場合に、多変量正規分布における平均ベクトルμおよび分散ベクトルσを出力情報として出力するようにエンコーダＥＮ１を学習させてよい。また、モデル生成部１３４は、平均ベクトルμおよび分散ベクトルσに基づく多変量正規分布に従う標本である潜在ベクトルｚを決定してよい。なお、モデル生成部１３４は、標準正規分布からランダムにサンプリングして得る確率変数εを導入し、これを用いて潜在ベクトルｚを決定してよい。続いて、モデル生成部１３４は、潜在ベクトルｚおよび変分条件ベクトルＶ６０を入力情報としてデコーダＤＥ１に入力した場合に、変分条件ベクトルＶ６０と対応する特徴を有する動画説明文（図９では、事前学習用動画説明文＃６０）を出力情報として出力するようにデコーダＤＥ１を学習させてよい。例えば、モデル生成部１３４は、バックプロパゲーション（誤差逆伝播法）等を用いて、エンコーダＥＮ１に入力された事前学習用動画説明文＃６０と、デコーダＤＥ１から出力された動画説明文との誤差が小さくなるようにエンコーダＥＮ１およびデコーダＤＥ１を学習させてよい。このように、モデル生成部１３４は、事前学習用フレーム特徴量＃６０に基づいて、事前学習用フレーム特徴量＃６０と対応する特徴を有する動画説明文を生成するように事前に学習された事前学習済みデコーダＤＥ１を生成してよい。 First, the model generation unit 134 may train a conditional variational autoencoder including the encoder EN1 and the decoder DE1 by the pre-learning described in FIG. 2. For example, the model generation unit 134 may generate a pre-learning variational condition vector V60 (hereinafter, may be abbreviated as "variational condition vector V60"), which is a condition vector based on the pre-learning frame feature #60. Next, the model generation unit 134 may train the encoder EN1 so that, when the variational condition vector V60 and the pre-learning video description #60 are input to the encoder EN1 as input information, the encoder EN1 outputs the mean vector μ and variance vector σ in the multivariate normal distribution as output information. In addition, the model generation unit 134 may determine a latent vector z, which is a sample that follows a multivariate normal distribution based on the mean vector μ and variance vector σ. In addition, the model generation unit 134 may introduce a random variable ε obtained by randomly sampling from a standard normal distribution, and use this to determine the latent vector z. Next, the model generation unit 134 may train the decoder DE1 so that when the latent vector z and the variation condition vector V60 are input to the decoder DE1 as input information, the decoder DE1 outputs a video description (pre-learning video description #60 in FIG. 9) having a feature corresponding to the variation condition vector V60 as output information. For example, the model generation unit 134 may train the encoder EN1 and the decoder DE1 using backpropagation or the like so that the error between the pre-learning video description #60 input to the encoder EN1 and the video description output from the decoder DE1 is reduced. In this way, the model generation unit 134 may generate a pre-trained decoder DE1 that has been trained in advance to generate a video description having a feature corresponding to the pre-learning frame feature #60 based on the pre-learning frame feature #60.

また、モデル生成部１３４は、図３で説明した第１の追加学習または図５で説明した第２の追加学習により、事前学習済みデコーダＤＥ１（以下、「デコーダＤＥ１」と略記する場合がある）を再学習させてよい。 The model generation unit 134 may also retrain the pre-trained decoder DE1 (hereinafter sometimes abbreviated as "decoder DE1") by the first additional learning described in FIG. 3 or the second additional learning described in FIG. 5.

例えば、モデル生成部１３４は、エンコーダＥＮ１から出力された平均ベクトルμおよび分散ベクトルσに基づく多変量正規分布に従う標本である潜在ベクトルｚを決定してよい。また、モデル生成部１３４は、学習用重みによって重み付けされた学習用フレーム特徴量に基づく条件ベクトルである学習用変分条件ベクトルＶ６１（以下、「変分条件ベクトルＶ６１」と略記する場合がある）を生成してよい。続いて、モデル生成部１３４は、潜在ベクトルｚおよび変分条件ベクトルＶ６１を入力情報としてデコーダＤＥ１に入力した場合に、変分条件ベクトルＶ６１と対応する特徴を有する動画説明文を出力情報として出力するようにデコーダＤＥ１を再学習させてよい。例えば、モデル生成部１３４は、バックプロパゲーション（誤差逆伝播法）等を用いて、デコーダＤＥ１から出力された学習用動画説明文＃６１と、画像文データセット＃６１に含まれる画像説明文＃６１（オリジナルの画像＃６１に対応する画像説明文）との誤差が小さくなるようにデコーダＤＥ１を再学習させてよい。このようにして、モデル生成部１３４は、重み付けされた学習用フレーム特徴量に基づいて、重み付けされた学習用フレーム特徴量と対応する特徴を有する動画説明文を生成するようにデコーダＤＥ１を再学習させてよい。このようにして、モデル生成部１３４は、第１の追加学習済みまたは第２の追加学習済みのデコーダＤＥ１を生成してよい。 For example, the model generation unit 134 may determine a latent vector z, which is a sample that follows a multivariate normal distribution based on the mean vector μ and variance vector σ output from the encoder EN1. The model generation unit 134 may also generate a training variation condition vector V61 (hereinafter, sometimes abbreviated as "variation condition vector V61"), which is a condition vector based on training frame features weighted by training weights. Next, the model generation unit 134 may retrain the decoder DE1 so that, when the latent vector z and the variation condition vector V61 are input to the decoder DE1 as input information, a video description having a feature corresponding to the variation condition vector V61 is output as output information. For example, the model generation unit 134 may retrain the decoder DE1 using backpropagation (error backpropagation method) or the like so that the error between the training video description #61 output from the decoder DE1 and the image description #61 (image description corresponding to the original image #61) included in the image description dataset #61 is reduced. In this manner, the model generation unit 134 may retrain the decoder DE1 so as to generate video description text having features corresponding to the weighted training frame features based on the weighted training frame features. In this manner, the model generation unit 134 may generate a first additionally trained or second additionally trained decoder DE1.

また、文章生成部１３６は、対象重みによって重み付けされた対象フレーム特徴量に基づく条件ベクトルである変分条件ベクトルＶ６２を生成してよい。続いて、モデル生成部１３４は、潜在ベクトルｚおよび変分条件ベクトルＶ６２を入力情報としてデコーダＤＥ１に入力して、対象動画の内容を説明する文章である対象動画説明文を生成してよい。このようにして、文章生成部１３６は、対象フレーム特徴量と対象重みとに基づいて、対象重みによって重み付けされた対象フレーム特徴量を、第１の追加学習済みまたは第２の追加学習済みのデコーダＤＥ１に入力して、対象動画の内容を説明する文章である対象動画説明文を生成してよい。 The sentence generation unit 136 may also generate a variational condition vector V62, which is a condition vector based on the target frame features weighted by the target weight. Next, the model generation unit 134 may input the latent vector z and the variational condition vector V62 as input information to the decoder DE1 to generate a target video description, which is a sentence that describes the contents of the target video. In this way, the sentence generation unit 136 may input the target frame features weighted by the target weight to the first additionally trained or second additionally trained decoder DE1 based on the target frame features and the target weight to generate a target video description, which is a sentence that describes the contents of the target video.

〔８－２．第２の変形例〕
上述した実施形態では、文章生成モデルが、生成器ネットワークおよび識別器ネットワークを含む条件付き敵対的生成ネットワークである場合について説明した。また、第１の変形例では、文章生成モデルが、エンコーダおよびデコーダを含む条件付き変分オートエンコーダである場合について説明した。第２の変形例では、文章生成モデルが、条件付き拡散モデルである場合について説明する。 8-2. Second Modification
In the above-described embodiment, a case has been described in which the sentence generation model is a conditional generative adversarial network including a generator network and a discriminator network. In the first modified example, a case has been described in which the sentence generation model is a conditional variational autoencoder including an encoder and a decoder. In the second modified example, a case in which the sentence generation model is a conditional diffusion model will be described.

図１０は、第２の変形例に係る文章生成モデルの一例である条件付き拡散モデルを示す図である。図１０に示すように、文章生成モデルは、条件付き拡散モデルであってよい。図１０では、図３で説明した第１の追加学習または図５で説明した第２の追加学習について説明する。 Figure 10 is a diagram showing a conditional diffusion model, which is an example of a sentence generation model related to the second modified example. As shown in Figure 10, the sentence generation model may be a conditional diffusion model. Figure 10 explains the first additional learning described in Figure 3 or the second additional learning described in Figure 5.

図１０では、条件付き拡散モデルの学習処理に用いるデータの一例として、初期の動画説明文ｘ_０に対してノイズが段階的に付与された複数のノイズ付き動画説明文を示す。モデル生成部１３４は、図１０に示す複数のノイズ付き動画説明文を含む学習用データを用いて条件付き拡散モデルを学習させる。図１０では、初期の動画説明文ｘ_０は、ノイズの付与に関する段階が段階＃０である。すなわち、ノイズが付加されていない動画説明文である。モデル生成部１３４は、初期の動画説明文ｘ_０に徐々にガウスノイズを足していき、最終的に純粋なガウスノイズｘ_Ｔを得る過程（拡散過程）において、初期の動画説明文ｘ_０に対して何度か微小なノイズが付加されたノイズ付き動画説明文ｘ_ｔ－１を生成する。ノイズ付き動画説明文ｘ_ｔ－１は、ノイズの付与に関する段階が段階＃ｔ－１である。すなわち、初期の動画説明文ｘ_０に対してノイズがｔ－１段階付与された動画説明文である。続いて、モデル生成部１３４は、ノイズ付き動画説明文ｘ_ｔ－１に微小なノイズが付加されたノイズ付き動画説明文ｘ_ｔを生成する。ノイズ付き動画説明文ｘ_ｔは、ノイズ付き動画説明文ｘ_ｔ－１に対してノイズがさらに１段階付加された動画説明文である。ノイズ付き動画説明文ｘ_ｔは、ノイズの付与に関する段階が段階＃ｔである。すなわち、初期の動画説明文ｘ_０に対してノイズがｔ段階付与された動画説明文である。例えば、ノイズ付き動画説明文ｘ_ｔは、ノイズ付き動画説明文ｘ_ｔ－１にノイズを付与するノイズ付与処理により生成される。図１０に示すｑ（ｘ_ｔ｜ｘ_ｔ－１）は、ノイズ付き動画説明文ｘ_ｔ－１からノイズ付き動画説明文ｘ_ｔに遷移する遷移確率を示す。 FIG. 10 shows a plurality of noise-added video descriptions in which noise is added stepwise to an initial video description x ₀ as an example of data used in the learning process of the conditional diffusion model. The model generation unit 134 uses learning data including the plurality of noise-added video descriptions shown in FIG. 10 to train the conditional diffusion model. In FIG. 10, the initial video description x ₀ is at stage #0 in terms of the addition of noise. That is, it is a video description to which no noise is added. The model generation unit 134 gradually adds Gaussian noise to the initial video description x ₀ , and in the process (diffusion process) of finally obtaining pure Gaussian noise x _T , generates a noise-added video description x _t-1 in which minute noise is added several times to the initial video description x _0. The noise-added video description x _t-1 is at stage #t-1 in terms of the addition of noise. That is, it is a video description to which noise is added t-1 stages to the initial video description x ₀ . Next, the model generation unit 134 generates a noise-added video description _xt in which minute noise is added to the noise-added video description _xt-1 . The noise-added video description _xt is a video description in which one more level of noise is added to the noise-added video description _xt-1 . The noise-added video description _xt is a stage #t in terms of noise addition. That is, it is a video description in which t levels of noise are added to the initial video description _x0 . For example, the noise-added video description _xt is generated by a noise addition process that adds noise to the noise-added video description _xt-1 . q( _xt | _xt-1 ) shown in FIG. 10 indicates the transition probability of transitioning from the noise-added video description _xt-1 to the noise-added video description _xt .

続いて、モデル生成部１３４は、純粋なガウスノイズｘ_Ｔから徐々にガウスノイズを除去していき、最終的にノイズが付加されていない動画説明文ｘ_０を得る過程（逆拡散過程）において、ノイズ付き動画説明文ｘ_ｔから微小なノイズを除去してノイズ付き動画説明文ｘ_ｔ－１を生成する条件付き拡散モデルを学習させる。例えば、モデル生成部１３４は、ノイズ付き動画説明文ｘ_ｔを入力とし、一つ手前の過程、すなわちノイズ付き動画説明文ｘ_ｔからノイズを１段階除去したノイズ付き動画説明文ｘ_ｔ－１を出力するように条件付き拡散モデルを学習させる。図１０に示すｐ_θ（ｘ_ｔ－１｜ｘ_ｔ、Ｖ７１）は、ノイズ付き動画説明文ｘ_ｔ－１からノイズ付き動画説明文ｘ_ｔに遷移する遷移確率を示す。また、ｐ_θ（ｘ_ｔ－１｜ｘ_ｔ、Ｖ７１）は、学習によって定まるパラメータθを持つニューラルネットワークの出力である。このように、モデル生成部１３４は、純粋なガウスノイズｘ_Ｔを入力とし、徐々にノイズを除去していくことで、最終的にノイズが付加されていない動画説明文ｘ_０を生成する機械学習モデルである条件付き拡散モデルを学習させる。 Next, the model generation unit 134 gradually removes Gaussian noise from the pure Gaussian noise x _T , and in the process (reverse diffusion process) of finally obtaining a video description x ₀ without added noise, the model generation unit 134 trains a conditional diffusion model that removes minute noise from the noisy video description x _t to generate a noisy video description x _t-1 . For example, the model generation unit 134 trains a conditional diffusion model so that the model generation unit 134 receives the noisy video description x t as input and outputs the noisy video description x _t-1 obtained by removing noise from the previous process, that is, the noisy video description x _t by one step. p _θ (x _t _-1 | x _t , V71) shown in FIG. 10 indicates the transition probability of transitioning from the noisy video description x _t-1 to the noisy video description x _t . Also, p _θ (x _t-1 | x _t , V71) is the output of a neural network having a parameter θ determined by learning. In this way, the model generation unit 134 trains a conditional diffusion model, which is a machine learning model that takes pure Gaussian noise _xT as input and gradually removes the noise, thereby ultimately generating a video description _x0 to which no noise has been added.

まず、モデル生成部１３４は、図２で説明した事前学習により、条件付き拡散モデルを学習させてよい。例えば、モデル生成部１３４は、事前学習用フレーム特徴量＃７０に基づく条件ベクトルである事前学習用拡散条件ベクトルＶ７０（以下、「拡散条件ベクトルＶ７０」と略記する場合がある）を生成してよい。続いて、モデル生成部１３４は、拡散条件ベクトルＶ７０を入力情報として条件付き拡散モデルに入力した場合に、ノイズベクトルが従う多変量正規分布における平均ベクトルおよび分散ベクトルを出力情報として出力するように条件付き拡散モデルを学習させてよい。例えば、モデル生成部１３４は、拡散条件ベクトルＶ７０および純粋なガウスノイズベクトルを入力情報として条件付き拡散モデルに入力した場合に、拡散条件ベクトルＶ７０と対応する特徴を有する動画説明文を生成するように条件付き拡散モデルを学習させてよい。例えば、モデル生成部１３４は、バックプロパゲーション（誤差逆伝播法）等を用いて、条件付き拡散モデルから出力された動画説明文と、動画文データセット＃７０に含まれる事前学習用動画説明文＃７０との誤差が小さくなるように条件付き拡散モデルを学習させてよい。このように、モデル生成部１３４は、事前学習用フレーム特徴量＃７０に基づいて、事前学習用フレーム特徴量＃７０と対応する特徴を有する動画説明文を生成するように事前に学習された事前学習済み条件付き拡散モデルを生成してよい。 First, the model generation unit 134 may train a conditional diffusion model by the pre-learning described in FIG. 2. For example, the model generation unit 134 may generate a pre-learning diffusion condition vector V70 (hereinafter, may be abbreviated as "diffusion condition vector V70"), which is a condition vector based on the pre-learning frame feature #70. Next, the model generation unit 134 may train the conditional diffusion model so that when the diffusion condition vector V70 is input as input information to the conditional diffusion model, the mean vector and variance vector in the multivariate normal distribution to which the noise vector follows are output as output information. For example, the model generation unit 134 may train the conditional diffusion model so that when the diffusion condition vector V70 and a pure Gaussian noise vector are input as input information to the conditional diffusion model, the model generation unit 134 may train the conditional diffusion model so as to generate a video description having characteristics corresponding to the diffusion condition vector V70. For example, the model generation unit 134 may use backpropagation or the like to train the conditional diffusion model so as to reduce the error between the video description output from the conditional diffusion model and the pre-training video description #70 included in the video description dataset #70. In this way, the model generation unit 134 may generate a pre-trained conditional diffusion model that has been trained in advance based on the pre-training frame feature #70 to generate a video description having features corresponding to the pre-training frame feature #70.

また、モデル生成部１３４は、図３で説明した第１の追加学習または図５で説明した第２の追加学習により、事前学習済み条件付き拡散モデル（以下、「条件付き拡散モデル」と略記する場合がある）を再学習させてよい。例えば、モデル生成部１３４は、学習用重みによって重み付けされた学習用フレーム特徴量に基づく条件ベクトルである学習用拡散条件ベクトルＶ７１（以下、「拡散条件ベクトルＶ７１」と略記する場合がある）を生成してよい。続いて、モデル生成部１３４は、学習用重みによって重み付けされた学習用フレーム特徴量に基づく条件ベクトルである拡散条件ベクトルＶ７１を入力情報として条件付き拡散モデルに入力した場合に、ノイズベクトルが従う多変量正規分布における平均ベクトルおよび分散ベクトルを出力情報として出力するように条件付き拡散モデルを学習させてよい。例えば、モデル生成部１３４は、拡散条件ベクトルＶ７１および純粋なガウスノイズを入力情報として条件付き拡散モデルに入力した場合に、拡散条件ベクトルＶ７１と対応する特徴を有する動画説明文を生成するように条件付き拡散モデルを再学習させてよい。例えば、モデル生成部１３４は、バックプロパゲーション（誤差逆伝播法）等を用いて、条件付き拡散モデルから出力された学習用動画説明文＃７１と、画像文データセット＃７１に含まれる画像説明文＃７１（オリジナルの画像＃７１に対応する画像説明文）との誤差が小さくなるように条件付き拡散モデルを再学習させてよい。このようにして、モデル生成部１３４は、重み付けされた学習用フレーム特徴量に基づいて、重み付けされた学習用フレーム特徴量と対応する特徴を有する動画説明文を生成するように条件付き拡散モデルを再学習させてよい。このようにして、モデル生成部１３４は、第１の追加学習済みまたは第２の追加学習済みの条件付き拡散モデルを生成してよい。 The model generation unit 134 may retrain the pre-trained conditional diffusion model (hereinafter, may be abbreviated as "conditional diffusion model") by the first additional learning described in FIG. 3 or the second additional learning described in FIG. 5. For example, the model generation unit 134 may generate a learning diffusion condition vector V71 (hereinafter, may be abbreviated as "diffusion condition vector V71") which is a condition vector based on the learning frame feature weighted by the learning weight. Next, the model generation unit 134 may train the conditional diffusion model so that when the diffusion condition vector V71 which is a condition vector based on the learning frame feature weighted by the learning weight is input as input information to the conditional diffusion model, the model generation unit 134 outputs the mean vector and the variance vector in the multivariate normal distribution to which the noise vector follows as output information. For example, the model generation unit 134 may retrain the conditional diffusion model so that when the diffusion condition vector V71 and pure Gaussian noise are input as input information to the conditional diffusion model, the model generation unit 134 generates a video description having characteristics corresponding to the diffusion condition vector V71. For example, the model generation unit 134 may use backpropagation or the like to retrain the conditional diffusion model so as to reduce an error between the training video description #71 output from the conditional diffusion model and the image description #71 (image description corresponding to the original image #71) included in the image description dataset #71. In this manner, the model generation unit 134 may retrain the conditional diffusion model so as to generate a video description having features corresponding to the weighted training frame features based on the weighted training frame features. In this manner, the model generation unit 134 may generate a first additionally trained or second additionally trained conditional diffusion model.

また、文章生成部１３６は、学習済みの条件付き拡散モデル（以下、「条件付き拡散モデル」と略記する場合がある）を用いてノイズベクトルを推定し、ノイズ付き動画説明文特徴量からノイズベクトルを取り除くことにより、動画説明文を生成する。例えば、文章生成部１３６は、対象重みによって重み付けされた対象フレーム特徴量に基づく条件ベクトルである拡散条件ベクトルＶ７２を生成してよい。続いて、モデル生成部１３４は、拡散条件ベクトルＶ７２および純粋なガウスノイズベクトルを入力情報として条件付き拡散モデルに入力して、対象動画の内容を説明する文章である対象動画説明文を生成してよい。このようにして、文章生成部１３６は、対象フレーム特徴量と対象重みとに基づいて、対象重みによって重み付けされた対象フレーム特徴量を、第１の追加学習済みまたは第２の追加学習済みの条件付き拡散モデルに入力して、対象動画の内容を説明する文章である対象動画説明文を生成してよい。 The sentence generation unit 136 also generates a video description by estimating a noise vector using a trained conditional diffusion model (hereinafter sometimes abbreviated as "conditional diffusion model") and removing the noise vector from the noised video description feature. For example, the sentence generation unit 136 may generate a diffusion condition vector V72, which is a condition vector based on the target frame feature weighted by the target weight. Next, the model generation unit 134 may input the diffusion condition vector V72 and a pure Gaussian noise vector as input information to the conditional diffusion model to generate a target video description, which is a sentence that describes the contents of the target video. In this way, the sentence generation unit 136 may input the target frame feature weighted by the target weight to the first additionally trained or second additionally trained conditional diffusion model based on the target frame feature and the target weight to generate a target video description, which is a sentence that describes the contents of the target video.

〔９．効果〕
上述したように、実施形態に係る情報処理装置１００は、動画生成部１３１と抽出部１３２と決定部１３３とモデル生成部１３４を備える。動画生成部１３１は、撮像画像と撮像画像の内容を説明する文章である画像説明文との組を含む画像文データセットに基づいて、学習用動画を生成する。抽出部１３２は、学習用動画を構成する複数のフレーム画像それぞれの特徴量である学習用フレーム特徴量を抽出する。決定部１３３は、学習用動画を構成する複数のフレーム画像それぞれに対応する重みである学習用重みを決定する。モデル生成部１３４は、学習用フレーム特徴量と学習用重みとに基づいて、学習用動画の内容を説明する文章である学習用動画説明文であって、学習用重みによって重み付けされた学習用フレーム特徴量と対応する特徴を有する学習用動画説明文を生成するように学習された機械学習モデルである文章生成モデルを生成する。 9. Effects
As described above, the information processing device 100 according to the embodiment includes a video generating unit 131, an extracting unit 132, a determining unit 133, and a model generating unit 134. The video generating unit 131 generates a learning video based on an image and sentence data set including a pair of a captured image and an image description that is a sentence that describes the content of the captured image. The extracting unit 132 extracts learning frame features that are features of each of a plurality of frame images that constitute the learning video. The determining unit 133 determines learning weights that are weights corresponding to each of a plurality of frame images that constitute the learning video. The model generating unit 134 generates a sentence generation model that is a machine learning model trained to generate a learning video description that is a sentence that describes the content of the learning video and has features corresponding to the learning frame features weighted by the learning weight, based on the learning frame features and the learning weights.

これにより、情報処理装置１００は、条件付き生成モデルに与える条件として、各フレーム画像に対応する重み付けされたフレーム特徴量を用いることにより、動画のどの部分（どのフレーム画像）を重視した動画説明文を生成するのかをコントロール可能とすることができる。また、情報処理装置１００は、注目するフレーム画像に応じた多様な動画説明文を生成可能とすることができる。また、情報処理装置１００は、注目するフレーム画像に応じた多様な動画説明文を生成可能とすることができるので、持続可能な開発目標（ＳＤＧｓ）の目標９「産業と技術革新の基盤をつくろう」の達成に貢献できる。また、情報処理装置１００は、条件付き生成モデルに与える条件として、各フレーム画像に対応する重み付けされたフレーム特徴量を用いることにより、動画の時系列情報を自然言語生成に反映することを可能とすることができる。 As a result, the information processing device 100 can control which part of the video (which frame image) to emphasize when generating a video description by using weighted frame features corresponding to each frame image as a condition to be given to the conditional generative model. The information processing device 100 can also generate a variety of video descriptions according to the frame image of interest. The information processing device 100 can also generate a variety of video descriptions according to the frame image of interest, which can contribute to the achievement of Goal 9 of the Sustainable Development Goals (SDGs), "Build resilience, innovate and innovate." The information processing device 100 can also reflect the time-series information of the video in natural language generation by using weighted frame features corresponding to each frame image as a condition to be given to the conditional generative model.

また、動画生成部１３１は、画像から動画を生成する機械学習モデルである第１の動画生成モデルを用いて、画像文データセットに含まれる撮像画像から、撮像画像をフレームに含む学習用動画を生成する。決定部１３３は、学習用動画を構成する複数のフレーム画像のうち、撮像画像に対応する重みを撮像画像以外の他のフレーム画像に対応する重みよりも大きくするように複数のフレーム画像それぞれに対応する学習用重みを決定する。 The video generation unit 131 also generates a learning video including captured images in frames from captured images included in the image-sentence dataset using a first video generation model, which is a machine learning model that generates videos from images. The determination unit 133 determines learning weights corresponding to each of the multiple frame images constituting the learning video such that the weight corresponding to the captured image is greater than the weights corresponding to the other frame images other than the captured image.

これにより、情報処理装置１００は、動画を構成する複数のフレーム画像のうち、動画を生成する元となった撮像画像を他のフレーム画像よりも重視した動画説明文を生成可能とすることができる。 This enables the information processing device 100 to generate a video description that places more importance on the captured image that was the source of generating the video than on the other frame images among the multiple frame images that make up the video.

また、動画生成部１３１は、文章から動画を生成する機械学習モデルである第２の動画生成モデルを用いて、画像文データセットに含まれる画像説明文から学習用動画を生成する。決定部１３３は、学習用動画を構成する複数のフレーム画像それぞれと撮像画像との類似度に関する情報を複数のフレーム画像それぞれに対応する学習用重みとする。 The video generation unit 131 also generates training videos from the image captions included in the image-sentence dataset using a second video generation model, which is a machine learning model that generates videos from text. The determination unit 133 sets information regarding the similarity between each of the multiple frame images constituting the training video and the captured image as training weights corresponding to each of the multiple frame images.

これにより、情報処理装置１００は、動画を生成する元となった画像説明文に対応する撮像画像との類似度が低いフレーム画像よりも、動画を生成する元となった画像説明文に対応する撮像画像との類似度が高いフレーム画像を重視した動画説明文を生成可能とすることができる。 This enables the information processing device 100 to generate a video description that places emphasis on frame images that have a high similarity to the captured image corresponding to the image description from which the video was generated, rather than frame images that have a low similarity to the captured image corresponding to the image description from which the video was generated.

また、抽出部１３２は、撮像動画と撮像動画の内容を説明する文章である動画説明文との組を含む動画文データセットに含まれる撮像動画を構成する複数のフレーム画像それぞれの特徴量である事前学習用フレーム特徴量を抽出する。モデル生成部１３４は、事前学習用フレーム特徴量に基づいて、事前学習用フレーム特徴量と対応する特徴を有する動画説明文を生成するように事前に学習された機械学習モデルである事前学習済み文章生成モデルを生成し、学習用フレーム特徴量と学習用重みとに基づいて、学習用動画の内容を説明する文章である学習用動画説明文であって、学習用重みによって重み付けされた学習用フレーム特徴量と対応する特徴を有する学習用動画説明文を生成するように事前学習済み文章生成モデルを再学習させることにより、文章生成モデルを生成する。 The extraction unit 132 also extracts pre-training frame features, which are features of each of a plurality of frame images constituting the captured video included in a video text dataset including a pair of a captured video and a video description, which is a text that describes the content of the captured video. The model generation unit 134 generates a pre-trained text generation model, which is a machine learning model that has been trained in advance to generate a video description having features corresponding to the pre-training frame features, based on the pre-training frame features, and generates a text generation model by re-training the pre-trained text generation model to generate a training video description, which is a text that describes the content of the training video, based on the training frame features and the training weights, and which has features corresponding to the training frame features weighted by the training weights.

これにより、情報処理装置１００は、フレーム特徴量と対応する特徴を有する動画説明文を生成可能とすることができる。 This enables the information processing device 100 to generate video description text that has characteristics corresponding to the frame features.

また、情報処理装置１００は、取得部１３５と文章生成部１３６をさらに備える。取得部１３５は、撮像画像と撮像画像の内容を説明する文章である画像説明文との組を含む画像文データセットに基づいて生成された学習用動画を構成する複数のフレーム画像それぞれの特徴量である学習用フレーム特徴量と、学習用動画を構成する複数のフレーム画像それぞれに対応する重みである学習用重みとに基づいて、学習用動画の内容を説明する文章である学習用動画説明文であって、学習用重みによって重み付けされた学習用フレーム特徴量と対応する特徴を有する学習用動画説明文を生成するように学習された機械学習モデルである文章生成モデルを取得する。抽出部１３２は、処理対象の動画である対象動画を構成する複数のフレーム画像それぞれの特徴量である対象フレーム特徴量を抽出する。決定部１３３は、対象動画を構成する複数のフレーム画像のうち、利用者によって指定された指定フレーム画像に対応する重みを指定フレーム画像以外の他のフレーム画像に対応する重みよりも大きくするように複数のフレーム画像それぞれに対応する重みである対象重みを決定する。文章生成部１３６は、対象フレーム特徴量と対象重みとに基づいて、対象重みによって重み付けされた対象フレーム特徴量を文章生成モデルに入力して、対象動画の内容を説明する文章である対象動画説明文を生成する。 The information processing device 100 further includes an acquisition unit 135 and a sentence generation unit 136. The acquisition unit 135 acquires a sentence generation model, which is a machine learning model trained to generate a learning video description, which is a sentence that describes the content of the learning video, having features corresponding to the learning frame features weighted by the learning weight, based on the learning frame features that are features of each of the multiple frame images constituting the learning video generated based on an image and sentence dataset including a pair of a captured image and an image description that is a sentence that describes the content of the captured image, and the learning weights that are weights corresponding to each of the multiple frame images constituting the learning video. The extraction unit 132 extracts target frame features that are features of each of the multiple frame images constituting the target video, which is a video to be processed. The determination unit 133 determines target weights that are weights corresponding to each of the multiple frame images such that the weight corresponding to the designated frame image designated by the user among the multiple frame images constituting the target video is larger than the weight corresponding to other frame images other than the designated frame image. The sentence generation unit 136 inputs the target frame features weighted by the target weights into a sentence generation model based on the target frame features and the target weights, and generates a target video description, which is a sentence that explains the contents of the target video.

これにより、情報処理装置１００は、条件付き生成モデルに与える条件として、各フレーム画像に対応する重み付けされたフレーム特徴量を用いることにより、動画のどの部分（どのフレーム画像）を重視した動画説明文を生成するのかをコントロール可能とすることができる。また、情報処理装置１００は、注目するフレーム画像に応じた多様な動画説明文を生成することができる。また、情報処理装置１００は、注目するフレーム画像に応じた多様な動画説明文を生成することができるので、持続可能な開発目標（ＳＤＧｓ）の目標９「産業と技術革新の基盤をつくろう」の達成に貢献できる。また、情報処理装置１００は、条件付き生成モデルに与える条件として、各フレーム画像に対応する重み付けされたフレーム特徴量を用いることにより、動画の時系列情報を自然言語生成に反映することを可能とすることができる。 As a result, the information processing device 100 can control which part of the video (which frame image) to emphasize when generating a video description by using weighted frame features corresponding to each frame image as a condition to be given to the conditional generative model. The information processing device 100 can also generate a variety of video descriptions according to the frame image of interest. The information processing device 100 can also generate a variety of video descriptions according to the frame image of interest, which can contribute to the achievement of Goal 9 of the Sustainable Development Goals (SDGs), "Build resilience, innovate and innovate." The information processing device 100 can also use weighted frame features corresponding to each frame image as a condition to be given to the conditional generative model, which can enable the time-series information of the video to be reflected in natural language generation.

また、文章生成モデルは、生成器ネットワークおよび識別器ネットワークを含む条件付き敵対的生成ネットワークであり、学習用重みによって重み付けされた学習用フレーム特徴量に基づく条件ベクトルである第１の敵対的条件ベクトルおよびノイズベクトルを入力情報として生成器ネットワークに入力した場合に、学習用動画説明文を出力情報として出力するように学習された生成器ネットワークと、撮像動画と撮像動画の内容を説明する文章である動画説明文との組を含む動画文データセットに含まれる撮像動画を構成する複数のフレーム画像それぞれの特徴量である事前学習用フレーム特徴量に基づく条件ベクトルである第２の敵対的条件ベクトルおよび動画説明文を入力情報として識別器ネットワークに入力した場合に、動画説明文が、真の動画説明文であって、かつ、第２の敵対的条件ベクトルと対応する動画説明文であることを示す情報を出力情報として出力するように学習された識別器ネットワークであって、第１の敵対的条件ベクトルおよび生成器ネットワークが生成した偽の動画説明文である学習用動画説明文を入力情報として識別器ネットワークに入力した場合に、学習用動画説明文が、真の動画説明文であって、かつ、第１の敵対的条件ベクトルと対応する動画説明文であることを示す情報以外の情報を出力情報として出力するように学習された識別器ネットワークと、を含む機械学習モデルである。 The sentence generation model is a conditional adversarial generative network including a generator network and a classifier network, and includes a generator network trained to output a learning video description as output information when a first adversarial condition vector, which is a condition vector based on training frame features weighted by training weights, and a noise vector are input as input information to the generator network, and a second adversarial condition vector, which is a condition vector based on pre-training frame features that are features of each of a plurality of frame images constituting a captured video included in a video text dataset that includes a pair of a captured video and a video description that is a text that explains the content of the captured video, and a noise vector, which is a condition vector based on pre-training frame features that are features of each of a plurality of frame images constituting a captured video included in a video text dataset that includes a pair of a captured video and a video description that is a text that explains the content of the captured video, A machine learning model including: a classifier network trained to output, as output information, information indicating that a video description is a true video description and that it corresponds to a second adversarial condition vector when an image description is input to the classifier network as input information; and a classifier network trained to output, as output information, information other than information indicating that a training video description is a true video description and that it corresponds to the first adversarial condition vector when a training video description, which is a false video description generated by a generator network, and a first adversarial condition vector and a training video description, are input to the classifier network as input information.

これにより、情報処理装置１００は、条件付き敵対的生成ネットワークを用いて、注目するフレーム画像に応じた多様な動画説明文を生成することができる。 This allows the information processing device 100 to use a conditional generative adversarial network to generate a variety of video descriptions that correspond to the frame image of interest.

また、文章生成モデルは、エンコーダおよびデコーダを含む条件付き変分オートエンコーダであり、撮像動画と撮像動画の内容を説明する文章である動画説明文との組を含む動画文データセットに含まれる撮像動画を構成する複数のフレーム画像それぞれの特徴量である事前学習用フレーム特徴量に基づく条件ベクトルである第１の変分条件ベクトルおよび動画説明文を入力情報としてエンコーダに入力した場合に、多変量正規分布における平均ベクトルおよび分散ベクトルを出力情報として出力するように学習されたエンコーダと、平均ベクトルおよび分散ベクトルに基づく多変量正規分布に従う標本である潜在ベクトル、および、学習用重みによって重み付けされた学習用フレーム特徴量に基づく条件ベクトルである第２の変分条件ベクトルを入力情報としてデコーダに入力した場合に、学習用動画説明文を出力情報として出力するように学習されたデコーダと、を含む機械学習モデルである。 The sentence generation model is a conditional variational autoencoder including an encoder and a decoder, and is a machine learning model including an encoder trained to output a mean vector and a variance vector in a multivariate normal distribution as output information when a first variational condition vector, which is a condition vector based on pre-training frame features that are features of each of a plurality of frame images constituting a captured video included in a video text dataset including a pair of a captured video and a video description that is a text that explains the content of the captured video, and the video description are input as input information to the encoder, and a decoder trained to output the training video description as output information when a latent vector, which is a sample that follows a multivariate normal distribution based on the mean vector and the variance vector, and a second variational condition vector, which is a condition vector based on the training frame features weighted by the training weight, are input as input information to the decoder.

これにより、情報処理装置１００は、条件付き変分オートエンコーダを用いて、注目するフレーム画像に応じた多様な動画説明文を生成することができる。 This allows the information processing device 100 to use a conditional variational autoencoder to generate a variety of video descriptions that correspond to the frame image of interest.

また、文章生成モデルは、条件付き拡散モデルであり、ノイズベクトルを含む学習用動画説明文であるノイズ付き動画説明文および学習用重みによって重み付けされた学習用フレーム特徴量に基づく条件ベクトルである拡散条件ベクトルを入力情報として条件付き拡散モデルに入力した場合に、ノイズベクトルが従う多変量正規分布における平均ベクトルおよび分散ベクトルを出力情報として出力するように学習された条件付き拡散モデルを用いてノイズベクトルを推定し、ノイズ付き動画説明文からノイズベクトルを取り除くことにより、学習用動画説明文を生成する機械学習モデルである。 The text generation model is a conditional diffusion model, and is a machine learning model that estimates the noise vector using the conditional diffusion model trained to output, as output information, the mean vector and variance vector in the multivariate normal distribution to which the noise vector follows when a noisy video description, which is a training video description including a noise vector, and a diffusion condition vector, which is a condition vector based on training frame features weighted by training weights, are input to the conditional diffusion model as input information, and generates a training video description by removing the noise vector from the noisy video description.

これにより、情報処理装置１００は、条件付き拡散モデルを用いて、注目するフレーム画像に応じた多様な動画説明文を生成することができる。 This allows the information processing device 100 to use the conditional diffusion model to generate a variety of video descriptions that correspond to the frame image of interest.

〔１０．ハードウェア構成〕
また、上述してきた実施形態に係る情報処理装置１００は、例えば図１１に示すような構成のコンピュータ１０００によって実現される。図１１は、情報処理装置１００の機能を実現するコンピュータの一例を示すハードウェア構成図である。コンピュータ１０００は、ＣＰＵ１１００、ＲＡＭ１２００、ＲＯＭ１３００、ＨＤＤ１４００、通信インターフェイス（Ｉ／Ｆ）１５００、入出力インターフェイス（Ｉ／Ｆ）１６００、及びメディアインターフェイス（Ｉ／Ｆ）１７００を備える。 10. Hardware Configuration
Moreover, the information processing device 100 according to the embodiment described above is realized by a computer 1000 having a configuration as shown in Fig. 11, for example. Fig. 11 is a hardware configuration diagram showing an example of a computer that realizes the functions of the information processing device 100. The computer 1000 includes a CPU 1100, a RAM 1200, a ROM 1300, a HDD 1400, a communication interface (I/F) 1500, an input/output interface (I/F) 1600, and a media interface (I/F) 1700.

ＣＰＵ１１００は、ＲＯＭ１３００またはＨＤＤ１４００に格納されたプログラムに基づいて動作し、各部の制御を行う。ＲＯＭ１３００は、コンピュータ１０００の起動時にＣＰＵ１１００によって実行されるブートプログラムや、コンピュータ１０００のハードウェアに依存するプログラム等を格納する。 The CPU 1100 operates based on a program stored in the ROM 1300 or the HDD 1400, and controls each component. The ROM 1300 stores a boot program executed by the CPU 1100 when the computer 1000 is started, and programs that depend on the hardware of the computer 1000, etc.

ＨＤＤ１４００は、ＣＰＵ１１００によって実行されるプログラム、及び、かかるプログラムによって使用されるデータ等を格納する。通信インターフェイス１５００は、所定の通信網を介して他の機器からデータを受信してＣＰＵ１１００へ送り、ＣＰＵ１１００が生成したデータを所定の通信網を介して他の機器へ送信する。 HDD 1400 stores programs executed by CPU 1100 and data used by such programs. Communication interface 1500 receives data from other devices via a specified communication network and sends it to CPU 1100, and transmits data generated by CPU 1100 to other devices via the specified communication network.

ＣＰＵ１１００は、入出力インターフェイス１６００を介して、ディスプレイやプリンタ等の出力装置、及び、キーボードやマウス等の入力装置を制御する。ＣＰＵ１１００は、入出力インターフェイス１６００を介して、入力装置からデータを取得する。また、ＣＰＵ１１００は、生成したデータを入出力インターフェイス１６００を介して出力装置へ出力する。 The CPU 1100 controls output devices such as a display and a printer, and input devices such as a keyboard and a mouse, via the input/output interface 1600. The CPU 1100 acquires data from the input devices via the input/output interface 1600. The CPU 1100 also outputs generated data to the output devices via the input/output interface 1600.

メディアインターフェイス１７００は、記録媒体１８００に格納されたプログラムまたはデータを読み取り、ＲＡＭ１２００を介してＣＰＵ１１００に提供する。ＣＰＵ１１００は、かかるプログラムを、メディアインターフェイス１７００を介して記録媒体１８００からＲＡＭ１２００上にロードし、ロードしたプログラムを実行する。記録媒体１８００は、例えばＤＶＤ（Digital Versatile Disc）、ＰＤ（Phase change rewritable Disk）等の光学記録媒体、ＭＯ（Magneto-Optical disk）等の光磁気記録媒体、テープ媒体、磁気記録媒体、または半導体メモリ等である。 The media interface 1700 reads a program or data stored in the recording medium 1800 and provides it to the CPU 1100 via the RAM 1200. The CPU 1100 loads the program from the recording medium 1800 onto the RAM 1200 via the media interface 1700 and executes the loaded program. The recording medium 1800 is, for example, an optical recording medium such as a DVD (Digital Versatile Disc) or a PD (Phase change rewritable Disc), a magneto-optical recording medium such as an MO (Magneto-Optical disk), a tape medium, a magnetic recording medium, or a semiconductor memory.

例えば、コンピュータ１０００が実施形態に係る情報処理装置１００として機能する場合、コンピュータ１０００のＣＰＵ１１００は、ＲＡＭ１２００上にロードされたプログラムを実行することにより、制御部１３０の機能を実現する。コンピュータ１０００のＣＰＵ１１００は、これらのプログラムを記録媒体１８００から読み取って実行するが、他の例として、他の装置から所定の通信網を介してこれらのプログラムを取得してもよい。 For example, when the computer 1000 functions as the information processing device 100 according to the embodiment, the CPU 1100 of the computer 1000 executes programs loaded onto the RAM 1200 to realize the functions of the control unit 130. The CPU 1100 of the computer 1000 reads and executes these programs from the recording medium 1800, but as another example, the CPU 1100 may obtain these programs from another device via a specified communication network.

以上、本願の実施形態のいくつかを図面に基づいて詳細に説明したが、これらは例示であり、発明の開示の欄に記載の態様を始めとして、当業者の知識に基づいて種々の変形、改良を施した他の形態で本発明を実施することが可能である。 Although several embodiments of the present application have been described in detail above with reference to the drawings, these are merely examples, and the present invention can be embodied in other forms that incorporate various modifications and improvements based on the knowledge of those skilled in the art, including the forms described in the disclosure section of the invention.

〔１１．その他〕
また、上記実施形態及び変形例において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。例えば、各図に示した各種情報は、図示した情報に限られない。 11. Other
Furthermore, among the processes described in the above embodiments and modifications, all or part of the processes described as being performed automatically can be performed manually, or all or part of the processes described as being performed manually can be performed automatically by a known method. In addition, the information including the processing procedures, specific names, various data and parameters shown in the above documents and drawings can be changed arbitrarily unless otherwise specified. For example, the various information shown in each drawing is not limited to the illustrated information.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。 In addition, each component of each device shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. In other words, the specific form of distribution and integration of each device is not limited to that shown in the figure, and all or part of them can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc.

例えば、上述した実施形態では、情報処理装置１００が、動画生成部１３１と、抽出部１３２と、決定部１３３と、モデル生成部１３４と、取得部１３５と、文章生成部１３６を機能部として有する場合について説明したが、各部はそれぞれ別々の装置に分散して構成することができる。例えば、情報処理装置１００は、文章を生成する情報処理装置として、取得部１３５と、文章生成部１３６を機能部として有することができる。また、情報処理装置１００以外の情報処理装置（以下、「生成装置」と記載する）は、文章生成モデルを生成する情報処理装置として、動画生成部１３１と、抽出部１３２と、決定部１３３と、モデル生成部１３４を機能部として有することができる。このとき、情報処理装置１００と生成装置とは、各種ネットワークと有線または無線で接続され、相互に情報の送受信を行ってよい。例えば、情報処理装置１００は、生成装置によって生成された文章生成モデルに関する情報を生成装置から受信してよい。 For example, in the above embodiment, the information processing device 100 has the video generation unit 131, the extraction unit 132, the determination unit 133, the model generation unit 134, the acquisition unit 135, and the sentence generation unit 136 as functional units. However, each unit can be distributed and configured in a separate device. For example, the information processing device 100 can have the acquisition unit 135 and the sentence generation unit 136 as functional units as an information processing device that generates sentences. In addition, an information processing device other than the information processing device 100 (hereinafter, referred to as a "generation device") can have the video generation unit 131, the extraction unit 132, the determination unit 133, and the model generation unit 134 as functional units as an information processing device that generates a sentence generation model. At this time, the information processing device 100 and the generation device may be connected to various networks by wire or wirelessly and may transmit and receive information to each other. For example, the information processing device 100 may receive information about the sentence generation model generated by the generation device from the generation device.

また、上述してきた実施形態及び変形例は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 The above-described embodiments and variations can be combined as appropriate to the extent that they do not cause inconsistencies in the processing content.

１００情報処理装置
１１０通信部
１２０記憶部
１３０制御部
１３１動画生成部
１３２抽出部
１３３決定部
１３４モデル生成部
１３５取得部
１３６文章生成部 REFERENCE SIGNS LIST 100 Information processing device 110 Communication unit 120 Storage unit 130 Control unit 131 Video generation unit 132 Extraction unit 133 Determination unit 134 Model generation unit 135 Acquisition unit 136 Text generation unit

Claims

A video generator that generates learning videos based on an image and text data set including a pair of a captured image and an image description text that explains the content of the captured image;
An extraction unit that extracts learning frame features that are features of each of a plurality of frame images that constitute the learning video;
A determination unit that determines learning weights corresponding to each of the plurality of frame images constituting the learning video such that a weight corresponding to the captured image is greater than weights corresponding to other frame images other than the captured image ; and
a model generation unit that generates a sentence generation model, which is a machine learning model trained to generate a training video description, which is a sentence explaining the content of the training video, based on the training frame features and the training weights, the training video description having features corresponding to the training frame features weighted by the training weights;
An information processing device comprising:

The video generation unit is
generating the learning video including the captured image in a frame from the captured image included in the image-sentence dataset using a first video generation model that is a machine learning model that generates a video from an image;
The information processing device according to claim 1 .

The video generation unit is
Using a second video generation model that is a machine learning model that generates videos from text, the training video is generated from the image caption included in the image and text dataset;
The determination unit is
information regarding a similarity between each of a plurality of frame images constituting the learning video and the captured image is set as the learning weight corresponding to each of the plurality of frame images;
The information processing device according to claim 1 .

The extraction unit is
Extracting pre-learning frame features which are features of each of a plurality of frame images constituting the captured video included in a video text dataset including a pair of a captured video and a video description text which is a text explaining the content of the captured video;
The model generation unit
Based on the pre-training frame features, a pre-trained sentence generation model is generated, which is a machine learning model that has been pre-trained to generate the video description having features corresponding to the pre-training frame features, and based on the training frame features and the training weights, a training video description is a sentence that explains the content of the training video, and the training video description has features corresponding to the training frame features weighted by the training weights, thereby generating the sentence generation model.
The information processing device according to claim 1 .

an acquisition unit that acquires a sentence generation model, which is a machine learning model trained to generate training video descriptions, which are sentences explaining the content of a training video, based on training frame features, which are features of each of a plurality of frame images constituting a training video generated based on an image and text dataset including a pair of a captured image and an image description, which is a sentence explaining the content of the captured image, and training weights, which are weights corresponding to each of a plurality of frame images constituting the training video, and which has features corresponding to the training frame features weighted by the training weights;
an extraction unit that extracts target frame features that are features of each of a plurality of frame images that constitute a target moving image that is a moving image to be processed;
a determination unit that determines target weights, which are weights corresponding to each of a plurality of frame images constituting the target moving image, such that a weight corresponding to a designated frame image designated by a user is made larger than weights corresponding to other frame images other than the designated frame image;
A sentence generation unit that inputs the target frame feature weighted by the target weight to the sentence generation model based on the target frame feature and the target weight, and generates a target video description sentence that is a sentence that explains the content of the target video;
An information processing device comprising:

The sentence generation model is a conditional generative adversarial network including a generator network and a classifier network;
The generator network is trained to output the training video description as output information when a first adversarial condition vector, which is a condition vector based on the training frame features weighted by the training weights, and a noise vector are input to the generator network as input information;
The classifier network is trained to output, as output information, information indicating that the video description is a true video description and corresponds to the second adversarial condition vector, when a second adversarial condition vector, which is a condition vector based on pre-learning frame features that are features of each of a plurality of frame images constituting the captured video included in a video text dataset including a pair of a captured video and a video description that is a text that explains the content of the captured video, and the video description are input as input information to the classifier network, and the classifier network is trained to output, as output information, information indicating that the video description is a true video description and corresponds to the second adversarial condition vector, when the first adversarial condition vector and the training video description, which is a false video description generated by the generator network, are input as input information to the classifier network,
6. The information processing device according to claim 1 or 5.

the sentence generation model is a conditional variational autoencoder including an encoder and a decoder;
The encoder is trained to output a mean vector and a variance vector in a multivariate normal distribution as output information when a first variational condition vector, which is a condition vector based on pre-learning frame features that are features of each of a plurality of frame images constituting the captured video included in a video text data set including a pair of a captured video and a video description that is a text that explains the content of the captured video, and the video description are input as input information to the encoder;
a decoder that is trained to output the training video description as output information when a latent vector that is a sample following the multivariate normal distribution based on the mean vector and the variance vector, and a second variational condition vector that is a condition vector based on the training frame features weighted by the training weights are input to the decoder as input information.
6. The information processing device according to claim 1 or 5.

The sentence generation model is a conditional diffusion model,
a machine learning model that generates the training video description by estimating the noise vector using the conditional diffusion model that has been trained to output, as output information, a mean vector and a variance vector in a multivariate normal distribution that the noise vector follows when a noisy video description that is the training video description including a noise vector and a diffusion condition vector that is a condition vector based on the training frame features weighted by the training weight are input as input information to the conditional diffusion model, and removing the noise vector from the noisy video description;
6. The information processing device according to claim 1 or 5.

An information processing method implemented by a program executed by an information processing device, comprising:
A video generation process for generating learning videos based on an image and text data set including a pair of a captured image and an image description text that explains the content of the captured image;
An extraction step of extracting learning frame features which are features of each of a plurality of frame images constituting the learning video;
A determination step of determining learning weights corresponding to each of the plurality of frame images constituting the learning video such that the weight corresponding to the captured image is greater than the weights corresponding to other frame images other than the captured image ;
a model generation process for generating a sentence generation model, which is a machine learning model trained to generate a training video description, which is a sentence explaining the content of the training video, based on the training frame features and the training weights, the training video description having features corresponding to the training frame features weighted by the training weights;
An information processing method comprising:

An information processing method implemented by a program executed by an information processing device, comprising:
an acquisition process for acquiring a sentence generation model, which is a machine learning model trained to generate training video description, which is a sentence explaining the content of the training video, based on training frame features, which are features of each of a plurality of frame images constituting the training video, generated based on an image and text dataset including a pair of a captured image and an image description, which is a sentence explaining the content of the captured image, and training weights, which are weights corresponding to each of a plurality of frame images constituting the training video, and which has features corresponding to the training frame features weighted by the training weights;
an extraction step of extracting target frame features which are features of each of a plurality of frame images constituting a target moving image which is a moving image to be processed;
a determining step of determining target weights, which are weights corresponding to each of a plurality of frame images constituting the target moving image, such that a weight corresponding to a designated frame image designated by a user is made larger than weights corresponding to other frame images other than the designated frame image;
A sentence generation process of inputting the target frame feature weighted by the target weight to the sentence generation model based on the target frame feature and the target weight to generate a target video description sentence which is a sentence explaining the content of the target video;
An information processing method comprising: