JP7623619B2

JP7623619B2 - Learning device, estimation device, learning method, estimation method, and program

Info

Publication number: JP7623619B2
Application number: JP2023565771A
Authority: JP
Inventors: 崇之梅田; 暁経三反崎; 正樹北原
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2025-01-29
Anticipated expiration: 2041-12-08
Also published as: JPWO2023105673A1; WO2023105673A1

Description

本発明は、学習装置、推定装置、学習方法、推定方法及びプログラムに関する。 The present invention relates to a learning device, an estimation device, a learning method, an estimation method and a program.

近年のセンシング技術と機械学習技術の発展に伴い、多種多様な対象を機械によって実世界を認識する取り組みがなされている。例えば一般的なＲＧＢカメラによる映像情報からは人の行動や周辺環境又は物体の認識が可能である。マルチスペクトルカメラを用いることで物体の素材解析などより詳細な情報の認識が可能になりつつある。さらに映像のみならず、音声情報から人の会話内容や感情の認識などが可能である。 With the recent development of sensing technology and machine learning technology, efforts are being made to use machines to recognize a wide variety of objects in the real world. For example, it is possible to recognize human behavior, the surrounding environment, and objects from video information obtained with a general RGB camera. By using a multispectral camera, it is becoming possible to recognize more detailed information, such as analyzing the material of an object. Furthermore, in addition to video, it is possible to recognize the content of people's conversations and emotions from audio information.

このように種々のセンサ情報、モーダル情報を活用するための機械学習技術の開発が盛んに取り組まれている。例えば非特許文献１では、画像とそれを説明するテキストが与えられ、画像特徴とテキスト特徴の関連性を直接学習することで、任意の画像に対して説明文を付与することを可能にしている。一方非特許文献２では、オートエンコーダ用いて複数のモーダルに共通する新たな特徴表現を学習することで、単一の特徴表現で複数のモーダルにおけるタスクを解くことを可能にしている。 In this way, active efforts are being made to develop machine learning technologies that utilize various sensor information and modal information. For example, in Non-Patent Document 1, an image and a text describing it are given, and by directly learning the relationship between image features and text features, it becomes possible to assign a description to any image. Meanwhile, in Non-Patent Document 2, an autoencoder is used to learn a new feature representation common to multiple modalities, making it possible to solve tasks in multiple modalities with a single feature representation.

Vicente Ordonez, Girish Kulkarni, Tamara L Berg,” Im2Text: Describing Images Using 1 Million Captioned Photographs” ［online］, ［２０２１年１１月２６日検索］、インターネット〈URL：https://papers.nips.cc/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf>Vicente Ordonez, Girish Kulkarni, Tamara L Berg, "Im2Text: Describing Images Using 1 Million Captioned Photographs" [online], [Retrieved November 26, 2021], Internet <URL: https://papers.nips.cc/paper/2011/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf> Jiquan Ngiam et.al.,” Multimodal Deep Learning” ［online］, ［２０２１年１１月２６日検索］、インターネット〈URL：https://icml.cc/2011/papers/399_icmlpaper.pdf>Jiquan Ngiam et.al.," Multimodal Deep Learning" [online], [Retrieved November 26, 2021], Internet <URL: https://icml.cc/2011/papers/399_icmlpaper.pdf>

しかしながら、これら先行技術は、学習データ一つ一つにすべてのモーダル情報が付与されていることが前提となる。例えばモーダルが画像と言語であった場合、ある画像とその説明文は同一のインスタンスを表現してなければならず、画像のみ、又は、テキストのみのデータは学習データになりえない。すなわち互いに紐づいた画像のデータと言語のデータとが存在しなければ学習の実行ができない。However, these prior art techniques are based on the premise that all modal information is attached to each piece of training data. For example, if the modals are images and language, an image and its description must represent the same instance, and data consisting of only images or only text cannot become training data. In other words, learning cannot be carried out unless image data and language data that are linked to each other exist.

ところで、解析の精度は、紐づいたデータに共通する情報を取得する精度が高いほど高い。そのため、高精度な認識を実現するには複数のモーダル情報が紐づいたデータを大量に用いて、紐づいたデータに共通する情報を取得する精度を高める必要がある。しかしながら、複数のモーダル情報が紐づいたデータを大量に用意することは非常に困難である。そのため、上記先行技術文献等の従来の技術では、複数種類のデータに共通する情報の推定に要する労力が大きい場合があった。 Meanwhile, the accuracy of analysis is higher the more accurately the information common to the linked data is acquired. Therefore, to achieve highly accurate recognition, it is necessary to use a large amount of data linked with multiple modal information to increase the accuracy of acquiring the information common to the linked data. However, it is very difficult to prepare a large amount of data linked with multiple modal information. Therefore, with conventional techniques such as those described in the above prior art documents, it was sometimes necessary to make a large effort to estimate information common to multiple types of data.

上記事情に鑑み、本発明は、複数種類のデータに共通する情報の推定に要する労力を軽減する技術を提供することを目的としている。In view of the above circumstances, the present invention aims to provide a technology that reduces the effort required to estimate information common to multiple types of data.

本発明の一態様は、入力されたデータの特徴量を取得する第１エンコーダと、入力されたデータの特徴量を取得する第２エンコーダと、入力されたデータをデコードする第１デコーダと、入力されたデータをデコードする第２デコーダと、を備えるニューラルネットワークであるマルチモーダルネットワークと、入力されたデータが前記第１エンコーダ、前記第２エンコーダ、前記第１デコーダ及び前記第２デコーダによって変換された結果である自己無撞着結果と前記データとの違いを示す交差自己無撞着損失と、複数の属性を有する人、生物、もの、無形物又は事象である主対象の第１の属性を示すデータを前記第１エンコーダがエンコードした結果と前記主対象の有する属性であって前記第１の属性とは異なる第２の属性を示すデータを前記第２エンコーダがエンコードした結果との違いを示す共通損失と、を用いて前記マルチモーダルネットワークの更新を行うネットワーク制御部と、を備える学習装置である。One aspect of the present invention is a learning device that includes a multimodal network, which is a neural network including a first encoder that acquires features of input data, a second encoder that acquires features of the input data, a first decoder that decodes the input data, and a second decoder that decodes the input data; and a network control unit that updates the multimodal network using a cross self-consistency loss that indicates the difference between a self-consistent result, which is a result of input data being converted by the first encoder, the second encoder, the first decoder, and the second decoder, and the data, and a common loss that indicates the difference between a result of the first encoder encoding data indicating a first attribute of a main subject, which is a person, living thing, thing, intangible object, or event having multiple attributes, and a result of the second encoder encoding data indicating a second attribute of the main subject that is different from the first attribute.

本発明の一態様は、推定対象の第１の属性を示す第１種対象データと、前記推定対象の有する属性のうちの前記第１の属性とは異なる第２の属性を示す第２種対象データとを取得するデータ取得部と、入力されたデータの特徴量を取得する第１エンコーダと、入力されたデータの特徴量を取得する第２エンコーダと、入力されたデータをデコードする第１デコーダと、入力されたデータをデコードする第２デコーダと、を備えるニューラルネットワークであるマルチモーダルネットワークと、入力されたデータが前記第１エンコーダ、前記第２エンコーダ、前記第１デコーダ及び前記第２デコーダによって変換された結果である自己無撞着結果と前記データとの違いを示す交差自己無撞着損失と、複数の属性を有する人、生物、もの、無形物又は事象である主対象の第１の属性を示すデータを前記第１エンコーダがエンコードした結果と前記主対象の有する属性であって前記第１の属性とは異なる第２の属性を示すデータを前記第２エンコーダがエンコードした結果との違いを示す共通損失と、を用いて前記マルチモーダルネットワークの更新を行うネットワーク制御部と、を備える学習装置が得た学習済みの前記マルチモーダルネットワークの実行する処理を、前記第１種対象データ及び前記第２種対象データに対して実行する推定部と、を備える推定装置である。One aspect of the present invention is a multimodal network that is a neural network including a data acquisition unit that acquires first type target data indicating a first attribute of an estimation target and second type target data indicating a second attribute different from the first attribute among the attributes of the estimation target, a first encoder that acquires features of input data, a second encoder that acquires features of the input data, a first decoder that decodes the input data, and a second decoder that decodes the input data; and a self-agnosing function that is a result of input data being converted by the first encoder, the second encoder, the first decoder, and the second decoder. and an estimation unit that performs processing performed by the trained multimodal network obtained by a learning device comprising: a network control unit that updates the multimodal network using a cross self-consistency loss that indicates the difference between a contradiction result and the data, and a common loss that indicates the difference between a result of the first encoder encoding data indicating a first attribute of a main object which is a person, a living thing, an object, an intangible object, or an event having multiple attributes, and a result of the second encoder encoding data indicating a second attribute which is an attribute of the main object and different from the first attribute, on the first type of target data and the second type of target data.

本発明の一態様は、入力されたデータの特徴量を取得する第１エンコーダと、入力されたデータの特徴量を取得する第２エンコーダと、入力されたデータをデコードする第１デコーダと、入力されたデータをデコードする第２デコーダと、を備えるニューラルネットワークであるマルチモーダルネットワークを実行するネットワーク実行ステップと、入力されたデータが前記第１エンコーダ、前記第２エンコーダ、前記第１デコーダ及び前記第２デコーダによって変換された結果である自己無撞着結果と前記データとの違いを示す交差自己無撞着損失と、複数の属性を有する人、生物、もの、無形物又は事象である主対象の第１の属性を示すデータを前記第１エンコーダがエンコードした結果と前記主対象の有する属性であって前記第１の属性とは異なる第２の属性を示すデータを前記第２エンコーダがエンコードした結果との違いを示す共通損失と、を用いて前記マルチモーダルネットワークの更新を行うネットワーク制御部ステップと、を有する学習方法である。One aspect of the present invention is a learning method comprising: a network execution step of executing a multimodal network, which is a neural network including a first encoder that acquires features of input data, a second encoder that acquires features of the input data, a first decoder that decodes the input data, and a second decoder that decodes the input data; and a network control step of updating the multimodal network using a cross self-consistency loss indicating the difference between a self-consistent result, which is a result of input data being converted by the first encoder, the second encoder, the first decoder, and the second decoder, and the data; and a common loss indicating the difference between a result of the first encoder encoding data indicating a first attribute of a main subject, which is a person, living thing, thing, intangible object, or event having multiple attributes, and a result of the second encoder encoding data indicating a second attribute of the main subject that is different from the first attribute.

本発明の一態様は、推定対象の第１の属性を示す第１種対象データと、前記推定対象の有する属性のうちの前記第１の属性とは異なる第２の属性を示す第２種対象データとを取得するデータ取得ステップと、入力されたデータの特徴量を取得する第１エンコーダと、入力されたデータの特徴量を取得する第２エンコーダと、入力されたデータをデコードする第１デコーダと、入力されたデータをデコードする第２デコーダと、を備えるニューラルネットワークであるマルチモーダルネットワークと、入力されたデータが前記第１エンコーダ、前記第２エンコーダ、前記第１デコーダ及び前記第２デコーダによって変換された結果である自己無撞着結果と前記データとの違いを示す交差自己無撞着損失と、複数の属性を有する人、生物、もの、無形物又は事象である主対象の第１の属性を示すデータを前記第１エンコーダがエンコードした結果と前記主対象の有する属性であって前記第１の属性とは異なる第２の属性を示すデータを前記第２エンコーダがエンコードした結果との違いを示す共通損失と、を用いて前記マルチモーダルネットワークの更新を行うネットワーク制御部と、を備える学習装置が得た学習済みの前記マルチモーダルネットワークの実行する処理を、前記第１種対象データ及び前記第２種対象データに対して実行する推定ステップと、を有する推定方法である。One aspect of the present invention is a multimodal network that is a neural network including a data acquisition step of acquiring first type target data indicating a first attribute of an estimation target and second type target data indicating a second attribute different from the first attribute among the attributes of the estimation target, a first encoder that acquires features of input data, a second encoder that acquires features of the input data, a first decoder that decodes the input data, and a second decoder that decodes the input data, and a self-agnosing function that is a result of the input data being converted by the first encoder, the second encoder, the first decoder, and the second decoder. and an estimation step of executing, on the first type target data and the second type target data, a process executed by the trained multimodal network obtained by a learning device including: a network control unit that updates the multimodal network using a cross self-consistency loss that indicates the difference between a contradiction result and the data, and a common loss that indicates the difference between a result of the first encoder encoding data indicating a first attribute of a main target which is a person, a living thing, an object, an intangible object, or an event having multiple attributes, and a result of the second encoder encoding data indicating a second attribute which is an attribute of the main target and different from the first attribute.

本発明の一態様は、上記の学習装置としてコンピュータを機能させるためのプログラムである。 One aspect of the present invention is a program for causing a computer to function as the above-mentioned learning device.

本発明の一態様は、上記の推定装置としてコンピュータを機能させるためのプログラムである。 One aspect of the present invention is a program for causing a computer to function as the above-mentioned estimation device.

本発明により、複数種類のデータに共通する情報の推定に要する労力を軽減することが可能となる。 The present invention makes it possible to reduce the effort required to estimate information common to multiple types of data.

実施形態の学習装置の概要を説明する第１の説明図。FIG. 1 is a first explanatory diagram illustrating an overview of a learning device according to an embodiment. 実施形態の学習装置の概要を説明する第２の説明図。FIG. 2 is a second explanatory diagram illustrating an overview of the learning device according to the embodiment. 実施形態の学習装置の概要を説明する第３の説明図。FIG. 3 is a third explanatory diagram illustrating an overview of the learning device according to the embodiment. 実施形態の学習装置の概要を説明する第４の説明図。FIG. 4 is a fourth explanatory diagram illustrating an overview of the learning device according to the embodiment. 実施形態の学習装置のハードウェア構成の一例を説明する図。FIG. 2 is a diagram illustrating an example of a hardware configuration of a learning apparatus according to an embodiment. 実施形態の学習装置が備える制御部の一例を説明する図。FIG. 2 is a diagram illustrating an example of a control unit included in the learning device of the embodiment. 実施形態の学習装置が実行する処理の流れの一例を示すフローチャート。1 is a flowchart illustrating an example of a flow of a process executed by the learning device of the embodiment. 実施形態の推定装置のハードウェア構成の一例を示す図。FIG. 2 is a diagram illustrating an example of a hardware configuration of an estimation apparatus according to an embodiment. 実施形態の推定装置が備える制御部の機能構成の一例を示す図。FIG. 2 is a diagram illustrating an example of a functional configuration of a control unit included in the estimation device of the embodiment. 実施形態の推定装置が実行する処理の流れの一例を示すフローチャート。4 is a flowchart showing an example of a flow of a process executed by the estimation device of the embodiment.

（実施形態）
図１から図４を用いて、実施形態の学習装置１の概要を説明する。学習装置１は、２種類のデータに共通する情報を推定する数理モデル（以下「マルチモーダル推定モデル」という。）を機械学習による学習により更新する。 (Embodiment)
An overview of a learning device 1 according to an embodiment will be described with reference to Figures 1 to 4. The learning device 1 updates a mathematical model (hereinafter referred to as a "multimodal estimation model") that estimates information common to two types of data through machine learning learning.

マルチモーダル推定モデルは２種類のデータに共通する情報を推定するため、２種類のデータが同一の対象の異なる属性を示す場合には、推定の対象は２種類のデータが属性を示す対象である。同一の対象の属性を示す２種類のデータのうちの一方を第１種データといい、２種類のデータのうちの他方を第２種データという。すなわち、第１種データと第２種データとは、同一の対象に関する互いに異なる属性の情報である。 Since the multimodal estimation model estimates information common to two types of data, when the two types of data indicate different attributes of the same object, the object to be estimated is the object whose attributes the two types of data indicate. One of the two types of data indicating the attributes of the same object is called the first type of data, and the other of the two types of data is called the second type of data. In other words, the first type of data and the second type of data are information on different attributes related to the same object.

以下、第１種データ及び第２種データが属性を示す対象を、主対象という。主対象は、複数の属性を有せばどのような、人、生物、もの、無形物又は事象であってもよい。すなわち、主対象は、複数の属性を有する人であってもよいし、複数の属性を有する生物であってもよいし、複数の属性を有するものであってもよいし、複数の属性を有する無形物であってもよいし、複数の属性を有する事象であってもよい。 Hereinafter, the object whose attributes the first type data and the second type data indicate will be referred to as the main object. The main object may be any person, living thing, thing, intangible object, or event, so long as it has multiple attributes. In other words, the main object may be a person with multiple attributes, a living thing with multiple attributes, an object with multiple attributes, an intangible object with multiple attributes, or an event with multiple attributes.

上述したようにマルチモーダル推定モデルは２種類のデータが同一の対象の異なる属性を示す場合にその対象を推定するため、主対象はマルチモーダル推定モデルの推定の対象の一例である。また、第１種データは主対象の第１の属性を示し、第２種データは主対象の第２の属性を示す。第２の属性（以下「第２属性」という。）は、主対象が有する属性のうち第１の属性（以下「第１属性」という。）と異なる属性である。As described above, the multimodal estimation model estimates an object when two types of data indicate different attributes of the same object, so the main object is an example of an object to be estimated by the multimodal estimation model. Furthermore, the first type of data indicates a first attribute of the main object, and the second type of data indicates a second attribute of the main object. The second attribute (hereinafter referred to as the "second attribute") is an attribute that is different from the first attribute (hereinafter referred to as the "first attribute") among the attributes possessed by the main object.

例えば第１種データが対象の形状を示す場合に、第２種データはその対象の名称を示す。すなわち第１属性が形状であるに、第２属性は例えば名称である。このように、第１種データと第２種データとの種類の違いは具体的には、示す属性の違いである。For example, if the first type of data indicates the shape of an object, the second type of data indicates the name of that object. In other words, while the first attribute is the shape, the second attribute is, for example, the name. In this way, the difference in type between the first type of data and the second type of data is specifically the difference in the attributes they indicate.

なお、数理モデルは、実行される条件と順番と（以下「実行規則」という。）が予め定められた１又は複数の処理を含む集合である。学習とは、機械学習の方法による数理モデルの更新を意味する。数理モデルの更新とは、数理モデルにおけるパラメータの値を好適に調整することを意味する。また、数理モデルの実行とは、数理モデルが含む各処理を実行規則にしたがって実行すること意味する。A mathematical model is a set including one or more processes whose execution conditions and order (hereinafter referred to as "execution rules") are predetermined. Learning means updating the mathematical model using machine learning methods. Updating the mathematical model means appropriately adjusting the parameter values in the mathematical model. Executing a mathematical model means executing each process included in the mathematical model in accordance with the execution rules.

学習による数理モデルの更新は、学習に関する所定の終了条件（以下「学習終了条件」という。）が満たされるまで行われる。学習終了条件は、例えば所定の回数の学習が行われた、という条件である。The mathematical model is updated through learning until a predetermined termination condition for learning (hereinafter referred to as the "learning termination condition") is satisfied. The learning termination condition is, for example, a condition that a predetermined number of learning rounds have been performed.

学習装置１は、２つのエンコーダと２つのデコーダとを備え、学習により各エンコーダと各デコーダとの処理の内容を学習終了条件が満たされるまで更新する。学習装置１の備える２つのエンコーダと２つのデコーダとを含むニューラルネットワークは、マルチモーダル推定モデルを表現するニューラルネットワークである。したがって、各エンコーダと各デコーダとの処理の内容の更新が、マルチモーダル推定モデルの更新である。The learning device 1 includes two encoders and two decoders, and updates the processing content of each encoder and each decoder through learning until a learning termination condition is satisfied. The neural network including the two encoders and two decoders included in the learning device 1 is a neural network that represents a multimodal estimation model. Therefore, updating the processing content of each encoder and each decoder is an update of the multimodal estimation model.

なおニューラルネットワークは、数理モデルを表現する電子回路、電気回路、光回路、集積回路等の回路である。ニューラルネットワークを更新するとは、ニューラルネットワークを表現する数理モデルの有するパラメータの値を更新することを意味する。 A neural network is a circuit such as an electronic circuit, an electrical circuit, an optical circuit, or an integrated circuit that represents a mathematical model. Updating a neural network means updating the values of the parameters of the mathematical model that represents the neural network.

図１は、実施形態の学習装置１の概要を説明する第１の説明図である。図２は、実施形態の学習装置１の概要を説明する第２の説明図である。図３は、実施形態の学習装置１の概要を説明する第３の説明図である。図４は、実施形態の学習装置１の概要を説明する第４の説明図である。 Figure 1 is a first explanatory diagram illustrating an overview of the learning device 1 of the embodiment. Figure 2 is a second explanatory diagram illustrating an overview of the learning device 1 of the embodiment. Figure 3 is a third explanatory diagram illustrating an overview of the learning device 1 of the embodiment. Figure 4 is a fourth explanatory diagram illustrating an overview of the learning device 1 of the embodiment.

学習装置１は、第１エンコーダ１０１、第２エンコーダ１０２、第１デコーダ１０３及び第２デコーダ１０４を備える。第１エンコーダ１０１、第２エンコーダ１０２、第１デコーダ１０３及び第２デコーダ１０４を含むニューラルネットワーク（以下「マルチモーダルネットワーク」という。）は、マルチモーダル推定モデルを表現するニューラルネットワークの一例である。したがって、マルチモーダルネットワークの更新とは、マルチモーダル推定モデルの更新を意味する。The learning device 1 includes a first encoder 101, a second encoder 102, a first decoder 103, and a second decoder 104. A neural network including the first encoder 101, the second encoder 102, the first decoder 103, and the second decoder 104 (hereinafter referred to as a "multimodal network") is an example of a neural network that represents a multimodal estimation model. Therefore, updating the multimodal network means updating the multimodal estimation model.

第１エンコーダ１０１、第２エンコーダ１０２、第１デコーダ１０３及び第２デコーダ１０４はそれぞれ、学習により更新されるニューラルネットワークである。マルチモーダルネットワークの学習では第１エンコーダ１０１、第２エンコーダ１０２、第１デコーダ１０３及び第２デコーダ１０４がそれぞれ更新される。学習では、第１エンコーダ１０１の出力と第２エンコーダ１０２の出力との違いを小さくするように第１エンコーダ１０１、第２エンコーダ１０２、第１デコーダ１０３及び第２デコーダ１０４それぞれの更新が行われる。The first encoder 101, the second encoder 102, the first decoder 103, and the second decoder 104 are each a neural network that is updated by learning. In learning of the multimodal network, the first encoder 101, the second encoder 102, the first decoder 103, and the second decoder 104 are each updated. In learning, the first encoder 101, the second encoder 102, the first decoder 103, and the second decoder 104 are each updated so as to reduce the difference between the output of the first encoder 101 and the output of the second encoder 102.

第１エンコーダ１０１は、入力されたデータの特徴量を取得する。以下、第１エンコーダ１０１が取得した特徴量を第１特徴量という。第１エンコーダには、例えば第１種データ（００１）が入力される。したがって、第１エンコーダ１０１は、例えば第１種データの第１特徴量を取得する。The first encoder 101 acquires the features of the input data. Hereinafter, the features acquired by the first encoder 101 are referred to as the first features. For example, first type data (001) is input to the first encoder. Therefore, the first encoder 101 acquires, for example, the first features of the first type data.

第１エンコーダ１０１には、第３種データが入力されてもよい。第３種データは、主対象とは異なる人、生物、もの、無形物又は事象（以下「副対象」という。）について第１属性と第２属性とのいずれか一方の属性を示す。副対象は、複数の属性を有してもよいし有さなくてもよい。A third type of data may be input to the first encoder 101. The third type of data indicates either the first attribute or the second attribute of a person, living thing, thing, intangible object, or event (hereinafter referred to as a "secondary object") that is different from the main object. The secondary object may or may not have multiple attributes.

第１エンコーダ１０１に第３種データが入力されることで、第１エンコーダ１０１は第３種データの第１特徴量を取得する。図３の例では、第１エンコーダ１０１に入力されるデータの一例として第１種データを示しているが、図３において第１エンコーダ１０１には第１種データに限らず第３種データが入力されてもよい。When the third type data is input to the first encoder 101, the first encoder 101 acquires a first feature of the third type data. In the example of Fig. 3, the first type data is shown as an example of data input to the first encoder 101, but in Fig. 3, not only the first type data but also the third type data may be input to the first encoder 101.

第２エンコーダ１０２は、入力されたデータの特徴量を取得する。以下、第２エンコーダ１０２が取得した特徴量を第２特徴量という。第２エンコーダには、例えば第２種データ（００２）が入力される。したがって、第２エンコーダ１０２は、例えば第２種データの第２特徴量を取得する。The second encoder 102 acquires the features of the input data. Hereinafter, the features acquired by the second encoder 102 are referred to as second features. For example, second type data (002) is input to the second encoder. Therefore, the second encoder 102 acquires, for example, the second feature of the second type data.

第２エンコーダ１０２には、第３種データが入力されてもよい。第２エンコーダ１０２に第３種データが入力されることで、第２エンコーダ１０２は第３種データの第２特徴量を取得する。図４の例では、第２エンコーダ１０２に入力されるデータの一例として第２種データを示しているが、図４において第２エンコーダ１０２には第２種データに限らず第３種データが入力されてもよい。A third type of data may be input to the second encoder 102. By inputting the third type of data to the second encoder 102, the second encoder 102 acquires a second feature of the third type of data. In the example of FIG. 4, the second type of data is shown as an example of data input to the second encoder 102, but in FIG. 4, not only the second type of data but also the third type of data may be input to the second encoder 102.

第１デコーダ１０３は、入力されたデータをデコードする。以下、第１デコーダ１０３によるデコードの結果を第１デコード結果という。第１デコーダ１０３は、例えば図１に示すように、第１特徴量をデコードする。以下、第１デコーダ１０３が第１特徴量をデコードした結果を、第１特徴量の第１デコード結果という。第１デコーダ１０３は、例えば図２又は図３に示すように、第２特徴量をデコードする。以下、第１デコーダ１０３が第２特徴量をデコードした結果を、第２特徴量の第１デコード結果という。The first decoder 103 decodes the input data. Hereinafter, the result of decoding by the first decoder 103 is referred to as the first decoding result. The first decoder 103 decodes the first feature, for example, as shown in FIG. 1. Hereinafter, the result of decoding the first feature by the first decoder 103 is referred to as the first decoding result of the first feature. The first decoder 103 decodes the second feature, for example, as shown in FIG. 2 or FIG. 3. Hereinafter, the result of decoding the second feature by the first decoder 103 is referred to as the first decoding result of the second feature.

第２デコーダ１０４は、入力されたデータをデコードする。以下、第２デコーダ１０４によるデコードの結果を第２デコード結果という。第２デコーダ１０４は、例えば図１に示すように、第２特徴量をデコードする。以下、第２デコーダ１０４が第２特徴量をデコードした結果を、第２特徴量の第２デコード結果という。第２デコーダ１０４は、例えば図２又は図４に示すように、第１特徴量をデコードする。以下、第２デコーダ１０４が第１特徴量をデコードした結果を、第１特徴量の第２デコード結果という。The second decoder 104 decodes the input data. Hereinafter, the result of decoding by the second decoder 104 is referred to as the second decoding result. The second decoder 104 decodes the second feature, for example, as shown in FIG. 1. Hereinafter, the result of decoding the second feature by the second decoder 104 is referred to as the second decoding result of the second feature. The second decoder 104 decodes the first feature, for example, as shown in FIG. 2 or FIG. 4. Hereinafter, the result of decoding the first feature by the second decoder 104 is referred to as the second decoding result of the first feature.

第２エンコーダ１０２は、例えば図３に示すように、第１特徴量の第２デコード結果を取得する。第１エンコーダ１０１は、例えば図４に示すように、第２特徴量の第１デコード結果を取得する。The second encoder 102 obtains a second decoded result of the first feature, for example, as shown in Figure 3. The first encoder 101 obtains a first decoded result of the second feature, for example, as shown in Figure 4.

学習装置１は、学習データとして第１種データと第２種データとの組が入力される場合には、損失関数として紐づき損失関数を用いた学習を行う。学習装置１は、第１種データ、第２種データ又は第３種データのいずれか一種のみが学習データとして入力される場合には、損失関数として非紐づき損失関数を用いた学習を行う。なお、学習データは、学習装置１による学習に用いられるデータである。なお、学習では、損失関数の示す違いが小さくなるように学習が行われる。 When a pair of first and second type data is input as the learning data, the learning device 1 performs learning using a linked loss function as the loss function. When only one of the first, second, or third type data is input as the learning data, the learning device 1 performs learning using a non-linked loss function as the loss function. The learning data is data used for learning by the learning device 1. The learning is performed so that the difference indicated by the loss function becomes small.

＜紐づき損失関数について＞
紐づき損失関数は、再構成損失と、共通損失と、交差再構成損失と、を含む。再構成損失は、第１副再構成損失と第２副再構成損失とを含む。第１副再構成損失は、第１種データと、第１種第１特徴量の第１デコード結果と、の違いを示す。第１種第１特徴量は、第１種データの第１特徴量である。第２副再構成損失は、第２種データと、第２種第２特徴量の第２デコード結果と、の違いを示す。 <About the linking loss function>
The linking loss function includes a reconstruction loss, a common loss, and a cross reconstruction loss. The reconstruction loss includes a first sub-reconstruction loss and a second sub-reconstruction loss. The first sub-reconstruction loss indicates a difference between the first type data and a first decoding result of the first type first feature. The first type first feature is a first feature of the first type data. The second sub-reconstruction loss indicates a difference between the second type data and a second decoding result of the second type second feature.

再構成損失は、例えば以下の式（１）で表される。 The reconstruction loss is expressed, for example, by the following equation (1):

ｄ_１は第１種データを示す。ｄ_２は第２種データを示す。Ｅ_１は第１エンコーダ１０１によるエンコードの処理を示す。Ｅ_２は第２エンコーダ１０２によるエンコードの処理を示す。Ｄ_１は第１デコーダ１０３によるデコードの処理を示す。Ｄ_２は第２デコーダ１０４によるデコードの処理を示す。式（１）の右辺の第１項が第１副再構成損失の一例である。式（２）の右辺の第２項が第２副再構成損失の一例である。 _d1 indicates first type data. _d2 indicates second type data. _E1 indicates encoding processing by the first encoder 101. _E2 indicates encoding processing by the second encoder 102. _D1 indicates decoding processing by the first decoder 103. _D2 indicates decoding processing by the second decoder 104. The first term on the right side of equation (1) is an example of the first sub-reconstruction loss. The second term on the right side of equation (2) is an example of the second sub-reconstruction loss.

式（１）は、第１エンコーダ１０１と第１デコーダ１０３とによって構成されるオートエンコーダと、第２エンコーダ１０２と第２デコーダ１０４とによって構成されるオートエンコーダと、の各出力が各オートエンコーダに入力されたデータとの違いを示す。したがって、再構成損失を小さくするように学習が行われれば、入力されたデータと出力との違いが小さくなるように各オートエンコーダが更新される。Equation (1) shows the difference between the output of the autoencoder composed of the first encoder 101 and the first decoder 103 and the output of the autoencoder composed of the second encoder 102 and the second decoder 104 and the data input to each autoencoder. Therefore, if learning is performed to reduce the reconstruction loss, each autoencoder is updated so that the difference between the input data and the output is reduced.

以下、第１エンコーダ１０１と第１デコーダ１０３とによって構成されるオートエンコーダを第１オートエンコーダという。以下、第２エンコーダ１０２と第２デコーダ１０４とによって構成されるオートエンコーダを第２オートエンコーダという。Hereinafter, the autoencoder composed of the first encoder 101 and the first decoder 103 is referred to as the first autoencoder. Hereinafter, the autoencoder composed of the second encoder 102 and the second decoder 104 is referred to as the second autoencoder.

共通損失は、第１特徴量と第２特徴量との違いを示す。共通損失は、例えば以下の式（２）で表される。 The common loss indicates the difference between the first feature and the second feature. The common loss is expressed, for example, by the following equation (2):

式（２）は、第１オートエンコーダの中間表現（すなわち、第１特徴量）と、第２オートエンコーダの中間表現（すなわち、第２特徴量）との違いを示す。したがって、共通損失を小さくするように学習が行われれば、第１特徴量と第２特徴量との違いが小さくなるように第１オートエンコーダと第２オートエンコーダとが更新される。Equation (2) shows the difference between the intermediate representation of the first autoencoder (i.e., the first feature) and the intermediate representation of the second autoencoder (i.e., the second feature). Therefore, if learning is performed to reduce the common loss, the first autoencoder and the second autoencoder are updated so that the difference between the first feature and the second feature is reduced.

その結果、共通損失を小さくする学習により、示す属性の異なる第１種データと第２種データとに共通する情報をマルチモーダル推定モデルが推定する精度が高まる。上述したように、第１種データと第２種データとはどちらも主対象について属性を示す情報である。そのため、マルチモーダル推定モデルが第１種データと第２種データとに共通する情報を推定する精度が高まるほど、マルチモーダル推定モデルは主対象をより高い精度で推定することが可能になる。As a result, by learning to reduce the common loss, the accuracy with which the multimodal estimation model estimates information common to the first type of data and the second type of data, which indicate different attributes, increases. As described above, both the first type of data and the second type of data are information indicating attributes of the main object. Therefore, the higher the accuracy with which the multimodal estimation model estimates information common to the first type of data and the second type of data, the more accurately the multimodal estimation model can estimate the main object.

交差再構成損失は、第１副交差再構成損失と第２副交差再構成損失とを含む。第１副交差再構成損失は、第１種データと、第２種第２特徴量の第１デコード結果と、の違いを示す。第２副交差再構成損失は、第２種データと、第１種第１特徴量の第２デコード結果と、の違いを示す。The cross reconstruction loss includes a first sub-cross reconstruction loss and a second sub-cross reconstruction loss. The first sub-cross reconstruction loss indicates the difference between the first type data and the first decoding result of the second type second feature. The second sub-cross reconstruction loss indicates the difference between the second type data and the second decoding result of the first type first feature.

交差再構成損失は、例えば以下の式（３）で表される。 The cross reconstruction loss is expressed, for example, by the following equation (3):

式（３）の右辺の第１項が第１副交差再構成損失の一例である。式（３）の右辺の第２項が第２副交差再構成損失の一例である。 The first term on the right side of equation (3) is an example of the first sub-crossing reconstruction loss. The second term on the right side of equation (3) is an example of the second sub-crossing reconstruction loss.

第１種データと第２種データとはどちらも主対象に関するデータであるので、学習が進めば第１種第１特徴量と第２種第２特徴量とは略同一であるはずになるはずである。第１種第１特徴量と第２種第２特徴量とが略同一であるならば、第１種第１特徴量のデコードを第１デコーダ１０３に代えて第２デコーダ１０４で実行したとしても、第１種第１特徴量の第２デコード結果は第１データに略同一であるはずである。 Because both the first type data and the second type data are data related to the main subject, the first type first feature and the second type second feature should be substantially identical as learning progresses. If the first type first feature and the second type second feature are substantially identical, even if the decoding of the first type first feature is performed by the second decoder 104 instead of the first decoder 103, the second decoding result of the first type first feature should be substantially identical to the first data.

また、第１種第１特徴量と第２種第２特徴量とが略同一であるならば、第２種第２特徴量のデコードを第２デコーダ１０４に代えて第１デコーダ１０３で実行したとしても、第２種第２特徴量の第１デコード結果は第２データに略同一であるはずである。したがって、交差再構成損失が大きい場合には、マルチモーダル推定モデルの推定の精度が良くないことを意味する。そのため、交差再構成損失を小さくするように学習が行われることで、マルチモーダル推定モデルの推定の精度が高まる。 Furthermore, if the first type first feature and the second type second feature are substantially identical, even if the decoding of the second type second feature is performed by the first decoder 103 instead of the second decoder 104, the first decoding result of the second type second feature should be substantially identical to the second data. Therefore, if the cross reconstruction loss is large, it means that the estimation accuracy of the multimodal estimation model is poor. Therefore, by performing learning to reduce the cross reconstruction loss, the estimation accuracy of the multimodal estimation model is improved.

このように、紐づき損失関数は、第ｉ副再構成損失と、共通損失と、第ｉ副交差再構成損失と、を含む。なお、ｉは１又は２である。より具体的には、紐づき損失関数は、第１再構成損失（００３）と、第２再構成損失（００４）と、共通損失（００５）と、第１交差再構成損失（００６）と、第２交差再構成損失（００７）とを含む。 Thus, the linking loss function includes the i-th sub-reconstruction loss, the common loss, and the i-th sub-cross reconstruction loss, where i is 1 or 2. More specifically, the linking loss function includes the first reconstruction loss (003), the second reconstruction loss (004), the common loss (005), the first cross reconstruction loss (006), and the second cross reconstruction loss (007).

第ｉ副再構成損失は、第ｉ種データと、第ｉ種第ｉ特徴量の第ｉデコード結果と、の違いを示す。第ｉ種第ｉ特徴量は、第ｉ種データの第ｉ特徴量である。第ｉ副交差再構成損失は、第ｉ種データと、第ｊ種第ｊ特徴量の第ｉデコード結果と、の違いを示す。なお、ｊは１又は２であり、ｊとｉとは互いに異なる値を示す。すなわち、ｉが１の場合にはｊは２であり、ｉが２の場合にはｊは１である。 The i-th sub-reconstruction loss indicates the difference between the i-th type data and the i-th decoded result of the i-th type, i-th feature. The i-th type, i-th feature is the i-th feature of the i-th type data. The i-th sub-reconstruction loss indicates the difference between the i-th type data and the i-th decoded result of the j-th type, j-th feature. Note that j is 1 or 2, and j and i are different values. That is, when i is 1, j is 2, and when i is 2, j is 1.

＜非紐づき損失関数について＞
非紐づき損失関数は、交差自己無撞着損失を含む。交差自己無撞着損失は、第１副交差自己無撞着損失と第２副交差自己無撞着損失とを含む。 <About unlinked loss functions>
The unlinked loss function includes a crossover self-consistent loss, which includes a first sub-crossover self-consistent loss and a second sub-crossover self-consistent loss.

第１副交差自己無撞着損失は、第１エンコーダ１０１に入力されたデータ（以下「第１入力データ」という。）と、第１入力データの第１交差自己無撞着データと、の違いを示す。第１交差自己無撞着データは、第２交差デコード結果の第２特徴量を第１デコーダ１０３がデコードした結果である。第２交差デコード結果は、第１入力データの第１特徴量を第２デコーダ１０４がデコードした結果である。The first cross self-consistent loss indicates the difference between the data input to the first encoder 101 (hereinafter referred to as "first input data") and the first cross self-consistent data of the first input data. The first cross self-consistent data is the result of the first decoder 103 decoding the second feature of the second cross decoding result. The second cross decoding result is the result of the second decoder 104 decoding the first feature of the first input data.

第２副交差自己無撞着損失は、第２エンコーダ１０２に入力されたデータ（以下「第２入力データ」という。）と、第２入力データの第２交差自己無撞着データと、の違いを示す。第２交差自己無撞着データは、第１交差デコード結果の第１特徴量を第２デコーダ１０４がデコードした結果である。第１交差デコード結果は、第２入力データの第２特徴量を第１デコーダ１０３がデコードした結果である。The second cross self-consistent loss indicates the difference between the data input to the second encoder 102 (hereinafter referred to as "second input data") and the second cross self-consistent data of the second input data. The second cross self-consistent data is the result of the second decoder 104 decoding the first feature of the first cross decoding result. The first cross decoding result is the result of the first decoder 103 decoding the second feature of the second input data.

なお、第１入力データは、第１属性を示すデータであればどのようなものであってもよく、第１種データであってもよいし、副対象の第１属性を示す第３種データであってもよい。また、第２入力データは、第２属性を示すデータであればどのようなものであってもよく、第２種データであってもよいし、副対象の第２属性を示す第３種データであってもよい。すなわち、第ｉ入力データは、第ｉ属性を示すデータであればどのようなものであってもよい。 The first input data may be any type of data indicating the first attribute, and may be first type data or third type data indicating the first attribute of the secondary object. The second input data may be any type of data indicating the second attribute, and may be second type data or third type data indicating the second attribute of the secondary object. In other words, the i-th input data may be any type of data indicating the i-th attribute.

このように交差自己無撞着損失は第ｉ副交差自己無撞着損失を含む。第ｉ副交差自己無撞着損失は、第ｉエンコーダに入力されたデータである第ｉ入力データと、第ｉ入力データの第ｉ交差自己無撞着データと、の違いを示す。第ｉ交差自己無撞着データは、第ｊ交差デコード結果の第ｊ特徴量を第ｉデコーダがデコードした結果である。第ｊ交差デコード結果は、第ｉ入力データの第ｉ特徴量を第ｊデコーダがデコードした結果である。 In this way, the cross self-consistency loss includes the i-th sub-cross self-consistency loss. The i-th sub-cross self-consistency loss indicates the difference between the i-th input data, which is the data input to the i-th encoder, and the i-th cross self-consistent data of the i-th input data. The i-th cross self-consistent data is the result of the j-th feature of the j-th cross decoding result being decoded by the i-th decoder. The j-th cross decoding result is the result of the i-th feature of the i-th input data being decoded by the j-th decoder.

なお、第ｉエンコーダは、ｉ＝１の場合には第１エンコーダ１０１を意味し、ｉ＝２の場合には第２エンコーダ１０２を意味する。第ｊエンコーダは、ｊ＝１の場合には第１エンコーダ１０１を意味し、ｊ＝２の場合には第２エンコーダ１０２を意味する。なお、第ｉデコーダは、ｉ＝１の場合には第１デコーダ１０３を意味し、ｉ＝２の場合には第２デコーダ１０４を意味する。第ｊデコーダは、ｊ＝１の場合には第１デコーダ１０３を意味し、ｊ＝２の場合には第２デコーダ１０４を意味する。 Note that the i-th encoder means the first encoder 101 when i=1, and the second encoder 102 when i=2. The j-th encoder means the first encoder 101 when j=1, and the second encoder 102 when j=2. Note that the i-th decoder means the first decoder 103 when i=1, and the second decoder 104 when i=2. The j-th decoder means the first decoder 103 when j=1, and the second decoder 104 when j=2.

交差自己無撞着損失は、例えば以下の式（４）で表される。 The cross self-consistency loss is expressed, for example, by the following equation (4).

式（４）の右辺の第１項が第１副交差自己無撞着損失の一例である。式（４）の右辺の第２項が第２副交差自己無撞着損失の一例である。 The first term on the right-hand side of equation (4) is an example of the first sub-crossing self-consistent loss. The second term on the right-hand side of equation (4) is an example of the second sub-crossing self-consistent loss.

なお、第ｉ入力データが第ｉエンコーダに入力される場合の第ｊ交差自己無撞着損失の値は０である。すなわち、第１入力データがマルチモーダル推定モデルに入力される場合には式（４）の右辺の第２項の値は０であり、第２入力データがマルチモーダル推定モデルに入力される場合には式（４）の右辺の第１項の値は０である。 Note that the value of the jth cross self-consistency loss when the ith input data is input to the ith encoder is 0. That is, when the first input data is input to the multimodal estimation model, the value of the second term on the right side of equation (4) is 0, and when the second input data is input to the multimodal estimation model, the value of the first term on the right side of equation (4) is 0.

図１及び図２の例では、第１種データと第２種データとの組がマルチモーダルネットワークに入力されていた。しかしながら、必ずしもユーザが第１種データと第２種データとの両方を用意できない場合もある。さらには、ユーザは、第１属性又は第２属性のいずれか一方を示すデータではあるものの主対象のデータではない、というデータを用意する場合もある。このような場合であっても、推定の精度が高まるようにマルチモーダルネットワークを更新することを可能にするのが、交差自己無撞着損失である。In the examples of Figures 1 and 2, a pair of first and second types of data was input to the multimodal network. However, there are cases where the user is not necessarily able to prepare both the first and second types of data. Furthermore, there are cases where the user prepares data that indicates either the first attribute or the second attribute, but is not the data of the main subject. Even in such cases, the cross self-consistent loss makes it possible to update the multimodal network so as to improve the accuracy of the estimation.

式（４）が示すように交差自己無撞着損失は、マルチモーダルネットワークに入力された１つのデータ（以下「単入力データ」という。）の自己無撞着結果と、短入力データとの違いを示す。第１入力データと第２入力データとのそれぞれは、単入力データの一例である。以下、単入力データとして第１入力データを例に交差自己無撞着損失を用いることの効果を説明する。As shown in equation (4), the cross self-consistency loss indicates the difference between the self-consistency result of one piece of data (hereinafter referred to as "single-input data") input to a multimodal network and short-input data. The first input data and the second input data are each an example of single-input data. Below, we explain the effect of using the cross self-consistency loss using the first input data as an example of single-input data.

＜交差自己無撞着損失の奏する効果＞
自己無撞着結果は、単入力データが、第１エンコーダ１０１、第２エンコーダ１０２、第１デコーダ１０３及び第２デコーダ１０４で変換された結果である。マルチモーダル推定モデルの推定の精度が高まれば第１特徴量と第２特徴量とは略同一になるはずであり、第１特徴量と第２特徴量とは略同一であるならば、第１特徴量の第２デコード結果の第２特徴量も略同一のはずである。その結果、第１入力データの第１特徴量から得られた第２デコード結果の第２特徴量を第１デコーダ１０３でデコードした結果は、第１入力データに略同一であるはずである。したがって、交差自己無撞着損失を小さくするように学習が行われることで、マルチモーダル推定モデル（すなわちマルチモーダルネットワーク）は、推定の精度が高まるように更新される。 <Effects of cross self-consistent losses>
The self-consistent result is a result of the single input data being converted by the first encoder 101, the second encoder 102, the first decoder 103, and the second decoder 104. If the estimation accuracy of the multimodal estimation model is improved, the first feature amount and the second feature amount should be substantially identical, and if the first feature amount and the second feature amount are substantially identical, the second feature amount of the second decoded result of the first feature amount should also be substantially identical. As a result, the result of decoding the second feature amount of the second decoded result obtained from the first feature amount of the first input data by the first decoder 103 should be substantially identical to the first input data. Therefore, by performing learning so as to reduce the cross self-consistency loss, the multimodal estimation model (i.e., the multimodal network) is updated so as to improve the estimation accuracy.

このように非紐づき損失関数は、交差自己無撞着損失を含む。より具体的には、非紐づき損失関数は、第１副交差自己無撞着損失（００８）と、第２副交差自己無撞着損失（００９）とを含む。図３又は図４が示すように、非紐づき損失関数は、第１種データ、第２種データ又は第３種データのいずれか１つが得られれば値が得られる。したがって、非紐づき損失関数を用いることで、第１種データと第２種データとの２種類のデータを用いることなくマルチモーダル推定モデルの更新が可能である。In this way, the unlinked loss function includes a cross-self-consistent loss. More specifically, the unlinked loss function includes a first sub-cross-self-consistent loss (008) and a second sub-cross-self-consistent loss (009). As shown in FIG. 3 or FIG. 4, the unlinked loss function can obtain a value if any one of the first type data, the second type data, or the third type data is obtained. Therefore, by using the unlinked loss function, it is possible to update the multimodal estimation model without using two types of data, the first type data and the second type data.

ここまで説明してきたように、学習装置１は、第１種データと第２種データとの組が入力される場合には紐づき損失関数を用いてマルチモーダルネットワークの更新が可能である。また、学習装置１は、第１種データ、第２種データ又は第３種データのいずれか一種のみが入力される場合であっても非紐づき損失関数を用いることでマルチモーダルネットワークの更新が可能である。As described above, the learning device 1 can update the multimodal network using a linked loss function when a pair of first type data and second type data is input. Also, the learning device 1 can update the multimodal network by using a non-linked loss function even when only one of the first type data, second type data, and third type data is input.

したがって学習装置１は、複数回の学習のうちの一部の学習において第１種データ、第２種データ又は第３種データのいずれか１つだけを用いた学習を行ったとしても、マルチモーダルネットワークを推定の精度が高まるように更新することが可能である。複数回の学習における他の一部の学習においては、第１種データと第２種データとの組を用いた学習が行われることで、学習装置１はマルチモーダルネットワークを推定の精度が高まるように更新する。Therefore, even if the learning device 1 performs learning using only one of the first type data, the second type data, or the third type data in some of the multiple learnings, it is possible to update the multimodal network to improve the accuracy of estimation. In other parts of the multiple learnings, learning is performed using a pair of the first type data and the second type data, and the learning device 1 updates the multimodal network to improve the accuracy of estimation.

図５は、実施形態における学習装置１のハードウェア構成の一例を示す図である。学習装置１は、バスで接続されたＣＰＵ（Central Processing Unit）等のプロセッサ９１とメモリ９２とを備える制御部１１を備え、プログラムを実行する。学習装置１は、プログラムの実行によって制御部１１、入力部１２、通信部１３、記憶部１４及び出力部１５を備える装置として機能する。 Figure 5 is a diagram showing an example of the hardware configuration of the learning device 1 in an embodiment. The learning device 1 has a control unit 11 including a processor 91 such as a CPU (Central Processing Unit) and a memory 92 connected by a bus, and executes a program. By executing the program, the learning device 1 functions as a device including the control unit 11, input unit 12, communication unit 13, memory unit 14 and output unit 15.

より具体的には、学習装置１は、プロセッサ９１が記憶部１４に記憶されているプログラムを読み出し、読み出したプログラムをメモリ９２に記憶させる。プロセッサ９１が、メモリ９２に記憶させたプログラムを実行することによって、学習装置１は、制御部１１、入力部１２、通信部１３、記憶部１４及び出力部１５を備える装置として機能する。More specifically, in the learning device 1, the processor 91 reads out a program stored in the storage unit 14 and stores the read out program in the memory 92. When the processor 91 executes the program stored in the memory 92, the learning device 1 functions as a device including a control unit 11, an input unit 12, a communication unit 13, a storage unit 14, and an output unit 15.

制御部１１は、学習装置１が備える各種機能部の動作を制御する。制御部１１は、例えばマルチモーダル推定モデルの学習を行う。The control unit 11 controls the operation of various functional units of the learning device 1. The control unit 11, for example, performs learning of a multimodal estimation model.

入力部１２は、例えばマウスやキーボード、タッチパネル等の入力装置を含んで構成される。入力部１２は、これらの入力装置を学習装置１に接続するインタフェースを含んで構成されてもよい。The input unit 12 includes input devices such as a mouse, a keyboard, a touch panel, etc. The input unit 12 may include an interface that connects these input devices to the learning device 1.

通信部１３は、学習装置１を外部装置に接続するためのインタフェースを含んで構成される。通信部１３は、有線又は無線を介して外部装置と通信する。外部装置は、例えば第１属性を示す第３種データの送信元の装置である。通信部１３は、第１属性を示す第３種データの送信元の装置との通信によって、第１属性を示す第３種データを取得する。外部装置は、例えば第２属性を示す第３種データの送信元の装置である。通信部１３は、第２属性を示す第３種データの送信元の装置との通信によって、第２属性を示す第３種データを取得する。The communication unit 13 includes an interface for connecting the learning device 1 to an external device. The communication unit 13 communicates with the external device via wired or wireless communication. The external device is, for example, a device that transmits third type data indicating the first attribute. The communication unit 13 acquires the third type data indicating the first attribute by communicating with the device that transmits the third type data indicating the first attribute. The external device is, for example, a device that transmits third type data indicating the second attribute. The communication unit 13 acquires the third type data indicating the second attribute by communicating with the device that transmits the third type data indicating the second attribute.

外部装置は、例えば第１種データの送信元の装置である。通信部１３は、第１種データの送信元の装置との通信によって、第１種データを取得する。外部装置は、例えば第２種データの送信元の装置である。通信部１３は、第２種データの送信元の装置との通信によって、第２種データを取得する。 The external device is, for example, a device that transmits the first type of data. The communication unit 13 acquires the first type of data by communicating with the device that transmits the first type of data. The external device is, for example, a device that transmits the second type of data. The communication unit 13 acquires the second type of data by communicating with the device that transmits the second type of data.

記憶部１４は、磁気ハードディスク装置や半導体記憶装置などのコンピュータ読み出し可能な記憶媒体装置を用いて構成される。記憶部１４は、学習装置１に関する各種情報を記憶する。記憶部１４は、例えば制御部１１が実行する処理の結果生じた各種情報を記憶する。The memory unit 14 is configured using a computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The memory unit 14 stores various information related to the learning device 1. The memory unit 14 stores various information generated as a result of processing executed by the control unit 11, for example.

出力部１５は、例えばＣＲＴ（Cathode Ray Tube）ディスプレイや液晶ディスプレイ、有機ＥＬ（Electro-Luminescence）ディスプレイ等の表示装置を含んで構成される。出力部１５は、これらの表示装置を学習装置１に接続するインタフェースを含んで構成されてもよい。The output unit 15 includes a display device such as a CRT (Cathode Ray Tube) display, a liquid crystal display, or an organic EL (Electro-Luminescence) display. The output unit 15 may include an interface that connects these display devices to the learning device 1.

図６は、実施形態の学習装置１が備える制御部１１の機能構成の一例を示す図である。制御部１１は、データ取得部１１１、データ入力部１１２、学習部１１３、通信制御部１１４、記憶制御部１１５及び出力制御部１１６を備える。6 is a diagram showing an example of the functional configuration of the control unit 11 provided in the learning device 1 of the embodiment. The control unit 11 includes a data acquisition unit 111, a data input unit 112, a learning unit 113, a communication control unit 114, a memory control unit 115, and an output control unit 116.

データ取得部１１１は、通信部１３に入力されたデータを取得する。データ取得部１１１の取得するデータの候補は、具体的には、第１種データ、第２種データ及び第３種データである。The data acquisition unit 111 acquires data input to the communication unit 13. Specifically, the candidates for data acquired by the data acquisition unit 111 are first type data, second type data, and third type data.

データ入力部１１２は、データ取得部１１１の取得したデータを、各データが示す属性に応じた出力先に出力する。データ入力部１１２は、例えばデータ取得部１１１の取得した第１種データを第１エンコーダ１０１に出力する。データ入力部１１２は、例えばデータ取得部１１１の取得した第２種データを第２エンコーダ１０２に出力する。データ入力部１１２は、例えばデータ取得部１１１の取得した第３種データであって第１属性を示す第３種データを第１エンコーダ１０１に出力する。The data input unit 112 outputs the data acquired by the data acquisition unit 111 to an output destination according to the attribute indicated by each data. For example, the data input unit 112 outputs the first type of data acquired by the data acquisition unit 111 to the first encoder 101. For example, the data input unit 112 outputs the second type of data acquired by the data acquisition unit 111 to the second encoder 102. For example, the data input unit 112 outputs the third type of data acquired by the data acquisition unit 111, which indicates a first attribute, to the first encoder 101.

データ入力部１１２は、例えばデータ取得部１１１の取得した第３種データであって第２属性を示す第３種データを第２エンコーダ１０２に出力する。第１種データ、第２種データ、第３種データの各データは、各データの示す属性を示す情報を有する。属性を示す情報は、予め定められた規則で属性を示す情報であればどのような情報であってもよい。属性を示す情報は、例えばデータの形式の違いで属性の違いを表現する情報である。データの形式の違いは、例えば画像データとテキストデータ等のデータの形式の違いである。The data input unit 112 outputs, for example, third type data acquired by the data acquisition unit 111, which indicates a second attribute, to the second encoder 102. Each of the first type data, second type data, and third type data has information indicating the attribute indicated by each data. The information indicating the attribute may be any information as long as it indicates the attribute according to a predetermined rule. The information indicating the attribute is, for example, information that expresses a difference in attribute by a difference in data format. The difference in data format is, for example, a difference in data format between image data and text data, etc.

このように、各データの示す属性は、属性を示す情報によって示されている。したがって、データ入力部１１２は、各データの属性を、属性を示す情報の違いに基づいて判定することができる。その結果、データ入力部１１２は、データ取得部１１１の取得したデータを、各データが示す属性に応じた出力先に出力することができる。なお、各データの属性を示す情報は、入力部１２又は通信部１３にユーザが入力してもよい。In this way, the attributes indicated by each piece of data are indicated by the information indicating the attributes. Therefore, the data input unit 112 can determine the attributes of each piece of data based on differences in the information indicating the attributes. As a result, the data input unit 112 can output the data acquired by the data acquisition unit 111 to an output destination according to the attributes indicated by each piece of data. Note that the information indicating the attributes of each piece of data may be input by the user to the input unit 12 or the communication unit 13.

学習部１１３は、マルチモーダルネットワーク１３１とネットワーク制御部１３２とを備える。マルチモーダルネットワーク１３１は、マルチモーダルネットワークである。したがって、マルチモーダルネットワーク１３１は、マルチモーダル推定モデルを表現するニューラルネットワークである。そのため、マルチモーダルネットワーク１３１は、第１エンコーダ１０１、第２エンコーダ１０２、第１デコーダ１０３及び第２デコーダ１０４を備える。The learning unit 113 includes a multimodal network 131 and a network control unit 132. The multimodal network 131 is a multimodal network. Therefore, the multimodal network 131 is a neural network that represents a multimodal estimation model. Therefore, the multimodal network 131 includes a first encoder 101, a second encoder 102, a first decoder 103, and a second decoder 104.

ネットワーク制御部１３２は、マルチモーダルネットワーク１３１が得た結果に基づき、マルチモーダルネットワーク１３１を更新する。より具体的には、ネットワーク制御部１３２は、マルチモーダルネットワーク１３１が得た結果と、マルチモーダルネットワーク１３１に入力されたデータと、に基づき、マルチモーダルネットワーク１３１を更新する。The network control unit 132 updates the multimodal network 131 based on the results obtained by the multimodal network 131. More specifically, the network control unit 132 updates the multimodal network 131 based on the results obtained by the multimodal network 131 and the data input to the multimodal network 131.

ネットワーク制御部１３２は、例えばデータ取得部１１１の取得したデータが第１種データと第２種データとの組である場合には、紐づき損失関数を用い、マルチモーダルネットワーク１３１が得た結果に基づき、マルチモーダルネットワーク１３１を更新する。 For example, when the data acquired by the data acquisition unit 111 is a pair of a first type of data and a second type of data, the network control unit 132 uses a linking loss function and updates the multimodal network 131 based on the results obtained by the multimodal network 131.

ネットワーク制御部１３２は、例えばデータ取得部１１１の取得したデータが第１種データと第２種データとの組では無い場合には、非紐づき損失関数を用い、マルチモーダルネットワーク１３１が得た結果に基づき、マルチモーダルネットワーク１３１を更新する。 For example, when the data acquired by the data acquisition unit 111 is not a pair of the first type of data and the second type of data, the network control unit 132 uses an unlinked loss function and updates the multimodal network 131 based on the results obtained by the multimodal network 131.

通信制御部１１４は通信部１３の動作を制御する。記憶制御部１１５は記憶部１４の動作を制御する。出力制御部１１６は出力部１５の動作を制御する。 The communication control unit 114 controls the operation of the communication unit 13. The memory control unit 115 controls the operation of the memory unit 14. The output control unit 116 controls the operation of the output unit 15.

図７は、実施形態における学習装置１が実行する処理の流れの一例を示すフローチャートである。データ取得部１１１がデータを取得する（ステップＳ１０１）。次に、データ入力部１１２がデータ取得部１１１の取得したデータを、各データの示す属性に応じた入力先にデータを入力する（ステップＳ１０２）。入力先は、具体的には、第１エンコーダ１０１又は第２エンコーダ１０２である。 Figure 7 is a flowchart showing an example of the flow of processing executed by the learning device 1 in an embodiment. The data acquisition unit 111 acquires data (step S101). Next, the data input unit 112 inputs the data acquired by the data acquisition unit 111 to an input destination according to the attributes indicated by each data (step S102). Specifically, the input destination is the first encoder 101 or the second encoder 102.

次にマルチモーダルネットワーク１３１が、ステップＳ１０２で入力されたデータに対してマルチモーダル推定モデルを実行する（ステップＳ１０３）。次にネットワーク制御部１３２が、マルチモーダル推定モデルの実行の結果に基づき、マルチモーダル推定モデルを更新する（ステップＳ１０４）。より具体的にはネットワーク制御部１３２は、マルチモーダル推定モデルの実行の結果に基づき、データ取得部１１１の取得したデータに応じて紐づき損失関数又は非紐づき損失関数のいずれか一方を用いて、マルチモーダル推定モデルを更新する。Next, the multimodal network 131 executes the multimodal estimation model on the data input in step S102 (step S103). Next, the network control unit 132 updates the multimodal estimation model based on the result of the execution of the multimodal estimation model (step S104). More specifically, the network control unit 132 updates the multimodal estimation model using either the linked loss function or the unlinked loss function depending on the data acquired by the data acquisition unit 111 based on the result of the execution of the multimodal estimation model.

次にネットワーク制御部１３２は、学習終了条件が満たされたか否かを判定する（ステップＳ１０５）。学習終了条件が満たされた場合（ステップＳ１０５：ＹＥＳ）、処理が終了する。一方、学習終了条件が満たされない場合（ステップＳ１０５：ＮＯ）、ステップＳ１０１の処理に戻る。Next, the network control unit 132 determines whether the learning end condition is satisfied (step S105). If the learning end condition is satisfied (step S105: YES), the process ends. On the other hand, if the learning end condition is not satisfied (step S105: NO), the process returns to step S101.

このようにして得られた学習済みのマルチモーダル推定モデルは、図８に示す推定装置２等の推定対象の第１属性のデータと推定対象の第２属性のデータとの組に基づいて推定対象を推定する装置で用いられる。なお、学習済みのマルチモーダル推定モデルとは、学習終了条件が満たされた時点のマルチモーダル推定モデルである。以下、推定装置２の推定対象を注目対象という。以下、注目対象の第１属性のデータの第１種対象データという。以下、注目対象の第２属性のデータの第２種対象データという。注目対象は、主対象と異なる対象であってもよい。The trained multimodal estimation model obtained in this manner is used in a device that estimates an estimation target based on a pair of data on the first attribute of the estimation target and data on the second attribute of the estimation target, such as the estimation device 2 shown in FIG. 8. Note that the trained multimodal estimation model is a multimodal estimation model at the time when the learning end condition is satisfied. Hereinafter, the estimation target of the estimation device 2 is referred to as the target of interest. Hereinafter, the data on the first attribute of the target of interest is referred to as first type target data. Hereinafter, the data on the second attribute of the target of interest is referred to as second type target data. The target of interest may be an object different from the main object.

図８は、実施形態の推定装置２のハードウェア構成の一例を示す図である。推定装置２は、第１種対象データと第２種対象データとの組を取得し、学習済みのマルチモーダル推定モデルを用いて、取得した第１種対象データと第２種対象データとが示す注目対象を推定する。 Figure 8 is a diagram illustrating an example of a hardware configuration of the estimation device 2 of an embodiment. The estimation device 2 acquires a pair of first type target data and second type target data, and estimates a target of interest indicated by the acquired first type target data and second type target data using a trained multimodal estimation model.

推定装置２は、バスで接続されたＣＰＵ等のプロセッサ９３とメモリ９４とを備える制御部２１を備え、プログラムを実行する。推定装置２は、プログラムの実行によって制御部２１、入力部２２、通信部２３、記憶部２４及び出力部２５を備える装置として機能する。The estimation device 2 has a control unit 21 including a processor 93 such as a CPU and a memory 94 connected by a bus, and executes a program. By executing the program, the estimation device 2 functions as a device including the control unit 21, an input unit 22, a communication unit 23, a memory unit 24, and an output unit 25.

より具体的には、推定装置２は、プロセッサ９３が記憶部２４に記憶されているプログラムを読み出し、読み出したプログラムをメモリ９４に記憶させる。プロセッサ９３が、メモリ９４に記憶させたプログラムを実行することによって、推定装置２は、制御部２１、入力部２２、通信部２３、記憶部２４及び出力部２５を備える装置として機能する。More specifically, in the estimation device 2, the processor 93 reads a program stored in the storage unit 24 and stores the read program in the memory 94. When the processor 93 executes the program stored in the memory 94, the estimation device 2 functions as a device including a control unit 21, an input unit 22, a communication unit 23, a storage unit 24, and an output unit 25.

制御部２１は、推定装置２が備える各種機能部の動作を制御する。制御部２１は、例えば学習済みのマルチモーダル推定モデルを実行する。The control unit 21 controls the operation of various functional units of the estimation device 2. The control unit 21 executes, for example, a trained multimodal estimation model.

入力部２２は、例えばマウスやキーボード、タッチパネル等の入力装置を含んで構成される。入力部２２は、これらの入力装置を推定装置２に接続するインタフェースを含んで構成されてもよい。The input unit 22 includes input devices such as a mouse, a keyboard, a touch panel, etc. The input unit 22 may include an interface that connects these input devices to the estimation device 2.

通信部２３は、推定装置２を外部装置に接続するためのインタフェースを含んで構成される。通信部２３は、有線又は無線を介して外部装置と通信する。外部装置は例えば、第１種対象データと第２種対象データとの組の送信元の装置である。通信部２３は、第１種対象データと第２種対象データとの組の送信元の装置との通信によって、第１種対象データと第２種対象データとの組を取得する。外部装置は、例えば学習装置１である。通信部２３は、学習装置１との通信によって、学習済みのマルチモーダル推定モデルを取得する。The communication unit 23 includes an interface for connecting the estimation device 2 to an external device. The communication unit 23 communicates with the external device via wired or wireless communication. The external device is, for example, a device that transmits a pair of the first type of target data and the second type of target data. The communication unit 23 acquires the pair of the first type of target data and the second type of target data by communicating with the device that transmits the pair of the first type of target data and the second type of target data. The external device is, for example, the learning device 1. The communication unit 23 acquires a trained multimodal estimation model by communicating with the learning device 1.

記憶部２４は、磁気ハードディスク装置や半導体記憶装置などのコンピュータ読み出し可能な記憶媒体装置を用いて構成される。記憶部２４は、推定装置２に関する各種情報を記憶する。記憶部２４は、例えば制御部２１が実行する処理の結果生じた各種情報を記憶する。The memory unit 24 is configured using a computer-readable storage medium device such as a magnetic hard disk device or a semiconductor storage device. The memory unit 24 stores various information related to the estimation device 2. The memory unit 24 stores various information generated as a result of processing executed by the control unit 21, for example.

出力部２５は、例えばＣＲＴディスプレイや液晶ディスプレイ、有機ＥＬディスプレイ等の表示装置を含んで構成される。出力部２５は、これらの表示装置を推定装置２に接続するインタフェースを含んで構成されてもよい。The output unit 25 is configured to include a display device such as a CRT display, a liquid crystal display, an organic EL display, etc. The output unit 25 may be configured to include an interface that connects these display devices to the estimation device 2.

図９は、実施形態の推定装置２が備える制御部２１の機能構成の一例を示す図である。制御部２１は、データ取得部２１１、推定部２１２、通信制御部２１３、記憶制御部２１４及び出力制御部２１５を備える。9 is a diagram showing an example of the functional configuration of the control unit 21 provided in the estimation device 2 of the embodiment. The control unit 21 includes a data acquisition unit 211, an estimation unit 212, a communication control unit 213, a memory control unit 214, and an output control unit 215.

データ取得部２１１は、通信部２３に入力された、第１種対象データと第２種対象データとの組のデータを取得する。推定部２１２は、学習済みのマルチモーダル推定モデルを、データ取得部２１１の取得したデータに対して実行する。すなわち、推定部２１２は、データ取得部２１１の取得した第１種対象データ及び第２種対象データに対して、学習済みのマルチモーダルネットワークによる注目対象の推定の処理を実行する。通信制御部２１３は通信部２３の動作を制御する。記憶制御部２１４は記憶部２４の動作を制御する。出力制御部２１５は出力部２５の動作を制御する。The data acquisition unit 211 acquires a set of data consisting of the first type of target data and the second type of target data input to the communication unit 23. The estimation unit 212 executes a trained multimodal estimation model on the data acquired by the data acquisition unit 211. That is, the estimation unit 212 executes a process of estimating an object of interest using a trained multimodal network on the first type of target data and the second type of target data acquired by the data acquisition unit 211. The communication control unit 213 controls the operation of the communication unit 23. The memory control unit 214 controls the operation of the memory unit 24. The output control unit 215 controls the operation of the output unit 25.

図１０は、実施形態の推定装置２が実行する処理の流れの一例を示すフローチャートである。データ取得部２１１が、第１種対象データと第２種対象データとの組のデータを取得する（ステップＳ２０１）。次に、推定部２１２が、ステップＳ２０１で取得されたデータに対して学習済みのマルチモーダル推定モデルを実行する（ステップＳ２０２）。すなわち、推定部２１２が、ステップＳ２０１で取得された第１種対象データ及び第２種対象データに対して学習済みのマルチモーダルネットワークの実行する処理を実行する。学習済みのマルチモーダルネットワークの実行する処理とは、具体的には、注目対象の推定の処理である。 Figure 10 is a flowchart showing an example of the flow of processing executed by the estimation device 2 of the embodiment. The data acquisition unit 211 acquires a set of data of a first type of target data and a second type of target data (step S201). Next, the estimation unit 212 executes a trained multimodal estimation model on the data acquired in step S201 (step S202). That is, the estimation unit 212 executes processing executed by the trained multimodal network on the first type of target data and the second type of target data acquired in step S201. The processing executed by the trained multimodal network is, specifically, processing to estimate the target of interest.

ステップＳ２０１で取得されたデータに対する学習済みのマルチモーダル推定モデルの実行により、注目対象が推定される。ステップＳ２０２の次に、出力制御部２１５が出力部２５の動作を制御して、出力部２５に、推定部２１２の推定の結果を出力させる（ステップＳ２０３）。The target object is estimated by executing the trained multimodal estimation model on the data acquired in step S201. After step S202, the output control unit 215 controls the operation of the output unit 25 to cause the output unit 25 to output the result of the estimation by the estimation unit 212 (step S203).

このように構成された学習装置１は、複数種類のデータに共通する情報を推定する数理モデルの推定の精度を高めるように学習を行う。学習装置１において、複数種類のデータに共通する情報は、具体的には、主対象である。そして、学習装置１は、紐づき損失関数を用いた学習だけでなく、非紐づき損失関数を用いた学習の実行も可能である。The learning device 1 configured in this manner performs learning so as to improve the accuracy of estimation of a mathematical model that estimates information common to multiple types of data. In the learning device 1, the information common to multiple types of data is specifically the main target. The learning device 1 is capable of performing learning using not only a linked loss function, but also a non-linked loss function.

そのため、第１種データと第２種データとの組のような紐づいた情報が存在する場合だけでなく、一方のみが存在する場合や、第３種データのみが存在する場合であっても学習装置１は学習を行うことが可能である。そのため、学習装置１は、紐づいた情報が存在する場合しか学習を行えない技術で得られる数理モデルよりも、少ない労力で、複数種類のデータに共通する情報を推定する数理モデルを得ることができる。すなわち、学習装置１は、複数種類のデータに共通する情報の推定に要する労力を軽減することができる。 Therefore, the learning device 1 can perform learning not only when linked information such as a pair of first type data and second type data exists, but also when only one of them exists, or when only third type data exists. Therefore, the learning device 1 can obtain a mathematical model that estimates information common to multiple types of data with less effort than a mathematical model obtained by a technology that can perform learning only when linked information exists. In other words, the learning device 1 can reduce the effort required to estimate information common to multiple types of data.

また、このように構成された推定装置２は、学習装置１の得た学習済みの数理モデルを用いて推定対象を推定する。そのため、推定装置２は、複数種類のデータに共通する情報の推定に要する労力を軽減することができる。Furthermore, the estimation device 2 configured in this manner estimates the estimation target using the learned mathematical model obtained by the learning device 1. Therefore, the estimation device 2 can reduce the effort required to estimate information common to multiple types of data.

（変形例）
なお、学習装置１による非紐づき損失関数を用いた学習では、第１種データ又は第２種データよりも、第３種データを用いた学習が行われることが望ましい。なぜなら、主対象だけでなく副対象についても学習を行うことで過学習の発生を抑制することができるからである。 (Modification)
In addition, in the learning using the unlinked loss function by the learning device 1, it is preferable to perform learning using the third type data rather than the first type data or the second type data, because the occurrence of overlearning can be suppressed by performing learning not only on the main object but also on the secondary object.

なお、非紐づき損失関数は、第ｉ交差自己無撞着損失に加えて、さらに第ｉ再構成損失を含んでもよい。上述したように学習データが第ｉエンコーダに入力される場合の第ｊ交差自己無撞着損失の値は０である。また、学習データが第ｉエンコーダに入力される場合の第ｊ交差自己無撞着損失の値も０である。非紐づき損失関数がさらに第ｉ再構成損失も含む場合、第ｉ再構成損失を含まない場合よりも、学習済みのマルチモーダル推定モデルの推定の精度は高い。 Note that the unlinked loss function may further include the i-th reconstruction loss in addition to the i-th cross self-consistent loss. As described above, the value of the j-th cross self-consistent loss is 0 when training data is input to the i-th encoder. Also, the value of the j-th cross self-consistent loss is 0 when training data is input to the i-th encoder. When the unlinked loss function further includes the i-th reconstruction loss, the estimation accuracy of the trained multimodal estimation model is higher than when it does not include the i-th reconstruction loss.

なお、学習装置１と推定装置２とは、必ずしも異なる装置として実装される必要は無い。学習装置１と推定装置２とは、例えば両者の機能を併せ持つ１つの装置又はシステムとして実装されてもよい。 Note that the learning device 1 and the estimation device 2 do not necessarily need to be implemented as different devices. The learning device 1 and the estimation device 2 may be implemented, for example, as a single device or system that combines the functions of both devices.

また、学習装置１と推定装置２とのそれぞれが備える各機能部は、ネットワークを介して通信可能に接続された複数台の情報処理装置を用いて実装されてもよい。 In addition, each functional unit of the learning device 1 and the estimation device 2 may be implemented using multiple information processing devices connected to each other so as to be able to communicate with each other via a network.

なお、学習装置１及び推定装置２のそれぞれは、ネットワークを介して通信可能に接続された複数台の情報処理装置を用いて実装されてもよい。なお、学習装置１及び推定装置２のそれぞれの各機能の全て又は一部は、ＡＳＩＣ（Application Specific Integrated Circuit）やＰＬＤ（Programmable Logic Device）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアを用いて実現されてもよい。プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ－ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。プログラムは、電気通信回線を介して送信されてもよい。Each of the learning device 1 and the estimation device 2 may be implemented using a plurality of information processing devices communicatively connected via a network. All or part of each function of the learning device 1 and the estimation device 2 may be realized using hardware such as an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field Programmable Gate Array). The program may be recorded on a computer-readable recording medium. Examples of computer-readable recording media include portable media such as flexible disks, optical magnetic disks, ROMs, and CD-ROMs, and storage devices such as hard disks built into computer systems. The program may be transmitted via a telecommunications line.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The above describes in detail an embodiment of the present invention with reference to the drawings, but the specific configuration is not limited to this embodiment and also includes designs that do not deviate from the gist of the present invention.

１…学習装置、２…推定装置、１０１…第１エンコーダ、１０２…第２エンコーダ、１０３…第１デコーダ、１０４…第２デコーダ、１１…制御部、１２…入力部、１３…通信部、１４…記憶部、１５…出力部、１１１…データ取得部、１１２…データ入力部、１１３…学習部、１３１…マルチモーダルネットワーク、１３２…ネットワーク制御部、１１４…通信制御部、１１５…記憶制御部、１１６…出力制御部、２１…制御部、２２…入力部、２３…通信部、２４…記憶部、２５…出力部、２１１…データ取得部、２１２…推定部、２１３…通信制御部、２１４…記憶制御部、２１５…出力制御部、９１、９３…プロセッサ、９２、９４…メモリ1...Learning device, 2...Estimation device, 101...First encoder, 102...Second encoder, 103...First decoder, 104...Second decoder, 11...Control unit, 12...Input unit, 13...Communication unit, 14...Memory unit, 15...Output unit, 111...Data acquisition unit, 112...Data input unit, 113...Learning unit, 131...Multimodal network, 132...Network control unit, 114...Communication control unit, 115...Memory control unit, 116...Output control unit, 21...Control unit, 22...Input unit, 23...Communication unit, 24...Memory unit, 25...Output unit, 211...Data acquisition unit, 212...Estimation unit, 213...Communication control unit, 214...Memory control unit, 215...Output control unit, 91, 93...Processor, 92, 94...Memory

Claims

a multimodal network that is a neural network including a first encoder that acquires a feature amount of input data, a second encoder that acquires a feature amount of the input data, a first decoder that decodes the input data, and a second decoder that decodes the input data;
a network control unit that updates the multimodal network using a cross self-consistency loss that indicates the difference between a self-consistency result, which is a result of input data being converted by the first encoder, the second encoder, the first decoder, and the second decoder, and the data, and a common loss that indicates the difference between a result of the first encoder encoding data indicating a first attribute of a main object, which is a person, a living thing, an object, an intangible object, or an event having multiple attributes, and a result of the second encoder encoding data indicating a second attribute of the main object that is different from the first attribute;
A learning device comprising:

the network control unit updates the multimodal network using the self-consistency result of the third type of data, which is data indicating either a first attribute or a second attribute of a person, a living thing, an object, an intangible object, or an event different from the main object;
The learning device according to claim 1 .

The network control unit updates the multimodal network based on a first secondary reconstruction loss indicating a difference between a result of decoding by the first decoder of a result of encoding by the first encoder of first type data, the first type data being data indicating the first attribute of the main object, and the first type data, and a second secondary reconstruction loss indicating a difference between a result of decoding by the second decoder of a result of encoding by the second encoder of second type data, the second type data being data indicating the second attribute of the main object, and the second type data.
The learning device according to claim 1 or 2.

a data acquisition unit that acquires first type target data indicating a first attribute of an estimation target and second type target data indicating a second attribute, which is different from the first attribute, among attributes of the estimation target;
a multimodal network which is a neural network including a first encoder which acquires a feature amount of input data, a second encoder which acquires a feature amount of the input data, a first decoder which decodes the input data, and a second decoder which decodes the input data; and a network control unit which updates the multimodal network using a cross self-consistency loss which indicates a difference between a self-consistency result which is a result of input data being converted by the first encoder, the second encoder, the first decoder, and the second decoder and the input data, and a common loss which indicates a difference between a result of encoding data by the first encoder which indicates a first attribute of a main object which is a person, a living thing, an object, an intangible object, or an event having a plurality of attributes, and a result of encoding data by the second encoder which indicates a second attribute which is an attribute of the main object and which is different from the first attribute;
An estimation device comprising:

a network execution step of executing a multimodal network, which is a neural network including a first encoder for acquiring a feature amount of input data, a second encoder for acquiring a feature amount of the input data, a first decoder for decoding the input data, and a second decoder for decoding the input data;
a network control unit step of updating the multimodal network using a cross self-consistency loss indicating the difference between a self-consistency result, which is a result of input data being converted by the first encoder, the second encoder, the first decoder, and the second decoder, and the data, and a common loss indicating the difference between a result of the first encoder encoding data indicating a first attribute of a main object, which is a person, a living thing, an object, an intangible object, or an event having a plurality of attributes, and a result of the second encoder encoding data indicating a second attribute of the main object, which is different from the first attribute;
A learning method that has

a data acquisition step of acquiring first type target data indicating a first attribute of an estimation target and second type target data indicating a second attribute, which is different from the first attribute, among attributes of the estimation target;
an estimation step of executing, on the first type target data and the second type target data, a process executed by the trained multimodal network obtained by a learning device including: a multimodal network which is a neural network including a first encoder which acquires a feature amount of input data, a second encoder which acquires a feature amount of the input data, a first decoder which decodes the input data, and a second decoder which decodes the input data; and a network control unit which updates the multimodal network using a cross self-consistency loss which indicates a difference between a self-consistency result which is a result of input data being converted by the first encoder, the second encoder, the first decoder, and the second decoder and the input data, and a common loss which indicates a difference between a result of encoding data by the first encoder which indicates a first attribute of a main object which is a person, a living thing, an object, an intangible object, or an event having a plurality of attributes, and a result of encoding data by the second encoder which indicates a second attribute which is an attribute of the main object and which is different from the first attribute;
The estimation method has the following structure:

A program for causing a computer to function as a learning device according to any one of claims 1 to 3.

A program for causing a computer to function as the estimation device described in claim 4.