JP7362076B2

JP7362076B2 - Information processing device, information processing method, and information processing program

Info

Publication number: JP7362076B2
Application number: JP2021087753A
Authority: JP
Inventors: 隆之堀; 容範金; 裕真鈴木; 一也植木
Original assignee: SoftBank Corp; Meisei Gakuen
Current assignee: SoftBank Corp; Meisei Gakuen
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2023-10-17
Anticipated expiration: 2041-05-25
Also published as: JP2022180958A

Description

本発明は、情報処理装置、情報処理方法及び情報処理プログラムに関する。 The present invention relates to an information processing device, an information processing method, and an information processing program.

従来、文字情報から画像を検索可能とするため、画像の内容を示す文字情報（キャプションやタグなど）を生成する技術が知られている。例えば、画像に含まれる人物を検索可能とするため、画像において人物を含む人物領域を特定し、人物領域を複数の部分領域に分割する。そして、複数の部分領域それぞれにおいてクエリ要素を生成し、複数の部分領域のクエリ要素を組み合わせて人物を検索するための検索クエリを生成する技術が知られている。 2. Description of the Related Art Conventionally, in order to enable images to be searched from text information, there has been known a technique for generating text information (caption, tag, etc.) indicating the content of an image. For example, in order to make it possible to search for a person included in an image, a person area including the person is identified in the image, and the person area is divided into a plurality of partial areas. A technique is known in which a query element is generated in each of a plurality of partial areas, and a search query for searching for a person is generated by combining the query elements of the plurality of partial areas.

特開２０１６－１６２４１４号公報Japanese Patent Application Publication No. 2016-162414 特開２０１９－２１９９８８号公報JP2019-219988A

画像の検索精度を向上させる技術が求められている。 There is a need for technology that improves image search accuracy.

本願に係る情報処理装置は、セグメンテーションの技術を用いて領域分割された画像のうち、構造を有する物体を含む分割領域である物体領域に関する領域情報、および、姿勢推定の技術を用いて推定された前記物体の構造に関する構造情報に基づいて、前記物体の属性に関する属性情報を抽出する属性情報抽出部と、前記属性情報抽出部によって抽出された属性情報に基づいて生成された文章であって、前記画像の内容を示す文章と前記画像とを対応付けて共通空間に埋め込むように学習されたＶＳＥ（Visual-Semantic Embedding）モデルを生成するモデル生成部と、を備える。 The information processing device according to the present application obtains region information regarding an object region, which is a divided region including an object having a structure, out of an image divided into regions using a segmentation technology, and information about an object region estimated using a pose estimation technology. an attribute information extraction unit that extracts attribute information regarding attributes of the object based on structural information regarding the structure of the object; and a sentence generated based on the attribute information extracted by the attribute information extraction unit, The image forming apparatus includes a model generation unit that generates a VSE (Visual-Semantic Embedding) model that is trained to associate a text indicating the content of an image with the image and embed the image in a common space.

図１は、実施形態に係る情報処理の概要について説明するための図である。FIG. 1 is a diagram for explaining an overview of information processing according to an embodiment. 図２は、実施形態に係るＶＳＥ（Visual-Semantic Embedding）モデルとコンセプト識別器について説明するための図である。FIG. 2 is a diagram for explaining a VSE (Visual-Semantic Embedding) model and a concept classifier according to the embodiment. 図３は、実施形態に係る情報処理装置の構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of an information processing device according to an embodiment. 図４は、実施形態に係る属性情報の抽出処理手順とＶＳＥモデルの生成処理手順を示す図である。FIG. 4 is a diagram showing an attribute information extraction processing procedure and a VSE model generation processing procedure according to the embodiment. 図５は、実施形態に係る情報処理手順を示す図である。FIG. 5 is a diagram showing an information processing procedure according to the embodiment. 図６は、変形例に係る情報処理手順を示す図である。FIG. 6 is a diagram showing an information processing procedure according to a modification. 図７は、情報処理装置の機能を実現するコンピュータの一例を示すハードウェア構成図である。FIG. 7 is a hardware configuration diagram showing an example of a computer that implements the functions of the information processing device.

以下に、本願に係る情報処理装置、情報処理方法及び情報処理プログラムを実施するための形態（以下、「実施形態」と呼ぶ）について図面を参照しつつ詳細に説明する。なお、この実施形態により本願に係る情報処理装置、情報処理方法及び情報処理プログラムが限定されるものではない。また、以下の各実施形態において同一の部位には同一の符号を付し、重複する説明は省略される。 DESCRIPTION OF THE PREFERRED EMBODIMENTS An information processing apparatus, an information processing method, and an information processing program according to the present application (hereinafter referred to as "embodiments") will be described in detail below with reference to the drawings. Note that the information processing apparatus, information processing method, and information processing program according to the present application are not limited to this embodiment. Further, in each of the embodiments below, the same parts are given the same reference numerals, and redundant explanations will be omitted.

（実施形態）
〔１．はじめに〕
従来、画像に含まれる物体（例えば、人物）の属性に関する属性情報を画像から抽出する技術が知られている。例えば、画像に含まれる人物の姿勢に関する姿勢情報を抽出する姿勢推定（Pose Estimation）の技術（以下、姿勢推定技術ともいう）が知られている。また、画像に含まれる人物の領域や服装の領域に関する情報を抽出するセグメンテーション（Segmentation）の技術が知られている。 (Embodiment)
[1. Introduction]
2. Description of the Related Art Conventionally, techniques for extracting attribute information regarding attributes of an object (for example, a person) included in an image from an image are known. For example, a pose estimation technology (hereinafter also referred to as pose estimation technology) that extracts pose information regarding the pose of a person included in an image is known. Furthermore, a segmentation technique is known that extracts information regarding a region of a person or a region of clothing included in an image.

しかしながら、セグメンテーションの技術は、画像に含まれる人物の領域や服装の領域を精度よく抽出することができるものの、画像に含まれる人物の姿勢を抽出することはできない。また、姿勢推定の技術は、画像に含まれる人物の姿勢を精度よく抽出することができるものの、画像に含まれる人物の領域や服装の領域を精度よく抽出することは困難である。 However, although segmentation techniques can accurately extract regions of people and clothing included in images, they cannot extract the postures of people included in images. Further, although posture estimation techniques can accurately extract the posture of a person included in an image, it is difficult to accurately extract a region of a person or a region of clothing included in an image.

これに対して、一実施形態に係る情報処理装置１００は、セグメンテーションの技術と姿勢推定の技術を組み合わせることで、画像に含まれる人物の領域や服装の領域、および画像に含まれる人物の姿勢を精度よく抽出することができる。図１は、一実施形態に係る情報処理の概要について説明するための図である。図１に示す情報処理は、後述する情報処理装置１００（図３参照）によって行われる。 In contrast, the information processing device 100 according to one embodiment combines segmentation technology and posture estimation technology to determine the region of the person included in the image, the region of clothing, and the posture of the person included in the image. It can be extracted with high precision. FIG. 1 is a diagram for explaining an overview of information processing according to an embodiment. The information processing shown in FIG. 1 is performed by an information processing apparatus 100 (see FIG. 3), which will be described later.

図１に示すように、一実施形態に係る情報処理装置１００は、セグメンテーションの技術を用いて画像を領域分割し、領域分割された画像のうち人物を含む分割領域である人物領域に関する人物領域情報を抽出する。ここで、情報処理装置１００は、セグメンテーションの技術を用いて画像を領域分割することで、人物の頭の先から足の先までの全体を含む人物領域を人物ごとに抽出してよい。また、情報処理装置１００は、セグメンテーションの技術を用いて画像を領域分割することで、人物の各身体部位（例えば、頭髪、顔、および人物が身に付けている各ファッションアイテムなど）を含む分割領域である身体部位領域に関する身体部位領域情報を身体部位ごとに抽出してよい。 As shown in FIG. 1, an information processing apparatus 100 according to an embodiment divides an image into regions using a segmentation technique, and provides human region information regarding a human region, which is a divided region including a person, of the divided image. Extract. Here, the information processing apparatus 100 may extract a person area including the entire person from the head to the toes for each person by dividing the image into regions using a segmentation technique. In addition, the information processing device 100 divides the image into regions using segmentation technology, thereby dividing the image into regions that include each body part of the person (for example, hair, face, and each fashion item worn by the person). Body part area information regarding a body part area that is a region may be extracted for each body part.

ここで、セグメンテーションの技術とは、画像を入力として、ピクセルレベルで領域を分割しラベルを付けていく技術である。セグメンテーションの技術は、そのラベリングの意味合いから、３種類に大別される。セマンティックセグメンテーション（Semantic Segmentation）は、画像上の全ピクセルをクラスに分類する技術である。マンティックセグメンテーションは、物体の種類ごとに画像を領域分割する。また、インスタンスセグメンテーション（Instance Segmentation）は、物体ごとの領域を分割し、かつ物体の種類を認識する技術である。インスタンスセグメンテーションは、物体ごとに画像を領域分割する。また、パノプティックセグメンテーション（Panoptic Segmentation）は、セマンティックセグメンテーションとインスタンスセグメンテーションを組み合わせた技術である。パノプティックセグメンテーションは、人物や動物、自動車などの物体（数えられるクラス、Thing クラスともいう）に対してインスタンスセグメンテーションを行い、空や道路、芝生などの背景（数えられないクラス、Stuff クラスともいう）に対してセマンティックセグメンテーションを行う技術である。情報処理装置１００は、セマンティックセグメンテーション、インスタンスセグメンテーション、またはパノプティックセグメンテーションの技術を用いて画像を領域分割してよい。また、情報処理装置１００は、セマンティックセグメンテーション、インスタンスセグメンテーションおよびパノプティックセグメンテーションの技術を組み合わせて画像を領域分割してよい。 Here, the segmentation technology is a technology that uses an image as input, divides regions at the pixel level, and attaches labels to the regions. Segmentation techniques can be roughly divided into three types based on the meaning of labeling. Semantic segmentation is a technique that classifies all pixels on an image into classes. Mantic segmentation divides an image into regions for each type of object. Instance segmentation is a technology that divides the region of each object and recognizes the type of object. Instance segmentation divides an image into regions for each object. Furthermore, panoptic segmentation is a technology that combines semantic segmentation and instance segmentation. Panoptic segmentation performs instance segmentation on objects such as people, animals, and cars (also called the countable class, also called the thing class), and performs instance segmentation on objects such as people, animals, and cars (also called the countable class, or thing class), and performs instance segmentation on backgrounds such as the sky, roads, and lawns (also called the uncountable class, or stuff class). ) is a technology that performs semantic segmentation. The information processing apparatus 100 may segment the image into regions using semantic segmentation, instance segmentation, or panoptic segmentation techniques. Further, the information processing apparatus 100 may segment the image into regions by combining techniques of semantic segmentation, instance segmentation, and panoptic segmentation.

また、情報処理装置１００は、姿勢推定の技術を用いて画像に含まれる人物の骨格に関する骨格情報を抽出する。具体的には、情報処理装置１００は、姿勢推定に関するあらゆる公知技術を用いて画像から骨格情報を抽出してよい。例えば、情報処理装置１００は、姿勢推定モデルと呼ばれる深層学習モデルを用いて、動画や静止画から人物や動物の姿勢（骨格）を推定する姿勢推定技術を用いて、骨格情報を抽出してよい。なお、情報処理装置１００は、１枚の画像に複数の人物が写っている場合には、姿勢推定処理により、複数の人物について特徴点を検出して、複数の人物の骨格に関する骨格情報を推定してもよい。情報処理装置１００は、姿勢推定技術を用いることで、画像中の人物の身体の部位を精緻に推定することができる。 Furthermore, the information processing apparatus 100 extracts skeletal information regarding the skeleton of a person included in the image using a posture estimation technique. Specifically, the information processing apparatus 100 may extract skeletal information from an image using any known technique related to pose estimation. For example, the information processing device 100 may extract skeletal information using a posture estimation technique that estimates the posture (skeleton) of a person or animal from a video or still image using a deep learning model called a posture estimation model. . Note that when multiple people are included in one image, the information processing device 100 detects feature points for the multiple people through posture estimation processing, and estimates skeletal information regarding the skeletons of the multiple people. You may. The information processing apparatus 100 can precisely estimate the body part of a person in an image by using posture estimation technology.

例えば、姿勢推定モデルの一例として、ＯｐｅｎＰｏｓｅ（“OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields”、Zhe Cao et al., 2018）が知られている。ＯｐｅｎＰｏｓｅは、画像に含まれる人物の身体の特徴を示す特徴点（キーポイントともいう）を検出し、特徴点を結んだ人物の姿勢を推定する姿勢推定モデルである。例えば、ＯｐｅｎＰｏｓｅは、画像に含まれる人物の身体の特徴点として、人物の身体の関節の位置を示す関節点を推定し、関節点を連結して生成される人物の身体の骨格を示す骨格モデルを人物の姿勢として検出する。また、例えば、特徴点を検出するタイプの姿勢推定モデルの中には、画像中の人物の身体の３０種類の部位を推定可能なものがある。具体的には、特徴点を検出するタイプの姿勢推定モデルを用いると、画像中の人物の身体の部位として、頭、目（右、左）、耳（右、左）、鼻、首、肩（右、中央、左）、肘（右、左）、背骨、手首（右、左）、手（右、左）、親指（右、左）、手先（右、左）、腰（右、中央、左）、膝（右、左）、足首（右、左）、足（右、左）を特定することができる。 For example, OpenPose (“OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields”, Zhe Cao et al., 2018) is known as an example of a pose estimation model. OpenPose is a posture estimation model that detects feature points (also referred to as key points) indicating the physical characteristics of a person included in an image and estimates the posture of the person by connecting the feature points. For example, OpenPose estimates joint points indicating the positions of the joints of a person's body as feature points of the person's body included in an image, and connects the joint points to generate a skeletal model indicating the skeleton of the person's body. is detected as a person's posture. Furthermore, for example, among posture estimation models of the type that detect feature points, there are models that can estimate 30 types of body parts of a person in an image. Specifically, when a pose estimation model of the type that detects feature points is used, the body parts of the person in the image are the head, eyes (right, left), ears (right, left), nose, neck, and shoulders. (right, center, left), elbow (right, left), spine, wrist (right, left), hand (right, left), thumb (right, left), fingertip (right, left), waist (right, center) , left), knee (right, left), ankle (right, left), and foot (right, left).

また、姿勢推定モデルの一例として、ＤｅｎｓｅＰｏｓｅ（参考ＵＲＬ：http://openaccess.thecvf.com/content_cvpr_2018/html/Guler_DensePose_Dense_Human_CVPR_2018_paper.html）が知られている。ＤｅｎｓｅＰｏｓｅは、２次元の画像中の人物の人物領域を検出し、検出した人物領域に対応する３次元身体表面モデルを生成する姿勢推定モデルである。より具体的には、ＤｅｎｓｅＰｏｓｅは、ＲＧＢ画像を入力として、ＲＧＢ画像中の人物の３次元表面のＵＶ座標を推定する。ＤｅｎｓｅＰｏｓｅを用いると、２次元の画像に写る人物領域から３次元身体表面のＵＶ座標を推定することができるので、２次元の画像に写る人物領域の各人体パーツ（人物の身体の部位）を精緻に推定することができる。ＤｅｎｓｅＰｏｓｅを用いると、画像中の人物の身体の２４種類の部位を推定することができる。具体的には、ＤｅｎｓｅＰｏｓｅを用いると、ＲＧＢ画像から、画像中の人物の身体の部位として、頭（左、右）、首、胴体、腕（左、右／上、前／前、後）、脚（左、右／太もも、ふくらはぎ／前、後）、手（左、右）、足（左、右）を特定することができる。 Furthermore, DensePose (reference URL: http://openaccess.thecvf.com/content_cvpr_2018/html/Guler_DensePose_Dense_Human_CVPR_2018_paper.html) is known as an example of a posture estimation model. DensePose is a posture estimation model that detects a human region of a person in a two-dimensional image and generates a three-dimensional body surface model corresponding to the detected human region. More specifically, DensePose takes an RGB image as input and estimates the UV coordinates of the three-dimensional surface of a person in the RGB image. Using DensePose, it is possible to estimate the UV coordinates of the 3D body surface from the human region in the 2D image, so each human body part (part of the person's body) in the human region in the 2D image can be precisely estimated. It can be estimated that Using DensePose, it is possible to estimate 24 types of body parts of a person in an image. Specifically, when DensePose is used, from an RGB image, the body parts of the person in the image are: head (left, right), neck, torso, arms (left, right/top, front/front, back), Legs (left, right/thigh, calf/front, back), hands (left, right), and feet (left, right) can be identified.

続いて、情報処理装置１００は、抽出した人物領域情報および骨格情報に基づいて、画像に含まれる人物の属性に関する人物属性情報を抽出する。例えば、情報処理装置１００は、人物属性情報の一例として、例えば、抽出された骨格情報に基づいて、人物の姿勢に関する姿勢情報を抽出する。また、情報処理装置１００は、人物属性情報の一例として、例えば、抽出された骨格情報の時間変化のパターンに基づいて、人物の動作に関する動作情報を抽出する。例えば、情報処理装置１００は、画像に含まれる人物が立っている状態から座っている状態に姿勢を変化させたという情報を抽出する。また、情報処理装置１００は、人物属性情報の一例として、例えば、抽出した各身体部位の身体部位領域の情報に基づいて、人物が身に着けている服装、髪型、および表情に関する情報を抽出する。例えば、情報処理装置１００は、画像に含まれる人物の服装が青い服であるという情報を抽出する。続いて、情報処理装置１００は、抽出した人物属性情報に基づいて、画像の内容を示す文章を生成する。例えば、情報処理装置１００は、画像に含まれる人物が立っている状態から座っている状態に姿勢を変化させたという情報と人物の服装が青い服であるという情報とに基づいて、画像の内容を示す文章の一例として、「青い服を来た人がレストラン内で赤いソファに座った。」という文章を生成してよい。 Next, the information processing apparatus 100 extracts person attribute information regarding the attributes of the person included in the image based on the extracted person area information and skeleton information. For example, the information processing apparatus 100 extracts posture information regarding a person's posture based on extracted skeletal information, as an example of person attribute information. Further, the information processing apparatus 100 extracts, as an example of person attribute information, motion information regarding a person's motion, based on a pattern of temporal changes in the extracted skeletal information, for example. For example, the information processing apparatus 100 extracts information that the person included in the image has changed his or her posture from a standing state to a sitting state. Furthermore, as an example of person attribute information, the information processing device 100 extracts information regarding clothing worn by a person, hairstyle, and facial expression based on information on body part regions of each extracted body part, for example. . For example, the information processing device 100 extracts information that the person included in the image is wearing blue clothes. Subsequently, the information processing apparatus 100 generates a sentence indicating the content of the image based on the extracted person attribute information. For example, the information processing device 100 determines the content of the image based on information that the person included in the image has changed his or her posture from a standing state to a sitting state and information that the person's clothing is blue. As an example of a sentence indicating ``A person wearing blue clothes sat on a red sofa in a restaurant.'' may be generated.

上述したように、一実施形態に係る情報処理装置１００は、セグメンテーションの技術を用いて領域分割された画像のうち、人物を含む分割領域である人物領域に関する人物領域情報、および、姿勢推定の技術を用いて推定された人物の骨格に関する骨格情報に基づいて、人物の属性に関する人物属性情報を抽出する。また、一実施形態に係る情報処理装置１００は、抽出された人物属性情報に基づいて生成された文章であって、画像の内容を示す文章と画像とを対応付けて共通空間に埋め込むように学習されたＶＳＥモデルを生成する。 As described above, the information processing apparatus 100 according to an embodiment uses person area information regarding a person area, which is a divided area including a person, of an image divided into areas using a segmentation technique, and a posture estimation technique. Personal attribute information regarding the attributes of the person is extracted based on skeletal information regarding the person's skeleton estimated using . Further, the information processing device 100 according to an embodiment learns to associate a sentence generated based on the extracted person attribute information and indicates the content of the image with the image and embed it in a common space. generate a VSE model.

このように、一実施形態に係る情報処理装置１００は、セグメンテーションの技術と姿勢推定の技術を組み合わせることで、画像に含まれる人物の領域や服装の領域、および画像に含まれる人物の姿勢を精度よく抽出することができる。これにより、情報処理装置１００は、画像に含まれる人物領域および人物領域の構成要素（例えば、各関節の関節位置情報、身体部位領域、およびファッションアイテムの領域など）を階層的に分解することができる。また、情報処理装置１００は、分解された構成要素の階層的な関係性に基づいて、人物領域や骨格情報から、人物の属性に関する人物属性情報を適切に抽出することができる。また、情報処理装置１００は、画像から適切に抽出された人物属性情報に基づいて、画像の内容を示す文章を適切に生成することができる。したがって、一実施形態に係る情報処理装置１００は、画像の検索精度を向上させることができる。 In this way, the information processing device 100 according to one embodiment can accurately determine the region of a person and the clothing region included in an image, and the posture of a person included in an image by combining the segmentation technology and the posture estimation technology. Can be extracted well. Thereby, the information processing device 100 can hierarchically decompose the human region and the components of the human region (for example, joint position information of each joint, body part region, fashion item region, etc.) included in the image. can. Furthermore, the information processing apparatus 100 can appropriately extract person attribute information regarding attributes of a person from the person region and skeletal information based on the hierarchical relationship of the decomposed components. Further, the information processing apparatus 100 can appropriately generate a sentence indicating the content of the image based on the person attribute information appropriately extracted from the image. Therefore, the information processing apparatus 100 according to one embodiment can improve image search accuracy.

なお、上述した例では、情報処理装置１００が、セグメンテーションの技術を用いて画像から人物領域情報を抽出し、抽出した人物領域情報に基づいて、人物属性情報を抽出する場合について説明したが、これに限られない。例えば、情報処理装置１００は、セグメンテーションの技術を用いて、画像から人物領域を抽出する。続いて、情報処理装置１００は、姿勢推定の技術を用いて、人物領域から人物領域に含まれる人物の骨格に関する骨格情報を抽出してよい。このように、情報処理装置１００は、姿勢推定の技術を用いて骨格情報を推定するために、セグメンテーションの技術を用いて抽出された人物領域の情報を用いることで、より精度よく姿勢情報を推定することができる。また、情報処理装置１００は、人物属性情報を推定するために、画像情報、人物領域情報、および骨格情報を組み合わせた情報を用いてよい。これにより、情報処理装置１００は、より精度よく人物属性情報を推定することができる。 Note that in the above example, the information processing apparatus 100 extracts person area information from an image using segmentation technology, and extracts person attribute information based on the extracted person area information. Not limited to. For example, the information processing apparatus 100 uses segmentation technology to extract a human region from an image. Subsequently, the information processing apparatus 100 may extract, from the human region, skeletal information regarding the skeleton of the person included in the human region, using a posture estimation technique. In this way, the information processing device 100 uses the information of the human region extracted using the segmentation technique to estimate the skeletal information using the posture estimation technique, thereby estimating the posture information more accurately. can do. Furthermore, the information processing apparatus 100 may use information that is a combination of image information, person area information, and skeletal information in order to estimate person attribute information. Thereby, the information processing apparatus 100 can estimate person attribute information with higher accuracy.

また、近年、膨大な数の映像コンテンツを効率よく整理、管理するために、画像の内容を自動で分析し、画像の内容を示す文字情報（キャプションやタグなど）を自動で生成・付与する技術が知られている。これにより、文字情報から画像を検索可能とすることができる。ここで、本願明細書における「画像」とは、映像などの動画であってもよいし、映像に含まれる各シーン（静止画）であってもよい。 In addition, in recent years, in order to efficiently organize and manage a huge amount of video content, technology has been introduced that automatically analyzes the content of images and automatically generates and adds text information (captions, tags, etc.) that indicates the content of the images. It has been known. This makes it possible to search for images based on text information. Here, the "image" in this specification may be a moving image such as a video, or each scene (still image) included in the video.

このような背景の下、文字情報から画像を検索する代表的な２つの手法を比較した研究が知られている。具体的には、（１）画像に含まれる物体、人物、場面および動作等の検出対象（以下、コンセプトともいう）をあらかじめ学習した学習済みの機械学習モデル（以下、コンセプト識別器ともいう）を用いてクエリ文から画像を検索する手法と、（２）画像の特徴を示す画像特徴量と、画像の内容を言語で表現した言語表現の特徴を示す言語特徴量とが対応付けられて埋め込まれた共通空間を用いて、クエリ文とマッチする画像を検索する手法とを比較した研究が知られている（参考文献；「Comparison and Evaluation of Video Retrieval Approaches Using Query Sentences」、IMIP 2020: Proceedings of the 2020 2nd International Conference on Intelligent Medicine and Image Processing、April 2020、Pages 103‐107、https://doi.org/10.1145/3399637.3399657）。 Against this background, research is known that compares two typical methods for retrieving images from text information. Specifically, (1) a trained machine learning model (hereinafter also referred to as a concept classifier) that has previously learned detection targets (hereinafter also referred to as concepts) such as objects, people, scenes, and actions contained in images; (2) image features indicating the characteristics of the image and linguistic features indicating the characteristics of the linguistic expression expressing the content of the image in language are associated and embedded. There is a known study that compared a method of searching for images that match a query sentence using a common space using 2020 2nd International Conference on Intelligent Medicine and Image Processing, April 2020, Pages 103‐107, https://doi.org/10.1145/3399637.3399657).

上記の研究によると、（１）に示すコンセプト識別器を用いる手法と（２）に示すＶＳＥを用いる手法は、相補的であることが示されている。そこで、本願発明の発明者は、（１）に示すコンセプト識別器を用いる手法と（２）に示すＶＳＥに基づく手法を統合することで、画像の検索精度を向上させる技術を提案する。具体的には、一実施形態に係る情報処理装置１００は、コンセプト識別器を用いて画像から適切なコンセプトを抽出し、ＶＳＥモデルを用いて抽出されたコンセプトと類似する画像を再検索する。例えば、情報処理装置１００は、視覚グラフ情報と文字グラフ情報とを対応付けて共通空間に埋め込むように学習されたＶＳＥモデル（以下、ＶＳＥモデルと記載する場合がある）を用いて抽出されたコンセプトと類似する画像を再検索する。これにより、一実施形態に係る情報処理装置１００は、例えば、利用者から受け付けたクエリ文に明示されていないコンセプト（例えば、暗示的なコンセプト）の中から、適切なコンセプトを抽出して、抽出したコンセプトと類似する画像を再検索することができる。したがって、本願発明の一実施形態によれば、画像の検索精度を向上させることができる。 According to the above research, it has been shown that the method using a concept classifier shown in (1) and the method using VSE shown in (2) are complementary. Therefore, the inventor of the present invention proposes a technique for improving image search accuracy by integrating the method using a concept classifier shown in (1) and the method based on VSE shown in (2). Specifically, the information processing apparatus 100 according to one embodiment extracts an appropriate concept from an image using a concept classifier, and searches again for an image similar to the extracted concept using a VSE model. For example, the information processing device 100 may extract concepts using a VSE model (hereinafter sometimes referred to as a VSE model) that has been learned to associate visual graph information and character graph information and embed them in a common space. Search again for images similar to . As a result, the information processing apparatus 100 according to an embodiment can, for example, extract an appropriate concept from among concepts (for example, implicit concepts) that are not explicitly stated in the query sentence received from the user. You can re-search for images similar to the concept you created. Therefore, according to an embodiment of the present invention, image search accuracy can be improved.

ここから、図２を用いて、実施形態に係るＶＳＥモデルとコンセプト識別器について説明する。図２は、実施形態に係るＶＳＥモデルとコンセプト識別器について説明するための図である。 From here, the VSE model and concept classifier according to the embodiment will be explained using FIG. 2. FIG. 2 is a diagram for explaining the VSE model and concept classifier according to the embodiment.

図２の左側は、実施形態に係るＶＳＥモデルの一例を示す。例えば、図２の左側に示すＶＳＥモデルは、画像から文を検索する場合、画像をＶＳＥモデルに入力して、画像の特徴を示す画像特徴量に対応する特徴ベクトル（以下、画像の特徴ベクトルともいう）を生成する。続いて、ＶＳＥモデルは、生成した画像の特徴ベクトルを文の特徴を示す文特徴量に対応する特徴ベクトル（以下、文の特徴ベクトルともいう）との共通の空間にマッピングしたのち、画像の特徴ベクトルと類似する文の特徴ベクトルに対応する文を検索結果とする。例えば、ＶＳＥモデルは、画像の特徴ベクトルと文の特徴ベクトルとの類似度が所定の閾値を超えるような文の特徴ベクトルに対応する文を検索結果として出力する。 The left side of FIG. 2 shows an example of the VSE model according to the embodiment. For example, in the VSE model shown on the left side of Figure 2, when searching for a sentence from an image, the image is input to the VSE model, and a feature vector (hereinafter also referred to as image feature vector) corresponding to the image feature amount indicating the characteristics of the image is used. ) is generated. Next, the VSE model maps the generated image feature vector to a common space with a feature vector corresponding to a sentence feature indicating the sentence feature (hereinafter also referred to as sentence feature vector), and then maps the image feature vector to a common space. A sentence corresponding to a feature vector of a sentence similar to the vector is set as a search result. For example, the VSE model outputs, as a search result, a sentence corresponding to a sentence feature vector in which the degree of similarity between the image feature vector and the sentence feature vector exceeds a predetermined threshold.

また、ＶＳＥモデルは、画像から画像特徴量を抽出することができる。ＶＳＥモデルは、画像から画像特徴量を抽出することができる任意の公知技術により実現されてよい。例えば、ＶＳＥモデルは、畳み込みニューラルネットワーク（ＣＮＮ：Convolutional Neural Network）を備えてよい。そして、ＶＳＥモデルは、ＣＮＮを用いて、画像から画像特徴量を抽出してよい。また、例えば、ＶＳＥモデルは、物体認識用に開発されたＲｅｓＮｅｔ（Residual Network）（Kaiming He et al., 2015）、ＡｌｅｘＮｅｔ（Krizhevsky et al., 2012）、ＶＧＧＮｅｔ（Simonyan et al., 2014）、ＧｏｏｇＬｅＮｅｔ（Szegedy et al., 2014）、ＳＥＮｅｔ（Squeeze-and-Excitation Networks）（Jie Hu et al., 2018））、ＥｆｆｉｃｉｅｎｔＮｅｔ（Tan et al., 2019）、またはＺＦＮｅｔ（Matthew et al., 2013）を備えてよい。そして、ＶＳＥモデルは、ＲｅｓＮｅｔ、ＡｌｅｘＮｅｔ、ＶＧＧＮｅｔ、ＧｏｏｇＬｅＮｅｔ、ＳＥＮｅｔ、ＥｆｆｉｃｉｅｎｔＮｅｔ、またはＺＦＮｅｔを用いて、画像から画像特徴量を抽出してよい。また、例えば、ＶＳＥモデルは、物体検出用に開発されたＦａｓｔｅｒＲ－ＣＮＮ（Shaoqing Ren et al., 2015）、ＹＯＬＯ（You Look Only Onse）（Joseph Redmon et al., 2015）、またはＳＳＤ（Single Shot MultiBox Detector）（Wei Liu., 2015）を備えてよい。そして、ＶＳＥモデルは、ＦａｓｔｅｒＲ－ＣＮＮ、ＹＯＬＯ、またはＳＳＤを用いて、画像から画像特徴量を抽出してよい。 Further, the VSE model can extract image features from an image. The VSE model may be implemented using any known technique that can extract image features from images. For example, the VSE model may include a Convolutional Neural Network (CNN). Then, the VSE model may use CNN to extract image features from the image. For example, VSE models include ResNet (Residual Network) (Kaiming He et al., 2015), AlexNet (Krizhevsky et al., 2012), VGGNet (Simonyan et al., 2014), which were developed for object recognition. GoogLeNet (Szegedy et al., 2014), SENet (Squeeze-and-Excitation Networks) (Jie Hu et al., 2018)), EfficientNet (Tan et al., 2019), or ZFNet (Matthew et al., 2013) may be provided. Then, the VSE model may extract image features from the image using ResNet, AlexNet, VGGNet, GoogleLeNet, SENet, EfficientNet, or ZFNet. In addition, for example, the VSE model is based on Faster R-CNN (Shaoqing Ren et al., 2015), YOLO (You Look Only Onse) (Joseph Redmon et al., 2015), which was developed for object detection, or SSD (Single Shot MultiBox Detector) (Wei Liu., 2015). The VSE model may then extract image features from the image using Faster R-CNN, YOLO, or SSD.

また、図２の左側に示すＶＳＥモデルは、文から画像を検索する場合、文をＶＳＥモデルに入力して、文の特徴ベクトルを生成する。続いて、ＶＳＥモデルは、生成した文の特徴ベクトルを画像の特徴ベクトルとの共通の空間にマッピングしたのち、文の特徴ベクトルと類似する画像の特徴ベクトルに対応する画像を検索結果とする。例えば、ＶＳＥモデルは、文の特徴ベクトルと画像の特徴ベクトルとの類似度が所定の閾値を超えるような画像の特徴ベクトルに対応する画像を検索結果として出力する。 Furthermore, when searching for an image from a sentence, the VSE model shown on the left side of FIG. 2 inputs the sentence to the VSE model and generates a feature vector of the sentence. Next, the VSE model maps the generated sentence feature vector to a common space with the image feature vector, and then sets the image corresponding to the image feature vector similar to the sentence feature vector as a search result. For example, the VSE model outputs, as a search result, an image corresponding to an image feature vector in which the degree of similarity between the sentence feature vector and the image feature vector exceeds a predetermined threshold.

また、ＶＳＥモデルは、言語表現（文章、フレーズ、又はキーワードなど）から言語表現の特徴を示す言語特徴量を抽出する。ＶＳＥモデルは、言語特徴量を抽出することができる任意の公知技術により実現されてよい。例えば、ＶＳＥモデルは、再帰型ニューラルネットワーク（ＲＮＮ：Recurrent Neural Network）を備えてよい。そして、ＶＳＥモデルは、ＲＮＮを用いて、言語表現から言語特徴量を抽出してよい。また、例えば、ＶＳＥモデルは、ＧＲＵ（Gated Recurrent Unit）またはＬＳＴＭ（Long Short Term Memory）を備えてよい。そして、ＶＳＥモデルは、ＧＲＵまたはＬＳＴＭを用いて、言語表現から言語特徴量を抽出してよい。また、例えば、ＶＳＥモデルは、Ｔｒａｎｓｆｏｒｍｅｒ（Ashish Vaswani et al., 2017）、ＴｒａｎｓｆｏｒｍｅｒをベースとしたＢＥＲＴ（Bidirectional Encoder Representations from Transformers）、ＧＰＴ－３（Generative Pre-Training3）またはＴ５（Text-to-Text Transfer Transformer）を備えてよい。そして、ＶＳＥモデルは、Ｔｒａｎｓｆｏｒｍｅｒ、ＢＥＲＴ、ＧＰＴ－３またはＴ５を用いて、言語表現から言語特徴量を抽出してよい。 Further, the VSE model extracts linguistic features indicating the characteristics of the linguistic expression from the linguistic expression (sentence, phrase, keyword, etc.). The VSE model may be realized by any known technique that can extract linguistic features. For example, the VSE model may include a recurrent neural network (RNN). Then, the VSE model may extract linguistic features from the linguistic expression using RNN. Further, for example, the VSE model may include a GRU (Gated Recurrent Unit) or a LSTM (Long Short Term Memory). Then, the VSE model may use GRU or LSTM to extract linguistic features from the linguistic expression. For example, VSE models include Transformer (Ashish Vaswani et al., 2017), Transformer-based BERT (Bidirectional Encoder Representations from Transformers), GPT-3 (Generative Pre-Training 3), or T5 (Text-to-Text Transfer Transformer). Then, the VSE model may extract linguistic features from the linguistic expression using Transformer, BERT, GPT-3, or T5.

図２の右側は、実施形態に係るコンセプト識別器の出力結果の一例を示す。コンセプト識別器は、コンセプトを含む画像が入力された場合に、画像に含まれるコンセプトと画像との類似度を示すコンセプト類似度を出力するよう学習された学習済みの機械学習モデルである。ここで、コンセプト識別器が学習するコンセプト（検出対象ともいう）には、画像に含まれる物体や人物等の対象物に限らず、画像の場面（シーン）および画像に含まれる人物や動物等の動作（走っている、座っている等）等の概念が含まれる。例えば、図２の右側に示すコンセプト識別器の出力結果は、バイクの横に男性が立っている画像がコンセプト識別器に入力された場合に、画像に含まれる男性の髪の毛の色、男性が着ている服装、男性の体の部位、バイクの色、背景の山や海、赤い橋といった対象を検出し、対象のクラス（カテゴリ）を出力したものである。なお、図２の右側では図示を省略しているが、コンセプト識別器は、画像に含まれるコンセプトのクラスとともに、画像に含まれるコンセプトが当該コンセプトのクラス（カテゴリ）に該当する確率を出力する。このように、コンセプト識別器は、コンセプトを含む画像が入力された場合に、画像に含まれるコンセプトを検出するとともに、検出されたコンセプトのクラスを推定する。すなわち、コンセプト識別器は、コンセプトを含む画像が入力された場合に、コンセプト類似度として、画像に含まれる各コンセプトが推定された各コンセプトのクラスに該当する確率をそれぞれ出力する。 The right side of FIG. 2 shows an example of the output result of the concept classifier according to the embodiment. The concept classifier is a trained machine learning model that is trained to output a concept similarity indicating the degree of similarity between a concept included in the image and the image when an image including a concept is input. Here, the concepts that the concept classifier learns (also called detection targets) include not only objects and people included in the image, but also scenes of the image and people and animals included in the image. It includes concepts such as actions (running, sitting, etc.). For example, the output result of the concept classifier shown on the right side of Figure 2 is that when an image of a man standing next to a motorcycle is input to the concept classifier, the color of the man's hair in the image, the color of the man's hair, The system detects objects such as clothing worn by men, body parts of men, the color of motorcycles, mountains and oceans in the background, and red bridges, and outputs the class (category) of the objects. Although not shown on the right side of FIG. 2, the concept classifier outputs the class of the concept included in the image as well as the probability that the concept included in the image corresponds to the class (category) of the concept. In this way, when an image including a concept is input, the concept classifier detects the concept included in the image and estimates the class of the detected concept. That is, when an image including a concept is input, the concept classifier outputs the probability that each concept included in the image corresponds to the estimated class of each concept as the concept similarity.

〔２．情報処理装置の構成〕
次に、図３を用いて、実施形態に係る情報処理装置の構成について説明する。図３は、実施形態に係る情報処理装置の構成例を示す図である。図３に示すように、情報処理装置１００は、通信部１１０と、記憶部１２０と、入力部１３０と、出力部１４０と、制御部１５０とを有する。 [2. Configuration of information processing device]
Next, the configuration of the information processing apparatus according to the embodiment will be described using FIG. 3. FIG. 3 is a diagram illustrating a configuration example of an information processing device according to an embodiment. As shown in FIG. 3, the information processing device 100 includes a communication section 110, a storage section 120, an input section 130, an output section 140, and a control section 150.

（通信部１１０）
通信部１１０は、例えば、ＮＩＣ（Network Interface Card）、モデムチップ及びアンテナモジュール等によって実現される。また、通信部１１０は、ネットワークＮ（図示略）と有線又は無線で接続される。 (Communication Department 110)
The communication unit 110 is realized by, for example, a NIC (Network Interface Card), a modem chip, an antenna module, and the like. Further, the communication unit 110 is connected to a network N (not shown) by wire or wirelessly.

（記憶部１２０）
記憶部１２０は、例えば、ＲＡＭ（Random Access Memory）、フラッシュメモリ等の半導体メモリ素子、又は、ハードディスク、光ディスク等の記憶装置によって実現される。例えば、記憶部１２０は、複数の映像または複数の映像それぞれに含まれる各シーンである画像のデータベースである映像プールを記憶する。また、記憶部１２０は、複数の文章または複数の文章それぞれに含まれる各テキストである文字列のデータベースであるキャプションプールを記憶する。 (Storage unit 120)
The storage unit 120 is realized by, for example, a semiconductor memory device such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk. For example, the storage unit 120 stores a video pool that is a database of images that are multiple videos or scenes included in each of the multiple videos. Furthermore, the storage unit 120 stores a caption pool that is a database of character strings that are a plurality of sentences or each text included in each of a plurality of sentences.

（入力部１３０）
入力部１３０は、利用者から各種操作の入力を受け付ける。例えば、入力部１３０は、タッチパネル機能により表示面（例えば出力部１４０）を介して利用者からの各種操作を受け付けてもよい。また、入力部１３０は、情報処理装置１００に設けられたボタンや、情報処理装置１００に接続されたキーボードやマウスからの各種操作を受け付けてもよい。例えば、入力部１３０は、利用者からクエリ文の入力を受け付けてよい。また、入力部１３０は、利用者からクエリ画像の入力を受け付けてよい。 (Input section 130)
The input unit 130 receives inputs for various operations from the user. For example, the input unit 130 may receive various operations from the user via a display screen (for example, the output unit 140) using a touch panel function. Further, the input unit 130 may accept various operations from buttons provided on the information processing device 100 or a keyboard or mouse connected to the information processing device 100. For example, the input unit 130 may accept input of a query sentence from the user. Furthermore, the input unit 130 may accept input of a query image from the user.

（出力部１４０）
出力部１４０は、例えば液晶ディスプレイや有機ＥＬ（Electro-Luminescence）ディスプレイ等によって実現される表示画面であり、各種情報を表示するための表示装置である。出力部１４０は、制御部１５０の制御に従って、各種情報を表示する。なお、情報処理装置１００にタッチパネルが採用される場合には、入力部１３０と出力部１４０とは一体化される。また、以下の説明では、出力部１４０を画面と記載する場合がある。 (Output section 140)
The output unit 140 is a display screen realized by, for example, a liquid crystal display or an organic EL (Electro-Luminescence) display, and is a display device for displaying various information. The output unit 140 displays various information under the control of the control unit 150. Note that when a touch panel is employed in the information processing apparatus 100, the input section 130 and the output section 140 are integrated. Furthermore, in the following description, the output unit 140 may be referred to as a screen.

（制御部１５０）
制御部１５０は、コントローラ（Controller）であり、例えば、ＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）等によって、情報処理装置１００の内部の記憶装置に記憶されている各種プログラム（情報処理プログラムの一例に相当）がＲＡＭ等の記憶領域を作業領域として実行されることにより実現される。図３に示す例では、制御部１５０は、属性情報抽出部１５１と、モデル生成部１５２と、受付部１５３と、取得部１５４と、検索部１５５と、抽出部１５６と、生成部１５７を有する。 (Control unit 150)
The control unit 150 is a controller that controls the information processing device 100 using, for example, a CPU (Central Processing Unit), an MPU (Micro Processing Unit), an ASIC (Application Specific Integrated Circuit), or an FPGA (Field Programmable Gate Array). This is achieved by executing various programs (corresponding to an example of an information processing program) stored in an internal storage device using a storage area such as a RAM as a work area. In the example shown in FIG. 3, the control unit 150 includes an attribute information extraction unit 151, a model generation unit 152, a reception unit 153, an acquisition unit 154, a search unit 155, an extraction unit 156, and a generation unit 157. .

（属性情報抽出部１５１）
属性情報抽出部１５１は、セグメンテーションの技術を用いて領域分割された画像のうち、構造を有する物体を含む分割領域である物体領域に関する領域情報を抽出する。例えば、属性情報抽出部１５１は、構造を有する物体の一例として、人物を含む分割領域である人物領域に関する人物領域情報を抽出する。例えば、属性情報抽出部１５１は、人物領域情報の一例として、人物が身に着けている各ファッションアイテムを含む分割領域であるアイテム領域に関するアイテム領域情報を抽出してよい。 (Attribute information extraction unit 151)
The attribute information extraction unit 151 extracts region information regarding an object region, which is a divided region including an object having a structure, from an image divided into regions using a segmentation technique. For example, the attribute information extraction unit 151 extracts person area information regarding a person area, which is a divided area including a person, as an example of an object having a structure. For example, the attribute information extraction unit 151 may extract item area information regarding an item area, which is a divided area including each fashion item worn by a person, as an example of person area information.

また、属性情報抽出部１５１は、人物領域情報の一例として、人物の各身体部位を含む分割領域である身体部位領域に関する身体部位領域情報を抽出してよい。例えば、属性情報抽出部１５１は、身体部位領域情報の一例として、人物の頭髪の領域に関する情報を抽出してよい。また、属性情報抽出部１５１は、身体部位領域情報の一例として、人物の顔の領域に関する情報を抽出してよい。 Further, the attribute information extraction unit 151 may extract body part area information regarding a body part area, which is a divided area including each body part of a person, as an example of person area information. For example, the attribute information extraction unit 151 may extract information regarding a person's hair region as an example of body part region information. Further, the attribute information extraction unit 151 may extract information regarding a person's face area as an example of body part area information.

また、属性情報抽出部１５１は、姿勢推定の技術を用いて推定された物体の構造に関する構造情報を抽出する。例えば、属性情報抽出部１５１は、構造情報の一例として、姿勢推定の技術を用いて推定された人物の骨格に関する骨格情報を抽出する。例えば、属性情報抽出部１５１は、骨格情報の一例として、人物の各関節の関節位置情報を抽出する。 Furthermore, the attribute information extraction unit 151 extracts structural information regarding the structure of the object estimated using the pose estimation technique. For example, the attribute information extraction unit 151 extracts, as an example of structural information, skeletal information regarding a human skeleton estimated using a posture estimation technique. For example, the attribute information extraction unit 151 extracts joint position information of each joint of a person as an example of skeletal information.

また、属性情報抽出部１５１は、領域情報と構造情報を抽出すると、抽出した領域情報と構造情報に基づいて、物体の属性に関する属性情報を抽出する。例えば、属性情報抽出部１５１は、抽出した人物領域情報と骨格情報に基づいて、画像に含まれる人物の属性に関する人物属性情報を抽出する。 Furthermore, after extracting the area information and structure information, the attribute information extraction unit 151 extracts attribute information regarding the attributes of the object based on the extracted area information and structure information. For example, the attribute information extraction unit 151 extracts person attribute information regarding the attributes of the person included in the image based on the extracted person area information and skeleton information.

例えば、属性情報抽出部１５１は、人物属性情報の一例として、抽出した人物の各関節の関節位置情報に基づいて、人物の姿勢に関する姿勢情報を抽出する。例えば、属性情報抽出部１５１は、人物が立っている状態、座っている状態、または右手を挙げている状態である等の姿勢情報を抽出してよい。 For example, the attribute information extraction unit 151 extracts posture information regarding the posture of the person based on joint position information of each joint of the extracted person as an example of the person attribute information. For example, the attribute information extraction unit 151 may extract posture information such as whether the person is standing, sitting, or raising their right hand.

また、属性情報抽出部１５１は、人物属性情報の一例として、抽出した人物の各関節の関節位置情報の時間変化のパターンに基づいて、人物の動作に関する動作情報を抽出する。例えば、属性情報抽出部１５１は、人物が立っている状態から座っている状態に姿勢を変化させたという動作情報を抽出してよい。また、属性情報抽出部１５１は、人物が走っている、歩いている、または右手を振っている等の動作情報を抽出してよい。 Further, the attribute information extraction unit 151 extracts motion information regarding the motion of the person, as an example of the person attribute information, based on a pattern of temporal changes in joint position information of each joint of the extracted person. For example, the attribute information extraction unit 151 may extract motion information indicating that the person changed their posture from a standing state to a sitting state. Furthermore, the attribute information extraction unit 151 may extract motion information such as whether the person is running, walking, or waving their right hand.

また、属性情報抽出部１５１は、人物属性情報の一例として、抽出したアイテム領域情報に基づいて、人物が身に着けているファッションアイテムの属性に関するアイテム属性情報を抽出する。例えば、属性情報抽出部１５１は、アイテム属性情報の一例として、人物が身に着けている衣服や靴、帽子、カバンなどのファッションアイテムの種類や色、形状、材質等を示す情報を抽出してよい。 Further, the attribute information extraction unit 151 extracts item attribute information regarding the attributes of a fashion item worn by the person, as an example of person attribute information, based on the extracted item area information. For example, the attribute information extraction unit 151 extracts information indicating the type, color, shape, material, etc. of fashion items such as clothes, shoes, hats, and bags worn by a person as an example of item attribute information. good.

また、属性情報抽出部１５１は、人物属性情報の一例として、人物の身体部位の属性に関する身体部位属性情報を抽出する。例えば、属性情報抽出部１５１は、身体部位属性情報の一例として、人物の頭髪の領域に関する情報に基づいて、人物の髪型に関する情報を抽出する。例えば、属性情報抽出部１５１は、人物の髪型の種類や色、形状、毛質当を示す情報を抽出してよい。また、属性情報抽出部１５１は、身体部位領域情報の一例として、人物の顔の領域に関する情報に基づいて、人物の表情に関する情報を抽出する。例えば、属性情報抽出部１５１は、人物の表情の種類（笑っている、怒っている等）や表情の度合い（少し笑っている、とても怒っている等）等を示す情報を抽出してよい。 Further, the attribute information extraction unit 151 extracts body part attribute information regarding attributes of a person's body parts as an example of person attribute information. For example, the attribute information extraction unit 151 extracts information regarding a person's hairstyle based on information regarding the hair region of the person as an example of body part attribute information. For example, the attribute information extraction unit 151 may extract information indicating the type, color, shape, and hair type of a person's hairstyle. Further, the attribute information extraction unit 151 extracts information regarding the facial expression of the person based on information regarding the facial area of the person as an example of body part area information. For example, the attribute information extraction unit 151 may extract information indicating the type of facial expression of the person (smiling, angry, etc.), the degree of facial expression (slightly smiling, very angry, etc.), and the like.

なお、上述した例では、属性情報抽出部１５１が、構造を有する物体の一例として、画像に含まれる人物の属性に関する人物属性情報を抽出する場合について説明したが、属性情報抽出部１５１は、構造を有する物体であれば、人物以外のどのような物体の属性情報を抽出してもよい。例えば、属性情報抽出部１５１は、構造を有する物体の一例として、ドア部分、窓ガラス部分、およびタイヤ部分といったパーツを組み合わせて構成される車両の属性に関する属性情報を抽出してよい。例えば、属性情報抽出部１５１は、画像のうち、領域情報として、車両である物体を含む分割領域である車両領域に関する車両領域情報、および、構造情報として、姿勢推定の技術を用いて推定された車両の骨格に関する骨格情報に基づいて、属性情報として、車両の属性に関する車両属性情報を抽出する。 In the above example, the attribute information extraction unit 151 extracts person attribute information related to the attributes of a person included in an image as an example of an object having a structure. Attribute information of any object other than a person may be extracted as long as the object has the following. For example, the attribute information extraction unit 151 may extract attribute information regarding the attributes of a vehicle that is formed by combining parts such as a door portion, a window glass portion, and a tire portion, as an example of an object having a structure. For example, the attribute information extraction unit 151 extracts, as region information, vehicle region information regarding a vehicle region that is a divided region including an object that is a vehicle, and structure information that is estimated using a posture estimation technique. Based on the skeleton information regarding the skeleton of the vehicle, vehicle attribute information regarding the attributes of the vehicle is extracted as attribute information.

（モデル生成部１５２）
モデル生成部１５２は、属性情報抽出部１５１によって抽出された属性情報に基づいて生成された文章であって、画像の内容を示す文章と画像とを対応付けて共通空間に埋め込むように学習されたＶＳＥモデルを生成する。具体的には、モデル生成部１５２は、属性情報抽出部１５１によって抽出された人物属性情報に基づいて生成された文章であって、画像の内容を示す文章と画像とを対応付けて共通空間に埋め込むように学習されたＶＳＥモデルを生成する。 (Model generation unit 152)
The model generation unit 152 is trained to associate sentences with images, which are generated based on the attribute information extracted by the attribute information extraction unit 151, and which indicate the content of images, and embed them in a common space. Generate a VSE model. Specifically, the model generation unit 152 associates the image with a sentence that is generated based on the person attribute information extracted by the attribute information extraction unit 151 and that indicates the content of the image, and stores it in a common space. Generate a VSE model trained to embed.

例えば、モデル生成部１５２は、属性情報抽出部１５１によって抽出された人物属性情報の特徴を示す人物属性特徴ベクトルを生成する。例えば、モデル生成部１５２は、セグメンテーションの技術を用いて抽出された人物領域の特徴を示す人物領域特徴ベクトルを生成する。また、モデル生成部１５２は、姿勢推定の技術を用いて推定された人物の骨格情報の特徴を示す骨格情報特徴ベクトルを生成する。続いて、モデル生成部１５２は、生成した人物領域特徴ベクトルと骨格情報特徴ベクトルをつなぎ合わせることで、人物属性特徴ベクトルを生成する。また、モデル生成部１５２は、生成した画像の内容を示す文章の特徴を示す文章特徴ベクトルを生成する。モデル生成部１５２は、人物属性特徴ベクトルと文章特徴ベクトルを生成すると、生成した人物属性特徴ベクトルと文章特徴ベクトルとが共通空間において類似するようにＶＳＥモデルを学習することで、ＶＳＥモデルを生成する。 For example, the model generation unit 152 generates a person attribute feature vector indicating the characteristics of the person attribute information extracted by the attribute information extraction unit 151. For example, the model generation unit 152 generates a human region feature vector indicating the characteristics of the extracted human region using a segmentation technique. Furthermore, the model generation unit 152 generates a skeletal information feature vector indicating the characteristics of the person's skeletal information estimated using the posture estimation technique. Subsequently, the model generation unit 152 generates a person attribute feature vector by connecting the generated person region feature vector and skeleton information feature vector. Furthermore, the model generation unit 152 generates a text feature vector that indicates the features of the text that indicates the content of the generated image. After generating a person attribute feature vector and a text feature vector, the model generation unit 152 generates a VSE model by learning the VSE model so that the generated person attribute feature vector and text feature vector are similar in a common space. .

（受付部１５３）
受付部１５３は、利用者によって入力されたクエリ文を受け付ける。例えば、受付部１５３は、入力部１３０を介して利用者が入力したクエリ文を受け付ける。ここで、本願明細書におけるクエリ文とは、完全な文章でなくてもよく、例えば、キーワードやフレーズであってもよい。以下では、受付部１５３が利用者から最初に受け付けたクエリ文を「第１クエリ文」と記載する。例えば、受付部１５３は、第１クエリ文の一例として、利用者から「person in a car」というフレーズを受け付ける。 (Reception Department 153)
The reception unit 153 receives a query sentence input by a user. For example, the reception unit 153 accepts a query sentence input by a user via the input unit 130. Here, the query sentence in this specification does not have to be a complete sentence, and may be a keyword or a phrase, for example. In the following, the query sentence that the reception unit 153 first receives from the user will be referred to as a "first query sentence." For example, the reception unit 153 accepts the phrase "person in a car" from the user as an example of the first query sentence.

（取得部１５４）
取得部１５４は、映像プールから画像を取得する。例えば、取得部１５４は、受付部１５３が第１クエリ文を受け付けると、記憶部１２０を参照して、複数の映像または複数の映像それぞれに含まれる各シーンである画像を映像プールから取得する。例えば、取得部１５４は、Ｎ個（Ｎは自然数）の画像＃１１～画像＃１Ｎを映像プールから取得する。 (Acquisition unit 154)
The acquisition unit 154 acquires images from the video pool. For example, when the reception unit 153 receives the first query sentence, the acquisition unit 154 refers to the storage unit 120 and acquires from the video pool an image that is a plurality of videos or each scene included in each of the plurality of videos. For example, the acquisition unit 154 acquires N images #11 to #1N (N is a natural number) from the video pool.

（検索部１５５）
検索部１５５は、モデル生成部１５２によって生成されたＶＳＥモデルを用いて、受付部１５３によって受け付けられた第１クエリ文に関する第１画像を検索する。具体的には、検索部１５５は、取得部１５４が画像を取得すると、受付部１５３によって受け付けられた第１クエリ文と取得部１５４によって取得された画像の組をＶＳＥモデルに入力する。例えば、検索部１５５は、第１クエリ文である「person in a car」とＮ個の画像＃１１～画像＃１Ｎそれぞれとの組をＶＳＥモデルに入力する。 (Search unit 155)
The search unit 155 uses the VSE model generated by the model generation unit 152 to search for a first image related to the first query sentence received by the reception unit 153. Specifically, when the acquisition unit 154 acquires an image, the search unit 155 inputs the set of the first query sentence accepted by the reception unit 153 and the image acquired by the acquisition unit 154 to the VSE model. For example, the search unit 155 inputs a set of the first query sentence "person in a car" and each of N images #11 to #1N to the VSE model.

続いて、検索部１５５は、第１クエリ文と画像との第１類似度をＶＳＥモデルから出力する。例えば、検索部１５５は、第１クエリ文とＮ個の画像＃１１～画像＃１Ｎそれぞれとの類似度＃１１～類似度＃１Ｎそれぞれを出力する。続いて、検索部１５５は、出力された第１類似度が第１閾値を超える第１画像を検索する。例えば、類似度＃１１～類似度＃１３は第１閾値を超えるが、類似度＃１４～類似度＃１Ｎは第１閾値以下であるとする。このとき、検索部１５５は、Ｎ個の画像＃１１～画像＃１Ｎの中から、第１クエリ文との第１類似度が第１閾値を超える画像＃１１～画像＃１３を第１画像として取得する。 Subsequently, the search unit 155 outputs the first similarity between the first query sentence and the image from the VSE model. For example, the search unit 155 outputs the degrees of similarity #11 to #1N between the first query sentence and each of the N images #11 to #1N, respectively. Subsequently, the search unit 155 searches for a first image whose output first similarity exceeds the first threshold. For example, suppose that similarity #11 to similarity #13 exceed the first threshold, but similarity #14 to similarity #1N are less than the first threshold. At this time, the search unit 155 selects images #11 to #13 whose first similarity with the first query sentence exceeds the first threshold from among the N images #11 to #1N as the first images. get.

なお、検索部１５５は、出力された第１類似度が第１閾値を超える第１画像を検索する代わりに、出力された第１類似度が高い方から順にいくつかの第１画像を検索してよい。例えば、第１クエリ文とＮ個の画像＃１１～画像＃１Ｎそれぞれとの類似度＃１１～類似度＃１Ｎのうち、類似度＃１１の類似度が最も高く、Ｎが大きくなるほど類似度が低いとする。このとき、検索部１５５は、Ｎ個の画像＃１１～画像＃１Ｎの中から、第１クエリ文との第１類似度が高い方から順に、例えば、３つの画像＃１１～画像＃１３を第１画像として取得してよい。 Note that instead of searching for a first image whose output first similarity exceeds the first threshold, the search unit 155 searches several first images in order from the one with the highest output first similarity. It's fine. For example, among the degrees of similarity #11 to #1N between the first query sentence and N images #11 to #1N, the degree of similarity #11 is the highest, and the larger N is, the higher the degree of similarity is. Suppose it is low. At this time, the search unit 155 selects, for example, three images #11 to #13 from among the N images #11 to #1N in order from the one with the highest first similarity to the first query sentence. It may be acquired as the first image.

（抽出部１５６）
抽出部１５６は、第１画像に関するコンセプトを抽出する。具体的には、抽出部１５６は、コンセプトを含む画像が入力された場合に、画像に含まれるコンセプトと画像とのコンセプト類似度を出力するよう学習された学習済みの機械学習モデルであるコンセプト識別器を用いて、第１画像から第１画像に関するコンセプトを抽出する。例えば、抽出部１５６は、検索部１５５によって第１画像が検索されると、検索部１５５によって検索された第１画像をコンセプト識別器に入力する。例えば、抽出部１５６は、検索部１５５によって検索された第１画像である画像＃１１～画像＃１３それぞれをコンセプト識別器に入力する。なお、以下では、簡単のため、第１画像が画像＃１１のみである場合について説明する。 (Extraction unit 156)
The extraction unit 156 extracts a concept related to the first image. Specifically, when an image including a concept is input, the extraction unit 156 uses a concept identification which is a trained machine learning model that is trained to output the concept similarity between the concept included in the image and the image. A concept related to the first image is extracted from the first image using a device. For example, when the search unit 155 searches for a first image, the extraction unit 156 inputs the first image searched by the search unit 155 to the concept classifier. For example, the extraction unit 156 inputs each of images #11 to #13, which are the first images searched by the search unit 155, to the concept classifier. Note that, for the sake of simplicity, a case will be described below in which the first image is only image #11.

続いて、抽出部１５６は、第１画像に含まれるコンセプトと第１画像とのコンセプト類似度をコンセプト識別器から出力する。例えば、抽出部１５６は、画像＃１１に含まれるコンセプトである「car_interior」と画像＃１１とのコンセプト類似度＃２１である「９０％」をコンセプト識別器から出力する。また、例えば、抽出部１５６は、画像＃１１に含まれるコンセプトである「自動車」と画像＃１１とのコンセプト類似度＃２２である「８０％」をコンセプト識別器から出力する。また、例えば、抽出部１５６は、画像＃１１に含まれるコンセプトである「バイク」と画像＃１１とのコンセプト類似度＃２３である「７０％」をコンセプト識別器から出力する。 Subsequently, the extraction unit 156 outputs the concept similarity between the concept included in the first image and the first image from the concept classifier. For example, the extraction unit 156 outputs "90%", which is the concept similarity #21, between "car_interior", which is a concept included in image #11, and image #11, from the concept classifier. Further, for example, the extraction unit 156 outputs "80%", which is the concept similarity #22, between "automobile", which is a concept included in image #11, and image #11, from the concept discriminator. Further, for example, the extraction unit 156 outputs "70%", which is the concept similarity #23 between "motorcycle", which is a concept included in image #11, and image #11, from the concept discriminator.

続いて、抽出部１５６は、出力されたコンセプト類似度がコンセプト閾値を超えるコンセプトを抽出する。例えば、コンセプト閾値が「８５％」であるとする。このとき、抽出部１５６は、コンセプト類似度がコンセプト閾値である「８５％」を超えるコンセプトである「car_interior」を抽出する。続いて、抽出部１５６は、出力されたコンセプト類似度がコンセプト閾値を超えるコンセプトの中から、第１クエリ文に含まれないコンセプトである隠れコンセプトを抽出する。例えば、抽出部１５６は、第１クエリ文に含まれない文字列を含むコンセプトを隠れコンセプトとして抽出する。例えば、抽出部１５６は、第１クエリ文である「person in a car」に含まれない文字列である「interior」を含むコンセプトである「car_interior」を隠れコンセプトとして抽出する。 Subsequently, the extraction unit 156 extracts concepts whose output concept similarity exceeds the concept threshold. For example, assume that the concept threshold is "85%". At this time, the extraction unit 156 extracts "car_interior", which is a concept whose concept similarity exceeds the concept threshold of "85%". Subsequently, the extraction unit 156 extracts hidden concepts, which are concepts that are not included in the first query sentence, from among the concepts whose output concept similarity exceeds the concept threshold. For example, the extraction unit 156 extracts a concept that includes a character string that is not included in the first query sentence as a hidden concept. For example, the extraction unit 156 extracts "car_interior", which is a concept that includes the character string "interior" that is not included in the first query sentence "person in a car", as a hidden concept.

なお、抽出部１５６は、出力されたコンセプト類似度がコンセプト閾値を超えるコンセプトを抽出する代わりに、出力されたコンセプト類似度が高い方から順にいくつかのコンセプトを抽出してよい。例えば、第１画像とＭ個（Ｍは自然数）のコンセプト＃１１～コンセプト＃１Ｍそれぞれとのコンセプト類似度＃１１～コンセプト類似度＃１Ｍのうち、コンセプト類似度＃１１のコンセプト類似度が最も高く、Ｍが大きくなるほどコンセプト類似度が低いとする。このとき、抽出部１５６は、Ｍ個のコンセプト＃１１～コンセプト＃１Ｍの中から、第１画像とのコンセプト類似度が高い方から順に、例えば、３つのコンセプト＃１１～コンセプト＃１３を抽出してよい。 Note that instead of extracting concepts whose output concept similarity exceeds the concept threshold, the extraction unit 156 may extract several concepts in descending order of output concept similarity. For example, among the concept similarities #11 to #1M between the first image and M concepts #11 to #1M (M is a natural number), the concept similarity #11 is the highest. , M is larger, the concept similarity is lower. At this time, the extraction unit 156 extracts, for example, three concepts #11 to #13 from the M concepts #11 to #1M in order of concept similarity to the first image. It's fine.

（生成部１５７）
以下では、抽出部１５６によって抽出された隠れコンセプトに基づいて生成されたクエリ文のことを「第２クエリ文」と記載する。生成部１５７は、抽出部１５６によって抽出された隠れコンセプトに基づいて、第２クエリ文を生成する。例えば、生成部１５７は、抽出部１５６によって抽出された隠れコンセプトに基づいて、第１クエリ文を更新して、第２クエリ文を生成してよい。例えば、生成部１５７は、抽出部１５６によって抽出された隠れコンセプトである「car_interior」を含む第２クエリ文を生成してよい。出力部１４０は、生成部１５７によって生成された第２クエリ文を出力する。例えば、出力部１４０は、生成部１５７によって生成された第２クエリ文の一例として、「car_interior」を出力する。受付部１５３は、出力部１４０によって出力された第２クエリ文を利用者から受け付ける。例えば、受付部１５３は、出力部１４０によって出力された第２クエリ文である「car_interior」を利用者から受け付ける。 (Generation unit 157)
Hereinafter, the query sentence generated based on the hidden concept extracted by the extraction unit 156 will be referred to as a "second query sentence." The generation unit 157 generates a second query sentence based on the hidden concept extracted by the extraction unit 156. For example, the generation unit 157 may update the first query sentence based on the hidden concept extracted by the extraction unit 156 to generate the second query sentence. For example, the generation unit 157 may generate a second query sentence that includes “car_interior”, which is the hidden concept extracted by the extraction unit 156. The output unit 140 outputs the second query statement generated by the generation unit 157. For example, the output unit 140 outputs “car_interior” as an example of the second query sentence generated by the generation unit 157. The reception unit 153 receives the second query sentence output by the output unit 140 from the user. For example, the reception unit 153 receives “car_interior”, which is the second query sentence output by the output unit 140, from the user.

なお、生成部１５７が第２クエリ文を生成する代わりに、出力部１４０によって出力された隠れコンセプトに基づいて利用者が第２クエリ文を生成してもよい。受付部１５３は、利用者によって生成された第２クエリ文を利用者から受け付けてもよい。 Note that instead of the generation unit 157 generating the second query sentence, the user may generate the second query sentence based on the hidden concept output by the output unit 140. The reception unit 153 may receive a second query sentence generated by the user from the user.

また、取得部１５４は、受付部１５３が第２クエリ文を受け付けると、記憶部１２０を参照して、複数の映像または複数の映像それぞれに含まれる各シーンである画像を映像プールから取得する。検索部１５５は、ＶＳＥモデルを用いて、受付部１５３によって受け付けられた第２クエリ文に関する第２画像を再検索する。例えば、検索部１５５は、受付部１５３によって受け付けられた第２クエリ文である「car_interior」に関する第２画像を再検索する。例えば、検索部１５５は、受付部１５３によって受け付けられた第２クエリ文と取得部１５４によって取得された画像の組をＶＳＥモデルに入力する。続いて、検索部１５５は、画像と第２クエリ文との第２類似度をＶＳＥモデルから出力する。続いて、検索部１５５は、出力された第２類似度が第２閾値を超える第２画像を再検索する。出力部１４０は、検索部１５５によって再検索された第２画像を検索結果として出力する。このようにして、出力部１４０は、例えば、隠れコンセプトである「car_interior」に基づいて生成された第２クエリ文である「car_interior」に関する第２画像を検索結果として出力する。 Further, when the reception unit 153 receives the second query sentence, the acquisition unit 154 refers to the storage unit 120 and acquires from the video pool an image that is a plurality of videos or each scene included in each of the plurality of videos. The search unit 155 uses the VSE model to search again for the second image related to the second query sentence received by the reception unit 153. For example, the search unit 155 searches again for the second image related to “car_interior”, which is the second query sentence accepted by the reception unit 153. For example, the search unit 155 inputs the set of the second query sentence received by the reception unit 153 and the image acquired by the acquisition unit 154 into the VSE model. Subsequently, the search unit 155 outputs the second similarity between the image and the second query sentence from the VSE model. Subsequently, the search unit 155 searches again for a second image whose output second similarity exceeds the second threshold. The output unit 140 outputs the second image searched again by the search unit 155 as a search result. In this way, the output unit 140 outputs, for example, the second image related to "car_interior", which is the second query sentence, and which is generated based on the hidden concept "car_interior", as a search result.

上述した例では、抽出部１５６が、第１クエリ文である「person in a car」に含まれない文字列「interior」を含むコンセプトである「car_interior」を隠れコンセプトとして抽出する場合について説明したが、他の例について説明する。例えば、受付部１５３は、第１クエリ文の一例として、「destroyed old building」というフレーズを利用者から受け付ける。検索部１５５は、ＶＳＥモデルを用いて、受付部１５３によって受け付けられた第１クエリ文である「destroyed old building」に関する第１画像を検索する。抽出部１５６は、第１画像をコンセプト識別器に入力して、第１画像に含まれるコンセプトである「ruin」を抽出する。続いて、抽出部１５６は、第１クエリ文である「destroyed old building」に含まれない文字列である「ruin」を含むコンセプトである「ruin」を隠れコンセプトとして抽出する。 In the above example, the extraction unit 156 extracts "car_interior", which is a concept that includes the character string "interior" that is not included in the first query sentence "person in a car", as a hidden concept. , another example will be explained. For example, the reception unit 153 receives the phrase "destroyed old building" from the user as an example of the first query sentence. The search unit 155 uses the VSE model to search for a first image related to “destroyed old building”, which is the first query sentence accepted by the reception unit 153. The extraction unit 156 inputs the first image to the concept classifier and extracts the concept "ruin" included in the first image. Subsequently, the extraction unit 156 extracts "ruin", which is a concept that includes the character string "ruin" that is not included in the first query sentence "destroyed old building", as a hidden concept.

〔３．情報処理のフロー〕
次に、図４を用いて、実施形態に係る情報処理の手順について説明する。図４は、実施形態に係る情報処理の一例を示すフローチャートである。図４では、属性情報抽出部１５１が、処理対象となる画像を取得する（ステップＳ１１）。例えば、属性情報抽出部１５１は、処理対象となる画像を映像プールから取得する。 [3. Information processing flow]
Next, an information processing procedure according to the embodiment will be described using FIG. 4. FIG. 4 is a flowchart illustrating an example of information processing according to the embodiment. In FIG. 4, the attribute information extraction unit 151 acquires an image to be processed (step S11). For example, the attribute information extraction unit 151 acquires an image to be processed from a video pool.

続いて、属性情報抽出部１５１は、セグメンテーションの技術を用いて、画像のうち人物を含む人物領域に関する人物領域情報を抽出する。また、属性情報抽出部１５１は、姿勢推定の技術を用いて、画像に含まれる人物の骨格情報を抽出する（ステップＳ１２）。 Next, the attribute information extraction unit 151 uses segmentation technology to extract person area information regarding a person area including a person from the image. Further, the attribute information extraction unit 151 extracts skeletal information of a person included in the image using a posture estimation technique (step S12).

続いて、属性情報抽出部１５１は、人物領域情報および骨格情報に基づいて、画像に含まれる人物の属性に関する人物属性情報を抽出する（ステップＳ１３）。属性情報抽出部１５１は、人物属性情報を抽出すると、抽出された人物属性情報に基づいて、画像の内容を示す文章を生成する（ステップＳ１４）。 Next, the attribute information extraction unit 151 extracts person attribute information regarding the attributes of the person included in the image based on the person area information and the skeleton information (step S13). After extracting the person attribute information, the attribute information extraction unit 151 generates a sentence indicating the content of the image based on the extracted person attribute information (step S14).

モデル生成部１５２は、属性情報抽出部１５１が生成した文章と画像とを対応付けて共通空間に埋め込むように学習されたＶＳＥモデルを生成する（ステップＳ１５）。 The model generation unit 152 generates a VSE model learned to associate the text and image generated by the attribute information extraction unit 151 and embed them in a common space (step S15).

次に、図５を用いて、実施形態に係る情報処理の手順について説明する。図５は、実施形態に係る情報処理の一例を示すフローチャートである。図５では、受付部１５３が、利用者によって入力された第１クエリ文を受け付ける（ステップＳ１０１）。取得部１５４は、受付部１５３が第１クエリ文を受け付けると、複数の映像または複数の映像それぞれに含まれる各シーンである画像を映像プールから取得する（ステップＳ１０２）。 Next, the information processing procedure according to the embodiment will be described using FIG. 5. FIG. 5 is a flowchart illustrating an example of information processing according to the embodiment. In FIG. 5, the reception unit 153 receives the first query sentence input by the user (step S101). When the reception unit 153 receives the first query sentence, the acquisition unit 154 acquires images that are a plurality of videos or each scene included in each of the plurality of videos from the video pool (step S102).

検索部１５５は、受付部１５３によって受け付けられた第１クエリ文と取得部１５４によって取得された画像の組をＶＳＥモデルに入力する。続いて、検索部１５５は、第１クエリ文と画像との第１類似度をＶＳＥモデルから出力する（ステップＳ１０３）。続いて、検索部１５５は、出力された第１類似度が第１閾値を超える第１画像を検索する（ステップＳ１０４）。 The search unit 155 inputs the set of the first query sentence received by the reception unit 153 and the image acquired by the acquisition unit 154 into the VSE model. Subsequently, the search unit 155 outputs the first similarity between the first query sentence and the image from the VSE model (step S103). Subsequently, the search unit 155 searches for a first image whose output first similarity exceeds the first threshold (step S104).

抽出部１５６は、検索部１５５によって検索された第１画像をコンセプト識別器に入力する（ステップＳ１０５）。続いて、抽出部１５６は、第１画像に含まれるコンセプトと第１画像とのコンセプト類似度をコンセプト識別器から出力する（ステップＳ１０６）。続いて、抽出部１５６は、出力されたコンセプト類似度がコンセプト閾値を超えるコンセプトを抽出する。続いて、抽出部１５６は、出力されたコンセプト類似度がコンセプト閾値を超えるコンセプトの中から、第１クエリ文に含まれないコンセプトである隠れコンセプトを抽出する（ステップＳ１０７）。 The extraction unit 156 inputs the first image searched by the search unit 155 to the concept classifier (step S105). Subsequently, the extraction unit 156 outputs the concept similarity between the concept included in the first image and the first image from the concept classifier (step S106). Subsequently, the extraction unit 156 extracts concepts whose output concept similarity exceeds the concept threshold. Subsequently, the extraction unit 156 extracts hidden concepts that are not included in the first query sentence from among the concepts whose output concept similarity exceeds the concept threshold (step S107).

出力部１４０は、抽出部１５６によって抽出された隠れコンセプトを出力する（ステップＳ１０８）。利用者は、出力部１４０によって出力された隠れコンセプトに基づいて第１クエリ文を更新し、新たな第２クエリ文を生成する（ステップＳ１０９）。受付部１５３は、出力部１４０によって出力された隠れコンセプトに基づいて更新された第２クエリ文を利用者から受け付ける。例えば、受付部１５３は、入力部１３０を介して、利用者によって更新された第２クエリ文を利用者から受け付ける（ステップＳ１１０）。取得部１５４は、受付部１５３が第２クエリ文を受け付けると、複数の映像または複数の映像それぞれに含まれる各シーンである画像を映像プールから取得する（ステップＳ１１１）。 The output unit 140 outputs the hidden concept extracted by the extraction unit 156 (step S108). The user updates the first query sentence based on the hidden concept output by the output unit 140 and generates a new second query sentence (step S109). The receiving unit 153 receives the second query sentence updated based on the hidden concept output by the output unit 140 from the user. For example, the reception unit 153 receives the second query sentence updated by the user from the user via the input unit 130 (step S110). When the reception unit 153 receives the second query sentence, the acquisition unit 154 acquires images that are a plurality of videos or each scene included in each of the plurality of videos from the video pool (step S111).

検索部１５５は、受付部１５３によって受け付けられた第２クエリ文と取得部１５４によって取得された画像の組をＶＳＥモデルに入力する。続いて、検索部１５５は、画像と第２クエリ文との第２類似度をＶＳＥモデルから出力する（ステップＳ１１２）。続いて、検索部１５５は、出力された第２類似度が第２閾値を超える第２画像を再検索する（ステップＳ１１３）。出力部１４０は、検索部１５５によって再検索された第２画像を検索結果として出力する（ステップＳ１１４）。 The search unit 155 inputs the set of the second query sentence accepted by the reception unit 153 and the image acquired by the acquisition unit 154 into the VSE model. Subsequently, the search unit 155 outputs the second similarity between the image and the second query sentence from the VSE model (step S112). Subsequently, the search unit 155 searches again for a second image whose output second similarity exceeds the second threshold (step S113). The output unit 140 outputs the second image searched again by the search unit 155 as a search result (step S114).

〔４．変形例〕
次に、図６を用いて、変形例に係る情報処理の手順について説明する。図６は、変形例に係る情報処理の一例を示すフローチャートである。図６では、受付部１５３が、利用者によって入力されたクエリ画像を受け付ける（ステップＳ２０１）。例えば、受付部１５３は、入力部１３０を介して利用者が入力したクエリ画像を受け付ける。ここで、本願明細書におけるクエリ画像とは、画像全体でなくてもよく、例えば、画像の一部であってもよい。 [4. Modified example]
Next, an information processing procedure according to a modified example will be described using FIG. 6. FIG. 6 is a flowchart illustrating an example of information processing according to a modification. In FIG. 6, the reception unit 153 receives a query image input by a user (step S201). For example, the reception unit 153 accepts a query image input by a user via the input unit 130. Here, the query image in this specification does not have to be the entire image, but may be, for example, a part of the image.

取得部１５４は、受付部１５３がクエリ画像を受け付けると、記憶部１２０を参照して、複数の文章または複数の文章それぞれに含まれる各テキストである文字列をキャプションプールから取得する（ステップＳ２０２）。 When the reception unit 153 receives the query image, the acquisition unit 154 refers to the storage unit 120 and acquires a character string that is a plurality of sentences or each text included in each of the plurality of sentences from the caption pool (step S202). .

また、検索部１５５は、取得部１５４が文字列を取得すると、受付部１５３によって受け付けられたクエリ画像と取得部１５４によって取得された文字列の組をＶＳＥモデルに入力する。続いて、検索部１５５は、クエリ画像と文字列との第３類似度をＶＳＥモデルから出力する（ステップＳ２０３）。続いて、検索部１５５は、出力された第３類似度が第３閾値を超える文字列を検索する（ステップＳ２０４）。 Further, when the acquisition unit 154 acquires a character string, the search unit 155 inputs the set of the query image accepted by the reception unit 153 and the character string acquired by the acquisition unit 154 to the VSE model. Subsequently, the search unit 155 outputs the third degree of similarity between the query image and the character string from the VSE model (step S203). Subsequently, the search unit 155 searches for a character string whose output third similarity exceeds the third threshold (step S204).

生成部１５７は、検索部１５５によって検索された文字列に基づいて第３クエリ文を生成する。出力部１４０は、生成部１５７によって生成された第３クエリ文を出力する。受付部１５３は、出力部１４０によって出力された第３クエリ文を利用者から受け付ける（ステップＳ２０５）。取得部１５４は、受付部１５３が第３クエリ文を受け付けると、複数の映像または複数の映像それぞれに含まれる各シーンである画像を映像プールから取得する（ステップＳ２０６）。 The generation unit 157 generates a third query sentence based on the character string searched by the search unit 155. The output unit 140 outputs the third query statement generated by the generation unit 157. The reception unit 153 receives the third query sentence output by the output unit 140 from the user (step S205). When the reception unit 153 receives the third query sentence, the acquisition unit 154 acquires images that are a plurality of videos or each scene included in each of the plurality of videos from the video pool (step S206).

検索部１５５は、受付部１５３によって受け付けられた第３クエリ文と取得部１５４によって取得された画像の組をＶＳＥモデルに入力する。続いて、検索部１５５は、画像と第３クエリ文との第１類似度をＶＳＥモデルから出力する（ステップＳ２０７）。続いて、検索部１５５は、出力された第１類似度が第１閾値を超える第３画像を検索する（ステップＳ２０８）。 The search unit 155 inputs the set of the third query received by the reception unit 153 and the image acquired by the acquisition unit 154 into the VSE model. Subsequently, the search unit 155 outputs the first similarity between the image and the third query sentence from the VSE model (step S207). Subsequently, the search unit 155 searches for a third image whose output first similarity exceeds the first threshold (step S208).

抽出部１５６は、検索部１５５によって検索された第３画像をコンセプト識別器に入力する（ステップＳ２０９）。続いて、抽出部１５６は、第３画像に含まれるコンセプトと第３画像とのコンセプト類似度をコンセプト識別器から出力する（ステップＳ２１０）。続いて、抽出部１５６は、出力されたコンセプト類似度がコンセプト閾値を超えるコンセプトを抽出する。続いて、抽出部１５６は、出力されたコンセプト類似度がコンセプト閾値を超えるコンセプトの中から、第３クエリ文に含まれないコンセプトである隠れコンセプトを抽出する（ステップＳ２１１）。 The extraction unit 156 inputs the third image searched by the search unit 155 to the concept classifier (step S209). Subsequently, the extraction unit 156 outputs the concept similarity between the concept included in the third image and the third image from the concept classifier (step S210). Subsequently, the extraction unit 156 extracts concepts whose output concept similarity exceeds the concept threshold. Subsequently, the extraction unit 156 extracts hidden concepts that are not included in the third query sentence from among the concepts whose output concept similarity exceeds the concept threshold (step S211).

出力部１４０は、抽出部１５６によって抽出された隠れコンセプトを出力する（ステップＳ２１２）。利用者は、出力部１４０によって出力された隠れコンセプトに基づいて第３クエリ文を更新し、新たな第４クエリ文を生成する（ステップＳ２１３）。受付部１５３は、出力部１４０によって出力された隠れコンセプトに基づいて更新された第４クエリ文を利用者から受け付ける。例えば、受付部１５３は、入力部１３０を介して、利用者によって更新された第４クエリ文を利用者から受け付ける（ステップＳ２１４）。取得部１５４は、受付部１５３が第４クエリ文を受け付けると、複数の映像または複数の映像それぞれに含まれる各シーンである画像を映像プールから取得する（ステップＳ２１５）。 The output unit 140 outputs the hidden concept extracted by the extraction unit 156 (step S212). The user updates the third query sentence based on the hidden concept output by the output unit 140 and generates a new fourth query sentence (step S213). The reception unit 153 receives from the user a fourth query sentence that has been updated based on the hidden concept output by the output unit 140. For example, the receiving unit 153 receives the fourth query statement updated by the user from the user via the input unit 130 (step S214). When the reception unit 153 receives the fourth query sentence, the acquisition unit 154 acquires images that are a plurality of videos or each scene included in each of the plurality of videos from the video pool (step S215).

検索部１５５は、受付部１５３によって受け付けられた第４クエリ文と取得部１５４によって取得された画像の組をＶＳＥモデルに入力する。続いて、検索部１５５は、画像と第４クエリ文との第２類似度をＶＳＥモデルから出力する（ステップＳ２１６）。続いて、検索部１５５は、出力された第２類似度が第２閾値を超える第４画像を再検索する（ステップＳ２１７）。出力部１４０は、検索部１５５によって再検索された第４画像を検索結果として出力する（ステップＳ２１８）。 The search unit 155 inputs the set of the fourth query received by the reception unit 153 and the image acquired by the acquisition unit 154 into the VSE model. Subsequently, the search unit 155 outputs the second similarity between the image and the fourth query sentence from the VSE model (step S216). Subsequently, the search unit 155 searches again for a fourth image whose output second similarity exceeds the second threshold (step S217). The output unit 140 outputs the fourth image searched again by the search unit 155 as a search result (step S218).

なお、ステップＳ２０５において、利用者は、出力部１４０によって出力された第３クエリ文を変更することができる。受付部１５３は、入力部１３０を介して、利用者によって変更された第３クエリ文を利用者から受け付ける。 Note that in step S205, the user can change the third query sentence output by the output unit 140. The reception unit 153 receives from the user, via the input unit 130, the third query sentence that has been changed by the user.

〔５．効果〕
上述してきたように、実施形態に係る情報処理装置１００は、属性情報抽出部１５１と、モデル生成部１５２を有する。属性情報抽出部１５１は、セグメンテーションの技術を用いて領域分割された画像のうち、構造を有する物体を含む分割領域である物体領域に関する領域情報、および、姿勢推定の技術を用いて推定された物体の構造に関する構造情報に基づいて、物体の属性に関する属性情報を抽出する。モデル生成部１５２は、属性情報抽出部１５１によって抽出された属性情報に基づいて生成された文章であって、画像の内容を示す文章と画像とを対応付けて共通空間に埋め込むように学習されたＶＳＥ（Visual-Semantic Embedding）モデルを生成する。 [5. effect〕
As described above, the information processing device 100 according to the embodiment includes the attribute information extraction section 151 and the model generation section 152. The attribute information extraction unit 151 extracts region information regarding an object region, which is a divided region including an object having a structure, out of an image divided into regions using a segmentation technology, and an object estimated using a pose estimation technology. Attribute information regarding the attributes of the object is extracted based on the structural information regarding the structure of the object. The model generation unit 152 is trained to associate sentences with images, which are generated based on the attribute information extracted by the attribute information extraction unit 151, and which indicate the content of images, and embed them in a common space. Generate a VSE (Visual-Semantic Embedding) model.

このように、情報処理装置１００は、セグメンテーションの技術と姿勢推定の技術を組み合わせることで、画像に含まれる構造を有する物体（例えば、車両や人物）の領域や物体の構成要素（各部位や各パーツ）の領域、および画像に含まれる物体の姿勢を精度よく抽出することができる。これにより、情報処理装置１００は、画像に含まれる物体領域および物体領域の構成要素（例えば、物体の構造、各部位の領域など）を階層的に分解することができる。また、情報処理装置１００は、分解された構成要素の階層的な関係性に基づいて、物体領域や構造情報から、物体の属性に関する属性情報を適切に抽出することができる。また、情報処理装置１００は、画像から適切に抽出された属性情報に基づいて、画像の内容を示す文章を適切に生成することができる。したがって、本願発明に係る情報処理装置１００は、画像の検索精度を向上させることができる。 In this way, the information processing device 100 combines segmentation technology and pose estimation technology to identify regions of structured objects (for example, vehicles and people) included in an image, and the constituent elements of objects (each part and each part). It is possible to accurately extract the region of the object (part) and the posture of the object included in the image. Thereby, the information processing apparatus 100 can hierarchically decompose the object region and the constituent elements of the object region (for example, the structure of the object, the region of each part, etc.) included in the image. Further, the information processing apparatus 100 can appropriately extract attribute information regarding the attributes of the object from the object region and structure information based on the hierarchical relationship of the decomposed components. Further, the information processing apparatus 100 can appropriately generate a sentence indicating the content of the image based on attribute information appropriately extracted from the image. Therefore, the information processing device 100 according to the present invention can improve image search accuracy.

また、属性情報抽出部１５１は、画像のうち、領域情報として、人物である物体を含む分割領域である人物領域に関する人物領域情報、および、構造情報として、姿勢推定の技術を用いて推定された人物の骨格に関する骨格情報に基づいて、属性情報として、人物の属性に関する人物属性情報を抽出する。モデル生成部１５２は、属性情報抽出部１５１によって抽出された人物属性情報に基づいて生成された文章であって、画像の内容を示す文章と画像とを対応付けて共通空間に埋め込むように学習されたＶＳＥモデルを生成する。 Further, the attribute information extraction unit 151 extracts, as area information, person area information regarding a person area, which is a divided area including an object that is a person, from the image, and person area information regarding a person area, which is a divided area including an object that is a person, and structure information, which is estimated using posture estimation technology. Personal attribute information regarding the attributes of the person is extracted as attribute information based on skeletal information regarding the skeleton of the person. The model generation unit 152 is trained to associate sentences with images, which are generated based on the person attribute information extracted by the attribute information extraction unit 151, and which indicate the content of the image, and embed them in a common space. A VSE model is generated.

情報処理装置１００は、セグメンテーションの技術と姿勢推定の技術を組み合わせることで、画像に含まれる人物の領域や服装の領域、および画像に含まれる人物の姿勢を精度よく抽出することができる。これにより、情報処理装置１００は、画像に含まれる人物領域および人物領域の構成要素（例えば、各関節の関節位置情報、身体部位領域、およびファッションアイテムの領域など）を階層的に分解することができる。また、情報処理装置１００は、分解された構成要素の階層的な関係性に基づいて、人物領域や骨格情報から、人物の属性に関する人物属性情報を適切に抽出することができる。また、情報処理装置１００は、画像から適切に抽出された人物属性情報に基づいて、画像の内容を示す文章を適切に生成することができる。したがって、本願発明に係る情報処理装置１００は、画像の検索精度を向上させることができる。 By combining segmentation technology and posture estimation technology, the information processing apparatus 100 can accurately extract a region of a person included in an image, a region of clothing, and a posture of a person included in an image. Thereby, the information processing device 100 can hierarchically decompose the human region and the components of the human region (for example, joint position information of each joint, body part region, fashion item region, etc.) included in the image. can. Furthermore, the information processing apparatus 100 can appropriately extract person attribute information regarding attributes of a person from the person region and skeletal information based on the hierarchical relationship of the decomposed components. Further, the information processing apparatus 100 can appropriately generate a sentence indicating the content of the image based on the person attribute information appropriately extracted from the image. Therefore, the information processing device 100 according to the present invention can improve image search accuracy.

また、属性情報抽出部１５１は、骨格情報として、人物の各関節の関節位置情報に基づいて、人物属性情報として、人物の姿勢に関する姿勢情報および人物の動作に関する動作情報を抽出する。 Further, the attribute information extraction unit 151 extracts posture information regarding the posture of the person and motion information regarding the movement of the person as the person attribute information based on the joint position information of each joint of the person as the skeletal information.

これにより、情報処理装置１００は、画像に含まれる人物の姿勢および動作に関する情報を適切に抽出することができるので、画像の内容を示す文章を適切に生成することができる。 Thereby, the information processing apparatus 100 can appropriately extract information regarding the posture and motion of the person included in the image, and therefore can appropriately generate a sentence indicating the content of the image.

また、属性情報抽出部１５１は、人物領域情報として、人物が身に着けている各ファッションアイテムを含む分割領域であるアイテム領域に関するアイテム領域情報に基づいて、人物属性情報として、人物が身に着けているファッションアイテムの属性に関するアイテム属性情報を抽出する。 Further, the attribute information extraction unit 151 extracts items worn by the person as person attribute information based on item area information regarding item areas, which are divided areas including each fashion item worn by the person, as person area information. To extract item attribute information regarding the attributes of a fashion item.

これにより、情報処理装置１００は、画像に含まれる人物のファッションアイテムに関する情報を適切に抽出することができるので、画像の内容を示す文章を適切に生成することができる。 Thereby, the information processing apparatus 100 can appropriately extract information regarding the fashion items of the person included in the image, and therefore can appropriately generate a sentence indicating the content of the image.

また、属性情報抽出部１５１は、人物領域情報として、人物の各身体部位を含む分割領域である身体部位領域に関する身体部位領域情報に基づいて、人物属性情報として、人物の身体部位の属性に関する身体部位属性情報を抽出する。例えば、属性情報抽出部１５１は、身体部位領域情報として、人物の頭髪の領域に関する情報に基づいて、身体部位属性情報として、人物の髪型に関する情報を抽出する。また、属性情報抽出部１５１は、身体部位領域情報として、人物の顔の領域に関する情報に基づいて、身体部位属性情報として、人物の表情に関する情報を抽出する。 The attribute information extraction unit 151 also extracts body part area information regarding the attributes of the body parts of the person as the person attribute information based on body part area information regarding the body part area which is a divided area including each body part of the person. Extract part attribute information. For example, the attribute information extraction unit 151 extracts information about a person's hairstyle as body part attribute information based on information about a hair region of the person as body part area information. Further, the attribute information extraction unit 151 extracts information regarding a person's facial expression as body part attribute information based on information regarding the face area of the person as body part area information.

これにより、情報処理装置１００は、画像に含まれる人物の髪型や表情に関する情報を適切に抽出することができるので、画像の内容を示す文章を適切に生成することができる。 Thereby, the information processing apparatus 100 can appropriately extract information regarding the hairstyle and facial expression of the person included in the image, and can therefore appropriately generate a sentence indicating the content of the image.

また、情報処理装置１００は、受付部１５３と、検索部１５５と、抽出部１５６を有する。受付部１５３は、利用者によって入力された第１クエリ文を受け付ける。検索部１５５は、モデル生成部１５２によって生成されたＶＳＥモデルを用いて、第１クエリ文に関する第１画像を検索する。抽出部１５６は、第１画像に関するコンセプトを抽出する。検索部１５５は、ＶＳＥモデルを用いて、抽出部１５６によって抽出されたコンセプトに基づく第２クエリ文に関する第２画像を再検索する。 Further, the information processing device 100 includes a reception section 153, a search section 155, and an extraction section 156. The reception unit 153 receives the first query sentence input by the user. The search unit 155 uses the VSE model generated by the model generation unit 152 to search for a first image related to the first query sentence. The extraction unit 156 extracts a concept related to the first image. The search unit 155 uses the VSE model to search again for the second image related to the second query sentence based on the concept extracted by the extraction unit 156.

これにより、情報処理装置１００は、ＶＳＥを用いることで、利用者によって入力されたクエリ文に関する画像を適切に検索することができる。また、情報処理装置１００は、適切に検索された画像からコンセプトを抽出したうえで、抽出したコンセプトに基づいて画像を再検索することができる。したがって、情報処理装置１００は、画像の検索精度を向上させることができる。 Thereby, the information processing apparatus 100 can appropriately search for images related to the query sentence input by the user by using VSE. Further, the information processing apparatus 100 can extract a concept from an appropriately searched image and then search for images again based on the extracted concept. Therefore, the information processing device 100 can improve image search accuracy.

また、情報処理装置１００は、生成部１５７をさらに備える。生成部１５７は、抽出部１５６によって抽出されたコンセプトに基づいて、第２クエリ文を生成する。検索部１５５は、ＶＳＥモデルを用いて、生成部１５７によって生成された第２クエリ文に関する第２画像を再検索する。 Further, the information processing device 100 further includes a generation unit 157. The generation unit 157 generates a second query sentence based on the concept extracted by the extraction unit 156. The search unit 155 uses the VSE model to search again for the second image related to the second query sentence generated by the generation unit 157.

これにより、情報処理装置１００は、適切なコンセプトに基づいて適切なクエリ文を生成することができる。例えば、情報処理装置１００は、適切な検索キーワードを追加（または不適切な検索キーワードを排除）することで、検索精度を向上させることを可能にする。したがって、情報処理装置１００は、適切なクエリ文に基づいて画像を再検索することができるので、画像の検索精度を向上させることができる。 Thereby, the information processing device 100 can generate an appropriate query sentence based on an appropriate concept. For example, the information processing device 100 can improve search accuracy by adding appropriate search keywords (or excluding inappropriate search keywords). Therefore, the information processing apparatus 100 can search for images again based on an appropriate query sentence, and thus can improve image search accuracy.

また、情報処理装置１００は、検索部１５５による検索結果を出力する出力部１４０をさらに備える。出力部１４０は、抽出部１５６によって抽出されたコンセプトを出力する。受付部１５３は、出力部１４０によって出力されたコンセプトに基づく第２クエリ文を利用者から受け付ける。検索部１５５は、ＶＳＥモデルを用いて、受付部１５３によって受け付けられた第２クエリ文に関する第２画像を再検索する。 Furthermore, the information processing device 100 further includes an output unit 140 that outputs the search results by the search unit 155. The output unit 140 outputs the concept extracted by the extraction unit 156. The reception unit 153 receives a second query sentence based on the concept output by the output unit 140 from the user. The search unit 155 uses the VSE model to search again for the second image related to the second query sentence received by the reception unit 153.

これにより、情報処理装置１００は、利用者が、適切なコンセプトに基づいて適切なクエリ文を生成するのを助けることができる。例えば、情報処理装置１００は、利用者が適切な検索キーワードを追加（または不適切な検索キーワードを排除）することで、検索精度を向上させることを可能にする。したがって、情報処理装置１００は、適切なクエリ文に基づいて画像を再検索することができるので、画像の検索精度を向上させることができる。 Thereby, the information processing apparatus 100 can help the user generate an appropriate query sentence based on an appropriate concept. For example, the information processing device 100 allows the user to improve search accuracy by adding appropriate search keywords (or eliminating inappropriate search keywords). Therefore, the information processing apparatus 100 can search for images again based on an appropriate query sentence, and thus can improve image search accuracy.

また、情報処理装置１００は、取得部１５４をさらに備える。取得部１５４は、複数の映像または複数の映像それぞれに含まれる各シーンである画像を取得する。検索部１５５は、取得部１５４によって取得された画像と受付部１５３によって受け付けられた第１クエリ文の組をＶＳＥモデルに入力して、画像と第１クエリ文との第１類似度をＶＳＥモデルから出力し、出力された第１類似度が第１閾値を超える第１画像を検索する。 Further, the information processing device 100 further includes an acquisition unit 154. The acquisition unit 154 acquires images that are a plurality of videos or each scene included in each of the plurality of videos. The search unit 155 inputs the set of the image acquired by the acquisition unit 154 and the first query sentence accepted by the reception unit 153 into the VSE model, and calculates the first similarity between the image and the first query sentence using the VSE model. , and a first image whose output first similarity exceeds a first threshold is searched.

これにより、情報処理装置１００は、ＶＳＥに基づく処理により、適切な画像を選択することができる。 Thereby, the information processing apparatus 100 can select an appropriate image through processing based on VSE.

また、検索部１５５は、取得部１５４によって取得された画像と抽出部１５６によって抽出されたコンセプトに基づく第２クエリ文の組をＶＳＥモデルに入力して、画像と第２クエリ文との第２類似度をＶＳＥモデルから出力し、出力された第２類似度が第２閾値を超える第２画像を再検索する。 Furthermore, the search unit 155 inputs the set of the image acquired by the acquisition unit 154 and the second query sentence based on the concept extracted by the extraction unit 156 into the VSE model, and The similarity is output from the VSE model, and a second image whose output second similarity exceeds the second threshold is searched again.

これにより、情報処理装置１００は、ＶＳＥに基づく処理とコンセプト識別器に基づく処理を回すことで、適切なコンセプトを選択することができる。例えば、情報処理装置１００は、利用者が入力したクエリ文に明示されていない内容（例えば、暗示的な内容）に関するコンセプトを抽出することができる。 Thereby, the information processing apparatus 100 can select an appropriate concept by performing processing based on the VSE and processing based on the concept discriminator. For example, the information processing apparatus 100 can extract a concept related to content (for example, implicit content) that is not explicitly stated in the query sentence input by the user.

また、抽出部１５６は、コンセプトを含む画像が入力された場合に、画像に含まれるコンセプトと画像とのコンセプト類似度を出力するよう学習された学習済みの機械学習モデルであるコンセプト識別器を用いて、第１画像から第１画像に関するコンセプトを抽出する。 Further, the extraction unit 156 uses a concept classifier, which is a trained machine learning model, which is trained to output the concept similarity between the concept included in the image and the image when an image including a concept is input. Then, a concept related to the first image is extracted from the first image.

これにより、情報処理装置１００は、コンセプト識別器を用いることで、適切に検索された画像から適切なコンセプトを抽出することができる。また、情報処理装置１００は、適切なコンセプトを抽出したうえで、適切なコンセプトに基づいて画像を再検索することができる。したがって、情報処理装置１００は、画像の検索精度を向上させることができる。 Thereby, the information processing apparatus 100 can extract an appropriate concept from an appropriately searched image by using the concept classifier. Further, the information processing apparatus 100 can extract an appropriate concept and then search for images again based on the appropriate concept. Therefore, the information processing device 100 can improve image search accuracy.

また、抽出部１５６は、検索部１５５によって検索された第１画像をコンセプト識別器に入力して、第１画像に含まれるコンセプトと第１画像とのコンセプト類似度をコンセプト識別器から出力し、出力されたコンセプト類似度がコンセプト閾値を超えるコンセプトを抽出する。 Further, the extraction unit 156 inputs the first image searched by the search unit 155 to the concept classifier, outputs the concept similarity between the concept included in the first image and the first image from the concept classifier, Concepts whose output concept similarity exceeds a concept threshold are extracted.

これにより、情報処理装置１００は、適切なコンセプトを抽出することができる。 Thereby, the information processing apparatus 100 can extract an appropriate concept.

また、抽出部１５６は、出力されたコンセプト類似度がコンセプト閾値を超えるコンセプトの中から、第１クエリ文に含まれないコンセプトである隠れコンセプトを抽出する。 Furthermore, the extraction unit 156 extracts hidden concepts that are not included in the first query sentence from among the concepts whose output concept similarity exceeds the concept threshold.

これにより、情報処理装置１００は、利用者が入力したクエリ文に明示されていない内容（例えば、暗示的な内容）に関するコンセプトを抽出することができる。また、情報処理装置１００は、利用者が入力したクエリ文に明示されていないコンセプト（例えば、暗示的なコンセプト）に基づいて画像を再検索することができる。したがって、情報処理装置１００は、画像の検索精度を向上させることができる。
また、受付部１５３は、利用者によって入力されたクエリ画像を受け付ける。検索部１５５は、ＶＳＥモデルを用いて、受付部１５３によって受け付けられたクエリ画像に関する文字列を検索し、検索した文字列に基づく第３クエリ文に関する第３画像を検索する。 Thereby, the information processing apparatus 100 can extract a concept related to content (for example, implicit content) that is not explicitly stated in the query sentence input by the user. Furthermore, the information processing apparatus 100 can re-search for images based on a concept (for example, an implicit concept) that is not explicitly stated in the query sentence input by the user. Therefore, the information processing device 100 can improve image search accuracy.
Further, the reception unit 153 receives a query image input by a user. The search unit 155 uses the VSE model to search for a character string related to the query image accepted by the reception unit 153, and searches for a third image related to a third query sentence based on the searched character string.

これにより、情報処理装置１００は、利用者が入力したクエリ画像に明示されていない内容（例えば、暗示的な内容）に関するコンセプトを抽出することができる。 Thereby, the information processing apparatus 100 can extract a concept related to content (for example, implicit content) that is not explicitly stated in the query image input by the user.

また、取得部１５４は、複数の文章または複数の文章それぞれに含まれる各テキストである文字列を取得する。検索部１５５は、取得部１５４によって取得された文字列と受付部１５３によって受け付けられたクエリ画像の組をＶＳＥモデルに入力して、文字列とクエリ画像との第３類似度をＶＳＥモデルから出力し、出力された第３類似度が第３閾値を超える文字列を検索し、検索した文字列に基づく第３クエリ文に関する第３画像を検索する。 Further, the acquisition unit 154 acquires a plurality of sentences or a character string that is each text included in each of the plurality of sentences. The search unit 155 inputs the combination of the character string acquired by the acquisition unit 154 and the query image accepted by the reception unit 153 into the VSE model, and outputs a third degree of similarity between the character string and the query image from the VSE model. Then, a character string whose output third similarity exceeds a third threshold is searched, and a third image related to a third query sentence based on the searched character string is searched.

これにより、情報処理装置１００は、ＶＳＥに基づく処理とコンセプト識別器に基づく処理を回すことで、適切なコンセプトを選択することができる。 Thereby, the information processing apparatus 100 can select an appropriate concept by performing processing based on the VSE and processing based on the concept discriminator.

また、抽出部１５６は、第３画像に関するコンセプトを抽出する。検索部１５５は、ＶＳＥモデルを用いて、抽出部１５６によって抽出されたコンセプトに基づく第４クエリ文に関する第４画像を再検索する。 Furthermore, the extraction unit 156 extracts a concept related to the third image. The search unit 155 uses the VSE model to search again for the fourth image related to the fourth query sentence based on the concept extracted by the extraction unit 156.

これにより、情報処理装置１００は、利用者が入力したクエリ画像に明示されていない内容（例えば、暗示的な内容）に関するコンセプトを抽出することができる。また、情報処理装置１００は、利用者が入力したクエリ画像に明示されていないコンセプト（例えば、暗示的なコンセプト）に基づいて画像を再検索することができる。したがって、情報処理装置１００は、画像の検索精度を向上させることができる。 Thereby, the information processing apparatus 100 can extract a concept related to content (for example, implicit content) that is not explicitly stated in the query image input by the user. Further, the information processing apparatus 100 can re-search for images based on a concept (for example, an implicit concept) that is not explicitly stated in the query image input by the user. Therefore, the information processing device 100 can improve image search accuracy.

〔６．ハードウェア構成〕
また、上述してきた実施形態に係る情報処理装置１００は、例えば図７に示すような構成のコンピュータ１０００によって実現される。図７は、情報処理装置１００の機能を実現するコンピュータの一例を示すハードウェア構成図である。コンピュータ１０００は、ＣＰＵ１１００、ＲＡＭ１２００、ＲＯＭ１３００、ＨＤＤ１４００、通信インターフェイス（Ｉ／Ｆ）１５００、入出力インターフェイス（Ｉ／Ｆ）１６００、及びメディアインターフェイス（Ｉ／Ｆ）１７００を備える。 [6. Hardware configuration]
Further, the information processing apparatus 100 according to the embodiments described above is realized by, for example, a computer 1000 having a configuration as shown in FIG. FIG. 7 is a hardware configuration diagram showing an example of a computer that implements the functions of the information processing device 100. The computer 1000 includes a CPU 1100, a RAM 1200, a ROM 1300, an HDD 1400, a communication interface (I/F) 1500, an input/output interface (I/F) 1600, and a media interface (I/F) 1700.

ＣＰＵ１１００は、ＲＯＭ１３００またはＨＤＤ１４００に格納されたプログラムに基づいて動作し、各部の制御を行う。ＲＯＭ１３００は、コンピュータ１０００の起動時にＣＰＵ１１００によって実行されるブートプログラムや、コンピュータ１０００のハードウェアに依存するプログラム等を格納する。 CPU 1100 operates based on a program stored in ROM 1300 or HDD 1400, and controls each section. The ROM 1300 stores a boot program executed by the CPU 1100 when the computer 1000 is started, programs depending on the hardware of the computer 1000, and the like.

ＨＤＤ１４００は、ＣＰＵ１１００によって実行されるプログラム、及び、かかるプログラムによって使用されるデータ等を格納する。通信インターフェイス１５００は、所定の通信網を介して他の機器からデータを受信してＣＰＵ１１００へ送り、ＣＰＵ１１００が生成したデータを所定の通信網を介して他の機器へ送信する。 The HDD 1400 stores programs executed by the CPU 1100, data used by the programs, and the like. Communication interface 1500 receives data from other devices via a predetermined communication network and sends it to CPU 1100, and transmits data generated by CPU 1100 to other devices via a predetermined communication network.

ＣＰＵ１１００は、入出力インターフェイス１６００を介して、ディスプレイやプリンタ等の出力装置、及び、キーボードやマウス等の入力装置を制御する。ＣＰＵ１１００は、入出力インターフェイス１６００を介して、入力装置からデータを取得する。また、ＣＰＵ１１００は、生成したデータを入出力インターフェイス１６００を介して出力装置へ出力する。なお、ＣＰＵ１１００の代わりに、ＭＰＵ（Micro Processing Unit）、また多大な計算パワーを必要とすることからＧＰＵ（Graphics Processing Unit）を用いてもよい。 The CPU 1100 controls output devices such as a display and a printer, and input devices such as a keyboard and mouse via an input/output interface 1600. CPU 1100 obtains data from an input device via input/output interface 1600. Further, CPU 1100 outputs the generated data to an output device via input/output interface 1600. Note that instead of the CPU 1100, an MPU (Micro Processing Unit) or a GPU (Graphics Processing Unit) may be used since it requires a large amount of calculation power.

メディアインターフェイス１７００は、記録媒体１８００に格納されたプログラムまたはデータを読み取り、ＲＡＭ１２００を介してＣＰＵ１１００に提供する。ＣＰＵ１１００は、かかるプログラムを、メディアインターフェイス１７００を介して記録媒体１８００からＲＡＭ１２００上にロードし、ロードしたプログラムを実行する。記録媒体１８００は、例えばＤＶＤ（Digital Versatile Disc）、ＰＤ（Phase change rewritable Disk）等の光学記録媒体、ＭＯ（Magneto-Optical disk）等の光磁気記録媒体、テープ媒体、磁気記録媒体、または半導体メモリ等である。 Media interface 1700 reads programs or data stored in recording medium 1800 and provides them to CPU 1100 via RAM 1200. CPU 1100 loads this program from recording medium 1800 onto RAM 1200 via media interface 1700, and executes the loaded program. The recording medium 1800 is, for example, an optical recording medium such as a DVD (Digital Versatile Disc) or a PD (Phase change rewritable disk), a magneto-optical recording medium such as an MO (Magneto-Optical disk), a tape medium, a magnetic recording medium, or a semiconductor memory. etc.

例えば、コンピュータ１０００が情報処理装置１００として機能する場合、コンピュータ１０００のＣＰＵ１１００は、ＲＡＭ１２００上にロードされたプログラムを実行することにより、制御部１５０の機能を実現する。コンピュータ１０００のＣＰＵ１１００は、これらのプログラムを記録媒体１８００から読み取って実行するが、他の例として、他の装置から所定の通信網を介してこれらのプログラムを取得してもよい。 For example, when the computer 1000 functions as the information processing device 100, the CPU 1100 of the computer 1000 realizes the functions of the control unit 150 by executing a program loaded onto the RAM 1200. The CPU 1100 of the computer 1000 reads these programs from the recording medium 1800 and executes them, but as another example, these programs may be acquired from another device via a predetermined communication network.

以上、本願の実施形態のいくつかを図面に基づいて詳細に説明したが、これらは例示であり、発明の開示の欄に記載の態様を始めとして、当業者の知識に基づいて種々の変形、改良を施した他の形態で本発明を実施することが可能である。 Some of the embodiments of the present application have been described above in detail based on the drawings, but these are merely examples, and various modifications and variations may be made based on the knowledge of those skilled in the art, including the embodiments described in the disclosure section of the invention. It is possible to carry out the invention in other forms with modifications.

〔７．その他〕
また、上記実施形態及び変形例において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。例えば、各図に示した各種情報は、図示した情報に限られない。 [7. others〕
Furthermore, among the processes described in the above embodiments and modified examples, all or part of the processes described as being performed automatically can be performed manually, or may be described as being performed manually. All or part of this processing can also be performed automatically using known methods. In addition, information including the processing procedures, specific names, and various data and parameters shown in the above documents and drawings may be changed arbitrarily, unless otherwise specified. For example, the various information shown in each figure is not limited to the illustrated information.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。 Furthermore, each component of each device shown in the drawings is functionally conceptual, and does not necessarily need to be physically configured as shown in the drawings. In other words, the specific form of distributing and integrating each device is not limited to what is shown in the diagram, and all or part of the devices can be functionally or physically distributed or integrated in arbitrary units depending on various loads and usage conditions. Can be integrated and configured.

また、上述してきた実施形態及び変形例は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 Furthermore, the above-described embodiments and modifications can be combined as appropriate within a range that does not conflict with the processing contents.

また、上述してきた「部（section、module、unit）」は、「手段」や「回路」などに読み替えることができる。例えば、検索部は、検索手段や検索回路に読み替えることができる。 Further, the above-mentioned "section, module, unit" can be read as "means", "circuit", etc. For example, the search unit can be replaced with a search means or a search circuit.

１００情報処理装置
１１０通信部
１２０記憶部
１３０入力部
１４０出力部
１５０制御部
１５１属性情報抽出部
１５２モデル生成部
１５３受付部
１５４取得部
１５５検索部
１５６抽出部
１５７生成部 100 Information processing device 110 Communication unit 120 Storage unit 130 Input unit 140 Output unit 150 Control unit 151 Attribute information extraction unit 152 Model generation unit 153 Reception unit 154 Acquisition unit 155 Search unit 156 Extraction unit 157 Generation unit

Claims

In an image divided into regions using segmentation technology, region information regarding the object region, which is a divided region including an object having a structure, and structural information regarding the structure of the object estimated using pose estimation technology. an attribute information extraction unit that extracts attribute information regarding attributes of the object based on the object;
The text generated based on the attribute information extracted by the attribute information extraction unit is a VSE (Visual a model generation unit that generates a Semantic Embedding model;
An information processing device comprising:

The attribute information extraction unit includes:
In the image, the area information is person area information regarding a person area that is a divided area including the object that is a person, and the structure information is the skeleton of the person estimated using the posture estimation technique. extracting person attribute information regarding attributes of the person as the attribute information based on skeletal information about the person;
The model generation unit is
The VSE model is a sentence generated based on the person attribute information extracted by the attribute information extraction unit, and is trained to associate a sentence indicating the content of the image with the image and embed it in a common space. generate,
The information processing device according to claim 1.

The attribute information extraction unit includes:
extracting, as the skeletal information, posture information regarding the posture of the person and motion information regarding the motion of the person, as the person attribute information, based on joint position information of each joint of the person;
The information processing device according to claim 2.

The attribute information extraction unit includes:
The person area information is based on item area information regarding an item area that is a divided area including each fashion item worn by the person, and the person attribute information is information about fashion items worn by the person. Extract item attribute information about attributes,
The information processing device according to claim 2 or 3.

The attribute information extraction unit includes:
As the person area information, extract body part attribute information regarding attributes of the body parts of the person as the person attribute information based on body part area information regarding a body part area that is a divided area including each body part of the person. do,
The information processing device according to any one of claims 2 to 4.

The attribute information extraction unit includes:
extracting information regarding the hairstyle of the person as the body part attribute information based on information regarding the hair area of the person as the body part area information;
The information processing device according to claim 5.

The attribute information extraction unit includes:
extracting information regarding facial expressions of the person as the body part attribute information based on information regarding the face area of the person as the body part area information;
The information processing device according to claim 5 or 6.

a reception unit that receives a first query sentence input by a user;
a search unit that searches for a first image related to the first query sentence using the VSE model generated by the model generation unit;
an extraction unit that extracts a first concept that is a detection target included in the first image;
Furthermore,
The search section includes:
re-searching for a second image related to a second query sentence based on the first concept extracted by the extraction unit using the VSE model;
The information processing device according to any one of claims 1 to 7.

The detection target includes at least one of an object, a person, a scene, and an action.
The information processing device according to claim 8.

further comprising a generation unit that generates the second query sentence based on the first concept extracted by the extraction unit,
The search section includes:
re-searching the second image related to the second query sentence generated by the generation unit using the VSE model;
The information processing device according to claim 8.

further comprising an output unit that outputs search results by the search unit,
The output section is
outputting first concept information regarding the first concept extracted by the extraction unit;
The reception department is
receiving from the user the second query sentence based on the first concept information output by the output unit;
The search section includes:
re-searching the second image related to the second query sentence accepted by the reception unit using the VSE model;
The information processing device according to claim 8.

further comprising an acquisition unit that acquires images that are a plurality of videos or each scene included in each of the plurality of videos,
The search section includes:
A set of the image acquired by the acquisition unit and the first query sentence accepted by the reception unit is input into the VSE model, and a first similarity between the image and the first query sentence is calculated from the VSE model. output, and searching for the first image for which the output first similarity exceeds a first threshold;
The information processing device according to any one of claims 8 to 11 .

The search section includes:
A set of an image acquired by the acquisition unit and a second query sentence based on the first concept extracted by the extraction unit is input to the VSE model, and a second similarity between the image and the second query sentence is calculated. output from the VSE model, and re-search the second image for which the output second similarity exceeds a second threshold;
The information processing device according to claim 12 .

The extraction section is
When an image is input, a concept classifier, which is a trained machine learning model, is trained to output the concept similarity between the image and a concept to be detected contained in the image. extracting the first concept from a first image;
The information processing device according to any one of claims 8 to 13 .

The extraction section is
The first image searched by the search unit is input to the concept classifier, and the concept similarity between the first concept to be detected included in the first image and the first image is determined from the concept classifier. output, and extracting the first concept whose output concept similarity exceeds a concept threshold;
The information processing device according to claim 14 .

The extraction section is
extracting a hidden concept that is the first concept corresponding to a character string not included in the first query sentence from among the first concepts for which the output concept similarity exceeds a concept threshold;
The information processing device according to claim 15 .

The reception department is
accept a query image input by the user;
The search section includes:
using the VSE model to search for a character string related to the query image accepted by the reception unit, and search for a third image related to a third query sentence based on the searched string;
The information processing device according to any one of claims 8 to 16 .

further comprising an acquisition unit that acquires a plurality of sentences or a character string that is each text included in each of the plurality of sentences,
The search section includes:
A set of the character string acquired by the acquisition unit and the query image accepted by the reception unit is input to the VSE model, and a third similarity between the character string and the query image is output from the VSE model. , searching for a character string for which the output third similarity exceeds a third threshold, and searching for the third image related to the third query sentence based on the searched character string;
The information processing device according to claim 17 .

The extraction section is
extracting a third concept that is a detection target included in the third image;
The search section includes:
re -searching for a fourth image related to a fourth query sentence based on the third concept extracted by the extraction unit using the VSE model;
The information processing device according to claim 18 .

An information processing method realized by a program executed by an information processing device, the method comprising:
In an image divided into regions using segmentation technology, region information regarding the object region, which is a divided region including an object having a structure, and structural information regarding the structure of the object estimated using pose estimation technology. an attribute information extraction step of extracting attribute information regarding attributes of the object based on the object;
The text generated based on the attribute information extracted in the attribute information extraction step is a VSE (Visual a model generation process that generates a Semantic Embedding model;
Information processing methods including

In an image divided into regions using segmentation technology, region information regarding the object region, which is a divided region including an object having a structure, and structural information regarding the structure of the object estimated using pose estimation technology. an attribute information extraction procedure for extracting attribute information regarding attributes of the object based on the object;
The text generated based on the attribute information extracted by the attribute information extraction procedure is a VSE (Visual a model generation procedure for generating a Semantic Embedding) model;
An information processing program that causes a computer to execute.