JP7417015B2

JP7417015B2 - Gesture shape learning device, gesture shape estimation device, methods thereof, and programs

Info

Publication number: JP7417015B2
Application number: JP2020086507A
Authority: JP
Inventors: 亮石井; 竜一郎東中; 裕司青野; 芙巳雄二瓶; 有紀子中野
Original assignee: SCHOOL JURIDICAL PERSON SEIKEI GAKUEN; Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: SCHOOL JURIDICAL PERSON SEIKEI GAKUEN; NTT Inc; NTT Inc USA
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2024-01-18
Anticipated expiration: 2040-05-18
Also published as: JP2021182179A

Description

特許法第３０条第２項適用（１）▲１▼ウェブサイト掲載日２０１９年１０月１４日 ▲２▼ウェブサイトのアドレスＩＣＭＩ２０１９ＣｏｎｆｅｒｅｎｃｅＰｒｏｇｒａｍウェブサイトｈｔｔｐｓ：／／ｉｃｍｉ．ａｃｍ．ｏｒｇ／２０１９／ｉｎｄｅｘ．ｐｈｐ？ｉｄ＝ｐｒｏｇｒａｍ（２）▲１▼ウェブサイト掲載日２０１９年１０月１４日 ▲２▼ウェブサイトのアドレスＩＣＭＩ２０１９ウェブサイトｈｔｔｐｓ：／／ｄｌ．ａｃｍ．ｏｒｇ／ｃｉｔａｔｉｏｎ．ｃｆｍ？ｉｄ＝３３５３７３６（３）▲１▼開催日２０１９年１０月１４日～２０１９年１０月１８日（公知日：２０１９年１０月１５日） ▲２▼集会名、開催場所ＩＣＭＩ２０１９ＣｏｎｆｅｒｅｎｃｅＧＲＡＮＤＤＵＳＨＵＬＡＫＥ（中国蘇州工■■区■月街２９９号）Application of Article 30, Paragraph 2 of the Patent Act (1) ▲1▼ Website publication date October 14, 2019 ▲2▼ Website address ICMI2019 Conference Program Website https://icmi. acm. org/2019/index. php? id=program (2) ▲1▼ Website publication date October 14, 2019 ▲2▼ Website address ICMI2019 website https://dl. acm. org/citation. cfm? id=3353736 (3) ▲1▼ Date October 14, 2019 - October 18, 2019 (Notification date: October 15, 2019) ▲2▼ Meeting name, venue ICMI 2019 Conference GRAND DUSHULAKE (China) No. 299, Yue Street, Tech■■ District, Suzhou)

この発明は、単語に対して適切な形状の図像的ジェスチャを付与する技術に関する。 The present invention relates to a technique for adding appropriately shaped iconographic gestures to words.

発話において、具体物のイメージを表現するハンドジェスチャ（以下、「ジェスチャ」とも呼ぶ）を図像的ジェスチャという。非特許文献１には、単語を基に検索されたウェブ上の画像を収集し、その単語に対する３種類の形状をその画像のSIFT特徴量から推定する推定器を決定木学習手法によって構築する技術が記載されている。 In speech, hand gestures (hereinafter also referred to as "gestures") that express images of concrete objects are called iconographic gestures. Non-Patent Document 1 describes a technology that uses a decision tree learning method to build an estimator that collects images on the web searched based on a word and estimates three types of shapes for that word from the SIFT features of the image. is listed.

Yuki Kadono, Yutaka Takase, and Yukiko I. Nakano, "Generating iconic gestures based on graphic data analysis and clustering," ACM/IEEE International Conference on Human-Robot Interaction, pp. 447-448, 2016.Yuki Kadono, Yutaka Takase, and Yukiko I. Nakano, "Generating iconic gestures based on graphic data analysis and clustering," ACM/IEEE International Conference on Human-Robot Interaction, pp. 447-448, 2016.

しかしながら、非特許文献１に記載の従来技術では、推定精度が高くないという課題があった。 However, the conventional technique described in Non-Patent Document 1 has a problem in that the estimation accuracy is not high.

この発明の目的は、単語に対して適切な形状の図像的ジェスチャを高精度に推定することである。 An object of the present invention is to estimate an iconographic gesture having an appropriate shape for a word with high accuracy.

上記の課題を解決するために、この発明の第一の態様のジェスチャ形状学習装置は、ある単語に対応する複数の画像と、その単語に対応するジェスチャ形状とが関連付けられた学習データを記憶する学習データ記憶部と、画像から抽出した各画素の色情報からなる特徴量を入力とし、その画像に対応するジェスチャ形状を推定するジェスチャ形状決定モデルを、学習データを用いて学習するジェスチャ形状学習部と、を含む。 In order to solve the above problems, a gesture shape learning device according to a first aspect of the present invention stores learning data in which a plurality of images corresponding to a certain word are associated with a gesture shape corresponding to the word. A learning data storage unit, and a gesture shape learning unit that uses the learning data to learn a gesture shape determination model that receives as input a feature amount consisting of color information of each pixel extracted from an image and estimates a gesture shape corresponding to that image. and, including.

この発明の第二の態様のジェスチャ形状学習装置は、ある単語に対応する複数の画像と、その単語に対応するジェスチャ形状とが関連付けられた学習データを記憶する学習データ記憶部と、画像から抽出した各画素の色情報からなる特徴量を入力とし、その画像に対応するジェスチャ形状の基本形状を推定する基本形状決定モデルを、学習データを用いて学習する基本形状学習部と、画像から抽出した各画素の色情報からなる特徴量を入力とし、その画像に対応するジェスチャ形状の縦横比を推定する縦横比決定モデルを、学習データを用いて学習する縦横比学習部と、を含む。 A gesture shape learning device according to a second aspect of the present invention includes a learning data storage unit that stores learning data in which a plurality of images corresponding to a certain word are associated with a gesture shape corresponding to the word; A basic shape learning unit that uses learning data to learn a basic shape determination model that estimates the basic shape of the gesture shape corresponding to the image by inputting feature quantities consisting of color information of each pixel extracted from the image. The image forming apparatus includes an aspect ratio learning unit that uses learning data to learn an aspect ratio determination model that receives a feature amount consisting of color information of each pixel as input and estimates the aspect ratio of a gesture shape corresponding to the image.

この発明の第三の態様のジェスチャ形状推定装置は、第一の態様のジェスチャ形状学習装置により学習したジェスチャ形状決定モデルを記憶するモデル記憶部と、入力画像から抽出した各画素の色情報からなる特徴量をジェスチャ形状決定モデルへ入力して、その入力画像に対応するジェスチャ形状を推定するジェスチャ形状推定部と、を含む。 A gesture shape estimating device according to a third aspect of the present invention includes a model storage unit that stores the gesture shape determination model learned by the gesture shape learning device according to the first aspect, and color information of each pixel extracted from an input image. The gesture shape estimation unit inputs the feature amount into the gesture shape determination model and estimates the gesture shape corresponding to the input image.

この発明の第四の態様のジェスチャ形状推定装置は、第二の態様のジェスチャ形状学習装置により学習した基本形状決定モデルおよび縦横比決定モデルを記憶するモデル記憶部と、入力画像から抽出した各画素の色情報からなる特徴量を基本形状決定モデルへ入力して、その入力画像に対応するジェスチャ形状の基本形状を推定する基本形状推定部と、入力画像から抽出した各画素の色情報からなる特徴量を縦横比決定モデルへ入力して、その入力画像に対応するジェスチャ形状の縦横比を推定する縦横比推定部と、基本形状および縦横比から入力画像に対応するジェスチャ形状を決定するジェスチャ形状決定部と、を含む。 A gesture shape estimating device according to a fourth aspect of the present invention includes a model storage unit that stores a basic shape determining model and an aspect ratio determining model learned by the gesture shape learning device according to the second aspect, and each pixel extracted from an input image. A basic shape estimator that inputs feature quantities consisting of color information into a basic shape determination model to estimate the basic shape of a gesture shape corresponding to the input image, and a feature consisting of color information of each pixel extracted from the input image. an aspect ratio estimation unit that inputs the amount into an aspect ratio determination model to estimate the aspect ratio of a gesture shape corresponding to the input image; and a gesture shape determination unit that determines the gesture shape corresponding to the input image from the basic shape and aspect ratio. including.

この発明によれば、単語に対して適切な形状の図像的ジェスチャを高精度に推定することができる。特に、ジェスチャ形状の基本形状とその基本形状における縦横比とを二段階で推定することで、より高精度にジェスチャ形状を推定することが可能となる。 According to the present invention, an iconographic gesture having an appropriate shape for a word can be estimated with high accuracy. In particular, by estimating the basic shape of the gesture shape and the aspect ratio of the basic shape in two stages, it is possible to estimate the gesture shape with higher accuracy.

図１は、ジェスチャ形状学習装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating the functional configuration of a gesture shape learning device. 図２は、ジェスチャ形状学習方法の処理手順を例示する図である。FIG. 2 is a diagram illustrating the processing procedure of the gesture shape learning method. 図３は、ジェスチャ形状決定モデルの構成を説明するための図である。FIG. 3 is a diagram for explaining the configuration of a gesture shape determination model. 図４は、ジェスチャ形状推定装置の機能構成を例示する図である。FIG. 4 is a diagram illustrating the functional configuration of the gesture shape estimation device. 図５は、ジェスチャ形状推定方法の処理手順を例示する図である。FIG. 5 is a diagram illustrating a processing procedure of the gesture shape estimation method. 図６は、ジェスチャ形状学習装置の機能構成を例示する図である。FIG. 6 is a diagram illustrating the functional configuration of the gesture shape learning device. 図７は、ジェスチャ形状学習方法の処理手順を例示する図である。FIG. 7 is a diagram illustrating the processing procedure of the gesture shape learning method. 図８は、ジェスチャ形状推定装置の機能構成を例示する図である。FIG. 8 is a diagram illustrating the functional configuration of the gesture shape estimation device. 図９は、ジェスチャ形状推定方法の処理手順を例示する図である。FIG. 9 is a diagram illustrating the processing procedure of the gesture shape estimation method. 図１０は、ジェスチャ形状推定装置の機能構成を例示する図である。FIG. 10 is a diagram illustrating the functional configuration of the gesture shape estimation device. 図１１は、ジェスチャ形状推定方法の処理手順を例示する図である。FIG. 11 is a diagram illustrating the processing procedure of the gesture shape estimation method. 図１２は、コンピュータの機能構成を例示する図である。FIG. 12 is a diagram illustrating the functional configuration of a computer.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Embodiments of the present invention will be described in detail below. Note that in the drawings, components having the same functions are designated by the same numbers, and redundant explanation will be omitted.

この発明では、ジェスチャ形状を推定するための入力情報（特徴量）として、画像の各画素のRGB情報を利用する。入力情報からジェスチャ形状を推定する推定器として、深層学習技術を利用する。また、推定器を構築する際に、特徴量からジェスチャ形状を一段階で推定する形態と、ジェスチャ形状の基本形状とその基本形状における縦横比とを二段階で推定する形態とを説明する。一段階の推定であっても従来技術より高精度にジェスチャ形状を推定することが可能であるが、二段階の推定であればより高精度にジェスチャ形状を推定することが可能となる。 In this invention, RGB information of each pixel of an image is used as input information (feature amount) for estimating a gesture shape. Deep learning technology is used as an estimator that estimates gesture shapes from input information. Furthermore, when constructing an estimator, a mode in which a gesture shape is estimated from the feature amount in one step, and a mode in which the basic shape of the gesture shape and the aspect ratio of the basic shape are estimated in two steps will be explained. Even with one-stage estimation, it is possible to estimate the gesture shape with higher precision than in the prior art, but with two-stage estimation, it is possible to estimate the gesture shape with higher precision.

［第一実施形態］
第一実施形態は、画像の特徴量からジェスチャ形状を一段階で推定するジェスチャ形状決定モデルを学習するジェスチャ形状学習装置１と、ジェスチャ形状学習装置１により学習されたジェスチャ形状決定モデルを用いて入力画像からジェスチャ形状を推定するジェスチャ形状推定装置２とからなる。 [First embodiment]
The first embodiment uses a gesture shape learning device 1 that learns a gesture shape determination model that estimates a gesture shape in one step from image features, and input data using the gesture shape determination model learned by the gesture shape learning device 1. The gesture shape estimating device 2 is configured to estimate a gesture shape from an image.

＜ジェスチャ形状学習装置１＞
第一実施形態のジェスチャ形状学習装置１は、図１に例示するように、単語辞書記憶部１１０、物体画像記憶部１２０、画像サイズ変換部１１、学習データ生成部１２、学習データ記憶部１３０、ジェスチャ形状学習部１３、およびモデル記憶部１４０を備える。このジェスチャ形状学習装置１が、図２に例示する各ステップの処理を行うことにより第一実施形態のジェスチャ形状学習方法が実現される。 <Gesture shape learning device 1>
As illustrated in FIG. 1, the gesture shape learning device 1 of the first embodiment includes a word dictionary storage section 110, an object image storage section 120, an image size conversion section 11, a learning data generation section 12, a learning data storage section 130, It includes a gesture shape learning section 13 and a model storage section 140. The gesture shape learning method of the first embodiment is realized by the gesture shape learning device 1 performing the processing of each step illustrated in FIG. 2 .

ジェスチャ形状学習装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。ジェスチャ形状学習装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。ジェスチャ形状学習装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。ジェスチャ形状学習装置の各処理部は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。ジェスチャ形状学習装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。 The gesture shape learning device is, for example, a special computer configured by loading a special program into a publicly known or dedicated computer that has a central processing unit (CPU), a main memory (RAM), etc. It is a very good device. The gesture shape learning device executes each process under the control of, for example, a central processing unit. The data input to the gesture shape learning device and the data obtained through each process are stored, for example, in the main memory, and the data stored in the main memory is read out to the central processing unit as necessary. Used for other processing. Each processing unit of the gesture shape learning device may be configured at least in part by hardware such as an integrated circuit. Each storage unit included in the gesture shape learning device includes, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device constituted by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory, or It can be configured with middleware such as relational databases and key-value stores.

図２を参照して、第一実施形態のジェスチャ形状学習装置１が実行するジェスチャ形状学習方法の処理手続きを説明する。 With reference to FIG. 2, the processing procedure of the gesture shape learning method executed by the gesture shape learning device 1 of the first embodiment will be described.

ステップＳ１１０において、ジェスチャ形状学習装置１は、単語辞書データベースを構築し、その単語辞書データベースを単語辞書記憶部１１０へ記憶する。単語辞書データベースは、予め定めた複数の単語に、その単語に対応するジェスチャ形状の情報を関連付けたものである。複数の単語は、例えば、辞書等から抽出した1000個程度の物体の名称を表す単語（以下、「物体名称単語」とも呼ぶ）である。ジェスチャ形状は、例えば、「円」「横長楕円」「縦長楕円」「正方形」「横長四角」「縦長四角」「正三角形」「横長三角」「縦長三角」「波形」「線形」「横長ひし形」「縦長ひし形」「不明」の14種類から選択する。以下では、14種類のジェスチャ形状のうち、選択される数が少ないものを統合して、「円」「横長楕円」「縦長楕円」「正方形」「横長四角」「縦長四角」「線形」の７種類を用いるものとする。ただし、ジェスチャ形状の選択肢は上記に限定されず、利用環境に応じて任意に設定すればよい。各単語とジェスチャ形状との関連付けは、複数のアノテーターが人手で行うものとする。 In step S110, the gesture shape learning device 1 constructs a word dictionary database and stores the word dictionary database in the word dictionary storage unit 110. The word dictionary database is a database in which a plurality of predetermined words are associated with information on gesture shapes corresponding to the words. The plurality of words are, for example, words representing the names of about 1000 objects extracted from a dictionary or the like (hereinafter also referred to as "object name words"). Gesture shapes include, for example, "circle," "horizontal ellipse," "vertical ellipse," "square," "horizontal square," "vertical square," "equilateral triangle," "horizontal triangle," "vertical triangle," "wavy," "linear," and "horizontal diamond." Choose from 14 types: ``vertical diamond'' and ``unknown.'' Below, out of the 14 types of gesture shapes, we have integrated the ones that are selected in small numbers, and we will integrate the 7 types of gesture shapes: ``circle'', ``horizontal ellipse'', ``vertical ellipse'', ``square'', ``horizontal rectangle'', ``vertical rectangle'', and ``linear''. The type shall be used. However, the gesture shape options are not limited to the above, and may be set arbitrarily according to the usage environment. It is assumed that the association between each word and the gesture shape is performed manually by a plurality of annotators.

ステップＳ１２０において、ジェスチャ形状学習装置１は、物体画像データベースを構築し、その物体画像データベースを物体画像記憶部１２０へ記憶する。物体画像データベースは、単語辞書データベースに含まれる各単語に、その単語に対応する複数の画像を関連付けたものである。複数の画像は、例えば、各単語を検索語としてインターネットで画像検索し、検索結果から適当な画像を10枚程度ダウンロードして収集すればよい。 In step S120, the gesture shape learning device 1 constructs an object image database and stores the object image database in the object image storage unit 120. The object image database is a database in which each word included in the word dictionary database is associated with a plurality of images corresponding to that word. A plurality of images can be collected by, for example, searching for images on the Internet using each word as a search term, and downloading about 10 suitable images from the search results.

ステップＳ１１において、画像サイズ変換部１１は、物体画像データベースに含まれる各画像を、それらを使用する機械学習手法に応じて適切な画像サイズに変換する。例えば、224×224ピクセルのRGB画像等に変換すればよい。 In step S11, the image size conversion unit 11 converts each image included in the object image database into an appropriate image size according to the machine learning method using the images. For example, it may be converted to an RGB image of 224 x 224 pixels.

ステップＳ１２において、学習データ生成部１２は、画像サイズ変換部１１により変換された画像に対して、その画像に対応するジェスチャ形状を教師データとして付与し、学習データを生成する。画像に対応するジェスチャ形状は、物体画像データベース中でその画像に関連付けられた単語を取得し、単語辞書データベース中でその単語に関連付けられたジェスチャ形状を取得すればよい。学習データ生成部１２は、生成した学習データを学習データ記憶部１３０へ記憶する。 In step S12, the learning data generation unit 12 adds a gesture shape corresponding to the image as teacher data to the image converted by the image size conversion unit 11, and generates learning data. The gesture shape corresponding to the image can be obtained by acquiring the word associated with the image in the object image database, and then acquiring the gesture shape associated with the word in the word dictionary database. The learning data generation unit 12 stores the generated learning data in the learning data storage unit 130.

ステップＳ１３において、ジェスチャ形状学習部１３は、学習データ記憶部１３０へ記憶された学習データを用いて、画像から抽出した特徴量を入力とし、その画像に対応するジェスチャ形状を推定するジェスチャ形状決定モデルを学習する。ジェスチャ形状決定モデルは、ニューラルネットワークを利用して構築される。図３にジェスチャ形状決定モデルの具体的な構成例を示す。ジェスチャ形状決定モデルの入力（図３のInput 1, Input 2, …, Input n）は、単語に対するn枚（例えば10枚）の各画像に対応する。まず、入力の各画像（224×224ピクセルのRGB 3チャネルの画像）に対して、深層学習を用いた学習済みのVGG-16モデルを適用し、4096次元の特徴量を得る（図３のVGG-16）。次に、n枚の画像から得られたn個の4096次元の特徴量から4096次元の平均ベクトルを算出する（図３のAverage）。このようにしてn枚の画像から得られた4096次元の平均ベクトルを入力とし、各画像に対応するジェスチャ形状を出力するジェスチャ形状決定モデルを構築する。なおこのとき、必ずしもInput1～nと対となる各VGG-16は必要ではなく、図３のAverageにおいて、n個のinputの特徴量抽出結果(VGG-16)を加算し平均をとる構成であってもよい。また、VGG-16の利用は一例であって、画像を入力として特徴量を抽出するような他の一般的なモデルを利用してもよい。同様に取得される次元数もモデルに合わせて任意のものでよい。学習手法として、全結合の２層からなるニューラルネットワークの出力をSoftmax関数に適用することにより、７種類のジェスチャ形状を表すラベルの尤度を計算する。１層目のニューラルネットワーク（図３のFC1）は、全結合のニューラルネットワークである。FC1は、活性化関数として例えばRelu関数を用い、出力として128次元のベクトルを得る。２層目のニューラルネットワーク（図３のFC2）は、FC1の出力ベクトルを入力として、Softmax関数により７種類のジェスチャ形状に対応する各ラベルの尤度を計算する。このニューラルネットワークを用いた機械学習の結果として、ジェスチャ形状決定モデルを得る。ジェスチャ形状学習部１３は、学習済みのジェスチャ形状決定モデルをモデル記憶部１４０へ記憶する。 In step S13, the gesture shape learning unit 13 uses the learning data stored in the learning data storage unit 130 to create a gesture shape determination model that takes as input the feature amount extracted from the image and estimates a gesture shape corresponding to the image. Learn. The gesture shape determination model is constructed using a neural network. FIG. 3 shows a specific configuration example of the gesture shape determination model. The inputs of the gesture shape determination model (Input 1, Input 2, ..., Input n in FIG. 3) correspond to each of n (for example, 10) images for a word. First, a trained VGG-16 model using deep learning is applied to each input image (224 × 224 pixel RGB 3-channel image) to obtain 4096-dimensional features (VGG -16). Next, a 4096-dimensional average vector is calculated from the n 4096-dimensional feature amounts obtained from the n images (Average in FIG. 3). In this way, we construct a gesture shape determination model that takes as input the 4096-dimensional average vector obtained from n images and outputs the gesture shape corresponding to each image. Note that at this time, each VGG-16 paired with Inputs 1 to n is not necessarily required, and in the Average in Figure 3, the feature extraction results (VGG-16) of n inputs are added and averaged. You can. Further, the use of VGG-16 is just one example, and other general models that extract feature amounts using an image as input may also be used. Similarly, the number of dimensions to be obtained may be arbitrary depending on the model. As a learning method, we calculate the likelihood of labels representing seven types of gesture shapes by applying the output of a fully connected two-layer neural network to the Softmax function. The first layer neural network (FC1 in FIG. 3) is a fully connected neural network. FC1 uses, for example, a Relu function as an activation function and obtains a 128-dimensional vector as an output. The second layer neural network (FC2 in FIG. 3) receives the output vector of FC1 and calculates the likelihood of each label corresponding to the seven types of gesture shapes using the Softmax function. As a result of machine learning using this neural network, a gesture shape determination model is obtained. The gesture shape learning unit 13 stores the learned gesture shape determination model in the model storage unit 140.

＜ジェスチャ形状推定装置２＞
第一実施形態のジェスチャ形状推定装置２は、図４に例示するように、モデル記憶部１４０、画像サイズ変換部２１、およびジェスチャ形状推定部２２を備える。このジェスチャ形状推定装置２が、図５に例示する各ステップの処理を行うことにより第一実施形態のジェスチャ形状推定方法が実現される。 <Gesture shape estimation device 2>
The gesture shape estimation device 2 of the first embodiment includes a model storage section 140, an image size conversion section 21, and a gesture shape estimation section 22, as illustrated in FIG. The gesture shape estimation method of the first embodiment is realized by the gesture shape estimation device 2 performing the processing of each step illustrated in FIG.

ジェスチャ形状推定装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。ジェスチャ形状推定装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。ジェスチャ形状推定装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて中央演算処理装置へ読み出されて他の処理に利用される。ジェスチャ形状推定装置の各処理部は、少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。ジェスチャ形状推定装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。 The gesture shape estimation device is, for example, a special computer configured by loading a special program into a publicly known or dedicated computer having a central processing unit (CPU), a main memory (RAM), etc. It is a very good device. The gesture shape estimation device executes each process under the control of a central processing unit, for example. The data input to the gesture shape estimation device and the data obtained through each process are stored, for example, in the main memory, and the data stored in the main memory is read out to the central processing unit as necessary. Used for other processing. Each processing unit of the gesture shape estimation device may be configured at least in part by hardware such as an integrated circuit. Each storage unit included in the gesture shape estimation device includes, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device configured with a semiconductor memory element such as a hard disk, an optical disk, or a flash memory, or It can be configured with middleware such as relational databases and key-value stores.

図５を参照して、第一実施形態のジェスチャ形状推定装置２が実行するジェスチャ形状推定方法の処理手続きを説明する。 With reference to FIG. 5, the processing procedure of the gesture shape estimation method executed by the gesture shape estimation device 2 of the first embodiment will be described.

モデル記憶部１４０には、ジェスチャ形状学習装置１により学習されたジェスチャ形状決定モデルが記憶されている。 The model storage unit 140 stores the gesture shape determination model learned by the gesture shape learning device 1.

ジェスチャ形状推定装置２には、推定対象とする物体名称単語に対応するn枚の画像が入力される。これらの画像は、例えば、物体画像データベースを構築する際と同様に、インターネット上からダウンロードして収集したものでもよいし、その他の手法により収集したものでもよい。 The gesture shape estimation device 2 is input with n images corresponding to object name words to be estimated. These images may be downloaded and collected from the Internet, for example, as in the case of constructing the object image database, or may be collected using other methods.

ステップＳ２１において、画像サイズ変換部２１は、各入力画像を、モデル適用に適した画像サイズに変換する。すなわち、画像サイズ変換部２１は、ジェスチャ形状学習装置１の画像サイズ変換部１１と同様にして、入力画像のサイズを変換する。 In step S21, the image size converter 21 converts each input image to an image size suitable for model application. That is, the image size converter 21 converts the size of the input image in the same manner as the image size converter 11 of the gesture shape learning device 1.

ステップＳ２２において、ジェスチャ形状推定部２２は、画像サイズ変換部２１により変換されたn枚の入力画像を、モデル記憶部１４０に記憶されているジェスチャ形状決定モデルに入力し、入力画像に対応するジェスチャ形状を推定する。具体的には、まず、各入力画像に対して、学習済みのVGG-16を適用し、4096次元の特徴量を得る。次に、各入力画像から得られた4096次元の特徴量から4096次元の平均ベクトルを算出する。この4096次元の平均ベクトルを学習済みのニューラルネットワークに入力し、７種類のジェスチャ形状を表すラベルの尤度を出力結果として得る。そして、出力結果の尤度が最大となるジェスチャ形状を、推定対象とした物体名称単語に対応するジェスチャ形状として出力する。 In step S22, the gesture shape estimation unit 22 inputs the n input images converted by the image size conversion unit 21 into the gesture shape determination model stored in the model storage unit 140, and generates a gesture corresponding to the input image. Estimate the shape. Specifically, first, trained VGG-16 is applied to each input image to obtain 4096-dimensional features. Next, a 4096-dimensional average vector is calculated from the 4096-dimensional feature values obtained from each input image. This 4096-dimensional average vector is input to a trained neural network, and the likelihoods of labels representing seven types of gesture shapes are obtained as output results. Then, the gesture shape with the maximum likelihood of the output result is output as the gesture shape corresponding to the object name word targeted for estimation.

［第二実施形態］
第二実施形態は、画像の特徴量からジェスチャ形状を二段階で推定するジェスチャ形状決定モデルを学習するジェスチャ形状学習装置３と、ジェスチャ形状学習装置３により学習されたジェスチャ形状決定モデルを用いて入力画像からジェスチャ形状を推定するジェスチャ形状推定装置４とからなる。二段階の推定では、まずジェスチャ形状の基本形状を推定し、次にその基本形状における縦横比を推定する。最終的に推定したいジェスチャ形状が、第一実施形態で例示したように「円」「横長楕円」「縦長楕円」「正方形」「横長四角」「縦長四角」「線形」の７種類であるとすれば、基本形状を「円」「四角」「線形」の３種類とし、縦横比を「横長」「縦長」「均等」の３種類とすればよい。最終的に出力されるジェスチャ形状は、基本形状と縦横比の組み合わせにより決定する。例えば、基本形状の推定結果が「円」であり、縦横比の推定結果が「横長」であれば、出力されるジェスチャ形状は「横長楕円」となる。以下、第一実施形態との相違点を中心に説明する。 [Second embodiment]
The second embodiment uses a gesture shape learning device 3 that learns a gesture shape determination model that estimates a gesture shape in two stages from image features, and inputs using the gesture shape determination model learned by the gesture shape learning device 3. It consists of a gesture shape estimation device 4 that estimates gesture shapes from images. In the two-step estimation, first the basic shape of the gesture shape is estimated, and then the aspect ratio of the basic shape is estimated. As illustrated in the first embodiment, there are seven types of gesture shapes to be estimated: circle, horizontal ellipse, vertical ellipse, square, horizontal rectangle, vertical rectangle, and linear. For example, the basic shape may be set to three types: "circle,""square," and "linear," and the aspect ratio may be set to three types: "horizontal,""vertical," and "uniform." The gesture shape that is finally output is determined by the combination of the basic shape and the aspect ratio. For example, if the basic shape estimation result is "circle" and the aspect ratio estimation result is "horizontally long", the output gesture shape will be "horizontally long ellipse". Hereinafter, differences from the first embodiment will be mainly described.

＜ジェスチャ形状学習装置３＞
第二実施形態のジェスチャ形状学習装置３は、図６に例示するように、ジェスチャ形状学習部１３が基本形状学習部１３１および縦横比学習部１３２を備える点が、第一実施形態のジェスチャ形状学習装置１と異なる。このジェスチャ形状学習装置３が、図７に例示する各ステップの処理を行うことにより第二実施形態のジェスチャ形状学習方法が実現される。 <Gesture shape learning device 3>
The gesture shape learning device 3 of the second embodiment differs from the gesture shape learning of the first embodiment in that the gesture shape learning section 13 includes a basic shape learning section 131 and an aspect ratio learning section 132, as illustrated in FIG. Different from device 1. The gesture shape learning method of the second embodiment is realized by the gesture shape learning device 3 performing the processing of each step illustrated in FIG. 7 .

ステップＳ１３１において、基本形状学習部１３１は、学習データ記憶部１３０へ記憶された学習データを用いて、画像から抽出した特徴量を入力とし、その画像に対応するジェスチャ形状の基本形状を推定する基本形状決定モデルを学習する。基本形状決定モデルの構造は、図３に示した第一実施形態のジェスチャ形状決定モデルと同様である。第一実施形態では、学習データのジェスチャ形状を「円」「横長楕円」「縦長楕円」「正方形」「横長四角」「縦長四角」「線形」の７種類としたが、これらのジェスチャ形状のうち「円」「横長楕円」「縦長楕円」を「円」に、「正方形」「横長四角」「縦長四角」を「四角」に変換して学習を行う。基本形状学習部１３１は、学習済みの基本形状決定モデルをモデル記憶部１４０へ記憶する。 In step S131, the basic shape learning unit 131 uses the learning data stored in the learning data storage unit 130 to input the feature amount extracted from the image, and uses the basic shape to estimate the basic shape of the gesture shape corresponding to the image. Learn a shape decision model. The structure of the basic shape determination model is similar to the gesture shape determination model of the first embodiment shown in FIG. In the first embodiment, there are seven types of gesture shapes in the learning data: "circle," "horizontal ellipse," "vertical ellipse," "square," "horizontal square," "vertical square," and "linear." Among these gesture shapes, Learn by converting ``circle'', ``horizontal ellipse'', and ``vertical ellipse'' into ``circle'', and converting ``square'', ``horizontal rectangle'', and ``vertical rectangle'' into ``square''. The basic shape learning unit 131 stores the learned basic shape determination model in the model storage unit 140.

ステップＳ１３２において、縦横比学習部１３２は、学習データ記憶部１３０へ記憶された学習データを用いて、画像から抽出した特徴量を入力とし、その画像に対応するジェスチャ形状の縦横比を推定する縦横比決定モデルを学習する。縦横比決定モデルの構造は、図３に示した第一実施形態のジェスチャ形状決定モデルと同様である。第一実施形態では、学習データのジェスチャ形状を「円」「横長楕円」「縦長楕円」「正方形」「横長四角」「縦長四角」「線形」の７種類としたが、これらのジェスチャ形状のうち「円」「正方形」を「均等」に、「横長楕円」「横長四角」を「横長」に、「縦長楕円」「縦長四角」を「縦長」に変換して学習を行う。縦横比学習部１３２は、学習済みの縦横比決定モデルをモデル記憶部１４０へ記憶する。 In step S132, the aspect ratio learning unit 132 uses the learning data stored in the learning data storage unit 130 to estimate the aspect ratio of the gesture shape corresponding to the image, using the feature amount extracted from the image as input. Learn the ratio decision model. The structure of the aspect ratio determination model is similar to the gesture shape determination model of the first embodiment shown in FIG. In the first embodiment, there are seven types of gesture shapes in the learning data: "circle," "horizontal ellipse," "vertical ellipse," "square," "horizontal square," "vertical square," and "linear." Among these gesture shapes, Learn by converting "circle" and "square" to "even", "horizontal ellipse" and "horizontal square" to "horizontal", and "vertical ellipse" and "vertical rectangle" to "vertical". The aspect ratio learning unit 132 stores the learned aspect ratio determination model in the model storage unit 140.

＜ジェスチャ形状推定装置４＞
第二実施形態のジェスチャ形状推定装置４は、図８に例示するように、ジェスチャ形状推定部２２が基本形状推定部２２１、縦横比推定部２２２、およびジェスチャ形状決定部２２３を備える点が、第一実施形態のジェスチャ形状推定装置２と異なる。このジェスチャ形状推定装置４が、図９に例示する各ステップの処理を行うことにより第二実施形態のジェスチャ形状推定方法が実現される。 <Gesture shape estimation device 4>
The gesture shape estimating device 4 of the second embodiment is characterized in that the gesture shape estimating section 22 includes a basic shape estimating section 221, an aspect ratio estimating section 222, and a gesture shape determining section 223, as illustrated in FIG. This is different from the gesture shape estimation device 2 of one embodiment. The gesture shape estimation method of the second embodiment is realized by the gesture shape estimation device 4 performing the processing of each step illustrated in FIG. 9 .

ステップＳ２２１において、基本形状推定部２２１は、画像サイズ変換部２１により変換されたn枚の入力画像を、モデル記憶部１４０に記憶されている基本形状決定モデルに入力し、入力画像に対応するジェスチャ形状の基本形状を推定する。基本形状推定部２２１は、基本形状決定モデルの出力結果の尤度が最大となる基本形状を、ジェスチャ形状の基本形状としてジェスチャ形状決定部２２３へ出力する。 In step S221, the basic shape estimation unit 221 inputs the n input images converted by the image size conversion unit 21 into the basic shape determination model stored in the model storage unit 140, and generates a gesture corresponding to the input image. Estimate the basic shape of the shape. The basic shape estimation unit 221 outputs the basic shape with the maximum likelihood of the output result of the basic shape determination model to the gesture shape determination unit 223 as the basic shape of the gesture shape.

なお、基本形状推定部２２１が推定したジェスチャ形状の基本形状が、例えば「線形」のように縦横比の別が存在しない形状である場合、以降の処理は実行せず、その基本形状を推定対象とした物体名称単語に対応するジェスチャ形状として出力する。 Note that if the basic shape of the gesture shape estimated by the basic shape estimation unit 221 is a shape that does not have a different aspect ratio, such as a "linear" shape, the subsequent processing is not performed and the basic shape is used as the estimation target. Output as a gesture shape corresponding to the object name word.

ステップＳ２２２において、縦横比推定部２２２は、画像サイズ変換部２１により変換されたn枚の入力画像を、モデル記憶部１４０に記憶されている縦横比決定モデルに入力し、入力画像に対応するジェスチャ形状の縦横比を推定する。縦横比推定部２２２は、縦横比決定モデルの出力結果の尤度が最大となる縦横比を、ジェスチャ形状の縦横比としてジェスチャ形状決定部２２３へ出力する。 In step S222, the aspect ratio estimation unit 222 inputs the n input images converted by the image size conversion unit 21 into the aspect ratio determination model stored in the model storage unit 140, and generates a gesture corresponding to the input image. Estimate the aspect ratio of a shape. The aspect ratio estimation unit 222 outputs the aspect ratio for which the likelihood of the output result of the aspect ratio determination model is maximum to the gesture shape determination unit 223 as the aspect ratio of the gesture shape.

ステップＳ２２３において、ジェスチャ形状決定部２２３は、ジェスチャ形状決定部２２３が出力する基本形状と、縦横比推定部２２２が出力する縦横比とからジェスチャ形状を決定し、推定対象とした物体名称単語に対応するジェスチャ形状として出力する。 In step S223, the gesture shape determination unit 223 determines a gesture shape based on the basic shape outputted by the gesture shape determination unit 223 and the aspect ratio outputted by the aspect ratio estimation unit 222, and determines a gesture shape corresponding to the object name word targeted for estimation. output as a gesture shape.

［第三実施形態］
第三実施形態では、第一実施形態もしくは第二実施形態のジェスチャ形状推定技術を、モーションラベル生成技術と組み合わせる。モーションラベル生成技術とは、発話を文字起こしした発話テキストから、発話内容に合わせた全身のモーションのラベルを生成する技術である（例えば、参考文献１、２参照）。 [Third embodiment]
In the third embodiment, the gesture shape estimation technology of the first embodiment or the second embodiment is combined with the motion label generation technology. The motion label generation technology is a technology that generates a label for the motion of the whole body that matches the content of the utterance from the utterance text obtained by transcribing the utterance (for example, see References 1 and 2).

〔参考文献１〕国際公開第２０１９／１６０１０４号
〔参考文献２〕Ryo Ishii, Taichi Katayama, Ryuichiro Higashinaka, and Junji Tomita, "Generating Body Motions using Spoken Language in Dialogue," Proceedings of the 18th International Conference on Intelligent Virtual Agents (IVA '18), pp. 87-92, 2018. [Reference 1] International Publication No. 2019/160104 [Reference 2] Ryo Ishii, Taichi Katayama, Ryuichiro Higashinaka, and Junji Tomita, "Generating Body Motions using Spoken Language in Dialogue," Proceedings of the 18th International Conference on Intelligent Virtual Agents (IVA '18), pp. 87-92, 2018.

モーションラベル生成技術では、発話テキストを入力として、発話テキストの文節ごとにどのようなハンドジェスチャを行うかを表す情報を生成する。例えば参考文献１、２に記載された技術では、文節ごとに図像的ジェスチャを行うか否かのラベルを出力する。第三実施形態では、このラベルが出力された際に、そのラベルが出力された文節および前後の文節に含まれる単語を用いて、第一実施形態もしくは第二実施形態のジェスチャ形状推定を実施することで、どのような形状のジェスチャを生成するかをより詳細に推定する。以下、第一実施形態および第二実施形態との相違点を中心に説明する。 Motion label generation technology takes spoken text as input and generates information indicating what kind of hand gesture to perform for each clause of the spoken text. For example, in the techniques described in References 1 and 2, a label indicating whether or not an iconographic gesture is to be performed is output for each phrase. In the third embodiment, when this label is output, the gesture shape estimation of the first embodiment or the second embodiment is performed using words included in the phrase for which the label was output and the preceding and following phrases. This allows us to estimate in more detail what shape of gesture to generate. Hereinafter, differences between the first embodiment and the second embodiment will be mainly described.

＜ジェスチャ形状推定装置５＞
第三実施形態のジェスチャ形状推定装置５は、図１０に例示するように、第一実施形態と同様に、モデル記憶部１４０、画像サイズ変換部２１、およびジェスチャ形状推定部２２を備え、さらに、モーションラベル生成部３１および入力画像生成部３２を備える。このジェスチャ形状推定装置５が、図１１に例示する各ステップの処理を行うことにより第三実施形態のジェスチャ形状推定方法が実現される。 <Gesture shape estimation device 5>
As illustrated in FIG. 10, the gesture shape estimation device 5 of the third embodiment includes a model storage unit 140, an image size conversion unit 21, and a gesture shape estimation unit 22, as in the first embodiment, and further includes: It includes a motion label generation section 31 and an input image generation section 32. The gesture shape estimation method of the third embodiment is realized by the gesture shape estimation device 5 performing the processing of each step illustrated in FIG. 11 .

ステップＳ３１において、モーションラベル生成部３１は、入力された発話テキストから、文節ごとにジェスチャを行うか否かを表すモーションラベルを生成する。モーションラベル生成部３１は、ジェスチャを行うことを表すモーションラベルが生成されたら、入力画像生成部３２へその旨を通知する。 In step S31, the motion label generation unit 31 generates a motion label indicating whether or not to perform a gesture for each phrase from the input speech text. When the motion label representing performing a gesture is generated, the motion label generation unit 31 notifies the input image generation unit 32 of this fact.

ステップＳ３２において、入力画像生成部３２は、ジェスチャを行う旨を表すモーションラベルが生成された文節およびその前後の文節に含まれる単語について、その単語に対応するn枚の画像を収集する。画像の収集は、例えば、ジェスチャ形状学習装置で用いた物体画像データベースからその単語に対応する画像を取得してもよいし、物体画像データベースを生成する際に行ったようにその単語を検索語としてインターネットで画像検索してダウンロードしてもよい。入力画像生成部３２は、収集したn枚の画像を、入力画像として画像サイズ変換部２１へ入力する。 In step S32, the input image generation unit 32 collects n images corresponding to the phrase included in the phrase and the phrases before and after the phrase for which a motion label indicating performing a gesture has been generated. Images may be collected, for example, by acquiring images corresponding to the word from the object image database used in the gesture shape learning device, or by using the word as a search term, as was done when generating the object image database. You can also search for images on the Internet and download them. The input image generation unit 32 inputs the collected n images to the image size conversion unit 21 as input images.

ステップＳ２１において、画像サイズ変換部２１は、第一実施形態と同様に、各入力画像をモデル適用に適した画像サイズに変換し、ジェスチャ形状推定部２２へ入力する。 In step S21, the image size converter 21 converts each input image into an image size suitable for model application, and inputs it to the gesture shape estimator 22, as in the first embodiment.

ステップＳ２２において、ジェスチャ形状推定部２２は、第一実施形態と同様に、n枚の入力画像からジェスチャ形状を推定する。 In step S22, the gesture shape estimation unit 22 estimates the gesture shape from the n input images, similarly to the first embodiment.

第三実施形態では、第一実施形態のジェスチャ形状推定装置２をモーションラベル生成技術と組み合わせる構成を説明したが、第二実施形態のジェスチャ形状推定装置４と組み合わせるように構成してもよい。この場合、ジェスチャ形状推定装置５は、ジェスチャ形状推定部２２が基本形状推定部２２１、縦横比推定部２２２、およびジェスチャ形状決定部２２３を備えるように構成すればよい。 In the third embodiment, a configuration has been described in which the gesture shape estimation device 2 of the first embodiment is combined with the motion label generation technology, but it may be configured to be combined with the gesture shape estimation device 4 of the second embodiment. In this case, the gesture shape estimation device 5 may be configured such that the gesture shape estimation section 22 includes a basic shape estimation section 221 , an aspect ratio estimation section 222 , and a gesture shape determination section 223 .

従来のモーションラベル生成技術では図像的ジェスチャの具体的な形状までを推定することはできなかった。第三実施形態のジェスチャ形状推定技術を用いればより詳細なジェスチャ形状を推定することが可能となる。 Conventional motion label generation techniques have not been able to estimate the specific shape of iconographic gestures. If the gesture shape estimation technique of the third embodiment is used, it becomes possible to estimate a more detailed gesture shape.

以上、この発明の実施の形態について説明したが、具体的な構成は、これらの実施の形態に限られるものではなく、この発明の趣旨を逸脱しない範囲で適宜設計の変更等があっても、この発明に含まれることはいうまでもない。実施の形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 Although the embodiments of this invention have been described above, the specific configuration is not limited to these embodiments, and even if the design is changed as appropriate without departing from the spirit of this invention, Needless to say, it is included in this invention. The various processes described in the embodiments are not only executed in chronological order according to the order described, but also may be executed in parallel or individually depending on the processing capacity of the device that executes the processes or as necessary.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムを図１２に示すコンピュータの記憶部１０２０に読み込ませ、制御部１０１０、入力部１０３０、出力部１０４０などに動作させることにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When the various processing functions of each device described in the above embodiments are realized by a computer, the processing contents of the functions that each device should have are described by a program. By loading this program into the storage unit 1020 of the computer shown in FIG. 12 and causing it to operate in the control unit 1010, input unit 1030, output unit 1040, etc., various processing functions in each of the above devices are realized on the computer. Ru.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 A program describing the contents of this process can be recorded on a computer-readable recording medium. The computer-readable recording medium may be of any type, such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 Further, this program is distributed by, for example, selling, transferring, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded. Furthermore, this program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記憶装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program, for example, first stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing a process, this computer reads a program stored in its own storage device and executes a process according to the read program. In addition, as another form of execution of this program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and furthermore, the program may be transferred to this computer from the server computer. The process may be executed in accordance with the received program each time. In addition, the above-mentioned processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer programs from the server computer to this computer, but only realizes processing functions by issuing execution instructions and obtaining results. You can also use it as Note that the program in this embodiment includes information that is used for processing by an electronic computer and that is similar to a program (data that is not a direct command to the computer but has a property that defines the processing of the computer, etc.).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, the present apparatus is configured by executing a predetermined program on a computer, but at least a part of these processing contents may be realized by hardware.

Claims

Using a database in which a plurality of images corresponding to a certain word are associated with gesture shapes corresponding to the word, the gesture shapes corresponding to the images are assigned as training data, and the generated learning data is stored. a learning data storage unit;
a gesture shape learning unit that uses the learning data to learn a gesture shape determination model that receives as input a feature amount consisting of color information of each pixel extracted from the image and estimates a gesture shape corresponding to the image;
A gesture shape learning device including:

Using a database in which a plurality of images corresponding to a certain word are associated with gesture shapes corresponding to the word, the gesture shapes corresponding to the images are assigned as training data, and the generated learning data is stored. a learning data storage unit;
a basic shape learning unit that uses the training data to learn a basic shape determination model that receives as input a feature amount consisting of color information of each pixel extracted from the image and estimates a basic shape of a gesture shape corresponding to the image;
an aspect ratio learning unit that uses the learning data to learn an aspect ratio determination model that receives as input a feature amount consisting of color information of each pixel extracted from the image and estimates the aspect ratio of a gesture shape corresponding to the image;
A gesture shape learning device including:

a model storage unit that stores a gesture shape determination model learned by the gesture shape learning device according to claim 1;
a gesture shape estimation unit that inputs a feature amount consisting of color information of each pixel extracted from the input image to the gesture shape determination model and estimates a gesture shape corresponding to the input image;
A gesture shape estimation device including:

a model storage unit that stores the basic shape determination model and the aspect ratio determination model learned by the gesture shape learning device according to claim 2;
a basic shape estimating unit that inputs a feature amount consisting of color information of each pixel extracted from the input image to the basic shape determination model to estimate the basic shape of a gesture shape corresponding to the input image;
an aspect ratio estimating unit that inputs feature quantities consisting of color information of each pixel extracted from the input image to the aspect ratio determination model to estimate the aspect ratio of the gesture shape corresponding to the input image;
a gesture shape determination unit that determines a gesture shape corresponding to the input image from the basic shape and the aspect ratio;
A gesture shape estimation device including:

The gesture estimation device according to claim 3 or 4,
a motion label generation unit that generates a motion label representing a phrase to perform a gesture from the uttered text;
an input image generation unit that obtains, as the input image, an image corresponding to a word near the phrase for which the motion label has been generated;
A gesture shape estimation device further comprising:

Using a database in which a plurality of images corresponding to a certain word and gesture shapes corresponding to the word are associated with the learning data storage unit, the gesture shape corresponding to the image is given as training data, and the generated image is generated. learning data is stored.
A gesture shape learning unit receives as input a feature amount consisting of color information of each pixel extracted from an image, and learns a gesture shape determination model for estimating a gesture shape corresponding to the image using the learning data,
Gesture shape learning method.

Using a database in which a plurality of images corresponding to a certain word and gesture shapes corresponding to the word are associated with the learning data storage unit, the gesture shape corresponding to the image is given as training data, and the generated image is generated. learning data is stored.
The basic shape learning unit receives as input the feature amount consisting of color information of each pixel extracted from the image, and uses the learning data to learn a basic shape determination model that estimates the basic shape of the gesture shape corresponding to the image. ,
The aspect ratio learning unit receives as input the feature amount consisting of color information of each pixel extracted from the image, and learns an aspect ratio determination model for estimating the aspect ratio of the gesture shape corresponding to the image using the learning data. ,
Gesture shape learning method.

A gesture shape determination model learned by the gesture shape learning method according to claim 6 is stored in the model storage unit,
a gesture shape estimation unit inputs a feature amount consisting of color information of each pixel extracted from the input image to the gesture shape determination model, and estimates a gesture shape corresponding to the input image;
Gesture shape estimation method.

A basic shape determining model and an aspect ratio determining model learned by the gesture shape learning device according to claim 7 are stored in the model storage unit,
a basic shape estimating unit inputs a feature amount consisting of color information of each pixel extracted from the input image to the basic shape determination model, and estimates a basic shape of a gesture shape corresponding to the input image,
an aspect ratio estimating unit inputs a feature amount consisting of color information of each pixel extracted from an input image to the aspect ratio determination model, and estimates an aspect ratio of a gesture shape corresponding to the input image;
a gesture shape determination unit determines a gesture shape corresponding to the input image from the basic shape and the aspect ratio;
Gesture shape estimation method.

A program for causing a computer to function as the gesture shape learning device according to claim 1 or 2.

A program for causing a computer to function as the gesture shape estimation device according to claim 3.