JP7700511B2

JP7700511B2 - Human image data analysis system

Info

Publication number: JP7700511B2
Application number: JP2021086077A
Authority: JP
Inventors: 浩平渡邉; 晶詳中瀬
Original assignee: JTEKT Corp
Current assignee: JTEKT Corp
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2025-07-01
Anticipated expiration: 2041-05-21
Also published as: JP2022178935A

Description

本発明は、人物画像データ解析システムに関する。 The present invention relates to a human image data analysis system.

近年、人物画像データを取得して、人物の姿勢を評価することが行われている。例えば、特許文献１には、製造現場における作業者の作業時間を計測するために、作業者の姿勢を判別することが記載されている。作業状況をカメラで取得し、取得した画像データに写る作業者の関節位置を示す特徴点データを含む骨格データを取得する。予め、骨格データ毎に姿勢ラベルが対応づけられている姿勢モデルを記憶しておく。そして、取得した骨格データをもとに、姿勢モデルに予め決められた姿勢ラベルから、画像データに写る人物の姿勢を判別する。 In recent years, image data of a person has been acquired and the posture of the person has been evaluated. For example, Patent Document 1 describes determining the posture of a worker in order to measure the working time of the worker at a manufacturing site. The working situation is captured by a camera, and skeletal data including feature point data indicating the joint positions of the worker appearing in the acquired image data is acquired. A posture model in which a posture label is associated with each piece of skeletal data is stored in advance. Then, based on the acquired skeletal data, the posture of the person appearing in the image data is determined from the posture label predetermined for the posture model.

また、作業者の作業時間を計測するために人物の姿勢を判別することの他に、人物の姿勢そのものを評価することも重要である。例えば、人物が正しい姿勢で歩行していることの評価を行うこともある。また、歩行器などの介護機器を用いている人物が、正しい姿勢で歩行器を利用しているか否かの評価を行うことも考えられる。また、歩行をアシストしたり、自立歩行を推進するように動作したりする種々の歩行支援機器が知られている。歩行支援機器を用いている人物が、どのような姿勢であるかを評価することも重要である。 In addition to determining a person's posture to measure the worker's working time, it is also important to evaluate the person's posture itself. For example, an evaluation may be made to see if a person is walking with the correct posture. It is also possible to evaluate whether a person using a care device such as a walker is using the walker with the correct posture. There are also various walking support devices known that assist walking or operate to promote independent walking. It is also important to evaluate the posture of a person using a walking support device.

また、工場などにおける作業者が作業負荷を軽減するためのアクティブパワーアシストスーツを装着する場合に、当該作業者の姿勢を評価することも重要である。作業者の姿勢を評価することにより、作業者がアシストスーツを適切に利用できているか、アシストスーツが適切に機能しているかなどを評価することができる。 In addition, when workers in factories or other places wear active power assist suits to reduce their workload, it is also important to evaluate the posture of the worker. By evaluating the worker's posture, it is possible to evaluate whether the worker is using the assist suit appropriately and whether the assist suit is functioning properly.

特開２０２０－２０１７７２号公報JP 2020-201772 A

上記のように、人物の姿勢を評価することは非常に重要である。そして、特許文献１に記載の方法においては、人物の骨格データから姿勢を判別している。画像データにおいて人物が全身正面を向いている姿勢の場合や、後方を向いている姿勢の場合などには、人物の骨格データから容易に人物の姿勢を判別することができる。 As mentioned above, it is very important to evaluate a person's posture. In the method described in Patent Document 1, the posture is determined from the person's skeletal data. When a person in the image data is facing forward or backward, the posture of the person can be easily determined from the person's skeletal data.

しかしながら、例えば、胴体が横向き姿勢である場合などには、骨格データのみからでは、人物の姿勢を判別できない場合がある。例えば、胴体が横向き姿勢の場合に、右足が前方に位置するのか、左足が前方に位置するのかを判別することは容易ではない。同様に、胴体が横向き姿勢の場合には、右腕と左腕のどちらが前方に位置するのかを判別することも容易ではない。また、人物が上半身と下半身とをねじれさせた姿勢である場合にも、人物の各部位がどのように位置しているかを判別することは容易ではない。 However, when the torso is oriented sideways, for example, it may not be possible to determine the posture of a person from skeletal data alone. For example, when the torso is oriented sideways, it is not easy to determine whether the right foot or the left foot is positioned forward. Similarly, when the torso is oriented sideways, it is not easy to determine whether the right arm or the left arm is positioned forward. Also, when a person's upper and lower body are twisted, it is not easy to determine how each part of the person is positioned.

本発明は、かかる背景に鑑みてなされたものであり、人物画像データに写る人物の姿勢を高精度に判別することができる人物画像データ解析システムを提供しようとするものである。 The present invention has been made in view of this background, and aims to provide a human image data analysis system that can determine the posture of a person depicted in human image data with high accuracy.

本発明の一態様は、
演算処理装置および記憶装置を備えるコンピュータ装置により構成された人物画像データ解析システムであって、
前記記憶装置は、
人物が含まれる第１人物画像データを説明変数とし、前記第１人物画像データにおける特徴量を目的変数として、機械学習を行うことにより生成された特徴量抽出に関する学習済みモデルを記憶し、
前記第１人物画像データに基づいて抽出された前記特徴量を説明変数とし、時系列の複数枚の前記第１人物画像データにおける前記人物の行動種類を目的変数として、機械学習を行うことにより生成された行動解析に関する学習済みモデルを記憶し、
前記特徴量および前記行動種類を説明変数とし、前記第１人物画像データにおける前記人物の姿勢を表現したキーポイントを目的変数として、機械学習を行うことにより生成されたキーポイント抽出に関する学習済みモデルを記憶し、
前記演算処理装置は、
前記記憶装置に記憶された前記特徴量抽出に関する学習済みモデルを用いて、人物が含まれる第２人物画像データを入力することにより、前記第２人物画像データにおける前記特徴量を抽出する特徴量抽出部と、
前記記憶装置に記憶された前記行動解析に関する学習済みモデルを用いて、前記特徴量抽出部により抽出された前記特徴量を入力することにより、時系列の複数枚の前記第２人物画像データにおける前記人物の前記行動種類を出力する行動種類出力部と、
前記記憶装置に記憶された前記キーポイント抽出に関する学習済みモデルを用いて、前記特徴量抽出部により抽出された前記特徴量および前記行動種類出力部により出力された前記行動種類を入力することにより、前記第２人物画像データにおける前記人物の前記キーポイントを出力するキーポイント出力部と、
を備え、
前記特徴量抽出に関する学習済みモデル、前記行動解析に関する学習済みモデル、および、前記キーポイント抽出に関する学習済みモデルは、学習フェーズにおいて、前記キーポイントの要素および前記行動種類の要素を含む損失関数により学習される、人物画像データ解析システムにある。
本発明の他の態様は、
演算処理装置および記憶装置を備えるコンピュータ装置により構成された人物画像データ解析システムであって、
前記記憶装置は、
人物が含まれる第１人物画像データを説明変数とし、前記第１人物画像データにおける特徴量を目的変数として、機械学習を行うことにより生成された特徴量抽出に関する学習済みモデルを記憶し、
前記第１人物画像データに基づいて抽出された前記特徴量を説明変数とし、時系列の複数枚の前記第１人物画像データにおける前記人物の行動種類を目的変数として、機械学習を行うことにより生成された行動解析に関する学習済みモデルを記憶し、
前記特徴量および前記行動種類を説明変数とし、前記第１人物画像データにおける前記人物の姿勢を表現したキーポイントを目的変数として、機械学習を行うことにより生成されたキーポイント抽出に関する学習済みモデルを記憶し、
前記演算処理装置は、
前記記憶装置に記憶された前記特徴量抽出に関する学習済みモデルを用いて、人物が含まれる時系列の複数枚の第２人物画像データを入力することにより、複数枚の前記第２人物画像データのそれぞれにおける前記特徴量を抽出する特徴量抽出部と、
前記記憶装置に記憶された前記行動解析に関する学習済みモデルを用いて、時系列の複数枚の前記第２人物画像データのそれぞれに基づいて前記特徴量抽出部により抽出された複数枚分の前記特徴量を入力することにより、時系列の複数枚の前記第２人物画像データにおける前記人物の前記行動種類を出力する行動種類出力部と、
前記記憶装置に記憶された前記キーポイント抽出に関する学習済みモデルを用いて、時系列の複数枚の前記第２人物画像データのうち選択された１枚の前記第２人物画像データに基づいて前記特徴量抽出部により抽出された１枚分の前記特徴量、および、前記行動種類出力部により出力された前記行動種類を入力することにより、前記選択された１枚の前記第２人物画像データにおける前記人物の前記キーポイントを出力するキーポイント出力部と、
を備える、人物画像データ解析システムにある。
本発明の他の態様は、
演算処理装置および記憶装置を備えるコンピュータ装置により構成された人物画像データ解析システムであって、
前記記憶装置は、
人物が含まれる第１人物画像データを説明変数とし、前記第１人物画像データにおける特徴量を目的変数として、機械学習を行うことにより生成された特徴量抽出に関する学習済みモデルを記憶し、
前記第１人物画像データに基づいて抽出された１枚分の前記特徴量を説明変数とし、時系列の複数枚の前記第１人物画像データにおける前記人物の行動種類を目的変数として、機械学習を行うことにより生成された行動解析に関する学習済みモデルを記憶し、
前記特徴量および前記行動種類を説明変数とし、前記第１人物画像データにおける前記人物の姿勢を表現したキーポイントを目的変数として、機械学習を行うことにより生成されたキーポイント抽出に関する学習済みモデルを記憶し、
前記演算処理装置は、
前記記憶装置に記憶された前記特徴量抽出に関する学習済みモデルを用いて、人物が含まれる時系列の複数枚の第２人物画像データを順次入力することにより、複数枚の前記第２人物画像データのそれぞれにおける前記特徴量を順次抽出する特徴量抽出部と、
前記記憶装置に記憶された前記行動解析に関する学習済みモデルを用いて、前記特徴量抽出部により抽出された１枚分の前記特徴量を順次入力し、かつ、前回演算処理を行った結果を用いた再帰型演算を行うことにより、今回演算処理の対象である１枚の前記第２人物画像データにおける前記人物の前記行動種類を出力する行動種類出力部と、
前記記憶装置に記憶された前記キーポイント抽出に関する学習済みモデルを用いて、前記特徴量抽出部により抽出された１枚分の前記特徴量および前記行動種類出力部により出力された前記行動種類を入力することにより、今回演算処理の対象である１枚の前記第２人物画像データにおける前記人物の前記キーポイントを出力するキーポイント出力部と、
を備える、人物画像データ解析システムにある。 One aspect of the present invention is
A human image data analysis system including a computer device having a processor and a memory device,
The storage device includes:
storing a trained model for feature extraction generated by performing machine learning using first person image data including a person as an explanatory variable and a feature in the first person image data as a target variable;
storing a trained model for behavior analysis generated by performing machine learning using the feature amount extracted based on the first person image data as an explanatory variable and a behavior type of the person in the multiple first person image data in a time series as a target variable;
storing a trained model for keypoint extraction generated by performing machine learning using the feature amount and the behavior type as explanatory variables and a keypoint representing a posture of the person in the first person image data as a target variable;
The arithmetic processing device includes:
a feature extraction unit that extracts the feature from second person image data including a person by inputting the second person image data using the trained model related to the feature extraction stored in the storage device; and
a behavior type output unit that outputs the behavior type of the person in the plurality of time-series images of the second person image data by inputting the feature amount extracted by the feature amount extraction unit using a trained model related to the behavior analysis stored in the storage device;
a keypoint output unit that outputs the keypoints of the person in the second person image data by inputting the feature extracted by the feature extraction unit and the behavior type output by the behavior type output unit using a trained model related to the keypoint extraction stored in the storage device;
Equipped with
The trained model for feature extraction, the trained model for behavioral analysis, and the trained model for keypoint extraction are in a human image data analysis system, which is trained in a learning phase using a loss function that includes elements of the keypoints and elements of the behavior types .
Another aspect of the present invention is
A human image data analysis system including a computer device having a processor and a memory device,
The storage device includes:
storing a trained model for feature extraction generated by performing machine learning using first person image data including a person as an explanatory variable and a feature in the first person image data as a target variable;
storing a trained model for behavior analysis generated by performing machine learning using the feature amount extracted based on the first person image data as an explanatory variable and a behavior type of the person in the multiple first person image data in a time series as a target variable;
storing a trained model for keypoint extraction generated by performing machine learning using the feature amount and the behavior type as explanatory variables and a keypoint representing a posture of the person in the first person image data as a target variable;
The arithmetic processing device includes:
a feature extraction unit that extracts the feature from each of a plurality of second person image data by inputting a plurality of second person image data in a time series including a person, using the trained model related to the feature extraction stored in the storage device;
a behavior type output unit that outputs the behavior type of the person in the plurality of time-series second person image data by inputting the feature amounts for the plurality of images extracted by the feature amount extracting unit based on each of the plurality of time-series second person image data, using a trained model for the behavior analysis stored in the storage device;
a key point output unit that uses a trained model for the key point extraction stored in the storage device to input the feature amounts for one image selected from the plurality of second person image data in time series, the feature amounts being extracted by the feature amount extractor based on the one image of the second person image data selected from the plurality of second person image data in time series, and the behavior type being output by the behavior type output unit, and outputs the key points of the person in the selected one image of the second person image data;
The present invention relates to a human image data analysis system comprising:
Another aspect of the present invention is
A human image data analysis system including a computer device having a processor and a memory device,
The storage device includes:
storing a trained model for feature extraction generated by performing machine learning using first person image data including a person as an explanatory variable and a feature in the first person image data as a target variable;
storing a learned model for behavior analysis generated by performing machine learning using the feature amount for one image extracted based on the first person image data as an explanatory variable and a behavior type of the person in the multiple first person image data in a time series as a target variable;
storing a trained model for keypoint extraction generated by performing machine learning using the feature amount and the behavior type as explanatory variables and a keypoint representing a posture of the person in the first person image data as a target variable;
The arithmetic processing device includes:
a feature extraction unit that sequentially inputs a plurality of second person image data in a time series including a person, by using the trained model related to the feature extraction stored in the storage device, and sequentially extracts the feature from each of the plurality of second person image data;
a behavior type output unit that uses a trained model for the behavior analysis stored in the storage device to sequentially input the feature amounts for one image extracted by the feature amount extraction unit and to perform a recursive calculation using a result of a previous calculation process, thereby outputting the behavior type of the person in the one image of the second person image data that is the target of a current calculation process;
a keypoint output unit that uses a trained model for the keypoint extraction stored in the storage device to input the feature amounts for one image extracted by the feature amount extraction unit and the behavior type output by the behavior type output unit, and outputs the keypoints of the person in the second person image data for one image that is the subject of a current calculation process;
The present invention relates to a human image data analysis system comprising:

キーポイント出力部は、人物画像データにおける特徴量のみを用いて、当該人物のキーポイントを出力しているのではない。キーポイント出力部は、人物画像データにおける特徴量に加えて、当該人物の行動種類を入力して、人物のキーポイントを出力している。 The keypoint output unit does not output keypoints for a person using only the features in the person's image data. The keypoint output unit inputs the type of behavior of the person in addition to the features in the person's image data, and outputs keypoints for the person.

このように、キーポイント出力部は、人物の行動種類を把握した上で人物のキーポイントを出力することで、人物のキーポイントを高精度に出力することができる。例えば、人物が上半身と下半身とをねじれさせた姿勢である場合には、キーポイントの１つとしての、隣り合う関節位置同士を接続する接続関係が誤って出力される可能性がある。しかし、人物の行動種類を把握することにより、ねじれ姿勢であっても、高精度にキーポイントの１つとしての接続関係を出力することができる。従って、人物の姿勢を高精度に解析することができる。 In this way, the key point output unit can output a person's key points with high accuracy by grasping the type of behavior of the person before outputting the person's key points. For example, if a person has a posture in which the upper and lower halves of the body are twisted, there is a possibility that a connection relationship connecting adjacent joint positions as one of the key points will be erroneously output. However, by grasping the type of behavior of the person, it is possible to output a connection relationship as one of the key points with high accuracy, even in the case of a twisted posture. Therefore, the posture of a person can be analyzed with high accuracy.

人物画像データ解析システムの構成を示す図である。FIG. 1 is a diagram illustrating a configuration of a person image data analysis system. 第一実施形態の人物画像データ解析システムにおいて、推論フェーズにおける学習済みモデルＡに関する入出力を示す図である。A diagram showing input and output regarding a trained model A in the inference phase in the human image data analysis system of the first embodiment. 第一実施形態の人物画像データ解析システムにおいて、推論フェーズにおける学習済みモデルＢ，Ｃ，Ｄの関係、および、学習済みモデルＢ，Ｃ，Ｄの入出力を示す図である。This figure shows the relationship between trained models B, C, and D in the inference phase, and the inputs and outputs of trained models B, C, and D in the human image data analysis system of the first embodiment. 学習済みモデルＣの出力情報であって、行動種類毎のスコアを表すグラフである。13 is a graph showing output information of trained model C, representing scores for each behavior type. 学習済みモデルＤの出力情報を説明する図であって、人物のキーポイントを示す図である。FIG. 13 is a diagram explaining the output information of the trained model D, showing key points of a person. 第一実施形態の人物画像データ解析システムにおいて、学習フェーズにおけるモデルＢ，Ｃ，Ｄの入出力を示す図である。13A to 13D are diagrams illustrating inputs and outputs of models B, C, and D in the learning phase in the human image data analysis system of the first embodiment. 比較例としての人物のキーポイントを示す図である。FIG. 13 is a diagram showing key points of a person as a comparative example. 第二実施形態の人物画像データ解析システムにおいて、推論フェーズにおける学習済みモデルＢ，Ｃ，Ｄの関係、および、学習済みモデルＢ，Ｃ，Ｄの入出力を示す図である。A diagram showing the relationship between trained models B, C, and D in the inference phase, and the inputs and outputs of trained models B, C, and D in the human image data analysis system of the second embodiment.

（１．人物画像データ解析システムの概要）
人物画像データ解析システムは、人物画像データを取得し、取得した人物画像データに含まれる人物の姿勢を解析する。人物の姿勢は、例えば、立位、座位、臥位、膝立ち位などに分類され、それぞれにおいてさらに詳細に分類される。さらに、人物の姿勢は、静止状態であるか、動作状態であるかによっても異なる。つまり、人物画像データ解析システムは、人物画像データに写っている人物がどのような姿勢であるかを解析する。 (1. Overview of human image data analysis system)
The human image data analysis system acquires human image data and analyzes the posture of the person included in the acquired human image data. The posture of the person is classified into, for example, a standing posture, a sitting posture, a lying posture, a kneeling posture, and the like, and each of these is further classified in detail. Furthermore, the posture of the person also differs depending on whether the person is in a stationary state or in a moving state. In other words, the human image data analysis system analyzes the posture of the person depicted in the human image data.

人物画像データ解析システムにより解析された人物の姿勢情報は、例えば、以下のように利用される。人物が静止状態において、当該人物の姿勢を評価する。例えば、人物が立位姿勢である場合に、適正な立位姿勢であるかの評価を行い、当該人物に対して、適正な立位姿勢となるよう改善させることができる。また、人物が座位姿勢や臥位姿勢である場合において、適正な座位姿勢や臥位姿勢であるかの評価を行い、適正な座具や寝具の選択に用いたり、座具や寝具の開発に用いたりすることができる。 The posture information of a person analyzed by the human image data analysis system is used, for example, as follows. The posture of the person is evaluated when the person is stationary. For example, when the person is standing, an evaluation is made as to whether this is an appropriate standing posture, and the person can be improved to have an appropriate standing posture. Also, when the person is sitting or lying down, an evaluation is made as to whether this is an appropriate sitting or lying down posture, and this can be used to select appropriate seating or bedding, or to develop seating or bedding.

動作時における人物の姿勢を評価するために用いることもできる。立位姿勢から座位姿勢への動作、その逆の動作、座位姿勢から臥位姿勢への動作、その逆の動作などにおける姿勢を評価することができる。また、歩行時、走行時、跳躍時などの姿勢を評価することができ、さらに、スポーツを行っている時の人物の各種姿勢を評価することもできる。 It can also be used to evaluate a person's posture during movement. It can evaluate posture when moving from a standing position to a sitting position, or vice versa, or from a sitting position to a lying position, or vice versa. It can also evaluate posture when walking, running, jumping, etc., and can also evaluate various postures of a person when playing sports.

さらに、歩行器などの介護機器を用いている人物が、正しい姿勢で歩行器を利用しているかの評価を行うこともできる。また、歩行をアシストしたり、自立歩行を推進したりするように駆動する歩行支援機器において、歩行支援機器を用いている人物の姿勢を評価することもできる。人物の姿勢の評価結果を用いて、歩行支援機器が適切に機能しているかの評価を行うことができる。さらに、当該歩行支援機器を用いている人物の姿勢を解析し、解析結果を用いて、歩行支援機器の制御を行うこともできる。 Furthermore, it is also possible to evaluate whether a person using a nursing care device such as a walker is using the walker with the correct posture. Also, in a walking support device that is driven to assist walking or promote independent walking, it is also possible to evaluate the posture of the person using the walking support device. Using the evaluation results of the person's posture, it is possible to evaluate whether the walking support device is functioning properly. Furthermore, it is also possible to analyze the posture of the person using the walking support device, and use the analysis results to control the walking support device.

また、被介護者や工場などの作業者が動作負荷を軽減するためにアクティブパワーアシストスーツを装着している場合に、当該装着者の姿勢を評価することもできる。装着者の姿勢の評価結果を用いて、アシストスーツが適切に機能しているかの評価を行うことができる。さらに、装着者の姿勢を解析し、解析結果を用いて、アシストスーツの制御を行うこともできる。また、工場などの作業者の姿勢を解析することにより、当該作業者の作業時間の評価を行うこともできる。さらに、作業者による作業種類毎の作業時間を評価することもできる。 In addition, when a care recipient or a worker in a factory wears an active power assist suit to reduce the load of movement, the posture of the wearer can also be evaluated. The evaluation results of the wearer's posture can be used to evaluate whether the assist suit is functioning properly. Furthermore, the posture of the wearer can be analyzed, and the analysis results can be used to control the assist suit. Furthermore, by analyzing the posture of a worker in a factory, etc., the working time of the worker can be evaluated. Furthermore, the working time of each type of work performed by the worker can be evaluated.

（２．第一実施形態）
（２－１．人物画像データ解析システム１の推論フェーズにおける構成）
人物画像データ解析システム１の構成について図１～図６を参照して説明する。特に、以下においては、人物画像データ解析システム１の推論フェーズにおける構成について説明する。図１に示すように、人物画像データ解析システム１は、撮像機器２と、解析に用いるコンピュータ装置により構成される。コンピュータ装置は、記憶装置３と、演算処理装置４とを備える。 2. First Embodiment
(2-1. Configuration of human image data analysis system 1 in inference phase)
The configuration of the person image data analysis system 1 will be described with reference to Figures 1 to 6. In particular, the configuration of the person image data analysis system 1 in the inference phase will be described below. As shown in Figure 1, the person image data analysis system 1 is made up of an imaging device 2 and a computer device used for analysis. The computer device includes a storage device 3 and an arithmetic processing device 4.

撮像機器２は、例えば、時系列に連続した動画像を撮像可能な動画像撮像機器や、時系列に静止画像を撮像可能が静止画像撮像機器などである。撮像機器２は、姿勢解析を行う対象である人物を含むように撮像するために用いられる。記憶装置３は、機械学習により生成された学習済みモデルＡ，Ｂ，Ｃ，Ｄを記憶する。演算処理装置４は、人物画像データ生成部１１、特徴量抽出部１２、行動種類出力部１３、および、キーポイント出力部１４を備える。 The imaging device 2 is, for example, a moving image imaging device capable of capturing continuous moving images in a time series, or a still image imaging device capable of capturing still images in a time series. The imaging device 2 is used to capture an image including a person who is the subject of posture analysis. The storage device 3 stores trained models A, B, C, and D generated by machine learning. The arithmetic processing device 4 includes a person image data generation unit 11, a feature extraction unit 12, a behavior type output unit 13, and a key point output unit 14.

図２に示すように、学習済みモデルＡは、機械学習を行うことにより生成された人物画像データ抽出に関する機械学習モデルである。学習済みモデルＡは、撮像機器２により撮像された画像データＤ１（以下、「元画像データ」と称する）を入力した場合に、元画像データＤ１の中から人物領域Ｄ１ａを抽出する。元画像データＤ１は、人物領域Ｄ１ａ、および、人物領域Ｄ１ａの周辺に位置する周辺領域Ｄ１ｂを含む。人物領域Ｄ１ａには、人物に加えて、人物が保持している物体も含まれるようにしても良い。 As shown in FIG. 2, trained model A is a machine learning model for extracting human image data that has been generated by performing machine learning. When trained model A receives image data D1 captured by imaging device 2 (hereinafter referred to as "original image data"), it extracts a human area D1a from the original image data D1. The original image data D1 includes a human area D1a and a surrounding area D1b located around the human area D1a. In addition to the person, the human area D1a may also include an object held by the person.

そして、学習済みモデルＡは、元画像データＤ１が入力されると、抽出された人物領域Ｄ１ａの画像データである人物画像データＤ２を出力する。学習済みモデルＡは、例えば、Ｒ-ＣＮＮ（Regions with Convolutional Neural Networks）などを適用する。学習済みモデルＡは、例えば、四角形の領域（バウンディングボックス）などにより、人物領域Ｄ１ａを抽出する。 When the original image data D1 is input, the trained model A outputs person image data D2, which is image data of the extracted person region D1a. The trained model A applies, for example, R-CNN (Regions with Convolutional Neural Networks) or the like. The trained model A extracts the person region D1a, for example, using a rectangular region (bounding box) or the like.

図３に示すように、学習済みモデルＢは、機械学習を行うことにより生成された特徴量抽出に関する機械学習モデルである。学習済みモデルＢは、例えば、ニューラルネットワークを含む機械学習アルゴリズム（ディープラーニングを含む）が好適であるが、他の機械学習アルゴリズムを適用しても良い。学習済みモデルＢは、学習済みモデルＡにより出力された人物が含まれる人物画像データＤ２を説明変数とし、人物画像データＤ２における特徴量を目的変数として、機械学習を行うことにより生成された機械学習モデルである。つまり、学習済みモデルＢは、人物画像データＤ２が入力されることにより、人物画像データＤ２における特徴量を出力する。 As shown in FIG. 3, trained model B is a machine learning model for feature extraction generated by performing machine learning. Trained model B is preferably a machine learning algorithm (including deep learning) including a neural network, but other machine learning algorithms may also be applied. Trained model B is a machine learning model generated by performing machine learning using person image data D2 including a person output by trained model A as an explanatory variable and feature values in person image data D2 as objective variables. In other words, trained model B outputs feature values in person image data D2 by inputting person image data D2.

なお、学習済みモデルＢが抽出する特徴量の種類は、予め設定しても良いし、機械学習により自動的に抽出されるようにしても良い。もちろん、特徴量の種類は、機械学習による自動的な抽出と設定者による設定とを併用しても良い。例えば、特徴量の種類は、機械学習により自動的に抽出された後に、設定者による修正設定を行うようにしても良い。 The types of features extracted by trained model B may be set in advance, or may be automatically extracted by machine learning. Of course, the types of features may be automatically extracted by machine learning and set by the setter. For example, the types of features may be automatically extracted by machine learning and then modified by the setter.

学習済みモデルＣは、機械学習を行うことにより生成された行動解析に関する機械学習モデルである。学習済みモデルＣは、例えば、ニューラルネットワークを含む機械学習アルゴリズム（ディープラーニングを含む）が好適であるが、他の機械学習アルゴリズムを適用しても良い。学習済みモデルＣは、時系列の複数枚の人物画像データＤ２のそれぞれに基づいて学習済みモデルＢにより抽出された複数枚分の特徴量を説明変数とし、時系列の複数枚の人物画像データＤ２における人物の行動種類を目的変数として、機械学習を行うことにより生成された機械学習モデルである。ここで、説明変数としての複数枚分の特徴量についての枚数や時系列の時間などは、任意に設定できる。 The trained model C is a machine learning model for behavioral analysis generated by performing machine learning. For example, a machine learning algorithm (including deep learning) including a neural network is suitable for the trained model C, but other machine learning algorithms may also be applied. The trained model C is a machine learning model generated by performing machine learning using the feature amounts of multiple images extracted by the trained model B based on each of the multiple images of person image data D2 in a time series as explanatory variables, and the behavior type of the person in the multiple images of person image data D2 in a time series as a target variable. Here, the number of images and the time series time for the feature amounts of multiple images as explanatory variables can be set arbitrarily.

人物の行動種類は、例えば、静止状態における立位姿勢、座位姿勢、臥位姿勢、膝立ち位姿勢、動作状態における歩行姿勢、走行姿勢、跳躍姿勢、各種スポーツを行っている時の姿勢などを大分類とすることができる。人物の行動種類は、当該大分類をさらに細かく分類されている。例えば、座位姿勢は、胡座、安座、正座、長座位、端座位、半座位などに分類される。また、臥位姿勢は、仰臥位、側臥位、腹臥位などに分類される。他の姿勢についても細かく分類される。 The types of a person's behavior can be broadly categorized into, for example, standing, sitting, lying, and kneeling postures in a stationary state, and walking, running, jumping, and postures when performing various sports in a moving state. The types of a person's behavior are further categorized into these broad categories. For example, sitting postures are categorized into cross-legged, sitting comfortably, sitting upright, sitting long, sitting on the edge of the bed, and half-sitting. Furthermore, lying postures are categorized into supine, lateral, and prone positions. Other postures are also categorized in more detail.

学習済みモデルＣは、時系列の複数枚の人物画像データＤ２のそれぞれに基づいて学習済みモデルＢにより抽出された複数枚分の特徴量が入力されると、図４に示すような、当該人物の行動種類のスコアを生成する。そして、学習済みモデルＣは、スコア値が最も高い行動種類を当該人物の行動種類と認定して、当該行動種類を出力する。 When trained model C receives the feature amounts for multiple images extracted by trained model B based on multiple time-series images of person image data D2, trained model C generates a score for the behavior type of the person as shown in FIG. 4. Trained model C then recognizes the behavior type with the highest score value as the behavior type of the person and outputs that behavior type.

学習済みモデルＤは、機械学習を行うことにより生成されたキーポイント抽出に関する機械学習モデルである。学習済みモデルＤは、例えば、ニューラルネットワークを含む機械学習アルゴリズム（ディープラーニングを含む）が好適であるが、他の機械学習アルゴリズムを適用しても良い。学習済みモデルＤは、特徴量および行動種類を説明変数とし、人物画像データＤ２における人物の姿勢を表現したキーポイントを目的変数として、機械学習を行うことにより生成された機械学習モデルである。特徴量は、学習済みモデルＢにより出力される情報である。行動種類は、学習済みモデルＣにより出力される情報である。 The trained model D is a machine learning model for keypoint extraction generated by performing machine learning. For example, a machine learning algorithm (including deep learning) including a neural network is suitable for the trained model D, but other machine learning algorithms may also be applied. The trained model D is a machine learning model generated by performing machine learning using feature amounts and behavior types as explanatory variables and keypoints expressing the posture of a person in the person image data D2 as objective variables. The feature amounts are information output by the trained model B. The behavior types are information output by the trained model C.

キーポイントについて、図５を参照して説明する。キーポイントは、図５の黒丸の一部にて示す人物の関節位置を含む。本形態においては、キーポイントは、図５の黒丸の一部にて示すように、人物の目の位置を含むようにしている。さらに、キーポイントは、図５の黒丸同士を接続する線により示す接続関係を含む。例えば、キーポイントは、隣り合う関節位置同士を接続する接続関係、隣り合う目の位置同士を接続する接続関係、目の位置と目の位置に近接する関節位置とを接続する接続関係を含む。つまり、キーポイントは、人物の姿勢を表現するための部位と、各部位の接続関係と、を含む特徴データである。そして、学習済みモデルＤは、特徴量および行動種類が入力されると、図５に示すキーポイントを出力する。 Key points will be described with reference to FIG. 5. Key points include the joint positions of a person, as shown by some of the black circles in FIG. 5. In this embodiment, key points include the positions of the person's eyes, as shown by some of the black circles in FIG. 5. Furthermore, key points include connections shown by lines connecting the black circles in FIG. 5. For example, key points include connections connecting adjacent joint positions, connections connecting adjacent eye positions, and connections connecting eye positions and joint positions close to eye positions. In other words, key points are feature data including parts for expressing a person's posture and connections between each part. When features and behavior types are input, the trained model D outputs the key points shown in FIG. 5.

人物画像データ生成部１１は、図１および図２に示すように、撮像機器２から元画像データＤ１を取得する。人物画像データ生成部１１は、記憶装置３に記憶された学習済みモデルＡを用いて、元画像データＤ１を入力することにより、元画像データＤ１の中から人物領域Ｄ１ａが抽出された人物画像データＤ２を生成する。 As shown in Figs. 1 and 2, the person image data generation unit 11 acquires original image data D1 from the imaging device 2. The person image data generation unit 11 uses a trained model A stored in the storage device 3 to input the original image data D1, thereby generating person image data D2 in which a person area D1a is extracted from the original image data D1.

元画像データＤ１が動画像データである場合には、人物画像データ生成部１１は、取得した動画像データから、時系列からなる複数枚の静止画像データを生成する。そして、人物画像データ生成部１１は、生成した時系列（例えば、時刻Ｔ１～Ｔ１０）からなる複数枚の静止画像データのそれぞれを学習済みモデルＡに入力し、複数枚の静止画像データのそれぞれにおける人物画像データＤ２を生成する。つまり、人物画像データ生成部１１は、例えば時刻Ｔ１～Ｔ１０における複数枚の人物画像データＤ２を生成する。 When the original image data D1 is moving image data, the person image data generation unit 11 generates multiple still image data frames consisting of a time series from the acquired moving image data. Then, the person image data generation unit 11 inputs each of the multiple still image data frames consisting of the generated time series (e.g., times T1 to T10) into the trained model A, and generates person image data D2 for each of the multiple still image data frames. In other words, the person image data generation unit 11 generates multiple person image data D2 for times T1 to T10, for example.

元画像データＤ１が静止画像データである場合には、人物画像データ生成部１１は、取得した時系列（例えばＴ１～Ｔ１０）からなる複数枚の静止画像データのそれぞれを学習済みモデルＡに入力し、複数枚の静止画像データのそれぞれにおける人物画像データＤ２を生成する。つまり、この場合も、人物画像データ生成部１１は、例えば時刻Ｔ１～Ｔ１０における複数枚の人物画像データＤ２を生成する。 When the original image data D1 is still image data, the person image data generation unit 11 inputs each of the multiple still image data frames comprising the acquired time series (e.g., T1 to T10) into the trained model A, and generates person image data D2 for each of the multiple still image data frames. In other words, in this case as well, the person image data generation unit 11 generates multiple person image data D2 for times T1 to T10, for example.

人物画像データＤ２には、人物の少なくとも胴体が撮像機器２に正対する姿勢の画像データ、人物の少なくとも胴体が背向する姿勢の画像データ、人物の少なくとも胴体が横向きとなる姿勢の画像データなど、種々の画像データが含まれる。ここで言う横向きとは、撮像機器２に対して９０°の向きである場合に限られず、撮像機器２に対して完全に正対する場合および完全に背向する場合を除く意味であって、斜め方向を向いている場合を含む。 The person image data D2 includes various image data, such as image data of a person with at least their torso facing directly to the imaging device 2, image data of a person with at least their torso facing away from the imaging device 2, and image data of a person with at least their torso facing sideways. Sideways here does not necessarily mean a position at 90° to the imaging device 2, but excludes the positions of completely facing directly to the imaging device 2 and completely facing away from the imaging device 2, and includes the positions of the person facing diagonally.

また、人物画像データＤ２には、人物が上半身と下半身とがねじれていない姿勢の画像データや、ねじれた姿勢の画像データなども含まれる。人物が歩行中においては、左手および右足が前方に位置し、右手および左足が後方に位置する状態となることがある。このような場合には、人物の上半身と下半身とがねじれた姿勢となっている。 The person image data D2 also includes image data of a person in a posture in which the upper and lower halves of the body are not twisted, as well as image data of a person in a twisted posture. When a person is walking, the left hand and right foot may be positioned forward, and the right hand and left foot may be positioned backward. In such a case, the person's upper and lower body are in a twisted posture.

特徴量抽出部１２は、図１および図３に示すように、人物画像データ生成部１１により生成された時系列（Ｔ１～Ｔ１０）からなる複数枚の人物画像データＤ２を取得する。特徴量抽出部１２は、記憶装置３に記憶された学習済みモデルＢを用いて、時系列（Ｔ１～Ｔ１０）からなる複数枚の人物画像データＤ２を入力する。そうすると、特徴量抽出部１２は、学習済みモデルＢの出力として、複数枚の人物画像データＤ２のそれぞれにおける特徴量、すなわち複数枚分の特徴量を抽出する。 As shown in Figs. 1 and 3, the feature extraction unit 12 acquires multiple images of person image data D2 consisting of a time series (T1 to T10) generated by the person image data generation unit 11. The feature extraction unit 12 inputs multiple images of person image data D2 consisting of a time series (T1 to T10) using a trained model B stored in the storage device 3. The feature extraction unit 12 then extracts features for each of the multiple images of person image data D2, i.e., the feature amounts for multiple images, as the output of the trained model B.

行動種類出力部１３は、図１および図３に示すように、特徴量抽出部１２により抽出された複数枚分の特徴量を取得する。行動種類出力部１３は、記憶装置３に記憶された学習済みモデルＣを用いて、特徴量抽出部１２により時系列（Ｔ１～Ｔ１０）からなる複数枚の人物画像データＤ２に基づいて抽出された複数枚分の特徴量を学習済みモデルＢに入力する処理を行う。 As shown in Figs. 1 and 3, the behavior type output unit 13 acquires the feature amounts for multiple images extracted by the feature amount extraction unit 12. Using the trained model C stored in the storage device 3, the behavior type output unit 13 performs processing to input the feature amounts for multiple images extracted by the feature amount extraction unit 12 based on multiple images of person image data D2 consisting of a time series (T1 to T10) to the trained model B.

そうすると、行動種類出力部１３は、時系列（Ｔ１～Ｔ１０）からなる複数枚分の特徴量を用いて、時系列の複数枚の人物画像データＤ２における人物の行動種類を出力する。具体的には、行動種類出力部１３は、図４に示すように、行動種類ごとのスコアを生成し、スコア値が最も高い行動種類を、当該人物の行動種類として出力する。 Then, the behavior type output unit 13 uses the feature amounts for multiple images in the time series (T1 to T10) to output the behavior type of the person in the multiple images of the person image data D2 in a time series. Specifically, as shown in FIG. 4, the behavior type output unit 13 generates a score for each behavior type, and outputs the behavior type with the highest score value as the behavior type of the person.

本形態においては、行動種類出力部１３は、１枚の人物画像データＤ２における特徴量ではなく、複数枚の人物画像データＤ２における特徴量、すなわち複数枚分の特徴量を入力している。つまり、時系列の複数枚の人物画像データＤ２における人物の位置の変化を判定することにより、行動種類を特定している。 In this embodiment, the behavior type output unit 13 inputs the feature amounts in multiple images of person image data D2, i.e., the feature amounts for multiple images, rather than the feature amounts in one image of person image data D2. In other words, the behavior type is identified by determining the change in the position of the person in the multiple images of person image data D2 in a time series.

キーポイント出力部１４は、図１および図３に示すように、特徴量抽出部１２により抽出された特徴量を取得する。特徴量抽出部１２は、上述したように、時系列（Ｔ１～Ｔ１０）からなる複数枚の人物画像データＤ２のそれぞれにおける特徴量、すなわち複数枚分の特徴量を抽出している。 As shown in Figures 1 and 3, the key point output unit 14 acquires the features extracted by the feature extraction unit 12. As described above, the feature extraction unit 12 extracts features for each of the multiple pieces of person image data D2 consisting of a time series (T1 to T10), i.e., the feature amounts for multiple pieces.

ただし、キーポイント出力部１４は、時系列（Ｔ１～Ｔ１０）からなる複数枚分の特徴量を用いる必要はない。本形態においては、キーポイント出力部１４は、時系列（Ｔ１～Ｔ１０）からなる複数枚の人物画像データＤ２のうち選択された１枚の人物画像データＤ２に基づいて抽出された１枚分の特徴量を取得する。例えば、キーポイント出力部１４は、時刻Ｔ１～Ｔ１０の中間時刻Ｔ５における人物画像データＤ２に基づいて抽出された特徴量を取得する。なお、キーポイント出力部１４が選択する時刻は、任意に決定できる。 However, the keypoint output unit 14 does not need to use feature amounts for multiple images consisting of a time series (T1 to T10). In this embodiment, the keypoint output unit 14 acquires feature amounts for one image extracted based on one image of person image data D2 selected from multiple images of person image data D2 consisting of a time series (T1 to T10). For example, the keypoint output unit 14 acquires feature amounts extracted based on person image data D2 at intermediate time T5 between times T1 to T10. Note that the time selected by the keypoint output unit 14 can be determined arbitrarily.

さらに、キーポイント出力部１４は、行動種類出力部１３により出力された行動種類を取得する。キーポイント出力部１４は、記憶装置３に記憶された学習済みモデルＣを用いて、取得した特徴量および行動種類を学習済みモデルＣに入力する処理を行うことにより、時刻Ｔ５の人物画像データＤ２における人物のキーポイントを出力する。 Furthermore, the key point output unit 14 acquires the behavior type output by the behavior type output unit 13. Using the trained model C stored in the storage device 3, the key point output unit 14 performs a process of inputting the acquired features and behavior type into the trained model C, thereby outputting the key points of the person in the person image data D2 at time T5.

図５に示すように、キーポイント出力部１４は、時刻Ｔ５の人物画像データＤ２における人物のキーポイントとして、関節位置、目の位置、各位置を接続する接続関係を出力する。 As shown in FIG. 5, the key point output unit 14 outputs the joint positions, eye positions, and the connection relationships connecting each position as the key points of the person in the person image data D2 at time T5.

（２－２．人物画像データ解析システム１の学習フェーズにおける構成）
人物画像データ解析システム１の学習フェーズにおける構成について、図６を参照して説明する。特に、モデルＢ，Ｃ，Ｄに関する学習フェーズについて説明する。 (2-2. Configuration of human image data analysis system 1 in learning phase)
The configuration of the human image data analysis system 1 in the learning phase will be described with reference to Fig. 6. In particular, the learning phase for models B, C, and D will be described.

まず、学習に使用する訓練データセットを準備する。訓練データセットとして、時系列の複数枚の人物画像データＤ２からなるユニットを多数準備する。例えば、複数の動画像データは、時系列の複数枚の人物画像データＤ２からなるユニットを多数含むものであるため、訓練データセットとして好適である。さらに、訓練データセットは、当該人物画像データＤ２における人物のキーポイント、人物の行動種類についてのラベル情報を含む。 First, a training dataset to be used for learning is prepared. As the training dataset, many units each consisting of multiple time-series images of person image data D2 are prepared. For example, multiple video image data sets are suitable as a training dataset because they contain many units each consisting of multiple time-series images of person image data D2. Furthermore, the training dataset includes label information on key points of people in the person image data D2 and types of behavior of people.

学習に用いる損失関数Ｆ（ｘ，ｙ）は、キーポイントの要素ｘ、および、行動種類の要素ｙを含む。モデルＢ，Ｃ，Ｄは、訓練データセットを入力して、損失関数Ｆ（ｘ，ｙ）を小さくするように学習を行う。損失関数Ｆ（ｘ，ｙ）がキーポイントの要素および行動種類の要素を有することにより、モデルＢ，Ｃ，Ｄは、キーポイントおよび行動種類の正解を出力するように学習される。このようにして学習された学習済みモデルＢ，Ｃ，Ｄは、記憶装置３に記憶される。 The loss function F(x, y) used in learning includes a keypoint element x and an action type element y. Models B, C, and D input a training dataset and learn to reduce the loss function F(x, y). As the loss function F(x, y) has keypoint elements and action type elements, models B, C, and D are trained to output correct answers for keypoints and action types. The trained models B, C, and D trained in this way are stored in the storage device 3.

上記のような損失関数Ｆ（ｘ，ｙ）を用いた学習は、モデルＢ，Ｃ，Ｄをそれぞれ独立に学習するのではなく、モデルＢ，Ｃ，Ｄを一体的なモデルのように扱って学習している。従って、モデルＢ，Ｃ，Ｄは、それぞれ、損失関数（ｘ、ｙ）に影響を受ける部分が効果的に学習されていく。 When learning using the loss function F(x, y) as described above, models B, C, and D are not trained independently, but rather models B, C, and D are trained as a unified model. Therefore, the parts of models B, C, and D that are affected by the loss function (x, y) are trained effectively.

（２－３．効果）
人物画像データ解析システム１において、キーポイント出力部１４は、人物画像データＤ２における特徴量のみを用いて、当該人物のキーポイントを出力しているのではない。キーポイント出力部１４は、人物画像データＤ２における特徴量に加えて、当該人物の行動種類を入力して、人物のキーポイントを出力している。 (2-3. Effects)
In the person image data analysis system 1, the key point output unit 14 does not output the key points of a person by using only the feature amounts in the person image data D2. The key point output unit 14 inputs the type of behavior of the person in addition to the feature amounts in the person image data D2, and outputs the key points of the person.

このように、キーポイント出力部１４は、人物の行動種類を把握した上で人物のキーポイントを出力することで、人物のキーポイントを高精度に出力することができる。このことについて、本形態におけるキーポイントの出力結果である図５と、比較例としてのキーポイントの出力結果である図７とを比較して説明する。 In this way, the keypoint output unit 14 can output a person's keypoints with high accuracy by grasping the type of behavior of the person and then outputting the person's keypoints. This will be explained by comparing Figure 5, which shows the output result of keypoints in this embodiment, with Figure 7, which shows the output result of keypoints as a comparative example.

図５は、本形態におけるキーポイント出力部１４が出力したキーポイントを示す。一方、図７は、行動種類を考慮せずに、人物画像データＤ２における特徴量のみに基づいて出力されたキーポイントを示す。図５および図７に示すキーポイントに用いた人物画像データＤ２は、人物が上半身と下半身とをねじれさせた姿勢である。さらに、人物画像データＤ２は、人物の少なくとも胴体が横向きとなる姿勢の画像データである。 Figure 5 shows key points output by the key point output unit 14 in this embodiment. On the other hand, Figure 7 shows key points output based only on the feature amounts in the person image data D2, without taking into account the type of activity. The person image data D2 used for the key points shown in Figures 5 and 7 is a posture in which the person's upper and lower body are twisted. Furthermore, the person image data D2 is image data of a posture in which at least the torso of the person is turned sideways.

図５に示す人物の下半身において、右股関節と右膝関節とが接続され、左股関節と左膝関節とが接続されている。このように、図５においては、関節同士が正しく接続されている。一方、図７に示す人物の下半身において、右股関節と左膝関節とが接続され、左股関節と右膝関節とが接続されている。つまり、図７においては、関節同士が誤って接続されている。 In the lower body of the person shown in Figure 5, the right hip joint is connected to the right knee joint, and the left hip joint is connected to the left knee joint. Thus, in Figure 5, the joints are correctly connected. On the other hand, in the lower body of the person shown in Figure 7, the right hip joint is connected to the left knee joint, and the left hip joint is connected to the right knee joint. In other words, in Figure 7, the joints are incorrectly connected.

図５および図７に示す人物の下半身において、右股関節は、右膝関節よりも、左膝関節の方が近い位置に位置し、左股関節は、左膝関節よりも、右膝関節の方が近い位置に位置する。そして、人物画像データが人物の胴体が横向きの姿勢であるため、左右股関節と左右膝関節とが、左右の前後位置が反対になっている。図７においては、近い位置に位置する関節同士を接続したものと思われる。 In the lower body of the person shown in Figures 5 and 7, the right hip joint is closer to the left knee joint than the right knee joint, and the left hip joint is closer to the right knee joint than the left knee joint. And because the person's torso is posed sideways in the person's image data, the left and right hip joints and the left and right knee joints are in reversed front-to-back positions. In Figure 7, it appears that joints located close to each other are connected.

図７に示すように、人物が上半身と下半身とをねじれさせた姿勢である場合には、キーポイントの１つとしての、隣り合う関節位置同士を接続する接続関係が誤って出力される可能性がある。関節位置の接続を正しく認識しないと、人物の姿勢を正しく認識できない。しかし、本形態においては、図５に示すように、人物の行動種類を把握することにより、ねじれ姿勢かつ横向き姿勢であっても、高精度にキーポイントの１つとしての接続関係を出力することができる。従って、人物の姿勢を高精度に解析することができる。 As shown in FIG. 7, when a person has a posture in which the upper and lower halves of the body are twisted, there is a possibility that the connection relationship connecting adjacent joint positions as one of the key points may be erroneously output. If the connections of the joint positions are not correctly recognized, the person's posture cannot be correctly recognized. However, in this embodiment, as shown in FIG. 5, by grasping the type of behavior of the person, it is possible to output the connection relationship as one of the key points with high accuracy even if the person is in a twisted posture and lying sideways. Therefore, the person's posture can be analyzed with high accuracy.

行動種類出力部１３は、時系列の複数枚分の特徴量を入力することにより、人物の行動種類を出力している。従って、行動種類出力部１３は、時系列の複数枚の人物画像データＤ２を用いることで、高精度に人物の行動種類を特定することができる。その結果、人物のキーポイントを高精度に出力できる。 The behavior type output unit 13 outputs the behavior type of a person by inputting feature amounts for multiple time-series images. Therefore, the behavior type output unit 13 can identify the behavior type of a person with high accuracy by using multiple time-series images of person image data D2. As a result, it is possible to output key points of a person with high accuracy.

また、学習済みモデルＢ，Ｃ，Ｄは、学習フェーズにおいて、キーポイントの要素および行動種類の要素を含む損失関数Ｆ（ｘ，ｙ）により学習されている。つまり、キーポイント出力部１４が行動種類を考慮したキーポイントを高精度に出力できるように、学習済みモデルＢ，Ｃ，Ｄが学習される。このようにして学習された学習済みモデルＢ，Ｃ，Ｄを用いて、人物のキーポイントを出力することから、高精度なキーポイントを出力できる。 In addition, in the learning phase, the trained models B, C, and D are trained using a loss function F(x, y) that includes keypoint elements and behavior type elements. In other words, the trained models B, C, and D are trained so that the keypoint output unit 14 can output keypoints that take behavior type into account with high accuracy. The trained models B, C, and D trained in this way are used to output person keypoints, making it possible to output highly accurate keypoints.

また、人物画像データ解析システム１は、撮像機器２により撮像された元画像データＤ１そのものを特徴量抽出部１２に入力するのではなく、元画像データＤ１から人物領域Ｄ１ａが抽出された人物画像データＤ２を特徴量抽出部１２に入力している。このように、人物領域Ｄ１ａを抽出した人物画像データＤ２を生成することにより、人物画像データＤ２における人物のキーポイントを高精度に出力することにつながる。 In addition, the person image data analysis system 1 does not input the original image data D1 captured by the imaging device 2 itself to the feature extraction unit 12, but inputs person image data D2 in which the person area D1a has been extracted from the original image data D1 to the feature extraction unit 12. In this way, generating person image data D2 in which the person area D1a has been extracted leads to highly accurate output of key points of people in the person image data D2.

（３．第二実施形態）
第二実施形態の人物画像データ解析システム１の推論フェーズの構成について、図１および図８を参照して説明する。 3. Second Embodiment
The configuration of the inference phase of the human image data analysis system 1 according to the second embodiment will be described with reference to FIGS.

図１に示すように、人物画像データ解析システム１は、学習済みモデルＡ，Ｂ，Ｃ，Ｄを記憶する記憶装置３、および、演算処理装置４を備える。演算処理装置４は、人物画像データ生成部１１、特徴量抽出部１２、行動種類出力部１３、および、キーポイント出力部１４を備える。 As shown in FIG. 1, the human image data analysis system 1 includes a storage device 3 that stores trained models A, B, C, and D, and a processing device 4. The processing device 4 includes a human image data generation unit 11, a feature extraction unit 12, a behavior type output unit 13, and a key point output unit 14.

学習済みモデルＡ，Ｂ，Ｄは、第一実施形態における学習済みモデルＡ，Ｂ，Ｄと同一である。学習済みモデルＣは、再帰型アルゴリズムを適用する。例えば、学習済みモデルＣは、ＲＮＮ（Recurrent Neural Network）、ＬＳＴＭ（Long Short Term Memory）などを適用する。 Trained models A, B, and D are the same as trained models A, B, and D in the first embodiment. Trained model C applies a recursive algorithm. For example, trained model C applies a recurrent neural network (RNN), long short term memory (LSTM), etc.

つまり、学習済みモデルＣは、特徴量抽出部１２により１枚の人物画像データＤ２に基づいて抽出された１枚分の特徴量を順次入力した場合に、前回演算処理を行った結果を用いた再帰型演算を行うことにより、今回演算処理の対象である１枚の人物画像データＤ２における人物の行動種類を出力する機械学習モデルである。 In other words, the trained model C is a machine learning model that, when the feature amounts for one image extracted by the feature extraction unit 12 based on one image of person image data D2 are input sequentially, performs a recursive calculation using the result of the previous calculation process, and outputs the behavior type of the person in the image of person image data D2 that is the subject of the current calculation process.

本形態において、演算処理装置４を構成する各部の処理は、以下のようになる。特徴量抽出部１２は、学習済みモデルＡを用いて、時系列の複数枚の人物画像データＤ２を順次入力することにより、複数枚の人物画像データＤ２のそれぞれにおける特徴量を順次抽出する。つまり、特徴量抽出部１２は、順次、演算処理の対象となる１枚の人物画像データＤ２の特徴量を抽出する。 In this embodiment, the processing of each part constituting the calculation processing device 4 is as follows. The feature extraction unit 12 uses the trained model A to sequentially input a plurality of time-series images of person image data D2, thereby sequentially extracting features from each of the plurality of images of person image data D2. In other words, the feature extraction unit 12 sequentially extracts features from one image of person image data D2 that is the subject of calculation processing.

行動種類出力部１３は、学習済みモデルＣを用いて、特徴量抽出部１２により１枚の人物画像データＤ２に基づいて抽出された１枚分の特徴量を順次入力し、かつ、前回演算処理を行った結果を用いた再帰型演算を行うことにより、今回演算処理の対象である１枚の人物画像データＤ２における人物の行動種類を出力する。 The behavior type output unit 13 uses the trained model C to sequentially input the feature amounts for one image extracted by the feature extraction unit 12 based on one image of the person D2, and performs a recursive calculation using the result of the previous calculation process, thereby outputting the behavior type of the person in the image of the person D2 that is the subject of the current calculation process.

キーポイント出力部１４は、学習済みモデルＤを用いて、今回演算処理の対象である１枚分の特徴量、および、今回演算処理の対象を含む人物画像データＤ２における人物の行動種類を入力することにより、今回演算処理の対象である１枚の人物画像データＤ２における人物のキーポイントを出力する。 The key point output unit 14 uses the trained model D to input the feature values of the one image that is the subject of the current calculation process and the type of behavior of the person in the person image data D2 that includes the subject of the current calculation process, and outputs the key points of the person in the one image of person image data D2 that is the subject of the current calculation process.

行動種類出力部１３が、再帰型演算を行うことにより、今回演算処理の対象である１枚分の特徴量を用いて、人物の行動種類を出力できる。従って、特徴量抽出部１２、行動種類出力部１３、および、キーポイント出力部１４における処理が、今回演算処理の対象としての１枚の人物画像データＤ２の入力により実行される。従って、時系列の人物画像データを順次入力する度に、当該人物画像データにおける人物のキーポイントを出力することができる。つまり、リアルタイムに人物のキーポイントを出力できる。その結果、リアルタイムに、人物の姿勢を解析することができる。 By performing recursive calculations, the behavior type output unit 13 can output the behavior type of a person using the feature amounts for one image that is the subject of the current calculation process. Therefore, the processing in the feature amount extraction unit 12, behavior type output unit 13, and key point output unit 14 is executed by inputting one image of person image data D2 that is the subject of the current calculation process. Therefore, each time time-series person image data is input sequentially, the key points of the person in the person image data can be output. In other words, the key points of the person can be output in real time. As a result, the posture of the person can be analyzed in real time.

１人物画像データ解析システム
３記憶装置
４演算処理装置
１１人物画像データ生成部
１２特徴量抽出部
１３行動種類出力部
１４キーポイント出力部
Ｄ１元画像データ
Ｄ２人物画像データ REFERENCE SIGNS LIST 1 Person image data analysis system 3 Storage device 4 Processing device 11 Person image data generation unit 12 Feature extraction unit 13 Behavior type output unit 14 Key point output unit D1 Original image data D2 Person image data

Claims

A human image data analysis system including a computer device having a processor and a memory device,
The storage device includes:
storing a trained model for feature extraction generated by performing machine learning using first person image data including a person as an explanatory variable and a feature in the first person image data as a target variable;
storing a trained model for behavior analysis generated by performing machine learning using the feature amount extracted based on the first person image data as an explanatory variable and a behavior type of the person in the multiple first person image data in a time series as a target variable;
storing a trained model for keypoint extraction generated by performing machine learning using the feature amount and the behavior type as explanatory variables and a keypoint representing a posture of the person in the first person image data as a target variable;
The arithmetic processing device includes:
a feature extraction unit that extracts the feature from second person image data including a person by inputting the second person image data using the trained model related to the feature extraction stored in the storage device; and
a behavior type output unit that outputs the behavior type of the person in the plurality of time-series images of the second person image data by inputting the feature amount extracted by the feature amount extraction unit using a trained model related to the behavior analysis stored in the storage device;
a keypoint output unit that outputs the keypoints of the person in the second person image data by inputting the feature extracted by the feature extraction unit and the behavior type output by the behavior type output unit using a trained model related to the keypoint extraction stored in the storage device;
Equipped with
A human image data analysis system, wherein the trained model for feature extraction, the trained model for behavior analysis, and the trained model for keypoint extraction are trained in a learning phase using a loss function that includes elements of the keypoints and elements of the behavior types .

the feature extraction unit extracts the feature from each of the second person image data by inputting the second person image data in a time series using a trained model for the feature extraction;
the behavior type output unit outputs the behavior type by inputting the feature amounts for a plurality of images extracted based on each of the plurality of second person image data in time series using a trained model related to the behavior analysis; and
2. The human image data analysis system of claim 1 , wherein the keypoint output unit uses a trained model for the keypoint extraction to input the feature amount for one image extracted based on the one image of the second human image data selected from the plurality of images of the second human image data in a time series, and the behavior type, thereby outputting the keypoints of the person in the one image of the second human image data selected.

the feature extraction unit sequentially inputs the plurality of second person image data in time series using a trained model for the feature extraction, thereby sequentially extracting the feature from each of the plurality of second person image data;
the behavior type output unit sequentially inputs the extracted feature amounts for one image using a trained model for the behavior analysis, and performs a recursive calculation using a result of a previous calculation process, thereby outputting the behavior type of the person in the one image of the second person image data that is a target of a current calculation process;
2. The human image data analysis system of claim 1, wherein the keypoint output unit uses a trained model for the keypoint extraction to input the feature amounts and the behavior type for one image, and outputs the keypoints of the person in the second human image data piece that is the subject of a current calculation process.

A human image data analysis system including a computer device having a processor and a memory device,
The storage device includes:
storing a trained model for feature extraction generated by performing machine learning using first person image data including a person as an explanatory variable and a feature in the first person image data as a target variable;
storing a trained model for behavior analysis generated by performing machine learning using the feature amount extracted based on the first person image data as an explanatory variable and a behavior type of the person in the multiple first person image data in a time series as a target variable;
storing a trained model for keypoint extraction generated by performing machine learning using the feature amount and the behavior type as explanatory variables and a keypoint representing a posture of the person in the first person image data as a target variable;
The arithmetic processing device includes:
a feature extraction unit that extracts the feature from each of a plurality of second person image data by inputting a plurality of second person image data in a time series including a person , using the trained model related to the feature extraction stored in the storage device;
a behavior type output unit that outputs the behavior type of the person in the plurality of time-series second person image data by inputting the feature amounts for the plurality of images extracted by the feature amount extracting unit based on each of the plurality of time-series second person image data, using a trained model for the behavior analysis stored in the storage device;
a key point output unit that uses a trained model for the key point extraction stored in the storage device to input the feature amounts for one image selected from the plurality of second person image data in time series, the feature amounts being extracted by the feature amount extractor based on the one image of the second person image data selected from the plurality of second person image data in time series , and the behavior type being output by the behavior type output unit, and outputs the key points of the person in the selected one image of the second person image data;
A human image data analysis system comprising:

A human image data analysis system including a computer device having a processor and a memory device,
The storage device includes:
storing a trained model for feature extraction generated by performing machine learning using first person image data including a person as an explanatory variable and a feature in the first person image data as a target variable;
storing a learned model for behavior analysis generated by performing machine learning using the feature amount for one image extracted based on the first person image data as an explanatory variable and a behavior type of the person in the multiple first person image data in a time series as a target variable;
storing a trained model for keypoint extraction generated by performing machine learning using the feature amount and the behavior type as explanatory variables and a keypoint representing a posture of the person in the first person image data as a target variable;
The arithmetic processing device includes:
a feature extraction unit that sequentially inputs a plurality of second person image data in a time series including a person , by using the trained model related to the feature extraction stored in the storage device, and sequentially extracts the feature from each of the plurality of second person image data;
a behavior type output unit that uses a trained model for the behavior analysis stored in the storage device to sequentially input the feature amounts for one image extracted by the feature amount extraction unit and to perform a recursive calculation using a result of a previous calculation process , thereby outputting the behavior type of the person in the one image of the second person image data that is the target of a current calculation process ;
a keypoint output unit that uses a trained model for the keypoint extraction stored in the storage device to input the feature amounts for one image extracted by the feature amount extraction unit and the behavior type output by the behavior type output unit, and outputs the keypoints of the person in the second person image data for one image that is the subject of a current calculation process ;
A human image data analysis system comprising:

The human image data analysis system according to claim 1 , wherein the first human image data and the second human image data include image data of the human in a pose in which at least a torso of the human is turned sideways.

The human image data analysis system according to any one of claims 1 to 6 , wherein the arithmetic processing device further includes a human image data generation unit that inputs original image data including a human area and a surrounding area, and generates the first human image data and the second human image data in which the human area is extracted from the original image data.

The human image data analysis system according to claim 1 , wherein the key points include joint positions of the person and connection relationships that connect adjacent joint positions of the person.