JP7624941B2

JP7624941B2 - Motion prediction device, method and program

Info

Publication number: JP7624941B2
Application number: JP2022035654A
Authority: JP
Inventors: 智尋中塚; 絵美明堂; 翔一郎三原; 賢史小森田
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2025-01-31
Anticipated expiration: 2042-03-08
Also published as: JP2023131014A

Description

本発明は、人物が動作する映像に基づいてその続きを予測する装置、方法及びプログラムに係り、特に、周辺の状況を考慮して人物の動作を予測する動作予測装置、方法及びプログラムに関する。 The present invention relates to an apparatus, method, and program for predicting the continuation of a person's action based on video of that person's action, and in particular to an action prediction apparatus, method, and program for predicting a person's action by taking into account the surrounding situation.

手足の動きのような人の詳細な動作について数秒先を予測することが、迅速な危険行動察知や人と連携して動作するロボットやアクチュエータの制御などの分野で必要とされている。 Predicting detailed human movements, such as hand and foot movements, several seconds in advance is necessary in areas such as quickly detecting dangerous behavior and controlling robots and actuators that work in cooperation with people.

非特許文献2には2D映像からの人物の骨格位置を検出し、過去の骨格位置からLSTM（Long Short Term Memory）を用いて未来の骨格位置を予測する技術が開示されている。非特許文献2では単純な動きを推測しやすい反面、人が取り得る姿勢の制限を別途に加える必要があり、厳密な制約条件を与えることが難しかった。 Non-Patent Document 2 discloses a technology that detects the skeletal position of a person from 2D video and predicts the future skeletal position from past skeletal positions using LSTM (Long Short Term Memory). While Non-Patent Document 2 makes it easy to predict simple movements, it is necessary to add additional restrictions on the postures that a person can take, making it difficult to impose strict constraints.

非特許文献1にはLSTMではなくGRU（Gated Recurrent Unit）をエンコーダデコーダのモジュールとして採用したCVAE（Conditional Variational Autoencoder）モデルをベースとして用いる技術が開示されている。CVAEは標準正規分布からランダムにサンプリングした潜在表現をもとに過去の動作を条件（Condition）として尤もらしい続きの動作を出力するよう学習できる。 Non-Patent Document 1 discloses a technology that uses a CVAE (Conditional Variational Autoencoder) model as its base, which employs a GRU (Gated Recurrent Unit) instead of an LSTM as the encoder-decoder module. CVAE can learn to output plausible subsequent actions using past actions as conditions, based on latent representations randomly sampled from a standard normal distribution.

非特許文献1は更に、人間の将来の動作には多様な可能性があることを踏まえ、続きの動作として複数の多様な予測結果を出力するよう、潜在表現に多様性を課している。非特許文献1は、与えられた過去の動作に基づき、ランダムにサンプリングされた潜在変数に対して多様な線形変換を加え、それを新たな潜在表現としてCVAEに与えるようにしている。これにより、多様な未来の可能性をカバーした尤もらしい動作の予測ができるようになった。 Non-Patent Document 1 further imposes diversity on the latent representation, taking into account the fact that there are many possible future human actions, so that multiple, diverse prediction results are output as subsequent actions. Non-Patent Document 1 applies various linear transformations to randomly sampled latent variables based on given past actions, and provides this to the CVAE as a new latent representation. This makes it possible to predict plausible actions that cover a wide range of future possibilities.

特許文献1には人物の行動を個々人の特性を反映して予測する技術が開示されている。特許文献1は、撮影した人物の行動の状態および人物の識別子を分析し、得られた人物情報に対して照合するルールに基づく行動予測情報を生成する。 Patent document 1 discloses a technology that predicts a person's behavior by reflecting the characteristics of each individual. Patent document 1 analyzes the behavioral state and person identifier of a photographed person, and generates behavior prediction information based on rules that are matched against the obtained person information.

特開2013-186556号公報JP 2013-186556 A

Yuan, Ye and Kitani, Kris (2020). Dlow: Diversifying latent flows for diverse human motion prediction. ECCVYuan, Ye and Kitani, Kris (2020). Dlow: Diversifying latent flows for diverse human motion prediction. ECCV Erwin Wu and Hideki Koike: FuturePose - Mixed Reality Martial Arts Training Using Real-Time 3D Human Pose Forecasting With a RGB Camera, WACV(2019)Erwin Wu and Hideki Koike: FuturePose - Mixed Reality Martial Arts Training Using Real-Time 3D Human Pose Forecasting With a RGB Camera, WACV(2019) J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779-788, doi: 10.1109/CVPR.2016.91.J. Redmon, S. Divvala, R. Girshick and A. Farhadi, "You Only Look Once: Unified, Real-Time Object Detection," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779- 788, doi: 10.1109/CVPR.2016.91. Wojke, Nicolai and Bewley, Alex and Paulus, Dietrich, "Simple Online and Realtime Tracking with a Deep Association Metric", ICIP2017Wojke, Nicolai and Bewley, Alex and Paulus, Dietrich, "Simple Online and Realtime Tracking with a Deep Association Metric", ICIP2017 Zhe Cao and Tomas Simon and Shih-En Wei and Yaser Sheikh, "Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields", CVPR2017Zhe Cao and Tomas Simon and Shih-En Wei and Yaser Sheikh, "Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields", CVPR2017 Yang, Ceyuan and Xu, Yinghao and Shi, Jianping and Dai, Bo and Zhou, Bolei, "Temporal Pyramid Network for Action Recognition", CVPR2020Yang, Ceyuan and Xu, Yinghao and Shi, Jianping and Dai, Bo and Zhou, Bolei, "Temporal Pyramid Network for Action Recognition", CVPR2020

非特許文献1では与えられた動作の続きとして妥当かつ多様な動作を予測できる。しかしながら、周辺の状況に対する考慮がないために可能性の低い動作も予測してしまう場合がある。例えば自律ロボットに動作の予測結果を参照させて動作計画を立てさせる場合、周辺の状況に応じてより妥当性の高い動作に集中して予測させることが望ましい。しかしながら、非特許文献1では可能性の低い動作を含む必要以上に多くの予測結果を提示するために自律ロボットの動作計画を阻害しかねない。 In Non-Patent Document 1, it is possible to predict a variety of reasonable actions as a continuation of a given action. However, there are cases where it predicts actions that are unlikely due to a lack of consideration of the surrounding situation. For example, when making an autonomous robot refer to the predicted action results and create an action plan, it is desirable to have it concentrate on predicting actions that are more likely to be likely depending on the surrounding situation. However, Non-Patent Document 1 presents more prediction results than necessary, including actions that are unlikely, which may hinder the autonomous robot's action plan.

特許文献1は個々人の特性を反映した行動予測を行うが、周辺の状況に対する考慮がないために状況に対して不整合な行動を予測結果として生成してしまう可能性がある。 Patent document 1 predicts behavior that reflects the characteristics of each individual, but because it does not take into account the surrounding situation, there is a possibility that the predicted behavior will be inconsistent with the situation.

本発明の目的は、上記の技術課題を解決し、撮影した人物の動作の続きを周辺の状況を考慮して正確に予測できる動作予測装置、方法及びプログラムを提供することにある。 The object of the present invention is to provide a motion prediction device, method, and program that can solve the above technical problems and accurately predict the subsequent motions of a photographed person while taking into account the surrounding circumstances.

上記の目的を達成するために、本発明は、人物が動作する映像に基づいて続きの動作を予測する動作予測装置において、以下の構成を具備した点に特徴がある。 To achieve the above object, the present invention is a motion prediction device that predicts subsequent motions based on video of a person performing a motion, and is characterized by having the following configuration.

(1) 人物の現在に至る動作を解析する手段と、動作の目的地点を取得する手段と、人物の現在に至る動作及び目的地点に基づいて続きの動作の潜在表現を抽出する手段と、前記人物の現在に至る動作、目的地点及び潜在表現に基づいて当該人物の続きの動作を予測する手段とを具備した。 (1) The system includes a means for analyzing a person's current movements, a means for acquiring a destination point of the movements, a means for extracting a latent expression of a subsequent movement based on the person's current movements and the destination point, and a means for predicting the person's subsequent movement based on the person's current movements, the destination point, and the latent expression.

(2) 前記動作の目的地点を取得する手段は、人物の現在に至る動作に基づいて目的地点を推定するようにした。 (2) The means for acquiring the destination point of the movement estimates the destination point based on the person's movement up to the present.

(3) 前記目的地点を取得する手段は、映像から物体を検出する手段と、映像に基づいて人物の移動方向を推定する手段とを具備し、移動方向で検出された物体の位置を目的地点と推定するようにした。 (3) The means for acquiring the destination point includes a means for detecting an object from an image and a means for estimating the direction of movement of a person based on the image, and is configured to estimate the position of the object detected in the direction of movement as the destination point.

(1) 撮影した人物の動作の続きを周辺の状況を考慮して予測するので正確な予測が可能になる。 (1) Accurate predictions are possible because the system takes into account the surrounding circumstances to predict the next steps of a person's actions.

(2)目的地点は動作に関連するところ、目的地点を現在に至る動作に基づいて推定するので目的地点を高精度に推定できるようになる。 (2) The destination point is related to movement, and since the destination point is estimated based on the movements up to the present, the destination point can be estimated with high accuracy.

(3)目的地点は移動方向に依存するところ、移動方向で検出された物体の位置を目的地点と推定するので目的地点を更に高精度に推定できるようになる。 (3) The destination point depends on the direction of movement, but since the position of an object detected in the direction of movement is estimated as the destination point, it becomes possible to estimate the destination point with even greater accuracy.

本発明の一実施形態に係る動作予測装置の主要部の構成を示した機能ブロック図である。1 is a functional block diagram showing a configuration of a main part of an action prediction device according to an embodiment of the present invention. 表現抽出部の構成を示した機能ブロック図である。FIG. 2 is a functional block diagram showing a configuration of an expression extraction unit. 表現抽出部の動作を示したフローチャートである。13 is a flowchart showing an operation of an expression extraction unit. 動作予測部の構成を示した機能ブロック図である。FIG. 2 is a functional block diagram showing a configuration of an operation prediction unit.

以下、図面を参照して本発明の実施の形態について詳細に説明する。図1は本発明の一実施形態に係る動作予測装置1の主要部の構成を示した機能ブロック図であり、映像取得部10、動作解析部20、目的地点推定部30、表現抽出部40及び動作予測部50を主要な構成としている。 The following describes in detail an embodiment of the present invention with reference to the drawings. FIG. 1 is a functional block diagram showing the configuration of the main parts of a motion prediction device 1 according to one embodiment of the present invention, and the main components include a video acquisition unit 10, a motion analysis unit 20, a destination point estimation unit 30, an expression extraction unit 40, and a motion prediction unit 50.

このような三次元形状復元装置1は、CPU，ROM，RAM，バス，インタフェース等を備えた少なくとも一台の汎用のコンピュータやサーバに各機能を実現するアプリケーション（プログラム）を実装することで構成できる。あるいはアプリケーションの一部をハードウェア化またはソフトウェア化した専用機や単能機としても構成できる。 Such a three-dimensional shape reconstruction device 1 can be configured by implementing applications (programs) that realize each function on at least one general-purpose computer or server equipped with a CPU, ROM, RAM, bus, interface, etc. Alternatively, it can be configured as a dedicated machine or a single-function machine in which part of the application is implemented as hardware or software.

本実施形態では一人の人物の単一動作について、現在に至る過去の動作の解析結果に基づいて当該動作に続くその後の動作を予測するものとし、ここでは特に、人物が物体（静止物体）を手に取ろうとしているシーンの予測を例にして説明する。なお、動作予測の対象人物が複数であれば各人に同様の処理を繰り返すことで個別予測が可能になる。 In this embodiment, for a single action made by a person, the subsequent action following that action is predicted based on the analysis results of past actions up to the present, and in particular, prediction of a scene in which a person is about to pick up an object (stationary object) is used as an example. Note that if there are multiple people for whom action prediction is to be made, individual predictions can be made by repeating the same process for each person.

映像取得部10は、RGBカメラや深度カメラを使って動作予測の対象人物を所定の時間（例えば、3秒）だけ撮影して人物の映ったフレーム画像を取得する。あるいは予め人物を所定の時間だけ撮影した動画ファイルを取得するようにしても良い。 The video acquisition unit 10 uses an RGB camera or a depth camera to capture a subject person whose movements are to be predicted for a predetermined period of time (e.g., 3 seconds) and acquires frame images showing the subject. Alternatively, a video file in which the subject has been captured in advance for a predetermined period of time may be acquired.

動作解析部20は、取得した映像を解析して人物の検出およびその追跡を実施し、検出及び追跡の結果に基づいて当該人物の現在に至る動作を推定する。人物検知には非特許文献3が開示するYOLO (You Only Look Once)、人物追跡には非特許文献4が開示するDeepSORTなどの深層学習モデルを用いることができる。これにより同一人物を複数フレームにわたって正確に追跡することができる。 The motion analysis unit 20 analyzes the acquired video to detect and track people, and estimates the person's current motion based on the results of the detection and tracking. A deep learning model such as YOLO (You Only Look Once) disclosed in Non-Patent Document 3 can be used for person detection, and DeepSORT disclosed in Non-Patent Document 4 can be used for person tracking. This makes it possible to accurately track the same person across multiple frames.

動作解析部20は更に、人物が映った各フレーム画像に対して姿勢推定の処理を実施する。姿勢推定には非特許文献5が開示するOpenPoseのように、映像中の骨格のキーポイントの位置を姿勢として出力する方法を用いることができる。キーポイントの位置は2次元座標や3次元座標で表現できる。 The motion analysis unit 20 further performs posture estimation processing on each frame image in which a person appears. For posture estimation, a method can be used that outputs the positions of skeletal key points in the video as postures, such as OpenPose disclosed in Non-Patent Document 5. The positions of key points can be expressed in two-dimensional or three-dimensional coordinates.

OpenPose等の出力は通常2次元座標のため、3次元座標で表現する場合には追加の処理が必要になる。例えば三角測量（入力に複数視点映像があること）、深度情報の付加（入力に深度画像があること）あるいは人間の3Dモデルのあてはめなどが考えられる。 Since the output of OpenPose and other tools is usually two-dimensional coordinates, additional processing is required to express it in three-dimensional coordinates. Examples include triangulation (multiple viewpoints in the input), adding depth information (depth images in the input), or fitting a 3D model of a human.

動作解析部20は、全フレーム分の結果を合わせた人物の動作の推定結果を動作解析結果として出力する。なお、フレーム数をT、骨格のキーポイント数をJとし、姿勢が3次元座標で表現されるとすると、動作解析結果のデータ形式はT×3Jの行列となる。 The motion analysis unit 20 outputs the estimated result of the person's motion by combining the results of all frames as the motion analysis result . Note that if the number of frames is T, the number of skeletal key points is J, and the posture is expressed in three-dimensional coordinates, the data format of the motion analysis result is a T×3J matrix.

なお、上記の解析結果に加えて、人物の動作が何の行動であるかについても解析してラベル付けし、動作解析結果に行動ラベルを付するようにしてもよい。行動ラベルは、例えば非特許文献６が開示する深層学習モデルを使うことで推定できる。 In addition to the above analysis results, the actions of a person may also be analyzed and labeled, and an action label may be attached to the action analysis results. The action label can be estimated, for example, by using the deep learning model disclosed in Non-Patent Document 6.

目的地点推定部30は、物体検出部301および方向推定部302を具備し、動作解析結果に基づいて動作の目的となっている物体を検出し、人物の移動方向に基づいて人物が目的としている地点を推定する。 The destination point estimation unit 30 includes an object detection unit 301 and a direction estimation unit 302, and detects an object that is the object of the action based on the results of the action analysis , and estimates the person's destination point based on the person's movement direction.

本実施例のように人が物体を手に取る動作を予測するのであれば、物体検出部301には人が手に取ることが可能な物体のカテゴリが予めいくつか与えられ、当該カテゴリに属する物体を対象に非特許文献3の手法により物体検出を実施する。 To predict the action of a person picking up an object as in this embodiment, the object detection unit 301 is given several categories of objects that a person can pick up in advance, and object detection is performed on objects that belong to those categories using the method described in Non-Patent Document 3.

なお、物体を手に取る動作以外にも、対象の動作と関連する物体の定義を事前に行っておけば同様の処理が可能になる。例えばドアを開けようとする動作を予測するのであれば取っ手、座ろうとする動作を予測するのであれば椅子などが検出対象となる。前記動作解析部20が行動ラベルの認識を行っていた場合には、当該行動ラベルに関連する物体に限定して検出するようにしてもよい。 In addition to the action of picking up an object, similar processing is possible if the objects associated with the target action are defined in advance. For example, to predict the action of opening a door, the detection target would be a handle, and to predict the action of sitting down, a chair, etc. If the action analysis unit 20 has recognized an action label, the detection target may be limited to objects associated with that action label.

物体を検出できると、続いて人物の動作の進行方向などを手掛かりに、動作の目的となっている物体の特定を行う。方向推定部302は人物の進行方向を身体の中心座標の移動方向のフレーム平均として算出し、目的としている物体を進行方向の延長線で人物との距離が最も近い物体に特定する。 Once an object is detected, the direction of the person's movement is used as a clue to identify the object that is the target of the movement. The direction estimation unit 302 calculates the person's direction of movement as the frame average of the movement direction of the center coordinates of the body, and identifies the target object as the object that is closest to the person on an extension of the direction of movement.

推定した物体の位置は人物の動作を表現した座標系と同じ座標系を用いて表すこととする。2次元座標であれば検出器が出力した領域の中心座標とし、3次元座標であれば物体の中心の3次元座標を三角測量や深度画像、事前定義した物体モデルのあてはめなどによって求めたものとする。推定した物体の位置は動作の目的地点として人物の動作と共に後段の表現抽出部40へ送られる。 The estimated object position is expressed using the same coordinate system as the coordinate system that expresses the person's movements. In the case of two-dimensional coordinates, the coordinates are the center of the area output by the detector, and in the case of three-dimensional coordinates, the three-dimensional coordinates of the center of the object are obtained by triangulation, depth imaging, fitting of a predefined object model, etc. The estimated object position is sent to the downstream expression extraction unit 40 as the destination point of the movement together with the person's movement.

また、物体を手に取る動作であれば、例えばコップや鞄などの物体の種類によって取り方が変化すると考えられることから、物体の種類も併せて後段の表現抽出部40へ送るようにしてもよい。物体の種類は例えばIDのような数値で表すことができ、非特許文献3が開示する検出器を用いることで検出結果と合わせてそのIDも出力できる。 In addition, if the action is to pick up an object, it is considered that the way of picking up the object will change depending on the type of object, such as a cup or a bag, so the type of object may also be sent to the expression extraction unit 40 at the subsequent stage. The type of object can be represented by a numerical value, such as an ID, and by using the detector disclosed in Non-Patent Document 3, the ID can be output together with the detection result.

物体の種類を考慮する場合は、例えば物体IDをワンホットベクトル表現に変換し、目的地点のデータに結合したうえで後段の処理を行うようにしてもよい。なお、目的地点推定部30に代えて目的地点を入力する構成を設け、目的地点を手動で入力できるようにしても良い。 When the type of object is taken into consideration, for example, the object ID may be converted into a one-hot vector representation and combined with the destination point data before subsequent processing. Note that instead of the destination point estimation unit 30, a configuration for inputting the destination point may be provided so that the destination point can be input manually.

表現抽出部40は、人物の動作x及び目的地点yの推定結果に基づいて、当該人物が続ける将来の動作の潜在表現を近似確率分布で推論することで複数の妥当な潜在表現zを抽出する。 Based on the estimated results of the person's action x and destination point y, the expression extraction unit 40 extracts multiple valid latent expressions z by inferring latent expressions of the person's future actions using an approximate probability distribution.

図2は表現抽出部40の構成を示した機能ブロック図であり、人物の現在に至る動作xがゲート付き回帰ユニット（GRU）401に入力され、目的地点yは多層パーセプトロン（MLP：Multilayer Perceptron）402に入力される。GRU401は人物の動作xを各時刻に応じて重み付け処理した後、後段の多層パーセプトロン403へ出力する。 Figure 2 is a functional block diagram showing the configuration of the expression extraction unit 40, in which the person's current movement x is input to a gated regression unit (GRU) 401, and the destination point y is input to a multilayer perceptron (MLP) 402. The GRU 401 weights the person's movement x according to each time, and then outputs it to the downstream multilayer perceptron 403.

図3は、表現抽出部40が人物の動作x及び目的地点yの推定結果に基づいて潜在表現zを抽出する手順を示したフローチャートであり、ここでは人物の動作xや目的地点yは3次元座標で表現されているものとする。 Figure 3 is a flowchart showing the procedure by which the expression extraction unit 40 extracts a latent expression z based on the estimation results of a person's movement x and a destination point y. Here, the person's movement x and the destination point y are assumed to be expressed in three-dimensional coordinates.

ステップS1では、多変量ガウス分布N(0,I)からK個のベクトル値ε={ε₁,…,ε_K}がランダムにサンプリングされて変換部405へ提供される。ステップS2では、各MLP403が人物の動作x∈R^T×3JをK個の複雑な非線形関数φ（事前に学習済のニューラルネットなど）によりエンコードすることでK個の行列とベクトルとのペアA_k，b_k (k=1，…，K)を算出する。 In step S1, K vector values ε={ε ₁ , ..., ε _K } are randomly sampled from a multivariate Gaussian distribution N(0,I) and provided to the conversion unit 405. In step S2, each MLP 403 calculates K matrix-vector pairs A _k , b _k (k=1, ..., K) by encoding a person's motion x∈R ^T×3J using K complex nonlinear functions φ (such as a pre-trained neural network).

ステップS3では、後段の変換部405が前記K個のベクトル値εをそれぞれ次式(1)に適用してK個のベクトル値δ（δ₁～δ_k）に変換する。 In step S3, the subsequent conversion unit 405 converts the K vector values ε into K vector values δ (δ ₁ to δ _k ) by applying the K vector values ε to the following equation (1).

δ_k=A_k・ε_k+b_k(1) _δk = _Ak _εk + _bk (1)

ステップS4では、MLP402が目的地点yをL個の複雑な非線形関数ψ（事前に学習済のニューラルネットなど）によりエンコードし、L個の行列とベクトルとのペアC_l，d_l (l=1，…，L)を算出する。 In step S4, the MLP 402 encodes the destination point y using L complex nonlinear functions ψ (such as a pre-trained neural network) to calculate L matrix-vector pairs C _l , d _l (l=1, . . . , L).

ステップS5では、後段の変換部406が前記K個のベクトル値δをそれぞれ次式(2)に適用してK・L個の潜在表現Z={Z_{1, 1}，…，Z_{K, L}}に変換する。 In step S5, the subsequent conversion unit 406 applies the K vector values δ to the following equation (2) to convert them into K·L latent expressions Z={Z _{1,1 , .} . . , Z _K,L }.

Z_ｋ，１＝Ｃ_１・δ₁＋d₁
…
Z_ｋ，L＝Ｃ_L・δ_k＋d_L (2) Z _k,1 =C ₁・δ ₁ +d ₁
…
Z _{k, L} = C _L・δ _k +d _L (2)

ステップS2の非線形関数φ及びステップS4非線形関数ψは、人物の動作xと目的地点yとの対を含むデータセット上でまとめて学習する。データセットの人物の動作xは途中のフレームで切り分けて「観測した動作」と正解の「続きの動作」として扱う。学習の方法は非特許文献1とよく似たものとしてよい。具体的には次式(3)の目的関数を確率的勾配降下法などの最適化手法を用いて最小化する。 The nonlinear function φ in step S2 and the nonlinear function ψ in step S4 are learned together on a dataset including pairs of a person's movement x and a destination point y. The person's movement x in the dataset is separated into intermediate frames and treated as the "observed movement" and the correct "continued movement." The learning method may be very similar to that in Non-Patent Document 1. Specifically, the objective function in the following equation (3) is minimized using an optimization method such as stochastic gradient descent.

L_Recon+L_Div+L_KL(3) L _Recon +L _Div +L _KL (3)

ここで、L_Reconは再構成誤差、L_Divは予測結果の多様性を促すエネルギー関数を示す。潜在表現zを後述の動作予測部50に入力して得られるK・L個の動作の予測結果を次式(4)，正解を次式(5)とすれば、前記L_Recon，L_Divは次式(6)，(7)で求められる。ただしλは適当な定数を設定する。 Here, L _Recon is the reconstruction error, and L _Div is an energy function that promotes diversity in the prediction results. If the prediction results of K·L actions obtained by inputting the latent representation z to the action prediction unit 50 described below are expressed by the following formula (4) and the correct answer is expressed by the following formula (5), then L _Recon and L _Div can be obtained by the following formulas (6) and (7). Here, λ is set to an appropriate constant.

L_KLはK・L個ある潜在表現zの確率分布とN(0,I)との間のカルバック・ライブラー距離の平均を示す。潜在表現z_{k, l}の確率分布はN(0,I)をA_k，b_kとC_l，d_lで順にアフィン変換した確率分布として次式(8)で表現できる。 L _KL denotes the average Kullback-Leibler distance between the probability distribution of the K·L latent representations z and N(0,I). The probability distribution of the latent representations z _{k, l} can be expressed as the following equation (8) as a probability distribution obtained by affine transforming N(0,I) in order with A _k , b _k and C _l , d _l .

上式(8)とN(0,I)とのカルバック・ライブラー距離をD_k，lとするとL_KLは次式(9)で求まる。なお、この学習では動作予測部50の出力を使うため、動作予測部50を先に学習しておく必要がある。 If the Kullback-Leibler distance between the above equation (8) and N(0,I) is Dk _,l , then _LKL can be found by the following equation (9). Note that since the output of the motion prediction unit 50 is used in this learning, it is necessary to train the motion prediction unit 50 in advance.

ステップS2，3とステップS4，5とは順番を入れ替えてもよく、その場合はステップS1でL個のベクトル値をサンプルするなど、適宜上式の記号も入れ替えることになる。 The order of steps S2 and S3 and steps S4 and S5 may be swapped. In that case, the symbols in the above equation would be swapped as appropriate, such as sampling L vector values in step S1.

ステップS2，3とステップS4，5とを同時に行うようにして人物の動作xと目的地点yとをK・L個の複雑な非線形関数（事前に学習済ニューラルネットなど）によりエンコードし、K・L個の行列とベクトルのペアA，bを算出するようにして、ただ一回のベクトル値εの変換により潜在表現zを得てもよい。 Steps S2 and S3 and steps S4 and S5 can be performed simultaneously to encode the person's movement x and destination point y using K x L complex nonlinear functions (such as pre-trained neural networks), calculate K x L matrix and vector pairs A and b, and obtain the latent representation z by converting the vector value ε just once.

得られた潜在表現zは人物の動作x及び物体の位置yにより条件づけられた確率分布からサンプルされた値としてみなすことができ、それぞれを続きの動作の潜在表現として合計K・L個の潜在表現zが後続の動作予測部50へ送られる。 The obtained latent representation z can be regarded as a value sampled from a probability distribution conditioned by the person's movement x and the object's position y, and a total of K·L latent representations z are sent to the subsequent movement prediction unit 50 as latent representations of subsequent movements.

動作予測部50は、現在に至るまでの人物の動作x、目的地点y及び潜在表現zに基づいて当該人物の妥当な複数の動作を予測する。例えば、人物の動作x、目的地点y、K・L個の潜在表現zを複雑な非線形関数ρ（事前に学習済のニューラルネットなど）に入力し、K・L個の続きの動作の予測x'を出力する。複雑な非線形関数をニューラルネットとした場合のモデルの例を図4に示す。 The action prediction unit 50 predicts multiple appropriate actions of a person based on the person's actions up to the present time x, destination point y, and latent expression z. For example, the person's action x, destination point y, and K·L latent expressions z are input to a complex nonlinear function ρ (such as a pre-trained neural network), and K·L predictions of subsequent actions x' are output. An example of a model in which the complex nonlinear function is a neural network is shown in Figure 4.

GRU501は人物の現在に至る過去の動作xを各時刻に応じて重み付け処理する。MLP502は目的地点yをL個の複雑な非線形関数ψによりエンコードし、L個の行列とベクトルのペアC_l，d_l (l=1，…，L)を算出する。結合モジュール503はGRU501及びMLP502の出力を潜在表現zと結合して後段のGRU504へ出力する。 The GRU 501 weights the past actions x of the person up to the present according to each time. The MLP 502 encodes the destination point y by L complex nonlinear functions ψ and calculates L matrix-vector pairs C _l , d _l (l=1, ..., L). The combination module 503 combines the outputs of the GRU 501 and MLP 502 with the latent representation z and outputs it to the downstream GRU 504.

GRU504はx、y、zから非線形関数ρによって動作の予測x'を推論する。非線形関数ρの学習は人物の動作x及び目的地点yの対を含むデータセット上で行い、非特許文献1がベースとする深層生成モデルであるCVAEと同様の方法を用い、潜在変数の生成に目的変数の情報を考慮できるようにすることで理想状態の推定を行うようにしても良い。 GRU504 infers a predicted motion x' from x, y, and z using a nonlinear function ρ. The nonlinear function ρ is learned on a dataset that includes pairs of human motion x and destination y. It is also possible to estimate the ideal state by using a method similar to CVAE, which is a deep generative model on which Non-Patent Document 1 is based, and taking into account information on the objective variable in generating the latent variables.

GRU504は予測x'の再構成誤差並びに潜在表現zの近似確率分布とN(0,I)とのカルバック・ライブラー距離の二つの項を持つ目的関数を考え、確率的勾配降下法などの最適化手法を用いて最小化する。 GRU504 considers an objective function with two terms: the reconstruction error of the prediction x', the approximate probability distribution of the latent representation z, and the Kullback-Leibler distance with N(0,I), and minimizes it using optimization methods such as stochastic gradient descent.

なお、上記の実施形態では表現抽出部40及び動作予測部50のいずれもが目的地点yを考慮しているが、どちらか一方でのみ考慮するようにしても良い。具体的には、表現抽出部40が実施する前記ステップS4，5の処理、あるいは動作予測部50の目的地点yの入力のいずれか一方を省略しても良い。 In the above embodiment, both the expression extraction unit 40 and the action prediction unit 50 take the destination point y into consideration, but it is also possible to have only one of them take it into consideration. Specifically, it is possible to omit either the processing of steps S4 and S5 performed by the expression extraction unit 40 or the input of the destination point y to the action prediction unit 50.

そして、上記の実施形態によれば人物が動作する映像に基づいて当該人物の続きの動作を正確に予測できるので、地理的あるいは経済的な格差を超えて多くの人々に安価で利便性の高い動作予測システムを提供できるようになる。その結果、国連が主導する持続可能な開発目標（SDGs）の目標9「レジリエントなインフラを整備し、包括的で持続可能な産業化を推進する」や目標11「都市を包摂的、安全、レジリエントかつ持続可能にする」に貢献することが可能となる。 The above embodiment allows accurate prediction of a person's next movements based on video of the person moving, making it possible to provide an inexpensive and convenient movement prediction system to many people regardless of geographic or economic disparity. As a result, it will be possible to contribute to Goal 9 "Build resilient infrastructure and promote inclusive and sustainable industrialization" and Goal 11 "Make cities inclusive, safe, resilient and sustainable" of the United Nations-led Sustainable Development Goals (SDGs).

1…動作予測装置，10…映像取得部，20…動作解析部，30…目的地点推定部，40…表現抽出部，50…動作予測部，401，501，504…GRU，402，403，404，502…MLP，405,406…変換部，503…結合モジュール 1...Movement prediction device, 10...Video acquisition unit, 20...Movement analysis unit, 30...Destination point estimation unit, 40...Expression extraction unit, 50...Movement prediction unit, 401, 501, 504...GRU, 402, 403, 404, 502...MLP, 405, 406...Conversion unit, 503...Combination module

Claims

A motion prediction device for predicting a subsequent motion of a person based on a video of the person performing the motion, comprising:
a means for analyzing a video of a person in motion , estimating each of the person's current postures, and outputting a sequence of the estimated postures as a motion analysis result ;
A means for estimating a destination point of the person's movement based on the result of the movement analysis ;
A means for extracting a latent expression of a subsequent motion of the person based on the motion analysis result and a destination point;
and means for predicting a subsequent action of the person based on the result of the action analysis , the destination point and the latent expression.

The motion prediction device according to claim 1, characterized in that the means for outputting the motion analysis result analyzes a video of a person moving, detects, tracks and estimates a posture of the person, and outputs a series of postures of the person up to the present.

3. The motion prediction device according to claim 1, further comprising means for acquiring a manually input destination point instead of the means for estimating the destination point.

The means for estimating a destination point includes:
A means for detecting an object from the video;
means for estimating a moving direction of a person based on the image;
3. The motion prediction device according to claim 1, wherein the position of the object detected in the moving direction is estimated as a destination point.

The means for estimating the destination point is notified of a category of the object in advance;
The motion prediction device according to claim 4 , wherein the means for detecting an object detects an object belonging to the category.

The means for outputting the motion analysis result identifies a person's motion and outputs a motion label of the identification result;
The action prediction device according to claim 4 , wherein the means for estimating the destination point detects an object corresponding to the action label.

A motion prediction method in which a computer predicts a subsequent motion of a person based on a video of the person performing the motion, comprising:
Analyzing a video of a person in motion to estimate each of the person's current postures, and outputting a sequence of the estimated postures as a motion analysis result ;
Estimating a destination point of the person's movement based on the movement analysis result ;
Extracting a latent expression of a subsequent motion of the person based on the motion analysis result and the destination point;
A motion prediction method, comprising: predicting a subsequent motion of the person based on the motion analysis result , a destination point, and a latent expression.

A motion prediction program for predicting a subsequent motion based on a video of a person performing a motion,
a step of analyzing a video of a person in motion , estimating each of the person's current postures, and outputting a sequence of the estimated postures as a motion analysis result ;
estimating a destination point of the person's movement based on the movement analysis result ;
extracting a latent expression of a subsequent motion of the person based on the motion analysis result and the destination point;
and predicting the person's subsequent actions based on the results of the action analysis , the destination point, and the latent expression.