JP7659158B2

JP7659158B2 - Motion recognition system, motion recognition method, and motion recognition program

Info

Publication number: JP7659158B2
Application number: JP2020188923A
Authority: JP
Inventors: 利生遠藤; 琢磨山本; 裕也大日方
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2020-11-12
Filing date: 2020-11-12
Publication date: 2025-04-09
Anticipated expiration: 2040-11-12
Also published as: JP2022077870A

Description

本発明は、動作認識システム、動作認識方法および動作認識プログラムに関する。 The present invention relates to a motion recognition system, a motion recognition method, and a motion recognition program.

人物の特定の動作を画像から自動認識する技術は、様々な分野への応用が期待されている。例えば、商品を販売する店舗内にカメラを設置し、撮影された画像から顧客の動作を自動判別することが考えられる。この場合、例えば、顧客の購入行為を検出して自動決済することによる店舗の無人化の実現、顧客の不正な行動や危険な行動の検出、顧客の行動の分析による商品の陳列や価格の適正化、などの応用が可能になる。 The technology to automatically recognize specific human actions from images is expected to be applied to a variety of fields. For example, cameras could be installed inside a store that sells goods, and customers' actions could be automatically identified from the images they capture. In this case, applications such as realizing an unmanned store by detecting customers' purchasing actions and automatically processing payments, detecting fraudulent or dangerous customer behavior, and optimizing product displays and prices by analyzing customer behavior would become possible.

動作の自動認識に関しては、例えば、次のような画像検索装置が提案されている。この画像検索装置は、入力画像から検索対象の姿勢情報を認識し、姿勢情報と入力画像とから特徴量を抽出して入力画像と関連付けてデータベースに蓄積し、ユーザが指定した姿勢情報から検索クエリを生成し、検索クエリにしたがって類似した姿勢を含む画像をデータベースから検索する。 For automatic recognition of movements, for example, the following image search device has been proposed. This image search device recognizes posture information of the search target from an input image, extracts features from the posture information and the input image, associates them with the input image, and stores them in a database. It also generates a search query from posture information specified by the user, and searches the database for images containing similar postures according to the search query.

また、画像の学習により動作を自動認識する技術もある。例えば、画像内で検出すべき動作の位置と種類が多数指定された訓練画像を用いて学習を行い、その学習結果を用いて、未知のテスト画像における人の動作と種類を推定する方法が提案されている。 There is also technology that automatically recognizes actions by learning from images. For example, a method has been proposed in which a system learns from training images in which the positions and types of actions to be detected within the image are specified, and then the learning results are used to estimate the human actions and their types in unknown test images.

特開２０１９－９１１３８号公報JP 2019-91138 A

Chunhui Gu、他１１名、AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions、IEEE Conference on Computer Vision and Pattern Recognition(CVPR) 2018Chunhui Gu, 11 others, AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions, IEEE Conference on Computer Vision and Pattern Recognition(CVPR) 2018

ところで、特定の動作が写っているクエリ画像と、動作の認識対象の画像との間の類似度に基づいて、後者の画像に特定の動作が写っているかを判定する場合には、例えば、各画像の特徴点の間で特徴量が比較される。ここで、人の身体の構造は複雑であり、ある特定の動作を人が行った場合でも、身体全体では多様な姿勢をとり得る。そのため、特定の動作を様々な姿勢で行っている多数のクエリ画像を使用しないと、その動作の認識精度を高めることができないという問題がある。 When judging whether a specific action is captured in a query image that captures the specific action and an image to be recognized for the action based on the similarity between the latter image, for example, feature amounts are compared between the feature points of each image. Here, the structure of the human body is complex, and even when a person performs a specific action, the entire body may take a variety of postures. For this reason, there is a problem in that it is not possible to improve the accuracy of recognizing the action unless a large number of query images in which the specific action is performed in various postures are used.

１つの側面では、本発明では、少数のクエリ画像を用いて高精度な動作認識を行うことが可能な動作認識システム、動作認識方法および動作認識プログラムを提供することを目的とする。 In one aspect, the present invention aims to provide an action recognition system, an action recognition method, and an action recognition program that are capable of performing highly accurate action recognition using a small number of query images.

１つの案では、入力受け付け部と判定部とを有する動作認識システムが提供される。この動作認識システムにおいて、入力受け付け部は、特定の動作を行う人物が撮影された第１の動画像を再生表示し、再生表示された第１の動画像のフレーム上に、特定の動作に関連する位置を指示する指示入力を再生表示中に受け付ける。判定部は、第２の動画像に含まれる１以上の特徴点を含む第２特徴点群と、第１の動画像に含まれる２以上の特徴点を含む第１特徴点群との間における特徴量の比較結果に基づいて、第１の動画像と第２の動画像との類似度を算出し、算出された類似度に基づいて第２の動画像に特定の動作が含まれるか否かを判定する。この判定部は、類似度の算出の際、第１特徴点群のうち指示入力による指示位置に近い特徴点ほど特徴量の比較結果に対して大きな重みを付与する。 In one proposal, a motion recognition system is provided that has an input receiving unit and a determination unit. In this motion recognition system, the input receiving unit plays and displays a first video of a person performing a specific motion, and receives, during playback and display, an instruction input that indicates a position related to the specific motion on a frame of the first video. The determination unit calculates a similarity between the first video and the second video based on a comparison result of feature amounts between a second feature point group that includes one or more feature points included in the second video and a first feature point group that includes two or more feature points included in the first video, and determines whether or not the second video includes a specific motion based on the calculated similarity. When calculating the similarity, the determination unit assigns a larger weight to the comparison result of feature amounts for feature points in the first feature point group that are closer to the position indicated by the instruction input.

また、１つの案では、コンピュータが、特定の動作を行う人物が撮影された第１の動画像を再生表示し、再生表示された第１の動画像のフレーム上に、特定の動作に関連する位置を指示する指示入力を再生表示中に受け付け、第２の動画像に含まれる１以上の特徴点を含む第２特徴点群と、第１の動画像に含まれる２以上の特徴点を含む第１特徴点群との間における特徴量の比較結果に基づいて、第１の動画像と第２の動画像との類似度を算出し、算出された類似度に基づいて第２の動画像に特定の動作が含まれるか否かを判定する、処理を実行する動作認識方法が提供される。この動作認識方法において、類似度の算出では、第１特徴点群のうち指示入力による指示位置に近い特徴点ほど特徴量の比較結果に対して大きな重みが付与される。 In one proposal, a motion recognition method is provided in which a computer plays and displays a first video of a person performing a specific motion, accepts, during playback and display, a command input indicating a position related to the specific motion on a frame of the first video, calculates a similarity between the first video and the second video based on a comparison result of features between a second feature point group including one or more feature points included in the second video and a first feature point group including two or more feature points included in the first video, and determines whether the second video includes the specific motion based on the calculated similarity. In this motion recognition method, in calculating the similarity, a greater weight is assigned to the comparison result of features of the first feature point group that are closer to the position indicated by the command input.

さらに、１つの案では、上記の動作認識方法と同様の処理をコンピュータに実行させる動作認識プログラムが提供される。 Furthermore, in one proposal, a motion recognition program is provided that causes a computer to execute processing similar to the motion recognition method described above.

１つの側面では、少数のクエリ画像を用いて高精度な動作認識を行うことができる。 In one aspect, highly accurate action recognition can be performed using a small number of query images.

第１の実施の形態に係る動作認識システムの構成例および処理例を示す図である。1 is a diagram illustrating a configuration example and a processing example of an action recognition system according to a first embodiment; 第２の実施の形態に係る情報処理システムの構成例を示す図である。FIG. 13 illustrates an example of a configuration of an information processing system according to a second embodiment. 情報処理装置のハードウェア構成例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of an information processing device. 情報処理装置が備える処理機能の構成例を示すブロック図である。2 is a block diagram showing an example of the configuration of processing functions included in an information processing device; 情報処理装置による動作検出・分析処理全体の処理手順を示すフローチャートの例である。11 is an example of a flowchart showing an overall processing procedure of a motion detection and analysis process performed by an information processing device. 動作特徴量について説明するための図である。FIG. 11 is a diagram for explaining a motion feature amount; 見本動画像からの動作特徴量の算出処理を示すフローチャートの例である。13 is an example of a flowchart showing a process of calculating a motion feature amount from a sample video. 動作特徴量の第１の例を示す図である。FIG. 11 is a diagram showing a first example of motion feature amounts; 動作特徴量の第３の例を示す図である。FIG. 13 is a diagram showing a third example of motion features; 時空間指示入力の例を示す図である。FIG. 13 is a diagram showing an example of a time-space instruction input. 注目度の算出処理について説明するための図である。FIG. 11 is a diagram for explaining a calculation process of an attention level. 時空間指示入力に応じた注目度の算出処理を示すフローチャートの例である。13 is an example of a flowchart illustrating a calculation process of an attention level according to a time-space instruction input. 検出対象動画像からの動作特徴量の算出処理を示すフローチャートの例である。13 is an example of a flowchart showing a process of calculating a motion feature amount from a detection target moving image. 見本動画像と検出対象動画像との間の類似検索処理について説明するための図である。11A and 11B are diagrams for explaining a similarity search process between a sample video and a detection target video. 入力区間動画像と部分動画像との間の類似度計算について説明するための図である。11 is a diagram for explaining a similarity calculation between an input section video image and a partial video image; FIG. 類似度に基づく動作検出処理を示すフローチャートの例である。13 is an example of a flowchart illustrating a motion detection process based on a similarity degree. 分析テーブルの一例を示す図である。FIG. 13 is a diagram illustrating an example of an analysis table. 第３の実施の形態に係る情報処理装置が備える処理機能の構成例を示す。13 illustrates an example of a configuration of processing functions included in an information processing device according to a third embodiment. 第３の実施の形態における動作検出・分析処理全体の処理手順を示すフローチャートの例である。13 is a flowchart illustrating an example of an overall processing procedure of a motion detection and analysis process according to a third embodiment. 学習および再判定の処理手順を示すフローチャートの例である。13 is a flowchart illustrating an example of a learning and re-determination process. 第４の実施の形態に係る情報処理システムの構成例を示す図である。FIG. 13 illustrates an example of a configuration of an information processing system according to a fourth embodiment.

以下、本発明の実施の形態について図面を参照して説明する。
〔第１の実施の形態〕
図１は、第１の実施の形態に係る動作認識システムの構成例および処理例を示す図である。図１に示す動作認識システムは、動画像に特定の動作を行っている人物が写っているかを判定する動作認識処理を実行するシステムであり、例えば、コンピュータとして実現される。動作認識システム１は、入力受け付け部２と判定部３を有する。入力受け付け部２と判定部３の処理は、例えば、プロセッサが所定のプログラムを実行することで実現される。なお、入力受け付け部２と判定部３は、それぞれ別の装置に実装されていてもよい。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
First Embodiment
Fig. 1 is a diagram showing an example of the configuration and processing of a motion recognition system according to a first embodiment. The motion recognition system shown in Fig. 1 is a system that executes motion recognition processing to determine whether a person performing a specific motion is captured in a video image, and is realized, for example, as a computer. The motion recognition system 1 has an input receiving unit 2 and a determination unit 3. The processing of the input receiving unit 2 and the determination unit 3 is realized, for example, by a processor executing a predetermined program. Note that the input receiving unit 2 and the determination unit 3 may each be implemented in a separate device.

入力受け付け部２は、特定の動作を行う人物が撮影された第１の動画像を再生表示する。この第１の動画像は、動作認識処理におけるクエリ画像として利用される。入力受け付け部２は、再生表示された第１の動画像のフレーム上に、上記の特定の動作に関連する位置を指示する指示入力を受け付ける。 The input reception unit 2 plays and displays a first video of a person performing a specific action. This first video is used as a query image in the action recognition process. The input reception unit 2 accepts an instruction input that indicates a position related to the specific action on a frame of the first video that is played and displayed.

図１では、第１の動画像は、先頭からフレーム１０ａ，１０ｂ，・・・を含んでいる。そして、フレーム１０ａでは、例えばマウスカーソルによって指示位置１１ａを指示する指示入力が行われており、フレーム１０ｂでは、指示位置１１ｂを指示する指示入力が行われている。このように、指示入力は第１の動画像内の複数フレームにわたって継続的に行われてもよい。また、フレームが次々に表示される間、指示位置を変化させることが可能であり、この場合には、指示入力による指示位置はフレームごとに記録されることになる。 In FIG. 1, the first moving image includes frames 10a, 10b, ... from the beginning. In frame 10a, a pointing input is made with, for example, a mouse cursor to point to pointing position 11a, and in frame 10b, a pointing input is made to point to pointing position 11b. In this way, pointing input may be made continuously across multiple frames in the first moving image. Also, it is possible to change the pointing position while the frames are displayed one after the other, in which case the pointing position made by pointing input is recorded for each frame.

上記のように、指示入力は、認識対象の特定の動作に関連する位置を指示するように行われる。例えば、指示入力は、第１の動画像に写る人物の身体部位のうち、特定の動作との関連性が高い身体部位の近傍を指示するように行われる。図１では、特定の動作として人物が商品（服）を手に取る動作が検出されるものとする。このような動作では、手の先端付近における動きが、他の動作と識別する上で特徴的である。このため、オペレータは、再生表示される第１の動画像を視認しながら、画面に映っている人物の手の先端付近を指示するように指示入力を行う。 As described above, the instruction input is made so as to indicate a position related to a specific action of the recognition target. For example, the instruction input is made so as to indicate the vicinity of a body part of the person appearing in the first video that is highly related to the specific action. In FIG. 1, the specific action detected is the action of the person picking up a product (clothing). In such an action, the movement near the tip of the hand is distinctive in distinguishing it from other actions. For this reason, the operator makes an instruction input so as to indicate the vicinity of the tip of the hand of the person appearing on the screen while viewing the first video being played back and displayed.

判定部３は、上記の特定の動作の検出対象となる第２の動画像に含まれる特徴点群と、第１の動画像に含まれる特徴点群との間における特徴量を比較し、その比較結果に基づいて第１の動画像と第２の動画像との間の類似度を算出する。判定部３は、算出された類似度に基づいて、第２の動画像に特定の動作が含まれるかを判定する。なお、図１では、第２の動画像は、先頭からフレーム２０ａ，２０ｂ，・・・を含んでいる。 The determination unit 3 compares the feature amounts between the group of feature points included in the second video, which is the detection target for the above-mentioned specific action, and the group of feature points included in the first video, and calculates the similarity between the first video and the second video based on the comparison result. The determination unit 3 determines whether the second video contains the specific action based on the calculated similarity. Note that in FIG. 1, the second video includes frames 20a, 20b, ... from the beginning.

例えば、特徴点として人物の身体部位が抽出されるとする。図１の例では、第１の動画像の各フレームから、特徴点として手のひらの位置と足首の位置とが抽出されるとする。例えば、第１の動画像のフレーム１０ａからは、手のひらに対応する特徴点１２ａと、足首に対応する特徴点１２ｂとが抽出されている。この場合、第２の動画像の各フレームからも、特徴点として手のひらの位置と足首の位置とが抽出される。例えば、第２の動画像のフレーム２０ａからは、手のひらに対応する特徴点２２ａと、足首に対応する特徴点２２ｂとが抽出されている。 For example, assume that body parts of a person are extracted as feature points. In the example of FIG. 1, assume that the positions of the palm and the ankle are extracted as feature points from each frame of the first video. For example, feature point 12a corresponding to the palm and feature point 12b corresponding to the ankle are extracted from frame 10a of the first video. In this case, the positions of the palm and the ankle are also extracted as feature points from each frame of the second video. For example, feature point 22a corresponding to the palm and feature point 22b corresponding to the ankle are extracted from frame 20a of the second video.

そして、フレーム１０ａとフレーム２０ａとの間の類似度の算出では、特徴点１２ａと特徴点２２ａとの間と、特徴点１２ｂと特徴点２２ｂとの間で、特徴量が比較される。そして、それぞれの特徴量の比較結果に基づいてフレーム１０ａとフレーム２０ａとの間の類似度が算出される。 When calculating the similarity between frames 10a and 20a, the features are compared between feature points 12a and 22a, and between feature points 12b and 22b. The similarity between frames 10a and 20a is calculated based on the comparison results of the respective features.

最終的には、第１の動画像と第２の動画像との間のフレームの組み合わせすべてについて類似度が算出され、それらの類似度が合計されることで第１の動画像と第２の動画像との間の類似度が求められる。判定部３は、例えば、このように求められた類似度を所定の閾値と比較することで、第２の動画像に特定の動作が含まれるか否かを判定する。 Finally, the similarity between the first video and the second video is calculated for all combinations of frames between the first video and the second video, and the similarities are summed to determine the similarity between the first video and the second video. The determination unit 3, for example, compares the similarity thus determined with a predetermined threshold value to determine whether or not a specific action is included in the second video.

ただし、判定部３は、このような類似度の算出の際に、第１の動画像に含まれる特徴点のうち、指示入力による指示位置に近い特徴点ほど、特徴量の比較結果に対して大きな重みを付与する。これにより、指示位置に近い特徴点の特徴量ほど類似度計算に対する寄与度が高くなる。 However, when calculating the similarity, the determination unit 3 assigns a greater weight to the comparison result of the feature amounts of feature points included in the first video that are closer to the pointing position by the pointing input. This makes the feature amounts of feature points closer to the pointing position more likely to contribute to the similarity calculation.

例えば図１では、フレーム１０ａ上の特徴点１２ａは指示位置１１ａに近いので、特徴点１２ａと特徴点２２ａとの間における特徴量の比較結果に対しては大きな重みが付与される。一方、フレーム１０ａ上の特徴点１２ｂは指示位置１１ａから遠いので、特徴点１２ｂと特徴点２２ｂとの間における特徴量の比較結果に対して小さな重みが付与される。 For example, in FIG. 1, because feature point 12a on frame 10a is close to pointing position 11a, a large weight is assigned to the comparison result of the feature amounts between feature point 12a and feature point 22a. On the other hand, because feature point 12b on frame 10a is far from pointing position 11a, a small weight is assigned to the comparison result of the feature amounts between feature point 12b and feature point 22b.

人の身体の構造は複雑であり、身体を撮影したとき、撮影画像内での身体の見え方は様々に変化する。検出対象とする特定の動作を人が行った場合でも、身体全体としては見え方に多様性が存在する。例えば、商品を手に取る動作を人が行った場合、手の近傍の身体部位の動きは似たような動きになる一方、足の動きは人や状況によって大きく異なる場合がある。このため、商品を手に取る動作を画像から検出したい場合には、手の近傍の特徴点の情報は動作検出のための有用性が高いが、足の近傍の特徴点の情報は動作検出のための有用性が低い。 The structure of the human body is complex, and when the body is photographed, the way the body appears in the captured image changes in various ways. Even when a person performs a specific action that is the target of detection, there is diversity in the way the body appears as a whole. For example, when a person picks up a product, the movements of the body parts near the hands will be similar, but the movements of the feet may vary greatly depending on the person and the situation. For this reason, when trying to detect the action of picking up a product from an image, information on feature points near the hands is highly useful for detecting the action, but information on feature points near the feet is less useful for detecting the action.

上記のような判定部３の処理により、指示入力による指示位置に近い特徴点の特徴量ほど、類似度計算に対する寄与度が高くなる。このため、クエリ画像としての第１の動画像を少なくとも１つ用意するだけで、特定の動作を行ったときの身体の動きに多様性がある場合でも、その特定の動きを高精度に検出できるようになる。すなわち、第１の実施の形態によれば、少数のクエリ画像を用いて高精度な動作認識を行うことが可能となる。 By the above-described processing of the determination unit 3, the feature amount of a feature point closer to the position indicated by the instruction input has a higher contribution to the similarity calculation. Therefore, by simply preparing at least one first video image as a query image, it becomes possible to detect a specific movement with high accuracy even when there is diversity in the body movement when the specific movement is performed. In other words, according to the first embodiment, it becomes possible to perform highly accurate movement recognition using a small number of query images.

なお、第１の動画像に対する指示入力は、例えば、マウスを用いた簡単な操作によって実現可能である。例えば、マウスドラッグ操作により、第１の動画像に含まれる複数のフレームにわたって指示入力を継続的に行うことができる。 Note that inputting instructions to the first video can be achieved, for example, by a simple operation using a mouse. For example, by dragging the mouse, it is possible to continuously input instructions across multiple frames included in the first video.

〔第２の実施の形態〕
次に、第２の実施の形態として、図１の動作認識システム１の動作認識処理機能を備えた情報処理装置を含むシステムについて説明する。 Second Embodiment
Next, as a second embodiment, a system including an information processing device having the action recognition processing function of the action recognition system 1 in FIG. 1 will be described.

図２は、第２の実施の形態に係る情報処理システムの構成例を示す図である。図２に示す情報処理システムは、情報処理装置１００と、この情報処理装置１００に接続されたカメラ２００とを含む。 Figure 2 is a diagram showing an example of the configuration of an information processing system according to the second embodiment. The information processing system shown in Figure 2 includes an information processing device 100 and a camera 200 connected to the information processing device 100.

情報処理装置１００は、例えば、パーソナルコンピュータやサーバ装置などのコンピュータとして実現される。情報処理装置１００は、カメラ２００によって撮影された動画像を基に、動画像に写った人物が特定の動作を行ったことを検出する。カメラ２００は、人物が写り得る所定の位置に配置され、撮影した動画像のデータを情報処理装置１００に送信する。 The information processing device 100 is realized as a computer such as a personal computer or a server device. Based on a video captured by the camera 200, the information processing device 100 detects that a person captured in the video has performed a specific action. The camera 200 is placed at a predetermined position where a person may be captured, and transmits data of the captured video to the information processing device 100.

図２に示すように、本実施の形態では例として、カメラ２００は、商品２１１が販売される店舗２１０内に配置され、商品２１１の前に存在する人物２１２（顧客）の動作を撮影するものとする。情報処理装置１００は、撮影された動画像を基に、商品２１１に関する特定の動作を人物２１２が行うことを検出する。例えば、情報処理装置１００は、人物２１２が商品２１１を手に取る動作や、人物２１２が商品２１１の価格を確認する動作を検出する。さらに、情報処理装置１００は、所定期間における人物２１２の動作の検出結果を用いて、顧客の行動分析といった様々な分析を行うことができる。 As shown in FIG. 2, in this embodiment, as an example, the camera 200 is placed in a store 210 where a product 211 is sold, and captures the actions of a person 212 (customer) standing in front of the product 211. The information processing device 100 detects that the person 212 performs a specific action related to the product 211 based on the captured video image. For example, the information processing device 100 detects the action of the person 212 picking up the product 211 or the action of the person 212 checking the price of the product 211. Furthermore, the information processing device 100 can perform various analyses, such as customer behavior analysis, using the detection results of the actions of the person 212 over a predetermined period of time.

図３は、情報処理装置のハードウェア構成例を示す図である。情報処理装置１００は、例えば、図３に示すようなコンピュータとして実現される。図３に示す情報処理装置１００は、プロセッサ１０１、ＲＡＭ（Random Access Memory）１０２、ＨＤＤ（Hard Disk Drive）１０３、グラフィックインタフェース（Ｉ／Ｆ）１０４、入力インタフェース（Ｉ／Ｆ）１０５、読み取り装置１０６、通信インタフェース（Ｉ／Ｆ）１０７およびネットワークインタフェース（Ｉ／Ｆ）１０８を有する。 Fig. 3 is a diagram showing an example of the hardware configuration of an information processing device. The information processing device 100 is realized, for example, as a computer as shown in Fig. 3. The information processing device 100 shown in Fig. 3 has a processor 101, a RAM (Random Access Memory) 102, a HDD (Hard Disk Drive) 103, a graphic interface (I/F) 104, an input interface (I/F) 105, a reading device 106, a communication interface (I/F) 107, and a network interface (I/F) 108.

プロセッサ１０１は、情報処理装置１００全体を統括的に制御する。プロセッサ１０１は、例えば、ＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）、ＤＳＰ（Digital Signal Processor）、ＡＳＩＣ（Application Specific Integrated Circuit）またはＰＬＤ（Programmable Logic Device）である。また、プロセッサ１０１は、ＣＰＵ、ＭＰＵ、ＤＳＰ、ＡＳＩＣ、ＰＬＤのうちの２以上の要素の組み合わせであってもよい。 The processor 101 performs overall control of the entire information processing device 100. The processor 101 is, for example, a CPU (Central Processing Unit), an MPU (Micro Processing Unit), a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), or a PLD (Programmable Logic Device). The processor 101 may also be a combination of two or more elements of a CPU, an MPU, a DSP, an ASIC, or a PLD.

ＲＡＭ１０２は、情報処理装置１００の主記憶装置として使用される。ＲＡＭ１０２には、プロセッサ１０１に実行させるＯＳ（Operating System）プログラムやアプリケーションプログラムの少なくとも一部が一時的に格納される。また、ＲＡＭ１０２には、プロセッサ１０１による処理に必要な各種データが格納される。 RAM 102 is used as the main storage device of information processing device 100. RAM 102 temporarily stores at least a portion of the OS (Operating System) program and application programs to be executed by processor 101. RAM 102 also stores various data necessary for processing by processor 101.

ＨＤＤ１０３は、情報処理装置１００の補助記憶装置として使用される。ＨＤＤ１０３には、ＯＳプログラム、アプリケーションプログラム、および各種データが格納される。なお、補助記憶装置としては、ＳＳＤ（Solid State Drive）などの他の種類の不揮発性記憶装置を使用することもできる。 The HDD 103 is used as an auxiliary storage device for the information processing device 100. The HDD 103 stores the OS program, application programs, and various data. Note that other types of non-volatile storage devices, such as a solid state drive (SSD), can also be used as the auxiliary storage device.

グラフィックインタフェース１０４には、表示装置１０４ａが接続されている。グラフィックインタフェース１０４は、プロセッサ１０１からの命令にしたがって、画像を表示装置１０４ａに表示させる。表示装置としては、液晶ディスプレイや有機ＥＬ（ElectroLuminescence）ディスプレイなどがある。 A display device 104a is connected to the graphic interface 104. The graphic interface 104 displays an image on the display device 104a in accordance with an instruction from the processor 101. Examples of the display device include a liquid crystal display and an organic EL (ElectroLuminescence) display.

入力インタフェース１０５には、入力装置１０５ａが接続されている。入力インタフェース１０５は、入力装置１０５ａから出力される信号をプロセッサ１０１に送信する。入力装置１０５ａとしては、キーボードやポインティングデバイスなどがある。ポインティングデバイスとしては、マウス、タッチパネル、タブレット、タッチパッド、トラックボールなどがある。 The input interface 105 is connected to the input device 105a. The input interface 105 transmits a signal output from the input device 105a to the processor 101. Examples of the input device 105a include a keyboard and a pointing device. Examples of the pointing device include a mouse, a touch panel, a tablet, a touch pad, and a trackball.

読み取り装置１０６には、可搬型記録媒体１０６ａが脱着される。読み取り装置１０６は、可搬型記録媒体１０６ａに記録されたデータを読み取ってプロセッサ１０１に送信する。可搬型記録媒体１０６ａとしては、光ディスク、半導体メモリなどがある。 A portable recording medium 106a is detachably attached to the reading device 106. The reading device 106 reads data recorded on the portable recording medium 106a and transmits it to the processor 101. Examples of the portable recording medium 106a include an optical disk and a semiconductor memory.

通信インタフェース１０７は、カメラ２００との間でデータの送受信を行う。ネットワークインタフェース１０８は、ネットワーク１０８ａを介して他の装置との間でデータの送受信を行う。 The communication interface 107 transmits and receives data to and from the camera 200. The network interface 108 transmits and receives data to and from other devices via the network 108a.

以上のようなハードウェア構成によって、情報処理装置１００の処理機能を実現することができる。
ところで、情報処理装置１００は、次のような手順で動画像から特定の動作を検出する。情報処理装置１００は、動作検出に先立って、検出対象の特定の動作を実際に行っている人物を写した動画像（見本動画像）をカメラ２００から取得し、この見本動画像から動作の特徴を示す動作特徴量を検出する。その後、情報処理装置１００は、動作検出の対象となる動画像（検出対象動画像）を取得し、この検出対象動画像から動作特徴量を検出する。そして、情報処理装置１００は、見本動画像と検出対象動画像との間で動作特徴量を比較することで、検出対象動画像から特定の動作を行っている人物が写った区間の動画像を検出する。 The processing functions of the information processing device 100 can be realized by the above hardware configuration.
The information processing device 100 detects a specific motion from a video in the following procedure. Prior to motion detection, the information processing device 100 acquires a motion (sample motion) from the camera 200 that shows a person actually performing the specific motion to be detected, and detects motion feature amounts indicating the characteristics of the motion from this sample motion. The information processing device 100 then acquires a motion that is the subject of motion detection (detection target motion) and detects motion feature amounts from this detection target motion. The information processing device 100 then compares the motion feature amounts between the sample motion and the detection target motion, thereby detecting a motion of a section of the detection target motion in which a person performing a specific motion is captured.

図４は、情報処理装置が備える処理機能の構成例を示すブロック図である。図４に示すように、情報処理装置１００は、記憶部１１０、映像入力部１１１、特徴量算出部１１２、時空間一括入力部１１３、注目度算出部１１４、映像検索部１１５、類似度算出部１１６および分析部１１７を備える。 Fig. 4 is a block diagram showing an example of the configuration of processing functions of an information processing device. As shown in Fig. 4, the information processing device 100 includes a storage unit 110, a video input unit 111, a feature amount calculation unit 112, a time-space batch input unit 113, an attention calculation unit 114, a video search unit 115, a similarity calculation unit 116, and an analysis unit 117.

記憶部１１０は、ＲＡＭ１０２、ＨＤＤ１０３など、情報処理装置１００が備える記憶装置の記憶領域によって実現される。記憶部１１０には、見本動画像データ１２１と検出対象動画像データ１３１が記憶される。また、記憶部１１０には、見本動画像データ１２１に対応するデータとして動作特徴量データ１２２と注目度データ１２３が記憶される。さらに、記憶部１１０には、検出対象動画像データ１３１に対応するデータとして動作特徴量データ１３２が記憶される。 The storage unit 110 is realized by a storage area of a storage device provided in the information processing device 100, such as the RAM 102 and the HDD 103. The storage unit 110 stores sample video data 121 and detection target video data 131. The storage unit 110 also stores motion feature data 122 and attention level data 123 as data corresponding to the sample video data 121. The storage unit 110 also stores motion feature data 132 as data corresponding to the detection target video data 131.

映像入力部１１１、特徴量算出部１１２、時空間一括入力部１１３、注目度算出部１１４、映像検索部１１５、類似度算出部１１６および分析部１１７の処理は、例えば、プロセッサ１０１が所定のアプリケーションプログラムを実行することで実現される。 The processing of the video input unit 111, feature amount calculation unit 112, time-space batch input unit 113, attention calculation unit 114, video search unit 115, similarity calculation unit 116 and analysis unit 117 is realized, for example, by the processor 101 executing a predetermined application program.

映像入力部１１１には、カメラ２００によって撮影された動画像のデータが入力される。この動画像としては、前述の見本動画像と検出対象動画像とが入力される。特徴量算出部１１２は、映像入力部１１１に入力された動画像から動作特徴量を算出する。特徴量は、動画像の各フレームから抽出された特徴点ごとのデータとして算出される。 Video data captured by the camera 200 is input to the video input unit 111. The above-mentioned sample video and the detection target video are input as the video. The feature amount calculation unit 112 calculates motion feature amounts from the video input to the video input unit 111. The feature amount is calculated as data for each feature point extracted from each frame of the video.

映像入力部１１１に見本動画像が入力された場合、特徴量算出部１１２は、見本動画像から動作特徴量を算出し、動作特徴量データ１２２として記憶部１１０に格納する。これとともに、特徴量算出部１１２は、見本動画像のデータを見本動画像データ１２１として記憶部１１０に格納する。また、映像入力部１１１に検出対象動画像が入力された場合、特徴量算出部１１２は、検出対象動画像から動作特徴量を算出し、動作特徴量データ１３２として記憶部１１０に格納する。これとともに、特徴量算出部１１２は、検出対象動画像のデータを検出対象動画像データ１３１として記憶部１１０に格納する。 When a sample video is input to the video input unit 111, the feature calculation unit 112 calculates motion features from the sample video and stores them in the storage unit 110 as motion feature data 122. At the same time, the feature calculation unit 112 stores the data of the sample video as sample video data 121 in the storage unit 110. When a detection target video is input to the video input unit 111, the feature calculation unit 112 calculates motion features from the detection target video and stores them in the storage unit 110 as motion feature data 132. At the same time, the feature calculation unit 112 stores the data of the detection target video as detection target video data 131 in the storage unit 110.

時空間一括入力部１１３は、映像入力部１１１に入力された見本動画像を表示装置に再生表示する。時空間一括入力部１１３は、見本動画像の再生表示中に、オペレータからの指示入力を受け付ける。この指示入力は、見本動画像に写っている人物が検出対象の動作を行っているときに、その動作において特徴的な動きをする人物の部位の位置を指し示すように行われる。その動作が行われている間、指示入力が継続され、動作が終了すると指示入力が終了される。これにより、見本動画像において特定の動作が行われている区間の範囲（すなわち、時間的な情報）と、特徴的な動きをする人物の部位の位置（すなわち、空間的な情報）とが一括して指定される。なお、指示入力は、例えばマウスドラッグ操作により行われる。 The time-space batch input unit 113 plays and displays the sample video input to the video input unit 111 on the display device. The time-space batch input unit 113 accepts instruction input from the operator while the sample video is being played and displayed. This instruction input is made to indicate the position of the body part of the person making the characteristic movement when the person in the sample video is performing the detection target movement. The instruction input continues while the movement is being performed, and ends when the movement ends. This allows the range of the section in the sample video where the specific movement is being performed (i.e., temporal information) and the position of the body part of the person making the characteristic movement (i.e., spatial information) to be specified together. Note that the instruction input is made, for example, by dragging the mouse.

以下、時空間一括入力部１１３が受け付けるオペレータの指示入力を「時空間指示入力」と記載し、その指示入力によって指定される情報を「時空間指示情報」と記載する。時空間指示情報は、見本動画像における時空間指示入力の開始時刻および終了時刻と、開始時刻から終了時刻までの各フレームにおける指示位置とを含む。 Hereinafter, the instruction input from the operator received by the time-space batch input unit 113 will be referred to as the "time-space instruction input", and the information specified by the instruction input will be referred to as the "time-space instruction information". The time-space instruction information includes the start time and end time of the time-space instruction input in the sample video, and the instruction position in each frame from the start time to the end time.

注目度算出部１１４は、見本動画像のフレームのうち、時空間指示入力が行われている区間の各フレームについて、指示された位置に基づいて特徴点ごとの注目度を算出する。注目度は、類似度算出部１１６によって見本動画像と検出対象動画像との間の類似度が計算される際に、動作特徴量に対して重み付けするために利用される。注目度算出部１１４は、算出された注目度を注目度データ１２３として記憶部１１０に格納する。 The attention calculation unit 114 calculates the attention level for each feature point based on the designated position for each frame of the sample video in the section where the time-space instruction input is being made. The attention level is used to weight the motion feature amount when the similarity calculation unit 116 calculates the similarity between the sample video and the detection target video. The attention calculation unit 114 stores the calculated attention level in the memory unit 110 as attention level data 123.

映像検索部１１５は、検出対象動画像データ１３１と見本動画像データ１２１とを比較して、検出対象動画像の中から、見本動画像のうち時空間指示入力が行われている区間の動画像と類似する動画像を検索する類似検索処理を制御する。この類似検索処理により、検出対象動画像の中から人物が所定の動作を行っている動画像が検索される。 The video search unit 115 controls a similarity search process that compares the detection target video data 131 with the sample video data 121 and searches the detection target video data for video similar to the video of the sample video data in the section in which the time-space instruction input is being performed. This similarity search process searches the detection target video data for video in which a person is performing a specified action.

類似度算出部１１６は、映像検索部１１５の制御の下で、検出対象動画像と見本動画像との間における類似度を動作特徴量に基づいて算出する。この算出において、注目度データ１２３が示す注目度が動作特徴量に対する重みとして利用される。これにより、時空間指示入力の位置に近い特徴点の動作特徴量ほど大きな重みが付与されて、類似度が計算される。類似度の算出結果は映像検索部１１５に通知され、映像検索部１１５は、通知された算出結果に基づいて、検出対象動画像から特定の動作が行われている動画像を抽出する。 Under the control of the video search unit 115, the similarity calculation unit 116 calculates the similarity between the detection target video and the sample video based on the motion feature amount. In this calculation, the attention level indicated by the attention level data 123 is used as a weight for the motion feature amount. As a result, the motion feature amount of a feature point closer to the position of the time-space instruction input is weighted more heavily to calculate the similarity. The similarity calculation result is notified to the video search unit 115, which extracts a video in which a specific motion is being performed from the detection target video based on the notified calculation result.

分析部１１７は、映像検索部１１５の検索結果に基づいて様々な分析を行う。例えば、分析部１１７は、商品を手に取る動作や、商品の価格を確認する動作の検出回数に基づいて、その商品の魅力や割高感を評価する。 The analysis unit 117 performs various analyses based on the search results of the video search unit 115. For example, the analysis unit 117 evaluates the attractiveness and perceived price of a product based on the number of times the action of picking up the product or checking the price of the product is detected.

図５は、情報処理装置による動作検出・分析処理全体の処理手順を示すフローチャートの例である。
まず、ステップＳ１１，Ｓ１２において、動作検出に対する前処理が実行される。 FIG. 5 is an example of a flowchart showing the overall processing procedure of the motion detection and analysis process performed by the information processing device.
First, in steps S11 and S12, pre-processing for motion detection is performed.

［ステップＳ１１］カメラ２００により、検出対象とする特定の動作を行う人物が写った見本動画像が撮影され、撮影された見本動画像のデータが映像入力部１１１に入力される。見本動画像は、検出対象とする人物の動作を示す画像の見本を与えるものであり、後述する類似検索における検索クエリとして利用される。 [Step S11] A sample video of a person performing a specific action to be detected is captured by the camera 200, and data of the captured sample video is input to the video input unit 111. The sample video provides a sample image showing the action of the person to be detected, and is used as a search query in a similarity search, which will be described later.

特徴量算出部１１２は、見本動画像の各フレームから特徴点を抽出し、特徴点ごとに動作特徴量を算出する。見本動画像のデータは、見本動画像データ１２１として記憶部１１０に格納される。これとともに、算出された動作特徴量は、動作特徴量データ１２２として見本動画像データ１２１に対応付けて記憶部１１０に格納される。 The feature amount calculation unit 112 extracts feature points from each frame of the sample video and calculates a motion feature amount for each feature point. The data of the sample video is stored in the storage unit 110 as sample video data 121. At the same time, the calculated motion feature amount is stored in the storage unit 110 in association with the sample video data 121 as motion feature amount data 122.

［ステップＳ１２］時空間一括入力部１１３は、見本動画像データ１２１に基づき、見本動画像を再生表示する。時空間一括入力部１１３は、見本動画像の再生表示中に、オペレータからの時空間指示入力を受け付ける。これにより、見本動画像において特定の動作が行われている時間的な範囲と、特徴的な動きをする人物の部位の位置とが一括して指定される。 [Step S12] The time-space batch input unit 113 plays and displays the sample video based on the sample video data 121. The time-space batch input unit 113 accepts time-space instruction input from the operator while the sample video is being played and displayed. This allows the time range in which a specific action is performed in the sample video and the positions of the body parts of the person making the characteristic movement to be specified all at once.

注目度算出部１１４は、見本動画像のフレームのうち、時空間指示入力が行われている区間の各フレームについて、時空間指示入力における指示位置に基づいて特徴点ごとの注目度を算出する。算出された注目度は、注目度データ１２３として記憶部１１０に格納される。 The attention calculation unit 114 calculates the attention level for each feature point for each frame of the sample video image in a section where the time-space instruction input is being performed, based on the position of the instruction in the time-space instruction input. The calculated attention level is stored in the storage unit 110 as attention level data 123.

続いて、ステップＳ１３，１４において、検出対象の動画像からの動作検出が実行される。
［ステップＳ１３］カメラ２００により、特定の動作の検出対象となる検出対象動画像が撮影され、撮影された検出対象動画像のデータが映像入力部１１１に入力される。検出対象動画像としては、例えば、店舗内を１日撮影して得られた動画像が用いられる。 Subsequently, in steps S13 and S14, motion detection is performed from the moving image of the detection target.
[Step S13] A detection target moving image that is to be a detection target for a specific action is captured by the camera 200, and data of the captured detection target moving image is input to the video input unit 111. As the detection target moving image, for example, a moving image obtained by capturing images of the inside of a store for one day is used.

特徴量算出部１１２は、検出対象動画像の各フレームから特徴点を抽出し、特徴点ごとに動作特徴量を算出する。検出対象動画像のデータは、検出対象動画像データ１３１として記憶部１１０に格納される。これとともに、算出された動作特徴量は、動作特徴量データ１３２として検出対象動画像データ１３１に対応付けて記憶部１１０に格納される。 The feature calculation unit 112 extracts feature points from each frame of the detection target video and calculates a motion feature for each feature point. Data of the detection target video is stored in the storage unit 110 as detection target video data 131. At the same time, the calculated motion feature is stored in the storage unit 110 in association with the detection target video data 131 as motion feature data 132.

［ステップＳ１４］類似度算出部１１６は、検出対象動画像と、見本動画像のうち時空間指示入力が行われている区間の動画像との間で、動作特徴量に基づく類似度を算出する。この算出において、注目度データ１２３が示す注目度が動作特徴量に対する重みとして利用される。映像検索部１１５は、類似度の算出結果を用いた類似検索により、検出対象動画像から特定の動作が行われている動画像を検出する。 [Step S14] The similarity calculation unit 116 calculates a similarity between the detection target video and the video of the sample video in a section where a time-space instruction input is being performed, based on the motion feature amount. In this calculation, the attention level indicated by the attention level data 123 is used as a weight for the motion feature amount. The video search unit 115 performs a similarity search using the similarity calculation result to detect a video in which a specific motion is being performed from the detection target video.

［ステップＳ１５］分析部１１７は、映像検索部１１５による特定の動作の検出結果を分析する。例えば、分析部１１７は、商品を手に取る動作や、商品の価格を確認する動作の検出回数に基づいて、その商品の魅力や割高感を評価する。 [Step S15] The analysis unit 117 analyzes the results of the detection of specific actions by the video search unit 115. For example, the analysis unit 117 evaluates the attractiveness or perceived price of the product based on the number of times the action of picking up the product or the action of checking the price of the product is detected.

以下、図５に示した各処理の詳細について説明する。
＜見本動画像からの動作特徴量の算出＞
図６は、動作特徴量について説明するための図である。動作特徴量は、人物の動作を表す特徴量であり、特徴点ごとに算出される。近年の機械学習、特に深層学習（ディープラーニング）の進展により、画像からそこに写っている人物の存在領域（外接長方形）や、関節などの身体部位の位置の推定が実用的なレベルで可能となった。 Each process shown in FIG. 5 will be described in detail below.
<Calculation of motion features from sample video>
6 is a diagram for explaining the motion feature amount. The motion feature amount is a feature amount that represents the motion of a person, and is calculated for each feature point. With the recent progress of machine learning, particularly deep learning, it has become possible to practically estimate the presence area (circumscribed rectangle) of a person appearing in an image and the positions of body parts such as joints from the image.

そこで、本実施の形態では、特徴点として人物の身体部位が用いられる。例えば、目の両端、両肩、両手首、１０指の指先、両足先など、あらかじめ決められた複数の身体部位が、特徴点として利用される。また、特徴点には身体部位ごとにあらかじめ決められた特徴点番号が付与される。なお、人物の身体部位は、既知の骨格検出技術を用いて検出される。 Therefore, in this embodiment, body parts of a person are used as feature points. For example, a number of predetermined body parts, such as both ends of the eyes, both shoulders, both wrists, the tips of all ten fingers, and the tips of both feet, are used as feature points. Furthermore, a predetermined feature point number is assigned to each feature point. The body parts of a person are detected using known skeletal detection technology.

特徴点番号をｉとし、検出可能な特徴点数（身体部位数）をｎとすると、一連の動画像についての動作特徴量Ｐは、次の式（１）に示すような時刻ｔに関する関数として定義される。
Ｐ＝｛ｐ_i（ｔ）｝（ただし、ｉ＝１～ｎ）・・・（１）
すなわち、動作特徴量Ｐは、ｎ個の関数ｐ_i（ｔ）の列として定義される。図６では例として、ｎ＝２の場合（２か所の特徴点が検出される場合）に見本動画像から算出される動作特徴量を示している。図６では、見本動画像の先頭からのフレームの撮影時刻をｔ₁，ｔ₂，ｔ₃，・・・としている。また、動作特徴量データ１２２は、例えば、図６のように特徴点番号と時刻（フレーム識別情報）とを対応付けたテーブルとして作成される。 If the feature point number is i and the number of detectable feature points (number of body parts) is n, then the motion feature amount P for a series of moving images is defined as a function of time t as shown in the following equation (1).
P = {p _i (t)} (where i = 1 to n) ... (1)
That is, the motion feature amount P is defined as a sequence of n functions p _i (t). Fig. 6 shows, as an example, motion feature amounts calculated from a sample video when n=2 (when two feature points are detected). In Fig. 6, the shooting times of the frames from the beginning of the sample video are t ₁ , t ₂ , t ₃ , .... Furthermore, the motion feature amount data 122 is created, for example, as a table in which feature point numbers correspond to times (frame identification information) as shown in Fig. 6.

また、個々の動作特徴量は、空間属性と値属性を含むものとする。すなわち、上記の関数ｐ_i（ｔ）は、空間属性Ｘ（ｐ_i（ｔ））と値属性Ａ（ｐ_i（ｔ））を含む。これらのうち、空間属性は、フレームにおける特徴点（身体部位）の２次元座標を示す。また、カメラ２００から人物までの距離ｚも検出可能な場合には、空間属性は３次元座標を示してもよい。値属性については後述する。 Each motion feature includes a spatial attribute and a value attribute. That is, the above function p _i (t) includes a spatial attribute X(p _i (t)) and a value attribute A(p _i (t)). Of these, the spatial attribute indicates the two-dimensional coordinates of a feature point (body part) in a frame. Furthermore, if the distance z from the camera 200 to a person can also be detected, the spatial attribute may indicate three-dimensional coordinates. The value attribute will be described later.

なお、画像内に複数の人物が写っている場合、動作特徴量は人物ごとに算出される。例えば、動画像から検出された人物については出現順にラベル（番号など）が付与され、ラベルに対応付けて上記の動作特徴量Ｐが算出される。 When multiple people appear in an image, the motion feature is calculated for each person. For example, people detected from a video are given labels (e.g., numbers) in the order of appearance, and the motion feature P is calculated in association with the labels.

図７は、見本動画像からの動作特徴量の算出処理を示すフローチャートの例である。この図７は、図５のステップＳ１１の処理に対応する。
［ステップＳ２１］検出対象とする特定の動作を行う人物が写った見本動画像が、カメラ２００から映像入力部１１１に入力され、特徴量算出部１１２に受け渡される。特徴量算出部１１２は、見本動画像のデータを見本動画像データ１２１として記憶部１１０に格納する。 7 is an example of a flowchart showing a process for calculating motion feature amounts from a sample video image, which corresponds to the process in step S11 in FIG.
[Step S21] A sample video showing a person performing a specific action to be detected is input from the camera 200 to the video input unit 111 and passed to the feature calculation unit 112. The feature calculation unit 112 stores the data of the sample video in the storage unit 110 as sample video data 121.

［ステップＳ２２］特徴量算出部１１２は、見本動画像の先頭側からフレームを１つ選択する。
［ステップＳ２３］特徴量算出部１１２は、選択されたフレームから特徴点を抽出する。前述のように、特徴点としては人物上の身体部位の位置が抽出される。抽出された特徴点のそれぞれには、身体部位ごとにあらかじめ統一的に決められた特徴点番号が付与される。 [Step S22] The feature amount calculation unit 112 selects one frame from the beginning of the sample video.
[Step S23] The feature amount calculation unit 112 extracts feature points from the selected frames. As described above, the positions of body parts on the person are extracted as feature points. Each of the extracted feature points is assigned a feature point number that is uniformly determined in advance for each body part.

［ステップＳ２４］特徴量算出部１１２は、抽出された各特徴点について動作特徴量を算出する。動作特徴量としては、空間属性と値属性とが算出される。特徴量算出部１１２は、算出された動作特徴量を、記憶部１１０内の動作特徴量データ１２２に登録する。 [Step S24] The feature calculation unit 112 calculates a motion feature for each extracted feature point. The motion feature includes a spatial attribute and a value attribute. The feature calculation unit 112 registers the calculated motion feature in the motion feature data 122 in the storage unit 110.

［ステップＳ２５］特徴量算出部１１２は、見本動画像上の全フレームについて処理が完了したかを判定する。未処理のフレームが存在する場合、処理がステップＳ２２に進められ、先頭側から未処理のフレームが１つ選択される。一方、全フレームについて処理済みの場合、処理が終了する。 [Step S25] The feature calculation unit 112 determines whether processing has been completed for all frames in the sample video. If there are unprocessed frames, processing proceeds to step S22, where one unprocessed frame is selected from the top. On the other hand, if all frames have been processed, processing ends.

［動作特徴量の第１の例］
図８は、動作特徴量の第１の例を示す図である。前述のように、動作特徴量に含まれる空間属性は、特徴点の座標を示す。また、動作特徴量に含まれる値属性も、空間属性と同様に特徴点の座標を示してもよい。この場合、時刻ｔのフレームにおけるｉ番目の特徴点の座標を（ｕ_i（ｔ），ｖ_i（ｔ））とすると、特徴点ごとの動作特徴量の関数ｐ_i（ｔ）は下記の式（２－１）で表され、空間属性、値属性はそれぞれ式（２－２），（２－３）で表される。なお、図８は、ｎ＝２の場合（２か所の特徴点が検出される場合）について例示している。
ｐ_i（ｔ）＝（ｕ_i（ｔ），ｖ_i（ｔ））（２－１）
Ｘ（ｐ_i（ｔ））＝（ｕ_i（ｔ），ｖ_i（ｔ））・・・（２－２）
Ａ（ｐ_i（ｔ））＝（ｕ_i（ｔ），ｖ_i（ｔ））・・・（２－３）
［動作特徴量の第２の例］
動作特徴量は、値属性として、特徴点の時間的な変化を示す情報を含むことができる。例えば、特徴量算出部１１２は、特徴量算出対象のフレームを中心とした前後の所定数のフレームを参照して、特徴点の速度ベクトル（ｕ_i’（ｔ），ｖ_i’（ｔ））を算出する。この場合、特徴点ごとの動作特徴量の関数ｐ_i（ｔ）は下記の式（３－１）で表され、空間属性、値属性はそれぞれ式（３－２），（３－３）で表される。
ｐ_i（ｔ）＝（ｕ_i（ｔ），ｖ_i（ｔ），ｕ_i’（ｔ），ｖ_i’（ｔ））（３－１）
Ｘ（ｐ_i（ｔ））＝（ｕ_i（ｔ），ｖ_i（ｔ））・・・（３－２）
Ａ（ｐ_i（ｔ））＝（ｕ_i（ｔ），ｖ_i（ｔ），ｕ_i’（ｔ），ｖ_i’（ｔ））・・・（３－３）
なお、値属性Ａ（ｐ_i（ｔ））は、ｕ_i’（ｔ），ｖ_i’（ｔ）のみ含み、ｕ_i（ｔ），ｖ_i（ｔ）を含まないようにしてもよい。 [First Example of Motion Feature Amount]
FIG. 8 is a diagram showing a first example of a motion feature amount. As described above, the spatial attribute included in the motion feature amount indicates the coordinates of a feature point. Similarly to the spatial attribute, the value attribute included in the motion feature amount may also indicate the coordinates of a feature point. In this case, if the coordinates of the i-th feature point in the frame at time t are (u _i (t), v _i (t)), the function p _i (t) of the motion feature amount for each feature point is expressed by the following formula (2-1), and the spatial attribute and value attribute are expressed by formulas (2-2) and (2-3), respectively. Note that FIG. 8 illustrates an example in which n=2 (two feature points are detected).
p _i (t)=(u _i (t), v _i (t)) (2-1)
X(p _i (t)) = (u _i (t), v _i (t)) ... (2-2)
A(p _i (t)) = (u _i (t), v _i (t)) ... (2-3)
[Second Example of Motion Feature Amount]
The motion feature amount may include, as a value attribute, information indicating a temporal change of the feature point. For example, the feature amount calculation unit 112 calculates the velocity vector (u _i '(t), v _i '(t)) of the feature point by referring to a predetermined number of frames before and after the frame for which the feature amount is to be calculated. In this case, the function p _i (t) of the motion feature amount for each feature point is expressed by the following formula (3-1), and the spatial attribute and value attribute are expressed by formulas (3-2) and (3-3), respectively.
p _i (t)=(u _i (t), v _i (t), u _i '(t), v _i '(t)) (3-1)
X(p _i (t)) = (u _i (t), v _i (t)) ... (3-2)
A(p _i (t)) = (u _i (t), v _i (t), u _i '(t), v _i '(t)) ... (3-3)
Note that the value attribute A(p _i (t)) may include only u _i '(t) and v _i '(t), and may not include u _i (t) and v _i (t).

［動作特徴量の第３の例］
図９は、動作特徴量の第３の例を示す図である。動作特徴量の値属性として、深層学習（ディープラーニング）技術を用いた情報が算出されてもよい。例えば、所定数の種類（例えば１００種類）の動作を個別に含む画像フレームが訓練データとして用意される。各訓練データには、写っている動作を識別する動作ラベルが付与されている。 [Third Example of Motion Feature Amount]
9 is a diagram showing a third example of the motion feature. Information using deep learning technology may be calculated as a value attribute of the motion feature. For example, image frames each including a predetermined number of types (e.g., 100 types) of motions are prepared as training data. Each training data is assigned a motion label that identifies the motion depicted.

また、図９の上側に示すように、学習のために、特徴点ごとに個別に畳み込みを行う数段の畳み込み層の配列１４１－１，・・・，１４１－ｎと、全特徴点で共通の全結合層１４２とを含むニューラルネットワークが用いられる。このニューラルネットワークでは、特徴点別の最終段の畳み込み層１４３－１，・・・，１４３－ｎの後段に、動作ラベルを出力する全結合層１４２が配置される。そして、このニューラルネットワークと上記の訓練データとを用い、動画像における注目フレームを中心とした前後の所定数のフレーム（例えば５フレーム）を入力としたときに注目フレームの動作ラベルが出力されるようにあらかじめ学習が行われる。 As shown in the upper part of Figure 9, a neural network is used for learning, which includes an arrangement of several convolutional layers 141-1, ..., 141-n that perform convolution for each feature point individually, and a fully connected layer 142 that is common to all feature points. In this neural network, a fully connected layer 142 that outputs action labels is placed after the final convolutional layers 143-1, ..., 143-n for each feature point. Then, using this neural network and the above training data, learning is performed in advance so that the action label of the target frame is output when a predetermined number of frames (e.g., five frames) before and after the target frame in the video are input.

このような学習が完了した後、図９の下側に示すように全結合層１４２が除去される。この状態では、最終段の畳み込み層１４３－１，・・・，１４３－ｎの出力は、対応する特徴点ごとに人物の動作を適切に表現することが期待される。換言すると、最終段の畳み込み層１４３－１，・・・，１４３－ｎからは、対応する特徴点について、人物の動作を判別するための適切な特徴量が出力される。そこで、特徴量算出部１１２は、時刻ｔのフレームを中心とした前後の所定数のフレームを学習後の配列１４１－１，・・・，１４１－ｎに入力する。その結果、配列１４１－１，・・・，１４１－ｎからそれぞれ特徴点ごとの値属性Ａ（ｐ₁（ｔ）），・・・，Ａ（ｐ_n（ｔ））が算出される。 After such learning is completed, the fully connected layer 142 is removed as shown in the lower part of FIG. 9. In this state, the output of the final convolutional layers 143-1, ..., 143-n is expected to appropriately express the motion of a person for each corresponding feature point. In other words, the final convolutional layers 143-1, ..., 143-n output appropriate features for discriminating the motion of a person for the corresponding feature points. Therefore, the feature calculation unit 112 inputs a predetermined number of frames around the frame at time t to the post-learning arrays 141-1, ..., 141-n. As a result, value attributes A(p ₁ (t)), ..., A(p _n (t)) for each feature point are calculated from the arrays 141-1, ..., 141-n.

［動作特徴量の第４の例］
複数台のカメラで人物を異なる方法から同時に撮影することで、ステレオ視の原理により特徴点（身体部位）の３次元座標を求めることができる。この場合、特徴点ごとの動作特徴量の関数ｐ_i（ｔ）は下記の式（４－１）のように３次元の情報として表される。また、前述の「動作特徴量の第１の例」の方法を用いた場合、空間属性、値属性はそれぞれ式（４－２），（４－３）で表される。
ｐ_i（ｔ）＝（ｕ_i（ｔ），ｖ_i（ｔ），ｗ_i（ｔ））・・・（４－１）
Ｘ（ｐ_i（ｔ））＝（ｕ_i（ｔ），ｖ_i（ｔ），ｗ_i（ｔ））・・・（４－２）
Ａ（ｐ_i（ｔ））＝（ｕ_i（ｔ），ｖ_i（ｔ），ｗ_i（ｔ））・・・（４－３）
なお、値属性は、前述の「動作特徴量の第２の例」または「動作特徴量の第３の例」の方法を用いて算出されてもよい。 [Fourth Example of Motion Feature Amount]
By simultaneously photographing a person from different angles using multiple cameras, the three-dimensional coordinates of feature points (body parts) can be obtained using the principle of stereoscopic vision. In this case, the function p _i (t) of the motion feature amount for each feature point is expressed as three-dimensional information as shown in the following formula (4-1). Furthermore, when the method described in the above "First Example of Motion Feature Amount" is used, the spatial attribute and value attribute are expressed by formulas (4-2) and (4-3), respectively.
p _i (t) = (u _i (t), v _i (t), w _i (t)) ... (4-1)
X(p _i (t)) = (u _i (t), v _i (t), w _i (t)) ... (4-2)
A(p _i (t)) = (u _i (t), v _i (t), w _i (t)) ... (4-3)
The value attribute may be calculated using the method of the "Second Example of Motion Feature Amount" or the "Third Example of Motion Feature Amount" described above.

＜時空間指示入力および注目度の算出＞
時空間一括入力部１１３は、見本動画像を再生表示して、オペレータからの時空間指示入力を受け付ける。この時空間指示入力の操作は、見本動画像から特定の動作を検出するための入力操作である。時空間指示入力は、見本動画像に写っている人物が検出対象の動作を行っているときに、その動作において特徴的な動きをする人物の部位の位置を指し示すように行われる。その動作が行われている間、指示入力が継続され、動作が終了すると指示入力が終了される。これにより、見本動画像において特定の動作が行われている時間的な範囲（開始時刻と終了時刻）と、特徴的な動きをする人物の部位の位置とが一括して指定される。 <Time-space instruction input and attention level calculation>
The time-space collective input unit 113 plays and displays the sample video and accepts time-space instruction input from the operator. This time-space instruction input operation is an input operation for detecting a specific movement from the sample video. The time-space instruction input is performed when a person in the sample video is performing a detection target movement, by pointing out the position of the body part of the person who makes a characteristic movement in the movement. The instruction input continues while the movement is being performed, and ends when the movement ends. This allows the time range (start time and end time) in which the specific movement is performed in the sample video and the position of the body part of the person who makes the characteristic movement to be specified all at once.

時空間指示入力は、マウスを用いた簡単な操作によって行うことが可能である。例えば、次の図１０に示すように、時空間指示入力はマウスドラッグ操作によって行われる。
図１０は、時空間指示入力の例を示す図である。図１０では、商品を手に取る動作を検出する場合について例示する。このような動作では、手首や手のひらの動きが特徴的であると推定される。そこで、オペレータは、再生画像上の人物が商品を手に取る動作を行っているときに、その人物の手首や手のひらにできるだけ近い位置をマウスカーソル（ポインタ）１５１が指し示すように時空間指示入力を行う。 The time-space instruction can be input by a simple operation using a mouse. For example, as shown in FIG. 10, the time-space instruction can be input by dragging the mouse.
Fig. 10 is a diagram showing an example of a time-space instruction input. Fig. 10 illustrates an example of detecting an action of picking up a product. In such an action, it is estimated that the movement of the wrist and palm is characteristic. Therefore, when a person on the reproduced image is picking up a product, the operator performs a time-space instruction input so that the mouse cursor (pointer) 151 points to a position as close as possible to the wrist or palm of the person.

図１０の例では、見本動画像がフレームＦ₁～Ｆ₂₀₀の２００フレームを含むものとする。そして、その中のフレームＦ₃₁～Ｆ₁₈₀の１５０フレームにおいて商品を手に取る動作が行われているとする。この場合、オペレータは見本動画像を見ながら、フレームＦ₃₁において人物の手首や手のひらの近傍をマウスカーソル１５１で指し示しながらクリックボタンを押下して、ドラッグ操作を開始する。その後、オペレータは、上記動作が終了するまで、人物の手首や手のひらの近傍をマウスカーソル１５１で指し示しながらクリックボタンを押下し続けて、ドラッグ操作を継続する。そして、オペレータは、フレームＦ₁₈₀の表示時点でクリックボタンの押下を中止して、ドラッグ操作を終了する。 In the example of Fig. 10, the sample video includes 200 frames from frames _F1 to _F200 . In addition, the action of picking up a product is assumed to be performed in 150 frames from frames _F31 to _F180 . In this case, while viewing the sample video, the operator presses the click button while pointing the mouse cursor 151 to the vicinity of the person's wrist or palm in frame _F31 to start a drag operation. Thereafter, the operator continues to press the click button while pointing the mouse cursor 151 to the vicinity of the person's wrist or palm until the above action is completed, thereby continuing the drag operation. Then, the operator stops pressing the click button when frame _F180 is displayed, and ends the drag operation.

このような操作により、時空間一括入力部１１３は、マウスドラッグ操作の開始時刻ｔ_startおよび終了時刻ｔ_endを取得する。開始時刻ｔ_startは、特定の動作が開始されたフレームＦ₃₁を示す情報となり、終了時刻ｔ_endは、この動作が行われている最終のフレームＦ₁₈₀を示す情報となる。なお、図１０では例として、開始時刻ｔ_startとして時刻ｔ₁が取得され、終了時刻ｔ_endとして時刻ｔ₁₅₀が取得されている。さらに、時空間一括入力部１１３は、開始時刻ｔ_startから終了時刻ｔ_endまでの間、各フレームにおいてマウスカーソル１５１が指示する位置を取得する。 By such an operation, the time-space collective input unit 113 acquires the start time t _start and the end time t _end of the mouse drag operation. The start time t _start is information indicating the frame F ₃₁ where a specific operation is started, and the end time t _end is information indicating the final frame F ₁₈₀ where this operation is performed. Note that, as an example in FIG. 10, time t ₁ is acquired as the start time t _start , and time t ₁₅₀ is acquired as the end time t _end . Furthermore, the time-space collective input unit 113 acquires the position indicated by the mouse cursor 151 in each frame from the start time t _start to the end time t _end .

このようにして時空間指示操作によって得られる情報をＲとすると、Ｒは時間区間［ｔ_start，ｔ_end］から画像フレーム上に定義された非負値関数への写像として表される。この情報をＲ（ｘ，ｙ；ｔ）と表すと、後で詳述するようにＲ（ｘ，ｙ；ｔ）は、時刻ｔの画像フレーム上の位置（ｘ，ｙ）についての注目度（動作への寄与度）を表す。以下、Ｒ（ｘ，ｙ；ｔ）を「注目度」とする。 If the information obtained by the spatiotemporal pointing operation in this way is R, then R is expressed as a mapping from the time interval [t _start , t _end ] to a non-negative function defined on the image frame. If this information is expressed as R(x,y;t), then R(x,y;t) represents the attention level (degree of contribution to the action) for the position (x,y) on the image frame at time t, as will be described in detail later. Hereinafter, R(x,y;t) will be referred to as the "attention level."

なお、時空間指示入力の方法としては、上記のようなマウスドラッグ操作を用いた方法に限らない。例えば、開始時刻ｔ_startと終了時刻ｔ_endにおいてそれぞれ１回ずつマウスクリック操作が行われる方法を用いてもよい。例えば図１０のケースでは、フレームＦ₃₁で人物の手首や手のひらの近傍をマウスカーソル１５１で指し示しながらクリックボタンが１回押下される。その時刻から、クリックボタンを押下しない状態のまま人物の手首や手のひらの近傍がマウスカーソル１５１で指し示される。そして、動作の終了時点においてクリックボタンが再度押下される。これにより、最初のボタン押下から最後のボタン押下までの間、各フレームにおいてマウスカーソル１５１が指し示す位置が記録される。 The method of inputting a time-space instruction is not limited to the method using the mouse drag operation as described above. For example, a method in which a mouse click operation is performed once each at the start time t _start and the end time t _end may be used. For example, in the case of FIG. 10, the click button is pressed once in frame F ₃₁ while pointing the vicinity of the wrist or palm of the person with the mouse cursor 151. From that time, the mouse cursor 151 points to the vicinity of the wrist or palm of the person without pressing the click button. Then, at the end of the movement, the click button is pressed again. As a result, the position pointed to by the mouse cursor 151 in each frame from the first button press to the last button press is recorded.

このようにボタン押下が２回行われる場合でも、マウスドラッグ操作を用いる場合と同様に、画面上の位置をマウスカーソル１５１で指示し続けるという簡単な操作で、検出対象の動作に対して特徴的な身体部位の近傍位置を記録することができる。そして、このような空間的な情報とともに、動作の開始時刻と終了時刻という時間的な情報を簡単な操作で一括して記録することができる。 Even when the button is pressed twice like this, the vicinity of a body part characteristic of the movement to be detected can be recorded by the simple operation of continuing to point to a position on the screen with the mouse cursor 151, just as in the case of using a mouse drag operation. Furthermore, together with this spatial information, the time information, i.e., the start and end times of the movement, can be recorded all at once with a simple operation.

図１１は、注目度の算出処理について説明するための図である。図１１では、時刻ｔにおけるマウスカーソル１５１の指示位置（２次元座標）を（ｘ₀（ｔ），ｙ₀（ｔ））とし、見本動画像のうち時刻ｔ₁から時刻ｔ₁₅₀までの各フレームの表示時に時空間指示入力が行われたとする。この場合、時刻ｔ₁，時刻ｔ₂，．．．，時刻ｔ₁₅₀の各フレームについて注目度Ｒ（ｘ，ｙ；ｔ₁），Ｒ（ｘ，ｙ；ｔ₂），．．．，Ｒ（ｘ，ｙ；ｔ₁₅₀）が算出されて、注目度データ１２３に登録される。 Fig. 11 is a diagram for explaining the attention degree calculation process. In Fig. 11, the designated position (two-dimensional coordinates) of the mouse cursor 151 at time t is ( _x0 (t), _y0 (t)), and a time-space instruction input is made when each frame from time _t1 to time _t150 of the sample video is displayed. In this case, attention degrees R( _x , _y ;t1), R(x, _y ; _t2 ),...,R(x,y; _t150 ) are calculated for each frame at time t1, time t2,...,time _t150 , and are registered in the attention degree data 123.

［注目度の第１の例］
第１の例として、注目度Ｒ（ｘ，ｙ；ｔ）を次の式（５）によって求めることができる。
Ｒ（ｘ，ｙ；ｔ）＝ｅｘｐ（－（（ｘ－ｘ₀（ｔ））²＋（ｙ－ｙ₀（ｔ））²）／σ²）・・・（５）
この式（５）によれば、注目度の値はマウスカーソル１５１の指示位置に近いほど大きな値となる。詳しくは後述するが、注目度は、類似検索における特徴点間の類似度の計算時に重みとして使用される。注目度が大きいほど、すなわち、時空間指示入力による指示位置に近い特徴点ほど、大きな重みが付与され、その特徴点についての類似度が特定動作の有無の判別に大きく寄与する。 [First Example of Attention Level]
As a first example, the attention level R(x, y; t) can be calculated by the following formula (5).
R (x, y; t) = exp (-((x-x ₀ (t)) ² + (y-y ₀ (t)) ² )/σ ² ) ... (5)
According to formula (5), the closer a feature point is to the position pointed to by the mouse cursor 151, the larger the attention value becomes. As will be described in detail later, the attention value is used as a weight when calculating the similarity between feature points in similarity search. The greater the attention value, that is, the closer a feature point is to the position pointed to by the time-space instruction input, the greater the weight is assigned to that feature point, and the similarity for that feature point greatly contributes to determining the presence or absence of a specific action.

また、式（５）において、σは注目度の空間的な広さを調整するための正の定数である。σの値により、検出対象の動作が、身体の広い範囲（例えば全身）が関与する（特徴的である）動作であるか、狭い範囲（例えば手指の先）が関与する動作であるかを区別する指示を、情報処理装置１００に与えることができる。例えば、検出対象の動作が、身体の狭い範囲が関与する動作である場合、σとして小さな値が設定されればよい。これにより、指示位置を中心として特徴点の動作特徴量が類似度計算に大きく寄与する範囲を限定することができる。 In addition, in equation (5), σ is a positive constant for adjusting the spatial extent of the attention level. The value of σ can be used to provide an instruction to the information processing device 100 to distinguish whether the motion to be detected is a (characteristic) motion involving a wide range of the body (e.g., the whole body) or a motion involving a narrow range (e.g., the fingertips). For example, if the motion to be detected is a motion involving a narrow range of the body, a small value can be set for σ. This makes it possible to limit the range in which the motion feature value of the feature point contributes significantly to the similarity calculation, centered on the indicated position.

［注目度の第２の例］
時空間指示入力の際に、オペレータは、上記のようなマウスドラッグ操作と並行して、キーボードのテンキーなどを用いて注目度の空間的な広さをフレームの進行に伴って可変させてもよい。この場合、注目度Ｒ（ｘ，ｙ；ｔ）は次の式（６）によって求められる。
Ｒ（ｘ，ｙ；ｔ）＝ｅｘｐ（－（（ｘ－ｘ₀（ｔ））²＋（ｙ－ｙ₀（ｔ））²）／σ²（ｔ））・・・（６）
式（６）において、σ²（ｔ）は、時間経過に伴って設定される、注目度の空間的な広さを調整するための正の係数である。式（６）のようにσを時間経過に伴って可変にすることで、身体の広い範囲の動きが特徴的であるか、狭い範囲の動きが特徴的であるかを、身体の動きの流れの中で適応的に設定することができる。その結果、類似検索の精度を向上させることができる。 [Second Example of Attention Level]
When inputting a time-space instruction, the operator may use the numeric keypad of a keyboard or the like in parallel with the mouse dragging operation described above to vary the spatial extent of the attention level as the frames progress. In this case, the attention level R(x, y; t) is calculated by the following formula (6).
R (x, y; t) = exp (-((x-x ₀ (t)) ² + (y-y ₀ (t)) ² )/σ ² (t)) ... (6)
In formula (6), ^σ2 (t) is a positive coefficient for adjusting the spatial extent of the attention level, which is set over time. By making σ variable over time as in formula (6), it is possible to adaptively set whether a wide range of body movement or a narrow range of body movement is characteristic within the flow of body movement. As a result, it is possible to improve the accuracy of similarity search.

なお、注目度算出における係数σの他の利用方法としては、例えば、ｘ座標とｙ座標とでそれぞれ個別の係数σを用いる方法も適用可能である。
図１２は、時空間指示入力に応じた注目度の算出処理を示すフローチャートの例である。この図１２の処理は、図５のステップＳ１２の処理に対応する。 As another method of using the coefficient σ in calculating the attention level, for example, a method of using separate coefficients σ for the x coordinate and the y coordinate can also be applied.
12 is a flowchart showing an example of a calculation process of an attention level according to a time-space instruction input. The process in FIG. 12 corresponds to the process in step S12 in FIG.

［ステップＳ３１］時空間一括入力部１１３は、見本動画像データ１２１に基づき、見本動画像の再生表示を開始する。
［ステップＳ３２］時空間一括入力部１１３は、表示中のフレームに対して時空間指示入力が開始されたかを判定する。時空間指示入力が開始された場合、処理がステップＳ３３に進められる。一方、時空間指示入力が開始されていない場合、次に表示されるフレームについてステップＳ３２の処理が実行される。 [Step S31] The time-space batch input unit 113 starts playing and displaying the sample video based on the sample video data 121.
[Step S32] The time-space collective input unit 113 determines whether a time-space instruction input has been started for the currently displayed frame. If a time-space instruction input has been started, the process proceeds to step S33. On the other hand, if a time-space instruction input has not been started, the process of step S32 is executed for the next frame to be displayed.

［ステップＳ３３］時空間一括入力部１１３は、時空間指示入力の開始時刻ｔ_startを記憶部１１０に記録する。
［ステップＳ３４］時空間一括入力部１１３は、現フレームにおけるマウスカーソルの指示位置（座標）を記憶部１１０に記録する。 [Step S33] The time-space batch input unit 113 records in the storage unit 110 the start time t _start of the time-space instruction input.
[Step S34] The time-space batch input block 113 records in the storage block 110 the position (coordinates) designated by the mouse cursor in the current frame.

［ステップＳ３５］時空間一括入力部１１３は、次に表示されたフレームにおいて時空間指示入力が終了したかを判定する。時空間指示入力が終了した場合、処理がステップＳ３６に進められる。一方、時空間指示入力が終了していない場合、現フレーム（終了の判定対象のフレーム）についてステップＳ３４の処理が実行される。 [Step S35] The time-space batch input unit 113 determines whether the time-space instruction input has ended for the next displayed frame. If the time-space instruction input has ended, the process proceeds to step S36. On the other hand, if the time-space instruction input has not ended, the process of step S34 is executed for the current frame (the frame for which the end is to be determined).

［ステップＳ３６］時空間一括入力部１１３は、前のフレームについての時刻を、時空間指示入力の終了時刻ｔ_endとして記憶部１１０に記録する。
［ステップＳ３７］注目度算出部１１４は、見本動画像のうち、開始時刻ｔ_startから終了時刻ｔ_endまでの区間の動画像（以下、「入力区間動画像」と記載する）に含まれる未処理のフレームの中から、先頭フレームを選択する。 [Step S36] The time-space batch input unit 113 records the time of the previous frame in the storage unit 110 as the end time t _end of the time-space instruction input.
[Step S37] The attention level calculation unit 114 selects the leading frame from among unprocessed frames included in the sample video from the start time t _start to the end time t _end (hereinafter referred to as the "input section video").

［ステップＳ３８］注目度算出部１１４は、選択されたフレームについてステップＳ３４で記録されたマウスカーソルの指示位置に基づき、そのフレームの各画素についての注目度Ｒを算出する。算出された注目度Ｒは、注目度データ１２３に記録される。 [Step S38] The attention calculation unit 114 calculates the attention level R for each pixel of the selected frame based on the mouse cursor position recorded in step S34. The calculated attention level R is recorded in the attention level data 123.

［ステップＳ３９］注目度算出部１１４は、入力区間動画像内の全フレームについて処理が終了したかを判定する。未処理のフレームが存在する場合、処理がステップＳ３７に進められて、次の未処理のフレームが選択される。一方、全フレームについて処理済みの場合には、処理が終了する。 [Step S39] The attention calculation unit 114 determines whether processing has been completed for all frames in the input section video image. If there are unprocessed frames, processing proceeds to step S37, where the next unprocessed frame is selected. On the other hand, if all frames have been processed, processing ends.

なお、図５のステップＳ１１での動作特徴量の算出と、ステップＳ１２の処理とは並列に実行されてもよい。また、ステップＳ１１での動作特徴量の算出の前にステップＳ１２の処理が実行されてもよい。この場合、動作特徴量は、見本動画像のうち、ステップＳ１２の時空間指示入力によって決定された入力区間動画像に含まれる各フレームについてのみ算出されればよい。 The calculation of the motion feature amount in step S11 of FIG. 5 and the process of step S12 may be executed in parallel. Furthermore, the process of step S12 may be executed before the calculation of the motion feature amount in step S11. In this case, the motion feature amount needs to be calculated only for each frame of the sample video that is included in the input section video determined by the spatiotemporal instruction input in step S12.

また、注目度Ｒは、フレーム内の全画素でなく、フレームに含まれる特徴点ごとに算出されて記録されてもよい。また、注目度Ｒの計算は、後述する類似検索の際に行われてもよい。 The attention level R may be calculated and recorded for each feature point included in a frame, rather than for all pixels in the frame. The attention level R may also be calculated during a similarity search, which will be described later.

＜検出対象動画像からの動作特徴量の算出＞
図１３は、検出対象動画像からの動作特徴量の算出処理を示すフローチャートの例である。この図１３は、図５のステップＳ１３の処理に対応する。図１３に示すように、検出対象動画像からの動作特徴量の算出は、見本動画像からの算出の場合と同様の処理手順で行われる。 <Calculation of motion features from detection target video>
Fig. 13 is an example of a flowchart showing a calculation process of motion features from a detection target video image. This Fig. 13 corresponds to the process of step S13 in Fig. 5. As shown in Fig. 13, the calculation of motion features from a detection target video image is performed in the same process as the calculation from a sample video image.

［ステップＳ４１］検出対象動画像が、カメラ２００から映像入力部１１１に入力され、特徴量算出部１１２に受け渡される。検出対象動画像は、例えば、１日の間に撮影された動画像である。特徴量算出部１１２は、検出対象動画像のデータを検出対象動画像データ１３１として記憶部１１０に格納する。 [Step S41] A detection target video is input from the camera 200 to the video input unit 111 and passed to the feature calculation unit 112. The detection target video is, for example, a video shot over the course of a day. The feature calculation unit 112 stores the data of the detection target video in the storage unit 110 as detection target video data 131.

［ステップＳ４２］特徴量算出部１１２は、検出対象動画像の先頭側からフレームを１つ選択する。
［ステップＳ４３］特徴量算出部１１２は、選択されたフレームから特徴点（身体部位）を抽出する。抽出された特徴点のそれぞれには、身体部位ごとにあらかじめ統一的に決められた特徴点番号が付与される。また、見本動画像と検出対象動画像との間では、同じ身体部位には同じ特徴点番号が付与される。 [Step S42] The feature calculation unit 112 selects one frame from the beginning of the detection target moving image.
[Step S43] The feature amount calculation unit 112 extracts feature points (body parts) from the selected frames. Each extracted feature point is assigned a feature point number that is uniformly determined in advance for each body part. The same feature point number is assigned to the same body part between the sample video and the detection target video.

［ステップＳ４４］特徴量算出部１１２は、抽出された各特徴点について動作特徴量を算出する。動作特徴量としては、空間属性と値属性とが算出される。特徴量算出部１１２は、算出された動作特徴量を、記憶部１１０内の動作特徴量データ１３２に登録する。 [Step S44] The feature calculation unit 112 calculates a motion feature for each extracted feature point. The motion feature includes a spatial attribute and a value attribute. The feature calculation unit 112 registers the calculated motion feature in the motion feature data 132 in the storage unit 110.

［ステップＳ４５］特徴量算出部１１２は、検出対象動画像上の全フレームについて処理が完了したかを判定する。未処理のフレームが存在する場合、処理がステップＳ４２に進められ、先頭側から未処理のフレームが１つ選択される。一方、全フレームについて処理済みの場合、処理が終了する。 [Step S45] The feature calculation unit 112 determines whether processing has been completed for all frames in the detection target video. If there are unprocessed frames, processing proceeds to step S42, where one unprocessed frame is selected from the top. On the other hand, if all frames have been processed, processing ends.

＜類似度に基づく動作検出＞
類似度算出部１１６は、検出対象動画像と、見本動画像のうち時空間指示入力が行われている区間の動画像（入力区間動画像）との間で、動作特徴量に基づく類似度を算出する。映像検索部１１５は、類似度の算出結果を用いた類似検索により、検出対象動画像から特定の動作が行われている入力区間動画像を検出する。 <Similarities-based Action Detection>
The similarity calculation unit 116 calculates a similarity between the detection target video and a video of a section of the sample video where a time-space instruction input is performed (input section video) based on the motion feature amount. The video search unit 115 detects an input section video in which a specific motion is performed from the detection target video by a similarity search using the calculation result of the similarity.

図１４は、見本動画像と検出対象動画像との間の類似検索処理について説明するための図である。入力区間動画像１６１は、前述のように、見本動画像のうち、時空間指示入力が行われている区間の動画像（開始時刻ｔ_startのフレームから終了時刻ｔ_endのフレームまでの動画像）である。類似検索処理では、検出対象動画像１６２から入力区間動画像１６１と同じ長さの部分動画像が抽出され、抽出された部分動画像と入力区間動画像１６１とが比較されることで、部分動画像が入力区間動画像１６１に類似するかが判定される。 14 is a diagram for explaining the similarity search process between the sample video and the detection target video. As described above, the input section video 161 is a video of a section of the sample video where a time-space instruction input is performed (video from a frame at a start time t _start to a frame at an end time t _end ). In the similarity search process, a partial video having the same length as the input section video 161 is extracted from the detection target video 162, and the extracted partial video is compared with the input section video 161 to determine whether the partial video is similar to the input section video 161.

部分動画像は、検出対象動画像１６２の先頭側から、１フレームずつ終端側にスライドさせながら順に抽出される。図１４では、検出対象動画像１６２からｍ個の部分動画像が抽出される場合について例示している。この場合、検出対象動画像１６２の先頭側から部分動画像１６３－１，１６３－２，・・・，１６３－ｍが抽出される。そして、部分動画像１６３－１，１６３－２，・・・，１６３－ｍのそれぞれと入力区間動画像１６１との間で、動作特徴量に基づいて類似度が算出される。類似度が所定の閾値以上である場合に、部分動画像に特定の動作を行う人物が写っていると判定される。 The partial video images are extracted in sequence from the beginning of the detection target video image 162, sliding one frame at a time towards the end. Figure 14 illustrates an example in which m partial video images are extracted from the detection target video image 162. In this case, partial video images 163-1, 163-2, ..., 163-m are extracted from the beginning of the detection target video image 162. Then, a similarity is calculated between each of the partial video images 163-1, 163-2, ..., 163-m and the input section video image 161 based on the motion feature amount. If the similarity is equal to or greater than a predetermined threshold, it is determined that the partial video image contains a person performing a specific motion.

なお、動画像間の類似度の代わりに相違度（非類似度）が算出されてもよい。類似度をＳｉｍ、相違度をＤとすると、例えばＳｉｍ＝－Ｄとして相違度Ｄの符号を反転することで、相違度Ｄを類似度Ｓｉｍに換算することができる。 In addition, dissimilarity (dissimilarity) may be calculated instead of similarity between video images. If the similarity is Sim and the dissimilarity is D, then the dissimilarity D can be converted to similarity Sim by inverting the sign of the dissimilarity D, for example, by setting Sim = -D.

図１５は、入力区間動画像と部分動画像との間の類似度計算について説明するための図である。類似度算出部１１６は、見本動画像から抽出された入力区間動画像と、検出対象動画像から抽出された部分動画像との間で、先頭から順にフレーム間の類似度（フレーム間類似度）を算出する。そして、類似度算出部１１６は、算出されたフレーム間類似度をすべて加算することで、入力区間動画像と部分動画像との類似度を算出する。 Figure 15 is a diagram for explaining the calculation of similarity between the input section video and the partial video. The similarity calculation unit 116 calculates the similarity between frames (inter-frame similarity) between the input section video extracted from the sample video and the partial video extracted from the detection target video, starting from the top. The similarity calculation unit 116 then calculates the similarity between the input section video and the partial video by adding up all the calculated inter-frame similarities.

図１５の例では、入力区間動画像には先頭側からフレーム１７１－１，１７１－２，１７１－３，・・・が含まれ、部分動画像には先頭側からフレーム１７２－１，１７２－２，１７２－３，・・・が含まれる。この場合、フレーム１７１－１とフレーム１７２－１との間、フレーム１７１－２とフレーム１７２－２との間、フレーム１７１－３とフレーム１７２－３との間において、それぞれフレーム間類似度が算出される。このようにしてフレームの組み合わせのすべてについてフレーム間類似度が算出される。 In the example of FIG. 15, the input section video includes frames 171-1, 171-2, 171-3, ... from the beginning, and the partial video includes frames 172-1, 172-2, 172-3, ... from the beginning. In this case, the inter-frame similarity is calculated between frames 171-1 and 172-1, between frames 171-2 and 172-2, and between frames 171-3 and 172-3. In this way, the inter-frame similarity is calculated for all frame combinations.

入力区間動画像と部分動画像のフレーム間では、同じ特徴点（すなわち、同じ身体部位）の動作特徴量同士が比較される。動作特徴量の差が小さいほど類似度が高いと判定される。フレーム間では、このように特徴点ごとに算出された類似度が加算されることで、フレーム間類似度が算出される。ただし、特徴点ごとに算出された類似度が加算される際に、前述の時空間指示入力に基づく注目度により、特徴点の位置に応じて類似度に重みが付与される。 Between frames of the input section video and partial video, motion feature amounts of the same feature points (i.e., the same body parts) are compared. The smaller the difference between the motion feature amounts, the higher the similarity is determined to be. Between frames, the similarities calculated for each feature point in this way are added together to calculate the inter-frame similarity. However, when the similarities calculated for each feature point are added together, a weight is assigned to the similarity depending on the position of the feature point, based on the attention level based on the spatiotemporal instruction input described above.

例えば図１５では、入力区間動画像のフレーム１７１－１においては、時空間入力指示によって指示位置１７３－１が指示されたとする。この場合、フレーム１７１－１上の特徴点のうち指示位置１７３－１に近い特徴点ほど、フレーム１７１－１とフレーム１７２－１との間の類似度計算の際に大きな重みが付与される。また、図１５では、入力区間動画像のフレーム１７１－２においては、時空間入力指示によって指示位置１７３－２が指示されたとする。この場合、フレーム１７２－１上の特徴点のうち指示位置１７３－２に近い特徴点ほど、フレーム１７１－２とフレーム１７２－２との間の類似度計算の際に大きな重みが付与される。 For example, in FIG. 15, it is assumed that in frame 171-1 of the input section video, designated position 173-1 is designated by a time-space input instruction. In this case, the closer to designated position 173-1 among the feature points on frame 171-1, the greater the weighting is given to the feature points when calculating the similarity between frames 171-1 and 172-1. Also, in FIG. 15, it is assumed that in frame 171-2 of the input section video, designated position 173-2 is designated by a time-space input instruction. In this case, the closer to designated position 173-2 among the feature points on frame 172-1, the greater the weighting is given to the feature points when calculating the similarity between frames 171-2 and 172-2.

前述のように、注目度Ｒは、時空間入力指示による指示位置に近い画素ほど大きな値に設定されている。フレーム１７１－１とフレーム１７２－１との間である特徴点の動作特徴量が比較される際には、フレーム１７１－１上の特徴点の位置に対応する注目度Ｒが重みとして用いられる。このため、この特徴点が時空間指示入力による指示位置に近いほど、動作特徴点に基づく類似度に対して大きな重みが乗算される。 As described above, the attention level R is set to a larger value for pixels closer to the position indicated by the spatiotemporal input instruction. When comparing the motion feature amounts of a certain feature point between frames 171-1 and 172-1, the attention level R corresponding to the position of the feature point on frame 171-1 is used as a weight. Therefore, the closer this feature point is to the position indicated by the spatiotemporal input instruction, the larger the weight is multiplied by which the similarity based on the motion feature point is multiplied.

このような計算により、検出対象の動作において特徴的な身体部位の特徴点ほど、類似度計算に対する寄与度が高くなる。したがって、動画像間の類似度は、検出対象の動作において着目すべき身体部位の動きの類似性や相違性に対して敏感な指標となる。その結果、その身体部位の動きを重視して動作検出が行われるようになり、動作検出精度を向上させることができる。 By performing such calculations, the more characteristic the feature points of a body part are in the movement to be detected, the greater their contribution to the similarity calculation. Therefore, the similarity between videos is an index that is sensitive to the similarities and differences in the movements of the body parts that should be noted in the movement to be detected. As a result, movement detection is performed with an emphasis on the movement of that body part, improving the accuracy of movement detection.

人の身体の構造は複雑であり、身体を撮影したとき、撮影画像内での身体の見え方は様々に変化する。検出対象とする特定の動作を人が行った場合でも、身体全体としては見え方に多様性が存在する。例えば、商品を手に取る動作を人が行った場合、手の近傍の身体部位の動きは似たような動きになる一方、足の動きは人や状況によって大きく異なる場合がある。このため、商品を手に取る動作を画像から検出したい場合には、手の近傍の特徴点の情報は動作検出のための有用性が高いが、足の近傍の特徴点の情報は動作検出においてあまり役に立たない。 The structure of the human body is complex, and when the body is photographed, the way the body appears in the captured image changes in various ways. Even when a person performs a specific action that is the target of detection, there is diversity in the way the body appears as a whole. For example, when a person picks up a product, the movements of the body parts near the hands will be similar, but the movements of the feet may vary greatly depending on the person and the situation. For this reason, when trying to detect the action of picking up a product from an image, information on feature points near the hands is highly useful for detecting the action, but information on feature points near the feet is not very useful for detecting the action.

上記のように、本実施の形態では、時空間指示入力による指示位置に近い特徴点の動作特徴量ほど、動作検出のための類似度計算に対する寄与度が高くなる。例えば図１５において点線矢印で示すように、フレーム１７１－１内の特徴点のうち、例えば手首に対応する特徴点は指示位置１７３－１に近いが、足首に対応する特徴点は指示位置１７３－１から遠い。そのため、手首に対応する特徴点の動作特徴量に対して、足首に対応する特徴点の動作特徴量より大きな重み（注目度）が付与される。これにより、類似計算の際に、手首に対応する特徴点の動作特徴量の寄与度が、足首に対応する特徴点の動作特徴量より高くなる。その結果、身体の動きに多様性がある場合でも特定の動きを高精度に検出できるようになる。 As described above, in this embodiment, the closer the motion feature value of a feature point is to the position indicated by the time-space instruction input, the higher its contribution to the similarity calculation for motion detection. For example, as shown by the dotted arrow in FIG. 15, among the feature points in frame 171-1, the feature point corresponding to the wrist, for example, is close to indicated position 173-1, but the feature point corresponding to the ankle is far from indicated position 173-1. Therefore, a greater weight (attention level) is assigned to the motion feature value of the feature point corresponding to the wrist than to the motion feature value of the feature point corresponding to the ankle. As a result, during similarity calculation, the contribution of the motion feature value of the feature point corresponding to the wrist is higher than that of the feature point corresponding to the ankle. As a result, it becomes possible to detect a specific motion with high accuracy even when there is a variety in body movements.

なお、画像からの動作検出方法としては、例えば、検出すべき動作の位置と種類が与えられた画像を訓練データとして用いて学習し、その学習結果を用いて入力画像から動作検出を行う方法が考えられる。この方法では、訓練データによって人物についての多くの姿勢が与えられることで、身体の動きに多様性がある場合でも特定の動作を精度よく検出できる。しかし、高精度化のためには多数の訓練データを事前に用意しなければならないという問題がある。特に、多数の画像（訓練データ）に対して検出すべき動作の位置と種類を人手などによって指定する必要があり、訓練データの作成に膨大な手間がかかる。 One possible method for detecting movements from images is to use images, which are given the position and type of the movement to be detected, as training data for learning, and then use the learning results to detect movements from input images. With this method, many different postures of people are given in the training data, so that specific movements can be detected with high accuracy even when there is a variety in body movements. However, there is a problem in that a large amount of training data must be prepared in advance to achieve high accuracy. In particular, the position and type of the movement to be detected must be specified manually for many images (training data), which means that creating training data is an enormous effort.

また、他の動作検出方法として、人物の姿勢情報を検索クエリとして用いて、多数の画像の中から特定の動作を含む画像を検索する方法も考えられる。しかし、上記のように特定の動作に対して人物が取り得る姿勢は多様であり、かつ時間的に変化するため、検出すべき動作を表す姿勢情報を指定することは容易でなく、姿勢の多様性に対する検出精度の悪化を抑止することが難しい。 Another possible method of detecting actions is to use a person's posture information as a search query to search for images that include a specific action from among a large number of images. However, as described above, there are many different postures that a person can take for a specific action, and these postures change over time. Therefore, it is not easy to specify posture information that represents the action to be detected, and it is difficult to prevent a deterioration in detection accuracy due to the diversity of postures.

このような問題に対して、本実施の形態によれば、事前に用意すべき情報を作成する工数が少ない。すなわち、特定の動作を行っている人物を撮影し、撮影された動画像を再生表示しながら、マウスドラッグ操作などのマウスを使った簡単な操作（時空間指示入力）により、着目すべき動きを行う領域を指定し、類似計算のための注目度を情報処理装置１００に算出させることができる。 In response to such problems, the present embodiment reduces the amount of work required to create the information that must be prepared in advance. That is, a person performing a specific action is photographed, and while playing back and displaying the captured video, a simple operation using the mouse, such as dragging the mouse (spatiotemporal instruction input), is used to specify the area in which the action to be focused is occurring, and the information processing device 100 can calculate the attention level for similarity calculation.

ここで、時空間指示入力では、マウスボタンを使った開始時刻および終了時刻の指定により、撮影された動画像から特定の動きが行われている区間を容易に特定できる。これにより、動画像上の無駄なフレームの解析が不要になり、処理効率を高めることができる。これとともに、マウスカーソルの指示によって着目すべき動きを行う領域を簡単に指定できるので、操作するオペレータに画像解析などの高いスキルが必要にならない。すなわち、このような簡単な事前操作を行うだけで、身体の動きに多様性がある場合でも特定の動きを高精度に検出できるようになる。したがって、高精度な動作検出を少ない手間で効率的に実行できる。 Here, in time-space instruction input, the section in the captured video where a specific movement is taking place can be easily identified by specifying the start and end times using the mouse buttons. This eliminates the need to analyze unnecessary frames in the video, improving processing efficiency. At the same time, the area in which the movement to be focused on is taking place can be easily specified by specifying the mouse cursor, so the operator does not need to have high skills in image analysis, etc. In other words, by simply performing such simple preliminary operations, specific movements can be detected with high accuracy even when there is diversity in body movements. Therefore, highly accurate motion detection can be performed efficiently with little effort.

［類似度計算の第１の例］
ここでは、見本動画像の入力区間動画像に含まれるｉ番目の特徴点の動作特徴量を、Ｐ＝｛ｐ_i（ｔ）｝とする。また、検出対象動画像の部分動画像に含まれるｉ番目の特徴点の動作特徴量を、Ｑ＝｛ｑ_i（ｔ）｝とする。時空間指示入力に応じた注目度Ｒ（ｘ，ｙ；ｔ）に対して、相違度Ｄ（Ｐ，Ｑ；Ｒ）は次の式（７）によって算出される。
Ｄ（Ｐ，Ｑ；Ｒ）＝Σ_tΣ_iＲ（Ｘ（ｐ_i（ｔ））；ｔ）｜Ａ（ｐ_i（ｔ））－Ａ（ｑ_i（ｔ））｜² ・・・（７）
式（７）において、Ｒ（Ｘ（ｐ_i（ｔ））；ｔ）は、時刻ｔのフレームにおけるｉ番目の特徴点の位置に対応する注目度を示す。Σは、下付きの添え字の変数で和をとることを示す。変数である時刻ｔの範囲は、時空間指示入力で指定された時間区間［ｔ_start，ｔ_end］である。この式（７）により、入力区間動画像において時空間指示入力による指示位置に近い特徴点ほど、動作特徴量の値属性Ａに対して大きな注目度が重みとして乗算される。 [First Example of Similarity Calculation]
Here, the motion feature amount of the i-th feature point included in the input section video of the sample video is defined as P = { _pi (t)}. Also, the motion feature amount of the i-th feature point included in the partial video of the detection target video is defined as Q = { _qi (t)}. The dissimilarity D(P,Q;R) is calculated with respect to the attention level R(x,y;t) according to the spatiotemporal instruction input by the following formula (7).
D (P, Q; R) = Σ _t Σ _i R (X (p _i (t)); t) | A (p _i (t)) - A (q _i (t)) | ² ... (7)
In equation (7), R(X(p _i (t)); t) indicates the attention level corresponding to the position of the i-th feature point in the frame at time t. Σ indicates taking the sum of the subscripted variable. The range of the variable time t is the time interval [t _start , t _end ] specified by the spatiotemporal command input. With equation (7), the closer a feature point is to the position specified by the spatiotemporal command input in the input section video, the greater the attention level is multiplied as a weighting for the value attribute A of the motion feature amount.

［類似度計算の第２の例］
上記の式（７）では、値属性のスケール（大きさ）が類似判定精度に影響を与え得る。身体の大きさや画像上の位置などによって値属性のスケールが異なっていても類似した動作を正しく類似と判定できるようにするためには、例えば、次の式（８－１）に示すようにコサイン類似度を用いることもできる。
Ｓｉｍ（Ｐ，Ｑ；Ｒ）＝Ｃｏｓ（Ｐ_R，Ｑ_R）＝Ｃｏｓ（Ｐ_R・Ｑ_R／｜Ｐ_R｜｜Ｑ_R｜）・・・（８－１）
Ｐ_R＝｛Ｒ（Ｘ（ｐ_i（ｔ））；ｔ）Ａ（ｐ_i（ｔ））｝・・・（８－２）
Ｑ_R＝｛Ｒ（Ｘ（ｐ_i（ｔ））；ｔ）Ａ（ｑ_i（ｔ））｝・・・（８－３）
ただし、ｉは１～ｎの値をとり、ｔは開始時刻ｔ_startから終了時刻ｔ_endまでの値をとる。また、｛・・・｝は、変数であるｉ，ｔをこの範囲内で変化させたときの値を要素（個別の座標軸上の値）としてもつベクトルを表す。「Ｐ_R・Ｑ_R」は２つのベクトルの内積を示し、「｜Ｐ_R｜｜Ｑ_R｜」は２つのベクトルの２乗ノルム（要素の２乗和の平方根）の積を示す。 [Second Example of Similarity Calculation]
In the above formula (7), the scale (magnitude) of the value attribute may affect the accuracy of the similarity determination. In order to correctly determine similar actions as similar even if the scale of the value attribute differs depending on the body size, the position on the image, etc., it is possible to use, for example, cosine similarity as shown in the following formula (8-1).
Sim (P, Q; R) = Cos (P _R , Q _R ) = Cos (P _R · Q _R / | P _R | | Q _R |) ... (8-1)
P _R = {R(X(p _i (t)); t) A(p _i (t))} ...(8-2)
Q _R = {R(X(p _i (t)); t) A(q _i (t))} ...(8-3)
Here, i takes values from 1 to n, and t takes values from the start time t _start to the end time t _end . Also, {...} represents a vector whose elements (values on individual coordinate axes) are the values obtained when the variables i and t are changed within this range. "P _R ·Q _R " indicates the inner product of two vectors, and "|P _R ||Q _R |" indicates the product of the square norm of two vectors (the square root of the sum of the squares of the elements).

［類似度計算の第３の例］
上記の第１、第２の例では、動作特徴量の値属性がそのまま用いられる。ここで、値属性が位置の情報を持つ場合（身体の位置によって値が異なる場合）には、以下のような考慮を行う必要がある。 [Third Example of Similarity Calculation]
In the first and second examples above, the value attribute of the motion feature is used as is. Here, if the value attribute has position information (if the value differs depending on the body position), the following considerations must be taken into account.

（１）画像上での位置に強く影響を受ける動作（例えば、固定的な位置の服を取る動作など）では、値属性に位置の情報（固定オフセット）を持たせる。
（２）画像上で対象物体との相対的な位置関係によって決まる動作（例えば、服の配置場所が日によって異なる場合など）では、値属性の位置の情報を対象物体からの変位（物体オフセット）に置き換える。 (1) For actions that are strongly influenced by the position on the image (such as the action of removing clothes that are in a fixed position), position information (fixed offset) is provided in the value attribute.
(2) For movements that depend on the relative positional relationship with the target object on the image (for example, when the location of clothes varies from day to day), the position information of the value attribute is replaced with the displacement from the target object (object offset).

（３）画像上の位置に依存せずに決まる動作（例えば、床に倒れる動作など）では、値属性の位置の情報は身体の中心からの変異（身体オフセット）に置き換える。
物体オフセットまたは身体オフセットを考慮した類似度または相違度としては、例えば、第１、第２の例におけるＡ（ｐ_i（ｔ））をＡ（ｐ_i（ｔ））－ｏｂｊ（ｔ）またはＡ（ｐ_i（ｔ））－Ａ（ｐ_μ（ｔ））に置き換えることで算出可能である。動作特徴量Ｑについても同様である。ここで、ｏｂｊ（ｔ）は、値属性と同じ長さのベクトルを示し、位置の情報を持つ要素に対しては対象物体の位置が設定され、それ以外の要素に対しては０となる。また、Ａ（ｐ_μ（ｔ））は、全特徴点での値属性の平均値を示す。 (3) For actions that are determined independently of the position on the image (such as falling to the floor), the position information of the value attribute is replaced with the displacement from the center of the body (body offset).
Similarity or dissimilarity taking into account object offset or body offset can be calculated, for example, by replacing A(p _i (t)) in the first and second examples with A(p _i (t))-obj(t) or A(p _i (t))-A(p _μ (t)). The same is true for the motion feature quantity Q. Here, obj(t) indicates a vector of the same length as the value attribute, and the position of the target object is set for elements having position information, and 0 is set for other elements. Furthermore, A(p _μ (t)) indicates the average value of the value attribute at all feature points.

図１６は、類似度に基づく動作検出処理を示すフローチャートの例である。この図１６は、図５のステップＳ１４の処理に対応する。
［ステップＳ５１］映像検索部１１５は、検出対象動画像から、入力区間動画像（見本動画像のうち開始時刻ｔ_startから終了時刻ｔ_endまでの区間の動画像）と同じ長さの部分動画像を、先頭側から抽出する。 16 is a flowchart showing an example of a motion detection process based on similarity, which corresponds to the process in step S14 in FIG.
[Step S51] The video search unit 115 extracts, from the beginning of the detection target video, a partial video having the same length as the input section video (the section of the sample video from the start time t _start to the end time t _end ).

［ステップＳ５２］類似度算出部１１６は、入力区間動画像および部分動画像の先頭側からそれぞれフレームを選択する。これにより、ステップＳ５３～Ｓ５５で比較対象となる動画像が特定される。 [Step S52] The similarity calculation unit 116 selects frames from the beginning of each of the input section video image and the partial video image. This identifies the video images to be compared in steps S53 to S55.

［ステップＳ５３］類似度算出部１１６は、入力区間動画像の選択フレームから特徴点を選択する。
［ステップＳ５４］類似度算出部１１６は、ステップＳ５３で入力区間動画像の選択フレームから選択された特徴点に対応する動作特徴量と、部分動画像の選択フレームに存在する同じ特徴点番号の特徴点に対応する動作特徴量と、前者の特徴点の座標に対応する注目度とを取得する。類似度算出部１１６は、取得した各動作特徴量および注目度に基づいて、選択された特徴点についてのフレーム間類似度を算出する。 [Step S53] The similarity calculation unit 116 selects feature points from the selected frames of the input section video.
[Step S54] The similarity calculation unit 116 acquires a motion feature amount corresponding to the feature point selected from the selected frame of the input section video in step S53, a motion feature amount corresponding to the feature point with the same feature point number present in the selected frame of the partial video, and an attention level corresponding to the coordinates of the former feature point. The similarity calculation unit 116 calculates an inter-frame similarity for the selected feature point based on the acquired motion feature amounts and attention levels.

［ステップＳ５５］類似度算出部１１６は、選択されたフレームに含まれる全特徴点の処理が完了したかを判定する。未処理の特徴点が存在する場合、処理がステップＳ５３に進められ、未処理の特徴点が１つ選択される。一方、全特徴点について処理済みの場合、処理がステップＳ５６に進められる。 [Step S55] The similarity calculation unit 116 determines whether processing of all feature points included in the selected frame has been completed. If there are unprocessed feature points, processing proceeds to step S53, where one unprocessed feature point is selected. On the other hand, if all feature points have been processed, processing proceeds to step S56.

［ステップＳ５６］類似度算出部１１６は、入力区間動画像（または部分動画像）に含まれる全フレームの処理が完了したかを判定する。未処理のフレームが存在する場合、処理がステップＳ５２に進められ、未処理のフレームのうちの先頭フレームが１つ選択される。一方、全フレームについて処理済みの場合、処理がステップＳ５７に進められる。 [Step S56] The similarity calculation unit 116 determines whether processing of all frames included in the input section video image (or partial video image) has been completed. If unprocessed frames exist, processing proceeds to step S52, where the first frame of the unprocessed frames is selected. On the other hand, if all frames have been processed, processing proceeds to step S57.

［ステップＳ５７］類似度算出部１１６は、ステップＳ５４で算出されたフレーム間類似度をすべて加算することで、入力区間動画像と部分動画像との間の類似度を出力する。
［ステップＳ５８］映像検索部１１５は、出力された類似度を所定の判定閾値と比較して、部分動画像に特定の動作が含まれるかを判定する。類似度が判定閾値以上の場合に、特定の動作が含まれると判定される。 [Step S57] The similarity calculation unit 116 adds up all the inter-frame similarities calculated in step S54, and outputs the similarity between the input section video sequence and the partial video sequence.
[Step S58] The video search unit 115 compares the output similarity with a predetermined determination threshold to determine whether the partial video image contains a specific action. If the similarity is equal to or greater than the determination threshold, it is determined that the partial video image contains a specific action.

［ステップＳ５９］映像検索部１１５は、検出対象画像から抽出可能なすべての部分動画像について処理が完了したかを判定する。未処理の部分動画像が存在する場合、処理がステップＳ５１に進められ、部分動画像の先頭フレームが１つ進められて検出対象画像から次の部分動画像が抽出される。一方、全部分動画像について処理済みの場合、処理が終了する。 [Step S59] The video search unit 115 determines whether processing has been completed for all partial video images that can be extracted from the detection target image. If there are unprocessed partial video images, processing proceeds to step S51, where the first frame of the partial video image is advanced by one and the next partial video image is extracted from the detection target image. On the other hand, if all partial video images have been processed, processing ends.

なお、前述のように、図１６の処理では、フレーム間類似度の代わりにフレーム間の相違度（非類似度）が算出されてもよい。この場合、ステップＳ５８では、加算されたフレーム間の相違度が類似度に換算され、換算された類似度が閾値と比較されればよい。 As described above, in the process of FIG. 16, the difference (dissimilarity) between frames may be calculated instead of the similarity between frames. In this case, in step S58, the added difference between frames is converted into a similarity, and the converted similarity is compared with a threshold value.

＜分析処理＞
ここでは例として、１日の間に撮影された検出対象画像から、図５のステップＳ１３，Ｓ１４により、商品を手に取る動作の検出と、商品の価格を確認する動作の検出とが行われたものとする。また、商品を手に取る動作の検出は、商品ごとに行われたものとする。 <Analysis processing>
As an example, it is assumed that the action of picking up a product and the action of checking the price of the product are detected from detection target images taken in one day in steps S13 and S14 of Fig. 5. It is also assumed that the detection of the action of picking up a product is performed for each product.

この場合、分析部１１７は例えば、１日の中で該当商品が手に取られた回数を、他の商品を含めて１日に手に取られた平均的な回数と比較することで、該当商品の魅力を評価する。また、分析部１１７は、１日の中で該当商品の価格が確認された回数を、その商品の同じ日における売り上げ回数と比較することで、該当商品の値ごろ感を評価する。 In this case, the analysis unit 117 evaluates the attractiveness of the product by, for example, comparing the number of times the product is picked up in a day with the average number of times other products are picked up in a day. The analysis unit 117 also evaluates the affordability of the product by comparing the number of times the price of the product is checked in a day with the number of times the product is sold in the same day.

図１７は、分析テーブルの一例を示す図である。図１７に示す分析テーブル１８１は、商品を分析するためのデータが一時的に登録されるテーブル情報である。この分析テーブル１８１には、「商品を手に取る」「価格を確認する」という各動作に対して、検出回数、基準回数、比率、分析結果が対応付けて登録されている。 Figure 17 is a diagram showing an example of an analysis table. The analysis table 181 shown in Figure 17 is table information in which data for analyzing a product is temporarily registered. In this analysis table 181, the number of detections, the reference number of times, the ratio, and the analysis results are registered in association with each of the actions of "picking up the product" and "checking the price."

商品を手に取る動作に関しては、検出回数の項目に、１日の中で該当商品が手に取られた回数が登録される。また、基準回数は、他の商品を含めて１日に手に取られた平均的な回数を示す。比率は、基準回数に対する検出回数の比率を示す。例えば、比率が１．４以上であれば該当商品の魅力が高いと評価され、比率が０．８以上１．４未満であれば魅力が中程度と評価され、比率が０．８未満であれば魅力が低いと評価されて、評価結果が分析結果の項目に登録される。 Regarding the action of picking up a product, the number of times the product was picked up in a day is registered in the detection count item. The reference count indicates the average number of times the product was picked up in a day, including other products. The ratio indicates the ratio of the detection count to the reference count. For example, if the ratio is 1.4 or more, the product is evaluated as having high appeal, if the ratio is 0.8 or more and less than 1.4, the product is evaluated as having medium appeal, and if the ratio is less than 0.8, the product is evaluated as having low appeal, and the evaluation results are registered in the analysis results item.

また、価格を確認する動作に関しては、検出結果の項目に、１日の中で該当商品の価格が確認された回数が登録される。基準回数の項目には、その商品の同じ日における売り上げ回数が登録される。比率は、基準回数に対する検出回数の比率を示す。例えば、比率が４以上であれば該当商品は割高と評価され、比率が２以上４未満であれば該当商品の価格が適正と評価され、比率が２未満であれば該当商品は割安と評価されて、評価結果が分析結果の項目に登録される。 Furthermore, with regard to the operation of checking prices, the number of times the price of the relevant product was checked during the day is registered in the detection results field. The number of times the product was sold on the same day is registered in the reference count field. The ratio indicates the ratio of the detection count to the reference count. For example, if the ratio is 4 or more, the relevant product is evaluated as overpriced, if the ratio is 2 or more but less than 4, the price of the relevant product is evaluated as appropriate, and if the ratio is less than 2, the relevant product is evaluated as underpriced, and the evaluation results are registered in the analysis results field.

〔第３の実施の形態〕
次に、第３の実施の形態として、第２の実施の形態に係る情報処理装置１００の処理の一部を変形した場合について説明する。 Third embodiment
Next, a third embodiment will be described in which a part of the processing of the information processing device 100 according to the second embodiment is modified.

上記の第２の実施の形態では、クエリ画像として用いた見本動画像に写った人物の動きに検索結果が過剰適合することがあり、このことが原因で動作の誤検出が発生し、検出精度が低下する場合がある。そこで、第３の実施の形態では、情報処理装置は、類似検索によって検索された部分動画像をそれぞれ再生表示し、各部分動画像に特定の動作が実際に写っているかをオペレータに確認させる。そして、情報処理装置は、確認の結果と各部分動画像とを用いて、動画像が特定の動作を含むか否かを弁別するための学習を実行する。最終的に、情報処理装置は学習の結果を用いて各部分動画像が特定の動作を含むかを再判定する。これにより、動作の検出精度を向上させる。 In the second embodiment described above, the search results may over-match the movements of the person captured in the sample video used as the query image, which may cause erroneous detection of the action and reduce the detection accuracy. Therefore, in the third embodiment, the information processing device plays and displays each of the partial video images found by the similarity search, and has the operator confirm whether the specific action is actually captured in each partial video image. The information processing device then uses the confirmation results and each partial video image to perform learning to distinguish whether the video image contains the specific action or not. Finally, the information processing device uses the learning results to re-determine whether each partial video image contains the specific action. This improves the detection accuracy of the action.

図１８は、第３の実施の形態に係る情報処理装置が備える処理機能の構成例を示す。なお、図１８では、図４と同じ構成要素には同じ符号を付して示し、その説明を省略する。図１８に示す情報処理装置１００ａは、図４に示した記憶部１１０、映像入力部１１１、特徴量算出部１１２、時空間一括入力部１１３、注目度算出部１１４、映像検索部１１５、類似度算出部１１６、分析部１１７に加えて、修正指示入力部１１８と弁別学習部１１９を備える。 Figure 18 shows an example of the configuration of processing functions provided in an information processing device according to the third embodiment. In Figure 18, the same components as those in Figure 4 are denoted with the same reference numerals, and their description will be omitted. The information processing device 100a shown in Figure 18 includes a correction instruction input unit 118 and a discrimination learning unit 119 in addition to the memory unit 110, video input unit 111, feature amount calculation unit 112, time-space batch input unit 113, attention calculation unit 114, video search unit 115, similarity calculation unit 116, and analysis unit 117 shown in Figure 4.

修正指示入力部１１８は、映像検索部１１５による類似度に基づく類似検索によって特定の動作が含まれると判定された部分動画像をそれぞれ再生表示し、各部分動画像に特定の動作が実際に含まれているかを示す入力を受け付ける。弁別学習部１１９は、上記各部分動画像と、修正指示入力部１１８が受け付けた入力の内容とを訓練データとして用いて、動画像に特定の動作が含まれるかを弁別するための学習を実行する。 The correction instruction input unit 118 plays and displays each of the partial video images that have been determined to contain a specific action by the video search unit 115 through a similarity search based on similarity, and accepts input indicating whether each partial video image actually contains a specific action. The discrimination learning unit 119 uses each of the partial video images and the content of the input accepted by the correction instruction input unit 118 as training data to perform learning to discriminate whether a video image contains a specific action.

図１９は、第３の実施の形態における動作検出・分析処理全体の処理手順を示すフローチャートの例である。図１９では、図５と同じ処理には同じステップ番号を付して示している。第３の実施の形態では、図５のステップＳ１４とステップＳ１５との間にステップＳ６１，Ｓ６２の処理が実行される。 Figure 19 is an example of a flowchart showing the overall processing procedure of the motion detection and analysis process in the third embodiment. In Figure 19, the same processes as in Figure 5 are indicated by the same step numbers. In the third embodiment, the processes of steps S61 and S62 are executed between steps S14 and S15 in Figure 5.

［ステップＳ６１］修正指示入力部１１８は、ステップＳ１４で特定の動作が検出された部分動画像を１つずつ再生表示し、各部分動画像に特定の動作が実際に含まれているかを示す入力をオペレータから受け付ける。弁別学習部１１９は、これらの部分動画像と、修正指示入力部１１８が受け付けた入力の内容（すなわち、部分動画像に特定の動作が含まれているか否かを示す情報）とを訓練データとして用いて、動画像に特定の動作が含まれるかを弁別するための学習を実行する。 [Step S61] The correction instruction input unit 118 plays and displays each of the partial video images in which the specific action was detected in step S14, and receives input from the operator indicating whether the specific action is actually included in each partial video image. The discrimination learning unit 119 uses these partial video images and the content of the input received by the correction instruction input unit 118 (i.e., information indicating whether the partial video image contains the specific action) as training data to perform learning to discriminate whether the video image contains the specific action.

［ステップＳ６２］映像検索部１１５は、ステップＳ６１での学習の結果を用いて、ステップＳ１４で特定の動作が検出された各部分動画像に特定の動作が含まれるかを再判定する。 [Step S62] Using the results of the learning performed in step S61, the video search unit 115 re-determines whether the specific action is included in each partial video image in which the specific action was detected in step S14.

この後、ステップＳ１５では、ステップＳ６２での判定結果に基づいて各種の分析が行われる。
図２０は、学習および再判定の処理手順を示すフローチャートの例である。なお、ステップＳ７１～Ｓ７４が図１９のステップＳ６１に対応し、ステップＳ７５～Ｓ７７が図１９のステップＳ６２に対応する。 Thereafter, in step S15, various analyses are carried out based on the results of the determination in step S62.
20 is a flowchart showing an example of a procedure for learning and re-determination. Note that steps S71 to S74 correspond to step S61 in FIG. 19, and steps S75 to S77 correspond to step S62 in FIG.

［ステップＳ７１］修正指示入力部１１８は、図１９のステップＳ１４で特定の動作が検出された部分動画像の中から１つを選択し、選択した部分動画像を再生表示する。
［ステップＳ７２］オペレータは、再生表示された部分動画像を確認して、特定の動作が写っているか否かを入力する操作を行う。修正指示入力部１１８は、入力情報を受け付ける。 [Step S71] The correction instruction input unit 118 selects one of the partial video images in which a specific action was detected in step S14 of FIG. 19, and plays back and displays the selected partial video image.
[Step S72] The operator checks the reproduced and displayed partial video image, and performs an operation to input whether or not a specific action is captured in the partial video image. The correction instruction input unit 118 accepts the input information.

例えば、オペレータは、再生表示された部分動画像に特定の動作が写っている場合、その旨を示す入力操作を行う。例えば、表示画面上に設けられたチェック部にチェックする操作を行う。すると、特定の動作が写っていることが修正指示入力部１１８に通知され、処理が次のステップＳ７３に進められる。一方、部分動画像に特定の動作が写っていなかった場合、オペレータは、次の部分動画像を再生表示させる操作を行う。すると、特定の動作が写っていないことが修正指示入力部１１８に通知され、処理が次のステップＳ７３に進められる。 For example, if a specific action is shown in the displayed partial video image, the operator performs an input operation to indicate this. For example, the operator performs an operation to check a check section provided on the display screen. This notifies the correction instruction input unit 118 that the specific action is shown, and processing proceeds to the next step S73. On the other hand, if the specific action is not shown in the partial video image, the operator performs an operation to play and display the next partial video image. This notifies the correction instruction input unit 118 that the specific action is not shown, and processing proceeds to the next step S73.

［ステップＳ７３］修正指示入力部１１８は、ステップＳ１４で特定の動作が検出されたすべての部分動画像について処理が完了したかを判定する。未処理の部分動画像が存在する場合、処理がステップＳ７１に進められ、未処理の部分動画像の中から１つが選択される。一方、該当するすべての部分動画像について処理済みの場合、処理がステップＳ７４に進められる。 [Step S73] The correction instruction input unit 118 determines whether processing has been completed for all partial video images in which a specific action was detected in step S14. If there are unprocessed partial video images, processing proceeds to step S71, where one of the unprocessed partial video images is selected. On the other hand, if all applicable partial video images have been processed, processing proceeds to step S74.

［ステップＳ７４］弁別学習部１１９は、ステップＳ１４で特定の動作が検出された部分動画像と、ステップＳ７２での入力結果とを訓練データとして用いて、動画像に特定の動作が写っているか否かを弁別するための学習を実行する。この学習では、これらの部分動画像のうち、特定の動作が写っていることが入力された部分動画像それぞれの動作特徴量群が正例として用いられ、それ以外の各部分動画像の動作特徴量群が負例として用いられる。 [Step S74] The discrimination learning unit 119 uses the partial video images in which the specific action was detected in step S14 and the input result in step S72 as training data to perform learning to discriminate whether or not the specific action is captured in the video image. In this learning, of these partial video images, the motion feature sets of each partial video image in which it has been input that the specific action is captured are used as positive examples, and the motion feature sets of each of the other partial video images are used as negative examples.

例えば、特徴量空間から実数空間［０，１］への所与の関数ｙ＝ｆ（ｘ；θ）が用意される。弁別学習部１１９は、正例に属する動作特徴量群に対してはｙ＝１を出力し、負例に属する動作特徴量群に対してはｙ＝０を出力するように、未知のパラメータθを調整する。これにより、動作特徴量の値属性が入力されると、正例か負例かの弁別結果が出力されるように学習される。 For example, a given function y = f(x; θ) from feature space to real space [0, 1] is prepared. The discrimination learning unit 119 adjusts the unknown parameter θ so that it outputs y = 1 for a group of motion features that belong to positive examples, and outputs y = 0 for a group of motion features that belong to negative examples. In this way, when a value attribute of a motion feature is input, it is learned to output a discrimination result of whether it is a positive example or a negative example.

学習の際に入力される値属性は、例えば｛Ａ（ｑ_i（ｔ））｝と表される。ただし、ｉは１～ｎの値をとり、ｔは開始時刻ｔ_startから終了時刻ｔ_endまでの値をとる。また、｛・・・｝は、変数であるｉ．ｔをこの範囲内で変化させたときの値を要素（個別の座標軸上の値）としてもつベクトルを表す。 The value attribute input during learning is represented as {A(q _i (t))}, for example. Here, i takes the values 1 to n, and t takes the value from the start time t _start to the end time t _end . Also, {...} represents a vector whose elements (values on individual coordinate axes) are the values when the variable i.t is changed within this range.

学習は、例えば、サポートベクターマシン（ＳＶＭ）などの機械学習技術を用いて実行することができる。例えば、関数ｆ（ｘ；θ）としてｘ・θが用いられる。ただし、「・」はベクトルの内積を示す。 The learning can be performed using a machine learning technique such as a support vector machine (SVM). For example, x·θ is used as the function f(x;θ), where "·" indicates the inner product of vectors.

また、上記の値属性に対しては注目度に応じた重みが付与されることが望ましい。この場合、学習の際の入力は、｛Ｒ（Ｘ（ｐ_i（ｔ））；ｔ）Ａ（ｑ_i（ｔ））｝と表される。
［ステップＳ７５］映像検索部１１５は、ステップＳ１４で特定の動作が検出された部分動画像の中から１つを選択する。 It is also desirable to assign weights to the value attributes according to their degree of attention. In this case, the input for learning is represented as {R(X( _pi (t));t)A( _qi (t))}.
[Step S75] The video search unit 115 selects one of the partial video images in which the specific action was detected in step S14.

［ステップＳ７６］映像検索部１１５は、学習結果である関数ｙ＝ｆ（ｘ；θ）に対して、選択された部分動画像についての動作特徴量の値属性を入力することで、その部分動画像に特定の動作が写っているかを判定する。このとき、値属性がそのまま入力されるのではなく、注目度に応じて重み付けされた値属性が入力されてもよい。また、θとしては学習の結果として得られた値が用いられる。また、部分動画像に特定の動作が写っているか否かは、ｙの出力値が所定の閾値以上か否かによって判定される。 [Step S76] The video search unit 115 inputs the value attribute of the motion feature for the selected partial video into the function y = f(x; θ) that is the result of learning, to determine whether the partial video contains a specific motion. At this time, instead of inputting the value attribute as is, a value attribute weighted according to the attention level may be input. Furthermore, the value obtained as a result of learning is used as θ. Furthermore, whether the partial video contains a specific motion is determined by whether the output value of y is equal to or greater than a predetermined threshold.

［ステップＳ７７］映像検索部１１５は、ステップＳ１４で特定の動作が検出されたすべての部分動画像について処理が完了したかを判定する。未処理の部分動画像が存在する場合、処理がステップＳ７５に進められ、未処理の部分動画像の中から１つが選択される。一方、該当するすべての部分動画像について処理済みの場合、処理が終了する。なお、ステップＳ７５～Ｓ７７の処理は、特定の動作が検出されなかった画像を含むすべての部分動画像に対してステップＳ７６の処理が実行されるように変形してもよい。 [Step S77] The video search unit 115 determines whether processing has been completed for all partial video images in which a specific action was detected in step S14. If there are unprocessed partial video images, processing proceeds to step S75, where one of the unprocessed partial video images is selected. On the other hand, if all applicable partial video images have been processed, processing ends. Note that the processing of steps S75 to S77 may be modified so that the processing of step S76 is performed on all partial video images, including images in which a specific action was not detected.

以上のように、オペレータから入力された特定の動作の有無の判定結果を用いて学習が行われ、学習結果を用いて動作の再判定が行われることで、動作の誤検出の発生確率を抑制でき、検出精度を向上させることができる。なお、オペレータは、学習結果を用いた判定結果の正確性を目視などによって確認し、正確性が不十分な場合には再度学習を実行させることもできる。 As described above, learning is performed using the results of the judgment of the presence or absence of a specific action input by the operator, and the action is re-judged using the learning results, thereby reducing the probability of erroneous detection of the action and improving detection accuracy. The operator can also visually check the accuracy of the judgment results using the learning results, and if the accuracy is insufficient, can execute learning again.

〔第４の実施の形態〕
第２、第３の実施の形態では、見本動画像に対する処理と、検出対象動画像から特定動作を含む部分動画像を検出する処理とが、同一の装置において実行されていた。しかし、これらの処理は別々の装置で実行されていてもよい。 Fourth embodiment
In the second and third embodiments, the process for the sample video and the process for detecting a partial video including a specific action from the detection target video are executed in the same device. However, these processes may be executed in separate devices.

図２１は、第４の実施の形態に係る情報処理システムの構成例を示す図である。図２１に示す情報処理システムは、サーバ装置１００ｂと情報処理装置１００ｃとを含む。なお、図２１では、図４と同じ構成要素には同じ符号を付して示し、その説明を省略する。 Fig. 21 is a diagram showing an example of the configuration of an information processing system according to the fourth embodiment. The information processing system shown in Fig. 21 includes a server device 100b and an information processing device 100c. Note that in Fig. 21, the same components as those in Fig. 4 are denoted by the same reference numerals, and their description will be omitted.

サーバ装置１００ｂは、第２の実施の形態に係る情報処理装置１００が備える処理機能のうち、大別して見本動画像に関する処理機能を備える。サーバ装置１００ｂは、記憶部１１０ａ、通信部１９１、特徴量算出部１１２ａ、時空間一括入力部１１３および注目度算出部１１４を備える。 The server device 100b has the processing functions related to the sample video image, broadly classified as the processing functions of the information processing device 100 according to the second embodiment. The server device 100b has a storage unit 110a, a communication unit 191, a feature amount calculation unit 112a, a time-space batch input unit 113, and an attention calculation unit 114.

このサーバ装置１００ｂでは、カメラ２００によって撮影された見本動画像が情報処理装置１００ｃから送信され、通信部１９２によって受信される。通信部１９１は、受信した見本動画像を特徴量算出部１１２ａに受け渡す。すると、特徴量算出部１１２ａは、図４の特徴量算出部１１２と同様に、見本動画像から特徴点ごとの動作特徴量を算出する。これにより、特徴量算出部１１２ａは、見本動画像データ１２１と動作特徴量データ１２２とを記憶部１１０ａに格納する。 In this server device 100b, a sample video captured by the camera 200 is transmitted from the information processing device 100c and received by the communication unit 192. The communication unit 191 passes the received sample video to the feature amount calculation unit 112a. The feature amount calculation unit 112a then calculates motion feature amounts for each feature point from the sample video, similar to the feature amount calculation unit 112 in FIG. 4. As a result, the feature amount calculation unit 112a stores the sample video data 121 and the motion feature amount data 122 in the storage unit 110a.

また、時空間一括入力部１１３は、見本動画像を表示装置に再生表示し、時空間指示入力を受け付ける。注目度算出部１１４は、見本動画像のフレームのうち、時空間指示入力が行われている区間の各フレームについて、指示された位置に基づいて特徴点ごとの注目度を算出する。これにより、注目度データ１２３が記憶部１１０ａに格納される。 The time-space batch input unit 113 also plays and displays the sample video on the display device and accepts time-space instruction input. The attention level calculation unit 114 calculates the attention level for each feature point based on the indicated position for each frame of the sample video in the section where the time-space instruction input is being made. As a result, attention level data 123 is stored in the storage unit 110a.

以上の処理が終了すると、通信部１９１は、記憶部１１０ａに格納された動作特徴量データ１２２と注目度データ１２３とを、情報処理装置１００ｃに送信する。
一方、情報処理装置１００ｃは、第２の実施の形態に係る情報処理装置１００が備える処理機能のうち、大別して検出対象動画像から特定動作を含む部分動画像を検出する処理機能を備える。情報処理装置１００ｃは、記憶部１１０、映像入力部１１１、通信部１９２、特徴量算出部１１２ｂ、映像検索部１１５、類似度算出部１１６および分析部１１７を備える。 When the above process is completed, the communication unit 191 transmits the action feature amount data 122 and the attention level data 123 stored in the storage unit 110a to the information processing device 100c.
On the other hand, the information processing device 100c has a processing function of detecting a partial moving image including a specific action from a detection target moving image, broadly classified as one of the processing functions of the information processing device 100 according to the second embodiment. The information processing device 100c has a storage unit 110, a video input unit 111, a communication unit 192, a feature amount calculation unit 112b, a video search unit 115, a similarity calculation unit 116, and an analysis unit 117.

この情報処理装置１００ｃでは、カメラ２００によって見本動画像が撮影されると、見本動画像が映像入力部１１１に入力される。すると、通信部１９２は、見本動画像のデータを見本動画像データ１２１として記憶部１１０に格納するとともに、見本動画像をサーバ装置１００ｂに送信する。その後、サーバ装置１００ｂから見本動画像に対応する動作特徴量データ１２２および注目度データ１２３が送信されると、通信部１９２がこれらを受信して記憶部１１０に格納する。 In this information processing device 100c, when a sample video is captured by the camera 200, the sample video is input to the video input unit 111. The communication unit 192 then stores the data of the sample video in the memory unit 110 as sample video data 121, and transmits the sample video to the server device 100b. After that, when the action feature data 122 and attention level data 123 corresponding to the sample video are transmitted from the server device 100b, the communication unit 192 receives them and stores them in the memory unit 110.

一方、カメラ２００によって検出対象動画像が撮影されると、検出対象動画像は映像入力部１１１から特徴量算出部１１２ｂに受け渡される。特徴量算出部１１２ｂは、図４の特徴量算出部１１２と同様に、検出対象動画像から特徴点ごとの動作特徴量を算出する。これにより、特徴量算出部１１２ｂは、検出対象動画像データ１３１と動作特徴量データ１３２とを記憶部１１０に格納する。 On the other hand, when the detection target moving image is captured by the camera 200, the detection target moving image is passed from the video input unit 111 to the feature amount calculation unit 112b. The feature amount calculation unit 112b calculates the motion feature amount for each feature point from the detection target moving image, similar to the feature amount calculation unit 112 in FIG. 4. As a result, the feature amount calculation unit 112b stores the detection target moving image data 131 and the motion feature amount data 132 in the storage unit 110.

これ以後、映像検索部１１５、類似度算出部１１６および分析部１１７によって、図４の情報処理装置１００と同様の処理が実行される。
なお、情報処理装置１００ｃは、図１８に示した修正指示入力部１１８および弁別学習部１１９をさらに備えてもよい。 Thereafter, the video search section 115, the similarity calculation section 116 and the analysis section 117 execute the same processes as those of the information processing device 100 in FIG.
The information processing device 100c may further include the correction instruction input unit 118 and the discrimination learning unit 119 shown in FIG.

なお、上記の各実施の形態に示した装置（例えば、動作認識システム１の機能を実現する装置や、情報処理装置１００）の処理機能は、コンピュータによって実現することができる。その場合、各装置が有すべき機能の処理内容を記述したプログラムが提供され、そのプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、磁気記憶装置、光ディスク、半導体メモリなどがある。磁気記憶装置には、ハードディスク装置（ＨＤＤ）、磁気テープなどがある。光ディスクには、ＣＤ（Compact Disc）、ＤＶＤ（Digital Versatile Disc）、ブルーレイディスク（Blu-ray Disc：ＢＤ、登録商標）などがある。 The processing functions of the devices shown in the above embodiments (for example, a device that realizes the functions of the action recognition system 1 and the information processing device 100) can be realized by a computer. In this case, a program describing the processing contents of the functions that each device should have is provided, and the above processing functions are realized on the computer by executing the program on a computer. The program describing the processing contents can be recorded on a computer-readable recording medium. Examples of computer-readable recording media include magnetic storage devices, optical disks, and semiconductor memories. Examples of magnetic storage devices include hard disk drives (HDDs) and magnetic tapes. Examples of optical disks include CDs (Compact Discs), DVDs (Digital Versatile Discs), and Blu-ray Discs (BD, registered trademark).

プログラムを流通させる場合には、例えば、そのプログラムが記録されたＤＶＤ、ＣＤなどの可搬型記録媒体が販売される。また、プログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することもできる。 When distributing a program, for example, portable recording media such as DVDs and CDs on which the program is recorded are sold. The program can also be stored in a storage device of a server computer, and the program can be transferred from the server computer to other computers via a network.

プログラムを実行するコンピュータは、例えば、可搬型記録媒体に記録されたプログラムまたはサーバコンピュータから転送されたプログラムを、自己の記憶装置に格納する。そして、コンピュータは、自己の記憶装置からプログラムを読み取り、プログラムにしたがった処理を実行する。なお、コンピュータは、可搬型記録媒体から直接プログラムを読み取り、そのプログラムにしたがった処理を実行することもできる。また、コンピュータは、ネットワークを介して接続されたサーバコンピュータからプログラムが転送されるごとに、逐次、受け取ったプログラムにしたがった処理を実行することもできる。 A computer that executes a program stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. The computer then reads the program from its own storage device and executes processing according to the program. Note that the computer can also read a program directly from a portable recording medium and execute processing according to that program. The computer can also execute processing according to the received program each time a program is transferred from a server computer connected via a network.

１動作認識システム
２入力受け付け部
３判定部
１０ａ，１０ｂ，２０ａ，２０ｂフレーム
１１ａ，１１ｂ指示位置
１２ａ，１２ｂ，２２ａ，２２ｂ特徴点 REFERENCE SIGNS LIST 1 Action recognition system 2 Input acceptance unit 3 Determination unit 10a, 10b, 20a, 20b Frames 11a, 11b Pointing positions 12a, 12b, 22a, 22b Feature points

Claims

an input receiving unit that plays and displays a first moving image captured of a person performing a specific action, and that continuously receives , during the playing and display of the first moving image, an instruction input that indicates a position related to the specific action for each frame included in the first moving image ;
a determination unit that calculates a similarity between the first video and the second video based on a comparison result of feature amounts between a second feature point group including one or more feature points included in the second video and a first feature point group including two or more feature points included in the first video, and determines whether or not the specific action is included in the second video based on the calculated similarity, wherein, when calculating the similarity, the determination unit assigns a greater weight to the comparison result of feature amounts for feature points in the first feature point group that are closer to a position pointed to by the instruction input;
A motion recognition system having the above configuration.

the input receiving unit plays and displays a third moving image including the first moving image, and extracts, from the third moving image, a section from a position where the instruction input is started to a position where the instruction input is ended during the playing and display, as the first moving image.
The action recognition system according to claim 1 .

the input receiving unit records a position pointed to by the instruction input for each frame included in the first moving image;
the determination unit performs a comparison of feature amounts between the second video and the first video for each frame, and when calculating the similarity, determines the weight based on a designated position corresponding to a frame for which feature amounts are to be compared, among designated positions recorded for each frame included in the first video;
The action recognition system according to claim 1 or 2.

the input reception unit receives an input for adjusting a setting value over time while receiving the instruction input;
when calculating the similarity, the determination unit changes the weight value according to a distance from the instruction position by the instruction input to each of two or more feature points included in the first feature point group for each frame according to the input setting value.
The action recognition system according to claim 3.

The determination unit is
comparing the first video with a plurality of the second video to determine whether or not the specific action is included in each of the plurality of the second video;
playing and displaying a fourth video determined to include the specific action among the plurality of second videos, and receiving a determination input indicating whether or not the specific action is included for each of the fourth videos;
using the fourth video and the content of the judgment input as training data, learning is performed to discriminate whether the specific action is included in the video;
determining a video including the specific action from among the fourth video using a result of the learning;
The action recognition system according to claim 1 .

each of the two or more feature points included in the first feature point group is a body part of a person detected in the first video image;
The action recognition system according to claim 1 .

The instruction input is performed by a mouse drag operation.
The action recognition system according to claim 1 .

The computer
playing and displaying a first moving image captured of a person performing a specific action , and continuously receiving, during the playing and display of the first moving image, an instruction input for indicating a position related to the specific action for each frame included in the first moving image ;
calculating a similarity between the first video and the second video based on a comparison result of feature amounts between a second feature point group including one or more feature points included in the second video and a first feature point group including two or more feature points included in the first video, and determining whether or not the second video includes the specific action based on the calculated similarity;
Execute the process,
In the calculation of the similarity, a feature point in the first feature point group that is closer to the pointing position by the pointing input is weighted more heavily in a comparison result of the feature amounts.
Action recognition method.

On the computer,
playing and displaying a first moving image captured of a person performing a specific action , and continuously receiving, during the playing and display of the first moving image, an instruction input for indicating a position related to the specific action for each frame included in the first moving image ;
calculating a similarity between the first video and the second video based on a comparison result of feature amounts between a second feature point group including one or more feature points included in the second video and a first feature point group including two or more feature points included in the first video, and determining whether or not the second video includes the specific action based on the calculated similarity;
Execute the process,
In the calculation of the similarity, a feature point in the first feature point group that is closer to the pointing position by the pointing input is weighted more heavily in a comparison result of the feature amounts.
Motion recognition program.