JP6823686B2

JP6823686B2 - Object detection device and method and storage medium

Info

Publication number: JP6823686B2
Application number: JP2019091606A
Authority: JP
Inventors: ホァーンヤオハイ; ツァンヤン; リーヤン; ジャンジーユエン
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2018-05-25
Filing date: 2019-05-14
Publication date: 2021-02-03
Anticipated expiration: 2039-05-14
Also published as: CN110532838A; JP2019204505A

Description

本発明は、画像処理、とりわけ例えばオブジェクトの検出処理に関するものである。 The present invention relates to image processing, in particular, object detection processing, for example.

監視システムでは、一般に人が主要な監視対象となる。一般に、人は日常的に何らかの物体（例えば、眼鏡、バッグ、スーツケース、車椅子等）を身に着けたり、手に持ったり、使用したりするので、これらの物体は補助的方法で人を監視するために利用される。本明細書では、これらの物体は、例えば、関連人物の付属物と呼ばれる。監視処理では、一般に人物認識処理が主な処理動作であり、人物の認識処理に使用される、最も必要な基本的な情報は、動画像中の人物の位置や関連物などの情報である。それ故、高い再現率でビデオ／画像から人物および関連オブジェクトを検出することができるかどうかは、人物認識処理の精度に直接影響を及ぼすことになる。本明細書では、人物認識処理とは、例えば、人物の属性認識、人物の照合（対象者のＩＤの検証）、人物画像の検索、人物の行動・行動の認識または分析（例えば、対象者が何か物体を持っているかどうか、および対象者と他のオブジェクトとの間の動作の分析など）が含まれる。 In surveillance systems, people are generally the primary monitoring target. In general, people routinely wear, hold, and use some objects (eg, glasses, bags, suitcases, wheelchairs, etc.), so these objects monitor people in an auxiliary way. Used to do. As used herein, these objects are referred to, for example, as attachments of related persons. In the monitoring process, the person recognition process is generally the main processing operation, and the most necessary basic information used for the person recognition process is information such as the position of the person in the moving image and related objects. Therefore, the ability to detect people and related objects from video / images with high recall has a direct impact on the accuracy of the person recognition process. In the present specification, the person recognition process means, for example, attribute recognition of a person, verification of a person (verification of the ID of a target person), search of a person image, recognition or analysis of a person's behavior / behavior (for example, the target person Includes whether you have any object, and an analysis of the behavior between the subject and other objects).

高い再現率でビデオ／画像から人物及び関連物を検出するための例示的な物体検出技術が非特許文献１に開示されている。概ね、次のとおりである。まず、ニューラルネットワークを用いて、入力画像からさまざまなレベルの特徴を抽出する。たとえば、小スケールオブジェクトの低レベル特徴、中スケールのオブジェクトの中レベル特徴、大スケールのオブジェクトのための高レベル特徴を抽出する。次に、対応する事前生成候補領域生成ネットワークを用いて、各レベルの特徴から、オブジェクトの候補領域の関連情報（例えば、候補領域の位置、候補領域のスコア、及び、候補領域の特徴）を抽出する。 Non-Patent Document 1 discloses an exemplary object detection technique for detecting a person and related objects from a video / image with a high recall rate. Generally, it is as follows. First, a neural network is used to extract various levels of features from the input image. For example, extract low-level features for small-scale objects, medium-level features for medium-scale objects, and high-level features for large-scale objects. Next, using the corresponding pre-generated candidate region generation network, the relevant information of the candidate region of the object (for example, the position of the candidate region, the score of the candidate region, and the characteristics of the candidate region) is extracted from the characteristics of each level. To do.

オブジェクト検出技術（例えば、上記の例示技術）においては、一般に、予め設定された閾値を下回らないスコアを持つ候補領域のみ、又は、上位Ｎにランク付けされるスコアの候補領域のみが、最終出力となる。換言すれば、最終出力の候補領域は、その画像から検出できるオブジェクト（人及び物体）の領域として見なされることになる。しかしながら、例えば、関係者によって遮られているスーツケース、座っている人によって遮られている車椅子、地面に置いた影の影響を受けるバッグ等、画像中の他の物体によって遮られている物体や照明の影響を受けている物体の場合、それらオブジェクトの特徴はその画像から完全に抽出することができない。故に、上記例示の技術を用いて画像からそれらオブジェクトの候補が検出できたとしても、それらオブジェクトの候補領域のスコアは低くなるので、それらオブジェクトの候補領域のスコアが予め設定された閾値を下回る、或いは、それらオブジェクトの候補領域のスコアが上位Ｎにランク付けされなくなる。結果、そのようなオブジェクトの候補領域が最終出力とはならず、画像からのオブジェクトが検出できず、最終的にそのオブジェクトの高い再現率に影響を与えることになる。 In the object detection technique (for example, the above-exemplified technique), generally, only the candidate area having a score not lower than the preset threshold value or only the candidate area having the score ranked in the upper N is the final output. Become. In other words, the candidate area of the final output will be regarded as the area of objects (people and objects) that can be detected from the image. However, objects obstructed by other objects in the image, such as suitcases obstructed by stakeholders, wheelchairs obstructed by seated persons, bags affected by shadows placed on the ground, etc. For objects that are affected by lighting, the features of those objects cannot be completely extracted from the image. Therefore, even if the candidates of those objects can be detected from the image by using the above-exemplified technique, the score of the candidate area of those objects is low, so that the score of the candidate area of those objects is lower than the preset threshold value. Alternatively, the scores of the candidate areas of those objects will not be ranked in the top N. As a result, the candidate area for such an object is not the final output, the object from the image cannot be detected, and ultimately the high recall of that object is affected.

"Feature Pyramid Networks for Object Detection" (Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie, CVPR 2017)."Feature Pyramid Networks for Object Detection" (Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie, CVPR 2017).

上記関連技術に鑑み、本発明は上記問題の少なくとも１つを解決する。 In view of the related art, the present invention solves at least one of the above problems.

本発明の１つの形態にて提供されるオブジェクト検出装置は、オブジェクト検出装置であって、画像から特徴を抽出する抽出手段と、所定のカテゴリに属する２つの関連オブジェクトにおける特定の位置関係を示す空間的関係を学習した事前生成モデルに基づいて、前記抽出された特徴間の前記空間的関係を取得する取得手段と、前記抽出した特徴に基づいて、前記画像からオブジェクトの候補領域を検出する検出手段と、前記検出された候補領域のうち２つの候補領域に含まれる前記特徴間の前記空間的関係に基づいて、前記画像の領域に所定のカテゴリのオブジェクトが存在する確率を示す候補領域のスコアを更新する更新手段と、前記更新された候補領域のスコアに基づいて、前記画像に存在するオブジェクトの領域を決定する決定手段と、を有することを特徴とする。 The object detection device provided in one embodiment of the present invention is an object detection device, which is an extraction means for extracting features from an image and a space showing a specific positional relationship between two related objects belonging to a predetermined category. An acquisition means for acquiring the spatial relationship between the extracted features based on a pre-generated model in which the target relationship is learned, and a detection means for detecting an object candidate region from the image based on the extracted features. And the score of the candidate area indicating the probability that an object of a predetermined category exists in the area of the image based on the spatial relationship between the features included in the two candidate areas of the detected candidate areas. It is characterized by having an update means for updating and a determination means for determining an area of an object existing in the image based on the score of the updated candidate area .

本発明の他の形態にて提供されるオブジェクト検出方法は、画像から特徴を抽出する抽出ステップと、所定のカテゴリに属する２つの関連オブジェクトにおける特定の位置関係を示す空間的関係を学習した事前生成モデルに基づいて、前記抽出された特徴間の前記空間的関係を取得する取得ステップと、前記抽出した特徴に基づいて、前記画像からオブジェクトの候補領域を検出する検出ステップと、前記検出された候補領域のうち２つの候補領域に含まれる前記特徴間の前記空間的関係に基づいて、前記画像の領域に所定のカテゴリのオブジェクトが存在する確率を示す候補領域のスコアを更新する更新ステップと、前記更新された候補領域のスコアに基づいて、前記画像に存在するオブジェクトの領域を決定する決定ステップと、を有することを特徴とする。 The object detection method provided in another embodiment of the present invention is a pre-generation that learns an extraction step of extracting features from an image and a spatial relationship indicating a specific positional relationship between two related objects belonging to a predetermined category. An acquisition step of acquiring the spatial relationship between the extracted features based on the model, a detection step of detecting a candidate region of an object from the image based on the extracted features, and the detected candidate. An update step of updating the score of a candidate region indicating the probability that an object of a predetermined category exists in the region of the image based on the spatial relationship between the features included in the two candidate regions of the region, and the above. It is characterized by having a determination step of determining an area of an object existing in the image based on the updated candidate area score .

本発明の更なる形態にて提供されるオブジェクト検出装置は、ビデオ内の現ビデオフレームの画像から特徴を抽出する特徴抽出手段と、所定のカテゴリに属する２つの関連オブジェクトにおける特定の位置関係を示す空間的関係を学習した事前生成モデルに基づいて、前記抽出された特徴間の前記空間的関係を取得する取得手段と、前記抽出した特徴に基づいて、前記画像からオブジェクトの候補領域を検出する検出手段と、前記検出された候補領域のうち２つの候補領域に含まれる前記特徴間の前記空間的関係に基づいて、前記画像の領域に所定のカテゴリのオブジェクトが存在する確率を示す候補領域のスコアを更新する更新手段と、前記更新された候補領域のスコアに基づいて、前記画像に存在するオブジェクトの領域を決定する決定手段と、を有することを特徴とする。 The object detection device provided in a further embodiment of the present invention shows a feature extraction means for extracting features from an image of the current video frame in a video and a specific positional relationship between two related objects belonging to a predetermined category. An acquisition means for acquiring the spatial relationship between the extracted features based on a pre-generated model in which the spatial relationship is learned, and detection for detecting a candidate region of an object from the image based on the extracted features. A score of a candidate region indicating the probability that an object of a predetermined category exists in the region of the image based on the means and the spatial relationship between the features included in the two candidate regions of the detected candidate regions. It is characterized by having an update means for updating the image and a determination means for determining an area of an object existing in the image based on the score of the updated candidate area .

本発明の更なる他の形態では、上記のオブジェクト検出方法を可能にするため、プロセッサによって実行可能な命令を記憶する記憶媒体が提供される。 In yet another embodiment of the invention, a storage medium for storing instructions that can be executed by a processor is provided to enable the object detection method described above.

本発明において、個々の特徴点間の空間的関係がオブジェクトの領域の検出を規制するように、その空間的関係がオブジェクトの領域を検出するときに用いられる。これにより、関連するオブジェクトをよりよく検出することが可能になる。関連するオブジェクトは一般に監視プロセス中における人を監視するのにより有用であるので、オブジェクト検出の再現率だけでなく、人を監視する効果も本発明に従って改善することができる。 In the present invention, just as the spatial relationship between individual feature points regulates the detection of an area of an object, that spatial relationship is used when detecting the area of an object. This makes it possible to better detect related objects. Since related objects are generally more useful for monitoring people during the monitoring process, not only the recall of object detection, but also the effect of monitoring people can be improved according to the present invention.

本発明の更なる他の特徴及び効果は、添付図面を参照して以下に説明される実施形態から明らかにする。 Further other features and effects of the present invention will be apparent from the embodiments described below with reference to the accompanying drawings.

添付図面は、本明細書に組み込まれてその一部を構成するものであり、本発明の原理を説明するために、本発明の実施形態を例示するものである。
本発明の実施形態に係る技術を実施可能なハードウェア構成を示すブロック図である。本発明の第１の実施形態に係るオブジェクト検出装置の構成を示すブロック図である。図２に示された、検出部２３０の構成を示すブロック図である。本発明の第１の実施形態にかかるオブジェクト検出のフローチャートを示す図である。図４のフローチャートにて利用される事前生成モデルの概略構成図である。本発明の第１の実施形態にかかる、図４に示された検出ステップのフローチャートを示す図である。本発明の第１の実施形態にかかる図６に示された、順位判定ステップのフローチャートを示す図である。本発明の第２の実施形態にかかるオブジェクト検出装置の構成を示すブロック図である。本発明に適用可能なモデルを生成する生成方法のフローチャートを示す図である。 The accompanying drawings are incorporated herein by reference and constitute a portion thereof, and illustrate embodiments of the present invention in order to explain the principles of the present invention.
It is a block diagram which shows the hardware structure which can carry out the technique which concerns on embodiment of this invention. It is a block diagram which shows the structure of the object detection apparatus which concerns on 1st Embodiment of this invention. It is a block diagram which shows the structure of the detection part 230 shown in FIG. It is a figure which shows the flowchart of the object detection which concerns on 1st Embodiment of this invention. It is a schematic block diagram of the pre-generative model used in the flowchart of FIG. It is a figure which shows the flowchart of the detection step shown in FIG. 4 which concerns on 1st Embodiment of this invention. It is a figure which shows the flowchart of the rank determination step shown in FIG. 6 which concerns on 1st Embodiment of this invention. It is a block diagram which shows the structure of the object detection apparatus which concerns on 2nd Embodiment of this invention. It is a figure which shows the flowchart of the generation method which generates the model applicable to this invention.

以下に添付図面を参照しながら、本発明に好適な実施形態を詳細に説明する。以下の説明は、本質的に単なる例示かつ例示的なものであり、本発明およびその用途または使用を限定することを意図するものではない点に留意されたい。実施形態に記載されている構成要素やステップの相対配置、数値表現、数値などは、特に記載がない限り、本発明の範囲を限定するものではない。さらに、当業者に知られている技術、方法、および装置は詳細には論じられないかもしれないが、適切な場合は明細書の一部であるべきものである。 Embodiments suitable for the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that the following description is merely exemplary and exemplary in nature and is not intended to limit the invention and its uses or uses. The relative arrangement of components and steps, numerical expressions, numerical values, and the like described in the embodiments do not limit the scope of the present invention unless otherwise specified. Moreover, techniques, methods, and devices known to those of skill in the art may not be discussed in detail, but should be part of the specification where appropriate.

なお、図面において同様の参照番号および文字は同様の項目を示すものである。それ故、１つの図において１つの項目が定義されると、それを以下の図において論じる必要はないことに留意されたい。 In the drawings, the same reference numbers and characters indicate the same items. Therefore, it should be noted that once an item is defined in one figure, it does not need to be discussed in the figures below.

本発明者は、オブジェクト検出の実際のシーンにて、一般に互いに特定の関係（特に空間的関係）を有するいくつかのオブジェクトが存在し、これらのオブジェクトは一般に関連オブジェクトとして参照されることを見出した。さらに、一方では、相互作用するケース（例えば、互いに遮られている等）は、関連オブジェクト間で発生する可能性が高い。一方、監視処理では、関連オブジェクトは人物の監視に役立つ。このように、本発明者は、オブジェクト検出プロセスにおいて、関連オブジェクト間に存在するこの特定の関係（特に空間的関係）は、例えば、いくつかの関連オブジェクトが遮蔽されている場合であっても、そのオブジェクトの領域の検出を規制（constrain）することができることを見出した。この結果、オブジェクト検出の再現率を向上させることができ、さらには人物を監視する効果を高めることができる。本発明では、女性とその女性が引いているスーツケース、男性とその男性が腰かけている車椅子等のような、人物（例えば対象者）と人物が身にまとっている／握っている／用いているオブジェクトは、関連オブジェクトとして見なすことができる。女性とその女性が抱えている子供や、前後に位置する、重なっている二人も、関連オブジェクトとして見なすことができる。隣接オブジェクト（例えば対象オブジェクトとその対象オブジェクトの隣の他のオブジェクト）、例えばスーツケースとその上に置かれたバッグ、人物の影とその影で一部もしくは全部が覆われたバッグなどもまた、関連オブジェクトと見なすことができる。しかしながら、これに限定される必要がないのは明らかである。 The present inventor has found that in the actual scene of object detection, there are generally several objects having a specific relationship (particularly spatial relationship) with each other, and these objects are generally referred to as related objects. .. Moreover, on the other hand, cases of interaction (eg, being blocked from each other) are likely to occur between related objects. On the other hand, in the monitoring process, related objects are useful for monitoring people. Thus, in the object detection process, the inventor presents this particular relationship (especially spatial relationship) that exists between related objects, even if some related objects are shielded, for example. We have found that it is possible to constrain the detection of the area of the object. As a result, the recall rate of object detection can be improved, and the effect of monitoring a person can be enhanced. In the present invention, a person (for example, a subject) and a person are wearing / holding / using, such as a woman and a suitcase pulled by the woman, a man and a wheelchair on which the man is sitting, etc. Objects that are present can be considered as related objects. A woman and her child, as well as two overlapping objects located in front and behind, can also be considered as related objects. Adjacent objects (eg, the target object and other objects next to it), such as a suitcase and a bag placed on it, a shadow of a person and a bag partially or entirely covered by that shadow, etc. Can be considered a related object. However, it is clear that it does not have to be limited to this.

本発明において、関連オブジェクト間に存在する空間的関係は、それら関連オブジェクト間の空間的規制（spatial constraint）を表す。例えば２つの関連オブジェクトにて、それらの間の空間的関係（例えば、その２つの関連オブジェクトに対応する領域間の空間的制約）は、少なくとも、次に示す幾つかの規制を少なくとも含む。
・２つのオブジェクト間の相対位置関係（Relative positional relationship）（例えば、方向関係、距離関係等）
例えば、机の上に置かれたコンピュータの場合、そのコンピュータと机との方向関係は、“机の上（on the desk)”となる。例えば、芝生上の人物／動物の場合、その人物／動物と芝生との方向関係は“芝生の上（on the grass）”となる。例えば、子供の手を握って歩くように導いている女性の場合、女性と子供との間の距離関係は“隣接して、近くに（adjacent to, close to）”である。しかし、これに限定される必要がないのは明らかである。
・２つのオブジェクト間の位相関係（Topological relationship）（たとえば、重複関係(overlapping relationship)、包含関係(inclusion relationship)、隣接関係(adjacency relationship)など）
例えば、車いすに座っている男性の場合、その男性と車椅子との位相関係は“重複関係”となる。例えば子供を抱えている女性の場合、その女性と子供との位相関係は“包含関係”となる。パラソルを持つ女性の場合、その女性とパラソルとの位相関係は“隣接関係”となる。しかし、これに限定される必要がないのは明らかである。
・２つのオブジェクト間の相対形状関係（Relative shape relationship）
例えば、車椅子に座っている男性のように、人と車椅子間の空間的規制が、“相対形状関係”にもなる。しかし、これに限定される必要がないのは明らかである。 In the present invention, the spatial relationship existing between related objects represents a spatial constraint between those related objects. For example, in two related objects, the spatial relationship between them (eg, the spatial constraints between the regions corresponding to the two related objects) includes at least some of the following restrictions:
-Relative positional relationship between two objects (for example, directional relationship, distance relationship, etc.)
For example, in the case of a computer placed on a desk, the directional relationship between the computer and the desk is "on the desk". For example, in the case of a person / animal on the lawn, the directional relationship between the person / animal and the lawn is “on the grass”. For example, in the case of a woman who holds the child's hand and guides her to walk, the distance relationship between the woman and the child is “adjacent to, close to”. However, it is clear that it does not have to be limited to this.
-Topological relationship between two objects (for example, overlapping relationship, inclusion relationship, adjacency relationship, etc.)
For example, in the case of a man sitting in a wheelchair, the phase relationship between the man and the wheelchair is "overlapping". For example, in the case of a woman who has a child, the phase relationship between the woman and the child is an "inclusion relationship". In the case of a woman with a parasol, the phase relationship between the woman and the parasol is "adjacent". However, it is clear that it does not have to be limited to this.
-Relative shape relationship between two objects
For example, like a man sitting in a wheelchair, the spatial regulation between a person and a wheelchair also becomes a "relative shape relationship". However, it is clear that it does not have to be limited to this.

本発明は、上記の考察結果に鑑みて提案されたものであり、以下に添付図面を参照しながら詳細に説明する。 The present invention has been proposed in view of the above discussion results, and will be described in detail below with reference to the accompanying drawings.

［ハードウェア構成］
まず、以下で説明される具術を実現可能なハードウェア構成について、図１を参照して説明する。 [Hardware configuration]
First, a hardware configuration capable of realizing the techniques described below will be described with reference to FIG.

ハードウェア構成１００は、例えば、中央処理装置（ＣＰＵ）１１０、ランダムアクセスメモリ（ＲＡＭ）１２０、リードオンリメモリ（ＲＯＭ）１３０、ハードディスク１４０、入力デバイス１５０、出力デバイス１６０、ネットワークインタフェース１７０、および、システムバス１８０を含む。更に、ハードウェア構成１００は、例えば、カメラ、ビデオカメラ、パーソナルデジタルアシスタント（ＰＤＡ）、スマートフォン、タブレット、ラップトップ、デスクトップコンピュータ、或いは、他の適当な電子デバイスに実装しても構わない。 The hardware configuration 100 includes, for example, a central processing unit (CPU) 110, a random access memory (RAM) 120, a read-only memory (ROM) 130, a hard disk 140, an input device 150, an output device 160, a network interface 170, and a system. Includes bus 180. Further, the hardware configuration 100 may be mounted on, for example, a camera, a video camera, a personal digital assistant (PDA), a smartphone, a tablet, a laptop, a desktop computer, or any other suitable electronic device.

一つの実装法における、本発明のオブジェクト検出処理はハードウェアもしくはファームウェアによって構成され、ハードウェア構成１００のモジュールもしくはコンポーネントとして機能する。例えば、図２を参照して以下に詳細に説明するオブジェクト検出装置２００、及び、図８を参照して以下に詳細に説明するオブジェクト検出装置８００は、ハードウェア構成１００のモジュールもしくはコンポーネントとして利用される。他の実装法における、本発明のオブジェクト検出処理は、ＲＯＭ１３０又はハードディスク１４０に格納されるソフトウェアによって構成され、ＣＰＵ１１０によって実行されるものである。例えば、詳細は図４を参照して説明する処理４００、及び、詳細は図９０を参照して説明する処理９００が、ＲＯＭ１３０又はハードディスク１４０に格納されるプログラムとして利用される。 The object detection process of the present invention in one implementation method is configured by hardware or firmware and functions as a module or component of hardware configuration 100. For example, the object detection device 200 described in detail below with reference to FIG. 2 and the object detection device 800 described in detail below with reference to FIG. 8 are used as modules or components of the hardware configuration 100. The object. The object detection process of the present invention in another mounting method is composed of software stored in the ROM 130 or the hard disk 140 and executed by the CPU 110. For example, the process 400 described in detail with reference to FIG. 4 and the process 900 described in detail with reference to FIG. 90 are used as programs stored in the ROM 130 or the hard disk 140.

ＣＰＵ１１０はプロセッサなどの、任意の適切なプログラム可能な制御装置であり、ＲＯＭ１３０またはハードディスク１４０（メモリなど）に格納されたさまざまなアプリケーションプログラムを実行することによって、後述する様々な機能を実行することができる。ＲＡＭ１２０は、ＲＯＭ１３０またはハードディスク１４０からロードされたプログラムやデータを一時的に記憶するとともに、ＣＰＵ１１０が実行する各種の処理（例えば、後述する図３から図７および図９の技術等）や他の利用可能な機能を実行する空間としても使用される。ハードディスク１４０は、オペレーティングシステム（ＯＳ）、各種アプリケーション、制御プログラム、ビデオ、画像、事前生成モデル(pre-generated models)、事前定義されたデータ（例えば、閾値群（ＴＨs））などの様々な情報を格納する。 The CPU 110 is any suitable programmable control device, such as a processor, that can execute various functions described below by executing various application programs stored in the ROM 130 or the hard disk 140 (memory, etc.). it can. The RAM 120 temporarily stores programs and data loaded from the ROM 130 or the hard disk 140, and also performs various processes executed by the CPU 110 (for example, the techniques of FIGS. 3 to 7 and 9 described later) and other uses. It is also used as a space to perform possible functions. The hard disk 140 stores various information such as an operating system (OS), various applications, control programs, videos, images, pre-generated models, and predefined data (for example, threshold groups (THs)). Store.

１つの実施形態では、入力デバイス１５０は、ユーザに対し、ハードウェア構成１００と情報交換することを許容する。１つの実施形態では、ユーザは入力デバイス１５０を通じて、画像／ビデオ／データを入力できる。他の実施形態では、ユーザは、入力デバイス１５０を通じて、本発明の対応する処理を起動することができる。更に、入力装置１５０は、ボタン、キーボード、またはタッチスクリーンなどの多様な形態を採用することができる。別の実施形態では、入力装置１５０は、デジタルカメラ、ビデオカメラ、および／またはネットワークカメラなどの特殊な電子装置から出力される画像／ビデオを受信するために利用される。 In one embodiment, the input device 150 allows the user to exchange information with the hardware configuration 100. In one embodiment, the user can input images / videos / data through the input device 150. In another embodiment, the user can activate the corresponding process of the invention through the input device 150. Further, the input device 150 can adopt various forms such as a button, a keyboard, or a touch screen. In another embodiment, the input device 150 is utilized to receive images / videos output from specialized electronic devices such as digital cameras, video cameras, and / or network cameras.

１つの実施形態において、出力デバイス１６０は、検出結果（オブジェクトの検出された領域の位置、スコア、特徴等）をユーザに対する表示するために利用される。更に、出力デバイス１６０は、ＣＲＴ（Cathode Ray Tube）、液晶ディスプレイなど、様々な形態を採用することができる。他の実施形態では、出力装置１６０は、検出結果を、人認識処理（例えば、人物属性認識、人物マッチング、人物画像検索、および人物の行動／行動の認識または分析など）などの後続の処理に出力するために利用される。 In one embodiment, the output device 160 is used to display detection results (positions, scores, features, etc. of detected areas of an object) to the user. Further, the output device 160 can adopt various forms such as a CRT (Cathode Ray Tube) and a liquid crystal display. In another embodiment, the output device 160 uses the detection result for subsequent processing such as person recognition processing (eg, person attribute recognition, person matching, person image retrieval, and person behavior / behavior recognition or analysis). Used to output.

ネットワークインタフェース１７０は、ハードウェア構成１００をネットワークに接続するインタフェースを提供する。例えば、ハードウェア構成１００は、ネットワークインタフェース１７０を介して、ネットワークを介して接続された他の電子機器とデータ通信を行うことができる。或いは、ハードウェア構成１００は、無線データ通信を行うための無線インタフェースを備えてもよい。システムバス１８０は、ＣＰＵ１１０、ＲＡＭ１２０、ＲＯＭ１３０、ハードディスク１４０、入力デバイス１５０、出力デバイス１６０、ネットワークインタフェース１７０等の間の、互いのデータ転送のためのデータ転送路を提供する。システムバス１８０はバスと呼ばれるが、特定のデータ伝送技術に限定されない。 The network interface 170 provides an interface for connecting the hardware configuration 100 to the network. For example, the hardware configuration 100 can perform data communication with other electronic devices connected via the network via the network interface 170. Alternatively, the hardware configuration 100 may include a wireless interface for performing wireless data communication. The system bus 180 provides a data transfer path for data transfer between the CPU 110, the RAM 120, the ROM 130, the hard disk 140, the input device 150, the output device 160, the network interface 170, and the like. The system bus 180 is called a bus, but is not limited to a specific data transmission technique.

上記のハードウェア構成１００は、単なる例示であり、本発明およびその用途または使用を限定することを意図としていない。更に、簡潔にするために、図１には１つのハードウェア構成しか示されていない。しかし、必要に応じて複数のハードウェア構成を使用することができる。 The above hardware configuration 100 is merely an example and is not intended to limit the present invention and its use or use. Further, for brevity, only one hardware configuration is shown in FIG. However, multiple hardware configurations can be used as needed.

［オブジェクト検出］
次に、本発明におけるオブジェクト検出を、図２乃至図８を参照して説明する。 [Object detection]
Next, the object detection in the present invention will be described with reference to FIGS. 2 to 8.

図２は、本発明の第１の実施形態におけるオブジェクト検出装置２００の構成を示すブロック図である。ここで、図２に示されるモジュールの幾つかもしくは全部は専用のハードウェアでもって実現しても良い。図２に示されるように、オブジェクト検出装置２００は、抽出部２１０、判定部２２０、及び、検出部２３０を含む。 FIG. 2 is a block diagram showing the configuration of the object detection device 200 according to the first embodiment of the present invention. Here, some or all of the modules shown in FIG. 2 may be realized by dedicated hardware. As shown in FIG. 2, the object detection device 200 includes an extraction unit 210, a determination unit 220, and a detection unit 230.

まず、図１に示さてる入力デバイス１５０は、特定の電子デバイス（例えば、ビデオカメラ等）から出力される画像を受信する。次に、入力デバイス１５０はシステムバス１８０を介して、受信した画像をオブジェクト検出装置２００に転送する。 First, the input device 150 shown in FIG. 1 receives an image output from a specific electronic device (for example, a video camera or the like). Next, the input device 150 transfers the received image to the object detection device 200 via the system bus 180.

そして、図２に示されるように、抽出ユニット２１０は受信した画像から特徴を抽出する。ここで、抽出部２１０は、既存の特徴抽出アルゴリズム、例えば、ローカルバイナリパターン（ＬＢＰ）アルゴリズム、Ｇａｂｏｒアルゴリズム、スケール不変の特徴変換（ＳＩＦＴ）アルゴリズム、ニューラルネットワーク（ＮＮ）アルゴリズムなどを用いて、画像から特徴を抽出することができる。ここで、抽出された特徴は、例えば、画像内の勾配特徴(gradient features)、エッジ特徴(edge feature)、見かけの特徴(apparent features)、意味的特徴(semantic feature)などであっても良い。 Then, as shown in FIG. 2, the extraction unit 210 extracts features from the received image. Here, the extraction unit 210 uses an existing feature extraction algorithm, for example, a local binary pattern (LBP) algorithm, a Gabor algorithm, a scale-invariant feature conversion (SIFT) algorithm, a neural network (NN) algorithm, or the like from an image. Features can be extracted. Here, the extracted features may be, for example, gradient features, edge features, apparent features, semantic features, and the like in the image.

判定部２２０は、抽出した特徴に基づき、画像内の個々の特徴点間の空間的関係を判定する。ここで特徴点は抽出された特徴上の点である。また任意の２つの特徴点間の空間的関係としては、その２つの特徴点が同じ領域又は異なる領域に属するかに従って、その特徴点間の空間的相関は、“同じ領域内の位置的相関（すなわち、イントラカテゴリの位置関係）”、及び、“異なる領域間の位置的関係（すなわち、インターカテゴリの位置関係）”にカテゴリ分けされる。更に、判定された特徴点間の空間的関係は、それに応じた空間的関係値を有する。ここで、２つの特徴点間の空間関係の場合、対応する空間関係値はその２つの特徴点がその空間関係に属する確率（probability）を表す。 The determination unit 220 determines the spatial relationship between the individual feature points in the image based on the extracted features. Here, the feature points are the extracted feature points. As for the spatial relationship between any two feature points, depending on whether the two feature points belong to the same region or different regions, the spatial correlation between the feature points is "positional correlation within the same region (" That is, it is categorized into "positional relationship of intra-category)" and "positional relationship between different regions (that is, positional relationship of inter-category)". Further, the spatial relationship between the determined feature points has a corresponding spatial relationship value. Here, in the case of a spatial relationship between two feature points, the corresponding spatial relationship value represents the probability that the two feature points belong to the spatial relationship.

１つの実施形態では、判定部２２０は、予め定義されたルールに従い、抽出した特徴に基づく特徴点間の空間的関係を判定しても良い。ここでの予め定義されたルールは、対応する記録デバイス、例えば図２に示される記憶デバイス２４０に記憶されても良い。ここで、記憶デバイス２４０は、図１に示されるＲＯＭ１３０またはハードディスク１４０で良いし、オブジェクト検出装置２００にネットワーク（不図示）を介して接続されるサーバまたは外部記憶デバイスであっても良い。このように、実施形態において、判定部２２０はまず記録デバイス２４０から予め定義されたルールを取得し、そして、空間的関係の対応する判定処理を行うことになる。 In one embodiment, the determination unit 220 may determine the spatial relationship between feature points based on the extracted features according to a predefined rule. The predefined rules here may be stored in a corresponding recording device, such as the storage device 240 shown in FIG. Here, the storage device 240 may be the ROM 130 or the hard disk 140 shown in FIG. 1, or may be a server or an external storage device connected to the object detection device 200 via a network (not shown). As described above, in the embodiment, the determination unit 220 first acquires a predefined rule from the recording device 240, and then performs the corresponding determination process of the spatial relationship.

他の実施形態としては、様々なシーンにおける空間的関係を便利に決定するようにするために、特徴点間の空間的関係を判定するために利用されるモデル（すなわち事前生成モデル）が、空間関係がラベル付けされたトレーニングサンプルに従って事訓練／事前生成され、対応する記憶デバイス（例えば、記憶デバイス２４０）に格納されることである。ここで、事前生成モデルの生成方法は、その詳細については図９を参照して後述する。一方、判定部２２０は事前生成のモデルを用いることで、抽出された特徴に基づく特徴点間の空間的関係を判定する。 In another embodiment, in order to conveniently determine the spatial relationships in various scenes, the model used to determine the spatial relationships between feature points (ie, the pre-generated model) is spatial. The relationship is to be trained / pre-generated according to the labeled training sample and stored in the corresponding storage device (eg, storage device 240). Here, the method of generating the pre-generated model will be described in detail later with reference to FIG. On the other hand, the determination unit 220 determines the spatial relationship between the feature points based on the extracted features by using the pre-generated model.

更に、オブジェクト検出の処理速度を改善するため、上記の事前生成モデルが図９に参照されるように生成される。ここでは事前生成モデルは、特徴抽出するための部分と、加えて空間的関係を判定する部分を含む。これに替えて、抽出部２１０が、事前生成モデルを用いて画像からも特徴を抽出しても良い。この場合、一方で、抽出部２１０は記憶デバイス２０から事前生成モデルを取得し、他方、抽出部２１０はその事前生成モデルを用いて画像から特徴を抽出する。更に、この場合、判定部２２０は、記憶ユニット２４０から対応する事前生成モデルを特に取得せずに、抽出部２１０が記憶デバイス２４０から取得した事前生成モデルをそのまま用いてもよい。 Further, in order to improve the processing speed of object detection, the above pre-generated model is generated as referred to in FIG. Here, the pre-generative model includes a part for feature extraction and a part for determining a spatial relationship. Alternatively, the extraction unit 210 may extract features from the image using a pre-generated model. In this case, on the one hand, the extraction unit 210 acquires a pre-generated model from the storage device 20, and on the other hand, the extraction unit 210 extracts features from the image using the pre-generated model. Further, in this case, the determination unit 220 may use the pre-generated model acquired from the storage device 240 as it is by the extraction unit 210 without particularly acquiring the corresponding pre-generated model from the storage unit 240.

図２の説明に戻る。個々の特徴点間の空間的関係が判定された後、検出部２３０は判定した空間的関係に基づき画像内のオブジェクトの領域を検出する。ここでオブジェクトは、好ましくは画像内の関連オブジェクトであり、検出したオブジェクトの領域は例えば領域の位置、領域のスコア、及び、領域によって包含される特徴を含む。ここで、１つの領域のスコアは、その領域が或るカテゴリのオブジェクトに属する確率を示し、１つの領域に包含される特徴は、その領域に属する抽出部２１０によって抽出された特徴のうちの特徴である。 Returning to the description of FIG. After the spatial relationship between the individual feature points is determined, the detection unit 230 detects the area of the object in the image based on the determined spatial relationship. Here, the object is preferably a related object in the image, and the area of the detected object includes, for example, the position of the area, the score of the area, and the features included by the area. Here, the score of one area indicates the probability that the area belongs to an object of a certain category, and the features included in one area are the features extracted by the extraction unit 210 belonging to the area. Is.

１つの実施形態において、検出部２３０は、判定された空間的関係を直接利用して、画像からオブジェクトの領域を検出しても良い。具体的には、まず、検出部２３０は、判定した空間的関係に基づいて個々の特徴点をクラスタリングする。ここでクラスタリング結果は１つの領域と見なすことでき、各クラスタリング結果における特徴点間の空間的関係は先に説明した“イントラカテゴリの空間的関係”に属する。そして、検出部２３０は、異なるクラスタリング結果に属する特徴点間の空間的関係（すなわち、上記の“インターカテゴリの空間的関係”）に基づいて、対応する領域を、オブジェクトの最終的な検出領域として判定する。ここで、互いの距離が所定の閾値（例えばＴＨ１）未満であるクラスタリング結果は、オブジェクトの最終的な検出領域と見なすことができ、互いに重なり合うクラスタリング結果は、たとえば最終的な検出と見なすことができる。 In one embodiment, the detection unit 230 may detect the area of the object from the image by directly utilizing the determined spatial relationship. Specifically, first, the detection unit 230 clusters individual feature points based on the determined spatial relationship. Here, the clustering result can be regarded as one area, and the spatial relationship between the feature points in each clustering result belongs to the “intra-category spatial relationship” described above. Then, the detection unit 230 uses the corresponding area as the final detection area of the object based on the spatial relationship between the feature points belonging to different clustering results (that is, the above-mentioned “intercategory spatial relationship”). judge. Here, the clustering result in which the distance between each other is less than a predetermined threshold value (for example, TH1) can be regarded as the final detection area of the object, and the clustering result overlapping with each other can be regarded as the final detection, for example. ..

他の実施形態において、より関連性の高いオブジェクトの領域を優先的に出力することができ、検出されたオブジェクトの領域の位置をより正確にするため、検出部２３０は、図３に示すように、候補領域検出部２３１および順位決定部２３２を含んでも良い。 In another embodiment, the detection unit 230 can preferentially output the region of the more relevant object and make the position of the region of the detected object more accurate, as shown in FIG. , Candidate region detection unit 231 and ranking determination unit 232 may be included.

図３に示すように、候補領域検出部２３１は、検出部２１０により検出された特徴に基づき、画像からオブジェクトの候補領域を検出する。ここで、候補領域検出部２３１は、既存の領域検出アルゴリズム、例えば、選択的検索アルゴリズム（selective search algorithm）、エッジボックスアルゴリズム（EdgeBoxes algorithm）、物体アルゴリズム（Objectness algorithm）などを用いて、画像から候補領域を検出することができる。更に、上記のごとく、先に示した事前生成モデルは特徴抽出する部分及び空間的関係を判定する部分を含んでもよく、オブジェクト検出の処理速度を更に上げるために、事前生成モデルは、それが図９に示されるように生成されるときにオブジェクトの候補領域の検出する部分を含んでも良い。そのため、候補領域検出部２３１は、事前生成モデルを用いて抽出された特徴に基づいて、画像からオブジェクトの候補領域を検出してもよい。この場合、候補領域検出部２３１は、検出部２１０が記憶デバイス２４０から取得した事前生成モデルを用いて、画像からオブジェクトの候補領域を検出しても良い。ここで、オブジェクトの検出された候補領域は、例えば候補領域の位置、候補領域のスコア、候補領域を包含する特徴をも含む。ここで、１つの候補領域のスコアは、その候補領域が或るカテゴリのオブジェクトに属する確率を示すものであり、例えば、候補領域のスコアは、候補領域をカテゴリ分けすることによって得ても良い。 As shown in FIG. 3, the candidate area detection unit 231 detects the candidate area of the object from the image based on the features detected by the detection unit 210. Here, the candidate area detection unit 231 uses an existing area detection algorithm, for example, a selective search algorithm, an EdgeBoxes algorithm, an objectness algorithm, or the like to make a candidate from an image. The area can be detected. Further, as described above, the pre-generated model shown above may include a part for extracting features and a part for determining spatial relationships, and in order to further increase the processing speed of object detection, the pre-generated model is shown in the figure. It may include a portion to be detected in the candidate area of the object when it is generated as shown in 9. Therefore, the candidate area detection unit 231 may detect the candidate area of the object from the image based on the features extracted by using the pre-generated model. In this case, the candidate area detection unit 231 may detect the candidate area of the object from the image by using the pre-generated model acquired by the detection unit 210 from the storage device 240. Here, the detected candidate area of the object also includes, for example, the position of the candidate area, the score of the candidate area, and the feature including the candidate area. Here, the score of one candidate area indicates the probability that the candidate area belongs to an object of a certain category. For example, the score of the candidate area may be obtained by categorizing the candidate areas.

次に、図３に示すように、画像から候補領域が検出された後、順位判定部２３２は、判定部２２０により判定された空間的関係に基づき、検出された候補領域の順位を判定し、順位判定後の候補領域を、オブジェクトの検出領域とする。 Next, as shown in FIG. 3, after the candidate region is detected from the image, the rank determination unit 232 determines the rank of the detected candidate region based on the spatial relationship determined by the determination unit 220. The candidate area after the ranking determination is used as the object detection area.

更に、上記の事前生成モデルは、オブジェクトの候補領域の検出する部分に加えて、そのオブジェクトが図９に示されるように生成されるときのオブジェクトの領域を直接的に検出する部分を含んでもよい。それ故、他の実施形態では、検出部２３０は、事前生成モデルを直接用いて判定された空間的関係に基づき、画像内のオブジェクトの領域を検出しても良い。この場合、検出部２３０は、抽出部２１０が記憶デバイス２４０から取得した事前生成モデルを用いて画像からオブジェクトの領域を検出しても良い。 Further, the pre-generated model described above may include a portion that directly detects the region of the object when the object is created as shown in FIG. 9, in addition to the portion that detects the candidate region of the object. .. Therefore, in other embodiments, the detector 230 may detect a region of an object in the image based on a spatial relationship determined directly using a pre-generated model. In this case, the detection unit 230 may detect the area of the object from the image by using the pre-generated model acquired by the extraction unit 210 from the storage device 240.

図２に戻って、オブジェクトの領域が画像から検出された後、検出部２３０は、予め定義されや閾値以上のスコアを持つオブジェクトの領域を最終的な検出結果とする、或いは、上位Ｎ個にランク付けスコアを持つ領域を最終的な検出結果とし、図１に示されるシステムバス１８０を介して、その最終的な検出結果を出力デバイス１６０に転送し、最終的なオブジェクトの検出結果（例えば領域の位置、スコア、特徴）をユーザに向けて表示、もしくは、人物認識処理（例えば人物属性認識、人物マッチング、人物画像検索、人物の振る舞い／行動等の認識もしくは解析等）のような後続する処理にオブジェクトの検出領域を出力する。 Returning to FIG. 2, after the area of the object is detected from the image, the detection unit 230 sets the area of the object having a score equal to or higher than the predefined value as the final detection result, or ranks the top N objects. The region having the ranking score is taken as the final detection result, and the final detection result is transferred to the output device 160 via the system bus 180 shown in FIG. 1, and the final detection result of the object (for example, the region) is transferred. Position, score, feature) is displayed to the user, or subsequent processing such as person recognition processing (for example, person attribute recognition, person matching, person image search, recognition or analysis of person behavior / behavior, etc.) The detection area of the object is output to.

図４に示されるフローチャート４００は、図２に示されるオブジェクト検出装置２００の対応する処理である。以下は、事前生成モデルを用いた対応する処理を行う抽出部２１０、判定部２２０及び検出部２３０で行われる説明である。ここで、処理にて用いられる事前生成モデルの概略構成は例えば図５に示す通りである。しかし、これに限定される必要性が無いことは明らかである。 The flowchart 400 shown in FIG. 4 is a corresponding process of the object detection device 200 shown in FIG. The following is a description performed by the extraction unit 210, the determination unit 220, and the detection unit 230 that perform the corresponding processing using the pre-generated model. Here, the schematic configuration of the pre-generated model used in the processing is as shown in FIG. 5, for example. However, it is clear that there is no need to be limited to this.

図４に示すように、抽出ステップＳ４１０にて、抽出部２１０は記憶デバイス２４０から事前生成モデルを取得し、取得した事前生成モデル（特に、その中の特徴抽出する部分）を用いて受信した画像から特徴を抽出する。 As shown in FIG. 4, in the extraction step S410, the extraction unit 210 acquires a pre-generated model from the storage device 240, and receives an image using the acquired pre-generated model (particularly, a portion in which the feature is extracted). Features are extracted from.

判定ステップＳ４２０にて、判定部２２０は、事前生成モデル（特にその中の空間的関係の判定する部分）を利用して、抽出された特徴に基づく特徴点間の空間的関係を判定する。 In the determination step S420, the determination unit 220 determines the spatial relationship between the feature points based on the extracted features by using the pre-generated model (particularly the portion for determining the spatial relationship in the model).

検出ステップＳ４３０にて、検出部２３０は、取得した事前生成モデルと用いて、画像からオブジェクトの領域を検出する。ここでオブジェクトは、好ましくは画像内の関連付けられたオブジェクトである。上述のように、より関連性の高いオブジェクトの領域を優先的に出力するため、及び、オブジェクトの検出された領域の位置をより正確にするため、１つの実施フェイでは、検出部２３０は、図６に従って、画像内のオブジェクトの領域を検出する。 In the detection step S430, the detection unit 230 detects the area of the object from the image by using the acquired pre-generated model. The object here is preferably an associated object in the image. As described above, in order to preferentially output the region of the more relevant object and to make the position of the detected region of the object more accurate, in one execution phase, the detection unit 230 According to 6, the area of the object in the image is detected.

図６に示されるように、候補領域検出ステップＳ４３１にて、候補領域検出部２３１は、事前生成モデル（特に、その中の候補領域を検出する部分）を用いて、抽出部２１０により抽出された特徴に基づき、画像からオブジェクトの候補領域を検出する。順位判定ステップＳ４３２にて、順位判定部２３２は、判定部２２０で判定された空間的関係に基づき候補領域の順位を判定し、その順位判定後の候補領域を、オブジェクトの最終的な検出領域とする。１つの実施形態では、順位判定部２３２は図７に従って候補領域の順位を判定する。 As shown in FIG. 6, in the candidate region detection step S431, the candidate region detection unit 231 was extracted by the extraction unit 210 using a pre-generated model (particularly, a portion for detecting the candidate region in the candidate region). Detects a candidate area for an object from an image based on its characteristics. In the rank determination step S432, the rank determination unit 232 determines the rank of the candidate area based on the spatial relationship determined by the determination unit 220, and the candidate area after the rank determination is used as the final detection area of the object. To do. In one embodiment, the ranking determination unit 232 determines the ranking of the candidate region according to FIG.

図７に示されるように、ステップＳ４３２１にて、順位判定部２３２は、候補領域検出部２３１によって検出された候補領域間の空間的関係を判定する。具体的には、任意の２つの候補領域について、順位判定部２３２は、その２つの候補領域に含まれる特徴点間の相互の空間的関係に基づき、その２つの候補領域間の空間的関係を判定する。ここで、２つの候補領域内に特定の空間的関係を有する２つの対応する特徴点がある限り、その２つの候補領域は特定の空間的関係を有すると見なすことができる。 As shown in FIG. 7, in step S4321, the ranking determination unit 232 determines the spatial relationship between the candidate regions detected by the candidate region detection unit 231. Specifically, for any two candidate regions, the ranking determination unit 232 determines the spatial relationship between the two candidate regions based on the mutual spatial relationship between the feature points included in the two candidate regions. judge. Here, as long as there are two corresponding feature points having a specific spatial relationship within the two candidate regions, the two candidate regions can be considered to have a specific spatial relationship.

１つの実施形態では、任意の２つの候補領域について、順位判定部２３２は、その２つの候補領域間の任意の２つの特徴点間の空間的関係を、その２つの候補領域間の空間的関係として判定する。好ましくは、例えば、これら２つの候補領域の中心の位置における２つの特徴点間の空間的関係を、その２つの候補領域間の空間的関係として決定されてよい。ここで、２つの候補領域間の空間的関係の空間関係値は、その２つの候補領域間の空間的関係の空間的関係値として見なす。例えば、最大の空間的関係値を持つ２つの特徴点間の空間的関係は、その２つの候補領域間の空間的関係として判定される。ここで、最大の空間的関係値は、２つの候補間の空間的関係の空間的関係値として、見なされるものである。 In one embodiment, for any two candidate regions, the ranking determination unit 232 determines the spatial relationship between any two feature points between the two candidate regions and the spatial relationship between the two candidate regions. Judge as. Preferably, for example, the spatial relationship between the two feature points at the central position of these two candidate regions may be determined as the spatial relationship between the two candidate regions. Here, the spatial relationship value of the spatial relationship between the two candidate regions is regarded as the spatial relationship value of the spatial relationship between the two candidate regions. For example, the spatial relationship between two feature points having the largest spatial relationship value is determined as the spatial relationship between the two candidate regions. Here, the maximum spatial relationship value is regarded as the spatial relationship value of the spatial relationship between the two candidates.

他の実施形態では、順位決定部２３２は、任意の２つの候補領域について、その２つの候補領域間に存在する特徴点間の全ての空間的関係を用いて、その２つの候補領域間の空間的関係を決定する。好ましくは、例えば、一方において、特徴点間の空間的関係が投票され、最も多数の投票を有する空間的関係がその２つの候補領域間の空間的関係として決定される。一方、投票数が最も多い空間関係に属するすべての空間関係値は、平均化、重み付け合算され、或いは、最大化され、得られた値は、その２つの候補領域間の空間関係の空間関係値とみなされる。 In another embodiment, the ranking determination unit 232 uses all the spatial relationships between the feature points existing between the two candidate regions for any two candidate regions, and uses the space between the two candidate regions. Determine the relationship. Preferably, for example, on the one hand, the spatial relationship between the feature points is voted on, and the spatial relationship with the most votes is determined as the spatial relationship between the two candidate regions. On the other hand, all the spatial relationship values belonging to the spatial relationship with the largest number of votes are averaged, weighted, or maximized, and the obtained value is the spatial relationship value of the spatial relationship between the two candidate regions. Is considered.

図７に戻る。ステップＳ４３２２にて、順位判定部２３２は、候補領域間の判定後の空間的関係の空間的関係値に基づき、候補領域のスコアを更新する。１つの実施形態において、順位判定部２３２は行列間の算出演算によって候補領域のスコアを更新しても良い。具体的には、例えば候補領域間の判定された空間的関係の空間的関係値からなる行列と、候補領域のスコアからなる行列が数学的演算（例えば行列の乗算）される。そして、その演算後に得られる結果が候補領域の更新後のスコアとする。他の実施形態においては、検出しようとしている対象オブジェクト（例えば対象人物）が特定される場合、順位判定部２３２は、その対象オブジェクトに関連するオブジェクト（例えば、対象人物の付属物）の候補領域のスコアを更新するだけで良い。具体的には。例えば、まず最大空間関係値を有する１つの関連オブジェクトは対象オブジェクトに対して空間的関係を持つ関連オブジェクトから判定され、そして、最大空間的関係値は、候補領域のスコアを更新するために判定された関連オブジェクトの候補領域のスコアに重ね合わされる。 Return to FIG. In step S4322, the ranking determination unit 232 updates the score of the candidate area based on the spatial relationship value of the spatial relationship after the determination between the candidate areas. In one embodiment, the rank determination unit 232 may update the score of the candidate region by a calculation operation between matrices. Specifically, for example, a matrix consisting of the spatial relation values of the determined spatial relations between the candidate regions and a matrix consisting of the scores of the candidate regions are mathematically operated (for example, matrix multiplication). Then, the result obtained after the calculation is used as the score after updating the candidate area. In another embodiment, when the target object to be detected (for example, the target person) is specified, the ranking determination unit 232 is a candidate area of an object (for example, an accessory of the target person) related to the target object. All you have to do is update your score. In particular. For example, one related object having the maximum spatial relationship value is first determined from the related object having a spatial relationship to the target object, and the maximum spatial relationship value is determined to update the score of the candidate area. It is superimposed on the score of the candidate area of the related object.

更に、処理速度を向上させるために候補領域間の空間的関係の判定範囲を狭くするため、図７に示すように、ステップＳ４３２０がＳ４３２１の前（すなわち、候補領域間の空間的関係を判定する前）に含まれるようにしても良い。図７に示すように、ステップＳ４３２０にて、順位判定部２３２は対応する補助情報を取得する。この補助情報は、例えば、特定検知タスクについての情報、特定の検知シーンについての情報などである。 Further, in order to narrow the determination range of the spatial relationship between the candidate regions in order to improve the processing speed, step S4320 determines the spatial relationship between the candidate regions before S4321 as shown in FIG. 7. It may be included in the previous). As shown in FIG. 7, in step S4320, the ranking determination unit 232 acquires the corresponding auxiliary information. This auxiliary information is, for example, information about a specific detection task, information about a specific detection scene, and the like.

特定検出タスクとしては、一般に、検出対象の対象オブジェクト（例えば対象人物）が特定される、すなわち、対象オブジェクトの位置情報およびカテゴリ情報が一般的に与えられる。更には、一般に、優先的に検出されるオブジェクトは、対象オブジェクトに関連する他のオブジェクト（例えば、対象人物の周囲にある付属物）であることが望ましい。 As the specific detection task, a target object to be detected (for example, a target person) is generally specified, that is, position information and category information of the target object are generally given. Furthermore, in general, it is desirable that the preferentially detected object is another object related to the target object (for example, an accessory around the target person).

したがって、特定検出タスクに関しては、順位決定部２３２が取得する補助情報は、例えば、少なくとも１つの対象オブジェクトの位置情報およびカテゴリ情報である。更に、その一方で、対象オブジェクトのカテゴリ情報は分かっているので、順位判定部２３２は対象オブジェクトと他のオブジェクト間に存在する空間的関係のタイプを明確に判定しても良い。例えば、対象オブジェクトが対象人物である場合、対象オブジェクトと他のオブジェクト間の空間的関係は、“或る人物とその他の人物間の空間的関係”、及び、“或る人物と他のオブジェクト間の空間的関係”のみとなり、“或るオブジェクトと或るオブジェクト間の空間的関係”とはならない。他方、対象オブジェクトの位置情報は分かっているので、順位判定部２３２は、全ての候補領域間の空間的関係を判定せずに、どの候補領域間の空間的関係を決定すればよいかを大まかに定義することができる。したがって、順位判定部２３２が、ステッＳ４３２１にて、候補領域間の空間的関係を判定するとき、特定の候補領域間の空間的関係のみが判定されればよく、これにより処理速度を向上させることができる。 Therefore, with respect to the specific detection task, the auxiliary information acquired by the ranking determination unit 232 is, for example, the position information and the category information of at least one target object. Further, on the other hand, since the category information of the target object is known, the rank determination unit 232 may clearly determine the type of spatial relationship existing between the target object and other objects. For example, when the target object is a target person, the spatial relationships between the target object and other objects are "spatial relationship between one person and another person" and "between one person and another object". It is only the "spatial relationship between" and not the "spatial relationship between an object and an object". On the other hand, since the position information of the target object is known, the ranking determination unit 232 roughly determines which candidate area should be determined without determining the spatial relationship between all the candidate areas. Can be defined in. Therefore, when the ranking determination unit 232 determines the spatial relationship between the candidate regions in step S4321, only the spatial relationship between the specific candidate regions needs to be determined, thereby improving the processing speed. Can be done.

また、特定検出タスクに関して、対象オブジェクトが対象人物である場合、順位判定部２３２によって得られる補助情報は、例えば、少なくとも１つの対象オブジェクト（すなわち、対象人物）の関節点情報（joint point information）である。ここで、対象人物の関節点情報は、手動ラベリングや関節点検出方法を用いて取得することができる。更に順位判定部２３２は、対象人物の関節点をカテゴリ分け又は認識することにより、対象人物とその対象人物に関連付けられた人物／オブジェクト間の空間的関係に対応する動作を取得しても良い。例えば、対象人物がスーツケースを引っ張っている場合、対象人物とスーツケースとの間の空間的関係に対応する動作は“引っ張る”である。したがって、順位判定部２３２が、Ｓ４３２１にて、候補領域間の特定の空間的関係を判定するとき、その特定候補領域間の特定空間的関係のみが判定され、更に、特定の動きに対応する特定の空間的関係のみを判定され、これにより、処理速度がさらに向上できる。 Further, regarding the specific detection task, when the target object is the target person, the auxiliary information obtained by the ranking determination unit 232 is, for example, the joint point information of at least one target object (that is, the target person). is there. Here, the joint point information of the target person can be acquired by using manual labeling or a joint point detection method. Further, the ranking determination unit 232 may acquire an operation corresponding to the spatial relationship between the target person and the person / object associated with the target person by categorizing or recognizing the joint points of the target person. For example, when the target person is pulling the suitcase, the action corresponding to the spatial relationship between the target person and the suitcase is "pulling". Therefore, when the ranking determination unit 232 determines a specific spatial relationship between the candidate regions in S4321, only the specific spatial relationship between the specific candidate regions is determined, and further, the identification corresponding to the specific movement is specified. Only the spatial relationship of is determined, which can further improve the processing speed.

特定の検出シーンに関しては、一般に、特定の空間的関係がそのシーンとシーン中のオブジェクト（例えば人物、動物等）間に存在する。例えば、草／大草原では、飛んでいる動物（例えば鳥など）は一般に空中を飛んでおり、地面を歩いている可能性は低い。また、人又は歩行する動物（例えば羊など）は一般に地面を歩くものであり、空中を飛ぶことはまずない。したがって、特定検出シーンについては、順位決定部２３２によって取得される補助情報は、例えば、シーン情報（すなわち入力画像の背景情報）である。さらに、順位判定部２３２は、具体的に、シーン情報に従って、或る特定のオブジェクトとシーン間の特定の空間的関係を判定しても良い。したがって、順位判定部２３２がＳ４３２１にて候補領域間の空間的関係を判定するとき、全ての空間的関係を判定せずに、特定の空間的関係のみが判定されるようにしても良い。この結果、処理速度は向上できる。 For a particular detection scene, there is generally a particular spatial relationship between the scene and the objects in the scene (eg, people, animals, etc.). For example, in grass / prairie, flying animals (such as birds) are generally flying in the air and are unlikely to be walking on the ground. Also, humans or walking animals (such as sheep) generally walk on the ground and rarely fly in the air. Therefore, for the specific detection scene, the auxiliary information acquired by the ranking determination unit 232 is, for example, scene information (that is, background information of the input image). Further, the ranking determination unit 232 may specifically determine a specific spatial relationship between a specific object and the scene according to the scene information. Therefore, when the ranking determination unit 232 determines the spatial relationship between the candidate regions in S4321, it is possible that only a specific spatial relationship is determined without determining all the spatial relationships. As a result, the processing speed can be improved.

図４に戻る。画像からオブジェクトの領域が検出されると、検出部２３０は、閾値以上のスコアを持つ、或いは、上位Ｎ個にライク付けされたオブジェクトの検出領域を最終的な検出結果とし、図１に示したシステムバス１８０を介して出力デバイス１６０に検出結果を送出し、最終的なオブジェクトの検出結果（例えば領域の位置、スコア、特徴）をユーザに向けて表示、もしくは、人物認識処理（例えば人物属性認識、人物マッチング、人物画像検索、人物の振る舞い／行動等の認識もしくは解析等）などの後続処理にオブジェクトの検出領域を出力する。例えば、人物の振る舞い／行動等の認識もしくは解析については、図２に示されるオブジェクト検出装置２００で検出されたオブジェクトの領域は、好ましくは、対象人物、及び、その対象人物が身にまとった／握った／用いた付属物、及び、その対象人物の近接する他の人物の領域であり、これにより、対象人物と付属物又は隣接人物間の振る舞い／行動が、その領域間の空間的関係から直接的に認識もしくは解析できる。また、例えば、対象人物とその対象人物に隣接する他の人物についての場合、領域間の空間的関係が“包含関係（inclusion relationship）”である場合、対象人物の行動は、例えば、“抱いている（holding）”として推察できる。また、対象人物とその対象人物の付属物について、領域間の空間的関係が“隣接関係（adjacency relationship）”である場合、対象人物の行動は例えば“握っている（grasping）”であると推察できる。また、ビデオのセグメント内の人物画像サーチにおいて、対象人物とその対象人物の付属物間の空間的関係は一般にそれほど変動しない。それ故、検出された領域間の空間的関係を有するビデオのセグメント内の対象人物は類似しているかどうかのみを判定しても良い。例えば、ビデオのセグメント内のスーツケースを引っ張っている対象人物が類似しているかどうかのみが判定される。 Return to FIG. When the area of the object is detected from the image, the detection unit 230 sets the detection area of the object having a score equal to or higher than the threshold value or assigned to the top N as the final detection result, and is shown in FIG. The detection result is sent to the output device 160 via the system bus 180, and the final object detection result (for example, the position, score, and feature of the area) is displayed to the user, or the person recognition process (for example, person attribute recognition). , Person matching, person image search, recognition or analysis of person behavior / behavior, etc.), and output the object detection area for subsequent processing. For example, regarding the recognition or analysis of the behavior / behavior of a person, the area of the object detected by the object detection device 200 shown in FIG. 2 is preferably the target person and / worn by the target person. The area of the gripped / used appendage and another person in close proximity to the target person, whereby the behavior / behavior between the target person and the accessory or adjacent person is due to the spatial relationship between the areas. Can be directly recognized or analyzed. Also, for example, in the case of a target person and another person adjacent to the target person, if the spatial relationship between the regions is an "inclusion relationship", the behavior of the target person is, for example, "Hugging". It can be inferred as "holding". Also, regarding the target person and the attachments of the target person, if the spatial relationship between the areas is an "adjacency relationship", it is inferred that the behavior of the target person is, for example, "grasping". it can. Also, in a person image search within a segment of a video, the spatial relationship between the target person and the attachments of the target person generally does not change much. Therefore, it may only be determined whether the target persons in the video segment having the spatial relationship between the detected regions are similar. For example, it is only determined if the target person pulling the suitcase in the video segment is similar.

本発明の第１の実施形態に従えば、画像内の個々の特徴点間の空間的関係はオブジェクト領域が検出されるときに利用されるので、これら空間的関係はオブジェクトの領域検出を規制することになり、それ故、関連するオブジェクトをより良く検出することを可能にする。関連するオブジェクトは一般に監視プロセス中に人物の監視に有用であるので、オブジェクト検出の再現率だけでなく、人を監視する効果も本発明に従って改善することができる。 According to the first embodiment of the present invention, since the spatial relationship between the individual feature points in the image is utilized when the object area is detected, these spatial relationships regulate the area detection of the object. Therefore, it makes it possible to better detect related objects. Since related objects are generally useful for monitoring people during the monitoring process, not only the recall of object detection but also the effect of monitoring people can be improved according to the present invention.

本発明の第１の実施形態において、オブジェクトの検出操作は１つの画像内で実行される。オブジェクト間の空間的関係は、一般に、短い継続時間内では大きくは変化しないので、本発明はビデオのセグメント内のオブジェクト検出を実行するためにも利用できる。図８は本発明の第２の実施形態におけるオブジェクト検出装置８００の構成を示すブロック図である。ここで、図８に示される幾つか、もしくは全てのモジュールは、専用のハードウェアで実現しても良い。図８に示すように、オブジェクト検出装置８００は、特徴抽出部８１０、候補領域検出部８２０、空間的関係判定部８３０及び順位判定部８４０を含む。 In the first embodiment of the present invention, the object detection operation is performed in one image. Since the spatial relationships between objects generally do not change significantly within a short duration, the present invention can also be used to perform object detection within a segment of video. FIG. 8 is a block diagram showing the configuration of the object detection device 800 according to the second embodiment of the present invention. Here, some or all the modules shown in FIG. 8 may be realized by dedicated hardware. As shown in FIG. 8, the object detection device 800 includes a feature extraction unit 810, a candidate area detection unit 820, a spatial relationship determination unit 830, and a ranking determination unit 840.

まず、図１に示される入力デバイス１５０は、特定の電子デバイス（例えばビデオカメラ等）から出力される、またはユーザか入力されたビデオのセグメントを受信する。次に、入力デバイス１５０は、受信したビデオを、システムバス１８０を介して、オブジェクト検出装置８００に転送する。 First, the input device 150 shown in FIG. 1 receives a segment of video output or input by a particular electronic device (eg, a video camera, etc.). Next, the input device 150 transfers the received video to the object detection device 800 via the system bus 180.

次に、図８に示されるように、特徴抽出部８１０は、受信したビデオ内の現ビデオフレームから特徴を抽出する。特徴抽出部８１０の動作は、図２に示した抽出部２１０のそれと同じであるので、その説明はここでは繰り返さない。 Next, as shown in FIG. 8, the feature extraction unit 810 extracts features from the current video frame in the received video. Since the operation of the feature extraction unit 810 is the same as that of the extraction unit 210 shown in FIG. 2, the description thereof will not be repeated here.

候補領域検出部８２０は、特徴抽出部８１０によって抽出された特徴に基づき、現ビデオフレームからオブジェクトの候補領域を検出する。候補領域検出部８２０の動作は、図３に示した候補領域検出部２３１と同じなので、その説明はここでは説明しない。 The candidate area detection unit 820 detects the candidate area of the object from the current video frame based on the features extracted by the feature extraction unit 810. Since the operation of the candidate area detection unit 820 is the same as that of the candidate area detection unit 231 shown in FIG. 3, the description thereof will not be described here.

空間的関係判定部８３０は、現ビデオフレームに対するそれ以前のフレームの検出結果に基づき、候補領域検出部８２０により検出した候補領域間の空間的関係を判定する。ここでは、現ビデオフレームに対するそれ以前のフレームの検出結果は、本発明の第１の実施形態に従って得ても良い。１つの実施形態では、例えば前ビデオフレームのいずれかから検出したオブジェクトの領域間の空間的関係は、現ビデオフレーム内の候補領域間の空間的関係であるとする。他の実施形態では、例えば、前ビデオフレームのＮ個のビデオフレームから検出されたオブジェクトの領域間の空間的関係の広範囲な結果（例えば、重み付けまたは平均化などの数学的演算の実行で得られる）が、現ビデオフレーム内の候補領域間の空間的関係となっているとする。 The spatial relationship determination unit 830 determines the spatial relationship between the candidate regions detected by the candidate region detection unit 820 based on the detection result of the previous frame with respect to the current video frame. Here, the detection result of the frame before that with respect to the current video frame may be obtained according to the first embodiment of the present invention. In one embodiment, for example, the spatial relationship between the regions of an object detected from any of the previous video frames is the spatial relationship between the candidate regions within the current video frame. In other embodiments, for example, a wide range of results of spatial relationships between regions of objects detected from the N video frames of the previous video frame (eg, obtained by performing mathematical operations such as weighting or averaging). ) Is the spatial relationship between the candidate areas in the current video frame.

順位判定部８４０は、空間的関係判定部８３０で判定された候補領域間の空間的関係に基づき、候補領域検出部８２０で検出した候補領域の順位を判定し、
順位判定後の候補領域を、オブジェクトの領域とする。 The rank determination unit 840 determines the rank of the candidate area detected by the candidate area detection unit 820 based on the spatial relationship between the candidate areas determined by the spatial relationship determination unit 830.
The candidate area after the ranking determination is set as the object area.

オブジェクトの領域が、現ビデオフレームから検出されると、順位判定部８４０は、予め定義された閾値以上のスコアを持つオブジェクトの領域を最終的な検出結果とする、もしくは、上位Ｎ個にランク付けられた領域を最終的な検出結果とし、図１のシステムバス１８０を介して出力デバイス１６０にその最終的な検出結果を転送し、現ビデオフレームの最終的に検出されたオブジェクトの領域（例えば領域の位置、スコア、特徴）をユーザに向けて表示、もしくは、人物認識処理（例えば人物属性認識、人物マッチング、人物画像検索、人物の振る舞い／行動等の認識もしくは解析等）などの後続処理にオブジェクトの検出領域を出力する。 When the area of the object is detected from the current video frame, the ranking determination unit 840 sets the area of the object having a score equal to or higher than the predefined threshold as the final detection result, or ranks the top N objects. The determined area is taken as the final detection result, the final detection result is transferred to the output device 160 via the system bus 180 of FIG. 1, and the area of the finally detected object of the current video frame (for example, the area). (Position, score, feature) is displayed to the user, or an object is used for subsequent processing such as person recognition processing (for example, person attribute recognition, person matching, person image search, recognition or analysis of person behavior / behavior, etc.) Output the detection area of.

本発明の第２実施形態の応用例として、図８に示したオブジェクト検出装置８００は、ビデオ内の人物を追跡するために利用しても良い。具体的には、ビデオ内の現ビデオフレームにおいて、現ビデオフレーム内の人物が、一般的に使われる人物追跡装置を用いて首尾よく追跡できる場合、現ビデオフレーム内の人物は、一般的に使われる人物追跡装置を用いて検出される。また、現ビデオフレーム内の人物が一般的に使われる人物追跡装置を用いて首尾よく追跡できない場合、現ビデオフレーム中の人物を、図８に示すオブジェクト検出装置８００を用いて検出しても良い。これにより、ビデオ全体における人物の追跡が達成される。 As an application example of the second embodiment of the present invention, the object detection device 800 shown in FIG. 8 may be used to track a person in a video. Specifically, in the current video frame in the video, if the person in the current video frame can be successfully tracked using a commonly used person tracking device, the person in the current video frame is generally used. Detected using a person tracking device. Also, if the person in the current video frame cannot be successfully tracked using a commonly used person tracking device, the person in the current video frame may be detected using the object detection device 800 shown in FIG. .. This achieves tracking of the person throughout the video.

［モデル生成］
本発明の第１の実施形態にて説明したように、本発明に適用可能なモデル（すなわち、事前生成モデル）は、空間的関係がラベル付けされたサンプルの学習にしたがって事前学習／事前生成される。ここで、上記のように、本発明の処理速度を向上させるため、例えば、図５に示すように、本発明に適用する事前生成モデルは、例えば、特徴抽出する部分、空間的関係を判定する部分、及び、領域／候補領域を検出する部分を含む。本発明において、事前学習モデルは、ディープラーニング法（例えば、ニューラルネットワーク法）を用いて、空間的関係がラベル付けされたサンプルのトレーニングに基づき生成されても良い。ここで、本発明における事前生成モデルの各部分は、複数レイヤのネットワークで構成され、例えば、特徴を抽出する部分はＮレイヤネットワークで構成され、空間的関係を判定する部分はＭレイヤネットワークで構成され、領域／候補領域を検出する部分はＴレイヤネットワークで構成されても良い。ここで、Ｎ，Ｍ，Ｔは自然数であって、それらが示す値は同じでも異なっても良い。 [Model generation]
As described in the first embodiment of the present invention, the model applicable to the present invention (ie, the pre-generated model) is pre-learned / pre-generated according to the learning of the spatially related sample. To. Here, as described above, in order to improve the processing speed of the present invention, for example, as shown in FIG. 5, the pre-generative model applied to the present invention determines, for example, a part for feature extraction and a spatial relationship. A part and a part for detecting a region / candidate region are included. In the present invention, the pre-learning model may be generated based on training of a sample labeled with spatial relationships using a deep learning method (eg, neural network method). Here, each part of the pre-generated model in the present invention is composed of a network of a plurality of layers. For example, a part for extracting features is composed of an N layer network, and a part for determining a spatial relationship is composed of an M layer network. The portion for detecting the region / candidate region may be configured by a T-layer network. Here, N, M, and T are natural numbers, and the values they indicate may be the same or different.

１つの実施形態では、事前生成モデルの生成に係る時間を短くするため、モデル内の、特徴を抽出する部分、空間的関係を判定する部分、及び、領域／候補領域を検出する部分は、バックプロパゲーション手段によって同時に更新される。図９は、本発明に適用可能なモデルを生成する生成法を概略的に示すフローチャート９００である。図９に示されるフローチャート９００にて、本発明に適用できるモデルを生成するニューラルネットワークを利用する例を使って説明する。しかし、これに限定される必要がないのは明らかである。ここで、図９に従った生成方法は、図１に示されるハードウェア構成１００によって実行することもできる。 In one embodiment, in order to shorten the time required to generate the pre-generated model, the part of the model for extracting features, the part for determining the spatial relationship, and the part for detecting the region / candidate region are backed up. Updated at the same time by propagation means. FIG. 9 is a flowchart 900 schematically showing a generation method for generating a model applicable to the present invention. In the flowchart 900 shown in FIG. 9, an example of using a neural network for generating a model applicable to the present invention will be described. However, it is clear that it does not have to be limited to this. Here, the generation method according to FIG. 9 can also be executed by the hardware configuration 100 shown in FIG.

図９に示すように、まず、図１に示されるＣＰＵ１１０は、初期のニューラルネットワークと、入力デバイス１５０によって事前にセットされている複数のトレーニングサンプルを取得する。ここで、各トレーニングサンプルは空間的関係、領域位置、及び、オブジェクトカテゴリでラベル付けされているものである。そして、トレーニングサンプルにてラベル付けされた空間的関係は、例えば、“空間的関係の有り／無し”、“どのカテゴリに空間的関係が属しているか”等である。 As shown in FIG. 9, first, the CPU 110 shown in FIG. 1 acquires an initial neural network and a plurality of training samples preset by the input device 150. Here, each training sample is labeled by spatial relationship, region position, and object category. The spatial relationships labeled in the training sample are, for example, "with / without spatial relationship", "which category the spatial relationship belongs to", and the like.

次に、ステップＳ９１０にて、一方で、ＣＰＵ１１０は、トレーニングサンプルを、特徴を抽出するための部分の現ニューラルネットワーク（例えば、初期ニューラルネットワーク）、および、空間的関係を判定する部分の現ニューラルネットワーク（初期ニューラルネットワーク）に通過させ、トレーニングサンプル中に存在する空間的関係を得る。一方、ＣＰＵ１１０は、得られた空間的関係とサンプル空間的関係間の損失（例えば第１の損失Ｌｏｓｓ１）を判定する。ここでサンプル空間的関係は、トレーニングサンプルにおいてラベル付けされた空間的関係に従って得ても良い。第１の損失Ｌｏｓｓ１は現ニューラルネットワークを用いて得られる予測空間的関係の空間的関係値と、サンプル空間的関係の空間関係値（すなわち、実空間的関係値）との誤差を表し、ここで誤差は例えば距離により測定される。例えば、第１の損失Ｌｏｓｓ１は次式（１）によって得ることができる。 Next, in step S910, on the other hand, the CPU 110 uses the training sample as a part of the current neural network for extracting features (for example, an initial neural network) and a part of the current neural network for determining the spatial relationship. Pass it through (initial neural network) to get the spatial relationships that exist in the training sample. On the other hand, the CPU 110 determines the loss between the obtained spatial relationship and the sample spatial relationship (for example, the first loss Loss1). Here, the sample spatial relationship may be obtained according to the spatial relationship labeled in the training sample. The first loss Loss1 represents the error between the spatial relationship value of the predicted spatial relationship obtained by using the current neural network and the spatial relationship value of the sample spatial relationship (that is, the real spatial relationship value). The error is measured, for example, by distance. For example, the first loss Loss 1 can be obtained by the following equation (1).

ここで、ｊはトレーニングサンプルにおけるオブジェクトが属する空間的関係カテゴリの番号を示し、Ｃは空間的関係カテゴリの最大数を表し、ｙ_jは空間的関係カテゴリｊのオブジェクトの実空間的関係値を表し、Ｐ_jは空間的関係カテゴリｊのオブジェクトの予測空間的カテゴリ値を示す。 Here, j represents the number of the spatial relationship category to which the object in the training sample belongs, C represents the maximum number of the spatial relationship categories, and y _j represents the real spatial relationship value of the object of the spatial relationship category j. , P _j indicate the predicted spatial category value of the object of the spatial relationship category j.

ステップＳ９２０にて、一方、ＣＰＵ１１０は、トレーニングサンプルを全ての現ニューラルネットワーク（例えば初期ニューラルネットワーク）に通過させ、オブジェクトの領域／候領域位置と、オブジェクトのオブジェクトカテゴリを得る。すなわち、ＣＰＵ１１０は、トレーニングサンプルを、特徴抽出する部分の現ニューラルネットワーク、空間的関係を判定する部分の現ニューラルネットワーク、及び、オブジェクトの領域／候補領域を検出するための部分のニューラルネットワークに通過させて、オブジェクトの領域／候補領域位置と、オブジェクトのオブジェクトカテゴリを得る。他方、得られたオブジェクトの領域／候補領域位置について、ＣＰＵ１１０は、得られたオブジェクトの領域／候補領域の位置とサンプル領域位置間の損失（例えば第２の損失Ｌｏｓｓ２）を判定する。ここで、サンプル領域位置は、トレーニングサンプルにラベル付けされた領域の位置に従って得ることができる。ここで、第２の損失Ｌｏｓｓ２は、現ニューラルネットワークを用いて得られる予測領域／候補領域位置とサンプル領域位置との間の誤差を表し、その誤差は距離によって計測される。例えば、第２の損失Ｌｏｓｓ２は次式(２)及び（３）により得られる。 In step S920, on the other hand, the CPU 110 passes the training sample through all current neural networks (eg, the initial neural network) to obtain the area / weather area position of the object and the object category of the object. That is, the CPU 110 passes the training sample through the current neural network of the part for extracting features, the current neural network of the part for determining the spatial relationship, and the neural network of the part for detecting the area / candidate area of the object. To obtain the area / candidate area position of the object and the object category of the object. On the other hand, with respect to the area / candidate area position of the obtained object, the CPU 110 determines a loss between the position of the area / candidate area of the obtained object and the sample area position (for example, a second loss Loss2). Here, the sample region location can be obtained according to the location of the region labeled on the training sample. Here, the second loss Loss2 represents an error between the predicted region / candidate region position and the sample region position obtained by using the current neural network, and the error is measured by the distance. For example, the second loss Loss 2 is obtained by the following equations (2) and (3).

ここで、ｓｍｏｏｔｈ_L1(x)は領域／候補領域位置とオブジェクトの実領域位置との間の差を表し、ｘはオブジェクトの領域／候補領域の位置の左上隅の横座標を表し、ｙはオブジェクトの領域／候補領域の位置の左上隅の縦座標を表し、ｗはオブジェクトの領域／候補領域の幅を表し、ｈはオブジェクトの領域／候補領域の高さを表し、ｔⁿ _iはオブジェクトカテゴリがｎのオブジェクトの領域／候補領域位置を表し、ｖⁿ _iはオブジェクトカテゴリがｎのオブジェクトの実領域位置を表す。 Here, smartphone _L1 (x) represents the difference between the area / candidate area position and the real area position of the object, x represents the horizontal coordinates of the upper left corner of the position of the object area / candidate area, and y represents the object. Represents the vertical coordinates of the upper left corner of the position of the area / candidate area, w represents the width of the object area / candidate area, h represents the height of the object area / candidate area, and t ⁿ _i represents the object category. Represents the area / candidate area position of an object of ⁿ , and v ⁿ _i represents the real area position of an object of object category n.

得られたオブジェクトのオブジェクトカテゴリについて、ＣＰＵ１１９は得られたオブジェクトのオブジェクトカテゴリとサンプルオブジェクトカテゴリ間の損失（例えば第２の損失Ｌｏｓｓ３）を判定する。ここで、サンプルオブジェクトカテゴリはトレーニングサンプルにてラベル付けされたオブジェクトカテゴリに従って得ることができる。また、この第３の損失Ｌｏｓｓ３は、現ニューラルネットワークを用いて得られる予測オブジェクトカテゴリと、サンプルオブジェクトカテゴリ（すなわち、実オブジェクトカテゴリ）との誤差を表し、この誤差は例えば距離によって計測できる。例えば第３の損失Ｌｏｓｓ３は次式（４）によって得られる。 For the object category of the obtained object, the CPU 119 determines a loss between the object category of the obtained object and the sample object category (for example, a second loss Loss 3). Here, the sample object category can be obtained according to the object category labeled in the training sample. Further, the third loss Loss3 represents an error between the predicted object category obtained by using the current neural network and the sample object category (that is, the actual object category), and this error can be measured by, for example, a distance. For example, the third loss Loss 3 is obtained by the following equation (4).

ここで、ｍはトレーニングサンプルが属するオブジェクトのオブジェクトカテゴリの番号を表し、Ｍはトレーニングサンプルが属するオブジェクトのオブジェクトカテゴリの最大数を表し、ｙ_mはオブジェクトカテゴリｍのオブジェクトの実オブジェクトカテゴリを表し、ｐ_mはオブジェクトカテゴリｍのオブジェクトの予測オブジェクトカテゴリを表す。 Here, m represents the number of the object category of the object to which the training sample belongs, M represents the maximum number of object categories of the object to which the training sample belongs, y _m represents the actual object category of the object of the object category m, and p. _m represents the predicted object category of the object of the object category m.

図９に戻って、ステップＳ９３０にて、ＣＰＵ１１０は全ての現ニューラルネットワークが判定によって得られる全損失（すなわち、第１の損失Ｌｏｓｓ１、第２の損失Ｌｏｓｓ２及び第３の損失Ｌｏｓｓ３）に基づき、所定の条件を満たすかどうかを判定する。例えば、３つの損失の合計もしは重みづけ合計が閾値（例えばＴＨ２）と比較され、３つの損失の合計／重みづけ合計がＴＨ２以下の場合は、全ての現ニューラルネットワークが所定の条件を満たすと判定され、最終的なニューラルネットワーク（すなわち、事前生成モデル）として出力される。ここで最終的なニューラルネットワークは、例えば、図２乃至図８を参照して説明したオブジェクト検出のために、図２に示した記憶デバイス２４０に出力される。３つの損失の合計／重みづけ合計がＴＨ２より大きい場合、全ての現ニューラルネットワークは所定の条件を満たしていないと判定され、生成処理はステップＳ９４０に進む。 Returning to FIG. 9, in step S930, the CPU 110 determines the total loss obtained by the determination of all the current neural networks (that is, the first loss Loss1, the second loss Loss2, and the third loss Loss3). Judge whether the condition of is satisfied. For example, if the sum of the three losses is compared to a threshold (eg TH2) and the sum of the three losses / weighted sum is less than or equal to TH2, then all current neural networks satisfy a given condition. It is determined and output as the final neural network (ie, pre-generated model). Here, the final neural network is output to the storage device 240 shown in FIG. 2, for example, for object detection described with reference to FIGS. 2 to 8. If the sum of the three losses / the sum of the weights is greater than TH2, it is determined that all the current neural networks do not meet the predetermined conditions, and the generation process proceeds to step S940.

ステップＳ９４０にて、ＣＰＵ１１０は第１の損失Ｌｏｓｓ１に基づき空間的半径を判定する部分の現ニューラルネットワークの各レイヤのパラメータを更新する。ここで各レイヤのパラメータは、例えば、現ニューラルネットワークの各コンボリューションレイヤの重みである。１つの例では、各レイヤのパラメータは、例えば確率的勾配降下法（stochastic gradient descent method）を用いることによって、第１の損失Ｌｏｓｓ１に基づいて更新される。 In step S940, the CPU 110 updates the parameters of each layer of the current neural network of the portion for determining the spatial radius based on the first loss Loss 1. Here, the parameter of each layer is, for example, the weight of each convolution layer of the current neural network. In one example, the parameters of each layer are updated based on the first loss Loss 1, for example by using a stochastic gradient descent method.

ステップＳ９５０にて、ＣＰＵ１１０は、第２の損失Ｌｏｓｓ２及び第３の損失Ｌｏｓｓ３に基づき、オブジェクトの領域／候補領域を検出する部分の現ニューラルネットワークにおける各レイヤのパラメータを更新する。ここでの各レイヤのパラメータも、例えば、現ニューラルネットワークにおけるコンボリューションレイヤの重みである。１つの実施形態では、各レイヤのパラメータは、例えば確率的勾配降下法を用い、第２の損失Ｌｏｓｓ２及び第３の損失Ｌｏｓｓ３に基づき更新される。 In step S950, the CPU 110 updates the parameters of each layer in the current neural network of the portion that detects the area / candidate area of the object based on the second loss Loss2 and the third loss Loss3. The parameters of each layer here are also, for example, the weights of the convolution layers in the current neural network. In one embodiment, the parameters of each layer are updated based on a second loss Loss 2 and a third loss Loss 3, using, for example, stochastic gradient descent.

ステップＳ９６０にて、ＣＰＵ１１０は、第１の損失Ｌｏｓｓ１、第２の損失Ｌｏｓｓ２及び第３の損失Ｌｏｓｓ３に基づき、特徴抽出する部分の現ニューラルネットワークの各レイヤのパラメータを更新する。ここで、各レイヤのパラメータは、例えば現ニューラルネットワークにおける各コンボリューションレイヤにおける重みでもある。１つの例において、各レイヤのパラメータは、また、確率的勾配降下法を用いて、第１の損失Ｌｏｓｓ１．第２の損失Ｌｏｓｓ２及び第３の損失Ｌｏｓｓ３に基づいて更新される。その後、生成処理は再度ステップＳ９１０に進む。 In step S960, the CPU 110 updates the parameters of each layer of the current neural network of the feature extraction portion based on the first loss Loss1, the second loss Loss2, and the third loss Loss3. Here, the parameter of each layer is also a weight in each convolution layer in the current neural network, for example. In one example, the parameters of each layer also use the stochastic gradient descent method to determine the first loss Loss 1. It is updated based on the second loss Loss2 and the third loss Loss3. After that, the generation process proceeds to step S910 again.

図９に示されるフローチャートにおいては、第１の損失Ｌｏｓｓ１、第２の損失Ｌｏｓｓ２及び第３の損失Ｌｏｓｓ３の３つの損失の合計／重み合計が所定の条件を満たすかどうかの条件は、現ニューラルネットワークの更新を停止する条件とした。しかし、これに限定される必要がないのは明らかである。例えば、ステップＳ９３０を省略するものの、現ニューラルネットワークへの更新回数が所定回数に達した後、対応する更新動作を停止する。 In the flowchart shown in FIG. 9, the condition of whether or not the sum / weight total of the three losses of the first loss Loss1, the second loss Loss2, and the third loss Loss3 satisfies a predetermined condition is the current neural network. It was set as a condition to stop the update of. However, it is clear that it does not have to be limited to this. For example, although step S930 is omitted, the corresponding update operation is stopped after the number of updates to the current neural network reaches a predetermined number.

上記のすべてのユニットは、本開示に記載の処理を実施するための例示的および／または好ましいモジュールである。これらのユニットは、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）、デジタル信号プロセッサ、特定用途向け集積回路などのようなハードウェアユニット、および／またはコンピュータ可読プログラムのようなソフトウェアモジュールであり得る。各ステップを実施するためのユニットについては、上記では詳細に説明されていない。しかしながら、特定のプロセスを実行するステップがあるとき、同じプロセスを実施するための対応する機能モジュールまたはユニット（ハードウェアおよび／またはソフトウェアによって実施される）であってもよい。説明によるステップのすべての組み合わせの技術的解決策およびこれらのステップに対応するユニットは、それらによって構成される技術的解決策が完全で適用可能である限り、本願の開示内容に含まれる。 All of the above units are exemplary and / or preferred modules for performing the processes described in the present disclosure. These units can be hardware units such as field programmable gate arrays (FPGAs), digital signal processors, application-specific integrated circuits, and / or software modules such as computer-readable programs. The units for performing each step are not described in detail above. However, when there are steps to perform a particular process, it may be the corresponding functional module or unit (implemented by hardware and / or software) to perform the same process. The technical solutions for all combinations of the steps described and the units corresponding to these steps are included in the disclosure of this application as long as the technical solutions they constitute are complete and applicable.

本発明の方法、及び、装置は複数のやり方で実施することができる。例えば、本発明の方法および装置は、ソフトウェア、ハードウェア、ファームウェア、またはそれらの任意の組み合わせによって実施しても良い。本方法のステップの上記の順序は単なる例示であることを意図しており、本発明の方法のステップは、特に明記しない限り、上記で具体的に説明した順序に限定されない。さらに、いくつかの実施形態において、本発明はまた、本発明による方法を実施するためのマシン可読命令を含む記録媒体に記録されたプログラムとして実施することもできる。したがって、本発明は、本発明による方法を実施するためのプログラムを記録した記録媒体も包含するものである。 The methods and devices of the present invention can be implemented in multiple ways. For example, the methods and devices of the invention may be implemented by software, hardware, firmware, or any combination thereof. The above order of steps in the method is intended to be merely exemplary, and the steps in the method of the invention are not limited to the order specifically described above, unless otherwise stated. Further, in some embodiments, the invention can also be implemented as a program recorded on a recording medium containing machine-readable instructions for carrying out the methods according to the invention. Therefore, the present invention also includes a recording medium on which a program for carrying out the method according to the present invention is recorded.

本発明のいくつかの特定の実施形態を例示で詳述したが、上記の実施形態は単なる例示的であり、本発明の範囲を限定するものではないことを当業者は理解するべきである。当業者には当然のことながら、本発明の範囲および精神から逸脱することなく、上記の実施形態を修正することができる。本発明の範囲は、付随する特許請求の範囲によって規定されるものである。 Although some specific embodiments of the present invention have been detailed by way of example, those skilled in the art should understand that the above embodiments are merely exemplary and do not limit the scope of the invention. As a matter of course, those skilled in the art can modify the above embodiments without departing from the scope and spirit of the present invention. The scope of the present invention is defined by the accompanying claims.

Claims

An object detector
Extraction means to extract features from images and
An acquisition means for acquiring the spatial relationship between the extracted features based on a pre-generative model that has learned the spatial relationship indicating a specific positional relationship between two related objects belonging to a predetermined category.
A detection means for detecting a candidate area of an object from the image based on the extracted features, and
Based on the spatial relationship between the features included in the two candidate regions of the detected candidate regions, the score of the candidate region indicating the probability that an object of a predetermined category exists in the region of the image is updated. Update method and
A determinant that determines the area of an object present in the image based on the score of the updated candidate area .
An object detection device characterized by having.

The object detection according to claim 1, wherein the pre-generated model is a model that learns the probability that two related objects belong to the specific positional relationship based on the extracted features. apparatus.

The acquisition means has the spatial relationship .
Position information and category information of at least one of the related objects ,
Connection point information of at least one of the related objects ,
Background information of the image,
The object detection device according to claim 1 or 2 , wherein the object detection device is obtained based on at least one of the above.

The acquisition means acquires the spatial relationship representing the spatial constraint between the two related objects .
The spatial constraint, the relative positional relationship between two of the related objects, the phase relationship between two of the associated object, characterized in that it comprises at least one of corresponding shape relationship between two of the related objects The object detection device according to any one of claims 1 to 3 .

The pre-generative model has at least a portion for extracting a feature from an image, a portion for determining the spatial relationship from the feature, and a portion for detecting an area of an object from the feature and the spatial relationship. Has three parts,
The object detection device according to any one of claims 1 to 4, wherein the three parts are simultaneously learned by means of backpropagation.

The extraction means extracts features from the image based on the pre-generated model.
It said detecting means, on the basis of the pre-generated model, the object detection apparatus according to claim 5, characterized that you detecting the candidate region of the object from the image.

The object detection device according to claim 5 or 6 , wherein the pre-generated model is generated based on a training sample labeled with the spatial relationship using a deep learning method.

It is an object detection method
Extraction steps to extract features from images and
An acquisition step of acquiring the spatial relationship between the extracted features based on a pre-generative model that has learned the spatial relationship indicating a specific positional relationship between two related objects belonging to a predetermined category.
A detection step of detecting a candidate area of an object from the image based on the extracted features,
Based on the spatial relationship between the features included in the two candidate regions of the detected candidate regions, the score of the candidate region indicating the probability that an object of a predetermined category exists in the region of the image is updated. Update steps and
A determination step of determining the area of an object present in the image based on the score of the updated candidate area .
An object detection method characterized by having.

The object detection according to claim 8, wherein the pre-generated model is a model that learns the probability that two related objects belong to the specific positional relationship based on the extracted features. Method.

The acquisition step acquires the spatial relationship that represents the spatial constraint between the two related objects .
The spatial constraint, the relative positional relationship between two of the related objects, the phase relationship between two of the associated object, characterized in that it comprises at least one of corresponding shape relationship between two of the related objects The object detection method according to claim 8 or 9.

The pre-generative model has at least a portion for extracting a feature from an image, a portion for determining the spatial relationship from the feature, and a portion for detecting an area of an object from the feature and the spatial relationship. Has three parts,
By means of backpropagation, the above three parts are learned at the same time.
In the extraction step, the features are extracted from the image using the pre-generated model.
In the detection step, a candidate region of the object is detected from the image using the pre-generated model.
The object detection method according to any one of claims 8 to 10, wherein the object detection method is characterized.

A feature extraction method that extracts features from the image of the current video frame in the video,
An acquisition means for acquiring the spatial relationship between the extracted features based on a pre-generative model that has learned the spatial relationship indicating a specific positional relationship between two related objects belonging to a predetermined category.
A detection means for detecting a candidate area of an object from the image based on the extracted features, and
Based on the spatial relationship between the features included in the two candidate regions of the detected candidate regions, the score of the candidate region indicating the probability that an object of a predetermined category exists in the region of the image is updated. Update method and
A determinant that determines the area of an object present in the image based on the score of the updated candidate area .
An object detection device characterized by having.

A storage medium that stores an instruction that executes the object detection method according to any one of claims 8 to 10 when executed by a computer.