JP7560950B2

JP7560950B2 - Image processing system and control program

Info

Publication number: JP7560950B2
Application number: JP2020050251A
Authority: JP
Inventors: 翔平今田; 秀行青木
Original assignee: Secom Co Ltd
Current assignee: Secom Co Ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2024-10-03
Anticipated expiration: 2040-03-19
Also published as: JP2021149691A

Description

本発明は、人物のジェスチャを検出する画像処理システム及び制御プログラムに関する。 The present invention relates to an image processing system and control program for detecting human gestures.

近年、監視空間を撮像した画像に基づいて、人物のジェスチャを検出する画像処理システムが開発されている。 In recent years, image processing systems have been developed that detect human gestures based on images captured in a monitored space.

特許文献１には、強度変化する光を物体に発し、その光の物体による反射光を外光から分離して検出し、光の物体による反射光画像を検出する情報入力装置が開示されている。 Patent document 1 discloses an information input device that emits light with varying intensity toward an object, separates the reflected light from the object from external light, and detects an image of the reflected light from the object.

特許文献２には、物体の所定の周期による往復動作を認識する携帯式コンピュータが開示されている。この携帯式コンピュータは、物体を撮影したイメージ・センサから連続する複数のフレームを受け取る。次にこの携帯式コンピュータは、背景画像と各フレームが含むブラー画像を比較し、対応する画素の階調値の差を計算して２値化した差分画像から物体の動作を認識する。 Patent document 2 discloses a portable computer that recognizes the reciprocating motion of an object at a predetermined cycle. The portable computer receives multiple successive frames from an image sensor that captures an image of the object. The portable computer then compares the background image with the blurred image contained in each frame, calculates the difference in the gradation values of corresponding pixels, and recognizes the object's motion from the binarized difference image.

特開平１０－１７７４４９号公報Japanese Patent Application Publication No. 10-177449 特許第５７８２０６１号公報Patent No. 5782061

画像処理システムでは、監視空間内の人物のジェスチャを精度良く検出することが望まれている。 In an image processing system, it is desirable to accurately detect gestures made by people in a monitored space.

本発明が解決しようとする課題は、撮影画像において動いた物体の動作を認識し易い画像を生成する画像処理システム及び制御プログラムを提供することである。また、監視空間内の人物が手を前に出して行うジェスチャを精度良く認識することができる画像処理システム及び制御プログラムを提供することである。 The problem that this invention aims to solve is to provide an image processing system and control program that generates an image in which the movement of a moving object in a captured image is easily recognized. Also, to provide an image processing system and control program that can accurately recognize a gesture made by a person in a monitored space with their hand out in front of them.

上述の課題を解決するため、本発明は、その一態様として、監視空間内の基準位置から物体までの距離に関する情報を階調値とする距離画像を順次取得する距離画像取得手段と、順次取得される距離画像に対応した、監視空間内の濃淡に関する情報を階調値とする２次元画像を順次取得する２次元画像取得手段と、距離画像取得手段により所定期間に取得された複数の距離画像内で同一位置に配置された画素又は領域のグループ毎に、グループの中で階調値が相対的に小さい画素又は領域を特定し、グループ毎に特定された画素又は領域を含む距離画像に対応する２次元画像内の、特定された画素又は領域に対応する画素又は領域を用いて、所定期間に取得された複数の２次元画像が合成された処理画像を生成する合成手段と、を有することを特徴とする画像処理システムを提供する。 In order to solve the above-mentioned problems, as one aspect of the present invention, an image processing system is provided that includes: a distance image acquisition means for sequentially acquiring distance images whose gradation values are information relating to the distance from a reference position to an object in a monitored space; a two-dimensional image acquisition means for sequentially acquiring two-dimensional images whose gradation values are information relating to the shading in the monitored space corresponding to the sequentially acquired distance images; and a synthesis means for identifying, for each group of pixels or regions arranged at the same position in a plurality of distance images acquired by the distance image acquisition means over a predetermined period of time, a pixel or region having a relatively small gradation value in the group, and generating a processed image in which a plurality of two-dimensional images acquired over a predetermined period of time are synthesized using a pixel or region corresponding to the identified pixel or region in a two-dimensional image corresponding to the distance image including the pixel or region identified for each group.

上記の画像処理システムにおいて、合成手段は、グループ毎に特定された画素又は領域を含む距離画像に対応する２次元画像内の、特定された画素又は領域に対応する画素又は領域の階調値を、そのグループに対応する画素又は領域の階調値として処理画像を生成することが好ましい。 In the above image processing system, it is preferable that the synthesis means generates a processed image by using the gradation value of a pixel or area corresponding to a specified pixel or area in a two-dimensional image corresponding to a distance image including the pixels or areas specified for each group as the gradation value of the pixel or area corresponding to that group.

上記の画像処理システムにおいて、合成手段は、グループ毎に特定された画素又は領域を含む距離画像に対応する２次元画像内の、特定された画素又は領域に対応する画素又は領域の階調値を、そのグループに対応する画素又は領域の第１成分の階調値とし、距離画像内でそのグループ毎に特定された画素又は領域の階調値を、そのグループに対応する画素又は領域の第２成分の階調値とするように処理画像を生成することが好ましい。 In the above image processing system, it is preferable that the synthesis means generates a processed image such that the gradation value of a pixel or region corresponding to a specified pixel or region in a two-dimensional image corresponding to a distance image including the pixels or regions specified for each group is set to the gradation value of the first component of the pixel or region corresponding to that group, and the gradation value of a pixel or region specified for each group in the distance image is set to the gradation value of the second component of the pixel or region corresponding to that group.

上記の画像処理システムにおいて、入力された学習用処理画像に含まれる人物のジェスチャ動作に関する情報を出力するように学習されたモデルに処理画像を入力し、モデルから出力された情報に基づいて、監視空間内の人物のジェスチャ動作を検出する検出手段をさらに有することが好ましい。 In the above image processing system, it is preferable that the system further includes a detection means for inputting the processed image into a model trained to output information about the gesture movements of a person contained in the input learning processed image, and detecting the gesture movements of a person in the monitored space based on the information output from the model.

上記の画像処理システムにおいて、距離画像又は２次元画像内で人物を含む人物領域を検出する人物領域検出手段をさらに有し、合成手段は、検出された人物領域に対応するグループに限り、そのグループ毎に特定された画素又は領域を含む距離画像に対応する２次元画像内の、特定された画素又は領域に対応する画素又は領域を用いて、処理画像を生成することが好ましい。 In the above image processing system, it is preferable that the system further includes a person area detection means for detecting a person area including a person in the distance image or the two-dimensional image, and the synthesis means generates a processed image using pixels or areas corresponding to the identified pixels or areas in the two-dimensional image corresponding to the distance image including the pixels or areas identified for each group, only for the groups corresponding to the detected person areas.

上記の画像処理システムにおいて、処理画像に基づいて、監視空間内に存在する人物の前方でなされた所定のジェスチャ動作を検出する検出手段をさらに有することが好ましい。 In the above image processing system, it is preferable that the system further includes a detection means for detecting a predetermined gesture movement made in front of a person present in the monitored space based on the processed image.

また、上述の課題を解決するため、本発明は、他の一態様として、監視空間内の基準位置から物体までの距離に関する情報を階調値とする距離画像を順次取得する距離画像取得手段と、距離画像取得手段により所定期間に取得された複数の距離画像内で同一位置に配置された画素又は領域のグループ毎に、グループの中で階調値が相対的に小さい画素又は領域を特定し、グループ毎に特定された画素又は領域を用いて、所定期間に取得された複数の距離画像を合成して処理画像を生成する処理画像生成手段と、を有することを特徴とする画像処理システムを提供する。 In addition, in order to solve the above-mentioned problems, in another aspect, the present invention provides an image processing system characterized by having a distance image acquisition means for sequentially acquiring distance images in which information relating to the distance from a reference position to an object in a monitored space is used as a gradation value, and a processed image generation means for identifying, for each group of pixels or regions located at the same position in multiple distance images acquired by the distance image acquisition means over a predetermined period of time, a pixel or region with a relatively small gradation value within the group, and using the pixels or regions identified for each group, synthesizing the multiple distance images acquired over the predetermined period of time to generate a processed image.

また、上述の課題を解決するため、本発明は、他の一態様として、監視空間内の基準位置から物体までの距離に関する情報を階調値とする距離画像を順次取得し、順次取得した距離画像に対応して、監視空間内の濃淡に関する情報を階調値とする２次元画像を順次取得し、所定期間に取得した複数の距離画像内で同一位置に配置された画素又は領域のグループ毎に、グループの中で階調値が相対的に小さい画素又は領域を特定し、グループ毎に特定された画素又は領域を含む距離画像に対応する２次元画像内の、特定された画素又は領域に対応する画素又は領域を用いて、所定期間に取得した複数の２次元画像を合成して処理画像を生成する、ことをコンピュータに実行させることを特徴とする制御プログラムを提供する。 In addition, in order to solve the above-mentioned problems, in another aspect, the present invention provides a control program that causes a computer to execute the following steps: sequentially acquire distance images in which information relating to the distance from a reference position to an object in a monitored space is used as a gradation value; sequentially acquire two-dimensional images in which information relating to the shading in the monitored space is used as a gradation value corresponding to the sequentially acquired distance images; identify, for each group of pixels or regions arranged at the same position in a plurality of distance images acquired over a predetermined period, a pixel or region in the group that has a relatively small gradation value; and generate a processed image by synthesizing the plurality of two-dimensional images acquired over the predetermined period using pixels or regions corresponding to the identified pixels or regions in the two-dimensional images corresponding to the distance images including the pixels or regions identified for each group.

また、上述の課題を解決するため、本発明は、他の一態様として、監視空間内の基準位置から物体までの距離に関する情報を階調値とする距離画像を順次取得し、所定期間に取得した複数の距離画像内で同一位置に配置された画素又は領域のグループ毎に、グループの中で階調値が相対的に小さい画素又は領域を特定し、グループ毎に特定された画素又は領域を用いて、所定期間に取得した複数の距離画像を合成して処理画像を生成する、ことをコンピュータに実行させることを特徴とする制御プログラムを提供する。 In order to solve the above-mentioned problems, in another aspect, the present invention provides a control program that causes a computer to execute the following steps: sequentially acquire distance images in which information relating to the distance from a reference position to an object in a monitored space is used as a gradation value; for each group of pixels or regions located at the same position in multiple distance images acquired over a specified period, identify pixels or regions with relatively small gradation values within the group; and use the pixels or regions identified for each group to synthesize the multiple distance images acquired over the specified period to generate a processed image.

本発明によれば、撮影画像において動いた物体の動作を認識し易い画像を生成する画像処理システム及び制御プログラムを提供することができる。また、監視空間内の人物が手を前に出して行うジェスチャを精度良く認識することができる画像処理システム及び制御プログラムを提供することができる。 The present invention provides an image processing system and control program that generates an image in which the movement of a moving object in a captured image is easily recognized. It also provides an image processing system and control program that can accurately recognize a gesture made by a person in a monitored space with their hand out in front of them.

画像処理システムのブロック図である。FIG. 1 is a block diagram of an image processing system. 画像処理システムの動作を示すフローチャートである。4 is a flowchart showing the operation of the image processing system. 各処理画像内の画素間の対応関係について説明するための図である。FIG. 2 is a diagram for explaining a correspondence relationship between pixels in each processed image. 処理画像について説明するための概念図である。FIG. 11 is a conceptual diagram for explaining a processed image. 処理画像について説明するための概念図である。FIG. 11 is a conceptual diagram for explaining a processed image. 処理画像の一例である。1 is an example of a processed image. 処理画像の一例である。1 is an example of a processed image. 処理画像の一例である。1 is an example of a processed image.

以下、図面を参照しつつ、本発明の様々な実施形態について説明する。ただし、本発明の技術的範囲は、それらの実施形態に限定されず、特許請求の範囲に記載された発明とその均等物に及ぶ点に留意されたい。また、各図において同一、又は相当する機能を有するものは、同一符号を付し、その説明を省略又は簡潔にすることもある。 Various embodiments of the present invention will be described below with reference to the drawings. However, please note that the technical scope of the present invention is not limited to these embodiments, but extends to the inventions described in the claims and their equivalents. In addition, in each drawing, parts having the same or equivalent functions are given the same reference numerals, and their description may be omitted or simplified.

（画像処理システム１の概要）
図１は、画像処理システム１のブロック図である。画像処理システム１は、監視空間内の人物、例えば入院患者又は被介護者の見守り等に用いられ、見守り対象者による手を振る動作等をジェスチャとして検出して、見守り者が使用する外部装置に通知する。画像処理システム１は、撮像装置２、距離センサ３、画像処理装置４等を有する。 (Overview of image processing system 1)
1 is a block diagram of an image processing system 1. The image processing system 1 is used for watching over a person in a monitored space, such as a hospitalized patient or a person receiving care, and detects a gesture such as a hand wave made by the person being watched over as a gesture, and notifies an external device used by the person watching over the gesture. The image processing system 1 includes an imaging device 2, a distance sensor 3, an image processing device 4, and the like.

撮像装置２は、画像生成手段の一例であり、監視空間を撮像した２次元画像を順次生成する。２次元画像は、監視空間内の濃淡に関する情報（輝度値または色値等）を階調値とする複数の画素が２次元に配置された画像である。撮像装置２は、発光器、２次元検出器、結像光学系及びＡ／Ｄ変換器等を有する。発光器は、例えば約890nmの波長を持つ近赤外光を監視空間に向けて照射する。２次元検出器は、ＣＣＤ（Ｃｈａｒｇｅ－ＣｏｕｐｌｅｄＤｅｖｉｃｅ）素子、Ｃ－ＭＯＳ（ＣｏｍｐｌｅｍｅｎｔａｒｙＭＯＳ）など、近赤外光に感度を有する光電変換器を有する。結像光学系は、２次元検出器上に監視場所の像を結像する。Ａ／Ｄ変換器は、２次元検出器から出力された電気信号を増幅し、アナログ／デジタル（Ａ／Ｄ）変換する。 The imaging device 2 is an example of an image generating means, and sequentially generates two-dimensional images by capturing images of the monitored space. A two-dimensional image is an image in which multiple pixels, whose gradation values are information about the shading in the monitored space (such as brightness values or color values), are arranged in two dimensions. The imaging device 2 includes a light emitter, a two-dimensional detector, an imaging optical system, an A/D converter, and the like. The light emitter irradiates the monitored space with near-infrared light having a wavelength of, for example, about 890 nm. The two-dimensional detector includes a photoelectric converter that is sensitive to near-infrared light, such as a CCD (Charge-Coupled Device) element or a C-MOS (Complementary MOS). The imaging optical system forms an image of the monitored location on the two-dimensional detector. The A/D converter amplifies the electrical signal output from the two-dimensional detector and converts it to analog/digital (A/D).

撮像装置２は、一定の時間間隔（例えば１／３０秒）毎に発光器に近赤外光を照射させながら監視空間を撮像し、各画素が近赤外光の強度を表す輝度値を階調値として有する近赤外光画像を２次元画像として生成し、画像処理装置４へ出力する。人間は近赤外光を直接視認することができないので、撮像装置２は、監視空間内の人物の視覚に影響を与えない。このため、画像処理システム１は、例えば入院患者または被介護者の見守りを行う場合に、入院患者または被介護者の就寝を妨げることなく、見守りを行うことができる。 The imaging device 2 captures images of the monitored space while irradiating the light emitter with near-infrared light at regular time intervals (e.g., 1/30 seconds), generates a near-infrared image as a two-dimensional image in which each pixel has a brightness value representing the intensity of the near-infrared light as a gradation value, and outputs it to the image processing device 4. Since humans cannot directly see near-infrared light, the imaging device 2 does not affect the vision of people in the monitored space. Therefore, when monitoring, for example, a hospitalized patient or a person receiving care, the image processing system 1 can monitor the patient without disturbing the patient or person receiving care while they sleep.

尚、２次元検出器は、可視光に感度を有する光電変換器を有し、各画素が可視光の輝度値、ＲＧＢ値又はＣＭＹ値を階調値として有する可視光画像を２次元画像として生成してもよい。この場合、発光器は省略されてもよい。 The two-dimensional detector may have a photoelectric converter that is sensitive to visible light, and generate a two-dimensional visible light image in which each pixel has a luminance value, RGB value, or CMY value of visible light as a gradation value. In this case, the light emitter may be omitted.

距離センサ３は、距離画像生成手段の一例であり、距離画像を順次生成する。距離画像は、監視空間内の基準位置から物体の対応する位置までの距離に関する情報を階調値とする複数の画素が２次元に配置された画像である。基準位置は、距離センサ３の配置位置である。距離センサ３は、撮像装置２が撮影を行う毎に、撮像装置２の発光器が近赤外線を照射するタイミングとずらしたタイミングで、撮像装置２の撮影範囲に向けて近赤外線を照射する。距離センサ３は、２次元画像内の各画素に対応する監視空間内の各位置に探査信号を順次照射する。例えば、距離センサ３は、撮像装置２の撮影範囲を水平方向及び垂直方向に２次元画像の水平方向及び垂直方向の画素数で等間隔に分割し、分割した各領域内の位置を２次元画像内の各画素に対応する位置として設定する。距離センサ３は、探査信号が照射された走査方位に沿って到来する反射信号を受光し、反射信号の強度に応じた値を持つ受光信号を生成する。 The distance sensor 3 is an example of a distance image generating means, and generates distance images sequentially. The distance image is an image in which a plurality of pixels, whose gradation values are information about the distance from a reference position in the monitored space to the corresponding position of the object, are arranged two-dimensionally. The reference position is the arrangement position of the distance sensor 3. The distance sensor 3 irradiates near-infrared rays toward the shooting range of the imaging device 2 at a timing shifted from the timing at which the light emitter of the imaging device 2 irradiates near-infrared rays every time the imaging device 2 performs shooting. The distance sensor 3 sequentially irradiates each position in the monitored space corresponding to each pixel in the two-dimensional image. For example, the distance sensor 3 divides the shooting range of the imaging device 2 in the horizontal and vertical directions at equal intervals by the number of pixels in the horizontal and vertical directions of the two-dimensional image, and sets the position in each divided area as the position corresponding to each pixel in the two-dimensional image. The distance sensor 3 receives a reflected signal arriving along the scanning direction in which the exploration signal is irradiated, and generates a received light signal having a value according to the intensity of the reflected signal.

距離センサ３は、探査信号の位相情報と、現時点で探査信号が照射されている方向を表す角度情報と、受光信号とに基づいて、走査方位ごとに、距離センサ３から反射信号を反射した物体までの距離を測定し、走査方位とその距離との関係を示す測距データを生成する。例えば、距離センサ３は、Time Of Flight法に従って、受光信号から求めた反射信号の位相と探査信号の位相との差を求め、その差に基づいて距離を測定する。距離センサ３は、測距データに示される各走査方位に対応する距離に応じた値を、各走査方位に対応する画素の階調値とした距離画像を生成し、画像処理装置４へ出力する。例えば、距離センサ３は、予め定められた距離範囲（例えば０．５ｍ～７ｍ）を２５６段階に等間隔に区分して０から２５５までの各値を割り当てる。距離センサ３は、測距データに示される各走査方位に対応する距離が属する区分に割り当てられた値を、各走査方位に対応する画素の階調値として設定する。対応する物体までの距離が短いほど階調値が小さくなり、対応する物体までの距離が長いほど階調値が大きくなるように、各階調値は設定される。 Based on the phase information of the exploration signal, the angle information indicating the direction in which the exploration signal is currently being irradiated, and the received light signal, the distance sensor 3 measures the distance from the distance sensor 3 to the object that reflected the reflected signal for each scanning direction, and generates distance measurement data indicating the relationship between the scanning direction and the distance. For example, the distance sensor 3 calculates the difference between the phase of the reflected signal obtained from the received light signal and the phase of the exploration signal according to the time of flight method, and measures the distance based on the difference. The distance sensor 3 generates a distance image in which a value corresponding to the distance corresponding to each scanning direction shown in the distance measurement data is set as the gradation value of the pixel corresponding to each scanning direction, and outputs the image to the image processing device 4. For example, the distance sensor 3 divides a predetermined distance range (for example, 0.5 m to 7 m) into 256 equal steps and assigns each value from 0 to 255. The distance sensor 3 sets the value assigned to the category to which the distance corresponding to each scanning direction shown in the distance measurement data belongs as the gradation value of the pixel corresponding to each scanning direction. Each gradation value is set so that the shorter the distance to the corresponding object, the smaller the gradation value, and the longer the distance to the corresponding object, the larger the gradation value.

尚、距離センサ３は、近赤外光やミリ波・レーザーなどを照射して物体に反射して返ってくる時間を計測するTOF・LiDAR方式、ステレオカメラなどを用いて三角測量を行う方式等の他の公知の方式に従って距離を測定してもよい。 In addition, the distance sensor 3 may measure distance according to other known methods, such as a TOF/LiDAR method that irradiates near-infrared light, millimeter waves, laser, etc. and measures the time it takes for the light to reflect off an object and return, or a method that performs triangulation using a stereo camera, etc.

このように、距離センサ３は、順次生成される２次元画像に対応して、距離画像を順次生成する。即ち、撮像装置２は、順次生成される距離画像に対応して、２次元画像を順次生成する。 In this way, the distance sensor 3 sequentially generates distance images corresponding to the sequentially generated two-dimensional images. That is, the imaging device 2 sequentially generates two-dimensional images corresponding to the sequentially generated distance images.

尚、撮像装置２と距離センサ３は、離間して配置し、撮影及び測定してもよい。その場合、処理部１２が、監視空間内の同一位置に対応する画素が２次元画像及び距離画像内で同一位置に配置されるように、２次元画像又は距離画像を補正する。画像処理装置４は、２次元画像及び距離画像の各画素の関係が示されるテーブルを記憶部９に予め記憶しておき、処理部１２は、記憶部９に記憶されたテーブルを参照して画像を補正する。 The imaging device 2 and distance sensor 3 may be placed apart to capture and measure. In this case, the processing unit 12 corrects the two-dimensional image or distance image so that pixels corresponding to the same position in the monitored space are positioned at the same position in the two-dimensional image and distance image. The image processing device 4 stores in advance in the memory unit 9 a table showing the relationship between each pixel in the two-dimensional image and distance image, and the processing unit 12 corrects the image by referring to the table stored in the memory unit 9.

また、撮像装置２と距離センサ３の一部または全部が共通に用いられてもよい。例えば、撮像装置２及び距離センサ３は、共通の発光器及び／又は受光器を用いて２次元画像及び距離画像を生成してもよい。 Furthermore, a part or all of the imaging device 2 and the distance sensor 3 may be used in common. For example, the imaging device 2 and the distance sensor 3 may generate a two-dimensional image and a distance image using a common light emitter and/or light receiver.

画像処理装置４は、デスクトップコンピュータ、ワークステーション、ノートパソコン等の一般的なコンピュータである。画像処理装置４は、インタフェース部５、入力部６、表示部７、通信部８、記憶部９、処理部１２、データバスＢを有する。 The image processing device 4 is a general computer such as a desktop computer, a workstation, or a notebook computer. The image processing device 4 has an interface unit 5, an input unit 6, a display unit 7, a communication unit 8, a memory unit 9, a processing unit 12, and a data bus B.

インタフェース部５は、撮像装置２及び距離センサ３とデータ通信を行うためのインタフェース回路を有し、撮像装置２及び距離センサ３と電気的に接続して、各種の制御信号又は画像信号を送受信する。なお、画像処理装置４が撮像装置２及び距離センサ３を有していてもよい。 The interface unit 5 has an interface circuit for performing data communication with the imaging device 2 and the distance sensor 3, and is electrically connected to the imaging device 2 and the distance sensor 3 to transmit and receive various control signals or image signals. Note that the image processing device 4 may also have the imaging device 2 and the distance sensor 3.

入力部６は、（キーボード、マウス等の）入力装置、及び、入力装置から信号を取得するインタフェース回路を有し、画像処理装置４を操作するオペレータからの入力操作を受け付ける。 The input unit 6 has an input device (such as a keyboard or mouse) and an interface circuit that acquires signals from the input device, and accepts input operations from an operator who operates the image processing device 4.

表示部７は、液晶、有機ＥＬ（Ｅｌｅｃｔｒｏ－Ｌｕｍｉｎｅｓｃｅｎｃｅ）等のディスプレイ及びディスプレイに画像データを出力するインタフェース回路を有し、各種の情報をディスプレイに表示する。 The display unit 7 has a display such as a liquid crystal display or an organic EL (Electro-Luminescence) display, and an interface circuit that outputs image data to the display, and displays various information on the display.

通信部８は、出力手段の一例であり、例えばＴＣＰ／ＩＰ等に準拠した通信インタフェース回路を有し、インターネット等の通信ネットワークに接続する。通信部８は、通信ネットワークから受信したデータを処理部１２へ出力し、処理部１２から入力されたデータを通信ネットワークに送信する。 The communication unit 8 is an example of an output means, and has a communication interface circuit that complies with, for example, TCP/IP, and connects to a communication network such as the Internet. The communication unit 8 outputs data received from the communication network to the processing unit 12, and transmits data input from the processing unit 12 to the communication network.

記憶部９は、ＲＯＭ、ＲＡＭ等の半導体メモリ、磁気ディスク又はＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ等の光ディスクドライブ及びその記録媒体を有する。また、記憶部９は、画像処理装置４を制御するための制御プログラム及び各種データを記憶し、処理部１２との間でこれらの情報を入出力する。コンピュータプログラムは、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ等のコンピュータ読み取り可能な可搬型記録媒体から公知のセットアッププログラム等を用いて記憶部９にインストールされてもよい。また、記憶部９は、データとして、モデル１０、背景画像１１を記憶する。 The storage unit 9 has semiconductor memory such as ROM and RAM, and a magnetic disk or an optical disk drive such as a CD-ROM or DVD-ROM, and a recording medium thereof. The storage unit 9 also stores a control program and various data for controlling the image processing device 4, and inputs and outputs this information between the processing unit 12. A computer program may be installed into the storage unit 9 from a computer-readable portable recording medium such as a CD-ROM or DVD-ROM using a known setup program or the like. The storage unit 9 also stores a model 10 and a background image 11 as data.

モデル１０は、入力された画像に対して、その画像に検出対象となるジェスチャが含まれている確からしさを示す評価値を出力するように事前学習された判定モデルである。評価値は、その画像に検出対象となるジェスチャが含まれている可能性が高いほど高くなるように定められる。 Model 10 is a judgment model that has been pre-trained to output an evaluation value that indicates the likelihood that an input image contains a gesture to be detected. The evaluation value is set to be higher the more likely the image contains a gesture to be detected.

背景画像１１は、無人状態の監視空間が撮影されて生成された２次元画像である。背景画像１１は、定期的に、または、監視空間内に人物が存在しないと判定されたタイミングで、適宜更新されてもよい。 The background image 11 is a two-dimensional image generated by photographing an unoccupied monitored space. The background image 11 may be updated periodically or as needed when it is determined that no people are present in the monitored space.

処理部１２は、ＣＰＵ、ＭＰＵ等のプロセッサと、ＲＯＭ、ＲＡＭ等のメモリと、その周辺回路とを有し、画像処理装置４の各種信号処理を実行する。なお、処理部１２として、ＤＳＰ、ＬＳＩ、ＡＳＩＣ、ＦＰＧＡ等が用いられてもよい。処理部１２は、距離画像取得手段１３、２次元画像取得手段１４、人物領域検出手段１５、抽出手段１６、処理画像生成手段１７、検出手段１８、出力制御手段１９、学習手段２０等を有する。 The processing unit 12 has a processor such as a CPU or MPU, memories such as a ROM or RAM, and peripheral circuits, and executes various signal processing of the image processing device 4. Note that a DSP, an LSI, an ASIC, an FPGA, etc. may be used as the processing unit 12. The processing unit 12 has a distance image acquisition means 13, a two-dimensional image acquisition means 14, a person area detection means 15, an extraction means 16, a processed image generation means 17, a detection means 18, an output control means 19, a learning means 20, etc.

（画像処理システム１のジェスチャ検出動作）
図２は、画像処理システム１の動作シーケンスを示すフローチャートである。この動作シーケンスは、記憶部９に記憶されている制御プログラムに基づいて、主に処理部１２により、画像処理装置４の各要素と協働して実行される。この動作シーケンスは、距離画像及び２次元画像が生成される時間間隔ごとに実行される。 (Gesture Detection Operation of Image Processing System 1)
2 is a flowchart showing an operation sequence of the image processing system 1. This operation sequence is executed mainly by the processing unit 12 in cooperation with each element of the image processing device 4, based on a control program stored in the storage unit 9. This operation sequence is executed at each time interval when a distance image and a two-dimensional image are generated.

まず、距離画像取得手段１３は、距離センサ３が生成した最新の距離画像を取得する（ステップＳ１）。距離画像取得手段１３は、監視空間内の基準位置から物体までの距離に関する情報を階調値とする距離画像を順次取得する。距離画像取得手段１３は、取得した距離画像を、取得した時刻と関連付けて記憶部９に記憶させる。 First, the distance image acquisition means 13 acquires the latest distance image generated by the distance sensor 3 (step S1). The distance image acquisition means 13 sequentially acquires distance images in which information relating to the distance from a reference position to an object in the monitored space is used as a gradation value. The distance image acquisition means 13 stores the acquired distance images in the memory unit 9 in association with the time of acquisition.

次に、２次元画像取得手段１４は、撮像装置２が生成した最新の２次元画像を取得する（ステップＳ２）。２次元画像取得手段１４は、順次取得される距離画像に対応した、監視空間内の濃淡に関する情報を階調値とする２次元画像を順次取得する。２次元画像取得手段１４は、取得した２次元画像を、取得した時刻と関連付けて記憶部９に記憶させる。このように、距離画像取得手段１３は、距離センサ３が順次生成した距離画像を順次取得し、２次元画像取得手段１４は、距離センサ３が順次生成した距離画像に対応して撮像装置２が順次生成した２次元画像を順次取得する。 Then, the two-dimensional image acquisition means 14 acquires the latest two-dimensional image generated by the imaging device 2 (step S2). The two-dimensional image acquisition means 14 sequentially acquires two-dimensional images whose gradation values are information about the shading in the monitored space, corresponding to the sequentially acquired distance images. The two-dimensional image acquisition means 14 stores the acquired two-dimensional images in the memory unit 9 in association with the time of acquisition. In this way, the distance image acquisition means 13 sequentially acquires the distance images sequentially generated by the distance sensor 3, and the two-dimensional image acquisition means 14 sequentially acquires the two-dimensional images sequentially generated by the imaging device 2 corresponding to the distance images sequentially generated by the distance sensor 3.

次に、人物領域検出手段１５は、２次元画像内で人物を含む人物領域を検出する（ステップＳ３）。 Next, the person area detection means 15 detects a person area that includes a person within the two-dimensional image (step S3).

人物領域検出手段１５は、２次元画像内の各画素の階調値と、記憶部９に記憶されている背景画像１１内の対応する各画素の階調値との差の絶対値を算出し、算出した差の絶対値が所定閾値以上となる画素の領域を差分領域として抽出する。人物領域検出手段１５は、同一物体による差分領域をラベリングによりグループ化し、変化領域として検出する。即ち、人物領域検出手段１５は、一枚の２次元画像から抽出した差分領域の内、相互に隣接（８連結）する画素をグループ化し、相互に近接する（所定範囲内に位置する）グループを、大きさ又は位置関係に基づいて結合し、結合した領域を変化領域として結合する。 The person area detection means 15 calculates the absolute value of the difference between the gradation value of each pixel in the two-dimensional image and the gradation value of each corresponding pixel in the background image 11 stored in the memory unit 9, and extracts the pixel area where the calculated absolute value of the difference is equal to or greater than a predetermined threshold as a difference area. The person area detection means 15 groups difference areas caused by the same object by labeling, and detects them as changed areas. That is, the person area detection means 15 groups adjacent (8-connected) pixels from the difference areas extracted from a single two-dimensional image, combines groups that are close to each other (located within a predetermined range) based on their size or positional relationship, and combines the combined areas as changed areas.

尚、人物領域検出手段１５は、フレーム間差分を用いて変化領域を検出してもよい。その場合、人物領域検出手段１５は、最新の２次元画像内の各画素の輝度値と、直前の２次元画像内の対応する各画素の輝度値との差の絶対値を算出し、算出した差の絶対値が所定閾値以上となる画素の領域を差分領域として抽出する。 The person area detection means 15 may detect changed areas using inter-frame differences. In this case, the person area detection means 15 calculates the absolute value of the difference between the luminance value of each pixel in the latest two-dimensional image and the luminance value of each corresponding pixel in the immediately preceding two-dimensional image, and extracts, as a difference area, a pixel area where the absolute value of the calculated difference is equal to or greater than a predetermined threshold value.

次に、人物領域検出手段１５は、変化領域の大きさ、縦横比等の特徴量に基づいて、その変化領域に写っている物体が人物らしいか否かを判定する。人物領域検出手段１５は、変化領域の大きさが人物の大きさに相当する所定範囲内であり、且つ、変化領域の縦横比が人物の縦横比に相当する所定範囲内であるか否かにより、その変化領域に写っている物体が人物らしいか否かを判定する。なお、各変化領域の大きさは、２次元画像内の位置、及び、記憶部９に記憶されている撮像装置２の設置情報等を用いて実際の大きさに変換される。人物領域検出手段１５は、変化領域が人物らしい場合、その変化領域を人物領域として検出する。 Then, the person area detection means 15 judges whether the object in the changed area is likely to be a person based on the feature quantities such as the size and aspect ratio of the changed area. The person area detection means 15 judges whether the object in the changed area is likely to be a person based on whether the size of the changed area is within a predetermined range corresponding to the size of a person and whether the aspect ratio of the changed area is within a predetermined range corresponding to the aspect ratio of a person. The size of each changed area is converted to an actual size using the position in the two-dimensional image and the installation information of the imaging device 2 stored in the memory unit 9. If the changed area is likely to be a person, the person area detection means 15 detects the changed area as a person area.

尚、人物領域検出手段１５は、２次元画像内で人物領域を検出する場合と同様にして、距離画像内で人物領域を検出してもよい。また、人物領域検出手段１５は、判定モデルに従って、画像内に含まれる人物領域を検出してもよい。その場合、画像処理装置４は、例えばディープラーニング等の公知の機械学習技術により、人物が含まれる複数の学習用画像を用いて学習された判定モデルを記憶部９に記憶しておく。判定モデルは、学習用画像が入力された場合に、学習用画像に含まれる人物領域の位置が出力されるように事前学習される。機械学習技術として、例えば、入力層、複数の中間層及び出力層から構成される多層構造のニューラルネットワーク等を用いる。入力層には、学習用画像が入力される。中間層の各ノードは、入力層の各ノードから出力された画像から特徴ベクトルを抽出し、抽出した各特徴ベクトルに重みを乗算した値の総和を出力する。出力層は、中間層の各ノードから出力された各特徴ベクトルに重みを乗算した値の総和を出力する。判定モデルは、各重みを調整しながら、出力層からの出力値と学習用画像に含まれる人物領域の位置との差分が小さくなるように学習する。人物領域検出手段１５は、２次元画像又は距離画像を判定モデルに入力し、判定モデルから出力された出力値から２次元画像又は距離画像内の人物領域を検出する。 Note that the person area detection means 15 may detect a person area in the distance image in the same manner as when detecting a person area in a two-dimensional image. The person area detection means 15 may also detect a person area included in an image according to a judgment model. In this case, the image processing device 4 stores in the storage unit 9 a judgment model trained using a plurality of learning images including people, for example, by a known machine learning technique such as deep learning. The judgment model is pre-trained so that when a learning image is input, the position of the person area included in the learning image is output. As the machine learning technique, for example, a multi-layered neural network composed of an input layer, a plurality of intermediate layers, and an output layer is used. A learning image is input to the input layer. Each node in the intermediate layer extracts a feature vector from the image output from each node in the input layer, and outputs the sum of values obtained by multiplying each extracted feature vector by a weight. The output layer outputs the sum of values obtained by multiplying each feature vector output from each node in the intermediate layer by a weight. The judgment model learns while adjusting each weight so that the difference between the output value from the output layer and the position of the person area included in the learning image becomes small. The person area detection means 15 inputs a two-dimensional image or a distance image into a judgment model, and detects a person area in the two-dimensional image or the distance image from the output value output from the judgment model.

次に、抽出手段１６は、所定期間に生成された所定数の距離画像内で同一位置に配置された画素のグループ毎に、グループの中で階調値が最小である画素を抽出する（ステップＳ４）。所定数は２以上であり、例えば１０である。 Next, the extraction means 16 extracts the pixel with the smallest gradation value in each group of pixels that are arranged at the same position in a predetermined number of distance images generated during a predetermined period (step S4). The predetermined number is 2 or more, for example 10.

抽出手段１６は、記憶部９に記憶されている距離画像の中から、直近の所定数の距離画像を読み出す。抽出手段１６は、読み出した各距離画像の、人物領域検出手段１５により検出された人物領域に対応する領域内で、同一位置に配置された画素をグループ化する。即ち、各グループには、所定数（読み出した距離画像と同数）の画素が含まれる。なお、抽出手段１６は、読み出した各距離画像の全領域内で、同一位置に配置された画素をグループ化してもよい。抽出手段１６は、各グループの中で階調値が最小である画素、即ち対応する物体までの距離が最も短い画素を抽出する。 The extraction means 16 reads out a predetermined number of the most recent distance images from the distance images stored in the memory unit 9. The extraction means 16 groups pixels that are located at the same position within an area of each read distance image that corresponds to a person area detected by the person area detection means 15. That is, each group contains a predetermined number of pixels (the same number as the read distance image). The extraction means 16 may also group pixels that are located at the same position within the entire area of each read distance image. The extraction means 16 extracts the pixel with the smallest gradation value in each group, i.e. the pixel with the shortest distance to the corresponding object.

尚、抽出手段１６は、階調値が閾値以下である画素に限り、各グループの中で階調値が最小である画素を抽出してもよい。また、抽出手段１６は、背景及び人物よりも手前に位置する物体が撮像された画素に限り、各グループの中で階調値が最小である画素を抽出してもよい。その場合、画像処理装置４は、無人状態の監視空間内で距離を測定して生成された背景距離画像を予め記憶部９に記憶しておく。抽出手段１６は、距離画像内の各画素の内、背景距離画像内の対応する画素の階調値より小さい階調値を有する画素に限り、各グループの中で階調値が最小である画素を抽出する。さらに、抽出手段１６は、距離画像内の各画素の内、所定時間前（例えば、抽出手段１６により読み出された直近の所定数の距離画像の直前の距離画像）に人物領域検出手段１５により検出された人物領域に対応する距離画像内の領域内の各画素の階調値の平均値より小さい階調値を有する画素に限り、各グループの中で階調値が最小である画素を抽出する。各グループの中で階調値が閾値以下である画素がなかった場合、抽出手段１６は、階調値が最小である画素の代わりに、予め定められた画素（例えば最新の距離画像内の画素）を抽出する。これらにより、抽出手段１６は、動きがあった背景（例えば風で揺らいだ植物等）が撮影された画素を抽出対象から除外することができる。その結果、画像処理システム１は、人物のジェスチャをより精度良く検出することができる。尚、人物領域に対応する距離画像内の各画素の階調値の平均値は、人物領域全体の階調値の平均値ではなく、人物領域の上半身（上半分）や頭部領域の階調値の平均値としてもよい。 The extraction means 16 may extract the pixel with the smallest gradation value in each group only from pixels whose gradation value is equal to or less than the threshold value. The extraction means 16 may also extract the pixel with the smallest gradation value in each group only from pixels in which an object located in front of the background and the person is captured. In this case, the image processing device 4 stores in advance in the storage unit 9 a background distance image generated by measuring the distance in an unmanned monitored space. The extraction means 16 extracts the pixel with the smallest gradation value in each group only from pixels in the distance image that have a gradation value smaller than the gradation value of the corresponding pixel in the background distance image. Furthermore, the extraction means 16 extracts the pixel with the smallest gradation value in each group only from pixels in the distance image that have a gradation value smaller than the average gradation value of each pixel in the distance image corresponding to the person area detected by the person area detection means 15 a predetermined time ago (for example, the distance image immediately before the predetermined number of distance images read out by the extraction means 16). If there is no pixel in each group whose gradation value is equal to or less than the threshold value, the extraction means 16 extracts a predetermined pixel (e.g., a pixel in the latest distance image) instead of the pixel with the smallest gradation value. This allows the extraction means 16 to exclude pixels in which a moving background (e.g., plants swaying in the wind) is captured from the extraction target. As a result, the image processing system 1 can detect a person's gestures with greater accuracy. Note that the average gradation value of each pixel in the distance image corresponding to the person region may be the average gradation value of the upper body (upper half) or head region of the person region, rather than the average gradation value of the entire person region.

尚、抽出手段１６は、グループ毎に抽出する画素は階調値が最小の画素でなくてもよい。例えば、抽出手段１６は、画素を抽出する際、グループの中で階調値が相対的に小さい画素を抽出してもよい。例えば、抽出手段１６は、グループの中で最小の階調値ではなく、所定番目（２番目または３番目等）に小さい階調値等、相対的に小さい階調値を有する画素を抽出する。例えば、抽出対象の画素の周囲の画素（例えば、上下左右の4近傍）との差分が所定以上である画素が所定数以上（例えば、上下左右の画素うち３つの画素との差分が所定以上）である場合、抽出対象の画素は２番目または３番目や、周囲の階調値同士で近い値の画素の中央値や平均値等、相対的に小さい階調値を有する画素を抽出する。このようにすれば、例えば、基準位置から同じ距離に位置する物体を測定しているはずが、距離センサ３のノイズ等の理由により、一時的に周囲領域の階調値とは異なる最小の階調値を有することになった画素を抽出対象から除外することができる。また、抽出手段１６は、グループ毎に画素を抽出する際、その画素の階調値に加えて、その画素に隣接する他の画素の階調値を参照して、その画素の抽出の要否を判定してもよい。この場合、抽出手段１６は、グループ毎に画素を抽出する際の指標として、その画素そのものの階調値に加えて、その画素に隣接する画素の階調値を参照する。例えば、抽出手段１６は、ある画素についての指標として、その画素の階調値と、その画素の上下左右に隣接する４つの画素の階調値から代表値（平均値、中央値、最頻値等）を算出する。更に、抽出手段１６は、距離画像取得手段により所定期間に取得された複数の距離画像内で同一位置に配置された画素に代わって、複数の画素からなる領域をグループ化してもよい。この場合、抽出手段１６は、領域毎に、その領域に属する画素の階調値の代表値を算出する。抽出手段１６は、領域に属する画素に関する抽出を行う際、その画素そのものの階調値の代わりに、その画素が属する領域の代表値を用いて、抽出する画素を選択する。 Note that the pixel extracted by the extraction means 16 for each group does not have to be the pixel with the smallest gradation value. For example, when extracting pixels, the extraction means 16 may extract pixels with relatively small gradation values in the group. For example, the extraction means 16 extracts pixels with relatively small gradation values, such as a predetermined (second or third) smallest gradation value, rather than the smallest gradation value in the group. For example, if there are a predetermined number or more pixels (for example, differences between three of the pixels on the top, bottom, left and right sides of the pixel to be extracted that are greater than or equal to a predetermined value) that have a difference greater than or equal to a predetermined value with respect to the surrounding pixels (for example, differences between the three pixels on the top, bottom, left and right sides of the pixel to be extracted that are greater than or equal to a predetermined value), the pixel to be extracted is the second or third pixel, or a pixel with a relatively small gradation value, such as the median or average value of the surrounding pixels with similar gradation values. In this way, for example, a pixel that is supposed to be measuring an object located at the same distance from the reference position but that temporarily has a minimum gradation value different from the gradation value of the surrounding area due to noise of the distance sensor 3 or the like can be excluded from the extraction target. Furthermore, when extracting pixels for each group, the extraction means 16 may refer to the gradation values of other pixels adjacent to the pixel in addition to the gradation value of the pixel to determine whether or not to extract the pixel. In this case, the extraction means 16 refers to the gradation values of the pixels adjacent to the pixel in addition to the gradation value of the pixel itself as an index when extracting pixels for each group. For example, the extraction means 16 calculates a representative value (average value, median value, mode value, etc.) from the gradation value of the pixel and the gradation values of four pixels adjacent to the pixel above, below, left and right of the pixel as an index for a certain pixel. Furthermore, the extraction means 16 may group regions consisting of multiple pixels instead of pixels arranged at the same position in multiple distance images acquired by the distance image acquisition means during a predetermined period. In this case, the extraction means 16 calculates a representative value of the gradation values of the pixels belonging to the region for each region. When extracting pixels belonging to the region, the extraction means 16 selects a pixel to be extracted using a representative value of the region to which the pixel belongs instead of the gradation value of the pixel itself.

次に、処理画像生成手段１７は、撮像装置２により所定期間に生成された２次元画像、及び／又は、距離センサ３により所定期間に生成された距離画像から処理画像を生成する（ステップＳ５）。処理画像生成手段１７は、抽出手段１６により抽出された階調値が最小の画素を含む距離画像に対応する２次元画像（グループ毎に特定された画素又は領域を含む距離画像に対応する２次元画像）の抽出された画素に対応する画素、及び／又は、距離画像にて抽出された階調値が最小の画素を用いて、所定期間に取得された複数の２次元画像、及び／又は、距離画像が合成された処理画像を生成する。 Then, the processed image generating means 17 generates a processed image from the two-dimensional images generated by the imaging device 2 during a predetermined period and/or the distance images generated by the distance sensor 3 during a predetermined period (step S5). The processed image generating means 17 generates a processed image in which multiple two-dimensional images and/or distance images acquired during a predetermined period are synthesized using pixels corresponding to the extracted pixels of the two-dimensional images corresponding to the distance images including the pixels with the smallest gradation values extracted by the extraction means 16 (two-dimensional images corresponding to the distance images including the pixels or regions specified for each group) and/or pixels with the smallest gradation values extracted in the distance images.

例えば、処理画像生成手段１７は、撮像装置２により所定期間に生成された２次元画像から処理画像を生成する。その場合、処理画像生成手段１７は、抽出手段１６によりグループ毎に抽出された各画素を含む各距離画像を抽出する。処理画像生成手段１７は、抽出した距離画像に対応する２次元画像について、抽出手段１６により抽出された画素に対応する画素の階調値を特定する。処理画像生成手段１７は、２次元画像内で特定した階調値を処理画像内のそのグループに対応する画素の階調値として設定することにより処理画像を生成する。 For example, the processed image generating means 17 generates a processed image from two-dimensional images generated by the imaging device 2 during a predetermined period. In this case, the processed image generating means 17 extracts each distance image including each pixel extracted for each group by the extraction means 16. The processed image generating means 17 identifies the gradation value of a pixel corresponding to the pixel extracted by the extraction means 16 for a two-dimensional image corresponding to the extracted distance image. The processed image generating means 17 generates a processed image by setting the gradation value identified in the two-dimensional image as the gradation value of a pixel corresponding to that group in the processed image.

なお、抽出手段１６が領域のグループ毎に領域を抽出した場合、処理画像生成手段１７は、抽出手段１６によりグループ毎に特定された各領域を含む各距離画像を抽出する。処理画像生成手段１７は、抽出した距離画像に対応する２次元画像について、抽出手段１６により抽出された領域に対応する各画素の階調値を処理画像内のそのグループに対応する領域内の各画素の階調値として設定することにより処理画像を生成する。 When the extraction means 16 extracts areas for each group of areas, the processed image generation means 17 extracts each distance image including each area identified for each group by the extraction means 16. The processed image generation means 17 generates a processed image by setting the gradation value of each pixel corresponding to the area extracted by the extraction means 16 as the gradation value of each pixel in the area corresponding to that group in the processed image for a two-dimensional image corresponding to the extracted distance image.

また、処理画像生成手段１７は、距離センサ３により所定期間に生成された距離画像から処理画像を生成してもよい。その場合、処理画像生成手段１７は、抽出手段１６によりグループ毎に抽出された距離画像内の各画素の階調値を特定する。処理画像生成手段１７は、距離画像内で特定した階調値を処理画像内のそのグループに対応する画素の階調値として設定することにより処理画像を生成する。 The processed image generating means 17 may also generate a processed image from a distance image generated by the distance sensor 3 during a predetermined period of time. In this case, the processed image generating means 17 identifies the gradation value of each pixel in the distance image extracted for each group by the extraction means 16. The processed image generating means 17 generates a processed image by setting the gradation value identified in the distance image as the gradation value of the pixel corresponding to that group in the processed image.

また、抽出手段１６が領域のグループ毎に領域を抽出した場合、処理画像生成手段１７は、抽出手段１６によりグループ毎に特定された距離画像内の各領域に対応する各画素の階調値を処理画像内のそのグループに対応する領域内の各画素の階調値として設定することにより処理画像を生成する。 In addition, when the extraction means 16 extracts areas for each group of areas, the processed image generation means 17 generates a processed image by setting the gradation value of each pixel corresponding to each area in the distance image identified for each group by the extraction means 16 as the gradation value of each pixel in the area corresponding to that group in the processed image.

また、処理画像生成手段１７は、撮像装置２により所定期間に生成された２次元画像及び距離センサ３により所定期間に生成された距離画像から処理画像を生成してもよい。その場合、処理画像生成手段１７は、抽出手段１６によりグループ毎に抽出された各画素を含む各距離画像を抽出する。処理画像生成手段１７は、抽出した距離画像に対応する２次元画像について、抽出手段１６により抽出された画素に対応する画素の階調値を特定する。また、処理画像生成手段１７は、抽出手段１６によりグループ毎に抽出された距離画像内の各画素の階調値を特定する。処理画像生成手段１７は、２次元画像内で特定した階調値を処理画像内のそのグループに対応する画素の第１成分の階調値として設定し、２次元画像内で特定した階調値を処理画像内のそのグループに対応する画素の第２成分の階調値として設定することにより処理画像を生成する。処理画像は、例えばＲＧＢ各色の成分を有する画像であり、第１成分は例えばＧ成分であり、第２成分は例えばＲ成分である。尚、第１成分、第２成分はＲＧＢ各色の成分の内の他の成分でもよい。また、第１成分、第２成分はＣＭＹの各成分の内の何れかの成分でもよい。また、第１成分、第２成分は人間の視覚に関連して定められない成分でもよい。 The processed image generating means 17 may generate a processed image from a two-dimensional image generated by the imaging device 2 during a predetermined period and a distance image generated by the distance sensor 3 during a predetermined period. In this case, the processed image generating means 17 extracts each distance image including each pixel extracted by the extraction means 16 for each group. The processed image generating means 17 specifies the gradation value of a pixel corresponding to the pixel extracted by the extraction means 16 for a two-dimensional image corresponding to the extracted distance image. The processed image generating means 17 also specifies the gradation value of each pixel in the distance image extracted by the extraction means 16 for each group. The processed image generating means 17 generates a processed image by setting the gradation value specified in the two-dimensional image as the gradation value of the first component of the pixel corresponding to that group in the processed image and setting the gradation value specified in the two-dimensional image as the gradation value of the second component of the pixel corresponding to that group in the processed image. The processed image is, for example, an image having RGB color components, where the first component is, for example, a G component and the second component is, for example, an R component. The first and second components may be other components among the RGB color components. The first and second components may also be any of the CMY color components. The first and second components may also be components that are not determined in relation to human vision.

上述したように、抽出手段１６は、人物領域検出手段１５により検出された人物領域に対応するグループに限り、グループの中で階調値が最小である画素を抽出している。即ち、処理画像生成手段１７は、検出された人物領域に対応するグループに限り、グループ毎に抽出された画素及び／又はその画素に対応する２次元画像内の画素の階調値を、そのグループに対応する画素の階調値とするように処理画像を生成する。処理画像生成手段１７は、検出された人物領域に対応しない画素については、予め定められた画像（例えば最新の２次元画像及び／又は距離画像）内の画素の階調値を、そのグループに対応する画素の階調値とする。これにより、処理画像生成手段１７は、処理画像を生成する処理の負荷を軽減させるとともに、人物に対応する領域に限定して複数の画像を合成した処理画像を生成することができる。なお、抽出手段１６及び処理画像生成手段１７の両方を含むものを合成手段と呼ぶ。合成手段は、グループ毎に特定された画素又は領域を含む距離画像に対応する二次元画像内の、特定された画素又は領域に対応する画素又は領域を用いて、所定期間に取得された複数の二次元画像が合成された処理画像を生成する。特に、合成手段は、グループ毎に特定された画素又は領域を含む距離画像に対応する２次元画像内の、特定された画素又は領域に対応する画素又は領域の階調値を、そのグループに対応する画素又は領域の階調値として処理画像を生成する。または、合成手段は、グループ毎に特定された画素又は領域を用いて、所定期間に取得された複数の距離画像を合成して処理画像を生成する。また、合成手段は、グループ毎に特定された画素又は領域の階調値を、そのグループに対応する画素又は領域の階調値として処理画像を生成する。または、合成手段は、グループ毎に特定された画素又は領域を含む距離画像に対応する２次元画像内の、特定された画素又は領域に対応する画素又は領域の階調値を、そのグループに対応する画素又は領域の第１成分の階調値とし、距離画像内でそのグループ毎に特定された画素又は領域の階調値を、そのグループに対応する画素又は領域の第２成分の階調値とするように処理画像を生成する。合成手段は、検出された人物領域に対応するグループに限り、そのグループ毎に特定された画素又は領域を含む距離画像に対応する２次元画像内の、特定された画素又は領域に対応する画素又は領域を用いて、処理画像を生成する。 As described above, the extraction means 16 extracts pixels with the smallest gradation value from the group only for the group corresponding to the person area detected by the person area detection means 15. That is, the processed image generation means 17 generates a processed image only for the group corresponding to the detected person area so that the gradation value of the pixel extracted for each group and/or the pixel in the two-dimensional image corresponding to the pixel is set to the gradation value of the pixel corresponding to the group. For pixels not corresponding to the detected person area, the processed image generation means 17 sets the gradation value of the pixel in a predetermined image (e.g., the latest two-dimensional image and/or distance image) to the gradation value of the pixel corresponding to the group. In this way, the processed image generation means 17 can reduce the processing load of generating the processed image and generate a processed image in which multiple images are synthesized, limited to the area corresponding to the person. Note that a combination means includes both the extraction means 16 and the processed image generation means 17. The combination means generates a processed image in which multiple two-dimensional images acquired during a predetermined period are synthesized using a pixel or area corresponding to a specified pixel or area in a two-dimensional image corresponding to a distance image including the pixel or area specified for each group. In particular, the synthesis means generates a processed image by using the gradation value of a pixel or region corresponding to a specified pixel or region in a two-dimensional image corresponding to a distance image including a pixel or region specified for each group as the gradation value of the pixel or region corresponding to the group. Alternatively, the synthesis means generates a processed image by synthesizing a plurality of distance images acquired during a predetermined period using the pixel or region specified for each group. Also, the synthesis means generates a processed image by using the gradation value of a pixel or region specified for each group as the gradation value of a pixel or region corresponding to the group. Alternatively, the synthesis means generates a processed image by using the gradation value of a pixel or region corresponding to a specified pixel or region in a two-dimensional image corresponding to a distance image including a pixel or region specified for each group as the gradation value of the first component of the pixel or region corresponding to the group, and the gradation value of a pixel or region specified for each group in the distance image as the gradation value of the second component of the pixel or region corresponding to the group. The synthesis means generates a processed image by using the pixel or region corresponding to a specified pixel or region in a two-dimensional image corresponding to a distance image including a pixel or region specified for each group only for the group corresponding to the detected person region.

図３は、距離画像、２次元画像及び処理画像の対応関係について説明するための図である。図３には、時刻Ｔ１、Ｔ２、Ｔ３にそれぞれ生成された距離画像Ｄ１～Ｄ３及び２次元画像Ｅ１～Ｅ３と、距離画像Ｄ１～Ｄ３及び２次元画像Ｅ１～Ｅ３から生成された処理画像Ｆ３が示されている。 Figure 3 is a diagram for explaining the correspondence between distance images, two-dimensional images, and processed images. Figure 3 shows distance images D1-D3 and two-dimensional images E1-E3 generated at times T1, T2, and T3, respectively, and a processed image F3 generated from the distance images D1-D3 and two-dimensional images E1-E3.

図３に示した例において、距離画像Ｄ１～Ｄ３の各画素Ｐ１及び各画素Ｐ２はそれぞれ同一位置に配置されており、同一グループに分類される。仮に、距離画像Ｄ１～Ｄ３の各画素Ｐ１の中で階調値が最小である画素が距離画像Ｄ１の画素Ｐ１であり、距離画像Ｄ１～Ｄ３の各画素Ｐ２の中で階調値が最小である画素が距離画像Ｄ３の画素Ｐ２であるものとする。その場合、距離画像Ｄ１に対応する２次元画像Ｅ１の画素Ｐ１の階調値が処理画像Ｆ３の画素Ｐ１の第１成分（Ｇ成分）の階調値として設定され、距離画像Ｄ１の画素Ｐ１の階調値が処理画像Ｆ３の画素Ｐ１の第２成分（Ｒ成分）の階調値として設定される。また、距離画像Ｄ３に対応する２次元画像Ｅ３の画素Ｐ２の階調値が処理画像Ｆ３の画素Ｐ２の第１成分（Ｇ成分）の階調値として設定され、距離画像Ｄ３の画素Ｐ２の階調値が処理画像Ｆ３の画素Ｐ２の第２成分（Ｒ成分）の階調値として設定される。 In the example shown in FIG. 3, pixels P1 and P2 in distance images D1 to D3 are located at the same position and are classified into the same group. Suppose that the pixel with the smallest gradation value among pixels P1 in distance images D1 to D3 is pixel P1 in distance image D1, and the pixel with the smallest gradation value among pixels P2 in distance images D1 to D3 is pixel P2 in distance image D3. In this case, the gradation value of pixel P1 in two-dimensional image E1 corresponding to distance image D1 is set as the gradation value of the first component (G component) of pixel P1 in processed image F3, and the gradation value of pixel P1 in distance image D1 is set as the gradation value of the second component (R component) of pixel P1 in processed image F3. In addition, the gradation value of pixel P2 in two-dimensional image E3 corresponding to distance image D3 is set as the gradation value of the first component (G component) of pixel P2 in processed image F3, and the gradation value of pixel P2 in distance image D3 is set as the gradation value of the second component (R component) of pixel P2 in processed image F3.

図４は、２次元画像から生成される処理画像の一例を示す。２次元画像２１～２３は、時刻Ｔ１、Ｔ２、Ｔ３の各時刻において、監視空間内で人物が撮像装置２に向けて手を振っている状況を撮像した画像である。一般に、人物が所定位置に向けて手を振る場合、その人物は手を所定位置側に押し出して手を振る。そのため、手は背景又は人物より所定位置に近い側に配置される。したがって、処理画像２４は、２次元画像２１～２３内でそれぞれ手が写っている領域２５～２７が含まれるように生成される。 Figure 4 shows an example of a processed image generated from a two-dimensional image. Two-dimensional images 21 to 23 are images captured at times T1, T2, and T3 of a person waving his or her hand towards the imaging device 2 in the monitored space. Generally, when a person waves his or her hand towards a specific position, the person waves by pushing their hand towards the specific position. Therefore, the hand is positioned closer to the specific position than the background or the person. Therefore, processed image 24 is generated so as to include areas 25 to 27 in which the hands are captured in two-dimensional images 21 to 23, respectively.

図５は、距離画像から生成される処理画像の一例を示す。距離画像３１～３３は、時刻Ｔ１、Ｔ２、Ｔ３の各時刻において、監視空間内で人物が距離センサ３に向けて手を振っている状況が測定されて生成された距離画像である。一般に、人物が所定位置に向けて手を振る場合、その人物は手を所定位置側に押し出して手を振る。そのため、手は背景又は人物より所定位置に近い側に配置される（図５において、色が濃くなるほど近い）。したがって、処理画像２４は、距離画像３１～３３内でそれぞれ手が写っている領域３５～３７が背景や人物とは異なる階調値で生成される。 Figure 5 shows an example of a processed image generated from a distance image. Distance images 31 to 33 are distance images generated by measuring the situation in which a person is waving their hand towards distance sensor 3 in the monitored space at times T1, T2, and T3. Generally, when a person waves their hand towards a specific position, they wave by pushing their hand towards the specific position. For this reason, the hand is positioned closer to the specific position than the background or person (in Figure 5, the darker the color, the closer). Therefore, processed image 24 is generated in which areas 35 to 37 where the hands are captured in distance images 31 to 33 are different gradation values from the background and person.

図６は、２次元画像及び距離画像から生成された処理画像の一例である。この処理画像では、距離画像から抽出した階調値がＲ成分の階調値として設定され、２次元画像から抽出した階調値がＧ成分の階調値として設定されている。一般に、撮像装置から物体までの距離が短いほど、その物体が写っている画像は明瞭になり、撮像装置から物体までの距離が長いほどその物体が写っている画像がぼやけて、物体のエッジが不明瞭になる。そのため、この処理画像では、撮像装置から離れた背景について、２次元画像から抽出されたＧ成分はぼやけてしまっている。しかしながら、この背景のエッジは、距離画像から抽出されたＲ成分によって明瞭となっている。一方、この処理画像では、撮像装置の近傍に存在する人物について、２次元画像から抽出されたＧ成分により、人物の服装の質感等のテクスチャが明瞭となり、人物が手を振っている様子、及び、肘を支点として少しずつ動いている腕の姿勢が明瞭に表現されている。このように、画像処理システム１は、２次元画像及び距離画像から処理画像を生成することにより、２次元画像において失われやすい遠方の細部に関する情報を、距離情報によって補完して、背景のエッジを明瞭化することができる。 Figure 6 is an example of a processed image generated from a two-dimensional image and a distance image. In this processed image, the gradation value extracted from the distance image is set as the gradation value of the R component, and the gradation value extracted from the two-dimensional image is set as the gradation value of the G component. In general, the shorter the distance from the imaging device to the object, the clearer the image in which the object is captured, and the longer the distance from the imaging device to the object, the blurrier the image in which the object is captured, and the unclearer the edges of the object. Therefore, in this processed image, the G component extracted from the two-dimensional image is blurred for the background away from the imaging device. However, the edges of this background are made clear by the R component extracted from the distance image. On the other hand, in this processed image, for a person present near the imaging device, the G component extracted from the two-dimensional image makes the texture of the person's clothing, etc. clear, and the appearance of the person waving his/her hand and the posture of his/her arm moving little by little with his/her elbow as the fulcrum are clearly expressed. In this way, by generating a processed image from a two-dimensional image and a distance image, image processing system 1 can complement information about distant details that are easily lost in two-dimensional images with distance information, thereby clarifying the edges of the background.

図７は、２次元画像から生成された処理画像の一例である。図６に示すように、このように生成された処理画像には、濃淡に関する情報によって、人物のテクスチャが明瞭となり、人物が手を振っている様子、及び、肘を支点として少しずつ動いている腕の姿勢が明瞭に表現されている。 Figure 7 is an example of a processed image generated from a two-dimensional image. As shown in Figure 6, the processed image generated in this way has clearer texture of the person due to the information on shading, and clearly shows the person waving their hand and the posture of their arm moving little by little with their elbow as the fulcrum.

図８は、距離画像から生成された処理画像の一例である。図７に示すように、このように生成された処理画像には、距離に関する情報が含まれるため、背景と人物とのエッジが明瞭に表現され、さらに背景及び人物と手のエッジも明瞭に表現されている。 Figure 8 is an example of a processed image generated from a distance image. As shown in Figure 7, the processed image generated in this way contains information about distance, so the edges between the background and the person are clearly depicted, and furthermore, the edges between the background and the person and their hands are also clearly depicted.

次に、検出手段１８は、処理画像生成手段１７により生成された処理画像についての評価値を取得する（ステップＳ６）。検出手段１８は、入力された学習用処理画像に含まれる人物のジェスチャ動作に関する情報を出力するように学習されたモデル１０に処理画像を入力し、モデル１０から出力された情報に基づいて、監視空間内の人物のジェスチャ動作を検出する。 Then, the detection means 18 obtains an evaluation value for the processed image generated by the processed image generation means 17 (step S6). The detection means 18 inputs the processed image to the model 10, which has been trained to output information about the gesture movements of a person contained in the input learning processed image, and detects the gesture movements of a person in the monitored space based on the information output from the model 10.

例えば、検出手段１８は、記憶部９に記憶されたモデル１０を用いて、処理画像についての評価値を取得する。モデル１０は、学習手段２０により生成される。学習手段２０は、例えばディープラーニング等の公知の機械学習技術を用いて、複数の学習用処理画像と、各学習用処理画像に検出対象のジェスチャが含まれている確からしさを示す評価値及び検出対象のジェスチャ動作が含まれる領域の位置との関係性を学習する。検出対象のジェスチャは、例えば手を振る動作である。特に、検出手段１８は、人の手など人体の一部を用いて行われるジェスチャ動作のうち人体の身体の前方（距離画像の階調値が人体の階調値よりも小さい領域）でなされた所定のジェスチャ動作を検出する。なお、検出対象のジェスチャは、手招き等の周期的な動作でもよい。また、検出対象のジェスチャは、複数でもよく、例えば手を振る動作及び手招きする動作の両方でもよい。各学習用処理画像は、様々な状態（立ち上がった状態、座った状態又は横たわった状態等）の物体による様々な大きさのジェスチャが含まれる画像又はジェスチャが含まれない画像から、処理画像と同様にして生成された画像である。学習手段２０は、学習した関係性をモデル１０として記憶部９に記憶する。 For example, the detection means 18 obtains an evaluation value for the processed image using the model 10 stored in the storage unit 9. The model 10 is generated by the learning means 20. The learning means 20 learns the relationship between the multiple learning processed images, the evaluation value indicating the likelihood that the gesture to be detected is included in each learning processed image, and the position of the area including the gesture to be detected, using a known machine learning technique such as deep learning. The gesture to be detected is, for example, a waving motion. In particular, the detection means 18 detects a predetermined gesture motion made in front of the human body (an area where the gradation value of the distance image is smaller than the gradation value of the human body) among gesture motions made using a part of the human body such as a human hand. The gesture to be detected may be a periodic motion such as a beckoning motion. In addition, the gesture to be detected may be multiple, and may be, for example, both a waving motion and a beckoning motion. Each learning processing image is an image generated in the same manner as the processing image from images including gestures of various sizes by objects in various states (standing, sitting, lying, etc.) or images including no gestures. The learning means 20 stores the learned relationships as a model 10 in the storage unit 9.

入力層には、学習用処理画像が入力される。中間層の各ノードは、入力層の各ノードから出力された画像から特徴ベクトルを抽出し、抽出した各特徴ベクトルに重みを乗算した値の総和を出力する。出力層は、中間層の各ノードから出力された各特徴ベクトルに重みを乗算した値の総和を出力する。学習手段２０は、各重みを調整しながら、出力層からの出力値と、正解値、及び、検出対象のジェスチャが含まれる領域の位置との差分が小さくなるように学習する。正解値は、例えばその学習用処理画像に検出対象のジェスチャが含まれる場合は１に設定され、検出対象のジェスチャが含まれない場合は０に設定される。尚、なお、モデル１０は、ＤＰＭ（Deformable Part Model）、Ｒ－ＣＮＮ（Regions with Convolutional Neural Networks）、ＹＯＬＯ等の他の機械学習技術により学習されてもよい。また、モデル１０は、画像処理装置４とは別の外部のコンピュータで生成され、画像処理装置４に送信されてもよい。その場合、学習手段２０を省略されてもよい。 The learning processing image is input to the input layer. Each node of the intermediate layer extracts a feature vector from the image output from each node of the input layer, and outputs the sum of values obtained by multiplying each extracted feature vector by a weight. The output layer outputs the sum of values obtained by multiplying each feature vector output from each node of the intermediate layer by a weight. The learning means 20 adjusts each weight while learning so that the difference between the output value from the output layer, the correct answer value, and the position of the area including the gesture to be detected is small. For example, the correct answer value is set to 1 when the learning processing image includes the gesture to be detected, and is set to 0 when the gesture to be detected is not included. Note that the model 10 may be trained by other machine learning techniques such as DPM (Deformable Part Model), R-CNN (Regions with Convolutional Neural Networks), and YOLO. The model 10 may also be generated by an external computer other than the image processing device 4 and transmitted to the image processing device 4. In that case, the learning means 20 may be omitted.

検出手段１８は、記憶部９に記憶されたモデル１０に、処理画像を入力し、モデル１０から出力された出力値を処理画像についての評価値及びジェスチャが検出された領域の位置として取得する。 The detection means 18 inputs the processed image into the model 10 stored in the memory unit 9, and obtains the output value output from the model 10 as an evaluation value for the processed image and the position of the area where the gesture was detected.

尚、検出手段１８は、処理画像内で動きがある領域を切り出し、切り出した画像をモデル１０に入力して評価値を算出してもよい。静止している物体では、距離センサ３からの距離が変化しないため、その物体内の位置毎に抽出手段１６によって階調値が最小である画素が抽出される距離画像に、ばらつきが発生する可能性がある。一方、動いている物体では、物体内の全領域について、階調値が最小である画素は一つの距離画像からまとめて抽出される可能性が高い。そこで、検出手段１８は、所定期間に生成された複数の距離画像毎に、各距離画像からステップＳ４で抽出手段１６により抽出された画素を特定する。検出手段１８は、各距離画像内で特定した画素の内、相互に密に隣接しながら連結し且つ所定サイズ以上である画素の領域に対応する処理画像内の領域を動きがある領域として検出する。これにより、検出手段１８は、動きがある領域に限定してジェスチャを検出することができ、ジェスチャをより精度良く検出することができる。 The detection means 18 may cut out an area in the processed image where there is movement, and input the cut-out image to the model 10 to calculate the evaluation value. For a stationary object, the distance from the distance sensor 3 does not change, so there is a possibility that the distance image in which the pixel with the smallest gradation value is extracted by the extraction means 16 for each position in the object may vary. On the other hand, for a moving object, it is highly likely that the pixel with the smallest gradation value is extracted collectively from one distance image for the entire area in the object. Therefore, the detection means 18 identifies the pixel extracted by the extraction means 16 in step S4 from each distance image for each of the multiple distance images generated during a predetermined period. The detection means 18 detects, as an area in the processed image where there is movement, an area of pixels that are closely adjacent to each other, connected, and have a predetermined size or more, among the pixels identified in each distance image. This allows the detection means 18 to detect gestures only in areas where there is movement, and to detect gestures more accurately.

また、検出手段１８は、パターンマッチング技術を用いて評価値を算出してもよい。その場合、画像処理装置４は、サンプル用の処理画像内で検出対象のジェスチャが写っている複数の画像のパターンを予め記憶部９に記憶しておく。検出手段１８は、ステップＳ５で生成された処理画像内の所定の大きさの領域を、その位置をずらしながら切り出して、記憶部９に記憶しておいた画像のパターンとの類似の程度を評価値として取得する。類似の程度は、例えば正規化相互相関値である。 The detection means 18 may also calculate the evaluation value using a pattern matching technique. In this case, the image processing device 4 stores in advance in the storage unit 9 a number of image patterns in which the gesture to be detected is captured within the sample processed image. The detection means 18 cuts out an area of a predetermined size within the processed image generated in step S5 while shifting its position, and obtains the degree of similarity with the image pattern stored in the storage unit 9 as an evaluation value. The degree of similarity is, for example, a normalized cross-correlation value.

次に、検出手段１８は、取得した評価値に基づいて、人物の検出対象のジェスチャを検出する（ステップＳ７）。検出手段１８は、評価値が予め定められた閾値以上である場合、処理画像に検出対象のジェスチャが含まれると判定し、評価値が閾値未満である場合、処理画像に検出対象のジェスチャが含まれないと判定する。このように、検出手段１８は、処理画像に基づいて、監視空間内の人物のジェスチャを検出する。特に、検出手段１８は、学習用処理画像が入力された場合に学習用処理画像に含まれる人物のジェスチャに関する情報を出力するように学習されたモデル１０に処理画像を入力し、モデル１０から出力された情報に基づいて、監視空間内の人物のジェスチャを検出する。 Then, the detection means 18 detects the gesture of the person to be detected based on the acquired evaluation value (step S7). If the evaluation value is equal to or greater than a predetermined threshold, the detection means 18 determines that the processed image contains the gesture of the person to be detected, and if the evaluation value is less than the threshold, the detection means 18 determines that the processed image does not contain the gesture of the person to be detected. In this way, the detection means 18 detects the gesture of the person in the monitored space based on the processed image. In particular, the detection means 18 inputs the processed image to the model 10 that has been trained to output information regarding the gesture of the person included in the processed image for learning when the processed image for learning is input, and detects the gesture of the person in the monitored space based on the information output from the model 10.

次に、検出手段１８は、検出対象のジェスチャを検出したか否かを判定する（ステップＳ８）。検出対象のジェスチャを検出しなかった場合、検出手段１８は、特に処理を実行せずに、一連のステップを終了する。 Next, the detection means 18 determines whether or not the gesture to be detected has been detected (step S8). If the gesture to be detected has not been detected, the detection means 18 ends the series of steps without performing any particular processing.

一方、検出対象のジェスチャを検出した場合、検出手段１８は、ジェスチャが検出された領域の近傍に人物が存在するか否かを判定する（ステップＳ９）。検出手段１８は、ステップＳ６で取得したジェスチャが検出された領域の位置と、ステップＳ３で検出された人物領域に対応する処理画像内の領域との間の距離を算出する。検出手段１８は、算出した距離が予め定められた距離閾値未満である場合、ジェスチャが検出された領域の近傍に人物が存在すると判定し、検出されたジェスチャは人物によって行われたジェスチャであると判定する。一方、検出手段１８は、算出した距離が距離閾値以上である場合、ジェスチャが検出された領域の近傍に人物が存在しないと判定し、検出されたジェスチャは人物によって行われたジェスチャでないと判定し、一連のステップを終了する。これにより、検出手段１８は、監視空間内の人物以外の物体の動きを、検出対象のジェスチャとして誤って検出することを防止できる。 On the other hand, when the detection means 18 detects a gesture to be detected, the detection means 18 judges whether or not a person is present in the vicinity of the area where the gesture was detected (step S9). The detection means 18 calculates the distance between the position of the area where the gesture was detected acquired in step S6 and the area in the processed image corresponding to the person area detected in step S3. If the calculated distance is less than a predetermined distance threshold, the detection means 18 judges that a person is present in the vicinity of the area where the gesture was detected and that the detected gesture is a gesture performed by a person. On the other hand, if the calculated distance is equal to or greater than the distance threshold, the detection means 18 judges that no person is present in the vicinity of the area where the gesture was detected and that the detected gesture is not a gesture performed by a person, and ends the series of steps. This allows the detection means 18 to prevent erroneous detection of the movement of an object other than a person in the monitored space as the gesture to be detected.

ジェスチャが検出された領域の近傍に人物が存在すると判定された場合、出力制御手段１９は、検出されたジェスチャに関する情報を、通信部８を介して出力して外部装置に通知し（ステップＳ１０）、一連のステップを終了する。ジェスチャに関する情報は、人物による検出対象のジェスチャが検出されたこと、ジェスチャの種類、ジェスチャが検出された時刻、ジェスチャの継続時間、ジェスチャが検出された領域等を含む。なお、出力制御手段１９は、検出されたジェスチャに関する情報を、表示部７に表示し又は不図示の音出力装置から出力してもよい。 When it is determined that a person is present near the area where the gesture is detected, the output control means 19 outputs information about the detected gesture via the communication unit 8 to notify an external device (step S10), and ends the series of steps. The information about the gesture includes the fact that a gesture of the detection target by a person has been detected, the type of gesture, the time when the gesture was detected, the duration of the gesture, the area where the gesture was detected, etc. The output control means 19 may display the information about the detected gesture on the display unit 7 or output it from a sound output device (not shown).

尚、ステップＳ３の処理は、省略されてもよい。この場合、抽出手段１６は、距離画像及び２次元画像内の全画素を対象としてステップＳ４の処理を実行する。 The process of step S3 may be omitted. In this case, the extraction means 16 performs the process of step S4 on all pixels in the distance image and the two-dimensional image.

また、ステップＳ９の処理は、省略されてもよい。この場合、モデル１０が、各学習用処理画像に人物による検出対象となるジェスチャが含まれている確からしさを示す評価値を出力するように事前学習され、検出手段１８は、判定モデル１０により、人物によって行われたジェスチャを検出してもよい。 The processing of step S9 may also be omitted. In this case, the model 10 may be pre-trained to output an evaluation value indicating the likelihood that each learning processing image contains a gesture to be detected by a person, and the detection means 18 may detect a gesture made by a person using the determination model 10.

（画像処理システム１の効果）
以上説明してきたように、画像処理システム１は、所定期間内で、監視空間内の物体が最も手前側に存在していた時に撮像又は測定された画素を用いて生成した処理画像に基づいて検出対象ジェスチャを検出する。これにより、画像処理システム１は、手前側で動きが発生した領域に着目して、手前側で動きが発生するジェスチャを精度良く検出することができる。したがって、画像処理システム１は、監視空間内の人物のジェスチャを精度良く検出することが可能となる。特に、画像処理システム１は、ナースコールの手段として、見守り対象者に身体の前に手を出して手を振る動作等をしてもらう場合、その動作等を精度良く検出することができ、見守り者に通知することが可能となる。 (Effects of Image Processing System 1)
As described above, the image processing system 1 detects a gesture to be detected based on a processed image generated using pixels captured or measured when an object in a monitored space was located at the closest position within a predetermined period of time. This allows the image processing system 1 to focus on an area where a movement occurs at the closest position and accurately detect a gesture in which a movement occurs at the closest position. Therefore, the image processing system 1 can accurately detect a gesture of a person in a monitored space. In particular, when a person to be monitored is asked to put his/her hand out in front of his/her body and wave as a means of calling a nurse, the image processing system 1 can accurately detect the movement and notify the person to be monitored.

また、画像処理システム１は、距離画像及び２次元画像に基づいて処理画像を生成する。画像処理システム１は、２次元画像に基づいて処理画像を生成することにより、物体の形状及びテクスチャについての情報を処理画像に含ませることができる。また、距離センサ３は近赤外光の反射率が低い物体までの距離を測定できない可能性があるが、画像処理システム１は、２次元画像に基づいて処理画像を生成することにより、信頼性の高い処理画像を生成することができる。一方、画像処理システム１は、距離画像に基づいて処理画像を生成することにより、背景と人物の輝度が近似している場合でも、背景と人物とが明瞭に区別された処理画像を生成することができる。このように、輝度と距離とは処理画像を生成する際に補完的な役割を果たすので、画像処理システム１は、距離画像及び２次元画像に基づいて生成された処理画像を用いることにより、ジェスチャをより精度良く検出することができる。また、距離画像を用いることで、人物と手が重なって動いていていたとしても、人物と手が明瞭に区分された処理画像を生成することができ、ジェスチャをより精度良く検出することができる。 The image processing system 1 generates a processed image based on the distance image and the two-dimensional image. By generating a processed image based on the two-dimensional image, the image processing system 1 can include information about the shape and texture of the object in the processed image. Although the distance sensor 3 may not be able to measure the distance to an object with low reflectance of near-infrared light, the image processing system 1 can generate a highly reliable processed image by generating a processed image based on the two-dimensional image. On the other hand, the image processing system 1 generates a processed image based on the distance image, so that even if the brightness of the background and the person are similar, the image processing system 1 can generate a processed image in which the background and the person are clearly distinguished from each other. In this way, the brightness and the distance play complementary roles when generating a processed image, so that the image processing system 1 can detect gestures more accurately by using a processed image generated based on the distance image and the two-dimensional image. Furthermore, by using the distance image, a processed image in which the person and the hand are clearly distinguished from each other can be generated, even if the person and the hand are moving while overlapping, and the gesture can be detected more accurately.

１画像処理システム、２撮像装置、３距離センサ、４画像処理装置、８通信部、９記憶部、１２処理部 1 Image processing system, 2 Imaging device, 3 Distance sensor, 4 Image processing device, 8 Communication unit, 9 Memory unit, 12 Processing unit

Claims

a distance image acquisition means for sequentially acquiring distance images in which information regarding a distance from a reference position to an object in a monitored space is used as a gradation value;
a two-dimensional image acquisition means for sequentially acquiring two-dimensional images, the two-dimensional images having gradation values representing information on shading within the monitored space, corresponding to the sequentially acquired distance images;
a synthesis means for identifying, for each group of pixels or regions arranged at the same position in a plurality of distance images acquired by the distance image acquisition means during a predetermined period of time, a pixel or region having a relatively small gradation value within the group, and generating a processed image in which the plurality of two-dimensional images acquired during the predetermined period of time are synthesized using a pixel or region corresponding to the identified pixel or region in a two-dimensional image corresponding to the distance image including the pixel or region identified for each group;
1. An image processing system comprising:

The image processing system according to claim 1, wherein the synthesis means generates a processed image by using the gradation value of a pixel or region corresponding to the specified pixel or region in a two-dimensional image corresponding to a distance image including the pixel or region specified for each group as the gradation value of the pixel or region corresponding to the group.

The image processing system according to claim 1 or 2, wherein the synthesis means generates the processed image such that the gradation value of a pixel or region corresponding to the identified pixel or region in a two-dimensional image corresponding to a distance image including the pixel or region identified for each group is set to the gradation value of the first component of the pixel or region corresponding to the group, and the gradation value of a pixel or region identified for each group in the distance image is set to the gradation value of the second component of the pixel or region corresponding to the group.

The image processing system according to any one of claims 1 to 3 further comprises a detection means for inputting the processed image for learning into a model trained to output information regarding the gesture movements of a person contained in the input processed learning image, and detecting the gesture movements of a person in the monitored space based on information output from the model.

a person area detection unit that detects a person area including a person in the distance image or the two-dimensional image,
The image processing system according to any one of claims 1 to 4, wherein the synthesis means generates the processed image using pixels or areas corresponding to the identified pixels or areas in a two-dimensional image corresponding to a distance image including the pixels or areas identified for each group, only for the group corresponding to the detected person area.

4. The image processing system according to claim 1 , further comprising a detection unit that detects, based on the processed image, a predetermined gesture made in front of a person present within the monitored space.

a distance image acquisition means for sequentially acquiring distance images in which information regarding a distance from a reference position to an object in a monitored space is used as a gradation value;
identifying, for each group of pixels or regions arranged at the same position in a plurality of distance images acquired by the distance image acquisition means during a predetermined period, a pixel or region having a gradation value indicating a relatively short distance within the group;
a processed image generating means for generating a processed image by synthesizing a plurality of distance images acquired during a predetermined period using the gradation values of the pixels or regions specified for each group;
1. An image processing system comprising:

Sequentially acquiring distance images in which information regarding the distance from a reference position to an object in a monitored space is used as a grayscale value;
Sequentially acquiring two-dimensional images in which information regarding shading in the monitored space is used as a gradation value in response to the sequentially acquired distance images;
Identifying, for each group of pixels or regions arranged at the same position in a plurality of distance images acquired during a predetermined period, a pixel or region having a relatively small gradation value within said group;
generating a processed image by synthesizing a plurality of two-dimensional images acquired during a predetermined period using pixels or regions corresponding to the identified pixels or regions in a two-dimensional image corresponding to a distance image including the pixels or regions identified for each group;
A control program that causes a computer to execute the above steps.

Sequentially acquiring distance images in which information regarding the distance from a reference position to an object in a monitored space is used as a grayscale value;
identifying, for each group of pixels or regions that are arranged at the same position in a plurality of distance images acquired during a predetermined period, a pixel or region that has a gradation value that indicates a relatively short distance within the group;
generating a processed image by synthesizing a plurality of distance images acquired during a predetermined period using the gradation values of the pixels or regions identified for each group;
A control program that causes a computer to execute the above steps.