JP7611535B2

JP7611535B2 - Method, device, computer-readable storage medium, and electronic device for multi-source heterogeneous data fusion using deep learning

Info

Publication number: JP7611535B2
Application number: JP2023051743A
Authority: JP
Inventors: 劉東昇; 劉彦▲に▼; 王黎明; 陳亜輝
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2022-08-02
Filing date: 2023-03-28
Publication date: 2025-01-10
Anticipated expiration: 2043-03-28
Also published as: JP2024021037A; CN115471719A; CN115471719B

Description

本発明は、画像処理技術の分野に係り、特にディープラーニングに基づくマルチソース異種データ融合方法に関する。 The present invention relates to the field of image processing technology, and in particular to a multi-source heterogeneous data fusion method based on deep learning.

マルチソースのデータには、同じオブジェクトへの情報が含まれている場合があります。ということは、同じオブジェクトのための情報は、異なるタイプのマルチソースデータに異なる形式キャリアで記録されてもよい。異なるデータは、同じオブジェクトの異なる側面の情報を表しており、どのようにマルチソースデータに対して情報融合を行い、それによって同一のオブジェクトを多次元立体表現するか、あるいはマルチソースデータセンターから十分に融合して同一のオブジェクトの情報を汲み取り、融合結果に基づいてその他の応用を行うことはずっと画像処理分野の研究に取り組む重要な課題である。 Multi- source data may contain information about the same object. That is, information about the same object may be recorded in different types of multi-source data with different format carriers. Different data represent different aspects of the same object , so how to perform information fusion on multi-source data to create a multi-dimensional stereoscopic representation of the same object , or how to fully fuse the multi-source data center to extract information about the same object and use the fusion results for other applications, has always been an important research issue in the field of image processing.

本発明は、先行技術に存在する技術的課題の少なくとも1つを解決することを目的とするものである。 The present invention aims to solve at least one of the technical problems existing in the prior art.

シーンを特徴付ける第1のオブジェクト情報と、シーン内で関連付けられたターゲット下位オブジェクトを特徴付ける第2のオブジェクト情報とを少なくとも含み画像をシーン記述するためのテキストタイプのシーン情報と、ピクチャタイプの画像とを取得する。
第1のオブジェクト情報に基づいて、第1のオブジェクト情報によって特徴付けられるシーン内の各下位オブジェクトにそれぞれ対応する分割用モデルを含むモデルクラスタが決定され、このモデルクラスタには、第1のオブジェクト情報によって特徴付けられるシーン内の各下位オブジェクトがそれぞれ対応する分割用モデルが含まれる。
第２のオブジェクト情報に基づいて、モデルクラスタ内の各ターゲット下位オブジェクトに対応するターゲットモデルが決定される。
各ターゲットモデルに基づいて画像を粗分割し、各ターゲット下位オブジェクトがそれぞれ対応する第1の画像を得る。
第1の画像のそれぞれについて、対応するターゲット下位オブジェクトが関連付けられた情報集約に基づく分割を行い、第1の画像のそれぞれに対応する細分割結果を得る。
シーン情報と画像との融合結果を特徴付ける、画像に対応する本体画像は、細分割結果に基づいてレンダリングされる。
一実施形態では、前記細分割結果は、マスク行列を含み、前記第1の画像の各々は、対応するターゲット下位オブジェクトに関連付けられた情報集約に基づいて分割され、前記第1の画像のそれぞれに対応する細分割結果を得ることができ、前記第1の画像の各々は、前記第1の画像のそれぞれに対応する細分割結果を得ることができる：
第１の画像ごとに、第1の画像を、対応する集約化情報抽出器に入力し、集約化情報抽出器は、第1の画像が対応するターゲット下位オブジェクトに対応する検出器に基づいて訓練により得られ、第1の画像に対応する融合特徴情報を得るために、第1の画像中のターゲット下位オブジェクトに対して集約化情報抽出を行う集約化情報抽出器である。
融合特徴情報は、第1の画像が対応するターゲット下位画像に対応する分割器に入力され、マスク行列が得られる。 A text type scene information for scene describing the image including at least a first object information characterizing the scene and a second object information characterizing a target subordinate object associated within the scene, and a picture type image are obtained.
Based on the first object information, a model cluster is determined that includes a segmentation model corresponding to each lower-level object in the scene characterized by the first object information, and the model cluster includes a segmentation model corresponding to each lower-level object in the scene characterized by the first object information.
Based on the second object information, a target model corresponding to each target subordinate object in the model cluster is determined.
The image is roughly segmented based on each target model, and a first image corresponding to each target sub-object is obtained.
For each of the first images, a segmentation based on the information aggregation associated with the corresponding target sub-object is performed to obtain a sub-segmentation result corresponding to each of the first images.
A body image corresponding to the image , which characterizes the fusion result between the scene information and the image, is rendered based on the subdivision result.
In one embodiment, the subdivision results include a mask matrix, and each of the first images is segmented based on an information aggregation associated with a corresponding target sub-object to obtain a subdivision result corresponding to each of the first images, and each of the first images can be segmented based on an information aggregation associated with a corresponding target sub- object to obtain a subdivision result corresponding to each of the first images, wherein:
For each first image , the first image is input to a corresponding aggregated information extractor, which is obtained by training based on a detector corresponding to a target sub-object to which the first image corresponds, and performs aggregated information extraction on the target sub-object in the first image to obtain fusion feature information corresponding to the first image.
The fused feature information is input to a divider corresponding to the target sub-image to which the first image corresponds, and a mask matrix is obtained.

本体画像は、第1の画像のそれぞれに対応するマスク行列と画像とに基づいてレンダリングされる。 The main image is rendered based on the mask matrix and image corresponding to each of the first images.

集約化情報抽出器は、主に抽出ネットワークとマスク生成ネットワークとから構成される、マスク生成ネットワークは、集約化情報抽出器に入力された画像中の下位オブジェクトと非下位オブジェクトとを区別するためのターゲットマスクを生成するために使用され、抽出ネットワークとマスク生成ネットワークは、以下の方法により訓練される：
サンプル画像およびサンプル画像に対応する下位オブジェクトのタイプに対応する検出器が取得され、サンプル画像は、サンプル画像内の下位オブジェクトの位置情報を特徴付けるラベル付け情報を有し、サンプル画像は、単一タイプの下位オブジェクトのみを含む。
サンプル画像は抽出ネットワークに入力され、抽出ネットワークは、以下の動作を実行する：サンプル画像中の各画素領域について、画素領域に対応するサンプル第１特徴を抽出する、各関連画素位置に対応する関連位置特徴を抽出し、関連位置特徴を融合して、画素領域に対応するサンプル第2特徴を得、関連画素位置は、画素領域が位置する近傍の画素領域に属さない他の画素位置である。
各画素領域に対応するサンプル第１特徴およびサンプル第２特徴がマスク生成ネットワークに入力され、サンプルターゲットマスクが得られる。
各画素領域に対して、対応するサンプル第１特徴を除いた他のサンプル第１特徴は、検出器に入力される、画素領域に対応する第１の検出結果が得られ、対応するサンプル第２特徴を除いた他のサンプル第２特徴が検出器に入力され、画素領域に対応する第２の検出結果が得られる。
各第１の検出結果、各第２の検出結果、及びラベル付け情報に基づいて、指示ターゲットマスクを求める。
サンプルターゲットマスクと指示ターゲットマスクとの違いに応じて、訓練が完了するまで抽出ネットワークとマスク生成ネットワークのパラメータを調整する。 The aggregated information extractor mainly consists of an extraction network and a mask generation network. The mask generation network is used to generate a target mask for distinguishing between subordinate objects and non-subordinate objects in the image input to the aggregated information extractor. The extraction network and the mask generation network are trained in the following manner:
A sample image and a detector corresponding to a type of subordinate object corresponding to the sample image are acquired, the sample image has labeling information characterizing position information of the subordinate objects within the sample image, and the sample image includes only a single type of subordinate object.
A sample image is input to the extraction network, and the extraction network performs the following operations: for each pixel region in the sample image, extract a sample first feature corresponding to the pixel region, extract associated position features corresponding to each associated pixel location, and fuse the associated position features to obtain a sample second feature corresponding to the pixel region, where the associated pixel locations are other pixel locations that do not belong to the neighboring pixel regions in which the pixel region is located.
The sample first feature and the sample second feature corresponding to each pixel region are input to a mask generation network to obtain a sample target mask.
For each pixel region, the other sample first features excluding the corresponding sample first feature are input to the detector to obtain a first detection result corresponding to the pixel region, and the other sample second features excluding the corresponding sample second feature are input to the detector to obtain a second detection result corresponding to the pixel region.
A designated target mask is determined based on each of the first detection results, each of the second detection results, and the labeling information.
Depending on the difference between the sample target mask and the instruction target mask, the parameters of the extraction network and the mask generation network are adjusted until training is complete.

各画素領域に対応する第1の検出結果及び第2の検出結果について、第1の検出結果とラベル付け情報との差異度が第1の閾値よりも大きく、かつ、第2の検出結果とラベル付け情報との差異度が第2の閾値よりも大きい場合、当該画素領域はコア画素領域に属すると判定される。
第1の検出結果とラベル付け情報との差異度が第1の閾値よりも大きく、かつ、第2の検出結果とラベル付け情報との差異度が第2の閾値以下である場合、画素領域は境界画素領域に属すると判定される。
コア画素領域と境界画素領域とに基づいて、指示ターゲットマスクが生成される。
各画素領域について、それに対応するサンプル第1特徴とそれに対応するサンプル第2特徴との間の差異が計算され、差異度情報が得られる。
前記第1の検出結果と前記ラベル付け情報との差異度が第1の閾値よりも大きく、かつ、前記第2の検出結果と前記ラベル付け情報との差異度が第2の閾値以下である場合、前記画素領域は境界画素領域に属すると判定される：
第1の検出結果とラベリング情報との差異度が第1の閾値よりも大きく、第2の検出結果とラベリング情報との差異度が第2の閾値以下であり、差異度情報が第3の閾値よりも大きい場合、画素領域は境界画素領域に属すると判定される。 For the first detection result and the second detection result corresponding to each pixel region, if the degree of difference between the first detection result and the labeling information is greater than a first threshold value, and the degree of difference between the second detection result and the labeling information is greater than a second threshold value, the pixel region is determined to belong to the core pixel region.
If the degree of difference between the first detection result and the labeling information is greater than a first threshold and the degree of difference between the second detection result and the labeling information is equal to or smaller than a second threshold, the pixel region is determined to belong to a boundary pixel region.
A pointer target mask is generated based on the core pixel region and the boundary pixel region.
For each pixel region, the difference between the corresponding sample first feature and the corresponding sample second feature is calculated to obtain dissimilarity information.
When a degree of difference between the first detection result and the labeling information is greater than a first threshold and a degree of difference between the second detection result and the labeling information is equal to or smaller than a second threshold, the pixel region is determined to belong to a boundary pixel region:
If the degree of difference between the first detection result and the labeling information is greater than a first threshold, the degree of difference between the second detection result and the labeling information is equal to or less than a second threshold, and the degree of difference information is greater than a third threshold, the pixel region is determined to belong to a boundary pixel region.

第1の画像が抽出ネットワークに入力され、マスク生成ネットワークがトリガされて、第1の画像に対応するターゲットマスクが生成される。
第1の画像中のターゲットマスクで覆われた部分について、集約化ネットワークに基づいて深さ特徴抽出を行い、第1のターゲット特徴を得る。
第1の画像のうちターゲットマスクによってマスキングされていない部分について、集約化ネットワークに基づいてマルチスケール特徴抽出を行い、第2のターゲット特徴を得る。
第1のターゲット特徴と第2のターゲット特徴とを融合して融合特徴情報を得る。
一方、本発明の実施形態は、ディープラーニングに基づくマルチソース異種データ融合装置を提供する：
マルチソース異種データ取得モジュール、シーンを特徴付ける第1のオブジェクト情報と、シーン内で関連付けられたターゲット下位オブジェクトを特徴付ける第2のオブジェクト情報とを少なくとも含み画像をシーン記述するためのテキストタイプのシーン情報と、ピクチャタイプの画像とを取得する。
クラスタ決定モジュールは、第1のオブジェクト情報に基づいて、第1のオブジェクト情報によって特徴付けられるシーン内の各下位オブジェクトにそれぞれ対応する分割用モデルを含むモデルクラスタを決定する。
モデル決定モジュールは、第２のオブジェクト情報に基づいて、モデルクラスタ内の各ターゲット下位オブジェクトに対応するターゲットモデルを決定する。
粗分割モジュールは、各ターゲットモデルに基づいて画像を粗分割し、各ターゲット下位オブジェクトがそれぞれ対応する第1の画像を得る。
各第1の画像に対して、対応するターゲット下位オブジェクトが関連付けられた情報集約に基づく分割を行う細分割モジュールが、各第1の画像にそれぞれ対応する細分割結果を得る。
シーン情報と画像の融合結果とを特徴付ける、細分割結果に基づいて画像に対応する本体画像をレンダリングする本体画像レンダリングモジュールが提供される。
一方で、本発明の実施形態は、ディープラーニングに基づくマルチソース異種データ融合方法の1つを実現するために、プロセッサによってロードされ実行される少なくとも1つの命令または少なくとも1つのプログラムを格納したコンピュータ可読記憶媒体を提供する。
別の態様では、本発明の実施形態は、少なくとも1つのプロセッサと、少なくとも1つのプロセッサと通信可能に接続されたメモリとを含む電子装置を提供する。ここで、メモリは、少なくとも1つのプロセッサによって実行可能な命令を記憶しており、少なくとも1つのプロセッサは、メモリに記憶された命令を実行することによって、ディープラーニングに基づく上述したマルチソース異種データ融合方法の1つを実現する。
一方、本発明の実施形態は、プロセッサによって実行されると、ディープラーニングに基づくマルチソース異種データ融合方法の1つを実現するコンピュータプログラムまたは命令を含むコンピュータプログラム製品を提供する。 A first image is input to the extraction network, and a mask generation network is triggered to generate a target mask corresponding to the first image.
For a portion in the first image covered by the target mask, depth feature extraction is performed based on the aggregation network to obtain a first target feature.
For the portion of the first image that is not masked by the target mask, multi-scale feature extraction is performed based on the aggregation network to obtain second target features.
The first target feature and the second target feature are fused to obtain fused feature information.
Meanwhile, an embodiment of the present invention provides a multi-source heterogeneous data fusion device based on deep learning:
A multi-source heterogeneous data acquisition module acquires text type scene information for scene describing an image, the text type scene information including at least first object information characterizing a scene and second object information characterizing a target subordinate object associated within the scene, and an image of a picture type.
The cluster determination module determines, based on the first object information, a model cluster including a segmentation model corresponding to each subordinate object in the scene characterized by the first object information.
The model determination module determines a target model corresponding to each target subordinate object in the model cluster based on the second object information.
The coarse segmentation module coarsely segments the image based on each target model, and obtains a first image to which each target sub-object corresponds respectively.
A subdivision module, which performs segmentation based on information aggregation associated with a corresponding target sub- object for each first image, obtains a subdivision result respectively corresponding to each first image.
A body image rendering module is provided for rendering a body image corresponding to the image based on the subdivision result, characterizing the scene information and the fusion result of the image.
Meanwhile, an embodiment of the present invention provides a computer-readable storage medium storing at least one instruction or at least one program to be loaded and executed by a processor to realize one of the deep learning based multi-source heterogeneous data fusion methods.
In another aspect, an embodiment of the present invention provides an electronic device including at least one processor and a memory communicatively coupled to the at least one processor, where the memory stores instructions executable by the at least one processor, and where the at least one processor executes the instructions stored in the memory to implement one of the above-mentioned deep learning based multi-source heterogeneous data fusion methods.
However, an embodiment of the present invention provides a computer program product including a computer program or instructions which, when executed by a processor, implements one of the deep learning based multi- source heterogeneous data fusion methods.

本発明の実施形態は、ディープラーニングに基づくマルチソース異種データ融合方法を提供する、この方式は、まず、粗分割によってシーン情報に基づく画像の粗分割を完了する。しかし、大まかな分割の精度には限界があり、シーンの情報と画像を融合しただけで、多様なデータをより深く融合するために、さらにシーン情報と合わせて、その中のターゲット下位オブジェクトに対応する集約化情報抽出器を選択することも可能である。この集約化情報抽出器により画像の情報集約が行われ、この情報集約の操作は、そのターゲット下位オブジェクトに関する事前知識のもとで行われていると考えられる。このように情報集約の過程では、そのシーン情報に関する事前知識と画像の情報融合が深いものとなり、この集約過程の結果に基づいて精緻な分割を行うことができる。これにより、正確な細分割結果が得られ、この細分割結果に基づいてレンダリングされた本体画像は、前記シーン情報と前記画像の融合結果を具現化し、更に関連する事前知識を用いて、一種のマルチソース異種データの深い融合の具現化である。 The embodiment of the present invention provides a multi-source heterogeneous data fusion method based on deep learning, which first completes the coarse segmentation of an image based on scene information through coarse segmentation. However, the accuracy of the rough segmentation is limited, and it is only possible to fuse the scene information and the image, and to further fuse various data more deeply, it is also possible to select an aggregated information extractor corresponding to the target sub-object in the scene information in combination with the scene information. The aggregated information extractor performs image information aggregation , and the operation of this information aggregation is considered to be performed under the prior knowledge of the target sub-object. In this way, in the information aggregation process, the prior knowledge of the scene information and the information fusion of the image are deep, and a fine segmentation can be performed based on the result of this aggregation process. As a result, an accurate subdivision result is obtained, and the main image rendered based on this subdivision result embodies the fusion result of the scene information and the image, and further uses the related prior knowledge to embody a kind of deep fusion of multi-source heterogeneous data.

図1は、実施形態により提供されるディープラーニングに基づくマルチソース異種データ融合方法の実行可能な実施フレームワークの概略図である。FIG. 1 is a schematic diagram of a possible implementation framework of the deep learning-based multi-source heterogeneous data fusion method provided by an embodiment. 図２は、本発明の実施形態に係る情報集約方法の流れを示す図である。FIG. 2 is a diagram showing a flow of an information aggregation method according to an embodiment of the present invention. 図３は、本発明の実施形態により提供されるディープラーニングに基づくマルチソース異種データ融合装置のブロック図であるFIG. 3 is a block diagram of a multi-source heterogeneous data fusion device based on deep learning provided by an embodiment of the present invention.

実施形態のディープラーニングに基づくマルチソース異種データ融合方法を説明し、図1に、本出願の実施形態が提供するディープラーニングに基づくマルチソース異種データ融合方法の流れを示す。本出願の実施形態は、実施形態またはフローチャートで上述した方法の動作ステップを提供するが、従来のまたは進歩性のない労力に基づいて、より多くまたはより少ない動作ステップを含むことができる。実施形態に列挙されたステップ順序は、多数のステップ実行順序のうちの1つにすぎず、一意の実行順序を表すものではない。実際のシステム、端末装置、またはサーバ製品が実行される場合、実施形態または添付の図に示される方法（例えば、並列プロセッサまたはマルチスレッド処理環境）に従って、順次実行または並列実行することができ、上記方法は、以下のことを含むことができる。 A multi-source heterogeneous data fusion method based on deep learning of an embodiment is described, and FIG. 1 shows the flow of the multi-source heterogeneous data fusion method based on deep learning provided by an embodiment of the present application. The embodiment of the present application provides the operation steps of the method described above in the embodiment or flowchart, but may include more or less operation steps based on conventional or non-inventive efforts. The step order listed in the embodiment is only one of many step execution orders, and does not represent a unique execution order. When an actual system, terminal device, or server product is executed, it can be executed sequentially or in parallel according to the embodiment or the method shown in the attached figure (e.g., a parallel processor or a multi-threaded processing environment), and the above method can include the following.

ステップS101.シーンを特徴付ける第1のオブジェクト情報と、シーン内で関連付けられたターゲット下位オブジェクトを特徴付ける第2のオブジェクト情報とを少なくとも含み画像をシーン記述するためのテキストタイプのシーン情報と、ピクチャタイプの画像とを取得する。
シーン情報は、第1のオブジェクト情報と第2のオブジェクト情報とを記述したテキスト情報であり、第1のオブジェクト情報は、例えばオフィスシーン、スポーツシーン、動物園シーンなどのシーンそのものである。第2のオブジェクト情報は、シーン内にあり、画像内に具現化されているターゲット下位オブジェクトを特徴付ける。動物園のシーンを例にとると、ネコ科動物、鳥類、爬虫類、魚類の4種類の下位オブジェクトが存在することができるが、画像中にネコが2匹、イヌが1匹しか存在しない場合、ターゲット下位オブジェクトが3匹、ネコが2匹、イヌが1匹存在する。シナリオ情報の構築方法は従来技術を用いることができ、本出願の発明の重点ではないので、ここでは言及しない。 Step S101: Obtain text type scene information for scene describing an image, the text type scene information including at least first object information characterizing a scene and second object information characterizing a target subordinate object associated with the scene, and an image of a picture type .
The scene information is text information describing the first object information and the second object information, and the first object information is the scene itself, such as an office scene, a sports scene, a zoo scene, etc. The second object information characterizes the target subordinate object that is in the scene and embodied in the image. Taking a zoo scene as an example, there can be four types of subordinate objects, felines, birds, reptiles, and fish, but if there are only two cats and one dog in the image, there will be three target subordinate objects, two cats, and one dog. The construction method of the scenario information can use the conventional technology, and is not the focus of the invention of this application, so it will not be mentioned here.

ステップS102.第１のオブジェクト情報に基づいて、第１のオブジェクト情報が特徴付けるシーン内の各下位オブジェクトにそれぞれ対応する分割用モデルを含むモデルクラスタを決定する。
本出願はまた、関連する様々なシナリオにおけるモデルクラスタを構築する必要がある、モデルクラスタ中のモデルは先行技術から得られてもよく、関連する開発者が自ら訓練して得られてもよく、モデルクラスタ中のモデルは下位オブジェクトに基づいた大まかな分割を行うのに用いられ、分割精度の要求は高くなく、モデルの取得難度と訓練難度も高くないため、本出願の発明の重点でもないので、ここでは言及しない。動物園シーンを例にとると、その動物園シーンに対応するモデルクラスタには、猫、犬、魚、鳥などの下位オブジェクトを分割するための分割モデルを含めることができる。 Step S102: Based on the first object information, a model cluster is determined that includes a segmentation model corresponding to each lower-level object in the scene characterized by the first object information.
This application also needs to build model clusters in various relevant scenarios, the models in the model clusters can be obtained from prior art or can be obtained by relevant developers through self-training, the models in the model clusters are used to perform rough segmentation based on sub- objects , the requirements for segmentation accuracy are not high, and the model acquisition and training difficulties are not high, so this is not the focus of the invention of this application, so it will not be mentioned here. Take a zoo scene as an example, the model cluster corresponding to the zoo scene can include segmentation models for segmenting sub-objects such as cats, dogs, fish, and birds.

ステップS103.第２のオブジェクト情報に基づいて、モデルクラスタ内の各ターゲット下位オブジェクトに対応するターゲットモデルを決定する。 Step S103: Determine a target model corresponding to each target subordinate object in the model cluster according to the second object information.

ステップS104.前記各ターゲットモデルに基づいて前記画像を粗分割し、前記各ターゲット下位オブジェクトにそれぞれ対応する第1の画像を得る。
画像中に猫が2匹、犬が1匹存在する場合、猫を分割するターゲットモデル1と犬を分割するターゲットモデル2の2つのターゲットモデルを決定することができ、ターゲットモデル1に基づいて2つの第1の画像を分割することができ、ターゲットモデル2に基づいて1つの第1の画像を分割することができる。 Step S104: Coarsely segment the image based on each of the target models to obtain first images corresponding to each of the target sub-objects.
If there are two cats and one dog in an image, two target models can be determined: target model 1 for segmenting the cats and target model 2 for segmenting the dog; two first images can be segmented based on target model 1, and one first image can be segmented based on target model 2.

ステップS105.各第1の画像に対して、それらに対応するターゲット下位オブジェクトが関連付けられた情報集約に基づく分割を行い、各第１の画像にそれぞれ対応する詳細な分割結果を得る。
本発明の実施の形態では、各タイプの下位オブジェクトに対応して１つの集約化情報抽出部が設けられている、この集約化情報抽出器は、このタイプの下位オブジェクトに対応する検出器訓練に基づいて得られた、この検出器は、当該下位オブジェクトに対応する検出モデルや分割モデルから得られてもよいし、従来技術を用いて得られてもよいが、このようなタイプの下位オブジェクトを検出できればよく、検出器の取得方式は限定しない。この集約化情報抽出器は、対応する第１の画像に対して集約化情報抽出を行い、対応する融合特徴情報を得ることができる。 Step S105: For each of the first images, perform segmentation based on information aggregation associated with the corresponding target subordinate objects, to obtain detailed segmentation results respectively corresponding to each of the first images.
In the embodiment of the present invention, one aggregated information extractor is provided corresponding to each type of lower-level object, and the aggregated information extractor is obtained based on a detector training corresponding to the lower-level object of this type, and the detector may be obtained from a detection model or segmentation model corresponding to the lower-level object, or may be obtained using a conventional technique. However, as long as it can detect such a type of lower-level object, the detector acquisition method is not limited. The aggregated information extractor can perform aggregated information extraction on the corresponding first image to obtain the corresponding fusion feature information.

先述の例では、ターゲットモデル1は、その画像中に猫が存在することは間違いないが、具体的に猫の輪郭がどのようなものであるかが明確に特定できない猫を含む第1の画像を切り出すことができるが、この第1の画像を対応する集約化情報抽出器に入力することで、猫に関する特徴融合情報を得ることができる。この集約化情報抽出器は、ネコを検出できる検出器によって訓練されているため、ネコというオブジェクトの特徴を融合し、ネコに関する特徴融合情報を得るのに特に適している。 In the above example, the target model 1 can extract a first image including a cat, in which it is certain that a cat exists in the image, but the specific outline of the cat cannot be clearly identified, and this first image can be input to a corresponding aggregated information extractor to obtain feature fusion information about the cat. This aggregated information extractor is trained by a detector that can detect cats, and is therefore particularly suitable for fusing features of the object cat to obtain feature fusion information about the cat.

具体的には、第１の画像ごとに、第1の画像を、対応する集約化情報抽出器に入力し、集約化情報抽出器は、第1の画像が対応するターゲット下位オブジェクトに対応する検出器に基づいて訓練により得られ、第1の画像に対応する融合特徴情報を得るために、第1の画像中のターゲット下位オブジェクトに対して集約化情報抽出を行う集約化情報抽出器である。
次に、第1の画像が対応するターゲット下位画像に対応する分割器に融合特徴情報を入力してマスク行列を得ることができる。前の例を踏襲して。集約化された情報抽出によって、猫に関する融合特徴情報が非常に豊富に得られる、すなわち融合特徴情報自体の情報集約度と情報品質が非常に高く、この情報を猫のようなタイプのオブジェクトを分割できる分割器に入力することで、細かな分割結果を特徴付けるためのマスク行列を得ることができる。本出願は、この分割器の取得方法については限定しないが、従来技術における分割器を用いてもよいし、自ら訓練して得てもよい。なぜなら、精密分割の効果は主に融合特徴情報の品質に依存しており、分割器に対する要求は特に高くないからである。 Specifically, for each first image , the first image is input to a corresponding aggregated information extractor, which is obtained by training based on a detector corresponding to the target sub-object to which the first image corresponds, and performs aggregated information extraction on the target sub -object in the first image to obtain fusion feature information corresponding to the first image.
Then, the fusion feature information can be input into the segmenter corresponding to the target sub- image to which the first image corresponds to obtain a mask matrix. Following the previous example, the fusion feature information about the cat can be obtained very richly through the intensive information extraction, that is, the information intensiveness and information quality of the fusion feature information itself is very high, and the mask matrix can be obtained to characterize the fine segmentation result by inputting this information into a segmenter that can segment the cat-like type of object . The present application does not limit the method of obtaining this segmenter, but the segmenter in the prior art can be used or obtained by self-training. Because the effect of fine segmentation mainly depends on the quality of the fusion feature information, the requirements for the segmenter are not particularly high.

ステップS106.シーン情報と画像との融合結果を特徴付ける、画像に対応する被写体画像を、前記細分割結果に基づいてレンダリングする。
細分割は、粗い分割に比べて、猫や犬といった具体的なオブジェクトの輪郭を非常に正確に分割することができる、したがって、細分割結果に基づいて、前記オブジェクトに対応する被写体画像をレンダリングすることができる、すなわち、前記画像のシーン情報に記録されている主要な被写体の輪郭および実体がレンダリングされて本体画像が得られ、本質的にはシーン情報に基づいて画像の詳細な輪郭分割が行われることで、前記シーン情報と前記画像との融合結果を特徴付ける本体画像が得られる。 Step S106: Render an object image corresponding to the image , which characterizes the fusion result between the scene information and the image, based on the subdivision result.
Compared to coarse segmentation, subdivision can segment the contours of concrete objects such as cats and dogs very accurately, and therefore a subject image corresponding to the object can be rendered based on the subdivision result; i.e., the contours and substance of the main subject recorded in the scene information of the image are rendered to obtain a main image, essentially a detailed contour segmentation of the image is performed based on the scene information to obtain a main image that characterizes the fusion result of the scene information and the image.

ステップS201.サンプル画像中の下位オブジェクトの位置情報を特徴付けるラベル付け情報を担持するサンプル画像と、サンプル画像に対応する下位オブジェクトのタイプに対応する検出器とを取得し、サンプル画像は、単一タイプの下位オブジェクトのみを含む。
本発明の実施例は、猫という下位オブジェクトに対応する集約化情報抽出器における抽出ネットワークとマスク生成ネットワークの訓練を例に述べたが、サンプル画像には猫のみが含まれており、猫の位置情報が付加されており、検出器も猫というオブジェクトの検出に用いることができる検出器である。 Step S201. Obtain a sample image carrying labeling information characterizing the position information of subordinate objects in the sample image and a detector corresponding to the type of subordinate object corresponding to the sample image, where the sample image only contains a single type of subordinate object.
The embodiment of the present invention has been described using an example of training an extraction network and a mask generation network in an aggregated information extractor corresponding to a lower-level object called a cat, where the sample image contains only a cat, the position information of the cat is added, and the detector can be used to detect the object called a cat.

ステップS202.前記抽出ネットワークに前記サンプル画像を入力する、抽出ネットワークは、以下の動作を実行する：サンプル画像中の各画素領域について、画素領域に対応するサンプル第1特徴を抽出する、各関連画素位置に対応する関連位置特徴を抽出し、関連位置特徴を融合して、画素領域に対応するサンプル第2特徴を得、関連画素位置は、画素領域が位置する近傍の画素領域に属さない他の画素位置である。 Step S202. Input the sample image into the extraction network, and the extraction network performs the following operations: for each pixel region in the sample image, extract a sample first feature corresponding to the pixel region, extract a related position feature corresponding to each related pixel position, and fuse the related position features to obtain a sample second feature corresponding to the pixel region, where the related pixel positions are other pixel positions that do not belong to the neighboring pixel regions in which the pixel region is located.

なお、本実施形態では、画素領域の分割は限定されず、実際に応じて行から設定することができる。粗分割した結果が検出枠であり、検出枠内の画像を切り出して第1の画像を得ることができ、検出枠を分割して得ることに関する知識についても従来技術を参照することができるので、これについては割愛するので、第1の画像は矩形画像である。第1の画像を9宮格または16宮格に分割することが可能であり、各格子ごとに1つの画素領域に対応しているが、当然ながら画素領域の分割は細ければ細かいほど分割効果が高い。
本発明の実施形態は、近傍領域の範囲および決定方法を限定するものではなく、例えば、画素領域を含み、かつ、画素領域の周辺の画素領域に属さない他の画素を含めるだけでよい。
本出願の実施形態におけるサンプル第1特徴は、画素領域のそれ自体の特徴を特徴付ける、一方、サンプル第2特徴は、実際には画素領域が位置するシーン特徴を特徴付けるものであり、これら2つの特徴の抽出方法については言及せず、畳み込み、多層畳み込み、自己注意に基づく畳み込み、マルチチャネル融合、プール化などの方法のうちの1つまたは複数の組み合わせによって実施することができるが、これについては本出願の実施形態では言及しない。 In this embodiment, the division of the pixel area is not limited, and can be set from the row according to the actual situation. The result of the rough division is the detection frame, and the image in the detection frame can be cut out to obtain the first image. The knowledge about dividing the detection frame and obtaining it can also be referred to the prior art, so it will be omitted, so the first image is a rectangular image. The first image can be divided into 9 or 16 grids, and each grid corresponds to one pixel area. Naturally, the finer the division of the pixel area, the higher the division effect.
The embodiment of the present invention does not limit the range and the method of determining the neighboring region, and may, for example, include the pixel region and other pixels that do not belong to the pixel region surrounding the pixel region.
The sample first feature in the embodiment of the present application characterizes the pixel region's own feature, while the sample second feature actually characterizes the scene feature in which the pixel region is located, and the method of extracting these two features is not mentioned, which can be implemented by one or more combinations of methods such as convolution, multi-layer convolution, self-attention based convolution, multi-channel fusion, pooling, etc., but this is not mentioned in the embodiment of the present application.

ステップS203.各画素領域に対応するサンプル第1特徴及びサンプル第2特徴をマスク生成ネットワークに入力し、サンプルターゲットマスクを得る。
マスク生成ネットワークは、サンプル第1特徴及びサンプル第2特徴に基づいてサンプルターゲットマスクを予測し、サンプル画像のうち、このサンプルターゲットマスクで覆われた部分は、より重要な画素で形成された領域であると考えることができ、サンプル画像のうち、有効情報の含有量が最も高い領域をスクリーニングする。 Step S203: input the sample first feature and the sample second feature corresponding to each pixel region into a mask generation network to obtain a sample target mask.
The mask generation network predicts a sample target mask based on the sample first feature and the sample second feature, and the portion of the sample image covered by this sample target mask can be considered to be an area formed by more important pixels, and the area of the sample image with the highest content of useful information is screened.

ステップS204.各画素領域について、対応するサンプル第1特徴を除いた他のサンプル第1特徴は、検出器に入力される、画素領域に対応する第1の検出結果が得られ、対応するサンプル第2特徴を除いた他のサンプル第2特徴が検出器に入力され、画素領域に対応する第2の検出結果が得られる。
各画素領域に対して、対応するサンプル第1特徴を除く他のサンプル第1特徴は、画素領域内のサンプル第1特徴を含まないサンプル第1特徴情報セットを形成する、サンプル第1特徴情報セットを検出器に入力する、第1の検出結果が得られ、もしこの第1の検出結果がラベル付け情報と非常に一致していれば、この画素領域の有無の影響は大きくなく、この画素領域は必然的に重要ではない画素によって形成された領域であり、その中の有効情報の含有量は必然的に低く、この画素領域はサンプル画像中の下位オブジェクトとは無関係である可能性が高い。 Step S204. For each pixel region, the other sample first features excluding the corresponding sample first feature are input to the detector to obtain a first detection result corresponding to the pixel region, and the other sample second features excluding the corresponding sample second feature are input to the detector to obtain a second detection result corresponding to the pixel region.
For each pixel region, other sample first features except the corresponding sample first feature form a sample first feature information set that does not include the sample first feature in the pixel region; the sample first feature information set is input into the detector; a first detection result is obtained; if this first detection result is highly consistent with the labeling information, the influence of the presence or absence of this pixel region is not large, and this pixel region is inevitably a region formed by unimportant pixels, and the content of effective information therein is inevitably low, and this pixel region is likely to be unrelated to the subordinate objects in the sample image.

各画素領域に対して、対応するサンプル第2特徴を除いた他のサンプル第2特徴は、画素領域内のサンプル第2特徴を含まないサンプル第2特徴情報セットを形成する、サンプル第2特徴情報セットを検出器に入力する、2番目の検査結果が得られました。この2番目の検出結果がマークアップ情報と非常に一致していれば、つまり、その画素領域のシーン情報があるかどうかはあまり影響しない、したがって、画素領域の近傍における画素領域以外の他の位置は、重要でない画素が存在する位置であることになる、サンプル画像中の下位オブジェクトと無関係であるか、またはサンプル画像中の下位オブジェクトのエッジに位置する可能性が高い、画素領域の近傍における画素領域以外の他の位置の有効情報の含有量は、必然的に低い。 For each pixel region, the other sample second features except for the corresponding sample second features form a sample second feature information set that does not include the sample second features in the pixel region, and the sample second feature information set is input into the detector, and a second inspection result is obtained. If this second detection result is highly consistent with the markup information, that is, whether there is scene information of the pixel region does not have much impact, therefore, other positions other than the pixel region in the vicinity of the pixel region will be positions where there are unimportant pixels, which are likely to be unrelated to the lower object in the sample image or located on the edge of the lower object in the sample image, and the content of effective information of other positions other than the pixel region in the vicinity of the pixel region is inevitably low.

ステップS205.前記第1の検出結果と、前記第2の検出結果と、前記ラベル付け情報とに基づいて、指示ターゲットマスクを求める。
具体的には、各画素領域に対応する第1の検出結果および第2の検出結果について、第1の検出結果とラベル付け情報との差異度が第1の閾値より大きく、かつ、第2の検出結果とラベル付け情報との差異度が第2の閾値より大きい場合に、当該画素領域がコア画素領域に属すると判定する。本出願の実施例は差異度については限定せず、測定結果と表記情報との差異度を特徴付けるものであり、任意の計算方式であってもよいが、本出願の実施例は差異度計算方法を限定せず、ニューラルネットワークの分野では多くの差異度測定方法が使用されてもよく、2つの情報の差異を示すことができればよい。第1の検出結果であっても第2の検出結果であっても、タグ付けされた情報とは大きく異なりますが、画素領域が非常に重要であることを説明すると、それはおそらく、下位オブジェクト部分情報を有する領域を特徴付けるコア画素領域に属する可能性が高く、本明細書では第1の閾値および第2の閾値は限定されず、実際のニューラルネットワークの訓練中に設定することができ、第1の閾値は第2の閾値よりも大きくすることができる。 Step S205: Obtain an instruction target mask based on the first detection result, the second detection result, and the labeling information.
Specifically, for the first detection result and the second detection result corresponding to each pixel region, if the difference between the first detection result and the labeling information is greater than the first threshold value, and the difference between the second detection result and the labeling information is greater than the second threshold value, the pixel region is determined to belong to the core pixel region. The embodiment of the present application does not limit the difference, characterizing the difference between the measurement result and the notation information, and may be any calculation method, but the embodiment of the present application does not limit the difference calculation method, and many difference measurement methods may be used in the field of neural networks, as long as it can show the difference between the two information. Whether it is the first detection result or the second detection result, it is very different from the tagged information, but when explaining that the pixel region is very important, it is probably likely to belong to the core pixel region that characterizes the region with the lower object part information, and the first threshold value and the second threshold value are not limited in this specification, and can be set during the training of the actual neural network, and the first threshold value can be greater than the second threshold value.

第1の検出結果とラベル付け情報との差異度が第1の閾値よりも大きく、かつ、第2の検出結果とラベル付け情報との差異度が第2の閾値以下である場合、画素領域は境界画素領域に属すると判定される。
具体的には、第1の検出結果とラベル付け情報との差が大きい場合は、画素領域が非常に重要であり、この領域がないといけないことを示しているが、第2の検出結果とラベル付け情報との差が小さい場合は、その画素領域の周囲の領域がそれほど重要でなく、あってもなくてもよいことを示しているのであれば、その画素領域はサンプル画像中の下位オブジェクトの境界に位置している可能性が高い。 If the degree of difference between the first detection result and the labeling information is greater than a first threshold and the degree of difference between the second detection result and the labeling information is equal to or smaller than a second threshold, the pixel region is determined to belong to a boundary pixel region.
Specifically, if the difference between the first detection result and the labeling information is large, this indicates that the pixel region is very important and must be present, whereas if the difference between the second detection result and the labeling information is small, this indicates that the region surrounding the pixel region is not so important and may not be present; in that case, the pixel region is likely to be located at the boundary of a lower-level object in the sample image.

ステップS301.各画素領域について、対応するサンプル第1特徴と対応するサンプル第2特徴との差分を計算し、差分情報を得る。
この相違度情報は、サンプル第1特徴とサンプル第2特徴との間の特徴距離を計算することによって得ることができる、特徴距離は情報距離の一種である、情報距離の測定方法を使用して測定することができます、これに対して、本出願の実施例は限定しないが、この差分度情報は、画素領域とその周囲の隣接する領域との間の情報の距離を特徴付けるものであり、距離が近ければ画素が大きく跳ねないのであれば、画素領域はサンプル画像中の下位オブジェクトの内側、または外側に位置している可能性が高く、いずれにしても交差は生じない。逆に、この画素領域は、下位オブジェクトのエッジに位置する可能性が高くなります。 Step S301: For each pixel region, the difference between the corresponding sample first feature and the corresponding sample second feature is calculated to obtain difference information.
This dissimilarity information can be obtained by calculating the feature distance between the sample first feature and the sample second feature , the feature distance is a kind of information distance, and can be measured using the information distance measurement method, whereas the embodiment of the present application is not limited to this, this dissimilarity information characterizes the information distance between the pixel region and the adjacent region around it, if the distance is close and the pixel does not jump significantly, the pixel region is likely to be located inside or outside the lower object in the sample image, and in any case, no intersection will occur. Conversely, this pixel region is more likely to be located on the edge of the lower object.

前記第1の検出結果と前記ラベル付け情報との差異度が第1の閾値よりも大きく、かつ、前記第2の検出結果と前記ラベル付け情報との差異度が第2の閾値以下である場合、前記画素領域は境界画素領域に属すると判定される：
第1の検出結果とラベリング情報との差異度が第1の閾値よりも大きく、第2の検出結果とラベリング情報との差異度が第2の閾値以下であり、差異度情報が第3の閾値よりも大きい場合、画素領域は境界画素領域に属すると判定される。 If a degree of difference between the first detection result and the labeling information is greater than a first threshold and a degree of difference between the second detection result and the labeling information is equal to or smaller than a second threshold, the pixel region is determined to belong to a boundary pixel region:
If the degree of difference between the first detection result and the labeling information is greater than a first threshold, the degree of difference between the second detection result and the labeling information is equal to or less than a second threshold, and the degree of difference information is greater than a third threshold, the pixel region is determined to belong to a boundary pixel region.

ステップS206.サンプルターゲットマスクと指示ターゲットマスクとの差分に応じて、訓練が完了するまで、抽出ネットワークとマスク生成ネットワークのパラメータを調整する。
サンプルターゲットマスクと指示ターゲットマスクとの差異の表現方法、フィードバック調整パラメータの方法、および訓練完了の条件は、ニューラルネットワーク分野の先行技術を参照することができ、ここでは言及しない。なお、抽出ネットワークやマスク生成ネットワークのネットワーク構成は、例えば、深さ畳み込みニューラルネットワークに基づいて設計してもよいが、これについては本出願の実施例に限定しない。 Step S206: Adjust the parameters of the extraction network and the mask generation network according to the difference between the sample target mask and the instruction target mask until training is completed.
The method of expressing the difference between the sample target mask and the instruction target mask, the method of feedback adjustment parameters, and the condition of training completion can refer to the prior art in the field of neural networks, and are not mentioned here. Note that the network configuration of the extraction network and the mask generation network can be designed based on, for example, a deep convolutional neural network, but this is not limited to the embodiment of this application.

訓練の後。本実施形態では、集約化情報抽出部に基づいて、第1の画像内の情報を抽出することができる、融合特徴情報を得るために、具体的には、集約化情報抽出器は、対応する集約化情報抽出器に第1の画像が入力された後、集約化情報抽出器は以下の動作を実行する集約化ネットワークをさらに含む：
第1の画像を抽出ネットワークに入力し、マスク生成ネットワークをトリガして第1の画像に対応するターゲットマスクを生成するプロセスは、前述を参照することができ、抽出ネットワークとマスク生成ネットワークとによって実行される。第1の画像中のターゲットマスクで覆われた部分について、集約化ネットワークに基づいて深さの特徴抽出を行い、第1のターゲット特徴を得る。本発明の実施形態において、ターゲットマスクで覆われていると考えられる部分は、重要で有効な情報が担持されている部分である、効果的な情報集約エリアです、この部分領域は、第1の画像中のターゲットの下位オブジェクトが位置する領域と非常に一致している、そこで、この部分領域に対して重要な第1ターゲット特徴の抽出を行うが、第1ターゲット特徴は、ピラミッドマルチスケール構造を備えた集約化ネットワークによって抽出することができ、階層的に豊かなマルチスケール情報を抽出することができ、このマルチスケール情報を融合処理することで、第1ターゲット特徴を得ることができる。 After training, in this embodiment, based on the aggregated information extractor, information in the first image can be extracted to obtain fusion feature information. Specifically, the aggregated information extractor further includes an aggregated network that performs the following operations after the first image is input into the corresponding aggregated information extractor:
The process of inputting the first image into the extraction network and triggering the mask generation network to generate a target mask corresponding to the first image can refer to the above, and is performed by the extraction network and the mask generation network. For the part covered by the target mask in the first image, depth feature extraction is performed based on the aggregation network to obtain the first target feature. In the embodiment of the present invention, the part considered to be covered by the target mask is an effective information aggregation area, which is a part carrying important and useful information, and this partial area is highly consistent with the area where the target's subordinate object in the first image is located, so that the extraction of the important first target feature is performed for this partial area, but the first target feature can be extracted by the aggregation network with a pyramid multi-scale structure, and hierarchically rich multi-scale information can be extracted, and the first target feature can be obtained by fusing the multi-scale information.

Claims

obtaining scene information of text type for describing the image , the scene information including at least first object information characterizing the scene and second object information characterizing the associated target subordinate object in the scene, and an image of picture type ;
determining, based on the first object information, a model cluster including a segmentation model corresponding to each subordinate object in a scene characterized by the first object information;
determining a target model corresponding to each target subordinate object in the model cluster based on the second object information;
coarsely segmenting the image based on each of the target models to obtain first images corresponding to each of the target sub-objects ;
For each of the first images, performing segmentation based on information aggregation associated with a corresponding target sub-object to obtain a sub-segmentation result corresponding to each of the first images;
- rendering a body image corresponding to said image, characterizing a fusion result between said scene information and said image based on said subdivision result;
A deep learning-based multi-source heterogeneous data fusion method comprising:

The subdivision results include a mask matrix, and the step of performing a segmentation based on information aggregation associated with a corresponding target subordinate object for each of the first images to obtain a subdivision result corresponding to each of the first images includes:
For each first image , input the first image to a corresponding aggregate information extractor , which is obtained by training based on a detector corresponding to a target sub-object to which the first image corresponds, and the aggregate information extractor performs aggregate information extraction on the target sub-object in the first image to obtain fusion feature information corresponding to the first image ;
inputting the fusion feature information into a divider corresponding to the target sub-image to which the first image corresponds, to obtain the mask matrix;
2. The method of claim 1, comprising:

Rendering a body image corresponding to the image, characterizing a fusion result of the scene information and the image based on the subdivision result, comprising:
3. The method of claim 2, further comprising: rendering the body image based on the mask matrix and the image corresponding to each of the first images.

The aggregated information extractor is mainly composed of an extraction network and a mask generation network, and the mask generation network is used to generate a target mask for distinguishing between subordinate objects and non-subordinate objects in an image input to the aggregated information extractor, and the extraction network and the mask generation network are trained by the following method:
acquiring a sample image and a detector corresponding to a type of subordinate object corresponding to the sample image, the sample image having labeling information characterizing positional information of the subordinate object within the sample image, the sample image including only a single type of subordinate object;
The sample image is input to the extraction network, and the extraction network extracts, for each pixel region in the sample image, a sample first feature corresponding to the pixel region, extracts a related position feature corresponding to each relevant pixel position, and fuses the related position features to obtain a sample second feature corresponding to the pixel region, the related pixel positions being other pixel positions not belonging to the pixel region in the vicinity of where the pixel region is located;
Input the sample first feature and the sample second feature corresponding to each pixel region into a mask generation network to obtain a sample target mask;
For each pixel region, inputting other sample first features excluding the corresponding sample first feature into the detector to obtain a first detection result corresponding to the pixel region, inputting other sample second features excluding the corresponding sample second feature into the detector to obtain a second detection result corresponding to the pixel region,
determining an instruction target mask based on each of the first detection results, each of the second detection results, and the labeling information;
4. The method of claim 3, wherein parameters of the extraction network and the mask generation network are adjusted in response to differences between the sample target mask and the instruction target mask until training is complete.

The step of determining an instruction target mask based on each of the first detection results, each of the second detection results, and the labeling information includes:
determining that a pixel region belongs to a core pixel region when a degree of difference between the first detection result and the labeling information is greater than a first threshold value and a degree of difference between the second detection result and the labeling information is greater than a second threshold value for a first detection result and a second detection result corresponding to each pixel region ;
determining that the pixel region belongs to a boundary pixel region when a degree of difference between the first detection result and the labeling information is greater than a first threshold and a degree of difference between the second detection result and the labeling information is equal to or smaller than a second threshold ;
generating the indication target mask based on the core pixel region and the boundary pixel region ;
5. The method of claim 4, comprising:

a multi-source heterogeneous data acquisition module for acquiring text-type scene information for scene description of an image , the text-type scene information including at least first object information characterizing a scene and second object information characterizing a target subordinate object associated within the scene , and an image of a picture type ;
a cluster determination module that determines, based on the first object information, a model cluster including a segmentation model corresponding to each subordinate object in a scene characterized by the first object information;
a model determination module for determining a target model corresponding to each target subordinate object in the model cluster based on the second object information;
a coarse segmentation module for coarsely segmenting the image based on each of the target models to obtain first images corresponding to each of the target sub-objects;
a subdivision module for performing a segmentation based on information aggregation associated with a corresponding target sub- object for each of the first images to obtain a subdivision result corresponding to each of the first images;
a body image rendering module for rendering a body image corresponding to the image , the body image characterizing a fusion result between the scene information and the image based on the subdivision result ;
A deep learning-based multi-source heterogeneous data fusion device comprising:

the subdivision result includes a mask matrix;
The subdivision module comprises:
For each first image, input the first image into a corresponding aggregate information extractor, which is obtained by training based on a detector corresponding to a target sub-object to which the first image corresponds, and the aggregate information extractor performs aggregate information extraction on the target sub-object in the first image to obtain fusion feature information corresponding to the first image;
The apparatus of claim 6 , further comprising inputting the fusion feature information into a divider corresponding to the target sub-image to which the first image corresponds, to obtain the mask matrix.

The body image rendering module includes:
The apparatus of claim 7 , further comprising: a first image processor for generating a first image based on the mask matrix and the first image corresponding to the first image.

A computer-readable storage medium having stored thereon at least one instruction or at least one program that is loaded and executed by a processor to implement the deep learning -based multi-source heterogeneous data fusion method according to any one of claims 1 to 5.

1. An electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor,
The memory stores instructions executable by the at least one processor, and the at least one processor executes the instructions stored in the memory to realize one of the deep learning-based multi-source heterogeneous data fusion methods according to any one of claims 1 to 5.
1. An electronic device comprising :