CN118570889A

CN118570889A - Sequence image target recognition method, device and electronic device based on image quality optimization

Info

Publication number: CN118570889A
Application number: CN202411062099.8A
Authority: CN
Inventors: 宋鸿飞; 王麒; 陈帅斌; 蒋泽飞; 夏虹
Original assignee: Hangzhou Denghong Technology Co Ltd
Current assignee: Hangzhou Denghong Technology Co Ltd
Priority date: 2024-08-05
Filing date: 2024-08-05
Publication date: 2024-08-30
Anticipated expiration: 2044-08-05
Also published as: CN118570889B

Abstract

The application relates to the field of image data processing, in particular to a sequential image target identification method and device based on image quality optimization and electronic equipment. Firstly, acquiring a queue of object images to be identified, which are acquired by a camera, then, determining the target IOU relation of each object image to be identified based on a face target detection network and a human body target detection network to obtain the queue of the target IOU relation, then, extracting the queue of the object images from the queue of the object images to be identified based on the queue of the target IOU relation, then, selecting the object image with the optimal face quality from the queue of the object images to be detected as the object image to be detected, then, carrying out face recognition on the object image to be detected to obtain a recognition result, wherein the recognition result is a person identity tag, and finally, designating the person identity tag in the recognition result as the person identity tag of the queue of the object image.

Description

Image quality optimization-based sequential image target identification method and device and electronic equipment

Technical Field

The application relates to the field of image data processing, in particular to a sequential image target identification method and device based on image quality optimization and electronic equipment.

Background

In modern video monitoring and intelligent security systems, sequential image target recognition technology plays a vital role. With the development of technology, the requirements on the accuracy and the efficiency of image recognition are higher and higher. However, the prior art has some limitations in processing large-scale image data, particularly in cross-domain personnel identification and trajectory tracking. For example, in the existing cross-domain personnel recognition and track tracking algorithm, a plurality of shot face images are usually sent to recognition to construct personnel tracks, so that each shot face image needs to be sent to a face recognition module to match personnel identity ids, excessive computing resources are consumed, time and labor are wasted, and efficiency is low. In addition, due to different factors such as the visual angle, illumination condition, distance and the like of the camera, the quality of the captured face image is uneven, which can lead to doubtful credibility of face recognition results. Moreover, not every captured face image meets the requirement of face recognition, but if some image frames are filtered, gaps and break points can appear when the personnel track is drawn, and the integrity of the track and the accuracy of analysis are affected.

Accordingly, a sequential image object recognition scheme based on image quality preference is desired.

Disclosure of Invention

The present application has been made in view of the above problems. An object of the present application is to provide a sequential image target recognition method, apparatus and electronic device based on image quality preference.

The embodiment of the application provides a preferable sequential image target identification method based on image quality, which comprises the following steps:

Acquiring a queue of images of objects to be identified, which are acquired by a camera;

determining target IOU relations of all the object images to be identified in the queues of the object images to be identified based on a face target detection network and a human target detection network to obtain a queue of target IOU relations;

extracting a queue of target object images from the queue of object images to be identified based on the queue of target IOU relationships;

selecting a target object image with the optimal face quality from the queue of the target object images as a target object image to be detected;

performing face recognition on the target object image to be detected to obtain a recognition result, wherein the recognition result is a personnel identity tag;

and designating the personnel identity label in the identification result as the personnel identity label of the queue of the target object image.

For example, according to an embodiment of the present application, a method for identifying a sequential image object based on image quality preference, wherein determining, based on a face object detection network and a body object detection network, a target IOU relationship of each of a queue of object images to be identified to obtain a queue of target IOU relationships includes:

inputting the images of the objects to be identified into the human face target detection network and the human body target detection network respectively to obtain a human body boundary box and a human face boundary box;

calculating a target IOU relationship between the human body boundary box and the human face boundary box according to the following relationship calculation formula:

the intersection area is the area of the intersection between the human body boundary box and the human face boundary box, and the union area is the area of the union between the human body boundary box and the human face boundary box.

For example, according to an embodiment of the present application, a sequential image object recognition method based on image quality preference, wherein extracting a queue of target object images from the queue of object images to be recognized based on the queue of target IOU relations, includes:

in response to the target IOU relation being smaller than or equal to a preset threshold, eliminating the corresponding object image to be identified;

and in response to the target IOU relationship being greater than a preset threshold, incorporating the corresponding object image to be identified into a queue of the target object image.

For example, according to an embodiment of the present application, a method for identifying a sequential image target based on image quality preference, wherein selecting a target object image with optimal face quality from a queue of the target object images as a target object image to be detected includes: for each target object image in the queue of target object images:

Processing each target object image by using an LBP mode operator to obtain a target object image LBP characteristic vector;

processing each target object image by using the HOG feature descriptors to obtain target object HOG feature vectors;

inputting the HOG feature vector of the target object and the LBP feature vector of the target object into a dynamic interaction module under gating response to obtain a multi-mode statistical feature vector of the target object;

Inputting each target object image into an image feature extractor based on a cavity convolutional neural network model to obtain a target object image feature map;

Inputting the target object image feature map into a feature foreground mask salizer based on a convolution gating feedforward mechanism to obtain a foreground salient target object image feature map;

inputting the foreground significant target object image feature map and the target object multi-mode statistical feature vector into a MetaNet model-based cross-domain joint encoder to obtain a target object image fusion feature map under the assistance of multi-mode statistical features;

And inputting the multi-mode statistical feature assisted target object image fusion feature map into an image quality scoring device based on a decoder to obtain a scoring decoding value.

For example, according to an embodiment of the present application, a method for identifying a sequential image target based on image quality preference, wherein inputting the target object HOG feature vector and the target object image LBP feature vector into a dynamic interaction module under a gating response to obtain a target object multi-modal statistical feature vector includes:

inputting the HOG feature vector of the target object and the LBP feature vector of the target object into a feature combination module for cascade processing to obtain a multi-mode statistical information combination feature vector of the target object;

after matrix multiplication of the target object multi-mode statistical information joint feature vector and the parameter matrix is calculated, the obtained feature vector and the bias vector are added according to positions to obtain a linear transformation target object multi-mode statistical information joint feature vector;

Using Activating the linear transformation target object multi-mode statistical information combined feature vector by a function to obtain a target object multi-mode statistical information dynamic fusion response gating value;

calculating the position-based product between the HOG feature vector of the target object and the multi-mode statistical information dynamic fusion response gating value of the target object to obtain a HOG feature vector of the weight modulation target object;

After calculating a response gating value of the dynamic fusion of the multi-mode statistical feature information of the target object, multiplying the obtained weight value with the LBP feature vector of the target object image according to the position to obtain the LBP feature vector of the weight modulation target object image;

And carrying out position point-based on the weight modulation target object HOG feature vector and the weight modulation target object image LBP feature vector to obtain a target object multi-mode statistical feature vector.

For example, according to an embodiment of the present application, a method for identifying a sequential image target based on image quality preference, wherein inputting the target object image feature map into a feature foreground mask salizer based on a convolution-gated feed-forward mechanism to obtain a foreground salient target object image feature map includes:

carrying out layer normalization processing on the target object image feature map to obtain a normalized target object image feature map;

performing channel expansion based on point convolution and depth convolution coding based on a cavity convolution layer on the normalized target object image feature map to obtain a target object image depth convolution backup feature map and a target object image depth convolution original edition feature map;

inputting the target object image depth convolution original edition feature map into a foreground gating mask module based on Gelu functions to obtain a target object image depth convolution gating mask weight feature map;

Calculating the position-based point multiplication between the target object image depth convolution gating mask weight feature map and the target object image depth convolution backup feature map to obtain a target object image gating mask foreground salient feature map;

And performing channel contraction based on point convolution on the target object image gating mask foreground salient feature map to obtain the foreground salient target object image feature map.

For example, according to the image quality preference-based sequential image target recognition method of the embodiment of the present application, a target object image corresponding to the largest of the scored decoding values is determined as the target object image to be detected.

For example, according to an embodiment of the present application, a method for identifying a sequential image target based on image quality preference, wherein identifying the image of the target object to be detected to obtain an identification result includes:

Inputting the target object image to be detected into a AlexNet-based face feature extractor to obtain a face feature vector;

And inputting the face feature vector into a face recognition device based on a classifier to obtain the recognition result.

For example, the image quality preference-based sequential image target recognition method according to the embodiment of the present application further includes a training step of: the dynamic interaction module is used for training the dynamic interaction module under the gating response, the image feature extractor based on the cavity convolutional neural network model, the feature foreground mask salient based on the convolutional gating feedforward mechanism, the cross-domain joint encoder based on the MetaNet model and the image quality scoring device based on the decoder;

wherein the training step comprises:

acquiring training data, wherein the training data comprises a queue of training images of an object to be identified, which are acquired by a camera;

Determining target IOU relations of all training object images to be identified in the training object image queue based on the face target detection network and the human body target detection network to obtain a training target IOU relation queue;

extracting a queue of training target object images from the queue of training target object images to be identified based on the queue of training target IOU relations;

Processing each training target object image in the queue of training target object images by using the LBP mode operator to obtain a training target object image LBP feature vector;

Processing each training target object image by using the HOG feature descriptors to obtain training target object HOG feature vectors;

Inputting the HOG feature vector of the training target object and the LBP feature vector of the training target object image into a dynamic interaction module under the gating response to obtain a multi-mode statistical feature vector of the training target object;

inputting the images of the training target objects into the image feature extractor based on the cavity convolutional neural network model to obtain a training target object image feature map;

inputting the training target object image feature map into the feature foreground mask salizer based on the convolution gating feedforward mechanism to obtain a training foreground salient target object image feature map;

Inputting the training foreground significant target object image feature map and the training target object multi-mode statistical feature vector into the MetaNet model-based cross-domain joint encoder to obtain a training multi-mode statistical feature assisted target object image fusion feature map;

inputting the target object image fusion feature map under the assistance of the training multi-mode statistical features into the decoder-based image quality scoring device to obtain a decoding loss function value;

Calculating a preset loss function value of the target object image fusion feature map under the assistance of the training multi-mode statistical features to obtain a target object image fusion loss function value under the assistance of the multi-mode statistical features;

and taking the weighted sum of the decoding loss function value and the target object image fusion loss function value under the assistance of the multi-mode statistical feature as a loss function value, and training a dynamic interaction module under the gating response, the image feature extractor based on the cavity convolutional neural network model, the feature foreground mask saliency based on the convolutional gating feedforward mechanism, the cross-domain joint encoder based on the MetaNet model and the image quality scoring device based on the decoder.

The embodiment of the application also provides a sequential image target recognition device based on image quality optimization, which comprises:

the image queue acquisition module is used for acquiring a queue of images of the object to be identified, which are acquired by the camera;

The IOU relation determining module is used for determining the target IOU relation of each object image to be identified in the queue of the object images to be identified based on the face target detection network and the human body target detection network so as to obtain a queue of the target IOU relation;

A target object image queue extracting module, configured to extract a queue of target object images from the queue of object images to be identified based on the queue of target IOU relationships;

the optimization module is used for selecting a target object image with optimal face quality from the queue of the target object images as a target object image to be detected;

The face recognition module is used for recognizing the image of the target object to be detected to obtain a recognition result, wherein the recognition result is a personnel identity tag;

And the personnel identity label designating module is used for designating the personnel identity label in the identification result as the personnel identity label of the queue of the target object image.

The embodiment of the application also provides electronic equipment, which comprises:

A processor; and

A memory in which computer program instructions are stored which, when executed by the processor, cause the processor to perform the image quality preferred sequence image target recognition method of any preceding claim.

According to the image quality optimization-based sequential image target recognition method, device and electronic equipment, a more reliable target recognition scheme can be provided while computing resources are saved, the problems in traditional cross-domain personnel recognition and track tracking are effectively solved, the accuracy and efficiency of face recognition are improved, and a more reliable target identification scheme is provided for the fields of security monitoring and the like.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present application, the following description will briefly explain the drawings of the embodiments of the present application. It is apparent that the figures in the following description relate only to some embodiments of the application and are not limiting of the application.

Fig. 1 shows a schematic diagram of an application architecture of a sequential image object recognition method based on image quality preference in an embodiment of the present application;

FIG. 2 shows a flowchart of a preferred sequence image target recognition method based on image quality in an embodiment of the application;

fig. 3 shows a flowchart of sub-step S540 of the image quality preferred sequence image target recognition method in an embodiment of the application;

FIG. 4 is a schematic diagram showing the structure of a sequential image object recognition apparatus according to an embodiment of the present application, which is preferable based on image quality;

FIG. 5 shows an application scenario diagram of a sequential image object recognition method based on image quality preference in an embodiment of the present application; and

Fig. 6 shows a schematic diagram of a queue of four object images to be identified.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are also within the scope of the application.

The terms used in the present specification are general terms that are currently widely used in the art in view of functions of the present application, but may be changed according to the intention, precedent, or new technology in the art of the person of ordinary skill in the art. Furthermore, specific terms may be selected, and in this case, detailed meanings thereof will be described in the detailed description of the present application. Accordingly, the terms used in the specification should not be construed as simple names, but are based on meanings of the terms and general description of the present application.

Although the present application makes various references to certain modules in a system according to embodiments of the present application, any number of different modules may be used and run on a user terminal and/or server. The modules are merely illustrative, and different aspects of the systems and methods may use different modules.

A flowchart is used in the present application to describe the operations performed by a system according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in order precisely. Rather, the various steps may be processed in reverse order or simultaneously, as desired. Also, other operations may be added to or removed from these processes.

Fig. 1 shows an application architecture diagram of a sequential image target recognition method based on image quality preference in an embodiment of the present application, including a server 100 and a terminal device 200.

The terminal device 200 and the server 100 may be connected to each other through the internet to realize communication therebetween. Optionally, the internet described above uses standard communication techniques and/or protocols. The internet is typically the internet, but may be any network including, but not limited to, a local area network (Local Area Network, LAN), metropolitan area network (Metropolitan Area Network, MAN), wide area network (Wide Area Network, WAN), mobile, wired or wireless network, private network, or any combination of virtual private networks. In some embodiments, the data exchanged over the network is represented using techniques and/or formats including hypertext markup language (Hyper Text Markup Language, HTML), extensible markup language (Extensible Markup Language, XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as secure sockets layer (Secure Socket Layer, SSL), transport layer security (Transport Layer Security, TLS), virtual private network (Virtual Private Network, VPN), internet protocol security (Internet Protocol Security, IPsec), etc. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.

The server 100 may provide various network services for the terminal device 200, wherein the server 100 may be a server, a server cluster formed by a plurality of servers, or a cloud computing center. In particular, the server 100 may include a processor 110 (Center Processing Unit, CPU), a memory 120, an input device 130, and an output device 140, etc., the input device 130 may include a keyboard, a mouse, a touch screen, etc., and the output device 140 may include a display device such as a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), a Cathode Ray Tube (CRT), etc.

The memory 120 may include Read Only Memory (ROM) and Random Access Memory (RAM) and provides the processor 110 with program instructions and data stored in the memory 120. In the embodiment of the present application, the memory 120 may be used to store a program of the sequential image object recognition method preferred based on image quality in the embodiment of the present application.

Processor 110 is operative to perform the steps of any of the preferred sequential image object recognition methods based on image quality of the embodiments of the present application in accordance with the obtained program instructions by calling the program instructions stored by memory 120.

In addition, the application architecture diagram in the embodiment of the present application is to more clearly illustrate the technical solution in the embodiment of the present application, and does not limit the technical solution provided by the embodiment of the present application, and certainly, for other application architectures and service applications, the technical solution provided by the embodiment of the present application is also applicable to similar problems.

The method for identifying a sequential image object based on image quality preference provided according to at least one embodiment of the present application is described below in a non-limiting manner by means of several examples or embodiments, and as described below, different features of these specific examples or embodiments may be combined with each other without contradiction, so as to obtain new examples or embodiments, which are also within the scope of protection of the present application.

In view of the above technical problems, in the technical solution of the present application, a method for identifying a sequential image target based on image quality optimization is provided, as shown in fig. 2, which includes: s510, acquiring a queue of images of the object to be identified, which are acquired by a camera; s520, determining the target IOU relation of each object image to be identified in the queue of the object images to be identified based on a human face target detection network and a human body target detection network so as to obtain a queue of target IOU relations; s530, extracting a queue of target object images from the queue of target object images to be identified based on the queue of target IOU relations; s540, selecting a target object image with optimal face quality from the queue of the target object images as a target object image to be detected; s550, carrying out face recognition on the target object image to be detected to obtain a recognition result, wherein the recognition result is a personnel identity tag; s560, designating the personnel identity label in the identification result as the personnel identity label of the queue of the target object image.

In step S520, determining, based on the face target detection network and the body target detection network, a target IOU relationship of each object image to be identified in the queue of the object images to be identified to obtain a queue of target IOU relationships, including: inputting the images of the objects to be identified into the human face target detection network and the human body target detection network respectively to obtain a human body boundary box and a human face boundary box; calculating a target IOU relationship between the human body boundary box and the human face boundary box according to the following relationship calculation formula:

Wherein, in step S530, extracting the queue of the target object image from the queue of the object image to be identified based on the queue of the target IOU relationship includes: in response to the target IOU relation being smaller than or equal to a preset threshold, eliminating the corresponding object image to be identified; and in response to the target IOU relationship being greater than a preset threshold, incorporating the corresponding object image to be identified into a queue of the target object image.

It should be understood that, in the technical solution of the present application, whether the objects selected by the human body bounding box and the human face bounding box belong to the same object may be confirmed based on the object IOU relationship. In this particular example, the target screening is performed based on a comparison between the target IOU relationship and a preset threshold. That is, in the above-mentioned image quality-based preferred sequence image target recognition method, the defect that in the conventional scheme, face recognition needs to be performed on each snap-shot face image can be avoided, target detection and human body detection are performed on each appearing target, a queue of target object images is extracted according to a target IOU relationship between the target detection and the human body detection, and then target object images with reliable face quality are screened from the queue of target object images to perform face recognition to match the identity id of a person. By means of the method, the image sequence of the target object can be constructed through the target IOU relation, the target object image with the optimal face quality is identified to carry out personnel identity recognition, face identity information verification can be completed only through one-time face recognition, and identity id information is matched for the sequence image. The method can save computing resources and provide a more reliable target identification scheme, effectively solve the problems in the traditional cross-domain personnel identification and track tracking, improve the accuracy and efficiency of face recognition, and provide a more reliable target identification scheme for the fields of security monitoring and the like.

Particularly, in the image quality optimization-based sequential image target recognition method, the step of selecting the target object image with the optimal face quality from the queue of the target object images is important, so that the face characteristics of the target object to be detected in the target object image can be ensured to be clearer, the face recognition algorithm is facilitated to extract key information more accurately, and the face recognition and the confirmation task of the personnel identity can be performed more accurately. That is, in cross-domain personnel identification and track tracking, selecting an optimal image for identification can reduce identification failure caused by poor image quality, and ensure continuity and integrity of personnel tracks.

Based on the above, the technical idea of the application is to process each target object image in the queue of target object images in an image processing and analyzing algorithm based on artificial intelligence and deep learning, so as to learn and capture multi-modal statistical features, depth feature information and facial semantics of each target object image, thereby utilizing the multi-modal statistical features to assist in optimizing expression of target object features, so as to score quality of different target object images, and determining the target object image corresponding to the maximum of the scoring decoding values as the target object image to be detected, so as to perform subsequent face recognition and personnel identity detection tasks.

Accordingly, as shown in fig. 3, selecting a target object image with the optimal face quality from the queue of target object images as a target object image to be detected includes: for each target object image in the queue of target object images: s541, processing each target object image by using an LBP mode operator to obtain a target object image LBP feature vector; s542, processing each target object image by using the HOG feature descriptors to obtain target object HOG feature vectors; s543, inputting the HOG feature vector of the target object and the LBP feature vector of the target object into a dynamic interaction module under gating response to obtain a multi-mode statistical feature vector of the target object; s544, inputting the target object images into an image feature extractor based on a cavity convolutional neural network model to obtain a target object image feature map; s545, inputting the target object image feature map into a feature foreground mask salizer based on a convolution gating feedforward mechanism to obtain a foreground salient target object image feature map; s546, inputting the foreground significant target object image feature map and the target object multi-mode statistical feature vector into a MetaNet model-based cross-domain joint encoder to obtain a target object image fusion feature map under the assistance of multi-mode statistical features; s547, inputting the multi-mode statistical feature assisted target object image fusion feature map into an image quality scoring device based on a decoder to obtain a scoring decoding value.

Specifically, the step of selecting the target object image with the optimal face quality from the queue of the target object images as the target object image to be detected is as follows: for each target object image in the queue of target object images, firstly, processing each target object image by using an LBP mode operator to obtain an LBP feature vector of the target object image; and processing each target object image by using the HOG feature descriptors to obtain target object HOG feature vectors. It should be understood that the LBP mode operator can capture texture feature information in an image, and has better characterization capability for targets with obvious texture features such as faces. Meanwhile, the LBP characteristics extracted by the LBP mode operator have invariance to the rotation of the image, and even if the face rotates in the image, the LBP characteristics still remain stable. The HOG feature descriptors can effectively capture edge and shape feature information in the image, and have good characterization capability for features such as face contours and the like. Moreover, the HOG features extracted by the HOG feature descriptors have scale invariance to a certain extent, and can adapt to target objects and faces under different scales.

Then, consider that since the target object HOG feature vector mainly describes edge and shape feature information about the target object face in the image, the target object image LBP feature vector focuses more on texture feature information about the target object face in the image. The two features provide different aspects and types of features of the face semantics of the target object, and have implicit relevance and interaction information. Therefore, in order to effectively combine the two types of semantic features of faces in the images so as to comprehensively consider the features of different aspects of the target object image and improve the diversity and the characterization capability of the features, in the technical scheme of the application, the HOG feature vector of the target object and the LBP feature vector of the target object image are further input into a dynamic interaction module of the feature vector under the gating response to obtain the multi-mode statistical feature vector of the target object. Through the processing of the dynamic interaction module of the feature vector under the gating response, the correlation relation and interaction influence between the HOG feature vector of the target object and the LBP feature vector of the target object can be learned and captured, so that interaction supplementation is carried out by utilizing the implicit correlation semantics between the two feature information, and the importance and contribution degree of different types of image semantic features to the subsequent image quality assessment task are identified. In this way, in the process of merging the HOG feature vector of the target object and the LBP feature vector of the target object to assist in carrying out the subsequent image quality scoring task, the adaptive dynamic weighted fusion of the two types of image features is realized by using a gating response mechanism, so that the model can learn the multi-mode statistical key features of the target object image, the feature expression capability is improved, and a more comprehensive and accurate data basis is provided for the subsequent processing and image quality scoring task of the target object image.

Accordingly, in step S543, the target object HOG feature vector and the target object image LBP feature vector are input to a dynamic interaction module under a gating response to obtain a target object multi-modal statistical feature vector, which includes: inputting the HOG feature vector of the target object and the LBP feature vector of the target object into a feature combination module for cascade processing to obtain a multi-mode statistical information combination feature vector of the target object; after matrix multiplication of the target object multi-mode statistical information joint feature vector and the parameter matrix is calculated, the obtained feature vector and the bias vector are added according to positions to obtain a linear transformation target object multi-mode statistical information joint feature vector; usingActivating the linear transformation target object multi-mode statistical information combined feature vector by a function to obtain a target object multi-mode statistical information dynamic fusion response gating value; calculating the position-based product between the HOG feature vector of the target object and the multi-mode statistical information dynamic fusion response gating value of the target object to obtain a HOG feature vector of the weight modulation target object; after calculating a response gating value of the dynamic fusion of the multi-mode statistical feature information of the target object, multiplying the obtained weight value with the LBP feature vector of the target object image according to the position to obtain the LBP feature vector of the weight modulation target object image; and carrying out position point-based on the weight modulation target object HOG feature vector and the weight modulation target object image LBP feature vector to obtain a target object multi-mode statistical feature vector.

In a specific example, inputting the target object HOG feature vector and the target object image LBP feature vector into a dynamic interaction module under a gating response to obtain a target object multi-modal statistical feature vector, including: inputting the HOG feature vector of the target object and the LBP feature vector of the target object into a dynamic interaction module of the feature vector under the gating response to process according to the following dynamic interaction formula so as to obtain the multi-mode statistical feature vector of the target object; wherein, the dynamic interaction formula is:

Wherein, AndRespectively the target object HOG feature vector and the target object image LBP feature vector,A vector concatenation operation is represented and is performed,Is a matrix of parameters that are selected from the group consisting of,Is the offset vector of the reference signal,Is a sigmoid function of the number of bits,Is a target object multi-modal statistics dynamic fusion response gating value,Is the multi-modal statistical feature vector of the target object.

Further, after multi-mode statistical features of the target object are extracted, in order to understand the semantics of the face of the target object contained in each target object image more deeply and comprehensively, so that the image quality is better scored to screen out the image with the optimal quality for personnel identity recognition.

It should be appreciated that the target object image feature map contains manifold semantic and feature information in the image that can affect subsequent assessment of image quality due to background interference and redundant information in the image that is not related to the target object face semantic. Based on the above, in order to better capture important features and structural information in the target object image, in the technical scheme of the application, the target object image feature map is further input into a feature foreground mask salizer based on a convolution gating feedforward mechanism to obtain a foreground salient target object image feature map. The characteristic foreground mask saliency device based on the convolution gating feedforward mechanism can highlight a target object in an image, and the saliency of the target object is enhanced by inhibiting background noise and interference, so that the characteristic foreground mask saliency device has an important role in screening of follow-up face quality optimized images and cross-domain personnel identity recognition.

Specifically, the feature foreground mask salizer based on the convolution gating feedforward mechanism firstly performs layer normalization processing on the target object image feature map so as to eliminate scale differences between different layers. Then, the normalized target object image feature map enhances the channel expression capability of the feature map by performing channel expansion through point convolution, and simultaneously, a hole convolution layer is introduced to realize depth convolution coding by adjusting the coverage of a convolution kernel, so that rich space and depth information are provided for a subsequent gating mechanism. In the core step, the depth convolution original edition feature map of the target object image is sent to a foreground gating mask module, the foreground mask weight is dynamically generated based on a specific function, the self-adaptive selection and reinforcement of the foreground feature of the target object image are realized, the step is the key for realizing the foreground information saliency, and the model can focus on the foreground region of the target object image through a gating mechanism and inhibit background noise. Then, by calculating the point-by-point multiplication between the target object image depth convolution gating mask weight feature map and the target object image depth convolution backup feature map, the gating mask front Jing Tuxian of the target object image feature map is realized, and the operation effectively combines the foreground mask weight and the target object image feature map to generate the feature map with highlighted foreground information. Finally, the target object image gating mask foreground salient feature map is subjected to channel contraction through point convolution, so that final feature modulation is completed, and a foreground salient target object image feature map is generated. The method not only improves the expression capability of the image characteristics of the target object, but also enhances the recognition and processing capability of the model to key foreground information, namely the human face part, in the image, and particularly can remarkably improve the performance of the deep learning model when processing data containing complex human face foreground and background.

Accordingly, in step S545, the target object image feature map is input to a feature foreground mask salizer based on a convolution-gated feed-forward mechanism to obtain a foreground salient target object image feature map, including: carrying out layer normalization processing on the target object image feature map to obtain a normalized target object image feature map; performing channel expansion based on point convolution and depth convolution coding based on a cavity convolution layer on the normalized target object image feature map to obtain a target object image depth convolution backup feature map and a target object image depth convolution original edition feature map; inputting the target object image depth convolution original edition feature map into a foreground gating mask module based on Gelu functions to obtain a target object image depth convolution gating mask weight feature map; calculating the position-based point multiplication between the target object image depth convolution gating mask weight feature map and the target object image depth convolution backup feature map to obtain a target object image gating mask foreground salient feature map; and performing channel contraction based on point convolution on the target object image gating mask foreground salient feature map to obtain the foreground salient target object image feature map.

In one specific example, inputting the target object image feature map into a feature foreground mask salizer based on a convolution-gated feed-forward mechanism to obtain a foreground salient target object image feature map includes: inputting the target object image feature map into the feature foreground mask saliency device based on the convolution gating feedforward mechanism, and processing the feature foreground mask saliency device by using the following foreground mask saliency formula to obtain the foreground salient target object image feature map; the foreground mask significantly enhancing formula is as follows:

Wherein, For the target object image feature map,Representing a layer normalization operation on the feature map,To normalize the target object image feature map,In order to perform the point convolution operation,For convolution kernel asIs used for the operation of the hole convolution of (1),The feature map is backed up for a target object image depth convolution,The master feature map is depth convolved for the target object image,Is thatThe function is activated and the function is activated,In order to implement the masking process,A mask weight feature map is depth convolved for the target object image,For each position feature value in the feature map,For the predetermined super-parameter(s),In order to multiply by the point of the position,The mask foreground highlighting feature map is gated for the target object image,And (5) an image feature map of the foreground significant target object.

In order to understand the semantics of the target object image more deeply and accurately and identify which images have better face quality, the foreground salient target object image feature image and the target object multi-mode statistical feature vector are further input into a MetaNet model-based cross-domain joint encoder to obtain a target object image fusion feature image under the assistance of the multi-mode statistical feature. The processing of the MetaNet model-based cross-domain joint encoder can learn shared feature representation among image features of different modes, and is beneficial to realizing effective fusion among different features by learning the cross-domain feature representation, specifically, the statistical features of the target object image are utilized to assist in optimizing the expression of the semantic features of the template object image, so that the semantic representation capability of the target object image is improved, and the subsequent image quality detection task and target identity recognition task are facilitated.

And then, inputting the target object image fusion feature map under the assistance of the multi-mode statistical features into an image quality scoring device based on a decoder to obtain a scoring decoding value. That is, the optimization characterization information of the semantic features of the target object image under the assistance of the statistical features is utilized to perform decoding regression, so that the quality of the image is evaluated to obtain a scoring decoding value. And further, the target object image corresponding to the maximum evaluation code value is determined as the target object image to be detected, so that the subsequent effective face recognition and personnel identity detection tasks are facilitated. Therefore, in cross-domain personnel identification and track tracking, the optimal image can be selected for identification, so that identification failure caused by poor image quality can be reduced, and continuity and integrity of personnel tracks are ensured.

Wherein, in step S547, the target object image corresponding to the largest one of the evaluation resolution code values is determined as the target object image to be detected.

Further, in step S550, performing face recognition on the target object image to be detected to obtain a recognition result, including: inputting the target object image to be detected into a AlexNet-based face feature extractor to obtain a face feature vector; and inputting the face feature vector into a face recognition device based on a classifier to obtain the recognition result.

Further, in the technical scheme of the present application, the image quality optimization-based sequential image target recognition method further includes a training step: the dynamic interaction module is used for training the dynamic interaction module under the gating response, the image feature extractor based on the cavity convolutional neural network model, the feature foreground mask salient based on the convolutional gating feedforward mechanism, the cross-domain joint encoder based on the MetaNet model and the image quality scoring device based on the decoder.

Wherein the training step comprises: acquiring training data, wherein the training data comprises a queue of training images of an object to be identified, which are acquired by a camera; determining target IOU relations of all training object images to be identified in the training object image queue based on the face target detection network and the human body target detection network to obtain a training target IOU relation queue; extracting a queue of training target object images from the queue of training target object images to be identified based on the queue of training target IOU relations; processing each training target object image in the queue of training target object images by using the LBP mode operator to obtain a training target object image LBP feature vector; processing each training target object image by using the HOG feature descriptors to obtain training target object HOG feature vectors; inputting the HOG feature vector of the training target object and the LBP feature vector of the training target object image into a dynamic interaction module under the gating response to obtain a multi-mode statistical feature vector of the training target object; inputting the images of the training target objects into the image feature extractor based on the cavity convolutional neural network model to obtain a training target object image feature map; inputting the training target object image feature map into the feature foreground mask salizer based on the convolution gating feedforward mechanism to obtain a training foreground salient target object image feature map; inputting the training foreground significant target object image feature map and the training target object multi-mode statistical feature vector into the MetaNet model-based cross-domain joint encoder to obtain a training multi-mode statistical feature assisted target object image fusion feature map; inputting the target object image fusion feature map under the assistance of the training multi-mode statistical features into the decoder-based image quality scoring device to obtain a decoding loss function value; calculating a preset loss function value of the target object image fusion feature map under the assistance of the training multi-mode statistical features to obtain a target object image fusion loss function value under the assistance of the multi-mode statistical features; and taking the weighted sum of the decoding loss function value and the target object image fusion loss function value under the assistance of the multi-mode statistical feature as a loss function value, and training a dynamic interaction module under the gating response, the image feature extractor based on the cavity convolutional neural network model, the feature foreground mask saliency based on the convolutional gating feedforward mechanism, the cross-domain joint encoder based on the MetaNet model and the image quality scoring device based on the decoder.

In a preferred example, the training target object multi-modal statistical feature vector is used to represent a gating mechanism based dynamic interaction feature representation of LBP features and HOG features of the training target object image. The training foreground salient object image feature graph represents foreground salient enhancement features of image semantic features determined by cavity convolution coding of the training object image. When the training foreground salient target object image feature map and the training target object multi-mode statistical feature vector are input into a MetaNet model-based cross-domain joint encoder, the training target object multi-mode statistical feature vector is used as an auxiliary mode to restrict feature expression of the training foreground salient target object image feature map along a channel dimension, so that the training multi-mode statistical feature assisted target object image fusion feature map has multi-mode channel hybrid feature expression richness, and meanwhile, the training multi-mode statistical feature assisted target object image fusion feature map has complex semantic features, so that decoding regression identification is difficult, and decoding training efficiency is affected.

Accordingly, the applicant of the present application further introduces a decoding loss function value, such as a predetermined loss function value other than the difference loss function between the true prompt result and the predicted prompt result, in the model training process, that is, trains the model by gradient back propagation based on the loss function value.

Specifically, calculating a predetermined loss function value of the target object image fusion feature map under the assistance of the training multi-mode statistical feature to obtain a target object image fusion loss function value under the assistance of the multi-mode statistical feature, including the following steps: expanding the target object image fusion feature map under the assistance of the training multi-mode statistical features into a target object image fusion feature vector under the assistance of the training multi-mode statistical features; calculating a first multi-modal statistical feature-assisted target object image fusion weight matrix and a second multi-modal statistical feature-assisted target object image fusion weight matrix based on the training multi-modal statistical feature-assisted target object image fusion feature vector, wherein the first multi-modal statistical feature-assisted target object image fusion weight matrix and the second multi-modal statistical feature-assisted target object image fusion weight matrix are the first multi-modal statistical feature-assisted target object image fusion weight matrixThe feature values of the positions are respectively the first feature vector of the target object image fusion under the assistance of the training multi-mode statistical featuresEigenvalue sum of firstOne half of the mean and difference absolute values of the eigenvalues; multiplying the target object image fusion feature vector under the assistance of the training multi-mode statistical feature with the target object image fusion weight matrix under the assistance of the first multi-mode statistical feature and the target object image fusion weight matrix under the assistance of the second multi-mode statistical feature respectively to obtain a target object image fusion intermediate vector under the assistance of the first multi-mode statistical feature and a target object image fusion intermediate vector under the assistance of the second multi-mode statistical feature; calculating the vector inner product of the first multi-modal statistical feature assisted target object image fusion intermediate vector and the second multi-modal statistical feature assisted target object image fusion intermediate vector to obtain a first multi-modal statistical feature assisted target object image fusion loss term; matrix multiplication is carried out on the target object image fusion weight matrix under the assistance of the first multi-modal statistical features and the target object image fusion weight matrix under the assistance of the second multi-modal statistical features, and a result matrix is calculatedThe norm is used for obtaining a target object image fusion loss term under the assistance of the second multi-modal statistical characteristic; and subtracting the product of the preset weight super parameter and the target object image fusion loss item under the assistance of the second multi-modal statistical feature from the target object image fusion loss item under the assistance of the first multi-modal statistical feature to obtain a target object image fusion loss function value under the assistance of the multi-modal statistical feature.

Then, model parameters can be optimized by gradient back propagation based on a weighted sum of the target object image fusion loss function value and the decoding loss function value under the assistance of the multi-modal statistical features.

The process of obtaining the fusion loss function value of the target object image under the assistance of the multi-mode statistical feature can be specifically expressed as the following loss calculation formula:

Wherein, Fusing feature vectors for the target object image under the assistance of the training multi-mode statistical features,AndRespectively fusing the target object image under the assistance of the first multi-modal statistical features and the target object image under the assistance of the second multi-modal statistical features,AndThe first multi-modal statistical feature assisted target object image fusion weight matrix and the second multi-modal statistical feature assisted target object image fusion weight matrix are respectively obtainedThe characteristic value of the location is used to determine,AndRespectively fusing feature vectors for the target object images under the assistance of the training multi-mode statistical featuresEigenvalue sum of firstThe characteristic value of the characteristic value is calculated,For the matrix multiplication to be performed,To calculate a matrixThe norm of the sample is calculated,For the predetermined weight to exceed the parameters,And fusing the loss function value for the target object image under the assistance of the multi-mode statistical characteristics.

In other words, in the above preferred example, the multi-modal statistical feature assisted target object image fusion loss function value performs, through the short-range and long-range cross-scale detail linked structural feature representation of the training multi-modal statistical feature assisted target object image fusion feature map, a query composition of a detail inner product space in the training multi-modal statistical feature assisted target object image fusion feature map to approximate a low rank independent observable composition of a link detail composition provided by structural detail interaction of the training multi-modal statistical feature assisted target object image fusion feature map, so that by training with the multi-modal statistical feature assisted target object image fusion loss function value, a detail group decomposition is performed on the basis of detail complexity through the distributed detail group of the training multi-modal statistical feature assisted target object image fusion feature map, so as to promote decoding regression decomposition recognition of a complex feature structure of the training multi-modal statistical feature assisted target object image fusion feature map, and improve decoding training efficiency. Therefore, the scoring of the image quality can be more effectively carried out to select the target object image with the optimal face quality so as to carry out the subsequent face recognition and personnel identity detection tasks.

Further, based on the above embodiment, referring to fig. 4, a schematic structural diagram of a sequential image object recognition device 800 based on image quality preference in an embodiment of the present application is shown. The image quality preference-based sequential image object recognition apparatus 800 includes: an image queue obtaining module 810, configured to obtain a queue of images of objects to be identified acquired by the camera; the IOU relationship determining module 820 is configured to determine, based on the face target detection network and the body target detection network, a target IOU relationship of each object image to be identified in the queue of object images to obtain a queue of target IOU relationships; a target object image queue extracting module 830, configured to extract a queue of target object images from the queue of object images to be identified based on the queue of target IOU relationships; a optimizing module 840, configured to select, from the queue of target object images, a target object image with the optimal face quality as a target object image to be detected; the face recognition module 850 is configured to perform face recognition on the image of the target object to be detected to obtain a recognition result, where the recognition result is a personnel identity tag; and the personnel identity label designating module 860 is configured to designate the personnel identity label in the identification result as the personnel identity label of the queue of the target object image.

Here, it will be understood by those skilled in the art that the specific functions and operations of the respective modules in the above-described image quality-based preferred sequence image target recognition apparatus 800 have been described in detail in the above description of the image quality-based preferred sequence image target recognition method with reference to fig. 2 to 3, and thus, repetitive descriptions thereof will be omitted.

Fig. 5 is an application scenario diagram of a sequential image object recognition method based on image quality preference according to an embodiment of the present application. As shown in fig. 5, in this application scenario, first, a queue of object images to be identified (for example, D illustrated in fig. 5) acquired by a camera is acquired, and then, the queue of object images to be identified is input into a server (for example, S illustrated in fig. 5) in which a sequential image object recognition algorithm based on image quality preference is deployed, wherein the server is capable of processing the queue of object images to be identified using the sequential image object recognition algorithm based on image quality preference to determine a person identity tag of the queue of object images to be identified.

Based on the foregoing embodiment, there is also provided in an embodiment of the present application an electronic device of another exemplary embodiment, including: a processor; and a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform the sequential image object recognition method of any preceding claim based on image quality preference.

For example, taking an electronic device as an example of the server 100 in fig. 1 of the present application, a processor in the electronic device is the processor 110 in the server 100, and a memory in the electronic device is the memory 120 in the server 100.

Further, in one embodiment of the present application, there is also provided a sequential image object recognition method based on image quality preference. The image quality optimization-based sequential image target identification method comprises the following steps: and step 1, storing a queue to be identified. And capturing human body/face information through a lens, storing the human body/face information as a target sequence according to the iou relation of the target, and distributing pseudo tags. And 2, face quality is optimized. And selecting a picture with higher quality as face recognition input, and improving the reliability of a face recognition result. And 3, face recognition. And extracting face features of the clear front face, comparing the faces in a personnel base, and identifying personnel identity information. And 4, matching the human body id. Matching the identified face with a human body, performing iou calculation on a face frame and a human body frame, and regarding the face and the human body as the same person id when the threshold value is exceeded.

Accordingly, the method has the following beneficial effects: 1. the target queue can complete the face identity information verification of all targets only by carrying out face recognition once. The pseudo tag method stores the targets of the same identity in the same queue, so that all identity id information can be matched for the queue only by carrying out face recognition once. Even if the face information is lost, the identity can be confirmed in the form of a queue pseudo tag.

Specifically, as shown in fig. 6, four queues Q1 to Q4 are first provided, each of which stores a plurality of images therein. And when the first face of Q3 and the last face of Q4 meet the face recognition requirement, Q4 is directly allocated with id, and the human feature vectors of Q3 and Q4 are matched after Q3 is allocated with id. Wherein the score is lower, the Q2 ranking priority is higher, and Q1 is later, because the reordering relationship analyzes the body posture orientation, Q1 back. Q3 and Q2 are successfully matched and then Q1 is successfully matched, so that the face id of Q3 is distributed to Q2 and Q1. And Q4 does not match the upper body characteristic information, so the face id does not need to be assigned to a cross-domain event.

More specifically, in step 1: the multi-view camera performs human shape/face event snapshot to form a cross-domain time sequence queue to be identified. Qi is taken as all events at the moment T of a single visual angle, wherein the events comprise the snapshot of the face and the human shape. And (3) performing time sequence queue allocation on the snap face/humanoid event by calculating whether the iou of the context target detection frame exceeds a threshold value, and simultaneously allocating a pseudo tag id to each q. For the face and the human-shaped event at the same moment, when the face area is positioned in the human body area, the two events are regarded as the same target.

In step 2: and filtering the low-quality pictures of the recorded face events, wherein the pictures which do not meet the face recognition requirements are not sent to the recognition module, so that the computing resources are saved, such as the conditions of blurring, serious shielding and the like. N faces which are front, clear and free of shielding are screened out to wait for recognition.

In step 3: and sending the pictures meeting the requirements to a recognition module, selecting res50 in the face recognition module as a arcface method of a backbone network to extract unique feature vectors of the faces, using the unique feature vectors as query contents, carrying out identity matching with a personnel information base which is input in advance, calculating the similarity between the feature vectors, searching the maximum score, and judging whether the maximum score exceeds a threshold value set by implementation to acquire matched identity information. And ensuring the elimination of accidental errors by taking the most recognition results as final results in the face time sequence event to be recognized. And after the face events in the face sequence are successfully identified, id can be allocated to all the events in the face sequence. In step 1, the human faces and the human bodies are paired in advance, so that when the human faces acquire the human information, the corresponding human body sequences are also allocated with the same human id information.

Those skilled in the art will appreciate that various modifications and improvements of the present disclosure may occur. For example, the various devices or components described above may be implemented in hardware, or may be implemented in software, firmware, or a combination of some or all of the three.

Furthermore, while the present application makes various references to certain elements in a system according to an embodiment of the present application, any number of different elements may be used and run on a client and/or server. The units are merely illustrative and different aspects of the systems and methods may use different units.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the methods described above may be implemented by a program that instructs associated hardware, and the program may be stored on a computer readable storage medium such as a read-only memory, a magnetic or optical disk, etc. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits. Accordingly, each module/unit in the above embodiment may be implemented in the form of hardware, or may be implemented in the form of a software functional module. The present application is not limited to any specific form of combination of hardware and software.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present application and is not to be construed as limiting thereof. Although exemplary embodiments of the present application have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this application.

Claims

1. A method for sequential image target recognition based on image quality optimization, characterized by comprising:

Obtain a queue of images of objects to be identified captured by a camera;

Based on the face target detection network and the human target detection network, determine the target IOU relationship of each object image to be identified in the queue of the object images to be identified to obtain a queue of target IOU relationships;

Based on the queue of target IOU relationships, extracting a queue of target object images from the queue of object images to be identified;

Selecting a target object image with the best face quality from the queue of target object images as the target object image to be detected;

Performing face recognition on the target object image to be detected to obtain a recognition result, wherein the recognition result is a person identity label;

The person identity tag in the recognition result is designated as the person identity tag of the queue of the target object image.

2. The method for sequential image target recognition based on image quality optimization according to claim 1, characterized in that, based on a face target detection network and a human target detection network, determining the target IOU relationship of each object image to be recognized in the queue of the object images to be recognized to obtain a queue of target IOU relationships comprises:

Inputting the images of the objects to be identified into the face target detection network and the human target detection network respectively to obtain a human body bounding box and a face bounding box;

The target IOU relationship between the human body bounding box and the face bounding box is calculated using the following relationship calculation formula, wherein the relationship calculation formula is:

;

The intersection area is the area of the intersection between the human body bounding box and the face bounding box, and the union area is the area of the union between the human body bounding box and the face bounding box.

3. The method for sequential image object recognition based on image quality optimization according to claim 2, characterized in that extracting a queue of target object images from the queue of the to-be-recognized object images based on the queue of the target IOU relationship comprises:

In response to the target IOU relationship being less than or equal to a preset threshold, the corresponding image of the object to be identified is eliminated;

In response to the target IOU relationship being greater than a preset threshold, the corresponding to-be-recognized object image is included in the queue of the target object image.

4. The method for sequential image target recognition based on image quality optimization according to claim 3 is characterized in that the target object image with the best face quality is selected from the queue of target object images as the target object image to be detected, comprising: for each target object image in the queue of target object images:

Using an LBP mode operator to process each target object image to obtain a target object image LBP feature vector;

Using the HOG feature descriptor to process each target object image to obtain a target object HOG feature vector;

Inputting the target object HOG feature vector and the target object image LBP feature vector into a dynamic interaction module under a gated response to obtain a multimodal statistical feature vector of the target object;

Inputting each target object image into an image feature extractor based on a dilated convolutional neural network model to obtain a target object image feature map;

Inputting the target object image feature map into a feature foreground mask salient device based on a convolutional gated feed-forward mechanism to obtain a foreground salient target object image feature map;

Inputting the foreground salient target object image feature map and the target object multimodal statistical feature vector into a cross-domain joint encoder based on a MetaNet model to obtain a target object image fusion feature map assisted by multimodal statistical features;

The target object image fusion feature map assisted by the multimodal statistical features is input into a decoder-based image quality scorer to obtain a score decoding value.

5. The method for sequential image target recognition based on image quality optimization according to claim 4, characterized in that the target object HOG feature vector and the target object image LBP feature vector are input into a dynamic interaction module under a gated response to obtain a target object multimodal statistical feature vector, comprising:

Inputting the target object HOG feature vector and the target object image LBP feature vector into a feature combination module for cascade processing to obtain a target object multimodal statistical information combination feature vector;

After calculating the matrix multiplication of the joint eigenvector of the multimodal statistical information of the target object and the parameter matrix, the obtained eigenvector is added to the bias vector by position to obtain the joint eigenvector of the multimodal statistical information of the linearly transformed target object;

use The function activates the linear transformation target object multimodal statistical information joint feature vector to obtain the target object multimodal statistical information dynamic fusion response gating value;

Calculating the positional product between the target object HOG feature vector and the target object multimodal statistical information dynamic fusion response gate value to obtain a weighted modulated target object HOG feature vector;

After calculating a dynamic fusion response gate value of the multimodal statistical feature information of the target object minus the dynamic fusion response gate value of the multimodal statistical feature information of the target object, the obtained weight value is multiplied by the LBP feature vector of the target object image according to the position to obtain a weight-modulated LBP feature vector of the target object image;

The weight-modulated target object HOG feature vector and the weight-modulated target object image LBP feature vector are processed according to position points to obtain a target object multimodal statistical feature vector.

6. The method for sequential image target recognition based on image quality optimization according to claim 5, characterized in that the target object image feature map is input into a feature foreground mask salient device based on a convolutional gated feedforward mechanism to obtain a foreground salient target object image feature map, comprising:

Performing layer normalization processing on the target object image feature map to obtain a normalized target object image feature map;

Performing point convolution-based channel expansion and hole convolution-layer-based deep convolution coding on the normalized target object image feature map to obtain a target object image deep convolution backup feature map and a target object image deep convolution original feature map;

Inputting the target object image deep convolution original feature map into a foreground gated mask module based on the Gelu function to obtain a target object image deep convolution gated mask weight feature map;

Calculate the position point multiplication between the target object image deep convolution gated mask weight feature map and the target object image deep convolution backup feature map to obtain the target object image gated mask foreground highlight feature map;

The target object image gated mask foreground salient feature map is subjected to channel shrinkage based on point convolution to obtain the foreground salient target object image feature map.

7. The method for sequential image target recognition based on image quality optimization according to claim 6 is characterized in that the target object image corresponding to the largest score decoding value is determined as the target object image to be detected.

8. The method for sequential image target recognition based on image quality optimization according to claim 7, characterized in that performing face recognition on the target object image to be detected to obtain a recognition result comprises:

Inputting the target object image to be detected into a face feature extractor based on AlexNet to obtain a face feature vector;

The facial feature vector is input into a classifier-based face recognizer to obtain the recognition result.

9. The method for sequential image object recognition based on image quality optimization according to claim 8, characterized in that it also includes a training step: for training the dynamic interaction module under the gated response, the image feature extractor based on the hole convolutional neural network model, the feature foreground mask salient device based on the convolutional gated feedforward mechanism, the cross-domain joint encoder based on the MetaNet model and the image quality scorer based on the decoder;

Wherein, the training step includes:

Acquire training data, wherein the training data includes a queue of training images of objects to be identified collected by a camera;

Based on the face target detection network and the human target detection network, determining the target IOU relationship of each training object image to be identified in the queue of training object images to obtain a queue of training target IOU relationships;

Based on the queue of the training target IOU relationship, extracting a queue of training target object images from the queue of training object images to be identified;

Using the LBP mode operator to process each training target object image in the queue of training target object images to obtain a training target object image LBP feature vector;

Using the HOG feature descriptor to process each of the training target object images to obtain a training target object HOG feature vector;

Inputting the training target object HOG feature vector and the training target object image LBP feature vector into the dynamic interaction module under the gated response to obtain a training target object multimodal statistical feature vector;

Inputting each of the training target object images into the image feature extractor based on the hole convolutional neural network model to obtain a training target object image feature map;

Inputting the training target object image feature map into the feature foreground mask salient device based on the convolution gated feedforward mechanism to obtain a training foreground salient target object image feature map;

Inputting the training foreground salient target object image feature map and the training target object multimodal statistical feature vector into the cross-domain joint encoder based on the MetaNet model to obtain a target object image fusion feature map assisted by the training multimodal statistical feature;

Inputting the target object image fusion feature map assisted by the training multimodal statistical feature into the decoder-based image quality scorer to obtain a decoding loss function value;

Calculating a predetermined loss function value of the training multimodal statistical feature-assisted target object image fusion feature map to obtain a multimodal statistical feature-assisted target object image fusion loss function value;

The weighted sum of the decoding loss function value and the target object image fusion loss function value assisted by the multimodal statistical features is used as the loss function value, and the dynamic interaction module under the gated response, the image feature extractor based on the void convolutional neural network model, the feature foreground mask salient device based on the convolutional gated feedforward mechanism, the cross-domain joint encoder based on the MetaNet model and the decoder-based image quality scorer are trained.

10. A sequence image target recognition device based on image quality optimization, characterized by comprising:

An image queue acquisition module is used to acquire a queue of images of objects to be identified collected by a camera;

An IOU relationship determination module is used to determine the target IOU relationship of each object image to be identified in the queue of the object images to be identified based on the face target detection network and the human target detection network to obtain a queue of target IOU relationships;

A target object image queue extraction module, configured to extract a queue of target object images from the queue of the to-be-identified object images based on the queue of the target IOU relationship;

A selection module, used for selecting a target object image with the best face quality from the queue of target object images as the target object image to be detected;

A face recognition module is used to perform face recognition on the target object image to be detected to obtain a recognition result, wherein the recognition result is a person identity label;

The personnel identity tag designation module is used to designate the personnel identity tag in the recognition result as the personnel identity tag of the queue of the target object image.

11. An electronic device, comprising:

Processor; and

A memory, in which computer program instructions are stored, and when the computer program instructions are executed by the processor, the processor executes the method for sequential image target recognition based on image quality optimization as described in any one of claims 1 to 9.