JP7542802B2

JP7542802B2 - Image recognition device using neural network and program used in the image recognition device

Info

Publication number: JP7542802B2
Application number: JP2020104577A
Authority: JP
Inventors: 弘亘藤吉; 隆義山下; 翼平川
Original assignee: Chubu University
Current assignee: Chubu University
Priority date: 2019-07-25
Filing date: 2020-06-17
Publication date: 2024-09-02
Anticipated expiration: 2040-06-17
Also published as: JP2024091853A; JP7691686B2; JP2021022368A

Description

特許法第３０条第２項適用ｈｔｔｐ：／／ｍｐｒｇ．ｊｐ／ｒｅｓｅａｒｃｈ／ａｂｎ＿ｊ令和１年６月２０日、令和１年７月掲載ｈｔｔｐ：／／ｃｖｉｍ．ｉｐｓｊ．ｏｒ．ｊｐ／ＭＩＲＵ２０１９／（「第２２回画像の認識・理解シンポジウム（ＭＩＲＵ２０１９）」のウェブサイト）令和１年７月掲載第２２回画像の認識・理解シンポジウム（ＭＩＲＵ２０１９）オーラルセッションＯＳ２Ａ－５、グランキューブ大阪（大阪府立国際会議場）令和１年７月３１日Patent Act Article 30, paragraph 2 applies http://mprg.jp/research/abn_j Posted June 20, 2019, July 2019 http://cvim.ipsj.or.jp/MIRU2019/ (Website of the 22nd Symposium on Image Recognition and Understanding (MIRU2019)) Posted July 2019 22nd Symposium on Image Recognition and Understanding (MIRU2019) Oral Session OS2A-5, Grand Cube Osaka (Osaka Prefectural International Convention Center) July 31, 2019

本発明は、ニューラルネットワークを用いた画像認識装置およびトレーニング装置に関するものである。 The present invention relates to an image recognition device and a training device that use a neural network.

従来、ＣＮＮ（Convolutional Neural Network）等のニューラルネットワークを用いた画像認識技術において、ニューラルネットワークによる推論時における注視領域を表現したアテンションマップを生成する技術が知られている（例えば、非特許文献１、２参照）。 Conventionally, in image recognition technology using neural networks such as CNN (Convolutional Neural Network), a technology is known that generates an attention map that represents the gaze area during inference by the neural network (see, for example, Non-Patent Documents 1 and 2).

Ramprasaath, R., S., Michael, C., Abhishek, D., Ramakrishna, V., Devi, P. and Dhruv, B.: Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization, International Conference on Computer Vision, pp. 618-626 (2017).Ramprasaath, R., S., Michael, C., Abhishek, D., Ramakrishna, V., Devi, P. and Dhruv, B.: Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization, International Conference. on Computer Vision, pp. 618-626 (2017). Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A.: Learning Deep Features for Discriminative Localization, Computer Vision and Pattern Recognition, pp. 2921-2929 (2016).Zhou, B., Khosla, A., Lapedriza, A., Oliva, A. and Torralba, A.: Learning Deep Features for Discriminative Localization, Computer Vision and Pattern Recognition, pp. 2921-2929 (2016).

しかし、発明者の検討によれば、ある画像についてニューラルネットワークが認識結果とアテンションマップを生成したとき、認識結果とアテンションマップ中の注視領域とが一致しない場合がある。例えば、認識結果が「笑顔」であるにもかかわらず注視領域が頭髪である場合、認識結果とアテンションマップ中の注視領域とが一致していない。認識結果とアテンションマップ中の注視領域とに不一致が有れば、それは画像認識自体の誤りにも繋がる問題である。そして、現状では、この不一致を修正する術はない。 However, according to the inventor's research, when a neural network generates a recognition result and an attention map for a certain image, there are cases where the recognition result does not match the gaze area in the attention map. For example, if the recognition result is a "smile" but the gaze area is hair, the recognition result does not match the gaze area in the attention map. If there is a mismatch between the recognition result and the gaze area in the attention map, it is a problem that can lead to errors in the image recognition itself. And currently, there is no way to correct this mismatch.

本開示は上記点に鑑み、アテンションマップを出力するニューラルネットワークを用いた画像認識技術において、ニューラルネットワークの認識機能または学習に人の知見を取り入れることを目的とする。 In view of the above, the present disclosure aims to incorporate human knowledge into the recognition function or learning of a neural network in an image recognition technology that uses a neural network that outputs an attention map.

本開示の１つの観点によれば、画像認識装置は、ニューラルネットワーク（１０）に画像（５１）を入力する入力部（１２０）と、入力された前記画像の特徴を含む特徴マップ（５２）と入力された前記画像の注視領域を表現するアテンションマップ（５３）とを、前記ニューラルネットワークが生成したとき、生成された前記アテンションマップに対して、人の修正操作に応じた修正を行うマップ修正部（１４０）と、修正された前記アテンションマップと前記特徴マップとが合成された合成マップ（５４）および前記画像に基づいて前記ニューラルネットワーク（１０）が前記画像の認識結果を生成したとき、生成された前記認識結果を出力する出力部（１６０）と、を備える。また、他の観点によれば、プログラムが画像認識装置を機能させる。 According to one aspect of the present disclosure, an image recognition device includes an input unit (120) for inputting an image (51) to a neural network (10), a feature map (52) including features of the input image and an attention map (53) expressing a gaze area of the input image , a map correction unit (140) for correcting the generated attention map in response to a correction operation by a person when the neural network generates the feature map (52) including features of the input image and an attention map (53) expressing a gaze area of the input image, and an output unit (160) for outputting the generated recognition result when the neural network (10) generates a recognition result of the image based on a synthesis map (54) in which the corrected attention map and the feature map are synthesized and the image. According to another aspect, a program causes the image recognition device to function.

このように、人の知見を利用してアテンションマップ５３が修正されることで、認知部１４が人の意図した領域を重視する。その結果、人の意図に沿った画像認識をすることができる。またこのとき、ニューラルネットワーク１０のパラメータは変更されていない。つまり、ニューラルネットワーク１０の再学習を必要とせず、人の意図した認識結果を得ることができる。 In this way, by using human knowledge to modify the attention map 53, the recognition unit 14 places emphasis on the area intended by the human. As a result, image recognition can be performed in line with the human's intention. Furthermore, at this time, the parameters of the neural network 10 are not changed. In other words, there is no need to re-learn the neural network 10, and the recognition results intended by the human can be obtained.

なお、各構成要素等に付された括弧付きの参照符号は、その構成要素等と後述する実施形態に記載の具体的な構成要素等との対応関係の一例を示すものである。 The reference symbols in parentheses attached to each component indicate an example of the correspondence between the component and the specific components described in the embodiments described below.

画像認識装置の構成図である。FIG. 1 is a configuration diagram of an image recognition device. ニューラルネットワークの構成を示す図である。FIG. 1 is a diagram illustrating a configuration of a neural network. アテンション部の構成を示す図である。FIG. 2 is a diagram illustrating a configuration of an attention unit. 処理部が実行する処理のフローチャートである。11 is a flowchart of a process executed by a processing unit. 入力画像と修正前のアテンションマップとが重なった表示を示す図である。FIG. 13 is a diagram showing a display in which an input image and an unmodified attention map are superimposed. 入力画像と注視領域が削除されたアテンションマップとが重なった表示を示す図である。FIG. 13 is a diagram showing a display in which an input image and an attention map from which a gaze area has been deleted are superimposed. 入力画像と注視領域が追加されたアテンションマップとが重なった表示を示す図である。FIG. 13 is a diagram showing a display in which an input image and an attention map with an added gaze area are superimposed. 修正されたアテンションマップとニューラルネットワークの関係を示す図である。FIG. 13 illustrates the relationship between the modified attention map and the neural network. 第２実施形態において処理部が実行する処理のフローチャートである。13 is a flowchart of a process executed by a processing unit in the second embodiment. 再学習の概要を示す図である。FIG. 1 is a diagram showing an outline of relearning.

（第１実施形態）
以下、第１実施形態について説明する。本実施形態に係る画像認識装置１は、図１に示すように、操作装置２、表示装置３、メモリ４、処理部５を備えている。 First Embodiment
Hereinafter, a first embodiment will be described. An image recognition device 1 according to this embodiment includes an operation device 2, a display device 3, a memory 4, and a processing unit 5, as shown in FIG.

操作装置２は、人の操作を受け付け、受け付けた操作に応じた信号を処理部５に出力する装置である。操作装置２は、例えば、マウス、キーボード、タッチパネル等であってもよい。表示装置３は、映像を人に表示する装置である。 The operation device 2 is a device that receives operations from a person and outputs a signal corresponding to the received operation to the processing unit 5. The operation device 2 may be, for example, a mouse, a keyboard, a touch panel, etc. The display device 3 is a device that displays an image to a person.

メモリ４は、書き換え可能な揮発性記憶媒体であるＲＡＭ、書き換え不可能な不揮発性記憶媒体であるＲＯＭ、書き換え可能な不揮発性記憶媒体であるフラッシュメモリを含む。ＲＡＭ、ＲＯＭ、フラッシュメモリは、非遷移的実体的記憶媒体である。フラッシュメモリには、学習済みのニューラルネットワーク１０のデータがあらかじめ記録されている。 The memory 4 includes RAM, which is a rewritable volatile storage medium, ROM, which is a non-rewritable non-volatile storage medium, and flash memory, which is a rewritable non-volatile storage medium. RAM, ROM, and flash memory are non-transient tangible storage media. The flash memory stores pre-recorded data of the trained neural network 10.

処理部５は、ＲＯＭまたはフラッシュメモリに記憶された不図示のプログラムを実行し、その実行の際にＲＡＭを作業領域として用いることで、後述する種々の処理を実現する。 The processing unit 5 executes a program (not shown) stored in ROM or flash memory, and uses the RAM as a working area during execution to realize various processes described below.

ここで、ニューラルネットワーク１０について説明する。ニューラルネットワーク１０は、図３に示すように、特徴抽出部１１、アテンション部１２、合成部１３、認知部１４を含んだ、ディープニューラルネットワークである。 Here, we will explain the neural network 10. As shown in FIG. 3, the neural network 10 is a deep neural network that includes a feature extraction unit 11, an attention unit 12, a synthesis unit 13, and a recognition unit 14.

ニューラルネットワーク１０は、入力画像５１が入力されると、アテンションマップ５３を生成する。アテンションマップ５３は、ニューラルネットワーク１０の推論時の注視領域を表現するデータである。つまり、アテンションマップ５３は、ニューラルネットワーク１０の推論時において、入力画像５１のどの領域が重視されているかを説明する視覚的説明用のデータである。 When an input image 51 is input, the neural network 10 generates an attention map 53. The attention map 53 is data that represents the area of attention during inference by the neural network 10. In other words, the attention map 53 is visual explanatory data that explains which area of the input image 51 is emphasized during inference by the neural network 10.

またニューラルネットワーク１０は、入力画像５１およびアテンションマップ５３に基づいて入力画像５１の分類結果を出力する。入力画像５１の分類結果とは、画像の認識対象に相当する複数のクラス（例えば、ダルメシアン、ザリガニ、フィンチ、カエル等）にそれぞれ対応する複数の尤度である。ここでは、クラスの数をＫとする。 The neural network 10 also outputs a classification result for the input image 51 based on the input image 51 and the attention map 53. The classification result for the input image 51 is a number of likelihoods corresponding to a number of classes (e.g., dalmatian, crayfish, finch, frog, etc.) that correspond to the recognition target of the image. Here, the number of classes is K.

特徴抽出部１１は、複数の層を有するニューラルネットワークである。これら複数の層は、複数の畳み込み層を少なくとも含む。更にこれら複数の層は、更に複数の残差ブロックの構成要素となっていてもよいし、複数のプーリング層等を有していてもよい。そして特徴抽出部１１は、入力された入力画像５１の情報をこれら複数の層に伝播させることで、特徴マップ５２を生成する。 The feature extraction unit 11 is a neural network having multiple layers. These multiple layers include at least multiple convolutional layers. Furthermore, these multiple layers may be components of multiple residual blocks, or may have multiple pooling layers, etc. Then, the feature extraction unit 11 generates a feature map 52 by propagating information of the input image 51 that has been input to these multiple layers.

特徴マップ５２は、Ｋ個のクラスにそれぞれ対応するＫ個の解像度ｈ×ｗのマップである。ｈ、ｗは、任意の整数である。したがって、特徴マップ５２のチャンネル数はＫである。特徴マップ５２の解像度は、入力画像５１の解像度と同じであってもよいし、入力画像５１の解像度よりも低くてもよい。 The feature map 52 is a map with K resolutions h×w, each of which corresponds to one of the K classes. h and w are any integers. Thus, the number of channels in the feature map 52 is K. The resolution of the feature map 52 may be the same as the resolution of the input image 51, or may be lower than the resolution of the input image 51.

特徴抽出部１１は、ベースラインモデルのうち入力層から始まり最初の全結合層よりも前の部分によって構成されていてもよい。ベースラインモデルとしては、複数の畳み込み層を有し、ニューラルネットワーク１０と同じ種類の複数のクラスの尤度を生成するものが選ばれる。例えば、ベースラインモデルとしては、非特許文献３に示すＶＧＧＮｅｔが用いられてもよいし、非特許文献４に示すＲｅｓＮｅｔが用いられてもよいし、他のＣＮＮ（Convolutional Neural Network）が用いられてもよい。 The feature extraction unit 11 may be configured by a portion of the baseline model starting from the input layer and preceding the first fully connected layer. As the baseline model, one that has multiple convolutional layers and generates the same types of multiple class likelihoods as the neural network 10 is selected. For example, as the baseline model, the VGGNet shown in Non-Patent Document 3 may be used, the ResNet shown in Non-Patent Document 4 may be used, or another CNN (Convolutional Neural Network) may be used.

アテンション部１２は、特徴抽出部１１によって生成された特徴マップ５２からアテンションマップ５３を生成する。アテンション部１２は、複数の層を有するニューラルネットワークである。これら複数の層は、図３に示すように、１つ以上の畳み込み層または１つ以上の残差ブロックを有する第１部分１２ａ、第１部分の後段におけるＫ×１×１畳み込み層１２ｂを有する。ここで、Ｌ、ａ、ｂを任意の自然数とすると、Ｌ×ａ×ｂ畳み込み層は、Ｌ個のチャネルの各々でａ×ｂのカーネルを用いた畳み込み層を意味する。 The attention unit 12 generates an attention map 53 from the feature map 52 generated by the feature extraction unit 11. The attention unit 12 is a neural network having multiple layers. As shown in FIG. 3, these multiple layers include a first part 12a having one or more convolutional layers or one or more residual blocks, and a K×1×1 convolutional layer 12b at the rear of the first part. Here, if L, a, and b are any natural numbers, an L×a×b convolutional layer means a convolutional layer using an a×b kernel in each of the L channels.

そしてアテンション部１２は、畳み込み層１２ｂの後段において分岐する２つのＫ×１×１畳み込み層１２ｃと１×１×１畳み込み層１２ｅを有する。そしてアテンション部１２は、畳み込み層１２ｃの後段におけるＧＡＰ（Global Average Pooling）層１２ｄを有する。 The attention unit 12 has two K×1×1 convolutional layers 12c and 1×1×1 convolutional layers 12e that branch out after the convolutional layer 12b. The attention unit 12 also has a GAP (Global Average Pooling) layer 12d after the convolutional layer 12c.

アテンション部１２に入力された特徴マップ５２の情報が、第１部分１２ａ、畳み込み層１２ｂ、畳み込み層１２ｃ、ＧＡＰ層１２ｄを伝播し、ＧＡＰ層１２ｄの出力がＳｏｆｔｍａｘ関数に入力されることで、ニューラルネットワーク１０と同じ種類の複数のクラスの尤度が分類結果として生成される。分類結果は、認識結果の一種である。 The information of the feature map 52 input to the attention unit 12 propagates through the first part 12a, the convolution layer 12b, the convolution layer 12c, and the GAP layer 12d, and the output of the GAP layer 12d is input to the Softmax function, generating the likelihood of multiple classes of the same type as the neural network 10 as classification results. The classification results are a type of recognition result.

また、アテンション部１２に入力された特徴マップ５２の情報が、第１部分１２ａ、畳み込み層１２ｂ、畳み込み層１２ｅに伝播されることで、アテンションマップ５３が生成される。全結合層ではなく畳み込み層１２ｂを介してアテンションマップ５３が生成されることで、注視領域の情報が局所化されたままでアテンションマップ５３に伝播される。また、１×１×１畳み込み層１２ｅを介することで、すべてのクラスに対応した注視領域の重み付き総和として１チャンネルのアテンションマップ５３が生成される。畳み込み層１２ｅのカーネルの各値は、すべて１でもよいし、それ以外でもよい。 In addition, the information of the feature map 52 input to the attention unit 12 is propagated to the first part 12a, the convolutional layer 12b, and the convolutional layer 12e to generate the attention map 53. By generating the attention map 53 via the convolutional layer 12b rather than the fully connected layer, the information of the attention area is propagated to the attention map 53 while remaining localized. In addition, by passing through the 1x1x1 convolutional layer 12e, a one-channel attention map 53 is generated as a weighted sum of the attention areas corresponding to all classes. The kernel values of the convolutional layer 12e may all be 1, or may be other values.

特徴マップ５２の各マップの解像度とアテンションマップ５３の解像度は同じである。そうなるよう、アテンション部１２は構成されている。アテンションマップ５３は、注視領域に該当する画素には比較的高い画素値が与えられ、注視領域に該当しない画素には注視領域と比べて低い画素値が与えられる。アテンションマップ５３の各画素値が取り得る値は、２値でもよいし、２５６段階の値でもよい。ある画素の画素値が高いほど、その画素の位置における注目度が高い。 The resolution of each map in the feature map 52 is the same as that of the attention map 53. The attention unit 12 is configured so that this is the case. In the attention map 53, pixels that fall within the attention area are given relatively high pixel values, and pixels that do not fall within the attention area are given lower pixel values than those within the attention area. The possible values of each pixel in the attention map 53 may be binary or may be a value in 256 levels. The higher the pixel value of a pixel, the higher the attention level at that pixel's position.

合成部１３は、特徴マップ５２とアテンションマップ５３との合成を行う。具体的には、特徴マップ５２におけるＫ個のチャネルの各々における解像度ｈ×ｗのマップに対し、アテンションマップ５３が乗算される。アテンションマップ５３と解像度ｈ×ｗのマップとの乗算は、同じ位置座標の画素同士で行われる。なお、合成は、上記のように乗算であってもよいし、加算であってもよいし、加算と乗算の組み合わせから成る演算であってもよい。この合成によって、合成マップ５４が得られる。合成マップ５４のチャネル数と解像度は、特徴マップ５２と同じである。 The synthesis unit 13 synthesizes the feature map 52 and the attention map 53. Specifically, the attention map 53 is multiplied by a map of resolution h×w in each of the K channels in the feature map 52. The attention map 53 is multiplied by the map of resolution h×w between pixels at the same position coordinates. Note that the synthesis may be multiplication as described above, addition, or a calculation consisting of a combination of addition and multiplication. A synthesis map 54 is obtained by this synthesis. The number of channels and resolution of the synthesis map 54 are the same as those of the feature map 52.

認知部１４は、合成マップ５４に基づいて各クラスの尤度を出力する。認知部１４は、複数の層を有するニューラルネットワークである。これら複数の層は、複数の畳み込み層を少なくとも含む。また、これら複数の層は、全結合層およびＧＡＰ層のうち一方または両方を含む。更にこれら複数の層は、更に複数の残差ブロックの構成要素となっていてもよいし、複数のプーリング層を有していてもよい。認知部１４は、入力された合成マップ５４の情報をこれら複数の層に伝播させることで、各クラスの尤度を分類結果として出力する。分類結果は、認識結果でもある。認知部１４は、上述のベースラインモデルのうち、アテンション部１２で利用された部分のすぐ後段から出力層までの部分によって構成されていてもよい。 The cognition unit 14 outputs the likelihood of each class based on the composite map 54. The cognition unit 14 is a neural network having multiple layers. These multiple layers include at least multiple convolution layers. In addition, these multiple layers include one or both of a fully connected layer and a GAP layer. Furthermore, these multiple layers may be components of multiple residual blocks or may have multiple pooling layers. The cognition unit 14 outputs the likelihood of each class as a classification result by propagating the information of the input composite map 54 to these multiple layers. The classification result is also a recognition result. The cognition unit 14 may be configured from the part of the above-mentioned baseline model immediately following the part used in the attention unit 12 to the output layer.

なお、ニューラルネットワーク１０、特徴抽出部１１、アテンション部１２、合成部１３、認知部１４が行うと上で説明した機能は、実際には、処理部５が当該ニューラルネットワーク１０の構造およびパラメータに従った処理を行うことで実現される。 The functions described above as being performed by the neural network 10, feature extraction unit 11, attention unit 12, synthesis unit 13, and recognition unit 14 are actually realized by the processing unit 5 performing processing in accordance with the structure and parameters of the neural network 10.

特徴抽出部１１、アテンション部１２、合成部１３は、上記のような機能が実現するよう、あらかじめ教師有り学習で誤差逆伝播法によって学習されている。学習においては、学習誤差（損失関数ともいう）Ｌとして、Ｌ＝Ｌａｔｔ＋Ｌｐｅｒが用いられる。ここで、Ｌａｔｔは、アテンション部１２が出力する分類結果に関する学習誤差であり、Ｌｐｅｒは、認知部１４が出力する分類結果に関する学習誤差である。ＬａｔｔおよびＬｐｅｒは、それぞれの分類結果に対してＳｏｆｔｍａｘ関数とクロスエントロピーの組み合わせを適用することで算出されてもよい。特徴抽出部１１は、誤差逆伝播法においてアテンション部１２と認知部１４の勾配を通り抜けることで学習される。 The feature extraction unit 11, attention unit 12, and synthesis unit 13 are trained in advance by backpropagation learning with a supervised method so as to realize the above-mentioned functions. In the training, L = Latt + Lper is used as the training error (also called a loss function) L. Here, Latt is the training error regarding the classification result output by the attention unit 12, and Lper is the training error regarding the classification result output by the recognition unit 14. Latt and Lper may be calculated by applying a combination of the Softmax function and cross entropy to each classification result. The feature extraction unit 11 is trained by passing through the gradients of the attention unit 12 and the recognition unit 14 in the backpropagation method.

以下、このように構成された学習済みのニューラルネットワーク１０を用いた処理部５の画像分類処理について説明する。 The following describes the image classification process performed by the processing unit 5 using the trained neural network 10 configured in this way.

処理部５は、人による操作装置２に対する実行開始操作等の所定の条件が満たされると、メモリ４に記録された所定のプログラムに規定された図４に示す処理を開始する。この処理において処理部５は、まずステップ１１０で、ニューラルネットワーク１０をメモリ４から読み出す。 When a predetermined condition is satisfied, such as a human performing an operation to start execution on the operation device 2, the processing unit 5 starts the process shown in FIG. 4, which is defined in a predetermined program recorded in the memory 4. In this process, the processing unit 5 first reads the neural network 10 from the memory 4 in step 110.

続いてステップ１２０で、入力画像５１を取得し、この入力画像５１をニューラルネットワーク１０に対して入力する。入力画像５１は、あらかじめメモリ４に記録されている複数の画像のうちから人の操作装置２に対する操作等によって選択された画像であってもよいし、不図示の通信ネットワークを介して他の装置から受信した画像であってもよい。 Next, in step 120, an input image 51 is acquired and input to the neural network 10. The input image 51 may be an image selected from a plurality of images previously recorded in the memory 4 by a person operating the operation device 2, or may be an image received from another device via a communication network (not shown).

ニューラルネットワーク１０に入力画像５１が入力されると、ニューラルネットワーク１０は、上述の通り、特徴抽出部１１が入力画像５１から特徴マップ５２および分類結果を生成し、アテンション部１２が特徴マップ５２からアテンションマップ５３を生成する。 When an input image 51 is input to the neural network 10, as described above, the feature extraction unit 11 of the neural network 10 generates a feature map 52 and a classification result from the input image 51, and the attention unit 12 generates an attention map 53 from the feature map 52.

処理部５は、ステップ１２０に続くステップ１３０で、このように生成されたアテンションマップ５３を取得する。すなわちニューラルネットワーク１０によってメモリ４内に生成されたアテンションマップ５３をメモリ４内の他の領域にコピーまたは移動する。 In step 130, which follows step 120, the processing unit 5 acquires the attention map 53 thus generated. That is, the processing unit 5 copies or moves the attention map 53 generated in the memory 4 by the neural network 10 to another area in the memory 4.

続いてステップ１４０で、処理部５は、取得された（すなわち、コピー先または移動先の）アテンションマップ５３を、人の操作装置２に対する修正操作に基づいて、修正する。これにより、人の知見によってアテンションマップ５３が修正される。 Next, in step 140, the processing unit 5 modifies the acquired (i.e., copy or move destination) attention map 53 based on the human modification operation on the operation device 2. In this way, the attention map 53 is modified based on human knowledge.

具体的には、処理部５は、修正前のアテンションマップ５３およびポインタを表示装置３に表示させる。ポインタは、表示装置３に表示されたアテンションマップ５３の表示範囲内を操作装置２に対する人の操作に応じて移動する画像である。人は、操作装置２に対して所定の修正操作（例えば、消去操作、追加操作等）を行うことで、表示されたアテンションマップ５３中のポインタと重なる位置範囲の値を修正する。 Specifically, the processing unit 5 displays the attention map 53 and the pointer before correction on the display device 3. The pointer is an image that moves within the display range of the attention map 53 displayed on the display device 3 in response to a person's operation on the operation device 2. The person corrects the value of the position range that overlaps with the pointer in the displayed attention map 53 by performing a predetermined correction operation (e.g., a delete operation, an add operation, etc.) on the operation device 2.

なおこの際、処理部５は、図５に示すように、入力画像５１をアテンションマップ５３に透過的に位置を合わせて重ねて、表示装置３に表示させた状態で、上記修正操作に応じた修正をアテンションマップ５３に反映させてもよい。この際、入力画像５１とアテンションマップ５３の解像度が異なる場合は、処理部５は、入力画像５１の解像度をアテンションマップ５３と一致するよう下げた上で、アテンションマップ５３に透過的に重ねる。 At this time, the processing unit 5 may transparently align and overlay the input image 51 on the attention map 53 as shown in FIG. 5 and display the input image 51 on the display device 3, and then reflect the correction corresponding to the above correction operation in the attention map 53. At this time, if the input image 51 and the attention map 53 have different resolutions, the processing unit 5 lowers the resolution of the input image 51 to match the attention map 53, and then transparently overlays the input image 51 on the attention map 53.

図５においては、ダルメシアンがサッカーボールを咥えている入力画像５１が、アテンションマップ５３に透過的に重ねられている。 In Figure 5, an input image 51 of a Dalmatian holding a soccer ball is transparently overlaid on an attention map 53.

このアテンションマップ５３では、注視領域がサッカーボールの領域にある。このままアテンションマップ５３が合成部１３に入力され、そのアテンションマップ５３と特徴マップ５２の合成結果である合成マップ５４が認知部１４の最初の層に入力された場合、認知部１４が生成する分類結果としては、サッカーボールの尤度が最も高くなる。つまり、ニューラルネットワーク１０は、入力画像５１をサッカーボールの画像であると認識する。 In this attention map 53, the gaze area is in the soccer ball area. If attention map 53 is input as is to synthesis unit 13, and synthesis map 54, which is the result of synthesizing attention map 53 and feature map 52, is input to the first layer of recognition unit 14, the classification result generated by recognition unit 14 will have the highest likelihood of being a soccer ball. In other words, neural network 10 recognizes input image 51 as an image of a soccer ball.

しかし、画像認識装置１を使う人は、入力画像５１をダルメシアンの画像として認識して欲しいと考えていた場合、このアテンションマップ５３では注視領域がダルメシアンのいる領域であるべきである。 However, if a person using the image recognition device 1 wants the input image 51 to be recognized as an image of a Dalmatian, the fixation area in this attention map 53 should be the area where the Dalmatian is located.

そこでこのような場合、人が、操作装置２を用いて、アテンションマップ５３中の注視領域を修正する。具体的には、まず、人が、操作装置２を用いて、アテンションマップ５３中の注視領域を消去する。例えば、人が、操作装置２の所定の消去ボタンを押しながら、ポインタを移動させてアテンションマップ５３中の注視領域全体を走査する。これにより、処理部５は、消去ボタンを押しながらポインタで走査された領域におけるアテンションマップ５３の画素値を下げて、図６に示すように、注視領域とならない画素値とする。 In such a case, therefore, the person uses the operation device 2 to modify the gaze area in the attention map 53. Specifically, first, the person uses the operation device 2 to erase the gaze area in the attention map 53. For example, the person moves the pointer while pressing a specific erase button on the operation device 2 to scan the entire gaze area in the attention map 53. As a result, the processing unit 5 lowers the pixel values of the attention map 53 in the area scanned by the pointer while pressing the erase button, to pixel values that do not become part of the gaze area, as shown in FIG. 6.

そしてその後、人は、操作装置２を用いて、アテンションマップ５３中の新たに注視領域としたい領域を設定する。例えば、人が、操作装置２の所定の追加ボタンを押しながら、ポインタを移動させてアテンションマップ５３中の注視領域としたい領域全体を走査する。これにより、処理部５は、追加ボタンを押しながらポインタで走査された領域におけるアテンションマップ５３の画素値を上げて、図７に示すように、注視領域となる画素値とする。図７の例では、人によって指定された新たな注視領域は、ダルメシアンの顔部分である。 Then, the person uses the operation device 2 to set a new area in the attention map 53 that the person wants to set as the attention area. For example, while pressing a specific add button on the operation device 2, the person moves the pointer to scan the entire area in the attention map 53 that the person wants to set as the attention area. As a result, the processing unit 5 increases the pixel values of the attention map 53 in the area scanned by the pointer while pressing the add button, and sets the pixel values to the attention area as shown in FIG. 7. In the example of FIG. 7, the new attention area specified by the person is the face of the Dalmatian.

このように、入力画像５１がアテンションマップ５３に重ねられて表示装置３に表示されることで、人は、入力画像５１のどの部分を注視領域とすべきかを判断できる場合は、その知見を効率よく利用して、アテンションマップ５３中の注視領域を容易に指定できる。このようなステップ１４０の処理により、ステップ１３０で取得されたアテンションマップ５３がメモリ４中で修正される。 In this way, by displaying the input image 51 on the display device 3 superimposed on the attention map 53, if a person can determine which part of the input image 51 should be the gaze area, that knowledge can be efficiently used to easily specify the gaze area in the attention map 53. By the processing of step 140 in this way, the attention map 53 obtained in step 130 is corrected in the memory 4.

続いて処理部５はステップ１５０で、直前のステップ１４０で修正されたアテンションマップ５３を、合成部１３に入力する。すると、合成部１３は、特徴マップ５２とアテンションマップ５３を上述の通り合成して合成マップ５４を生成して認知部１４の最初の層に入力する。合成マップ５４が入力された認知部１４は、上述の通り合成マップ５４に基づいて分類結果を生成する。この分類結果においては、ダルメシアンの尤度が最も高くなる。つまり、ニューラルネットワーク１０は、入力画像５１をダルメシアンの画像であると認識する。 Next, in step 150, the processing unit 5 inputs the attention map 53 corrected in the previous step 140 to the synthesis unit 13. The synthesis unit 13 then generates a synthesis map 54 by synthesizing the feature map 52 and the attention map 53 as described above, and inputs the synthesis map 54 to the first layer of the recognition unit 14. The recognition unit 14 to which the synthesis map 54 has been input generates a classification result based on the synthesis map 54 as described above. In this classification result, the likelihood of a Dalmatian is the highest. In other words, the neural network 10 recognizes the input image 51 as an image of a Dalmatian.

処理部５は、ステップ１５０に続くステップ１６０で、このようにして認知部１４が生成した分類結果を取得して出力する。出力先は、不図示の通信ネットワークを介した他の装置であってもよいし、メモリ４であってもよいし、表示装置３であってもよい。 In step 160 following step 150, the processing unit 5 acquires and outputs the classification result thus generated by the recognition unit 14. The output destination may be another device via a communication network (not shown), the memory 4, or the display device 3.

このように、人の知見を利用してアテンションマップ５３が修正されることで、認知部１４が人の意図した領域により高い重み付けがされる。その結果、人の意図に沿った画像認識をすることができる。つまり、人の知見に基づいて手動で修正されたアテンションマップを用いることで認識結果の調整が可能となる。 In this way, by using human knowledge to modify the attention map 53, the recognition unit 14 assigns a higher weight to areas intended by the human. As a result, image recognition can be performed in line with the human's intention. In other words, by using an attention map that has been manually modified based on human knowledge, it becomes possible to adjust the recognition results.

またこのとき、ニューラルネットワーク１０のパラメータは変更されていない。つまり、ニューラルネットワーク１０の再学習を必要とせず、人の意図した認識結果を得ることができる。 In addition, at this time, the parameters of the neural network 10 are not changed. In other words, there is no need to re-train the neural network 10, and the recognition results intended by the person can be obtained.

例えば、眼底画像が入力画像５１としてニューラルネットワーク１０に入力されたときに、医師が自分の経験に基づく知見を用いてアテンションマップ５３の注視領域を修正することで、眼の疾患のグレードをクラスとして識別がより正確になる。このように、例えば医用画像診断において、本実施形態の機能は有用である。 For example, when a fundus image is input to the neural network 10 as the input image 51, a doctor can use knowledge based on his or her own experience to modify the gaze area of the attention map 53, thereby making it possible to more accurately identify the grade of eye disease as a class. In this way, the function of this embodiment is useful, for example, in medical image diagnosis.

以上説明した通り、画像認識装置１の処理部５は、図８に示すように、入力画像５１が入力されたニューラルネットワーク１０によって生成されたアテンションマップ５３に対して、人の修正操作に応じた修正を行う（ステップ１４０）。そして処理部５は、修正されたアテンションマップ５３および入力画像５１に基づいてニューラルネットワーク１０が生成した入力画像５１の認識結果を出力する（ステップ１６０）。 As described above, the processing unit 5 of the image recognition device 1 modifies the attention map 53 generated by the neural network 10 to which the input image 51 is input in accordance with the human modification operation (step 140), as shown in FIG. 8. The processing unit 5 then outputs the recognition result of the input image 51 generated by the neural network 10 based on the modified attention map 53 and the input image 51 (step 160).

また、アテンションマップ５３を生成するために画像の情報が伝播する経路と、認識結果を生成するために画像の情報が伝播する経路とが、一部（すなわち特徴抽出部１１）において共有されて、他の部分（すなわちアテンション部１２と認知部１４）で分離されている。そして、合成部１３により、その分離部分の認知部１４の側に、修正後のアテンションマップ５３が反映された合成マップ５４が入力される。このように、修正後のアテンションマップ５３に基づいた合成マップ５４の入力箇所が、ニューラルネットワーク１０の構造に適したものになっていることで、修正されたアテンションマップ５３による認識結果の改善度合いが向上する。 In addition, the path along which image information is propagated to generate attention map 53 and the path along which image information is propagated to generate recognition results are shared in one part (i.e., feature extraction unit 11) and separated in other parts (i.e., attention unit 12 and recognition unit 14). Then, synthesis unit 13 inputs composite map 54 reflecting the corrected attention map 53 to the recognition unit 14 side of the separated part. In this way, the input point of composite map 54 based on corrected attention map 53 is suited to the structure of neural network 10, thereby improving the degree of improvement in recognition results by corrected attention map 53.

また、処理部５は、入力画像５１をアテンションマップ５３に透過的に重ねて表示装置３に表示させた状態で、人の修正操作に応じた修正をアテンションマップ５３に反映させる。人は、入力画像５１のどの部分を注視領域とすべきかを、その入力画像５１を見ることで比較的容易に判断できる。したがって、入力画像５１がアテンションマップ５３に重ねられて表示装置３に表示されることで、人は、自分の知見を視覚的に効率よく利用して、アテンションマップ５３中の注視領域を容易に指定できる。 Furthermore, the processing unit 5 reflects corrections made by the person to the attention map 53 in a state in which the input image 51 is transparently overlaid on the attention map 53 and displayed on the display device 3. A person can relatively easily determine which part of the input image 51 should be the gaze area by looking at the input image 51. Therefore, by displaying the input image 51 on the display device 3 and overlaid on the attention map 53, a person can visually and efficiently use his or her own knowledge to easily specify the gaze area in the attention map 53.

なお、本実施形態では、処理部５が、ステップ１２０を実行することで入力部として機能し、ステップ１４０を実行することでマップ修正部として機能し、ステップ１６０を実行することで出力部として機能する。 In this embodiment, the processing unit 5 functions as an input unit by executing step 120, functions as a map correction unit by executing step 140, and functions as an output unit by executing step 160.

（第２実施形態）
次に第２実施形態について説明する。本実施形態では、人の修正操作に応じた修正されたアテンションマップに基づいて、ニューラルネットワーク１０の重み、バイアス等の学習パラメータが補正される。すなわち、修正されたアテンションマップに基づいてニューラルネットワーク１０が再学習される。 Second Embodiment
Next, a second embodiment will be described. In this embodiment, learning parameters such as weights and biases of the neural network 10 are corrected based on the attention map corrected according to a human correction operation. That is, the neural network 10 is re-learned based on the corrected attention map.

本実施形態のハードウェア構成は、第１実施形態において図１に示したものと同じである。また、メモリ４に記憶されている学習済みのニューラルネットワーク１０の構成についても、第１実施形態と同じである。なお、本実施形態の画像認識装置１は、トレーニング装置に対応する。 The hardware configuration of this embodiment is the same as that shown in FIG. 1 in the first embodiment. The configuration of the trained neural network 10 stored in the memory 4 is also the same as that in the first embodiment. Note that the image recognition device 1 of this embodiment corresponds to a training device.

本実施形態第１実施形態と異なる点の１つは、処理部５が図４の処理を実行するのではなく、その代わりに、アテンションマップ５３を修正せずに、入力画像５１に対応する分類結果をニューラルネットワーク１０に生成させることである。 One of the differences between this embodiment and the first embodiment is that the processing unit 5 does not execute the process of FIG. 4, but instead causes the neural network 10 to generate a classification result corresponding to the input image 51 without modifying the attention map 53.

すなわち、処理部５は、まず、第１実施形態と同様、ニューラルネットワーク１０をメモリ４から読み出し、続いて、入力画像５１を取得し、この入力画像５１をニューラルネットワーク１０に対して入力する。 That is, the processing unit 5 first reads the neural network 10 from the memory 4, similar to the first embodiment, and then acquires the input image 51 and inputs this input image 51 to the neural network 10.

するとニューラルネットワーク１０においては、第１実施形態と同様に特徴抽出部１１およびアテンション部１２が機能することで、アテンション部１２によってアテンションマップ５３および分類結果が生成される。このアテンションマップ５３は人の修正操作を受けることなく、すなわち修正されることなく、合成部１３に入力される。合成部１３は、特徴マップ５２と人の修正操作を受けなかったアテンションマップ５３とを合成することで、合成マップ５４を生成する。認知部１４は、この合成マップ５４に基づいて、第１実施形態と同様に分類結果を生成する。処理部５は、この分類結果を第１実施形態と同様に取得して出力する。 In the neural network 10, the feature extraction unit 11 and attention unit 12 function in the same manner as in the first embodiment, and the attention unit 12 generates an attention map 53 and a classification result. This attention map 53 is input to the synthesis unit 13 without being modified by a human. The synthesis unit 13 generates a composite map 54 by combining the feature map 52 and the attention map 53 that has not been modified by a human. The recognition unit 14 generates a classification result based on this composite map 54 in the same manner as in the first embodiment. The processing unit 5 acquires and outputs this classification result in the same manner as in the first embodiment.

そして、処理部５は、上記のようにニューラルネットワーク１０を用いて入力画像５１からその入力画像５１の分類結果を取得する処理に加え、ニューラルネットワーク１０を再学習させるため、図９に示す処理を実行する。この再学習によって、ニューラルネットワーク１０はファインチューニングされる。 Then, in addition to the process of acquiring the classification result of the input image 51 from the input image 51 using the neural network 10 as described above, the processing unit 5 executes the process shown in FIG. 9 to re-train the neural network 10. Through this re-training, the neural network 10 is fine-tuned.

処理部５は、操作装置２に対して人による所定の再学習開始操作が行われたことに基づいて、図９の処理を開始する。この処理において、処理部５は、再学習用のデータセットを用いる。再学習用のデータセットは、学習用画像と教師ラベルからなるグループを複数個（１０個でも１００個でも１０万個でもよい）有している。 The processing unit 5 starts the process of FIG. 9 based on a predetermined relearning start operation being performed by a human on the operation device 2. In this process, the processing unit 5 uses a dataset for relearning. The dataset for relearning has multiple groups (10, 100, or 100,000 groups) each consisting of learning images and teacher labels.

学習用画像は、入力画像５１のように特徴抽出部１１に入力されるデータである。教師ラベルは、同じグループの学習用画像が特徴抽出部１１に入力されたときにアテンション部１２および認知部１４から出力される分類結果の正解値とされるデータである。 The learning image is data input to the feature extraction unit 11, such as the input image 51. The teacher label is data that is considered to be the correct answer value of the classification result output from the attention unit 12 and the recognition unit 14 when learning images of the same group are input to the feature extraction unit 11.

再学習用のデータセットは、あらかじめ生成されてメモリ４の不揮発性記憶媒体に記録されていてもよいし、不図示の通信ネットワークを介してデータサーバから取得されてもよい。また、再学習用のデータセットの学習用画像および教師ラベルとしては、ニューラルネットワーク１０の初期の学習時に用いられた学習用データセットと同じものが流用されてもよいし、当該学習用データセットと異なるものであってもよい。 The re-learning dataset may be generated in advance and recorded in a non-volatile storage medium of memory 4, or may be acquired from a data server via a communication network (not shown). In addition, the training images and teacher labels of the re-learning dataset may be the same as the training dataset used during initial training of neural network 10, or may be different from the training dataset.

処理部５は、図９の処理において、まず、ステップ２１０、２２０のループ処理を、再学習用データセットに含まれるグループ毎に、実行する。処理部５は、ループ処理の各回において、まずステップ２１０で、対象となるグループ中の学習用画像を特徴抽出部１１に入力する。続いてステップ２２０で、入力された学習用画像に基づいてアテンション部１２が生成したアテンションマップ５３および分類結果、ならびに、学習用画像に基づいて認知部１４が生成した分類結果を取得してメモリ４に記録する。 In the process of FIG. 9, the processing unit 5 first executes the loop process of steps 210 and 220 for each group included in the re-learning dataset. In each loop process, the processing unit 5 first inputs the learning images in the target group to the feature extraction unit 11 in step 210. Next, in step 220, the processing unit 5 acquires the attention map 53 and classification results generated by the attention unit 12 based on the input learning images, as well as the classification results generated by the recognition unit 14 based on the learning images, and records them in the memory 4.

なお、入力された学習用画像に基づいてニューラルネットワーク１０がアテンションマップ５３および２種類の分類結果を出力する方法は、学習用画像を入力画像５１に置き換えた上述の方法と同等である。ステップ２２０の後、１回分のループ処理が終了する。 The method by which the neural network 10 outputs the attention map 53 and the two types of classification results based on the input learning images is the same as the above-mentioned method in which the learning images are replaced with the input images 51. After step 220, one loop of processing is completed.

グループの数だけループ処理が終了すると、処理部５の処理はステップ２３０に進む。この時点で、すべての再学習用データセット中の各グループに対して、アテンション部１２が生成したアテンションマップ５３および分類結果、および、認知部１４が生成した分類結果が、対応付けられて、メモリ４に記録されている。 When the loop processing has been completed for the number of groups, the processing unit 5 proceeds to step 230. At this point, for each group in all re-learning datasets, the attention map 53 and classification results generated by the attention unit 12 and the classification results generated by the recognition unit 14 are associated and recorded in the memory 4.

処理部５は、ステップ２３０では、複数のグループのうち、誤認識が発生したグループを抽出する。誤認識が発生したとして抽出されるのは、認知部１４が出力した分類結果において尤度が最も高いクラスと、教師ラベルが示すクラス（すなわち、教師ラベルにおいて尤度が最も高いクラス）とが一致しなかったグループである。あるいは、アテンション部１２が出力した分類結果において尤度が最も高いクラスと、教師ラベルが示すクラスとが一致しなかったグループが、誤認識が発生したとして抽出されてもよい。またあるいは、それらの両方が抽出されてもよい。抽出されるグループは、殆どの場合複数である。 In step 230, the processing unit 5 extracts a group in which a recognition error has occurred from among the multiple groups. A group in which a recognition error has occurred is extracted in which the class with the highest likelihood in the classification result output by the recognition unit 14 does not match the class indicated by the teacher label (i.e., the class with the highest likelihood in the teacher label). Alternatively, a group in which the class with the highest likelihood in the classification result output by the attention unit 12 does not match the class indicated by the teacher label may be extracted in which a recognition error has occurred. Alternatively, both of these may be extracted. In most cases, multiple groups are extracted.

続いてステップ２４０では、直前のステップ２３０で抽出したグループの各々に対応してメモリ４に記録されているアテンションマップ５３を、人の知見に基づいて修正する。具体的には、図４のステップ１４０と同様の処理により、操作装置２に対する人の修正操作に基づいて、当該アテンションマップ５３を修正する。そして処理部５は、修正後のアテンションマップ５３を、当該グループに属する教師アテンションマップとして、メモリ４に保存する。 Next, in step 240, the attention map 53 recorded in the memory 4 corresponding to each of the groups extracted in the previous step 230 is modified based on human knowledge. Specifically, the attention map 53 is modified based on the human modification operation on the operation device 2 by a process similar to that of step 140 in FIG. 4. The processing unit 5 then stores the modified attention map 53 in the memory 4 as a teacher attention map belonging to the group.

このように作成される教師アテンションマップは、同じグループの学習用画像が特徴抽出部１１に入力されたときにアテンション部１２から出力されるアテンションマップ５３の正解値とされるデータである。この処理により、教師アテンションマップは、再学習用データセットに追加される。 The teacher attention map created in this way is data that is considered to be the correct answer value of the attention map 53 output from the attention unit 12 when learning images of the same group are input to the feature extraction unit 11. Through this process, the teacher attention map is added to the re-learning dataset.

続いて処理部５は、ステップ２５０で、今回の図９の処理で取得した２種類の分類結果、アテンションマップ５３、および再学習用データセットに基づいて、ニューラルネットワーク１０を再学習させる。上述の通り、再学習用データセットには、教師アテンションマップ、教師ラベルが含まれる。 Next, in step 250, the processing unit 5 retrains the neural network 10 based on the two types of classification results obtained in the current process of FIG. 9, the attention map 53, and the re-learning dataset. As described above, the re-learning dataset includes a teacher attention map and a teacher label.

具体的には、図１０に示すように、３つの学習誤差Ｌａｔｔ、Ｌｐｅｒ、Ｌｍａｐの和から成る量Ｌ＝Ｌａｔｔ＋Ｌｐｅｒ＋Ｌｍａｐを学習誤差として、誤差逆伝播法により、アテンション部１２および認知部１４の重み、バイアス等の学習パラメータが更新される。図１０においては、認知部１４の出力層１４ｂと、認知部１４の出力層１４ｂよりも前段の部分１４ａとが表されている。なお、本実施形態では、特徴抽出部１１の重み、バイアス等の学習パラメータは更新されない。 Specifically, as shown in FIG. 10, the learning error is determined by the sum of three learning errors Latt, Lper, and Lmap, that is, L=Latt+Lper+Lmap, and the learning parameters such as weights and biases of the attention unit 12 and the recognition unit 14 are updated by the error backpropagation method. In FIG. 10, the output layer 14b of the recognition unit 14 and the part 14a preceding the output layer 14b of the recognition unit 14 are shown. Note that in this embodiment, the learning parameters such as weights and biases of the feature extraction unit 11 are not updated.

ここで、Ｌａｔｔは、学習用画像６１がニューラルネットワーク１０に入力されたときにアテンション部１２が出力する分類結果と、当該学習用画像６１と同じグループに属する教師ラベル６０との間の、誤差を示す量である。 Here, Latt is an amount indicating the error between the classification result output by the attention unit 12 when a training image 61 is input to the neural network 10 and the teacher label 60 that belongs to the same group as the training image 61.

また、Ｌｐｅｒは、学習用画像６１がニューラルネットワーク１０に入力されたときに特徴抽出部１１が出力する分類結果と、当該学習用画像６１と同じグループに属する教師ラベル６０との間の、誤差を示す量である。 In addition, Lper is an amount indicating the error between the classification result output by the feature extraction unit 11 when the training image 61 is input to the neural network 10 and the teacher label 60 that belongs to the same group as the training image 61.

また、Ｌｍａｐは、学習用画像６１がニューラルネットワーク１０に入力されたときにアテンション部１２が出力するアテンションマップ５３と、当該学習用画像６１と同じグループに属する教師ラベル６０との間の、誤差を示す量である。 In addition, Lmap is an amount indicating the error between the attention map 53 output by the attention unit 12 when a learning image 61 is input to the neural network 10 and the teacher label 60 that belongs to the same group as the learning image 61.

学習誤差Ｌｍａｐとしては、以下の式のようにＬ２ノルム誤差が採用されてもよいし、他の形態の誤差が採用されてもよい。
Ｌｍａｐ＝γ×｜｜Ｍ’－Ｍ｜｜_２
ここで、Ｍは学習用画像６１がニューラルネットワーク１０に入力されたときにアテンション部１２が出力するアテンションマップ５３の値を示す。Ｍ’は、学習用画像６１と同じグループに対応する修正後のアテンションマップの値を示す。これら２つのアテンションマップの要素毎に誤差を求めることで，人の知見に近いアテンションマップを出力するようアテンション部１２が学習される。 As the learning error Lmap, an L2 norm error may be adopted as shown in the following formula, or another form of error may be adopted.
Lmap=γ×||M'-M|| ₂
Here, M indicates the value of the attention map 53 output by the attention unit 12 when the learning image 61 is input to the neural network 10. M' indicates the value of the corrected attention map corresponding to the same group as the learning image 61. By calculating the error for each element of these two attention maps, the attention unit 12 is trained to output an attention map close to human knowledge.

ここで、γは学習誤差Ｌｍａｐを調整する係数である。ＬｍａｐはＬａｔｔ、Ｌｐｅｒと比べて誤差の値が大きい。そのため、γをＬｍａｐに乗算することで、３つの学習誤差Ｌｍａｐ、Ｌａｔｔ、Ｌｐｅｒの大きさを調整することができる。ステップ２５０の後、図９の処理が終了し、再学習されたニューラルネットワーク１０がメモリ４に記録される。 Here, γ is a coefficient that adjusts the learning error Lmap. Lmap has a larger error value than Latt and Lper. Therefore, by multiplying Lmap by γ, the magnitudes of the three learning errors Lmap, Latt, and Lper can be adjusted. After step 250, the process of FIG. 9 ends, and the retrained neural network 10 is recorded in memory 4.

このように、人の知見に基づいて修正されたアテンションマップに基づいてニューラルネットワーク１０のファインチューニングが行われることで、ニューラルネットワーク１０による画像認識機能が向上する。つまり、処理部５がファインチューニング後のニューラルネットワーク１０に種々の入力画像５１を入力したときに認知部１４が生成する認識結果の正解率が向上する。 In this way, fine-tuning the neural network 10 based on the attention map modified based on human knowledge improves the image recognition function of the neural network 10. In other words, the accuracy rate of the recognition results generated by the recognition unit 14 when the processing unit 5 inputs various input images 51 to the fine-tuned neural network 10 improves.

以上説明した通り、処理部５は、再学習用のデータセットを用いて、ニューラルネットワーク１０を再学習させる（ステップ２５０）。そして、再学習用のデータセットは、複数の教師アテンションマップを含む。 As described above, the processing unit 5 retrains the neural network 10 using the retraining dataset (step 250). The retraining dataset includes multiple teacher attention maps.

このように、アテンションマップを生成するニューラルネットワーク１０を再学習するときに、アテンションマップの正解値とされる教師アテンションマップが使用される。教師アテンションマップは、人の知見に基づいて作成されたものなので、このようにすることで、ニューラルネットワーク１０の学習に人の知見を取り入れることが可能となる。 In this way, when retraining the neural network 10 that generates the attention map, the teacher attention map, which is regarded as the correct value of the attention map, is used. Since the teacher attention map is created based on human knowledge, this makes it possible to incorporate human knowledge into the learning of the neural network 10.

また、処理部５は、ニューラルネットワーク１０に複数の学習用画像を入力することによって複数の学習用画像にそれぞれ対応した複数のアテンションマップを取得する（ステップ２１０、２２０）。そして処理部５は、人の修正操作に応じてそれら複数のアテンションマップを修正して教師アテンションマップとする（ステップ２４０）。 The processing unit 5 also inputs a plurality of learning images to the neural network 10 to obtain a plurality of attention maps corresponding to the plurality of learning images (steps 210, 220). The processing unit 5 then modifies the plurality of attention maps in response to a human correction operation to obtain a teacher attention map (step 240).

このように、ニューラルネットワーク１０が生成したアテンションマップに対して人がした修正操作に基づいて、教師アテンションマップを生成することができる。したがって、より直接的に、ニューラルネットワーク１０の学習に人の知見を取り入れることが可能となる。しかも、ゼロから教師アテンションマップを作成する場合に比べて、修正操作が簡単である。 In this way, a teacher attention map can be generated based on human correction operations made to the attention map generated by the neural network 10. This makes it possible to more directly incorporate human knowledge into the learning of the neural network 10. Moreover, the correction operations are simpler than when creating a teacher attention map from scratch.

また、再学習用のデータセットに含まれる複数の教師アテンションマップは、再学習の前にニューラルネットワーク１０によって誤認識された学習用画像のみである。このように、誤認識された学習用画像に対応する教師アテンションマップを多く再学習に用いることで、より高い効率で再学習を行うことができる。これは、誤認識された学習用画像を入力として生成されたアテンションマップは、それ自体も誤りが多い可能性が高いからである。 The multiple teacher attention maps included in the re-learning dataset are only those training images that were misrecognized by the neural network 10 before re-learning. In this way, by using many teacher attention maps corresponding to misrecognized training images for re-learning, re-learning can be performed more efficiently. This is because the attention map generated using misrecognized training images as input is likely to be itself erroneous.

また、ニューラルネットワーク１０は、入力画像５１およびアテンションマップ５３に基づいて入力画像５１の認識結果を生成する。このように、入力画像５１のみならずアテンションマップ５３も画像認識のための情報としてフィードバックするようなニューラルネットワーク１０においては、入力画像５１の認識結果とアテンションマップ５３との間の関連性が強い。したがって、そのようなニューラルネットワーク１０においては、教師アテンションマップを用いた再学習の効果が、入力画像５１の認識結果の向上に寄与する度合いが、高い。 Furthermore, neural network 10 generates a recognition result for input image 51 based on input image 51 and attention map 53. In this way, in a neural network 10 in which not only input image 51 but also attention map 53 are fed back as information for image recognition, there is a strong correlation between the recognition result for input image 51 and attention map 53. Therefore, in such a neural network 10, the effect of re-learning using the teacher attention map contributes to a high degree to which the recognition result for input image 51 is improved.

また、処理部５は、特徴抽出部１１を再学習させずにアテンション部１２を再学習させる。このように、ニューラルネットワーク１０のうちでもアテンションマップ５３の生成に強く関係する部分が再学習されることにより、効率の高いニューラルネットワーク１０のファインチューニングが実現する。 The processing unit 5 also re-learns the attention unit 12 without re-learning the feature extraction unit 11. In this way, the parts of the neural network 10 that are strongly related to the generation of the attention map 53 are re-learned, thereby achieving highly efficient fine-tuning of the neural network 10.

なお、本実施形態では、処理部５が、ステップ２０５を実行することで読出部として機能し、ステップ２５０を実行することでトレーニング部として機能し、ステップ２１０、２２０を実行することで取得部として機能し、ステップ２４０を実行することでマップ修正部として機能する。 In this embodiment, the processing unit 5 functions as a readout unit by executing step 205, as a training unit by executing step 250, as an acquisition unit by executing steps 210 and 220, and as a map correction unit by executing step 240.

（他の実施形態）
なお、本発明は上記した実施形態に限定されるものではなく、適宜変更が可能である。また、上記各実施形態は、互いに無関係なものではなく、組み合わせが明らかに不可な場合を除き、適宜組み合わせが可能である。また、上記各実施形態において、実施形態を構成する要素は、特に必須であると明示した場合および原理的に明らかに必須であると考えられる場合等を除き、必ずしも必須のものではない。また、上記各実施形態において、実施形態の構成要素の個数、数値、量、範囲等の数値が言及されている場合、特に必須であると明示した場合および原理的に明らかに特定の数に限定される場合等を除き、その特定の数に限定されるものではない。また、ある量について複数個の値が例示されている場合、特に別記した場合および原理的に明らかに不可能な場合を除き、それら複数個の値の間の値を採用することも可能である。 Other Embodiments
The present invention is not limited to the above-mentioned embodiment, and can be modified as appropriate. The above-mentioned embodiments are not unrelated to each other, and can be combined as appropriate, except when the combination is clearly impossible. In each of the above-mentioned embodiments, the elements constituting the embodiment are not necessarily essential, except when it is specifically stated that they are essential or when it is clearly considered essential in principle. In each of the above-mentioned embodiments, when the numbers, values, amounts, ranges, etc. of the components of the embodiment are mentioned, they are not limited to the specific numbers, except when it is specifically stated that they are essential or when it is clearly limited to a specific number in principle. In addition, when multiple values are exemplified for a certain amount, it is also possible to adopt a value between those multiple values, except when it is specifically stated otherwise or when it is clearly impossible in principle.

また、本発明は、上記各実施形態に対する以下のような変形例および均等範囲の変形例も許容される。なお、以下の変形例は、それぞれ独立に、上記実施形態に適用および不適用を選択できる。すなわち、以下の変形例のうち任意の組み合わせを、上記実施形態に適用することができる。 The present invention also allows for the following modifications and modifications within an equivalent scope to each of the above embodiments. Note that each of the following modifications can be independently applied or not applied to the above embodiments. In other words, any combination of the following modifications can be applied to the above embodiments.

（変形例１）
画像認識装置１は、第１実施形態の機能（すなわち、人の知見に基づいて修正されたアテンションマップを用いた画像認識）と第２実施形態の機能（すなわち、人の知見に基づいて修正されたアテンションマップを用いた再学習）の両方の機能を有していてもよい。 (Variation 1)
The image recognition device 1 may have both the functions of the first embodiment (i.e., image recognition using an attention map modified based on human knowledge) and the functions of the second embodiment (i.e., re-learning using an attention map modified based on human knowledge).

（変形例２）
上記実施形態では、アテンション部１２および認知部１４が出力する認識結果の一例として、分類結果が上げられている。しかし、アテンション部１２および認知部１４が出力する認識結果は、分類結果に限らず、回帰による結果でもよい。つまり、ニューラルネットワーク１０が行う画像の認識は、分類でもよいし、回帰でもよい。 (Variation 2)
In the above embodiment, a classification result is given as an example of the recognition result output by the attention unit 12 and the recognition unit 14. However, the recognition result output by the attention unit 12 and the recognition unit 14 is not limited to a classification result and may be a result by regression. In other words, the image recognition performed by the neural network 10 may be classification or regression.

（変形例３）
上記第１実施形態では、ニューラルネットワーク１０は、特徴抽出部１１、アテンション部１２、合成部１３、認知部１４を有している。しかし、人の知見に基づいて修正されたアテンションマップを用いた画像認識を実現するためのニューラルネットワークは、このような構成のものに限られない。すなわち、入力された画像に基づいてアテンションマップを生成し、当該画像とアテンションマップに基づいて画像の認識結果を生成するニューラルネットワークであれば、アテンションマップが修正されることで画像の認識機能が向上し得る。 (Variation 3)
In the first embodiment, the neural network 10 has a feature extraction unit 11, an attention unit 12, a synthesis unit 13, and a recognition unit 14. However, the neural network for realizing image recognition using an attention map corrected based on human knowledge is not limited to such a configuration. In other words, if the neural network generates an attention map based on an input image and generates an image recognition result based on the input image and the attention map, the image recognition function can be improved by correcting the attention map.

（変形例４）
上記第２実施形態では、ニューラルネットワーク１０は、特徴抽出部１１、アテンション部１２、合成部１３、認知部１４を有している。しかし、人の知見に基づいて修正されてもよい。しかし、人の知見に基づいて修正されたアテンションマップを用いた再学習を実現するためのニューラルネットワークは、このような構成のものに限られない。すなわち、入力された画像に基づいてアテンションマップおよび画像の認識結果を生成するニューラルネットワークであれば、修正されたアテンションマップを用いて再学習することで画像の認識機能が向上し得る。例えば、非特許文献２に記載されたＣＡＭ（Class Activation Mapping）のようなニューラルネットワークが、人の知見に基づいて修正されたアテンションマップを用いて再学習されてもよい。 (Variation 4)
In the second embodiment, the neural network 10 has a feature extraction unit 11, an attention unit 12, a synthesis unit 13, and a recognition unit 14. However, it may be modified based on human knowledge. However, the neural network for realizing re-learning using the attention map modified based on human knowledge is not limited to such a configuration. That is, if the neural network generates an attention map and an image recognition result based on an input image, the image recognition function can be improved by re-learning using the modified attention map. For example, a neural network such as CAM (Class Activation Mapping) described in Non-Patent Document 2 may be re-learned using an attention map modified based on human knowledge.

（変形例５）
上記第１実施形態では、処理部５は、入力画像５１をアテンションマップ５３に透過的に重ねて、表示装置３に表示させた状態で、人の修正操作に応じた修正をアテンションマップに反映させている。しかし、必ずしもこのようにしなくてもよい。例えば、処理部５は、入力画像５１とアテンションマップ５３を重ならずに並べて表示装置３に表示させた状態で、人の修正操作に応じた修正をアテンションマップに反映させてもよい。また例えば、処理部５は、アテンションマップ５３を表示装置３に表示させて入力画像５１を表示装置３に表示させない状態で、人の修正操作に応じた修正をアテンションマップに反映させてもよい。 (Variation 5)
In the first embodiment, the processing unit 5 reflects the correction corresponding to the correction operation by the person on the attention map in a state where the input image 51 is transparently superimposed on the attention map 53 and displayed on the display device 3. However, this is not necessarily required. For example, the processing unit 5 may reflect the correction corresponding to the correction operation by the person on the attention map in a state where the input image 51 and the attention map 53 are displayed side by side without overlapping on the display device 3. Also, for example, the processing unit 5 may reflect the correction corresponding to the correction operation by the person on the attention map in a state where the attention map 53 is displayed on the display device 3 and the input image 51 is not displayed on the display device 3.

（変形例６）
上記第１、２実施形態では、アテンションマップの修正方法として、アテンション部１２によって生成されたアテンションマップ中の一部の画素の値のみを変更し、残りの画素の値は変更しない方法が示されている。つまり、アテンション部１２によって生成されたアテンションマップに変更を加える方法が示されている。 (Variation 6)
In the first and second embodiments, as a method for correcting the attention map, a method is shown in which only the values of some pixels in the attention map generated by the attention unit 12 are changed and the values of the remaining pixels are not changed. In other words, a method is shown in which a change is made to the attention map generated by the attention unit 12.

しかし、アテンションマップの修正方法は、必ずしもこのような方法に限られない。例えば、画像がニューラルネットワーク１０に入力されたときにアテンション部１２によって出力されたアテンションマップとは別に、新たなアテンションマップがゼロから作成されてもよい。この場合、第１実施形態では、この新たなアテンションマップが合成部１３に入力され、第２実施形態では、この新たなアテンションマップが教師アテンションマップになる。 However, the method of correcting the attention map is not necessarily limited to this method. For example, a new attention map may be created from scratch in addition to the attention map output by the attention unit 12 when an image is input to the neural network 10. In this case, in the first embodiment, this new attention map is input to the synthesis unit 13, and in the second embodiment, this new attention map becomes the teacher attention map.

新たなアテンションマップの作成方法としては、例えば、以下のような方法がある。まず、人が、ニューラルネットワーク１０に入力された画像を見て注視領域の位置範囲を決める。そして人が、その決めた注視領域の位置範囲を反映する新たなアテンションマップを、コンピュータを操作して作成してもよい。このコンピュータは、画像認識装置１であってもよいし、他の装置であってもよい。 For example, a method for creating a new attention map is as follows. First, a person looks at the image input to the neural network 10 and determines the position range of the gaze area. The person may then operate a computer to create a new attention map that reflects the determined position range of the gaze area. This computer may be the image recognition device 1 or another device.

（変形例７）
上記第２実施形態では、再学習に使用される教師アテンションマップは、再学習の前にニューラルネットワーク１０によって誤認識された学習用画像に対応する教師アテンションマップのみである。しかし、再学習に使用される教師アテンションマップに、再学習の前にニューラルネットワーク１０によって正しく認識された学習用画像に対応する教師アテンションマップが含まれていてもよい。 (Variation 7)
In the above second embodiment, the teacher attention map used for re-learning is only the teacher attention map corresponding to the training image erroneously recognized by the neural network 10 before re-learning. However, the teacher attention map used for re-learning may include the teacher attention map corresponding to the training image correctly recognized by the neural network 10 before re-learning.

その場合も、誤認識された学習用画像に対応する教師アテンションマップの数が、正しく認識された学習用画像に対応する教師アテンションマップよりも多ければ、再学習の高効率化を行うことができる。 Even in this case, if the number of teacher attention maps corresponding to misrecognized training images is greater than the number of teacher attention maps corresponding to correctly recognized training images, re-learning can be performed with high efficiency.

あるいは、誤認識された学習用画像に対応する教師アテンションマップの数が、正しく認識された学習用画像に対応する教師アテンションマップより少なくてもよい。 Alternatively, the number of teacher attention maps corresponding to misrecognized training images may be fewer than the number of teacher attention maps corresponding to correctly recognized training images.

（変形例８）
上記実施形態では、ニューラルネットワーク１０の再学習においては、特徴抽出部１１は再学習されず、アテンション部１２、認知部１４のみが再学習される。ニューラルネットワーク１０の再学習は、この形態に限られない。例えば、特徴抽出部１１、認知部１４が再学習されず、アテンション部１２のみが再学習されてもよい。また例えば、特徴抽出部１１のみが再学習され、アテンション部１２、認知部１４が再学習されなくてもよい。また例えば、特徴抽出部１１、認知部１４が再学習され、アテンション部１２が再学習されなくてもよい。 (Variation 8)
In the above embodiment, when re-learning the neural network 10, the feature extraction unit 11 is not re-learned, and only the attention unit 12 and the recognition unit 14 are re-learned. Re-learning of the neural network 10 is not limited to this form. For example, the feature extraction unit 11 and the recognition unit 14 may not be re-learned, and only the attention unit 12 may be re-learned. Alternatively, for example, only the feature extraction unit 11 may be re-learned, and the attention unit 12 and the recognition unit 14 may not be re-learned. Alternatively, for example, the feature extraction unit 11 and the recognition unit 14 may be re-learned, and the attention unit 12 may not be re-learned.

また、特徴抽出部１１、アテンション部１２、認知部１４が再学習される形態も許容される。この場合は、ニューラルネットワーク１０の再学習はファインチューニングではない。 It is also permissible for the feature extraction unit 11, attention unit 12, and recognition unit 14 to be retrained. In this case, the retraining of the neural network 10 is not fine tuning.

（変形例９）
上記実施形態では、再学習は、Ｌｍａｐ、Ｌａｔｔ、Ｌｐｅｒの３つの学習誤差を用いて誤差逆伝播法を用いて行われている。しかし、Ｌｍａｐ、Ｌａｔｔ、Ｌｐｅｒのすべてを用いなくてもよい。例えば、Ｌｍａｐのみを用いてもよい。 (Variation 9)
In the above embodiment, the re-learning is performed by using the error backpropagation method using three learning errors, Lmap, Latt, and Lper. However, it is not necessary to use all of Lmap, Latt, and Lper. For example, only Lmap may be used.

１…画像認識装置、２…操作装置、３…表示装置、４…メモリ、５…処理部、１０…ニューラルネットワーク、１１…特徴抽出部、１２…アテンション部、１３…合成部、１４…認知部、５１…入力画像、５２…特徴マップ、５３…アテンションマップ、５４…合成マップ、６０…教師ラベル、６１…学習用画像 1...image recognition device, 2...operation device, 3...display device, 4...memory, 5...processing unit, 10...neural network, 11...feature extraction unit, 12...attention unit, 13...synthesis unit, 14...recognition unit, 51...input image, 52...feature map, 53...attention map, 54...synthesis map, 60...teacher label, 61...learning image

Claims

an input unit (120) for inputting an image (51) to the neural network (10);
a map correction unit (140) that corrects the generated attention map in response to a correction operation by a person when the neural network generates a feature map (52) including features of the input image and an attention map (53) expressing a gaze area of the input image;
An image recognition device comprising : a composite map (54) in which the corrected attention map and the feature map are combined ; and an output unit (160) that outputs the generated recognition result when the neural network (10) generates a recognition result for the image based on the image.

The neural network includes a feature extraction unit (11), an attention unit (12), a synthesis unit (13), and a recognition unit (14),
The feature extraction unit includes a plurality of convolution layers, and generates the feature map (52) by propagating information of the image through the plurality of convolution layers;
The attention unit generates the attention map based on the feature map;
The synthesis unit synthesizes the feature map and the modified attention map to generate the synthesis map ;
The image recognition device according to claim 1 , wherein the recognition unit generates the recognition result based on the composite map.

The image recognition device according to claim 1 or 2, characterized in that the map correction unit reflects the correction corresponding to the correction operation in the attention map while the image is transparently superimposed on the attention map and displayed on a display device (3).

A program used in an image recognition device that outputs a recognition result of an input image (51),
An input unit (120) for inputting the image into the neural network (10);
A program that causes the image recognition device to function as a map correction unit (140 ) that, when the neural network generates a feature map (52) including features of the input image and an attention map (53) expressing the gaze area of the input image, corrects the generated attention map in response to a human correction operation, and an output unit (160) that outputs the generated recognition result when the neural network (10) generates a recognition result of the image based on a composite map (54) in which the corrected attention map and the feature map are combined and the image.