JP6945987B2

JP6945987B2 - Arithmetic circuit, its control method and program

Info

Publication number: JP6945987B2
Application number: JP2016211898A
Authority: JP
Inventors: 加藤　政美; 政美加藤; 山本　貴久; 貴久山本; 野村　修; 修野村; 伊藤　嘉則; 嘉則伊藤; 克彦森
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2016-10-28
Filing date: 2016-10-28
Publication date: 2021-10-06
Anticipated expiration: 2036-10-28
Also published as: JP2018073103A

Description

本発明は、パターン認識等に使用される演算回路、その制御方法及びプログラムに関するものである。 The present invention relates to an arithmetic circuit used for pattern recognition and the like, a control method thereof, and a program.

パターン認識装置などの画像処理装置にニューラルネットワークの手法が広く応用されている。ニューラルネットワークの中でも、ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋｓ（以下ＣＮＮと略記する）と呼ばれる演算手法が認識対象の変動に対して頑健なパターン認識を可能にする手法として注目されている。例えば、非特許文献１では、コンボリューショナルニューラルネットワーク（ＣＮＮ）の様々な応用例・実装例が開示されている。ＣＮＮ処理は、認識対象の信号や実現する認識機能等に応じて様々なネットワークの構成が提案されている。ここで、コンボリューショナルニューラルネットワークの構成は、階層の数やその階層内の特徴面の数等、コンボリューション演算の結合関係で表現される構成を示す。 The neural network method is widely applied to image processing devices such as pattern recognition devices. Among neural networks, a calculation method called Convolutional Neural Networks (hereinafter abbreviated as CNN) is attracting attention as a method that enables robust pattern recognition with respect to fluctuations in the recognition target. For example, Non-Patent Document 1 discloses various application examples and implementation examples of a convolutional neural network (CNN). For CNN processing, various network configurations have been proposed according to the signal to be recognized, the recognition function to be realized, and the like. Here, the configuration of the convolutional neural network shows a configuration expressed by the connection relationship of the convolution operation, such as the number of layers and the number of feature faces in the layer.

図１５は簡単なＣＮＮ処理の例を示すネットワーク構成図である。入力層１５０１は、画像データに対してＣＮＮ処理を行う場合、ラスタスキャンされた所定サイズの画像データに相当する。特徴面１５０３ａ〜１５０３ｃは第１階層１５０８の特徴面を示す。特徴面とは、所定の特徴抽出演算（コンボリューション演算及び非線形処理）の処理結果に相当するデータ面である。特徴面は上位階層で所定の対象を認識するための特徴抽出結果に相当すし、ラスタスキャンされた画像データに対する処理結果であるため、処理結果も面で表す。 FIG. 15 is a network configuration diagram showing an example of a simple CNN process. The input layer 1501 corresponds to raster-scanned image data of a predetermined size when CNN processing is performed on the image data. The feature planes 1503a to 1503c indicate the feature planes of the first layer 1508. The feature surface is a data surface corresponding to the processing result of a predetermined feature extraction operation (convolution operation and non-linear processing). Since the feature surface corresponds to the feature extraction result for recognizing a predetermined object in the upper layer and is the processing result for the raster-scanned image data, the processing result is also represented by the surface.

特徴面１５０３ａ〜１５０３ｃは、入力層１５０１に対応するコンボリューション演算及び非線形処理により算出されるものである。例えば、特徴面１５０３ａは、模式的に示す２次元のフィルタカーネル１５０２１ａのコンボリューション演算と演算結果の非線形変換により算出する。 The feature planes 1503a to 1503c are calculated by the convolution calculation and the non-linear processing corresponding to the input layer 1501. For example, the feature surface 1503a is calculated by the convolution operation of the two-dimensional filter kernel 15021a schematically shown and the non-linear conversion of the operation result.

例えば、フィルタカーネル（フィルタ係数マトリクス）サイズがｃｏｌｕｍｎＳｉｚｅ×ｒｏｗＳｉｚｅであるコンボリューション演算は以下に示すような積和演算により処理する。 For example, the convolution operation in which the filter kernel (filter coefficient matrix) size is volumeSize × lowSize is processed by the product-sum operation as shown below.

ここで、「ｉｎｐｕｔ（ｘ，ｙ）」は座標（ｘ、ｙ）での参照画素値を示し、「ｏｕｔｐｕｔ（ｘ，ｙ）」は座標（ｘ、ｙ）での演算結果を示す。また、「ｗｅｉｇｈｔ（ｃｏｌｕｍｎ，ｒｏｗ）」は座標（ｘ＋ｃｏｌｕｍｎ、ｙ＋ｒｏｗ）での重み係数を示し、「ｃｏｌｕｍｎＳｉｚｅ」及び「ｒｏｗＳｉｚｅ」はコンボリューションカーネルサイズを示す。

Here, "input (x, y)" indicates a reference pixel value in coordinates (x, y), and "output (x, y)" indicates a calculation result in coordinates (x, y). Further, "weight (column, low)" indicates a weighting coefficient in coordinates (x + volume, y + low), and "columnSize" and "lowSize" indicate a convolution kernel size.

ＣＮＮ処理では複数のフィルタカーネルを画素単位で走査しながら積和演算を繰り返し、最終的な積和結果を非線形変換することで特徴面を算出する。なお、特徴面１５０３ａは前階層の一つの画像データから算出されるので、結合数が１である。特徴面１５０３ａを算出するためのフィルタカーネル１５０２１ａは１つである。ここで、フィルタカーネル１５０２１ｂ、フィルタカーネル１５０２１ｃはそれぞれ特徴面１５０３ｂ、１５０３ｃを算出する際に使用されるフィルタカーネルである。また、フィルタカーネルは、フィルタ又はカーネルと略称することがある。 In the CNN process, the product-sum operation is repeated while scanning a plurality of filter kernels on a pixel-by-pixel basis, and the final product-sum result is non-linearly converted to calculate the characteristic surface. Since the feature surface 1503a is calculated from one image data in the previous layer, the number of connections is 1. There is only one filter kernel 15021a for calculating the feature plane 1503a. Here, the filter kernel 15021b and the filter kernel 15021c are filter kernels used when calculating the feature planes 1503b and 1503c, respectively. The filter kernel may be abbreviated as a filter or a kernel.

図１６はＣＮＮ処理における特徴面１５０５ａを算出す場合の例を説明する図である。特徴面１５０５ａは前階層１５０８の３つの特徴面１５０３ａ〜ｃから算出され、特徴面１５０３ａ〜ｃと結合している。特徴面１５０５ａのデータを算出する場合、まず、特徴面１５０３ａに対しては模式的に示すカーネル１５０４１ａを用いたフィルタ演算（コンボリューション演算）を行い、その結果を累積加算器１６０１に保持する。同様に、特徴面１５０３ｂ、１５０３ｃに対してはそれぞれカーネル１５０４２ａ、１５０４３ａのコンボリューション演算を行い、その結果を累積加算器１６０１に累積加算する。３種類のカーネルを用いたコンボリューション演算の終了後、ロジスティック関数や双曲正接関数（ｔａｎｈ関数）を利用した非線形変換処理１６０２を行う。 FIG. 16 is a diagram illustrating an example in the case of calculating the feature surface 1505a in the CNN process. The feature plane 1505a is calculated from the three feature planes 1503a to c of the previous layer 1508 and is coupled to the feature planes 1503a to c. When calculating the data of the feature surface 1505a, first, a filter calculation (convolution calculation) using the kernel 15041a schematically shown is performed on the feature surface 1503a, and the result is held in the cumulative adder 1601. Similarly, the kernels 15042a and 15043a are subjected to convolution operations on the feature surfaces 1503b and 1503c, respectively, and the results are cumulatively added to the cumulative adder 1601. After the convolution operation using the three types of kernels is completed, the non-linear conversion process 1602 using the logistic function and the hyperbolic tangent function (tanh function) is performed.

以上の処理を画像全体に対して１画素ずつ走査しながら処理することで、特徴面１５０５ａを算出する。同様に、特徴面１５０５ｂは前階層１５０８の３つの特徴面に対してカーネル１５０４１ｂ、カーネル１５０４２ｂ及びカーネル１５０４３ｂで示す３つのコンボリューション演算を用いて算出する。更に、特徴面１５０７は前階層１５０９の特徴面１５０５ａ〜ｂのそれぞれに対してカーネル１５０６１及びカーネル１５０６２で示す２つのコンボリューション演算を用いて算出する。 The feature surface 1505a is calculated by performing the above processing while scanning the entire image pixel by pixel. Similarly, the feature plane 1505b is calculated for the three feature planes of the previous layer 1508 by using the three convolution operations shown in kernel 15041b, kernel 15042b, and kernel 15043b. Further, the feature plane 1507 is calculated for each of the feature planes 1505a to 1505 of the previous layer 1509 by using two convolution operations shown in kernel 15061 and kernel 15062.

なお、各コンボリューション係数はパーセプトロン学習やバックプロパゲーション学習等の一般的な手法を用いて予め学習により決定されているものとする。例えば、物体の検出やパターン認識等においては、１０×１０以上の大きなサイズのコンボリューションカーネルを使用することがある。 It is assumed that each convolution coefficient is determined in advance by learning using a general method such as perceptron learning or backpropagation learning. For example, in object detection, pattern recognition, etc., a large size convolution kernel of 10 × 10 or more may be used.

このように、ＣＮＮ処理では多数の大きなカーネルサイズのコンボリューション演算を繰り返すため、膨大な回数の積和演算が必要となる。共通のハードウェアで様々な認識タスクに対応するためには、多様なネットワークを高い並列度で効率的に処理することが求められる。 As described above, since a large number of large kernel-sized convolutional operations are repeated in the CNN process, a huge number of product-sum operations are required. In order to support various recognition tasks with common hardware, it is required to efficiently process various networks with a high degree of parallelism.

特許文献１では積和演算器を複数用意し、複数の受容野位置（算出する特徴面の画素位置）に対応するコンボリューション演算を並列に処理することで高速化する装置が提案されている。また、特許文献２ではコンボリューションカーネルに対して演算器を割り付ける構成のＣＮＮ処理装置が提案されている。 Patent Document 1 proposes a device for increasing the speed by preparing a plurality of multiply-accumulate arithmetic units and processing convolution operations corresponding to a plurality of receptive field positions (pixel positions of feature planes to be calculated) in parallel. Further, Patent Document 2 proposes a CNN processing device having a configuration in which an arithmetic unit is assigned to a convolutional kernel.

特開２０１０−１３４６９７JP 2010-134697 ＵＳ２０１２／０３０３９３２US2012 / 0303932

ＹａｎｎＬｅＣｕｎ，ＫｏｒａｙＫａｖｕｋｖｕｏｇｌｕａｎｄＣｌeｍｅｎｔＦａｒａｂｅｔ：ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｔｗｏｒｋｓａｎｄＡｐｐｌｉｃａｔｉｏｎｓｉｎＶｉｓｉｏｎ，Ｐｒｏｃ．ＩｎｔｅｒｎａｔｉｏｎａｌＳｙｍｐｏｓｉｕｍｏｎＣｉｒｃｕｉｔｓａｎｄＳｙｓｔｅｍｓ（ＩＳＣＡＳ’１０），ＩＥＥＥ，２０１０，Yann LeCun, Koray Kavacvuoglu and Clement Farabet: Convolutional Networks and Applications in Vision, Proc. International Symposium on Circuits and Systems (ISCAS'10), IEEE, 2010,

しかしながら、特許文献１では、算出する一つの特徴面に着目して、複数の受容野を並列に処理するが、コンボリューションのカーネルサイズや処理対象の領域等によっては、効率的に並列処理できない場合がある。例えば、カーネルサイズが小さい場合、積和演算器に入力するデータの転送時間がボトルネックとなり、積和演算の処理効率が低下する場合がある。 However, in Patent Document 1, a plurality of receptive fields are processed in parallel by paying attention to one characteristic aspect to be calculated, but when the parallel processing cannot be performed efficiently depending on the kernel size of the convolution, the area to be processed, and the like. There is. For example, when the kernel size is small, the transfer time of the data input to the product-sum calculation unit becomes a bottleneck, and the processing efficiency of the product-sum calculation may decrease.

本発明は上記の課題に鑑みてなされたものであり、保持部に保持された一部の参照データと異なるフィルタとのフィルタ演算を順次に行うことによって、積和演算の処理効率の低下を避ける演算回路を提供することを目的とする。また、その演算回路の制御方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and avoids a decrease in processing efficiency of the product-sum operation by sequentially performing filter operations on a part of the reference data held in the holding unit and a different filter. It is an object of the present invention to provide an arithmetic circuit. Another object of the present invention is to provide a control method and a program of the arithmetic circuit.

上記課題を解決するために、本発明によれば、フィルタ演算の参照データと係数データとを記憶する記憶装置と接続し、１つの参照特徴面に対するフィルタ演算により複数の算出特徴面を算出する演算回路に、前記参照データと前記係数データとの前記フィルタ演算を実行する少なくとも一つの演算器と、前記記憶装置から転送された複数の参照データを保持する参照データ保持手段と、前記記憶装置から転送された複数の係数データを保持する係数データ保持手段と、前記少なくとも一つの演算器のそれぞれに、前記参照データ保持手段に保持された前記複数の参照データのうち１つの参照データと、前記係数データ保持手段に保持された前記複数の係数データのうち１つの係数データとの演算を、同一の１つの参照データとそれぞれ異なる１つの係数データとの演算を順次実行させるようにして実行させ、複数の算出特徴面の算出を、算出対象となる算出特徴面の領域中の所定の大きさの部分領域ごとに算出特徴面を順次切り替えながら実行させる制御手段と、を有する。 In order to solve the above problems, according to the present invention, an operation of connecting to a storage device that stores reference data and coefficient data of a filter operation and calculating a plurality of calculated feature surfaces by a filter operation on one reference feature surface. In the circuit, at least one arithmetic unit that executes the filter calculation of the reference data and the coefficient data, a reference data holding means that holds a plurality of reference data transferred from the storage device, and transfer from the storage device. and coefficient data holding means for holding a plurality of coefficient data, the each of the at least one computing unit, and one reference data among said plurality of reference data held in the reference data holding means, said coefficient data the operation of a single coefficient data of the plurality of coefficient data stored in the storage means, to execute the operation with each of the same one reference data different one coefficient data so as to sequentially execute a plurality of It has a control means for executing the calculation of the calculated feature plane while sequentially switching the calculated feature plane for each partial region of a predetermined size in the region of the calculated feature plane to be calculated.

本発明により、保持部に保持された一部の参照データと異なるフィルタとのフィルタ演算を順次に行うことによって、積和演算の処理効率の低下を避けることができる。 According to the present invention, it is possible to avoid a decrease in the processing efficiency of the product-sum calculation by sequentially performing the filter calculation with a filter different from a part of the reference data held in the holding unit.

第１の実施形態の演算回路の構成を示すブロック図である。It is a block diagram which shows the structure of the arithmetic circuit of 1st Embodiment. 第１の実施形態の演算回路の演算処理を概念的に説明する図である。It is a figure which conceptually explains the arithmetic processing of the arithmetic circuit of 1st Embodiment. コンボリューション演算の基本的な考え方を説明する図である。It is a figure explaining the basic concept of a convolution operation. 演算回路の制御部の構成を示す図である。It is a figure which shows the structure of the control part of the arithmetic circuit. 並列コンボリューション演算の例を説明する図である。It is a figure explaining an example of a parallel convolution operation. 演算回路の並列積和演算器の構成を説明する図である。It is a figure explaining the structure of the parallel product sum arithmetic unit of the arithmetic circuit. 演算回路のシフトレジスタの構成を説明する図である。It is a figure explaining the structure of the shift register of an arithmetic circuit. 第１の実施形態の動作を説明するタイムチャートである。It is a time chart explaining the operation of 1st Embodiment. 第１の実施形態の動作を説明するタイムチャートである。It is a time chart explaining the operation of 1st Embodiment. 並列演算回路を具備した画像処理装置の構成を説明する図である。It is a figure explaining the structure of the image processing apparatus provided with the parallel arithmetic circuit. 画像処理装置の動作を説明するフローチャートである。It is a flowchart explaining operation of an image processing apparatus. 第２の実施形態の動作を説明するタイムチャートである。It is a time chart explaining the operation of the 2nd Embodiment. 第２の実施形態の演算回路の構成を示すブロック図である。It is a block diagram which shows the structure of the arithmetic circuit of 2nd Embodiment. 第２の実施形態の特徴的な構成と動作を説明する図である。It is a figure explaining the characteristic structure and operation of the 2nd Embodiment. ＣＮＮ処理の例を示すネットワーク構成図である。It is a network configuration diagram which shows the example of CNN processing. ＣＮＮ処理におけるコンボリューション演算を説明する図である。It is a figure explaining the convolution operation in CNN processing.

（第１の実施形態）
まず、本発明の第１の実施形態について説明する。図１は本発明の第１の実施形態に係る演算回路の構成を説明する図である。 (First Embodiment)
First, the first embodiment of the present invention will be described. FIG. 1 is a diagram illustrating a configuration of an arithmetic circuit according to a first embodiment of the present invention.

図１の説明に先立ち、本実施形態の演算回路が行う各種の演算処理の一例として、この演算回路によるコンボリューション演算の基本的な考え方を、図３を用いて説明する。図３は、コンボリューション演算によって参照特徴面３０２から特徴面３０６を算出する一例である。ここでは特徴面３０６の垂直方向に連続する３つの位置の特徴面データ３０５を並列に算出する場合の概念を説明する。なお、なお、基本的な考え方は、特徴面３０６の水平方向に連続する位置を並列に算出する場合についても同様である。コンボリューションカーネル（フィルタカーネル）のサイズは説明のため３行１列の係数とする。特徴面３０６のデータ３０５を並列に算出するのに必要な参照データが参照特徴面３０２のデータ３０１である。 Prior to the description of FIG. 1, as an example of various arithmetic processes performed by the arithmetic circuit of the present embodiment, the basic concept of convolution arithmetic by this arithmetic circuit will be described with reference to FIG. FIG. 3 is an example of calculating the feature surface 306 from the reference feature surface 302 by the convolution calculation. Here, the concept of calculating the feature plane data 305 at three positions consecutive in the vertical direction of the feature plane 306 in parallel will be described. The basic idea is the same for the case where the positions of the characteristic surface 306 that are continuous in the horizontal direction are calculated in parallel. The size of the convolution kernel (filter kernel) is a coefficient of 3 rows and 1 column for explanation. The reference data required to calculate the data 305 of the feature surface 306 in parallel is the data 301 of the reference feature surface 302.

図３のシフトレジスタ３０３及びシフトレジスタ３０７は、それぞれ参照データ３０１及びコンボリューションカーネルの係数データを保持する。シフトレジスタ３０３は複数の積和演算器３０４に異なる参照位置のデータを並列に供給し、シフトレジスタ３０７は複数の積和演算器３０４に共通の係数データを順次供給する。シフトレジスタ３０３及びシフトレジスタ３０７は図示しないクロックに同期して順次動作し、その出力を並列積和演算器３０４で並列に演算する。ここで、算出する特徴面のデータｏ１に着目すると、１クロック目でｏ１＝ｉ１×ｗ１が算出され、２クロック目でｏ１＝ｏ１＋ｉ２×ｗ２、３クロック目ｏ１＝ｏ１＋ｉ３×ｗ３が演算さる。結果として３クロックで所望のコンボリューション結果（ｉ１×ｗ１＋ｉ２×ｗ２＋ｉ３×ｗ３）が得られる。コンボリューションカーネルが２次元の場合、参照データと係数データを変えながら上記処理を列単位に繰り返して累積することで２次元のコンボリューション演算が実現する。 The shift register 303 and the shift register 307 of FIG. 3 hold reference data 301 and coefficient data of the convolution kernel, respectively. The shift register 303 supplies data at different reference positions to the plurality of multiply-accumulate calculators 304 in parallel, and the shift register 307 sequentially supplies coefficient data common to the plurality of product-sum calculators 304. The shift register 303 and the shift register 307 operate sequentially in synchronization with a clock (not shown), and the outputs thereof are calculated in parallel by the parallel multiply-accumulate calculator 304. Here, focusing on the characteristic surface data o1 to be calculated, o1 = i1 × w1 is calculated at the first clock, and o1 = o1 + i2 × w2 at the second clock, and o1 = o1 + i3 × w3 at the third clock are calculated. As a result, a desired convolution result (i1 × w1 + i2 × w2 + i3 × w3) can be obtained with 3 clocks. When the convolution kernel is two-dimensional, the two-dimensional convolution operation is realized by repeating and accumulating the above processing column by column while changing the reference data and the coefficient data.

この様に算出特徴面３０６を基準にしてコンボリューション演算を行うことでフィルタカーネルのサイズに応じたクロック数で積和演算器３０４の並列度に対応する位置の特徴面３０６のデータを並列に算出することができる。 By performing the convolution calculation with reference to the calculated feature surface 306 in this way, the data of the feature surface 306 at the position corresponding to the degree of parallelism of the multiply-accumulate calculator 304 is calculated in parallel with the number of clocks according to the size of the filter kernel. can do.

本実施形態はこの様な算出する特徴面を基準とした並列コンボリューション演算手法を例として説明する。本実施形態の演算手法は、フィルタカーネルのサイズ及び積和演算器３０４の並列度に因果関係がないという特徴を有している。つまり、コンボリューション演算を様々な並列度で処理することができる。 In this embodiment, a parallel convolution calculation method based on such a characteristic surface to be calculated will be described as an example. The calculation method of the present embodiment has a feature that there is no causal relationship between the size of the filter kernel and the degree of parallelism of the product-sum calculation unit 304. That is, the convolution operation can be processed with various degrees of parallelism.

図１に示す演算回路は図１０に示す画像処理装置における演算回路１００２に相当する部分である。図１に示す演算回路は図１５に示す様な複数のデータ群の階層的な結合関係に従って、下位の階層から特徴面を順次算出していく。ＲＡＭ１０１は、演算対象となる前階層のデータや演算結果のデータを格納するメモリ。ＲＡＭ１０１は図１０のＲＡＭ１０１と同一である。 The arithmetic circuit shown in FIG. 1 is a portion corresponding to the arithmetic circuit 1002 in the image processing apparatus shown in FIG. The arithmetic circuit shown in FIG. 1 sequentially calculates feature planes from lower layers according to the hierarchical connection relationship of a plurality of data groups as shown in FIG. The RAM 101 is a memory for storing the data of the previous layer to be calculated and the data of the calculation result. The RAM 101 is the same as the RAM 101 of FIG.

制御部１０２は、データ転送に関する制御・特徴面の処理順等に関する制御を司る。図４は制御部１０２のより詳細な構成を説明する図である。シーケンス制御部１２０１は、レジスタ群１２０２に設定された情報に従って、図１の動作を制御する各種制御信号１２０４を入出力する。同様に、シーケンス制御部１２０１はメモリ制御部１２０５を制御するための制御信号１２０６を出力する。シーケンス制御部１２０１はバイナリカウンタやジョンソンカウンタ等からなるシーケンサにより構成される。レジスタ群１２０２は、複数のレジスタセットからなり、例えば参照する特徴面や算出する特徴面に関する情報、カーネルに関する情報、特徴面の処理順等関する情報等を記録する。レジスタ群１２０２は、ブリッジ１００４及び画像バス１００３を介してＣＰＵ１００７から予め所定の値が書き込まれる。 The control unit 102 controls control related to data transfer, control related to processing order of characteristic surfaces, and the like. FIG. 4 is a diagram illustrating a more detailed configuration of the control unit 102. The sequence control unit 1201 inputs and outputs various control signals 1204 that control the operation of FIG. 1 according to the information set in the register group 1202. Similarly, the sequence control unit 1201 outputs a control signal 1206 for controlling the memory control unit 1205. The sequence control unit 1201 is composed of a sequencer including a binary counter, a Johnson counter, and the like. The register group 1202 is composed of a plurality of register sets, and records, for example, information on a feature surface to be referred to, information on a feature surface to be calculated, information on a kernel, information on a processing order of feature surfaces, and the like. A predetermined value of the register group 1202 is written in advance from the CPU 1007 via the bridge 1004 and the image bus 1003.

参照データシフトレジスタ１０６は並列積和演算器１０７に参照データを供給するデータ供給部である。参照データシフトレジスタ１０６は、参照データバッファ１０５にバッファリングされた参照データ（コンボリューション演算に必要な前階層の特徴面データ）を所定のタイミングで並列積和演算器１０７に並列に供給する。係数データシフトレジスタ１０４は並列積和演算器１０７に係数データを供給するデータ供給部であり、コンボリューション演算に必要なパラメータデータ（重み係数）を並列積和演算器１０７に順次に供給する。 The reference data shift register 106 is a data supply unit that supplies reference data to the parallel multiply-accumulate calculator 107. The reference data shift register 106 supplies the reference data (feature surface data of the previous layer required for the convolution calculation) buffered in the reference data buffer 105 to the parallel multiply-accumulate calculator 107 in parallel at a predetermined timing. The coefficient data shift register 104 is a data supply unit that supplies coefficient data to the parallel multiply-accumulate calculator 107, and sequentially supplies parameter data (weighting coefficient) required for the convolution calculation to the parallel multiply-accumulate calculator 107.

並列積和演算器１０７は、ｍ個（ｍは１以上）の積和演算器を内蔵するものとする。並列積和演算器１０７は同一のクロックで並列に動作する。図６は並列積和演算器１０７の概略構成を示す図である。データ６０１１〜６０１ｍは参照データシフトレジスタ１０６の出力データであり、各乗算器６０３１〜６０３ｍへ供給される異なる参照データである。データ６０２は係数データシフトレジスタ１０４の出力データであり、各乗算器６０３１〜６０３ｍへ共通に供給されるデータである。累積加算器６０４１〜６０４ｍは、コンボリューションカーネル演算期間中乗算結果を累積する。クリア信号６０５は、所定のコンボリューション演算単位が終了すると累積加算器６０４１〜６０４ｍの内蔵ラッチをクリアするために使用される。ラッチイネーブル信号（ＬａｔｃｈＥｎａｂｌｅ信号）６０６は、当該信号で累積加算値を更新する。ＬａｔｃｈＥｎａｂｌｅ信号には図示しないクロック信号に同期した信号が接続されるものとする。 The parallel product-sum calculation unit 107 shall include m (m is 1 or more) product-sum calculation units. The parallel multiply-accumulate unit 107 operates in parallel with the same clock. FIG. 6 is a diagram showing a schematic configuration of the parallel product-sum calculator 107. The data 6011 to 601 m are the output data of the reference data shift register 106, and are different reference data supplied to each multiplier 6031 to 603 m. The data 602 is the output data of the coefficient data shift register 104, and is the data commonly supplied to each multiplier 6031 to 603 m. The cumulative adders 6041 to 604 m accumulate the multiplication results during the convolution kernel calculation period. The clear signal 605 is used to clear the built-in latches of the cumulative adders 6041 to 604 m when a predetermined convolution operation unit is completed. The latch enable signal (Latch Enable signal) 606 updates the cumulative addition value with the signal. It is assumed that a signal synchronized with a clock signal (not shown) is connected to the Latch Enable signal.

係数データ保持部１０３１〜１０３ｎは、ＲＡＭ１０１に格納されている係数データ（パラメータデータ）から、演算処理に必要な係数データを一時的に格納する。係数データ保持部１０３１〜１０３ｎは、キャッシュやプリフェッチバッファにより構成される。係数データ保持部１０３１〜１０３ｎは、ｎ個（ｎは１以上）の保持部を有し、ｎ種類のコンボリューションカーネルに対応する重み係数を保持する。本実施形態では、係数データはＲＡＭ１０１に格納されているものとするが、ＲＡＭ１０１に限定せず、他の記憶部や記憶装置に格納してもよい。例えば、図示しないＲＯＭ等に係数データが格納されている構成でもよい。演算結果取り出し部として、結果シフトレジスタ１０８はコンボリューション演算の終了毎に演算結果を取り出す。 The coefficient data holding units 1031 to 103n temporarily store the coefficient data required for the arithmetic processing from the coefficient data (parameter data) stored in the RAM 101. The coefficient data holding units 1031 to 103n are composed of a cache and a prefetch buffer. The coefficient data holding units 1031 to 103n have n holding units (n is 1 or more) and hold weighting coefficients corresponding to n types of convolution kernels. In the present embodiment, the coefficient data is stored in the RAM 101, but the data is not limited to the RAM 101 and may be stored in another storage unit or storage device. For example, the coefficient data may be stored in a ROM or the like (not shown). As the operation result extraction unit, the result shift register 108 extracts the operation result at each end of the convolution operation.

本実施形態では、係数データバッファ１０３１〜１０３ｎに複数種類のコンボリューションカーネルを格納し、順次切り替えて並列積和演算器１０７に供給することで、同じ参照データに対して異なるコンボリューション演算を処理する。即ち、異なる特徴面のデータを順に算出する。 In the present embodiment, a plurality of types of convolution kernels are stored in the coefficient data buffers 1031 to 103n, and the convolution kernels are sequentially switched and supplied to the parallel multiply-accumulate arithmetic unit 107 to process different convolution operations for the same reference data. .. That is, the data of different feature planes are calculated in order.

非線形変換処理部１０９は、結果シフトレジスタ１０８から出力されるデータに対してシグモイド関数等の非線形変換処理を行う。非線形変換処理部１０９の出力結果は制御部１０２を介してＲＡＭ１０１に格納され、次の階層の参照データとしてＲＡＭ１０１に保持される。ＲＡＭ１０１に格納された前階層の演算結果である特徴面を参照することで多階層のネットワークを順次処理することができる。 The non-linear conversion processing unit 109 performs non-linear conversion processing such as a sigmoid function on the data output from the result shift register 108. The output result of the nonlinear conversion processing unit 109 is stored in the RAM 101 via the control unit 102, and is held in the RAM 101 as reference data in the next layer. By referring to the feature plane which is the calculation result of the previous layer stored in the RAM 101, the multi-layer network can be sequentially processed.

係数データシフトレジスタ１０４、参照データシフトレジスタ１０６及び結果シフトレジスタ１０８はデータロード機能付のシフトレジスタである。参照データバッファ１０５及び係数データバッファ１０３１〜１０３ｎは、それぞれ参照データシフトレジスタ１０６及び係数データシフトレジスタ１０４と同じビット幅の複数のレジスタで構成される。結果シフトレジスタ１０８は、並列積和演算器１０７の累積加算器出力の有効ビットと同じビット数の複数のレジスタで構成される。図７にこれらのシフトレジスタの構成例を示す。 The coefficient data shift register 104, the reference data shift register 106, and the result shift register 108 are shift registers with a data load function. The reference data buffer 105 and the coefficient data buffers 1031 to 103n are composed of a plurality of registers having the same bit width as the reference data shift register 106 and the coefficient data shift register 104, respectively. The result shift register 108 is composed of a plurality of registers having the same number of bits as the effective bits of the cumulative adder output of the parallel multiply-accumulate calculator 107. FIG. 7 shows a configuration example of these shift registers.

図７はレジスタ個数が４の場合の例を説明する。多ビットのフリップフロップ７０１ａ〜ｄは、ＣＬＯＣＫ信号に同期して所定ｂｉｔのデータをラッチする。セレクタ７０２ａ〜ｃは、選択信号（Ｌｏａｄ信号）が０の場合ＯＵＴｘ（ｘ：０〜２）が選択され、１の場合ＩＮｘ（ｘ：１〜３）が選択される。即ち、Ｌｏａｄ信号に応じてシフト動作とロード動作が選択される。Ｅｎａｌｂｅ信号はデータ遷移のイネーブル信号であり、１である場合は、ＣＬＯＣＫ信号の立ち上がりでデータをラッチし、０である場合は、ラッチしたデータをそのまま保持する（状態遷移しない）。 FIG. 7 describes an example when the number of registers is 4. The multi-bit flip-flops 701a to 701 latch the data of a predetermined bit in synchronization with the CLOCK signal. For the selectors 702a to c, OUTx (x: 0 to 2) is selected when the selection signal (Load signal) is 0, and INx (x: 1 to 3) is selected when the selection signal (Load signal) is 1. That is, the shift operation and the load operation are selected according to the load signal. The Enalbe signal is an enable signal for data transition. When it is 1, the data is latched at the rising edge of the CLOCK signal, and when it is 0, the latched data is held as it is (state transition does not occur).

図１におけるＬｏａｄ２／Ｌｏａｄ４／Ｌｏａｄ５信号はそれぞれ係数データシフトレジスタ１０４、参照データシフトレジスタ１０６、結果シフトレジスタ１０８のＬｏａｄ信号である。図１におけるＥｎａｂｌｅ１／Ｅｎａｂｌｅ２／Ｅｎａｂｌｅ３信号はそれぞれ係数データシフトレジスタ１０４、参照データシフトレジスタ１０６、結果シフトレジスタ１０８のＥｎａｂｌｅ信号である。係数データシフトレジスタ１０４は初期データのロード後、水平方向のコンボリューションカーネルサイズと同じクロック数シフト動作を実行する。シフト動作に応じて、並列積和演算器１０７に対して重み係数データを順次供給する。シフトレジスタ４０５ａ，ｂのそれぞれの図７におけるＯＵＴｎ信号が全ての並列積和演算器１０７に共通に接続される。 The Load2 / Load4 / Load5 signals in FIG. 1 are the Load signals of the coefficient data shift register 104, the reference data shift register 106, and the result shift register 108, respectively. The Enable1 / Enable2 / Enable3 signals in FIG. 1 are the Enable signals of the coefficient data shift register 104, the reference data shift register 106, and the result shift register 108, respectively. After loading the initial data, the coefficient data shift register 104 executes a clock number shift operation having the same horizontal convolution kernel size. Weight coefficient data is sequentially supplied to the parallel multiply-accumulate calculator 107 according to the shift operation. The OUTn signals in FIG. 7 of the shift registers 405a and 405b are commonly connected to all parallel multiply-accumulate units 107.

同様に、参照データシフトレジスタ１０６は参照データバッファ１０５から初期データがロードされる。以後水平方向のコンボリューションカーネルサイズと同じクロック数シフト動作を実行し、並列積和演算器１０７に対して複数の参照データ（図７ＯＵＴ１〜ＯＵＴｎ信号）を同時に供給する。 Similarly, the reference data shift register 106 is loaded with initial data from the reference data buffer 105. After that, the same clock number shift operation as the horizontal convolution kernel size is executed, and a plurality of reference data (FIGS. 7 OUT1 to OUTn signals) are simultaneously supplied to the parallel multiply-accumulate unit 107.

係数データシフトレジスタ１０４と参照データシフトレジスタ１０６は同期して動作する。係数データシフトレジスタ１０４及び参照データシフトレジスタ１０６から供給されるデータに従って、並列積和演算器１０７が積和演算を実行する。ここで得られた累積和は、対象特徴面に対応する全コンボリューションカーネルの演算終了後、結果シフトレジスタ１０８にロードされ、所定のタイミングで非線形変換処理部１０９に送られる。並列積和演算器１０７は、図６に示すようにそれぞれ同一クロックで動作する同一の回路がｍ個並んでいるものとする。結果シフトレジスタ１０８はｍ個の積和演算出力を保持することが可能なフリップフロップで構成する。 The coefficient data shift register 104 and the reference data shift register 106 operate in synchronization. The parallel product-sum calculator 107 executes the product-sum operation according to the data supplied from the coefficient data shift register 104 and the reference data shift register 106. The cumulative sum obtained here is loaded into the result shift register 108 after the calculation of all the convolution kernels corresponding to the target feature planes is completed, and is sent to the nonlinear conversion processing unit 109 at a predetermined timing. As shown in FIG. 6, the parallel multiply-accumulate unit 107 is assumed to have m of the same circuits operating at the same clock. The result shift register 108 is composed of flip-flops capable of holding m product-sum operation outputs.

並列積和演算器１０７の出力は所定の有効ビットのみ結果シフトレジスタ１０８に接続する。非線形変換処理部１０９はルックアップテーブル等により構成することができる。ここで変換処理されたデータはＲＡＭ１０１の所定アドレスに格納される。ここでの格納アドレスも制御部１０２に従って制御される。 The output of the parallel multiply-accumulate unit 107 connects only a predetermined effective bit to the result shift register 108. The non-linear conversion processing unit 109 can be configured by a look-up table or the like. The data converted here is stored at a predetermined address of the RAM 101. The storage address here is also controlled according to the control unit 102.

図５は本実施形態の演算回路による並列処理の具体例を模式的に説明する図である。図５の参照データ面８０２及び算出データ面８０４は、それぞれラスタスキャンされたデータ座標を用いて表す。参照データ面８０２は、各データ（模式的に示す最小一升）がラスタスキャン順でＲＡＭ１０１に格納された前階層の演算結果（ｉｎｐｕｔ（ｘ，ｙ）、ｘ：水平方向位置、ｙ：垂直方向位置）を示すものであるとする。算出データ面８０４は、各データがラスタスキャンされた演算結果（ｏｕｔｐｕｔ（ｘ，ｙ）、ｘ：水平方向位置、ｙ：垂直方向位置）を示すものとする。 FIG. 5 is a diagram schematically illustrating a specific example of parallel processing by the arithmetic circuit of the present embodiment. The reference data surface 802 and the calculated data surface 804 of FIG. 5 are represented using raster-scanned data coordinates, respectively. On the reference data surface 802, the calculation results (input (x, y), x: horizontal position, y: vertical direction) of the previous layer in which each data (minimum one box schematically shown) is stored in the RAM 101 in the raster scan order. Position). It is assumed that the calculated data surface 804 indicates the calculation result (output (x, y), x: horizontal position, y: vertical position) in which each data is raster-scanned.

算出範囲８０３は並列積和演算器１０７（ｍ＝４の場合）で並列に演算して得られるデータの位置を示し、参照範囲８０１がコンボリューション演算のカーネルサイズが３×３である場合の算出範囲８０３に対する参照データの範囲である。制御部は参照範囲８０１内の各ラインのデータを順に参照データレジスタバッファに転送し、並列積和演算器は参照データのシフト動作に伴ってコンボリューション演算を実現する。 The calculation range 803 indicates the position of the data obtained by the parallel multiply-accumulate operation unit 107 (when m = 4), and the reference range 801 is the calculation when the kernel size of the convolution operation is 3 × 3. The range of reference data for range 803. The control unit sequentially transfers the data of each line in the reference range 801 to the reference data register buffer, and the parallel multiply-accumulate arithmetic unit realizes the convolution operation along with the shift operation of the reference data.

ここで、コンボリューション演算処理の基本的な動作について説明する。本実施形態によるコンボリューション演算は、算出特徴面の水平方向に連続するｍ画素位置のデータを並列に算出するものである。係数データバッファ１０３１〜１０３ｎのそれぞれは少なくともコンボリューションカーネルの水平方向のサイズより多いレジスタで構成される。例えば重み係数が８ビットで表されるデータの場合、８ビット幅の複数のレジスタで構成される。例えば、水平方向のコンボリューションカーネルサイズが「１１」の場合、当該レジスタの数は「１１」とする。 Here, the basic operation of the convolution arithmetic processing will be described. The convolution calculation according to the present embodiment calculates the data of the m pixel positions continuous in the horizontal direction of the calculation feature plane in parallel. Each of the coefficient data buffers 1031-103n consists of at least more registers than the horizontal size of the convolution kernel. For example, in the case of data in which the weighting coefficient is represented by 8 bits, it is composed of a plurality of registers having a width of 8 bits. For example, when the horizontal convolution kernel size is "11", the number of the registers is "11".

実際には、想定する最大コンボリューションサイズのレジスタ数で構成する。制御部１０２は、積和演算処理に必要な複数種類の係数を当該レジスタに予めロードし、算出特徴面毎に選択して利用する。参照データバッファ１０５はＲＡＭ１０１に格納された参照データの一部を一時的に保持するために使用される。 Actually, it is composed of the number of registers of the assumed maximum convolution size. The control unit 102 loads a plurality of types of coefficients required for the product-sum calculation process into the register in advance, and selects and uses each of the calculation feature planes. The reference data buffer 105 is used to temporarily hold a part of the reference data stored in the RAM 101.

参照データが８ビットで表されるデータの場合、参照データバッファ１０５は８ビット幅の複数のレジスタで構成される。参照データバッファ１０５は所定数以上の個数のレジスタで構成される。この所定数は、並列に処理する複数の演算器のそれぞれが一単位のコンボリューション演算を実行するために必要な参照データの数である。この所定数は、例えば、（「並列に処理する演算器の数」＋「並列処理する方向と同じ方向のコンボリューションカーネルサイズ」−１）×「並列処理する方向と直行する方向のコンボリューションカーネルサイズ」によって計算される。 When the reference data is data represented by 8 bits, the reference data buffer 105 is composed of a plurality of 8-bit wide registers. The reference data buffer 105 is composed of a predetermined number or more of registers. This predetermined number is the number of reference data required for each of a plurality of arithmetic units to be processed in parallel to execute one unit of convolution operation. This predetermined number is, for example, (“the number of arithmetic units to be processed in parallel” + “convolution kernel size in the same direction as parallel processing” -1) × “convolution kernel in the direction parallel to parallel processing”. Calculated by "size".

さらに、ここでは、参照データの読み出しとコンボリューション演算をパイプライン動作させるために、参照データバッファは上記サイズの２倍のレジスタからなるダブルバッファで構成されてよい。参照データバッファは制御部１０２の制御に従って参照データシフトレジスタ１０６にロードする複数のデータを並列に出力する。 Further, here, in order to pipeline the reading of the reference data and the convolution operation, the reference data buffer may be composed of a double buffer composed of registers having twice the above size. The reference data buffer outputs a plurality of data to be loaded into the reference data shift register 106 in parallel under the control of the control unit 102.

図２は図１の演算回路の動作モードを概念的に説明する図である。 FIG. 2 is a diagram conceptually explaining the operation mode of the arithmetic circuit of FIG.

図２（Ａ）は１対１の結合関係のネットワークを４並列で動作する積和演算器２０２を用いて算出する場合の例を示している。ここでは算出する特徴面２０６の４画素位置２０３のコンボリューション演算を並列に処理する。積和演算器２０２はコンボリューション演算の内容によって定まる必要な参照データ２０１を、データバッファ２０５を介して参照しながら並列処理単位でラスタ―スキャン順に演算を進める。 FIG. 2A shows an example of calculating a network having a one-to-one coupling relationship using a multiply-accumulate calculator 202 that operates in four parallels. Here, the convolution calculation of the 4-pixel position 203 of the feature surface 206 to be calculated is processed in parallel. The product-sum calculator 202 proceeds with the calculation in the order of raster scan in parallel processing units while referring to the necessary reference data 201 determined by the content of the convolution operation via the data buffer 205.

図２（Ｂ）は１対２の結合関係のネットワークを図２（Ａ）で示す構成の演算処理装置で処理する場合の例を示している。４並列の積和演算器２０２を用いて特徴面２０８、２０９を順に処理する。特徴面２０８では算出領域２０７、２１０の順に算出する。即ち、並列処理単位で面順次処理により特徴面２０８、２０９を順次算出する。この場合、特徴面２０８と２０９が参照する参照データは共通であるが、特徴面を面順次で順に処理するため、特徴面２０８，２０９の処理毎に同じ参照データ２０１がデータバッファ２０５に転送されることになる。 FIG. 2B shows an example in which a network having a one-to-two coupling relationship is processed by the arithmetic processing unit having the configuration shown in FIG. 2A. The feature planes 208 and 209 are processed in order using the four-parallel multiply-accumulate calculator 202. On the feature surface 208, the calculation areas 207 and 210 are calculated in this order. That is, the feature planes 208 and 209 are sequentially calculated by the plane sequential processing in the parallel processing unit. In this case, the reference data referred to by the feature planes 208 and 209 is common, but since the feature planes are processed in order of the planes, the same reference data 201 is transferred to the data buffer 205 for each processing of the feature planes 208 and 209. Will be.

図２（Ｃ）は４並列の積和演算器２０２を用いて異なる特徴面２０８、２０９を並列処理単位で順次算出する。つまり、算出領域２０７、２１１、２１０、２１２の順に処理する。この場合、例えば、２つの特徴面２０８、２０９の算出領域２０７，２１１の演算時に必要となる参照特徴面２０４上の参照データ２０１はデータバッファ２０５に保持され、再利用される。一般的に、参照特徴面は低速な大容量なメモリに格納され、データバッファ２０５は高速・小容量なメモリやレジスタ等で構成される。図２（Ｃ）に示すように複数の特徴面を跨いで並列処理単位で順に処理する場合、データバッファ２０５を介して特徴面２０８と２０９の対象領域算出時のデータを共用する事ができる。このため、図２（Ｂ）に示すように面順次で処理する場合に比べて参照データのデータバッファ２０５への転送数を半減させることができる。参照データの読み出し転送速度を考慮しない場合、図２（Ｂ）と図２（Ｃ）の処理時間を同等とみなすことができるが、転送速度が遅い場合、図２（Ｂ）はデータ転送時間が処理時間を律し、図２（Ｃ）に比べて処理時間が増加する場合がある。これは、演算器の並列度が高く、コンボリューションカーネルのサイズが小さい場合に顕著になる。 In FIG. 2C, different feature planes 208 and 209 are sequentially calculated in parallel processing units using a four-parallel multiply-accumulate calculator 202. That is, the calculation areas 207, 211, 210, and 212 are processed in this order. In this case, for example, the reference data 201 on the reference feature surface 204 required for the calculation of the calculation areas 207 and 211 of the two feature surfaces 208 and 209 is held in the data buffer 205 and reused. Generally, the reference feature surface is stored in a low-speed, large-capacity memory, and the data buffer 205 is composed of a high-speed, small-capacity memory, registers, and the like. As shown in FIG. 2C, when processing is sequentially performed in parallel processing units across a plurality of feature planes, the data at the time of calculating the target area of the feature planes 208 and 209 can be shared via the data buffer 205. Therefore, as shown in FIG. 2B, the number of transfer of reference data to the data buffer 205 can be halved as compared with the case of processing in surface order. When the read transfer speed of the reference data is not taken into consideration, the processing times of FIGS. 2 (B) and 2 (C) can be regarded as equivalent, but when the transfer speed is slow, the data transfer time of FIG. 2 (B) is shown in FIG. The processing time is regulated, and the processing time may increase as compared with FIG. 2C. This becomes noticeable when the degree of parallelism of the arithmetic unit is high and the size of the convolution kernel is small.

この例で示すように、ＣＮＮネットワークの構成や動作条件に応じて図２（Ａ）と図２（Ｃ）に示すように処理順を切り替えることで並列演算器の数に応じた最良の性能を引き出すことができる。 As shown in this example, the best performance according to the number of parallel computing units can be obtained by switching the processing order as shown in FIGS. 2 (A) and 2 (C) according to the configuration and operating conditions of the CNN network. Can be pulled out.

本実施形態では、ＣＮＮネットワークの構成や動作条件に応じて演算回路の処理順を最適化し、適切な動作モードを選択することを提案する。 In the present embodiment, it is proposed to optimize the processing order of the arithmetic circuit according to the configuration of the CNN network and the operating conditions, and select an appropriate operating mode.

次に、図８及び図９を用いて本実施形態の演算回路の動作モードをより詳細に説明する。図８は、図２（Ａ）に示すように、各特徴面を面順次で順次に演算する場合のタイムチャート概要を示す。図９は、図２（Ｃ）に示すように、２つの特徴面をコンボリューション演算単位で切り替えながら演算する場合のタイムチャート概要を示す。図８及び図９は、演算処理の処理順が異なる動作モードのタイムチャートである。動作モードは制御部１０２内のレジスタ群１２０２の設定で変更可能である。 Next, the operation mode of the arithmetic circuit of the present embodiment will be described in more detail with reference to FIGS. 8 and 9. As shown in FIG. 2A, FIG. 8 shows an outline of a time chart in the case where each feature plane is sequentially calculated in a plane sequence. As shown in FIG. 2C, FIG. 9 shows an outline of a time chart in a case where two characteristic planes are switched in convolution calculation units for calculation. 8 and 9 are time charts of operation modes in which the processing order of arithmetic processing is different. The operation mode can be changed by setting the register group 1202 in the control unit 102.

まず、図８を用いて、一つの特徴面を並列処理単位でラスタスキャン処理する場合の例（図２Ａの処理に相当する）を説明するで。なお、図８に示す信号は全て図示しないクロック信号に基づいて同期動作する。図８は特徴面処理開始時の一部のタイミングを示す。図８はカーネルサイズが５×５の場合である。 First, with reference to FIG. 8, an example (corresponding to the processing of FIG. 2A) in which one characteristic surface is subjected to raster scan processing in parallel processing units will be described. All the signals shown in FIG. 8 perform a synchronous operation based on a clock signal (not shown). FIG. 8 shows a part of the timing at the start of the feature surface processing. FIG. 8 shows a case where the kernel size is 5 × 5.

係数データバッファ１０３１〜１０３ｎには、特徴面の演算処理開始前に、必要な係数データがロードされているものとする。ｓｅｌ信号は係数データバッファ１０３１〜１０３ｎの出力を選択する信号であり、複数のコンボリューションカーネルに対応する係数から所望の係数を選択するために使用する。ここでは、一つの特徴面を演算処理する動作中では、ｓｅｌ信号は常に０である。 It is assumed that necessary coefficient data is loaded in the coefficient data buffers 1031 to 103n before the start of the arithmetic processing of the feature surface. The sel signal is a signal for selecting the output of the coefficient data buffers 1031 to 103n, and is used to select a desired coefficient from the coefficients corresponding to a plurality of convolution kernels. Here, the sel signal is always 0 during the operation of arithmetically processing one characteristic surface.

また、参照データバッファ１０５には、カーネル垂直方向演算区間１である区間４０２では、演算処理するために必要な参照データが全てロードされているものとする。 Further, it is assumed that the reference data buffer 105 is loaded with all the reference data necessary for arithmetic processing in the interval 402, which is the kernel vertical arithmetic interval 1.

制御部１０２は、まず、次のカーネル垂直方向演算区間２である区間４０３で必要な参照データのロードを開始するためにＬｏａｄ３信号を有効化する。ここで、Ｌｏａｄ３信号は信号レベル１の場合が有効化された状態であるものとする。なお、カーネル垂直方向区間１に必要な参照データは既に参照データバッファ１０５に格納済みであるとする。ここでは、参照データバッファはダブルバッファで構成されているとし、データの参照とデータのロードを同時に処理可能である。 The control unit 102 first activates the Load3 signal in order to start loading the necessary reference data in the section 403, which is the next kernel vertical calculation section 2. Here, it is assumed that the Road3 signal is in the enabled state when the signal level is 1. It is assumed that the reference data required for the kernel vertical section 1 has already been stored in the reference data buffer 105. Here, it is assumed that the reference data buffer is composed of a double buffer, and data reference and data load can be processed at the same time.

制御部１０２は、Ｌｏａｄ３信号の有効化と同時にＲＡＭ１０１から参照データを取り出し、参照データバッファ１０５にセットする。セットするデータの数はコンボリューションカーネルの大きさ及び並列度から決定する。例えば、コンボリューション演算のカーネルサイズが５×５である場合、演算器の並列度を２０とすると、２０＋５−１＝２４個のデータをセットする。＊ＣＬＲ信号は、並列積和演算器１０７の累積加算器６０４１〜６０４ｍを初期化するための信号であり、当該信号が０である場合、累積加算器に内蔵するレジスタは０に初期化される。 The control unit 102 takes out the reference data from the RAM 101 and sets it in the reference data buffer 105 at the same time when the Load3 signal is activated. The number of data to be set is determined from the size of the convolution kernel and the degree of parallelism. For example, when the kernel size of the convolution operation is 5 × 5, and the degree of parallelism of the operation unit is 20, 20 + 5-1 = 24 pieces of data are set. * The CLR signal is a signal for initializing the cumulative adders 6041 to 604 m of the parallel multiply-accumulate calculator 107, and when the signal is 0, the register built in the cumulative adder is initialized to 0. ..

制御部１０２は、新たな特徴面位置のコンボリューション演算開始前に＊ＣＬＲ信号を０にする。参照データバッファ１０５はダブルバッファ構成であるため、カーネル垂直方向演算区間１（区間４０２）で使用するデータを出力すると共に、カーネル垂直方向演算区間２（区間４０３）で使用するためのデータを格納する。以後、参照データバッファ１０５は図示しないトグル信号に従ってダブルバッファとしてデータの読み出し、書き出しが制御される。 The control unit 102 sets the * CLR signal to 0 before starting the convolution calculation of the new feature plane position. Since the reference data buffer 105 has a double buffer configuration, it outputs the data used in the kernel vertical calculation section 1 (section 402) and stores the data to be used in the kernel vertical calculation section 2 (section 403). .. After that, the reference data buffer 105 is controlled to read and write data as a double buffer according to a toggle signal (not shown).

Ｌｏａｄ２信号は係数データシフトレジスタ１０４の初期化を指示するための信号である。当該信号が１でかつＥｎａｂｌｅ１信号が有効（信号レベル１）の場合、係数データバッファ１０３１に保持する複数の重み係数データが係数データシフトレジスタ１０４に一括ロードされる。 The Load2 signal is a signal for instructing the initialization of the coefficient data shift register 104. When the signal is 1 and the Enable1 signal is valid (signal level 1), a plurality of weight coefficient data held in the coefficient data buffer 1031 are collectively loaded into the coefficient data shift register 104.

Ｅｎａｂｌｅ１信号はシフトレジスタのデータ遷移を制御する信号である。演算器の動作中は、Ｅｎａｂｌｅ１信号は常に１に設定されているため、Ｌｏａｄ２信号が１の場合、クロック信号に応じての出力をラッチし、Ｌｏａｄ２信号が０の場合、クロック信号に応じてシフト処理を継続する。 The Enable1 signal is a signal that controls the data transition of the shift register. Since the Enable1 signal is always set to 1 during the operation of the arithmetic unit, when the Load2 signal is 1, the output is latched according to the clock signal, and when the Load2 signal is 0, the output is shifted according to the clock signal. Continue processing.

制御部１０２のシーケンス制御部１２０１は、コンボリューションカーネルの水平方向サイズに応じたクロック数をカウントするとＬｏａｄ２信号を有効化し、シフト動作を停止させる。同時に、シーケンス制御部１２０１は、係数データバッファ１０３１に保持する重み係数データを係数データシフトレジスタ１０４に一括ロードする。 The sequence control unit 1201 of the control unit 102 activates the Load2 signal when counting the number of clocks corresponding to the horizontal size of the convolution kernel, and stops the shift operation. At the same time, the sequence control unit 1201 collectively loads the weighting coefficient data held in the coefficient data buffer 1031 into the coefficient data shift register 104.

即ち、コンボリューションカーネルの水平方向単位で重み係数を一括ロードし、ロードした係数を動作クロックに応じてシフトアウトする。ここで、図８の場合Ｓｅｌ信号は常に０ｘ００であり、係数データバッファ１０３１〜１０３１ｎは特定のカーネルの係数を順次に出力する。つまり、同じカーネルで一つの特徴面を算出する。 That is, the weighting coefficients are collectively loaded in the horizontal unit of the convolution kernel, and the loaded coefficients are shifted out according to the operating clock. Here, in the case of FIG. 8, the Ser signal is always 0x00, and the coefficient data buffers 1031 to 1031n sequentially output the coefficients of a specific kernel. That is, one characteristic surface is calculated with the same kernel.

Ｌｏａｄ４信号は、参照データシフトレジスタ１０６の初期化を指示するための信号である。当該信号が１でかつＥｎａｂｌｅ２信号が有効（信号レベル１）の場合、参照データバッファ１０５に保持する参照データが参照データシフトレジスタ１０６に一括ロードされる。参照データバッファ１０５に格納されているデータは、図示しないタイミング信号に従って水平方向の処理単位で必要な参照データを出力する。参照データバッファ１０５が出力するデータはカーネル水平方向演算区間（区間４０１）毎に対応する異なる参照データを出力する。 The Load4 signal is a signal for instructing the initialization of the reference data shift register 106. When the signal is 1 and the Enable2 signal is valid (signal level 1), the reference data held in the reference data buffer 105 is collectively loaded into the reference data shift register 106. The data stored in the reference data buffer 105 outputs necessary reference data in horizontal processing units according to a timing signal (not shown). The data output by the reference data buffer 105 outputs different reference data corresponding to each kernel horizontal operation section (section 401).

なお、Ｅｎａｂｌｅ２信号はシフトレジスタのデータ遷移を制御する信号であるが、動作中は常に１に設定されている。従って、Ｌｏａｄ４信号が１の場合、クロック信号に応じて参照データバッファ１０５の出力をラッチし、Ｌｏａｄ４信号が０である場合、クロック信号に応じてシフト処理を継続する。 The Enable2 signal is a signal that controls the data transition of the shift register, but is always set to 1 during operation. Therefore, when the Load4 signal is 1, the output of the reference data buffer 105 is latched according to the clock signal, and when the Load4 signal is 0, the shift process is continued according to the clock signal.

制御部１０２のシーケンス制御部１２０１は、コンボリューションカーネルの水平方向サイズに応じたクロック数をカウントするとＬｏａｄ４信号を有効化し、シフト動作を停止させると同時に参照データバッファ１０５に保持する参照データを一括ロードする。 The sequence control unit 1201 of the control unit 102 activates the Load4 signal when counting the number of clocks according to the horizontal size of the convolution kernel, stops the shift operation, and at the same time, collectively loads the reference data held in the reference data buffer 105. do.

即ち、コンボリューションカーネルの１列単位で必要な参照データを一括ロードし、ロードした参照データを動作クロックに応じてシフトする。以上、制御部１０２はＬｏａｄ４信号をＬｏａｄ２信号と同一タイミングで制御する。 That is, the necessary reference data is collectively loaded in units of one column of the convolution kernel, and the loaded reference data is shifted according to the operating clock. As described above, the control unit 102 controls the Load4 signal at the same timing as the Load2 signal.

並列積和演算器１０７は、クロックに同期して積和演算を継続しているため、シフトレジスタ１０４及び１０６のシフト動作に従って算出する特徴面の複数の点に対して同時にコンボリューションカーネルサイズに応じた積和演算処理を実行する。 Since the parallel multiply-accumulate unit 107 continues the product-sum operation in synchronization with the clock, the convolution kernel size is simultaneously applied to a plurality of points on the feature surface calculated according to the shift operation of the shift registers 104 and 106. Executes the product-sum operation process.

具体的には、シフトレジスタ１０４とシフトレジスタ１０６のシフト動作期間（図８中のカーネル水平方向演算区間４０１）中にコンボリューションカーネルの１列分の積和演算がなされることになる。 Specifically, during the shift operation period of the shift register 104 and the shift register 106 (kernel horizontal calculation section 401 in FIG. 8), the product-sum calculation for one column of the convolution kernel is performed.

当該列単位の演算を重み係数及び参照データを入替ながら水平方向に繰り返すことで並列度の数に応じた二次元のコンボリューション演算結果が算出される（図８のカーネル垂直方向演算区間１（区間４０２））。 By repeating the operation for each column in the horizontal direction while exchanging the weighting coefficient and the reference data, the two-dimensional convolution operation result corresponding to the number of parallelisms is calculated (the kernel vertical operation section 1 (section) in FIG. 8). 402)).

このように、制御部１０２はカーネルサイズ及び並列度に応じて各信号を制御することで、積和演算処理と積和演算処理に必要なデータ（参照データ）のＲＡＭ１０１からの供給を並行に実行させる。 In this way, the control unit 102 controls each signal according to the kernel size and the degree of parallelism, so that the product-sum calculation process and the data (reference data) required for the product-sum calculation process are supplied from the RAM 101 in parallel. Let me.

Ｌｏａｄ５信号は並列積和演算器の結果を結果シフトレジスタ１０８に並列にロードするための信号であり、制御部１０２は対象となる特徴面の並列処理単位の積和演算が終了するとＬｏａｄ５信号及びＥｎａｂｌｅ３信号に１を出力する。結果シフトレジスタ１０８はＬｏａｄ５信号が１でＥｎａｂｌｅ３信号が１の場合、並列積和演算器１０７の出力を一括ロードする。制御部１０２はシフトレジスタ１０４及び１０５のシフト動作中にＥｎａｂｌｅ３の信号を有効化し、結果シフトレジスタ１０８に保持する演算結果をシフトアウトする。シフトアウトした演算結果は非線形変換処理部１０９で変換処理された後、制御部１０２により、レジスタ群１２０２に記された演算結果格納先ポインタ等の情報に従ってＲＡＭ１０１の所定のアドレスに格納される。 The Load5 signal is a signal for loading the result of the parallel product-sum calculator in parallel to the result shift register 108, and the control unit 102 finishes the product-sum calculation of the parallel processing unit of the target feature surface, and then the Load5 signal and Enable3. Output 1 to the signal. As a result, when the Load5 signal is 1 and the Enable3 signal is 1, the shift register 108 collectively loads the outputs of the parallel multiply-accumulate calculator 107. The control unit 102 activates the signal of Enable 3 during the shift operation of the shift registers 104 and 105, and shifts out the calculation result held in the result shift register 108. The shifted-out calculation result is converted by the nonlinear conversion processing unit 109, and then stored by the control unit 102 at a predetermined address of the RAM 101 according to information such as a calculation result storage destination pointer written in the register group 1202.

本実施形態の演算回路では、ＲＡＭ１０１に対する参照データの読み出し、演算結果の書き出しを積和演算処理期間に並行処理することで、高速に処理することができる。但し、並列度とコンボリューションカーネルの関係によっては、ＲＡＭ１０１へのアクセスを積和演算期間中に完全にパイプライン化できない場合もある。例えば、並列度が高くかつコンボリューションカーネルが小さい場合は、Ｌｏａｄ３による参照データの転送が間に合わない場合がある。その場合、制御部１０２はＲＡＭ１０１へアクセス完了を優先し、Ｅｎａｂｌｅ１／Ｅｎａｂｌｅ２／Ｅｎａｂｌｅ３信号及び累積加算器のＬａｔｃｈＥｎａｂｌｅ信号を制御することで積和演算処理の開始を遅延させる。 In the arithmetic circuit of the present embodiment, reading of reference data to the RAM 101 and writing of the arithmetic result can be processed at high speed by parallel processing during the product-sum calculation processing period. However, depending on the relationship between the degree of parallelism and the convolution kernel, access to the RAM 101 may not be completely pipelined during the multiply-accumulate operation period. For example, if the degree of parallelism is high and the convolution kernel is small, the transfer of reference data by Load 3 may not be in time. In that case, the control unit 102 gives priority to the completion of access to the RAM 101, and delays the start of the product-sum calculation process by controlling the Enable1 / Enable2 / Enable3 signal and the Latch Enable signal of the cumulative adder.

図９は２つの特徴面を並列演算単位で順に処理する場合のタイムチャートである。つまり、図２（Ｃ）に対応する。 FIG. 9 is a time chart in which two feature planes are sequentially processed in parallel arithmetic units. That is, it corresponds to FIG. 2 (C).

ここでは図８との違いのみについて説明する。図９はＳｅｌ信号とＬｏａｄ３信号が図８と異なる。図９は２つの特徴面をカーネル演算単位で切り替えながら処理する場合の例を示している。特徴面の処理開始に先立ち、制御部１０２は係数データバッファ１０３１及び係数データバッファ１０３２にそれぞれ特徴面の演算に必要な重み係数を格納する。また、参照データバッファ１０５にはカーネル垂直方向演算区間１（区間５０２）及びカーネル垂直方向演算区間２（区間５０３）で共通に使用する参照データが既にロードされているものとする。 Here, only the difference from FIG. 8 will be described. In FIG. 9, the Ser signal and the Load3 signal are different from those in FIG. FIG. 9 shows an example in which processing is performed while switching between the two characteristic surfaces in kernel operation units. Prior to the start of processing of the feature surface, the control unit 102 stores the weighting coefficients required for the calculation of the feature surface in the coefficient data buffer 1031 and the coefficient data buffer 1032, respectively. Further, it is assumed that the reference data buffer 105 is already loaded with the reference data commonly used in the kernel vertical calculation section 1 (section 502) and the kernel vertical calculation section 2 (section 503).

カーネル垂直方向演算区間１（区間５０２）ではｓｅｌ＝０ｘ００で選択される係数データを用いて並列積和演算器１０７でコンボリューション演算実行される。一方カーネル垂直方向演算区間２（区間５０３）ではｓｅｌ＝０ｘ０１で選択される係数データを用いてコンボリューション演算が実行される。この２つの区間では、参照データバッファ１０５に格納済みの共通の参照データが参照され、参照データバッファ１０５が出力する参照データは図８の場合と同様に図示しないタイミング信号に従って水平方向の処理単位で必要な参照データを出力する。その際、カーネル垂直方向演算区間１（区間５０２）とカーネル垂直方向演算区間２（区間５０３）では水平方向の処理単位で同じ参照データが繰り返し出力する。このため、カーネル垂直方向演算区間３（区間５０４）と非図示のカーネル垂直方向演算区間４で共通に使用する参照データのロードに許される時間は区間５０５となり、図８のケース（区間４０５に対応）に対して２倍の時間となる。 In the kernel vertical calculation section 1 (section 502), the convolution calculation is executed by the parallel multiply-accumulate calculator 107 using the coefficient data selected at sel = 0x00. On the other hand, in the kernel vertical calculation section 2 (section 503), the convolution calculation is executed using the coefficient data selected at sel = 0x01. In these two sections, the common reference data stored in the reference data buffer 105 is referred to, and the reference data output by the reference data buffer 105 is in horizontal processing units according to a timing signal (not shown) as in the case of FIG. Output the required reference data. At that time, the same reference data is repeatedly output in the kernel vertical direction calculation section 1 (section 502) and the kernel vertical direction calculation section 2 (section 503) in the horizontal processing unit. Therefore, the time allowed for loading the reference data commonly used in the kernel vertical calculation section 3 (section 504) and the kernel vertical calculation section 4 (not shown) is the section 505, which corresponds to the case of FIG. 8 (corresponding to the section 405). ) Is twice as long.

図９の動作では参照データを共有し、係数データを入れ替えて順次処理することで図２（ｃ）の特徴面２０８における算出領域２０７及び特徴面２０９における算出領域２１１のデータを順次に算出する。更に、カーネル垂直方向演算区間３（区間５０４）では再び係数を入れ替えて特徴面２０８の算出領域２１０のデータを算出する。この様に参照データを再利用しながら、係数を入れ替えることで算出する特徴面の処理順を制御する。 In the operation of FIG. 9, the reference data is shared, the coefficient data is exchanged, and the data of the calculation area 207 on the feature surface 208 and the calculation area 211 on the feature surface 209 of FIG. 2C are sequentially calculated. Further, in the kernel vertical direction calculation section 3 (section 504), the coefficients are exchanged again to calculate the data in the calculation area 210 of the feature plane 208. While reusing the reference data in this way, the processing order of the feature planes calculated by exchanging the coefficients is controlled.

図８と比べて明らかな様に、図９の場合、２つのカーネル垂直方向演算区間（区間５０２、区間５０３）で参照データを共有することで、参照データバッファへのデータ転送回数（＝転送レート）を半減することが可能になる。これにより、参照データの転送に時間を要する場合、或いはカーネルサイズが小さく、カーネル演算区間が短い場合に、データ転送が処理時間を律するケースを低減することができる。 As is clear from FIG. 8, in the case of FIG. 9, by sharing the reference data between the two kernel vertical calculation sections (section 502, section 503), the number of data transfers to the reference data buffer (= transfer rate). ) Can be halved. As a result, it is possible to reduce the case where the data transfer regulates the processing time when the transfer of the reference data takes a long time, or when the kernel size is small and the kernel calculation interval is short.

例えば、並列積和演算器１０７の並列度を２０、カーネルサイズを５とし、並列演算器は１サイクルで一つの重み係数に対する積和演算を完了するものとする。また重み係数が１バイトであり、データ転送サイクルが４バイト／サイクルであるとすると、一つのコンボリューション処理あたりの処理サイクルは図８の動作モードでは以下の様になる。 For example, it is assumed that the parallel degree of the parallel product-sum calculation unit 107 is 20 and the kernel size is 5, and the parallel calculation unit completes the product-sum operation for one weighting coefficient in one cycle. Assuming that the weighting coefficient is 1 byte and the data transfer cycle is 4 bytes / cycle, the processing cycle per convolution processing is as follows in the operation mode of FIG.

並列演算処理単位の処理サイクルは５×５＝２５サイクルである。 The processing cycle of the parallel computing unit is 5 × 5 = 25 cycles.

並列演算処理単位の演算に必要な参照データの転送に要する処理サイクルは（２０＋５−１）×５／４＝３０サイクル。 Parallel computing The processing cycle required to transfer reference data required for processing unit operations is (20 + 5-1) x 5/4 = 30 cycles.

この場合、データ転送が処理時間を律することになり、並列演算器の性能を十分活かしていない。 In this case, the data transfer controls the processing time, and the performance of the parallel computing unit is not fully utilized.

一方、図９の動作モードでは、参照データを共有しているので、その処理サイクルは以下のようになる。 On the other hand, in the operation mode of FIG. 9, since the reference data is shared, the processing cycle is as follows.

二つの並列演算処理単位の処理サイクル２５×２＝５０サイクルである。 The processing cycle of two parallel arithmetic processing units is 25 × 2 = 50 cycles.

並列演算処理単位の演算に必要な参照データの転送に要する処理サイクルは３０サイクルとなり演算処理が処理時間を律し、並列演算器の性能を活かしていることになる。 The processing cycle required for transferring the reference data required for the operation of the parallel arithmetic processing unit is 30 cycles, and the arithmetic processing regulates the processing time, and the performance of the parallel arithmetic unit is utilized.

図１０は本実施形態の演算回路１００２を具備した画像処理装置の構成を示すものである。この画像処理装置は、入力画像データからパターン認識処理によって特定の物体を検出する機能を有する。図１０の画像入力モジュール１０００は、光学系、ＣＣＤ又はＣＭＯＳセンサー等の光電変換デバイス及びセンサーを制御するドライバー回路／ＡＤコンバーター／各種画像補正を司る信号処理回路／フレームバッファ等により構成される。 FIG. 10 shows the configuration of an image processing apparatus including the arithmetic circuit 1002 of the present embodiment. This image processing device has a function of detecting a specific object from input image data by pattern recognition processing. The image input module 1000 of FIG. 10 is composed of an optical system, a photoelectric conversion device such as a CCD or CMOS sensor, a driver circuit for controlling the sensor, an AD converter, a signal processing circuit for controlling various image corrections, a frame buffer, and the like.

ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１０１は、演算回路１００２の演算作業バッファとして使用されるメモリである。ＲＡＭ１０１にはＣＮＮの特徴面に相当するデータが記憶される。 The RAM (Random Access Memory) 101 is a memory used as a calculation work buffer of the calculation circuit 1002. Data corresponding to the characteristic surface of the CNN is stored in the RAM 101.

ＤＭＡＣ（ＤｉｒｅｃｔＭｅｍｏｒｙＡｃｃｅｓｓＣｏｎｔｒｏｌｌｅｒ）１００６は、画像バス１００３上の各処理部とＣＰＵバス１０１０間のデータ転送を司る。ブリッジ１００４は、画像バス１００３とＣＰＵバス１０１０のブリッジ機能を提供する。 The DMAC (Direct Memory Access Controller) 1006 controls data transfer between each processing unit on the image bus 1003 and the CPU bus 1010. The bridge 1004 provides a bridge function between the image bus 1003 and the CPU bus 1010.

前処理モジュール１００５は、ＣＮＮ処理によるパターン認識処理を効果的に行うための各種前処理を行う。前処理モジュール１００５は、色変換処理／コントラスト補正処理等の画像データ変換処理を処理するハードウェアである。 The pre-processing module 1005 performs various pre-processing for effectively performing the pattern recognition processing by the CNN processing. The preprocessing module 1005 is hardware that processes image data conversion processing such as color conversion processing / contrast correction processing.

ＣＰＵ１００７は、制御プログラムを実行することによって、装置全体の動作を制御するものである。ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１００８は、ＣＰＵ１００７の動作を規定する命令やパラメータデータを格納する。ＲＡＭ１００９はＣＰＵ１００７の動作に必要なメモリである。ＣＰＵ１００７はブリッジ１００４を介して画像バス１００３上のＲＡＭ１０１にアクセスすることも可能である。 The CPU 1007 controls the operation of the entire device by executing a control program. The ROM (Read Only Memory) 1008 stores instructions and parameter data that define the operation of the CPU 1007. The RAM 1009 is a memory required for the operation of the CPU 1007. The CPU 1007 can also access the RAM 101 on the image bus 1003 via the bridge 1004.

図１１は本実施形態の画像処理装置の動作を説明するフローチャートである。以下、フローチャートは、ＣＰＵ１００７が制御プログラムを実行することにより実現されるものとする。ステップＳ１１０１では認識処理の開始に先立ち、ＣＰＵ１００７が各種初期化処理を実行する。ＣＰＵ１００７は、演算回路の動作に必要な重み係数をＲＯＭ１００８からＲＡＭ１０１に転送すると共に、演算回路１００２の動作、即ちＣＮＮのネットワーク構成を定義する為の各種レジスタ設定を行う。具体的には、ＣＰＵ１００７は、演算回路１００２の制御部１０２に存在する複数のレジスタに所定の値を設定する。同様に、ＣＰＵ１００７は、前処理モジュール１００５等のレジスタに対しても動作に必要な値を書き込む。 FIG. 11 is a flowchart illustrating the operation of the image processing apparatus of the present embodiment. Hereinafter, it is assumed that the flowchart is realized by the CPU 1007 executing the control program. In step S1101, the CPU 1007 executes various initialization processes prior to the start of the recognition process. The CPU 1007 transfers the weighting coefficient necessary for the operation of the arithmetic circuit from the ROM 1008 to the RAM 101, and sets various registers for defining the operation of the arithmetic circuit 1002, that is, the network configuration of the CNN. Specifically, the CPU 1007 sets predetermined values in a plurality of registers existing in the control unit 102 of the arithmetic circuit 1002. Similarly, the CPU 1007 writes a value necessary for operation to a register such as the preprocessing module 1005.

次に、ステップＳ１１０２で各特徴面を算出する際の処理順を決定する。 Next, in step S1102, the processing order for calculating each feature plane is determined.

図８と図９等で説明したように、ＣＮＮのネットワーク構造やＲＡＭ１０１から演算器に対するデータ転送性能、並列に動作する演算器の数等の条件に従って特徴面の処理順を決定する。例えば、下位階層の特徴面に対して複数の特徴面を算出する場合、転送サイクルと演算サイクルに基づいて処理順を決定する。転送サイクルは複数の算出特徴面の位置を並列に処理するコンボリューション演算に必要な参照データの読み出しに必要なサイクル（転送時間）であり、演算サイクルはコンボリューション演算に要する処理サイクルである。転送サイクルと演算サイクルに基づいて、処理順を決定する。 As described with reference to FIGS. 8 and 9, the processing order of the feature planes is determined according to conditions such as the network structure of the CNN, the data transfer performance from the RAM 101 to the arithmetic units, and the number of arithmetic units operating in parallel. For example, when calculating a plurality of feature planes with respect to the feature planes of the lower hierarchy, the processing order is determined based on the transfer cycle and the calculation cycle. The transfer cycle is a cycle (transfer time) required for reading reference data required for a convolution operation that processes the positions of a plurality of calculated feature planes in parallel, and the operation cycle is a processing cycle required for the convolution operation. The processing order is determined based on the transfer cycle and the calculation cycle.

即ち、ステップＳ１１０２で、動作条件に基づいて各特徴面を面順次で特徴面毎に順次処理するか、或いは特徴面を跨いで演算器の処理単位で順次処理するかを決定する。 That is, in step S1102, it is determined whether to sequentially process each feature surface for each feature surface in a surface sequence based on the operating conditions, or to sequentially process each feature surface in a processing unit of an arithmetic unit across the feature surfaces.

ステップＳ１１０１の初期化処理及びステップＳ１１０２の処理順決定が終了した後に、一連の物体認識動作が開始する。 After the initialization process of step S1101 and the process order determination of step S1102 are completed, a series of object recognition operations are started.

まず、ステップＳ１１０３では画像入力モジュール１０００が、画像センサーの出力する信号をディジタルデータに変換し、フレーム単位で図示しない（画像入力部１０００に内蔵する）フレームバッファに格納する。 First, in step S1103, the image input module 1000 converts the signal output by the image sensor into digital data and stores it in a frame buffer (built in the image input unit 1000) for each frame.

フレームバッファへの格納が完了すると、ステップＳ１１０４では、所定の信号に基づいて、前処理モジュール１００５が画像変換処理を開始する。前処理モジュール１００５は、前記フレームバッファ上の画像データから輝度データを抽出し、コントラスト補正処理を行う。 When the storage in the frame buffer is completed, in step S1104, the preprocessing module 1005 starts the image conversion process based on the predetermined signal. The preprocessing module 1005 extracts luminance data from the image data on the frame buffer and performs contrast correction processing.

輝度データの抽出は一般的な線形変換処理によりＲＧＢ画像データから輝度データを生成する。コントラスト補正の手法も一般的に知られているコントラスト補正処理を適用してコントラストを強調する。前処理モジュール１００５は、コントラスト補正処理後の輝度データを検出用画像としてＲＡＭ１０１に格納する。 The luminance data is extracted by generating the luminance data from the RGB image data by a general linear conversion process. The contrast correction method also applies a generally known contrast correction process to enhance the contrast. The preprocessing module 1005 stores the luminance data after the contrast correction processing in the RAM 101 as a detection image.

１フレームの画像データに対して前処理が完了すると、前処理モジュール１００５は図示しない完了信号を有効にする。ステップＳ１１０５では、演算回路１００２は当該完了信号に基づいて演算回路１００２を起動し、ＣＮＮに基づく物体の検出処理を開始する。ステップＳ１１０６では、最終層の特徴面の算出が終了すると演算回路１００２はＣＰＵ１００７に対して完了割り込みを発生する。ステップＳ１１０７では、ＣＰＵ１００７は演算回路１００２の処理終了割り込を受信すると、最終層の特徴面を解析し、画像中の物体の位置や属性を判定する。ステップＳ１１０７の解析処理が完了すると、ステップＳ１１０８では、次のフレームの画像に対する処理を継続する。 When the preprocessing for one frame of image data is completed, the preprocessing module 1005 enables a completion signal (not shown). In step S1105, the arithmetic circuit 1002 activates the arithmetic circuit 1002 based on the completion signal, and starts the object detection process based on the CNN. In step S1106, when the calculation of the characteristic surface of the final layer is completed, the arithmetic circuit 1002 generates a completion interrupt to the CPU 1007. In step S1107, when the CPU 1007 receives the processing end interruption of the arithmetic circuit 1002, it analyzes the characteristic surface of the final layer and determines the position and attributes of the object in the image. When the analysis process of step S1107 is completed, the process for the image of the next frame is continued in step S1108.

以上、本実施形態では、並列積和演算器１０７に供給する参照データと係数データを動作条件に応じて制御することで処理する特徴面の順番をコンボリューションカーネル単位で変える。これにより、参照データの再利用性を高め、メモリアクセスボトルネックを解消することができる。 As described above, in the present embodiment, the order of the feature planes to be processed by controlling the reference data and the coefficient data supplied to the parallel multiply-accumulate calculator 107 according to the operating conditions is changed for each convolution kernel. As a result, the reusability of the reference data can be improved and the memory access bottleneck can be eliminated.

本実施形態によれば、ＣＮＮネットワークの構成や参照データの転送サイクル及び演算サイクルに基づいて特徴面の処理順を制御することで、簡単な制御で、様々なネットワークを効率的に処理することができる。 According to the present embodiment, by controlling the processing order of the feature planes based on the configuration of the CNN network, the transfer cycle of the reference data, and the calculation cycle, it is possible to efficiently process various networks with simple control. can.

なお、本実施形態では２つ特徴面を跨ぐ処理順で処理する場合について説明したが、これに限るわけではなく、更に多くの特徴面を切り替えながら処理する構成でも良い。 In the present embodiment, the case where the processing is performed in the processing order straddling the two characteristic surfaces has been described, but the present invention is not limited to this, and a configuration in which processing is performed while switching more characteristic surfaces may be used.

また、本実施形態では係数データバッファを複数有する構成の例を示した。この場合、係数データのロード時間の影響を低減することができるが、この限りではない。 Further, in the present embodiment, an example of a configuration having a plurality of coefficient data buffers is shown. In this case, the influence of the load time of the coefficient data can be reduced, but this is not the case.

また、本実施形態では並列演算器で２次元のコンボリューション演算を処理する場合について説明したが、コンボリューション演算に限るわけではない。実施形態では２次元の画像データに対するＣＮＮ処理の例を説明したが、音声データ等の１次元データや時間方向の変化も含めた３次元データに対するＣＮＮ処理に適用することも可能である。 Further, in the present embodiment, the case where the two-dimensional convolution operation is processed by the parallel arithmetic unit has been described, but the present invention is not limited to the convolution operation. In the embodiment, an example of CNN processing for two-dimensional image data has been described, but it can also be applied to CNN processing for one-dimensional data such as audio data and three-dimensional data including changes in the time direction.

また、本実施形態ではＣＮＮ処理の場合について説明したがこれに限るわけではなく、ＲｅｓｔｒｉｃｔｅｄＢｏｌｔｚｍａｎｎＭａｃｈｉｎｅｓやＲｅｃｕｒｓｉｖｅＮｅｕｒａｌＮｅｔｗｏｒｋ等他の階層的な処理に適用可能である。 Further, in the present embodiment, the case of CNN processing has been described, but the present invention is not limited to this, and can be applied to other hierarchical processing such as Restricted Boltzmann Machines and Recurrent Neural Network.

また、本実施形態では図５に示すように、水平方向に並ぶ複数の特徴面のデータを並列に処理する場合について説明したが、これに限るわけでなく、垂直方向に連続する特徴面データを並列に処理する構成にしても良い。 Further, in the present embodiment, as shown in FIG. 5, a case where data of a plurality of feature planes arranged in the horizontal direction is processed in parallel has been described, but the present invention is not limited to this, and feature plane data continuous in the vertical direction is used. It may be configured to process in parallel.

また、本実施形態では、複数の積和演算器が並列に処理することについて説明したが、これに限るわけでなく、一つの積和演算器を用いて特徴面データを算出する構成にしても良い。 Further, in the present embodiment, it has been described that a plurality of multiply-accumulate calculators process in parallel, but the present invention is not limited to this, and the feature plane data may be calculated using one product-sum calculator. good.

（第２の実施形態）
実施形態１ではコンボリューションカーネル単位で処理する特徴面を切り替えながら処理する構成について説明したが、本実施形態では積和演算単位で処理する特徴面を切り替えながら処理する構成について説明する。 (Second Embodiment)
In the first embodiment, the configuration for processing while switching the feature planes to be processed in the convolution kernel unit has been described, but in the present embodiment, the configuration for processing while switching the feature planes to be processed in the multiply-accumulate operation unit will be described.

図１３は本実施形態の演算回路の構成を示す図である。ここでは第１の実施形態との違いのみについて説明する。図１３は図１の構成に対して積和ステート保持部１１０が新たに追加されている。積和ステート保持部１１０は、積和演算のステートを保持する機能を有する。図１４は積和ステート保持部１１０を含む並列積和演算器１０７の構成と動作を説明する図である。 FIG. 13 is a diagram showing the configuration of the arithmetic circuit of the present embodiment. Here, only the difference from the first embodiment will be described. In FIG. 13, a product-sum state holding unit 110 is newly added to the configuration of FIG. The product-sum sum state holding unit 110 has a function of holding the state of the product-sum operation. FIG. 14 is a diagram illustrating the configuration and operation of the parallel product-sum calculation unit 107 including the product-sum state holding unit 110.

図１４（Ａ）に示すように３つの特徴面１４０５〜１４０７を演算処理単位で順に処理する場合について説明する。フィルタカーネル１４０２〜１４０４はそれぞれ特徴面１４０５〜１４０７を算出する際に必要となるコンボリューションカーネルマトリクスである。 As shown in FIG. 14A, a case where the three feature planes 1405 to 1407 are sequentially processed in arithmetic processing units will be described. The filter kernels 1402 to 1404 are convolution kernel matrices required for calculating the feature planes 1405 to 1407, respectively.

図１４（Ｂ）は係数データシフトレジスタ１４０８、参照データシフトレジスタ１４０９、並列積和演算器１４１０、積和ステート保持部１１０の例を説明する図である。係数データシフトレジスタ１４０８は、図１の係数データシフトレジスタ１０４と同様に、重み係数を積和演算器１４１０に順に供給する。参照データシフトレジスタ１４０９は、図１の参照データシフトレジスタ１０６と同様に、参照データを積和演算器１４１０に供給する。積和演算器１４１０は、ここでは並列積和演算器の中の一つの積和演算器を示している。積和演算器１４１０は乗算器と加算器からなる。積和ステート保持部１１０１４１１は、複数の積和ステート保持部の中の一つの積和ステート保持部を示している。累積和シフトレジスタ１４１２は、３つのシフトレジスタからなる。セレクタ―１４１３は、累積和シフトレジスタ１４１２の出力のいずれかを選択する。 FIG. 14B is a diagram illustrating an example of a coefficient data shift register 1408, a reference data shift register 1409, a parallel product-sum calculator 1410, and a product-sum state holding unit 110. The coefficient data shift register 1408 supplies the weighting coefficients to the multiply-accumulate calculator 1410 in order, similarly to the coefficient data shift register 104 of FIG. The reference data shift register 1409 supplies the reference data to the multiply-accumulate calculator 1410 in the same manner as the reference data shift register 106 of FIG. The multiply-accumulate calculator 1410 shows here one product-sum calculator among the parallel multiply-accumulate calculators. The product-sum calculator 1410 includes a multiplier and an adder. The product-sum state holding unit 11014111 indicates one product-sum state holding unit among the plurality of product-sum state holding units. The cumulative sum shift register 1412 comprises three shift registers. The selector-1413 selects one of the outputs of the cumulative sum shift register 1412.

図１４（Ｃ）は、図１４（Ｂ）に示す構成の動作を説明する図である。ここでは３つの特徴面１４０５〜１４０７を積和演算単位に順に処理する場合について説明する。係数データシフトレジスタ１４０８には係数データ１４０２〜１４０４が係数毎にインターリーブした順番で格納する。ここではカーネルサイズが３×３のフィルタカーネルについて説明する。データ列Ａ１〜Ａ３、データ列Ｂ１〜Ｂ３及びデータ列Ｃ１〜Ｃ３はそれぞれフィルタカーネル１４０２〜１４０４の一つのデータ列であるとする。この場合、セレクタ１４１３は、累積和シフトレジスタ１４１２のＭＡ３出力をフィードバックするように設定する。 FIG. 14C is a diagram illustrating the operation of the configuration shown in FIG. 14B. Here, a case where the three feature planes 1405 to 1407 are sequentially processed in the multiply-accumulate operation unit will be described. The coefficient data shift register 1408 stores the coefficient data 1402 to 1404 in the order of interleaving for each coefficient. Here, a filter kernel having a kernel size of 3 × 3 will be described. It is assumed that the data strings A1 to A3, the data strings B1 to B3, and the data strings C1 to C3 are each one data string of the filter kernels 1402 to 1404. In this case, the selector 1413 is set to feed back the MA3 output of the cumulative sum shift register 1412.

係数データシフトレジスタの出力１４１４は、異なるカーネルの係数がインターリーブされた順番で順に出力する。第１の実施形態ではＳｅｌ信号を制御して演算に使用するカーネルの係数を選択する必要があるが、本実施形態の場合は、その必用はない。係数データバッファ１０３１〜１０３ｎに係数データを格納する際に異なるカーネルの係数データをインターリーブした順番で格納するだけで良い。制御部１０２は、係数データバッファ１０３１への係数データ格納時に、係数データをインターリーブした状態で格納しておく。 The output 1414 of the coefficient data shift register outputs in order in which the coefficients of different kernels are interleaved. In the first embodiment, it is necessary to control the Ser signal and select the kernel coefficient used for the calculation, but in the case of the present embodiment, it is not necessary. When storing the coefficient data in the coefficient data buffers 103 to 103n, it is only necessary to store the coefficient data of different kernels in the interleaved order. The control unit 102 stores the coefficient data in an interleaved state when the coefficient data is stored in the coefficient data buffer 1031.

参照データシフトレジスタの出力１４１５は、参照データを順にシフト出力する。ここでは、参照データシフトレジスタのシフトクロックは、係数データシフトレジスタのシフトクロックの１／３となる。最初の３クロックで参照データＤ１に対する３つの異なるフィルタカーネルの積和演算を処理し、乗算器出力１４０６は、累積加算出力１４１７と同じである。その次に、順次に累積和シフトレジスタの出力１４１８〜１４２０が図１４（Ｃ）のように変化する。累積和シフトレジスタ１４１２のＭＡ３を積和演算器１４１０に帰還することで３つのステートの積和演算結果を保持することが可能になる。図１４（Ｃ）において点線で示す矢印はコンボリューションカーネル１４０２に対する積和演算の状態を示すものである。累積和シフトレジスタ１４１２を介した積和演算ループにより、９サイクル後に累積和シフトレジスタ１４１２の出力にコンボリューション演算結果が出力されている（出力１４２０）。同様に、コンボリューションカーネル１４０３、１４０４に対するコンボリューション演算結果が累積和シフトレジスタ１４１２の出力として順次に出力する。 The output 1415 of the reference data shift register shifts and outputs the reference data in order. Here, the shift clock of the reference data shift register is 1/3 of the shift clock of the coefficient data shift register. The first three clocks process the multiply-accumulate operation of three different filter kernels on the reference data D1, and the multiplier output 1406 is the same as the cumulative add output 1417. Next, the outputs 1418 to 1420 of the cumulative sum shift register are sequentially changed as shown in FIG. 14 (C). By feeding back MA3 of the cumulative sum shift register 1412 to the multiply-accumulate calculator 1410, it is possible to hold the product-sum calculation results of the three states. The arrow indicated by the dotted line in FIG. 14C indicates the state of the product-sum operation with respect to the convolution kernel 1402. The product-sum calculation loop via the cumulative sum shift register 1412 outputs the convolution calculation result to the output of the cumulative sum shift register 1412 after 9 cycles (output 1420). Similarly, the convolution calculation results for the convolution kernels 1403 and 1404 are sequentially output as the output of the cumulative sum shift register 1412.

図１２は本実施形態の動作タイミング例を示す図である。基本的な動作は第１の実施形態と同じである。本実形態ではＬａｏｄ３信号の有効期間で、並列度水平方向とカーネルサイズに応じたデータをＲＡＭ１０１から読み出す。また、カーネル水平方向演算区間で３つの異なる特徴面の行単位の積和演算を算出し、カーネル垂直方向演算区間で３つの特徴面のコンボリューション演算を実行する。演算結果は非線形変換処理部１０９を通して、３つの特徴面の結果がインターリーブされた順番で出力される。 FIG. 12 is a diagram showing an example of operation timing of the present embodiment. The basic operation is the same as that of the first embodiment. In this actual embodiment, data corresponding to the horizontal direction of parallelism and the kernel size is read from the RAM 101 during the valid period of the Laod3 signal. In addition, the product-sum operation for each row of three different feature planes is calculated in the kernel horizontal direction calculation section, and the convolution calculation of the three feature faces is executed in the kernel vertical direction calculation section. The calculation result is output through the nonlinear conversion processing unit 109 in the order in which the results of the three feature planes are interleaved.

このように本実施形態では、積和演算単位で異なるコンボリューション演算を順次処理する。このため、参照データを積和演算単位で共有し、再利用することができる。即ち、３つの特徴面を算出するに際してＲＡＭ１０１から係数データシフトレジスタに転送する参照データの回数は１回で良い。従って、第１の実施形態の場合と同様にＲＡＭ１０１から係数データバッファ１０３１〜ｎへのデータ転送が処理時間を律する可能性を低減することができる。 As described above, in the present embodiment, different convolution operations are sequentially processed in the product-sum operation unit. Therefore, the reference data can be shared and reused in the product-sum operation unit. That is, the number of reference data to be transferred from the RAM 101 to the coefficient data shift register when calculating the three characteristic planes may be one. Therefore, as in the case of the first embodiment, it is possible to reduce the possibility that the data transfer from the RAM 101 to the coefficient data buffers 1031 to n limits the processing time.

また、第１の実施形態ではコンボリューションカーネル演算単位で特徴面を跨いで処理するため、参照データバッファに格納する参照データとして、カーネル演算に必要なサイズの参照データが必要になる。一方、本実施形態では、積和演算単位で特徴面を跨いで処理するため、積和演算単位に必要なサイズの参照データでよい。即ち、「並列に処理する演算器の数」＋「並列処理する方向と同じ方向のコンボリューションカーネルサイズ」−１だけで良い。 Further, in the first embodiment, since the convolution kernel operation unit is processed across the feature planes, the reference data of the size required for the kernel operation is required as the reference data to be stored in the reference data buffer. On the other hand, in the present embodiment, since the processing is performed across the feature planes in the product-sum calculation unit, the reference data of the size required for the product-sum calculation unit may be used. That is, only "the number of arithmetic units to be processed in parallel" + "convolution kernel size in the same direction as the direction of parallel processing" -1 is sufficient.

更に、本実施形態では、特徴面を跨いで処理する特徴面の数を変更する場合、係数データシフトレジスタ１４０８に設定するデータとセレクタ１４１３の設定及び参照データシフトレジスタ１４０９のシフトクロックを修正するだけで良い。例えば、２つの特徴面を積和演算単位で跨いで処理する場合、係数データシフトレジスタには係数データＡｎ１４０２とＢｎ１４０３をインターリーブして格納する。参照データシフトレジスタのシフトクロックは１／２倍とし、累積和シフトレジスタ１４１２のＭＡ２出力を積和演算器１４１０の加算器に帰還するように設定する。このように簡単な構成の追加で特徴面の処理順に関する自由度を高めることができる。 Further, in the present embodiment, when changing the number of feature planes to be processed across the feature planes, only the data set in the coefficient data shift register 1408, the setting of the selector 1413, and the shift clock of the reference data shift register 1409 are modified. Is fine. For example, when processing two feature planes straddling the product-sum operation unit, the coefficient data An1402 and Bn1403 are interleaved and stored in the coefficient data shift register. The shift clock of the reference data shift register is set to 1/2 times, and the MA2 output of the cumulative sum shift register 1412 is set to be fed back to the adder of the multiply-accumulate calculator 1410. By adding such a simple configuration, the degree of freedom regarding the processing order of the feature planes can be increased.

第２の実施形態では算出る特徴面の数が１〜３の場合に処理順を制御する場合について説明したがこれに限るわけではない。累積和シフトレジスタ１４１２の数を増やすことでより多くの算出特徴面に対して処理順を制御することができる。 In the second embodiment, the case where the processing order is controlled when the number of feature planes to be calculated is 1 to 3 has been described, but the present invention is not limited to this. By increasing the number of cumulative sum shift registers 1412, the processing order can be controlled for more calculated feature planes.

（その他の実施形態）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other embodiments)
The present invention supplies a program that realizes one or more functions of the above-described embodiment to a system or device via a network or storage medium, and one or more processors in the computer of the system or device reads and executes the program. It can also be realized by the processing to be performed. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

１０１ＲＡＭ
１０２制御部
１０３係数データバッファ
１０４係数データシフトレジスタ
１０５参照データバッファ
１０６参照データシフトレジスタ
１０７並列積和演算器
１０８結果シフトレジスタ
１０９非線形変換処理部 101 RAM
102 Control unit 103 Coefficient data buffer 104 Coefficient data shift register 105 Reference data buffer 106 Reference data shift register 107 Parallel product-sum calculator 108 Result shift register 109 Non-linear conversion processing unit

Claims

It is an arithmetic circuit that connects to a storage device that stores reference data and coefficient data of filter calculation and calculates a plurality of calculated feature planes by filter calculation for one reference feature plane.
At least one arithmetic unit that executes the filter operation of the reference data and the coefficient data, and
A reference data holding means for holding a plurality of reference data transferred from the storage device, and
A coefficient data holding means for holding a plurality of coefficient data transferred from the storage device, and
In each of the at least one arithmetic unit , one of the reference data of the plurality of reference data held by the reference data holding means and one of the plurality of coefficient data held by the coefficient data holding means . The calculation with the coefficient data is executed so as to sequentially execute the calculation with one reference data of the same and one coefficient data different from each other, and the calculation of a plurality of calculation feature planes is performed on the calculation feature plane to be calculated. A control means for executing while sequentially switching the calculated feature planes for each partial region of a predetermined size in the region.
An arithmetic circuit characterized by having.

A reference data supply means that loads reference data from the reference data holding means and supplies the reference data to the arithmetic unit.
It further has a coefficient data supply means for loading coefficient data from the coefficient data holding means and supplying the coefficient data to the arithmetic unit.
Wherein, while is supplied with the same one reference data to one of said arithmetic unit from the reference data supply means, said coefficient data supplying means different plurality of coefficient data to the one of the arithmetic units from The arithmetic circuit according to claim 1, wherein the arithmetic circuits are supplied one by one.

The reference data holding means has at least a first buffer and a second buffer, and the first buffer holds the reference data used for the filter calculation and supplies the held reference data to the calculator. The arithmetic circuit according to claim 1 or 2, wherein the second buffer holds reference data transferred from the storage device.

The calculation circuit according to any one of claims 1 to 3, wherein the filter calculation is a convolution calculation of the reference data and the coefficient data.

Any one of claims 1 to 4, further comprising a shift register that holds the output data of the arithmetic unit and a conversion means that performs non-linear conversion processing on the output data of the shift register. The arithmetic circuit described in.

The arithmetic circuit according to claim 5, wherein the control means stores the output data of the shift register or the output data of the conversion means in the storage device.

It is an arithmetic circuit that connects to a storage device that stores reference data and coefficient data of filter calculation and calculates a plurality of calculated feature planes by filter calculation for one reference feature plane.
At least one arithmetic unit that executes the filter operation of the reference data and the coefficient data, and
A reference data holding means for holding a plurality of reference data transferred from the storage device, and
A coefficient data holding means for holding a plurality of coefficient data transferred from the storage device, and
A reference data supply means that loads reference data from the reference data holding means and supplies the loaded reference data to the arithmetic unit.
A coefficient data supply means that loads the plurality of coefficient data from the coefficient data holding means and supplies the loaded coefficient data to the arithmetic unit.
One of the reference data of the plurality of reference data supplied from the reference data supply means and one of the plurality of coefficient data supplied from the coefficient data supply means to each of the at least one arithmetic unit . The calculation with the coefficient data is executed so as to sequentially execute the calculation with one reference data of the same and one coefficient data different from each other, and the calculation of a plurality of calculation feature planes is performed on the calculation feature plane to be calculated. A control means for executing while sequentially switching the calculated feature planes for each partial region of a predetermined size in the region.
An arithmetic circuit characterized by having.

The calculation circuit according to claim 7, wherein the filter calculation is a convolution calculation of the reference data and the coefficient data.

The number of reference data held in the reference data holding means is the number of reference data required for the arithmetic unit to perform one unit of the convolution operation, and is the number of the arithmetic units and the filter operation. The arithmetic circuit according to claim 4 or 8, wherein the calculation circuit is determined based on the size of the filter of the above.

The arithmetic circuit according to claim 2 or 7, wherein the reference data supply means and the coefficient data supply means are shift registers having a data load function.

The arithmetic circuit according to any one of claims 1 to 10, wherein the filter arithmetic is an arithmetic represented by a hierarchical connection relationship of a plurality of data groups of a convolutional neural network.

The arithmetic circuit has a plurality of arithmetic units that process the filter operations in parallel, and the control means controls parallel processing by the plurality of arithmetic units based on the hierarchical coupling relationship. The arithmetic circuit according to claim 11.

An image processing apparatus having the arithmetic circuit according to any one of claims 1 to 12 and processing image data as the reference data.

The image processing apparatus according to claim 13, wherein the arithmetic circuit performs arithmetic processing for pattern recognition.

It is a control method of an arithmetic circuit that connects to a storage device that stores reference data and coefficient data of a filter calculation and calculates a plurality of calculated feature planes by a filter calculation for one reference feature plane.
A calculation step in which at least one arithmetic unit executes the filter operation of the reference data of the filter operation and the coefficient data of the filter operation.
A reference data holding step in which the reference data holding means holds a predetermined number of reference data transferred from the storage device, and
A coefficient data holding step in which the coefficient data holding means holds the coefficient data of the first filter and the coefficient data of the second filter transferred from the storage device, and
Control means, wherein at least each one of the computing units, and one reference data among said plurality of reference data held in the reference data holding means, the coefficient of the plurality of coefficient data stored in the data storing means the operation of a single coefficient data, a manner is performed as to successively perform the operation between each the same one reference data different one coefficient data, the calculation of the plurality of calculated feature plane, the calculation target of the A control process in which the calculated feature planes are sequentially switched and executed for each partial region of a predetermined size in the region of the calculated feature planes.
A method characterized by having.

It is a control program of an arithmetic circuit that connects to a storage device that stores reference data and coefficient data of filter calculation and calculates a plurality of calculated feature planes by filter calculation for one reference feature plane.
An operation step of causing at least one arithmetic unit to execute the filter operation of the reference data of the filter operation and the coefficient data of the filter operation.
A reference data holding step of causing the reference data holding means to hold a predetermined number of reference data transferred from the storage device, and a reference data holding step.
A coefficient data holding step for holding the coefficient data of the first filter and the coefficient data of the second filter transferred from the storage device in the coefficient data holding means, and a coefficient data holding step.
In each of the at least one arithmetic unit , one of the reference data of the plurality of reference data held by the reference data holding means and one of the plurality of coefficient data held by the coefficient data holding means . The calculation with the coefficient data is executed so as to sequentially execute the calculation with one reference data of the same and one coefficient data different from each other, and the calculation of a plurality of calculation feature planes is performed on the calculation feature plane to be calculated. A control step to execute while sequentially switching the calculated feature planes for each partial region of a predetermined size in the region, and
A program characterized by having a computer execute.