JP7795766B2

JP7795766B2 - Neural network circuit device

Info

Publication number: JP7795766B2
Application number: JP2021192336A
Authority: JP
Inventors: 真人本村; 載勲劉
Original assignee: Tokyo Institute of Technology NUC; Institute of Science Tokyo
Current assignee: Tokyo Institute of Technology NUC; Institute of Science Tokyo
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2026-01-08
Anticipated expiration: 2041-11-26
Also published as: JP2023078975A

Description

本発明は、ニューラルネットワーク回路装置に関する。 The present invention relates to a neural network circuit device.

従来、ＣＮＮ（Convolutional Neural Network：畳み込みニューラルネットワーク）等のニューラルネットワークでは、学習段階でノード（ニューロン）間の接続（エッジ）の重みを更新（調整）することによって、ニューラルネットワークが学習される。これに対し、非特許文献１は、学習段階でも重みを更新しないようなニューラルネットワーク（ＨＮＮ：Hidden Network）のアルゴリズムを開示している。非特許文献１にかかるアルゴリズムは、学習段階で、ニューラルネットワークに隠れている、推論の精度が良好なサブネットワークを抽出する。その際に、このアルゴリズムは、ランダムに設定された重みを更新しない。つまり、非特許文献１にかかる技術では、ランダムに設定された重みが固定されている。また、非特許文献１にかかる技術では、ノード間の接続の有効又は無効を定義するスーパーマスク（Supermask；単に「マスク」と称することもある）が、学習段階で決定される。非特許文献１にかかる技術では、スーパーマスクと固定された重みとを用いて、推論で用いられるモデル（サブネットワーク）が構成される。 Conventionally, neural networks such as convolutional neural networks (CNNs) are trained by updating (adjusting) the weights of the connections (edges) between nodes (neurons) during the training phase. In contrast, Non-Patent Document 1 discloses an algorithm for a neural network (HNN: Hidden Network) that does not update weights even during the training phase. The algorithm in Non-Patent Document 1 extracts subnetworks hidden in the neural network with high inference accuracy during the training phase. In doing so, the algorithm does not update the randomly set weights. In other words, the technology in Non-Patent Document 1 fixes the randomly set weights. Furthermore, in the technology in Non-Patent Document 1, a supermask (sometimes simply referred to as a "mask") that defines whether connections between nodes are valid or invalid is determined during the training phase. In the technology in Non-Patent Document 1, a model (subnetwork) used in inference is constructed using the supermask and fixed weights.

Vivek Ramanujan他、「What's Hidden in a Randomly Weighted Neural Network?」、CVPR 2020、https://arxiv.org/abs/1911.13299Vivek Ramanujan et al., "What's Hidden in a Randomly Weighted Neural Network?", CVPR 2020, https://arxiv.org/abs/1911.13299

非特許文献１は、推論で用いられるＨＮＮのモデルをハードウェアで実装する方法を開示していない。したがって、単に非特許文献１にかかるアルゴリズムをハードウェアで実現しようとすると、モデル情報のデータ量が膨大となる場合に、メモリ容量及びメモリアクセスが膨大となるような、処理効率の悪いハードウェアが構成されてしまうおそれがある。 Non-Patent Document 1 does not disclose a method for implementing the HNN model used in inference in hardware. Therefore, if one were to simply try to implement the algorithm described in Non-Patent Document 1 in hardware, there is a risk that if the amount of model information data becomes enormous, the resulting hardware would have poor processing efficiency, requiring enormous amounts of memory and memory access.

本発明は、効率的に処理を行うことが可能なニューラルネットワーク回路装置を提供することを目的とする。 The objective of the present invention is to provide a neural network circuit device that can perform processing efficiently.

本発明にかかるニューラルネットワーク回路装置は、学習段階において、ノード間の接続それぞれの重みが固定された乱数で設定され、前記接続それぞれの有効又は無効を決定することによってマスクが生成され、前記重みと前記マスクとを用いて構成された、ニューラルネットワークモデルを実現するニューラルネットワーク回路装置であって、推論を行う際に、学習段階で用いられたシードと同じシードを用いて学習段階で生成された乱数と同じ乱数を生成するように構成された乱数生成器により前記重みを生成する重み生成回路と、推論対象データ又は当該推論対象データに対応する活性値データである入力データと、前記重みと、前記マスクとを用いて、積和演算を行う演算回路と、を有する。 The neural network circuit device of the present invention is a neural network circuit device that realizes a neural network model in which, during the learning stage, the weights of each connection between nodes are set by fixed random numbers, a mask is generated by determining whether each of the connections is valid or invalid, and the weights and the mask are used. The neural network circuit device includes: a weight generation circuit that generates the weights using a random number generator configured to generate the same random numbers as those generated during the learning stage using the same seed as used during the learning stage when performing inference; and an arithmetic circuit that performs a product-sum operation using input data that is the data to be inferred or activation value data corresponding to the data to be inferred, the weights, and the mask.

本発明によれば、効率的に処理を行うことが可能なニューラルネットワーク回路装置を提供することを目的とする。 The objective of the present invention is to provide a neural network circuit device that can perform processing efficiently.

ＨＮＮアルゴリズムを説明するための図である。FIG. 1 is a diagram illustrating an HNN algorithm. ＨＮＮアルゴリズムを説明するための図である。FIG. 1 is a diagram illustrating an HNN algorithm. ＨＮＮアルゴリズムにおけるスーパーマスクのｋの値と推論の精度との関係を示すグラフである。10 is a graph showing the relationship between the value of k of the supermask in the HNN algorithm and the accuracy of inference. 本実施の形態にかかるスーパーマスクの選択について説明するための図である。FIG. 10 is a diagram for explaining selection of a supermask according to the present embodiment. 実施の形態１にかかるニューラルネットワーク回路装置の構成を示す図である。FIG. 1 is a diagram illustrating a configuration of a neural network circuit device according to a first embodiment. 実施の形態１にかかる重み生成回路及びマスク展開回路の処理を説明するための図である。10A and 10B are diagrams for explaining the processing of a weight generation circuit and a mask development circuit according to the first embodiment; 実施の形態１にかかる重み生成回路の構成を示す図である。FIG. 2 is a diagram illustrating a configuration of a weight generation circuit according to the first embodiment; 実施の形態１にかかるマスク展開回路の構成を示す図である。FIG. 2 is a diagram illustrating a configuration of a mask developing circuit according to the first embodiment; 実施の形態１にかかる入力データを説明するための図である。FIG. 2 is a diagram for explaining input data according to the first embodiment; 実施の形態１にかかる演算回路の動作を説明するための図である。FIG. 2 is a diagram for explaining the operation of the arithmetic circuit according to the first embodiment; 実施の形態１にかかるＰＥベクトルの処理を説明するための図である。FIG. 10 is a diagram for explaining processing of a PE vector according to the first embodiment; 実施の形態１にかかるニューラルネットワーク回路装置の各構成要素に関するタイミングチャートを示す図である。FIG. 2 is a timing chart illustrating each component of the neural network circuit device according to the first embodiment. 実施の形態１にかかる後処理回路の構成を示す図である。FIG. 2 is a diagram illustrating a configuration of a post-processing circuit according to the first embodiment. 実施の形態１にかかるニューラルネットワーク回路装置によって演算を行った結果を示す図である。10A and 10B are diagrams illustrating results of calculations performed by the neural network circuit device according to the first embodiment. 実施の形態１にかかるニューラルネットワーク回路装置によって演算を行った結果を示す図である。10A and 10B are diagrams illustrating results of calculations performed by the neural network circuit device according to the first embodiment. 実施の形態１にかかるニューラルネットワーク回路装置によって演算を行った結果を示す図である。10A and 10B are diagrams illustrating results of calculations performed by the neural network circuit device according to the first embodiment. 実施の形態１にかかるニューラルネットワーク回路装置によって演算を行った結果を示す図である。10A and 10B are diagrams illustrating results of calculations performed by the neural network circuit device according to the first embodiment.

（実施の形態の概要）
実施の形態の説明に先立って、本発明にかかる実施の形態の概要について説明する。なお、以下、本発明の実施形態を説明するが、以下の実施形態は請求の範囲にかかる発明を限定するものではない。また、実施形態の中で説明されている特徴の組み合わせの全てが発明の解決手段に必須であるとは限らない。 (Outline of the embodiment)
Before describing the embodiments, an overview of the embodiments of the present invention will be described. Note that, although the embodiments of the present invention will be described below, the following embodiments do not limit the scope of the invention according to the claims. Furthermore, not all of the combinations of features described in the embodiments are necessarily essential to the solution of the invention.

＜ＨＮＮについて＞
図１及び図２は、ＨＮＮアルゴリズムを説明するための図である。ＨＮＮアルゴリズムは、非特許文献１にかかる技術に対応する。図１の左側には、ＨＮＮアルゴリズムにかかるモデル（ニューラルネットワーク）の学習について説明するための図が記載されている。 <About HNN>
1 and 2 are diagrams for explaining the HNN algorithm. The HNN algorithm corresponds to the technology described in Non-Patent Document 1. On the left side of Fig. 1, a diagram for explaining the learning of a model (neural network) according to the HNN algorithm is shown.

ＨＮＮアルゴリズムでは、学習段階で、初期モデル１０と、スーパーマスク２０とが設けられる。初期モデル１０は、ノード１２（ニューロン）と、ノード１２間の接続１３とによって構成されている。ここで、接続１３それぞれには、重み１４が設定される。接続それぞれに関する重み１４は、例えば乱数によって設定される。上述したように、ＨＮＮアルゴリズムでは、重み１４は、学習によって更新されることはなく、固定されている。 In the HNN algorithm, an initial model 10 and a supermask 20 are created during the learning phase. The initial model 10 is composed of nodes 12 (neurons) and connections 13 between the nodes 12. Here, a weight 14 is assigned to each connection 13. The weight 14 for each connection is set, for example, using a random number. As mentioned above, in the HNN algorithm, the weights 14 are not updated during learning and are fixed.

スーパーマスク２０は、接続１３それぞれが有効であるか無効であるかを定義している。スーパーマスク２０は、接続１３それぞれの重要度を示すスコア３０（Ｓｃｏｒｅ）を用いて生成される。スコア３０は、学習段階で決定される。言い換えると、ＨＮＮアルゴリズムでは、複数の接続１３それぞれのスコア３０が学習される。スコア３０は、例えばバックプロパゲーションによって学習される。 The supermask 20 defines whether each connection 13 is valid or invalid. The supermask 20 is generated using a score 30 that indicates the importance of each connection 13. The score 30 is determined during the training phase. In other words, the HNN algorithm learns a score 30 for each of the multiple connections 13. The score 30 is trained, for example, by backpropagation.

図２は、スーパーマスク２０を決定する方法を説明するための図である。図２は、スコア３０の分布を示す図である。図２の横軸はスコア３０の値を示し、縦軸は対応するスコア３０の値を有する接続１３の数を示す。ここで、スーパーマスク２０は、スコア３０の値が高い接続１３を有効とし、スコア３０の値が低い接続１３を無効とするように決定することによって、生成される。具体的には、スーパーマスク２０は、上位ｋ％のスコア３０に対応する接続１３（図２でハッチングが施された部分に対応する接続）を有効とし、残りの接続１３を無効とするように、生成される。ここで、ｋの値は、生成されるスーパーマスク２０ごとに予め設定される。そして、複数のｋそれぞれについて、スーパーマスク２０が設けられ得る。 Figure 2 is a diagram illustrating a method for determining a supermask 20. Figure 2 is a diagram illustrating the distribution of scores 30. The horizontal axis of Figure 2 represents the score 30 value, and the vertical axis represents the number of connections 13 having the corresponding score 30 value. Here, the supermask 20 is generated by determining that connections 13 with high score 30 values are valid and connections 13 with low score 30 values are invalid. Specifically, the supermask 20 is generated by determining that connections 13 corresponding to the top k% of scores 30 (connections corresponding to the hatched portions in Figure 2) are valid and the remaining connections 13 are invalid. Here, the value of k is set in advance for each supermask 20 to be generated. A supermask 20 can be created for each of multiple k values.

図１に示したスーパーマスク２０における太い実線は、上位ｋ％のスコア３０に対応する接続１３に対応する。したがって、この太い実線は、有効と決定された接続１３に対応する。一方、図１に示したスーパーマスク２０における破線は、上位ｋ％のスコア３０に対応しない接続１３（つまり下位（１００－ｋ）％のスコア３０に対応する接続１３）に対応する。したがって、この破線は、無効と決定された接続１３に対応する。ここで、本実施の形態にかかるスーパーマスク２０では、有効と決定された接続１３（つまり上位ｋ％のスコア３０に対応する接続１３）に「１」が付される。一方、無効と決定された接続１３（つまり上位ｋ％のスコア３０に対応しない接続１３）に「０」が付される。なお、スコア３０（及びスーパーマスク２０）の学習では、スーパーマスク２０に対応するｋの値に基づいて、学習が行われる。具体的には、生成しようとするスーパーマスク２０に対応するｋの値ごとに、スコア３０が学習される。例えば、ｋ＝２０％のスーパーマスク２０を生成する際には、上位２０％のスコア３０を用いて全てのスコア３０を学習（更新）する。なお、その際、上述したように、重み１４は更新されない。 The thick solid lines in the supermask 20 shown in Figure 1 correspond to connections 13 corresponding to the top k% of scores 30. Therefore, these thick solid lines correspond to connections 13 determined to be valid. On the other hand, the dashed lines in the supermask 20 shown in Figure 1 correspond to connections 13 that do not correspond to the top k% of scores 30 (i.e., connections 13 that correspond to the bottom (100-k)% of scores 30). Therefore, these dashed lines correspond to connections 13 determined to be invalid. Here, in the supermask 20 of this embodiment, a "1" is assigned to connections 13 determined to be valid (i.e., connections 13 that correspond to the top k% of scores 30). On the other hand, a "0" is assigned to connections 13 determined to be invalid (i.e., connections 13 that do not correspond to the top k% of scores 30). Note that the score 30 (and the supermask 20) is learned based on the value of k corresponding to the supermask 20. Specifically, the score 30 is learned for each value of k corresponding to the supermask 20 to be generated. For example, when generating a supermask 20 with k=20%, the top 20% of scores 30 are used to learn (update) all scores 30. Note that, as mentioned above, the weights 14 are not updated at this time.

そして、初期モデル１０とスーパーマスク２０とを組み合わせることによって、図１の右側に示すサブネットワーク４０が構成される。つまり、初期モデル１０の各接続１３（重み１４）と、スーパーマスク２０とを用いることによって、ニューラルネットワーク（サブネットワーク４０）のモデルが構成される。具体的には、初期モデル１０の各接続１３（重み１４）と、スーパーマスク２０に示された各接続１３の有効又は無効の情報（例えば「１」又は「０」）との論理積（ＡＮＤ）によって、サブネットワーク４０が構成される。ＨＮＮアルゴリズムでは、このサブネットワーク４０（ニューラルネットワークモデル）を用いて、推論が行われる。 Then, by combining the initial model 10 and the supermask 20, the subnetwork 40 shown on the right side of Figure 1 is constructed. In other words, a neural network (subnetwork 40) model is constructed using each connection 13 (weight 14) of the initial model 10 and the supermask 20. Specifically, the subnetwork 40 is constructed by the logical product (AND) of each connection 13 (weight 14) of the initial model 10 and the valid or invalid information (e.g., "1" or "0") of each connection 13 shown in the supermask 20. The HNN algorithm uses this subnetwork 40 (neural network model) to perform inference.

図３は、ＨＮＮアルゴリズムにおけるスーパーマスク２０のｋの値と推論の精度との関係を示すグラフである。図３の上図は、ＨＮＮアルゴリズムをＲｅｓＮｅｔ５０及びＩｍａｇｅＮｅｔに適用した場合の結果である。図３の下図は、ＨＮＮアルゴリズムをＶＧＧ１６及びＣＩＦＡＲ－１００に適用した場合の結果である。また、各グラフの横軸はｋの値に対応し、縦軸は推論の精度に対応する。図３に示すように、ｋの値が０～３０％においては、ｋの値が大きいほど、推論の精度が高くなり得る。 Figure 3 is a graph showing the relationship between the k value of SuperMask 20 in the HNN algorithm and inference accuracy. The top graph in Figure 3 shows the results when the HNN algorithm is applied to ResNet50 and ImageNet. The bottom graph in Figure 3 shows the results when the HNN algorithm is applied to VGG16 and CIFAR-100. The horizontal axis of each graph corresponds to the k value, and the vertical axis corresponds to inference accuracy. As shown in Figure 3, when the k value is between 0 and 30%, the larger the k value, the higher the inference accuracy can be.

ここで、上述したように、本実施の形態にかかるスーパーマスク２０は、「０」又は「１」の二値情報である。したがって、学習段階で生成されたスーパーマスク２０を圧縮して保存しておき、推論段階で、圧縮されたスーパーマスク２０を読み出して展開することが可能である。スーパーマスク２０を圧縮しておくことにより、推論段階で読み出すモデル（学習済みモデル）に対応するスーパーマスク２０のサイズを小さくすることができる。言い換えると、スーパーマスク２０を読み出す際に伝送されるデータ量を削減することができる。これにより、消費電力を節約することができる。したがって、より効率的に推論処理を行うことができる。 As described above, the supermask 20 in this embodiment is binary information of "0" or "1." Therefore, the supermask 20 generated in the learning stage can be compressed and saved, and the compressed supermask 20 can be read and expanded in the inference stage. By compressing the supermask 20, the size of the supermask 20 corresponding to the model (trained model) read in the inference stage can be reduced. In other words, the amount of data transmitted when reading the supermask 20 can be reduced. This allows for power consumption savings. Therefore, inference processing can be performed more efficiently.

そして、ｋの値が小さいほど、スーパーマスク２０における「１」の数（つまり「有効」と決定された接続１３の数）が少なくなり、「０」の数（つまり「無効」と決定された接続１３の数）が多くなる。したがって、ゼロランレングス（Zero-Run-Length：ＺＲＬ）圧縮方式によってスーパーマスク２０を圧縮することにより、ｋの値が小さいスーパーマスク２０ほど、圧縮された状態における圧縮率が高くなる。なお、「圧縮率が高い」とは、圧縮前のデータサイズに対する圧縮後のデータサイズの割合が小さくなること、つまり良好に圧縮されることであることに、留意されたい。ここで、ｋの値が小さいスーパーマスク２０では、「１」の数が少なくなるので、疎であると言える。したがって、スーパーマスク２０が疎であるほど、圧縮率が高くなる。言い換えると、圧縮された状態のスーパーマスク２０は、このスーパーマスク２０が疎であるほど、圧縮率が高い状態で圧縮されている。 The smaller the value of k, the fewer the number of "1"s in the supermask 20 (i.e., the number of connections 13 determined to be "valid") and the greater the number of "0"s (i.e., the number of connections 13 determined to be "invalid"). Therefore, by compressing the supermask 20 using the Zero-Run-Length (ZRL) compression method, the smaller the value of k for the supermask 20, the higher the compression rate in the compressed state. Note that a "high compression rate" means that the ratio of the data size after compression to the data size before compression is small, i.e., good compression. Here, a supermask 20 with a small value of k has fewer "1"s, and can therefore be said to be sparse. Therefore, the sparser the supermask 20, the higher the compression rate. In other words, the sparser the supermask 20 in its compressed state, the higher the compression rate.

なお、上述したように、図３の例では、ｋの値が０～３０％程度では、ｋの値が大きいほど推論の精度が高くなり、ｋの値が小さいほど推論の精度が低くなる。そして、上述したように、ｋの値が小さいほど、スーパーマスク２０が疎となるので、圧縮率を高くすることができる。したがって、ｋの値が０～３０％程度では、推論の精度を高くしたい場合は、ｋの値が大きなスーパーマスク２０を選択することができる。これに対し、効率的に推論処理を行いたい場合（通信量及び消費電力を抑制したい場合）は、ｋの値が小さい（つまり圧縮率の高い）スーパーマスク２０を選択することができる。これにより、推論の精度と推論処理の効率とのトレードオフが、容易に実現される。したがって、推論の精度を重視したスーパーマスク２０と、推論処理の効率（通信量及び消費電力の抑制）を重視したスーパーマスク２０とを、別個に生成することができる。 As mentioned above, in the example of Figure 3, when the value of k is approximately 0 to 30%, the larger the value of k, the higher the inference accuracy, and the smaller the value of k, the lower the inference accuracy. As mentioned above, the smaller the value of k, the sparser the supermask 20, and therefore the higher the compression rate. Therefore, when the value of k is approximately 0 to 30%, if you want to increase the inference accuracy, you can select a supermask 20 with a large k value. On the other hand, if you want to perform inference processing efficiently (reduce communication volume and power consumption), you can select a supermask 20 with a small k value (i.e., a high compression rate). This easily achieves a trade-off between inference accuracy and inference processing efficiency. Therefore, it is possible to separately generate a supermask 20 that prioritizes inference accuracy and a supermask 20 that prioritizes inference processing efficiency (reduce communication volume and power consumption).

また、スーパーマスク２０は、推論の対象ごとに生成されてもよい。例えば、画像データから動物を検出する推論において、推論対象が「犬」である場合と、推論対象が「猫」である場合とで、互いに異なるスーパーマスク２０を生成してもよい。この場合、例えば推論対象を「犬」として学習された、ｋの値が比較的大きな（例えば２０～３０％程度）スーパーマスク２０と、ｋの値が比較的小さな（例えば１～１０％程度）スーパーマスク２０とを、生成することができる。推論対象を「猫」として学習した場合も同様である。したがって、スーパーマスク２０は、推論の精度及び推論対象の少なくとも一方ごとに生成されていてもよい。これにより、重み１４を変えなくても、スーパーマスク２０を入れ替えるだけで、容易に、異なる推論対象について推論を行うことができ、異なる精度で推論を行うことができる。 Furthermore, a supermask 20 may be generated for each inference target. For example, in inference to detect animals from image data, different supermasks 20 may be generated when the inference target is a "dog" and when the inference target is a "cat." In this case, for example, a supermask 20 with a relatively large k value (e.g., approximately 20-30%) and a supermask 20 with a relatively small k value (e.g., approximately 1-10%) can be generated, where the inference target is trained as a "dog." The same applies when the inference target is trained as a "cat." Therefore, a supermask 20 may be generated for at least one of the inference accuracy and the inference target. This allows inference to be easily performed for different inference targets, and with different accuracy, simply by switching the supermasks 20 without changing the weights 14.

図４は、本実施の形態にかかるスーパーマスク２０の選択について説明するための図である。上述したように、初期モデル１０において、各接続１３に重み１４が設定されている。そして、これらの重み１４は固定されている。そして、初期モデル１０にスーパーマスク２０が重畳されることによって、推論を行うことができる。ここで、ｋ＝３０％のスーパーマスク２０Ａと、ｋ＝２０％のスーパーマスク２０Ｂと、ｋ＝１０％のスーパーマスク２０Ｃとが生成されているとする。スーパーマスク２０Ａを選択すると、推論時において、消費電力は比較的大きいが、推論の精度が比較的良好となり得る。一方、スーパーマスク２０Ｂを選択すると、推論時において、スーパーマスク２０Ａの場合と比較して、消費電力は比較的抑制され得るが、推論の精度が劣化し得る。また、スーパーマスク２０Ｃを選択すると、推論時において、スーパーマスク２０Ｂの場合と比較して、消費電力はさらに抑制され得るが、推論の精度がさらに劣化し得る。このように、推論の精度を重視するか推論処理の効率（消費電力の抑制）を重視するかによって、使用するスーパーマスク２０を選択することができる。 Figure 4 is a diagram illustrating the selection of a supermask 20 according to this embodiment. As described above, in the initial model 10, a weight 14 is assigned to each connection 13. These weights 14 are fixed. Inference can be performed by superimposing a supermask 20 on the initial model 10. Assume that supermasks 20A with k = 30%, 20B with k = 20%, and 20C with k = 10% have been generated. Selecting supermask 20A results in relatively high power consumption during inference, but relatively good inference accuracy. Selecting supermask 20B, on the other hand, can reduce power consumption during inference compared to supermask 20A, but can result in poorer inference accuracy. Selecting supermask 20C can further reduce power consumption during inference compared to supermask 20B, but can result in poorer inference accuracy. In this way, the supermask 20 to be used can be selected depending on whether emphasis is placed on inference accuracy or on inference processing efficiency (power consumption reduction).

また、上述したように、ＨＮＮアルゴリズムでは、重み１４は、初期モデル１０において乱数によって設定され、固定されている。ここで、推論段階で使用される重みは、学習段階で設定された重みと同じである必要がある。この場合、学習段階で乱数により設定された重みを記憶しておき、推論段階でその記憶された重みを用いる方法がある。しかしながら、その場合は、重みを記憶するための記憶領域が必要となる。 Also, as mentioned above, in the HNN algorithm, the weights 14 are set by random numbers in the initial model 10 and are fixed. Here, the weights used in the inference stage must be the same as the weights set in the learning stage. In this case, one method is to store the weights set by random numbers in the learning stage and use these stored weights in the inference stage. However, in this case, a storage area is required to store the weights.

一方、上述したように、重みは乱数によって生成されるので、乱数生成器を用いて重み１４を生成することができる。ここで、乱数生成器が同じであり、且つ、乱数生成器に入力されるシード（Seed）が同じである場合は、同じ乱数が出力されることとなる。本実施の形態では、推論段階で、学習段階で使用されたシード及び乱数生成器を使用する。これにより、学習段階で生成された乱数と同じ乱数が、推論段階でも生成される。したがって、学習段階で設定された重みと同じ重みを推論段階でも使用できる。この場合、重み１４を装置に記憶しておく必要がないので、装置の記憶領域のサイズを削減することが可能となる。したがって、メモリ容量を削減することが可能となる。さらに、推論段階で重みを設定する際にメモリにアクセスする必要がなくなるので、メモリアクセスを抑制することが可能となる。したがって、良好な処理効率を実現することが可能となる。 On the other hand, as described above, weights are generated using random numbers, so weights 14 can be generated using a random number generator. Here, if the same random number generator is used and the same seed is input to the random number generator, the same random numbers will be output. In this embodiment, the seed and random number generator used in the learning phase are used in the inference phase. As a result, the same random numbers generated in the learning phase are also generated in the inference phase. Therefore, the same weights set in the learning phase can be used in the inference phase. In this case, there is no need to store weights 14 in the device, so the size of the device's memory area can be reduced. This makes it possible to reduce memory capacity. Furthermore, there is no need to access memory when setting weights in the inference phase, so memory access can be suppressed. This makes it possible to achieve good processing efficiency.

そして、ＨＮＮアルゴリズムでは、重み１４を「＋１」又は「－１」と設定した場合でも、推論の精度が良好である。したがって、重み１４を二値情報とすることができる。すなわち、例えば二値情報の「１」を「＋１」と解釈し、「０」を「－１」と解釈することで、重み１４を二値情報とすることができる。また、上述したように、スーパーマスク２０も二値情報である。したがって、本実施の形態では、ＨＮＮアルゴリズムをハードウェアで実現する場合に、装置の小型化を図ることが可能である。 The HNN algorithm maintains good inference accuracy even when the weight 14 is set to "+1" or "-1." Therefore, the weight 14 can be treated as binary information. For example, the weight 14 can be treated as binary information by interpreting the binary information "1" as "+1" and "0" as "-1." Furthermore, as mentioned above, the supermask 20 is also binary information. Therefore, in this embodiment, when the HNN algorithm is implemented in hardware, it is possible to reduce the size of the device.

（実施の形態１）
以下、実施形態について、図面を参照しながら説明する。説明の明確化のため、以下の記載及び図面は、適宜、省略、及び簡略化がなされている。また、各図面において、同一の要素には同一の符号が付されており、必要に応じて重複説明は省略されている。 (Embodiment 1)
Hereinafter, embodiments will be described with reference to the drawings. For clarity of explanation, the following description and drawings have been omitted and simplified as appropriate. In addition, the same elements in each drawing are designated by the same reference numerals, and duplicate explanations have been omitted as necessary.

＜ニューラルネットワーク回路装置＞
図５は、実施の形態１にかかるニューラルネットワーク回路装置１００の構成を示す図である。実施の形態１にかかるニューラルネットワーク回路装置１００は、制御回路１１０と、重み生成回路２００と、マスク展開回路３００と、演算回路４００と、活性値データメモリ５００と、バレルシフタ１３０と、後処理回路１４０とを有する。ニューラルネットワーク回路装置１００は、推論時に、上述したＨＮＮアルゴリズムを用いて推論を行うように構成された、ハードウェアである。ニューラルネットワーク回路装置１００は、上記の回路を１つのチップ（on chip）で構成し得る。ニューラルネットワーク回路装置１００は、学習段階において設定された重みと、学習段階で生成されたスーパーマスクとを用いて構成された、ニューラルネットワークモデルを実現する。 <Neural network circuit device>
FIG. 5 is a diagram showing a configuration of a neural network circuit device 100 according to the first embodiment. The neural network circuit device 100 according to the first embodiment includes a control circuit 110, a weight generation circuit 200, a mask expansion circuit 300, an arithmetic circuit 400, an activation value data memory 500, a barrel shifter 130, and a post-processing circuit 140. The neural network circuit device 100 is hardware configured to perform inference using the above-described HNN algorithm during inference. The neural network circuit device 100 may include the above circuits on a single chip. The neural network circuit device 100 realizes a neural network model configured using weights set in the learning phase and a supermask generated in the learning phase.

重み生成回路２００（ＷＧＵ：Weight Generation Unit）は、推論を行う際に、学習段階で用いられたシードと同じシードを用いて学習段階で生成された乱数と同じ乱数を生成するように構成された乱数生成器により、各接続１３に対応する重み１４を生成する。なお、重み生成回路２００は、カーネル（フィルタ、出力チャネル）ごとに重み１４を生成する。詳しくは後述する。 When performing inference, the weight generation circuit 200 (WGU: Weight Generation Unit) generates weights 14 corresponding to each connection 13 using a random number generator configured to generate the same random numbers as those generated in the learning phase using the same seed as used in the learning phase. Note that the weight generation circuit 200 generates weights 14 for each kernel (filter, output channel). Details will be described later.

マスク展開回路３００（ＳＥＵ：Supermask Expansion Unit）は、推論を行う際に、圧縮された状態のスーパーマスク２０を展開する。なお、マスク展開回路３００は、カーネル（フィルタ、出力チャネル）ごとに、スーパーマスク２０を展開する。詳しくは後述する。 The mask expansion circuit 300 (SEU: Supermask Expansion Unit) expands the compressed supermask 20 when performing inference. The mask expansion circuit 300 expands the supermask 20 for each kernel (filter, output channel). Details will be provided later.

図６は、実施の形態１にかかる重み生成回路２００及びマスク展開回路３００の処理を説明するための図である。重み生成回路２００が各接続１３に対応する重み１４を生成することによって、学習段階で設定された初期モデル１０が構成される。また、マスク展開回路３００がスーパーマスク２０を展開することで、圧縮されていない元のスーパーマスク２０が得られる。そして、初期モデル１０にスーパーマスク２０が重畳されることで、推論を行うことができる。 Figure 6 is a diagram illustrating the processing of the weight generation circuit 200 and mask expansion circuit 300 according to the first embodiment. The weight generation circuit 200 generates weights 14 corresponding to each connection 13, thereby constructing the initial model 10 set in the learning stage. The mask expansion circuit 300 expands the supermask 20, thereby obtaining the original, uncompressed supermask 20. Then, inference can be performed by superimposing the supermask 20 on the initial model 10.

制御回路１１０（ＭＣＣ：Model Construction Controller）は、推論を行う際に、学習段階で構成された推論用モデル（初期モデル１０とスーパーマスク２０との組み合わせ）を構成するための制御を行う。特に、制御回路１１０は、重み生成回路２００及びマスク展開回路３００を制御する。制御回路１１０は、重み生成回路２００で使用される乱数生成器に入力されるシードを生成する。また、制御回路１１０は、圧縮されたスーパーマスク２０の圧縮の種類（コードレングス）の情報を、マスク展開回路３００に出力（提供）する。また、制御回路１１０は、重み生成回路２００の動作とマスク展開回路３００の動作とが同期するように、制御を行う。詳しくは後述する。なお、制御回路１１０は、プロセッサ及び記憶デバイスを有してもよい。そして、制御回路１１０の機能は、記憶デバイスに記憶され、プロセッサによって実行されるプログラムによって実現されてもよい。 When performing inference, the control circuit 110 (MCC: Model Construction Controller) controls the construction of the inference model (combination of the initial model 10 and the supermask 20) constructed during the learning phase. In particular, the control circuit 110 controls the weight generation circuit 200 and the mask expansion circuit 300. The control circuit 110 generates a seed to be input to the random number generator used in the weight generation circuit 200. The control circuit 110 also outputs (provides) information on the compression type (code length) of the compressed supermask 20 to the mask expansion circuit 300. The control circuit 110 also controls the operation of the weight generation circuit 200 so that it is synchronized with the operation of the mask expansion circuit 300. Details will be described later. The control circuit 110 may include a processor and a storage device. The functions of the control circuit 110 may be implemented by a program stored in the storage device and executed by the processor.

活性値データメモリ５００（ＡＭＥＭ：Activation Memory）は、入力データ（ｉＡｃｔ）を格納する。ここで、「入力データ」とは、画像データ等の推論対象データ、又は、推論対象データに対応する活性値データである。活性値データは、推論対象データを用いて１つ目の畳み込み層における処理で生成されたデータ、又は、２つ目以降の畳み込み層における処理で生成されたデータに対応する。つまり、活性値データは、前の畳み込み層における処理における出力データに対応する。推論対象データが画像データである場合、入力データは、画像の横方向と、画像の縦方向と、入力チャネルとにより、３次元に構成され得る。詳しくは後述する。なお、入力チャネルは、入力データが推論対象データである場合はその画像のチャネル（例えばＲＧＢそれぞれの値）に対応する。また、入力データが活性値データである場合、入力チャネルは、前回の畳み込み層における処理のカーネルの種類（つまり前回の畳み込み層における処理における出力データの出力チャネル）に対応する。 The activation data memory 500 (AMEM: Activation Memory) stores input data (iAct). Here, "input data" refers to inference target data such as image data, or activation data corresponding to the inference target data. The activation data corresponds to data generated by processing in the first convolutional layer using the inference target data, or data generated by processing in the second or subsequent convolutional layers. In other words, the activation data corresponds to output data from processing in the previous convolutional layer. When the inference target data is image data, the input data can be configured three-dimensionally by the horizontal and vertical directions of the image and input channels. This will be described in more detail below. Note that when the input data is inference target data, the input channel corresponds to the channel of that image (e.g., each RGB value). When the input data is activation data, the input channel corresponds to the kernel type from processing in the previous convolutional layer (i.e., the output channel of the output data from processing in the previous convolutional layer).

バレルシフタ１３０は、入力データの位置を調整する。具体的には、バレルシフタ１３０は、入力データを演算回路４００で処理する際に、入力データそれぞれが適切な処理対象の位置に配置されるように、入力データの位置を調整する。バレルシフタ１３０は、３次元に構成された入力データを、縦方向又は横方向にシフトする。 The barrel shifter 130 adjusts the position of the input data. Specifically, when the input data is processed by the arithmetic circuit 400, the barrel shifter 130 adjusts the position of the input data so that each piece of input data is placed in the appropriate position for processing. The barrel shifter 130 shifts the three-dimensionally configured input data vertically or horizontally.

演算回路４００は、入力データと、重み生成回路２００によって生成された重み１４と、マスク展開回路３００によって展開されたスーパーマスク２０とを用いて、積和演算を行う。演算回路４００は、入力データと重み１４との積がスーパーマスク２０によって有効化又は無効化されるように、演算を行う。演算回路４００は、入力チャネルごと及びカーネル（出力チャネル）ごとに積和演算を行う。演算回路４００は、それぞれの積和演算によって得られた部分和（ＰＳＵＭ：partial Sum）を、後処理回路１４０に出力する。詳しくは後述する。 The arithmetic circuit 400 performs a product-sum operation using the input data, the weight 14 generated by the weight generation circuit 200, and the supermask 20 expanded by the mask expansion circuit 300. The arithmetic circuit 400 performs an operation so that the product of the input data and the weight 14 is validated or invalidated by the supermask 20. The arithmetic circuit 400 performs a product-sum operation for each input channel and each kernel (output channel). The arithmetic circuit 400 outputs the partial sums (PSUM) obtained by each product-sum operation to the post-processing circuit 140. Details will be described later.

後処理回路１４０（ＰＰＵ：Post-Processing Unit）は、ＣＮＮにおける畳み込み層について必要な処理を行う。後処理回路１４０は、演算回路４００で実行された処理の後段の処理を行う。後処理回路１４０は、得られた出力データ（ｏＡｃｔ）を、活性値データメモリ５００に格納する。詳しくは後述する。 The post-processing circuit 140 (PPU: Post-Processing Unit) performs the necessary processing for the convolutional layer of the CNN. The post-processing circuit 140 performs processing subsequent to the processing performed by the arithmetic circuit 400. The post-processing circuit 140 stores the obtained output data (oAct) in the activity value data memory 500. This will be described in more detail below.

図７は、実施の形態１にかかる重み生成回路２００の構成を示す図である。重み生成回路２００は、推論を行う際に、学習段階で用いられたシードと同じシードを用いて学習段階で生成された乱数と同じ乱数を生成するように構成された乱数生成器により重み１４を生成するように構成されている。重み生成回路２００は、出力チャネルごと（つまりカーネルごと）に並列に処理を行い得る。つまり、重み生成回路２００は、複数のカーネル（つまり出力チャネル）についての重み１４を並列に生成することができる。 Figure 7 is a diagram showing the configuration of the weight generation circuit 200 according to the first embodiment. The weight generation circuit 200 is configured to generate weights 14 when performing inference using a random number generator configured to generate the same random numbers as those generated in the learning stage using the same seed as those used in the learning stage. The weight generation circuit 200 can perform processing in parallel for each output channel (i.e., for each kernel). In other words, the weight generation circuit 200 can generate weights 14 for multiple kernels (i.e., output channels) in parallel.

重み生成回路２００は、出力チャネルごと（つまりカーネルごと）に処理を行う重み生成部２１０を、並列に処理可能な出力チャネルの数だけ有する。図７の例では、重み生成回路２００は、１６個の出力チャネル（１６ｏＣｈｓ）ごとに処理を行う。つまり、重み生成回路２００は、１６個の重み生成部２１０を有する。重み生成回路２００は、出力チャネルごと（つまり重み生成部２１０ごと）に、乱数生成器２０２と、セレクタ２０４とを有する。乱数生成器２０２は、例えば、Ｘｏｒｓｈｉｆｔ１６のような、疑似乱数生成アルゴリズムによって実現され得る。 The weight generation circuit 200 has weight generation units 210 that perform processing for each output channel (i.e., for each kernel) equal to the number of output channels that can be processed in parallel. In the example of Figure 7, the weight generation circuit 200 performs processing for each of 16 output channels (16oChs). In other words, the weight generation circuit 200 has 16 weight generation units 210. The weight generation circuit 200 has a random number generator 202 and a selector 204 for each output channel (i.e., for each weight generation unit 210). The random number generator 202 can be implemented using a pseudo-random number generation algorithm such as Xorshift16, for example.

なお、ニューラルネットワークモデルの出力チャネルの数は、処理対象の出力チャネルの数（つまり重み生成部２１０の数）よりも多くてもよい。つまり、出力チャネルの数は、１６個でなくてもよく、例えば２５６個であってもよいし、５１２個であってもよい。出力チャネルが２５６個である場合、重み生成回路２００は、１６個の出力チャネルごとの処理を、１６回行ってもよい。このことは、後述するマスク展開回路３００及び演算回路４００においても同様である。 The number of output channels of the neural network model may be greater than the number of output channels to be processed (i.e., the number of weight generation units 210). In other words, the number of output channels does not have to be 16, and may be, for example, 256 or 512. If there are 256 output channels, the weight generation circuit 200 may perform processing on each of the 16 output channels 16 times. This also applies to the mask expansion circuit 300 and arithmetic circuit 400 described below.

ここで、上述したように、重み生成部２１０は、カーネルごとに重みを生成する。つまり、１つの重み生成部２１０が、１つのカーネルに対応する重みの集まりであるカーネル重み群２２０を生成する。ここで、カーネル重み群２２０は、入力データの３次元構成に対応して３次元に構成されている。カーネル重み群２２０は、入力データの横方向に対応する幅ｋｗと、入力データの縦方向に対応する高さｋｈと、入力データの処理対象の入力チャネルに対応する奥行１６チャネル（１６ｉＣｈｓ）のサイズを有している。１つの重み生成部２１０は、１回のサイクルで、１×１×１６のサイズのカーネル重み部分２２２を生成する。なお、カーネル重み部分２２２は、１ビットの重み１４が、処理対象の入力チャネルの数（１６個）だけ集まった集合である。したがって、重み生成部２１０がｋｗ×ｋｈ回のサイクルで処理を繰り返すことによって、ｋｗ×ｋｈ個のカーネル重み部分２２２が集まったカーネル重み群２２０が生成される。 As described above, the weight generation unit 210 generates a weight for each kernel. That is, one weight generation unit 210 generates a kernel weight group 220, which is a collection of weights corresponding to one kernel. Here, the kernel weight group 220 is configured three-dimensionally to correspond to the three-dimensional structure of the input data. The kernel weight group 220 has a width kw corresponding to the horizontal direction of the input data, a height kh corresponding to the vertical direction of the input data, and a depth of 16 channels (16iChs) corresponding to the input channels to be processed of the input data. One weight generation unit 210 generates a kernel weight portion 222 of size 1 x 1 x 16 in one cycle. Note that the kernel weight portion 222 is a collection of 1-bit weights 14, equal to the number of input channels to be processed (16). Therefore, by the weight generation unit 210 repeating the process in kw x kh cycles, a kernel weight group 220 consisting of kw x kh kernel weight portions 222 is generated.

乱数生成器２０２に入力されるシードは、後述する方法により制御回路１１０で生成され得る。図７の例では、制御回路１１０は、１６個の出力チャネルそれぞれについて、互いに異なる１６ビットのシードを生成する。したがって、重み生成回路２００（乱数生成器２０２）は、出力チャネルごとに異なるシードを用いて乱数を生成する。これにより、重み生成回路２００は、出力チャネルごとに異なる重み（異なるカーネル重み部分２２２及びカーネル重み群２２０）を生成することができる。 The seed input to the random number generator 202 can be generated by the control circuit 110 using a method described below. In the example of FIG. 7, the control circuit 110 generates a different 16-bit seed for each of the 16 output channels. Therefore, the weight generation circuit 200 (random number generator 202) generates random numbers using a different seed for each output channel. This allows the weight generation circuit 200 to generate different weights (different kernel weight portions 222 and kernel weight groups 220) for each output channel.

乱数生成器２０２は、初めの１サイクル目では、制御回路１１０から出力されたシードを用いて乱数を生成する。これにより、１個目のカーネル重み部分２２２が生成される。そして、乱数生成器２０２は、ｎ回目（ｎは２以上（ｋｗ×ｋｈ）以下の整数）のサイクルでは、（ｎ－１）回目に生成された乱数（重み）を用いて、乱数を生成する。これにより、ｎ個目のカーネル重み部分２２２が生成される。したがって、セレクタ２０４は、１回目のサイクルでは、制御回路１１０から取得された乱数のシード（初期値）を選択して乱数生成器２０２に出力し、２回目以降では前回の乱数生成器２０２の出力（生成された乱数）を選択する。重み生成部２１０それぞれは、生成されたカーネル重み部分２２２を、演算回路４００に出力する。したがって、重み生成回路２００は、１回のサイクルで、処理対象の出力チャネルの個数に対応する１６個のカーネル重み部分２２２を、演算回路４００に出力する。これにより、重み生成回路２００は、１回のサイクルで、１６（ｉＣｈ）×１６（ｏＣｈ）＝２５６個（つまり２５６ビット）の重み１４を、演算回路４００に出力することとなる。 In the first cycle, the random number generator 202 generates a random number using the seed output from the control circuit 110. This generates the first kernel weight portion 222. Then, in the nth cycle (n is an integer greater than or equal to 2 and less than or equal to (kw × kh)), the random number generator 202 generates a random number using the random number (weight) generated in the (n-1)th cycle. This generates the nth kernel weight portion 222. Therefore, in the first cycle, the selector 204 selects the random number seed (initial value) obtained from the control circuit 110 and outputs it to the random number generator 202, and in the second cycle and thereafter, it selects the previous output (generated random number) of the random number generator 202. Each weight generation unit 210 outputs the generated kernel weight portion 222 to the calculation circuit 400. Therefore, in one cycle, the weight generation circuit 200 outputs 16 kernel weight portions 222 to the calculation circuit 400, corresponding to the number of output channels to be processed. As a result, the weight generation circuit 200 outputs 16 (iCh) x 16 (oCh) = 256 (i.e., 256 bits) weights 14 to the calculation circuit 400 in one cycle.

このように、重み生成回路２００は、推論段階で、学習段階で設定された重みと同じ重みを生成するように構成されている。したがって、重みを記憶しておくことが不要となる。したがって、メモリ容量を削減することが可能となる。さらに、推論段階で重みを設定する際にメモリにアクセスする必要がなくなるので、メモリアクセスを抑制することが可能となる。したがって、消費電力を抑制することが可能となる。これにより、処理効率が良好となる。 In this way, the weight generation circuit 200 is configured to generate weights in the inference stage that are the same as the weights set in the learning stage. This eliminates the need to store weights, making it possible to reduce memory capacity. Furthermore, since there is no need to access memory when setting weights in the inference stage, it is possible to reduce memory accesses. This makes it possible to reduce power consumption, resulting in improved processing efficiency.

制御回路１１０は、出力チャネル番号（＃ｏＣｈ）と層番号（＃Ｌａｙｅｒ）とを用いて、シードを生成する。ここで、出力チャネル番号は、処理対象の出力チャネルの識別情報（インデックス番号）に対応する。また、層番号は、ニューラルネットワークモデルを構成する層（畳み込み層）の識別情報（インデックス番号）に対応する。例えば、制御回路１１０は、１層目の処理において、図７に示された最も手前の重み生成部２１０に供給されるシードを生成する際に、「＃ｏＣｈ＝０」及び「＃Ｌａｙｅｒ＝０」を用いてもよい。 The control circuit 110 generates a seed using an output channel number (#oCh) and a layer number (#Layer). Here, the output channel number corresponds to the identification information (index number) of the output channel to be processed. The layer number corresponds to the identification information (index number) of a layer (convolutional layer) that constitutes the neural network model. For example, in processing the first layer, the control circuit 110 may use "#oCh=0" and "#Layer=0" when generating a seed to be supplied to the front-most weight generation unit 210 shown in FIG. 7.

このように、出力チャネル番号（＃ｏＣｈ）と層番号（＃Ｌａｙｅｒ）とを用いてシードを生成することにより、各層についての処理及び出力チャネルごとに、異なるシードを生成することができる。これにより、各層についての処理及び出力チャネルごとに、乱数生成の周期性を崩すことができる。言い換えると、各層についての処理及び出力チャネルごとに、異なる周期で乱数を生成することができる。したがって、Ｘｏｒｓｈｉｆｔ１６のような簡易な疑似乱数生成アルゴリズムを用いて生成された乱数を重みとして生成しても、推論の精度の劣化を抑制することができる。したがって、疑似乱数の周期が短い疑似乱数生成アルゴリズムを用いることが可能となる。これにより、乱数生成器２０２を小型化することができる。 In this way, by generating a seed using the output channel number (#oCh) and layer number (#Layer), a different seed can be generated for each process and output channel for each layer. This breaks the periodicity of random number generation for each process and output channel for each layer. In other words, random numbers can be generated with different periods for each process and output channel for each layer. Therefore, even if random numbers generated using a simple pseudo-random number generation algorithm such as Xorshift16 are used as weights, deterioration in inference accuracy can be suppressed. This makes it possible to use a pseudo-random number generation algorithm with a short pseudo-random number period. This allows the random number generator 202 to be made smaller.

また、制御回路１１０は、出力チャネル番号（＃ｏＣｈ）と層番号（＃Ｌａｙｅｒ）とを入力とする任意のハッシュ関数によって、シードを生成してもよい。このようにしてシードを生成することによって、推論の精度の劣化を抑制しつつ、重みを生成するために記憶しておくデータ量をさらに削減することができる。 The control circuit 110 may also generate a seed using any hash function that takes the output channel number (#oCh) and layer number (#Layer) as input. By generating a seed in this manner, it is possible to further reduce the amount of data stored for generating weights while suppressing degradation in inference accuracy.

図８は、実施の形態１にかかるマスク展開回路３００の構成を示す図である。マスク展開回路３００は、推論を行う際に、圧縮された状態のスーパーマスク２０を展開するように構成されている。ここで、上述したように、推論処理の前段階では、スーパーマスク２０は、ゼロランレングス圧縮方式によって圧縮されている。スーパーマスク２０は、所定のコードレングスで圧縮されている。本実施の形態では、処理対象のスーパーマスク２０に対応するｋの値に応じて、コードレングスが定められていてもよい。 Figure 8 is a diagram showing the configuration of the mask expansion circuit 300 according to the first embodiment. The mask expansion circuit 300 is configured to expand the supermask 20 in a compressed state when performing inference. As described above, prior to the inference process, the supermask 20 is compressed using the zero-run length compression method. The supermask 20 is compressed to a predetermined code length. In this embodiment, the code length may be determined according to the value of k corresponding to the supermask 20 to be processed.

具体的には、上述したように、ｋの値が小さいほど、スーパーマスク２０は疎となる。つまり、ｋの値が小さいほど、スーパーマスク２０における「１」の数（つまり「有効」と決定された接続１３の数）が少なくなり、「０」の数（つまり「無効」と決定された接続１３の数）が多くなる。したがって、この場合は、長いコードレングスでゼロランレングス圧縮を行った方が、圧縮率を高くすることができる。一方、ｋの値が大きいほど、スーパーマスク２０は密となる。つまり、ｋの値が大きいほど、スーパーマスク２０における「１」の数が多くなり、「０」の数が少なくなる。したがって、この場合は、長いコードレングスでゼロランレングス圧縮を行うと圧縮前よりも圧縮後のデータ長が長くなり得るため、短いコードレングスでゼロランレングス圧縮を行った方が、圧縮率を高くすることができる。 Specifically, as described above, the smaller the value of k, the sparser the supermask 20. In other words, the smaller the value of k, the fewer the number of "1s" (i.e., the number of connections 13 determined to be "valid") in the supermask 20, and the more "0s" (i.e., the number of connections 13 determined to be "invalid"). Therefore, in this case, performing zero run length compression with a long code length can achieve a higher compression rate. On the other hand, the larger the value of k, the denser the supermask 20. In other words, the larger the value of k, the more "1s" there are in the supermask 20, and the fewer "0s" there are. Therefore, in this case, performing zero run length compression with a long code length can result in a longer data length after compression than before compression, so performing zero run length compression with a short code length can achieve a higher compression rate.

例えば、ゼロランレングス圧縮のコードレングスを｛２，３，４｝［ビット］とする。この場合、ｋの値が１～１５％程度のスーパーマスク２０は、コードレングスが「４ビット」の圧縮手法で圧縮されてもよい。また、ｋの値が１５～２５％程度のスーパーマスク２０は、コードレングスが「３ビット」の圧縮手法で圧縮されてもよい。また、ｋの値が２５～３５％程度のスーパーマスク２０は、コードレングスが「２ビット」の圧縮手法で圧縮されてもよい。 For example, the code length for zero run length compression is {2, 3, 4} bits. In this case, a supermask 20 with a k value of approximately 1 to 15% may be compressed using a compression method with a code length of "4 bits." A supermask 20 with a k value of approximately 15 to 25% may be compressed using a compression method with a code length of "3 bits." A supermask 20 with a k value of approximately 25 to 35% may be compressed using a compression method with a code length of "2 bits."

マスク展開回路３００は、圧縮データメモリ３０２と、ＦＩＦＯバッファ３０４と、デコーダ３０６と、マスクメモリ３１０とを有する。ＦＩＦＯバッファ３０４及びデコーダ３０６は、処理対象の出力チャネルごと（つまり処理対象のカーネルごと）に設けられている。図８の例では、１６個の出力チャネルにそれぞれに対応するＦＩＦＯバッファ３０４及びデコーダ３０６が設けられている。これにより、ＦＩＦＯバッファ３０４及びデコーダ３０６は、出力チャネルごと（つまりカーネルごと）に並列に処理を行う。 The mask expansion circuit 300 has a compressed data memory 302, a FIFO buffer 304, a decoder 306, and a mask memory 310. A FIFO buffer 304 and a decoder 306 are provided for each output channel to be processed (i.e., for each kernel to be processed). In the example of Figure 8, a FIFO buffer 304 and a decoder 306 are provided for each of the 16 output channels. As a result, the FIFO buffer 304 and the decoder 306 perform processing in parallel for each output channel (i.e., for each kernel).

圧縮データメモリ３０２（ＺＭＥＭ：ZRL Memory）は、圧縮されたスーパーマスク２０（圧縮スーパーマスク）のデータを格納する。圧縮データメモリ３０２は、圧縮スーパーマスク（ZRL Encoded Supermask）のデータを取得する。圧縮データメモリ３０２は、制御回路１１０から、圧縮スーパーマスクのデータを取得してもよい。ここで、図８に示すように、圧縮スーパーマスクのデータは、出力チャネルそれぞれについて所定のデータ量に分割された状態で取得され、圧縮データメモリ３０２に格納されてもよい。例えば、図８において圧縮データメモリ３０２に格納された「ｏＣｈ０」のデータそれぞれが、出力チャネル＃０のカーネルに対応する、分割された圧縮スーパーマスクのデータに対応する。なお、マスク展開回路３００が取得するスーパーマスク２０は圧縮されていなくてもよい。この場合、取得されたスーパーマスク２０は、圧縮データメモリ３０２ではなく、後述するマスクメモリ３１０に、直接、入力されてもよい（Pass-through mode）。 The compressed data memory 302 (ZMEM: ZRL Memory) stores compressed supermask 20 (compressed supermask) data. The compressed data memory 302 acquires compressed supermask (ZRL Encoded Supermask) data. The compressed data memory 302 may acquire compressed supermask data from the control circuit 110. Here, as shown in FIG. 8, the compressed supermask data may be acquired in a state where it is divided into a predetermined data amount for each output channel and stored in the compressed data memory 302. For example, in FIG. 8, each piece of "oCh0" data stored in the compressed data memory 302 corresponds to the divided compressed supermask data corresponding to the kernel of output channel #0. Note that the supermask 20 acquired by the mask expansion circuit 300 does not have to be compressed. In this case, the acquired supermask 20 may be input directly to the mask memory 310 (described later) rather than to the compressed data memory 302 (pass-through mode).

ＦＩＦＯバッファ３０４（ＦＩＦＯ：First-In First-Out）は、分割された圧縮スーパーマスクを結合するように構成されている。具体的には、ＦＩＦＯバッファ３０４は、出力チャネルごと（つまりカーネルごと）に、並列に、分割された圧縮スーパーマスクを結合する。例えば、図８に示した最も手前のＦＩＦＯバッファ３０４が、圧縮データメモリ３０２に格納された複数の「ｏＣｈ０」のデータを結合する。なお、各出力チャネルそれぞれに関するＦＩＦＯバッファ３０４は、１つの処理対象のカーネルに対応する圧縮スーパーマスクが生成されるように、分割された圧縮スーパーマスクを結合してもよい。各出力チャネルそれぞれに関するＦＩＦＯバッファ３０４は、３２ビットのデータを圧縮データメモリ３０２から一度に受信する。そして、ＦＩＦＯバッファ３０４は、最大４ビット（コードレングスの最大値に対応）のデータをデコーダ３０６に一度に出力してもよい。ここで、出力されるデータ量は、圧縮スーパーマスクのコードレングスに対応する。取得された圧縮スーパーマスクに関するコードレングスの情報は、制御回路１１０から受信されてもよい。 The FIFO buffer 304 (FIFO: First-In First-Out) is configured to combine the divided compressed supermasks. Specifically, the FIFO buffer 304 combines the divided compressed supermasks in parallel for each output channel (i.e., for each kernel). For example, the frontmost FIFO buffer 304 shown in FIG. 8 combines multiple "oCh0" data stored in the compressed data memory 302. Note that the FIFO buffer 304 for each output channel may combine the divided compressed supermasks so that a compressed supermask corresponding to one kernel to be processed is generated. The FIFO buffer 304 for each output channel receives 32 bits of data from the compressed data memory 302 at a time. The FIFO buffer 304 may then output up to 4 bits (corresponding to the maximum code length) of data to the decoder 306 at a time. Here, the amount of data output corresponds to the code length of the compressed supermask. Code length information for the obtained compressed supermask may be received from the control circuit 110.

デコーダ３０６（ZRL Decoder）は、圧縮スーパーマスクを展開（デコード）するように構成されている。デコーダ３０６は、出力チャネルごと（つまりカーネルごと）に、並列に、圧縮スーパーマスクを展開する。デコーダ３０６は、制御回路１１０からコードレングスの情報を受信してもよい。デコーダ３０６は、受信されたコードレングスの情報を用いて、ＦＩＦＯバッファ３０４から出力された圧縮データを展開する。デコーダ３０６は、例えば、各出力チャネルについて８ビットごとに、展開されたスーパーマスクのデータを、マスクメモリ３１０に出力してもよい。 The decoder 306 (ZRL Decoder) is configured to expand (decode) the compressed supermask. The decoder 306 expands the compressed supermask in parallel for each output channel (i.e., for each kernel). The decoder 306 may receive code length information from the control circuit 110. The decoder 306 uses the received code length information to expand the compressed data output from the FIFO buffer 304. The decoder 306 may output the expanded supermask data to the mask memory 310, for example, in 8-bit increments for each output channel.

マスクメモリ３１０（ＳＭＥＭ：Supermask Memory）は、展開されたスーパーマスクのデータを、一時的に格納する。マスクメモリ３１０は、１つの処理対象のカーネルに対応するスーパーマスクのデータの集合であるカーネルマスク群３２０を、処理対象のカーネルの数（出力チャネルの数：１６ｏＣｈｓ）だけ格納する。ここで、カーネルマスク群３２０のサイズは、上述したカーネル重み群２２０のサイズと実質的に同じである。つまり、カーネルマスク群３２０は、入力データの横方向に対応する幅ｋｗと、入力データの縦方向に対応する高さｋｈと、入力データの処理対象の入力チャネルに対応する奥行１６ビット（１６ｉＣｈｓ）のサイズを有している。 The mask memory 310 (SMEM: Supermask Memory) temporarily stores the expanded supermask data. The mask memory 310 stores the kernel mask group 320, which is a collection of supermask data corresponding to one kernel to be processed, in the same number as the number of kernels to be processed (number of output channels: 16oChs). Here, the size of the kernel mask group 320 is substantially the same as the size of the kernel weight group 220 described above. In other words, the kernel mask group 320 has a width kw corresponding to the horizontal direction of the input data, a height kh corresponding to the vertical direction of the input data, and a depth of 16 bits (16iChs) corresponding to the input channel of the input data to be processed.

マスクメモリ３１０は、出力チャネルごとに、並列して、１回のサイクルで、１×１×１６のサイズのカーネルマスク部分３２２を、演算回路４００に出力する。カーネルマスク部分３２２のサイズは、上述したカーネル重み部分２２２のサイズと実質的に同じである。したがって、カーネルマスク部分３２２は、１ビットのマスクデータが、処理対象の入力チャネルの数（１６個）だけ集まった集合である。そして、ｋｗ×ｋｈ個のカーネルマスク部分３２２によって、カーネルマスク群３２０が構成される。そして、出力チャネル（カーネル）それぞれについて、ｋｗ×ｋｈ回のサイクルで、カーネルマスク部分３２２が、演算回路４００に出力される。 The mask memory 310 outputs a kernel mask portion 322 of size 1 x 1 x 16 to the arithmetic circuit 400 in one cycle in parallel for each output channel. The size of the kernel mask portion 322 is substantially the same as the size of the kernel weight portion 222 described above. Therefore, the kernel mask portion 322 is a collection of 1-bit mask data, the number of which is equal to the number of input channels to be processed (16). The kernel mask group 320 is then composed of kw x kh kernel mask portions 322. The kernel mask portion 322 is then output to the arithmetic circuit 400 in kw x kh cycles for each output channel (kernel).

マスクメモリ３１０は、２つのバンク３１２（バンク３１２Ａ（Bank 1）及びバンク３１２Ｂ（Bank 2））を有する。バンク３１２Ａとバンク３１２Ｂとによって、ダブルバッファリングを実現する。具体的には、例えば、まず、デコーダ３０６から出力されたデータがバンク３１２Ａに格納される。そして、処理対象のカーネルに対応するカーネルマスク群３２０のデータがバンク３１２Ａに格納された後、バンク３１２Ａに格納されたデータが読み出されて演算回路４００に出力され、演算に使用される。バンク３１２Ａに格納されたデータが読み出されて演算回路４００に出力されている間、バンク３１２Ｂに、次の処理対象のカーネルに対応するカーネルマスク群３２０のデータが格納される。 Mask memory 310 has two banks 312 (bank 312A (Bank 1) and bank 312B (Bank 2)). Banks 312A and 312B implement double buffering. Specifically, for example, data output from decoder 306 is first stored in bank 312A. Then, after data from kernel mask group 320 corresponding to the kernel to be processed is stored in bank 312A, the data stored in bank 312A is read out and output to arithmetic circuit 400, where it is used in the calculation. While the data stored in bank 312A is being read out and output to arithmetic circuit 400, data from kernel mask group 320 corresponding to the next kernel to be processed is stored in bank 312B.

そして、バンク３１２Ａに格納された全てのデータが読み出されて演算回路４００に出力されると、バンク３１２Ｂに格納されたデータが読み出されて演算回路４００に出力され、演算に使用される。バンク３１２Ｂに格納されたデータが読み出されて演算回路４００に出力されている間、バンク３１２Ａに、次の処理対象のカーネルに対応するカーネルマスク群３２０のデータが格納される。以下、これを繰り返す。これにより、演算回路４００に対応する処理を効率よく行うことができる。つまり、マスクメモリ３１０がデータの読出し及び書込み（格納）を並行して行うため、演算回路４００は、処理を効率よく行うことができる。 When all the data stored in bank 312A has been read out and output to the arithmetic circuit 400, the data stored in bank 312B is read out and output to the arithmetic circuit 400, where it is used in the calculation. While the data stored in bank 312B is being read out and output to the arithmetic circuit 400, data from the kernel mask group 320 corresponding to the next kernel to be processed is stored in bank 312A. This process is repeated thereafter. This allows the processing corresponding to the arithmetic circuit 400 to be performed efficiently. In other words, because the mask memory 310 reads and writes (stores) data in parallel, the arithmetic circuit 400 can perform processing efficiently.

ここで、制御回路１１０は、重み生成回路２００（乱数生成器２０２）からカーネル重み部分２２２が出力されるタイミングと、マスク展開回路３００（マスクメモリ３１０）からカーネルマスク部分３２２が出力されるタイミングとを、同期させる。具体的には、制御回路１１０は、シードを重み生成回路２００に出力するタイミングに対応するタイミングで、マスクメモリ３１０に読出し信号（read_addr）を出力する。これにより、同じカーネルにおける同じ位置（ｋｗ×ｋｈ上の位置）に対応するカーネル重み部分２２２とカーネルマスク部分３２２とが、実質的に同じタイミングで、演算回路４００に出力され得る。 Here, the control circuit 110 synchronizes the timing at which the kernel weight portion 222 is output from the weight generation circuit 200 (random number generator 202) with the timing at which the kernel mask portion 322 is output from the mask expansion circuit 300 (mask memory 310). Specifically, the control circuit 110 outputs a read signal (read_addr) to the mask memory 310 at the timing corresponding to the timing at which the seed is output to the weight generation circuit 200. This allows the kernel weight portion 222 and kernel mask portion 322 corresponding to the same position (position on kw × kh) in the same kernel to be output to the calculation circuit 400 at substantially the same timing.

図９は、実施の形態１にかかる入力データを説明するための図である。上述したように、入力データ５１０は、画像の横方向（幅：Ｗｉｄｔｈ）と、画像の縦方向（高さ：Ｈｅｉｇｈｔ）と、入力チャネル（奥行：ｉＣｈｓ）とによって、３次元構造を有している。そして、実施の形態１では、演算回路４００における入力データ５１０の演算の単位として、チャンク５２０を考える。チャンク５２０は、３次元に構成された入力データ５１０の一部のデータに対応する。１つのチャンク５２０に対応するデータが、一度に、演算回路４００に入力され得る。 Figure 9 is a diagram for explaining input data according to the first embodiment. As described above, the input data 510 has a three-dimensional structure consisting of the horizontal direction (width) of the image, the vertical direction (height) of the image, and the input channel (depth: iChs). In the first embodiment, chunks 520 are considered as units of calculation of the input data 510 in the calculation circuit 400. The chunks 520 correspond to a portion of the data in the three-dimensionally structured input data 510. Data corresponding to one chunk 520 can be input to the calculation circuit 400 at one time.

チャンク５２０は、例えば、幅ｃｗ＝４、高さｃｈ＝４、奥行（入力チャネル）ｉＣｈ＝１６のサイズを有する。この場合において、入力データ５１０のサイズをＷｉｄｔｈ＝１２、Ｈｅｉｇｈｔ＝１２、ｉＣｈｓ＝４８とすると、入力データ５１０は、３×３×３＝２７個のチャンク５２０を有する。ここで、入力データ５１０は、例えば、８ビットのデータの集まりである。この場合、入力データ５１０のデータサイズは、１２×１２×４８×８［ビット］である。また、１つのチャンク５２０のデータサイズは、４×４×１６×８［ビット］である。 For example, chunk 520 has a width of cw = 4, height ch = 4, and depth (input channel) iCh = 16. In this case, if the size of input data 510 is Width = 12, Height = 12, and iChs = 48, then input data 510 has 3 x 3 x 3 = 27 chunks 520. Here, input data 510 is, for example, a collection of 8-bit data. In this case, the data size of input data 510 is 12 x 12 x 48 x 8 [bits]. Furthermore, the data size of one chunk 520 is 4 x 4 x 16 x 8 [bits].

図１０は、実施の形態１にかかる演算回路４００の動作を説明するための図である。上述したように、重み生成回路２００から、各出力チャネル（カーネル）に対応するカーネル重み部分２２２が、演算回路４００に入力される。図１０の例では、上向きの矢印で示すように、１６個の出力チャネル（１６ｏＣｈｓ）それぞれに対応するカーネル重み部分２２２が、演算回路４００に入力される。同様に、マスク展開回路３００から、各出力チャネル（カーネル）に対応するカーネルマスク部分３２２が、演算回路４００に入力される。図１０の例では、下向きの矢印で示すように、１６個の出力チャネル（１６ｏＣｈｓ）それぞれに対応するカーネルマスク部分３２２が、演算回路４００に入力される。 Figure 10 is a diagram for explaining the operation of the arithmetic circuit 400 according to the first embodiment. As described above, the kernel weight portions 222 corresponding to each output channel (kernel) are input from the weight generation circuit 200 to the arithmetic circuit 400. In the example of Figure 10, as indicated by the upward arrows, the kernel weight portions 222 corresponding to each of the 16 output channels (16 oChs) are input to the arithmetic circuit 400. Similarly, the mask expansion circuit 300 inputs the kernel mask portions 322 corresponding to each output channel (kernel) to the arithmetic circuit 400. In the example of Figure 10, as indicated by the downward arrows, the kernel mask portions 322 corresponding to each of the 16 output channels (16 oChs) are input to the arithmetic circuit 400.

また、演算回路４００には、チャンク５２０のデータが入力される。ここで、チャンク５２０は、ｃｗ×ｃｈ個のチャンク部分５２２で構成される。上記の例では、チャンク５２０は、４×４＝１６個のチャンク部分５２２を有する。そして、チャンク部分５２２のサイズは、１×１×１６である。つまり、チャンク部分５２２のサイズは、カーネル重み部分２２２及びカーネルマスク部分３２２のサイズと対応している。そして、１６個（＝４×４）のチャンク部分５２２が、並列して演算回路４００に入力される。 The arithmetic circuit 400 also receives input of chunk 520 data. Here, chunk 520 is composed of cw x ch chunk portions 522. In the above example, chunk 520 has 4 x 4 = 16 chunk portions 522. The size of chunk portion 522 is 1 x 1 x 16. In other words, the size of chunk portion 522 corresponds to the size of kernel weight portion 222 and kernel mask portion 322. The 16 (= 4 x 4) chunk portions 522 are then input in parallel to the arithmetic circuit 400.

演算回路４００は、多数の演算素子（ＰＥ：Processing Element）を有する。また、演算回路４００は、複数のＰＥマトリクス４１０を有する。ここで、ＰＥマトリクス４１０は、処理対象の出力チャネルの数（つまり処理対象のカーネルの数）に対応する数だけ設けられている。図１０の例では、演算回路４００は、１６個のＰＥマトリクス４１０（16 PE Matrices）を有する。また、１つのＰＥマトリクス４１０は、複数のＰＥベクトル４２０を有する。ＰＥマトリクス４１０は、それぞれ、入力されるチャンク部分５２２の数（ｃｗ×ｃｈ個）に対応する数のＰＥベクトル４２０を有する。図１０の例では、ＰＥマトリクス４１０は、それぞれ、１６個（＝４×４）のＰＥベクトル４２０（16 PE Vectors）を有する。つまり、ＰＥマトリクス４１０におけるＰＥベクトル４２０は、チャンク５２０におけるチャンク部分５２２の位置（位置ずれ）に対応する。また、後述するように、ＰＥベクトル４２０は、処理対象の入力チャネルの数に対応する数の演算素子を有する。 The arithmetic circuit 400 has a large number of processing elements (PEs). The arithmetic circuit 400 also has multiple PE matrices 410. The number of PE matrices 410 corresponds to the number of output channels to be processed (i.e., the number of kernels to be processed). In the example of FIG. 10, the arithmetic circuit 400 has 16 PE matrices 410. Each PE matrix 410 has multiple PE vectors 420. Each PE matrix 410 has a number of PE vectors 420 corresponding to the number of input chunk portions 522 (cw × ch). In the example of FIG. 10, each PE matrix 410 has 16 (= 4 × 4) PE vectors 420. In other words, the PE vectors 420 in the PE matrix 410 correspond to the positions (positional deviations) of the chunk portions 522 in the chunks 520. Furthermore, as described below, the PE vector 420 has a number of processing elements corresponding to the number of input channels to be processed.

ここで、同じカーネルにおける同じ位置（ｋｗ×ｋｈ上の位置）に対応するカーネル重み部分２２２及びカーネルマスク部分３２２は、同じＰＥマトリクス４１０に入力される。すなわち、図１０に示されたカーネル重み群２２０とカーネルマスク群３２０とが互いに同じカーネルに対応するとする。この場合、このカーネル重み群２２０における（ｋｗ，ｋｈ）＝（ｘ１，ｙ１）の位置のカーネル重み部分２２２と、このカーネルマスク群３２０における（ｋｗ，ｋｈ）＝（ｘ１，ｙ１）の位置のカーネルマスク部分３２２とが、互いに同じＰＥマトリクス４１０に入力される。なお、この場合、上記と並列して、別のカーネルに関する（ｋｗ，ｋｈ）＝（ｘ１，ｙ１）の位置のカーネル重み部分２２２及びカーネルマスク部分３２２が、別のＰＥマトリクス４１０に入力される。 Here, the kernel weight portion 222 and kernel mask portion 322 corresponding to the same position (position on kw × kh) in the same kernel are input to the same PE matrix 410. That is, assume that the kernel weight group 220 and kernel mask group 320 shown in FIG. 10 correspond to the same kernel. In this case, the kernel weight portion 222 at the position (kw, kh) = (x1, y1) in this kernel weight group 220 and the kernel mask portion 322 at the position (kw, kh) = (x1, y1) in this kernel mask group 320 are input to the same PE matrix 410. In this case, in parallel with the above, the kernel weight portion 222 and kernel mask portion 322 at the position (kw, kh) = (x1, y1) for a different kernel are input to a different PE matrix 410.

そして、これらのカーネル重み部分２２２及びカーネルマスク部分３２２は、同じＰＥマトリクス４１０の複数のＰＥベクトル４２０に入力される。つまり、あるＰＥマトリクス４１０に含まれる複数のＰＥベクトル４２０には、同じカーネル重み部分２２２及び同じカーネルマスク部分３２２が入力される。言い換えると、各ＰＥマトリクス４１０は、各カーネル（出力チャネル）に対応付けられている。 These kernel weight portions 222 and kernel mask portions 322 are then input to multiple PE vectors 420 of the same PE matrix 410. In other words, the same kernel weight portion 222 and the same kernel mask portion 322 are input to multiple PE vectors 420 included in a certain PE matrix 410. In other words, each PE matrix 410 is associated with a respective kernel (output channel).

また、同じ位置のチャンク部分５２２は、右向きの矢印で示すように、各ＰＥマトリクス４１０における互いに同じ位置のＰＥベクトル４２０に入力される。つまり、複数のチャンク部分５２２それぞれは、各ＰＥマトリクス４１０の、そのチャンク５２０における位置に対応するＰＥベクトル４２０に入力される。例えば、図１０において最も上のチャンク部分５２２は、各ＰＥマトリクス４１０における最も上のＰＥベクトル４２０に入力される。他の位置のチャンク部分５２２についても同様である。このように、各ＰＥマトリクス４１０におけるＰＥベクトル４２０は、チャンク５２０内の位置（位置ずれ）に対応付けられている。 Furthermore, chunk portions 522 at the same position are input to PE vectors 420 at the same position in each PE matrix 410, as indicated by the right-pointing arrows. In other words, each of the multiple chunk portions 522 is input to the PE vector 420 in each PE matrix 410 that corresponds to its position in that chunk 520. For example, in Figure 10, the topmost chunk portion 522 is input to the topmost PE vector 420 in each PE matrix 410. The same applies to chunk portions 522 at other positions. In this way, the PE vectors 420 in each PE matrix 410 are associated with positions (positional offsets) within the chunk 520.

したがって、各ＰＥベクトル４２０には、チャンク５２０内における当該ＰＥベクトル４２０に対応する位置のチャンク部分５２２が入力される。また、各ＰＥベクトル４２０には、当該ＰＥベクトル４２０（ＰＥマトリクス４１０）に対応するカーネルに関するカーネル重み部分２２２及びカーネルマスク部分３２２が入力される。そして、各ＰＥベクトル４２０は、後述するように、これらのチャンク部分５２２と、カーネル重み部分２２２と、カーネルマスク部分３２２とを用いて、積和演算を行う。 Therefore, each PE vector 420 receives the chunk portion 522 at a position within the chunk 520 that corresponds to that PE vector 420. Furthermore, each PE vector 420 receives the kernel weight portion 222 and kernel mask portion 322 for the kernel that corresponds to that PE vector 420 (PE matrix 410). Then, as described below, each PE vector 420 performs a multiply-and-accumulate operation using these chunk portion 522, kernel weight portion 222, and kernel mask portion 322.

なお、入力データ５１０における別のチャンク５２０が演算回路４００に入力されることによって、他のチャンク５２０のデータについても、同様に積和演算を行うことができる。また、カーネル重み群２２０及びカーネルマスク群３２０それぞれにおける他の位置（ｋｗ×ｋｈにおける他の位置）のカーネル重み部分２２２及びカーネルマスク部分３２２が演算回路４００に入力されることによって、他のカーネル上の位置についても、同様に積和演算を行うことができる。例えば、各カーネルのある位置（例えば（ｋｗ，ｋｈ）＝（ｘ１，ｙ１））に対応するカーネル重み部分２２２及びカーネルマスク部分３２２が対応するＰＥマトリクス４１０内の各ＰＥベクトル４２０に入力されているとする。その間に、入力されるチャンク５２０を変えることによって、演算回路４００は、入力データ５１０における全てのデータについて、上記の位置の積和演算を行うことができる。そして、各カーネルの別の位置（例えば（ｋｗ，ｋｈ）＝（ｘ１，ｙ２））に対応するカーネル重み部分２２２及びカーネルマスク部分３２２が対応するＰＥマトリクス４１０内の各ＰＥベクトル４２０に入力されるようにする。その間に、入力されるチャンク５２０を変えることによって、演算回路４００は、同様に、入力データ５１０における全てのデータについて、上記の位置の積和演算を行うことができる。このようにして、各カーネルにおける全ての位置に対して、入力データ５１０の全ての位置についての積和演算を行うことができる。なお、これらのデータの入力のタイミングは、ニューラルネットワーク回路装置１００に入力されるクロックを用いて、例えば制御回路１１０によって調整されてもよい。 Note that by inputting another chunk 520 of the input data 510 to the arithmetic circuit 400, similar product-sum operations can be performed on the data of the other chunk 520. Furthermore, by inputting the kernel weight portion 222 and kernel mask portion 322 at other positions (other positions in kw × kh) in the kernel weight group 220 and kernel mask group 320, respectively, to the arithmetic circuit 400, similar product-sum operations can be performed on positions on other kernels. For example, assume that the kernel weight portion 222 and kernel mask portion 322 corresponding to a certain position of each kernel (e.g., (kw, kh) = (x1, y1)) are input to each PE vector 420 in the corresponding PE matrix 410. Meanwhile, by changing the input chunk 520, the arithmetic circuit 400 can perform product-sum operations on the above positions for all data in the input data 510. Then, the kernel weight portion 222 and kernel mask portion 322 corresponding to a different position of each kernel (for example, (kw, kh) = (x1, y2)) are input to each PE vector 420 in the corresponding PE matrix 410. Meanwhile, by changing the input chunk 520, the arithmetic circuit 400 can similarly perform a multiply-and-accumulate operation for the above positions for all data in the input data 510. In this way, a multiply-and-accumulate operation can be performed for all positions in the input data 510 for all positions in each kernel. Note that the timing of inputting this data may be adjusted by, for example, the control circuit 110 using a clock input to the neural network circuit device 100.

図１１は、実施の形態１にかかるＰＥベクトル４２０の処理を説明するための図である。ＰＥベクトル４２０は、処理対象の入力チャネルの数に対応する数の演算素子４３０（ＰＥ）を有する。上記の例では、処理対象の入力チャネルの数は１６個なので、ＰＥベクトル４２０は、１６個の演算素子４３０（16 PEs）を有する。なお、図１１には、紙面の都合上、８個の演算素子４３０のみが示されている。 Figure 11 is a diagram illustrating the processing of the PE vector 420 according to the first embodiment. The PE vector 420 has a number of processing elements 430 (PEs) corresponding to the number of input channels to be processed. In the above example, the number of input channels to be processed is 16, so the PE vector 420 has 16 processing elements 430 (16 PEs). Note that due to space limitations, only eight processing elements 430 are shown in Figure 11.

上述したように、１つのＰＥベクトル４２０には、１つのカーネル重み部分２２２と、１つのカーネルマスク部分３２２と、１つのチャンク部分５２２とが入力される。そして、カーネル重み部分２２２、カーネルマスク部分３２２及びチャンク部分５２２のそれぞれの各入力チャネルに対応するデータが、１つの演算素子４３０に入力される。つまり、ＰＥベクトル４２０は、入力チャネルそれぞれに対応する演算素子４３０を有している。例えば、カーネル重み部分２２２、カーネルマスク部分３２２及びチャンク部分５２２それぞれの入力チャネル＃０のデータが、入力チャネル＃０に対応する演算素子４３０（図１１の最も右下且つ最も手前の演算素子４３０）に入力される。また、カーネル重み部分２２２、カーネルマスク部分３２２及びチャンク部分５２２それぞれの入力チャネル＃１５のデータが、入力チャネル＃１５に対応する演算素子４３０（図１１の最も左上且つ最も奥の演算素子４３０）に入力される。 As described above, one PE vector 420 receives one kernel weight portion 222, one kernel mask portion 322, and one chunk portion 522 as input. Then, data corresponding to each input channel of the kernel weight portion 222, kernel mask portion 322, and chunk portion 522 is input to one arithmetic element 430. That is, the PE vector 420 has an arithmetic element 430 corresponding to each input channel. For example, data for input channel #0 of the kernel weight portion 222, kernel mask portion 322, and chunk portion 522 is input to the arithmetic element 430 corresponding to input channel #0 (the arithmetic element 430 at the bottom right and forefront in FIG. 11). Furthermore, data for input channel #15 of the kernel weight portion 222, kernel mask portion 322, and chunk portion 522 is input to the arithmetic element 430 corresponding to input channel #15 (the arithmetic element 430 at the top left and furthest back in FIG. 11).

演算素子４３０は、入力されたカーネル重み部分２２２、カーネルマスク部分３２２及びチャンク部分５２２のそれぞれの各入力チャネルのデータを用いて、論理積の演算を行う。そして、ＰＥベクトル４２０は、１６個の演算素子４３０それぞれで得られた論理積の和を、部分和（ＰＳＵＭ）として算出する。そして、演算回路４００は、各ＰＥベクトル４２０で得られた部分和（合わせて１６×１６個の部分和）を、後処理回路１４０に出力する。 The arithmetic element 430 performs a logical product operation using the data from each input channel of the input kernel weight portion 222, kernel mask portion 322, and chunk portion 522. The PE vector 420 then calculates the sum of the logical products obtained by each of the 16 arithmetic elements 430 as a partial sum (PSUM). The arithmetic circuit 400 then outputs the partial sums obtained by each PE vector 420 (a total of 16 x 16 partial sums) to the post-processing circuit 140.

ここで、部分和は、
ＰＳＵＭ＝Σｉｆ（ＳＭａｓｋ）｛ｉＡｃｔ×Ｗｅｉｇｈｔ｝
・・・（１）
と表される。ここで、式１は、各入力チャネルについて、スーパーマスクの値（ＳＭａｓｋ）が「１」である場合（つまり「０」でない場合）に、入力データの値（ｉＡｃｔ）と重み値（Ｗｅｉｇｈｔ）との積を算出し、入力チャネルごとに得られた積の和を算出する、ということを意味する。言い換えると、式１は、入力データと重み値との積がスーパーマスクの値（１又は０）によって有効化又は無効化されるように、演算を行うことを意味している。 Here, the partial sums are
PSUM=Σif(SMask) {iAct×Weight}
...(1)
Here, Equation 1 means that for each input channel, when the value of the supermask (SMask) is "1" (i.e., not "0"), the product of the input data value (iAct) and the weight value (Weight) is calculated, and the sum of the products obtained for each input channel is calculated. In other words, Equation 1 means that an operation is performed so that the product of the input data and the weight value is enabled or disabled depending on the value of the supermask (1 or 0).

ここで、式１は、以下の式２と同義である。
ＰＳＵＭ＝Σ｛（ｉＡｃｔａｎｄＳＭａｓｋ）×Ｗｅｉｇｈｔ｝
・・・（２）
式２は、各入力チャネルについて、入力データの値（ｉＡｃｔ）とスーパーマスクの値（ＳＭａｓｋ）との論理積と重み値（Ｗｅｉｇｈｔ）との積を算出し、入力チャネルごとに得られた積の和を算出する、ということを意味する。 Here, Equation 1 is equivalent to Equation 2 below.
PSUM=Σ{(iAct and SMask)×Weight}
... (2)
Equation 2 means that for each input channel, the product of the logical product of the input data value (iAct) and the supermask value (SMask) and the weight value (Weight) is calculated, and the sum of the products obtained for each input channel is calculated.

実施の形態１にかかる演算素子４３０は、式２に対応する演算を行っている。つまり、演算素子４３０（演算回路４００）は、ある接続に入力される入力データとスーパーマスクにおける当該接続に対応する有効又は無効を示す値（１又は０）との論理積を当該接続に対応する重みで乗算する。そして、ＰＥベクトル４２０（演算回路４００）は、入力チャネルごとに得られた乗算値の和を、部分和（ＰＳＵＭ）として算出する。なお、重み値は「＋１」又は「－１」であるので、入力データの値（ｉＡｃｔ）とスーパーマスクの値（ＳＭａｓｋ）との論理積を重み値（Ｗｅｉｇｈｔ）で乗算することは、単純な符号変換（Sign Inversion）に対応する。 The arithmetic element 430 according to the first embodiment performs the calculation corresponding to Equation 2. That is, the arithmetic element 430 (arithmetic circuit 400) multiplies the logical product of the input data input to a certain connection and the value (1 or 0) in the supermask indicating whether that connection is valid or invalid by the weight corresponding to that connection. The PE vector 420 (arithmetic circuit 400) then calculates the sum of the multiplied values obtained for each input channel as the partial sum (PSUM). Note that since the weight value is either "+1" or "-1," multiplying the logical product of the input data value (iAct) and the supermask value (SMask) by the weight value (Weight) corresponds to simple sign inversion.

説明の簡略化のために入力チャネル数を４個として、上記の演算の具体例を説明する。ＰＥベクトル４２０に入力されるデータが、以下のようであるとする。
「ｉＡｃｔ」＝｛５，３，７，１｝
「ＳＭａｓｋ」＝｛１，０，１，１｝
「Ｗｅｉｇｈｔ」＝｛＋１，－１，－１，＋１｝ For simplicity of explanation, a specific example of the above calculation will be described assuming that the number of input channels is four. It is assumed that the data input to the PE vector 420 is as follows:
"iAct" = {5, 3, 7, 1}
"SMask" = {1, 0, 1, 1}
"Weight" = {+1, -1, -1, +1}

このとき、１つ目の入力チャネルについては、「ＳＭａｓｋ」が「１」であるので、「ｉＡｃｔ」は有効となる。そして、１つ目の入力チャネルに対応する演算素子４３０は、「ｉＡｃｔ」＝「５」と「Ｗｅｉｇｈｔ」＝「＋１」との積を「＋５」と算出する。また、２つ目の入力チャネルについては、「ＳＭａｓｋ」が「０」であるので、入力データは無効となる。したがって、２つ目の入力チャネルに対応する演算素子４３０は、積演算を行わない。 In this case, for the first input channel, "SMask" is "1", so "iAct" is valid. The calculation element 430 corresponding to the first input channel calculates the product of "iAct" = "5" and "Weight" = "+1" as "+5". For the second input channel, "SMask" is "0", so the input data is invalid. Therefore, the calculation element 430 corresponding to the second input channel does not perform a product calculation.

また、３つ目の入力チャネルについては、「ＳＭａｓｋ」が「１」であるので、「ｉＡｃｔ」は有効となる。そして、３つ目の入力チャネルに対応する演算素子４３０は、「ｉＡｃｔ」＝「７」と「Ｗｅｉｇｈｔ」＝「－１」との積を「－７」と算出する。また、４つ目の入力チャネルについては、「ＳＭａｓｋ」が「１」であるので、「ｉＡｃｔ」は有効となる。そして、４つ目の入力チャネルに対応する演算素子４３０は、「ｉＡｃｔ」＝「１」と「Ｗｅｉｇｈｔ」＝「＋１」との積を「＋１」と算出する。そして、ＰＥベクトル４２０は、得られた積「＋５」、「－７」及び「＋１」の和（ＰＳＵＭ）を、「－１」と算出する。 For the third input channel, "SMask" is "1", so "iAct" is valid. The arithmetic element 430 corresponding to the third input channel calculates the product of "iAct" = "7" and "Weight" = "-1" as "-7". For the fourth input channel, "SMask" is "1", so "iAct" is valid. The arithmetic element 430 corresponding to the fourth input channel calculates the product of "iAct" = "1" and "Weight" = "+1" as "+1". The PE vector 420 then calculates the sum (PSUM) of the obtained products "+5", "-7", and "+1" as "-1".

図１２は、実施の形態１にかかるニューラルネットワーク回路装置１００の各構成要素に関するタイミングチャートを示す図である。図１２に示すタイミングチャート中の数値は、出力チャネル（カーネル）のインデックスを示す。タイミングＴ０で、制御回路１１０は、１６個のカーネル＃０～＃１５（出力チャネル＃０～＃１５）に関する圧縮スーパーマスクのコードレングスの情報を、マスク展開回路３００のデコーダ３０６に出力する。デコーダ３０６は、タイミングＴ０～Ｔ１の間に、カーネル＃０～＃１５に関する圧縮スーパーマスクを展開し、展開されたスーパーマスク２０をマスクメモリ３１０のバンク３１２Ａに格納する。 Figure 12 is a timing chart illustrating the components of the neural network circuit device 100 according to the first embodiment. The numbers in the timing chart illustrated in Figure 12 indicate the index of the output channel (kernel). At timing T0, the control circuit 110 outputs information on the code lengths of the compressed supermasks for the 16 kernels #0 to #15 (output channels #0 to #15) to the decoder 306 of the mask expansion circuit 300. Between timings T0 and T1, the decoder 306 expands the compressed supermasks for kernels #0 to #15 and stores the expanded supermasks 20 in bank 312A of the mask memory 310.

タイミングＴ１で、制御回路１１０は、カーネル＃０～＃１５それぞれに関する重み１４を生成するためのシードを、重み生成回路２００の乱数生成器２０２に出力する。これにより、乱数生成器２０２は、カーネル＃０～＃１５に関する重み１４を生成し、生成された重み１４を演算回路４００に出力する。一方、タイミングＴ１で、制御回路１１０は、カーネル＃０～＃１５それぞれに関するスーパーマスク２０を読み出すための読出し信号を、バンク３１２Ａに出力する。これにより、カーネル＃０～＃１５それぞれに関するスーパーマスク２０が、演算回路４００に出力される。したがって、演算回路４００は、タイミングＴ１～Ｔ２の間に、カーネル＃０～＃１５（出力チャネル＃０～＃１５）に関する演算を行う。 At timing T1, the control circuit 110 outputs a seed for generating weights 14 for each of kernels #0 to #15 to the random number generator 202 of the weight generation circuit 200. As a result, the random number generator 202 generates weights 14 for kernels #0 to #15 and outputs the generated weights 14 to the arithmetic circuit 400. Meanwhile, at timing T1, the control circuit 110 outputs a read signal to bank 312A for reading out supermasks 20 for each of kernels #0 to #15. As a result, the supermasks 20 for each of kernels #0 to #15 are output to the arithmetic circuit 400. Therefore, the arithmetic circuit 400 performs calculations for kernels #0 to #15 (output channels #0 to #15) between timings T1 and T2.

さらに、タイミングＴ１で、制御回路１１０は、次の１６個のカーネル＃１６～＃３１（出力チャネル＃１６～＃３１）に関する圧縮スーパーマスクのコードレングスの情報を、マスク展開回路３００のデコーダ３０６に出力する。デコーダ３０６は、タイミングＴ１～Ｔ２の間に、カーネル＃０～＃１５に関する圧縮スーパーマスクを展開し、展開されたスーパーマスク２０をマスクメモリ３１０のバンク３１２Ｂに格納する。このように、カーネル＃０～＃１５（出力チャネル＃０～＃１５）に関する演算と、次のカーネル＃１６～＃３１（出力チャネル＃１６～＃３１）に関する圧縮スーパーマスクの展開とが、並行して実行される。 Furthermore, at timing T1, the control circuit 110 outputs information on the code lengths of the compressed supermasks for the next 16 kernels #16 to #31 (output channels #16 to #31) to the decoder 306 of the mask expansion circuit 300. Between timings T1 and T2, the decoder 306 expands the compressed supermasks for kernels #0 to #15 and stores the expanded supermasks 20 in bank 312B of the mask memory 310. In this way, the calculations for kernels #0 to #15 (output channels #0 to #15) and the expansion of the compressed supermasks for the next kernels #16 to #31 (output channels #16 to #31) are performed in parallel.

タイミングＴ２で、制御回路１１０は、カーネル＃１６～＃３１それぞれに関する重み１４を生成するためのシードを、重み生成回路２００の乱数生成器２０２に出力する。これにより、乱数生成器２０２は、カーネル＃１６～＃３１に関する重み１４を生成し、生成された重み１４を演算回路４００に出力する。一方、タイミングＴ２で、制御回路１１０は、カーネル＃１６～＃３１それぞれに関するスーパーマスク２０を読み出すための読出し信号を、バンク３１２Ｂに出力する。これにより、カーネル＃１６～＃３１それぞれに関するスーパーマスク２０が、演算回路４００に出力される。したがって、演算回路４００は、タイミングＴ２～Ｔ３の間に、カーネル＃１６～＃３１（出力チャネル＃１６～＃３１）に関する演算を行う。 At timing T2, the control circuit 110 outputs a seed for generating weights 14 for each of kernels #16 to #31 to the random number generator 202 of the weight generation circuit 200. As a result, the random number generator 202 generates weights 14 for kernels #16 to #31 and outputs the generated weights 14 to the arithmetic circuit 400. Meanwhile, at timing T2, the control circuit 110 outputs a read signal to bank 312B for reading out supermasks 20 for each of kernels #16 to #31. As a result, the supermasks 20 for each of kernels #16 to #31 are output to the arithmetic circuit 400. Therefore, the arithmetic circuit 400 performs calculations for kernels #16 to #31 (output channels #16 to #31) between timings T2 and T3.

さらに、タイミングＴ２で、制御回路１１０は、次の１６個のカーネル＃３２～＃４７（出力チャネル＃３２～＃４７）に関する圧縮スーパーマスクのコードレングスの情報を、マスク展開回路３００のデコーダ３０６に出力する。デコーダ３０６は、タイミングＴ２～Ｔ３の間に、カーネル＃１６～＃３１に関する圧縮スーパーマスクを展開し、展開されたスーパーマスク２０をマスクメモリ３１０のバンク３１２Ａに格納する。このように、カーネル＃１６～＃３１（出力チャネル＃１６～＃３１）に関する演算と、次のカーネル＃３２～＃４７（出力チャネル＃３２～＃４７）に関する圧縮スーパーマスクの展開とが、並行して実行される。そして、次の１６個のカーネル＃３２～＃４７（出力チャネル＃３２～＃４７）についての処理及びその次の１６個のカーネル＃４８～＃６３（出力チャネル＃４８～＃６３）についての処理も、同様に行われる。 Furthermore, at timing T2, the control circuit 110 outputs information about the code lengths of the compressed supermasks for the next 16 kernels #32 to #47 (output channels #32 to #47) to the decoder 306 of the mask expansion circuit 300. Between timings T2 and T3, the decoder 306 expands the compressed supermasks for kernels #16 to #31 and stores the expanded supermasks 20 in bank 312A of the mask memory 310. In this way, the calculations for kernels #16 to #31 (output channels #16 to #31) and the expansion of the compressed supermask for the next kernels #32 to #47 (output channels #32 to #47) are performed in parallel. Processing for the next 16 kernels #32 to #47 (output channels #32 to #47) and the following 16 kernels #48 to #63 (output channels #48 to #63) is then performed in a similar manner.

図１３は、実施の形態１にかかる後処理回路１４０の構成を示す図である。後処理回路１４０は、浮動小数点型変換部１４１と、バッチノーム部１４２と、ブロック浮動小数点型変換部１４３と、活性化関数処理部１４４と、プーリング部１４５とを有する。浮動小数点型変換部１４１（ＩＮＴ２ＦＰ）は、整数型であるＰＳＵＭのデータを浮動小数点型に変換する。これにより、後処理回路１４０は、浮動小数点演算を行うことができる。 Figure 13 is a diagram showing the configuration of the post-processing circuit 140 according to the first embodiment. The post-processing circuit 140 includes a floating-point type converter 141, a batch norm unit 142, a block floating-point type converter 143, an activation function processor 144, and a pooling unit 145. The floating-point type converter 141 (INT2FP) converts PSUM data, which is of integer type, to floating-point type. This enables the post-processing circuit 140 to perform floating-point operations.

バッチノーム部１４２（ＢＮ）は、浮動小数点型に変換されたデータに対して、バッチノーマライゼーション（正規化）を行う。ブロック浮動小数点型変換部１４３（ＦＰ２ＢＦＰ）は、正規化された浮動小数点型のデータを、ブロック浮動小数点型に変換する。これにより、データ量を減らすことができる。活性化関数処理部１４４（ＲｅＬＵ）は、ブロック浮動小数点型に変換されたデータに対して、ＲｅＬＵ（Rectified Linear Unit）等の活性化関数を適用して、データを活性化する。これにより、活性化関数の出力結果が得られる。プーリング部１４５（ＭａｘＰｏｏｌ）は、出力結果に対して最大プーリング処理を行う。 The batch normalization unit 142 (BN) performs batch normalization on the data converted to floating-point format. The block floating-point conversion unit 143 (FP2BFP) converts the normalized floating-point data to block floating-point format, thereby reducing the amount of data. The activation function processing unit 144 (ReLU) applies an activation function such as ReLU (Rectified Linear Unit) to the data converted to block floating-point format to activate the data. This obtains the output result of the activation function. The pooling unit 145 (MaxPool) performs max pooling on the output result.

後処理回路１４０は、得られた出力チャネル（カーネル）ごとのデータを、活性値データメモリ５００に格納する。そして、活性値データメモリ５００に格納されたデータを入力データとして、ニューラルネットワーク回路装置１００は、次のニューラルネットワークの層について、上述した処理を繰り返して実行する。このようにして、ニューラルネットワーク回路装置１００は、ニューラルネットワークの複数の層について処理を行う。つまり、ニューラルネットワーク回路装置１００は、図５に示された、活性値データメモリ５００から、バレルシフタ１３０、演算回路４００及び後処理回路１４０を経て活性値データメモリ５００に戻るループの処理を繰り返すことによって、ニューラルネットワークの各層について処理を行う。 The post-processing circuit 140 stores the obtained data for each output channel (kernel) in the activation value data memory 500. Then, using the data stored in the activation value data memory 500 as input data, the neural network circuit device 100 repeats the above-mentioned process for the next neural network layer. In this way, the neural network circuit device 100 processes multiple layers of the neural network. In other words, the neural network circuit device 100 processes each layer of the neural network by repeating the loop process shown in FIG. 5, which starts from the activation value data memory 500, passes through the barrel shifter 130, the arithmetic circuit 400, and the post-processing circuit 140, and returns to the activation value data memory 500.

＜演算結果＞
図１４～図１７は、実施の形態１にかかるニューラルネットワーク回路装置１００によって演算を行った結果を示す図である。図１４は、実施の形態１にかかる手法を適用した場合に、外部メモリアクセスの削減を達成できることを示している。重み生成回路２００によりニューラルネットワーク回路装置１００（チップ）の内部で重み１４を生成することにより、外部メモリアクセスを４８．０％低減できる。さらに、マスク展開回路３００により圧縮スーパーマスクを展開することにより、外部メモリアクセスを４５．９％低減できる。このように、実施の形態１にかかるニューラルネットワーク回路装置１００（On-chip Model Construction）は、外部メモリアクセスを削減することが可能である。 <Computation result>
14 to 17 are diagrams showing the results of calculations performed by the neural network circuit device 100 according to the first embodiment. FIG. 14 shows that a reduction in external memory access can be achieved when the technique according to the first embodiment is applied. By generating weights 14 inside the neural network circuit device 100 (chip) using the weight generation circuit 200, external memory access can be reduced by 48.0%. Furthermore, by expanding the compressed supermask using the mask expansion circuit 300, external memory access can be reduced by 45.9%. In this way, the neural network circuit device 100 (on-chip model construction) according to the first embodiment can reduce external memory access.

図１５は、実施の形態１にかかる手法をＲｅｓＮｅｔ５０に適用した場合に、装置内（チップ内）で記憶しておく重み１４に関するデータ量を削減できることを示している。出力チャネルごとに異なるシードを用いて乱数を生成することにより重み１４を生成すること（Random Seed）によって、本手法を用いない場合（Binaried Weight）と比較して、重み１４の設定のために記憶しておくデータ量を９８．２％削減できる。また、出力チャネル番号と層番号とを入力とするハッシュ関数を用いてシードを生成することにより重み１４を生成すること（Hashed Seed）によって、重み１４の設定のために記憶しておくデータ量を１００％削減できる。このように、実施の形態１にかかるニューラルネットワーク回路装置１００は、重み１４の設定のために記憶するデータ量を削減することが可能である。 Figure 15 shows that when the technique according to the first embodiment is applied to ResNet 50, the amount of data related to weights 14 stored within the device (within the chip) can be reduced. By generating weights 14 by generating random numbers using a different seed for each output channel (Random Seed), the amount of data stored for setting weights 14 can be reduced by 98.2% compared to not using this technique (Binaried Weight). Furthermore, by generating weights 14 by generating seeds using a hash function that takes the output channel number and layer number as input (Hashed Seed), the amount of data stored for setting weights 14 can be reduced by 100%. In this way, the neural network circuit device 100 according to the first embodiment is able to reduce the amount of data stored for setting weights 14.

図１６は、実施の形態１にかかる手法を適用した場合でも、精度の劣化を抑制できていることを示す図である。既存の場合（PyTorch Default）と比較して、出力チャネルごとに異なるシードを用いて乱数を生成することにより重み１４を生成した場合（Random Seed）でも、精度の劣化を抑制することができている。また、既存の場合（PyTorch Default）と比較して、出力チャネル番号と層番号とを入力とするハッシュ関数を用いてシードを生成することにより重み１４を生成した場合（Hashed Seed）でも、精度の劣化を抑制することができている。 Figure 16 shows that accuracy degradation can be suppressed even when the method according to the first embodiment is applied. Compared to the existing case (PyTorch Default), accuracy degradation can be suppressed even when weights 14 are generated by generating random numbers using different seeds for each output channel (Random Seed). Furthermore, accuracy degradation can be suppressed even when weights 14 are generated by generating seeds using a hash function that takes the output channel number and layer number as input (Hashed Seed), compared to the existing case (PyTorch Default).

図１７は、実施の形態１にかかる手法を適用してスーパーマスク２０を圧縮した場合の効果を示している。図１７は、ｋの値が１０％、２０％及び３０％のスーパーマスク２０を圧縮した場合の圧縮率を示している。ｋの値が１０％のスーパーマスク２０では、「０」の数が多く「１」の数が少ない。したがって、この場合は、コードレングスが「４ビット」の圧縮手法（4-bit encoding）で圧縮することにより、圧縮率を高くすることができる。そのときの圧縮率は、約０．５であり、スーパーマスク２０を圧縮しない場合と比較して４９．６％改善（データ量が削減）されている。 Figure 17 shows the effect of compressing the supermask 20 using the method according to the first embodiment. Figure 17 shows the compression ratio when compressing the supermask 20 with k values of 10%, 20%, and 30%. The supermask 20 with k value of 10% has a large number of "0"s and a small number of "1"s. Therefore, in this case, the compression ratio can be increased by compressing using a compression method with a code length of "4 bits" (4-bit encoding). The compression ratio in this case is approximately 0.5, which is a 49.6% improvement (data volume reduction) compared to when the supermask 20 is not compressed.

また、ｋの値が２０％のスーパーマスク２０では、ｋの値が１０％のスーパーマスク２０よりも「０」の数が少なく「１」の数が多い。したがって、この場合は、コードレングスが「３ビット」の圧縮手法（3-bit encoding）で圧縮することにより、圧縮率を高くすることができる。そのときの圧縮率は、約０．７５である。また、ｋの値が３０％のスーパーマスク２０では、ｋの値が２０％のスーパーマスク２０よりも「０」の数が少なく「１」の数が多い。したがって、この場合は、コードレングスが「２ビット」の圧縮手法（2-bit encoding）で圧縮することにより、圧縮率を高くすることができる。そのときの圧縮率は、約０．９である。このように、圧縮スーパーマスクは、対応するスーパーマスク２０が疎である（「１」の数が少ない）ほど、圧縮率が高い状態で圧縮されている。 Furthermore, a supermask 20 with k set to 20% has fewer "0s" and more "1s" than a supermask 20 with k set to 10%. Therefore, in this case, the compression ratio can be increased by compressing using a compression method with a code length of "3 bits" (3-bit encoding). The compression ratio is approximately 0.75. Furthermore, a supermask 20 with k set to 30% has fewer "0s" and more "1s" than a supermask 20 with k set to 20%. Therefore, in this case, the compression ratio can be increased by compressing using a compression method with a code length of "2 bits" (2-bit encoding). The compression ratio is approximately 0.9. Thus, the sparser the corresponding supermask 20 (the fewer "1s"), the higher the compression ratio of the compressed supermask.

（変形例）
なお、本発明は上記実施の形態に限られたものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。例えば、上述した実施の形態１にかかるニューラルネットワーク回路装置１００は、マスク展開回路３００を有さなくてもよい。つまり、スーパーマスク２０は、圧縮された状態で提供されなくてもよい。一方、スーパーマスク２０が圧縮された状態で提供され、マスク展開回路３００が圧縮スーパーマスクを展開することによって、上述したように、メモリアクセスを削減することができる。 (Modification)
The present invention is not limited to the above-described embodiments, and various modifications can be made without departing from the spirit of the present invention. For example, the neural network circuit device 100 according to the first embodiment described above does not need to include the mask expansion circuit 300. In other words, the supermask 20 does not need to be provided in a compressed state. On the other hand, if the supermask 20 is provided in a compressed state and the mask expansion circuit 300 expands the compressed supermask, memory accesses can be reduced, as described above.

また、上述した実施の形態では、処理対象の入力チャネルの数を１６個としたが、このような構成に限られない。処理対象の入力チャネルの数は１６個でなくてもよい。また、上述した実施の形態では、処理対象の出力チャネル（カーネル）の数を１６個としたが、このような構成に限られない。処理対象の出力チャネル（カーネル）の数は１６個でなくてもよい。さらに、上述した実施の形態では、チャンク５２０の横方向及び縦方向サイズを、４×４としたが、このような構成に限られない。チャンク５２０の横方向及び縦方向サイズは、４×４でなくてもよい。 Furthermore, in the above-described embodiment, the number of input channels to be processed is 16, but this is not limited to this configuration. The number of input channels to be processed does not have to be 16. Further, in the above-described embodiment, the number of output channels (kernels) to be processed is 16, but this is not limited to this configuration. The number of output channels (kernels) to be processed does not have to be 16. Furthermore, in the above-described embodiment, the horizontal and vertical sizes of chunk 520 are 4x4, but this is not limited to this configuration. The horizontal and vertical sizes of chunk 520 do not have to be 4x4.

上述の例において、プログラムは、コンピュータに読み込まれた場合に、実施形態で説明された１又はそれ以上の機能をコンピュータに行わせるための命令群（又はソフトウェアコード）を含む。プログラムは、非一時的なコンピュータ可読媒体又は実体のある記憶媒体に格納されてもよい。限定ではなく例として、コンピュータ可読媒体又は実体のある記憶媒体は、random-access memory（RAM）、read-only memory（ROM）、フラッシュメモリ、solid-state drive（SSD）又はその他のメモリ技術、CD-ROM、digital versatile disk（DVD）、Blu-ray（登録商標）ディスク又はその他の光ディスクストレージ、磁気カセット、磁気テープ、磁気ディスクストレージ又はその他の磁気ストレージデバイスを含む。プログラムは、一時的なコンピュータ可読媒体又は通信媒体上で送信されてもよい。限定ではなく例として、一時的なコンピュータ可読媒体又は通信媒体は、電気的、光学的、音響的、またはその他の形式の伝搬信号を含む。 In the above examples, the program includes instructions (or software code) that, when loaded into a computer, cause the computer to perform one or more functions described in the embodiments. The program may be stored on a non-transitory computer-readable medium or a tangible storage medium. By way of example and not limitation, computer-readable medium or tangible storage medium includes random-access memory (RAM), read-only memory (ROM), flash memory, solid-state drive (SSD) or other memory technology, CD-ROM, digital versatile disk (DVD), Blu-ray® disc or other optical disk storage, magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage device. The program may also be transmitted on a transitory computer-readable medium or communication medium. By way of example and not limitation, transitory computer-readable medium or communication medium includes electrical, optical, acoustic, or other forms of propagated signals.

１０初期モデル
１２ノード
１３接続
１４重み
２０スーパーマスク
３０スコア
４０サブネットワーク
１００ニューラルネットワーク回路装置
１１０制御回路
１３０バレルシフタ
１４０後処理回路
１４１浮動小数点型変換部
１４２バッチノーム部
１４３ブロック浮動小数点型変換部
１４４活性化関数処理部
１４５プーリング部
２００重み生成回路
２０２乱数生成器
２０４セレクタ
２１０重み生成部
２２０カーネル重み群
２２２カーネル重み部分
３００マスク展開回路
３０２圧縮データメモリ
３０４ＦＩＦＯバッファ
３０６デコーダ
３１０マスクメモリ
３１２バンク
３２０カーネルマスク群
３２２カーネルマスク部分
４００演算回路
４１０ＰＥマトリクス
４２０ＰＥベクトル
４３０演算素子
５００活性値データメモリ
５１０入力データ
５２０チャンク
５２２チャンク部分 10 Initial model 12 Node 13 Connection 14 Weight 20 Supermask 30 Score 40 Sub-network 100 Neural network circuit device 110 Control circuit 130 Barrel shifter 140 Post-processing circuit 141 Floating point type conversion unit 142 Batch norm unit 143 Block floating point type conversion unit 144 Activation function processing unit 145 Pooling unit 200 Weight generation circuit 202 Random number generator 204 Selector 210 Weight generation unit 220 Kernel weight group 222 Kernel weight part 300 Mask expansion circuit 302 Compressed data memory 304 FIFO buffer 306 Decoder 310 Mask memory 312 Bank 320 Kernel mask group 322 Kernel mask part 400 Arithmetic circuit 410 PE matrix 420 PE vector 430 Arithmetic element 500 Activity value data memory 510 Input data 520 Chunk 522 Chunk part

Claims

A neural network circuit device that realizes a neural network model in which, in a learning stage, weights of each connection between nodes are set by fixed random numbers, a mask is generated by determining whether each of the connections is valid or invalid, and the weights and the mask are used to generate a neural network model,
a weight generation circuit that generates the weights using a random number generator configured to generate the same random numbers as those generated in the learning stage using the same seed as those used in the learning stage when performing inference;
an arithmetic circuit that performs a product-sum operation using input data that is inference target data or activation value data corresponding to the inference target data, the weight, and the mask;
and
the weight generation circuit generates the random numbers as the weights, the random numbers being generated using a different seed for each output channel;
Neural network circuit device.

a control circuit that generates the seed using an output channel number and a layer number of a layer that constitutes the neural network model;
2. The neural network circuit device of claim 1 , further comprising:

the control circuit generates the seed by a hash function that receives the output channel number and the layer number as input.
3. The neural network circuit device according to claim 2 .

a mask expansion circuit for expanding the compressed mask when making an inference;
and
the arithmetic circuit performs a product-sum operation using the mask expanded by the mask expansion circuit;
4. The neural network circuit device according to claim 1.

The mask in a compressed state is compressed using a zero run length compression method.
5. The neural network circuit device according to claim 4 .

The mask in the compressed state is compressed at a higher compression rate as the mask becomes sparser.
6. The neural network circuit device according to claim 5 .

the calculation circuit performs a calculation such that a product of the input data and the weight is enabled or disabled by the mask.
7. The neural network circuit device according to claim 1.

the arithmetic circuit multiplies a logical product of the input data input to a certain connection and a value in the mask indicating validity or invalidity corresponding to the connection by the weight corresponding to the connection;
8. The neural network circuit device according to claim 7 .

The mask is generated for at least one of the accuracy of inference and the inference target.
9. The neural network circuit device according to claim 1.