JP7199283B2

JP7199283B2 - accelerator device

Info

Publication number: JP7199283B2
Application number: JP2019064104A
Authority: JP
Inventors: 啓太山口
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2019-03-28
Filing date: 2019-03-28
Publication date: 2023-01-05
Anticipated expiration: 2039-03-28
Also published as: JP2020166368A

Description

本開示は、疎行列ベクトル積を並列計算するアクセラレータ装置に関する。 The present disclosure relates to an accelerator device for parallel computing of sparse matrix-vector products.

近年、ＡＩ（Artificial Intelligence）、機械学習の分野では、その計算過程において大規模な行列を用いた計算が行われる。例えば、全結合型ニューラルネットワークの計算では、大規模なパラメータ行列が用いられ、当該パラメータ行列の計算をプロセッサで処理されている。大規模なパラメータ行列をプロセッサで処理する場合、計算量が多くなり計算の処理負担が大きくなる。 In recent years, in the fields of AI (Artificial Intelligence) and machine learning, calculations using large-scale matrices are performed in the calculation process. For example, in the computation of a fully-connected neural network, a large-scale parameter matrix is used, and the computation of the parameter matrix is processed by a processor. When a processor processes a large-scale parameter matrix, the amount of calculation increases and the processing load of the calculation increases.

そのため、パラメータ行列をプルーニング（Pruning）などでパラメータを削減することで、多くの要素が零で構成されている疎行列に変換して、パラメータ容量および計算量を削減する処理が行われている。疎行列は要素の多くが零で構成される行列であるので、プロセッサでは非零要素のみを計算することになり計算量を削減できる。 Therefore, by reducing the parameters by pruning or the like, the parameter matrix is converted to a sparse matrix in which many elements are composed of zeros, thereby reducing the parameter capacity and the amount of calculation. Since a sparse matrix is a matrix in which most of the elements are composed of zeros, the processor calculates only the nonzero elements, thereby reducing the amount of calculation.

このような大規模なパラメータ行列を変換して得られた疎行列を計算するためにアクセラレータ装置が開発されている。アクセラレータ装置として、例えば、特許文献１、非特許文献１，２などに開示されている。これらアクセラレータ装置では、疎行列ベクトル積（ＳｐＭＶ：Sparse Matrix-Vector Multiplication）を並列計算することで、高速計算を実現している。 Accelerator devices have been developed to compute sparse matrices obtained by transforming such large-scale parameter matrices. Examples of accelerator devices are disclosed in Patent Document 1, Non-Patent Documents 1 and 2, and the like. These accelerator devices realize high-speed computation by performing parallel computation of sparse matrix-vector multiplication (SpMV).

米国特許出願公開第２０１８／００４６９００号明細書U.S. Patent Application Publication No. 2018/0046900

Fowers, J. et al. 、” A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication"、IEEE Int. Symp. on Field-Programmable Custom Computing Machines (2014)Fowers, J. et al., ``A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication'', IEEE Int. Symp. on Field-Programmable Custom Computing Machines (2014) Song Han. et al. 、"ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA"、International Symposium on FPGA 2017Song Han. et al., "ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA", International Symposium on FPGA 2017

しかし、疎行列ベクトル積（ＳｐＭＶ）では、疎行列の非零要素と、それに対応する入力ベクトルの要素との積和から、出力ベクトルの要素を計算する。そのため、疎行列ベクトル積（ＳｐＭＶ）を行うアクセラレータ装置では、疎行列の非零要素のみを計算するので、入力ベクトルの要素へのアクセスが疎行列の非零要素の構造に基づいて不連続なアクセスとなる。 However, in the sparse matrix vector product (SpMV), the elements of the output vector are calculated from the sum of the products of the non-zero elements of the sparse matrix and the corresponding elements of the input vector. Therefore, in an accelerator device that performs sparse matrix vector multiplication (SpMV), only the nonzero elements of the sparse matrix are calculated, so access to the elements of the input vector is discontinuous based on the structure of the nonzero elements of the sparse matrix. becomes.

また、アクセラレータ装置において、疎行列ベクトル積（ＳｐＭＶ）を並列計算するには、例えば、出力ベクトルの１つの要素の計算を、１つの並列実行の単位（スレッド）に割当て、並列計算することになる。この場合、アクセラレータ装置では、並列実行するスレッド間で、不連続な入力ベクトルの要素に対して並列アクセスするための機構を設ける必要があった。 In addition, in order to perform parallel calculation of a sparse matrix-vector product (SpMV) in an accelerator device, for example, the calculation of one element of an output vector is assigned to one parallel execution unit (thread) for parallel calculation. . In this case, in the accelerator device, it is necessary to provide a mechanism for parallel access to discontinuous input vector elements between threads executing in parallel.

アクセラレータ装置では、入力ベクトルの要素への不連続なアクセスを行うため入力データを複製してスレッドごとのメモリに格納する構成が必要であった。また、アクセラレータ装置では、不連続な入力ベクトルの要素に対して並列アクセスするために、入力データを分割して多数の独立したメモリバンクに格納し、スレッド間の並列アクセスを分散する構成が必要であった。アクセラレータ装置では、これらの構成を設けることでスレッドごとに冗長な複製メモリが必要であったり、また多数の独立したメモリバンクが必要であったりするなど、多くのハードウェア資源を必要とする問題があった。 In order to perform discontinuous access to the elements of the input vector, the accelerator device requires a configuration in which the input data is duplicated and stored in the memory for each thread. In addition, in order to access discontinuous input vector elements in parallel, the accelerator device requires a configuration in which input data is divided and stored in a large number of independent memory banks, and parallel accesses are distributed among threads. there were. In the accelerator device, by providing these configurations, there is a problem that a lot of hardware resources are required, such as redundant duplicated memory for each thread and a large number of independent memory banks. there were.

そこで、本開示は、上述した課題を解決するためのものであり、疎行列ベクトル積を並列計算するアクセラレータ装置において、必要とするハードウェア資源を削減することができるアクセラレータ装置を提供することを目的としている。 Therefore, the present disclosure is intended to solve the above-described problems, and an object thereof is to provide an accelerator device capable of reducing required hardware resources in an accelerator device for parallel calculation of a sparse matrix-vector product. and

本開示のある実施形態に従うと、疎行列ベクトル積の計算を含む演算処理を実行するアクセラレータ装置であって、疎行列の非零要素の各々と入力ベクトルの要素の各々との積和を含む出力ベクトルの要素の各々の計算処理を並列に実行する並列計算部と、疎行列の非零要素の各々、入力ベクトルの要素の各々、および出力ベクトルの要素の各々を格納する記憶部と、を備え、並列計算部は、記憶部に記憶している要素の各々に対して並列にアクセスすることができるデータロード部と、データロード部がアクセスした入力ベクトルの要素の各々と、疎行列の非零要素の各々との積和を計算する演算部と、データロード部が記憶部に記憶している入力ベクトルの要素の各々に対して並列にアクセスする順序に従い、疎行列の非零要素の各々の配置にパディングを挿入する挿入部とを有する。 According to an embodiment of the present disclosure, an accelerator apparatus for performing an arithmetic operation comprising computing a sparse matrix-vector product, the output comprising the sum of products of each non-zero element of a sparse matrix and each element of an input vector a parallel computing unit that executes computation processing of each element of the vector in parallel; and a storage unit that stores each of the non-zero elements of the sparse matrix, each of the input vector elements, and each of the output vector elements. , the parallel computing unit includes a data loading unit that can access each of the elements stored in the storage unit in parallel, each of the elements of the input vector accessed by the data loading unit and the nonzero Each non-zero element of the sparse matrix is processed in accordance with the order in which the operation unit that calculates the sum of products with each element and the data load unit access each element of the input vector stored in the storage unit in parallel. an insert for inserting padding into the arrangement.

本開示のアクセラレータ装置では、記憶部に記憶している入力ベクトルの要素の各々に対して並列にアクセスする順序に従い、疎行列の非零要素の各々の配置にパディングを挿入することで、必要とするハードウェア資源を削減することができる。 In the accelerator device of the present disclosure, by inserting padding into the placement of each non-zero element of the sparse matrix according to the order in which each element of the input vector stored in the storage unit is accessed in parallel, hardware resources to be used can be reduced.

本実施の形態１に係るアクセラレータ装置の構成を示すブロック図である。1 is a block diagram showing the configuration of an accelerator device according to Embodiment 1; FIG. 本実施の形態１に係るＳｐＭＶユニットの構成を示すブロック図である。2 is a block diagram showing the configuration of an SpMV unit according to the first embodiment; FIG. 本実施の形態１に係るポストプロセスユニットの構成を示すブロック図である。3 is a block diagram showing the configuration of a post-process unit according to the first embodiment; FIG. 要素の多くが零で構成される疎行列の一例を示す図である。FIG. 2 is a diagram showing an example of a sparse matrix in which many elements are zero; 図４に示した疎行列のＥＬＬＰＡＣＫ形式で表した一例を示す図である。5 is a diagram showing an example of the sparse matrix shown in FIG. 4 expressed in ELLPACK format; FIG. 外部メモリに格納されるＥＬＬＰＡＣＫ形式の疎行列の一例を示す図である。FIG. 3 is a diagram showing an example of an ELLPACK format sparse matrix stored in an external memory; 疎行列ベクトル積の計算処理を行うプログラムの一例を示した図である。FIG. 10 is a diagram showing an example of a program for calculating a sparse matrix-vector product; 疎行列の行ごとの疎行列ベクトル積の計算処理を行うプログラムの一例を示した図である。FIG. 10 is a diagram showing an example of a program for calculating a sparse matrix-vector product for each row of a sparse matrix; 図８に示すプログラムを、疎行列の行ごと複数のスレッドとして並列実行する場合の一例を示す図である。FIG. 9 is a diagram showing an example of parallel execution of the program shown in FIG. 8 as a plurality of threads for each row of a sparse matrix; 疎行列のインデックスのリストの一例を示す図である。FIG. 10 is a diagram showing an example of a list of indices of a sparse matrix; 図１０に示す疎行列のインデックスのリストに基づいて入力ベクトルの要素にアクセスする順序の一例を示す図である。FIG. 11 is a diagram showing an example of the order of accessing elements of an input vector based on the list of indices of the sparse matrix shown in FIG. 10; 図１０に示す疎行列のインデックスのリストに対してパディングを平準化して挿入した一例を示す図である。FIG. 11 is a diagram showing an example in which padding is leveled and inserted into the sparse matrix index list shown in FIG. 10 ; 疎行列のインデックスのリストに対してパディングを平準化して挿入する手順について示すフローチャートである。FIG. 11 is a flowchart showing a procedure for leveling and inserting padding into a list of indices of a sparse matrix; FIG. 本実施の形態１に係るデータバッファのメモリ構成を示すブロック図である。3 is a block diagram showing a memory configuration of a data buffer according to the first embodiment; FIG. 図１４に示すデータバッファにアクセスするためのインデックスの一例を示す図である。15 is a diagram showing an example of an index for accessing the data buffer shown in FIG. 14; FIG. 本実施の形態１に係るロードユニットの構成を示すブロック図である。2 is a block diagram showing the configuration of a load unit according to the first embodiment; FIG. 本実施の形態２に係るデータバッファのメモリ構成を示すブロック図である。FIG. 9 is a block diagram showing a memory configuration of a data buffer according to the second embodiment; FIG. 図１７に示すデータバッファにアクセスするためのインデックスの一例を示す図である。18 is a diagram showing an example of an index for accessing the data buffer shown in FIG. 17; FIG. 本実施の形態２に係るロードユニットの構成を示すブロック図である。FIG. 9 is a block diagram showing the configuration of a load unit according to the second embodiment; FIG.

以下に、本開示の実施の形態に係るアクセラレータ装置について図面を参照して詳しく説明する。なお、図中同一符号は同一または相当部分を示す。 Below, an accelerator device according to an embodiment of the present disclosure will be described in detail with reference to the drawings. The same reference numerals in the drawings indicate the same or corresponding parts.

（実施の形態１）
大規模な疎行列を高速計算するハードウェアのアクセラレータ装置として、例えばＦＰＧＡ（Field Programmable Gate Array）を用いて実現することができる。ＦＰＧＡを用いたアクセラレータ装置では、メモリ、ロジック、演算器などのハードウェア資源を用いて、疎行列ベクトル積（ＳｐＭＶ）を並列計算している。以下の実施の形態では、ＦＰＧＡを用いたアクセラレータ装置について説明するが、ハードウェアで構成されたアクセラレータ装置であれば、ＦＰＧＡを用いたアクセラレータ装置に限定されない。 (Embodiment 1)
A hardware accelerator device for high-speed calculation of large-scale sparse matrices can be implemented using, for example, an FPGA (Field Programmable Gate Array). Accelerator devices using FPGAs use hardware resources such as memories, logics, and computing units to perform parallel computation of sparse matrix-vector products (SpMV). In the following embodiments, an accelerator device using FPGA will be described, but the accelerator device is not limited to an accelerator device using FPGA as long as it is configured by hardware.

まず、疎行列ベクトル積（ＳｐＭＶ）を並列計算するＦＰＧＡを用いたアクセラレータ装置の構成について説明する。図１は、本実施の形態１に係るアクセラレータ装置１００の構成を示すブロック図である。図１に示すアクセラレータ装置（FPGA Accelerator）１００は、例えば全結合型ニューラルネットワークの計算を実行するためのアクセラレータ装置である。アクセラレータ装置１００は、ホストプロセッサ（図示せず）と接続するシステムバス（System Bus）２００、および外部メモリ（External Memory）３００とそれぞれ接続されている。 First, the configuration of an accelerator device using FPGA for parallel calculation of sparse matrix-vector product (SpMV) will be described. FIG. 1 is a block diagram showing the configuration of the accelerator device 100 according to the first embodiment. An accelerator device (FPGA accelerator) 100 shown in FIG. 1 is, for example, an accelerator device for executing computation of a fully-connected neural network. The accelerator device 100 is connected to a system bus (System Bus) 200 connected to a host processor (not shown) and an external memory (External Memory) 300 .

アクセラレータ装置１００において全結合型ニューラルネットワークの計算を行う場合、ホストプロセッサからの入力データおよび出力データは、システムバス２００を経由して、アクセラレータ装置１００の内部のデータバッファに格納される。全結合型ニューラルネットワークの計算に用いる重みパラメータ（疎行列）は、ホストプロセッサからシステムバス２００、アクセラレータ装置１００を経由して外部メモリ３００に格納される。 When the accelerator device 100 performs computation of the fully-connected neural network, input data and output data from the host processor are stored in a data buffer inside the accelerator device 100 via the system bus 200 . Weight parameters (sparse matrix) used for calculation of the fully-connected neural network are stored in the external memory 300 from the host processor via the system bus 200 and the accelerator device 100 .

アクセラレータ装置１００は、メモリインターフェース／インターコネクト（Memory Interface／Interconnect）１０、データバッファ（Data Buffer）２０、ＳｐＭＶユニット（SpMV Unit）３０、ポストプロセスユニット（Post-Process Unit）４０を含んでいる。メモリインターフェース／インターコネクト１０は、システムバス２００、および外部メモリ３００とそれぞれ接続されている。メモリインターフェース／インターコネクト１０は、システムバス２００を介してホストプロセッサと接続されている。メモリインターフェース／インターコネクト１０は、アクセラレータ装置１００の内部のデータバッファ２０、ＳｐＭＶユニット３０、およびポストプロセスユニット４０の間を相互接続している。 The accelerator device 100 includes a memory interface/interconnect 10 , a data buffer 20 , an SpMV unit 30 and a post-process unit 40 . Memory interface/interconnect 10 is connected to system bus 200 and external memory 300, respectively. Memory interface/interconnect 10 is connected to the host processor via system bus 200 . Memory interface/interconnect 10 interconnects between data buffer 20 , SpMV unit 30 and post-processing unit 40 within accelerator device 100 .

データバッファ２０は、ホストプロセッサからの入力データおよび出力データ（次層の入力データ）を格納する。ＳｐＭＶユニット３０は、データバッファ２０から読み出した入力データ列（入力ベクトルの要素）と、外部メモリ３００から読み出した重みパラメータ列（疎行列の非零要素）との積和を並列計算する。ポストプロセスユニット４０は、ＳｐＭＶユニット３０の後処理を行い。例えば、バイアスを加算する処理および活性化関数の処理を並列計算して出力データ列（出力ベクトルの要素）をデータバッファ２０に格納する。 The data buffer 20 stores input data and output data (next layer input data) from the host processor. The SpMV unit 30 performs parallel calculation of sums of products of input data strings (input vector elements) read from the data buffer 20 and weight parameter strings (sparse matrix non-zero elements) read from the external memory 300 . Post-processing unit 40 performs post-processing of SpMV unit 30 . For example, the process of adding the bias and the process of the activation function are calculated in parallel, and the output data string (element of the output vector) is stored in the data buffer 20 .

次に、ＳｐＭＶユニット３０の構成について説明する。図２は、本実施の形態１に係るＳｐＭＶユニット３０の構成を示すブロック図である。ＳｐＭＶユニット３０は、疎行列ロードユニット（Sparse Matrix Load Unit）３１、疎行列のインデックスのリストを格納するインデックスバッファ（Index Buffer）３２、疎行列の非零要素の値のリストを格納するバリューバッファ（Value Buffer）３３、データロードユニット（Data Load Unit）３４、ＦＭＡ（Fused Multiply Add：融合積和演算）アレイ（FMA Array）３５、コントロールユニット（Control Unit）３６を含んでいる。 Next, the configuration of the SpMV unit 30 will be described. FIG. 2 is a block diagram showing the configuration of the SpMV unit 30 according to the first embodiment. The SpMV unit 30 includes a sparse matrix load unit 31, an index buffer 32 that stores a list of sparse matrix indices, and a value buffer ( Value Buffer) 33 , Data Load Unit 34 , FMA (Fused Multiply Add) Array (FMA Array) 35 , and Control Unit 36 .

疎行列ロードユニット３１は、外部メモリ３００から重みパラメータ列（疎行列の非零要素）を読み出し、読み出した重みパラメータ列（疎行列の非零要素）から疎行列のインデックスのリストをインデックスバッファ３２に、疎行列の非零要素のリストをバリューバッファ３３にそれぞれ格納する。インデックスバッファ３２およびバリューバッファ３３は、並列計算に必要なデータ幅に構成されている。 The sparse matrix load unit 31 reads the weight parameter string (non-zero elements of the sparse matrix) from the external memory 300, and stores a list of sparse matrix indexes from the read weight parameter string (non-zero elements of the sparse matrix) in the index buffer 32. , a list of non-zero elements of the sparse matrix are stored in the value buffer 33, respectively. The index buffer 32 and value buffer 33 are configured to have a data width required for parallel computation.

データロードユニット３４は、インデックスバッファ３２に格納した疎行列のインデックスのリストから並列計算を行うインデックス列を参照し、データバッファ２０から並列計算を行う入力データ列（入力ベクトルの要素）を読み出す。なお、データロードユニット３４が、疎行列のインデックスのリストに基づき、データバッファ２０にアクセスして入力ベクトルの要素を読み出す処理については、以下で詳しく説明する。 The data load unit 34 refers to the index columns for parallel calculation from the sparse matrix index list stored in the index buffer 32 and reads out the input data columns (input vector elements) for parallel calculation from the data buffer 20 . The process by which the data load unit 34 accesses the data buffer 20 and reads the elements of the input vector based on the list of sparse matrix indices will be described in detail below.

ＦＭＡアレイ３５は、複数のＦＭＡユニットで構成され、各々のＦＭＡユニットでデータロードユニット３４から入力データ列（入力ベクトルの要素）、バリューバッファ３３から疎行列の非零要素のデータ列をそれぞれ受け取り、融合積和を並列計算する。ＦＭＡアレイ３５は、疎行列の全ての列を並列計算した後、出力データ列（出力ベクトルの要素）をポストプロセスユニット４０に転送する。コントロールユニット３６は、全結合型ニューラルネットワークの各層の計算を繰り返し実行するためのシーケンス制御を行う。 The FMA array 35 is composed of a plurality of FMA units, each of which receives an input data string (input vector elements) from the data load unit 34 and a data string of non-zero elements of the sparse matrix from the value buffer 33, Parallel calculation of the fused sum of products. After FMA array 35 computes all the columns of the sparse matrix in parallel, it transfers the output data columns (elements of the output vector) to post-processing unit 40 . The control unit 36 performs sequence control for repeatedly executing calculations in each layer of the fully-connected neural network.

なお、ＳｐＭＶユニット３０では、疎行列ベクトル積（ＳｐＭＶ）を並列計算するために、疎行列ロードユニット３１、データロードユニット３４、およびＦＭＡアレイ３５を構成するＦＭＡユニットがスレッドに対応してそれぞれ複数存在し、それぞれが並列に接続されている。 In the SpMV unit 30, a plurality of FMA units constituting the sparse matrix load unit 31, the data load unit 34, and the FMA array 35 are provided in correspondence with the threads in order to perform parallel calculation of the sparse matrix-vector product (SpMV). and are connected in parallel.

次に、ポストプロセスユニット４０の構成について説明する。図３は、本実施の形態１に係るポストプロセスユニット４０の構成を示すブロック図である。ポストプロセスユニット４０は、バイアスロードユニット（Bias Load Unit）４１、バイアスを格納するバイアスバッファ（Bias Buffer）４２、バイアスを加算する処理を行うバイアス加算ユニット（Bias Add）４３、ＬＵＴ（Look Up Table）ロードユニット（LUT Load Unit）４４、活性化関数の処理を実行するアクティベーションユニット（Activation）４５、データストアユニット（Data Store Unit）４６、コントロールユニット（Control Unit）４７を含んでいる。 Next, the configuration of the post-process unit 40 will be described. FIG. 3 is a block diagram showing the configuration of the post-process unit 40 according to the first embodiment. The post-processing unit 40 includes a bias load unit 41, a bias buffer 42 that stores bias, a bias add unit 43 that performs bias addition processing, and a LUT (Look Up Table). It includes a load unit (LUT Load Unit) 44 , an activation unit (Activation) 45 that executes activation function processing, a data store unit (Data Store Unit) 46 , and a control unit (Control Unit) 47 .

バイアスロードユニット４１は、外部メモリ３００からバイアスパラメータを読み出し、バイアスバッファ４２に格納する。このバイアスバッファ４２は、並列計算に必要なデータ幅に構成されている。バイアス加算ユニット４３は、ＳｐＭＶユニット３０の出力データ列（出力ベクトルの要素）と、バイアスバッファ４２から対応するバイアスパラメータ列を受け取り、バイアスを加算する並列計算を行う。 The bias load unit 41 reads bias parameters from the external memory 300 and stores them in the bias buffer 42 . This bias buffer 42 is configured to have a data width required for parallel computation. The bias addition unit 43 receives the output data string (element of the output vector) of the SpMV unit 30 and the corresponding bias parameter string from the bias buffer 42, and performs parallel calculation of adding the biases.

ＬＵＴロードユニット４４は、活性化関数のアルゴリズム（ＬＵＴ構成とする）を事前に設定する。アクティベーションユニット４５は、バイアス加算ユニット４３からの出力データに対して活性化関数を処理する並列計算を行う。データストアユニット４６は、アクティベーションユニット４５からの出力データ列を受け取り、メモリインターフェース／インターコネクト１０を経由して、データバッファ２０に格納（書き戻す）する。コントロールユニット４７は、全結合型ニューラルネットワークの各層の計算を繰り返し実行するためのシーケンス制御を行う。 The LUT load unit 44 presets the activation function algorithm (assumed to be a LUT configuration). Activation unit 45 performs parallel computation processing activation functions on the output data from bias summation unit 43 . The data store unit 46 receives the output data string from the activation unit 45 and stores (writes back) it in the data buffer 20 via the memory interface/interconnect 10 . The control unit 47 performs sequence control for repeatedly executing calculations in each layer of the fully-connected neural network.

なお、ポストプロセスユニット４０では、バイアスを加算する処理および活性化関数の処理を並列計算するために、バイアスロードユニット４１、バイアス加算ユニット４３、ＬＵＴロードユニット４４、アクティベーションユニット４５、およびデータストアユニット４６がスレッドに対応してそれぞれ複数存在し、それぞれが並列に接続されている。 In the post-processing unit 40, a bias load unit 41, a bias addition unit 43, an LUT load unit 44, an activation unit 45, and a data store unit are used for parallel calculation of bias addition processing and activation function processing. A plurality of 46 exist corresponding to the threads, and are connected in parallel.

次に、アクセラレータ装置１００において、疎行列ベクトル積を並列計算する場合の疎行列の処理について詳しく説明する。疎行列は、非零要素のみに圧縮したデータ形式で、外部メモリ３００に格納される。本実施の形態では、疎行列の格納形式の一例としてＥＬＬＰＡＣＫ形式を用いる。ＥＬＬＰＡＣＫ形式では、疎行列の非零要素の値のリストと、非零要素の列位置を示す疎行列のインデックスのリストとで疎行列を表す形式である。 Next, the processing of a sparse matrix in parallel calculation of a sparse matrix vector product in the accelerator device 100 will be described in detail. A sparse matrix is stored in the external memory 300 in a data format in which only non-zero elements are compressed. In this embodiment, the ELLPACK format is used as an example of a storage format for a sparse matrix. In the ELLPACK format, a sparse matrix is represented by a list of non-zero element values of the sparse matrix and a list of sparse matrix indices indicating the column positions of the non-zero elements.

ＥＬＬＰＡＣＫ形式で表した疎行列について具体例を示して説明する。図４は、要素の多くが零で構成される疎行列の一例を示す図である。図５は、図４に示した疎行列のＥＬＬＰＡＣＫ形式で表した一例を示す図である。図５に示す要素のうち「＊」で示した要素はパディングを挿入した要素を示している。なお、パディングを挿入した要素の値は任意とする。ＥＬＬＰＡＣＫ形式では、データ（ｄａｔａ）には、疎行列の非零要素の値が格納され、インデックス（ｉｎｄｅｘ）には、非零要素の値の列位置が格納される、例えば、図１に示す０行（ｒｏｗ）目［３，７，０，０］は、非零要素が０列（ｃｏｌｕｍｎ）目と１列目とに存在する。そのため、ＥＬＬＰＡＣＫ形式で表わすと、ｄａｔａ［３，７］でｉｎｄｅｘ［０，１］となる。なお、２行目［１，０，５，９］は、非零要素の値が３つあるので、他の行目にはパディングが挿入されている。 A specific example of a sparse matrix expressed in the ELLPACK format will be described. FIG. 4 is a diagram showing an example of a sparse matrix in which many elements are zero. FIG. 5 is a diagram showing an example of the sparse matrix shown in FIG. 4 expressed in ELLPACK format. Among the elements shown in FIG. 5, the elements indicated by "*" indicate the elements in which padding is inserted. Note that the value of the element into which the padding is inserted is arbitrary. In the ELLPACK format, the data stores the values of non-zero elements in the sparse matrix, and the index stores the column position of the non-zero element values. In the row [3,7,0,0], non-zero elements are present in the 0th column and the 1st column. Therefore, when expressed in the ELLPACK format, data[3,7] becomes index[0,1]. Since the second row [1, 0, 5, 9] has three non-zero element values, padding is inserted in the other rows.

図５に示すＥＬＬＰＡＣＫ形式の疎行列を外部メモリ３００に格納する場合、列優先で格納される。図６は、外部メモリ３００に格納されるＥＬＬＰＡＣＫ形式の疎行列の一例を示す図である。図６に示すｄａｔａでは、疎行列の非零要素の値を１列目、２列目、３列目の順に並べて外部メモリ３００に格納されている。また、図６に示すｉｎｄｅｘでは、非零要素の値の列位置を１列目、２列目、３列目の順に並べて外部メモリ３００に格納されている。 When the ELLPACK format sparse matrix shown in FIG. 5 is stored in the external memory 300, it is stored in column priority. FIG. 6 is a diagram showing an example of an ELLPACK format sparse matrix stored in the external memory 300 . In the data shown in FIG. 6, the values of the non-zero elements of the sparse matrix are stored in the external memory 300 in the order of the first, second, and third columns. In the index shown in FIG. 6, the column positions of the values of the non-zero elements are arranged in the order of the first column, the second column, and the third column and stored in the external memory 300 .

次に、ＥＬＬＰＡＣＫ形式で表した疎行列に対する疎行列ベクトル積（ＳｐＭＶ）の計算処理について説明する。図７は、疎行列ベクトル積（ＳｐＭＶ）の計算処理を行うプログラムの一例を示した図である。図７に示すプログラムでは、Ａを疎行列、ｘとｙとをそれぞれ列ベクトルとした場合に、ｙ＝Ａｘを計算している。疎行列ＡのサイズはＭ行×Ｎ列とし、ＥＬＬＰＡＣＫ形式で表した疎行列ＡのサイズはＭ行×Ｋ（≦Ｎ）列となる。なお、パディングで挿入する値は、計算に影響を与えない値（零）とし、疎行列ベクトル積（ＳｐＭＶ）での計算で処理をスキップさせる。また、図７に示すプログラムは、図１に示すＳｐＭＶユニット３０で実行される。 Next, calculation processing of a sparse matrix vector product (SpMV) for a sparse matrix expressed in ELLPACK format will be described. FIG. 7 is a diagram showing an example of a program for calculating a sparse matrix-vector product (SpMV). The program shown in FIG. 7 calculates y=Ax where A is a sparse matrix and x and y are column vectors. The size of the sparse matrix A is M rows×N columns, and the size of the sparse matrix A expressed in ELLPACK format is M rows×K (≦N) columns. Note that the value inserted by padding is a value (zero) that does not affect the calculation, and the process is skipped in the calculation of the sparse matrix-vector product (SpMV). Also, the program shown in FIG. 7 is executed by the SpMV unit 30 shown in FIG.

本実施の形態では、複数のデータに対して同じ処理を適用するＳＩＭＤ（Single Instruction Multiple Data）方式で、疎行列ベクトル積（ＳｐＭＶ）の並列計算を行ってもよい。このＳＩＭＤ方式では、複数のスレッド間で並列計算を行う際に、並列計算に必要なデータ幅で構成された実行機構を共有することができる。そのため、ＳｐＭＶユニット３０にＳＩＭＤ方式を採用することで、スレッドごとに機構を実装する方式に比べて、少ないハードウェア資源で疎行列ベクトル積（ＳｐＭＶ）の並列計算を実現することができる。 In the present embodiment, parallel computation of sparse matrix-vector product (SpMV) may be performed by SIMD (Single Instruction Multiple Data), which applies the same processing to a plurality of data. In this SIMD method, when performing parallel computation among a plurality of threads, an execution mechanism configured with a data width required for parallel computation can be shared. Therefore, by adopting the SIMD method for the SpMV unit 30, parallel calculation of sparse matrix-vector product (SpMV) can be realized with fewer hardware resources than a method of implementing a mechanism for each thread.

さらに、本実施の形態では、図７に示したプログラムを疎行列の行（ｒｏｗ）ごとに展開できるようなループ形式のプログラムを構成している。図８は、疎行列の行ごとの疎行列ベクトル積（ＳｐＭＶ）の計算処理を行うプログラムの一例を示した図である。図８に示す疎行列ベクトル積のプログラムを、疎行列の行ごと複数のスレッドとして並列実行することで、疎行列ベクトル積（ＳｐＭＶ）の並列計算を実現することができる。図９は、図８に示すプログラムを、疎行列の行ごと複数のスレッドとして並列実行する場合の一例を示す図である。ＳｐＭＶユニット３０は、ＳＩＭＤ方式を採用した場合、図８に示すプログラムを図９に示す方式で並列実行することで実現することができ、複数のデータに対して同じプログラムを適用する。 Furthermore, in the present embodiment, a loop-type program is constructed so that the program shown in FIG. 7 can be expanded for each row of a sparse matrix. FIG. 8 is a diagram showing an example of a program for calculating a sparse matrix vector product (SpMV) for each row of a sparse matrix. Parallel computation of the sparse matrix-vector product (SpMV) can be realized by executing the sparse matrix-vector product program shown in FIG. 8 in parallel as a plurality of threads for each row of the sparse matrix. FIG. 9 is a diagram showing an example of parallel execution of the program shown in FIG. 8 as a plurality of threads for each row of a sparse matrix. The SpMV unit 30 can be realized by parallelly executing the program shown in FIG. 8 in the method shown in FIG. 9 when the SIMD method is adopted, and the same program is applied to a plurality of data.

本実施の形態では、疎行列を格納する外部メモリ３００、入力データ（入力ベクトル）および出力データ（出力ベクトル）を格納するデータバッファ２０のそれぞれが、共有メモリ方式としスレッド間で同じメモリをアクセスする。なお、疎行列のインデックスのリストを格納するインデックスバッファ３２、疎行列の非零要素の値のリストを格納するバリューバッファ３３のそれぞれも共有メモリ方式である。 In this embodiment, the external memory 300 that stores the sparse matrix and the data buffer 20 that stores the input data (input vector) and the output data (output vector) are each of a shared memory system, and the same memory is accessed between threads. . The index buffer 32 that stores the list of indices of the sparse matrix and the value buffer 33 that stores the list of non-zero element values of the sparse matrix are also of the shared memory type.

外部メモリ３００から疎行列を読み出す場合、疎行列ロードユニット３１は、スレッド間で連続する要素を並列アクセスすることができる。また、データバッファ２０から出力ベクトルを読み出す場合、データロードユニット３４は、スレッド間で連続する要素を並列アクセスすることができる。しかし、データバッファ２０から入力ベクトルを読み出す場合、データロードユニット３４は、インデックスバッファ３２に格納した疎行列のインデックスのリストで間接参照して、スレッド間で不連続な要素を並列アクセスすることになる。 When reading a sparse matrix from the external memory 300, the sparse matrix load unit 31 can access consecutive elements in parallel between threads. Also, when reading the output vector from the data buffer 20, the data load unit 34 can access consecutive elements in parallel between threads. However, when reading an input vector from the data buffer 20, the data load unit 34 indirectly references the sparse matrix index list stored in the index buffer 32 to access discontinuous elements in parallel between threads. .

具体例を示して説明する。図１０は、疎行列のインデックスのリストの一例を示す図である。図１１は、図１０に示す疎行列のインデックスのリストに基づいて入力ベクトルの要素にアクセスする順序の一例を示す図である。データロードユニット３４は、図１０に示すｉｎｄｅｘの０列目の［４，６，２，９］を１回目の繰り返し処理（Iteration1）で読み出し、ｉｎｄｅｘの１列目の［１３，１０，７，１３］を２回目の繰り返し処理（Iteration2）で読み出す（図１１）。 A specific example will be shown and explained. FIG. 10 is a diagram showing an example of a sparse matrix index list. FIG. 11 is a diagram showing an example of the order of accessing the elements of the input vector based on the sparse matrix index list shown in FIG. The data load unit 34 reads [4, 6, 2, 9] in the 0th column of the index shown in FIG. 13] is read in the second iteration (Iteration 2) (FIG. 11).

しかし、ＳｐＭＶユニット３０が、ＳＩＭＤ方式を採用した場合、並列実行するスレッド間でメモリアクセスの機構や演算器などの実行機構が共有される。そのため、データロードユニット３４は、データバッファ２０から入力ベクトルを読み出す場合、並列計算に必要なデータ幅でメモリへのアクセスが構成され、スレッド間で連続するデータ列を並列にアクセスすることになる。 However, when the SpMV unit 30 adopts the SIMD method, the execution mechanisms such as the memory access mechanism and arithmetic units are shared between threads executing in parallel. Therefore, when the data load unit 34 reads an input vector from the data buffer 20, access to the memory is configured with a data width required for parallel calculation, and consecutive data strings are accessed in parallel between threads.

データロードユニット３４は、スレッド間で不連続なデータ（要素）を並列アクセスするときに、メモリアクセスのデータ幅に不連続なデータ（要素）が収まらない場合がある。この場合、データロードユニット３４は、メモリアクセスを複数回繰り返して、不連続なデータ（要素）を連続するデータ列に寄せ集める機構（Gather機構）が必要となる。 When the data load unit 34 accesses discontinuous data (elements) in parallel between threads, the discontinuous data (elements) may not fit within the data width of memory access. In this case, the data load unit 34 requires a mechanism (Gather mechanism) that repeats memory access multiple times and gathers discontinuous data (elements) into a continuous data string.

ここで、疎行列は、ＥＬＬＰＡＣＫ形式などにより一定の割合で要素の削減が行われる。これにより、疎行列の非零要素はある間隔で並び、またスレッド間で並列アクセスする非零要素は、ある範囲のデータ幅に収まる確率が高い。この間隔と範囲とは、例えば、非零要素間距離の平均と標準偏差の倍数としてもよい。図１０示す例では、疎行列の非零要素は４要素の間隔で並び、スレッド間で並列アクセスする非零要素は６要素の範囲のデータ幅に収まっている。 Here, the elements of the sparse matrix are reduced at a constant rate by the ELLPACK format or the like. As a result, the non-zero elements of the sparse matrix are arranged at certain intervals, and the non-zero elements accessed in parallel between threads have a high probability of being within a certain range of data width. The intervals and ranges may be, for example, multiples of the mean and standard deviation of the distances between non-zero elements. In the example shown in FIG. 10, the non-zero elements of the sparse matrix are arranged at intervals of four elements, and the non-zero elements accessed in parallel between threads are within the data width of six elements.

そこで、データロードユニット３４は、不連続なデータ（要素）を連続するデータ列に寄せ集める機構（Gather機構）が必要となる。データロードユニット３４は、当該機構により、この間隔をメモリアクセスのストライド幅と、この範囲をメモリアクセスのデータ幅としてデータバッファ２０に対してスレッド間で並列アクセスすることができるようになる。 Therefore, the data load unit 34 requires a mechanism (Gather mechanism) for gathering discontinuous data (elements) into a continuous data string. With this mechanism, the data load unit 34 can access the data buffer 20 in parallel between threads using this interval as the memory access stride width and this range as the memory access data width.

しかし、ＳｐＭＶユニット３０では、例えばデータロードユニット３４によりスレッド間で並列アクセスする入力ベクトルの複数の要素が、１回のメモリアクセスのデータ幅に収まるように、疎行列のインデックスのリストに対してパディングを挿入することができる。ＳｐＭＶユニット３０で疎行列のインデックスのリストに対してパディングを挿入することで、データロードユニット３４において寄せ集める機構（Gather機構）を不要にできる。具体的に、ＳｐＭＶユニット３０では、疎行列のインデックスの各列に含まれるパディングを挿入した要素の数が平準化するようにパディングを挿入する。つまり、ＳｐＭＶユニット３０は、例えば、疎行列ロードユニット３１で疎行列のインデックスのリストに対してパディングを挿入する。 However, in the SpMV unit 30, for example, the list of sparse matrix indices is padded so that a plurality of elements of an input vector accessed in parallel between threads by the data load unit 34 can be accommodated within the data width of one memory access. can be inserted. By inserting padding into the list of sparse matrix indices in the SpMV unit 30, the gather mechanism in the data load unit 34 can be made unnecessary. Specifically, the SpMV unit 30 inserts padding so that the number of elements inserted with padding included in each column of the index of the sparse matrix is leveled. That is, the SpMV unit 30 inserts padding to the list of sparse matrix indices, for example, in the sparse matrix load unit 31 .

疎行列ロードユニット３１は、外部メモリ３００から疎行列の非零要素を読み出し、疎行列のインデックスの各列に含まれるパディングの要素の数が平準化するようにパディングを挿入して、疎行列のインデックスのリストを生成してインデックスバッファ３２に格納する。疎行列ロードユニット３１は、パディングを挿入する挿入部として機能している。図１０に示す疎行列のインデックスのリストに対して、各列に含まれるパディングの要素の数が平準化するようにパディングを挿入した例を説明する。図１２は、図１０に示す疎行列のインデックスのリストに対してパディングを平準化して挿入した一例を示す図である。図１２に示す例では、疎行列のインデックスの各列がメモリアクセスの４要素のストライド幅、８要素のデータ幅に収まるように、パディングを平準化して挿入している。 The sparse matrix load unit 31 reads the non-zero elements of the sparse matrix from the external memory 300, inserts padding so that the number of padding elements included in each column of the index of the sparse matrix is leveled, and restores the sparse matrix. An index list is generated and stored in the index buffer 32 . The sparse matrix load unit 31 functions as an inserter that inserts padding. An example will be described in which padding is inserted into the sparse matrix index list shown in FIG. 10 so that the number of padding elements included in each column is equalized. FIG. 12 is a diagram showing an example in which padding is leveled and inserted into the sparse matrix index list shown in FIG. In the example shown in FIG. 12, the padding is leveled and inserted so that each column of the sparse matrix index fits within the stride width of 4 elements and the data width of 8 elements for memory access.

図１２に示すｉｎｄｅｘの０列目の［４，６，２，＊］を１回目の繰り返し処理（Iteration1）で読み出し、ｉｎｄｅｘの１列目の［＊，１０，７，１３］を２回目の繰り返し処理（Iteration2）で読み出す。そのため、データロードユニット３４は、１回目の繰り返し処理（Iteration1）で０番目～７番目までの８要素のデータ幅に対してメモリアクセスを行い、２回目の繰り返し処理（Iteration2）で４番目～１１番目までの８要素のデータ幅に対してメモリアクセスを行うことになる。つまり、データロードユニット３４は、メモリアクセスを複数回繰り返す必要がなく、不連続なデータ（要素）を連続するデータ列に寄せ集める機構（Gather機構）も必要ない。 [4, 6, 2, *] in the 0th column of the index shown in FIG. Read by iteration process (Iteration2). Therefore, the data load unit 34 performs memory access to the data width of 8 elements from 0th to 7th in the first iteration (Iteration 1), and accesses the 4th to 11th elements in the second iteration (Iteration 2). Memory access is performed for the data width of 8 elements up to the th. In other words, the data load unit 34 does not need to repeat memory access multiple times, and does not need a mechanism (Gather mechanism) for gathering discontinuous data (elements) into a continuous data string.

次に、疎行列ロードユニット３１で疎行列のインデックスのリストに対してパディングを平準化して挿入する手順について説明する。図１３は、疎行列のインデックスのリストに対してパディングを平準化して挿入する手順について示すフローチャートである。 Next, a procedure for leveling and inserting padding into a sparse matrix index list by the sparse matrix load unit 31 will be described. FIG. 13 is a flowchart showing a procedure for leveling and inserting padding into a list of indices of a sparse matrix.

まず、疎行列ロードユニット３１は、ＥＬＬＰＡＣＫ形式の疎行列の１列目から順に、全ての列に対して以下の処理を繰り返す（ステップＳ１）。疎行列ロードユニット３１は、列に含まれる要素の中で最小のインデックスを求める（ステップＳ２）。次に、疎行列ロードユニット３１は、ストライド幅のメモリ領域をメモリブロック、データ幅のメモリブロックをウィンドウとして、ステップＳ２で求めた最小のインデックスが含まれるメモリブロックから始まるウィンドウの範囲を求める（ステップＳ３）。 First, the sparse matrix load unit 31 repeats the following process for all columns of the sparse matrix in ELLPACK format, starting from the first column (step S1). The sparse matrix load unit 31 finds the minimum index among the elements included in the column (step S2). Next, the sparse matrix load unit 31 uses the stride width memory area as a memory block and the data width memory block as a window, and obtains the range of the window starting from the memory block containing the minimum index obtained in step S2 (step S3).

疎行列ロードユニット３１は、ウィンドウの範囲外の要素に対して、対応する行に１列分のパディングを挿入する（ステップＳ４）。例えば、図１１に示す例では、１回目の繰り返し処理（Iteration1）で０番目～７番目までの８要素のデータ幅をウィンドウの範囲とすると、疎行列のインデックスの９番目はウィンドウの範囲外となるのでパディングを挿入する。疎行列のインデックスの９番目に対応する行に１列分のパディングを挿入すると、図１２に示す１回目の繰り返し処理（Iteration1）のようになる。疎行列ロードユニット３１は、１列の処理が終わった場合、次の列に対して処理を繰り返す（ステップＳ５）。 The sparse matrix load unit 31 inserts one column of padding in the corresponding row for elements outside the window range (step S4). For example, in the example shown in FIG. 11, if the data width of eight elements from the 0th to the 7th elements in the first iteration (Iteration1) is set as the window range, the 9th sparse matrix index is out of the window range. So insert padding. Inserting one column of padding into the row corresponding to the ninth index of the sparse matrix results in the first iteration (Iteration 1) shown in FIG. When the sparse matrix load unit 31 finishes processing one column, it repeats the processing for the next column (step S5).

次に、データバッファ２０のメモリ構成について説明する。図１４は、本実施の形態１に係るデータバッファ２０のメモリ構成を示すブロック図である。図１４に示すデータバッファ２０では、データ数が２５６のバッファの構成例であり、１つのメモリバンクを３２行×４列（４要素のデータ幅）のメモリで構成し、２つのメモリバンクで１つのバッファを構成している。 Next, the memory configuration of the data buffer 20 will be explained. FIG. 14 is a block diagram showing the memory configuration of the data buffer 20 according to the first embodiment. The data buffer 20 shown in FIG. 14 is a configuration example of a buffer with 256 data. consists of one buffer.

１つのメモリバンクのデータ幅は、メモリアクセスのストライド幅に相当し、全てのメモリバンクを合わせたデータ幅は、メモリアクセスのデータ幅に相当する。図１４に示す例では、メモリアクセスのストライド幅は４、データ幅は２×４となる。 The data width of one memory bank corresponds to the stride width of memory access, and the total data width of all memory banks corresponds to the data width of memory access. In the example shown in FIG. 14, the memory access stride width is 4 and the data width is 2×4.

データロードユニット３４は、図１４に示すデータバッファにアクセスする場合、８ビットのインデックスでアクセスを行う。図１５は、図１４に示すデータバッファにアクセスするためのインデックスの一例を示す図である。図１５に示すインデックスでは、データの行を５ビットで、メモリバンクを１ビットで、データの列を２ビットでそれぞれ表している。 The data load unit 34 accesses the data buffer shown in FIG. 14 using an 8-bit index. 15 is a diagram showing an example of an index for accessing the data buffer shown in FIG. 14. FIG. In the index shown in FIG. 15, a row of data is represented by 5 bits, a memory bank by 1 bit, and a column of data by 2 bits.

次に、図１４に示すデータバッファに対して複数のインデックスで並列アクセスするロードユニットの構成について説明する。図１６は、本実施の形態１に係るロードユニットの構成を示すブロック図である。 Next, the configuration of the load unit that accesses the data buffer shown in FIG. 14 in parallel using a plurality of indexes will be described. FIG. 16 is a block diagram showing the configuration of the load unit according to the first embodiment.

図１６に示すロードユニットでは、２つのメモリバンクごとのアドレスを生成するアドレスコアレスユニット（アドレス生成部）３４ａと、２つのメモリバンクからの出力データを選択するデータクロスバ（データ出力部）３４ｂとを含んでいる。 The load unit shown in FIG. 16 includes an address coreless unit (address generator) 34a that generates addresses for each of the two memory banks, and a data crossbar (data output section) 34b that selects output data from the two memory banks. contains.

アドレスコアレスユニット３４ａは、並列アクセスの複数のインデックスからメモリバンクごとに行（ｒｏｗ）インデックスを融合し、メモリバンクごとのアドレスを生成する。並列アクセスの複数のインデックスは、同一のメモリアクセスの範囲に制限されているため、メモリバンクごとのアドレス（行インデックス）は１つに決まる。 Address coreless unit 34a fuses row indices for each memory bank from multiple indices of parallel access to generate an address for each memory bank. Since multiple indexes for parallel access are restricted to the same memory access range, only one address (row index) is determined for each memory bank.

データクロスバ３４ｂは、並列アクセスの複数のインデックスごとに、図１５に示すメモリバンクインデックスでメモリバンクを選択し、列（ｃｏｌｕｍｎ）インデックスでメモリバンクからの出力データを選択する。 The data crossbar 34b selects a memory bank with a memory bank index shown in FIG. 15 and selects output data from the memory bank with a column index for each of a plurality of indexes for parallel access.

以上のように、本実施の形態１に係るアクセラレータ装置１００は、疎行列ベクトル積の計算を含む演算処理を実行するアクセラレータ装置である。アクセラレータ装置１００は、疎行列の非零要素の各々と入力ベクトルの要素の各々との積和を含む出力ベクトルの要素の各々の計算処理を並列に実行するＳｐＭＶユニット３０と、疎行列の非零要素の各々、入力ベクトルの要素の各々、および出力ベクトルの要素の各々を格納するデータバッファ２０とを備えている。ＳｐＭＶユニット３０部は、データロードユニット３４と、ＦＭＡアレイ３５と、疎行列ロードユニット３１（挿入部）とを有する。データロードユニット３４は、データバッファ２０に記憶している要素の各々に対して並列にアクセスすることができる。ＦＭＡアレイ３５は、データロードユニット３４がアクセスした入力ベクトルの要素の各々と、疎行列の非零要素の各々との積和（融合積和）を計算する。疎行列ロードユニット３１は、データロードユニット３４がデータバッファ２０に記憶している入力ベクトルの要素の各々に対して並列にアクセスする順序に従い、疎行列の非零要素の各々の配置にパディングを挿入する。 As described above, the accelerator device 100 according to the first embodiment is an accelerator device that executes arithmetic processing including calculation of a sparse matrix-vector product. Accelerator device 100 includes an SpMV unit 30 that executes in parallel computation processing of each element of an output vector including sums of products of each nonzero element of a sparse matrix and each element of an input vector; A data buffer 20 for storing each element, each element of the input vector, and each element of the output vector. The SpMV unit 30 section has a data load unit 34, an FMA array 35, and a sparse matrix load unit 31 (insertion section). Data load unit 34 can access each of the elements stored in data buffer 20 in parallel. FMA array 35 calculates the sum of products (fused sum of products) between each element of the input vector accessed by data load unit 34 and each non-zero element of the sparse matrix. Sparse matrix load unit 31 inserts padding into the placement of each non-zero element of the sparse matrix according to the order in which data load unit 34 accesses each of the elements of the input vector stored in data buffer 20 in parallel. do.

これにより、本実施の形態１に係るアクセラレータ装置１００では、記憶部に記憶している入力ベクトルの要素の各々に対して並列にアクセスする順序に従い、疎行列の非零要素の各々の配置にパディングを挿入することで、不連続なデータ（要素）を連続するデータ列に寄せ集める機構が不要で、ハードウェア資源を削減することができる。 As a result, in the accelerator device 100 according to the first embodiment, padding is applied to the placement of each non-zero element of the sparse matrix in accordance with the order in which the elements of the input vector stored in the storage unit are accessed in parallel. By inserting , there is no need for a mechanism for gathering discontinuous data (elements) into a continuous data string, and hardware resources can be reduced.

疎行列ロードユニット３１は、入力ベクトルの要素の各々に対して並列にアクセスする順序において、予め定められたストライド幅とデータ幅とで規定されるアクセス単位で並列にアクセスする場合、アクセス単位に含まれない疎行列の非零要素の配置に対して前記パディングを挿入する。これにより、ＳｐＭＶユニット３０では、疎行列のインデックスの各列に含まれるパディングを挿入した要素の数が平準化してパディングを挿入することができる。 The sparse matrix load unit 31 accesses each element of the input vector in parallel in an access unit defined by a predetermined stride width and data width in the order of parallel access. The padding is inserted for placements of non-zero elements of the sparse matrix that are not filled. As a result, the SpMV unit 30 can insert padding while leveling the number of elements into which padding is inserted in each column of the index of the sparse matrix.

データロードユニット３４は、入力ベクトルの要素の各々に対して並列にアクセスする場合に、メモリバンクごとにアクセスするアドレスを生成するアドレスコアレスユニット３４ａと、アドレスコアレスユニット３４ａで生成したアドレスに対応するメモリバンクからの出力データを選択するデータクロスバ３４ｂとを有してもよい。これにより、データバッファ２０に対して複数のインデックスで並列アクセスするデータロードユニット３４を構成することができる。 The data load unit 34 includes an address coreless unit 34a that generates an address to be accessed for each memory bank and a memory corresponding to the address generated by the address coreless unit 34a when accessing each element of the input vector in parallel. and a data crossbar 34b for selecting output data from the bank. As a result, the data load unit 34 can be configured to access the data buffer 20 in parallel using a plurality of indexes.

データロードユニット３４は、アクセス単位のストライド幅を１つのメモリバンクが有するデータの列数とし、アクセス単位のデータ幅を全てのメモリバンクが有するデータの列数の合計数としてもよい。これにより、データロードユニット３４は、データバッファ２０に対してメモリバンクの粒度でアクセスすることができる。 The data load unit 34 may use the stride width of the access unit as the number of columns of data held by one memory bank, and the data width of the access unit as the total number of columns of data held by all memory banks. This allows the data load unit 34 to access the data buffer 20 at memory bank granularity.

アクセラレータ装置１００は、ＦＰＧＡで構成されてもよい。これにより、アクセラレータ装置１００のコストを低減することができる。 The accelerator device 100 may be configured with an FPGA. Thereby, the cost of the accelerator device 100 can be reduced.

なお、パディングを平準化して挿入する挿入部の例として、疎行列ロードユニット３１を説明したが、疎行列ロードユニット３１以外に、別途挿入部を設けてもよい。 Although the sparse matrix load unit 31 has been described as an example of an insertion unit that smoothes and inserts padding, an insertion unit other than the sparse matrix load unit 31 may be provided.

（実施の形態２）
実施の形態１に係るアクセラレータ装置１００では、１つのメモリバンクを３２行×４列（４要素のデータ幅）のメモリで構成し、２つのメモリバンクで１つのバッファを構成している例を説明した。データバッファ２０は、メモリバンクの粒度でアクセスを行うため、疎行列の非零要素の構造により不要なメモリアクセスが発生することになる。 (Embodiment 2)
In the accelerator device 100 according to the first embodiment, an example will be described in which one memory bank is configured with a memory of 32 rows×4 columns (data width of 4 elements), and two memory banks configure one buffer. did. Since the data buffer 20 performs access at the granularity of the memory bank, unnecessary memory access occurs due to the structure of the non-zero elements of the sparse matrix.

メモリアクセスの粒度を細かくする場合、メモリバンクのデータ幅がメモリアクセスのストライド幅を等分したサイズとし、メモリバンクの総数はメモリアクセスのデータ幅に合わせる必要がある。本実施の形態２に係るアクセラレータ装置では、メモリアクセスの粒度を細かくした場合について説明する。なお、本実施の形態２に係るアクセラレータ装置では、実施の形態１に係るアクセラレータ装置１００とデータバッファの構成を除いて同じ構成であり、詳細な説明は繰り返さない。 When the granularity of memory access is finer, the data width of memory banks must be equal to the stride width of memory access, and the total number of memory banks must match the data width of memory access. In the accelerator device according to the second embodiment, a case where the granularity of memory access is made fine will be described. The accelerator device according to the second embodiment has the same configuration as accelerator device 100 according to the first embodiment except for the configuration of the data buffer, and detailed description thereof will not be repeated.

本実施の形態２に係るデータバッファのメモリ構成について説明する。図１７は、本実施の形態２に係るデータバッファのメモリ構成を示すブロック図である。図１７に示すデータバッファでは、データ数が２５６のバッファの構成例であり、１つのメモリバンクを３２行×２列（２要素のデータ幅）のメモリで構成し、４つのメモリバンクで１つのバッファを構成している。 A memory configuration of the data buffer according to the second embodiment will be described. FIG. 17 is a block diagram showing the memory configuration of the data buffer according to the second embodiment. The data buffer shown in FIG. 17 is a configuration example of a buffer with 256 data. constitute a buffer.

１つのメモリバンクのデータ幅は、メモリアクセスのストライド幅に相当し、全てのメモリバンクを合わせたデータ幅は、メモリアクセスのデータ幅に相当する。図１７に示す例では、メモリアクセスのストライド幅は２、データ幅は４×２となる。 The data width of one memory bank corresponds to the stride width of memory access, and the total data width of all memory banks corresponds to the data width of memory access. In the example shown in FIG. 17, the memory access stride width is 2 and the data width is 4×2.

データロードユニット３４は、図１７に示すデータバッファにアクセスする場合、８ビットのインデックスでアクセスを行う。図１８は、図１７に示すデータバッファにアクセスするためのインデックスの一例を示す図である。図１８に示すインデックスでは、データの行を５ビットで、メモリバンクを２ビットで、データの列を１ビットでそれぞれ表している。 The data load unit 34 accesses the data buffer shown in FIG. 17 using an 8-bit index. 18 is a diagram showing an example of an index for accessing the data buffer shown in FIG. 17. FIG. In the index shown in FIG. 18, a row of data is represented by 5 bits, a memory bank by 2 bits, and a column of data by 1 bit.

次に、図１７に示すデータバッファに対して複数のインデックスで並列アクセスするロードユニットの構成について説明する。図１９は、本実施の形態２に係るロードユニットの構成を示すブロック図である。 Next, the configuration of the load unit that accesses the data buffer shown in FIG. 17 in parallel using a plurality of indexes will be described. FIG. 19 is a block diagram showing the configuration of a load unit according to the second embodiment.

図１９に示すロードユニットでは、４つのメモリバンクごとのアドレスを生成するアドレスコアレスユニット（アドレス生成部）３４ｃと、４つのメモリバンクからの出力データを選択するデータクロスバ（データ出力部）３４ｄとを含んでいる。 The load unit shown in FIG. 19 includes an address coreless unit (address generator) 34c that generates addresses for each of the four memory banks, and a data crossbar (data output section) 34d that selects output data from the four memory banks. contains.

アドレスコアレスユニット３４ｃは、並列アクセスの複数のインデックスからメモリバンクごとに行（ｒｏｗ）インデックスを融合し、メモリバンクごとのアドレスを生成する。並列アクセスの複数のインデックスは、同一のメモリアクセスの範囲に制限されているため、メモリバンクごとのアドレス（行インデックス）は１つに決まる。 Address coreless unit 34c fuses row indices for each memory bank from multiple indices of parallel access to generate an address for each memory bank. Since multiple indexes for parallel access are restricted to the same memory access range, only one address (row index) is determined for each memory bank.

データクロスバ３４ｄは、並列アクセスの複数のインデックスごとに、図１８に示すメモリバンクインデックスでメモリバンクを選択し、列（ｃｏｌｕｍｎ）インデックスでメモリバンクからの出力データを選択する。 The data crossbar 34d selects a memory bank with a memory bank index shown in FIG. 18 and selects output data from the memory bank with a column index for each of a plurality of indexes for parallel access.

図１７に示したようにメモリアクセスの粒度を細かくすると、メモリバンク数が増加してハードウェア構成が複雑になる。一方、スレッドの並列数を増やすと、メモリアクセスが均一化することで、不要なメモリアクセスが低減されるが、並列計算に必要となるメモリ帯域幅が増える。これらから、メモリアクセスの粒度（メモリバンクの数）は、疎行列の非零要素構造とスレッドの並列数とのトレードオフで決めることになる。 As shown in FIG. 17, finer memory access granularity increases the number of memory banks and complicates the hardware configuration. On the other hand, increasing the parallel number of threads reduces unnecessary memory accesses by uniforming memory accesses, but increases the memory bandwidth required for parallel computation. From these, the granularity of memory access (the number of memory banks) is determined by the trade-off between the non-zero element structure of the sparse matrix and the parallel number of threads.

以上のように、本実施の形態２に係るアクセラレータ装置では、データロードユニット３４が、メモリバンクの数を変更することができる。これにより、本実施の形態２に係るアクセラレータ装置では、疎行列の非零要素構造とスレッドの並列数とのバランスを考慮してメモリバンクの数を決定することができる。 As described above, in the accelerator device according to the second embodiment, the data load unit 34 can change the number of memory banks. Thus, in the accelerator device according to the second embodiment, the number of memory banks can be determined in consideration of the balance between the non-zero element structure of the sparse matrix and the parallel number of threads.

なお、以上の実施の形態１，２において、処理可能なスレッド数が疎行列の行数よりも少ない場合（処理可能なスレッド数＜疎行列の行数）、疎行列の行をスレッド数で分割して、上述した処理を繰り返し実行することになる。 In Embodiments 1 and 2 above, when the number of processable threads is less than the number of rows of the sparse matrix (the number of processable threads < the number of rows of the sparse matrix), the rows of the sparse matrix are divided by the number of threads. Then, the above-described processing is repeatedly executed.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した実施の形態ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 It should be considered that the embodiments disclosed this time are illustrative in all respects and not restrictive. The scope of the present invention is indicated by the scope of the claims rather than the above-described embodiments, and is intended to include all modifications within the scope and meaning of equivalents to the scope of the claims.

１０メモリインターフェース／インターコネクト、２０データバッファ、３０ＳｐＭＶユニット、３１行疎列ロードユニット、３２インデックスバッファ、３３バリューバッファ、３４データロードユニット、３４ａ，３４ｃアドレスコアレスユニット、３４ｂ，３４ｄデータクロスバ、３５ＦＭＡアレイ、３６，４７コントロールユニット、４０ポストプロセスユニット、４１バイアスロードユニット、４２バイアスバッファ、４３バイアス加算ユニット、４４ロードユニット、４５アクティベーションユニット、４６データストアユニット、１００アクセラレータ装置、２００システムバス、３００外部メモリ。 10 memory interface/interconnect, 20 data buffer, 30 SpMV unit, 31 row sparse column load unit, 32 index buffer, 33 value buffer, 34 data load unit, 34a, 34c address coreless unit, 34b, 34d data crossbar, 35 FMA array , 36, 47 control unit, 40 post-process unit, 41 bias load unit, 42 bias buffer, 43 bias addition unit, 44 load unit, 45 activation unit, 46 data store unit, 100 accelerator device, 200 system bus, 300 external memory.

Claims

An accelerator device for performing arithmetic operations including computation of a sparse matrix-vector product,
a parallel computing unit that executes in parallel computation processing of each of the elements of the output vector including the sum of products of each of the non-zero elements of the sparse matrix and each of the elements of the input vector;
a storage unit that stores each non-zero element of the sparse matrix, each element of the input vector, and each element of the output vector;
The parallel computing unit
a data loading unit capable of accessing each of the elements stored in the storage unit in parallel;
an operation unit that calculates the sum of products of each element of the input vector accessed by the data loading unit and each non-zero element of the sparse matrix;
an inserting unit that inserts padding into the placement of each of the non-zero elements of the sparse matrix according to the order in which the data loading unit accesses each of the elements of the input vector stored in the storage unit in parallel; Accelerator device.

The insertion section is
When accessing each element of the input vector in parallel in an access unit defined by a predetermined stride width and data width in the order of accessing each element in parallel,
2. The accelerator apparatus according to claim 1, wherein said padding is inserted into an arrangement of non-zero elements of said sparse matrix not included in said access unit.

The data loading unit
an address generator for generating an address to be accessed for each memory bank when accessing each of the elements of the input vector in parallel;
3. The accelerator device according to claim 2, further comprising a data output section for selecting output data from said memory bank corresponding to the address generated by said address generation section.

The data loading unit
The stride width of the access unit is the number of columns of data held by one memory bank,
4. The accelerator device according to claim 3, wherein said data width of said access unit is the total number of columns of data held by all said memory banks.

The data loading unit
5. The accelerator device of claim 4, wherein the number of memory banks can be changed.

6. The accelerator device according to any one of claims 1 to 5, wherein said accelerator device is composed of an FPGA.