JP6836065B2

JP6836065B2 - Information processing device, PLD management program and PLD management method

Info

Publication number: JP6836065B2
Application number: JP2017034302A
Authority: JP
Inventors: デビッドタシ; 久典藤澤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-02-27
Filing date: 2017-02-27
Publication date: 2021-02-24
Anticipated expiration: 2037-02-27
Also published as: US20180246735A1; JP2018142046A; US10534621B2

Description

本発明は，情報処理装置、ＰＬＤ管理プログラム及びＰＬＤ管理方法に関する。 The present invention relates to an information processing device, a PLD management program, and a PLD management method.

プログラマブルロジックデバイス（Programmable Logic Device, 以下PLDと称する。）は、予め複数の論理回路要素、メモリ回路要素、配線、スイッチ等が形成された集積回路に、所定の処理を実行可能な回路をコンフィグレーションするためのコンフィグレーションデータが設定または書込まれると、所定の処理を実行可能な回路をリコンフィグレーションする。このようなPLDは、例えばFPGA（Field Programmable Gate Array）などであり、コンフィグレーションデータを書き換えることで内部の回路を様々な論理回路にリコンフィグレーション可能なLSIである。以下、PLDの１つであるFPAGを例にして説明する。 A programmable logic device (hereinafter referred to as PLD) is an integrated circuit in which a plurality of logic circuit elements, memory circuit elements, wirings, switches, etc. are formed in advance, and a circuit capable of executing a predetermined process is configured. When the configuration data is set or written, the circuit that can execute the predetermined process is reconfigured. Such a PLD is, for example, an FPGA (Field Programmable Gate Array), and is an LSI capable of reconfiguring an internal circuit into various logic circuits by rewriting the configuration data. Hereinafter, FPAG, which is one of the PLDs, will be described as an example.

プロセッサは、ソフトウエアの所定の処理（例えばジョブ）をハードウエアの専用回路で実行するとき、その専用回路をコンフィグレーションするためのコンフィグレーションデータをFPGAに設定または書込んでFPGA内に専用回路をコンフィグレーションし、その専用回路に所定の処理を実行させる。また、専用回路が所定の処理を終了すると、異なる処理を実行する別の専用回路のコンフィグレーションデータをFPGAに設定または書き込んでFPGA内に別の専用回路をコンフィグレーションし、別の専用回路に異なる処理を実行させる。プロセッサがソフトウエアの所定の処理をFPGAの専用回路に実行させることで、FPGAをプロセッサのアクセラレータとして利用する。これにより、プロセッサを有する情報処理装置（コンピュータ）を省電力化、高機能化できる。 When a processor executes a predetermined process of software (for example, a job) in a dedicated circuit of hardware, the processor sets or writes configuration data for configuring the dedicated circuit in the FPGA and sets the dedicated circuit in the FPGA. Configure and let the dedicated circuit perform a given process. In addition, when the dedicated circuit finishes the predetermined processing, the configuration data of another dedicated circuit that executes different processing is set or written to the FPGA to configure another dedicated circuit in the FPGA, and the configuration data is changed to another dedicated circuit. Let the process be executed. The FPGA is used as the accelerator of the processor by causing the processor to execute the predetermined processing of the software in the dedicated circuit of the FPGA. As a result, the information processing device (computer) having a processor can be made power-saving and highly functional.

FPGAの大規模化に伴い、FPGA内に複数の論理回路をコンフィグレーションし、複数の論理回路を並行して動作させることができる。また、FPGA内にコンフィグレーションした複数の論理回路を動作させながら、一部の論理回路をリコンフィグレーションして新たな論理回路の動作を開始するなど、複数の論理回路を非同期で動的にリコンフィグレーションし、非同期で並列に動作させることが可能になる。 With the increase in the scale of FPGA, it is possible to configure multiple logic circuits in the FPGA and operate multiple logic circuits in parallel. In addition, while operating multiple logic circuits configured in the FPGA, some logic circuits are reconfigured to start the operation of new logic circuits, and multiple logic circuits are dynamically reconfigured asynchronously. It can be configured and operated asynchronously and in parallel.

FPGAに複数の回路をコンフィグレーションすることについては以下の特許文献に開示されている。特許文献５は先願であるが公知例ではない。 The configuration of a plurality of circuits in an FPGA is disclosed in the following patent documents. Patent Document 5 is a prior application, but is not a known example.

特開２０１５−１５４４１７号公報Japanese Unexamined Patent Publication No. 2015-154417 特開２００４−３２０４３号公報Japanese Unexamined Patent Publication No. 2004-32043 特開２０１６−７６８６７号公報Japanese Unexamined Patent Publication No. 2016-76867 特開２０１５−２３１２０５号公報JP-A-2015-231205 特願２０１６−２４８２９７Japanese Patent Application No. 2016-248297

一方で、複数のユーザが、プロセッサとFPGAを搭載した情報処理装置を使用する場合、複数のユーザのプログラムの特定の処理を、FPGAにコンフィグレーションした複数の論理回路がそれぞれ処理することがある。その場合、複数のユーザのプログラムは、互いの論理回路を意識せず、それぞれの論理回路をFPGA内にコンフィグレーションし、コンフィグレーションされた複数の論理回路がFPGAを部分的に且つ動的に共用する。その結果、FPGAとメモリとの間のバスの使用帯域がバス帯域の上限値に達してバス帯域にボトルネックが発生する場合がある。 On the other hand, when a plurality of users use an information processing device equipped with a processor and an FPGA, a plurality of logic circuits configured in the FPGA may process specific processes of the programs of the plurality of users. In that case, the programs of multiple users configure each logic circuit in the FPGA without being aware of each other's logic circuits, and the configured logic circuits partially and dynamically share the FPGA. To do. As a result, the used bandwidth of the bus between the FPGA and the memory may reach the upper limit of the bus bandwidth, causing a bottleneck in the bus bandwidth.

バス帯域にボトルネックが発生した場合、所定の論理回路の並列度を低下させ、代わりに別の論理回路の並列度を上昇させ、全体の実行時間の悪化を抑制することが考えられる。並列度を上昇した場合の実行時間の短縮は論理回路の種類によって異なる。 When a bottleneck occurs in the bus band, it is conceivable to reduce the degree of parallelism of a predetermined logic circuit and instead increase the degree of parallelism of another logic circuit to suppress deterioration of the overall execution time. The reduction in execution time when the degree of parallelism is increased differs depending on the type of logic circuit.

そこで，本発明の目的は，PLDの回路リソースの使用効率を向上する情報処理装置、PLD管理プログラム及びPLD管理方法を提供することにある。 Therefore, an object of the present invention is to provide an information processing device, a PLD management program, and a PLD management method for improving the efficiency of using PLD circuit resources.

実施の形態の第１の側面は，プログラムを実行するプロセッサと、
前記プロセッサからのコンフィグレーション要求に応じて、前記コンフィグレーション要求が要求する論理回路をコンフィグレーションするリコンフィグレーション領域を有するプログラマブルロジック回路装置（以下ＰＬＤ）を有し、
前記プロセッサは、
前記リコンフィグレーション領域内にコンフィグレーションされ動作中の複数の論理回路のうち、第１の論理回路の並列度を下げて第２の論理回路の並列度を上げる並列度調整を行った場合の前記複数の論理回路の第１の実行時間と、前記並列度調整を行わない場合の前記複数の論理回路の第２の実行時間とを比較し、
前記第１の実行時間が前記第２の実行時間より短い場合、前記ＰＬＤに前記並列度調整の要求を行い、短くない場合、前記ＰＬＤに前記並列度調整の要求を行わない、情報処理装置である。 The first aspect of the embodiment is a processor that executes the program and
A programmable logic circuit device (hereinafter referred to as PLD) having a reconfiguration area for configuring a logic circuit required by the configuration request in response to a configuration request from the processor.
The processor
Of the plurality of logic circuits configured and operating in the reconfiguration area, the parallel degree adjustment for lowering the parallelism of the first logic circuit and increasing the parallelism of the second logic circuit is performed. The first execution time of the plurality of logic circuits is compared with the second execution time of the plurality of logic circuits when the degree of parallelism adjustment is not performed.
If the first execution time is shorter than the second execution time, the PLD is requested to adjust the degree of parallelism, and if it is not short, the PLD is not requested to adjust the degree of parallelism. is there.

第１の側面によれば，PLDの回路リソースの使用効率を向上することができる。 According to the first aspect, the efficiency of using the circuit resources of the PLD can be improved.

本実施の形態における情報処理装置の構成例を示す図である。It is a figure which shows the configuration example of the information processing apparatus in this embodiment. FPGAのコンフィグレーション例を示す図である。It is a figure which shows the configuration example of FPGA. FPGAのリコンフィグレーション領域の一例を示す図である。It is a figure which shows an example of the reconfiguration area of FPGA. 複数のユーザの論理回路が動的にコンフィグレーションされそして削除される例を説明する図である。It is a figure explaining an example that the logic circuit of a plurality of users is dynamically configured and deleted. FPGA内にコンフィグレーションされる論理回路の並列度の制御例を示す図である。It is a figure which shows the control example of the degree of parallelism of the logic circuit configured in FPGA. FPGAにおけるバス帯域のボトルネックを説明する図である。It is a figure explaining the bottleneck of the bus bandwidth in FPGA. 第1の実施の形態におけるFPGA管理方法による並列度の制御例を示す図である。It is a figure which shows the control example of the degree of parallelism by the FPGA management method in 1st Embodiment. CI処理回路の場合の実行時間について説明する図である。It is a figure explaining the execution time in the case of a CI processing circuit. 並列度Piを変化させた場合の、CI処理回路の実行時間の違いを示す図である。It is a figure which shows the difference of the execution time of a CI processing circuit when the degree of parallelism Pi is changed. 第1の実施の形態におけるFPGA管理プログラムのフローチャート図である。It is a flowchart of the FPGA management program in 1st Embodiment. プロセッサが管理するユーザ回路のパラメータを示す図表である。It is a figure which shows the parameter of the user circuit managed by a processor. ユーザ回路の並列度調整処理S8のフローチャート図である。It is a flowchart of the parallel degree adjustment process S8 of a user circuit. 工程S13Aの並列度を増加する処理を示すフローチャート図である。It is a flowchart which shows the process which increases the degree of parallelism of process S13A. 図１２の工程S15,S15B,S15C,S15Dの詳細なフローチャートを示す図である。It is a figure which shows the detailed flowchart of the process S15, S15B, S15C, S15D of FIG. 工程S17の処理を示すフローチャート図である。It is a flowchart which shows the process of process S17. 第1の具体例を示す図である。It is a figure which shows the 1st concrete example. 第２の具体例を示す図である。It is a figure which shows the 2nd specific example.

図１は、本実施の形態における情報処理装置の構成例を示す図である。情報処理装置であるサーバ１０は、OSやアプリケーションプログラムやミドルウエアプログラムを実行するプロセッサまたはCPU(Central Processing Unit)１１と、DRAMなどのメインメモリ１２を接続するCPUバスなどの第1のバスBUS_1を有する。さらに、サーバ１０は、マウス、キーボード、表示パネルなどのI/Oデバイス（１３）、ネットワークNETに接続されるNIC（Network Interface Card）１４、そして、OS、アプリケーションプログラムAPL、データDATAなどを記憶するHDD（Hard Disk Drive）やSSD（Solid State Drive）などの補助記憶装置１７などを有する。そして、それらと第1のバスBUS_1がPCIバスなどの第２のバスBUS_2を介して接続される。 FIG. 1 is a diagram showing a configuration example of an information processing device according to the present embodiment. The server 10 which is an information processing device connects a processor or CPU (Central Processing Unit) 11 that executes an OS, an application program, or a middleware program, and a first bus BUS_1 such as a CPU bus that connects a main memory 12 such as a DRAM. Have. Further, the server 10 stores an I / O device (13) such as a mouse, a keyboard, and a display panel, a NIC (Network Interface Card) 14 connected to the network NET, an OS, an application program APL, and data DATA. It has an auxiliary storage device 17 such as an HDD (Hard Disk Drive) and an SSD (Solid State Drive). Then, they are connected to the first bus BUS_1 via a second bus BUS_2 such as a PCI bus.

さらに、サーバ１０は、任意の論理回路をリコンフィグレーション可能なPLD(Programmable Logic Drive)１５と、PLDのコンフィグレーションデータなどを記憶するメモリ１６と、それらを接続するバスである第３のバスBUS_3を有する。PLDは例えばFPGAなどであり、メモリ１６はFPGA用メモリ、第３のバスはFPGAバスである。 Further, the server 10 has a PLD (Programmable Logic Drive) 15 capable of reconfiguring an arbitrary logic circuit, a memory 16 for storing PLD configuration data and the like, and a third bus BUS_3 which is a bus connecting them. Has. The PLD is, for example, an FPGA, the memory 16 is a memory for FPGA, and the third bus is an FPGA bus.

たとえば、CPU１１が実行するOSのジョブ管理プログラムが、実行中のアプリケーションプログラム中にFPGA内の論理回路により処理可能なジョブを検出した場合、CPUがその論理回路をコンフィグレーションするためのコンフィグレーションデータをFPGA内に書き込んで（または設定して）論理回路をコンフィグレーションし、その論理回路を実行する。 For example, when the job management program of the OS executed by the CPU 11 detects a job that can be processed by the logic circuit in the FPGA in the running application program, the CPU outputs the configuration data for configuring the logic circuit. Write (or configure) the logic in the FPGA to configure the logic and execute the logic.

補助記憶装置１７には、FPGAを管理するFPGA管理プログラムと、論理回路をコンフィグレーションするコンフィグレーションデータC_DATAが記憶される。サーバ１０が起動する時、補助記憶装置内のOS、アプリケーションAPL、FPGA管理プログラムがメインメモリ１２に展開され、プロセッサ１１により実行される。また、補助記憶装置内のコンフィグレーションデータC_DATAはFPGA用メモリに展開される。 The auxiliary storage device 17 stores the FPGA management program that manages the FPGA and the configuration data C_DATA that configures the logic circuit. When the server 10 starts up, the OS, application APL, and FPGA management program in the auxiliary storage device are expanded in the main memory 12 and executed by the processor 11. In addition, the configuration data C_DATA in the auxiliary storage device is expanded in the FPGA memory.

FPGA１５は、コンフィグレーションデータを変更することで様々な論理回路をコンフィグレーションすることができるので、サーバ１０が製造された後でも、コンフィグレーションデータを変更することで、様々なジョブの処理をFPGA内にコンフィグレーションした論理回路で高速に処理することができる。 Since the FPGA 15 can configure various logic circuits by changing the configuration data, even after the server 10 is manufactured, the processing of various jobs can be performed in the FPGA by changing the configuration data. It can be processed at high speed with the logic circuit configured in.

クラウドサービス等において、複数のユーザがそれぞれのアプリケーションプログラムをサーバ１０に実行させる。その結果、サーバ１０のプロセッサ１１は、複数のユーザのアプリケーションプログラムを並列に実行する。そして、それぞれのアプリケーションプログラムの所定の処理（ジョブ）を実行する論理回路が、FPGA１５内に非同期で構築され、構築された複数の論理回路（ユーザ回路）が並列に動作してそれぞれの所定の処理（ジョブ）を実行する。 In a cloud service or the like, a plurality of users execute their respective application programs on the server 10. As a result, the processor 11 of the server 10 executes the application programs of a plurality of users in parallel. Then, a logic circuit that executes a predetermined process (job) of each application program is asynchronously constructed in the FPGA 15, and a plurality of constructed logic circuits (user circuits) operate in parallel to perform each predetermined process. Execute (job).

図２は、FPGAのコンフィグレーション例を示す図である。図２のFPGA１５は、FPGAの第３のバスBUS_3とのバスインターフェース回路BUS_IFと、コンフィグレーションデータの書込み制御及びその他の制御を行う制御回路１５１と、コンフィグレーションデータが書き込まれるコンフィグレーションデータメモリC_RAMと、書き込まれたコンフィグレーションデータにより種々の論理回路がリコンフィグレーションされるリコンフィグレーション領域RC_REGと、内部バスI_BUSを有する。 FIG. 2 is a diagram showing an example of FPGA configuration. The FPGA 15 in FIG. 2 includes a bus interface circuit BUS_IF with the third bus BUS_3 of the FPGA, a control circuit 151 that controls writing of configuration data and other controls, and a configuration data memory C_RAM to which configuration data is written. , It has a reconfiguration area RC_REG in which various logic circuits are reconfigured by the written configuration data, and an internal bus I_BUS.

リコンフィグレーション領域RC_REGには、図示しないが、複数の論理回路要素、メモリ回路要素、配線、スイッチ等が予め形成される。また、リコンフィグレーション領域RC_REGは、論理的にまたは物理的に区分された複数の部分リコンフィグレーションブロックPBに区分される。そして、リコンフィグレーションされる論理回路は、各部分リコンフィグレーションブロックPBに収容できる回路ブロックをコンフィグレーション単位として、単数または複数の各部分リコンフィグレーションブロック内にコンフィグレーションされる。したがって、例えば、コンフィグレーションデータメモリC_RAMは、複数の部分リコンフィグレーションブロックPBに対応する複数の記憶領域に区分され、各記憶領域にコンフィグレーションデータC_DATAが書き込まれると、その記憶領域に対応する部分リコンフィグレーションブロックPBにそれぞれの論理回路がコンフィグレーションされる。 Although not shown, a plurality of logic circuit elements, memory circuit elements, wirings, switches, etc. are formed in advance in the reconfiguration area RC_REG. In addition, the reconfiguration area RC_REG is divided into a plurality of logically or physically divided partial reconfiguration blocks PB. Then, the logic circuit to be reconfigured is configured in each one or a plurality of partial reconfiguration blocks with the circuit block accommodated in each partial reconfiguration block PB as a configuration unit. Therefore, for example, the configuration data memory C_RAM is divided into a plurality of storage areas corresponding to a plurality of partial reconfiguration blocks PB, and when the configuration data C_DATA is written to each storage area, the portion corresponding to the storage area. Each logic circuit is configured in the reconfiguration block PB.

さらに、あるジョブを実行する論理回路（ユーザ回路）が複数の部分リコンフィグレーションブロックPBにコンフィグレーションされる場合がある。その場合は、複数の機能ブロック領域に対応する記憶領域に論理回路をコンフィグレーションするためのコンフィグレーションデータがそれぞれ書き込まれ、各部分リコンフィグレーションブロックにコンフィグレーションされた複数の回路により前述のジョブの処理を実行する論理回路（ユーザ回路）がコンフィグレーションされる。 Further, a logic circuit (user circuit) that executes a certain job may be configured in a plurality of partial reconfiguration blocks PB. In that case, the configuration data for configuring the logic circuit is written in the storage area corresponding to the multiple functional block areas, and the multiple circuits configured in each partial reconfiguration block of the above-mentioned job. A logic circuit (user circuit) that executes processing is configured.

上記のとおり、FPGA内のリコンフィグレーション領域RC_REGは、複数の部分リコンフィグレーションブロックPBでコンフィグレーションされる。そして、各ユーザのアプリケーションプログラム内の所定の処理（ジョブ）を実行する論理回路は、単一の部分リコンフィグレーションブロックPBにコンフィグレーションされる場合と、複数の部分リコンフィグレーションブロックPBにコンフィグレーションされる場合とがある。 As described above, the reconfiguration area RC_REG in the FPGA is configured with multiple partial reconfiguration blocks PB. Then, the logic circuit that executes a predetermined process (job) in each user's application program is configured in a single partial reconfiguration block PB or in a plurality of partial reconfiguration block PBs. It may be done.

リコンフィグレーション領域RC_REG内にコンフィグレーションされた論理回路には、バスインターフェースBUS_IFを介して、CPUから入力データが入力され、入力データの処理結果がCPUに出力される。また、リコンフィグレーション領域RC_REG内にコンフィグレーションされた複数の論理回路は、内部バスI_BUS、バスインターフェースBUS_IF、及びFPGAバスBUS_3を介して、FPGA用メモリ１６と動作中のデータの送受信を行う。 Input data is input from the CPU to the logic circuit configured in the reconfiguration area RC_REG via the bus interface BUS_IF, and the processing result of the input data is output to the CPU. Further, the plurality of logic circuits configured in the reconfiguration area RC_REG transmit and receive operating data to and from the FPGA memory 16 via the internal bus I_BUS, the bus interface BUS_IF, and the FPGA bus BUS_3.

図３は、FPGAのリコンフィグレーション領域の一例を示す図である。図２に示したとおり、リコンフィグレーション領域RC_REGは、マトリクス状に配置された複数の部分リコンフィグレーションブロックPBに区分される。また、リコンフィグレーション領域RC_REGは、複数の部分リコンフィグレーションブロックPB内に構成される複数の論理回路間のデータ転送や、図２のバスインターフェースBUS_IFと部分リコンフィグレーションブロックPB内にコンフィグレーションされる論理回路との間のデータ転送のための運用回路OCを有する。運用回路OCは、ネットワーク配線と、ネットワークスイッチと、ルーティング回路など含む。 FIG. 3 is a diagram showing an example of the reconfiguration area of the FPGA. As shown in FIG. 2, the reconfiguration area RC_REG is divided into a plurality of partial reconfiguration blocks PB arranged in a matrix. Further, the reconfiguration area RC_REG is configured for data transfer between a plurality of logic circuits configured in a plurality of partial reconfiguration block PBs, and in the bus interface BUS_IF and the partial reconfiguration block PB in FIG. It has an operation circuit OC for data transfer to and from a logic circuit. The operation circuit OC includes network wiring, a network switch, a routing circuit, and the like.

図３の例では、複数の部分リコンフィグレーションブロックPBのうち、左側の３×３の部分リコンフィグレーションブロックPBにコンフィグレーションされた回路によりユーザAの論理回路UC_Aがコンフィグレーションされ、右側の２×４の部分リコンフィグレーションブロックPBにコンフィグレーションされた回路によりユーザBの論理回路UC_Bがコンフィグレーションされる。また、回路がコンフィグレーションされていない８個の部分リコンフィグレーションブロックPBが無色で示されている。 In the example of FIG. 3, the logic circuit UC_A of user A is configured by the circuit configured in the 3 × 3 partial reconfiguration block PB on the left side of the plurality of partial reconfiguration block PBs, and the logic circuit UC_A on the right side is 2 The logic circuit UC_B of user B is configured by the circuit configured in the partial reconfiguration block PB of × 4. Also, eight partially reconfigured blocks PB with no circuit configured are shown colorless.

図４は、複数のユーザの論理回路が動的にコンフィグレーションされそして削除される例を説明する図である。時間T1では、FPGA内のリコンフィグレーション領域RC_REGには論理回路はコンフィグレーションされていない。次に、時間T2で、ユーザAの論理回路が２つの部分リコンフィグレーションブロックにコンフィグレーションされジョブの実行を開始する。その後、時間T3で、ユーザBの論理回路が６個の部分リコンフィグレーションブロックにコンフィグレーションされ実行開始する。時間T3の後でユーザAの論理回路が処理を完了し、時間T4で、ユーザCの論理回路が４個の部分リコンフィグレーションブロックにコンフィグレーションされ実行開始する。その後、時間T5でユーザBの論理回路が処理を終了し、時間T6でユーザDの論理回路が４個の部分リコンフィグレーションブロックにコンフィグレーションされ実行開始する。それぞれコンフィグレーションされた論理回路は、処理が完了すると、例えば、論理回路をコンフィグレーションしていた部分リコンフィグレーションブロックが開放され、他の論理回路をコンフィグレーション可能な状態に開放される。その場合、例えば、解放された部分リコンフィグレーションブロックに他の論理回路がコンフィグレーションされるまでは、コンフィグレーションデータメモリC_RAM内のコンフィグレーションデータは削除されず、再度同じ論理回路のコンフィグレーション要求が発生すると、コンフィグレーション済みの論理回路が有効化される。 FIG. 4 is a diagram illustrating an example in which logic circuits of a plurality of users are dynamically configured and deleted. At time T1, no logic circuit is configured in the reconfiguration area RC_REG in the FPGA. Next, at time T2, the logic circuit of user A is configured in the two partial reconfiguration blocks and starts executing the job. After that, at time T3, the logic circuit of user B is configured in six partial reconfiguration blocks and starts execution. After time T3, the logic circuit of user A completes the process, and at time T4, the logic circuit of user C is configured in four partial reconfiguration blocks and starts execution. After that, at time T5, the logic circuit of user B finishes processing, and at time T6, the logic circuit of user D is configured in four partial reconfiguration blocks and starts execution. When the processing of each configured logic circuit is completed, for example, the partial reconfiguration block in which the logic circuit was configured is released, and the other logic circuits are released in a configurable state. In that case, for example, the configuration data in the configuration data memory C_RAM is not deleted until another logic circuit is configured in the released partial reconfiguration block, and the same logic circuit configuration request is made again. When it occurs, the configured logic circuit is activated.

図４に示すとおり、FPGAのリコンフィグレーション領域内には、同じユーザのまたは異なるユーザの異なる論理回路が非同期でコンフィグレーションされ、コンフィグレーションされた論理回路がジョブの実行を行う。そして、前述のサーバ１０内のFPGA管理プログラムが、FPGA内に論理回路をリコンフィグレーションする制御を行う。 As shown in FIG. 4, different logic circuits of the same user or different users are asynchronously configured in the reconfiguration area of the FPGA, and the configured logic circuits execute the job. Then, the FPGA management program in the server 10 described above controls to reconfigure the logic circuit in the FPGA.

図５は、FPGA内にコンフィグレーションされる論理回路の並列度の制御例を示す図である。PLDの１つであるFPGAには、コンフィグレーションデータを設定することで論理回路がコンフィグレーションされ、その論理回路がジョブを実行し、CPUのアクセラレータの機能を有する。しかし、FPGA内の論理回路は、コンフィグレーションデータでリコンフィグレーションされたルックアップテーブルやスイッチング回路でコンフィグレーションされるため、通常のカスタム集積回路よりも動作速度が低い。そのため、FPGA内の論理回路をCPUのアクセラレータとして利用するための１つの方法として、FPGA内に同じ論理回路を複数個コンフィグレーションし、複数個の論理回路で並列動作することが考えられる。 FIG. 5 is a diagram showing a control example of the degree of parallelism of the logic circuit configured in the FPGA. A logic circuit is configured in the FPGA, which is one of the PLDs, by setting the configuration data, and the logic circuit executes the job and has the function of the accelerator of the CPU. However, since the logic circuit in the FPGA is configured by the lookup table or switching circuit reconfigured with the configuration data, the operating speed is lower than that of a normal custom integrated circuit. Therefore, as one method for using the logic circuit in the FPGA as an accelerator of the CPU, it is conceivable to configure a plurality of the same logic circuits in the FPGA and operate them in parallel with the plurality of logic circuits.

例えば、FPGA管理プログラムを実行するプロセッサは、あるジョブの処理を実行する論理回路をFPGA内にコンフィグレーションする場合、リコンフィグレーション領域RC_REGに空きがあれば、同じ論理回路を複数個コンフィグレーションするようにFPGAを制御し、複数個の同じ論理回路に並列にジョブの実行を行わせる。 For example, when a processor that executes an FPGA management program configures a logic circuit that executes processing of a certain job in the FPGA, if there is space in the reconfiguration area RC_REG, the same logic circuit should be configured multiple times. Controls the FPGA and causes multiple same logic circuits to execute jobs in parallel.

図５の例では、時間T11で、FPGA管理プログラムを実行するプロセッサは、ユーザAの論理回路UC_Aを６個の部分リコンフィグレーションブロックにコンフィグレーションし、ユーザBの論理回路UC_Bを２個の部分リコンフィグレーションブロックにコンフィグレーションする。そして、その後の時間T12では、プロセッサは、２つ目のユーザBの論理回路UC_B2を２個の部分リコンフィグレーションブロックにコンフィグレーションし、２個の論理回路UC_B, UC_B2に並列動作を行わせる。同様に、時間T13では、プロセッサは、２つ目のユーザAの論理回路UC_A2を６個の部分リコンフィグレーションブロックにコンフィグレーションし、２個の論理回路UC_A, UC_A2に並列動作を行わせる。これにより、FPGA内の論理回路の動作速度を高速化することができる。 In the example of FIG. 5, at time T11, the processor executing the FPGA management program configures user A's logic circuit UC_A into six partial reconfiguration blocks and user B's logic circuit UC_B into two parts. Configure in the reconfiguration block. Then, at the subsequent time T12, the processor configures the logic circuit UC_B2 of the second user B into two partial reconfiguration blocks, and causes the two logic circuits UC_B and UC_B2 to perform parallel operations. Similarly, at time T13, the processor configures the second user A's logic circuit UC_A2 into six partial reconfiguration blocks, causing the two logic circuits UC_A and UC_A2 to operate in parallel. As a result, the operating speed of the logic circuit in the FPGA can be increased.

例えば、ユーザの論理回路が加算器であり、1個の加算器がＮサイクルで演算を完了する場合、２個の加算器をコンフィグレーションして並列に加算演算すれば、Ｎ／２サイクルで演算を完了する。これが論理回路の並列度を増加してジョブの実行時間を短縮する例である。 For example, if the user's logic circuit is an adder and one adder completes the operation in N cycles, if two adders are configured and the addition operation is performed in parallel, the operation is performed in N / 2 cycles. To complete. This is an example of increasing the degree of parallelism of logic circuits and shortening the job execution time.

［バス帯域のボトルネック］
図６は、FPGAにおけるバス帯域のボトルネックを説明する図である。FPGA内にコンフィグレーションされた論理回路は、FPGAバスBUS_3を介して図１、２に示したFPGA用メモリ１６にアクセスする。FPGA用メモリ１６には、リコンフィグレーションされる論理回路のコンフィグレーションデータと、コンフィグレーションされた論理回路がアクセスするデータとが格納される。したがって、FPGA管理プログラムを実行するプロセッサが、FPGAにある論理回路のコンフィグレーションを要求したとき、FPGA内の制御回路がFPGA用メモリにアクセスし、論理回路のコンフィグレーションデータをダウンロードする。さらに、FPGA内にコンフィグレーションされた論理回路がそれぞれのジョブを実行すると、各論理回路がFPGA用メモリ内に格納されているデータにアクセスする。したがって、PFGA内にコンフィグレーションされた論理回路は、FPGAバスBUS_3が提供可能な帯域のうち、それぞれのデータ転送量に対応する帯域を使用する。 [Bus band bottleneck]
FIG. 6 is a diagram illustrating a bus bandwidth bottleneck in FPGA. The logic circuit configured in the FPGA accesses the FPGA memory 16 shown in FIGS. 1 and 2 via the FPGA bus BUS_3. The FPGA memory 16 stores the configuration data of the reconfigured logic circuit and the data accessed by the configured logic circuit. Therefore, when the processor that executes the FPGA management program requests the configuration of the logic circuit in the FPGA, the control circuit in the FPGA accesses the memory for FPGA and downloads the configuration data of the logic circuit. Furthermore, when the logic circuits configured in the FPGA execute each job, each logic circuit accesses the data stored in the memory for FPGA. Therefore, the logic circuit configured in the PFGA uses the band corresponding to each data transfer amount among the bands that can be provided by the FPGA bus BUS_3.

図６の例では、時間T21で、FPGAのリコンフィグレーション領域RC_REG内に、ユーザ１，３，４のユーザ回路UC_1、UC_3、UC_4が並列度１でコンフィグレーションされ、ユーザ２のユーザ回路UC2が並列度２でコンフィグレーションされている。FPGAバスBUS_3の提供可能な帯域（データ転送量の上限値）が例えば1350MB/Sであり、ユーザ１，２，３，４のユーザ回路UC_1, UC_2、UC_3, UC_4の平均データ転送量がそれぞれ100MB/S、200MB/S、200MB/S、300MB/Sとする。図６の状態では、コンフィグレーションされたユーザ回路UC_1〜UC_4の平均データ転送量の合計値が100+200*2+200+300=1000MB/Sである。したがって、合計値1000MB/Sは上限値1350MB/Sに達していない。この状態では、FPGAバスBUS_3にボトルネックは発生しておらず、各ユーザ回路は予測されたデータ転送量で動作し、ジョブの実行時間も予測された実行時間になる。 In the example of FIG. 6, at time T21, the user circuits UC_1, UC_3, and UC_4 of the users 1, 3 and 4 are configured with the degree of parallelism 1 in the reconfiguration area RC_REG of the FPGA, and the user circuit UC2 of the user 2 is set. It is configured with a degree of parallelism of 2. The bandwidth that can be provided by the FPGA bus BUS_3 (upper limit of the data transfer amount) is, for example, 1350MB / S, and the average data transfer amount of the user circuits UC_1, UC_2, UC_3, and UC_4 of users 1, 2, 3, and 4 is 100MB, respectively. / S, 200MB / S, 200MB / S, 300MB / S. In the state of FIG. 6, the total value of the average data transfer amount of the configured user circuits UC_1 to UC_4 is 100 + 200 * 2 + 200 + 300 = 1000MB / S. Therefore, the total value of 1000MB / S has not reached the upper limit of 1350MB / S. In this state, there is no bottleneck in FPGA bus BUS_3, each user circuit operates with the predicted data transfer amount, and the job execution time is also the predicted execution time.

一方、時間T22では、FPGA管理プログラムを実行するプロセッサが、FPGAの制御回路にユーザ２の論理回路UC_2の並列度を４に増加する要求を行い、論理回路UC_2の並列度が４に増加されている。この場合、プロセッサは、リコンフィグレーション領域内に論理回路UC_2の並列度を４に増加するために必要な部分リコンフィグレーションブロックの空きがあり、且つ論理回路UC_2のデータ転送量の予測が低くかったため、並列度を４に増加してもバス帯域の上限値を超えることはないと予測されていたと考えられる。 On the other hand, at time T22, the processor executing the FPGA management program requests the control circuit of the FPGA to increase the parallelism of the logic circuit UC_2 of the user 2 to 4, and the parallelism of the logic circuit UC_2 is increased to 4. There is. In this case, the processor has a space in the partial reconfiguration block required to increase the degree of parallelism of the logic circuit UC_2 to 4 in the reconfiguration area, and the prediction of the data transfer amount of the logic circuit UC_2 is low. Therefore, it is considered that it was predicted that the upper limit of the bus bandwidth would not be exceeded even if the degree of parallelism was increased to 4.

しかしながら、実際には、動作中の論理回路のデータ転送量の合計値が100+200*4+200+300=1400MB/Sとなり、FPGAバスの上限値1350MB/Sを超えてしまい、PFGAバスの帯域にボトルネックが発生する場合がある。その結果、並列度を４に増加されたユーザ２の論理回路UC_2は、ジョブの実行に必要な帯域を使用することができず、ユーザ２の論理回路UC_2による１つのジョブの実行時間は、予測した実行時間より長くなる。 However, in reality, the total amount of data transferred in the operating logic circuit is 100 + 200 * 4 + 200 + 300 = 1400MB / S, which exceeds the upper limit of 1350MB / S on the FPGA bus, and the PFGA bus A bottleneck may occur in the band. As a result, the logic circuit UC_2 of user 2 whose degree of parallelism has been increased to 4 cannot use the band required for job execution, and the execution time of one job by the logic circuit UC_2 of user 2 is predicted. It will be longer than the execution time.

上記のとおり、FPGA内のリコンフィグレーション領域RC_REG内の部分リコンフィグレーションブロックに空きがある場合、論理回路の並列度を増加させたとしても、FPGAバスの帯域が足りず論理回路のデータ転送量の合計値がバス帯域の上限値に達してバス帯域にボトルネックが発生する場合がある。その結果、並列度を増加した論理回路の性能は上がらず、リコンフィグレーション領域内の部分リコンフィグレーションブロックを無駄に使用することになる。 As mentioned above, if there is space in the partial reconfiguration block in the reconfiguration area RC_REG in the FPGA, even if the degree of parallelism of the logic circuit is increased, the bandwidth of the FPGA bus is insufficient and the data transfer amount of the logic circuit The total value of may reach the upper limit of the bus bandwidth, causing a bottleneck in the bus bandwidth. As a result, the performance of the logic circuit with the increased degree of parallelism does not improve, and the partial reconfiguration block in the reconfiguration area is wasted.

［第１の実施の形態］
図７は、第1の実施の形態におけるFPGA管理方法による並列度の制御例を示す図である。このFPGA管理方法では、ユーザの論理回路に、１つのジョブを実行するのに要する実行時間を測定する実行時間測定回路と、FPGAバスへのアクセスを監視しバスアクセスの単位時間当たりのデータ転送量の平均値を測定するデータ転送量測定回路とが含まれる。これらの測定回路は、FPGAのコンフィグレーションデータによりコンフィグレーション可能である。そして、FPGAの制御回路は、ユーザの論理回路をコンフィグレーションデータでコンフィグレーションするときに同時に測定回路もコンフィグレーションデータでコンフィグレーションする。または、測定回路を予め部分リコンフィグレーションブロックに形成しておき、部分リコンフィグレーションブロックにコンフィグレーションされる論理回路の測定回路として使用してもよい。 [First Embodiment]
FIG. 7 is a diagram showing an example of controlling the degree of parallelism by the FPGA management method in the first embodiment. In this FPGA management method, the user's logic circuit has an execution time measurement circuit that measures the execution time required to execute one job, and the data transfer amount per unit time of bus access by monitoring access to the FPGA bus. A data transfer amount measuring circuit for measuring the average value of is included. These measurement circuits can be configured by the FPGA configuration data. Then, in the FPGA control circuit, when the user's logic circuit is configured with the configuration data, the measurement circuit is also configured with the configuration data at the same time. Alternatively, the measurement circuit may be formed in the partial reconfiguration block in advance and used as the measurement circuit of the logic circuit configured in the partial reconfiguration block.

そして、FPGA管理プログラムを実行するプロセッサは、FPGA内のリコンフィグレーション領域内にコンフィグレーションされ動作中の複数の論理回路のデータ転送量の測定値を取得し、取得したデータ転送量の測定値の合計がFPGAバスのデータ転送量の上限値を超えない範囲で、リコンフィグレーション領域内にコンフィグレーションする複数の論理回路それぞれの並列数を増加する。 Then, the processor that executes the FPGA management program acquires the measured value of the data transfer amount of the plurality of logic circuits that are configured and operating in the reconfiguration area in the FPGA, and obtains the measured value of the acquired data transfer amount. Increase the number of parallels of each of the multiple logic circuits configured in the reconfiguration area within the range where the total does not exceed the upper limit of the data transfer amount of the FPGA bus.

また、第1の実施の形態では、プロセッサは、取得したデータ転送量の測定値の合計がFPGAバスのデータ転送量の上限値に達した場合、複数の論理回路のうち、所定の条件を満たす論理回路の並列度を減少する。そして、プロセッサは、並列度を減少させた論理回路以外の別の論理回路のいずれかの並列度を、FPGAバスのデータ転送量の上限値を超えない範囲で、増加する。これにより、並列度を増加した論理回路の動作が予測より短い時間で終了することが期待できる。プロセッサは、並列度を増加した別の論理回路の動作終了後、上限値を超えない範囲で、並列度を減少した論理回路の並列度を増加する。これにより、当該論理回路の動作が予測より短い時間で終了することが期待できる。 Further, in the first embodiment, the processor satisfies a predetermined condition among the plurality of logic circuits when the total of the measured values of the acquired data transfer amount reaches the upper limit value of the data transfer amount of the FPGA bus. Reduce the degree of parallelism of logic circuits. Then, the processor increases the degree of parallelism of any of the logic circuits other than the logic circuit in which the degree of parallelism is reduced within a range not exceeding the upper limit of the data transfer amount of the FPGA bus. As a result, it can be expected that the operation of the logic circuit with the increased degree of parallelism will be completed in a shorter time than expected. After the operation of another logic circuit having an increased degree of parallelism is completed, the processor increases the degree of parallelism of the logic circuit having a decreased degree of parallelism within a range not exceeding the upper limit value. As a result, it can be expected that the operation of the logic circuit will be completed in a shorter time than expected.

図７の例で説明すると、プロセッサが、図６の時間T22の状態で動作中の論理回路のデータ転送量の測定値を取得し、その合計値がFPGAバスの限界値に達していることを検出する。これにより、図7の時間T23に示すとおり、プロセッサは、バスのボトルネックの原因と考えられるユーザ２の論理回路UC_2の並列度を４から２に減らす。その後、論理回路のデータ転送量が低いユーザ１の論理回路UC_1の並列度を１から４に増加する。この結果、動作中の論理回路のデータ転送量の測定値の合計が、100*4+200*2+200+300=1300MB/SとなりFPGAバスの上限値1350MB/S未満になり、バス帯域のボトルネックは解消される。 Explaining with the example of FIG. 7, the processor acquires the measured value of the data transfer amount of the logic circuit operating in the state of time T22 of FIG. 6, and the total value reaches the limit value of the FPGA bus. To detect. As a result, as shown at time T23 in FIG. 7, the processor reduces the degree of parallelism of the user 2 logic circuit UC_2, which is considered to be the cause of the bus bottleneck, from 4 to 2. After that, the degree of parallelism of the logic circuit UC_1 of the user 1 whose data transfer amount of the logic circuit is low is increased from 1 to 4. As a result, the total measured value of the data transfer amount of the operating logic circuit becomes 100 * 4 + 200 * 2 + 200 + 300 = 1300MB / S, which is less than the upper limit of FPGA bus 1350MB / S, and the bus band. The bottleneck is eliminated.

これにより、ユーザ１の論理回路UC_1の動作時間が短くなり短時間で動作完了することが予測される。そして、時間T24に示すとおり、ユーザ１の論理回路UC_1の動作が完了すると、プロセッサは、並列度を減少したユーザ２の論理回路UC_2を優先的に並列度２から４に増やす。そして、プロセッサは、動作中の論理回路のデータ転送量の測定値を取得し、測定値の合計200*4+200+300=1300MB/SがFPGAバスの上限値1350MB/S未満であることを検出する。この状態でも、バス帯域のボトルネックが解消され、論理回路が十分な動作を行うことができる。 As a result, it is predicted that the operation time of the logic circuit UC_1 of the user 1 will be shortened and the operation will be completed in a short time. Then, as shown in time T24, when the operation of the logic circuit UC_1 of the user 1 is completed, the processor preferentially increases the logic circuit UC_2 of the user 2 whose degree of parallelism is reduced from the degree of parallelism 2 to 4. Then, the processor acquires the measured value of the data transfer amount of the operating logic circuit, and the total of the measured values 200 * 4 + 200 + 300 = 1300MB / S is less than the upper limit value 1350MB / S of the FPGA bus. To detect. Even in this state, the bottleneck of the bus band is eliminated, and the logic circuit can operate sufficiently.

［データ処理タイプCIとDI］
図７の時間T23でユーザ回路UC_1の並列度を増やした場合、ユーザ回路UC_1のデータ処理パターンによって、実行時間の短縮度が異なる。例えば、データ処理パターンには、データ・インテンシブ（DI：Data Intensive）と、コンピュテーション・インテンシブ（CI：Computation Intensive）とがある。一般に、DI処理回路の場合は、並列度を増やすと実行時間がそれに比例して短くなるが、CI処理回路の場合は、並列度を増やしても実行時間の短縮は少ない場合がある。 [Data processing types CI and DI]
When the degree of parallelism of the user circuit UC_1 is increased at the time T23 in FIG. 7, the degree of shortening of the execution time differs depending on the data processing pattern of the user circuit UC_1. For example, there are two types of data processing patterns: Data Intensive (DI) and Computation Intensive (CI). Generally, in the case of a DI processing circuit, increasing the degree of parallelism shortens the execution time proportionally, but in the case of a CI processing circuit, even if the degree of parallelism is increased, the reduction in execution time may be small.

DI処理回路の場合、回路が動作中、データの読み出し（ロード）とデータの書込み（ストア）が常時発生し、ユーザ回路内のメモリアクセスを行うロードストアユニット（LSU）の稼働率が高く、バスの使用帯域も高くなる。したがって、バスボトルネックの原因となる。バスボトルネックが発生した場合、DI処理回路の性能が悪化し、ジョブの実行時間は長くなる傾向にある。また、DI処理回路は、一般に、並列度をＮ倍にすると実行時間は1/N倍に短縮する。 In the case of a DI processing circuit, while the circuit is operating, data read (load) and data write (store) always occur, and the operation rate of the load store unit (LSU) that accesses the memory in the user circuit is high, and the bus The bandwidth used for is also high. Therefore, it causes a bus bottleneck. When a bus bottleneck occurs, the performance of the DI processing circuit deteriorates and the job execution time tends to be long. Further, in the DI processing circuit, the execution time is generally shortened to 1 / N times when the degree of parallelism is increased by N times.

一方、CI処理回路の場合、メモリアクセスが回路動作の最初と最後に発生する。つまり、回路動作の最初にデータ処理に必要な入力データがメモリから読み出され、回路動作の最後にデータ処理後の出力データがメモリに書込まれる。データ処理中メモリアクセスはほとんど発生しない。したがって、CI処理回路は、バス帯域を使用しない時間が比較的長く、バスの使用帯域は小さく、バスボトルネックが発生してもCI処理回路の性能は悪化せず、ジョブの実行時間はあまり変わらない。また、CI処理回路は、一般に、並列度を増やしても実行時間の短縮は少ない。 On the other hand, in the case of a CI processing circuit, memory access occurs at the beginning and end of the circuit operation. That is, the input data required for data processing is read from the memory at the beginning of the circuit operation, and the output data after the data processing is written to the memory at the end of the circuit operation. Memory access rarely occurs during data processing. Therefore, the CI processing circuit does not use the bus bandwidth for a relatively long time, the bus bandwidth is small, the performance of the CI processing circuit does not deteriorate even if a bus bottleneck occurs, and the job execution time does not change much. Absent. In addition, the CI processing circuit generally does not reduce the execution time even if the degree of parallelism is increased.

図７の場合、時間T23でユーザ１の論理回路UC_1がDI処理回路の場合、その並列度を増やすと、ジョブの実行時間は増加率に応じて短くなるが、一方、論理回路UC_1がCI処理回路の場合、その並列度を増やしてもジョブの実行時間はあまり短縮されない。この理由は、通常、ユーザ回路はパイプライン構造を有し、パイプライン構造では既にデータ処理が並列化されるからである。そのため、CI処理回路の場合は、回路の並列度を増やしても、パイプラインのイニシエーション・インターバル（Initiation Interval）しか実行時間が短縮されない。一方、DI処理回路の場合は、回路の並列度を増やすと、データ処理中のメモリアクセスも並列化され、通常、並列度をＮ倍にすると実行時間は1/Nに短縮される。 In the case of FIG. 7, when the logic circuit UC_1 of user 1 is a DI processing circuit at time T23, if the degree of parallelism is increased, the job execution time becomes shorter according to the rate of increase, while the logic circuit UC_1 performs CI processing. In the case of a circuit, increasing the degree of parallelism does not significantly reduce the job execution time. The reason for this is that the user circuit usually has a pipeline structure, and data processing is already parallelized in the pipeline structure. Therefore, in the case of a CI processing circuit, even if the degree of parallelism of the circuit is increased, the execution time is shortened only by the initiation interval of the pipeline. On the other hand, in the case of a DI processing circuit, if the degree of parallelism of the circuit is increased, the memory access during data processing is also parallelized. Normally, if the degree of parallelism is multiplied by N, the execution time is shortened to 1 / N.

図８は、CI処理回路の場合の実行時間について説明する図である。CI処理回路の実行時間は、前述の実行回数Nと、1回の実行で処理されるデータセットSと、入力データのロード（読み出し）時間T_LDと、出力データのストア（書込み）時間T_STと、1つのデータセットの処理時間T_{COMP_SINGLE}と、複数のデータの処理開始間隔であるイニシエーション・インターバルT_IIとで予測することができる。 FIG. 8 is a diagram illustrating an execution time in the case of a CI processing circuit. The execution time of the CI processing circuit is the above-mentioned number of executions N, the data set S processed by one execution, the input data load (read) time T _LD, and the output data store (write) time T _ST. It can be predicted by the processing time T _{COMP_SINGLE} _{of one data set and the initiation interval T II} , which is the processing start interval of multiple data.

図８に示されるとおり、CI処理回路は、実行開始時に入力データをメモリからロードする処理LDを行い、入力データの計算処理COMPをS回パイプライン処理し、最後に出力データをメモリにストアする処理STを実行する。そして、S回のパイプライン処理は、それぞれイニシエーション・インターバルT_IIの間隔で処理開始される。１つのデータセットの計算処理COMPの実行時間はT_{COMP_SINGLE}であり、データセットS回のイニシエーション・インターバルT_IIはΔiである。前述のとおり、CI処理回路では、データ処理の最初と最後にデータロードLDとデータストアSTが発生し、データ処理中COMPはメモリアクセスはほとんど発生しない。 As shown in FIG. 8, the CI processing circuit performs the processing LD of loading the input data from the memory at the start of execution, pipelines the input data calculation processing COMP S times, and finally stores the output data in the memory. Execute process ST. Then, the pipeline processing of S times is started at the interval of the _{initiation interval T II.} The execution time of the calculation processing COMP of one data set is T _{COMP_SINGLE} , and the initiation interval T _II of the data set S times is Δi. As described above, in the CI processing circuit, the data load LD and the data store ST occur at the beginning and the end of the data processing, and the COMP hardly generates the memory access during the data processing.

そこで、１つのジョブに含まれるN回の実行回数のうちｉ番目の1回の実行時間Tiは、次のとおりである。
Ti＝T_LD＋Δi＋T_{COMP_SINGLE}＋T_ST 式１
ここで、Δiは、T_IIの（S-1）倍であるので、次のとおりである。
Δi＝T_II * (S-1) Therefore, the i-th execution time Ti of the N execution times included in one job is as follows.
Ti ＝ T _LD ＋ Δi ＋ T _{COMP_SINGLE} ＋ T _ST formula 1
Here, Δi is _{(S-1) times T II} , so it is as follows.
Δi ＝ T _II * (S-1)

さらに、CI処理回路の並列度をPiとすると、Δiは、次のとおり並列度Piに応じて短縮される。
Δi＝T_II * {(S/Pi)-1} Further, assuming that the degree of parallelism of the CI processing circuit is Pi, Δi is shortened according to the degree of parallelism Pi as follows.
Δi ＝ T _II * {(S / Pi) -1}

図９は、並列度Piを変化させた場合の、CI処理回路の実行時間の違いを示す図である。図９の例では、データセットSがS=4である。そして、並列度PiがPi=1、Pi=2、Pi=S=4の場合のΔが示される。Pi=1の場合は、Δi＝3 * T_II、Pi=2の場合は、Δi＝T_II、そして、Pi=S=4の場合は、Δi＝０となる。 FIG. 9 is a diagram showing a difference in execution time of the CI processing circuit when the degree of parallelism Pi is changed. In the example of FIG. 9, the data set S is S = 4. Then, Δ is shown when the degree of parallelism Pi is Pi = 1, Pi = 2, and Pi = S = 4. When Pi = 1, Δi = 3 * T _II , when Pi = 2, Δi = T _II , and when Pi = S = 4, Δi = 0.

以上、CI処理回路の場合のｉ番目の実行時間Tiは、並列度Piとすると、上記の式１により次のとおりとなる。 As described above, the i-th execution time Ti in the case of the CI processing circuit is as follows according to the above equation 1 assuming the degree of parallelism Pi.

そして、データ処理時間に比較してメモリアクセス時間T_LD、T_STは十分に小さいので、式１からT_LD、T_STを省略すると、実行時間Tiは次のとおりとなる。 Since the memory access times T _LD and T _ST are sufficiently smaller than _{the data processing time, if the T LD} and T _ST are omitted from Equation 1, the execution time Ti is as follows.

そこで、回路の並列度PiをPjに変更した場合の実行時間の差分（Pi<Pjの場合の短縮時間）は、次のとおりとなる。 Therefore, the difference in execution time (shortening time in the case of Pi <Pj) when the degree of parallelism Pi of the circuit is changed to Pj is as follows.

更に、式２において、CI処理回路の並列度を最大のPi＝Sにすると、実行時間Tiは次のとおり最短実行時間T_minとなる。 Further, in Equation 2, when the degree of parallelism of the CI processing circuit is set to the maximum Pi = S, the execution time Ti becomes the _{shortest execution time T min as follows.}

そして、CI処理回路の実行回数Nのトータル実行時間T_{CI_total}は、ｉ回目の実行時の並列度をPiとすると、次のとおりである。 _{The total execution time T CI_total} of the number of executions N of the CI processing circuit is as follows, where the degree of parallelism at the time of the i-th execution is Pi.

上記の式５において、並列度PiからPjに変更した場合の変更前のトータル実行時間T_{CI_total_before}と、変更後のトータル実行時間T_{CI_total_after}はそれぞれ、式５の並列度をPiとPjにした式になる。 In Formula 5 above, the total execution time and T _{CI_total_before} before the change when changing from the parallel score Pi to Pj, respectively total running time T _{CI_total_after} the changed parallelism of Formula 5 to Formula you Pi and Pj Become.

さらに、N回の実行全てで並列度Piが最大値S（Pi=S）になる場合（[1:N]の全ての要素iにおいて、Pi＝S）、トータル実行時間は以下の最小値T_{CI_total_min}になる。 Furthermore, when the degree of parallelism Pi becomes the maximum value S (Pi = S) in all N executions (Pi = S in all the elements i of [1: N]), the total execution time is the following minimum value T. It becomes _{CI_total_min.}

次に、DI処理回路の並列度と実行時間について説明する。DI処理回路は、データロードLDとデータストアSTが、各データセットのデータ処理COMP中にも発生する。したがって、DI処理回路は、CI処理回路のように実行時間Tiの大半が演算処理時間T_{COMP_SINGL}とはならない。そこで、本実施の形態では、DI処理回路は、並列度をN倍にすれば実行時間は1/N倍に短縮されると仮定して、実行時間の計算を行う。 Next, the degree of parallelism and the execution time of the DI processing circuit will be described. In the DI processing circuit, the data load LD and the data store ST also occur during the data processing COMP of each data set. Therefore, unlike the CI processing circuit, most of the execution time Ti of the DI processing circuit does not _{have the arithmetic processing time T COMP_SINGL.} Therefore, in the present embodiment, the DI processing circuit calculates the execution time on the assumption that the execution time is shortened to 1 / N times if the degree of parallelism is multiplied by N.

本実施の形態では、バスボトルネックが発生して、所定のユーザ回路の並列度を減らしてバスボトルネックを解消し、その代わりに所定のユーザ回路とは別のユーザ回路の並列度を増やす。但し、並列度調整前の全ユーザ回路の実行時間の合計T_{total_before}と、並列調整後の全ユーザ回路の実行時間の合計T_{total_after}とを比較して、並列度の調整を実行するか否か判断する。すなわち、調整後のT_{total_after}が調整前のT_{total_before}より短くなる場合、ボトルネック解消のための並列度の調整を実行し、短くならない場合、並列度の調整を実行しない。 In the present embodiment, a bus bottleneck occurs, the degree of parallelism of a predetermined user circuit is reduced to eliminate the bus bottleneck, and instead, the degree of parallelism of a user circuit different from the predetermined user circuit is increased. However, the total execution time of all user circuits before _{parallel adjustment T total_before} and the total execution time of all user circuits after parallel adjustment T _{total_after} are compared to determine whether to perform parallel degree adjustment. To do. That is, if the adjusted T _{total_after} is shorter than the T _{total_before} before the adjustment, the parallel degree adjustment for eliminating the bottleneck is executed, and if it is not shortened, the parallel degree adjustment is not executed.

上記のT_{total_before}とT_{total_after}は、次のとおり、いずれもCI処理回路及びDI処理回路の実行時間の合計である。
T_{total_before}＝T_DI__{total_before}＋T_CI__{total_before}
T_{total_after}＝T_{DI_total_after}＋T_{CI_total_after} The above T _{total_before} and T _{total_after} are both the total execution times of the CI processing circuit and the DI processing circuit as follows.
T _{total_before} ＝ T _DI _ _{total_before} ＋ T _CI _ _{total_before}
T _{total_after} ＝ T _{DI_total_after} ＋ T _{CI_total_after}

CI処理回路の並列度の調整後の実行時間は、前述の式５、６により予測される。一方、DI処理回路の並列度の調整後の実行時間は、調整前の実行時間の測定値を並列度N倍の逆数1/N倍して予測される。また、CI処理回路とDI処理回路の区別は、例えば、処理プログラムのコンパイル時に判定することができる。または、使用帯域の測定値から、帯域の使用がデータ処理の最初と最後だけ発生する回路をCI処理回路、それ以外をDI処理回路と判定できる。 The execution time after adjusting the degree of parallelism of the CI processing circuit is predicted by the above equations 5 and 6. On the other hand, the execution time after adjusting the degree of parallelism of the DI processing circuit is predicted by multiplying the measured value of the execution time before adjustment by the reciprocal 1 / N of the degree of parallelism N times. Further, the distinction between the CI processing circuit and the DI processing circuit can be determined, for example, at the time of compiling the processing program. Alternatively, from the measured value of the band used, it can be determined that the circuit in which the band is used only at the beginning and the end of the data processing is a CI processing circuit, and the other circuits are DI processing circuits.

［FPGA管理プログラムの概略処理］
図１０は、第1の実施の形態におけるFPGA管理プログラムのフローチャート図である。例えば、OS（Operating System）のジョブ管理プログラムは、プロセッサが実行するユーザのアプリケーションプログラムのジョブを監視し、ジョブの処理がFPGA内の論理回路で実行可能な場合、プロセッサに新ユーザ回路のコンフィグレーション要求の割込みを発生する。 [Outline processing of FPGA management program]
FIG. 10 is a flowchart of the FPGA management program according to the first embodiment. For example, the OS (Operating System) job management program monitors the jobs of the user's application program executed by the processor, and if the job processing can be executed by the logic circuit in the FPGA, the processor configures a new user circuit. Generates a request interrupt.

FPGA管理プログラムを実行するプロセッサは、OS（Operating System）から新たなユーザ回路をコンフィグレーションする要求を受信すると（S1のYES）、次のように要求を処理する。まず、プロセッサは、FPGAのリコンフィグレーション領域の総面積から動作中のユーザ回路の総面積を減じた値が、新たなユーザ回路の面積より大きいか否か判定する（S2）。FPGAのリコンフィグレーション領域の総面積は、例えば部分リコンフィグレーションブロックPBの数であり、動作中のユーザ回路の総面積は、例えば動作中のユーザ回路がコンフィグレーションされている部分リコンフィグレーションブロックの数である。 When the processor that executes the FPGA management program receives a request to configure a new user circuit from the OS (Operating System) (YES in S1), it processes the request as follows. First, the processor determines whether or not the value obtained by subtracting the total area of the operating user circuit from the total area of the reconfiguration area of the FPGA is larger than the area of the new user circuit (S2). The total area of the FPGA reconfiguration area is, for example, the number of partial reconfiguration blocks PB, and the total area of the operating user circuit is, for example, the partial reconfiguration block in which the operating user circuit is configured. Is the number of.

工程S2の判定がYESの場合、プロセッサは、FPGAに新ユーザ回路のコンフィグレーションを要求する（S3）。そして、FPGAから新ユーザ回路のコンフィグレーション完了通知があると（S4のYES）、プロセッサはFPGAにユーザ回路によるジョブ開始を通知する（S5）。一方、工程S2の判定がNOの場合、プロセッサは、FPGAに新ユーザ回路のコンフィグレーションを要求せず、新たに回路コンフィグレーション要求を要求キュー（要求の待ち行列）に格納する（S9）。要求キュー内の要求は、次のサイクルで前述の工程S1で新ユーザ回路構築要求としてチェックされる。 If the determination in step S2 is YES, the processor requests the FPGA to configure a new user circuit (S3). Then, when the FPGA notifies the completion of the configuration of the new user circuit (YES in S4), the processor notifies the FPGA of the start of the job by the user circuit (S5). On the other hand, if the determination in step S2 is NO, the processor does not request the FPGA to configure the new user circuit, but stores a new circuit configuration request in the request queue (request queue) (S9). The request in the request queue is checked as a new user circuit construction request in step S1 described above in the next cycle.

さらに、プロセッサは、FPGAからユーザ回路のジョブの実行完了通知を受信すると（S6）、FPGAにジョブの実行が完了したユーザ回路の開放通知を行う（S7）。これにより、FPGA内の制御回路は、リコンフィグレーション領域内にコンフィグレーションされたユーザ回路を解放状態にする。 Further, when the processor receives the job execution completion notification of the user circuit from the FPGA (S6), the processor notifies the FPGA of the release of the user circuit whose job execution is completed (S7). As a result, the control circuit in the FPGA releases the user circuit configured in the reconfiguration area.

さらに、プロセッサは、ユーザ回路の並列度調整処理S8を実行する。ユーザ回路の並列度調整処理については後述する。そして、プロセッサは、上記の工程S1〜S8を繰り返し実行する。 Further, the processor executes the parallel degree adjustment process S8 of the user circuit. The parallel degree adjustment process of the user circuit will be described later. Then, the processor repeatedly executes the above steps S1 to S8.

［ユーザ回路のパラメータ］
並列度調整処理S8の説明をする前に、まず、プロセッサが管理するユーザ回路の各種パラメータの例について説明する。 [User circuit parameters]
Before explaining the parallel degree adjustment process S8, first, examples of various parameters of the user circuit managed by the processor will be described.

図１１は、プロセッサが管理するユーザ回路のパラメータを示す図表である。図１１の図表に、FPGAのリコンフィグレーション領域内にコンフィグレーションされているユーザ回路UC_1, UC_2, UC_3, UC_4それぞれについて、論理回路の並列度P、予測コンフィグレーション時間CT_E、予測実行時間ET_E、予測使用帯域BD_E、測定実行時間ET_M、測定使用帯域BD_Mの値が示されている。 FIG. 11 is a diagram showing parameters of a user circuit managed by the processor. In the chart of FIG. 11, for each of the user circuits UC_1, UC_2, UC_3, and UC_4 configured in the FPGA reconfiguration area, the degree of parallelism of the logic circuit P, the predicted configuration time CT_E, the predicted execution time ET_E, and the predicted The values of the used band BD_E, the measurement execution time ET_M, and the measurement used band BD_M are shown.

予測コンフィグレーション時間CT_Eは、論理回路のコンフィグレーションデータをFPGAメモリからダウンロードしてFPGA内のコンフィグレーションデータメモリC_RAMに設定するのに要する時間の予測値である。予測実行時間ET_Eは、論理回路による１つのジョブの実行完了までの時間の予測値である。予測使用帯域BD_Eは、論理回路がジョブ実行中に使用する単位時間当たりのバス帯域（データ転送量）の予測値であり、単位はMB/Sである。 Predicted configuration time CT_E is a predicted value of the time required to download the configuration data of the logic circuit from the FPGA memory and set it in the configuration data memory C_RAM in the FPGA. The predicted execution time ET_E is a predicted value of the time until the execution of one job is completed by the logic circuit. The predicted bandwidth BD_E is a predicted value of the bus bandwidth (data transfer amount) per unit time used by the logic circuit during job execution, and the unit is MB / S.

一方、測定実行時間ET_M、測定使用帯域BD_Mは、論理回路に設けられた実行時間測定回路とデータ転送量測定回路それぞれの測定値である。 On the other hand, the measurement execution time ET_M and the measurement use band BD_M are the measured values of the execution time measurement circuit and the data transfer amount measurement circuit provided in the logic circuit.

また、FPGAバスの帯域の上限値をBD_Lとする。この帯域上限値BD_Lは、FPGAバスの帯域であり、リコンフィグレーション領域にコンフィグレーションされた論理回路のFPGAバスへのデータ転送量の合計がこの帯域上限値BD_Lを超えることはできない。したがって、リコンフィグレーション領域にコンフィグレーションされた論理回路のデータ転送量の合計が帯域上限値BD_Lに達している場合、バス帯域にボトルネックが発生しているとみなすことができる。 In addition, the upper limit of the FPGA bus bandwidth is BD_L. This bandwidth upper limit value BD_L is the bandwidth of the FPGA bus, and the total amount of data transferred to the FPGA bus of the logic circuit configured in the reconfiguration area cannot exceed this bandwidth upper limit value BD_L. Therefore, when the total amount of data transferred by the logic circuits configured in the reconfiguration area reaches the bandwidth upper limit value BD_L, it can be considered that a bottleneck has occurred in the bus band.

さらに、図１１のパラメータは、ユーザ回路UC_1, UC_2, UC_3, UC_4それぞれについて、回路の処理パターンを示す回路タイプCI（Computation Intensive,コンピュテーション・インテンシブ）、DI（Data Intensive,データ・インテンシブ）と、ジョブに対するユーザ回路の実行回数Nと、１回の実行で処理されるデータセット数Sと、入力データの読み取り時間T_LDと、出力データの書込み時間T_STと、1個のデータセットの処理時間T_{COMP_SINGLE}と、パイプライン回路のイニシエーション・インターバル（Initiation interval）時間T_IIを有する。 Further, the parameters of FIG. 11 include circuit types CI (Computation Intensive) and DI (Data Intensive) indicating circuit processing patterns for each of the user circuits UC_1, UC_2, UC_3, and UC_4. The number of times the user circuit is executed for the job N, the number of data sets processed in one execution S, the input data read time T _LD , the output data write time T _ST, and the processing time of one data set. It has T _{COMP_SINGLE} and the initiation interval time T _II of the pipeline circuit.

ユーザの論理回路の並列度調整処理S8では、プロセッサは、図１１に示した値に基づいてFPGAのリコンフィグレーション領域内のユーザの論理回路の並列度を制御する。 In the parallel degree adjustment process S8 of the user's logic circuit, the processor controls the parallelism of the user's logic circuit in the reconfiguration area of the FPGA based on the value shown in FIG.

図１２は、ユーザ回路の並列度調整処理S8のフローチャート図である。FPGA管理プログラムを実行するプロセッサは、一定時間待機するたびに（S10のYES）、FPGA内にコンフィグレーションされているユーザ回路の実行時間測定回路と使用帯域測定回路が測定中の測定実行時間ET_Mと測定使用帯域BD_Mを、両回路から読み出すまたはFPGA内の制御回路１５１から受信する（S11）。 FIG. 12 is a flowchart of the parallel degree adjustment process S8 of the user circuit. Every time the processor that executes the FPGA management program waits for a certain period of time (YES in S10), the execution time measurement circuit of the user circuit configured in the FPGA and the measurement execution time ET_M that the band used measurement circuit is measuring The measurement use band BD_M is read from both circuits or received from the control circuit 151 in the FPGA (S11).

［並列度の増加制御（１）］
そして、プロセッサは、FPGAバスの帯域上限BD_Lからユーザ回路の測定使用帯域の合計値を減じた値が、FPGA内にコンフィグレーション中のユーザ回路のいずれかの並列度を増加するために必要な最小帯域より大きいか否かを判定する（S12）。工程S12の判定がYESであれば、プロセッサは、以下に示す式１、式２を満たす範囲で、ユーザ回路の並列度を増加する（S13A）。 [Control of increase in parallelism (1)]
Then, the processor is the minimum required to increase the degree of parallelism of any of the user circuits being configured in the FPGA by subtracting the total measured bandwidth of the user circuit from the bandwidth upper limit BD_L of the FPGA bus. Determine if it is larger than the band (S12). If the determination in step S12 is YES, the processor increases the degree of parallelism of the user circuit within the range satisfying Equations 1 and 2 shown below (S13A).

図１３は、工程S13Aの並列度を増加する処理を示すフローチャート図である。まず、プロセッサは、複数（ｎ個）のユーザ回路を所定の順（例えば測定使用帯域BD_Mが小さい順）にソートする（S131）。このソートされたユーザ回路の順番を係数ｉ= 1〜nとする。そして、プロセッサは、ソートされた順番で、つまり係数順に、係数ｉ= 1〜ｎの各ｉついて(S132-S135)、処理対象のi番目のユーザ回路の並列度Piを１つ増加した後の並列度PXi（＝Pi + 1）で以下の式１，式２を満たすか否か判定する（S133）。 FIG. 13 is a flowchart showing a process of increasing the degree of parallelism in step S13A. First, the processor sorts a plurality of (n) user circuits in a predetermined order (for example, in ascending order of measurement use band BD_M) (S131). Let the coefficients i = 1 to n be the order of the sorted user circuits. Then, the processor increases the degree of parallelism Pi of the i-th user circuit to be processed by one for each i of the coefficients i = 1 to n (S132-S135) in the sorted order, that is, in the coefficient order. It is determined whether or not the following equations 1 and 2 are satisfied by the degree of parallelism PXi (= Pi + 1) (S133).

式１、式２は図１１に示されるが以下のとおりである。
Σ（BD_Mj/Pj）*PXj < BD_L 式１
Σ (Aj*PXj) ≦ A_L 式２
ここで、Σは全ユーザ回路j=1〜nの合計である。また、式１、式２のPXjは、j=iならPXj=Pj + 1、j≠iならPXj=Pjとなり、処理対象のi番目のユーザ回路だけ並列度Pjを＋１増加し、i番目ではない他のユーザ回路は増加しない並列度Pjのままである。 Equations 1 and 2 are shown in FIG. 11 and are as follows.
Σ (BD_Mj / Pj) * PXj <BD_L Equation 1
Σ (Aj * PXj) ≤ A_L Equation 2
Here, Σ is the sum of all user circuits j = 1 to n. The PXj in Equations 1 and 2 is PXj = Pj + 1 if j = i and PXj = Pj if j ≠ i, increasing the degree of parallelism Pj by +1 only for the i-th user circuit to be processed, and at the i-th. No other user circuits remain parallel degree Pj which does not increase.

つまり、n=4,i=2の場合の式１は次の通りである。
(BD_M1/P1)*P1 + (BD_M2/P2)*PX2 + (BD_M3/P3)*P3 + (BD_M4/P4)*P4 < BD_L
上記の左辺の第１項は(BD_M1/P1)*P1=BD_M1であり、第３，４項も同様であるから、よって、
BD_M1 + (BD_M2/P2)*PX2 + BD_M3 + BD_M4 < BD_L That is, Equation 1 in the case of n = 4 and i = 2 is as follows.
(BD_M1 / P1) * P1 + (BD_M2 / P2) * PX2 + (BD_M3 / P3) * P3 + (BD_M4 / P4) * P4 <BD_L
The first term on the left side of the above is (BD_M1 / P1) * P1 = BD_M1, and the same applies to the third and fourth terms.
BD_M1 + (BD_M2 / P2) * PX2 + BD_M3 + BD_M4 <BD_L

さらに、式２のAjは並列度１のユーザ回路の回路面積（例えば、部分リコンフィグレーションブロックの数）、A_Lはリコンフィグレーション領域の総回路面積（例えば、部分リコンフィグレーションブロックの総数）である。n=4,i=2の場合の式２は次の通りである。
A1*P1 + A2*PX2 +A3*P3 + A4*P4 ≦ A_L Further, Aj in Equation 2 is the circuit area of the user circuit having a degree of parallelism 1 (for example, the number of partial reconfiguration blocks), and A_L is the total circuit area of the reconfiguration area (for example, the total number of partial reconfiguration blocks). is there. Equation 2 for n = 4 and i = 2 is as follows.
A1 * P1 + A2 * PX2 + A3 * P3 + A4 * P4 ≤ A_L

式１を満たす場合、処理対象のi番目のユーザ回路だけその並列度Piを１つ増加した後の全ユーザ回路の使用帯域の合計が、FPGAバスの帯域上限値BD_Lより小さいことを意味する。式１において(BD_M2/P2)*PX2は、測定使用帯域は並列度に比例することを意味する。一方、式２を満たす場合、処理対象のi番目のユーザ回路だけその並列度Piを１つ増加した後の全ユーザ回路の使用面積の合計が、FPGAの総回路面積A_L以下であることを意味する。 When Equation 1 is satisfied, it means that the total bandwidth used by all the user circuits after increasing the degree of parallelism Pi by one only for the i-th user circuit to be processed is smaller than the bandwidth upper limit value BD_L of the FPGA bus. In Equation 1, (BD_M2 / P2) * PX2 means that the measurement bandwidth is proportional to the degree of parallelism. On the other hand, when Equation 2 is satisfied, it means that the total used area of all user circuits after increasing the degree of parallelism Pi by one only for the i-th user circuit to be processed is equal to or less than the total circuit area A_L of the FPGA. To do.

工程S133の判定がYESなら（S133のYES）、プロセッサは、増加後の並列度PXiをそのユーザ回路UC_iの並列度Piに設定する（S134）。係数ｉ=1〜ｎの全てにおいて工程S133がYESの場合、全てのユーザ回路の並列度Piが＋１されたことを意味する。 If the determination in step S133 is YES (YES in S133), the processor sets the increased degree of parallelism PXi to the degree of parallelism Pi in its user circuit UC_i (S134). If step S133 is YES in all of the coefficients i = 1 to n, it means that the degree of parallelism Pi of all user circuits has been increased by +1.

一方、係数iが１〜ｎのいずれかで工程S133の判定がNOなら（S133のNO）、S132〜S135のループを抜ける。すなわち、ユーザ回路の順に並列度を＋１増加し、あるユーザ回路で工程S133の判定がNOになると、ループS132〜S135の処理を終了する。 On the other hand, if the coefficient i is any of 1 to n and the determination in step S133 is NO (NO in S133), the loop of S132 to S135 is exited. That is, the degree of parallelism is increased by +1 in the order of the user circuits, and when the determination in step S133 becomes NO in a certain user circuit, the processing of loops S132 to S135 ends.

そして、CPUは、ユーザ回路UC_iを新しく設定した並列度Piでリコンフィグレーションする要求をFPGAに行い、そのユーザ回路のリコンフィグレーション完了通知受信後、そのユーザ回路のジョブの実行再開を通知する（S137）。 Then, the CPU makes a request to the FPGA to reconfigure the user circuit UC_i with the newly set parallel degree Pi, and after receiving the reconfiguration completion notification of the user circuit, notifies the resumption of job execution of the user circuit ( S137).

図１３において、測定使用帯域が小さい順にソートし、測定使用帯域が小さいユーザ回路を優先的に並列度を増加させ、あるユーザ回路で式１を満たさない場合、再度、測定使用帯域が小さい回路の並列度を増加できるか否か判定するようにしてもよい。その場合、一般的に測定使用帯域が小さいほど並列度を＋１増加したときの使用帯域の増加量も小さい傾向にあるので、バス帯域の上限未満に抑えることができる。そこで、かかるユーザ回路の並列度をより増加させてジョブの実行時間をより短縮させ、より早くジョブの実行を完了させるためである。ユーザ回路のジョブ実行が完了すれば、その後他のユーザ回路の並列度を増加させてそれらのジョブの実行時間も短縮できる場合がある。 In FIG. 13, the user circuits having the smaller measurement bandwidths are sorted in ascending order, and the degree of parallelism is increased preferentially for the user circuits with the smaller measurement bandwidths. It may be determined whether or not the degree of parallelism can be increased. In that case, in general, the smaller the measurement bandwidth, the smaller the increase in the bandwidth used when the degree of parallelism is increased by +1. Therefore, the bandwidth can be suppressed to less than the upper limit of the bus bandwidth. Therefore, this is to increase the degree of parallelism of the user circuit, shorten the job execution time, and complete the job execution faster. After the job execution of the user circuit is completed, the degree of parallelism of other user circuits may be increased to shorten the execution time of those jobs.

［バス帯域のボトルネックの原因と推定されるユーザ回路の並列度の低下と、他のユーザ回路の並列度の増加］
図１２に戻り、工程S12での判定がNOの場合、プロセッサは、測定使用帯域の合計がFPGAバスの帯域上限に達しているか否か判定する（S14）。この工程S14の判定がYESの場合、FPGAバスの帯域にボトルネックが発生していることを意味する。 [Decrease in parallelism of user circuits, which is presumed to be the cause of bus bandwidth bottlenecks, and increase in parallelism of other user circuits]
Returning to FIG. 12, if the determination in step S12 is NO, the processor determines whether or not the total measurement bandwidth has reached the upper limit of the FPGA bus bandwidth (S14). If the determination in this step S14 is YES, it means that a bottleneck has occurred in the band of the FPGA bus.

そこで、プロセッサは、使用帯域が帯域上限未満になるよう、所定のユーザ回路UC_MAXの並列度を低下させる（S15）。所定のユーザ回路UC_MAXとして、第1の例として、測定使用帯域が最も大きいユーザ回路が選択される。プロセッサは、並列度の低下量として、所定のユーザ回路UC_MAXの予測使用帯域を算出し、他のユーザ回路の測定使用帯域との合計が帯域上限に満たないような値を選択する。測定使用帯域が大きいユーザ回路は、一般にDI処理回路が選択される場合が多い。 Therefore, the processor reduces the degree of parallelism of the predetermined user circuit UC_MAX so that the band used is less than the upper limit of the band (S15). As the predetermined user circuit UC_MAX, as the first example, the user circuit having the largest measurement bandwidth is selected. The processor calculates the predicted bandwidth used by the predetermined user circuit UC_MAX as the amount of decrease in the degree of parallelism, and selects a value such that the sum with the measured bandwidth of other user circuits is less than the bandwidth upper limit. In general, a DI processing circuit is often selected as a user circuit having a large measurement bandwidth.

第２の例として、所定のユーザ回路UC_MAXとして、実行時間ET_Eと測定実行時間ET_Mの差分が最も大きいユーザ回路が選択される。このようなユーザ回路は、バスボトルネックにより予測使用帯域BD_EほどFPGAバスの帯域を使用することができていない蓋然性が高い。したがって、かかるユーザ回路の並列度を低下させることで、バスボトルネックによりユーザ回路の一部が十分に動作せずFPGA内に無駄にコンフィグレーションされている状況を改善することができる。 As a second example, as the predetermined user circuit UC_MAX, the user circuit having the largest difference between the execution time ET_E and the measurement execution time ET_M is selected. It is highly probable that such a user circuit cannot use the FPGA bus band as much as the predicted usage band BD_E due to the bus bottleneck. Therefore, by reducing the degree of parallelism of the user circuit, it is possible to improve the situation where a part of the user circuit does not operate sufficiently due to the bus bottleneck and is unnecessarily configured in the FPGA.

並列度を低下させるターゲットのユーザ回路の選択は、第３の例として、予測使用帯域BD_Eと測定使用帯域BD_Mの差分が最大のユーザ回路を選択してもよい。この場合、差分が最大のユーザ回路は、バスボトルネックにより予測使用帯域BD_EほどFPGAバスの帯域を使用することができていないユーザ回路であるため、かかるユーザ回路を、並列度を減少させるターゲットに選択する。 As a third example, the user circuit of the target that reduces the degree of parallelism may be selected with the maximum difference between the predicted used band BD_E and the measured used band BD_M. In this case, the user circuit with the largest difference is the user circuit that cannot use the FPGA bus band as much as the predicted use band BD_E due to the bus bottleneck, so such a user circuit is used as a target to reduce the degree of parallelism. select.

さらに、第４の例として、並列度が最大のユーザ回路を選択して並列度を減少させてもよい。この場合、並列度が最大に制御されているユーザ回路は、他のユーザ回路よりもより優遇されているユーザ回路といえるので、かかるユーザ回路を、並列度を減少させるターゲットに選択する。 Further, as a fourth example, the user circuit having the maximum degree of parallelism may be selected to reduce the degree of parallelism. In this case, the user circuit whose degree of parallelism is controlled to the maximum can be said to be a user circuit which is more favored than other user circuits, and therefore such a user circuit is selected as a target for reducing the degree of parallelism.

プロセッサは、所定のユーザ回路UC_MAXの並列度を低下させると共に、所定のユーザ回路UC_MAX以外の他のユーザ回路の並列度を増加する（S15）。並列度を増加させるユーザ回路の選択は、様々な例が考えられる。第1の例では、図１３と同様に、任意の順に他のユーザ回路をソートし、式１，式２を満たす範囲で順番に並列度を増加するようにする。つまり、FPGA内の所定のユーザ回路UC_MAX以外の他のユーザ回路のうち、DI処理回路とCI処理回路の区別をせず、任意のユーザ回路の並列度を増加する。 The processor reduces the degree of parallelism of the predetermined user circuit UC_MAX and increases the degree of parallelism of the user circuits other than the predetermined user circuit UC_MAX (S15). Various examples can be considered for selecting a user circuit that increases the degree of parallelism. In the first example, as in FIG. 13, other user circuits are sorted in an arbitrary order, and the degree of parallelism is increased in order within a range satisfying Equations 1 and 2. That is, among the user circuits other than the predetermined user circuit UC_MAX in the FPGA, the DI processing circuit and the CI processing circuit are not distinguished, and the degree of parallelism of any user circuit is increased.

第２の例では、CI処理回路を優先的に選択して並列度を増加する。CI処理回路は使用帯域が小さいので、帯域上限未満を満たしつつ実行時間を短縮できる可能性がある。但し、CI処理回路の実行時間の短縮は、前述の通りあまり大きくない場合がある。 In the second example, the CI processing circuit is preferentially selected to increase the degree of parallelism. Since the CI processing circuit uses a small bandwidth, there is a possibility that the execution time can be shortened while satisfying the bandwidth upper limit or less. However, the reduction in the execution time of the CI processing circuit may not be so large as described above.

第３の例では、CI処理回路のうち並列度の増加による実行時間の短縮の程度が大きい回路を選択し、さらに、一部のDI処理回路を選択し、選択した両CI処理回路とDI処理回路の並列度を増加する。この場合、並列度が増加された回路の実行時間が短縮され、実行完了後にバスボトルネック解消のために並列度を低下した所定のユーザ回路の並列度を増加して、合計実行時間を短縮できれば望ましい。 In the third example, among the CI processing circuits, a circuit having a large degree of reduction in execution time due to an increase in the degree of parallelism is selected, and a part of the DI processing circuits is selected, and both selected CI processing circuits and DI processing are selected. Increase the degree of parallelism of the circuit. In this case, if the execution time of the circuit with the increased degree of parallelism can be shortened, and the degree of parallelism of a predetermined user circuit whose degree of parallelism has been reduced to eliminate the bus bottleneck after the execution is completed can be increased to shorten the total execution time. desirable.

次に、プロセッサは、工程S15での並列度の増加及び低下の調整を行わない場合の全てのユーザ回路のジョブ完了までの実行時間の合計、調整前（未調整）合計実行時間T_{total_before}と、前記調整を行った場合の同実行時間の合計、調整後（調整済）合計実行時間T_{total_after}とを計算し、T_{total_after}＜T_{total_before}か否かを判定する（S15B）。 Next, when the processor does not adjust the increase and decrease of the degree of parallelism in step S15, the total execution time until the job is completed in all the user circuits, the total execution time before adjustment (unadjusted) _{, T total_before,} and so on. The total of the same execution time when the above adjustment is made and the adjusted (adjusted) total execution time T _{total_after} are calculated, and it is determined whether or not _{T total_after} <T _{total_before (S15B).}

判定結果がYESの場合、プロセッサは、並列度の調整を実行する（S15C）。判定結果がNOの場合、プロセッサは、並列度の調整を実行しない（S15D）。つまり、バスボトルネック状態を許容したまま、ユーザ回路の並列度の調整を行わず、未調整のままにする。FPGA内のユーザ回路は、処理実行中FPGA内にコンフィグレーションされる。したがって、全てのユーザ回路のジョブ完了までの実行時間の合計が短くなると、FPGAの回路リソースの使用効率が高くなることを意味する。プロセッサが工程S15Bの判定を行うことで、バスボトルネック状態の解消を行うか否かを、FPGAの回路リソースの使用効率が高くなるか否かの観点で判断することができる。 If the determination result is YES, the processor executes the adjustment of the degree of parallelism (S15C). If the test result is NO, the processor does not perform parallelism adjustment (S15D). That is, while allowing the bus bottleneck state, the degree of parallelism of the user circuit is not adjusted and is left unadjusted. The user circuit in the FPGA is configured in the FPGA during processing. Therefore, if the total execution time until the job is completed in all the user circuits is shortened, it means that the utilization efficiency of the circuit resources of the FPGA becomes high. By determining the process S15B by the processor, it is possible to determine whether or not to eliminate the bus bottleneck state from the viewpoint of whether or not the utilization efficiency of the FPGA circuit resources is high.

図１４は、図１２の工程S15,S15B,S15C,S15Dの詳細なフローチャートを示す図である。プロセッサは、所定のユーザ回路UC_MAXの並列度を下げる処理として、測定使用帯域が最大のユーザ回路UC_MAXを抽出し（S151）、そのユーザ回路UC_MAXの並列度を下げて、新たな使用帯域を計算により予測する。新たな使用帯域の予測は、例えば、並列度を1/Nに下げれば、使用帯域も1/N倍に下がると見積もる。 FIG. 14 is a diagram showing a detailed flowchart of steps S15, S15B, S15C, and S15D of FIG. The processor extracts the user circuit UC_MAX having the maximum measurement bandwidth as a process of lowering the parallelism of the predetermined user circuit UC_MAX (S151), lowers the parallelism of the user circuit UC_MAX, and calculates a new bandwidth to be used. Predict. Forecasting the new bandwidth used, for example, estimates that if the degree of parallelism is reduced to 1 / N, the bandwidth used will also be reduced 1 / N times.

そして、プロセッサは、予測使用帯域が上限未満になるか否か判定し（S153）、未満にならない場合は、再度工程S152を実行し更に並列度を低下させる。未満になる場合、プロセッサは、帯域上限と全ユーザ回路の予測使用帯域の合計との差分が、所定の基準値Vthを超えているか否か判定する（S154）。この所定の基準値Vthを超えている場合は、ユーザ回路UC_MAX以外の他のユーザ回路の並列度を増加し、そのユーザ回路の新たな使用帯域を予測する（S155）。 Then, the processor determines whether or not the predicted used bandwidth is less than the upper limit (S153), and if it is not less than the upper limit, the process S152 is executed again to further reduce the degree of parallelism. If it is less than, the processor determines whether the difference between the upper bandwidth limit and the sum of the estimated bandwidths of all user circuits exceeds a predetermined reference value Vth (S154). When this predetermined reference value Vth is exceeded, the degree of parallelism of the user circuit other than the user circuit UC_MAX is increased, and a new bandwidth used by the user circuit is predicted (S155).

そして、プロセッサは、並列度を低下及び上昇する調整前の並列度（旧並列度）での全ユーザ回路の実行時間の合計値T_{total_before}を予測し、同時に、並列度を低下及び上昇する調整後の並列度（新並列度）での全ユーザ回路の実行時間の合計値T_{total_after}を予測する（S156）。さらに、プロセッサは、予測値を比較して、T_{total_after}＜T_{total_before}か否かを判定する（S157）。 _{Then, the processor predicts the total value T total_before} of the execution times of all user circuits at the parallel degree before adjustment (former parallel degree) that decreases and increases the parallel degree, and at the same time, after the adjustment that decreases and increases the parallel degree. _Predict the total value T total_after of the execution time of all user circuits at the degree of parallelism (new degree of parallelism) (S156). Further, the processor compares the predicted values _{to determine whether T total_after} <T _{total_before} (S157).

この判定結果がYESであれば（S157のYES）、プロセッサは、ユーザ回路の新並列度をFPGAに要求し、回路リコンフィグレーション完了通知受信後、並列度を更新されたユーザ回路のジョブの再開を通知する（S161）。前述の工程S154の判定結果がNOの場合も、プロセッサは工程S161を実行する。そして、プロセッサは、ユーザ回路UC_MAXを並列度低下リストに記憶する（S162）。 If this determination result is YES (YES in S157), the processor requests the FPGA for a new degree of parallelism of the user circuit, and after receiving the circuit reconfiguration completion notification, restarts the job of the user circuit whose degree of parallelism has been updated. Is notified (S161). The processor executes step S161 even when the determination result of step S154 described above is NO. Then, the processor stores the user circuit UC_MAX in the parallelism reduction list (S162).

一方、T_{total_after}＜T_{total_before}の判定結果がNOであれば（S157のNO）、プロセッサは、並列度を低下したユーザ回路UC_MAX以外のユーザ回路に並列度未調整のDI処理回路が含まれているか否かを判定する（S158）。含まれている場合（S158のYES）、CI処理回路の代わりにDI処理回路の並列度を上げることで、工程S157の判定結果がYESになる可能性がある。そこで、プロセッサは、並列度を増加する第1のユーザ回路のDI処理回路とCI処理回路の組み合わせを変更し（S160）、再度工程S155,S156,S157を実行する。 On the other hand, _{if the judgment result of T total_after} <T _{total_before} is NO (NO in S157), does the processor include a DI processing circuit with unadjusted parallelism in the user circuit other than the user circuit UC_MAX with reduced parallelism? Judge whether or not (S158). If it is included (YES in S158), the determination result in step S157 may be YES by increasing the degree of parallelism of the DI processing circuit instead of the CI processing circuit. Therefore, the processor changes the combination of the DI processing circuit and the CI processing circuit of the first user circuit that increases the degree of parallelism (S160), and executes steps S155, S156, and S157 again.

例えば、第1のユーザ回路について、あるCI処理回路の並列度を増加する代わりに、あるDI処理回路の並列度を増加したほうが、調整後の実行時間T_{total_after}を調整後の実行時間T_{total_before}より短くできる可能性がある。例えば、並列度を増やした第1のユーザ回路のうち、回路規模が大きなCI処理回路であって、並列度増加による実行時間の短縮が少ないCI処理回路について、並列度の増加を中止し、利用可能な回路リソースを確保し、そのCI処理回路代わりにDI処理回路を選択して並列度を増加することが望ましい。 For example, for the first user circuit, if the degree of parallelism of a certain DI processing circuit is increased instead of increasing the degree of parallelism of a certain CI processing circuit, the adjusted execution time T _{total_after} is compared with the adjusted execution time T _{total_before} . It may be possible to shorten it. For example, among the first user circuits with increased parallelism, CI processing circuits with a large circuit scale and less reduction in execution time due to increased parallelism are used after stopping the increase in parallelism. It is desirable to secure possible circuit resources and select a DI processing circuit instead of the CI processing circuit to increase the degree of parallelism.

そして、工程S157の判定がYESになれば、プロセッサは、工程S161,S162を実行し、並列度の調整処理を実行する。 Then, if the determination in step S157 becomes YES, the processor executes steps S161 and S162, and executes the parallel degree adjustment process.

一方、第1のユーザ回路のDI処理回路とCI処理回路の組み合わせの変更をK回行っても工程S157の判定がNOになる場合は（S159のNO）、所定のユーザ回路UC_MAXの並列度の低下と第1のユーザ回路の並列度の増加の調整を行わない（S15D）。前述の工程S158の判定がNOの場合も、工程S157の判定結果が逆転する可能性は小さいので、プロセッサは、ユーザ回路の並列度の調整を行わない（S15D）。 On the other hand, if the determination in step S157 is NO even after changing the combination of the DI processing circuit and CI processing circuit of the first user circuit K times (NO in S159), the degree of parallelism of the predetermined user circuit UC_MAX No adjustment is made for the decrease and the increase in the degree of parallelism of the first user circuit (S15D). Even when the determination in step S158 is NO, the possibility that the determination result in step S157 is reversed is small, so that the processor does not adjust the degree of parallelism of the user circuit (S15D).

図１４のフローチャートによれば、プロセッサは、バスボトルネックを解消するための所定のユーザ回路UC_MAXの並列度の低下と、他のユーザ回路（第1のユーザ回路）の並列度の増加の調整後の状態での全実行時間の合計が、調整前の状態での全実行時間の合計より短くなる第1のユーザ回路の組み合わせを探索する。これにより、バスボトルネックの解消と、FPAGの回路リソースの有効利用の両方を達成することができる。 According to the flowchart of FIG. 14, the processor adjusts the decrease in the degree of parallelism of the predetermined user circuit UC_MAX for eliminating the bus bottleneck and the increase in the degree of parallelism of the other user circuits (first user circuit). Search for a first user circuit combination in which the total execution time in the state of is shorter than the total execution time in the state before adjustment. As a result, both the elimination of bus bottlenecks and the effective use of FPAG circuit resources can be achieved.

［ユーザ回路UC_MAXの並列度を増加する制御］
図１２に戻り、プロセッサは、一定時間待機中（S10のNO）、ユーザ回路のジョブ実行完了通知を受信すると（S16のYES）、式１、式２を満たす範囲で、ユーザ回路UC_MAXの並列度を増加する制御を行う（S17）。一定時間待機中にジョブ実行完了通知を受信しない場合、プロセッサは、ユーザ回路の並列度調整処理S8を終了する。 [Control to increase the degree of parallelism of the user circuit UC_MAX]
Returning to FIG. 12, when the processor waits for a certain period of time (NO in S10) and receives the job execution completion notification of the user circuit (YES in S16), the degree of parallelism of the user circuit UC_MAX within the range satisfying Equations 1 and 2. Is controlled to increase (S17). If the job execution completion notification is not received while waiting for a certain period of time, the processor terminates the parallel degree adjustment process S8 of the user circuit.

図１５は、工程S17の処理を示すフローチャート図である。ユーザ回路のジョブ実行完了通知を受信すると(図１２のS16のYES)、プロセッサは、並列度低下リストにユーザ回路UC_MAXが存在するか判定する（S171）。存在する場合（S171のYES）、プロセッサは、式1、式２を満たす範囲で、ユーザ回路UC_MAXの最大の新並列度PXを算出する（S172）。式１，式２は、図１３の式1、式２と同じである。但し、ここでは、直前にあるユーザ回路がジョブ実行を完了して開放されるので、開放されたユーザ回路は式１，式２から除かれる。また、並列度増加対象は所定のユーザ回路UC_MAXである。 FIG. 15 is a flowchart showing the process of step S17. Upon receiving the job execution completion notification of the user circuit (YES in S16 of FIG. 12), the processor determines whether the user circuit UC_MAX exists in the parallelism reduction list (S171). If present (YES in S171), the processor calculates the maximum new parallelism PX of the user circuit UC_MAX within the range satisfying Equations 1 and 2 (S172). Equations 1 and 2 are the same as Equations 1 and 2 in FIG. However, here, since the immediately preceding user circuit is released after completing the job execution, the released user circuit is excluded from Equations 1 and 2. The target for increasing the degree of parallelism is the predetermined user circuit UC_MAX.

例えば、ユーザ２の回路UC_2が並列度低下リストに格納されていて、ユーザ１とユーザ３の回路UC_1, UC_3のジョブ実行が完了したとすると、式１、式２は次の通りになる。
(BD_M2/P2)*PX2 + BD_M4 < BD_L 式１
A2*PX2 + A4*P4 ≦ A_L 式２ For example, assuming that the circuit UC_2 of the user 2 is stored in the parallelism reduction list and the job execution of the circuits UC_1 and UC_3 of the user 1 and the user 3 is completed, the equations 1 and 2 are as follows.
(BD_M2 / P2) * PX2 + BD_M4 <BD_L Equation 1
A2 * PX2 + A4 * P4 ≤ A_L Equation 2

プロセッサは、上記の式を満たす範囲で、最大の新並列度PX2を算出する。これにより、ユーザ回路UC_MAX（UC_2）は、他のユーザ回路の実行完了時に優先的に並列度を増加する制御を受けることができる。 The processor calculates the maximum new degree of parallelism PX2 within the range that satisfies the above equation. As a result, the user circuit UC_MAX (UC_2) can be controlled to preferentially increase the degree of parallelism when the execution of another user circuit is completed.

そして、プロセッサは、ユーザ回路UC_MAXの新並列度PXでの論理回路のコンフィグレーションをFPGAに要求し、回路リコンフィグレーション完了通知を受信するとジョブの実行再開を通知する（S173）。また、プロセッサは、並列度を増加したユーザ回路UC_MAXを並列度低下リストから削除する（S173）。 Then, the processor requests the FPGA to configure the logic circuit at the new parallel degree PX of the user circuit UC_MAX, and when it receives the circuit reconfiguration completion notification, it notifies the job execution to be restarted (S173). The processor also removes the increased parallelism user circuit UC_MAX from the reduced parallelism list (S173).

図１２に戻り、プロセッサによるユーザ回路の並列度調整制御をまとめると次のとおりである。プロセッサは、通常は一定時間ごとに測定実行時間ET_Mと測定使用帯域BD_MをFPGA内のユーザ回路の測定回路から取得する（S11）。そして、全ユーザ回路の測定使用帯域の合計と、FPGAバスの帯域上限値との差分が、並列度増加のために必要な帯域分より大きい場合（S12のYES）、プロセッサは、式1，式２を満たす範囲で、あるユーザ回路の並列度を増加する（S13A）。また、あるユーザ回路がジョブの実行を完了した場合（S16のYES）、並列度を低下したユーザ回路UC_MAXがなければ（S171のNO）、次の測定サイクルで取得した測定使用帯域BD_Mに基づいて工程S12の判定がYESになり、プロセッサは、再度、式1、式２を満たす範囲で所定のユーザ回路を優先して並列度を増加する（S13A）。 Returning to FIG. 12, the parallel degree adjustment control of the user circuit by the processor is summarized as follows. The processor normally acquires the measurement execution time ET_M and the measurement bandwidth BD_M from the measurement circuit of the user circuit in the FPGA at regular intervals (S11). Then, when the difference between the total measured bandwidth of all user circuits and the upper limit of the bandwidth of the FPGA bus is larger than the bandwidth required to increase the degree of parallelism (YES in S12), the processor uses Equation 1, Equation 1. The degree of parallelism of a certain user circuit is increased within the range satisfying 2 (S13A). Also, if a user circuit completes job execution (YES in S16) and there is no user circuit UC_MAX with reduced parallelism (NO in S171), it is based on the measurement bandwidth BD_M acquired in the next measurement cycle. The determination in step S12 becomes YES, and the processor again gives priority to a predetermined user circuit within the range satisfying Equations 1 and 2 and increases the degree of parallelism (S13A).

一方、測定使用帯域の合計値がFPGAバスの帯域上限値に達している場合（S14のYES）、プロセッサは、FPGAバス帯域のボトルネックの原因と疑われる所定のユーザ回路UC_MAXを選択し、その並列度をバスボトルネックが解消するように低下する（S15）。この並列度を下げるユーザ回路UC_MAXには、例えば使用帯域が大きいDI処理回路が選択される。さらに、プロセッサは、残りのユーザ回路について、ユーザ回路UC_MAX以外の他のユーザ回路を式１、式２を満たす範囲で並列度を増加する（S15）。この並列度を上げるユーザ回路は、例えば、使用帯域が小さいCI処理回路が選択される。 On the other hand, if the total bandwidth used for measurement has reached the upper limit of the FPGA bus bandwidth (YES in S14), the processor selects a predetermined user circuit UC_MAX that is suspected to be the cause of the FPGA bus bandwidth bottleneck. The degree of parallelism is reduced so that the bus bottleneck disappears (S15). For the user circuit UC_MAX that lowers the degree of parallelism, for example, a DI processing circuit having a large bandwidth used is selected. Further, the processor increases the degree of parallelism of the remaining user circuits in the range where the user circuits other than the user circuit UC_MAX satisfy Equations 1 and 2 (S15). For the user circuit that increases the degree of parallelism, for example, a CI processing circuit having a small bandwidth used is selected.

そして、並列度調整前の状態での全ユーザ回路の合計予測実行時間より、並列度調整後の状態での全ユーザ回路の合計予測実行時間が短い場合に、プロセッサは、並列度の変更を実行し（S15C）、短くない場合に、プロセッサは、並列度の変更を実行しない（S15D）。 Then, when the total predicted execution time of all user circuits in the state after the parallel degree adjustment is shorter than the total predicted execution time of all user circuits in the state before the parallel degree adjustment, the processor executes the change of the parallel degree. However (S15C), if not short, the processor does not perform a degree of parallelism change (S15D).

さらに、あるユーザ回路のジョブの実行が完了したら（S16のYES）、プロセッサは、並列度を低下させたユーザ回路UC_MAXの並列度を、式１、式２を満たす範囲で最大の並列度に増加する（S17）。これにより、プロセッサは、バス帯域のボトルネックの原因と見なされたユーザ回路UC_MAXの並列度を一時的に低下するが、他のユーザ回路の並列度を増加した結果それらの合計実行時間が短くなる。そして、他のユーザ回路のジョブの実行が完了すると、一時的に並列度を低下させたユーザ回路UC_MAXの並列度を再度増加させる。その結果、バスボトルネック発生時の並列度の調整により、バスボトルネックが解消され、さらに全ユーザ回路の合計実行時間が短くなる可能性がある。 Furthermore, when the job execution of a certain user circuit is completed (YES in S16), the processor increases the degree of parallelism of the user circuit UC_MAX with the reduced degree of parallelism to the maximum degree of parallelism within the range satisfying equations 1 and 2. (S17). This causes the processor to temporarily reduce the parallelism of the user circuit UC_MAX, which is considered to be the cause of the bus bandwidth bottleneck, but increases the parallelism of other user circuits, resulting in a shorter total execution time of them. .. Then, when the execution of the job of another user circuit is completed, the parallel degree of the user circuit UC_MAX whose parallel degree is temporarily lowered is increased again. As a result, by adjusting the degree of parallelism when a bus bottleneck occurs, the bus bottleneck may be eliminated and the total execution time of all user circuits may be shortened.

［並列度調整の具体例］
図１６は、第1の具体例を示す図である。横軸が時間TIMEであり、縦方向に（１）バスボトルネックが発生しない場合の予測実行時間、（２）バスボトルネックが発生し並列度調整しない場合の予測実行時間、（３）バスボトルネックが発生し並列度調整した場合の予測実行時間をそれぞれ示す。第1の具体例は、FPAG内にユーザ回路UC-AとUC-Bがコンフィグレーションされ、ユーザ回路UC-AがDI処理回路、ユーザ回路UC-BがCI処理回路という、最も単純化した例である。 [Specific example of parallelism adjustment]
FIG. 16 is a diagram showing a first specific example. The horizontal axis is time TIME, and in the vertical direction (1) predicted execution time when a bus bottleneck does not occur, (2) predicted execution time when a bus bottleneck occurs and the degree of parallelism is not adjusted, (3) bus bottle The estimated execution time when a bottleneck occurs and the degree of parallelism is adjusted is shown. The first specific example is the simplest example in which the user circuits UC-A and UC-B are configured in the FPAG, the user circuit UC-A is the DI processing circuit, and the user circuit UC-B is the CI processing circuit. Is.

図１６の（１）において、ユーザ回路UC-Aは、DI処理回路であり、並列度Ｐ＝４、１回の実行時間が８で、実行回数５でジョブが完了する。一方、ユーザ回路UC-Bは、CI処理回路であり、並列度Ｐ＝２、１回の実行時間が２５（T_LD＝１、イニシエーション・インターバルΔ＝２、T_COMP＝２１、T_ST＝１）で、実行回数２でジョブが完了する。 In (1) of FIG. 16, the user circuit UC-A is a DI processing circuit, the degree of parallelism P = 4, one execution time is 8, and the job is completed when the number of executions is 5. On the other hand, the user circuit UC-B is a CI processing circuit, and the degree of parallelism P = 2, one execution time is 25 (T _LD = 1, initiation interval Δ = 2, T _COMP = 21, T _ST = 1). ), The job is completed with 2 executions.

（２）バスボトルネック発生、並列度調整しない場合、時刻ｔ１でバスボトルネックが発生し、ユーザ回路UC-AとUC-Bでメモリアクセスに時間がかかっている。そして、バスボトルネックが発生しても両ユーザ回路の並列度を変更していない。その結果、ユーザ回路UC−Aは実行時間が６長くなり、ユーザ回路UC-Bは実行時間が３長くなっている。 (2) When a bus bottleneck occurs and the degree of parallelism is not adjusted, a bus bottleneck occurs at time t1 and it takes time to access the memory in the user circuits UC-A and UC-B. And even if a bus bottleneck occurs, the degree of parallelism of both user circuits is not changed. As a result, the execution time of the user circuit UC-A is 6 longer, and that of the user circuit UC-B is 3 longer.

（３）バスボトルネック発生、並列度調整する場合、ユーザ回路UC-Aの並列度Pを４から２に減らした結果、１回の実行時間が８から１６に２倍になっている。そして、ユーザ回路UC-Bの並列度Pを２から４に２倍に増やしたが、CI処理回路のため、イニシエーション・インターバルΔが２から１に減少しただけであり、１回の実行時間は２５から２４にわずかに短くなっただけである。そして、ユーザ回路UC-Bの実行終了後に、ユーザ回路UC-Aの並列度Pを２から８に変更し、最後の実行時間が４と短くなった。 (3) When a bus bottleneck occurs and the degree of parallelism is adjusted, as a result of reducing the degree of parallelism P of the user circuit UC-A from 4 to 2, the execution time at one time is doubled from 8 to 16. Then, the degree of parallelism P of the user circuit UC-B was doubled from 2 to 4, but because of the CI processing circuit, the initiation interval Δ was only reduced from 2 to 1, and the execution time was one time. It's only slightly shortened from 25 to 24. Then, after the execution of the user circuit UC-B was completed, the degree of parallelism P of the user circuit UC-A was changed from 2 to 8, and the final execution time was shortened to 4.

そこで、（２）と（３）の場合の２つのユーザ回路の実行時間の合計を比較すると、ユーザ回路UC-Aが２２長くなり、ユーザ回路UC-Bが１短くなっているので、（２）の調整前の合計実行時間T_{total_before}より（３）の調整後の合計実行時間T_{total_after}のほうが長くなる。したがって、バスボトルネックが発生したとき、プロセッサは、ユーザ回路の並列度の調整を行わないという判定を行う。 Therefore, when comparing the total execution time of the two user circuits in the cases of (2) and (3), the user circuit UC-A is 22 longer and the user circuit UC-B is 1 shorter, so (2). The total execution time after adjustment of (3) T _{total_after} is longer than the total execution time T _{total_before before adjustment of).} Therefore, when a bus bottleneck occurs, the processor determines that the degree of parallelism of the user circuit is not adjusted.

図１７は、第２の具体例を示す図である。横軸と、（１）（２）（３）は図１６と同様である。第２の具体例は、FPAG内にユーザ回路UC-AとUC-Bに加えて更にUC-Cがコンフィグレーションされ、ユーザ回路UC-A、UC-CがDI処理回路、ユーザ回路UC-BがCI処理回路である。 FIG. 17 is a diagram showing a second specific example. The horizontal axis and (1), (2), and (3) are the same as those in FIG. In the second specific example, in addition to the user circuits UC-A and UC-B, UC-C is further configured in the FPAG, and the user circuits UC-A and UC-C are DI processing circuits and the user circuit UC-B. Is the CI processing circuit.

図１７の（１）において、ユーザ回路UC-A、UC-Bは、図１６と同じである。そして、ユーザ回路UC-Cは、DI処理回路であり、並列度Ｐ＝３、１回の実行時間が１２で、実行回数４でジョブが完了する。 In (1) of FIG. 17, the user circuits UC-A and UC-B are the same as those of FIG. The user circuit UC-C is a DI processing circuit, and the job is completed when the degree of parallelism P = 3, the execution time of one time is 12, and the number of executions is 4.

（２）バスボトルネック発生、並列度調整を行わない場合、時刻ｔ１でバスボトルネックが発生し、ユーザ回路UC-A、UC-B、UC-Cでメモリアクセスに時間がかかっている。そして、バスボトルネックが発生しても両ユーザ回路の並列度を変更していない。その結果、ユーザ回路UC−Aは実行時間が６長くなり、ユーザ回路UC-Bは実行時間が３長くなり、ユーザ回路UC-Cは実行時間が６長くなっている。 (2) When a bus bottleneck occurs and the degree of parallelism is not adjusted, a bus bottleneck occurs at time t1, and it takes time to access the memory in the user circuits UC-A, UC-B, and UC-C. And even if a bus bottleneck occurs, the degree of parallelism of both user circuits is not changed. As a result, the execution time of the user circuit UC-A is 6 longer, that of the user circuit UC-B is 3 longer, and that of the user circuit UC-C is 6 longer.

（３）バスボトルネック発生、並列度調整を行う場合、ユーザ回路UC-Aの並列度Pを４から２に減らした結果、１回の実行時間が８から１６に２倍になっている。そして、具体例１で説明したとおりユーザ回路UC-Bの並列度を上げても実行時間の短縮は小さいため、ユーザ回路UC-Bの並列度Pは２から増加していない。その代わりに、ユーザ回路UC-Cの並列度Pを３から４に変更している。その結果、ユーザ回路UC-Bはリコンフィグレーション時間がなくなり１回の実行時間は（１）と同じ２５であり、一方、ユーザ回路UC-Cは、実行時間が１２から９に短くなっている。この例では、ユーザ回路UC-Cの使用帯域が小さかったため、帯域上限未満で並列度を増やすことができた例である。 (3) When a bus bottleneck occurs and the degree of parallelism is adjusted, the degree of parallelism P of the user circuit UC-A is reduced from 4 to 2, and as a result, one execution time is doubled from 8 to 16. As described in Specific Example 1, even if the degree of parallelism of the user circuit UC-B is increased, the reduction in the execution time is small, so that the degree of parallelism P of the user circuit UC-B does not increase from 2. Instead, the degree of parallelism P of the user circuit UC-C is changed from 3 to 4. As a result, the reconfiguration time of the user circuit UC-B is eliminated and the execution time of one time is 25, which is the same as (1), while the execution time of the user circuit UC-C is shortened from 12 to 9. .. In this example, since the bandwidth used by the user circuit UC-C was small, the degree of parallelism could be increased below the upper bandwidth limit.

そして、ユーザ回路UC-Cの実行終了後に、ユーザ回路UC-Aの並列度Pを２から８に変更し、最後の３回の実行時間がそれぞれ４と短くなった。 Then, after the execution of the user circuit UC-C was completed, the degree of parallelism P of the user circuit UC-A was changed from 2 to 8, and the execution time of the last three times was shortened to 4 respectively.

そこで、（２）と（３）の場合の２つのユーザ回路の実行時間の合計を比較すると、ユーザ回路UC-Aが１４長くなり、一方、ユーザ回路UC-Bが３短くなり、ユーザ回路UC-Cが１２短くなっている。その結果、（２）の調整前の合計実行時間T_{total_before}より（３）の調整後の合計実行時間T_{total_after}のほうが短くなる。したがって、バスボトルネックが発生したとき、ユーザ回路の並列度の調整を行うという判定が行われる。 Therefore, when comparing the total execution time of the two user circuits in the cases of (2) and (3), the user circuit UC-A is 14 longer, while the user circuit UC-B is 3 shorter, and the user circuit UC. -C is 12 shorter. As a result, more of the total execution time T _{Total_after} after adjustment becomes shorter than the total running time T _{Total_before} before adjustment (2) (3). Therefore, when a bus bottleneck occurs, it is determined that the degree of parallelism of the user circuit is adjusted.

第２の具体例では、バスボトルネックが発生したとき、DI処理回路のユーザ回路UC-Aの並列度を下げてバスボトルネックを解消し、余ったバスの帯域とFPAG内の回路リソースを利用して、CI処理回路ではなくDI処理回路のユーザ回路UC-Cの並列度を上げた例である。この場合は、調整後の合計実行時間のほうが短くなったため、並列度の調整を行うという判定になっている。 In the second specific example, when a bus bottleneck occurs, the degree of parallelism of the user circuit UC-A of the DI processing circuit is lowered to eliminate the bus bottleneck, and the surplus bus band and circuit resources in the FPAG are used. This is an example of increasing the degree of parallelism of the user circuit UC-C of the DI processing circuit instead of the CI processing circuit. In this case, since the total execution time after adjustment is shorter, it is determined that the degree of parallelism is adjusted.

以上のとおり、本実施の形態では、バスボトルネックが発生したときに、プロセッサが、バスボトルネック解消のために所定のユーザ回路の並列度を一時的に下げ、余裕ができた回路リソースを利用して別のユーザ回路の並列度を増加し、別のユーザ回路の実行完了後に、所定のユーザ回路の並列度を増加する。但し、上記のような並列度の調整を行うか否かを、調整前の状態での予測実行時間の合計と、調整後の状態での予測実行時間の合計とを比較して判定する。 As described above, in the present embodiment, when a bus bottleneck occurs, the processor temporarily lowers the degree of parallelism of a predetermined user circuit in order to eliminate the bus bottleneck, and uses the circuit resources that have a margin. Then, the degree of parallelism of another user circuit is increased, and the degree of parallelism of a predetermined user circuit is increased after the execution of another user circuit is completed. However, whether or not to adjust the degree of parallelism as described above is determined by comparing the total predicted execution time in the state before the adjustment with the total predicted execution time in the adjusted state.

通常は、バスボトルネックが発生すると、使用帯域が大きいDI処理回路の並列度を下げて、使用帯域が小さいCI処理回路の並列度を上げることで、バスボトルネックの解消と実行時間の悪化を抑制する。 Normally, when a bus bottleneck occurs, the parallelism of DI processing circuits with a large bandwidth is lowered and the parallelism of CI processing circuits with a small bandwidth is increased to eliminate the bus bottleneck and worsen the execution time. Suppress.

しかし、CI処理回路は、並列度を下げても実行時間の短縮の度合いが小さく、並列度の調整を行っても実行時間の合計が短くならない場合がある。その場合は、並列度の調整を行わないという判定が行われる。 However, in the CI processing circuit, the degree of reduction of the execution time is small even if the degree of parallelism is lowered, and the total execution time may not be shortened even if the degree of parallelism is adjusted. In that case, it is determined that the degree of parallelism is not adjusted.

但し、CI処理回路でも、並列度の増加による実行時間の短縮の度合いに幅があり、短縮時間が長いCI処理回路を選択して並列度を増加し、さらにバスボトルネックにならない別のDI処理回路を選択して並列度を増加して実行時間を短縮させることで、バスボトルネックの解消と実行時間の悪化の抑制を達成できる場合がある。 However, even in CI processing circuits, there is a range in the degree of reduction in execution time due to the increase in parallelism, and a CI processing circuit with a long reduction time is selected to increase parallelism, and another DI processing that does not become a bus bottleneck. By selecting a circuit and increasing the degree of parallelism to shorten the execution time, it may be possible to eliminate the bus bottleneck and suppress the deterioration of the execution time.

上記の実施の形態では、バスボトルネックが発生したときに、一部のユーザ回路)並列度を低下し、他のユーザ回路の並列度を増加する並列度の調整を行うか否かを、調整前の状態と調整後の状態でそれぞれの実行時間の合計を比較して判定した。しかし、バスボトルネックの発生時だけでなく、例えばFPGAの消費電力の上限値に達した場合（消費電力のボトルネックの発生）でも、消費電力の大きいユーザ回路の並列度を低下し、消費電力の小さい他のユーザ回路の並列度を増加する並列度の調整を行うか否かを、調整前と調整後の実行時間の合計を比較して判定して行っても良い。 In the above embodiment, when a bus bottleneck occurs, it is adjusted whether or not to adjust the degree of parallelism that lowers the degree of parallelism of some user circuits and increases the degree of parallelism of other user circuits. Judgment was made by comparing the total execution time of each of the previous state and the adjusted state. However, not only when a bus bottleneck occurs, but also when, for example, the upper limit of FPGA power consumption is reached (power consumption bottleneck occurs), the degree of parallelism of a user circuit with large power consumption is reduced, and power consumption is reduced. Whether or not to adjust the degree of parallelism that increases the degree of parallelism of other user circuits having a small value may be determined by comparing the total execution time before and after the adjustment.

以上の実施の形態をまとめると，次の付記のとおりである。 The above embodiments can be summarized as follows.

（付記１）
プログラムを実行するプロセッサと、
前記プロセッサからのコンフィグレーション要求に応じて、前記コンフィグレーション要求が要求する論理回路をコンフィグレーションするリコンフィグレーション領域を有するプログラマブルロジック回路装置（以下ＰＬＤ）を有し、
前記プロセッサは、
前記リコンフィグレーション領域内にコンフィグレーションされ動作中の複数の論理回路のうち、第１の論理回路の並列度を下げて第２の論理回路の並列度を上げる並列度調整を行った場合の前記複数の論理回路の第１の実行時間と、前記並列度調整を行わない場合の前記複数の論理回路の第２の実行時間とを比較し、
前記第１の実行時間が前記第２の実行時間より短い場合、前記ＰＬＤに前記並列度調整の要求を行い、短くない場合、前記ＰＬＤに前記並列度調整の要求を行わない、情報処理装置。 (Appendix 1)
The processor that executes the program and
A programmable logic circuit device (hereinafter referred to as PLD) having a reconfiguration area for configuring a logic circuit required by the configuration request in response to a configuration request from the processor.
The processor
Of the plurality of logic circuits configured and operating in the reconfiguration area, the parallel degree adjustment for lowering the parallelism of the first logic circuit and increasing the parallelism of the second logic circuit is performed. The first execution time of the plurality of logic circuits is compared with the second execution time of the plurality of logic circuits when the degree of parallelism adjustment is not performed.
An information processing device that requests the PLD to adjust the degree of parallelism when the first execution time is shorter than the second execution time, and does not request the PLD to adjust the degree of parallelism when the first execution time is not short.

（付記２）
前記プロセッサは、
前記リコンフィグレーション領域内にコンフィグレーションされ動作中の複数の論理回路のデータ転送量の測定値を取得し、
前記データ転送量の合計が前記ＰＬＤのバスのデータ転送量の上限値に達した場合、前記比較を実行する、付記１に記載の情報処理装置。 (Appendix 2)
The processor
Obtain the measured values of the data transfer amount of a plurality of logic circuits that are configured and operating in the reconfiguration area.
The information processing apparatus according to Appendix 1, wherein when the total amount of data transferred reaches the upper limit of the amount of data transferred on the PLD bus, the comparison is executed.

（付記３）
前記比較では、前記複数の論理回路の前記第１の実行時間の合計と、前記第２の実行時間の合計とを比較し、
前記第１の実行時間の合計値が前記第２の実行時間の合計値より短い場合、前記ＰＬＤに前記並列度調整の要求を行い、短くない場合、前記ＰＬＤに前記並列度調整の要求を行わない、付記１に記載の情報処理装置。 (Appendix 3)
In the comparison, the total of the first execution times of the plurality of logic circuits and the total of the second execution times are compared.
If the total value of the first execution time is shorter than the total value of the second execution time, the PLD is requested to adjust the degree of parallelism, and if it is not short, the PLD is requested to adjust the degree of parallelism. No, the information processing device according to Appendix 1.

（付記４）
前記プロセッサは、さらに、
前記複数の論理回路の前記第１の実行時間と、前記第２の実行時間とを計算する、付記１に記載の情報処理装置。 (Appendix 4)
The processor further
The information processing apparatus according to Appendix 1, which calculates the first execution time and the second execution time of the plurality of logic circuits.

（付記５）
前記プロセッサは、
前記複数の論理回路の前記第１の実行時間として、第１の論理回路の並列度を下げて第２の論理回路の並列度を上げる並列度調整を行った後、前記第２の論理回路の実行が完了後に前記第１の論理回路の並列度を上げる場合に予測される前記複数の論理回路の実行時間を計算する、付記４に記載の情報処理装置。 (Appendix 5)
The processor
As the first execution time of the plurality of logic circuits, after adjusting the degree of parallelism by lowering the degree of parallelism of the first logic circuit and increasing the degree of parallelism of the second logic circuit, the second logic circuit The information processing apparatus according to Appendix 4, which calculates the execution time of the plurality of logic circuits predicted when the degree of parallelism of the first logic circuit is increased after the execution is completed.

（付記６）
前記複数の論理回路は、データ処理中にメモリアクセスが発生するデータ・インテンシブ処理回路と、データ処理の最初と最後にメモリアクセスが発生するコンピュテーション・インテンシブ処理回路のいずれか一方または両方を含み、
前記プロセッサは、
前記データ・インテンシブ処理回路の並列度をＮ倍にした場合、前記実行時間を１／Ｎ倍になるよう前記第１の実行時間を算出し、
前記コンピュテーション・インテンション処理回路の並列度をＮ倍にした場合、前記実行時間を、前記コンピュテーション・インテンション処理回路のパイプライン処理におけるイニシエーション・インターバル時間が１／Ｎ倍になるよう前記第１の実行時間を算出する、付記４に記載の情報処理装置。 (Appendix 6)
The plurality of logic circuits include one or both of a data-intensive processing circuit in which memory access occurs during data processing and a computation-intensive processing circuit in which memory access occurs at the beginning and end of data processing.
The processor
When the degree of parallelism of the data-intensive processing circuit is multiplied by N, the first execution time is calculated so that the execution time is multiplied by 1 / N.
When the degree of parallelism of the Computation Intention processing circuit is multiplied by N, the execution time is increased by 1 / N times so that the initiation interval time in the pipeline processing of the Computation Intention processing circuit is multiplied by N. The information processing apparatus according to Appendix 4, which calculates the execution time of 1.

（付記７）
前記複数の論理回路は、データ処理中にメモリアクセスが発生するデータ・インテンシブ処理回路と、データ処理の最初と最後にメモリアクセスが発生するコンピュテーション・インテンシブ処理回路の両方を含み、
前記プロセッサは、
前記第１の論理回路を前記データ・インテンシブ処理回路から選択し、前記第２の論理回路を前記コンピュテーション・インテンシブ処理回路から選択する、付記１に記載の情報処理装置。 (Appendix 7)
The plurality of logic circuits include both a data-intensive processing circuit in which memory access occurs during data processing and a computation-intensive processing circuit in which memory access occurs at the beginning and end of data processing.
The processor
The information processing apparatus according to Appendix 1, wherein the first logic circuit is selected from the data-intensive processing circuit, and the second logic circuit is selected from the computation-intensive processing circuit.

（付記８）
前記プロセッサは、
前記第１の実行時間が前記第２の実行時間より短くない場合、前記第２の論理回路を変更して、前記比較を再度行う、付記７に記載の情報処理装置。 (Appendix 8)
The processor
The information processing apparatus according to Appendix 7, wherein when the first execution time is not shorter than the second execution time, the second logic circuit is changed and the comparison is performed again.

（付記９）
前記プロセッサは、
前記第２の論理回路の変更を、前記第２の論理回路のうち前記並列度を増やした場合の実行時間の短縮の程度が少ないコンピュテーション・インテンシブ処理回路に代えて、前記データ・インテンシブ処理回路を選択する、付記８に記載の情報処理装置。 (Appendix 9)
The processor
The data-intensive processing circuit is replaced with the computation-intensive processing circuit in which the execution time is less shortened when the degree of parallelism is increased in the second logic circuit. The information processing apparatus according to Appendix 8, wherein the information processing apparatus is selected.

（付記１０）
プロセッサからのコンフィグレーション要求に応じて、前記コンフィグレーション要求が要求する論理回路をコンフィグレーションするリコンフィグレーション領域を有するプログラマブルロジック回路装置（以下ＰＬＤ）管理処理をプロセッサに実行させるＰＬＤ管理プログラムであって、
前記管理処理は、
前記リコンフィグレーション領域内にコンフィグレーションされ動作中の複数の論理回路のうち、第１の論理回路の並列度を下げて第２の論理回路の並列度を上げる並列度調整を行った場合の前記複数の論理回路の第１の実行時間と、前記並列度調整を行わない場合の前記複数の論理回路の第２の実行時間とを比較し、
前記第１の実行時間が前記第２の実行時間より短い場合、前記ＰＬＤに前記並列度調整の要求を行い、短くない場合、前記ＰＬＤに前記並列度調整の要求を行わない、処理を有するＰＬＤ管理プログラム。 (Appendix 10)
A PLD management program that causes a processor to execute a programmable logic circuit device (hereinafter referred to as PLD) management process having a reconfiguration area for configuring a logic circuit required by the configuration request in response to a configuration request from the processor. ,
The management process is
Of the plurality of logic circuits configured and operating in the reconfiguration area, the parallel degree adjustment for lowering the parallelism of the first logic circuit and increasing the parallelism of the second logic circuit is performed. The first execution time of the plurality of logic circuits is compared with the second execution time of the plurality of logic circuits when the degree of parallelism adjustment is not performed.
If the first execution time is shorter than the second execution time, the PLD is requested to adjust the degree of parallelism, and if it is not short, the PLD is not requested to adjust the degree of parallelism. Management program.

（付記１１）
プログラムを実行するプロセッサと、
前記プロセッサからのコンフィグレーション要求に応じて、前記コンフィグレーション要求が要求する論理回路をコンフィグレーションするリコンフィグレーション領域を有するプログラマブルロジック回路装置（以下ＰＬＤ）を有する情報処理装置の前記ＰＬＤ管理方法であって、
前記リコンフィグレーション領域内にコンフィグレーションされ動作中の複数の論理回路のうち、第１の論理回路の並列度を下げて第２の論理回路の並列度を上げる並列度調整を行った場合の前記複数の論理回路の第１の実行時間と、前記並列度調整を行わない場合の前記複数の論理回路の第２の実行時間とを比較し、
前記第１の実行時間が前記第２の実行時間より短い場合、前記ＰＬＤに前記並列度調整の要求を行い、短くない場合、前記ＰＬＤに前記並列度調整の要求を行わない、ＰＬＤ管理方法。 (Appendix 11)
The processor that executes the program and
The PLD management method for an information processing device having a programmable logic circuit device (hereinafter referred to as PLD) having a reconfiguration area for configuring a logic circuit required by the configuration request in response to a configuration request from the processor. hand,
Of the plurality of logic circuits configured and operating in the reconfiguration area, the parallel degree adjustment for lowering the parallelism of the first logic circuit and increasing the parallelism of the second logic circuit is performed. The first execution time of the plurality of logic circuits is compared with the second execution time of the plurality of logic circuits when the degree of parallelism adjustment is not performed.
A PLD management method in which if the first execution time is shorter than the second execution time, the PLD is requested to adjust the degree of parallelism, and if it is not short, the PLD is not requested to adjust the degree of parallelism.

１０：情報処理装置
１１：CPU、プロセッサ
１２：メインメモリ
１５：FPGA、PLD
１６：補助記憶装置
１７：FPGA用のデータメモリ
BUS_1：CPUバス
BUS_2：PCIバス
BUS_3：FPGAバス
I_BUS：FPGA内部バス
RC_REG：リコンフィグレーション領域
OC:FPGAの運用回路
PB：部分リコンフィグレーションブロック
UC_A, UC_B：ユーザ回路
１５１：C_DATA書き込み制御回路
C_RAM：コンフィグレーションデータメモリ
P：並列度
ET_E：予測実行時間
BD_E：予測帯域
ET_M：測定実行時間
BD_M：測定帯域、使用帯域
A1、A2：ユーザ回路面積
BD_L：上限帯域
A_L：総回路面積
T_{total_after}：第１の実行時間、第１の合計実行時間、並列度調整後の合計実行時間
T_{total_before}：第２の実行時間、第２の合計実行時間、並列度調整前の合計実行時間
CI：コンピュテーション・インテンシブ処理回路
DI：データ・インテンシブ処理回路 10: Information processing device 11: CPU, processor 12: Main memory 15: FPGA, PLD
16: Auxiliary storage device 17: Data memory for FPGA
BUS_1: CPU bus
BUS_2: PCI bus
BUS_3: FPGA bus
I_BUS: FPGA internal bus
RC_REG: Reconfiguration area
OC: FPGA operation circuit
PB: Partial reconfiguration block
UC_A, UC_B: User circuit 151: C_DATA write control circuit
C_RAM: Configuration data memory
P: degree of parallelism
ET_E: Predicted execution time
BD_E: Predicted bandwidth
ET_M: Measurement execution time
BD_M: Measurement band, used band
A1, A2: User circuit area
BD_L: Upper limit band
A_L: Total circuit area
T _{total_after} : 1st execution time, 1st total execution time, total execution time after adjusting the degree of parallelism
T _{total_before} : 2nd execution time, 2nd total execution time, total execution time before parallelism adjustment
CI: Computation Intensive Processing Circuit
DI: Data intensive processing circuit

Claims

The processor that executes the program and
A programmable logic circuit device (hereinafter referred to as PLD) having a reconfiguration area for configuring a logic circuit required by the configuration request in response to a configuration request from the processor.
The processor
Of the plurality of logic circuits configured and operating in the reconfiguration area, the parallel degree adjustment for lowering the parallelism of the first logic circuit and increasing the parallelism of the second logic circuit is performed. The first execution time of the plurality of logic circuits is compared with the second execution time of the plurality of logic circuits when the degree of parallelism adjustment is not performed.
An information processing device that requests the PLD to adjust the degree of parallelism when the first execution time is shorter than the second execution time, and does not request the PLD to adjust the degree of parallelism when the first execution time is not short.

The processor
Obtain the measured values of the data transfer amount of a plurality of logic circuits that are configured and operating in the reconfiguration area.
The information processing apparatus according to claim 1, wherein when the total of the data transfer amounts reaches the upper limit value of the data transfer amount of the PLD bus, the comparison is executed.

In the comparison, the total of the first execution times of the plurality of logic circuits and the total of the second execution times are compared.
If the total value of the first execution time is shorter than the total value of the second execution time, the PLD is requested to adjust the degree of parallelism, and if it is not short, the PLD is requested to adjust the degree of parallelism. The information processing device according to claim 1.

The processor further
The information processing apparatus according to claim 1, wherein the first execution time of the plurality of logic circuits and the second execution time are calculated.

The processor
As the first execution time of the plurality of logic circuits, after adjusting the degree of parallelism by lowering the degree of parallelism of the first logic circuit and increasing the degree of parallelism of the second logic circuit, the second logic circuit The information processing apparatus according to claim 4, wherein the execution time of the plurality of logic circuits predicted when the degree of parallelism of the first logic circuit is increased after the execution is completed is calculated.

The plurality of logic circuits include one or both of a data-intensive processing circuit in which memory access occurs during data processing and a computation-intensive processing circuit in which memory access occurs at the beginning and end of data processing.
The processor
When the degree of parallelism of the data-intensive processing circuit is multiplied by N, the first execution time is calculated so that the execution time is multiplied by 1 / N.
When the degree of parallelism of the Computation Intensive Processing Circuit is multiplied by N, the execution time is multiplied by 1 / N so that the initiation interval time in the pipeline processing of the Computation Intensive Processing Circuit is multiplied by N. The information processing apparatus according to claim 4, wherein the execution time is calculated.

The plurality of logic circuits include both a data-intensive processing circuit in which memory access occurs during data processing and a computation-intensive processing circuit in which memory access occurs at the beginning and end of data processing.
The processor
The information processing apparatus according to claim 1, wherein the first logic circuit is selected from the data intensive processing circuit, and the second logic circuit is selected from the computational intensive processing circuit.

The processor
The information processing apparatus according to claim 7, wherein when the first execution time is not shorter than the second execution time, the second logic circuit is changed and the comparison is performed again.

A PLD management program that causes a processor to execute a programmable logic circuit device (hereinafter referred to as PLD) management process having a reconfiguration area for configuring a logic circuit required by the configuration request in response to a configuration request from the processor. ,
The management process is
Of the plurality of logic circuits configured and operating in the reconfiguration area, the parallel degree adjustment for lowering the parallelism of the first logic circuit and increasing the parallelism of the second logic circuit is performed. The first execution time of the plurality of logic circuits is compared with the second execution time of the plurality of logic circuits when the degree of parallelism adjustment is not performed.
If the first execution time is shorter than the second execution time, the PLD is requested to adjust the degree of parallelism, and if it is not short, the PLD is not requested to adjust the degree of parallelism. Management program.

The processor that executes the program and
The PLD management method for an information processing device having a programmable logic circuit device (hereinafter referred to as PLD) having a reconfiguration area for configuring a logic circuit required by the configuration request in response to a configuration request from the processor. hand,
Of the plurality of logic circuits configured and operating in the reconfiguration area, the parallel degree adjustment for lowering the parallelism of the first logic circuit and increasing the parallelism of the second logic circuit is performed. The first execution time of the plurality of logic circuits is compared with the second execution time of the plurality of logic circuits when the degree of parallelism adjustment is not performed.
A PLD management method in which if the first execution time is shorter than the second execution time, the PLD is requested to adjust the degree of parallelism, and if it is not short, the PLD is not requested to adjust the degree of parallelism.