JP7305052B2

JP7305052B2 - Delay update device, processing system and program

Info

Publication number: JP7305052B2
Application number: JP2022539830A
Authority: JP
Inventors: 真吾追立; 堅太藤本
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2023-07-07
Anticipated expiration: 2040-07-28
Also published as: JPWO2022024214A1; WO2022024214A1

Description

本開示は、マルチコアプロセッサシステムに関する。 The present disclosure relates to multicore processor systems.

特許文献１に記載されているように、従来、処理の高速化を図るために、複数のコアを有するマルチコアプロセッサを使用して並列処理を実行するマルチコアプロセッサシステムに関する技術が提案されている。また、マルチコアプロセッサシステムの処理性能を高めるため、複数のコアでキャッシュを共有する技術が知られている。 As described in Patent Literature 1, conventionally, in order to speed up processing, there has been proposed a technology related to a multicore processor system that executes parallel processing using a multicore processor having a plurality of cores. Also, in order to improve the processing performance of a multi-core processor system, a technology is known in which a plurality of cores share a cache.

マルチコアプロセッサシステムにおいて、全プログラムがキャッシュに格納される場合には、マルチコアプロセッサは効率よくプログラムを実行することができる。 In a multi-core processor system, if all programs are cached, the multi-core processor can execute programs efficiently.

しかしながら、ソフトウェアの肥大化等により、全プログラムがキャッシュに格納されない場合、キャッシュに収まりきらないプログラムは、キャッシュミスの発生都度、キャッシュへロードされる。そのため、マルチコアプロセッサの実行効率が低下する可能性がある。 However, if the entire program cannot be stored in the cache due to software bloat or the like, the program that cannot be stored in the cache is loaded into the cache each time a cache miss occurs. Therefore, the execution efficiency of the multicore processor may decrease.

一方で、並列処理方法としては、複数のデータに対して複数コアが同じ処理を並列に実行するデータ並列処理が知られている。 On the other hand, as a parallel processing method, data parallel processing is known in which a plurality of cores execute the same processing on a plurality of data in parallel.

特開２００８－１９１９４９号公報JP 2008-191949 A

マルチコアプロセッサシステムについては、マルチコアプロセッサを有効利用できることは望ましい。 For multi-core processor systems, it is desirable to be able to take advantage of multi-core processors.

そこで、本開示は上述の点に鑑みて成されたものであり、マルチコアプロセッサの有効利用を図ることが可能な技術を提供することを目的とする。 Therefore, the present disclosure has been made in view of the above points, and an object thereof is to provide a technology that enables effective use of a multi-core processor.

プログラム作成装置の一態様は、複数のコアを含むマルチコアプロセッサと、前記複数のコアで共有される共有キャッシュとを備えるマルチコアプロセッサシステムの前記マルチコアプロセッサが実行するプログラムを作成するプログラム作成装置であって、処理対象プログラムにおいて、データ並列処理が記述されたデータ並列プログラムに変換される対象の部分プログラムである対象部分プログラムを特定する特定部と、前記複数のコアにおいて、一部のコアでの処理開始タイミングが、他のコアでの処理開始タイミングよりも遅延するような前記データ並列処理が記述された前記データ並列プログラムに前記対象部分プログラムを変換し、それによって得られた前記データ並列プログラムを含む実行対象プログラムを生成する生成部とを備える。 One aspect of a program creation device is a program creation device that creates a program to be executed by the multi-core processor of a multi-core processor system comprising a multi-core processor including a plurality of cores and a shared cache shared by the plurality of cores, a specifying unit for specifying a target partial program, which is a target partial program to be converted into a data parallel program in which data parallel processing is described, in the processing target program; Execution including the data parallel program obtained by converting the target partial program into the data parallel program describing the data parallel processing whose timing is delayed from the processing start timing of other cores. and a generator for generating the target program.

また、遅延量更新装置の一態様は、上記のプログラム作成装置が作成した前記データ並列プログラムに記述された前記データ並列処理において前記一部のコアでの処理開始タイミングの遅延量を更新する遅延量更新装置であって、前記マルチコアプロセッサでの前記データ並列処理の実行状態を示す実行状態情報を取得する取得部と、前記実行状態情報に基づいて前記遅延量を更新する更新部とを備える。 Further, one aspect of the delay amount updating device is a delay amount for updating the delay amount of the processing start timing in the part of the cores in the data parallel processing described in the data parallel program created by the program creating device. An update device, comprising: an acquisition unit that acquires execution state information indicating an execution state of the data parallel processing in the multi-core processor; and an update unit that updates the delay amount based on the execution state information.

また、処理システムの一態様は、上記のプログラム作成装置と、上記の遅延量更新装置とを備える。 Further, one aspect of the processing system includes the above program creation device and the above delay amount updating device.

また、プログラムの一態様は、複数のコアを含むマルチコアプロセッサと、前記複数のコアで共有される共有キャッシュとを備えるマルチコアプロセッサシステムの前記マルチコアプロセッサが実行するプログラムを作成するためのプログラムであって、コンピュータ装置に、処理対象プログラムにおいて、データ並列処理が記述されたデータ並列プログラムに変換される対象の部分プログラムである対象部分プログラムを特定する工程と、前記複数のコアにおいて、一部のコアでの処理開始タイミングが、他のコアでの処理開始タイミングよりも遅延するような前記データ並列処理が記述された前記データ並列プログラムに前記対象部分プログラムを変換し、それによって得られた前記データ並列プログラムを含む実行対象プログラムを生成する工程とを実行させるためのものである。 Further, one aspect of the program is a program for creating a program to be executed by the multi-core processor of a multi-core processor system comprising a multi-core processor including a plurality of cores and a shared cache shared by the plurality of cores, specifying, in a computer device, a target partial program that is a target partial program to be converted into a data parallel program in which data parallel processing is described in a processing target program; converting the target partial program into the data parallel program in which the data parallel processing is described such that the processing start timing of the other cores is delayed from the processing start timing of the other cores, and the data parallel program obtained thereby and a step of generating an execution target program including

また、プログラムの一態様は、上記のプログラム作成装置が作成した前記データ並列プログラムに記述された前記データ並列処理において前記一部のコアでの処理開始タイミングの遅延量を更新するためのプログラムであって、コンピュータ装置に、前記マルチコアプロセッサでの前記データ並列処理の実行状態を示す実行状態情報を取得する工程と、前記実行状態情報に基づいて前記遅延量を更新する工程とを実行させるためのものである。 Further, one aspect of the program is a program for updating the delay amount of the processing start timing in the part of the cores in the data parallel processing described in the data parallel program created by the program creation device. a step of obtaining execution state information indicating an execution state of the data parallel processing in the multi-core processor; and a step of updating the delay amount based on the execution state information. is.

マルチコアプロセッサの有効利用を図ることができる。 Effective use of a multi-core processor can be achieved.

本開示の目的、特徴、態様、および利点は、以下の詳細な説明と添付図面とによって、より明白となる。 Objects, features, aspects and advantages of the present disclosure will become more apparent with the following detailed description and accompanying drawings.

処理システムの構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of a processing system. プログラム作成装置の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of a program creation apparatus. プログラム書き込み装置の構成の一例を示すブロック図である。1 is a block diagram showing an example of the configuration of a program writing device; FIG. プログラム作成装置の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of a program creation apparatus. プログラム作成装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation|movement of a program creation apparatus. マルチコアプロセッサの動作の一例を示す図である。FIG. 4 is a diagram showing an example of the operation of a multicore processor; FIG. マルチコアプロセッサの動作の一例を示す図である。FIG. 4 is a diagram showing an example of the operation of a multicore processor; FIG. ｆｏｒ文の一例を示す図である。It is a figure which shows an example of a for sentence. 遅延データ並列プログラムの一例を示す図である。It is a figure which shows an example of a delayed data parallel program. 遅延量更新装置の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of a delay amount update apparatus. 遅延量更新装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation|movement of a delay amount update apparatus. マルチコアプロセッサの動作の一例を示す図である。FIG. 4 is a diagram showing an example of the operation of a multicore processor; FIG. マルチコアプロセッサの動作の一例を示す図である。FIG. 4 is a diagram showing an example of the operation of a multicore processor; FIG. 指定記述の一例を示す図である。It is a figure which shows an example of a designation|designated description. エンジニアリングツールの一部の構成の一例を示す図である。It is a figure which shows an example of a structure of a part of engineering tool. 更新部の一例を示すブロック図である。It is a block diagram which shows an example of an update part. 遅延量更新装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation|movement of a delay amount update apparatus.

実施の形態１．
図１は本実施の形態に係る処理システム１の構成の一例を示すブロック図である。処理システム１は、マルチコアプロセッサシステム２と、エンジニアリングツール３とを備える。エンジニアリングツール３は、マルチコアプロセッサシステム２が実行するプログラムを作成する。そして、エンジニアリングツール３は、作成したプログラムをマルチコアプロセッサシステム２に書き込む。Embodiment 1.
FIG. 1 is a block diagram showing an example of the configuration of a processing system 1 according to this embodiment. A processing system 1 comprises a multi-core processor system 2 and an engineering tool 3 . The engineering tool 3 creates programs to be executed by the multicore processor system 2 . The engineering tool 3 then writes the created program to the multicore processor system 2 .

＜マルチコアプロセッサシステムの構成例＞
マルチコアプロセッサシステム２（単にプロセッサシステム２ともいう）は、例えば、家電機器及びＩｏＴ（Internet of Things）機器等の組み込み機器に搭載される。プロセッサシステム２は組み込み機器以外の装置に搭載されてもよい。<Configuration example of multi-core processor system>
A multi-core processor system 2 (simply referred to as a processor system 2) is installed in built-in devices such as home appliances and IoT (Internet of Things) devices. The processor system 2 may be installed in a device other than an embedded device.

図１に示されるように、プロセッサシステム２は、例えば、複数のコア２１を備えるマルチコアプロセッサ２０と、複数のコア２１で共有される共有キャッシュ２２と、複数のコア２１で共有される共有メモリ２３とを備える。各コア２１は一種の演算回路であって、プログラムを実行することが可能である。複数のコア２１は、互いに独立して共通のプログラムを実行することが可能である。複数のコア２１は例えば１つのパッケージ内に収容されている。マルチコアプロセッサ２０は、例えば、ＣＰＵ（Central Processing Unit）の一種であると言える。 As shown in FIG. 1, the processor system 2 includes, for example, a multi-core processor 20 having a plurality of cores 21, a shared cache 22 shared by the plurality of cores 21, and a shared memory 23 shared by the plurality of cores 21. and Each core 21 is a kind of arithmetic circuit and can execute a program. A plurality of cores 21 can execute a common program independently of each other. A plurality of cores 21 are accommodated, for example, in one package. The multi-core processor 20 can be said to be a kind of CPU (Central Processing Unit), for example.

共有メモリ２３（単にメモリ２３ともいう）は、プログラム等の各種データを記憶する記憶回路である。メモリ２３はメインメモリとも呼ばれる。メモリ２３内のプログラムはマルチコアプロセッサ２０によって実行される。メモリ２３は、例えば、メインプログラム２３０、更新プログラム２３１及び遅延量２３２を記憶する。メインプログラム２３０は、例えば、プロセッサシステム２が搭載される組み込み機器が実行する各種処理が記述されたプログラムである。遅延量２３２は、後述するように、メインプログラム２３０の実行時に参照されるパラメータである。更新プログラム２３１は、遅延量２３２を更新する処理が記述されたプログラムである。メモリ２３は、例えばＲＯＭ（Read Only Memory）で構成されている。メモリ２３は、フラッシュＲＯＭで構成されてもよいし、フラッシュＲＯＭ以外のメモリで構成されてもよい。 The shared memory 23 (simply called memory 23) is a storage circuit that stores various data such as programs. The memory 23 is also called main memory. Programs in the memory 23 are executed by the multicore processor 20 . The memory 23 stores a main program 230, an update program 231 and a delay amount 232, for example. The main program 230 is, for example, a program in which various types of processing to be executed by an embedded device in which the processor system 2 is installed are described. The delay amount 232 is a parameter that is referenced when the main program 230 is executed, as will be described later. The update program 231 is a program in which processing for updating the delay amount 232 is described. The memory 23 is composed of, for example, a ROM (Read Only Memory). The memory 23 may be composed of a flash ROM, or may be composed of a memory other than the flash ROM.

共有キャッシュ２２（単にキャッシュ２２ともいう）は、プログラム等の各種データを一時的に記憶する記憶回路である。キャッシュ２２の記憶容量は、メモリ２３の記憶容量よりも小さくなっている。また、キャッシュ２２の速度はメモリ２３の速度よりも速くなっている。キャッシュ２２は、例えばＲＡＭ（Random Access Memory）で構成されている。キャッシュ２２は、ＳＲＡＭ（Static Random Access Memory）で構成されてもよいし、ＳＲＡＭ以外のメモリで構成されてもよい。 The shared cache 22 (simply referred to as cache 22) is a storage circuit that temporarily stores various data such as programs. The storage capacity of cache 22 is smaller than the storage capacity of memory 23 . Also, the speed of the cache 22 is faster than the speed of the memory 23 . The cache 22 is composed of, for example, a RAM (Random Access Memory). The cache 22 may be composed of SRAM (Static Random Access Memory), or may be composed of memory other than SRAM.

各コア２１は、プログラムを実行する場合、キャッシュ２２内に処理対象のデータ（プログラムも含む）が存在するときにはキャッシュ２２から処理対象のデータを読み出す。キャッシュ２２内に処理対象のデータ（言い換えれば必要なデータ）が存在することは、キャッシュヒットと呼ばれる。一方で、各コア２１は、キャッシュ２２内に処理対象のデータが存在しないときには、共有メモリ２３から処理対象のデータを読み出してキャッシュ２２内に一旦記憶する。そして、各コア２１は、キャッシュ２２から処理対象のデータを読み出す。キャッシュ２２内に処理対象のデータが存在しないことは、キャッシュミスと呼ばれる。 When executing a program, each core 21 reads the data to be processed from the cache 22 if the data to be processed (including the program) exists in the cache 22 . The presence of data to be processed (in other words, necessary data) in the cache 22 is called a cache hit. On the other hand, when the data to be processed does not exist in the cache 22 , each core 21 reads the data to be processed from the shared memory 23 and temporarily stores it in the cache 22 . Then, each core 21 reads data to be processed from the cache 22 . The absence of data to be processed in the cache 22 is called a cache miss.

＜エンジニアリングツールの構成例＞
図１に示されるように、エンジニアリングツール３は、プログラム作成装置３０（単に作成装置３０ともいう）及びプログラム書き込み装置３５（単に書き込み装置３５ともいう）を備える。作成装置３０は、マルチコアプロセッサ２０が実行するメインプログラム２３０を作成することが可能である。書き込み装置３５は、作成装置３０が作成したメインプログラム２３０をメモリ２３に書き込むことが可能である。なお、作成装置３０は、更新プログラム２３１を作成してもよい。この場合、書き込み装置３５は、作成装置３０が作成した更新プログラム２３１をメモリ２３に書き込んでもよい。<Engineering tool configuration example>
As shown in FIG. 1, the engineering tool 3 includes a program creation device 30 (simply referred to as creation device 30) and a program writing device 35 (simply referred to as writing device 35). The creating device 30 can create the main program 230 that the multicore processor 20 executes. The writing device 35 can write the main program 230 created by the creating device 30 into the memory 23 . Note that the creation device 30 may create the update program 231 . In this case, the writing device 35 may write the update program 231 created by the creating device 30 into the memory 23 .

図２は作成装置３０のハードウェア構成の一例を示すブロック図である。作成装置３０は一種のコンピュータ装置である。作成装置３０は、パーソナルコンピュータであってもよいし、他のコンピュータ装置であってもよい。図２に示されるように、作成装置３０は、ハードウェアとして、例えば、処理回路３００と、記憶装置３０１と、通信装置３０２と、入力装置３０３と、表示装置３０４とを備える。 FIG. 2 is a block diagram showing an example of the hardware configuration of the creation device 30. As shown in FIG. The creation device 30 is a kind of computer device. The creation device 30 may be a personal computer or another computer device. As shown in FIG. 2, the creation device 30 includes, for example, a processing circuit 300, a storage device 301, a communication device 302, an input device 303, and a display device 304 as hardware.

処理回路３００は、例えば、少なくとも一つのプロセッサを備える。処理回路３００は、例えば、プロセッサの一種であるＣＰＵを備える。処理回路３００は、複数のＣＰＵを備えてもよいし、少なくとも一つのＤＳＰ（Digital Signal Processor）を備えてもよい。処理回路３００は処理部３００とも言える。 Processing circuitry 300 comprises, for example, at least one processor. The processing circuit 300 includes, for example, a CPU, which is a type of processor. The processing circuit 300 may include multiple CPUs or at least one DSP (Digital Signal Processor). The processing circuit 300 can also be called a processing unit 300 .

なお、処理回路３００の少なくとも一部は、その機能の実現にソフトウェアが不要なハードウェア回路によって実現されてもよい。この場合には、処理回路３００の少なくとも一部は、例えば、単一回路、複合回路、プログラム化されたプロセッサ、並列プログラム化されたプロセッサ、ＡＳＩＣ（Application Specific Integrated Circuit）及びＦＰＧＡ（field-programmable gate array）の少なくとも一つで構成されてもよい。 Note that at least part of the processing circuit 300 may be implemented by a hardware circuit that does not require software to implement its functions. In this case, at least part of the processing circuit 300 may be, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an ASIC (Application Specific Integrated Circuit) and an FPGA (field-programmable gate). array).

記憶装置３０１は、例えば、ＲＯＭ及びＲＡＭなどの、ＣＰＵ等のプロセッサが読み取り可能な非一時的な記録媒体を含む。記憶装置３０１は、ＲＯＭ及びＲＡＭ以外の、コンピュータが読み取り可能な非一時的な記録媒体を備えてもよい。記憶装置３０１は、例えば、ＨＤＤ（Hard Disk Drive）またはＳＳＤ（Solid State Drive）などを備えてもよい。 The storage device 301 includes non-temporary recording media such as ROM and RAM, which are readable by a processor such as a CPU. The storage device 301 may comprise a non-temporary computer-readable recording medium other than ROM and RAM. The storage device 301 may include, for example, an HDD (Hard Disk Drive) or an SSD (Solid State Drive).

記憶装置３０１は、プログラム作成装置３０を制御するためのプログラム３０１ａを記憶する。プログラム３０１ａには、例えばＯＳ（Operating System）が含まれる。処理回路３００のプロセッサがプログラム３０１ａを実行することによって、処理回路３００の各種機能が実現される。 The storage device 301 stores a program 301 a for controlling the program creating device 30 . The program 301a includes, for example, an OS (Operating System). Various functions of the processing circuit 300 are realized by the processor of the processing circuit 300 executing the program 301a.

表示装置３０４は、文字、記号及び図形等の各種情報を表示することが可能である。表示装置３０４は、液晶表示装置であってもよいし、他の種類の表示装置であってもよい。入力装置３０３は、ユーザからの入力を受け付けることが可能である。入力装置３０３は、例えば、キーボード及びマウスを備える。入力装置３０３は、タッチセンサを備えてもよい。この場合、表示装置３０４は、入力装置３０３が備えるタッチセンサを有するタッチパネルディスプレイであってもい。入力装置３０３はマイクを備えてもよい。 The display device 304 can display various types of information such as characters, symbols, and graphics. The display device 304 may be a liquid crystal display device or another type of display device. The input device 303 can accept input from the user. Input device 303 includes, for example, a keyboard and a mouse. Input device 303 may include a touch sensor. In this case, the display device 304 may be a touch panel display having a touch sensor included in the input device 303 . Input device 303 may comprise a microphone.

通信装置３０２は、作成装置３０の外部の装置と通信することが可能である。通信装置３０２は通信回路とも言える。通信装置３０２は、例えば、書き込み装置３５と通信することが可能である。通信装置３０２は、例えば、書き込み装置３５と直接的に有線通信を行う。通信装置３０２は、書き込み装置３５と無線通信を行ってもよい。また、通信装置３０２は、インターネット等を含むネットワークを通じて書き込み装置３５と通信してもよい。 The communication device 302 can communicate with devices external to the creation device 30 . The communication device 302 can also be said to be a communication circuit. The communication device 302 can communicate with the writing device 35, for example. The communication device 302 performs wired communication directly with the writing device 35, for example. The communication device 302 may perform wireless communication with the writing device 35 . Also, the communication device 302 may communicate with the writing device 35 through a network including the Internet.

なお、作成装置３０のハードウェア構成は上記の例に限られない。例えば、作成装置３０は、処理回路３００、記憶装置３０１、通信装置３０２、入力装置３０３及び表示装置３０４以外の装置を備えてもよい。 Note that the hardware configuration of the creation device 30 is not limited to the above example. For example, the creation device 30 may include devices other than the processing circuitry 300 , the storage device 301 , the communication device 302 , the input device 303 and the display device 304 .

図３は書き込み装置３５のハードウェア構成の一例を示すブロック図である。書き込み装置３５は一種のコンピュータ装置である。図３に示されるように、書き込み装置３５は、ハードウェアとして、例えば、処理回路３５０と、記憶装置３５１と、通信装置３５２と、通信装置３５３とを備える。 FIG. 3 is a block diagram showing an example of the hardware configuration of the writing device 35. As shown in FIG. The writing device 35 is a kind of computer device. As shown in FIG. 3, the writing device 35 includes, for example, a processing circuit 350, a storage device 351, a communication device 352, and a communication device 353 as hardware.

処理回路３５０は、例えば、少なくとも一つのプロセッサを備える。処理回路３５０は、例えば、プロセッサの一種であるＣＰＵを備える。処理回路３５０は、複数のＣＰＵを備えてもよいし、少なくとも一つのＤＳＰを備えてもよい。 Processing circuitry 350 includes, for example, at least one processor. The processing circuitry 350 includes, for example, a CPU, which is a type of processor. The processing circuitry 350 may comprise multiple CPUs and may comprise at least one DSP.

なお、処理回路３５０の少なくとも一部は、その機能の実現にソフトウェアが不要なハードウェア回路によって実現されてもよい。この場合には、処理回路３５０の少なくとも一部は、例えば、単一回路、複合回路、プログラム化されたプロセッサ、並列プログラム化されたプロセッサ、ＡＳＩＣ及びＦＰＧＡの少なくとも一つで構成されてもよい。 Note that at least part of the processing circuit 350 may be implemented by a hardware circuit that does not require software to implement its functions. In this case, at least a portion of processing circuitry 350 may comprise, for example, single circuits, multiple circuits, programmed processors, parallel programmed processors, ASICs, and/or FPGAs.

記憶装置３５１は、例えば、ＲＯＭ及びＲＡＭなどの、ＣＰＵ等のプロセッサが読み取り可能な非一時的な記録媒体を含む。記憶装置３５１は、記憶装置３０１と同様に、ＲＯＭ及びＲＡＭ以外の、コンピュータが読み取り可能な非一時的な記録媒体を備えてもよい。 The storage device 351 includes, for example, non-temporary recording media such as ROM and RAM that are readable by a processor such as a CPU. As with the storage device 301, the storage device 351 may include a non-temporary computer-readable recording medium other than ROM and RAM.

記憶装置３５１は、書き込み装置３５を制御するためのプログラム３５１ａを記憶する。処理回路３５０のプロセッサがプログラム３５１ａを実行することによって、処理回路３５０の各種機能が実現される。 The storage device 351 stores a program 351a for controlling the writing device 35. FIG. Various functions of the processing circuit 350 are realized by the processor of the processing circuit 350 executing the program 351a.

通信装置３５２は、作成装置３０の通信装置３０２と通信することが可能である。通信装置３５３は、プロセッサシステム２と通信することが可能である。プロセッサシステム２は通信装置３５３を通信する通信装置を備える。通信装置３５３は、プロセッサシステム２と直接通信してもよいし、インターネット等を含むネットワークを通じてプロセッサシステム２と通信してもよい。通信装置３５３は有線通信を行ってもよいし、無線通信を行ってもよい。 The communication device 352 can communicate with the communication device 302 of the creation device 30 . Communication device 353 is capable of communicating with processor system 2 . Processor system 2 includes a communication device that communicates with communication device 353 . The communication device 353 may communicate directly with the processor system 2 or may communicate with the processor system 2 through a network including the Internet. The communication device 353 may perform wired communication or wireless communication.

なお、書き込み装置３５のハードウェア構成は上記の例に限られない。例えば、書き込み装置３５は、処理回路３５０、記憶装置３５１、通信装置３５２及び通信装置３５３以外の装置を備えてもよい。例えば、書き込み装置３５は、ユーザからの入力を受け付けることが可能な入力装置を備えてもよい。 Note that the hardware configuration of the writing device 35 is not limited to the above example. For example, writing device 35 may comprise devices other than processing circuitry 350 , storage device 351 , communication device 352 and communication device 353 . For example, writing device 35 may comprise an input device capable of accepting input from a user.

以上のようなハードウェア構成例を有するエンジニアリングツール３では、作成装置３０の処理回路３００がメインプログラム２３０を作成する。そして、作成装置３０の通信装置３０２は、処理回路３００で作成されたメインプログラム２３０を書き込み装置３５の通信装置３５２に送信する。書き込み装置３５では、処理回路３５０が、通信装置３５２で受信されたメインプログラム２３０を通信装置３５３に入力する。通信装置３５３は、入力されたメインプログラム２３０を、プロセッサシステム２の通信装置を通じてメモリ２３内に書き込む。 In the engineering tool 3 having the hardware configuration example as described above, the processing circuit 300 of the creation device 30 creates the main program 230 . The communication device 302 of the creating device 30 then transmits the main program 230 created by the processing circuit 300 to the communication device 352 of the writing device 35 . In the writing device 35 , the processing circuit 350 inputs the main program 230 received by the communication device 352 to the communication device 353 . The communication device 353 writes the input main program 230 into the memory 23 through the communication device of the processor system 2 .

なお、書き込み装置３５は、プロセッサシステム２と常に接続されている必要はない。例えば、書き込み装置３５がネットワークを通じてプロセッサシステム２と接続されている場合、メインプログラム２３０がメモリ２３内に書き込まれた後、書き込み装置３５は当該ネットワークから切り離されてもよい。 Note that the writing device 35 does not have to be always connected to the processor system 2 . For example, if the writing device 35 is connected to the processor system 2 through a network, the writing device 35 may be disconnected from the network after the main program 230 is written into the memory 23 .

また、上記の例では、書き込み装置３５は、プロセッサシステム２に搭載されたメモリ２３に対してプログラムを書き込んでいるが、プロセッサシステム２に搭載されていない状態のメモリ２３に対してプログラムを書き込んでもよい。この場合、書き込み装置３５は、例えば、ソケット等のメモリ２３を保持する保持部品を備える。そして、書き込み装置３５の処理回路３５０は、保持部品に保持されたメモリ２３に対してプログラムを書き込む。プログラムが書き込まれたメモリ２３は、プロセッサシステム２に搭載される。 In the above example, the writing device 35 writes the program to the memory 23 mounted on the processor system 2, but writing the program to the memory 23 not mounted on the processor system 2 good. In this case, the writing device 35 comprises a holding part, for example a socket, holding the memory 23 . Then, the processing circuit 350 of the writing device 35 writes the program to the memory 23 held in the holding component. A memory 23 in which a program is written is installed in the processor system 2 .

＜プログラム作成装置の機能ブロック構成例＞
図４は作成装置３０の機能ブロック構成の一例を示す図である。図４に示されるように、作成装置３０は、機能ブロックとして、プログラム記述部３１０と、特定部３２０と、生成部３３０と、コンパイル部３４０とを備える。プログラム記述部３１０、特定部３２０、生成部３３０及びコンパイル部３４０のそれぞれは、例えば、処理回路３００がプログラム３０１ａを実行することによって処理回路３００に実現される機能ブロックで構成されている。なお、処理回路３００の少なくとも一部が、その機能の実現にソフトウェアが不要なハードウェア回路によって実現される場合には、プログラム記述部３１０、特定部３２０、生成部３３０及びコンパイル部３４０の少なくとも一つが、その機能の実現にソフトウェアが不要なハードウェア回路によって実現されてもよい。<Example of functional block configuration of programming device>
FIG. 4 is a diagram showing an example of the functional block configuration of the creation device 30. As shown in FIG. As shown in FIG. 4, the creation device 30 includes a program description unit 310, a specification unit 320, a generation unit 330, and a compilation unit 340 as functional blocks. Each of the program description unit 310, the specifying unit 320, the generating unit 330, and the compiling unit 340 is composed of, for example, functional blocks realized by the processing circuit 300 when the processing circuit 300 executes the program 301a. Note that when at least part of the processing circuit 300 is realized by a hardware circuit that does not require software to realize its functions, at least one of the program description unit 310, the identification unit 320, the generation unit 330, and the compilation unit 340 One may be implemented by a hardware circuit that does not require software to implement its functions.

プログラム記述部３１０は、入力装置３０３が受け付けるユーザからの入力に基づいて第１テキスト形式プログラム４０１を記述することが可能である。ユーザは、プログラム記述部３１０の機能を利用して、プログラム作成装置３０上で、第１テキスト形式プログラム４０１を記述することが可能である。プログラム記述部３１０は、入力装置３０３が受け付けるユーザの入力に基づいて第１テキスト形式プログラム４０１を作成する。 The program description unit 310 can describe the first text format program 401 based on the input from the user received by the input device 303 . The user can use the function of the program writing unit 310 to write the first text format program 401 on the program creation device 30 . The program description unit 310 creates the first text format program 401 based on user input received by the input device 303 .

プログラム記述部３１０は、例えば、スレッド単位で処理を記述することができる。プログラム記述部３１０は、各スレッドについて、そのスレッドをどのコア２１が実行するかを、ユーザの指示に応じて第１テキスト形式プログラム４０１に記述することができる。 The program description unit 310 can describe processing in units of threads, for example. The program description unit 310 can describe, for each thread, which core 21 executes the thread in the first text format program 401 according to the user's instruction.

プログラム記述部３１０は、例えば、プログラム３０１ａに含まれるＯＳが処理回路３００で実行されることによって実現される。ＯＳには、例えば、プログラムの一種であるテキストエディタが含まれており、当該テキストエディタが実行されることによってプログラム記述部３１０が実現される。プログラム記述部３１０が作成する第１テキスト形式プログラム４０１は、例えば表示装置３０４に表示される。これにより、ユーザは、例えば、表示装置３０４に表示される作成中の第１テキスト形式プログラム４０１を見ながら、第１テキスト形式プログラム４０１の記述を進めることができる。 The program description unit 310 is realized by executing an OS included in the program 301a in the processing circuit 300, for example. The OS includes, for example, a text editor, which is a type of program, and the program description unit 310 is realized by executing the text editor. The first text format program 401 created by the program description unit 310 is displayed on the display device 304, for example. As a result, the user can advance the description of the first text format program 401 while watching the first text format program 401 being created displayed on the display device 304, for example.

特定部３２０は、第１テキスト形式プログラム４０１において、データ並列処理が記述されたデータ並列プログラムに変換される対象の部分プログラムを特定する。以後、当該部分プログラムを変換対象部分プログラムと呼ぶことがある。 The identification unit 320 identifies a partial program to be converted into a data parallel program describing data parallel processing in the first text format program 401 . Hereinafter, the partial program may be called a conversion target partial program.

ここで、データ並列処理とは、複数のデータ（あるいは複数のデータセット）に対して複数のコアが同じ処理を並列実行する処理である。言い換えれば、データ並列処理とは、複数のデータ（あるいは複数のデータセット）に対して複数のコアが共通のプログラム（言い換えれば同一のプログラム）を用いて並列処理を行うことである。プロセッサシステム２の複数のコア２１は、互いに依存関係の無い独立した複数のデータ（あるいは複数のデータセット）に対して同じ処理を並列実行することが可能である。データ並列処理は、データ並列とも呼ばれることがある。以後、データ並列処理において個々のコアが実行する処理を並列個別処理と呼ぶことがある。特定部３２０の動作については後で詳細に説明する。 Here, data parallel processing is processing in which a plurality of cores execute the same processing in parallel on a plurality of data (or a plurality of data sets). In other words, data parallel processing means that multiple cores perform parallel processing on multiple data (or multiple data sets) using a common program (in other words, the same program). The plurality of cores 21 of the processor system 2 can parallelly execute the same processing on a plurality of independent data (or a plurality of data sets) that are independent of each other. Data parallelism is sometimes called data parallelism. Hereinafter, processing executed by individual cores in data parallel processing may be referred to as parallel individual processing. The operation of the identification unit 320 will be described later in detail.

生成部３３０は、特定部３２０での特定結果に基づいて、第１テキスト形式プログラム４０１から第２テキスト形式プログラム４０２を生成する。具体的には、生成部３３０は、複数のコア２１において、一部のコア２１での処理開始タイミングが、他のコア２１での処理開始タイミングよりも遅延するようなデータ並列処理（遅延データ並列処理ともいう）が記述されたデータ並列プログラム（遅延データ並列プログラムともいう）に、変換対象部分プログラムを変換する。処理開始タイミングは、処理開始時刻であるとも言える。そして、生成部３３０は、遅延データ並列プログラムを含む第２テキスト形式プログラム４０２を生成する。第２テキスト形式プログラム４０２は、第１テキスト形式プログラム４０１において、変換対象部分プログラムが遅延データ並列プログラムに置き換えられたものである。後述するように、第２テキスト形式プログラム４０２は、コンパイルされた後にマルチコアプロセッサ２０で実行されることから、実行対象プログラムであると言える。生成部３３０の動作については後で詳細に説明する。 The generation unit 330 generates the second text format program 402 from the first text format program 401 based on the identification result of the identification unit 320 . Specifically, the generation unit 330 performs data parallel processing (delayed data parallel processing) such that the processing start timing of some cores 21 is delayed from the processing start timing of other cores 21 among the plurality of cores 21 . process) is described (also called delayed data parallel program). The processing start timing can also be said to be the processing start time. The generation unit 330 then generates the second text format program 402 including the delayed data parallel program. The second text format program 402 is obtained by replacing the conversion target partial program in the first text format program 401 with a delayed data parallel program. As will be described later, the second text format program 402 can be said to be a program to be executed because it is executed by the multi-core processor 20 after being compiled. The operation of the generator 330 will be described later in detail.

コンパイル部３４０は、第２テキスト形式プログラム４０２をコンパイルして、第２テキスト形式プログラム４０２を実行形式プログラム４０３に変換する。実行形式プログラム４０３は、メインプログラム２３０として、書き込み装置３５によってメモリ２３に書き込まれる。 The compiling unit 340 compiles the second text format program 402 and converts the second text format program 402 into the executable format program 403 . The executable program 403 is written into the memory 23 by the writing device 35 as the main program 230 .

上記の説明から理解できるように、メインプログラム２３０には、遅延データ並列プログラム(詳細には実行形式に変換された遅延データ並列プログラム）が含まれる。この遅延データ並列プログラムに記述された遅延データ並列処理では、一部のコア２１が実行する並列個別処理の開始タイミングが、他のコア２１が実行する並列個別処理の開始タイミングよりも遅延している。 As can be understood from the above description, the main program 230 includes a delayed data parallel program (more specifically, a delayed data parallel program converted into executable form). In the delayed data parallel processing described in this delayed data parallel program, the start timing of the parallel individual processing executed by some cores 21 is delayed from the start timing of the parallel individual processing executed by the other cores 21. .

以後、並列個別処理をジョブと呼ぶことがある。また、遅延データ並列処理において、他の並列個別処理と比較して、開始タイミングが遅延している並列個別処理を、遅延並列個別処理あるいは遅延ジョブと呼ぶことがある。また、遅延データ並列処理において、他の並列個別処理と比較して、開始タイミングが先行している並列個別処理を、先行並列個別処理あるいは先行ジョブと呼ぶことがある。 Hereafter, parallel individual processing may be called a job. In the delayed data parallel processing, a parallel individual process whose start timing is later than other parallel individual processes may be called a delayed parallel individual process or a delayed job. Also, in the delayed data parallel processing, a parallel individual process whose start timing is earlier than other parallel individual processes may be called a preceding parallel individual process or a preceding job.

メモリ２３内の遅延量２３２は、先行並列個別処理（先行ジョブ）の開始タイミングに対する遅延並列個別処理（遅延ジョブ）の開始タイミングの遅延量を示している。遅延量２３２は、メインプログラム２３０に含まれる遅延データ並列プログラムの実行中に、遅延ジョブを実行するコア２１によって参照される。遅延データ並列プログラムを実行中のコア２１は、遅延量２３２を参照する場合、キャッシュ２２内に遅延量２３２が記憶されていれば、キャッシュ２２から遅延量２３２を読み出して参照する。一方で、コア２１は、キャッシュ２２内に遅延量２３２が記憶されていなければ、メモリ２３から遅延量２３２を読み出して一旦キャッシュ２２に記憶する。そして、コア２１は、キャッシュ２２から遅延量２３２を読み出して参照する。 The delay amount 232 in the memory 23 indicates the delay amount of the start timing of the delayed parallel individual processing (delayed job) with respect to the start timing of the preceding parallel individual processing (preceding job). The delay amount 232 is referenced by the core 21 executing the delayed job during execution of the delayed data parallel program included in the main program 230 . When the core 21 executing the delayed data parallel program refers to the delay amount 232, if the delay amount 232 is stored in the cache 22, the delay amount 232 is read from the cache 22 and referred to. On the other hand, if the delay amount 232 is not stored in the cache 22 , the core 21 reads the delay amount 232 from the memory 23 and temporarily stores it in the cache 22 . The core 21 then reads the delay amount 232 from the cache 22 and refers to it.

図５はエンジニアリングツール３の動作の一例を示すフローチャートである。図５に示されるように、ステップｓ１において、プログラム記述部３１０が、ユーザの入力に基づいて第１テキスト形式プログラム４０１を記述して作成する。次にステップｓ２において、特定部３２０は、第１テキスト形式プログラム４０１において、データ並列プログラムに変換される対象の部分プログラム、つまり変換対象部分プログラムを特定する。次にステップｓ３において、生成部３３０は、変換対象部分プログラムを遅延データ並列プログラムに変換し、それによって得られた遅延データ並列プログラムを含む第２テキスト形式プログラム４０２を生成する。次にステップｓ４において、コンパイル部３４０は、第２テキスト形式プログラム４０２をコンパイルして実行形式プログラム４０３を生成する。そしてステップｓ５において、書き込み装置３５が、実行形式プログラム４０３をメインプログラム２３０としてメモリ２３に書き込む。プロセッサシステム２が搭載された組み込み機器の実稼働中に、メモリ２３内のメインプログラム２３０が実行されることによって、当該組み込み機器の各種機能が実現される。メインプログラム２３０の実行中においては、例えば、遅延データ並列処理が繰り返し実行される。なお、第１テキスト形式プログラム４０１は、プログラム作成装置３０以外の装置で作成されてもよい。 FIG. 5 is a flow chart showing an example of the operation of the engineering tool 3. As shown in FIG. As shown in FIG. 5, in step s1, the program description unit 310 describes and creates a first text format program 401 based on user input. Next, at step s2, the identification unit 320 identifies a partial program to be converted into a data parallel program, that is, a partial program to be converted, in the first text format program 401. FIG. Next, in step s3, the generation unit 330 converts the conversion target partial program into a delayed data parallel program, and generates the second text format program 402 including the delayed data parallel program thus obtained. Next, in step s4, the compiling section 340 compiles the second text format program 402 to generate the executable program 403. FIG. Then, in step s5, the writing device 35 writes the executable program 403 to the memory 23 as the main program 230. FIG. Various functions of the embedded device are implemented by executing the main program 230 in the memory 23 during actual operation of the embedded device on which the processor system 2 is mounted. During execution of the main program 230, for example, delayed data parallel processing is repeatedly executed. Note that the first text format program 401 may be created by a device other than the program creating device 30 .

以上のように、データ並列処理では、複数のデータに対して複数のコア２１が共通のプログラムを用いて並列処理を行うことから、データ並列処理の実行によって、マルチコアプロセッサ２０での処理時間を短縮することができる。 As described above, in data parallel processing, a plurality of cores 21 perform parallel processing on a plurality of data using a common program. can do.

図６はマルチコアプロセッサ２０がデータ並列処理５０を実行する場合の当該マルチコアプロセッサ２０の動作の一例を示す図である。図６の例では、複数のコア２１は、４個のコア２１ａ～２１ｄで構成されている。コア２１ａ～２１ｄは、データ並列処理５０において、並列個別処理５１～５４をそれぞれ実行する。また、データ並列処理５０は遅延データ並列処理ではなく、データ並列処理５０では、複数の並列個別処理５１～５４の開始タイミングは同じに設定されている。複数の並列個別処理５１～５４の間では、処理対象のデータは異なるものの、処理内容は同じとなっている。 FIG. 6 is a diagram showing an example of the operation of the multicore processor 20 when the multicore processor 20 executes the data parallel processing 50. As shown in FIG. In the example of FIG. 6, the multiple cores 21 are composed of four cores 21a to 21d. In the data parallel processing 50, the cores 21a-21d execute parallel individual processing 51-54, respectively. Also, the data parallel processing 50 is not delayed data parallel processing, and the start timings of the plurality of parallel individual processings 51 to 54 are set to be the same. The data to be processed are different among the plurality of parallel individual processes 51 to 54, but the processing contents are the same.

図６の例では、コア２１ａは、並列個別処理５１の前に逐次処理５５を行い、並列個別処理５１の後に逐次処理５６を行う。データ並列処理５０では、逐次処理５５で得られた複数のデータセットに対して同じ処理が並列実行される。コア２１ａは、データ並列処理５０で得られた複数のデータセットに対して逐次処理５６を行う。 In the example of FIG. 6 , the core 21 a performs serial processing 55 before parallel individual processing 51 and serial processing 56 after parallel individual processing 51 . In data parallel processing 50 , the same processing is executed in parallel on multiple data sets obtained in serial processing 55 . The core 21 a performs serial processing 56 on the multiple data sets obtained by the data parallel processing 50 .

また、コア２１ｂ～２１ｄは、低優先度処理５７～５９をそれぞれ実行する。低優先度処理５７～５９は、データ並列処理５０、逐次処理５５及び逐次処理５６よりも処理優先度が低い処理である。低優先度処理５７～５９は、例えば、データ並列処理５０、逐次処理５５及び逐次処理５６とは無関係の独立した処理である。 The cores 21b-21d also execute low-priority processes 57-59, respectively. The low priority processes 57 to 59 are processes with processing priorities lower than those of the data parallel processing 50, serial processing 55 and serial processing 56. FIG. The low-priority processes 57-59 are independent processes unrelated to the data parallel process 50, the serial process 55 and the serial process 56, for example.

図６の例では、コア２１ａが逐次処理５５を開始し、コア２１ｂ～２１ｄが低優先度処理５７～５９をそれぞれ開始する。逐次処理５５が完了すると、データ並列処理５０の実行が可能となる。コア２１ａは、逐次処理５５の実行後に並列個別処理５１を実行する。コア２１ｂは、優先度の低い低優先度処理５７の実行を中断して、優先度の高い並列個別処理５２を実行する。同様に、コア２１ｃは低優先度処理５８の実行を中断して並列個別処理５３を実行し、コア２１ｄは低優先度処理５９の実行を中断して並列個別処理５４を実行する。優先度の高いデータ並列処理５０が完了すると、コア２１ｂ～２１ｄは、中断していた低優先度処理５７～５９の実行をそれぞれ再開し、コア２１ａは逐次処理５６を実行する。 In the example of FIG. 6, core 21a initiates sequential processing 55, and cores 21b-21d initiate low-priority processing 57-59, respectively. Once the serial processing 55 is completed, the data parallel processing 50 can be performed. The core 21 a executes the parallel individual processing 51 after executing the serial processing 55 . The core 21b suspends execution of the low-priority process 57 and executes the parallel individual process 52 of high priority. Similarly, the core 21c suspends execution of the low-priority process 58 and executes the parallel individual process 53, and the core 21d suspends execution of the low-priority process 59 and executes the parallel individual process . When the high-priority data parallel processing 50 is completed, the cores 21b-21d resume the suspended low-priority processing 57-59, respectively, and the core 21a executes the serial processing 56. FIG.

このように、データ並列処理５０が実行されることで、優先度の高い処理全体の完了時間を短縮することができる。 By executing the data parallel processing 50 in this way, it is possible to shorten the completion time of the entire high-priority processing.

図７は、マルチコアプロセッサ２０が遅延データ並列処理１５０を実行する場合の当該マルチコアプロセッサ２０の動作の一例を示す図である。図７の例では、複数のコア２１は、４個のコア２１ａ～２１ｄで構成されている。コア２１ａ～２１ｄは、遅延データ並列処理１５０において、並列個別処理５１～５４をそれぞれ実行する。図７に示される遅延データ並列処理１５０では、例えば、コア２１ｂ～２１ｄが実行する並列個別処理５２～５４の開始タイミングが、コア２１ａが実行する並列個別処理５１の開始タイミングよりも遅延している。遅延データ並列処理１５０では、並列個別処理５１が先行ジョブとなり、並列個別処理５２～５４が遅延ジョブとなる。並列個別処理５２～５４の開始タイミングは、並列個別処理５１の開始タイミングよりも遅延量２３２だけ遅れている。 FIG. 7 is a diagram showing an example of the operation of the multicore processor 20 when the multicore processor 20 executes the delayed data parallel processing 150. As shown in FIG. In the example of FIG. 7, the multiple cores 21 are composed of four cores 21a to 21d. In the delayed data parallel processing 150, the cores 21a-21d execute parallel individual processing 51-54, respectively. In the delayed data parallel processing 150 shown in FIG. 7, for example, the start timing of the parallel individual processes 52 to 54 executed by the cores 21b to 21d is delayed from the start timing of the parallel individual process 51 executed by the core 21a. . In the delayed data parallel processing 150, the parallel individual processing 51 is the preceding job, and the parallel individual processings 52-54 are the delayed jobs. The start timings of the parallel individual processes 52 to 54 are delayed from the start timing of the parallel individual process 51 by a delay amount 232 .

図７の例では、図６の例と同様に、コア２１ａが逐次処理５５を開始し、コア２１ｂ～２１ｄが低優先度処理５７～５９をそれぞれ開始する。逐次処理５５が完了すると、遅延データ並列処理１５０の実行が可能となる。コア２１ａは、逐次処理５５の実行後に並列個別処理５１を開始する。コア２１ｂは、並列個別処理５１の開始後に並列個別処理５２の開始タイミングになると、優先度の低い低優先度処理５７を中断して、優先度の高い並列個別処理５２を開始する。同様に、コア２１ｃは、並列個別処理５３の開始タイミングになると、低優先度処理５８を中断して並列個別処理５３を開始する。同様に、コア２１ｄは、並列個別処理５４の開始タイミングになると、低優先度処理５９を中断して並列個別処理５４を開始する。優先度の高い遅延データ並列処理１５０が完了すると、コア２１ｂ～２１ｄは、中断していた低優先度処理５７～５９の実行をそれぞれ再開し、コア２１ａは逐次処理５６を実行する。 In the example of FIG. 7, as in the example of FIG. 6, the core 21a starts the sequential process 55 and the cores 21b-21d start the low priority processes 57-59, respectively. Once the serial processing 55 is completed, the delayed data parallel processing 150 can be executed. The core 21 a starts the parallel individual processing 51 after executing the serial processing 55 . When the parallel individual processing 52 starts after the parallel individual processing 51 has started, the core 21 b interrupts the low priority low priority processing 57 and starts the high priority parallel individual processing 52 . Similarly, the core 21 c interrupts the low-priority processing 58 and starts the parallel individual processing 53 at the start timing of the parallel individual processing 53 . Similarly, the core 21 d interrupts the low-priority processing 59 and starts the parallel individual processing 54 at the start timing of the parallel individual processing 54 . When the high-priority delayed data parallel processing 150 is completed, the cores 21b-21d resume the suspended low-priority processing 57-59, respectively, and the core 21a executes the serial processing 56. FIG.

図６の例では、マルチコアプロセッサ２０がデータ並列処理５０を実行する場合には、コア２１ａ～２１ｄは、データ並列処理５０が記述された共通のデータ並列プログラムを実行することになる。データ並列プログラムがキャッシュ２２内に存在しない場合には、メモリ２３からデータ並列プログラムがキャッシュ２２にロードされる必要がある。データ並列プログラムのキャッシュ２２へのロード中においては、マルチコアプロセッサ２０の待ち時間が発生する。データ並列処理５０では、複数の並列個別処理５１～５４の開始タイミングが同じであることから、データ並列プログラムのキャッシュ２２へのロードの待ち時間がコア２１ａ～２１ｄのすべてにおいて同じように発生する。 In the example of FIG. 6, when the multi-core processor 20 executes the data parallel processing 50, the cores 21a to 21d execute a common data parallel program in which the data parallel processing 50 is written. If the data parallel program is not present in cache 22, the data parallel program must be loaded into cache 22 from memory 23. FIG. A latency of the multi-core processor 20 occurs while the data parallel program is being loaded into the cache 22 . In the data parallel processing 50, since the start timings of the plurality of parallel individual processes 51-54 are the same, the same waiting time for loading the data parallel program to the cache 22 occurs in all of the cores 21a-21d.

これに対して、図７に示される遅延データ並列処理１５０では、並列個別処理５２～５４の開始タイミングが並列個別処理５１の開始タイミングよりも遅延している。これにより、遅延データ並列処理１５０の実行中に、コア２１ｂ～２１ｃは、並列個別処理５２～５４以外の処理を実行することができる。図７の例では、コア２１ｂ～２１ｃは、遅延データ並列処理１５０の実行中に低優先度処理５７～５９を実行しており、並列個別処理５２～５４の開始タイミングの遅延分だけ、低優先度処理５７～５９を長く実行している。これにより、マルチコアプロセッサ２０の有効利用を図ることができる。 On the other hand, in the delayed data parallel processing 150 shown in FIG. 7, the start timings of the parallel individual processings 52 to 54 are delayed from the start timing of the parallel individual processing 51 . As a result, the cores 21b-21c can execute processes other than the parallel individual processes 52-54 while the delayed data parallel processing 150 is being executed. In the example of FIG. 7, the cores 21b-21c are executing the low-priority processes 57-59 during the execution of the delayed data parallel processing 150, and the low-priority processes 52-54 are delayed by the start timings of the parallel individual processes 52-54. The processing 57 to 59 is executed for a long time. As a result, the multi-core processor 20 can be used effectively.

さらに、遅延データ並列プログラムのキャッシュ２２へのロードの待ち時間がコア２１ａ～２１ｄのすべてにおいて同じようには発生しない。つまり、遅延データ並列プログラムにおいて、先行して処理を開始するコア２１ａの並列個別処理５１の実行中にキャッシュ２２へロードされた部分を、コア２１ｂ～２１ｄは実行することができることから、コア２１ｂ～２１ｄの待ち時間を、コア２１ａの待ち時間よりも小さくすることができる。よって、コア２１ｂ～２１ｄでの処理開始タイミングを遅延させたとしても遅延データ並列処理１５０全体の処理時間を短くすることができる。図６及び７の例では、遅延データ並列処理１５０全体の処理時間(言い換えれば、遅延データ並列処理１５０の完了時間）と、データ並列処理５０全体の処理時間(言い換えれば、データ並列処理５０の完了時間）とがほぼ同じとなっている。 Furthermore, the latency of loading delayed data parallel programs into cache 22 does not occur equally in all of cores 21a-21d. That is, in the delayed data parallel program, the cores 21b to 21d can execute the portion loaded into the cache 22 during the execution of the parallel individual processing 51 of the core 21a that starts processing earlier. 21d can be made smaller than the latency of core 21a. Therefore, even if the processing start timings of the cores 21b to 21d are delayed, the processing time of the entire delayed data parallel processing 150 can be shortened. In the examples of FIGS. 6 and 7, the processing time of the entire delay data parallel processing 150 (in other words, the completion time of the delay data parallel processing 150) and the processing time of the entire data parallel processing 50 (in other words, the completion of the data parallel processing 50) time) are almost the same.

＜特定部及び生成部の動作例の詳細説明＞
特定部３２０は、第１テキスト形式プログラム４０１において変換対象部分プログラムを特定する場合には、例えば、ループ処理に着目する。例えば、第１テキスト形式プログラム４０１がＣ言語で記述されている場合を考える。この場合、特定部３２０は、第１テキスト形式プログラム４０１においてｆｏｒ文を特定する。そして、特定部３２０は、特定したｆｏｒ文で記述されたループ処理がループ間で独立したデータを扱っている場合、特定したｆｏｒ文を変換対象部分プログラムとする。<Detailed description of operation examples of the specifying unit and the generating unit>
When identifying the conversion target partial program in the first text format program 401, the identifying unit 320 focuses on loop processing, for example. For example, consider the case where the first text format program 401 is written in C language. In this case, the identifying unit 320 identifies the for statement in the first text format program 401 . Then, when the loop processing described by the specified for statement handles independent data between loops, the specifying unit 320 sets the specified for statement as the conversion target partial program.

図８は、第１テキスト形式プログラム４０１に含まれるｆｏｒ文の一例を示す図である。図８に示されるｆｏｒ文で記述されたループ処理では、ループ間で、Ｖ〔ｉ］＝Ｘ［ｉ］＋Ｙ［ｉ］という同じ処理が行われる。図８のループ処理は、ループ回数ｉ（０≦ｉ≦３９９）に応じたデータセット（Ｖ〔ｉ］，Ｘ［ｉ］，Ｙ［ｉ］）を扱っており、ループ間で独立したデータを扱っている。このため、図８に示されるｆｏｒ文で記述されたループ処理は、データ並列処理に変換されることが可能である。特定部３２０は、第１テキスト形式プログラム４０１において図８に示されるｆｏｒ文を特定した場合、特定したｆｏｒ文を、変換対象部分プログラム、つまり、データ並列処理が記述されたデータ並列プログラムに変換する対象の部分プログラムとする。 FIG. 8 is a diagram showing an example of a for statement included in the first text format program 401. As shown in FIG. In the loop processing described by the for statement shown in FIG. 8, the same processing of V[i]=X[i]+Y[i] is performed between loops. The loop processing in FIG. 8 handles data sets (V[i], X[i], Y[i]) corresponding to the number of loops i (0≤i≤399), and independent data is processed between loops. are dealing with Therefore, the loop processing described by the for statement shown in FIG. 8 can be converted into data parallel processing. When identifying the for statement shown in FIG. 8 in the first text format program 401, the identification unit 320 converts the identified for statement into a partial program to be converted, that is, a data parallel program in which data parallel processing is described. Let it be the target partial program.

生成部３３０は、特定部３２０で特定された変換対象部分プログラムを遅延データ並列プログラムに変換する。図９は、遅延データ並列プログラム５５０の一例を示す図である。図９の遅延データ並列プログラム５５０は、図８に示されるｆｏｒ文が変換されたものである。図９の遅延データ並列プログラム５５０は、例えば４つのコア２１ａ，２１ｂ，２１ｃ，２１ｄで実行される。 The generation unit 330 converts the conversion target partial program identified by the identification unit 320 into a delayed data parallel program. FIG. 9 is a diagram showing an example of the delayed data parallel program 550. As shown in FIG. A delayed data parallel program 550 in FIG. 9 is obtained by converting the for statement shown in FIG. The delayed data parallel program 550 of FIG. 9 is executed, for example, by four cores 21a, 21b, 21c, and 21d.

図９の例では、例えばコア２１ａが、データセット（Ｘ［ｉ］，Ｙ［ｉ］）（０≦ｉ≦９９）に対して、Ｖ〔ｉ］＝Ｘ［ｉ］＋Ｙ［ｉ］を実行する。コア２１ａが、０≦ｉ≦９９の範囲においてＶ〔ｉ］＝Ｘ［ｉ］＋Ｙ［ｉ］を行う処理が、遅延データ並列処理においてコア２１ａが行う並列個別処理（言い換えればジョブ）となる。また、例えばコア２１ｂが、データセット（Ｘ［ｉ］，Ｙ［ｉ］）（１００≦ｉ≦１９９）に対して、Ｖ〔ｉ］＝Ｘ［ｉ］＋Ｙ［ｉ］を実行する。コア２１ｂが、１００≦ｉ≦１９９の範囲においてＶ〔ｉ］＝Ｘ［ｉ］＋Ｙ［ｉ］を行う処理が、遅延データ並列処理においてコア２１ｂが行う並列個別処理となる。また、例えばコア２１ｃが、データセット（Ｘ［ｉ］，Ｙ［ｉ］）（２００≦ｉ≦２９９）に対して、Ｖ〔ｉ］＝Ｘ［ｉ］＋Ｙ［ｉ］を実行する。コア２１ｃが、２００≦ｉ≦２９９の範囲においてＶ〔ｉ］＝Ｘ［ｉ］＋Ｙ［ｉ］を行う処理が、遅延データ並列処理においてコア２１ｃが行う並列個別処理となる。そして、例えばコア２１ｄが、データセット（Ｘ［ｉ］，Ｙ［ｉ］）（３００≦ｉ≦３９９）に対して、Ｖ〔ｉ］＝Ｘ［ｉ］＋Ｙ［ｉ］を実行する。コア２１ｄが、３００≦ｉ≦３９９の範囲においてＶ〔ｉ］＝Ｘ［ｉ］＋Ｙ［ｉ］を行う処理が、遅延データ並列処理においてコア２１ｄが行う並列個別処理となる。 In the example of FIG. 9, for example, the core 21a executes V[i]=X[i]+Y[i] for the data set (X[i], Y[i]) (0≤i≤99). do. A process performed by the core 21a to perform V[i]=X[i]+Y[i] in the range of 0≤i≤99 is parallel individual processing (in other words, job) performed by the core 21a in the delayed data parallel processing. Also, for example, the core 21b executes V[i]=X[i]+Y[i] for the data set (X[i], Y[i]) (100≤i≤199). The process of performing V[i]=X[i]+Y[i] in the range of 100≤i≤199 by the core 21b is parallel individual processing performed by the core 21b in the delayed data parallel processing. Also, for example, the core 21c executes V[i]=X[i]+Y[i] for the data set (X[i], Y[i]) (200≦i≦299). The process of performing V[i]=X[i]+Y[i] in the range of 200≤i≤299 by the core 21c is parallel individual processing performed by the core 21c in the delayed data parallel processing. Then, for example, the core 21d executes V[i]=X[i]+Y[i] for the data set (X[i], Y[i]) (300≤i≤399). The process of performing V[i]=X[i]+Y[i] in the range of 300≤i≤399 by the core 21d is parallel individual processing performed by the core 21d in the delayed data parallel processing.

また、図９の遅延データ並列プログラム５５０には、３個のコア２１ｂ，２１ｃ，２１ｄのそれぞれについて、コア２１が行う並列個別処理の開始タイミングを遅延させる遅延処理が記述されたプログラム５５０ａが含まれる。コア２１ｂ，２１ｃ，２１ｄのそれぞれが遅延処理を実行することによって、コア２１ｂ，２１ｃ，２１ｄが行う並列個別処理の開始タイミングが、コア２１ａが行う並列個別処理の開始タイミングよりも遅延する。図９の例では、コア２１ａが行う並列個別処理が先行ジョブとなり、コア２１ｂ，２１ｃ，２１ｄが行う並列個別処理が遅延ジョブとなる。図９の例では、ＯｐｅｎＭＰが用いられてプログラム５５０ａが記述されている。 The delayed data parallel program 550 of FIG. 9 includes a program 550a describing delay processing for delaying the start timing of parallel individual processing performed by the core 21 for each of the three cores 21b, 21c, and 21d. . As each of the cores 21b, 21c, and 21d executes the delay processing, the start timing of the parallel individual processing performed by the cores 21b, 21c, and 21d is delayed from the start timing of the parallel individual processing performed by the core 21a. In the example of FIG. 9, the parallel individual processing performed by the core 21a is the preceding job, and the parallel individual processing performed by the cores 21b, 21c, and 21d is the delayed job. In the example of FIG. 9, OpenMP is used to describe the program 550a.

また、図９の遅延データ並列プログラム５５０には、４個のコア２１ａ～２１ｄのそれぞれについて、コア２１での並列個別処理の実行状態を監視する監視処理が記述されたプログラム５５０ｂが含まれている。コア２１は、監視処理の実行中に、当該コア２１での並列個別処理の実行状態を示す実行状態情報（コア実行状態情報ともいう）を出力する。コア実行状態情報には、例えば、並列個別処理の実行時間、並列個別処理での実行命令数、並列個別処理の実行中のキャッシュミス回数及び並列個別処理の実行中のストール回数が含まれる。あるコア２１での並列個別処理の実行時間は、当該あるコア２１での遅延データ並列処理の実行時間であるとも言える。コア実行状態情報は、後述するように、遅延量２３２の更新で使用される。コア実行状態情報に含まれる情報はこの限りではない。 The delay data parallel program 550 of FIG. 9 also includes a program 550b describing monitoring processing for monitoring the execution state of parallel individual processing in the core 21 for each of the four cores 21a to 21d. . The core 21 outputs execution state information (also referred to as core execution state information) indicating the execution state of parallel individual processing in the core 21 during execution of the monitoring process. The core execution state information includes, for example, the execution time of parallel individual processing, the number of executed instructions in parallel individual processing, the number of cache misses during execution of parallel individual processing, and the number of stalls during execution of parallel individual processing. It can also be said that the execution time of parallel individual processing in a certain core 21 is the execution time of delayed data parallel processing in the certain core 21 . The core execution state information is used in updating the delay amount 232 as described later. Information included in the core execution state information is not limited to this.

生成部３３０は、遅延データ並列プログラムを生成すると、当該遅延データ並列プログラムを含む第２テキスト形式プログラム４０２を生成する。この第２テキスト形式プログラム４０２は、第１テキスト形式プログラム４０１において、変換対象部分プログラムが遅延データ並列プログラムに置き換えられたものである。生成部３３０で生成された第２テキスト形式プログラム４０２はコンパイルされた後にメモリ２３に書き込まれる。 After generating the delayed data parallel program, the generator 330 generates the second text format program 402 including the delayed data parallel program. This second text format program 402 is obtained by replacing the conversion target partial program in the first text format program 401 with a delayed data parallel program. The second text format program 402 generated by the generation unit 330 is written to the memory 23 after being compiled.

＜遅延量更新装置の一例＞
本例の処理システム１は、遅延量２３２を更新する遅延量更新装置２５０を備える。本例では、例えば、マルチコアプロセッサ２０が遅延量更新装置２５０として機能する。図１０は遅延量更新装置２５０の一例を示すブロック図である。<Example of delay update device>
The processing system 1 of this example includes a delay amount updating device 250 that updates the delay amount 232 . In this example, for example, the multicore processor 20 functions as the delay update device 250 . FIG. 10 is a block diagram showing an example of the delay update device 250. As shown in FIG.

図１０に示されるように、遅延量更新装置２５０は、例えば、マルチコアプロセッサ２０での遅延データ並列処理の実行状態を示す実行状態情報６００（プロセッサ実行状態情報６００ともいう）を取得する取得部２５１と、取得部２５１で取得された実行状態情報６００に基づいて遅延量２３２を更新する更新部２５２とを備える。プロセッサ実行状態情報６００には、各コア２１についてのコア実行状態情報が含まれる。また、プロセッサ実行状態情報６００には、遅延データ並列処理全体の実行時間が含まれる。 As shown in FIG. 10, the delay update device 250 includes, for example, an acquisition unit 251 that acquires execution state information 600 (also referred to as processor execution state information 600) indicating the execution state of parallel processing of delayed data in the multi-core processor 20. and an update unit 252 that updates the delay amount 232 based on the execution state information 600 acquired by the acquisition unit 251 . The processor execution state information 600 includes core execution state information for each core 21 . The processor execution state information 600 also includes the execution time of the entire delay data parallel processing.

取得部２５１及び更新部２５２のそれぞれは、例えば、マルチコアプロセッサ２０のあるコア２１が更新プログラム２３１を実行することによって実現される機能ブロックで構成されている。取得部２５１は、遅延データ並列プログラムの実行中のマルチコアプロセッサ２０の各コア２１からコア実行状態情報を取得する。また、取得部２５１は、遅延データ並列プログラムの実行中に、遅延データ並列処理全体の実行時間を取得する。 Each of the acquisition unit 251 and the update unit 252 is configured by a functional block realized by executing the update program 231 by a certain core 21 of the multi-core processor 20, for example. The acquisition unit 251 acquires core execution state information from each core 21 of the multi-core processor 20 that is executing the delayed data parallel program. Also, the acquisition unit 251 acquires the execution time of the entire delay data parallel processing during execution of the delay data parallel program.

図１１は、遅延量更新装置２５０が行う更新処理の一例を示すフローチャートである。図１１に示される更新処理は、プロセッサシステム２の実稼働中（言い換えれば、プロセッサシステム２が搭載された組み込み機器の実稼働中）に実行されてもよいし、プロセッサシステム２の実稼働前（例えば、プロセッサシステム２が搭載された組み込み機器の出荷前）に実行されてもよい。 FIG. 11 is a flowchart showing an example of update processing performed by the delay update device 250. As shown in FIG. The update process shown in FIG. 11 may be executed during the actual operation of the processor system 2 (in other words, during the actual operation of the embedded device in which the processor system 2 is mounted), or before the actual operation of the processor system 2 ( For example, it may be executed before shipment of an embedded device in which the processor system 2 is mounted.

図１１に示されるように、ステップｓ１１において、取得部２５１が、マルチコアプロセッサ２０にメインプログラム２３０を実行させる。次に、ステップｓ１２において、取得部２５１は、メインプログラム２３０の実行中に、プロセッサ実行状態情報６００を取得する。 As shown in FIG. 11, in step s11, the acquisition unit 251 causes the multi-core processor 20 to execute the main program 230. As shown in FIG. Next, in step s12, the acquisition unit 251 acquires the processor execution state information 600 while the main program 230 is being executed.

メインプログラム２３０の実行中には、遅延データ並列処理が繰り返し実行される。遅延データ並列処理で使用される遅延量２３２の初期値は、例えば、先行並列個別処理（先行ジョブ）の完了時刻が、遅延並列個別処理（遅延ジョブ）の開始時刻となるような値に設定される。遅延量２３２の初期値はこれに限られない。 During execution of the main program 230, delayed data parallel processing is repeatedly executed. The initial value of the delay amount 232 used in the delayed data parallel processing is set to a value such that, for example, the completion time of the preceding parallel individual processing (preceding job) becomes the start time of the delayed parallel individual processing (delayed job). be. The initial value of the delay amount 232 is not limited to this.

図１２は、遅延量２３２が初期値に設定されている場合の遅延データ並列処理１５０とその前後の処理の一例を示す図である。図１２では、上述の低優先度処理５７～５９が示されていないが、実際には、図７に示されるように、コア２１ｂ～２１ｄは、並列個別処理を実行していないときに、低優先度処理５７～５９をそれぞれ実行する。後述の図１３についても同様である。 FIG. 12 is a diagram showing an example of the delay data parallel processing 150 and the processing before and after the delay data parallel processing 150 when the delay amount 232 is set to the initial value. Although FIG. 12 does not show the low priority processes 57 to 59 described above, actually, as shown in FIG. Priority processing 57 to 59 are executed respectively. The same applies to FIG. 13 described later.

先行並列個別処理５１は、その開始タイミングが遅延せずに、逐次処理５５の直後に開始する。本例では、先行並列個別処理（先行ジョブ）５１の実行時間はｔ＿ｎｏｒｍａｌで表されている。遅延並列個別処理（遅延ジョブ）５２～５４は、先行並列個別処理５１の終了タイミングから開始し、遅延量２３２はｔ＿ｄｅｌａｙで表されている。また、遅延並列個別処理５２～５４の実行時間はｔ＿ｆａｓｔで表されている。そして、遅延データ並列処理１５０全体の実行時間、つまり、先行ジョブの開始から全ジョブの終了までの時間が、ｔ＿ｔｏｔａｌで表されている。 The preceding parallel individual processing 51 is started immediately after the sequential processing 55 without any delay in its start timing. In this example, the execution time of the preceding parallel individual processing (preceding job) 51 is represented by t_normal. The delayed parallel individual processes (delayed jobs) 52 to 54 start at the end timing of the preceding parallel individual process 51, and the delay amount 232 is represented by t_delay. Also, the execution time of the delayed parallel individual processes 52 to 54 is represented by t_fast. The overall execution time of the delayed data parallel processing 150, that is, the time from the start of the preceding job to the end of all jobs is represented by t_total.

上述のように、コア２１ｂ～２１ｄは、遅延データ並列プログラムにおいて、先行並列個別処理５１の実行中にキャッシュ２２へロードされた部分を実行することができる。そのため、コア２１ｂ～２１ｄの待ち時間を、コア２１ａの待ち時間よりも小さくすることができる。そのため、ｔ＿ｎｏｒｍａｌ≧ｔ＿ｆａｓｔとなる。 As described above, the cores 21b-21d can execute portions of the delayed data parallel program that were loaded into the cache 22 during execution of the preceding parallel individual processes 51. FIG. Therefore, the waiting time of the cores 21b to 21d can be made smaller than the waiting time of the core 21a. Therefore, t_normal≧t_fast.

ステップｓ１２の後、ステップｓ１３において、更新部２５２は、遅延量２３２の更新の終了条件である更新終了条件が成立するか否かを判定する。本例では、例えば、以下の条件式（１）が更新終了条件として採用される。 After step s12, in step s13, the update unit 252 determines whether or not an update termination condition, which is a termination condition for updating the delay amount 232, is satisfied. In this example, for example, the following conditional expression (1) is adopted as the update end condition.

ステップｓ１３において、更新部２５２は、ステップｓ１２で取得されたプロセッサ実行状態情報に含まれる先行ジョブの実行時間と、当該プロセッサ実行状態情報に含まれる遅延データ並列処理１５０全体の実行時間とを用いて、条件式（１）が成立するか否かを判定する。 In step s13, the updating unit 252 uses the execution time of the preceding job included in the processor execution state information acquired in step s12 and the overall execution time of the delayed data parallel processing 150 included in the processor execution state information to , determines whether or not the conditional expression (1) is satisfied.

ステップｓ１２において、更新終了条件（条件式（１））が成立したと判定されると、更新処理が終了して、メインプログラム２３０の実行が終了する。これにより、遅延量２３２の更新が完了する。一方で、ステップｓ１３において、更新終了条件（条件式（１））が成立しないと判定されると、ステップｓ１４が実行される。 When it is determined in step s12 that the update end condition (conditional expression (1)) is satisfied, the update process ends and the execution of the main program 230 ends. This completes the updating of the delay amount 232 . On the other hand, if it is determined in step s13 that the update end condition (conditional expression (1)) is not satisfied, step s14 is executed.

ステップｓ１４では、更新部２５２が遅延量２３２を更新する。ステップｓ１４において、まず、更新部２５２は、例えば以下の式（２）を用いて、新たな遅延量２３２（ｔ＿ｄｅｌａｙ）を求める。そして、更新部２５２は、メモリ２３内の遅延量２３２を新たな遅延量２３２で置き換える。これにより、メモリ２３内の遅延量２３２が更新される。 In step s14, the update unit 252 updates the delay amount 232. FIG. In step s14, first, the updating unit 252 obtains a new delay amount 232 (t_delay) using, for example, the following equation (2). Then, the updating unit 252 replaces the delay amount 232 in the memory 23 with the new delay amount 232 . Thereby, the delay amount 232 in the memory 23 is updated.

ステップｓ１４において、更新部２５２は、ステップｓ１２で取得されたプロセッサ実行状態情報に含まれる先行ジョブの実行時間と、当該プロセッサ実行状態情報に含まれる遅延ジョブの実行時間とを用いて、新たな遅延量を求める。 In step s14, the update unit 252 uses the execution time of the preceding job included in the processor execution state information acquired in step s12 and the execution time of the delayed job included in the processor execution state information to create a new delay. ask for quantity.

ステップｓ１４の後、ステップｓ１２が再度実行されて、新しいプロセッサ実行状態情報が取得される。その後、同様にしてステップｓ１３が実行され、以後、遅延量更新装置２５０は同様に動作する。 After step s14, step s12 is executed again to obtain new processor execution state information. After that, step s13 is executed in a similar manner, and the delay update device 250 operates in a similar manner thereafter.

ここで、キャッシュ２２の使用状態は、遅延ジョブの遅延量２３２に応じて変化することから、遅延量２３２に応じて、先行ジョブの実行時間（ｔ＿ｎｏｒｍａｌ）と遅延ジョブの実行時間（ｔ＿ｆａｓｔ）も変化する。したがって、式（２）を用いて遅延量２３２を更新したとしても、１度の更新で、条件式（１）が成立するとは限らない。そのため、本例では、遅延量更新装置２５０は、条件式（１）が成立するまで繰り返し遅延量２３２を更新している。 Here, since the usage state of the cache 22 changes according to the delay amount 232 of the delayed job, the execution time (t_normal) of the preceding job and the execution time (t_fast) of the delayed job also change according to the delay amount 232. do. Therefore, even if the delay amount 232 is updated using the formula (2), the conditional formula (1) does not necessarily hold with one update. Therefore, in this example, the delay amount updating device 250 repeatedly updates the delay amount 232 until the conditional expression (1) is satisfied.

図１３は、更新完了後の遅延量２３２が用いられて実行される遅延データ並列処理１５０とその前後の処理の一例を示す図である。図１３にも示されるように、本例では、遅延データ並列処理１５０の実行時間（ｔ＿ｔｏｔａｌ）が、先行並列個別処理５１の実行時間（ｔ＿ｎｏｒｍａｌ）と一致するように遅延量２３２（ｔ＿ｄｅｌａｙ）が更新される。複数の並列個別処理５１～５２の演算量は同等であることから、本例のように遅延量２３２が更新されることによって、遅延並列個別処理５２～５２は、先行並列個別処理５１と同じタイミングで終了する。 FIG. 13 is a diagram showing an example of the delayed data parallel processing 150 executed using the post-update delay amount 232 and the processing before and after that. As also shown in FIG. 13, in this example, the delay amount 232 (t_delay) is updated so that the execution time (t_total) of the delayed data parallel processing 150 matches the execution time (t_normal) of the preceding parallel individual processing 51. be done. Since the amount of calculation of the plurality of parallel individual processes 51 to 52 is the same, by updating the delay amount 232 as in this example, the delayed parallel individual processes 52 to 52 can be executed at the same timing as the preceding parallel individual process 51. end with .

このように、遅延並列個別処理５２～５２が先行並列個別処理５１と同じタイミングで終了するように遅延量２３２が更新されることによって、並列個別処理５２～５２の開始タイミングを遅延させたとしても、遅延データ並列処理１５０の実行時間（ｔ＿ｔｏｔａｌ）を小さくすることができる。例えば、上述の図６及び７に示されるように、遅延データ並列処理１５０全体の実行時間を、遅延並列個別処理を含まないデータ並列処理５０全体の実行時間と同等にすることができる。 In this way, by updating the delay amount 232 so that the delayed parallel individual processes 52 to 52 end at the same timing as the preceding parallel individual process 51, even if the start timings of the parallel individual processes 52 to 52 are delayed, , the execution time (t_total) of the delayed data parallel processing 150 can be reduced. For example, as shown in FIGS. 6 and 7 above, the overall execution time of the delayed data parallel processing 150 can be made equivalent to the overall execution time of the data parallel processing 50 that does not include delayed parallel individual processing.

なお、更新終了条件としては、上記の条件式（１）以外の条件が採用されてもよい。遅延量２３２が変化しない場合であっても並列個別処理の実行時間はばらつくことから、更新終了条件としては、例えば以下の条件式（３）が採用されてもよい。 Note that conditions other than conditional expression (1) above may be adopted as the update end condition. Even if the delay amount 232 does not change, the execution time of the parallel individual processing varies, so the following conditional expression (3), for example, may be adopted as the update end condition.

条件式（３）のパラメータｅの値は、並列個別処理の実行時間のばらつきに基づく値である。パラメータｅの値が決定される場合には、例えば、同じ遅延量２３２が用いられて複数回遅延データ並列処理が実行される。そして、その結果から得られる並列個別処理の実行時間のばらつきに基づいてパラメータｅの値が事前に決定される。 The value of parameter e in conditional expression (3) is a value based on variations in execution time of parallel individual processing. When the value of parameter e is determined, for example, the same delay amount 232 is used to execute delayed data parallel processing multiple times. Then, the value of the parameter e is determined in advance based on the variation in the execution time of the parallel individual processing obtained from the result.

また、実稼働中のプロセッサシステム２では、遅延データ並列処理中のコア２１での割り込み処理等により、並列個別処理の実行時間等が変動することがある。そこで、プロセッサシステム２の実稼働中では、図１１に示される更新処理が繰り返し実行されてもよい。これにより、並列個別処理の実行時間等が変動する場合であっても、遅延量２３２を適切な値に設定することができる。更新処理は、定期的に繰り返し実行されてもよいし、不定期的に繰り返し実行されてもよい。また、更新処理は、ユーザの指示に応じて実行されてもよい。 In addition, in the processor system 2 in actual operation, the execution time of parallel individual processing may fluctuate due to interrupt processing or the like in the core 21 during parallel processing of delayed data. Therefore, during the actual operation of the processor system 2, the update process shown in FIG. 11 may be repeatedly executed. As a result, the delay amount 232 can be set to an appropriate value even when the execution time of parallel individual processing fluctuates. The update process may be repeatedly executed periodically or may be repeatedly executed irregularly. Also, the update process may be executed according to a user's instruction.

上記の例では、特定部３２０は、第１テキスト形式プログラム４０１に含まれる特定のｆｏｒ文を変換対象部分プログラムとしていたが、第１テキスト形式プログラム４０１に含まれる他の部分プログラムを変換対象部分プログラムとしてもよい。例えば、特定部３２０は、第１テキスト形式プログラム４０１に含まれる特定のｗｈｉｌｅ文を変換対象部分プログラムとしてもよい。この場合、特定部３２０は、ｗｈｉｌｅ文で記述されたループ処理がループ間で独立したデータを扱っている場合、当該ｗｈｉｌｅ文を変換対象部分プログラムとしてもよい。また、特定部３２０は、第１テキスト形式プログラム４０１に含まれる特定のｆｏｒ文及び特定のｗｈｉｌｅ文のそれぞれを変換対象部分プログラムとしてもよい。 In the above example, the identification unit 320 selects the specific for statement included in the first text format program 401 as the conversion target partial program, but the other partial program included in the first text format program 401 is the conversion target partial program. may be For example, the identification unit 320 may set a specific while statement included in the first text format program 401 as a partial program to be converted. In this case, if the loop processing described by the while statement handles independent data between loops, the specifying unit 320 may set the while statement as the partial program to be converted. Further, the identifying unit 320 may set each of the specific for statement and the specific while statement included in the first text format program 401 as the partial program to be converted.

特定部３２０が、複数の変換対象部分プログラムを特定する場合には、生成部３３０は、当該複数の変換対象部分プログラムのそれぞれについて、当該変換対象部分プログラムを遅延データ並列プログラムに変換する。そして、生成部３３０は、得られた複数の遅延データ並列プログラムを含む第２テキスト形式プログラム４０２を生成する。複数の遅延データ並列プログラムを含む第２テキスト形式プログラム４０２は、コンパイル後に、メインプログラム２３０としてメモリ２３に書き込まれる。この場合、メモリ２３には、複数の遅延データ並列プログラムでそれぞれ使用される複数の遅延量２３２が記憶される。そして、遅延量更新装置２５０は、複数の遅延量２３２のそれぞれを上述のようにして更新する。 When the identifying unit 320 identifies a plurality of conversion target partial programs, the generating unit 330 converts each of the plurality of conversion target partial programs into a delayed data parallel program. The generation unit 330 then generates the second text format program 402 including the obtained multiple delayed data parallel programs. The second text format program 402 including multiple delayed data parallel programs is written to memory 23 as main program 230 after compilation. In this case, the memory 23 stores a plurality of delay amounts 232 respectively used in a plurality of delay data parallel programs. Then, the delay amount update device 250 updates each of the plurality of delay amounts 232 as described above.

上記の例では、遅延量２３２は更新されているが、予め決定された遅延量２３２が更新されずに常に使用されてもよい。また、上記の図７，１２，１３等の例では、遅延データ並列処理を構成する複数の並列個別処理のうち、１つの並列個別処理だけが先行ジョブとなっているが、複数の並列個別処理が先行ジョブとなってもよい。例えば、コア２１ａが実行する並列個別処理だけではなく、コア２１ｂが実行する並列個別処理も先行ジョブとなってもよい。また、コア２１ａ，２１ｂ，２１ｃがそれぞれ実行する３つの並列個別処理が先行ジョブとなり、コア２１ｄが実行する並列個別処理が遅延ジョブとなってもよい。 Although the delay amount 232 is updated in the above example, the predetermined delay amount 232 may always be used without being updated. In the examples of FIGS. 7, 12, 13, etc., only one parallel individual processing is the preceding job among the plurality of parallel individual processings that constitute the delayed data parallel processing. may be the predecessor job. For example, not only the parallel individual processing executed by the core 21a but also the parallel individual processing executed by the core 21b may be the preceding job. Also, the three parallel individual processes executed by the cores 21a, 21b, and 21c may be the preceding jobs, and the parallel individual processes executed by the core 21d may be the delayed jobs.

以上のように、本実施の形態では、遅延データ並列処理において、一部のコア２１での開始タイミングが他のコア２１での開始タイミングよりも遅延している。これにより、遅延データ並列処理の実行中に、一部のコア２１は、並列個別処理以外の処理を実行することができる。これにより、マルチコアプロセッサ２０の有効利用を図ることができる。 As described above, in the present embodiment, the start timing of some cores 21 is delayed from the start timing of other cores 21 in delayed data parallel processing. This allows some of the cores 21 to execute processes other than the parallel individual processes during execution of the delayed data parallel processing. As a result, the multi-core processor 20 can be used effectively.

さらに、一部のコア２１は、遅延データ並列プログラムにおいて、先行して処理を開始する他のコアの並列個別処理の実行中にキャッシュ２２へロードされた部分を実行することができること。そのため、一部のコア２１の待ち時間を、他のコア２１の待ち時間よりも小さくすることができる。よって、一部のコア２１での処理開始タイミングを遅延させたとしても遅延データ並列処理全体の処理時間を短くすることができる。 Furthermore, some cores 21 can execute a portion of the delayed data parallel program that is loaded into the cache 22 during execution of parallel individual processing of other cores that start processing earlier. Therefore, the waiting time of some cores 21 can be made smaller than the waiting time of other cores 21 . Therefore, even if the processing start timings of some of the cores 21 are delayed, the processing time of the entire delayed data parallel processing can be shortened.

実施の形態２．
本実施の形態では、変換対象部分プログラムがユーザによって指定される。本実施の形態に係る処理システム１の構成は、上述の実施の形態１に係る処理システム１の構成と同じである。Embodiment 2.
In this embodiment, the user designates a partial program to be converted. The configuration of the processing system 1 according to this embodiment is the same as the configuration of the processing system 1 according to the first embodiment described above.

本実施の形態では、ユーザは、プログラム記述部３１０の機能を用いて、第１テキスト形式プログラム４０１において変換対象部分プログラムを指定することができる。具体的には、ユーザは、プログラム記述部３１０の機能を用いて、第１テキスト形式プログラム４０１において、その部分プログラムが遅延データ並列プログラムに変換される対象であることを示す記述（指定記述ともいう）を第１テキスト形式プログラム４０１に含めることができる。指定記述は一種のプログラムである。プログラム記述部３１０は、入力装置３０３が受け付けるユーザからの入力に基づいて、第１テキスト形式プログラム４０１に指定記述を含める。指定記述は、ユーザによって指定された部分プログラムが遅延データ並列プログラムに変換される対象であることを示す記述であると言える。 In this embodiment, the user can use the function of the program description unit 310 to specify the partial program to be converted in the first text format program 401 . Specifically, the user uses the function of the program description unit 310 to write a description (also referred to as a designation description) in the first text format program 401 that indicates that the partial program is to be converted into a delayed data parallel program. ) can be included in the first textual program 401 . A specified description is a kind of program. The program description unit 310 includes a specified description in the first text format program 401 based on the input from the user received by the input device 303 . The specified description can be said to be a description indicating that the partial program specified by the user is to be converted into the delayed data parallel program.

図１４は指定記述４１０の一例を示す図である。図１４の例では、ＯｐｅｎＭＰが用いられて指定記述４１０が表されている。図１４の指定記述４１０は、そのすぐ下のｆｏｒ文が遅延データ並列プログラムに変換される対象であることを示している。第１テキスト形式プログラム４０１には複数の指定記述が含まれてもよい。 FIG. 14 is a diagram showing an example of the designation description 410. As shown in FIG. In the example of FIG. 14, the specification description 410 is expressed using OpenMP. The designation description 410 in FIG. 14 indicates that the for statement immediately below it is to be converted into a delayed data parallel program. The first text format program 401 may include multiple specified descriptions.

本実施の形態では、特定部３２０は、第１テキスト形式プログラム４０１において指定記述４１０を特定する。そして、特定部３２０は、特定した指定記述４１０に基づいて変換対象部分プログラムを特定する。図１４の例では、特定部３２０は、指定記述４１０のすぐ下のｆｏｒ文を変換対象部分プログラムとして特定する。そして、生成部３３０は、特定部３２０で特定された変換対象部分プログラムを遅延データ並列プログラムに変換し、それによって得られた遅延データ並列プログラムを含む第２テキスト形式プログラム４０２を生成する。生成部３３０で生成された第２テキスト形式プログラム４０２は、コンパイル後にメモリ２３に記憶される。本実施の形態に係る処理システム１の他の動作については、実施の形態１に係る処理システム１と同様である。 In this embodiment, the identifying unit 320 identifies the designated description 410 in the first text format program 401 . Then, the specifying unit 320 specifies the conversion target partial program based on the specified specification description 410 . In the example of FIG. 14, the identifying unit 320 identifies the for statement immediately below the specified description 410 as the conversion target partial program. The generation unit 330 then converts the conversion target partial program identified by the identification unit 320 into the delayed data parallel program, and generates the second text format program 402 including the resulting delayed data parallel program. The second text format program 402 generated by the generation unit 330 is stored in the memory 23 after compilation. Other operations of the processing system 1 according to the present embodiment are the same as those of the processing system 1 according to the first embodiment.

このように、本実施の形態では、第１テキスト形式プログラム４０１は、ユーザによって指定された部分プログラムが遅延データ並列プログラムに変換される対象であることを示す指定記述を含んでいる。これにより、特定部３２０は、ユーザによって指定された部分プログラムを変換対象部分プログラムとすることができる。つまり、ユーザは変換対象部分プログラムを指定することができる。 Thus, in this embodiment, the first text format program 401 includes a specification description indicating that a partial program specified by the user is to be converted into a delayed data parallel program. As a result, the specifying unit 320 can set the partial program specified by the user as the conversion target partial program. That is, the user can specify the partial program to be converted.

ここで、データ並列処理の実行中に複数のコア２１間で通信が行われることがある。例えば、複数のコア２１がそれぞれ実行する複数の並列個別処理の完了を同期させる必要がある場合には、複数のコア２１間で通信が行われることがある。第１テキスト形式プログラム４０１に含まれる部分プログラムによっては、それを遅延データ並列プログラムに変換した場合、当該遅延データ並列プログラムの実行時の複数のコア２１間の通信によるオーバーヘッドによって、遅延データ並列プログラムに変換するメリットが小さい場合がある。 Here, communication may be performed between multiple cores 21 during execution of data parallel processing. For example, when it is necessary to synchronize the completion of a plurality of parallel individual processes respectively executed by a plurality of cores 21, communication may be performed between the plurality of cores 21. FIG. Depending on the partial program included in the first text format program 401, if it is converted into a delayed data parallel program, the overhead due to communication between the multiple cores 21 during execution of the delayed data parallel program may cause the delay data parallel program to The merit of converting may be small.

本実施の形態では、ユーザは、変換対象部分プログラムを指定することができることから、それを遅延データ並列プログラムに変換するメリットが大きい部分プログラムを、変変換対象部分プログラムに指定することができる。よって、マルチコアプロセッサ２０での処理時間をより確実に短縮することができる。 In this embodiment, since the user can specify the partial program to be converted, the user can specify, as the partial program to be converted, a partial program that is highly advantageous in converting it into a delayed data parallel program. Therefore, the processing time in the multi-core processor 20 can be shortened more reliably.

実施の形態３．
本実施の形態では、エンジニアリングツール３は、図１５に示されるように、遅延量２３２を出力する遅延量推定モデル４６０と、遅延量推定モデル４６０を学習するモデル学習部３６０とをさらに備える。遅延量推定モデル４６０及びモデル学習部３６０は、例えば、エンジニアリングツール３のプログラム作成装置３０に設けられる。遅延量推定モデル４６０及びモデル学習部３６０のそれぞれは、例えば、処理回路３００がプログラム３０１ａを実行することによって処理回路３００に実現される。Embodiment 3.
In this embodiment, the engineering tool 3 further includes a delay amount estimation model 460 that outputs the delay amount 232 and a model learning unit 360 that learns the delay amount estimation model 460, as shown in FIG. The delay amount estimation model 460 and the model learning unit 360 are provided in the programming device 30 of the engineering tool 3, for example. Each of the delay amount estimation model 460 and the model learning unit 360 is implemented in the processing circuit 300 by the processing circuit 300 executing the program 301a, for example.

学習済みの遅延量推定モデル４６０は、例えば、プロセッサ実行状態情報に基づいて、遅延並列個別処理が先行並列個別処理と同じタイミングで終了するような遅延量２３２を推定して出力する。言い換えれば、学習済みの遅延量推定モデル４６０は、例えば、プロセッサ実行状態情報に基づいて、遅延並列個別処理の完了時刻と先行並列個別処理の完了時刻とが同時刻となる遅延量２３２を推定して出力する。 The learned delay amount estimation model 460 estimates and outputs the delay amount 232 that makes the delayed parallel individual processing end at the same timing as the preceding parallel individual processing, for example, based on the processor execution state information. In other words, the learned delay amount estimation model 460 estimates the delay amount 232 at which the completion time of the delayed parallel individual processing and the completion time of the preceding parallel individual processing are the same time, for example, based on the processor execution state information. output.

遅延量推定モデル４６０は、例えば、ニューラルネットワークで構成されている。遅延量推定モデル４６０を構成するニューラルネットワーク（第１ニューラルネットワークともいう）の入力層には、プロセッサ実行状態情報６００が入力される。第１ニューラルネットワークの出力層からは遅延量２３２が出力される。第１ニューラルネットワークは、畳み込みニューラルネットワーク（ＣＮＮ（Convolutional Neural Network））であってもよいし、他の種類のニューラルネットワークであってもよい。第１ニューラルネットワークは、モデル学習部３６０によって学習されるパラメータを含む。このパラメータには、人工ニューロン間の結合の重みを示す重み付け係数が含まれる。第１ニューラルネットワークは、入力層に入力されるプロセッサ実行状態情報６００に対してパラメータに基づく演算を行って、出力層から遅延量２３２を出力する。 The delay amount estimation model 460 is composed of, for example, a neural network. Processor execution state information 600 is input to the input layer of the neural network (also referred to as the first neural network) that constitutes delay estimation model 460 . A delay amount 232 is output from the output layer of the first neural network. The first neural network may be a convolutional neural network (CNN) or another type of neural network. The first neural network includes parameters learned by model learning section 360 . This parameter includes a weighting factor that indicates the weight of the connections between artificial neurons. The first neural network performs a parameter-based operation on the processor execution state information 600 input to the input layer, and outputs a delay amount 232 from the output layer.

モデル学習部３６０は、例えば、第１ニューラルネットワークのパラメータを機械学習することによって、遅延量推定モデル４６０を学習する。第１ニューラルネットワークのパラメータが学習されることによって、遅延量推定モデル４６０は、遅延並列個別処理が先行並列個別処理と同じタイミングで終了するような遅延量２３２をプロセッサ実行状態情報に基づいて出力することが可能となる。モデル学習部３６０は、例えば教師あり学習法を用いて、第１ニューラルネットワークのパラメータを学習する。 The model learning unit 360 learns the delay amount estimation model 460 by machine learning the parameters of the first neural network, for example. By learning the parameters of the first neural network, the delay amount estimation model 460 outputs the delay amount 232 based on the processor execution state information so that the delayed parallel individual processing ends at the same timing as the preceding parallel individual processing. becomes possible. The model learning unit 360 learns the parameters of the first neural network using, for example, a supervised learning method.

記憶装置３０１には、第１ニューラルネットワークのパラメータの学習で使用される教師データと学習用データ（学習データともいう）が記憶されている。学習用データ及び教師データはまとめて教師付き学習データと呼ばれることがある。本例では、例えば、プロセッサシステム２が学習用データ及び教師データを生成する。 The storage device 301 stores teacher data and learning data (also referred to as learning data) used in learning the parameters of the first neural network. Learning data and teacher data are sometimes collectively referred to as supervised learning data. In this example, for example, the processor system 2 generates learning data and teacher data.

教師データには、遅延並列個別処理が先行並列個別処理と同じタイミングで終了するような遅延量２３２が含まれる。以後、教師データに含まれる遅延量２３２を基準遅延量２３２と呼ぶことがある。 The teacher data includes a delay amount 232 such that the delayed parallel individual processing ends at the same timing as the preceding parallel individual processing. Hereinafter, the delay amount 232 included in the teacher data may be referred to as the reference delay amount 232 .

基準遅延量２３２としては、例えば、上述の図１１の更新処理において更新が完了した遅延量２３２が採用される。教師データには複数の基準遅延量２３２が含まれる。本実施の形態では、第１ニューラルネットワークのパラメータの学習が実行される前に、図１１の更新処理がプロセッサシステム２において繰り返し実行されることによって、複数の基準遅延量２３２が取得される。 As the reference delay amount 232, for example, the delay amount 232 that has been updated in the above-described update processing of FIG. 11 is used. Teacher data includes a plurality of reference delay amounts 232 . In the present embodiment, a plurality of reference delay amounts 232 are acquired by repeatedly executing the updating process in FIG. 11 in the processor system 2 before learning the parameters of the first neural network.

ここで、上述のように、遅延量２３２に応じてキャッシュ２２の使用状態は変化する。そして、キャッシュ２２の使用状態に応じてプロセッサ実行状態情報は変化する。よって、遅延量２３２に応じてプロセッサ実行状態情報が変化する。つまり、遅延量２３２とプロセッサ実行状態情報との間には相関関係がある。 Here, as described above, the usage state of the cache 22 changes according to the delay amount 232 . The processor execution state information changes according to the state of use of the cache 22 . Therefore, the processor execution state information changes according to the delay amount 232 . In other words, there is a correlation between the delay amount 232 and the processor execution state information.

そこで、本例では、学習用データとしてプロセッサ実行状態情報が採用される。具体的には、学習用データには、教師データに含まれる複数の基準遅延量２３２にそれぞれ対応する複数のプロセッサ実行状態情報が含まれる。ここで、基準遅延量２３２に対応するプロセッサ実行状態情報とは、当該基準遅延量２３２が使用されて遅延済みデータ並列プログラムが実行されているときに取得部２５１で取得されるプロセッサ実行状態情報を意味する。学習用データに含まれるプロセッサ実行状態情報は、遅延並列個別処理が先行並列個別処理と同じタイミングで終了するような遅延量２３２が使用されて遅延済みデータ並列プログラムが実行されているときに取得部２５１で取得されるプロセッサ実行状態情報であると言える。以後、学習用データに含まれるプロセッサ実行状態情報を学習用実行状態情報と呼ぶことがある。 Therefore, in this example, processor execution state information is employed as learning data. Specifically, the learning data includes a plurality of pieces of processor execution state information respectively corresponding to the plurality of reference delay amounts 232 included in the teacher data. Here, the processor execution state information corresponding to the reference delay amount 232 is the processor execution state information acquired by the acquisition unit 251 when the delayed data parallel program is executed using the reference delay amount 232. means. The processor execution state information included in the learning data is acquired by the acquisition unit when the delayed data parallel program is executed using the delay amount 232 that causes the delay parallel individual processing to end at the same timing as the preceding parallel individual processing. It can be said that it is the processor execution state information acquired in 251 . Hereinafter, processor execution state information included in learning data may be referred to as learning execution state information.

学習用実行状態情報には、例えば、各コア２１についてのコア実行状態情報と、遅延データ並列処理全体の実行時間とが含まれる。学習用実行状態情報に含まれるコア実行状態情報には、例えば、並列個別処理の実行時間、並列個別処理での実行命令数、並列個別処理の実行中のキャッシュミス回数及び並列個別処理の実行中のストール回数が含まれる。 The learning execution state information includes, for example, core execution state information for each core 21 and the execution time of the entire delayed data parallel processing. The core execution state information included in the learning execution state information includes, for example, the execution time of parallel individual processing, the number of instructions executed in parallel individual processing, the number of cache misses during execution of parallel individual processing, and the number of cache misses during execution of parallel individual processing. stall count.

モデル学習部３６０は、第１ニューラルネットワークの学習を行う場合、第１ニューラルネットワークの入力層に学習用データを入力する。そして、モデル学習部３６０は、第１ニューラルネットワークの出力層から出力される出力データについての教師データに対する誤差が小さくなるように、第１ニューラルネットワークのパラメータを調整する。より詳細には、モデル学習部３６０は、学習用データに含まれる学習用実行状態情報を第１ニューラルネットワークの入力層に入力する。そして、モデル学習部３６０は、入力層に学習用実行状態情報を入力した場合に第１ニューラルネットワークの出力層から出力される出力データ（言い換えれば遅延量２３２）についての、当該学習用実行状態情報に対応する基準遅延量２３２に対する誤差が小さくなるように、パラメータを調整する。パラメータの調整方法としては、誤差逆伝播法が採用されてもよし、他の方法が採用されてもよい。パラメータの調整が完了すると、パラメータの学習が終了する。調整後のパラメータ、学習済みパラメータとなり、記憶装置３０１に記憶される。記憶装置３０１内の学習済みパラメータは、例えば、書き込み装置３５を通じてプロセッサシステム２のメモリ２３に書き込まれる。プロセッサシステム２では、学習済みパラメータが使用されて遅延量２３２が更新される。なお、第１ニューラルネットワークのパラメータの学習方法はこの限りではない。 When learning the first neural network, the model learning unit 360 inputs learning data to the input layer of the first neural network. Model learning section 360 then adjusts the parameters of the first neural network so that the error of the output data output from the output layer of the first neural network with respect to the teacher data is reduced. More specifically, the model learning unit 360 inputs the learning execution state information included in the learning data to the input layer of the first neural network. Then, the model learning unit 360 obtains the learning execution state information for the output data (in other words, the delay amount 232) output from the output layer of the first neural network when the learning execution state information is input to the input layer. The parameters are adjusted so that the error with respect to the reference delay amount 232 corresponding to is small. As a parameter adjustment method, an error backpropagation method may be adopted, or another method may be adopted. When parameter tuning is complete, parameter learning ends. The adjusted parameters and learned parameters are stored in the storage device 301 . The learned parameters in the storage device 301 are written into the memory 23 of the processor system 2 through the writing device 35, for example. In the processor system 2, the learned parameters are used to update the delay amount 232. FIG. Note that the method of learning the parameters of the first neural network is not limited to this.

本実施の形態では、遅延量更新装置２５０は、図１６に示される更新部２５２ａをさらに備える。更新部２５２ａは学習済みモデル４６０ａを備える。更新部２５２ａは、学習済みモデル４６０ａを用いて遅延量２３２を更新する。本例では、教師データ及び学習用データの生成に上述の更新部２５２が使用され、学習済みパラメータの生成後は、更新部２５２ａが使用されて遅延量２３２が更新される。 In this embodiment, the delay update device 250 further includes an updating unit 252a shown in FIG. The updating unit 252a has a trained model 460a. The updating unit 252a updates the delay amount 232 using the trained model 460a. In this example, the update unit 252 described above is used to generate teacher data and learning data, and after the learned parameters are generated, the update unit 252a is used to update the delay amount 232. FIG.

学習済みモデル４６０ａは、例えば、モデル学習部３６０で生成された学習済みパラメータを含む第２ニューラルネットワークで構成されている。第２ニューラルネットワークは、第１ニューラルネットワークと同じ構成を有している。第２ニューラルネットワークの入力層には、取得部２５１が取得するプロセッサ実行状態情報６００が入力される。第２ニューラルネットワークは、入力層に入力されるプロセッサ実行状態情報６００に対して学習済みパラメータに基づく演算を行って、出力層から遅延量２３２を出力する。これにより、第２ニューラルネットワークの出力層から、遅延並列個別処理が先行並列個別処理と同じタイミングで終了するような遅延量２３２が出力される。 The trained model 460a is composed of, for example, a second neural network including trained parameters generated by the model learning unit 360. FIG. The second neural network has the same configuration as the first neural network. The processor execution state information 600 acquired by the acquisition unit 251 is input to the input layer of the second neural network. The second neural network performs an operation based on learned parameters on the processor execution state information 600 input to the input layer, and outputs the delay amount 232 from the output layer. As a result, the delay amount 232 is output from the output layer of the second neural network such that the delayed parallel individual processing ends at the same timing as the preceding parallel individual processing.

図１７は、更新部２５２ａが使用された更新処理の一例を示すフローチャートである。図１７に示される更新処理は、プロセッサシステム２の実稼働中（言い換えれば、プロセッサシステム２が搭載された組み込み機器の実稼働中）に実行されてもよいし、プロセッサシステム２の実稼働前（例えば、プロセッサシステム２が搭載された組み込み機器の出荷前）に実行されてもよい。以後、図１１の更新処理を第１更新処理と呼び、図１７の更新処理を第２更新処理と呼ぶことがある。 FIG. 17 is a flowchart showing an example of update processing using the update unit 252a. The update process shown in FIG. 17 may be executed during the actual operation of the processor system 2 (in other words, during the actual operation of the embedded device in which the processor system 2 is mounted), or before the actual operation of the processor system 2 ( For example, it may be executed before shipment of an embedded device in which the processor system 2 is mounted. Henceforth, the update process of FIG. 11 may be called the 1st update process, and the update process of FIG. 17 may be called the 2nd update process.

図１７に示されるように、上述のステップｓ１１及びｓ１２が実行される。その後、ステップｓ２４において、更新部２５２ａは、ステップｓ１２で取得されたプロセッサ実行状態情報６００に基づいて新たな遅延量２３２を求める。具体的には、更新部２５２ａは、第２ニューラルネットワークの入力層に対して、ステップｓ１２で取得されたプロセッサ実行状態情報を入力する。そして、更新部２５２ａは、第２ニューラルネットワークの出力層から出力される遅延量２３２を新たな遅延量２３２とする。更新部２５２は、メモリ２３内の遅延量２３２を新たな遅延量２３２で置き換える。これにより、メモリ２３内の遅延量２３２が更新され、第２更新処理が終了する。 As shown in FIG. 17, steps s11 and s12 described above are performed. After that, in step s24, the updating unit 252a obtains a new delay amount 232 based on the processor execution state information 600 acquired in step s12. Specifically, the updating unit 252a inputs the processor execution state information acquired in step s12 to the input layer of the second neural network. Then, the updating unit 252 a sets the delay amount 232 output from the output layer of the second neural network as the new delay amount 232 . The updating unit 252 replaces the delay amount 232 in the memory 23 with a new delay amount 232 . As a result, the delay amount 232 in the memory 23 is updated, and the second update process ends.

なお、プロセッサシステム２の実稼働中において、第２更新処理が繰り返し実行されてもよい。この場合、第２更新処理は、定期的に繰り返し実行されてもよいし、不定期的に繰り返し実行されてもよい。また、第２更新処理は、ユーザの指示に応じて実行されてもよい。 Note that the second update process may be repeatedly executed during the actual operation of the processor system 2 . In this case, the second update process may be repeatedly executed periodically or irregularly. Also, the second update process may be executed according to a user's instruction.

また、プロセッサシステム２の実稼働中において、まず、第１更新処理が繰り返し実行されて遅延量２３２が更新され、その後、第２更新処理が繰り返し実行されて遅延量２３２が更新されてもよい。この場合、第１更新処理の繰り返し実行によって得られた教師データ及び学習用データに基づいて学習済みパラメータが生成され、その学習済みパラメータが使用されて第２更新処理が繰り返し実行されてもよい。 Also, during the actual operation of the processor system 2 , first, the first update process may be repeatedly executed to update the delay amount 232 , and then the second update process may be repeatedly executed to update the delay amount 232 . In this case, learned parameters may be generated based on teacher data and learning data obtained by repeatedly executing the first update process, and the second update process may be repeatedly executed using the learned parameters.

また、上記の例では、マルチコアプロセッサシステム２が教師データ及び学習用データを生成しているが、マルチコアプロセッサシステム２と同様の構成を有するマルチコアプロセッサシステムが、第１更新処理と同様の処理を繰り返し実行することによって、教師データ及び学習用データを生成してもよい。この場合には、遅延量更新装置２５０には更新部２５２が不要になる。 Also, in the above example, the multi-core processor system 2 generates teacher data and learning data, but the multi-core processor system having the same configuration as the multi-core processor system 2 repeats the same process as the first update process. By executing, teacher data and learning data may be generated. In this case, the delay amount updating device 250 does not need the updating unit 252 .

また、上記の例では、エンジニアリングツール３がモデル学習部３６０及び遅延量推定モデル４６０を備えているが、プロセッサシステム２がモデル学習部３６０及び遅延量推定モデル４６０を備えてもよい。この場合、学習済みの遅延量推定モデル４６０が学習済みモデル４６０ａとして使用されてもよい。 Also, in the above example, the engineering tool 3 includes the model learning section 360 and the delay amount estimation model 460 , but the processor system 2 may include the model learning section 360 and the delay amount estimation model 460 . In this case, the learned delay estimation model 460 may be used as the learned model 460a.

また、遅延量推定モデル４６０は再学習されてもよい。この場合、第２更新処理が繰り返し実行されることによって新たな教師データ及び学習用データが取得され、これらの新たな教師データ及び学習用データに基づいて遅延量推定モデル４６０が再学習されてもよい。そして、再学習された遅延量推定モデル４６０の学習済みパラメータが使用されて第２更新処理が実行されてもよい。 Also, the delay amount estimation model 460 may be re-learned. In this case, new teacher data and learning data are acquired by repeatedly executing the second update process, and the delay amount estimation model 460 is re-learned based on these new teacher data and learning data. good. Then, the learned parameters of the re-learned delay amount estimation model 460 may be used to perform the second update process.

上記の例では、プロセッサシステム２が遅延量更新装置２５０として機能していたが、遅延量更新装置２５０はプロセッサシステム２とは別に設けられてもよい。この場合、エンジニアリングツール３及び遅延量更新装置２５０を備える処理システム１に、プロセッサシステム２が含められなくてもよい。つまり、エンジニアリングツール３及び遅延量更新装置２５０を備える処理システム１と、それとは別のプロセッサシステム２とを備えるシステムが構築されてもよい。 Although the processor system 2 functions as the delay update device 250 in the above example, the delay update device 250 may be provided separately from the processor system 2 . In this case, the processor system 2 may not be included in the processing system 1 including the engineering tool 3 and the delay update device 250 . That is, a system may be constructed that includes the processing system 1 that includes the engineering tool 3 and the delay update device 250, and the processor system 2 that is separate from it.

本開示は詳細に説明されたが、上記した説明は、すべての局面において、例示であって、限定的なものではない。例示されていない無数の変形例が想定され得るものと解される。 While the present disclosure has been described in detail, the foregoing description is, in all respects, illustrative and not restrictive. It is understood that a myriad of variations not illustrated may be envisioned.

また、各実施の形態を自由に組み合わせたり、各実施の形態を適宜、変形、省略することが可能である。 In addition, it is possible to freely combine each embodiment, and to modify or omit each embodiment as appropriate.

１処理システム、２マルチコアプロセッサシステム、２０マルチコアプロセッサ、２１，２１ａ，２１ｂ，２１ｃ，２１ｄコア、２２共有キャッシュ、３０プログラム作成装置、２３１更新プログラム、２５０遅延量更新装置、２５１取得部、２５２，２５２ａ更新部、３２０特定部、３３０生成部、３０１ａプログラム。 1 processing system, 2 multi-core processor system, 20 multi-core processor, 21, 21a, 21b, 21c, 21d core, 22 shared cache, 30 program creation device, 231 update program, 250 delay amount update device, 251 acquisition unit, 252, 252a Update unit, 320 identification unit, 330 generation unit, 301a program.

Claims

A delay amount for processing a program created by a program creation device that creates a program to be executed by the multi-core processor of a multi-core processor system comprising a multi-core processor including a plurality of cores and a shared cache shared by the plurality of cores an update device,
The programming device is
a specifying unit for specifying a target partial program, which is a target partial program to be converted into a data parallel program in which data parallel processing is described, in the processing target program;
In the plurality of cores, the target partial program is converted into the data parallel program in which the data parallel processing is described such that the processing start timing of some cores is delayed from the processing start timing of the other cores. and a generation unit that generates an execution target program including the data parallel program obtained thereby;
with
The delay update device,
an acquisition unit that acquires execution state information indicating an execution state of the data parallel processing in the multi-core processor;
an updating unit that updates, based on the execution state information, a delay amount of processing start timing in the part of the cores in the data parallel processing described in the data parallel program;
with
The delay update device, wherein the execution state information includes an execution time of the data parallel processing in the part of the cores and an execution time of the data parallel processing in the other cores.

The delay amount update device according to claim 1,
The delay amount updating device, wherein the update unit updates the delay amount using a learned model that outputs the delay amount when the execution state information is input.

A delay amount update device according to claim 1 or claim 2,
the program creation device for generating the data parallel program to be targeted in the delay update device;
A processing system comprising:

In a multi-core processor system comprising a multi-core processor including a plurality of cores and a shared cache shared by the plurality of cores, a program creation device creates a program to be executed by the multi-core processor. A program for executing
The programming device is
a specifying unit for specifying a target partial program, which is a target partial program to be converted into a data parallel program in which data parallel processing is described, in the processing target program;
In the plurality of cores, the target partial program is converted into the data parallel program in which the data parallel processing is described such that the processing start timing of some cores is delayed from the processing start timing of the other cores. and a generation unit that generates an execution target program including the data parallel program obtained thereby;
with
The program, in a computer device,
obtaining execution state information indicating an execution state of the data parallel processing in the multi-core processor;
a step of updating the delay amount of the processing start timing of the some cores in the data parallel processing described in the data parallel program based on the execution state information;
and
The program according to claim 1, wherein the execution state information includes an execution time of the data parallel processing in the part of the cores and an execution time of the data parallel processing in the other cores.