JP5379765B2

JP5379765B2 - Program execution apparatus and program execution method

Info

Publication number: JP5379765B2
Application number: JP2010184509A
Authority: JP
Inventors: 悟近藤; 淳一赤埴
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2010-08-20
Filing date: 2010-08-20
Publication date: 2013-12-25
Anticipated expiration: 2030-08-20
Also published as: JP2012043232A

Abstract

PROBLEM TO BE SOLVED: To provide technology accelerating the speed of a processor. SOLUTION: A program execution device comprises an analysis unit that decides independence-promoting processing for allocating a thread to a specific core (occupied core) in a fixed manner and ordering the execution so as to accelerate the speed of a processor. The analysis unit includes: a step S403 for calculating a cache hit coefficient P, comparing it with a threshold and narrowing down options for the independence-promoting processing; a step S406 for calculating an independence-promoting determination value F, comparing it with a threshold and determining the independence-promoting processing from the options; and a step S411 for calculating a throughput coefficient TH, generating core allocation information to execute core occupation processing when the core occupation processing has the greater throughput coefficient TH than normal processing, and executing core allocation processing (high-speed processing) based on the core allocation information. COPYRIGHT: (C)2012,JPO&INPIT

Description

本発明は、プロセッサの処理速度を向上する技術に関する。 The present invention relates to a technique for improving the processing speed of a processor.

特許文献１には、マルチコアプロセッサにおける処理速度の向上を目的として、キャッシュミスを防ぐために、マルチコアプロセッサのどのコアにどのスレッドをどのタイミングで動作させるかをスケジューリングする技術が開示されている。具体的には、Ｌ２（Level 2）キャッシュ内に既に格納されているデータを再利用可能なように、スレッドを以前動作したコアで割り当てるスケジューリングを行うことによって、データをロードする時間を削減し高速化を実現する技術である。 Patent Document 1 discloses a technique for scheduling which thread in which core of a multi-core processor is operated at which timing in order to prevent a cache miss for the purpose of improving the processing speed in the multi-core processor. Specifically, the time to load the data is reduced by performing scheduling to allocate the thread with the core that has previously operated so that the data already stored in the L2 (Level 2) cache can be reused. This is a technology that realizes

特開２０００−１４８５１８号公報JP 2000-148518 A

しかし、１つのスレッドが処理中に扱うデータのバリエーション（データアドレス、データのサイズ、データのビットパターン等の相違のこと）が多く、かつ、それらのデータが均等に処理に使用される場合には、データがキャッシュアウトした（データがキャッシュ上に存在しない）状態になるケースがほとんどである。また、特定のスレッド単独で見ればデータのバリエーションが少なく、キャッシュヒット率が高くなり得る場合であっても、同一コアで動作する他のスレッドが使用するデータのバリエーションが大きい場合には、他のスレッドのデータがＬ２キャッシュ内で支配的になり、単独ならばキャッシュヒット率が高いデータがキャッシュアウトした状態になる可能性が高い。つまり、特許文献１に記載の技術では、プロセッサの処理速度の向上を確実には望めないケースがあるといった問題がある。 However, when there are many data variations (differences in data address, data size, data bit pattern, etc.) that one thread handles during processing, and these data are used for processing equally In most cases, the data is cached out (data does not exist in the cache). In addition, even if the data variation is small and the cache hit rate can be high if the specific thread alone is viewed, if the data variation used by other threads operating on the same core is large, The thread data becomes dominant in the L2 cache, and if it is alone, there is a high possibility that data with a high cache hit rate will be in a cache-out state. That is, the technique described in Patent Document 1 has a problem that there is a case where it is not possible to surely improve the processing speed of the processor.

また、プロセッサの処理効率向上のため、並行動作可能な処理をスレッド群として複数のコアで並行動作させることも考えられるが、この場合、バリエーションが少なく、本来キャッシュヒット率が高くなり得るデータであっても、複数のコアでスレッドが動作するが故に、同じアドレスのデータを複数のコア間で参照し合い、結果的にデータロードの時間を多く必要としてしまう可能性がある。このように、複数のコアのキャッシュ上に同じアドレスのキャッシュデータが存在してしまうと、キャッシュコヒーレンシ（複数のキャッシュに格納されている同一のデータの一貫性）を保つためのオーバヘッドが大きくなるため、プロセッサの処理速度の向上を損ねる原因となる。 In addition, in order to improve the processing efficiency of the processor, it is conceivable that processes that can be operated in parallel are operated in parallel by a plurality of cores as a thread group. However, in this case, there are few variations and the cache hit rate can be increased originally. However, since threads operate on a plurality of cores, data at the same address may be referred to between a plurality of cores, and as a result, a long time for data loading may be required. Thus, if cache data with the same address exists in the caches of multiple cores, the overhead for maintaining cache coherency (consistency of the same data stored in multiple caches) increases. This is a cause of impairing the improvement of the processing speed of the processor.

そこで、本発明は、前記した問題を解決し、プロセッサの処理速度を向上する技術を提供することを課題とする。 Therefore, an object of the present invention is to provide a technique for solving the above-described problems and improving the processing speed of the processor.

本発明は、プログラムをスレッドに分割し、前記スレッドそれぞれをＣＰＵ（Central Processing Unit）を構成するコアに割り当てて前記プログラムを実行するプログラム実行装置であって、任意の前記コアに任意の前記スレッドを割り当てて前記プログラムの処理を行う通常処理の実行時に所定の周期で収集される、前記プログラムに記述されている変数の処理開始時刻と、その変数の処理がキャッシュヒットしたことを示すキャッシュヒット情報と、を関連付けて記憶しているとともに、前記ＣＰＵのコア数と、キャッシュヒットの効果の判定に用いる閾値とを記憶している記憶部と、（１）前記変数の中の第１の変数の前記処理開始時刻およびその第１の変数の次に処理される第２の変数の前記処理開始時刻を読み出して差分を算出し、収集された回数分の前記差分の平均値を算出し、その算出した平均値を第１の処理時間とし、（２）前記変数の中の、前記キャッシュヒット情報が関連付けられた前記第１の変数の前記処理開始時刻および当該第１の変数の次に処理される第２の変数の前記処理開始時刻を読み出して差分を算出し、収集された回数分の前記差分の平均値を算出し、その算出した平均値を第２の処理時間とし、（３）前記第１の処理時間を前記第２の処理時間で除算して、独立化判定値を算出し、（４）前記独立化判定値が前記第１の変数の処理に用いるスレッド数を示す前記閾値より大きいか否かを判定し、その判定結果において大きいという判定の場合に、前記第１の変数の前記処理時間を前記第２の処理時間とし、その判定結果において否という判定の場合に、前記第１の変数の前記処理時間を前記第１の処理時間とし、（５）前記（１）〜前記（４）の処理を前記プログラムの変数について実行し、（６）前記ＣＰＵのコア数を、前記通常処理の実行時の前記プログラムの前記変数について前記第１の処理時間を合計した合計値で除算し、その値を第１のスループット係数とし、（７）前記ＣＰＵの１つのコアに、前記第２の処理時間を処理時間とする１つの前記変数の処理を実行させるように割り当てたときのスループットを、コア数の１を分子とし、前記第２の処理時間と当該変数の処理から次の処理までの待ち時間との合算値を分母とする第１の除算値を算出し、前記ＣＰＵのコア数から前記第２の処理時間を処理時間とする前記変数の数を減算して、その減算値を分子とし、前記プログラムの前記変数の中から前記第２の処理時間とならなかった前記変数の前記第１の処理時間と当該変数の処理の次の処理までの待ち時間との合計値を分母とする第２の除算値を算出し、前記第１の除算値および前記第２の除算値の中で最も小さい値を第２のスループット係数とし、（８）前記第２のスループット係数が前記第１のスループット係数より大きい場合、前記第２の処理時間を処理時間とする前記変数を処理するスレッドを固定の前記コアに割り当てることを示したコア割当情報を生成する解析部と、前記解析部によって生成された前記コア割当情報に基づいて、前記第２の処理時間を処理時間とする前記変数を処理するスレッドを固定の前記コアに割り当てて、前記プログラムを実行する実行部とを備えることを特徴とする。 The present invention is a program execution apparatus that divides a program into threads, assigns each of the threads to a core that constitutes a CPU (Central Processing Unit), and executes the program. A process start time of a variable described in the program, which is collected at a predetermined period when a normal process for allocating and processing the program is performed, and cache hit information indicating that the process of the variable has a cache hit; , And a storage unit storing the number of cores of the CPU and a threshold value used for determining the effect of the cache hit, and (1) the first variable among the variables Read the processing start time and the processing start time of the second variable processed next to the first variable, calculate the difference, and collect The average value of the difference for the number of times is calculated, and the calculated average value is set as a first processing time. (2) Among the variables, the first variable associated with the cache hit information is calculated. Read the processing start time and the processing start time of the second variable to be processed next to the first variable, calculate the difference, calculate the average value of the differences for the number of times collected, and calculate the difference The average value obtained is used as the second processing time, (3) the first processing time is divided by the second processing time to calculate an independence determination value, and (4) the independence determination value is It is determined whether or not it is larger than the threshold value indicating the number of threads used for processing the first variable, and when it is determined that the determination result is large, the processing time of the first variable is set to the second processing time. And in the case of a negative determination The processing time of the first variable is the first processing time, (5) the processing of (1) to (4) is executed for the variable of the program, and (6) the number of cores of the CPU Is divided by the total sum of the first processing times for the variables of the program at the time of execution of the normal processing, and the value is set as a first throughput coefficient. (7) One core of the CPU , The throughput when the processing of one variable with the second processing time as the processing time is assigned to be executed, with the number of cores being 1 as a numerator, and the processing time of the second processing time and the variable Calculating a first division value using a sum of a waiting time until the next processing as a denominator and subtracting the number of variables having the second processing time as a processing time from the number of cores of the CPU; Using the subtraction value as the numerator, the program A second division of which the denominator is the total value of the first processing time of the variable that has not become the second processing time and the waiting time until the next processing of the variable. A value is calculated, and the smallest value among the first divided value and the second divided value is set as a second throughput coefficient. (8) The second throughput coefficient is larger than the first throughput coefficient. An analysis unit for generating core allocation information indicating that a thread for processing the variable having the second processing time as a processing time is allocated to a fixed core; and the core allocation generated by the analyzing unit And an execution unit that executes the program by allocating a thread that processes the variable having the second processing time as the processing time to the fixed core based on the information.

また、本発明は、プログラムをスレッドに分割し、前記スレッドそれぞれをＣＰＵを構成するコアに割り当てて前記プログラムを実行するプログラム実行装置において用いられるプログラム実行方法であって、前記プログラム実行装置が、任意の前記コアに任意の前記スレッドを割り当てて前記プログラムの処理を行う通常処理の実行時に所定の周期で収集される、前記プログラムに記述されている変数の処理開始時刻と、その変数の処理がキャッシュヒットしたことを示すキャッシュヒット情報と、を関連付けて記憶しているとともに、前記ＣＰＵのコア数と、キャッシュヒットの効果の判定に用いる閾値とを記憶している記憶部と、解析部と、実行部と、を備え、前記解析部が、（１）前記変数の中の第１の変数の前記処理開始時刻およびその第１の変数の次に処理される第２の変数の前記処理開始時刻を読み出して差分を算出し、収集された回数分の前記差分の平均値を算出し、その算出した平均値を第１の処理時間とするステップ、（２）前記変数の中の、前記キャッシュヒット情報が関連付けられた前記第１の変数の前記処理開始時刻および当該第１の変数の次に処理される第２の変数の前記処理開始時刻を読み出して差分を算出し、収集された回数分の前記差分の平均値を算出し、その算出した平均値を第２の処理時間とするステップ、（３）前記第１の処理時間を前記第２の処理時間で除算して、独立化判定値を算出するステップ、（４）前記独立化判定値が前記第１の変数の処理に用いるスレッド数を示す前記閾値より大きいか否かを判定し、その判定結果において大きいという判定の場合に、前記第１の変数の前記処理時間を前記第２の処理時間とし、その判定結果において否という判定の場合に、前記第１の変数の前記処理時間を前記第１の処理時間とするステップ、（５）前記（１）〜前記（４）の処理を前記プログラムの変数について実行するステップ、（６）前記ＣＰＵのコア数を、前記通常処理の実行時の前記プログラムの前記変数について前記第１の処理時間を合計した合計値で除算し、その値を第１のスループット係数とするステップ、（７）前記ＣＰＵの１つのコアに、前記第２の処理時間を処理時間とする１つの前記変数の処理を実行させるように割り当てたときのスループットを、コア数の１を分子とし、前記第２の処理時間と当該変数の処理から次の処理までの待ち時間との合算値を分母とする第１の除算値を算出し、前記ＣＰＵのコア数から前記第２の処理時間を処理時間とする前記変数の数を減算して、その減算値を分子とし、前記プログラムの前記変数の中から前記第２の処理時間とならなかった前記変数の前記第１の処理時間と当該変数の処理の次の処理までの待ち時間との合計値を分母とする第２の除算値を算出し、前記第１の除算値および前記第２の除算値の中で最も小さい値を第２のスループット係数とするステップ、（８）前記第２のスループット係数が前記第１のスループット係数より大きい場合、前記第２の処理時間を処理時間とする前記変数を処理するスレッドを固定の前記コアに割り当てることを示したコア割当情報を生成するステップ、を実行し、前記実行部が、前記解析部によって生成された前記コア割当情報に基づいて、前記第２の処理時間を処理時間とする前記変数を処理するスレッドを固定の前記コアに割り当てて、前記プログラムを実行するステップを実行することを特徴とする。 The present invention also relates to a program execution method used in a program execution apparatus that divides a program into threads, assigns each of the threads to a core that constitutes a CPU, and executes the program. The processing start time of the variable described in the program and the processing of the variable are collected at a predetermined cycle when executing the normal processing for assigning the arbitrary thread to the core and executing the processing of the program. Cache hit information indicating that a hit has been associated and stored, a storage unit storing the number of cores of the CPU, and a threshold used for determining the effect of the cache hit, an analysis unit, and an execution And the analysis unit (1) includes: (1) the processing start time and first variable of the first variable among the variables; Read the processing start time of the second variable processed next to the first variable, calculate the difference, calculate the average value of the difference for the number of times collected, and calculate the calculated average value (2) Among the variables, the processing start time of the first variable associated with the cache hit information and the second processed next to the first variable Reading the processing start time of the variable, calculating a difference, calculating an average value of the difference for the number of times collected, and setting the calculated average value as a second processing time; (3) the first (4) calculating the independence determination value by dividing the processing time by the second processing time; and (4) the independence determination value being greater than the threshold value indicating the number of threads used for processing the first variable. Whether or not If it is determined that the processing time of the first variable is large, the processing time of the first variable is the second processing time. If the determination result is NO, the processing time of the first variable is the first processing time. A step of processing time; (5) a step of executing the processing of (1) to (4) with respect to a variable of the program; and (6) the number of cores of the CPU is set as the number of cores of the program at the time of execution of the normal processing. Dividing the variable by the total sum of the first processing times for the variable and setting the value as a first throughput coefficient; (7) assigning the second processing time to one core of the CPU as the processing time; The throughput when assigned to execute the processing of one of the variables is the sum of the second processing time and the waiting time from the processing of the variable to the next processing, where 1 is the numerator of the number of cores. A first division value having a value as a denominator is calculated, the number of the variables having the second processing time as the processing time is subtracted from the number of cores of the CPU, and the subtraction value is used as a numerator. A second division value having a denominator as a total value of the first processing time of the variable that has not become the second processing time and the waiting time until the next processing of the variable among the variables. And (8) the second throughput coefficient is greater than the first throughput coefficient, wherein the second throughput coefficient is set to the smallest value among the first divided value and the second divided value. If it is larger, a step of generating core allocation information indicating that a thread for processing the variable having the second processing time as a processing time is allocated to the fixed core is executed, and the execution unit performs the analysis Produced by the department Was based on the core allocation information assigned to the second processing said core securing the thread to handle the variable time and the processing time, and executes a step of executing the program.

このような構成によれば、実測した値を用いて前記（１）〜（８）の処理を実行して変数のバリエーションを考慮した上で、キャッシュヒットによる効果が明らかな変数の処理のためのスレッドを特定している。例えば、前記（１）〜（５）の処理では、処理速度の向上のために、キャッシュヒットすることの効果の判定指標として、独立化判定値を算出している。そして、前記（６）〜（８）の処理では、独立化判定値によってキャッシュヒットすることの効果があると判定されたケースについて、スループット係数を算出して、前記効果を検証している。そして、前記検証結果に基づいて、そのスレッドを固定のコアに割り当てることができる。すなわち、変数のバリエーションを考慮しつつ、プロセッサの処理速度を向上することができる。 According to such a configuration, the processing of (1) to (8) described above is executed using the actually measured values to consider the variation of the variables, and the processing for processing variables that clearly show the effect of the cache hit. The thread is specified. For example, in the processes (1) to (5), an independence determination value is calculated as a determination index of the effect of a cache hit in order to improve the processing speed. In the processes (6) to (8), the throughput coefficient is calculated for the case where it is determined that there is an effect of the cache hit by the independence determination value, and the effect is verified. Based on the verification result, the thread can be assigned to a fixed core. That is, the processing speed of the processor can be improved while taking into account variable variations.

本発明は、前記記憶部が、前記ＣＰＵのコアのキャッシュサイズと、前記通常処理実行時に所定の周期で収集される、前記変数の、配列長、前記通常処理実行時に格納されるメモリのアドレス、およびデータサイズと、キャッシュヒット率の第２の閾値と、をさらに記憶しており、前記解析部が、前記変数について、前記記憶部から前記アドレスを読み出して異なるアドレスの数を集計した前記異なるアドレスの数と、前記記憶部から読み出した前記キャッシュサイズ、前記配列長、および前記データサイズとをパラメータとして、前記異なるアドレスの数の減少、前記配列長の減少、前記データサイズの減少、前記キャッシュサイズの増大、にしたがって大きな値となる前記キャッシュヒット率を算出する算出手段と、前記キャッシュヒット率が前記記憶部に記憶している前記第２の閾値より大きいか否かを判定する判定手段と、をさらに備え、前記解析部が、前記判定手段において前記キャッシュヒット率が前記第２の閾値より大きいと判定された場合、当該変数を、前記（１）および前記（２）の処理に用いる前記第１の変数として、前記（１）〜前記（８）の処理を実行することを特徴とする。 In the present invention, the storage unit collects the cache size of the core of the CPU, the array length of the variables collected at a predetermined period when the normal process is executed, the address of the memory stored when the normal process is executed, And the data size and the second threshold value of the cache hit rate, and the analysis unit reads the address from the storage unit and aggregates the number of different addresses for the variable. And the cache size read from the storage unit, the array length, and the data size as parameters, the number of different addresses decreases, the array length decreases, the data size decreases, the cache size Calculating means for calculating the cache hit rate, which increases as the value increases, and the cache hit rate. Determining means for determining whether or not the rate is greater than the second threshold value stored in the storage unit, wherein the analysis unit determines whether the cache hit rate is the second threshold value in the determination unit. When it is determined that the variable is larger, the variable (1) and the process (8) are executed as the first variable used in the processes (1) and (2). To do.

また、本発明は、前記プログラム実行装置が、前記ＣＰＵのコアのキャッシュサイズと、前記通常処理実行時に所定の周期で収集される、前記変数の、配列長、前記通常処理実行時に格納されるメモリのアドレス、およびデータサイズと、キャッシュヒット率の第２の閾値と、をさらに記憶している前記記憶部を備え、前記解析部が、前記変数について、前記記憶部から前記アドレスを読み出して異なるアドレスの数を集計した前記異なるアドレスの数と、前記記憶部から読み出した前記キャッシュサイズ、前記配列長、および前記データサイズとをパラメータとして、前記異なるアドレスの数の減少、前記配列長の減少、前記データサイズの減少、前記キャッシュサイズの増大、にしたがって大きな値となるキャッシュヒット率を算出する算出ステップ、前記キャッシュヒット率が前記記憶部に記憶している前記第２の閾値より大きいか否かを判定する判定ステップ、前記判定ステップにおいて前記キャッシュヒット率が前記第２の閾値より大きいと判定された場合、当該変数を、前記（１）および前記（２）の処理に用いる前記第１の変数として、前記（１）〜前記（８）の処理を実行するステップを実行することを特徴とする。 Further, the present invention provides the program execution device that stores the cache size of the CPU core, the array length of the variables, and the memory that is stored when the normal process is executed, that is collected at a predetermined period when the normal process is executed. And the data size and the second threshold value of the cache hit rate are stored in the storage unit, and the analysis unit reads out the address from the storage unit and outputs a different address for the variable. The number of different addresses, the number of different addresses, and the cache size read from the storage unit, the array length, and the data size are used as parameters to reduce the number of different addresses, decrease the array length, Calculate a cache hit rate that increases as the data size decreases and the cache size increases. Determining step of determining whether or not the cache hit rate is greater than the second threshold stored in the storage unit; and determining in the determination step that the cache hit rate is greater than the second threshold If so, the step of executing the processes of (1) to (8) is executed as the first variable used for the processes of (1) and (2). To do.

このような構成によれば、キャッシュヒット率に基づいて、前記（１）〜前記（８）の処理に用いる変数の候補を大まかに絞ることができる。したがって、特定のコアにスレッドを割り当て実行する変数を決定するために掛かる時間を短縮することができるので、短時間でプロセッサの処理速度の向上を図ることができる。 According to such a configuration, variable candidates used in the processes (1) to (8) can be roughly narrowed down based on the cache hit rate. Therefore, it is possible to reduce the time required to determine a variable to be executed by assigning a thread to a specific core, so that the processing speed of the processor can be improved in a short time.

本発明によれば、プロセッサの処理速度を向上する技術を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the technique which improves the processing speed of a processor can be provided.

本実施形態における高速処理システムの構成例を示す図である。It is a figure which shows the structural example of the high-speed processing system in this embodiment. 高速処理システムの処理シーケンス例を示す図である。It is a figure which shows the process sequence example of a high-speed processing system. プログラム解析の例およびデータの解析例を示す図である。It is a figure which shows the example of a program analysis, and the example of analysis of data. 高速処理システムの解析部における処理フロー例を示す図である。It is a figure which shows the example of a processing flow in the analysis part of a high-speed processing system. 独立化処理するか否かの判定例を示す図である。It is a figure which shows the example of determination of whether to independence process. 本実施形態において、スレッドのコア割り当ての例を示す図である。In this embodiment, it is a figure which shows the example of the core allocation of a thread | sled. 高速処理システムの適用例を示す図である。It is a figure which shows the example of application of a high-speed processing system.

本発明を実施するための形態（以降、「本実施形態」と称す。）におけるプログラム実行装置は、プロセッサ上でのソフトウェアの処理速度を向上させるために、バリエーションが少ないデータを扱うスレッドを決定し、そのスレッドをマルチコアプロセッサのコアに固定的に割り当てつつ、パイプライン状（後記）に並列処理する構成を備えている。以下に、それらの構成および処理フローについて、適宜図面を参照しながら詳細に説明する。 A program execution device in a mode for carrying out the present invention (hereinafter referred to as “the present embodiment”) determines a thread that handles data with little variation in order to improve the processing speed of software on the processor. The thread is fixedly assigned to the core of the multi-core processor, and is configured to perform parallel processing in a pipeline shape (described later). Hereinafter, the configuration and the processing flow will be described in detail with reference to the drawings as appropriate.

本実施形態におけるプログラム実行装置の構成について、図１を用いて説明する。
図１に示すように、プログラム実行装置１０は、ネットワーク２０内に配置され、ネットワーク２０に接続しているＰＣ（Personal Computer）等の端末３０（３０Ａ，３０Ｂ，３０Ｃ）との間で、情報を送受信可能になっている。プログラム実行装置１０は、端末３０からデプロイされたプログラムに基づいて、その処理動作が決定される。なお、プログラム実行装置１０は、例えば、汎用コンピュータ、サーバ、ルータ等である。
プログラム実行装置１０は、端末３０Ｃからデプロイされたプログラムを受信し、そのプログラムを動作させて、端末３０Ａから受信したデータに処理を施し、端末３０Ｂに処理後の処理データを送信する。
なお、図１では、端末３０を３台しか記載していないが、４台以上がネットワーク２０に接続していても構わない。 The configuration of the program execution device in this embodiment will be described with reference to FIG.
As shown in FIG. 1, the program execution device 10 is arranged in a network 20 and transmits information to and from a terminal 30 (30A, 30B, 30C) such as a PC (Personal Computer) connected to the network 20. Sending and receiving is possible. The program execution device 10 determines its processing operation based on the program deployed from the terminal 30. Note that the program execution device 10 is, for example, a general-purpose computer, a server, or a router.
The program execution device 10 receives the deployed program from the terminal 30C, operates the program, processes the data received from the terminal 30A, and transmits the processed data to the terminal 30B.
Although only three terminals 30 are illustrated in FIG. 1, four or more terminals may be connected to the network 20.

次に、プログラム実行装置１０の構成例について説明する。
図１に示すように、プログラム実行装置１０は、入力部１１、実行部１２、出力部１３、解析部１４、記憶部１５、および受付部１６を備える。プログラム実行装置１０は、図示しないＣＰＵおよびメインメモリによって構成される処理部（不図示）とアプリケーションプログラム等を記憶する記憶部１５とで構成される。処理部は、記憶部１５に記憶されているアプリケーションプログラムをメインメモリに展開して、実行部１２および解析部１４を具現化する。 Next, a configuration example of the program execution device 10 will be described.
As shown in FIG. 1, the program execution device 10 includes an input unit 11, an execution unit 12, an output unit 13, an analysis unit 14, a storage unit 15, and a reception unit 16. The program execution device 10 includes a processing unit (not shown) including a CPU and a main memory (not shown) and a storage unit 15 that stores application programs and the like. The processing unit implements the execution unit 12 and the analysis unit 14 by developing the application program stored in the storage unit 15 in the main memory.

入力部１１は、通信インタフェースであり、端末３０から処理用のデータを含むデータ情報を受信する。
実行部１２は、解析部１４によって生成されたプログラムのスレッドを記憶部１５から取得して、その取得したスレッドを実行し、入力部１１を介して取得したデータに対して、処理を実行する。
出力部１３は、通信インタフェースであり、端末３０へ処理結果をデータとして含むデータ情報を送信する。 The input unit 11 is a communication interface, and receives data information including processing data from the terminal 30.
The execution unit 12 acquires a thread of the program generated by the analysis unit 14 from the storage unit 15, executes the acquired thread, and executes processing on the data acquired via the input unit 11.
The output unit 13 is a communication interface, and transmits data information including processing results as data to the terminal 30.

解析部１４は、端末３０からデプロイされたプログラムをスレッドに分割するコンパイル処理を実行し、その分割したスレッド（分割プログラム）を記憶部１５に記憶する。また、解析部１４は、スレッドを、どのようにコアに割り当てるかを決定する処理を実行し、その割り当てに関するコア割当情報（後記）を記憶部１５に記憶する。
受付部１６は、通信インタフェースであり、端末３０からデプロイされたプログラムを含むプログラム情報を受信する。なお、受付部１６は、ネットワーク２０を介さずに、端末３０Ｃに通信ケーブルを介して直接接続するインタフェースであっても構わない。
記憶部１５は、解析部１４によって処理された結果等を記憶している。 The analysis unit 14 executes a compile process for dividing the program deployed from the terminal 30 into threads, and stores the divided threads (divided programs) in the storage unit 15. In addition, the analysis unit 14 executes processing for determining how to allocate a thread to a core, and stores core allocation information (described later) regarding the allocation in the storage unit 15.
The accepting unit 16 is a communication interface, and receives program information including a deployed program from the terminal 30. The accepting unit 16 may be an interface that is directly connected to the terminal 30C via a communication cable, not via the network 20.
The storage unit 15 stores the results processed by the analysis unit 14.

次に、プログラム実行装置１０の処理シーケンス例について、図２を用いて説明する（適宜、図１参照）。
ステップＳ２０１では、端末３０Ｃが、プログラム実行装置１０に実行させるプログラムをプログラム実行装置１０へデプロイ（送信）し、受付部１６がそのプログラムを受け付ける。なお、プログラムは、逐次処理用のプログラムであっても構わない。
ステップＳ２０２では、受付部１６が、記憶部１５に受け付けたプログラムを記憶する。
ステップＳ２０３では、解析部１４が、記憶部１５からプログラムを取得する。この取得のタイミングは、例えば、新しくプログラムが記憶されたタイミングとする。
ステップＳ２０４では、解析部１４が、コンパイラレベルの解析を実行する。具体的には、解析部１４は、プログラムをコンパイルし、スレッドに分割した分割プログラムを生成する。
ステップＳ２０５では、解析部１４は、解析結果（分割プログラム）を記憶部１５に記憶する。 Next, a processing sequence example of the program execution device 10 will be described with reference to FIG. 2 (see FIG. 1 as appropriate).
In step S201, the terminal 30C deploys (transmits) a program to be executed by the program execution device 10 to the program execution device 10, and the reception unit 16 receives the program. The program may be a sequential processing program.
In step S 202, the reception unit 16 stores the program received in the storage unit 15.
In step S 203, the analysis unit 14 acquires a program from the storage unit 15. The acquisition timing is, for example, the timing at which a new program is stored.
In step S204, the analysis unit 14 performs a compiler level analysis. Specifically, the analysis unit 14 compiles the program and generates a divided program that is divided into threads.
In step S205, the analysis unit 14 stores the analysis result (division program) in the storage unit 15.

ステップＳ２０６では、実行部１２が、記憶部１５から、分割プログラムを取得する。この取得のタイミングは、例えば、新しく解析結果が記憶されたタイミングとする。そして、実行部１２は、分割プログラムに基づいて、太線に示すように、通常処理（逐次処理）を実行状態にする。
ステップＳ２０７では、端末３０Ａが、データを含むデータ情報をプログラム実行装置１０へ送信し、入力部１１がそのデータ情報を受け付ける。
ステップＳ２０８では、実行部１２が、入力部１１を介してデータを取得する。
ステップＳ２０９では、実行部１２が、取得したデータに対して、通常処理（逐次処理）を実行する。
ステップＳ２１０〜Ｓ２１１では、実行部１２が、出力部１３を介して、処理結果を含むデータ情報を端末３０Ｂへ出力する。 In step S 206, the execution unit 12 acquires a division program from the storage unit 15. The acquisition timing is, for example, a timing at which a new analysis result is stored. And the execution part 12 makes a normal process (sequential process) an execution state based on a division | segmentation program, as shown with a thick line.
In step S207, the terminal 30A transmits data information including data to the program execution device 10, and the input unit 11 receives the data information.
In step S 208, the execution unit 12 acquires data via the input unit 11.
In step S209, the execution unit 12 performs normal processing (sequential processing) on the acquired data.
In steps S210 to S211, the execution unit 12 outputs data information including the processing result to the terminal 30B via the output unit 13.

ステップＳ２１２では、実行部１２が、予め設定された周期ごとに、通常処理によって処理したデータの処理経過状況および処理開始時刻を、解析用データとして収集し、記憶部１５に記憶する。処理経過状況とは、例えば、プログラムに記述されている関数の引数（変数）のデータサイズ、その変数が格納されたメモリ上のアドレス等である。例えば変数が配列になっている場合は、配列長×各要素のデータサイズが、変数としてのデータサイズになる。処理開始時刻とは、例えば、その変数について処理が開始された時刻である。
ステップＳ２１３では、解析部１４が、記憶部１５に記憶された解析用データを取得する。
ステップＳ２１４では、解析部１４が、記憶部１５に記憶されている解析用データを読み出して、解析する。この解析によって、高速処理を実施できるか否かを判定し、高速処理を実施できると判定した場合に、どのスレッドをコアに占有して割り当てるかを表すコア割当情報を作成する。なお、解析部１４は、高速処理を実施できないと判定した場合には、コア割当情報を作成しない。 In step S 212, the execution unit 12 collects the processing progress status and processing start time of the data processed by the normal processing for each preset period as analysis data, and stores the data in the storage unit 15. The process progress status is, for example, the data size of an argument (variable) of a function described in the program, an address on a memory in which the variable is stored, or the like. For example, when the variable is an array, the data size as the variable is the array length × the data size of each element. The process start time is, for example, the time when the process is started for the variable.
In step S 213, the analysis unit 14 acquires the analysis data stored in the storage unit 15.
In step S214, the analysis unit 14 reads and analyzes the analysis data stored in the storage unit 15. By this analysis, it is determined whether or not high-speed processing can be performed, and when it is determined that high-speed processing can be performed, core allocation information indicating which threads are occupied and allocated to the core is created. The analysis unit 14 does not create core allocation information when it is determined that high-speed processing cannot be performed.

ステップＳ２１５では、解析部１４が、ステップＳ２１４において作成したコア割当情報を記憶部１５に記憶する。なお、解析部１４は、高速処理を実施できないと判定した場合に、既に記憶部１５にコア割当情報が記憶されているとき、当該コア割当情報を消去する。
ステップＳ２１６では、実行部１２が、記憶部１５から、コア割当情報を取得する。そして、実行部１２は、当該コア割当情報に基づいて、太い点線に示すように、高速処理（コア占有処理）を実行状態にする。なお、記憶部１５にコア割当情報が記憶されていない場合には、高速処理（コア占有処理）は実行状態とならず、通常処理（逐次処理）を継続することになる。 In step S215, the analysis unit 14 stores the core allocation information created in step S214 in the storage unit 15. If the analysis unit 14 determines that high-speed processing cannot be performed and the core allocation information is already stored in the storage unit 15, the analysis unit 14 deletes the core allocation information.
In step S 216, the execution unit 12 acquires core allocation information from the storage unit 15. And the execution part 12 makes a high-speed process (core occupation process) an execution state as shown with a thick dotted line based on the said core allocation information. If the core allocation information is not stored in the storage unit 15, the high-speed process (core occupation process) is not in the execution state, and the normal process (sequential process) is continued.

ステップＳ２１７では、端末３０Ａが、データを含むデータ情報をプログラム実行装置１０へ送信し、入力部１１がそのデータ情報を受け付ける。
ステップＳ２１８では、実行部１２が、入力部１１を介してデータを取得する。
ステップＳ２１９では、実行部１２が、取得したデータに対して、高速処理（コア占有処理）を実行する。
ステップＳ２２０〜Ｓ２２１では、実行部１２が、出力部１３を介して、処理結果を含むデータ情報を端末３０Ｂへ出力する。 In step S217, terminal 30A transmits data information including data to program execution device 10, and input unit 11 receives the data information.
In step S218, the execution unit 12 acquires data via the input unit 11.
In step S219, the execution unit 12 performs high-speed processing (core occupation processing) on the acquired data.
In steps S220 to S221, the execution unit 12 outputs data information including the processing result to the terminal 30B via the output unit 13.

なお、ステップＳ２１４における解析部１４は、記憶部１５から、少なくとも繰返し２回以上の解析用データを取得した上で、その解析用データに対して統計処理を施して平均値を算出する等の前処理を行った後、解析を行うことが好ましい。 The analysis unit 14 in step S214 obtains the analysis data from the storage unit 15 at least twice, and then performs statistical processing on the analysis data before calculating the average value. It is preferable to perform analysis after processing.

ここで、図２のステップＳ２１２において、実行部１２が解析用データを収集する方法の一例について、図３を用いて説明する。
図３の左側は、プログラムの一例を示している。ただし、プログラム言語は、限定されなくてもよい。プログラム中の実行文および変数に対して、処理経過状況および処理開始時刻を収集する動作を有する指定子（図３では、analyzeと表している。）が付加される。そして、実行部１２は、この指定子に基づいて、プログラムに記述されている関数の変数のデータサイズ(変数が配列の場合は配列長×各要素のデータサイズ)、その変数が格納されたアドレス、および処理開始時刻を予め設定された周期で収集する。 Here, an example of a method in which the execution unit 12 collects data for analysis in step S212 in FIG. 2 will be described with reference to FIG.
The left side of FIG. 3 shows an example of the program. However, the programming language need not be limited. A specifier (represented as “analyze” in FIG. 3) having an operation of collecting the processing progress status and the processing start time is added to the executable statement and variable in the program. Based on this specifier, the execution unit 12 then sets the data size of the variable of the function described in the program (array length if the variable is an array × data size of each element), and the address where the variable is stored. And the processing start time are collected at a preset period.

図３の右側は、収集結果の例を示している。本実施形態では、配列（例えば、a0，a1等）の要素数や木の葉数（例えば、tree0，tree1等）が計測可能なデータ構造を収集対象としているが、これに限られることはなく、規模が計測可能なものであれば、他の要素であっても構わない。なお、以降の説明では、木構造は配列で表現することができるので、木と配列とを区別せずに、配列と表記することにする。
図３に示すように、配列の配列長（配列の要素の数）、配列の各要素のデータサイズ、キャッシュヒット（ｈｉｔと表示）であったか、キャッシュミス（ｍｉｓｓと表示）であったか、処理開始時刻、通常処理（逐次処理）の中で呼び出しているアドレス、および配列のデータサイズを集計対象としている。 The right side of FIG. 3 shows an example of the collection result. In the present embodiment, a data structure that can measure the number of elements of an array (for example, a0, a1, etc.) and the number of leaves of a tree (for example, tree0, tree1, etc.) is a collection target. Other elements may be used as long as can be measured. In the following description, the tree structure can be expressed as an array, so that the tree and the array are not distinguished from each other and are described as an array.
As shown in FIG. 3, whether the array length (number of elements in the array), the data size of each element in the array, a cache hit (displayed as hit), a cache miss (displayed as miss), or the processing start time The addresses called during normal processing (sequential processing) and the data size of the array are to be counted.

次に、図２のステップＳ２１４における解析部１４の処理フローについて、図４を用いて説明する（適宜、図２，３参照）。
ステップＳ４０１では、解析部１４は、記憶部１５に記憶されている解析用データを取得する。そして、解析部１４は、解析用データに統計処理を施して、解析用データのうち異なるアドレスの数（Ｋと表記）を集計し、配列長の平均値（Ｌと表記）、各要素のデータサイズの平均値（Ｅと表記）を算出する。また、解析部１４は、プログラム実行装置１０のシステム情報（不図示）等から、コアのキャッシュサイズＣを取得する。 Next, the processing flow of the analysis unit 14 in step S214 in FIG. 2 will be described with reference to FIG. 4 (see FIGS. 2 and 3 as appropriate).
In step S 401, the analysis unit 14 acquires analysis data stored in the storage unit 15. Then, the analysis unit 14 performs statistical processing on the analysis data, totals the number of different addresses (denoted as K) in the analysis data, averages the array length (denoted as L), and data of each element The average value of size (denoted as E) is calculated. Further, the analysis unit 14 acquires the core cache size C from the system information (not shown) of the program execution device 10.

ステップＳ４０２では、解析部１４は、キャッシュヒット係数Ｐ（＝（１／Ｋ＋１／Ｌ）×Ｃ／Ｅ）を算出する。ここで、キャッシュヒット係数Ｐの特性について、以下に説明する。例えば、キャッシュヒット係数Ｐは、Ｋが大きい場合、すなわち、演算するたびに毎回異なるアドレスとなる場合には、スレッドが使用するアドレス領域が広いため、Ｌ２キャッシュにヒットする確率も小さくなると考えられ、値として小さくなるように見積もる。また、キャッシュヒット係数Ｐは、Ｌが大きい場合に関しても、該当箇所の処理が扱うデータがＬ２キャッシュに入りきらない可能性が高くなるため、値として小さくなるように見積もる。また同様に、キャッシュヒット係数Ｐは、Ｅが大きい場合にも、該当箇所の処理が扱うデータがＬ２キャッシュに入りきらない可能性が高くなるため、値として小さくなるように見積もる。また、キャッシュヒット係数Ｐは、Ｃが大きいほど、該当箇所の処理が扱うデータのＬ２キャッシュに入りきる可能性が高くなるため、値として大きくなるように見積もる。すなわち、キャッシュヒット係数Ｐを用いることで、スレッド処理化した際にキャッシュに収まり得るデータを扱う処理箇所の候補を、大まかに抽出することができる。 In step S402, the analysis unit 14 calculates a cache hit coefficient P (= (1 / K + 1 / L) × C / E). Here, the characteristics of the cache hit coefficient P will be described below. For example, when the cache hit coefficient P is large, that is, when the address becomes different each time the calculation is performed, it is considered that the probability of hitting the L2 cache is small because the address area used by the thread is wide. Estimate so that the value becomes smaller. Further, even when L is large, the cache hit coefficient P is estimated to be a small value because there is a high possibility that the data handled by the processing at the corresponding location will not fit into the L2 cache. Similarly, the cache hit coefficient P is estimated to be a small value even when E is large because there is a high possibility that the data handled by the processing at the corresponding location will not fit in the L2 cache. In addition, the larger the C hit coefficient P, the higher the possibility that the data handled by the processing of the corresponding part will be stored in the L2 cache. Therefore, the cache hit coefficient P is estimated to increase as a value. That is, by using the cache hit coefficient P, it is possible to roughly extract candidates for processing locations that handle data that can be stored in the cache when threaded.

ステップＳ４０３では、解析部１４は、キャッシュヒット係数Ｐが予め設定してある閾値Ｔｈ０（第２の閾値）より大きいか否かを判定する。なお、閾値Ｔｈ０（第２の閾値）は、独立化処理を分類するための閾値であり、記憶部１５に記憶されている。
そして、Ｐが閾値Ｔｈ０（第２の閾値）より大きい場合（ステップＳ４０３でＹｅｓ）には、ステップＳ４０４では、解析部１４は、その指定子の振られた処理を独立化処理（コアに占有して割り当てて実行する処理）の候補に設定する。
また、Ｐが閾値Ｔｈ０（第２の閾値）以下の場合（ステップＳ４０３でＮｏ）には、処理はステップＳ４１３へ進む。
このステップＳ４０１〜Ｓ４０４の処理は、独立化処理のスレッドを確定するための前処理であって、独立化処理に当てはまらないものを大まかに振るい落とす効果がある。すなわち、独立化処理のスレッドを確定するためのステップＳ４０５以降の処理時間を短縮する効果もある。したがって、変数が少ない場合等には、ステップＳ４０１〜Ｓ４０４の処理を省略することも可能である。 In step S403, the analysis unit 14 determines whether or not the cache hit coefficient P is greater than a preset threshold value Th0 (second threshold value). The threshold value Th0 (second threshold value) is a threshold value for classifying the independence process, and is stored in the storage unit 15.
If P is greater than the threshold value Th0 (second threshold value) (Yes in step S403), in step S404, the analysis unit 14 performs the process assigned to the specifier as an independent process (occupies the core). To be assigned and executed).
If P is equal to or less than the threshold Th0 (second threshold) (No in step S403), the process proceeds to step S413.
The processes in steps S401 to S404 are pre-processes for determining the thread of the independence process, and have the effect of roughly shaking out items that do not apply to the independence process. That is, there is an effect of shortening the processing time after step S405 for determining the thread of the independence processing. Therefore, when there are few variables, the processing in steps S401 to S404 can be omitted.

ステップＳ４０５では、解析部１４は、独立化処理の候補について、記憶部１５に記憶している解析用データ中から処理開始時刻を読み出して、独立化判定値Ｆを算出する。ただし、Ｆ＝通常処理時の処理時間Ｍ／キャッシュヒット時の処理時間Ｈである。
ここで、その処理の具体例について、図５を用いて説明する。 In step S405, the analysis unit 14 reads the processing start time from the analysis data stored in the storage unit 15 for the independence processing candidate, and calculates the independence determination value F. However, F = processing time M at normal processing / processing time H at cache hit.
Here, a specific example of the processing will be described with reference to FIG.

図５（ａ）の上段は、通常処理における指定子の振られた処理［ａ１］の処理時間を表している。すなわち、処理［ａ１］の処理開始時刻（＝１０１００１）から処理［ａ２］の処理開始時刻（＝１０１２５０）までの時間を、通常処理における指定子の振られた処理時間（Ｍ）として表す。図５（ａ）では、そのＭの値は「２４９」である。なお、Ｍの値は、キャッシュヒットおよびキャッシュミスのいずれであっても区別せずに、統計処理によって算出される。例えば、Ｍの値は、平均値である。
次に、図５（ａ）の下段は、キャッシュヒットした場合の処理［ａ１］の処理時間を表している。すなわち、処理［ａ１］の処理開始時刻（＝３１８７４１）から処理［ａ２］の処理開始時刻（＝３１８８０６）までの時間を、キャッシュヒットした場合の処理時間（Ｈ）として表す。図５（ａ）では、そのＨの値は「５５」である。なお、Ｈの値は、キャッシュヒットした場合の処理時間のみを対象として、統計処理によって算出される。例えば、Ｈの値は、平均値である。
そして、図５（ａ）のケースでは、独立化判定値Ｆは、Ｍ／Ｈ＝２４９／５５＝４．５３として求まる。 The upper part of FIG. 5A represents the processing time of the assigned process [a1] in the normal process. That is, the time from the process start time (= 101001) of the process [a1] to the process start time (= 101250) of the process [a2] is expressed as a process time (M) with a specifier assigned in the normal process. In FIG. 5A, the value of M is “249”. Note that the value of M is calculated by statistical processing without distinguishing whether it is a cache hit or a cache miss. For example, the value of M is an average value.
Next, the lower part of FIG. 5A represents the processing time of the process [a1] when a cache hit occurs. That is, the time from the process start time (= 318741) of process [a1] to the process start time (= 318806) of process [a2] is represented as the process time (H) when a cache hit occurs. In FIG. 5A, the value of H is “55”. Note that the value of H is calculated by statistical processing only for the processing time when a cache hit occurs. For example, the value of H is an average value.
In the case of FIG. 5A, the independence determination value F is obtained as M / H = 249/55 = 4.53.

また、図５（ｂ）の上段は、通常処理における指定子の振られた処理［ａ２］の処理時間を表している。すなわち、処理［ａ２］の処理開始時刻（＝１０１２５０）から処理［ａ３］の処理開始時刻（＝１０１５１０）までの時間を、通常処理における指定子の振られた処理時間（Ｍ）として表す。図５（ｂ）では、そのＭの値は「２６０」である。
次に、図５（ｂ）の下段は、キャッシュヒットした場合の処理［ａ２］の処理時間を表している。すなわち、処理［ａ２］の処理開始時刻（＝３１８８０６）から処理［ａ３］の処理開始時刻（＝３１８８４６）までの時間を、キャッシュヒットした場合の処理時間（Ｈ）として表す。図５（ｂ）では、そのＨの値は「４０」である。
そして、図５（ｂ）のケースでは、独立化判定値Ｆは、Ｍ／Ｈ＝２６０／４０＝６．５として求まる。 Further, the upper part of FIG. 5B represents the processing time of the process [a2] assigned with the specifier in the normal process. That is, the time from the process start time (= 101250) of the process [a2] to the process start time (= 101510) of the process [a3] is expressed as a process time (M) in which the specifier is assigned in the normal process. In FIG. 5B, the value of M is “260”.
Next, the lower part of FIG. 5B represents the processing time of the process [a2] when a cache hit occurs. That is, the time from the process start time (= 318806) of process [a2] to the process start time (= 318846) of process [a3] is represented as the process time (H) when a cache hit occurs. In FIG. 5B, the value of H is “40”.
In the case of FIG. 5B, the independence determination value F is obtained as M / H = 260/40 = 6.5.

図４へ戻って、ステップＳ４０６では、解析部１４は、独立化処理の候補の処理に含まれるスレッド数を閾値として、独立化判定値Ｆが、当該閾値より大きいか否かを判定する。なお、この閾値は、ステップＳ２０４のコンパイラレベルの解析において算出され、ステップＳ２０５の解析結果（分割プログラム）とともに記憶部１５に記憶される。
例えば、図５に示す例において、閾値が５である場合には、図５（ａ）の場合は、独立化判定値Ｆ＝４．５３＜閾値（＝５）であるので、独立化処理にしないと判定する。また、図５（ｂ）の場合は、独立化判定値Ｆ＝６．５＞閾値（＝５）であるので、独立化処理にすると判定する。
すなわち、この判定では、独立化判定値Ｆ＝閾値の場合は、元の並列度倍の高速化となることを意味している。そして、独立化判定値Ｆ＞閾値の場合であれば、独立化処理の候補は、１コアで捌ききることができ、コアを占有させる効果（処理速度の向上）を期待できる。 Returning to FIG. 4, in step S 406, the analysis unit 14 determines whether the independence determination value F is larger than the threshold value by using the number of threads included in the independence process candidate process as a threshold value. This threshold value is calculated in the compiler level analysis in step S204, and is stored in the storage unit 15 together with the analysis result (division program) in step S205.
For example, in the example shown in FIG. 5, when the threshold value is 5, in the case of FIG. 5A, the independence determination value F = 4.53 <threshold value (= 5). Judge that not. In the case of FIG. 5B, since the independence determination value F = 6.5> threshold (= 5), it is determined that the independence processing is performed.
That is, in this determination, when the independence determination value F = the threshold value, it means that the speed is increased to double the original degree of parallelism. If independence determination value F> threshold value, candidates for independence processing can be handled by one core, and an effect of occupying the core (an improvement in processing speed) can be expected.

ステップＳ４０６において、独立化判定値Ｆが閾値以下の場合（ステップＳ４０６でＮｏ）、処理はステップＳ４１３へ進む。
また、独立化判定値Ｆが閾値より大きい場合（ステップＳ４０６でＮｏ）、ステップＳ４０７では、解析部１４は、当該候補を独立化処理に確定する。なお、独立化判定値Ｆが閾値より大きい場合は、確定した独立化処理におけるスレッドは、キャッシュ内に収まり得るデータを扱う高速処理可能なスレッドとなっている。 If the independence determination value F is equal to or smaller than the threshold value in step S406 (No in step S406), the process proceeds to step S413.
If the independence determination value F is greater than the threshold (No in step S406), in step S407, the analysis unit 14 determines the candidate as an independence process. When the independence determination value F is larger than the threshold, the thread in the determined independence process is a thread capable of high-speed processing that handles data that can be stored in the cache.

ステップＳ４０８では、解析部１４は、独立化処理が連続、かつＰが予め設定された所定の閾値(Ｔｈ１)より小さいか、またはそうでないかを判定する。なお、閾値（Ｔｈ１）は、独立化処理を連結するか否かの判定に用いる閾値であり、記憶部１５に記憶されている。
独立化処理が連続しており、かつＰが予め設定された所定の閾値(Ｔｈ１)より小さいと判定された場合（ステップＳ４０８でＹｅｓ）、ステップＳ４０９では、解析部１４は、連続する独立化処理を１つの独立化処理にまとめる。閾値(Ｔｈ１)に充分小さい値を設定した場合、その閾値(Ｔｈ１)より小さくなるということは、１つのコアにおけるＬ２キャッシュのサイズで領域が余り得ることを示唆している。したがって、このような独立化領域が連続する限り分割をせずに1つのスレッドとする。このような処理を行うことで、スレッドの粒度を大きくすることによって、パイプライン処理（後記）で発生し得るデメリットを低減できるようになる。すなわち、パイプライン処理において、処理間でのデータの入出力の調整のために設けられるキューにおける待ち時間を小さくすることができる。つまり、処理速度を向上させることができる。
また、ステップＳ４０８でＮｏの場合、処理は、ステップＳ４０９をスキップして、ステップＳ４１０へ進む。 In step S408, the analysis unit 14 determines whether the independence process is continuous and whether P is smaller than a predetermined threshold (Th1) set in advance or not. Note that the threshold value (Th1) is a threshold value used for determining whether or not to connect the independence processing, and is stored in the storage unit 15.
When it is determined that the independence process is continuous and P is smaller than a predetermined threshold (Th1) set in advance (Yes in step S408), in step S409, the analysis unit 14 causes the independence process to continue. Are combined into one independent process. When a sufficiently small value is set for the threshold value (Th1), the value smaller than the threshold value (Th1) suggests that an area can be left by the size of the L2 cache in one core. Therefore, as long as such independent regions are continuous, one thread is not divided. By performing such processing, it is possible to reduce disadvantages that may occur in pipeline processing (described later) by increasing the thread granularity. That is, in pipeline processing, the waiting time in a queue provided for adjusting input / output of data between processes can be reduced. That is, the processing speed can be improved.
Also, in the case of No in step S408, the process skips step S409 and proceeds to step S410.

ステップＳ４１０では、解析部１４は、独立化処理について、スループット係数ＴＨを算出する。
スループット係数ＴＨの算出の具体例について、図６を用いて説明する。なお、図６の例では、プロセッサのコアが１６であるとし、１コアに１スレッドを割り当てるものとする。
図６の上段は、通常処理（１６スレッド使用）における処理時間と、コア占有処理の場合における処理時間とを示している。なお、コア占有処理とは、ＣＰＵのコアを、独立化処理を占有的に配置する領域と、それ以外の領域とに分けてスレッドを割り当てるようにしたものである。例えば、図６の左下の図のように、１つの独立化処理を、１つの占有コア（斜め線を付したコア）に固定的に配置する。そして、それ以外の処理は、通常処理の分割スレッド用コアの領域（斜め線を付していないコア）に、特に制限なくスレッド配置するものとする。また、コア占有処理を用いる場合には、独立化処理と通常処理（分割スレッド処理）とが混在するため、図６中の右下に示す、パイプライン状に分割された処理が実行される。 In step S410, the analysis unit 14 calculates a throughput coefficient TH for the independence process.
A specific example of calculating the throughput coefficient TH will be described with reference to FIG. In the example of FIG. 6, it is assumed that the number of cores of the processor is 16, and one thread is allocated to one core.
The upper part of FIG. 6 shows the processing time in normal processing (using 16 threads) and the processing time in the case of core occupation processing. The core occupying process is a process in which threads of a CPU core are allocated to an area where an independent process is exclusively arranged and an area other than that. For example, as shown in the lower left diagram of FIG. 6, one independence process is fixedly arranged on one occupied core (core with diagonal lines). In other processes, the threads are arranged without limitation in the area of the split thread core for normal processing (the core without diagonal lines). Further, when using the core occupancy process, the independence process and the normal process (divided thread process) are mixed, so the process divided in the pipeline shape shown in the lower right in FIG. 6 is executed.

このパイプライン状の処理は、分割したスレッドごとに、独立に次々に処理を実行していくため、並列処理を実行することができ、処理速度を向上させることができる。例えば、公知例（Matt Welsh, etal.,「SEDA:An Architecture for Well-Conditioned, Scalable Internet Services」,SOSP2001, October, 2001）に開示されている方法を用いて行うことができる。そこで、本実施形態では、コア占有処理を実行する場合に、図６に示すように、パイプライン状の処理を適用する。 Since this pipeline-like process is executed independently for each divided thread, parallel processing can be executed and the processing speed can be improved. For example, it can be performed using a method disclosed in a known example (Matt Welsh, et al., “SEDA: An Architecture for Well-Conditioned, Scalable Internet Services”, SOSP2001, October, 2001). Therefore, in the present embodiment, when executing the core occupancy process, a pipeline-like process is applied as shown in FIG.

図６に示すように、指定子の振られた処理ｂ０，ｂ１，ｂ２，ｂ３，ｂ４について、通常処理における処理時間は、それぞれ、８０，９０，８０，１００である。それに対して、コア占有処理における処理時間は、独立化処理スレッド０および独立化処理スレッド２をそれぞれ占有コアに割り当てて処理を行うことを想定した場合、接続遅延（処理間でのデータの入出力の調整のための待ち時間）を５と仮定すると、それぞれ５，５，９０，５，５，５，１００である。なお、独立化処理スレッドの処理時間には、キャッシュヒットした場合の処理時間を用いる。
ここで、スループット係数ＴＨは、コア数Ｃと、処理時間の合計Ｔとを用いて、コア数Ｃを処理時間の合計Ｔで除算する演算によって算出することができる。なお、コア数Ｃは、記憶部１５に記憶されている。
通常処理の場合のスループット係数ＴＨは、Ｃ／Ｔ＝（１６−０）／（８０＋９０＋８０＋１００）＝０．０４６となる。また、コア占有処理の場合のスループット係数ＴＨは、Ｃ／Ｔ＝ｍｉｎ{１／(５（スレッド０）＋５（接続遅延）)＝０．１，１／(５（スレッド２）＋５（接続遅延）)＝０．１，（１６−２）／(９０（スレッド１）＋５（接続遅延）＋１００（スレッド３）)＝０．０７２}＝０．０７２となる。ただし、ｍｉｎは、最小値を選択する関数である。この関数は、コア割り当てをした際に最もボトルネックとなる箇所が、最終的なスループットとなることを決定している。この例では、コア占有側ではない方でボトルネックが発生していることが分かる。 As shown in FIG. 6, the processing times in the normal processing are 80, 90, 80, and 100 for the processing b0, b1, b2, b3, and b4 assigned with the specifiers, respectively. On the other hand, the processing time in the core occupancy processing is assumed to be a connection delay (input / output of data between processes) assuming that the processing is performed by allocating the independent processing thread 0 and the independent processing thread 2 to the dedicated core. Assuming that the waiting time for the adjustment is 5 is 5, 5, 90, 5, 5, 5, 100, respectively. Note that the processing time when a cache hit occurs is used as the processing time of the independent processing thread.
Here, the throughput coefficient TH can be calculated by an operation of dividing the number of cores C by the total processing time T using the number of cores C and the total processing time T. The core number C is stored in the storage unit 15.
The throughput coefficient TH in the case of normal processing is C / T = (16−0) / (80 + 90 + 80 + 100) = 0.046. The throughput coefficient TH in the case of core occupation processing is C / T = min {1 / (5 (thread 0) +5 (connection delay)) = 0.1, 1 / (5 (thread 2) +5 (connection delay). )) = 0.1, (16-2) / (90 (thread 1) +5 (connection delay) +100 (thread 3)) = 0.072} = 0.072. Here, min is a function for selecting the minimum value. This function determines that the most bottleneck when assigning cores is the final throughput. In this example, it can be seen that a bottleneck has occurred on the side that is not on the core occupation side.

図４へ戻って、ステップＳ４１１では、解析部１４は、コア占有処理のスループット係数ＴＨが通常処理のスループット係数ＴＨより大きいか否かを判定する。
コア占有処理のスループット係数ＴＨが通常処理のスループット係数ＴＨより大きい場合（ステップＳ４１１でＹｅｓ）、ステップＳ４１２では、解析部１４は、コア占有処理（高速処理）のために、前記したコア割当情報を記憶部１５に記憶する。
また、コア占有処理のスループット係数ＴＨが通常処理のスループット係数ＴＨ以下の場合（ステップＳ４１１でＮｏ）、ステップＳ４１３では、解析部１４は、通常処理のために、コア割当情報を記憶部１５から消去する。なお、ステップＳ４１３において、記憶部１５にコア割当情報が記憶されていない状態であれば、解析部１４は、消去処理を行わない。 Returning to FIG. 4, in step S411, the analysis unit 14 determines whether or not the throughput coefficient TH of the core occupation process is larger than the throughput coefficient TH of the normal process.
When the throughput coefficient TH of the core occupation process is larger than the throughput coefficient TH of the normal process (Yes in step S411), in step S412, the analysis unit 14 uses the core allocation information described above for the core occupation process (high-speed process). Store in the storage unit 15.
If the throughput coefficient TH of the core occupation process is equal to or less than the throughput coefficient TH of the normal process (No in step S411), in step S413, the analysis unit 14 deletes the core allocation information from the storage unit 15 for the normal process. To do. In step S413, if the core allocation information is not stored in the storage unit 15, the analysis unit 14 does not perform the erasure process.

以上、本実施形態で説明したプログラム実行装置１０は、プロセッサの処理速度を向上させるように、スレッドを固定的に特定のコア（占有コア）に割り当てて実行させる独立化処理を決定する解析部１４を備えている。具体的には、解析部１４は、図４のステップＳ４０３において、キャッシュヒット係数Ｐを算出して、閾値と比較し、独立化処理の候補を絞る。解析部１４は、図４のステップＳ４０６では、独立化判定値Ｆを算出して、閾値と比較し、候補の中から独立化処理を確定する。図４のステップＳ４１１では、解析部１４は、スループット係数ＴＨを算出して、コア占有処理の方が通常処理よりスループット係数ＴＨが大きい場合に、コア占有処理を実行させるためのコア割当情報を作成する。そして、実行部１２は、そのコア割当情報に基づいて、コア割当処理（高速処理）を実行する。したがって、本実施形態におけるプログラム実行装置１０は、処理速度を向上させることができる。また、パイプライン状の処理を用いることによって、分割したスレッドごとに、独立に次々に処理を実行していくため、処理を並列に実行することができ、さらに処理速度を向上させることができる。 As described above, the program execution device 10 described in the present embodiment determines an independence process for allocating a thread to a specific core (occupied core) and executing it in a fixed manner so as to improve the processing speed of the processor. It has. Specifically, in step S403 in FIG. 4, the analysis unit 14 calculates a cache hit coefficient P, compares it with a threshold value, and narrows down candidates for independence processing. In step S406 of FIG. 4, the analysis unit 14 calculates the independence determination value F, compares it with a threshold value, and determines the independence process from the candidates. In step S411 of FIG. 4, the analysis unit 14 calculates the throughput coefficient TH, and creates core allocation information for causing the core occupancy process to be executed when the core occupancy process has a larger throughput coefficient TH than the normal process. To do. And the execution part 12 performs a core allocation process (high-speed process) based on the core allocation information. Therefore, the program execution device 10 in the present embodiment can improve the processing speed. In addition, by using pipeline-like processing, the processing is executed independently for each divided thread, so that the processing can be executed in parallel and the processing speed can be further improved.

ここで、プログラム実行装置１０を呼処理に適用した例について、図７を用いて説明する。
図７に示すように、サーバまたはルータ５０がネットワーク２０内に配置されている。サーバまたはルータ５０は、ネットワーク２０に接続している端末３０Ａから接続要求を受け付けて、呼処理を行って、接続先の端末３０Ｂに接続要求を送信する。
図７では、サーバまたはルータ５０は、負荷分散装置４０と２台以上のプログラム実行装置１０とを備え、プログラム実行装置１０同士が並列に呼処理を実行する構成を備えている。なお、ネットワーク２０の管理者が操作する管理端末３１は、プログラム実行装置１０に対して、プログラムをデプロイするために用いられる。そして、プログラム実行装置１０は、デプロイされたプログラムに対して、前記した通常処理および高速処理を実行する。 Here, an example in which the program execution device 10 is applied to call processing will be described with reference to FIG.
As shown in FIG. 7, a server or router 50 is arranged in the network 20. The server or router 50 receives a connection request from the terminal 30A connected to the network 20, performs call processing, and transmits the connection request to the connection destination terminal 30B.
In FIG. 7, the server or router 50 includes a load distribution device 40 and two or more program execution devices 10, and the program execution devices 10 execute a call process in parallel. The management terminal 31 operated by the administrator of the network 20 is used for deploying a program to the program execution device 10. Then, the program execution device 10 executes the normal process and the high-speed process described above for the deployed program.

なお、従来の呼処理のプログラムは、一般的に繰返し演算が少なく、逐次処理用に生成されている。したがって、呼処理は、スレッドがキャッシュヒットしたかどうかによって、処理速度が大きく異なっていた。それに対して、管理者等によってデプロイされたプログラムをプログラム実行装置１０に適用することによって、安定して高速処理を実現させることができる。また、２台以上のプログラム実行装置１０に対して、それぞれ担当する処理を決めておいて、それぞれのプログラム実行装置１０に跨ってパイプライン状に並列処理させることによって、分散かつ並列度を高めることができ、処理速度を向上させることができる。 Note that conventional call processing programs generally have few repetitive calculations and are generated for sequential processing. Therefore, the call processing has a large processing speed depending on whether or not the thread hits the cache. On the other hand, by applying a program deployed by an administrator or the like to the program execution device 10, high-speed processing can be realized stably. In addition, by determining the processing in charge of each of the two or more program execution devices 10 and performing parallel processing in a pipeline manner across the program execution devices 10, the distribution and the degree of parallelism are increased. And the processing speed can be improved.

１０プログラム実行装置
１２実行部
１４解析部
１５記憶部
Ｅデータサイズ
Ｆ独立化判定値
Ｈキャッシュヒット時の処理時間
Ｋ異なるアドレスの数
Ｌ配列長
Ｍ通常処理時の処理時間
Ｐキャッシュヒット係数
ＴＨスループット係数
Ｔｈ０閾値（第２の閾値）
Ｔｈ１閾値 DESCRIPTION OF SYMBOLS 10 Program execution apparatus 12 Execution part 14 Analysis part 15 Storage part E Data size F Independence judgment value H Processing time at the time of cache hit K Number of different addresses L Array length M Processing time at normal processing P Cache hit coefficient TH Throughput coefficient Th0 threshold (second threshold)
Th1 threshold

Claims

A program execution device that divides a program into threads, assigns each of the threads to a core that constitutes a CPU (Central Processing Unit), and executes the program,
A process start time of a variable described in the program, which is collected at a predetermined period during execution of a normal process for assigning an arbitrary thread to an arbitrary core and executing the process of the program, and processing of the variable Cache hit information indicating that a cache hit has been associated and stored, and a storage unit storing the number of cores of the CPU and a threshold used for determining the effect of the cache hit;
(1) The number of times the processing start time of the first variable of the variables and the processing start time of the second variable processed next to the first variable are read to calculate the difference and collected An average value of the difference in minutes, and the calculated average value as the first processing time,
(2) Among the variables, the processing start time of the first variable associated with the cache hit information and the processing start time of the second variable processed next to the first variable are read. The difference is calculated, the average value of the difference of the collected number of times is calculated, and the calculated average value is set as the second processing time.
(3) dividing the first processing time by the second processing time to calculate an independence determination value;
(4) It is determined whether or not the independence determination value is larger than the threshold value indicating the number of threads used for the processing of the first variable, and in the case of determination that the determination result is large, The processing time is the second processing time, and if the determination result is NO, the processing time of the first variable is the first processing time,
(5) The processes (1) to (4) are executed for the variables of the program,
(6) The number of cores of the CPU is divided by a total value obtained by summing up the first processing times for the variables of the program at the time of execution of the normal processing, and the value is set as a first throughput coefficient,
(7) The throughput when one core of the CPU is assigned to execute the processing of the one variable with the second processing time as the processing time is defined as a numerator of 1 of the number of cores. A first division value having a sum of a processing time of 2 and a waiting time from the processing of the variable to the next processing as a denominator, and calculating the second processing time from the number of cores of the CPU as a processing time. Subtracting the number of variables to be used, and using the subtracted value as a numerator, the first processing time of the variable that did not become the second processing time among the variables of the program and the processing of the variable A second division value with the total value of the waiting time until the next processing as a denominator is calculated, and the smallest value among the first division value and the second division value is set as a second throughput factor. ,
(8) Core assignment indicating that a thread for processing the variable having the second processing time as a processing time is assigned to the fixed core when the second throughput coefficient is larger than the first throughput coefficient An analysis unit for generating information;
Based on the core allocation information generated by the analysis unit, an execution unit that allocates a thread for processing the variable having the second processing time as a processing time to the fixed core and executes the program A program execution device comprising:

The storage unit includes a cache size of the core of the CPU, an array length of the variables, an address of a memory stored during execution of the normal process, and a data size collected at a predetermined period when the normal process is executed. , Further storing a second threshold value of the cache hit rate,
The analysis unit
For the variable, the number of different addresses obtained by reading the address from the storage unit and totaling the number of different addresses, the cache size read from the storage unit, the array length, and the data size as parameters, A calculation means for calculating the cache hit ratio that becomes a large value in accordance with a decrease in the number of different addresses, a decrease in the array length, a decrease in the data size, and an increase in the cache size;
Determining means for determining whether or not the cache hit rate is greater than the second threshold stored in the storage unit;
Further comprising
The analysis unit
When the determination unit determines that the cache hit rate is greater than the second threshold, the variable is set as the first variable used in the processes (1) and (2). The program execution device according to claim 1, wherein the processing (8) is executed.

A program execution method used in a program execution device that divides a program into threads, assigns each of the threads to a core that constitutes a CPU, and executes the program,
The program execution device includes:
A process start time of a variable described in the program, which is collected at a predetermined period during execution of a normal process for assigning an arbitrary thread to an arbitrary core and executing the process of the program, and processing of the variable Cache hit information indicating that a cache hit has been associated and stored, a storage unit storing the number of cores of the CPU, and a threshold value used for determining the effect of the cache hit, an analysis unit, An execution unit, and
The analysis unit
(1) The number of times the processing start time of the first variable of the variables and the processing start time of the second variable processed next to the first variable are read to calculate the difference and collected Calculating an average value of the difference in minutes, and setting the calculated average value as a first processing time;
(2) Among the variables, the processing start time of the first variable associated with the cache hit information and the processing start time of the second variable processed next to the first variable are read. Calculating a difference, calculating an average value of the difference for the collected number of times, and setting the calculated average value as a second processing time;
(3) calculating an independence determination value by dividing the first processing time by the second processing time;
(4) It is determined whether or not the independence determination value is larger than the threshold value indicating the number of threads used for the processing of the first variable, and in the case of determination that the determination result is large, The processing time as the second processing time, and in the case of a determination of NO in the determination result, the processing time of the first variable as the first processing time;
(5) A step of executing the processes of (1) to (4) for the variables of the program,
(6) dividing the number of cores of the CPU by a total value obtained by summing up the first processing times for the variables of the program at the time of execution of the normal processing, and setting the value as a first throughput coefficient;
(7) The throughput when one core of the CPU is assigned to execute the processing of the one variable with the second processing time as the processing time is defined as a numerator of 1 of the number of cores. A first division value having a sum of a processing time of 2 and a waiting time from the processing of the variable to the next processing as a denominator, and calculating the second processing time from the number of cores of the CPU as a processing time. Subtracting the number of variables to be used, and using the subtracted value as a numerator, the first processing time of the variable that did not become the second processing time among the variables of the program and the processing of the variable A second division value with a total value of a waiting time until the next processing as a denominator is calculated, and the smallest value among the first division value and the second division value is set as a second throughput coefficient. Step to do,
(8) Core assignment indicating that a thread for processing the variable having the second processing time as a processing time is assigned to the fixed core when the second throughput coefficient is larger than the first throughput coefficient Generating information,
Run
The execution unit is
Based on the core allocation information generated by the analysis unit, a step of allocating a thread for processing the variable having the second processing time as a processing time to the fixed core and executing the program is executed. A program execution method characterized by the above.

The program execution device includes:
The cache size of the CPU core, the array length of the variables, the memory address stored during execution of the normal process, the data size, and the cache hit rate collected at a predetermined period when the normal process is executed The storage unit further storing a second threshold value,
The analysis unit
For the variable, the number of different addresses obtained by reading the address from the storage unit and totaling the number of different addresses, the cache size read from the storage unit, the array length, and the data size as parameters, A calculation step of calculating a cache hit rate that becomes a large value according to a decrease in the number of different addresses, a decrease in the array length, a decrease in the data size, and an increase in the cache size;
A determination step of determining whether or not the cache hit rate is greater than the second threshold stored in the storage unit;
When it is determined in the determination step that the cache hit rate is greater than the second threshold, the variable is set as the first variable used in the processes (1) and (2). The program execution method according to claim 3, wherein the step of executing the process of (8) is executed.