JP7795103B2

JP7795103B2 - Program, data processing device and data processing method

Info

Publication number: JP7795103B2
Application number: JP2022092512A
Authority: JP
Inventors: ムハマドバーゲルベイク; アリシェイクホレスラミ
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2021-06-18
Filing date: 2022-06-07
Publication date: 2026-01-07
Anticipated expiration: 2042-06-07
Also published as: JP2023001055A

Description

本発明は、プログラム、データ処理装置及びデータ処理方法に関する。 The present invention relates to a program, a data processing device, and a data processing method.

割当問題は、配車ルーティングやＦＰＧＡ（Field-Programmable Gate Array）のブロック配置など、様々な実世界のアプリケーションをもつＮＰ困難な組合せ最適化問題のクラスである。割当問題の例として、２次割当問題（ＱＡＰ：Quadratic Assignment Problem）がある（たとえば、非特許文献１参照）。 Assignment problems are a class of NP-hard combinatorial optimization problems with a variety of real-world applications, such as vehicle routing and FPGA (Field-Programmable Gate Array) block placement. An example of an assignment problem is the quadratic assignment problem (QAP) (see, for example, Non-Patent Document 1).

２次割当問題は、ｎ個の要素（施設など）をｎ個の割当先に割り当てる際に、要素間のフロー量（施設間での物資の輸送量などのコスト）と、各要素が割り当てられる割当先間の距離との積の総和を最小化するような割当を求める問題である。すなわち、２次割当問題は、以下の式（１）で表される評価関数の値を最小化するような割当を探索する問題である。評価関数は割当状態に応じたコストを表し、コスト関数などとも呼ばれる。 The quadratic assignment problem is a problem of finding an assignment that minimizes the sum of the products of the flow volume between elements (costs such as transporting materials between facilities) and the distances between the destinations to which each element is assigned when allocating n elements (facilities, etc.) to n destinations. In other words, the quadratic assignment problem is a problem of finding an assignment that minimizes the value of the evaluation function expressed in equation (1) below. The evaluation function represents the cost according to the allocation state and is also called a cost function.

式（１）において、ｆ_ｉ，ｊは、識別番号＝ｉ，ｊの要素間のフロー量、ｄ_{φ（ｉ），φ（ｊ）}は、識別番号＝ｉ，ｊの要素が割り当てられている割当先間の距離を示す。
ところで、ノイマン型コンピュータが不得意とする大規模な離散最適化問題を計算する装置として、イジング型の評価関数を用いたボルツマンマシン（イジング装置とも呼ばれる）がある。ボルツマンマシンは、回帰型ニューラルネットワークの一種である。 In equation (1), f _i,j indicates the flow amount between elements with identification numbers = i and j, and d _φ(i),φ(j) indicates the distance between the assignment destinations to which the elements with identification numbers = i and j are assigned.
Meanwhile, there is a Boltzmann machine (also called an Ising machine) that uses an Ising evaluation function and is a type of recurrent neural network that can solve large-scale discrete optimization problems, which von Neumann computers are not good at.

ボルツマンマシンは、組合せ最適化問題を磁性体のスピンの振る舞いを表すイジングモデルに変換する。そして、ボルツマンマシンは、疑似焼き鈍し法やパラレルテンパリング（たとえば、非特許文献２参照）などのマルコフ連鎖モンテカルロ法により、イジング型の評価関数の値が最小となるイジングモデルの状態の探索を行う（たとえば、非特許文献３参照）。評価関数の値は、エネルギーに相当する。なお、ボルツマンマシンは、評価関数の符号を変えれば、評価関数の値が極大になる状態を探索することもできる。イジングモデルの状態は、複数の状態変数の値（ニューロン値とも呼ばれる）の組合せにより表現できる。各状態変数の値として、０または１を用いることができる。 A Boltzmann machine converts a combinatorial optimization problem into an Ising model that represents the spin behavior of a magnetic material. The Boltzmann machine then searches for the state of the Ising model that minimizes the value of an Ising-type evaluation function (see, for example, Non-Patent Document 3) using Markov chain Monte Carlo methods such as simulated annealing and parallel tempering (see, for example, Non-Patent Document 2). The value of the evaluation function corresponds to energy. Note that by changing the sign of the evaluation function, the Boltzmann machine can also search for the state where the value of the evaluation function is maximized. The state of the Ising model can be expressed by a combination of the values of multiple state variables (also called neuron values). The value of each state variable can be either 0 or 1.

イジング型の評価関数は、たとえば、以下の式（２）で定義される。 The Ising-type evaluation function is defined, for example, by the following equation (2):

右辺の１項目は、イジングモデルの全状態変数の全組合せについて、漏れと重複なく、Ｎ個の状態変数から選ばれる２つの状態変数の値（０または１）と重み値（２つの状態変数の間の相互作用の強さを表す）との積を積算したものである。ｓ_ｉは、識別番号がｉの状態変数、ｓ_ｊは、識別番号がｊの状態変数であり、ｗ_ｉ，ｊは、識別番号がｉとｊの状態変数間の相互作用の大きさを示す重み値である。右辺の２項目は、各識別番号についてのバイアス係数と状態変数との積の総和を求めたものである。ｂ_ｉは、識別番号＝ｉについてのバイアス係数を示している。 The first item on the right-hand side is the sum of the products of the values (0 or 1) of two state variables selected from the N state variables, without omission or duplication, and the weight value (representing the strength of interaction between the two state variables), for all combinations of all state variables of the Ising model. s _i is the state variable with identification number i, s _j is the state variable with identification number j, and wi _,j is a weight value indicating the magnitude of interaction between the state variables with identification numbers i and j. The two items on the right-hand side are the sum of the products of the bias coefficient and the state variable for each identification number. _{b i} indicates the bias coefficient for identification number = i.

また、ｓ_ｉの値の変化に伴うエネルギーの変化量（ΔＥ_ｉ）は、以下の式（３）で表される。 The amount of change in energy (ΔE _i ) associated with a change in the value of s _i is expressed by the following equation (3).

式（３）において、ｓ_ｉが１から０に変化するとき、Δｓ_ｉは－１となり、ｓ_ｉが０から１に変化するとき、Δｓ_ｉは１となる。なお、ｈ_ｉは局所場と呼ばれ、Δｓ_ｉに応じてｈ_ｉに符号（＋１または－１）を乗じたものがΔＥ_ｉとなる。 In equation (3), when s _i changes from 1 to 0, Δs _i becomes -1, and when s _i changes from 0 to 1, Δs _i becomes 1. Note that h _i is called a local field, and ΔE _i is obtained by multiplying h _i by the sign (+1 or -1) according to Δs _i .

そして、たとえば、ΔＥ_ｉが、乱数と温度パラメータの値に基づいて得られるノイズ値（熱ノイズとも呼ばれる）より小さい場合には、ｓ_ｉの値を更新することで状態遷移を発生させ、局所場も更新する、という処理が繰り返される。 Then, for example, if ΔE _i is smaller than a noise value (also called thermal noise) obtained based on the random number and the temperature parameter value, the value of s _i is updated to cause a state transition, and the local field is also updated, and this process is repeated.

このようなボルツマンマシンを用いて２次割当問題を計算する技術が提案されている（たとえば、特許文献１、非特許文献４参照）。
２次割当問題のイジング型の評価関数は、以下の式（４）で表せる。 Techniques have been proposed for solving quadratic assignment problems using such Boltzmann machines (see, for example, Patent Document 1 and Non-Patent Document 4).
The Ising-type evaluation function for the quadratic assignment problem can be expressed by the following equation (4).

式（４）において、ｘは状態変数をベクトル化したものであり、ｎ個の要素のｎ個の割当先への割当状態を表す。ｘ^Ｔは、（ｘ_１，１，…，ｘ_１，ｎ，ｘ_２，１，…，ｘ_２，ｎ，……，ｘ_ｎ，１，…，ｘ_ｎ，ｎ）と表せる。ｘ_ｉ，ｊ＝１は、識別番号＝ｉの要素が、識別番号＝ｊの割当先に割り当てられていることを示し、ｘ_ｉ，ｊ＝０は、識別番号＝ｉの要素が、識別番号＝ｊの割当先に割り当てられていないことを示す。 In equation (4), x is a vector of state variables and represents the allocation state of n elements to n allocation destinations. ^xT can be expressed as ( _x1,1 , ..., x1 _,n , _x2,1 , ..., _x2,n , ..., xn _,1 , ..., xn _,n ). _xi,j = 1 indicates that the element with identification number = i is allocated to the allocation destination with identification number = j, and xi _,j = 0 indicates that the element with identification number = i is not allocated to the allocation destination with identification number = j.

Ｗは、重み値の行列であり、前述のフロー量（ｆ_ｉ，ｊ）と、ｎ個の割当先間の距離の行列Ｄを用いて、以下の式（５）で表せる。 W is a matrix of weight values, and can be expressed by the following equation (5) using the flow amount (f _i,j ) and the matrix D of the distances between n destinations.

米国特許出願公開第２０２１／０３２６６７９号明細書US Patent Application Publication No. 2021/0326679

Eugene L. Lawler, “The quadratic assignment problem”, Management Science, Vol.9, No.4 pp.586-599, July 1963Eugene L. Lawler, “The quadratic assignment problem”, Management Science, Vol.9, No.4 pp.586-599, July 1963 Robert H. Swendsen and Jian-Sheng Wang, ”Replica monte carlo simulation of spin-glasses”, Physical Review Letters, Vol.57, No.21, pp.2607-2609, November 1986Robert H. Swendsen and Jian-Sheng Wang, “Replica monte carlo simulation of spin-glasses”, Physical Review Letters, Vol.57, No.21, pp.2607-2609, November 1986 K. Dabiri, et al., “Replica Exchange MCMC Hardware With Automatic Temperature Selection and Parallel Trial”, IEEE Transactions on Parallel and Distributed Systems, Vol.31, No.7, pp.1681-1692, July 2020K. Dabiri, et al., “Replica Exchange MCMC Hardware With Automatic Temperature Selection and Parallel Trial”, IEEE Transactions on Parallel and Distributed Systems, Vol.31, No.7, pp.1681-1692, July 2020 M.Bagherbeik et al., ”A permutational boltzmann machine with parallel tempering for solving combinatorial optimization problems”, In International Conference on Parallel Problem Solving from Nature, pp.317-331, Springer, 2020M. Bagherbeik et al., “A permutational boltzmann machine with parallel tempering for solving combinatorial optimization problems”, In International Conference on Parallel Problem Solving from Nature, pp.317-331, Springer, 2020 Rainer E Burkard, Stefan E Karisch, and Franz Rendl, “Qaplib-a quadratic assignment problem library”, Journal of Global optimization, 10(4), pp.391-403, 1997Rainer E Burkard, Stefan E Karisch, and Franz Rendl, “Qaplib-a quadratic assignment problem library”, Journal of Global optimization, 10(4), pp.391-403, 1997 Gintaras Palubeckis, “An algorithm for construction of test cases for the quadratic assignment problem”, Informatica, Lith. Acad. Sci., Vol.11, No.3, pp.281-296, 2000Gintaras Palubeckis, “An algorithm for construction of test cases for the quadratic assignment problem”, Informatica, Lith. Acad. Sci., Vol.11, No.3, pp.281-296, 2000 Zvi Drezner, Peter M Hahn, and Eric D Taillard, “Recent advances for the quadratic assignment problem with special emphasis on instances that are difficult for meta-heuristic methods”, Annals of Operations research, 139(1), pp.65-94, 2005Zvi Drezner, Peter M Hahn, and Eric D Taillard, “Recent advances for the quadratic assignment problem with special emphasis on instances that are difficult for meta-heuristic methods”, Annals of Operations research, 139(1), pp.65-94, 2005 Allyson Silva, Leandro C. Coelho, and Maryam Darvish, “Quadratic assignment problem variants: A survey and an effective parallel memetic iterated tabu search”, European Journal of Operational Research, 2020Allyson Silva, Leandro C. Coelho, and Maryam Darvish, “Quadratic assignment problem variants: A survey and an effective parallel memetic iterated tabu search”, European Journal of Operational Research, 2020 Danny Munera, Daniel Diaz, and Salvador Abreu, “Hybridization as cooperative parallelism for the quadratic assignment problem”, In Hybrid Metaheuristics, pp. 47-61, Springer International Publishing Switzerland, 2016Danny Munera, Daniel Diaz, and Salvador Abreu, “Hybridization as cooperative parallelism for the quadratic assignment problem”, In Hybrid Metaheuristics, pp. 47-61, Springer International Publishing Switzerland, 2016 Kresimir Mihic, Kevin Ryan, and Alan Wood, “Randomized decomposition solver with the quadratic assignment problem as a case study”, INFORMS Journal on Computing, Vol.30, No.2, pp.295-308, 2018Kresimir Mihic, Kevin Ryan, and Alan Wood, “Randomized decomposition solver with the quadratic assignment problem as a case study”, INFORMS Journal on Computing, Vol.30, No.2, pp.295-308, 2018

上記のような評価関数を用いて２次割当問題を計算する手法は、エネルギーの変化量の計算や、局所場の更新などの処理が多数回繰り返され、メモリアクセスも多数回繰り返されるため計算に時間がかかる。 Methods of calculating quadratic assignment problems using evaluation functions like those above require repeated calculations of energy changes and local field updates, as well as repeated memory accesses, which takes a long time to calculate.

１つの側面では、本発明は、割当問題を高速に計算できるプログラム、データ処理装置及びデータ処理方法を提供することを目的とする。 In one aspect, the present invention aims to provide a program, data processing device, and data processing method that can quickly calculate assignment problems.

１つの実施態様では、割当問題の解を、割当状態に応じたコストを表す評価関数を用いた局所探索により探索する処理をコンピュータに実行させるプログラムであって、メモリに記憶された、複数の割当先へ割り当てられる複数の要素間のフロー量を表すフロー行列と、前記複数の割当先間の距離を表す距離行列と、に基づいて、前記複数の要素のうち、第１の要素と第２の要素の割当先が入れ替わる第１の割当変更が生じた場合の、前記評価関数の第１の変化量を、ベクトル算術演算を用いて計算し、前記第１の変化量に基づいて、前記第１の割当変更を許容するか否かを判定し、前記第１の割当変更を許容すると判定した場合、前記割当状態を更新するとともに、前記距離行列において、前記第１の要素と前記第２の要素に対応する２列または２行を入れ替えるように更新する、処理を前記コンピュータに実行させるプログラムが提供される。 In one embodiment, a program is provided that causes a computer to execute a process of searching for a solution to an assignment problem through local search using an evaluation function that represents costs according to the assignment state. The program calculates, using vector arithmetic operations, a first change in the evaluation function when a first assignment change occurs in which the assignment destinations of a first element and a second element among the multiple elements are swapped, based on a flow matrix that represents the flow volume between multiple elements assigned to multiple assignment destinations and a distance matrix that represents the distances between the multiple assignment destinations, both stored in memory; determines whether to allow the first assignment change based on the first change; and, if it is determined that the first assignment change is allowable, updates the assignment state and updates the distance matrix so that two columns or two rows corresponding to the first element and the second element are swapped.

また、１つの実施態様では、データ処理装置が提供される。
また、１つの実施態様では、データ処理方法が提供される。 Also, in one embodiment, a data processing apparatus is provided.
Also, in one embodiment, a data processing method is provided.

１つの側面では、本発明は、割当問題を高速に計算できる。 In one aspect, the present invention allows for fast computation of assignment problems.

ＱＡＰの計算例を示す図である。FIG. 10 is a diagram illustrating an example of calculating QAP. ＱＡＰの計算時における距離行列の並べ替え例とデータ処理装置の例を示す図である。10A and 10B are diagrams illustrating an example of rearrangement of a distance matrix when calculating a QAP and an example of a data processing device. キャッシュ行列の更新計算の例を示す図である。FIG. 10 is a diagram illustrating an example of update calculation of a cache matrix. パラレルテンパリングを実行するソルバーシステムの例を示す図である。FIG. 1 illustrates an example of a solver system that performs parallel tempering. ＱＡＰの解をパラレルテンパリングによる局所探索で探索するアルゴリズムの例を示す図である。FIG. 10 is a diagram illustrating an example of an algorithm for searching for a solution to a QAP by local search using parallel tempering. パラレルテンパリングによる局所探索の全体の処理の流れを示すフローチャートである。10 is a flowchart showing the overall processing flow of local search by parallel tempering. ＱＡＰの場合のレプリカの初期化処理の一例の流れを示すフローチャートである。10 is a flowchart illustrating an example of a process for initializing a replica in the case of a QAP. ＱＡＰの場合のレプリカ探索処理の一例の流れを示すフローチャートである。10 is a flowchart illustrating an example of a replica search process in the case of a QAP. ＱＳＡＰの解をパラレルテンパリングによる局所探索で探索するアルゴリズムの例を示す図である。FIG. 10 is a diagram illustrating an example of an algorithm for searching for a solution to QSAP by local search using parallel tempering. ＱＳＡＰの場合のレプリカの初期化処理の一例の流れを示すフローチャートである。10 is a flowchart illustrating an example of the flow of a replica initialization process in the case of QSAP. ＱＳＡＰの場合のレプリカ探索処理の一例の流れを示すフローチャートである。10 is a flowchart illustrating an example of a replica search process in the case of QSAP. ＶΔＣ法と比較したＳＡＭ法、ＢＭ＄法による計算の高速化の度合いの評価結果を示す図である。FIG. 10 is a diagram showing the evaluation results of the degree of speedup of calculations by the SAM method and the BM$ method compared to the VΔC method. スカラー型の演算処理に対するベクトル型の演算処理の高速化の度合いの評価結果を示す図である。FIG. 10 is a diagram showing the results of an evaluation of the degree of speedup of vector-type arithmetic processing relative to scalar-type arithmetic processing. 負荷分散の一例を示す図である。FIG. 10 illustrates an example of load balancing. 負荷分散による演算処理の高速化の度合いの評価結果を示す図である。FIG. 10 is a diagram showing the evaluation results of the degree of speedup of calculation processing by load distribution. 測定アルゴリズムの一例を示す図である。FIG. 10 is a diagram illustrating an example of a measurement algorithm. ＶΔＣ法、ＳＡＭ法、ＢＭ＄法について測定された相対的な高速化の度合いと、問題サイズに応じて占有されるメモリ階層を示す図である。FIG. 1 shows the relative speedup measured for the V.DELTA.C, SAM, and BM$ methods, and the memory hierarchy occupied depending on the problem size. ΔＣ生成回路の一例を示す図である。FIG. 10 is a diagram illustrating an example of a ΔC generation circuit. 列の入れ替えを行うハードウェア構成の第１の例を示す図である。FIG. 10 illustrates a first example of a hardware configuration for performing column swapping. 列の入れ替え例を示す図である。FIG. 10 is a diagram illustrating an example of column interchange. 列の入れ替えを行うハードウェア構成の第２の例を示す図である。FIG. 10 illustrates a second example of a hardware configuration for performing column swapping. 列の入れ替えを行うハードウェア構成の第２の例の１つ目の変形例を示す図である。FIG. 10 is a diagram illustrating a first modified example of the second example of the hardware configuration for performing column interchange. 列の入れ替えを行うハードウェア構成の第２の例の２つ目の変形例を示す図である。FIG. 10 is a diagram illustrating a second modification of the second example of the hardware configuration for performing column interchange. 列の入れ替えを行うハードウェア構成の第３の例を示す図である。FIG. 10 illustrates a third example of a hardware configuration for performing column swapping. 列の入れ替えを行うハードウェア構成の第３の例の変形例を示す図である。FIG. 10 is a diagram illustrating a modification of the third example of the hardware configuration for performing column interchange. ΔＣ生成回路の他の例を示す図である。FIG. 10 is a diagram illustrating another example of a ΔC generation circuit. ２つのレプリカについての処理を行うハードウェア構成の例を示す図である。FIG. 10 illustrates an example of a hardware configuration for performing processing on two replicas. ΔＣ生成回路の他の例を示す図である。FIG. 10 is a diagram illustrating another example of a ΔC generation circuit. レプリカ処理回路の一例を示す図である。FIG. 10 is a diagram illustrating an example of a replica processing circuit. 非対称行列を用いたＱＡＰの計算で用いられるレプリカ処理回路の一例を示す図である。FIG. 10 is a diagram illustrating an example of a replica processing circuit used in calculating a QAP using an asymmetric matrix. データ処理装置の一例であるコンピュータのハードウェア例を示す図である。FIG. 1 is a diagram illustrating an example of hardware of a computer that is an example of a data processing device.

以下、発明を実施するための形態を、図面を参照しつつ説明する。
本実施の形態のデータ処理装置は、割当問題の例として２次割当問題（ＱＡＰ）または２次半割当問題（ＱＳＡＰ：Quadratic Semi-Assignment Problem）の解を、局所探索により探索する。以下、ＱＡＰ、ＱＳＡＰ及び局所探索について説明する。 Hereinafter, embodiments of the invention will be described with reference to the drawings.
The data processing device of this embodiment searches for a solution to a quadratic assignment problem (QAP) or a quadratic semi-assignment problem (QSAP) by local search, as an example of an assignment problem. QAP, QSAP, and local search will be described below.

（ＱＡＰ）
図１は、ＱＡＰの計算例を示す図である。ＱＡＰは、ｎ個の要素（施設など）をｎ個の割当先に割り当てる際に、要素間のフロー量と、各要素が割り当てられる割当先間の距離との積の総和を最小化するような割当を求める問題である。 (QAP)
1 shows an example of how to calculate QAP. QAP is a problem of finding an allocation that minimizes the sum of the products of the flow rates between elements and the distances between the elements when allocating n elements (such as facilities) to n destinations.

図１には、１～４の識別番号が付された４つの施設を４箇所の割当先（Ｌ_１～Ｌ_４）に割り当てる例が示されている。
ｎ個の要素間のフロー量を表すフロー行列は、以下の式（６）で表される。 FIG. 1 shows an example in which four facilities, each having identification numbers 1 to 4, are assigned to four locations (L ₁ to L ₄ ).
A flow matrix representing the amount of flow between n elements is expressed by the following equation (6).

フロー行列（Ｆ）は、ｎ行ｎ列の行列である。ｆ_ｉ，ｊはｉ行ｊ列のフロー量であり、識別番号＝ｉ，ｊの要素間のフロー量を表す。たとえば、図１の識別番号＝１の施設と識別番号＝２の施設との間のフロー量は、ｆ_１，２と表せる。 The flow matrix (F) is a matrix with n rows and n columns. _{f i,j} is the flow amount in row i and column j, and represents the flow amount between elements with identification numbers = i, j. For example, the flow amount between a facility with identification number = 1 and a facility with identification number = 2 in Figure 1 can be expressed as f _1,2 .

ｎ箇所の割当先間の距離を表す距離行列は、以下の式（７）で表される。 The distance matrix representing the distance between n allocation destinations is expressed by the following equation (7).

距離行列（Ｄ）は、ｎ行ｎ列の行列である。ｄ_ｋ，ｌはｋ行ｌ列の距離であり、識別番号＝ｋ，ｌの割当先間の距離を表す。たとえば、図１の識別番号＝１の割当先（Ｌ_１）と識別番号＝２の割当先（Ｌ_２）との間の距離は、ｄ_１，２と表せる。 The distance matrix (D) is a matrix with n rows and n columns. _{d k,l} is the distance in the kth row and lth column, and represents the distance between the assignees of identification numbers = k,l. For example, the distance between the assignee (L ₁ ) of identification number = 1 and the assignee (L ₂ ) of identification number = 2 in FIG. 1 can be expressed as d _1,2 .

ＱＡＰの計算は、前述の式（１）を最小化するような割当を探索することで行われる。ｎ個の要素のｎ箇所の割当先への割当状態は、整数割当ベクトルφまたはバイナリ状態行列Ｘで表される。φは集合Φ_ｎの要素である。Φ_ｎは、セットＮ＝｛１，２，３，…，ｎ｝の全ての順列のセットである。バイナリ状態行列Ｘに含まれるバイナリ変数であるｘ_ｉ，ｊは、以下の式（８）で表される。 The QAP is calculated by searching for an assignment that minimizes the above-mentioned equation (1). The assignment state of n elements to n assignment destinations is represented by an integer assignment vector φ or a binary state matrix X. φ is an element of the set Φ _n . Φ _n is a set of all permutations of the set N = {1, 2, 3, ..., n}. The binary variables x _i,j included in the binary state matrix X are expressed by the following equation (8).

図１には、識別番号＝１の施設が識別番号＝２の割当先、識別番号＝２の施設が識別番号＝３の割当先、識別番号＝３の施設が識別番号＝４の割当先、識別番号＝４の施設が識別番号＝１の割当先にそれぞれ割り当てられた場合の例が示されている。整数割当ベクトルφは、φ（１）＝２、φ（２）＝３、φ（３）＝４、φ（４）＝１となっている。バイナリ状態行列Ｘは、ｘ_１，２、ｘ_２，３、ｘ_３，４、ｘ_４，１がそれぞれ１となっており、その他のｘ_ｉ，ｊは０になっている。 1 shows an example in which a facility with identification number 1 is assigned to an assignee with identification number 2, a facility with identification number 2 is assigned to an assignee with identification number 3, a facility with identification number 3 is assigned to an assignee with identification number 4, and a facility with identification number 4 is assigned to an assignee with identification number 1. The integer assignment vector φ is φ(1) = 2, φ(2) = 3, φ(3) = 4, and φ(4) = 1. In the binary state matrix X, x _1,2 , x _2,3 , x _3,4 , and x _4,1 are each 1, and the other x _i,j are 0.

ＱＡＰには、フロー行列と距離行列の一方または両方が対称行列の場合と、フロー行列と距離行列の両方が非対称行列の場合がある。本実施の形態では、主に対称行列（対角成分が０（バイアスレス））を用いたＱＡＰに焦点が当てられている。このようなＱＡＰが、インスタンスの大部分であるし、計算を単純化するためである。ただし、対称行列を用いたＱＡＰは、非対称行列を用いたＱＡＰに直接的に変換可能である。 A QAP can have either one or both of the flow and distance matrices as symmetric matrices, or both asymmetric matrices. In this embodiment, we focus primarily on QAPs that use symmetric matrices (diagonal elements are 0 (biasless)). This is because such QAPs represent the majority of instances and simplify calculations. However, a QAP that uses a symmetric matrix can be directly converted to a QAP that uses an asymmetric matrix.

（ＱＳＡＰ）
ＱＳＡＰはＱＡＰを変形したものである。ＱＳＡＰでは、要素数と割当先の数とが等しくない。たとえば、各割当先に複数の要素を割り当てることが許容される。ＱＳＡＰでは、距離行列は以下の式（９）で表される。 (QSAP)
QSAP is a modified version of QAP. In QSAP, the number of elements is not equal to the number of destinations. For example, multiple elements are allowed to be assigned to each destination. In QSAP, the distance matrix is expressed by the following equation (9):

距離行列（Ｄ）の対角要素は、割当先内でのルーティングを考慮するため非零である。ＱＳＡＰではさらに、以下の式（１０）で表される追加の行列Ｂが用いられる。 The diagonal elements of the distance matrix (D) are non-zero to take into account routing within the assigned destination. QSAP also uses an additional matrix B, expressed as equation (10) below.

ｂ_ｉ，ｋは、識別番号＝ｉの要素を識別番号＝ｋの割当先に割り当てるための一定のコストを表している。
ＱＳＡＰの計算は、ＱＡＰの評価関数である式（１）の代わりに、以下の式（１１）を最小化するような割当を探索することで行われる。 b _i,k represents a constant cost for assigning an element with identification number=i to an assignee with identification number=k.
The QSAP is calculated by searching for an allocation that minimizes the following equation (11) instead of equation (1), which is the evaluation function of the QAP.

ｎ個の要素のｍ箇所の割当先への割当状態は、整数割当ベクトルψ（ψ∈［１，ｍ］^ｎ）、またはバイナリ状態行列Ｓで表される。バイナリ状態行列Ｓに含まれるバイナリ変数であるｓ_ｉ，ｊは、以下の式（１２）で表される。 The allocation state of n elements to m allocation destinations is expressed by an integer allocation vector ψ (ψε[1, m] ⁿ ) or a binary state matrix S. The binary variables s _i,j included in the binary state matrix S are expressed by the following equation (12).

（局所探索）
局所探索では、現在の状態の変更によって到達可能な近傍状態内において候補解の探索が行われる。ＱＡＰに対して局所探索を行う方法として、要素間のペアワイズ交換がある。この手法では、２つの要素が選択され、それらの割当先が交換される。２つの要素（識別番号＝ａ，ｂで表される）の、割当先（識別番号＝φ（ａ），φ（ｂ）で表される）の交換による評価関数（式（１））の値の変化量は、以下の式（１３）で表すことができる。 (Local search)
In local search, a candidate solution is searched for within a nearby state that can be reached by changing the current state. A method for performing local search on a QAP is pairwise exchange between elements. In this method, two elements are selected and their assignment destinations are exchanged. The change in the value of the evaluation function (Equation (1)) due to the exchange of assignment destinations (represented by identification numbers = φ(a) and φ(b)) of two elements (represented by identification numbers = a and b) can be expressed by the following Equation (13).

式（１３）のように、変化量（ΔＣ_ｅｘ）は積和演算ループによって生成される。
ＱＳＡＰに対してペアワイズ交換により局所探索を行う場合、評価関数（式（１１））の値の変化量は、以下の式（１４）で表すことができる。 As shown in equation (13), the change amount (ΔC _ex ) is generated by a multiply-and-accumulate loop.
When performing a local search on a QSAP by pairwise exchange, the amount of change in the value of the evaluation function (Equation (11)) can be expressed by the following Equation (14).

式（１４）においてΔＢ_ｅｘは、各割当先への要素の割当に関する一定のコストの変化を示し、以下の式（１５）で表される。 In equation (14), ΔB _ex indicates a change in a fixed cost related to the allocation of elements to each allocation destination, and is expressed by the following equation (15).

ＱＳＡＰにおける状態空間は、割当状態を表す前述の順列のみに制約されない。ある割当先に、別の要素が割り当てられているかどうかに関係なく、要素をその割当先に再配置することができる。識別番号＝ａの要素を、現在の割当先（識別番号＝ψ（ａ））から識別番号＝ｌの割当先に、割当変更する際の評価関数の値の変化量は、以下の式（１６），（１７）で表される。 The state space in QSAP is not restricted to the permutations described above that represent the allocation state. An element can be relocated to a certain allocation destination regardless of whether another element is assigned to that allocation destination. The change in the value of the evaluation function when the element with identification number = a is reassigned from its current allocation destination (identification number = ψ(a)) to the allocation destination with identification number = l is expressed by the following equations (16) and (17).

局所探索では、上記のように計算される変化量に基づいて、その変化量の評価関数の値の変化を引き起こす割当変更の提案を受け入れるか否かが決定される。提案を受け入れるか否かは、たとえば、貪欲法など事前に定義された基準に基づいて決定される。貪欲法が用いられる場合、コスト（評価関数の値（エネルギー））を削減する割当変更が受け入れられる。貪欲法が用いられる場合、提案の受け入れ確率（ＰＡＲ：Proposal Acceptance Rate）は、探索の開始時に高くなるが、探索が評価関数の極小値でスタックし、それ以上の改善の動きが見つからないため、後に０に近づく傾向がある。 In local search, a decision is made as to whether to accept a proposed allocation change that will cause a change in the value of the evaluation function by the amount of change calculated as above. The decision as to whether to accept a proposal is based on predefined criteria, such as a greedy method. When a greedy method is used, allocation changes that reduce the cost (value of the evaluation function (energy)) are accepted. When a greedy method is used, the proposal acceptance rate (PAR) is high at the start of the search, but tends to approach 0 later as the search gets stuck in a local minimum of the evaluation function and no further improvement is found.

貪欲法に代えて、本実施の形態では、疑似焼き鈍し法などの確率的局所探索法を使用することができる。確率的局所探索法では、割当の変化にランダム性を加えるために、温度パラメータ（Ｔ）を使用して、以下の式（１８）で表されるメトロポリス法の受入れ確率（Ｐ_ａｃｃ）を用いることができる。 Instead of the greedy method, in this embodiment, a stochastic local search method such as simulated annealing can be used. In the stochastic local search method, the acceptance probability (P acc ) of the Metropolis method, expressed by the following equation (18), can be used with a temperature parameter ( _T ) to add randomness to the change in allocation.

Ｔが無限大に向かって増加すると、Ｐ_ａｃｃ及びＰＡＲは増加し、ΔＣの値に関係なくすべての提案が受け入れられる。Ｔが下げられて０に近づくと、提案の受け入れは貪欲になり、ＰＡＲは探索が最終的に評価関数の極小値でスタックするため、０になる傾向がある。 As T increases towards infinity, P _acc and PAR increase and all proposals are accepted regardless of the value of ΔC. As T is lowered and approaches 0, proposal acceptance becomes greedy and PAR tends to 0 as the search eventually gets stuck in a local minimum of the evaluation function.

ＱＡＰとＱＳＡＰのΔＣの計算は、データ処理装置が割当問題の解を探索する際に、データ処理装置における合計処理時間の大部分を占める。
ΔＣの計算を、以下のようにベクトル算術演算を用いて行うことで計算の高速化が可能となる。 The calculation of the QAP and ΔC of the QSAP takes up a large portion of the total processing time in the data processing device as it searches for a solution to the assignment problem.
The calculation of ΔC can be performed at high speed by using vector arithmetic operations as follows:

（ΔＣ計算のベクトル化）
以下この手法をＶΔＣ法と呼ぶ場合がある。式（１３）によるＱＡＰのΔＣ_ｅｘの計算は、ｉ＝ａ，ｂの計算をスキップするという条件がなければ、内積計算を用いて簡単にベクトル化できる。 (Vectorization of ΔC calculation)
Hereinafter, this method may be referred to as the VΔC method. The calculation of ΔC _ex of QAP according to equation (13) can be easily vectorized using inner product calculations unless there is a condition that the calculation of i=a, b is skipped.

本実施の形態のデータ処理装置は、この条件をなくし、ΔＣ_ｅｘの計算をｎ個のすべての要素に関する積和演算ループとし、以下の式（１９）に示されるように、フロー量と距離との追加の積を補償項として加えることで、その条件をなくすことによる補償を行う。 The data processing device of this embodiment eliminates this condition, performs the calculation of ΔC _ex as a product-sum operation loop for all n elements, and adds an additional product of the flow amount and the distance as a compensation term, as shown in the following equation (19), thereby performing compensation by eliminating the condition.

式（１９）は、識別番号＝ａ，ｂの要素の割当先が入れ替わったときの、フロー量の差のベクトルΔＦ（式（２０））及び距離の差のベクトルΔＤ（式（２１））を用いて、式（２２）のように、内積を使用して再定式化できる。 Equation (19) can be reformulated using the dot product, as in equation (22), using the flow rate difference vector ΔF (equation (20)) and the distance difference vector ΔD (equation (21)) when the allocation destinations of elements with identification numbers = a and b are swapped.

式（２０）のΔ^ｂ _ａＦは、フロー行列のｂ行とａ行の差ベクトルを表し、Ｆ_ｂ，＊はフロー行列のｂ行を示し、Ｆ_ａ，＊はフロー行列のａ行を示す。
ベクトルΔＦの要素は、元のフロー行列の順序で配置されるが、ベクトルΔＤの要素は現在の割当状態に対応するように並べられる。これは、式（２１）のように、転置されたバイナリ状態行列Ｘ（Ｘ^Ｔと表記されている）と（Ｄ_{φ（ａ），＊}－Ｄ_{φ（ｂ），＊}）との乗算によって数学的に示される。Ｄ_{φ（ａ），＊}は距離行列のａ行を示し、Ｄ_{φ（ｂ），＊}は距離行列のｂ行を示す。 In equation (20), Δ ^b _a F represents the difference vector between row b and row a of the flow matrix, F _b,* represents row b of the flow matrix, and F _a,* represents row a of the flow matrix.
The elements of vector ΔF are arranged in the order of the original flow matrix, while the elements of vector ΔD are ordered to correspond to the current assignment state. This is mathematically shown by multiplying the transposed binary state matrix X (denoted as ^XT ) by (D _φ(a),* - D _φ(b),* ), as in equation (21). _{D φ(a),*} denotes row a of the distance matrix, and D _φ(b),* denotes row b of the distance matrix.

ソフトウェアでは、式（２１）のような計算は、要素数＝ｎに比例する時間でベクトルΔＤの要素を並べ替えることで実現される。
ＱＳＡＰの場合、１つの割当先に要素が複数配置可能であるため、以下の式（２３）のようにｍ行ｍ列の距離行列から、サイズ＝ｎのベクトル（ΔＤ）が生成される。 In software, calculations such as equation (21) are realized by rearranging the elements of vector ΔD in a time proportional to the number of elements=n.
In the case of QSAP, since a plurality of elements can be allocated to one allocation destination, a vector (ΔD) of size=n is generated from a distance matrix of m rows and m columns as shown in the following equation (23).

このようなベクトルΔＤと式（２０）に示したベクトルΔＦとを用いて、ΔＣ^ｓ _ｅｘは、以下の式（２４）により計算できる。 Using such a vector ΔD and the vector ΔF shown in equation (20), ΔC ^s _ex can be calculated by the following equation (24).

ＱＳＡＰの場合、複数の要素が同じ割当先に割り当てられることが制限されないことから、式（２４）の２行目のような追加の補償項がΔＣに加えられる。
また、式（１７）に対応する再配置によるΔＣ^ｓ _ｒｅｌの計算のベクトル化した形式は、以下の式（２５）のように表せる。 In the case of QSAP, since there is no restriction on multiple elements being assigned to the same destination, an additional compensation term such as the second line of equation (24) is added to ΔC.
Furthermore, the vectorized form of the calculation of ΔC ^s _rel by rearrangement corresponding to equation (17) can be expressed as the following equation (25).

再配置による要素の割当先の移動は、１つの要素について行われるため、ベクトルΔＦの計算と、フロー量と距離との積を加えることによる補償は必要ない。
（距離行列の並べ替え）
上記のＶΔＣ法は、ＳＩＭＤ（Single Instruction/Multiple Data）を用いることで、複数のＣＰＵ（Central Processing Unit）や複数のＧＰＵ（Graphics Processing Unit）などのプロセッサ上に実装可能である。 Since the shift of the allocation destination of an element due to rearrangement is performed for one element, there is no need to calculate the vector ΔF and compensate by adding the product of the flow amount and the distance.
(Sorting the distance matrix)
The V.DELTA.C method can be implemented on processors such as multiple CPUs (Central Processing Units) and multiple GPUs (Graphics Processing Units) by using SIMD (Single Instruction/Multiple Data).

しかし、ＶΔＣ法で行われるΔＤの要素の並べ替えは、要素数＝ｎに比例する時間で実行することができるが、２つの要素の割当先の入れ替えによるΔＣの計算のたびに実行されるため処理効率がよくない。また、大量の計算コストがかかる可能性がある。 However, while the rearrangement of ΔD elements performed by the VΔC method can be performed in a time proportional to the number of elements (n), it is not very efficient because it is performed every time ΔC is calculated by swapping the allocation destinations of two elements. It can also incur a large amount of computational cost.

適切な要素順をもつΔＤを生成する計算コストを最小限に抑えるために、本実施形態のデータ処理装置は、現在の割当状態に応じて距離行列の列を並べ替える。以下この手法をＳＡＭ（State-Aligned D Matrix）法と呼ぶ場合がある。また、以下では、このように並べ替えを行った距離行列を状態整列Ｄ行列と呼び、割当問題がＱＡＰの場合はＤ^Ｘ、割当問題がＱＳＡＰの場合はＤ^Ｓと表記する。 In order to minimize the computational cost of generating ΔD with an appropriate element order, the data processing device of this embodiment rearranges the columns of the distance matrix according to the current allocation state. Hereinafter, this method may be referred to as the SAM (State-Aligned D Matrix) method. Hereinafter, the distance matrix rearranged in this way is referred to as the state-aligned D matrix, and is denoted as D ^X when the assignment problem is QAP, and as D ^S when the assignment problem is QSAP.

図２は、ＱＡＰの計算時における距離行列の並べ替え例とデータ処理装置の例を示す図である。
データ処理装置１０は、たとえば、コンピュータであり、記憶部１１、処理部１２を有する。 FIG. 2 is a diagram showing an example of rearrangement of the distance matrix when calculating the QAP, and an example of a data processing device.
The data processing device 10 is, for example, a computer, and includes a storage unit 11 and a processing unit 12 .

記憶部１１は、たとえば、ＤＲＡＭ（Dynamic Random Access Memory）などの電子回路である揮発性の記憶装置、または、ＨＤＤ（Hard Disk Drive）やフラッシュメモリなどの電子回路である不揮発性の記憶装置である。記憶部１１は、ＳＲＡＭ（Static Random Access Memory）レジスタなどの電子回路を含んでいてもよい。 The storage unit 11 is, for example, a volatile storage device that is an electronic circuit such as a dynamic random access memory (DRAM), or a non-volatile storage device that is an electronic circuit such as a hard disk drive (HDD) or flash memory. The storage unit 11 may also include an electronic circuit such as an SRAM (static random access memory) register.

記憶部１１は、たとえば、局所探索などをコンピュータに実行させるプログラムを記憶するとともに、前述のフロー行列、距離行列、状態整列Ｄ行列（Ｄ^ＸまたはＤ^Ｓ）、割当状態（整数割当ベクトルφまたはバイナリ状態行列Ｘで表される）を記憶する。 The storage unit 11 stores, for example, a program that causes a computer to execute a local search or the like, as well as the flow matrix, distance matrix, state alignment D matrix (D ^X or D ^S ), and assignment state (represented by an integer assignment vector φ or a binary state matrix X) described above.

処理部１２は、たとえば、ＣＰＵ、ＧＰＵ、ＤＳＰ（Digital Signal Processor）などのハードウェアであるプロセッサにより実現できる。また、処理部１２は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡなどの電子回路により実現されるようにしてもよい。処理部１２は、記憶部１１に記憶されたプログラムを実行して、局所探索処理をデータ処理装置１０に行わせる。なお、処理部１２は、複数のプロセッサの集合であってもよい。 The processing unit 12 can be realized by a hardware processor such as a CPU, GPU, or DSP (Digital Signal Processor). The processing unit 12 may also be realized by an electronic circuit such as an ASIC (Application Specific Integrated Circuit) or FPGA. The processing unit 12 executes a program stored in the storage unit 11 to cause the data processing device 10 to perform local search processing. The processing unit 12 may also be a collection of multiple processors.

処理部１２は、２つの要素の割当先を入れ替える処理を繰り返し、たとえば、式（１）または式（１１）に示した評価関数の値が極小になる割当状態を探索する。評価関数の極小値のうちの最小値になる割当状態が最適解となる。なお、式（１）または式（１１）に示した評価関数の符号を変えれば、処理部１２は、評価関数の値が極大になる割当状態を探索することもできる（この場合、最大値が最適解となる）。 The processing unit 12 repeatedly switches the allocation destinations of the two elements, searching for an allocation state in which the value of the evaluation function shown in equation (1) or equation (11) is minimized, for example. The allocation state in which the smallest of the minimum values of the evaluation function is the optimal solution. Note that by changing the sign of the evaluation function shown in equation (1) or equation (11), the processing unit 12 can also search for an allocation state in which the value of the evaluation function is maximized (in this case, the maximum value is the optimal solution).

図２には、ＱＡＰの計算時における距離行列の並べ替え例が示されている。あるタイムステップ＝ｔの場合の割当状態と、その割当状態に対応した配列となっている状態整列Ｄ行列が、φ^（ｔ）、Ｄ^Ｘ（ｔ）と表記されている。識別番号＝１，３の要素の割当先を入れ替える割当変更が提案され、受け入れられると、タイムステップ＝ｔ＋１では、図２のφ^{（ｔ＋１）}で表されているような割当状態となる。このとき、状態整列Ｄ行列として、受け入れられた割当状態に対応するように、Ｄ^Ｘ（ｔ）に対して１列目と３列目が入れ替えられたＤ^{Ｘ（ｔ＋１）}が生成される。 2 shows an example of rearrangement of a distance matrix when calculating a QAP. The allocation state at a certain time step = t and the state-ordered D matrix arranged corresponding to the allocation state are denoted as φ ^(t) and D ^X(t) . If an allocation change that swaps the allocation destinations of elements with identification numbers = 1 and 3 is proposed and accepted, the allocation state at time step = t+1 will be as shown by φ ^(t+1) in FIG. 2. In this case, D ^X(t+1) is generated as the state-ordered D matrix, in which the first and third columns of D ^X(t) are swapped to correspond to the accepted allocation state.

なお、提案された割当変更が受け入れられなかった場合、φ^（ｔ）とＤ^Ｘ（ｔ）の更新はされない。
ＱＡＰの場合、状態整列Ｄ行列（Ｄ^Ｘ）は、以下の式（２６）に示されるように、現在の割当状態を表すバイナリ状態行列Ｘの転置行列Ｘ^Ｔに、元の距離行列（Ｄ）を乗ずることで表すことができる。なお、ハードウェア的に状態整列Ｄ行列の列の入れ替えを行う場合のハードウェア構成の例については後述する。 Note that if the proposed reassignment is not accepted, φ ^(t) and D ^X(t) are not updated.
In the case of QAP, the state alignment D matrix (D ^X ) can be expressed by multiplying the transpose matrix X ^T of the binary state matrix X representing the current allocation state by the original distance matrix (D), as shown in the following equation (26). Note that an example of a hardware configuration for rearranging the columns of the state alignment D matrix in hardware will be described later.

ベクトルΔＤ^Ｘは、以下の式（２７）により、前述のベクトルΔＤの要素の並べ替えを行わずに計算でき、ΔＤ^Ｘを前述の式（２２）に代入することで、ΔＣ_ｅｘが計算できる。 The vector ΔD ^X can be calculated by the following equation (27) without rearranging the elements of the above-mentioned vector ΔD, and ΔC _ex can be calculated by substituting ΔD ^X into the above-mentioned equation (22).

ＱＳＡＰの場合、状態整列Ｄ行列（Ｄ^Ｓ）は、以下の式（２８）に示されるように、現在の割当状態を表すバイナリ状態行列Ｓの転置行列Ｓ^Ｔに、元の距離行列（Ｄ）を乗ずることで表すことができる。Ｄ^Ｓは、ｍ行ｎ列の行列となる。 In the case of QSAP, the state alignment D matrix (D ^S ) can be expressed by multiplying the transpose matrix S ^T of the binary state matrix S representing the current allocation state by the original distance matrix (D) as shown in the following equation (28). ^{D S} is a matrix with m rows and n columns.

割当先交換と再配置によるベクトルΔＤ^Ｓはそれぞれ、以下の式（２９）、式（３０）により、前述のΔＤの要素の並べ替えを行わずに計算でき、前述の式（２４）、式（２５）に代入することで、ΔＣ^ｓ _ｅｘとΔＣ^ｓ _ｒｅｌを計算できる。 The vectors ΔD ^S resulting from the assignment exchange and rearrangement can be calculated using the following equations (29) and (30), respectively, without rearranging the elements of the above-mentioned ΔD, and by substituting these into the above-mentioned equations (24) and (25), ΔC ^s _ex and ΔC ^s _rel can be calculated.

前述のＶΔＣ法では、ベクトルΔＤの要素を並べ替えるには、ΔＣが計算されるたびに要素数＝ｎに比例する操作が行われる。これとは対照的に、上記の手法（ＳＡＭ法）では、割当状態に一致するようにＤ^ＸやＤ^Ｓの並べ替えが行われるのは、提案された割当変更が受け入れられた場合のみである。これにより、平均して反復ごとに必要な操作の数が削減され計算効率が上がるため、計算時間が短縮され、割当問題を高速に計算できるようになる。 In the VΔC method described above, rearranging the elements of vector ΔD requires operations proportional to the number of elements (n) each time ΔC is calculated. In contrast, in the above technique (SAM method), rearranging D ^X and D ^S to match the allocation state is only performed if the proposed allocation change is accepted. This reduces the number of operations required per iteration on average, improving computational efficiency, thereby shortening computation time and enabling the assignment problem to be calculated quickly.

（ボルツマンマシンキャッシング法（比較例））
ΔＣ計算をベクトル化して行う手法には以下のような手法も考えられる。この手法は、バイナリ状態行列のビットの反転に対応する部分的なΔＣの値（前述の式（３）に示した局所場（ｈ_ｉ）に対応する）を計算して保存する方法である。以下この手法をボルツマンマシンキャッシング法（ＢＭ＄法と表記する）と呼ぶ場合がある。 (Boltzmann Machine Caching Method (Comparative Example))
The following method can also be considered as a method for vectorizing ΔC calculation. This method calculates and stores partial ΔC values (corresponding to the local field (h _i ) shown in the above-mentioned equation (3)) corresponding to bit inversions of the binary state matrix. Hereinafter, this method may be referred to as the Boltzmann machine caching method (BM$ method).

バイナリ状態行列（ＸまたはＳ）の各ビットに対応して、キャッシュされた局所場が用いられる。割当変更の提案が受け入れられたときに、ｎ^２に比例する時間でｎ行ｎ列の局所場による行列（以下キャッシュ行列という）の更新が行われることになるが、ΔＣの計算については、ｎに依存しない時間で行うことができる。 A cached local field is used for each bit of the binary state matrix (X or S). When a proposal for reassignment is accepted, an n-by-n matrix (hereinafter referred to as the cache matrix) using the local fields is updated in a time proportional to ^n² , but ΔC can be calculated in a time independent of n.

ＱＡＰの場合、キャッシュ行列（Ｈ）は、探索処理の開始時に式（３１）により、ｎ^３に比例する時間で生成される。 In the case of QAP, the cache matrix (H) is generated at the start of the search process according to equation (31) in a time proportional to n ³ .

キャッシュされた局所場を使用して、式（３２）により内積ΔＦ・ΔＤを生成できる。そして、生成したΔＦ・ΔＤを前述の式（２２）に代入することで、ΔＣ_ｅｘが計算できる。 Using the cached local field, the inner product ΔF·ΔD can be generated by equation (32). Then, by substituting the generated ΔF·ΔD into the above equation (22), ΔC _ex can be calculated.

割当変更の提案が受け入れられると、キャッシュ行列は、以下の図３に示されるように、式（３３）によりクロネッカー積を使用することによって更新される。 If the reassignment proposal is accepted, the cache matrix is updated by using the Kronecker product according to equation (33), as shown in Figure 3 below.

図３は、キャッシュ行列の更新計算の例を示す図である。
更新は、一度にキャッシュ行列の一行分行われる。ベクトルΔＦの０の要素に対応する行についての処理はスキップされる。 FIG. 3 is a diagram illustrating an example of update calculation of the cache matrix.
The update is performed one row of the cache matrix at a time, skipping the processing of rows corresponding to zero elements of the vector ΔF.

ＱＳＡＰの場合、キャッシュ行列（Ｈ^ｓ）は、探索処理の開始時に式（３４）により生成される。 In the case of QSAP, the cache matrix (H ^s ) is generated at the beginning of the search process according to equation (34).

キャッシュされた局所場を使用して、式（３５）により内積ΔＦ・ΔＤを生成できる。そして、生成したΔＦ・ΔＤを前述の式（２４）に代入することで、ΔＣ^ｓ _ｅｘが計算できる。 Using the cached local field, the inner product ΔF·ΔD can be generated by equation (35). Then, by substituting the generated ΔF·ΔD into the above equation (24), ΔC ^s _ex can be calculated.

割当変更の提案が受け入れられると、キャッシュ行列は、式（３６）によりクロネッカー積を使用することによって更新される。 If the reassignment proposal is accepted, the cache matrix is updated using the Kronecker product according to equation (36).

同様に、キャッシュされた局所場を使用して、式（３７）により再配置が生じた場合の内積ΔＦ・ΔＤを生成できる。そして、生成したΔＦ・ΔＤを前述の式（２５）に代入することで、ΔＣ^ｓ _ｒｅｌが計算できる。 Similarly, the cached local field can be used to generate the inner product ΔF·ΔD when rearrangement occurs using equation (37). Then, ΔC ^s _rel can be calculated by substituting the generated ΔF·ΔD into the above equation (25).

再配置の割当変更の提案が受け入れられると、キャッシュ行列は、式（３８）によって更新される。 When a proposed relocation change is accepted, the cache matrix is updated according to equation (38).

上記のようなＢＭ＄法では、キャッシュ行列を保持することになるが、問題の規模が大きくなるとキャッシュ行列を保持するメモリの容量も大きくなる。このため、高速読み出し可能な比較的容量が小さいメモリの使用が難しくなる。ＳＡＭ法は、このようなキャッシュ行列を保持しなくてよい。 In the BM$ method described above, a cache matrix is maintained, but as the problem size increases, the memory capacity required to maintain the cache matrix also increases. This makes it difficult to use relatively small memory that can be read quickly. The SAM method does not require the maintenance of such a cache matrix.

（ソルバーシステムの設計）
上記の３つの手法（ＶΔＣ法、ＳＡＭ法、ＢＭ＄法）の性能を比較検証するため、以下に示すようなソルバーシステムを設計した。なお、以下では、データ処理装置１０の処理部１２がＳＩＭＤ機能を備えたマルチコアＣＰＵを含み、マルチコアＣＰＵ上にパラレルテンパリングアルゴリズムが実装されているものとして説明する。 (Solver system design)
In order to compare and verify the performance of the above three methods (VΔC method, SAM method, and BM$ method), we designed the solver system shown below. Note that the following description will be given assuming that the processing unit 12 of the data processing device 10 includes a multi-core CPU with SIMD functionality and that the parallel tempering algorithm is implemented on the multi-core CPU.

各専用コアは、探索インスタンスデータ（前述のＤ^Ｘ、Ｄ^Ｓ、Ｈ、Ｈ^ｓ）を保持するための専用キャッシュをもつ。
図４は、パラレルテンパリングを実行するソルバーシステムの例を示す図である。 Each dedicated core has a dedicated cache for holding search instance data (D ^x , D ^s , H, H ^s mentioned above).
FIG. 4 is a diagram illustrating an example of a solver system that performs parallel tempering.

ソルバーシステムは、探索部２０とパラレルテンパリングコントローラ２１を有する。探索部２０は、複数のコア２０ａ１～２０ａｍを有し、コア２０ａ１～２０ａｍのそれぞれが複数のレプリカ（インスタンスに相当する）についての前述の局所探索（ＳＬＳ：Stochastic Local Search）を並列に実行する。レプリカ数はＭである。コア２０ａ１は、レプリカ２０ｂ１を含む複数のレプリカについての局所探索を行う。コア２０ａｍは、レプリカ２０ｂＭを含む複数のレプリカについての局所探索を行う。 The solver system has a search unit 20 and a parallel tempering controller 21. The search unit 20 has multiple cores 20a1-20am, each of which performs the aforementioned local search (SLS: Stochastic Local Search) on multiple replicas (corresponding to instances) in parallel. The number of replicas is M. Core 20a1 performs a local search on multiple replicas, including replica 20b1. Core 20am performs a local search on multiple replicas, including replica 20bM.

ＱＡＰの場合は、レプリカごとに、前述の整数割当ベクトルφ、キャッシュ行列（Ｈ）（ＢＭ＄法の場合）、状態整列Ｄ行列（Ｄ^Ｘ）（ＳＡＭ法の場合）、評価関数の値（Ｃ）が保持される。 In the case of QAP, the aforementioned integer allocation vector φ, cache matrix (H) (in the case of the BM$ method), state alignment D matrix (D ^X ) (in the case of the SAM method), and value of the evaluation function (C) are stored for each replica.

ＱＳＡＰの場合は、レプリカごとに、前述の整数割当ベクトルψ、キャッシュ行列（Ｈ^ｓ）（ＢＭ＄法の場合）、状態整列Ｄ行列（Ｄ^Ｓ）（ＳＡＭ法の場合）、評価関数の値（Ｃ）が保持される。 In the case of QSAP, the aforementioned integer allocation vector ψ, cache matrix (H ^s ) (in the case of the BM$ method), state alignment D matrix (D ^s ) (in the case of the SAM method), and value of the evaluation function (C) are stored for each replica.

式（６）に示したフロー行列（Ｆ）と、ＱＳＡＰの場合に用いられる式（１０）に示した行列Ｂは全レプリカ２０ｂ１～２０ｂＭについて共通である。
Ｍ個のレプリカ２０ｂ１～２０ｂＭには、互いに異なる値の温度パラメータ（Ｔ）が設定される。レプリカ２０ｂ１～２０ｂＭには、Ｔ_ｍｉｎからＴ_ｍａｘまで段階的に上昇する温度の何れかが設定されている。以下、このような段階的に上昇する温度パラメータの値を、温度ラダーと呼ぶ場合がある。 The flow matrix (F) shown in equation (6) and the matrix B shown in equation (10) used in the case of QSAP are common to all replicas 20b1 to 20bM.
The M replicas 20b1 to 20bM are set with different values of a temperature parameter (T). Each replica 20b1 to 20bM is set with one of temperatures that increases stepwise from T _min to T _max . Hereinafter, such stepwise increasing temperature parameter values may be referred to as a temperature ladder.

パラレルテンパリングコントローラ２１は、隣接する温度パラメータの値（Ｔ_ｋ、Ｔ_ｋ＋１）が設定されるレプリカ間で、評価関数の値（Ｃ_ｋ、Ｃ_ｋ＋１）とＴ_ｋ、Ｔ_ｋ＋１に基づいて、以下の式（３９）で表される交換許容確率ＳＡＰで、Ｔ_ｋ、Ｔ_ｋ＋１の値を交換する。 The parallel tempering controller 21 exchanges the values of Tk and _{Tk+1 between replicas in which adjacent temperature parameter values (Tk, Tk+1} ₎ are set, based on the evaluation function values ( _Ck , _Ck ₊₁ ) and _Tk , Tk ₊ ₁ , with the exchange allowance probability SAP expressed by the following equation (39).

上記のような探索部２０とパラレルテンパリングコントローラ２１は、図２に示した処理部１２と記憶部１１により実現される。
フロー行列、状態整列Ｄ行列、キャッシュ行列は、たとえば、単精度浮動小数点数形式で記憶部１１に記憶される。ＶΔＣ法やＳＡＭ法を実行する場合の内積計算やキャッシュ行列の更新などを行う際に、融合積和演算命令が利用できるようにするためである。 The search unit 20 and the parallel tempering controller 21 described above are realized by the processing unit 12 and the storage unit 11 shown in FIG.
The flow matrix, state alignment D matrix, and cache matrix are stored in the storage unit 11 in the single-precision floating-point format, for example, so that a fused multiply-and-accumulate instruction can be used when performing inner product calculations when executing the VΔC method or the SAM method, or when updating the cache matrix.

図５は、ＱＡＰの解をパラレルテンパリングによる局所探索で探索するアルゴリズムの例を示す図である。なお、図５のアルゴリズムの例では、並列に局所探索が行われるレプリカ数が３２である。なお、図５において、１６行目、１８行目などに示されている“（１）”、“（３１）”などは、前述の式（１）、式（３１）などを表す。 Figure 5 shows an example of an algorithm for finding a solution to QAP using local search with parallel tempering. In the example algorithm shown in Figure 5, the number of replicas for which local search is performed in parallel is 32. In Figure 5, "(1)" and "(31)" on lines 16 and 18 represent the aforementioned equations (1) and (31).

アルゴリズムは、温度パラメータの値とレプリカ状態を初期化することから始まる。次に、ソルバーシステムは最適化ループに入り、全てのレプリカに対して並列に、局所探索がＩ回繰り返し実行される。 The algorithm begins by initializing the temperature parameter values and replica states. The solver system then enters an optimization loop, performing one local search iteration in parallel across all replicas.

Ｉ回の局所探索が終わると、レプリカ交換処理が行われ、事前設定されたＢＫＳ（Best-Known-Solution）コスト（Ｃ_ＢＫＳ）が見つかるか、タイムアウト制限に達するまで、最適化ループが再開される。 After I local searches, a replica exchange process is performed and the optimization loop is restarted until a preset Best-Known-Solution (BKS) cost (C _BKS ) is found or a timeout limit is reached.

ＱＡＰの局所探索では、ループが事前設定された反復回数（Ｉ）で実行され、２つの要素（図５の例では施設）が選択され、その２つの要素の割当先を入れ替えたときのΔＣ_ｅｘの値が、選択された手法（ＶΔＣ法、ＢＭ＄法またはＳＡＭ法）により計算される。 In the local search of the QAP, a loop is executed for a preset number of iterations (I), two elements (facilities in the example of FIG. 5) are selected, and the value of ΔC _ex when the allocation destinations of the two elements are swapped is calculated using the selected method (VΔC method, BM$ method, or SAM method).

次に、計算されたΔＣ_ｅｘと各レプリカの現在の温度パラメータの値を用いて、式（１８）のＰ_ａｃｃが計算される。次に、０から１の範囲の乱数値が生成され、Ｐ_ａｃｃと比較されてベルヌーイ試行が実行され、選択された２つの要素間の割当先の入れ替えの提案が受け入れられるかどうかが決定される。提案が受け入れられた場合には、状態（割当状態）が更新される。 Next, P _acc in equation (18) is calculated using the calculated ΔC _ex and the current temperature parameter value of each replica. A random value between 0 and 1 is then generated and compared with P _acc to perform a Bernoulli trial to determine whether the proposed swap of assignments between the two selected elements is acceptable. If the proposal is accepted, the state (assignment state) is updated.

図６は、パラレルテンパリングによる局所探索の全体の処理の流れを示すフローチャートである。図６では、図５に示したアルゴリズムのパラレルテンパリングによる局所探索の全体の処理の流れが示されている。 Figure 6 is a flowchart showing the overall processing flow of local search using parallel tempering. Figure 6 shows the overall processing flow of local search using parallel tempering for the algorithm shown in Figure 5.

まず初期化ループ（ステップＳ１０～Ｓ１３）が、ｉ＝０からｉ＜Ｍ－１の間、ｉを１つずつ増やしつつ行われる。Ｍはレプリカ数である（図５の例では３２個）。
初期化ループでは、各レプリカに対して初期温度（温度ラダー）の設定（Ｔ［ｉ］←Ｔ^０［ｉ］）が行われる（ステップＳ１１）。そして、レプリカの初期化処理（図７参照）が行われる（ステップＳ１２）。 First, an initialization loop (steps S10 to S13) is performed while incrementing i by 1 from i=0 to i<M-1, where M is the number of replicas (32 in the example of FIG. 5).
In the initialization loop, the initial temperature (temperature ladder) is set for each replica (T[i]←T ⁰ [i]) (step S11), and then the replica is initialized (see FIG. 7) (step S12).

次に、最適化ループ（ステップＳ１４～Ｓ１６）が、ｉ＝０からｉ＜Ｍ－１の間、ｉを１つずつ増やしつつ行われる。
最適化ループでは、レプリカ探索処理が行われる（ステップＳ１５）。レプリカ探索処理では、各レプリカについて、設定された温度パラメータの値を用いて、Ｉ回の局所探索が行われる。 Next, an optimization loop (steps S14 to S16) is performed while increasing i by one from i=0 to i<M-1.
In the optimization loop, a replica search process is performed (step S15), in which a local search is performed I times for each replica using the set temperature parameter value.

その後、レプリカ交換処理（ステップＳ１７）が行われ、探索を終了させるか否かが判定される（ステップＳ１８）。たとえば、前述のように、事前設定されたＢＫＳコスト（Ｃ_ＢＫＳ）が見つかるか、タイムアウト制限に達した場合、探索が終了したと判定され、探索が終了する。探索を終了しないと判定された場合、ステップＳ１４からの最適化ループが再開される。 Then, a replica exchange process (step S17) is performed, and it is determined whether or not to terminate the search (step S18). For example, as described above, if a preset BKS cost (C _BKS ) is found or the timeout limit is reached, it is determined that the search is terminated, and the search ends. If it is determined not to terminate the search, the optimization loop is restarted from step S14.

なお、データ処理装置１０は、探索が終了した場合、探索結果（たとえば、後述の最小値、Ｃ_ｍｉｎ、φ_ｍｉｎ）を出力してもよい。探索結果は、たとえばデータ処理装置１０に接続される表示装置に表示されてもよいし、外部の装置に送信されてもよい。 When the search is completed, the data processing device 10 may output the search results (e.g., minimum values C _min and φ _min described below). The search results may be displayed on a display device connected to the data processing device 10, or may be transmitted to an external device.

なお、図６ではレプリカごとに順番に処理を行う例を示しているが、たとえば、３２個のコアをもつプロセッサにより、３２個のレプリカについてステップＳ１１，Ｓ１２，Ｓ１５の処理を並列に実行可能である。 Note that while Figure 6 shows an example in which processing is performed sequentially for each replica, it is possible to execute steps S11, S12, and S15 in parallel for 32 replicas using a processor with 32 cores, for example.

次に、ＳＡＭ法を実行する場合を例にして、ステップＳ１２のレプリカの初期化処理と、ステップＳ１５のレプリカ探索処理の流れを、フローチャートを用いて説明する。
図７は、ＱＡＰの場合のレプリカの初期化処理の一例の流れを示すフローチャートである。 Next, the replica initialization process in step S12 and the replica search process in step S15 will be described with reference to a flowchart, taking the case where the SAM method is executed as an example.
FIG. 7 is a flowchart showing an example of the flow of the replica initialization process in the case of a QAP.

まず、整数割当ベクトルφの初期化が行われる（ステップＳ２０）。ステップＳ２０の処理では、たとえば、ランダムに要素の割当先（施設の配置先）が決定される。そして、式（１）の計算により、初期コスト（Ｃ）が計算される（ステップＳ２１）。 First, the integer allocation vector φ is initialized (step S20). In step S20, the allocation destination of elements (location of facilities) is determined randomly, for example. Then, the initial cost (C) is calculated using equation (1) (step S21).

その後、初期化したφとＣにより、最小値（Ｃ_ｍｉｎとφ_ｍｉｎ）が初期化される（ステップＳ２２）。また、φの初期値に対応するように、状態整列Ｄ行列（Ｄ^Ｘ）の初期化が行われる（ステップＳ２３）。その後、図６に示したフローチャートの処理に戻る。 Then, the minimum values (C _min and φ _min ) are initialized by the initialized φ and C (step S22). Also, the state alignment D matrix (D ^X ) is initialized to correspond to the initial value of φ (step S23). Then, the process returns to the flowchart shown in FIG.

図８は、ＱＡＰの場合のレプリカ探索処理の一例の流れを示すフローチャートである。
レプリカ探索処理では、イタレーションループ（ステップＳ３０～Ｓ４０）が、ｉ＝１からｉ＜Ｉの間、ｉを１つずつ増やしつつ行われる。 FIG. 8 is a flowchart showing an example of the flow of replica search processing in the case of QAP.
In the replica search process, an iteration loop (steps S30 to S40) is performed while increasing i by one from i=1 to i<I.

まず、割当先の入れ替え候補となる２つの要素（識別番号＝ａ，ｂで表されている）が選択される（ステップＳ３１）。
そして、その２つの要素の割当先を入れ替えたときのΔＣ_ｅｘ（ａ，ｂ）の値が、式（２２）により計算される（ステップＳ３２）。 First, two elements (represented by identification numbers a and b) that are candidates for replacement of allocation destinations are selected (step S31).
Then, the value of ΔC _ex (a, b) when the allocation destinations of the two elements are swapped is calculated using equation (22) (step S32).

次に、計算されたΔＣ_ｅｘと各レプリカの現在の温度パラメータの値を用いて、式（１８）のＰ_ａｃｃが計算される（ステップＳ３３）。そして、Ｐ_ａｃｃが０から１の範囲の乱数値ｒａｎｄ（）よりも大きいか否かが判定される（ステップＳ３４）。 Next, P _acc in equation (18) is calculated using the calculated ΔC _ex and the current temperature parameter value of each replica (step S33), and it is determined whether P _acc is greater than a random number rand() ranging from 0 to 1 (step S34).

Ｐ_ａｃｃ＞ｒａｎｄ（）であると判定された場合、Ｄ^Ｘの列ａと列ｂの入れ替え（Ｄ^Ｘ _＊，ａ，Ｄ^Ｘ _＊，ｂ←Ｄ^Ｘ _＊，ｂ，Ｄ^Ｘ _＊，ａ）が行われる（ステップＳ３５）。さらに、割当先の入れ替えによる割当状態の更新（φ（ａ），φ（ｂ）←φ（ｂ），φ（ａ））が行われ（ステップＳ３６）、評価関数の値（Ｃ）の更新（Ｃ←Ｃ＋ΔＣ_ｅｘ（ａ，ｂ））が行われる（ステップＳ３７）。 If it is determined that P _acc > rand(), columns a and b of D ^X are swapped (D ^X _{*, a} , D ^X _{*, b} ←D ^X _{*, b} , D ^X _{*, a} ) (step S35). Furthermore, the allocation status is updated by swapping the allocation destinations (φ(a), φ(b) ←φ(b), φ(a)) (step S36), and the value of the evaluation function (C) is updated (C ← C + ΔC _ex (a, b)) (step S37).

その後、Ｃ＜Ｃ_ｍｉｎであるか否かが判定され（ステップＳ３８）、Ｃ＜Ｃ_ｍｉｎであると判定された場合、最小値の更新（Ｃ_ｍｉｎ←Ｃ、φ_ｍｉｎ←φ）が行われる（ステップＳ３９）。 Thereafter, it is determined whether C<C _min (step S38), and if it is determined that C<C _min , the minimum value is updated (C _min ←C, φ _min ←φ) (step S39).

ステップＳ３９の処理後、またはステップＳ３４の処理でＰ_ａｃｃ＞ｒａｎｄ（）ではないと判定された場合、またはステップＳ３８の処理でＣ＜Ｃ_ｍｉｎではないと判定された場合、ｉ＝ＩとなるまでステップＳ３１からの処理が繰り返される。ｉ＝Ｉとなった場合、図６に示したフローチャートの処理に戻る。 After the process of step S39, or if it is determined in the process of step S34 that P _acc > rand() is not true, or if it is determined in the process of step S38 that C < C _min is not true, the process from step S31 is repeated until i = I. If i = I, the process returns to the process of the flowchart shown in FIG.

なお、図６～図８に示した処理の順序は一例であり、適宜処理の順序を入れ替えてもよい。
図９は、ＱＳＡＰの解をパラレルテンパリングによる局所探索で探索するアルゴリズムの例を示す図である。 The order of the processes shown in FIGS. 6 to 8 is an example, and the order of the processes may be changed as appropriate.
FIG. 9 is a diagram showing an example of an algorithm for searching for a solution to QSAP by local search using parallel tempering.

図９に示されているＱＳＡＰを計算する局所探索の関数は、再配置の提案を受け入れるか否かの判定や更新を行う再配置ループを、２つの要素の割当先を入れ替えるか否かの判定や更新を行う交換ループの前に含んでいる。 The local search function for computing the QSAP shown in Figure 9 includes a rearrangement loop that determines whether to accept or update a rearrangement proposal, before an exchange loop that determines or updates whether to swap the allocation destinations of two elements.

次に、ＳＡＭ法を実行する場合を例にして、ＱＳＡＰを計算する場合の、図６のステップＳ１２のレプリカの初期化処理と、ステップＳ１５のレプリカ探索処理の流れを、フローチャートを用いて説明する。 Next, using the SAM method as an example, we will explain the replica initialization process in step S12 of Figure 6 and the replica search process in step S15 when calculating QSAP using a flowchart.

図１０は、ＱＳＡＰの場合のレプリカの初期化処理の一例の流れを示すフローチャートである。
まず、整数割当ベクトルψの初期化が行われる（ステップＳ５０）。ステップＳ５０の処理では、たとえば、ランダムに要素の割当先（施設の配置先）が決定される。そして、式（１１）の計算により、初期コスト（Ｃ）が計算される（ステップＳ５１）。 FIG. 10 is a flowchart showing an example of the flow of a replica initialization process in the case of QSAP.
First, the integer allocation vector ψ is initialized (step S50). In the processing of step S50, the allocation destinations of elements (location destinations of facilities) are determined, for example, randomly. Then, the initial cost (C) is calculated using equation (11) (step S51).

その後、初期化したψとＣにより、最小値（Ｃ_ｍｉｎとψ_ｍｉｎ）が初期化される（ステップＳ５２）。また、ψの初期値に対応するように、式（２８）により、状態整列Ｄ行列（Ｄ^Ｓ）の初期化が行われる（ステップＳ５３）。その後、図６に示したフローチャートの処理に戻る。 Then, the minimum values (C _min and ψ _min ) are initialized by the initialized ψ and C (step S52). Also, the state alignment D matrix (D ^S ) is initialized by equation (28) so as to correspond to the initial value of ψ (step S53). Then, the process returns to the flowchart shown in FIG.

図１１は、ＱＳＡＰの場合のレプリカ探索処理の一例の流れを示すフローチャートである。
ＱＳＡＰの場合のレプリカ探索処理でも、イタレーションループ（ステップＳ６０～Ｓ７８）が、ｉ＝１からｉ＜Ｉの間、ｉを１つずつ増やしつつ行われる。 FIG. 11 is a flowchart showing an example of the flow of replica search processing in the case of QSAP.
In the replica search process in the case of QSAP, an iteration loop (steps S60 to S78) is also performed while increasing i by one from i=1 to i<I.

まず、識別番号＝ａの要素と識別番号＝ｌの割当先が選択される（ステップＳ６１）。
そして、識別番号＝ａの要素を識別番号＝ｌの割当先に割り当てたときのΔＣ^ｓ _ｒｅｌの値が、式（２５）により計算される（ステップＳ６２）。 First, an element with identification number = a and an allocation destination of identification number = l are selected (step S61).
Then, the value of ΔC ^s _rel when the element with identification number=a is assigned to the assignment destination with identification number=l is calculated using equation (25) (step S62).

次に、計算されたΔＣ^ｓ _ｒｅｌと各レプリカの現在の温度パラメータの値を用いて、式（１８）のＰ_ａｃｃが計算される（ステップＳ６３）。そして、Ｐ_ａｃｃが０から１の範囲の乱数値ｒａｎｄ（）よりも大きいか否かが判定される（ステップＳ６４）。 Next, P _acc in equation (18) is calculated using the calculated ΔC ^s _rel and the current temperature parameter value of each replica (step S63), and it is determined whether P _acc is greater than a random number rand() ranging from 0 to 1 (step S64).

Ｐ_ａｃｃ＞ｒａｎｄ（）であると判定された場合、Ｄ^ｓの列ａの値が、距離行列Ｄの列ｌの値で更新される（ステップＳ６５）。さらに、割当状態の更新（φ（ａ）←ｌ）が行われ（ステップＳ６６）、評価関数の値（Ｃ）の更新（Ｃ←Ｃ＋ΔＣ^ｓ _ｒｅｌ）が行われる（ステップＳ６７）。 If it is determined that P _acc > rand(), the values of column a of D ^s are updated with the values of column l of the distance matrix D (step S65). Furthermore, the allocation state is updated (φ(a)←l) (step S66), and the value of the evaluation function (C) is updated (C←C+ΔC ^s _rel ) (step S67).

その後、Ｃ＜Ｃ_ｍｉｎであるか否かが判定され（ステップＳ６８）、Ｃ＜Ｃ_ｍｉｎであると判定された場合、最小値の更新（Ｃ_ｍｉｎ←Ｃ、ψ_ｍｉｎ←ψ）が行われる（ステップＳ６９）。 Thereafter, it is determined whether C<C _min (step S68), and if it is determined that C<C _min , the minimum value is updated (C _min ←C, ψ _min ←ψ) (step S69).

ステップＳ６９の処理後、またはステップＳ６４の処理でＰ_ａｃｃ＞ｒａｎｄ（）ではないと判定された場合、またはステップＳ６８の処理でＣ＜Ｃ_ｍｉｎではないと判定された場合、ステップＳ７０の処理が行われる。 After the process of step S69, or if it is determined in the process of step S64 that P _acc > rand( ) is not true, or if it is determined in the process of step S68 that C < C _min is not true, the process of step S70 is performed.

ステップＳ７０の処理では、割当先の入れ替え候補となる２つの要素（識別番号＝ａ，ｂで表されている）が選択される。
そして、その２つの要素の割当先を入れ替えたときのΔＣ^ｓ _ｅｘの値が、式（２４）により計算される（ステップＳ７１）。 In the process of step S70, two elements (represented by identification numbers a and b) are selected as candidates for replacement of allocation destinations.
Then, the value of ΔC ^s _ex when the allocation destinations of the two elements are swapped is calculated using equation (24) (step S71).

次に、ΔＣ^ｓ _ｅｘ＜０であるか否かが判定される（ステップＳ７２）。なお、０の代わりに所定の値（固定値）を用いてもよい。
ΔＣ_ｅｘ ^ｓ＜０であると判定された場合、Ｄ^Ｓの列ａと列ｂの入れ替え（Ｄ^Ｓ _＊，ａ，Ｄ^Ｓ _＊，ｂ←Ｄ^Ｓ _＊，ｂ，Ｄ^Ｓ _＊，ａ）が行われる（ステップＳ７３）。さらに、割当先の入れ替えによる割当状態の更新（ψ（ａ），ψ（ｂ）←ψ（ｂ），ψ（ａ））が行われ（ステップＳ７４）、評価関数の値（Ｃ）の更新（Ｃ←Ｃ＋ΔＣ^ｓ _ｅｘ）が行われる（ステップＳ７５）。 Next, it is determined whether ΔC ^sex _< 0 (step S72). Note that a predetermined value (fixed value) may be used instead of 0.
If it is determined that ΔC _ex ^s < 0, columns a and b of D ^S are swapped (D ^S _*,a , D ^S _*,b ←D ^S _*,b , D ^S _*,a ) (step S73). Furthermore, the allocation state is updated by swapping the allocation destinations (ψ(a), ψ(b) ←ψ(b), ψ(a)) (step S74), and the value of the evaluation function (C) is updated (C ←C + ΔC ^s _ex ) (step S75).

その後、Ｃ＜Ｃ_ｍｉｎであるか否かが判定され（ステップＳ７６）、Ｃ＜Ｃ_ｍｉｎであると判定された場合、最小値の更新（Ｃ_ｍｉｎ←Ｃ、ψ_ｍｉｎ←ψ）が行われる（ステップＳ７７）。 Thereafter, it is determined whether C<C _min is true (step S76), and if it is determined that C<C _min is true, the minimum value is updated (C _min <-C, ψ _min <-ψ) (step S77).

ステップＳ７７の処理後、またはステップＳ７２の処理でΔＣ^ｓ _ｅｘ＜０ではないと判定された場合、またはステップＳ７６の処理でＣ＜Ｃ_ｍｉｎではないと判定された場合、ｉ＝ＩとなるまでステップＳ６１からの処理が繰り返される。ｉ＝Ｉとなった場合、図６に示したフローチャートの処理に戻る。 After the process of step S77, or if it is determined in the process of step S72 _that ΔC ^sex < 0 is not satisfied, or if it is determined in the process of step S76 that C < C _min is not satisfied, the process from step S61 onwards is repeated until i = I. If i = I, the process returns to the process of the flowchart shown in FIG.

なお、図１０～図１１に示した処理の順序は一例であり、適宜処理の順序を入れ替えてもよい。
（計算速度の評価結果）
まず、スカラー型の演算処理を使用し、ＶΔＣ法、ＳＡＭ法、及びＢＭ＄法の３つの方法による計算速度を評価した結果を示す。計算対象は、１０個のＱＡＰインスタンスであり、図５に示したアルゴリズムにより計算が行われた。計算は、図４に示したようなソルバーシステムを２つ用いて、６４レプリカのパラレルテンパリングにより行われた。 The order of the processes shown in FIGS. 10 and 11 is an example, and the order of the processes may be changed as appropriate.
(Evaluation results of calculation speed)
First, we will show the results of evaluating the calculation speed of three methods using scalar arithmetic processing: the VΔC method, the SAM method, and the BM$ method. The calculation targets 10 QAP instances, and calculations were performed using the algorithm shown in Figure 5. The calculations were performed using two solver systems such as those shown in Figure 4, with 64 replicas in parallel tempering.

各インスタンスは、シーケンシャルに１００回、独立な乱数シードで実行され、ＢＫＳに到達した時間（ＴｔＳ：Time-to-Solution）が記録された。
図１２は、ＶΔＣ法と比較したＳＡＭ法、ＢＭ＄法による計算の高速化の度合いの評価結果を示す図である。横軸はＱＡＰの１０個のインスタンスと幾何平均（図１２では“ＧＥＯＭＥＡＮ”と表記されている）を表し、縦軸はＶΔＣ法と比較したＳＡＭ法、ＢＭ＄法の高速化の度合いを表している。高速化の度合いは、ＶΔＣ法によるＴｔＳに対するＳＡＭ法とＢＭ＄法によるＴｔＳの比率により表されている。 Each instance was run sequentially 100 times with independent random seeds, and the time to reach BKS (Time-to-Solution (TtS)) was recorded.
FIG. 12 shows the evaluation results of the degree of speedup of calculations by the SAM method and the BM$ method compared to the VΔC method. The horizontal axis represents 10 instances of QAP and the geometric mean (denoted as "GEOMEAN" in FIG. 12), and the vertical axis represents the degree of speedup of the SAM method and the BM$ method compared to the VΔC method. The degree of speedup is represented by the ratio of TtS by the SAM method and the BM$ method to TtS by the VΔC method.

ＳＡＭ法を用いた場合、Ｄ^Ｘを更新する追加の計算コストがあるにもかかわらず、ＶΔＣ法を用いた場合よりも計算速度が向上している。
なお、ＢＭ＄法を用いた場合、ＶΔＣ法に対して幾何平均で２．５７倍、計算速度が向上している。 When the SAM method is used, the calculation speed is improved compared to when the VΔC method is used, despite the additional calculation cost of updating ^DX .
When the BM$ method is used, the calculation speed is improved by 2.57 times in geometric mean compared to the VΔC method.

（ＳＩＭＤの場合の計算速度の向上）
次に、スカラー型の演算処理に対するベクトル型の演算処理の計算速度の向上の度合いを評価した結果を示す。ベクトル型の演算処理を行うために、ＡＶＸ（Advanced Vector eXtensions）２のＳＩＭＤ組み込み関数が用いられた。 (Improvement of calculation speed in the case of SIMD)
Next, we will show the results of evaluating the degree of improvement in calculation speed of vector-type arithmetic processing compared to scalar-type arithmetic processing. To perform the vector-type arithmetic processing, SIMD built-in functions of AVX (Advanced Vector eXtensions) 2 were used.

前述と同様の手順を用いて、同じ１０のＱＡＰインスタンスをベクトル型の演算処理で実行した。
図１３は、スカラー型の演算処理に対するベクトル型の演算処理の高速化の度合いの評価結果を示す図である。横軸はＱＡＰの１０個のインスタンスと幾何平均（図１３では“ＧＥＯＭＥＡＮ”と表記されている）を表し、縦軸はスカラー型の演算処理に対するベクトル型の演算処理の高速化の度合いを表している。高速化の度合いは、ＶΔＣ法、ＳＡＭ法、ＢＭ＄法のそれぞれについての、スカラー型の演算処理によるＴｔＳに対するベクトル型の演算処理によるＴｔＳの比率により表されている。 Using the same procedure as above, the same 10 QAP instances were run with vector-based computation.
13 is a diagram showing the evaluation results of the degree of speedup of vector-type arithmetic processing relative to scalar-type arithmetic processing. The horizontal axis represents 10 instances of QAP and the geometric mean (denoted as "GEOMEAN" in FIG. 13), and the vertical axis represents the degree of speedup of vector-type arithmetic processing relative to scalar-type arithmetic processing. The degree of speedup is represented by the ratio of TtS by vector-type arithmetic processing to TtS by scalar-type arithmetic processing for each of the VΔC method, the SAM method, and the BM$ method.

スカラー型の演算処理に対してベクトル型の演算処理では、平均して、ＶΔＣ法の場合ほぼ２倍高速であり、ＢＭ＄法とＳＡＭ法の場合、３倍以上高速であった。ＶΔＣ法では、前述のようにＶΔＣ法で行われるΔＤの要素の並べ替えが実行時間のかなりの部分を占めるため、ＳＩＭＤ組み込み関数のメリットが最も少ない。 On average, the VΔC method is nearly twice as fast as scalar-type arithmetic processing in vector-type arithmetic processing, and the BM$ and SAM methods are more than three times faster. As mentioned above, the VΔC method offers the least benefit from SIMD built-in functions, since the rearrangement of the ΔD elements performed in the VΔC method takes up a significant portion of the execution time.

（動的負荷分散）
パラレルテンパリングでは、最低温度が設定されているレプリカと最高温度が設定されているレプリカとでは、ＰＡＲが異なる。Ｔの値が大きいほど（温度が高いほど）、ＰＡＲも上がる。これによりＳＡＭ法やＢＭ＄法の更新処理が、レプリカの処理を行うスレッド間で大きな実行時間のギャップを引き起こす可能性がある。 (Dynamic Load Balancing)
In parallel tempering, the PAR differs between the replica with the lowest temperature setting and the replica with the highest temperature setting. The larger the value of T (the higher the temperature), the higher the PAR. This can cause large execution time gaps between the threads processing the replicas in the SAM and BM$ update processes.

これを軽減するために、データ処理装置１０は、各温度のイタレーションごとの時間をＳＬＳ関数内で追跡し、その時間を用いて、各温度で実行されるイタレーションの回数をスケーリングする。 To mitigate this, the data processing device 10 tracks the time per iteration at each temperature within the SLS function and uses that time to scale the number of iterations performed at each temperature.

図１４は、負荷分散の一例を示す図である。図１４では、Ｔ１～Ｔ３２（Ｔ１＜Ｔ２＜…＜Ｔ３２）の温度ラダーが設定される３２のレプリカについての、温度パラメータの交換前のΔＣの計算時間、更新処理の時間が示されている。 Figure 14 shows an example of load balancing. Figure 14 shows the calculation time of ΔC before exchanging temperature parameters and the update processing time for 32 replicas set with a temperature ladder of T1 to T32 (T1<T2<...<T32).

負荷分散を行わない場合、大きい温度パラメータの値が設定されたレプリカほど更新処理の時間が長くなり、小さい温度パラメータの値が設定されたレプリカほど更新処理の時間が短い。このため、小さい温度パラメータの値が設定されたレプリカほど長いアイドル時間が生じている。 Without load balancing, replicas with higher temperature parameter values take longer to update, while replicas with lower temperature parameter values take shorter update times. As a result, replicas with lower temperature parameter values experience longer idle times.

これに対して負荷分散を行う場合、たとえば、Ｔ１が設定されているレプリカの実行時間（スレッドの実行時間）を基準に、他のレプリカにおけるΔＣ計算のイタレーション回数がスケーリングされる。 When load balancing is performed, for example, the number of iterations of the ΔC calculation in other replicas is scaled based on the execution time (thread execution time) of the replica to which T1 is set.

これにより、図１４に示すように、各レプリカ間でスレッドの実行時間をほぼ同等にすることができ、アイドル時間が短縮される。
このような負荷分散の有効性を定量的に示すために、前述の１０個のインスタンスを計算対象として、負荷分散による演算処理の高速化の度合いを評価した結果を以下に示す。 This allows the thread execution times to be roughly equal among the replicas, as shown in FIG. 14, thereby reducing idle time.
In order to quantitatively demonstrate the effectiveness of such load balancing, the degree of speedup of calculation processing due to load balancing was evaluated using the above-mentioned 10 instances as calculation targets, and the results are shown below.

図１５は、負荷分散による演算処理の高速化の度合いの評価結果を示す図である。横軸はＱＡＰの１０個のインスタンスと幾何平均“ＧＥＯＭＥＡＮ”を表し、縦軸は負荷分散を行わない場合の演算処理に対する負荷分散を行った場合の高速化の度合いを表している。高速化の度合いは、負荷分散をせずにＳＡＭ法、ＢＭ＄法のそれぞれを実行したときのＴｔＳに対する、負荷分散をしてＳＡＭ法、ＢＭ＄法のそれぞれを実行したときのＴｔＳの比率により表されている。なお、負荷分散をしたときもしなかったときも、同じ温度パラメータの値のセット（たとえば、図１４のＴ１～Ｔ３２）が用いられている。 Figure 15 shows the evaluation results of the degree of speedup of calculation processing due to load balancing. The horizontal axis represents 10 instances of QAP and the geometric mean "GEOMEAN", and the vertical axis represents the degree of speedup when load balancing is performed compared to calculation processing without load balancing. The degree of speedup is represented by the ratio of TtS when the SAM method and the BM$ method are performed with load balancing to TtS when the SAM method and the BM$ method are performed without load balancing. Note that the same set of temperature parameter values (for example, T1 to T32 in Figure 14) was used whether load balancing was performed or not.

負荷分散により、ＳＡＭ法では平均１．８６倍、ＢＭ＄では平均１．３８倍、演算処理の高速化が達成できていることが分かった。インスタンスファミリ間での負荷分散の有効性の違いは、インスタンスファミリの特性により、温度パラメータの極値間の異なるＰＡＲのギャップに起因する可能性がある。 We found that load balancing achieved an average speedup of 1.86 times for the SAM method and 1.38 times for the BM$ method. The difference in the effectiveness of load balancing between instance families may be due to different PAR gaps between extreme values of the temperature parameter, depending on the characteristics of the instance families.

上記の各機能によって提供される計算の高速化の評価結果を表１にまとめる。 Table 1 summarizes the evaluation results of the computational speedup provided by each of the above functions.

２種類の高速化の度合（Incremental Speed-UpとCumulative Speed-Up）が示されている。なお、高速化の度合いは、前述の１０個のインスタンスについてのＴｔＳの幾何平均を用いて計算されたものであり、ＶΔＣ法を用いたスカラー型の演算処理によるＴｔＳを基準（１．００）としている。 Two degrees of speed-up are shown (Incremental Speed-Up and Cumulative Speed-Up). Note that the degree of speed-up is calculated using the geometric mean of the TtS for the 10 instances mentioned above, with the TtS obtained by scalar-type calculation processing using the VΔC method set as the standard (1.00).

（ベンチマーク結果）
上記のＶΔＣ法、ＳＡＭ法、ＢＭ＄法に関して、６４コアのＣＰＵを含む所定のハードウェア構成のデータ処理装置１０を用いてベンチマークスコアが測定された。 (Benchmark results)
Regarding the V.DELTA.C method, the SAM method, and the BM$ method, benchmark scores were measured using a data processing device 10 with a predetermined hardware configuration including a 64-core CPU.

上記３つの方法を行うＱＡＰのソルバーのベンチマークスコアを測定するために、ＱＡＰライブラリーリファレンス（非特許文献５参照）にあるインスタンスが用いられた。さらに、非特許文献６，７で提案されたインスタンスが用いられた。また、本実施の形態では、非特許文献７に示された問題サイズの大きいＱＡＰ（既知の最適解がない）の新しいＢＫＳ状態を見つけることが試みられた。またＱＳＡＰのソルバーのベンチマークスコアを測定するために、非特許文献８で紹介されたインスタンスのセットが用いられた。 To measure the benchmark scores of QAP solvers that implement the above three methods, instances from the QAP Library Reference (see Non-Patent Document 5) were used. Additionally, instances proposed in Non-Patent Documents 6 and 7 were used. In this embodiment, we also attempted to find a new BKS state for the large-size QAP (with no known optimal solution) shown in Non-Patent Document 7. Furthermore, to measure the benchmark scores of QSAP solvers, a set of instances introduced in Non-Patent Document 8 was used.

（ＱＡＰベンチマーク） (QAP benchmark)

表２には、ベクトル型の演算処理を行う３つの方法（ＶΔＣ法、負荷分散を行うＳＡＭ法及びＢＭ＄法）を実施するソルバーによる、ＱＡＰの各インスタンスに対するＴｔＳが示されている。さらに表２には、ＶΔＣ法を用いたベクトル型の演算処理によるＴｔＳを基準（１．００）とした高速化の度合い（ＴｔＳの幾何平均を用いて計算されたもの）が示されている。さらに、表２には、公開されている２つのソルバーであるＰａｒＥＯＴＳ（非特許文献９参照）と、ＰＢＭ（Permutational Boltzmann Machine）（非特許文献１参照）によるＴｔＳと高速化の度合いが示されている。これら２つの公開されているソルバーは、５分のタイムアウトウィンドウ内で、いくつかの難しいインスタンスを１００％の成功率で解くことができるものである。 Table 2 shows the TtS for each instance of the QAP using solvers that implement three vector-based computation methods (the VΔC method, the SAM method with load balancing, and the BM$ method). Table 2 also shows the degree of speedup (calculated using the geometric mean of TtS), with the TtS for vector-based computation using the VΔC method set as the baseline (1.00). Table 2 also shows the TtS and degree of speedup for two publicly available solvers, ParEOTS (see Non-Patent Document 9) and PBM (Permutational Boltzmann Machine) (see Non-Patent Document 1). These two publicly available solvers are capable of solving some difficult instances with a 100% success rate within a 5-minute timeout window.

表２に示されているＴｔＳの値は、ＶΔＣ法、ＳＡＭ法、及びＢＭ＄法による局所探索を１００回独立に連続して実行し、測定されたＴｔＳの平均値（９９％の信頼区間の値として表されている）である。各実行のタイムアウト制限は５分である。ＰａｒＥＯＴＳとＰＢＭについてのＴｔＳの値は、それぞれの開示によるものである。 The TtS values shown in Table 2 are the average TtS values (expressed as values with a 99% confidence interval) measured over 100 independent consecutive runs of local searches using the VΔC, SAM, and BM$ methods. The timeout limit for each run is 5 minutes. The TtS values for ParEOTS and PBM are from their respective publications.

表２に示されているように、ＢＭ＄法は、５つのソルバーの中で全てのインスタンスにわたって最高のパフォーマンスを示し、ＰＢＭと比較して２倍、ＰａｒＥＯＴＳと比較して３００倍以上の高速化を示した。ＳＡＭ法とＢＭ＄法は、ＶΔＣ法に対してそれぞれ１．９２倍、３．２２倍の平均速度向上を示した。なお、表１と比較した表２の結果の違いは、使用されたインスタンスによるものである。 As shown in Table 2, the BM$ method performed best across all instances among the five solvers, achieving a speedup of 2x compared to PBM and over 300x compared to ParEOTS. The SAM and BM$ methods achieved average speedups of 1.92x and 3.22x compared to the VΔC method, respectively. Note that the differences in the results in Table 2 compared to Table 1 are due to the instances used.

ただし、ＳＡＭ法とＢＭ＄法のどちらが良いかは、問題規模やＰＡＲに依存し、ＢＭ＄法よりもＳＡＭ法が優位なケースもある。たとえば後述のように（図１７参照）、ＰＡＲが高くなるとＢＭ＄法よりもＳＡＭ法の方が優位になる傾向がある。また、本評価結果は、ＣＰＵでの実装の結果であり、比較的大きなメモリをもった系での結果である。専用回路での実装の場合、局所場を記憶しなくてもよいＳＡＭ法が、ＢＭ＄法よりも優位となる場合も考えられる。 However, whether the SAM or BM$ method is better depends on the problem scale and PAR, and there are cases where the SAM method is superior to the BM$ method. For example, as will be described later (see Figure 17), as the PAR increases, the SAM method tends to be superior to the BM$ method. Furthermore, these evaluation results are the result of an implementation on a CPU, and are the results of a system with a relatively large memory. When implemented on a dedicated circuit, the SAM method, which does not need to store local fields, may be superior to the BM$ method in some cases.

（ＱＳＡＰベンチマーク）
ＱＡＰと比較して、ＱＳＡＰソルバーに関するこれまでの開示はほとんどない。主な参考資料は、２０コアのＣＰＵに実装されたＰＭＩＴＳ（Parallel Memetic Iterative Tabu Search）アルゴリズムである（非特許文献８参照）。ＰＭＩＴＳは、ポピュレーションの５０％を用いて、１時間のタイムアウト制限で、停止基準と同じ最良のソリューションに到達する。この方法は、協調レプリカを使用したＭｅｍｅｔｉｃアルゴリズムにおいて、収束を予測する合理的な方法である。しかし、レプリカが温度ラダーに沿って絶えず流動しているパラレルテンパリングでは、このような基準は妥当ではない。 (QSAP Benchmark)
Compared to QAP, there has been little previous disclosure on QSAP solvers. The main reference is the Parallel Memetic Iterative Tabu Search (PMITS) algorithm implemented on a 20-core CPU (see Non-Patent Document 8). PMITS reaches the best solution with a 1-hour timeout limit using 50% of the population, which is the same as the stopping criterion. This method is a reasonable way to predict convergence in memetic algorithms with cooperative replicas. However, in parallel tempering, where replicas are constantly flowing along the temperature ladder, such a criterion is not valid.

したがって、本実施の形態では、ＱＡＰの場合と同様の方法でＴｔＳが測定され、ＰＭＩＴＳについては基準としてＴｔＳの値が測定された。その結果が、以下の表３に示されている。パラレルテンパリングソルバーとＰＭＩＴＳとの間の相関関係については示されていない。 Therefore, in this embodiment, TtS was measured in the same manner as for QAP, and the TtS value was measured as a reference for PMITS. The results are shown in Table 3 below. No correlation between the parallel tempering solver and PMITS is shown.

表３のように、ＱＡＰと比較すると、ＱＳＡＰインスタンス全体のＶΔＣ法と比較したＳＡＭ法及びＢＭ＄法のパフォーマンスは、ほぼ２倍低下している。これは、より高いＰＡＲに起因するものと考えられる。 As shown in Table 3, when compared to QSAP, the performance of the SAM and BM$ methods is almost 2x worse than the VΔC method across QSAP instances. This is likely due to higher PAR.

（拡張Ｔａｉｌｌａｒｄインスタンス上のＱＡＰスケーリング）
非特許文献７で紹介された、いくつかのＱＡＰインスタンスは未知の最適解をもち、最大ｎ=７２９のサイズである。このサイズと難易度のため、ベンチマークに使用されることはめったにないが、データ処理装置１０は、より良い解を見つけるために、所定の時間制限で、ＶΔＣ法、ＳＡＭ法及びＢＭ＄法を用いて、これらのインスタンスを実行した。 QAP Scaling on Extended Taillard Instances
Some QAP instances introduced in Non-Patent Document 7 have unknown optimal solutions and have a maximum size of n = 729. Due to their size and difficulty, they are rarely used for benchmarking, but the data processing device 10 executed these instances using the VΔC method, the SAM method, and the BM$ method with a predetermined time limit to find better solutions.

これらのインスタンスのＢＫＳ値を改善しようとした以前の試みでは、数分から数時間実行したにもかかわらず、ほとんどまたはまったく改善されなかった（たとえば、非特許文献１０参照）。 Previous attempts to improve the BKS values of these instances have resulted in little or no improvement, even after running for minutes to hours (see, for example, Non-Patent Document 10).

実験では、各インスタンスは、２０回実行され、ｎ＝１２５及びｎ＝１７５の場合は、１０秒、ｎ＝３４３及びｎ＝７２９のインスタンスの場合は３０秒で終了する。表４，５には、各インスタンスについてＶΔＣ法、ＳＡＭ法及びＢＭ＄法を適用した場合の最良のコスト（式（１）の評価関数の値）とともに、平均コストが示されている。 In the experiment, each instance was executed 20 times, completing in 10 seconds for n=125 and n=175, and 30 seconds for n=343 and n=729 instances. Tables 4 and 5 show the average cost along with the best cost (the value of the evaluation function in Equation (1)) when applying the VΔC, SAM, and BM$ methods for each instance.

実行時間が短いにもかかわらず、ＳＡＭ法とＢＭ＄法による４つを除くすべてのインスタンスのＢＫＳを改善でき、平均コストも以前のＢＫＳよりも低くなっていることが分かった。 Despite the shorter execution time, we were able to improve BKS for all but four instances using the SAM and BM$ methods, and found that the average cost was also lower than the previous BKS.

（スケーリング分析）
ベンチマーク結果と、ＶΔＣ法とＳＡＭ法の定性的比較に基づくと、ＳＡＭ法がより効率的であると考えられる。ＶΔＣ法ではイタレーションごとにΔＤの各要素の連続的な並べ替えが行われるのに対し、ＳＡＭ法では、割当状態に一致するようにＤ^ＸやＤ^Ｓの並べ替えが行われるのは、提案された割当変更が受け入れられた場合のみであるためである。 (Scaling analysis)
Based on benchmark results and a qualitative comparison of the VΔC and SAM methods, the SAM method appears to be more efficient because the VΔC method sequentially reorders each element of ΔD at each iteration, whereas the SAM method reorders D ^X and D ^S to match the allocation state only if the proposed reassignment is accepted.

一方、ＳＡＭ法とＢＭ＄法の相対的な性能は、主にＰＡＲに依存する。ＰＡＲは、使用されている探索アルゴリズムによって大きく異なる可能性がある。さらに、同じ探索アルゴリズム内であっても、ＰＡＲは、シミュレーテッドアニーリングなどのアルゴリズムを使用した実行時や、パラレルテンパリングなどで同時探索されるインスタンス間で変化する可能性がある。 On the other hand, the relative performance of SAM and BM$ methods depends primarily on PAR, which can vary significantly depending on the search algorithm used. Furthermore, even within the same search algorithm, PAR can change between runs using algorithms such as simulated annealing, or between instances searched simultaneously, such as in parallel tempering.

以下、問題サイズとＰＡＲ値に応じた実行時間の比較結果を示す。
図１６は、測定アルゴリズムの一例を示す図である。
測定アルゴリズムでは、最初に１～１００のランダムな値をもつフロー行列（Ｆ）及び距離行列（Ｄ）が生成される。そして割当状態（φ）が初期化される。次に、所定のイタレーション回数（Ｉ_{ｌｉｍｉｔ}）の処理が実行され、１ループの実行時間が計測される。このとき、所望のＰＡＲ値で提案がランダムに受け入れられる。 Below, we show the results of comparing execution times according to problem size and PAR value.
FIG. 16 is a diagram illustrating an example of a measurement algorithm.
In the measurement algorithm, first, a flow matrix (F) and a distance matrix (D) with random values between 1 and 100 are generated. Then, the allocation state (φ) is initialized. Next, a predetermined number of iterations (I _limit ) are executed, and the execution time of one loop is measured. At this time, a proposal with a desired PAR value is randomly accepted.

問題のサイズは、ＳＡＭ法及びＢＭ＄法のレプリカデータを格納するために使用されるメモリ階層に基づいて３つのグループに分割されている。ｎ＜２５６と２５６＜ｎ＜１０２４の問題サイズの場合には、Ｉ_{ｌｉｍｉｔ}はそれぞれ１００Ｍ及び１０Ｍが使用される。ｎ＞１０２４の問題サイズの場合には、Ｉ_{ｌｉｍｉｔ}は１Ｍが使用される。 The problem sizes are divided into three groups based on the memory hierarchy used to store replica data for the SAM and BM$ methods. For problem sizes of n<256 and 256<n<1024, I _limit is set to 100M and 10M, respectively. For problem sizes of n>1024, I _limit is set to 1M.

各イタレーションでは、ΔＣ_ｅｘの値が計算され、乱数生成器が、所望のＰＡＲで提案を受け入れるか否かを判定するために用いられる。
ＣＰＵパイプライン、キャッシュ、メモリに全負荷がかかっている間の性能を測定するため、６４個のループインスタンスが並列に実行された（コアごとに１つ）。この処理は、データポイントごとに１０の異なる乱数シードを使用して繰り返され、平均実行時間が測定された。 At each iteration, the value of ΔC _ex is calculated and a random number generator is used to determine whether to accept the proposal with the desired PAR.
To measure performance while the CPU pipeline, cache, and memory were fully loaded, 64 loop instances were run in parallel (one per core). This process was repeated using 10 different random number seeds for each data point, and the average execution time was measured.

フロー行列の密度がＢＭ＄法のキャッシュ行列（Ｈ）の更新関数に影響を与えるため、２つの別々のシミュレーションセットが実行された。１つは完全に密なフロー行列で、もう１つはスパースなフロー行列であり、ゼロ以外の値が１０％しかない。各測定は、ｎ＝［１００，５０００］の範囲の１０種類の問題サイズと、ＰＡＲ＝［０．０００１，０．１］の範囲の２９のＰＡＲを組み合わせて、合計２９０のパラメータの組み合わせについて計算が行われた。 Because the density of the flow matrix affects the update function of the cache matrix (H) in the BM$ method, two separate sets of simulations were run: one with a fully dense flow matrix and one with a sparse flow matrix with only 10% non-zero values. Each measurement was calculated for 10 different problem sizes in the range n = [100, 5000] and 29 PARs in the range PAR = [0.0001, 0.1], for a total of 290 parameter combinations.

図１７は、ＶΔＣ法、ＳＡＭ法、ＢＭ＄法について測定された相対的な高速化の度合いと、問題サイズに応じて占有されるメモリ階層を示す図である。高速化の度合いを示す図では、横軸はＰＡＲ［％］を表し、縦軸は高速化の度合い（Ｓｐｅｅｄ－Ｕｐ）を表す。図１７には、ＳＡＭ法及びＶΔＣ法に対するＢＭ＄法の高速化の度合いが、それぞれ異なる密度のフロー行列を用いた場合について示されている。さらに、図１７には、ＶΔＣ法に対するＳＡＭ法の高速化の度合いが示されている。図１７の例では、メモリ階層は、記憶容量が小さい順に、Ｌ２キャッシュ、Ｌ３キャッシュ、ＤＲＡＭがある。 Figure 17 shows the relative speedup measured for the VΔC, SAM, and BM$ methods, and the memory hierarchy occupied depending on the problem size. In the diagram showing the speedup, the horizontal axis represents PAR [%], and the vertical axis represents the speedup (Speed-Up). Figure 17 also shows the speedup of the BM$ method relative to the SAM and VΔC methods, using flow matrices of different densities. Furthermore, Figure 17 shows the speedup of the SAM method relative to the VΔC method. In the example of Figure 17, the memory hierarchy is arranged in order of decreasing storage capacity: L2 cache, L3 cache, and DRAM.

なお、ＱＡＰの１０のインスタンスに対する非負荷分散のパラレルテンパリングシミュレーションによる、あるフロー行列の密度に対する最大のＰＡＲの値は、以下の表６のようなものであった。 Note that the maximum PAR values for a given flow matrix density in a non-load-balanced parallel tempering simulation for 10 instances of QAP were as shown in Table 6 below.

なお、ＱＳＡＰシミュレーションは、ＱＡＰシミュレーションと類似するため、省略した。両シミュレーションの結果には顕著な差異はないと考えられる。
問題サイズによる相対的な高速化の度合いは、メモリ階層のどの層に探索に用いるデータの大部分が格納されているかに基づいて、３つのグループに分けることができる。各コアには専用のＬ２キャッシュ（本計算例では記憶容量が２５６ｋＢ）があり、４つのコアのグループはＬ３キャッシュ（本計算例では記憶容量が１６ＭＢ）を共有している。 The QSAP simulation was omitted because it is similar to the QAP simulation, and it is believed that there is no significant difference between the results of both simulations.
The relative speedups for different problem sizes can be divided into three groups based on which layer of the memory hierarchy contains the majority of the data used for searching: Each core has its own L2 cache (256 kB in this example), and a group of four cores shares an L3 cache (16 MB in this example).

図１７のように、ＳＡＭ法とＢＭ＄法の場合、各ループスレッドのＤ^Ｘとキャッシュ行列（Ｈ）は、コアのＬ２キャッシュに最大ｎ＝２５６のサイズで収まり、ｎ＝１０２４までの問題はＬ３キャッシュに収まる。 As shown in FIG. 17, in the case of the SAM and BM$ methods, the D ^X and cache matrix (H) of each loop thread can fit into the L2 cache of the core with a maximum size of n=256, and problems up to n=1024 can fit into the L3 cache.

図１７のように、問題サイズが大きくなり、より記憶容量が大きいメモリ階層が用いられるようになると、ＢＭ＄法と他の２つの方法の間の相対的な性能（高速化の度合い）が低下する。探索に用いるデータが上層のメモリ階層に移動すると、ＢＭ＄法がＳＡＭ法及びＶΔＣ法よりも優れた性能を維持するために要するＰＡＲの値が大幅に減少する。 As shown in Figure 17, as the problem size increases and a memory hierarchy with larger storage capacity is used, the relative performance (degree of speedup) between the BM$ method and the other two methods decreases. As the data used for the search moves to a higher memory hierarchy, the PAR value required for the BM$ method to maintain superior performance to the SAM and VΔC methods decreases significantly.

パラレルテンパリングを実施するソルバーの場合、表６に示すように、レプリカ全体の最大のＰＡＲの値は、インスタンスファミリ全体の問題サイズとともに必ずしも減少しない。これは、より小さなＱＡＰインスタンスについて、表２に示したのと同じ相対速度を維持するためのＰＡＲと問題サイズの関係は、必ずしも図１７に示すような関係に依存しないことを示している。 For solvers that implement parallel tempering, the maximum PAR across replicas does not necessarily decrease with problem size across instance families, as shown in Table 6. This indicates that for smaller QAP instances, the relationship between PAR and problem size to maintain the same relative speeds as shown in Table 2 does not necessarily depend on the relationship shown in Figure 17.

ＶΔＣ法に対するＳＡＭ法の高速化の度合いは、問題サイズに基づいて２つのグループに分けられる。問題サイズがｎ≦８００で、ループインスタンスごとのＤ^Ｘの大部分がキャッシュに収まる場合、ＳＡＭ法では、ＰＡＲ＜１０％の範囲ではＶΔＣよりも明らかに有利である。探索に用いるデータの保存先をＬ２キャッシュからＬ３キャッシュに移行することは、２つの方法間の相対的な性能に大きな影響を与えていない。 The speedup of the SAM method over the VΔC method can be divided into two groups based on the problem size. When the problem size is n≦800 and most of the D ^X for each loop instance fits in the cache, the SAM method is clearly more advantageous than the VΔC method in the range of PAR<10%. Moving the storage destination of the search data from the L2 cache to the L3 cache does not have a significant impact on the relative performance between the two methods.

（ハードウェア例）
以下、ＳＡＭ法を実現するためのハードウェア例を説明する。
なお、以下では、説明を簡略化するために、フロー行列と距離行列の両方が対称行列（対角成分が０（バイアスレス））であるものとして説明する。前述のように、このようなＱＡＰが、インスタンスの大部分であるし、計算を単純化するためである。対称行列を用いたＱＡＰは、非対称行列を用いたＱＡＰに直接的に変換可能である。 (Hardware example)
An example of hardware for implementing the SAM method will be described below.
In the following, for the sake of simplicity, it is assumed that both the flow matrix and the distance matrix are symmetric matrices (diagonal elements are 0 (biasless)). As mentioned above, this type of QAP is the majority of instances, and it simplifies calculations. A QAP using a symmetric matrix can be directly converted into a QAP using an asymmetric matrix.

対称行列のみを用いたＱＡＰの評価関数は、以下の式（４０）のように表せる。 The evaluation function for QAP using only symmetric matrices can be expressed as follows:

式（１）と式（４０）との違いは、式（１）の“ｊ＝１”が“ｊ＝ｉ”となっている点だけである。
式（４０）で表される評価関数を用いた場合、ＱＡＰの計算に用いられるΔＣ_ｅｘは式（１９）、式（２２）の代わりに、以下の式（４１）、式（４２）で表せる。 The only difference between equation (1) and equation (40) is that "j=1" in equation (1) becomes "j=i".
When the evaluation function expressed by equation (40) is used, ΔC _ex used in calculating QAP can be expressed by the following equations (41) and (42) instead of equations (19) and (22).

図１８は、ΔＣ生成回路の一例を示す図である。
ΔＣ生成回路３０は、フロー行列メモリ３１ａ、状態整列Ｄ行列メモリ３１ｂ、差分計算回路３２ａ，３２ｂ、マルチプレクサ３３ａ，３３ｂ、内積計算回路３４、レジスタ３５、乗算回路３６、加算回路３７を有する。これらは、たとえば、図２に示した記憶部１１または処理部１２に含まれる回路やメモリである。 FIG. 18 is a diagram illustrating an example of a ΔC generating circuit.
The ΔC generation circuit 30 includes a flow matrix memory 31 a, a state-ordered D matrix memory 31 b, difference calculation circuits 32 a and 32 b, multiplexers 33 a and 33 b, an inner product calculation circuit 34, a register 35, a multiplication circuit 36, and an addition circuit 37. These are, for example, circuits or memories included in the storage unit 11 or the processing unit 12 shown in FIG.

フロー行列メモリ３１ａは、フロー行列（Ｆ）を記憶する。
状態整列Ｄ行列メモリ３１ｂは、距離行列を更新したものである状態整列Ｄ行列（Ｄ^Ｘ）を記憶する。 The flow matrix memory 31a stores the flow matrix (F).
The state alignment D matrix memory 31b stores the state alignment D matrix (D ^X ) which is an updated distance matrix.

図１８の例では、フロー行列メモリ３１ａと状態整列Ｄ行列メモリ３１ｂは、２つのポートを有するデュアルポートメモリである。一部のＦＰＧＡではこのようなデュアルポートメモリを使用可能である。 In the example of Figure 18, the flow matrix memory 31a and the state-aligned D matrix memory 31b are dual-port memories with two ports. Such dual-port memories can be used in some FPGAs.

差分計算回路３２ａは、式（４２）のΔ^ｂ _ａＦであるＦ_ｂ，＊－Ｆ_ａ，＊を計算する。
差分計算回路３２ｂは、式（４２）のΔ^φ（ａ） _φ（ｂ）Ｄ^ＸであるＤ^Ｘ _{φ（ａ），＊}－Ｄ^Ｘ _{φ（ｂ），＊}を計算する。 The difference calculation circuit 32a calculates F _b,* - _{F a,*,} which is Δ ^b _a F in equation (42).
The difference calculation circuit 32b calculates D ^{X φ(a)} ,* - D ^X _φ(b) ,*, which is Δ _φ(a) ^φ _(b) D ^X in equation (42).

マルチプレクサ３３ａは、Ｆ_ａ，＊からｆ_ａ，ｂを選択して出力する。
マルチプレクサ３３ｂは、Ｄ^Ｘ _{φ（ａ），＊}からｄ_{φ（ａ），ｂ}を選択して出力する。
内積計算回路３４は、Δ^ｂ _ａＦとΔ^φ（ａ） _φ（ｂ）Ｄ^Ｘの内積を計算する。内積計算回路３４は、たとえば、並列に接続された複数の乗算器により実現できる。 The multiplexer 33a selects and outputs f _a,b from F _a,* .
The multiplexer 33b selects and outputs d _φ(a),b from D ^X _φ(a),* .
The dot product calculation circuit 34 calculates the dot product of Δ ^b _a F and Δ ^φ(a) _φ(b) D ^X. The dot product calculation circuit 34 can be realized by, for example, a plurality of multipliers connected in parallel.

レジスタ３５は、式（４２）に含まれる係数“２”を保持する。
乗算回路３６は２ｆ_ａ，ｂｄ_{φ（ａ），ｂ}を計算する。
加算回路３７は、内積の計算結果に２ｆ_ａ，ｂｄ_{φ（ａ），ｂ}を加算することでΔＣ_ｅｘを計算し、出力する。 The register 35 holds the coefficient "2" included in the equation (42).
The multiplication circuit 36 calculates 2f _a,b d _φ(a),b .
The adder circuit 37 calculates and outputs ΔC _ex by adding 2f _a,b d _φ(a),b to the calculation result of the inner product.

このようなハードウェアを用いたΔＣ_ｅｘの計算は、たとえば、図２の処理部１２に含まれる図示しない制御回路がプログラムを実行することで制御される。
前述のようにＳＡＭ法では、ΔＣ_ｅｘを発生させる要素の割当先の入れ替え（割当変更）が提案され、受け入れられると、状態整列Ｄ行列は、受け入れられた割当状態に対応するように、列が入れ替えられる。 The calculation of ΔC _ex using such hardware is controlled by, for example, a control circuit (not shown) included in the processing unit 12 of FIG. 2 executing a program.
As described above, in the SAM method, when a swap (allocation change) of the allocation destination of the element that generates ΔC _ex is proposed and accepted, the columns of the state alignment D matrix are swapped to correspond to the accepted assigned state.

図１９は、列の入れ替えを行うハードウェア構成の第１の例を示す図である。
図１９のように、状態整列Ｄ行列メモリ３１ｂに記憶されている状態整列Ｄ行列の列の入れ替えは、たとえば、マルチプレクサ４０ａ，４０ｂ、スイッチ４１を用いて行うことができる。これらの回路も図２の処理部１２に含みうる。 FIG. 19 illustrates a first example of a hardware configuration for performing column interchange.
19, the columns of the state order D matrix stored in the state order D matrix memory 31b can be interchanged using, for example, multiplexers 40a and 40b and a switch 41. These circuits may also be included in the processing unit 12 in FIG.

マルチプレクサ４０ａは、状態整列Ｄ行列メモリ３１ｂから入れ替えのために読み出される状態整列Ｄ行列の各行のうちの、第１の列の値を順に選択して出力する。
マルチプレクサ４０ｂは、状態整列Ｄ行列メモリ３１ｂから読み出される状態整列Ｄ行列の各行のうちの、第２の列の値を順に選択して出力する。 The multiplexer 40a sequentially selects and outputs the value of the first column of each row of the state alignment D matrix read out from the state alignment D matrix memory 31b for replacement.
The multiplexer 40b sequentially selects and outputs the value of the second column of each row of the state alignment D matrix read from the state alignment D matrix memory 31b.

スイッチ４１は、状態整列Ｄ行列メモリ３１ｂにおいて、マルチプレクサ４０ａから出力される値を、マルチプレクサ４０ｂから出力される値が記憶されていた場所に記憶させるように記憶場所を入れ替えて書き込む。また、スイッチ４１は、状態整列Ｄ行列メモリ３１ｂにおいて、マルチプレクサ４０ｂから出力される値を、マルチプレクサ４０ａから出力される値が記憶されていた場所に記憶させるように記憶場所を入れ替えて書き込む。 Switch 41 swaps the memory locations in state-aligned D-matrix memory 31b so that the value output from multiplexer 40a is stored in the location where the value output from multiplexer 40b was stored. Switch 41 also swaps the memory locations in state-aligned D-matrix memory 31b so that the value output from multiplexer 40b is stored in the location where the value output from multiplexer 40a was stored.

このようなハードウェアを用いた列の入れ替えは、たとえば、図２の処理部１２に含まれる図示しない制御回路がプログラムを実行することで制御される。
図２０は、列の入れ替え例を示す図である。図２０では、１列目と３列目の入れ替えが行われる例が示されている。 Such column switching using hardware is controlled by, for example, a control circuit (not shown) included in the processing unit 12 of FIG. 2 executing a program.
20 is a diagram showing an example of column interchange, in which the first and third columns are interchanged.

最初にｄ_１，３とｄ_１，１がマルチプレクサ４０ａ，４０ｂによって選択され、スイッチ４１によって記憶場所が入れ替えられる。次に、ｄ_２，３とｄ_２，１がマルチプレクサ４０ａ，４０ｂによって選択され、スイッチ４１によって記憶場所が入れ替えられる。同様の処理が合計ｎサイクル繰り返されることで、列の入れ替えが完了する。 First, d _1,3 and d _1,1 are selected by multiplexers 40a and 40b, and their storage locations are swapped by switch 41. Next, d _2,3 and d _2,1 are selected by multiplexers 40a and 40b, and their storage locations are swapped by switch 41. The same process is repeated a total of n cycles to complete the column swapping.

図２１は、列の入れ替えを行うハードウェア構成の第２の例を示す図である。図２１において、図１９に示した要素と同じ要素については同一符号が付されている。
図２１のように、状態整列Ｄ行列メモリ３１ｂには、状態整列Ｄ行列のほか、状態整列Ｄ行列の転置行列（（Ｄ^Ｘ）^Ｔ）も記憶されている。状態整列Ｄ行列の列の入れ替えは、読み出される転置行列の要素を用いて、たとえば、前述のスイッチ４１のほか、シフトレジスタ４５ａ，４５ｂにより行うことができる。これらの回路も処理部１２に含みうる。 Fig. 21 is a diagram showing a second example of a hardware configuration for performing column interchange, in which the same elements as those shown in Fig. 19 are denoted by the same reference numerals.
21 , the state order D matrix memory 31b stores not only the state order D matrix but also the transposed matrix ((D ^X ) ^T ) of the state order D matrix. The columns of the state order D matrix can be swapped using the elements of the read transposed matrix, for example, by the aforementioned switch 41 and shift registers 45a and 45b. These circuits may also be included in the processing unit 12.

図２１の回路構成では、状態整列Ｄ行列メモリ３１ｂからは、状態整列Ｄ行列の入れ替えが行われる２列に対応する、転置行列の２行が読み出される。
シフトレジスタ４５ａは、状態整列Ｄ行列メモリ３１ｂから読み出される転置行列の２行のうちの、第１の行のｎ個の値を保持し、１サイクルずつシフトさせて１つずつ値を出力する。 In the circuit configuration of FIG. 21, two rows of the transposed matrix corresponding to the two columns of the state reordering D matrix to be transposed are read out from the state reordering D matrix memory 31b.
The shift register 45a holds n values in the first row of the two rows of the transposed matrix read from the state-ordered D-matrix memory 31b, and shifts them one cycle at a time to output the values one by one.

シフトレジスタ４５ｂは、状態整列Ｄ行列メモリ３１ｂから読み出される転置行列の２行のうちの、第２の行のｎ個の値を保持し、１サイクルずつシフトさせて１つずつ値を出力する。 Shift register 45b holds the n values in the second row of the two rows of the transposed matrix read from state-aligned D-matrix memory 31b, shifts them one cycle at a time, and outputs each value one by one.

スイッチ４１は、状態整列Ｄ行列メモリ３１ｂにおいて、シフトレジスタ４５ａから出力される値を、シフトレジスタ４５ｂから出力される値が記憶されていた場所に記憶させるように記憶場所を切り替える。また、スイッチ４１は、状態整列Ｄ行列メモリ３１ｂにおいて、シフトレジスタ４５ｂから出力される値を、シフトレジスタ４５ａから出力される値が記憶されていた場所に記憶させるように記憶場所を切り替える。 Switch 41 switches the storage location in state-ordered D-matrix memory 31b so that the value output from shift register 45a is stored in the location where the value output from shift register 45b was stored. Switch 41 also switches the storage location in state-ordered D-matrix memory 31b so that the value output from shift register 45b is stored in the location where the value output from shift register 45a was stored.

このようなハードウェア構成によってもｎサイクルで列の入れ替えができる。また、図２１のハードウェア構成の場合、図１９のハードウェア構成よりも配線性が向上する可能性がある。また、比較的回路規模が大きくなる可能性がある図１９に示したようなマルチプレクサ４０ａ，４０ｂが不要になる。 This hardware configuration also allows columns to be swapped every n cycles. Furthermore, the hardware configuration of Figure 21 may have improved wiring efficiency compared to the hardware configuration of Figure 19. It also eliminates the need for multiplexers 40a and 40b as shown in Figure 19, which may result in a relatively large circuit size.

図２２は、列の入れ替えを行うハードウェア構成の第２の例の１つ目の変形例を示す図である。図２２において、図２１に示した要素と同じ要素については同一符号が付されている。 Figure 22 shows a first variant of the second example of a hardware configuration for column swapping. In Figure 22, elements that are the same as those shown in Figure 21 are designated by the same reference numerals.

図２２のハードウェア構成では、状態整列Ｄ行列メモリ３１ｂと分離した転置行列メモリ４６に、状態整列Ｄ行列の転置行列（（Ｄ^Ｘ）^Ｔ）が記憶されている点が、図２１のハードウェア構成と異なっている。転置行列メモリ４６から、転置行列の上記２行が読み出される。その他の構成については、図２１と同様である。 The hardware configuration in Fig. 22 differs from the hardware configuration in Fig. 21 in that the transpose matrix ((D ^X ) ^T ) of the state ordered D matrix is stored in a transpose matrix memory 46 that is separate from the state ordered D matrix memory 31b. The above two rows of the transpose matrix are read from the transpose matrix memory 46. The other configurations are the same as those in Fig. 21.

図２３は、列の入れ替えを行うハードウェア構成の第２の例の２つ目の変形例を示す図である。図２３において、図２２に示した要素と同じ要素については同一符号が付されている。 Figure 23 shows a second variation of the second example of a hardware configuration for column swapping. In Figure 23, elements that are the same as those shown in Figure 22 are designated by the same reference numerals.

図２３のハードウェア構成では、図１８に示したフロー行列メモリ３１ａに、フロー行列のほか、状態整列Ｄ行列の転置行列（（Ｄ^Ｘ）^Ｔ）も記憶されている。状態整列Ｄ行列の列の入れ替えは、フロー行列メモリ３１ａから読み出される転置行列の要素を用いて、前述のように、スイッチ４１、シフトレジスタ４５ａ，４５ｂにより行うことができる。 In the hardware configuration of Fig. 23, in addition to the flow matrix, the transposed matrix ((D ^X ) ^T ) of the state order D matrix is also stored in the flow matrix memory 31a shown in Fig. 18. The columns of the state order D matrix can be swapped by the switch 41 and shift registers 45a and 45b, as described above, using the elements of the transposed matrix read from the flow matrix memory 31a.

図２４は、列の入れ替えを行うハードウェア構成の第３の例を示す図である。図２４において、図１９に示した要素と同じ要素については同一符号が付されている。
図２４のハードウェア構成では、状態整列Ｄ行列は、２つの状態整列Ｄ行列メモリ３１ｂ１，３１ｂ２に記憶される。 Fig. 24 is a diagram showing a third example of a hardware configuration for performing column interchange, in which the same elements as those shown in Fig. 19 are denoted by the same reference numerals.
In the hardware configuration of FIG. 24, the state alignment D matrix is stored in two state alignment D matrix memories 31b1 and 31b2.

状態整列Ｄ行列メモリ３１ｂ１から読み出された状態整列Ｄ行列の各行の要素を用いて、状態整列Ｄ行列メモリ３１ｂ２に記憶されている状態整列Ｄ行列に対して、図１９のハードウェア構成を用いた場合と同様に列の入れ替えが行われる。 Using the elements of each row of the state alignment D matrix read from state alignment D matrix memory 31b1, columns of the state alignment D matrix stored in state alignment D matrix memory 31b2 are swapped in the same way as when the hardware configuration of Figure 19 is used.

列の入れ替え後の状態整列Ｄ行列は、状態整列Ｄ行列メモリ３１ｂ１にコピーされる。
このようなハードウェア構成によってもｎサイクルで列の入れ替えができる。また、図２４のハードウェア構成の場合、図１９のハードウェア構成よりも配線性が向上する可能性がある。 The state-ordered D matrix after the column exchange is copied to the state-ordered D matrix memory 31b1.
With this hardware configuration, columns can be swapped in n cycles. In addition, the hardware configuration of Fig. 24 may have improved wiring efficiency compared to the hardware configuration of Fig. 19.

図２５は、列の入れ替えを行うハードウェア構成の第３の例の変形例を示す図である。図２５において、図１９に示した要素と同じ要素については同一符号が付されている。
図２５のハードウェア構成では、図１８に示したフロー行列メモリ３１ａに、フロー行列のほか、状態整列Ｄ行列の転置行列（（Ｄ^Ｘ）^Ｔ）も記憶されている。状態整列Ｄ行列の列の入れ替えは、フロー行列メモリ３１ａから読み出される転置行列の要素を用いて、前述のように、マルチプレクサ４０ａ，４０ｂ、スイッチ４１により行うことができる。 25 is a diagram showing a modification of the third example of the hardware configuration for performing column interchange, in which the same elements as those shown in FIG. 19 are denoted by the same reference numerals.
In the hardware configuration of Fig. 25, in addition to the flow matrix, the transposed matrix ((D ^X ) ^T ) of the state order D matrix is also stored in the flow matrix memory 31a shown in Fig. 18. The columns of the state order D matrix can be swapped by the multiplexers 40a and 40b and the switch 41, as described above, using the elements of the transposed matrix read from the flow matrix memory 31a.

図２６は、ΔＣ生成回路の他の例を示す図である。図２６において、図１８に示した要素と同じ要素については同一符号が付されている。
図２６のΔＣ生成回路５０では、フロー行列メモリ５１ａ、状態整列Ｄ行列メモリ５１ｂとして、シングルポートメモリが用いられている。この場合、Ｆ_ａ，＊を保持し、フロー行列メモリ５１ａからＦ_ｂ，＊が読み出されるタイミングでＦ_ａ，＊を出力し、差分計算回路３２ａとマルチプレクサ３３ａに供給するレジスタ５２ａが用いられる。また、Ｄ^Ｘ _{φ（ａ），＊}を保持し、状態整列Ｄ行列メモリ５１ｂからＤ^Ｘ _{φ（ｂ），＊}が読み出されるタイミングでＤ^Ｘ _{φ（ａ），＊}を出力し、差分計算回路３２ｂとマルチプレクサ３３ｂに供給するレジスタ５２ｂが用いられる。 Fig. 26 is a diagram showing another example of a ΔC generating circuit, in which the same elements as those shown in Fig. 18 are denoted by the same reference numerals.
26, single-port memories are used as the flow matrix memory 51a and the state-ordered D matrix memory 51b. In this case, a register 52a is used that holds F _a,* , outputs F _a, * when F _b,* is read from the flow matrix memory 51a, and supplies it to the difference calculation circuit 32a and the multiplexer 33a. Also, a register 52b is used that holds D ^x _φ(a),* , outputs D ^x φ(a),* when D ^x φ ₍ _b),* is read from the state-ordered D matrix memory 51b, and supplies it to the difference calculation circuit 32b and the multiplexer 33b.

その他の構成については、図１８と同様である。
ところで、図２１～図２４に示した列交換のためのハードウェア構成を、図１８または図２６に示したΔＣ生成回路３０，５０に適用する場合、２つのレプリカについての処理を同時に実行することが可能となる。 The other configurations are the same as those in FIG.
Incidentally, when the hardware configuration for column exchange shown in FIGS. 21 to 24 is applied to the ΔC generation circuits 30 and 50 shown in FIG. 18 or 26, it becomes possible to simultaneously execute processing for two replicas.

図２７は、２つのレプリカについての処理を行うハードウェア構成の例を示す図である。図２７には、図２６に示したΔＣ生成回路５０に図２２に示した列交換のためのハードウェア構成を組み合わせた構成が示されている。 Figure 27 shows an example of a hardware configuration for processing two replicas. Figure 27 shows a configuration that combines the ΔC generation circuit 50 shown in Figure 26 with the hardware configuration for column swapping shown in Figure 22.

図２７の例では、状態整列Ｄ行列メモリ５１ｂには、２つのレプリカ（Ｒ１、Ｒ２）の状態整列Ｄ行列（Ｄ^Ｘ _Ｒ１、Ｄ^Ｘ _Ｒ２）が記憶されている。
１つのレプリカで割当変更が受け入れられ、状態整列Ｄ行列メモリ５１ｂの書き込みポートを使用して１つのレプリカについて列交換が行われる。その間、他方のレプリカについて、状態整列Ｄ行列メモリ５１ｂの読み出しポートから読み出される状態整列Ｄ行列の要素に基づいて、ΔＣ_ｅｘを計算する処理が行われる。 In the example of FIG. 27, the state alignment D matrix memory 51b stores state alignment D matrices (D ^x _R1 , D ^x _R2 ) of two replicas (R1, R2).
One replica accepts the assignment change and performs column swapping for that replica using the write port of the state-aligned D-matrix memory 51 b, while the other replica performs a process of calculating ΔC _ex based on the elements of the state-aligned D-matrix read from the read port of the state-aligned D-matrix memory 51 b.

なお、どちらのレプリカも更新処理を実行していない場合、一方のレプリカについての処理は、他方のレプリカについてのΔＣ_ｅｘの計算が行われている間は、ストールされる。 Note that if neither replica is currently performing an update, processing for one replica will be stalled while ΔC _ex is being calculated for the other replica.

図２８は、ΔＣ生成回路の他の例を示す図である。図２８において、図１８に示した要素と同じ要素については同一符号が付されている。
対称順列の問題のみが計算対象となり、ΔＣ_ｅｘが、以下の式（４３）で表される場合、図２８に示すようなΔＣ生成回路６０を用いることもできる。 Fig. 28 is a diagram showing another example of a ΔC generating circuit, in which the same elements as those shown in Fig. 18 are denoted by the same reference numerals.
When only the problem of symmetric permutation is the object of calculation and ΔC _ex is expressed by the following equation (43), a ΔC generating circuit 60 as shown in FIG. 28 can also be used.

ΔＣ生成回路６０では、図１８に示したようなマルチプレクサ３３ａ，３３ｂなどが不要になり、その代わり、セレクタ６１が設けられている。
たとえば、差分計算回路３２ｂは、ｄ_{φ（ａ），ｉ}－ｄ_{φ（ｂ），ｉ}を計算する減算器３２ｂｉをｎ個有し、セレクタ６１は、減算器３２ｂｉの出力と０との一方を選択して出力する２入力１出力のマルチプレクサ６１ａｉをｎ個有する。 In the ΔC generating circuit 60, the multiplexers 33a and 33b shown in FIG. 18 are not necessary, and instead a selector 61 is provided.
For example, the difference calculation circuit 32b has n subtractors 32bi that calculate d _φ(a),i −d _φ(b),i , and the selector 61 has n two-input, one-output multiplexers 61ai that select and output either the output of the subtractor 32bi or 0.

マルチプレクサ６１ａｉは、式（４３）において、ｉ＝ａ，ｂとなる場合に０を出力する。このような場合に、０を出力する方法は、マルチプレクサを用いる方法以外の他の方法でもよい。 Multiplexer 61ai outputs 0 when i = a, b in equation (43). In such cases, the method of outputting 0 may be other than using a multiplexer.

（レプリカ処理回路）
各レプリカについての処理を行う回路は、前述したΔＣ生成回路４０，５０，６０の何れかと、状態整列Ｄ行列の列の入れ替えを行う前述のいくつかのハードウェア構成の何れかと、を組み合わせることで実現できる。 (Replica processing circuit)
The circuit that performs processing for each replica can be realized by combining any of the ΔC generation circuits 40, 50, and 60 described above with any of the several hardware configurations described above that perform column interchange of the state alignment D matrix.

図２９は、レプリカ処理回路の一例を示す図である。図２９には、図２６に示したΔＣ生成回路５０に図１９に示した列交換のためのハードウェア構成を組み合わせたレプリカ処理回路７０の例が示されている。図２９において、図１９と図２６に示した要素と同じ要素については同一符号が付されている。 Figure 29 is a diagram showing an example of a replica processing circuit. Figure 29 shows an example of a replica processing circuit 70 that combines the ΔC generation circuit 50 shown in Figure 26 with the hardware configuration for column swapping shown in Figure 19. In Figure 29, elements that are the same as those shown in Figures 19 and 26 are assigned the same reference numerals.

図２９の例では、マルチプレクサ３３ｂが図１９に示したマルチプレクサ４０ａの機能も有している。
これらの回路は、図２の記憶部１１または処理部１２に含みうる。 In the example of FIG. 29, the multiplexer 33b also has the function of the multiplexer 40a shown in FIG.
These circuits may be included in the memory unit 11 or the processing unit 12 in FIG.

なお、割当状態を表す整数割当ベクトルφを記憶する構成や、計算されたΔＣ_ｅｘを生じさせる割当変更の提案を受け入れるか否かの判定を行う構成などは、図示が省略されている。 Note that the configuration for storing the integer allocation vector φ representing the allocation state, and the configuration for determining whether or not to accept a proposal for an allocation change that will result in the calculated ΔC _ex are not shown in the figure.

（非対称行列を用いた場合のＱＡＰへの拡張）
非対称行列（対角成分が非零）を用いたＱＡＰの計算では、対称行列を用いた場合のΔＣ_ｅｘに相当するΔＣ_ａｓｙｍは、以下の式（４４）で表される。 (Extension to QAP when using asymmetric matrices)
In the calculation of QAP using an asymmetric matrix (diagonal elements are non-zero), ΔC _asym , which corresponds to ΔC _ex when a symmetric matrix is used, is expressed by the following equation (44).

このような、ΔＣ_ａｓｙｍを計算するレプリカ処理回路は、ΔＣ_ｅｘを計算するレプリカ処理回路を用いて実現できる。
図３０は、非対称行列を用いたＱＡＰの計算で用いられるレプリカ処理回路の一例を示す図である。図３０において、図２９に示した要素と同じ要素については同一符号が付されている。 Such a replica processing circuit for calculating ΔC _asym can be realized using a replica processing circuit for calculating ΔC _ex .
30 is a diagram showing an example of a replica processing circuit used in calculating QAP using an asymmetric matrix. In FIG. 30, elements that are the same as those shown in FIG. 29 are assigned the same reference numerals.

ΔＣ_ａｓｙｍを計算するレプリカ処理回路８０は、図２９に示したΔＣ_ｅｘを計算するレプリカ処理回路７０ａ１，７０ａ２を２つ有する。ただし、一方のレプリカ処理回路７０ａ２のフロー行列メモリ５１ａには、フロー行列の転置行列（Ｆ^Ｔ）が記憶されており、レプリカ処理回路７０ａ２の状態整列Ｄ行列メモリ５１ｂには、状態整列Ｄ行列の転置行列（（Ｄ^Ｘ）^Ｔ）が記憶されている。 Replica processing circuit 80 that calculates ΔC _asym has two replica processing circuits 70a1 and 70a2 that calculate ΔC _ex shown in Fig. 29. However, the flow matrix memory 51a of one of the replica processing circuits, 70a2, stores the transposed matrix (F ^T ) of the flow matrix, and the state order D matrix memory 51b of replica processing circuit 70a2 stores the transposed matrix ((D ^X ) ^T ) of the state order D matrix.

レプリカ処理回路８０は、さらに、メモリ８１ａ，８１ｂ、レジスタ８２ａ，８２ｂ、差分計算回路８３ａ，８３ｂ、乗算回路８４、補償項計算回路８５、加算回路８６を有する。 The replica processing circuit 80 further includes memories 81a and 81b, registers 82a and 82b, difference calculation circuits 83a and 83b, a multiplication circuit 84, a compensation term calculation circuit 85, and an addition circuit 86.

メモリ８１ａは、フロー行列の対角成分（Ｆ_ｄ）を記憶し、メモリ８１ｂは、距離行列の対角成分（Ｄ_ｄ）を記憶する。非対称行列を用いた場合、対称行列を用いた場合と異なり、これらの対角成分は非零の値を含みうる。 The memory 81a stores the diagonal elements of the flow matrix ( _Fd ), and the memory 81b stores the diagonal elements of the distance matrix ( _Dd ). When an asymmetric matrix is used, unlike when a symmetric matrix is used, these diagonal elements may contain non-zero values.

レジスタ８２ａは、メモリ８１ａから読み出されるｆ_ａ，ａまたはｆ_ｂ，ｂの一方を保持し、メモリ８１ａからｆ_ａ，ａまたはｆ_ｂ，ｂの他方が読み出されるタイミングで、ｆ_ａ，ａまたはｆ_ｂ，ｂの一方を出力し、差分計算回路８３ａに供給する。 The register 82a holds either f _a,a or f _b,b read from the memory 81a, and outputs either f _a,a or f _b,b at the timing when the other of f _a,a or f _b,b is read from the memory 81a, and supplies it to the difference calculation circuit 83a.

レジスタ８２ｂは、メモリ８１ｂから読み出されるｄ_{φ（ａ），φ（ａ）}またはｄ_{φ（ｂ），φ（ｂ）}の一方を保持する。そして、レジスタ８２ｂは、メモリ８１ｂからｄ_{φ（ａ），φ（ａ）}またはｄ_{φ（ｂ），φ（ｂ）}の他方が読み出されるタイミングで、ｄ_{φ（ａ），φ（ａ）}またはｄ_{φ（ｂ），φ（ｂ）}の一方を出力し、差分計算回路８３ｂに供給する。 The register 82b holds either _{dφ(a), φ(a)} or dφ _{(b), φ(b)} read from the memory 81b, and outputs either _{dφ(a), φ(a) or dφ(b)} _{, φ(b)} at the timing when the other of _{dφ(a), φ(a)} or _{dφ(b), φ(b)} is read from the memory 81b, and supplies it to the difference calculation circuit 83b.

差分計算回路８３ａは、式（４４）のｆ_ｂ，ｂ－ｆ_ａ，ａを計算する。
差分計算回路８３ｂは、式（４４）のｄ_{φ（ａ），φ（ａ）}－ｄ_{φ（ｂ），φ（ｂ）}を計算する。 The difference calculation circuit 83a calculates f _b,b −f _a,a in equation (44).
The difference calculation circuit 83b calculates d _{φ(a), φ(a)} −d _{φ(b), φ(b)} in equation (44).

乗算回路８４は、差分計算回路８３ａ，８３ｂの計算結果に基づいて、式（４４）の（ｆ_ｂ，ｂ－ｆ_ａ，ａ）（ｄ_{φ（ａ），φ（ａ）}－ｄ_{φ（ｂ），φ（ｂ）}）を計算する。
補償項計算回路８５は、式（４４）の右辺の４項目である補償項（ｉ＝ａ，ｂの計算をスキップするという条件をなくしたことによる補償を行う項）を計算する。補償項計算回路８５は、補償項を計算するために、レプリカ処理回路７０ａ１のマルチプレクサ３３ａ，３３ｂからｆ_ａ，ｂ、ｄ_{φ（ａ），φ（ｂ）}を受け、レプリカ処理回路７０ａ２のマルチプレクサ３３ａ，３３ｂからｆ_ｂ，ａ、ｄ_{φ（ｂ），φ（ａ）}を受ける。 The multiplication circuit 84 calculates (f _b,b − _{f a,a} ) (d _{φ(a),φ(a) − d φ(b),} _{φ(b) ) in equation (44) based on the calculation results of the difference calculation circuits 83 a and 83 b} .
Compensation term calculation circuit 85 calculates the compensation terms (terms that compensate for the elimination of the condition that the calculation of i = a, b is skipped), which are the four items on the right side of equation (44). To calculate the compensation terms, compensation term calculation circuit 85 receives f _a,b , d _{φ(a), and φ(b)} from multiplexers 33 a and 33 b in replica processing circuit 70 a 1, and receives f b _,a , d φ(b _{), and φ(a)} from multiplexers 33 a and 33 b in replica processing circuit 70 a 2.

加算回路８６は、レプリカ処理回路７０ａ１の内積計算回路３４から式（４４）の右辺の１項目の値を受け、レプリカ処理回路７０ａ２の内積計算回路３４から式（４４）の右辺の２項目の値を受ける。さらに加算回路８６は、乗算回路８４から、式（４４）の右辺の３項目の値、補償項計算回路８５から、式（４４）の右辺の４項目の値を受ける。そして、加算回路８６はこれらの和を計算することでΔＣ_ａｓｙｍを生成し、出力する。 Adder circuit 86 receives the value of one item on the right-hand side of equation (44) from dot product calculation circuit 34 of replica processing circuit 70a1, and receives the values of two items on the right-hand side of equation (44) from dot product calculation circuit 34 of replica processing circuit 70a2. Adder circuit 86 also receives the values of three items on the right-hand side of equation (44) from multiplier circuit 84, and the value of four items on the right-hand side of equation (44) from compensation term calculation circuit 85. Adder circuit 86 then calculates the sum of these values to generate and output ΔC _asym .

これらの回路やメモリなどは、図２の記憶部１１または処理部１２に含みうる。
なお、割当状態を表す整数割当ベクトルφを記憶する構成、計算されたΔＣ_ｅｘを生じさせる割当変更の提案を受け入れるか否かの判定を行う構成などは、図示が省略されている。 These circuits, memories, etc. may be included in the storage unit 11 or the processing unit 12 in FIG.
Note that the configuration for storing the integer allocation vector φ representing the allocation state, the configuration for determining whether or not to accept a proposal for an allocation change that will result in the calculated ΔC _ex , and the like are omitted from the illustration.

以上のようなハードウェア構成により、ＱＡＰに対してＳＡＭ法による局所探索を行うことができる。ＱＳＡＰに対してＳＡＭ法による局所探索を行う構成も、上記のハードウェア構成を適宜変更することで実現できる。たとえば、式（２４）の２行目の計算を行うための演算回路（乗算回路や加算回路など）が追加される。 With the above hardware configuration, it is possible to perform a local search using the SAM method on a QAP. A configuration for performing a local search using the SAM method on a QSAP can also be realized by appropriately modifying the above hardware configuration. For example, an arithmetic circuit (such as a multiplication circuit or addition circuit) is added to perform the calculation on the second line of equation (24).

なお、上記の処理内容（たとえば、図６～図８、図１０、図１１など）は、データ処理装置１０にプログラムを実行させることでソフトウェアにて実現できる。
プログラムは、コンピュータ読み取り可能な記録媒体に記録しておくことができる。記録媒体として、たとえば、磁気ディスク、光ディスク、光磁気ディスク、半導体メモリなどを使用できる。磁気ディスクには、フレキシブルディスク（ＦＤ：Flexible Disk）及びＨＤＤが含まれる。光ディスクには、ＣＤ（Compact Disc）、ＣＤ－Ｒ（Recordable）／ＲＷ（Rewritable）、ＤＶＤ（Digital Versatile Disc）及びＤＶＤ－Ｒ／ＲＷが含まれる。プログラムは、可搬型の記録媒体に記録されて配布されることがある。その場合、可搬型の記録媒体から他の記録媒体にプログラムをコピーして実行してもよい。 The above-described processing contents (for example, FIGS. 6 to 8, 10, 11, etc.) can be realized by software by causing the data processing device 10 to execute a program.
The program can be recorded on a computer-readable recording medium. Examples of recording media that can be used include magnetic disks, optical disks, magneto-optical disks, and semiconductor memories. Magnetic disks include flexible disks (FDs) and HDDs. Optical disks include compact discs (CDs), recordable/rewritable CD-Rs/RWs, digital versatile discs (DVDs), and DVD-Rs/RWs. The program may be recorded on a portable recording medium and distributed. In this case, the program may be copied from the portable recording medium to another recording medium and executed.

図３１は、データ処理装置の一例であるコンピュータのハードウェア例を示す図である。
コンピュータ９０は、ＣＰＵ９１、ＲＡＭ９２、ＨＤＤ９３、ＧＰＵ９４、入力インタフェース９５、媒体リーダ９６及び通信インタフェース９７を有する。上記ユニットは、バスに接続されている。 FIG. 31 is a diagram illustrating an example of hardware of a computer, which is an example of a data processing device.
The computer 90 includes a CPU 91, a RAM 92, a HDD 93, a GPU 94, an input interface 95, a media reader 96, and a communication interface 97. The above units are connected to a bus.

ＣＰＵ９１は、プログラムの命令を実行する演算回路を含むプロセッサである。ＣＰＵ９１は、ＨＤＤ９３に記憶されたプログラムやデータの少なくとも一部をＲＡＭ９２にロードし、プログラムを実行する。なお、ＣＰＵ９１は、たとえば、図４に示したように、複数のレプリカの処理を並列に実行するために、複数のプロセッサコアを備えてもよい。また、コンピュータ９０は複数のプロセッサを備えてもよい。なお、複数のプロセッサの集合（マルチプロセッサ）を「プロセッサ」と呼んでもよい。 The CPU 91 is a processor including an arithmetic circuit that executes program instructions. The CPU 91 loads at least a portion of the program and data stored in the HDD 93 into the RAM 92 and executes the program. The CPU 91 may include multiple processor cores to execute the processing of multiple replicas in parallel, for example, as shown in FIG. 4. The computer 90 may also include multiple processors. A collection of multiple processors (a multiprocessor) may also be called a "processor."

ＲＡＭ９２は、ＣＰＵ９１が実行するプログラムやＣＰＵ９１が演算に用いるデータを一時的に記憶する揮発性の半導体メモリである。なお、コンピュータ９０は、ＲＡＭ９２以外の種類のメモリを備えてもよく、複数個のメモリを備えてもよい。 RAM 92 is a volatile semiconductor memory that temporarily stores programs executed by CPU 91 and data used by CPU 91 for calculations. Note that computer 90 may be equipped with other types of memory besides RAM 92, or may be equipped with multiple memories.

ＨＤＤ９３は、ＯＳ（Operating System）やミドルウェアやアプリケーションソフトウェアなどのソフトウェアのプログラム、及び、データを記憶する不揮発性の記憶装置である。プログラムには、たとえば、前述のような割当問題の解を探索する処理をコンピュータ９０に実行させるプログラムが含まれる。なお、コンピュータ９０は、フラッシュメモリやＳＳＤ（Solid State Drive）などの他の種類の記憶装置を備えてもよく、複数の不揮発性の記憶装置を備えてもよい。 The HDD 93 is a non-volatile storage device that stores software programs such as the OS (Operating System), middleware, and application software, as well as data. The programs include, for example, a program that causes the computer 90 to execute a process for searching for a solution to the assignment problem described above. The computer 90 may also be equipped with other types of storage devices, such as flash memory or an SSD (Solid State Drive), or may be equipped with multiple non-volatile storage devices.

ＧＰＵ９４は、ＣＰＵ９１からの命令にしたがって、コンピュータ９０に接続されたディスプレイ９４ａに画像（たとえば、割当問題の計算結果などを表す画像）を出力する。ディスプレイ９４ａとしては、ＣＲＴ（Cathode Ray Tube）ディスプレイ、液晶ディスプレイ（ＬＣＤ：Liquid Crystal Display）、プラズマディスプレイ（ＰＤＰ：Plasma Display Panel）、有機ＥＬ（ＯＥＬ：Organic Electro-Luminescence）ディスプレイなどを用いることができる。 In accordance with instructions from the CPU 91, the GPU 94 outputs an image (for example, an image showing the calculation results of an assignment problem) to a display 94a connected to the computer 90. The display 94a can be a CRT (Cathode Ray Tube) display, a liquid crystal display (LCD), a plasma display panel (PDP), an organic electro-luminescence (OEL) display, or the like.

入力インタフェース９５は、コンピュータ９０に接続された入力デバイス９５ａから入力信号を取得し、ＣＰＵ９１に出力する。入力デバイス９５ａとしては、マウスやタッチパネルやタッチパッドやトラックボールなどのポインティングデバイス、キーボード、リモートコントローラ、ボタンスイッチなどを用いることができる。また、コンピュータ９０に、複数の種類の入力デバイスが接続されていてもよい。 The input interface 95 receives input signals from an input device 95a connected to the computer 90 and outputs them to the CPU 91. Examples of the input device 95a include pointing devices such as a mouse, touch panel, touchpad, or trackball, a keyboard, a remote controller, and button switches. Multiple types of input devices may also be connected to the computer 90.

媒体リーダ９６は、記録媒体９６ａに記録されたプログラムやデータを読み取る読み取り装置である。記録媒体９６ａとして、たとえば、磁気ディスク、光ディスク、光磁気ディスク（ＭＯ：Magneto-Optical disk）、半導体メモリなどを使用できる。磁気ディスクには、ＦＤやＨＤＤが含まれる。光ディスクには、ＣＤやＤＶＤが含まれる。 The media reader 96 is a reading device that reads programs and data recorded on a recording medium 96a. Examples of recording media 96a that can be used include magnetic disks, optical disks, magneto-optical disks (MO: Magneto-Optical disks), and semiconductor memories. Magnetic disks include floppy disks and HDDs. Optical disks include CDs and DVDs.

媒体リーダ９６は、たとえば、記録媒体９６ａから読み取ったプログラムやデータを、ＲＡＭ９２やＨＤＤ９３などの他の記録媒体にコピーする。読み取られたプログラムは、たとえば、ＣＰＵ９１によって実行される。なお、記録媒体９６ａは、可搬型記録媒体であってもよく、プログラムやデータの配布に用いられることがある。また、記録媒体９６ａやＨＤＤ９３を、コンピュータ読み取り可能な記録媒体ということがある。 The media reader 96 copies programs and data read from the recording medium 96a to other recording media such as the RAM 92 and HDD 93. The read programs are executed by the CPU 91, for example. The recording medium 96a may be a portable recording medium and may be used to distribute programs and data. The recording medium 96a and HDD 93 may also be referred to as computer-readable recording media.

通信インタフェース９７は、ネットワーク９７ａに接続され、ネットワーク９７ａを介して他の情報処理装置と通信を行うインタフェースである。通信インタフェース９７は、スイッチなどの通信装置とケーブルで接続される有線通信インタフェースでもよいし、基地局と無線リンクで接続される無線通信インタフェースでもよい。 The communication interface 97 is connected to the network 97a and communicates with other information processing devices via the network 97a. The communication interface 97 may be a wired communication interface connected to a communication device such as a switch via a cable, or a wireless communication interface connected to a base station via a wireless link.

以上、実施の形態に基づき、本発明のプログラム、データ処理装置及びデータ処理方法の一観点について説明してきたが、これらは一例にすぎず、上記の記載に限定されるものではない。 The above describes one aspect of the program, data processing device, and data processing method of the present invention based on an embodiment, but these are merely examples and are not limited to the above description.

たとえば、上記の説明では、割当状態に応じて距離行列の列を並べ替えるものとしたが、割当状態に応じた距離行列の行を並べ替えても、適宜式を変形すれば同様の作用効果が得られる。 For example, in the above explanation, the columns of the distance matrix are rearranged according to the allocation status, but the same effect can be achieved by rearranging the rows of the distance matrix according to the allocation status, by modifying the formula appropriately.

１０データ処理装置
１１記憶部
１２処理部 10 Data processing device 11 Storage unit 12 Processing unit

Claims

A program that causes a computer to execute a process of searching for a solution to an assignment problem by local search using an evaluation function that represents a cost according to an assignment state,
calculating, using vector arithmetic operations, a first change in the evaluation function when a first allocation change occurs in which allocation destinations of a first element and a second element among the plurality of elements are swapped, based on a flow matrix representing flow amounts between a plurality of elements allocated to a plurality of allocation destinations and a distance matrix representing distances between the plurality of allocation destinations, both of which are stored in a memory;
determining whether to allow the first allocation change based on the first change amount;
If it is determined that the first allocation change is permitted, the allocation state is updated, and the distance matrix is updated so that two columns or two rows corresponding to the first element and the second element are swapped.
A program that causes the computer to execute a process.

The program of claim 1 causes the computer to execute a process that, when the assignment problem is QAP, determines whether to accept the first assignment change based on a comparison result between an acceptance probability calculated based on the first change amount and the value of a temperature parameter and a random value.

Before calculating the first change amount,
When the assignment problem is QSAP, a second change in the evaluation function is calculated when a second assignment change occurs in which the first element is assigned to a first assignment destination;
determining whether to accept the second allocation change based on a comparison result between an acceptance probability calculated based on the second change amount and the value of the temperature parameter and a random value;
If it is determined that the second allocation change is permitted, the allocation state and the distance matrix are updated;
After calculating the first change amount,
determining whether or not to permit the first allocation change based on a comparison result between the first change amount and a predetermined value;
The program according to claim 1 , which causes the computer to execute a process.

reading the distance matrix row by row from the memory;
Select two values in the two columns included in the read row;
writing the two values to the memory by swapping their storage locations;
Repeat the process to swap the two columns.
The program according to claim 1 , which causes the computer to execute a process.

the memory further stores a transpose of the distance matrix;
storing a first row of the transposed matrix in a first shift register and a second row of the transposed matrix in a second shift register, the first row corresponding to the two columns of the distance matrix;
the storage locations of two values output from the first shift register and the second shift register are swapped and written into the memory, and the process is repeated to swap the two columns;
The program according to claim 1 , which causes the computer to execute a process.

the memory includes a first memory and a second memory in which the distance matrix is stored;
reading the distance matrix row by row from the first memory;
Select two values in the two columns included in the read row;
writing the two values to the second memory by swapping their storage locations;
Repeat the process to swap the two columns.
The program according to claim 1 , which causes the computer to execute a process.

the local search is performed using parallel tempering with a plurality of replicas, each replica having a different temperature parameter value;
the memory stores the distance matrices of a first replica and a second replica among the plurality of replicas;
calculating the first change amount based on the distance matrix of the second replica while updating the distance matrix of the first replica;
The program according to claim 1 , which causes the computer to execute a process.

A data processing device that searches for a solution to an assignment problem by local search using an evaluation function that represents a cost according to an assignment state,
a storage unit that stores a flow matrix that represents flow amounts between a plurality of elements that are allocated to a plurality of allocation destinations, and a distance matrix that represents distances between the plurality of allocation destinations;
a processing unit that calculates, based on the flow matrix and the distance matrix, a first change amount of the evaluation function using vector arithmetic operation when a first allocation change occurs in which allocation destinations of a first element and a second element among the plurality of elements are swapped, determines whether or not to allow the first allocation change based on the first change amount, and when it is determined that the first allocation change is allowed, updates the allocation state and updates the distance matrix so that two columns or two rows corresponding to the first element and the second element are swapped;
A data processing device having:

A data processing method for searching for a solution to an assignment problem by local search using an evaluation function that represents a cost according to an assignment state, comprising:
The computer
calculating, using vector arithmetic operations, a first change in the evaluation function when a first allocation change occurs in which allocation destinations of a first element and a second element among the plurality of elements are swapped, based on a flow matrix representing flow amounts between a plurality of elements allocated to a plurality of allocation destinations and a distance matrix representing distances between the plurality of allocation destinations, both of which are stored in a memory;
determining whether to allow the first allocation change based on the first change amount;
If it is determined that the first allocation change is permitted, the allocation state is updated, and the distance matrix is updated so that two columns or two rows corresponding to the first element and the second element are swapped.
Data processing methods.