JPH0421900B2

JPH0421900B2 -

Info

Publication number: JPH0421900B2
Application number: JP59034450A
Authority: JP
Inventors: Junichi Takahashi; Sanshiro Hatsutori; Takashi Kimura; Atsushi Iwata
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1984-02-27
Filing date: 1984-02-27
Publication date: 1992-04-14
Also published as: JPS60179871A

Description

【発明の詳細な説明】〔発明の技術分野〕本発明は、例えば音声認識や文字認識において
パターンのマツチングをとる際などに利用される
ダイナミツクプログラミングに基づくマツチング
演算に代表されるような、２種類の変数間のあら
ゆる組合せに対する演算およびその演算結果を用
いたデータの局部的依存性をもつ漸化式の演算の
実行に使用するアレイプロセツサに関する。[Detailed Description of the Invention] [Technical Field of the Invention] The present invention relates to a matching operation based on dynamic programming, which is used, for example, when matching patterns in speech recognition or character recognition. The present invention relates to an array processor used to perform operations on all combinations of variables of different types, and to perform operations on recurrence formulas with local data dependence using the results of the operations.

[Prior art]

代表例として、２種類のベクトル変数間の演算
とその演算結果を用いた累積演算の漸化式からな
るダイナミツクプログラミングに基づくマツチン
グ演算の一例を以下に示す。 As a representative example, an example of a matching operation based on dynamic programming, which consists of an operation between two types of vector variables and a recurrence formula of an accumulation operation using the operation results, will be shown below.

D_ij＝｜c_i−r_j｜²＝_n 〓^k=1 ｜c_Kｉ−r_Kｊ｜² …(1) S_ij＝D_ij＋mmS_i,j-1 S_i-1,j-1 S_i-1,j …(2) S_pj＝∞，S_ip＝∞，S_pp＝Ｏ …(3) ここで、c_i，r_jは、それぞれＩ個のベクトル列
Ｃ＝｛c₁，c₂，……，c₁｝、Ｎ個のベクトルＲ＝
｛r₁，r₂，……，r_N｝のｉ番目、ｊ番目の要素で
ある。また、ｍは各ベクトルの次数を表わし、c_i
＝（c₁ ⁱ，c₂ ⁱ，……，c_n ⁱ），r_j＝（r₁ ^j，r₂ ^j，……，r
_n
^ｊ）である。D_ij，S_ijは、それぞれベクトル間距
離，累積距離を表わす。(3)式は、漸化式(2)の初期
条件である。D _ij =｜c _i −r _j | ² = _n 〓 ^k=1 ｜c _K i−r _K ｜j｜ ² …(1) S _ij =D _ij +mmS _i,j-1 S _i-1,j-1 S _i-1,j …(2) S _pj =∞, S _ip =∞, S _pp =O …(3) Here, c _i and r _j are I vector sequences C = {c ₁ , c ₂ , ..., c ₁ }, N vectors R=
These are the i-th and j-th elements of {r ₁ , r ₂ , ..., r _N }. Also, m represents the order of each vector, and c _i
= (c ₁ ⁱ , c ₂ ⁱ , ..., c _n ⁱ ), r _j = (r ₁ ^j , r ₂ ^j , ..., r
_n
^j ). D _ij and S _ij represent the distance between vectors and the cumulative distance, respectively. Equation (3) is the initial condition of recurrence equation (2).

この種の演算を並列に処理できるアレイプロセ
ツサとして、従来、２種類のベクトル列のデータ
の個数がそれぞれＩ，Ｎの場合には（Ｉ×Ｎ）個
の処理要素（プロセシングエレメント；以下PE
と略記する）を２次元に配列した構成がある。こ
の構成を第１図に、その動作例を第２図〜第５図
に示す。第１図において、１００はPE、２００
はデータバス、３００はコントロールバスを示
す。また４００は入力端子を示し、５００は出力
端子を示す。 Conventionally, as an array processor that can process this type of operation in parallel, when the number of data of two types of vector strings is I and N, respectively, (I × N) processing elements (processing elements; hereinafter referred to as PE
) are arranged in two dimensions. This configuration is shown in FIG. 1, and examples of its operation are shown in FIGS. 2 to 5. In Figure 1, 100 is PE, 200
indicates a data bus, and 300 indicates a control bus. Further, 400 indicates an input terminal, and 500 indicates an output terminal.

各PE１００は、積和演算からなるベクトル間
距離演算(1)式と比較・累積演算(2)式を実行する手
段および隣接するPEとの間で比較演算結果や累
積結果S_ij、ベクトルデータc_i，r_jの授受を行なう
手段を有する。なお、各PEに２次元配列上での
位置を表す番号を付記し、ｉ行ｊ行のPEをPE_ij
と表わすと次のような動作で上記(1)，(2)，(3)を実
行することができる。 Each PE 100 has means for executing vector distance calculation (1) and comparison/accumulation calculation (2) consisting of product-sum calculations, as well as means for executing comparison calculation results, cumulative results S _ij , and vector data c between adjacent PEs. It has means for sending and receiving _i and r _j . Note that each PE is given a number indicating its position on the two-dimensional array, and the PE in row i and j is PE _ij
Expressed as , the above (1), (2), and (3) can be executed by the following operations.

左隣接のPE_i-1，_jおよび下隣接のPE_i，_j-1（ま
たは、左端の入力端子および下端の入力端子）
から２種類のベクトルデータc_i，r_jを入力し、
そのベクトル間距離を(1)式を実行することによ
り求める。 Left-adjacent PEs _i-1 , _j and lower-adjacent PEs _i , _j-1 (or the leftmost input terminal and the bottommost input terminal)
Input two types of vector data c _i and r _j from
The distance between the vectors is determined by executing equation (1).

右隣接のPE_i+1，_jおよび上隣接のPE_i，_j+1に、
それぞれベクトルデータc_i，r_jを転送する。 To the right neighboring PE _i+1 , _j and the upper neighboring PE _i , _j+1 ,
Transfer vector data c _i and r _j respectively.

左隣接のPE_i-1，_jから累積演算結果S_i-1，_jを、
下隣接のPE_i，_j-1からmm（S_i，_j-1，S_i-1，_j-i）の比
較演算結果をそれぞれ入力し、これらの比較演
算mm〔S_i-1，_j，mm（S_i，_j-1，S_i-1，_j-1）〕を実行
し、この結果にで求めたD_ijを加えてS_ijを求
める。 The cumulative operation result S i _-1 , _j from the left adjacent PE i- ₁ , _j is
Input the comparison operation results of mm (S _i , _j _-1 , S _i-1 , _ji ) from the lower adjacent PE _i , j-1 , respectively, and calculate these comparison operations mm [S _i-1 , _j , mm ( S _i , _j-1 , S _i-1 , _j-1 )] and add D _ij obtained in to this result to obtain S _ij .

比較演算mm（S_ii，S_i-1，_j）を実行し、その演
算結果を上隣接のPE_i，_j+1へ、累積演算結果S_ij
を右隣接のPE_i+1，_jへ転送する。 Execute the comparison operation mm (S _ii , S _i-1 , _j ), transfer the operation result to the upper neighboring PE _i , _j+1 , and apply the cumulative operation result S _ij
is transferred to the right-adjacent PEs _i+1 and _j .

ここで、，は、比較・累積演算(2)式を実行
する過程を示している。すなわち、PE_ijにおいて
累積演算(2)式を実行するために必要な３種類の累
積結果S_i，_j-1，S_i-1，_j，S_i-1，_j-1のうち、S_i，_j-1，
S_i-1，_jはそれぞれ転送すべきPE_ijの下隣接のPE_i，_j
_−１および左隣接のPE_i-1，_jに存在するのに対し、
S_i-1，_j-1はPE_ijに対して対角方向に隣接した
PE_i-1，_j-1に存在する。このため、前者の２つの
データは１回の転送、後者はPE_i，_j-1を経由して
２回の転送を必要とする。しかし、S_i-1，_j-1の転
送に介在するPE_i，_j-1においてあらかじめS_i，_j-1と
S_i-1，_j-1）とを比較してその結果をPE_ijに転送し、
このデータとPE_i-1，_jからの転送データS_i-1，_jとの
比較演算を実行するようにすれば、PE_ijにおいて
(2)式通りの３つのデータの比較演算を実行するこ
とと等価になる。 Here, , indicates the process of executing the comparison/accumulation operation (2). That is, among the three types of cumulative results S _i , _j-1 , S _i-1 , _j , S _i-1 , _j-1 necessary to execute cumulative operation (2) in PE _ij , S _i , _j-1 ,
S _i-1 , _j are PE _i , _j below the PE _ij to be transferred, respectively
₋₁ and the left adjacent PE _i-1 , _j , whereas
S _i-1 , _j-1 are diagonally adjacent to PE _ij
Exists in PE _i-1 , _j-1 . Therefore, the former two data require one transfer, and the latter requires two transfers via PE _i and _j-1 . However, in PE _i , _j-1 that intervenes in the transfer of S _i-1 , _j-1, S _i , _j-1 and
S _i-1 , _j-1 ) and transfer the result to PE _ij ,
If a comparison operation is performed between this data and the transfer data S i _-1 , _j from PE _i-1 , _j , then at PE _ij ,
(2) This is equivalent to performing a comparison operation on three data as shown in formula (2).

これらの各動作を、第１図の実線で示す各対角
線上の全PEに対して〜の動作をPEの並列処
理単位として実行する方法、あるいは、と、
との２種類の動作を並列処理単位としてこれ
を隣接する対角線上のPEで交互に実行する方法
により、ベクトル間距離D_ij，累積結果S_ijを計算
しながら最終的な累積結果S_I，_Nを求めることがで
きる。このうち、後者の実行方法の場合は、２つ
の並列処理単位間の有効なダイナミツクステツプ
数が異なるため、（ノー・オペレーシヨン）NOP
命令によつて実行ステツプ数を調整しなければな
らないが、ここでは詳細な説明は省略する。第２
図〜第５図は、この後者の場合の２次元配列アレ
イ上での動作を、時刻ｔから時刻ｔ＋３にわたつ
て示したものである。ここで、時刻は、各PEが
とおよびとの全処理を実行するのに要す
る時間を単位としており、各図ａ，ｂはそれぞれ
各PEにおいて上記単位時間中に矩形で囲まれた
データが算出された状態を示している。 A method of performing each of these operations as a parallel processing unit of PEs for all PEs on each diagonal line shown by the solid line in FIG. 1, or,
The final cumulative result S _I , _N is computed while calculating the inter-vector distance D _ij and the cumulative result S _ij by using two types of operations as parallel processing units and executing them alternately on adjacent diagonal PEs. can be found. In the case of the latter execution method, the effective number of dynamic steps between the two parallel processing units is different, so (no operation) NOP
Although the number of execution steps must be adjusted depending on the instruction, detailed explanation will be omitted here. Second
5 to 5 show the operation on the two-dimensional array in this latter case from time t to time t+3. Here, the time is the time required for each PE to execute all the processes of and, and each figure a and b shows the data enclosed in the rectangle calculated in each PE during the above unit time. It shows the state that has been applied.

このような２次元配列構成では、一応演算の局
所性・規則性が生かされて並列処理を実行でき
る。しかし、例えば上記のダイナミツクプログラ
ミングに基づくマツチング演算式(2)式が(4)式に示
すような複雑な演算式である場合には、(4)式の累
積結果S_i-1，_j-1，S_i-1，_j-2，S_i-2，_j-1の転送と比較
演
算の対象となる３つのデータを生成する演算につ
いて２個のPEを介して実行しなければならず、
PE_ijでの比較演算はこれらの３つのデータをPE_ij
内に入力してから実行する方法をとるなど各PE
が並列に実行すべき処理単位の内容が複雑になる
ばかりでなく、全PEを十分効率良く使用した並
列処理は実行できない。 In such a two-dimensional array configuration, parallel processing can be executed by taking advantage of the locality and regularity of operations. However, for example, if the matching calculation formula (2) based on the dynamic programming described above is a complex calculation formula as shown in formula (4), the cumulative results of formula (4) S _i-1 , _{j- 1} , S _i-1 , _j-2 , S _i-2 , _{j-1 ,} operations that generate three data to be subjected to the transfer and comparison operations must be performed via two PEs,
The comparison operation in PE _ij converts these three data into PE _ij
Each PE
Not only does the content of the processing unit that should be executed in parallel become complicated, but it is also impossible to execute parallel processing using all PEs efficiently.

S_ij＝D_ijmmS_i-2,j-1＋2D_i-1,j S_i-1,j-1＋D_ij S_i-1,j-2＋2D_i,1-j …(4) また、対象とするダイナミツクプログラミング
に基づくマツチング演算で処理すべき２種類のベ
クトル列のデータの個数を表わす正整数Ｎ及びＩ
の両方に依存してPEの個数を決定しなければな
らないので、多種のベクトル列C_u（C_u＝｛c₁ ^u，c₂
^ｕ，……，c_1u ^u｝；ｕ＝１，２……，l_c）と多種の
ベクトル列R_v（R_v＝｛r₁ ^v，r₂ ^v……，r_Nv ^v｝；ｖ＝
１，２……，l_r）とのダイナミツクプログラミン
グに基づくマツチング演算を実行するためには、
正整数Ｎ，Ｉとして Nmax＝（ max １≦ｖ≦lrNv）， Imax＝（ max １≦ｖ≦lrIu）を選ばなければならず、PEの個数は（Nmax×
Imax）個必要とする。したがつて、ベクトル列
C_u，R_vに対する処理を行なう場合は、Cmax，
Rmaxの組合せ以外のすべてのベクトル列の組合
せに対して、ダイナミツクプログラミングに基づ
くマツチング演算処理の動作を実行する必要のな
いPEが多数存在することとなり、ハードウエア
の有効利用が図れない。S _ij =D _ij mmS _i-2,j-1 +2D _i-1,j S _i-1,j-1 +D _ij S _i-1,j-2 +2D _i,1-j …(4) Also, the target positive integers N and I that represent the number of data of two types of vector sequences to be processed in a matching operation based on dynamic programming.
Since _the number _of PEs must _be determined depending on ^both _of
^u , ..., c _1u ^u }; u=1, 2..., l _c ) and various vector sequences R _v (R _v = {r ₁ ^v , r ₂ ^v ..., r _Nv ^v }; v=
In order to perform a matching operation based on dynamic programming with 1, 2..., l _r ),
As positive integers N and I, Nmax=( max 1≦v≦lrNv), Imax=( max 1≦v≦lrIu) must be selected, and the number of PEs is (Nmax×
Imax) pieces are required. Therefore, the vector sequence
When processing C _u and R _v , Cmax,
For all combinations of vector sequences other than the Rmax combination, there are many PEs that do not need to perform matching calculation processing based on dynamic programming, making it impossible to use hardware effectively.

また、必要なPEの個数を処理すべきデータの
個数の最大値から決定しなければならないこと
は、LSI技術により小形化を図る場合に大きな支
障となる。１個のLSIに搭載できるPEの個数は
PEの機能により異なるが、例えば、１個のLSI
に４個程度のPEを搭載できるとともに、Nmax
６０，Imax６０の場合には900個ものLSIを２次
元に配列・接続しなければならない。 Furthermore, the fact that the number of required PEs must be determined from the maximum number of data to be processed is a major hindrance when attempting to downsize using LSI technology. The number of PEs that can be mounted on one LSI is
It depends on the PE function, but for example, one LSI
About 4 PEs can be installed in the Nmax
60. In the case of Imax60, as many as 900 LSIs must be arranged and connected in two dimensions.

[Object and structure of the invention]

そこで、本発明の目的は、ダイナミツクプログ
ラミングに基づくマツチング演算に代表される２
種類の変数間のあらゆる組合せに対する演算とそ
の演算結果を用いたデータの局所的依存性をもつ
漸化式の演算を、対象とする演算量に適応した
PE数からなるアレイ構成で、各PEを有効に動作
させながら、高効率の並列処理で実現することが
可能なアレイプロセツサを提供することにある。 Therefore, an object of the present invention is to perform two-way matching operations based on dynamic programming.
The calculations for all combinations of variables of different types and the calculation of recurrence formulas with local data dependence using the calculation results are adapted to the target amount of calculations.
The object of the present invention is to provide an array processor that can realize highly efficient parallel processing while effectively operating each PE in an array configuration consisting of a number of PEs.

このような目的を達成するために、本発明は、
それぞれ外部からの２種類の入力データ列Ｃ＝
｛c_i｝（ｉ＝１，２，……，Ｉ）およびＲ＝｛r_j｝
（ｊ＝１，２，……，Ｎ）の各データc_i，r_jを入力
する手段と、２種類のデータ間の加減算、比較演
算および積和演算の各所望の演算を行ないその結
果を蓄える手段と、入力データc_iおよび演算結果
を隣接処理要素との間で送受する手段と、最終的
な演算結果を外部に出力する手段とを備えた処理
要素をｎ個環状に配列するとともに、各処理要素
間を、隣接処理要素とのデータ授受を行なうため
のデータ転送パスと外部入力パスとを切り換える
マルチプレクサを介して環状に接続し、かつ全処
理要素がその処理結果を隣接処理要素へ、入力デ
ータ列Ｃの中の連続するｎ個分ずつの入力データ
の入れ換えごとに、（ｍｏⁿ dN）回転送する処理を
各処理要素における通常の処理単位と並列に実行
する手段ならびにこれら各処理要素を制御する手
段を備えたものである。 In order to achieve such an objective, the present invention
Two types of input data strings C = each from the outside
{c _i }(i=1,2,...,I) and R={r _j }
Means for inputting each data c _i , r _j of (j=1, 2, ..., N), performing each desired operation such as addition/subtraction, comparison operation, and product-sum operation between two types of data and outputting the results. N processing elements each having storage means, means for transmitting and receiving input data c _i and operation results between adjacent processing elements, and means for outputting the final operation results to the outside are arranged in a ring, and Each processing element is connected in a circular manner via a multiplexer that switches between a data transfer path and an external input path for exchanging data with adjacent processing elements, and all processing elements transmit their processing results to adjacent processing elements. Means for executing the process of transferring (m o ⁿ dN) times in parallel with the normal processing unit in each processing element every time n consecutive pieces of input data in the input data string C are replaced, and each of these processes It is equipped with means for controlling the elements.

ここで、modNはＮをｎで割つた場合の剰余を
表わす。なお、ＩおよびＮならびにｎは任意の正
整数であるが、実際上はＮとｎとの関係はｍｏⁿ
dNが成立する範囲で規定される。以下、実施例
を用いて本発明を詳細に説明する。 Here, modN represents the remainder when N is divided by n. Note that I, N, and n are arbitrary positive integers, but in reality, the relationship between N and n is m o ⁿ
It is defined within the range where dN holds true. Hereinafter, the present invention will be explained in detail using Examples.

〔Example〕

ダイナミツクプログラミングに基づくマツチン
グ演算の一例である上記の演算式(1)，(2)，(3)を２
種類のベクトル列C_u，R_v（ｕ＝１，２，……，l_c，
ｖ＝１，２，……，l_r）について実行する場合に
ついて示す。第６図に、本発明の一実施例の構成
を示す。 The above equations (1), (2), and (3), which are examples of matching operations based on dynamic programming, can be transformed into 2
vector sequences C _u , R _v (u=1, 2,..., l _c ,
The case of execution for v=1, 2, . . . , l _r ) will be described. FIG. 6 shows the configuration of an embodiment of the present invention.

第６図は、PEの個数がｎの場合を示し、１は
この処理要素PEで、ダイナミツクプログラミン
グに基づくマツチング演算式(1)，(2)，(3)を実行す
るための加減算、比較演算や積和演算を実行する
演算器を内蔵し、隣接するPEとのデータ授受や
外部とのデータ授受を実行するためのレジスタお
よび演算結果や転送データを蓄積するメモリを有
する。２−１〜２−ｎは外部からの入力データc_i
^ｕ（ｉ＝１，２，……，I_u）をｎ個分（PEの個数
分）ずつアレイに入力する場合と隣接PEからの
循環転送される入力データc_i ^u（ｉ＝１，２，…
…，I_u）の転送の場合とを切り換えるためのマル
チプレクサである。例えばアレイの各PEのｎ個
の入力データ列c₁ ^u，c₂ ^u，……，c_o ^uをPE１から
入力する場合、２−１のマルチプレクサだけが外
部からの入力データバス３を選択し、これを外部
からの入力データ列c₁ ^u，c₂ ^u……，c_o ^uの入力口と
し、PE１を起点として入力されたデータc_i ^uは隣
接PEへ順々に転送する方法でｎ個分のデータc₁
^ｕ，c₂ ^u，……，c_o ^uを各PEに１個ずつ割付ける。
それ以外の場合は２−１〜２−ｎのすべてのマル
チプレクサがPE間のデータ転送バス５を選択し、
入力データ列c₁ ^u，c₂ ^u，……，c_o ^uをPE間で循環
転送する。また、各PEは、後述するように上記
ｎ個分ずつの入力データパターンの入れ換えごと
に、modN回、通常の処理単位と並列に、それぞ
れの処理結果を隣接PEへ同時に転送することが
できる構成となつている。４は他方の入力ベクト
ルデータ列R_V＝｛r₁ ^V，r₂ ^V，……r_NV ^V｝（ｋ＝１，
２，……l_r）の各ベクトルデータを各PEに順次入
力するとともに最終的な演算結果S_I1，_N1，S_I1，_N2
……S_Iu，_NV……S_Ilc，_Nlrを外部に出力するための
Ｉ／Ｏバスである。上記５は、PE間でのベクト
ルデータc_i ^uの循環転送ならびに累積演算結果S_ij
の転送を実行するためのデータ転送バスである。
６はＩ／Ｏバスに接続される各PEのＩ／Ｏ端子
である。また、７，８，９は、それぞれ入力ベク
トルデータc_i ^u，r₁ ^V（ｉ＝１，２，……I_u；ｊ＝
１，２，……，N_v；ｕ＝１，２，……l_c；ｖ＝
１，２，……l_r）および最終的な演算結果S_I1，_N1，
S_I1，_N2……S_IuNV……S_Ilc，N_lrを示す。さらに１０
は上記入力データの入れ換えのタイミングの判断
や処理結果の転送回数の計数をはじめ、システム
全体の制御動作を行なうコントロールユニツトで
ある。 Figure 6 shows the case where the number of PEs is n, and 1 is this processing element PE, which performs addition, subtraction, and comparison to execute matching calculation formulas (1), (2), and (3) based on dynamic programming. It has a built-in arithmetic unit that performs arithmetic operations and sum-of-products operations, registers for exchanging data with adjacent PEs and external devices, and memory that stores operational results and transferred data. 2-1 to 2-n are external input data c _i
When ^u (i=1, 2, ..., I _u ) are input to the array n pieces (for the number of PEs), and when input data c _i ^u (i=1, 2) are transferred cyclically from adjacent PEs. ,…
..., I _u ). For example, when n input data strings c ₁ ^u , c ₂ ^u , ..., c _o ^u of each PE in the array are input from PE 1, only the multiplexer 2-1 selects input data bus 3 from the outside. , this is used as ^the input port for _the external ^input data _string c ₁ ^u , c ₂ ^u . Individual data c ₁
Assign one ^u , c ₂ ^u , ..., c _o ^u to each PE.
Otherwise, all multiplexers 2-1 to 2-n select the inter-PE data transfer bus 5,
The input data strings c ₁ ^u , c ₂ ^u , ..., c _o ^u are transferred cyclically between PEs. In addition, as described later, each PE is configured to be able to simultaneously transfer each processing result to the adjacent PE in parallel with the normal processing unit modN times for each exchange of the above n input data patterns. It is becoming. 4 is the other input vector data string R _V = {r ₁ ^V , r ₂ ^V , ... r _NV ^V } (k = 1,
2,...l _r ) are input sequentially to each PE, and the final calculation results S _I1 , _N1 , S _I1 , _N2
...S _Iu , _NV ...S _Ilc , _Nlr are I/O buses for outputting them to the outside. 5 above involves the circular transfer of vector data c _i ^u between PEs and the cumulative operation result S _ij
This is a data transfer bus for executing transfers.
6 is an I/O terminal of each PE connected to the I/O bus. In addition, 7, 8, and 9 are input vector data c _i ^u , r ₁ ^V (i=1, 2,...I _u ; j=
1,2,...,N _v ;u=1,2,...l _c ;v=
1, 2, ... l _r ) and the final calculation results S _I1 , _N1 ,
S _I1 , _N2 ... S _IuNV ... S _Ilc , N _lr are shown. 10 more
is a control unit that performs control operations for the entire system, including determining the timing of exchanging the input data and counting the number of transfers of processing results.

第７図に、各PEの構成例を示す。図において、
１点鎖線で囲んだ部分が１個のPE１を示し、１
１は各PEへのベクトルデータr_j ^V（ｊ＝１，２，
……，N_V）の入力および最終的な演算結果S_IuN_v
の出力を行なうための外部Ｉ／Ｏバス、１２はこ
の外部Ｉ／Ｏバス１１とのデータ授受を行なうた
めのＩ／Ｏ端子を示す。また１３は左隣接PEか
らのデータ転送バス端子、１４は右隣接PEへの
データ転送バス端子を示す。１５は外部Ｉ／Ｏバ
ス１１からベクトルデータr_jを入力するためのバ
ツフアレジスタ、１６は外部Ｉ／Ｏバス１１へ最
終的な演算結果S_Iu，N_Vを出力するためのバツフ
アレジスタ、１７は隣接PEからベクトルデータ
c_i ^u（ｉ＝１，２，……I_u）の入力および後述する
処理動作ｂ○，ｃ○で実行される累積演算S_ijの計算
に必要なデータの入力を行なうためのレジスタ、
１８は隣接PEへベクトルデータc_i ^uおよび累積演
算S_ijの計算に必要なデータの転送を行なうため
のレジスタ、１９は内部バスである。２０，２１
は、それぞれこのPEに入力されるベクトルデー
タr₁ ^V，c_i ^uの全成分r_k ^vj，c_k ^ui（ｋ＝１，２，……，
ｍ）を蓄えるバツフアメモリ、２２は(1)，(2)式の
演算を実行するための加減算・比較演算・積和演
算機能を有する演算ユニツトであり、２３は(2)，
(3)式を実行する際に必要なデータを保持しておく
ためのワークメモリである。ワークメモリ２３
は、その保持するデータの性格上、２種類の領域
２３−１と２３−２とに分かれる。すなわち、２
３−１は後述する入力ベクトルデータc_i ^uの循環
転送時での処理動作ａ○，ｂ○，ｃ○の実行において必
要なデータを保持する領域であり、２３−２はベ
クトル列C₁，C₂，……Cl_cのうちのｎ個のベクト
ル列の入れ換え直後の処理動作ｂ○，ｃ○の実行時に
必要となるデータの保持領域である。２４は制御
ユニツトであり、内蔵のマイクロプログラムある
いは外部からの命令に従つて制御を行なう。２５
が第６図のコントロールユニツト１０からの制御
信号に入力端である。２６，２７はワークメモリ
へのアドレス線を示す。そのうち、２６はカウン
タ２８が演算途中結果を保持する領域２３−２を
アクセスするものであるのに対し、２７は例えば
マイクロプログラムからの直接アドレスに相当
し、上記処理動作ｂ○，ｃ○の個々の処理に必要なデ
ータの蓄積領域２３−１をアクセスする。 FIG. 7 shows an example of the configuration of each PE. In the figure,
The part surrounded by the dashed line indicates one PE1,
1 is the vector data r _j ^V (j=1, 2,
..., N _V ) and the final calculation result S _Iu N _v
An external I/O bus 12 indicates an I/O terminal for exchanging data with the external I/O bus 11. Further, 13 indicates a data transfer bus terminal from the left adjacent PE, and 14 indicates a data transfer bus terminal to the right adjacent PE. 15 is a buffer register for inputting vector data r _j from the external I/O bus 11; 16 is a buffer register for outputting the final calculation results S _Iu , N _V to the external I/O bus 11; 17 is vector data from adjacent PE
A register for inputting data necessary for inputting c _i ^u (i = 1, 2, ... I _u ) and calculating the cumulative operation S _ij executed in processing operations b○, c○ described later;
18 is a register for transferring vector data c _i ^u and data necessary for calculating the cumulative operation S _ij to an adjacent PE, and 19 is an internal bus. 20, 21
_are all components r _k ^vj , c _k ^ui ( ^k = ¹ , 2 _, ...,
22 is an arithmetic unit having addition/subtraction, comparison, and product-sum operation functions for executing the operations of formulas (1) and (2); 23 is a buffer memory for storing (2),
(3) This is a work memory for holding the data required when executing the formula. Work memory 23
is divided into two types of areas 23-1 and 23-2 due to the nature of the data it holds. That is, 2
3-1 is an area for holding data necessary for executing processing operations a○, b○, c○ during circular transfer of input vector data c _i ^u , which will be described later _; C ₂ , . . . Cl This is a storage area for data required when executing processing operations b○ and c○ immediately after exchanging n vector sequences of _c . 24 is a control unit, which performs control according to a built-in microprogram or external instructions. 25
is the input terminal for the control signal from the control unit 10 in FIG. 26 and 27 indicate address lines to the work memory. Of these, 26 is for accessing the area 23-2 in which the counter 28 holds the intermediate results of calculations, while 27 corresponds to, for example, a direct address from the microprogram, and is used to access each of the above processing operations b○ and c○. The storage area 23-1 for data necessary for processing is accessed.

上述したように、演算ユニツト２２における演
算結果はワークメモリ２３に保持されるが、隣接
PE間でのデータ転送用にレジスタ１７，１８を
備えており、上記演算結果をワークメモリ２３か
らレジスタ１８に取り込んでそこから隣接PEの
レジスタ１７に転送している間に、演算ユニツト
２２においては次の演算が行なえるような構成と
なつている。したがつて、後述するａ○，ｂ○，ｃ○か
らなる入力データc_iの転送を行なつてD_ij，S_ijを演
算し結果を転送するという通常の処理単位と並行
して、つまり通常の処理の流れを全く乱すことな
く、入力データパターンの入れ換えの際には、各
PEが上記演算を実行している間を利用して、後
述するようなｍｏⁿ dN回の処理結果の隣接PEへの
同時転送を行なうことが可能である。 As mentioned above, the calculation results in the calculation unit 22 are held in the work memory 23, but the
It is equipped with registers 17 and 18 for data transfer between PEs, and while the above calculation result is fetched from the work memory 23 into the register 18 and transferred from there to the register 17 of the adjacent PE, the calculation unit 22 The configuration is such that the following calculations can be performed. Therefore, in parallel with the normal processing unit of transferring input data c _i consisting of a○, b○, c○, calculating D _ij and S _ij , and transferring the results, that is, normal When replacing input data patterns, each
It is possible to simultaneously transfer m o ⁿ dN processing results to adjacent PEs, as will be described later, while the PE is executing the above calculation.

次に、本構成で上記の演算式(1)，(2)，(3)で示さ
れるダイナミツクプログラミングに基づくマツチ
ング演算を実行する方法を説明する。ダイナミツ
クプログラミングに基づくマツチング演算は、２
種類のベクトルデータ列C_u，R_vの作るそれぞれ
の２次元格子平面上の各格子点に対して式(1)，(2)
の演算を実行することに相当する。第８図は、本
構成にて２種類のベクトルデータ列、すなわちl_c
個のベクトルデータ列C_u＝｛c₁ ^u，c₂ ^u，……，c_Iu
^ｕ｝（ｕ＝１，２，……l_c）とl_r個のベクトルデー
タ列R_v＝｛r₁ ^V，r₂ ^V，……y_Nv ^v｝（ｖ＝１，２，…
…l_r）に対するダイナミツクプログラミングに基
づくマツチング演算(1)，(2)，(3)式を連続的に実行
する様子を示している。図において、格子平面上
の各対角破線、対角実線はPEの処理単位を時間
単位とした場合の時刻を表わし、矢印Ａが時刻の
進行方向を示す。つまり、同一破線，実線上の格
子点は同時に処理されることを意味する。PEの
個数はｎ個であるから、処理実行中は常に対角線
上のｎ個の格子点が同時に処理される。 Next, a method of executing the matching operation based on dynamic programming shown by the above-mentioned arithmetic expressions (1), (2), and (3) with this configuration will be explained. The matching operation based on dynamic programming consists of 2
Equations (1) and (2) are used for each grid point on each two-dimensional grid plane created by the different vector data sequences C _u and R _v .
This corresponds to executing the calculation. Figure 8 shows two types of vector data strings in this configuration, namely l _c
vector data string C _u = {c ₁ ^u , c ₂ ^u , ..., c _Iu
^u } (u = 1, 2, ...l _c ) and l _r vector data string R _v = {r ₁ ^V , r ₂ ^V , ... y _Nv ^v } (v = 1, 2, ...
It shows how matching operations (1), (2), and (3) based on dynamic programming for (... _l r ) are executed continuously. In the figure, each diagonal broken line and diagonal solid line on the lattice plane represent time when the PE processing unit is a time unit, and arrow A indicates the direction in which time progresses. This means that grid points on the same broken line and solid line are processed simultaneously. Since the number of PEs is n, n grid points on the diagonal are always processed simultaneously during processing.

本構成でのデータの入力動作の様子を第９図に
示す。第９図はｎ＝６の場合を示し、３１はPE、
３２はベクトルデータc_i（ｉ＝１，２，……，Ｉ）
および累積結果S_ijを隣接するPEへ転送するため
のデータ転送バス、３３は各処理時刻におけるデ
ータ転送バス上のベクトルデータc_i（ｉ＝１，２，
……，Ｉ）の流れ、３４は各処理時刻において各
PEに入力すべきＩ／Ｏバス上のベクトルデータ
r_j（ｉ＝１，２，……，Ｎ）を示す。PEの個数
分、すなわち６個のベクトルデータ列c₁，c₂，…
…，c₆がPE₁から順に入力され、各ベクトルデー
タは各PEでの処理が終了するごとに右隣接のPE
へ順次転送され、第１番目のデータc₁がPE₁に戻
つてくるまでは処理時刻が進むにつれてデータc_i
（ｉ＝１，２，……，６）が現われるデータ転送
バスが１つずつ増えるが、データc_iがPE₆から
PE₁に転送される時刻以後は、各PEに存在する
データc₁〜c₆は各時刻ごとに同時に隣接するPE
へ転送される。一方、データr_j（ｊ＝１，２，…
…，Ｎ）はこの各PE間のデータc_i（ｉ＝１，２，
……，６）の転送動作に同期して各PEに順々に
入力される。そして、各PE間で規制的なデータ
授受を行ないながら、全格子点に対して演算式
(1)，(2)，(3)を実行する。 FIG. 9 shows the data input operation in this configuration. Figure 9 shows the case where n=6, 31 is PE,
32 is vector data c _i (i=1, 2, ..., I)
and a data transfer bus for transferring the cumulative result S _ij to the adjacent PE; 33 is vector data c _i (i=1, 2,
..., I) flow, 34 is each processing time at each processing time.
Vector data on the I/O bus to be input to PE
Indicates r _j (i=1, 2, ..., N). The number of PEs, that is, 6 vector data sequences c ₁ , c ₂ ,...
..., c ₆ are input sequentially from PE ₁ , and each vector data is input to the right adjacent PE each time the processing in each PE is completed.
Until the first data c ₁ returns to PE ₁ , the data c _i
The number of data transfer buses where (i = 1, 2, ..., 6) appears increases one by one, but when data c _i is transferred from PE ₆
After the time when data is transferred to PE ₁ , the data c ₁ to _{c 6} existing in each PE is simultaneously transferred to the adjacent PE at each time.
will be forwarded to. On the other hand, data r _j (j=1, 2,...
..., N) is the data c _i (i=1, 2,
..., 6) are input to each PE in turn in synchronization with the transfer operation. Then, while performing regulatory data exchange between each PE, calculation formulas are calculated for all grid points.
Execute (1), (2), and (3).

第８図の破線群は、マルチプレクサ２−１だ
けを外部からの入力データバスの選択モードに
し、PEの個数ｎ個の入力ベクトルデータ列c^1/1，
c^1/2，……，c^1/nを順に入力し、PE₂〜PE_oは処理
単位を終了するごとに隣接するPEとのベクトル
データc^1/x（ｘ＝１，２，……，ｎ−１）のデータ
授受を同時に行なうことを示す。この破線群１に
続く実線群は、PE_oにデータc^1/1が入力された後
は全マルチプレクサ２−１〜２−ｎがPE間のデ
ータ転送バスの選択モードとなり、入力ベクトル
データ列c^1/1，c^1/2，……，c^1/nを各PE間で循環転
送しながら演算式(1)，(2)，(3)を実行することを示
している。そして、続く破線群は、入力ベクト
ルデータ列c^1/1，c^1/2，……，c^1/nを次のｎ個分の
ベクトルデータ列c^1/n+1，……c^1/I1，……c_i ^uと入
れ換えながら演算を続行する過程を示している。 The group of broken lines in FIG. 8 indicates that only the multiplexer 2-1 is set to the selection mode of the input data bus from the outside, and the input vector data string c ^1/1 of the number of PEs is n.
c ^1/2 , ..., c ^1/n are input in order, and each time PE ₂ to _{PE o} completes a processing unit, vector data c ^1/x (x = 1, 2, ...) with the adjacent PE is input. , n-1) are transferred simultaneously. The solid line group following this broken line group 1 indicates that after data c ^1/1 is input to PE _o , all multiplexers 2-1 to 2-n are in the selection mode of the data transfer bus between PEs, and the input vector data string c This shows that equations ( ¹ ), (2), and (3) are executed while cyclically transferring 1/1, c ^1/2 , ..., c ^1/n between each PE. The following group of broken lines indicates that the input vector data strings c ^1/1 , c ^1/2 , ..., c ^1/n are converted into the next n vector data strings c ^1/n+1 , ... c ^1/ It shows the process of continuing the calculation while exchanging ^I1 , ... c _i ^u .

ところで、各PEには処理単位ごとに２種類の
ベクトルデータc_i ^u．r_j ^vが入力されるので、演算
式(1)は各PEで独立に並列実行されるが、演算式
(2)は隣接PEとのデータ授受を行ないながら実行
する。例えば、第１０図は、PEの個数ｎ＝５と
して、ベクトルデータ列C₁、C₂，とベクトルデ
ータ列R₁，R₂のすべての組合せについて連続的
に処理を行なう場合の各PEの処理手順および各
PEが担当する格子点の分布を示したもので、図
中⊂⊃で囲まれた格子点群は同一のPEにおいて処
理されることを意味し、左肩に示した数字がその
PE番号を示しているが、同図において例えばS_7,8
を求める場合、時刻t₁におけるS_7,8の計算に必要
なデータは時刻t₂，t₃においてPE₄，PE₅で求めら
れるS_6,7、S_7,7、S_6,8である。時刻t₂，t₃は時刻t₁に
対して過去であるので、データS_6,8はS_7,8を計算
するPE₅内に存在し、データS_6,7、S_7,7はPE₄に存
在する。すなわち、必要なデータは常に隣接する
PE内に存在するので、S_7,8に対する演算式(2)の比
較演算を実行する場合は、PE₄においてmm（S_6,7、
S_7,7）を実行し、その結果をPE₅に転送してPE₅
においてmm〔S_6,8mm（S_6,7、S_7,7）〕を実行する。 By the way, each PE has two types of vector data c _i ^u . Since r _j ^v is input, calculation formula (1) is executed independently and in parallel on each PE, but calculation formula
(2) is executed while exchanging data with neighboring PEs. For example, FIG. 10 shows the processing of each PE when all combinations of vector data sequences C ₁ , C ₂ , and vector data sequences R ₁ , R ₂ are sequentially processed with the number of PEs n=5. Steps and each
This shows the distribution of grid points handled by a PE. In the figure, grid points surrounded by ⊂⊃ mean that they are processed by the same PE, and the number on the left side indicates the number.
The PE number is shown, but in the same figure, for example, S _7,8
When calculating S 7,8 at time t ₁ , the data required to calculate S _7,8 at time t 1 is S _6,7 , S _7,7 , S _6,8 determined by PE ₄ and PE ₅ at times t ₂ and t ₃ . . Since times t ₂ and t ₃ are in the past with respect to time t ₁ , data S _6,8 exists in PE ₅ that calculates S _7,8 , and data S _6,7 and S _7,7 exist in PE 5. Present in ₄ . That is, the required data is always contiguous.
Since it exists in PE, when performing the comparison operation of equation ( ₂ ) for S _7,8 , mm(S _6,7 ,
S _7,7 ) and transfer the result _to PE ₅
Perform mm [S _6,8 mm (S _6,7 , S _7,7 )] at

この場合、前述したように入力ベクトルデータ
列C₁，C₂……C_lcをPEの個数分（ｎ個）ごとに区
切つてアレイに入力し処理を行なうため、第１０
図に示すように斜線で示した格子点に対応する
S_ijは、入力ベクトルデータの入れ換えが始まる
までに、所定のPEへ転送しておかなければなら
ない。例えば、PE₁に存在するS_5,1はPE₃へ、PE₂
に存在するS_5,2はPE₄へ、PE₃に存在するS_5,3は
PE₅へ、PE₄に存在するS_5,4はPE₁へ、PE₅に存在
するS_5,5はPE₂へそれぞれ転送しなければならな
い。一般に、ｎ個のベクトルデータ列の入れ換え
が始まる（mod_lr 〓^v=1 N_v）時刻前の時刻から、すな
わち第１０図の例ではｍｏ⁵ d17＝２時刻前の時刻
から全PEは、各時刻ごとにそれぞれ蓄えている
累積結果S_po,j（ｐ＝１，２，……）を隣接するPE
へ同時に転送する動作を開始し、これらのデータ
の転送を後述するａ○，ｂ○，ｃ○の通常の処理動作と
並列に、前述したようにPEが演算処理のみを行
なつている間を利用してPEの各処理単位に１回
ずつ行なうことにより、ｎ個の入力ベクトルデー
タ列の入れ換え直前までに必要なデータS_po,jを所
定のPEに転送しておくことができる。第１０図
に示す例では、PE₁の格子点（c^1/1，r^2/6）に対す
る処理と並列に、PE₁，PE₂，PE₃，PE₄，PE₅の
各ワークメモリ２３−２の同一アドレスに存在す
るデータS_5,1，S_5,2，S_5,3，S_5,4，S_5,5は隣接する
PEへ転送されてPE₂，PE₃，PE₄，PE₅，PE₁に
配置され、PE₁の格子点（c^1/2，r^2/5に対する処理
では同様にしてPE₃，PE₄，PE₅，PE₁，PE₂に配
置されて転送が完了し、PE₁の次の格子点（c^1/3，
r^2/6）に対する処理時刻での次の入力ベクトルデ
ータ列c^1/5，c^1/6，c^1/1，c^2/1との入れ換え直後の処
理では、PE₃，PE₄，PE₅，PE₁，PE₂が上記の２
回の転送により得られたデータS_5,1，S_5,2，S_5,3，
S_5,4，S_5,5を使つて処理動作ａ○，ｂ○，ｃ○を実行す
る。このような処理を繰り返し実行することによ
り各PEはダイナミツクプログラミングに基づく
マツチング演算式(1)，(2)，(3)を規則的かつ連続的
に実行することができる。 In this case, as mentioned above, the input vector data strings C ₁ , C ₂ _. . .
Corresponding to the hatched grid points as shown in the figure
S _ij must be transferred to a predetermined PE before the exchange of input vector data begins. For example, S _5,1 present in PE ₁ goes to PE ₃ , PE ₂
S _5,2 present in PE _{4 goes to PE 4} , S _5,3 present in PE ₃ goes to
S _5,4 existing in PE ₄ must be transferred _{to PE 1} _, and S _5,5 existing in PE ₅ must be transferred to PE ₂ . Generally, from the time before the start of the replacement of n vector data strings (mod _lr 〓 ^v=1 N _v ), that is, from the time before m o ⁵ d17 = 2 times in the example of Fig. 10, the total PE is as follows. The accumulated results S _po,j (p = 1, 2, ...) stored at each time are calculated by the adjacent PEs.
At the same time, data transfer is started in parallel with the normal processing operations of a○, b○, and c○, which will be described later. By performing this once for each processing unit of the PE, the necessary data S _po,j can be transferred to a predetermined PE immediately before the n input vector data strings are replaced. In the example shown in FIG. ₁₀ , in parallel with the processing for the grid points ₍ c ^1/1 , _r ^2/6 ) of PE ₁ _, _the work memories 23- Data S _5,1 , S _5,2 , S _5,3 , S _5,4 , S _5,5 existing at the same address of 2 are adjacent
It is transferred to PE and placed at PE ₂ , PE ₃ , PE ₄ , PE ₅ , PE ₁ , and in the processing for PE ₁ 's grid points (c ^1/2 , r ^2/5 , PE ₃ , PE ₄ , The transfer is completed by placing it at PE ₅ , PE ₁ , PE ₂ , and the next grid point of PE ₁ (c ^1/3 ,
In the process immediately after replacing the next input vector data sequence c ^1/5 , c ^1/6 , c ^1/1 , c ^2/1 at the processing time for r ^2/6 ), PE ₃ , PE ₄ , PE ₅ , PE ₁ , PE ₂ are the above 2
Data obtained by transfer times S _5,1 , S _5,2 , S _5,3 ,
Processing operations a○, b○, and c○ are executed using S _5,4 and S _5,5 . By repeatedly executing such processing, each PE can regularly and continuously execute matching calculation formulas (1), (2), and (3) based on dynamic programming.

以上のように入力ベクトル列C_u（ｕ＝１，２，
……，l_c）のｎ個のベクトルデータ列の入力また
は入れ換えと循環転送とを交互に繰り返し、かつ
上記ベクトルデータc_i ^u（ｉ＝１，２，……I_u）の
入力および循環転送に同期してベクトルデータr_j
^ｖ（ｊ＝１，２，……，N_v）を各PEに入力しなが
ら、各PEが各格子点に対して演算式(1)，(2)，(3)
を繰り返し実行することにより全格子点に対する
処理を完了する。 As described above, the input vector sequence C _u (u=1, 2,
..., l _c ) and cyclic transfer of n vector data strings, and input and cyclic transfer of the vector data c _i ^u (i=1, 2, . . . I _u ). vector data r _j in sync with
While inputting ^v (j = 1, 2, ..., N _v ) to each PE, each PE calculates calculation formulas (1), (2), (3) for each grid point.
By repeatedly executing , the processing for all grid points is completed.

以上をまとめ、式(1)，(2)を実行する場合のPE
の一般的な処理動作（通常の処理単位）は次のよ
うになる。 Summarizing the above, PE when executing equations (1) and (2) is
The general processing operation (normal processing unit) is as follows.

ａ○ 左隣接のPEまたは外部からの入力データバ
スよりベクトルデータc_i（ｉ＝１，２，……，
Ｉ）を入力すると同時に右隣接のPEへベクト
ルデータc_i-1を転送し、これらのベクトルデー
タの転送に同期してＩ／Ｏバスからベクトルデ
ータr_j（ｊ＝１，２，……Ｎ）を入力し、上記
の演算式(1)を実行しD_ijを求める。a○ Vector data c _i (i=1, 2, ...,
At the same time as I) is input, vector data c _i-1 is transferred to the right-adjacent PE, and in synchronization with the transfer of these vector data, vector data r _j (j = 1, 2, ... N ) and execute the above equation (1) to find D _ij .

ｂ○ 比較演算mm〔S_i-1，_j，mm（S_j-1，_j-1，S_i，_j-1
）〕
を実行し、この結果にD_ijを加算してS_ijを求め
る。b○ Comparison operation mm [S _i-1 , _j , mm (S _j-1 , _j-1 , S _i , _j-1
)〕
Execute and add D _ij to this result to find S _ij .

ｃ○ 比較演算mm〔S_i-1，_j，S_ij）を実行してその演
算結果を右隣接のPEへ転送すると同時に、比
較演算結果mm（S_i，_j-1，S_i+1，_j-1）を左隣接の
PEから入力する。c○ Execute the comparison operation mm [S _i-1 , _j , S _ij ) and transfer the operation result to the right adjacent PE, and at the same time, the comparison operation result mm (S _i , _j-1 , S _i+1 , _j-1 ) of the left neighbor
Input from PE.

ａ○は演算式(1)の実行に相当し、ｂ○，ｃ○は演算
式
(2)，(3)の実行に相当する。各PEは、ａ○，ｂ○，ｃ○
の順に同時に、すなわちａ○を行なうときには全
PEがａ○を、ｂ○を行なうときには全PEがｂ○を、と
いうように処理動作を行なう。 a○ corresponds to the execution of calculation formula (1), b○ and c○ are calculation formulas
This corresponds to executing (2) and (3). Each PE is a○, b○, c○
When performing a○ simultaneously in the order of
When a PE performs a○ and b○, all PEs perform b○, and so on.

本動作と２次元配列構成の動作の根本的な差異
は、式(2)を実行する場合のデータ転送動作にあ
る。２次元配列構成の動作では、累積結果S_i-1，_j-
_１を左隣接のPEへ転送してから比較演算mm（S_i，_j-
_１，S_i-1，_j-1）を行なうのに対し、本動作ではS_i-1，
_j-1は次の時刻に求められるS_i，_j-1と同一のPE内に
あるためデータ転送は実行しなくても比較演算が
実行できる。 The fundamental difference between this operation and the operation of the two-dimensional array configuration lies in the data transfer operation when executing equation (2). In the operation of the two-dimensional array configuration, the cumulative results S _i-1 , _j-
₁ to the left-adjacent PE and then performs the comparison operation mm(S _i , _j-
₁ , S _i-1 , _j-1 ), in this operation, S _i-1 ,
Since _j-1 is in the same PE as S _i and _j-1 to be determined at the next time, the comparison operation can be performed without performing data transfer.

なお、式(4)を実行する場合は、各PEにおいて、
「ｄ○隣接するPEから累積結果を入力して、これに
そのPE内で実行されるベクトル間距離の２倍の
値を加えて隣接するPEへ出力する」１回の入出
力動作と「ｅ○隣接するPEから累積結果を入力し、
ベクトル間距離を加えて保持する」動作の２種類
の簡単な動作を実行することにより、上述したと
同様に規則的に累積結果を求めることができる。 Note that when executing equation (4), at each PE,
``d○ Input the cumulative result from the adjacent PE, add twice the value of the distance between vectors executed within that PE, and output it to the adjacent PE'' one input/output operation and ``e ○ Enter cumulative results from adjacent PEs,
By performing two types of simple operations: ``adding and holding distances between vectors,'' cumulative results can be obtained regularly in the same way as described above.

以上説明したように、本発明によれば、PEの
個数は処理対象となる各ベクトルデータの個数を
表わす正整数I_u，N_vに全く依存せず、予測され
るデータ処理量に応じて適当な値に設定でき、
PEを規則的な処理動作の繰り返しでフル稼動し
てハードウエアを量大限有効利用したパイプライ
ン並列処理によりダイナミツクプログラミングに
基づくマツチング演算を実行できる。したがつ
て、LSIで実現する場合は、従来の正整数I_u，N_v
ｎに依存してPEの個数を決定しなければならな
い２次元配列構成に比べて実装規模が非常に小さ
くなるだけでなくハードウエアの有効利用を図る
ことができる。また、PEの個数をいくつに設定
しても任意のN_v，I_uの個数をもつベクトルデー
タ列に対して処理を実行できるというPE数の拡
張性を有する。 As explained above, according to the present invention, the number of PEs does not depend at all on the positive integers I _u and N _v representing the number of each vector data to be processed, and is determined appropriately according to the expected amount of data processing. can be set to a value that
Matching calculations based on dynamic programming can be executed through pipeline parallel processing, which utilizes the hardware to the maximum extent possible by operating the PE at full capacity by repeating regular processing operations. Therefore, when realized on LSI, conventional positive integers I _u , N _v
Compared to a two-dimensional array configuration in which the number of PEs must be determined depending on n, the implementation scale is much smaller and the hardware can be used more effectively. In addition, the number of PEs is expandable in that no matter how many PEs are set, processing can be executed on vector data strings with arbitrary numbers of _Nv and _Iu .

次に、２次元配列構成と本構成との効率を、
PEの平均稼動率を考慮したPE1個当り・単位時
間当りのスループツトで比較してみる。 Next, the efficiency of the two-dimensional array configuration and this configuration is
Let's compare the throughput per PE/unit time considering the average operating rate of PE.

２次元配列構成において前記の処理動作，
と，の２種類の処理単位のうち大きい方のス
テツプ数をUsquare、本構成の処理動作ａ○，ｂ○，
ｃ○からなる処理単位のダイナミツクステツプ数を
Uringとする。２次元配列構成では、１組のベク
トルデータに対するダイナミツクプログラミング
に基づくマツチング演算を完了するには、，
および，の２種類の処理単位を交互に実行す
る方法をとると2Usquareステツプ必要である。
ここで対象としているダイナミツクプログラミン
グに基づくマツチング演算では、１つのベクトル
データ列Ｒに対してPE_ijが演算式(1)，(2)，(3)を実
行し累積結果S_ijを求めてしまえば、PE_i′_j′（i′＞
ｉ，j′＞ｊ）が上記演算式を実行しているときに
はPE_ijはこのベクトルデータ列Ｒに対する処理を
実行する必要性がない。そこで、あるベクトルデ
ータ列R_vに対して処理を実行している時に処理
に寄与していないPEを別のベクトルデータ列
R_v′に対する処理に割り当てることができる。つ
まり、第１番目のベクトルデータ列R₁の累積結
果S_ijを計算しながら、2Usquareステツプの位相
差をもつて第２番目のベクトルデータ列R₂に対
しても累積結果S_ijの計算を実行することができ
る。ベクトルデータ列I_uとベクトルデータ列R_vと
の最終的な演算結果S_Iu，_Nvを得るまでに、S_ijを求
めるために必要なダイナミツクステツプ数
2Usquareを単位として（Nmax＋Imax）ステツ
プを要するので、この（Nmax＋Imax）ステツ
プの時間内に（Nmax＋Imax）種類の最終累積
結果S_Iu，_Nvを得ることができる。一方、本発明に
よる構成においては、入力ベクトルデータ列C₁，
C₂，……Cl_cのｎ個分のベクトルデータ列ごとに
入力ベクトルデータ列R₁，R₂……，Rl_rとの処理
を繰り返しながら、最終累積結果S_Io，_Nvを得るこ
とができる。 The above processing operation in a two-dimensional array configuration,
The larger number of steps among the two processing units of and is Usquare, and the processing operations of this configuration are a○, b○,
The number of dynamic steps in the processing unit consisting of c○ is
Let's call it Uring. In a two-dimensional array configuration, to complete a matching operation based on dynamic programming on a set of vector data,
If we take the method of executing two types of processing units alternately, 2Usquare steps are required.
In the matching operation based on dynamic programming targeted here, PE _ij executes calculation formulas (1), (2), and (3) for one vector data string R to obtain the cumulative result S _ij . For example, PE _i ′ _j ′(i′＞
i, j'>j) is executing the above arithmetic expression, PE _ij does not need to execute processing on this vector data string R. Therefore, when processing is executed on a certain vector data string R _v , PEs that do not contribute to the processing are added to another vector data string.
It can be assigned to processing for R _v ′. In other words, while calculating the cumulative result S _ij for the first vector data sequence R ₁ , the cumulative result S _ij is also calculated for the second vector data sequence R ₂ with a phase difference of 2Usquare steps. can do. The number of dynamic steps required to obtain S _ij before obtaining the final operation results S _Iu and _Nv between the vector data sequence I _u and the vector data sequence R _v
Since (Nmax+Imax) steps are required in units of 2Usquare, (Nmax+Imax) types of final cumulative results S _Iu and _Nv can be obtained within the time of this (Nmax+Imax) step. On the other hand, in the configuration according to the present invention, the input vector data string C ₁ ,
By repeating the processing with the input vector data strings R ₁ , R ₂ ..., Rl _r for every n vector data strings of C 2 , ...Cl _c , the final cumulative results S _Io _, _Nv can be obtained. .

以上のようなアレイ全体での処理動作に基づい
て、ベクトルデータC₁，C₂，……Cl_cとベクトル
データ列R₁，R₂……Rl_rのすべての組合せに対し
て処理を実行する場合のPEの効率を求めると、
以下のようになる。 Based on the processing operations for the entire array as described above, processing is executed for all combinations of vector data C ₁ , C ₂ , ...Cl _c and vector data sequences R ₁ , R ₂ ...Rl _r . If we find the efficiency of PE in the case,
It will look like this:

２次元配列構成の場合； l_r・l_c個の最終結果を得るには、2Usquareを単
位として（Nmax＋Imax＋l_r・l_c）ステツプを必
要とする。PE数はNmax・Imax個であるから、
PEの効率n_sqは， η_sq＝l_r・l_c／（Nmax＋Imax＋l_r・l_c）・2Usquare／Nm
ax・Imax…(5) 本発明の場合； Uringを単位として、ｎ個分の入力ベクトルデ
ータ列の入れ換え動作時の処理はｎステツプ、入
力ベクトルデータを循環転送しながら実行する処
理は（_lr 〓^v=1 N_v-o）ステツプである。入力ベクトル
データ列C₁，C₂，……C_lcを１つの入力ベクトル
データ列と考えて処理を実行することと等価なの
で、l_r，l_c個の最終結果を得るには、ステツプ必要である。式(6)の第１項は循環転送
時のステツプ数、第２項はデータ入れ換え時のス
テツプ数、第３項は処理開始及び終了時のステツ
プ数である。また、rI＝ｍｏⁿ ｄ_lc 〓^u=1 I_uである。PE
数はｎ個であるから、PEの効率ηringは、ここで、N₁，N₂，……，N_lrの平均値をN_av、
I₁，I₂，……，I_lcの平均値をI_avとして式(7)を書き
換えると、_lr 〓^v=1 N_v＝lr・N_av，_lc 〓^u=1 I_u＝le・I_avである
から、 ηring／η_sq＝Nmax／N_av・Imax／I_av；（１＋Nmax＋Imax／lrlc）／（１＋ｎ−rI／leIav＋n
²／lcIav）・2Usquare／Uring…(8) 式(8)の第３項の分母・分子の１以外の項は、各
構成での処理開始及び終了に対する効率にかかわ
るものである。したがつて、処理実行中における
PEの効率の比は、 ηring／η_sqNmax／N_av・Imax／I_av・2Usquare／Urin
g……(9) で表わされる。 In the case of a two-dimensional array configuration: To obtain l _r ·l _c final results, (Nmax+Imax+l _r ·l _c ) steps are required in units of 2Usquare. Since the number of PEs is Nmax・Imax,
PE efficiency n _sq is η _sq = l _r・l _c / (Nmax + Imax + l _r・l _c )・2Usquare/Nm
ax・Imax…(5) In the case of the present invention; The process of exchanging n input vector data strings using Uring as a unit is n steps, and the process executed while circularly transferring input vector data is ( _lr 〓 ^v=1 N _vo ) step. This is equivalent to processing the input vector data strings C ₁ , C ₂ , ... C _lc as one input vector data string, so to obtain l _r , l _c final results, Steps are required. The first term in equation (6) is the number of steps during circular transfer, the second term is the number of steps when exchanging data, and the third term is the number of steps at the start and end of processing. Moreover, rI= ^mon d _lc 〓 ^u=1 I _u . P.E.
Since the number is n, the efficiency ηring of PE is Here, the average value of N ₁ , N ₂ , ..., N _lr is N _av ,
Rewriting equation (7) by setting the average value of I ₁ , I ₂ , ..., I _lc as I _av , _lr 〓 ^v=1 N _v = lr・N _av , _lc 〓 ^u=1 I _u ＝le・I _av , ηring/η _sq = Nmax/N _av・Imax/I _av ; (1+Nmax+Imax/lrlc)/(1+n−rI/leIav+n
² /lcIav)・2Usquare/Uring...(8) The terms other than 1 in the denominator and numerator of the third term in equation (8) are related to the efficiency of starting and ending processing in each configuration. Therefore, during processing
The PE efficiency ratio is ηring/η _sq Nmax/N _av・Imax/I _av・2Usquare/Urin
It is expressed as g...(9).

２次元配列構成における各PEが入出力動作を
同時に実行できる手段をもつとすると2Usquare
Uring、またNmax＞N_av，Imax＞I_avであるこ
とより、本発明の構成は２次元配列構成に対して
常に効率が良く、例えばN_av＝３／４Nmax，I_av＝３／４Imaxの場合は約1.8倍の効率となる。また、各PEが入力・出力の動作を各処理単位ごとに交
互に実行する手段しかもたない場合には、
2Usquare＜Uringであり、２次元配列構成に対
する本発明の効率比はさらに大きくなる。 Assuming that each PE in a two-dimensional array configuration has a means to perform input/output operations simultaneously, 2Usquare
Since Nmax>Nav _and Imax> _Iav , the configuration of the present invention is always efficient for two-dimensional array configurations, for example, when _Nav = 3/4Nmax, _Iav = 3/4Imax. is approximately 1.8 times more efficient. Additionally, if each PE only has a means to alternately execute input/output operations for each processing unit,
2Usquare<Uring, and the efficiency ratio of the present invention for a two-dimensional array configuration is even greater.

２次元配列構成の場合は最低限（Nmax×
Imax）個のPEを配列・接続しなければならない
ため、その実装規模が非常に大きくなるので、従
来は、各PEの入出力をビツトシリアルで実行す
る方法をとることにより各PEの規模をコンパク
トにすることが行なわれていた。しかし、ここで
対象としているようなダイナミツクプログラミン
グに基づくマツチング演算におけるデータは、(1)
式に示すようにある次元数のデータ列を１つのデ
ータとして取扱うベクトルデータであるので、ビ
ツトシリアルでデータの入出力を実行すると、
PE間での転送ステツプ数が非常に多くなり、全
体の演算に非常に多くの時間を要する。これに対
し、本構成ではPEの個数を大幅に減少すること
ができるので、PE間のデータ転送をパラレル転
送で実現しても実装規模に対する問題を生じるこ
とがなく、ここで対象としているダイナミツクプ
ログラミングに基づくマツチング演算のようなベ
クトルデータに対する処理に敵している。 In the case of a two-dimensional array configuration, the minimum (Nmax ×
Imax) PEs must be arranged and connected, resulting in a very large implementation scale. Conventionally, the scale of each PE was reduced by executing input/output of each PE in bit serial format. things were being done. However, the data in the matching operation based on dynamic programming, which is the target here, is (1)
As shown in the formula, it is vector data that treats a data string of a certain number of dimensions as one piece of data, so when inputting and outputting data in bit serial format,
The number of transfer steps between PEs becomes very large, and the entire calculation takes a very long time. On the other hand, in this configuration, the number of PEs can be significantly reduced, so even if data transfer between PEs is realized using parallel transfer, there will be no problem with the implementation scale, and the dynamic It is unsuitable for processing vector data such as matching operations based on programming.

以上、(1)，(2)，(3)式に示すダイナミツクプログ
ラミング演算の場合を中心に説明したが、本発明
はこれに限定されるものではなく、前述したよう
に例えば(2)式が(4)式である場合、その他、２種類
の変数間のあらゆる組合せに対する演算とその演
算結果を用いたデータの局所依存性をもつ漸化式
の演算の実行に同様に適用可能である。 Although the above description has focused on the case of dynamic programming operations shown in equations (1), (2), and (3), the present invention is not limited thereto. When is Equation (4), it is similarly applicable to the execution of operations on all combinations of two types of variables and recurrence expressions with local dependence of data using the results of the operations.

〔Effect of the invention〕

以上説明したように、本発明によれば、それぞ
れ所定の入出力手段および演算手段を備えた処理
要素を、隣接する処理要素とのデータ授受を行な
うためのデータ転送バスと外部入力バスとを切り
換えるマルチプレクサを介して環状に接続し、か
つ全処理要素がそれぞれの処理結果を隣接処理要
素へ同時に転送する処理を、各処理要素における
通常の処理単位と並列に所定回実行することがで
きる構成としたことにより、ダイナミツクプログ
ラミングに基づくマツチング演算に代表される２
種類の変数間のあらゆる組合せに対する演算とそ
の演算結果を用いたデータの局所依存性をもつ漸
化式の演算を、対象とする演算量に応じた適正な
PE数からなるアレイ構成で、各処理要素を有効
に動作させながら高効率の並列処理で実現するこ
とができる。 As explained above, according to the present invention, processing elements each having a predetermined input/output means and calculation means are switched between a data transfer bus and an external input bus for exchanging data with adjacent processing elements. The system is connected in a ring via a multiplexer, and all processing elements simultaneously transfer their processing results to adjacent processing elements, which can be executed a predetermined number of times in parallel with the normal processing unit of each processing element. As a result, two methods, represented by matching operations based on dynamic programming,
Calculate calculations for all combinations of variables of different types and calculations of recurrence formulas with local data dependence using the calculation results in an appropriate manner according to the amount of calculations to be performed.
With an array configuration consisting of a number of PEs, it is possible to achieve highly efficient parallel processing while effectively operating each processing element.

[Brief explanation of drawings]

第１図は従来の２次元配列アレイプロセツサの
構成例を示す図、第２図ａ，ｂ〜第５図ａ，ｂは
その処理動作の一例を説明するための図、第６図
は本発明の一実施例を示す構成図、第７図は各処
理要素の構成例を示すブロツク図、第８図は第６
図の構成における処理動作の一例を説明するため
の図、第９図は同じく外部からのデータ入力と処
理要素間でのデータ転送の様子を説明するための
図、第１０図は各処理要素の処理動作の一例を説
明するための図である。１，３１……処理要素、２−１〜２−ｎ……マ
ルチプレクサ、３……外部入力データバス、４，
１１……外部Ｉ／Ｏバス、５，３２……データ転
送バス、６，１２……Ｉ／Ｏ端子、７，８，３
３，３４……入力ベクトルデータ、９……最終演
算結果、１０……コントロールユニツト、１３，
１４……データ転送バス端子、１５，１６……バ
ツフアレジスタ、１７，１８……レジスタ、２
０，２１……バツフアメモリ、２２……演算ユニ
ツト、２３……ワークメモリ、２４……制御ユニ
ツト、２６，２７……アドレス線、２８……カウ
ンタ。 FIG. 1 is a diagram showing an example of the configuration of a conventional two-dimensional array processor; FIGS. 2a, b to 5 a, b are diagrams for explaining an example of its processing operation; FIG. 7 is a block diagram showing an example of the structure of each processing element, and FIG. 8 is a block diagram showing an example of the structure of each processing element.
FIG. 9 is a diagram for explaining an example of processing operation in the configuration shown in the figure. FIG. 9 is also a diagram for explaining data input from the outside and data transfer between processing elements. FIG. FIG. 3 is a diagram for explaining an example of a processing operation. 1, 31... Processing element, 2-1 to 2-n... Multiplexer, 3... External input data bus, 4,
11... External I/O bus, 5, 32... Data transfer bus, 6, 12... I/O terminal, 7, 8, 3
3, 34...Input vector data, 9...Final calculation result, 10...Control unit, 13,
14... Data transfer bus terminal, 15, 16... Buffer register, 17, 18... Register, 2
0, 21... Buffer memory, 22... Arithmetic unit, 23... Work memory, 24... Control unit, 26, 27... Address line, 28... Counter.

Claims

[Claims] 1 n processing elements PE are arranged in a ring, and each processing element receives two types of external input data strings C=
{ci} (i=1,2,...,I) and R={rj}
Means for inputting each data ci, rj of (j=1, 2, ..., N), and means for performing desired operations such as addition/subtraction, comparison operation, and product-sum operation between two types of data and storing the results. and input data ci (i=1,2
..., I), means for transmitting and receiving calculation results between adjacent processing elements, and means for outputting the final calculation results to the outside, and between each processing element, input data ci from the outside is provided. is connected in a ring via a multiplexer that switches between a data transfer path and an external input path for exchanging data with adjacent processing elements so that the data can be input from any processing element, and all processing elements can input their respective processing results. is transferred to the adjacent processing element (m o ⁿ dN) times for each replacement of n consecutive pieces of input data in the input data string C, in parallel with the processing unit in each processing element. What is claimed is: 1. An array processor comprising means for controlling each of these processing elements.