JP3971719B2

JP3971719B2 - SIMD type microprocessor

Info

Publication number: JP3971719B2
Application number: JP2003159703A
Authority: JP
Inventors: 貴雄片山
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2003-06-04
Filing date: 2003-06-04
Publication date: 2007-09-05
Anticipated expiration: 2023-06-04
Also published as: JP2004362253A

Description

【０００１】
【発明の属する技術分野】
本発明は、画像データ等を高速処理するために単一の命令で複数データに対して同じ処理を行うＳＩＭＤ（ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎ−ｓｔｒｅａｍＭｕｌｔｉｐｌｅＤａｔａ− ｓｔｒｅａｍ）型マイクロプロセッサに関し、特に、プロセッサエレメントのネイバーフッドオペレーション（隣接処理）を行なうＳＩＭＤ型マイクロプロセッサに関わる。
【０００２】
【従来の技術】
近年、デジタル複写機やファクシミリ装置等の画像処理では、画素数の増加、画像処理の多様化などにより画質の向上が図られている。こういった画像処理では複数のデータに対して同一の処理をすることが多く、１命令で１つのデータを処理するＳＩＳＤ（ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎ−ｓｔｒｅａｍＳｉｎｇｌｅＤａｔａ−ｓｔｒｅａｍ）型マイクロプロセッサよりも、１命令で複数のデータを同時処理するＳＩＭＤ（ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎ−ｓｔｒｅａｍＭｕｌｔｉｐｌｅＤａｔａ−ｓｔｒｅａｍ）型マイクロプロセッサが用いられることが多い。
【０００３】
ＳＩＭＤ型マイクロプロセッサ２は、図５のようにグローバルプロセッサ４と、演算アレイ６と、レジスタファイル８からなる。図５はデータの流れを説明するため簡略化されており、図５のプロセッサエレメントグループ１０はプロセッサエレメント（ＰＥ）の集合体である。
【０００４】
なお、本明細書では便宜上、プロセッサエレメントは２５６個という構成にして説明している。個々のプロセッサエレメントの構成は、後で説明する。
【０００５】
例えば、スキャナやカメラなどの外部入力装置から入力された画像データは、レジスタファイル８に書き込まれ、演算アレイ６にて論理算術演算等の含まれた画像処理が行われ、再度レジスタファイル８に書き込まれ、プリンタ・記憶装置などの外部出力装置に出力される。グローバルプロセッサ４は、プログラムを解読し制御信号をレジスタファイル８や演算アレイ６に送信する。グローバルプロセッサ４はグローバルプロセッサそのものでの演算処理・データ転送・逐次処理等も行う。
【０００６】
図６は、図５のプロセッサエレメント（ＰＥ）１２をより詳しく示す従来技術例である。図６では、プロセッサエレメント１２はＰＥ〔０〕〜ＰＥ〔２５５〕までの２５６個用意されている。なお個々のプロセッサエレメントは順に０乃至２５５の序数が「ＰＥ番号」として付番されている。本明細書では、各図面の左方から順に付番されているものとする。
【０００７】
１つのプロセッサエレメント１２は、レジスタ１４と、ＡＬＵ（算術論理演算器）１６を中心とした演算部からなる。レジスタ１４は、例えばＲ０〜Ｒ３１の３２本用意されており、そのうちの一部のレジスタが外部入出力に接続されている。Ｒ０〜Ｒ３１のレジスタは、データをシフト・拡張するシフト・拡張器１８に接続されＡＬＵ１６の片側のデータとして第２の記憶部２２にラッチされる。ＡＬＵ１６のもう片側のデータは、演算部に備わるＡレジスタ２４の値を第１の記憶部２０にラッチしたものとなる。図６の例では、レジスタ１本のサイズを８ビットとしている。更に、レジスタのうちＲ０〜Ｒ２３の２４本が外部入出力に接続され、レジスタ１４からシフト・拡張器１８までは８ビット、シフト・拡張器１８はその８ビットを１６ビットに拡張して第２の記憶部２２に出力する、としている。よって、ＡＬＵ１６、Ａレジスタ２４及び第１の記憶部２０は、対応して１６ビットとなる。
【０００８】
図６でのデータの流れは概略次のようになる。まず、外部入力からＲレジスタ１４に転送された８ビット画像データは、シフト・拡張器１８によって１６ビットデータに拡張もしくはシフト処理され、第２の記憶部２２に転送され、ＡＬＵ１６でデータ処理を施される。その結果Ａレジスタ２４に書き込まれたデータは、シフト・拡張器１８からＲレジスタ１４に書きこまれ外部出力に出力される。
【０００９】
更に図７の従来技術例では、Ｒレジスタ１４とシフト・拡張器１８との経路が、近接するプロセッサエレメントの範囲内でのＲレジスタ１４とシフト・拡張器１８との間に設定できるようにしている。なお図７の演算器２６は、ＡＬＵ１６、第１の記憶部２０と第２の記憶部２２、及びＡレジスタ２４をまとめて示している。ここで注目のＰＥを「ＰＥ〔ｎ〕」とすると、ＰＥ〔ｎ〕の演算器２６は、ＰＥ〔ｎ〕のＲレジスタ１４とのデータ通信を行うほか、１つ左のＰＥ（ＰＥ〔ｎ−１〕）、２つ左のＰＥ（ＰＥ〔ｎ−２〕）、３つ左のＰＥ（ＰＥ〔ｎ−３〕）、１つ右のＰＥ（ＰＥ〔ｎ＋１〕）、２つ右のＰＥ（ＰＥ〔ｎ＋２〕）、３つ右のＰＥ（ＰＥ〔ｎ＋３〕）の計７つのＰＥのＲレジスタ１４とデータ通信することができる。このとき、バス選択器が必要であるが、図７の従来技術例では、シフト・拡張器１８とまとめて符号２８により示している。
【００１０】
本明細書では、この近接するＰＥにおけるＲレジスタ１４と演算部２６とのデータ通信を、「ＰＥシフト」と称することにする。一般的には、ネイバーフッドオペレーション（隣接処理）とも呼ばれている。
【００１１】
ＰＥシフトが使用される画像処理には、ＭＴＦ（ＭｏｄｕｌａｔｉｏｎＴｒａｎｓｆｅｒＦｕｎｃｔｉｏｎ）補正、フレアデータ除去などがある。
【００１２】
上記のＭＴＦ補正は、注目画素を除く周辺画素の強調成分を計算して、その結果に強度倍率を乗じて、注目画素に加算することにより実現する補正である。図８のような３ライン×５画素のフィルタマトリクスにおいて中央の注目画素をｄ２１とし、ＭＴＦフィルタ係数をＭ００〜Ｍ４２、強度倍率をｍａｇ、フィルタへの入力画素データのマトリクスをｄ００〜ｄ４２、フィルタの出力をｍｔｆｏとすると、ＭＴＦ補正演算は以下の式で表される。
【００１３】
ｍｔｆｏ＝ｄ２１×Ｍ２１＋
ｍａｇ×（ｄ００×Ｍ００＋ｄ１０×Ｍ１０＋ｄ２０×Ｍ２０＋ｄ３０×Ｍ３０＋ｄ４０×Ｍ４０＋ｄ０１×Ｍ０１＋ｄ１１×Ｍ１１＋ｄ３１×Ｍ３１＋ｄ４１×Ｍ４１＋ｄ０２×Ｍ０２＋ｄ１２×Ｍ１２＋ｄ２２×Ｍ２２＋ｄ３２×Ｍ３３＋ｄ４２×Ｍ４２）
【００１４】
上記のフレアデータ除去は、スキャナ等の読み取りで発生するフレア光（原稿面から反射された拡散光以外の光）を除去する処理である。図９に示す３ライン×５画素の領域内で、しきい値Ｔ未満の値を持つ画素の値を積算して、画素数Ｎで割ることにより平均値（フレア補正量）を求め、この値を注目画素データから減算する。
【００１５】
以上の演算は、いずれも注目画素の前後のデータを参照する必要がある演算である。フィルタは３ライン×５画素を例に挙げたが、５ライン×７画素、それ以上のマトリクス範囲のフィルタも存在する。
【００１６】
通常、ＳＩＭＤ型マイクロプロセッサでの１命令の処理分を「１ＳＩＭＤ」と称する。例えばデジタルコピアの場合、１枚の原稿の画像処理は、原稿１ラインにおける画素数を「１ＳＩＭＤ」で処理できる画素数で割った回数に、ライン数を乗じて算出された全体回数の「ＳＩＭＤ」によって行なわれる。また、通常、ラインの方向を横方向として主走査方向と称し、縦方向を副走査方向と称する。よって、主走査方向の処理を行う場合、図１０のように「１ＳＩＭＤ」ずつぴたりと連続してデータの処理ができれば効率が良い。
【００１７】
けれども、前述した処理のように近隣の画素データを参照する場合には、図１１のように「１ＳＩＭＤ」のデータの両端の数個（この数個はフィルタの画素数で決まる）をオーバーラップさせた形で取り込み参照しなければならない。言うまでも無く、オーバーラップが無いほうが、処理効率は良い。
【００１８】
ところで、下記の特許文献１、特許文献２、特許文献３及び特許文献４では、ネイバーフッドオペレーション（隣接処理）の改良を目指すものであるが、上記のようなオーベーラップの問題点を解決するものではない。
【００１９】
【特許文献１】
特開平１１−１５８０１号
【特許文献２】
特許第２７５６２５７号
【特許文献３】
特許第２８１２２９２号
【特許文献４】
特開２００２−２４７３４７号
【００２０】
【発明が解決しようとする課題】
本発明は、ＳＩＭＤ型マイクロプロセッサにて、１ラインを複数「ＳＩＭＤ」により処理する場合に、上記のようなオーバーラップをなくし効率良くデータを処理できることを目的とする。
【００２１】
【課題を解決するための手段】
本発明は、上記の目的を達成するために為されたものである。本発明に係る請求項１に記載のＳＩＭＤ型マイクロプロセッサは、
グローバルプロセッサと、ｍａｘ個のプロセッサエレメントを有し、各プロセッサエレメントには０から（ｍａｘ−１）の序数が配置に従い順に付番され、
更に各プロセッサエレメントは、データを処理する演算器と、演算器の入力・出力のバスに接続された複数のレジスタとを含み、
プロセッサエレメント配置の両端から見て所定の第１の個数の範囲内のプロセッサエレメントでは、端からｉ番目（ｉは自然数、且つ、１≦ｉ≦（第１の個数））のプロセッサエレメントからの上記バスは、反対の端のプロセッサエレメントから（（第１の個数）−ｉ）番目のプロセッサエレメントまで、夫々のバスと選択器を挟んで経路により接続しており、
プロセッサエレメント配置の両端から見て第１の個数の範囲内のプロセッサエレメント以外のプロセッサエレメントでは、夫々のプロセッサエレメントを中心にして第１の個数の範囲内の、左右に隣接するプロセッサエレメントのバスと、経路により接続しており、
各プロセッサエレメントの演算器の入力及び出力のバスに接続する複数のレジスタのうち、いずれを選択するかを指示する上記グローバルプロセッサからの選択信号が、プロセッサエレメント配置の両端から見て所定の第１の個数の範囲内のプロセッサエレメントと、プロセッサエレメント配置の両端から見て所定の第１の個数の範囲内のプロセッサエレメント以外のプロセッサエレメントとで、異なることを特徴とする。
【００２２】
本発明に係る請求項２に記載のＳＩＭＤ型マイクロプロセッサは、
記選択器が、制御信号により各プロセッサエレメントの演算器に特定の値を設定することを特徴とする請求項１に記載のＳＩＭＤ型マイクロプロセッサである。
【００２３】
本発明に係る請求項３に記載のＳＩＭＤ型マイクロプロセッサは、
プロセッサエレメントのレイアウト配置において、
個別のプロセッサエレメント間の配線が均等な距離になるように、プロセッサエレメント配置の両端のプロセッサエレメントが近傍に配置されることを特徴とする請求項１に記載のＳＩＭＤ型マイクロプロセッサである。
【００２４】
【発明の実施の形態】
以下、図面を参照して本発明に係る好適な実施の形態を説明する。
【００２５】
≪第１の実施の形態≫
図１は、本発明の第１の実施の形態に係るプロセッサエレメント１２のブロック図である。図１は特にプロセッサエレメントグループ１０の両端におけるプロセッサエレメント１２を示す。
【００２６】
第１の実施の形態は、図７の従来技術例に加えて、ＰＥシフトのためのバス接続を変更している。つまり、ＰＥシフト選択器３０を新たに設定している。
【００２７】
通常、両端のＰＥ１２のＰＥシフト用バスは、特定の値を入力、例えばＧＮＤ接続して“０”を入力するようになっている。
【００２８】
図１では、左隣に３個のプロセッサエレメントまで、右隣に３個のプロセッサエレメントまでの間でデータ通信できる仕様となっている。この個数は増減されてもよい。図１では、自プロセッサエレメントを含めて７ヶ所（のＰＥ）から選択できる。また、図１では図示してないが、レジスタ１４とシフト・拡張・バス選択器２８’とのバスのサイズは、８ビットである。
【００２９】
ＰＥ〔２５３〕、ＰＥ〔２５４〕、ＰＥ〔２５５〕とＰＥ〔０〕、ＰＥ〔１〕、ＰＥ〔２〕とが、ＰＥシフト選択器３０により接続することで、図１２のようなデータ通信が可能となる。図１２にて、「注目ＰＥ」とは注目する演算器２６が存在するＰＥを示し、「データ転送可能なＰＥ」とは注目ＰＥの演算器２６と接続しデータをリード／ライトする対象のレジスタが属するＰＥを示す。Ｌ３は注目ＰＥから左に３つ隣、Ｌ２は左に２つ隣、Ｌ１は左に１つ隣、Ｃは自ＰＥ、Ｒ１は右に１つ隣、Ｒ２は右に２つ隣、Ｒ３は右に３つ隣を表す。
【００３０】
例えば、図１２ではＰＥ〔２５４〕のバスは、左側ではＰＥ〔２５１〕、ＰＥ〔２５２〕、及びＰＥ〔２５３〕と接続され、右側ではＰＥ〔２５５〕、ＰＥシフト選択器３０を介してＰＥ〔０〕、及び同じくＰＥシフト選択器３０を介してＰＥ〔１〕と接続されている。ＰＥシフト選択器３０では、（ＰＥ〔０〕やＰＥ〔１〕ではなく）“０”もしくは“０ＦＦｈ”（全ビット“１"）を選択する。
【００３１】
ＰＥシフト選択器３０は、ＰＥ〔２５３〕、ＰＥ〔２５４〕、ＰＥ〔２５５〕、ＰＥ〔０〕、ＰＥ〔１〕及びＰＥ〔２〕に接続される。これら両端３つずつのＰＥシフト選択器は、それぞれのＰＥ１２のブロックにある（図１では、説明の便宜により、ブロックの外に置いている。）。ＰＥシフト選択器３０の制御信号は、グローバルプロセッサ４のＳＣＵ５から供給される。ＰＥシフト選択器３０は、制御信号（“０”、“１”）によって、反対側のＰＥの選択が制御される。“０”とするならば、全ビット０であるからＧＮＤに繋げば問題ない。“０ｆｆｈ”とするならば、全ビット１であるからＶＣＣ接続でよい。ＰＥシフト選択器３０はセレクタを含むが、該セレクタは、セレクト信号２ビットであり且つ３ｔｏ１のマルチプレクサでよい。
【００３２】
なお、上記の「反対側」とは、ＰＥ〔０〕を対象としたときには、ＰＥ〔２５３〕、ＰＥ〔２５４〕もしくはＰＥ〔２５５〕である。ＰＥ１を対象としたときには、ＰＥ〔２５４〕もしくはＰＥ〔２５５〕である。ＰＥ２を対象としたときには、ＰＥ〔２５５〕である。一方、ＰＥ〔２５５〕を対象としたときには、ＰＥ〔０〕、ＰＥ〔１〕もしくはＰＥ〔２〕である。ＰＥ〔２５４〕を対象としたときには、ＰＥ〔０〕もしくはＰＥ〔１〕である。ＰＥ〔２５３〕を対象としたときには、ＰＥ〔０〕である。
【００３３】
図２は、第１の実施の形態でのレジスタ１４の制御信号を示す。レジスタ１２の制御信号は、図２のようにグローバルプロセッサ４のシーケンシャルユニット（ＳＣＵ）５から各ＰＥに送られる。シーケンシャルユニット５は、グローバルプロセッサ４のメモリに書きこまれているプログラムをデコードし、制御信号をグローバルプロセッサ４並びにプロセッサエレメント１２に送信し、各ブロックを動かす。
【００３４】
レジスタの選択は、第１のレジスタ選択信号３２と第２のレジスタ選択信号３４とによって為される。第１のレジスタ選択信号３２は、ＰＥ〔３〕乃至ＰＥ〔２５２〕のレジスタ１２に送られ、第２のレジスタ選択信号３４は、ＰＥ〔０〕乃至ＰＥ〔０〕、ＰＥ〔２５３〕乃至ＰＥ〔２５５〕に送られる。ここで、レジスタ選択信号（第１のレジスタ選択信号３２、第２のレジスタ選択信号３４）は、複数の選択信号の集合である。つまり、各ＰＥにてレジスタが３２本あったとしたら、読み取り処理及び書きこみ処理で１本ずつ必要であるので、最低６４本の制御信号が、それぞれのレジスタ選択信号に必要となる。
【００３５】
例えば、図７において、１ラインのデータは左から順に「１ＳＩＭＤ」ずつ図６の外部入出力からロードされる。ここで、１番最初の「１ＳＩＭＤ」分のデータはＲ１に、２番目の「１ＳＩＭＤ」分のデータはＲ２に、更に、３番目の「１ＳＩＭＤ」分のデータはＲ３にロードするものとする。つまり、例えばＲ１を現在処理するべきデータとすると、処理すべき「１ＳＩＭＤ」分のデータの前後の「ＳＩＭＤ」分のデータを常に別のレジスタに置いておくことになる。言い換えると、Ｒ０には前回Ｒ１がロードしていた前ＳＩＭＤ分のデータが格納されており、Ｒ２には次回Ｒ１がロードする次ＳＩＭＤのデータが格納されていることになる。
【００３６】
ここで、あるフィルタ処理にて、例えばＰＥ番号が１だけ大きいＰＥのＲ１レジスタからデータを参照し、それぞれのＰＥの演算器２６の中のＡレジスタ２４の値と演算し、結果をＡレジスタ２４に格納する、とする。つまり、ＰＥ〔０〕のＡレジスタ２４はＰＥ〔１〕のＲ１と、ＰＥ〔１〕のＡレジスタ２４はＰＥ〔２〕のＲ１と、というように、ＰＥ〔０〕からＰＥ〔２５４〕までのＡレジスタ２４は、夫々ＰＥ〔１〕からＰＥ〔２５５〕のＲ１の値と演算することになる。
【００３７】
ここで、ＰＥ〔２５５〕のＡレジスタ２４は、ＰＥ〔０〕のＲ２の値と演算すれば、他のＰＥと同要件の演算を行なうことになり、且つ、第１の実施の形態を利用すればそれが容易に可能であることになる。この演算を実施するには、命令でレジスタを指定すればよい。命令の仕様は、「１ＳＩＭＤ」の範囲を超えるＰＥを参照するとき（即ち、反対側のＰＥを参照するとき）のレジスタを変える、というものになる。
【００３８】
また、前ＳＩＭＤのデータを参照する場合、例えば、〔ＰＥ番号−３〕のＰＥのＲ１レジスタからデータを参照し、それぞれのＰＥの演算器２６の中のＡレジスタ２４と演算し、結果をＡレジスタ２４に格納するような場合を、想定する。つまり、ＰＥ〔３〕のＡレジスタ２４はＰＥ〔０〕のＲ１と、ＰＥ〔４〕のＡジスタ２４はＰＥ〔１〕のＲ１と、というように、ＰＥ〔３〕からＰＥ〔２５５〕までのＡレジスタ２４は、夫々ＰＥ〔０〕からＰＥ〔２５２〕までのＲ１の値と演算することになる。ここで、ＰＥ〔０〕のＡレジスタ２４は、ＰＥ〔２５３〕のＲ０の値と演算すれば、他のＰＥと同要件の演算を行なうことになり、且つ、第１の実施の形態を利用すればそれが容易に可能であることになる。よって同様に、ＰＥ〔１〕のＡレジスタ２４はＰＥ〔２５４〕のＲ０の値と演算し、ＰＥ〔２〕のＡレジスタ２４はＰＥ〔２５５〕のＲ０の値と演算する。
【００３９】
上記の第１の実施の形態では、第１のレジスタ選択信号３２と第２のレジスタ選択信号３４とが、設定されている。更に、図示していないが、ＳＣＵ５からレジスタの選択モードを表す信号を１本用意し、ＰＥ〔０〕乃至ＰＥ〔２〕、ＰＥ〔２５３〕乃至ＰＥ〔２５５〕の夫々のブロックにレジスタ選択変更装置を加え、レジスタ制御信号を全てのＰＥで一対とし、ＰＥ〔０〕乃至ＰＥ〔２〕、ＰＥ〔２５３〕乃至ＰＥ〔２５５〕のブロックでは選択モードに従い、他のＰＥで選択されるレジスタと異なるレジスタが選択され得る、という構成にしてもよい。
【００４０】
≪第２の実施の形態≫
第２の実施の形態では、次に説明するように、プロセッサエレメントの配置に工夫が施されている。それ以外の構成は、第１の実施の形態と同様である。
【００４１】
ウェハにプロセッサエレメント１２をレイアウトする場合、図３のようなレイアウト配置にすると、ＰＥ〔０〕乃至ＰＥ〔２〕と、ＰＥ〔２５３〕乃至ＰＥ〔２５５〕との間のバス配線が非常に長くなる。すると、配線抵抗が増え、遅延が増大し、ＰＥシフトの致命的なスピード劣化の決定的な要因となる。それら以外のＰＥ間では、隣接処理の配線遅延は均等である。
【００４２】
図４は、本発明の第２の実施の形態に係るプロセッサエレメントの配置である。全プロセッサエレメントの前半、後半で２分し、ミラー配置としている。このようにすれば、全プロセッサエレメントにて、隣接処理の配線遅延は略均等である。
【００４３】
【発明の効果】
本発明を利用することにより、以下のような効果を得ることができる。
【００４４】
まず、１ラインの画素数が多い画像処理において、余分なオーバーヘッドを減少させることができる。
【００４５】
プロセッサエレメントの配置をミラー配置にすることで、両端のプロセッサエレメント間の距離を減らすことができ、よって配線遅延による速度低下を防ぐことができる。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態に係るプロセッサエレメントのブロック図である。
【図２】第１の実施の形態でのレジスタの制御信号を示す。
【図３】本発明に係るプロセッサエレメントの配置である。
【図４】本発明の第２の実施の形態に係るプロセッサエレメントの配置である。
【図５】ＳＩＭＤ型マイクロプロセッサの概略のブロック図である。
【図６】図５のプロセッサエレメントをより詳しく示すブロック図である。
【図７】従来技術に係るプロセッサエレメントのブロック図である。
【図８】ＭＴＦ補正を示す模式図である。
【図９】フレアデータ除去を示す模式図である。
【図１０】１ラインと１ＳＩＭＤの関係図（１）である。
【図１１】１ラインと１ＳＩＭＤの関係図（２）である。
【図１２】本発明に係るデータ通信を示す表である。
【符号の説明】
２・・・ＳＩＭＤ型マイクロプロセッサ、４・・・グローバルプロセッサ、１２・・・プロセッサエレメント、２６・・・演算器、２８、２８’・・・シフト・拡張・バス選択器、３０・・・ＰＥシフト選択器。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a single instruction-stream multiple data-stream (SIMD) type microprocessor that performs the same processing on a plurality of data with a single instruction in order to process image data or the like at high speed, and in particular, a neighborhood operation of a processor element. The present invention relates to a SIMD type microprocessor that performs (adjacent processing).
[0002]
[Prior art]
In recent years, in image processing such as digital copying machines and facsimile machines, image quality has been improved by increasing the number of pixels and diversifying image processing. In such image processing, the same processing is often performed for a plurality of data, and a single instruction-stream single data-stream (SISD) type microprocessor that processes one piece of data with one instruction. A SIMD (Single Instruction-stream Multiple Data-stream) type microprocessor that frequently processes a plurality of data is often used.
[0003]
The SIMD type microprocessor 2 includes a global processor 4, an operation array 6, and a register file 8 as shown in FIG. FIG. 5 is simplified for explaining the flow of data, and the processor element group 10 in FIG. 5 is a collection of processor elements (PE).
[0004]
In the present specification, for the sake of convenience, a description is given of a configuration having 256 processor elements. The configuration of each processor element will be described later.
[0005]
For example, image data input from an external input device such as a scanner or a camera is written to the register file 8, and image processing including logical arithmetic operation is performed by the arithmetic array 6, and is written to the register file 8 again. And output to an external output device such as a printer / storage device. The global processor 4 decodes the program and transmits a control signal to the register file 8 and the arithmetic array 6. The global processor 4 also performs arithmetic processing, data transfer, sequential processing, etc. in the global processor itself.
[0006]
FIG. 6 is a prior art example showing the processor element (PE) 12 of FIG. 5 in more detail. In FIG. 6, 256 processor elements 12 from PE [0] to PE [255] are prepared. Each processor element is numbered with an ordinal number of 0 to 255 as a “PE number”. In the present specification, numbers are assigned sequentially from the left side of each drawing.
[0007]
One processor element 12 includes a register 14 and an arithmetic unit centering on an ALU (arithmetic logic unit) 16. For example, 32 registers R0 to R31 are prepared, and some of the registers 14 are connected to an external input / output. The registers R0 to R31 are connected to a shift / expansion unit 18 that shifts / expands data, and are latched in the second storage unit 22 as data on one side of the ALU 16. The data on the other side of the ALU 16 is obtained by latching the value of the A register 24 provided in the arithmetic unit in the first storage unit 20. In the example of FIG. 6, the size of one register is 8 bits. Further, 24 registers R0 to R23 of the registers are connected to an external input / output. The registers 14 to the shift / extension unit 18 are 8 bits, and the shift / extension unit 18 expands the 8 bits to 16 bits. Are output to the storage unit 22. Therefore, the ALU 16, the A register 24, and the first storage unit 20 have 16 bits correspondingly.
[0008]
The data flow in FIG. 6 is roughly as follows. First, the 8-bit image data transferred from the external input to the R register 14 is expanded or shifted to 16-bit data by the shift / extension unit 18, transferred to the second storage unit 22, and subjected to data processing by the ALU 16. Is done. As a result, the data written to the A register 24 is written from the shift / extension unit 18 to the R register 14 and output to the external output.
[0009]
Further, in the prior art example of FIG. 7, the path between the R register 14 and the shift / expander 18 can be set between the R register 14 and the shift / expander 18 within the range of adjacent processor elements. Yes. 7 shows the ALU 16, the first storage unit 20 and the second storage unit 22, and the A register 24 together. Here, if the PE of interest is “PE [n]”, the computing unit 26 of PE [n] performs data communication with the R register 14 of PE [n], and the left PE (PE [n] -1]), two left PEs (PE [n-2]), three left PEs (PE [n-3]), one right PE (PE [n + 1]), two right PEs Data communication can be performed with the R registers 14 of a total of seven PEs (PE [n + 2]) and three right PEs (PE [n + 3]). At this time, a bus selector is necessary. In the prior art example of FIG.
[0010]
In this specification, data communication between the R register 14 and the arithmetic unit 26 in the adjacent PE is referred to as “PE shift”. Generally, it is also called a neighborhood operation (adjacent processing).
[0011]
Image processing using PE shift includes MTF (Modulation Transfer Function) correction and flare data removal.
[0012]
The above MTF correction is a correction realized by calculating an emphasis component of peripheral pixels excluding the target pixel, multiplying the result by the intensity magnification, and adding the result to the target pixel. In the filter matrix of 3 lines × 5 pixels as shown in FIG. 8, the pixel of interest at the center is d21, the MTF filter coefficient is M00 to M42, the intensity magnification is mag, the matrix of input pixel data to the filter is d00 to d42, When the output is mtfo, the MTF correction calculation is expressed by the following equation.
[0013]
mtfo = d21 × M21 +
mag × (d00 × M00 + d10 × M10 + d20 × M20 + d30 × M30 + d40 × M40 + d01 × M01 + d11 × M11 + d31 × M31 + d41 × M41 + d02 × M02 + d12 × M12 + d22 × M22 + d32 × M33 + d42 × M42)
[0014]
The flare data removal is a process for removing flare light (light other than diffused light reflected from the document surface) generated by reading with a scanner or the like. In the region of 3 lines × 5 pixels shown in FIG. 9, the values of pixels having a value less than the threshold value T are integrated and divided by the number of pixels N to obtain an average value (flare correction amount). Is subtracted from the target pixel data.
[0015]
All of the above operations are operations that need to refer to data before and after the pixel of interest. The filter is exemplified by 3 lines × 5 pixels, but there are also filters in a matrix range of 5 lines × 7 pixels or more.
[0016]
Usually, the processing of one instruction in the SIMD type microprocessor is referred to as “1 SIMD”. For example, in the case of a digital copier, the image processing of one original is performed by multiplying the number of pixels in one line of the original by the number of pixels that can be processed by “1 SIMD” and the total number of times “SIMD” calculated by multiplying the number of lines. Is done by. Also, the line direction is generally referred to as the main scanning direction, and the vertical direction is referred to as the sub-scanning direction. Therefore, when processing in the main scanning direction is performed, it is efficient if data can be processed continuously by “1 SIMD” as shown in FIG.
[0017]
However, when neighboring pixel data is referred to as in the above-described processing, several data at the both ends of “1 SIMD” data (this number is determined by the number of pixels of the filter) are overlapped as shown in FIG. It must be imported and referenced in the form. Needless to say, the processing efficiency is better when there is no overlap.
[0018]
By the way, the following Patent Document 1, Patent Document 2, Patent Document 3 and Patent Document 4 aim to improve the neighborhood operation (adjacent processing), but do not solve the above-described problems of overwrap. .
[0019]
[Patent Document 1]
Japanese Patent Laid-Open No. 11-15801 [Patent Document 2]
Patent No. 2756257 [Patent Document 3]
Patent No. 2812292 [Patent Document 4]
JP 2002-247347 A
[Problems to be solved by the invention]
An object of the present invention is to eliminate the above-described overlap and efficiently process data when a single line is processed by a plurality of “SIMDs” in a SIMD type microprocessor.
[0021]
[Means for Solving the Problems]
The present invention has been made to achieve the above object. According to the first aspect of the present invention, there is provided a SIMD type microprocessor.
A global processor and max processor elements, each processor element is numbered in order from 0 to (max-1) ordinal,
Each processor element further includes an arithmetic unit for processing data and a plurality of registers connected to the input / output bus of the arithmetic unit.
With respect to processor elements within a predetermined first number range as viewed from both ends of the processor element arrangement, the i-th (i is a natural number and 1 ≦ i ≦ (first number)) processor element from the end The bus is connected from the processor element at the opposite end to the ((first number) -i) -th processor element by a path across each bus and a selector,
In the processor elements other than the processor elements within the first number range as viewed from both ends of the processor element arrangement, the buses of the processor elements adjacent to the left and right within the first number range centering on the respective processor elements , Connected by route,
A selection signal from the global processor that indicates which one of a plurality of registers connected to the input and output buses of the arithmetic unit of each processor element is selected is a predetermined first when viewed from both ends of the processor element arrangement. The processor elements within the range of the number of the processor elements are different from the processor elements other than the processor elements within the predetermined first number of ranges as viewed from both ends of the processor element arrangement .
[0022]
According to a second aspect of the present invention, there is provided a SIMD type microprocessor.
2. The SIMD type microprocessor according to claim 1, wherein the selector sets a specific value in an arithmetic unit of each processor element by a control signal .
[0023]
According to the third aspect of the present invention, there is provided a SIMD type microprocessor.
In the layout arrangement of processor elements ,
2. The SIMD type microprocessor according to claim 1 , wherein the processor elements at both ends of the processor element arrangement are arranged in the vicinity so that the wiring between the individual processor elements becomes an equal distance .
[0024]
DETAILED DESCRIPTION OF THE INVENTION
DESCRIPTION OF THE PREFERRED EMBODIMENTS Preferred embodiments according to the present invention will be described below with reference to the drawings.
[0025]
<< First Embodiment >>
FIG. 1 is a block diagram of a processor element 12 according to the first embodiment of the present invention. FIG. 1 particularly shows the processor elements 12 at both ends of the processor element group 10.
[0026]
In the first embodiment, the bus connection for PE shift is changed in addition to the prior art example of FIG. That is, the PE shift selector 30 is newly set.
[0027]
Normally, the PE shift buses of the PEs 12 at both ends are inputted with a specific value, for example, “0” is connected by GND connection.
[0028]
In FIG. 1, the specification allows data communication between up to three processor elements on the left and up to three processor elements on the right. This number may be increased or decreased. In FIG. 1, it is possible to select from seven (PE) including its own processor element. Although not shown in FIG. 1, the bus size of the register 14 and the shift / extension / bus selector 28 'is 8 bits.
[0029]
PE [253], PE [254], PE [255] and PE [0], PE [1], and PE [2] are connected by the PE shift selector 30 so that data communication as shown in FIG. Is possible. In FIG. 12, “target PE” indicates a PE in which the notable calculator 26 exists, and “data transferable PE” indicates a register to which data is read / written by connecting to the notable PE calculator 26. Indicates the PE to which. L3 is 3 neighbors to the left from the PE of interest, L2 is 2 neighbors to the left, L1 is 1 neighbor to the left, C is its own PE, R1 is 1 neighbor to the right, R2 is 2 neighbors to the right, R3 is Three neighbors on the right.
[0030]
For example, in FIG. 12, the PE [254] bus is connected to PE [251], PE [252], and PE [253] on the left side, and PE [255] and PE shift selector 30 on the right side. [0] and also connected to PE [1] via the PE shift selector 30. The PE shift selector 30 selects “0” or “0FFh” (all bits “1”) (not PE [0] or PE [1]).
[0031]
The PE shift selector 30 is connected to PE [253], PE [254], PE [255], PE [0], PE [1], and PE [2]. These three PE shift selectors at both ends are in the block of each PE 12 (in FIG. 1, they are placed outside the block for convenience of explanation). The control signal of the PE shift selector 30 is supplied from the SCU 5 of the global processor 4. The PE shift selector 30 controls the selection of the opposite PE by a control signal (“0”, “1”). If “0” is set, all bits are 0, so there is no problem if connected to GND. If “0ffh” is set, since all bits are 1, VCC connection is sufficient. The PE shift selector 30 includes a selector, which may be a 2-bit select signal and a 3-to-1 multiplexer.
[0032]
The “opposite side” is PE [253], PE [254], or PE [255] when PE [0] is targeted. When PE1 is targeted, PE [254] or PE [255]. When PE2 is targeted, PE [255]. On the other hand, when PE [255] is targeted, it is PE [0], PE [1] or PE [2]. When PE [254] is targeted, it is PE [0] or PE [1]. When PE [253] is targeted, PE [0].
[0033]
FIG. 2 shows control signals of the register 14 in the first embodiment. The control signal of the register 12 is sent from the sequential unit (SCU) 5 of the global processor 4 to each PE as shown in FIG. The sequential unit 5 decodes a program written in the memory of the global processor 4, sends a control signal to the global processor 4 and the processor element 12, and moves each block.
[0034]
The register is selected by the first register selection signal 32 and the second register selection signal 34. The first register selection signal 32 is sent to the registers 12 of PE [3] to PE [252], and the second register selection signal 34 is sent to PE [0] to PE [0], PE [253] to PE [255]. Here, the register selection signals (the first register selection signal 32 and the second register selection signal 34) are a set of a plurality of selection signals. In other words, if there are 32 registers in each PE, one register is required for each reading process and writing process, and therefore, a minimum of 64 control signals are required for each register selection signal.
[0035]
For example, in FIG. 7, one line of data is loaded from the external input / output of FIG. 6 in order of “1 SIMD” from the left. Here, the first “1 SIMD” data is loaded into R 1, the second “1 SIMD” data is loaded into R 2, and the third “1 SIMD” data is loaded into R 3. In other words, for example, if R1 is data to be processed at present, data of “SIMD” before and after “1 SIMD” of data to be processed is always placed in another register. In other words, data for the previous SIMD that was previously loaded by R1 is stored in R0, and data for the next SIMD that will be loaded by R1 next time is stored in R2.
[0036]
Here, in a certain filter process, for example, data is referred to from the R1 register of the PE whose PE number is larger by 1, and the data is calculated with the value of the A register 24 in the calculator 26 of each PE, and the result is stored in the A register 24 And store it in That is, the PE [0] A register 24 is PE [1] R1, the PE [1] A register 24 is PE [2] R1, and so on, from PE [0] to PE [254]. The A register 24 calculates the value of R1 from PE [1] to PE [255].
[0037]
Here, if the A register 24 of the PE [255] calculates the value of the R2 of the PE [0], the calculation of the same requirement as the other PEs is performed, and the first embodiment is used. This is easily possible. To perform this operation, a register may be specified by an instruction. The specification of the instruction is to change the register when referring to the PE exceeding the range of “1 SIMD” (that is, when referring to the PE on the opposite side).
[0038]
Further, when referring to the data of the previous SIMD, for example, the data is referred to from the R1 register of the PE of [PE number-3], and is operated with the A register 24 in the calculator 26 of each PE, and the result is A A case where data is stored in the register 24 is assumed. That is, PE [3] A register 24 is PE [0] R1, PE [4] A register 24 is PE [1] R1, and so on, from PE [3] to PE [255]. The A register 24 calculates the value of R1 from PE [0] to PE [252]. Here, if the A register 24 of PE [0] calculates the value of R0 of PE [253], it will perform the same calculation as other PEs and use the first embodiment. This is easily possible. Similarly, the A register 24 of PE [1] operates with the value R0 of PE [254], and the A register 24 of PE [2] operates with the value R0 of PE [255].
[0039]
In the first embodiment, the first register selection signal 32 and the second register selection signal 34 are set. Further, although not shown, one signal indicating the register selection mode is prepared from the SCU 5, and the register selection is changed to each block of PE [0] to PE [2] and PE [253] to PE [255]. A device is added, and register control signals are paired for all PEs. In the PE [0] to PE [2] and PE [253] to PE [255] blocks, the registers selected by other PEs are selected according to the selection mode. A different register may be selected.
[0040]
<< Second Embodiment >>
In the second embodiment, as will be described below, the arrangement of the processor elements is devised. The other configuration is the same as that of the first embodiment.
[0041]
When the processor element 12 is laid out on the wafer, the bus wiring between PE [0] to PE [2] and PE [253] to PE [255] is very long if the layout arrangement shown in FIG. Become. Then, wiring resistance increases, delay increases, and becomes a decisive factor of fatal speed deterioration of PE shift. Between the other PEs, the wiring delay of the adjacent processing is equal.
[0042]
FIG. 4 shows an arrangement of processor elements according to the second embodiment of the present invention. The first half and the second half of all the processor elements are divided into two, and the mirror arrangement is adopted. In this way, the wiring delay of the adjacent processing is substantially equal in all the processor elements.
[0043]
【The invention's effect】
By using the present invention, the following effects can be obtained.
[0044]
First, in image processing with a large number of pixels in one line, extra overhead can be reduced.
[0045]
By making the arrangement of the processor elements into a mirror arrangement, the distance between the processor elements at both ends can be reduced, thereby preventing a reduction in speed due to wiring delay.
[Brief description of the drawings]
FIG. 1 is a block diagram of a processor element according to a first embodiment of the present invention.
FIG. 2 shows a register control signal in the first embodiment.
FIG. 3 is an arrangement of processor elements according to the present invention.
FIG. 4 is an arrangement of processor elements according to a second embodiment of the present invention.
FIG. 5 is a schematic block diagram of a SIMD type microprocessor.
6 is a block diagram showing the processor element of FIG. 5 in more detail.
FIG. 7 is a block diagram of a processor element according to the prior art.
FIG. 8 is a schematic diagram showing MTF correction.
FIG. 9 is a schematic diagram showing flare data removal.
FIG. 10 is a relational diagram (1) between one line and one SIMD.
FIG. 11 is a relational diagram (2) between one line and one SIMD.
FIG. 12 is a table showing data communication according to the present invention.
[Explanation of symbols]
2 ... SIMD type microprocessor, 4 ... global processor, 12 ... processor element, 26 ... arithmetic unit, 28, 28 '... shift / expansion / bus selector, 30 ... PE Shift selector.

Claims

A global processor and max processor elements, each processor element is numbered in order from 0 to (max-1) ordinal,
Each processor element further includes an arithmetic unit for processing data and a plurality of registers connected to the input / output bus of the arithmetic unit.
With respect to processor elements within a predetermined first number range as viewed from both ends of the processor element arrangement, the i-th (i is a natural number and 1 ≦ i ≦ (first number)) processor element from the end The bus is connected from the processor element at the opposite end to the ((first number) -i) -th processor element by a path across each bus and a selector,
In the processor elements other than the processor elements within the first number range as viewed from both ends of the processor element arrangement, the buses of the processor elements adjacent to the left and right within the first number range centering on the respective processor elements , Connected by route,
A selection signal from the global processor that indicates which one of a plurality of registers connected to the input and output buses of the arithmetic unit of each processor element is selected is a predetermined first when viewed from both ends of the processor element arrangement. And a processor element other than the processor elements within a predetermined first number range as viewed from both ends of the processor element arrangement .

2. The SIMD type microprocessor according to claim 1, wherein the selector sets a specific value in an arithmetic unit of each processor element by a control signal .

In the layout arrangement of processor elements ,
2. The SIMD type microprocessor according to claim 1 , wherein the processor elements at both ends of the processor element arrangement are arranged in the vicinity so that the wiring between the individual processor elements becomes an equal distance .