JPH0740268B2

JPH0740268B2 - Document matching device

Info

Publication number: JPH0740268B2
Application number: JP60209630A
Authority: JP
Inventors: 立武田; 節子川島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: NTT Inc
Priority date: 1985-09-21
Filing date: 1985-09-21
Publication date: 1995-05-01
Anticipated expiration: 2010-05-01
Also published as: JPS6269367A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、光ディスク装置等の大容量記憶装置を使用し
て印刷物等の紙面情報を生のまゝまたは文字情報に変換
して蓄積する文書ファイリング装置に係り、詳しくは、
格納済み文書を引き出す際に利用されるインデクス情報
の構成法に関する。DETAILED DESCRIPTION OF THE INVENTION [Industrial field of application] The present invention is a document that uses a large-capacity storage device such as an optical disk device to convert paper surface information of a printed matter or the like into raw or character information and stores it. Regarding the filing device,
The present invention relates to a method of constructing index information used when extracting a stored document.

[Conventional technology]

従来、文書ファイリング装置において文書を格納する当
っては、スキャナ等により紙面情報を電気信号に変換し
て原情報から入力すると共に、これとは独立に紙面情報
を代表する見出し等の言語情報すなわちキーワードなど
の文字列情報を人間がキーボード等から別途入力した
後、この両者を蓄積している。また、格納してあるはず
の文書を取り出すには、人間の記憶等を頼りに、格納当
時に付与したと思われる文字列情報を人間が同じくキー
ボード等から入力してその所在を検索後、原情報紙面を
入手する手法を用いている。Conventionally, when storing a document in a document filing apparatus, the page information is converted into an electric signal by a scanner or the like and input from the original information, and independently of this, language information such as a heading representing the page information, that is, a keyword. After the human character inputs the character string information such as, for example, from a keyboard separately, both of them are accumulated. To retrieve a document that should have been stored, relying on human memory etc., humans also enter character string information that was supposed to have been given at the time of storage from the keyboard etc. It uses the method of obtaining the information page.

[Problems to be solved by the invention]

ところで、キーワードとしての文字列情報は正確さにお
いては優れているが、自然言語の持つ多様性と人間の記
憶のあいまいさにより、文字格納時に付与した文字列情
報を正確に思い出すことが困難である。したがって、大
量文書蓄積装置における文書の取り出し操作に際しては
試行錯誤が多くなりがちで、そのハードウェア性能に基
づく速度に比し極めて遅い実質速度での取り出し動作が
実現されているに過ぎない。By the way, although the character string information as a keyword is excellent in accuracy, it is difficult to accurately recall the character string information given at the time of character storage because of the diversity of natural language and the ambiguity of human memory. . Therefore, trial and error are likely to occur during the document retrieval operation in the large-volume document storage device, and the retrieval operation is only realized at an actual speed extremely slow as compared with the speed based on the hardware performance thereof.

また、他の従来方法として文書を連想させる代表的な図
面をインデクスとして利用する手法もあるが、インデク
スとしては情報量が多過ぎて実用的ではない欠点があ
る。さらに他の方法としては、特願昭58-199572号の如
く、原文書を高速に表示する方法も考え得るが、それす
らもある程度特定されたファイル内部の一部領域でのみ
有効な手段であるにすぎない。Also, as another conventional method, there is a method of using a typical drawing associated with a document as an index, but there is a drawback that the index has too much information and is not practical. As another method, as in Japanese Patent Application No. 58-199572, a method of displaying an original document at high speed can be considered, but even that is a method effective only in a partial area inside a specified file. Nothing more.

本発明の目的は、インデクスが文字列情報のみで不便で
あった従来の問題点を解決し、人間の記憶情報として比
較的安定（忘れ難い）なイメージ情報を利用出来る文書
ファイリング装置を提供することにある。An object of the present invention is to provide a document filing device which solves the conventional problem that the index is only the character string information and is inconvenient, and can use relatively stable (memorable) image information as human memory information. It is in.

[Means and Actions for Solving Problems]

本発明の文書ファイリング装置は、文書を読取る手段
と、前記読取った文書を記憶する手段と、前記読取った
文書紙面を複数の領域に分割し、その物理的特徴から各
領域がもつ属性を求め、分割した領域の属性が定まらな
い場合は、当該領域をさらに複数の小領域に分割し、こ
れを属性が定まるまで繰り返して小領域の属性を求める
手段と、前記求めた各領域および小領域の属性の集合を
前記読取った文書のインデックスとして該当文書に対応
づけて記憶する手段とを有する。The document filing apparatus of the present invention divides the read document sheet surface into a plurality of areas by reading the document, storing the read document, and determining the attributes of each area from the physical characteristics of the area. When the attributes of the divided areas are not determined, the area is further divided into a plurality of small areas, and the means for repeatedly obtaining the attributes of the small areas until the attributes are determined, and the obtained attributes of each area and small area. And a means for storing the set of (1) as the index of the read document in association with the corresponding document.

即ち、本発明は、入力された文書紙面の物理定数を階層
分割的に計測・算出し、その結果得られた主要な数値を
キーワード・インデクス等とゝもにインデクス蓄積手段
に格納することにより、文書の取り出しに際しては、キ
ーワードとゝもに紙面の物理定数をも指定可能とし、大
量の格納文書の中から所望の文書を迅速・的確に取り出
すことができるようにしたことである。That is, the present invention, by measuring and calculating the physical constants of the input document paper in a layered manner, and storing the main numerical values obtained as a result in the index storage means and the index storage means. When retrieving a document, it is possible to specify a keyword and a physical constant on the paper so that a desired document can be retrieved quickly and accurately from a large amount of stored documents.

〔Example〕

第１図は本発明の文書ファイリング装置の一実施例の全
体構成を示す。イメージスキャナ１は文書紙面を白黒の
２値情報として読み取る。イメージバッファ２はイメー
ジスキャナ１で読み取った文書データを一時格納する。
本実施例では、イメージバッファ２は1728ドット×2304
ラインからなるとする。バッファ制御回路３はイメージ
バッファ２と切換え回路４を制御して、イメージバッフ
ァ２の入力文書データの余白部を除く中央部分（1024ド
ット×1536ラインとする）をＮ個に分割し、切換え回路
４を通して順次イメージデータバス９に出力する。本実
施例ではＮ＝６とする。情報圧伸回路10は文書データを
圧縮／伸長する回路である。文書データ蓄積装置110は
情報圧伸回路10で圧縮された文書データを格納する。文
書データ蓄積装置110から読み出された文書データは、
逆に情報圧伸回路10で伸長されCRT表示装置112に表示さ
れる。インデクスデータ蓄積装置111は、文書データ蓄
積装置110の文書データに対するインデクスを格納す
る。FIG. 1 shows the overall configuration of an embodiment of the document filing apparatus of the present invention. The image scanner 1 reads the document paper surface as black and white binary information. The image buffer 2 temporarily stores the document data read by the image scanner 1.
In this embodiment, the image buffer 2 is 1728 dots × 2304.
Suppose it consists of lines. The buffer control circuit 3 controls the image buffer 2 and the switching circuit 4 to divide the central portion (1024 dots × 1536 lines) of the input document data of the image buffer 2 excluding the blank portion into N pieces, and the switching circuit 4 To sequentially output to the image data bus 9. In this embodiment, N = 6. The information companding circuit 10 is a circuit for compressing / decompressing document data. The document data storage device 110 stores the document data compressed by the information companding circuit 10. The document data read from the document data storage device 110 is
On the contrary, it is expanded by the information companding circuit 10 and displayed on the CRT display device 112. The index data storage device 111 stores an index for the document data of the document data storage device 110.

水平周辺分布計数回路５は、分割された各領域の水平周
辺分布を計数し、メモリ51に格納する。垂直周辺分布計
数回路６は、同様に垂直周辺分布を計算し、メモリ61に
格納する。水平ランレングス計数回路７は、分割された
各領域の水平ランレングスを計算し、メモリ71あるいは
72に格納する。垂直ランレングス計数回路８は、同様に
垂直ランレングスを計算してメモリ81あるいは82に格納
する。高速フーリエ変換回路32は、CPU100とは独立にフ
ーリエ変換処理を高速に行うためのものである。CPU100
は例えばマイクロプロセッサであり、文書データに対応
するインデクス生成のための大部分の処理を実行する。
以下、動作を説明する。The horizontal peripheral distribution counting circuit 5 counts the horizontal peripheral distribution of each divided area and stores it in the memory 51. The vertical marginal distribution counting circuit 6 similarly calculates the vertical marginal distribution and stores it in the memory 61. The horizontal run length counting circuit 7 calculates the horizontal run length of each of the divided areas, and the memory 71 or
Store in 72. The vertical run length counting circuit 8 similarly calculates the vertical run length and stores it in the memory 81 or 82. The fast Fourier transform circuit 32 is for performing Fourier transform processing at high speed independently of the CPU 100. CPU100
Is a microprocessor, for example, and executes most of the processing for generating an index corresponding to document data.
The operation will be described below.

読み取り CPU100はバス101を介してイメージスキャナ１に読み取
り指令を出し、スキャナ１に挿入した文書原稿を読み取
り、イメージバッファ２に白黒の２値データ（1728ドッ
ト×2304ライン）として一時蓄える。The reading CPU 100 issues a reading command to the image scanner 1 via the bus 101, reads the document original inserted in the scanner 1, and temporarily stores it in the image buffer 2 as monochrome binary data (1728 dots × 2304 lines).

この紙面情報は情報圧伸回路10を経由して文書データ蓄
積装置110に格納保存されるが、これとは独立に、第２
図の如く紙面の中央部分1024ドット×1536ラインのみを
特徴抽出操作のためイメージバッファ２に保存する。This page space information is stored and saved in the document data storage device 110 via the information companding circuit 10, but independently of this, the second
As shown in the figure, only 1024 dots × 1536 lines in the central portion of the paper are stored in the image buffer 2 for the feature extraction operation.

分割イメージバッファ制御回路３は、まず第２図に示す文書
紙面の左上部分の領域１に相当する512ドット×512ライ
ン（紙面中央部分の1/6に相当）の２値データを、水平
方向に走査してイメージデータバス９に次々に出力す
る。イメージバッファ制御回路３は、この後、前記と同
じ文書紙面の左上に相当する２値データを垂直方向に走
査してイメージデータバス９に次々に出力する。以下、
これを領域２〜６についても繰り返す。The divided image buffer control circuit 3 firstly outputs the binary data of 512 dots × 512 lines (corresponding to 1/6 of the central portion of the paper) corresponding to the upper left area 1 of the document paper shown in FIG. 2 in the horizontal direction. It scans and outputs to the image data bus 9 one after another. After that, the image buffer control circuit 3 vertically scans the binary data corresponding to the upper left of the same document paper surface and outputs it to the image data bus 9 one after another. Less than,
This is repeated for regions 2-6.

水平分布とランレングス水平周辺分布回路５は、分割されたイメージデータの水
平１ライン毎にその中の黒画素個数を積算し、その値を
メモリ51に蓄える動作を512ライン分繰り返す。これと
同時に、水平ランレングス計数回路７は水平１ラインの
中のランレングスを個々に計測し、その値を平均操作し
つゝメモリ71に蓄えるとゝもに、分散を同時に計算して
メモリ72に蓄える。この動作も512ライン分繰り返す。
同様に、垂直方向についても、垂直周辺分布回路６と垂
直ランレングス計数８により、黒画素個数とランレング
スの平均・分散が算出され、メモリ61と81,82に各々格
納される。この動作を512ライン分繰り返す。Horizontal distribution and run length The horizontal peripheral distribution circuit 5 integrates the number of black pixels in each horizontal line of the divided image data and stores the value in the memory 51 for 512 lines. At the same time, the horizontal run length counting circuit 7 individually measures the run lengths in one horizontal line, averages the values and stores them in the memory 71, and at the same time calculates the variance and simultaneously calculates the memory 72. Store in. This operation is repeated for 512 lines.
Similarly, also in the vertical direction, the number of black pixels and the average / dispersion of the run lengths are calculated by the vertical peripheral distribution circuit 6 and the vertical run length counter 8 and stored in the memories 61, 81 and 82, respectively. This operation is repeated for 512 lines.

特徴ベクトル上記水平・垂直の一連動作により水平周辺分布ベクトル
Ph（512要素）、垂直周辺分布ベクトルPv（512要素）、
ランレングス平均値ベクトルLr（水平黒ラン平均、垂直
黒ラン平均）、ランレングス分散ベクトルLv（水平ラン
分散、垂直ラン分散）の４個のベクトルが入手できたこ
とになる。このとき、各ベクトルの次元数は、各々、51
2,512,2,2である。この４個のベクトルは第２図の６分
割紙面の各々につき４個一組で生成される。その結果、
紙面１ページにつき６組、24組のベクトルがバス101を
介してCPU100に転送される。これらのベクトルは、この
まゝインデクスとするにはデータ量が大き過ぎ、また、
分類にも不便であるので、さらに以下のごとくCPU100が
代表値に変換する。Feature vector Horizontal marginal distribution vector by the series of horizontal and vertical operations
Ph (512 elements), vertical marginal distribution vector Pv (512 elements),
This means that four vectors, the run length average value vector Lr (horizontal black run average, vertical black run average) and the run length variance vector Lv (horizontal run variance, vertical run variance) are available. At this time, the number of dimensions of each vector is 51
It is 2,512,2,2. These four vectors are generated as a set of four for each of the six-divided paper of FIG. as a result,
Six sets and 24 sets of vectors are transferred to the CPU 100 via the bus 101 for each page of the paper. These vectors have too much data for this index, and
Since it is also inconvenient for classification, the CPU 100 further converts it into a representative value as follows.

周辺分布の代表化まず、６分割紙面１領域につき２組発生する周辺分布ベ
クトルＰは、よく知られているように１価関数の周辺分
布曲線２本（縦と横）と考えることができ、平均値Pmと
分散Vp各２個で代表することができる。これらは水平・
垂直の計２組が存在するが、水平周辺分布平均値Pmhと
垂直周辺分布平均値Pmvはその定義から、Pmh＝Pmvであ
り、その値は紙面の平均明度（黒画素数／全画素数）に
比例する。そこで、この両者を平均明度Pmと呼ぶことに
する。こうすることにより、周辺分布の算術的特徴は、
Ｐ（Pm,Vpv,Vph）なる３次元ベクトルに集約できる。物
理的には、平均明度は黒部分の集中度を表現する。Representation of marginal distribution First, as is well known, marginal distribution vectors P generated in two sets per one area of a paper surface divided into six can be considered as two marginal distribution curves (vertical and horizontal) of a monovalent function, It can be represented by an average value Pm and a variance Vp of two each. These are horizontal
There are a total of two sets of vertical, but the horizontal marginal distribution average value Pmh and the vertical marginal distribution average value Pmv are, by definition, Pmh = Pmv, and the value is the average brightness of the paper surface (number of black pixels / total number of pixels). Proportional to. Therefore, both of them will be referred to as the average brightness Pm. By doing this, the arithmetic features of the marginal distribution are
It can be aggregated into a three-dimensional vector P (Pm, Vpv, Vph). Physically, the average brightness expresses the degree of concentration of the black portion.

次に、周辺分布の周期的特徴をFFT（高速フーリエ変
換）を利用して以下のように抽出する。CPU100は指令に
基づき、水平周辺分布ベクトルPhのデータ群（例えば51
2個）を水平分布メモリ51からFFT演算回路32に転送し、
FFT演算を実行せしめる。CPU100は、その結果のみをバ
ス101を介して吸い上げたのち、代表横周波数Fhを求め
る。Next, the periodic features of the marginal distribution are extracted using the FFT (Fast Fourier Transform) as follows. Based on the command, the CPU 100 uses the data group of the horizontal peripheral distribution vector Ph (for example, 51
2) from the horizontal distribution memory 51 to the FFT operation circuit 32,
Perform FFT operation. The CPU 100 sucks only the result via the bus 101, and then obtains the representative lateral frequency Fh.

同様に、垂直周辺分布ベクトルから代表縦周波数Pvを求
める。これらを要素として、周波数特徴ベクトル（Fh,F
v）を定義する。Similarly, the representative vertical frequency Pv is obtained from the vertical marginal distribution vector. With these as elements, the frequency feature vector (Fh, F
v) is defined.

以上で周辺分布に基づく物理パラメータはベクトル２個
（スカラ量５個）に集約されたことになる。次にランレ
ングスに基づくパラメータの代表化を行う。As described above, the physical parameters based on the peripheral distribution are aggregated into two vectors (scalar amount: five). Next, the parameterization based on the run length is performed.

ランレングスの代表化先に、ランレングスに関する情報は、白ラン、黒ランの
各平均値Lrw,Lrb、及び分散Lvh,Lvbの４個のスカラ量に
集約されることを示した。これらを要素とするラン・ベ
クトルＬは、Ｌ（Lrw,Lrb）で表現でき、ラン分散ベク
トルLvは、Lv（Lvh,Lvb）と表現できる。すなわち４個
のスカラ量を２個のベクトルに集約した。物理的には、
領域の全体の黒画素の連結傾向を表現し、写真や図表を
検知する。Representation of run lengths Previously, it was shown that the information on run lengths is aggregated into four scalar quantities of white run, black run average Lrw and Lrb, and variances Lvh and Lvb. The run vector L having these elements can be expressed by L (Lrw, Lrb), and the run dispersion vector Lv can be expressed by Lv (Lvh, Lvb). That is, the four scalar quantities are aggregated into two vectors. Physically,
It expresses the connection tendency of black pixels in the entire area and detects photographs and charts.

インデクス量スカラ・パラメータ１個を２バイト表現することにする
と、４個のベクトルからなる18バイトのインデクス情報
が生成できたことになる。この18バイトのインデクス情
報は、１紙面６分割のとき１紙面につき６組できるか
ら、18×６＝108バイトであり、このまゝではインデク
スとしての保存に適しないので、以下の如くさらに大代
表化する。Index amount If one scalar parameter is represented by 2 bytes, 18 bytes of index information consisting of 4 vectors can be generated. This 18-byte index information is 18 x 6 = 108 bytes because 6 sets can be created per page when the page is divided into 6 pages, and up to this point, it is not suitable for storage as an index. Turn into.

領域のカテゴリ判定注目する６分割領域は、以下の方法で空白、線図、文字
列、写真等の４カテゴリのいずれか一つのカテゴリに分
類する。二つのカテゴリには跨らせないことにする。す
なわち、文書は前記の４種の要素からなるというモデル
を採用する。Area Category Determination The 6-divided area of interest is classified into one of four categories such as blank, diagram, character string, and photo by the following method. We will not span two categories. That is, the model that the document is composed of the above-mentioned four kinds of elements is adopted.

空白：平均濃度Pmが１×10^-2より小さいことをもって空
白と見なす。たゞし、線画との区別をするため、黒ラン
レングス分散が１×10²より大きいものは除く。Blank: It is regarded as blank when the average density Pm is smaller than 1 × 10 ^-2 . However, in order to distinguish it from a line drawing, those with a black run length variance of greater than 1 × 10 ² are excluded.

線画：平均濃度Pmが１×10^-1より小さいことをもって線
画と見なす。たゞし、空白との区別をするため黒ランレ
ングス分散が１×10²より小さいものは除く。Line drawing: A line drawing is considered when the average density Pm is smaller than 1 × 10 ^-1 . However, in order to distinguish it from the blank, those with a black run length variance of less than 1 × 10 ² are excluded.

文字列：周波数特徴ベクトルの２次元空間写像で判定す
る。第３図のごとく座標のいずれかが周波数10より大き
い部分にあることをもって文字列と見なす。これは、行
構成情報を利用するものである。また、ラン・ベクトル
を解査し、黒ランの平均が５より小さいことは条件と
し、文字部分ではランが短いことを利用する。さらに両
軸のどちらかに近いときは縦／横の書式と見なすことが
出来る。第３図において、１つの丸印は１の分割領域に
対応し、白丸は文字列または表、黒丸はその他（図面、
グラフ、写真など）である。なお、周波数１とは、画面
の右半面が空白、左半面が黒塗のような場合に相当す
る。Character string: Determined by two-dimensional space mapping of frequency feature vector. As shown in FIG. 3, if any of the coordinates is in the portion where the frequency is higher than 10, it is regarded as a character string. This utilizes line configuration information. In addition, the run vector is analyzed, and the condition that the average of black runs is less than 5 is used, and the fact that the runs are short in the character portion is used. Furthermore, when it is close to either of the two axes, it can be regarded as a vertical / horizontal format. In FIG. 3, one circle corresponds to one divided area, a white circle is a character string or a table, and a black circle is other (drawing,
Graphs, photos, etc.). Frequency 1 corresponds to a case where the right half surface of the screen is blank and the left half surface is black.

写真列：平均濃度Pmが１×10^-1より大きいことをもって
写真と見なす。さらに、ラン・ベクトルを検査し、黒ラ
ン平均が水平・垂直とも５より大きい値であることを条
件とする。Photo sequence: A photo is considered to have an average density Pm larger than 1 × 10 ^-1 . Further, the run vector is inspected, and the condition is that the black run average is greater than 5 both horizontally and vertically.

領域属性判定の階層化当然のことながら、６分割領域のなかにさらに複数種要
素が混在することにより、上記の第１階層判定操作では
カテゴリが決定しない場合が大多数である。その場合、
512×512画素の領域をさらに４分解し、128×128領域４
個に分解する。この分解領域につきさらに第２階層判定
操作を行い、前記カテゴリ判定を行う。４分解領域のカ
テゴリが４個とも一致した場合を以て、第１階層の判明
と定義する。Hierarchization of Area Attribute Judgment Of course, in most cases, the category is not decided by the above-mentioned first hierarchy decision operation because a plurality of kinds of elements are mixed in the 6-divided area. In that case,
The area of 512x512 pixels is further decomposed into 4 and 128x128 area 4
Disassemble into individual pieces. The second layer determination operation is further performed on this decomposition area to perform the category determination. The case where all four categories of the four decomposition areas match each other is defined as the identification of the first layer.

これでもなお不明の場合、分解領域をさらに４分し、64
×64に細分し、第３階層判定操作を行う。以下同様に、
第Ｎ＋１層の４個の一致を以て第Ｎ層の判明とする。第
４図はこれを説明する図である。If this is still unclear, divide the decomposition area for an additional 4 minutes and
It is subdivided into x64 and the third layer judgment operation is performed. And so on
It is determined that the Nth layer is obtained by four coincidences of the N + 1th layer. FIG. 4 is a diagram explaining this.

判定結果の記述空白、線画、文字列、写真等にそれぞれ記号S,G,C,Pを
割り当て、たとえば、６分割領域すべてが文字列の場
合、サンプル文書１＝Ｃと表現し、この文書は文字列以外の要素を含まぬことを
表現する。これを第０層表示と呼ぶ。Description of judgment result Assign symbols S, G, C, P to blanks, line drawings, character strings, photographs, etc. For example, if all 6 divided areas are character strings, sample document 1 = C is expressed. Expresses that elements other than character strings are not included. This is called the 0th layer display.

もし、例えば第２図の下段２箇所（領域5,6）が画面と
写真の場合、サンプル文書２＝X,CCCCGP（Ｘはカテゴリ不定の表示）と記述し、第０層表示で混在文書であることをＸで示
し、より詳細に６領域が文字列４、線画１、写真１の割
合と順次で構成されていることを表示する。これを第１
層表示と呼ぶ。For example, if the lower two places (areas 5 and 6) in Fig. 2 are the screen and the photograph, describe as sample document 2 = X, CCCCGP (X is an uncategorized display), and display the mixed document in the 0th layer display. This is indicated by X, and more specifically, it is displayed that the six areas are sequentially composed of the character string 4, the line drawing 1, and the proportion of the photograph 1. This is the first
This is called layer display.

第１階層で判明しない文書は次のように記述する。第４
図の如く、サンプル文書２の下段にある写真がやゝ大き
く、領域４にはみ出していると、領域４には余白と写真
の一部が混在しているので、一義の判定ができない場合
が生ずる。このような場合にはサンプル文書３＝X,CCCXGP,…… と記述し、第２階層情報の表示に継続する。Documents that are not found in the first hierarchy are described as follows. Fourth
As shown in the figure, if the photograph in the lower part of the sample document 2 is slightly large and protrudes into the area 4, the margin and a part of the photograph are mixed in the area 4, and there may be a case where the unique determination cannot be made. . In such a case, describe as sample document 3 = X, CCCXGP, ... And continue displaying the second layer information.

第２階層の４分解領域で上半分が空白、下半分が概ね写
真とすると、サンプリ文書３の表示は、サンプル文書３＝X,CCCXGP,SSGG. と記述する。Assuming that the upper half is blank and the lower half is almost a photograph in the four decomposition areas of the second layer, the sample document 3 is displayed as sample document 3 = X, CCCXGP, SSGG.

もし第２階層までに判明しない場合、例えば、上記例の
４分解領域の下段右が余白と写真の混在ならば、さらに
第３階層を記述して、サンプル文書４＝X,CCCXGP,SSGX,SSGG. と記述する。この場合、階層レベルは必ずも明記する必
要はなく次のように、サンプル文書４＝XCCCXGPSSGXSSGG と記述して単に記号を延長するのみでもよい。なぜな
ら、先頭の記号１個は必ず存在し、且つ、引き続く記号
は必ず６個であり、第２層以下は必ず４個で区切れる。
また、不定記号Ｘの順次と同一の順序で下位の階層が記
述されるからである。If it is not known up to the second layer, for example, if the lower right of the four decomposition areas in the above example is a mixture of margins and photos, describe the third layer further and sample document 4 = X, CCCXGP, SSGX, SSGG Write .. In this case, it is not always necessary to specify the hierarchical level, and sample document 4 = XCCCXGPSSGXSSGG may be simply described to extend the symbol as follows. This is because the first symbol is always present, the following symbols are always 6, and the second and subsequent layers are always separated by 4.
This is also because the lower hierarchy is described in the same order as the indefinite symbol X.

判定結果のインデクス化上記の記号１個の表記には、分類カテゴリ４とカテゴリ
不明数１の合計数１の合計５状態を表示するので、最低
３ビットを必要とする。処理装置との整合を考えると、
４ビットがこれに近いが、将来のカテゴリ数拡張を考慮
して以後、８ビット（１バイト）で記述するものとす
る。すなわち、前記表記文字列をそのまゝインデクス記
号列として取り扱う。Indexing of determination result Since the notation of one symbol described above indicates a total of 5 states of the category category 4 and the category unknown number 1, a total of 1 states, a minimum of 3 bits is required. Considering the matching with the processing device,
Although 4 bits is close to this, it will be described with 8 bits (1 byte) after considering the future expansion of the number of categories. That is, the notation character string is treated as the index symbol string.

このようにしたとき、インデクスの情報量は、“割り切
れた”場合で１バイト／ページ、第１層までのときで１
＋６（バイト）、第２階層へ詳細化したときで１＋６＋
４×ｎ（ｎ≧６）、第３階層まで詳細化したときで、１
＋６＋４×ｎ＋４×ｍ（ｍ≦４）であり、第３階層を使
用すると最大47バイト／ページである。第４階層以下の
使用では、さらに大きくなるので、特別の用途以外では
使用しない。When this is done, the amount of information in the index is 1 byte / page for "divided" and 1 for the first layer.
+6 (byte), 1 + 6 + when detailed in the second layer
4 × n (n ≧ 6), 1 when detailed up to the third layer
+ 6 + 4 × n + 4 × m (m ≦ 4), which is 47 bytes / page at the maximum when the third layer is used. The use of the fourth layer and below causes the size to be further increased, and is not used for any purpose other than the special purpose.

このようにして生成したインデクス記号列は、従来のキ
ーワード文字列と同等の分類が可能である。例えばカテ
ゴリ記号に割り当てた文字がアルファベットの場合、イ
ンデクス記号列は単なるキーワードと全く同等の分類並
べ替えが可能である。このことは、後日の検索を能率的
に行う上で極めて有利である。The index symbol string thus generated can be classified in the same manner as the conventional keyword character string. For example, when the letters assigned to the category symbols are alphabets, the index symbol string can be sorted and sorted just like keywords. This is extremely advantageous for efficient retrieval at a later date.

またこのインデクス記号列を数値列と見なせば、各階層
に数個のベクトルが生成でき、これを特徴ベクトル・セ
ットＡと呼ぶことにして、 A₀＝a₀ A₁＝(a₁₁,a₁₂,a₁₃,a₁₄,a₁₃,a₁₄) A₂＝(a₂₁,a₂₂,a₂₃,a₂₄) ・ A_m＝(a_m1,a_m2,a_m3,a_m4) 但し、a_mnは記号S,G,C,PXと１対１の数値と記述する。
第ｍ階層ベクトルが存在する場合、これを第ｍ層特徴ベ
クトルと呼び、その個数は一般には複数である。If this index symbol string is regarded as a numerical value string, several vectors can be generated in each layer, and this will be called a feature vector set A. A ₀ = a ₀ A ₁ = (a ₁₁ ,, a ₁₂ , a ₁₃ , a ₁₄ , a ₁₃ , a ₁₄ ) A ₂ ＝ (a ₂₁ , a ₂₂ , a ₂₃ , a ₂₄ ) ・ A _m ＝ (a _m1,, a _m2 , a _m3 , a _m4 ), where a _mn is described as a symbol S, G, C, PX and a one-to-one numerical value.
When the m-th layer vector exists, this is called an m-th layer feature vector, and the number thereof is generally plural.

インデクスの保存文書を蓄積する場合は、後日取り出すときのために複数
のキーワードを付与する方法が一般的であり、本実施例
でもこのキーワードは保存する。すなわち、日付、文書
名等を公知の方法でインデクスデータ記憶装置111に文
字列形式で記憶する。Storage of index When a document is stored, it is common to add a plurality of keywords for retrieval at a later date, and this keyword is also stored in this embodiment. That is, the date, the document name, etc. are stored in the index data storage device 111 in a character string format by a known method.

特徴ベクトルはもはやキーワードと同様の文字列で表現
されているので、キーワードの記憶領域のすぐ後ろに引
き続いて記憶する。第ｍ階層ベクトルが存在するとき
は、これに引続き記憶して、保存する。この場合、効果
的な高速検索を実現するために、キーワードのみを分
類、並べ替えしたキーワード・インデクス・ファイルを
生成しておくのが常套手段である。同様の目的で、特徴
ベクトル・インデクス・ファイルをインデクスデータ記
憶装置111の中でに生成しておくと、後日の検索操作実
行時の処理速度を高めることはいうまでもない。なお、
インデクスデータ蓄積装置111は文書データ蓄積装置110
と一緒に同一のメモリで構成してもよい。Since the feature vector is no longer represented by the same character string as the keyword, it is stored immediately after the keyword storage area. When the m-th layer vector exists, it is continuously stored and saved. In this case, in order to realize an effective high-speed search, it is common practice to generate a keyword index file in which only keywords are classified and rearranged. Needless to say, if a feature vector index file is generated in the index data storage device 111 for the same purpose, the processing speed at the time of executing a search operation at a later date is increased. In addition,
The index data storage device 111 is the document data storage device 110.
Together with the same memory.

以上で、文書の蓄積動作と、これに伴う検索用インデク
スの保存が完了したことになる。This completes the document storage operation and the associated saving of the search index.

文書の取出し文書の取出し時には、まず人間の記憶に基づきキーワー
ドをキーボード等から入力し、候補文書を絞り込む。こ
ゝまでは従来から公知の技術を利用する。絞った結果が
単一にならない場合や、全く絞れない場合、従来は試行
錯誤的に入力キーワードを変更しながら目的の文書に近
づく方法しかなかった。本実施例ではつぎのように継続
する。Retrieving Documents When retrieving documents, first enter keywords from a keyboard based on human memory to narrow down candidate documents. Up to this point, a conventionally known technique is used. In the case where the narrowed-down result is not single or not at all, the conventional method is to approach the target document while changing the input keyword by trial and error. In this embodiment, the process continues as follows.

例示紙面と入力と特徴抽出キーワードの入力に引き続き、記憶に基づく例示紙面を
入力する。原理的には、目的文書と同程度の大きさの白
紙を用意し、目的文書の中の記憶している１ぺージのレ
イアウトを筆記具で描き、これをスキャナ１に挿入し、
文書格納時と同様の手続きでパラメータ化するという操
作を行う。Input of example page and feature extraction After inputting the keyword, the example page based on memory is entered. In principle, prepare a blank sheet of the same size as the target document, draw a 1-page layout stored in the target document with a writing instrument, insert it into the scanner 1,
The parameterization operation is performed in the same procedure as when storing the document.

実際には、紙に描く操作に替えて、CAD（コンピュータ
・エイデド・デザイン）や作図機能付きワードプロセッ
等と同様の公知の手段で、CRT上の操作のみで例示紙面
を作成する。例示紙面の作成に際しては、縦／横書き、
行ピッチ、図面／写真や余白の位置等の書式情報が含ま
れることが望ましいことは言うまでもない。このように
して作成した例示紙面データを、文書格納時に使用した
と同様に６分割領域ごとに特徴ベクトルＥを算出する。
Ｅが第０層のみの場合はＥ＝S,C,G,Pのいずれかが、第
１層までのとき、 E₁＝(e₁₁,e₁₂,e₁₃,e₁₄,e₁₅,e₁₆) を求め、第ｍ層までのとき、 E_m＝(e_m1,e_m2,e_m2,e_m4) 但し、e_mnは記号S,G,C,P,Xと１対１対応の数値を求め
る。Actually, instead of the operation of drawing on paper, a known means similar to CAD (Computer Aided Design) or a word processor with a drawing function is used to create an example paper surface only by the operation on the CRT. When creating the example paper, write vertically / horizontally,
It goes without saying that it is desirable to include format information such as line pitch, drawing / photograph, and margin position. The feature vector E is calculated for each of the six divided areas of the exemplary paper surface data created in this way, in the same manner as when the document was stored.
When E is only the 0th layer, when any one of E = S, C, G, P is up to the 1st layer, E ₁ = (e ₁₁ , e ₁₂ , e ₁₃ , e ₁₄ , e ₁₅ , e ₁₆ ), and up to the mth layer, E _m = (e _m1 , e _m2 , e _m2 , e _m4 ), where e _mn is a numerical value that has a one-to-one correspondence with the symbols S, G, C, P, X Ask for.

パラメータの比較方法候補文書の特徴ベクトル第ｍ層の一つをS_mとするとき、
例示文書のベクトルE_mとの差異を示す距離ベクトルD_mを D_m＝S_m-S_m で定義する。さらに、ベクトル不定要素Ｘを含まない場
合の距離関数｜D_m｜を、と定義する。すなわち｜D₁｜＝０は１紙面６個すべての
領域で候補と例示の属性が一致したことを意味し、｜D₁
｜＝６はすべての領域で属性が異なることを意味する。Parameter comparison method When S _m is one of the feature vector m-th layer of the candidate document,
A distance vector D _m indicating a difference from the vector E _{m of the} example document is defined by D _m = S _m -S _m . Furthermore, the distance function | D _m | when the vector indefinite element X is not included is It is defined as That is, | D ₁ | = 0 means that the candidate and the example attributes are the same in all six areas on one page, and | D ₁ |
| = 6 means that the attributes are different in all areas.

一方、ベクトルにＸを含む場合の距離関数｜D_m｜はつぎ
のような拡張で定義できる。Ｘを含む特徴ベクトル第ｍ
層には必ず特徴ベクトル第ｍ＋１層が付随している筈で
あるから、これをベクトルS_m1と表記すると、例えば S_m1＝（CCGP）が直ちに入手できる。一方、E_mは不定記号Ｘを含まない
ので特徴ベクトル第ｍ＋１層が付随していないが、６分
割領域の４要素すべてがe_mであったことは自明なので、
例えば E_m+1＝（CCCC）と記述する。こゝで拡張距離関数｜D_m｜を次式で定義す
る。前記同様に距離関数がで表されていて、δ_xは不定要素Ｘを含む演算を要する
ものとする。このときδ_xをの如く定義すれば、１階層下の４個の要素すべてが異な
るとき｜D_m｜＝１に、また、すべてが同一のとき｜D_m｜
＝０になる。On the other hand, the distance function | D _m | when X is included in the vector can be defined by the following extension. The feature vector m including X
Since the layer must be accompanied by the feature vector m + 1th layer, if this is expressed as a vector S _m1 , for example, S _m1 = (CCGP) can be immediately obtained. On the other hand, since E _m does not include the indefinite symbol X, the feature vector m + 1-th layer is not attached, but it is obvious that all four elements of the 6-divided region were e _m .
For example, write E _{m + 1} = (CCCC). The extended distance function | D _m | is defined by the following equation. As before, the distance function is And δ _x requires an operation including the indefinite element X. Then δ _x If all four elements under one hierarchy are different, | D _m | = 1, and if all are the same, | D _m |
= 0.

以上の定義により、|D|の大きさは候補文書と例示文書
の近似程度をよく表現する。従って、候補文書のすべて
について|D|を算出し、一定の値、例えば１以下の候補
文書のみを残せば、外観に近い文書のみが残ることにな
る。また、候補文書のすべてについて検査するまでもな
く、文書格納時に分類並べ替えの施してある特徴ベクト
ル・インデクス・ファイルを検索すれば、より高速に検
索可能である。With the above definition, the size of | D | well expresses the degree of approximation between the candidate document and the example document. Therefore, if | D | is calculated for all of the candidate documents and only candidate documents having a constant value, for example, 1 or less are left, only documents having a close appearance will remain. Further, it is possible to search at higher speed by searching the feature vector index file that has been sorted and sorted at the time of storing the document without checking all candidate documents.

最終特定 CPU100は、距離ベクトルの値で例示紙面との近似度を計
算しながら、次々にインデクスを検索し、|D|＝０の候
補のみを残す。その結果判明した原文書データを記憶装
置110からとり出し、情報圧伸回路10を介してCRT表示装
置112に原文書を表示し、人間が最後の特定をして文書
を終了する。The final specification CPU 100 searches the indexes one after another while calculating the degree of approximation with the example sheet by the value of the distance vector, and leaves only the candidates of | D | = 0. The original document data found as a result is taken out from the storage device 110, the original document is displayed on the CRT display device 112 via the information companding circuit 10, and the person makes the final identification and ends the document.

また、|D|＝０では１文書も残らぬ場合には、|D|＜1.25
の文章を表示し、それでもなければ|D|＜1.5……と範囲
を広げて人間の判断に委ねる。Also, if | D | = 0 and no document remains, | D | <1.25
Is displayed, and if not, the range is expanded to | D | <1.5 …… and left to human judgment.

以上、実施例においては、原文書紙面分割数６、再分割
数４を採用したが、紙面の大きさ、縦横比等に応じて他
の値でもよい。また、周波数成分変換に高速フーリエ変
換を用いたが、他の直交変換、例えばアダマル変換など
を用いも良い。さらに、キーワードによる絞り込みに先
立って特徴インデクスによる絞り込みを先行してもよ
い。As described above, in the embodiment, the original document paper surface division number of 6 and the subdivision number of 4 are adopted, but other values may be used depending on the size of the paper surface, the aspect ratio, and the like. Further, although the fast Fourier transform is used for the frequency component conversion, other orthogonal transforms such as the Hadamard transform may be used. Further, narrowing down by the characteristic index may be preceded before narrowing down by the keyword.

〔The invention's effect〕

以上説明したように、本発明によれば、インデクスとし
てキーワードのみならず文書紙面の外観、レイアウト情
報を領域属性判定の階層化できめ細かに取り込めるよう
になっているから、キーワードのみ検索する従来方法に
比べて多数の文書の中から効果的に希望の文書を探し出
すことが可能である。また、キーワードを完全に忘れて
も、紙面の概観を覚えていることが多い自作の報告書等
では、容易に希望文書を発見できる利点がある。さら
に、電子ファイルシステム内の個人用のバインダ、フォ
ルダなどに文書を格納するにあたっては、バインダ名、
フォルダ名の他は特別なキーワードを付与することな
く、自動生成の特徴インデクスのみを付与するだけとい
う使い方が可能になる利点をもつ。As described above, according to the present invention, not only the keyword but also the appearance of the document sheet and the layout information can be finely incorporated by the area attribute determination in a hierarchical manner as the index. In comparison, it is possible to effectively find a desired document from a large number of documents. In addition, even if a keyword is completely forgotten, there is an advantage that a desired document can be easily found in a self-made report or the like that often remembers the outline of the paper. Furthermore, when storing a document in a personal binder, folder, etc. in the electronic file system, the binder name,
There is an advantage that it is possible to use only the automatically generated feature index without adding any special keywords other than the folder name.

また、文書を外観という、主として物理的側面から捕ら
えて分類できるという機能は、従来行われてきた意味、
概念情報の側面からの分類機能と対をなすものであり、
事務処理の大半を占める文書処理の自動化に貢献する。In addition, the function of being able to classify a document by capturing it mainly from the physical side of its appearance, that is, its appearance,
It is paired with the classification function from the aspect of conceptual information,
Contributes to the automation of document processing, which accounts for the majority of paperwork.

これらの利点は、人間の多くの場合、イメージを取込
み、それから抽出した非言語情報と、さらに抽出した言
語情報の両方を使って記憶や思考をしていることを考慮
すると電子ファイルシステムの高利便化に極めて有効と
いえる。These advantages are high convenience of the electronic file system considering that human beings often use both the extracted non-verbal information and the extracted linguistic information to capture and think of images, which is the case for human beings. It can be said that this is extremely effective for

[Brief description of drawings]

第１図は本発明の一実施例の全体構成図、第２図は文書
紙面の分割例を示す図、第３図は文字列領域の特徴抽出
の概念図、第４図は文書紙面の階層構造表示の模式図で
ある。１……スキャナ、２……イメージバッファメモリ、３…
…バッファ制御回路、４……切換え回路、５……水平周
辺分布計数回路、６……垂直周辺分布計数回路、７……
水平ランレングス計数回路、８……垂直ランレングス計
数回路、９……イメージデータバス、10……情報圧伸回
路、32……高速フーリエ変換回路、100……マイクロコ
ンピュータ、101……制御データバス、110……文書デー
タ蓄積装置、111……インデクスデータ蓄積装置、112…
…CRT表示装置。FIG. 1 is an overall configuration diagram of an embodiment of the present invention, FIG. 2 is a diagram showing an example of document page division, FIG. 3 is a conceptual diagram of character string region feature extraction, and FIG. 4 is a document page hierarchy. It is a schematic diagram of a structure display. 1 ... Scanner, 2 ... Image buffer memory, 3 ...
... buffer control circuit, 4 switching circuit, 5 horizontal peripheral distribution counting circuit, 6 vertical peripheral distribution counting circuit, 7
Horizontal run length counting circuit, 8 ... Vertical run length counting circuit, 9 ... Image data bus, 10 ... Information companding circuit, 32 ... Fast Fourier transform circuit, 100 ... Microcomputer, 101 ... Control data bus , 110 ... Document data storage device, 111 ... Index data storage device, 112 ...
… CRT display.

Claims

[Claims]

1. A device for reading a document, a device for storing the read document, a device for dividing the read paper surface into a plurality of areas, and obtaining the attributes of each area from the physical characteristics thereof.
When the attributes of the divided areas are not determined, the area is further divided into a plurality of small areas, and the means for repeatedly obtaining the attributes of the small areas until the attributes are determined, and the obtained attributes of each area and small area. And a means for storing the set of (1) as the index of the read document in association with the corresponding document.

2. The document filing apparatus according to claim 1, further comprising means for inputting an exemplary paper surface,
The similarity between the means for obtaining the index of the example sheet and the index of the example sheet and the stored index is calculated by operating the means for dividing the example sheet to obtain the attribute, and the similarity is constant. A document filing apparatus provided with means for extracting an original document within a range.