JP4853891B2

JP4853891B2 - How to create document structure information

Info

Publication number: JP4853891B2
Application number: JP2004375548A
Authority: JP
Inventors: 尚紀浅田; 雅之椋木; 正人青山
Original assignee: Hiroshima City University
Current assignee: Hiroshima City University
Priority date: 2004-12-27
Filing date: 2004-12-27
Publication date: 2012-01-11
Anticipated expiration: 2024-12-27
Also published as: JP2006185008A

Description

本発明は、文書構造情報の作成方法に関し、特に、罫線で区切られたボックスから構成される文書を対象とした文書構造情報の作成方法に関するものである。 The present invention relates to a method for creating document structure information, and more particularly, to a method for creating document structure information for a document composed of boxes separated by ruled lines.

近年、企業や自治体では事務書類の電子化が推進され、国レベルでも電子政府の総合窓口が用意されるなど、開発が進められている。そして、事務処理の効率化のために、帳票などの処理は急速にコンピューター化されている。帳票には多数の罫線が縦横に引かれており、これらの罫線で囲まれた領域はボックスまたはセルと呼ばれている。ボックスは、例えば「氏名」のように帳票自身に予め印刷されている自己指示ボックスであったり、実際に記入者の氏名や住所等を具体的に記入する挿入ボックスであったりする。これらのボックスは所定の決まりに基づいて、帳票上にレイアウトされている。 In recent years, enterprises and local governments have promoted the digitization of office documents, and development has been progressing at the national level, including the provision of general electronic government contact points. In order to improve the efficiency of paperwork, the processing of forms and the like is rapidly computerized. A lot of ruled lines are drawn vertically and horizontally in the form, and an area surrounded by these ruled lines is called a box or a cell. The box may be a self-instruction box pre-printed on the form itself, such as “name”, or an insertion box that specifically fills in the name, address, etc. of the writer. These boxes are laid out on the form based on a predetermined rule.

一般に、帳票を処理するシステムは以下のような構成を有している。帳票処理システムが帳票から必要な情報を抽出するために、まず、読み取り装置（ＯＣＲ等）によって帳票の画像イメージ（ビットマップ）を得る。この画像イメージをシステムのメモリの中に予め記憶されているフォーマットに基づいて解析することにより、帳票のレイアウトを把握する。この解析により、帳票中のどの位置にどのような情報（例えば、住所または名前等）が存在するのかを特定した上で、その位置に実際に存在するイメージとしての文字、数字、記号などを、よく知られた文字認識技術を用いることで、テキストとして認識していく。これにより、その位置に記入された情報がテキストとして認識される。ここでフォーマットとは、帳票のレイアウトを解析するためのモデルであって、帳票のレイアウトはこの雛形を比較参照することで解析される（特許文献１を参照）。
特開２００１−２６６０６６号公報 Generally, a system for processing a form has the following configuration. In order for the form processing system to extract necessary information from the form, first, an image image (bitmap) of the form is obtained by a reading device (OCR or the like). By analyzing the image based on the format stored in advance in the system memory, the layout of the form is grasped. This analysis specifies what information (for example, address or name) exists in which position in the form, and then, as an image that actually exists at that position, letters, numbers, symbols, etc. It is recognized as text by using well-known character recognition technology. Thereby, the information entered at the position is recognized as text. Here, the format is a model for analyzing the layout of the form, and the layout of the form is analyzed by comparing and referring to the template (see Patent Document 1).
JP 2001-266066 A

しかしながら、上記した特許文献１記載の方法では、多様な書式の罫線文書への対応を考えた場合、拡張性や柔軟性に欠けるという問題点があった。 However, the above-described method described in Patent Document 1 has a problem that it lacks expandability and flexibility when considering the correspondence to ruled line documents of various formats.

更に、上記した解析方法はあらかじめ情報が記入された文書に対しては有効であるが、情報が記入されていない文書に対しては解析ができなかった。 Furthermore, although the above analysis method is effective for a document in which information has been previously entered, it has not been possible to analyze a document in which no information has been entered.

本発明は、上記問題を鑑みて成されたものである。本発明の目的の一つは、罫線によって区切られたボックスから構成され、情報が記入されていない罫線文書の構造を解析することができる文書構造情報の作成方法を提供することにある。 The present invention has been made in view of the above problems. One of the objects of the present invention is to provide a method for creating document structure information that can analyze the structure of a ruled line document that is composed of boxes separated by ruled lines and in which no information is written.

本発明の文書構造情報の作成方法は、一般構造文書と表構造文書とが混在している文書から文書構造情報を作成する方法であり、コンピュータあるいは手動で、罫線で区切られた複数のボックスを有する前記文書の前記ボックスを、前記ボックスの種別に基づいて、分類する第１のステップと、コンピュータで、前記ボックスの縦方向または／および横方向の隣接関係を示すボックスリストを作成する第２のステップと、コンピュータで、優先順位を有する複数の文法規則によって構成され、各文法規則によってボックス間の指示関係を明確にする書式構造文法を、前記ボックスリストに対して、適用する第３のステップと、を具備し、前記第３のステップでは、前記ボックスリストに対して、前記ボックス間の指示関係を縦または横の一方向のみとした前記書式構造文法である一般構造文法を適用し、矛盾が生じたボックス以降のボックス群に対して、前記ボックス間の指示関係を縦と横との両方向とした前記書式構造文法である表構造文法を適用することを特徴とする。
The document structure information creation method of the present invention is a method for creating document structure information from a document in which a general structure document and a table structure document are mixed, and a plurality of boxes delimited by ruled lines are manually created by a computer or manually. A first step of classifying the boxes of the document based on the type of the box; and a second list for creating a box list indicating a vertical or / and horizontal adjacency relationship of the boxes by the computer And a third step of applying, to the box list , a format structure grammar constituted by a plurality of grammatical rules having a priority order and clarifying an instruction relationship between boxes by each grammar rule. , comprising a, in the third step, with respect to the box list, one of the vertical or horizontal instructions relationships between the box Only to the application of the general structure grammar is the format structure grammar for box group after conflict box, is the format structure grammar both directions between the vertical and horizontal instructions relationships between the box A table structure grammar is applied.

本発明の文書構造情報の作成方法によれば、罫線で区切られた複数のボックスを有する文書に対して、縦方向および／または横方向の隣接関係を示すボックスリストに対して書式構造文法を適用している。従って、罫線文書の指示関係を解析して、記入情報の管理、記入情報の集計などをサポートすることができ、電子化された文書の処理を容易且つ汎用的に扱うことが可能となる。更に、本発明の文書構造情報の作成方法を対話型文書作成用ワープロ、対話型文書記入用ワープロ、自治体の電子申請システム、社内文書処理システムなどに適用することで、対話型文書を容易に作成することおよび完成された対話型文書を効率的に管理することが可能となる。 According to the document structure information creation method of the present invention, a format structure grammar is applied to a box list indicating a vertical and / or horizontal adjacency relationship for a document having a plurality of boxes separated by ruled lines. is doing. Therefore, it is possible to analyze the relationship between the indications of the ruled line document and support the management of entry information, the summation of entry information, and the like, and the processing of the digitized document can be easily and universally handled. In addition, by applying the document structure information creation method of the present invention to an interactive document creation word processor, an interactive document entry word processor, a local electronic application system, an in-house document processing system, etc., an interactive document can be easily created. And managing the completed interactive document efficiently.

また、本発明の文書構造情報の作成方法によれば、縦および横の指示方向が混在する罫線文書に対しても解析を行うことが可能であるため、様々な種類の罫線文書の解析を行うことが可能となる。 Also, according to the document structure information creation method of the present invention, it is possible to analyze a ruled line document in which both vertical and horizontal instruction directions are mixed, and therefore various types of ruled line documents are analyzed. It becomes possible.

更に、本発明の文書構造情報の作成方法によれば、罫線文書を構成する各ボックスを指示ボックス、空欄ボックス、説明ボックス、および挿入ボックスに分類することで、情報が記入されていない罫線文書の各ボックス間の指示関係を解析することが可能となる。 Furthermore, according to the document structure information creation method of the present invention, each box constituting a ruled line document is classified into an instruction box, a blank box, an explanation box, and an insertion box. It is possible to analyze the indication relationship between each box.

＜第1の実施の形態＞
先ず、本形態の文書構造情報の作成方法が適用可能な文書フォーマットについて説明する。一般的に広く用いられている文書は、完結型文書モデルと対話型文書モデルとの２種類に分類することができる。 <First embodiment>
First, a document format to which the document structure information creation method of this embodiment can be applied will be described. Documents that are generally widely used can be classified into two types: a complete document model and an interactive document model.

完結型文書モデルは、新聞、書籍または広告といった用途に用いられており、作者が文書を作成した時点で文書として完成している。そして、読み手はその完成した文書から作成者の伝達事項を読み取り、情報を獲得する。 The complete document model is used for applications such as newspapers, books, or advertisements, and is completed as a document when the author creates the document. Then, the reader reads the creator's transmission items from the completed document and acquires information.

対話型文書モデルは、申請用紙、アンケートまたは履歴書などの用途に用いられる文書モデルであり、作成者、記入者および処理者の三種類の立場の人間が存在する。作成者は、入力情報について記入内容に関する指示と入力情報を記入するための領域から構成される未完成文書を作成する。記入者は、未完成文書の指示を読み取り、適切な領域に必要な情報を記入し、文書を完成させる。処理者は、記入者の情報記入により完成された文書から、必要な情報を抽出して管理する。処理者は作成者と同一の場合がある。 The interactive document model is a document model that is used for applications such as application forms, questionnaires, and resumes, and there are three types of people: creators, writers, and processors. The creator creates an incomplete document composed of an instruction regarding the contents of input information and an area for inputting the input information. The writer reads the instruction of the unfinished document, fills in necessary information in an appropriate area, and completes the document. The processor extracts and manages necessary information from the document completed by the entry of information by the writer. The processor may be the same as the creator.

更に、図１に示すように、対話型文書モデルには複数の種類が存在する。例えば、図１（Ａ）は、罫線により領域が区切られた対話型罫線文書１であり、図１（Ｂ）は、明確な領域に区切られていない対話型フリーフォーマット文書２である。本形態の文書構造情報の作成方法は、図１（Ａ）に示すような対話型罫線文書１を対象とする。対話型罫線文書では全ての記入内容に関する指示および記入の為の領域は、罫線によって区切られた領域内に存在する。従って、図２に示すような文書３は対話型罫線文書には当てはまらない。ただし、図２の罫線領域には、本発明を適用して指示関係を解析することができる。 Furthermore, as shown in FIG. 1, there are a plurality of types of interactive document models. For example, FIG. 1A shows an interactive ruled line document 1 in which areas are divided by ruled lines, and FIG. 1B shows an interactive free format document 2 that is not divided into clear areas. The document structure information creation method of the present embodiment targets an interactive ruled line document 1 as shown in FIG. In the interactive ruled line document, instructions for all entry contents and areas for entry exist within the area delimited by ruled lines. Therefore, the document 3 as shown in FIG. 2 does not apply to the interactive ruled line document. However, the indication relationship can be analyzed by applying the present invention to the ruled line region of FIG.

対話型罫線文書において、罫線によって区切られた最小の矩形領域をボックスと定義する。ボックスはその配置場所と内部文字列からそれぞれ固有の意味が与えられている。また、複数のボックスが組み合わされることにより、文書全体として一つの目的を成す文書となる。そして、対話型罫線文書は３つの要素（対話要素を成す指示および被指示の関係、記入情報に関する文字列、罫線の引き方を表すレイアウト情報）から構成されている。 In an interactive ruled line document, a minimum rectangular area divided by ruled lines is defined as a box. Each box has its own meaning based on its location and internal string. Further, by combining a plurality of boxes, the document as a whole has a single purpose. The interactive ruled line document is composed of three elements (the relationship between an instruction and an instruction that form an interactive element, a character string related to entry information, and layout information indicating how to draw a ruled line).

文書構造情報の作成方法の対象となる罫線文書は、無駄なボックスが存在しない（意味を成さない無駄なボックスが文書内に存在しない）罫線文書である。 The ruled line document that is the target of the document structure information creation method is a ruled line document in which no useless box exists (no useless box that does not make sense exists in the document).

ボックスは大きく分けると非記入ボックスと記入ボックスに分類される。そして、非記入ボックスには指示ボックス（ＩＮＤ）と説明ボックス（ＥＸＰ）とがあり、記入ボックスには、挿入ボックス（ＩＮＳ）と空欄ボックス（ＢＬＫ）とがある。以下にそれぞれのボックスの特徴を明記する。
・指示ボックス（ＩＮＤ）は他のボックスに何らかの記入指示を与えるボックス
・挿入ボックス（ＩＮＳ）は内部に文字列があるボックス
・空欄ボックス（ＢＬＫ）は内部に何も書かれていない空欄のボックス
・説明ボックス（ＥＸＰ）は説明が書かれたボックス
以下、ＩＮＤ、ＩＮＳ、ＢＬＫ、ＥＸＰと記載する。 Boxes are roughly classified into non-entry boxes and entry boxes. The non-entry box includes an instruction box (IND) and an explanation box (EXP), and the entry box includes an insertion box (INS) and a blank box (BLK). The characteristics of each box are specified below.
・ Instruction box (IND) gives some entry instructions to other boxes ・ Insert box (INS) has a character string inside ・ Blank box (BLK) is a blank box with nothing written inside ・The explanation box (EXP) is described as IND, INS, BLK, EXP below the box where the explanation is written.

図３（Ａ）および図３（Ｂ）を参照して、ボックスを具体的に説明する。図３（Ａ）に示す罫線文書４を参照して、“氏名”および“生年月日”と記入されたボックスがＩＮＤに相当する。また、“年月日”と記入されたボックスはＩＮＳであり、空欄のボックスはＢＬＫであり、“※生年月日は・・・して下さい”と記入されたボックスはＥＸＰである。この分類結果を図３（Ｂ）に示す。 With reference to FIG. 3 (A) and FIG. 3 (B), a box is demonstrated concretely. Referring to the ruled line document 4 shown in FIG. 3A, a box in which “name” and “birth date” are entered corresponds to IND. In addition, the box in which “Date” is entered is INS, the blank box is BLK, and the box in which “* Please enter your date of birth” is EXP. The classification result is shown in FIG.

図３（Ｂ）を参照して、矢印が示すようにＩＮＤ１は右に隣接するＢＬＫに指示を与えており、ＩＮＤ２は右に隣接するＩＮＳに指示を与えている。 Referring to FIG. 3B, as indicated by an arrow, IND1 gives an instruction to the BLK adjacent to the right, and IND2 gives an instruction to the INS adjacent to the right.

次に、図４から図７を参照して、ボックスの指示関係について説明する。対話型罫線文書において、ボックスとボックスとの間には記入内容に関する指示関係が働いている。この指示関係には縦方向、横方向という２つの方向性が存在する。また、指示関係自体にも様々な種類が存在する。本形態では、対話型罫線文書中の指示関係として、図４に示すような隣接するボックス間に働く４種類の指示関係を想定した。 Next, the box indication relationship will be described with reference to FIGS. In an interactive ruled line document, there is an instruction relationship regarding the contents to be entered between boxes. This indication relationship has two directions, the vertical direction and the horizontal direction. There are various types of instruction relationships themselves. In this embodiment, four kinds of instruction relations acting between adjacent boxes as shown in FIG. 4 are assumed as instruction relations in the interactive ruled line document.

図４（Ａ）は、単一指示ボックス群５であり、一つの指示ボックスが一つの記入ボックスに働く１対１の指示関係を示している。ボックス間の隣接方向としては上下方向と左右方向の二種類がある。具体的には、図中の矢印は指示関係を表しており、ＩＮＤが隣接するＢＬＫに指示を与えている。このような指示形態を単一指示と呼ぶ。また、ここでは被指示ボックスとして、ＢＬＫ（空欄ボックス）を採用しているが、ＩＮＳ（挿入ボックス）も被指示ボックスに成りうる。これは後述する指示関係においても同様である。 FIG. 4A shows a single instruction box group 5 and shows a one-to-one instruction relationship in which one instruction box functions as one entry box. There are two types of adjoining directions between boxes: vertical and horizontal directions. Specifically, the arrows in the figure indicate an instruction relationship, and IND gives an instruction to an adjacent BLK. Such an instruction form is called a single instruction. Further, here, BLK (blank box) is adopted as the instruction box, but INS (insertion box) can also be an instruction box. The same applies to the instruction relationship described later.

図４（Ｂ）は、自己指示ボックスであり、指示ボックスから指示を受けない記入ボックスを表している。この記入ボックスは内部の文字列によって記入内容を十分想定できる。このようなボックスはボックス自身を指示しているとみなす。つまり、自己指示の指示関係が生じるのは挿入ボックスにおいてのみである。 FIG. 4B shows a self-instruction box and an entry box that does not receive an instruction from the instruction box. The entry contents of this entry box can be fully assumed by the internal character string. Such a box is considered to point to the box itself. In other words, the self-instruction indication relationship occurs only in the insertion box.

図５および図６では、一つの指示ボックスの指示が複数のボックスに働く１対ｎの指示関係を説明する。指示方向は縦または横であり、どちらか一方方向へ直線的に指示を出す。 5 and 6, a one-to-n instruction relationship in which an instruction in one instruction box works on a plurality of boxes will be described. The indication direction is vertical or horizontal, and the indication is given linearly in either direction.

図５に示すボックス群は、単層繰り返し指示ボックス群６であり、ボックス間の指示関係は単層繰り返し指示である。単層繰り返し指示とは、ＢＬＫのみが連続する構造や、内部文字列が同じ挿入ボックスのみが連続する構造に対し、指示ボックスが繰り返し指示を行う形式である。 The box group shown in FIG. 5 is a single layer repetition instruction box group 6, and the instruction relationship between the boxes is a single layer repetition instruction. The single-layer repetition instruction is a form in which the instruction box repeatedly instructs a structure in which only BLK is continuous or a structure in which only an insertion box having the same internal character string is continuous.

具体的には、図５（Ａ）および図５（Ｂ）を参照して、ＩＮＤは隣接するＢＬＫ１に指示を与えており、更にＢＬＫ１と隣接するＢＬＫ２に指示を与えている。 Specifically, referring to FIGS. 5A and 5B, IND gives an instruction to adjacent BLK1, and further gives an instruction to BLK2 adjacent to BLK1.

また、図５（Ｃ）および図５（Ｄ）を参照して、ＩＮＤはＢＬＫ１およびＢＬＫ２と隣接しており、ＢＬＫ１およびＢＬＫ２にそれぞれ同じ指示を与えている。 Referring to FIGS. 5C and 5D, IND is adjacent to BLK1 and BLK2, and gives the same instruction to BLK1 and BLK2, respectively.

図６に示すボックス群は多層繰り返し指示ボックス群７であり、ボックス間の指示関係は多層繰り返し指示を示している。多層繰り返し指示とは、指示ボックスを含む複数のボックス群を一つの指示ボックスでまとめる形式である。つまり、重複する情報を束ねるために用いられる。多層繰り返し指示では、指示ボックスに親子関係が発生するため、記入ボックスには階層的な指示関係が得られる。階層的に上位の指示を親指示とよぶ。 The box group shown in FIG. 6 is a multilayer repeat instruction box group 7, and the instruction relationship between boxes indicates a multilayer repeat instruction. The multi-layer repeated instruction is a form in which a plurality of box groups including instruction boxes are combined into one instruction box. That is, it is used to bundle overlapping information. In the multi-layer repeated instruction, a parent-child relationship is generated in the instruction box, so that a hierarchical instruction relationship is obtained in the entry box. Hierarchical instructions are called parent instructions.

図６（Ａ）および図６（Ｂ）を参照して、ＩＮＤ１はＩＮＤ２およびＩＮＤ３に対して同一の指示を与えている。そして、ＩＮＤ２は隣接するＢＬＫ１に対して指示を与えており、ＩＮＤ３は隣接するＢＬＫ２に対して指示を与えている。そして、これらの指示方向は同一方向であり、ＩＮＤ１とＩＮＤ２とは隣接しているが、ＩＮＤ１とＩＮＤ３とは隣接していない。このような指示関係は上述した親子関係にあり、ＩＮＤ１は親指示であり、ＩＮＤ２およびＩＮＤ３は子指示である。 Referring to FIGS. 6A and 6B, IND1 gives the same instruction to IND2 and IND3. IND2 gives an instruction to the adjacent BLK1, and IND3 gives an instruction to the adjacent BLK2. These pointing directions are the same direction, and IND1 and IND2 are adjacent to each other, but IND1 and IND3 are not adjacent to each other. Such an instruction relationship is the parent-child relationship described above, IND1 is a parent instruction, and IND2 and IND3 are child instructions.

また、図６（Ｃ）を参照して、ＩＮＤ１はＩＮＤ２と隣接しており、更にＩＮＤ２はＩＮＤ３と隣接している。この隣接方向は共に縦方向である。ここで、ＩＮＤ１はＩＮＤ２およびＩＮＤ３に対して指示を与えている。そして、ＩＮＤ２は横方向に隣接するＢＬＫ１に指示を与えており、同様にＩＮＤ３は横方向に隣接するＢＬＫ２に指示を与えている。つまり、ＢＬＫ１にとってＩＮＤ１は親指示であり、ＩＮＤ２は子指示である。 Referring to FIG. 6C, IND1 is adjacent to IND2, and IND2 is adjacent to IND3. Both adjacent directions are longitudinal directions. Here, IND1 gives instructions to IND2 and IND3. Then, IND2 gives an instruction to BLK1 adjacent in the horizontal direction, and similarly IND3 gives an instruction to BLK2 adjacent in the horizontal direction. That is, for BLK1, IND1 is a parent instruction, and IND2 is a child instruction.

図６（Ｄ）および図６（Ｅ）を参照して、多層繰り返し指示ボックス群７Ｄおよび７Ｅにおいても、このような指示関係が発生する。 With reference to FIG. 6D and FIG. 6E, such an instruction relationship also occurs in the multilayer repeat instruction box groups 7D and 7E.

本形態の文書構造情報の作成方法が適用される罫線文書は、上述したような指示関係によって構成されており、ページを跨ぐような指示関係、矢印による指示関係および離れた場所に働く指示関係は扱わない。 The ruled line document to which the document structure information creation method of the present embodiment is applied is configured by the above-described instruction relationship, and the instruction relationship across the pages, the instruction relationship by the arrow, and the instruction relationship working at a remote place are Do not handle.

そして、上述したような指示関係は拡張バッカス記法を用いた文脈自由文法で表現することができる。本形態では文法表現方法を書式構造文法と定義し、以下の記号を用いて表現する。
“・”は、ボックス間の隣接関係を表す。
“::＝”は、右辺から左辺への導出（還元）を表す。
“＋”は、直前要素の一回以上の繰り返しを表す。
“｜”は、要素の選択を表す。 The indication relationship as described above can be expressed by a context-free grammar using the extended Bacchus notation. In this embodiment, the grammar expression method is defined as a format structure grammar and expressed using the following symbols.
“·” Represents an adjacent relationship between boxes.
“:: =” represents derivation (reduction) from the right side to the left side.
“+” Represents one or more repetitions of the immediately preceding element.
“|” Represents selection of an element.

ここで、隣接とは二つのボックスが同じ高さもしくは同じ幅の辺を共有していることである。 Here, “adjacent” means that two boxes share sides of the same height or width.

書式構造文法は優先順位を有する文法規則から構成されており、ボックス種別に対応するＩＮＤ、ＥＸＰ、ＢＬＫ、ＩＮＳの４種類の終端記号に加えて、非終端記号である＜ｇｃｂ＞および＜ｉｃｂ＞が用いられる。 The format structure grammar is composed of grammatical rules having priorities. In addition to the four types of terminal symbols IND, EXP, BLK, and INS corresponding to the box type, <gcb> and <icb> which are non-terminal symbols are included. Used.

＜ｇｃｂ＞(general compound box)とは、他のボックスを指示しない複合ボックスを表す。ここで複合ボックスとは、指示・被指示の関係を持つボックス群を表現するものである。 <Gcb> (general compound box) represents a compound box that does not designate another box. Here, the composite box represents a group of boxes having an instruction / instructed relationship.

＜ｉｃｂ＞(indication compound box)とは、他の複合ボックスを指示する可能性のある複合ボックスを表現するものである。 <Icb> (indication compound box) represents a composite box that may indicate another composite box.

図７を参照して、具体的な書式構造文法の文法規則とその優先順位を説明する。書式構造文法５０は、８種類の文法規則から成り、各文法規則には優先順位と同じ番号を付与している。つまり、文法規則１と文法規則２とでは規則１の方が優先順位は高いとみなされる。そして、この優先順位は経験的に与えられたものであり以下に説明する。 With reference to FIG. 7, specific grammar rules of format structure grammar and their priorities will be described. The format structure grammar 50 is composed of eight types of grammar rules, and each grammar rule is assigned the same number as the priority order. That is, rule 1 is considered to have a higher priority in grammar rule 1 and grammar rule 2. This priority is given empirically and will be described below.

各文法規則は指示関係を表しており、規則１は多層繰り返し指示を表現し、規則２は単層繰り返し指示を表現している。また、規則３は自己指示挿入ボックスを、規則４は説明ボックスを表現している。更に、規則５は、図５（Ｃ）、図５（Ｄ）に示すような単層繰り返し指示および、図６（Ｄ）、図６（Ｅ）に示すような多層繰り返し指示を受けるための前処理として行う規則１の適用方向と直交する方向への還元を意味している。規則６は複合ボックス同士を還元する規則である。 Each grammar rule represents an instruction relationship, rule 1 represents a multi-layer repetition instruction, and rule 2 represents a single-layer repetition instruction. Rule 3 expresses a self-instruction insertion box, and rule 4 expresses an explanation box. Furthermore, rule 5 is a rule for receiving a single layer repetition instruction as shown in FIGS. 5C and 5D and a multilayer repetition instruction as shown in FIGS. 6D and 6E. This means reduction in a direction orthogonal to the application direction of Rule 1 performed as a process. Rule 6 is a rule for reducing composite boxes.

これらの各指示関係に加えて、規則７は他の＜ｇｃｂ＞を指示する可能性を考慮し、＜ｉｃｂ＞に還元していたが最終的に指示する＜ｇｃｂ＞が存在しない場合は、解析成功による終了を意味する＜ｇｃｂ＞の集合の要素とするための規則である。 In addition to each of these indication relationships, rule 7 considers the possibility of indicating other <gcb>, and if it is reduced to <icb> but there is no finally indicated <gcb>, it is analyzed This is a rule for making an element of a set of <gcb> meaning termination by success.

規則８は、表形式指示を横方向および縦方向の各方向の構造解析の組み合わせであるとみなして求めるために、表形式指示の特徴である指示連結部を還元するために存在する。 Rule 8 exists to reduce the instruction linking part that is a feature of the tabular instruction in order to obtain the tabular instruction as a combination of structural analyzes in the horizontal and vertical directions.

具体的には、挿入ボックスに対する指示ボックスが存在しないときのみ自己指示とするため、規則３より規則２の方が優先される。 Specifically, rule 2 is given priority over rule 3 because the self-instruction is given only when there is no instruction box for the insertion box.

また、規則５および規則６は上位の規則で還元されないときに、図５（Ｃ）、図５（Ｄ）、図６（Ｄ）、図６（Ｅ）に示すような繰り返し指示を検出するための前処理として利用されるため多層、単層（単一指示を含む）および自己指示の方が優先される。 Further, when rule 5 and rule 6 are not reduced by a higher-order rule, in order to detect a repetitive instruction as shown in FIGS. 5C, 5D, 6D, and 6E. Multi-layer, single-layer (including single instruction) and self-instruction are given priority.

更に、規則７は最終的に＜ｉｃｂ＞と隣接する＜ｇｃｂ＞がなかったときに用いるものであり、規則８は形式的に表構造を扱うためのものなので、指示関係の抽出規則より上位にあってはならない。 Furthermore, rule 7 is used when there is no <gcb> adjacent to <icb> in the end, and rule 8 is for formal handling of the table structure. Must not be.

以上のような書式構造文法を用いて、文書の構造解析を行う。文書の構造解析は、構造解析木を作成する構文解析段階と得られた構造解析木からボックス間に働く指示関係を解析する指示関係解析段階の二段から成る。ここで、構造解析木は、1つの構成要素を示すノードからスタートして、枝分かれを繰り返すことで、樹木が枝を伸ばすように広がっていくようなデータ構造を有している。この構造解析木は、文法から直接生成される。つまり、構造解析木は、ボックス種別の分類結果を葉ノードとした状態から文法規則に基づく還元を開始し、還元前の記号組の親ノードとして還元後の記号を配置することを還元が可能な限り繰り返し適用することで生成できる。 The document structure is analyzed using the format structure grammar as described above. The structure analysis of a document is composed of two stages: a syntax analysis stage for creating a structure analysis tree and an instruction relation analysis stage for analyzing an instruction relation working between boxes from the obtained structure analysis tree. Here, the structural analysis tree has a data structure that starts from a node indicating one component and repeats branching so that the tree expands to extend the branch. This structural analysis tree is generated directly from the grammar. In other words, the structural analysis tree can start reduction based on the grammatical rule from the state in which the classification result of the box type is a leaf node, and can place the symbol after reduction as the parent node of the symbol set before reduction. It can be generated by applying as many times as possible.

指示関係解析は、構文解析の結果から得ることができる。すなわち、左のノードが指示ボックスとなるノードに着目したとき、その左のノードの指示が右のノード全体にかかる指示関係が成立する。 The instruction relation analysis can be obtained from the result of the syntax analysis. That is, when attention is paid to a node in which the left node is an instruction box, an instruction relationship is established in which the instruction of the left node is directed to the entire right node.

図８に、単一指示、自己指示、繰り返し指示の三種類の指示関係が構成する固有の木構造を示す。図８（Ａ）に示す単一指示は、指示ボックスが記入ボックスに対してのみ指示を与えている。また、記入ボックスはこの指示ボックス以外からの指示を受けていない。図８（Ｂ）に示す自己指示は、指示ボックスは自らのボックスに対して指示を与えており、他のボックスに対して指示を与えていない。図８（Ｃ）に示す単層繰り返し指示は、指示を受け入れる記入ボックス群（ａ）〜（ｃ）に対する指示は（１）から与えられている。また、図８（Ｄ）に示す多層繰り返し指示は、指示を受け入れるボックス（ｄ）に対する指示は（２）〜（４）である。 FIG. 8 shows a unique tree structure formed by three types of instruction relationships: single instruction, self instruction, and repetitive instruction. In the single instruction shown in FIG. 8A, the instruction box gives an instruction only to the entry box. Also, the entry box has not received instructions from other than this instruction box. In the self-instruction shown in FIG. 8B, the instruction box gives instructions to its own box and does not give instructions to other boxes. In the single-layer repetition instruction shown in FIG. 8C, the instructions for the entry box groups (a) to (c) for receiving the instructions are given from (1). In the multi-layer repetition instruction shown in FIG. 8D, the instructions for the box (d) for receiving the instructions are (2) to (4).

以上のように、対話型罫線文書の各ボックスの指示関係を書式構造文法によって明確にすることで文書構造情報を得ることができる。 As described above, the document structure information can be obtained by clarifying the indication relation of each box of the interactive ruled line document by the format structure grammar.

図９を参照して、フローチャート２８は対話型罫線文書を解析するための方法を示しており、以下に各ステップを説明する。 Referring to FIG. 9, a flowchart 28 shows a method for analyzing an interactive ruled line document, and each step will be described below.

ステップＳ１では、対象となる対話型罫線文書から縦方向または横方向のボックスリストを作成する。ボックスリストとはボックス間の隣接関係を示すリストである。 In step S1, a vertical or horizontal box list is created from the target interactive ruled line document. The box list is a list indicating the adjacent relationship between boxes.

本形態で対象となる対話型罫線文書は、指示方向が横方向のみの文書または縦方向のみの文書に限定される。つまり、指示方向が縦方向および横方向の双方に存在するような文書は対象とされない。従って、指示方向が横方向の文書に対しては左右の接続を示すH_listが作成され、指示方向が上下方向の文書に対してはV_listが作成される。ボックスリストの作成方法は後述する。 The interactive ruled line document that is a target in this embodiment is limited to a document in which the pointing direction is only in the horizontal direction or a document in which only the vertical direction is set. That is, a document in which the pointing direction exists in both the vertical direction and the horizontal direction is not a target. Therefore, an H_list indicating the left and right connection is created for a document whose pointing direction is horizontal, and a V_list is created for a document whose pointing direction is vertical. A method for creating the box list will be described later.

ステップＳ２では、書式構造文法の最も優先順位の高い文法規則を現在の文法規則に設定する。 In step S2, the grammar rule having the highest priority in the format structure grammar is set as the current grammar rule.

ステップＳ３では、ボックスリスト中に現在の規則が適用可能な箇所があるか否かを判断する。適用可能箇所があればステップＳ４へ、なければステップＳ６へ移動する。 In step S3, it is determined whether or not there is a place where the current rule is applicable in the box list. If there is an applicable part, the process proceeds to step S4, and if not, the process proceeds to step S6.

ステップＳ４では、適用箇所に対して文法規則に基づく還元処理を行う。しかし、適用可能箇所が複数箇所ある場合は、最初に現れる箇所のみに還元処理を行う。 In step S4, a reduction process based on the grammatical rule is performed on the application location. However, when there are a plurality of applicable places, the reduction process is performed only on the first appearing place.

ステップＳ５では、ボックスリスト内の全てボックスが、＜ｇｃｂ＞（他のボックスを指示しない複合ボックス）によって記述されているか否かを判断する。もし、全てのボックスが＜ｇｃｂ＞によって記述されていれば、ボックスリストの解析成功となり、文書構造情報を得ることができる。一部のボックスが＜ｇｃｂ＞によって記述されていなければ、ステップＳ２へ移動する。 In step S5, it is determined whether or not all boxes in the box list are described by <gcb> (a composite box that does not designate another box). If all the boxes are described by <gcb>, the box list is successfully analyzed, and the document structure information can be obtained. If some boxes are not described by <gcb>, the process moves to step S2.

ステップＳ６では、現在の規則よりも一つ優先順位の低い文法規則を現在の規則に設定し、再びステップＳ３へ移動する。 In step S6, a grammar rule having a lower priority than the current rule is set as the current rule, and the process moves to step S3 again.

以上のステップを縦方向または横方向のボックスリストに対して行うことで、対話型罫線文書の構造解析が行われる。 The structural analysis of the interactive ruled line document is performed by performing the above steps on the box list in the vertical direction or the horizontal direction.

図１０から図１８を参照して、本形態の対話型罫線文書の構造解析方法を具体的に説明する。 With reference to FIGS. 10 to 18, the structure analysis method of the interactive ruled line document according to the present embodiment will be specifically described.

先ず、図１０および図１１を参照して、ステップＳ１について詳述する。ここでは、図１０（Ａ）に示すような指示方向が横方向にだけ存在する対話型罫線文書３０Ａを例にとって説明する。 First, step S1 will be described in detail with reference to FIG. 10 and FIG. Here, an interactive ruled line document 30A having an instruction direction only in the horizontal direction as shown in FIG. 10A will be described as an example.

図１０（Ｂ）を参照して、対話型罫線文書３０Ｂは、対話型罫線文書３０Ａの各ボックスをＩＮＤ、ＩＮＳ、ＥＸＰ、ＢＬＫの４種類に分類した文書を示している。そして、分類された各ボックスを識別できるように、ボックス種別毎に固有の番号を付与している。 Referring to FIG. 10B, an interactive ruled line document 30B shows a document in which each box of the interactive ruled line document 30A is classified into four types of IND, INS, EXP, and BLK. A unique number is assigned to each box type so that each classified box can be identified.

ボックスの分類方法としては、ＯＣＲなどの画像読み取り手段を利用することで、内部に文字列が存在しないボックスをＢＬＫに自動判定することも可能であるが、基本的にボックスは手動で分類される。そして、横方向の隣接関係を示すボックスリスト（H_list）を作成する。 As a method for classifying boxes, it is possible to automatically determine a box without a character string as BLK by using an image reading means such as OCR, but basically the boxes are classified manually. . Then, a box list (H_list) indicating the adjacent relationship in the horizontal direction is created.

図１１を参照して、横方向ボックスリスト（H_list）３１の作成方法を説明する。 A method of creating the horizontal box list (H_list) 31 will be described with reference to FIG.

横方向ボックスリスト３１を作成するには、まず、ボックスの左上座標を基準として、紙面の右向きをｘ軸の正方向、下向きをｙ軸の正方向とする座標系を想定する。そして、ｙ座標で昇順ソートした後に、ｙ座標が等しいボックスをｘ座標で昇順にソートする。ソートした後、連続する２つのボックスが同じ高さを持ち、互いの一辺を完全に共有している場合は、２つのボックスは左右方向に隣接していると判断して、隣接記号“・”をボックス間に挿入する。その結果得られるリストが横方向ボックスリスト３１である。ここでは、図１１（Ａ）を参照して、矢印Ａ１から矢印Ａ４の順番でソートされる。矢印Ａはソートされるボックス上を通過し、ソートの始点ボックスと終点ボックスを結んでいる。このようにして形成された横方向ボックスリスト３１を図１１（Ｂ）に示す。ここでは、各矢印Ａに応じてソートされたボックスを同一行に記述している。 In order to create the horizontal box list 31, first, a coordinate system is assumed in which the right direction of the page is the positive direction of the x axis and the downward direction is the positive direction of the y axis with reference to the upper left coordinates of the box. Then, after sorting in ascending order by y coordinate, boxes having the same y coordinate are sorted in ascending order by x coordinate. After sorting, if two consecutive boxes have the same height and completely share one side of each other, it is determined that the two boxes are adjacent in the left-right direction, and the adjacent symbol “·” Is inserted between the boxes. The resulting list is the horizontal box list 31. Here, with reference to FIG. 11 (A), it sorts in the order of arrow A1 to arrow A4. The arrow A passes over the box to be sorted and connects the sorting start point box and the end point box. The horizontal box list 31 formed in this way is shown in FIG. Here, the boxes sorted according to each arrow A are described on the same line.

以上の処理がステップＳ１にて行われ、ステップＳ２へ移行する。 The above processing is performed in step S1, and the process proceeds to step S2.

本形態では必要としないが、ここで縦方向ボックスリスト（V_list）の作成方法についても説明する。縦方向ボックスリストを作成するには、前述した座標系を想定して、ｘ座標で昇順ソートした後に、ｘ座標が等しいボックスをｙ座標で昇順にソートする。ソートした後、連続する２つのボックスが同じ幅を持ち、互いの一辺を完全の共有している場合は、２つのボックスは上下方向に隣接していると判断して、隣接記号“・”をボックス間に挿入する。その結果得られるリストが縦方向ボックスリストである。 Although not required in this embodiment, a method of creating a vertical box list (V_list) will also be described here. In order to create a vertical box list, assuming the coordinate system described above, after sorting in ascending order by x coordinate, boxes having the same x coordinate are sorted in ascending order by y coordinate. After sorting, if two consecutive boxes have the same width and completely share one side of each other, it is determined that the two boxes are adjacent vertically, and the adjacent symbol “ Insert between boxes. The resulting list is a vertical box list.

次に、図１２から図１６を参照して、ステップＳ２からステップＳ６の処理を詳述する。 Next, the processing from step S2 to step S6 will be described in detail with reference to FIGS.

ステップＳ２において、図７に示す書式構造文法５０における最も優先順位が高い文法規則が現在の規則に設定される。従って文法規則１が現在の規則に設定される。 In step S2, the grammar rule having the highest priority in the format structure grammar 50 shown in FIG. 7 is set as the current rule. Therefore, grammar rule 1 is set to the current rule.

そして、ステップＳ３では、横方向ボックスリスト３１に対して文法規則１が適用できる箇所を調べる。しかし、適用可能箇所は存在しないため、ステップＳ６に移行して現在の規則に文法規則２が設定された後、ステップＳ３へ移行する。ステップＳ３では、文法規則２がボックスリスト３１に適用可能か判断する。ここで、図１２を参照して、破線で囲まれたボックス群Ｓ１に対して、文法規則２を適用することが可能であるため、ステップＳ４に移行して還元処理が行われる。還元処理が行われた後のボックスリスト３１はボックス群Ｓ１が＜ｇｃｂ１＞に還元されている。この還元処理によって、ＩＮＤ１がＢＬＫ１に対して指示を与えている、単一指示関係が解析される。 In step S 3, a part where the grammatical rule 1 can be applied to the horizontal box list 31 is checked. However, since there is no applicable part, the process proceeds to step S6, and after the grammatical rule 2 is set in the current rule, the process proceeds to step S3. In step S 3, it is determined whether grammar rule 2 is applicable to the box list 31. Here, referring to FIG. 12, grammatical rule 2 can be applied to the box group S1 surrounded by a broken line, so that the process proceeds to step S4 and reduction processing is performed. In the box list 31 after the reduction process is performed, the box group S1 is reduced to <gcb1>. By this reduction process, the single instruction relationship in which IND1 gives an instruction to BLK1 is analyzed.

還元処理後、ステップＳ５において、ボックスリスト３１が＜ｇｃｂ＞だけで記述されているかどうか判断されるが、現段階では未還元箇所が存在するのでステップＳ２へ移動する。そして、ステップＳ２において、再度、現在の規則に文法規則１が設定される。 After the reduction process, it is determined in step S5 whether or not the box list 31 is described only by <gcb>. However, since there is an unreduced part at this stage, the process moves to step S2. In step S2, grammar rule 1 is set again as the current rule.

以上のように、ボックスリストに対して書式構造文法５０の優先順位の高い文法規則から順番に適用可能か判断し、適用可能箇所に対して還元処理を行い、また文法規則１から判断するという作業を繰り返す。これは、ボックスリストがすべて＜ｇｃｂ＞だけで記述されるまで繰り返し行われる。 As described above, it is determined whether or not the box structure can be applied to the box list in order from the grammar rule having the highest priority of the format structure grammar 50, the reduction process is performed on the applicable part, and the determination is made from the grammar rule 1. repeat. This is repeated until all box lists are described only with <gcb>.

ここからは、現在の規則に文法規則１に設定されてから、始めて還元処理されるまでの一連の処理を一回の処理とみなしてカウントし、その際に適用された文法規則を中心に記述していく。従って、図１２に示すように、処理回数１において文法規則２が適用され、ＩＮＤ１およびＢＬＫ１とから成るボックス群Ｓ１が＜ｇｃｂ１＞に還元処理される。 From here, the series of processes from the setting of grammar rule 1 to the current rule until the first reduction process is counted as one process, and the grammar rules applied at that time are mainly described. I will do it. Therefore, as shown in FIG. 12, the grammatical rule 2 is applied at the number of times of processing 1, and the box group S1 composed of IND1 and BLK1 is reduced to <gcb1>.

図１３を参照して、処理回数２から処理回数７の処理について説明する。処理回数に対応したボックス群の番号がその処理回数時において処理される。つまり、処理回数２ではボックス群２の還元処理が行われる。ここで、処理回数２から処理回数７においては、すべて文法規則２に基づいた還元処理が行われている。従って、処理回数２においてＩＮＤ２とＢＬＫ２から成るボックス群Ｓ２は＜ｇｃｂ２＞に還元される処理が行われている。同様に、ボックス群３からボックス群７についてもこのような還元処理が施されており、処理回数３においてはＩＮＤ３とＢＬＫ３から成るボックス群Ｓ３は＜ｇｃｂ３＞に還元される。そして、他のボックス群Ｓ４からＳ７も＜ｇｃｂ４＞から＜ｇｃｂ７＞に還元されており、ボックス群の番号と還元後の非終端記号＜ｇｃｂ＞の番号とは対応している。 With reference to FIG. 13, the processing from the processing count 2 to the processing count 7 will be described. A box group number corresponding to the number of times of processing is processed at the time of the number of times of processing. That is, the reduction process of the box group 2 is performed at the process count 2. Here, the reduction process based on the grammatical rule 2 is performed from the process number 2 to the process number 7. Therefore, in the number of processing times 2, the box group S2 composed of IND2 and BLK2 is reduced to <gcb2>. Similarly, the box group 3 to the box group 7 are also subjected to such a reduction process, and the box group S3 composed of IND3 and BLK3 is reduced to <gcb3> at the number of processing times 3. The other box groups S4 to S7 are also reduced from <gcb4> to <gcb7>, and the box group number corresponds to the number of the non-terminal symbol <gcb> after reduction.

図１４を参照して、処理回数８から処理回数１０における処理について説明する。ここでは文法規則３に基づいた還元処理が行われている。つまり、ボックス群８からボックス群１０は挿入ボックスであり、横方向に隣接する他の指示ボックスとの指示関係はないと解析される。従って、処理回数８において、ＩＮＳ１であるボックス群Ｓ８は＜ｇｃｂ８＞に還元される。同様に、処理回数９および処理回数１０では、ボックス群Ｓ９が＜ｇｃｂ９＞に、ボックス群Ｓ１０が＜ｇｃｂ１０＞にそれぞれ還元されている。 With reference to FIG. 14, the process in the process frequency 8 to the process frequency 10 is demonstrated. Here, reduction processing based on grammatical rule 3 is performed. That is, the box group 8 to the box group 10 are insertion boxes, and it is analyzed that there is no instruction relationship with other instruction boxes adjacent in the horizontal direction. Therefore, at the number of processing times 8, the box group S8 that is INS1 is reduced to <gcb8>. Similarly, in the processing times 9 and 10, the box group S9 is reduced to <gcb9> and the box group S10 is reduced to <gcb10>.

図１５を参照して、処理回数１１の処理について説明する。ここでは文法規則４に基づいた還元処理が行われており、説明ボックスであるＥＸＰ１を示すボックス群Ｓ１１が＜ｇｃｂ１１＞に還元されている。 With reference to FIG. 15, processing of the processing count 11 will be described. Here, a reduction process based on the grammatical rule 4 is performed, and the box group S11 indicating EXP1 which is the explanation box is reduced to <gcb11>.

図１６を参照して、処理回数１２から処理回数１５の処理について説明する。ここでは文法規則６に基づいた還元処理が行われている。つまり、隣接する他のボックスを指示することがないボックス群同士を一つのボックス群に還元する処理が行われている。従って、処理回数１２ではボックス群Ｓ１２が＜ｇｃｂ１２＞に還元処理される。同様に、処理回数１３ではボックス群Ｓ１３が＜ｇｃｂ１３＞に、処理回数１４ではボックス群Ｓ１４が＜ｇｃｂ１４＞に、処理回数１５ではボックス群Ｓ１５は＜ｇｃｂ１５＞にそれぞれ還元処理される。処理回数１５が終了すれば、ステップＳ５を経て横方向のボックスリスト３１の解析が成功したと判断される。 With reference to FIG. 16, processing from the processing count 12 to the processing count 15 will be described. Here, reduction processing based on the grammatical rule 6 is performed. That is, a process of reducing box groups that do not designate other adjacent boxes into one box group is performed. Accordingly, at the number of processing times 12, the box group S12 is reduced to <gcb12>. Similarly, the box group S13 is reduced to <gcb13> at the processing count 13, the box group S14 is reduced to <gcb14> at the processing count 14, and the box group S15 is reduced to <gcb15> at the processing count 15. If the processing count 15 is completed, it is determined that the analysis of the box list 31 in the horizontal direction is successful through step S5.

図１７を参照して、解析されたボックス間の横方向の指示関係を説明する。図１７（Ａ）に、解析結果に基づいて作成され横隣接構造解析木３５を示し、図１７（Ｂ）に、解析結果を示す。 With reference to FIG. 17, the horizontal indicating relationship between the analyzed boxes will be described. FIG. 17A shows the laterally adjacent structure analysis tree 35 created based on the analysis result, and FIG. 17B shows the analysis result.

図１７（Ａ）を参照して、文書全体＜ｄｏｃｕｍｅｎｔ＞はボックス群＜ｇｃｂ８＞、＜ｇｃｂ１１＞、＜ｇｃｂ１２＞、＜ｇｃｂ１３＞、＜ｇｃｂ１４＞および＜ｇｃｂ１５＞から構成されていることがわかる。そして、これらのボックス群同士は指示関係にない。しかし、各ボックス群の下位に相当するボックス群には指示関係が成り立っている。例えば、＜ｇｃｂ１２＞の下位に位置する＜ｇｃｂ１＞と＜ｇｃｂ２＞とは指示関係にある。また、実線で記された矢印は指示方向を表している。例えば、＜ｇｃｂ１＞はＩＮＤ１およびＢＬＫ１とから成り、ＩＮＤ１がＢＬＫ１に対して指示を与える単一指示の関係が成り立っている。同様に、＜ｇｃｂ２＞はＩＮＤ２とＢＬＫ２とから成り、これらも単一指示の関係になる。 Referring to FIG. 17A, it can be seen that the entire document <document> is composed of box groups <gcb8>, <gcb11>, <gcb12>, <gcb13>, <gcb14>, and <gcb15>. These box groups are not in an instruction relationship. However, an instruction relation is established in the box group corresponding to the lower order of each box group. For example, <gcb1> and <gcb2>, which are positioned below <gcb12>, are in a pointing relationship. An arrow indicated by a solid line indicates the indicated direction. For example, <gcb1> is composed of IND1 and BLK1, and a single-instruction relationship is established in which IND1 gives an instruction to BLK1. Similarly, <gcb2> is composed of IND2 and BLK2, which are also in a single indication relationship.

図１７（Ｂ）を参照して、対話型罫線文書３０Ｂには、横隣接構造解析木３５から得られた解析結果が矢印３８にて記されており、各ボックスの指示関係が明確にされている。このようにして、文書構造情報を作成することが可能となる。ここでは、単一指示の関係にあるボックス群のみが存在する文書を取り扱っているが、他の指示関係にあるボックス群を有する文書についても上述した方法を用いて解析することが可能である。 Referring to FIG. 17B, in the interactive ruled line document 30B, the analysis result obtained from the horizontal adjacent structure analysis tree 35 is indicated by an arrow 38, and the indication relation of each box is clarified. Yes. In this way, document structure information can be created. Here, a document in which only a box group having a single instruction relationship exists is handled, but a document having a box group having another instruction relationship can also be analyzed using the above-described method.

また、縦方向にだけ指示関係が存在する文書では、V_listに対して同様にして、縦構造解析木から文書構造情報を作成することができる。 Further, in the case of a document having a pointing relationship only in the vertical direction, document structure information can be created from the vertical structure analysis tree in the same manner as for V_list.

罫線文書の指示関係を解析して、対話型文書のレイアウト情報、記入情報および指示情報を分割して管理することで、記入情報の管理、記入情報の集計などをサポートすることができ、電子化された文書の処理を容易且つ汎用的に扱うことが可能となる。また、必要な情報だけを抽出することが可能となり、文書情報の管理をより効率化させることが可能となる。更に、文書構造情報と罫線のレイアウト情報を分離して管理することにより、事務処理を効率化させることが可能となる。更に、罫線文書に指示関係を示す矢印などを表示することで、記入者の誤記や記入漏れを防止することが可能となる。 By analyzing the instruction relations of ruled line documents and dividing and managing layout information, entry information and instruction information of interactive documents, it is possible to support entry information management, entry information aggregation, etc. It is possible to handle the processed document easily and universally. Further, only necessary information can be extracted, and management of document information can be made more efficient. Further, by separately managing document structure information and ruled line layout information, it is possible to improve the efficiency of office processing. Furthermore, by displaying an arrow or the like indicating the instruction relationship on the ruled line document, it becomes possible to prevent the writer from writing errors or omissions.

また、本形態の文書構造情報の作成方法を対話型文書作成用ワープロ、対話型文書記入用ワープロ、自治体の電子申請システム、社内文書処理システムなどに適用することで、対話型文書を容易に作成することおよび完成された対話型文書を効率的に管理することが可能となる。更に、書式構造文法に基づく構造化記述を半自動的に生成することも可能である。 In addition, by applying the document structure information creation method of this form to an interactive document creation word processor, an interactive document entry word processor, a local electronic application system, an in-house document processing system, etc., an interactive document can be created easily. And managing the completed interactive document efficiently. Furthermore, a structured description based on the format structure grammar can be generated semi-automatically.

＜第２の実施の形態＞
本形態は、第１の実施の形態と基本的な部分は同一であるので、相違点を中心に説明する。 <Second Embodiment>
Since this embodiment is basically the same as the first embodiment, the differences will be mainly described.

まず、本形態において解析の対象とする文書は、縦方向の指示と横方向の指示とが混在する罫線文書である。このような罫線文書の構造を解析するためには、一度の構造解析では不十分である。そこで、罫線文書の指示関係を解析するために横優先構造解析と縦優先構造解析とを行う。 First, the document to be analyzed in the present embodiment is a ruled line document in which vertical direction instructions and horizontal direction instructions are mixed. In order to analyze the structure of such a ruled line document, a single structural analysis is not sufficient. Therefore, horizontal priority structure analysis and vertical priority structure analysis are performed in order to analyze the indication relationship of the ruled line document.

横優先構造の解析方法は、構造解析木を作成する際の文法規則の適用箇所検索をH_list、V_listの順で行う解析方法である。構造解析木の作成手順を以下に示す処理番号１から処理番号６にそって説明する。 The horizontal priority structure analysis method is an analysis method in which grammatical rule application location search when creating a structural analysis tree is performed in the order of H_list and V_list. The procedure for creating the structural analysis tree will be described in the order of process number 1 to process number 6 shown below.

１．対象文書である文書のボックス種別を分類し、H_listとV_listを作成する。 1. Classify the box type of the target document, and create H_list and V_list.

２．文法規則の最も優先順位の高い規則を現在の規則とする。 2. The rule with the highest priority in the grammar rule is the current rule.

３．H_listの中に現在の規則が適用できる箇所があるか否かを走査する。適用できる箇所がある場合は還元処理を行い、処理番号６へ移行する。また、適用可能箇所が複数ある場合は、最初に現れる箇所のみに還元処理を行う。 3. Scans H_list for places where the current rule can be applied. If there is an applicable part, a reduction process is performed, and the process proceeds to process number 6. In addition, when there are a plurality of applicable places, the reduction process is performed only on the first appearing place.

４．V_listの中に現在の規則が適用できる箇所があるか否かを走査する。適用できる箇所がある場合は還元処理を行い、処理番号６へ移行する。また、適用可能箇所が複数ある場合は、最初に現れる箇所のみに還元処理を行う。 4). Scans V_list for places where the current rule can be applied. If there is an applicable part, a reduction process is performed, and the process proceeds to process number 6. In addition, when there are a plurality of applicable places, the reduction process is performed only on the first appearing place.

５．一つ優先順位の低い文法規則を新たに現在の規則とし、再び処理番号３へ移行する。また、この時点で全ての文法規則が適用できなくなった場合は解析を終了する。 5. The grammar rule having one lower priority is newly set as the current rule, and the process proceeds to process number 3 again. If all grammar rules cannot be applied at this point, the analysis is terminated.

６．還元処理により、ボックスの並びや接続関係が変化している可能性があるため、再びH_listとV_listとを作成する。リスト作成後、処理番号２へ移行する。ただしH_listまたはV_listが＜ｇｃｂ＞だけに還元されている場合は解析が成功したので終了する。 6). Since there is a possibility that the arrangement of boxes and the connection relationship have changed due to the reduction process, H_list and V_list are created again. After creating the list, the process proceeds to process number 2. However, if H_list or V_list is reduced to only <gcb>, the analysis is successful and the process ends.

以上のようにして、横優先構造の解析が行われる。また、縦優先構造解析方法は、処理番号３と処理番号４とが入れ替わった解析方法である。 As described above, the horizontal priority structure is analyzed. The vertical priority structure analysis method is an analysis method in which process number 3 and process number 4 are interchanged.

このように、文法規則の適用箇所検索を一定方向で行うことにより、横優先構造解析では横方向に働く指示関係を優先的に解析することができ、縦優先構造解析では縦方向に働く指示関係を優先的に解析することができる。これにより、指示方向が混在する罫線文書の文書構造情報を作成することが可能となる。 In this way, by searching in a certain direction for grammatical rule application locations, it is possible to preferentially analyze the instruction relation that works in the horizontal direction in the horizontal priority structure analysis, and the instruction relation that works in the vertical direction in the vertical priority structure analysis. Can be preferentially analyzed. As a result, it is possible to create document structure information of a ruled line document in which instruction directions are mixed.

図１８（Ａ）を参照して、罫線文書７０Ａを例にとって、縦の指示方向と横の指示方向が混在する罫線文書の解析方法を説明する。 With reference to FIG. 18A, a ruled line document analyzing method in which a vertical instruction direction and a horizontal instruction direction are mixed will be described by taking a ruled line document 70A as an example.

まず、図１８（Ｂ）を参照して、罫線文書７０Ａのボックス種別が分類された罫線文書７０Ｂの説明をする。ここでは、ＩＮＤ１がＢＬＫ１に指示を与えており、ＩＮＤ２がＢＬＫ２に指示を与えている。そして、これらの指示方向は横方向である。また、ＩＮＤ３がＩＮＳ２に指示を与えており、この指示方向は縦である。このような罫線文書７０に対して、図７に示した書式構造文法５０を適用する。 First, the ruled line document 70B in which the box type of the ruled line document 70A is classified will be described with reference to FIG. Here, IND1 gives instructions to BLK1, and IND2 gives instructions to BLK2. These indication directions are horizontal directions. Further, IND3 gives an instruction to INS2, and this instruction direction is vertical. The format structure grammar 50 shown in FIG. 7 is applied to such a ruled line document 70.

図１８（Ｃ）に、罫線文書７０に関するボックスリストを示す。ここでは、横優先構造解析を説明するため、横方向ボックスリスト７１に対して先ず、書式構造文法を適用する。そして、横方向ボックスリスト７１に、現在の文法規則の適用可能箇所がなければ、縦方向ボックスリスト７２に対して現在の文法規則が適用される。 FIG. 18C shows a box list related to the ruled line document 70. Here, in order to explain horizontal priority structure analysis, a format structure grammar is first applied to the horizontal box list 71. If there is no applicable location of the current grammar rule in the horizontal box list 71, the current grammar rule is applied to the vertical box list 72.

図１９（Ａ）を参照して、横方向ボックスリスト７１のＩＮＤ１とＢＬＫ１とから成るボックス群Ｓ３０に対して、文法規則２が適用されて＜ｇｃｂ１＞に還元される。この還元処理後の罫線文書７０Ｃを図１９（Ｂ）に示す。そして、図１９（Ｃ）を参照して、還元処理が行われた罫線文書７０Ｃに関するボックスリストを新たに作成する。 Referring to FIG. 19A, grammar rule 2 is applied to box group S30 consisting of IND1 and BLK1 in horizontal box list 71 and reduced to <gcb1>. The ruled line document 70C after the reduction process is shown in FIG. Then, with reference to FIG. 19C, a box list relating to the ruled line document 70C on which the reduction process has been performed is newly created.

図２０（Ａ）を参照して、横方向ボックスリスト７１のＩＮＤ２とＢＬＫ２とから成るボックス群Ｓ３１に対して、文法規則２が適用されて＜ｇｃｂ２＞に還元される。そして、図２０（Ｂ）を参照して、還元処理が行われた罫線文書７０に関するボックスリストを新たに作成する。 Referring to FIG. 20A, grammar rule 2 is applied to box group S31 consisting of IND2 and BLK2 in horizontal box list 71 and reduced to <gcb2>. Then, with reference to FIG. 20B, a box list relating to the ruled line document 70 on which the reduction process has been performed is newly created.

図２１（Ａ）を参照して、縦方向ボックスリスト７２のＩＮＤ３とＩＮＳ２とから成るボックス群Ｓ３２に対して、文法規則２が適用されて＜ｇｃｂ３＞に還元される。そして、図２１（Ｂ）を参照して、還元処理が行われた罫線文書に関するボックスリストが新たに作成す。 Referring to FIG. 21A, grammar rule 2 is applied to box group S32 consisting of IND3 and INS2 in vertical box list 72 and reduced to <gcb3>. Then, with reference to FIG. 21B, a box list relating to the ruled line document that has undergone the reduction process is newly created.

図２２（Ａ）を参照して、横方向ボックスリスト７１のＥＸＰ１であるボックス群Ｓ３３に対して、文法規則４が適用されて＜ｇｃｂ４＞に還元される。そして、図２２（Ｂ）参照して、新たにボックスリストを作成した結果、横方向ボックスリスト７１および縦方向ボックスリスト７２が＜ｇｃｂ＞だけで表現されていることから、解析が終了したこととなる。 Referring to FIG. 22A, grammar rule 4 is applied to box group S33, which is EXP1 in horizontal box list 71, and reduced to <gcb4>. Then, referring to FIG. 22B, as a result of creating a new box list, the horizontal box list 71 and the vertical box list 72 are expressed only by <gcb>, and therefore the analysis is completed. Become.

図２３（Ａ）を参照して、解析結果から横優先構造解析木７３を作成する。この結果、横方向の指示関係が明らかになり、矢印で示す方向にボックスが指示を与えている。従って、ＩＮＤ１はＢＬＫ１に。ＩＮＤ２はＢＬＫ２に対してそれぞれ指示を与えていることがわかる。 Referring to FIG. 23A, a horizontal priority structure analysis tree 73 is created from the analysis result. As a result, the horizontal instruction relationship is clarified, and the box gives an instruction in the direction indicated by the arrow. Therefore, IND1 becomes BLK1. It can be seen that IND2 gives instructions to BLK2.

ここでは縦優先構造解析については省略したが、上述した方法によって解析した結果を図２３（Ｂ）に示す。図２３（Ｂ）を参照して、縦優先構造解析木７４について説明する。縦方向に指示関係にあるのは、ＩＮＤ３とＩＮＳ２とであり、ＩＮＤ３がＩＮＳ２に対して指示を与えている。 Here, the vertical priority structure analysis is omitted, but the result of analysis by the above-described method is shown in FIG. The vertical priority structure analysis tree 74 will be described with reference to FIG. IND3 and INS2 are in an instruction relationship in the vertical direction, and IND3 gives an instruction to INS2.

図２３（Ｃ）を参照して、横優先構造解析木７３および縦優先構造解析木７４から得られた指示関係は、罫線文書７０内の矢印にて示す。このようにして、縦と横とに指示方向が混在する罫線文書の文書構造情報を作成することができる。 With reference to FIG. 23C, the indication relationship obtained from the horizontal priority structure analysis tree 73 and the vertical priority structure analysis tree 74 is indicated by an arrow in the ruled line document 70. In this way, it is possible to create document structure information of a ruled line document in which designated directions are mixed vertically and horizontally.

＜第３の実施の形態＞
本形態は、第１の実施の形態および第２の実施の形態と基本的な部分は同一であるので、相違点を中心に説明する。 <Third Embodiment>
Since the basic part of this embodiment is the same as that of the first embodiment and the second embodiment, the description will focus on the differences.

図２４を参照して、本形態の文書構造情報の作成方法が適用可能な罫線文書を説明する。本形態で扱う文書は表構造によって形成されている。具体的に表構造とは、ボックス群全体の形状が長方形であり、指示ボックスを除いた他のボックスは全て空欄ボックスまたは挿入ボックスなどの記入ボックスである構造の文書である。しかし、表構造の左上のボックスは説明ボックスとみなされている。 With reference to FIG. 24, a ruled line document to which the document structure information creation method of the present embodiment is applicable will be described. Documents handled in this form are formed in a table structure. Specifically, the table structure is a document having a structure in which the shape of the entire box group is a rectangle, and all other boxes except the instruction box are entry boxes such as a blank box or an insertion box. However, the box at the top left of the table structure is considered an explanation box.

表構造の特徴は、記入ボックスがマトリックス状に配置されることから記入ボックス群は長方形と成ることと、このボックス群の上辺と左辺の横方向および縦方向に指示ボックス連続部４１が存在することである。 The table structure is characterized by the fact that the entry boxes are arranged in a matrix, so that the entry box group is rectangular, and the instruction box continuous portion 41 exists in the horizontal and vertical directions of the upper and left sides of the box group. It is.

表構造を用いて表現される文書構造は、多層繰り返し指示を効果的にレイアウトするために用いられる構造である。つまり、多層繰り返し指示において、親指示を束ねる属性情報が複数存在する層が二つ存在するときに、それぞれの層の属性情報を横あるいは縦方向にレイアウトしたものである。 The document structure expressed using the table structure is a structure used for effectively laying out the multilayer repeat instruction. That is, when there are two layers having a plurality of attribute information bundled with the parent instruction in the multilayer repeat instruction, the attribute information of each layer is laid out in the horizontal or vertical direction.

以下に、本形態において対象とする表構造の具体的な例を説明する。 Hereinafter, a specific example of the table structure targeted in this embodiment will be described.

図２４（Ａ）を参照して、表構造４０Ａは最も単純な表構造である。ここでは指示ボックスが縦方向および横方向に３つずつ連続して設けられているが、双方の指示ボックスの数に限定はない。図２４（Ｂ）から図２４（Ｅ）を参照して、これらの表構造４０Ｂから４０Ｅは、親子関係の指示ボックス群を有する表構造である。具体的には、表構造４０Ｂおよび表構造４０Ｃは、横方向に指示を与える指示ボックスの一部に親子関係が存在する。ここでは図示していないが、縦方向に指示を与える指示ボックスが部分的に親子関係にあるような表構造も本形態の対象となる。また、表構造４０Ｄは横方向に指示を与える指示ボックス全てに親子関係が成立しており、表構造４０Ｅは縦方向に指示を与える指示ボックス全てに親子関係が成立している。 Referring to FIG. 24A, the table structure 40A is the simplest table structure. Here, three instruction boxes are provided continuously in the vertical direction and the horizontal direction, but the number of both instruction boxes is not limited. Referring to FIGS. 24B to 24E, these table structures 40B to 40E are table structures having an instruction box group of a parent-child relationship. Specifically, the table structure 40B and the table structure 40C have a parent-child relationship in a part of an instruction box that gives instructions in the horizontal direction. Although not shown here, a table structure in which an instruction box for giving an instruction in the vertical direction partially has a parent-child relationship is also an object of this embodiment. In the table structure 40D, a parent-child relationship is established in all instruction boxes that give instructions in the horizontal direction, and in the table structure 40E, a parent-child relationship is established in all instruction boxes that give instructions in the vertical direction.

本形態では、上記したような表構造によって形成された文書に対して、表構造に対応した書式構造文法である表構造文法を適用することで文書構造情報を作成する。 In this embodiment, document structure information is created by applying a table structure grammar, which is a format structure grammar corresponding to the table structure, to a document formed with the table structure as described above.

表構造文法とは、空欄ボックスと挿入ボックスを区別することなく一度に結合することを特徴とする文法である。これにより、表内にあらかじめ単位等が埋め込まれた挿入ボックスと単位が不必要または省略された空欄ボックスとが混在している場合を解決することができる。また、表には指示ボックスが一列に連続する部分が存在することから、それらを意味ある形で還元する文法規則が含まれている。 The table structure grammar is a grammar characterized by combining a blank box and an insertion box at a time without distinction. As a result, it is possible to solve the case where an insertion box in which a unit or the like is embedded in the table and a blank box in which the unit is unnecessary or omitted are mixed. In addition, since there are portions where instruction boxes are continuous in a line in the table, grammar rules for reducing them in a meaningful manner are included.

表の内部の記入ボックスは左側と上側の２方向から指示を受けるため、連続する記入ボックスが指示ボックスと結合して、複合ボックスに還元すると、その複合ボックスは最初の還元方向とは別の方向に結合する必要がある。従って、これらの表の特徴を表す部分について新たな非終端記号ｉｇｔ(indication group in table)および、ｔｂｌ(table)を追加した。 Since the entry boxes inside the table receive instructions from the left and upper two directions, if the consecutive entry boxes are combined with the instruction boxes and returned to a composite box, the composite box will have a different direction from the original reduction direction. Need to be combined. Therefore, new non-terminal symbols igt (indication group in table) and tbl (table) have been added to the portions representing the characteristics of these tables.

＜ｉｇｔ＞(indication group in table)とは、表中の指示ボックス連結部を表すが、指示は出さないボックス群である。 <Igt> (indication group in table) is a group of boxes that indicate an instruction box concatenation unit in the table but do not give an instruction.

＜ｔｂｌ＞(table)とは、表形式指示を持つボックス群である。 <Tbl> (table) is a group of boxes having a table format instruction.

表形式指示とは、一つの記入ボックスが縦と横との二方向の異なる指示ボックスから独立して指示を受けており、一つの指示ボックスが一方方向に繰り返し指示を出す指示関係である。また、記入ボックスの縦および横はそれぞれ同じ幅（高さ）のボックスと隣接している。そして、表形式指示では、二方向の指示ボックス間に親子関係や意味的つながりはない。 The tabular instruction is an instruction relationship in which one entry box receives instructions independently from two different directions boxes in the vertical and horizontal directions, and one instruction box repeatedly issues instructions in one direction. The vertical and horizontal sides of the entry box are adjacent to the same width (height) box. In the tabular instructions, there is no parent-child relationship or semantic connection between the instruction boxes in two directions.

図２５を参照して、表構造文法６０の文法規則とその優先順位を説明する。表構造文法６０は、９種類の文法規則から成り、各文法規則には優先順位と同じ番号を付与している。そして、表規則の優先順位は表構造を解析する際にもこれまでと同様にもっとも信頼できる解析結果が最初に現れるようにつけられている。 With reference to FIG. 25, the grammar rules of the table structure grammar 60 and their priorities will be described. The table structure grammar 60 is composed of nine kinds of grammar rules, and each grammar rule is assigned the same number as the priority order. The priority of the table rules is set so that the most reliable analysis results appear first when analyzing the table structure.

以下に各文法規則の性質について説明する。以後、書式構造文法の文法規則１と区別するために、表規則１、表規則２・・・、と記載する。 The nature of each grammar rule is explained below. Hereinafter, in order to distinguish from the grammatical rule 1 of the format structure grammar, it is described as a table rule 1, a table rule 2,.

表規則１では、表構造の特徴である記入ボックス連続部を還元する。 In the table rule 1, the continuous part of the entry box, which is a feature of the table structure, is reduced.

表規則２では、多層繰り返し指示を還元する。 In Table Rule 2, the multilayer repeat instruction is reduced.

表規則３では、表規則２の多層繰り返し指示を受けるための前処理として、表規則２の適用方向と直交する方向への還元を行う。 In the table rule 3, as a pre-processing for receiving the multilayer repetition instruction of the table rule 2, reduction is performed in a direction orthogonal to the application direction of the table rule 2.

表規則４、５、６では、表構造の特徴である指示連結部を還元する。 In the table rules 4, 5, and 6, the instruction connecting part, which is a feature of the table structure, is reduced.

表規則４では、表の左上をｉｇｔに還元する。 In Table Rule 4, the upper left of the table is reduced to igt.

表規則５では、ＩＮＤの並びである指示ボックス連側部分を逐次的にｉｇｔに還元する。 According to Table Rule 5, the instruction box link side portion of the IND array is sequentially reduced to “itt”.

表規則６では、親指示を持つ指示ボックス連続部分、もしくはこの表規則で還元された複数のｉｇｔを逐次的にｉｇｔに還元する。 In the table rule 6, an instruction box continuous part having a parent instruction or a plurality of igt reduced by this table rule is sequentially reduced to igt.

表規則７、８、９では表構造への還元を行う。 Table rules 7, 8, and 9 perform reduction to a table structure.

表規則７では、親指示を持つ指示ボックス連側部分、もしくはこの表規則で還元された複数のｉｇｔを逐次的にｉｇｔに還元する。 In the table rule 7, the instruction box side part having the parent instruction or a plurality of igt reduced by this table rule is sequentially reduced to igt.

表規則８では、表全体に親指示が係ることを許すため、表全体に係る親指示とｔｂｌの繰り返しをｔｂｌに還元する。 In Table Rule 8, in order to allow the parent instruction to be applied to the entire table, the repetition of the parent instruction and tbl related to the entire table is reduced to tbl.

表規則９では、最終的に、表構造文書は複合ボックス（ｇｃｂ）に還元されて、文書の構成要素となる。 According to the table rule 9, the table structure document is finally reduced to a composite box (gcb) and becomes a component of the document.

表構造を解析するためには、一度の構造解析では不十分である。なぜなら表内部の記入ボックスが縦と横から同時に指示を受けているからである。そこで、表構造の指示関係を解析するために横優先構造解析と縦優先構造解析とを行う。横優先構造解析および縦優先構造解析の具体的な方法は、第２の実施の形態において説明したのでここでは省略する。 In order to analyze the table structure, a single structural analysis is not sufficient. This is because the entry box in the table receives instructions from both the vertical and horizontal directions. Therefore, horizontal priority structure analysis and vertical priority structure analysis are performed in order to analyze the table structure indication relationship. Since the specific methods of the horizontal priority structure analysis and the vertical priority structure analysis have been described in the second embodiment, they are omitted here.

図２６から図３２を参照して、表構造文書６１の文書構造情報の作成方法を具体的に説明する。 A method for creating the document structure information of the table structure document 61 will be specifically described with reference to FIGS.

図２６（Ａ）を参照して、表構造文書６１は縦方向と横方向の双方から指示を受けるボックスが存在する。ボックス種類を分類した結果を図２６（Ｂ）に示す。ＢＬＫ１はＩＮＤ１およびＩＮＤ３から指示を受けており、ＢＬＫ２がＩＮＤ２およびＩＮＤ３から指示を受けている。また、ＩＮＤ３の指示方向は横方向であるが、ＩＮＤ１およびＩＮＤ２の指示方向は縦方向である。 Referring to FIG. 26A, the table structure document 61 has boxes for receiving instructions from both the vertical direction and the horizontal direction. FIG. 26B shows the result of classifying the box types. BLK1 receives instructions from IND1 and IND3, and BLK2 receives instructions from IND2 and IND3. The indication direction of IND3 is the horizontal direction, but the indication directions of IND1 and IND2 are the vertical direction.

図２７（Ａ）を参照して、ボックス種別の分類後にボックスリストを作成する。横方向ボックスリスト６２はH_listを示しており、縦方向ボックスリストはV_listを示している。ここでは、横優先構造解析を行うため、横方向ボックスリスト６２に対して最初に表構造文法規則を適用する。ボックス群Ｓ２０に対して表規則１を適用する。表規則１を適用した結果、ＩＮＤ３、ＢＬＫ１およびＢＬＫ２は＜ｇｃｂ１＞に還元される。還元後、新たにボックスリストを作成した結果を図２７（Ｂ）示す。このとき、縦方向ボックスリスト６３も同様にボックスリストが再度、作成される。 Referring to FIG. 27A, a box list is created after classification of box types. The horizontal box list 62 indicates H_list, and the vertical box list indicates V_list. Here, in order to perform the horizontal priority structure analysis, the table structure grammar rule is first applied to the horizontal box list 62. Table rule 1 is applied to the box group S20. As a result of applying Table Rule 1, IND3, BLK1, and BLK2 are reduced to <gcb1>. FIG. 27B shows the result of creating a new box list after the reduction. At this time, the box list is created again for the vertical box list 63 as well.

図２８（Ａ）を参照して、ＥＸＰ１とＩＮＤ１とから成るボックス群Ｓ２１に対して表規則４が適用される。その結果、ボックス群Ｓ２１は＜ｉｇｔ１＞に還元される。そして、図２８（Ｂ）に示すように、再度、ボックスリストが作成される。 Referring to FIG. 28A, table rule 4 is applied to box group S21 made up of EXP1 and IND1. As a result, the box group S21 is reduced to <igt1>. Then, as shown in FIG. 28B, a box list is created again.

図２９（Ａ）を参照して、＜ｉｇｔ１＞とＩＮＤ２とから成るボックス群Ｓ２２に対して表規則４が適用される。その結果、ボックス群Ｓ２２は＜ｉｇｔ２＞に還元される。そして、図２９（Ｂ）に示すように、再度、ボックスリストが作成される。 Referring to FIG. 29A, table rule 4 is applied to box group S22 made up of <igt1> and IND2. As a result, the box group S22 is reduced to <igt2>. Then, as shown in FIG. 29B, a box list is created again.

図３０（Ａ）を参照して、＜ｉｇｔ１＞と＜ｇｃｂ１＞とから成るボックス群Ｓ２３に対して表規則７が適用される。その結果、ボックス群Ｓ２３は＜ｔｂｌ１＞に還元される。ここでは、横方向ボックスリスト６２に対して、適用可能な表規則が存在しないため、縦方向ボックスリスト６３に対して適用可能な表規則が適用されている。そして、図３０（Ｂ）に示すように、再度、ボックスリストが作成される。 Referring to FIG. 30A, table rule 7 is applied to box group S23 made up of <igt1> and <gcb1>. As a result, the box group S23 is reduced to <tbl1>. Here, since there is no applicable table rule for the horizontal box list 62, the applicable table rule for the vertical box list 63 is applied. Then, as shown in FIG. 30B, a box list is created again.

図３１（Ａ）を参照して、＜ｔｂｌ１＞であるボックス群２５に対して表規則９が適用される。この結果、＜ｔｂｌ１＞は＜ｇｃｂ２＞に変換されて、すべてのボックスが＜ｇｃｂ＞に変換されたので解析が終了する。以上のようにして、横優先構造の解析が行われる。 Referring to FIG. 31A, table rule 9 is applied to box group 25 that is <tbl1>. As a result, <tbl1> is converted to <gcb2>, and all the boxes are converted to <gcb>, and thus the analysis ends. As described above, the horizontal priority structure is analyzed.

図３２を参照して、解析されたボックス間の指示関係を説明する。 With reference to FIG. 32, the indication relationship between the analyzed boxes will be described.

先ず、図３２（Ａ）を参照して、解析結果に基づいて作成され横優先構造解析木６４について説明する。太線で示された矢印が指示関係を表しており、ＩＮＤ１がＢＬＫ１に対して指示を与えており、ＩＮＤ２がＢＬＫ２に対して指示を与えていることがわかる。 First, the horizontal priority structure analysis tree 64 created based on the analysis result will be described with reference to FIG. An arrow indicated by a bold line represents an instruction relationship, and it can be seen that IND1 gives an instruction to BLK1, and IND2 gives an instruction to BLK2.

次に、図３２（Ｂ）を参照して、縦優先構造解析木６５について説明する。縦優先構造の解析方法は、上述したように表規則を適用させるボックスリストの順番を入れ替えるだけであるので、省略した。ここでは、解析結果のみを示す。太線の矢印は指示関係を示しており、ＩＮＤ３はＢＬＫ１およびＢＬＫ２に対して指示を与えていることがわかる。 Next, the vertical priority structure analysis tree 65 will be described with reference to FIG. The analysis method of the vertical priority structure is omitted because it only changes the order of the box list to which the table rule is applied as described above. Here, only the analysis result is shown. Thick line arrows indicate an instruction relationship, and it is understood that IND3 gives instructions to BLK1 and BLK2.

図３２（Ｃ）を参照して、表構造文書６１内の矢印は、これらの解析木から得られた指示関係を示している。このようにして、表構造文書の構造情報を作成することができる。 Referring to FIG. 32C, the arrows in the table structure document 61 indicate the indication relationship obtained from these parse trees. In this way, the structure information of the table structure document can be created.

＜第４の実施の形態＞
本形態は、第１の実施の形態から第３の実施の形態と基本的な箇所は同一であるので、相違点を中心に説明する。ここで、第１の実施の形態および第２の実施の形態で対象とされた文書を一般構造文書と記載し、使用された書式構造文法を一般構造文法と記載する。また、第３の実施の形態で対象とされた文書を表構造文書と記載し、使用された書式構造文法を表構造文法と記載する。 <Fourth embodiment>
Since this embodiment is the same as the first embodiment to the third embodiment in the same basic points, differences will be mainly described. Here, the document targeted in the first embodiment and the second embodiment is described as a general structure document, and the used format structure grammar is described as a general structure grammar. Also, the document targeted in the third embodiment is described as a table structure document, and the used format structure grammar is described as a table structure grammar.

本形態では、対象とする文書は、一般構造文書と表構造文書の両方であり、それらが組み合わされた文書をも対象とする。更に、挿入ボックスのみを指定することでボックス種別の分類を可能にする。 In this embodiment, the target document is both a general structure document and a table structure document, and the combined document is also a target. Furthermore, by specifying only the insertion box, it is possible to classify the box type.

また、一般構造文書に対する解析方法は、第２の実施の形態で述べた、横優先構造解析と縦優先構造解析とを行う方法を採用する。 The analysis method for the general structure document employs the method of performing the horizontal priority structure analysis and the vertical priority structure analysis described in the second embodiment.

具体的には、ボックスの種類を分別する際に、先ず、空欄ボックスは画像読みとり手段を用いることにより内部に文字列のないボックスとして自動判定して、自動的にＢＬＫに設定する。次に、空欄ボックスを除く、内部に文字列のあるボックスをすべて指示ボックスに設定する。その後、挿入ボックスを手動で設定する。この時点では、説明ボックスと指示ボックスとが指示ボックスに振り分けられている。しかし、書式構造文法を適用した際に、指示ボックスでありながら、どのボックスにも指示を与えていないボックスを説明ボックスとすることで、指示ボックスと説明ボックスとを分類している。 Specifically, when classifying the types of boxes, first, the blank box is automatically determined as a box having no character string by using an image reading means, and is automatically set to BLK. Next, all boxes with character strings inside are set as instruction boxes except blank boxes. Then set the insertion box manually. At this point, the explanation box and the instruction box are allocated to the instruction box. However, when the format structure grammar is applied, an instruction box and an explanation box are classified by setting an explanation box as an instruction box that does not give an instruction to any box.

ここで、図７で示した書式構造文法５０の文法規則８は形式的に表構造を扱うために含まれていた規則であるため、不要となる。また、第２の実施例では表構造の左上のボックスは説明ボックスと仮定している。しかし、記入者が挿入ボックスのみを指定することから、表構造の左上の説明ボックスと他の罫線文書中の説明ボックスとはともに指示ボックスに分類されることになる。本来、説明ボックスは他のボックスを指示しなく、かつ他のボックスから指示されないボックスである。従って、指示ボックスに分類されることにより、一般構造文法を用いた場合、横優先、縦優先構造解析の両方の解析において、指示ボックスであるにも関わらず他のボックスを指示しないという矛盾を生じることとなる。この時点では、矛盾が発生した指示ボックスが、表構造の左上の説明ボックスか、一般構造中の説明ボックスのどちらであるかは判断できないが、どちらかであることは判断できる。 Here, the grammar rule 8 of the format structure grammar 50 shown in FIG. 7 is unnecessary because it is a rule included to handle the table structure formally. In the second embodiment, the upper left box of the table structure is assumed to be an explanation box. However, since the writer specifies only the insertion box, the explanation box at the upper left of the table structure and the explanation box in the other ruled line document are classified as instruction boxes. Originally, the explanation box is a box that does not designate another box and is not designated by another box. Therefore, when the general structure grammar is used, it is inconsistent that no other box is indicated even though it is an instruction box in the analysis of both the horizontal priority and the vertical priority structure analysis. It will be. At this point, it is not possible to determine whether the instruction box in which the contradiction occurs is the upper left explanation box of the table structure or the explanation box in the general structure, but it can be judged to be either.

そこで、矛盾の生じたボックス以降のボックス群を対象に表構造文法を用いた解析を行う。その結果、矛盾を生じたボックスが表構造の左上のボックスであれば＜ｔｂｌ＞に還元されるので、その矛盾が生じたボックスを説明ボックスとする。逆に、＜ｔｂｌ＞に還元されなかった場合は、一般構造中の説明ボックスであると判断できるので、その矛盾が生じたボックスを説明ボックスに置き換えて、再度一般構造文法の適用を行う。 Therefore, the analysis using the table structure grammar is performed for the boxes after the box in which the contradiction occurs. As a result, if the contradicted box is the upper left box of the table structure, it is reduced to <tbl>. Therefore, the box in which the contradiction has occurred is used as an explanation box. On the other hand, if it is not reduced to <tbl>, it can be determined that it is an explanation box in the general structure, so the box in which the contradiction occurs is replaced with the explanation box, and the general structure grammar is applied again.

図３３を参照して、本形態の具体的な文書構造情報の作成方法を説明する。 With reference to FIG. 33, a specific method for creating document structure information according to this embodiment will be described.

フローチャート７０に示すステップＳ１０において、上述したようにボックス種別が分類された文書に対応したボックスリストを作成する。 In step S10 shown in the flowchart 70, as described above, a box list corresponding to the document in which the box type is classified is created.

次に、ステップＳ１１において、ボックスリストに対して書式構造を適用する。ここで、図７で示した書式構造文法５０の文法規則８を除く文法が適用される。 Next, in step S11, the format structure is applied to the box list. Here, the grammar excluding the grammar rule 8 of the format structure grammar 50 shown in FIG. 7 is applied.

ステップＳ１２において、文法適用後、未指示関係エラーの発生有無を判断する。未指示関係エラーが発生しなければ、解析成功となる。未指示関係エラーを起こすボックスはボックス種別が誤っているボックスまたは表構造の先頭ボックスである。従って、この未指示関係エラーが発生すれば、エラーの発生原因が、表構造の先頭ボックスであるのかボックス種別が誤っているかを判断するためにステップＳ１３へ移行する。 In step S12, after applying the grammar, it is determined whether or not an undirected relation error has occurred. If no unindicated error occurs, the analysis is successful. A box that causes an unindicated relation error is a box with an incorrect box type or a top box of a table structure. Therefore, if this unindicated error occurs, the process proceeds to step S13 in order to determine whether the cause of the error is the top box of the table structure or the box type is incorrect.

ステップＳ１３において、未指示関係エラーの発生原因を判定するために表構造文法による構造解析木の指示関係解析を行う。このとき、未指示関係エラー発生ボックス以降の全てのボックスに対し、表構造文法が適用される。 In step S13, in order to determine the cause of the occurrence of an unindicated relation error, an instruction relation analysis of a structural analysis tree using table structure grammar is performed. At this time, the table structure grammar is applied to all boxes after the undirected relation error occurrence box.

ステップＳ１４において、表構造文法を適用した後、未指示関係エラーが発生したボックスをスタートとする表構造が発見されない場合は、ボックス種別が誤っているボックスであると判断される。そして、このボックスは説明ボックスに変換される。 In step S14, after applying the table structure grammar, if a table structure starting from a box in which an unindicated relation error has occurred is not found, it is determined that the box type is the wrong box. This box is converted into an explanation box.

エラーが発生したボックスを説明ボックスに変換した後、再度、ステップＳ１１において書式構造文法を適用する。その結果、エラーが発生しなければ、解析成功となる。つまり、エラーが発生したボックスは、挿入ボックスではなく、説明ボックスであることが判明したことになる。 After the box in which the error has occurred is converted into an explanation box, the format structure grammar is applied again in step S11. As a result, if no error occurs, the analysis is successful. That is, it is found that the box in which the error occurred is not an insertion box but an explanation box.

表形式指示が検出された場合は、表構造が存在していることになる。従って、表構造の範囲を特定し、ステップＳ１１に戻り、表構造以降のボックスに対して再び書式構造文法を適用する。表構造の範囲は、指示を受ける記入ボックスの最大番号までである。 If a tabular indication is detected, a table structure exists. Therefore, the range of the table structure is specified, the process returns to step S11, and the format structure grammar is again applied to the boxes after the table structure. The range of the table structure is up to the maximum number of entry boxes that receive instructions.

以上の手法にて、一般構造文書と表構造文書が混在した文書に対しても解析を行うことができ、文書構造情報を作成することが可能となる。また、この手法を用いれば、一般構造文書と表構造文書が交互に存在するような文書に対しても、文書構造情報を作成することも可能である。 With the above method, it is possible to analyze a document in which a general structure document and a table structure document are mixed, and document structure information can be created. Also, using this technique, it is possible to create document structure information for a document in which general structure documents and table structure documents exist alternately.

一般構造文書と表構造文書が混在した文書として、図３４に示すような文書８０を例にとって説明する。文書８０のボックス種別を分類した結果を図３５に示す。ここでは、上述したように、説明ボックスは全て指示ボックスとして分類されている。まず、この文書に対して一般構造文法を適用すると、斜線で示されたボックス８５が未指示関係エラーを起こす。従って、ボックス８５以降の全ボックスに対して表構造文法が適用される。表構造文法適用後に指示関係解析を行うと、ボックス８６Ａおよびボックス８７Ａから隣接するボックスから、ボックス８６Ｂおよびボックス８７Ｂから繰り返し指示を受けるボックスまで表形式指示がみられ、未指示関係エラーは発生しなかった。よって、この範囲を表と確定し、ボックス８５までを一般構造文法で再解析する。ボックス９０以降に再び一般構造文法を適用すると、構造解析木が作成され、指示関係解析を行っても未指示関係エラーは発生しなかった。このようにして、指示関係が解析された結果を図３６に示す。図３６では、矢印の始点が指示を与えるボックスであり、矢印の終点に位置するボックスに対して指示を与えている。そして、文書８０は、一般構造文書８８Ａと一般構造文書８８Ｂとの間に、表構造文書８９が存在する文書であることがわかる。 A document 80 as shown in FIG. 34 will be described as an example of a document in which a general structure document and a table structure document are mixed. The result of classifying the box type of the document 80 is shown in FIG. Here, as described above, all the explanation boxes are classified as instruction boxes. First, when the general structure grammar is applied to this document, the box 85 indicated by diagonal lines causes an unindicated relation error. Therefore, the table structure grammar is applied to all the boxes after the box 85. When the instruction relation analysis is performed after the table structure grammar is applied, the table format instruction is seen from the box 86A and the box 87A to the box receiving the repeated instruction from the box 86B and the box 87B, and an unindicated relation error does not occur. It was. Therefore, this range is determined as a table, and up to box 85 is reanalyzed with the general structure grammar. When the general structural grammar is applied again after the box 90, a structural analysis tree is created, and an undirected relationship error does not occur even if the directed relationship analysis is performed. FIG. 36 shows the result of analyzing the instruction relationship in this way. In FIG. 36, the start point of the arrow is a box giving an instruction, and an instruction is given to the box located at the end point of the arrow. Then, it can be understood that the document 80 is a document in which the table structure document 89 exists between the general structure document 88A and the general structure document 88B.

以上のようにして、一般構造文書と表構造文書が混在した文書の文書構造情報が作成される。
As described above, document structure information of a document in which a general structure document and a table structure document are mixed is created.

（Ａ）（Ｂ）は、本発明の文書構造情報の作成方法の対象となる文書を説明する図である。(A) (B) is a figure explaining the document used as the object of the preparation method of the document structure information of this invention. 本発明の文書構造情報の作成方法の対象となる文書を説明する図である。It is a figure explaining the document used as the object of the preparation method of the document structure information of this invention. （Ａ）（Ｂ）は、本発明の文書構造情報の作成方法のボックスの種類を説明する図である。(A) (B) is a figure explaining the kind of box of the creation method of the document structure information of this invention. （Ａ）−（Ｂ）は、本発明の文書構造情報の作成方法におけるボックス間の指示関係を説明する図である。(A)-(B) is a figure explaining the instruction | indication relationship between boxes in the preparation method of the document structure information of this invention. （Ａ）−（Ｄ）は、本発明の文書構造情報の作成方法におけるボックス間の指示関係を説明する図である。(A)-(D) is a figure explaining the instruction | indication relationship between the boxes in the preparation method of the document structure information of this invention. （Ａ）−（Ｅ）は、本発明の文書構造情報の作成方法におけるボックス間の指示関係を説明する図である。(A)-(E) is a figure explaining the instruction | indication relationship between the boxes in the preparation method of the document structure information of this invention. 本発明の文書構造情報の作成方法における書式構造文法を説明する図である。It is a figure explaining the format structure grammar in the preparation method of the document structure information of this invention. （Ａ）−（Ｄ）は、本発明の文書構造情報の作成方法の構造解析木を説明する図である。(A)-(D) is a figure explaining the structure analysis tree of the preparation method of the document structure information of this invention. 本発明の文書構造情報の作成方法のフローチャートを説明する図である。It is a figure explaining the flowchart of the preparation method of the document structure information of this invention. （Ａ）（Ｂ）は、本発明の文書構造情報の作成方法の対象となる文書を説明する図である。(A) (B) is a figure explaining the document used as the object of the preparation method of the document structure information of this invention. （Ａ）（Ｂ）本発明の文書構造情報の作成方法のボックスリストを説明する図である。(A) (B) It is a figure explaining the box list | wrist of the preparation method of the document structure information of this invention. 本発明の文書構造情報の作成方法を説明する図である。It is a figure explaining the preparation method of the document structure information of this invention. 本発明の文書構造情報の作成方法を説明する図である。It is a figure explaining the preparation method of the document structure information of this invention. 本発明の文書構造情報の作成方法を説明する図である。It is a figure explaining the preparation method of the document structure information of this invention. 本発明の文書構造情報の作成方法を説明する図である。It is a figure explaining the preparation method of the document structure information of this invention. 本発明の文書構造情報の作成方法を説明する図である。It is a figure explaining the preparation method of the document structure information of this invention. （Ａ）（Ｂ）は、本発明の文書構造情報の作成方法を説明する図である。(A) (B) is a figure explaining the creation method of the document structure information of this invention. （Ａ）−（Ｃ）は、本発明の文書構造情報の作成方法を説明する図である。(A)-(C) is a figure explaining the preparation method of the document structure information of this invention. （Ａ）−（Ｃ）は、本発明の文書構造情報の作成方法を説明する図である。(A)-(C) is a figure explaining the preparation method of the document structure information of this invention. （Ａ）（Ｂ）は、本発明の文書構造情報の作成方法を説明する図である。(A) (B) is a figure explaining the creation method of the document structure information of this invention. （Ａ）（Ｂ）は、本発明の文書構造情報の作成方法を説明する図である。(A) (B) is a figure explaining the creation method of the document structure information of this invention. （Ａ）（Ｂ）は、本発明の文書構造情報の作成方法を説明する図である。(A) (B) is a figure explaining the creation method of the document structure information of this invention. （Ａ）−（Ｃ）は、本発明の文書構造情報の作成方法を説明する図である。(A)-(C) is a figure explaining the preparation method of the document structure information of this invention. （Ａ）−（Ｅ）は、本発明の文書構造情報の作成方法の対象となる表構造を説明する図である。(A)-(E) is a figure explaining the table | surface structure used as the object of the preparation method of the document structure information of this invention. 本発明の文書構造情報の作成方法における表構造文法を説明する図である。It is a figure explaining the table structure grammar in the preparation method of the document structure information of this invention. （Ａ）（Ｂ）は、本発明の文書構造情報の作成方法を説明する図である。(A) (B) is a figure explaining the creation method of the document structure information of this invention. （Ａ）（Ｂ）は、本発明の文書構造情報の作成方法を説明する図である。(A) (B) is a figure explaining the creation method of the document structure information of this invention. （Ａ）（Ｂ）は、本発明の文書構造情報の作成方法を説明する図である。(A) (B) is a figure explaining the creation method of the document structure information of this invention. （Ａ）（Ｂ）は、本発明の文書構造情報の作成方法を説明する図である。(A) (B) is a figure explaining the creation method of the document structure information of this invention. （Ａ）（Ｂ）は、本発明の文書構造情報の作成方法を説明する図である。(A) (B) is a figure explaining the creation method of the document structure information of this invention. （Ａ）（Ｂ）は、本発明の文書構造情報の作成方法を説明する図である。(A) (B) is a figure explaining the creation method of the document structure information of this invention. （Ａ）−（Ｃ）は、本発明の文書構造情報の作成方法を説明する図である。(A)-(C) is a figure explaining the preparation method of the document structure information of this invention. 本発明の文書構造情報の作成方法を説明する図である。It is a figure explaining the preparation method of the document structure information of this invention. 本発明の文書構造情報の作成方法を説明する図である。It is a figure explaining the preparation method of the document structure information of this invention. 本発明の文書構造情報の作成方法を説明する図である。It is a figure explaining the preparation method of the document structure information of this invention. 本発明の文書構造情報の作成方法を説明する図である。It is a figure explaining the preparation method of the document structure information of this invention.

Claims

A method for creating document structure information from a document in which general structure documents and table structure documents are mixed.
A first step of classifying the box of the document having a plurality of boxes separated by ruled lines based on a type of the box, manually or manually;
A second step of creating a box list indicating a vertical or / and horizontal adjacency relationship of the box in a computer;
And a third step of applying to the box list a format structure grammar constituted by a plurality of grammatical rules having a priority order and clarifying the instruction relationship between boxes by each grammar rule. And
In the third step,
Applying a general structure grammar, which is the format structure grammar , in which the instruction relation between the boxes is only in one direction vertically or horizontally, to the box list,
Respect box group after conflict box, creating the document structure information, which comprises applying a table structure grammar is the format structure grammar both directions between the vertical and horizontal instructions relationships between the box Method.

The box is
An instruction box that gives instructions to other boxes;
An explanation box that does not receive instructions from other boxes and does not give instructions.
A blank box where nothing is written in the box and information is entered,
2. The method for creating document structure information according to claim 1, wherein the box is one of insertion boxes in which characters are written and information is entered.

In the first step,
The blank boxes are automatically classified using image reading means,
3. The method for creating document structure information according to claim 2, wherein the instruction box, the explanation box, and the insertion box are manually classified.

In the third step,
2. The document structure information creation method according to claim 1, wherein a structure analysis tree indicating a vertical or horizontal pointing relationship between the boxes is formed.

In the third step,
After applying the format structure grammar to the box list indicating the horizontal adjacency relationship, the format structure grammar is applied to the box list indicating the vertical adjacency relationship. Create
After applying the format structure grammar to the box list indicating vertical adjacency, the vertical structure analysis tree is obtained by applying the format structure grammar to the box list indicating horizontal adjacency. make,
2. The document structure information creation method according to claim 1, wherein a document structure is analyzed from the horizontal priority structure analysis tree and the vertical priority structure analysis tree.

In the first step,
The blank boxes are classified using image reading means,
The insertion box is manually classified, and the boxes other than the blank box and the insertion box are classified as the instruction box,
3. The document structure information according to claim 2, wherein in the third step, the instruction relationship between adjacent boxes is analyzed, and the instruction boxes having no instruction relationship with other boxes are classified into explanation boxes. How to create

The document structure information creation method according to claim 1, wherein the grammar rules are described in an extended Bacchus notation.

In the third step,
Scan sequentially applicable places from the grammar rules with high priority to the box list,
2. The method for creating document structure information according to claim 1, wherein after applying the grammar rule to the applicable part, the applicable part is sequentially scanned again from a grammatical rule having a higher priority.