JP3607182B2

JP3607182B2 - Document information extraction apparatus, method, and recording medium recording the program

Info

Publication number: JP3607182B2
Application number: JP2000256353A
Authority: JP
Inventors: 大子郎横関; 隆彦村山; 修一郎山本
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2000-08-25
Filing date: 2000-08-25
Publication date: 2005-01-05
Anticipated expiration: 2020-08-25
Also published as: JP2002073589A

Description

【０００１】
【発明の属する技術分野】
この発明は、木構造を有する文書情報から、特定の部分木文書のみを抽出可能とする文書情報抽出装置、方法、及びそのプログラムを記録した記録媒体に関する。
【０００２】
【従来の技術】
近年、文書情報を、ＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）から発展し、その構造を木構造で表すことが可能なＸＭＬ（Ｅｘｔｅｎｓｉｂｌｅｍａｒｋｕｐｌａｎｇｕａｇｅ）により作成して管理することが盛んである。
ＨＴＭＬやＸＭＬによる文書情報は、タグと呼ばれる文書の定義を示すマークを用いて、文書の構造を記述する。特に、ＨＴＭＬでは利用できるタグの種類や意味が予め決まっていたが、ＸＭＬでは使用者が独自に定義することが可能となった。よって、関連のある企業間でこのタグを独自に定義し、企業間のデータ交換を容易に行う工夫がされている。
【０００３】
【発明が解決しようとする課題】
しかし、従来、ＸＭＬで記述された木構造を有する文書情報から、任意に必要な文書情報を取得するには、次のような方法があるが、それぞれ以下の問題があった。
まず、ＸＭＬの木構造を木構造のまま扱い、プログラミング言語により必要な文書を抽出する方法では、抽出する文書構造毎の専用のプログラミングが必要であり、抽出に手間がかかる上、汎用的でなく、多くのプログラムを必要とするので設備の面からも効率が悪いという問題があった。
また、木構造から目的の部分を指定する手段と、目的の部分を発見した時に、削除する、しない等のアクションを実行するプログラムとを組み合わせて実行する方法では、その対応関係を抽出する文書構造に従い、その都度指定する必要があり、同様に汎用的でなく、効率が悪いという問題があった。
【０００４】
本発明は、上記問題点に鑑みてなされたもので、木構造を有する文書情報から、特定の部分木文書のみを抽出可能とする文書情報抽出装置を提供することを目的とする。より具体的には、文書情報を利用者に開示可能な文書情報へ変換するための変換ルールを記述した定義体情報の木構造と、文書情報の木構造を比較して、文書情報から定義体情報と同一の構造と値を有する部分木文書を抽出する文書情報抽出装置、方法、及びそのプログラムを記録した記録媒体を提供することを目的とする。
【０００５】
【課題を解決するための手段】
上記問題点を解決するために、本発明は、木構造を有するマークアップ言語で記述された文書情報から、利用者毎の権限に応じて、前記文書情報の全てもしくは一部分を、該利用者のそれぞれに開示する部分木文書として抽出する文書情報抽出装置であって、前記文書情報を該利用者に開示可能な文書情報へ変換するための変換ルールがマークアップ言語で記述されている定義体情報を、与えられた利用者の権限情報と前記文書情報に従い定義体情報データベースから抽出する定義体情報管理手段と、前記定義体情報管理手段で抽出された前記定義体情報から、その木構造を解釈する定義体解釈手段と、前記定義体解釈手段で解釈された前記定義体情報の木構造と、前記文書情報の木構造を比較して、前記部分木文書を抽出する部分木抽出手段とを設け、前記部分木抽出手段は、前記文書情報の木構造に対応して各ノードの評価結果を保持する領域を備え、初期値として該評価結果に偽値の設定された第１の木構造情報保持手段と、前記文書情報と前記第１の木構造情報保持手段の両方の木構造に対応して各ノードの評価結果を保持する領域を備え、初期値として該評価結果に偽値の設定された第２の木構造情報保持手段と、木構造の末端ノードを葉ノードとし、前記定義体情報の木構造の指定された葉ノードの値と前記文書情報の木構造の指定された葉ノードの値を比較し、両者が一致する場合に、前記葉ノードの評価結果として前記第１の木構造情報保持手段へ真値を記録する葉ノード処理手段と、処理対象となるノードより下位のノードを子ノードとし、前記葉ノードを含む同一名の子ノードに対する評価結果を論理和により合成すると共に、前記葉ノードを含む非同一名の子ノードに対する評価結果を論理積により合成し、該論理合成した結果を該処理対象となるノードの評価結果として前記第１の木構造情報保持手段へ記録する枝ノード処理手段と、前記定義体情報の木構造を構成するノードと同一名のノードを前記文書情報から検索して前記枝ノード処理手段を起動し、枝ノード処理手段の評価終了後、該ノードより下の枝に位置する部分木文書の評価を行い、真値のみを含む部分木の評価結果を前記第１の木構造情報保持手段に真値と記録するセクション処理手段と、前記セクション処理手段を起動すると共に、前記第２の木構造情報保持手段と前記第１の木構造情報保持手段とで同じ位置に位置するノード同士の論理和を取ることで、複数の前記セクション処理手段の評価結果を前記第２の木構造情報保持手段へ累積保持し、前記セクション処理手段の全評価終了後は、前記第２の木構造情報保持手段の評価結果に真値が記録されたノードからなる部分木を部分木文書として抽出するルール処理手段とを有することを特徴とする。
以上の構成により、文書情報を利用者に開示可能な文書情報へ変換するための変換ルールを記述した定義体情報の木構造と、文書情報の木構造を比較して、文書情報から定義体情報と同一の構造と値を有する部分木文書を抽出することを可能とする。また、定義体情報に指定された抽出を希望する複数の部分木文書に対応した、抽出対象文書に対する評価結果を累積して保持し、最後に一括して部分木文書の抽出を行うことを可能とする。
【０００７】
本発明は、上記文書情報抽出装置において、定義体情報は、抽出対象とする要素もしくは値を指定する識別子と、識別子で特定される要素もしくは値に対して抽出する条件を指定する制約条件とを含むことを特徴とする。
【０００８】
本発明は、上記文書情報抽出装置において、マークアップ言語は、ＸＭＬであることを特徴とする。
【０００９】
本発明は、木構造を有するマークアップ言語で記述された文書情報から、利用者毎の権限に応じて、前記文書情報の全てもしくは一部分を、該利用者のそれぞれに開示する部分木文書として抽出する文書情報抽出方法であって、前記文書情報を該利用者に開示可能な文書情報へ変換するための変換ルールがマークアップ言語で記述されている定義体情報を、与えられた利用者の権限情報と前記文書情報に従い定義体情報データベースから抽出する処理と、抽出された前記定義体情報から、その木構造を解釈する処理と、解釈された前記定義体情報の木構造と、前記文書情報の木構造を比較して、前記部分木文書を抽出する処理とを含み、前記部分木文書を抽出する処理は、木構造の末端ノードを葉ノードとし、前記定義体情報の木構造の指定された葉ノードの値と前記文書情報の木構造の指定された葉ノードの値を比較し、両者が一致する場合に、前記葉ノードの評価結果として、前記文書情報の木構造に対応して各ノードの評価結果を保持する領域を備え、初期値として該評価結果に偽値の設定された第１の木構造情報保持手段へ真値を記録する葉ノード処理と、処理対象となるノードより下位のノードを子ノードとし、前記葉ノードを含む同一名の子ノードに対する評価結果を論理和により合成すると共に、前記葉ノードを含む非同一名の子ノードに対する評価結果を論理積により合成し、該論理合成した結果を該処理対象となるノードの評価結果として前記第１の木構造情報保持手段へ記録する枝ノード処理と、前記定義体情報の木構造を構成するノードと同一名のノードを前記文書情報から検索して前記枝ノード処理を起動し、枝ノード処理の評価終了後、該ノードより下の枝に位置する部分木文書の評価を行い、真値のみを含む部分木の評価結果を前記第１の木構造情報保持手段に真値と記録するセクション処理と、前記セクション処理を起動すると共に、前記文書情報と前記第１の木構造情報保持手段の両方の木構造に対応して各ノードの評価結果を保持する領域を備え、初期値として該評価結果に偽値の設定された第２の木構造情報保持手段と前記第１の木構造情報保持手段とで同じ位置に位置するノード同士の論理和を取ることで、複数の前記セクション処理の評価結果を前記第２の木構造情報保持手段へ累積保持し、前記セクション処理の全評価終了後は、前記第２の木構造情報保持手段の評価結果に真値が記録されたノードからなる部分木を部分木文書として抽出するルール処理とを有することを特徴とする。
【００１０】
本発明は、木構造を有するマークアップ言語で記述された文書情報から、利用者毎の権限に応じて、前記文書情報の全てもしくは一部分を、該利用者のそれぞれに開示する部分木文書として抽出する文書情報抽出方法に用いられるプログラムを記録した記録媒体であって、前記プログラムは、前記文書情報を該利用者に開示可能な文書情報へ変換するための変換ルールがマークアップ言語で記述されている定義体情報を、与えられた利用者の権限情報と前記文書情報に従い定義体情報データベースから抽出する処理と、抽出された前記定義体情報から、その木構造を解釈する処理と、解釈された前記定義体情報の木構造と、前記文書情報の木構造を比較して、前記部分木文書を抽出する処理とを含み、前記部分木文書を抽出する処理は、木構造の末端ノードを葉ノードとし、前記定義体情報の木構造の指定された葉ノードの値と前記文書情報の木構造の指定された葉ノードの値を比較し、両者が一致する場合に、前記葉ノードの評価結果として、前記文書情報の木構造に対応して各ノードの評価結果を保持する領域を備え、初期値として該評価結果に偽値の設定された第１の木構造情報保持手段へ真値を記録する葉ノード処理と、処理対象となるノードより下位のノードを子ノードとし、前記葉ノードを含む同一名の子ノードに対する評価結果を論理和により合成すると共に、前記葉ノードを含む非同一名の子ノードに対する評価結果を論理積により合成し、該論理合成した結果を該処理対象となるノードの評価結果として前記第１の木構造情報保持手段へ記録する枝ノード処理と、前記定義体情報の木構造を構成するノードと同一名のノードを前記文書情報から検索して前記枝ノード処理を起動し、枝ノード処理の評価終了後、該ノードより下の枝に位置する部分木文書の評価を行い、真値のみを含む部分木の評価結果を前記第１の木構造情報保持手段に真値と記録するセクション処理と、前記セクション処理を起動すると共に、前記文書情報と前記第１の木構造情報保持手段の両方の木構造に対応して各ノードの評価結果を保持する領域を備え、初期値として該評価結果に偽値の設定された第２の木構造情報保持手段と前記第１の木構造情報保持手段とで同じ位置に位置するノード同士の論理和を取ることで、複数の前記セクション処理の評価結果を前記第２の木構造情報保持手段へ累積保持し、前記セクション処理の全評価終了後は、前記第２の木構造情報保持手段の評価結果に真値が記録されたノードからなる部分木を部分木文書として抽出するルール処理とをコンピュータに実行させることを特徴とする。
【００１１】
【発明の実施の形態】
以下、図面を参照して本発明の実施の形態について説明する。
まず、図１から図１０を用いて、本発明の実施の形態を説明する。
図１は、本発明の実施の形態の文書情報抽出装置の構成を説明するブロック図である。図１において、符号１は、指定された定義体情報で抽出対象文書から部分木文書の抽出を行う部分木抽出処理部である。符号２は、利用者のＩＤ情報等から、利用者の権限情報を与える利用者情報管理部である。符号３は、利用者のＩＤ情報等と、その利用者の権限情報を一意に記録する利用者情報データベースである。符号４は、抽出対象の文書情報と、利用者の権限情報から定義体情報を指定する定義体情報管理部である。符号５は、利用者の権限情報に対応して定義体情報を予め記録された定義体情報データベースである。
符号６は、指定された定義体情報の木構造を解釈する定義体解釈処理部である。
【００１２】
また、部分木抽出処理部１は、ルール処理部１１と、木構造情報保持部１２（第２の木構造情報保持手段）と、セクション処理部１３と、木構造情報一時保持部１４（第１の木構造情報保持手段）と、枝ノード処理部１５と、葉ノード処理部１６とから構成されている。
ルール処理部１１は、木構造情報保持部１２に、複数の要素を記録する構造体記録部を、与えられた文書情報の木構造と同一の木構造に構成した評価結果保持部を作成する。また、定義体解釈処理部６より与えられる定義体情報に指定された、抽出を希望する複数の部分木文書に対応して、木構造情報一時保持部１４へ記録された部分木文書の評価結果を、木構造情報保持部１２へ論理合成して累積し、木構造情報保持部１２の評価結果に真値が記録された部分木を部分木文書として抽出する。
木構造情報保持部１２は、複数の要素を記録する構造体記録部から構成される記録部であって、部分木の評価結果を論理合成して累積して保持するバッファである。
セクション処理部１３は、木構造情報一時保持部１４に、複数の要素を記録する構造体記録部を文書情報の木構造と同一の木構造に構成した評価結果保持部を作成する。また、定義体情報の木構造を構成するノードと同一のノードを、文書情報から検索して枝ノード処理部１５を起動し、枝ノード処理部１５の評価終了後、該ノードより下の枝に位置する部分木文書の評価を行い、真値のみを含む部分木の評価結果を真値と記録する。
木構造情報一時保持部１４は、複数の要素を記録する構造体記録部から構成される記録部であって、部分木の評価結果を一時的に保持するバッファである。
枝ノード処理部１５は、木構造において自分より下の枝に位置する、葉ノードを含む子ノードの評価結果を論理合成し、木構造情報一時保持部１４へ記録する。
葉ノード処理部１６は、指定された定義体情報の木構造の末端部分の値と指定された文書情報の木構造の末端部分の値を比較し、両者が一致する場合に、真値を葉ノードの評価結果として木構造情報一時保持部１４へ記録する。
【００１３】
なお、部分木抽出処理部１の木構造情報保持部１２と、木構造情報一時保持部１４と、更に、利用者情報データベース３と、定義体情報データベース５は、それぞれ、ハードディスク装置や光磁気ディスク装置、フラッシュメモリ等の不揮発性のメモリや、ＣＤ−ＲＯＭ等の読み出しのみが可能な記録媒体、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）のような揮発性のメモリ、あるいはこれらの組み合わせによるコンピュータ読み取り、書き込み可能な記録媒体より構成されるものとする。
【００１４】
また、部分木抽出処理部１のルール処理部１１と、セクション処理部１３と、枝ノード処理部１５と、葉ノード処理部１６と、更に、利用者情報管理部２と、定義体情報管理部４と、定義体解釈処理部６は、それぞれ、部分木抽出処理部１、あるいは利用者情報管理部２、定義体情報管理部４、定義体解釈処理部６において、専用のハードウェアにより実現されるものであってもよく、また、メモリおよびＣＰＵ（中央演算装置）により構成され、上記の各部の機能を実現するためのプログラムをメモリにロードして実行することにより、その機能を実現させるものであってもよい。
【００１５】
また、本実施の形態の文書情報抽出装置には、周辺機器として入力装置、表示装置等（いずれも図示せず）が接続されるものとする。ここで、入力装置とはキーボード、マウス等の入力デバイスのことをいう。表示装置とはＣＲＴ（ＣａｔｈｏｄｅＲａｙＴｕｂｅ）ディスプレイ装置や液晶表示装置等のことをいう。
【００１６】
次に、本発明の実施の形態の動作を図２から図１０を用いて説明する。
まず、図２から図６を用いて、抽出対象文書と定義体情報のＸＭＬにおける表現と木構造表現について説明する。図２は、同実施の形態で説明する抽出対象文書と抽出文書の一例を示す模式図であり、図３は、木構造情報保持部１２、あるいは木構造情報一次保持部１４に構成された、抽出対象文書の評価結果保持部を説明する模式図である。
図２において、＜Ａ＞や＜Ｂ＞が１つのノード（継ぎ目）を表し、＜Ａ＞や＜／Ａ＞等の同一の記号で表されたタグの間に記述された＜Ｂ＞や＜Ｃ＞のタグが、そのノードより下に位置する枝ノード、あるいは葉ノードを表す。
また、図３において、木構造情報保持部１２、あるいは木構造情報一次保持部１４に構成された抽出対象文書の評価結果保持部は、構造体記録部Ｄ０１からＤ１８が図２に示した各ノードに対応して配置され、木構造を構成している。
また図４の（ａ）は、ルール処理部１１が木構造情報保持部１２に作成した評価結果保持部の記録内容であり、構造体記録部により抽出対象文書へのリンクと評価結果の真偽値を保持する。図４の（ｂ）は、セクション処理部１３が木構造情報一時保持部１４に作成した評価結果保持部の記録内容であり、構造体記録部により抽出対象文書へのリンクと、木構造情報保持部１２へのリンクと、評価結果の真偽値を保持する。
更に、図５は、同実施の形態で説明する定義体情報の一例を示す模式図である。定義体情報では、抽出したい部分木文書の一塊りをセクション（＜ｓｅｃｔｉｏｎ＞）として表す。本発明の文書情報抽出装置は、各セクションのタグ＜ｓｅｃｔｉｏｎ＞と＜／ｓｅｃｔｉｏｎ＞の間に記述されたタグと同じ構造、及び値を持つ部分木文書を、抽出対象文書から抽出する。定義体情報は、例えば＜Ｂ＞と＜／Ｂ＞の間に記述された、抽出を行う部分を示す識別子「＊」と抽出を行う条件式「＝＝”ｂ１”」で示される。また、この定義体情報を木構造表現へ変換したものが図６に示す模式図である。図６は、ルール処理部１１で処理される定義体情報であることを示すＲ０１の下に、抽出したい部分木文書の一塊りである「ｓｅｃｔｉｏｎｘ」を示すＲ０２と、「ｓｅｃｔｉｏｎｙ」を示すＲ１１が配置されている。更に、それぞれの下に、ノードＡを示すＲ０３や、ノードＢを示すＲ０４等が配置される。
【００１７】
以上を踏まえて、図７から図１０のフローチャートを用いて、部分木抽出処理部１の動作を説明する。まず、部分木抽出処理部１には、与えられた利用者情報から決定された定義体情報が定義体解釈処理部６を介して、図６に示す木構造として与えられているものとする。また、文書情報も予め与えられているものとする。
図７は、部分木抽出処理部１のルール処理部１１の動作を説明したフローチャートである。まず、ルール処理部１１は、木構造情報保持部１２に、図３に示した抽出対象文書と同じ木構造を持つ評価結果保持部を作成する（ステップＳ１）。
作成した木構造情報保持部１２の評価結果保持部の真偽値は”Ｆａｌｓｅ”に初期化する（ステップＳ２）。
次にセクション処理部１３を起動し、セクション処理を実行する（ステップＳ３）。
１つのセクションに対するセクション処理の実行が終了すると、その評価結果の真偽値（”Ｆａｌｓｅ”か”Ｔｕｒｅ”）を木構造情報保持部１２の内容と論理合成し、セクション毎の評価結果を累積して保持する（ステップＳ４）。
１つのセクションに対する評価結果を保持したら、全セクションに対して評価を行ったかどうかを判断する（ステップＳ５）。
もし、全セクションの評価が終了していない場合、ステップＳ３へ戻り、次のセクションの評価を実行する（ステップＳ５のＮＯ）。
もし、全セクションの評価を終了した場合、ステップＳ６へ進む（ステップＳ５のＹＥＳ）。
全セクションの評価が終了したら、木構造情報保持部１２の中で、評価結果の真偽値が”Ｔｒｕｅ”である部分木を、抽出対象文書から抽出して（ステップＳ６）終了する。
【００１８】
図８は、部分木抽出処理部１のセクション処理部１３の動作を説明したフローチャートである。ルール処理部１１から起動されたセクション処理部１３は、木構造情報一時保持部１４に、図３に示した抽出対象文書と同じ木構造を持つ評価結果保持部を作成する（ステップＳ１１）。
作成した木構造情報一時保持部１４の評価結果保持部の真偽値は”Ｆａｌｓｅ”に初期化する（ステップＳ１２）。
木構造情報一時保持部１４の初期化が終了したら、抽出対象文書から、＜ｓｅｃｔｉｏｎ＞と＜／ｓｅｃｔｉｏｎ＞に間の最上位に記述されたタグと同じ名前の子ノードを抽出対象のセクション１つとして検索する（ステップＳ１３）。
次に、抽出対象文書に該当するノードが発見されたか否かを判断し（ステップＳ１４）、もし該当ノードが発見されなかった場合（ステップＳ１４のＮＯ）、そのまま何もせずに、ルール処理部１１へ木構造情報一時保持部の記録内容を返却する（ステップＳ１７）。
もし、ステップＳ１４において、該当するノードが発見された場合（ステップＳ１４のＹＥＳ）、枝ノード処理部１５を起動して枝ノード処理を実行する（ステップＳ１５）。
枝ノード処理が終了すると、セクションを構成している部分木文書の評価を行う。評価は、該ノードより下の枝に位置する部分木文書の評価を行い、真値（Ｔｒｕｅ）のみを含む部分木の評価結果を”Ｔｒｕｅ”と記録する（ステップＳ１６）。
セクションを構成する部分木文書の評価が終了したら、ルール処理部１１へ木構造情報一時保持部の記録内容を返却して（ステップＳ１７）終了する。
【００１９】
図９は、部分木抽出処理部１の枝ノード処理部１５の動作を説明したフローチャートである。セクション処理部１３から起動された枝ノード処理部１５は、枝ノード処理部１５内に、自分より下に位置する子ノードの真偽値を保持する真偽値保持部を作成する（ステップＳ２１）。真偽値保持部は定義体情報の該当する枝ノードに記述された子ノードの全種類（名前）に対応して、その数だけ作成する。
作成した真偽値保持部の真偽値は”Ｆａｌｓｅ”に初期化する（ステップＳ２２）。
次に、自分より下に位置する子ノードが、そのノードより下に更に枝を持つ枝ノードであるか、それより下にノードを持たず、値のみを持つ葉ノードであるかを判断する（ステップＳ２３）。
もし、子ノードがそれより下にノードを持たず、値のみを持つ葉ノードであった場合（ステップＳ２３のＮＯ）、葉ノード処理部１６を起動して葉ノード処理を実行する（ステップＳ３２）。そして、ステップＳ２５へ進む。
もし、子ノードがそれより下にノードを持つ枝ノードであった場合（ステップＳ２３のＹＥＳ）、更に枝ノード処理を実行する（ステップＳ２４）。
ステップＳ２４あるいはステップＳ３２の実行が終了したら、真偽値保持部から、今処理した子ノードの種類（名前）に対応する真偽値を取得し、枝ノード処理あるいは葉ノード処理の結果との論理和を計算し、結果を同じ真偽値保持部へ書き戻す（ステップＳ２５）。
次に、全ての子ノードに対して評価を行った否かを判断する（ステップＳ２６）。
もし、全ての子ノードに対して評価を終了していない場合、ステップＳ２３へ戻り、次の子ノードの評価を実行する（ステップＳ２６のＮＯ）。
もし、全ての子ノードに対して評価が終了した場合（ステップＳ２６のＹＥＳ）、全種類（名前）の子ノードの真偽値保持部が”Ｔｒｕｅ”であるか否かを判断する（ステップＳ２７）。
もし、ステップＳ２７で、全種類（名前）の子ノードの真偽値保持部が”Ｔｒｕｅ”であった場合（ステップＳ２７のＹＥＳ）、木構造情報一時保持部１４の該当する枝ノードへ”Ｔｒｕｅ”を設定する（ステップＳ２８）。
そして、呼び出し元へ”Ｔｒｕｅ”を返却して（ステップＳ２９）終了する。もし、ステップＳ２７で、子ノードの真偽値保持部のいずれかが”Ｆａｌｓｅ”であった場合（ステップＳ２７のＮＯ）、該当する枝ノードへ”Ｆａｌｓｅ”を設定する（ステップＳ３０）。
そして、呼び出し元へ”Ｆａｌｓｅ”を返却して（ステップＳ３１）終了する。
なお、定義体情報の該当する枝ノードに記述された子ノードの全種類（名前）と、各子ノードの評価結果の真偽値を、
【数１】

と表すと、上述の枝ノードの子ノードが、更にそれより下にノードを持つ枝ノードであった場合の評価処理は次式で表される。
【数２】

ここで、上式の論理和はステップＳ２５に、論理積はステップＳ２７にそれぞれ対応する。
【００２０】
図１０は、部分木抽出処理部１の葉ノード処理部１６の動作を説明したフローチャートである。枝ノード処理部１５から起動された葉ノード処理部１６は、まず、定義体情報の葉ノードに該当する位置に、値を特定する条件式が設定されているか否かを判断する（ステップＳ４１）。
もし、葉ノードの値を特定する条件式が設定されていない場合（ステップＳ４１のＮＯ）、条件は満たされたものとして、条件式の判断ステップＳ４２をパスしてステップＳ４３へ進む。
もし、葉ノードの値を特定する条件式が設定されている場合（ステップＳ４１のＹＥＳ）、葉ノードの持つ値が、定義体情報に設定された条件式を満たすか否かを判断する（ステップＳ４２）。
もし、葉ノードの値が条件式を満たさない場合（ステップＳ４２のＮＯ）、該当する葉ノードに対応する木構造情報一時保持部１４の真偽値を”Ｆａｌｓｅ”に設定する（ステップＳ４７）。
木構造情報一時保持部１４の真偽値を設定したら、呼び出し元へ”Ｆａｌｓｅ”を返却して（ステップＳ４８）終了する。
もし、葉ノードの値が条件式を満たす場合（ステップＳ４２のＹＥＳ）、該当する葉ノードに対応する木構造情報一時保持部１４の真偽値を”Ｔｒｕｅ”に設定する（ステップＳ４６）。
次に、木構造情報一時保持部１４の真偽値を設定したら、呼び出し元へ”Ｔｒｕｅ”を返却して（ステップＳ４５）終了する。
【００２１】
以上説明したように、ルール処理部１１からセクション処理部１３、セクション処理部１３から枝ノード処理部１５、枝ノード処理部１５から枝ノード処理部１５、あるいは葉ノード処理部１６がそれぞれ起動され、指定された部分木を評価して呼び出し元へ評価結果を返却することで、セクション処理部の動作が一回終了すると、抽出対象文書から、抽出したい部分木文書の一塊り（セクション）に対する評価結果が得られる。ルール処理は、定義体情報に指定されたセクションの数だけ、抽出対象文書の評価を繰り返し、すべて終了したならば、その各セクションの評価結果の論理和により、抽出するべき部分木文書を特定する。
【００２２】
次に、上述の実施の形態で説明した文書情報抽出装置を利用した実施例を、図１１から図１５を用いて説明する。本実施例は、パーソナルコンピュータの製造会社とその部品会社間で行う見積り文書の送受信に、文書情報抽出装置を利用して、情報の振り分けを行う場合を説明する。
図１１は、本実施の形態の実施例を説明する模式図である。図１１において、符号５０は、文書情報抽出装置を利用した情報管理センタである。
符号５１は、文書情報抽出装置の利用者情報管理部を示す。符号５２は、文書情報抽出装置の定義体情報管理部を示す。符号５３は、文書情報抽出装置の部分木抽出処理部を示す。
符号５４は、文書情報Ｄａを示す。文書情報Ｄａは見積の内容を指定する文書で、内容は図１２に示すように、パーソナルコンピュータのメモリやディスクといった構成部品毎に、部品名、型名、数量、価格といった見積に必要な情報がタグとして記述される。また、文書情報Ｄａは、予めパーソナルコンピュータの製造会社が、情報管理センタ５０へ登録するものとする。
また、符号５５と符号５６は、それぞれ、パーソナルコンピュータの部品会社である見積先業者Ｂと見積先業者Ｃに対して設定された定義体情報Ｒｂと定義体情報Ｒｃである。定義体情報Ｒｂと定義体情報Ｒｃも、予めパーソナルコンピュータの製造会社が、情報管理センタ５０へ登録するものとする。定義体情報の内容は、それぞれ、図１３（ａ）に見積先業者Ｂに対する定義体情報を、図１３（ｂ）に見積先業者Ｃに対する定義体情報を示す。図１３（ａ）では、見積先業者Ｂがメモリを扱う部品会社なので、メモリに関する情報のみを抽出するようにタグが構成されている。同様に、図１３（ｂ）では、見積先業者Ｃがディスクを扱う部品会社なので、ディスクに関する情報のみを抽出するようにタグが構成されている。
更に、符号５７と符号５８は、それぞれ、見積先業者Ｂと見積先業者Ｃに対して抽出された抽出文書Ｔｂと抽出文書Ｔｃである。また、符号６０と符号７０は、それぞれ見積先業者Ｂと見積先業者Ｃを示す。
【００２３】
今、情報管理センタ５０へ、見積先業者Ｂ６０と見積先業者Ｃ７０がアクセスすると、利用者情報管理部５２において、見積先業者Ｂ６０と見積先業者Ｃ７０のそれぞれに対する利用者属性が特定されて、定義体情報管理部５３へ出力される。定義体情報管理部５３では、見積先業者Ｂ６０と見積先業者Ｃ７０のそれぞれの利用者属性と、予め登録された文書情報Ｄａ５４を用いて、予め登録された定義体情報の中から、定義体情報Ｒｂ５５と定義体情報Ｒｃ５６を抽出する。
定義体情報Ｒｂ５５と定義体情報Ｒｃ５６は、部分木抽出処理部５１へ出力され（ここでは紙面の都合上、定義体解釈処理部は省略する）、文書情報Ｄａ５４の部分木文書の抽出が実行される。
抽出された抽出文書Ｔｂ５７は、図１４に示す。図１４では、図１２に示す見積の内容を指定する文書から、見積先業者Ｂ６０が扱うメモリに関する情報のみが抽出されている。この抽出文書Ｔｂ５７は見積先業者Ｂ６０へ送信される。
抽出された抽出文書Ｔｃ５８は、図１５に示す。図１５では、図１２に示す見積の内容を指定する文書から、見積先業者Ｃ７０が扱うメモリに関する情報のみが抽出されている。この抽出文書Ｔｃ５８は見積先業者Ｃ７０へ送信される。
以上説明したように、本発明の文書情報抽出装置を用いると、一つの見積の内容を指定する文書から、見積先業者毎に、特定の部分のみを抽出した見積の内容を指定する情報を送付することが可能となる。
【００２４】
また、上述の実施の形態で説明した文書情報抽出装置、及び実施例で説明した情報管理センタは、それぞれ、その機能を実現するためのプログラムを、コンピュータ読みとり可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより、上述の各装置における機能を実現しても良い。
【００２５】
ここで、上記「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含み、さらにＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ）システムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。また、「コンピュータ読みとり可能な記録媒体」とは、フロッピーディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。更に、「コンピュータ読みとり可能な記録媒体」とは、インターネット等のコンピュータネットワークや電話回線等の通信回線を介してプログラムを送出する場合のように、短時間の間、動的にプログラムを保持するもの（伝送媒体もしくは伝送波）、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。
【００２６】
また、上記プログラムは、前述した機能の一部を実現するためのものであっても良く、更に前述した機能をコンピュータシステムに既に記憶されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。
【００２７】
【発明の効果】
以上の如く本発明によれば、木構造を有する同一の文書情報から、利用者毎の権限に応じて、文書情報のすべて、もしくは一部分を、該利用者のそれぞれに開示する部分木文書として抽出する文書情報抽出装置に含まれる部分木抽出手段に、文書情報の評価結果を記録する、偽値に初期化された第１の木構造情報保持手段と、定義体情報の木構造の指定された末端部分の値と文書情報の木構造の指定された末端部分の値を比較し、両者が一致する場合に、真値を葉ノードの評価結果として第１の木構造情報保持手段へ記録する葉ノード処理手段と、木構造において自分より下の枝に位置する、葉ノードを含む子ノードの評価結果を論理合成し、第１の木構造情報保持手段へ記録する枝ノード処理手段と、第１の木構造情報保持手段に、複数の要素を記録する構造体記録部を文書情報の木構造と同一の木構造に構成した評価結果保持部を作成し、同時に定義体情報の木構造を構成するノードと同一のノードを文書情報から検索して枝ノード処理手段を起動し、枝ノード処理手段の評価終了後、該ノードより下の枝に位置する部分木文書の評価を行い、真値のみを含む部分木の評価結果を真値と記録するセクション処理手段と、セクション処理手段を起動し、セクション処理手段の評価終了後、評価結果保持部の評価結果に真値が記録された部分木を部分木文書として抽出するルール処理手段とを設けた。
これにより、文書情報を利用者に開示可能な文書情報へ変換するための変換ルールを記述した定義体情報の木構造と、文書情報の木構造を比較して、文書情報から定義体情報と同一の構造と値を有する部分木文書を抽出することが可能となる。
【００２８】
本発明は、上記文書情報抽出装置において、文書情報の評価結果を記録する、偽値に初期化された第２の木構造情報保持手段を更に備え、ルール処理手段は、第２の木構造情報保持手段に、複数の要素を記録する構造体記録部を文書情報の木構造と同一の木構造に構成した評価結果保持部を作成する手段と、定義体情報に指定された抽出を希望する複数の部分木文書に対応して、第１の木構造情報保持手段へ記録された部分木文書の評価結果を、第２の木構造情報保持手段へ論理合成して累積する手段と、第２の木構造情報保持手段の評価結果に真値が記録された部分木を部分木文書として抽出する手段とを有する構成とした。
これにより、定義体情報に指定された抽出を希望する複数の部分木文書に対応した抽出対象文書に対する評価結果を累積して保持し、最後に一括して部分木文書の抽出を行うことが可能となる。
従って、ＸＭＬ等のマークアップ言語で記述された、木構造を有する文書情報から、任意に必要な文書情報を取得する場合に、従来の問題点であった、抽出する文書構造毎の専用のプログラミングを行う必要がなく、効率的に部分木文書を抽出できるという効果が得られる。
また、従来の問題点であった、木構造から目的の部分を指定する手段と、目的の部分を発見した時に、削除する、しない等のアクションを実行するプログラムとを組み合わせる対応関係を、抽出する文書構造に従い、その都度指定する必要がなく、同様に効率的に部分木文書を抽出できるという効果が得られる。
【図面の簡単な説明】
【図１】本発明の実施の形態の構成を説明するブロック図である。
【図２】同実施の形態で説明する抽出対象文書と抽出文書の一例を示す模式図である。
【図３】木構造情報保持部に構成された、抽出対象文書の評価結果保持部を説明する模式図である。
【図４】評価結果保持部の記録内容の一例を説明する模式図である。
【図５】同実施の形態で説明する定義体情報の一例を示す模式図である。
【図６】図５で説明する定義体情報の木構造表現を示す模式図である。
【図７】同実施の形態のルール処理部の動作を説明するフローチャートである。
【図８】同実施の形態のセクション処理部の動作を説明するフローチャートである。
【図９】同実施の形態の枝ノード処理部の動作を説明するフローチャートである。
【図１０】同実施の形態の葉ノード処理部の動作を説明するフローチャートである。
【図１１】同実施の形態の実施例を説明する模式図である。
【図１２】同実施例で説明する抽出対象文書を説明する模式図である。
【図１３】同実施例で説明する定義体情報を説明する模式図である。
【図１４】同実施例で抽出された部分木文書を説明する模式図である。
【図１５】同実施例で抽出された部分木文書を説明する模式図である。
【符号の説明】
１部分木抽出処理部
２利用者情報管理部
３利用者情報データベース
４定義体情報管理部
５定義体情報データベース
６定義体解釈処理部
１１ルール処理部
１２木構造情報保持部
１３セクション処理部
１４木構造情報一時保持部
１５枝ノード処理部
１６葉ノード処理部
５０情報管理センタ
５１部分木抽出処理部
５２利用者情報管理部
５３定義体情報管理部
５４文書情報Ｄａ
５５定義体Ｒｂ
５６定義体Ｒｃ
５７抽出文書Ｔｂ
５８抽出文書Ｔｃ
６０見積先業者Ｂ
７０見積先業者Ｃ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document information extraction apparatus and method capable of extracting only a specific partial tree document from document information having a tree structure, and a recording medium recording the program.
[0002]
[Prior art]
In recent years, document information has been developed from HTML (Hyper Text Markup Language) and is actively created and managed by XML (Extensible Markup Language) that can represent the structure in a tree structure.
Document information in HTML or XML describes the structure of a document using a mark called a tag indicating the definition of the document. In particular, the types and meanings of tags that can be used in HTML are determined in advance, but in XML, a user can define them independently. Therefore, a device has been devised to uniquely define this tag among related companies and easily exchange data between the companies.
[0003]
[Problems to be solved by the invention]
However, conventionally, there are the following methods for acquiring arbitrarily required document information from document information having a tree structure described in XML, but each has the following problems.
First, the XML tree structure is handled as it is, and the method for extracting the necessary documents using a programming language requires dedicated programming for each document structure to be extracted. Since many programs are required, there is a problem that the efficiency is low from the viewpoint of equipment.
In addition, in the method of executing a combination of a means for specifying a target part from a tree structure and a program for executing an action such as deleting or not when the target part is found, the document structure for extracting the correspondence Therefore, there is a problem that it is not versatile and inefficient.
[0004]
The present invention has been made in view of the above problems, and an object of the present invention is to provide a document information extraction apparatus that can extract only a specific partial tree document from document information having a tree structure. More specifically, the tree structure of definition information describing conversion rules for converting document information into document information that can be disclosed to the user is compared with the tree structure of the document information, and the definition structure is converted from the document information. An object of the present invention is to provide a document information extraction apparatus and method for extracting a subtree document having the same structure and value as information, and a recording medium recording the program.
[0005]
[Means for Solving the Problems]
In order to solve the above problems, the present invention has a tree structure.Written in markup languageA document information extracting apparatus that extracts all or part of the document information from the document information as a partial tree document disclosed to each of the users according to the authority for each user.The definition information in which a conversion rule for converting the document information into document information that can be disclosed to the user is described in a markup language. The definition information according to the given user authority information and the document information. Definition body information management means extracted from the database;From the definition body information extracted by the definition body information management means, definition body interpretation means for interpreting the tree structure, tree structure of the definition body information interpreted by the definition body information interpretation means, and the document information A partial tree extracting unit that compares the tree structures and extracts the partial tree document, and the partial tree extracting unit includes a region that holds an evaluation result of each node corresponding to the tree structure of the document information; , As the initial valueIn the evaluation resultFirst tree structure information holding means for which a false value is set;An area for holding the evaluation result of each node corresponding to the tree structure of both the document information and the first tree structure information holding means, and a second value in which a false value is set in the evaluation result as an initial value Tree structure information holding means;When the end node of the tree structure is a leaf node, the value of the specified leaf node of the tree structure of the definition information is compared with the value of the specified leaf node of the tree structure of the document information, and the two match , A leaf node processing means for recording a true value in the first tree structure information holding means as the evaluation result of the leaf node, and a node lower than the node to be processed as a child node,The evaluation result for the child node having the same name including the leaf node is synthesized by logical sum, the evaluation result for the child node having the same name including the leaf node is synthesized by logical product, and the logical synthesis result is obtained.Branch node processing means for recording in the first tree structure information holding means as an evaluation result of the node to be processed; nodes constituting the tree structure of the definition body information;Same nameThe node information is retrieved from the document information, the branch node processing means is activated, and after the evaluation of the branch node processing means is completed, the subtree document located in the branch below the node is evaluated, and only the true value is included. The evaluation result of the subtreeIn the first tree structure information holding meansA section processing means for recording the true value;The section processing means is activated, and a plurality of the section processing is performed by calculating a logical sum of nodes located at the same position in the second tree structure information holding means and the first tree structure information holding means. The evaluation result of the means is accumulated and held in the second tree structure information holding means, and after all the evaluations of the section processing means are completed, the second tree structure information holding meansAnd a rule processing means for extracting a subtree consisting of nodes whose true values are recorded in the evaluation result as a subtree document.
With the above configuration, the tree structure of definition information describing conversion rules for converting document information into document information that can be disclosed to the user is compared with the tree structure of document information. It is possible to extract a subtree document having the same structure and value.In addition, it is possible to accumulate and hold the evaluation results for the extraction target documents corresponding to multiple subtree documents desired to be extracted specified in the definition information, and finally extract subtree documents in a batch And
[0007]
The present invention provides the document information extraction device, wherein the definition information is:It includes an identifier for designating an element or value to be extracted, and a constraint condition for designating a condition for extracting the element or value specified by the identifier.
[0008]
The present invention is characterized in that, in the document information extracting apparatus, the markup language is XML.
[0009]
The present invention has a tree structureWritten in markup languageA document information extraction method for extracting all or part of the document information from the document information as a partial tree document disclosed to each of the users according to the authority for each user,The definition information in which a conversion rule for converting the document information into document information that can be disclosed to the user is described in a markup language. The definition information according to the given user authority information and the document information. Processing to extract from the database;A process of interpreting the tree structure from the extracted definition information, a process of extracting the partial tree document by comparing the tree structure of the interpreted definition information and the tree structure of the document information; And the processing for extracting the subtree document includes a leaf node as a terminal node of the tree structure, a value of a leaf node designated in the tree structure of the definition information, and a leaf designated in the tree structure of the document information When the values of the nodes are compared and the two match, the evaluation result of the leaf node is provided with an area for holding the evaluation result of each node corresponding to the tree structure of the document information.Results of the evaluationA leaf node process for recording a true value in the first tree structure information holding means for which a false value is set in a node, and a node lower than the node to be processed as a child node,The evaluation result for the child node having the same name including the leaf node is synthesized by logical sum, the evaluation result for the child node having the same name including the leaf node is synthesized by logical product, and the logical synthesis result is obtained.Branch node processing recorded in the first tree structure information holding means as the evaluation result of the node to be processed; nodes constituting the tree structure of the definition body information;Same nameThe node information is retrieved from the document information, the branch node processing is started, and after the evaluation of the branch node processing is completed, the subtree document located in the branch below the node is evaluated, and the subtree including only the true value Evaluation result ofIn the first tree structure information holding meansSection processing to record true value,The section process is activated, and an area for holding the evaluation result of each node corresponding to the tree structure of both the document information and the first tree structure information holding unit is provided, and the evaluation result is set to false By calculating the logical sum of nodes located at the same position in the second tree structure information holding means and the first tree structure information holding means in which values are set, the evaluation results of a plurality of the section processes are obtained as described above. The accumulated information is stored in the second tree structure information holding means, and after the evaluation of the section processing is completed, the second tree structure information holding meansAnd a rule process for extracting a subtree consisting of nodes whose true values are recorded in the evaluation result as a subtree document.
[0010]
The present invention has a tree structureWritten in markup languageA recording medium recording a program used in a document information extraction method for extracting all or part of the document information from the document information as a partial tree document disclosed to each of the users according to the authority for each user. And the program isThe definition information in which a conversion rule for converting the document information into document information that can be disclosed to the user is described in a markup language. The definition information according to the given user authority information and the document information. Processing to extract from the database;A process of interpreting the tree structure from the extracted definition information, a process of extracting the partial tree document by comparing the tree structure of the interpreted definition information and the tree structure of the document information; And the processing for extracting the subtree document includes a leaf node as a terminal node of the tree structure, a value of a leaf node designated in the tree structure of the definition information, and a leaf designated in the tree structure of the document information When the values of the nodes are compared and the two match, the evaluation result of the leaf node is provided with an area for holding the evaluation result of each node corresponding to the tree structure of the document information.Results of the evaluationA leaf node process for recording a true value in the first tree structure information holding means for which a false value is set in a node, and a node lower than the node to be processed as a child node,The evaluation result for the child node having the same name including the leaf node is synthesized by logical sum, the evaluation result for the child node having the same name including the leaf node is synthesized by logical product, and the logical synthesis result is obtained.Branch node processing recorded in the first tree structure information holding means as the evaluation result of the node to be processed; nodes constituting the tree structure of the definition body information;Same nameThe node information is retrieved from the document information, the branch node processing is started, and after the evaluation of the branch node processing is completed, the subtree document located in the branch below the node is evaluated, and the subtree including only the true value Evaluation result ofIn the first tree structure information holding meansSection processing to record true value,The section process is activated, and an area for holding the evaluation result of each node corresponding to the tree structure of both the document information and the first tree structure information holding unit is provided, and the evaluation result is set to false By calculating the logical sum of nodes located at the same position in the second tree structure information holding means and the first tree structure information holding means in which values are set, the evaluation results of a plurality of the section processes are obtained as described above. The accumulated information is stored in the second tree structure information holding means, and after the evaluation of the section processing is completed, the second tree structure information holding meansIt is characterized by causing a computer to execute rule processing for extracting a subtree consisting of nodes whose true values are recorded in an evaluation result as a subtree document.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
First, an embodiment of the present invention will be described with reference to FIGS.
FIG. 1 is a block diagram illustrating a configuration of a document information extraction apparatus according to an embodiment of this invention. In FIG. 1, reference numeral 1 denotes a subtree extraction processing unit that extracts a subtree document from an extraction target document with designated definition information. Reference numeral 2 denotes a user information management unit that gives user authority information based on user ID information and the like. Reference numeral 3 denotes a user information database that uniquely records user ID information and the authority information of the user. Reference numeral 4 denotes a definition body information management unit that specifies definition body information from document information to be extracted and user authority information. Reference numeral 5 denotes a definition body information database in which definition body information is recorded in advance corresponding to the authority information of the user.
Reference numeral 6 denotes a definition object interpretation processing unit that interprets the tree structure of designated definition information.
[0012]
The subtree extraction processing unit 1 includes a rule processing unit 11, a tree structure information holding unit 12 (second tree structure information holding unit), a section processing unit 13, and a tree structure information temporary holding unit 14 (first Tree structure information holding means), a branch node processing unit 15, and a leaf node processing unit 16.
The rule processing unit 11 creates, in the tree structure information holding unit 12, an evaluation result holding unit in which a structure recording unit that records a plurality of elements is configured in the same tree structure as the tree structure of given document information. In addition, the evaluation result of the subtree document recorded in the tree structure information temporary storage unit 14 corresponding to the plurality of subtree documents desired to be extracted specified in the definition body information given from the definition body interpretation processing unit 6 Are logically synthesized and accumulated in the tree structure information holding unit 12, and a subtree having a true value recorded in the evaluation result of the tree structure information holding unit 12 is extracted as a subtree document.
The tree structure information holding unit 12 is a recording unit composed of a structure recording unit that records a plurality of elements, and is a buffer that logically synthesizes and accumulates evaluation results of subtrees.
The section processing unit 13 creates, in the tree structure information temporary holding unit 14, an evaluation result holding unit in which a structure recording unit for recording a plurality of elements is configured in the same tree structure as that of the document information. In addition, the same node as that constituting the tree structure of the definition information is retrieved from the document information, and the branch node processing unit 15 is activated. After the evaluation of the branch node processing unit 15 is finished, the branch node processing unit 15 The subtree document located is evaluated, and the evaluation result of the subtree including only the true value is recorded as the true value.
The tree structure information temporary storage unit 14 is a recording unit including a structure recording unit that records a plurality of elements, and is a buffer that temporarily stores the evaluation result of the partial tree.
The branch node processing unit 15 logically synthesizes the evaluation result of the child node including the leaf node located in the branch below itself in the tree structure, and records it in the tree structure information temporary storage unit 14.
The leaf node processing unit 16 compares the value of the end part of the tree structure of the specified definition information with the value of the end part of the tree structure of the specified document information. The node evaluation result is recorded in the tree structure information temporary storage unit 14.
[0013]
Note that the tree structure information holding unit 12, the tree structure information temporary holding unit 14, the user information database 3, and the definition body information database 5 of the partial tree extraction processing unit 1 are a hard disk device and a magneto-optical disk, respectively. Non-volatile memory such as a device, flash memory, a recording medium such as a CD-ROM that can only be read, a volatile memory such as a RAM (Random Access Memory), or a combination thereof, and can be read and written by a computer It is assumed to be composed of a recording medium.
[0014]
In addition, the rule processing unit 11, the section processing unit 13, the branch node processing unit 15, the leaf node processing unit 16, the user information management unit 2, and the definition body information management unit of the subtree extraction processing unit 1. 4 and the definition object interpretation processing unit 6 are realized by dedicated hardware in the subtree extraction processing unit 1, or the user information management unit 2, the definition object information management unit 4, and the definition object interpretation processing unit 6, respectively. In addition, it is configured by a memory and a CPU (central processing unit), and implements the function by loading a program for realizing the function of each of the above parts into the memory and executing it. It may be.
[0015]
In addition, an input device, a display device, and the like (none of which are shown) are connected as peripheral devices to the document information extraction device of the present embodiment. Here, the input device refers to an input device such as a keyboard and a mouse. The display device refers to a CRT (Cathode Ray Tube) display device, a liquid crystal display device, or the like.
[0016]
Next, the operation of the embodiment of the present invention will be described with reference to FIGS.
First, the XML representation and the tree structure representation of the extraction target document and the definition information will be described with reference to FIGS. FIG. 2 is a schematic diagram illustrating an example of an extraction target document and an extracted document described in the embodiment, and FIG. 3 is configured in the tree structure information holding unit 12 or the tree structure information primary holding unit 14. It is a schematic diagram explaining the evaluation result holding part of an extraction object document.
In FIG. 2, <A> and <B> represent one node (seam), and <B> and <B> described between tags represented by the same symbol such as <A> and </A>. The tag C> represents a branch node or a leaf node located below the node.
Also, in FIG. 3, the evaluation result holding units of the extraction target document configured in the tree structure information holding unit 12 or the tree structure information primary holding unit 14 are the nodes shown in FIG. 2 by the structure recording units D01 to D18. Are arranged in correspondence with each other to form a tree structure.
4A shows the contents recorded in the evaluation result holding unit created by the rule processing unit 11 in the tree structure information holding unit 12, and the structure recording unit links to the extraction target document and verifies the authenticity of the evaluation result. Holds the value. FIG. 4B shows the recorded contents of the evaluation result holding unit created by the section processing unit 13 in the tree structure information temporary holding unit 14, and the structure recording unit holds a link to the extraction target document and the tree structure information. A link to the unit 12 and a true / false value of the evaluation result are held.
Furthermore, FIG. 5 is a schematic diagram showing an example of definition body information described in the embodiment. In the definition information, a group of subtree documents to be extracted is represented as a section (<section>). The document information extraction apparatus of the present invention extracts a subtree document having the same structure and value as a tag described between tags <section> and </ section> of each section from the extraction target document. The definition information is indicated by, for example, an identifier “*” indicating a part to be extracted and a conditional expression “==“ b1 ”” described between <B> and </ B>. FIG. 6 is a schematic diagram obtained by converting the definition body information into a tree structure representation. FIG. 6 shows R02 indicating “section x” which is a group of subtree documents to be extracted and “section y” under R01 indicating definition information processed by the rule processing unit 11. R11 is arranged. Further, below each, R03 indicating node A, R04 indicating node B, and the like are arranged.
[0017]
Based on the above, the operation of the subtree extraction processing unit 1 will be described using the flowcharts of FIGS. 7 to 10. First, it is assumed that the definition tree information determined from the given user information is given to the partial tree extraction processing unit 1 as the tree structure shown in FIG. Document information is also given in advance.
FIG. 7 is a flowchart for explaining the operation of the rule processing unit 11 of the subtree extraction processing unit 1. First, the rule processing unit 11 creates an evaluation result holding unit having the same tree structure as the extraction target document shown in FIG. 3 in the tree structure information holding unit 12 (step S1).
The truth value of the evaluation result holding unit of the created tree structure information holding unit 12 is initialized to “False” (step S2).
Next, the section processing unit 13 is activated to execute section processing (step S3).
When the execution of the section processing for one section is completed, the truth value (“False” or “Ture”) of the evaluation result is logically synthesized with the content of the tree structure information holding unit 12, and the evaluation result for each section is accumulated. (Step S4).
If the evaluation result for one section is held, it is determined whether or not all sections have been evaluated (step S5).
If all sections have not been evaluated, the process returns to step S3, and the next section is evaluated (NO in step S5).
If all sections have been evaluated, the process proceeds to step S6 (YES in step S5).
When all the sections have been evaluated, a subtree whose true / false value is “True” is extracted from the extraction target document in the tree structure information holding unit 12 (step S6), and the process ends.
[0018]
FIG. 8 is a flowchart for explaining the operation of the section processing unit 13 of the partial tree extraction processing unit 1. The section processing unit 13 started from the rule processing unit 11 creates an evaluation result holding unit having the same tree structure as the extraction target document shown in FIG. 3 in the tree structure information temporary holding unit 14 (step S11).
The truth value of the evaluation result holding unit of the created tree structure information temporary holding unit 14 is initialized to “False” (step S12).
When the initialization of the tree structure information temporary storage unit 14 is completed, one child node having the same name as the tag described at the highest level between <section> and </ section> is extracted from the extraction target document. As a search (step S13).
Next, it is determined whether or not a node corresponding to the extraction target document has been found (step S14). If the corresponding node has not been found (NO in step S14), the rule processing unit 11 performs nothing as it is. The recorded contents of the temporary tree structure information temporary holding unit are returned (step S17).
If a corresponding node is found in step S14 (YES in step S14), the branch node processing unit 15 is activated to execute branch node processing (step S15).
When the branch node processing is completed, the subtree document constituting the section is evaluated. In the evaluation, the subtree document located in the branch below the node is evaluated, and the evaluation result of the subtree including only the true value (True) is recorded as “True” (step S16).
When the evaluation of the partial tree document constituting the section is completed, the recorded contents of the tree structure information temporary holding unit are returned to the rule processing unit 11 (step S17), and the process ends.
[0019]
FIG. 9 is a flowchart for explaining the operation of the branch node processing unit 15 of the subtree extraction processing unit 1. The branch node processing unit 15 activated from the section processing unit 13 creates a true / false value holding unit for holding the true / false value of the child node located below itself in the branch node processing unit 15 (step S21). . The true / false value holding units are created by the number corresponding to all types (names) of child nodes described in the corresponding branch node of the definition information.
The truth value of the created truth value holding unit is initialized to “False” (step S22).
Next, it is determined whether the child node located below itself is a branch node having a branch further below the node, or a leaf node having no value below it and having only a value ( Step S23).
If the child node is a leaf node having no value below it and having only a value (NO in step S23), the leaf node processing unit 16 is activated to execute the leaf node processing (step S32). .Then, the process proceeds to step S25.
If the child node is a branch node having a node below it (YES in step S23), further branch node processing is executed (step S24).
When step S24 or step S32 is finished,From the truth value holding unit, obtain the truth value corresponding to the type (name) of the child node that has just been processed,Branch node processing or leaf node processingAnd the result is written back to the same truth value holding unit (step S25).
Next, it is determined whether or not all child nodes have been evaluated (step S26).
If the evaluation has not been completed for all the child nodes, the process returns to step S23, and the next child node is evaluated (NO in step S26).
If the evaluation has been completed for all the child nodes (YES in step S26), it is determined whether or not the truth value holding unit of all types (names) of child nodes is “True” (step S27). ).
If it is determined in step S27 that the true / false value holding unit of all types (names) of child nodes is “True” (YES in step S27), the corresponding branch node of the tree structure information temporary holding unit 14 is set to “True”. "Is set (step S28).
Then, “True” is returned to the caller (step S29), and the process ends. If any of the true / false value holding units of the child nodes is “False” in Step S27 (NO in Step S27), “False” is set to the corresponding branch node (Step S30).
Then, “False” is returned to the caller (step S31), and the process ends.
In addition, all the types (names) of child nodes described in the corresponding branch node of the definition information, and the true / false values of the evaluation results of each child node,
[Expression 1]

When the child node of the above-described branch node is a branch node having a node below it, the evaluation process is expressed by the following expression.
[Expression 2]

Here, the logical sum of the above equation corresponds to step S25, and the logical product corresponds to step S27.
[0020]
FIG. 10 is a flowchart illustrating the operation of the leaf node processing unit 16 of the partial tree extraction processing unit 1. The leaf node processing unit 16 activated from the branch node processing unit 15 first determines whether or not a conditional expression for specifying a value is set at a position corresponding to the leaf node of the definition information (step S41). .
If the conditional expression for specifying the value of the leaf node is not set (NO in step S41), it is determined that the condition is satisfied, and the conditional expression determination step S42 is passed and the process proceeds to step S43.
If the conditional expression for specifying the value of the leaf node is set (YES in step S41), it is determined whether or not the value of the leaf node satisfies the conditional expression set in the definition information (step). S42).
If the value of the leaf node does not satisfy the conditional expression (NO in step S42), the true / false value of the tree structure information temporary holding unit 14 corresponding to the corresponding leaf node is set to “False” (step S47).
When the true / false value of the tree structure information temporary holding unit 14 is set, “False” is returned to the caller (step S48), and the process ends.
If the leaf node value satisfies the conditional expression (YES in step S42), the true / false value of the tree structure information temporary holding unit 14 corresponding to the corresponding leaf node is set to "True" (step S46).
Next, when the true / false value of the tree structure information temporary holding unit 14 is set, “True” is returned to the caller (step S45), and the process ends.
[0021]
As described above, the rule processing unit 11 starts the section processing unit 13, the section processing unit 13 starts the branch node processing unit 15, the branch node processing unit 15 starts the branch node processing unit 15, or the leaf node processing unit 16, respectively. By evaluating the specified subtree and returning the evaluation result to the caller, when the operation of the section processing unit ends once, the evaluation result for the block (section) of the subtree document to be extracted from the extraction target document Is obtained. In the rule processing, the evaluation of the extraction target document is repeated for the number of sections specified in the definition information, and when all the evaluations are completed, the subtree document to be extracted is specified by the logical sum of the evaluation results of each section. .
[0022]
Next, an example using the document information extracting apparatus described in the above embodiment will be described with reference to FIGS. In the present embodiment, a case will be described in which information is distributed using a document information extraction apparatus for transmission / reception of an estimate document performed between a personal computer manufacturer and its parts company.
FIG. 11 is a schematic diagram for explaining an example of the present embodiment. In FIG. 11, reference numeral 50 denotes an information management center using a document information extraction device.
Reference numeral 51 denotes a user information management unit of the document information extraction apparatus. Reference numeral 52 denotes a definition information management unit of the document information extraction apparatus. Reference numeral 53 denotes a partial tree extraction processing unit of the document information extraction apparatus.
Reference numeral 54 denotes document information Da. The document information Da is a document that specifies the contents of an estimate. As shown in FIG. 12, the contents include information necessary for an estimate such as a part name, model name, quantity, and price for each component such as a memory or a disk of a personal computer. Described as a tag. Document information Da is registered in advance in the information management center 50 by a personal computer manufacturer.

Reference numerals

55 and 56 respectively denote definition body information Rb and definition body information Rc set for the estimated supplier B and the estimated supplier C, which are parts companies of personal computers. The definition body information Rb and the definition body information Rc are also registered in advance in the information management center 50 by the personal computer manufacturer. The contents of the definition body information are shown in FIG. 13 (a) and the definition body information for the quotation destination company B, and FIG. In FIG. 13A, since the quotation supplier B is a parts company that handles memory, the tag is configured to extract only information related to the memory. Similarly, in FIG. 13B, since the quotation supplier C is a parts company that handles disks, the tag is configured to extract only information related to the disks.
Further, reference numerals 57 and 58 are an extracted document Tb and an extracted document Tc extracted for the estimated supplier B and the estimated supplier C, respectively.

Reference numerals

60 and 70 indicate an estimated supplier B and an estimated supplier C, respectively.
[0023]
Now, when the estimate supplier B60 and the estimate supplier C70 access the information management center 50, the user information management unit 52 specifies the user attributes for each of the estimate supplier B60 and the estimate supplier C70 and defines them. It is output to the body information management unit 53. The definition body information management unit 53 uses the respective user attributes of the quotation destination company B60 and the quotation destination company C70 and the document information Da54 registered in advance to define the definition body information from the definition body information registered in advance. Rb55 and definition information Rc56 are extracted.
The definition body information Rb55 and the definition body information Rc56 are output to the subtree extraction processing unit 51 (here, the definition body interpretation processing unit is omitted for the sake of space), and the extraction of the subtree document of the document information Da54 is executed. The
The extracted extracted document Tb57 is shown in FIG. In FIG. 14, only the information related to the memory handled by the quotation supplier B60 is extracted from the document specifying the contents of the quotation shown in FIG. This extracted document Tb57 is transmitted to the quotation supplier B60.
The extracted extracted document Tc58 is shown in FIG. In FIG. 15, only information related to the memory handled by the quoted supplier C70 is extracted from the document that specifies the content of the quote shown in FIG. 12. This extracted document Tc58 is transmitted to the quotation supplier C70.
As described above, when the document information extraction apparatus of the present invention is used, information specifying the content of an estimate obtained by extracting only a specific part from each document specifying the content of an estimate is sent to each quotation partner. It becomes possible to do.
[0024]
In addition, the document information extraction apparatus described in the above embodiment and the information management center described in the example respectively record a program for realizing the function in a computer-readable recording medium. The functions of the above-described apparatuses may be realized by causing a computer system to read and execute a program recorded on a recording medium.
[0025]
Here, the “computer system” includes an OS and hardware such as peripheral devices, and further includes a homepage providing environment (or display environment) if a WWW (World Wide Web) system is used. Shall be. The “computer-readable recording medium” refers to a portable medium such as a floppy disk, a magneto-optical disk, a ROM, a CD-ROM, or a storage device such as a hard disk built in the computer system. Furthermore, “computer-readable recording medium” means a program that dynamically holds a program for a short time, such as when a program is sent via a computer network such as the Internet or a communication line such as a telephone line. (Transmission medium or transmission wave), including a volatile memory inside a computer system serving as a server or a client in that case, that holds a program for a certain period of time.
[0026]
Further, the program may be for realizing a part of the functions described above, and further, a program capable of realizing the functions described above in combination with a program already stored in a computer system, a so-called difference file ( Difference program).
[0027]
【The invention's effect】
As described above, according to the present invention, all or a part of document information is extracted from the same document information having a tree structure as a partial tree document disclosed to each of the users according to the authority for each user. The first tree structure information holding means initialized to a false value and the tree structure of the definition body information are designated, wherein the evaluation result of the document information is recorded in the subtree extracting means included in the document information extracting device A leaf that compares the value of the terminal portion with the value of the specified terminal portion of the tree structure of the document information, and if both match, records the true value as the evaluation result of the leaf node in the first tree structure information holding means A node processing means, a branch node processing means for logically synthesizing an evaluation result of a child node including a leaf node located in a branch below itself in the tree structure, and recording the result into a first tree structure information holding means; Multiple tree structure information holding means Creates an evaluation result holding unit in which the structure recording unit that records the element is configured in the same tree structure as the document information tree structure, and simultaneously searches the document information for the same node that constitutes the tree structure of the definition information The branch node processing means is started, and after the evaluation of the branch node processing means is completed, the subtree document located on the branch below the node is evaluated, and the evaluation result of the subtree including only the true value is regarded as the true value. A section processing means for recording, and a rule processing means for activating the section processing means and extracting a subtree having a true value recorded in the evaluation result of the evaluation result holding section as a subtree document after the evaluation of the section processing means is completed. Provided.
This compares the tree structure of definition information describing conversion rules for converting document information into document information that can be disclosed to the user, and the tree structure of document information. It is possible to extract a subtree document having the following structure and value.
[0028]
The present invention further includes a second tree structure information holding unit initialized to a false value, which records the evaluation result of the document information in the document information extraction apparatus, and the rule processing unit includes the second tree structure information. Means for creating an evaluation result holding unit in which a structure recording unit for recording a plurality of elements is configured in the same tree structure as the tree structure of the document information, and a plurality of desired extractions specified in the definition body information Corresponding to the subtree document of the first tree structure information holding means, the second tree structure information holding means logically synthesizing and accumulating the evaluation results of the partial tree document to the second tree structure information holding means, And a means for extracting a subtree whose true value is recorded in the evaluation result of the tree structure information holding means as a subtree document.
As a result, it is possible to accumulate and hold the evaluation results for the extraction target documents corresponding to the plurality of subtree documents desired to be extracted specified in the definition information, and finally extract the subtree documents in a batch. It becomes.
Therefore, when acquiring arbitrarily necessary document information from document information having a tree structure described in a markup language such as XML, a dedicated programming for each document structure to be extracted, which has been a conventional problem There is no need to perform the process, and the effect that the subtree document can be extracted efficiently is obtained.
In addition, a correspondence relationship that combines a conventional method for specifying a target part from a tree structure and a program that executes an action such as deleting or not when the target part is found is extracted. According to the document structure, there is no need to specify each time, and the effect that the subtree document can be extracted efficiently is obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of an embodiment of the present invention.
FIG. 2 is a schematic diagram illustrating an example of an extraction target document and an extracted document described in the embodiment.
FIG. 3 is a schematic diagram illustrating an evaluation result holding unit of an extraction target document configured in a tree structure information holding unit.
FIG. 4 is a schematic diagram illustrating an example of recorded contents of an evaluation result holding unit.
FIG. 5 is a schematic diagram showing an example of definition body information described in the embodiment.
6 is a schematic diagram showing a tree structure representation of definition body information described in FIG. 5. FIG.
FIG. 7 is a flowchart for explaining the operation of the rule processing unit according to the embodiment;
FIG. 8 is a flowchart for explaining the operation of a section processing unit according to the embodiment;
FIG. 9 is a flowchart for explaining the operation of the branch node processing unit according to the embodiment;
FIG. 10 is a flowchart for explaining the operation of a leaf node processing unit according to the embodiment;
FIG. 11 is a schematic diagram for explaining an example of the embodiment;
FIG. 12 is a schematic diagram illustrating an extraction target document described in the embodiment.
FIG. 13 is a schematic diagram illustrating definition body information described in the embodiment.
FIG. 14 is a schematic diagram illustrating a subtree document extracted in the same embodiment.
FIG. 15 is a schematic diagram illustrating a partial tree document extracted in the same embodiment.
[Explanation of symbols]
1 Partial tree extraction processing unit
2 User Information Management Department
3 user information database
4 definition body information management department
5 Definition body information database
6 Definition body interpretation processing section
11 Rule processing section
12 Tree structure information holding part
13 Section processing section
14 Tree structure information temporary storage
15 Branch node processing unit
16 Leaf node processing unit
50 Information Management Center
51 Partial tree extraction processing unit
52 User Information Management Department
53 Definition body information management part
54 Document Information Da
55 Definition Rb
56 Definition Rc
57 Extracted document Tb
58 Extracted Document Tc
60 Estimate supplier B
70 Estimated supplier C

Claims

Document information extraction that extracts all or a part of the document information from the document information described in a markup language having a tree structure as a partial tree document disclosed to each of the users according to the authority of each user A device,
The definition information in which a conversion rule for converting the document information into document information that can be disclosed to the user is described in a markup language. The definition information according to the given user authority information and the document information. Definition body information management means extracted from the database;
Definition body interpretation means for interpreting the tree structure from the definition body information extracted by the definition body information management means;
A sub-tree extraction unit that extracts the sub-tree document by comparing the tree structure of the definition body information interpreted by the definition body interpretation unit and the tree structure of the document information;
The partial tree extracting means includes
A first tree structure information holding unit including an area for holding the evaluation result of each node corresponding to the tree structure of the document information, and a false value set in the evaluation result as an initial value;
An area for holding the evaluation result of each node corresponding to the tree structure of both the document information and the first tree structure information holding means, and a second value in which a false value is set in the evaluation result as an initial value Tree structure information holding means;
When the end node of the tree structure is a leaf node, the value of the specified leaf node of the tree structure of the definition information is compared with the value of the specified leaf node of the tree structure of the document information, and the two match A leaf node processing means for recording a true value in the first tree structure information holding means as the evaluation result of the leaf node;
A node lower than the processing target node is used as a child node, and the evaluation result for the child node having the same name including the leaf node is synthesized by OR, and the evaluation result for the child node having the same name including the leaf node is combined. A branch node processing unit that combines by logical product and records the logically synthesized result as an evaluation result of the node to be processed in the first tree structure information holding unit;
A node having the same name as the node constituting the tree structure of the definition body information is searched from the document information, the branch node processing unit is activated, and after the evaluation of the branch node processing unit is completed, the node is positioned on a branch below the node. Section processing means for evaluating the subtree document to be recorded, and recording the evaluation result of the subtree including only the true value as a true value in the first tree structure information holding means ;
The section processing means is activated, and a plurality of the section processing is performed by calculating a logical sum of nodes located at the same position in the second tree structure information holding means and the first tree structure information holding means. A node in which a true value is recorded in the evaluation result of the second tree structure information holding means after all evaluations of the section processing means have been accumulated and held in the second tree structure information holding means. Rule processing means for extracting a subtree consisting of as a subtree document;
A document information extraction apparatus characterized by comprising:

The definition information is
An identifier that specifies the element or value to be extracted, and
A constraint condition that specifies a condition to be extracted for the element or value specified by the identifier;
The document information extracting apparatus according to claim 1, comprising:

The markup language is
3. The document information extraction apparatus according to claim 2 , wherein the document information extraction apparatus is XML (Extensible markup language).

Document information extraction that extracts all or a part of the document information from the document information described in a markup language having a tree structure as a partial tree document disclosed to each of the users according to the authority of each user A method,
The definition information in which a conversion rule for converting the document information into document information that can be disclosed to the user is described in a markup language. The definition information according to the given user authority information and the document information. Processing to extract from the database;
A process of interpreting the tree structure from the extracted definition information,
A process for extracting the partial tree document by comparing the tree structure of the interpreted definition body information with the tree structure of the document information;
The process of extracting the partial tree document is as follows:
When the end node of the tree structure is a leaf node, the value of the specified leaf node of the tree structure of the definition information is compared with the value of the specified leaf node of the tree structure of the document information, and the two match The first tree structure information having an area for holding the evaluation result of each node corresponding to the tree structure of the document information as the evaluation result of the leaf node, and a false value set in the evaluation result as an initial value Leaf node processing to record the true value to the holding means;
A node lower than the processing target node is used as a child node, and the evaluation result for the child node having the same name including the leaf node is synthesized by OR, and the evaluation result for the child node having the same name including the leaf node is combined. Branch node processing that combines by logical product and records the logically synthesized result as an evaluation result of the node to be processed in the first tree structure information holding means;
A part that is located on a branch below the node after searching for a node having the same name as the node constituting the tree structure of the definition body information from the document information, starting the branch node process, and evaluating the branch node process Section processing for evaluating a tree document and recording a result of evaluation of a subtree including only a true value as a true value in the first tree structure information holding means ;
The section process is activated, and an area for holding the evaluation result of each node corresponding to the tree structure of both the document information and the first tree structure information holding unit is provided, and the evaluation result is set to false By calculating the logical sum of nodes located at the same position in the second tree structure information holding means and the first tree structure information holding means in which values are set, the evaluation results of a plurality of the section processes are obtained as described above. A cumulative tree is stored in the second tree structure information holding means, and after all evaluations of the section processing are completed, a subtree consisting of nodes whose true values are recorded in the evaluation results of the second tree structure information holding means is subtree. Rule processing to extract as a document;
A document information extraction method characterized by comprising:

Document information extraction that extracts all or a part of the document information from the document information described in a markup language having a tree structure as a partial tree document disclosed to each of the users according to the authority of each user A recording medium recording a program used in a method,
The program is
The definition information in which a conversion rule for converting the document information into document information that can be disclosed to the user is described in a markup language. The definition information according to the given user authority information and the document information. Processing to extract from the database;
A process of interpreting the tree structure from the extracted definition information,
A process for extracting the partial tree document by comparing the tree structure of the interpreted definition body information with the tree structure of the document information;
The process of extracting the partial tree document is as follows:
When the end node of the tree structure is a leaf node, the value of the specified leaf node of the tree structure of the definition information is compared with the value of the specified leaf node of the tree structure of the document information, and the two match The first tree structure information having an area for holding the evaluation result of each node corresponding to the tree structure of the document information as the evaluation result of the leaf node, and a false value set in the evaluation result as an initial value Leaf node processing to record the true value to the holding means;
A node lower than the processing target node is used as a child node, and the evaluation result for the child node having the same name including the leaf node is synthesized by OR, and the evaluation result for the child node having the same name including the leaf node is combined. Branch node processing that combines by logical product and records the logically synthesized result as an evaluation result of the node to be processed in the first tree structure information holding means;
A part that is located on a branch below the node after searching for a node having the same name as the node constituting the tree structure of the definition body information from the document information, starting the branch node process, and evaluating the branch node process Section processing for evaluating a tree document and recording a result of evaluation of a subtree including only a true value as a true value in the first tree structure information holding means ;
The section process is activated, and an area for holding the evaluation result of each node corresponding to the tree structure of both the document information and the first tree structure information holding means is provided. By calculating a logical sum of nodes located at the same position in the second tree structure information holding unit and the first tree structure information holding unit in which values are set, the evaluation results of a plurality of the section processes are obtained as described above. A partial tree consisting of nodes whose true values are recorded in the evaluation result of the second tree structure information holding means is accumulated and held in the second tree structure information holding means, and after all evaluations of the section processing are completed. Rule processing to extract as a document;
A computer-readable recording medium that causes a computer to execute.