JP4397221B2

JP4397221B2 - Link setting device and information using information extracted from text sentence

Info

Publication number: JP4397221B2
Application number: JP2003397196A
Authority: JP
Inventors: 徹持田; 高広三浦
Original assignee: 株式会社日立システムアンドサービス
Priority date: 2003-11-27
Filing date: 2003-11-27
Publication date: 2010-01-13
Anticipated expiration: 2023-11-27
Also published as: JP2005157853A

Description

この発明は、文書中の数値情報データおよびその係り受け情報により生成した抽象化モデルデータを用いて、情報処理装置において効率的な処理を行うための装置およびその方法に関する。 The present invention relates to an apparatus and method for performing efficient processing in an information processing apparatus using abstract model data generated from numerical information data in a document and its dependency information.

電子文書において、閲覧者が文書内容を容易に理解するために、文書内に関連する記載がある場合にこれらを相互に関連付ける機能（いわゆる、リンク機能）が、一般的に利用されている。例えば、文書を閲覧していて意味が分からない単語があるような場合でも、その単語に関する説明文にリンクが設定されていれば、マウスでクリックなどすることによって説明文を容易に参照することができる。 In an electronic document, a function (so-called link function) that correlates a description when there is a description related to the document is generally used so that a viewer can easily understand the contents of the document. For example, even if there is a word that you do not understand when you are browsing a document, you can easily refer to the explanation by clicking with the mouse if the explanation is related to the word. it can.

このようなリンクを人間が設定するのは煩雑であるため、コンピュータにマッチング処理を行わせてリンク設定する技術が存在する。従来技術では、電子文書中のテキスト文を形態素解析して抽出したキーワードを索引文字列とマッチングすることによって、リンクを機械的に設定するようにしていた（例えば、特許文献１および特許文献２）。 Since it is cumbersome for a person to set such a link, there is a technique for setting a link by causing a computer to perform matching processing. In the prior art, a link is mechanically set by matching a keyword extracted by morphological analysis of a text sentence in an electronic document with an index character string (for example, Patent Document 1 and Patent Document 2). .

特開平３−９５６７３号Japanese Patent Laid-Open No. 3-95673

特開平７−３２５８２７号JP-A-7-325827

しかし、このような従来のリンク設定装置は、何れも数値情報データ（例えば、１００、百など）にハイパーリンクを設定することを目的とするものではなかった。数詞は、それ自体で特定の意味内容を有していないため、マッチングによりリンクを設定するのには適さないためである。このような、テキスト文をマッチングするだけでは所望の数値情報データだけを抽出することができないという課題は、キーワード検索においても共通に存在していた。 However, none of these conventional link setting devices are intended to set hyperlinks in numerical information data (for example, 100, one hundred, etc.). This is because a numerical word does not have a specific meaning in itself, and is not suitable for setting a link by matching. Such a problem that only desired numerical information data cannot be extracted simply by matching text sentences also exists in keyword search.

この発明は、上記問題を解決すべく、数値情報データおよびその係り受け情報から抽象化したモデルを生成し、情報処理装置に有効な情報として提供することを目的とする。 In order to solve the above problem, an object of the present invention is to generate an abstract model from numerical information data and its dependency information, and to provide the information processing apparatus as effective information.

（１、２、１４）この発明のリンク設定装置は、
文書データ中の関連する文要素データを含む対象データ間にリンクを設定するリンク設定装置であって、
与えられた文書データ中のテキスト文データを複数の文要素データに分割し、各文要素の種類を決定して記録部に記録するとともに、与えられたテキスト文データ中に数値情報データが含まれるか否かを判断し、含まれる場合には、数値情報データを特定して記録部に記録する抽出・解析手段と、
記録部を参照して、数値情報データを含む文要素データを特定し、当該文要素データの種類に基づいて、当該数値情報データと係り受け関係にある文要素データを少なくとも１つ決定し、決定した文要素データから係り受け情報データを抽出する係り受け情報抽出手段と、
数値情報抽出手段が抽出した数値情報データ、係り受け情報抽出手段が抽出した係り受け情報データを抽象化モデルデータとして、その文書データ中の位置を示す位置データとともに記録部に記録する抽象化モデル記録手段と、
記録部に記録された複数の抽象化モデルデータのうち、同じ数値情報データおよび同じ係り受け情報データを有する抽象化モデルデータを選択する抽象化モデル選択手段、
前記選択された抽象化モデルデータの位置情報データに基づいて、前記抽象化モデルデータの対象データ間にリンクを設定するリンク設定手段、
を備えたことを特徴とする。 (1, 2, 14) The link setting device of this invention
A link setting device for setting a link between target data including related sentence element data in document data,
The text data in the given document data is divided into a plurality of text element data, the type of each text element is determined and recorded in the recording unit, and the numerical text data is included in the given text text data And if included, an extraction / analysis means for specifying numerical information data and recording it in the recording unit, and
Referring to the recording unit, the sentence element data including the numerical information data is identified, and at least one sentence element data having a dependency relation with the numerical information data is determined based on the type of the sentence element data. Dependency information extracting means for extracting dependency information data from the sentence element data,
Abstract model recording that records the numerical information data extracted by the numerical information extraction means and the dependency information data extracted by the dependency information extraction means as abstract model data together with position data indicating the position in the document data in the recording unit Means,
Abstraction model selection means for selecting abstract model data having the same numerical information data and the same dependency information data among a plurality of abstract model data recorded in the recording unit;
Link setting means for setting a link between target data of the abstraction model data based on the position information data of the selected abstraction model data;
It is provided with.

これにより、単なるマッチングだけではリンクを設定できない数値情報データについても正確にリンクを設定することができる。 Thereby, it is possible to accurately set a link even for numerical information data that cannot be set by simple matching alone.

（３）この発明のリンク設定装置は、前記リンク設定手段が、リンク先の対象データがテキスト文である場合に、当該リンク先のテキスト文から付加情報だけを抽出するようにリンクを設定する、ことを特徴とする。 (3) In the link setting device according to the present invention, when the link setting target data is a text sentence, the link setting means sets a link so as to extract only additional information from the link destination text sentence. It is characterized by that.

これにより、リンク先で自明な情報を省略することにより周辺情報（付加情報）を効果的に表示することができる。 Thereby, peripheral information (additional information) can be effectively displayed by omitting the obvious information at the link destination.

（４、５、１５）この発明の文書検索装置は、
検索要素である対象データに基づいて文書データを検索する文書検索装置であって、
与えられた文書データ中のテキスト文データを複数の文要素データに分割し、各文要素の種類を決定して記録部に記録するとともに、与えられたテキスト文データ中に数値情報データが含まれるか否かを判断し、含まれる場合には、数値情報データを特定して記録部に記録する抽出・解析手段と、
記録部を参照して、数値情報データを含む文要素データを特定し、当該文要素データの種類に基づいて、当該数値情報データと係り受け関係にある文要素データを少なくとも１つ決定し、決定した文要素データから係り受け情報データを抽出する係り受け情報抽出手段と、
数値情報抽出手段が抽出した数値情報データ、係り受け情報抽出手段が抽出した係り受け情報データを抽象化モデルデータとして、その文書データ中の位置を示す位置データとともに記録部に保持する抽象化モデル保持手段、
文書データから抽象化モデルデータを生成し記録した抽象化モデル記録手段を検索し、抽象化モデル保持手段から受けた抽象化モデルデータと比較することにより、同じ数値情報データおよび同じ係り受け情報データを有する抽象化モデルデータの位置情報データだけを抽出する抽象化モデル比較手段、
前記抽出された抽象化モデルデータの位置情報データに基づいて、抽象化モデルデータが同じ対象データを検索結果として表示する検索結果表示手段、
を備えたことを特徴とする。 (4, 5, 15) The document retrieval apparatus of the present invention
A document search device that searches document data based on target data that is a search element,
The text data in the given document data is divided into a plurality of text element data, the type of each text element is determined and recorded in the recording unit, and the numerical text data is included in the given text text data And if included, an extraction / analysis means for specifying numerical information data and recording it in the recording unit, and
Referring to the recording unit, the sentence element data including the numerical information data is identified, and at least one sentence element data having a dependency relation with the numerical information data is determined based on the type of the sentence element data. Dependency information extracting means for extracting dependency information data from the sentence element data,
Abstract model holding that holds the numerical information data extracted by the numerical information extraction means and the dependency information data extracted by the dependency information extraction means as abstract model data together with the position data indicating the position in the document data. means,
By retrieving the abstract model recording means that generated and recorded the abstract model data from the document data and comparing it with the abstract model data received from the abstract model holding means, the same numerical information data and the same dependency information data are obtained. Abstraction model comparison means for extracting only position information data of abstraction model data having,
Search result display means for displaying, as search results, target data having the same abstract model data, based on the positional information data of the extracted abstract model data;
It is provided with.

これにより、単なるマッチング検索だけでは検索ができない数値情報データに対しても所望の検索処理を行うことができる。 Thereby, a desired search process can be performed even for numerical information data that cannot be searched by a simple matching search.

（６）この発明の文書検索装置は、前記抽象化モデル記録手段には、予め検索対象となる文書データ全体について、検索前に抽象化モデルデータ生成されて記録されていることを特徴とする。 (6) The document search apparatus according to the present invention is characterized in that the abstract model recording means previously generates and records abstract model data before searching for the entire document data to be searched.

これにより、文書データの検索処理を効率的に行うことができる。 Thereby, the search processing of document data can be performed efficiently.

（７、８、１６）この発明のこの発明の文書入力検証装置は、
検証要素の対象データに基づいて文書編集装置から入力される文書データを検証する文書入力検証装置であって、
文書編集装置から与えられた文書データ中のテキスト文データを複数の文要素データに分割し、各文要素の種類を決定して記録部に記録するとともに、与えられたテキスト文データ中に数値情報データが含まれるか否かを判断し、含まれる場合には、数値情報データを特定して記録部に記録する抽出・解析手段と、
記録部を参照して、数値情報データを含む文要素データを特定し、当該文要素データの種類に基づいて、当該数値情報データと係り受け関係にある文要素データを少なくとも１つ決定し、決定した文要素データから係り受け情報データを抽出する係り受け情報抽出手段と、
数値情報抽出手段が抽出した数値情報データ、係り受け情報抽出手段が抽出した係り受け情報データを抽象化モデルデータとして、その文書データ中の位置を示す位置データとともに記録部に保持する抽象化モデル保持手段、
文書データから抽象化モデルデータを生成し記録した抽象化モデル記録手段を検索し、抽象化モデル保持手段から受けた抽象化モデルデータと係り受け情報が同じで、かつ、数値情報データが異なる抽象化モデルデータがあるか否かを判別する抽象化モデル判別手段、
前記抽象化モデル判別手段から取得した判別結果に基づいて、入力エラー情報を文書編集装置に出力する入力エラー出力手段、
を備えたことを特徴とする。
これにより、誤入力の生じやすい数値情報データに対して容易にチェックを行うことができる。 (7, 8, 16) The document input verification device of the present invention of the present invention is
A document input verification device for verifying document data input from a document editing device based on target data of a verification element,
Text text data in the document data given from the document editing device is divided into a plurality of text element data, the type of each text element is determined and recorded in the recording unit, and numerical information is included in the given text text data It is determined whether or not data is included, and if included, extraction / analysis means for identifying numerical information data and recording it in a recording unit,
Referring to the recording unit, the sentence element data including the numerical information data is identified, and at least one sentence element data having a dependency relation with the numerical information data is determined based on the type of the sentence element data. Dependency information extracting means for extracting dependency information data from the sentence element data,
Abstract model holding that holds the numerical information data extracted by the numerical information extraction means and the dependency information data extracted by the dependency information extraction means as abstract model data together with the position data indicating the position in the document data. means,
Abstraction model recording means that generates and records abstract model data from document data is retrieved, and the abstraction model data received from the abstract model holding means is the same as the dependency model, and the numerical information data is different. Abstraction model discrimination means for discriminating whether there is model data;
An input error output means for outputting input error information to the document editing device based on the determination result obtained from the abstract model determination means;
It is provided with.
Thereby, it is possible to easily check numerical information data that is likely to be erroneously input.

（９）この発明の文書検索装置は、抽象化モデル記録手段の抽象化モデルデータが、検証要素入力手段からの入力に基づいて随時生成され、常に更新されていることを特徴とする。 (9) The document search apparatus of the present invention is characterized in that the abstract model data of the abstract model recording means is generated as needed based on the input from the verification element input means and is constantly updated.

これにより、文書入力の検証処理を効率的に行うことができる。 This makes it possible to efficiently perform document input verification processing.

（１０）この発明の各装置は、
前記文書データには、テキスト文データ以外の他の処理対象要素が含まれており、さらに、
抽象化モデルデータを抽出する前に、前記処理対象要素の表現形式を判断する表現形式判別手段、
対象データの表現形式がテキスト文である場合には、前記処理対象要素から、数値情報データと係り受け関係にある文要素を抽象化モデルデータとして、その位置情報データと共に抽出する抽象化モデル生成手段、
対象データの表現形式が他の要素である場合には、所定の規則に基づいて抽象化モデルデータを生成し、これらを位置情報データと共に抽象化モデル記録手段に与える抽象化モデル生成手段、
を備えたことを特徴とする。 (10) Each device of the present invention
The document data includes other elements to be processed other than text sentence data, and
Before extracting the abstraction model data, an expression format determination means for determining the expression format of the processing target element;
When the representation format of the target data is a text sentence, abstract model generation means for extracting, from the processing target element, a sentence element having a dependency relationship with the numerical information data as abstract model data together with the position information data ,
When the representation format of the target data is another element, an abstract model generation unit that generates abstract model data based on a predetermined rule and supplies them to the abstract model recording unit together with the position information data,
It is provided with.

これにより、抽象化モデルデータをテキスト文以外の要素から生成することが可能となる。 Thereby, the abstract model data can be generated from elements other than the text sentence.

（１１）この発明の各装置は、前記他の対象データが、テーブルデータまたはイメージデータであることを特徴とする。 (11) Each device of the present invention is characterized in that the other target data is table data or image data.

これにより、抽象化モデルデータをテーブルやイメージから生成することが可能となる。 Thereby, abstract model data can be generated from a table or an image.

（１２）この発明の各装置は、前記文書データはＸＭＬ形式で記述されており、ファイル内容に含まれる各タグには、表現形式の属性が予め付されていることを特徴とする。 (12) Each device according to the present invention is characterized in that the document data is described in an XML format, and an attribute of an expression format is attached in advance to each tag included in the file content.

これにより、表現形式の判断や抽象化処理要素の特定が容易になる。 This facilitates the determination of the expression format and the specification of abstraction processing elements.

（１３）この発明の各装置は、さらに、
前記文書データに含まれるテキスト文の言語が複数ある場合に、翻訳辞書を参照することにより、抽象化モデルデータを同じ言語に統一する抽象化モデル翻訳手段を備えた、ことを特徴とする。 (13) Each device of the present invention further includes:
When there are a plurality of languages of text sentences included in the document data, an abstract model translating unit that unifies the abstract model data into the same language by referring to a translation dictionary is provided.

これにより、複数の言語が含まれる場合でも、抽象化モデルデータを生成して利用することができる。 Thereby, even when a plurality of languages are included, the abstract model data can be generated and used.

なお、本明細書における「抽象化処理要素（対象データ）」とは、文書データを数値情報データを含むテキスト文、テーブル、イメージなどの表現形式で分割した、各装置による抽象化処理の対象となるデータをいう。 The “abstraction processing element (target data)” in this specification refers to a target of abstraction processing by each device, in which document data is divided in a representation format such as a text sentence including numerical information data, a table, and an image. Data.

なお、本明細書における「抽象化モデルデータ」とは、テキスト文などから抽出される数値情報データと係り受け情報を関連づけたデータをいう。 The “abstract model data” in this specification refers to data in which dependency information is associated with numerical information data extracted from a text sentence or the like.

１．第１の実施形態［抽象化モデル生成装置］
まず、抽象化モデルデータを生成するために数値情報データと係り受け情報を抽出する原理および抽象化モデルデータを生成するまでの処理について、図１〜図１８を用いて以下に説明する。 1. First Embodiment [Abstract Model Generation Device]
First, the principle of extracting numerical information data and dependency information in order to generate abstract model data and the processing until the generation of abstract model data will be described below with reference to FIGS.

１−１−１．機能ブロック図
図１に、本実施形態にかかる抽象化モデル生成装置の機能ブロック図を示す。この図において、本発明にかかる抽象化モデル生成装置は、切出手段１０１、数値情報データ抽出手段１０３、テキスト文解析手段１０５、係り受け情報抽出手段１０７、抽象化モデル出力手段１０９を備えている。 1-1-1. Functional Block Diagram FIG. 1 shows a functional block diagram of the abstract model generation device according to this embodiment. In this figure, the abstract model generation apparatus according to the present invention includes a cutout means 101, numerical information data extraction means 103, text sentence analysis means 105, dependency information extraction means 107, and abstract model output means 109. .

切出手段１０１は、入力した文書からテキスト文データを一文ずつ切り出して記録部に記録する。例えば、テキスト文中に出現する句点「。」をデリミタとして、テキスト文データを切り出す処理を行う。 The cutout unit 101 cuts out text sentence data one by one from the input document and records it in the recording unit. For example, a process of cutting out text sentence data is performed using a delimiter “.” Appearing in the text sentence.

数値情報データ抽出手段１０３は、記録部から読み出したテキスト文データを、先頭から順に参照して所定の数値情報データが含まれているか否かを判断する。所定の数値情報データが含まれている場合には、数値情報データを抽出するとともに、前記テキスト文をテキスト文解析手段に与える。 The numerical information data extraction unit 103 refers to the text sentence data read from the recording unit in order from the top, and determines whether or not predetermined numerical information data is included. When predetermined numerical information data is included, the numerical information data is extracted and the text sentence is given to the text sentence analyzing means.

テキスト文解析手段１０５は、記録部から読み出したテキスト文データを形態素解析した後、各形態素を所定の文法定義情報に基づいて解析し、テキスト文データを所定の文要素データに分割する。 The text sentence analysis unit 105 performs morphological analysis on the text sentence data read from the recording unit, and then analyzes each morpheme based on predetermined grammar definition information, and divides the text sentence data into predetermined sentence element data.

係り受け情報抽出手段１０７は、抽出された数値情報データが含まれる文要素データに基づいて、文要素データに分割されたテキスト文データから当該数値情報データと係り受け関係にある他の文要素データを抽出する。 The dependency information extracting means 107 is based on the sentence element data including the extracted numerical information data, and other sentence element data having a dependency relation with the numerical information data from the text sentence data divided into the sentence element data. To extract.

抽象化モデル出力手段１０９は、数値情報データ抽出手段１０３において抽出された数値情報データと係り受け情報抽出手段１０７において抽出された係り受け情報データとを関連付けて抽象化モデルデータとして出力する。例えば、ディスプレイやデータベース等に出力する。 The abstract model output means 109 associates the numerical information data extracted by the numerical information data extraction means 103 with the dependency information data extracted by the dependency information extraction means 107, and outputs them as abstract model data. For example, the data is output to a display or a database.

１−１−２．ハードウェア構成
図１に示す抽象化モデル生成装置をＣＰＵを用いて実現したハードウェア構成の一例を図２に示す。抽象化モデル生成装置は、ディスプレイ２０１、ＣＰＵ２０３、メモリ２０５、キーボード／マウス２０７、ハードディスク２０９、ＣＤ−ＲＯＭドライブ２１１および通信回路２１５を備えている。 1-1-2. Hardware Configuration FIG. 2 shows an example of a hardware configuration in which the abstract model generation device shown in FIG. 1 is realized using a CPU. The abstract model generation apparatus includes a display 201, a CPU 203, a memory 205, a keyboard / mouse 207, a hard disk 209, a CD-ROM drive 211, and a communication circuit 215.

ハードディスク２０９には、本発明にかかる抽象化モデル抽出処理を行うための情報抽出プログラム２０９１、形態素解析処理を行うための形態素解析辞書２０９３、構文解析処理を行うための文法定義情報２０９４、所定の数値情報データを抽出するための数値単位マスタ２０９５、抽象化モデル抽出処理による抽象化モデルデータを記録する抽象化モデルＤＢ２０９７等が記録されており、これらはＣＤ−ＲＯＭドライブ２１１を介してＣＤ−ＲＯＭ２１２に記録されたデータを読み出してインストールしたものである。 The hard disk 209 includes an information extraction program 2091 for performing abstract model extraction processing according to the present invention, a morphological analysis dictionary 2093 for performing morphological analysis processing, grammar definition information 2094 for performing syntax analysis processing, and predetermined numerical values. A numerical unit master 2095 for extracting information data, an abstract model DB 2097 for recording abstract model data by abstract model extraction processing, and the like are recorded. These are stored in the CD-ROM 212 via the CD-ROM drive 211. The recorded data is read and installed.

なお、上記インストールは、通信回路２１５を用いてインターネット２１６等からダウンロードしたデータを使用して行うようにしてもよい。 The installation may be performed using data downloaded from the Internet 216 or the like using the communication circuit 215.

１−１−３．フローチャート
情報抽出プログラム２０９１に基づく処理について、図３〜６のフローチャートを用いて説明する。以下では、テキスト文「営業利益は前年度と同水準の３２７７６百万円となった。」を含むテキスト文を切出手段１０１に入力した場合を例として説明する。 1-1-3. Flowchart Processing based on the information extraction program 2091 will be described using the flowcharts of FIGS. In the following, an example will be described in which a text sentence including the text sentence “Operating profit is 32,763 million yen, the same level as the previous year” is input to the clipping means 101.

図３に示す抽象化モデル抽出処理のフローチャートにおいて、ＣＰＵ２０３は、テキスト文を読み込みメモリ２０５に入力する（ステップＳ３０１）。 In the abstract model extraction process flowchart shown in FIG. 3, the CPU 203 reads a text sentence and inputs it to the memory 205 (step S301).

入力したテキスト文の先頭からテキスト文データを一文切り出す（ステップＳ３０３）。なお、本実施形態においては句点「。」をテキスト文のデリミタとする。例えば、前記テキスト文データが読み込まれた場合、先頭の文である「営業利益は前年度と同水準の３２７７６百万円となった。」が切り出される。 Text sentence data is cut out from the head of the input text sentence (step S303). In the present embodiment, the punctuation mark “.” Is used as a text sentence delimiter. For example, when the text sentence data is read, the first sentence, “Operating income is 32,976 million yen, which is the same level as the previous year.” Is extracted.

ＣＰＵ２０３は、切り出したテキスト文データに基づいて、数値情報データ抽出処理を行う（ステップＳ３０５）。 The CPU 203 performs numerical information data extraction processing based on the extracted text sentence data (step S305).

１−１−３−１．数値情報データ抽出処理
図４に示す数値情報データ抽出処理のフローチャートにおいて、ＣＰＵ２０３は数値単位マスタ２０９５に基づいて数値情報データの単位文字を抽出する（ステップＳ４０１）。例えば、本実施形態において、数値単位マスタには「百万円」、「億円」等の金額に関する単位文字が記録されている。 1-1-3-1. Numerical Information Data Extraction Processing In the flowchart of the numerical information data extraction processing shown in FIG. 4, the CPU 203 extracts unit characters of numerical information data based on the numerical unit master 2095 (step S401). For example, in this embodiment, unit characters relating to amounts such as “million yen” and “100 million yen” are recorded in the numerical unit master.

特に、数値単位マスタは、数値情報データを抽出するテキスト文の内容に応じて設定すればよい。例えば、有価証券報告書や財務諸表の場合は金額に関する単位文字である「円」、「億円」、「百万円」、「千円」等を設定し、歴史年表や新聞記事などの場合は日付に関する単位文字である「年」、「月」、「日」等を設定すればよい。これにより、文書の内容に応じて、必要な情報だけを取り出すことができる。 In particular, the numerical unit master may be set according to the contents of a text sentence from which numerical information data is extracted. For example, in the case of securities reports and financial statements, “Yen”, “Billions of yen”, “Millions of yen”, “Thousands of yen”, etc., which are unit characters relating to monetary amounts, are set, such as historical chronology and newspaper articles In this case, “year”, “month”, “day”, etc., which are unit characters relating to the date, may be set. Thereby, only necessary information can be extracted according to the contents of the document.

なお、上記の数値単位マスタを使わない場合は、情報抽出プログラム２０９１中に数値情報データにかかる所定の単位文字を設定しておけばよい。 If the numerical unit master is not used, a predetermined unit character for the numerical information data may be set in the information extraction program 2091.

ＣＰＵ２０３は、テキスト文データの先頭から、抽出した単位文字が存在するか否かを検索する（ステップＳ４０３）。例えば、テキスト文「営業利益は、前年度と同水準の３２７７６百万円となった。」の先頭から、単位文字「百万円」または「億円」が存在するか否かを検索する。 The CPU 203 searches whether or not the extracted unit character exists from the head of the text sentence data (step S403). For example, it is searched from the head of the text sentence “Operating income was 32,763 million yen, which is the same level as the previous year”, whether or not the unit characters “million yen” or “100 million yen” exist.

テキスト文データの中に、所定の単位文字が存在すれば（ステップＳ４０５、ＹＥＳ）、当該単位文字の前方に連続する数字を数値情報データの構成要素として抽出する（ステップＳ４０７）。この場合、複数の数字が連続していれば、先頭方向から連続する数字をすべて連結した数字の文字列を数値情報データの構成要素として抽出する。なお、カンマ「，」や小数点「．」などの記号が連続していても、そこで区切ることなく一体として文字列が連結され、抽出される。さらに、ＣＰＵ２０３は、抽出した数値情報データの構成要素と前記単位文字とを連結して数値情報データとする。 If a predetermined unit character exists in the text sentence data (step S405, YES), numbers consecutive in front of the unit character are extracted as constituent elements of the numerical information data (step S407). In this case, if a plurality of numbers are continuous, a character string of numbers obtained by concatenating all the continuous numbers from the head direction is extracted as a constituent element of the numerical information data. Note that even if symbols such as a comma “,” and a decimal point “.” Are consecutive, character strings are concatenated and extracted without being separated there. Further, the CPU 203 concatenates the constituent elements of the extracted numerical information data and the unit characters to obtain numerical information data.

例えば、上記テキスト文データの場合、「３２７７６」が数値情報データの構成要素として抽出され、単位文字「百万円」と連結して、「３２７７６百万円」が数値情報データとして抽出され、メモリ２０５またはハードディスク２０９の所定領域に記録される。 For example, in the case of the above text sentence data, “32776” is extracted as a constituent element of the numerical information data, concatenated with the unit character “million yen”, and “32776 million yen” is extracted as the numerical information data. 205 or a predetermined area of the hard disk 209.

ＣＰＵ２０３は、テキスト文データの中に、他に単位文字が存在すれば（ステップＳ４０９、ＮＯ）、ステップＳ４０７を繰り返す。他に単位文字が存在しなければ、当該処理を終了する（ステップＳ４０９、ＹＥＳ）。また、ステップＳ４０５において、単位文字が存在しない場合にも、当該処理を終了する（ステップＳ４０５、ＮＯ）。 If there is another unit character in the text sentence data (step S409, NO), the CPU 203 repeats step S407. If there is no other unit character, the process ends (YES in step S409). In addition, if no unit character exists in step S405, the process is ended (NO in step S405).

１−１−３−２．テキスト文解析処理
数値情報データ抽出処理（図３、ステップＳ３０５）において、数値情報データが抽出されれば（ステップＳ３０７、ＹＥＳ）、ＣＰＵ２０３はテキスト文解析処理（ステップＳ３０９）を行う。 1-1-3-2. Text sentence analysis processing If numerical information data is extracted in the numerical information data extraction process (FIG. 3, step S305) (step S307, YES), the CPU 203 performs a text sentence analysis process (step S309).

図５に示すテキスト文解析処理のフローチャートにおいて、ＣＰＵ２０３は形態素解析辞書２０９３に基づいてテキスト文データを形態素解析する処理を行う（ステップＳ５０１）。例えば、形態素解析処理には、奈良先端科学技術大学院大学松本研究室の「茶筌」等を使用すればよい。 In the flowchart of the text sentence analysis process shown in FIG. 5, the CPU 203 performs a morphological analysis of the text sentence data based on the morphological analysis dictionary 2093 (step S501). For example, for the morphological analysis process, a “tea bowl” of Matsumoto Laboratory of Nara Institute of Science and Technology may be used.

図７、図７ａおよび図７ｂにテキスト文データ「営業利益は、前年度と同水準の３２７７６百万円となった。」７０１に対して抽象化モデル抽出処理を行う場合の例を示す。 FIG. 7, FIG. 7a and FIG. 7b show an example in which the abstract model extraction processing is performed on the text sentence data “Operating profit was 32,976 million yen, which is the same level as the previous year” 701.

ＣＰＵ２０３は、上記ステップＳ５０１の形態素解析処理によって、上記テキスト文データ７０１（図７）を、意味を有する最小の言語単位である形態素に分割する。 The CPU 203 divides the text sentence data 701 (FIG. 7) into morphemes, which are the smallest meaningful language units, through the morphological analysis processing in step S501.

例えば、図７において、上記テキスト文データ７０１は、７０３に示すように「営業利益（名詞）／は（助詞）／前年度（名詞）／と（助詞）／同水準（名詞）／の（助詞）／３２７７６百万円（名詞）／と（助詞）／なった（動詞）／。（文末）」のように形態素解析される。なお、上記および図７に示すテキスト文データ７０３においては、説明上テキスト文データを形態素ごとに「／」を挿入して表示しているが、実際には、図７ａのＡに示すように、形態素解析データとして、メモリ２０５またはハードディスク２０９の所定領域に記録される。 For example, in FIG. 7, the text sentence data 701 includes “operating profit (noun) / ha (particle) / previous year (noun) / and (particle) / same level (noun) / ) / 32776 million yen (noun) / and (particle) / na (verb) /. (End of sentence). In the text sentence data 703 shown in FIG. 7 and FIG. 7, the text sentence data is displayed by inserting “/” for each morpheme for the sake of explanation, but actually, as shown in A of FIG. It is recorded in a predetermined area of the memory 205 or the hard disk 209 as morphological analysis data.

ＣＰＵ２０３は文法定義情報２０９４に基づいて、上記において形態素解析したテキスト文データを構文解析する処理を行う（ステップＳ５０３）。例えば、構文解析処理には、上昇型構文解析のＬＲ文法パーザであるＹＡＣＣ等を使用すればよい。なお、パーザとは、構文解析プログラムを指し、入力文字列を解析して構文解析木などを作成するプログラムである。
上記ＬＲ文法パーザにおいては、予め設定した文法定義情報２０９４に基づいて構文解析が行われる。図８に、本実施形態において設定する文法定義情報２０９４の例を示す。なお、図８に示す文法定義情報２０９４においては、ＢＮＦ表記によって文法情報を定義している。 Based on the grammar definition information 2094, the CPU 203 performs a syntax analysis of the text sentence data subjected to the morphological analysis in the above (step S503). For example, the syntax analysis process may use YACC, which is an LR grammar parser for ascending syntax analysis. The parser refers to a syntax analysis program, which is a program that analyzes an input character string and creates a syntax analysis tree.
The LR grammar parser performs syntax analysis based on preset grammar definition information 2094. FIG. 8 shows an example of grammar definition information 2094 set in this embodiment. In the grammar definition information 2094 shown in FIG. 8, grammar information is defined by BNF notation.

ＣＰＵ２０３は、上記ステップＳ５０３によって、形態素解析したテキスト文データを、ＬＲ文法パーザであるＹＡＣＣに入力して構文解析を行う。例えば、上記ＹＡＣＣと文法定義情報２０９４を用いて、図７に示したテキスト文データ７０１を構文解析すると、７０５に示すようになる。 In step S503, the CPU 203 inputs the text sentence data subjected to morphological analysis to YACC, which is an LR grammar parser, and performs syntax analysis. For example, if the text sentence data 701 shown in FIG. 7 is parsed using the YACC and the grammar definition information 2094, the result is as shown in 705.

７０５においては、文法定義情報２０９４に基づいて、レベル１〜８のように段階的に構文解析が行われる。これにより、各レベルにおいてテキスト文データ全体を所定の文要素データに分割することができる。例えば、レベル２においては、テキスト文データ７０１を「営業利益は（従属句）／前年度と（従属句）／同水準の（従属句）／３２７７６百万円（従属句）と／なった（動詞句）」の各文要素データに分割することができる。 In 705, based on the grammar definition information 2094, syntax analysis is performed step by step as in levels 1-8. Thereby, the whole text sentence data can be divided | segmented into predetermined sentence element data in each level. For example, at level 2, the text sentence data 701 is “operating profit is (subordinate phrase) / previous year (subordinate phrase) / same level (subordinate phrase) / 32976 million yen (subordinate phrase) / Can be divided into each sentence element data of “verb phrase)”.

図７ａのＢ〜Ｄおよび図７ｂのＥ〜Ｉに示すように、ＣＰＵ２０３は、構文解析したデータを各レベル毎にメモリ２０５またはハードディスク２０９の所定領域に記録する。 As shown in B to D of FIG. 7a and E to I of FIG. 7b, the CPU 203 records the parsed data in a predetermined area of the memory 205 or the hard disk 209 for each level.

ＣＰＵ２０３は、構文解析したテキスト文データから「〜句」レベルの文要素データに分割されたテキスト文データを取得する（ステップＳ５０５）。 The CPU 203 acquires the text sentence data divided into the sentence element data at the “˜phrase” level from the text sentence data that has been parsed (step S505).

例えば、図７におけるテキスト文データ７０５のレベル２に分割されたテキスト文データ７０７、具体的には、図７ａのＣに示した構文解析データ（レベル２）を取得する。 For example, the text sentence data 707 divided into level 2 of the text sentence data 705 in FIG. 7, specifically, the syntax analysis data (level 2) shown in C of FIG. 7a is acquired.

ＣＰＵ２０３は、取得した構文解析データ（レベル２）に基づいて、主格句を決定する処理を行う（ステップＳ５０７）。ここで、主格句とは、主語を含む文要素データであり、文要素データの末尾の助詞によって主格句か否かの判定を行う。例えば、文要素データの末尾が「〜は」、「〜が」、「〜も」のいずれかである場合、当該文要素データは主格句であると判定する。 The CPU 203 performs a process for determining a main phrase based on the acquired syntax analysis data (level 2) (step S507). Here, the main phrase is sentence element data including the subject, and it is determined whether or not it is a main phrase by the particle at the end of the sentence element data. For example, if the end of the sentence element data is any one of “to”, “to is”, and “to”, the sentence element data is determined to be a main phrase.

図９に主格句を決定する場合の例を示す。図９に示すテキスト文データ９０１において、文要素データ「営業利益は（従属句）」の末尾の助詞は、図７の形態素解析直後のテキスト文データ７０３より、「は」である。これにより、当該文要素データ「営業利益は（従属句）」の末尾が上記「〜は」に該当し、主格句であると決定することができる。 FIG. 9 shows an example in which a main phrase is determined. In the text sentence data 901 shown in FIG. 9, the particle at the end of the sentence element data “Operating profit is (subordinate phrase)” is “ha” from the text sentence data 703 immediately after the morphological analysis in FIG. As a result, it is possible to determine that the sentence element data “operating profit is (subordinate phrase)” ends with “˜ha” and is a main phrase.

なお、ここで主格句と決定される従属句の前に連体修飾の従属句が存在する場合には、それらを含めた従属句列を主格句と判断してもよく、また、かかる連体修飾の従属句を主格句に含めなくてもよい。例えば、「当連結会計年度の（従属句）」が「営業利益（従属句）」の前にある場合、「当連結会計年度の営業利益」（従属句列）が主格句であると判断するようにしてもよいし、「営業利益（従属句）」だけを主格句であると判断するようにしてもよい。 If there are subordinate clauses of the combination modification before the subordinate phrase determined to be the main phrase here, the subordinate phrase string including them may be determined as the main phrase, and such a subordination phrase may be determined. Dependent phrases need not be included in the main phrase. For example, if “(Subordinate clause) of the current fiscal year” precedes “Operating income (Subordinate clause)”, it is determined that “Operating income of the current fiscal year” (Subordinate clause column) is the main phrase. Alternatively, only “operating profit (subordinate phrase)” may be determined as the main phrase.

ＣＰＵ２０３は、取得したテキスト文データ７０７において、述部を決定する処理を行う（ステップＳ５０９）。ここで、述部とは、述語を含む文要素データである。ＣＰＵ２０３は、文要素データの末尾の品詞によって述部か否かの判定を行う。 The CPU 203 performs processing for determining a predicate in the acquired text sentence data 707 (step S509). Here, the predicate is statement element data including a predicate. The CPU 203 determines whether the predicate is based on the part of speech at the end of the sentence element data.

例えば、文要素データの末尾の品詞が「名詞」、「形容詞」、「動詞」のいずれかである場合、当該文要素データは述部または述部の構成要素であると判定する。特に、「動詞」であっても意味の独立性の低い（非自立である）「補助動詞」および上記「名詞」、「形容詞」の場合は、直前の文要素データと連結して述部を構成するものとする。 For example, when the part of speech at the end of the sentence element data is “noun”, “adjective”, or “verb”, the sentence element data is determined to be a predicate or a component of the predicate. In particular, in the case of “auxiliary verbs” and the above “nouns” and “adjectives” whose meaning is low even if they are “verbs”, the predicates are connected with the immediately preceding sentence element data. Shall be composed.

本実施形態においては、文要素データの末尾が「〜である」、「〜となる」、「〜であった」、「〜となった」のいずれかである場合、当該分要素の末尾の品詞は「補助動詞」であると判定している。なお、文要素データの末尾の品詞が、「補助動詞」であるか否かの判定を形態素解析に基づいて行うようにしてもよい。この場合、形態素解析で動詞を本動詞（自立）か補助動詞（非自立）かの解析を行うようにすればよい。 In the present embodiment, when the end of sentence element data is any one of “to”, “becomes”, “was to”, and “becomes to”, The part of speech is determined to be an “auxiliary verb”. Note that the determination as to whether the part of speech at the end of the sentence element data is an “auxiliary verb” may be made based on morphological analysis. In this case, it is only necessary to analyze whether the verb is a main verb (independent) or an auxiliary verb (non-independent) by morphological analysis.

図９に、述部を決定する場合の例を示す。図９に示すテキスト文データ９０３において、文要素データ「なった（動詞句）」は、図７の形態素解析の結果により「動詞」である。これにより、当該文要素データ「なった（動詞句）」は、述部または述部の構成要素であると決定することができる。さらに、当該文要素データ「なった（動詞句）」は、「補助動詞」に分類されるため、直前の文要素データと連結されて述部を構成するものとされる。 FIG. 9 shows an example of determining a predicate. In the text sentence data 903 shown in FIG. 9, the sentence element data “Nata (verb phrase)” is a “verb” based on the result of the morphological analysis shown in FIG. Thereby, the sentence element data “Nana (verb phrase)” can be determined to be a predicate or a component of the predicate. Further, since the sentence element data “Nana (verb phrase)” is classified as an “auxiliary verb”, it is concatenated with the immediately preceding sentence element data to form a predicate.

これにより、当該文要素データ「なった（動詞句）」と直前の文要素データ「３２７７６百万円と（従属句）」を連結して「３２７７６百万円となった」が述部と決定される。 As a result, the sentence element data “Nata (verb phrase)” and the immediately preceding sentence element data “32776 million yen and (subordinate phrase)” are concatenated to become “32976 million yen” as a predicate. Is done.

すなわち、図９のテキスト文データ９０５に示すように、テキスト文データは「営業利益は（主格句）／前年度と（従属句）／同水準の（従属句）／３２７７６百万円となった（述部）」の各文要素データに分割される。なお、実際には、ＣＰＵ２０３は、メモリ２０５またはハードディスク２０９の所定領域に、図９ａに示すような解析結果データ９１０を記録する。 That is, as shown in the text sentence data 905 of FIG. 9, the text sentence data is “Operating profit was (main phrase) / previous year and (subordinate phrase) / same level (subordinate phrase) / 32976 million yen. (Predicate) "is divided into each sentence element data. In practice, the CPU 203 records analysis result data 910 as shown in FIG. 9 a in a predetermined area of the memory 205 or the hard disk 209.

１−１−３−３．係り受け抽象化モデル抽出処理
テキスト文解析処理（図３、ステップＳ３０９）において、テキスト文データが所定の種類の文要素データに分割されると、さらにＣＰＵ２０３は係り受け抽象化モデル抽出処理（ステップＳ３１３）を行う。 1-1-3-3. Dependency abstraction model extraction processing
In the text sentence analysis process (FIG. 3, step S309), when the text sentence data is divided into predetermined types of sentence element data, the CPU 203 further performs a dependency abstract model extraction process (step S313).

図６に示すテキスト文解析処理のフローチャートにおいて、ＣＰＵ２０３は前記数値情報データの属する文要素データを取得する（ステップＳ６０１）。例えば、上記数値情報データ抽出処理において抽出された数値情報データは「３２７７６百万円」であるので、図９に示すテキスト文データ９０５により当該情報が含まれる文要素データとして、「３２７７６百万円となった（述部）」を取得する。 In the flowchart of the text sentence analysis process shown in FIG. 6, the CPU 203 acquires sentence element data to which the numerical information data belongs (step S601). For example, since the numerical information data extracted in the numerical information data extraction process is “32776 million yen”, the text element data 905 shown in FIG. Is obtained (predicate).

ＣＰＵ２０３は数値情報データが含まれる文要素データが、主格句であると判定すれば（ステップＳ６０３、ＹＥＳ）、後述するステップＳ６０９に進み、主格句でないと判定すれば（ステップＳ６０３、ＮＯ）、さらに述部であるか否かの判定を行う（ステップＳ６０５）。 If the CPU 203 determines that the sentence element data including the numerical information data is a main phrase (step S603, YES), the CPU 203 proceeds to step S609 described later, and if it is determined that the sentence element data is not a main phrase (step S603, NO), further It is determined whether or not it is a predicate (step S605).

ＣＰＵ２０３は数値情報データの属する文要素データが述部であると判定すれば（ステップＳ６０５、ＹＥＳ）、元のテキスト文データの主格句中の名詞を前記数値情報データの係り受け情報データとして抽出し（ステップＳ６１３）、述部でないと判定すれば（ステップＳ６０５、ＮＯ）、さらに、数値情報データの属する文要素データを従属句であると判断して、当該文要素データが連体修飾であるか否かの判定を行う（ステップＳ６０７）。 If the CPU 203 determines that the sentence element data to which the numerical information data belongs is a predicate (step S605, YES), the CPU 203 extracts the noun in the main phrase of the original text sentence data as dependency information data of the numerical information data. If it is determined that it is not a predicate (step S613) (step S605, NO), it is further determined that the sentence element data to which the numerical information data belongs is a subordinate phrase, and whether or not the sentence element data is a combination modification. Is determined (step S607).

ＣＰＵ２０３は数値情報データが含まれる文要素データが、連体修飾であると判定すれば（ステップＳ６０７、ＹＥＳ）、当該文要素データの直後の文要素データ中の名詞を前記数値情報データの係り受け情報データとして抽出し（ステップＳ６１５）、連体修飾でないと判定すれば（ステップＳ６０７、ＮＯ）、数値情報データの属する文要素データ以外の文要素データを前記数値情報データの係り受け情報データとして抽出する（ステップＳ６０９）。 If the CPU 203 determines that the sentence element data including the numerical information data is the combination modification (YES in step S607), the noun in the sentence element data immediately after the sentence element data is changed to the dependency information of the numerical information data. If it is extracted as data (step S615) and it is determined that the modification is not a combination modification (NO in step S607), sentence element data other than the sentence element data to which the numerical information data belongs are extracted as dependency information data of the numerical information data ( Step S609).

以下、数値情報データの属する文要素データが「述部」、「主格句」、「従属句」の場合に分けて説明する。 In the following description, the sentence element data to which the numerical information data belongs are divided into “predicate”, “main phrase”, and “subordinate phrase”.

１−１−３−３−１．「述部」の場合
例えば、数値情報データの属する文要素データ「３２７７６百万円となった（述部）」は述部であると判定されるので、図９に示すテキスト文データ９０５から「営業利益は（主格句）」に基づいて係り受け情報データが抽出される。すなわち、図７に示したテキスト文データ７０３を参照して「営業利益（名詞）」が係り受け情報データとして抽出される。 1-1-3-3-1. In the case of “predicate” For example, since the sentence element data “numerical information data that has become 32767 million yen (predicate)” is determined to be a predicate, the text sentence data 905 shown in FIG. The dependency information data is extracted on the basis of “operating profit (main phrase)”. That is, “business profit (noun)” is extracted as dependency information data with reference to the text sentence data 703 shown in FIG.

これにより、数値情報データ「３２７７６百万円」、係り受け情報データ「営業利益」が抽象化モデルデータとなる。 Accordingly, the numerical information data “32776 million yen” and the dependency information data “operating profit” become the abstract model data.

１−１−３−３−２．「主格句」の場合
例えば、テキスト文データとして「１６０３年には徳川家康が征夷大将軍に任じられた。」が入力された場合を考える。 1-1-3-3-3. In the case of “main phrase” For example, let us consider a case where “Ieyasu Tokugawa was entrusted to the great conqueror in 1603” as text data.

図３に示したステップＳ３０１〜Ｓ３０９にしたがって、上記テキスト文データは「１６０３年には（主格句）／徳川家康が（主格句）／征夷大将軍に（従属句）／任じられた（述部）」の文要素データに分割される。なお、この場合の単位文字には「年」を使用しており、数値情報データとして「１６０３年」が抽出されているものとする。 According to steps S301 to S309 shown in FIG. 3, the above text sentence data is “in the year 1603 (main phrase) / Ieyasu Tokugawa (main phrase) / conquered general (subordinate phrase) / assigned (predicate) ) "Sentence element data. In this case, “year” is used as a unit character, and “1603” is extracted as numerical information data.

この場合、数値情報データの属する文要素データは「１６０３年には（主格句）」であるので、ＣＰＵ２０３は、テキスト文データから「１６０３年には（主格句）」以外の文要素データを係り受け情報データとして抽出する。すなわち、「徳川家康が（主格句）／征夷大将軍に（従属句）／任じられた（述部）」が係り受け情報データとして抽出される。 In this case, since the sentence element data to which the numerical information data belongs is “1603 (prominent phrase)”, the CPU 203 relates sentence element data other than “1603 (prominent phrase)” from the text sentence data. Extracted as received information data. That is, “Ieyasu Tokugawa (main phrase) / subordinate phrase (subordinate phrase) / assigned (predicate)” is extracted as dependency information data.

これにより、数値情報データ「１６０３年」、係り受け情報データ「徳川家康が征夷大将軍に任じられた」が抽象化モデルデータとなる。 As a result, the numerical information data “1603” and the dependency information data “Ieyasu Tokugawa has been appointed as a great general of conquest” become the abstract model data.

１−１−３−３−３．「従属句」の場合
数値情報データが含まれる文要素データが従属句である場合、ＣＰＵ２０３は当該文要素データが修飾形態が連体修飾であるか否かを判定する（ステップＳ６０７）。すなわち、後続する文要素データに体言が含まれているか否かを判定する。 1-1-3-3-3. In the case of “subordinate phrase” When the sentence element data including the numerical information data is a subordinate phrase, the CPU 203 determines whether or not the sentence element data is a combination modification (step S607). That is, it is determined whether or not a statement is included in the subsequent sentence element data.

例えば、数値情報データが含まれる文要素データの末尾が「〜の」である場合（ステップＳ６０７、ＹＥＳ）、修飾する単語が体言であると判定し、当該文要素データは連体修飾であると判定する。 For example, if the end of sentence element data including numerical information data is “to” (step S607, YES), it is determined that the word to be modified is a body word, and the sentence element data is determined to be a combined modification. To do.

例えば、テキスト文データとして「私は昨日１００円のノートを買った。」が入力された場合を考える。図３に示したステップＳ３０１〜Ｓ３０９にしたがって、上記テキスト文データは「私は（主格句）／昨日（従属句）／１００円の（従属句）／ノートを（従属句）／買った（述部）」の文要素データに分割される。なお、この場合の単位文字には「円」を使用しており、数値情報データとして「１００円」が抽出されているものとする。 For example, consider a case where “I bought a note of 100 yen yesterday” is input as text data. In accordance with steps S301 to S309 shown in FIG. 3, the text data is “I am (main phrase) / yesterday (subordinate phrase) / 100 yen (subordinate phrase) / note (subordinate phrase) / buy (description). Part) "sentence element data. Note that “yen” is used as the unit character in this case, and “100 yen” is extracted as numerical information data.

この場合、数値情報データの属する文要素データは「１００円の（従属句）」であり、文要素データの末尾が「〜の」であるので、ＣＰＵ２０３は、当該従属句は連体修飾であると判定する。すなわち、当該従属句が修飾する単語は体言であると判定し、テキスト文データにおける当該文要素データの直後の文要素データの名詞を係り受け情報データとして抽出する。 In this case, since the sentence element data to which the numerical information data belongs is “100 yen (subordinate phrase)” and the end of the sentence element data is “to”, the CPU 203 determines that the subordinate phrase is a combination modification. judge. That is, it is determined that the word modified by the subordinate phrase is a body word, and the noun of the sentence element data immediately after the sentence element data in the text sentence data is extracted as dependency information data.

例えば、直後の文要素データである「ノートを（従属句）」から「ノート（名詞）」が係り受け情報データとして抽出される。 For example, “note (noun)”, which is sentence element data immediately after, is extracted as dependency information data from “note (subordinate phrase)”.

これにより、数値情報データ「１００円」、係り受け情報データ「ノート」が抽象化モデルデータとなる。 As a result, the numerical information data “100 yen” and the dependency information data “note” become the abstract model data.

一方、ＣＰＵ２０３が連体修飾でないと判定した場合（ステップＳ６０７、ＮＯ）、すなわち、修飾する単語が用言であると判定し、数値情報データの属する文要素データ以外の文要素データを係り受け情報データとして抽出する（ステップＳ６０９）。 On the other hand, when the CPU 203 determines that the modification is not a combination modification (NO in step S607), that is, it is determined that the word to be modified is a predicate, and sentence element data other than the sentence element data to which the numerical information data belongs is determined as dependency information data. (Step S609).

例えば、テキスト文データとして「源頼朝は１１９２年に鎌倉幕府を開いた。」が入力された場合を考える。 For example, consider a case in which “Yori Yoromo opened the Kamakura Shogunate in 1192” is input as text data.

図３に示したステップＳ３０１〜Ｓ３０９にしたがって、上記テキスト文データは「源頼朝は（主格句）／１１９２年に（従属句）／鎌倉幕府を（従属句）／開いた（述部）」の文要素データに分割される。なお、この場合の単位文字には「年」を使用しており、数値情報データとして「１１９２年」が抽出されているものとする。 According to steps S301 to S309 shown in FIG. 3, the text sentence data is “Gen Yoromo (primary phrase) / 1192 (subordinate phrase) / Kamakura shogunate (subordinate phrase) / opened (predicate)”. Divided into sentence element data. In this case, “year” is used as a unit character, and “1192” is extracted as numerical information data.

この場合、数値情報データの属する文要素データは「１１９２年に（従属句）」であり、文要素データの末尾が「〜の」以外であるので、ＣＰＵ２０３は、当該従属句は連体修飾でないと判定する。すなわち、当該従属句が修飾する単語は用言であると判定し、テキスト文データにおける数値情報データの属する文要素データ以外の文要素データを係り受け情報データとして抽出する。 In this case, since the sentence element data to which the numerical information data belongs is “in 1192 (subordinate phrase)” and the end of the sentence element data is other than “to”, the CPU 203 determines that the subordinate phrase is not a combination modification. judge. That is, it is determined that the word modified by the subordinate phrase is a predicate, and sentence element data other than the sentence element data to which the numerical information data belongs in the text sentence data is extracted as dependency information data.

例えば、文要素データ「１１９２年（従属句）」以外の文要素データである「源頼朝は（主格句）／鎌倉幕府を（従属句）／開いた（述部）」が係り受け情報データとして抽出される。 For example, sentence element data other than the sentence element data “1192 (subordinate phrase)” is “source Yoromo (main phrase) / kamakura shogunate (subordinate phrase) / opened (predicate)” as dependency information data Extracted.

これにより、数値情報データ「１１９２年」、係り受け情報データ「源頼朝は鎌倉幕府を開いた」が抽象化モデルデータとなる。 As a result, the numerical information data “1192” and the dependency information data “Gen Yoromo opened the Kamakura Shogunate” become the abstract model data.

ＣＰＵ２０３は、上記の数値情報データ抽出処理（図４）によって抽出した数値情報データの数だけ上記処理ステップＳ６０１〜Ｓ６０９を繰り返し（ステップＳ６１１、ＹＥＳ）、他の数値情報データがなければ当該処理を終了する（ステップＳ６１１、ＮＯ）。 The CPU 203 repeats the processing steps S601 to S609 as many times as the numerical information data extracted by the numerical information data extraction process (FIG. 4) (step S611, YES). If there is no other numerical information data, the process ends. (Step S611, NO).

１−１−３−４．まとめ
係り受け抽象化モデル抽出処理を終えるとＣＰＵ２０３は、抽象化モデルデータを抽象化モデルＤＢ２０９７に記録する（図３、ステップＳ３１５）。例えば、図７に示したテキスト文データ「営業利益は、前年度と同水準の３２７７６百万円となった。」が入力された場合、抽象化モデルＤＢ２０９７に、数値情報データ「３２７７６百万円」、係り受け情報データ「営業利益」が記録される。 1-1-3-4. Conclusion When the dependency abstract model extraction process is completed, the CPU 203 records the abstract model data in the abstract model DB 2097 (FIG. 3, step S315). For example, when the text sentence data “Operating profit was 32,763 million yen, the same level as the previous year” shown in FIG. 7, the numerical information data “32776 million yen is stored in the abstract model DB 2097. ”, The dependency information data“ operating profit ”is recorded.

なお、本実施形態においては、抽象化モデル生成装置において抽出した抽象化モデルデータを抽象化モデルＤＢ２０９７に記録するようにしているが、これに限定されることはなく、抽象化モデルデータを他のアプリケーションに引き渡すようにしてもよい。 In this embodiment, the abstract model data extracted by the abstract model generation device is recorded in the abstract model DB 2097. However, the present invention is not limited to this, and the abstract model data is stored in other models. You may make it hand over to an application.

また、テキスト文データから数値情報データと係り受け情報データを抽出する処理を行うために、他のアプリケーションに組み込んで使用するようにしてもよい。さらに、ユーザに提示するために、抽象化モデルデータを抽象化モデル生成装置のディスプレイ２０１に表示するようにしてもよい。 Further, in order to perform processing for extracting numerical information data and dependency information data from text sentence data, it may be used by being incorporated in another application. Furthermore, the abstract model data may be displayed on the display 201 of the abstract model generation device for presentation to the user.

ＣＰＵ２０３は、テキスト文から切り出されたテキスト文データの数だけ上記処理（図３、ステップＳ３０３〜３１５を繰り返し（ステップＳ３１７、ＮＯ）、処理対象となるテキスト文データがなくなれば、当該抽象化モデル抽出処理を終了する（ステップＳ３１７、ＹＥＳ）。 The CPU 203 repeats the above processing (FIG. 3, steps S303 to S315 (step S317, NO)) as many times as the number of text sentence data cut out from the text sentence. If there is no text sentence data to be processed, the abstract model extraction is performed. The process ends (step S317, YES).

以上説明したように、この発明によれば、数値情報データに対応する名詞が近傍にない場合であっても、係り受け関係にある名詞を正確に抽出することができる。また、数値情報データに対応する情報として抽出する情報を、特定の名詞に限定することなく、係り受け関係にある名詞または文を正確に抽出することができる。 As described above, according to the present invention, nouns having a dependency relationship can be accurately extracted even when nouns corresponding to numerical information data are not in the vicinity. In addition, information extracted as information corresponding to numerical information data is not limited to a specific noun, and a noun or sentence having a dependency relationship can be accurately extracted.

１−２．第１−１の実施形態
上記実施形態においては、入力されるテキスト文データが単文であることを前提として説明した。本実施形態においては、特に、テキスト文データが２つの文を含む複文または重文である場合について説明する。 1-2. 1-1 embodiment In the said embodiment, it demonstrated on the assumption that the text-text data input is a single sentence. In the present embodiment, a case where the text sentence data is a compound sentence or a double sentence including two sentences will be described.

なお、複文とは、主語・述語の関係が成り立っている文で、さらにその構成部分に主語・述語の関係がみられるものである。また、重文とは、独立した二つ以上の文が、対等の資格で結合した文である。 Note that a compound sentence is a sentence in which a subject / predicate relationship is established, and a subject / predicate relationship is further found in its constituent parts. A heavy sentence is a sentence formed by combining two or more independent sentences with equal qualifications.

１−２−１．機能ブロック図
図１０に、本実施形態にかかる抽象化モデル生成装置の機能ブロック図を示す。この図において、本発明にかかる抽象化モデル生成装置は、切出手段１０１、数値情報データ抽出手段１０３、テキスト文解析手段１０５、抽出対象文決定手段１０６、係り受け情報抽出手段１０７、抽象化モデル出力手段１０９を備えている。 1-2-1. Functional Block Diagram FIG. 10 is a functional block diagram of the abstract model generation device according to this embodiment. In this figure, the abstract model generation apparatus according to the present invention includes a cutout means 101, numerical information data extraction means 103, text sentence analysis means 105, extraction target sentence determination means 106, dependency information extraction means 107, abstract model. Output means 109 is provided.

抽出対象文決定手段１０６は、テキスト文解析手段によって文要素データに分割されたテキスト文データから数値情報データの抽出対象となる文を決定し、係り受け情報抽出手段１０７に与える。例えば、複文の場合は、主文と副文に分割した後、各文が抽出対象であるか否かを決定する。 The extraction target sentence determination unit 106 determines a sentence from which numerical information data is to be extracted from the text sentence data divided into sentence element data by the text sentence analysis unit, and supplies the sentence to the dependency information extraction unit 107. For example, in the case of a compound sentence, after dividing into a main sentence and a sub sentence, it is determined whether or not each sentence is an extraction target.

なお、切出手段１０１、数値情報データ抽出手段１０３、テキスト文解析手段１０５、係り受け情報抽出手段１０７、抽象化モデル出力手段１０９は、第１の実施形態と同様である。 Note that the cutout unit 101, the numerical information data extraction unit 103, the text sentence analysis unit 105, the dependency information extraction unit 107, and the abstract model output unit 109 are the same as those in the first embodiment.

１−２−２．ハードウェア構成
ハードウェア構成については、第１の実施形態と同様である。 1-2-2. Hardware configuration The hardware configuration is the same as in the first embodiment.

１−２−３．フローチャート
情報抽出プログラム２０９１に基づく処理について、図１１のフローチャートを用いて説明する。以下では、テキスト文データ「今月の食費は、父が７０００円のワインを買ったため、４５４００円になった。」を含むテキスト文を切出手段１０１に入力した場合を例として説明する。 1-2-3. Flowchart Processing based on the information extraction program 2091 will be described using the flowchart of FIG. In the following, an example will be described in which a text sentence including text sentence data “This month's food expense is 45400 yen because my father bought 7000 yen of wine” is input to the cutting means 101.

図１１に示す抽象化モデル抽出処理のフローチャートにおいて、ＣＰＵ２０３が行うステップＳ３０１〜Ｓ３０９までの処理は基本的に第１の実施形態と同様である。 In the abstract model extraction process flowchart shown in FIG. 11, the processes of steps S301 to S309 performed by the CPU 203 are basically the same as those in the first embodiment.

例えば、前記テキスト文データが読み込まれた場合、先頭の文である「今月の食費は、父が７０００円のワインを買ったため、４５４００円になった。」がテキスト文データとして切り出される。 For example, when the text sentence data is read, the first sentence, “This month's food expense is 45400 yen because my father bought 7000 yen of wine” is cut out as text sentence data.

１−２−３−１．数値情報データ抽出処理
例えば、本実施形態において、数値単位マスタには「円」が単位文字として記録されており、ＣＰＵ２０３は、図４に示す数値情報データ抽出処理により、「７０００円」、「４５４００円」をそれぞれ数値情報データとして抽出する。 1-2-3-1. Numerical Information Data Extraction Processing For example, in this embodiment, “Yen” is recorded as a unit character in the numerical value unit master, and the CPU 203 performs “7000 yen” and “45400” by the numerical information data extraction processing shown in FIG. Each “circle” is extracted as numerical information data.

１−２−３−２．テキスト文解析処理
数値情報データ抽出処理（図１１、ステップＳ３０５）において、数値情報データが抽出されれば（ステップＳ３０７、ＹＥＳ）、ＣＰＵ２０３はテキスト文解析処理（ステップＳ３０９）を行う。第１の実施形態に示したように、図５に示すテキスト文解析処理のフローチャートにしたがって、ＣＰＵ２０３は上記テキスト文データを所定の文要素データに分割する。 1-2-3-2. Text sentence analysis processing If numerical information data is extracted in the numerical information data extraction process (FIG. 11, step S305) (YES in step S307), the CPU 203 performs text sentence analysis processing (step S309). As shown in the first embodiment, the CPU 203 divides the text sentence data into predetermined sentence element data according to the flowchart of the text sentence analysis process shown in FIG.

図１３に、テキスト文解析処理を行った後のテキスト文データを示す。テキスト文データ１３０１に示すように、前記テキスト文データは「今月の食費は、（主格句）／父が（主格句）／７０００円の（従属句）／ワインを（従属句）／買ったため、（述部）／４５４００円になった（述部）」の各文要素データに分割される。 FIG. 13 shows the text sentence data after the text sentence analysis processing is performed. As shown in the text sentence data 1301, the text sentence data is: “This month's food cost is (main phrase) / father's (main phrase) / 7000 yen (subordinate phrase) / wine (subordinate phrase) / buy. (Predicate) / 45400 Yen (Predicate) ”is divided into each sentence element data.

１−２−３−３．抽出対象文決定処理
テキスト文解析処理を終えるとＣＰＵ２０３は、抽出対象文決定処理を行う（ステップＳ３１０）。図１２に、抽出対象文決定処理におけるフローチャートを示す。 1-2-3-3. Extraction target sentence decision processing
When the text sentence analysis process is completed, the CPU 203 performs an extraction target sentence determination process (step S310). FIG. 12 shows a flowchart of the extraction target sentence determination process.

ＣＰＵ２０３は、構文解析処理されたテキスト文データに基づいて、文の種類を判定する。すなわち、ＣＰＵ２０３は、主格句と述部の組合せの個数および並び方に基づいて、入力されたテキスト文データが、複文、重文、単文のいずれであるかを判定する（図１２、ステップＳ１２０１）。 The CPU 203 determines the type of sentence based on the text sentence data subjected to the syntax analysis process. That is, the CPU 203 determines whether the input text sentence data is a compound sentence, a double sentence, or a single sentence based on the number and arrangement of combinations of the main phrase and predicate (FIG. 12, step S1201).

ＣＰＵ２０３は、主格句と述部の組合せが１つであればテキスト文データは単文であると判定する（ステップＳ１２０１、単文）。また、主格句と述部の組合せが２つであり、並び方が「主格句−主格句−述部−述部」であればテキスト文データは複文であると判定する（ステップＳ１２０１、複文）。さらに、主格句と述部の組合せが２つであり、並び方が「主格句−述部−主格句−述部」であればテキスト文データは重文であると判定する（ステップＳ１２０１、重文）。 If the combination of the main phrase and the predicate is one, the CPU 203 determines that the text sentence data is a single sentence (step S1201, simple sentence). If the combination of the main phrase and predicate is two and the arrangement is “main phrase-main phrase-predicate-predicate”, it is determined that the text sentence data is a compound sentence (step S1201, compound sentence). Further, if there are two combinations of the main phrase and the predicate and the arrangement is “main phrase-predicate-main phrase-predicate”, it is determined that the text sentence data is a heavy sentence (step S1201, heavy sentence).

ＣＰＵ２０３は、入力されたテキスト文データが複文であると判定すると、当該テキスト文データを主文と副文に分割する（ステップＳ１２０３）。例えば、図１３に示すテキスト文データ１３０１の場合、文の構造が「主格句−主格句−述部−述部」であることにより、複文であると判定され、テキスト文データ１３０１は、主文としてのテキスト文データ１３０３「今月の食費は、（主格句）／４５４００円になった（述部）」および副文としてのテキスト文データ１３０５「父が（主格句）／７０００円の（従属句）／ワインを（従属句）／買ったため（述部）」に分割される。 If the CPU 203 determines that the input text sentence data is a compound sentence, the CPU 203 divides the text sentence data into a main sentence and a sub sentence (step S1203). For example, in the case of the text sentence data 1301 shown in FIG. 13, it is determined that the sentence structure is “main case phrase-main case phrase-predicate-predicate”, so that it is a compound sentence. Text sentence data 1303 “This month's food cost was (main phrase) / 45400 yen (predicate)” and text sentence data 1305 as a sub sentence (father was (main phrase) / 7000 yen (subordinate phrase) / Wine (subordinate phrase) / Bought (predicate) ".

一方、ＣＰＵ２０３は、入力されたテキスト文データが重文であると判定すると、当該テキスト文データを前半文と後半文に分割する（ステップＳ１２０５）。例えば、図１４に示すテキスト文データ１４０１の場合、文の構造が「主格句−述部−主格句−述部」であることにより、重文であると判定され、テキスト文データ１４０１は、前半文としてのテキスト文データ１４０３「今月の食費は、（主格句）／４５４００円になった（述部）」および副文としてのテキスト文データ１３０５「父が（主格句）／７０００円の（従属句）／ワインを（従属句）／買ったため（述部）」に分割される。 On the other hand, if the CPU 203 determines that the input text sentence data is a heavy sentence, the CPU 203 divides the text sentence data into a first half sentence and a second half sentence (step S1205). For example, in the case of the text sentence data 1401 shown in FIG. 14, it is determined that the sentence structure is “major phrase-predicate-prominent phrase-predicate”, so that it is a heavy sentence. Text sentence data 1403 “Food expenses for this month are (primary phrases) / 45400 yen (predicate)” and text sentence data 1305 as sub-sentence is “Father is (primary phrases) / 7000 yen (subordinate phrase) ) / Wine (subordinate clause) / bought (predicate) ”.

ＣＰＵ２０３は、分割された単文または入力された単文を取得し（ステップＳ１２０７）、当該単文に数値情報データが含まれるか否かを判定する（ステップＳ１２０９）。 CPU203 acquires the divided simple sentence or input sentence (step S1207), determines whether or not contain numerical information data to the simple sentence (step S1209).

ＣＰＵ２０３は、取得した単文に数値情報データが含まれていれば（ステップＳ１２０９、ＹＥＳ）、当該単文を抽出対象文として決定し、メモリ２０５に記憶する（ステップＳ１２１１）。なお、数値情報データが含まれていなければ、ステップＳ１２１１をスキップする。 If the acquired single sentence includes numerical information data (step S1209, YES), the CPU 203 determines the single sentence as an extraction target sentence and stores it in the memory 205 (step S1211). If numerical information data is not included, step S1211 is skipped.

ＣＰＵ２０３は、他の単文があれば（ステップＳ１２１３、ＹＥＳ）、ステップＳ１２０７に戻って、次の単文を取得して上記ステップＳ１２０９〜Ｓ１２１３と同様の処理を行う。なお、処理対象となる単文がなくなれば、当該処理を終了する（ステップＳ１２１３、ＮＯ）。 If there is another simple sentence (YES in step S1213), the CPU 203 returns to step S1207, acquires the next simple sentence, and performs the same processing as in steps S1209 to S1213. If there is no single sentence to be processed, the process ends (NO in step S1213).

例えば、図１３の主文としてのテキスト文データ１３０３が単文として取得されると、数値情報データ「４５４００円」が含まれるので、当該テキスト文データ１３０３を抽出対象文として決定する。さらに、副文としてのテキスト文データ１３０５が単文として取得されて、数値情報データ「７０００円」が含まれることにより、当該テキスト文データ１３０５を抽出対象文として決定する。なお、図１４に示した重文のテキスト文データ１４０１の場合も上記と同様に処理できる。 For example, when the text sentence data 1303 as the main sentence in FIG. 13 is acquired as a single sentence, the numerical information data “45400 yen” is included, so the text sentence data 1303 is determined as an extraction target sentence. Further, the text sentence data 1305 as the sub sentence is acquired as a single sentence, and the numerical information data “7000 yen” is included, so that the text sentence data 1305 is determined as the extraction target sentence. In the case of the heavy text text data 1401 shown in FIG.

１−２−３−４．係り受け抽象化モデル抽出処理
抽出対象文決定処理（図１１、ステップＳ３１０）において、入力されたテキスト文データが所定の種類の文要素データに分割され単文に分割されると、さらにＣＰＵ２０３は係り受け抽象化モデル抽出処理（ステップＳ３１３）を行う。なお、係り受け抽象化モデル抽出処理については、第１の実施形態において示した図６のフローチャートと同様である。 1-2-3-4. Dependency abstraction model extraction processing
In the extraction target sentence determination process (FIG. 11, step S310), when the input text sentence data is divided into predetermined types of sentence element data and divided into simple sentences, the CPU 203 further performs a dependency abstract model extraction process (step S313) is performed. Note that the dependency abstraction model extraction processing is the same as the flowchart of FIG. 6 shown in the first embodiment.

ＣＰＵ２０３は、上記の抽出対象文決定処理（図１２）によって決定した単文の数だけ上記処理ステップＳ３１３〜Ｓ３１５を繰り返し（ステップＳ３１６、ＹＥＳ）、他の単文がなければ当該処理を終了する（ステップＳ３１６、ＮＯ）。 The CPU 203 repeats the processing steps S313 to S315 as many as the number of simple sentences determined by the extraction target sentence determination process (FIG. 12) (step S316, YES), and if there is no other single sentence, the process ends (step S316). , NO).

例えば、図１３に示すテキスト文データ１３０３から、数値情報データ「４５４００円」、係り受け情報データ「食費」が抽出され、抽象化モデルＤＢ２０９７に記録される。同様に、図１３に示すテキスト文データ１３０５から、数値情報データ「７０００円」、係り受け情報データ「ワイン」が抽出され、抽象化モデルＤＢ２０９７に記録される。 For example, numerical information data “45400 yen” and dependency information data “food expenses” are extracted from the text sentence data 1303 shown in FIG. 13 and recorded in the abstract model DB 2097. Similarly, numerical information data “7000 yen” and dependency information data “wine” are extracted from the text sentence data 1305 shown in FIG. 13 and recorded in the abstract model DB 2097.

２−３−５．まとめ
以上説明したように、この発明によれば、テキスト文データが単文、複文、重文のいずれの場合であっても、数値情報データと係り受け関係にある名詞または文を正確に抽出することができる。 2-3-5. Summary As described above, according to the present invention, it is possible to accurately extract a noun or sentence that is dependent on numerical information data regardless of whether the text sentence data is a single sentence, a compound sentence, or a heavy sentence. it can.

なお、上記実施形態においては、テキスト文データが主格句と述部の組合せがの２つである複文または重文を前提として説明したが、２つ以上の組合せであってもよい。この場合、主格句と述部の並び方を予めパターン化しておき、いずれのパターンに属するかに基づいて、複文であるか重文であるかの判定を行い、単文に分割するように構成すればよい。 In the above embodiment, the text sentence data has been described on the premise of a compound sentence or a duplicate sentence in which the combination of the main phrase and the predicate is two, but a combination of two or more may be used. In this case, the arrangement of the main phrase and the predicate may be pre-patterned, and it may be configured to determine whether it is a compound sentence or a heavy sentence based on which pattern it belongs to, and to divide it into simple sentences. .

３．第１−２の実施形態
上記の実施形態においては、入力されたテキスト文データを文要素データに分割し、数値情報データの属する文要素データの種類に基づいて、数値情報データおよび当該数値情報データと係り受け関係にある係り受け情報データのみを抽出するように構成した。 3. 1-2 embodiment In the above embodiment, the input text sentence data is divided into sentence element data, and based on the type of sentence element data to which the numerical information data belongs, the numerical information data and the numerical information data It is configured to extract only dependency information data that is in a dependency relationship.

本実施形態においては、所定の文要素データに基づいて数値情報データおよび係り受け情報データを主情報として抽出し、さらにその他の文要素データに基づいて付加情報を抽出する場合について説明する。 In the present embodiment, a case will be described in which numerical information data and dependency information data are extracted as main information based on predetermined sentence element data, and additional information is extracted based on other sentence element data.

１−３−１．機能ブロック図
図１５に、本実施形態にかかる抽象化モデル生成装置の機能ブロック図を示す。この図において、本発明にかかる抽象化モデル生成装置は、切出手段１０１、数値情報データ抽出手段１０３、テキスト文解析手段１０５、抽出対象文決定手段１０６、係り受け情報抽出手段１０７、付加情報抽出手段１０８、抽象化モデル出力手段１０９を備えている。 1-3-1. Functional Block Diagram FIG. 15 is a functional block diagram of the abstract model generation device according to this embodiment. In this figure, the abstract model generation apparatus according to the present invention includes a cutout means 101, numerical information data extraction means 103, text sentence analysis means 105, extraction target sentence determination means 106, dependency information extraction means 107, and additional information extraction. Means 108 and abstract model output means 109 are provided.

付加情報抽出手段１０８は、文要素データに分割されたテキスト文データに基づいて、数値情報データおよび係り受け情報データを含む文要素データ以外の文要素データを抽出し、当該数値情報データおよび係り受け情報データに関する付加情報として抽象化モデル出力手段１０９に与える。 The additional information extraction unit 108 extracts sentence element data other than the sentence element data including the numerical information data and the dependency information data based on the text sentence data divided into the sentence element data, and the numerical information data and the dependency information are extracted. It is given to the abstract model output means 109 as additional information related to information data.

１−３−２．ハードウェア構成
ハードウェア構成については、第１の実施形態と同様である。 1-3-2. Hardware configuration The hardware configuration is the same as in the first embodiment.

１−３−３．フローチャート
情報抽出プログラム２０９１に基づく処理について、図１６のフローチャートを用いて説明する。以下では、第１の実施形態と同様に、テキスト文データ「営業利益は、前年度と同水準の３２７７６百万円となった。」を含むテキスト文を切出手段１０１に入力した場合を例として説明する。 1-3-3. Flowchart Processing based on the information extraction program 2091 will be described using the flowchart of FIG. In the following, as in the first embodiment, an example in which a text sentence including the text sentence data “Operating income was 32,976 million yen, the same level as the previous year” is input to the clipping unit 101 is described as an example. Will be described.

図１１に示す抽象化モデル抽出処理のフローチャートにおいて、ＣＰＵ２０３が行うステップＳ３０１〜Ｓ３１５までの処理は基本的に第１の実施形態と同様である。 In the abstract model extraction process flowchart shown in FIG. 11, the processes of steps S301 to S315 performed by the CPU 203 are basically the same as those in the first embodiment.

例えば、前記テキスト文データが読み込まれた場合、先頭の文である「営業利益は、前年度と同水準の３２７７６百万円となった。」がテキスト文データとして切り出される。 For example, when the text sentence data is read, the first sentence “Operating profit was 32,976 million yen, the same level as the previous year” is extracted as the text sentence data.

数値情報データ抽出処理（図１６、ステップＳ３０５）において、数値情報データが抽出されれば（ステップＳ３０７、ＹＥＳ）、ＣＰＵ２０３はテキスト文解析処理（ステップＳ３０９）を行う。第１の実施形態に示したように、図５に示すテキスト文解析処理のフローチャートにしたがって、ＣＰＵ２０３は上記テキスト文データを所定の文要素データに分割する。 In the numerical information data extraction process (FIG. 16, step S305), if numerical information data is extracted (step S307, YES), the CPU 203 performs a text sentence analysis process (step S309). As shown in the first embodiment, the CPU 203 divides the text sentence data into predetermined sentence element data according to the flowchart of the text sentence analysis process shown in FIG.

図１８に、テキスト文解析処理を行った後のテキスト文データを示す。テキスト文データ１８０１に示すように、前記テキスト文データは「営業利益は（主格句）／前年度と（従属句）／同水準の（従属句）／３２７７６百万円となった（述部）」の各文要素データに分割される。 FIG. 18 shows the text sentence data after the text sentence analysis processing is performed. As shown in the text sentence data 1801, the text sentence data is “Operating profit was (main phrase) / previous year (subordinate phrase) / same level (subordinate phrase) / 32776 million yen (predicate)” Is divided into each sentence element data.

テキスト文解析処理（図１６、ステップＳ３０９）において、入力されたテキスト文データが所定の種類の文要素データに分割されると、さらにＣＰＵ２０３は係り受け抽象化モデル抽出処理（ステップＳ３１３）を行う。なお、係り受け抽象化モデル抽出処理については、第１の実施形態において示した図６のフローチャートと同様である。 In the text sentence analysis process (FIG. 16, step S309), when the input text sentence data is divided into predetermined types of sentence element data, the CPU 203 further performs a dependency abstract model extraction process (step S313). Note that the dependency abstraction model extraction processing is the same as the flowchart of FIG. 6 shown in the first embodiment.

したがって、第１の実施形態と同様に、数値情報データ「３２７７６百万円」、係り受け情報データ「営業利益」が抽象化モデルデータとなる。 Therefore, as in the first embodiment, the numerical information data “32776 million yen” and the dependency information data “operating profit” are the abstract model data.

１−３−３−１．付加抽象化モデル抽出処理
係り受け抽象化モデル抽出処理を終えると、ＣＰＵ２０３は、付加抽象化モデル抽出処理を行う（ステップＳ３３１）。図１７に、付加抽象化モデル抽出処理におけるフローチャートを示す。 1-3-3-1. Additional abstraction model extraction processing
When the dependency abstract model extraction process is completed, the CPU 203 performs an additional abstract model extraction process (step S331). FIG. 17 shows a flowchart of the additional abstract model extraction process.

ＣＰＵ２０３は、数値情報データ抽出処理において抽出された数値情報データおよび係り受け抽象化モデル抽出処理において抽出された係り受け情報データの属する文要素データを取得する（図１７、ステップＳ１７０１）。 The CPU 203 acquires the numerical information data extracted in the numerical information data extraction processing and the sentence element data to which the dependency information data extracted in the dependency abstract model extraction processing belongs (FIG. 17, step S1701).

次に、ＣＰＵ２０３は、テキスト文解析処理において文要素データに分割されたテキスト文データ１８０１を読み込み、上記において取得した数値情報データおよび係り受け情報データの属する文要素データ以外の文要素データを追加情報として抽出する（ステップＳ１７０３）。 Next, the CPU 203 reads the text sentence data 1801 divided into the sentence element data in the text sentence analysis process, and adds the sentence element data other than the sentence element data to which the numerical information data and the dependency information data acquired above belong to the additional information. (Step S1703).

例えば、図１８のテキスト文データ１８０１においては、数値情報データの属する文要素データは「３２７７６百万円（述部）」であり、係り受け情報データの属する文要素データは「営業利益は（主格句）」である。したがって、数値情報データおよび係り受け情報データの属する文要素データ以外の文要素データとして、「前年度と（従属句）」および「同水準の（従属句）」が抽出される。 For example, in the text sentence data 1801 in FIG. 18, the sentence element data to which the numerical information data belongs is “32776 million yen (predicate)”, and the sentence element data to which the dependency information data belongs is “operating profit is (major Phrase). Therefore, “same year (subordinate phrase)” and “same level (subordinate phrase)” are extracted as sentence element data other than the sentence element data to which the numerical information data and the dependency information data belong.

ＣＰＵ２０３は、上記の数値情報データ抽出処理（図４）によって抽出した数値情報データの数だけ上記処理ステップＳ１７０１〜Ｓ１７０３を繰り返し（ステップＳ１７０５、ＹＥＳ）、他の数値情報データがなければ当該処理を終了する（ステップＳ１７０５、ＮＯ）。 The CPU 203 repeats the processing steps S1701 to S1703 as many times as the numerical information data extracted by the numerical information data extraction process (FIG. 4) (step S1705, YES), and terminates the process if there is no other numerical information data. (Step S1705, NO).

付加抽象化モデル抽出処理を終えるとＣＰＵ２０３は、抽象化モデルデータを抽象化モデルＤＢ２０９７に記録する（図１６、ステップＳ３３３）。 When the additional abstract model extraction processing is completed, the CPU 203 records the abstract model data in the abstract model DB 2097 (FIG. 16, step S333).

例えば、図１８の１８０３に示すように、抽象化モデルＤＢ２０９７には、数値情報データ「３２７７６百万円」、係り受け情報データ「営業利益」に加えて、付加情報「前年度と／同水準の」が記録される。 For example, as shown by 1803 in FIG. 18, in the abstract model DB 2097, in addition to the numerical information data “32776 million yen” and the dependency information data “operating income”, the additional information “same level as the previous year”. Is recorded.

ＣＰＵ２０３は、テキスト文から切り出されたテキスト文データの数だけ上記処理（図１１、ステップＳ３０３〜３３３を繰り返し（ステップＳ３１７、ＮＯ）、処理対象となるテキスト文データがなくなれば、当該抽象化モデル抽出処理を終了する（ステップＳ３１７、ＹＥＳ）。 The CPU 203 repeats the above processing (FIG. 11, steps S303 to S333 (step S317, NO)) for the number of text sentence data cut out from the text sentence, and when there is no text sentence data to be processed, the abstract model extraction is performed. The process ends (step S317, YES).

１−３−４．まとめ
以上説明したように、この発明によれば、テキスト文データの主情報である数値情報データ・係り受け情報データに加えて、当該主情報にかかる付加情報を抽出することができる。 1-3-4. Summary As described above, according to the present invention, in addition to numerical information data and dependency information data that are main information of text sentence data, additional information related to the main information can be extracted.

これにより、数値情報データと直接的な係り受け関係にある係り受け情報データだけでなく、数値情報データまたは係り受け情報データと間接的な関係にある付加情報を抽出することができる。 As a result, not only the dependency information directly related to the numerical information data but also the additional information indirectly related to the numerical information data or the dependency information data can be extracted.

例えば、数値情報データ「３２７７６百万円」および係り受け情報データ「営業利益」に関する付加情報「前年度と／同水準の」は、「営業利益」が「３２７７６百万円」である状態に関する背景・状況としての付加情報として利用することができる。 For example, the additional information “same level as last year” regarding the numerical information data “32776 million yen” and the dependency information data “operating profit” is related to the situation where the “operating profit” is “32976 million yen”. -It can be used as additional information as a situation.

なお、上記実施形態においては、テキスト文データが単文である場合について説明したが、複文または重文を含む場合であってもよい。この場合、第１−１の実施形態において示したように、テキスト文解析処理（ステップＳ３０９）。と係り受け抽象化モデル抽出処理（ステップＳ３１３）の間に抽出対象文決定処理を挿入すればよい。 In the above embodiment, the case where the text sentence data is a single sentence has been described. However, the text sentence data may include a compound sentence or a duplicate sentence. In this case, as shown in the 1-1 embodiment, text sentence analysis processing (step S309). And an extraction target sentence determination process may be inserted between the dependency abstract model extraction process (step S313).

１−４．その他の実施形態
上記実施形態においては、数値情報データ抽出処理を行った後に分割処理を行うように構成したが、テキスト文解析処理中に数値情報データ抽出処理を行うように構成してもよい。具体的には、テキスト文解析処理においてテキスト文データを文要素データに分割した後に、各文要素データについて数値情報データが含まれるか否かを判断させ、含まれていればテキスト文データの構文解析を行うように構成すればよい。 1-4. Other Embodiments In the above-described embodiment, the division processing is performed after the numerical information data extraction processing. However, the numerical information data extraction processing may be performed during the text sentence analysis processing. Specifically, after the text sentence data is divided into sentence element data in the text sentence analysis process, it is determined whether or not numerical information data is included for each sentence element data, and if included, the syntax of the text sentence data What is necessary is just to comprise so that an analysis may be performed.

上記実施形態においては、数値情報データを抽出する場合に、テキスト文データの中に存在する所定の単位文字の前方に連続する数字を数値情報データとして抽出するように構成したが、単位文字によっては、後方に連続する数字を数値情報データとして抽出するようにしてもよい。例えば、「￥」や「＄」などの通貨記号を単位文字とする場合がこれに該当する。 In the above embodiment, when the numerical information data is extracted, it is configured to extract, as the numerical information data, numbers consecutive in front of a predetermined unit character existing in the text sentence data. Alternatively, numbers that are continuous in the back may be extracted as numerical information data. For example, this is the case when a currency symbol such as “¥” or “$” is used as a unit character.

上記実施形態においては、構文解析処理において、上昇型構文解析のＬＲ文法パーザであるＹＡＣＣを例示したが、その他の種類のパーザであってもよい。例えば、下降型構文解析のＬＬ文法パーザなどがこれに該当する。 In the above embodiment, YACC, which is an LR grammar parser for ascending syntax analysis, is exemplified in the syntax analysis processing, but other types of parsers may be used. For example, an LL grammar parser for descending parsing corresponds to this.

上記実施形態においては、日本語で記述されたテキスト文について説明したが、数値情報データが抽出可能であり、形態素解析処理および構文解析処理において文要素データに分割することができれば、他の言語で記述されたテキスト文であってもよい。例えば、英語の場合は、英文の数値単位に基づいて数値情報データを抽出し、英文用の形態素解析プログラムおよび構文解析プログラムを用いて構成すればよい。 In the above embodiment, a text sentence written in Japanese has been described. However, if numerical information data can be extracted and can be divided into sentence element data in morphological analysis processing and syntax analysis processing, It may be a written text. For example, in the case of English, numerical information data may be extracted based on English numeric units and configured using an English morphological analysis program and a syntax analysis program.

２．第２の実施形態［リンク設定装置］
上記第１の実施形態の抽象化モデル生成装置を利用して、数値情報データが含まれるテキスト文などの抽象化処理要素にリンクを設定する装置について、以下に図１９〜図２５を用いて説明する。 2. Second Embodiment [Link Setting Device]
An apparatus for setting a link to an abstraction processing element such as a text sentence including numerical information data using the abstraction model generation apparatus of the first embodiment will be described below with reference to FIGS. To do.

２−１．機能ブロック図
図１９に、本実施形態にかかるリンク設定装置の機能ブロック図を示す。この図において、本発明にかかるリンク設定装置は、文書データ入力手段１２０１、要素分割手段１２０３、抽象化モデルデータ抽出手段１２０４、抽象化モデル記録手段１２１１、抽象化モデル選択手段１２１３、リンク設定手段１２１５を備えている。なお、抽象化モデルデータ抽出手段１２０４は、表現形式判別手段１２０５、抽象化モデル生成手段（テキスト文）１２０７、抽象化モデル生成手段（テーブル）１２０９を備える。 2-1. Functional Block Diagram FIG. 19 is a functional block diagram of the link setting device according to the present embodiment. In this figure, a link setting device according to the present invention includes a document data input means 1201, an element division means 1203, an abstract model data extraction means 1204, an abstract model recording means 1211, an abstract model selection means 1213, and a link setting means 1215. It has. The abstract model data extraction unit 1204 includes an expression format determination unit 1205, an abstract model generation unit (text sentence) 1207, and an abstract model generation unit (table) 1209.

文書データ入力手段１２０１から文書データが入力され、要素分割手段により抽象化処理要素毎に分割される。分割された抽象化処理要素は、抽象化モデルデータ抽出手段１２０４に与えられる。 Document data is input from the document data input unit 1201 and is divided for each abstraction processing element by the element dividing unit. The divided abstraction processing elements are given to the abstraction model data extraction unit 1204.

抽象化モデルデータ抽出手段１２０４の表現形式判別手段１２０５は、取得した抽象化処理要素の表現形式を判断する。例えば、抽象化処理要素がテキスト文（文章）であるか、テーブル（表）であるかを判断する。これにより、当該要素は、抽象化処理要素の表現形式がテキスト文である場合には、抽象化モデル生成手段（テキスト文）１２０７に与えられ、要素の表現形式がテーブルである場合には、抽象化モデル生成手段（テーブル）１２０９に与えられる。なお、文書データにテキスト文のみが含まれる場合には、上記表現形式判別手段１２０５と抽象化モデル生成手段（テーブル）１２０９は不要になる。 The expression format discrimination means 1205 of the abstract model data extraction means 1204 determines the expression format of the acquired abstraction processing element. For example, it is determined whether the abstraction processing element is a text sentence (sentence) or a table (table). Thus, the element is given to the abstract model generation means (text sentence) 1207 when the abstract processing element expression format is a text sentence, and the element is abstract when the element expression form is a table. This is given to a modelized model generating means (table) 1209. When the document data includes only a text sentence, the above-described expression format determination unit 1205 and the abstract model generation unit (table) 1209 are not necessary.

抽象化モデル生成手段（テキスト文）１２０７は、第１の実施形態における抽象化モデル生成装置（図１に示す）と同じ機能を有している。すなわち、入力されたテキスト文から数値情報データと係り受け関係にある他の文要素を抽出し、これらを位置情報データと共に抽象化モデル記録手段１２１１に与える（具体的には、図３等のフローチャートのステップＳ３０１〜Ｓ３１７に示す処理を行う。）。抽象化モデル生成手段（テーブル）１２０９は、テーブルデータについて定められた所定規則（後述する）に基づいて、抽象化モデルデータを生成する処理を行い、これらを位置情報データと共に抽象化モデル記録手段１２１１に与える。 The abstract model generation means (text sentence) 1207 has the same function as the abstract model generation apparatus (shown in FIG. 1) in the first embodiment. That is, other sentence elements having a dependency relationship with the numerical information data are extracted from the input text sentence, and these are provided to the abstract model recording means 1211 together with the position information data (specifically, the flowchart of FIG. 3 and the like). The processing shown in steps S301 to S317 of FIG. The abstract model generation means (table) 1209 performs processing for generating abstract model data based on a predetermined rule (described later) defined for the table data, and abstract model recording means 1211 together with the position information data. To give.

抽象化モデル記録手段１２１１は、抽象化モデル生成手段（テキスト文）１２０７や抽象化モデル生成手段（テーブル）１２０９から与えられた抽象化モデルデータを記録する。抽象化モデル選択手段１２１３は、記録された当該抽象化モデル記録手段１２１１を検索して、同じ抽象化モデルデータを全て選択し各位置情報データをリンク設定手段１２１５に提供する。リンク設定手段１２１５は、当該位置情報データに基づいて、抽象化モデルデータが同じ抽象化処理要素に対してリンク（例えば、ハイパーリンク）を設定する。 The abstract model recording unit 1211 records the abstract model data given from the abstract model generation unit (text sentence) 1207 and the abstract model generation unit (table) 1209. The abstract model selection means 1213 searches the abstract model recording means 1211 recorded, selects all the same abstract model data, and provides each position information data to the link setting means 1215. The link setting unit 1215 sets a link (for example, a hyperlink) for the abstraction processing elements having the same abstraction model data based on the position information data.

２−２．ハードウェア構成
図１９に示すリンク設定装置をＣＰＵを用いて実現したハードウェア構成の一例を、図２０に示す。図２０に示すリンク設定装置は、ディスプレイ２０１、ＣＰＵ２０３、メモリ２０５、キーボード／マウス２０７、ハードディスク２０９、ＣＤ−ＲＯＭドライブ２１１および通信回路２１５を備えている。 2-2. Hardware Configuration FIG. 20 shows an example of a hardware configuration in which the link setting device shown in FIG. 19 is realized using a CPU. 20 includes a display 201, a CPU 203, a memory 205, a keyboard / mouse 207, a hard disk 209, a CD-ROM drive 211, and a communication circuit 215.

ハードディスク２０９には、リンク設定プログラム２２１が記録されている。リンク設定プログラム２２１は、テキスト文について抽象化モデルデータ生成処理（図３に示す）を行うための抽象化モデルデータ生成スクリプト２２３を備えている。さらに、リンク設定プログラム２２１は、抽象化処理要素の表現形式（テキスト文、テーブルなど）を判別するための表現形式対応表２２５、および生成した抽象化モデルデータを記録するための抽象化モデルDB２２７も備えている。 A link setting program 221 is recorded on the hard disk 209. The link setting program 221 includes an abstract model data generation script 223 for performing abstract model data generation processing (shown in FIG. 3) for a text sentence. Further, the link setting program 221 also includes an expression format correspondence table 225 for determining the expression format (text sentence, table, etc.) of the abstraction processing element, and an abstract model DB 227 for recording the generated abstract model data. I have.

これらのプログラム等は、ＣＤ−ＲＯＭドライブ２１１を介してＣＤ−ＲＯＭ２１２に記録されたデータを読み出してインストールしたものである。なお、上記インストールは、通信回路２１５を用いてインターネット２１６等からダウンロードしたデータを使用して行うようにしてもよい。 These programs and the like are installed by reading data recorded on the CD-ROM 212 via the CD-ROM drive 211. The installation may be performed using data downloaded from the Internet 216 or the like using the communication circuit 215.

２−３．抽象化モデルデータに基づいてリンクを設定する処理の具体例
図２１に示すフローチャートを用いて、数値情報データが含まれるテキスト文などから抽象化モデルデータを生成し、リンクを設定するまでの処理について説明する。なお、以下の処理は、XML形式のファイル（ＸＭＬファイル）中のテキスト文やテーブルなどについてリンクを設定する場合の例である。 2-3. Specific Example of Processing for Setting Link Based on Abstraction Model Data Processing for generating abstract model data from a text sentence including numerical information data and setting a link using the flowchart shown in FIG. explain. The following processing is an example in the case where a link is set for a text sentence or a table in an XML format file (XML file).

まず、リンク設定の対象となる文書データが入力される（ステップＳ２００１）。例えば、図２２Ａに示すような有価証券報告書ファイルのデータが、ＣＤ−ＲＯＭ２１２を介してＣＤ−ＲＯＭドライブから読み出され、メモリ２０５に記録される。なお、図２２Aに示す有価証券報告書ファイルのデータは、図２２Ｂに示すようなＸＭＬで記述されている。例えば、図２２Ａに示すファイルのテキスト文ａ１は、図２２Ｂのα部分に対応する。 First, document data to be linked is input (step S2001). For example, the securities report file data as shown in FIG. 22A is read from the CD-ROM drive via the CD-ROM 212 and recorded in the memory 205. Note that the data of the securities report file shown in FIG. 22A is described in XML as shown in FIG. 22B. For example, the text sentence a1 of the file shown in FIG. 22A corresponds to the α part in FIG. 22B.

入力された文書データは、テキスト文やテーブルなどの各抽象化処理要素に分割される（ステップＳ２００３）。具体的には、図２２Ｂに示すＸＭＬデータから、所定の開始タグと終了タグに囲まれた部分が抽出される。例えば、図２３に示す表現形式対応表２２５のBusinessResultsおよびItemの欄には、表現形式（テキスト文またはテーブル）の列に抽象化処理の対象であることを示す記号「○」が入力されているため、これらの開始タグと終了タグに囲まれたα〜γ部分（図２２Bに示す）がそれぞれ抽出される。なお、図２３の表現形式対応表２２５において、「−」は抽象化処理要素として抽出しないことを示す。 The input document data is divided into each abstract processing element such as a text sentence and a table (step S2003). Specifically, a portion surrounded by a predetermined start tag and end tag is extracted from the XML data shown in FIG. 22B. For example, in the BusinessResults and Item fields of the representation format correspondence table 225 shown in FIG. 23, a symbol “◯” indicating that the subject is an abstraction process is entered in the column of the representation format (text sentence or table). Therefore, the α to γ portions (shown in FIG. 22B) surrounded by these start tags and end tags are extracted. In the expression format correspondence table 225 of FIG. 23, “-” indicates that the abstraction processing element is not extracted.

次に、図２３に示す表現形式対応表２２５に基づいて、抽出した部分のコンテンツがテキスト文であるか、テーブルであるかが判断される（ステップＳ２００５）。例えば、図２３の表現形式対応表２２５から、図２２のBusinessResultsタグであるα部分のコンテンツＡ１はテキスト文であると判断され、またItemタグであるβ部分およびγ部分のコンテンツＢ１〜Ｂ３およびコンテンツＣ１〜Ｃ３はテーブルであると判断される。 Next, based on the expression format correspondence table 225 shown in FIG. 23, it is determined whether the content of the extracted part is a text sentence or a table (step S2005). For example, from the expression format correspondence table 225 of FIG. 23, it is determined that the content A1 of the α portion that is the BusinessResults tag of FIG. 22 is a text sentence, and the content B1 to B3 and the content of the β portion and the γ portion that are the Item tag C1 to C3 are determined to be tables.

テキスト文またはテーブルであると判断された場合は、以下（i）（ii）に説明するように、その表現形式によって異なる方法で抽象化モデルデータの生成処理が行われる（ステップS２００７、S２００９）。 If it is determined that it is a text sentence or a table, as described in (i) and (ii) below, abstract model data generation processing is performed in a different manner depending on the expression format (steps S2007 and S2009).

（i）表現形式がテキスト文であると判断された場合、抽象化モデルデータ生成スクリプト２２３が、第１の実施形態において説明したテキスト文についての抽象化モデルデータ生成処理を行い、「数値情報データ＝係り受け情報」となるデータを抽象化モデルデータとして抽出する（ステップS２００７）。抽象化モデルデータ生成処理は、具体的には、図３のフローチャートにおけるステップＳ３０１〜Ｓ３１７に示す処理であり、かかる処理により数値情報データと係り受け関係にある文要素が抽出される。例えば、図２２BのコンテンツＡからは、抽象化モデルデータとして「売上高＝８，１９１，７５２百万円」が抽出される。さらに、この抽象化モデルデータと共にその位置ＩＤ（例えば、図２２に示す「br2003」Ａ０）が抽象化モデルＤＢ２２７に与えられる。 (I) When it is determined that the expression format is a text sentence, the abstract model data generation script 223 performs the abstract model data generation process for the text sentence described in the first embodiment, and “numerical information data = "Dependency information" is extracted as abstract model data (step S2007). Specifically, the abstract model data generation process is a process shown in steps S301 to S317 in the flowchart of FIG. 3, and sentence elements having a dependency relationship with the numerical information data are extracted by this process. For example, “sales = 8,191,752 million yen” is extracted from the content A in FIG. 22B as abstract model data. Further, along with the abstract model data, the position ID (for example, “br2003” A0 shown in FIG. 22) is given to the abstract model DB 227.

（ii）表現形式がテーブルであると判断された場合、テーブルについて予め定められた規則に基づいて、抽象化モデルデータが生成される（ステップＳ２００９）。この実施形態では、テーブルと判断された各部分に属するコンテンツを組み合わせることにより、抽象化モデルデータとして「数値情報データ＝係り受け情報」を満たすデータを生成するように規定している。例えば、図２２Ｂに示すβ部分のコンテンツB１〜B3からは抽象化モデルデータとして「８，１９１，７５２百万円＝売上高」、「１００＝売上高」が抽出され、γ部分のコンテンツＣ１〜Ｃ3からは抽象化モデルデータ「１５２，９６７百万円＝営業利益」「１．９＝営業利益」が抽出される。さらに、これらの抽象化モデルデータと共にその位置ＩＤ（例えば、β部分について「sales_2003」Ｂ０、γ部分について「ifo_2003」Ｃ０）が、抽象化モデルＤＢ２２７に与えられる。 (Ii) When it is determined that the expression format is a table, abstract model data is generated based on a rule predetermined for the table (step S2009). In this embodiment, it is stipulated that data satisfying “numerical information data = dependency information” is generated as abstraction model data by combining contents belonging to each part determined as a table. For example, “8,191,752 million yen = sales” and “100 = sales” are extracted as abstract model data from the contents B1 to B3 of the β portion shown in FIG. 22B, and the contents C1 to C1 of the γ portion are extracted. From C3, abstract model data “152,967 million yen = operating profit” and “1.9 = operating profit” are extracted. Further, the position ID (for example, “sales_2003” B0 for the β portion and “ifo_2003” C0 for the γ portion) is given to the abstract model DB 227 together with the abstract model data.

抽象化モデルDB２２７では、抽象化モデルデータとその位置情報データを記録する（ステップＳ２０１１）。図２４に、抽象化モデルＤＢ２２７に記録されるデータの具体例を示す。 In the abstract model DB 227, abstract model data and its position information data are recorded (step S2011). FIG. 24 shows a specific example of data recorded in the abstract model DB 227.

つぎに、リンク設定プログラム２２１は、抽象化モデルDB２２７を検索することにより、抽象化モデルデータが同じものを全て選択し、各位置情報データのIDをリンク設定プログラム２２１に出力する（ステップＳ２０１３）。例えば、図２４に示す同じ抽象化モデルデータが同じであるID「br2003」と「ifo_2003」がリンク設定プログラム２２１に与えられる。 Next, the link setting program 221 searches the abstract model DB 227 to select all of the same abstract model data, and outputs the ID of each position information data to the link setting program 221 (step S2013). For example, IDs “br2003” and “ifo_2003” having the same abstract model data shown in FIG. 24 are given to the link setting program 221.

リンク設定プログラム２２１は、これら抽象化モデルデータの位置IDに基づいて、リンクを設定する（ステップＳ２０１５）。例えば、XLinkを用いて図２２Ｂに示すα部分とβ部分の間に、ID「br2003」と「ifo_2003」に基づいてリンクが設定される。なお、XLinkとは、URIを使用することによりリンクを表現するリンク言語である。 The link setting program 221 sets a link based on the position ID of these abstraction model data (step S2015). For example, a link is set based on IDs “br2003” and “ifo_2003” between the α portion and the β portion shown in FIG. 22B using XLink. XLink is a link language that expresses links by using URIs.

これにより、図２５に示すように、文書の閲覧中にテーブルの一部（斜線部Ｘ）がクリックされると、関連するテキスト文をポップアップウインドウＹに表示するようなことが可能になる。 As a result, as shown in FIG. 25, when a part of the table (shaded portion X) is clicked while browsing a document, a related text sentence can be displayed in the pop-up window Y.

なお、図２５において、ポップアップウィンドウＹにテキスト文を表示する際（つまり、リンク先がテキスト文である場合）に、付加情報だけを抽出するようして「当連結会計年度」と「前連結会計年度に比べて２％だけ増加」の部分だけを表示することもできる。この付加情報を抽出する処理は、図１７等に既述されている処理により実行できる。 In FIG. 25, when a text sentence is displayed in the pop-up window Y (that is, when the link destination is a text sentence), only “additional information” is extracted so that “current fiscal year” and “previous fiscal year” are displayed. Only the part of “increased by 2% compared to the year” can be displayed. The process of extracting the additional information can be executed by the process already described in FIG.

以上のように、数値情報データを含む文書から生成した抽象化モデルデータを利用して、文書データ中のテキスト文などの各部分についてリンクを簡単に設定することが可能になる。 As described above, using abstract model data generated from a document including numerical information data, it is possible to easily set a link for each part such as a text sentence in the document data.

なお、上記実施形態においては、テキスト文とテーブルの間にリンクを設定するようにしたが、テキスト文とイメージの間にリンクを設定するようにしてもよい。ただし、イメージについては、見出しなどの属性データに基づいて少なくとも抽象化モデルデータを生成できるようなデータ構造を有している必要がある。また、テキスト文同士にリンクを設定するようにしてもよい。 In the above embodiment, a link is set between the text sentence and the table, but a link may be set between the text sentence and the image. However, the image needs to have a data structure that can generate at least abstract model data based on attribute data such as headings. Moreover, you may make it set a link between text sentences.

なお、上記実施形態では抽象化モデルが２つの場合について説明したが、抽象化モデルが３以上見つかった場合には、それぞれの対象データについて１対多数のリンクを設定する。例えば、ＸLinkにより複数のリンク先を全て画面上に表示するように設定したり、複数のリンク先をリスト表示して、所望のリンク先を選択できるような機能をＪＡＶＡ等の一般的な言語で記述してリンクを設定する。 In the above embodiment, a case where there are two abstract models has been described. However, when three or more abstract models are found, a one-to-many link is set for each target data. For example, it is possible to set a function to display a plurality of link destinations on the screen by using XLink, or display a list of a plurality of link destinations and select a desired link destination in a general language such as JAVA. Write and set the link.

なお、上記実施形態においては、ＸＭＬ文書（ファイル内容）に含まれるタグに表現形式の属性を付していないが、テキスト文、テーブルなどの表現形式の属性や抽象化処理要素であることを示す属性を予め付して記述しておくようにしてもよい。これにより、図２３に示す表現形式対応表２２５がなくても、抽象化モデルデータの表現形式を判別したり、抽象化処理要素を抽出することが可能になる。 In the above embodiment, the expression format attribute is not attached to the tag included in the XML document (file contents), but it is an expression format attribute such as a text sentence or a table or an abstract processing element. Attributes may be added and described in advance. Accordingly, even if there is no representation format correspondence table 225 shown in FIG. 23, it is possible to determine the representation format of the abstract model data and extract the abstract processing element.

３．第３の実施形態［文書検索装置］
上記実施形態においては、数値情報データに基づいて生成した抽象化モデルデータをリンクを設定するために利用したが、文書検索装置において利用するようにしてもよい。上記第１の実施形態の抽象化モデル生成装置を利用して、数値情報データが含まれるテキスト文などの要素について文書ファイルの検索を行う装置について、以下に図２６、図２８を用いて説明する。なお、本実施形態では、文書データにテキスト文のみが含まれる場合について説明する。 3. Third Embodiment [Document Search Device]
In the above embodiment, the abstract model data generated based on the numerical information data is used for setting the link. However, the abstract model data may be used in the document search apparatus. An apparatus for searching a document file for an element such as a text sentence including numerical information data using the abstract model generation apparatus of the first embodiment will be described below with reference to FIGS. . In the present embodiment, a case where only text text is included in the document data will be described.

３−１．機能ブロック図
図２６に、本実施形態にかかる文書検索装置の機能ブロック図を示す。この図において、本発明にかかる文献検索装置は、検索要素入力手段２５０１、抽象化モデルデータ抽出手段（テーブル）２５０３、抽象化モデル保持手段２５０５、抽象化モデル記録手段２５０７、抽象化モデル比較手段２５０９、検索結果出力手段２５１１を備えている。なお、抽象化モデル記録手段２５０７には、文書データ全体中の抽象化処理要素について、予め生成された各抽象化モデルデータが記録されている。 3-1. Functional Block Diagram FIG. 26 shows a functional block diagram of the document search apparatus according to this embodiment. In this figure, a document search apparatus according to the present invention includes a search element input means 2501, abstract model data extraction means (table) 2503, abstract model holding means 2505, abstract model recording means 2507, abstract model comparison means 2509. A search result output unit 2511 is provided. The abstract model recording means 2507 records each abstract model data generated in advance for the abstract processing elements in the entire document data.

検索要素入力手段に、検索要素（テキスト文）が抽象化処理要素毎に入力され、抽象化モデル生成手段（テキスト文）２５０３に与えられる。抽象化処理要素を受けた抽象化モデル生成手段（テキスト文）２５０３は、数値情報データと係り受け関係にある文要素を抽象化モデルデータとして抽出し、位置情報データと共に抽象化モデル保持手段２５０５に与える。なお、抽象化モデルデータの生成処理は、図３等の抽象化モデルデータ生成フローチャートに示す処理により行われる。 A search element (text sentence) is input to the search element input means for each abstract processing element, and is supplied to the abstract model generation means (text sentence) 2503. Upon receiving the abstraction processing element, the abstraction model generation means (text sentence) 2503 extracts the sentence element having a dependency relation with the numerical information data as abstraction model data, and stores it in the abstraction model holding means 2505 together with the position information data. give. The generation process of the abstract model data is performed by the process shown in the abstract model data generation flowchart of FIG.

抽象化モデル比較手段２５０９は、抽象化モデル記録手段２５０７を検索し、予め文書データ全体から生成しておいた抽象化モデルデータと、抽象化モデル生成手段（テキスト文）２５０３から受けた検索対象の抽象化モデルデータとを比較することにより、同じ抽象化モデルデータの位置情報データを取得して検索結果出力手段２５１１に提供する。 The abstract model comparison unit 2509 searches the abstract model recording unit 2507 to search the abstract model data generated from the entire document data in advance and the search target received from the abstract model generation unit (text sentence) 2503. By comparing with the abstraction model data, the position information data of the same abstraction model data is acquired and provided to the search result output means 2511.

検索結果出力手段２５１１は、当該位置情報データに基づいて、抽象化モデルデータが同じ抽象化処理要素を検索結果として出力する。 The search result output unit 2511 outputs an abstraction processing element having the same abstract model data as a search result based on the position information data.

３−２．ハードウェア構成
図２６に示す文書検索装置をＣＰＵを用いて実現したハードウェア構成の一例を、図２７に示す。図２７に示す文書検索装置は、ディスプレイ２０１、ＣＰＵ２０３、メモリ２０５、キーボード／マウス２０７、ハードディスク２０９、ＣＤ−ＲＯＭドライブ２１１および通信回路２１５を備えている。 3-2. Hardware Configuration FIG. 27 shows an example of a hardware configuration in which the document search apparatus shown in FIG. 26 is realized using a CPU. 27 includes a display 201, a CPU 203, a memory 205, a keyboard / mouse 207, a hard disk 209, a CD-ROM drive 211, and a communication circuit 215.

ハードディスク２０９には、文書検索プログラム２３１が記録されている。文書検索プログラム２３１は、数値情報データを含むテキスト文について抽象化モデルデータ生成処理（図３に示す）を行うための抽象化モデルデータ生成スクリプト２２３、および生成した抽象化モデルデータを記録するための抽象化モデルDB２２７を備えている。なお、この実施形態では、文書データにテキスト文だけが含まれることとしたので、抽象化処理要素の表現形式（テキスト文、テーブルなど）を判別するための表現形式対応表２２５（図２０）は備えていないが、複数の表現形式がテキスト文データに存在する場合には備えるようにしてもよい。 A document search program 231 is recorded on the hard disk 209. The document search program 231 records an abstract model data generation script 223 for performing abstract model data generation processing (shown in FIG. 3) for a text sentence including numerical information data, and the generated abstract model data. An abstraction model DB 227 is provided. In this embodiment, since only the text sentence is included in the document data, the expression format correspondence table 225 (FIG. 20) for determining the expression format (text sentence, table, etc.) of the abstraction processing element is Although not provided, it may be provided when a plurality of expression formats exist in the text sentence data.

３−３．文書検索処理の具体例
図２８に示すフローチャートを用いて、文書データから生成した抽象化モデルデータを利用してファイル検索を行う処理について説明する。なお、説明を簡単にするため、以下の例で検索対象となるXML形式のファイル（ＸＭＬファイル）には、テキスト文だけが含まれているものとするが、テーブルなどが含まれる場合でも実施できる。 3-3. Specific Example of Document Search Processing A file search processing using abstract model data generated from document data will be described using the flowchart shown in FIG. In order to simplify the explanation, it is assumed that the XML format file (XML file) to be searched in the following example includes only a text sentence. However, the present invention can be implemented even when a table or the like is included. .

文書検索処理を行う前に、抽象化モデルＤＢ２２７には、検索しようとする文書データ全体について、抽象化モデルデータが全て記録されている。抽象化モデルＤＢ２２７への記録のタイミングは、リアルタイムでもよく、文書検証時でもよい。具体的には、図３に示すステップＳ３０１〜Ｓ３１７と同じ処理により、検索対象となる文書データ（例えば、図２２に示すＸＭＬファイル）についての抽象化モデルデータが生成され、予め記録されている。 Before performing the document search process, the abstract model DB 227 records all the abstract model data for the entire document data to be searched. The recording timing in the abstract model DB 227 may be real time or document verification. Specifically, abstraction model data for document data to be searched (for example, an XML file shown in FIG. 22) is generated and recorded in advance by the same processing as steps S301 to S317 shown in FIG.

まず、検索したいテキスト文の入力を受ける（ステップＳ２６０１）。さらに、検索開始入力（検索ボタンの押下など）を受けて、入力されたテキスト文のＸＭＬデータが抽象化モデルデータ生成スクリプト２２３に出力される（ステップＳ２６０３）。 First, an input of a text sentence to be searched is received (step S2601). Further, in response to a search start input (such as pressing a search button), XML data of the input text sentence is output to the abstract model data generation script 223 (step S2603).

抽象化モデルデータ生成スクリプト２２３は、入力を受けたテキスト文から抽象化モデルデータを抽出し、抽象化モデルデータとその位置ＩＤをメモリ２０５に出力する（ステップＳ２６０５）。メモリ２０５には、抽象化モデルデータとその位置ＩＤが保持される（ステップＳ２６０７）。 The abstract model data generation script 223 extracts the abstract model data from the input text sentence, and outputs the abstract model data and its position ID to the memory 205 (step S2605). The memory 205 holds the abstract model data and its position ID (step S2607).

文書検索プログラム２３１は、文書データから抽象化モデルデータを抽出して記録した抽象化モデルDB２２７を検索し、メモリ２０５から読み出した抽象化モデルデータと比較することにより、抽象化モデルデータが同じデータのIDを抽出する（ステップＳ２６０９）。 The document search program 231 searches the abstract model DB 227 in which the abstract model data is extracted and recorded from the document data, and compares it with the abstract model data read from the memory 205, so that the abstract model data has the same data. ID is extracted (step S2609).

文書検索プログラムは、これら抽象化モデルデータのIDに対応するテキスト文を検索結果として表示する（ステップＳ２６１１）。例えば、検索結果として特定された部分についてハイライト表示を行う。 The document search program displays a text sentence corresponding to the ID of the abstract model data as a search result (step S2611). For example, the highlighting is performed for the part specified as the search result.

以上のように、検索の対象となる大量の文書から抽象化モデルデータを生成し、それらを入力テキスト文などから生成した抽象化モデルデータと比較することで、単純なマッチングとは異なる文書検索装置を実現することができる。 As described above, by generating abstract model data from a large amount of documents to be searched and comparing them with abstract model data generated from input text sentences, etc., a document search device that differs from simple matching Can be realized.

なお、上記実施形態においては、予め文書全体について抽象化モデルデータを生成し抽象化モデルＤＢ２２７に記録しておくようにしたが、検索処理時にその都度抽出するようにしてもよい。 In the above embodiment, the abstract model data is generated in advance for the entire document and recorded in the abstract model DB 227. However, it may be extracted each time during the search process.

なお、上記実施形態においては、テキスト文のみが含まれる文書データを検索対象としたが、テーブルやイメージなどを含む文書データを検索するようにしてもよい。 In the above-described embodiment, document data including only a text sentence is set as a search target. However, document data including a table or an image may be searched.

４．第４の実施形態［検証機能付き文書編集装置］
また、前述の数値情報データに基づいて生成した抽象化モデルデータを、文書編集プログラムにおける検証装置として利用するようにしてもよい。上記第１の実施形態の抽象化モデル生成装置を利用して、数値情報データが含まれるテキスト文などの要素について文書の検証を行う装置について、以下に図２９、図３１を用いて説明する。なお、本実施形態では、文書データにテキスト文のみが含まれる場合について説明する。 4). Fourth Embodiment [Document Editing Device with Verification Function]
Further, the abstract model data generated based on the numerical information data described above may be used as a verification device in the document editing program. An apparatus for verifying a document for an element such as a text sentence including numerical information data using the abstract model generation apparatus according to the first embodiment will be described below with reference to FIGS. 29 and 31. FIG. In the present embodiment, a case where only text text is included in the document data will be described.

４−１．機能ブロック図
図２９に、本実施形態にかかる文書入力検証装置の機能ブロック図を示す。この図において、本発明にかかる文書入力検証装置は、検証要素入力手段２７０１、要素抽出手段２７０２、抽象化モデル生成手段（テキスト文）２７０３、抽象化モデル保持手段２７０５、抽象化モデル記録手段２７０７、抽象化モデル判別手段２７０９、入力エラー出力手段２７１１を備えている。なお、抽象化モデル記録手段２７０７には、常に、文書編集装置に入力される文書データについて抽象化モデルデータが生成され蓄積されている。 4-1. Functional Block Diagram FIG. 29 is a functional block diagram of the document input verification apparatus according to this embodiment. In this figure, a document input verification apparatus according to the present invention includes verification element input means 2701, element extraction means 2702, abstract model generation means (text sentence) 2703, abstract model holding means 2705, abstract model recording means 2707, An abstract model discriminating unit 2709 and an input error output unit 2711 are provided. In the abstract model recording unit 2707, abstract model data is always generated and stored for document data input to the document editing apparatus.

検証要素入力手段２７０１は、ワープロ・エディタなどの文書編集装置から文書データ（テキスト文）の入力を受けており、これを要素抽出手段２７０３に与えている。要素抽出手段２７０３は、入力された文書データから抽象化処理要素を抽出する。抽象化モデル生成手段（テキスト文）２７０３は、抽象化処理要素を数値情報データと係り受け関係にある文要素を抽象化モデルデータとして抽出し、位置情報データと共に抽象化モデル保持手段２７０５に与える。なお、抽象化モデルデータの生成処理は、図３等の抽象化モデルデータ生成フローチャートに示す処理により行われる。 The verification element input unit 2701 receives input of document data (text sentence) from a document editing apparatus such as a word processor / editor, and provides the input to the element extraction unit 2703. The element extraction unit 2703 extracts an abstract processing element from the input document data. The abstract model generation means (text sentence) 2703 extracts the abstract processing elements as the dependency model data as the abstract model data, and supplies them to the abstract model holding means 2705 together with the position information data. The generation process of the abstract model data is performed by the process shown in the abstract model data generation flowchart of FIG.

抽象化モデル判別手段２７０９は抽象化モデル記録手段２７０７を検索し、抽象化モデル保持手段２７０５から受けた抽象化モデルデータと係り受け情報が同じで、かつ、数値情報データが異なる抽象化モデルデータを持つ抽象化処理要素があるか否かを判別し、その判別結果を入力エラー出力手段２７１１に与える。 The abstract model discriminating means 2709 searches the abstract model recording means 2707 and extracts abstract model data having the same dependency information and different numerical information data from the abstract model data received from the abstract model holding means 2705. It is determined whether or not there is an abstraction processing element to have, and the determination result is given to the input error output means 2711.

入力エラー出力手段２７１１は、抽象化モデル判別手段２７０９から取得した判別結果に基づいて、文書編集装置に入力エラー情報を出力する。 The input error output unit 2711 outputs input error information to the document editing apparatus based on the determination result acquired from the abstract model determination unit 2709.

４−２．ハードウェア構成
図２９に示す文書入力検証装置をＣＰＵを用いて実現したハードウェア構成の一例を、図３０に示す。図３０に示す文書入力検証装置は、ディスプレイ２０１、ＣＰＵ２０３、メモリ２０５、キーボード／マウス２０７、ハードディスク２０９、ＣＤ−ＲＯＭドライブ２１１および通信回路２１５を備えている。 4-2. Hardware Configuration FIG. 30 shows an example of a hardware configuration in which the document input verification apparatus shown in FIG. 29 is realized using a CPU. The document input verification apparatus shown in FIG. 30 includes a display 201, a CPU 203, a memory 205, a keyboard / mouse 207, a hard disk 209, a CD-ROM drive 211, and a communication circuit 215.

ハードディスク２０９には、文書編集アプリケーション２４１と文書検証プログラム２４３が記録されている。文書検証プログラム２３１は、文書検証アプリケーション２４１からテキスト文データの入力を受け、その検証結果を出力する機能を有している。図３０に示すように、文書検証プログラム２３１には、数値情報データを含むテキスト文について抽象化モデルデータ生成処理（図３に示す）を行うための抽象化モデルデータ生成スクリプト２２３、および生成した抽象化モデルデータを記録するための抽象化モデルDB２２７を備えている。なお、この実施形態では、文書データにテキスト文だけが含まれることとしたので、抽象化処理要素の表現形式（テキスト文、テーブルなど）を判別するための表現形式対応表２２５（図２０）は備えていないが、複数の表現形式がテキスト文データに存在する場合には備えるようにしてもよい。 A document editing application 241 and a document verification program 243 are recorded on the hard disk 209. The document verification program 231 has a function of receiving text sentence data input from the document verification application 241 and outputting the verification result. As shown in FIG. 30, the document verification program 231 includes an abstract model data generation script 223 for performing abstract model data generation processing (shown in FIG. 3) for a text sentence including numerical information data, and the generated abstract. An abstract model DB 227 for recording the model data is provided. In this embodiment, since only the text sentence is included in the document data, the expression format correspondence table 225 (FIG. 20) for determining the expression format (text sentence, table, etc.) of the abstraction processing element is Although not provided, it may be provided when a plurality of expression formats exist in the text sentence data.

４−３．文書検証処理の具体例
図３１に示すフローチャートを用いて、文書の編集中に数値情報データを誤って入力した場合に、入力エラーを出力する処理について説明する。なお、説明を簡単にするため、以下の例で検証対象となるXML形式のファイル（ＸＭＬファイル）には、テキスト文だけが含まれているものとするが、テーブルなどが含まれる場合でも実施できる。 4-3. Specific examples of document verification processing
A process for outputting an input error when numerical information data is erroneously input while editing a document will be described using the flowchart shown in FIG. In order to simplify the explanation, it is assumed that the XML format file (XML file) to be verified in the following example includes only a text sentence. However, the present invention can be implemented even when a table or the like is included. .

文書検証処理を行う前に、抽象化モデルＤＢ２２７には、文書編集装置に入力されている文書データについて、抽象化モデルデータが全て記録されている。抽象化モデルＤＢ２２７への記録のタイミングは、リアルタイムでもよく、文書検証時でもよい。具体的には、図３に示すステップＳ３０１〜Ｓ３１７と同じ処理により、検索対象となる文書データ（例えば、図２２に示すＸＭＬファイル）についての抽象化モデルデータが生成され、常に更新されている。 Before performing the document verification process, the abstract model DB 227 records all abstract model data for the document data input to the document editing apparatus. The recording timing in the abstract model DB 227 may be real time or document verification. Specifically, abstraction model data for document data to be searched (for example, an XML file shown in FIG. 22) is generated and constantly updated by the same processing as steps S301 to S317 shown in FIG.

まず、ＸＭＬ文書編集用のワープロなどの文書編集装置から入力される情報を取得する（ステップＳ２８０１）。さらに、検証開始入力（検索ボタンの押下など）を受けて、入力されたテキスト文群から抽象化処理要素が抽出され、抽象化モデルデータ生成スクリプト２２３に出力される（ステップＳ２８０３）。 First, information input from a document editing apparatus such as an XML document editing word processor is acquired (step S2801). Further, upon receiving a verification start input (such as pressing a search button), an abstract processing element is extracted from the input text sentence group and output to the abstract model data generation script 223 (step S2803).

抽象化モデルデータ生成スクリプト２２３は、入力を受けた抽象化処理要素から抽象化モデルデータを抽出し、抽象化モデルデータとその位置ＩＤをメモリ２０５に出力する（ステップＳ２８０５）。メモリ２０５には、抽象化モデルデータとその位置ＩＤが保持される（ステップＳ２８０７）。 The abstract model data generation script 223 extracts the abstract model data from the received abstract processing element, and outputs the abstract model data and its position ID to the memory 205 (step S2805). The memory 205 holds the abstract model data and its position ID (step S2807).

抽象化モデルデータ判別プログラム２４１は、抽象化モデルDB２２７を検索することにより、メモリ２０５から読み出した抽象化モデルデータと係り受け情報が同じで、かつ、数値情報データが異なる抽象化モデルデータ（類似抽象化モデルデータ）があるか否かを判別し、その結果を文書検証プログラム２４１に出力する（ステップＳ２８０９）。 The abstract model data discriminating program 241 searches the abstract model DB 227 to extract abstract model data (similar abstraction) having the same dependency information as the abstract model data read from the memory 205 and different numerical information data. And the result is output to the document verification program 241 (step S2809).

類似抽象化モデルデータがある場合には、文書検証プログラム２４１は、文書編集装置に入力絵エラー情報を出力する（ステップＳ２８１１）。これにより、例えば、ワープロなどの画面に誤入力であることを喚起する警告ウィンドウの表示が行われる。なお、類似抽象化モデルデータがあると判別した際に、抽象化モデルＤＢ２２７から位置ＩＤや数値情報データを取得しておけば、該当箇所や訂正すべき数値を同時に表示することも可能である。 If there is similar abstraction model data, the document verification program 241 outputs input picture error information to the document editing device (step S2811). As a result, for example, a warning window that alerts the user to an incorrect input is displayed on the screen of a word processor or the like. When it is determined that there is similar abstract model data, if the position ID and numerical information data are acquired from the abstract model DB 227, it is possible to simultaneously display the corresponding part and the numerical value to be corrected.

以上のように、文書編集装置既に入力した文書の内容から抽象化モデルデータを生成して文書検証に利用することにより、数値情報データの誤入力を容易に認識することが可能になる。 As described above, it is possible to easily recognize erroneous input of numerical information data by generating abstract model data from the contents of a document already input and using it for document verification.

なお、上記実施形態においては、予め文書全体について抽象化モデルデータを抽出し、抽象化モデルＤＢ２２７に記録するようにしたが、検索実行時に既に入力した文書から抽出するようにしてもよい。 In the above embodiment, the abstract model data is extracted in advance for the entire document and recorded in the abstract model DB 227. However, the abstract model data may be extracted from the document already input when the search is executed.

なお、上記実施形態においては、テキスト文のみが含まれる文書データを検証対象としたが、テーブルやイメージのような他の要素を含む文書データを検証するようにしてもよい。 In the above embodiment, document data including only a text sentence is set as a verification target. However, document data including other elements such as a table or an image may be verified.

５．その他の実施形態
なお、上記実施形態においては、文書ファイル含まれるテキスト文の言語が複数あるような場合でも、その言語にあった処理方法で抽象化モデルデータを生成し、係り受け情報を辞書などを用いて翻訳すれば、抽象化モデルデータの言語を統一することが可能である。 5). Other Embodiments In the above embodiment, even when there are a plurality of languages of text sentences included in the document file, abstract model data is generated by a processing method suitable for the language, and dependency information is stored in a dictionary or the like. It is possible to unify the language of the abstract model data by translating using.

なお、上記実施形態においては、抽象化処理要素を行う文書としてＸＭＬ形式のファイルを用いるようにしたが、ＨＴＭＬ形式のファイルなどその他の形式のファイルを用いてもよい。 In the above-described embodiment, an XML format file is used as a document for performing an abstraction processing element. However, other format files such as an HTML format file may be used.

上記実施形態においては、ＣＰＵ２０３を用い、ソフトウェアによって処理を実現している。しかし、その一部もしくは全てを、ロジック回路等のハードウェアによって実現してもよい。なお、プログラムの一部の処理をさらに、オペレーティングシステム（ＯＳ）にさせるようにしてもよい。 In the above embodiment, the CPU 203 is used to realize processing by software. However, some or all of them may be realized by hardware such as a logic circuit. In addition, you may make it make an operating system (OS) process a part of program further.

この発明の実施形態における情報抽出装置の機能ブロック図の例を示す図である。It is a figure which shows the example of the functional block diagram of the information extraction apparatus in embodiment of this invention. この発明の抽象化モデル生成装置のハードウェア構成図の例を示す例である。It is an example which shows the example of the hardware block diagram of the abstraction model production | generation apparatus of this invention. この発明の「抽象化モデル抽出処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "abstraction model extraction process" of this invention. この発明の「数値情報データ抽出処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "numerical information data extraction process" of this invention. この発明の「テキスト文分割処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "text sentence division | segmentation process" of this invention. この発明の「係り受け情報抽出処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "dependency information extraction process" of this invention. この発明の「形態素解析」または「構文解析」の例を示す図である。It is a figure which shows the example of "morphological analysis" or "syntax analysis" of this invention. この発明の「形態素解析データ」または「構文解析データ」の例を示す図である。It is a figure which shows the example of "morpheme analysis data" or "syntax analysis data" of this invention. この発明の「構文解析データ」の例を示す図である。It is a figure which shows the example of the "syntactic analysis data" of this invention. この発明の「文法定義情報」の例を示す図である。It is a figure which shows the example of the "grammar definition information" of this invention. この発明の「主格句」または「述部」を決定する例を示す図である。It is a figure which shows the example which determines the "main phrase" or "predicate" of this invention. この発明の「解析結果データ」の例を示す図である。It is a figure which shows the example of the "analysis result data" of this invention. この発明の実施形態における抽象化モデル生成装置の機能ブロック図の例を示す図である。It is a figure which shows the example of the functional block diagram of the abstraction model production | generation apparatus in embodiment of this invention. この発明の「抽象化モデル抽出処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "abstraction model extraction process" of this invention. この発明の「抽出対象文決定処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "extraction target sentence determination process" of this invention. この発明の「複文を主文と副文に分割する」場合の例を示す図である。It is a figure which shows the example in the case of "dividing a compound sentence into a main sentence and a sub sentence" of this invention. この発明の「重文を前半文と後半文に分割する」場合の例を示す図である。It is a figure which shows the example in the case of "dividing a heavy sentence into the first half sentence and the latter half sentence" of this invention. この発明の実施形態における抽象化モデル生成装置の機能ブロック図の例を示す図である。It is a figure which shows the example of the functional block diagram of the abstraction model production | generation apparatus in embodiment of this invention. この発明の「抽象化モデル抽出処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "abstraction model extraction process" of this invention. この発明の「付加情報抽出処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "additional information extraction process" of this invention. この発明のテキスト文から付加情報を抽出する場合の例を示す図である。It is a figure which shows the example in the case of extracting additional information from the text sentence of this invention. この発明の実施形態におけるリンク設定装置の機能ブロック図の例を示す図である。It is a figure which shows the example of the functional block diagram of the link setting apparatus in embodiment of this invention. この発明のリンク設定装置のハードウェア構成図の例を示す例である。It is an example which shows the example of the hardware block diagram of the link setting apparatus of this invention. この発明の「リンク設定処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "link setting process" of this invention. この発明のＸＭＬファイルの例を示す図である。It is a figure which shows the example of the XML file of this invention. この発明の表現形式対応表の例を示す図である。It is a figure which shows the example of the expression format correspondence table | surface of this invention. この発明の抽象化モデルＤＢのデータ例を示す図である。It is a figure which shows the example of data of abstraction model DB of this invention. この発明で設定したリンクを利用した例を示す図である。It is a figure which shows the example using the link set by this invention. この発明の実施形態における文書検索装置の機能ブロック図の例を示す図である。It is a figure which shows the example of the functional block diagram of the document search apparatus in embodiment of this invention. この発明の文書検索装置のハードウェア構成図の例を示す例である。It is an example which shows the example of the hardware block diagram of the document search device of this invention. この発明の「文書検索処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "document search process" of this invention. この発明の実施形態における文書入力検証装置の機能ブロック図の例を示す図である。It is a figure which shows the example of the functional block diagram of the document input verification apparatus in embodiment of this invention. この発明の文書入力検証装置のハードウェア構成図の例を示す例である。It is an example which shows the example of the hardware block diagram of the document input verification apparatus of this invention. この発明の「文書入力検証処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "document input verification process" of this invention.

Explanation of symbols

１０１・・・・切出手段
１０３・・・・数値情報データ抽出手段
１０５・・・・テキスト文分割手段
１０６・・・・抽出対象文決定手段
１０７・・・・係り受け情報抽出手段
１０８・・・・付加情報抽出手段
１０９・・・・抽象化モデル出力手段
101 ... Extraction means 103 ... Numerical information data extraction means 105 ... Text sentence division means 106 ... Extraction target sentence determination means 107 ... Dependency information extraction means 108 ... ..Additional information extraction means 109... Abstraction model output means

Claims

A link setting device for setting a link between target data including related sentence element data in document data,
The text data in the given document data is divided into a plurality of text element data, the type of each text element is determined and recorded in the recording unit, and the numerical text data is included in the given text text data And if included, an extraction / analysis means for specifying numerical information data and recording it in the recording unit, and
Referring to the recording unit, the sentence element data including the numerical information data is identified, and at least one sentence element data having a dependency relation with the numerical information data is determined based on the type of the sentence element data. Dependency information extracting means for extracting dependency information data from the sentence element data,
Abstract model recording that records the numerical information data extracted by the numerical information extraction means and the dependency information data extracted by the dependency information extraction means as abstract model data together with position data indicating the position in the document data in the recording unit Means,
Abstraction model selection means for selecting abstract model data having the same numerical information data and the same dependency information data among a plurality of abstract model data recorded in the recording unit;
Link setting means for setting a link between target data of the abstraction model data based on the position information data of the selected abstraction model data;
A link setting device comprising:

A program for realizing, using a computer, a link setting device for setting a link between target data including related sentence element data in document data, the computer comprising the following means: :
A) The text sentence data in the given document data is divided into a plurality of sentence element data, the type of each sentence element is determined and recorded in the recording unit, and the numerical information data is included in the given text sentence data. Determining whether or not to include, if included, extraction / analysis means for identifying numerical information data and recording it in the recording unit;
B) Referring to the recording unit, the sentence element data including the numerical information data is specified, and at least one sentence element data having a dependency relationship with the numerical information data is determined based on the type of the sentence element data. Dependency information extraction means for extracting dependency information data from the determined sentence element data;
C) Abstraction in which the numerical information data extracted by the numerical information extraction means and the dependency information data extracted by the dependency information extraction means are recorded as abstract model data in the recording unit together with position data indicating the position in the document data. Model recording means,
D) Abstract model selection means for selecting abstract model data having the same numerical information data and the same dependency information data from among a plurality of abstract model data recorded in the recording unit;
E) Link setting means for setting a link between target data of the abstraction model data based on the position information data of the selected abstraction model data.

In the link setting device or the link setting program according to claim 1 or 2,
The link setting unit sets a link so that only additional information is extracted from the text text of the link destination when the target data of the link destination is a text sentence.

A document search device that searches document data based on target data that is a search element,
The text data in the given document data is divided into a plurality of text element data, the type of each text element is determined and recorded in the recording unit, and the numerical text data is included in the given text text data And if included, an extraction / analysis means for specifying numerical information data and recording it in the recording unit, and
Referring to the recording unit, the sentence element data including the numerical information data is identified, and at least one sentence element data having a dependency relation with the numerical information data is determined based on the type of the sentence element data. Dependency information extracting means for extracting dependency information data from the sentence element data,
Abstract model holding that holds the numerical information data extracted by the numerical information extraction means and the dependency information data extracted by the dependency information extraction means as abstract model data together with the position data indicating the position in the document data. means,
By retrieving the abstract model recording means that generated and recorded the abstract model data from the document data and comparing it with the abstract model data received from the abstract model holding means, the same numerical information data and the same dependency information data are obtained. Abstraction model comparison means for extracting only position information data of abstraction model data having,
Search result display means for displaying, as search results, target data having the same abstract model data, based on the positional information data of the extracted abstract model data;
A document retrieval apparatus comprising:

A program for realizing, using a computer, a document search apparatus that searches for document data based on target data that is a search element, the computer having the following means:
A) The text sentence data in the given document data is divided into a plurality of sentence element data, the type of each sentence element is determined and recorded in the recording unit, and the numerical information data is included in the given text sentence data. Determining whether or not to include, if included, extraction / analysis means for identifying numerical information data and recording it in the recording unit;
B) Referring to the recording unit, the sentence element data including the numerical information data is specified, and at least one sentence element data having a dependency relationship with the numerical information data is determined based on the type of the sentence element data. Dependency information extraction means for extracting dependency information data from the determined sentence element data;
C) Abstraction in which numerical information data extracted by the numerical information extraction unit and dependency information data extracted by the dependency information extraction unit are stored as abstract model data in the recording unit together with position data indicating a position in the document data Model holding means,
D) The same numerical information data and the same dependency information are obtained by searching the abstract model recording means for generating and recording the abstract model data from the document data and comparing it with the abstract model data received from the abstract model holding means. Abstraction model comparison means for extracting only position information data of abstraction model data having data;
E) Search result display means for displaying target data having the same abstract model data as a search result based on the extracted position information data of the abstract model data.

In the document search device or document search program according to any one of claims 4 and 5,
In the abstract model recording means, the abstract model data is generated and recorded before searching for the entire document data to be searched.

A document input verification device for verifying document data input from a document editing device based on target data of a verification element,
The text data in the document data given from the document editing device is divided into a plurality of text element data, the type of each text element is determined and recorded in the recording unit, and numerical information is included in the given text text data. It is determined whether or not data is included, and if included, extraction / analysis means for identifying numerical information data and recording it in a recording unit,
Referring to the recording unit, the sentence element data including the numerical information data is identified, and at least one sentence element data having a dependency relation with the numerical information data is determined based on the type of the sentence element data. Dependency information extracting means for extracting dependency information data from the sentence element data,
Abstract model holding that holds the numerical information data extracted by the numerical information extraction means and the dependency information data extracted by the dependency information extraction means as abstract model data together with the position data indicating the position in the document data. means,
Abstraction model recording means that generates and records abstract model data from document data is retrieved, and the abstraction model data received from the abstract model holding means is the same as the dependency model, and the numerical information data is different. Abstraction model discrimination means for discriminating whether there is model data;
An input error output means for outputting input error information to the document editing device based on the determination result obtained from the abstract model determination means;
A document input verification device comprising:

A program for realizing, using a computer, a document input verification device for verifying document data input from a document editing device based on target data of a verification element, and comprising the following means on a computer Program to:
A) The text sentence data in the document data given from the document editing device is divided into a plurality of sentence element data, the type of each sentence element is determined and recorded in the recording unit, and the given text sentence data is included in the given text sentence data It is determined whether or not numerical information data is included, and if included, extraction / analysis means for specifying the numerical information data and recording it in the recording unit,
B) Referring to the recording unit, the sentence element data including the numerical information data is specified, and at least one sentence element data having a dependency relationship with the numerical information data is determined based on the type of the sentence element data. Dependency information extraction means for extracting dependency information data from the determined sentence element data;
C) Abstraction in which numerical information data extracted by the numerical information extraction unit and dependency information data extracted by the dependency information extraction unit are stored as abstract model data in the recording unit together with position data indicating a position in the document data Model holding means,
D) The abstract model recording means that generates and records the abstract model data from the document data is searched, the dependency model information is the same as the abstract model data received from the abstract model holding means, and the numerical information data is different. Abstraction model discrimination means for discriminating whether or not there is abstraction model data;
E) Input error output means for outputting input error information to the document editing apparatus based on the determination result acquired from the abstract model determination means.

In the document input verification device according to claim 7 or 8,
The abstract model data of the abstract model recording means is generated as needed based on the input from the verification element input means, and is always updated.

The apparatus or program according to any one of claims 1 to 9, wherein the document data includes an element to be processed other than text sentence data,
Before extracting the abstraction model data, an expression format determination means for determining the expression format of the processing target element;
When the representation format of the target data is a text sentence, abstract model generation means for extracting, from the processing target element, a sentence element having a dependency relationship with the numerical information data as abstract model data together with the position information data ,
When the representation format of the target data is another element, an abstract model generation unit that generates abstract model data based on a predetermined rule and supplies them to the abstract model recording unit together with the position information data,
Characterized by comprising.

11. The apparatus or program according to claim 10, wherein the other target data is table data or image data.

12. The apparatus or program according to claim 1, wherein the document data is described in an XML format, and an attribute of an expression format is attached in advance to each tag included in the file content. What to do.

The apparatus or program according to any one of claims 1 to 12, further comprising:
An abstract model translation means for unifying the abstract model data into the same language by referring to a translation dictionary when there are a plurality of languages of text sentences included in the document data is provided.

A link setting method for setting a link between target data including related sentence element data in document data,
The text data in the given document data is divided into a plurality of text element data, the type of each text element is determined and recorded in the recording unit, and the numerical text data is included in the given text text data And if included, an extraction / analysis step for identifying numerical information data and recording it in the recording unit, and
Referring to the recording unit, the sentence element data including the numerical information data is identified, and at least one sentence element data having a dependency relation with the numerical information data is determined based on the type of the sentence element data. A dependency information extraction step of extracting dependency information data from the sentence element data,
Abstraction model data that is recorded in the recording unit together with position data indicating the position in the document data, using the numerical information data extracted in the numerical information extraction step and the dependency information data extracted in the dependency information extraction step as abstract model data Recording step;
An abstract model data selection step for selecting abstract model data having the same numerical information data and the same dependency information data from among a plurality of abstract model data recorded in the recording unit;
A link setting step for setting a link between target data of the abstraction model data based on the positional information data of the selected abstraction model data;
A link setting method characterized by comprising:

A document search method for searching document data based on target data that is a search element,
The text data in the given document data is divided into a plurality of text element data, the type of each text element is determined and recorded in the recording unit, and the numerical text data is included in the given text text data And if included, an extraction / analysis step for identifying numerical information data and recording it in the recording unit, and
Referring to the recording unit, the sentence element data including the numerical information data is identified, and at least one sentence element data having a dependency relation with the numerical information data is determined based on the type of the sentence element data. A dependency information extraction step of extracting dependency information data from the sentence element data,
Abstraction model data that is stored in the recording unit together with position data indicating the position in the document data, using the numerical information data extracted in the numerical information extraction step and the dependency information data extracted in the dependency information extraction step as abstract model data Holding step,
The same numerical information data and the same dependency information data are obtained by searching the abstract model recording means that has generated and recorded the abstract model data from the document data and comparing it with the abstract model data held in the abstract model data holding step. Abstraction model data comparison step for extracting only position information data of abstraction model data having
A search result display step for displaying, as search results, target data having the same abstract model data, based on the positional information data of the extracted abstract model data;
A document search method characterized by comprising:

A document input verification method for verifying document data input from a document editing device based on target data of a verification element,
The text data in the document data given from the document editing device is divided into a plurality of text element data, the type of each text element is determined and recorded in the recording unit, and numerical information is included in the given text text data. It is determined whether or not the data is included, and if included, the extraction / analysis step for identifying the numerical information data and recording it in the recording unit,
Referring to the recording unit, the sentence element data including the numerical information data is identified, and at least one sentence element data having a dependency relation with the numerical information data is determined based on the type of the sentence element data. A dependency information extraction step of extracting dependency information data from the sentence element data,
Abstraction model data that is stored in the recording unit together with position data indicating the position in the document data, using the numerical information data extracted in the numerical information extraction step and the dependency information data extracted in the dependency information extraction step as abstract model data Holding step,
Abstraction model recording means that generated and recorded abstraction model data from document data is retrieved, and the abstraction model data held in the abstraction model data holding step has the same dependency information and different numerical information data Abstraction model data determination step for determining whether there is abstraction model data;
An input error output step for outputting input error information to the document editing device based on the determination result obtained in the abstraction model data determination step;
A document input verification method characterized by comprising: