JP4458517B2

JP4458517B2 - Information extraction apparatus and method

Info

Publication number: JP4458517B2
Application number: JP2003389023A
Authority: JP
Inventors: 高広三浦
Original assignee: 株式会社日立システムアンドサービス
Priority date: 2003-11-19
Filing date: 2003-11-19
Publication date: 2010-04-28
Anticipated expiration: 2023-11-19
Also published as: JP2005149359A

Description

この発明は、与えられたテキスト文データから数値情報を含む情報を抽出する情報抽出装置およびその方法に関する。 The present invention relates to an information extracting apparatus and method for extracting information including numerical information from given text sentence data.

文書データから数値データを抽出し、この数値データに対応する名詞データを抽出する方法が知られている（例えば、特許文献１）。 A method of extracting numerical data from document data and extracting noun data corresponding to the numerical data is known (for example, Patent Document 1).

上記の方法について簡単に説明する。例えば、文書データにおける一文が、「今日は２００円の豚肉と、２０００円のコップを買った。」である場合について説明する。 The above method will be briefly described. For example, a case will be described in which one sentence in the document data is “Today I bought 200 yen pork and 2000 yen cup”.

まず、文書データを１文字ずつ参照し、数字が連続する文字列を数値データとして抽出する。例えば、数値データとして、「２００」および「２０００」がそれぞれ抽出される。 First, document data is referred to one character at a time, and a character string with consecutive numbers is extracted as numerical data. For example, “200” and “2000” are extracted as numerical data.

次に、上記数値データが含まれる句読点までの範囲の文を数字文字列データとして抽出する。例えば、数字文字列データとして、「今日は２００円の豚肉と、」および「２０００円のコップを買った。」がそれぞれ抽出される。 Next, a sentence in the range up to the punctuation mark including the numerical data is extracted as numerical character string data. For example, “Today I bought 200 yen pork” and “I bought a 2000 yen cup” are extracted as numeric character string data.

さらに、上記数字文字列データを形態素解析して名詞を抽出し、これらの名詞の中から予め定められた項目分類文字列に含まれる名詞のみを名詞データとして抽出する。例えば、名詞データとして、「豚肉」および「コップ」がそれぞれ抽出される。 Furthermore, morphological analysis is performed on the numeric character string data to extract nouns, and only nouns included in predetermined item classification character strings are extracted from these nouns as noun data. For example, “pork” and “cup” are extracted as noun data.

以上のようにして、数値データとして「２００」および「２０００」と、これらにそれぞれ対応する名詞データとして「豚肉」および「コップ」を得ることができる。 As described above, “200” and “2000” can be obtained as numerical data, and “pork” and “cup” can be obtained as noun data corresponding thereto.

特開平８−３２９１６５号公報JP-A-8-329165

しかしながら、上記のような方法を用いて数値データおよび名詞データを抽出する場合、名詞データは数値データの近傍にあることが前提となる。すなわち、文の中で数値データに対応する名詞データが、数値データと句読点を超えて記述されている場合には、正確に対応する名詞データを抽出することはできない。 However, when numerical data and noun data are extracted using the method described above, it is assumed that the noun data is in the vicinity of the numerical data. In other words, if the noun data corresponding to the numerical data in the sentence is described beyond the numerical data and the punctuation marks, the corresponding noun data cannot be extracted accurately.

例えば、「今日買った豚肉は、２００円だった。」という文では「豚肉」と「２００」が句読点を超えて記述されているので、数値データとこれに対応する名詞データを正確に抽出することはできない。 For example, in the sentence “Buy bought today was 200 yen”, “pork” and “200” are described beyond punctuation, so the numerical data and the corresponding noun data are accurately extracted. It is not possible.

また、上記のような方法では、予め設けられた項目分類文字列に含まれる名詞のみを名詞データとして抽出するようにしている。このため、項目分類文字列に含まれない任意の名詞を数値データに対応づけて抽出することはできない。 In the method as described above, only nouns included in the item classification character strings provided in advance are extracted as noun data. For this reason, any noun that is not included in the item classification character string cannot be extracted in association with numerical data.

さらに、数値データに対応づけて抽出されるのは名詞のみを対象としているため、数値データが名詞以外の文要素を指している場合には、正確な抽出ができない。 Further, since only nouns are extracted in association with numerical data, accurate extraction cannot be performed when the numerical data points to sentence elements other than nouns.

この発明は、上記のような課題を解決するためになされたものであり、数値データに対応する名詞データが近傍になくても正確に名詞データを抽出することができ、また、任意の名詞データを抽出することができ、さらに、名詞以外の文要素を数値データに対応づけて抽出することのできる情報抽出装置の提供を目的とする。 The present invention has been made to solve the above-described problems, and can extract noun data accurately even when noun data corresponding to numerical data is not in the vicinity. It is another object of the present invention to provide an information extraction apparatus that can extract sentence elements other than nouns in correspondence with numerical data.

(1)(2)(11)この発明にかかる情報抽出装置またはプログラムは、与えられたテキスト文データから数値情報データを含む情報データを抽出する情報抽出装置であって、与えられたテキスト文データを複数の文要素データに分割し、各文要素の種類を決定して記録部に記録するとともに、与えられたテキスト文データ中に数値情報データが含まれるか否かを判断し、含まれる場合には、数値情報データを特定して記録部に記録する抽出・解析手段と、記録部を参照して、数値情報データを含む文要素データを特定し、当該文要素データの種類に基づいて、当該数値情報データと係り受け関係にある文要素データを少なくとも１つ決定し、決定した文要素データから係り受け情報データを抽出する係り受け情報抽出手段と、数値情報抽出手段が抽出した数値情報データ、係り受け情報抽出手段が抽出した係り受け情報データを出力する出力手段を備えたことを特徴としている。 (1) (2) (11) An information extraction apparatus or program according to the present invention is an information extraction apparatus for extracting information data including numerical information data from given text sentence data, and the given text sentence data Is divided into multiple sentence element data, the type of each sentence element is determined and recorded in the recording unit, and it is determined whether or not numerical information data is included in the given text sentence data. In the extraction / analysis means for identifying the numerical information data and recording it in the recording unit, with reference to the recording unit, the sentence element data including the numerical information data is identified, and based on the type of the sentence element data, At least one sentence element data having a dependency relationship with the numerical information data is determined, and dependency information extraction means for extracting the dependency information data from the determined sentence element data, and the numerical information extraction means extract And output means for outputting dependency information data extracted by the dependency information extraction means.

したがって、数値情報データに対応する名詞データが当該数値情報データの近傍に存在しない場合であっても、数値情報データと係り受け関係にある文要素データを抽出して、数値情報データに対応する名詞データを正確に抽出することができる。 Therefore, even when noun data corresponding to the numerical information data does not exist in the vicinity of the numerical information data, the sentence element data having a dependency relationship with the numerical information data is extracted, and the noun corresponding to the numerical information data is extracted. Data can be extracted accurately.

(3)この発明にかかる情報抽出装置またはプログラムは、抽出・解析手段は、与えられたテキスト文データ中に数値情報データが含まれるか否かを判断し、含まれる場合には、数値情報データを特定して抽出して記録部に記録する数値情報抽出手段と、前記テキスト文データを、複数の文要素データに分割し、各文要素の種類を決定して記録部に記録するテキスト文解析手段と、を備えており、係り受け情報抽出手段は、前記記録部に記録されている数値情報データが、いずれの文要素データ中にあるかを判定することにより、数値情報データを含む文要素データを特定するものであることを特徴としている。 (3) In the information extraction apparatus or program according to the present invention, the extraction / analysis means determines whether or not numerical information data is included in the given text sentence data, and if included, the numerical information data Numerical information extraction means for identifying and extracting and recording in the recording section, text sentence analysis for dividing the text sentence data into a plurality of sentence element data, determining the type of each sentence element and recording in the recording section And the dependency information extracting means determines which sentence element data contains the numerical information data recorded in the recording unit, thereby determining the sentence element including the numerical information data. It is characterized by specifying data.

(4)この発明にかかる情報抽出装置またはプログラムは、抽出・解析手段は、テキスト文データを、複数の文要素データに分割し、各文要素の種類を決定して記録部に記録するテキスト文解析手段と、前記文要素データ中に数値情報データが含まれるか否かを判断し、含まれる場合には、当該文要素データと数値情報データとを対応付けて抽出する数値情報抽出手段と、を備えたことを特徴としている。 (4) In the information extraction device or program according to the present invention, the extraction / analysis means divides the text sentence data into a plurality of sentence element data, determines the type of each sentence element, and records it in the recording unit. Analyzing means, determining whether or not numerical information data is included in the sentence element data, and if included, numerical information extracting means for extracting the sentence element data and the numerical information data in association with each other; It is characterized by having.

(5)この発明にかかる情報抽出装置またはプログラムは、テキスト文解析手段は、テキスト文データを、複数の文要素データに分割し、各文要素データが、主語を含む場合は主格句として、述語を含む場合は述部として、主語または述語のいずれも含まない場合は従属句として、文要素の種類を決定することを特徴としている。 (5) In the information extraction apparatus or program according to the present invention, the text sentence analysis means divides the text sentence data into a plurality of sentence element data, and each sentence element data includes a subject, and a predicate The type of sentence element is determined as a predicate when it contains and as a subordinate phrase when neither subject nor predicate is included.

したがって、テキスト文データにおける文構造に制限されることなく、文要素の種類に基づいて正確に情報を抽出することができる。 Therefore, information can be accurately extracted based on the type of sentence element without being limited to the sentence structure in the text sentence data.

(6)この発明にかかる情報抽出装置またはプログラムは、数値情報データが含まれる文要素データまたは数値情報データに対応付けて抽出された文要素データの種類が従属句である場合、係り受け情報抽出手段は、前記数値情報データを含む文要素データの修飾する単語が体言または用言のいずれであるかを判断して、係り受け関係にある文要素データを決定することを特徴としている。 (6) The information extracting device or program according to the present invention is a method for extracting dependency information when the type of sentence element data including numerical information data or the sentence element data extracted in association with the numerical information data is a subordinate phrase. The means is characterized by determining whether the word to be modified by the sentence element data including the numerical information data is a body word or a predicate, and determining sentence element data having a dependency relationship.

したがって、数値情報が指し示す情報を、より正確に抽出することができる。 Therefore, the information indicated by the numerical information can be extracted more accurately.

(7)この発明にかかる情報抽出装置またはプログラムは、数値情報データを含む文要素データの修飾する単語が体言である場合、係り受け情報抽出手段は、テキスト文データを構成する文要素データの中から数値情報データを含む文要素データに連続する文要素データを、係り受け関係にある文要素データとして決定し、前記数値情報を抽出した文要素の修飾する単語が用言である場合、前記係り受け情報抽出手段は、前記テキスト文データを構成する文要素データの中から前記数値情報データを含む文要素データ以外の文要素データを、係り受け関係にある文要素として決定することを特徴としている。 (7) In the information extraction apparatus or program according to the present invention, when the word to be modified of the sentence element data including the numerical information data is a body word, the dependency information extracting means includes the sentence element data constituting the text sentence data. If the sentence element data continuous from the sentence element data including the numerical information data is determined as the sentence element data having dependency relation, and the word to be modified by the sentence element from which the numerical information is extracted is a predicate, the relation The received information extracting means is characterized in that sentence element data other than the sentence element data including the numerical information data is determined as a sentence element having a dependency relationship from among the sentence element data constituting the text sentence data. .

したがって、数値情報データが指し示す情報を、より正確に抽出することができる。これにより、意味的に高度な情報データを抽出することができる。 Therefore, the information indicated by the numerical information data can be extracted more accurately. Thereby, semantically sophisticated information data can be extracted.

(8)この発明にかかる情報抽出装置またはプログラムは、記録部には、予め、所定の数値単位が記録されており、抽出・解析手段は、記録部から読み出した数値単位と同じ文字列が、前記テキスト文データまたは文要素データにあるか否かに基づいて、数値情報データが含まれるか否かを判断することを特徴とするしている。 (8) In the information extraction apparatus or program according to the present invention, a predetermined numerical unit is recorded in the recording unit in advance, and the extraction / analysis unit has the same character string as the numerical unit read from the recording unit, It is characterized in that whether or not numerical information data is included is determined based on whether or not the text sentence data or sentence element data exists.

したがって、テキスト文データの中から目的の数値情報データのみを抽出することができる。例えば、金額を対象とする数値情報データを取得したい場合には、前後に「円」や「￥」等が付加されている数字を数値情報データとして抽出し、これに対応する係り受け情報データを抽出することができる。 Therefore, only the target numerical information data can be extracted from the text sentence data. For example, when it is desired to obtain numerical information data for a monetary amount, numbers with “yen” and “¥” added before and after are extracted as numerical information data, and dependency information data corresponding to this is extracted. Can be extracted.

(9)この発明にかかる情報抽出装置またはプログラムは、テキスト文解析手段によって得られた文要素データ、および数値情報抽出手段が抽出した数値情報データに基づいて、抽出対象となるテキスト文データを決定する抽出対象文決定手段をさらに備えたことを特徴としている。 (9) An information extraction apparatus or program according to the present invention determines text sentence data to be extracted based on sentence element data obtained by a text sentence analyzing means and numerical information data extracted by a numerical information extracting means. It is further characterized by further comprising an extraction target sentence determining means.

したがって、テキスト文データが単文である場合だけでなく、２以上の文を有する重文、複文またはこれらの混合文であっても、正確に情報を抽出することができる。 Therefore, information can be accurately extracted not only when the text sentence data is a single sentence but also when it is a heavy sentence, a compound sentence or a mixed sentence having two or more sentences.

(10)この発明にかかる情報抽出装置またはプログラムは、テキスト文データから、数値情報データが含まれる文要素データと係り受け情報データが含まれる文要素データとを除いた文要素データを、付加情報データとして抽出する付加情報抽出手段を備えたことを特徴とするしている。 (10) The information extraction device or program according to the present invention provides sentence element data obtained by excluding sentence element data including numerical information data and sentence element data including dependency information data from text sentence data as additional information. An additional information extracting means for extracting as data is provided.

したがって、数値情報データおよびこれに対応する係り受け情報データを抽出すると同時に、これらに関連する付加情報データを抽出することができる。 Therefore, it is possible to extract the numerical information data and the dependency information data corresponding thereto, and at the same time extract the additional information data related thereto.

以下、本発明における実施形態について、図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

１．第１の実施形態
１−１．機能ブロック図
図１に、本実施形態にかかる情報抽出装置の機能ブロック図を示す。この図において、本発明にかかる情報抽出装置は、切出手段１０１、数値情報抽出手段１０３、テキスト文解析手段１０５、係り受け情報抽出手段１０７、出力手段１０９を備えている。 1. First embodiment 1-1. Functional Block Diagram FIG. 1 shows a functional block diagram of the information extraction apparatus according to this embodiment. In this figure, the information extracting apparatus according to the present invention comprises a cutting means 101, a numerical information extracting means 103, a text sentence analyzing means 105, a dependency information extracting means 107, and an output means 109.

切出手段１０１は、入力したテキスト文書からテキスト文データを一文ずつ切り出して記録部に記録する。例えば、テキスト文書中に出現する句点「。」をデリミタとして、テキスト文データを切り出す処理を行う。 The cutout unit 101 cuts out text sentence data one sentence at a time from the input text document and records it in the recording unit. For example, a process of cutting out text sentence data is performed using a delimiter “.” Appearing in a text document.

数値情報抽出手段１０３は、記録部から読み出したテキスト文データを、先頭から順に参照して所定の数値情報データが含まれているか否かを判断する。所定の数値情報データが含まれている場合には、数値情報データを抽出するとともに、前記テキスト文をテキスト文解析手段に与える。 The numerical information extraction unit 103 refers to the text sentence data read from the recording unit in order from the top, and determines whether or not predetermined numerical information data is included. When predetermined numerical information data is included, the numerical information data is extracted and the text sentence is given to the text sentence analyzing means.

テキスト文解析手段１０５は、記録部から読み出したテキスト文データを形態素解析した後、各形態素を所定の文法定義情報に基づいて解析し、テキスト文データを所定の文要素データに分割する。 The text sentence analysis unit 105 performs morphological analysis on the text sentence data read from the recording unit, and then analyzes each morpheme based on predetermined grammar definition information, and divides the text sentence data into predetermined sentence element data.

係り受け情報抽出手段１０７は、抽出された数値情報データが含まれる文要素データに基づいて、文要素データに分割されたテキスト文データから当該数値情報データと係り受け関係にある他の文要素データを抽出する。 The dependency information extracting means 107 is based on the sentence element data including the extracted numerical information data, and other sentence element data having a dependency relation with the numerical information data from the text sentence data divided into the sentence element data. To extract.

出力手段１０９は、数値情報抽出手段１０３において抽出された数値情報データおよび係り受け情報抽出手段１０７において抽出された係り受け情報データを出力する。例えば、ディスプレイやデータベース等に出力する。 The output unit 109 outputs the numerical information data extracted by the numerical information extraction unit 103 and the dependency information data extracted by the dependency information extraction unit 107. For example, the data is output to a display or a database.

１−２．ハードウェア構成
図１に示す情報抽出装置をＣＰＵを用いて実現したハードウェア構成の一例を図２に示す。情報抽出装置は、ディスプレイ２０１、ＣＰＵ２０３、メモリ２０５、キーボード／マウス２０７、ハードディスク２０９、ＣＤ−ＲＯＭドライブ２１１および通信回路２１５を備えている。 1-2. Hardware Configuration FIG. 2 shows an example of a hardware configuration in which the information extraction apparatus shown in FIG. 1 is realized using a CPU. The information extraction apparatus includes a display 201, a CPU 203, a memory 205, a keyboard / mouse 207, a hard disk 209, a CD-ROM drive 211, and a communication circuit 215.

ハードディスク２０９には、本発明にかかる情報抽出処理を行うための情報抽出プログラム２０９１、形態素解析処理を行うための形態素解析辞書２０９３、構文解析処理を行うための文法定義情報２０９４、所定の数値情報データを抽出するための数値単位マスタ２０９５、情報抽出処理による抽出結果を記録する抽出結果データベース２０９７等が記録されており、これらはＣＤ−ＲＯＭドライブ２１１を介してＣＤ−ＲＯＭ２１２に記録されたデータを読み出してインストールしたものである。 The hard disk 209 includes an information extraction program 2091 for performing information extraction processing according to the present invention, a morphological analysis dictionary 2093 for performing morphological analysis processing, grammar definition information 2094 for performing syntax analysis processing, and predetermined numerical information data. A numerical unit master 2095 for extracting data, an extraction result database 2097 for recording extraction results by information extraction processing, and the like are recorded, and these read out data recorded on the CD-ROM 212 via the CD-ROM drive 211. Installed.

なお、上記インストールは、通信回路２１５を用いてインターネット２１６等からダウンロードしたデータを使用して行うようにしてもよい。 The installation may be performed using data downloaded from the Internet 216 or the like using the communication circuit 215.

１−３．フローチャート
情報抽出プログラム２０９１に基づく処理について、図３〜６のフローチャートを用いて説明する。以下では、テキスト文「営業利益は前年度と同水準の３２７７６百万円となった。」を含むテキスト文書を切出手段１０１に入力した場合を例として説明する。 1-3. Flowchart Processing based on the information extraction program 2091 will be described using the flowcharts of FIGS. In the following, an example will be described in which a text document including the text sentence “Operating income is 32,763 million yen, which is the same level as the previous year” is input to the clipping unit 101.

図３に示す情報抽出処理のフローチャートにおいて、ＣＰＵ２０３は、テキスト文書を読み込みメモリ２０５に入力する（ステップＳ３０１）。 In the flowchart of the information extraction process shown in FIG. 3, the CPU 203 reads a text document and inputs it into the memory 205 (step S301).

入力したテキスト文書の先頭からテキスト文データを一文切り出す（ステップＳ３０３）。なお、本実施形態においては句点「。」をテキスト文のデリミタとする。例えば、前記テキスト文データが読み込まれた場合、先頭の文である「営業利益は前年度と同水準の３２７７６百万円となった。」が切り出される。 Text sentence data is cut out from the head of the input text document (step S303). In the present embodiment, the punctuation mark “.” Is used as a text sentence delimiter. For example, when the text sentence data is read, the first sentence, “Operating income is 32,976 million yen, which is the same level as the previous year.” Is extracted.

ＣＰＵ２０３は、切り出したテキスト文データに基づいて、数値情報抽出処理を行う（ステップＳ３０５）。 The CPU 203 performs numerical information extraction processing based on the cut-out text sentence data (step S305).

１−３−１．数値情報抽出処理
図４に示す数値情報抽出処理のフローチャートにおいて、ＣＰＵ２０３は数値単位マスタ２０９５に基づいて数値情報データの単位文字を抽出する（ステップＳ４０１）。例えば、本実施形態において、数値単位マスタには「百万円」、「億円」等の金額に関する単位文字が記録されている。 1-3-1. Numerical Information Extraction Processing In the flowchart of the numerical information extraction processing shown in FIG. 4, the CPU 203 extracts unit characters of numerical information data based on the numerical unit master 2095 (step S401). For example, in this embodiment, unit characters relating to amounts such as “million yen” and “100 million yen” are recorded in the numerical unit master.

特に、数値単位マスタは、数値情報データを抽出するテキスト文書の内容に応じて設定すればよい。例えば、有価証券報告書や財務諸表の場合は金額に関する単位文字である「円」、「億円」、「百万円」、「千円」等を設定し、歴史年表や新聞記事などの場合は日付に関する単位文字である「年」、「月」、「日」等を設定すればよい。これにより、文書の内容に応じて、必要な情報だけを取り出すことができる。 In particular, the numerical unit master may be set according to the contents of a text document from which numerical information data is extracted. For example, in the case of securities reports and financial statements, “Yen”, “Billions of yen”, “Millions of yen”, “Thousands of yen”, etc., which are unit characters relating to monetary amounts, are set, such as historical chronology and newspaper articles In this case, “year”, “month”, “day”, etc., which are unit characters relating to the date, may be set. Thereby, only necessary information can be extracted according to the contents of the document.

なお、上記の数値単位マスタを使わない場合は、情報抽出プログラム２０９１中に数値情報データにかかる所定の単位文字を設定しておけばよい。 If the numerical unit master is not used, a predetermined unit character for the numerical information data may be set in the information extraction program 2091.

ＣＰＵ２０３は、テキスト文データの先頭から、抽出した単位文字が存在するか否かを検索する（ステップＳ４０３）。例えば、テキスト文「営業利益は、前年度と同水準の３２７７６百万円となった。」の先頭から、単位文字「百万円」または「億円」が存在するか否かを検索する。 The CPU 203 searches whether or not the extracted unit character exists from the head of the text sentence data (step S403). For example, it is searched from the head of the text sentence “Operating income was 32,763 million yen, which is the same level as the previous year”, whether or not the unit characters “million yen” or “100 million yen” exist.

テキスト文データの中に、所定の単位文字が存在すれば（ステップＳ４０５、ＹＥＳ）、当該単位文字の前方に連続する数字を数値情報データの構成要素として抽出する（ステップＳ４０７）。この場合、複数の数字が連続していれば、先頭方向から連続する数字をすべて連結した数字の文字列を数値情報データの構成要素として抽出する。さらに、ＣＰＵ２０３は、抽出した数値情報データの構成要素と前記単位文字とを連結して数値情報データとする。 If a predetermined unit character exists in the text sentence data (step S405, YES), numbers consecutive in front of the unit character are extracted as constituent elements of the numerical information data (step S407). In this case, if a plurality of numbers are continuous, a character string of numbers obtained by concatenating all the continuous numbers from the head direction is extracted as a constituent element of the numerical information data. Further, the CPU 203 concatenates the constituent elements of the extracted numerical information data and the unit characters to obtain numerical information data.

例えば、上記テキスト文データの場合、「３２７７６」が数値情報データの構成要素として抽出され、単位文字「百万円」と連結して、「３２７７６百万円」が数値情報データとして抽出され、メモリ２０５またはハードディスク２０９の所定領域に記録される。 For example, in the case of the above text sentence data, “32776” is extracted as a constituent element of the numerical information data, concatenated with the unit character “million yen”, and “32776 million yen” is extracted as the numerical information data. 205 or a predetermined area of the hard disk 209.

ＣＰＵ２０３は、テキスト文データの中に、他に単位文字が存在すれば（ステップＳ４０９、ＮＯ）、ステップＳ４０７を繰り返す。他に単位文字が存在しなければ、当該処理を終了する（ステップＳ４０９）。また、ステップＳ４０５において、単位文字が存在しない場合にも、当該処理を終了する（ステップＳ４０５、ＮＯ）。 If there is another unit character in the text sentence data (step S409, NO), the CPU 203 repeats step S407. If there is no other unit character, the process is terminated (step S409). In addition, if no unit character exists in step S405, the process is ended (NO in step S405).

１−３−２．テキスト文解析処理
数値情報抽出処理（図３、ステップＳ３０５）において、数値情報データが抽出されれば（ステップＳ３０７、ＹＥＳ）、ＣＰＵ２０３はテキスト文解析処理（ステップＳ３０９）を行う。 1-3-2. Text sentence analysis processing If numerical information data is extracted in the numerical information extraction process (FIG. 3, step S305) (step S307, YES), the CPU 203 performs a text sentence analysis process (step S309).

図５に示すテキスト文解析処理のフローチャートにおいて、ＣＰＵ２０３は形態素解析辞書２０９３に基づいてテキスト文データを形態素解析する処理を行う（ステップＳ５０１）。例えば、形態素解析処理には、奈良先端科学技術大学院大学松本研究室の「茶筌」等を使用すればよい。 In the flowchart of the text sentence analysis process shown in FIG. 5, the CPU 203 performs a morphological analysis of the text sentence data based on the morphological analysis dictionary 2093 (step S501). For example, for the morphological analysis process, a “tea bowl” of Matsumoto Laboratory of Nara Institute of Science and Technology may be used.

図７、図７ａおよび図７ｂにテキスト文データ「営業利益は、前年度と同水準の３２７７６百万円となった。」７０１に対して情報抽出処理を行う場合の例を示す。 FIG. 7, FIG. 7a and FIG. 7b show an example in which the information extraction processing is performed on the text sentence data “Operating profit was 32,976 million yen, which is the same level as the previous year” 701.

ＣＰＵ２０３は、上記ステップＳ５０１の形態素解析処理によって、上記テキスト文データ７０１（図７）を、意味を有する最小の言語単位である形態素に分割する。 The CPU 203 divides the text sentence data 701 (FIG. 7) into morphemes, which are the smallest meaningful language units, through the morphological analysis processing in step S501.

例えば、図７において、上記テキスト文データ７０１は、７０３に示すように「営業利益（名詞）／は（助詞）／前年度（名詞）／と（助詞）／同水準（名詞）／の（助詞）／３２７７６百万円（名詞）／と（助詞）／なった（動詞）／。（文末）」のように形態素解析される。なお、上記および図７に示すテキスト文データ７０３においては、説明上テキスト文データを形態素ごとに「／」を挿入して表示しているが、実際には、図７ａのＡに示すように、形態素解析データとして、メモリ２０５またはハードディスク２０９の所定領域に記録される。 For example, in FIG. 7, the text sentence data 701 includes “operating profit (noun) / ha (particle) / previous year (noun) / and (particle) / same level (noun) / ) / 32776 million yen (noun) / and (particle) / na (verb) /. (End of sentence). In the text sentence data 703 shown in FIG. 7 and FIG. 7, the text sentence data is displayed by inserting “/” for each morpheme for the sake of explanation, but actually, as shown in A of FIG. It is recorded in a predetermined area of the memory 205 or the hard disk 209 as morphological analysis data.

ＣＰＵ２０３は文法定義情報２０９４に基づいて、上記において形態素解析したテキスト文データを構文解析する処理を行う（ステップＳ５０３）。例えば、構文解析処理には、上昇型構文解析のＬＲ文法パーザであるＹＡＣＣ等を使用すればよい。なお、パーザとは、構文解析プログラムを指し、入力文字列を解析して構文解析木などを作成するプログラムである。
上記ＬＲ文法パーザにおいては、予め設定した文法定義情報２０９４に基づいて構文解析が行われる。図８に、本実施形態において設定する文法定義情報２０９４の例を示す。なお、図８に示す文法定義情報２０９４においては、ＢＮＦ表記によって文法情報を定義している。 Based on the grammar definition information 2094, the CPU 203 performs a syntax analysis of the text sentence data subjected to the morphological analysis in the above (step S503). For example, the syntax analysis process may use YACC, which is an LR grammar parser for ascending syntax analysis. The parser refers to a syntax analysis program, which is a program that analyzes an input character string and creates a syntax analysis tree.
The LR grammar parser performs syntax analysis based on preset grammar definition information 2094. FIG. 8 shows an example of grammar definition information 2094 set in this embodiment. In the grammar definition information 2094 shown in FIG. 8, grammar information is defined by BNF notation.

ＣＰＵ２０３は、上記ステップＳ５０３によって、形態素解析したテキスト文データを、ＬＲ文法パーザであるＹＡＣＣに入力して構文解析を行う。例えば、上記ＹＡＣＣと文法定義情報２０９４を用いて、図７に示したテキスト文データ７０１を構文解析すると、７０５に示すようになる。 In step S503, the CPU 203 inputs the text sentence data subjected to morphological analysis to YACC, which is an LR grammar parser, and performs syntax analysis. For example, if the text sentence data 701 shown in FIG. 7 is parsed using the YACC and the grammar definition information 2094, the result is as shown in 705.

７０５においては、文法定義情報２０９４に基づいて、レベル１〜８のように段階的に構文解析が行われる。これにより、各レベルにおいてテキスト文データ全体を所定の文要素データに分割することができる。例えば、レベル２においては、テキスト文データ７０１を「営業利益は（従属句）／前年度と（従属句）／同水準の（従属句）／３２７７６百万円（従属句）と／なった（動詞句）」の各文要素データに分割することができる。 In 705, based on the grammar definition information 2094, syntax analysis is performed step by step as in levels 1-8. Thereby, the whole text sentence data can be divided | segmented into predetermined sentence element data in each level. For example, at level 2, the text sentence data 701 is “operating profit is (subordinate phrase) / previous year (subordinate phrase) / same level (subordinate phrase) / 32976 million yen (subordinate phrase) / Can be divided into each sentence element data of “verb phrase)”.

図７ａのＢ〜Ｄおよび図７ｂのＥ〜Ｉに示すように、ＣＰＵ２０３は、構文解析したデータを各レベル毎にメモリ２０５またはハードディスク２０９の所定領域に記録する。 As shown in B to D of FIG. 7a and E to I of FIG. 7b, the CPU 203 records the parsed data in a predetermined area of the memory 205 or the hard disk 209 for each level.

ＣＰＵ２０３は、構文解析したテキスト文データから「〜句」レベルの文要素データに分割されたテキスト文データを取得する（ステップＳ５０５）。 The CPU 203 acquires the text sentence data divided into the sentence element data at the “˜phrase” level from the text sentence data that has been parsed (step S505).

例えば、図７におけるテキスト文データ７０５のレベル２に分割されたテキスト文データ７０７、具体的には、図７ａのＣに示した構文解析データ（レベル２）を取得する。 For example, the text sentence data 707 divided into level 2 of the text sentence data 705 in FIG. 7, specifically, the syntax analysis data (level 2) shown in C of FIG. 7a is acquired.

ＣＰＵ２０３は、取得した構文解析データ（レベル２）に基づいて、主格句を決定する処理を行う（ステップＳ５０７）。ここで、主格句とは、主語を含む文要素データであり、文要素データの末尾の助詞によって主格句か否かの判定を行う。例えば、文要素データの末尾が「〜は」、「〜が」、「〜も」のいずれかである場合、当該文要素データは主格句であると判定する。 The CPU 203 performs a process for determining a main phrase based on the acquired syntax analysis data (level 2) (step S507). Here, the main phrase is sentence element data including the subject, and it is determined whether or not it is a main phrase by the particle at the end of the sentence element data. For example, if the end of the sentence element data is any one of “to”, “to is”, and “to”, the sentence element data is determined to be a main phrase.

図９に主格句を決定する場合の例を示す。図９に示すテキスト文データ９０１において、文要素データ「営業利益は（従属句）」の末尾の助詞は、図７の形態素解析直後のテキスト文データ７０３より、「は」である。これにより、当該文要素データ「営業利益は（従属句）」の末尾が上記「〜は」に該当し、主格句であると決定することができる。 FIG. 9 shows an example in which a main phrase is determined. In the text sentence data 901 shown in FIG. 9, the particle at the end of the sentence element data “Operating profit is (subordinate phrase)” is “ha” from the text sentence data 703 immediately after the morphological analysis in FIG. As a result, it is possible to determine that the sentence element data “operating profit is (subordinate phrase)” ends with “˜ha” and is a main phrase.

ＣＰＵ２０３は、取得したテキスト文データ７０７において、述部を決定する処理を行う（ステップＳ５０９）。ここで、述部とは、述語を含む文要素データである。ＣＰＵ２０３は、文要素データの末尾の品詞によって述部か否かの判定を行う。 The CPU 203 performs processing for determining a predicate in the acquired text sentence data 707 (step S509). Here, the predicate is statement element data including a predicate. The CPU 203 determines whether the predicate is based on the part of speech at the end of the sentence element data.

例えば、文要素データの末尾の品詞が「名詞」、「形容詞」、「動詞」のいずれかである場合、当該文要素データは述部または述部の構成要素であると判定する。特に、「動詞」であっても意味の独立性の低い（非自立である）「補助動詞」および上記「名詞」、「形容詞」の場合は、直前の文要素データと連結して述部を構成するものとする。 For example, when the part of speech at the end of the sentence element data is “noun”, “adjective”, or “verb”, the sentence element data is determined to be a predicate or a component of the predicate. In particular, in the case of “auxiliary verbs” and the above “nouns” and “adjectives” whose meaning is low even if they are “verbs”, the predicates are connected with the immediately preceding sentence element data. Shall be composed.

本実施形態においては、文要素データの末尾が「〜である」、「〜となる」、「〜であった」、「〜となった」のいずれかである場合、当該分要素の末尾の品詞は「補助動詞」であると判定している。なお、文要素データの末尾の品詞が、「補助動詞」であるか否かの判定を形態素解析に基づいて行うようにしてもよい。この場合、形態素解析で動詞を本動詞（自立）か補助動詞（非自立）かの解析を行うようにすればよい。 In the present embodiment, when the end of sentence element data is any one of “to”, “becomes”, “was to”, and “becomes to”, The part of speech is determined to be an “auxiliary verb”. Note that the determination as to whether the part of speech at the end of the sentence element data is an “auxiliary verb” may be made based on morphological analysis. In this case, it is only necessary to analyze whether the verb is a main verb (independent) or an auxiliary verb (non-independent) by morphological analysis.

図９に、述部を決定する場合の例を示す。図９に示すテキスト文データ９０３において、文要素データ「なった（動詞句）」は、図７の形態素解析の結果により「動詞」である。これにより、当該文要素データ「なった（動詞句）」は、述部または述部の構成要素であると決定することができる。さらに、当該文要素データ「なった（動詞句）」は、「補助動詞」に分類されるため、直前の文要素データと連結されて述部を構成するものとされる。 FIG. 9 shows an example of determining a predicate. In the text sentence data 903 shown in FIG. 9, the sentence element data “Nata (verb phrase)” is a “verb” based on the result of the morphological analysis shown in FIG. Thereby, the sentence element data “Nana (verb phrase)” can be determined to be a predicate or a component of the predicate. Further, since the sentence element data “Nana (verb phrase)” is classified as an “auxiliary verb”, it is concatenated with the immediately preceding sentence element data to form a predicate.

これにより、当該文要素データ「なった（動詞句）」と直前の文要素データ「３２７７６百万円と（従属句）」を連結して「３２７７６百万円となった」が述部と決定される。 As a result, the sentence element data “Nata (verb phrase)” and the immediately preceding sentence element data “32776 million yen and (subordinate phrase)” are concatenated to become “32976 million yen” as a predicate. Is done.

すなわち、図９のテキスト文データ９０５に示すように、テキスト文データは「営業利益は（主格句）／前年度と（従属句）／同水準の（従属句）／３２７７６百万円となった（述部）」の各文要素データに分割される。なお、実際には、ＣＰＵ２０３は、メモリ２０５またはハードディスク２０９の所定領域に、図９ａに示すような解析結果データ９１０を記録する。 That is, as shown in the text sentence data 905 of FIG. 9, the text sentence data is “Operating profit was (main phrase) / previous year and (subordinate phrase) / same level (subordinate phrase) / 32976 million yen. (Predicate) "is divided into each sentence element data. In practice, the CPU 203 records analysis result data 910 as shown in FIG. 9 a in a predetermined area of the memory 205 or the hard disk 209.

１−３−３．係り受け情報抽出処理
テキスト文解析処理（図３、ステップＳ３０９）において、テキスト文データが所定の種類の文要素データに分割されると、さらにＣＰＵ２０３は係り受け情報抽出処理（ステップＳ３１３）を行う。 1-3-3. Dependency information extraction processing
In the text sentence analysis process (FIG. 3, step S309), when the text sentence data is divided into predetermined types of sentence element data, the CPU 203 further performs a dependency information extraction process (step S313).

図６に示すテキスト文解析処理のフローチャートにおいて、ＣＰＵ２０３は前記数値情報データの属する文要素データを取得する（ステップＳ６０１）。例えば、上記数値情報抽出処理において抽出された数値情報データは「３２７７６」であるので、図９に示すテキスト文データ９０５により当該情報が含まれる文要素データとして、「３２７７６百万円となった（述部）」を取得する。 In the flowchart of the text sentence analysis process shown in FIG. 6, the CPU 203 acquires sentence element data to which the numerical information data belongs (step S601). For example, since the numerical information data extracted in the numerical information extraction process is “32776”, the text element data 905 shown in FIG. Predicate) ”.

ＣＰＵ２０３は数値情報データが含まれる文要素データが、主格句であると判定すれば（ステップＳ６０３、ＹＥＳ）、後述するステップＳ６０９に進み、主格句でないと判定すれば（ステップＳ６０３、ＮＯ）、さらに述部であるか否かの判定を行う（ステップＳ６０５）。 If the CPU 203 determines that the sentence element data including the numerical information data is a main phrase (step S603, YES), the CPU 203 proceeds to step S609 described later, and if it is determined that the sentence element data is not a main phrase (step S603, NO), further It is determined whether or not it is a predicate (step S605).

ＣＰＵ２０３は数値情報データの属する文要素データが述部であると判定すれば（ステップＳ６０５、ＹＥＳ）、元のテキスト文データの主格句に含まれる名詞を前記数値情報データの係り受け情報データとして抽出し（ステップＳ６１３）、述部でないと判定すれば（ステップＳ６０５、ＮＯ）、さらに、数値情報データの属する文要素データを従属句であると判断して、当該文要素データが連体修飾であるか否かの判定を行う（ステップＳ６０７）。 If the CPU 203 determines that the sentence element data to which the numerical information data belongs is a predicate (step S605, YES), the noun included in the main phrase of the original text sentence data is extracted as dependency information data of the numerical information data. If it is determined that it is not a predicate (step S605, NO), it is further determined that the sentence element data to which the numerical information data belongs is a subordinate phrase, and whether the sentence element data is a combination modification. It is determined whether or not (step S607).

ＣＰＵ２０３は数値情報データが含まれる文要素データが、連体修飾であると判定すれば（ステップＳ６０７、ＹＥＳ）、当該文要素データの直後の文要素データに含まれる名詞を前記数値情報データの係り受け情報データとして抽出し（ステップＳ６１５）、連体修飾でないと判定すれば（ステップＳ６０７、ＮＯ）、数値情報データの属する文要素データ以外の文要素データを前記数値情報データの係り受け情報データとして抽出する（ステップＳ６０９）。 If the CPU 203 determines that the sentence element data including the numerical information data is a linkage modification (YES in step S607), the CPU 203 determines the noun included in the sentence element data immediately after the sentence element data as a dependency of the numerical information data. If it is extracted as information data (step S615) and it is determined that it is not a combination modification (NO in step S607), sentence element data other than the sentence element data to which the numerical information data belongs are extracted as dependency information data of the numerical information data. (Step S609).

以下、数値情報データの属する文要素データが「述部」、「主格句」、「従属句」の場合に分けて説明する。 In the following description, the sentence element data to which the numerical information data belongs are divided into “predicate”, “main phrase”, and “subordinate phrase”.

１−３−３−１．「述部」の場合
例えば、数値情報データの属する文要素データ「３２７７６百万円となった（述部）」は述部であると判定されるので、図９に示すテキスト文データ９０５から「営業利益は（主格句）」に基づいて係り受け情報データが抽出される。すなわち、図７に示したテキスト文データ７０３を参照して「営業利益（名詞）」が係り受け情報データとして抽出される。 1-3-3-1. In the case of “predicate” For example, since the sentence element data “numerical information data that has become 32767 million yen (predicate)” is determined to be a predicate, the text sentence data 905 shown in FIG. The dependency information data is extracted on the basis of “operating profit (main phrase)”. That is, “business profit (noun)” is extracted as dependency information data with reference to the text sentence data 703 shown in FIG.

これにより、数値情報データ「３２７７６百万円」、係り受け情報データ「営業利益」が抽出結果となる。 As a result, the numerical information data “32776 million yen” and the dependency information data “operating profit” are extracted.

１−３−３−２．「主格句」の場合
例えば、テキスト文データとして「１６０３年には徳川家康が征夷大将軍に任じられた。」が入力された場合を考える。 1-3-3-2. In the case of “main phrase” For example, let us consider a case where “Ieyasu Tokugawa was entrusted to the great conqueror in 1603” as text data.

図３に示したステップＳ３０１〜Ｓ３０９にしたがって、上記テキスト文データは「１６０３年には（主格句）／徳川家康が（主格句）／征夷大将軍に（従属句）／任じられた（述部）」の文要素データに分割される。なお、この場合の単位文字には「年」を使用しており、数値情報データとして「１６０３年」が抽出されているものとする。 According to steps S301 to S309 shown in FIG. 3, the above text sentence data is “in the year 1603 (main phrase) / Ieyasu Tokugawa (main phrase) / conquered general (subordinate phrase) / assigned (predicate) ) "Sentence element data. In this case, “year” is used as a unit character, and “1603” is extracted as numerical information data.

この場合、数値情報データの属する文要素データは「１６０３年には（主格句）」であるので、ＣＰＵ２０３は、テキスト文データから「１６０３年には（主格句）」以外の文要素データを係り受け情報データとして抽出する。すなわち、「徳川家康が（主格句）／征夷大将軍に（従属句）／任じられた（述部）」が係り受け情報データとして抽出される。 In this case, since the sentence element data to which the numerical information data belongs is “1603 (prominent phrase)”, the CPU 203 relates sentence element data other than “1603 (prominent phrase)” from the text sentence data. Extracted as received information data. That is, “Ieyasu Tokugawa (main phrase) / subordinate phrase (subordinate phrase) / assigned (predicate)” is extracted as dependency information data.

これにより、数値情報データ「１６０３年」、係り受け情報データ「徳川家康が征夷大将軍に任じられた」が抽出結果となる。 As a result, the numerical information data “1603” and the dependency information data “Ieyasu Tokugawa was entrusted to the great conqueror” are the extraction results.

１−３−３−３．「従属句」の場合
数値情報データが含まれる文要素データが従属句である場合、ＣＰＵ２０３は当該文要素データが修飾形態が連体修飾であるか否かを判定する（ステップＳ６０７）。すなわち、後続する文要素データに体言が含まれているか否かを判定する。 1-3-3-3. In the case of “subordinate phrase” When the sentence element data including the numerical information data is a subordinate phrase, the CPU 203 determines whether or not the sentence element data is a combination modification (step S607). That is, it is determined whether or not a statement is included in the subsequent sentence element data.

例えば、数値情報データが含まれる文要素データの末尾が「〜の」である場合（ステップＳ６０７、ＹＥＳ）、修飾する単語が体言であると判定し、当該文要素データは連体修飾であると判定する。 For example, if the end of sentence element data including numerical information data is “to” (step S607, YES), it is determined that the word to be modified is a body word, and the sentence element data is determined to be a combined modification. To do.

例えば、テキスト文データとして「私は昨日１００円のノートを買った。」が入力された場合を考える。図３に示したステップＳ３０１〜Ｓ３０９にしたがって、上記テキスト文データは「私は（主格句）／昨日（従属句）／１００円の（従属句）／ノートを（従属句）／買った（述部）」の文要素データに分割される。なお、この場合の単位文字には「円」を使用しており、数値情報データとして「１００円」が抽出されているものとする。 For example, consider a case where “I bought a note of 100 yen yesterday” is input as text data. In accordance with steps S301 to S309 shown in FIG. 3, the text data is “I am (main phrase) / yesterday (subordinate phrase) / 100 yen (subordinate phrase) / note (subordinate phrase) / buy (description). Part) "sentence element data. Note that “yen” is used as the unit character in this case, and “100 yen” is extracted as numerical information data.

この場合、数値情報データの属する文要素データは「１００円の（従属句）」であり、文要素データの末尾が「〜の」であるので、ＣＰＵ２０３は、当該従属句は連体修飾であると判定する。すなわち、当該従属句が修飾する単語は体言であると判定し、テキスト文データにおける当該文要素データの直後の文要素データの名詞を係り受け情報データとして抽出する。 In this case, since the sentence element data to which the numerical information data belongs is “100 yen (subordinate phrase)” and the end of the sentence element data is “to”, the CPU 203 determines that the subordinate phrase is a combination modification. judge. That is, it is determined that the word modified by the subordinate phrase is a body word, and the noun of the sentence element data immediately after the sentence element data in the text sentence data is extracted as dependency information data.

例えば、直後の文要素データである「ノートを（従属句）」から「ノート（名詞）」が係り受け情報データとして抽出される。 For example, “note (noun)”, which is sentence element data immediately after, is extracted as dependency information data from “note (subordinate phrase)”.

これにより、数値情報データ「１００円」、係り受け情報データ「ノート」が抽出結果となる。 As a result, the numerical information data “100 yen” and the dependency information data “note” are extracted.

一方、ＣＰＵ２０３が連体修飾でないと判定した場合（ステップＳ６０７、ＮＯ）、すなわち、修飾する単語が用言であると判定し、数値情報データの属する文要素データ以外の文要素データを係り受け情報データとして抽出する（ステップＳ６０９）。 On the other hand, when the CPU 203 determines that the modification is not a combination modification (NO in step S607), that is, it is determined that the word to be modified is a predicate, and sentence element data other than the sentence element data to which the numerical information data belongs is determined as dependency information data. (Step S609).

例えば、テキスト文データとして「源頼朝は１１９２年に鎌倉幕府を開いた。」が入力された場合を考える。 For example, consider a case in which “Yori Yoromo opened the Kamakura Shogunate in 1192” is input as text data.

図３に示したステップＳ３０１〜Ｓ３０９にしたがって、上記テキスト文データは「源頼朝は（主格句）／１１９２年に（従属句）／鎌倉幕府を（従属句）／開いた（述部）」の文要素データに分割される。なお、この場合の単位文字には「年」を使用しており、数値情報データとして「１１９２年」が抽出されているものとする。 According to steps S301 to S309 shown in FIG. 3, the text sentence data is “Gen Yoromo (primary phrase) / 1192 (subordinate phrase) / Kamakura shogunate (subordinate phrase) / opened (predicate)”. Divided into sentence element data. In this case, “year” is used as a unit character, and “1192” is extracted as numerical information data.

この場合、数値情報データの属する文要素データは「１１９２年に（従属句）」であり、文要素データの末尾が「〜の」以外であるので、ＣＰＵ２０３は、当該従属句は連体修飾でないと判定する。すなわち、当該従属句が修飾する単語は用言であると判定し、テキスト文データにおける数値情報データの属する文要素データ以外の文要素データを係り受け情報データとして抽出する。 In this case, since the sentence element data to which the numerical information data belongs is “in 1192 (subordinate phrase)” and the end of the sentence element data is other than “to”, the CPU 203 determines that the subordinate phrase is not a combination modification. judge. That is, it is determined that the word modified by the subordinate phrase is a predicate, and sentence element data other than the sentence element data to which the numerical information data belongs in the text sentence data is extracted as dependency information data.

例えば、文要素データ「１１９２年（従属句）」以外の文要素データである「源頼朝は（主格句）／鎌倉幕府を（従属句）／開いた（述部）」が係り受け情報データとして抽出される。 For example, sentence element data other than the sentence element data “1192 (subordinate phrase)” is “source Yoromo (main phrase) / kamakura shogunate (subordinate phrase) / opened (predicate)” as dependency information data Extracted.

これにより、数値情報データ「１１９２年」、係り受け情報データ「源頼朝は鎌倉幕府を開いた」が抽出結果となる。 As a result, the numerical information data “1192” and the dependency information data “Gen Yoromo opened the Kamakura Shogunate” are extracted results.

ＣＰＵ２０３は、上記の数値情報抽出処理（図４）によって抽出した数値情報データの数だけ上記処理ステップＳ６０１〜Ｓ６０９を繰り返し（ステップＳ６１１、ＹＥＳ）、他の数値情報データがなければ当該処理を終了する（ステップＳ６１１、ＮＯ）。 The CPU 203 repeats the processing steps S601 to S609 by the number of numerical information data extracted by the numerical information extraction processing (FIG. 4) (step S611, YES), and ends the processing if there is no other numerical information data. (Step S611, NO).

１−３−４．まとめ
係り受け情報抽出処理を終えるとＣＰＵ２０３は、抽出結果を抽出結果データベース２０９７に記録する（図３、ステップＳ３１５）。例えば、図７に示したテキスト文データ「営業利益は、前年度と同水準の３２７７６百万円となった。」が入力された場合、抽出結果データベース２０９７に、数値情報データ「３２７７６百万円」、係り受け情報データ「営業利益」が記録される。 1-3-4. Conclusion When the dependency information extraction process is completed, the CPU 203 records the extraction result in the extraction result database 2097 (FIG. 3, step S315). For example, when the text sentence data “Operating income was 32,763 million yen, the same level as the previous year” shown in FIG. 7, the numerical information data “32776 million yen is entered in the extraction result database 2097. ”, The dependency information data“ operating profit ”is recorded.

なお、本実施形態においては、情報抽出装置において抽出した抽出結果を抽出結果データベース２０９７に記録するようにしているが、これに限定されることはなく、抽出結果を他のアプリケーションに引き渡すようにしてもよい。 In this embodiment, the extraction result extracted by the information extraction apparatus is recorded in the extraction result database 2097. However, the present invention is not limited to this, and the extraction result is transferred to another application. Also good.

また、テキスト文データから数値情報データと係り受け情報データを抽出する処理を行うために、他のアプリケーションに組み込んで使用するようにしてもよい。さらに、ユーザに提示するために、抽出結果を情報抽出装置のディスプレイ２０１に表示するようにしてもよい。 Further, in order to perform processing for extracting numerical information data and dependency information data from text sentence data, it may be used by being incorporated in another application. Further, the extraction result may be displayed on the display 201 of the information extraction device for presentation to the user.

ＣＰＵ２０３は、テキスト文書から切り出されたテキスト文データの数だけ上記処理（図３、ステップＳ３０３〜３１５を繰り返し（ステップＳ３１７、ＹＥＳ）、処理対象となるテキスト文データがなくなれば、当該情報抽出処理を終了する（ステップＳ３１７、ＮＯ）。 The CPU 203 repeats the above process (FIG. 3, steps S303 to S315 (YES in step S317) as many times as the number of text sentence data cut out from the text document (step S317, YES). If there is no text sentence data to be processed, the information extraction process is performed. The process ends (step S317, NO).

以上説明したように、この発明によれば、数値情報データに対応する名詞が近傍にない場合であっても、係り受け関係にある名詞を正確に抽出することができる。また、数値情報データに対応する情報として抽出する情報を、特定の名詞に限定することなく、係り受け関係にある名詞または文を正確に抽出することができる。 As described above, according to the present invention, nouns having a dependency relationship can be accurately extracted even when nouns corresponding to numerical information data are not in the vicinity. In addition, information extracted as information corresponding to numerical information data is not limited to a specific noun, and a noun or sentence having a dependency relationship can be accurately extracted.

２．第２の実施形態
上記実施形態においては、入力されるテキスト文データが単文であることを前提として説明した。本実施形態においては、特に、テキスト文データが２つの文を含む複文または重文である場合について説明する。 2. Second Embodiment In the above embodiment, the description has been made on the assumption that the input text sentence data is a single sentence. In the present embodiment, a case where the text sentence data is a compound sentence or a double sentence including two sentences will be described.

なお、複文とは、主語・述語の関係が成り立っている文で、さらにその構成部分に主語・述語の関係がみられるものである。また、重文とは、独立した二つ以上の文が、対等の資格で結合した文である。 Note that a compound sentence is a sentence in which a subject / predicate relationship is established, and a subject / predicate relationship is further found in its constituent parts. A heavy sentence is a sentence formed by combining two or more independent sentences with equal qualifications.

２−１．機能ブロック図
図１０に、本実施形態にかかる情報抽出装置の機能ブロック図を示す。この図において、本発明にかかる情報抽出装置は、切出手段１０１、数値情報抽出手段１０３、テキスト文解析手段１０５、抽出対象文決定手段１０６、係り受け情報抽出手段１０７、出力手段１０９を備えている。 2-1. Functional Block Diagram FIG. 10 shows a functional block diagram of the information extraction apparatus according to this embodiment. In this figure, the information extraction apparatus according to the present invention comprises a cutting means 101, numerical information extraction means 103, text sentence analysis means 105, extraction target sentence determination means 106, dependency information extraction means 107, and output means 109. Yes.

抽出対象文決定手段１０６は、テキスト文解析手段によって文要素データに分割されたテキスト文データから数値情報データの抽出対象となる文を決定し、係り受け情報抽出手段１０７に与える。例えば、複文の場合は、主文と副文に分割した後、各文が抽出対象であるか否かを決定する。 The extraction target sentence determination unit 106 determines a sentence from which numerical information data is to be extracted from the text sentence data divided into sentence element data by the text sentence analysis unit, and supplies the sentence to the dependency information extraction unit 107. For example, in the case of a compound sentence, after dividing into a main sentence and a sub sentence, it is determined whether or not each sentence is an extraction target.

なお、切出手段１０１、数値情報抽出手段１０３、テキスト文解析手段１０５、係り受け情報抽出手段１０７、出力手段１０９は、第１の実施形態と同様である。 Note that the extraction unit 101, the numerical information extraction unit 103, the text sentence analysis unit 105, the dependency information extraction unit 107, and the output unit 109 are the same as those in the first embodiment.

２−２．ハードウェア構成
ハードウェア構成については、第１の実施形態と同様である。 2-2. Hardware configuration The hardware configuration is the same as in the first embodiment.

２−３．フローチャート
情報抽出プログラム２０９１に基づく処理について、図１１のフローチャートを用いて説明する。以下では、テキスト文データ「今月の食費は、父が７０００円のワインを買ったため、４５４００円になった。」を含むテキスト文書を切出手段１０１に入力した場合を例として説明する。 2-3. Flowchart Processing based on the information extraction program 2091 will be described using the flowchart of FIG. In the following, an example will be described in which a text document including text sentence data “This month's food expense is 45400 yen because my father bought 7000 yen of wine” is input to the clipping unit 101.

図１１に示す情報抽出処理のフローチャートにおいて、ＣＰＵ２０３が行うステップＳ３０１〜Ｓ３０９までの処理は基本的に第１の実施形態と同様である。 In the flowchart of the information extraction process shown in FIG. 11, the processes from step S301 to S309 performed by the CPU 203 are basically the same as those in the first embodiment.

例えば、前記テキスト文データが読み込まれた場合、先頭の文である「今月の食費は、父が７０００円のワインを買ったため、４５４００円になった。」がテキスト文データとして切り出される。 For example, when the text sentence data is read, the first sentence, “This month's food expense is 45400 yen because my father bought 7000 yen of wine” is cut out as text sentence data.

２−３−１．数値情報抽出処理
例えば、本実施形態において、数値単位マスタには「円」が単位文字として記録されており、ＣＰＵ２０３は、図４に示す数値情報抽出処理により、「７０００円」、「４５４００円」をそれぞれ数値情報データとして抽出する。 2-3-1. Numerical Information Extraction Processing For example, in this embodiment, “Yen” is recorded as a unit character in the numerical value unit master, and the CPU 203 performs “7000 yen” and “45400 yen” by the numerical information extraction processing shown in FIG. Are respectively extracted as numerical information data.

２−３−２．テキスト文解析処理
数値情報抽出処理（図１１、ステップＳ３０５）において、数値情報データが抽出されれば（ステップＳ３０７、ＹＥＳ）、ＣＰＵ２０３はテキスト文解析処理（ステップＳ３０９）を行う。第１の実施形態に示したように、図５に示すテキスト文解析処理のフローチャートにしたがって、ＣＰＵ２０３は上記テキスト文データを所定の文要素データに分割する。 2-3-2. Text sentence analysis processing If numerical information data is extracted in the numerical information extraction process (FIG. 11, step S305) (step S307, YES), the CPU 203 performs a text sentence analysis process (step S309). As shown in the first embodiment, the CPU 203 divides the text sentence data into predetermined sentence element data according to the flowchart of the text sentence analysis process shown in FIG.

図１３に、テキスト文解析処理を行った後のテキスト文データを示す。テキスト文データ１３０１に示すように、前記テキスト文データは「今月の食費は、（主格句）／父が（主格句）／７０００円の（従属句）／ワインを（従属句）／買ったため、（述部）／４５４００円になった（述部）」の各文要素データに分割される。 FIG. 13 shows the text sentence data after the text sentence analysis processing is performed. As shown in the text sentence data 1301, the text sentence data is: “This month's food cost is (main phrase) / father's (main phrase) / 7000 yen (subordinate phrase) / wine (subordinate phrase) / buy. (Predicate) / 45400 Yen (Predicate) ”is divided into each sentence element data.

２−３−３．抽出対象文決定処理
テキスト文解析処理を終えるとＣＰＵ２０３は、抽出対象文決定処理を行う（ステップＳ３１０）。図１２に、抽出対象文決定処理におけるフローチャートを示す。 2-3-3. Extraction target sentence decision processing
When the text sentence analysis process is completed, the CPU 203 performs an extraction target sentence determination process (step S310). FIG. 12 shows a flowchart of the extraction target sentence determination process.

ＣＰＵ２０３は、構文解析処理されたテキスト文データに基づいて、文の種類を判定する。すなわち、ＣＰＵ２０３は、主格句と述部の組合せの個数および並び方に基づいて、入力されたテキスト文データが、複文、重文、単文のいずれであるかを判定する（図１２、ステップＳ１２０１）。 The CPU 203 determines the type of sentence based on the text sentence data subjected to the syntax analysis process. That is, the CPU 203 determines whether the input text sentence data is a compound sentence, a double sentence, or a single sentence based on the number and arrangement of combinations of the main phrase and predicate (FIG. 12, step S1201).

ＣＰＵ２０３は、主格句と述部の組合せが１つであればテキスト文データは単文であると判定する（ステップＳ１２０１、単文）。また、主格句と述部の組合せが２つであり、並び方が「主格句−主格句−述部−述部」であればテキスト文データは複文であると判定する（ステップＳ１２０１、複文）。さらに、主格句と述部の組合せが２つであり、並び方が「主格句−述部−主格句−述部」であればテキスト文データは重文であると判定する（ステップＳ１２０１、重文）。 If the combination of the main phrase and the predicate is one, the CPU 203 determines that the text sentence data is a single sentence (step S1201, simple sentence). If the combination of the main phrase and predicate is two and the arrangement is “main phrase-main phrase-predicate-predicate”, it is determined that the text sentence data is a compound sentence (step S1201, compound sentence). Further, if there are two combinations of the main phrase and the predicate and the arrangement is “main phrase-predicate-main phrase-predicate”, it is determined that the text sentence data is a heavy sentence (step S1201, heavy sentence).

ＣＰＵ２０３は、入力されたテキスト文データが複文であると判定すると、当該テキスト文データを主文と副文に分割する（ステップＳ１２０３）。例えば、図１３に示すテキスト文データ１３０１の場合、文の構造が「主格句−主格句−述部−述部」であることにより、複文であると判定され、テキスト文データ１３０１は、主文としてのテキスト文データ１３０３「今月の食費は、（主格句）／４５４００円になった（述部）」および副文としてのテキスト文データ１３０５「父が（主格句）／７０００円の（従属句）／ワインを（従属句）／買ったため（述部）」に分割される。 If the CPU 203 determines that the input text sentence data is a compound sentence, the CPU 203 divides the text sentence data into a main sentence and a sub sentence (step S1203). For example, in the case of the text sentence data 1301 shown in FIG. 13, it is determined that the sentence structure is “main case phrase-main case phrase-predicate-predicate”, so that it is a compound sentence. Text sentence data 1303 “This month's food cost was (main phrase) / 45400 yen (predicate)” and text sentence data 1305 as a sub sentence (father was (main phrase) / 7000 yen (subordinate phrase) / Wine (subordinate phrase) / Bought (predicate) ".

一方、ＣＰＵ２０３は、入力されたテキスト文データが重文であると判定すると、当該テキスト文データを前半文と後半文に分割する（ステップＳ１２０５）。例えば、図１４に示すテキスト文データ１４０１の場合、文の構造が「主格句−述部−主格句−述部」であることにより、重文であると判定され、テキスト文データ１４０１は、前半文としてのテキスト文データ１４０３「今月の食費は、（主格句）／４５４００円になった（述部）」および副文としてのテキスト文データ１３０５「父が（主格句）／７０００円の（従属句）／ワインを（従属句）／買ったため（述部）」に分割される。 On the other hand, if the CPU 203 determines that the input text sentence data is a heavy sentence, the CPU 203 divides the text sentence data into a first half sentence and a second half sentence (step S1205). For example, in the case of the text sentence data 1401 shown in FIG. 14, it is determined that the sentence structure is “major phrase-predicate-prominent phrase-predicate”, so that it is a heavy sentence. Text sentence data 1403 “Food expenses for this month are (primary phrases) / 45400 yen (predicate)” and text sentence data 1305 as sub-sentence is “Father is (primary phrases) / 7000 yen (subordinate phrase) ) / Wine (subordinate clause) / bought (predicate) ”.

ＣＰＵ２０３は、分割され単文または入力された単文を取得し（ステップＳ１２０７）、当該単文に数値情報データが含まれるか否かを判定する（ステップＳ１２０９）。 The CPU 203 acquires the divided single sentence or the input single sentence (step S1207), and determines whether the single sentence includes numerical information data (step S1209).

ＣＰＵ２０３は、取得した単文に数値情報データが含まれていれば（ステップＳ１２０９、ＹＥＳ）、当該単文を抽出対象文として決定し、メモリ２０５に記憶する（ステップＳ１２１１）。なお、数値情報データが含まれていなければ、ステップＳ１２１１をスキップする。 If the acquired single sentence includes numerical information data (step S1209, YES), the CPU 203 determines the single sentence as an extraction target sentence and stores it in the memory 205 (step S1211). If numerical information data is not included, step S1211 is skipped.

ＣＰＵ２０３は、他の単文があれば（ステップＳ１２１３、ＹＥＳ）、ステップＳ１２０７に戻って、次の単文を取得して上記ステップＳ１２０９〜Ｓ１２１３と同様の処理を行う。なお、処理対象となる単文がなくなれば、当該処理を終了する（ステップＳ１２１３、ＮＯ）。 If there is another simple sentence (YES in step S1213), the CPU 203 returns to step S1207, acquires the next simple sentence, and performs the same processing as in steps S1209 to S1213. If there is no single sentence to be processed, the process ends (NO in step S1213).

例えば、図１３の主文としてのテキスト文データ１３０３が単文として取得されると、数値情報データ「４５４００円」が含まれるので、当該テキスト文データ１３０３を抽出対象文として決定する。さらに、副文としてのテキスト文データ１３０５が単文として取得されて、数値情報データ「７０００円」が含まれることにより、当該テキスト文データ１３０５を抽出対象文として決定する。なお、図１４に示した重文のテキスト文データ１４０１の場合も上記と同様に処理できる。 For example, when the text sentence data 1303 as the main sentence in FIG. 13 is acquired as a single sentence, the numerical information data “45400 yen” is included, so the text sentence data 1303 is determined as an extraction target sentence. Further, the text sentence data 1305 as the sub sentence is acquired as a single sentence, and the numerical information data “7000 yen” is included, so that the text sentence data 1305 is determined as the extraction target sentence. In the case of the heavy text text data 1401 shown in FIG.

２−３−４．係り受け情報抽出処理
抽出対象文決定処理（図１１、ステップＳ３１０）において、入力されたテキスト文データが所定の種類の文要素データに分割され単文に分割されると、さらにＣＰＵ２０３は係り受け情報抽出処理（ステップＳ３１３）を行う。なお、係り受け情報抽出処理については、第１の実施形態において示した図６のフローチャートと同様である。 2-3-4. Dependency information extraction processing
In the extraction target sentence determination process (FIG. 11, step S310), when the input text sentence data is divided into predetermined types of sentence element data and divided into simple sentences, the CPU 203 further performs dependency information extraction processing (step S313). I do. The dependency information extraction process is the same as the flowchart of FIG. 6 shown in the first embodiment.

ＣＰＵ２０３は、上記の抽出対象文決定処理（図１２）によって決定した単文の数だけ上記処理ステップＳ３１３〜Ｓ３１５を繰り返し（ステップＳ３１６、ＹＥＳ）、他の単文がなければ当該処理を終了する（ステップＳ３１６、ＮＯ）。 The CPU 203 repeats the processing steps S313 to S315 as many as the number of simple sentences determined by the extraction target sentence determination process (FIG. 12) (step S316, YES), and if there is no other single sentence, the process ends (step S316). , NO).

例えば、図１３に示すテキスト文データ１３０３から、数値情報データ「４５４００円」、係り受け情報データ「食費」が抽出され、抽出結果データベース２０９７に記録される。同様に、図１３に示すテキスト文データ１３０５から、数値情報データ「７０００円」、係り受け情報データ「ワイン」が抽出され、抽出結果データベース２０９７に記録される。 For example, numerical information data “45400 yen” and dependency information data “food expenses” are extracted from the text sentence data 1303 shown in FIG. 13 and recorded in the extraction result database 2097. Similarly, numerical information data “7000 yen” and dependency information data “wine” are extracted from the text sentence data 1305 shown in FIG. 13 and recorded in the extraction result database 2097.

２−３−５．まとめ
以上説明したように、この発明によれば、テキスト文データが単文、複文、重文のいずれの場合であっても、数値情報データと係り受け関係にある名詞または文を正確に抽出することができる。 2-3-5. Summary As described above, according to the present invention, it is possible to accurately extract a noun or sentence that is dependent on numerical information data regardless of whether the text sentence data is a single sentence, a compound sentence, or a heavy sentence. it can.

なお、上記実施形態においては、テキスト文データが主格句と述部の組合せがの２つである複文または重文を前提として説明したが、２つ以上の組合せであってもよい。この場合、主格句と述部の並び方を予めパターン化しておき、いずれのパターンに属するかに基づいて、複文であるか重文であるかの判定を行い、単文に分割するように構成すればよい。 In the above embodiment, the text sentence data has been described on the premise of a compound sentence or a duplicate sentence in which the combination of the main phrase and the predicate is two, but a combination of two or more may be used. In this case, the arrangement of the main phrase and the predicate may be pre-patterned, and it may be configured to determine whether it is a compound sentence or a heavy sentence based on which pattern it belongs to, and to divide it into simple sentences. .

３．第３の実施形態
上記の実施形態においては、入力されたテキスト文データを文要素データに分割し、数値情報データの属する文要素データの種類に基づいて、数値情報データおよび当該数値情報データと係り受け関係にある係り受け情報データのみを抽出するように構成した。 3. Third Embodiment In the above embodiment, the input text sentence data is divided into sentence element data, and based on the type of sentence element data to which the numerical information data belongs, the numerical information data and the numerical information data are related. Only the dependency information data in the receiving relationship is extracted.

本実施形態においては、所定の文要素データに基づいて数値情報データおよび係り受け情報データを主情報として抽出し、さらにその他の文要素データに基づいて付加情報を抽出する場合について説明する。 In the present embodiment, a case will be described in which numerical information data and dependency information data are extracted as main information based on predetermined sentence element data, and additional information is extracted based on other sentence element data.

３−１．機能ブロック図
図１５に、本実施形態にかかる情報抽出装置の機能ブロック図を示す。この図において、本発明にかかる情報抽出装置は、切出手段１０１、数値情報抽出手段１０３、テキスト文解析手段１０５、抽出対象文決定手段１０６、係り受け情報抽出手段１０７、付加情報抽出手段１０８、出力手段１０９を備えている。 3-1. Functional Block Diagram FIG. 15 shows a functional block diagram of the information extraction device according to this embodiment. In this figure, an information extracting apparatus according to the present invention includes a cutting means 101, a numerical information extracting means 103, a text sentence analyzing means 105, an extraction target sentence determining means 106, a dependency information extracting means 107, an additional information extracting means 108, Output means 109 is provided.

付加情報抽出手段１０８は、文要素データに分割されたテキスト文データに基づいて、数値情報データおよび係り受け情報データを含む文要素データ以外の文要素データを抽出し、当該数値情報データおよび係り受け情報データに関する付加情報として出力手段１０９に与える。 The additional information extraction unit 108 extracts sentence element data other than the sentence element data including the numerical information data and the dependency information data based on the text sentence data divided into the sentence element data, and the numerical information data and the dependency information are extracted. This is given to the output means 109 as additional information regarding the information data.

３−２．ハードウェア構成
ハードウェア構成については、第１の実施形態と同様である。 3-2. Hardware configuration The hardware configuration is the same as in the first embodiment.

３−３．フローチャート
情報抽出プログラム２０９１に基づく処理について、図１６のフローチャートを用いて説明する。以下では、第１の実施形態と同様に、テキスト文データ「営業利益は、前年度と同水準の３２７７６百万円となった。」を含むテキスト文書を切出手段１０１に入力した場合を例として説明する。 3-3. Flowchart Processing based on the information extraction program 2091 will be described using the flowchart of FIG. In the following, as in the first embodiment, an example in which a text document including text data “operating profit is 32,976 million yen, the same level as the previous year” is input to the clipping unit 101 is described as an example. Will be described.

図１１に示す情報抽出処理のフローチャートにおいて、ＣＰＵ２０３が行うステップＳ３０１〜Ｓ３１５までの処理は基本的に第１の実施形態と同様である。 In the flowchart of the information extraction process shown in FIG. 11, the processes from step S301 to S315 performed by the CPU 203 are basically the same as those in the first embodiment.

例えば、前記テキスト文データが読み込まれた場合、先頭の文である「営業利益は、前年度と同水準の３２７７６百万円となった。」がテキスト文データとして切り出される。 For example, when the text sentence data is read, the first sentence “Operating profit was 32,976 million yen, the same level as the previous year” is extracted as the text sentence data.

数値情報抽出処理（図１６、ステップＳ３０５）において、数値情報データが抽出されれば（ステップＳ３０７、ＹＥＳ）、ＣＰＵ２０３はテキスト文解析処理（ステップＳ３０９）を行う。第１の実施形態に示したように、図５に示すテキスト文解析処理のフローチャートにしたがって、ＣＰＵ２０３は上記テキスト文データを所定の文要素データに分割する。 If numerical information data is extracted in the numerical information extraction process (FIG. 16, step S305) (step S307, YES), the CPU 203 performs a text sentence analysis process (step S309). As shown in the first embodiment, the CPU 203 divides the text sentence data into predetermined sentence element data according to the flowchart of the text sentence analysis process shown in FIG.

図１８に、テキスト文解析処理を行った後のテキスト文データを示す。テキスト文データ１８０１に示すように、前記テキスト文データは「営業利益は（主格句）／前年度と（従属句）／同水準の（従属句）／３２７７６百万円となった（述部）」の各文要素データに分割される。 FIG. 18 shows the text sentence data after the text sentence analysis processing is performed. As shown in the text sentence data 1801, the text sentence data is “Operating profit was (main phrase) / previous year (subordinate phrase) / same level (subordinate phrase) / 32776 million yen (predicate)” Is divided into each sentence element data.

テキスト文解析処理（図１６、ステップＳ３０９）において、入力されたテキスト文データが所定の種類の文要素データに分割されると、さらにＣＰＵ２０３は係り受け情報抽出処理（ステップＳ３１３）を行う。なお、係り受け情報抽出処理については、第１の実施形態において示した図６のフローチャートと同様である。 In the text sentence analysis process (FIG. 16, step S309), when the input text sentence data is divided into predetermined types of sentence element data, the CPU 203 further performs a dependency information extraction process (step S313). The dependency information extraction process is the same as the flowchart of FIG. 6 shown in the first embodiment.

したがって、第１の実施形態と同様に、数値情報データ「３２７７６百万円」、係り受け情報データ「営業利益」が抽出結果となる。 Therefore, as in the first embodiment, numerical information data “32776 million yen” and dependency information data “operating profit” are extracted results.

３−３−１．付加情報抽出処理
係り受け情報抽出処理を終えると、ＣＰＵ２０３は、付加情報抽出処理を行う（ステップＳ３３１）。図１７に、付加情報抽出処理におけるフローチャートを示す。 3-3-1. Additional information extraction process
When the dependency information extraction processing is completed, the CPU 203 performs additional information extraction processing (step S331). FIG. 17 shows a flowchart in the additional information extraction process.

ＣＰＵ２０３は、数値情報抽出処理において抽出された数値情報データおよび係り受け情報抽出処理において抽出された係り受け情報データの属する文要素データを取得する（図１７、ステップＳ１７０１）。 The CPU 203 acquires numerical element data extracted in the numerical information extraction process and sentence element data to which the dependency information data extracted in the dependency information extraction process belongs (FIG. 17, step S1701).

次に、ＣＰＵ２０３は、テキスト文解析処理において文要素データに分割されたテキスト文データ１８０１を読み込み、上記において取得した数値情報データおよび係り受け情報データの属する文要素データ以外の文要素データを追加情報として抽出する（ステップＳ１７０３）。 Next, the CPU 203 reads the text sentence data 1801 divided into the sentence element data in the text sentence analysis process, and adds the sentence element data other than the sentence element data to which the numerical information data and the dependency information data acquired above belong to the additional information. (Step S1703).

例えば、図１８のテキスト文データ１８０１においては、数値情報データの属する文要素データは「３２７７６百万円（述部）」であり、係り受け情報データの属する文要素データは「営業利益は（主格句）」である。したがって、数値情報データおよび係り受け情報データの属する文要素データ以外の文要素データとして、「前年度と（従属句）」および「同水準の（従属句）」が抽出される。 For example, in the text sentence data 1801 in FIG. 18, the sentence element data to which the numerical information data belongs is “32776 million yen (predicate)”, and the sentence element data to which the dependency information data belongs is “operating profit is (major Phrase). Therefore, “same year (subordinate phrase)” and “same level (subordinate phrase)” are extracted as sentence element data other than the sentence element data to which the numerical information data and the dependency information data belong.

ＣＰＵ２０３は、上記の数値情報抽出処理（図４）によって抽出した数値情報データの数だけ上記処理ステップＳ１７０１〜Ｓ１７０３を繰り返し（ステップＳ１７０５、ＹＥＳ）、他の数値情報データがなければ当該処理を終了する（ステップＳ１７０５、ＮＯ）。 The CPU 203 repeats the processing steps S1701 to S1703 by the number of numerical information data extracted by the numerical information extraction processing (FIG. 4) (step S1705, YES), and ends the processing if there is no other numerical information data. (Step S1705, NO).

付加情報抽出処理を終えるとＣＰＵ２０３は、抽出結果を抽出結果データベース２０９７に記録する（図１６、ステップＳ３３３）。 When the additional information extraction process is finished, the CPU 203 records the extraction result in the extraction result database 2097 (FIG. 16, step S333).

例えば、図１８の１８０３に示すように、抽出結果データベース２０９７には、数値情報データ「３２７７６百万円」、係り受け情報データ「営業利益」に加えて、付加情報「前年度と／同水準の」が記録される。 For example, as shown by 1803 in FIG. 18, in the extraction result database 2097, in addition to the numerical information data “32776 million yen” and the dependency information data “operating profit”, the additional information “same level as in the previous year”. Is recorded.

ＣＰＵ２０３は、テキスト文書から切り出されたテキスト文データの数だけ上記処理（図１１、ステップＳ３０３〜３３３を繰り返し（ステップＳ３１７、ＹＥＳ）、処理対象となるテキスト文データがなくなれば、当該情報抽出処理を終了する（ステップＳ３１７、ＮＯ）。 The CPU 203 repeats the above process (FIG. 11, steps S303 to S333 (YES in step S317) as many times as the number of text sentence data cut out from the text document (step S317, YES). If there is no text sentence data to be processed, the information extraction process is performed. The process ends (step S317, NO).

３−４．まとめ
以上説明したように、この発明によれば、テキスト文データの主情報である数値情報データ・係り受け情報データに加えて、当該主情報にかかる付加情報を抽出することができる。 3-4. Summary As described above, according to the present invention, in addition to numerical information data and dependency information data that are main information of text sentence data, additional information related to the main information can be extracted.

これにより、数値情報データと直接的な係り受け関係にある係り受け情報データだけでなく、数値情報データまたは係り受け情報データと間接的な関係にある付加情報を抽出することができる。 As a result, not only the dependency information directly related to the numerical information data but also the additional information indirectly related to the numerical information data or the dependency information data can be extracted.

例えば、数値情報データ「３２７７６百万円」および係り受け情報データ「営業利益」に関する付加情報「前年度と／同水準の」は、「営業利益」が「３２７７６百万円」である状態に関する背景・状況としての付加情報として利用することができる。 For example, the additional information “same level as last year” regarding the numerical information data “32776 million yen” and the dependency information data “operating profit” is related to the situation where the “operating profit” is “32976 million yen”. -It can be used as additional information as a situation.

なお、上記実施形態においては、テキスト文データが単文である場合について説明したが、複文または重文を含む場合であってもよい。この場合、第２の実施形態において示したように、テキスト文解析処理（ステップＳ３０９）。と係り受け情報抽出処理（ステップＳ３１３）の間に抽出対象文決定処理を挿入すればよい。 In the above embodiment, the case where the text sentence data is a single sentence has been described. However, the text sentence data may include a compound sentence or a duplicate sentence. In this case, as shown in the second embodiment, text sentence analysis processing (step S309). And an extraction target sentence determination process may be inserted between the dependency information extraction process (step S313).

４．その他の実施形態
上記実施形態においては、数値情報抽出処理を行った後に分割処理を行うように構成したが、テキスト文解析処理中に数値情報抽出処理を行うように構成してもよい。具体的には、テキスト文解析処理においてテキスト文データを文要素データに分割した後に、各文要素データについて数値情報データが含まれるか否かを判断させ、含まれていればテキスト文データの構文解析を行うように構成すればよい。 4). Other Embodiments In the above embodiment, the division process is performed after the numerical information extraction process. However, the numerical information extraction process may be performed during the text sentence analysis process. Specifically, after the text sentence data is divided into sentence element data in the text sentence analysis process, it is determined whether or not numerical information data is included for each sentence element data, and if included, the syntax of the text sentence data What is necessary is just to comprise so that an analysis may be performed.

上記実施形態においては、数値情報データを抽出する場合に、テキスト文データの中に存在する所定の単位文字の前方に連続する数字を数値情報データとして抽出するように構成したが、単位文字によっては、後方に連続する数字を数値情報データとして抽出するようにしてもよい。例えば、「￥」や「＄」などの通貨記号を単位文字とする場合がこれに該当する。 In the above embodiment, when the numerical information data is extracted, it is configured to extract, as the numerical information data, numbers consecutive in front of a predetermined unit character existing in the text sentence data. Alternatively, numbers that are continuous in the back may be extracted as numerical information data. For example, this is the case when a currency symbol such as “¥” or “$” is used as a unit character.

上記実施形態においては、構文解析処理において、上昇型構文解析のＬＲ文法パーザであるＹＡＣＣを例示したが、その他の種類のパーザであってもよい。例えば、下降型構文解析のＬＬ文法パーザなどがこれに該当する。 In the above embodiment, YACC, which is an LR grammar parser for ascending syntax analysis, is exemplified in the syntax analysis processing, but other types of parsers may be used. For example, an LL grammar parser for descending parsing corresponds to this.

上記実施形態においては、日本語で記述されたテキスト文書について説明したが、数値情報データが抽出可能であり、形態素解析処理および構文解析処理において文要素データに分割することができれば、他の言語で記述されたテキスト文書であってもよい。例えば、英語の場合は、英文の数値単位に基づいて数値情報データを抽出し、英文用の形態素解析プログラムおよび構文解析プログラムを用いて構成すればよい。 In the above embodiment, a text document described in Japanese has been described. However, numerical information data can be extracted and can be divided into sentence element data in morpheme analysis processing and syntax analysis processing. It may be a written text document. For example, in the case of English, numerical information data may be extracted based on English numeric units and configured using an English morphological analysis program and a syntax analysis program.

上記実施形態においては、図１に示す機能を実現する為に、ＣＰＵ２０３を用い、ソフトウェアによってこれを実現している。しかし、その一部もしくは全てを、ロジック回路等のハードウェアによって実現してもよい。なお、プログラムの一部の処理をさらに、オペレーティングシステム（ＯＳ）にさせるようにしてもよい。 In the above embodiment, the CPU 203 is used to realize the function shown in FIG. 1, and this is realized by software. However, some or all of them may be realized by hardware such as a logic circuit. In addition, you may make it make an operating system (OS) process a part of program further.

この発明の実施形態における情報抽出装置の機能ブロック図の例を示す図である。It is a figure which shows the example of the functional block diagram of the information extraction apparatus in embodiment of this invention. この発明の情報抽出装置のハードウェア構成図の例を示す例である。It is an example which shows the example of the hardware block diagram of the information extraction apparatus of this invention. この発明の「情報抽出処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "information extraction process" of this invention. この発明の「数値情報抽出処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "numerical information extraction process" of this invention. この発明の「テキスト文解析処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "text sentence analysis process" of this invention. この発明の「係り受け情報抽出処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "dependency information extraction process" of this invention. この発明の「形態素解析」または「構文解析」の例を示す図である。It is a figure which shows the example of "morphological analysis" or "syntax analysis" of this invention. この発明の「形態素解析データ」または「構文解析データ」の例を示す図である。It is a figure which shows the example of "morpheme analysis data" or "syntax analysis data" of this invention. この発明の「構文解析データ」の例を示す図である。It is a figure which shows the example of the "syntactic analysis data" of this invention. この発明の「文法定義情報」の例を示す図である。It is a figure which shows the example of the "grammar definition information" of this invention. この発明の「主格句」または「述部」を決定する例を示す図である。It is a figure which shows the example which determines the "main phrase" or "predicate" of this invention. この発明の「解析結果データ」の例を示す図である。It is a figure which shows the example of the "analysis result data" of this invention. この発明の実施形態における情報抽出装置の機能ブロック図の例を示す図である。It is a figure which shows the example of the functional block diagram of the information extraction apparatus in embodiment of this invention. この発明の「情報抽出処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "information extraction process" of this invention. この発明の「抽出対象文決定処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "extraction target sentence determination process" of this invention. この発明の「複文を主文と副文に分割する」場合の例を示す図である。It is a figure which shows the example in the case of "dividing a compound sentence into a main sentence and a sub sentence" of this invention. この発明の「重文を前半文と後半文に分割する」場合の例を示す図である。It is a figure which shows the example in the case of "dividing a heavy sentence into the first half sentence and the latter half sentence" of this invention. この発明の実施形態における情報抽出装置の機能ブロック図の例を示す図である。It is a figure which shows the example of the functional block diagram of the information extraction apparatus in embodiment of this invention. この発明の「情報抽出処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "information extraction process" of this invention. この発明の「付加情報抽出処理」におけるフローチャートの例を示す図である。It is a figure which shows the example of the flowchart in the "additional information extraction process" of this invention. この発明のテキスト文データから付加情報を抽出する場合の例を示す図である。It is a figure which shows the example in the case of extracting additional information from the text sentence data of this invention.

Explanation of symbols

１０１切出手段
１０３数値情報抽出手段
１０５テキスト文解析手段
１０６抽出対象文決定手段
１０７係り受け情報抽出手段
１０８付加情報抽出手段
１０９出力手段
DESCRIPTION OF SYMBOLS 101 Extraction means 103 Numerical information extraction means 105 Text sentence analysis means 106 Extraction target sentence determination means 107 Dependency information extraction means 108 Additional information extraction means 109 Output means

Claims

An information extraction device for extracting information data including numerical information data from given text sentence data,
The given text sentence data is divided into sentence element data as phrases, and when the end of each sentence element data is one of the particles “ha”, “ga”, and “mo”, the sentence element data is “ If the part of speech at the end of each sentence element data is one of "noun", "adjective", or "verb", the sentence element data is "predicate"; otherwise, the sentence element The data is recorded as a “subordinate phrase” in the recording unit, and it is determined whether or not a character indicating the numerical unit is included in the given text sentence data. If included, the numerical unit and the numerical unit are included. Extraction / analysis means for identifying numerical information data as a numerical information data and recording it in a recording section,
Referring to the recording unit, the sentence element data including the numerical information data is identified, and at least one sentence element data having a dependency relation with the numerical information data is determined based on the type of the sentence element data. Dependency information extracting means for extracting dependency information data from the sentence element data,
An information extraction apparatus comprising: output means for outputting numerical information data extracted by the numerical information extraction means and dependency information data extracted by the dependency information extraction means;
The extraction / analysis means determines that the sentence element data is “main phrase” when the sentence element data ends in any one of “to”, “to”, and “to”, and the sentence element data If the part of speech is one of "noun", "adjective", or "verb", the sentence element data is "predicate", and if it does not correspond to any of "main phrase" or "predicate", the sentence element Let the data be a "subordinate phrase"
The dependency information extraction means, when the numerical information data is included in the sentence element data of the “predicate”, sets the noun of “main phrase” in the text sentence data as dependency information data, and the numerical information When the data is included in the sentence element data of the “subordinate phrase” and the word to be modified by the sentence element data is a body word, the noun of the sentence element data next to the sentence element data is used as dependency information data An information extraction apparatus characterized by that.

A program for realizing, using a computer, an information extraction device that extracts information data including numerical information data from given text sentence data,
The given text sentence data is divided into sentence element data as phrases, and when the end of each sentence element data is one of the particles “ha”, “ga”, and “mo”, the sentence element data is “ If the part of speech at the end of each sentence element data is "noun", "adjective", or "verb", the sentence element data is "predicate"; otherwise, the sentence element The data is recorded as a “subordinate phrase” in the recording unit, and it is determined whether or not a character indicating the numerical unit is included in the given text sentence data. If included, the numerical unit and the numerical unit are included. Extraction / analysis means for identifying numerical information data as a numerical information data and recording it in a recording section,
Referring to the recording unit, the sentence element data including the numerical information data is identified, and at least one sentence element data having a dependency relation with the numerical information data is determined based on the type of the sentence element data. Dependency information extracting means for extracting dependency information data from the sentence element data,
Numerical information data extracted by the numerical information extraction means, output means for outputting the dependency information data extracted by the dependency information extraction means,
A program to make it function,
The extracting / analyzing means divides the text sentence data into a plurality of sentence element data, and each sentence element data includes a subject phrase including a subject and a predicate when including a predicate. If it doesn't contain, the type of sentence element is determined as a subordinate phrase,
The dependency information extraction means, when the numerical information data is included in the sentence element data of the “predicate”, sets the noun of “main phrase” in the text sentence data as dependency information data, and the numerical information When the data is included in the sentence element data of the “subordinate phrase” and the word to be modified by the sentence element data is a body word, the noun of the sentence element data next to the sentence element data is used as dependency information data A program characterized by that.

In the program of Claim 2,
The text sentence analysis means divides the text sentence data into a plurality of sentence element data, and each sentence element data includes a subject phrase including a subject, a predicate when including a predicate, and a subject or predicate. A program characterized by determining the type of sentence element as a subordinate phrase when none of them is included.

In the program of Claim 2 or 3,
The dependency information extraction unit includes the numerical value in the text sentence data when the numerical information data is included in the sentence element data of the “subordinate phrase” and the word to be modified by the sentence element data is a predicate. A program characterized in that a combination of all sentence element data other than a “subordinate phrase” including information data is defined as dependency information data.

In the program in any one of Claims 2-4,
The dependency information extracting means, when the numerical information data is included in the sentence element data of “main phrase”, the sentence element other than “main phrase” including the numerical information data in the text sentence data A program characterized in that a combination of all data becomes dependency information data.

In the program in any one of Claims 2-5,
The computer determines whether the text sentence data has two or more pairs of main clauses and predicates. If there are two or more, the computer decomposes them into a single sentence containing only one pair of main clauses and predicates. A program for further functioning as an extraction target sentence determination means for generating text sentence data.

In the program in any one of Claims 2-6,
Additional information extraction for extracting sentence element data as additional information data from the text sentence data, excluding sentence element data including the numerical information data and sentence element data including the dependency information data A program for further functioning as a means.

An information extraction method for extracting information data including numerical information data from given text sentence data by a computer,
The computer divides the given text sentence data into sentence element data, which are phrases, and classifies each sentence element data into at least the types of “main phrase”, “subordinate phrase”, and “predicate” and stores them in the recording unit. In addition to recording, it is judged whether or not the character that indicates the numerical unit is included in the given text sentence data, and if it is included, the numerical unit and the consecutive numbers before or after the numerical unit are numerically An extraction / analysis step that is specified as information data and recorded in the recording unit;
The computer refers to the recording unit to identify sentence element data including numerical information data, and based on the type of the sentence element data, at least one sentence element data having a dependency relationship with the numerical information data. A dependency information extraction step for determining and extracting dependency information data from the determined sentence element data;
The computer outputs the numerical information data extracted by the numerical information extraction means, the dependency information data extracted by the dependency information extraction means,
In the information extraction method characterized by executing
In the extraction / analysis step, when the end of the sentence element data is any one of “to”, “to”, and “to”, the computer assumes that the sentence element data is “main phrase”, and the sentence element data If the part-of-speech at the end is either "noun", "adjective", or "verb", the sentence element data is assumed to be "predicate", and if it does not fall under any of "main phrase" or "predicate" Let the sentence element data be a subordinate phrase,
In the dependency information extracting step, when the numerical information data is included in the sentence element data of the “predicate”, the computer sets the noun of “main phrase” in the text sentence data as dependency information data. When the numerical information data is included in the sentence element data of the “subordinate phrase” and the word to be modified by the sentence element data is a formal word, the noun of the sentence element data next to the sentence element data is changed to the dependency information An information extraction method characterized by using data.