JP5776539B2

JP5776539B2 - Extraction apparatus, extraction program, and extraction method

Info

Publication number: JP5776539B2
Application number: JP2011284536A
Authority: JP
Inventors: 友哉岩倉; 井形　伸之; 伸之井形
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-12-26
Filing date: 2011-12-26
Publication date: 2015-09-09
Anticipated expiration: 2031-12-26
Also published as: JP2013134625A

Description

本発明は、抽出装置、抽出プログラム、および抽出方法に関する。 The present invention relates to an extraction device, an extraction program, and an extraction method.

自然言語処理の要素技術の一つに、電子化されたテキストの中から、人名や地名や組織名、時間表現、および数値表現といった種類の単語を抽出する固有表現抽出の技術がある（例えば、下記非特許文献１参照）。 One of the elemental technologies of natural language processing is a technique for extracting a specific expression such as a person name, a place name, an organization name, a time expression, and a numerical expression from an electronic text (for example, Non-patent document 1 below).

固有表現抽出の技術は、情報検索、情報抽出、構文解析、またはテキストマイニングなどに利用される。なお、一般に、人名や地名や組織名といった固有名詞、時間表現、および数値表現といった種類の単語を総称して、固有表現という。 The technique of extracting a specific expression is used for information retrieval, information extraction, syntax analysis, text mining, or the like. In general, proper nouns such as personal names, place names, and organization names, types of words such as time expressions, and numerical expressions are collectively referred to as specific expressions.

また、固有表現抽出の精度向上を図る技術の一つに、抽出対象のテキストの中で同一表記である複数の単語を、同一の種類の固有表現の単語として抽出するための技術がある（例えば、下記非特許文献２参照）。 In addition, as one of techniques for improving the accuracy of specific expression extraction, there is a technique for extracting a plurality of words having the same notation in the extraction target text as words of the same type of specific expression (for example, Non-patent document 2 below).

また、関連する技術として、テキストから抽出した固有表現の種類の出現頻度から、抽出結果を修正する技術がある（例えば、下記特許文献１参照）。また、２種類の固有表現抽出器を組み合わせて使用する技術がある（例えば、下記特許文献２参照）。 Further, as a related technique, there is a technique for correcting an extraction result based on the appearance frequency of the type of the unique expression extracted from the text (see, for example, Patent Document 1 below). In addition, there is a technique that uses two types of named entity extractors in combination (for example, see Patent Document 2 below).

また、固有表現抽出に用いられる規則を自動生成する機械学習技術の一つに、Ｂｏｏｓｔｉｎｇ学習の技術がある（例えば、下記特許文献３，４、および下記非特許文献３参照）。また、ＳＶＭ（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）による学習の技術がある（例えば、下記非特許文献４，５参照）。 In addition, as one of machine learning techniques for automatically generating rules used for specific expression extraction, there is a Boosting learning technique (see, for example, Patent Documents 3 and 4 and Non-Patent Document 3 below). In addition, there is a learning technique using SVM (Support Vector Machine) (for example, see Non-Patent Documents 4 and 5 below).

特開２００７−１４８７８５号公報JP 2007-148785 A 特開２００６−３３０９３５号公報JP 2006-330935 A 特開２０１０−３３２１３号公報JP 2010-33213 A 特開２０１０−３３２１４号公報JP 2010-33214 A

ＫｉｙｏｔａｋａＵｃｈｉｍｏｔｏ，ＱｉｎｇＭａ，ＭａｓａｋｉＭｕｒａｔａ，ＨｉｒｏｍｉＯｚａｋｕ，ａｎｄＨｉｔｏｓｈｉＩｓａｈａｒａ，「ＮａｍｅｄＥｎｔｉｔｙＥｘｔｒａｃｔｉｏｎＢａｓｅｄｏｎＡＭａｘｉｍｕｍＥｎｔｒｏｐｙＭｏｄｅｌａｎｄＴｒａｎｓｆｏｒｍａｔｉｏｎＲｕｌｅｓ」ＡＣＬ‘００，ｐｐ．３２６−３３５Kiyotaka Uchimoto, Qing Ma, Masaki Murata, Hiromi Ozaku, and Hitoshi Isahara, “Named Entity Extraction Based Amplitude Excited. 326-335 ＪｅｎｎｙＲｏｓｅＦｉｎｋｅｌ，ＴｒｏｎｄＧｒｅｎａｇｅｒ，ａｎｄＣｈｒｉｓｔｏｐｈｅｒＭａｎｎｉｎｇ，「ＩｎｃｏｒｐｏｒａｔｉｎｇＮｏｎ−ｌｏｃａｌＩｎｆｏｒｍａｔｉｏｎｉｎｔｏＩｎｆｏｒｍａｔｉｏｎＥｘｔｒａｃｔｉｏｎＳｙｓｔｅｍｓｂｙＧｉｂｂｓＳａｍｐｌｉｎｇ」ＡＣＬ‘０５，ｐｐ３６３−３７０Jenny Rose Finkel, Trond Grenager, and Christopher Manning, “Incorporating Non-Local Information into Information Extraction Systems by Gibbs 05” ＲｏｂｅｒｔＥ．ＳｃｈａｐｉｒｅａｎｄＹｏｒａｍＳｉｎｇｅｒ，「ＢｏｏｓＴｅｘｔｅｒ：ＡＢｏｏｓｔｉｎｇ−ｂａｓｅｄＳｙｓｔｅｍｆｏｒＴｅｘｔＣａｔｅｇｏｒｉｚａｔｉｏｎ」ＭａｃｈｉｎｅＬｅａｒｎｉｎｇＶｏｌｕｍｅ３９，Ｎｕｍｂｅｒｓ２−３，ｐｐ１３５−１６８Robert E. Schapier and Yoram Singer, “BoosTexter: A Boosting-based System for Text Categorization” Machine Learning Volume 39, Numbers 2-3, pp135-168 ＶｌａｄｉｍｉｒＮ．Ｖａｐｎｉｋ，「ＳｔａｔｉｓｔｉｃａｌＬｅａｒｎｉｎｇＴｈｅｏｒｙ」Ｗｉｌｅｙ−Ｉｎｔｅｒｓｃｉｅｎｃｅ，１９９８．Vladimir N. Vapnik, “Statistical Learning Theory” Wiley-Interscience, 1998. Ｔ．Ｊｏａｃｈｉｍｓ，「ＴｒａｉｎｉｎｇＬｉｎｅａｒＳＶＭｓｉｎＬｉｎｅａｒＴｉｍｅ」ＰｒｏｃｅｅｄｉｎｇＫＤＤ ‘０６Ｐｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅ１２ｔｈＡＣＭＳＩＧＫＤＤｉｎｔｅｒｎａｔｉｏｎａｌｃｏｎｆｅｒｅｎｃｅｏｎＫｎｏｗｌｅｄｇｅｄｉｓｃｏｖｅｒｙａｎｄｄａｔａｍｉｎｉｎｇ．T. T. et al. Joachims, “Training Linear SVMs in Linear Time” Proceeding KDD '06 Proceedings of the 12th ACM SIGKDD international conferencing on Knowledge discovery.

しかしながら、上述した従来技術では、抽出対象のテキストの中に同一表記の単語群がある場合に、同一の文書内であれば、同一表記の単語の固有表現の種類は同じになりやすいという仮定を基に、同じ表記の単語が同じ固有表現の種類となる場合に高いスコアを与える方法である。そのため、この仮定が成り立たない場合は、誤って当該単語群をまとめて同一の種類の単語として抽出してしまうという問題がある。 However, in the above-described prior art, when there is a group of words with the same notation in the text to be extracted, it is assumed that the types of specific expressions of words with the same notation are likely to be the same within the same document. Based on this, it is a method of giving a high score when words of the same notation are of the same specific expression type. Therefore, if this assumption does not hold, there is a problem that the word group is mistakenly extracted as the same type of words.

本発明は、上述した従来技術による問題点を解消するため、固有表現の抽出精度の向上を図ることができる抽出装置、抽出プログラム、および抽出方法を提供することを目的とする。 An object of the present invention is to provide an extraction apparatus, an extraction program, and an extraction method capable of improving the extraction accuracy of a specific expression in order to eliminate the above-described problems caused by the related art.

上述した課題を解決し、目的を達成するため、本発明の一側面によれば、共起単語の組み合わせと、共起単語の各々とともに出現する同一表記の単語が同一種類の単語であるか否かを示す情報と、を関連付けた判別規則を記憶する第１の記憶部と、共起単語と当該共起単語までの距離の組み合わせと、当該距離に応じて規定された単語の種類を示す情報と、を関連付けた抽出用規則を記憶する第２の記憶部と、を備え、一連の単語の中から第１の単語および当該第１の単語と同一表記の第２の単語を検出し、検出された第１の単語が共起単語の一方とともに出現し、かつ、検出された第２の単語が共起単語の他方とともに出現する判別規則が第１の記憶部にあるか否かを判別し、判別規則がある場合、当該判別規則から第１の単語と第２の単語とが同一種類か否かを判別し、同一種類であると判別された場合、一連の単語の中から、第１の単語および第２の単語の各々から所定距離以内に存在する単語と当該単語までの距離との組み合わせを特定し、特定された組み合わせに関連付けられた単語の種類を示す情報を第２の記憶部に記憶されている抽出用規則から抽出し、第１の単語および第２の単語に付与し、付与された一連の単語を出力する抽出装置、抽出プログラム、および抽出方法が提案される。 In order to solve the above-described problems and achieve the object, according to one aspect of the present invention, a combination of co-occurrence words and words of the same notation appearing with each of the co-occurrence words are the same type of words. Information indicating whether or not, a first storage unit that stores a discrimination rule that associates the information, a combination of a co-occurrence word and a distance to the co-occurrence word, and information indicating a type of word defined according to the distance And a second storage unit that stores the extraction rule associated with the first word and the first word and the second word having the same notation as the first word are detected from the series of words It is determined whether or not the first storage unit has a determination rule in which the first word that has been generated appears with one of the co-occurrence words and the second word that has been detected appears with the other of the co-occurrence words If there is a discriminant rule, the first word and the second simple word from the discriminant rule. Are determined to be of the same type, and if it is determined to be of the same type, a word existing within a predetermined distance from each of the first word and the second word from the series of words and the word A combination with the distance to the first, and the information indicating the type of the word associated with the identified combination is extracted from the extraction rule stored in the second storage unit, the first word and the second An extraction device, an extraction program, and an extraction method for providing a word and outputting the given series of words are proposed.

本発明の一側面によれば、固有表現の抽出精度の向上を図ることができるという効果を奏する。 According to one aspect of the present invention, it is possible to improve the extraction accuracy of a specific expression.

図１は、抽出装置による固有表現抽出の内容を示す説明図である。FIG. 1 is an explanatory diagram showing the contents of specific expression extraction by the extraction device. 図２は、抽出装置１００のハードウェア構成例を示すブロック図である。FIG. 2 is a block diagram illustrating a hardware configuration example of the extraction device 100. 図３は、同一表記・種類単語判別規則３００の記憶内容の一例を示す説明図である。FIG. 3 is an explanatory diagram showing an example of the stored contents of the same notation / type word discrimination rule 300. 図４は、固有表現抽出用規則４００の記憶内容の一例を示す説明図である。FIG. 4 is an explanatory diagram showing an example of the contents stored in the specific expression extraction rule 400. 図５は、抽出装置１００の機能的構成を示すブロック図である。FIG. 5 is a block diagram illustrating a functional configuration of the extraction device 100. 図６は、抽出装置１００による規則学習処理に用いられる学習データの例１を示す説明図である。FIG. 6 is an explanatory diagram illustrating Example 1 of learning data used for the rule learning process by the extraction device 100. 図７は、図６に示した学習データ群６００を用いて同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを生成する内容を示す説明図（その１）である。FIG. 7 is an explanatory diagram (part 1) showing the contents of generating the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 using the learning data group 600 shown in FIG. 図８は、図６に示した学習データ群６００を用いて同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを生成する内容を示す説明図（その２）である。FIG. 8 is an explanatory diagram (part 2) showing the contents of generating the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 using the learning data group 600 shown in FIG. 図９は、図６に示した学習データ群６００を用いて同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを生成する内容を示す説明図（その３）である。FIG. 9 is an explanatory diagram (part 3) showing the contents of generating the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 using the learning data group 600 shown in FIG. 図１０は、実施例１にかかる規則学習処理の詳細な処理手順を示すフローチャート（その１）である。FIG. 10 is a flowchart (part 1) illustrating a detailed processing procedure of the rule learning process according to the first embodiment. 図１１は、実施例１にかかる規則学習処理の詳細な処理手順を示すフローチャート（その２）である。FIG. 11 is a flowchart (part 2) of the detailed process procedure of the rule learning process according to the first embodiment. 図１２は、実施例１にかかる抽出装置１００による固有表現抽出処理の具体例１を示す説明図（その１）である。FIG. 12 is an explanatory diagram (part 1) of a specific example 1 of the specific expression extraction process performed by the extraction apparatus 100 according to the first embodiment. 図１３は、実施例１にかかる抽出装置１００による固有表現抽出処理の具体例１を示す説明図（その２）である。FIG. 13 is an explanatory diagram (part 2) of a specific example 1 of the named entity extraction process performed by the extraction apparatus 100 according to the first embodiment. 図１４は、実施例１にかかる抽出装置１００による固有表現抽出結果の出力例１を示す説明図である。FIG. 14 is an explanatory diagram of an output example 1 of the specific expression extraction result by the extraction device 100 according to the first embodiment. 図１５は、実施例１にかかる固有表現抽出処理の詳細な処理手順を示すフローチャートである。FIG. 15 is a flowchart of a detailed process procedure of the named entity extraction process according to the first embodiment. 図１６は、抽出装置１００による規則学習処理に用いられる学習データの例２を示す説明図である。FIG. 16 is an explanatory diagram illustrating Example 2 of learning data used for the rule learning process performed by the extraction device 100. 図１７は、図１６に示した学習データ群１６００を用いて同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを生成する内容を示す説明図（その１）である。FIG. 17 is an explanatory diagram (part 1) showing the contents of generating the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 using the learning data group 1600 shown in FIG. 図１８は、図１６に示した学習データ群１６００を用いて同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを生成する内容を示す説明図（その２）である。FIG. 18 is an explanatory diagram (part 2) showing the contents of generating the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 using the learning data group 1600 shown in FIG. 図１９は、図１６に示した学習データ群１６００を用いて同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを生成する内容を示す説明図（その３）である。FIG. 19 is an explanatory diagram (part 3) showing the contents of generating the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 using the learning data group 1600 shown in FIG. 図２０は、実施例１にかかる抽出装置１００による固有表現抽出処理の具体例２を示す説明図（その１）である。FIG. 20 is an explanatory diagram (part 1) of a specific example 2 of the specific expression extraction process performed by the extraction apparatus 100 according to the first embodiment. 図２１は、実施例１にかかる抽出装置１００による固有表現抽出処理の具体例２を示す説明図（その２）である。FIG. 21 is an explanatory diagram (part 2) of a specific example 2 of the specific expression extraction process performed by the extraction apparatus 100 according to the first embodiment. 図２２は、実施例１にかかる抽出装置１００による固有表現抽出結果の出力例２を示す説明図である。FIG. 22 is an explanatory diagram of an output example 2 of the specific expression extraction result by the extraction device 100 according to the first embodiment. 図２３は、実施例２にかかる抽出装置１００による規則学習処理に用いられるチャンクのラティスの一例を示す説明図である。FIG. 23 is an explanatory diagram of an example of a chunk lattice used in the rule learning process by the extraction device 100 according to the second embodiment. 図２４は、図２３で生成されたチャンクのラティスを用いて同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを生成する内容を示す説明図（その１）である。FIG. 24 is an explanatory diagram (part 1) showing the contents of generating the same notation / type word discrimination rule 300 and the unique expression extraction rule 400 using the chunk lattice generated in FIG. 図２５は、図２３で生成されたチャンクのラティスを用いて同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを生成する内容を示す説明図（その２）である。FIG. 25 is an explanatory diagram (part 2) showing the contents of generating the same notation / type word discrimination rule 300 and the unique expression extraction rule 400 using the chunk lattice generated in FIG. 図２６は、図２３で生成されたチャンクのラティスを用いて同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを生成する内容を示す説明図（その３）である。FIG. 26 is an explanatory diagram (part 3) showing the contents of generating the same notation / type word discrimination rule 300 and the unique expression extraction rule 400 using the chunk lattice generated in FIG. 図２７は、実施例２にかかる規則学習処理の詳細な処理手順を示すフローチャート（その１）である。FIG. 27 is a flowchart (part 1) illustrating a detailed processing procedure of the rule learning process according to the second embodiment. 図２８は、実施例２にかかる規則学習処理の詳細な処理手順を示すフローチャート（その２）である。FIG. 28 is a flowchart (part 2) of the detailed process procedure of the rule learning process according to the second embodiment. 図２９は、抽出装置１００による固有表現抽出の対象になるチャンクのラティスの一例を示す説明図である。FIG. 29 is an explanatory diagram illustrating an example of a chunk lattice that is a target of extraction of a specific expression by the extraction device 100. 図３０は、実施例２にかかる抽出装置１００による固有表現抽出処理の具体例を示す説明図（その１）である。FIG. 30 is an explanatory diagram (part 1) of a specific example of the specific expression extraction process performed by the extraction apparatus 100 according to the second embodiment. 図３１は、実施例２にかかる抽出装置１００による固有表現抽出処理の具体例を示す説明図（その２）である。FIG. 31 is an explanatory diagram (part 2) of a specific example of the specific expression extraction process performed by the extraction apparatus 100 according to the second embodiment. 図３２は、実施例２にかかる抽出装置１００による固有表現抽出結果の出力例を示す説明図である。FIG. 32 is an explanatory diagram of an output example of the specific expression extraction result by the extraction device 100 according to the second embodiment. 図３３は、実施例２にかかる固有表現抽出処理の詳細な処理手順を示すフローチャートである。FIG. 33 is a flowchart of a detailed process procedure of the named entity extraction process according to the second embodiment.

以下に添付図面を参照して、この発明にかかる抽出装置、抽出プログラム、および抽出方法の実施の形態を詳細に説明する。 Exemplary embodiments of an extraction apparatus, an extraction program, and an extraction method according to the present invention will be described below in detail with reference to the accompanying drawings.

（抽出装置による固有表現抽出の内容）
まず、図１を用いて、抽出装置による固有表現抽出の内容について説明する。 (Contents of specific expression extraction by the extractor)
First, the contents of specific expression extraction by the extraction device will be described with reference to FIG.

図１は、抽出装置による固有表現抽出の内容を示す説明図である。図１において、抽出装置１００は、固有表現の抽出対象になるテキストデータ１１０の中から、固有表現を抽出する装置である。抽出装置１００は、固有表現の抽出のために参照される同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを有する。 FIG. 1 is an explanatory diagram showing the contents of specific expression extraction by the extraction device. In FIG. 1, an extraction apparatus 100 is an apparatus that extracts a specific expression from text data 110 that is a target of extraction of a specific expression. The extraction apparatus 100 has the same notation / type word discrimination rule 300 and a specific expression extraction rule 400 that are referred to for extraction of a specific expression.

同一表記・種類単語判別規則３００は、判別規則を記憶するテーブルである。判別規則は、同一表記の１組の単語が同一の固有表現の種類の単語か否かを判別するために参照される規則である。判別規則は、例えば、共起単語の組み合わせと、当該共起単語の組み合わせの各々とともに出現する同一表記の１組の単語が同一種類であるか否かを示す情報と、を関連付けた規則である。 The same notation / type word discrimination rule 300 is a table for storing discrimination rules. The discrimination rule is a rule that is referred to in order to discriminate whether or not a set of words having the same notation is a word of the same specific expression type. The discrimination rule is, for example, a rule that associates a combination of co-occurrence words with information indicating whether or not a pair of words having the same notation appearing with each combination of the co-occurrence words is of the same type. .

ここで、共起単語とは、文章の中で、同一表記の２つの単語が存在する場合に、その各々とともに出現した単語である。例えば、「宮崎出身の宮崎さん」の場合、２つの「宮崎」が同一表記の２つの単語であり、各「宮崎」とともに出現する「出身」および「さん」が共起単語となる。また、共起単語どうしは通常異なる表記であるが、本実施の形態では、同一表記の単語どうしであっても共起単語の組み合わせとして扱う。なお、図１の「○」は同一種類であることを示す情報であり、「×」は異なる種類であることを示す情報である。 Here, the co-occurrence word is a word that appears with each of two words having the same notation in the sentence. For example, in the case of “Mr. Miyazaki from Miyazaki”, two “Miyazaki” are two words with the same notation, and “Birth” and “San” appearing with each “Miyazaki” are co-occurrence words. In addition, although co-occurrence words usually have different notations, in the present embodiment, even words having the same notation are treated as combinations of co-occurrence words. Note that “◯” in FIG. 1 is information indicating the same type, and “X” is information indicating different types.

図１の例では、最初の「出身」の前の「宮崎」は「場所」であり、次の「さん」の前の「宮崎」は人名であるため、これら２つの単語に対する判別結果は、同一種類でないことを示す「×」になる。また、例えば、テキストデータが「宮崎出身の友達と宮崎に行く」であれば、両方の「宮崎」は「場所」であるため、これら２つの単語に対する判別結果は、同一種類であることを示す「○」になる。 In the example of FIG. 1, “Miyazaki” in front of the first “from” is “place”, and “Miyazaki” in front of the next “san” is a person's name. It becomes "x" which shows that it is not the same kind. Also, for example, if the text data is “Go to Miyazaki with a friend from Miyazaki”, both “Miyazaki” are “places”, and therefore the discrimination results for these two words are of the same type. It becomes “○”.

固有表現抽出用規則４００は、抽出用規則を記憶するテーブルである。抽出用規則は、或る単語が何の種類の単語かを抽出するために参照される規則である。抽出用規則は、例えば、種類を抽出する対象になる単語から所定距離に或る単語が存在する場合には、種類を抽出する対象になる単語が何の種類かを示す規則である。なお、図１の「出身：＋１」は種類を抽出する対象になる単語の１単語分後ろに共起単語「出身」が存在することを示す情報である。「ＬＯＣ（ＬＯＣＡＴＩＯＮ）」は単語の種類が場所であることを示す情報であり、「ＰＥＲ（ＰＥＲＳＯＮ）」は単語の種類が人であることを示す情報である。 The specific expression extraction rule 400 is a table that stores extraction rules. The extraction rule is a rule that is referred to in order to extract what kind of word a certain word is. The extraction rule is, for example, a rule indicating what kind of word the type is to be extracted from when a certain word exists at a predetermined distance from the word whose type is to be extracted. 1 is information indicating that the co-occurrence word “origin” exists one word after the word whose type is to be extracted. “LOC (LOCATION)” is information indicating that the type of word is a place, and “PER (PERSON)” is information indicating that the type of word is a person.

図１の例では、固有表現の抽出対象になるテキストデータ１１０が「宮崎出身の宮崎さんと宮崎へ行く。」である場合を例に挙げて、抽出装置１００による固有表現抽出の内容について説明する。 In the example of FIG. 1, the content of the specific expression extraction by the extraction apparatus 100 will be described by taking as an example the case where the text data 110 from which the specific expression is extracted is “Miyazaki from Miyazaki and going to Miyazaki.” .

（１）まず、抽出装置１００は、テキストデータ１１０の中から、同一表記の単語を検出し、検出した単語が同一種類の固有表現であるか否かを判別する。これにより、抽出装置１００は、同一表記かつ同一種類の単語を検出し、以降の処理で同一表記かつ同一種類の単語を一纏めにして扱うことができるようになる。 (1) First, the extraction apparatus 100 detects a word having the same notation from the text data 110 and determines whether or not the detected word is a specific expression of the same type. As a result, the extraction apparatus 100 can detect words of the same notation and the same type, and can handle the same notation and the same type of words together in the subsequent processing.

図１の例では、抽出装置１００は、テキストデータ１１０の中から、同一表記の単語として、「宮崎」１１１および「宮崎」１１２の組を検出する。次に、抽出装置１００は、テキストデータ１１０の中から、「宮崎」１１１と「宮崎」１１２の各々とともに出現する単語の組み合わせを、同一種類か否かを判別する手がかりとして特定する。ここで、手がかりとは、或るデータの判別の際に指標になるキーである。抽出装置１００は、具体的には、例えば、「宮崎」１１１と「宮崎」１１２の各々の前後２単語分の距離以内にある単語の組み合わせ「出身＆さん」、「の＆さん」などを、同一種類か否かを判別する手がかりとして特定する。 In the example of FIG. 1, the extraction apparatus 100 detects a set of “Miyazaki” 111 and “Miyazaki” 112 as the same notation words from the text data 110. Next, the extraction apparatus 100 identifies combinations of words that appear together with each of “Miyazaki” 111 and “Miyazaki” 112 from the text data 110 as clues for determining whether or not they are of the same type. Here, the clue is a key that serves as an index when discriminating certain data. Specifically, the extraction apparatus 100, for example, combines word combinations “from & san” and “no & san” within the distance of two words before and after each of “Miyazaki” 111 and “Miyazaki” 112, It is specified as a clue to discriminate whether or not they are of the same type.

ここで、同一表記・種類単語判別規則３００には、特定した手がかり「出身＆さん」に該当する共起単語「出身」と「さん」が記憶されている。そして、同一表記・種類単語判別規則３００には、当該共起単語の各々とともに出現する同一表記の１組の単語が異なる種類であることを示す判別結果「×」を含む判別規則が記憶されている。そのため、抽出装置１００は、同一表記・種類単語判別規則３００と手がかり「出身＆さん」とから、「宮崎」１１１と「宮崎」１１２とが異なる種類であると判別する。 Here, the same notation / type word discrimination rule 300 stores co-occurrence words “origin” and “san” corresponding to the identified clue “origin & san”. In the same notation / type word discrimination rule 300, a discrimination rule including a discrimination result “x” indicating that a pair of words having the same notation appearing with each of the co-occurrence words is of a different type is stored. Yes. Therefore, the extraction apparatus 100 determines that “Miyazaki” 111 and “Miyazaki” 112 are different types based on the same notation / type word determination rule 300 and the clue “from origin & san”.

同様に、抽出装置１００は、テキストデータ１１０の中から、同一表記の単語として、「宮崎」１１１および「宮崎」１１３を検出する。次に、抽出装置１００は、テキストデータ１１０の中から、「宮崎」１１１と「宮崎」１１３の各々とともに出現する単語の組み合わせ「出身＆へ」、「出身＆行く」などを手がかりとして特定する。 Similarly, the extraction apparatus 100 detects “Miyazaki” 111 and “Miyazaki” 113 as the same notation words from the text data 110. Next, the extraction apparatus 100 specifies from the text data 110 a combination of words “Birth & Go” and “Birth & Go” that appear together with “Miyazaki” 111 and “Miyazaki” 113 as clues.

ここで、同一表記・種類単語判別規則３００には、特定した手がかり「出身＆行く」に該当する共起単語「出身」と「行く」が記憶されている。そして、同一表記・種類単語判別規則３００には、当該共起単語の各々とともに出現する同一表記の１組の単語が同一種類であることを示す判別結果「○」を含む判別規則が記憶されている。そのため、抽出装置１００は、同一表記・種類単語判別規則３００と手がかり「出身＆行く」とから、「宮崎」１１１と「宮崎」１１３とが同一種類であると判別する。 Here, the same notation / type word discrimination rule 300 stores the co-occurrence words “origin” and “go” corresponding to the identified clue “origin & go”. The same notation / type word discrimination rule 300 stores a discrimination rule including a discrimination result “◯” indicating that a pair of words having the same notation appearing with each of the co-occurrence words is of the same type. Yes. Therefore, the extraction apparatus 100 determines that “Miyazaki” 111 and “Miyazaki” 113 are of the same type based on the same notation / type word determination rule 300 and the clue “from & going”.

同様に、抽出装置１００は、テキストデータ１１０の中から、同一表記の単語として、「宮崎」１１２および「宮崎」１１３を検出する。次に、抽出装置１００は、テキストデータ１１０の中から、「宮崎」１１２と「宮崎」１１３の各々とともに出現する単語の組み合わせ「さん＆へ」、「さん＆行く」などを手がかりとして特定する。 Similarly, the extraction apparatus 100 detects “Miyazaki” 112 and “Miyazaki” 113 as the same notation words from the text data 110. Next, the extraction apparatus 100 identifies from the text data 110 the word combinations “San & He” and “San & Go” that appear with each of “Miyazaki” 112 and “Miyazaki” 113 as clues.

ここで、同一表記・種類単語判別規則３００には、特定した手がかり「さん＆行く」に該当する共起単語「さん」と「行く」が記憶されている。そして、同一表記・種類単語判別規則３００には、当該共起単語の各々とともに出現する同一表記の１組の単語が異なる種類であることを示す判別結果「×」を含む判別規則が記憶されている。そのため、抽出装置１００は、同一表記・種類単語判別規則３００と手がかり「さん＆行く」とから、「宮崎」１１２と「宮崎」１１３とが異なる種類であると判別する。 Here, the same notation / type word discrimination rule 300 stores co-occurrence words “san” and “go” corresponding to the identified clue “san & go”. In the same notation / type word discrimination rule 300, a discrimination rule including a discrimination result “x” indicating that a pair of words having the same notation appearing with each of the co-occurrence words is of a different type is stored. Yes. Therefore, the extraction apparatus 100 determines that “Miyazaki” 112 and “Miyazaki” 113 are different types based on the same notation / type word determination rule 300 and the clue “San & Go”.

これにより、抽出装置１００は、「宮崎」１１１と「宮崎」１１３とを同一表記かつ同一種類の単語であると判別し、以降の処理で「宮崎」１１１と「宮崎」１１３とを一纏めにして扱うことができるようになる。 Thereby, the extraction apparatus 100 determines that “Miyazaki” 111 and “Miyazaki” 113 are the same notation and the same type of word, and collects “Miyazaki” 111 and “Miyazaki” 113 together in the subsequent processing. Can be handled.

（２）抽出装置１００は、テキストデータ１１０のうち、（１）で同一表記かつ同一種類と判別された単語を一纏めにしておく。そして、抽出装置１００は、テキストデータ１１０の中の単語ごとに、または一纏めにされた単語ごとに、固有表現の抽出のための手がかりを特定し、特定した手がかりから単語の種類を抽出する。これにより、抽出装置１００は、同一表記かつ同一種類の単語については、一纏めにして同じ単語の種類を抽出することができる。また、抽出装置１００は、同一表記であっても異なる種類の単語については、他の同一表記の単語とは別個に単語の種類を抽出することができるようになる。 (2) The extraction device 100 collects the words that are identified as the same notation and the same type in the text data 110 in (1). Then, the extraction device 100 identifies a clue for extracting a unique expression for each word in the text data 110 or for each grouped word, and extracts a word type from the identified clue. Thereby, the extraction apparatus 100 can extract the same word type for the same notation and the same type of words. Further, the extraction apparatus 100 can extract the type of a word separately from other words having the same notation, even if they have the same notation.

図１の例では、抽出装置１００は、同一表記かつ同一種類と判別された「宮崎」１１１と「宮崎」１１３を一纏めにして、「宮崎」１１１と「宮崎」１１３の各々から所定距離以内に存在する単語と当該単語までの距離の組み合わせを、抽出の手がかりとして特定する。ここで、距離とは、単語間の距離であり、単語数や文字数で決定される。以下では、距離として、文章の後ろ方向を「＋」として表記した単語数、および文章の前方向を「−」として表記した単語数を採用する。例えば、或る単語の「１単語後ろ」に存在する単語の、或る単語からの距離は、「＋１」である。また、例えば、或る単語の「１単語前」に存在する単語の、或る単語からの距離は、「−１」である。なお、或る単語自体の、或る単語からの距離は、「０」とする。 In the example of FIG. 1, the extraction apparatus 100 collects “Miyazaki” 111 and “Miyazaki” 113 that are identified as the same notation and the same type, and within a predetermined distance from each of “Miyazaki” 111 and “Miyazaki” 113. A combination of an existing word and a distance to the word is specified as a clue for extraction. Here, the distance is a distance between words, and is determined by the number of words and the number of characters. Hereinafter, the number of words expressed as “+” in the backward direction of the sentence and the number of words expressed as “−” in the forward direction of the sentence are adopted as the distance. For example, the distance of a word existing “one word behind” a certain word from the certain word is “+1”. Further, for example, the distance of a word existing “one word before” a certain word from the certain word is “−1”. Note that the distance of a certain word itself from a certain word is “0”.

なお、図１では、所定距離として、前後２単語分の距離を採用する。抽出装置１００は、具体的には、例えば、「宮崎」１１１から「１単語後ろ」に「出身」が存在するため、「出身：＋１」を手がかりとして特定する。同様に、抽出装置１００は、「の：＋２」、「さん：−２」、「と：−１」、「へ：＋１」、および「行く：＋２」を手がかりとして特定する。 In FIG. 1, a distance of two words before and after is adopted as the predetermined distance. Specifically, for example, since “from” is “one word behind” from “Miyazaki” 111, the extraction apparatus 100 specifies “from: +1” as a clue. Similarly, the extraction apparatus 100 identifies “no: +2”, “san: -2”, “to: −1”, “to: +1”, and “go: +2” as clues.

ここで、固有表現抽出用規則４００には、手がかり「出身：＋１」が示す単語の種類が「ＬＯＣ」であることを示す抽出用規則が記憶されている。また、固有表現抽出用規則４００には、手がかり「行く：＋２」が示す単語の種類が「ＬＯＣ」であることを示す抽出用規則が記憶されている。そのため、抽出装置１００は、固有表現抽出用規則４００と手がかり「出身：＋１」と「行く：＋２」とから、「宮崎」１１１と「宮崎」１１３の種類が、「ＬＯＣ」であると抽出する。 Here, the specific expression extraction rule 400 stores an extraction rule indicating that the word type indicated by the clue “origin: +1” is “LOC”. In addition, the specific expression extraction rule 400 stores an extraction rule indicating that the type of the word indicated by the clue “go: +2” is “LOC”. Therefore, the extraction apparatus 100 extracts that the types of “Miyazaki” 111 and “Miyazaki” 113 are “LOC” from the specific expression extraction rules 400 and the clues “from: +1” and “go: +2”. .

一方、抽出装置１００は、同一種類の他の単語がない「宮崎」１１２については、「宮崎」１１２から前後２単語分の距離以内に存在する単語と当該単語までの距離との組み合わせを、抽出の手がかりとして特定する。抽出装置１００は、具体的には、例えば、「出身：−２」、「の：−１」、「さん：＋１」、および「と：＋２」を手がかりとして特定する。ここで、固有表現抽出用規則４００には、「さん：＋１」が示す単語の種類が「ＰＥＲ」であることを示す抽出用規則が記憶されている。そのため、抽出装置１００は、固有表現抽出用規則４００と「さん：＋１」とから、「宮崎」１１２の種類が、「ＰＥＲ」であると抽出する。 On the other hand, for “Miyazaki” 112 having no other words of the same type, the extraction apparatus 100 extracts a combination of a word existing within the distance of two words before and after “Miyazaki” 112 and the distance to the word. Identify as a clue. Specifically, for example, the extraction apparatus 100 specifies “origin: −2”, “no: −1”, “san: +1”, and “to: +2” as clues. Here, the specific expression extraction rule 400 stores an extraction rule indicating that the word type indicated by “san: +1” is “PER”. For this reason, the extraction apparatus 100 extracts that the type of “Miyazaki” 112 is “PER” from the specific expression extraction rule 400 and “san: +1”.

これにより、抽出装置１００は、同一表記かつ同一種類の「宮崎」１１１と「宮崎」１１３については、一纏めにして同じ種類「ＬＯＣ」を抽出することができる。また、抽出装置１００は、同一表記であっても異なる種類の「宮崎」１１２については、他の同一表記の「宮崎」１１１と「宮崎」１１３とは別個に単語の種類「ＰＥＲ」を抽出することができるようになる。 Accordingly, the extraction apparatus 100 can extract the same type “LOC” as a group for “Miyazaki” 111 and “Miyazaki” 113 having the same notation and the same type. Further, the extraction apparatus 100 extracts the word type “PER” separately from “Miyazaki” 111 and “Miyazaki” 113 of the same notation for different types of “Miyazaki” 112 even if they have the same notation. Will be able to.

結果として、抽出装置１００は、同一表記かつ同一種類の単語を一纏めにして同じ種類の単語として抽出することで、同一種類の単語を異なる種類の単語として抽出することを防止して、抽出精度の向上を図ることができる。また、抽出装置１００は、同一表記であっても異なる種類の単語同士を、別個に扱って単語の種類を抽出することで、誤って同じ種類の単語として抽出することを防止し、抽出精度の向上を図ることができる。 As a result, the extraction device 100 prevents the extraction of the same type of word as a different type of word by extracting the same notation and the same type of word together and extracting the same type of word, thereby improving the extraction accuracy. Improvements can be made. Further, the extraction apparatus 100 treats different types of words separately even if they are the same notation, and extracts the type of the word, thereby preventing erroneous extraction as the same type of word and improving the extraction accuracy. Improvements can be made.

（３）そして、抽出装置１００は、抽出結果を出力する。抽出装置１００は、具体的には、例えば、抽出した単語の種類をタグとして付与したテキストデータ１１０を、ディスプレイに出力する。なお、抽出装置１００は、付与後のテキストデータ１１０をネットワークを介して他のコンピュータに送信してもよいし、記録媒体に出力してもよい。これにより、抽出装置１００は、テキストデータ１１０の中の単語の種類を抽出装置１００の利用者に通知することができる。また、抽出装置１００は、他のソフトウェア（例えば翻訳ソフトウェア、または情報検索ソフトウェア）に、タグを付与したテキストデータ１１０を提供することができる。 (3) Then, the extraction device 100 outputs the extraction result. Specifically, the extraction device 100 outputs, for example, text data 110 to which the extracted word type is assigned as a tag to the display. Note that the extraction device 100 may transmit the text data 110 after the grant to another computer via a network, or may output it to a recording medium. Thereby, the extraction apparatus 100 can notify the user of the extraction apparatus 100 of the type of word in the text data 110. Further, the extraction apparatus 100 can provide text data 110 with a tag attached to other software (for example, translation software or information search software).

なお、抽出装置１００は、表記揺れがある単語（例えば「宮崎」と「みやざき」と「ミヤザキ」など）に関しては、表記揺れがある場合を含めて同一表記と判断するようにしてもよい。また、抽出装置１００は、共起単語が活用可能な場合（例えば「行く」と「行き」と「行け」など）、活用した場合を含めて同一の共起単語として扱ってもよい。 Note that the extraction apparatus 100 may determine that a word with a notation swing (for example, “Miyazaki”, “Miyazaki”, “Miyazaki”, etc.) is the same notation including a case where there is a notation swing. Further, the extraction device 100 may treat the co-occurrence words as the same co-occurrence words including the cases where the co-occurrence words can be utilized (for example, “go”, “go” and “go”).

（抽出装置１００のハードウェア構成例）
次に、図２を用いて、抽出装置１００のハードウェア構成例について説明する。 (Hardware configuration example of extraction device 100)
Next, a hardware configuration example of the extraction apparatus 100 will be described with reference to FIG.

図２は、抽出装置１００のハードウェア構成例を示すブロック図である。図２において、抽出装置１００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２０１と、ＲＯＭ（Ｒｅａｄ‐ＯｎｌｙＭｅｍｏｒｙ）２０２と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２０３と、磁気ディスクドライブ２０４と、磁気ディスク２０５と、光ディスクドライブ２０６と、光ディスク２０７と、ディスプレイ２０８と、Ｉ／Ｆ（Ｉｎｔｅｒｆａｃｅ）２０９と、キーボード２１０と、マウス２１１と、スキャナ２１２と、プリンタ２１３と、を備えている。また、各構成部はバス２２０によってそれぞれ接続されている。 FIG. 2 is a block diagram illustrating a hardware configuration example of the extraction device 100. In FIG. 2, an extraction apparatus 100 includes a CPU (Central Processing Unit) 201, a ROM (Read-Only Memory) 202, a RAM (Random Access Memory) 203, a magnetic disk drive 204, a magnetic disk 205, and an optical disk drive. 206, an optical disk 207, a display 208, an I / F (Interface) 209, a keyboard 210, a mouse 211, a scanner 212, and a printer 213. Each component is connected by a bus 220.

ここで、ＣＰＵ２０１は、抽出装置１００の全体の制御を司る。ＲＯＭ２０２は、ブートプログラムなどのプログラムを記憶している。ＲＡＭ２０３は、ＣＰＵ２０１のワークエリアとして使用される。磁気ディスクドライブ２０４は、ＣＰＵ２０１の制御にしたがって磁気ディスク２０５に対するデータのリード／ライトを制御する。磁気ディスク２０５は、磁気ディスクドライブ２０４の制御で書き込まれたデータを記憶する。 Here, the CPU 201 governs overall control of the extraction apparatus 100. The ROM 202 stores a program such as a boot program. The RAM 203 is used as a work area for the CPU 201. The magnetic disk drive 204 controls reading / writing of data with respect to the magnetic disk 205 according to the control of the CPU 201. The magnetic disk 205 stores data written under the control of the magnetic disk drive 204.

光ディスクドライブ２０６は、ＣＰＵ２０１の制御にしたがって光ディスク２０７に対するデータのリード／ライトを制御する。光ディスク２０７は、光ディスクドライブ２０６の制御で書き込まれたデータを記憶したり、光ディスク２０７に記憶されたデータをコンピュータに読み取らせたりする。 The optical disk drive 206 controls reading / writing of data with respect to the optical disk 207 according to the control of the CPU 201. The optical disk 207 stores data written under the control of the optical disk drive 206, or causes the computer to read data stored on the optical disk 207.

ディスプレイ２０８は、カーソル、アイコンあるいはツールボックスをはじめ、文書、画像、機能情報などのデータを表示する。このディスプレイ２０８は、例えば、ＣＲＴ、ＴＦＴ液晶ディスプレイ、プラズマディスプレイなどを採用することができる。 The display 208 displays data such as a document, an image, and function information as well as a cursor, an icon, or a tool box. As the display 208, for example, a CRT, a TFT liquid crystal display, a plasma display, or the like can be adopted.

インターフェース（以下、「Ｉ／Ｆ」と略する。）２０９は、通信回線を通じてＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）、インターネットなどのネットワーク２１４に接続され、このネットワーク２１４を介して他の装置に接続される。そして、Ｉ／Ｆ２０９は、ネットワーク２１４と内部のインターフェースを司り、外部装置からのデータの入出力を制御する。Ｉ／Ｆ２０９には、例えばモデムやＬＡＮアダプタなどを採用することができる。 An interface (hereinafter abbreviated as “I / F”) 209 is connected to a network 214 such as a LAN (Local Area Network), a WAN (Wide Area Network), and the Internet through a communication line. Connected to other devices. The I / F 209 controls an internal interface with the network 214 and controls data input / output from an external device. For example, a modem or a LAN adapter may be employed as the I / F 209.

キーボード２１０は、文字、数字、各種指示などの入力のためのキーを備え、データの入力を行う。また、タッチパネル式の入力パッドやテンキーなどであってもよい。マウス２１１は、カーソルの移動や範囲選択、あるいはウィンドウの移動やサイズの変更などを行う。ポインティングデバイスとして同様に機能を備えるものであれば、トラックボールやジョイスティックなどであってもよい。 The keyboard 210 includes keys for inputting characters, numbers, various instructions, and the like, and inputs data. Moreover, a touch panel type input pad or a numeric keypad may be used. The mouse 211 performs cursor movement, range selection, window movement, size change, and the like. A trackball or a joystick may be used as long as they have the same function as a pointing device.

スキャナ２１２は、画像を光学的に読み取り、抽出装置１００内に画像データを取り込む。なお、スキャナ２１２は、ＯＣＲ（ＯｐｔｉｃａｌＣｈａｒａｃｔｅｒＲｅａｄｅｒ）機能を持たせてもよい。また、プリンタ２１３は、画像データや文書データを印刷する。プリンタ２１３には、例えば、レーザプリンタやインクジェットプリンタを採用することができる。 The scanner 212 optically reads an image and captures image data into the extraction apparatus 100. The scanner 212 may have an OCR (Optical Character Reader) function. The printer 213 prints image data and document data. As the printer 213, for example, a laser printer or an ink jet printer can be adopted.

（同一表記・種類単語判別規則３００の記憶内容）
次に、図３を用いて、図１に示した同一表記・種類単語判別規則３００の記憶内容について説明する。上述したように、同一表記・種類単語判別規則３００は、抽出装置１００が有するテーブルであり、判別規則を記憶する。判別規則は、同一表記の１組の単語が同一種類の単語か否かを判別するために参照される規則である。判別規則は、例えば、共起単語の組み合わせと、当該共起単語の組み合わせの各々とともに出現する同一表記の１組の単語が同一種類であるか否かを示す情報と、を関連付けた規則である。なお、同一表記・種類単語判別規則３００は、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などにより実現される。 (The stored contents of the same notation / type word discrimination rule 300)
Next, the stored contents of the same notation / type word discrimination rule 300 shown in FIG. 1 will be described with reference to FIG. As described above, the same notation / type word discrimination rule 300 is a table included in the extraction apparatus 100 and stores discrimination rules. The discrimination rule is a rule that is referred to in order to discriminate whether or not a pair of words having the same notation is the same type of word. The discrimination rule is, for example, a rule that associates a combination of co-occurrence words with information indicating whether or not a pair of words having the same notation appearing with each combination of the co-occurrence words is of the same type. . The same notation / type word discrimination rule 300 is realized by the RAM 203, the magnetic disk 205, the optical disk 207, and the like.

図３は、同一表記・種類単語判別規則３００の記憶内容の一例を示す説明図である。図３に示すように、同一表記・種類単語判別規則３００は、規則項目のそれぞれに対応付けて、判別結果項目を有し、規則ごとにレコードを構成する。 FIG. 3 is an explanatory diagram showing an example of the stored contents of the same notation / type word discrimination rule 300. As shown in FIG. 3, the same notation / type word discrimination rule 300 has a discrimination result item associated with each rule item, and constitutes a record for each rule.

規則項目には、同一表記の１組の単語の各々とともに出現する共起単語の組み合わせが記憶される。規則項目には、具体的には、例えば、共起単語「出身」と「さん」の組み合わせや共起単語「出身」と「行く」の組み合わせが記憶される。判別結果項目には、規則項目の組み合わせの共起単語の各々とともに出現する同一表記の１組の単語が同一種類であるか否かを示す情報が記憶される。なお、図３では、同一種類である場合には「○」を表記し、異なる種類である場合には「×」を表記している。 In the rule item, a combination of co-occurrence words appearing with each of a set of words having the same notation is stored. Specifically, for example, a combination of the co-occurrence words “from” and “san” and a combination of the co-occurrence words “from” and “go” are stored in the rule item. In the discrimination result item, information indicating whether or not a pair of words having the same notation appearing together with each of the co-occurrence words of the combination of rule items is the same type is stored. In FIG. 3, “◯” is written for the same type, and “X” is written for different types.

（固有表現抽出用規則４００の記憶内容）
次に、図４を用いて、図１に示した固有表現抽出用規則４００の記憶内容について説明する。上述したように、固有表現抽出用規則４００は、抽出装置１００が有するテーブルであり、抽出用規則を記憶する。抽出用規則は、或る単語が何の種類の単語かを抽出するために参照される規則である。抽出用規則は、例えば、種類を抽出する対象になる単語から所定距離に或る単語が存在する場合には、種類を抽出する対象になる単語が何の種類かを示す規則である。なお、固有表現抽出用規則４００は、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などにより実現される。 (Contents stored in specific expression extraction rule 400)
Next, the stored contents of the specific expression extraction rule 400 shown in FIG. 1 will be described with reference to FIG. As described above, the specific expression extraction rule 400 is a table included in the extraction apparatus 100 and stores the extraction rule. The extraction rule is a rule that is referred to in order to extract what kind of word a certain word is. The extraction rule is, for example, a rule indicating what kind of word the type is to be extracted from when a certain word exists at a predetermined distance from the word whose type is to be extracted. The specific expression extraction rules 400 are realized by the RAM 203, the magnetic disk 205, the optical disk 207, and the like.

図４は、固有表現抽出用規則４００の記憶内容の一例を示す説明図である。図４に示すように、抽出用規則は、規則項目のそれぞれに対応付けて、種類項目を有し、規則ごとにレコードを構成する。 FIG. 4 is an explanatory diagram showing an example of the contents stored in the specific expression extraction rule 400. As shown in FIG. 4, the extraction rule has a type item in association with each rule item, and constitutes a record for each rule.

規則項目には、抽出対象の単語の共起単語と、抽出対象の単語から共起単語までの距離と、の組み合わせが記憶される。上述したように、距離とは、単語間の距離であり、単語数や文字数で決定される。距離としては、例えば、文章の後ろ方向を「＋」として表記した単語数が採用される。例えば、或る単語の「１単語後ろ」に存在する単語の、或る単語からの距離は、「＋１」である。規則項目には、具体的には、例えば、或る単語の「１単語後ろ」に「出身」が存在することを示す「出身：＋１」が記憶される。 The rule item stores a combination of the co-occurrence word of the extraction target word and the distance from the extraction target word to the co-occurrence word. As described above, the distance is a distance between words and is determined by the number of words and the number of characters. As the distance, for example, the number of words expressed as “+” in the backward direction of the sentence is adopted. For example, the distance of a word existing “one word behind” a certain word from the certain word is “+1”. Specifically, for example, “origin: +1” indicating that “origin” exists behind “one word behind” a certain word is stored in the rule item.

種類項目には、規則項目の組み合わせに該当する場合における抽出対象の単語の種類が記憶される。なお、抽出対象の単語の種類には、例えば、「組織名（ＯＲＧＡＮＩＺＡＴＩＯＮ）」、「人名（ＰＥＲＳＯＮ）」、「地名（ＬＯＣＡＴＩＯＮ）、「日付表現（ＤＡＴＥ）」、「時間表現（ＴＩＭＥ）」、「金額表現（ＭＯＮＥＹ）」、「割合表現（ＰＥＲＣＥＮＴ）」、「固有物名（ＡＲＴＩＦＡＣＴ）」などがある。 The type item stores the type of word to be extracted when the combination corresponds to a combination of rule items. The types of words to be extracted include, for example, “organization name (ORGANIZATION)”, “person name (PERSON)”, “place name (LOCATION)”, “date expression (DATE)”, “time expression (TIME)”, “Money amount expression (MONEY)”, “Percentage expression (PERCENT)”, “Inherent name (ARTIFACT)”, and the like.

（抽出装置１００の機能的構成例）
次に、図５を用いて、抽出装置１００の機能的構成例について説明する。 (Functional configuration example of the extraction device 100)
Next, a functional configuration example of the extraction device 100 will be described with reference to FIG.

図５は、抽出装置１００の機能的構成を示すブロック図である。抽出装置１００は、第１の記憶部５０１と、第２の記憶部５０２と、入力部５０３と、検出部５０４と、判別部５０５と、特定部５０６と、抽出部５０７と、出力部５０８と、第１の取得部５０９と、第２の取得部５１０と、判断部５１１と、生成部５１２と、格納部５１３と、変換部５１４と、を含む構成である。入力部５０３〜生成部５１２、および変換部５１４は、具体的には、例えば、図２に示したＲＯＭ２０２、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶装置に記憶されたプログラムをＣＰＵ２０１に実行させることにより、または、Ｉ／Ｆ２０９により、その機能を実現する。 FIG. 5 is a block diagram illustrating a functional configuration of the extraction device 100. The extraction apparatus 100 includes a first storage unit 501, a second storage unit 502, an input unit 503, a detection unit 504, a determination unit 505, a specification unit 506, an extraction unit 507, and an output unit 508. The first acquisition unit 509, the second acquisition unit 510, the determination unit 511, the generation unit 512, the storage unit 513, and the conversion unit 514 are included. Specifically, the input unit 503 to the generation unit 512 and the conversion unit 514 cause the CPU 201 to execute a program stored in a storage device such as the ROM 202, the RAM 203, the magnetic disk 205, and the optical disk 207 illustrated in FIG. Or the I / F 209 realizes the function.

抽出装置１００は、例えば、単語単位で分割された一連の単語の中から固有表現の単語を抽出することができる。また、抽出装置１００は、チャンクをノードとしてその間をリンクでつないだチャンクのラティスの中から固有表現のチャンクを抽出してもよい。ここで、チャンクとは、１または複数の単語の塊である。以下では、まず、単語単位で分割された一連の単語の中から固有表現の単語を抽出する場合の抽出装置１００の機能について説明する。なお、単語単位で分割された一連の単語の中から固有表現の単語を抽出する具体例は、図１を用いて説明した例、図１２〜図１４を用いて後述する例、または図２０〜図２２を用いて後述する例である。 For example, the extraction device 100 can extract a word having a unique expression from a series of words divided in units of words. In addition, the extraction apparatus 100 may extract a chunk of a unique expression from a lattice of chunks in which chunks are nodes and links between them. Here, a chunk is a lump of one or more words. Below, the function of the extraction apparatus 100 in the case of extracting the word of a specific expression from the series of words divided | segmented by the word unit first is demonstrated. A specific example of extracting a word of a unique expression from a series of words divided in units of words is an example described with reference to FIG. 1, an example described later with reference to FIGS. 12 to 14, or FIGS. This is an example described later with reference to FIG.

入力部５０３は、固有表現抽出の対象になる一連の単語の入力を受け付ける。入力部５０３は、具体的には、例えば、キーボード２１０を介して抽出装置１００の利用者からのテキストデータ１１０の入力を受け付ける。また、入力部５０３は、具体的には、例えば、Ｉ／Ｆ２０９を介して受信されたテキストデータ１１０を受け付けてもよい。なお、受け付けたデータは、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、入力部５０３は、固有表現抽出の対象になる一連の単語の入力を受け付けて、固有表現抽出処理を開始するトリガを発生させることができる。 The input unit 503 receives an input of a series of words that are the target of the specific expression extraction. Specifically, the input unit 503 receives an input of the text data 110 from the user of the extraction device 100 via the keyboard 210, for example. Specifically, the input unit 503 may accept the text data 110 received via the I / F 209, for example. The received data is stored in a storage area such as the RAM 203, the magnetic disk 205, and the optical disk 207. As a result, the input unit 503 can receive an input of a series of words to be subjected to specific expression extraction and generate a trigger for starting the specific expression extraction process.

第１の記憶部５０１は、共起単語の組み合わせと、共起単語の各々とともに出現する同一表記の単語が同一種類の単語であるか否かを示す情報と、を関連付けた判別規則を記憶する。ここで、共起単語とは、文章の中で、或る単語と同時に出現する単語である。第１の記憶部５０１は、具体的には、例えば、上述した同一表記・種類単語判別規則３００である。 The first storage unit 501 stores a discrimination rule that associates a combination of co-occurrence words with information indicating whether or not the same notation words appearing together with the co-occurrence words are the same type of word. . Here, a co-occurrence word is a word that appears simultaneously with a certain word in a sentence. Specifically, the first storage unit 501 is, for example, the same notation / type word discrimination rule 300 described above.

第２の記憶部５０２は、共起単語と当該共起単語までの距離の組み合わせと、当該距離に応じて規定された単語の種類を示す情報と、を関連付けた抽出用規則を記憶する。ここで、距離とは、単語間の距離であり、単語数や文字数で決定される。距離としては、例えば、始点になる単語から何単語目であるかを示す情報を採用できる。 The second storage unit 502 stores an extraction rule in which a combination of a co-occurrence word and a distance to the co-occurrence word is associated with information indicating the type of word defined according to the distance. Here, the distance is a distance between words, and is determined by the number of words and the number of characters. As the distance, for example, information indicating how many words are from the starting word can be employed.

より具体的には、例えば、文章上で始点になる単語から１単語分後ろに存在する単語までの距離は、「＋１」になる。また、文章上で始点になる単語から１単語分前に存在する単語までの距離は、「−１」になる。第２の記憶部５０２は、具体的には、例えば、上述した固有表現抽出用規則４００である。 More specifically, for example, the distance from the starting word on the sentence to the word existing one word behind is “+1”. Further, the distance from the starting word on the sentence to the word existing one word before is “−1”. Specifically, the second storage unit 502 is, for example, the above-described specific expression extraction rule 400.

検出部５０４は、一連の単語の中から第１の単語および当該第１の単語と同一表記の第２の単語を検出する。検出部５０４は、具体的には、例えば、入力部５０３によって入力されたテキストデータ１１０を形態素解析して、テキストデータ１１０を単語ごとに分割し、一連の単語を生成する。そして、検出部５０４は、生成した一連の単語の中から、同一表記の単語を検出する。 The detection unit 504 detects a first word and a second word having the same notation as the first word from the series of words. Specifically, for example, the detection unit 504 morphologically analyzes the text data 110 input by the input unit 503, divides the text data 110 into words, and generates a series of words. Then, the detection unit 504 detects words having the same notation from the generated series of words.

検出部５０４は、図１の例では、テキストデータ１１０「宮崎（２０１）出身の宮崎（２０２）さんと宮崎（２０３）へ行く。」を形態素解析し、単語ごとに分割して「宮崎出身の宮崎さんと宮崎へ行く。」を生成する。なお、（）内の数字は、図１に示した符号２０１〜２０３である。そして、検出部５０４は、同一表記の「宮崎」１１１と「宮崎」１１２の組と「宮崎」１１１と「宮崎」１１３の組と「宮崎」１１２と「宮崎」１１３の組を検出する。以下では、「宮崎」１１１と「宮崎」１１３の組を例に挙げて説明を行う。 In the example of FIG. 1, the detection unit 504 performs morphological analysis on the text data 110 “Miyazaki (202) from Miyazaki (201) and goes to Miyazaki (203)”, and divides the word data into words “from Miyazaki. Go to Miyazaki with Mr. Miyazaki. " In addition, the number in () is the code | symbol 201-203 shown in FIG. Then, the detection unit 504 detects a set of “Miyazaki” 111 and “Miyazaki” 112, a set of “Miyazaki” 111 and “Miyazaki” 113, and a set of “Miyazaki” 112 and “Miyazaki” 113 with the same notation. In the following, a description will be given by taking a pair of “Miyazaki” 111 and “Miyazaki” 113 as an example.

なお、検出された単語は、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、検出部５０４は、判別部５０５による判別対象になる単語の組を検出し、判別部５０５に単語の組が同一の種類か否かを判別させることができる。 The detected word is stored in a storage area such as the RAM 203, the magnetic disk 205, or the optical disk 207. Thereby, the detection unit 504 can detect a set of words to be determined by the determination unit 505, and can determine whether the set of words is the same type.

判別部５０５は、検出部５０４によって検出された第１の単語が共起単語の一方とともに出現し、かつ、検出部５０４によって検出された第２の単語が共起単語の他方とともに出現する判別規則が第１の記憶部５０１にあるか否かを判別する。そして、判別部５０５は、判別規則がある場合、当該判別規則から第１の単語と第２の単語が同一種類か否かを判別する。 The discriminating unit 505 is a discriminating rule in which the first word detected by the detecting unit 504 appears together with one of the co-occurrence words, and the second word detected by the detecting unit 504 appears together with the other of the co-occurrence words. Is stored in the first storage unit 501. Then, when there is a determination rule, the determination unit 505 determines whether the first word and the second word are of the same type based on the determination rule.

判別部５０５は、具体的には、例えば、検出部５０４によって検出された同一表記の単語の各々の周辺に存在する単語の組み合わせを、判別の手がかりとして特定する。ここで、手がかりとは、単語の判別のための指標になるキーである。なお、周辺に存在する単語とは、所定距離以内にある単語であり、例えば、同一表記の単語の各々の前後２単語分の距離以内にある単語である。 Specifically, the determination unit 505 specifies, for example, a combination of words existing around each of the same notation words detected by the detection unit 504 as a clue for determination. Here, the clue is a key that serves as an index for distinguishing words. In addition, the word which exists in the periphery is a word which exists within a predetermined distance, for example, is a word which is within the distance of two words before and after each of the words of the same notation.

次に、判別部５０５は、同一表記・種類単語判別規則３００の規則項目の共起単語の組み合わせの中に、特定した手がかりのいずれかに該当する組み合わせがあるか否かを判別する。そして、判別部５０５は、該当する組み合わせがある場合、当該組み合わせに対応する判別結果項目を参照し、同一表記の単語が同一種類か否かを判別する。 Next, the determination unit 505 determines whether there is a combination corresponding to any one of the specified clues among the co-occurrence word combinations of the rule items of the same notation / type word determination rule 300. Then, when there is a corresponding combination, the determination unit 505 refers to a determination result item corresponding to the combination and determines whether or not the same notation words are the same type.

判別部５０５は、より具体的には、例えば、検出部５０４によって検出された「宮崎」１１１と「宮崎」１１３の各々の周辺に存在する単語の組み合わせ「出身＆へ」、「出身＆行く」などを判別の手がかりとして特定する。次に、判別部５０５は、同一表記・種類単語判別規則３００に、特定した手がかり「出身＆行く」に該当する共起単語「出身」と「行く」があると判別する。そして、判別部５０５は、当該共起単語の組み合わせに対応する判別結果項目に、同一種類であることを示す情報があるため、「宮崎」１１１と「宮崎」１１３とが同一種類であると判別する。 More specifically, the determination unit 505, for example, a combination of words “Birth & Go” and “Birth & Go” that exist around each of “Miyazaki” 111 and “Miyazaki” 113 detected by the detection unit 504. Are identified as clues for discrimination. Next, the determination unit 505 determines that the same notation / type word determination rule 300 includes the co-occurrence words “origin” and “go” corresponding to the identified clue “origin & go”. Then, the determination unit 505 determines that “Miyazaki” 111 and “Miyazaki” 113 are the same type because there is information indicating that they are the same type in the determination result item corresponding to the combination of the co-occurrence words. To do.

なお、判別部５０５は、該当する組み合わせがない場合、同一表記の単語を異なる種類の単語と判別してもよい。なお、判別結果は、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、判別部５０５は、単語の種類の抽出において、一纏めにして扱うべき単語の組を検出することができる。 Note that the determination unit 505 may determine words of the same notation as different types of words when there is no corresponding combination. The determination result is stored in a storage area such as the RAM 203, the magnetic disk 205, and the optical disk 207. As a result, the determination unit 505 can detect a set of words to be handled together in extraction of word types.

特定部５０６は、判別部５０５によって同一種類であると判別された場合、一連の単語の中から、第１の単語および第２の単語の各々から所定距離以内に存在する単語と当該単語までの距離との組み合わせを特定する。特定部５０６は、具体的には、例えば、同一表記かつ同一種類の単語が存在する場合、各々の単語から前後２単語分の距離にある単語と当該単語までの距離とを、抽出の手がかりとして特定する。 When the determining unit 505 determines that they are the same type, the specifying unit 506 selects a word existing within a predetermined distance from each of the first word and the second word and the word from the series of words. Identify combinations with distance. Specifically, for example, when the same notation and the same type of word exist, the specifying unit 506 uses, as a clue for extraction, a word that is a distance of two words before and after each word and the distance to the word. Identify.

特定部５０６は、より具体的には、例えば、判別部５０５によって同一種類と判別された「宮崎」１１１と「宮崎」１１３とを一纏めにして扱う。そして、特定部５０６は、「宮崎」１１１と「宮崎」１１３の各々から前後２単語分の距離以内に存在する単語と当該単語までの距離との組み合わせを、抽出の手がかりとして特定する。ここで、特定部５０６は、抽出の手がかりとして、「出身：＋１」、「の：＋２」、「さん：−２」、「と：−１」、「へ：＋１」、および「行く：＋２」を特定する。 More specifically, the specifying unit 506 collectively handles, for example, “Miyazaki” 111 and “Miyazaki” 113 which are determined to be the same type by the determination unit 505. The identifying unit 506 identifies a combination of a word existing within the distance of two words before and after each of “Miyazaki” 111 and “Miyazaki” 113 and the distance to the word as a clue to extraction. Here, the identifying unit 506 uses, as clues for extraction, “origin: +1”, “no: +2”, “san: −2”, “to: −1”, “to: +1”, and “go: +2”. Is specified.

なお、特定された手がかりは、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、特定部５０６は、抽出部５０７により単語の種類を抽出するために用いられる手がかりを特定することができる。 The identified clue is stored in a storage area such as the RAM 203, the magnetic disk 205, or the optical disk 207. Thereby, the specifying unit 506 can specify a clue used by the extracting unit 507 to extract the type of word.

抽出部５０７は、特定部５０６によって特定された組み合わせに関連付けられた単語の種類を示す情報を第２の記憶部５０２に記憶されている抽出用規則から抽出し、第１の単語および第２の単語に付与する。 The extraction unit 507 extracts information indicating the type of the word associated with the combination specified by the specifying unit 506 from the extraction rules stored in the second storage unit 502, and extracts the first word and the second Append to words.

抽出部５０７は、具体的には、例えば、固有表現抽出用規則４００の規則項目の単語と当該単語までの距離の組み合わせの中から、特定部５０６によって特定された手がかりのいずれかに該当する組み合わせがあるか否かを判別する。次に、抽出部５０７は、該当する組み合わせがある場合、当該組み合わせに対応する種類項目から、単語の種類を抽出する。そして、抽出部５０７は、抽出した単語の種類を示すタグを、文章上の同一表記かつ同一種類の単語の各々に付与する。 Specifically, the extraction unit 507, for example, a combination corresponding to any one of the clues specified by the specification unit 506 from combinations of the word of the rule item of the specific expression extraction rule 400 and the distance to the word. It is determined whether or not there is. Next, when there is a corresponding combination, the extraction unit 507 extracts the word type from the type item corresponding to the combination. Then, the extraction unit 507 assigns a tag indicating the type of the extracted word to each of the same notation and the same type of word on the sentence.

抽出部５０７は、より具体的には、例えば、固有表現抽出用規則４００に、特定部５０６によって特定された手がかり「出身：＋１」に該当する規則項目があると判別し、当該規則項目に対応する種類項目から単語の種類が「ＬＯＣ」であることを示す情報を抽出する。また、抽出部５０７は、固有表現抽出用規則４００に、特定部５０６によって特定された手がかり「行く：＋２」に該当する規則項目があると判別し、当該規則項目に対応する種類項目から単語の種類が「ＬＯＣ」であることを示す情報を抽出する。そして、抽出部５０７は、「宮崎」１１１と「宮崎」１１３の種類として、「ＬＯＣ」を抽出する。 More specifically, for example, the extraction unit 507 determines that there is a rule item corresponding to the clue “origin: +1” specified by the specification unit 506 in the specific expression extraction rule 400, and corresponds to the rule item. Information indicating that the type of the word is “LOC” is extracted from the type item. In addition, the extraction unit 507 determines that there is a rule item corresponding to the clue “go: +2” specified by the specification unit 506 in the specific expression extraction rule 400, and extracts a word from the type item corresponding to the rule item. Information indicating that the type is “LOC” is extracted. Then, the extraction unit 507 extracts “LOC” as the type of “Miyazaki” 111 and “Miyazaki” 113.

そして、抽出部５０７は、タグを付与したテキストデータ１１０「＜ＬＯＣ＞宮崎＜／ＬＯＣ＞出身の＜ＰＥＲ＞宮崎＜／ＰＥＲ＞さんと＜ＬＯＣ＞宮崎＜／ＬＯＣ＞へ行く。」を生成する。なお、抽出された単語の種類を示す情報は、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、抽出部５０７は、同一表記かつ同一種類の単語については、同一の単語の種類を抽出することができる。結果として、抽出部５０７は、同一表記であっても異なる種類の単語については、別個に単語の種類を抽出することができる。 Then, the extraction unit 507 generates the text data 110 “<LOC> Miyazaki </ LOC> from <PER> Miyazaki </ PER> and <LOC> Miyazaki </ LOC>” with the tag attached. . Information indicating the extracted word type is stored in a storage area such as the RAM 203, the magnetic disk 205, and the optical disk 207. Accordingly, the extraction unit 507 can extract the same word type for the same notation and the same type of word. As a result, the extraction unit 507 can separately extract word types for different types of words even if they have the same notation.

出力部５０８は、抽出部５０７により付与された一連の単語を出力する。出力部５０８は、具体的には、例えば、抽出した単語の種類をタグとして付与したテキストデータ１１０を、ディスプレイ２０８に出力する。また、出力部５０８は、付与後のテキストデータ１１０をＩ／Ｆ２０９を介して他のコンピュータに送信してもよいし、記録媒体に出力してもよい。これにより、出力部５０８は、テキストデータ１１０の中の単語の種類を抽出装置１００の利用者に通知することができる。また、出力部５０８は、他のソフトウェア（例えば翻訳ソフトウェア、または情報検索ソフトウェア）に、タグを付与したテキストデータ１１０を提供することができる。 The output unit 508 outputs a series of words given by the extraction unit 507. Specifically, the output unit 508 outputs, for example, the text data 110 provided with the extracted word type as a tag to the display 208. In addition, the output unit 508 may transmit the text data 110 after the grant to another computer via the I / F 209 or may output it to a recording medium. As a result, the output unit 508 can notify the user of the extraction device 100 of the type of word in the text data 110. Further, the output unit 508 can provide the text data 110 with the tag added to other software (for example, translation software or information search software).

また、抽出装置１００は、上述した固有表現の抽出に用いられる同一表記・種類単語判別規則３００および固有表現抽出用規則４００を、学習データ群から自動生成することができる。以下では、まず、同一表記・種類単語判別規則３００を生成する場合の抽出装置１００の機能について説明する。この機能は、第１の取得部５０９と、第２の取得部５１０と、判断部５１１と、生成部５１２と、格納部５１３と、により実現される。なお、同一表記・種類単語判別規則３００を生成する具体例は、図６，図７を用いて後述する。 Further, the extraction apparatus 100 can automatically generate the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 used for the above-described extraction of the specific expression from the learning data group. Below, the function of the extraction device 100 when generating the same notation / type word discrimination rule 300 will be described first. This function is realized by the first acquisition unit 509, the second acquisition unit 510, the determination unit 511, the generation unit 512, and the storage unit 513. A specific example of generating the same notation / type word discrimination rule 300 will be described later with reference to FIGS.

第１の取得部５０９は、単語の種類を示す情報が付与された単語を含む単語列の中から同一表記の単語の組み合わせを取得する。ここで、単語列とは、同一表記・種類単語判別規則３００および固有表現抽出用規則４００の生成に用いられる学習データである。単語の種類を示す情報とは、学習データに付与された単語の種類を示すタグである。なお、タグは、学習データの作成者によって付与される。ここで、図６を用いて、同一表記・種類単語判別規則３００および固有表現抽出用規則４００の生成に用いられる学習データについて説明する。 The first acquisition unit 509 acquires a combination of words having the same notation from a word string including a word to which information indicating a word type is given. Here, the word string is learning data used for generating the same notation / type word discrimination rule 300 and the unique expression extraction rule 400. The information indicating the type of word is a tag indicating the type of word assigned to the learning data. The tag is given by the creator of learning data. Here, the learning data used to generate the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 will be described with reference to FIG.

図６は、学習データの例１を示す説明図である。図６に示すように、同一表記・種類単語判別規則３００および固有表現抽出用規則４００の生成のために、タグが付与された複数の学習データを含む学習データ群６００が用意される。 FIG. 6 is an explanatory diagram illustrating Example 1 of learning data. As shown in FIG. 6, in order to generate the same notation / type word discrimination rule 300 and the unique expression extraction rule 400, a learning data group 600 including a plurality of learning data with tags attached thereto is prepared.

例えば、学習データ群６００には、学習データ６１０「＜ＬＯＣ＞宮崎＜／ＬＯＣ＞出身の＜ＰＥＲ＞宮崎＜／ＰＥＲ＞さんと＜ＬＯＣ＞宮崎＜／ＬＯＣ＞へ行く。」が含まれる。学習データ６１０の中の「宮崎」６１１には「ＬＯＣ」のタグが付与され、「宮崎」６１２には「ＰＥＲ」のタグが付与され、「宮崎」６１３には「ＬＯＣ」のタグが付与されている。 For example, the learning data group 600 includes learning data 610 “<PER> Miyazaki </ LOC> from <PER> Miyazaki </ PER> and <LOC> Miyazaki </ LOC>”. The tag “LOC” is assigned to “Miyazaki” 611 in the learning data 610, the tag “PER” is assigned to “Miyazaki” 612, and the tag “LOC” is assigned to “Miyazaki” 613. ing.

また、学習データ群６００には、学習データ６２０「＜ＰＥＲ＞福岡＜／ＰＥＲ＞さんと＜ＰＥＲ＞宮崎＜／ＰＥＲ＞さんが＜ＬＯＣ＞福岡＜／ＬＯＣ＞へ行く。」が含まれる。学習データ６２０の中の福岡６２１には「ＰＥＲ」のタグが付与され、宮崎６２２には「ＰＥＲ」のタグが付与され、福岡６２３には「ＬＯＣ」のタグが付与されている。 The learning data group 600 includes learning data 620 “<PER> Fukuoka </ PER> and <PER> Miyazaki </ PER> go to <LOC> Fukuoka </ LOC>”. The tag “PER” is assigned to Fukuoka 621 in the learning data 620, the tag “PER” is assigned to Miyazaki 622, and the tag “LOC” is assigned to Fukuoka 623.

また、学習データ群６００には、学習データ６３０「＜ＰＥＲ＞宮崎＜／ＰＥＲ＞さんは新幹線で＜ＬＯＣ＞宮崎＜／ＬＯＣ＞へ行く。」が含まれる。学習データ６３０の中の宮崎６３１には「ＰＥＲ」のタグが付与され、宮崎６３２には「ＬＯＣ」のタグが付与されている。 The learning data group 600 also includes learning data 630 “<PER> Miyazaki </ PER> goes to <LOC> Miyazaki </ LOC> on the Shinkansen”. The tag “PER” is assigned to Miyazaki 631 in the learning data 630, and the tag “LOC” is assigned to Miyazaki 632.

抽出装置１００は、このような学習データ群６００を用いることで、同一表記であって「ＬＯＣ」と「ＰＥＲ」との２種類になりうる単語の組があった場合に、当該単語の組が同一種類か否かを判別するための判別規則を生成することができる。また、抽出装置１００は、単語の種類が「ＬＯＣ」であるか「ＰＥＲ」であるかを抽出するための抽出用規則を生成することができる。 The extraction apparatus 100 uses such a learning data group 600, so that when there is a pair of words that can be two types of “LOC” and “PER” with the same notation, the pair of words is A discrimination rule for discriminating whether or not they are the same type can be generated. Further, the extraction apparatus 100 can generate an extraction rule for extracting whether the word type is “LOC” or “PER”.

図５に戻り、第１の取得部５０９は、具体的には、例えば、学習データを形態素解析して、学習データを単語ごとに分割し、単語列を生成する。そして、検出部５０４は、生成した単語列の中から、同一表記の単語の組み合わせを取得する。 Returning to FIG. 5, specifically, the first acquisition unit 509, for example, morphologically analyzes learning data, divides the learning data into words, and generates a word string. Then, the detecting unit 504 acquires a combination of words having the same notation from the generated word string.

第１の取得部５０９は、例えば、学習データ６１０「宮崎（６１１）出身の宮崎（６１２）さんと宮崎（６１３）へ行く。」を形態素解析し、単語ごとに分割して「宮崎出身の宮崎さんと宮崎へ行く。」を単語列として生成する。なお、（）内の数字は、図６に示した符号６１１〜６１３である。そして、第１の取得部５０９は、同一表記の「宮崎」６１１と「宮崎」６１２の組と「宮崎」６１１と「宮崎」６１３の組と「宮崎」６１２と「宮崎」６１３の組を検出する。以下では、「宮崎」６１１と「宮崎」６１３の組を例に挙げて説明を行う。 For example, the first acquisition unit 509 performs morphological analysis on the learning data 610 “Miyazaki (612) from Miyazaki (611) and goes to Miyazaki (613)”, and divides each word into “Miyazaki from Miyazaki” To go to Miyazaki. "As a word string. In addition, the number in () is the code | symbol 611-613 shown in FIG. The first acquisition unit 509 detects a pair of “Miyazaki” 611 and “Miyazaki” 612, a pair of “Miyazaki” 611 and “Miyazaki” 613, and a pair of “Miyazaki” 612 and “Miyazaki” 613 with the same notation. To do. In the following description, a set of “Miyazaki” 611 and “Miyazaki” 613 is taken as an example.

なお、取得された単語は、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、第１の取得部５０９は、判断部５１１による判断対象になる単語の組を検出し、判断部５１１に単語の組が同一の種類か否かを判断させることができる。 The acquired words are stored in a storage area such as the RAM 203, the magnetic disk 205, and the optical disk 207. Thereby, the first acquisition unit 509 can detect a set of words to be determined by the determination unit 511 and cause the determination unit 511 to determine whether or not the word sets are of the same type.

第２の取得部５１０は、第１の取得部５０９によって取得された組み合わせの各々の単語とともに出現する共起単語の組み合わせを、単語の種類を示す情報が付与された単語を含む単語列の中から取得する。第２の取得部５１０は、具体的には、例えば、第１の取得部５０９によって取得された組み合わせの各々の単語の周辺に存在する単語の組み合わせを、判別の手がかりとして特定する。なお、周辺に存在する単語とは、所定距離以内に存在する単語であり、例えば、同一表記の単語の各々の前後２単語分の距離以内にある単語である。 The second acquisition unit 510 includes a combination of co-occurrence words that appears with each word of the combination acquired by the first acquisition unit 509 in a word string including a word to which information indicating the type of the word is given. Get from. Specifically, for example, the second acquisition unit 510 specifies a combination of words existing around each word of the combination acquired by the first acquisition unit 509 as a clue for determination. In addition, the word which exists in the periphery is a word which exists within a predetermined distance, for example, is a word which is within the distance of two words before and after each of the words of the same notation.

第２の取得部５１０は、より具体的には、例えば、第１の取得部５０９によって取得された「宮崎」１１１と「宮崎」１１３の各々の周辺に存在する単語の組み合わせ「出身＆へ」、「出身＆行く」などを判別の手がかりとして取得する。なお、取得された手がかりは、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、第２の取得部５１０は、同一表記・種類単語判別規則３００に含まれる判別規則を生成するための手がかりを取得することができる。 More specifically, the second acquisition unit 510 is, for example, a combination of words “origin & e” existing around each of “Miyazaki” 111 and “Miyazaki” 113 acquired by the first acquisition unit 509. , "Born & Going" is obtained as a clue for discrimination. The obtained clues are stored in a storage area such as the RAM 203, the magnetic disk 205, and the optical disk 207. Thereby, the second acquisition unit 510 can acquire a clue for generating a discrimination rule included in the same notation / type word discrimination rule 300.

判断部５１１は、同一表記の単語の組み合わせの各々の単語に付与された単語の種類を示す情報に基づいて、同一表記の単語が、同一種類の単語であるか否かを判断する。判断部５１１は、具体的には、例えば、組み合わせの各々の単語に付与されたタグが一致するか否かを判断する。次に、判断部５１１は、一致する場合、組み合わせの各々の単語が同一種類の単語であると判断する。 The determination unit 511 determines whether or not the words with the same notation are the same type of words based on the information indicating the type of word given to each word in the combination of words with the same notation. Specifically, the determination unit 511 determines, for example, whether or not tags assigned to each word of the combination match. Next, the determination part 511 determines that each word of a combination is the same kind of word, when it corresponds.

判断部５１１は、より具体的には、例えば、図２に示すテキストデータ１１０の中で、「宮崎」１１１に付与されたタグが「ＬＯＣ」であり、「宮崎」１１３に付与されたタグが「ＬＯＣ」であり、タグが一致するため、同一種類であると判別する。なお、判断部５１１は、一致しない場合、組み合わせの各々の単語が異なる種類の単語であると判断してもよい。 More specifically, the determination unit 511, for example, in the text data 110 shown in FIG. 2, the tag given to “Miyazaki” 111 is “LOC”, and the tag given to “Miyazaki” 113 is Since it is “LOC” and the tags match, it is determined that they are of the same type. Note that the determination unit 511 may determine that each word in the combination is a different type of word if they do not match.

なお、判断結果は、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、判断部５１１は、生成部５１２によって生成される判別規則が同一種類であることを示す規則か否かを判断することができる。 The determination result is stored in a storage area such as the RAM 203, the magnetic disk 205, or the optical disk 207. Thereby, the determination unit 511 can determine whether the determination rules generated by the generation unit 512 are rules indicating the same type.

生成部５１２は、第２の取得部５１０によって取得された共起単語の組み合わせと、判断部５１１によって判断された判断結果とを、関連付けた判別規則を生成する。生成部５１２は、具体的には、例えば、学習エンジンを用いて、取得された手がかりと判断結果とを関連付けた学習事例群の中から尤もらしい学習事例を特定し、特定した学習事例を判別規則として生成する。ここで、学習エンジンとしては、具体的には、例えば、従来技術であるＢｏｏｓｔｉｎｇ学習器やＳＶＭがある。 The generation unit 512 generates a determination rule in which the combination of the co-occurrence words acquired by the second acquisition unit 510 and the determination result determined by the determination unit 511 are associated with each other. Specifically, the generation unit 512 uses a learning engine, for example, to identify a likely learning case from among a group of learning cases in which the acquired clues are associated with the determination result, and determines the identified learning case as a discriminating rule. Generate as Here, specifically as a learning engine, there exist a Boosting learning device and SVM which are the prior art, for example.

尤もらしい学習事例としては、例えば、学習事例群の中での出現頻度が閾値以上である学習事例を採用してもよい。具体的には、例えば、尤もらしい学習事例として、共起単語「出身」と「行く」の組み合わせと、同一種類であることを関連付けた学習事例が採用される。また、尤もらしい学習事例としては、例えば、多くの種類の単語とともに出現する格助詞「の」などを含む学習事例を採用しないようにしてもよい。なお、生成された判別規則は、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、生成部５１２は、固有表現の抽出に用いられる判別規則を生成することができる。 As a plausible learning case, for example, a learning case whose appearance frequency in the learning case group is equal to or higher than a threshold value may be adopted. Specifically, for example, a learning case in which a combination of the co-occurrence words “from” and “go” is associated with the same type is used as a plausible learning case. Moreover, as a plausible learning example, for example, a learning example including a case particle “NO” that appears with many types of words may not be adopted. The generated discrimination rule is stored in a storage area such as the RAM 203, the magnetic disk 205, and the optical disk 207. Thereby, the generation unit 512 can generate a discrimination rule used for extraction of the specific expression.

格納部５１３は、生成部５１２によって生成された判別規則を第１の記憶部５０１に格納する。格納部５１３は、具体的には、例えば、生成した判別規則を、同一表記・種類単語判別規則３００に格納する。これにより、格納部５１３は、固有表現抽出処理に用いられる同一表記・種類単語判別規則３００を保存しておくことができる。 The storage unit 513 stores the determination rule generated by the generation unit 512 in the first storage unit 501. Specifically, for example, the storage unit 513 stores the generated discrimination rule in the same notation / type word discrimination rule 300. Thereby, the storage unit 513 can store the same notation / type word discrimination rule 300 used for the specific expression extraction processing.

次に、固有表現抽出用規則４００を生成する場合の抽出装置１００の機能について説明する。この機能は、検出部５０４と、判別部５０５と、特定部５０６と、生成部５１２と、格納部５１３と、により実現される。なお、固有表現抽出用規則４００を生成する具体例は、図８，図９を用いて後述する。 Next, the function of the extraction device 100 when generating the specific expression extraction rules 400 will be described. This function is realized by the detection unit 504, the determination unit 505, the specifying unit 506, the generation unit 512, and the storage unit 513. A specific example of generating the specific expression extraction rule 400 will be described later with reference to FIGS.

検出部５０４は、単語の種類を示す情報が付与された単語を含む単語列の中から第３の単語および当該第３の単語と同一表記の第４の単語を検出する。検出部５０４は、具体的には、例えば、学習データを形態素解析して、学習データを単語ごとに分割し、単語列を生成する。そして、検出部５０４は、生成した単語列の中から、同一表記の単語を検出する。 The detection unit 504 detects a third word and a fourth word having the same notation as the third word from a word string including a word to which information indicating the type of word is given. Specifically, for example, the detection unit 504 morphologically analyzes learning data, divides the learning data into words, and generates a word string. Then, the detecting unit 504 detects the same notation word from the generated word string.

検出部５０４は、例えば、学習データ６１０「宮崎（６１１）出身の宮崎（６１２）さんと宮崎（６１３）へ行く。」を形態素解析し、単語ごとに分割して「宮崎出身の宮崎さんと宮崎へ行く。」を生成する。なお、（）内の数字は、図６に示した符号６１１〜６１３である。そして、検出部５０４は、同一表記の「宮崎」６１１と「宮崎」６１２の組と「宮崎」６１１と「宮崎」６１３の組と「宮崎」６１２と「宮崎」６１３の組を検出する。以下では、「宮崎」６１１と「宮崎」６１３の組を例に挙げて説明を行う。 For example, the detection unit 504 performs a morphological analysis on the learning data 610 “Miyazaki (612) from Miyazaki (611) and goes to Miyazaki (613)” and divides it into words “Miyazaki and Miyazaki from Miyazaki and Miyazaki. To go. " In addition, the number in () is the code | symbol 611-613 shown in FIG. Then, the detection unit 504 detects a set of “Miyazaki” 611 and “Miyazaki” 612, a set of “Miyazaki” 611 and “Miyazaki” 613, and a set of “Miyazaki” 612 and “Miyazaki” 613 with the same notation. In the following description, a set of “Miyazaki” 611 and “Miyazaki” 613 is taken as an example.

判別部５０５は、検出部５０４によって検出された第３の単語が共起単語の一方とともに出現し、かつ、検出部５０４によって検出された第４の単語が共起単語の他方とともに出現する判別規則が第１の記憶部５０１にあるか否かを判別する。そして、判別部５０５は、判別規則がある場合、当該判別規則から第１の単語と第２の単語が同一種類か否かを判別する。 The discriminating unit 505 determines a discriminating rule in which the third word detected by the detecting unit 504 appears together with one of the co-occurrence words, and the fourth word detected by the detecting unit 504 appears together with the other of the co-occurrence words. Is stored in the first storage unit 501. Then, when there is a determination rule, the determination unit 505 determines whether the first word and the second word are of the same type based on the determination rule.

判別部５０５は、具体的には、例えば、検出部５０４によって検出された同一表記の単語の各々の周辺に存在する単語の組み合わせを、判別の手がかりとして特定する。次に、判別部５０５は、同一表記・種類単語判別規則３００の規則項目の共起単語の組み合わせの中に、特定した手がかりのいずれかに該当する組み合わせがあるか否かを判別する。 Specifically, the determination unit 505 specifies, for example, a combination of words existing around each of the same notation words detected by the detection unit 504 as a clue for determination. Next, the determination unit 505 determines whether there is a combination corresponding to any one of the specified clues among the co-occurrence word combinations of the rule items of the same notation / type word determination rule 300.

そして、判別部５０５は、該当する組み合わせがある場合、当該組み合わせに対応する判別結果項目を参照し、同一表記の単語が同一種類か否かを判別する。なお、判別部５０５は、該当する組み合わせがない場合、同一表記の単語を異なる種類の単語と判別してもよい。 Then, when there is a corresponding combination, the determination unit 505 refers to a determination result item corresponding to the combination and determines whether or not the same notation words are the same type. Note that the determination unit 505 may determine words of the same notation as different types of words when there is no corresponding combination.

判別部５０５は、より具体的には、例えば、検出部５０４によって検出された「宮崎」６１１と「宮崎」６１３の各々の周辺に存在する単語の組み合わせ「出身＆へ」、「出身＆行く」などを判別の手がかりとして特定する。次に、判別部５０５は、同一表記・種類単語判別規則３００に、特定した手がかり「出身＆行く」に該当する共起単語「出身」と「行く」があると判別する。そして、判別部５０５は、当該共起単語の組み合わせに対応する判別結果項目に、同一種類であることを示す情報があるため、「宮崎」６１１と「宮崎」６１３とが同一種類であると判別する。 More specifically, the determination unit 505, for example, a combination of words “Birth & Go” and “Birth & Go” that exist around each of “Miyazaki” 611 and “Miyazaki” 613 detected by the detection unit 504. Are identified as clues for discrimination. Next, the determination unit 505 determines that the same notation / type word determination rule 300 includes the co-occurrence words “origin” and “go” corresponding to the identified clue “origin & go”. The determination unit 505 determines that “Miyazaki” 611 and “Miyazaki” 613 are of the same type because there is information indicating that they are of the same type in the determination result item corresponding to the combination of the co-occurrence words. To do.

なお、判別結果は、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、判別部５０５は、単語の種類の抽出用規則の生成において、一纏めにして扱うべき単語の組を検出することができる。 The determination result is stored in a storage area such as the RAM 203, the magnetic disk 205, and the optical disk 207. Accordingly, the determination unit 505 can detect a set of words that should be handled together in generating a word type extraction rule.

特定部５０６は、判別部５０５によって同一種類であると判別された場合、単語の種類を示す情報が付与された単語を含む単語列の中から、第３の単語および第４の単語の各々から所定距離以内に存在する単語と当該単語までの距離との組み合わせを特定する。 When the determining unit 505 determines that the types are the same type, the specifying unit 506 determines whether each of the third word and the fourth word is included in the word string including the word to which the information indicating the word type is added. A combination of a word existing within a predetermined distance and a distance to the word is specified.

特定部５０６は、具体的には、例えば、同一表記かつ同一種類の単語が存在する場合、各々の単語から前後２単語分の距離にある単語と当該単語までの距離とを、抽出の手がかりとして特定する。 Specifically, for example, when the same notation and the same type of word exist, the specifying unit 506 uses, as a clue for extraction, a word that is a distance of two words before and after each word and the distance to the word. Identify.

特定部５０６は、より具体的には、例えば、判別部５０５によって同一種類と判別された「宮崎」６１１と「宮崎」６１３とを一纏めにして扱う。そして、特定部５０６は、「宮崎」１１１と「宮崎」１１３の各々から前後２単語分の距離以内に存在する単語と当該単語までの距離との組み合わせを、抽出の手がかりとして特定する。ここで、特定部５０６は、抽出の手がかりとして、「出身：＋１」、「の：＋２」、「さん：−２」、「と：−１」、「へ：＋１」、および「行く：＋２」を特定する。 More specifically, the specifying unit 506 handles, for example, “Miyazaki” 611 and “Miyazaki” 613 collectively determined as the same type by the determination unit 505. The identifying unit 506 identifies a combination of a word existing within the distance of two words before and after each of “Miyazaki” 111 and “Miyazaki” 113 and the distance to the word as a clue to extraction. Here, the identifying unit 506 uses, as clues for extraction, “origin: +1”, “no: +2”, “san: −2”, “to: −1”, “to: +1”, and “go: +2”. Is specified.

なお、特定された手がかりは、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、特定部５０６は、生成部５１２により単語の種類の抽出用規則を生成するために用いられる手がかりを特定することができる。 The identified clue is stored in a storage area such as the RAM 203, the magnetic disk 205, or the optical disk 207. Thereby, the specifying unit 506 can specify a clue used by the generating unit 512 to generate a word type extraction rule.

生成部５１２は、特定部５０６によって特定された単語と当該単語までの距離との組み合わせと、第３の単語と第４の単語のいずれかの単語に付与されている単語の種類を示す情報と、を関連付けた抽出用規則を生成する。生成部５１２は、具体的には、例えば、学習エンジンを用いて、取得された手がかりと同一表記かつ同一種類の単語に付与されたタグが示す単語の種類を関連付けた学習事例群の中から尤もらしい学習事例を特定し、特定した学習事例を抽出用規則として生成する。具体的には、例えば、尤もらしい学習事例として、「出身：＋１」と単語の種類「ＬＯＣ」を関連付けた学習事例が採用される。なお、生成された抽出用規則は、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、生成部５１２は、固有表現の抽出に用いられる抽出用規則を生成することができる。 The generation unit 512 includes a combination of the word specified by the specifying unit 506 and the distance to the word, and information indicating the type of word assigned to one of the third word and the fourth word , And an extraction rule associated with each other. Specifically, the generation unit 512 uses, for example, a learning engine to generate a likelihood from among a group of learning cases in which the types of words indicated by tags attached to the same notation and the same type of words as the acquired clues are associated. A specific learning case is specified, and the specified learning case is generated as an extraction rule. Specifically, for example, a learning example in which “from: +1” and the word type “LOC” are associated with each other is adopted as a plausible learning example. The generated extraction rule is stored in a storage area such as the RAM 203, the magnetic disk 205, and the optical disk 207. Thereby, the generation unit 512 can generate an extraction rule used for extraction of the specific expression.

格納部５１３は、生成部５１２によって生成された抽出用規則を第２の記憶部５０２に格納する。格納部５１３は、具体的には、例えば、生成した抽出用規則を、固有表現抽出用規則４００に格納する。これにより、格納部５１３は、固有表現抽出処理に用いられる固有表現抽出規則を保存しておくことができる。 The storage unit 513 stores the extraction rule generated by the generation unit 512 in the second storage unit 502. Specifically, for example, the storage unit 513 stores the generated extraction rule in the specific expression extraction rule 400. Thereby, the storage unit 513 can store the specific expression extraction rule used for the specific expression extraction processing.

次に、チャンクをノードとするチャンクのラティスの中から固有表現となるチャンクを抽出する場合の抽出装置１００の機能について説明する。なお、チャンク単位で分割されたチャンクのラティスの中から固有表現のチャンクを抽出する場合の具体例は、図２９〜図３２を用いて後述する。 Next, the function of the extraction apparatus 100 in the case of extracting a chunk that is a unique expression from a chunk lattice having a chunk as a node will be described. A specific example of extracting a chunk of unique expressions from a lattice of chunks divided in units of chunks will be described later with reference to FIGS.

変換部５１４は、一連の単語を、チャンクを含む複数通りの単語列に変換する。変換部５１４は、具体的には、例えば、形態素解析により、一連の文章を単語ごとに分割する。そして、変換部５１４は、各単語を連結したチャンクを含む複数通りのチャンクのラティスを生成する。 The conversion unit 514 converts a series of words into a plurality of word strings including chunks. Specifically, the conversion unit 514 divides a series of sentences into words by, for example, morphological analysis. Then, the conversion unit 514 generates a lattice of a plurality of chunks including chunks obtained by connecting the words.

ここで、一連の文章として、「Ｂ商事の社員はＢ商事へ行きＢ商事から帰る。」を例に挙げる。具体的には、例えば、変換部５１４は、形態素解析により、一連の文章を単語ごとに分割し、「Ｂ商事の社員はＢ商事へ行きＢ商事から帰る。」を生成する。スペースは単語の区切りを示す。 Here, as an example of a series of sentences, “an employee of B Trading goes to B Trading and returns from B Trading” is taken as an example. Specifically, for example, the conversion unit 514 divides a series of sentences into words by morphological analysis, and generates “B trading employee goes to B trading and returns from B trading”. A space indicates a word break.

次に、変換部５１４は、予め指定された単語数の単語を連結してチャンクを生成する。ここでは、単語数として「３」が指定されているとする。例えば、最初に出現する単語「Ｂ」から生成されるチャンクとしては、「Ｂ」、「Ｂ商事」、「Ｂ商事は」のように指定された単語数までのチャンクが生成される。次に出現する単語「商事」から生成されるチャンクとしては、「商事」、「商事の」「商事の社員」のチャンクが生成される。 Next, the conversion unit 514 generates chunks by concatenating words of a predetermined number of words. Here, it is assumed that “3” is designated as the number of words. For example, as the chunk generated from the word “B” that appears first, chunks up to the specified number of words such as “B”, “B Trading”, and “B Trading” are generated. As chunks generated from the next word “trade”, chunks of “trade”, “trade” and “trade employee” are created.

このように、各単語を起点に指定された単語数の単語を連結してチャンクを生成し、各チャンクの最後の単語とその次に出現する単語から始まるチャンクをつないでいくことで、図２３、図２９のような、チャンクのラティスを生成する。 In this way, chunks are generated by concatenating words of the designated number of words starting from each word, and the last word of each chunk and the chunk starting from the next appearing word are connected, so that FIG. , A chunk lattice as shown in FIG. 29 is generated.

なお、変換後のチャンクのラティスは、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。 The converted chunk lattice is stored in a storage area such as the RAM 203, the magnetic disk 205, or the optical disk 207.

検出部５０４は、チャンクのラティスの中から第１のチャンクおよび当該第１のチャンクと同一表記の第２のチャンクを検出する。検出部５０４は、具体的には、例えば、チャンクのラティスの中から、同一表記のチャンクを検出する。検出部５０４は、より具体的には、例えば、「Ｂ商事の社員はＢ商事へ行きＢ商事から帰る。」から生成されるチャンクのラティスから、同一表記のチャンク「Ｂ」の組３つ、「商事」の組３つ、「Ｂ商事」の組を３つ検出する。以下では、文章の先頭から２つ目の「Ｂ商事」と３つ目の「Ｂ商事」の組を例に挙げて説明を行う。 The detection unit 504 detects the first chunk and the second chunk having the same notation as the first chunk from the chunk lattice. Specifically, the detection unit 504 detects, for example, chunks having the same notation from the chunk lattice. More specifically, the detection unit 504, for example, from a lattice of chunks generated from “an employee of B trading goes to B trading and returns from B trading”, three sets of chunks “B” with the same notation, Three sets of “trade” and three sets of “B trade” are detected. In the following description, the second “B Trading” and the third “B Trading” from the beginning of the sentence will be described as an example.

なお、検出されたチャンクは、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、検出部５０４は、判別部５０５による判別対象になるチャンクの組を検出し、判別部５０５にチャンクの組が同一の種類か否かを判別させることができる。 The detected chunk is stored in a storage area such as the RAM 203, the magnetic disk 205, and the optical disk 207. Thereby, the detection unit 504 can detect a set of chunks to be determined by the determination unit 505, and can determine whether the set of chunks is the same type.

判別部５０５は、検出部５０４によって検出された第１のチャンクが共起単語の一方とともに出現し、かつ、検出部５０４によって検出された第２のチャンクが共起単語の他方とともに出現する判別規則が第１の記憶部５０１にあるか否かを判別する。そして、判別部５０５は、判別規則がある場合、当該判別規則から第１のチャンクと第２のチャンクが同一種類か否かを判別する。 The discriminating unit 505 determines a discriminating rule in which the first chunk detected by the detecting unit 504 appears together with one of the co-occurrence words, and the second chunk detected by the detecting unit 504 appears together with the other of the co-occurrence words. Is stored in the first storage unit 501. Then, when there is a determination rule, the determination unit 505 determines whether or not the first chunk and the second chunk are the same type from the determination rule.

判別部５０５は、具体的には、例えば、検出部５０４によって検出された同一表記のチャンクの各々の周辺に存在するチャンクの組み合わせを、判別の手がかりとして特定する。次に、判別部５０５は、同一表記・種類チャンク判別規則の規則項目の共起チャンクの組み合わせの中に、特定した手がかりのいずれかに該当する組み合わせがあるか否かを判別する。そして、判別部５０５は、該当する組み合わせがある場合、当該組み合わせに対応する判別結果項目を参照し、同一表記のチャンクが同一種類か否かを判別する。なお、判別部５０５は、該当する組み合わせがない場合、同一表記のチャンクを異なる種類のチャンクと判別してもよい。 Specifically, the determination unit 505 specifies, for example, a combination of chunks present around each of the same notation chunks detected by the detection unit 504 as a clue for determination. Next, the determination unit 505 determines whether there is a combination corresponding to one of the specified clues among the combinations of co-occurrence chunks of the rule items of the same notation / type chunk determination rule. When there is a corresponding combination, the determination unit 505 refers to a determination result item corresponding to the combination and determines whether the same notation chunks are the same type. Note that when there is no corresponding combination, the determination unit 505 may determine the same notation chunks as different types of chunks.

判別部５０５は、より具体的には、例えば、検出部５０４によって検出された「Ｂ商事」の組の各々の周辺に存在する単語の組み合わせ「へ＆から」、「行き＆帰る」などを判別の手がかりとして特定する。次に、判別部５０５は、同一表記・種類単語判別規則３００に、特定した手がかり「行き＆帰る」に該当する共起単語「行き」と「帰る」がある場合、当該共起単語の組み合わせに対応する判別結果項目にある同一種類であるか否かを示す情報を抽出する。そして、判別部５０５は、抽出した情報から、検出部５０４によって検出された「Ｂ商事」の組が同一種類であるか否かを判別する。 More specifically, the discriminating unit 505 discriminates, for example, the word combinations “To & From” and “Go & Return” that exist around each of the “B Trading” groups detected by the detecting unit 504. Identify as a clue. Next, when the same notation / type word discrimination rule 300 includes the co-occurrence words “going” and “returning” corresponding to the identified clue “going & returning”, the discriminating unit 505 determines the combination of the co-occurrence words. Information indicating whether or not they are the same type in the corresponding discrimination result item is extracted. Then, the determination unit 505 determines whether the set of “B trading” detected by the detection unit 504 is the same type from the extracted information.

なお、判別結果は、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、判別部５０５は、チャンクの種類の抽出において、一纏めにして扱うべきチャンクの組を検出することができる。 The determination result is stored in a storage area such as the RAM 203, the magnetic disk 205, and the optical disk 207. Thereby, the discrimination unit 505 can detect a set of chunks to be handled collectively in the extraction of the chunk type.

特定部５０６は、判別部５０５によって判別規則があると判別された場合、変換後の単語列の中から、第１のチャンクおよび第２のチャンクの各々から所定距離以内に存在する単語と当該単語までの距離との組み合わせを特定する。特定部５０６は、具体的には、例えば、同一表記かつ同一種類のチャンクが存在する場合、各々のチャンクから前後２単語分の距離にある単語と当該単語までの距離とを、抽出の手がかりとして特定する。 When the determining unit 505 determines that there is a determination rule, the specifying unit 506 includes a word existing within a predetermined distance from each of the first chunk and the second chunk and the word from the converted word string Specify the combination with the distance. Specifically, for example, when the same notation and the same type of chunks exist, the specifying unit 506 uses, as a clue for extraction, a word that is a distance of two words before and after each chunk and the distance to the word. Identify.

特定部５０６は、より具体的には、例えば、判別部５０５によって同一種類と判別された「Ｂ商事」の組を一纏めにして扱う。そして、特定部５０６は、「Ｂ商事」の組の各々から前後２単語分の距離以内に存在する単語と当該単語までの距離との組み合わせを、抽出の手がかりとして特定する。ここで、特定部５０６は、抽出の手がかりとして、「社員：−２」、「は：−１」、「へ：＋１」、「行き：＋２」、「へ：−２」、「行き：−１」、「から：＋１」、および「帰る：＋２」を特定する。 More specifically, for example, the specifying unit 506 collectively handles a set of “B trading” determined as the same type by the determining unit 505. Then, the specifying unit 506 specifies a combination of a word existing within a distance of two words before and after each of the “B trading” pairs and a distance to the word as a clue for extraction. Here, the identifying unit 506 uses “employee: −2”, “ha: −1”, “to: +1”, “going: +2”, “going: −2”, “going: − as clues for extraction. 1 ”,“ From: +1 ”, and“ Return: +2 ”.

なお、特定された手がかりは、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、特定部５０６は、抽出部５０７によりチャンクの種類を抽出するために用いられる手がかりを特定することができる。 The identified clue is stored in a storage area such as the RAM 203, the magnetic disk 205, or the optical disk 207. Thereby, the specifying unit 506 can specify a clue used by the extracting unit 507 to extract the type of chunk.

抽出部５０７は、特定部５０６によって特定された組み合わせに関連付けられた単語の種類を示す情報を第２の記憶部５０２に記憶されている抽出用規則から抽出し、第１のチャンクおよび第２のチャンクに付与する。 The extraction unit 507 extracts information indicating the type of word associated with the combination identified by the identification unit 506 from the extraction rules stored in the second storage unit 502, and extracts the first chunk and the second chunk Grant to a chunk.

抽出部５０７は、具体的には、例えば、固有表現抽出用規則４００の規則項目の単語と当該単語までの距離の組み合わせの中から、特定部５０６によって特定された手がかりのいずれかに該当する組み合わせがあるか否かを判別する。次に、抽出部５０７は、該当する組み合わせがある場合、当該組み合わせに対応する種類項目から、単語の種類を抽出する。そして、抽出部５０７は、抽出した単語の種類を示すタグを、文章上の同一表記かつ同一種類のチャンクの各々に付与する。 Specifically, the extraction unit 507, for example, a combination corresponding to any one of the clues specified by the specification unit 506 from combinations of the word of the rule item of the specific expression extraction rule 400 and the distance to the word. It is determined whether or not there is. Next, when there is a corresponding combination, the extraction unit 507 extracts the word type from the type item corresponding to the combination. Then, the extraction unit 507 assigns a tag indicating the type of the extracted word to each of the same notation and the same type of chunk on the sentence.

抽出部５０７は、より具体的には、例えば、固有表現抽出用規則４００に、特定部５０６によって特定された手がかり「出身：＋１」に該当する規則項目がある場合、当該規則項目に対応する種類項目から単語の種類を示す情報を抽出する。そして、抽出部５０７は、抽出した情報から、「Ｂ商事」の種類として、例えば「ＬＯＣ」に対してのスコアを１付与する。このように各チャンクに対しそれぞれの固有表現に成りうるスコアを付与していく。 More specifically, the extraction unit 507, for example, if the specific expression extraction rule 400 includes a rule item corresponding to the clue “origin: +1” specified by the specification unit 506, the type corresponding to the rule item Information indicating the type of word is extracted from the item. Then, the extraction unit 507 gives, for example, 1 score for “LOC” as the type of “B Trading” from the extracted information. In this way, a score that can be a unique expression is assigned to each chunk.

すべてのチャンクに規則適用後、抽出部５０７は、各チャンクに付与された各固有表現のスコアを基に、文頭から文末まで取りうるチャンクパスのパスおよびそのパス上で取りうる固有表現の種類の組み合わせのうち、スコアの和が最大となるチャンクの列およびそれぞれの固有表現タイプを選択する。その結果、例えば、「＜ＯＲＧ＞Ｂ商事＜／ＯＲＧ＞の社員は＜ＬＯＣ＞Ｂ商事＜／ＬＯＣ＞へ行き＜ＬＯＣ＞Ｂ商事＜／ＬＯＣ＞から帰る。」を生成する。各チャンクのそれぞれの固有表現になるかどうかのスコアの決定方法としては、規則による各固有表現として判別された合計回数や、規則がスコアを持っている場合であれば、適用された規則のスコアの和を用いる。 After applying the rule to all chunks, the extraction unit 507 determines the types of unique expressions that can be taken on the path of the chunk path that can be taken from the beginning to the end of the sentence based on the score of each unique expression assigned to each chunk. Among the combinations, select the column of chunks with the largest sum of scores and the respective unique expression type. As a result, for example, “an employee of <ORG> B trading </ ORG> goes to <LOC> B trading </ LOC> and returns from <LOC> B trading </ LOC>” is generated. The method of determining the score of whether or not each chunk is a unique expression can be determined by the total number of times identified as each unique expression by the rule or the score of the applied rule if the rule has a score. Use the sum of

なお、抽出されたチャンクの種類を示す情報は、ＲＡＭ２０３、磁気ディスク２０５、光ディスク２０７などの記憶領域に記憶される。これにより、抽出部５０７は、同一表記かつ同一種類のチャンクについては、同一のチャンクの種類を抽出することができる。結果として、抽出部５０７は、同一表記であっても異なる種類のチャンクについては、別個にチャンクの種類を抽出することができる。 Information indicating the type of the extracted chunk is stored in a storage area such as the RAM 203, the magnetic disk 205, and the optical disk 207. Accordingly, the extraction unit 507 can extract the same chunk type for the same notation and the same type of chunk. As a result, the extraction unit 507 can extract the types of chunks separately for different types of chunks even with the same notation.

出力部５０８は、抽出部５０７により付与された単語列を出力する。出力部５０８は、具体的には、例えば、抽出した単語の種類をタグとして付与したチャンクのラティスを、ディスプレイ２０８に出力する。また、出力部５０８は、付与後のチャンクのラティスをＩ／Ｆ２０９を介して他のコンピュータに送信してもよいし、記録媒体に出力してもよい。これにより、出力部５０８は、チャンクの種類を抽出装置１００の利用者に通知することができる。また、出力部５０８は、他のソフトウェア（例えば翻訳ソフトウェア、または情報検索ソフトウェア）に、タグを付与したチャンクのラティスを提供することができる。 The output unit 508 outputs the word string given by the extraction unit 507. Specifically, the output unit 508 outputs, for example, a lattice of chunks to which the extracted word type is assigned as a tag to the display 208. Further, the output unit 508 may transmit the lattice of the assigned chunk to another computer via the I / F 209 or may output it to a recording medium. As a result, the output unit 508 can notify the user of the extraction device 100 of the type of chunk. Further, the output unit 508 can provide the lattice of the chunk to which the tag is added to other software (for example, translation software or information search software).

（実施例１）
次に、実施例１について説明する。実施例１は、図５を用いて説明した単語単位で分割された一連の単語の中から固有表現の単語を抽出する場合の実施例である。 Example 1
Next, Example 1 will be described. Example 1 is an example in the case of extracting the word of a specific expression from the series of words divided | segmented by the word unit demonstrated using FIG.

（実施例１にかかる抽出装置１００による規則学習処理の具体例１）
次に、図６〜図９を用いて、実施例１にかかる抽出装置１００による規則学習処理の具体例１について説明する。規則学習処理は、図６に示した学習データ群６００を用いて、同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを生成する処理であり、図５を用いて説明した処理である。 (Specific example 1 of rule learning processing by the extraction apparatus 100 according to the first embodiment)
Next, specific example 1 of the rule learning process performed by the extraction apparatus 100 according to the first embodiment will be described with reference to FIGS. The rule learning process is a process for generating the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 using the learning data group 600 shown in FIG. 6, and the process described with reference to FIG. is there.

図７〜図９は、図６に示した学習データ群６００を用いて同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを生成する内容を示す説明図である。図７において、まず、抽出装置１００は、学習データ群６００の中から、未処理の学習データを選択する。ここでは、抽出装置１００は、学習データ６１０を選択したとする。 FIG. 7 to FIG. 9 are explanatory diagrams showing the contents for generating the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 using the learning data group 600 shown in FIG. In FIG. 7, the extraction apparatus 100 first selects unprocessed learning data from the learning data group 600. Here, it is assumed that the extraction device 100 has selected the learning data 610.

（１）次に、抽出装置１００は、選択した学習データ６１０の中から、同一表記の単語の組として、「宮崎」６１１と「宮崎」６１２の組と、「宮崎」６１１と「宮崎」６１３の組と、「宮崎」６１２と「宮崎」６１３の組と、を抽出する。 (1) Next, the extraction apparatus 100 selects “Miyazaki” 611 and “Miyazaki” 612, “Miyazaki” 611 and “Miyazaki” 613 as combinations of words having the same notation from the selected learning data 610. And “Miyazaki” 612 and “Miyazaki” 613 are extracted.

そして、抽出装置１００は、「宮崎」６１１と「宮崎」６１２の組については、異なる種類であるため、「宮崎」６１１と「宮崎」６１２の各々の周辺にある単語の組み合わせを、単語の組が異なる種類であることを示す手がかりとして取得する。そして、抽出装置１００は、「出身＆さん」、「の＆さん」などの単語の組が手がかりである、異なる種類であることを判別するための学習事例を生成する。 The extraction apparatus 100 uses different combinations of “Miyazaki” 611 and “Miyazaki” 612, and therefore, the combination of words around each of “Miyazaki” 611 and “Miyazaki” 612 is converted into a word set. As a clue that indicates that they are of different types. Then, the extraction apparatus 100 generates a learning example for discriminating that a pair of words such as “from & san” and “no & san” is a different type of clue.

また、抽出装置１００は、「宮崎」６１１と「宮崎」６１３の組については、同一種類であるため、「宮崎」６１１と「宮崎」６１３の各々の周辺にある単語の組み合わせを、単語の組が同一種類であることを示す手がかりとして取得する。そして、抽出装置１００は、「出身＆へ」、「出身＆行く」などの単語の組が手がかりである、同一種類であることを判別するための学習事例を生成する。 Further, since the extraction device 100 has the same type for the set of “Miyazaki” 611 and “Miyazaki” 613, the combination of words around each of “Miyazaki” 611 and “Miyazaki” 613 is converted into a set of words. Is obtained as a clue that indicates that they are of the same type. Then, the extraction apparatus 100 generates a learning example for determining that the word sets such as “Birth & Go”, “Birth & Go”, etc. are the same type.

また、抽出装置１００は、「宮崎」６１２と「宮崎」６１３の組については、異なる種類であるため、「宮崎」６１２と「宮崎」６１３の各々の周辺にある単語の組み合わせを、単語の組が異なる種類であることを示す手がかりとして取得する。そして、抽出装置１００は、「さん＆へ」、「さん＆行く」などの単語の組が手がかりである、同一種類であることを判別するための学習事例を生成する。 Further, since the extraction device 100 has different types of “Miyazaki” 612 and “Miyazaki” 613 sets, the combination of words around each of “Miyazaki” 612 and “Miyazaki” 613 is converted into a set of words. As a clue that indicates that they are of different types. Then, the extraction apparatus 100 generates a learning example for discriminating that a pair of words such as “san & he” and “san & go” is the same type as a clue.

抽出装置１００は、学習データ６１０から学習事例を生成した後、学習データ群６００の中に未処理の学習データが残っていれば、当該未処理の学習データからも学習事例を生成する。 If the unprocessed learning data remains in the learning data group 600 after generating the learning case from the learning data 610, the extracting device 100 also generates the learning case from the unprocessed learning data.

（２）そして、抽出装置１００は、学習エンジンを用いて、生成された学習事例群のうち、尤もらしい学習事例を特定する。学習エンジンは、例えば、学習事例群の中で出現頻度が閾値以上である学習事例を、尤もらしい学習事例として特定する。 (2) Then, using the learning engine, the extraction device 100 identifies a likely learning case among the generated learning case group. For example, the learning engine identifies a learning case whose appearance frequency is greater than or equal to a threshold in the learning case group as a likely learning case.

例えば、学習エンジンは、学習データ群６００から生成された学習事例群に、単語の組が同一種類であることを示す手がかり「出身＆行く」の学習事例が所定個数以上含まれている場合に、「出身＆行く」の学習事例を尤もらしい学習事例として特定する。また、学習エンジンは、多くの種類の単語とともに出現する格助詞「の」などを含む手がかり「の＆さん」の学習事例は、単語の組が同一種類である場合にも異なる場合にも出現する可能性があるため、尤もらしい学習事例として採用しないようにする。 For example, when the learning engine group generated from the learning data group 600 includes a predetermined number or more of learning cases of “origin & go” indicating that the word set is of the same type, Identify a study case of “Born & Going” as a plausible study case. In addition, the learning engine has a clue “no & san” that includes case particles “no” that appear with many types of words, even if the word pairs are of the same type or different types. Do not adopt as a plausible learning case because there is a possibility.

なお、学習エンジンとしては、例えば、Ｂｏｏｓｔｉｎｇ学習器やＳＶＭがある。Ｂｏｏｓｔｉｎｇ学習器やＳＶＭについては、従来技術であるため説明を省略する。抽出装置１００は、学習エンジンによって特定された学習事例から判別規則を生成する。抽出装置１００は、例えば、単語の組が同一種類であることを示す情報と、共起単語「出身」と「行く」の組み合わせを関連付けた判別規則を生成する。そして、抽出装置１００は、生成した判別規則を含む同一表記・種類単語判別規則３００を生成する。 Note that examples of the learning engine include a Boosting learning device and an SVM. Since the Boosting learner and the SVM are conventional techniques, description thereof is omitted. The extraction device 100 generates a discrimination rule from the learning case specified by the learning engine. For example, the extraction apparatus 100 generates a discrimination rule that associates information indicating that the word pairs are of the same type with a combination of the co-occurrence words “from” and “go”. Then, the extraction apparatus 100 generates the same notation / type word determination rule 300 including the generated determination rule.

図８において、抽出装置１００は、学習データの中から未処理の学習データを選択する。次に、抽出装置１００は、選択した学習データからタグを除外した対象データを生成する。ここでは、抽出装置１００は、学習データ６１０を選択し、学習データ６１０から対象データ８００を生成したとする。 In FIG. 8, the extraction apparatus 100 selects unprocessed learning data from the learning data. Next, the extraction apparatus 100 generates target data from which the tags are excluded from the selected learning data. Here, it is assumed that the extraction apparatus 100 selects the learning data 610 and generates the target data 800 from the learning data 610.

（１）次に、抽出装置１００は、対象データ８００の中から、同一表記の単語の組として、「宮崎」８０１と「宮崎」８０２の組と、「宮崎」８０１と「宮崎」８０３の組と、「宮崎」８０２と「宮崎」８０３の組と、を抽出する。 (1) Next, the extraction apparatus 100 sets a pair of “Miyazaki” 801 and “Miyazaki” 802 and a pair of “Miyazaki” 801 and “Miyazaki” 803 from the target data 800 as pairs of words having the same notation. And a set of “Miyazaki” 802 and “Miyazaki” 803 are extracted.

そして、抽出装置１００は、「宮崎」８０１と「宮崎」８０２の組について、「宮崎」８０１と「宮崎」８０２の各々の周辺にある単語の組み合わせ「出身＆さん」、「の＆さん」などを手がかりとして特定する。また、抽出装置１００は、「宮崎」８０１と「宮崎」８０３の組について、「宮崎」８０１と「宮崎」８０３の各々の周辺にある単語の組み合わせ「出身＆へ」、「出身＆行く」などを手がかりとして特定する。また、抽出装置１００は、「宮崎」８０１と「宮崎」８０２の組について、「宮崎」８０１と「宮崎」８０２の各々の周辺にある単語の組み合わせ「さん＆へ」、「さん＆行く」などを手がかりとして特定する。 Then, the extraction apparatus 100, for the set of “Miyazaki” 801 and “Miyazaki” 802, combines word combinations “from & san”, “no & san”, etc. around each of “Miyazaki” 801 and “Miyazaki” 802. As a clue. In addition, the extraction apparatus 100 uses a combination of words “Birth & Go”, “Birth & Go”, etc. in the vicinity of “Miyazaki” 801 and “Miyazaki” 803 for the set of “Miyazaki” 801 and “Miyazaki” 803, etc. As a clue. In addition, the extraction apparatus 100 uses a combination of words “Miyazaki” 801 and “Miyazaki” 802 in the vicinity of “Miyazaki” 801 and “Miyazaki” 802, such as “san & e” and “san & go”. As a clue.

（２）そして、抽出装置１００は、判別エンジンを用いて、生成された手がかりのうち、同一表記・種類単語判別規則３００の判別規則に該当する手がかりを特定する。判別エンジンは、該当する判別規則が複数ある場合、尤もらしい判別規則を特定する。抽出装置１００は、判別エンジンによって特定された判別規則により、「宮崎」８０１と「宮崎」８０２の組と「宮崎」８０１と「宮崎」８０３の組と「宮崎」８０２と「宮崎」８０３の組との各々が同一種類か否かを判別する。ここでは、抽出装置１００は、同一表記・種類単語判別規則３００と、共起単語「出身」と「行く」の組み合わせとから、「宮崎」８０１と「宮崎」８０３の組が同一種類であると判別する。 (2) Then, using the discrimination engine, the extraction apparatus 100 identifies a clue that corresponds to the discrimination rule of the same notation / type word discrimination rule 300 among the generated clues. When there are a plurality of applicable discrimination rules, the discrimination engine specifies a plausible discrimination rule. The extraction apparatus 100 determines a set of “Miyazaki” 801 and “Miyazaki” 802, a set of “Miyazaki” 801 and “Miyazaki” 803, a set of “Miyazaki” 802 and “Miyazaki” 803 according to the discrimination rule specified by the discrimination engine. And whether each is the same type. Here, the extraction apparatus 100 determines that the combination of “Miyazaki” 801 and “Miyazaki” 803 is of the same type based on the same notation / type word discrimination rule 300 and the combination of the co-occurrence words “from” and “go”. Determine.

図９において、抽出装置１００は、選択した学習データ６１０の中から、図８で同一種類と判別された「宮崎」８０１と「宮崎」８０３の組に対応する「宮崎」６１１と「宮崎」６１３を特定する。抽出装置１００は、以降の処理で、特定された「宮崎」６１１と「宮崎」６１３を一纏めにして扱う。 In FIG. 9, the extraction apparatus 100 selects “Miyazaki” 611 and “Miyazaki” 613 corresponding to the set of “Miyazaki” 801 and “Miyazaki” 803 determined as the same type in the selected learning data 610 in FIG. 8. Is identified. The extraction apparatus 100 handles the identified “Miyazaki” 611 and “Miyazaki” 613 together in the subsequent processing.

（１）抽出装置１００は、学習データ６１０の中の単語ごとに、または一纏めにされた単語ごとに、タグを参照して単語の種類を特定し、特定した単語の種類を抽出するための手がかりを特定する。抽出装置１００は、具体的には、例えば、一纏めにされた「宮崎」６１１と「宮崎」６１３の各々の周辺にある単語から、「ＬＯＣ」を抽出するための手がかりになる「出身：＋１」、「の：＋２」、「さん：−２」、「と：−１」、「へ：＋１」、および「行く：＋２」を特定する。そして、抽出装置１００は、特定した単語が手がかりである、単語の種類が「ＬＯＣ」であることを判別するための学習事例を生成する。 (1) The extraction apparatus 100 identifies a word type by referring to a tag for each word in the learning data 610 or for each grouped word, and a clue for extracting the identified word type Is identified. Specifically, the extraction apparatus 100, for example, “origin: +1”, which is a clue for extracting “LOC” from the words around each of “Miyazaki” 611 and “Miyazaki” 613 grouped together. , “No: +2”, “san: -2”, “to: −1”, “to: +1”, and “go: +2”. Then, the extraction apparatus 100 generates a learning case for determining that the identified word is a clue and the word type is “LOC”.

また、抽出装置１００は、同一種類の他の単語がない「宮崎」６１２の周辺にある単語から、「ＰＥＲ」を抽出するための手がかりになる「出身：−２」、「の：−１」、「さん：＋１」、および「と：＋２」を特定する。そして、抽出装置１００は、特定した単語が手がかりである、単語の種類が「ＰＥＲ」であることを判別するための学習事例を生成する。 In addition, the extraction apparatus 100 can obtain a clue for extracting “PER” from words in the vicinity of “Miyazaki” 612 that does not have other words of the same type. , “San: +1” and “to: +2” are specified. Then, the extraction apparatus 100 generates a learning case for determining that the identified word is a clue and the word type is “PER”.

（２）そして、抽出装置１００は、学習エンジンを用いて、生成された学習事例群のうち、同一種類の固有表現かどうかの判別において尤もらしい学習事例を特定する。学習エンジンは、例えば、学習事例群の中で出現頻度が閾値以上である学習事例を、尤もらしい学習事例として特定する。抽出装置１００は、学習エンジンによって特定された学習事例から抽出用規則を生成する。抽出装置１００は、例えば、単語の種類「ＬＯＣ」を示す情報と、「出身：＋１」とを関連付けた抽出用規則を生成する。 (2) Then, using the learning engine, the extraction apparatus 100 identifies a likely learning case in the determination of whether or not it is the same kind of specific expression from among the generated learning case group. For example, the learning engine identifies a learning case whose appearance frequency is greater than or equal to a threshold in the learning case group as a likely learning case. The extraction device 100 generates an extraction rule from the learning case specified by the learning engine. For example, the extraction apparatus 100 generates an extraction rule in which information indicating the word type “LOC” is associated with “origin: +1”.

そして、抽出装置１００は、生成した抽出用規則を含む固有表現抽出用規則４００を生成する。また、抽出装置１００は、固有表現以外の種類「Ｏ（Ｏｔｈｅｒ）」の単語「出身」についても手がかりを特定し、学習事例を生成してもよい。そして、抽出装置１００は、生成した学習事例から単語の種類「Ｏ」を示す抽出用規則を生成してもよい。 Then, the extraction apparatus 100 generates a specific expression extraction rule 400 including the generated extraction rule. Further, the extraction apparatus 100 may identify a clue for the word “from” of the type “O (Other)” other than the specific expression, and generate a learning example. Then, the extraction apparatus 100 may generate an extraction rule indicating the word type “O” from the generated learning case.

これにより、抽出装置１００は、学習データ群６００を用いて、同一表記・種類単語判別規則３００および固有表現抽出用規則４００を機械学習により生成することができる。そのため、抽出装置１００の利用者は、同一表記・種類単語判別規則３００および固有表現抽出用規則４００を生成する手間を削減することができる。 Thereby, the extraction apparatus 100 can generate the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 by machine learning using the learning data group 600. Therefore, the user of the extraction apparatus 100 can reduce the trouble of generating the same notation / type word discrimination rule 300 and the specific expression extraction rule 400.

（実施例１にかかる規則学習処理の詳細な処理手順）
次に、図１０および図１１を用いて、実施例１にかかる規則学習処理の詳細な処理手順について説明する。実施例１にかかる規則学習処理は、図７〜図９に示した抽出装置１００によって実行される処理である。 (Detailed processing procedure of the rule learning process according to the first embodiment)
Next, a detailed processing procedure of the rule learning process according to the first embodiment will be described with reference to FIGS. 10 and 11. The rule learning process according to the first embodiment is a process executed by the extraction device 100 illustrated in FIGS. 7 to 9.

図１０および図１１は、実施例１にかかる規則学習処理の詳細な処理手順を示すフローチャートである。図１０に示すように、まず、抽出装置１００は、学習データ群６００の中から、未処理の学習データを選択する（ステップＳ１００１）。 10 and 11 are flowcharts illustrating a detailed processing procedure of the rule learning process according to the first embodiment. As shown in FIG. 10, first, the extraction apparatus 100 selects unprocessed learning data from the learning data group 600 (step S1001).

次に、抽出装置１００は、選択した学習データの中に、同一表記の単語があるか否かを判定する（ステップＳ１００２）。ここで、同一表記の単語がない場合（ステップＳ１００２：Ｎｏ）、抽出装置１００は、ステップＳ１００８に移行する。 Next, the extraction apparatus 100 determines whether there is a word with the same notation in the selected learning data (step S1002). Here, when there is no word of the same notation (step S1002: No), the extraction apparatus 100 moves to step S1008.

一方、同一表記の単語がある場合（ステップＳ１００２：Ｙｅｓ）、抽出装置１００は、選択した学習データの中にある同一表記の単語の組のうち、未処理の単語の組を選択する（ステップＳ１００３）。 On the other hand, when there is a word with the same notation (step S1002: Yes), the extraction apparatus 100 selects a set of unprocessed words from the set of words with the same notation in the selected learning data (step S1003). ).

次に、抽出装置１００は、選択した単語の組が同一種類の単語であるか否かを判定する（ステップＳ１００４）。ここで、同一種類の単語である場合（ステップＳ１００４：Ｙｅｓ）、抽出装置１００は、選択した単語の組の各々とともに出現する共起単語の組み合わせと、単語の組が同一種類であることを示す情報と、を含む学習事例を生成する（ステップＳ１００５）。そして、抽出装置１００は、ステップＳ１００７に移行する。 Next, the extraction apparatus 100 determines whether or not the selected set of words is the same type of word (step S1004). Here, if the words are of the same type (step S1004: Yes), the extraction apparatus 100 indicates that the combination of co-occurrence words appearing with each of the selected word sets and the word set are of the same type. A learning case including information is generated (step S1005). Then, the extraction device 100 proceeds to step S1007.

一方、異なる種類の単語である場合（ステップＳ１００４：Ｎｏ）、抽出装置１００は、選択した単語の組の各々とともに出現する共起単語の組み合わせと、単語の組が異なる種類であることを示す情報と、を含む学習事例を生成する（ステップＳ１００６）。そして、抽出装置１００は、ステップＳ１００７に移行する。 On the other hand, when it is a different kind of word (step S1004: No), the extraction apparatus 100 is information which shows that the combination of the co-occurrence word which appears with each of the selected word set and the word set are different types. And a learning example including (Step S1006). Then, the extraction device 100 proceeds to step S1007.

次に、抽出装置１００は、未処理の単語の組があるか否かを判定する（ステップＳ１００７）。未処理の単語の組がある場合（ステップＳ１００７：Ｙｅｓ）、抽出装置１００は、ステップＳ１００３に戻る。 Next, the extraction apparatus 100 determines whether there is a set of unprocessed words (step S1007). If there is a set of unprocessed words (step S1007: Yes), the extraction apparatus 100 returns to step S1003.

一方、未処理の単語の組がない場合（ステップＳ１００７：Ｎｏ）、抽出装置１００は、学習データ群６００の中に、未処理の学習データがあるか否かを判定する（ステップＳ１００８）。ここで、未処理の学習データがある場合（ステップＳ１００８：Ｙｅｓ）、抽出装置１００は、ステップＳ１００１に戻る。 On the other hand, when there is no unprocessed word set (step S1007: No), the extraction apparatus 100 determines whether there is unprocessed learning data in the learning data group 600 (step S1008). Here, when there is unprocessed learning data (step S1008: Yes), the extraction device 100 returns to step S1001.

一方、未処理の学習データがない場合（ステップＳ１００８：Ｎｏ）、抽出装置１００は、生成した学習事例群から、同一表記・種類単語判別規則３００を生成する（ステップＳ１００９）。そして、抽出装置１００は、図１１のステップＳ１１０１に移行する。 On the other hand, when there is no unprocessed learning data (step S1008: No), the extraction apparatus 100 generates the same notation / type word discrimination rule 300 from the generated learning case group (step S1009). Then, the extraction device 100 proceeds to step S1101 in FIG.

図１１において、抽出装置１００は、学習データ群６００の中から未処理の学習データを選択し、選択した学習データのタグを除去した対象データを生成する（ステップＳ１１０１）。次に、抽出装置１００は、同一表記・種類単語判別規則３００を参照して、生成した対象データの中で同一表記かつ同一種類の単語の組を特定する（ステップＳ１１０２）。 In FIG. 11, the extraction device 100 selects unprocessed learning data from the learning data group 600, and generates target data from which the tags of the selected learning data are removed (step S1101). Next, the extraction apparatus 100 refers to the same notation / type word discrimination rule 300 and identifies a set of words having the same notation and the same type in the generated target data (step S1102).

次に、抽出装置１００は、選択した学習データの中から、未処理の単語を選択する（ステップＳ１１０３）。そして、抽出装置１００は、ステップＳ１１０２の特定結果から、選択した単語と同一表記かつ同一種類の単語があるか否かを判定する（ステップＳ１１０４）。 Next, the extraction apparatus 100 selects an unprocessed word from the selected learning data (step S1103). Then, the extraction apparatus 100 determines whether there is a word of the same notation and the same type as the selected word from the identification result of step S1102 (step S1104).

ここで、同一表記かつ同一種類の単語がある場合（ステップＳ１１０４：Ｙｅｓ）、抽出装置１００は、同一表記かつ同一種類の単語の組の各々から特定した手がかりと、タグから特定した当該単語の組の種類と、を含む学習事例を生成する（ステップＳ１１０５）。そして、抽出装置１００は、ステップＳ１１０７に移行する。 Here, when there are words of the same notation and the same type (step S1104: Yes), the extraction apparatus 100 sets the clue specified from each of the same notation and the same type of word and the set of the word specified from the tag. A learning example including the type of (step S1105). Then, the extraction device 100 proceeds to step S1107.

一方、抽出装置１００は、同一表記かつ同一種類の単語がない場合（ステップＳ１１０４：Ｎｏ）、抽出装置１００は、選択した単語から特定した手がかりと、タグから特定した当該単語の種類と、を含む学習事例を生成する（ステップＳ１１０６）。そして、抽出装置１００は、ステップＳ１１０７に移行する。 On the other hand, when there is no word of the same notation and the same type (step S1104: No), the extraction device 100 includes the clue specified from the selected word and the type of the word specified from the tag. A learning example is generated (step S1106). Then, the extraction device 100 proceeds to step S1107.

次に、抽出装置１００は、選択した学習データの中に、未処理の単語があるか否かを判定する（ステップＳ１１０７）。ここで、未処理の単語がある場合（ステップＳ１１０７：Ｙｅｓ）、抽出装置１００は、ステップＳ１１０３に戻る。 Next, the extraction apparatus 100 determines whether or not there is an unprocessed word in the selected learning data (step S1107). Here, when there is an unprocessed word (step S1107: Yes), the extraction apparatus 100 returns to step S1103.

一方、未処理の単語がない場合（ステップＳ１１０７：Ｎｏ）、抽出装置１００は、未処理の学習データがあるか否かを判定する（ステップＳ１１０８）。ここで、未処理の学習データがある場合（ステップＳ１１０８：Ｙｅｓ）、抽出装置１００は、ステップＳ１１０１に戻る。 On the other hand, when there is no unprocessed word (step S1107: No), the extraction apparatus 100 determines whether there is unprocessed learning data (step S1108). Here, when there is unprocessed learning data (step S1108: Yes), the extraction device 100 returns to step S1101.

一方、未処理の学習データがない場合（ステップＳ１１０８：Ｎｏ）、抽出装置１００は、生成した学習事例群から、固有表現抽出用規則４００を生成する（ステップＳ１１０９）。そして、抽出装置１００は、規則学習処理を終了する。これにより、抽出装置１００は、学習データ群６００を用いて、同一表記・種類単語判別規則３００および固有表現抽出用規則４００を機械学習により生成することができる。そのため、抽出装置１００の利用者は、同一表記・種類単語判別規則３００および固有表現抽出用規則４００を生成する手間を削減することができる。 On the other hand, when there is no unprocessed learning data (step S1108: No), the extraction apparatus 100 generates a specific expression extraction rule 400 from the generated learning case group (step S1109). Then, the extraction device 100 ends the rule learning process. Thereby, the extraction apparatus 100 can generate the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 by machine learning using the learning data group 600. Therefore, the user of the extraction apparatus 100 can reduce the trouble of generating the same notation / type word discrimination rule 300 and the specific expression extraction rule 400.

（実施例１にかかる抽出装置１００による固有表現抽出処理の具体例１）
次に、図１２および図１３を用いて、実施例１にかかる抽出装置１００による固有表現抽出処理の具体例１について説明する。実施例１にかかる固有表現抽出処理は、固有表現の抽出対象のデータの中から固有表現の単語を抽出する処理であり、図５を用いて説明した処理である。 (Specific example 1 of specific expression extraction processing by the extraction apparatus 100 according to the first embodiment)
Next, specific example 1 of the specific expression extraction process by the extraction device 100 according to the first embodiment will be described with reference to FIGS. 12 and 13. The specific expression extraction process according to the first embodiment is a process of extracting a word of a specific expression from data to be extracted of a specific expression, and is the process described with reference to FIG.

図１２および図１３は、実施例１にかかる抽出装置１００による固有表現抽出処理の具体例１を示す説明図である。図１２において、抽出装置１００は、固有表現の抽出対象になる入力データ１２００を受け付ける。 12 and 13 are explanatory diagrams of a specific example 1 of the specific expression extraction process performed by the extraction apparatus 100 according to the first embodiment. In FIG. 12, the extraction apparatus 100 receives input data 1200 that is a target of extraction of a specific expression.

（１）まず、抽出装置１００は、入力データ１２００の中から、同一表記の単語の組として、「福岡」１２０１と「福岡」１２０２の組と、「福岡」１２０１と「福岡」１２０３の組と、「福岡」１２０２と「福岡」１２０３の組と、を抽出する。 (1) First, the extraction apparatus 100 includes a set of “Fukuoka” 1201 and “Fukuoka” 1202, a set of “Fukuoka” 1201 and “Fukuoka” 1203 as a set of words having the same notation from the input data 1200. , “Fukuoka” 1202 and “Fukuoka” 1203 are extracted.

そして、抽出装置１００は、「福岡」１２０１と「福岡」１２０２の組について、「福岡」１２０１と「福岡」１２０２の各々の周辺にある単語の組み合わせ「出身＆さん」、「の＆さん」などを手がかりとして特定する。また、抽出装置１００は、「福岡」１２０１と「福岡」１２０３の組について、「福岡」１２０１と「福岡」１２０３の各々の周辺にある単語の組み合わせ「出身＆へ」、「出身＆行く」などを手がかりとして特定する。また、抽出装置１００は、「福岡」１２０２と「福岡」１２０３の組について、「福岡」１２０２と「福岡」１２０３の各々の周辺にある単語の組み合わせ「さん＆へ」、「さん＆行く」などを手がかりとして特定する。 Then, the extraction apparatus 100, for the set of “Fukuoka” 1201 and “Fukuoka” 1202, combines word combinations “Origin & San”, “No & San”, etc. around each of “Fukuoka” 1201 and “Fukuoka” 1202. As a clue. In addition, the extraction apparatus 100 uses a combination of words “Fukuoka” 1201 and “Fukuoka” 1203 in the vicinity of “Fukuoka” 1201 and “Fukuoka” 1203, such as “from &&” and “from &&”. As a clue. In addition, the extraction apparatus 100 uses a combination of words “San & He”, “San & Go”, etc., around each of “Fukuoka” 1202 and “Fukuoka” 1203 for the group of “Fukuoka” 1202 and “Fukuoka” 1203. As a clue.

（２）そして、抽出装置１００は、判別エンジンを用いて、生成された手がかりのうち、同一表記・種類単語判別規則３００の判別規則に該当する手がかりを特定する。判別エンジンは、該当する判別規則が複数ある場合、尤もらしい判別規則を特定する。抽出装置１００は、判別エンジンによって特定された判別規則により、「福岡」１２０１と「福岡」１２０２の組と「福岡」１２０１と「福岡」１２０３の組と「福岡」１２０２と「福岡」１２０３の組との各々が同一種類か否かを判別する。 (2) Then, using the discrimination engine, the extraction apparatus 100 identifies a clue that corresponds to the discrimination rule of the same notation / type word discrimination rule 300 among the generated clues. When there are a plurality of applicable discrimination rules, the discrimination engine specifies a plausible discrimination rule. The extraction apparatus 100 determines a set of “Fukuoka” 1201 and “Fukuoka” 1202, a set of “Fukuoka” 1201 and “Fukuoka” 1203, a set of “Fukuoka” 1202 and “Fukuoka” 1203 according to the discrimination rule specified by the discrimination engine. And whether each is the same type.

ここでは、抽出装置１００は、同一表記・種類単語判別規則３００と、共起単語「出身」と「行く」の組み合わせとから、「福岡」１２０１と「福岡」１２０３の組が同一種類であると判別する。 Here, the extraction apparatus 100 determines that the combination of “Fukuoka” 1201 and “Fukuoka” 1203 is of the same type based on the same notation / type word discrimination rule 300 and the combination of the co-occurrence words “from” and “go”. Determine.

図１３において、抽出装置１００は、以降の処理で、図１２で同一種類と判別された「福岡」１２０１と「福岡」１２０３を一纏めにして扱う。 In FIG. 13, the extraction apparatus 100 treats “Fukuoka” 1201 and “Fukuoka” 1203 that are determined to be the same type in FIG.

（１）抽出装置１００は、入力データ１２００の中の単語ごとに、または一纏めにされた単語ごとに、単語の種類を抽出するための手がかりを特定する。抽出装置１００は、具体的には、例えば、一纏めにされた「福岡」１２０１と「福岡」１２０３の周辺にある単語から、手がかりとして「出身：＋１」、「の：＋２」、「さん：−２」、「と：−１」、「へ：＋１」、および「行く：＋２」を特定する。また、抽出装置１００は、同一種類の他の単語がない「福岡」１２０２の周辺にある単語から、手がかりとして「出身：−２」、「の：−１」、「さん：＋１」、および「と：＋２」を特定する。 (1) The extraction apparatus 100 specifies a clue for extracting the type of word for each word in the input data 1200 or for each grouped word. Specifically, for example, the extraction device 100 uses, as a clue, words from “Fukuoka” 1201 and “Fukuoka” 1203 that are grouped together as “clue: +1”, “no: +2”, “san: −”. 2 ”,“ To: −1 ”,“ To: +1 ”, and“ Go: +2 ”. In addition, the extraction apparatus 100 uses the words in the vicinity of “Fukuoka” 1202 that do not have other words of the same type as clues “from: -2”, “no: −1”, “san: +1”, and “ And: +2 ”.

（２）次に、抽出装置１００は、判別エンジンを用いて、生成された手がかりに該当する固有表現抽出用規則４００の抽出用規則を特定する。判別エンジンは、該当する抽出用規則が複数ある場合、尤もらしい抽出用規則を特定する。抽出装置１００は、判別エンジンによって特定された抽出用規則により、入力データ１２００の中の単語ごとに、または一纏めにされた単語ごとに、単語の種類を抽出する。ここでは、抽出装置１００は、「福岡」１２０１と「福岡」１２０３の種類として「ＬＯＣ」を抽出し、「福岡」１２０２の種類として「ＰＥＲ」を抽出する。 (2) Next, the extraction apparatus 100 specifies the extraction rule of the specific expression extraction rule 400 corresponding to the generated clue using the discrimination engine. When there are a plurality of corresponding extraction rules, the discrimination engine specifies a plausible extraction rule. The extraction device 100 extracts word types for each word in the input data 1200 or for each word grouped according to the extraction rules specified by the discrimination engine. Here, the extraction apparatus 100 extracts “LOC” as the type of “Fukuoka” 1201 and “Fukuoka” 1203 and extracts “PER” as the type of “Fukuoka” 1202.

また、抽出装置１００は、単語「出身」についても手がかりを特定し、「出身」の単語の種類を抽出してもよい。ここでは、抽出装置１００は、「出身」について、固有表現ではない単語の種類「Ｏ」を抽出する。 The extraction apparatus 100 may also identify a clue for the word “from” and extract the type of the word “from”. Here, the extraction apparatus 100 extracts a word type “O” that is not a unique expression for “born”.

これにより、抽出装置１００は、同一表記かつ同一種類の単語については、一纏めにして同じ単語の種類を抽出することができる。また、抽出装置１００は、同一表記であっても異なる種類の単語については、他の同一表記の単語とは別個に単語の種類を抽出することができるようになる。結果として、抽出装置１００は、同一表記かつ同一種類の単語を一纏めにして同じ種類の単語として抽出することで、同一種類の単語を異なる種類の単語として抽出することを防止して、抽出精度の向上を図ることができる。また、抽出装置１００は、同一表記であっても異なる種類の単語同士を、別個に扱って単語の種類を抽出することで、誤って同じ種類の単語として抽出することを防止し、抽出精度の向上を図ることができる。 Thereby, the extraction apparatus 100 can extract the same word type for the same notation and the same type of words. Further, the extraction apparatus 100 can extract the type of a word separately from other words having the same notation, even if they have the same notation. As a result, the extraction device 100 prevents the extraction of the same type of word as a different type of word by extracting the same notation and the same type of word together and extracting the same type of word, thereby improving the extraction accuracy. Improvements can be made. Further, the extraction apparatus 100 treats different types of words separately even if they are the same notation, and extracts the type of the word, thereby preventing erroneous extraction as the same type of word and improving the extraction accuracy. Improvements can be made.

（実施例１にかかる抽出装置１００による固有表現抽出結果の出力例１）
次に、図１４を用いて、実施例１にかかる抽出装置１００による固有表現抽出結果の出力例１について説明する。抽出装置１００は、図１２および図１３での固有表現抽出結果を出力する。 (Output example 1 of the result of extraction of specific expressions by the extraction apparatus 100 according to the first embodiment)
Next, an output example 1 of the named entity extraction result by the extraction apparatus 100 according to the first embodiment will be described with reference to FIG. The extraction apparatus 100 outputs the specific expression extraction results in FIGS. 12 and 13.

図１４は、実施例１にかかる抽出装置１００による固有表現抽出結果の出力例１を示す説明図である。図１４の（Ａ）に示すように、抽出装置１００は、例えば、抽出した単語の種類をタグとして付与した入力データ１２００「＜ＬＯＣ＞福岡＜／ＬＯＣ＞出身の＜ＰＥＲ＞福岡＜／ＰＥＲ＞さんと＜ＬＯＣ＞福岡＜／ＬＯＣ＞へ行く。」を出力する。また、図１４の（Ｂ）に示すように、抽出装置１００は、例えば、入力データ１２００「福岡出身の福岡さんと福岡へ行く。」を表示する際に、抽出した単語の種類を示す色を付与して表示する。 FIG. 14 is an explanatory diagram of an output example 1 of the specific expression extraction result by the extraction device 100 according to the first embodiment. As shown in FIG. 14A, the extraction apparatus 100 uses, for example, input data 1200 “<LOC> Fukuoka </ LOC> from <PER> Fukuoka </ PER> with the extracted word type as a tag. Go to <LOC> Fukuoka </ LOC>. Further, as shown in FIG. 14B, for example, when the extraction apparatus 100 displays the input data 1200 “Going to Fukuoka with Mr. Fukuoka from Fukuoka”, the color indicating the type of the extracted word is displayed. Give and display.

（実施例１にかかる固有表現抽出処理の詳細な処理手順）
次に、図１５を用いて、実施例１にかかる固有表現抽出処理の詳細な処理手順について説明する。実施例１にかかる固有表現抽出処理は、図１２〜図１４に示した抽出装置１００によって行われる処理である。 (Detailed processing procedure of specific expression extraction processing according to embodiment 1)
Next, a detailed processing procedure of the specific expression extraction processing according to the first embodiment will be described with reference to FIG. The specific expression extraction processing according to the first embodiment is processing performed by the extraction device 100 illustrated in FIGS. 12 to 14.

図１５は、実施例１にかかる固有表現抽出処理の詳細な処理手順を示すフローチャートである。図１５に示すように、まず、抽出装置１００は、入力データを受け付ける（ステップＳ１５０１）。次に、抽出装置１００は、同一表記・種類単語判別規則３００を参照して、入力データの中で同一表記かつ同一種類の単語の組を特定する（ステップＳ１５０２）。 FIG. 15 is a flowchart of a detailed process procedure of the named entity extraction process according to the first embodiment. As shown in FIG. 15, first, the extraction device 100 accepts input data (step S1501). Next, the extraction apparatus 100 refers to the same notation / type word discrimination rule 300 and identifies a set of words having the same notation and the same type in the input data (step S1502).

次に、抽出装置１００は、入力データの中から、未処理の単語を選択する（ステップＳ１５０３）。そして、抽出装置１００は、ステップＳ１５０２の特定結果から、選択した単語と同一表記かつ同一種類の単語があるか否かを判定する（ステップＳ１５０４）。 Next, the extraction apparatus 100 selects an unprocessed word from the input data (step S1503). Then, the extraction apparatus 100 determines whether or not there is a word having the same notation and the same type as the selected word from the identification result of step S1502 (step S1504).

ここで、同一表記かつ同一種類の単語がある場合（ステップＳ１５０４：Ｙｅｓ）、抽出装置１００は、同一表記かつ同一種類の単語の組の各々から特定した手がかりと、固有表現抽出用規則４００と、から同一表記かつ同一種類の単語の組の種類を抽出する（ステップＳ１５０５）。そして、抽出装置１００は、ステップＳ１５０７に移行する。 Here, when there is a word of the same notation and the same type (step S1504: Yes), the extraction device 100 includes a clue specified from each of the set of the same notation and the same type of word, a proper expression extraction rule 400, The type of a set of words having the same notation and the same type is extracted from (step S1505). Then, the extraction apparatus 100 proceeds to step S1507.

一方、抽出装置１００は、同一表記かつ同一種類の単語がない場合（ステップＳ１５０４：Ｎｏ）、抽出装置１００は、選択した単語から特定した手がかりと、固有表現抽出用規則４００と、から選択した単語の種類を抽出する（ステップＳ１５０６）。そして、抽出装置１００は、ステップＳ１５０７に移行する。 On the other hand, when there is no word of the same notation and type (step S1504: No), the extraction device 100 selects the word selected from the clue specified from the selected word and the specific expression extraction rule 400. Are extracted (step S1506). Then, the extraction apparatus 100 proceeds to step S1507.

次に、抽出装置１００は、入力データの中に、未処理の単語があるか否かを判定する（ステップＳ１５０７）。ここで、未処理の単語がある場合（ステップＳ１５０７：Ｙｅｓ）、抽出装置１００は、ステップＳ１５０３に戻る。 Next, the extraction apparatus 100 determines whether there is an unprocessed word in the input data (step S1507). If there is an unprocessed word (step S1507: YES), the extraction apparatus 100 returns to step S1503.

一方、未処理の単語がない場合（ステップＳ１５０７：Ｎｏ）、抽出装置１００は、抽出結果を出力する（ステップＳ１５０８）。そして、抽出装置１００は、固有表現抽出処理を終了する。 On the other hand, when there is no unprocessed word (step S1507: No), the extraction apparatus 100 outputs an extraction result (step S1508). Then, the extraction apparatus 100 ends the specific expression extraction process.

（実施例１にかかる抽出装置１００による規則学習処理の具体例２）
次に、図１６〜図１９を用いて、実施例１にかかる抽出装置１００による規則学習処理の具体例２について説明する。具体例２は、具体例１よりも単語の種類を細分化した場合の例である。例えば、具体例２では、単語の種類として、組織名「ＯＲＧ」を細分化した、組織名の先頭「Ｂ−ＯＲＧ」と、組織名の中「Ｉ−ＯＲＧ」と、組織名の後尾「Ｅ−ＯＲＧ」と、を採用する。 (Specific example 2 of rule learning processing by the extraction apparatus 100 according to the first embodiment)
Next, specific example 2 of the rule learning process performed by the extraction apparatus 100 according to the first embodiment will be described with reference to FIGS. Specific example 2 is an example in which the types of words are subdivided compared to specific example 1. For example, in the specific example 2, as the word type, the organization name “ORG” is subdivided, the organization name top “B-ORG”, the organization name “I-ORG”, and the organization name tail “E-E”. -ORG ".

図１６は、抽出装置１００による規則学習処理に用いられる学習データの例２を示す説明図である。図１６に示すように、図６の学習データ群６００より細分化された種類を示すタグが付与された複数の学習データを含む学習データ群１６００が用意される。 FIG. 16 is an explanatory diagram illustrating Example 2 of learning data used for the rule learning process performed by the extraction device 100. As shown in FIG. 16, a learning data group 1600 including a plurality of learning data to which tags indicating types subdivided from the learning data group 600 of FIG. 6 are provided.

例えば、学習データ群１６００には、学習データ１６１０「＜Ｂ−ＯＲＧ＞Ａ＜／Ｂ−ＯＲＧ＞＜Ｅ−ＯＲＧ＞株式会社＜／Ｅ−ＯＲＧ＞は＜Ｂ−ＯＲＧ＞株式会社＜／Ｂ−ＯＲＧ＞＜Ｉ−ＯＲＧ＞Ａ＜／Ｉ−ＯＲＧ＞＜Ｅ−ＯＲＧ＞研究所＜／Ｅ−ＯＲＧ＞の子会社であり、＜Ｂ−ＯＲＧ＞株式会社＜／Ｂ−ＯＲＧ＞＜Ｉ−ＯＲＧ＞Ａ＜／Ｉ−ＯＲＧ＞＜Ｅ−ＯＲＧ＞研究所＜／Ｅ−ＯＲＧ＞は＜ＬＯＣ＞川崎市＜／ＬＯＣ＞にある。」が含まれる。 For example, the learning data group 1600 includes learning data 1610 “<B-ORG> A </ B-ORG> <E-ORG> Corporation </ E-ORG> </ B-ORG> Corporation </ B- ORG> <I-ORG> A </ I-ORG> <E-ORG> Laboratory </ E-ORG>, <B-ORG> Corporation </ B-ORG> <I-ORG> A </ I-ORG> <E-ORG> Laboratory </ E-ORG> is in <LOC> Kawasaki City </ LOC>.

学習データ１６１０の中の「Ａ」１６１１には「Ｂ−ＯＲＧ」のタグが付与され、「株式会社」１６１２には「Ｅ−ＯＲＧ」のタグが付与されている。また、「株式会社」１６１３には「Ｂ−ＯＲＧ」のタグが付与され、「Ａ」１６１４には「Ｉ−ＯＲＧ」のタグが付与され、「研究所」１６１５には「Ｅ−ＯＲＧ」のタグが付与されている。また、「株式会社」１６１６には「Ｂ−ＯＲＧ」のタグが付与され、「Ａ」１６１７には「Ｉ−ＯＲＧ」のタグが付与され、「研究所」１６１８には「Ｅ−ＯＲＧ」のタグが付与されている。また、「川崎市」１６１９には「ＬＯＣ」のタグが付与されている。 A tag “B-ORG” is assigned to “A” 1611 in the learning data 1610, and a tag “E-ORG” is assigned to “corporation” 1612. Also, “B-ORG” tag is assigned to “corporation” 1613, “I-ORG” tag is assigned to “A” 1614, and “E-ORG” is assigned to “Laboratory” 1615. A tag is attached. Also, “B-ORG” tag is assigned to “corporation” 1616, “I-ORG” tag is assigned to “A” 1617, and “E-ORG” is assigned to “laboratory” 1618. A tag is attached. The tag “LOC” is assigned to “Kawasaki city” 1619.

また、学習データ群１６００には、学習データ１６２０「＜ＬＯＣ＞宮崎＜／ＬＯＣ＞にある＜Ｂ−ＯＲＧ＞宮崎＜／Ｂ−ＯＲＧ＞＜Ｅ−ＯＲＧ＞商事＜／Ｅ−ＯＲＧ＞の＜ＰＥＲ＞宮崎＜／ＰＥＲ＞社長は＜ＬＯＣ＞宮崎＜／ＬＯＣ＞出身である。」が含まれる。学習データ１６２０の中の「宮崎」１６２１には「ＬＯＣ」のタグが付与され、「宮崎」１６２２には「Ｂ−ＯＲＧ」のタグが付与されている。また、「商事」１６２３には「Ｅ−ＯＲＧ」のタグが付与され、「宮崎」１６２４には「ＰＥＲ」のタグが付与され、「宮崎」１６２５には「ＬＯＣ」のタグが付与されている。 The learning data group 1600 includes <PER> of <B-ORG> Miyazaki </ B-ORG> <E-ORG> Trading </ E-ORG> in the learning data 1620 “<LOC> Miyazaki </ LOC>”. > President Miyazaki </ PER> is from <LOC> Miyazaki </ LOC>. The tag “LOC” is assigned to “Miyazaki” 1621 in the learning data 1620, and the tag “B-ORG” is assigned to “Miyazaki” 1622. In addition, the tag “E-ORG” is assigned to “Shosho” 1623, the tag “PER” is assigned to “Miyazaki” 1624, and the tag “LOC” is assigned to “Miyazaki” 1625. .

抽出装置１００は、このような学習データ群１６００を用いることで、図６に示した学習データ群６００を用いた場合よりも細分化された種類に対応した同一表記・種類単語判別規則および固有表現抽出用規則を生成することができる。 The extraction apparatus 100 uses the learning data group 1600 as described above, so that the same notation / type word discrimination rule and specific expression corresponding to the types subdivided compared to the case where the learning data group 600 shown in FIG. 6 is used. Extraction rules can be generated.

図１７〜図１９は、図１６に示した学習データ群１６００を用いて同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを生成する内容を示す説明図である。図１７において、まず、抽出装置１００は、学習データ群１６００の中から、未処理の学習データを選択する。ここでは、抽出装置１００は、学習データ１６１０を選択したとする。 FIGS. 17 to 19 are explanatory diagrams showing the contents for generating the same notation / type word discrimination rule 300 and the unique expression extraction rule 400 using the learning data group 1600 shown in FIG. In FIG. 17, the extraction apparatus 100 first selects unprocessed learning data from the learning data group 1600. Here, it is assumed that the extraction apparatus 100 has selected the learning data 1610.

（１）次に、抽出装置１００は、選択した学習データ１６１０の中から、同一表記の単語の組として、「Ａ」１６１１と「Ａ」１６１４の組と、「Ａ」１６１１と「Ａ」１６１７の組と、「Ａ」１６１４と「Ａ」１６１７の組と、を抽出する。 (1) Next, the extraction apparatus 100 selects “A” 1611 and “A” 1614, “A” 1611, and “A” 1617 as a pair of words having the same notation from the selected learning data 1610. And a set of “A” 1614 and “A” 1617 are extracted.

そして、抽出装置１００は、「Ａ」１６１１と「Ａ」１６１４の組については、異なる種類であるため、「Ａ」１６１１と「Ａ」１６１４の各々の周辺にある単語の組み合わせ「株式会社＆研究所」などの単語の組が手がかりである、同一種類でないことを判別するための学習事例を生成する。 The extraction apparatus 100 has different combinations of “A” 1611 and “A” 1614, and therefore a combination of words “A Co. & Research” around each of “A” 1611 and “A” 1614. A learning example is generated to determine whether a set of words such as “place” is a clue and is not the same type.

また、抽出装置１００は、「Ａ」１６１１と「Ａ」１６１７の組については、異なる種類であるため、「Ａ」１６１１と「Ａ」１６１７の各々の周辺にある単語の組み合わせ「株式会社＆研究所」などの単語の組が手がかりである、同一種類でないことを判別するための学習事例を生成する。 Further, the extraction apparatus 100 uses different combinations of “A” 1611 and “A” 1617, and therefore, the combination of the words “A” 1611 and “A” 1617 around “A” 1611 and “A” 1617. A learning example is generated to determine whether a set of words such as “place” is a clue and is not the same type.

また、抽出装置１００は、「Ａ」１６１４と「Ａ」１６１７の組については、同一種類であるため、「Ａ」１６１４と「Ａ」１６１７の各々の周辺にある単語の組み合わせ「研究所＆研究所」、「株式会社＆株式会社」などの単語の組が手がかりである、同一種類であることを判別するための学習事例を生成する。 In addition, the extraction apparatus 100 has the same type of “A” 1614 and “A” 1617, and therefore, a combination of words around each of “A” 1614 and “A” 1617 “laboratory & research” A learning example is generated for discriminating that a set of words such as “Tokoro” and “Corporation & Co., Ltd.” are clues and are of the same type.

抽出装置１００は、単語「Ａ」について学習事例を生成した後、他の同一表記の単語「株式会社」や「研究所」についても学習事例を生成する。また、抽出装置１００は、学習データ１６１０から学習事例を生成した後、学習データ群１６００の中に未処理の学習データが残っていれば、当該未処理の学習データからも学習事例を生成する。 After generating the learning case for the word “A”, the extraction apparatus 100 also generates the learning case for the other words “corporation” and “laboratory” having the same notation. In addition, after generating a learning case from the learning data 1610, the extraction device 100 generates a learning case from the unprocessed learning data if unprocessed learning data remains in the learning data group 1600.

（２）そして、抽出装置１００は、学習エンジンを用いて、生成された学習事例群から、同一種類の固有表現かどうかの判別において尤もらしい学習事例を特定する。学習エンジンは、例えば、学習事例群の中で出現頻度が閾値以上である学習事例を特定する。また、学習エンジンは、類似する学習事例が存在しないために、閾値以上の精度で単語の組が同一種類か否かを判別可能な学習事例を特定してもよい。抽出装置は、例えば、単語の組が同一種類であることを示す情報と、共起単語「株式会社」と「株式会社」の組み合わせを関連付けた判別規則を生成する。そして、抽出装置１００は、生成した判別規則を含む同一表記・種類単語判別規則３００を生成する。 (2) Then, using the learning engine, the extraction apparatus 100 identifies a likely learning case in the determination of whether or not it is the same kind of specific expression from the generated learning case group. For example, the learning engine identifies a learning case whose appearance frequency is greater than or equal to a threshold in the learning case group. In addition, since there is no similar learning case, the learning engine may specify a learning case that can determine whether or not the word sets are of the same type with an accuracy equal to or higher than a threshold value. The extraction device generates, for example, a discrimination rule that associates information indicating that a pair of words is of the same type with a combination of the co-occurrence words “stock” and “stock”. Then, the extraction apparatus 100 generates the same notation / type word determination rule 300 including the generated determination rule.

図１８において、抽出装置１００は、学習データ群１６００の中から未処理の学習データを選択する。次に、抽出装置１００は、選択した学習データからタグを除外した対象データを生成する。ここでは、抽出装置１００は、学習データ１６１０を選択し、学習データ１６１０から対象データ１８００を生成したとする。 In FIG. 18, the extraction device 100 selects unprocessed learning data from the learning data group 1600. Next, the extraction apparatus 100 generates target data from which the tags are excluded from the selected learning data. Here, it is assumed that the extraction apparatus 100 selects the learning data 1610 and generates the target data 1800 from the learning data 1610.

（１）次に、抽出装置１００は、対象データ１８００の中から、同一表記の単語の組として、「Ａ」１８０１と「Ａ」１８０２の組と、「Ａ」１８０１と「Ａ」１８０３の組と、「Ａ」１８０２と「Ａ」１８０３の組と、を抽出する。 (1) Next, the extraction apparatus 100 sets “A” 1801 and “A” 1802 and “A” 1801 and “A” 1803 as pairs of the same notation from the target data 1800. Then, a set of “A” 1802 and “A” 1803 is extracted.

そして、抽出装置１００は、「Ａ」１８０１と「Ａ」１８０２の組について、「Ａ」１８０１と「Ａ」１８０２の各々の周辺にある単語の組み合わせ「株式会社＆研究所」などを手がかりとして特定する。また、抽出装置１００は、「Ａ」１８０１と「Ａ」１８０３の組について、「Ａ」１８０１と「Ａ」１８０３の各々の周辺にある単語の組み合わせ「株式会社＆研究所」などを手がかりとして特定する。また、抽出装置１００は、「Ａ」１８０２と「Ａ」１８０３の組について、「Ａ」１８０２と「Ａ」１８０３の各々の周辺にある単語の組み合わせ「研究所＆研究所」、「株式会社＆株式会社」などを手がかりとして特定する。 Then, the extraction apparatus 100 identifies a combination of words “A” 1801 and “A” 1802 around each of “A” 1801 and “A” 1802 using “Corporation & Research Institute” as a clue. To do. Further, the extraction apparatus 100 identifies the combination of “A” 1801 and “A” 1803 with a combination of words “Corporation & Research Institute” around each of “A” 1801 and “A” 1803 as a clue. To do. In addition, the extraction apparatus 100 uses a combination of words “Laboratory & Laboratory”, “Co., Ltd.”, and “A” 1802 and “A” 1803 in the vicinity of “A” 1802 and “A” 1803. Identify “Corporation” as a clue.

（２）そして、抽出装置１００は、判別エンジンを用いて、生成された手がかりに該当する同一表記・種類単語判別規則３００の判別規則を特定する。判別エンジンは、該当する判別規則が複数ある場合、例えば、学習事例上での判別精度などを基に、尤もらしい判別規則を特定する。 (2) Then, using the discrimination engine, the extraction device 100 identifies the discrimination rule of the same notation / type word discrimination rule 300 corresponding to the generated clue. When there are a plurality of applicable discrimination rules, the discrimination engine specifies a plausible discrimination rule based on, for example, the discrimination accuracy on the learning example.

抽出装置１００は、判別エンジンによって特定された判別規則により、「Ａ」１８０１と「Ａ」１８０２の組と「Ａ」１８０１と「Ａ」１８０３の組と「Ａ」１８０２と「Ａ」１８０３の組との各々が同一種類か否かを判別する。ここでは、抽出装置１００は、同一表記・種類単語判別規則３００と、共起単語「研究所」と「研究所」の組み合わせと、「株式会社」と「株式会社」の組み合わせとから、「Ａ」１８０２と「Ａ」１８０３の組が同一種類であると判別する。 The extraction apparatus 100 determines a set of “A” 1801 and “A” 1802, a set of “A” 1801 and “A” 1803, a set of “A” 1802 and “A” 1803 according to the discrimination rule specified by the discrimination engine. And whether each is the same type. In this case, the extraction apparatus 100 uses the same notation / type word discrimination rule 300, the combination of the co-occurrence words “laboratory” and “laboratory”, and the combination of “corporation” and “corporation”. "1802" and "A" 1803 are determined to be of the same type.

図１９において、抽出装置１００は、選択した学習データ１６１０の中から、図１８で同一種類と判別された「Ａ」１８０２と「Ａ」１８０３の組に対応する「Ａ」１６１４と「Ａ」１６１７を特定する。抽出装置１００は、以降の処理で、特定された「Ａ」１６１４と「Ａ」１６１７を一纏めにして扱う。 In FIG. 19, the extraction apparatus 100 selects “A” 1614 and “A” 1617 corresponding to the combination of “A” 1802 and “A” 1803 determined as the same type in FIG. 18 from the selected learning data 1610. Is identified. The extraction apparatus 100 handles the specified “A” 1614 and “A” 1617 together in the subsequent processing.

（１）抽出装置１００は、学習データ１６１０の中の単語ごとに、または一纏めにされた単語ごとに、タグを参照して単語の種類を特定し、特定した単語の種類を抽出するための手がかりを特定する。抽出装置１００は、具体的には、例えば、一纏めにされた「Ａ」１６１４と「Ａ」１６１７の周辺にある単語から、「Ｉ−ＯＲＧ」を抽出するための手がかりになる「株式会社：−１」および「研究所：＋１」を特定する。そして、抽出装置１００は、特定した単語が手がかりである、単語の種類が「Ｉ−ＯＲＧ」であることを判別するための学習事例を生成する。 (1) The extraction apparatus 100 refers to a tag for each word in the learning data 1610 or for each grouped word, identifies a word type, and a clue for extracting the identified word type Is identified. Specifically, for example, the extraction apparatus 100 can be used as a clue to extract “I-ORG” from a group of words in the vicinity of “A” 1614 and “A” 1617. 1 ”and“ Lab: +1 ”are identified. Then, the extraction apparatus 100 generates a learning case for determining that the identified word is a clue and the word type is “I-ORG”.

また、抽出装置１００は、同一種類の他の単語がない「Ａ」１６１１の周辺にある単語から、「Ｂ−ＯＲＧ」を抽出するための手がかりになる「株式会社：＋１」を特定する。そして、抽出装置１００は、特定した単語が手がかりである、単語の種類が「Ｂ−ＯＲＧ」であることを判別するための学習事例を生成する。 Further, the extraction apparatus 100 identifies “corporation: +1” that is a clue for extracting “B-ORG” from words around “A” 1611 that does not have other words of the same type. Then, the extraction apparatus 100 generates a learning case for determining that the identified word is a clue and the type of the word is “B-ORG”.

（２）そして、抽出装置１００は、学習エンジンを用いて、生成された学習事例群のうち、尤もらしい学習事例を特定する。学習エンジンは、例えば、学習事例群の中で出現頻度が閾値以上である学習事例を、尤もらしい学習事例として特定する。抽出装置１００は、学習エンジンによって特定された学習事例から抽出用規則を生成する。抽出装置１００は、例えば、単語の種類「Ｉ−ＯＲＧ」と、「株式会社：−１」とを関連付けた抽出用規則を生成する。そして、抽出装置１００は、生成した抽出用規則を含む固有表現抽出用規則４００を生成する。 (2) Then, using the learning engine, the extraction device 100 identifies a likely learning case among the generated learning case group. For example, the learning engine identifies a learning case whose appearance frequency is greater than or equal to a threshold in the learning case group as a likely learning case. The extraction device 100 generates an extraction rule from the learning case specified by the learning engine. The extraction apparatus 100 generates, for example, an extraction rule that associates the word type “I-ORG” with “corporation: −1”. Then, the extraction apparatus 100 generates a specific expression extraction rule 400 including the generated extraction rule.

これにより、抽出装置１００は、学習データ群１６００を用いて、同一表記・種類単語判別規則３００および固有表現抽出用規則４００を機械学習により生成することができる。そのため、抽出装置１００の利用者は、同一表記・種類単語判別規則３００および固有表現抽出用規則４００を生成する手間を削減することができる。ここでは、閾値を基にした規則学習手法を用いたが、他の学習手法としては、非特許文献３，４，５などを用いることもできる。学習手法として、非特許文献３を用いた場合は、各規則のスコアの和、非特許文献４，５はモデル（規則に相当）を用いて計算されるスコアの和を基に、同一表記で同一の固有表現になる単語、各単語の固有表現の種類を決める。 Accordingly, the extraction apparatus 100 can generate the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 by machine learning using the learning data group 1600. Therefore, the user of the extraction apparatus 100 can reduce the trouble of generating the same notation / type word discrimination rule 300 and the specific expression extraction rule 400. Here, a rule learning method based on a threshold is used, but as other learning methods, Non-Patent Documents 3, 4, 5, and the like can also be used. When Non-Patent Document 3 is used as a learning technique, the same notation is used based on the sum of scores of each rule, and Non-Patent Documents 4 and 5 are based on the sum of scores calculated using a model (corresponding to a rule). Determine the words that have the same specific expression and the type of specific expression for each word.

（実施例１にかかる抽出装置１００による固有表現抽出処理の具体例２）
次に、図２０および図２１を用いて、実施例１にかかる抽出装置１００による固有表現抽出処理の具体例２について説明する。 (Specific example 2 of specific expression extraction processing by the extraction apparatus 100 according to the first embodiment)
Next, a specific example 2 of the specific expression extraction process performed by the extraction apparatus 100 according to the first embodiment will be described with reference to FIGS. 20 and 21.

図２０および図２１は、実施例１にかかる抽出装置１００による固有表現抽出処理の具体例２を示す説明図である。図２０において、抽出装置１００は、固有表現の抽出対象になる入力データ２０００を受け付ける。 20 and 21 are explanatory diagrams of a specific example 2 of the specific expression extraction process performed by the extraction apparatus 100 according to the first embodiment. In FIG. 20, the extraction apparatus 100 receives input data 2000 that is a target for extraction of a specific expression.

（１）まず、抽出装置１００は、入力データ２０００の中から、同一表記の単語の組として、「Ｂ」２００１と「Ｂ」２００２の組と、「Ｂ」２００１と「Ｂ」２００３の組と、「Ｂ」２００２と「Ｂ」２００３の組と、を抽出する。 (1) First, the extraction apparatus 100 includes a group of “B” 2001 and “B” 2002, a group of “B” 2001 and “B” 2003 as a group of words having the same notation from the input data 2000. , “B” 2002 and “B” 2003 are extracted.

そして、抽出装置１００は、「Ｂ」２００１と「Ｂ」２００２の組について、「Ｂ」２００１と「Ｂ」２００２の各々の周辺にある単語の組み合わせ「株式会社＆研究所」などを手がかりとして特定する。また、抽出装置１００は、「Ｂ」２００１と「Ｂ」２００３の組について、「Ｂ」２００１と「Ｂ」２００３の各々の周辺にある単語の組み合わせ「株式会社＆研究所」などを手がかりとして特定する。また、抽出装置１００は、「Ｂ」２００２と「Ｂ」２００３の組について、「Ｂ」２００２と「Ｂ」２００３の各々の周辺にある単語の組み合わせ「研究所＆研究所」、「株式会社＆株式会社」などを手がかりとして特定する。 Then, the extraction apparatus 100 identifies the combination of “B” 2001 and “B” 2002 with a combination of words “Corporation & Research Institute” around each of “B” 2001 and “B” 2002. To do. In addition, the extraction apparatus 100 identifies a combination of “B” 2001 and “B” 2003 with a combination of words “Corporation & Research Institute” around each of “B” 2001 and “B” 2003 as a clue. To do. In addition, the extraction apparatus 100 uses a combination of words “Laboratory & Research Institute”, “Co., Ltd.” and “B” 2002 and “B” 2003 in combination with words around each of “B” 2002 and “B” 2003. Identify “Corporation” as a clue.

（２）そして、抽出装置１００は、判別エンジンを用いて、特定された手がかりに該当する同一表記・種類単語判別規則３００の判別規則を特定する。判別エンジンは、該当する判別規則が複数ある場合、尤もらしい判別規則を特定する。抽出装置１００は、判別エンジンによって特定された判別規則により、「Ｂ」２００１と「Ｂ」２００２の組と「Ｂ」２００１と「Ｂ」２００３の組と「Ｂ」２００２と「Ｂ」２００３の組との各々が同一種類か否かを判別する。 (2) Then, the extraction apparatus 100 specifies the discrimination rule of the same notation / type word discrimination rule 300 corresponding to the identified clue using the discrimination engine. When there are a plurality of applicable discrimination rules, the discrimination engine specifies a plausible discrimination rule. The extraction apparatus 100 uses a discrimination rule specified by the discrimination engine to set “B” 2001 and “B” 2002, “B” 2001 and “B” 2003, and “B” 2002 and “B” 2003. And whether each is the same type.

ここでは、抽出装置１００は、同一表記・種類単語判別規則３００と、共起単語「研究所」と「研究所」の組み合わせと、「株式会社」と「株式会社」の組み合わせとから、「Ｂ」２００２と「Ｂ」２００３の組が同一種類であると判別する。 In this case, the extraction apparatus 100 uses the same notation / type word discrimination rule 300, the combination of the co-occurrence words “laboratory” and “laboratory”, and the combination of “corporation” and “corporation”. "2002" and "B" 2003 are determined to be of the same type.

図２１において、抽出装置１００は、以降の処理で、図２０で同一種類と判別された「Ｂ」２００２と「Ｂ」２００３を一纏めにして扱う。 In FIG. 21, the extraction apparatus 100 collectively handles “B” 2002 and “B” 2003 determined to be the same type in FIG. 20 in the subsequent processing.

（１）抽出装置１００は、入力データ２０００の中の単語ごとに、または一纏めにされた単語ごとに、単語の種類を抽出するための手がかりを特定する。抽出装置１００は、具体的には、例えば、一纏めにされた「Ｂ」２００２と「Ｂ」２００３の周辺にある単語から、手がかりとして「株式会社：−１」および「研究所：＋１」を特定する。また、抽出装置１００は、同一種類の他の単語がない「Ｂ」２００１の周辺にある単語から、手がかりとして「株式会社：＋１」を特定する。 (1) The extraction apparatus 100 specifies a clue for extracting the type of word for each word in the input data 2000 or for each grouped word. Specifically, the extraction apparatus 100 specifies, for example, “corporation: −1” and “laboratory: +1” as clues from the words around “B” 2002 and “B” 2003 that are grouped together. To do. In addition, the extraction apparatus 100 identifies “corporation: +1” as a clue from words in the vicinity of “B” 2001 that has no other words of the same type.

（２）次に、抽出装置１００は、判別エンジンを用いて、特定された手がかりに該当する固有表現抽出用規則４００の抽出用規則を特定する。判別エンジンは、該当する抽出用規則が複数ある場合、尤もらしい抽出用規則を特定する。抽出装置１００は、判別エンジンによって特定された抽出用規則により、入力データ２０００の中の単語ごとに、または一纏めにされた単語ごとに、単語の種類を抽出する。 (2) Next, the extraction apparatus 100 specifies the extraction rule of the specific expression extraction rule 400 corresponding to the specified clue using the discrimination engine. When there are a plurality of corresponding extraction rules, the discrimination engine specifies a plausible extraction rule. The extraction device 100 extracts word types for each word in the input data 2000 or for each word grouped according to the extraction rules specified by the discrimination engine.

ここでは、抽出装置１００は、「Ｂ」２００２と「Ｂ」２００３の種類として「Ｉ−ＯＲＧ」を抽出し、「Ｂ」２００１の種類として「Ｂ−ＯＲＧ」を抽出する。なお、抽出装置１００は、他の同一表記の単語「株式会社」や「研究所」についても、同様に、単語の種類を抽出する。 Here, the extraction apparatus 100 extracts “I-ORG” as the type of “B” 2002 and “B” 2003 and extracts “B-ORG” as the type of “B” 2001. Note that the extraction apparatus 100 similarly extracts word types for the other words “corporation” and “laboratory” having the same notation.

（実施例１にかかる抽出装置１００による固有表現抽出結果の出力例２）
次に、図２２を用いて、実施例１にかかる抽出装置１００による固有表現抽出結果の出力例２について説明する。抽出装置１００は、図２０および図２１での固有表現抽出結果を出力する。 (Output example 2 of the named entity extraction result by the extraction apparatus 100 according to the first embodiment)
Next, an output example 2 of the named entity extraction result by the extraction apparatus 100 according to the first embodiment will be described with reference to FIG. The extraction apparatus 100 outputs the specific expression extraction results in FIGS. 20 and 21.

図２２は、実施例１にかかる抽出装置１００による固有表現抽出結果の出力例２を示す説明図である。図２２に示すように、抽出装置１００は、例えば、抽出した単語の種類をタグとして付与した入力データ２０００「＜Ｂ−ＯＲＧ＞Ｂ＜／Ｂ−ＯＲＧ＞＜Ｅ−ＯＲＧ＞株式会社＜／Ｅ−ＯＲＧ＞は＜Ｂ−ＯＲＧ＞株式会社＜／Ｂ−ＯＲＧ＞＜Ｉ−ＯＲＧ＞Ｂ＜／Ｉ−ＯＲＧ＞＜Ｅ−ＯＲＧ＞研究所＜／Ｅ−ＯＲＧ＞の子会社であり、＜Ｂ−ＯＲＧ＞株式会社＜／Ｂ−ＯＲＧ＞＜Ｉ−ＯＲＧ＞Ｂ＜／Ｉ−ＯＲＧ＞＜Ｅ−ＯＲＧ＞研究所＜／Ｅ−ＯＲＧ＞は川崎市にある。」を出力する。 FIG. 22 is an explanatory diagram of an output example 2 of the specific expression extraction result by the extraction device 100 according to the first embodiment. As illustrated in FIG. 22, the extraction apparatus 100, for example, uses input data 2000 “<B-ORG> B </ B-ORG> <E-ORG>, Inc. </ E>, to which the extracted word type is assigned as a tag. -ORG> is a subsidiary of <B-ORG> Corporation </ B-ORG> <I-ORG> B </ I-ORG> <E-ORG> Laboratory </ E-ORG>, <B- ORG> Corporation </ B-ORG> <I-ORG> B </ I-ORG> <E-ORG> Laboratory </ E-ORG> is in Kawasaki City ”.

他にも、Ｂ−ＯＲＧ、Ｉ−ＯＲＧ，Ｅ−ＯＲＧのように複数の単語で一つの固有表現となる場合を表現するタグが付与されている場合は、＜ＯＲＧ＞Ｂ株式会社＜ＯＲＧ＞は＜ＯＲＧ＞株式会社Ｂ研究所＜／ＯＲＧ＞の子会社であり、＜ＯＲＧ＞株式会社Ｂ研究所＜／ＯＲＧ＞は川崎市にある。」のように一つのタグとして出力することも可能である。 In addition, when a tag expressing a case where a single unique expression is formed by a plurality of words, such as B-ORG, I-ORG, and E-ORG, <ORG> B Corporation <ORG> Is a subsidiary of <ORG> B Laboratory Inc. </ ORG>, and <ORG> B Laboratory Ltd. </ ORG> is located in Kawasaki City. It is also possible to output as a single tag.

（実施例２）
実施例１は、単語単位で分割された一連の単語の中から固有表現の単語を抽出する実施例であった。対して、実施例２は、図５を用いて説明したチャンク単位で分割されたチャンクのラティスの中から固有表現のチャンクを抽出する場合の実施例である。 (Example 2)
Example 1 is an example in which a word with a unique expression is extracted from a series of words divided in units of words. On the other hand, Embodiment 2 is an embodiment in the case of extracting a chunk of a unique expression from the lattice of chunks divided in units of chunks described with reference to FIG.

（実施例２にかかる抽出装置１００による規則学習処理の内容）
次に、図２３〜図２６を用いて、実施例２にかかる抽出装置１００による規則学習処理の内容について説明する。規則学習処理は、学習データ群の各々を変換して得たチャンクのラティスを用いて、同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを生成する処理である。 (Contents of Rule Learning Processing by Extracting Device 100 According to Embodiment 2)
Next, the contents of the rule learning process performed by the extraction apparatus 100 according to the second embodiment will be described with reference to FIGS. The rule learning process is a process of generating the same notation / type word discrimination rule 300 and the unique expression extraction rule 400 using chunk lattices obtained by converting each of the learning data groups.

図２３は、実施例２にかかる抽出装置１００による規則学習処理に用いられるチャンクのラティスの一例を示す説明図である。図２３に示すように、抽出装置１００は、学習データ「＜ＯＲＧ＞Ｃ社＜／ＯＲＧ＞の社員は＜ＬＯＣ＞Ｃ社＜／ＬＯＣ＞へ行き＜ＬＯＣ＞Ｃ社＜／ＬＯＣ＞から帰る。」からチャンクのラティス２３００を生成する。 FIG. 23 is an explanatory diagram of an example of a chunk lattice used in the rule learning process by the extraction device 100 according to the second embodiment. As shown in FIG. 23, in the extraction apparatus 100, employees of learning data “<ORG> C company </ ORG> go to <LOC> C company </ LOC> and return from <LOC> C company </ LOC>. ”To generate a chunk lattice 2300.

抽出装置１００は、具体的には、例えば、学習データ「＜ＯＲＧ＞Ｃ社＜／ＯＲＧ＞の社員は＜ＬＯＣ＞Ｃ社＜／ＬＯＣ＞へ行き＜ＬＯＣ＞Ｃ社＜／ＬＯＣ＞から帰る。」を形態素解析し、単語ごとに区切る。単語ごとに区切られた学習データは、例えば、「Ｃ社の社員はＣ社へ行きＣ社から帰る。」である。 Specifically, for example, the employee of the learning data “<ORG> C Company </ ORG> goes to <LOC> C Company </ LOC> and returns from <LOC> C Company </ LOC>. "Is morphologically analyzed and separated into words. The learning data divided for each word is, for example, “an employee of company C goes to company C and returns from company C”.

抽出装置１００は、単語ごとに区切られた学習データの中から、２単語ずつ組み合わせたチャンクを生成する。例えば、抽出装置１００は、「Ｃ社」「社の」「の社員」「社員は」などのチャンクを生成する。そして、抽出装置１００は、生成したチャンクから、チャンクのラティスを生成する。そして、抽出装置１００は、複数通りのチャンクのラティスの中から、タグが付与されたチャンク「Ｃ社」を含むチャンクのラティス中のパスを、正しく判別するように規則学習する。 The extraction apparatus 100 generates a chunk in which two words are combined from learning data divided for each word. For example, the extraction apparatus 100 generates chunks such as “Company C”, “Company”, “Employees”, and “Employees”. Then, the extraction apparatus 100 generates a chunk lattice from the generated chunk. Then, the extraction apparatus 100 performs rule learning so as to correctly determine the path in the chunk lattice including the chunk “Company C” to which the tag is attached from among the plurality of chunk lattices.

図２４〜図２６は、図２３で生成されたチャンクのラティスを用いて同一表記・種類単語判別規則３００と固有表現抽出用規則４００とを生成する内容を示す説明図である。図２４において、まず、抽出装置１００は、学習データ群の中から、未処理の学習データを選択する。そして、抽出装置１００は、図２３に示したように、選択した学習データからチャンクのラティスを生成する。ここでは、抽出装置１００は、チャンクのラティス２３００を生成したとする。 24 to 26 are explanatory diagrams showing the contents of generating the same notation / type word discrimination rule 300 and the unique expression extraction rule 400 using the chunk lattice generated in FIG. In FIG. 24, the extraction apparatus 100 first selects unprocessed learning data from the learning data group. Then, as illustrated in FIG. 23, the extraction apparatus 100 generates a chunk lattice from the selected learning data. Here, it is assumed that the extraction apparatus 100 has generated a chunk lattice 2300.

（１）抽出装置１００は、選択したチャンクのラティス２３００の中から、同一表記のチャンクの組として、「Ｃ社」２３０１と「Ｃ社」２３０２の組と、「Ｃ社」２３０１と「Ｃ社」２３０３の組と、「Ｃ社」２３０２と「Ｃ社」２３０３の組と、を抽出する。また、抽出装置１００は、例えば、同一表記のチャンクの組として、「Ｃ」の組や「社」の組を抽出してもよい。 (1) The extraction apparatus 100 selects “C Company” 2301 and “C Company” 2302 as a set of chunks having the same notation from the selected chunk lattice 2300, and “C Company” 2301 and “C Company”. ”2303 and“ C company ”2302 and“ C company ”2303 are extracted. Further, the extraction apparatus 100 may extract a set of “C” or a set of “company” as a set of chunks having the same notation, for example.

そして、抽出装置１００は、「Ｃ社」２３０１と「Ｃ社」２３０２の組については、異なる種類であるため、「Ｃ社」２３０１と「Ｃ社」２３０２の各々の周辺にある単語の組み合わせ「の＆へ」、「社員＆行く」などを、単語の組が異なる種類であることを示す手がかりとして生成し、同一ではないとう学習事例を生成する The extraction apparatus 100 uses different combinations of “Company C” 2301 and “Company C” 2302, and therefore a combination of words “C Company” 2301 and “Company C” 2302 around each “ "No & Go", "Employee & Go", etc. are generated as clues indicating that the word pairs are of different types, and learning cases that are not identical are generated.

また、抽出装置１００は、「Ｃ社」２３０１と「Ｃ社」２３０３組については、異なる種類であるため、「Ｃ社」２３０１と「Ｃ社」２３０３の各々の周辺にある単語の組み合わせ「の＆から」、「社員＆帰る」などを、単語の組が異なる種類であることを示す手がかりとして生成し、同一でないという学習事例を生成する Further, since the extraction apparatus 100 has different types of “Company C” 2301 and “Company C” 2303, the word combination “N” around each of “Company C” 2301 and “Company C” 2303 "& Kara", "employee & return", etc. are generated as clues indicating that the word pairs are of different types, and a learning case that is not the same is generated.

また、抽出装置１００は、「Ｃ社」２３０２と「Ｃ社」２３０３の組については、同一種類であるため、「Ｃ社」２３０２と「Ｃ社」２３０３の各々の周辺にある単語の組み合わせ「へ＆から」、「行き＆帰る」などを、単語の組が同一種類であることを示す手がかりとして生成し、同一であるという学習事例を生成する Further, the extraction apparatus 100 has the same type of “Company C” 2302 and “Company C” 2303, and therefore, the combination of words “C Company” 2302 and “C Company” 2303 around each “ “To & From”, “Go & Return”, etc. are generated as clues indicating that the word pairs are of the same type, and a learning case that is the same is generated.

抽出装置１００は、チャンクのラティス２３００から学習事例を生成した後、学習データ群の中に未処理の学習データが残っていれば、当該未処理の学習データをチャンクのラティスに変換し、変換後のチャンクのラティスからも学習事例を生成する。 After generating a learning example from the chunk lattice 2300, the extraction apparatus 100 converts the unprocessed learning data into a chunk lattice if there is unprocessed learning data in the learning data group. A learning example is also generated from the lattice of the chunk.

（２）そして、抽出装置１００は、学習エンジンを用いて、生成された学習事例群のうち、尤もらしい学習事例を特定する。学習エンジンは、例えば、学習事例群の中で出現頻度が閾値以上である学習事例を、尤もらしい手がかりとして特定する。抽出装置１００は、学習エンジンによって特定された学習事例から判別規則を生成する。抽出装置１００は、例えば、例えば、チャンクの組が同一種類であることを示す情報と、共起単語「行き」と「帰る」の組み合わせを関連付けた判別規則を生成する。そして、抽出装置１００は、生成した判別規則を含む同一表記・種類単語判別規則３００を生成する。また、抽出装置１００は、固有表現のチャンクではない「Ｃ」、「社」または「社の」などについては、判別規則を生成しなくてもよい。 (2) Then, using the learning engine, the extraction device 100 identifies a likely learning case among the generated learning case group. For example, the learning engine identifies a learning case whose appearance frequency is equal to or higher than a threshold in the learning case group as a likely clue. The extraction device 100 generates a discrimination rule from the learning case specified by the learning engine. For example, the extraction device 100 generates a discrimination rule that associates, for example, information indicating that the combination of chunks is the same type with a combination of the co-occurrence words “go” and “return”. Then, the extraction apparatus 100 generates the same notation / type word determination rule 300 including the generated determination rule. Further, the extraction apparatus 100 may not generate a discrimination rule for “C”, “Company”, “Company”, or the like that is not a chunk of the unique expression.

図２５において、抽出装置１００は、学習データ群の中から、未処理の学習データを選択する。そして、抽出装置１００は、図２３に示したように、選択した学習データからチャンクのラティスを生成する。ここでは、抽出装置１００は、チャンクのラティス２３００を生成したとする。 In FIG. 25, the extraction apparatus 100 selects unprocessed learning data from the learning data group. Then, as illustrated in FIG. 23, the extraction apparatus 100 generates a chunk lattice from the selected learning data. Here, it is assumed that the extraction apparatus 100 has generated a chunk lattice 2300.

また、抽出装置１００は、チャンクのラティス２３００からタグを除外した対象ラティスを生成する。ここでは、抽出装置１００は、チャンクのラティス２３００から対象ラティス２５００を生成したとする。 Further, the extraction apparatus 100 generates a target lattice from which tags are excluded from the chunk lattice 2300. Here, it is assumed that the extraction apparatus 100 generates the target lattice 2500 from the chunk lattice 2300.

（１）次に、抽出装置１００は、対象ラティス２５００の中から、同一表記のチャンクの組として、「Ｃ社」２５０１と「Ｃ社」２５０２の組と、「Ｃ社」２５０１と「Ｃ社」２５０３の組と、「Ｃ社」２５０２と「Ｃ社」２５０３の組と、を抽出する。また、抽出装置１００は、例えば、同一表記のチャンクの組として、「Ｃ」の組や「社」の組を抽出してもよい。 (1) Next, the extraction apparatus 100 sets “C company” 2501 and “C company” 2502 sets, “C company” 2501 and “C company” as a set of chunks having the same notation from the target lattice 2500. "2503" and "C company" 2502 and "C company" 2503 are extracted. Further, the extraction apparatus 100 may extract a set of “C” or a set of “company” as a set of chunks having the same notation, for example.

そして、抽出装置１００は、「Ｃ社」２５０１と「Ｃ社」２５０２の組について、「Ｃ社」２５０１と「Ｃ社」２５０２の各々の周辺にある単語の組み合わせ「の＆へ」、「社員＆行く」などを手がかりとして特定する。また、抽出装置１００は、「Ｃ社」２５０１と「Ｃ社」２５０３の組について、「Ｃ社」２５０１と「Ｃ社」２５０３の各々の周辺にある単語の組み合わせ「の＆から」、「社員＆帰る」などを手がかりとして特定する。また、抽出装置１００は、「Ｃ社」２５０２と「Ｃ社」２５０３の組について、「Ｃ社」２５０２と「Ｃ社」２５０３の各々の周辺にある単語の組み合わせ「へ＆から」、「行き＆帰る」などを手がかりとして特定する。 Then, the extraction apparatus 100, for the set of “C company” 2501 and “C company” 2502, combines the word combinations “No & He” and “Employees” around “C company” 2501 and “C company” 2502. “& Go” etc. as a clue. In addition, the extraction apparatus 100 uses a combination of words “No & Kara” and “employee” in the vicinity of “C company” 2501 and “C company” 2503 for the set of “C company” 2501 and “C company” 2503. & "Return" etc. as a clue. In addition, the extraction apparatus 100, for a set of “Company C” 2502 and “Company C” 2503, combines word combinations “To & From” and “Go” around each of “Company C” 2502 and “Company C” 2503. & "Return" etc. as a clue.

（２）そして、抽出装置１００は、判別エンジンを用いて、特定された手がかりに該当する同一表記・種類単語判別規則３００の判別規則を特定する。判別エンジンは、該当する判別規則が複数ある場合、尤もらしい判別規則を特定する。抽出装置１００は、判別エンジンによって特定された判別規則により、「Ｃ社」２５０１と「Ｃ社」２５０２の組と「Ｃ社」２５０１と「Ｃ社」２５０３の組と「Ｃ社」２５０２と「Ｃ社」２５０３の組との各々が同一種類か否かを判別する。ここでは、抽出装置１００は、同一表記・種類単語判別規則３００と、共起単語「行き」と「帰る」の組み合わせとから、「Ｃ社」２５０２と「Ｃ社」２５０３の組が同一種類であると判別する。 (2) Then, the extraction apparatus 100 specifies the discrimination rule of the same notation / type word discrimination rule 300 corresponding to the identified clue using the discrimination engine. When there are a plurality of applicable discrimination rules, the discrimination engine specifies a plausible discrimination rule. The extraction apparatus 100 determines the combination of “Company C” 2501 and “Company C” 2502, the combination of “Company C” 2501 and “Company C” 2503, “Company C” 2502, and “ It is determined whether or not each of the group of “Company C” 2503 is the same type. Here, the extraction apparatus 100 determines that the combination of “Company C” 2502 and “Company C” 2503 is of the same type based on the same notation / type word discrimination rule 300 and the combination of the co-occurrence words “go” and “return”. Determine that there is.

図２６において、抽出装置１００は、チャンクのラティス２３００の中から、図２５で同一種類と判別された「Ｃ社」２５０２と「Ｃ社」２５０３の組に対応する「Ｃ社」２３０２と「Ｃ社」２３０３を特定する。抽出装置１００は、以降の処理で、特定された「Ｃ社」２３０２と「Ｃ社」２３０３を一纏めにして扱う。 In FIG. 26, the extraction apparatus 100 includes “C company” 2302 and “C” corresponding to a set of “C company” 2502 and “C company” 2503 identified as the same type in FIG. “Company” 2303 is specified. The extraction apparatus 100 handles the specified “Company C” 2302 and “Company C” 2303 together in the subsequent processing.

（１）抽出装置１００は、チャンクのラティス２３００の中のチャンクごとに、または一纏めにされたチャンクごとに、タグを参照してチャンクの種類を特定し、特定したチャンクの種類を抽出するための手がかりを特定する。抽出装置１００は、具体的には、例えば、一纏めにされた「Ｃ社」２３０２と「Ｃ社」２３０３の周辺にあるチャンクから、「ＬＯＣ」を抽出するための手がかりになる「社員：−２」、「は：−１」、「へ：＋１」、「行き：＋２」、「へ：−２」、「行き：−１」、「から：＋１」、および「帰る：＋２」を特定する。そして、抽出装置１００は、特定した「社員：−２」、「は：−１」、「へ：＋１」、「行き：＋２」、「へ：−２」、「行き：−１」、「から：＋１」、および「帰る：＋２」が手がかりである、チャンクの種類が「ＯＲＧ」であることを判別するための学習事例を生成する。 (1) The extraction apparatus 100 refers to the tag for each chunk in the chunk lattice 2300 or for each chunk that is grouped, identifies the chunk type, and extracts the identified chunk type Identify clues. Specifically, for example, the extraction apparatus 100 can be used as a clue to extract “LOC” from chunks around “C company” 2302 and “C company” 2303 that are grouped together. ”,“ Ha: -1 ”,“ To: +1 ”,“ To: +2 ”,“ To: -2 ”,“ To: −1 ”,“ From: +1 ”, and“ Return: +2 ” . Then, the extraction apparatus 100 identifies the “employee: −2”, “ha: −1”, “to: +1”, “bound: +2”, “bound: -2”, “bound: −1”, “ From: +1 ”and“ Return: +2 ”are clues, and a learning example for determining that the type of chunk is“ ORG ”is generated.

また、抽出装置１００は、同一種類の他のチャンクがない「Ｃ社」２３０１の周辺にあるチャンクから、「ＯＲＧ」を抽出するための手がかりになる「の：＋１」、および「社員：＋２」を特定する。そして、抽出装置１００は、特定した「の：＋１」、および「社員：＋２」が手がかりである、チャンクの種類が「ＯＲＧ」であることを判別するための学習事例を生成する。 In addition, the extraction apparatus 100 uses “no: +1” and “employee: +2” as clues for extracting “ORG” from chunks around “C company” 2301 that does not have other chunks of the same type. Is identified. Then, the extraction apparatus 100 generates a learning case for determining that the identified “no: +1” and “employee: +2” are clues and the type of chunk is “ORG”.

（２）そして、抽出装置１００は、学習エンジンを用いて、生成された学習事例群のうち、尤もらしい学習事例を特定する。学習エンジンは、例えば、学習事例群の中で出現頻度が閾値以上である学習事例を、尤もらしい学習事例として特定する。抽出装置１００は、学習エンジンによって特定された学習事例から抽出用規則を生成する。抽出装置１００は、例えば、単語の種類「ＬＯＣ」を示す情報と、「行き：＋２」とを関連付けた抽出用規則を生成する。そして、抽出装置１００は、生成した抽出用規則を含む固有表現抽出用規則４００を生成する。 (2) Then, using the learning engine, the extraction device 100 identifies a likely learning case among the generated learning case group. For example, the learning engine identifies a learning case whose appearance frequency is greater than or equal to a threshold in the learning case group as a likely learning case. The extraction device 100 generates an extraction rule from the learning case specified by the learning engine. For example, the extraction device 100 generates an extraction rule that associates information indicating the word type “LOC” with “bound: +2”. Then, the extraction apparatus 100 generates a specific expression extraction rule 400 including the generated extraction rule.

また、抽出装置１００は、固有表現のチャンクではない「Ｃ」、「社」または「社の」などについて、固有表現ではないことを示す抽出用規則を生成してもよい。また、抽出装置１００は、固有表現のチャンクではない「Ｃ」、「社」または「社の」などについては、抽出用規則を生成しなくてもよい。 Further, the extraction apparatus 100 may generate an extraction rule indicating that “C”, “Company”, “Company”, or the like that is not a chunk of the unique expression is not a unique expression. Further, the extraction apparatus 100 may not generate an extraction rule for “C”, “Company”, “Company”, or the like that is not a chunk of the unique expression.

これにより、抽出装置１００は、学習データ群を用いて、同一表記・種類単語判別規則３００および固有表現抽出用規則４００を機械学習により生成することができる。そのため、抽出装置１００の利用者は、同一表記・種類単語判別規則３００および固有表現抽出用規則４００を生成する手間を削減することができる。 Thereby, the extraction device 100 can generate the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 by machine learning using the learning data group. Therefore, the user of the extraction apparatus 100 can reduce the trouble of generating the same notation / type word discrimination rule 300 and the specific expression extraction rule 400.

（実施例２にかかる規則学習処理の詳細な処理手順）
次に、図２７および図２８を用いて、実施例２にかかる規則学習処理の詳細な処理手順について説明する。実施例２にかかる規則学習処理は、図２４〜図２６に示した抽出装置１００によって実行される処理である。 (Detailed processing procedure of the rule learning process according to the second embodiment)
Next, a detailed processing procedure of the rule learning process according to the second embodiment will be described with reference to FIGS. 27 and 28. The rule learning process according to the second embodiment is a process executed by the extraction device 100 illustrated in FIGS.

図２７および図２８は、実施例２にかかる規則学習処理の詳細な処理手順を示すフローチャートである。図２７に示すように、まず、抽出装置１００は、学習データ群の中から、未処理の学習データを選択する（ステップＳ２７０１）。 FIGS. 27 and 28 are flowcharts illustrating the detailed processing procedure of the rule learning process according to the second embodiment. As shown in FIG. 27, first, the extraction apparatus 100 selects unprocessed learning data from the learning data group (step S2701).

次に、抽出装置１００は、選択した学習データをチャンク単位で分割されたチャンクのラティスに変換する（ステップＳ２７０２）。そして、抽出装置１００は、変換後のチャンクのラティスの中に、同一表記のチャンクがあるか否かを判定する（ステップＳ２７０３）。ここで、同一表記のチャンクがない場合（ステップＳ２７０３：Ｎｏ）、抽出装置１００は、ステップＳ２７０９に移行する。 Next, the extraction apparatus 100 converts the selected learning data into a lattice of chunks divided in units of chunks (step S2702). The extraction apparatus 100 determines whether there is a chunk with the same notation in the converted chunk lattice (step S2703). Here, when there is no chunk with the same notation (step S2703: No), the extraction apparatus 100 proceeds to step S2709.

一方、同一表記のチャンクがある場合（ステップＳ２７０３：Ｙｅｓ）、抽出装置１００は、チャンクのラティスの中にある同一表記のチャンクの組のうち、未処理のチャンクの組を選択する（ステップＳ２７０４）。 On the other hand, when there are chunks with the same notation (step S2703: Yes), the extraction apparatus 100 selects a set of unprocessed chunks from the pair of chunks with the same notation in the chunk lattice (step S2704). .

次に、抽出装置１００は、選択したチャンクの組が同一種類のチャンクであるか否かを判定する（ステップＳ２７０５）。ここで、同一種類のチャンクである場合（ステップＳ２７０５：Ｙｅｓ）、抽出装置１００は、選択したチャンクの組の各々とともに出現する共起チャンクの組み合わせと、チャンクの組が同一種類であることを示す情報と、を含む判別規則を生成する（ステップＳ２７０６）。そして、抽出装置１００は、ステップＳ２７０８に移行する。 Next, the extraction apparatus 100 determines whether or not the selected set of chunks is the same type of chunk (step S2705). If the chunks are the same type (step S2705: Yes), the extraction apparatus 100 indicates that the combination of co-occurrence chunks that appear with each of the selected chunk pairs and the chunk set are of the same type. A discrimination rule including information is generated (step S2706). Then, the extraction device 100 proceeds to step S2708.

一方、異なる種類のチャンクである場合（ステップＳ２７０５：Ｎｏ）、抽出装置１００は、選択したチャンクの組の各々とともに出現する共起チャンクの組み合わせと、チャンクの組が異なる種類であることを示す情報と、を含む判別規則を生成する（ステップＳ２７０７）。そして、抽出装置１００は、ステップＳ２７０８に移行する。 On the other hand, in the case of different types of chunks (step S2705: No), the extraction apparatus 100 indicates that the combination of co-occurrence chunks that appears with each of the selected chunk pairs and the type of chunks are different. Are generated (step S2707). Then, the extraction device 100 proceeds to step S2708.

次に、抽出装置１００は、未処理のチャンクの組があるか否かを判定する（ステップＳ２７０８）。未処理のチャンクの組がある場合（ステップＳ２７０８：Ｙｅｓ）、抽出装置１００は、ステップＳ２７０４に戻る。 Next, the extraction apparatus 100 determines whether there is an unprocessed chunk pair (step S2708). If there is an unprocessed chunk pair (step S2708: YES), the extraction apparatus 100 returns to step S2704.

一方、未処理のチャンクの組がない場合（ステップＳ２７０８：Ｎｏ）、抽出装置１００は、チャンクのラティス群の中に、未処理のチャンクのラティスがあるか否かを判定する（ステップＳ２７０９）。ここで、未処理のチャンクのラティスがある場合（ステップＳ２７０９：Ｙｅｓ）、抽出装置１００は、ステップＳ２７０１に戻る。 On the other hand, if there is no unprocessed chunk pair (step S2708: No), the extraction apparatus 100 determines whether there is an unprocessed chunk lattice in the chunk lattice group (step S2709). If there is a lattice of unprocessed chunks (step S2709: YES), the extraction apparatus 100 returns to step S2701.

一方、未処理のチャンクのラティスがない場合（ステップＳ２７０９：Ｎｏ）、抽出装置１００は、生成した判別規則群から、同一表記・種類単語判別規則３００を生成する（ステップＳ２７１０）。そして、抽出装置１００は、図２８のステップＳ２８０１に移行する。 On the other hand, when there is no lattice of unprocessed chunks (step S2709: No), the extraction apparatus 100 generates the same notation / type word determination rule 300 from the generated determination rule group (step S2710). Then, the extraction apparatus 100 proceeds to step S2801 in FIG.

図２８において、抽出装置１００は、学習データ群の中から未処理の学習データを選択し、選択した学習データのタグを除去した対象データを生成する（ステップＳ２８０１）。次に、抽出装置１００は、生成した対象データをチャンク単位で分割されたチャンクのラティスに変換する（ステップＳ２７０２）。 In FIG. 28, the extraction apparatus 100 selects unprocessed learning data from the learning data group, and generates target data from which the tags of the selected learning data are removed (step S2801). Next, the extraction apparatus 100 converts the generated target data into a chunk lattice divided in units of chunks (step S2702).

そして、抽出装置１００は、同一表記・種類単語判別規則３００を参照して、チャンクのラティスの中で同一表記かつ同一種類のチャンクの組を特定する（ステップＳ２８０３）。次に、抽出装置１００は、チャンクのラティスの中から、未処理のチャンクを選択する（ステップＳ２８０４）。そして、抽出装置１００は、ステップＳ２８０２の特定結果から、選択したチャンクと同一表記かつ同一種類のチャンクがあるか否かを判定する（ステップＳ２８０５）。 Then, the extraction apparatus 100 refers to the same notation / type word discrimination rule 300 and identifies a set of chunks of the same notation and the same type in the chunk lattice (step S2803). Next, the extraction apparatus 100 selects an unprocessed chunk from the chunk lattice (step S2804). Then, the extraction apparatus 100 determines whether there is a chunk having the same notation and the same type as the selected chunk from the identification result of step S2802 (step S2805).

ここで、同一表記かつ同一種類のチャンクがある場合（ステップＳ２８０５：Ｙｅｓ）、抽出装置１００は、同一表記かつ同一種類のチャンクの組の各々から特定した手がかりと、タグから特定した当該チャンクの組の種類と、を含む抽出用規則を生成する（ステップＳ２８０６）。そして、抽出装置１００は、ステップＳ２８０８に移行する。 Here, when there is a chunk of the same notation and the same type (step S2805: Yes), the extraction apparatus 100 sets the clue specified from each of the same notation and the same type of chunk and the set of the chunk specified from the tag. And an extraction rule including the type of (step S2806). Then, the extraction device 100 proceeds to step S2808.

一方、抽出装置１００は、同一表記かつ同一種類のチャンクがない場合（ステップＳ２８０５：Ｎｏ）、抽出装置１００は、選択したチャンクから特定した手がかりと、タグから特定した当該チャンクの種類と、を含む抽出用規則を生成する（ステップＳ２８０７）。そして、抽出装置１００は、ステップＳ２８０８に移行する。 On the other hand, when there is no same notation and same type of chunk (step S2805: No), the extraction device 100 includes the clue specified from the selected chunk and the type of the chunk specified from the tag. An extraction rule is generated (step S2807). Then, the extraction device 100 proceeds to step S2808.

次に、抽出装置１００は、選択したチャンクのラティスの中に、未処理のチャンクがあるか否かを判定する（ステップＳ２８０８）。ここで、未処理のチャンクがある場合（ステップＳ２８０８：Ｙｅｓ）、抽出装置１００は、ステップＳ２８０４に戻る。 Next, the extraction apparatus 100 determines whether there is an unprocessed chunk in the lattice of the selected chunk (step S2808). If there is an unprocessed chunk (step S2808: YES), the extraction apparatus 100 returns to step S2804.

一方、未処理のチャンクがない場合（ステップＳ２８０８：Ｎｏ）、抽出装置１００は、未処理のチャンクのラティスがあるか否かを判定する（ステップＳ２８０９）。ここで、未処理のチャンクのラティスがある場合（ステップＳ２８０９：Ｙｅｓ）、抽出装置１００は、ステップＳ２８０１に戻る。 On the other hand, when there is no unprocessed chunk (step S2808: No), the extraction apparatus 100 determines whether there is a lattice of unprocessed chunks (step S2809). If there is an unprocessed chunk lattice (step S2809: YES), the extraction apparatus 100 returns to step S2801.

一方、未処理のチャンクのラティスがない場合（ステップＳ２８０９：Ｎｏ）、抽出装置１００は、生成した抽出用規則群から、固有表現抽出用規則４００を生成する（ステップＳ２８１０）。そして、抽出装置１００は、規則学習処理を終了する。 On the other hand, when there is no lattice of unprocessed chunks (step S2809: No), the extraction apparatus 100 generates a specific expression extraction rule 400 from the generated extraction rule group (step S2810). Then, the extraction device 100 ends the rule learning process.

（実施例２にかかる抽出装置１００による固有表現抽出処理の具体例）
次に、図２９〜図３２を用いて、実施例２にかかる抽出装置１００による固有表現抽出処理の具体例について説明する。実施例２にかかる固有表現抽出処理は、固有表現の抽出対象のデータを変換して得たチャンクのラティスの中から固有表現のチャンクを抽出する処理であり、図５を用いて説明した処理である。 (Specific Example of Specific Expression Extraction Processing by Extraction Device 100 According to Embodiment 2)
Next, a specific example of the specific expression extraction process performed by the extraction apparatus 100 according to the second embodiment will be described with reference to FIGS. 29 to 32. The specific expression extraction process according to the second embodiment is a process for extracting a chunk of a specific expression from a lattice of chunks obtained by converting data to be extracted of a specific expression, and is the process described with reference to FIG. is there.

図２９は、抽出装置１００による固有表現抽出の対象になるチャンクのラティスの一例を示す説明図である。図２９に示すように、抽出装置１００は、固有表現の抽出対象になる入力データ「Ｂ商事の社員はＢ商事へ行きＢ商事から帰る。」を受け付ける。そして、抽出装置１００は、入力データから、固有表現抽出の対象になるチャンクのラティス２９００を生成する。 FIG. 29 is an explanatory diagram illustrating an example of a chunk lattice that is a target of extraction of a specific expression by the extraction device 100. As illustrated in FIG. 29, the extraction apparatus 100 receives input data “an employee of B trading goes to B trading and returns from B trading”, which is a target of extraction of a specific expression. Then, the extraction apparatus 100 generates a chunk lattice 2900 that is a target of extraction of the specific expression from the input data.

抽出装置１００は、具体的には、例えば、入力データ「Ｂ商事の社員はＢ商事へ行きＢ商事から帰る。」を形態素解析し、単語ごとに区切る。単語ごとに区切られた学習データは、例えば、「Ｂ商事の社員はＢ商事へ行きＢ商事から帰る。」である。 Specifically, for example, the extraction apparatus 100 performs morphological analysis on input data “an employee of B Trading goes to B Trading and returns from B Trading”, and divides it into words. The learning data divided into words is, for example, “Employees of B Trading go to B Trading and return from B Trading”.

抽出装置１００は、単語ごとに区切られた学習データの中から、２単語ずつ組み合わせたチャンクを生成する。例えば、抽出装置１００は、「Ｂ商事」「商事の」「の社員」「社員は」などのチャンクを生成する。そして、抽出装置１００は、生成したチャンクから、チャンクのラティスを生成する。そして、抽出装置１００は、生成したチャンクのラティスを、固有表現抽出の対象になるチャンクのラティスとして特定する。 The extraction apparatus 100 generates a chunk in which two words are combined from learning data divided for each word. For example, the extraction apparatus 100 generates chunks such as “B Trading”, “Commerce Trading”, “Employees”, and “Employees”. Then, the extraction apparatus 100 generates a chunk lattice from the generated chunk. Then, the extraction apparatus 100 identifies the generated chunk lattice as the chunk lattice that is the target of the specific expression extraction.

図３０および図３１は、実施例２にかかる抽出装置１００による固有表現抽出処理の具体例を示す説明図である。図３０において、（１）抽出装置１００は、チャンクのラティス２９００の中から、同一表記のチャンクの組として、「Ｂ商事」２９０１と「Ｂ商事」２９０２の組と、「Ｂ商事」２９０１と「Ｂ商事」２９０３の組と、「Ｂ商事」２９０２と「Ｂ商事」２９０３の組と、を抽出する。また、抽出装置１００は、例えば、同一表記のチャンクの組として、「Ｂ」の組や「商事」の組を抽出してもよい。 FIG. 30 and FIG. 31 are explanatory diagrams illustrating a specific example of the specific expression extraction process performed by the extraction apparatus 100 according to the second embodiment. In FIG. 30, (1) the extraction apparatus 100 includes a set of “B trading” 2901 and “B trading” 2902, “B trading” 2901, and “ A set of “B trading” 2903 and a set of “B trading” 2902 and “B trading” 2903 are extracted. Further, for example, the extraction apparatus 100 may extract a “B” set or a “trade” set as a set of chunks having the same notation.

そして、抽出装置１００は、「Ｂ商事」２９０１と「Ｂ商事」２９０２の組について、「Ｂ商事」２９０１と「Ｂ商事」２９０２の各々の周辺にある単語の組み合わせ「の＆へ」、「社員＆行く」などを手がかりとして特定する。また、抽出装置１００は、「Ｂ商事」２９０１と「Ｂ商事」２９０３の組について、「Ｂ商事」２９０１と「Ｂ商事」２９０３の各々の周辺にある単語の組み合わせ「の＆から」、「社員＆帰る」などを手がかりとして特定する。また、抽出装置１００は、「Ｂ商事」２９０２と「Ｂ商事」２９０３の組について、「Ｂ商事」２９０２と「Ｂ商事」２９０３の各々の周辺にある単語の組み合わせ「へ＆から」、「行き＆帰る」などを手がかりとして特定する。 Then, the extraction apparatus 100, for the set of “B Shoji” 2901 and “B Shoji” 2902, combines the words “No & H”, “Employees” around “B Shoji” 2901 and “B Shoji” 2902. “& Go” etc. as a clue. Further, the extraction apparatus 100 uses a combination of words “no & to” and “employee” in the vicinity of “B trading” 2901 and “B trading” 2903 for the set of “B trading” 2901 and “B trading” 2903. & "Return" etc. as a clue. In addition, the extraction apparatus 100, for the set of “B Shoji” 2902 and “B Shoji” 2903, combines word combinations “To & From” and “Go To” around “B Shoji” 2902 and “B Shoji” 2903. & "Return" etc. as a clue.

（２）そして、抽出装置１００は、判別エンジンを用いて、生成された手がかりに該当する同一表記・種類単語判別規則３００の判別規則を特定する。判別エンジンは、該当する判別規則が複数ある場合、尤もらしい判別規則を特定する。抽出装置１００は、判別エンジンによって特定された判別規則により、「Ｂ商事」２９０１と「Ｂ商事」２９０２の組と「Ｂ商事」２９０１と「Ｂ商事」２９０３の組と「Ｂ商事」２９０２と「Ｂ商事」２９０３の組との各々が同一種類か否かを判別する。 (2) Then, using the discrimination engine, the extraction device 100 identifies the discrimination rule of the same notation / type word discrimination rule 300 corresponding to the generated clue. When there are a plurality of applicable discrimination rules, the discrimination engine specifies a plausible discrimination rule. The extraction apparatus 100 uses a discrimination rule specified by the discrimination engine to set “B trading” 2901 and “B trading” 2902, “B trading” 2901 and “B trading” 2903, and “B trading” 2902 and “ It is determined whether or not each of the group “B Trading” 2903 is of the same type.

ここでは、抽出装置１００は、同一表記・種類単語判別規則３００と、共起単語「行き」と「帰る」の組み合わせとから、「Ｂ商事」２９０２と「Ｂ商事」２９０３の組が同一種類であると判別する。 Here, the extraction apparatus 100 uses the same notation / type word discrimination rule 300 and the combination of the co-occurrence words “go” and “return”, and the combination of “B trading” 2902 and “B trading” 2903 is of the same type. Determine that there is.

図３１において、抽出装置１００は、以降の処理で、図３０で同一種類と判別された「Ｂ商事」２９０２と「Ｂ商事」２９０３を一纏めにして扱う。 In FIG. 31, the extraction apparatus 100 handles “B Trading” 2902 and “B Trading” 2903 collectively determined as the same type in FIG.

（１）抽出装置１００は、チャンクのラティス２９００の中のチャンクごとに、または一纏めにされたチャンクごとに、チャンクの種類を抽出するための手がかりを特定する。抽出装置１００は、具体的には、例えば、一纏めにされた「Ｂ商事」２９０２と「Ｂ商事」２９０３の周辺にある単語から、手がかりとして「社員：−２」、「は：−１」、「へ：＋１」、「行き：＋２」、「へ：−２」、「行き：−１」、「から：＋１」、および「帰る：＋２」を特定する。また、抽出装置１００は、同一種類の他のチャンクがない「Ｂ商事」２９０１の周辺にある単語から、手がかりとして「の：＋１」、および「社員：＋２」を特定する (1) The extraction apparatus 100 specifies a clue for extracting the type of chunk for each chunk in the chunk lattice 2900 or for each chunk. Specifically, for example, the extraction apparatus 100 uses, as a clue, “employee: −2”, “ha: −1”, and the like from a group of words around “B trading” 2902 and “B trading” 2903. “To: +1”, “To: +2”, “To: -2”, “To: −1”, “From: +1”, and “Return: +2” are specified. Further, the extraction apparatus 100 identifies “no: +1” and “employee: +2” as clues from words around “B Shoji” 2901 that does not have other chunks of the same type.

（２）次に、抽出装置１００は、判別エンジンを用いて、生成された手がかりに該当する固有表現抽出用規則４００の抽出用規則を特定する。判別エンジンは、該当する抽出用規則が複数ある場合、尤もらしい抽出用規則を特定する。抽出装置１００は、学習エンジンによって特定された抽出用規則により、チャンクのラティス２９００の中のチャンクごとに、または一纏めにされたチャンクごとに、チャンクの種類を抽出する。ここでは、抽出装置１００は、「Ｂ商事」２９０２と「Ｂ商事」２９０３の種類として「ＬＯＣ」を抽出し、「Ｂ商事」２９０１の種類として「ＯＲＧ」を抽出する。 (2) Next, the extraction apparatus 100 specifies the extraction rule of the specific expression extraction rule 400 corresponding to the generated clue using the discrimination engine. When there are a plurality of corresponding extraction rules, the discrimination engine specifies a plausible extraction rule. The extraction apparatus 100 extracts the type of chunk for each chunk in the chunk lattice 2900 or for each chunk that is grouped according to the extraction rule specified by the learning engine. Here, the extraction apparatus 100 extracts “LOC” as the type of “B trading” 2902 and “B trading” 2903 and extracts “ORG” as the type of “B trading” 2901.

これにより、抽出装置１００は、同一表記かつ同一種類のチャンクについては、一纏めにして同じチャンクの種類を抽出することができる。また、抽出装置１００は、同一表記であっても異なる種類のチャンクについては、他の同一表記のチャンクとは別個にチャンクの種類を抽出することができるようになる。結果として、抽出装置１００は、同一表記かつ同一種類のチャンクを一纏めにして同じ種類のチャンクとして抽出することで、同一種類のチャンクを異なる種類のチャンクとして抽出することを防止して、抽出精度の向上を図ることができる。また、抽出装置１００は、同一表記であっても異なる種類のチャンク同士を、別個に扱ってチャンクの種類を抽出することで、誤って同じ種類のチャンクとして抽出することを防止し、抽出精度の向上を図ることができる。 Thereby, the extraction apparatus 100 can extract the same chunk type together for the same notation and the same type of chunk. Further, the extraction apparatus 100 can extract the type of chunk separately from other chunks of the same notation for different types of chunks even if they have the same notation. As a result, the extraction apparatus 100 prevents the extraction of the same type of chunks as different types of chunks by extracting the same type of chunks with the same notation and the same type of chunks, thereby improving the extraction accuracy. Improvements can be made. Further, the extraction apparatus 100 prevents different types of chunks from being mistakenly extracted as the same type of chunks by handling different types of chunks separately and extracting the types of chunks, even if they have the same notation. Improvements can be made.

（実施例２にかかる抽出装置１００による固有表現抽出結果の出力例）
次に、図３２を用いて、実施例２にかかる抽出装置１００による固有表現抽出結果の出力例について説明する。抽出装置１００は、図２９〜図３１での固有表現抽出結果を出力する。 (Output example of specific expression extraction result by extraction apparatus 100 according to embodiment 2)
Next, an output example of the named entity extraction result by the extraction apparatus 100 according to the second embodiment will be described with reference to FIG. The extraction apparatus 100 outputs the specific expression extraction results in FIGS. 29 to 31.

図３２は、実施例２にかかる抽出装置１００による固有表現抽出結果の出力例を示す説明図である。図３２に示すように、抽出装置１００は、例えば、抽出した単語の種類をタグとして付与したチャンクのラティス２９００「＜ＯＲＧ＞Ｂ商事＜／ＯＲＧ＞の社員は＜ＬＯＣ＞Ｂ商事＜／ＬＯＣ＞へ行き＜ＬＯＣ＞Ｂ商事＜／ＬＯＣ＞から帰る。」を出力する。 FIG. 32 is an explanatory diagram of an output example of the specific expression extraction result by the extraction device 100 according to the second embodiment. As shown in FIG. 32, the extraction apparatus 100, for example, uses a chunk lattice 2900 “<ORG> B Shoji </ ORG> as an employee for <LOC> B Shoji </ LOC>” to which the extracted word type is assigned as a tag. Go to <LOC> B Trading </ LOC> ”.

（実施例２にかかる固有表現抽出処理の詳細な処理手順）
次に、図３３を用いて、実施例２にかかる固有表現抽出処理の詳細な処理手順について説明する。実施例２にかかる固有表現抽出処理は、図２９〜図３２に示した抽出装置１００によって行われる処理である。 (Detailed processing procedure of specific expression extraction processing according to embodiment 2)
Next, the detailed processing procedure of the specific expression extraction processing according to the second embodiment will be described with reference to FIG. The specific expression extraction process according to the second embodiment is a process performed by the extraction apparatus 100 illustrated in FIGS. 29 to 32.

図３３は、実施例２にかかる固有表現抽出処理の詳細な処理手順を示すフローチャートである。図３３に示すように、まず、抽出装置１００は、入力データを受け付ける（ステップＳ３３０１）。次に、抽出装置１００は、入力データをチャンク単位で分割されたチャンクのラティスに変換する（ステップＳ３３０２）。 FIG. 33 is a flowchart of a detailed process procedure of the named entity extraction process according to the second embodiment. As shown in FIG. 33, first, the extraction device 100 accepts input data (step S3301). Next, the extraction apparatus 100 converts the input data into a chunk lattice divided in units of chunks (step S3302).

そして、抽出装置１００は、同一表記・種類単語判別規則３００を参照して、入力データの中で同一表記かつ同一種類のチャンクの組を特定する（ステップＳ３３０３）。次に、抽出装置１００は、入力データの中から、未処理のチャンクを選択する（ステップＳ３３０４）。そして、抽出装置１００は、ステップＳ３３０３の特定結果から、選択したチャンクと同一表記かつ同一種類のチャンクがあるか否かを判定する（ステップＳ３３０５）。 Then, the extraction apparatus 100 refers to the same notation / type word discrimination rule 300 and identifies a set of chunks of the same notation and the same type in the input data (step S3303). Next, the extraction apparatus 100 selects an unprocessed chunk from the input data (step S3304). Then, the extraction device 100 determines whether or not there is a chunk of the same notation and type as the selected chunk from the identification result of step S3303 (step S3305).

ここで、同一表記かつ同一種類のチャンクがある場合（ステップＳ３３０５：Ｙｅｓ）、抽出装置１００は、同一表記かつ同一種類のチャンクの組の各々から特定した手がかりと、固有表現抽出用規則４００と、から同一表記かつ同一種類のチャンクの組の固有表現の種類を特定するためのスコアを規則を用いて付与する（ステップＳ３３０６）。そして、抽出装置１００は、ステップＳ３３０８に移行する。 Here, when there is a chunk of the same notation and the same type (step S3305: Yes), the extraction apparatus 100 includes a clue specified from each of the set of chunks of the same notation and the same type, a rule 400 for extracting the unique expression, A score for specifying the type of the unique expression of the same notation and the same type of chunk set is assigned using a rule (step S3306). Then, the extraction device 100 proceeds to step S3308.

一方、抽出装置１００は、同一表記かつ同一種類のチャンクがない場合（ステップＳ３３０５：Ｎｏ）、抽出装置１００は、選択したチャンクから特定した手がかりと、固有表現抽出用規則４００と、から選択したチャンクの固有表現の種類を特定するためのスコアを規則を用いて付与する（ステップＳ３３０７）。そして、抽出装置１００は、ステップＳ３３０８に移行する。 On the other hand, if there is no chunk of the same notation and the same type (step S3305: No), the extraction device 100 selects the chunk selected from the clue specified from the selected chunk and the specific expression extraction rule 400. A score for specifying the type of proper expression is assigned using a rule (step S3307). Then, the extraction device 100 proceeds to step S3308.

次に、抽出装置１００は、入力データの中に、未処理のチャンクがあるか否かを判定する（ステップＳ３３０８）。ここで、未処理のチャンクがある場合（ステップＳ３３０８：Ｙｅｓ）、抽出装置１００は、ステップＳ３３０４に戻る。 Next, the extraction apparatus 100 determines whether or not there is an unprocessed chunk in the input data (step S3308). If there is an unprocessed chunk (step S3308: Yes), the extraction apparatus 100 returns to step S3304.

一方、未処理のチャンクがない場合（ステップＳ３３０８：Ｎｏ）、最終結果の選択を行う。最終結果の選択は、各チャンクに付与された各固有表現のスコアを基に、文頭から文末まで取りうるチャンクパスのパスおよびそのパス上で取りうる固有表現の種類の組み合わせのうち、スコアの和が最大となるチャンクの列およびそれぞれの固有表現タイプを選択する。その後、抽出装置１００は、抽出結果を出力する（ステップＳ３３０９）。そして、抽出装置１００は、固有表現抽出処理を終了する。 On the other hand, if there is no unprocessed chunk (step S3308: No), the final result is selected. Selection of the final result is based on the score of each unique expression assigned to each chunk, and the sum of the scores among the combinations of chunk path paths that can be taken from the beginning to the end of the sentence and the types of unique expressions that can be taken on that path. Select the column of chunks that has the largest and each named entity type. Thereafter, the extraction apparatus 100 outputs the extraction result (step S3309). Then, the extraction apparatus 100 ends the specific expression extraction process.

以上説明したように、抽出装置１００は、固有表現の抽出対象になるテキストデータ１１０の中に、同一表記の単語がある場合、当該単語が同一種類の単語か否かを判別する。次に、抽出手段は、同一表記かつ同一種類の単語は一纏めにして同じ単語の種類を抽出し、同一表記であっても異なる種類の単語は別個に単語の種類を抽出する。 As described above, when there is a word with the same notation in the text data 110 to be extracted from the specific expression, the extraction apparatus 100 determines whether the word is the same type of word. Next, the extracting means extracts the same word type by grouping the same notation and the same type of word, and extracts different word types for different types of words even if they have the same notation.

また、抽出装置１００は、学習データから同一表記・種類単語判別規則３００および固有表現抽出用規則４００を機械学習により生成することができる。そのため、抽出装置１００の利用者は、同一表記・種類単語判別規則３００および固有表現抽出用規則４００を生成する手間を削減することができる。 Further, the extraction device 100 can generate the same notation / type word discrimination rule 300 and the specific expression extraction rule 400 from the learning data by machine learning. Therefore, the user of the extraction apparatus 100 can reduce the trouble of generating the same notation / type word discrimination rule 300 and the specific expression extraction rule 400.

また、抽出装置１００は、固有表現の抽出対象になるテキストデータ１１０をチャンクごとに分割してチャンクのラティスを生成し、チャンクのラティスの中に、同一表記のチャンクがある場合、当該チャンクが同一種類のチャンクか否かを判別する。次に、抽出手段は、同一表記かつ同一種類のチャンクは一纏めにしてチャンクの種類を抽出し、同一表記であっても異なる種類のチャンクは別個にチャンクの種類を抽出する。 Further, the extraction device 100 generates a chunk lattice by dividing the text data 110 from which the specific expression is to be extracted into chunks. If there are chunks with the same notation in the chunk lattice, the chunks are the same. Determine if it is a type of chunk. Next, the extracting unit extracts chunk types of chunks having the same notation and the same type together, and extracts different types of chunks even if they have the same notation.

結果として、抽出装置１００は、同一表記かつ同一種類のチャンクを一纏めにして同じ種類のチャンクとして抽出することで、同一種類のチャンクを異なる種類のチャンクとして抽出することを防止して、抽出精度の向上を図ることができる。また、抽出装置１００は、同一表記であっても異なる種類のチャンク同士を、別個に扱ってチャンクの種類を抽出することで、誤って同じ種類のチャンクとして抽出することを防止し、抽出精度の向上を図ることができる。 As a result, the extraction apparatus 100 prevents the extraction of the same type of chunks as different types of chunks by extracting the same type of chunks with the same notation and the same type of chunks, thereby improving the extraction accuracy. Improvements can be made. Further, the extraction apparatus 100 prevents different types of chunks from being mistakenly extracted as the same type of chunks by handling different types of chunks separately and extracting the types of chunks, even if they have the same notation. Improvements can be made.

なお、本実施の形態で説明した抽出方法は、予め用意されたプログラムをパーソナル・コンピュータやワークステーション等のコンピュータで実行することにより実現することができる。本抽出プログラムは、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ等のコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行される。また本抽出プログラムは、インターネット等のネットワークを介して配布してもよい。 The extraction method described in this embodiment can be realized by executing a program prepared in advance on a computer such as a personal computer or a workstation. The extraction program is recorded on a computer-readable recording medium such as a hard disk, a flexible disk, a CD-ROM, an MO, and a DVD, and is executed by being read from the recording medium by the computer. The extraction program may be distributed via a network such as the Internet.

上述した実施の形態に関し、さらに以下の付記を開示する。 The following additional notes are disclosed with respect to the embodiment described above.

（付記１）共起単語の組み合わせと、前記共起単語の各々とともに出現する同一表記の単語が同一種類の単語であるか否かを示す情報と、を関連付けた判別規則を記憶する第１の記憶部と、
前記共起単語と当該共起単語までの距離の組み合わせと、当該距離に応じて規定された単語の種類を示す情報と、を関連付けた抽出用規則を記憶する第２の記憶部と、
一連の単語の中から第１の単語および当該第１の単語と同一表記の第２の単語を検出する検出部と、
前記検出部によって検出された第１の単語が前記共起単語の一方とともに出現し、かつ、前記検出部によって検出された第２の単語が前記共起単語の他方とともに出現する判別規則が前記第１の記憶部にあるか否かを判別し、判別規則がある場合、当該判別規則から前記第１の単語と前記第２の単語とが同一種類か否かを判別する判別部と、
前記判別部によって同一種類であると判別された場合、前記一連の単語の中から、前記第１の単語および前記第２の単語の各々から所定距離以内に存在する単語と当該単語までの距離との組み合わせを特定する特定部と、
前記特定部によって特定された組み合わせに関連付けられた単語の種類を示す情報を前記第２の記憶部に記憶されている抽出用規則から抽出し、前記第１の単語および前記第２の単語に付与する抽出部と、
前記抽出部により付与された前記一連の単語を出力する出力部と、
を有することを特徴とする抽出装置。 (Additional remark 1) The 1st which memorize | stores the discrimination rule which linked | related the combination of a co-occurrence word, and the information which shows whether the word of the same description which appears with each of the said co-occurrence word is the same kind of word A storage unit;
A second storage unit that stores an extraction rule that associates the combination of the co-occurrence word and the distance to the co-occurrence word and information indicating the type of the word defined according to the distance;
A detection unit for detecting a first word and a second word having the same notation as the first word from a series of words;
A discrimination rule in which the first word detected by the detection unit appears together with one of the co-occurrence words and the second word detected by the detection unit appears together with the other of the co-occurrence words is A determination unit that determines whether the first word and the second word are of the same type based on the determination rule;
When it is determined by the determination unit that they are of the same type, a word existing within a predetermined distance from each of the first word and the second word from the series of words and a distance to the word A specific part that identifies the combination of
Information indicating the type of word associated with the combination specified by the specifying unit is extracted from the extraction rule stored in the second storage unit, and is given to the first word and the second word An extractor to perform,
An output unit for outputting the series of words given by the extraction unit;
An extraction device comprising:

（付記２）前記一連の単語を、チャンクを含む複数通りの単語列に変換する変換部を有し、
前記検出部は、
前記変換部によって変換された複数の単語列の中から第１のチャンクおよび当該第１のチャンクと同一表記の第２のチャンクを検出し、
前記判別部は、
前記検出部によって検出された第１のチャンクが前記共起単語の一方とともに出現し、かつ、前記検出部によって検出された第２のチャンクが前記共起単語の他方とともに出現する判別規則が前記第１の記憶部にあるか否かを判別し、判別規則がある場合、当該判別規則から前記第１の単語と前記第２の単語とが同一種類か否かを判別し、
前記特定部は、
前記判別部によって同一種類であると判別された場合、前記変換後の単語列の中から、前記第１のチャンクおよび前記第２のチャンクの各々から所定距離以内に存在する単語と当該単語までの距離との組み合わせを特定し、
前記抽出部は、
前記特定部によって特定された組み合わせに関連付けられた単語の種類を示す情報を前記第２の記憶部に記憶されている抽出用規則から抽出し、前記第１のチャンクおよび前記第２のチャンクに付与することを特徴とする付記１に記載の抽出装置。 (Supplementary Note 2) A conversion unit that converts the series of words into a plurality of word strings including chunks,
The detector is
Detecting a first chunk and a second chunk having the same notation as the first chunk from the plurality of word strings converted by the conversion unit;
The discrimination unit
A discrimination rule in which the first chunk detected by the detection unit appears with one of the co-occurrence words and the second chunk detected by the detection unit appears with the other of the co-occurrence words is the first rule. Whether or not the first word and the second word are of the same type from the determination rule;
The specific part is:
When it is determined by the determination unit that they are of the same type, a word existing within a predetermined distance from each of the first chunk and the second chunk and the word from the converted word string Identify the combination with distance,
The extraction unit includes:
Information indicating the type of word associated with the combination specified by the specifying unit is extracted from the extraction rule stored in the second storage unit, and is given to the first chunk and the second chunk The extraction apparatus according to Supplementary Note 1, wherein:

（付記３）単語の種類を示す情報が付与された単語を含む単語列の中から同一表記の単語の組み合わせを取得する第１の取得部と、
前記第１の取得部によって取得された組み合わせの各々の単語とともに出現する共起単語の組み合わせを、前記単語の種類を示す情報が付与された単語を含む単語列の中から取得する第２の取得部と、
前記同一表記の単語の組み合わせの各々の単語に付与された単語の種類を示す情報に基づいて、前記同一表記の単語が、同一種類の単語であるか否かを判断する判断部と、
前記第２の取得部によって取得された共起単語の組み合わせと、前記判断部によって判断された判断結果とを、関連付けた判別規則を生成する生成部と、
前記生成部によって生成された判別規則を前記第１の記憶部に格納する格納部と、
を有することを特徴とする付記１または２に記載の抽出装置。 (Additional remark 3) The 1st acquisition part which acquires the combination of the word of the same description from the word sequence containing the word to which the information which shows the kind of word was provided,
Second acquisition of acquiring a combination of co-occurrence words appearing with each word of the combination acquired by the first acquisition unit from a word string including a word to which information indicating the type of the word is given. And
A determination unit configured to determine whether or not the word of the same notation is the same type of word based on the information indicating the type of the word given to each word of the combination of the same notation words;
A generation unit that generates a determination rule that associates the combination of co-occurrence words acquired by the second acquisition unit with the determination result determined by the determination unit;
A storage unit that stores the discrimination rule generated by the generation unit in the first storage unit;
The extraction apparatus according to appendix 1 or 2, characterized by comprising:

（付記４）前記検出部は、
単語の種類を示す情報が付与された単語を含む単語列の中から第３の単語および当該第３の単語と同一表記の第４の単語を検出し、
前記判別部は、
前記検出部によって検出された第３の単語が前記共起単語の一方とともに出現し、かつ、前記検出部によって検出された第４の単語が前記共起単語の他方とともに出現する判別規則が前記第１の記憶部にあるか否かを判別し、判別規則がある場合、当該判別規則から前記第１の単語と前記第２の単語とが同一種類か否かを判別し、
前記特定部は、
前記判別部によって同一種類であると判別された場合、前記単語の種類を示す情報が付与された単語を含む単語列の中から、前記第３の単語および前記第４の単語の各々から所定距離以内に存在する単語と当該単語までの距離との組み合わせを特定し、
前記生成部は、
前記特定部によって特定された単語と当該単語までの距離との組み合わせと、前記第３の単語と前記第４の単語のいずれかの単語に付与されている単語の種類を示す情報と、を関連付けた抽出用規則を生成し、
前記格納部は、
前記生成部によって生成された抽出用規則を前記第２の記憶部に格納することを特徴とする付記３に記載の抽出装置。 (Additional remark 4) The said detection part is
A third word and a fourth word having the same notation as the third word are detected from a word string including a word to which information indicating a word type is given;
The discrimination unit
A discrimination rule in which the third word detected by the detection unit appears together with one of the co-occurrence words and the fourth word detected by the detection unit appears together with the other of the co-occurrence words is Whether or not the first word and the second word are of the same type from the determination rule;
The specific part is:
When determined to be the same type by the determination unit, a predetermined distance from each of the third word and the fourth word from a word string including a word to which information indicating the type of the word is given Identify the combination of the word that is within and the distance to that word,
The generator is
The combination of the word specified by the specifying unit and the distance to the word is associated with information indicating the type of the word given to any one of the third word and the fourth word. Generated extraction rules,
The storage unit
4. The extraction device according to appendix 3, wherein the extraction rule generated by the generation unit is stored in the second storage unit.

（付記５）コンピュータに、
一連の単語の中から第１の単語および当該第１の単語と同一表記の第２の単語を検出し、
共起単語の組み合わせと、前記共起単語の各々とともに出現する同一表記の単語が同一種類の単語であるか否かを示す情報と、を関連付けた判別規則を記憶する第１の記憶部に、検出された第１の単語が前記共起単語の一方とともに出現し、かつ、検出された第２の単語が前記共起単語の他方とともに出現する判別規則があるか否かを判別し、判別規則がある場合、当該判別規則から前記第１の単語と前記第２の単語とが同一種類か否かを判別し、
同一種類であると判別された場合、前記一連の単語の中から、前記第１の単語および前記第２の単語の各々から所定距離以内に存在する単語と当該単語までの距離との組み合わせを特定し、
前記共起単語と当該共起単語までの距離の組み合わせと、当該距離に応じて規定された単語の種類を示す情報と、を関連付けた抽出用規則を記憶する第２の記憶部に記憶されている抽出用規則から、特定された組み合わせに関連付けられた単語の種類を示す情報を抽出し、前記第１の単語および前記第２の単語に付与し、
付与された前記一連の単語を出力する、
処理を実行させることを特徴とする抽出プログラム。 (Appendix 5)
Detecting a first word and a second word having the same notation as the first word from the series of words;
In a first storage unit for storing a discrimination rule that associates a combination of co-occurrence words and information indicating whether or not the same notation word appearing with each of the co-occurrence words is the same type of word, Determining whether there is a discrimination rule in which the detected first word appears with one of the co-occurrence words and the detected second word appears with the other of the co-occurrence words; If there is, determine whether the first word and the second word are the same type from the determination rule,
When it is determined that they are of the same type, a combination of a word existing within a predetermined distance from each of the first word and the second word and a distance to the word is identified from the series of words And
A combination of the co-occurrence word and the distance to the co-occurrence word and information indicating the type of word defined according to the distance are stored in a second storage unit that stores an extraction rule that associates the information. Information indicating the type of the word associated with the specified combination is extracted from the extraction rule, and is added to the first word and the second word,
Outputting the given sequence of words,
An extraction program characterized by causing processing to be executed.

（付記６）コンピュータが、
一連の単語の中から第１の単語および当該第１の単語と同一表記の第２の単語を検出し、
共起単語の組み合わせと、前記共起単語の各々とともに出現する同一表記の単語が同一種類の単語であるか否かを示す情報と、を関連付けた判別規則を記憶する第１の記憶部に、検出された第１の単語が前記共起単語の一方とともに出現し、かつ、検出された第２の単語が前記共起単語の他方とともに出現する判別規則があるか否かを判別し、判別規則がある場合、当該判別規則から前記第１の単語と前記第２の単語とが同一種類か否かを判別し、
同一種類であると判別された場合、前記一連の単語の中から、前記第１の単語および前記第２の単語の各々から所定距離以内に存在する単語と当該単語までの距離との組み合わせを特定し、
前記共起単語と当該共起単語までの距離の組み合わせと、当該距離に応じて規定された単語の種類を示す情報と、を関連付けた抽出用規則を記憶する第２の記憶部に記憶されている抽出用規則から、特定された組み合わせに関連付けられた単語の種類を示す情報を抽出し、前記第１の単語および前記第２の単語に付与し、
付与された前記一連の単語を出力する、
処理を実行することを特徴とする抽出方法。 (Appendix 6)
Detecting a first word and a second word having the same notation as the first word from the series of words;
In a first storage unit for storing a discrimination rule that associates a combination of co-occurrence words and information indicating whether or not the same notation word appearing with each of the co-occurrence words is the same type of word, Determining whether there is a discrimination rule in which the detected first word appears with one of the co-occurrence words and the detected second word appears with the other of the co-occurrence words; If there is, determine whether the first word and the second word are the same type from the determination rule,
When it is determined that they are of the same type, a combination of a word existing within a predetermined distance from each of the first word and the second word and a distance to the word is identified from the series of words And
A combination of the co-occurrence word and the distance to the co-occurrence word and information indicating the type of word defined according to the distance are stored in a second storage unit that stores an extraction rule that associates the information. Information indicating the type of the word associated with the specified combination is extracted from the extraction rule, and is added to the first word and the second word,
Outputting the given sequence of words,
An extraction method characterized by executing processing.

１００抽出装置
３００同一表記・種類単語判別規則
４００固有表現抽出用規則
５０１第１の記憶部
５０２第２の記憶部
５０３入力部
５０４検出部
５０５判別部
５０６特定部
５０７抽出部
５０８出力部
５０９第１の取得部
５１０第２の取得部
５１１判断部
５１２生成部
５１３格納部
５１４変換部 100 Extraction Device 300 Same Notation / Type Word Discrimination Rule 400 Specific Expression Extraction Rule 501 First Storage Unit 502 Second Storage Unit 503 Input Unit 504 Detection Unit 505 Discrimination Unit 506 Identification Unit 507 Extraction Unit 508 Output Unit 509 First Acquisition unit 510 second acquisition unit 511 determination unit 512 generation unit 513 storage unit 514 conversion unit

Claims

A first storage unit that stores a discrimination rule that associates a combination of co-occurrence words and information indicating whether or not the same notation words appearing with each of the co-occurrence words are of the same type;
A second storage unit that stores an extraction rule that associates the combination of the co-occurrence word and the distance to the co-occurrence word and information indicating the type of the word defined according to the distance;
A detection unit for detecting a first word and a second word having the same notation as the first word from a series of words;
A discrimination rule in which the first word detected by the detection unit appears together with one of the co-occurrence words and the second word detected by the detection unit appears together with the other of the co-occurrence words is A determination unit that determines whether the first word and the second word are of the same type based on the determination rule;
When it is determined by the determination unit that they are of the same type, a word existing within a predetermined distance from each of the first word and the second word from the series of words and a distance to the word A specific part that identifies the combination of
Information indicating the type of word associated with the combination specified by the specifying unit is extracted from the extraction rule stored in the second storage unit, and is given to the first word and the second word An extractor to perform,
An output unit for outputting the series of words given by the extraction unit;
An extraction device comprising:

A first acquisition unit that acquires a combination of words having the same notation from a word string including a word to which information indicating a word type is given;
Second acquisition of acquiring a combination of co-occurrence words appearing with each word of the combination acquired by the first acquisition unit from a word string including a word to which information indicating the type of the word is given. And
A determination unit configured to determine whether or not the word of the same notation is the same type of word based on the information indicating the type of the word given to each word of the combination of the same notation words;
A generation unit that generates a determination rule that associates the combination of co-occurrence words acquired by the second acquisition unit with the determination result determined by the determination unit;
A storage unit that stores the discrimination rule generated by the generation unit in the first storage unit;
The extraction apparatus according to claim 1, comprising:

The detector is
A third word and a fourth word having the same notation as the third word are detected from a word string including a word to which information indicating a word type is given;
The discrimination unit
A discrimination rule in which the third word detected by the detection unit appears together with one of the co-occurrence words and the fourth word detected by the detection unit appears together with the other of the co-occurrence words is Whether or not the first word and the second word are of the same type from the determination rule;
The specific part is:
When determined to be the same type by the determination unit, a predetermined distance from each of the third word and the fourth word from a word string including a word to which information indicating the type of the word is given Identify the combination of the word that is within and the distance to that word,
The generator is
The combination of the word specified by the specifying unit and the distance to the word is associated with information indicating the type of the word given to any one of the third word and the fourth word. Generated extraction rules,
The storage unit
3. The extraction apparatus according to claim 2, wherein the extraction rule generated by the generation unit is stored in the second storage unit.

On the computer,
Detecting a first word and a second word having the same notation as the first word from the series of words;
In a first storage unit for storing a discrimination rule that associates a combination of co-occurrence words and information indicating whether or not the same notation word appearing with each of the co-occurrence words is the same type of word, Determining whether there is a discrimination rule in which the detected first word appears with one of the co-occurrence words and the detected second word appears with the other of the co-occurrence words; If there is, determine whether the first word and the second word are the same type from the determination rule,
When it is determined that they are of the same type, a combination of a word existing within a predetermined distance from each of the first word and the second word and a distance to the word is identified from the series of words And
A combination of the co-occurrence word and the distance to the co-occurrence word and information indicating the type of word defined according to the distance are stored in a second storage unit that stores an extraction rule that associates the information. Information indicating the type of the word associated with the specified combination is extracted from the extraction rule, and is added to the first word and the second word,
Outputting the given sequence of words,
An extraction program characterized by causing processing to be executed.

Computer
Detecting a first word and a second word having the same notation as the first word from the series of words;
In a first storage unit for storing a discrimination rule that associates a combination of co-occurrence words and information indicating whether or not the same notation word appearing with each of the co-occurrence words is the same type of word, Determining whether there is a discrimination rule in which the detected first word appears with one of the co-occurrence words and the detected second word appears with the other of the co-occurrence words; If there is, determine whether the first word and the second word are the same type from the determination rule,
When it is determined that they are of the same type, a combination of a word existing within a predetermined distance from each of the first word and the second word and a distance to the word is identified from the series of words And
A combination of the co-occurrence word and the distance to the co-occurrence word and information indicating the type of word defined according to the distance are stored in a second storage unit that stores an extraction rule that associates the information. Information indicating the type of the word associated with the specified combination is extracted from the extraction rule, and is added to the first word and the second word,
Outputting the given sequence of words,
An extraction method characterized by executing processing.