JP6502807B2

JP6502807B2 - Information extraction apparatus, information extraction method and information extraction program

Info

Publication number: JP6502807B2
Application number: JP2015182102A
Authority: JP
Inventors: 昌之岡本; 祐一宮村; 彩奈山本
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2015-09-15
Filing date: 2015-09-15
Publication date: 2019-04-17
Anticipated expiration: 2035-09-15
Also published as: JP2017058866A

Description

実施形態は、情報の抽出に関する。 Embodiments relate to the extraction of information.

例えばＷｅｂページなどの文書に記載された複数の属性の値同士の関係情報（例えば、材料名とその特性値との関係情報、商品名とその価格との関係情報、など）を抽出する技術が求められている。係る技術を利用すれば、文章に記載された膨大な情報から所望の情報を容易に整理することが可能となる。例えば、文章から商品名とそのスペックとの関係情報を抽出し、その一覧を短時間で表にまとめることができる。 For example, a technique for extracting relationship information between values of a plurality of attributes described in a document such as a web page (for example, relationship information between a material name and its characteristic value, relationship information between a product name and its price, etc.) It has been demanded. By using such a technology, it is possible to easily organize desired information from the huge amount of information described in the text. For example, it is possible to extract relationship information between a product name and its specification from a sentence, and to list the list in a short time in a table.

ところが、抽出された関係情報は曖昧性を持つことがある。例えば、材料の移動度を異なる実験条件の下で測定した結果を報告する論文から材料名とその移動度との関係情報を抽出すると、材料名が同一であるものの移動度が異なる複数の関係情報が抽出される可能性がある。或いは、ＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）、自動車などの商品では、同一の商品名（ブランド名）に対してオプション（例えば、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）などのストレージ容量、カラー、排気量など）の異なる複数のモデルが用意されることがある。故に、例えばＰＣの価格が記述された文書から商品名とその価格との関係情報を抽出すると、商品名（ブランド名）が同一であるものの価格が異なる複数の関係情報が抽出される可能性がある。他方、モデル名に相当する値を商品名として抽出したとしても、抽出された関係情報がどのブランドにおけるモデルと価格との関係を表しているのか特定することは困難である。 However, the extracted relationship information may have ambiguity. For example, when extracting the relationship information of a material name and its mobility from a paper reporting the result of measuring the mobility of the material under different experimental conditions, a plurality of relationship information having the same material name but different mobility May be extracted. Alternatively, in a product such as a PC (Personal Computer) or an automobile, storage capacity such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive), color, or exhaust for the same product name (brand name) Several models with different quantities may be prepared. Therefore, for example, when relation information between a product name and its price is extracted from a document in which the price of a PC is described, there is a possibility that a plurality of relation information items having the same product name (brand name) but different prices may be extracted. is there. On the other hand, even if a value corresponding to a model name is extracted as a product name, it is difficult to specify which brand the relation information extracted represents a relation between a model and a price.

特開２０１０−１１７７９７号公報JP, 2010-117797, A 特開２００１−０６０１９９号公報JP 2001-060199

実施形態は、文書から抽出された属性の値同士の関係情報の持つ曖昧性を低減することを目的とする。 Embodiments aim to reduce the ambiguity of relationship information between attribute values extracted from a document.

実施形態によれば、情報抽出装置は、関係情報抽出部と、補完情報抽出部とを含む。関係情報抽出部は、第１の属性の値と第２の属性の値との間の関係を表す少なくとも１つの関係情報を文書から抽出することによって関係情報群を得る。補完情報抽出部は、関係情報群に属する第１の関係情報が曖昧性を持つと判定基準に従って判定される場合に、当該第１の関係情報を形成する第１の属性の値および第２の属性の値の少なくとも一方に関係する補完情報を抽出する。 According to the embodiment, the information extraction device includes a related information extraction unit and a complementary information extraction unit. The relationship information extraction unit obtains a relationship information group by extracting from the document at least one relationship information indicating a relationship between the value of the first attribute and the value of the second attribute. The complementary information extraction unit, when it is determined according to the determination criteria that the first relation information belonging to the relation information group has ambiguity, the value of the first attribute forming the first relation information and the second information Extract complementary information related to at least one of the attribute values.

第１の実施形態に係る情報抽出装置を例示するブロック図。FIG. 1 is a block diagram illustrating an information extraction apparatus according to a first embodiment. 図１の入力部によって受け取られる文書を例示する図。FIG. 3 illustrates a document received by the input unit of FIG. 1; 図１の関係情報抽出部、学習部および算出部によって行われる一連の関係情報抽出処理を例示するフローチャート。The flowchart which illustrates a series of relation information extraction processing performed by the relation information extraction part of Drawing 1, a learning part, and a calculation part. 図１の関係情報抽出部、学習部および算出部によって抽出される属性の値の候補を例示する図。FIG. 7 is a view exemplifying candidate values of attributes extracted by the relationship information extraction unit, the learning unit, and the calculation unit of FIG. 1; 図１の関係情報抽出部、学習部および算出部によって抽出される属性の値の候補を例示する図。FIG. 7 is a view exemplifying candidate values of attributes extracted by the relationship information extraction unit, the learning unit, and the calculation unit of FIG. 1; 図１の関係情報抽出部、学習部および算出部によって抽出される関係情報群を例示する図。The figure which illustrates the related information group extracted by the related information extraction part of FIG. 1, a learning part, and a calculation part. 図１の補完情報抽出部によって抽出される補完情報を例示する図。FIG. 7 is a diagram exemplifying complementary information extracted by the complementary information extraction unit of FIG. 1; 第２の実施形態に係る情報抽出装置を例示するブロック図。The block diagram which illustrates the information extraction device concerning a 2nd embodiment. 図８の出力部および評価情報入力部によって実現される関係情報評価ＵＩを例示する図。FIG. 9 is a view exemplifying a related information evaluation UI realized by the output unit and the evaluation information input unit of FIG. 8; 図１の補完情報抽出部の動作を例示するフローチャート。5 is a flowchart illustrating the operation of the complementary information extraction unit of FIG. 1;

以下、図面を参照しながら実施形態の説明が述べられる。尚、以降、説明済みの要素と同一または類似の要素には同一または類似の符号が付され、重複する説明は基本的に省略される。 Hereinafter, the description of the embodiments will be described with reference to the drawings. In the following, elements that are the same as or similar to the elements described above will be assigned the same or similar reference numerals, and overlapping descriptions will be basically omitted.

（第１の実施形態）
図１には、第１の実施形態に係る情報抽出装置１００が例示される。情報抽出装置１００は、例えば、文書を入力可能な端末であってもよいし、係る端末のためにアプリケーションまたはサービスを提供してもよい。 First Embodiment
An information extraction device 100 according to the first embodiment is illustrated in FIG. The information extraction apparatus 100 may be, for example, a terminal capable of inputting a document, or may provide an application or service for such a terminal.

情報抽出装置１００は、文章から複数の項目（以降、属性と称される）の値同士の関係情報を抽出する。なお、情報抽出装置１００は、抽出対象の複数の属性を指定されてもよいし、分析目的を指定され当該分析目的に基づいて抽出対象の複数の属性を決定してもよい。 The information extraction apparatus 100 extracts relationship information between values of a plurality of items (hereinafter referred to as attributes) from a sentence. The information extraction apparatus 100 may specify a plurality of attributes to be extracted, or may specify an analysis purpose and determine a plurality of attributes to be extracted based on the analysis purpose.

文書は、例えばＷｅｂページ、ニュース記事などの自然言語ベースのテキストデータである。しかしながら、属性の値の候補となる要素（例えば単語）をその特徴に基づいて抽出することのできる任意の種別のデータが文書として利用可能である。 Documents are natural language-based text data such as Web pages and news articles. However, any type of data that can extract an element (for example, a word) that is a candidate for the value of the attribute based on its feature is available as a document.

図１に例示されるように、情報抽出装置１００は、入力部１０１と、関係情報抽出部１０２と、学習部１０３と、算出部１０４と、判定部１０５と、補完情報抽出部１０６と、出力部１０７とを含む。 As illustrated in FIG. 1, the information extraction apparatus 100 outputs the input unit 101, the relationship information extraction unit 102, the learning unit 103, the calculation unit 104, the determination unit 105, the complementary information extraction unit 106, and And a unit 107.

入力部１０１は、関係情報の抽出が行われる文書を受け取る。入力部１０１は、記録媒体またはネットワーク経由で文書を取得してもよいし、ユーザの操作に応じて文書を直接入力してもよい。文書の一例として、ＰＣの新機種の価格に関する記事が図２に示されている。入力部１０１は、文書を関係情報抽出部１０２へと出力する。 The input unit 101 receives a document for which extraction of relationship information is performed. The input unit 101 may acquire a document via a recording medium or a network, or may directly input the document according to the user's operation. As an example of the document, an article on the price of a new PC model is shown in FIG. The input unit 101 outputs the document to the relationship information extraction unit 102.

関係情報抽出部１０２、学習部１０３および算出部１０４は、概括すれば図３に例示される一連の関係情報抽出処理を行う。関係情報抽出部１０２はステップＳ２０１、ステップＳ２０２、ステップＳ２０３およびステップＳ２０４の処理を行い、学習部１０３はステップＳ２０５の処理を行い、算出部１０４はステップＳ２０６の処理を行う。これらの処理は、例えば機械学習を用いて実現することができる。 The relationship information extraction unit 102, the learning unit 103, and the calculation unit 104 perform a series of relationship information extraction processes exemplified in FIG. The relation information extraction unit 102 performs the processes of step S201, step S202, step S203, and step S204, the learning unit 103 performs the process of step S205, and the calculation unit 104 performs the process of step S206. These processes can be realized using, for example, machine learning.

関係情報抽出部１０２は、入力部１０１から文書を受け取る。概括すれば、関係情報抽出部１０２は、文書に記載された複数の属性の値同士の関係情報の候補を抽出する。関係情報抽出部１０２は、抽出した関係情報の候補を学習部１０３へと出力する。さらに、後述されるように、関係情報抽出部１０２は、関係情報の候補の素性（特徴量とも呼ばれる）と、複数の属性値の候補の素性とを学習部１０３へと出力する。 The relationship information extraction unit 102 receives a document from the input unit 101. In summary, the relationship information extraction unit 102 extracts candidates of relationship information between values of a plurality of attributes described in a document. The relationship information extraction unit 102 outputs the extracted relationship information candidates to the learning unit 103. Furthermore, as will be described later, the relationship information extraction unit 102 outputs the feature (also referred to as a feature amount) of the relationship information candidate and the feature of the plurality of attribute value candidates to the learning unit 103.

図３に沿って説明すると、関係情報抽出部１０２は、最初に文書に対して前処理としてのテキスト解析を行う（ステップＳ２０１）。テキスト解析は、例えば、形態素解析、固有表現抽出、構文解析などであってよい。 Referring to FIG. 3, first, the relationship information extraction unit 102 performs text analysis as preprocessing on the document (step S201). The text analysis may be, for example, morphological analysis, specific expression extraction, syntax analysis, and the like.

次に、関係情報抽出部１０２は、抽出対象の複数の属性（例えば、「商品名」および「価格」）それぞれの値の候補（典型的には、名詞、数値などの単語）を文書から抽出する（ステップＳ２０２およびステップＳ２０３）。 Next, the relation information extraction unit 102 extracts, from the document, candidates (typically, words such as nouns and numerical values) of respective values of a plurality of attributes to be extracted (for example, “product name” and “price”). (Step S202 and step S203).

具体的には、関係情報抽出部１０２は、事前に辞書に登録されている名称または固有名詞を文書から探索したり、「○○版」などの表現を対象とするパターンマッチングルールを文書に適用したりすることによって、属性「商品名」の値の候補を抽出してもよい。他方、関係情報抽出部１０２は、「（数字）＋（通貨単位）」などの表現を対象とするパターンマッチングルールを文書に適用することによって、属性「価格」の値の候補を抽出してもよい。 Specifically, the relation information extraction unit 102 searches the document for a name or proper noun registered in advance in the dictionary, or applies a pattern matching rule for an expression such as “○○ version” to the document. By doing so, candidates for the value of the attribute “product name” may be extracted. On the other hand, the relation information extraction unit 102 extracts the candidate of the value of the attribute “price” by applying to the document a pattern matching rule that targets expressions such as “(number) + (currency unit)”. Good.

例えば、関係情報抽出部１０２は、図２の文書から、属性「商品名」の値の候補として「Ｄｂｏｏｋ」、「５００ＧＢ版」、「１ＴＢ版」、「ＬＮｏｔｅ」などを抽出することができる。他方、関係情報抽出部１０２は、図２の文書から、属性「価格」の値の候補として「５万円」、「１０万円」、「３万円」などを抽出することができる。さらに、関係情報抽出部１０２は、抽出した属性値の候補の素性を導出して学習部１０３へと出力する。 For example, the relation information extraction unit 102 can extract “Dbook”, “500 GB version”, “1 TB version”, “LNote” and the like as candidates for the value of the attribute “product name” from the document of FIG. On the other hand, the relation information extraction unit 102 can extract "50,000 yen", "100,000 yen", "30,000 yen", etc. as candidates for the value of the attribute "price" from the document of FIG. Furthermore, the relationship information extraction unit 102 derives the extracted feature value candidate candidate and outputs the feature to the learning unit 103.

次に、関係情報抽出部１０２は、抽出した複数の属性の値の候補を組み合わせることで関係情報の候補を得る（ステップＳ２０４）。例えば、関係情報抽出部１０２は、（「Ｄｂｏｏｋ」，「５万円」）、（「Ｄｂｏｏｋ」，「１０万円」）、（「５００ＧＢ版」，「５万円」）などの関係情報の候補を得ることができる。さらに、関係情報抽出部１０２は、抽出した関係情報の候補の素性（例えば、文書において「商品名」の値の候補と「価格」の値の候補との間に出現する単語など）を導出して学習部１０３へと出力する。 Next, the relationship information extraction unit 102 obtains a relationship information candidate by combining the extracted value candidates of the plurality of attributes (step S204). For example, the related information extraction unit 102 may set related information such as (“Dbook”, “50,000 yen”), (“Dbook”, “100,000 yen”), (“500 GB version”, “50,000 yen”), etc. You can get a candidate. Furthermore, the relation information extraction unit 102 derives the feature of the extracted relation information candidate (for example, a word appearing between the candidate of the value of “product name” and the candidate of the value of “price” in the document) Output to the learning unit 103.

ここで、素性とは、属性の値の候補または関係情報の候補を特徴付ける手がかりを意味する。具体的には、関係情報の候補を形成する複数の属性の値の候補（単語）の品詞若しくは意味、文書における各属性の値の候補の出現位置、または、文書においてある属性の値の候補と他の属性の値の候補との間に出現する単語数などが、素性として利用可能である。 Here, the feature means a clue for characterizing an attribute value candidate or a relation information candidate. Specifically, the part of speech or meaning of candidates (words) of values of a plurality of attributes forming a candidate of relation information, the appearance positions of candidates of values of each attribute in a document, or the candidates of values of an attribute in a document The number of words appearing with the candidate of the value of another attribute can be used as a feature.

なお、図３の例では、関係情報抽出部１０２は、複数の属性の値の候補を抽出して、それらの組み合わせることで関係情報の候補を得る。しかしながら、関係情報抽出部１０２は、複数の属性の値の候補を抽出するステップを経ることなく、複数の属性の値同士の関係を表す関係情報の候補を直接的に抽出してもよい。 In the example of FIG. 3, the relationship information extraction unit 102 extracts candidates for values of a plurality of attributes, and obtains candidates for relationship information by combining them. However, the relationship information extraction unit 102 may directly extract the relationship information candidate representing the relationship between the values of the plurality of attributes without passing through the step of extracting the plurality of attribute value candidates.

学習部１０３は、関係情報抽出部１０２から、関係情報の候補およびその素性と、複数の属性値の候補の素性とを受け取る。学習部１０３は、関係情報の候補の素性と複数の属性値の候補の素性とに基づいて、素性の重みの学習を行う（ステップＳ２０５）。具体的には、学習部１０３は、予め与えられている正解事例（正例）および不正解事例（負例）に基づいて、素性毎の重要度（重み）を学習する。学習部１０３は、関係情報の候補と、（学習した）素性の重みとを算出部１０４へと出力する。 The learning unit 103 receives, from the relationship information extraction unit 102, candidates for relationship information and their features, and features of candidates for a plurality of attribute values. The learning unit 103 learns the feature weight based on the feature of the relation information candidate and the feature of the plurality of attribute value candidates (step S205). Specifically, the learning unit 103 learns the importance (weight) of each feature based on the correct case (positive example) and the incorrect case (negative example) given in advance. The learning unit 103 outputs the relation information candidate and the (learned) feature weight to the calculation unit 104.

学習部１０３は、関係情報の候補、複数の属性の値の候補、正解事例および不正解事例を利用する任意の機械学習法または他の類似の技術を利用することができる。具体的には、ＭａｒｋｏｖＬｏｇｉｃＮｅｔｗｏｒｋ、ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ、ＣｏｎｄｉｔｉｏｎａｌＲａｎｄｏｍＦｉｅｌｄなどの技術が利用可能である。 The learning unit 103 can use any machine learning method or other similar technology that uses candidate of relation information, candidate of values of multiple attributes, correct case and incorrect case. Specifically, technologies such as Markov Logic Network, Support Vector Machine, Conditional Random Field, etc. can be used.

なお、図３の例では、学習部１０３によって逐次学習が行われているが、例えば予め学習済みの重み、または、手動若しくは他の技法によって予め設定した重みを利用することにより、係る逐次学習を省略することもできる。 In the example of FIG. 3, although the learning is sequentially performed by the learning unit 103, such sequential learning may be performed by using, for example, previously learned weights or weights set manually or by other techniques. It can be omitted.

算出部１０４は、学習部１０３から関係情報の候補と素性の重みとを受け取る。算出部１０４は、素性の重みに基づいて、図６に例示されるように、関係情報の候補毎にその確からしさを表すスコアを算出（推定）し、当該関係情報の候補に関連付ける（ステップＳ２０６）ことにより、少なくとも１つの関係情報を含む関係情報群を得る。各関係情報には、識別のためにＩＤが付与されてもよい。スコアは、典型的には確率値であるが、結果の確からしさを表す任意の指標であってよい。なお、算出部１０４（または、後述される判定部１０５）は、スコアが閾値未満であった候補を関係情報群から除外するフィルタリング処理を行ってもよい。係るフィルタリング処理によれば、誤った関係を表している可能性の高い候補を排除することができる。算出部１０４は、関係情報群を判定部１０５へと出力する。 The calculating unit 104 receives the relation information candidate and the feature weight from the learning unit 103. The calculation unit 104 calculates (estimates) a score representing the certainty for each candidate of the relation information as illustrated in FIG. 6 based on the weight of the feature, and associates the score with the candidate of the relation information (step S206) ), To obtain a related information group including at least one related information. Each piece of related information may be assigned an ID for identification. The score is typically a probability value, but may be any indicator that represents the likelihood of an outcome. Note that the calculation unit 104 (or the determination unit 105 described later) may perform filtering processing to exclude a candidate whose score is less than the threshold from the related information group. According to such filtering processing, it is possible to exclude a candidate highly likely to represent an incorrect relationship. The calculation unit 104 outputs the relation information group to the determination unit 105.

算出部１０４は、関係情報の候補の確からしさを表すスコアの算出に利用するために、図４および図５に例示されるように、当該関係情報の候補を形成する複数の属性の値の候補の確からしさ（例えば、商品名らしさ、価格らしさなど）を表すスコアを算出してもよい。 In order to use the calculation unit 104 to calculate the score representing the likelihood of the candidate of the relation information, as exemplified in FIGS. 4 and 5, the candidate of the values of the plurality of attributes forming the candidate of the relation information is A score representing certainty (e.g., product name, price, etc.) may be calculated.

判定部１０５は、算出部１０４から関係情報群を受け取る。判定部１０５は、関係情報群に属する関係情報が曖昧性を持つか否か（換言すれば、関係情報の表す関係が明確であるか否か）を例えば後述される第１の判定基準および第２の判定基準を含む種々の判定基準に従って判定する。判定部１０５は、曖昧性を持つと判定した関係情報を補完情報抽出部１０６へと渡し、それ以外の関係情報を出力部１０７へと渡す。 The determination unit 105 receives the related information group from the calculation unit 104. The determination unit 105 determines, for example, whether or not the relationship information belonging to the relationship information group has ambiguity (in other words, whether the relationship represented by the relationship information is clear or not), for example, a first determination criterion described later and The determination is made according to various determination criteria including two determination criteria. The determination unit 105 passes the relationship information determined to have ambiguity to the complementary information extraction unit 106, and passes the other relationship information to the output unit 107.

例えば、図６の関係情報群には、商品名が同一（「Ｄｂｏｏｋ」）であって価格が異なる（「５万円」，「１０万円」）複数の関係情報（関係情報ＩＤ＝１，２）が含まれている。これらの関係情報は、いずれもスコアが高いので誤りではないと推定できるものの、両者の内容は一見すると矛盾している。故に、係る関係情報を提示したとしても、ユーザ（分析者と呼ぶこともできる）は「Ｄｂｏｏｋ」の価格が「５万円」であるのか「１０万円」であるのかを判断することができない。 For example, in the relation information group of FIG. 6, product names are the same (“Dbook”) and prices are different (“50,000 yen”, “100,000 yen”) plural pieces of relation information (relation information ID = 1, 2) is included. Although all of these pieces of related information have high scores, it can be estimated that they are not erroneous, but the contents of both are contradictory at first glance. Therefore, even if the related information is presented, the user (can also be called an analyst) can not judge whether the price of "Dbook" is "50,000 yen" or "100,000 yen" .

そこで、任意の第１の関係情報を形成する第１の属性（例えば、「商品名」）の値の候補が別の第２の関係情報を形成する第１の属性の値の候補と同一であって、かつ、当該第１の関係情報を形成する第２の属性（例えば、「価格」）の値の候補が第２の関係情報を形成する第２の属性の値の候補と異なる場合に、判定部１０５は、第１の関係情報および第２の関係情報は曖昧であると判定してもよい（第１の判定基準）。 Then, the candidate of the value of the first attribute (for example, "product name") forming any first relation information is the same as the candidate of the value of the first attribute forming another second relation information. And the candidate of the value of the second attribute (for example, "price") forming the first relation information is different from the candidate of the value of the second attribute forming the second relation information. The determining unit 105 may determine that the first relation information and the second relation information are ambiguous (first judgment criterion).

他方、図６の関係情報群には、商品名が一般名詞である（「１ＴＢ版」）である関係情報（関係情報ＩＤ＝４）が含まれている。この関係情報もスコアが高いので誤りではないと推定できるものの、当該関係情報を形成する商品名が特定の商品（「ＤＢｏｏｋ」または「Ｌｎｏｔｅ」）を指していない。故に、係る関係情報を提示したとしても、ユーザは何の商品の「１ＴＢ版」が「１０万円」であるのかを判断することができない。 On the other hand, the relationship information group in FIG. 6 includes relationship information (relationship information ID = 4) whose product name is a general noun (“1 TB version”). Although the related information is also high in score, it can be estimated that it is not an error, but the product name forming the related information does not indicate a specific product ("DBook" or "Lnote"). Therefore, even if the related information is presented, the user can not judge what "1TB version" of the product is "100,000 yen".

そこで、任意の関係情報に含まれる第１の属性の値の候補が特定の種類の単語（例えば、一般名詞）に該当する場合に、判定部１０５は、当該関係情報は曖昧であると判定してもよい（第２の判定基準）。 Therefore, when the candidate of the value of the first attribute included in any relationship information corresponds to a specific type of word (for example, a general noun), the determination unit 105 determines that the relationship information is ambiguous. (The second judgment standard).

補完情報抽出部１０６は、判定部１０５から通知された関係情報に対して曖昧性を低減するための補完情報を抽出し、当該補完情報を用いて関係情報を補完してから出力部１０７へと渡す。補完情報は、関係情報を形成する複数の属性の値の候補のうち少なくとも１つと関係する。 The complementary information extraction unit 106 extracts complementary information for reducing ambiguity with respect to the related information notified from the determination unit 105, and complements the related information using the complementary information, and then the output unit 107 is performed. hand over. The complementary information relates to at least one of the plurality of attribute value candidates forming the relation information.

具体的には、補完情報抽出部１０６は、関係情報を形成する複数の属性の値の候補のうち少なくとも１つについて、係り受け解析、照応解析またはパラフレーズのような文章の解析または修正を行い、補完情報を抽出してもよい。或いは、補完情報抽出部１０６は、関係情報を形成する複数の属性の値の候補のうち少なくとも１つについて、その候補の上位概念若しくは下位概念に相当する単語またはその候補の詳細を表す単語を補完情報として抽出してもよい。 Specifically, the complementary information extraction unit 106 analyzes or corrects a sentence such as dependency analysis, associative analysis, or paraphrase for at least one of the plurality of attribute value candidates forming the relationship information. , Complementary information may be extracted. Alternatively, the complementary information extraction unit 106 complements, for at least one of the plurality of attribute value candidates forming the relation information, a word corresponding to the upper concept or lower concept of the candidate or a word representing the details of the candidate It may be extracted as information.

補完情報抽出部１０６は、例えば図１０に示されるように補完情報を抽出してもよい。なお、図１０の補完情報抽出処理は、第１の属性の値が同一であって、かつ、第２の属性の値が異なる複数の関係情報が存在する場合に行われる（ステップＳ４０１）。例えば、図６の関係情報ＩＤ＝１，２の関係情報、ならびに、関係情報ＩＤ＝３，５の関係情報を対象に、図１０の補完情報抽出処理が行われる。 The complementary information extraction unit 106 may extract complementary information as shown in FIG. 10, for example. The complementary information extraction process of FIG. 10 is performed when there is a plurality of pieces of relationship information in which the values of the first attribute are the same and the values of the second attribute are different (step S401). For example, the complementary information extraction process of FIG. 10 is performed on the relationship information of the relationship information ID = 1, 2 in FIG. 6 and the relationship information of the relationship information ID = 3, 5.

補完情報抽出部１０６は、相異なる第２の属性の値の各々について、当該第２の属性の値と関係する第１の属性の値を補完情報の候補として探索する（ステップＳ４０２）。例えば、図６の関係情報ＩＤ＝１，２の関係情報に関して、補完情報抽出部１０６は、「５万円」と関係する「商品名」の値の候補として「５００ＧＢ版」を抽出し、「１０万円」と関係する「商品名」の値の候補として「１ＴＢ版」を抽出する。抽出された「５００ＧＢ版」および「１ＴＢ版」の情報は、関係情報ＩＤ＝１，２の関係情報を補完して両者のより明確な区別を可能とするかもしれない。 The complementary information extraction unit 106 searches the value of the first attribute related to the value of the second attribute for each of the values of the different second attribute as a candidate of the complementary information (step S402). For example, with respect to the relationship information of relationship information ID = 1, 2 in FIG. 6, the complementary information extraction unit 106 extracts the “500 GB version” as a candidate of the “product name” value related to “50,000 yen” The "1TB version" is extracted as a candidate for the "product name" value related to 100,000 yen. The extracted “500 GB version” and “1 TB version” information may complement the related information of the related information ID = 1, 2 to enable clearer distinction between the two.

なお、補完情報抽出部１０６は、注目する第２の属性の値を含む関係情報の総数、補完情報の候補の単語の種類、補完情報の候補を含む関係情報に付与されたスコア、（関係情報に含まれる属性が３つ以上の場合には）複数の関係情報の間で共通の値を持つ属性の総数、第１の属性の属性値間の意味的な上位下位関係、などの一部または全部に基づいて、探索する補完情報の候補を絞り込んでもよい。また、補完情報抽出部１０６は、各関係情報に含まれる第１の属性の値について係り受け解析、照応解析またはパラフレーズを行い、補完情報の候補を抽出してもよい。 Note that the complementary information extraction unit 106 adds the total number of relational information including the value of the second attribute of interest, the type of candidate complementary information word, the score assigned to the relational information including the candidate for complementary information, A part of the total number of attributes with common values among multiple pieces of relationship information, the semantic upper / lower relationship between attribute values of the first The candidates for the complementary information to be searched may be narrowed down based on the whole. In addition, the complementary information extraction unit 106 may perform dependency analysis, anaphoric analysis, or paraphrase on the value of the first attribute included in each piece of relationship information to extract a candidate for complementary information.

次に、補完情報抽出部１０６は、ステップＳ４０２において発見した補完情報の候補をその言語的特徴に応じて分類する（ステップＳ４０３）。具体的には、補完情報抽出部１０６は、補完情報の候補を品詞に応じて分類してもよいし、関係情報抽出部１０２が当該補完情報の候補の抽出に用いたパターンマッチングルールに応じて当該補完情報の候補を分類してもよい。例えば、「５００ＧＢ版」および「１ＴＢ版」は、どちらも「［数字］＋［単位］＋［名詞］」という品詞を持っているし、どちらも「○○版」という表現を対象とするパターンマッチングルールを用いて抽出されているかもしれない。故に、補完情報抽出部１０６は、これらを同一の候補群に分類することができる。 Next, the complementary information extraction unit 106 classifies the candidate of the complementary information found in step S402 according to the linguistic feature (step S403). Specifically, the complementary information extraction unit 106 may classify the candidate of the complementary information according to the part of speech, or according to the pattern matching rule used by the relationship information extraction unit 102 for extracting the candidate of the complementary information. The candidates for the complementary information may be classified. For example, "500 GB version" and "1 TB version" both have a part of speech of "[number] + [unit] + [noun]", and both have a pattern for the expression "○ version" It may have been extracted using matching rules. Therefore, the complementary information extraction unit 106 can classify these into the same candidate group.

次に、補完情報抽出部１０６は、ステップＳ４０３において得られた候補群のうち要素数の最も多いものを選択し、当該候補群に属する各候補を補完情報として取り扱う（ステップＳ４０４）。すなわち、図１０の例では、補完情報とは、第１の属性から分離した第３の属性の値とみなすことができる。例えば、補完情報抽出部１０６は、図７に示されるように、「ＤＢｏｏｋ」および「５万円」を含む関係情報に対して、「５万円」に関係する「５００ＧＢ版」という補完情報を追加したり、「ＤＢｏｏｋ」および「１０万円」を含む関係情報に対して、「１０万円」に関係する「１ＴＢ版」という補完情報を追加したりする。 Next, the complementary information extraction unit 106 selects one of the candidate group obtained in step S403 with the largest number of elements, and handles each candidate belonging to the candidate group as complementary information (step S404). That is, in the example of FIG. 10, the complementary information can be regarded as the value of the third attribute separated from the first attribute. For example, as shown in FIG. 7, the complementary information extraction unit 106 performs complementary information of “500 GB version” related to “50,000 yen” with respect to related information including “DBook” and “50,000 yen”. Add or add complementary information “1 TB version” related to “100,000 yen” to related information including “DBook” and “100,000 yen”.

なお、図１０の補完情報抽出処理の完了後に前述の判定部１０５が再度判定を行い、必要に応じて補完情報抽出処理が繰り返されてもよい。また、曖昧性を持つ関係情報が依然として残存する場合には、学習部１０３は曖昧性を持たない関係情報を正解事例として利用して学習を行ってもよい。 The above-described determination unit 105 may make another determination after the completion of the complementary information extraction process of FIG. 10, and the complementary information extraction process may be repeated as necessary. In addition, in the case where relationship information having ambiguity still remains, the learning unit 103 may perform learning using the relationship information having no ambiguity as a correct case.

補完情報は、関係情報の持つ曖昧性を完全には除去できなくてもよく、例えば複数の関係情報間の区別に貢献するものであればよい。例えば、補完情報抽出部１０６は、ある補完情報の候補を追加することで、同一の属性の値を含む複数の関係情報のうちどのくらいの割合（パーセンテージ）が区別できるようになるかを計算し、その割合が閾値を超えるならば当該補完情報の候補を採用してもよい。 The complementary information may not be able to completely remove the ambiguity of the related information, and may be, for example, as long as it contributes to the distinction between a plurality of related information. For example, the complementation information extraction unit 106 calculates how much a percentage of a plurality of pieces of relation information including the value of the same attribute can be distinguished by adding a candidate of certain complementation information, If the ratio exceeds the threshold, the candidate of the complementary information may be adopted.

出力部１０７は、判定部１０５または補完情報抽出部１０６から関係情報（補完情報を含み得る）を受け取る。出力部１０７は、この関係情報をユーザに向けて提示する。出力部１０７は、例えば、関係情報を表形式でディスプレイに表示してもよいし、音声合成技術を用いて関係情報の内容をスピーカから読み上げてもよい。なお、出力部１０７は、関係情報に加えて、当該関係情報を形成する各属性の値の文書内での出現箇所に関する情報（例えば、出現箇所周辺の記載の引用）をさらに提示してもよい。 The output unit 107 receives related information (which may include complementary information) from the determination unit 105 or the complementary information extraction unit 106. The output unit 107 presents the relationship information to the user. For example, the output unit 107 may display the related information in a tabular form on the display, or may read out the content of the related information from a speaker using a voice synthesis technology. In addition to the relationship information, the output unit 107 may further present information (e.g., quotation of the description of the vicinity of the occurrence location) regarding the appearance location in the document of the value of each attribute forming the relationship information. .

以上説明したように、第１の実施形態に係る情報抽出装置は、複数の属性の値同士の関係を表す関係情報を文書から抽出し、関係情報が曖昧性を持つ場合には当該関係情報を形成するいずれかの属性の値に関係する補完情報をさらに抽出して当該補完情報を用いて当該関係情報を補完する。従って、この情報抽出装置によれば、関係情報の持つ曖昧性を低減または除去し、文書の分析に寄与する情報（例えば、曖昧でない関係情報およびその妥当性の判断材料）を提示することができる。 As described above, the information extraction apparatus according to the first embodiment extracts, from the document, relation information representing the relation between values of a plurality of attributes, and if the relation information has ambiguity, the relation information is extracted. Further, complementary information related to the value of any attribute to be formed is further extracted, and the relevant information is complemented using the complementary information. Therefore, according to this information extraction apparatus, it is possible to reduce or eliminate the ambiguity of the related information, and to present the information contributing to the analysis of the document (for example, the unambiguous related information and the judgment material of its validity) .

（第２の実施形態）
図８に例示されるように、第２の実施形態に係る情報抽出装置３００は、入力部１０１と、関係情報抽出部１０２と、学習部１０３と、算出部１０４と、判定部１０５と、補完情報抽出部１０６と、出力部１０７と、評価情報入力部３０８と、評価結果格納部３０９とを含む。すなわち、情報抽出装置３００は、図１の情報抽出装置１００に評価情報入力部３０８および評価結果格納部３０９を追加したものに相当する。 Second Embodiment
As illustrated in FIG. 8, the information extraction apparatus 300 according to the second embodiment includes the input unit 101, the relationship information extraction unit 102, the learning unit 103, the calculation unit 104, the determination unit 105, and the complementation. An information extraction unit 106, an output unit 107, an evaluation information input unit 308, and an evaluation result storage unit 309 are included. That is, the information extraction device 300 corresponds to the information extraction device 100 of FIG. 1 to which the evaluation information input unit 308 and the evaluation result storage unit 309 are added.

評価情報入力部３０８は、出力部１０７によって提示された関係情報に対してユーザ（評価者と呼ぶこともできる）から妥当性の評価結果を受け付ける。評価情報入力部３０８は、受け取った評価結果を関係情報と関連付けて評価結果格納部３０９に格納する。 The evaluation information input unit 308 receives an evaluation result of validity from the user (which may be called an evaluator) with respect to the related information presented by the output unit 107. The evaluation information input unit 308 stores the received evaluation result in the evaluation result storage unit 309 in association with the relationship information.

評価情報入力部３０８は、出力部１０７と協同して関係情報評価ＵＩ（ＵｓｅｒＩｎｔｅｒｆａｃｅ）として機能する。この関係情報評価ＵＩは、例えば図９に示されるように、評価の対象となる（例えば表形式の）関係情報と、関係情報の妥当性評価の判断材料となる情報（例えば、関係情報を形成する各属性値の出現箇所周辺の記載の引用）と、評価入力用のＧＵＩ部品（例えば、○ボタンおよび×ボタン）とを表示してもよい。 The evaluation information input unit 308 cooperates with the output unit 107 to function as a related information evaluation UI (User Interface). For example, as shown in FIG. 9, the relation information evaluation UI forms relation information (for example, in the form of a table) to be an object of evaluation and information to be a judgment material for evaluating the relevance information (for example, forms relation information). ) And a GUI part for evaluation input (for example, an ボタン button and an ボタン button) may be displayed.

評価結果格納部３０９には、評価済みの関係情報とその評価結果とが格納される。評価結果が良好な（例えば、閾値以上であった）関係情報は、例えば、学習部１０３によって行われる学習の正解事例として利用されてよい。或いは、評価結果格納部３０９に格納されている情報は、補完情報抽出部１０６が補完情報の候補の適切さを判定するために利用してもよい。例えば、所与の補完情報の候補を含む、評価結果が良好な関係情報の数または割合が閾値以上であることを条件に、補完情報抽出部１０６は当該補完情報の候補を適切であると判定して採用してもよい。 The evaluation result storage unit 309 stores the evaluated relationship information and the evaluation result thereof. Relational information whose evaluation result is good (for example, equal to or higher than the threshold) may be used as a correct case of learning performed by the learning unit 103, for example. Alternatively, the information stored in the evaluation result storage unit 309 may be used by the complementary information extraction unit 106 to determine the appropriateness of the candidate for the complementary information. For example, the complementary information extraction unit 106 determines that the candidate for the complementary information is appropriate, on the condition that the number or the ratio of the related information having a good evaluation result including the candidate for the given complementary information is equal to or more than the threshold. May be adopted.

以上説明したように、第２の実施形態に係る情報抽出装置は、抽出された関係情報（補完情報を含み得る）の妥当性について評価者からフィードバックを受ける。従って、この情報抽出装置は、評価結果に基づいて関係情報または補完情報の抽出精度を向上させることができる。 As described above, the information extraction device according to the second embodiment receives feedback from the evaluator about the validity of the extracted relationship information (which may include complementary information). Therefore, the information extraction apparatus can improve the extraction accuracy of the related information or the complementary information based on the evaluation result.

本実施形態に係る情報抽出装置は、単独のハードウェア装置によって実装されてもよいし、この情報抽出装置の機能の一部がネットワークに接続された外部サーバ上で実行されてもよい。また、この情報抽出装置は、ＣＰＵなどの制御装置と、メモリ、ＲＯＭ、ＲＡＭなどの記憶装置と、ＨＤＤなどの外部記憶装置と、ディスプレイ装置などの表示装置と、キーボード、マウスなどの入力装置とを備えた一般的なコンピュータによって実装することもできる。 The information extraction apparatus according to the present embodiment may be implemented by a single hardware device, or part of the functions of the information extraction apparatus may be executed on an external server connected to a network. The information extraction device also includes a control device such as a CPU, a storage device such as a memory, a ROM, and a RAM, an external storage device such as an HDD, a display device such as a display device, and an input device such as a keyboard and a mouse. It can also be implemented by a common computer with

上記各実施形態において説明された種々の機能部は、回路を用いることで実現されてもよい。回路は、特定の機能を実現する専用回路であってもよいし、プロセッサのような汎用回路であってもよい。 The various functional units described in the above embodiments may be realized by using a circuit. The circuit may be a dedicated circuit that implements a specific function or may be a general-purpose circuit such as a processor.

上記各実施形態の処理の少なくとも一部は、汎用のコンピュータを基本ハードウェアとして用いることでも実現可能である。上記処理を実現するプログラムは、コンピュータで読み取り可能な記録媒体に格納して提供されてもよい。プログラムは、インストール可能な形式のファイルまたは実行可能な形式のファイルとして記録媒体に記憶される。記録媒体としては、磁気ディスク、光ディスク（ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＤＶＤ等）、光磁気ディスク（ＭＯ等）、半導体メモリなどである。記録媒体は、プログラムを記憶でき、かつ、コンピュータが読み取り可能であれば、何れであってもよい。また、上記処理を実現するプログラムを、インターネットなどのネットワークに接続されたコンピュータ（サーバ）上に格納し、ネットワーク経由でコンピュータ（クライアント）にダウンロードさせてもよい。 At least a part of the processing in each of the above-described embodiments can also be realized by using a general-purpose computer as basic hardware. The program for realizing the above process may be provided by being stored in a computer readable recording medium. The program is stored in the recording medium as an installable file or an executable file. The recording medium may be a magnetic disk, an optical disk (CD-ROM, CD-R, DVD or the like), a magneto-optical disk (MO or the like), a semiconductor memory or the like. The recording medium may store the program and may be any computer readable one. Further, the program for realizing the above processing may be stored on a computer (server) connected to a network such as the Internet, and may be downloaded to the computer (client) via the network.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 While certain embodiments of the present invention have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, substitutions, and modifications can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and the gist of the invention, and are included in the invention described in the claims and the equivalent scope thereof.

１００，３００・・・情報抽出装置
１０１・・・入力部
１０２・・・関係情報抽出部
１０３・・・学習部
１０４・・・算出部
１０５・・・判定部
１０６・・・補完情報抽出部
１０７・・・出力部
３０８・・・評価情報入力部
３０９・・・評価結果格納部 100, 300 ... information extraction device 101 ... input unit 102 ... relationship information extraction unit 103 ... learning unit 104 ... calculation unit 105 ... determination unit 106 ... complementary information extraction unit 107 ... Output unit 308 ... Evaluation information input unit 309 ... Evaluation result storage unit

Claims

A relationship information extraction unit for obtaining a relationship information group by extracting from a document at least one relationship information representing a relationship between the value of the first attribute and the value of the second attribute;
When it is determined according to the determination criteria that the first relation information belonging to the relation information group has ambiguity, at least the value of the first attribute and the value of the second attribute forming the first relation information And a complementary information extraction unit that extracts complementary information related to one side.

The value of the first attribute forming the first relation information is the same as the value of the first attribute forming the second relation information belonging to the relation information group, and the first relation information It is determined that the first relationship information and the second relationship information have an ambiguity when the value of the second attribute that forms the second relationship information is different from the value of the second attribute that forms the second relationship information. The information extraction device according to claim 1, further comprising a determination unit.

The information processing apparatus further comprises a determination unit that determines that the first relation information has an ambiguity when the value of the first attribute included in the first relation information corresponds to a specific type of word. The information extraction device according to 1).

The complementary information extraction unit, when it is determined that the first relation information has an ambiguity, at least one of the value of the first attribute and the value of the second attribute forming the first relation information. 3. The information extraction device according to claim 2, wherein dependency analysis, analytic analysis or paraphrase is performed to extract the complementary information.

The complementary information extraction unit, when it is determined that the first relation information has an ambiguity, a word corresponding to a high-level concept or a low-level concept of the value of the first attribute forming the first relation information The information extraction device according to claim 2, wherein a word representing details of the value of the first attribute is extracted as the complementary information.

The complementary information extraction unit, when it is determined that the first relation information has ambiguity, at least one third relation information including the value of the second attribute that is the same as the first relation information. The information extraction apparatus according to claim 2, wherein any one of values of first attributes forming the information is extracted as the complementary information.

When it is determined that the first relationship information has ambiguity, the complementary information extraction unit extracts the complementary information, and complements the first relationship information using the complementary information. The information extraction device according to 1).

The information extraction device according to claim 1, further comprising an output unit that outputs the first relation information.

The evaluation input unit according to claim 8, further comprising: an evaluation input unit that receives an evaluation result of validity from an evaluator for the first relation information; and a storage unit that stores the evaluation result in association with the first relation information. Information extraction device.

The information extraction device according to claim 8, wherein the output unit outputs the first relation information in a tabular form.

A learning unit that learns the weight of the feature of the first relation information;
The information extraction device according to claim 1, further comprising: a calculation unit that calculates a score representing the likelihood of the first relation information based on the weight of the feature.

An information extraction method implemented by a computer, comprising
Obtaining a group of relationship information by extracting from the document at least one relationship information representing a relationship between the value of the first attribute and the value of the second attribute;
When it is determined according to the determination criteria that the first relation information belonging to the relation information group has ambiguity, at least the value of the first attribute and the value of the second attribute forming the first relation information An information extraction method comprising extracting complementary information related to one side.

Computer,
Means for obtaining a group of relationship information by extracting from the document at least one relationship information representing a relationship between the value of the first attribute and the value of the second attribute;
When it is determined according to the determination criteria that the first relation information belonging to the relation information group has ambiguity, at least the value of the first attribute and the value of the second attribute forming the first relation information An information extraction program to function as a means for extracting complementary information related to one.