JP6499537B2

JP6499537B2 - Connection expression structure analysis apparatus, method, and program

Info

Publication number: JP6499537B2
Application number: JP2015141649A
Authority: JP
Inventors: 平尾　努; 努平尾; 康久吉田; 林　克彦; 克彦林; 永田　昌明; 昌明永田
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2015-07-15
Filing date: 2015-07-15
Publication date: 2019-04-10
Anticipated expiration: 2035-07-15
Also published as: JP2017027111A

Description

本発明は、接続表現項構造解析装置、方法、及びプログラムに係り、特に、与えられた文書から接続表現によって結ばれた項を抽出するための接続表現項構造解析装置、方法、及びプログラムに関する。 The present invention relates to a connection expression term structure analysis apparatus, method, and program, and more particularly, to a connection expression term structure analysis apparatus, method, and program for extracting a term connected by a connection expression from a given document.

従来より、与えられた文書から接続表現とその意味クラス、接続表現によって結ばれた２つの項となるテキストスパン（文書中の一部のテキスト）を抽出する技術がある。この技術は、接続表現−項構造解析技術と呼ばれる。たとえば以下の文では、接続表現：becauseによって項１：「He caught a cold」、及び項２：「he got soaked in the rain」が意味：因果関係で結びついている。 Conventionally, there is a technique for extracting a connection span, a semantic class thereof, and a text span (part of text in the document) which is two terms connected by the connection representation from a given document. This technique is called a connected expression-term structure analysis technique. For example, in the following sentence, the term 1: “He caught a cold” and the term 2: “he got soaked in the rain” are connected by a connection expression: “because” in a meaning: causal relationship.

He caught a cold because he got soaked in the rain． He caught a cold because he got soaked in the rain.

このような意味的に結びついた２つのテキストスパンの組を意味クラスごとに大量に収集し、知識源とすることで、自然言語処理の様々なタスク（含意認識、文書要約、機械翻訳等）の質を向上させることができる。 By collecting a large number of such semantically linked sets of text spans for each semantic class and using it as a knowledge source, various tasks of natural language processing (entailment recognition, document summarization, machine translation, etc.) The quality can be improved.

接続表現と項との関係は明示的な場合と暗示的な場合に大別される。明示的な場合は接続表現そのものが出現する場合である。たとえば、因果関係をあらわす「because」、時間の推移をあらわす「after」などによって２つのスパンが結ばれている場合である。一方、接続表現がそのもの出現しなくとも意味的に因果関係や時間の推移をあらわす文のペアが存在する。たとえば、以下の２文の間には因果関係が成立する。 The relationship between connection expressions and terms is broadly divided into explicit cases and implicit cases. An explicit case is when the connection representation itself appears. For example, two spans are connected by “because” representing a causal relationship, “after” representing a transition of time, and the like. On the other hand, even if the connection expression does not appear, there exists a sentence pair that semantically indicates a causal relationship or a transition of time. For example, a causal relationship is established between the following two sentences.

朝から雨が降っていた。
野球の試合も中止となった。 It has been raining since morning.
The baseball game was also canceled.

従来の接続表現−項構造解析技術（非特許文献１）は、図８に示す接続表現項構造解析装置の構成で文書から、接続表現、意味クラス、及び項を抽出していた。 Conventional connection expression-term structure analysis technology (Non-patent Document 1) extracts connection expressions, semantic classes, and terms from a document with the configuration of the connection expression term structure analysis apparatus shown in FIG.

従来の接続表現項構造解析装置では、明示的接続表現−項構造の抽出は以下（１）〜（４）の手順で行っていた。 In the conventional connection expression term structure analysis apparatus, the explicit connection expression-term structure is extracted by the following procedures (1) to (4).

（１）入力文書から接続表現候補辞書に格納されているすべての接続表現を抽出し、それらの表現が項を持つ接続表現か否かを判定する（接続表現抽出部）。（２）項を持つ接続表現と判定された場合、その意味クラスを分類器を利用して決定する。（３）さらに、接続表現の２つの項（項１、項２）が同一の文に出現するか（ＳＳ）、項１が先行する文に出現するか（ＰＳ）を決定する。（４）それぞれの場合に応じて、文内項抽出部、文間項抽出部を用いて接続表現に対応する２つの項を抽出する。 (1) All connection expressions stored in the connection expression candidate dictionary are extracted from the input document, and it is determined whether or not these expressions are connection expressions having terms (connection expression extraction unit). (2) When it is determined that the connection expression has a term, its semantic class is determined using a classifier. (3) Further, it is determined whether two terms (term 1 and term 2) of the connection expression appear in the same sentence (SS) or whether term 1 appears in the preceding sentence (PS). (4) Depending on each case, two terms corresponding to the connection expression are extracted using the intra-sentence term extraction unit and the inter-sentence term extraction unit.

また、暗示的接続表現−項構造の抽出は以下（１）、（２）の手順で行っていた。 The extraction of the implicit connection expression-term structure was performed in the following procedures (1) and (2).

（１）文書中の隣接する文のペア（先の文から項１、後の文から項２を取り出す）を抽出し、意味クラス分類部を用いて意味クラスを付与する。なお、意味クラスはあらかじめ複数分類が定義されているとする。非特許文献１では、Expansion、 Contingency、Temporal、 Comparison という４つの意味クラスを利用している。意味クラス分類部は２文間に接続関係がある場合には何らかの意味クラスを出力し、接続関係にない場合には接続関係がないことを出力する。すなわち、意味クラス分類部は、２文間の関係を意味クラス数＋１のクラスに分類する。（２）意味クラスが付与された２文について文間項抽出部を用いて項を抽出する。以上のようにして、接続表現項構造解析装置は、文書から明示的接続表現−項構造、及び暗示的接続表現−項構造の抽出を行っていた。 (1) A pair of adjacent sentences in the document (extracting item 1 from the previous sentence and item 2 from the subsequent sentence) is extracted, and a semantic class is assigned using a semantic class classifying unit. It is assumed that a plurality of classifications are defined in advance for the semantic class. Non-Patent Document 1 uses four semantic classes: Expansion, Contingency, Temporal, and Comparison. The semantic class classification unit outputs some semantic class when there is a connection relationship between two sentences, and outputs that there is no connection relationship when there is no connection relationship. That is, the semantic class classifying unit classifies the relationship between two sentences into classes having the number of semantic classes + 1. (2) Using the inter-sentence term extraction unit, terms are extracted from two sentences to which semantic classes are assigned. As described above, the connection expression term structure analysis apparatus extracts an explicit connection expression-term structure and an implicit connection expression-term structure from a document.

Lin Ziheng, Ng Hwee Tou, and Kan Min-Yen. 2014. A PDTB-styled End-to-End Discourse Parser. Natural Language Engineering,20:151‐184.Lin Ziheng, Ng Hwee Tou, and Kan Min-Yen. 2014.A PDTB-styled End-to-End Discourse Parser.Natural Language Engineering, 20: 151-184.

従来の接続表現−項構造の抽出法は、文書、あるいは文の談話構造を考慮せずに項を抽出している。暗示的接続表現−項抽出の場合には、隣接した２文しか項抽出の対象にならない。また、明示的接続表現−項抽出であっても、接続表現が出現する文に項１、項２が同時出現しない場合、項１はその接続表現が出現する文の１つ前の文から抽出する。しかし、１文をこえた接続関係は隣接した２文だけに限らないため、本来抽出すべき接続表現−項構造に取りこぼしが生じる。 The conventional connection representation-term structure extraction method extracts a term without considering the discourse structure of a document or sentence. In the case of implicit connection expression-term extraction, only two adjacent sentences are subject to term extraction. Even if explicit connection representation-term extraction is used, if terms 1 and 2 do not appear simultaneously in a statement in which connection representation appears, term 1 is extracted from the statement immediately preceding the statement in which the connection representation appears. To do. However, since the connection relationship over one sentence is not limited to two adjacent sentences, a connection expression-term structure that should be originally extracted is missed.

本発明は、上記問題点を解決するために成されたものであり、隣接しない文間からも、接続表現によって結ばれた項を抽出することができる接続表現項構造解析装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems, and a connected expression term structure analysis apparatus, method, and program capable of extracting terms connected by connection expressions even between non-adjacent sentences. The purpose is to provide.

上記目的を達成するために、第１の発明に係る接続表現項構造解析装置は、入力された文書に基づいて、前記文書に含まれる文の各々の修辞構造に基づく、前記文の各々を各ノードで表わした談話構造木を生成する談話構造解析部と、前記文書に含まれる文の各々について、構文解析を行って構文木を生成する構文解析部と、前記構文解析部によって生成された前記文の各々についての構文木に基づいて、項を持つ接続表現を抽出する接続表現抽出部と、前記接続表現抽出部によって抽出された前記接続表現について、前記接続表現を含む文内に、前記接続表現によって結ばれた２つの項が出現するか否かを判定する項位置関係決定部と、前記項位置関係決定部によって前記接続表現を含む文内に、前記接続表現によって結ばれた２つの項が出現すると判定された場合、前記接続表現を含む文から、前記接続表現によって結ばれた２つの項を抽出する文内項抽出部と、前記項位置関係決定部によって前記接続表現を含む文内に、前記接続表現によって結ばれた２つの項が出現しないと判定された場合、前記接続表現を含む文から、前記接続表現によって結ばれた２つの項の何れか一方を抽出し、前記談話構造解析部によって生成された前記談話構造木において、前記接続表現を含む文の親ノード又は兄弟ノードに対応する文から、前記接続表現によって結ばれた２つの項の何れか他方を抽出する文間項抽出部と、前記接続表現抽出部によって抽出された前記接続表現に基づいて、前記接続表現の意味クラスを分類する意味クラス分類部と、を含んで構成されている。 In order to achieve the above object, the connected expression term structure analyzing apparatus according to the first aspect of the present invention relates to each of the sentences based on the rhetorical structure of each of the sentences included in the document based on the input document. A discourse structure analysis unit that generates a discourse structure tree represented by a node; a syntax analysis unit that generates a syntax tree by performing syntax analysis for each sentence included in the document; and the syntax analysis unit that generates the syntax tree. A connection expression extraction unit that extracts a connection expression having a term based on a syntax tree for each sentence, and the connection expression extracted by the connection expression extraction unit includes the connection expression in a sentence including the connection expression. A term positional relationship determination unit that determines whether or not two terms connected by an expression appear, and two terms connected by the connection expression in a sentence including the connection expression by the term positional relationship determination unit But If it is determined to be present, a sentence in-sentence extraction unit that extracts two terms connected by the connection expression from the sentence including the connection expression, and a sentence including the connection expression by the term positional relationship determination unit When it is determined that the two terms connected by the connection expression do not appear, either one of the two terms connected by the connection expression is extracted from the sentence including the connection expression, and the discourse structure analysis is performed. In the discourse structure tree generated by the section, an inter-sentence term extraction that extracts one of the two terms connected by the connection expression from a sentence corresponding to a parent node or a sibling node of the sentence including the connection expression And a semantic class classifying unit that classifies the semantic classes of the connection representation based on the connection representation extracted by the connection representation extraction unit.

また、第２の発明に係る接続表現項構造解析装置は、入力された文書に基づいて、前記文書に含まれる文の各々の修辞構造に基づく、前記文の各々を各ノードで表わした談話構造木を生成する談話構造解析部と、前記談話構造解析部によって生成された前記談話構造木に基づいて、親子ノードに対応する文のペア、及び兄弟ノードに対応する文のペアを、接続関係を持つ文のペアの候補とし、前記接続関係を持つ文のペアの候補の各々について、接続関係があるか否かを判定する関連文ペア抽出部と、前記関連文ペア抽出部によって接続関係があると判定された前記接続関係を持つ文のペアの候補の各々について、前記接続関係を持つ文のペアの候補から、暗示的な接続表現によって結ばれる２つの項を抽出する文間項抽出部と、前記関連文ペア抽出部によって接続関係があると判定された前記接続関係を持つ文のペアの候補の各々について、前記接続関係を持つ文のペアの候補に基づいて、前記暗示的な接続表現の意味クラスを分類する意味クラス分類部と、を含んで構成されている。 Further, the connection expression term structure analyzing apparatus according to the second invention is based on an input document and based on a rhetorical structure of each sentence included in the document, and a discourse structure in which each of the sentences is represented by each node. Based on the discourse structure analysis unit for generating a tree and the discourse structure tree generated by the discourse structure analysis unit, a sentence pair corresponding to a parent-child node and a sentence pair corresponding to a sibling node are connected to each other. A related sentence pair extraction unit that determines whether or not there is a connection relationship for each of the sentence pair candidates having the connection relationship, and a connection relationship by the related sentence pair extraction unit. An inter-sentence term extraction unit that extracts, from each of the sentence pair candidates having the connection relationship, two terms connected by an implicit connection expression for each of the sentence pair candidates having the connection relationship determined as , The related sentence For each of the sentence pair candidates having the connection relationship determined by the extraction unit to be connected, the semantic class of the implicit connection expression is determined based on the sentence pair candidates having the connection relationship. And a semantic class classification unit for classification.

第３の発明に係る接続表現項構造解析方法は、談話構造解析部が、入力された文書に基づいて、前記文書に含まれる文の各々の修辞構造に基づく、前記文の各々を各ノードで表わした談話構造木を生成するステップと、構文解析部が、前記文書に含まれる文の各々について、構文解析を行って構文木を生成するステップと、接続表現抽出部が、前記構文解析部によって生成された前記文の各々についての構文木に基づいて、項を持つ接続表現を抽出するステップと、項位置関係決定部が、前記接続表現抽出部によって抽出された前記接続表現について、前記接続表現を含む文内に、前記接続表現によって結ばれた２つの項が出現するか否かを判定するステップと、文内項抽出部が、前記項位置関係決定部によって前記接続表現を含む文内に、前記接続表現によって結ばれた２つの項が出現すると判定された場合、前記接続表現を含む文から、前記接続表現によって結ばれた２つの項を抽出するステップと、文間項抽出部が、前記項位置関係決定部によって前記接続表現を含む文内に、前記接続表現によって結ばれた２つの項が出現しないと判定された場合、前記接続表現を含む文から、前記接続表現によって結ばれた２つの項の何れか一方を抽出し、前記談話構造解析部によって生成された前記談話構造木において、前記接続表現を含む文の親ノード又は兄弟ノードに対応する文から、前記接続表現によって結ばれた２つの項の何れか他方を抽出するステップと、意味クラス分類部が、前記接続表現抽出部によって抽出された前記接続表現に基づいて、前記接続表現の意味クラスを分類するステップと、を含んで実行することを特徴とする。 In the connected expression term structure analysis method according to the third invention, the discourse structure analysis unit, based on the input document, based on the rhetorical structure of each sentence included in the document, each of the sentences at each node. Generating a represented discourse structure tree, a syntax analysis unit generating a syntax tree by performing syntax analysis for each sentence included in the document, and a connection expression extracting unit by the syntax analysis unit A step of extracting a connection expression having a term based on a syntax tree for each of the generated sentences; and the connection expression for the connection expression extracted by the connection expression extraction unit by a term positional relationship determination unit. Determining whether or not two terms connected by the connection expression appear in the sentence including , When it is determined that two terms connected by the connection expression appear, a step of extracting two terms connected by the connection expression from a sentence including the connection expression, and an inter-sentence term extraction unit, When it is determined by the term positional relationship determination unit that two terms connected by the connection expression do not appear in the sentence including the connection expression, 2 connected by the connection expression from the sentence including the connection expression. One of the two terms is extracted, and in the discourse structure tree generated by the discourse structure analysis unit, the sentence is connected by the connection expression from the sentence corresponding to the parent node or sibling node of the sentence including the connection expression. A step of extracting one of the two terms, and a semantic class classification unit, based on the connection representation extracted by the connection representation extraction unit, determines a semantic class of the connection representation. And executes includes a step similar, the.

第４の発明に係る接続表現項構造解析方法は、談話構造解析部が、入力された文書に基づいて、前記文書に含まれる文の各々の修辞構造に基づく、前記文の各々を各ノードで表わした談話構造木を生成するステップと、関連文ペア抽出部が、前記談話構造解析部によって生成された前記談話構造木に基づいて、親子ノードに対応する文のペア、及び兄弟ノードに対応する文のペアを、接続関係を持つ文のペアの候補とし、前記接続関係を持つ文のペアの候補の各々について、接続関係があるか否かを判定するステップと、文間項抽出部が、前記関連文ペア抽出部によって接続関係があると判定された前記接続関係を持つ文のペアの候補の各々について、前記接続関係を持つ文のペアの候補から、暗示的な接続表現によって結ばれる２つの項を抽出するステップと、意味クラス分類部が、前記関連文ペア抽出部によって接続関係があると判定された前記接続関係を持つ文のペアの候補の各々について、前記接続関係を持つ文のペアの候補に基づいて、前記暗示的な接続表現の意味クラスを分類するステップと、を含んで実行することを特徴とする。 In the connected expression term structure analyzing method according to the fourth invention, the discourse structure analyzing unit, based on the input document, based on the rhetorical structure of each sentence included in the document, each of the sentences at each node. The step of generating the expressed discourse structure tree and the related sentence pair extraction unit correspond to the sentence pair corresponding to the parent-child node and the sibling node based on the discourse structure tree generated by the discourse structure analysis unit. The sentence pair is a sentence pair candidate having a connection relation, and for each of the sentence pair candidates having the connection relation, determining whether or not there is a connection relation; Each of the sentence pair candidates having the connection relation determined to have a connection relation by the related sentence pair extraction unit is connected from the sentence pair candidates having the connection relation by an implicit connection expression 2. Two terms And a sentence pair candidate having the connection relationship for each of the sentence pair candidates having the connection relationship determined by the related sentence pair extraction unit by the semantic class classification unit. And classifying the semantic class of the implicit connection representation based on:

第５の発明に係るプログラムは、上記第１又は第２の発明に係る接続表現項構造解析装置を構成する各部として機能させるためのプログラムである。 A program according to a fifth aspect of the invention is a program for causing each section to constitute the connected expression term structure analyzing apparatus according to the first or second aspect of the invention.

本発明の接続表現項構造解析装置、方法、及びプログラムによれば、隣接しない文間からも、接続表現によって結ばれた項を抽出することができる、という効果が得られる。 According to the connection expression term structure analysis apparatus, method, and program of the present invention, it is possible to extract the terms connected by the connection expression even between non-adjacent sentences.

本発明の第１の実施の形態に係る接続表現項構造解析装置の構成を示すブロック図である。It is a block diagram which shows the structure of the connection expression term structure analysis apparatus which concerns on the 1st Embodiment of this invention. 文内項抽出において従属接続の場合に抽出対象となる構造木の一例を示す図である。It is a figure which shows an example of the structure tree used as extraction object in the case of subordinate connection in sentence internal term extraction. 文内項抽出において等位接続の場合に抽出対象となる構造木の一例を示す図である。It is a figure which shows an example of the structure tree used as extraction object in the case of equipotential connection in sentence internal term extraction. 文内項抽出において等位接続の場合に抽出対象となる構造木の一例を示す図である。It is a figure which shows an example of the structure tree used as extraction object in the case of equipotential connection in sentence internal term extraction. 本発明の実施の形態に係る接続表現項構造解析装置における接続表現項構造解析処理ルーチンを示すフローチャートである。It is a flowchart which shows the connection expression term structure analysis processing routine in the connection expression term structure analysis apparatus which concerns on embodiment of this invention. 本発明の第２の実施の形態に係る接続表現項構造解析装置の構成を示すブロック図である。It is a block diagram which shows the structure of the connection expression term structure analysis apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係る接続表現項構造解析装置における接続表現項構造解析処理ルーチンを示すフローチャートである。It is a flowchart which shows the connection expression term structure analysis processing routine in the connection expression term structure analysis apparatus which concerns on the 2nd Embodiment of this invention. 従来の接続表現項構造解析装置の構成を示すブロック図の一例である。It is an example of the block diagram which shows the structure of the conventional connection expression term structure analysis apparatus.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の第１の実施の形態に係る接続表現項構造解析装置の構成＞ <Configuration of connection expression term structure analysis apparatus according to first embodiment of the present invention>

まず、本発明の第１の実施の形態に係る接続表現項構造解析装置の構成について説明する。第１の実施の形態に係る接続表現項構造解析装置では、文書から明示的接続表現に関する接続表現、項、及び意味ラベルを抽出する。 First, the configuration of the connection expression term structure analysis apparatus according to the first embodiment of the present invention will be described. The connection expression term structure analysis apparatus according to the first embodiment extracts connection expressions, terms, and semantic labels relating to explicit connection expressions from a document.

図１に示すように、本発明の第１の実施の形態に係る接続表現項構造解析装置１００は、ＣＰＵと、ＲＡＭと、後述する接続表現項構造解析処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この接続表現項構造解析装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部５０とを備えている。 As shown in FIG. 1, the connection expression term structure analysis apparatus 100 according to the first embodiment of the present invention includes a CPU, a RAM, a program for executing a connection expression term structure analysis processing routine to be described later, and various programs. It can be constituted by a computer including a ROM storing data. Functionally, the connection expression term structure analysis apparatus 100 includes an input unit 10, an arithmetic unit 20, and an output unit 50 as shown in FIG.

入力部１０は、解析対象となる文書を受け付ける。 The input unit 10 receives a document to be analyzed.

演算部２０は、文分割部３０と、談話構造解析部３２と、構文解析部３４と、接続表現抽出部３６と、項位置関係決定部３８と、文内項抽出部４０と、文間項抽出部４２と、意味クラス分類部４４とを含んで構成されている。 The computing unit 20 includes a sentence dividing unit 30, a discourse structure analyzing unit 32, a syntax analyzing unit 34, a connection expression extracting unit 36, a term positional relationship determining unit 38, an intra-sentence term extracting unit 40, and an inter-sentence term. An extraction unit 42 and a semantic class classification unit 44 are included.

文分割部３０は、入力部１０により受け付けた文書を取得し、文書に対して文の区切りを与える。文の区切りの認定は既存の文分割器を利用する。あるいは、句点を手がかりとするだけでも良い。なお、予め文分割した文書を入力部１０により受け付けて、文分割部３０の処理を省略しても良い。 The sentence division unit 30 acquires the document received by the input unit 10 and gives a sentence break to the document. Sentence delimiters are identified using existing sentence dividers. Alternatively, it may be just a clue. Note that a document that has been divided into sentences in advance may be received by the input unit 10 and the processing of the sentence dividing unit 30 may be omitted.

談話構造解析部３２は、文分割部３０により文区切りが与えられた文書に基づいて、文書に含まれる文の各々の修辞構造に基づく、文の各々を各ノードで表わした談話構造木を生成する。談話構造木によって文同士のノード間の親子関係が表わされる。談話構造木は、非特許文献２などの修辞構造解析器を用いてＲＳＴツリーを生成した後、非特許文献３に記載されているルールを適用することで文同士のノード間の親子関係を決定することができる。また、必ずしもＲＳＴツリーを生成する必要はなく、文同士のノード間の親子関係を表した修辞構造木のデータを用いて学習した解析器を利用することで文同士のノード間の親子関係を得ることも可能である。 The discourse structure analysis unit 32 generates a discourse structure tree in which each sentence is represented by each node based on the rhetorical structure of each sentence included in the document, based on the document given the sentence break by the sentence division unit 30. To do. The discourse structure tree represents a parent-child relationship between nodes of sentences. The discourse structure tree generates an RST tree using a rhetorical structure analyzer such as Non-Patent Document 2, and then determines the parent-child relationship between nodes of sentences by applying the rules described in Non-Patent Document 3. can do. Moreover, it is not always necessary to generate an RST tree, and a parent-child relationship between nodes of sentences is obtained by using an analyzer that has been learned using data of a rhetorical structure tree that represents parent-child relationships between nodes of sentences. It is also possible.

[非特許文献２]：duVerle、 D. and Prendinger、 H. ‘A Novel Discourse Parser Based on Support Vector Machine Classi_cation'. Proc of the 47th ACL, pp. 665{675 (2009) . [Non-Patent Document 2]: duVerle, D. and Prendinger, H. 'A Novel Discourse Parser Based on Support Vector Machine Classi_cation'. Proc of the 47th ACL, pp. 665 {675 (2009).

[非特許文献３]：Tsutomu Hirao、 Yasuhisa Yoshida、 Masaaki Nishino, Norihito Yasuda and Masaaki Nagata. ‘Single-Document Summarization as a Tree Knapsack Problem'. Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1515{1520, (2013). [Non-Patent Document 3]: Tsutomu Hirao, Yasuhisa Yoshida, Masaaki Nishino, Norihito Yasuda and Masaaki Nagata. 'Single-Document Summarization as a Tree Knapsack Problem'. Proc. Of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1515 {1520, (2013).

構文解析部３４は、文分割部３０により文区切りが与えられた文書に含まれる文の各々について、構文解析を行って構文木を生成する。構文解析については様々なソフトウェアが開発されているため、既存のソフトウェアを用いて文の各々の構文木を生成すれば良い。 The syntax analysis unit 34 performs syntax analysis on each of the sentences included in the document to which the sentence delimiter 30 has been given a sentence break to generate a syntax tree. Since various software has been developed for parsing, it is only necessary to generate a syntax tree for each sentence using existing software.

接続表現抽出部３６は、構文解析部３４により生成された文の各々についての構文木に基づいて、項を持つ接続表現を抽出する。接続表現抽出部３６は、具体的には、まず文書中に出現する単語について、予め人手で整備した接続表現候補辞書（図示省略）の辞書エントリ表現を参照し、辞書エントリ表現にマッチする単語を抽出する。そして、辞書エントリ表現にマッチする単語が項をとる接続表現か否かを、辞書エントリ表現が項をとる否かを注釈付けした学習データを用いて訓練したＳＶＭ、ロジスティック回帰のような２値分類器を利用して判定し、項をとる接続表現を抽出する。文書中に出現する単語が、項をとる接続表現か否かを判定するために利用する特徴として、以下の（１）〜（５）のような特徴を用いれば良い。 The connection expression extraction unit 36 extracts a connection expression having a term based on the syntax tree for each sentence generated by the syntax analysis unit 34. Specifically, the connection expression extraction unit 36 first refers to a dictionary entry expression of a connection expression candidate dictionary (not shown) prepared manually in advance for a word that appears in a document, and selects a word that matches the dictionary entry expression. Extract. Then, binary classification such as SVM and logistic regression trained using learning data in which whether or not a word matching the dictionary entry expression is a connected expression that takes a term is annotated. A connection expression that takes a term is extracted. The following features (1) to (5) may be used as features used to determine whether a word appearing in a document is a connected expression that takes a term.

（１）辞書エントリ表現とその品詞
（２）辞書エントリ表現の前後５単語とそれらの品詞
（３）構文木における辞書エントリ表現の深さ
（４）構文木における辞書エントリ表現の親、左の兄弟、右の兄弟
（５）構文木における辞書エントリ表現から根までのパス (1) Dictionary entry expression and its part of speech (2) Five words before and after dictionary entry expression and their part of speech (3) Depth of dictionary entry expression in syntax tree (4) Parent of dictionary entry expression in syntax tree, left brother , Right sibling (5) Path from dictionary entry representation to root in syntax tree

項位置関係決定部３８は、接続表現抽出部３６によって抽出された接続表現について、接続表現を含む文内に、接続表現によって結ばれた２つの項が出現するか否かを判定する。項位置関係決定部３８は、具体的には、接続表現抽出部３６と同様に予め学習データを用いて訓練したＳＶＭ、ロジスティク回帰などの２値分類器を用いて、接続表現を含む文内に、接続表現によって結ばれた２つの項が出現するか否かを判定する。判定に利用する特徴は接続表現抽出部３６に利用した上記（１）〜（５）の特徴に加え、接続表現の出現位置（文の前半、なかば、後半など）も用いる。 The term positional relationship determination unit 38 determines whether or not two terms connected by the connection expression appear in the sentence including the connection expression for the connection expression extracted by the connection expression extraction unit 36. Specifically, the term positional relationship determination unit 38 uses a binary classifier such as SVM or logistic regression previously trained using learning data in the same manner as the connection representation extraction unit 36, and uses the binary classifier such as SVM and logistic regression in the sentence including the connection representation. Then, it is determined whether or not two terms connected by the connection expression appear. In addition to the features (1) to (5) used for the connection expression extraction unit 36, the characteristics used for the determination use the appearance position of the connection expression (the first half of the sentence, the middle, the second half, etc.).

項位置関係決定部３８は、接続表現を含む文内に接続表現によって結ばれた２つの項が出現しないと判定した場合、接続表現を含む文を、接続表現によって結ばれた２つの項のうちの項２を抽出するための文として決定し、談話構造解析部３２によって生成された談話構造木において、接続表現を含む文の親ノード又は兄弟ノードに対応する文を、接続表現によって結ばれた２つの項のうちの項１を抽出するための文として決定する。 When it is determined that the two terms connected by the connection expression do not appear in the sentence including the connection expression, the term positional relationship determination unit 38 selects the sentence including the connection expression from the two terms connected by the connection expression. In the discourse structure tree generated by the discourse structure analysis unit 32, the sentence corresponding to the parent node or sibling node of the sentence including the connection expression is connected by the connection expression. Of the two terms, term 1 is determined as a sentence for extracting.

文内項抽出部４０は、項位置関係決定部３８によって接続表現を含む文内に、接続表現によって結ばれた２つの項が出現すると判定された場合、接続表現を含む文から、接続表現によって結ばれた項１、及び項２を抽出し、出力部５０に出力する。 When the term positional relationship determination unit 38 determines that two terms connected by the connection expression appear in the sentence including the connection expression, the in-sentence term extraction unit 40 uses the connection expression from the sentence including the connection expression. The connected terms 1 and 2 are extracted and output to the output unit 50.

文内項抽出部４０は、具体的には、構文解析部３４により生成した文の各々の構文木のうち、接続表現を含む文の構文木を受け取り、接続表現が従属接続、又は等位接続の場合に、それぞれ以下のルールを適用して項１、及び項２を抽出する。なお、接続表現と、従属接続又は等位接続との対応関係は予め人手で与えておく。 Specifically, the sentence internal term extraction unit 40 receives a syntax tree of a sentence including a connection expression among the syntax trees of the sentence generated by the syntax analysis unit 34, and the connection expression is a dependent connection or a coordinate connection. In this case, the following rules are applied to extract the terms 1 and 2 respectively. The correspondence relationship between the connection expression and the subordinate connection or the equipotential connection is previously given manually.

まず、接続表現が従属接続の場合の項１、及び項２の抽出方法について説明する。 First, the method for extracting terms 1 and 2 when the connection representation is dependent connection will be described.

接続表現が従属接続の場合、項２を以下の（１）、（２）の手順で抽出する。 When the connection expression is a subordinate connection, the term 2 is extracted by the following procedures (1) and (2).

（１）対象とする接続表現の最後の単語を表すノードを、構文木のノードをあらわすノード変数ｘに代入する。
（２）ｘの親ノードをｘに代入する。この操作をｘに代入されたノードがＳＢＡＲまたはＳのラベルをとるまで繰り返し、どちらかのラベルをとった時点でのｘによって支配されるテキストスパンを項２とする。 (1) A node representing the last word of the target connection expression is assigned to a node variable x representing a node in the syntax tree.
(2) Substitute the parent node of x into x. This operation is repeated until the node assigned to x takes the label of SBAR or S, and the text span dominated by x at the time of taking either label is term 2.

図２に抽出の例を示す。図２の例では、まずｘにｂｅｃａｕｓｅが代入される。ｂｅｃａｕｓｅはＳ、又はＳＢＡＲのどちらでもないため、ｘにｂｅｃａｕｓｅの親ノードであるＩＮを代入する。ＩＮはＳ、又はＳＢＡＲのどちらでもないため、ｘにＩＮの親ノードであるＳＢＡＲを代入する。ｘがＳＢＡＲとなったので処理が終わり、ｘに代入されたＳＢＡＲが支配するスパン「because he is honest」を項２とする。 FIG. 2 shows an example of extraction. In the example of FIG. 2, first, because is substituted for x. Since “because” is neither S nor SBAR, the parent node of “because” is substituted for x. Since IN is neither S nor SBAR, SBAR which is the parent node of IN is substituted for x. Since x becomes SBAR, the process ends, and the span “because he is honest” controlled by the SBAR assigned to x is term 2.

次に、接続表現が従属接続の場合、項１を以下（１）、（２）の手順で抽出する。なお、ｘは項２の手順が終了した時点での値を引き継ぐ。 Next, when the connection expression is a subordinate connection, the term 1 is extracted by the following procedures (1) and (2). Note that x takes over the value at the time point when the procedure of item 2 is completed.

（１）ｘの親ノードをｘに代入する。
（２）ｘに代入されたノードがＳＢＡＲまたはＳのラベルをとるまで繰り返し、どちらかのラベルをとった時点でのｘによって支配されるテキストスパンを取り出し、そこから項２のスパンを取り除いたものを項１とする。 (1) Substitute the parent node of x into x.
(2) Repeat until the node assigned to x takes the SBAR or S label, extract the text span dominated by x at the time of taking either label, and remove the span of term 2 from it Is term 1.

図２の例では、項２を決定した時点で、ｘには「because he is honest」を支配するＳＢＡＲが代入されているので、その親ノードであるＶＰをｘに代入する。ＶＰはＳ、ＳＢＡＲのどちらでもないので、さらにその親ノードであるＳをｘに代入する。ｘがＳとなったので処理を終え、Ｓが支配するスパン「I like him because he is honest」を取り出し、そこから項２のスパン「because he is honest」を取り除いたスパン「I like him」を項１とする。 In the example of FIG. 2, when the term 2 is determined, since the SBAR governing “because he is honest” is substituted for x, the parent node VP is substituted for x. Since VP is neither S nor SBAR, the parent node S is further substituted for x. Since x becomes S, the process ends, and the span “I like him because he is honest” that S controls is taken out, and the span “I like him” is removed from the span “because he is honest” in item 2. This is term 1.

次に、接続表現が等位接続の場合の項１、及び項２の抽出方法について説明する。 Next, the extraction method of term 1 and term 2 when the connection representation is equipotential connection will be described.

接続表現が等位接続の場合、項２を以下（１）〜（３）の手順で抽出する。 When the connection expression is equipotential connection, the term 2 is extracted by the following procedures (1) to (3).

（１）対象とする接続表現の最後の単語を表すノードを、ノード変数ｘに代入し、ｘの親ノードをノード変数ｙに代入する。
（２）ｘ、ｙにそれぞれの親ノードを代入する。
（３）ｘ、ｙが支配するスパンであるｓｐａｎ（ｘ）及びｓｐａｎ（ｙ）の最左の単語が一致しなくなるまで、（２）を繰り返す。一致しなくなった時点で、ｙが支配するスパンのうち接続表現直後の単語からスパンの最後の単語までを項２とする。 (1) The node representing the last word of the target connection expression is assigned to the node variable x, and the parent node of x is assigned to the node variable y.
(2) Substitute the respective parent nodes for x and y.
(3) Repeat (2) until the leftmost words of span (x) and span (y), which are spans governed by x and y, do not match. The term from the word immediately after the connection expression to the last word of the span among the spans dominated by y at the time when they do not coincide with each other is termed item 2.

図３に抽出対象となる構造木の第１の例を示す。図３の例では、まずｘにａｎｄを代入し、ｙにＣＣを代入する。ｓｐａｎ（ｘ）、ｓｐａｎ（ｙ）の最左の単語がａｎｄで一致するためｘにＣＣ、ｙにＳを代入する。ｓｐａｎ（ｘ）の最左の単語はａｎｄ、ｓｐａｎ（ｙ）の最左の単語はＨｅとなり、一致しないので処理を終了する。そして、ｓｐａｎ（ｙ）、つまり、「He became a student and he received a grant」のａｎｄ直後からのスパン「he received a grant」を項２とする。 FIG. 3 shows a first example of a structural tree to be extracted. In the example of FIG. 3, first, and is substituted for x, and CC is substituted for y. Since the leftmost word of span (x) and span (y) matches with and, CC is substituted for x and S is substituted for y. The leftmost word of span (x) is “and”, and the leftmost word of span (y) is “He”. Then, span (y), that is, the span “he received a grant” immediately after the “He became a student and he received a grant” is term 2.

図４に抽出対象となる構造木の第２の例を示す。図４の例では、まずｘにｂｕｔ、ｙにＣＣを代入する。ｓｐａｎ（ｘ）、ｓｐａｎ（ｙ）の最左の単語がｂｕｔで一致するため、ｘにＣＣ、ｙにＶＰを代入する。ｓｐａｎ（ｘ）とｓｐａｎ（ｙ）の最左の単語はそれぞれｂｕｔとｗｅｒｅとで一致しないので処理を終了する。ｙが支配するスパンのうちｂｕｔの直後からのスパン「were not adjusted for ination」を項２とする。 FIG. 4 shows a second example of the structural tree to be extracted. In the example of FIG. 4, first, but is substituted for x and CC is substituted for y. Since the leftmost word of span (x) and span (y) matches with but, CC is substituted for x and VP is substituted for y. Since the leftmost words of span (x) and span (y) do not match in but and were, respectively, the process ends. The span “were not adjusted for ination” immediately after but of the spans controlled by y is term 2.

次に、接続表現が等位接続の場合、項１を以下（１）、及び（２）の手順で抽出する。なお、ｘ、ｙは項２の手順が終了した時点での値を引き継ぐ。 Next, when the connection representation is equipotential connection, the term 1 is extracted by the following procedures (1) and (2). Note that x and y take on the values at the time point when the procedure of item 2 is completed.

（１）ｙの子ノードのうちｘよりも左にＳあるいはＳＢＡＲが存在する場合（複数存在する場合には最右を選択）、そのノードが支配するスパンを項１とする。
（２）上記（１）に該当しない場合、ｙにその親を代入しＳＢＡＲまたはＳのラベルをとるまで構文木を遡る。ＳＢＡＲあるいはＳをとった時点でのｙが支配するスパンから接続表現と項２を取り除いたスパンを項１とする。 (1) If S or SBAR exists to the left of x among the child nodes of y (the rightmost is selected when there are a plurality of child nodes), the span controlled by the node is term 1.
(2) If the above (1) does not apply, the parent is substituted for y and the syntax tree is traced until SBAR or S label is obtained. The span obtained by removing the connection expression and the term 2 from the span dominated by y at the time when SBAR or S is taken is defined as term 1.

図３の構造木から抽出する例では、項２を決定した時点では、ｘはＣＣ、ｙはＳである。ここで、ｘよりも左のｙの子ノードの中にＳがあるため、そのＳが支配するスパン「He became a student」を項１とする。 In the example extracted from the structural tree in FIG. 3, x is CC and y is S when the term 2 is determined. Here, since S is in the child node of y to the left of x, the span “He became a student” governed by S is term 1.

また、図４の構造木から抽出する例では、項２を決定した時点では、ｘはＣＣ、ｙはＶＰである。ここで、ｘよりも左のｙの子ノードの中にＳ、ＳＢＡＲとも存在しないため、ｙにその親を代入する。するとｙがＳとなるので処理を終了する。ｙが支配するスパン「The gures were adjusted for deation、 but were not adjusted for ination」から「but ware not adjusted for ination」を取り除いた「The _gures were adjusted for deation」を項１とする。 In the example extracted from the structural tree in FIG. 4, x is CC and y is VP when the term 2 is determined. Here, since neither S nor SBAR exists in the child node of y to the left of x, its parent is substituted for y. Then, since y becomes S, the process is terminated. “The _gures were adjusted for deation”, which is obtained by removing “but ware not adjusted for ination” from the span “y gures were adjusted for deation, but were not adjusted for ination”, is defined as item 1.

文間項抽出部４２は、項位置関係決定部３８によって、接続表現を含む文内に接続表現によって結ばれた２つの項が出現しないと判定された場合、項位置関係決定部３８によって、接続表現によって結ばれた２つの項のうちの項２を抽出するための文として決定された接続表現を含む文から、項２を抽出し、項位置関係決定部３８によって、接続表現によって結ばれた２つの項のうちの項１を抽出するための文として決定された、接続表現を含む文の親ノード又は兄弟ノードに対応する文から、項１を抽出し、抽出した２つの項を出力部５０に出力する。 The inter-sentence term extraction unit 42 determines that the term positional relationship determination unit 38 determines that the two terms connected by the connection representation do not appear in the sentence including the connection representation. The term 2 is extracted from the sentence including the connection expression determined as the sentence for extracting the term 2 out of the two terms connected by the expression, and is connected by the connection expression by the term positional relationship determination unit 38. The term 1 is extracted from the sentence corresponding to the parent node or the sibling node of the sentence including the connection expression determined as the sentence for extracting the term 1 of the two terms, and the two extracted terms are output. Output to 50.

意味クラス分類部４４は、接続表現抽出部３６によって抽出された接続表現に基づいて、接続表現の意味クラスを分類し、接続表現及び当該接続表現の意味クラスを出力部５０に出力する。意味クラス分類部４４は、具体的には、接続表現抽出部３６で抽出された接続表現と、接続表現の周辺の単語とを入力として、予め学習データにより学習した多クラスの分類問題を解くことにより、接続表現に対する意味クラスを分類する。なお、多クラス分類問題であるため、学習データ中のクラス分布がなるべく均一になるようにデータを学習データから再サンプリングする。 The semantic class classification unit 44 classifies the semantic class of the connection representation based on the connection representation extracted by the connection representation extraction unit 36, and outputs the connection representation and the semantic class of the connection representation to the output unit 50. Specifically, the semantic class classification unit 44 inputs the connection expression extracted by the connection expression extraction unit 36 and words around the connection expression, and solves a multi-class classification problem learned in advance from learning data. To classify semantic classes for connection expressions. Since this is a multi-class classification problem, the data is resampled from the learning data so that the class distribution in the learning data is as uniform as possible.

＜本発明の第１の実施の形態に係る接続表現項構造解析装置の作用＞ <Operation of the connection expression term structure analyzing apparatus according to the first embodiment of the present invention>

次に、本発明の第１の実施の形態に係る接続表現項構造解析装置１００の作用について説明する。入力部１０において文書を受け付けると、接続表現項構造解析装置１００は、図５に示す接続表現項構造解析処理ルーチンを実行する。 Next, the operation of the connection expression term structure analysis apparatus 100 according to the first embodiment of the present invention will be described. When the input unit 10 accepts a document, the connection expression term structure analysis apparatus 100 executes a connection expression term structure analysis processing routine shown in FIG.

まず、ステップＳ１００では、入力部１０において受け付けた文書を取得し、文書に対して文の区切りを与える。 First, in step S100, a document accepted by the input unit 10 is acquired, and sentence breaks are given to the document.

次に、ステップＳ１０２では、ステップＳ１００で文区切りが与えられた文書に基づいて、文書に含まれる文の各々の修辞構造に基づく、文の各々を各ノードで表わした談話構造木を生成する。 Next, in step S102, a discourse structure tree in which each sentence is represented by each node is generated based on the rhetorical structure of each sentence included in the document, based on the document given the sentence break in step S100.

ステップＳ１０４では、ステップＳ１００で文区切りが与えられた文書に含まれる文の各々について、構文解析を行って構文木を生成する。 In step S104, a syntax tree is generated by performing syntax analysis on each of the sentences included in the document given the sentence break in step S100.

ステップＳ１０６では、ステップＳ１０４で生成された文の各々についての構文木に基づいて、項を持つ接続表現を抽出する。 In step S106, a connection expression having a term is extracted based on the syntax tree for each of the sentences generated in step S104.

ステップＳ１０８では、ステップＳ１０６において、抽出された接続表現について、接続表現を含む文内に、接続表現によって結ばれた２つの項が出現するか否かを判定する。また、ステップＳ１０８では、接続表現を含む文内に接続表現によって結ばれた２つの項が出現しないと判定した場合、接続表現を含む文を、接続表現によって結ばれた２つの項のうちの項２を抽出するための文として決定し、ステップＳ１０２で生成された談話構造木において、接続表現を含む文の親ノード又は兄弟ノードに対応する文を、接続表現によって結ばれた２つの項のうちの項１を抽出するための文として決定する。 In step S108, it is determined whether or not two terms connected by the connection expression appear in the sentence including the connection expression for the connection expression extracted in step S106. In step S108, when it is determined that two terms connected by the connection expression do not appear in the sentence including the connection expression, the sentence including the connection expression is replaced with a term of the two terms connected by the connection expression. 2 is extracted as a sentence for extracting, and in the discourse structure tree generated in step S102, a sentence corresponding to the parent node or sibling node of the sentence including the connection expression is selected from the two terms connected by the connection expression. Is determined as a sentence for extracting the first term.

ステップＳ１１０では、ステップＳ１０８で接続表現を含む文内に、接続表現によって結ばれた２つの項が出現すると判定された場合、接続表現を含む文から、接続表現によって結ばれた項１、及び項２を抽出し、出力部５０に出力する。 In step S110, when it is determined in step S108 that two terms connected by the connection expression appear in the sentence including the connection expression, the term 1 and the term connected by the connection expression from the sentence including the connection expression are included. 2 is extracted and output to the output unit 50.

ステップＳ１１２では、ステップＳ１０８において、接続表現を含む文内に、接続表現によって結ばれた２つの項が出現しないと判定された場合、ステップＳ１０８で項２を抽出するための文として決定された接続表現を含む文から、項２を抽出し、ステップＳ１０８で項１を抽出するための文として決定された、接続表現を含む文の親ノード又は兄弟ノードに対応する文から、項１を抽出し、抽出した２つの項を出力部５０に出力する。 In step S112, if it is determined in step S108 that the two terms connected by the connection expression do not appear in the sentence including the connection expression, the connection determined as the sentence for extracting the term 2 in step S108. The term 2 is extracted from the sentence including the expression, and the term 1 is extracted from the sentence corresponding to the parent node or the sibling node of the sentence including the connection expression determined as the sentence for extracting the term 1 in step S108. The two extracted terms are output to the output unit 50.

ステップＳ１１４では、ステップＳ１０６で抽出された接続表現に基づいて、接続表現の意味クラスを分類し、接続表現及び当該接続表現の意味クラスを出力部５０に出力し、接続表現項構造解析処理ルーチンを終了する。 In step S114, the semantic class of the connection representation is classified based on the connection representation extracted in step S106, the connection representation and the semantic class of the connection representation are output to the output unit 50, and the connection representation term structure analysis processing routine is executed. finish.

以上説明したように、第１の実施の形態に係る接続表現項構造解析装置によれば、文書に基づいて、文書に含まれる文の各々の修辞構造に基づく、談話構造木を生成し、構文解析を行って構文木を生成し、項を持つ接続表現を抽出し、接続表現を含む文内に、接続表現によって結ばれた２つの項が出現するか否かを判定し、接続表現によって結ばれた２つの項が出現すると判定された場合、接続表現を含む文から、接続表現によって結ばれた２つの項を抽出し、接続表現によって結ばれた２つの項が出現しないと判定された場合、接続表現を含む文から、項２を抽出し、談話構造木において、接続表現を含む文の親ノード又は兄弟ノードに対応する文から、項１を抽出し、接続表現の意味クラスを分類することにより、隣接しない文間からも、接続表現によって結ばれた項を抽出することができる。 As described above, according to the connection expression term structure analysis apparatus according to the first embodiment, a discourse structure tree is generated based on the rhetorical structure of each sentence included in the document based on the document, and the syntax Parse to generate a syntax tree, extract a connection expression with terms, determine whether two terms connected by the connection expression appear in the sentence containing the connection expression, and connect by the connection expression When it is determined that the two terms appear, the two terms connected by the connection expression are extracted from the sentence including the connection expression, and it is determined that the two terms connected by the connection expression do not appear. The term 2 is extracted from the sentence including the connection expression, and the term 1 is extracted from the sentence corresponding to the parent node or the sibling node of the sentence including the connection expression in the discourse structure tree, and the semantic class of the connection expression is classified. , Even between non-adjacent sentences Can be extracted term tied by a connection represented.

＜本発明の第２の実施の形態に係る接続表現項構造解析装置の構成＞ <Configuration of connection expression term structure analysis device according to second embodiment of the present invention>

次に、本発明の第２の実施の形態に係る接続表現項構造解析装置の構成について説明する。なお、第１の実施の形態と同様の構成となる部分については、同一符号を付して説明を省略する。第２の実施の形態に係る接続表現項構造解析装置では、文書から暗示的接続表現に関する項、及び意味ラベルを抽出する。 Next, the configuration of the connection expression term structure analysis apparatus according to the second embodiment of the present invention will be described. In addition, about the part which becomes the structure similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted. In the connection expression term structure analysis apparatus according to the second embodiment, a term relating to an implicit connection expression and a semantic label are extracted from a document.

図６に示すように、本発明の第２の実施の形態に係る接続表現項構造解析装置２００は、ＣＰＵと、ＲＡＭと、後述する接続表現項構造解析処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この接続表現項構造解析装置２００は、機能的には図６に示すように入力部１０と、演算部２２０と、出力部５０とを備えている。 As shown in FIG. 6, the connection expression term structure analysis apparatus 200 according to the second embodiment of the present invention includes a CPU, a RAM, a program for executing a connection expression term structure analysis processing routine to be described later, and various programs. It can be constituted by a computer including a ROM storing data. Functionally, the connection expression term structure analysis apparatus 200 includes an input unit 10, a calculation unit 220, and an output unit 50 as shown in FIG.

演算部２２０は、文分割部３０と、談話構造解析部３２と、関連文ペア抽出部２３８と、文間項抽出部２４２と、意味クラス分類部２４４とを含んで構成されている。 The computing unit 220 includes a sentence dividing unit 30, a discourse structure analyzing unit 32, a related sentence pair extracting unit 238, an inter-sentence term extracting unit 242, and a semantic class classifying unit 244.

談話構造解析部３２は、第１の実施の形態と同様の処理により、文分割部３０により文区切りが与えられた文書に基づいて、文書に含まれる文の各々の修辞構造に基づく、文の各々を各ノードで表わした談話構造木を生成する。 The discourse structure analysis unit 32 performs the same process as in the first embodiment based on the rhetorical structure of each sentence included in the document based on the document given the sentence break by the sentence division unit 30. A discourse structure tree in which each node is represented is generated.

関連文ペア抽出部２３８は、談話構造解析部３２によって生成された談話構造木に基づいて、親子ノードに対応する文のペア、及び兄弟ノードに対応する文のペアを、接続関係を持つ文のペアの候補とし、接続関係を持つ文のペアの候補の各々について、接続関係があるか否かを判定する。 Based on the discourse structure tree generated by the discourse structure analysis unit 32, the related sentence pair extraction unit 238 converts a sentence pair corresponding to a parent-child node and a sentence pair corresponding to a sibling node to a sentence having a connection relationship. It is determined whether or not there is a connection relationship for each candidate pair of sentences having a connection relationship as a pair candidate.

関連文ペア抽出部２３８は、具体的には、談話構造木を入力として受け取り、木の親子ノード、及び兄弟ノードとなる文のペアを、接続関係を持つ文ペアの候補とし、これらの文ペアの候補の各々に対して、予め学習した２値分類器を利用することで文ペアに接続関係があるか否かを決定する。２値分類器は、学習データの文ペアとして文Ｓ_ｉ、文Ｓ_ｊを用意し、以下の（１）〜（５）の特徴を用いて、２値分類器を学習する。 Specifically, the related sentence pair extraction unit 238 receives a discourse structure tree as an input, sets a sentence pair that becomes a parent-child node and a sibling node of the tree as candidates for a sentence pair having a connection relationship, and sets these sentence pairs. For each of the candidates, it is determined whether or not the sentence pair has a connection relationship by using a binary classifier learned in advance. The binary classifier prepares sentences S _i and S _j as sentence pairs of learning data, and learns the binary classifier using the following features (1) to (5).

（１）文Ｓ_ｉ、及び文Ｓ_ｊの先頭の単語
（２）文Ｓ_ｉ、及び文Ｓ_ｊの最後の単語
（３）文Ｓ_ｉ、及び文Ｓ_ｊの先頭の３単語
（４）文Ｓ_ｉに含まれる単語と文Ｓ_ｊに含まれる単語とのペアすべて
（５）文Ｓ_ｉに含まれる単語の意味クラスと文Ｓ_ｊに含まれる単語の意味クラスのペアすべて (1) sentence _{S i,} and sentence _S beginning word of _j (2) statements _{S i,} and sentence _S last word (3) of the _j statement _{S i,} and sentence _S 3 words (4) of the head of the _j sentence All pairs of words included in S _i and words included in sentence S _j (5) All pairs of meaning classes of words included in sentence S _i and meaning classes of words included in sentence S _j

なお、上記（５）の特徴である単語の意味クラスは既存のシソーラスや単語クラスタリングの結果から得ることができる。さらに、関連文ペア抽出部２３８は、接続関係があると判定された文のペアの候補の各々について、談話構造木が表現する修飾、被修飾関係を利用して、項１を抽出するための文、及び項２を抽出するための文を決定する。例えば、文Ｓ_ｉが文Ｓ_ｊの子ノードであれば、文Ｓ_ｉを、項２を抽出するための文とし、文Ｓ_ｊを、項１を抽出するための文とする。文Ｓ_ｉ、及び文Ｓ_ｊが兄弟ノードであるなら、文番号の小さいものを、項１を抽出するための文とし、大きいものを、項２を抽出するための文とする。 Note that the word semantic class, which is the feature (5) above, can be obtained from the results of existing thesauruses and word clustering. Further, the related sentence pair extraction unit 238 extracts the term 1 for each of the sentence pair candidates determined to have the connection relation by using the modification and the modified relation represented by the discourse structure tree. The sentence and the sentence for extracting the term 2 are determined. For example, if the child node of the statement S _i the sentence S _j, the statement S _i, a statement for extracting the second aspect, the sentence S _j, a sentence for extracting claim 1. If the sentence S _i and the sentence S _j are sibling nodes, a sentence having a small sentence number is a sentence for extracting the term 1, and a sentence having a large sentence number is a sentence for extracting the term 2.

文間項抽出部２４２は、関連文ペア抽出部２３８によって接続関係があると判定された接続関係を持つ文のペアの候補の各々について、当該接続関係を持つ文のペアの候補から、暗示的な接続表現によって結ばれる２つの項を抽出する。なお、関連文ペア抽出部２３８において、項１、及び項２がどの文から抽出されるかの判定は終わっているため、ここでは以下の（１）及び（２）の操作で項のみを取り出す。 The inter-sentence term extraction unit 242 implicitly determines, from each of the sentence pair candidates having a connection relationship determined by the related sentence pair extraction unit 238, from the sentence pair candidates having the connection relationship. Two terms connected by a simple connection expression are extracted. In addition, since the related sentence pair extraction unit 238 determines which sentence the terms 1 and 2 are extracted from, only the terms are extracted by the following operations (1) and (2). .

（１）文中に含まれる記号のうち、「。」、「！」、「？」の文末表現を削除する。
（２）文頭、文末における「“”」等の括弧表現を削除する。 (1) Delete the sentence end expressions of “.”, “!”, “?” Among the symbols included in the sentence.
(2) Delete parentheses such as ““ ”at the beginning and end of the sentence.

文間項抽出部２４２では、上記の（１）及び（２）の操作を変化がなくなるまで繰り返し、暗示的な接続関係を有する２つの項を出力部５０に出力する。 The inter-sentence term extraction unit 242 repeats the operations (1) and (2) until there is no change, and outputs two terms having an implicit connection relationship to the output unit 50.

意味クラス分類部２４４は、関連文ペア抽出部２３８によって接続関係があると判定された接続関係を持つ文のペアの候補の各々について、接続関係を持つ文のペアの候補に基づいて、暗示的な接続表現の意味クラスを分類し、出力部５０に出力する。意味クラス分類部２４４は、文ペアの候補の各々を入力として、予め学習データにより学習した多クラスの分類問題を解くことにより、文ペアの候補の各々の文同士をつなぐ接続関係の意味クラスを決定する。学習及び分類に用いる特徴は、上記関連文ペア抽出部２３８で利用した（１）〜（５）の特徴を利用する。さらに、多クラス分類問題であるため、学習データ中のクラス分布がなるべく均一になるようにデータを学習データから再サンプリングする。 The semantic class classifying unit 244 implicitly determines, based on the sentence pair candidates having the connection relationship, each of the sentence pair candidates having the connection relationship determined by the related sentence pair extracting unit 238 to have the connection relationship. The semantic classes of connection expressions are classified and output to the output unit 50. The semantic class classification unit 244 receives each sentence pair candidate as an input and solves a multi-class classification problem learned in advance from learning data, thereby determining a semantic class of a connection relationship that connects the sentences of each sentence pair candidate. decide. As features used for learning and classification, the features (1) to (5) used in the related sentence pair extraction unit 238 are used. Furthermore, since it is a multi-class classification problem, the data is resampled from the learning data so that the class distribution in the learning data is as uniform as possible.

＜本発明の第２の実施の形態に係る接続表現項構造解析装置の作用＞ <Operation of the connection expression term structure analyzing apparatus according to the second embodiment of the present invention>

次に、本発明の第２の実施の形態に係る接続表現項構造解析装置２００の作用について説明する。入力部１０において文書を受け付けると、接続表現項構造解析装置２００は、図７に示す接続表現項構造解析処理ルーチンを実行する。なお、第１の実施の形態と同様の作用となる箇所については同一符号を付して説明を省略する。 Next, the operation of the connection expression term structure analysis apparatus 200 according to the second embodiment of the present invention will be described. When the input unit 10 accepts a document, the connection expression term structure analysis apparatus 200 executes a connection expression term structure analysis processing routine shown in FIG. In addition, about the location which becomes the effect | action similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

ステップＳ２００では、ステップＳ１０２で生成された談話構造木に基づいて、親子ノードに対応する文のペア、及び兄弟ノードに対応する文のペアを、接続関係を持つ文のペアの候補とし、接続関係を持つ文のペアの候補の各々について、接続関係があるか否かを判定する。また、ステップ２００では、接続関係があると判定された文のペアの候補の各々について、談話構造木が表現する修飾、被修飾関係を利用して、項２を抽出するための文、及び項１を抽出するための文を決定する。 In step S200, based on the discourse structure tree generated in step S102, the sentence pair corresponding to the parent-child node and the sentence pair corresponding to the sibling node are used as candidate sentence pairs having a connection relation, and the connection relation It is determined whether or not there is a connection relationship for each of the sentence pair candidates having. Further, in step 200, for each sentence pair candidate determined to have a connection relationship, a sentence for extracting term 2 and a term using the modification and modified relationships expressed by the discourse structure tree The sentence for extracting 1 is determined.

次に、ステップＳ２０２では、ステップＳ２００で接続関係があると判定された接続関係を持つ文のペアの候補の各々について、当該接続関係を持つ文のペアの候補から、暗示的な接続表現によって結ばれる２つの項を抽出し、出力部５０に出力する。 Next, in step S202, each of the sentence pair candidates having the connection relationship determined to have the connection relationship in step S200 is connected from the sentence pair candidates having the connection relationship by an implicit connection expression. Are extracted and output to the output unit 50.

そして、ステップＳ２０４では、ステップＳ２００で接続関係があると判定された接続関係を持つ文のペアの候補の各々について、接続関係を持つ文のペアの候補に基づいて、暗示的な接続表現の意味クラスを分類し、出力部５０に出力し、接続表現項構造解析処理ルーチンを終了する。 Then, in step S204, for each of the sentence pair candidates having the connection relation determined to have the connection relation in step S200, the meaning of the implicit connection expression based on the sentence pair candidates having the connection relation. The class is classified and output to the output unit 50, and the connection expression term structure analysis processing routine is terminated.

なお、第２の実施の形態に係る接続表現項構造解析装置２００の他の構成及び作用については、第１の実施の形態と同様であるため、説明を省略する。 In addition, about the other structure and effect | action of the connection expression term structure analysis apparatus 200 which concern on 2nd Embodiment, since it is the same as that of 1st Embodiment, description is abbreviate | omitted.

以上説明したように、第２の実施の形態に係る接続表現項構造解析装置によれば、文書に基づいて、修辞構造に基づく、文の各々を各ノードで表わした談話構造木を生成し、談話構造木に基づいて、親子ノードに対応する文のペア、及び兄弟ノードに対応する文のペアを、接続関係を持つ文のペアの候補とし、接続関係を持つ文のペアの候補の各々について、接続関係があるか否かを判定し、接続関係を持つ文のペアの候補から、暗示的な接続表現によって結ばれる２つの項を抽出し、暗示的な接続表現の意味クラスを分類することにより、隣接しない文間からも、接続関係を持つ意味的に結ばれた項を抽出することができる。 As described above, according to the connection expression term structure analysis device according to the second embodiment, a discourse structure tree in which each sentence is represented by each node based on a rhetorical structure is generated based on a document, Based on the discourse structure tree, a sentence pair corresponding to a parent-child node and a sentence pair corresponding to a sibling node are used as a sentence pair candidate having a connection relation, and each of a sentence pair candidate having a connection relation is selected. Determine whether there is a connection relationship, extract two terms connected by an implicit connection expression from candidate sentence pairs with a connection relationship, and classify the semantic class of the implicit connection expression Thus, it is possible to extract semantically connected terms having a connection relationship even between non-adjacent sentences.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、上述した実施の形態では、第１の実施の形態に係る接続表現項構造解析装置によって、文書から明示的接続表現に関する接続表現、項、及び意味ラベルを抽出し、第２の実施の形態に係る接続表現項構造解析装置によって、文書から暗示的接続表現に関する接続表現、項、及び意味ラベルを抽出する場合を例に説明したが、これに限定されるものではなく、一つの接続表現項構造解析装置によって、文書から明示的接続表現に関する接続表現、項、及び意味ラベル、並びに暗示的接続表現に関する項、及び意味ラベルを抽出するようにしてもよい。 For example, in the above-described embodiment, the connection expression, the term, and the semantic label related to the explicit connection expression are extracted from the document by the connection expression term structure analysis apparatus according to the first embodiment, and the second embodiment. In the above description, the connection expression, the term, and the semantic label related to the implicit connection expression are extracted from the document by the connection expression term structure analysis apparatus according to the present invention. However, the present invention is not limited to this. You may make it extract the connection expression regarding an explicit connection expression, a term, and a semantic label, and the term regarding an implicit connection expression, and a semantic label from a document by a structure analysis apparatus.

１０入力部
２０、２２０演算部
３０文分割部
３２談話構造解析部
３４構文解析部
３６接続表現抽出部
３８項位置関係決定部
４０文内項抽出部
４２、２４２文間項抽出部
４４、２４４意味クラス分類部
４６文間項抽出部
５０出力部
１００、２００接続表現項構造解析装置
２３８関連文ペア抽出部 DESCRIPTION OF SYMBOLS 10 Input part 20,220 Operation part 30 Sentence division part 32 Discourse structure analysis part 34 Syntax analysis part 36 Connection expression extraction part 38 Term positional relationship determination part 40 Sentence term extraction part 42,242 Inter sentence sentence extraction part 44,244 Meaning Class classification unit 46 Inter-sentence term extraction unit 50 Output unit 100, 200 Connection expression term structure analysis device 238 Related sentence pair extraction unit

Claims

Based on the input document, based on the rhetorical structure of each of the sentences included in the document, a discourse structure analyzing unit that generates a discourse structure tree representing each of the sentences by each node;
For each of the sentences included in the document, a syntax analysis unit that performs syntax analysis and generates a syntax tree;
Based on a predetermined connection expression candidate dictionary and the syntax tree, a connection expression candidate from each of the sentences included in the document, and a feature amount corresponding to the connection expression candidate, and an appearance position of the connection expression candidate A first classification that has been learned in advance so as to extract a feature quantity that includes the extracted feature quantity and output a first determination result indicating whether or not the connection expression corresponding to the feature quantity has a term. A connection expression extraction unit that extracts a connection expression having a term from the connection expression candidates using a container ;
Learning in advance to output the second determination result indicating whether or not two terms connected by a connection expression appear in a sentence including the connection expression corresponding to the feature quantity, using the extracted feature quantity as an input. A second determination result corresponding to the connection representation for each connection representation extracted by the connection representation extraction unit using the second classifier
For the connection expression for which the second determination result is obtained, when the second determination result indicates that two terms do not appear, a sentence including the connection expression is expressed as one of the two terms. And a sentence corresponding to the parent node or sibling node of the sentence including the connection expression in the discourse structure tree, and a sentence for extracting the other of the two terms A term position relationship determining unit to determine;
An inter-sentence term extraction unit that extracts two terms connected by a connection expression based on the determination by the term positional relationship determination unit ;
Connecting representation obtained the second determination result, which indicates that the second determination result is two terms emerge from sentences including the connection representation, the two terms that are connected by the connection represented An internal sentence extraction unit to extract;
A semantic class classification unit that classifies semantic classes of the connection representation based on the connection representation extracted by the connection representation extraction unit;
Connection expression term structure analysis device including

Based on the input document, based on the rhetorical structure of each of the sentences included in the document, a discourse structure analyzing unit that generates a discourse structure tree representing each of the sentences by each node;
Based on the discourse structure tree generated by the discourse structure analysis unit, a sentence pair corresponding to a parent-child node and a sentence pair corresponding to a sibling node are used as candidate sentence pairs having a connection relationship, and the sentence The feature amount corresponding to the pair candidate is extracted, and the extracted feature amount is input, and learning is performed in advance to output a determination result indicating whether or not there is a connection relationship for the sentence pair corresponding to the feature amount. A related sentence pair extraction unit for determining whether or not there is a connection relation for each of the sentence pair candidates having the connection relation,
Each of the sentence pair candidates having the connection relation determined to have a connection relation by the related sentence pair extraction unit is connected from the sentence pair candidates having the connection relation by an implicit connection expression 2. An inter-sentence term extraction unit for extracting two terms;
For each of the sentence pair candidates having the connection relationship determined to have a connection relationship by the related sentence pair extraction unit, based on the sentence pair candidates having the connection relationship, the implicit connection expression A semantic class classification unit for classifying semantic classes;
Connection expression term structure analysis device including

A discourse structure analysis unit, based on the input document, based on the rhetorical structure of each sentence included in the document, generating a discourse structure tree in which each of the sentences is represented by each node;
A syntax analysis unit that parses each sentence included in the document to generate a syntax tree;
The connection expression extraction unit is a connection expression candidate from each of the sentences included in the document based on a predetermined connection expression candidate dictionary and the syntax tree, and a feature amount corresponding to the connection expression candidate, the connection expression Learning in advance to extract a feature quantity including the appearance position of an expression candidate, and using the extracted feature quantity as an input, output a first determination result indicating whether or not a connected expression corresponding to the feature quantity has a term Extracting a connection expression having a term from the connection expression candidates using the first classifier ,
A second determination indicating whether or not two terms connected by a connection expression appear in a sentence including the connection expression corresponding to the feature quantity, the term positional relationship determination unit receiving the extracted feature quantity as an input; Using a second classifier previously learned to output a result, a second determination result corresponding to the connection expression is obtained for each connection expression extracted by the connection expression extraction unit,
For the connection expression for which the second determination result is obtained, when the second determination result indicates that two terms do not appear, a sentence including the connection expression is expressed as one of the two terms. And a sentence corresponding to the parent node or sibling node of the sentence including the connection expression in the discourse structure tree, and a sentence for extracting the other of the two terms A step to determine ;
Bunkanko extraction unit, the connection representation obtained the second determination result, which indicates that the second determination result is two terms emerge from sentences including the connection represented by the connection expressed Extracting the connected two terms;
A semantic class classification unit classifying a semantic class of the connection representation based on the connection representation extracted by the connection representation extraction unit;
Connection expression term structure analysis method including

A discourse structure analysis unit, based on the input document, based on the rhetorical structure of each sentence included in the document, generating a discourse structure tree in which each of the sentences is represented by each node;
Based on the discourse structure tree generated by the discourse structure analysis unit, a related sentence pair extraction unit converts a sentence pair corresponding to a parent-child node and a sentence pair corresponding to a sibling node to a sentence having a connection relationship. A determination result indicating whether or not there is a connection relationship with respect to a sentence pair corresponding to the feature amount, by extracting the feature amount corresponding to the sentence pair candidate as a pair candidate, and using the extracted feature amount as an input Determining whether or not there is a connection relationship for each of the sentence pair candidates having the connection relationship using a classifier previously learned to output
An inter-sentence term extraction unit implicitly determines, from each of the sentence pair candidates having the connection relationship, for each of the sentence pair candidates having the connection relationship determined by the related sentence pair extraction unit. Extracting two terms connected by a simple connection expression;
For each of the sentence pair candidates having the connection relationship determined by the related sentence pair extraction unit, the semantic class classification unit, based on the sentence pair candidates having the connection relationship, Categorizing semantic classes of implicit connection expressions;
Connection expression term structure analysis method including

The program for functioning a computer as each part which comprises the connection expression term structure analysis apparatus of Claim 1 or 2.