JP5743952B2

JP5743952B2 - Quotation part extraction apparatus, method, and program

Info

Publication number: JP5743952B2
Application number: JP2012108428A
Authority: JP
Inventors: 九月貞光; 東中　竜一郎; 竜一郎東中; 齋藤　邦子; 邦子齋藤
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2012-05-10
Filing date: 2012-05-10
Publication date: 2015-07-01
Anticipated expiration: 2032-05-10
Also published as: JP2013235487A

Description

本発明は、引用部抽出装置、方法、及びプログラムに係り、特に、文書中から引用する部分を抽出する引用部抽出装置、方法、及びプログラムに関する。 The present invention relates to a citation extraction device, method, and program, and more particularly, to a citation extraction device, method, and program for extracting a portion to be cited from a document.

従来、固有表現抽出などのように、文書中から複数の単語の並びを１つのまとまりとして抽出することが行われている（例えば、非特許文献１参照）。固有表現抽出等では、文書から目的の部分を抽出するための識別器を用いて、抽出範囲の始まりの単語にＢタグ、抽出範囲のうち始まり以外の単語にＩタグ、抽出範囲以外の単語にＯタグを付与して、Ｂタグ及びＩタグが付与された単語のまとまりを抽出している。このような固有表現抽出や形態素解析等の系列ラベリング問題を解くために用いる識別器として、ＣＲＦ（Conditional Random Fields）が知られている（例えば、非特許文献２参照）。 Conventionally, a sequence of a plurality of words is extracted from a document as a single unit, such as extraction of a specific expression (see Non-Patent Document 1, for example). In specific expression extraction, etc., using a discriminator for extracting a target part from a document, a B tag is used for the word at the beginning of the extraction range, an I tag is used for a word other than the start of the extraction range, and An O tag is assigned to extract a group of words to which a B tag and an I tag are assigned. CRF (Conditional Random Fields) is known as a discriminator used for solving such sequence labeling problems such as named entity extraction and morphological analysis (for example, see Non-Patent Document 2).

坪井祐太、鹿島久嗣、工藤拓、「言語処理における識別モデルの発展−ＨＭＭからＣＲＦまで」、２００６年言語処理学会年次大会チュートリアル資料Yuta Tsuboi, Hisayoshi Kashima, Taku Kudo, "Development of Discrimination Model in Language Processing-From HMM to CRF", 2006 Annual Conference of the Language Processing Society of Japan John Lafferty, Andrew McCallum, Fernando Pereira, "Conditional random fields: Probabilistic models for segmenting and labeling sequence data", 2001.John Lafferty, Andrew McCallum, Fernando Pereira, "Conditional random fields: Probabilistic models for segmenting and labeling sequence data", 2001.

しかしながら、上記非特許文献１及び２には、従来の固有表現抽出の手法を引用部抽出に適用することは開示されていない。例えば、「Ａさんに『少し遅刻しそう』ってメールして」という文書が、音声入力等を通じて入力された場合、引用部（『』の範囲）がどこであるかは明示的ではない。このような文書に対して、上記のような従来の固有表現抽出の手法を適用したとしても、適切に引用部を抽出することができない。 However, Non-Patent Documents 1 and 2 do not disclose the application of a conventional method for extracting a specific expression to citation extraction. For example, when a document “e-mail to Mr. A that says“ I am a little late ”” is input through voice input or the like, it is not explicit where the quotation part (the range of “”) is. Even if the above-described conventional method for extracting a specific expression is applied to such a document, it is not possible to appropriately extract the citation part.

本発明は、上記の事情を鑑みてなされたもので、引用部が明示的に付与されていない文書から引用部を抽出することができる引用部抽出装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a citation extraction apparatus, method, and program capable of extracting a citation from a document to which a citation is not explicitly given. And

第１の発明に係る引用部抽出装置は、文書データを複数の部分に分割する分割手段と、前記分割手段により分割された各部分に、文書データから抽出する引用部の特性に応じた素性を付与すると共に、前記各部分について、引用部か否かを示す複数種類のタグの各々を候補とし、抽出する引用部に基づく制約を満たすか否かを示す情報を前記各部分に付与した制約付きラティスを生成する制約付きラティス生成手段と、ラティスから引用部を抽出するための抽出モデルに基づいて、前記制約付きラティス生成手段により生成された制約付きラティスから前記各部分の最適候補を選択し、引用部を示すタグが最適候補として選択された部分を引用部として抽出する抽出手段と、を含んで構成されている。 According to a first aspect of the present invention, there is provided a quoting part extracting device that divides document data into a plurality of parts, and that each part divided by the dividing means has a feature according to characteristics of a quoting part extracted from the document data. With each of the parts, with each of the plurality of types of tags indicating whether or not it is a citation part as candidates, information indicating whether or not the restriction based on the citation part to be extracted is satisfied is attached to each part. Based on a constrained lattice generating means for generating a lattice and an extraction model for extracting a citation part from the lattice, an optimal candidate for each part is selected from the constrained lattice generated by the constrained lattice generating means, Extraction means for extracting, as a citation part, a portion in which a tag indicating the citation part is selected as an optimal candidate.

第１の発明に係る引用部抽出装置によれば、分割手段が、文書データを複数の部分に分割し、制約付きラティス生成手段が、分割手段により分割された各部分に、文書データから抽出する引用部の特性に応じた素性を付与すると共に、各部分について、引用部か否かを示す複数種類のタグの各々を候補とし、抽出する引用部に基づく制約を満たすか否かを示す情報を各部分に付与した制約付きラティスを生成する。そして、抽出手段が、ラティスから引用部を抽出するための抽出モデルに基づいて、制約付きラティス生成手段により生成された制約付きラティスから各部分の最適候補を選択し、引用部を示すタグが最適候補として選択された部分を引用部として抽出する。 According to the citation extraction apparatus according to the first invention, the dividing means divides the document data into a plurality of parts, and the constrained lattice generating means extracts from the document data into each part divided by the dividing means. A feature according to the characteristics of the quotation part is given, and for each part, information indicating whether or not the restriction based on the quotation part to be extracted is satisfied with each of a plurality of types of tags indicating whether or not the quotation part is satisfied as candidates. Generate a constrained lattice attached to each part. Then, the extraction means selects the optimal candidate for each part from the constrained lattice generated by the constrained lattice generation means based on the extraction model for extracting the citation part from the lattice, and the tag indicating the citation part is optimal. The part selected as a candidate is extracted as a citation part.

このように、抽出する引用部の特性に応じた制約を付与した制約付きラティスを生成して、引用部抽出モデルを用いて文書から引用部を抽出するため、引用部が明示的に付与されていない文書から引用部を抽出することができる。 In this way, a constrained lattice is generated with constraints according to the characteristics of the citation part to be extracted, and the citation part is extracted from the document using the citation part extraction model. Extract citations from missing documents.

また、第１の発明に係る引用部抽出装置は、抽出する引用部の特性に応じて予め定められたキーワードと一致する前記文書データ内の単語のうち、該文書データの最後に出現する単語以外の単語を隠蔽する隠蔽手段を含んで構成することができ、前記制約付きラティス生成手段は、前記隠蔽手段による隠蔽後の前記文書データに対して前記制約付きラティスを生成し、前記抽出手段は、前記選択された部分に隠蔽された単語が含まれる場合には、隠蔽された単語を元に戻して前記引用部として抽出することができる。これにより、広い文脈を見なくても引用部の誤抽出を防止することができる。 Further, the citation extraction device according to the first invention is a word other than the word that appears at the end of the document data among the words in the document data that match a keyword predetermined according to the characteristics of the citation to be extracted. The constrained lattice generating means generates the constrained lattice for the document data concealed by the concealing means, and the extracting means comprises: When the selected part includes a concealed word, the concealed word can be restored and extracted as the citation part. Thereby, it is possible to prevent erroneous extraction of the citation part without looking at a wide context.

また、第１の発明において、前記分割手段は、前記文書データを文節毎に分割することができる。これにより、各文節に付与される素性のスパース性を抑えることができると共に、離れた位置にある部分の情報も考慮した素性を付与することができる。 In the first invention, the dividing unit can divide the document data into phrases. As a result, it is possible to suppress the sparsity of the features assigned to each phrase, and it is possible to provide features that also take into account the information of parts located at distant positions.

また、第１の発明に係る引用部抽出装置は、前記分割手段により分割された各部分に、引用部の特性に応じた素性を付与すると共に、前記各部分について、前記引用部か否かを示す複数種類のタグの各々を候補としたラティスを生成するラティス生成手段と、前記ラティス生成手段により生成されたラティスと、前記文書データ内における引用部の正解位置とに基づいて、前記抽出モデルを学習する学習手段と、を含んで構成することができる。これにより、引用部を抽出する際に用いる抽出モデルを生成するための構成を併せ持つことができる。 In addition, the citation extraction device according to the first aspect of the invention provides a feature according to the characteristics of the citation to each part divided by the dividing means, and determines whether each part is the citation. The extraction model is generated based on a lattice generation means for generating a lattice with each of a plurality of types of tags shown as candidates, a lattice generated by the lattice generation means, and a correct position of a citation in the document data. Learning means for learning. Thereby, it can have the structure for producing | generating the extraction model used when extracting a quotation part.

また、第２の発明に係る引用部抽出装置は、文書データを複数の部分に分割する分割手段と、前記分割手段により分割された各部分に、文書データから抽出する引用部の特性に応じた素性を付与すると共に、前記各部分について、引用部か否かを示す複数種類のタグの各々を候補としたラティスを生成するラティス生成手段と、前記ラティス生成手段により生成されたラティスと、前記文書データ内における引用部の正解位置とに基づいて、ラティスから引用部を抽出するための抽出モデルを学習する学習手段と、を含んで構成することができる。このように生成された抽出モデルを用いて、引用部が明示的に付与されていない文書から引用部を抽出することができる。 According to a second aspect of the present invention, there is provided a quoting part extraction device that divides document data into a plurality of parts, and that divides the document data into parts divided by the dividing means according to the characteristics of the citation part extracted from the document data. A lattice generating means for generating a lattice with candidates for each of a plurality of types of tags indicating whether or not each part is a citation part, a lattice generated by the lattice generating means, and the document Learning means for learning an extraction model for extracting a citation part from a lattice based on the correct position of the citation part in the data can be configured. Using the extraction model generated in this way, it is possible to extract a citation from a document to which a citation is not explicitly given.

また、第３の発明に係る引用部抽出方法は、分割手段が、文書データを複数の部分に分割し、制約付きラティス生成手段が、前記分割手段により分割された各部分に、文書データから抽出する引用部の特性に応じた素性を付与すると共に、前記各部分について、引用部か否かを示す複数種類のタグの各々を候補とし、抽出する引用部に基づく制約を満たすか否かを示す情報を前記各部分に付与した制約付きラティスを生成し、抽出手段が、ラティスから引用部を抽出するための抽出モデルに基づいて、前記制約付きラティス生成手段により生成された制約付きラティスから前記各部分の最適候補を選択し、引用部を示すタグが最適候補として選択された部分を引用部として抽出する方法である。 In the citation extraction method according to the third invention, the dividing unit divides the document data into a plurality of parts, and the constrained lattice generation unit extracts each part divided by the dividing unit from the document data. A feature corresponding to the characteristics of the quoted part to be given is given, and each of the plurality of types of tags indicating whether or not the quoted part is selected as a candidate for each part, and indicates whether or not the constraint based on the extracted quoted part is satisfied A constrained lattice in which information is given to each part is generated, and an extracting unit extracts each of the constrained lattices generated by the constrained lattice generating unit based on an extraction model for extracting a citation from the lattice. In this method, an optimum candidate for a part is selected, and a part in which a tag indicating a citation part is selected as an optimum candidate is extracted as a citation part.

また、第４の発明に係る引用部抽出方法は、分割手段が、文書データを複数の部分に分割し、ラティス生成手段が、前記分割手段により分割された各部分に、文書データから抽出する引用部の特性に応じた素性を付与すると共に、前記各部分について、引用部か否かを示す複数種類のタグの各々を候補としたラティスを生成し、学習手段が、前記ラティス生成手段により生成されたラティスと、前記文書データ内における引用部の正解位置とに基づいて、ラティスから引用部を抽出するための抽出モデルを学習する方法である。 According to a fourth aspect of the present invention, the quoting part extracting method divides the document data into a plurality of parts by the dividing means, and the lattice generating means extracts from the document data into the parts divided by the dividing means. A feature corresponding to the characteristics of the part is given, and for each of the parts, a lattice is generated with each of a plurality of types of tags indicating whether the part is a quoted part, and learning means is generated by the lattice generating means. This is a method of learning an extraction model for extracting a citation from a lattice based on the lattice and the correct position of the citation in the document data.

また、第５の発明に係る引用部抽出プログラムは、コンピュータを、上記の引用部抽出装置を構成する各手段として機能させるためのプログラムである。 Moreover, the quotation part extraction program which concerns on 5th invention is a program for functioning a computer as each means which comprises said quotation part extraction apparatus.

以上説明したように、本発明の引用部抽出装置、方法、及びプログラムによれば、抽出する引用部の特性に応じた制約を付与した制約付きラティスを生成して、引用部抽出モデルを用いて文書から引用部を抽出するため、引用部が明示的に付与されていない文書から引用部を抽出することができる、という効果が得られる。 As described above, according to the citation extraction apparatus, method, and program of the present invention, a constraint lattice to which a restriction according to the characteristics of the citation to be extracted is generated is generated, and the citation extraction model is used. Since the citation part is extracted from the document, the citation part can be extracted from the document to which the citation part is not explicitly given.

本実施の形態に係る引用部抽出装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the quotation part extraction apparatus which concerns on this Embodiment. ラティス生成部で生成されたラティスの一例を示す概略図である。It is the schematic which shows an example of the lattice produced | generated by the lattice production | generation part. 制約付きラティス生成部で生成された制約付きラティスの一例を示す概略図である。It is the schematic which shows an example of the restricted lattice produced | generated in the restricted lattice production | generation part. 制約付きラティスから選択された最適候補の一例を示す概略図である。It is the schematic which shows an example of the optimal candidate selected from the lattice with restrictions. 本実施の形態における学習処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the learning process routine in this Embodiment. 本実施の形態における適用処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the application process routine in this Embodiment.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

本実施の形態に係る引用部抽出装置１０は、ＣＰＵと、ＲＡＭと、後述する学習処理及び適用処理を含む引用部抽出処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成することができる。 The quotation part extraction apparatus 10 according to the present embodiment is configured by a computer including a CPU, a RAM, and a ROM that stores a program for executing a quotation part extraction processing routine including a learning process and an application process described later. can do.

このコンピュータは、機能的には、図１に示すように、学習部１２と適用部１４とを含んだ構成で表すことができ、さらに学習部１２は、文節分割部２０と、キーワード隠蔽部２２と、ラティス生成部２４と、引用部抽出学習部２６とを含んだ構成で表すことができ、適用部１４は、文節分割部２０と、キーワード隠蔽部２２と、制約付きラティス生成部２８と、引用部抽出部３０とを含んだ構成で表すことができる。文節分割部２０及びキーワード隠蔽部２２は、学習部１２及び適用部１４に共通の機能である。また、文節分割部２０は本発明の分割手段の一例であり、キーワード隠蔽部２２は本発明の隠蔽手段の一例であり、引用部抽出学習部２６は本発明の学習手段の一例であり、引用部抽出部３０は本発明の抽出手段の一例である。 As shown in FIG. 1, this computer can be functionally represented by a configuration including a learning unit 12 and an application unit 14, and the learning unit 12 further includes a phrase division unit 20 and a keyword concealment unit 22. And a lattice generation unit 24 and a citation extraction learning unit 26. The application unit 14 includes a phrase division unit 20, a keyword concealment unit 22, a constrained lattice generation unit 28, It can be expressed by a configuration including the citation extraction unit 30. The phrase division unit 20 and the keyword concealment unit 22 are functions common to the learning unit 12 and the application unit 14. The phrase dividing unit 20 is an example of the dividing unit of the present invention, the keyword concealing unit 22 is an example of the concealing unit of the present invention, and the citation extraction learning unit 26 is an example of the learning unit of the present invention. The part extraction unit 30 is an example of the extraction means of the present invention.

ここで、本実施の形態で抽出の対象とする引用部を、以下のように定義する。 Here, the citation part to be extracted in this embodiment is defined as follows.

「誰か、または何か」（ａ)に対して、「何かしらの内容」（ｂ）を、「ある手段」（ｃ）を通じて、「何かする」（ｄ）、という文書内のｂを引用部であるとする。 For “someone or something” (a), quote “b” in the document “something” (b), “something” (c), “something” (d). Suppose that

例えば、「Ａさんに少し遅刻しそうってメールして」という文書の場合、ａ＝Ａさん、ｂ＝少し遅刻しそう、ｃ＝メール、ｄ＝送信、となる。 For example, in the case of a document “send me a little late for email”, a = A, b = slightly late, c = email, d = send.

また、「明日３時に待ち合わせとメモする」の場合、ａ＝自分、または自動でメモが可能なシステム、ｂ＝明日３時に待ち合わせ、ｃ＝メモ、ｄ＝記録、となる。 Also, in the case of “note queuing at 3 o'clock tomorrow”, a = self or a system capable of automatically taking notes, b = waiting at 3 o'clock tomorrow, c = memo, d = recording.

文書内にａ〜ｄの全てが揃っていることは必ずしも条件ではなく、また、引用部の定義として、ｂだけでなく、ｂ＋ｃや他の組み合わせについて定義してもよい。 It is not necessarily a condition that all of “a” to “d” are included in the document. Further, as a definition of the citation part, not only b but also b + c and other combinations may be defined.

文節分割部２０は、例えばＣＲＦ等の従来既知の識別器を用いて、入力された文書（テキストデータ）を文節毎に区切って複数の部分に分割した文節区切り文を出力する。文節単位で文書を分割することにより、後述するラティス生成部２４及び制約付きラティス生成部２８において、各部分（各文節）に素性を付与した際に、素性のスパース性を抑えることができ、また、単語毎に分割した場合と比べて、離れた位置にある部分の情報も考慮した素性を付与することができる。 The phrase dividing unit 20 outputs a phrase-delimited sentence obtained by dividing an input document (text data) into a plurality of parts by dividing the input document (text data) by using a conventionally known classifier such as CRF. By dividing the document into phrase units, the sparseness of the features can be suppressed when features are added to each part (each clause) in the lattice generation unit 24 and the constrained lattice generation unit 28 described later. Compared with the case where each word is divided, it is possible to give a feature that also considers information of a portion at a distant position.

キーワード隠蔽部２２は、抽出する引用部の特性に応じたキーワードが予め登録されたキーワード辞書４０を参照して、文節区切り文内のキーワードと一致する単語のうち、文書の最後に出現する単語以外の単語を隠蔽した隠蔽済み文節区切り文を出力する。例えば、「Ａさんに『Ｂさんにメールして』ってメールして」という文書の場合、１回目に出現した「メール」は引用部内の単語であるが、「メール」という単語には、上記のように定義された引用部を抽出する際に付与される素性として非常に強い力を持つ素性が付与されるため、「Ａさんに『Ｂさんに』メールしてってメールして」のように、１回目に出現した「メール」を起点に誤った部分が引用部として抽出される可能性がある。このような誤った抽出を防止するためには、比較的広い文脈を見る必要があるが、キーワードと一致する単語を隠蔽することにより、各文節に素性を付与する際のウインドウサイズ（参照する範囲）を広げて広い文脈を見ることなく、上記のような誤った抽出を防止することができる。 The keyword concealing unit 22 refers to the keyword dictionary 40 in which keywords according to the characteristics of the quoting part to be extracted are registered in advance, and a word other than the word that appears at the end of the document among the words that match the keyword in the phrase delimiter Output the concealed phrase delimiter sentence that conceals the word. For example, in the case of a document “Email Mr. A to“ Email Mr. B ””, “Mail” that appears for the first time is a word in the quote part, but the word “Mail” Since a feature with a very strong power is given as the feature given when extracting the citation part defined as above, "Email Mr. A to Mr. B" As described above, there is a possibility that an erroneous part is extracted as a citation part starting from “mail” that appears first. In order to prevent such erroneous extraction, it is necessary to look at a relatively wide context. However, by concealing the word that matches the keyword, the window size when assigning features to each phrase (reference range) ) Can be prevented and the above-mentioned erroneous extraction can be prevented without looking at a wide context.

また、キーワード隠蔽部２２は、文書から引用部を抽出する適用時においては、隠蔽した単語及びその単語の位置等の隠蔽情報を引用部抽出部３０へ出力する。 Further, the keyword concealment unit 22 outputs concealment information such as the concealed word and the position of the word to the citation unit extraction unit 30 at the time of application for extracting the citation unit from the document.

ラティス生成部２４は、キーワード隠蔽部２２から出力された隠蔽済み文節区切り文の各文節に対して、引用部の特性に応じた素性を付与する。例えば、「１つ後の文節に『メール』という単語が含まれている」といったものや、「１つ後の文節に品詞が『名詞』の単語が含まれている」といった内容を示す素性を付与することができる。また、音声で入力された文書の場合、素性としてポーズ情報を利用してもよい。ポーズ情報は文節を区切る上で重要な手掛かりとなると共に、引用部の開始位置または終了位置を示す情報としても重要な情報となる。そこで、「文節の前にポーズ情報がある」等の素性を付与することができる。また、学習器を使う場合は、これらの素性の依存性などを考慮する必要性がないのが利点である。 The lattice generation unit 24 assigns features corresponding to the characteristics of the citation to each clause of the concealed clause delimiter output from the keyword concealment unit 22. For example, a feature indicating content such as “the next phrase contains the word“ mail ”” or “the next phrase contains the word“ noun ””. Can be granted. In the case of a document input by voice, pause information may be used as a feature. The pause information is an important clue for dividing a sentence, and is also important information as information indicating the start position or end position of the citation section. Therefore, a feature such as “there is pause information before the phrase” can be given. Moreover, when using a learning device, it is an advantage that there is no need to consider the dependence of these features.

また、ラティス生成部２４は、各文節について、引用部の先頭部分を示すＢタグ、引用部のうち先頭以外の部分を示すＩタグ、及び引用部以外の部分を示すＯタグを候補とするラティスを生成し、上記の文節毎の素性を含むラティスを出力する。 In addition, the lattice generation unit 24 uses, as candidates for each phrase, a B tag indicating the head portion of the citation portion, an I tag indicating a portion other than the head portion of the citation portion, and an O tag indicating a portion other than the citation portion. And outputs a lattice containing the above features for each clause.

引用部抽出学習部２６は、ラティス生成部２４から出力されたラティスと、入力文書内における引用部の正解位置を示す正解データとから、引用部を抽出するための引用部抽出モデル４２を学習する。図２に、ラティス生成部２４から出力されたラティスに対して、各文節の正解タグを色付けして表した例を示す。なお、ラティスは文節毎の素性も含んだものであるが、図２では素性の図示を省略している。後述の図３及び４も同様である。学習の手法は、一般の系列ラベリング問題と同様の識別学習の手法を用いることができる。識別学習を用いることで、様々な素性を利用可能である。 The citation extraction learning unit 26 learns a citation extraction model 42 for extracting a citation from the lattice output from the lattice generation unit 24 and correct data indicating the correct position of the citation in the input document. . FIG. 2 shows an example in which the correct tag of each phrase is colored with respect to the lattice output from the lattice generation unit 24. Note that the lattice includes features for each phrase, but the features are not shown in FIG. The same applies to FIGS. 3 and 4 described later. As a learning method, a discriminative learning method similar to a general sequence labeling problem can be used. Various features can be used by using discriminative learning.

制約付きラティス生成部２８は、ラティス生成部２８と同様に、キーワード隠蔽部２２から出力された隠蔽済み文節区切り文の各文節に素性を付与すると共に、各文節について、Ｂタグ、Ｉタグ、及びＯタグを候補とするラティスを生成する。制約付きラティス生成部２８では、さらに、各候補について、抽出する引用部に基づく制約を満たすか否かを示す制約情報を付与する。例えば、Ｉタグの文節はＢタグの文節の後にしか現れない（Ｏタグの文節の後や文頭にＩタグの文節は現れない）、という制約を設けることができる。また、文書中に引用部は一度しか現れない、という条件を設定し、この条件に基づいて、Ｂタグの文節は一度しか現れない、という制約を設けてもよい。 Similar to the lattice generation unit 28, the constrained lattice generation unit 28 adds features to each clause of the concealed clause delimiter output from the keyword concealment unit 22, and for each clause, a B tag, an I tag, and A lattice with an O tag as a candidate is generated. In addition, the constrained lattice generation unit 28 assigns, for each candidate, constraint information indicating whether or not the constraint based on the extracted citation unit is satisfied. For example, a restriction can be provided that an I tag clause appears only after a B tag clause (an I tag clause does not appear after an O tag clause or at the beginning of a sentence). In addition, a condition that the citation part appears only once in the document may be set, and based on this condition, a restriction may be provided that the B tag clause appears only once.

図３に、各候補に制約情報が付与された制約付きラティスの一例を示す。制約を満たす候補には、タグを示す記号の後のかっこ内に「valid」を、制約を満たさない候補には、「invalid」を付与している。また、図３では、複数の制約を適宜組み合わせた場合の複数（ここでは２つ）の制約付きラティスが生成された例を示している。 FIG. 3 shows an example of a constrained lattice in which constraint information is given to each candidate. A candidate that satisfies the constraint is given “valid” in parentheses after the symbol indicating the tag, and “invalid” is given to a candidate that does not satisfy the constraint. FIG. 3 shows an example in which a plurality (two in this case) of restricted lattices are generated when a plurality of constraints are appropriately combined.

なお、図２に示した引用部抽出学習部２６で用いるラティスにも「valid」及び「invalid」の情報を付与しているが、図２では、引用部の正解位置を用いて正解タグが選択されており、これは、正解タグが上記のような制約を満たしていることも表しているため、正解タグの候補と「valid」が付与された候補とが一致している。 In addition, although information of “valid” and “invalid” is also given to the lattice used in the citation extraction learning unit 26 shown in FIG. 2, the correct tag is selected using the correct position of the citation in FIG. This also indicates that the correct tag satisfies the constraints as described above, and therefore the correct tag candidate matches the candidate given “valid”.

引用部抽出部３０は、制約付きラティス生成部２８で生成された制約付きラティスから、引用部抽出学習部２６で生成された引用部抽出モデル４２を用いて、文節毎に最も尤もらしい候補（最適候補）を選択する。ここでは、まず、複数の制約付きラティスから、最も尤もらしい制約付きラティスが選択され、選択された制約付きラティスから最適経路を探索することにより最適候補が選択される。図４に選択された候補を色付けしたラティスを示す。図４の例では、図３に示した上段の制約付きラティスが選択され、選択された制約付きラティスの「valid」が付与された候補から最適候補が選択されている。 The citation extraction unit 30 uses the citation extraction model 42 generated by the citation extraction learning unit 26 from the constrained lattice generated by the constrained lattice generation unit 28 and uses the most likely candidate (optimum) for each phrase. Select (Candidate). Here, first, the most likely constrained lattice is selected from a plurality of constrained lattices, and the optimal candidate is selected by searching the optimal route from the selected constrained lattice. FIG. 4 shows a lattice in which the selected candidate is colored. In the example of FIG. 4, the upper restricted lattice shown in FIG. 3 is selected, and the optimum candidate is selected from the candidates given “valid” of the selected restricted lattice.

また、引用部抽出部３０は、最適候補としてＢタグ及びＩタグが選択された文節を抽出し、キーワード隠蔽部２２により隠蔽された単語を、キーワード隠蔽部２２から出力された隠蔽情報に基づいて元に戻して、引用部を抽出する。さらに、文節の最後が「〜って」等の引用助詞の場合は、その引用助詞の前までを引用部として抽出し、抽出した引用部を示す情報を付与した引用部情報付き出力文を出力する。 In addition, the citation extraction unit 30 extracts a phrase in which the B tag and the I tag are selected as optimal candidates, and the words concealed by the keyword concealment unit 22 are based on the concealment information output from the keyword concealment unit 22. Return to the original and extract the quoted part. In addition, if the phrase ends with a quote particle such as "~ tte", the part before the quote particle is extracted as a quote part, and an output sentence with quote part information attached with information indicating the extracted quote part is output. To do.

次に、本実施の形態に係る引用部抽出装置１０の作用について説明する。引用部抽出装置１０に学習用の入力文（引用部の正解位置の情報を含む入力文）が入力された場合には、学習部１２において、図５に示す学習処理ルーチンを実行することにより引用部抽出モデル４２を生成する。引用部抽出モデル４２が生成された状態で、引用部抽出装置１０に引用部抽出を適用する入力文（引用部の正解位置が明示的ではない入力文）が入力された場合には、適用部１４において、図６に示す適用処理ルーチンを実行することにより、入力文から引用部を抽出する。以下、各処理について詳述する。 Next, the operation of the citation extraction device 10 according to the present embodiment will be described. When an input sentence for learning (an input sentence including information on the correct position of the quotation part) is input to the quotation part extraction apparatus 10, the learning part 12 executes the learning processing routine shown in FIG. A part extraction model 42 is generated. When an input sentence (an input sentence in which the correct position of the citation part is not explicit) is input to the citation part extraction apparatus 10 in a state where the citation part extraction model 42 is generated, the application part At 14, the citation is extracted from the input sentence by executing the application process routine shown in FIG. Hereinafter, each process is explained in full detail.

まず、学習処理ルーチンでは、ステップ１００で、文節分割部２０が、学習用の入力文（テキストデータ）を取得する。ここでは、下記に示すような学習用の入力文を取得するものとする。なお、この学習用の入力文には、下記に示すような引用部の正解位置を示す正解データが付与されている
入力文：ＡさんにＢさんにメールしてってメールして
正解データ：『Ｂさんにメールして』
そして、文節分割部２０が、取得した学習用の入力文を文節毎に区切って複数の部分に分割し、下記に示すような文節区切り文を出力する。
文節区切り文：Ａさんに／Ｂさんに／メールしてって／メールして First, in the learning processing routine, in step 100, the phrase dividing unit 20 acquires an input sentence (text data) for learning. Here, it is assumed that an input sentence for learning as shown below is acquired. In addition, in this input sentence for learning, correct data indicating the correct position of the quoted part as shown below is given. Input sentence: Mail A to Mr. B and correct data: “Email B”
Then, the phrase dividing unit 20 divides the acquired learning input sentence for each phrase and divides it into a plurality of parts, and outputs a phrase-delimited sentence as shown below.
Sentence breaks: To A / B / email / email

次に、ステップ１０２で、キーワード隠蔽部２２が、抽出する引用部の特性に応じたキーワードが予め登録されたキーワード辞書４０を参照して、上記ステップ１００で出力された文節区切り文内のキーワードと一致する単語のうち、文書の最後に出現する単語以外の単語を隠蔽した隠蔽済み文節区切り文を出力する。キーワード辞書４０に「メール」という単語が登録されていた場合、上記入力文に対して、下記に示すような隠蔽済み文節区切り文を出力する。なお、下記では、隠蔽部分を「×」で表している。
隠蔽済み文節区切り文：Ａさんに／Ｂさんに／×してって／メールして Next, in step 102, the keyword concealment unit 22 refers to the keyword dictionary 40 in which keywords corresponding to the characteristics of the quoting part to be extracted are registered in advance, and the keywords in the phrase delimiter sentence output in step 100 are A concealed phrase delimiter sentence that conceals words other than the word that appears at the end of the document among the matching words is output. When the word “mail” is registered in the keyword dictionary 40, a concealed phrase delimiter as shown below is output for the input sentence. In the following, the concealed part is represented by “x”.
Hidden paragraph breaks: To Mr. A / To Mr. B / X / Mail

次に、ステップ１０４で、ラティス生成部２４が、上記ステップ１０２で出力された隠蔽済み文節区切り文の各文節に対して、引用部の特性に応じた素性を付与する。また、ラティス生成部２４が、各文節について、引用部の先頭部分を示すＢタグ、引用部のうち先頭以外の部分を示すＩタグ、及び引用部以外の部分を示すＯタグを候補とするラティス（例えば、図２）を生成し、上記の文節毎の素性を含むラティスを出力する。 Next, in step 104, the lattice generation unit 24 assigns a feature corresponding to the characteristics of the quoting part to each clause of the concealed clause-delimited sentence output in step 102. In addition, the lattice generation unit 24 uses, as candidates for each phrase, a B tag indicating the head portion of the citation portion, an I tag indicating a portion other than the head portion of the citation portion, and an O tag indicating a portion other than the citation portion. (For example, FIG. 2) is generated, and a lattice including the above-described feature for each clause is output.

次に、ステップ１０６で、引用部抽出学習部２６が、上記ステップ１０４で出力されたラティスと、上記ステップ１００で取得した学習用の入力文に付与された正解データとを用いて、引用部を抽出するための引用部抽出モデル４２を学習する。学習により生成された引用部抽出モデル４２を所定の記憶領域に記憶して、学習処理ルーチンを終了する。 Next, in step 106, the citation extraction learning unit 26 uses the lattice output in step 104 and the correct answer data given to the learning input sentence acquired in step 100 to determine the citation unit. The citation extraction model 42 for extraction is learned. The citation extraction model 42 generated by learning is stored in a predetermined storage area, and the learning processing routine is terminated.

次に、適用処理ルーチンでは、ステップ１１０で、文節分割部２０が、引用部抽出を適用する適用入力文（テキストデータ）を取得する。ここでは、下記に示すような適用入力文を取得するものとする。
適用入力文：ＡさんにＣさんにメールしてちょってメールして
そして、文節分割部２０が、取得した適用入力文を文節毎に区切って複数の部分に分割し、下記に示すような文節区切り文を出力する。
文節区切り文：Ａさんに／Ｃさんに／メールしてちょって／メールして Next, in the application processing routine, in step 110, the phrase dividing unit 20 acquires an applied input sentence (text data) to which the citation extraction is applied. Here, an application input sentence as shown below is acquired.
Applicable input sentence: E-mail is sent to Mr. A, and then the phrase dividing unit 20 divides the acquired applied input sentence into a plurality of parts by dividing the acquired input sentence for each phrase, as shown below. Outputs paragraph breaks.
Sentence breaks: To Mr. A / To Mr./C

次に、ステップ１１２で、キーワード隠蔽部２２が、学習処理のステップ１０２と同様に、キーワード辞書４０を参照して、上記ステップ１１０で出力された文節区切り文内を隠蔽済み文節区切り文に変換して出力すると共に、隠蔽した単語及びその単語の位置等の隠蔽情報を出力する。ここでは、上記適用入力文に対して、下記に示すような隠蔽済み文節区切り文、及び隠蔽情報を出力する。
隠蔽済み文節区切り文：Ａさんに／Ｃさんに／×してちょって／メールして
隠蔽情報：３番目の文節、×→メール Next, in step 112, the keyword concealment unit 22 refers to the keyword dictionary 40 in the same manner as in step 102 of the learning process, and converts the phrase delimiter sentence output in step 110 into a concealed phrase delimiter sentence. And the concealment information such as the concealed word and the position of the word. Here, a concealed clause-delimited sentence and concealment information as shown below are output for the applied input sentence.
Concealed sentence break sentence: To Mr. A / To Mr./C/Choose/Email Concealment information: Third sentence, X → Mail

次に、ステップ１１４で、制約付きラティス生成部２８が、上記ステップ１１２で出力された隠蔽済み文節区切り文の各文節に素性を付与すると共に、各文節について、Ｂタグ、Ｉタグ、及びＯタグを候補とするラティスを生成する。さらに、制約付きラティス生成部２８が、抽出する引用部に基づく制約を満たすか否かを示す制約情報を各候補に付与した制約付きラティス（例えば、図３）を生成し、上記の文節毎の素性を含む制約付きラティスを出力する。 Next, in step 114, the constrained lattice generation unit 28 adds a feature to each clause of the concealed clause-delimited sentence output in step 112, and for each clause, a B tag, an I tag, and an O tag. Generate a lattice with candidate. Further, the constrained lattice generation unit 28 generates a constrained lattice (for example, FIG. 3) in which constraining information indicating whether or not the constraint based on the extracted citation part is satisfied is given to each candidate. Output a constrained lattice including features.

次に、ステップ１１６で、引用部抽出部３０が、所定領域に記憶された引用部抽出モデル４２を読み出して、上記１１４で出力された制約付きラティスから、文節毎に最適候補を選択する。そして、最適候補としてＢタグ及びＩタグが選択された文節を抽出し（例えば、図４）、上記ステップ１１２で出力された隠蔽情報に基づいて、同ステップ１１２で隠蔽された単語を元に戻して、引用部を抽出する。さらに、文節の最後が「〜って」等の引用助詞の場合は、その引用助詞の前までを引用部として抽出し、下記に示すような、抽出された引用部を示す情報（『』部分）を入力文に付与した引用部情報付き出力文を出力して、適用処理ルーチンを終了する。
引用部情報付き出力文：Ａさんに『Ｃさんにメールしてちょ』ってメールして Next, in step 116, the citation extraction unit 30 reads the citation extraction model 42 stored in the predetermined area, and selects the optimum candidate for each phrase from the constrained lattice output in 114 above. Then, the phrase in which B tag and I tag are selected as the optimal candidates is extracted (for example, FIG. 4), and the word concealed in step 112 is restored based on the concealment information output in step 112 above. To extract the citation. Furthermore, when the last part of a phrase is a quote particle such as “~ tte”, the part before the quote particle is extracted as a quote part, and information indicating the extracted quote part (“” part as shown below) ) Is output to the input sentence, and the application processing routine is terminated.
Output text with quoted part information: Email Mr. A to “Email C”

以上説明したように、本実施の形態に係る引用部抽出装置によれば、抽出する引用部の特性に応じた制約を付与した制約付きラティスを生成して、引用部抽出モデルを用いて文書から引用部を抽出するため、引用部が明示的に付与されていない文書から引用部を抽出することができる。また、抽出する引用部の特性に応じたキーワードと一致する単語のうち、文書の最後に出現する単語以外の単語を隠蔽することで、広い文脈を見なくても引用部の誤抽出を防止することができる。また、文書を文節単位で分割することにより、各文節に付与される素性のスパース性を抑えることができると共に、離れた位置にある部分の情報も考慮した素性を付与することができる。 As described above, according to the citation extraction device according to the present embodiment, a lattice with a constraint according to the characteristics of the citation to be extracted is generated, and the document is extracted from the document using the citation extraction model. Since the citation part is extracted, the citation part can be extracted from a document to which the citation part is not explicitly given. In addition, by concealing words other than the word that appears at the end of the document among the words that match the keyword according to the characteristics of the citation part to be extracted, erroneous extraction of the citation part is prevented without looking at a wide context. be able to. Further, by dividing the document into clauses, it is possible to suppress the sparsity of the features given to each clause, and it is possible to give features that also take into account information at a distant portion.

なお、本発明は、上記実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、上記実施の形態では、ＣＲＦ等の識別学習器により引用部抽出モデルを学習して生成する場合について説明したが、ルールベースで引用部抽出モデルを生成してもよい。例えば、「メール（ポーズ）遅刻しそう」というポーズ情報を含む入力文に対して、「メール」の後にポーズ情報が続いた場合、そのポーズ情報の次の単語からは引用部である、というようなルールを記述することで、引用部抽出モデルを生成することができる。 For example, in the above embodiment, the case where the citation extraction model is learned and generated by an identification learning device such as CRF has been described. However, the citation extraction model may be generated on a rule basis. For example, in the case of an input sentence including pose information “mail (pause) is likely to be late”, when “pause information” follows “mail”, the next word of the pose information is a quote part. By describing the rules, a citation extraction model can be generated.

また、上記実施の形態では、引用部か否かを示す複数種類のタグとして、引用部の先頭部分を示すＢタグ、引用部のうち先頭以外の部分を示すＩタグ、及び引用部以外の部分を示すＯタグの３種類を用いる場合について説明したが、これに限定されず、各部分が引用部か引用部以外かを区別可能なタグを用いればよい。例えば、引用部の最終部分を示すＥタグを加えた４種類を用いてもよいし、Ｂタグを除いたＥタグ、Ｉタグ、Ｏタグの３種類を用いてもよい。 Moreover, in the said embodiment, as multiple types of tags which show whether it is a quotation part, B tag which shows the head part of a quotation part, I tag which shows parts other than the head among quotation parts, and parts other than a quotation part However, the present invention is not limited to this, and a tag that can distinguish whether each part is a citation part or a part other than the citation part may be used. For example, you may use 4 types which added E tag which shows the last part of a quotation part, and may use 3 types, E tag, I tag, and O tag except B tag.

また、上記実施の形態では、学習部と適用部とを同一のコンピュータで構成する場合について説明したが、別々のコンピュータで構成するようにしてもよい。 In the above embodiment, the learning unit and the application unit are configured by the same computer. However, the learning unit and the application unit may be configured by separate computers.

また、本願明細書中において、プログラムが予めインストールされている実施の形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。また、本発明の引用部抽出装置を、上記処理を実現するための半導体集積回路等のハードウエアにより構成してもよい。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium. Moreover, you may comprise the quotation part extraction apparatus of this invention by hardware, such as a semiconductor integrated circuit for implement | achieving the said process.

１０引用部抽出装置
１２学習部
１４適用部
２０文節分割部
２２キーワード隠蔽部
２４ラティス生成部
２６引用部抽出学習部
２８制約付きラティス生成部
３０引用部抽出部
４０キーワード辞書
４２引用部抽出モデル DESCRIPTION OF SYMBOLS 10 Quotation part extraction apparatus 12 Learning part 14 Application part 20 Clause division part 22 Keyword concealment part 24 Lattice generation part 26 Quote part extraction learning part 28 Constrained lattice generation part 30 Quote part extraction part 40 Keyword dictionary 42 Quote part extraction model

Claims

A dividing means for dividing the document data into a plurality of parts;
Each part divided by the dividing means is given a feature according to the characteristics of the quotation part extracted from the document data, and each of the plural types of tags indicating whether or not the quotation part is a candidate for each part. , A constrained lattice generation means for generating a constrained lattice in which information indicating whether or not the constraint based on the extracted citation part is satisfied is given to each part;
Based on the extraction model for extracting the citation part from the lattice, the optimum candidate of each part is selected from the constrained lattice generated by the constrained lattice generation means, and the tag indicating the citation part is selected as the optimal candidate Extraction means for extracting the part as a quote part,
Quotation part extraction device including

Including concealment means for concealing words other than the word appearing at the end of the document data among the words in the document data matching a predetermined keyword according to the characteristics of the citation part to be extracted,
The constrained lattice generation means generates the constrained lattice for the document data after concealment by the concealment means,
The citation extraction device according to claim 1, wherein, when the selected part includes a concealed word, the extraction unit restores the concealed word and extracts it as the citation part.

The citation extraction apparatus according to claim 1, wherein the dividing unit divides the document data into phrases.

A feature corresponding to the characteristics of the quoting part is given to each part divided by the dividing unit, and a lattice is generated with each of the plurality of types of tags indicating whether or not the quoting part is a candidate for each part. Lattice generating means to
Learning means for learning the extraction model based on the lattice generated by the lattice generation means and the correct position of the citation in the document data;
The citation extraction device according to claim 1, comprising:

The dividing means divides the document data into a plurality of parts,
The constrained lattice generation means assigns a feature according to the characteristics of the citation part extracted from the document data to each part divided by the division means, and indicates a plurality of types indicating whether each part is a citation part Each of the tag as a candidate, generating a constrained lattice that gives information indicating whether or not the constraint based on the citation part to be extracted is satisfied,
Based on the extraction model for extracting the citation part from the lattice, the extraction means selects the optimum candidate of each part from the constrained lattice generated by the constrained lattice generation means, and the tag indicating the citation part is optimal A citation part extraction method that extracts a part selected as a candidate as a citation part.

A citation extraction program for causing a computer to function as each means constituting the citation extraction device according to any one of claims 1 to 4 .