JP5819239B2

JP5819239B2 - Important word / phrase extraction apparatus, method, and program

Info

Publication number: JP5819239B2
Application number: JP2012084842A
Authority: JP
Inventors: 田中　陽子; 陽子田中; 伸章廣嶋; 義昌小池; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2012-04-03
Filing date: 2012-04-03
Publication date: 2015-11-18
Anticipated expiration: 2032-04-03
Also published as: JP2013214239A

Description

本発明は、文書から重要語句を抽出する重要語句抽出装置、方法、及びプログラムに関する。 The present invention relates to an important phrase extracting apparatus, method, and program for extracting an important phrase from a document.

ユーザがある要求を持って文書の検索や閲覧を行っている場合に、検索または閲覧対象の文書においてユーザの要求と関わりが深い情報を含む部分が、抽出すべき重要な部分であるとする。重要な部分の単位は、ユーザの要求と関わりが深い情報を示す単語（重要語）、重要語を含む文や段落などの文の集合等とすることができる。以下、この重要な部分を重要語句という。また、ここで、ユーザの要求とは、「安くておいしいお店」や「景色がいい場所」といった、特定の物事を指してはいないが、文書を検索または閲覧する際にユーザが重要視している情報とする。従来より、このようなユーザの要求に関する重要語句を文書中から抽出し、ユーザに提示することが行われている。 When a user searches and browses a document with a certain request, it is assumed that a portion including information closely related to the user's request in the document to be searched or browsed is an important portion to be extracted. The unit of the important part can be a word (important word) indicating information closely related to the user's request, a set of sentences such as sentences or paragraphs including the important word, and the like. Hereinafter, this important part is referred to as an important phrase. Here, the user's request does not refer to specific things such as “cheap and delicious shops” and “scenery places”, but the user places importance on searching or browsing documents. Information. Conventionally, an important phrase related to such a user request is extracted from a document and presented to the user.

例えば、ユーザが自身の閲覧履歴を容易に把握することを目的とし、ユーザの閲覧履歴を収集してその閲覧履歴間の類似度に基づいて、処理対象となる複数の閲覧履歴をクラスタに分類し、クラスタ毎にキーワードを抽出し、そのキーワードと当該クラスタ毎の閲覧履歴に係る情報とをサマリとして生成し、ユーザに提示する要約生成装置が提案されている（例えば、特許文献１参照）。 For example, for the purpose of easily grasping the user's browsing history, the user's browsing history is collected, and a plurality of browsing histories to be processed are classified into clusters based on the similarity between the browsing histories. A summary generation device has been proposed in which keywords are extracted for each cluster, the keywords and information related to the browsing history for each cluster are generated as a summary, and presented to the user (see, for example, Patent Document 1).

また、ユーザ個人が閲覧した文書の履歴から、そのユーザにとって重要度の高い語を抽出するキーワード抽出法として、ユーザが閲覧した文書中から出現回数の高い単語を取り出し、その単語との共起の偏りが大きい単語を、ユーザにとって重要度が高いキーワードとして抽出する方法が提案されている（例えば、非特許文献１参照）。 In addition, as a keyword extraction method for extracting words having high importance for the user from the history of documents viewed by the individual user, a word having a high frequency of appearance is extracted from the document viewed by the user, and co-occurrence with the word is performed. There has been proposed a method of extracting a word having a large bias as a keyword having high importance for the user (see, for example, Non-Patent Document 1).

特開２０１１−１００３５０号公報JP 2011-100350 A

松尾豊、福田隼人、石塚満、「ユーザ個人の閲覧履歴からのキーワード抽出によるブラウジング支援」、人工知能学会論文誌１８巻、２００３Yutaka Matsuo, Hayato Fukuda, Mitsuru Ishizuka, “Supporting Browsing by Extracting Keywords from User's Individual Browsing History”, Journal of the Japanese Society for Artificial Intelligence, Volume 18, 2003

しかしながら、従来技術では、重要語とされた単語が文書中に複数回出現する場合、その重要語を含む部分が重要語句として複数抽出される。例えば、文単位で重要語句を抽出する場合において、「安い」という単語が重要語である場合を考える。検索または閲覧されている処理対象の文書中に「安かった。」、「子供用のメニューもあり安くすんだ」、「クーポンでさらに安い特別価格！」のような文が含まれている場合、各文にそれぞれ「安い」という単語が使われているため、これら全ての文が重要語句として抽出されることになる。重要語の種類は複数存在する場合が多く、重要語の数が増えるほど抽出される重要語句も増え、文書中のほとんどの部分が重要語句として抽出されてしまう可能性もある。 However, in the related art, when a word that is regarded as an important word appears multiple times in a document, a plurality of parts including the important word are extracted as important words. For example, let us consider a case where the word “cheap” is an important word when extracting an important word / phrase in sentence units. If the document being processed for search or browsing contains sentences such as "It was cheap", "There is a menu for children and it's cheaper", and "Special price even cheaper with coupons!" Since the word “cheap” is used for each sentence, all these sentences are extracted as important phrases. There are many types of important words, and as the number of important words increases, the number of important words extracted increases, and there is a possibility that most parts in the document are extracted as important words.

しかし実際には、ただ「安かった」と書かれた文よりも、「クーポンでさらに安い特別価格！」や「ランチ８００円という安さ。」といった文の方が、ユーザが望む「安い」という要求についての情報を多く含んでいると考えられる。このような、ある文がユーザの要求を多く含むか否かということは、文の文字数や単語数に依存するものではない。例えば、「子供用のメニューもあり安くすんだ」という文は、「クーポンでさらに安い特別価格！」や「ランチ８００円という安さ。」という文と比べて、文字数も単語数も少ないわけではない。しかし、「子供用のメニューもあり安くすんだ」という文には、「安い」及び「子供連れ」という要求に関する情報が含まれてはいるが、「安い」という要求のみを持って文書を検索または閲覧しているユーザにとっては、それほど情報が多い文とは言えない。 However, in actuality, a statement such as “Special price even cheaper with a coupon!” Or “Low price of 800 yen for lunch” is more demanding that the user wants “Cheap” than a sentence that says “It was cheap”. It seems to contain a lot of information about. Whether such a sentence includes many user requests does not depend on the number of characters or words in the sentence. For example, the sentence "There is a menu for children and it's cheap" does not mean fewer letters and words than the sentences "Special price even cheaper with a coupon!" . However, the sentence “There is a menu for children and it's cheap” contains information about the requirements of “cheap” and “with children”, but it searches for documents with only the request of “cheap” Or it cannot be said that it is a sentence with much information for the browsing user.

また、単語の共起を用いて重要語を抽出する手法は、閲覧履歴の中の頻出語が同義語である場合、同じような単語との共起が高くなり、重要語であっても共起の偏りが大きくならず、抽出されない可能性もある。例えば、閲覧履歴中の頻出語に「価格」、「値段」、「料金」が含まれる場合、処理対象の文書に「安い」というような重要語が存在しても、「安い」は、どの頻出語とも共起が高いため、重要語として抽出されない可能性が高い。 In addition, the method of extracting key words using word co-occurrence increases the co-occurrence with similar words when frequent words in the browsing history are synonyms. There is a possibility that the bias of the origin does not increase and is not extracted. For example, if the frequently used words in the browsing history include “price”, “price”, and “fee”, even if there is an important word such as “cheap” in the document to be processed, Since co-occurrence with frequent words is high, there is a high possibility that they are not extracted as important words.

本発明は、上記の事実を鑑みてなされたもので、ユーザの要求に関する情報が様々な単語で表現されている場合にも対応し、かつユーザの要求との関係が深い重要語句を抽出することができる重要語句抽出装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above-mentioned facts, and it also supports the case where information related to a user's request is expressed in various words, and extracts important phrases that are deeply related to the user's request. An object of the present invention is to provide an important word / phrase extraction device, method, and program.

上記目的を達成するために、本発明の重要語句抽出装置は、複数の単語を含む処理対象の文書、及びユーザの複数の要求の各々を表す要求単語集合の各々を受け付ける受付手段と、前記受付手段により受け付けた文書を、抽出する重要語句の長さに応じて、複数の分割単語集合に分割する文書分割手段と、前記文書分割手段により分割された分割単語集合に含まれる単語の各々と、前記受付手段により受け付けた要求単語集合に含まれる単語の各々との単語間類似度に基づいて、前記分割単語集合と前記要求単語集合の各々との単語集合間類似度の最大値、または重み付きスコアを、前記分割単語集合毎のスコアとして計算するスコア計算手段と、前記スコア計算手段により計算されたスコアが所定の閾値より高い分割単語集合を、前記文書に含まれる重要語句として抽出する抽出手段と、を含んで構成されている。 To achieve the above object, keyword extraction device of the present invention includes a receiving means for receiving each request word set representing each of the plurality of documents to be processed containing the word, and the user of the plurality of requests, the receiving Document dividing means for dividing the document received by the means into a plurality of divided word sets according to the length of the important words to be extracted, each of the words included in the divided word set divided by the document dividing means, Based on the similarity between words with each of the words included in the requested word set accepted by the accepting means, the maximum value of the similarity between word sets between the divided word set and each of the requested word sets , or weighted score, and the score calculating means for calculating as a score of the divided words each set, the calculated score is higher split word set than a predetermined threshold by said score calculating means, the document Extraction means for extracting a keyword included, is configured to include a.

本発明の重要語句抽出装置によれば、受付手段が、複数の単語を含む処理対象の文書、及びユーザの複数の要求の各々を表す要求単語集合の各々を受け付ける。そして、文書分割手段が、受付手段により受け付けた文書を、抽出する重要語句の長さに応じて、複数の分割単語集合に分割し、スコア計算手段が、文書分割手段により分割された分割単語集合に含まれる単語の各々と、受付手段により受け付けた要求単語集合に含まれる単語の各々との単語間類似度に基づいて、分割単語集合と要求単語集合の各々との単語集合間類似度の最大値、または重み付きスコアを、分割単語集合毎のスコアとして計算する。そして、抽出手段が、スコア計算手段により計算されたスコアが所定の閾値より高い分割単語集合を、文書に含まれる重要語句として抽出する。 According to the important word / phrase extraction apparatus of the present invention, the accepting unit accepts each of a processing target document including a plurality of words and a requested word set representing each of a plurality of user requests. Then, the document dividing unit divides the document received by the receiving unit into a plurality of divided word sets according to the length of the important phrase to be extracted, and the score calculating unit uses the divided word set divided by the document dividing unit. The maximum similarity between word sets between the divided word set and each of the requested word sets based on the similarity between words between each of the words included in the request word and each of the words included in the requested word set received by the receiving unit A value or a weighted score is calculated as a score for each divided word set. Then, the extraction unit extracts a divided word set whose score calculated by the score calculation unit is higher than a predetermined threshold as an important phrase included in the document.

このように、ユーザの要求を表す要求単語集合と文書を分割した分割単語集合との単語間類似度に基づくスコアを用いることで、ユーザの要求に関する情報が様々な単語で表現されている場合にも対応し、かつユーザの要求との関係が深い重要語句を抽出することができる。 As described above, when the score based on the similarity between words between the requested word set representing the user request and the divided word set obtained by dividing the document is used, information regarding the user request is expressed in various words. It is possible to extract important words / phrases that are also closely related to the user's request.

また、ユーザの要求が複数存在する場合でも、文書中からユーザの要求を最も表している部分から順に重要語句として抽出することができる。 Further, it is possible to request the User chromatography The even when there are multiple, and extracts a turn keyword from the portion that most represents a user request from the document.

また、本発明の重要語句抽出方法は、受付手段が、複数の単語を含む処理対象の文書、及びユーザの複数の要求の各々を表す要求単語集合の各々を受け付け、文書分割手段が、前記受付手段により受け付けた文書を、抽出する重要語句の長さに応じて、複数の分割単語集合に分割し、スコア計算手段が、前記分割単語集合と前記要求単語集合の各々との単語集合間類似度の最大値、または重み付きスコアを、前記分割単語集合毎のスコアとして計算し、抽出手段が、前記スコア計算手段により計算されたスコアが所定の閾値より高い分割単語集合を、前記文書に含まれる重要語句として抽出する方法である。 Also, keyword extraction method of the present invention includes a receiving unit receives each request word set representing each of the plurality of word document to be processed containing, and user of the plurality of requests, document divider means, the accepting The document received by the means is divided into a plurality of divided word sets according to the length of the important phrase to be extracted, and the score calculating means uses the word set similarity between the divided word set and each of the requested word sets. Or the weighted score is calculated as a score for each divided word set, and the extraction means includes a divided word set whose score calculated by the score calculating means is higher than a predetermined threshold in the document. This is a method of extracting as an important phrase.

また、本発明の重要語句抽出プログラムは、コンピュータを、上記の重要語句抽出装置を構成する各手段として機能させるためのプログラムである。 Moreover, the important phrase extraction program of this invention is a program for functioning a computer as each means which comprises said important phrase extraction apparatus.

以上説明したように、本発明の重要語句抽出装置、方法、及びプログラムによれば、ユーザの要求を表す要求単語集合と文書を分割した分割単語集合との単語間類似度に基づくスコアを用いることで、ユーザの要求に関する情報が様々な単語で表現されている場合にも対応し、かつユーザの要求との関係が深い重要語句を抽出することができる、という効果が得られる。 As described above, according to the important phrase extracting device, method, and program of the present invention, the score based on the similarity between words between the requested word set representing the user's request and the divided word set obtained by dividing the document is used. Thus, it is possible to extract an important phrase that is compatible with a case where information related to the user's request is expressed in various words and has a deep relationship with the user's request.

本実施の形態に係る重要語句抽出装置の構成を示す概略図である。It is the schematic which shows the structure of the keyword extraction apparatus which concerns on this Embodiment. 要求単語集合の生成手法の一例を説明するための説明図である。It is explanatory drawing for demonstrating an example of the production | generation method of a request word set. 分割単語集合毎のスコアの計算方法を説明するための説明図である。It is explanatory drawing for demonstrating the calculation method of the score for every division | segmentation word set. 本実施の形態に係る重要語句抽出装置における重要語句抽出処理ルーチンの内容を示すフローチャートである。It is a flowchart which shows the content of the important phrase extraction process routine in the important phrase extraction apparatus which concerns on this Embodiment.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜重要語句抽出装置の構成＞ <Configuration of key word extraction device>

本実施の形態に係る重要語句抽出装置は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）と、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）と、後述する重要語句抽出処理ルーチンを実行するためのプログラムを記憶したＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）とを備えたコンピュータで構成されている。このコンピュータは、機能的には、図１に示すように、文書受付部１２と、ユーザ要求受付部１４と、設定受付部１６と、形態素解析部２０、文書分割部２２、スコア計算部２４、及び抽出部２６を含む重要語句抽出部１８と、単語間類似度判定データベース（ＤＢ）２８と、出力部３０とを含んだ構成で表すことができる。 The important phrase extracting device according to the present embodiment includes a CPU (Central Processing Unit), a RAM (Random Access Memory), and a ROM (Read Only Memory) storing a program for executing a later-described important phrase extracting process routine. It is comprised with the computer provided with. As shown in FIG. 1, this computer functionally includes a document reception unit 12, a user request reception unit 14, a setting reception unit 16, a morpheme analysis unit 20, a document division unit 22, a score calculation unit 24, And an important phrase extraction unit 18 including the extraction unit 26, an inter-word similarity determination database (DB) 28, and an output unit 30.

文書受付部１２は、既知のキーボード、マウス、記憶装置などの入力器により入力された、処理対象となる文書（テキストデータ）を受け付ける。処理対象となる文書は、ユーザが現在パーソナルコンピュータ等を利用して閲覧しているウェブページの文書や、その閲覧している文書に貼られたリンクのリンク先にある文書とすることができる。 The document receiving unit 12 receives a document (text data) to be processed, which is input by an input device such as a known keyboard, mouse, or storage device. The document to be processed can be a document of a web page that the user is currently browsing using a personal computer or a document at a link destination of a link attached to the document being browsed.

ユーザ要求受付部１４は、入力器により入力されたユーザ要求を受け付ける。ユーザ要求とは、ユーザが文書の検索または閲覧等を行う際に持つ要求を表す単語１語以上を含む単語集合である。ユーザ要求を表す単語集合に含まれる各単語の品詞は問わない。例えば、ユーザが「お得に安くすませたい」という要求を持って文書の検索または閲覧等を行う場合、「値段」という単語をユーザ要求としてもよいし、「お得安い」という２つ以上の単語を含む単語集合をユーザ要求としてもよい。また、例えば「値段価格」というような類義語や「クーポンセール」というような関連語などを含む単語集合でユーザ要求を表現してもよい。 The user request receiving unit 14 receives a user request input by the input device. The user request is a word set including one or more words representing a request that the user has when searching or browsing a document. The part of speech of each word included in the word set representing the user request is not limited. For example, when a user searches or browses a document with a request that “I want to save you a great deal”, the word “price” may be used as a user request, or two or more “price is cheap” A word set including words may be a user request. For example, the user request may be expressed by a word set including a synonym such as “price / price” or a related word such as “coupon sale”.

ユーザ要求受付部１４で受け付けるユーザ要求の入力方法は特に限定されず、ユーザにより入力器から入力されたユーザ要求を予め受け付けておき、どの文書に対しても、予め受け付けておいたユーザ要求を適応させるようにしてもよいし、文書毎にユーザ要求を受け付けるようにしてもよい。 The input method of the user request received by the user request receiving unit 14 is not particularly limited. The user request input from the input device by the user is received in advance, and the user request received in advance is applied to any document. Alternatively, a user request may be accepted for each document.

また、ユーザ要求受付部１４は、ユーザにより自然文の形式で入力されたテキストデータを受け付け、そのテキストデータから自動的にユーザ要求を抽出して受け付けるようにしてもよい。例えば、ユーザが入力器により入力した「お得に安く済ませたい」という自然文の形式のテキストデータを受け付けた場合、このテキストデータを解析して、「お得安い」等のユーザ要求を表す単語集合に変換することができる。 Further, the user request accepting unit 14 may accept text data input in the form of a natural sentence by a user, and automatically extract a user request from the text data and accept it. For example, when text data in the form of a natural sentence “I want to save you a great deal” input by the input device is accepted, this text data is analyzed and a word that indicates a user request such as “Budget” Can be converted to a set.

また、ユーザ要求受付部１４は、ユーザによる入力の手間を省くために、ユーザに対して選択リストのような形式で複数のユーザ要求を提示し、ユーザにより選択されたユーザ要求を受け付けるようにしてもよい。例えば、ユーザ要求を表す単語集合として、単語集合１：「値段価格クーポンセール」、単語集合２：「子供家族親子ベビー離乳食」、単語集合３：・・・、のような似た意味を持つ単語集合を複数準備しておき、選択リストとしてユーザに提示する。ユーザが「お得に安くすませたい」という要求を持っている場合には、ユーザは選択リストから単語集合１を選択し、ユーザ要求受付部１４では、この選択された単語集合１：「値段価格クーポンセール」を受け付けることができる。なお、各ユーザ要求を表す単語集合に含まれる単語の数は、各々異なっていてもよい。 Further, the user request receiving unit 14 presents a plurality of user requests in a format such as a selection list to the user in order to save the user input, and receives the user request selected by the user. Also good. For example, as a word set representing a user request, words having similar meanings such as word set 1: “price price coupon sale”, word set 2: “child family parent-child baby baby food”, word set 3:. A plurality of sets are prepared and presented to the user as a selection list. When the user has a request for “I want to save you cheaply”, the user selects the word set 1 from the selection list, and the user request reception unit 14 selects the selected word set 1: “price price”. Coupon sale "can be accepted. Note that the number of words included in the word set representing each user request may be different.

また、ユーザ要求は１つに限らず、複数でもよい。その場合は、各々が上記のような単語集合で表現された複数のユーザ要求を受け付ける。例えば、ユーザが「お得に安くすませたい」という要求と、「子供連れにも適したものがいい」という要求とを持っていた場合、「値段価格クーポンセール」及び「子供家族親子ベビー離乳食」の２つの単語集合を２つのユーザ要求として受け付ける。また、上記のように自然文を解析してユーザ要求に変換する場合には、例えば「お得に安くすませたいし、子供連れにも適したものがいい」というテキストデータを解析して、上記の２つの単語集合に変換することができる。また、上記のように選択リストを提示する場合には、選択リストからの複数の選択を許可するようにする。例えば上記の例では、ユーザが選択リストから単語集合１及び単語集合２を選択することで、ユーザ要求受付部１４では、上記の２つの単語集合を受け付けることができる。 Further, the number of user requests is not limited to one and may be plural. In that case, a plurality of user requests, each represented by a word set as described above, are accepted. For example, if the user has a request that “I want to save you at a reasonable price” and a request that “I want something that is suitable for children”, “Price Price Coupon Sale” and “Children Family Parent and Child Baby Baby Food” Are accepted as two user requests. In addition, when the natural sentence is analyzed and converted into the user request as described above, for example, the text data “I want to save you money and are suitable for children” is analyzed. Can be converted into two word sets. Further, when the selection list is presented as described above, a plurality of selections from the selection list are permitted. For example, in the above example, when the user selects the word set 1 and the word set 2 from the selection list, the user request receiving unit 14 can receive the two word sets.

ここで、ユーザの要求を表す単語集合を生成する手法の一例について説明する。図２に示すように、ユーザが文書の検索の際に入力した単語または単語の組み合わせであるクエリのログを用いる。クエリログの中から、複数の単語を含むクエリのみを抜き出す。例えば、「みなとみらいランチ」、「みなとみらいランチクーポン」といったものである。このようなクエリに含まれている単語を、単語間の類似度を距離としてＫ−ｍｅａｎｓ法を用いてクラスタリングすることにより、似たような意味や深い関係性を持つ単語が集まった複数の単語集合を生成する。 Here, an example of a method for generating a word set representing a user request will be described. As shown in FIG. 2, a query log that is a word or a combination of words input by a user when searching for a document is used. Extract only queries that contain multiple words from the query log. For example, “Minato Mirai Lunch” and “Minato Mirai Lunch Coupon”. A plurality of words in which words having similar meanings and deep relationships are gathered by clustering the words included in such a query using the K-means method with the similarity between the words as a distance. Create a set.

このように生成された単語集合を、上記の自然文を解析してユーザ要求を表す単語集合に変換したり、選択リストとして提示したりする際に利用することができる。なお、ユーザ要求を表す単語集合の生成手法は、上記の例に限定されない。 The word set generated in this way can be used when the natural sentence is analyzed and converted into a word set representing a user request or presented as a selection list. Note that the method of generating a word set representing a user request is not limited to the above example.

設定受付部１６は、入力器により入力された各種設定を示す情報を受け付ける。各種設定には、抽出する重要語句の単位（例えば、単語、文、段落といった任意の長さの単語集合）、及び後述するスコア計算部２４で用いる閾値ｔｈに関する設定が含まれる。 The setting reception unit 16 receives information indicating various settings input by the input device. The various settings include settings relating to a unit of an important word to be extracted (for example, a word set having an arbitrary length such as a word, a sentence, and a paragraph) and a threshold value th used in a score calculation unit 24 described later.

重要語句抽出部１８は、処理対象となる文書を抽出する重要語句の単位に応じて分割し、分割された各単語集合の中から、ユーザの要求を表している単語集合を求める。以下、重要語句抽出部１８の各部について詳述する。 The important word / phrase extraction unit 18 divides the document to be processed in accordance with the unit of the important word / phrase to extract, and obtains a word set representing the user's request from the divided word sets. Hereinafter, each part of the important phrase extracting unit 18 will be described in detail.

形態素解析部２０は、文書受付部１２により受け付けた処理対象の文書（テキストデータ）に対して、従来既知に手法により形態素解析を行う。 The morphological analysis unit 20 performs morphological analysis on the processing target document (text data) received by the document reception unit 12 by a conventionally known method.

文書分割部２２は、図３に示すように、形態素解析部２０による解析結果、及び設定受付部１６で受け付けた抽出する重要語句の単位に基づいて、処理対象の文書を複数の単語集合ｌ_１，・・・ｌ_ｈに分割する。ある分割された単語集合（以下、「分割単語集合」という）ｌ_ｙにはｗ_ｙ１，・・・，ｗ_ｙｔのｔ個の単語が含まれているとする。抽出する単位が単語である場合には、それぞれを１個の単語を含む単語集合と見なす。 As shown in FIG. 3, the document dividing unit 22 divides a document to be processed into a plurality of word sets l ₁ based on the analysis result of the morphological analysis unit 20 and the unit of the important word phrase extracted by the setting reception unit 16. , divided into ··· _{l h.} A divided word set (hereinafter referred to as "split word set") and the l _y w _{y1, ···,} are included t number of words w _yt. When the unit to be extracted is a word, each is regarded as a word set including one word.

スコア計算部２４は、ユーザ要求受付部１４で受け付けたユーザ要求を表す単語集合（以下、「要求単語集合」という）ａ_１，・・・，ａ_ｉを取得する。ある要求単語集合ａ_ｘにはｎ_ｘ１，・・・，ｎ_ｘｋのｋ個の単語が含まれているとする。スコア計算部２４は、まず、分割単語集合ｌ_ｙと要求単語集合ａ_ｘに含まれる単語ｎ_ｘ１との類似度ｓ（ｌ_ｙ，ｎ_ｘ１）を求める。類似度ｓ（ｌ_ｙ，ｎ_ｘ１）は、例えば、下記（１）式に示すように、分割単語集合ｌ_ｙに含まれている単語ｗ_ｙ１，・・・，ｗ_ｙｔのうち、単語ｎ_ｘ１との類似度が大きい順に上位ｊ個の類似度の平均とすることができる。ただし、ｊは１からｔまでの任意の整数とし、ｗ_ｙ１，・・・，ｗ_ｙｔのうち類似度の大きい順にｊ個の単語をｗ'_ｙ１，・・・，ｗ'_ｙｊとおき、単語ｗ'_ｙ１と単語ｎ_ｘ１との類似度をｓｉｍ（ｗ'_ｙ１，ｎ_ｘ１）とする。 The score calculation unit 24 acquires a word set (hereinafter referred to as “request word set”) a ₁ ,..., A _i representing the user request received by the user request reception unit 14. _Assume that a certain requested word set a _x includes k words of n _x1 ,..., N _xk . Score calculation unit 24 first obtains the divided word set _{l y} the requested word set _a similarity between the word _{n x1} included in _{_{_{x s (l y, n x1}}} ). The similarity s (l _y , n _x1 ) is, for example, the word n _x1 among the words w _y1 ,..., W _yt included in the divided word set l _y as shown in the following equation (1). And the average of the top j similarities in descending order. However, j is an arbitrary integer from 1 to _{_t,} w _{y1, ···,} w the j number of words in the descending order of the degree of similarity among the _{_{w yt 'y1, ···, w}} ' yj Distant, word _Assume that the similarity between w ′ _y1 and the word n _x1 is sim (w ′ _y1 , n _x1 ).

そして、下記（２）式により、分割単語集合ｌ_ｙと要求単語集合ａ_ｘに含まれる各単語ｎ_ｘ１，・・・，ｎ_ｘｋそれぞれとの類似度ｓを平均して、分割単語集合ｌ_ｙと要求単語集合ａ_ｘとの類似度Ｓ（ｌ_ｙ，ａ_ｘ）を算出する。 Then, according to the following expression (2), the similarity s between each divided word set l _y and each word n _x1 ,..., N _xk included in the requested word set a _x is averaged to obtain a divided word set l _y. And a similarity S (l _y , a _x ) between the request word set a _x and the requested word set a _x are calculated.

各単語間の類似度は、ここでは、単語間類似性判定ＤＢ２８として概念ベースを用いる。概念ベースとは、コーパスにおける単語同士の共起頻度を記録した共起行列に対し、特異値分解を行い、単語を次元数の縮退したベクトルで表現した概念ベクトルのデータベースである（参考文献：別所克人、古瀬蔵、片岡良治、「単語と意味属性との共起に基づく概念ベクトル生成法」、人工知能学会全国大会、２００６）。単語間の類似度の求め方は、概念ベースを用いる方法に限らず、シソーラスなどを用いる方法でもよい。 Here, for the similarity between words, a concept base is used as the inter-word similarity determination DB 28. The concept base is a database of concept vectors in which the singular value decomposition is performed on the co-occurrence matrix that records the co-occurrence frequency of words in the corpus, and the words are expressed as vectors with reduced dimensions (Reference: Bessho) Katsuto, Kurosekura, Kataoka Ryoji, “Concept vector generation method based on co-occurrence of words and semantic attributes”, Japanese Society for Artificial Intelligence, 2006). The method of obtaining the similarity between words is not limited to a method using a concept base, and a method using a thesaurus or the like may be used.

スコア計算部２４は、ユーザ要求が１つの場合は、算出した類似度Ｓを分割単語集合ｌ_ｙのスコアｓｃｏｒｅ（ｌ_ｙ）とする。また、ユーザ要求が２つ以上ある場合は、それぞれのユーザ要求を表す要求単語集合ａ_１，・・・，ａ_ｉと分割単語集合ｌ_ｙとの類似度を各々算出し、下記（３）式に示すように、最も値が大きいものを分割単語集合ｌ_ｙのスコアｓｃｏｒｅ（ｌ_ｙ）とする。また、下記（４）式に示すように、分割単語集合ｌ_ｙと各要求単語集合ａ_１，・・・，ａ_ｉとの類似度の各々に重みωを付けたものをスコアとして用いてもよい。重みωは、分割単語集合ｌ_ｙと各要求単語集合ａ_１，・・・，ａ_ｉとの類似度の値が大きいものから傾斜をかけた重みとしたり、ユーザ要求の強さなどに応じた重みとしたりすることができる。 Score calculation unit 24, if the user request is one, the calculated similarity S and split word set _l score _{_y} score _(l _y). Also, if the user request is two or more, required word set a ₁ representing each user _request, ..., and calculates each similarity between a _i and the divided word set l _y, the following equation (3) as shown in, the most value divided those large word set _l score _{_y} score _(l _y). Further, as shown in the following equation (4), a score obtained by adding a weight ω to each similarity between the divided word set l _y and each requested word set a ₁ ,..., A _i may be used as the score. Good. The weight ω is a weight that is inclined from the one having a large similarity between the divided word set l _y and each requested word set a ₁ ,..., A _i , or according to the strength of the user request. Or weights.

例えば、ユーザの要求が「安くて家族連れに向いている景色がきれいなお店が知りたい」という場合には、「安い」というお金に関する要求、「家族連れ向き」という店の設備や雰囲気等に関する要求、及び「景色がきれい」という眺めに関する要求が含まれている。従来技術では、それぞれの要求と関わりが深い単語の重要度が高くなるため、ユーザの要求が多いほど抽出される重要語句も増えてしまい、抽出する意味をなさない場合もあった。本実施の形態では、上記のように、それぞれのユーザ要求を表す要求単語集合ａ_１，・・・，ａ_ｉと分割単語集合ｌ_ｙとの類似度を用いてスコアを計算することにより、最も多く情報を含む部分や、最もユーザの要求を表している部分から順に重要語句として抽出することが可能となる。例えば、「安い」についての情報を含む語句と、「安い」及び「家族連れ向き」についての情報を含む語句があった場合には、後者の方が抽出される可能性が高くなり、ユーザは自分の要求についての情報をより多く含んでいる部分をいち早く見つけ、情報にたどり着くことができる。 For example, if the user's request is “I want to know a shop that is cheap and has a beautiful scenery suitable for families,” the demand for money “cheap”, the facilities and atmosphere of the store “family oriented”, etc. A request and a request regarding a view of “beautiful scenery” are included. In the prior art, since the importance of words closely related to each request increases, the number of extracted important words increases as the number of user requests increases, and it may not make sense to extract. In this embodiment, as described above, the request word set a ₁ representing each user _request, ..., by calculating the score using the similarity between a _i and the divided word set l _y, most It is possible to extract as important words in order from a part including a lot of information and a part most representative of the user's request. For example, when there is a phrase including information about “cheap” and a phrase including information about “cheap” and “for families”, the latter is more likely to be extracted, and the user You can quickly find the part that contains more information about your request and get to the information.

また、スコア計算部２４は、分割単語集合ｌ_ｙに複数の文や句、単語などが含まれる場合、その単位に再分割してそれぞれのスコアを求め、そのスコアの積や重みづけをした積などをその分割単語集合ｌ_ｙのスコアとして用いてもよい。スコア計算部２４は、このスコアを全ての分割単語集合について計算する。 Furthermore, the score calculation unit 24, a plurality of sentences and phrases in the divided word set l _y, if it contains such a word, obtains a respective score was subdivided into the unit, the product of the product and weighting the scores or the like may be used as the score of the divided word set l _y. The score calculation unit 24 calculates this score for all divided word sets.

抽出部２６は、スコア計算部２４で計算されたスコアが任意の閾値ｔｈを超えた分割単語集合を重要語句として抽出する。閾値ｔｈは、どの程度重要語句を抽出する必要があるかによって自由に決めることができる。例えば、スコアの上位ｋ位までを重要語句として抽出するように決めてもよいし、スコアが全体の平均より高いものを重要語句として抽出するように、スコアの平均値をｔｈとしてもよい。閾値ｔｈを高くするほど、文書中でユーザの要求に関する情報を最もよく表している分割単語集合のみが抽出され、閾値を低くするほど、ユーザの要求に関連する様々な情報を持つ分割単語集合が多く抽出される。どの文書に対しても同一の閾値ｔｈを設定して重要語句を抽出してもよいし、文書毎にその内容や長さ等によって調整して抽出してもよい。また、閾値ｔｈに加え、抽出する単語数や文字数を制限し、その中に収まる範囲内で重要語句を抽出するような制限を設けてもよい。 The extraction unit 26 extracts a divided word set whose score calculated by the score calculation unit 24 exceeds an arbitrary threshold th as an important phrase. The threshold th can be freely determined depending on how much important phrases need to be extracted. For example, it may be decided to extract the top k ranks of the score as important words or phrases, and the average value of the scores may be set to th so that words having a score higher than the overall average are extracted as important words. As the threshold th is increased, only the divided word set that best represents the information related to the user request in the document is extracted. As the threshold value is decreased, the divided word set having various information related to the user request is extracted. Many are extracted. An important word / phrase may be extracted by setting the same threshold th for any document, or may be extracted by adjusting the content, length, etc. of each document. Further, in addition to the threshold th, there may be a restriction that limits the number of words and characters to be extracted, and extracts important words / phrases within a range that falls within that range.

出力部３０は、ユーザが閲覧中の文書中において、抽出部２６で抽出された重要語句がハイライトされて表示されるなどしてユーザに提示されるように、重要語句を出力する。 The output unit 30 outputs an important word / phrase so that the important word / phrase extracted by the extraction unit 26 is highlighted and displayed in the document being browsed by the user.

＜重要語句抽出装置の作用＞ <Operation of key word extraction device>

次に、本実施の形態に係る重要語句抽出装置１０の作用について説明する。 Next, the operation of the keyword extraction device 10 according to the present embodiment will be described.

ユーザがパーソナルコンピュータ等で文書を検索または閲覧している場合に、ユーザによりユーザ要求及び各種設定を示す情報が入力されると、重要語句抽出装置１０において、図４に示す重要語句抽出処理ルーチンが実行される。 When a user searches or browses a document with a personal computer or the like, when the user request and information indicating various settings are input by the user, the important phrase extracting device 10 executes the important phrase extracting process routine shown in FIG. Executed.

ステップ１００で、文書受付部１２が処理対象となる文書（テキストデータ）を受け付け、ユーザ要求受付部１４がユーザ要求を表す要求単語集合ａ_ｘ（ｘ＝１，・・・，ｉ）を受け付け、設定受付部１６が抽出する重要語句の単位を含む各種設定を示す情報を受け付ける。 In step 100, the document receiving unit 12 receives a document (text data) to be processed, and the user request receiving unit 14 receives a request word set a _x (x = 1,..., I) representing a user request, Information indicating various settings including the unit of the important phrase extracted by the setting receiving unit 16 is received.

次に、ステップ１０２で、形態素解析部２０が、上記ステップ１００で受け付けた処理対象の文書に対して形態素解析を行う。次に、ステップ１０４で、文書分割部２２が、上記ステップ１０２の形態素解析部２０による解析結果、及び上記ステップ１００で受け付けた抽出する重要語句の単位に基づいて、処理対象の文書を複数の分割単語集合ｌ_ｙ（ｙ＝１，・・・，ｈ）に分割する。 Next, in step 102, the morpheme analysis unit 20 performs morpheme analysis on the processing target document received in step 100. Next, in step 104, the document dividing unit 22 divides the document to be processed into a plurality of divisions based on the analysis result of the morphological analysis unit 20 in step 102 and the unit of the extracted important word accepted in step 100. Divide into word sets l _y (y = 1,..., H).

次に、ステップ１０６で、スコア計算部２４が、分割単語集合ｌ_ｙと要求単語集合ａ_ｘとの類似度Ｓ（ｌ_ｙ，ａ_ｘ）に基づいて、各分割単語集合ｌ_ｙのスコアｓｃｏｒｅ（ｌ_ｙ）を計算する。 Next, in step 106, the score calculation unit 24, split word set _{l y} the requested word set _{a x} a similarity _{S (l} y, _{a x)} on the basis of each split word set _l score _y score ( l _y ).

次に、ステップ１０８で、抽出部２６が、上記ステップ１０６で計算されたスコアが閾値ｔｈを超えた分割単語集合を重要語句として抽出して出力し、重要語句抽出処理ルーチンを終了する。 Next, in step 108, the extraction unit 26 extracts and outputs the divided word set whose score calculated in step 106 exceeds the threshold th as an important phrase, and ends the important phrase extraction processing routine.

以上説明したように、本実施の形態に係る重要語句抽出装置によれば、ユーザの要求を表す空間（要求単語集合）と、文書を所定単位に分割した部分（分割単語集合）との単語間の意味空間上での距離（類似度）に基づくスコアを用いることで、共起の偏りを用いた従来技術にように、同じような意味を持つ語が抽出されないということを防止し、処理対象の文書において、ユーザの要求に関する情報が様々な単語で表現されている場合にも対応することができる。また、スコアと閾値とを比較して重要語句を抽出することにより、文書のほとんどが重要語句として抽出されてしまうようなことを防止し、ユーザの要求との関係が深い重要語句を抽出することができる。 As described above, according to the important word / phrase extraction device according to the present embodiment, a space between words between a space representing a user's request (requested word set) and a portion obtained by dividing the document into predetermined units (divided word set). By using a score based on the distance (similarity) in the semantic space, it is possible to prevent a word having the same meaning from being extracted as in the prior art using the bias of co-occurrence and to be processed In this document, it is possible to cope with a case where information relating to a user request is expressed in various words. Also, by extracting the important words by comparing the score and the threshold, it is possible to prevent most of the document from being extracted as important words and to extract the important words that are closely related to the user's request. Can do.

また、ユーザ要求が複数ある場合、分割単語集合とユーザ要求の各々を表す要求単語集合との類似度の最大値や、重み付けした類似度をスコアとして用いることで、ユーザ要求に関する情報を最も多く含む部分や、最もユーザの要求を表している部分から順に、重要語句として抽出することができ、文書のほとんどが重要語句として抽出されてしまうようなことを防止することができる。 In addition, when there are a plurality of user requests, the maximum value of the similarity between the divided word set and the request word set representing each of the user requests or the weighted similarity is used as a score, thereby including the most information on the user request. It is possible to extract as important words in order from the part and the part that most represents the user's request, and it is possible to prevent most of the document from being extracted as important words.

また、従来手法では、ユーザの閲覧履歴が必要であったが、本実施の形態では、閲覧履歴がなくてもそれぞれのユーザの要求に関連する情報を抽出することができる。 Further, in the conventional method, a user's browsing history is required, but in the present embodiment, information related to each user's request can be extracted even if there is no browsing history.

このように抽出された重要語句を、例えばウェブ上の文書において、文書の閲覧や検索等の際に役立てることができる。文書をユーザに提示する際に、抽出された重要語句をハイライトしてユーザに提示することによって、ユーザが欲しい情報をいち早く見つけることができる可能性があり、効率的に文書を閲覧したり情報を検索したりすることができる。また、抽出した重要語句を文書の検索結果のスニペットとして提示したり、複数の文書から抽出された重要語句を組み合わせることによって様々な人が書いたブログやレビュー記事などの中から、ユーザの要求に関する情報を集約して提示したりすることも可能である。特に、外出先などで、スマートフォンなどの携帯端末を用いて情報検索する際等には、様々な文書の情報を全て閲覧して情報を得るのは煩わしく、ユーザの要求に合わせた重要語句のみを抽出して、ユーザに提示することによって、ユーザの手間を省くことができる。さらに、ウェブ上の文書だけでなく、蓄積された大量の文書データや電子書籍等を閲覧する際にも、その中から効率的にユーザの要求に関連する情報を得ることができる。 The important phrases extracted in this way can be used, for example, when browsing or searching a document in a document on the web. When presenting a document to the user, it is possible to quickly find the information that the user wants by highlighting the extracted important phrases and presenting it to the user. Can be searched. In addition, it presents the extracted important words as a snippet of the search result of the document, or combines the important words extracted from multiple documents with regard to user requests from blogs and review articles written by various people. It is also possible to collect and present information. In particular, when searching for information using a mobile terminal such as a smartphone when going out, etc., it is troublesome to obtain information by browsing all the information of various documents. By extracting and presenting to the user, the user's trouble can be saved. Furthermore, when browsing not only documents on the web but also a large amount of accumulated document data, electronic books, etc., information related to the user's request can be efficiently obtained.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

また、上述の重要語句抽出装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 Moreover, although the above-mentioned important phrase extracting device has a computer system inside, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used. .

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。また、本発明は、周知のコンピュータに媒体もしくは通信回線を介して、プログラムをインストールすることによっても実現可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium. The present invention can also be realized by installing a program on a known computer via a medium or a communication line.

＜実施例＞ <Example>

以下に実施例を示す。ユーザが「安くてお得」、「お店の雰囲気」という２つの要求を持っていた場合を考える。ユーザ要求を表す要求単語集合としては以下を用いた。 Examples are shown below. Consider a case where the user has two requirements of “cheap and profitable” and “shop atmosphere”. The following was used as a request word set representing a user request.

ａ_１＝｛“セット”，“クーポン”，“安い”，“激安”，“値段”，“手配”，“料金”，“ツアー”，“格安”，“価格”，“安価”，“無料”，“割安”，“パッケージ”，“得”，“プラン”，“コスト”，“高い”，“キャンペーン”，“フリープラン”｝、ｋ_１＝２０
ａ_２＝｛“かわいい”，“センス”，“流行”，“お洒落”，“花車”，“ファッション”，“デザイン”，“エレガント”，“上品”，“カジュアル”，“シック”，“服”，“コーディネート”，“セクシー”，“可愛らしい”，“アクセサリー”，“オシャレ”，“ブランド”，“可愛い”，“洒落”｝、ｋ₂＝２０ a ₁ = {“set”, “coupon”, “cheap”, “discount”, “price”, “arrangement”, “fee”, “tour”, “cheap”, “price”, “cheap”, “free” “,“ Cheap ”,“ Package ”,“ Profit ”,“ Plan ”,“ Cost ”,“ High ”,“ Campaign ”,“ Free Plan ”}, k ₁ = 20
a ₂ = {“cute”, “sense”, “fashion”, “fashionable”, “flower cart”, “fashion”, “design”, “elegant”, “elegant”, “casual”, “chic”, “clothes "," Coordinate "," Sexy "," Cute "," Accessory "," Fashion "," Brand "," Cute "," Fashion "}, k ₂ = 20

また、抽出する重要語句の単位は文単位とし、分割単語集合に含まれる単語としては、分割単語集合の１単位である文に含まれる単語のうち、品詞が名詞または形容詞の単語を用いた。ここでは、（１）式におけるｊをｊ＝２とし、スコアの値が小さいため、対数をとったものを用いた。スコアの計算方法について、以下の文（分割単語集合ｌ_１）を用いて説明する。 Moreover, the unit of the important phrase to be extracted is a sentence unit, and as the word included in the divided word set, the word whose part of speech is a noun or adjective is used among the words included in the sentence which is one unit of the divided word set. Here, j in the formula (1) is set to j = 2, and since the score value is small, a logarithmic value is used. The score calculation method will be described using the following sentence (divided word set l ₁ ).

ｌ_１＝“ランチでパスタという日には・・”
という分割単語集合ｌ_ｙに含まれる上記条件の単語は、
ｗ_１＝｛“ランチ”，“パスタ”，“日”｝
となり、ｔ＝３となる。この分割単語集合ｌ_１と、以下の要求単語集合ａ_１との類似度を求める。 l ₁ = “On the day of lunch and pasta…”
The words of the above-mentioned conditions to be included in the divided word set l _y that is,
w ₁ = {“Lunch”, “Pasta”, “Day”}
And t = 3. The similarity between this divided word set l ₁ and the following request word set a ₁ is obtained.

まず、要求単語集合ａ_１に含まれる単語ｎ_１１＝“セット”と分割単語集合ｌ_１との類似度を求める。このとき、
ｓｉｍ（“ランチ”，“セット”）＝0.0667
ｓｉｍ（“パスタ”，“セット”）＝0.0315
ｓｉｍ（“日”，“セット”）＝0.0570
となるため、
ｗ'_１＝｛“ランチ”，“日”｝
となる。このｗ'_１を用いて、ｊ＝２を（１）式に適用した下記（５）式により、要求単語集合ａ_１に含まれる単語ｎ_１１＝“セット”と分割単語集合ｌ_１との類似度を求める。 First, the similarity between the word n ₁₁ = “set” included in the requested word set a ₁ and the divided word set l ₁ is obtained. At this time,
sim (“Lunch”, “Set”) = 0.0667
sim ("pasta", "set") = 0.0315
sim (“day”, “set”) = 0.0570
So that
w ′ ₁ = {“Lunch”, “Sun”}
It becomes. Using this w ′ ₁ , the similarity between the word n ₁₁ = “set” included in the requested word set a ₁ and the divided word set l _{1 according} to the following formula (5) in which j = 2 is applied to the formula (1) Find the degree.

これを要求単語集合ａ_１に含まれる全ての単語について計算し、平均する。すなわち、ｋ＝ｋ_１＝２０を（２）式に適用した下記（６）式により、分割単語集合ｌ_１と要求単語集合ａ_１との類似度Ｓ（ｌ_１，ａ_１）を計算する。 This is calculated for all words included in the requested word set a ₁ and averaged. That is, the similarity S (l ₁ , a ₁ ) between the divided word set l ₁ and the requested word set a ₁ is calculated by the following formula (6) in which k = k ₁ = 20 is applied to the formula (2).

もう１つのユーザ要求を表す要求単語集合ａ_２についても同様に類似度Ｓ（ｌ_１，ａ_２）を計算し、最も大きい類似度の対数を分割単語集合ｌ_１のスコアｓｃｏｒｅ（ｌ_１）とした。下記の出力例において、閾値ｔｈは下記（７）式で定めた。 Similarity S (l ₁ , a ₂ ) is similarly calculated for request word set a ₂ representing another user request, and the logarithm of the largest similarity is set as score score (l ₁ ) of divided word set l _1. did. In the following output example, the threshold th is determined by the following equation (7).

ここで、ｔｈｌｖは、重要語句をどれだけ抽出するかを調整するためのパラメータとした。以下の出力例において、文頭に○印を付した文は、ｔｈｌｖ＝0.85とし、閾値ｔｈを高めに設定した場合に抽出された文である。また、文頭に△印を付した文は、ｔｈｌｖ＝0.6とし、閾値ｔｈを低めに設定した場合に、○印の文に加えて抽出された文である。○印の文は、ユーザの要求を最も強く表す文が選ばれており、△印の文は、ユーザの要求に関連のある文が選ばれていることがわかる。文末のかっこ内の数値はスコアを表す。品詞が名詞または形容詞の単語を含まない文については、本実施例の条件下ではスコアを計算できないため、スコアに“null”を付して抽出の対象外とした。 Here, thlv is a parameter for adjusting how many important phrases are extracted. In the following output example, a sentence with a circle mark at the beginning of the sentence is a sentence extracted when thlv = 0.85 and the threshold value th is set higher. A sentence with a triangle mark at the beginning of the sentence is a sentence extracted in addition to the sentence marked with a circle when thlv = 0.6 and the threshold value th is set low. It can be seen that a sentence that represents the user's request is selected as the sentence with a circle, and a sentence that is related to the user's request is selected as a sentence with a triangle. The number in parentheses at the end of the sentence represents the score. For sentences whose part of speech does not include noun or adjective words, the score cannot be calculated under the conditions of this example, so “null” is added to the score and it is excluded from extraction.

（出力例）
・ランチでパスタという日には・・（-85.4484637354695）
・ＡＢＣカフェ（-88.0978786776667）
○こちらはリーズナブルで手軽なイタリアンランチができるお店（-34.9677383949862）
○見た目も洒落たカフェです（-33.9715381578441）
○お昼はいつも相当並んでいて、みなさん店外のいすに座って待っているので、１２−１３時を外した早め・遅めのランチで入りたい感じ（-43.8880971908371）
○店内は明るくてにぎやかでした（-49.225867499909）
△こちらではランチタイムにはパスタにパン・サラダ・ドリンクがつきます（-60.503565658478）
△しかも大盛り無料（-63.4664379716312）
・そりゃ、老若男女並びますわ・・・・（-98.540852414816）
・パンとサラダ（-184.423315416741）
・サラダのドレッシングも美味（-76.8044815622487）
・揃いました（null）
・ドリンクはオレンジジュースに（-94.2192436418884）
・パスタはツナとフレッシュトマトのオイルベース（-66.83119207037）
・<大盛りです>（-184.613623137593）
・お腹すいていたので、確かな満足（-70.390793748566）
・お味は、普通に美味しいです（-72.6345729746803）
○特筆すべきはこのボリュームと内容で、￥９５０ぽっきりという価格設定です（-38.5748480842723）
・しかも５つ位、季節で変わるようです（-141.449011600035）
△しかも、ランチのスタンプカードあり（-60.3557579207）
△１０個集めるとランチ１回無料！通いたい！！（-58.9474098383104）
△夜はもう少し落ち着いてゆっくりした感じです（-59.2116103942359）
△デートにもいいかも（-61.5268202682332） (Example output)
・ On the day of pasta at lunch ... (-85.4484637354695)
・ ABC Cafe (-88.0978786776667)
○ This is a reasonable and easy Italian lunch shop (-34.9677383949862)
○ It looks like a stylish cafe (-33.9715381578441)
○ Lunches are always lined up and everyone sits in a chair outside the store and waits, so they want to enter early or late lunch after 12-13 o'clock (-43.8880971908371)
○ The shop was bright and lively (-49.225867499909)
△ Here at lunchtime, pasta is served with bread, salad and drink (-60.503565658478)
△ Besides, it is free of charge (-63.4664379716312)
・ That ’s right, old and young .... (-98.540852414816)
・ Bread and salad (-184.423315416741)
・ Salad dressing is also delicious (-76.8044815622487)
・ I got it (null)
・ Drinks are orange juice (-94.2192436418884)
・ Pasta is oil base of tuna and fresh tomato (-66.83119207037)
・ <It is large> (-184.613623137593)
・ I was satisfied because I was hungry (-70.390793748566)
・ The taste is normally delicious (-72.6345729746803)
○ It is worth mentioning that this volume and contents are priced at 950 yen (-38.5748480842723)
・ Moreover, it seems to change by about 5 seasons (-141.449011600035)
△ Moreover, there is a stamp card for lunch (-60.3557579207)
△ Collect 10 and get 1 free lunch! I want to go! ! (-58.9474098383104)
△ At night, I feel a little more relaxed (-59.2116103942359)
△ May be good for a date (-61.5268202682332)

ユーザの要求を表す要求単語集合として用いる単語数を変えたり、新たに加えたりすることによって、抽出される文（分割単語集合）は異なる場合がある。また、類似度として用いる値として、今回は概念ベクトルを用いたが、これを変えることによっても抽出される文は異なる場合がある。 The extracted sentence (divided word set) may differ depending on the number of words used as the requested word set representing the user's request or by adding a new word. Moreover, as a value used as the similarity, a concept vector is used this time, but the extracted sentence may be different by changing this.

１０重要語句抽出装置
１２文書受付部
１４ユーザ要求受付部
１６設定受付部
１８重要語句抽出部
２０形態素解析部
２２文書分割部
２４スコア計算部
２６抽出部
３０出力部 DESCRIPTION OF SYMBOLS 10 Important phrase extraction apparatus 12 Document reception part 14 User request reception part 16 Setting reception part 18 Important phrase extraction part 20 Morphological analysis part 22 Document division part 24 Score calculation part 26 Extraction part 30 Output part

Claims

Receiving means for receiving each of a processing target document including a plurality of words and a request word set representing each of a plurality of user requests;
A document dividing unit that divides the document received by the receiving unit into a plurality of divided word sets according to the length of an important word to be extracted;
Based on the word similarity between each of the words included in the divided word set divided by the document dividing unit and each of the words included in the requested word set received by the receiving unit, the divided word set and the Score calculating means for calculating a maximum value of similarity between word sets with each of the requested word sets , or a weighted score as a score for each of the divided word sets;
Extraction means for extracting a segmented word set whose score calculated by the score calculation means is higher than a predetermined threshold as an important phrase included in the document;
An important word / phrase extraction device.

Accepting means accepts each request word set representing each of the plurality of documents to be processed containing the word, and the user of the plurality of requests,
The document dividing means divides the document received by the receiving means into a plurality of divided word sets according to the length of the important word to be extracted,
The score calculation means is based on the inter-word similarity between each of the words included in the divided word set divided by the document dividing means and each of the words included in the requested word set received by the receiving means. Calculating a maximum value of similarity between word sets between each divided word set and each of the requested word sets , or a weighted score, as a score for each divided word set;
An important phrase extraction method, wherein the extraction means extracts a divided word set whose score calculated by the score calculation means is higher than a predetermined threshold as an important phrase included in the document.

Computer, keyword extraction program for causing to function as each means constituting the keyword extraction device according to claim 1 Symbol placement.