JP4592566B2

JP4592566B2 - Topic extraction method and apparatus, program, and computer-readable recording medium

Info

Publication number: JP4592566B2
Application number: JP2005329268A
Authority: JP
Inventors: 裕一郎関口; 吉秀佐藤; 晴美川島; 雅博奥
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2005-11-14
Filing date: 2005-11-14
Publication date: 2010-12-01
Anticipated expiration: 2025-11-14
Also published as: JP2007140602A

Description

本発明は、話題抽出方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体に係り、特に、一つまたは複数の情報発信源から新しい情報を含む文書を次々と入手し得る状況において、各文書において話題として扱われている特徴的な語句を抽出するための話題抽出方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体に関する。 The present invention relates to a topic extraction method and apparatus, a program, and a computer-readable recording medium, and in particular, in a situation where documents containing new information can be obtained one after another from one or a plurality of information sources. The present invention relates to a topic extraction method and apparatus, a program, and a computer-readable recording medium for extracting a characteristic word that is treated as a phrase.

インターネットをはじめとした情報メディアの発達により、誰であっても容易に情報発信を行えるようになり、様々な発信者から日々文書が発信されるようになってきている。そのような中、現在までに作成された文書情報を分析することによって、各文書において話題となっている事項を抽出することが可能になると考えられる。 With the development of information media such as the Internet, anyone can easily send information, and documents are being sent from various senders every day. Under such circumstances, it is considered that it is possible to extract the topic that is a topic in each document by analyzing the document information created so far.

インターネット上にアップロードされている文書集合等の大量かつ多様な文書集合に対し、文書中の語句の分野的及び時間的な出現頻度を考慮して、文書中で話題として扱われている特徴語句を抽出する技術は複数提案されている。 For large and diverse document collections such as document collections uploaded on the Internet, feature words / phrases treated as topics in the document are considered in consideration of the field and temporal frequency of words / phrases in the document. Several techniques for extraction have been proposed.

従来技術として、ネットワークシステム上にアップロードされている文書をその作成時刻情報と共に取得し、当該文書の内容に応じて予め設定された複数の分野に自動的に分類し、各分野毎に時間に沿って出現頻度が特徴的に増加しており、なおかつ他分野で出現していないような語句に対して、話題を表す特徴語句として話題度合いを示す話題度の値を高く算出する技術がある（例えば、特許文献１参照）。
特開２００５−１３５３１１号公報 As a conventional technology, documents uploaded on a network system are acquired together with their creation time information, and automatically classified into a plurality of fields set in advance according to the contents of the documents. There is a technique for calculating a topic value indicating a topic level as a feature word representing a topic with respect to a word or phrase whose appearance frequency has increased characteristically and does not appear in other fields (for example, , See Patent Document 1).
JP 2005-13531 A

しかしながら、上記の従来の技術においては、予め人手で分類する分野の項目を設定するため、時間と共にネットワークにアップロードされる文書の傾向が変化し、新たな分野が発生する度に分野の項目を再度設定しなおす必要があった。 However, in the above-described conventional technique, since the field items to be manually classified are set in advance, the tendency of documents uploaded to the network changes with time, and the field items are re-read every time a new field is generated. It was necessary to set again.

また、設定された分野の項目に含まれない内容の文書については、正確な分野の分類がされず他分野の文書と共に処理されてしまうため、当該文書中で扱われている話題を表す語句が当該文書が分類された分野内で特徴的な出現頻度を示さず、高い話題度が算出されないという問題があった。 In addition, since documents with contents that are not included in the set field items are not accurately classified, they are processed together with documents in other fields. There is a problem that a high topic level is not calculated because the document does not show a characteristic appearance frequency in the field in which the document is classified.

本発明は、上記の点に鑑みなされたもので、処理対象となる文書で扱われている内容の分野の候補を人手で設定することなしに、文書の扱っている分野において話題となっている事柄を表す語句に対して高い話題度を算出することが可能な話題抽出方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体を提供することを目的とする。 The present invention has been made in view of the above points, and has become a hot topic in the field handled by a document without manually setting candidates for the field of contents handled by the document to be processed. It is an object of the present invention to provide a topic extraction method and apparatus, a program, and a computer-readable recording medium capable of calculating a high topic level for a word representing a matter.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、多数の情報発信源によって作成された文書の集合を解析し、処理対象文書中に含まれる語句に対して話題性の強度を算出する話題度算出方法であって、
語句抽出手段が、文書を作成した発信源の情報を有する文書の集合が入力されると、該文書の集合を解析して、話題性評価の対象となる語句を当該文書中から切り出し、発信源情報を付与して語句データベースに記録する語句抽出ステップ（ステップ１）と、
発信源関連度算出手段が、発信源を単位として、発信源の文書から切り出した語句の出現回数を特徴量とする発信源特徴量リストを発信源特徴量バッファに蓄積し、１組の発信源ｉと発信源ｊとの類似度合いを示す関連度Ｒijを当該発信源の発信源特徴リストから算出し、関連度データベースに記録する発信源関連度算出ステップ（ステップ２）と、
話題度算出手段が、処理対象文書を入力として受け付け、当該処理対象文書の発信源ｉに基づく基準関連度分布と、処理対象文書に含まれる語句ｋそれぞれに対して得られる語句関連度分布とから、当該語句ｋの話題となっている度合いを算出する話題度算出ステップと、
を行い、
話題度算出ステップにおいて、
処理対象文書の発信源ｉとそれ以外の発信源ｊとの関連度Ｒijを関連度データベースから取得し、決められた数Ｎ個に分割された関連度の刻みごとに当該関連度の刻みに該当する関連度Ｒijを有する発信源ｊの数を集計した情報である基準関連度分布を求め、
語句データベースを参照して語句ｋを持つ発信源ｌを取得し、発信源ｉと当該発信源ｌとの関連度Ｒilを関連度データベースから取得し、関連度の刻みごとに当該関連度の刻みに該当する関連度Ｒilを有する発信源ｌの数を集計した情報である語句関連度分布を求め、
基準関連度分布と語句関連度分布とから、基準関連度分布と語句関連度分布を比較して、当該語句ｋの語句関連度分布が基準関連度分布よりも関連度の高い範囲に偏っている場合に、当該処理対象文書の発信源ｉに関連の高い文書で多く扱われる話題語であるとみなし、決められた関連度の刻みそれぞれにおいて、当該刻みにおける関連度ｎから基準関連度分布の重心における関連度の値ｎ０を引いた値と、当該刻みにおける語句関連度分布の値から基準関連度分布の値を引いた値とを掛け合わせた値を求め、それらを足し合わせた値を当該語句ｋの話題となっている度合いとする。 The present invention (Claim 1) is a topic degree calculation method for analyzing a set of documents created by a large number of information transmission sources and calculating the strength of topicality for a word or phrase included in a processing target document. ,
When a set of documents having information on a transmission source that has created the document is input to the phrase extraction unit, the set of documents is analyzed, and a phrase that is subject to topicality evaluation is cut out from the document, and the transmission source A phrase extraction step (step 1) for adding information and recording it in the phrase database;
The transmission source relevance calculating means accumulates in the transmission source feature amount buffer a transmission source feature amount list having the number of appearances of a word and phrase extracted from the transmission source document in units of the transmission source as one set of transmission sources. a source source relevance calculating step (step 2) of calculating a relevance Rij indicating the degree of similarity between i and the source j from the source feature list of the source and recording it in the relevance database;
The topic level calculation means accepts the processing target document as an input, based on the reference relevance distribution based on the transmission source i of the processing target document and the phrase relevance distribution obtained for each of the phrases k included in the processing target document. , A topic degree calculating step for calculating the degree of topic of the word k,
And
In the topic level calculation step,
The degree of association Rij between the transmission source i of the document to be processed and the other transmission source j is obtained from the degree-of-association database and corresponds to the degree of the degree of relevance for each degree of relevance divided into a predetermined number N. A reference relevance distribution which is information obtained by aggregating the number of transmission sources j having relevance Rij
The transmission source l having the phrase k is obtained by referring to the phrase database, the relevance level Ril between the transmission source i and the transmission source l is acquired from the relevance degree database, and the relevance degree is incremented for each relevance degree step. The phrase relevance distribution, which is information obtained by counting the number of transmission sources 1 having the corresponding relevance Ril, is obtained.
Comparing the reference relevance distribution and the phrase relevance distribution from the reference relevance distribution and the phrase relevance distribution, the phrase relevance distribution of the word k is biased to a range having a higher relevance than the reference relevance distribution. In this case, it is regarded as a topic word that is often handled in a document highly related to the transmission source i of the processing target document, and the center of gravity of the reference relevance distribution is determined from the relevance n in the step for each determined relevance step. The value obtained by subtracting the relevance value n0 from the value obtained by multiplying the value of the word relevance distribution in the step by the value obtained by subtracting the value of the standard relevance distribution is obtained, and the sum of these values is added to the word. k is the degree of the topic.

また、本発明（請求項２）は、語句関連度分布の関連度の刻みごとの数を、当該刻みに該当する発信源ｌでの前記語句ｋの使用回数の合計とする。 According to the present invention (claim 2), the number of relevance steps in the word relevance distribution is set to the total number of times the word k is used at the transmission source l corresponding to the step .

また、本発明（請求項３）は、基準関連度分布と語句関連度分布とをそれぞれに正規化し、正規化された基準関連度分布と語句関連度分布とを用いて、語句ｋの話題となっている度合いを求める。 Further, the present invention (Claim 3) normalizes the reference relevance distribution and the phrase relevance distribution respectively, and uses the normalized reference relevance distribution and the phrase relevance distribution to Find the degree of being .

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項４）は、多数の情報発信源によって作成された文書の集合を解析し、処理対象文書中に含まれる語句に対して話題性の強度を算出する話題度算出装置であって、
語句抽出手段が、文書を作成した発信源の情報を有する文書の集合が入力されると、該文書の集合を解析して、話題性評価の対象となる語句を当該文書中から切り出し、発信源情報を付与して語句データベース２２０に記録する語句抽出手段２１０と、
発信源を単位として、発信源の文書から切り出した語句の出現回数を特徴量とする発信源特徴量リストを発信源特徴量バッファに蓄積し、１組の発信源ｉと発信源ｊとの類似度合いを示す関連度Ｒijを当該発信源の発信源特徴リストから算出し、関連度データベース２４０に記録する発信源関連度算出手段と、
処理対象文書を入力として受け付け、当該処理対象文書の発信源ｉに基づく基準関連度分布と、処理対象文書に含まれる語句ｋそれぞれに対して得られる語句関連度分布とから、当該語句ｋの話題となっている度合いを算出する話題度算出手段２５０と、を有し、
話題度算出手段２５０は、
処理対象文書の発信源ｉとそれ以外の発信源ｊとの関連度Ｒijを関連度データベース２４０から取得し、決められた数Ｎ個に分割された関連度の刻みごとに当該関連度の刻みに該当する関連度Ｒijを有する発信源ｊの数を集計した情報である基準関連度分布を求め、
語句データベース２２０を参照して語句ｋを持つ発信源ｌを取得し、発信源ｉと当該発信源ｌとの関連度Ｒilを関連度データベース２４０から取得し、関連度の刻みごとに当該関連度の刻みに該当する関連度Ｒilを有する発信源ｌの数を集計した情報である語句関連度分布を求め、
基準関連度分布と語句関連度分布とから、基準関連度分布と語句関連度分布を比較して、当該語句ｋの語句関連度分布が基準関連度分布よりも関連度の高い範囲に偏っている場合に、当該処理対象文書の発信源ｉに関連の高い文書で多く扱われる話題語であるとみなし、決められた関連度の刻みそれぞれにおいて、当該刻みにおける関連度ｎから基準関連度分布の重心における関連度の値ｎ０を引いた値と、当該刻みにおける語句関連度分布の値から基準関連度分布の値を引いた値とを掛け合わせた値を求め、それらを足し合わせた値を当該語句ｋの話題となっている度合いとする手段を含む。 The present invention (Claim 4) is a topic degree calculation device that analyzes a set of documents created by a large number of information transmission sources and calculates the strength of topicality for words included in a processing target document. ,
When a set of documents having information on a transmission source that has created the document is input to the phrase extraction unit, the set of documents is analyzed, and a phrase that is subject to topicality evaluation is cut out from the document, and the transmission source A phrase extraction unit 210 that adds information and records it in the phrase database 220;
A source feature quantity list having the number of occurrences of a word extracted from the source document as a feature quantity is stored in the source feature quantity buffer in units of the source, and the similarity between one set of source i and source j A source relevance calculating means for calculating the relevance Rij indicating the degree from the transmission source feature list of the transmission source and recording it in the relevance database 240;
From the reference relevance distribution based on the source i of the processing target document as an input and the word relevance distribution obtained for each of the words k included in the processing target document, the topic of the word k Topic level calculation means 250 for calculating the degree of
The topic level calculation means 250
The degree of association Rij between the transmission source i of the document to be processed and the other transmission source j is acquired from the degree-of-relationship database 240, and the degree of relevance is calculated for each degree of relevance divided into a predetermined number N. A reference relevance distribution, which is information obtained by counting the number of transmission sources j having the corresponding relevance Rij, is obtained.
The transmission source l having the phrase k is acquired with reference to the phrase database 220, the relevance level Ril between the transmission source i and the transmission source l is acquired from the relevance level database 240 , and Finding the phrase relevance distribution, which is information obtained by tabulating the number of sources 1 having the relevance Ril corresponding to the step ,
Comparing the reference relevance distribution and the phrase relevance distribution from the reference relevance distribution and the phrase relevance distribution, the phrase relevance distribution of the word k is biased to a range having a higher relevance than the reference relevance distribution. In this case, it is regarded as a topic word that is often handled in a document highly related to the transmission source i of the processing target document, and the center of gravity of the reference relevance distribution is determined from the relevance n in the step for each determined relevance step. The value obtained by subtracting the relevance value n0 in the value and the value obtained by subtracting the value of the reference relevance distribution from the value of the word relevance distribution at the step is obtained, and the sum of these values is added to the word. It includes means for determining the degree of topic of k.

また、本発明（請求項５）は、話題度算出手段２５０において、
語句関連度分布の関連度の刻みごとの数を、当該刻みに該当する発信源ｌでの前記語句ｋの使用回数の合計とする手段を含む。

Further, according to the present invention (claim 5), in the topic level calculation means 250,
The number of relevance steps of the word relevance distribution is included as a sum of the number of times the word k is used at the transmission source l corresponding to the step .

また、本発明（請求項６）は、話題度算出手段２５０において、
基準関連度分布と語句関連度分布とをそれぞれに正規化し、正規化された基準関連度分布と語句関連度分布とを用いて、語句ｋの話題となっている度合いを求める手段を含む。 Further, according to the present invention (claim 6 ), the topic level calculation means 250
Means for normalizing the reference relevance distribution and the phrase relevance distribution, respectively, and using the normalized reference relevance distribution and the phrase relevance distribution to determine the degree of the topic k being the topic.

本発明（請求項７）は、請求項４乃至６のいずれか１項に記載の話題度算出装置を構成する各手段としてコンピュータを機能させるための話題度算出プログラムである。 The present invention (Claim 7 ) is a topic degree calculation program for causing a computer to function as each means constituting the topic degree calculation apparatus according to any one of Claims 4 to 6 .

本発明（請求項８）は、請求項７に記載のプログラムを格納したことを特徴とする話題度算出プログラムを格納したコンピュータ読み取り可能な記録媒体である。

The present invention (Claim 8 ) is a computer-readable recording medium storing a topic degree calculation program characterized by storing the program according to Claim 7 .

上記のように本発明によれば、Ｗｅｂ上での日記などの様々な発信源が作成した多種多様な分野について扱っている文書群に対し、各発信源と処理対象文書の発信源の関連度を求め、当該関連度の分布と処理対象文書中に含まれる各語句を扱ったことのある発信源のみに絞った場合の関連度の分布と比較し、関連の強い発信源でより多く使われている語句に高い重みを設定することにより、分野の項目を予め設定することなしに処理対象文書の所属する分野で特徴的な語句に対して高い話題の度合いを求めることができる。 As described above, according to the present invention, with respect to a document group dealing with various fields created by various transmission sources such as a diary on the Web, the degree of association between each transmission source and the transmission source of the processing target document. Compared to the distribution of relevance and the distribution of relevance when focusing only on sources that have dealt with each word or phrase contained in the document to be processed, By setting a high weight to a certain word / phrase, it is possible to obtain a high degree of topic with respect to a word / phrase characteristic in the field to which the document to be processed belongs without setting the field item in advance.

また、本発明によれば、処理対象文書の作成時刻と同じ時期において特徴的に多く取り扱われている語句に時間重みを付け、話題度と時間重みとの両方を考慮した話題度を新しく算出することにより、処理対象文書の作成された時期においてある分野で特徴的に用いられた語句にのみ高い話題度を設定することができる。 Further, according to the present invention, a time weight is given to words that are characteristically handled at the same time as the creation time of the document to be processed, and a new topic degree that considers both the topic degree and the time weight is calculated. Thus, it is possible to set a high topic level only for words that are characteristically used in a certain field at the time when the processing target document is created.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［第１の実施の形態］
図３は、本発明の第１の実施の形態における話題度算出装置の構成を示す。 [First Embodiment]
FIG. 3 shows the configuration of the topic level calculation device according to the first embodiment of the present invention.

同図に示す話題度算出装置には、本装置の入力となる文書データを蓄積する文書データベース２００と本装置が出力する語句の話題度を記録する語句話題度記録装置２９０とが接続されている。 The topic level calculation apparatus shown in FIG. 2 is connected to a document database 200 that stores document data that is input to the apparatus, and a phrase topic level recording apparatus 290 that records the topic level of words output by the apparatus. .

本実施の形態における話題度算出装置は、文書内語句抽出部２１０、種話題情報データベース２２０、発信源関連度算出部２３０、関連度データベース２４０、話題度算出部２５０、及び話題度出力部２６０から構成される。 The topic level calculation device according to the present embodiment includes a word / phrase extraction unit 210 in a document, a seed topic information database 220, a transmission source relevance level calculation unit 230, a relevance level database 240, a topic level calculation unit 250, and a topic level output unit 260. Composed.

図４は、本発明の第１の実施の形態における概要動作のフローチャートである。 FIG. 4 is a flowchart of an outline operation in the first embodiment of the present invention.

ステップ１０１）文書内語句抽出部２１０において、予め収集され文書データベース２００に格納されている文書群を読み出して、形態素解析等の既存の技術を用いて解析することにより、当該文書中に含まれる全ての語句を抽出し、抽出された語句それぞれについて当該語句が含まれていた文書の発信源と作成時刻とを組み合わせて、種話題情報データベース（特許請求の範囲の「語句データベース」に相当）２２０に格納する。 Step 101) In the in-document phrase extracting unit 210, all documents included in the document are read out by reading out a group of documents collected in advance and stored in the document database 200, and analyzing them using an existing technique such as morphological analysis. Are extracted from each of the extracted words and phrases into the seed topic information database (corresponding to the “phrase database” in the claims) 220 by combining the source and creation time of the document containing the words. Store.

ステップ１０２）次に、発信源関連度算出部２３０において、ステップ１０１で作成された種話題情報データベース２２０を参照して、各発信源毎に当該発信源が過去に作成した文書中に含まれる語句群と当該語句群の各語句の使用回数とを取得し、当該発信源の使用語句群ベクトルを作成し、２つの発信源の使用語句群ベクトルについてコサイン類似度等の既存のベクトル比較技術を用いて、当該使用語句ベクトル間の類似度を発信源間の関連度として算出する処理を、あり得る全ての発信源の組み合わせについて行い、得られた関連度を関連度データベース２４０に出力する。 Step 102) Next, referring to the seed topic information database 220 created in Step 101 in the transmission source relevance calculation unit 230, for each transmission source, the phrase included in the document created by the transmission source in the past Group and the number of times each phrase in the phrase group is used, a phrase group vector for the source is created, and existing vector comparison techniques such as cosine similarity are used for the phrase group vectors of the two sources Then, the process of calculating the similarity between the use phrase vectors as the relevance between the transmission sources is performed for all possible combinations of the transmission sources, and the obtained relevance is output to the relevance database 240.

ステップ１０３）話題度算出部２５０において、外部から処理対象とする文書（以下、処理対象文書と呼ぶ）に含まれる各語句について、当該語句と同一の語句を発信したことのある発信源群と処理対象文書の発信源との関連度の分布を、全発信源もしくは、処理対象文書を作成した発信源以外の発信源と処理対象文書を作成した発信源との関連度の分布を比較して、当該語句を発信したことのある発信源群の関連度の分布が、より関連度の高い範囲に偏っている場合に、当該語句を処理対象文書と関連する分野で高頻度に出現する特徴語句として高い話題度を設定する。 Step 103) In the topic level calculation unit 250, for each word included in a document to be processed from the outside (hereinafter referred to as a processing target document), a transmission source group and a process that have transmitted the same word / phrase as the word / phrase Compare the distribution of relevance with the source of the target document by comparing the distribution of relevance between all sources or sources other than the source that created the processing target document and the source that created the processing target document, When the distribution of relevance of the source group that has transmitted the relevant phrase is biased to a higher relevance range, the relevant phrase appears as a feature phrase that frequently appears in the field related to the processing target document. Set a high topic level.

ステップ１０４）ステップ１０３で求められた話題度を話題度出力部２６０から語句話題度記録装置２９０に出力する。 Step 104) The topic level obtained in Step 103 is output from the topic level output unit 260 to the phrase topic level recording device 290.

次に、話題度算出装置の詳細な動作を図３の構成図に基づいて説明する。 Next, the detailed operation of the topic level calculation device will be described based on the configuration diagram of FIG.

文書データベース２００には、文書毎に作成時刻と発信源情報が付加された複数の文書が蓄積されている。例えば、Ｗｅｂ上に公開されている文書に「2004 4/25 13:55」といったような作成時刻と、公開されているＷｅｂサイトのドメイン名や執筆者の名称といったような発信源情報とを付加し、次々と入力として記録することにより、文書データベース２００を構築することができる。特にインターネット上の日記サイトなど、新しい文書が逐次作成され、発信者が判別可能な情報源からの文書を入力するのが望ましい。また、サイト内の文書が更新された場合にも、新たな文書が作成されたと見做して収集してもよい。 The document database 200 stores a plurality of documents to which creation time and transmission source information are added for each document. For example, the creation time such as “2004 4/25 13:55” and the source information such as the domain name of the published website and the name of the author are added to the document published on the web. The document database 200 can be constructed by recording as inputs one after another. In particular, it is desirable that new documents are sequentially created, such as a diary site on the Internet, and a document from an information source that can be identified by the caller is input. Further, when a document on the site is updated, it may be collected assuming that a new document has been created.

文書内語句抽出部２１０は、文書データベース２００に蓄積されている文書を１文書ずつ取得し、形態素解析を行い、品詞毎に分解する。例えば、「おいしいチョコドーナツ」という文章を、「おいしい」「チョコ」「ドーナツ」と分解する。分解された品詞群から名詞のみを選んで抽出する。このとき、必要に応じて「チョコ」「ドーナツ」という連続する名詞を連結して「チョコドーナツ」という複合名詞とし、複合名詞を１個の名詞として扱ってもよい。 The in-document phrase extraction unit 210 obtains documents stored in the document database 200 one document at a time, performs morphological analysis, and decomposes each part of speech. For example, the sentence “delicious chocolate donut” is broken down into “delicious”, “chocolate”, and “donut”. Select and extract only nouns from the decomposed part-of-speech group. At this time, if necessary, consecutive nouns such as “chocolate” and “donut” may be connected to form a compound noun “chocolate donut”, and the compound noun may be treated as one noun.

また、「秋の新番組」というような名詞的に扱われるフレーズについても名詞として抽出してもよい。以下の説明では、抽出して得られた名詞と複合名詞と名詞的に扱われるフレーズとを総称して語句と呼ぶ。このようにして得られた語句それぞれについて、解析前に当該語句が含まれていた文書の作成時刻と発信源情報とを付加し、例えば、「チョコドーナツ 2005/01/06 11:36 blog.temporary.ex.xx」といった形式の情報として、種話題情報データベース２２０に蓄積する。以下の説明では、上記の語句と作成時刻と発信源情報との組の情報を種話題情報と表記することとする。 Also, a phrase treated as a noun such as “new program in autumn” may be extracted as a noun. In the following description, the extracted nouns, compound nouns, and phrases treated as nouns are collectively referred to as words. For each word obtained in this way, the creation time and source information of the document containing the word before the analysis is added, for example, “Choco Donut 2005/01/06 11:36 blog.temporary .ex.xx "is stored in the seed topic information database 220. In the following description, information on a set of the above phrase, creation time, and transmission source information is referred to as seed topic information.

種話題情報データベース２２０に蓄積される種話題情報の例を図５に示す。同じ語句が処理文書中で複数回使用されている場合には、内容が同一の種話題情報が複数蓄積されてしまうため、処理の効率化のため、そのうちの一つの種話題情報を蓄積することとしてもよい。 An example of seed topic information stored in the seed topic information database 220 is shown in FIG. If the same word / phrase is used more than once in the processed document, multiple seed / topic information with the same content will be stored. It is good.

発信源関連度算出部２３０は、一定の処理時間毎に、種話題情報データベース２２０に蓄積されている種話題情報を参照して各発信源の使用している語句を抽出する。また、それらの語句の使用回数を計算する。抽出した語句と計算した語句の使用回数とを発信源特徴量とする。得られた発信源特徴量を異なる発信源の対毎に比較することにより、各発信源間の関連度を算出し、関連度データベース２４０に関連度をテーブルの形で出力する。 The transmission source relevance calculation unit 230 refers to the seed topic information stored in the seed topic information database 220 for each fixed processing time, and extracts a phrase used by each transmission source. In addition, the number of times these words are used is calculated. The extracted phrase and the calculated number of times the phrase is used are set as the source feature amount. By comparing the obtained transmission source feature quantity for each pair of different transmission sources, the degree of association between each transmission source is calculated, and the degree of association is output to the association degree database 240 in the form of a table.

関連度データベース２４０に蓄積される関連度情報の例を図６に示す。 An example of the relevance level information stored in the relevance level database 240 is shown in FIG.

ここで、当該発信源関連度算出部２３０の動作を詳細に説明する。 Here, the operation of the transmission source relevance calculation unit 230 will be described in detail.

図７は、本発明の第１の実施の形態における発信源関連度算出部の処理のフローチャートである。 FIG. 7 is a flowchart of processing of the transmission source association degree calculation unit in the first embodiment of the present invention.

ステップ５１０）発信源関連度算出部２３０は、処理が開始されると、種話題情報データベース２２０に蓄積されている全発信源情報を取得し、重複部分を取り除くことにより、文書データバッファ（図示せず）に入力元となっている発信源のリストを作成する。 Step 510) When the process is started, the transmission source relevance calculating unit 230 acquires all transmission source information stored in the seed topic information database 220, and removes the overlapping portion, thereby obtaining a document data buffer (not shown). A list of sources that are input sources.

ステップ５２０）次に、ステップ５１０で得られた発信源のリストから一つの発信源情報を取り出し、当該発信源情報を含む種話題情報内の語句を種話題情報データベース２２０から取得して語句情報バッファ（図示せず）に蓄積する。この際、処理の軽減のため、例えば、語句を取得する対象を作成時刻が処理時刻から２ヶ月前までの種話題情報とするといったように、ある一定時刻範囲の作成時刻を持つ種話題情報だけに絞ってもよい。 Step 520) Next, one transmission source information is extracted from the transmission source list obtained in Step 510, a word / phrase in the seed topic information including the transmission source information is acquired from the seed topic information database 220, and a phrase information buffer is obtained. (Not shown). At this time, in order to reduce processing, for example, only the seed topic information having a creation time within a certain time range, such as a subject for which a phrase is acquired is set to seed topic information whose creation time is two months before the processing time. It may be narrowed down to.

ステップ５３０）語句情報バッファ（図示せず）中の各語句について語句毎の出現回数を求め、語句ｗ_ｋと当該語句の出現回数Ｖ_ｉ（ｗ_ｋ）とからなる、処理対象の発信源Ｓ_ｉの語句特徴量Ｃ_ｉを作成する。例えば、ステップ５２０で蓄積された語句情報バッファ（図示せず）中の語句が「野球」「W杯」「野球」「野球」「決勝」「W杯」「野球」の場合には、発信源ｉの語句特徴量Ｃ_ｉは、「野球４，W杯２，決勝１」といった語句ｗ_ｋと当該語句の出現回数Ｖ_ｉ(Ｗ_ｋ)との組の集合となる。得られた語句特徴量Ｃ_ｉは、発信源特徴量リストとして発信源特徴量バッファ（図示せず）に蓄積される。 Step 530) For each word in the word information buffer (not shown), the number of appearances for each word is obtained, and the processing source S _i composed of the word w _k and the number of appearances V _i (w _k ) of the word. creating a word feature quantity C _i. For example, if the word / phrase in the word / phrase information buffer (not shown) accumulated in step 520 is “baseball”, “world cup”, “baseball”, “baseball”, “final”, “world cup”, “baseball”, The phrase feature amount C _i of _i is a set of a set of a phrase w _k such as “baseball 4, World Cup 2, final 1” and the number of appearances of the phrase V _i (W _k ). The obtained phrase feature value C _i is stored in a transmission source feature value buffer (not shown) as a transmission source feature value list.

ステップ５４０）上記のステップ５１０で作成した発信源のリストの中で全ての発信源に対して、ステップ５２０、ステップ５３０の処理を行ったかを判定し、未処理の発信源情報が存在する場合には、ステップ５２０に戻る。存在しない場合には、ステップ５５０に移行する。 Step 540) It is determined whether or not the processing of Step 520 and Step 530 has been performed for all the transmission sources in the list of transmission sources created in Step 510 above, and there is unprocessed transmission source information. Returns to step 520. If not, the process proceeds to step 550.

ステップ５５０）発信源特徴量バッファ（図示せず）の発信源特徴量リストから１組の発信源Ｓ_ｉ，Ｓ_ｉの語句特徴量Ｃ_ｉ，Ｃ_ｊを取り出し、そのＣ_ｉとＣ_ｊの関連度を発信源ｉと発信源ｊの関連度Ｒ_ｉｊとして算出する。 Step 550) The phrase feature amounts C _i and C _j of the set of source sources S _i and S _i are extracted from the source feature list of the source feature buffer (not shown), and the relationship between the C _i and C _j is extracted. The degree is calculated as the degree of association R _ij between the transmission source i and the transmission source j.

例えば、発信源ｉと発信源ｊとの関連度Ｒ_ｉｊを下記の式（１）を用いて算出する。 For example, the degree of association R _ij between the transmission source i and the transmission source j is calculated using the following equation (1).

上記の式（１）において、Ｗは語句特徴量Ｃ_ｉとＣ_ｊに含まれるすべての語句を表すとする。また、Ｖ_ｉ(w_k)，Ｖ_ｊ（w_k）は全て正の値であるため、Ｒ_ｉｊは０以上１以下となり、Ｒ_ｉｊの値が大きいほど発信源Ｓ_ｉと発信源Ｓ_ｊの関連が大きいことを表す。

In the above equation (1), W represents all the phrases included in the phrase feature amounts C _i and C _j . In addition, since V _i (w _k ) and V _j (w _k ) are all positive values, R _ij is 0 or more and 1 or less, and the larger the value of R _ij , the more the transmission source S _i and the transmission source S _j . Represents a large relationship.

得られた関連度Ｒ_ｉｊ（ｊ≠ｉ）は関連度データベース２４０のｉ行ｊ列とｊ行ｉ列の２箇所に蓄積される。 The obtained relevance R _ij (j ≠ i) is stored in two locations, i row j column and j row i column, in the relevance database 240.

ステップ５６０）全ての発信源の組み合わせについて、ステップ５５０の処理を行ったかどうかを判定する。未処理の発信源の組み合わせが存在する場合には、ステップ５５０に移行する。全ての発信源の組み合わせに対してステップ５５０を処理済の場合は、発信源関連度算出部２３０の処理を終了する。 Step 560) It is determined whether or not the processing of Step 550 has been performed for all combinations of transmission sources. If there is an unprocessed source combination, the process proceeds to step 550. When step 550 has been processed for all combinations of the transmission sources, the processing of the transmission source relevance calculation unit 230 ends.

話題度算出部２５０は、処理対象文書を入力として受け付け、当該文書中の語句それぞれに対して、処理対象文書の発信源と過去にその語句を扱ったことのある発信源との関連度の値の分布と、処理対象文書の発信源と他の全ての発信源との関連度の分布を比較することにより、当該処理対象文書の発信源と関連度の高い発信源で多く用いられている語句に対し、話題の度合いが高いと見做し、値の大きい話題度を算出する。 The topic level calculation unit 250 accepts a processing target document as an input, and for each word / phrase in the document, the value of the degree of association between the transmission source of the processing target document and a transmission source that has handled the word / phrase in the past Compared with the distribution of the degree of relevance between the source of the processing target document and all other sources, the phrase frequently used in the source of the processing target document On the other hand, it is assumed that the topic level is high, and the topic level having a large value is calculated.

当該話題度算出部２５０の動作を詳細に説明する。 The operation of the topic level calculation unit 250 will be described in detail.

図８は、本発明の第１の実施の形態における話題度算出部の動作のフローチャートである。 FIG. 8 is a flowchart of the operation of the topic level calculation unit in the first embodiment of the present invention.

ステップ６１０）話題度算出部２５０は処理が始まると、外部から処理対象文書を入力として受け付ける。処理対象文書には発信源が付与されているとする。また、処理対象文書は、文書データベース２００に含まれる文書であることが望ましい。 Step 610) When the processing starts, the topic level calculation unit 250 accepts a processing target document as an input from the outside. Assume that a transmission source is assigned to the processing target document. The processing target document is preferably a document included in the document database 200.

ステップ６２０）関連度データベース２４０から処理対象文書の発信源Ｓ_ｉと当該発信源以外の全ての発信源Ｓ_ｊ（ｊ≠ｉ）との関連度Ｒ_ｉｊの集まりを取得し、その値の分布を集計する。図６を用いて説明すると、処理対象文書の発信源が発信源「８」であった場合、発信源「８」とその他全ての発信源（発信源１〜７，９〜Ｎ）との関連度として、0.282，0.166，0.217，0.327，0.313，0.275…を取得する。このようにして得られた関連度の集合Ｒ_８ｊ（ｊ≠８）に対して、その値の分布を基準関連度分布Ｒ_ｓ（ｎ）として求める。例えば、0.01刻みで集計する場合、
0.01*ｎ≦Ｒ_8j＜0.01*(ｎ+1)
の式に当てはまる発信源の個数をｎ＝０〜１００までの範囲で集計し、基準関連度分布
Ｒ_ｓ（ｎ）（ｎ＝０・・・９９）
を求める。 Step 620) A collection of association degrees R _ij between the transmission source S _{i of the} document to be processed and all transmission sources S _j (j ≠ i) other than the transmission source is obtained from the association degree database 240, and the distribution of the values is obtained. Tally. Referring to FIG. 6, when the transmission source of the document to be processed is the transmission source “8”, the relationship between the transmission source “8” and all other transmission sources (transmission sources 1 to 7, 9 to N). As a degree, 0.282, 0.166, 0.217, 0.327, 0.313, 0.275... Are acquired. For the set of relevances R _8j (j ≠ 8) obtained in this way, the distribution of the values is obtained as the reference relevance distribution R _s (n). For example, when counting in increments of 0.01,
0.01 * n ≦ R _8j <0.01 * (n + 1)
The number of transmission sources that apply to the above formula is aggregated in the range of n = 0 to 100, and the reference relevance distribution R _s (n) (n = 0... 99)
Ask for.

ステップ６３０）処理対象文書Ｄ_ｉに含まれる語句を、文書内語句抽出部２１０と同様の処理により取得し、得られた語句を話題度算出部２５０内の処理対象語句バッファ（図示せず）に蓄積する。 The words contained in the step 630) the target document D _i, obtained by the same processing as the document within the phrase extraction section 210, the resulting word to the processing target phrases buffer topic calculation unit 250 (not shown) to accumulate.

ステップ６４０）処理対象語句バッファ（図示せず）から語句ｗｋを一つ取り出し、種話題情報データベース２２０を参照して当該語句を持つ種話題情報内の発信源情報を取得し、集計することにより、語句ｗ_ｋを使用したことのある発信源Ｓ_ｊとその語句を使用した回数Ｘ（Ｓ_ｊ，ｗ_ｋ）の組からなる語句使用発信源情報を作成する。 Step 640) By taking one phrase wk from the processing target phrase buffer (not shown), referring to the seed topic information database 220, obtaining source information in the seed topic information having the phrase, and summing up, Phrase usage source information consisting of a combination of a source S _j that has used the word w _k and the number of times X (S _j , w _k ) that used the word is created.

例えば、「野球」という語句について種話題情報データベース２２０から取得された発信源情報が、「発信源ｊ」「発信源ｊ」「発信源ｊ」「発信源ｋ」「発信源ｎ」「発信源ｎ」といった場合には、「発信源ｊ３回，発信源ｋ１回，発信源ｎ２回」という語句使用発信源情報を作成する。但し、ステップ６５０で語句使用回数を用いない場合は、語句使用発信源情報に語句使用回数を含める必要はない。 For example, the source information acquired from the seed topic information database 220 for the phrase “baseball” is “source j” “source j” “source j” “source k” “source n” “source”. In the case of “n”, the phrase use transmission source information of “transmission source j 3 times, transmission source k 1 time, transmission source n 2 times” is created. However, if the phrase usage count is not used in step 650, the phrase usage count need not be included in the phrase usage source information.

ステップ６５０）関連度データベース２４０から、指定された文書情報の発信源Ｓｉと語句使用発信源情報に含まれる各発信源Ｓ_ｊとの関連度Ｒ_ｉｊの集合を取得し、その値をステップ６２０と同様に集計し、語句関連度分布Ｒｗｋ（ｎ）を求める。例えば、０．０１刻みで集計する場合、
0.01*n≦Ｒ_ｉｊ＜0.01+(n+1)
の式を満たす発信源の個数をｎ＝０〜１００まで集計し、その語句関連度分布R_wk（ｎ）（ｎ＝０…１００）を求める。この際、ある範囲内の関連度を持つ発信源の個数を集計する代わりに、ある範囲内の関連度を持つ発信源の語句使用回数の合計を集計してもよい。例えば、処理対象文書の発信源がＳ_８で語句ｗ_ｋについて語句関連度分布Ｒ_ｗｋ（ｎ）を求める処理を行い、
0.01*5≦Ｒ_８ｊ＜0.01*(5+1)
を満たす発信源がＳ_６，Ｓ_１０，Ｓ_２１の３つである場合に、範囲に含まれる発信源の数である“３”をＲ_ｗｋ（５）に設定するのではなく、Ｓ_６，Ｓ_１０，Ｓ_２１の語句使用回数の和である
Ｘ（Ｓ_６，ｗ_ｋ）＋Ｘ（Ｓ_１０，ｗ_ｋ）＋Ｘ（Ｓ_２１，ｗ_ｋ）
をＲ_ｗｋ（５）に設定してもよい。 Step 650) A set of relevance R _ij between the source Si of the specified document information and each source S _j included in the phrase use source information is obtained from the relevance level database 240, and the value is set as Step 620. Similarly, the word relevance distribution Rwk (n) is obtained. For example, when counting in increments of 0.01,
0.01 * n ≦ R _ij <0.01+ (n + 1)
The number of transmission sources satisfying the above formula is tabulated from n = 0 to 100, and the phrase relevance distribution R _wk (n) (n = 0... 100) is obtained. At this time, instead of counting the number of transmission sources having a degree of relevance within a certain range, the total number of times of use of phrases of the transmission sources having a degree of relevance within a certain range may be totaled. For example, a process for _{obtaining a} phrase relevance distribution R _wk (n) for the phrase w _k at the source of the processing target document is S ₈ ,
0.01 * 5 ≦ R _8j <0.01 * (5 + 1)
When there are three sources S ₆ , S ₁₀ , and S ₂₁ that satisfy the condition, “3” that is the number of sources included in the range is not set in R _wk (5), but S ₆ , X (S ₆ , w _k ) + X (S ₁₀ , w _k ) + X (S ₂₁ , w _k ), which is the sum of the number of times the phrases S ₁₀ and S ₂₁ are used
May be set to R _wk (5).

ステップ６６０）語句ｗ_ｋが処理対象文書Ｄ_ｉの発信源Ｓ_ｉとの関連度の高い発信源において多く使われている際に、分野の特徴的な語として高い値を設定するため、語句関連度分布と基準関連度分布とを全範囲における分布の値を足し合わせた値を用いて正規化し、高い関連度の範囲において上記正規化された語句関連度分布が上記正規化された基準関連度分布よりも大きくなっている場合に、話題度Score_r(w_k)を高い値に算出する。 Step 660) When word w _k is often used at the originating source associated high degree of source of S _i of the target document D _i, to set a high value as a characteristic word in the field, the phrase related Normalization of the degree distribution and the reference relevance distribution using the sum of the distribution values in the entire range, and the normalized phrase relevance distribution in the high relevance range is the normalized reference relevance When the distribution is larger than the distribution, the topic score Score _r (w _k ) is calculated to a high value.

例えば、上記のステップ６２０とステップ６５０とのように、0.01刻みの関連度の範囲で集計した場合には、式（４）によってScore_r(w_k)が求められる。 For example, when the calculation is performed within the range of the degree of relevance in increments of 0.01 as in step 620 and step 650 described above, Score _r (w _k ) is obtained by equation (4).

式（４）中のｎ_０は、基準関連度分布Ｒ_ｓ（ｎ）の重心を示す値であり、式（５）により求められる。

N ₀ in Expression (4) is a value indicating the center of gravity of the reference relevance distribution R _s (n), and is obtained from Expression (5).

ステップ６７０）得られた語句ｗ_ｋと話題度Score_r(w_k)の組を、話題度出力部２６０に出力する。

Step 670) The set of the obtained word / phrase w _k and topic level Score _r (w _k ) is output to the topic level output unit 260.

ステップ６８０）処理対象語句バッファ（図示せず）中の全ての語句について、ステップ６４０からステップ６７０までの処理を行ったかを判定し、未処理の語句が存在する場合はステップ６４０に移行する。全ての語句について処理済みの場合は話題度算出部２５０の処理を終了する。 Step 680) It is determined whether or not the processing from Step 640 to Step 670 has been performed for all the words / phrases in the processing target word / phrase buffer (not shown). If there are unprocessed words / phrases, the process proceeds to Step 640. If all words have been processed, the topic level calculation unit 250 ends.

話題度出力部２６０は、話題度算出部２５０から受け取った語句と当該語句の話題度の組み合わせを語句話題度記録装置２９０に出力する。この際、出力量の軽減のため、予め設定された値以上の話題度を持つ語句のみに限って出力してもよい。語句話題度記録装置２９０に出力される語句とその話題度の例を図９に示す。 The topic level output unit 260 outputs a combination of the phrase received from the topic level calculation unit 250 and the topic level of the phrase to the phrase topic level recording device 290. At this time, in order to reduce the output amount, only words having a topic degree equal to or higher than a preset value may be output. FIG. 9 shows an example of a phrase and its topic level that are output to the phrase topic level recording device 290.

［第２の実施の形態］
図１０は、本発明の第２の実施の形態における話題度算出装置の構成を示す。 [Second Embodiment]
FIG. 10 shows the configuration of the topic level calculation apparatus according to the second embodiment of the present invention.

本実施の形態における話題度算出装置は、文書内語句抽出部２１０、種話題情報データベース２２０、発信源関連度算出部２３０、関連度データベース２４０、話題度算出部２５０、時間重み算出部８６０及び話題度出力部８７０から構成される。 The topic level calculation device according to the present embodiment includes an in-document word / phrase extraction unit 210, a seed topic information database 220, a transmission source relevance level calculation unit 230, a relevance level database 240, a topic level calculation unit 250, a time weight calculation unit 860, and a topic. It is composed of a degree output unit 870.

このうち、文書内語句抽出部２１０、種話題情報データベース２２０、発信源関連度算出部２３０、関連度データベース２４０、話題度算出部２５０は、前述の第１の実施の形態と同様の動作をする。 Among them, the in-document phrase extraction unit 210, the seed topic information database 220, the transmission source relevance calculation unit 230, the relevance database 240, and the topic calculation unit 250 operate in the same manner as in the first embodiment. .

本実施の形態では、作成時刻と発信源の情報が含まれた文書を処理対象とする。 In the present embodiment, a document including creation time and transmission source information is a processing target.

第１の実施の形態で示した処理文書の所属する分野でよく扱われる語に対し高い話題度を算出する方法では、当該分野での専門用語のような使用される分野が特定されるが、その分野の中では一般的に用いられているような語句も高い話題度が算出されてしまう。 In the method of calculating a high topic level for a word that is often handled in the field to which the processed document belongs as shown in the first embodiment, a field to be used such as a technical term in the field is specified. A high degree of topic is also calculated for words that are commonly used in the field.

それに対し、処理対象文書の作成時刻と同じ時期において特徴的に多く取り扱われている語句に重みをつける時間重み算出部８６０を設け、話題度算出部２５０において話題度算出部２５０で得られた話題度と時間重み算出部８６０で得られた時間重みとの両方を考慮した話題度を新たに算出することにより、処理対象文書の作成された時期においてある分野で特徴的に用いられた語句のみに高い話題度を設定することを可能とする。 On the other hand, a time weight calculation unit 860 that weights words that are characteristically handled at the same time as the creation time of the processing target document is provided, and a topic obtained by the topic level calculation unit 250 in the topic level calculation unit 250 By calculating a new topic degree that takes into account both the degree and the time weight obtained by the time weight calculation unit 860, only the words that are characteristically used in a certain field at the time when the processing target document was created It is possible to set a high topic level.

図１１は、本発明の第２の実施の形態における概要動作のフローチャートである。 FIG. 11 is a flowchart of an outline operation in the second exemplary embodiment of the present invention.

前述の第１の実施の形態における図４に示すステップ１０１〜ステップ１０３については同様の動作であるので、その説明は省略する。 Since steps 101 to 103 shown in FIG. 4 in the first embodiment are the same operations, the description thereof is omitted.

ステップ２０１）時間重み算出部８６０において、処理対象文書に含まれる各語句に対して、処理対象文書の作成時期に多く使用されている度合いを、当該語句の時間重み値とする。 Step 201) In the time weight calculation unit 860, for each word / phrase included in the processing target document, the degree of use of the processing target document at the creation time of the processing target document is set as the time weight value of the word / phrase.

ステップ２０２）話題度出力部８７０において、処理対象文書に含まれる各語句に対して、話題度算出部２５０で得られた当該語句の話題度と、時間重み算出部８６０で得られた当該語句の時間重みとを掛け合わせて得られた値を、時間による注目度の変化を考慮した当該語句の話題度とする。 Step 202) In the topic level output unit 870, for each word / phrase included in the processing target document, the topic level of the word / phrase obtained by the topic level calculation unit 250 and the word / phrase obtained by the time weight calculation unit 860 are calculated. The value obtained by multiplying the time weight is set as the topic level of the word / phrase considering the change in the attention level with time.

ステップ２０３）話題度出力部８７０は、上記のステップ２０２で得られた話題度を語句話題度記録装置２９０に出力する。 Step 203) The topic level output unit 870 outputs the topic level obtained in Step 202 to the phrase topic level recording device 290.

次に、本実施の形態における、話題度算出装置の詳細な動作を図１０の構成図に基づいて説明する。 Next, the detailed operation of the topic level calculation device in the present embodiment will be described based on the configuration diagram of FIG.

以下では、第１の実施の形態にはない時間重み算出部８６０と第１の実施の形態とは異なる話題度算出部８７０の動作について説明する。 Hereinafter, the operations of the time weight calculation unit 860 and the topic degree calculation unit 870 different from those in the first embodiment will be described.

時間重み算出部８６０は、話題度算出部２５０と同様に、話題語句抽出処理の対象となる処理対象文書の指定を入力として受け付け、文書内語句抽出部２１０と同様の処理を行うことにより得られた処理対象文書中の全ての語句について、例えば、特開２００５−２７６１１５号公報に示されるような処理対象文書の作成時刻において、当該語句の出現頻度が増加している語句に対して高い重みを付ける手法を、種話題情報データベース２２０を参照して適用することにより、時間重みScore_t(w_k)を算出する。 Similar to the topic level calculation unit 250, the time weight calculation unit 860 receives the designation of the processing target document that is the subject of the topic phrase extraction process as an input, and is obtained by performing the same processing as the in-document phrase extraction unit 210. For all the words in the processing target document, for example, at the creation time of the processing target document as disclosed in Japanese Patent Application Laid-Open No. 2005-276115, a high weight is given to the words whose frequency of occurrence of the word is increasing. The time weight Score _t (w _k ) is calculated by applying the attaching method with reference to the seed topic information database 220.

話題度出力部８７０は、処理対象文書に含まれる語句ｗｋそれぞれに対し、処理対象文書の含まれる分野で特徴的に出現し、なおかつ処理対象文書の作成時刻において使用頻度が増加している語句に、文書の話題を表す語として高い話題度を算出し出力する。 The topic level output unit 870 generates, for each word / phrase wk included in the processing target document, a word that appears characteristically in the field including the processing target document and is frequently used at the creation time of the processing target document. A high topic level is calculated and output as a word representing the topic of the document.

話題度出力部８７０は、語句ｗ_ｋについて話題度算出部２５０からｗ_ｋの話題度Score_r(w_k)を、時間重み算出部８６０からw_kの時間重みScore_t(w_k)をそれぞれ受け取り、Score_r(w_k)とScore_t(w_k)を掛け合わせることにより、分野的な特徴度と時間的な特徴度との両方を評価した語句の話題度Score(w_k)を算出し、語句話題度記録装置２９０に出力する。この際、出力量の軽減のため、予め設定された値以上の話題度を持つ語句のみに限って出力してもよい。 Topic of the output unit 870 receives the word w _k topics calculator 250 from the w _k topic of Score _r (w _k), the time from the weight calculation section 860 of the w _k time weight Score _t a (w _k), respectively , Score _r (w _k ) and Score _t (w _k ) are multiplied to calculate the topic score Score (w _k ) of the phrase that evaluates both the domain feature and temporal feature, It is output to the phrase topic level recording device 290. At this time, in order to reduce the output amount, only words having a topic degree equal to or higher than a preset value may be output.

本発明は、上記の第１・第２の実施の形態で示した話題度算出装置の動作をプログラムとして構築し、話題度算出装置として動作する種話題情報データベース２２０、関連度データベース２４０にアクセス可能なコンピュータにインストールする、または、ネットワークを介して流通させることも可能である。 In the present invention, the operation of the topic level calculation device shown in the first and second embodiments is constructed as a program, and the seed topic information database 220 and the relevance level database 240 that operate as the topic level calculation device can be accessed. It can also be installed on a simple computer or distributed via a network.

また、構築されたプログラムをハードディスク装置や、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールして実行させる、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk device or a flexible disk / CD-ROM, installed in a computer to be executed, or distributed.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、文書中の語句の話題度を算出する技術に適用可能である。 The present invention is applicable to a technique for calculating the topic level of words in a document.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の第１の実施の形態における話題度算出装置の構成図である。It is a block diagram of the topic degree calculation apparatus in the 1st Embodiment of this invention. 本発明の第１の実施の形態における概要動作のフローチャートである。It is a flowchart of the outline | summary operation | movement in the 1st Embodiment of this invention. 本発明の第１の実施の形態における種話題情報データベースに蓄積される種話題情報の例である。It is an example of the seed topic information accumulate | stored in the seed topic information database in the 1st Embodiment of this invention. 本発明の第１の実施の形態における関連度データベースに蓄積される関連度情報の例である。It is an example of the relevance degree information accumulate | stored in the relevance degree database in the 1st Embodiment of this invention. 本発明の第１の実施の形態における発信源関連度算出部の動作のフローチャートである。It is a flowchart of operation | movement of the transmission source related degree calculation part in the 1st Embodiment of this invention. 本発明の第１の実施の形態における話題度算出部の処理を説明する図である。It is a figure explaining the process of the topic degree calculation part in the 1st Embodiment of this invention. 本発明の第１の実施の形態における語句話題度記録装置への出力の例である。It is an example of the output to the phrase topic degree recording device in the first embodiment of the present invention. 本発明の第２の実施の形態における話題度算出装置の構成図である。It is a block diagram of the topic degree calculation apparatus in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における概要動作のフローチャートである。It is a flowchart of the outline | summary operation | movement in the 2nd Embodiment of this invention.

Explanation of symbols

２００文書データベース
２１０語句抽出手段、文書内語句抽出部
２２０語句データベース、種話題情報データベース
２３０発信源関連度算出手段、発信源関連度算出部
２４０関連度データベース
２５０話題度算出手段、話題度算出部
２６０話題度出力手段、話題度出力部
２９０語句話題度記録装置
８６０時間重み算出部
８７０話題出力部 200 Document Database 210 Phrase Extraction Unit, In-Document Phrase Extraction Unit 220 Phrase Database, Species Topic Information Database 230 Transmission Source Relevance Calculation Unit, Transmission Source Relevance Calculation Unit 240 Relevance Database 250 Topic Degree Calculation Unit, Topic Degree Calculation Unit 260 Topic level output means, topic level output unit 290 Phrase topic level recording device 860 Time weight calculation unit 870 Topic output unit

Claims

A topic degree calculation method for analyzing a set of documents created by a large number of information transmission sources and calculating the strength of topicality with respect to a word or phrase included in a processing target document,
When a set of documents having information on a transmission source that has created the document is input to the phrase extraction unit, the set of documents is analyzed, and a phrase that is subject to topicality evaluation is cut out from the document, and the transmission source A phrase extraction step of adding information and recording it in a phrase database;
The transmission source relevance calculating means accumulates in the transmission source feature quantity buffer a transmission source feature quantity list whose feature quantity is the number of appearances of the phrase extracted from the transmission source document in units of transmission sources, and sets a set of transmissions. A source source relevance calculating step of calculating a relevance Rij indicating the degree of similarity between the source i and the source j from the source feature list of the source and recording it in the relevance database;
The topic level calculation means accepts a processing target document as an input, a standard relevance distribution based on the transmission source i of the processing target document, and a phrase relevance distribution obtained for each of the phrases k included in the processing target document From the topic level calculation step of calculating the degree of the topic k is a topic,
And
In the topic level calculation step,
The degree of association Rij between the source i of the document to be processed and the other source j is obtained from the degree-of-association database, and the degree of the degree of association is obtained for each degree of association divided into a predetermined number N. Obtaining the reference relevance distribution, which is information obtained by aggregating the number of transmission sources j having the relevance Rij corresponding to
A source l having the phrase k is obtained by referring to the phrase database, and a relevance level Ril between the source i and the source l is obtained from the relevance level database , Obtaining the phrase relevance distribution which is information obtained by tabulating the number of transmission sources l having relevance Ril corresponding to the relevance step ;
By comparing the reference relevance distribution and the phrase relevance distribution from the reference relevance distribution and the phrase relevance distribution, the phrase relevance distribution of the word k is biased to a range having a higher relevance than the reference relevance distribution. In the case where the processing target document is highly related to the transmission source i of the document to be processed, it is regarded as a topic word that is often handled. The value obtained by multiplying the value obtained by subtracting the relevance value n0 at the center of gravity and the value obtained by subtracting the value of the reference relevance distribution from the value of the word relevance distribution at the step is obtained, and the sum of these values is obtained. The degree of the topic k is a topic,
The topic degree calculation method characterized by this.

The topic degree calculation method according to claim 1 , wherein the number of relevance levels in the phrase relevance distribution is a sum of the number of times the phrase k is used at the transmission source l corresponding to the step .

The reference relevance distribution and the phrase relevance distribution are respectively normalized, and the normalized degree of topic k is obtained using the normalized reference relevance distribution and the phrase relevance distribution. The topic degree calculation method according to 1 or 2.

A topic degree calculation device that analyzes a set of documents created by a large number of information transmission sources and calculates the strength of topicality with respect to a phrase included in a processing target document,
When a set of documents having information on a transmission source that has created the document is input to the phrase extraction unit, the set of documents is analyzed, and a phrase that is subject to topicality evaluation is cut out from the document, and the transmission source A phrase extraction means for adding information and recording the phrase database;
A transmission source feature amount list having the number of appearances of the phrase extracted from the transmission source document as a feature amount in the transmission source unit is accumulated in the transmission source feature amount buffer, and a set of the transmission source i and the transmission source j A source relevance calculating means for calculating a relevance Rij indicating the degree of similarity from the source characteristic list of the source and recording it in a relevance database;
The processing target document is received as an input, and the reference relevance distribution based on the transmission source i of the processing target document and the phrase relevance distribution obtained for each of the phrases k included in the processing target document are used. A topic degree calculating means for calculating the degree of topic,
The topic level calculation means includes:
The degree of association Rij between the source i of the document to be processed and the other source j is obtained from the degree-of-association database, and the degree of the degree of association is obtained for each degree of association divided into a predetermined number N. Obtaining the reference relevance distribution, which is information obtained by aggregating the number of transmission sources j having the relevance Rij corresponding to
A source l having the phrase k is obtained by referring to the phrase database, and a relevance level Ril between the source i and the source l is obtained from the relevance level database , Obtaining the phrase relevance distribution which is information obtained by tabulating the number of transmission sources l having relevance Ril corresponding to the relevance step ;
By comparing the reference relevance distribution and the phrase relevance distribution from the reference relevance distribution and the phrase relevance distribution, the phrase relevance distribution of the word k is biased to a range having a higher relevance than the reference relevance distribution. In the case where the processing target document is highly related to the transmission source i of the document to be processed, it is regarded as a topic word that is often handled. The value obtained by subtracting the relevance value n0 at the center of gravity and the value obtained by subtracting the reference relevance distribution value from the value of the word relevance distribution at the step is obtained, and the sum of these values is obtained. Including means for determining the degree of topic of the phrase k,
A topic degree calculation device characterized by that.

The topic level calculation means includes:
The topic degree calculation device according to claim 4, further comprising means for setting the number of relevance steps of the word relevance distribution for each step of the word k at the transmission source l corresponding to the step .

The topic level calculation means includes:
Means for normalizing the reference relevance distribution and the phrase relevance distribution respectively, and using the normalized reference relevance distribution and the phrase relevance distribution to determine the degree of topic of the word k The topic degree calculation device according to claim 4 or 5.

A topic degree calculation program for causing a computer to function as each means constituting the topic degree calculation apparatus according to claim 4.

A computer-readable recording medium storing a topic degree calculation program, wherein the program according to claim 7 is stored.