JP4800846B2

JP4800846B2 - Topic degree calculation method and apparatus, program, and computer-readable recording medium

Info

Publication number: JP4800846B2
Application number: JP2006153846A
Authority: JP
Inventors: 裕一郎関口; 吉秀佐藤; 晴美川島; 英範奥田
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2006-06-01
Filing date: 2006-06-01
Publication date: 2011-10-26
Anticipated expiration: 2026-06-01
Also published as: JP2007323434A

Description

本発明は、話題度算出方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体に係り、特に、新しい情報を含む文書を次々と入手しうる状況において、文書群から話題となっている語句を自動的に抽出するための話題度算出方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体に関する。 The present invention relates to a topic level calculation method and apparatus, a program, and a computer-readable recording medium. In particular, in a situation where documents including new information can be obtained one after another, a phrase that is a topic from a document group is automatically detected. The present invention relates to a topic level calculation method and apparatus, a program, and a computer-readable recording medium.

インターネットをはじめとした情報メディアの発達により、誰であっても容易に情報発信を行えるようになり、様々な発信者によって文書が作成され、ネットワーク上に発信されるようになってきている。そのような中、現在までに作成された文書情報を分析することによって、任意の時点において話題となっていた事柄を抽出することが可能になると考えられる。 With the development of the Internet and other information media, anyone can easily send information, and documents are created by various senders and sent on the network. Under such circumstances, it is considered that it becomes possible to extract matters that have become a topic at an arbitrary point in time by analyzing document information created so far.

インターネットに代表されるネットワークシステム上にアップロードされている文書群から、文書群中に含まれる語句の出現回数の時間変動を考慮して、文書群中で話題となっている特徴語句を抽出する技術は複数提案されている。 A technology that extracts feature words that are a topic in a document group from a document group uploaded on a network system typified by the Internet, taking into account the temporal variation of the number of occurrences of the phrase contained in the document group. Several have been proposed.

従来の技術として、ネットワークシステム上にアップロードされている文書をその作成時刻情報と共に取得し、当該文書の内容に応じて予め設定された複数の分野に自動的に分類し、各分野毎に時間に沿って出現頻度が特徴的に増加しており、なおかつ他分野で出現していないような語句に対して話題を表す特徴語句として高い話題度合いを示す話題度の値を算出する技術がある（例えば、特許文献１参照）。 As a conventional technique, a document uploaded on a network system is acquired together with its creation time information, and automatically classified into a plurality of fields set in advance according to the contents of the document. There is a technique for calculating a topic level value indicating a high topic level as a characteristic word representing a topic with respect to a word that has a characteristic increase along the line and that does not appear in other fields (for example, , See Patent Document 1).

しかし、上記の技術においては、一定期間中にある語句の使用回数が増加した場合に話題を表す特徴語句として抽出するため、全体の文書量が短期的に増加した場合や、全体の文書量が定期的に変動している場合に、その影響により精度が低下する問題があった。そのため、文書量の変化を元とした変動をキャンセルする補正関数を設定する手法が存在する（例えば、非特許文献１参照）。
特開２００５−２７６１１５号公報「document streamにおけるburstの発見」藤木稔明、南野朋之、鈴木泰裕、奥村学、情報処理学会研究報告2003-NL-160 However, in the above technique, when the number of times a certain phrase is used during a certain period of time is extracted as a feature phrase that expresses a topic, the total document volume increases when the total document volume increases in the short term. When it fluctuates regularly, there is a problem that the accuracy is lowered due to the influence. For this reason, there is a method of setting a correction function that cancels fluctuations based on changes in the document amount (see Non-Patent Document 1, for example).
JP 2005-276115 A "Discovery of burst in document stream" Yukiaki Fujiki, Yasuyuki Minamino, Yasuhiro Suzuki, Manabu Okumura, IPSJ Research Report 2003-NL-160

しかしながら、上記従来の方法は、文書量の変化のパターンを使用者が把握して補正関数を設計しなければならないため手間がかかるものであった。 However, the above-described conventional method is troublesome because the user must grasp the pattern of change in the document amount and design the correction function.

本発明は、上記の点に鑑みなされたもので、どのような文書数の母数の変動が生じても、自動的にその影響を補正して精度高く話題となる語句を抽出することが可能な話題度算出方法及び装置及びプログラム及びコンピュータ読み取り可能な記録媒体を提供することを目的とする。 The present invention has been made in view of the above points, and it is possible to automatically correct the influence and extract a topic word / phrase with high accuracy regardless of the variation of the parameter of the number of documents. An object of the present invention is to provide a topic level calculation method, apparatus, program, and computer-readable recording medium.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、多数の文書を解析して、該文書中に含まれる語句に対して所望の時刻における話題性の強度を判定する装置における話題度算出方法であって、
文書解析手段が、作成時刻情報を有する入力文書群が入力されると、該入力文書群を解析して話題性評価の対象となる語句（以下、文書内語句と記す）を抽出し、入力された入力文書数、該文書内語句の使用回数である語句頻度を集計し、それらを集計した集計時刻および該文書内語句と共に語句データベースに格納する文書解析ステップ（ステップ１１０）と、
語句頻度算出手段が、外部から入力された話題度算出処理の対象となる語句（以下、処理対象語句ｗと記す）と、どの時点での話題を算出するのかを表す時刻情報（以下、算出指定時刻と記す）を取得する入力受付ステップ（ステップ１２０）と、
語句頻度算出手段が、処理対象語句に基づいて語句データベースを検索し、該処理対象語句ｗに対応する文書内語句の語句頻度と集計時刻に基づいて、該語句頻度の時間変動を表す語句頻度関数Ｇ _w （Ｔ）を算出する語句頻度算出ステップ（ステップ１３０）と、
話題度算出手段が、入力受付ステップ（ステップ１２０）で入力された入力時刻の時点において、処理対象語句の話題性の高低を表す話題度を、語句頻度算出ステップ（ステップ１３０）で算出された語句頻度関数Ｇ _w （Ｔ）に対して算出指定時刻の付近で語句頻度が増えている場合に、高い話題度を算出する話題重み関数を掛け合わせることにより求める話題度算出ステップ（ステップ１４０）と、
を行い、
語句頻度算出ステップ（ステップ１３０）において、
語句頻度算出手段は、処理対象語句ｗの入力文書中での語句頻度の時間変動D _w （Ｔ）と入力文書数の時間変動Ｄ _all （Ｔ）とを求め、Ｄ _w （Ｔ）とＤ _all （Ｔ）の類似性を評価して相関度合いR(ｗ)を求め、該D _all (T)に該R（w）を掛け合わせた値に比例した値を、該Ｄ _ｗ（Ｔ）から減算した値を語句頻度関数G _ｗ（Ｔ）とする。 The present invention (Claim 1) is a topic level calculation method in an apparatus that analyzes a large number of documents and determines the strength of topicality at a desired time with respect to a phrase included in the document,
When an input document group having creation time information is input, the document analysis unit analyzes the input document group and extracts a word / phrase (hereinafter referred to as an in-document word / phrase) to be evaluated. input document number, counts the word frequency is the number of times of use of the document in terms, the document analysis storing the word database (step 110) with their aggregations and aggregation time and the document in the phrase,
The phrase frequency calculation means is a phrase (hereinafter referred to as a processing target phrase w ) that is subject to topic level calculation processing input from the outside, and time information (hereinafter, calculation designation) that indicates the topic at which the topic is calculated. An input receiving step (step 120) for acquiring time)
Phrase frequency calculation means searches a phrase database based on a processing target phrase, and a phrase frequency function representing a temporal variation of the phrase frequency based on a phrase frequency and a total time of a phrase in the document corresponding to the processing target phrase w A phrase frequency calculating step (step 130) for calculating G _w (T) ;
The phrase calculated by the topic frequency calculation step (step 130) is the topic level indicating the level of topicality of the processing target phrase at the time of the input time input at the input reception step (step 120). A topic level calculation step (step 140) to be obtained by multiplying the frequency function G _w (T) by a topic weight function that calculates a high topic level when the phrase frequency is increased near the calculation specified time;
And
In the phrase frequency calculation step (step 130),
The phrase frequency calculation means obtains a time variation D _w (T) of the phrase frequency in the input document of the processing target phrase w and a time variation D _all (T) of the number of input documents, and D _w (T) and D _all The degree of correlation R (w) is obtained by evaluating the similarity of (T), and a value proportional to the value obtained by multiplying the D _all (T) by the R (w) is subtracted from the D _w (T). The obtained value is set as a phrase frequency function G _w (T) .

また、本発明（請求項２）は、語句頻度算出ステップ（ステップ１３０）において、
処理対象語句ｗの入力文書群中での語句頻度の時間変動Ｄ_ｗ（Ｔ）と入力文書数の時間変動Ｄ_all（Ｔ）との相関度合いＲ（ｗ）を、 Further, the present invention (Claim 2 ) provides a phrase frequency calculating step (Step 130).
The correlation degree R (w) between the time fluctuation D _w (T) of the word frequency in the input document group of the processing target word w and the time fluctuation D _all (T) of the number of input documents

により求める。

Ask for.

また、本発明（請求項３）は、語句頻度算出ステップ（ステップ１３０）において、
処理対象語句ｗの入力文書中での語句頻度の時間変動Ｄ_ｗ（Ｔ）と入力文書数の時間変動Ｄ_all（Ｔ）との相関度合いＲ（ｗ）を用いて、入力文書数が変動した影響による語句の語句頻度の増減を考慮した語句頻度関数Ｇ_ｗ（Ｔ）を、 Further, the present invention (Claim 3 ), in the phrase frequency calculation step (Step 130),
The number of input documents fluctuated using the degree of correlation R (w) between the time fluctuation D _w (T) of the word frequency in the input document of the processing target word w and the time fluctuation D _all (T) of the number of input documents. The phrase frequency function G _w (T) considering the increase or decrease of the phrase frequency of the phrase due to the influence,

により求める。

Ask for.

また、本発明（請求項４）は、入力受付ステップ（ステップ１２０）において、
話題度算出処理の対象となる処理対象語句ｗと、どの時点での話題度を算出するかを表す算出指定時刻を受け付ける代わりに、
語句頻度算出ステップ（ステップ１３０）において、
入力文書群中に含まれる全ての処理対象語句について語句頻度関数Ｇ _w （Ｔ）を求め、
話題度算出ステップ（ステップ１４０）において、
入力文書群中に含まれる全ての処理対象語句について処理時刻での話題度を求める。 Further, the present invention (Claim 4 ) is provided in the input receiving step (Step 120).
Instead of accepting the processing target phrase w that is the subject of the topic level calculation process and the calculation designated time indicating the topic level at which to calculate,
In the phrase frequency calculation step (step 130),
The phrase frequency function G _w (T) is obtained for all the processing target phrases included in the input document group,
In the topic level calculation step (step 140),
The topic level at the processing time is obtained for all the processing target words included in the input document group.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項５）は、多数の文書を解析して、該文書中に含まれる語句に対して所望の時刻における話題性の強度を判定する話題度算出装置であって、
作成時刻情報を有する入力文書群が入力されると、該入力文書群を解析して話題性評価の対象となる語句（以下、文書内語句と記す）を抽出し、入力された入力文書数、該文書内語句の使用回数である語句頻度を集計し、それらを集計した集計時刻及び該文書内語句と共に語句データベース２３０に格納する文書解析手段２１０と、
外部から入力された話題度算出処理の対象となる語句（以下、処理対象語句ｗと記す）を取得し、該処理対象語句に基づいて語句データベース２３０を検索し、取得した該処理対象語句ｗに対応する文書内語句の語句頻度と集計時刻に基づいて、該語句頻度の時間変動を表す語句頻度関数Ｇ _w （Ｔ）を算出する語句頻度算出手段２４０と、
外部からどの時点での話題を算出するのかを表す時刻情報（以下、算出指定時刻と記す）を取得し、該算出指定時刻の時点において、処理対象語句の話題性の高低を表す話題度を、語句頻度算出手段２４０で算出された語句頻度関数Ｇ _w （Ｔ）に対して入力時刻の付近で語句頻度が増えている場合に、高い話題度を算出する話題重み関数を掛け合わせることにより求める話題度算出手段２５０と、
を有し、
語句頻度算出手段２４０は、
処理対象語句ｗの入力文書中での語句頻度の時間変動D _w （Ｔ）と入力文書数の時間変動Ｄ _all （Ｔ）とを求め、Ｄ _w （Ｔ）とＤ _all （Ｔ）の類似性を評価して相関度合いR(ｗ)を求め、該D _all (T)に該R（w）を掛け合わせた値に比例した値を、該Ｄ _ｗ（Ｔ）から減算した値を語句頻度関数G _ｗ（Ｔ）とする手段を含む。 The present invention (Claim 5) is a topic degree calculation device that analyzes a large number of documents and determines the strength of topicality at a desired time for a word or phrase included in the documents,
When an input document group having creation time information is input, the input document group is analyzed to extract a word / phrase (hereinafter referred to as an in-document word / phrase) for topicality evaluation, and the number of input documents input, A document analysis unit 210 that counts the phrase frequency, which is the number of times the phrase in the document is used, and stores it in the phrase database 230 together with the tabulated time and the phrase in the document;
A word / phrase (hereinafter referred to as a processing target word / w ) input from the topic level calculation process inputted from the outside is acquired, the word / phrase database 230 is searched based on the processing target word / phrase, and the acquired processing target word / phrase w A phrase frequency calculating means 240 for calculating a phrase frequency function G _w (T) representing a temporal variation of the phrase frequency based on the phrase frequency and the aggregation time of the corresponding phrase in the document;
Obtaining time information (hereinafter referred to as calculation designated time) indicating the topic at which the topic is calculated from the outside, and at the time of the calculation designated time, the topic degree representing the level of topicality of the processing target phrase, The topic obtained by multiplying the phrase frequency function G _w (T) calculated by the phrase frequency calculation unit 240 by the topic weight function for calculating a high topic degree when the phrase frequency is increasing near the input time. Degree calculation means 250;
Have
The phrase frequency calculation means 240
The time fluctuation D _w (T) of the word frequency in the input document of the processing target word w and the time fluctuation D _all (T) of the number of input documents are obtained, and the similarity between D _w (T) and D _all (T) The degree of correlation R (w) is evaluated, and a value proportional to the value obtained by multiplying the D _all (T) by the R (w) is subtracted from the D _w (T). Including means for G _w (T) .

また、本発明（請求項６）は、語句頻度算出手段２４０において、
処理対象語句ｗの入力文書群中での語句頻度の時間変動Ｄ_ｗ（Ｔ）と入力文書数の時間変動Ｄ_all（Ｔ）との相関度合いＲ（ｗ）を、 Further, the present invention (claim 6 ) provides the phrase frequency calculation means 240,
The correlation degree R (w) between the time fluctuation D _w (T) of the word frequency in the input document group of the processing target word w and the time fluctuation D _all (T) of the number of input documents

により求める。

Ask for.

また、本発明（請求項７）は、語句頻度算出手段２４０において、
処理対象語句ｗの入力文書中での語句頻度の時間変動Ｄ_ｗ（Ｔ）と入力文書数の時間変動Ｄ_all（Ｔ）との相関度合いＲ（ｗ）を用いて、入力文書数が変動した影響による語句の語句頻度の増減を考慮した語句頻度関数Ｇ_ｗ（Ｔ）を、 Further, the present invention (claim 7 ) provides the phrase frequency calculating means 240,
The number of input documents fluctuated using the degree of correlation R (w) between the time fluctuation D _w (T) of the word frequency in the input document of the processing target word w and the time fluctuation D _all (T) of the number of input documents. The phrase frequency function G _w (T) considering the increase or decrease of the phrase frequency of the phrase due to the influence,

により求める。
また、本発明（請求項８）は、語句頻度算出手段２４０において、話題度算出処理の対象となる処理対象語句と、どの時点での話題度を算出するかを表す算出指定時刻を受け付ける代わりに、入力文書群中に含まれる全ての処理対象語句について語句頻度関数を求める手段を含み、
話題度算出手段２５０において、入力文書群中に含まれる全ての処理対象語句について処理時刻での話題度を求める手段を含む。

Ask for.
Further, according to the present invention (claim 8 ), the phrase frequency calculation unit 240 accepts a processing target phrase that is a subject of the topic degree calculation process and a calculation designated time that indicates when the topic degree is calculated. , Including means for obtaining a phrase frequency function for all the processing target phrases included in the input document group,
The topic level calculation means 250 includes means for determining the topic level at the processing time for all the processing target words included in the input document group.

本発明（請求項９）は、コンピュータに、請求項５乃至８記載の話題度算出装置の各手段を実行させる話題度算出プログラムである。 The present invention (Claim 9 ) is a topic level calculation program for causing a computer to execute each means of the topic level calculation apparatus according to claims 5 to 8 .

本発明（請求項１０）は、コンピュータに、請求項５乃至８記載の話題度算出装置の各手段を実行させる話題度算出プログラムを格納したコンピュータ読み取り可能な記録媒体である。

The present invention (Claim 10) is a computer-readable recording medium storing a topic degree calculation program for causing a computer to execute each means of the topic degree calculation apparatus according to claims 5 to 8 .

上記のように本発明によれば、ウェブ上で公開されているニュース記事や日記記事などの次々と発信されている文書情報を取得し、文書中の語句の話題性の高低を自動的に抽出する際に、全文書数の時間変動の影響を除去することが可能となる。それにより、従来の技術にあった全文書数が変動した際に誤った話題語を抽出してしまうことがなくなり、最近の流行や話題を精度高く抽出することが可能となる。 As described above, according to the present invention, document information that is sent one after another such as news articles and diary articles published on the web is acquired, and the topical level of words in the document is automatically extracted. In this case, it is possible to remove the influence of the time variation of the total number of documents. Thereby, when the total number of documents according to the prior art fluctuates, an erroneous topic word is not extracted, and it is possible to extract recent trends and topics with high accuracy.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［第１の実施の形態］
図３は、本発明の第１の実施の形態における話題度算出装置の構成を示す。 [First Embodiment]
FIG. 3 shows the configuration of the topic level calculation device according to the first embodiment of the present invention.

同図に示す話題度算出装置には、本装置の入力となる文書データを蓄積する文書データベース２００と、本装置が出力する話題語情報を表示する話題表示装置２６０とが接続されている。 The topic level calculation apparatus shown in FIG. 2 is connected to a document database 200 that stores document data that is input to the apparatus and a topic display apparatus 260 that displays topic word information output from the apparatus.

話題度算出装置は、文書解析部２１０、語句集計部２２０、語句データベース２３０、語句頻度算出部２４０、話題度算出部２５０から構成される。 The topic level calculation device includes a document analysis unit 210, a phrase totaling unit 220, a phrase database 230, a phrase frequency calculation unit 240, and a topic level calculation unit 250.

文書データベース２００には、作成時刻が付加された文書群が蓄積されている。例えば、Ｗｅｂ上に公開されている文書に「2006 4/25 13:55」といったような作成時刻と文書を一意に表す文書ＩＤとを付加し、次々と入力して記録することにより、文書データベース２００を構築することができる。インターネット上の日記サイトなど、新しい文書が逐次更新される情報源の場合には、サイト内の文書が更新された場合にも、新たな文書が作成されたとみなして収集してもよい。 The document database 200 stores a document group to which a creation time is added. For example, a document database is created by adding a creation time such as “2006 4/25 13:55” and a document ID uniquely representing a document to a document published on the Web, and inputting and recording them one after another. 200 can be built. In the case of an information source in which new documents are sequentially updated, such as a diary site on the Internet, even when a document in the site is updated, the new document may be regarded as being created and collected.

文書解析部２１０は、文書データベース２００に蓄積されている文書を１文書ずつ取得し、形態素解析を行い、品詞毎に分解する。例えば、「おいしいチョコドーナツ」という文章を、「おいしい」「チョコ」「ドーナツ」と分解する。分解された品詞群から名詞のみを選んで抽出する。このとき、必要に応じて「チョコ」「ドーナツ」という連続する名詞を連結して「チョコドーナツ」という複合名詞とし、複合名詞を１個の名詞として扱ってもよい。以下の説明では、名詞と複合名詞とを総称して語句と呼ぶ。このようにして得られた語句それぞれについて、解析前に当該語句が含まれていた文書の作成時刻と文書ＩＤとを付加し、
「チョコドーナツ 2006/01/06 11:36 ID1035, バナナ 2006/01/06 11:36 ID1035, 新製品 2006/01/06 12:06 ID1036, …」
といった形式の情報として、語句集計部２２０の語句バッファ（図示せず）に蓄積する。 The document analysis unit 210 acquires documents stored in the document database 200 one document at a time, performs morphological analysis, and disassembles each part of speech. For example, the sentence “delicious chocolate donut” is broken down into “delicious”, “chocolate”, and “donut”. Select and extract only nouns from the decomposed part-of-speech group. At this time, if necessary, consecutive nouns such as “chocolate” and “donut” may be connected to form a compound noun “chocolate donut”, and the compound noun may be treated as one noun. In the following description, nouns and compound nouns are collectively referred to as words. For each word obtained in this way, the creation time and document ID of the document containing the word before analysis are added,
"Choco Donut 2006/01/06 11:36 ID1035, Banana 2006/01/06 11:36 ID1035, New Product 2006/01/06 12:06 ID1036,…"
Is stored in a phrase buffer (not shown) of the phrase totaling unit 220.

語句集計部２２０は、予め設定された一定期間毎に起動し、語句バッファに蓄積された情報を読み込み、語句バッファ中の情報に含まれる文書ＩＤを重複なく取り出してその数を文書数Ｄ_allとして集計し、また、語句バッファ中に含まれる全ての語句について語句ｗ_ｋの使用回数を語句頻度Ｄw_kとして集計する。集計の結果得られた文書数Ｄ_allと、全ての語句ｗ_ｋと語句ｗ_ｋの語句頻度Ｄw_kとの組とを、集計した時刻の情報とともに語句データベース２３０に記録する。語句データベース２３０に蓄積される情報の例を図４に示す。 The phrase totaling unit 220 is activated at predetermined intervals, reads information stored in the phrase buffer, takes out document IDs included in the information in the phrase buffer without duplication, and sets the number as the document number D _all. aggregated, also counts the number of uses of the word w _k as word frequency Dw _k for all words contained in the word buffer. The total number of documents D _all obtained as a result of the aggregation and a set of all the phrases w _k and the phrase frequencies Dw _k of the phrases w _k are recorded in the phrase database 230 together with information on the total times. An example of information stored in the phrase database 230 is shown in FIG.

語句集計部２２０において、データ量の削減のため、語句ｗ_ｋの使用回数を集計する代わりに、同一文書ＩＤで複数回語句ｗ_ｋが使われている場合には合わせて１回と集計することにより、語句ｗ_ｋの語句頻度Ｄw_kを集計してもよい。 In word counting unit 220 for data amount reduction, instead of counting the number of uses of the word w _k, to aggregate once combined if multiple word w _k with the same document ID is used Thus, the phrase frequency Dw _k of the phrase w _k may be aggregated.

語句頻度算出部２４０は、外部から話題度算出対象となる語句情報が入力されると、当該入力語句に基づいて語句データベース２３０を検索し、当該処理対象語句に対応する語句頻度と文書数を取得して、その時間変化を比較することにより、全文書数の時間による変動の影響を除いた話題度算出対象語句の集計期間毎の使用頻度を話題度算出部２５０のバッファ（図示せず）に出力する。 When the phrase information that is subject to topic level calculation is input from the outside, the phrase frequency calculation unit 240 searches the phrase database 230 based on the input phrase and acquires the phrase frequency and the number of documents corresponding to the processing target phrase. Then, by comparing the time changes, the frequency of use of the topic level calculation target words / phrases excluding the influence of the time variation of the total number of documents is stored in a buffer (not shown) of the topic level calculation unit 250. Output.

図５は、本発明の第１の実施の形態における語句頻度算出部の処理のフローチャートである。 FIG. 5 is a flowchart of processing of the phrase frequency calculation unit in the first embodiment of the present invention.

ステップ５００）語句頻度算出部２４０は、処理が開始されると外部から処理対象となる語句情報ｗの入力を受け付ける。 Step 500) The phrase frequency calculation unit 240 receives an input of phrase information w to be processed from the outside when the process is started.

ステップ５１０）受け付けた処理対象語句ｗに基づいて語句データベース２３０を検索し、当該語句ｗに対応する各集計期間毎の使用頻度情報を読み込み、集計期間毎の語句ｗの使用頻度の変動を表す関数Ｄ_w（Ｔ）を得る。ここで、Ｔは離散値である。例として、３つの語句ｗ_１とｗ_２とｗ_３とにおける使用頻度変動を表すＤ_ｗ１（Ｔ），Ｄ_ｗ２（Ｔ），Ｄ_ｗ３（Ｔ）を図６に示す。なお、図６に示す曲線は実際には離散点の集合である。この際、処理を軽減するため最近Ｎ期間の使用文書数のみに限ってＤ_w（Ｔ）を算出してもよい。 Step 510) A function that searches the phrase database 230 based on the accepted processing target phrase w, reads usage frequency information for each tabulation period corresponding to the word w, and represents a variation in the usage frequency of the phrase w for each tabulation period D _w (T) is obtained. Here, T is a discrete value. As an example, FIG. 6 shows D _w1 (T), D _w2 (T), and D _w3 (T) representing the usage frequency fluctuations in _three words w ₁ , w _2, and w ₃ . The curve shown in FIG. 6 is actually a set of discrete points. At this time, D _w (T) may be calculated only for the number of documents used in the most recent N period in order to reduce processing.

ステップ５２０）次に、各集計期間毎の文書数を語句蓄積データベース２３０から取得し、文書数の集計期間毎の変動Ｄ_all（Ｔ）を算出する。Ｄ_all（Ｔ）の例を図７に示す。なお、図７に示す曲線は、実際には離散点の集合である。この際に、処理を軽減するための最近Ｎ期間の文書数のみに限ってＤ_all（Ｔ）を算出してもよい。 Step 520) Next, the number of documents for each counting period is acquired from the phrase storage database 230, and the variation D _all (T) of the number of documents for each counting period is calculated. An example of D _all (T) is shown in FIG. Note that the curve shown in FIG. 7 is actually a set of discrete points. At this time, D _all (T) may be calculated only for the number of documents in the latest N periods for reducing processing.

ステップ５３０）処理対象語句ｗの入力文書群中での使用頻度と入力文書数との相関度合いを相関度Ｒ（ｗ）として算出する。具体的には、語句ｗの時間変化関数Ｄ_w（Ｔ）と文書数の時間変化関数Ｄ_all（Ｔ）の類似性を評価して相関度Ｒ（ｗ）を得る。類似性の算出には一般的な波形の相関関数である次式を用いる。 Step 530) The degree of correlation between the frequency of use of the processing target phrase w in the input document group and the number of input documents is calculated as the correlation degree R (w). Specifically, the degree of correlation R (w) is obtained by evaluating the similarity between the time variation function D _w (T) of the phrase w and the time variation function D _all (T) of the number of documents. The similarity is calculated using the following equation which is a general waveform correlation function.

ステップ５４０）次に、入力文書数の時間毎の変化量に相関度Ｒ（ｗ）を掛け合わせた値に比例した値を処理対象語句の同時間の使用頻度から除くことにより、入力文書数が変動した影響による語句の使用頻度の増減を考慮した語句頻度関数Ｇ_w（Ｔ）を求める。具体的には、Ｄ_all（Ｔ）の最小値をminＤ_all（Ｔ）、最大値をmaxＤ_all（Ｔ）とし、次の式で求める。

Step 540) Next, the value proportional to the value obtained by multiplying the amount of change in the number of input documents with time by the correlation degree R (w) is removed from the frequency of simultaneous use of the processing target words, thereby obtaining the number of input documents. A phrase frequency function G _w (T) that takes into account the increase or decrease in the phrase usage frequency due to the effect of fluctuation is obtained. Specifically, the minimum value of D _all (T) is minD _all (T), and the maximum value is maxD _all (T).

この際、計算の簡略化のため、Ｒ（ｗ）が０よりも小さい場合には、Ｒ（ｗ）＝０とみなして計算してもよい。また、Ｇ_w（Ｔ）が０よりも小さくなる場合には、Ｇ_w（Ｔ）＝０としてもよい。

At this time, in order to simplify the calculation, when R (w) is smaller than 0, it may be calculated assuming that R (w) = 0. When G _w (T) is smaller than 0, G _w (T) = 0 may be set.

このようにして全文書数の時間変動の影響を除いた語句の使用頻度の変更を算出することにより、図６に示した各語句の全文書数の影響が補正された語句頻度関数Ｇ_ｗ１（Ｔ），Ｇ_ｗ２（Ｔ），Ｇ_ｗ３（Ｔ）は、それぞれ、図８に示すようになる。図８に示す曲線や直線は実際には離散点の集合である。 In this way, by calculating the change in the phrase usage frequency excluding the influence of the time variation of the total number of documents, the phrase frequency function G _w1 (in which the influence of the total number of documents of each phrase shown in FIG. T), G _w2 (T), and G _w3 (T) are as shown in FIG. The curves and straight lines shown in FIG. 8 are actually a set of discrete points.

ステップ５５０）上記で得られた語句ｗの語句頻度関数Ｇ_w（Ｔ）を話題度算出部２５０の語句頻度バッファ（図示せず）に記録する。 Step 550) The phrase frequency function G _w (T) of the phrase w obtained above is recorded in a phrase frequency buffer (not shown) of the topic degree calculation unit 250.

次に、話題度算出部２６０の処理について説明する。 Next, processing of the topic level calculation unit 260 will be described.

図９は、本発明の第１の実施の形態における話題度算出部の処理のフローチャートである。 FIG. 9 is a flowchart of the process of the topic level calculation unit in the first embodiment of the present invention.

ステップ９１０）話題度算出部２６０は、語句頻度バッファ（図示せず）に語句頻度関数Ｇ_w（Ｔ）が書き込まれると、処理を開始し、外部から処理対象時刻ｔ_ｐの入力を受け付ける。 Step 910) topics calculating unit 260, the word frequency of word frequency buffer (not shown) function G _w (T) is written, the process was started and receives an input of the processing target time t _p from the outside.

ステップ９２０）次に、処理対象時刻を元に、予め与えられた正の値ｔ_ｑを用いて得られるｔ_ｐからｔ_ｐ−ｔ_ｑまでの処理範囲に対応する、最近多く使われた単語に大きな重みを置くような、話題重み付け関数Ｉt_p（ｔ）を作成する。この際、前述の特許文献１に示されるインパクト曲線のような重み付け関数を作成するとよい。 Step 920) Next, based on the processing target time corresponds to the process range up to t _p -t _q from t _p obtained using the positive value t _q previously given, the recently much used words A topic weighting function It _p (t) that creates a large weight is created. At this time, it is preferable to create a weighting function such as the impact curve shown in Patent Document 1 described above.

ステップ９３０）次に、話題重み付け関数Ｉt_p（ｔ）と語句ｗの使用頻度の時間変化Ｇ_w（Ｔ）を次式に代入することによって、語句ｗの話題度ＴＳ（ｗ）を求める。 Step 930) Next, the topic degree TS (w) of the phrase w is obtained by substituting the topic weighting function It _p (t) and the temporal change G _w (T) of the usage frequency of the phrase w into the following equation.

ステップ９４０）得られた語句ｗの話題度ＴＳ（ｗ）と語句ｗとを表示装置２６０に出力する。例えば、「決勝戦32.8」といった結果が表示装置２６０の画面上に出力される。

Step 940) The topic level TS (w) and the phrase w of the obtained phrase w are output to the display device 260. For example, a result such as “final game 32.8” is output on the screen of the display device 260.

なお、上記の語句頻度算出部２４０におけるステップ５００、話題度算出部２６０のステップ９１０において、それぞれ、処理対象となる語句情報と処理対象時刻を外部から入力しているが、この例に限定されることなく、語句頻度算出部２４０において、両方を入力してもよい。 Note that, in step 500 in the above phrase frequency calculation unit 240 and in step 910 in the topic level calculation unit 260, the phrase information and the processing target time are input from the outside, respectively, but this is limited to this example. Instead, the phrase frequency calculation unit 240 may input both.

［第２の実施の形態］
図１０は、本発明の第２の実施の形態における話題度算出装置の構成を示す。 [Second Embodiment]
FIG. 10 shows the configuration of the topic level calculation apparatus according to the second embodiment of the present invention.

同図に示す話題度算出装置には、本装置の入力となる文書データを蓄積する文書データベース２００と、本装置が出力する話題語句情報を記録する話題語句記録装置１０６０とが接続されている。 The topic level calculation apparatus shown in FIG. 2 is connected to a document database 200 that stores document data that is input to the apparatus, and a topic phrase recording apparatus 1060 that records topic phrase information output by the apparatus.

話題度算出装置は、文書解析部２１０、語句集計部２２０、語句データベース２３０、語句頻度算出部１０４０、話題度算出部１０５０から構成される。 The topic level calculation device includes a document analysis unit 210, a phrase totaling unit 220, a phrase database 230, a phrase frequency calculation unit 1040, and a topic level calculation unit 1050.

上記の構成のうち、文書データベース２００、文書解析部２１０、語句集計部２２０、語句データベース２３０は、前述の第１の実施の形態と同様であり、その説明は省略する。 Of the above configuration, the document database 200, the document analysis unit 210, the phrase totaling unit 220, and the phrase database 230 are the same as those in the first embodiment described above, and a description thereof will be omitted.

第１の実施の形態で示した語句の話題度算出方法においては、利用者から話題度算出対象とする語句の入力を受け付けてから、話題度の算出処理を行っていた。これは、ある言葉が話題となっているかを知ることはできるが、未知の言葉が話題になっている場合にそれを知ることができない。 In the phrase topic level calculation method shown in the first embodiment, the topic level calculation process is performed after receiving an input of a phrase that is a topic level calculation target from the user. This allows you to know if a word is a topic, but not when an unknown word is a topic.

これに対し、本実施の形態では、予め定めた一定期間毎に語句データベース２３０に含まれる全語句に対して話題度を算出することにより、その時々でどのような語句が話題になっているかを抽出可能とする。 On the other hand, in this embodiment, by calculating the topic level for all the words and phrases included in the phrase database 230 at predetermined intervals, it is possible to determine what kind of phrase is being talked about at any given time. Extractable.

以下では、第１の実施の形態にはない全語句の語句頻度を算出する語句頻度算出部１０４０と、全語句の話題度を算出する話題度算出部１０５０との動作について説明する。 Hereinafter, operations of the phrase frequency calculation unit 1040 that calculates the phrase frequency of all words and phrases and the topic level calculation unit 1050 that calculates the topic level of all phrases that are not in the first embodiment will be described.

図１１は、本発明の第２の実施の形態における語句頻度算出部の処理のフローチャートである。 FIG. 11 is a flowchart of processing of the phrase frequency calculation unit in the second embodiment of the present invention.

ステップ１１００）語句頻度算出部１０４０は、予め定められた一定期間毎に起動し、語句データベース２３０にアクセスし、蓄積されている語句を重複なく取り出し語句リストを作成し、当該語句頻度算出部１０４０内のバッファ（図示せず）に格納する。 Step 1100) The phrase frequency calculation unit 1040 is activated every predetermined period, accesses the phrase database 230, takes out the stored phrases without duplication, creates a phrase list, and stores the phrase frequency calculation unit 1040 in the phrase frequency calculation unit 1040 Stored in a buffer (not shown).

ステップ１１１０）次に、各集計期間毎の文書数を語句蓄積データベース２３０から取得し、文書数の時間変動Ｄ_all（Ｔ）を算出する。この際に、処理を軽減するために最近Ｎ期間の文書数のみに限ってＤ_all（Ｔ）を算出してもよい。 Step 1110) Next, the number of documents for each counting period is acquired from the phrase storage database 230, and the time variation D _all (T) of the number of documents is calculated. At this time, in order to reduce processing, D _all (T) may be calculated only for the number of documents in the latest N periods.

ステップ１１２０）ステップ１１００で作成され、バッファに格納された語句リストから未処理語句ｗを選び、語句データベース２３０から当該語句の各集計期間毎の使用数を読み込み、集計期間毎の語句ｗの使用頻度の変動を表す関数Ｄ_w（Ｔ）を得る。この際、処理を軽減するため最近Ｎ期間の使用文書数のみに限ってＤ_w（Ｔ）を算出してもよい。 Step 1120) An unprocessed word / phrase w is selected from the word / phrase list created in Step 1100 and stored in the buffer, the number of uses of the word / phrase for each counting period is read from the word / phrase database 230, and the frequency of use of the word / phrase for each counting period is read. A function D _w (T) representing the variation of is obtained. At this time, D _w (T) may be calculated only for the number of documents used in the most recent N period in order to reduce processing.

ステップ１１３０）語句ｗの使用頻度の変動を表す関数Ｄ_w（Ｔ）と文書数の時間変動を表す関数Ｄ_all（Ｔ）の類似性を評価して相関度Ｒ（ｗ）を得る。類似性の算出には、一般的な波形の相関関数である次式を用いる。 Step 1130) The similarity R (w) is obtained by evaluating the similarity between the function D _w (T) representing the variation in the usage frequency of the word w and the function D _all (T) representing the temporal variation in the number of documents. The similarity is calculated using the following equation which is a general waveform correlation function.

ステップ１１４０）波形の相関度Ｒ（ｗ）を元に、補正された語句ｗの使用頻度の時間変化を表す関数Ｇ_w（Ｔ）を算出する。Ｄ_w（Ｔ）の最小値をminD_w、最大値をmaxＤ_w，Ｄ_all（Ｔ）の最小値をminＤ_all（Ｔ）、最大値をmaxＤ_all（Ｔ）とし、次式で求める。

Step 1140) Based on the correlation degree R (w) of the waveform, a function G _w (T) representing a temporal change in the usage frequency of the corrected word w is calculated. The minimum value of D _w (T) is minD _w , the maximum value is maxD _w , the minimum value of D _all (T) is minD _all (T), and the maximum value is maxD _all (T).

ステップ１１５０）得られた語句ｗの補正された使用頻度の時間変化Ｇ_w（Ｔ）を話題度算出部１０５０の語句頻度バッファ（図示せず）に記録する。 Step 1150) The time change G _w (T) of the corrected usage frequency of the obtained phrase w is recorded in a phrase frequency buffer (not shown) of the topic level calculation unit 1050.

ステップ１１６０）未処理の語句が語句リスト中に存在するかを確認し、存在するならばステップ１１２０に戻って処理を続ける。未処理の語句が存在しなければ処理を終了する。 Step 1160) It is confirmed whether or not an unprocessed word / phrase exists in the word / phrase list, and if it exists, the process returns to Step 1120 to continue the processing. If there is no unprocessed word, the process ends.

次に、話題度算出部２６０の処理の流れを示す。 Next, a processing flow of the topic level calculation unit 260 is shown.

図１２は、本発明の第２の実施の形態における話題度算出部の処理のフローチャートである。 FIG. 12 is a flowchart of processing of the topic level calculation unit in the second embodiment of the present invention.

ステップ１２１０）話題度算出部１０５０は、語句頻度算出部１０４０と同様に予め定められた期間毎に処理を開始し、処理開始時の時刻ｔ_ｐを取得する。 Step 1210) topics calculation unit 1050 starts the processing for each predetermined period in the same manner as the phrase frequency calculating unit 1040, acquires the time t _p of the start of processing.

ステップ１２２０）次に、処理対象時刻を元に、予め与えられた正の値ｔ_ｑを用いて得られるｔ_ｐからｔ_ｐ−ｔ_ｑまでの時間範囲に対応する、話題重み付け関数Ｉt_p（ｔ）を作成する。この際、前述の特許文献１に示されるインパクト曲線のような重み付け関数を作成するとよい。 Step 1220) Next, based on the processing target time corresponds to the time range from _{t p} obtained using the positive value _{t q} previously given to _t p -t _q, topic weighting function It _p (t ). At this time, it is preferable to create a weighting function such as the impact curve shown in Patent Document 1 described above.

ステップ１２３０）語句バッファ（図示せず）からある語句ｗの使用頻度の時間変化を表すＧ_w（Ｔ）を一つ取り出し、話題重み付け関数Ｉt_p（ｔ）と掛け合わせることによって、語句ｗの話題度ＴＳ（ｗ）を求める。 Step 1230) Take out one G _w (T) representing a temporal change in the frequency of use of a certain word w from a word buffer (not shown) and multiply it by the topic weighting function It _p (t) to obtain the topic of the word w Determine the degree TS (w).

ステップ１２４０）得られた語句ｗの話題度ＴＳ（ｗ）と語句ｗと処理時間ｔ_ｐとを話題度記録装置１０６０に出力する。例えば、
「決勝戦 32.8 2006/016 13:30」
といった結果が出力される。一連の処理で出力される話題度情報は全て同じ時刻情報を持つため、記憶量の軽減のためそれらをまとめて記録してもよい。話題度記録装置１０６０に出力される話題度情報の例を図１３に示す。

Step 1240) topics degree TS of the resulting word w and (w) outputs the word w and the treatment time _{t p} the topic of the recording apparatus 1060. For example,
"Final 32.8 2006/016 13:30"
Will be output. Since all topic level information output in a series of processes has the same time information, they may be recorded together to reduce the storage amount. An example of topic level information output to the topic level recording apparatus 1060 is shown in FIG.

ステップ１２５０）語句頻度バッファ（図示せず）に未処理の語句頻度情報が含まれるかを確認し、含まれる場合にはステップ１１３０に戻り処理を続ける。含まれない場合には処理を終了する。 Step 1250) It is confirmed whether or not the unprocessed phrase frequency information is included in the phrase frequency buffer (not shown). If included, the process returns to Step 1130 to continue the processing. If not included, the process ends.

また、上記の話題度算出装置の各構成要素の動作をプログラムとして構築し、話題度算出装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 Further, the operation of each component of the topic level calculation device described above can be constructed as a program, installed on a computer used as the topic level calculation device and executed, or distributed via a network.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、話題度算出装置として利用されるコンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed on a computer used as a topic level calculation device.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、文書群から話題となっている語句を抽出するための技術に適用可能である。 The present invention can be applied to a technique for extracting a topic / phrase from a document group.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の第１の実施の形態における話題度算出装置の構成図である。It is a block diagram of the topic degree calculation apparatus in the 1st Embodiment of this invention. 本発明の第１の実施の形態における語句データベースに蓄積される情報の例である。It is an example of the information accumulate | stored in the phrase database in the 1st Embodiment of this invention. 本発明の第１の実施の形態における語句頻度算出部の処理のフローチャートである。It is a flowchart of a process of the phrase frequency calculation part in the 1st Embodiment of this invention. 本発明の第１の実施の形態における語句頻度の時間変動の例である。It is an example of the time fluctuation | variation of the phrase frequency in the 1st Embodiment of this invention. 本発明の第１の実施の形態における文書数の時間変動の例である。It is an example of the time fluctuation | variation of the number of documents in the 1st Embodiment of this invention. 本発明の第１の実施の形態における文書数の変動の影響を除いた語句頻度の時間変動の例である。It is an example of the time fluctuation | variation of the phrase frequency except the influence of the fluctuation | variation of the number of documents in the 1st Embodiment of this invention. 本発明の第１の実施の形態における話題度算出部の処理のフローチャートである。It is a flowchart of the process of the topic degree calculation part in the 1st Embodiment of this invention. 本発明の第２の実施の形態における話題度算出装置の構成図である。It is a block diagram of the topic degree calculation apparatus in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における語句頻度算出部の処理のフローチャートである。It is a flowchart of a process of the phrase frequency calculation part in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における話題度算出部の処理のフローチャートである。It is a flowchart of a process of the topic degree calculation part in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における話題度記録装置に蓄積される情報の例である。It is an example of the information accumulate | stored in the topic degree recording apparatus in the 2nd Embodiment of this invention.

Explanation of symbols

２００文書データベース
２１０文書解析手段、文書解析部
２２０語句集計部
２３０語句データベース
２４０語句頻度算出手段、語句頻度算出部
２５０話題度算出手段、話題度算出部
２６０話題表示装置
１０４０語句頻度算出部
１０５０話題度算出部
１０６０話題度記録装置 200 Document Database 210 Document Analysis Unit, Document Analysis Unit 220 Phrase Counting Unit 230 Phrase Database 240 Phrase Frequency Calculation Unit, Phrase Frequency Calculation Unit 250 Topic Level Calculation Unit, Topic Level Calculation Unit 260 Topic Display Device 1040 Phrase Frequency Calculation Unit 1050 Topic Level Calculation unit 1060 Topic level recording device

Claims

A topic level calculation method in an apparatus that analyzes a large number of documents and determines the strength of topicality at a desired time with respect to a phrase included in the document,
When an input document group having creation time information is input, the document analysis unit analyzes the input document group and extracts a word / phrase (hereinafter referred to as an in-document word / phrase) to be evaluated. input document number, the document analyzing step of aggregating the word frequency, stored in the word database along with them aggregated aggregation time and the documents in the word is the number of times of use of the document the phrase,
The phrase frequency calculation means is a phrase (hereinafter referred to as a processing target phrase w ) that is subject to topic level calculation processing input from the outside, and time information (hereinafter, calculation designation) that indicates the topic at which the topic is calculated. An input reception step for acquiring the time)
The phrase frequency calculation means searches the phrase database based on the processing target phrase, and the time variation of the phrase frequency based on the phrase frequency of the phrase in the document corresponding to the processing target phrase w and the aggregation time A phrase frequency calculating step for calculating a phrase frequency function G _w (T) representing
The topic frequency calculation means calculates the topic frequency indicating the level of topicality of the processing target phrase at the time of the input time input in the input reception step, the phrase frequency function G calculated in the phrase frequency calculation step a topic degree calculation step to obtain by multiplying _w (T) by a topic weight function for calculating a high topic degree when the phrase frequency is increasing near the calculation designated time;
And
In the phrase frequency calculating step,
The word frequency calculation means obtains a time fluctuation D _w (T) of the word frequency in the input document of the processing target word w and a time fluctuation D _all (T) of the number of input documents, and D _w ( T) and by evaluating the similarity of D _all (T) obtaining a correlation degree R (w), a value proportional to a value obtained by multiplying the R (w) to the D _all (T), said D _w A topic degree calculation method characterized in that a value subtracted from (T) is a phrase frequency function G _w (T) .

In the phrase frequency calculating step,
The correlation degree R (w) between the time fluctuation D _w (T) of the word frequency in the input document group of the processing target word w and the time fluctuation D _all (T) of the number of input documents

Asking for,
The topic level calculation method according to claim 1 .

In the phrase frequency calculating step,
Using the degree of correlation R (w) between the time variation D _w (T) of the word frequency in the input document of the processing target word w and the time variation D _all (T) of the number of input documents, the number of input documents The phrase frequency function G _w (T) considering the increase or decrease of the phrase frequency due to the fluctuation of

Asking for,
The topic level calculation method according to claim 1 .

In the input receiving step,
Instead of accepting the processing target phrase w that is the subject of the topic level calculation process and the calculation designated time indicating the topic level at which the topic level is calculated,
In the phrase frequency calculating step,
Obtaining a phrase frequency function G _w (T) for all the processing target phrases included in the input document group;
In the topic level calculation step,
Obtain the topic level at the processing time for all the processing target words included in the input document group,
Topical calculation method of claims 1 to 3, wherein.

A topic degree calculation device that analyzes a large number of documents and determines the strength of topicality at a desired time with respect to a phrase included in the document,
When an input document group having creation time information is input, the input document group is analyzed to extract a word / phrase (hereinafter referred to as an in-document word / phrase) for topicality evaluation, and the number of input documents input, Document analysis means for totalizing phrase frequencies, which are the number of times the phrase is used in the document, and storing the total frequency in the phrase database together with the total time and the phrase in the document
A word / phrase (hereinafter, referred to as a processing target phrase / w ) input from an externally inputted topic degree calculation process is acquired, the word / phrase database is searched based on the processing target phrase / phrase, and the acquired processing target phrase / w A phrase frequency calculating means for calculating a phrase frequency function G _w (T) representing a temporal variation of the phrase frequency based on the phrase frequency of the corresponding phrase in the document and the counting time;
Obtain time information (hereinafter referred to as a calculation designated time) indicating the time at which the topic is calculated from the outside, and at the time of the calculation designated time, obtain a topic level representing the level of topicality of the processing target phrase. , Multiplying the phrase frequency function G _w (T) calculated by the phrase frequency calculation means by a topic weight function for calculating a high topic level when the phrase frequency increases near the input time. The topic level calculation means obtained by
Have
The phrase frequency calculating means includes:
A time variation D _w (T) of the phrase frequency in the input document of the processing target phrase w and a time variation D _all (T) of the number of input documents are obtained, and D _w (T) and D _all (T ) To obtain a correlation degree R (w), and a value obtained by subtracting a value proportional to a value obtained by multiplying the D _all (T) by the R (w ) from the D _w (T) Including means for the phrase frequency function G _w (T) ,
A topic degree calculation device characterized by that.

The phrase frequency calculating means includes:
The correlation degree R (w) between the time fluctuation D _w (T) of the word frequency in the input document group of the processing target word w and the time fluctuation D _all (T) of the number of input documents

Asking for,
The topic level calculation device according to claim 5 .

The phrase frequency calculating means includes:
Using the degree of correlation R (w) between the time variation D _w (T) of the word frequency in the input document of the processing target word w and the time variation D _all (T) of the number of input documents, the number of input documents The phrase frequency function G _w (T) considering the increase or decrease of the phrase frequency due to the fluctuation of

Asking for,
The topic level calculation device according to claim 5 .

The phrase frequency calculating means includes:
Instead of accepting the processing target phrase w that is the subject of the topic level calculation processing and the calculation designated time that indicates when the topic level is calculated, all the processing target phrases included in the input document group Means for determining a phrase frequency function G _w (T) for
The topic level calculation means includes:
Including means for determining a topic level at a processing time for all processing target words included in the input document group,
The topic degree calculation device according to claim 5 .

On the computer,
9. A topic level calculation program that causes each means of the topic level calculation device according to claim 5 to be executed.

On the computer,
9. A computer-readable recording medium storing a topic level calculation program for executing each means of the topic level calculation device according to claim 5 .