JP3606159B2

JP3606159B2 - Text processing device

Info

Publication number: JP3606159B2
Application number: JP2000102274A
Authority: JP
Inventors: 航李; 健司山西
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2000-04-04
Filing date: 2000-04-04
Publication date: 2005-01-05
Anticipated expiration: 2020-04-04
Also published as: JP2001290833A

Description

【０００１】
【発明の属する技術分野】
本発明は、文章処理装置に関し、特に英語，日本語等の自然言語の文章から、その文章がトピックとしている事柄を正確に反映して抽出することが可能な文章処理装置に関する。
【０００２】
【従来の技術】
インターネットや電子図書館において、英語，日本語等の自然言語によって書かれた文章（テキスト，ドキュメント）から情報検索，情報抽出，情報要約を、如何に有効に行うかが益々重要な課題になってきている。
【０００３】
従来、文章のトピック（話題）をキーワードの形で表現し、キーワードを文章から自動的に抽出する技術が開発されている。ユーザにキーワードを提示することによって、文章のトピックを大まかに把握してもらうことができる。例えば、文章内に現われる単語のｔｆ−ｉｄｆと呼ばれる重みを計算し、ｔｆ−ｉｄｆの値の大きい単語をキーワードとする方法が広く使われている（Ｇ．Ｓａｌｔｏｎ，Ａ．ＷｏｎｇａｎｄＣ．Ｓ．Ｙａｎｇ，ＡＶｅｃｔｏｒＳｐａｃｅＭｏｄｅｌｆｏｒＡｕｔｏｍａｔｉｃＩｎｄｅｘｉｎｇ，ＣｏｍｍｕｎｉｃａｔｉｏｎｓｏｆｔｈｅＡＣＭ，Ｖｏｌｕｍｅ１８，Ｎｕｍｂｅｒ１１，ｐｐ６１３−６２０，１９７５）。ｔｆ−ｉｄｆの値は、単語の注目文章内における頻度と比例する。つまり、単語が注目文章内に頻繁に現われれば、そのｔｆ−ｉｄｆの値も大きい。
【０００４】
別の従来技術は、文章自動分割（セグメンテーション）技術の開発である。例えば、ハーストが提案した方式（ＭａｒｔｉＨｅａｒｓｔ，ＴｅｘｔＴｉｌｉｎｇ：ＳｅｇｍｅｎｔｉｎｇＴｅｘｔｉｎｔｏＭｕｌｔｉ−ｐａｒａｇｒａｐｈＳｕｂｔｏｐｉｃＰａｓｓａｇｅｓ，ＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，Ｖｏｌｕｍｅ２３，Ｎｕｍｂｅｒ１，ｐｐ３３−６４，１９９７）では、単語の出現頻度ベクトルの間の距離を基に、文章を幾つかの部分に分割することができる。また、「特開平１１−２７２６９９号公報：文章要約装置およびその方法」のようなハーストの方式を拡張したものも提案されている。
【０００５】
【発明が解決しようとする課題】
従来のキーワード方式では、キーワード、つまり、単語を基に文章のトピックを抽出する場合、それをうまく抽出できないことがあった。例えば、トピックを表す単語の出現頻度の低い時、それらの単語の表すトピックも抽出され難かった。例えば、或る新聞記事に、「貿易」，「関税」，「輸入」，「輸出」という単語が一回ずつ現われたと仮定する。「貿易」，「関税」，「輸入」，「輸出」は全て貿易と関係するものなので、「貿易」というトピックが抽出されるべきである。しかし、個々の単語の出現頻度が低いので、それらの単語はキーワードとして抽出され難かった。よって、それらの単語の表す「貿易」というトピックも抽出され難かった。
【０００６】
また、従来の文章自動分割方式では、単語の出現頻度ベクトルに基づいて文章の分割を行っているので、文章をそのトピック構造の変化に応じてうまく分割することができなかった。更に、分割された各部分のトピックはキーワード抽出法を用いて抽出する以外に方法がなかった。ここにも前記キーワードによるトピック抽出の問題点がある。
【０００７】
そこで本発明の課題は、英語，日本語等の自然言語の文章から、その文章がトピックとしている事柄を正確に反映して抽出することが可能な文章処理装置を提供することである。
【０００８】
【課題を解決するための手段】
前記課題を解決するために本発明は、自然言語からなる文章における単語間の共起情報を記憶する共起情報記憶手段と、自然言語からなる文章を入力する文章入力手段と、該文章入力手段から入力された前記文章を、単語を単位にして分割する形態素解析手段と、該形態素解析手段が分割した単語に基づいて前記共起情報記憶手段に記憶された共起情報を参照し、前記文章のトピックを抽出するトピック抽出手段とを備え、前記トピック抽出手段は、前記形態素解析手段が分割した単語を入力し、該単語に対して共起し易い単語を、前記共起情報記憶手段に記憶された共起情報から参照し、この参照が、前記共起し易い単語の中で前記入力された文章に現われる単語だけを対象とする共起情報参照手段と、前記分割した単語に対して共起し易い単語の中で前記入力された文章に現われる単語を、単語クラスタ（グループ）に纏める単語クラスタ纏め手段と、該単語クラスタ纏め手段が纏めた単語クラスタ内における個々の単語の出現割合を計算する第１出現割合計算手段と、前記単語クラスタ纏め手段が纏めた複数の単語クラスタからなるトピックの前記入力した文章全体における出現割合を計算する第２出現割合計算手段とを備えたことを特徴とする。
【０００９】
また、前記文章入力手段から入力した文章を、段落で分割する文章分割手段を備えたことを特徴とする。
このようにすれば、トピック抽出手段が文章を構成する単語の共起情報に基づいて文章のトピックを抽出するので、文章のトピックを正確に抽出することができる。
【００１０】
さらに、このようにすれば、例えば図１２（Ｂ）に示すように、共起し易い単語のクラスタ（例えば貿易関係トピックのクラスタ）における個々の単語（例えば trade ）の出現割合（この場合は 0 ． 4 （ =40 ％））が一目で判別できる。
しかも、このようにすれば、例えば図１２（Ｂ）に示すように、共起し易い単語のクラスタ（貿易関係トピック，時間関係トピック，OTHERSの各クラスタ）からなるトピックが、文章全体でどの程度出現するかの割合が一目で判別できる。例えば、OTHERSのクラスタは、文章全体で0.4（=40％）の割合で出現している。
【００１１】
また、前記トピック抽出手段が抽出した文章のトピックを表示する表示手段を備えたことを特徴とする。
このようにすれば、例えば図１２（Ｂ）のように表示できる。
【００１２】
また、前記トピック抽出手段が抽出した文章のトピックを表示する表示手段を備え、前記表示手段は、前記文章分割手段で分割した文章と、前記単語クラスタ纏め手段が纏めた単語クラスタと、前記第２出現割合計算手段が計算したトピックの出現割合とを、３次元で表示することを特徴とする。
このようにすれば、例えば図１６に示すように、単語クラスタをＸ軸方向,段落をＹ軸方向,確率（出現割合）をＺ軸方向に表現するので、単語クラスタのトピックの出現割合を一目で判別できる。
【００１３】
【発明の実施の形態】
以下、本発明を図示の第１〜第４実施例に基づいて説明する。第１〜第４実施例の文章処理装置Ｂの基本構成を図１に示し、夫々の実施例の相違点は処理部４０の構成のみである。
【００１４】
（１）第１実施例
図１に示すように、本実施例の文章処理装置Ｂは、「共起情報記憶手段」である記憶部１０と、「文章入力手段」である入力部２０と、「形態素解析手段」である形態素解析部３０と、「トピック抽出手段」である処理部４０と、表示部５０を備える。
【００１５】
記憶部１０は、単語の共起情報を記憶し、どの単語とどの単語が共起し易いという単語の共起情報を記憶する。ここに、共起とは、言語学で２つ以上の言語現象が同じ発話や文の中で同時に起り得ることをいい、例えば「よちよち」は「歩く」とは共起するが、「走る」とは共起しない（三省堂、新明解国語辞典・第五版）。図２に単語（英語）の共起情報の具体例を示す。例えば、単語「ｔｒａｄｅ」は、単語「ｉｍｐｏｒｔ］，「ｅｘｐｏｒｔ］等と共起し易い、つまり、単語間における共起し易いことに関する情報が共起情報である。
【００１６】
記憶部１０における単語の共起情報は人間が作成したものでもよく、コンピュータが自動的に収集したものでもよい。コンピュータが自動的に収集する場合、最も簡単な方法は、大量の文章を集め、同じ文章に現われる任意の二つの単語の共起頻度をカウントし、共起頻度の高い単語ペアが共起し易いと判断することである。但し、本発明において、共起情報の集め方はここで説明した方法に限定するものではない。
【００１７】
入力部２０は、英語，日本語等の自然言語の文章を入力する。文章が長いものでもよく、短いものでもよい。
【００１８】
形態素解析部３０は入力部２０から文章を入力し、単語の列に分割する。図３（Ａ）は日本語の文章の一例を示す。図３（Ｂ）は、図３（Ａ）の文章に対して形態素解析を行った結果の例を示す。即ち、日本語では形態素解析によって文章が単語に分割される。図４（Ａ）は英語の文章の一例を示す。図４（Ｂ）は、図４（Ａ）の文章に対して形態素解析を行った結果の例を示す。即ち、英語では形態素解析によって単語が原型に変換される。
【００１９】
図５は第１実施例の処理部４０の構成図である。
処理部４０は、形態素解析部３０から単語の列を入力し、記憶部１０に記憶された単語共起情報を参照する「共起情報参照手段」である共起情報参照部４１と、単語の列における共起し易い単語を単語のクラスタに纏める「単語クラスタ纏め手段」である単語クラスタ纏め部４２を備える。
【００２０】
また、処理部４０は、単語のクラスタからなる確率モデルを作成する線形結合モデル作成部４３を備える。前記確率モデルは、単語クラスタ内の確率分布を単語クラスタ間の確率分布によって線形結合する（線形結合モデルと称する）（図１２（Ｂ）参照・後述する）。また、線形結合モデルにおける単語クラスタ内の確率分布と単語クラスタ間の確率分布を計算する確率モデル計算部４４を備える。
【００２１】
例えば、或る単語クラスタ内の確率分布で或る単語の出現頻度が低い場合でも、前記或る単語は単語クラスタ間では出現頻度が高い場合がある。従って、線形結合モデルを作成することにより、或る単語クラスタ内の出現頻度のみならず、単語クラスタ間の出現頻度を含めた、文章全体のトピックを正確に反映することができることになる。前記線形結合モデル作成部４３と確率モデル計算部４４とが「第１出現割合計算手段」と「第２出現割合計算手段」を構成する。
【００２２】
表示部５０は、処理部４０から単語クラスタ間の確率分布を入力し、単語クラスタ間の確率分布、或は単語クラスタ間の確率分布における確率の高い単語クラスタを、文章処理装置のユーザに表示する。
【００２３】
図６は、本実施例の文章処理装置の処理のフローチャートを示す。
本実施例の文章処理装置を用いれば、図７（Ａ）に示す入力文章から、図７（Ｂ）に示す単語クラスタによって表現されたトピックを、表示部５０を介してユーザに提示する。図７（Ｂ）の単語クラスタ「ｍｏｎｔｈ−Ｊａｎｕａｒｙ−Ｆｅｂｒｕａｒｙ」が一つのトピックを表し、図７（Ｃ）の単語クラスタ「ｔｒａｄｅ−ｅｘｐｏｒｔ−ｉｍｐｏｒｔ」が別のトピックを表す。
【００２４】
図７（Ｃ）は、図７（Ａ）の文章からｔｆ−ｉｄｆを使って抽出したキーワード（トピック）である。共に７単語からなるものであるが、図７（Ｂ）の抽出されたトピックが、図７（Ｃ）の抽出されたトピックより適切であることがわかる。また、ユーザにとって図７（Ｂ）のほうが図７（Ｃ）より分り易いことも明らかである。
【００２５】
以下、どのようにして、単語クラスタ纏め部４２が単語を単語クラスタに纏め、線形結合モデル作成部４３が線形結合モデルを作成し、確率モデル計算部４４が線形結合モデルにおける確率分布を計算するかについて、具体的に説明する。
【００２６】
単語クラスタ纏め部４２は、例えば、以下のように入力文章にある単語を単語クラスタに纏める。但し、本発明において、単語クラスタの纏め方はここで説明した方法に限定されるものではない。
先ず、入力文章中の各単語のｔｆ−ｉｄｆの値を図８（Ａ）のように計算し、ｔｆ−ｉｄｆの値の高いものを集める。以下、ｔｆ−ｉｄｆの大きい単語をシード単語と呼ぶことにする。図８（Ｂ）は、図８（Ａ）の文章から抽出したシード単語を示す。
【００２７】
共起情報参照部４１は、シード単語に対して共起し易い単語を記憶部１０から参照する。なお、共起し易い単語の中で、入力文章に現われるものだけを対象にする。図９（Ａ）に示すのは、シード単語に共起しやすく、入力文章に現われた単語である。
【００２８】
単語クラスタ纏め部４２は、シード単語と、それに共起しやすく入力文章に現われた単語を一つの単語クラスタに纏める。つまり、シード単語を中心に同義語や関連語の一つの単語集合を作成する。単語クラスタには必ずシード単語がある。また、入力文章に現われたその他の単語を一つの特別なクラスタ「ＯＴＨＥＲＳ」に纏める。図９（Ｂ）に示すのは纏められた単語クラスタの例である。
【００２９】
次に単語クラスタ纏め部４２は、作成された単語クラスタを更に幾つかの大きい単語クラスタにマージ（合併）する。但し、このマージの処理を省略してよい。また、マージの方法はここで説明した方法に限定するものではない。
単語クラスタ纏め部４２は、シード単語が互いに共起し易い二つのクラスタをマージする。例えば、シード単語「ｔｒａｄｅ」の共起単語に「ｓｕｒｐｌｕｓ」がある。同時に、シード単語「ｓｕｒｐｌｕｓ」の共起単語に「ｔｒａｄｅ」がある。単語クラスタ纏め部４２は、この「ｔｒａｄｅ」と「ｓｕｒｐｌｕｓ」をシード単語として持つ二つの単語クラスタを一つに纏める。処理部４０の単語クラスタマージアルゴリズムを図１０に示す。更に、図１１にそのフローチャートを示す。
【００３０】
処理部４０の線形結合モデル作成部４３は、単語クラスタからなる線形結合モデルを作成する。前述の如く線形結合モデルは、文章内におけるトピックの変化を正確に反映するものである。具体的には、単語クラスタはトピックに対応し、単語クラスタに属する単語はそのトピックに関連している単語であると解釈することができる。一般に、文章は或る一つのトピックに集中するが、そのトピックから離れた別のトピックに一時的に移行する場合がある。また、文章が一つのトピックに集中しているときは、そのトピックに関連の深い単語を多く用いる傾向がある。例えば、或る文章に「ｔｒａｄｅ」というトピックについて論じることが多く、そのとき「ｔｒａｄｅ」に関連の深い単語を多く用いるが、たまには「ｐｏｌｉｔｉｃｓ」のトピックに移行することもある。即ち、「ｐｏｌｉｔｉｃｓ」に関連の深い言葉を用いるようになる。
【００３１】
図１２（Ａ）は線形結合モデルの数式を示す。Ｗが単語を表し、ｋが単語クラスを表す。単語ｗの単語クラスタｋからの生起確率はＰ（ｗ／ｋ）と表現される。Ｐｋ（ｗ／ｋ）はクラスタ内分布における確率である。単語クラスタｋの生起確率はＰ（ｋ）と表現される。Ｐ（ｋ）はクラスタ間の分布における確率である。
【００３２】
図１２（Ｂ）は線形結合モデルの例を示す。単語クラスタはトピックを表す。図１２（Ｂ）では、例えば、左の単語クラスタが貿易関係のトピックを表し、真中の単語クラスタが時間関係のトピックを表す。また、右の単語クラスタがＯＴＨＥＲＳに当たる。同じ単語が複数の単語クラスタに属してもよい。
【００３３】
単語クラスタ間の確率分布は、トピックが夫々どのぐらいの確率（相対頻度）で入力文章で言及されているかを表す。単語クラスタ内の確率分布は、単語が夫々どのぐらいの確率で関係するトピック内で言及されているかを表す。例えば、図１２（Ｂ）では、値「０．３（３０％），０．３（３０％），０．４（４０％）」が単語クラスタ間の分布を示す。これは、或る文章内で、貿易関係のトピック、時間関係のトピック、およびＯＴＨＥＲＳが夫々０．３（３０％），０．３（３０％），０．４（４０％）の確率で言及されている（登場している）ことを意味する。また、例えば、貿易関係の単語クラスタでは、値「０．４，０．１，０．２５，０．２５」が一つの単語クラスタ内の確率分布を形成する。単語「ｔｒａｄｅ」が０．４の確率で貿易関係のトピックで言及されるていることを意味する。
【００３４】
処理部４０の確率モデル計算部４４は、入力文章における単語の出現頻度を基に、線形結合モデルの確率を計算する。計算のアルゴリズムを図１３に示す。この計算では、線形結合モデルにおける確率値の初期値を先ず適宜に設定する。次に、文章における単語の出現頻度を基に繰り返し計算によって線形結合モデルの各確率値を計算する。この計算によって、例えば文章内で貿易関係の単語が多ければ、貿易関係のトピック（単語クラスタ）の確率が大きくなる。また、文章内で「ｉｍｐｏｒｔ」の出現頻度が多ければ、貿易関係のトピック（単語クラスタ）内で「ｉｍｐｏｒｔ」の確率が大きくなる。但し、本発明において、確率の計算方法はここで説明した方法に限定されるものではない。
【００３５】
表示部５０が最終的に確率の高いキーワードではなく、確率の高い単語クラスタをユーザにトピックとして表示する。或は、表示部５０は、単語クラスタ間の確率分布そのものをユーザに表示する。但し、本発明においてユーザへのトピックの表示方法はここで説明された方法に限定されるものではない。
【００３６】
（２）第２実施例
本実施例の構成は前述の如く処理部４０Ａのみが第１実施例と相違し、その他の記憶部１０，入力部２０，形態素解析部３０，表示部５０は第１実施例に同一である。
図１４は本実施例の処理部４０Ａを示し、共起情報参照部４１と単語クラスタ纏め部４２と線形結合モデル作成部４３は第１実施例に同一なので、重複説明を省略する。
【００３７】
処理部４０Ａの文章分割部４５は、入力された単語の列を幾つかの部分に分割し、線形結合モデル作成部４３は分割された各部分において、纏められた単語クラスタからなる線形結合モデルを作成する。更に、確率分布計算部４６は、各部分における線形結合モデル中の単語クラスタ内の確率分布と単語クラスタ間の確率分布を計算する。
表示部５０は、確率分布計算部４６から各部分における単語クラスタ間の確率分布、或は単語クラスタ間の確率分布における確率の高い単語クラスタを入力し、ユーザに表示する。
【００３８】
図１５は、本実施例の文章処理装置の処理のフローチャートを示す。
本実施例の文章処理装置は文章を幾つかの部分に分割し（第一段落，第二段落，第三段落等）、それぞれの部分の重要トピックとして単語クラスタを抽出する点で前記第１実施例と異る。
文章の分割は従来法、例えば、ハースト法を用いることが考えられる。但し、ハースト法に限定するものではない。
【００３９】
例えば、図１６に示すのは、本実施例の文章処理装置が表示した文章内の各部分のトピックの例である。文章は、その内容に応じて分割され、また、各分割された部分のトピックは単語クラスタとその確率によって表現される。即ち、Ｘ軸方向に単語クラスタを示し、Ｙ軸方向に分割されたブロックを示し、Ｚ軸方向に単語クラスタの確率を示す。この表示によってユーザが素早く文章のトピック構造を的確に把握することができる。
【００４０】
また、例えば、図１７に示すのは、本実施例の文章処理装置が表示した文章の分割された各部分と各部分のトピックの例である。この表示によってユーザが文章のトピック構造を的確に把握することができる。ここでは、各部分のトピックを太字表示や赤色表示することにより、ユーザに一目でトピックが何であるかを分かり易くしてもよい。
本発明においてユーザへのトピックの表示方法はここで説明された方法に限定されるものではない。
【００４１】
（３）第３実施例
本実施例の構成は前述の如く処理部４０Ｂのみが第１実施例と相違し、その他の記憶部１０，入力部２０，形態素解析部３０，表示部５０は第１実施例に同一である。
図１８は本実施例の処理部４０Ｂを示し、共起情報参照部４１と単語クラスタ纏め部４２と線形結合モデル作成部４３は第１実施例に同一なので、重複説明を省略する。
【００４２】
処理部４０Ｂの文章分割部４７は、線形結合モデルを用いて入力された単語の列を幾つかの部分に分割し、確率分布計算部４８は、各部分における線形結合モデル中の単語クラスタ内の確率分布と単語クラスタ間の確率分布を計算する。
表示部５０は、確率分布計算部４８から、各部分における単語クラスタ間の確率分布、或は単語クラスタ間の確率分布における確率値の高い単語クラスタを入力し、ユーザに表示する。
【００４３】
図１９は、本実施例の文章処理装置の処理のフローチャートを示す。
本実施例の文章処理装置は、先ず線形結合モデルを作成し、それから作成された線形結合モデルを用いて文章を分割する。線形結合モデルを用いて分割を行う点で第２実施例と異る。
処理部４０Ｂの線形結合モデル作成部４３が作成した線形結合モデルは、図１２（Ａ）に示すものと同じである。
文章分割部４７は、線形結合モデルを用いて文章の分割を行う。以下はその方法を示す。但し、線形結合モデルを用いた文章の分割法は以下の説明に限定されたわけではない。
【００４４】
先ず、入力文章における分割点となり得る点を特定する。通常文章内の各文の終わりがこのような点に当たる。例えば、図７（Ａ）の文章では、「ｓａｉｄ．」、「ｒｅｓｐｅｃｔｉｖｅｌｙ．」、「ｙｅａｒ．」の直後が分割点となり得る点である。
次に、分割点の左と右のｋ単語内の単語を集め、夫々を一つの疑似文章とする。或は、左右ｋ文内の単語を集め、夫々を一つの疑似文章としてもよい。ｋは予め決めた値である。左右夫々の疑似文章を基に作成された線形結合モデルにおける確率分布を計算する。確率の計算法は図１３に示すものであってもよいが、例えば、図２０に示すものでもよい。
【００４５】
次に、計算された左右の二つの線形結合モデルの間の距離を計算する。図２１は確率モデルの間の距離の計算式を示す。左右の線形結合モデルの間の距離が予め定められた閾値以上であれば、この点が分割点であると判断する。
以上の処理を全ての分割点の候補に対して行い、分割点となる点を決め、文章の分割を行う。
表示部５０は、例えば、分割された各部分の確率の大きい単語クラスタをトピックとし、ユーザに表示する。
【００４６】
（４）第４実施例
本実施例は入力部２０に大量の文章を入力した場合であり、文章処理装置の構成は前記第３実施例に同じである。
処理部４０の文章分割部４７は、線形結合モデルを用いて入力された単語の列を幾つかの部分に分割し、各部分における線形結合モデル中の単語クラスタ内の確率分布と単語クラスタ間の確率分布を計算する。
【００４７】
表示部５０は、処理部４０から、各部分における単語クラスタ間の確率分布、或は単語クラスタ間の確率分布における確率の高い単語クラスタを入力し、ユーザに表示する。
本実施例の文章処理装置の特徴は大量の文章の時系列におけるトピックの変化を表示できることである。例えば、一年間の新聞記事を入力し、その年の新聞記事におけるトピックの変化状況をみることができる。
【００４８】
【発明の効果】
以上説明したように本発明によれば、文章から意味のまとまった単語クラスタを抽出し、文章のトピックとしているので、キーワードにより正確にかつユーザにわかり易い形でトピックを抽出することができる。図７（Ｂ），（Ｃ）の例からこのことがよくわかる。
また、文章を分割し、分割された各部分のトピックを示すことによって、文章の内容を更にユーザに分り易い形で表示することができる。ユーザは本発明を使って重要なトピックだけでなく、文章内におけるトピックの変化もみることができる。
【図面の簡単な説明】
【図１】本発明の各実施例の構成図である。
【図２】同各実施例に適用する単語の共起情報の例である。
【図３】入力文章の例である。
【図４】別の入力文章の例である。
【図５】第１実施例の処理部の構成図である。
【図６】第１実施例のフローチャートである。
【図７】第１実施例における処理例であり、（Ａ）は文章の例、（Ｂ）は（Ａ）の文章から抽出した単語クラスタ、（Ｃ）は（Ａ）の文章から抽出したキーワードである。
【図８】（Ａ）は単語のｔｆ−ｉｄｆの値の計算式、（Ｂ）は図７（Ａ）のｔｆ−ｉｄｆの値の大きい単語である。
【図９】（Ａ）は図７（Ａ）の文章のシード単語と共起情報、（Ｂ）は図７（Ａ）の文章のシード単語とそのクラスタである。
【図１０】単語クラスタのマージのアルゴリズムである。
【図１１】単語クラスタのマージのフローチャートである。
【図１２】（Ａ）は線形結合モデルの計算式、（Ｂ）は線形結合モデルの例である。
【図１３】線形結合モデルにおける確率分布の計算方法である。
【図１４】第２実施例の処理部の構成図である。
【図１５】第２実施例における処理のフローチャートである。
【図１６】第２実施例の出力例である。
【図１７】第２実施例の出力例である。
【図１８】第３実施例の処理部の構成図である。
【図１９】第３実施例のフローチャートである。
【図２０】線形結合モデルにおける確率分布計算方法である
【図２１】線形結合モデルの間の距離の計算式である。
【符号の説明】
Ｂ文章処理装置
１０記憶部
２０入力部
３０形態素解析部
４０，４０Ａ，４０Ｂ処理部
４１共起情報参照部
４２単語クラスタまとめ部
４３線形結合モデル作成部
４４確率モデル計算部
４５単語後列分割部
４６確率分布計算部
４７文章分割部
４８確率分布計算部
５０表示部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a sentence processing apparatus, and more particularly to a sentence processing apparatus capable of accurately reflecting and extracting the topic of a sentence from a natural language sentence such as English or Japanese.
[0002]
[Prior art]
In the Internet and electronic libraries, how to effectively perform information retrieval, information extraction, and information summarization from sentences (text, documents) written in natural languages such as English and Japanese is becoming an increasingly important issue. Yes.
[0003]
Conventionally, a technique for expressing a topic (topic) of a sentence in the form of a keyword and automatically extracting the keyword from the sentence has been developed. By presenting the keyword to the user, it is possible to roughly understand the topic of the sentence. For example, a method of calculating a weight called tf-idf of a word appearing in a sentence and using a word having a large value of tf-idf as a keyword is widely used (G. Salton, A. Wong and CS. Yang, A Vector Space Model for Automatic Indexing, Communications of the ACM, Volume 18, Number 11, pp 613-620, 1975). The value of tf-idf is proportional to the frequency of the word in the attention sentence. That is, if a word appears frequently in a note, the value of tf-idf is large.
[0004]
Another prior art is the development of an automatic sentence segmentation technique. For example, the method proposed by Hurst (Marti Heart, TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages, Computational Linguistics, Volume 23, Number 33, Number 19, Ppp. On the basis of it, the sentence can be divided into several parts. Further, an extension of the Hurst method such as “JP-A-11-272699: sentence summarizing apparatus and method” has been proposed.
[0005]
[Problems to be solved by the invention]
In the conventional keyword method, when a topic of a sentence is extracted based on a keyword, that is, a word, it may not be extracted well. For example, when the frequency of words representing topics is low, it is difficult to extract the topics represented by those words. For example, assume that the words “trade”, “tariff”, “import”, and “export” appear once in a newspaper article. Since “trade”, “tariff”, “import” and “export” are all related to trade, the topic “trade” should be extracted. However, since the appearance frequency of individual words is low, those words are difficult to be extracted as keywords. Therefore, it was difficult to extract the topic “trade” represented by these words.
[0006]
Further, in the conventional automatic sentence dividing method, since the sentence is divided based on the word appearance frequency vector, the sentence cannot be divided well according to the change in the topic structure. Furthermore, there is no method other than extracting the topic of each divided part using the keyword extraction method. Here too, there is a problem of topic extraction by the keyword.
[0007]
Accordingly, an object of the present invention is to provide a sentence processing apparatus capable of accurately reflecting and extracting from a sentence in a natural language such as English or Japanese, what the sentence is a topic.
[0008]
[Means for Solving the Problems]
In order to solve the above problems, the present invention provides a co-occurrence information storage means for storing co-occurrence information between words in a sentence made of a natural language, a sentence input means for inputting a sentence made of a natural language, and the sentence input means. The sentence inputted from the morpheme analyzing means for dividing the sentence in units of words, and the co-occurrence information stored in the co-occurrence information storage means based on the words divided by the morpheme analyzing means, the sentence Topic extraction means for extracting the topic , the topic extraction means inputs the words divided by the morpheme analysis means, and stores words that are likely to co-occur with the words in the co-occurrence information storage means The co-occurrence information is referred to, and the co-occurrence information reference means for only the words appearing in the input sentence among the easy-to-occur co-occurrence words and the divided words are shared. Easy to start A word cluster grouping unit that collects words appearing in the input sentence among words into a word cluster (group), and a first calculation unit that calculates an appearance ratio of individual words in the word cluster collected by the word cluster grouping unit. It is characterized by comprising: an appearance ratio calculating means; and a second appearance ratio calculating means for calculating an appearance ratio of the topic composed of a plurality of word clusters collected by the word cluster collecting means in the entire inputted sentence.
[0009]
Further, the present invention is characterized by further comprising a sentence dividing means for dividing the sentence input from the sentence input means into paragraphs.
In this way, the topic extraction unit extracts the topic of the sentence based on the co-occurrence information of the words constituting the sentence, so that the topic of the sentence can be accurately extracted.
[0010]
Furthermore, in this way, as shown in FIG. 12B, for example, the occurrence ratio of individual words (for example, trade ) in a cluster of words that are likely to co-occur (for example, clusters of trade-related topics) (in this case, 0) . 4 (= 40%)) can be determined at a glance.
Moreover, in this way, for example, as shown in FIG. 12 (B), the extent of the topic composed of a cluster of easily occurring words (trading related topic, time related topic, and OTHERS cluster) in the whole sentence is increased. The rate of appearance can be determined at a glance. For example, OTHERS clusters appear at a rate of 0.4 (= 40%) in the entire sentence.
[0011]
In addition, a display means for displaying the topic of the sentence extracted by the topic extraction means is provided.
If it does in this way, it can display as FIG. 12 (B), for example.
[0012]
In addition, the display means for displaying the topic of the sentence extracted by the topic extraction means, the display means, the sentence divided by the sentence dividing means, the word cluster collected by the word cluster collecting means, and the second The appearance ratio calculated by the appearance ratio calculation means is displayed in three dimensions .
In this way, for example, as shown in FIG. 16, the word cluster is expressed in the X-axis direction, the paragraph is expressed in the Y-axis direction, and the probability (appearance ratio) is expressed in the Z-axis direction. Can be determined.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the present invention will be described based on first to fourth embodiments shown in the drawings. The basic configuration of the text processing apparatus B of the first to fourth embodiments is shown in FIG. 1, and the difference between the respective embodiments is only the configuration of the processing unit 40.
[0014]
(1) First Embodiment As shown in FIG. 1, the text processing apparatus B of this embodiment includes a storage unit 10 that is a “co-occurrence information storage unit”, an input unit 20 that is a “text input unit”, A morpheme analysis unit 30 that is a “morpheme analysis unit”, a processing unit 40 that is a “topic extraction unit”, and a display unit 50 are provided.
[0015]
The storage unit 10 stores word co-occurrence information, and stores word co-occurrence information indicating which words are likely to co-occur. Here, co-occurrence means that two or more linguistic phenomena can occur at the same time in the same utterance or sentence in linguistics, for example, “Toyachi” co-occurs with “walking” but “runs” (Sanseido, Shinmeikan Japanese Dictionary, 5th edition). FIG. 2 shows a specific example of co-occurrence information of words (English). For example, the word “trade” is information that is likely to co-occur with the words “import”, “export”, and the like, that is, information that is easy to co-occur between words.
[0016]
The word co-occurrence information in the storage unit 10 may be created by a person or automatically collected by a computer. When a computer automatically collects, the simplest method is to collect a large amount of sentences, count the co-occurrence frequency of any two words appearing in the same sentence, and easily co-occur the word pairs with high co-occurrence frequency It is to judge. However, in the present invention, the method of collecting the co-occurrence information is not limited to the method described here.
[0017]
The input unit 20 inputs natural language sentences such as English and Japanese. The sentence may be long or short.
[0018]
The morphological analysis unit 30 inputs a sentence from the input unit 20 and divides the sentence into word strings. FIG. 3A shows an example of a Japanese sentence. FIG. 3B shows an example of a result of performing morphological analysis on the sentence of FIG. That is, in Japanese, sentences are divided into words by morphological analysis. FIG. 4A shows an example of an English sentence. FIG. 4B shows an example of a result of performing morphological analysis on the sentence of FIG. That is, in English, a word is converted into a prototype by morphological analysis.
[0019]
FIG. 5 is a configuration diagram of the processing unit 40 of the first embodiment.
The processing unit 40 inputs a word string from the morphological analysis unit 30 and refers to the word co-occurrence information stored in the storage unit 10, a co-occurrence information reference unit 41, A word cluster summarizing unit 42 that is a “word cluster summarizing unit” for summarizing words that are likely to co-occur in a column into a word cluster is provided.
[0020]
Further, the processing unit 40 includes a linear combination model creating unit 43 that creates a probability model composed of word clusters. The probability model linearly combines the probability distributions in the word clusters with the probability distributions between the word clusters (referred to as a linear combination model) (see FIG. 12B, which will be described later). In addition, a probability model calculation unit 44 that calculates a probability distribution within a word cluster and a probability distribution between word clusters in the linear combination model is provided.
[0021]
For example, even if the appearance frequency of a certain word is low in a probability distribution within a certain word cluster, the certain word may have a high appearance frequency between word clusters. Therefore, by creating a linear combination model, it is possible to accurately reflect not only the appearance frequency within a certain word cluster but also the topic of the entire sentence including the appearance frequency between word clusters. The linear combination model creation unit 43 and the probability model calculation unit 44 constitute “first appearance ratio calculation means” and “second appearance ratio calculation means”.
[0022]
The display unit 50 receives a probability distribution between word clusters from the processing unit 40, and displays a probability distribution between word clusters or a word cluster having a high probability in the probability distribution between word clusters to the user of the sentence processing apparatus. .
[0023]
FIG. 6 shows a flowchart of processing of the text processing apparatus of this embodiment.
If the text processing apparatus of the present embodiment is used, the topic expressed by the word cluster shown in FIG. 7B is presented to the user via the display unit 50 from the input text shown in FIG. The word cluster “month-January-February” in FIG. 7B represents one topic, and the word cluster “trade-export-import” in FIG. 7C represents another topic.
[0024]
FIG. 7C shows keywords (topics) extracted from the text in FIG. 7A using tf-idf. Although both consist of 7 words, it can be seen that the topic extracted in FIG. 7B is more appropriate than the topic extracted in FIG. It is also clear that the user can easily understand FIG. 7B than FIG. 7C.
[0025]
Hereinafter, how the word cluster compiling unit 42 collects words into word clusters, the linear combination model creation unit 43 creates a linear combination model, and the probability model calculation unit 44 calculates the probability distribution in the linear combination model. Will be described in detail.
[0026]
For example, the word cluster collecting unit 42 collects words in the input sentence into word clusters as follows. However, in the present invention, the method of collecting word clusters is not limited to the method described here.
First, the value of tf-idf of each word in the input sentence is calculated as shown in FIG. 8A, and those having a high value of tf-idf are collected. Hereinafter, a word having a large tf-idf is referred to as a seed word. FIG. 8B shows a seed word extracted from the sentence of FIG.
[0027]
The co-occurrence information reference unit 41 refers to a word that is likely to co-occur with respect to the seed word from the storage unit 10. It should be noted that only words that appear in the input sentence among words that tend to co-occur are targeted. What is shown in FIG. 9A is a word that appears easily in the input sentence and is likely to co-occur with the seed word.
[0028]
The word cluster consolidating unit 42 consolidates the seed words and the words that appear easily in the input sentence into one word cluster. That is, one word set of synonyms and related words is created around the seed word. There is always a seed word in a word cluster. Further, other words appearing in the input sentence are collected into one special cluster “OTHERS”. FIG. 9B shows an example of a clustered word cluster.
[0029]
Next, the word cluster collecting unit 42 merges the created word clusters into several larger word clusters. However, this merging process may be omitted. The merging method is not limited to the method described here.
The word cluster collecting unit 42 merges two clusters in which seed words are likely to co-occur with each other. For example, “surplus” is a co-occurrence word of the seed word “trade”. At the same time, there is “trade” in the co-occurrence word of the seed word “surplus”. The word cluster collecting unit 42 combines two word clusters having “trade” and “surplus” as seed words into one. The word cluster merge algorithm of the processing unit 40 is shown in FIG. Furthermore, the flowchart is shown in FIG.
[0030]
The linear combination model creation unit 43 of the processing unit 40 creates a linear combination model composed of word clusters. As described above, the linear combination model accurately reflects changes in topics within a sentence. Specifically, a word cluster corresponds to a topic, and a word belonging to the word cluster can be interpreted as a word related to the topic. In general, sentences are concentrated on one topic, but may temporarily move to another topic away from the topic. Also, when sentences are concentrated on one topic, there is a tendency to use many words that are closely related to that topic. For example, a topic “trade” is often discussed in a certain sentence, and at that time, a word closely related to “trade” is often used, but sometimes it is shifted to a topic “politics”. That is, words that are deeply related to “politics” are used.
[0031]
FIG. 12A shows a mathematical expression of a linear combination model. W represents a word and k represents a word class. The occurrence probability of the word w from the word cluster k is expressed as P (w / k). Pk (w / k) is a probability in the intra-cluster distribution. The occurrence probability of the word cluster k is expressed as P (k). P (k) is the probability in the distribution between clusters.
[0032]
FIG. 12B shows an example of a linear combination model. A word cluster represents a topic. In FIG. 12B, for example, the left word cluster represents a trade-related topic, and the middle word cluster represents a time-related topic. The right word cluster corresponds to OTHERS. The same word may belong to multiple word clusters.
[0033]
The probability distribution between the word clusters represents the probability (relative frequency) that each topic is mentioned in the input sentence. The probability distribution within a word cluster represents how often each word is referred to in the related topic. For example, in FIG. 12B, the values “0.3 (30%), 0.3 (30%), 0.4 (40%)” indicate the distribution between word clusters. This is mentioned in a sentence with a probability of 0.3 (30%), 0.3 (30%), and 0.4 (40%) for trade-related topics, time-related topics, and OTHERS, respectively. It means being (appearing). Further, for example, in a trade-related word cluster, the values “0.4, 0.1, 0.25, 0.25” form a probability distribution within one word cluster. It means that the word “trade” is mentioned in a trade-related topic with a probability of 0.4.
[0034]
The probability model calculation unit 44 of the processing unit 40 calculates the probability of the linear combination model based on the appearance frequency of words in the input sentence. The calculation algorithm is shown in FIG. In this calculation, the initial value of the probability value in the linear combination model is first set as appropriate. Next, each probability value of the linear combination model is calculated by iterative calculation based on the appearance frequency of words in the sentence. By this calculation, for example, if there are many trade-related words in a sentence, the probability of trade-related topics (word clusters) increases. In addition, if the occurrence frequency of “import” is high in a sentence, the probability of “import” is increased in a trade-related topic (word cluster). However, in the present invention, the probability calculation method is not limited to the method described here.
[0035]
The display unit 50 finally displays a high-probability word cluster as a topic to the user instead of a high-probability keyword. Alternatively, the display unit 50 displays the probability distribution itself between the word clusters to the user. However, in the present invention, the method of displaying the topic to the user is not limited to the method described here.
[0036]
(2) Second Embodiment The configuration of this embodiment differs from the first embodiment only in the processing unit 40A as described above, and the other storage unit 10, input unit 20, morpheme analysis unit 30, and display unit 50 are the first. The same as the embodiment.
FIG. 14 shows a processing unit 40A of the present embodiment, and the co-occurrence information reference unit 41, the word cluster summarizing unit 42, and the linear combination model creating unit 43 are the same as those in the first embodiment, and thus the duplicate description is omitted.
[0037]
The sentence division unit 45 of the processing unit 40A divides the input word string into several parts, and the linear combination model creation unit 43 generates a linear combination model composed of the word clusters collected in each divided part. create. Further, the probability distribution calculation unit 46 calculates the probability distribution in the word clusters and the probability distribution between the word clusters in the linear combination model in each part.
The display unit 50 inputs a probability distribution between word clusters in each portion or a word cluster having a high probability in the probability distribution between word clusters from the probability distribution calculation unit 46 and displays it to the user.
[0038]
FIG. 15 shows a flowchart of processing of the text processing apparatus of this embodiment.
The sentence processing apparatus of this embodiment divides a sentence into several parts (first paragraph, second paragraph, third paragraph, etc.), and extracts the word cluster as an important topic of each part. Is different.
It is conceivable to use a conventional method, for example, the Hurst method, for dividing the sentence. However, it is not limited to the Hurst method.
[0039]
For example, FIG. 16 shows an example of the topic of each part in the text displayed by the text processing apparatus of this embodiment. A sentence is divided according to its contents, and the topic of each divided part is expressed by a word cluster and its probability. That is, the word cluster is shown in the X-axis direction, the block divided in the Y-axis direction is shown, and the probability of the word cluster is shown in the Z-axis direction. This display allows the user to quickly and accurately grasp the topic structure of the sentence.
[0040]
Also, for example, FIG. 17 shows an example of each divided part of the text displayed by the text processing apparatus of this embodiment and the topic of each part. This display allows the user to accurately grasp the topic structure of the sentence. Here, the topic of each part may be displayed in bold or red so that the user can easily understand what the topic is at a glance.
In the present invention, the method of displaying the topic to the user is not limited to the method described here.
[0041]
(3) Third Embodiment As described above, the configuration of the third embodiment is different from the first embodiment only in the processing unit 40B, and the other storage unit 10, input unit 20, morphological analysis unit 30, and display unit 50 are the first. The same as the embodiment.
FIG. 18 shows the processing unit 40B of the present embodiment, and the co-occurrence information reference unit 41, the word cluster summarizing unit 42, and the linear combination model creation unit 43 are the same as those in the first embodiment, and thus redundant description is omitted.
[0042]
The sentence division unit 47 of the processing unit 40B divides the input word sequence into several parts using the linear combination model, and the probability distribution calculation unit 48 includes a word cluster in the linear combination model in each part. Compute probability distribution and probability distribution between word clusters.
The display unit 50 inputs a probability distribution between word clusters in each portion or a word cluster having a high probability value in the probability distribution between word clusters from the probability distribution calculation unit 48 and displays it to the user.
[0043]
FIG. 19 shows a flowchart of processing of the text processing apparatus of this embodiment.
The sentence processing apparatus according to the present embodiment first creates a linear combination model, and then divides the sentence using the created linear combination model. The second embodiment is different from the second embodiment in that division is performed using a linear combination model.
The linear combination model created by the linear combination model creation unit 43 of the processing unit 40B is the same as that shown in FIG.
The sentence dividing unit 47 divides sentences using a linear combination model. The method is shown below. However, the sentence division method using the linear combination model is not limited to the following description.
[0044]
First, a point that can be a division point in an input sentence is specified. Usually, the end of each sentence in a sentence is such a point. For example, in the sentence of FIG. 7A, the point immediately after “said.”, “Respectively.”, “Year.” Can be a division point.
Next, the words in the k words at the left and right of the dividing point are collected, and each is made into one pseudo sentence. Alternatively, the words in the left and right k sentences may be collected and each may be made into one pseudo sentence. k is a predetermined value. The probability distribution in the linear combination model created based on the left and right pseudo sentences is calculated. The probability calculation method may be as shown in FIG. 13, but may be as shown in FIG. 20, for example.
[0045]
Next, the calculated distance between the two left and right linear combination models is calculated. FIG. 21 shows a formula for calculating the distance between the probability models. If the distance between the left and right linear combination models is greater than or equal to a predetermined threshold, it is determined that this point is a division point.
The above processing is performed for all candidates for division points, points to be division points are determined, and sentences are divided.
The display unit 50 displays, for example, a word cluster having a high probability of each divided part as a topic and displays it to the user.
[0046]
(4) Fourth Embodiment In this embodiment, a large amount of text is input to the input unit 20, and the configuration of the text processing apparatus is the same as that of the third embodiment.
The sentence dividing unit 47 of the processing unit 40 divides the input word sequence into several parts using the linear combination model, and the probability distribution in the word clusters in the linear combination model in each part and the word clusters are divided. Calculate the probability distribution.
[0047]
The display unit 50 inputs from the processing unit 40 a probability distribution between word clusters in each portion or a word cluster having a high probability in the probability distribution between word clusters and displays it to the user.
The feature of the sentence processing apparatus of the present embodiment is that it can display topic changes in a time series of a large amount of sentences. For example, it is possible to input a newspaper article for one year and see how topics change in the newspaper article for that year.
[0048]
【The invention's effect】
As described above, according to the present invention, a cluster of meaning words is extracted from a sentence and used as the topic of the sentence. Therefore, the topic can be extracted accurately and easily understood by the user using keywords. This can be clearly seen from the examples of FIGS. 7B and 7C.
Also, by dividing the sentence and showing the topic of each divided part, the contents of the sentence can be displayed in a form that is more easily understood by the user. Using the present invention, the user can see not only important topics but also changes in topics within the text.
[Brief description of the drawings]
FIG. 1 is a configuration diagram of each embodiment of the present invention.
FIG. 2 is an example of word co-occurrence information applied to each embodiment.
FIG. 3 is an example of an input sentence.
FIG. 4 is an example of another input sentence.
FIG. 5 is a configuration diagram of a processing unit according to the first embodiment.
FIG. 6 is a flowchart of the first embodiment.
7 is a processing example in the first embodiment, where (A) is an example of a sentence, (B) is a word cluster extracted from the sentence of (A), and (C) is a keyword extracted from the sentence of (A). It is.
8A is a formula for calculating the tf-idf value of a word, and FIG. 8B is a word having a large tf-idf value in FIG. 7A.
9A is a seed word and co-occurrence information of the sentence in FIG. 7A, and FIG. 9B is a seed word and cluster of the sentence in FIG. 7A.
FIG. 10 is an algorithm for merging word clusters.
FIG. 11 is a flowchart of merging word clusters.
12A is a calculation formula of a linear combination model, and FIG. 12B is an example of a linear combination model.
FIG. 13 is a calculation method of a probability distribution in a linear combination model.
FIG. 14 is a configuration diagram of a processing unit according to the second embodiment.
FIG. 15 is a flowchart of processing in the second embodiment.
FIG. 16 is an output example of the second embodiment.
FIG. 17 is an output example of the second embodiment.
FIG. 18 is a configuration diagram of a processing unit according to a third embodiment.
FIG. 19 is a flowchart of the third embodiment.
20 is a probability distribution calculation method in a linear combination model. FIG. 21 is a formula for calculating a distance between linear combination models.
[Explanation of symbols]
B Text processing device 10 Storage unit 20 Input unit 30 Morphological analysis unit 40, 40A, 40B Processing unit 41 Co-occurrence information reference unit 42 Word cluster summary unit 43 Linear combination model creation unit 44 Probability model calculation unit 45 Word rearrangement division unit 46 Probability Distribution calculation unit 47 Text division unit 48 Probability distribution calculation unit 50 Display unit

Claims

Co-occurrence information storage means for storing co-occurrence information between words in sentences composed of natural language;
A sentence input means for inputting sentences composed of natural language;
A morpheme analyzing means for dividing the sentence input from the sentence input means in units of words;
Topic extraction means for referring to the co-occurrence information stored in the co-occurrence information storage means based on the words divided by the morpheme analysis means and extracting the topic of the sentence;
With
The topic extraction means includes
A word divided by the morphological analysis unit is input, and a word that is likely to co-occur with the word is referred to from the co-occurrence information stored in the co-occurrence information storage unit, and this reference is likely to occur with the co-occurrence. Co-occurrence information reference means for only words appearing in the input sentence among words;
Word cluster collecting means for collecting words that appear in the input sentence among words that are likely to co-occur with the divided words into a word cluster (group);
First appearance ratio calculating means for calculating the appearance ratio of individual words in the word cluster collected by the word cluster collecting means;
Second appearance ratio calculating means for calculating an appearance ratio of the topic composed of a plurality of word clusters collected by the word cluster collecting means in the entire inputted sentence;
Text processing apparatus according to claim <br/> further comprising a.

2. The sentence processing apparatus according to claim 1, further comprising sentence dividing means for dividing the sentence input from the sentence input means into paragraphs.

The sentence processing apparatus according to claim 1 or 2, further comprising display means for displaying a topic of the sentence extracted by the topic extracting means .

Display means for displaying the topic of the sentence extracted by the topic extraction means;
The display means displays in three dimensions the sentences divided by the sentence dividing means, the word clusters collected by the word cluster collecting means, and the topic appearance ratio calculated by the second appearance ratio calculating means < The sentence processing apparatus according to claim 2, wherein