JP3775239B2

JP3775239B2 - Text segmentation method and apparatus, text segmentation program, and storage medium storing text segmentation program

Info

Publication number: JP3775239B2
Application number: JP2001146872A
Authority: JP
Inventors: 克人別所
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2001-05-16
Filing date: 2001-05-16
Publication date: 2006-05-17
Anticipated expiration: 2021-05-16
Also published as: JP2002342324A

Description

【０００１】
【発明の属する技術分野】
本発明は、テキスト分割方法及び装置及びテキスト分割プログラム及びテキスト分割プログラムを格納した記憶媒体に係り、特に、テキストを入力とし、当該テキストを意味的なまとまりの単位である意味段落に自動分割するためのテキスト分割方法及び装置及びテキスト分割プログラム及びテキスト分割プログラムを格納した記憶媒体に関する。
【０００２】
【従来の技術】
従来のテキスト分割方法としては、M.A.Hearstによって考案された単位の頻度に基づく単語列の結束度による方法( 参考文献:Hearst, M.A.,: Multi-Paragraph Segmentation of Expository Text, 32nd Annual Meeting of the Association for Computational Linguistics, pp.9-16(1994)) がある。
【０００３】
この方法では、まず、テキストを形態素解析して単語に分割する。
【０００４】
次に、図６に示すように、任意の単語境界の前後に、ある個数の単語の集合である単語列（以下の説明では、「窓」と記す）をとり、各窓を構成する単語の頻度ベクトルをとり、前後の窓に対応する頻度ベクトル間の余弦測度を単語列結束度として計算する。各単語境界に対し、この計算を行うことにより、各単語境界に一つの単語列結束度が対応することになる。
【０００５】
単語境界が意味段落境界に近づくにつれ、前後の窓に共通して含まれる単語は一般に少なくなるため、単語列結束度は減少していく。そこで、単語列結束度が極小である単語境界を当該テキストの意味段落の境界と認定する。
【０００６】
ここで、ある単語境界位置をｉ、前の窓をｂl 、後ろの窓をｂr とし、単語ｔのｂl 、ｂr における出現頻度をそれぞれ
【０００７】
【数１】

としたとき、ｉにおける単語列結束度Ｃi は、
【０００８】
【数２】

と表される。
【０００９】
【発明が解決しようとする課題】
テキストの意味段落の中途の単語境界位置で、前後の窓に共通して含まれる単語が少ないことは多い。しかしながら、上記従来のHearstの方法では、単語の頻度ベクトル間の余弦測度を取っているため、そのような単語境界位置における結束度は小さくなり、意味段落の境界と認定されることが多い。このように、上記従来のHearstの方法では、認定した意味段落の境界にノイズとなるものが多く含まれるという問題がある。
本発明は、上記の点に鑑みなされたもので、テキストから正解である意味段落の境界（テキストの隣接単語間境界）のみを過不足なく認定できるようなテキスト分割方法及び装置及びテキスト分割プログラム及びテキスト分割プログラムを格納した記憶媒体を提供することを目的とする。
【００１０】
【課題を解決するための手段】
図１は、本発明の原理を説明するための図である。
【００１１】
本発明（請求項１）は、テキストを意味的なまとまりの単位である意味段落に分割するテキスト分割方法であって、
形態素解析手段が、テキストを形態素解析して、単語に分割する形態素解析過程（ステップ１）と、
単語ベクトル取得手段が、単語の意味を表現するベクトルが格納されている概念ベースを検索することによって形態素解析過程で得られた各単語に対応する単語ベクトルを取得する単語ベクトル取得過程（ステップ２）と、
単語列結束度算出手段が、
単語の境界の前後に、ある個数の単語の集合である単語列をとり、
各単語列に対し、該単語列を構成する単語の単語ベクトルの和ベクトルまたは重心ベクトルを算出し、
前後の単語列に対応する和ベクトルまたは重心ベクトル間の余弦測度を始めとする類似尺度または距離尺度を単語列結束度として算出する単語列結束度算出過程（ステップ３）と、
意味段落境界認定手段が、単語列結束度が類似尺度である場合は極小である単語境界を、単語列結束度が距離尺度である場合は極大である単語境界を、テキストの意味段落の境界と認定する意味段落境界認定過程（ステップ４）と、からなる。
【００１２】
本発明（請求項２）は、テキストを意味的なまとまりの単位である意味段落に分割するテキスト分割方法であって、
形態素解析手段が、テキストを形態素解析して、単語に分割する形態素解析過程（ステップ１）と、
単語ベクトル取得手段が、単語の意味を表現するベクトルが格納されている概念ベースを検索することによって形態素解析過程で得られた各単語に対応する単語ベクトルを取得する単語ベクトル取得過程（ステップ２）と、
単語列結束度算出手段が、
単語の境界の前後に、ある個数の単語の集合である単語列をとり、
各単語列に対し、該単語列を構成する単語のベクトルの分布から母集団ベクトル分布を推定し、
前後の単語列に対応する母集団ベクトル分布間のカルパック・リーブラー距離を始めとする類似尺度または距離尺度を単語列結束度として算出する単語列結束度算出過程（ステップ３）と、
意味段落境界認定手段が、
単語列結束度が類似尺度である場合は極小である単語境界を、単語列結束度が距離尺度である場合は極大である単語境界を、テキストの意味段落の境界と認定する意味段落境界認定過程（ステップ４）と、からなる。
【００１４】
図２は、本発明の原理構成図である。
【００１５】
本発明（請求項３）は、テキストを意味的なまとまりの単位である意味段落に分割するテキスト分割装置であって、
テキストを形態素解析して、単語に分割する形態素解析手段２０と、
単語の意味を表現するベクトルが格納されている概念ベース６０を検索することによって形態素解析手段２０で得られた各単語に対応する単語ベクトルを取得する単語ベクトル取得手段３０と、
単語の境界の前後に、ある個数の単語の集合である単語列をとり、該各単語列に対し、該単語列を構成する単語の単語ベクトルの和ベクトルまたは重心ベクトルを算出し、前後の単語列に対応する該和ベクトルまたは該重心ベクトル間の余弦測度を始めとする類似尺度または距離尺度を単語列結束度として算出する単語列結束度算出手段４０と、
単語列結束度が類似尺度である場合は極小である単語境界を、単語列結束度が距離尺度である場合は極大である単語境界を、テキストの意味段落の境界と認定する意味段落境界認定手段５０と、を有する。
【００１６】
本発明（請求項４）は、テキストを意味的なまとまりの単位である意味段落に分割するテキスト分割装置であって、
テキストを形態素解析して、単語に分割する形態素解析手段２０と、
単語の意味を表現するベクトルが格納されている概念ベース６０を検索することによって形態素解析手段２０で得られた各単語に対応する単語ベクトルを取得する単語ベクトル取得手段３０と、
単語の境界の前後に、ある個数の単語の集合である単語列をとり、該各単語列に対し、該単語列を構成する単語のベクトルの分布から母集団ベクトル分布を推定し、前後の単語列に対応する該母集団ベクトル分布間のカルパック・リーブラー距離を始めとする類似尺度または距離尺度を単語列結束度として算出する単語列結束度算出手段４０と、
単語列結束度が類似尺度である場合は極小である単語境界を、単語列結束度が距離尺度である場合は極大である単語境界を、テキストの意味段落の境界と認定する意味段落境界認定手段５０と、を有する。
【００１８】
本発明（請求項５）は、コンピュータを、請求項３または４記載のテキスト分割装置として機能させるプログラムである。
【００２１】
本発明（請求項６）は、コンピュータを、請求項３または４記載のテキスト分割装置として機能させるプログラムを格納した記憶媒体である。
【００２３】
上記のように、本発明では、単語の意味を表現するベクトルが格納されている概念ベースを用いる。この概念ベースにおける単語ベクトルは、意味的に類似している単語間ほど距離が近く、意味的に類似していない単語間ほど距離が遠くなるように値が設定されている。正解の意味段落境界の前の窓（直前の単語列）に含まれる単語と後ろの窓（直後の単語列）に含まれる単語とは意味的類似性が低いことにより、そのベクトル間の距離も遠くなるため、単語列の結束度は、類似尺度のとき低くなり、距離尺度のとき高くなる。
【００２４】
また、意味段落の中途の単語境界位置においては、前の窓（直前の単語列）に含まれる単語と後ろの窓（直後の単語列）に含まれる単語とは意味的類似性が高い。前後の窓（直前・直後の単語列）に共通して含まれる単語がない場合でも、同様のことが言える。従って、そのベクトル間の距離も近くなるため、単語列の結束度は、類似尺度のとき高くなり、距離尺度のとき低くなる。
【００２５】
そこで、単語列結束度が類似尺度である場合は極小である単語境界を、距離尺度である場合は極大である単語境界を当該テキストの意味段落の境界と認定することにより、正解である意味段落の境界のみを過不足なく認定できるようになる。
【００２６】
【発明の実施の形態】
図３は、本発明の一実施の形態におけるテキスト分割装置の構成を示す。同図に示すテキスト分割装置は、テキスト入力部１０、形態素解析部２０、単語ベクトル取得部３０、単語列結束度算出部４０、意味段落境界認定部５０、概念ベース６０から構成される。
【００２７】
概念ベース６０は、単語の意味を表現する単語ベクトルが格納されており、当該単語ベクトルは、意味的に類似している単語間程距離が近く、意味的に類似していない単語間ほど距離が遠くなるように値が設定されており、データベースに格納される。
【００２８】
テキスト入力部１０は、処理対象となるテキストを入力する。
【００２９】
形態素解析部２０は、入力されたテキストを形態素解析して単語に分割し、その形態素解析結果を単語ベクトル取得部３０に転送する。
【００３０】
単語ベクトル取得部３０は、概念ベース６０を検索することにより、形態素解析の結果得られた各単語に対応するベクトルを取得する。
【００３１】
単語列結束度算出部４０は、図６に示すように、任意の単語境界の前後に、ある個数の単語の集合である窓（単語列）をとり、各窓を構成する単語のベクトルの情報から、前後の窓の類似尺度または距離尺度である単語列結束度を算出する。各単語境界に対し、この計算を行うことにより、各単語境界に一つの単語列結束度が対応することになる。また、単語列結束度を求める際に、単語列結束度を求める際に、単語列結束度算出部４０は、各窓に対し、当該窓を構成する単語のベクトルの和または重心をとり、単語列結束度として、前後の窓に対応する和または重心ベクトル間の余弦測度を始めとする類似尺度または距離尺度をとる。あるいは、各窓に対し、当該窓を構成する単語のベクトルの分布から母集団分布を推定し、単語列結束度として、前後の窓に対応する母集団分布間のカルバック・リーブラー距離を始めとする類似尺度または距離尺度をとる。
【００３２】
意味段落境界認定部５０は、単語列結束度が類似尺度である場合は極小である単語境界を、距離尺度である場合は極大である単語境界を、当該テキストの意味段落の境界と認定する。
【００３３】
【実施例】
以下、図面と共に本発明の実施例を説明する。
【００３４】
図４は、本発明の一実施例のテキスト分割装置の動作のフローチャートである。
【００３５】
ステップ１０１）形態素解析部２０は、入力テキストを形態素解析して単語に分割する。
【００３６】
ステップ１０２）単語ベクトル取得部３０は、単語の意味を表現するベクトルが格納されている概念ベース６０を検索することによって、ステップ１０１の形態素解析処理により得られた各単語に対応するベクトルを取得する。
【００３７】
ステップ１０３）単語列結束度算出部４０は、前述の図６に示すように、任意の単語境界の前後に、ある個数の単語集合である窓を取り、各窓を構成する単語のベクトルの情報から、前後の窓の類似尺度または距離尺度である単語列結束度を算出する。単語列結束度を算出する単語境界は、１単語の刻み幅でとっていく。各単語境界に対する窓の幅は単語の一定個数分とる。窓の幅をａ個としたとき、テキストの最小のａ単語以内の単語境界の前の窓の幅と最後のａ単語以内の単語境界の後ろの窓の幅はａ個足りないが、ａ個に足りない窓はとれる最大幅をとって単語列結束度を算出する。あるいは、前後の窓の幅が、ａ個とれる単語境界のみ単語列結束度を算出する。
【００３８】
ステップ１０４）意味段落境界設定部５０は、単語列結束度が類似尺度である場合は極小である単語境界を、距離尺度である場合は極大である単語境界を、当該テキストの意味段落の境界と認定する。ここでいう極値とは、テキスト全体における極値である。
【００３９】
次に、概念ベース６０について説明する。
【００４０】
図５は、本発明の一実施例の概念ベースのデータの例を示す。
【００４１】
概念ベース６０は、各単語毎に、ｐ次元のベクトル値が付与されている。概念ベース６０中の単語は、名詞や動詞、形容詞等の自立語である。概念ベース６０における単語ベクトルは、意味的に類似している単語間ほど距離が近く、意味的に類似していない単語間ほど距離が遠くなるように値が設定されている。
【００４２】
概念ベースの例としては、特願平４−２５１５１３の「類似性判別装置」や、特願平６−０９６０１１の「類似性判別利用データ精錬方法及びこの方法を実施する装置」で紹介されているデータベースがある。
【００４３】
また、Deerwesterの論文(Deerwester,S.,Dumais,S.T.,Furnas, G. W.,Landauer,T.K.,and Harshman, R.:Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science,pp.391-407(1990)) では、単語の文書における頻度を記録した単語・文書間の共起行列を特異値分解により次元数を縮退させた行列に変換しているが、この変換後の行列も概念ベースの一例である。Schutze の論文(Schutze,H.:Dimensions of Meaning, Proc. of Supercomputing '92,pp.787-796(1992))では、コーパス中の単語間の共起頻度を記録した単語・単語間の共起行列を特異値分解により次元数を縮退させた行列に変換しているが、この変換後の行列も概念ベースの一例である。
【００４４】
前述のステップ１０２における単語ベクトル取得部３０において、概念ベース６０を検索することによって、ステップ１０１の形態素解析処理で得られた各単語に対応するベクトルを取得する。
【００４５】
次に、上記のステップ１０３における単語列結束度算出部４０の処理について説明する。
【００４６】
単語列結束度算出部４０は、各窓に対し、当該窓を構成する単語のベクトルの和または重心をとり、単語列結束度として、前後の窓に対応する和または重心ベクトル間の余弦測度を始めとする類似尺度または距離尺度をとる。
【００４７】
余弦測度は、類似尺度である。ここで、ある単語境界位置をｉ、前の窓に含まれる単語集合をＬ、後ろの窓に含まれる単語の集合をＲとし、単語ｔに対応する概念ベース６０中のベクトルをν_t としたとき、前後の窓に対応する和ベクトル間の余弦測度Ｃ_i は、以下のように表される。なお、以下の式における“・”は、ベクトル間の内積である。
【００４８】
【数３】

余弦測度は、２つのベクトル間の角度で決まるので、前後の窓に対応する重心ベクトル間の余弦測度は、和ベクトル間の余弦測度と一致する。
【００４９】
また、前後の窓に対応するベクトル間の距離尺度として、ベクトルを分布と見做して、分布間の距離尺度であるカルバック・リーブラー距離をとる方法もある。カルバック・リーブラー距離は、以下のように表される。前の窓に対応する和ベクトルω_L を以下のように成分表示したとする。
【００５０】
【数４】

ここで、ａ _L1＞0(1≦ｉ≦ｐ）と仮定する。
【００５１】
また、あるベクトル値が表現する意味と、そのベクトル値のスカラ倍の値が表現する意味を同一視できるように概念ベースが構成されているとする。このとき、以下のようなω_Lの各成分の和が１となるように正規化したベクトルω_L ’とω_L を同一視できる。
【００５２】
【数５】

後ろの窓に対応する和ベクトルのω_R についても同様に正規化したベクトル
【００５３】
【数６】

を作る。
【００５４】
【数７】

ベクトルω_L’、ω_R’間のカルバック・リーブラ距離として、ＫＬ（ω_R’，ω_L ’）をとってもよい。
【００５５】
上記のカルバック・リーブラ距離は、２つの分布に対して対称ではないので、双方の分布からみたカルバック・リーブラ距離の和であるJeffery 距離を距離尺度としてとる方法もある。Jeffery 距離Ｊ（ω_L’，ω_R’）は、以下のように表される。
【００５６】
【数８】

次に、上記のステップ１０３における単語列結束度算出部４０の処理について説明する。
【００５７】
単語列結束度算出部４０では、各窓に対し、当該窓を構成する単語のベクトルの分布から母集団分布を推定し、単語列結束度として前後の窓に対応する母集団分布間のカルバック・リーブラー距離を始めとする類似尺度または距離尺度をとる。単語ベクトルの次元をｐ次元としたとき、単語ベクトルの集合を、ｐ次元空間上の連続的なある確率分布に従う標本の集合と見て、標本集合から元の確率分布を推定する訳である。
【００５８】
前述したように、カルバック・リーブラ距離は距離尺度である。カルバック・リーブラ距離の算出は、具体的には以下のようにする。
【００５９】
前の窓を構成する単語ベクトルの集合
【００６０】
【数９】

から母集団分布ｆ（ｘ）（ｘ∈Ｒ^p）を推定する。母集団分布の推定には、母集団分布としてパラメトリックな分布をとる方法と、ノンパラメトリックな分布をとる方法がある。パラメトリックな分布の一例としては、正規分布があり、これを決定付けるパラメータは、母平均と母分散共分散行列である。Ｖ _Lから最尤推定等の手法により、これらのパラメータを推定することにより、母集団分布ｆ（ｘ）（ｘ∈Ｒ^p）を推定する。ここで、母平均μは、次のように推定される。
【００６１】
なお、｜Ｌ｜は、Ｌの要素数である。
【００６２】
【数１０】

また、母分散共分散行列Ωは、次のように推定される。（ν _t −μ）は縦ベクトルであり、（ν _t −μ）’は、それを転置した横ベクトルである。
【００６３】
【数１１】

推定したμ、Ωにより、正規分布である母集団分布ｆ（ｘ）（ｘ∈Ｒ^p）は次のように表される。
【００６４】
【数１２】

後ろの窓を構成する単語ベクトルの集合からも同様に母集団分布ｇ（ｘ）（ｘ∈Ｒ^p）を推定する。
【００６５】
確率分布ｆ（ｘ），ｇ（ｘ）間のカルバック・リーブラ距離ＫＬ（ｆ（ｘ），ｇ（ｘ））は、
【００６６】
【数１３】

となる。
【００６７】
確率分布ｆ（ｘ），ｇ（ｘ）間のカルバック・リーブラ距離として、ＫＬ（ｆ（ｘ），ｇ（ｘ））をとってもよい。
【００６８】
上記のカルバック・リーブラ距離は、２つの確率分布に対して対称ではないので、双方の確率分布からみたカルバック・リーブラ距離の和であるJeffery 距離を距離尺度としてとる方法もある。Jeffery 距離Ｊ（ｆ（ｘ），ｇ（ｘ））は、以下のように表される。
【００６９】
【数１４】

実際のカルバック・リーブラ距離や、Jeffery 距離の算出では、積分領域を分割し、各分割領域のある一点に対応する積分関数の数値に基づいて積分値の近似値を求めるといった離散的な数値計算手法をとることができる。
【００７０】
次に、ステップ１０４における意味段落境界認定部５０の処理について説明する。
【００７１】
ステップ１０３において、上記の方法により各単語境界に対応する単語列結束度を計算した後、意味段落境界認定部５０において、単語列結束度が類似尺度である場合は極小である単語境界を、距離尺度である場合は極大である単語境界を、当該テキストの意味段落の境界と認定する。ここでいう極値とは、テキスト全体における極値である。
【００７２】
また、上記の実施例では、図４のフローチャートに基づいて説明したが、図４に示す一連の動作をプログラムとして構築し、概念ベースをテキスト分割装置として利用されるコンピュータのバッファ等に格納し、構築されたプログラムをＣＰＵにインストールして実行したり、ネットワークを介して流通させることも可能である。
【００７３】
また、構築されたプログラムをテキスト分割装置として利用されるコンピュータに接続されるハードディスク装置や、フロッピーディスク、ＣＤ−ＲＯＭ等の可搬記憶媒体に格納しておき、本発明を実施する際にインストールすることにより、容易に本発明を実現できる。
【００７４】
なお、本発明は、上記の実施例に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。
【００７５】
【発明の効果】
上述のように、本発明によれば、単語の意味を表現するベクトルの情報から単語列結束度を算出することにより、正解である意味段落の境界のみを過不足なく認定できるようになる。
【図面の簡単な説明】
【図１】本発明の原理を説明するための図である。
【図２】本発明の原理構成図である。
【図３】本発明の一実施の形態におけるテキスト分割装置の構成図である。
【図４】本発明の一実施例のテキスト分割装置の動作のフローチャートである。
【図５】本発明の一実施例の概念ベースのデータの例である。
【図６】単語列結束度算出を説明するための図である。
【符号の説明】
１０テキスト入力部
２０形態素解析手段、形態素解析部
３０単語ベクトル取得手段、単語ベクトル取得部
４０単語列結束度算出手段、単語列結束度算出部
５０隣接単語列認定手段、意味段落境界認定部
６０概念ベース[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a text segmentation method and apparatus, a text segmentation program, and a storage medium storing the text segmentation program. In particular, the present invention takes text as input and automatically divides the text into semantic paragraphs that are units of semantic units. The present invention relates to a text division method and apparatus, a text division program, and a storage medium storing the text division program.
[0002]
[Prior art]
As a conventional text segmentation method, a method based on the unity of word strings based on the unit frequency devised by MAHearst (Reference: Hearst, MA ,: Multi-Paragraph Segmentation of Expository Text, 32nd Annual Meeting of the Association for Computational Linguistics, pp.9-16 (1994)).
[0003]
In this method, the text is first divided into words by morphological analysis.
[0004]
Next, as shown in FIG. 6, a word string (in the following description, referred to as “window”) that is a set of a certain number of words is taken before and after an arbitrary word boundary, and the words constituting each window Taking the frequency vector, the cosine measure between the frequency vectors corresponding to the front and back windows is calculated as the word string cohesion degree. By performing this calculation for each word boundary, one word string cohesion degree corresponds to each word boundary.
[0005]
As the word boundary approaches the semantic paragraph boundary, the number of words commonly included in the front and back windows generally decreases, and the word string cohesion decreases. Therefore, a word boundary having a minimum word string cohesion is recognized as a boundary of a semantic paragraph of the text.
[0006]
Here, a certain word boundary position is i, the front window is bl, and the rear window is br, and the appearance frequency of the word t in bl and br is expressed as follows.
[Expression 1]

, The word string cohesion degree Ci in i is
[0008]
[Expression 2]

It is expressed.
[0009]
[Problems to be solved by the invention]
There are often few words that are included in the front and back windows at the word boundary position in the middle of the meaning paragraph of the text. However, since the conventional Hearst method takes a cosine measure between word frequency vectors, the degree of cohesion at such a word boundary position becomes small and is often recognized as a boundary of a semantic paragraph. As described above, the conventional Hearst method has a problem in that many recognized noise paragraphs are included in the boundaries of the recognized semantic paragraphs.
The present invention has been made in view of the above points, and is a text division method and apparatus, a text division program, and a text division program that can recognize only the boundary of a semantic paragraph that is correct from the text (the boundary between adjacent words of the text) without excess or deficiency. An object is to provide a storage medium storing a text division program.
[0010]
[Means for Solving the Problems]
FIG. 1 is a diagram for explaining the principle of the present invention.
[0011]
The present invention (Claim 1) is a text dividing method for dividing text into semantic paragraphs that are units of semantic units ,
A morpheme analyzing unit (step 1) for performing morpheme analysis on the text and dividing the text into words;
Word vector acquisition means, the word vector obtaining step of obtaining a word vectors corresponding to each word obtained by the morphological analysis process by searching the concept base vector representing a meaning of a word is stored (Step 2) When,
The word string cohesion calculating means
Take a word string that is a set of a certain number of words before and after the word boundary,
For each word string, calculate the sum vector or centroid vector of the word vectors of the words constituting the word string,
A word string cohesion degree calculating process (step 3) for calculating a similarity measure or a distance measure including a cosine measure between sum vectors or centroid vectors corresponding to preceding and following word strings as a word string cohesion degree;
The semantic paragraph boundary recognition means uses the word boundary that is minimal when the word string cohesion is a similarity measure, and the word boundary that is maximal when the word string cohesion is a distance measure as the boundary of the semantic paragraph of the text. certification means paragraph boundaries certification process (step 4), made of.
[0012]
The present invention (Claim 2) is a text dividing method for dividing text into semantic paragraphs which are units of semantic units,
A morpheme analyzing unit (step 1) for performing morpheme analysis on the text and dividing the text into words;
A word vector acquisition process in which the word vector acquisition means acquires a word vector corresponding to each word obtained in the morpheme analysis process by searching a concept base in which a vector representing the meaning of the word is stored (step 2) When,
The word string cohesion calculating means
Take a word string that is a set of a certain number of words before and after the word boundary,
For each word string, estimate the population vector distribution from the distribution of the vectors of the words constituting the word string,
A word string cohesion degree calculation process (step 3) for calculating a similarity measure or a distance measure including a Calpac-Liber distance between population vector distributions corresponding to the preceding and following word strings as a word string cohesion degree;
Meaning paragraph boundary recognition means,
Semantic paragraph boundary recognition process that recognizes a word boundary that is minimal when the word string cohesion is a similarity measure and a word boundary that is maximal when the word string cohesion is a distance measure as the boundary of the semantic paragraph of the text (Step 4) .
[0014]
FIG. 2 is a principle configuration diagram of the present invention.
[0015]
The present invention (Claim 3) is a text dividing device for dividing a text into semantic paragraphs which are units of semantic units,
Morphological analysis means 20 for analyzing the text and dividing it into words;
A word vector obtaining means 30 for obtaining the word vectors corresponding to each word obtained by the morphological analysis unit 20 by searching the concept base 60 vector representing the meaning of a word is stored,
A word string that is a set of a certain number of words is taken before and after a word boundary, and for each word string, a sum vector or a centroid vector of words constituting the word string is calculated, and the preceding and following words are calculated. A word string cohesion degree calculating means 40 for calculating a similarity measure or a distance measure including a cosine measure between the sum vector or the centroid vector corresponding to a column as a word string cohesion degree;
Meaning paragraph boundary recognition means that recognizes a word boundary that is a minimum when the word string cohesion is a similarity measure and a word boundary that is a maximum when the word sequence cohesion is a distance measure as a boundary of a semantic paragraph of the text It has a 50, a.
[0016]
The present invention (Claim 4 ) is a text dividing device for dividing a text into semantic paragraphs which are units of semantic units,
Morphological analysis means 20 for analyzing the text and dividing it into words;
A word vector acquisition means 30 for acquiring a word vector corresponding to each word obtained by the morpheme analysis means 20 by searching a concept base 60 in which a vector expressing the meaning of the word is stored;
A word string that is a set of a certain number of words is taken before and after a word boundary, and for each word string, a population vector distribution is estimated from the distribution of the word vectors constituting the word string, and the preceding and following words A word string cohesion degree calculating means 40 for calculating a similarity measure or a distance measure including a Calpac-Liber distance between the population vector distributions corresponding to the columns as a word string cohesion degree;
Meaning paragraph boundary recognition means that recognizes a word boundary that is a minimum when the word string cohesion is a similarity measure and a word boundary that is a maximum when the word sequence cohesion is a distance measure as a boundary of a semantic paragraph of the text 50 .
[0018]
The present invention (Claim 5 ) is a program that causes a computer to function as the text dividing device according to Claim 3 or 4 .
[0021]
The present invention (Claim 6 ) is a storage medium storing a program for causing a computer to function as the text dividing apparatus according to Claim 3 or 4 .
[0023]
As above SL, the present invention uses the concept base vectors representing the meaning of a word is stored. The word vectors in this concept base are set such that the distance between words that are semantically similar is closer, and the distance between words that are not semantically similar is longer. The meaning of the correct answer The words in the window before the paragraph boundary (the previous word string) and the words in the back window (the word string immediately after) have low semantic similarity, so the distance between the vectors Since it is far away, the cohesion degree of the word string is low when the similarity scale is used and is high when the distance scale is used.
[0024]
In addition, at the word boundary position in the middle of the semantic paragraph, the word included in the previous window (immediate word string) and the word included in the rear window (immediate word string) have high semantic similarity. The same can be said even when there are no words commonly included in the preceding and following windows (word strings immediately before and after). Therefore, since the distance between the vectors is also close, the cohesion degree of the word string is high when the similarity measure is used and is low when the distance measure is used.
[0025]
Therefore, by identifying a word boundary that is a minimum when the word string cohesion is a similarity measure and a word boundary that is a maximum when it is a distance measure as a boundary of a semantic paragraph of the text, a semantic paragraph that is a correct answer is recognized. It becomes possible to certify only the boundaries of these.
[0026]
DETAILED DESCRIPTION OF THE INVENTION
FIG. 3 shows the configuration of the text segmentation apparatus according to the embodiment of the present invention. The text segmentation apparatus shown in FIG. 1 includes a text input unit 10, a morpheme analysis unit 20, a word vector acquisition unit 30, a word string cohesion degree calculation unit 40, a semantic paragraph boundary recognition unit 50, and a concept base 60.
[0027]
The concept base 60 stores a word vector expressing the meaning of a word, and the word vector is closer in distance between words that are semantically similar, and is closer in distance between words that are not semantically similar. Values are set to be far away and stored in the database.
[0028]
The text input unit 10 inputs text to be processed.
[0029]
The morpheme analysis unit 20 morphologically analyzes the input text and divides it into words, and transfers the morpheme analysis result to the word vector acquisition unit 30.
[0030]
The word vector acquisition unit 30 acquires a vector corresponding to each word obtained as a result of morphological analysis by searching the concept base 60.
[0031]
As shown in FIG. 6, the word string cohesion calculating unit 40 takes a window (word string) that is a set of a certain number of words before and after an arbitrary word boundary, and information on the vectors of words constituting each window. from, or similarity measure before and after the window for calculating the word string cohesion is distance measure. By performing this calculation for each word boundary, one word string cohesion degree corresponds to each word boundary. Further, when obtaining the word string cohesion, in determining a word string cohesion, word string cohesion calculator 40, for each window, sums or centroid vectors of words constituting the window, as word string cohesion, or the sum corresponding to the front and rear windows or similarity measure including cosine measure between centroid vector taking distance measure. Alternatively, for each window, and estimate the population distribution from the distribution of the vector of words constituting the window, as a word string cohesion degree, began Kullback Lee Bed Ra distance between the population distribution corresponding to the front and rear of the window similarity measure and also take the distance scale.
[0032]
Meaning paragraph boundary discriminating section 50, the word boundary for word string cohesion degree is similarity measure is minimum, if the distance measure word boundary is a maximum, is recognized as the boundary of the meanings paragraph of the text.
[0033]
【Example】
Embodiments of the present invention will be described below with reference to the drawings.
[0034]
FIG. 4 is a flowchart of the operation of the text segmentation apparatus according to the embodiment of the present invention.
[0035]
Step 101) The morpheme analysis unit 20 performs morphological analysis on the input text and divides it into words.
[0036]
Step 102) The word vector acquisition unit 30 acquires a vector corresponding to each word obtained by the morphological analysis processing in Step 101 by searching the concept base 60 in which a vector representing the meaning of the word is stored. .
[0037]
Step 103) As shown in FIG. 6, the word string cohesion degree calculation unit 40 takes a window, which is a certain number of word sets, before and after an arbitrary word boundary, and information on the vectors of words constituting each window. from, or similarity measure before and after the window for calculating the word string cohesion is distance measure. The word boundaries for calculating the word string cohesion are taken in increments of one word. The window width for each word boundary is a certain number of words. When the width of the window was a number, but the width of the window of the back of the minimum width and a word boundary within the words of the end of the front of the window of the word boundary within a word of the text is missing a number, a The word string cohesion degree is calculated by taking the maximum width of windows that are not enough. Alternatively, the word string cohesion degree is calculated only for word boundaries where the widths of the front and rear windows are a.
[0038]
Step 104) The semantic paragraph boundary setting unit 50 sets a word boundary that is a minimum when the word string cohesion is a similarity measure, and a word boundary that is a maximum when it is a distance measure as a boundary of a semantic paragraph of the text. Authorize. The extreme value here is an extreme value in the entire text.
[0039]
Next, the concept base 60 will be described.
[0040]
FIG. 5 shows an example of concept-based data according to one embodiment of the present invention.
[0041]
In the concept base 60, a p-dimensional vector value is assigned to each word. The words in the concept base 60 are independent words such as nouns, verbs, and adjectives. The word vectors in the concept base 60 are set such that the distance between words that are semantically similar is closer, and the distance between words that are not semantically similar is longer.
[0042]
Examples of concept bases are introduced in “Similarity Discriminating Device” in Japanese Patent Application No. 4-251513 and “Similarity Discriminating Utilization Data Refinement Method and Device for Implementing this Method” in Japanese Patent Application No. 6-096011. There is a database.
[0043]
Also, Deerwester's paper (Deerwester, S., Dumais, ST, Furnas, GW, Landauer, TK, and Harshman, R .: Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, pp. 391-407 ( 1990)) converts the co-occurrence matrix between words and documents that records the frequency of word documents into a matrix whose dimensionality is reduced by singular value decomposition. This converted matrix is also an example of a concept base. It is. Schutze's paper (Schutze, H .: Dimensions of Meaning, Proc. Of Supercomputing '92, pp. 787-796 (1992)) records the frequency of co-occurrence between words in the corpus. The matrix is converted into a matrix whose dimensionality is reduced by singular value decomposition, and this converted matrix is also an example of a concept base.
[0044]
The word vector acquisition unit 30 in step 102 described above searches the concept base 60 to acquire a vector corresponding to each word obtained in the morphological analysis process in step 101.
[0045]
Next, the processing of the word string cohesion degree calculation unit 40 in step 103 will be described.
[0046]
Word string cohesion degree calculation unit 40, for each window, also the sum of the vectors of the words constituting the window take a heavy heart, as word string cohesion, or the sum corresponding to the front and rear windows between centroid vector also similar measure, including the cosine measure taking the distance scale.
[0047]
The cosine measure is a similar measure. Here, a word boundary position is i, a word set included in the previous window is L, a set of words included in the back window is R, and a vector in the concept base 60 corresponding to the word t is ν _t . Then, the cosine measure C _i between the sum vectors corresponding to the front and back windows is expressed as follows. In the following expression, “·” is an inner product between vectors.
[0048]
[Equation 3]

Since the cosine measure is determined by the angle between the two vectors, the cosine measure between the centroid vectors corresponding to the front and rear windows coincides with the cosine measure between the sum vectors.
[0049]
Further, as a measure of distance between vectors corresponding to the front and rear windows, and regarded as distribution vector, there is a method of taking a Kullback-Ribura over distance is a distance measure between the distributions. The Cullback Libler distance is expressed as follows: Sum vector ω _L corresponding to the previous window Are expressed as components as follows.
[0050]
[Expression 4]

Here, it is assumed that a _L1 > 0 (1 ≦ i ≦ p).
[0051]
Further, it is assumed that the concept base is configured so that the meaning expressed by a certain vector value can be equated with the meaning expressed by a scalar multiple of the vector value. At this time, the vector ω _L normalized so that the sum of the components of ω _L as follows becomes 1 'And ω _L can be identified.
[0052]
[Equation 5]

A normalized vector is similarly applied to the sum vector ω _R corresponding to the back window.
[Formula 6]

make.
[0054]
[Expression 7]

KL (ω _R ′, ω _L ′) may be taken as the Cullback Libra distance between the vectors ω _L ′ and ω _R ′.
[0055]
Since the above-mentioned Kalbach-Liber distance is not symmetric with respect to two distributions, there is a method of taking Jeffery distance, which is the sum of the Kalbach-Liber distances from both distributions, as a distance scale. Jeffery distance J (ω _L ', ω _R ') is expressed as follows.
[0056]
[Equation 8]

Next, the processing of the word string cohesion degree calculation unit 40 in step 103 will be described.
[0057]
For each window, the word string cohesion calculating unit 40 estimates the population distribution from the distribution of the vectors of the words constituting the window, and as the word string cohesion, the kullback- similarity measure, including Ribura over distance or take a distance scale. When the dimension of the word vector is p-dimensional, the set of word vectors is regarded as a set of samples that follow a certain probability distribution in the p-dimensional space, and the original probability distribution is estimated from the sample set.
[0058]
As described above, the Cullback Libra distance is a distance measure. The calculation of the Cullback Libra distance is specifically performed as follows.
[0059]
The set of word vectors that make up the previous window
[Equation 9]

The population distribution f (x) (xεR ^p ) is estimated from the above. There are two methods for estimating the population distribution: a method of taking a parametric distribution as a population distribution and a method of taking a nonparametric distribution. An example of a parametric distribution is a normal distribution, and parameters determining this are a population mean and a population variance covariance matrix. The population distribution f (x) (xεR ^p ) is estimated by estimating these parameters from _VL by a method such as maximum likelihood estimation. Here, the population mean μ is estimated as follows.
[0061]
Note that | L | is the number of elements of L.
[0062]
[Expression 10]

The population variance covariance matrix Ω is estimated as follows. ( Ν _t −μ) is the vertical vector, ( ν _t −μ) ′ is a horizontal vector obtained by transposing it.
[0063]
[Expression 11]

Based on the estimated μ and Ω, the population distribution f (x) (x∈R ^p ), which is a normal distribution, is expressed as follows.
[0064]
[Expression 12]

Similarly, a population distribution g (x) (xεR ^p ) is estimated from a set of word vectors constituting the back window.
[0065]
The Cullback Libra distance KL (f (x), g (x)) between the probability distributions f (x), g (x) is
[0066]
[Formula 13]

It becomes.
[0067]
KL (f (x), g (x)) may be taken as the Cullback Libra distance between the probability distributions f (x) and g (x).
[0068]
Since the above-mentioned Kalbach-Liber distance is not symmetric with respect to two probability distributions, there is a method of taking Jeffery distance as a distance scale, which is the sum of the Kalbach-Liber distances from both probability distributions. Jeffery distance J (f (x), g (x)) is expressed as follows.
[0069]
[Expression 14]

In calculating the actual Calbach-Liber distance and Jeffery distance, a discrete numerical calculation method that divides the integration region and obtains an approximate value of the integration value based on the numerical value of the integration function corresponding to a certain point in each division region. Can be taken.
[0070]
Next, the process of the semantic paragraph boundary recognition unit 50 in step 104 will be described.
[0071]
In step 103, after calculating the word string cohesion degree corresponding to each word boundary by the above method, in the semantic paragraph boundary recognition unit 50, if the word string cohesion degree is a similarity measure, the word boundary that is the minimum is determined as the distance. If it is a scale, the word boundary that is the maximum is recognized as the boundary of the semantic paragraph of the text. The extreme value here is an extreme value in the entire text.
[0072]
Further, in the above embodiment, the description has been made based on the flowchart of FIG. 4, but the series of operations shown in FIG. 4 is constructed as a program, and the concept base is stored in a buffer of a computer used as a text dividing device, etc. It is also possible to install the built program on the CPU and execute it, or to distribute it via a network.
[0073]
Further, the constructed program is stored in a hard disk device connected to a computer used as a text dividing device, a portable storage medium such as a floppy disk, CD-ROM, etc., and installed when the present invention is carried out. Thus, the present invention can be easily realized.
[0074]
The present invention is not limited to the above-described embodiments, and various modifications and applications are possible within the scope of the claims.
[0075]
【The invention's effect】
As described above, according to the present invention, by calculating the word string cohesion degree from the vector information expressing the meaning of the word, it is possible to recognize only the boundary of the correct semantic paragraph without excess or deficiency.
[Brief description of the drawings]
FIG. 1 is a diagram for explaining the principle of the present invention.
FIG. 2 is a principle configuration diagram of the present invention.
FIG. 3 is a configuration diagram of a text segmentation apparatus according to an embodiment of the present invention.
FIG. 4 is a flowchart of the operation of the text segmentation apparatus according to the embodiment of the present invention.
FIG. 5 is an example of concept-based data according to an embodiment of the present invention.
FIG. 6 is a diagram for explaining calculation of word string cohesion.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 Text input part 20 Morphological analysis means, Morphological analysis part 30 Word vector acquisition means, Word vector acquisition part 40 Word string cohesion degree calculation means, Word string cohesion degree calculation part 50 Adjacent word string recognition means, semantic paragraph boundary recognition part 60 Concept base

Claims

A text splitting method that splits text into semantic paragraphs, which are units of semantic unity ,
A morpheme analyzing unit morphologically analyzes the text and divides the text into words;
Word vector acquisition means includes word vector obtaining step of obtaining a word vectors corresponding to each word obtained by the morphological analysis process by searching the concept base vectors representing the meaning of a word is stored,
The word string cohesion calculating means
Take a word string that is a set of a certain number of words before and after the word boundary,
For each word string, calculate the sum vector or centroid vector of the word vectors of the words constituting the word string;
A word string cohesion degree calculating process for calculating a similarity measure or a distance measure including a cosine measure between the sum vector or the centroid vector corresponding to preceding and following word strings as a word string cohesion degree;
The semantic paragraph boundary recognition means determines a word boundary that is a minimum when the word string cohesion is a similarity measure, and a word boundary that is a maximum when the word string cohesion is a distance measure. and the meaning of paragraph boundary certification process to be recognized as the boundary,
A text segmentation method characterized by comprising:

A text splitting method that splits text into semantic paragraphs, which are units of semantic unity,
A morpheme analyzing unit morphologically analyzes the text and divides the text into words;
A word vector acquisition means for acquiring a word vector corresponding to each word obtained in the morpheme analysis process by searching a concept base in which a vector expressing the meaning of the word is stored;
The word string cohesion calculating means
Take a word string that is a set of a certain number of words before and after the word boundary,
For each of the word strings, a population vector distribution is estimated from the distribution of the word vectors constituting the word string,
A word string cohesion degree calculating process for calculating a similarity measure or a distance scale as a word string cohesion degree, such as a Calpac-Liber distance between the population vector distributions corresponding to the preceding and following word strings;
Meaning paragraph boundary recognition means,
When the word string cohesion is a similarity measure, a word boundary that is a minimum is recognized, and when the word string cohesion is a distance measure, a word boundary that is a maximum is recognized as a semantic paragraph boundary. Boundary recognition process,
A text segmentation method characterized by comprising :

A text splitting device that splits text into semantic paragraphs, which are units of semantic units,
Morphological analysis means for analyzing the text and dividing it into words,
A concept base that stores vectors representing the meaning of words;
A word vector obtaining means for obtaining a word vectors corresponding to each word obtained by the morphological analysis unit by searching the concept base,
Take a word string that is a set of a certain number of words before and after a word boundary, and for each word string, calculate the sum vector or centroid vector of the word vectors of the words that make up the word string, A word string cohesion degree calculating means for calculating a similarity measure or a distance measure including a cosine measure between the sum vector or the centroid vector corresponding to a word string as a word string cohesion degree;
When the word string cohesion is a similarity measure, a word boundary that is a minimum is recognized, and when the word string cohesion is a distance measure, a word boundary that is a maximum is recognized as a semantic paragraph boundary. Boundary recognition means ,
A text segmentation device characterized by comprising:

A text splitting device that splits text into semantic paragraphs, which are units of semantic units,
Morphological analysis means for analyzing the text and dividing it into words,
A concept base that stores vectors representing the meaning of words;
Word vector acquisition means for acquiring a word vector corresponding to each word obtained by the morpheme analysis means by searching the concept base;
A word string that is a set of a certain number of words is taken before and after a word boundary, and for each word string, a population vector distribution is estimated from the distribution of the word vectors constituting the word string. A word string cohesion degree calculating means for calculating a similarity measure or a distance scale as a word string cohesion degree, including a Calpac-Liber distance between the population vector distributions corresponding to columns,
When the word string cohesion is a similarity measure, a word boundary that is a minimum is recognized, and when the word string cohesion is a distance measure, a word boundary that is a maximum is recognized as a semantic paragraph boundary. Boundary recognition means,
A text segmentation device characterized by comprising:

Computer
5. A text division program that functions as the text division device according to claim 3 .

Computer
5. A storage medium storing a text division program, wherein the program for functioning as the text division device according to claim 3 is stored.