JP3577972B2

JP3577972B2 - Similarity determination method, document search device, document classification device, storage medium storing document search program, and storage medium storing document classification program

Info

Publication number: JP3577972B2
Application number: JP29732198A
Authority: JP
Inventors: 準二富田; 浩竹野
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 1998-10-19
Filing date: 1998-10-19
Publication date: 2004-10-20
Anticipated expiration: 2018-10-19
Also published as: JP2000123041A

Description

【０００１】
【発明の属する技術分野】
本発明は、類似度判定方法及び文書検索装置及び文書分類装置及び文書検索プログラムを格納した記憶媒体及び文書分類プログラムを格納した記憶媒体に係り、特に、単語列、または、単語列のブール演算子結合、または、文書、または、文書集合からなる文書要素間の類似度を適切に判定するための類似度判定方法及び文書検索装置及び文書分類装置及び文書検索プログラムを格納した記憶媒体及び文書分類プログラムを格納した記憶媒体に関する。
【０００２】
【従来の技術】
従来の技術による文書検索装置は、以下の方法で検索キーと検索対象文書との類似度を判定することによって、検索結果を決定するものである。ここで、検索キーは、単語列、または、単語列のブール演算子結合だけでなく、文書、または、文書集合などの場合もある。
【０００３】
まず、形態素解析と呼ばれる技術を用いて検索対象のそれぞれの文書に使用されている単語を抽出し、出現頻度情報などに基づき、それぞれの単語にその単語の主題（内容）との関連の強さを表す重要度を付与する。同様に、ユーザの入力した検索キー内の単語にも重要度を付与し、それぞれの単語がどの程度の重要度で、それぞれの文書に含まれているのか調べ、検索キー内のブール演算子を適切に処理して類似度を計算する。最後に検索対象の文書を、このようにして求められた類似度の降順にソートして、上位ｎ件を検索結果とするものである。なお、ここで、ｎは、正の定数である。
【０００４】
例えば、「情報または、文書（特に文書）の検索」に関する文書を検索したいとする。ユーザは、検索キーとして、
（（情報⁻０．５ｏｒ文書⁻０．８）ａｎｄ検索⁻１．０）
と指定する。文書検索装置は、まず、検索キー内での単語の重要度（それぞれ（情報，０．５）（文書，０．８）（検索，１．０）を求める（この例の場合、検索キー内に重要度が明示的に記述されているが、単語の出現頻度情報などからこれらの重要度を自動的に決定する場合もある）。
【０００５】
次に、検索キーに使用されている単語「情報」「文書」「検索」の検索対象のそれぞれの文書内での重要度を出現頻度などを基にして求める。これらが次の値であったとする。

検索キー内での単語の重要度と、検索対象文書内でのその単語の重要度の積を計算し、検索キー内で、ｏｒが使われた場合は、その両側の値を足すこととし、ａｎｄが使われた場合は、その両側の値の小さい方を取ることとする。
【０００６】
この方法で、検索キーとそれぞれの文書の類似度は以下のように求めることができる。
文書ａの類似度＝ｍｉｎ（（０．５＊０．４＋０．８＊０．６），１．０＊０．９）＝ｍｉｎ（０．６８，０．９）＝０．６８
文書ｂの類似度＝ｍｉｎ（（０．５＊０．４＋０．８＊０．１），１．０＊０．０）＝ｍｉｎ（０．２６，０．０）＝０．０
文書ｃの類似度＝ｍｉｎ（（０．５＊０．３＋０．８＊０．８），１．０＊１．０）＝ｍｉｎ（０．７９，１．０）＝０．７９
ここで、ｍｉｎ（ｘ，ｙ）は、ｘ，ｙの小さい方の値を返す。
【０００７】
このようにして、検索キーと検索対象文書との類似度を計算し、この値の降順に検索対象文書をソートして、上位ｎ（＝２）件を検索結果とする。従って、この場合の検索結果は、
文書ｃ（類似度０．７９）
文書ａ（類似度０．６８）
となる。
【０００８】
また、従来の文書分類装置は、分類対象となる文書集合内のすべての２つの文書の組み合わせについて、それら文書間の類似度を判定することによって、文書を分類するものである。
まず、分類対象となる文書集合のそれぞれの文書から単語を抽出し、それらの単語に適切な重要度を付与する。次に、この重要度を基に、文書検索装置で述べた方法と同様の方法で分類対象となる文書集合内の全ての２つの文書の組み合わせについてそれら文書間の類似度を判定する。次にこの文書間の類似度に基づき、類似度の大きい文書同士を順次結合していくことによって、文書を分類する。この手法は、クラスタリングと呼ばれている。
【０００９】
【発明が解決しようとする課題】
より精度の高い文書検索装置及び文書分類装置を構成するためには、文書要素間の類似度を適切に判定する必要がある。ここで、文書要素とは、単語列、または、単語のブール演算子結合、または、文書、または、文書集合である。しかしながら、従来の類似度判定方法には、以下のような問題がある。
【００１０】
１．複数の主題や副題を持つ文書要素間の類似度を精度良く判定できない：
文書要素が単語列や要約などの場合、その文書要素は１つの主題を持っていると考えられるが、一般に文書全文を対象とするとその文書は複数の主題や副題を持つものとなる。そのため、このような文書全文を対象とすると類似度が適切に計算されない。
【００１１】
例えば、文書検索作業において、ユーザが「情報検索を行うロボット」に関する文書を検索したい場合に、「情報検索ａｎｄロボット」と検索キーを指定したとする。しかし、この検索キーでは、「情報検索システム」と「産業用ロボット」という２つの主題を持つ文書にまで高い類似度を与えてしまう。このように、文書要素が複数の主題や副題を持つ場合に、類似度を精度良く判定できないという問題がある。
【００１２】
２．文書要素の持つ特徴を利用した類似度の判定ができない：
文内に使用されている単語間には、係り受け関係などの特徴がある。また、文書には、パラグラフなどの特徴がある。しかしながら、従来の類似度判定方法では、単語を抽出し、それらの単語に重要度を付与し、それらを基に類似度を判定するだけなので、これらの特徴を利用することができず、類似度を精度良く判定できないという問題がある。
【００１３】
３．形態素解析の不完全性：
単語を文書から抽出する際に用いられる形態素解析では、どの部分文字列が単語となるかを認識する必要があり、そのために、単語を予め辞書に登録しておく必要がある。しかしながら、情報の更新速度が速い場合には、全ての単語を予め辞書に登録しておくことは不可能であり、このような情報を対象とした場合、単語の抽出を行う際の解析の失敗は避けられない。例えば、辞書に、「インター」と「ネット」という単語だけしか登録されていない場合、「インターネット」という単語は抽出されず、この単語は、「インター」と「ネット」という２つの単語として抽出されてしまう。このように、単語の抽出の失敗が起こるために、類似度を精度良く判定できないという問題がある。
【００１４】
本発明は、上記の点に鑑みなされたもので、複数の主題や副題を持つ文書要素間の類似度を精度良く判定し、文書要素の持つ特徴を利用した類似度の判定を可能とし、形態素解析の不完全性を解決して文書要素間の類似度を精度良く判定することが可能な類似度判定方法及び文書検索装置及び文書分類装置及び文書検索プログラムを格納した記憶媒体及び文書分類プログラムを格納した記憶媒体を提供することを目的とする。
【００１５】
【課題を解決するための手段】
図１は、本発明の原理を説明するための図である。
本発明（請求項１）は、文書要素間の類似度を適切に判定するための類似度判定方法において、
主題グラフ作成手段において、単語列または、単語列のブール演算子結合または、文または、文書または、文書集合で構成される文書要素が複数入力されると、各文書要素内で使用されている単語を抽出し（ステップ１）、
抽出されたそれぞれの単語に重要度を付与し（ステップ２）、
単語の重要度をノードの重みとし、該単語間の関連度をリンクの重みとし（ステップ３）、それぞれの文書要素の主題グラフを生成し（ステップ４）、
類似度判定手段において、主題グラフ間の一致の度合に基づき、文書要素間の類似度を求める（ステップ５）。
【００１６】
本発明（請求項２）は、単語を抽出する際に、文書要素を形態素解析する。
本発明（請求項３）は、類似度判定手段において、主題グラフ間の一致の度合を計算する際に、
両方の主題グラフの同様のノード（同じ単語を含んでいるノード）の個数が多ければ多い程、該主題グラフ間の一致の度合を大きな値とし、
片方の主題グラフ内にあるノードに大きな重みが付いていた場合は、もう片方の主題グラフ内の同様のノードに大きな重みが付いていればいる程、両方の主題グラフ間の一致の度合を大きな値とし、
両方の主題グラフの同様のリンク（リンクの両端のノードに含まれる単語が同じであるリンク）の本数が多ければ多い程、該両方の主題グラフ間の一致の度合を大きな値とし、
片方の主題グラフ内にあるリンクに大きな重みが付いていた場合は、もう片方の主題グラフ内の同様のリンクに大きな重みが付いていればいる程、両方の主題グラフ間の一致の度合を大きな値にするように、主題グラフ間の一致の度合を計算する。
【００１７】
本発明（請求項４）は、類似度判定手段において、主題グラフ間の一致の度合を計算する際に、
それぞれの主題グラフを、該主題グラフ内で使用されている単語集合がどの程度の強さで関連しあっているのかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ内の任意のノード間にリンクがない場合には、該部分グラフに小さい重みのリンクを生成し、
部分グラフを再結合し、
部分グラフに生成したリンクをそのまま追加して、分割前の主題グラフに戻して主題グラフ間の一致の度合を計算する。
【００１８】
本発明（請求項５）は、類似度判定手段において、主題グラフ間の一致の度合を計算する際に、
それぞれの主題グラフを、該主題グラフ内で使用されている単語集合がどの程度の強さで関連し合っているかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ間の任意のノード間にリンクがない場合には、該部分グラフに小さい重みのリンクを生成し、
それぞれの部分グラフ毎に一致の度合を計算する。
【００１９】
本発明（請求項６）は、類似度判定手段において、主題グラフ間の一致の度合を計算する際に、
それぞれの主題グラフを、該主題グラフ内で使用されている単語集合がどの程度の強さで関連し合っているかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ内の任意のノード間にリンクがない場合には、該部分グラフに小さい重みのリンクを生成し、
それぞれの部分グラフ毎に一致の度合を計算し、
部分グラフ毎に計算された一致の度合の総和を計算することにより、主題グラフ間の一致の度合を計算する。
【００２０】
図２は、本発明の文書検索装置の原理構成図である。
本発明（請求項７）は、ユーザからの検索要求に基づいて文書を検索するための文書検索装置であって、
ユーザからの検索要求を解析し、検索キーを取り出す検索インタフェース手段４１０と、
検索キーの単語の重要度及び単語間の関連度を用いて検索キーの主題グラフを生成する検索キー主題グラフ作成手段４２０と、
指定された単語が出現する文書の文書ＩＤの集合を取得する単語情報管理手段４３０と、
文書ＩＤが指定されると、該文書ＩＤに対応した文書を検索対象文書記憶手段から取得し、該文書内の単語の重要度と単語間の関連度を用いて文書の主題グラフを作成する検索対象文書主題グラフ作成手段４４０と、
検索キーの主題グラフと文書の主題グラフを入力し、それらがどの程度似ているのかを判断する類似度判定手段４５０と、
検索インタフェース手段４１０、検索キー主題グラフ作成手段４２０、単語情報管理手段４３０、検索対象文書主題グラフ作成手段４４０、及び類似度判定手段４５０の制御を行う検索制御手段４６０と、を有する。
【００２１】
本発明（請求項８）は、検索キー主題グラフ作成手段４２０において、
単語列または、単語列のブール演算子結合または、文または、文書または、文書集合で構成される文書要素から、該文書要素内で使用されている単語を抽出する単語抽出手段と、
抽出されたそれぞれの単語に重要度を付与する重要度付与手段と、
抽出されたそれぞれ任意の２単語間に関連度を付与する関連度付与手段とを含み、
検索対象文書主題グラフ作成手段４４０は、
単語の重要度をノードの重みとし、該単語間の関連度をリンクの重みとした主題グラフを作成する主題表現手段を含み、
類似度判定手段４５０は、
検索キーの主題グラフと文書の主題グラフ間の一致の度合に基づき、文書要素間の類似度を求める手段を含む。
【００２２】
本発明（請求項９）は、類似度判定手段４５０において、
検索キーの主題グラフと文書の主題グラフの同様のノード（同じ単語を含んでいるノード）の個数が多ければ多い程、該グラフ間の一致の度合を大きな値とし、
片方のグラフ内にあるノードに大きな重みが付いていた場合は、もう片方のグラフ内の同様のノードに大きな重みが付いていればいる程、両方のグラフ間の一致の度合を大きな値とし、
両方のグラフの同様のリンク（リンクの両端のノードに含まれる単語が同じであるリンク）の本数が多ければ多い程、該両方のグラフ間の一致の度合を大きな値とし、
片方のグラフ内にあるリンクに大きな重みが付いていた場合は、もう片方のグラフ内の同様のリンクに大きな重みが付いていればいる程、両方のグラフ間の一致の度合を大きな値にするように、検索キーの主題グラフと文書の主題グラフ間の一致の度合を計算する第１の計算手段を含む。
【００２３】
本発明（請求項１０）は、類似度判定手段４５０において、
それぞれの検索キーの主題グラフと文書の主題グラフを、該グラフ内で使用されている単語集合がどの程度の強さで関連し合っているのかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ内の任意のノード間にリンクがない場合には、該部分グラフに小さい重みのリンクを生成し、
部分グラフを再結合し、
部分グラフに生成したリンクをそのまま追加して、分割前のグラフに戻して検索キーの主題グラフと文書の主題グラフ間の一致の度合を計算する第２の計算手段を含む。
【００２４】
本発明（請求項１１）は、類似度判定手段４５０において、
それぞれの検索キーの主題グラフと文書の主題グラフを、該グラフ内で使用されている単語集合がどの程度の強さで関連し合っているのかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ間の任意のノード間にリンクがない場合には、該部分グラフに小さい重みのリンクを生成し、
それぞれの部分グラフ毎に一致の度合を計算する第３の計算手段を含む。
【００２５】
本発明（請求項１２）は、類似度判定手段４５０において、
それぞれの検索キーの主題グラフと文書の主題グラフを、該グラフ内で使用されている単語集合がどの程度の強さで関連し合っているのかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ内の任意のノード間にリンクがない場合には、該部分グラフに小さい重みのリンクを生成し、
それぞれの部分グラフ毎に一致の度合を計算し、
部分グラフ毎に計算された一致の度合の総和を計算することにより、検索キーの主題グラフと文書の主題グラフ間の一致の度合を計算する第４の計算手段を含む。
【００２６】
図３は、本発明の文書分類装置の原理構成図である。
本発明（請求項１３）は、文書が格納されている文書記憶手段から、文書ＩＤに対応した文書を取得し、該文書の単語の重要度と単語間の関連度を用いて主題グラフを作成する主題グラフ作成手段６１０と、
２つの文書の主題グラフが入力されると、これらの一致の度合を計算するグラフ類似度判定手段６２０と、
文書間の類似度を表す行列に基づいて、該文書を分類する分類手段６３０と、
分類作業全体の制御を行う分類制御手段６４０とを有する。
【００２７】
本発明（請求項１４）は、主題グラフ作成手段６１０において、
主題グラフ間の一致の度合を測定するために、
単語列または、単語列のブール演算子結合または、文または、文書または、文書集合で構成される文書要素から、該文書要素内で使用されている単語を抽出する単語抽出手段と、
抽出されたそれぞれの単語に重要度を付与する重要度付与手段と、
抽出されたそれぞれの任意の２単語間に関連度を付与する関連度付与手段と、
単語の重要度をノードの重みとし、該単語間の関連度をリンクの重みとして主題グラフを生成する主題表現手段とを含み、
グラフ類似度判定手段６２０は、
主題グラフ間の一致の度合に基づき、文書要素間の類似度を計算する手段を含む。
【００２８】
本発明（請求項１５）は、グラフ類似度判定手段６２０において、
両方のグラフの同様のノード（同じ単語を含んでいるノード）の個数が多ければ多いほど、該グラフ間の一致の度合を大きな値とし、
片方のグラフ内にあるノードに大きな重みが付いていた場合は、もう片方のグラフ内の同様のノードに大きな重みが付いていればいる程、両方のグラフ間の一致の度合を大きな値とし、
両方のグラフの同様のリンク（リンクの両端のノードに含まれる単語が同じであるリンク）の本数が多ければ多い程、該両方のグラフ間の一致の度合を大きな値とし、
片方のグラフ内にあるリンクに大きな重みが付いていた場合は、もう片方のグラフ内の同様のリンクに大きな重みが付いていればいる程、両方のグラフ間の一致の度合を大きな値にするように、主題グラフ間の一致の度合を計算する第１の計算手段を含む。
【００２９】
本発明（請求項１６）は、グラフ類似度判定手段６２０において、
それぞれの主題グラフを、該主題グラフ内で使用されている単語集合がどの程度の強さで関連し合っているのかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ内の任意のノード間にリンクがない場合には、該部分グラフに小さい重みのリンクを生成し、
部分グラフを再結合し、
部分グラフに生成したリンクをそのまま追加して、分割前のグラフに戻して主題グラフ間の一致の度合を計算する第２の計算手段を含む。
【００３０】
本発明（請求項１７）は、グラフ類似度判定手段６２０において、
それぞれの主題グラフを、該主題グラフ内で使用されている単語集合がどの程度の強さで関連し合っているかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ間の任意のノード間にリンクがない場合には、該部分グラフに小さい重みのリンクを生成し、
それぞれの部分グラフ毎に一致の度合を計算する第３の計算手段を含む。
【００３１】
本発明（請求項１８）は、グラフ分類度判定手段６２０において、
それぞれの主題グラフを、該主題グラフ内で使用されている単語集合がどの程度の強さで関連し合っているのかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ内の任意のノード間にリンクがない場合には、該部分グラフに小さい重みのリンクを生成し、
それぞれの部分グラフ毎に一致の度合を計算し、
部分グラフ毎に計算された一致の度合の総和を計算することにより、主題グラフ間の一致の度合を計算する第４の計算手段を含む。
【００３２】
本発明（請求項１９）は、ユーザからの検索要求に基づいて文書を検索するための文書検索プログラムを格納した記憶媒体であって、
コンピュータに、
ユーザからの検索要求を解析し、検索キーを取り出す検索インタフェースステップと、
検索キーの重要度及び単語間の関連度を用いて検索キーの主題グラフを生成する検索キー主題グラフ作成ステップと、
指定された単語が出現する文書の文書ＩＤの集合を取得する単語情報管理ステップと、
文書ＩＤが指定されると、該文書ＩＤに対応した文書を検索対象文書が格納されている検索対象文書記憶手段から取得し、該文書内の単語の重要度と単語間の関連度を用いて文書の主題グラフを作成する検索対象文書主題グラフ作成ステップと、
検索キーの主題グラフと検索対象文書の文書の主題グラフを入力とし、それらがどの程度似ているかを判断する類似度判定ステップと、
を実行させるプログラムを格納した文書検索プログラムを格納した記憶媒体である。
【００３３】
本発明（請求項２０）は、検索キー主題グラフ作成ステップは、
単語列または、単語列のブール演算子結合または、文または、文書または、文書集合で構成される文書要素から、該文書要素内で使用されている単語を抽出する単語抽出ステップと、
抽出されたそれぞれの単語に重要度を付与する重要度付与ステップと、
抽出されたそれぞれ任意の２単語間に関連度を付与する関連度付与ステップと、を実行させ、
検索対象文書主題グラフ作成ステップは、
単語の重要度をノードの重みとし、該単語間の関連度をリンクの重みとした主題グラフを生成する主題表現ステップを実行させ、
類似度判定ステップは、
検索キーの主題グラフと文書の主題グラフ間の一致の度合に基づき、文書要素間の類似度を求めるステップを実行させる。
【００３４】
本発明（請求項２１）は、類似度判定ステップにおいて、
検索キーの主題グラフと文書の主題グラフの両方のグラフの同様のノード（同じ単語を含んでいるノード）の個数が多ければ多い程、該グラフ間の一致の度合を大きな値とし、
片方のグラフ内にあるノードに大きな重みが付いていた場合は、もう片方のグラフ内の同様のノードに大きな重みが付いていればいる程、両方のグラフ間の一致の度合を大きな値とし、
両方のグラフの同様のリンク（リンクの両端のノードに含まれる単語が同じであるリンク）の本数が多ければ多い程、該両方のグラフ間の一致の度合を大きな値とし、
片方のグラフ内にあるリンクに大きな重みが付いていた場合は、もう片方のグラフ内の同様のリンクに大きな重みが付いていればいる程、両方のグラフ間の一致の度合を大きな値にするように、検索キーの主題グラフと文書の主題グラフ間の一致の度合を計算する第１の計算ステップを実行させる。
【００３５】
本発明（請求項２２）は、類似度判定ステップにおいて、
検索キーの主題グラフと文書の主題グラフのそれぞれのグラフを、該グラフ内で使用されている単語集合がどの程度の強さで関連し合っているかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ内の任意のノード間にリンクがない場合には、該部分グラフに小さい重みのリンクを生成し、
それぞれの部分グラフ毎に一致の度合を計算し、
部分グラフ毎に計算された一致の度合の総和を計算することにより、検索キーの主題グラフと文書の主題グラフ間の一致の度合を計算する第２の計算ステップを実行させる。
【００３６】
本発明（請求項２３）は、類似度判定ステップにおいて、
検索キーの主題グラフと文書の主題グラフのそれぞれのグラフを、該グラフ内で使用されている単語集合がどの程度の強さで関連し合っているかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ間の任意のノード間にリンクがない場合には、該部分リンクに小さい重みのリンクを生成し、
検索キーの主題グラフと文書の主題グラフそれぞれの部分グラフ毎に一致の度合を計算する第３の計算ステップを実行させる。
【００３７】
本発明（請求項２４）は、類似度判定ステップにおいて、
検索キーの主題グラフと文書の主題グラフのそれぞれのグラフを、該グラフ内で使用されている単語集合がどの程度の強さで関連し合っているのかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ内の任意のノード間にリンクがない場合には、該部分グラフに小さい重みのリンクを生成し、
それぞれの部分グラフ毎に一致の度合を計算し、
部分グラフ毎に計算された一致の度合の総和を計算することにより、検索キーの主題グラフと文書の主題グラフ間の一致の度合を計算する第４の計算ステップを実行させる。
【００３８】
本発明（請求項２５）は、文書を分類するための文書分類プログラムを格納した記憶媒体であって、
コンピュータに、
文書が格納されている文書記憶手段から、文書ＩＤに対応した文書を取得し、該文書の単語の重要度と単語間の関連度を用いて主題グラフを作成する主題グラフ作成ステップと、
２つの文書の主題グラフが入力されると、これらの一致の度合を判定するグラフ分類判定ステップと、
文書間の類似度を表す行列に基づいて、該文書を分類する分類ステップと、を実行させるプログラムを格納した文書分類プログラムを格納した記憶媒体である。
【００３９】
本発明（請求項２６）は、主題グラフ作成ステップにおいて、
主題グラフ間の一致の度合を測定するために、
単語列または、単語列のブール演算子結合または、文または、文書または、文書集合で構成される文書要素から、該文書要素内で使用されている単語を抽出する単語抽出ステップと、
抽出されたそれぞれの単語に重要度を付与する重要度付与ステップと、
抽出されたそれぞれの任意の２単語間に関連度を付与する関連度付与ステップと、
単語の重要度をノードの重みとし、該単語間の関連度をリンクの重みとした主題グラフを生成する主題表現ステップと、
主題グラフ間の一致の度合に基づき、文書要素間の類似度を求める文書要素間類似度判定ステップを実行させる。
【００４０】
本発明（請求項２７）は、グラフ類似度判定ステップにおいて、
両方の主題グラフの同様のノード（同じ単語を含んでいるノード）の個数が多ければ多い程、該グラフ間の一致の度合を大きな値とし、
片方のグラフ内にあるノードに大きな重みが付いていた場合は、もう片方のグラフ内の同様のノードに大きな重みが付いていればいる程、両方のグラフ間の一致の度合を大きな値とし、
両方のグラフの同様のリンク（リンクの両端のノードに含まれる単語が同じであるリンク）の本数が多ければ多い程、該両方のグラフ間の一致の度合を大きな値とし、
片方のグラフ内にあるリンクに大きな重みが付いていた場合は、もう片方のグラフ内の同様のリンクに大きな重みが付いていればいる程、両方のグラフ間の一致の度合を大きな値にするように、主題グラフ間の一致の度合を計算する第１のステップを実行させる。
【００４１】
本発明（請求項２８）は、グラフ類似度判定ステップにおいて、
それぞれの主題グラフを、該グラフ内で使用されている単語集合がどの程度の強さで関連し合っているかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ内の任意のノード間にリンクがない場合には、該部分グラフに小さい重みのリンクを生成し、
部分グラフを再結合し、
部分グラフに生成したリンクをそのまま追加して、分割前のグラフに戻して主題グラフ間の一致の度合を計算する第２の計算ステップを実行させる。
【００４２】
本発明（請求項２９）は、グラフ類似度判定ステップにおいて、
それぞれの主題グラフを、該主題グラフ内で使用されている単語集合がどの程度の強さで関連し合っているかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ内の任意のノード間にリンクがない場合には、該部分グラフに小さい重みのリンクを生成し、
それぞれの部分グラフ毎に一致の度合を計算する第３の計算ステップを実行させる。
【００４３】
本発明（請求項３０）は、グラフ類似度判定ステップにおいて、
それぞれの主題グラフを、該主題グラフ内で使用されている単語集合がどの程度の強さで関連し合っているかに基づいて、部分グラフに分割し、
それぞれの部分グラフに、該部分グラフ内の任意のノード間にリンクがない場合には、該部分グラフに小さい重みのリンクを生成し、
それぞれの部分グラフ毎に一致の度合を計算し、
部分グラフ毎に計算された一致の度合の総和を計算することにより、主題グラフ間の一致の度合を計算する第４の計算ステップを実行させる。
【００４４】
上記により、本発明によれば、複数の主題や副題を持つ文書要素間の類似度を精度良く判定できない問題を、単語間の関連度を用いることによって、複数の主題や副題を持つ文書要素間の類似度を精度良く判定することが可能となる。例えば、前述の発明が解決しようとする課題における例において、ユーザが「情報検索を行うロボット」に関する文書を検索したい場合に、本発明では、「情報検索」と「ロボット」が強く関連している文書の方がそうでない文書を比べて、高い類似度となる。前述した、「情報検索システム」と「産業用ロボット」という２つの主題を持つ文書内では、「情報検索」と「ロボット」が強く関連していないので、このような文書は、高い類似度とならない。このように、本発明では、類似度を精度よく判定することが可能となる。
【００４５】
また、本発明によれば、文書要素の持つ特徴を利用した類似度の判定ができないという問題を、文内で強い係り受けの関係にある単語間や、同一のパラグラフに含まれる単語間に高い関連度を与えることができるため、これらの特徴を利用した類似度の判定により解決できる。このように、本発明を利用すれば、類似度を精度良く判定できる。
【００４６】
更に、形態素解析を利用しているため単語の抽出の失敗により類似度の判定の精度を低下させるという問題に対して、本発明では、前述の「インターネット」という単語の抽出の失敗の例を用いて説明すると、形態素解析を利用しているため、前述した例と同様に、「インターネット」という単語は抽出されず、この単語は、「インター」と「ネット」という２つの単語として抽出されてしまう。しかしながら、ある文書要素に「インターネット」という文字列がある場合、その文書要素内では、抽出された単語「インター」と「ネット」の間には強い関連がある。従って、「インター」と「ネット」が別々に出現する文書要素に比べて、「インターネット」という文字列が出現する文書要素の方が高い類似度となる。従って、たとえ、形態素解析に失敗したとしても、本発明により類似度判定の精度の低下を阻止することが可能となる。
【００４７】
上記により、本発明では、文書要素間の類似度を精度良く判定できるので、精度の良い類似度判定方法及び文書検索装置及び文書分類装置及び文書検索プログラムを格納した記憶媒体及び文書分類プログラムを格納した記憶媒体を提供することができる。
【００４８】
【発明の実施の形態】
図４は、本発明の類似度判定方法を説明するためのフローチャートである。
ステップ１０）文書要素内で使用されている単語を抽出する。
ステップ２０）文書要素内で使用されているそれぞれの単語の重要度を計算する。
【００４９】
ステップ３０）文書要素内で使用されている任意の２単語間の関連度を計算する。
ステップ４０）単語の重要度をノードの重みとし、単語間の関連度をリンクの重みとした、グラフによってそれぞれの文書要素の主題を表現する。以下、このグラフを主題グラフと呼ぶ。
【００５０】
ステップ５０）このようにして生成した主題グラフ同士の一致度に基づき、文書要素間の類似度を判定する。
以下、上記の各ステップの動作を詳細に説明する。
ステップ１０）単語の抽出：
単語の抽出は、文書要素を形態素解析することによっておこなう。形態素解析手法には、既存技術を用いるものとする。
【００５１】
ステップ２０）重要度の計算：
それぞれの文書要素内で使用されている単語の重要度を、次のようにして計算する。本発明では、文書要素として、単語列、単語のブール演算子結合、文、文書、文書集合を想定しているので、それぞれについての単語の重要度の計算法を以下に示す。
【００５２】
・単語列：全ての単語の重要度を同じ値にするか、または、ユーザにそれぞれの単語の重要度を明示的に指定させることによって、重要度を決定する。
・単語のブール演算子結合：単語列の場合と同様の方法で重要度を決定する。
・文：全ての単語を同じ重要度とするか、または、単語の品詞（固有名詞には、副詞よりも高い重要度を付与するなど）に応じて重要度を決定する。
【００５３】
・文書：文の場合と同様の方法で重要度を決定するか、または、単語の出現位置情報（タイトル内で出現する単語には高い重要度を付与）、出現頻度情報（高い出現頻度の単語には高い重要度を付与）、文書要素集合全体の中での出現文書要素数（特定の文書要素にしか出現しない単語には高い重要度を付与）などに基づき計算する。
【００５４】
・文書集合：文書集合全体を一つの大きな文書（全ての文書を結合した文書）と考えて、文書の場合と同様の方法で重要度を計算する。
ステップ３０）関連度の計算：
それぞれの文書要素内で使用されている単語間の関連度を、次のようにして計算する。本発明では、文書要素として、単語列、単語のブール演算子結合、文、文書、文書集合を想定しているので、それぞれについての単語間の関連度の計算法を次に示す。
【００５５】
・単語列：文書要素に含まれる全ての２単語間の関連度を等しい値とするか、または、ユーザに明示的に関連度を指定させることによって、関連度を決定する。
・単語のブール演算子結合：単語列での方法に加えて、ブール演算子の種類に応じて関連度を決定する。例えば、ａｎｄで結合されている単語同士は、ｏｒで結合されているものに比べて関連度を大きな値とする。
【００５６】
・文：単語列での方法を用いるか、または、次に示す、係り受け関係の情報を用いて計算する。まず、文の係り受け関係の解析を行う。係り受け関係の解析の手法は、既存技術を用いるものとする。直接の係り受けの関係にあるもの同士は強い関連があるとし、間接的な係り受け関係にあるものは、弱い関連があるものとする。例えば、「情報の検索に単語の重要度を利用する」という文があった場合、「情報」と「検索」の関連度は、「情報」と「単語」の関連度に比べて大きな値とする。なぜなら、「情報」と「検索」は、直接の係り受け関係にあるのに対して、「情報」と「単語」は直接の係り受け関係にないからである。
【００５７】
・文書：文での方法を用いるか、または、以下の２つのどちらかの方法によって、単語間の関連度を計算する。
−共出現情報の利用：
ある２単語が同一の文内（または、指定文字数の範囲内）で共出現した場合、これらの共出現回数を数える。共出現の回数が多ければ多い程、それら２単語間の関連度を大きな値とする。
【００５８】
−構造情報を利用：
文書の構造（章、パラグラフなど）を解析する。あるパラグラフ内に現れる単語はそのパラグラフの見出し語と関連があり、また、パラグラフ内の単語同士は関連があると考えられるので、パラグラフ内での頻度情報に基づき、単語間の関連度を決定する。例えば、あるパラグラフだけに高い頻度で出現している単語はその節の見出し語と強い関連があり、また、それらの単語同士は強い関連があるとする。
【００５９】
・文書集合：文書集合を一つの大きな文書（全ての文書を結合した文書）と考えて、文書の場合と同様の方法で関連度を計算する。
ステップ４０）主題グラフの作成：
ステップ２０で求めた単語の重要度をノード重みとし、ステップ３０で求めた単語間の関連度をリンクの重みとしたグラフを作成する。このグラフ（主題グラフ）によって文書要素の主題を表現する。
【００６０】
ステップ５０）ステップ１４０で作成した文書要素の主題グラフ間の一致度を測定することによって、文書要素間の類似度を判定する。類似度判定処理の構成を図５に示す。
図５は、本発明の類似度判定処理の動作を説明するための図である。
１．グラフ間一致度測定処理（ステップ１１１、１２３、１３３、１４３）：文書要素の主題グラフｑとｕの一致度を、以下の式によって計算する。グラフｑとグラフｕに使用されている単語の重要度をそれぞれ以下のベクトルで表す。
【００６１】
ｖ_ｑ＝（ｖ_ｑ１，ｖ_ｑ２，…，ｖ_ｑｎ）（１）
ｖ_ｕ＝（ｖ_ｕ１，ｖ_ｕ２，…，ｖ_ｕｎ）（２）
ここで、ｖ_ｑｉとｖ_ｕｉはそれぞれ、文書要素ｑ内での単語ｉの重要度、文書ｕ内での単語ｉの重要度を表す。
これらのベクトルの内積ｆ_ｖ
【００６２】
【数１】

【００６３】
を計算する。
グラフｑとグラフｕに使用されている単語間の関連度をそれぞれ以下のように表す。
ｒ_ｑ＝（ｖ_ｑ１１，ｖ_ｑ１２，…，ｖ_ｑ２１，ｖ_ｑ２２，…，ｖ_ｑｎｎ）（３）
ｒ_ｕ＝（ｖ_ｕ１１，ｖ_ｕ１２，…，ｖ_ｕ２１，ｖ_ｕ２２，…，ｖ_ｕｎｎ）（４）
ここで、ｖ_ｑｉｊとｖ_ｕｉｊは、それぞれ、文書要素ｑ内での単語ｉと単語ｊの関連度、文書要素ｕ内での単語ｉと単語ｊの関連度を表す。
【００６４】
これらのベクトルの内積
【００６５】
【数２】

【００６６】
を計算する。
ｆ_ｖとｆ_ｒからグラフ間の一致度を以下のように求める。
一致度＝ｆ_ｖ ^ｐ＊ｆ_ｒ ^ｑ
ここで、ｐ及びｑは正の定数である。
２．グラフ分割処理（ステップ１２１、１３１、１４１）：
以下の処理によって、グラフを分割し、それぞれの部分グラフ内に小さい重みのリンクを生成する。
【００６７】
（ａ）グラフＧ_Ａをノード間の結合力の強さに応じて、ｐ個の部分グラフにＧ_Ａｉ（ｉ＝０，１，…，ｐ）に分割する。ここで、結合力の強さとは、例えば、「それぞれの部分グラフ内の任意のノード間には、必ず、距離１のリンクが存在するか、または、距離ｎ（ｎ≧２）以下のリンクがｍ（ｍ≧１）本以上存在する。」などである。ここで、ノードａとノードｂ間の距離とは、ａからｂへ到達するのに通過するリンクの本数である。
【００６８】
（ｂ）分割された部分グラフ内の任意のノード間にリンクが存在しない場合は、これらのノード間に弱い重みのリンクを生成する。
ｎ＝２，ｍ＝２の場合のグラフ分割処理を図６に示す。この例では、分割前のグラフＧ_Ａ（２１０）は、３個の部分グラフ（Ｇ_Ａ１（２２１），Ｇ_Ａ２（２２２），Ｇ_Ａ３（２３３））に分割されている。このように分割されるのは、
・Ｇ_Ａ１（２１１）について、ノードＡＢ，ＡＣ，ＢＤ，ＣＤ間に距離１のリンクが存在し、ＡＤ，ＢＣ間のそれぞれには、距離２のリンクが２本存在する。
・Ｇ_Ａ２（２２２）について、ノードＢＤ，ＢＥ，ＤＥ間に、距離１のリンクが存在する。
・Ｇ_Ａ３（２２３）について、ノードＤＦ間の距離１のリンクが存在する。
【００６９】
このため、「それぞれの部分グラフ内の任意の２ノード間には、必ず距離１のリンクが存在するか、または、距離２以下のリンクが２本以上存在する。」という条件を満たすからである。また、Ｇ_Ａ１（２２１）における破線は、グラフ分割処理の（ｂ）で追加された弱い重みのリンクである。
この処理から明らかなように、分割された部分グラフ内の単語同士は、強い結合力で結ばれている。従って、これらの部分グラフは、意味的に関連の強い単語の集合で構成されていることになるので、これらの部分グラフからなるサブ文書はそれぞれもとの文書の副題を表すことになる。
【００７０】
また、このように部分グラフ内にリンクを生成することによって、それぞれの副題に含まれる単語同士には、ある程度の関連があるということをグラフ上で表現している。
３．グラフ再結合処理（ステップ１２２）：
グラフ分割処理が作成した部分グラフＧ_Ａｉ（ｉ＝０，１，…，ｐ）を分割前のグラフＧ’_Ａに再結合する。このとき、Ｇ’_Ａは、グラフ分割処理（ステプ１２１）で生成されたリンクを追加したものである。
【００７１】
図７に、グラフ再結合処理の例を示す。この例では、図６のＧ_Ａ（２１０）から作成された、３個の部分グラフ（Ｇ_Ａ１（３１１），Ｇ_Ａ２（３１２），Ｇ_Ａ３（３１３））を、元のグラフへ再結合することによって、Ｇ’_Ａ（３２０）を作成している。グラフ分割処理、グラフ再結合処理が作成したＧ’_Ａには、ＡＤ間やＢＣ間にＧ_Ａには存在しなかった弱い重みのリンクが追加されている。
【００７２】
４．部分グラフ一致度測定処理（ステップ１３２、１４２）：
グラフ分割処理が作成したそれぞれの部分グラフ毎に、グラフ間一致度測定処理を用いて、一致度を測定する。
５．一致度合計処理（ステップ１４４）：
部分グラフごとに求めた一致度を合計した値を、分割前のグラフ全体の一致度とする。
【００７３】
次に、図５に示した類似度の計算方法（４種類）のそれぞれについて説明する。
１．グラフ分割を用いない方法（ステップ１１０）：
（ａ）文書要素の主題グラフＧ_ｑとＧ_ｕを、グラフ間一致度測定処理（ステップ１１１）に渡す。
【００７４】
（ｂ）グラフ間一致度測定処理（ステップ１１１）では、これら２のグラフＧ_ｑ，Ｇ_ｕ間の一致度を測定し、出力する。
この方法で求めた主題グラフ間の一致度を文書要素間の類似度とする。この方法は、グラフ分割を用いないので処理が高速である。
２．グラフ分割、再結合を用いる方法（ステップ１２０）：
（ａ）文書要素の主題グラフＧ_ｑとＧ_ｕを、グラフ分割処理（ステップ１２１）に渡す。
【００７５】
（ｂ）グラフ分割処理（ステップ１２１）は、Ｇ_ｑ、Ｇ_ｕをそれぞれ複数の部分グラフＧ_ｑｉ（ｉ＝０，１，…ｐ）、Ｇ_ｕｊ（ｊ＝０，１，…，ｒ）に分割し、Ｇ_ｑｉ，Ｇ_ｕｊ内の任意のノード間にリンクが存在しない場合は、これらのノード間に小さい重みのリンクを生成する。
（ｃ）グラフ再結合処理（ステップ１２２）は、グラフ分割処理（ステップ１２１）で作成した部分グラフＧ_ｑｉ，Ｇ_ｕｊを、もう一度グラフ分割する前の状態に再結合することによって、Ｇ’_ｑ，Ｇ’_ｕを作成する。前述したように、再結合の際には、それぞれの部分グラフに生成したリンクを、Ｇ’_ｑ，Ｇ’_ｕに追加する。
【００７６】
（ｄ）グラフ間一致度測定処理（ステップ１２３）は、Ｇ’_ｑとＧ’_ｕ間の一致度を測定し出力する。
このようにして求めた主題グラフ間の一致度を文書要素間の類似度とする。この方法では、間接的な単語間の関連（同一の副題に含まれる単語には、直接のある程度の関連がある）を用いて類似度の判定を行うことができるので、より正確な類似度を計算できる。
【００７７】
３．部分グラフ毎の一致度の測定法（ステップ１３０）：
（ａ）文書要素の主題グラフＧ_ｑとＧ_ｕのそれぞれを、グラフ分割処理（ステップ１３１）に渡す。
（ｂ）グラフ分割処理（ステップ１３１）は、Ｇ’_ｑ，Ｇ’_ｕを複数の部分グラフＧ_ｑｉ（ｉ＝０，１，…，ｐ）、Ｇ_ｕｊ（ｊ＝０，１，…，ｒ）に分割し、Ｇ_ｑｉ，Ｇ_ｕｊ内の任意のノード間にリンクが存在しない場合は、これらのノード間に小さい重みのリンクを追加する。
【００７８】
（ｃ）部分グラフ一致度測定処理（ステップ１３２）は、それぞれの部分グラフＧ_ｑｉとＧ_ｕｊの全ての組み合わせについて一致度を測定する。この際、グラフ間一致度測定処理（ステップ１３３）を利用する。部分グラフ毎に求めた一致度を出力する。
このようにして求めた、部分グラフごとの一致度を、それぞれ分割されたサブ文書（副題）毎の類似度とする。この方法では、文書全体を対象とするのではなく、文書内のサブ文書（副題）毎の類似度の計算を行うことができる。
【００７９】
４．部分グラフ毎の一致度の合計を用いる方法（ステップ１４０）：
（ａ）文書要素の主題グラフＧ_ｑ、Ｇ_ｕのそれぞれを、グラフ分割処理（ステップ１４１）に渡す。
（ｂ）グラフ分割処理（ステップ１４１）は、Ｇ_ｑ，Ｇ_ｕを複数の部分グラフＧ_ｑｉ（ｉ＝０，１，…，ｐ），Ｇ_ｕｊ（ｊ＝０，１，…，ｒ）に分割し、Ｇ_ｑｉ，Ｇ_ｕｊ内の任意のノード間にリンクが存在しない場合は、これらのノード間に小さい重みのリンクを追加する。
【００８０】
（ｃ）部分グラフ一致度測定処理（ステップ１４２）は、それぞれの部分グラフＧ_ｑｉとＧ_ｕｊのすべての組み合わせについて一致度を測定する。この際、グラフ間一致度測定処理（ステップ１４３）を利用する。
（ｄ）一致度合計処理（ステップ１４４）は、部分グラフ一致度測定処理（ステップ１４２）で得られた全ての部分グラフ毎の一致度の合計を計算し、出力する。
【００８１】
このようにして求めた、合計された一致度を、文書要素間の類似度とする。この方法では、文書内の副題毎の類似度の総和によって、文書要素間の類似度を判定することができるので、より正確な類似度の計算を行うことができる。
以上をまとめると、処理を高速に行うことができるのが、「グラフ分割を用いない方法（ステップ１１０）」である。
【００８２】
「グラフ分割、再結合を用いる方法（ステップ１２０）」では、グラフ分割処理という複雑な処理を行う代わりに、間接的な単語間の関連も利用した、類似度の判定を行うことができる。
また、「部分グラフ毎の一致度の測定法（ステップ１３０）」では、文書要素をそれぞれ１つの文書として取り扱うのではなく、副題ごとに分割された独立したそれぞれのサブ文書毎の類似度の判定を行うことができる。
【００８３】
「部分グラフ毎の一致度の合計を用いる方法（ステップ１４０）」では、非常に処理が複雑であるが、文書内の副題毎の類似度の総和として、文書全体の類似度を求めることができるので、より正確な類似度を判定できる。
【００８４】
【実施例】
以下、図面と共に本発明の実施例を説明する。
最初に、前述の方法を用いた文書検索装置について説明する。
図８は、本発明の一実施例の文書検索装置の構成を示す。
同図に示す文書検索装置は、検索インタフェース部４１０、検索キー主題グラフ作成部４２０、単語情報管理部４３０、検索対象主題グラフ作成部４４０、類似度判定部４５０、検索制御部４６０、検索対象文書データベース４４１、及びインデックスファイル４３１から構成される。
【００８５】
検索インタフェース部４１０は、ユーザからの検索要求を解析し、検索キーを取り出し、検索制御部４６０に渡す。また、検索結果を検索制御部４６０から受け取り、ユーザに返す。
検索キー主題グラフ作成部４２０は、検索キーから主題グラフを作成する。
単語情報管理部４３０は、インデックスファイル４３１を参照することによって、指定された単語が出現する文書ＩＤの集合を取得する。ここで、インデックスファイル４３１は、単語をキー、その単語が出現する文書ＩＤの集合を値とするハッシュテーブルである。
【００８６】
検索対象文書主題グラフ作成部４４０は、文書ＩＤが指定されるとその文書ＩＤに対応した文書を検索対象文書データベース４４１から取得し、その文書の主題グラフを作成する。
類似度判定部４５０は、検索キーの主題グラフと検索対象文書の主題グラフを入力とし、それらの類似度を判定する。ここで、類似度の判定法は、前述の図５のフローチャートの方法を用いる。
【００８７】
検索制御部４６０は、以下の処理を行う。
（ａ）検索インタフェース部４１０から検索キーを取得する。
（ｂ）検索キー主題グラフ作成部４２０から、この検索キーから作成された主題グラフを取得する。
（ｃ）単語情報管理部４３０から、この主題グラフ内の単語のどれか一つでも出現する文書ＩＤの集合を取得する。
【００８８】
（ｄ）これらの文書ＩＤの集合のそれぞれの要素に対して、以下の処理を実行する。
▲１▼ 文書ＩＤに対応した検索対象文書の主題グラフを検索対象文書主題グラフ作成部４４０から取得する。
▲２▼ この検索対象文書の主題グラフと検索キーの主題グラフの類似度を類似度判定４５０から取得する。
【００８９】
（ｅ）文書ＩＤの集合を類似度の降順にソートし、上位ｎ件の文書ＩＤに対応した文書を検索結果とする。
以下、次の例を用いて処理の流れを説明する。
図９は、本発明の一実施例の主題グラフの作成の例を示す。
検索キーＱ：（情報ｏｒ文書）ａｎｄ検索
検索対象文書Ｕ：以下の５文からなる文書
「情報の主題について。
【００９０】
文書の主題について。
検索の効率を上げる。
情報を検索する。
文書を検索する。」
ステップ５０１）ユーザが検索要求を入力する。
【００９１】
ステップ５０２）検索インタフェース部４１０は、ユーザが入力した検索要求から検索キーを抽出し、検索制御部４６０に渡す。
ステップ５０３）検索制御部４６０は、検索キーを検索キー主題グラフ作成部４２０に渡す。
ステップ５０４）検索キー主題グラフ作成部４２０は、検索キーから主題グラフを作成し、検索制御部４６０に渡す。今回の例では、検索キーＱから図９の検索キーの主題グラフ５１０を作成した。但し、すべての単語の重要度を１．０とし、単語間の関連度は、“ｏｒ”の場合０．５、“ａｎｄ”の場合１．０とした。
【００９２】
ステップ５０５）検索制御部４６０は、検索キーの主題グラフに使用されているそれぞれの単語を単語情報管理部４３０に渡す。
ステップ５０６）単語情報管理部４３０は、その単語が一度でも出現する文書ＩＤの集合をインデックスファイル４３１から取得し、検索制御部４６０に渡す。
【００９３】
ステップ５０７）検索制御部４６０は、単語情報管理部４３０から取得した文書ＩＤの集合のそれぞれの要素を検索対象文書主題グラフ作成部４４０に渡す。
ステップ５０８）検索対象文書主題グラフ作成部４４０は、文書ＩＤに対応した文書を検索対象文書データベース４４１から取得し、その文書の主題グラフを作成し、これを検索制御部４６０に渡す。今回の例では、文書Ｕから図９の検索対象文書の主題グラフ５２０を作成した。但し、単語の重要度は出現回数に比例した値とし、単語間の関連度は、文内の共出現回数に比例した値とし、不要語は取り除いた。
【００９４】
ステップ５０９）検索制御部４６０は、検索対象文書主題グラフ作成部４４０から取得したそれぞれの主題グラフと検索キーの主題グラフを類似度判定部４５０に渡す。
ステップ５１０）類似度判定部４５０は、検索キーの主題グラフとそれぞれの検索対象文書の主題グラフとの類似度の判定を行う。
【００９５】
今回は、「グラフ分割を用いない方法」を使用した場合の例として、類似度の判定法を説明する。まず、図９の検索キーの主題グラフ５１０及び検索対象文書の主題グラフ５２０から、以下の単語の重要度を表すベクトルを生成する。但し、ベクトルｖ_Ｑ，ｖ_Ｕの要素は、それぞれ、検索キーＱ、検索対象文書Ｕ内での（情報，文書，検索，主題，効率）の重要度を示し、グラフに存在しない単語の重要度は、０．０とした。
【００９６】
ｖ_Ｑ＝（１．０，１．０，１．０，０．０，０．０）（５）
ｖ_ｕ＝（０．６，０．６，１．０，０．６，０．３）（６）
これらのベクトルの内積ｆ_ｖは、
ｆ_ｖ＝２．２
となる。
【００９７】
次に、同様に関連度を表すベクトルを生成する。但し、ベクトルｒ_Ｑ，ｒ_Ｕの要素は、それぞれ、検索キーＱ、検索対象文書Ｕ内での（情報と主題、情報と文書、情報と検索、情報と効率、主題と文書、主題と検索、主題と効率、文書と検索、文書と効率、検索と効率）の関連度を表し、グラフに存在しないリンクの重みは０とした。
【００９８】
ｒ_Ｑ＝（０．０，０．５，１．０，０．０，０．０，０．０，０．０，１．０，０．０，０．０）（７）
ｒ_Ｕ＝（１．０，０．０，１．０，０．０，１．０，０．０，０．０，１．０，０．０，１．０）（８）
これらのベクトルの内積ｆ_ｖは、
ｆ_ｖ＝２．０
となる。そこで、ｐ＝１，ｑ＝１とした場合、検索キーの主題グラフ５１０と検索対象文書の主題グラフ５２０の一致度は、
一致度＝ｆ_Ｕ＊ｆ_ｒ＝２．２＊２．０＝４．４
となり、今回の例での検索キーＱと検索対象文書Ｕの類似度は、４．４と計算される。
【００９９】
ステップ５１１）検索制御部４６０は、類似度判定部４５０から取得したそれぞれの文書の類似度に基づき、文書集合を降順に並べ替え、上位ｎ件を検索結果とし、これを検索インタフェース部４１０に渡す。ここで、類似度判定部４５０で、部分グラフごとの一致度の測定法を用いた場合は、上位ｎ件のサブ文書が検索結果となる。
【０１００】
ステップ５１２）検索インタフェース部４１０は、検索結果をユーザに返す。今回の例では、（文書Ｕ，４．４）が検索結果である。
次に、本発明を用いた文書分類装置について説明する。
図１０は、本発明の一実施例の文書分類装置の構成を示す。
同図に示す文書分類装置は、主題グラフ作成部６１０、類似度判定部６２０、分類部６３０、分類制御部６４０、文書データベース６１１から構成される。
【０１０１】
同図において、主題グラフ作成部６１０は、前述の文書検索装置の検索対象文書主題グラフ作成部４４０と全く同じものである。
主題グラフ作成部６１０は、文書データベース６１１から文書ＩＤに対応した文書を取得し、その文書の主題グラフを作成する。
類似度判定部６２０は、２つの文書の主題グラフが入力されると、これらの類似度を判定する。ここで、類似度の判定には、前述の文書検索装置の類似度判定部４５０と同様の類似度判定方法を用いるものとする。
【０１０２】
分類部６３０は、文書間類似度行列を基に文書を分類する。ここで、文書間類似度行列とは、以下の形式である。

但し、ｓ_ｉｊは、文書ｉと文書ｊの類似度を表し、ｓ_ｉｊ＝ｓ_ｊｉであり、ｓ_ｉｉは無限大である。
【０１０３】
文書間類似度行列が与えられた時の分類の方法は、例えば、類似度最大の文書同士を順次結合していくクラスタリングなどである。具体的な分類の方法は、既存記述による。
分類制御部６４０は、分類作業全体の制御を行う。
上記の構成の一連の動作を以下に説明する。
【０１０４】
ステップ６０１）ユーザは、文書データベース６１１内の文書を何個の文書集合に分類するのか（分類数）を指定する。
ステップ６０２）分類制御部６４０は、文書データベース６１１に含まれるすべての文書の文書ＩＤを主題グラフ作成部６１０に渡す。
ステップ６０３）主題グラフ作成部６１０は、それぞれの文書ＩＤに対応した文書を文書データベース６１１から取得し、主題グラフを作成し、これを分類制御部６４０に渡す。
【０１０５】
ステップ６０４）分類制御部６４０は、主題グラフ作成部６１０から取得した主題グラフのすべての２つの組み合わせを類似度判定部６２０に渡す。
ステップ６０５）類似度判定部６２０は、それぞれの主題グラフ間の類似度を判定し、分類制御部６４０に渡す。
ステップ６０６）分類制御部６４０は、類似度判定部６２０から取得したすべての２つの組み合わせの文書間の類似度から、文書間類似度行列を作成し、ユーザが入力した分類数と共に分類部６３０に渡す。
【０１０６】
ステップ６０７）分類部６３０は、文書間類似度行列と分類数を基に、文書集合の分類を行い、分類結果を分類制御部６４０に渡す。
ステップ６０８）分類制御部６４０は、分類結果をユーザに返す。
また、上記の実施例における文書検索装置と文書分類装置の構成要素をプログラムとして構築し、文書検索装置及び文書分類装置として利用されるコンピュータに接続されるディスク装置や、フロッピーディスク、ＣＤ−ＲＯＭ等の可搬記憶媒体に格納しておき、本発明を実施する際にインストールすることにより容易に本発明を実現できる。
【０１０７】
なお、本発明は、上記の実施例に限定されることなく、特許請求の範囲内で種々変更・応用が可能である。
【０１０８】
【発明の効果】
上述のように、本発明によれば、文書要素間の類似度を単語の重要度だけであく、単語間の関連度を基に判定することができるので、より精度の高い類似度の判定を行うことができる。
また、検索キーと検索対象文書の類似度を、検索キー及び検索対象文書内での単語の重要度だけでなく、検索キー及び検索対象文書内での単語間の関連度も用いて計算することができるので、検索キーが文や文書になっても、また、検索対象が文書全文となった場合でも、より精度の高い情報検索を表現できる。
【０１０９】
また、同様に文書間の類似度を文書内の単語の重要度だけでなく、文書内の単語間の関連度も用いて計算することができるので、より精度の高い文書分類を実現できる。
【図面の簡単な説明】
【図１】本発明の原理を説明するための図である。
【図２】本発明の文書検索装置の原理構成図である。
【図３】本発明の文書分類装置の原理構成図である。
【図４】本発明の類似度判定方法を説明するためのフローチャートである。
【図５】本発明の類似度判定処理の動作を説明するための図である。
【図６】本発明のグラフ分類処理を説明するための図である。
【図７】本発明のグラフ再結合処理を説明するための図である。
【図８】本発明の一実施例の文書検索装置の構成図である。
【図９】本発明の一実施例の主題グラフの作成の例を示す図である。
【図１０】本発明の一実施例の文書分類装置の構成図である。
【符号の説明】
２１０分割前のグラフ
２１１，２２２，２２３部分グラフ
３１１，３１２，３１３部分グラフ
３２０再結合したグラフ
４１０検索インタフェース手段、検索インタフェース部
４２０検索キー主題グラフ作成手段、検索キー主題グラフ作成部
４３０単語情報管理手段、単語情報管理部
４３１インデックスファイル
４４０検索対象文書主題グラフ作成手段、検索対象文書主題グラフ作成部
４４１検索対象文書記憶手段、検索対象文書データベース
４５０類似度判定手段、類似度判定部
４６０検索制御手段、検索制御部
５１０検索キーＱの主題グラフ
５２０検索対象文書Ｕの主題グラフ
６１０主題グラフ作成手段、主題グラフ作成部
６１１文書記憶手段、文書データベース
６２０グラフ類似度判定手段、類似度判定部
６３０分類手段、分類部
６４０分類制御手段、分類部制御部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a similarity determination method, a document search device, a document classification device, a storage medium storing a document search program, and a storage medium storing a document classification program, and more particularly to a word string or a Boolean operator of the word string. A similarity determination method, a document search device, a document classification device, a storage medium storing a document search program, and a document classification program for appropriately determining the similarity between document elements composed of a combination or a document or a document set And a storage medium storing the information.
[0002]
[Prior art]
A document search device according to the related art determines a search result by determining the similarity between a search key and a search target document by the following method. Here, the search key may be not only a word string or a Boolean operator combination of the word strings but also a document or a document set.
[0003]
First, using a technique called morphological analysis, words used in each document to be searched are extracted, and the strength of association between each word and the subject (content) of the word is determined based on the appearance frequency information. Is given. Similarly, we assign importance to words in the search key entered by the user, check how important each word is in each document, and change the Boolean operator in the search key. Appropriate processing is performed to calculate the similarity. Finally, the search target documents are sorted in descending order of the similarity obtained in this manner, and the top n items are set as the search results. Here, n is a positive constant.
[0004]
For example, suppose that a user wants to search for a document related to “search for information or document (especially document)”. The user, as a search key,
((information⁻0.5 or document⁻0.8) and search⁻1.0)
Is specified. First, the document search device obtains the importance ((information, 0.5) (document, 0.8) (search, 1.0)) of the word in the search key (in this example, the search key , The importance is explicitly described, but the importance may be automatically determined from the appearance frequency information of the word, etc.).
[0005]
Next, the importance of the words "information", "document", and "search" used in the search key in each document to be searched is determined based on the frequency of appearance. Suppose these were the following values:

Calculate the product of the importance of the word in the search key and the importance of the word in the search target document, and if or is used in the search key, add the values on both sides, When and is used, the smaller of the values on both sides is taken.
[0006]
With this method, the similarity between the search key and each document can be obtained as follows.
Similarity of document a = min ((0.5 * 0.4 + 0.8 * 0.6), 1.0 * 0.9) = min (0.68, 0.9) = 0.68
Similarity of document b = min ((0.5 * 0.4 + 0.8 * 0.1), 1.0 * 0.0) = min (0.26,0.0) = 0.0
Similarity of document c = min ((0.5 * 0.3 + 0.8 * 0.8), 1.0 * 1.0) = min (0.79, 1.0) = 0.79
Here, min (x, y) returns the smaller value of x and y.
[0007]
In this way, the similarity between the search key and the search target document is calculated, the search target documents are sorted in descending order of this value, and the top n (= 2) items are set as search results. Therefore, the search result in this case is
Document c (similarity 0.79)
Document a (similarity 0.68)
It becomes.
[0008]
Further, the conventional document classifying apparatus classifies documents by determining the similarity between all the combinations of two documents in a set of documents to be classified.
First, words are extracted from each document of a set of documents to be classified, and the words are given appropriate importance. Next, based on this importance, the similarity between all two documents in the set of documents to be classified is determined by the same method as that described for the document search device. Next, based on the similarity between the documents, the documents are classified by sequentially combining documents having a large similarity. This technique is called clustering.
[0009]
[Problems to be solved by the invention]
In order to configure a document search device and a document classification device with higher accuracy, it is necessary to appropriately determine the similarity between document elements. Here, the document element is a word string, a Boolean operator combination of words, a document, or a document set. However, the conventional similarity determination method has the following problems.
[0010]
1. Cannot accurately determine similarity between document elements with multiple themes and subtitles:
When the document element is a word string, an abstract, or the like, the document element is considered to have one subject, but generally, when the entire document is targeted, the document has a plurality of subjects and subtitles. Therefore, the similarity is not properly calculated when the entire text of such a document is targeted.
[0011]
For example, in the document search operation, it is assumed that the user specifies a search key of “information search and robot” when searching for a document related to “robot performing information search”. However, this search key gives a high degree of similarity to documents having two themes, "information retrieval system" and "industrial robot". As described above, when the document element has a plurality of themes and subtitles, there is a problem that the similarity cannot be accurately determined.
[0012]
2. Unable to determine similarity using features of document elements:
Words used in a sentence have characteristics such as dependency relationships. Documents also have features such as paragraphs. However, the conventional similarity determination method only extracts words, assigns importance to those words, and determines similarity based on them. Therefore, these features cannot be used, and similarity determination cannot be performed. Cannot be determined with high accuracy.
[0013]
3. Incomplete morphological analysis:
In morphological analysis used when extracting a word from a document, it is necessary to recognize which partial character string is a word, and for that purpose, the word must be registered in a dictionary in advance. However, when the update speed of information is high, it is impossible to register all words in a dictionary in advance, and when such information is targeted, analysis at the time of extracting words fails. Is inevitable. For example, if only the words “inter” and “net” are registered in the dictionary, the word “internet” is not extracted, and this word is extracted as two words “inter” and “net”. Would. Thus, there is a problem that the similarity cannot be determined with high accuracy because the word extraction fails.
[0014]
SUMMARY OF THE INVENTION The present invention has been made in view of the above points, and enables highly accurate determination of similarity between document elements having a plurality of subjects and subtitles, and enables determination of similarity using characteristics of document elements. A similarity determination method, a document search device, a document classification device, a storage medium storing the document search program, and a document classification program capable of resolving incomplete analysis and accurately determining the similarity between document elements. It is an object of the present invention to provide a storage medium having stored therein.
[0015]
[Means for Solving the Problems]
FIG. 1 is a diagram for explaining the principle of the present invention.
The present invention (claim 1) provides a similarity determination method for appropriately determining the similarity between document elements,
In the subject graph creation means,Word string or Boolean operator combination of word strings, or sentence or document or document element composed of document setIs entered multiple times,Extract words used in each document element (step 1)
The importance is given to each extracted word (step 2),
The importance of a word is taken as the weight of a node, and the relevance between the words is taken as the weight of a link.age(Step 3), for each document elementSubject graphIs generated (step 4),
In the similarity determination means,Subject graphBased on the degree of matching between documents, the similarity between document elementsAsk(Step 5).
[0016]
The present invention (claim 2) performs a morphological analysis on document elements when extracting words.
The present invention (claim3)In the similarity determination means, the subject graphWhen calculating the degree of match between
BothSubject graphThe larger the number of similar nodes (nodes containing the same word) ofSubject graphThe degree of agreement between
one ofSubject graphIf a node inside is heavily weighted, the otherSubject graphThe greater the weight of similar nodes in, the more both nodesSubject graphThe degree of agreement between
BothSubject graphThe greater the number of similar links (links having the same word in the nodes at both ends of the link), the greater the number of both linksSubject graphThe degree of agreement between
one ofSubject graphIf the link inside is heavily weighted, the otherSubject graphThe greater the weight of similar links inSubject graphSo that the degree of match betweenSubject graphCalculate the degree of match between
[0017]
The present invention (claim4)In the similarity determination means, the subject graphWhen calculating the degree of match between
eachSubject graphToThe subject graphIs divided into subgraphs based on how strongly the word sets used in the
If each subgraph has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph,
Rejoin the subgraphs,
Add the generated link to the subgraph as it is, andSubject graphBack toSubject graphCalculate the degree of match between
[0018]
The present invention (claim5)In the similarity determination means, the subject graphWhen calculating the degree of match between
eachSubject graphToThe subject graphIs divided into subgraphs based on how strongly the word sets used within are related,
If each subgraph has no link between any nodes between the subgraphs, a link having a small weight is generated in the subgraph,
The degree of matching is calculated for each subgraph.
[0019]
The present invention (claim6)In the similarity determination means, the subject graphWhen calculating the degree of match between
eachSubject graphToThe subject graphIs divided into subgraphs based on how strongly the word sets used within are related,
If each subgraph has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph,
Calculate the degree of match for each subgraph,
By calculating the sum of the degrees of matching calculated for each subgraph,Subject graphCalculate the degree of match between
[0020]
FIG. 2 is a principle configuration diagram of the document search device of the present invention.
The present invention (claim7) Is a document search device for searching for a document based on a search request from a user,
A search interface unit 410 for analyzing a search request from a user and extracting a search key;
search keyGraph of search key using the importance of words and the relevance between wordsA search key subject graph creating means 420 for generating
A word information management unit 430 that acquires a set of document IDs of documents in which the specified word appears;
When a document ID is specified, a document corresponding to the document ID is obtained from the search target document storage unit, and theThematic Graph of Documents Using Word Importance and Relevance between WordsSearch subject document graph creation means 440 for creating
Search Key Theme Graph and Document Theme GraphAnd similarity determining means 450 for determining how similar they are,
A search control unit 460 for controlling the search interface unit 410, the search key subject graph creation unit 420, the word information management unit 430, the search subject document subject graph creation unit 440, and the similarity determination unit 450;,Having.
[0021]
The present invention (claim8) In the search key subject graph creation means 420
Word extraction means for extracting a word used in the document element from a word element or a Boolean operator combination of the word string or a sentence or a document or a document element constituted by a document set;
Importance assigning means for assigning importance to each of the extracted words;
Relevance assigning means for assigning a relevance between any two extracted words,
The search subject document subject graph creation means 440
The importance of a word is used as the weight of a node, and the relevance between the words is used as the weight of a link.Create a customized thematic graphIncluding the subject expression means,
The similarity determination means 450
Search Key Theme Graph and Document Theme GraphBased on the degree of matching between documents, the similarity between document elementsAskIncluding means.
[0022]
The present invention (claim9) In the similarity determination means 450
Search Key Theme Graph and Document Theme GraphThe greater the number of similar nodes (nodes containing the same word), the greater the degree of matching between the graphs,
If a node in one graph is heavily weighted, the greater the weight of a similar node in the other graph, the greater the degree of matching between both graphs,
The greater the number of similar links in both graphs (links having the same word in the nodes at both ends of the link), the greater the degree of matching between both graphs,
If a link in one graph is heavily weighted, the greater the weight of a similar link in the other graph, the greater the degree of matching between the two graphs like,Search Key Theme Graph and Document Theme GraphA first calculating means for calculating a degree of coincidence between them.
[0023]
The present invention (claim10) In the similarity determination means 450
eachSearch Key Theme Graph and Document Theme GraphToTheBased on how strong the word sets used in the graph are related to each other, split them into subgraphs,
If each subgraph has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph,
Rejoin the subgraphs,
Add the link generated to the subgraph as it is and return to the graph before splittingSearch Key Theme Graph and Document Theme GraphSecond calculating means for calculating the degree of coincidence between them.
[0024]
The present invention (claim11) In the similarity determination means 450
eachSearch Key Theme Graph and Document Theme GraphToTheBased on how strong the word sets used in the graph are related to each other, split them into subgraphs,
If each subgraph has no link between any nodes between the subgraphs, a link having a small weight is generated in the subgraph,
A third calculating means for calculating the degree of coincidence for each subgraph is included.
[0025]
The present invention (claim12) In the similarity determination means 450
eachSearch Key Theme Graph and Document Theme GraphToTheBased on how strong the word sets used in the graph are related to each other, split them into subgraphs,
If each subgraph has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph,
Calculate the degree of match for each subgraph,
By calculating the sum of the degrees of matching calculated for each subgraph,Search Key Theme Graph and Document Theme GraphAnd a fourth calculating means for calculating the degree of coincidence between them.
[0026]
FIG. 3 is a diagram showing the principle configuration of the document classification device according to the present invention.
The present invention (claimThirteen) Obtains the document corresponding to the document ID from the document storage means in which the document is stored, andThematic graph using word importance and relevance between wordsThematic graph creating means 610 for creating
Two documentsSubject graphIs entered, the degree of these matches iscalculateGraph similarity determination means 620;
A classifying unit 630 for classifying the documents based on a matrix representing the similarity between the documents;
A classification control unit 640 for controlling the entire classification work.
[0027]
The present invention (claim14) In the subject graph creation means 610
Subject graphTo measure the degree of agreement between
Word extraction means for extracting a word used in the document element from a word element or a Boolean operator combination of the word string or a sentence or a document or a document element constituted by a document set;
Importance assigning means for assigning importance to each of the extracted words;
A relevance assigning means for assigning a relevance between any two extracted words;
The importance of a word is taken as the weight of a node, and the relevance between the words is taken as the weight of a link.Generate a subject graph asSubject means,
The graph similarity determination means 620
Subject graphBased on the degree of matching between documents, the similarity between document elementscalculateIncluding means.
[0028]
The present invention (claimFifteen) In the graph similarity determination means 620
The greater the number of similar nodes (nodes containing the same word) in both graphs, the greater the degree of matching between the graphs,
If a node in one graph is heavily weighted, the greater the weight of a similar node in the other graph, the greater the degree of matching between both graphs,
The greater the number of similar links in both graphs (links having the same word in the nodes at both ends of the link), the greater the degree of matching between both graphs,
If a link in one graph is heavily weighted, the greater the weight of a similar link in the other graph, the greater the degree of matching between the two graphs like,Subject graphA first calculating means for calculating a degree of coincidence between them.
[0029]
The present invention (claim16) In the graph similarity determination means 620
eachsubjectGraphThe subjectBased on how strong the word sets used in the graph are related to each other, split them into subgraphs,
If each subgraph has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph,
Rejoin the subgraphs,
Add the generated link to the subgraph as it is and return to the graph before splittingSubject graphSecond calculating means for calculating the degree of coincidence between them.
[0030]
The present invention (claim17) In the graph similarity determination means 620
eachSubject graphToThe subject graphIs divided into subgraphs based on how strongly the word sets used within are related,
If each subgraph has no link between any nodes between the subgraphs, a link having a small weight is generated in the subgraph,
A third calculating means for calculating the degree of coincidence for each subgraph is included.
[0031]
The present invention (claim18) In the graph classification degree determining means 620
eachSubject graphToThe subject graphIs divided into subgraphs based on how strongly the word sets used within are related,
If each subgraph has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph,
Calculate the degree of match for each subgraph,
By calculating the sum of the degrees of matching calculated for each subgraph,Subject graphAnd a fourth calculating means for calculating the degree of coincidence between them.
[0032]
The present invention (claim19) Is a storage medium storing a document search program for searching for a document based on a search request from a user,
On the computer,
A search interface that analyzes search requests from users and retrieves search keysStepsWhen,
search keyGraph of search key using the importance of words and the relevance between wordsCreate a search key theme graph that generatesStepsWhen,
Word information management for acquiring a set of document IDs of documents in which a specified word appearsStepsWhen,
When the document ID is specified, a document corresponding to the document ID is obtained from the search target document storage unit in which the search target document is stored.Thematic Graph of Documents Using Word Importance and Relevance between WordsCreate a subject graph for searchStepsWhen,
Search Key Theme GraphAnd the search target documentDocument subject graphSimilarity judgment to determine how similar they areSteps and
Stored a program to executeThis is a storage medium storing a document search program.
[0033]
The present invention (claim20) Create a search key subject graphStepsIs
Word extraction for extracting words used in a document element from a word string, a Boolean operator combination of word strings, or a sentence, a document, or a document element composed of a document setStepsWhen,
Assigning importance to each extracted wordStepsWhen,
Relevance assignment to assign relevance between any two extracted wordsAnd execute the steps and
Creating a subject graph for search target documentsStepsIs
The importance of a word is used as the weight of a node, and the relevance between the words is used as the weight of a link.Generate a thematic graphThematic expressionLet the steps run,
Similarity judgmentStepsIs
Search Key Theme Graph and Document Theme GraphBased on the degree of matching between documents, the similarity between document elementsPerform the required steps.
[0034]
The present invention (claim21) Is similarity judgmentIn the step,
Search key theme graph and document theme graphThe greater the number of similar nodes (nodes containing the same word) in both graphs, the greater the degree of matching between the graphs,
If a node in one graph is heavily weighted, the greater the weight of a similar node in the other graph, the greater the degree of matching between both graphs,
The greater the number of similar links in both graphs (links having the same word in the nodes at both ends of the link), the greater the degree of matching between both graphs,
If a link in one graph is heavily weighted, the greater the weight of a similar link in the other graph, the greater the degree of matching between the two graphs like,Search Key Theme Graph and Document Theme GraphFirst calculation for calculating the degree of agreement betweenExecute the step.
[0035]
The present invention (claim22) Is similarity judgmentStepsAt
Search key theme graph and document theme graphEach graph,TheBased on how strongly the word sets used in the graph are related, you can subdivide it into subgraphs,
If each subgraph has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph,
Calculate the degree of match for each subgraph,
By calculating the sum of the degrees of matching calculated for each subgraph,Search Key Theme Graph and Document Theme GraphSecond calculation for calculating the degree of agreement betweenExecute the step.
[0036]
The present invention (claim23) Is similarity judgmentStepsAt
Search key theme graph and document theme graphEach graph,TheBased on how strongly the word sets used in the graph are related, you can subdivide it into subgraphs,
If there is no link between any nodes between the subgraphs in each subgraph, a link having a small weight is generated for the sublink,
Search Key Theme Graph and Document Theme GraphThird calculation for calculating the degree of coincidence for each subgraphExecute the step.
[0037]
The present invention (claim24) Is similarity judgmentStepsAt
Search key theme graph and document theme graphEach graph,TheBased on how strong the word sets used in the graph are related to each other, split them into subgraphs,
If each subgraph has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph,
Calculate the degree of match for each subgraph,
By calculating the sum of the degrees of matching calculated for each subgraph,Search Key Theme Graph and Document Theme GraphFourth calculation for calculating the degree of agreement betweenExecute the step.
[0038]
The present invention (claim25) Is the document segment for classifying documents.KindA storage medium storing a program,
On the computer,
A document corresponding to the document ID is obtained from the document storage unit in which the document is stored, and theThematic graph using word importance and relevance between wordsCreate thematic graph to createStepsWhen,
When the subject graphs of two documents are input, a graph classification determination that determines the degree of coincidence between them is made.StepsWhen,
Classification for classifying documents based on a matrix representing similarity between documentsAnd a storage medium storing a document classification program storing a program for executing the steps.
[0039]
The present invention (claim26) Creates thematic graphStepsAt
Between subject graphsTo measure the degree of match of
Word extraction for extracting words used in a document element from a word string, a Boolean operator combination of word strings, or a sentence, a document, or a document element composed of a document setStepsWhen,
Assigning importance to each extracted wordStepsWhen,
Relevance assignment to assign relevance between any two extracted wordsStepsWhen,
The importance of a word is used as the weight of a node, and the relevance between the words is used as the weight of a link.Generate a thematic graphThematic expressionStepsWhen,
Subject graphDocument element similarity determination to determine the similarity between document elements based on the degree of matching between themRun a step.
[0040]
The present invention (claim27) Is graph similarity judgmentStepsAt
BothSubject graphThe greater the number of similar nodes (nodes containing the same word), the greater the degree of matching between the graphs,
If a node in one graph is heavily weighted, the greater the weight of a similar node in the other graph, the greater the degree of matching between both graphs,
The greater the number of similar links in both graphs (links having the same word in the nodes at both ends of the link), the greater the degree of matching between both graphs,
If a link in one graph is heavily weighted, the greater the weight of a similar link in the other graph, the greater the degree of matching between the two graphs like,Subject graphCalculate the degree of agreement betweenExecute the step.
[0041]
The present invention (claim28) Is graph similarity judgmentStepsAt
eachSubject graphToTheBased on how strongly the word sets used in the graph are related, you can subdivide it into subgraphs,
If each subgraph has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph,
Rejoin the subgraphs,
Add the generated link to the subgraph as it is and return to the graph before splittingSubject graphSecond calculation for calculating the degree of agreement betweenExecute the step.
[0042]
The present invention (claim29) Is graph similarity judgmentStepsAt
eachSubject graphToThe subjectBased on how strongly the word sets used in the graph are related, you can subdivide it into subgraphs,
If each subgraph has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph,
Third calculation for calculating the degree of coincidence for each subgraphExecute the step.
[0043]
The present invention (claim30) Is graph similarity judgmentStepsAt
Each subject graph,The subjectBased on how strongly the word sets used in the graph are related, you can subdivide it into subgraphs,
If each subgraph has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph,
Calculate the degree of match for each subgraph,
By calculating the sum of the degrees of matching calculated for each subgraph,Subject graphFourth calculation for calculating the degree of agreement betweenExecute the step.
[0044]
As described above, according to the present invention, the problem that the similarity between document elements having a plurality of subjects and subtitles cannot be determined with high accuracy is improved by using the degree of relevance between words. Can be accurately determined. For example, in the example of the problem to be solved by the above-described invention, if the user wants to search for a document related to "robot performing information search", in the present invention, "information search" and "robot" are strongly related. A document has a higher similarity than a document that does not. In a document having the two subjects of "information retrieval system" and "industrial robot", "information retrieval" and "robot" are not strongly related to each other. No. As described above, according to the present invention, it is possible to accurately determine the similarity.
[0045]
Further, according to the present invention, the problem that the similarity cannot be determined using the features of the document element is high between words having a strong dependency relationship in a sentence or between words included in the same paragraph. Since the degree of association can be given, the problem can be solved by determining the similarity using these features. As described above, the similarity can be accurately determined by using the present invention.
[0046]
Further, in order to reduce the accuracy of the determination of the similarity due to the failure of the word extraction due to the use of the morphological analysis, the present invention uses an example of the failure of the extraction of the word “Internet” described above. In other words, since the morphological analysis is used, the word “Internet” is not extracted as in the above-described example, and this word is extracted as two words “inter” and “net”. . However, when a certain document element has a character string “Internet”, there is a strong relationship between the extracted words “inter” and “net” in the document element. Therefore, the document element in which the character string “Internet” appears has a higher similarity than the document element in which “inter” and “net” appear separately. Therefore, even if the morphological analysis fails, the present invention can prevent a decrease in the accuracy of the similarity determination.
[0047]
As described above, according to the present invention, the similarity between document elements can be determined with high accuracy. Therefore, a highly accurate similarity determination method, a document search device, a document classification device, a storage medium storing a document search program, and a document classification program are stored. Storage medium can be provided.
[0048]
BEST MODE FOR CARRYING OUT THE INVENTION
FIG. 4 is a flowchart for explaining the similarity determination method of the present invention.
Step 10) Extract words used in the document element.
Step 20) Calculate the importance of each word used in the document element.
[0049]
Step 30) Calculate the degree of association between any two words used in the document element.
Step 40) The subject of each document element is expressed by a graph in which the importance of a word is set as the weight of a node and the degree of association between words is set as the weight of a link. Hereinafter, this graph is called a theme graph.
[0050]
Step 50) The similarity between the document elements is determined based on the degree of coincidence between the subject graphs generated in this way.
Hereinafter, the operation of each of the above steps will be described in detail.
Step 10) Word extraction:
Word extraction is performed by morphological analysis of document elements. An existing technology is used for the morphological analysis method.
[0051]
Step 20) Calculation of importance:
The importance of words used in each document element is calculated as follows. In the present invention, as a document element, a word string, a Boolean operator combination of words, a sentence, a document, and a document set are assumed, and a method of calculating the importance of a word for each of them will be described below.
[0052]
Word sequence: The importance is determined by setting the importance of all words to the same value, or by letting the user explicitly specify the importance of each word.
Boolean operator combination of words: The importance is determined in the same manner as in the case of word strings.
Sentence: All words have the same importance, or importance is determined according to the part of speech of a word (e.g., proper nouns are given higher importance than adverbs).
[0053]
-Document: The importance is determined in the same manner as in the case of a sentence, or word appearance position information (words appearing in the title are given high importance), appearance frequency information (words with high appearance frequency) Is given high importance), the number of document elements appearing in the entire document element set (words appearing only in specific document elements are given high importance), and the like.
[0054]
Document set: The entire document set is regarded as one large document (a document in which all documents are combined), and the importance is calculated in the same manner as in the case of a document.
Step 30) Relevance calculation:
The relevance between words used in each document element is calculated as follows. In the present invention, as a document element, a word string, a Boolean operator combination of words, a sentence, a document, and a document set are assumed, and a method of calculating the degree of association between words for each of them will be described below.
[0055]
Word string: The relevance is determined by setting the relevance between all two words included in the document element to an equal value or by having the user explicitly specify the relevance.
-Word Boolean operator combination: In addition to the word string method, the degree of relevancy is determined according to the type of Boolean operator. For example, words connected by and have a higher degree of association than words connected by or.
[0056]
-Sentence: Either a method using a word string is used, or calculation is performed using the information on the dependency relationship shown below. First, the dependency relation of the sentence is analyzed. An existing technique is used as a method of analyzing the dependency relationship. It is assumed that those that have a direct dependency relationship have a strong relationship, and those that have an indirect dependency relationship have a weak relationship. For example, if there is a sentence “Use word importance for information search”, the relevance of “information” and “search” is larger than the relevance of “information” and “word”. I do. This is because “information” and “search” have a direct dependency relationship, whereas “information” and “word” have no direct dependency relationship.
[0057]
Document: The degree of relevance between words is calculated by using a sentence method or one of the following two methods.
-Use of co-occurrence information:
When a certain two words co-occur in the same sentence (or within the range of the designated number of characters), the number of times of co-occurrence is counted. The greater the number of times of co-occurrence, the greater the degree of association between those two words.
[0058]
-Use structural information:
Analyze the structure of a document (chapter, paragraph, etc.). Since words appearing in a paragraph are related to the headword of the paragraph, and words in the paragraph are considered to be related, the degree of association between words is determined based on the frequency information in the paragraph. . For example, it is assumed that a word that appears at a high frequency only in a certain paragraph has a strong association with the headword of the section, and that the words have a strong association.
[0059]
Document set: The document set is considered as one large document (a document in which all documents are combined), and the degree of relevance is calculated in the same manner as in the case of a document.
Step 40) Creating a subject graph:
A graph is created in which the importance of the word obtained in step 20 is used as the node weight, and the degree of association between the words obtained in step 30 is used as the weight of the link. This graph (subject graph) expresses the subject of the document element.
[0060]
Step 50) The similarity between the document elements is determined by measuring the degree of coincidence between the subject graphs of the document elements created in step 140. FIG. 5 shows the configuration of the similarity determination processing.
FIG. 5 is a diagram for explaining the operation of the similarity determination processing of the present invention.
1. Graph coincidence measurement processing (steps 111, 123, 133, 143): The coincidence between the subject graphs q and u of the document element is calculated by the following equation. The importance of the words used in the graphs q and u is represented by the following vectors, respectively.
[0061]
v_q = (V_q1, V_q2, ..., v_qn(1)
v_u = (V_u1, V_u2, ..., v_un(2)
Where v_qiAnd v_uiRepresents the importance of the word i in the document element q and the importance of the word i in the document u, respectively.
Dot product f of these vectors_v
[0062]
(Equation 1)

[0063]
Is calculated.
The relevance between words used in the graph q and the graph u is expressed as follows.
r_q= (V_q11, V_q12, ..., v_q21, V_q22, ..., v_qnn(3)
r_u= (V_u11, V_u12, ..., v_u21, V_u22, ..., v_un) (4)
Where v_qijAnd v_uijRepresents the relevance between the word i and the word j in the document element q and the relevance between the word i and the word j in the document element u, respectively.
[0064]
Dot product of these vectors
[0065]
(Equation 2)

[0066]
Is calculated.
f_vAnd f_rThen, the degree of coincidence between the graphs is obtained as follows.
Degree of coincidence = f_v ^p* F_r ^q
Here, p and q are positive constants.
2. Graph division processing (

steps

121, 131, 141):
The graph is divided by the following processing, and a link having a small weight is generated in each subgraph.
[0067]
(A) Graph G_AIs divided into p subgraphs according to the strength of the coupling force between the nodes._Ai(I = 0, 1,..., P). Here, the strength of the coupling force means, for example, that “a link with a distance of 1 always exists between arbitrary nodes in each subgraph, or a link with a distance of n (n ≧ 2) or less”. m (m ≧ 1) or more ”. Here, the distance between the node a and the node b is the number of links that pass from the node a to the node b.
[0068]
(B) If no link exists between arbitrary nodes in the divided subgraph, a link having a weak weight is generated between these nodes.
FIG. 6 shows a graph division process when n = 2 and m = 2. In this example, the graph G before the division_A(210) has three subgraphs (G_A1(221), G_A2(222), G_A3(233)). The division in this way is
・ G_A1Regarding (211), a link with a distance of 1 exists between nodes AB, AC, BD, and CD, and two links with a distance of 2 exist between each of AD and BC.
・ G_A2Regarding (222), a link with a distance of 1 exists between the nodes BD, BE, and DE.
・ G_A3For (223), there is a link with a distance of 1 between the nodes DF.
[0069]
This satisfies the condition that "a link of distance 1 always exists between any two nodes in each subgraph, or two or more links of distance 2 or less exist." . G_A1The broken line in (221) is a link with a weak weight added in (b) of the graph division processing.
As is apparent from this processing, the words in the divided subgraphs are connected with a strong connection force. Therefore, these subgraphs are composed of a set of words that are semantically related, and the sub-documents composed of these subgraphs each represent a subtitle of the original document.
[0070]
Also, by generating links in the subgraph in this way, it is expressed on the graph that words included in each subtitle have some degree of association.
3. Graph reconnection processing (step 122):
Subgraph G created by graph division processing_Ai(I = 0, 1,..., P) is divided into a graph G ′ before division._ATo rejoin. At this time, G '_AIs obtained by adding the link generated in the graph division processing (step 121).
[0071]
FIG. 7 shows an example of the graph recombining process. In this example, G in FIG._AThe three subgraphs (G_A1(311), G_A2(312), G_A3(313)) into the original graph to obtain G ′_A(320). G 'created by graph division processing and graph reconnection processing_AHas G between AD and BC_AA link with a weak weight that did not exist has been added.
[0072]
4. Subgraph coincidence measurement processing (steps 132 and 142):
For each of the partial graphs created by the graph division processing, the degree of coincidence is measured using the inter-graph coincidence measurement processing.
5. Matching degree total processing (step 144):
The value obtained by summing the coincidences obtained for each subgraph is set as the coincidence of the entire graph before division.
[0073]
Next, each of the similarity calculation methods (four types) shown in FIG. 5 will be described.
1. Method without using graph partitioning (step 110):
(A) Subject graph G of document element_qAnd G_uIs passed to the inter-graph coincidence measurement processing (step 111).
[0074]
(B) In the inter-graph coincidence measurement process (step 111), these two graphs G_q, G_uThe degree of coincidence between them is measured and output.
The degree of coincidence between subject graphs obtained by this method is defined as the degree of similarity between document elements. This method is fast because no graph division is used.
2. Method using graph division and recombination (step 120):
(A) Subject graph G of document element_qAnd G_uIs passed to the graph division processing (step 121).
[0075]
(B) The graph division processing (step 121)_q, G_uTo a plurality of subgraphs G_qi(I = 0, 1,... P), G_uj(J = 0, 1,..., R) and G_qi, G_ujIf there are no links between any of the nodes, a low weight link is generated between these nodes.
(C) The graph recombining process (step 122) is a subgraph G created in the graph dividing process (step 121)._qi, G_ujTo the state before the graph division again,_q, G '_uCreate As described above, at the time of recombination, the link generated for each subgraph is represented by G '_q, G '_uAdd to
[0076]
(D) The graph-to-graph coincidence measurement processing (step 123) is performed by G ′_qAnd G '_uMeasure and output the degree of coincidence between them.
The degree of coincidence between the subject graphs obtained in this way is defined as the degree of similarity between document elements. In this method, similarity can be determined using indirect word-to-word associations (words included in the same subtitle have some direct association). Can be calculated.
[0077]
3. Method for measuring the degree of coincidence for each subgraph (step 130):
(A) Subject graph G of document element_qAnd G_uAre passed to the graph division processing (step 131).
(B) The graph division processing (step 131) is performed by G ′_q, G '_uTo a plurality of subgraphs G_qi(I = 0, 1,..., P), G_uj(J = 0, 1,..., R) and G_qi, G_ujIf there are no links between any of the nodes in, add a low weight link between these nodes.
[0078]
(C) The subgraph coincidence measurement processing (step 132) is performed for each subgraph G_qiAnd G_ujThe degree of coincidence is measured for all combinations. At this time, the inter-graph coincidence measurement processing (step 133) is used. Outputs the degree of coincidence obtained for each subgraph.
The degree of coincidence for each subgraph obtained in this manner is set as the degree of similarity for each divided sub-document (subtitle). In this method, the similarity can be calculated for each sub-document (subtitle) in the document, not for the entire document.
[0079]
4. Method using the sum of the degrees of coincidence for each subgraph (step 140):
(A) Subject graph G of document element_q, G_uAre passed to the graph division processing (step 141).
(B) The graph division processing (step 141)_q, G_uTo a plurality of subgraphs G_qi(I = 0, 1,..., P), G_uj(J = 0, 1,..., R) and G_qi, G_ujIf there are no links between any of the nodes in, add a low weight link between these nodes.
[0080]
(C) The subgraph coincidence measurement processing (step 142) is performed for each subgraph G_qiAnd G_ujIs measured for all combinations of. At this time, the inter-graph coincidence measurement processing (step 143) is used.
(D) The matching degree total processing (step 144) calculates and outputs the sum of the matching degrees for all the subgraphs obtained in the subgraph matching degree measuring processing (step 142).
[0081]
The sum of the coincidences determined in this way is defined as the similarity between the document elements. According to this method, the similarity between document elements can be determined based on the sum of the similarities for each subtitle in the document, so that a more accurate calculation of the similarity can be performed.
In summary, the method that can perform the processing at high speed is the “method without using graph division (step 110)”.
[0082]
In the “method using graph division and recombination (step 120)”, similarity can be determined using indirect word-to-word association instead of performing complicated processing called graph division processing.
In the "method of measuring the degree of coincidence for each subgraph (step 130)", the similarity of each independent sub-document divided for each subtitle is determined instead of treating each document element as one document. It can be performed.
[0083]
The “method using the sum of the degrees of matching for each subgraph (step 140)” is very complicated, but the similarity of the entire document can be obtained as the sum of the degrees of similarity for each subtitle in the document. Therefore, a more accurate similarity can be determined.
[0084]
【Example】
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
First, a description will be given of a document search apparatus using the above method.
FIG. 8 shows the configuration of a document search device according to one embodiment of the present invention.
The document search apparatus shown in the figure includes a search interface unit 410, a search key subject graph creation unit 420, a word information management unit 430, a search subject subject graph creation unit 440, a similarity determination unit 450, a search control unit 460, a search target document. It comprises a database 441 and an index file 431.
[0085]
The search interface unit 410 analyzes a search request from the user, extracts a search key, and passes the key to the search control unit 460. The search result is received from the search control unit 460 and returned to the user.
The search key subject graph creation unit 420 creates a subject graph from the search key.
The word information management unit 430 acquires a set of document IDs in which the specified word appears by referring to the index file 431. Here, the index file 431 is a hash table that uses a word as a key and a set of document IDs in which the word appears as a value.
[0086]
When a document ID is specified, the search target document subject graph creation unit 440 acquires a document corresponding to the document ID from the search target document database 441, and creates a subject graph of the document.
The similarity determination unit 450 receives the subject graph of the search key and the subject graph of the search target document, and determines the similarity between them. Here, the method of determining the similarity uses the method of the flowchart of FIG. 5 described above.
[0087]
The search control unit 460 performs the following processing.
(A) Obtain a search key from the search interface unit 410.
(B) From the search key subject graph creation unit 420, a subject graph created from the search key is acquired.
(C) A set of document IDs in which any one of the words in the subject graph appears is acquired from the word information management unit 430.
[0088]
(D) The following processing is executed for each element of the set of these document IDs.
(1) The subject graph of the search target document corresponding to the document ID is acquired from the search target document subject graph creation unit 440.
{Circle around (2)} The similarity between the subject graph of the search target document and the subject graph of the search key is acquired from the similarity determination 450.
[0089]
(E) The set of document IDs is sorted in descending order of similarity, and documents corresponding to the top n document IDs are set as search results.
Hereinafter, the flow of processing will be described using the following example.
FIG. 9 shows an example of creating a subject graph according to one embodiment of the present invention.
Search key Q: (information or document) and search
Document to be searched U: Document consisting of the following 5 sentences
"On the subject of the information.
[0090]
About the subject of the document.
Improve search efficiency.
Search for information.
Search for documents. "
Step 501) The user inputs a search request.
[0091]
Step 502) The search interface unit 410 extracts a search key from the search request input by the user and passes it to the search control unit 460.
Step 503) The search control unit 460 passes the search key to the search key subject graph creation unit 420.
Step 504) The search key subject graph creation unit 420 creates a subject graph from the search key and passes it to the search control unit 460. In this example, the subject key graph 510 of the search key shown in FIG. However, the importance of all words was set to 1.0, and the relevance between words was set to 0.5 for "or" and 1.0 for "and".
[0092]
Step 505) The search control unit 460 passes each word used in the subject graph of the search key to the word information management unit 430.
Step 506) The word information management unit 430 acquires from the index file 431 a set of document IDs in which the word appears at least once, and passes it to the search control unit 460.
[0093]
Step 507) The search control unit 460 passes each element of the set of document IDs acquired from the word information management unit 430 to the search subject document subject graph creation unit 440.
Step 508) The search target document subject graph creation unit 440 acquires a document corresponding to the document ID from the search target document database 441, creates a subject graph of the document, and passes it to the search control unit 460. In this example, the subject graph 520 of the search target document in FIG. However, the importance of the word was a value proportional to the number of appearances, the relevance between words was a value proportional to the number of co-occurrence in the sentence, and unnecessary words were removed.
[0094]
Step 509) The search control unit 460 passes the respective subject graphs obtained from the search subject document subject graph creation unit 440 and the subject graph of the search key to the similarity determination unit 450.
Step 510) The similarity determination unit 450 determines the similarity between the subject graph of the search key and the subject graph of each search target document.
[0095]
This time, a method of determining the similarity will be described as an example of the case where the “method not using graph division” is used. First, a vector representing the importance of the following words is generated from the subject graph 510 of the search key and the subject graph 520 of the search target document in FIG. Where the vector v_Q, V_UIndicate the importance of (information, document, search, subject, efficiency) in the search key Q and the search target document U, respectively, and the importance of a word that does not exist in the graph is set to 0.0.
[0096]
v_Q= (1.0, 1.0, 1.0, 0.0, 0.0) (5)
v_u= (0.6, 0.6, 1.0, 0.6, 0.3) (6)
Dot product f of these vectors_vIs
f_v= 2.2
It becomes.
[0097]
Next, similarly, a vector representing the degree of association is generated. Where the vector r_Q, R_UAre the search key Q, (information and subject, information and document, information and search, information and efficiency, subject and document, subject and search, subject and efficiency, document and search, The degree of relevance between document and efficiency, and search and efficiency) is shown, and the weight of a link that does not exist in the graph is set to 0.
[0098]
r_Q= (0.0, 0.5, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0) (7)
r_U= (1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0) (8)
Dot product f of these vectors_vIs
f_v= 2.0
It becomes. Therefore, when p = 1 and q = 1, the degree of coincidence between the subject graph 510 of the search key and the subject graph 520 of the search target document is:
Degree of coincidence = f_U* F_r= 2.2 * 2.0 = 4.4
The similarity between the search key Q and the search target document U in this example is calculated as 4.4.
[0099]
Step 511) The search control unit 460 sorts the document set in descending order based on the similarity of each document obtained from the similarity determination unit 450, and sets the top n items as search results, which are passed to the search interface unit 410. . Here, when the similarity determination unit 450 uses the method of measuring the degree of coincidence for each subgraph, the top n sub-documents are the search results.
[0100]
Step 512) The search interface unit 410 returns a search result to the user. In this example, (document U, 4.4) is the search result.
Next, a document classification apparatus using the present invention will be described.
FIG. 10 shows the configuration of a document classification device according to one embodiment of the present invention.
The document classification device shown in FIG. 7 includes a subject graph creation unit 610, a similarity determination unit 620, a classification unit 630, a classification control unit 640, and a document database 611.
[0101]
In the figure, a subject graph creation unit 610 is exactly the same as the search subject document subject graph creation unit 440 of the above-described document search device.
The subject graph creation unit 610 acquires a document corresponding to the document ID from the document database 611, and creates a subject graph of the document.
When the subject graphs of two documents are input, the similarity determination unit 620 determines the similarity between the two. Here, the similarity determination is performed using the similarity determination method similar to the similarity determination unit 450 of the above-described document search device.
[0102]
The classification unit 630 classifies documents based on the inter-document similarity matrix. Here, the inter-document similarity matrix has the following format.

Where s_ijRepresents the similarity between document i and document j, and s_ij= S_jiAnd s_iiIs infinite.
[0103]
A classification method when the inter-document similarity matrix is given is, for example, clustering in which documents having the highest similarity are sequentially combined. The specific classification method is based on the existing description.
The classification control unit 640 controls the entire classification work.
A series of operations of the above configuration will be described below.
[0104]
Step 601) The user specifies the number of document sets into which the documents in the document database 611 are classified (the number of classifications).
Step 602) The classification control unit 640 passes the document IDs of all the documents included in the document database 611 to the subject graph creation unit 610.
Step 603) The subject graph creation unit 610 acquires a document corresponding to each document ID from the document database 611, creates a subject graph, and passes this to the classification control unit 640.
[0105]
Step 604) The classification control unit 640 passes all the two combinations of the theme graphs acquired from the theme graph creation unit 610 to the similarity determination unit 620.
Step 605) The similarity determination unit 620 determines the similarity between the respective subject graphs, and passes it to the classification control unit 640.
Step 606) The classification control unit 640 creates an inter-document similarity matrix from the similarities between the documents of all the two combinations acquired from the similarity determination unit 620, and sends it to the classification unit 630 together with the number of classifications input by the user. hand over.
[0106]
Step 607) The classification unit 630 classifies the document set based on the inter-document similarity matrix and the number of classifications, and passes the classification result to the classification control unit 640.
Step 608) The classification control unit 640 returns a classification result to the user.
Further, the components of the document search device and the document classification device in the above embodiment are constructed as a program, and a disk device, a floppy disk, a CD-ROM, etc. connected to a computer used as the document search device and the document classification device The present invention can be easily realized by storing the program in a portable storage medium described above and installing it when implementing the present invention.
[0107]
It should be noted that the present invention is not limited to the above-described embodiment, but can be variously modified and applied within the scope of the claims.
[0108]
【The invention's effect】
As described above, according to the present invention, the similarity between document elements can be determined based not only on the importance of words but also on the relevance between words. It can be carried out.
In addition, the similarity between the search key and the search target document should be calculated using not only the search key and the importance of the word in the search target document but also the relevance between the search key and the word in the search target document. Therefore, even if the search key is a sentence or a document, or if the search target is the entire document, a more accurate information search can be expressed.
[0109]
Similarly, similarity between documents can be calculated using not only the importance of the words in the document but also the relevance between words in the document, so that more accurate document classification can be realized.
[Brief description of the drawings]
FIG. 1 is a diagram for explaining the principle of the present invention.
FIG. 2 is a diagram illustrating the principle configuration of a document search device according to the present invention.
FIG. 3 is a diagram illustrating the principle configuration of a document classification device according to the present invention.
FIG. 4 is a flowchart illustrating a similarity determination method according to the present invention.
FIG. 5 is a diagram illustrating an operation of a similarity determination process according to the present invention.
FIG. 6 is a diagram for explaining a graph classification process of the present invention.
FIG. 7 is a diagram for explaining a graph recombining process according to the present invention.
FIG. 8 is a configuration diagram of a document search device according to an embodiment of the present invention.
FIG. 9 is a diagram showing an example of creating a subject graph according to one embodiment of the present invention.
FIG. 10 is a configuration diagram of a document classification device according to an embodiment of the present invention.
[Explanation of symbols]
210 Graph before division
211, 222, 223 Subgraph
311, 312, 313 subgraph
320 Recombined graph
410 search interface means, search interface section
420 search key subject graph creation means, search key subject graph creation unit
430 Word information management means, word information management unit
431 Index file
440 Search subject document subject graph creation means, search subject document subject graph creation unit
441 Search target document storage means, search target document database
450 Similarity determining means, similarity determining section
460 Search control means, search control unit
510 Theme graph of search key Q
520 Subject graph of search target document U
610 Theme graph creation means, theme graph creation unit
611 Document storage means, document database
620 Graph similarity determination means, similarity determination section
630 Classification means, classification unit
640 Classification control means, classification unit control unit

Claims

In a similarity determination method for appropriately determining the similarity between document elements,
In the subject graph creating means, when a word string or a combination of Boolean operators of word strings or a plurality of document elements composed of a sentence, a document, or a document set is input, the word used in each document element extracting, importance is given to each of said word extracted, the importance of the single word as node weight were the weight of the link relevance between said word, the subject graph of each document element Generate
A similarity determination method , wherein similarity determination means obtains a similarity between the document elements based on a degree of coincidence between the subject graphs .

When extracting the word,
2. The method according to claim 1, wherein the document element is morphologically analyzed.

When calculating the degree of coincidence between the subject graphs in the similarity determination means ,
The more number of similar nodes of both subject graph (nodes that contain the same word) and a large value of the degree of match between the subject graph,
If have a greater weight to a node within one of the subject graph, as it is long with a large weight in the same node in the other in one subject graph, large the degree of coincidence between both the subject graph Value,
The more the number of similar link both said subject graph (Links word contained in nodes at both ends is the same), the degree of match between the subject graph the both sides a large value,
If have a large weight in the link in one of the subject graph, as are long with a large weight in the same link other in one subject graph, the degree of coincidence between the both subjects graph The similarity determination method according to claim 1, wherein the degree of coincidence between the subject graphs is calculated so as to have a large value.

When calculating the degree of coincidence between the subject graphs in the similarity determination means ,
Dividing each of the subject graphs into subgraphs based on how strongly the word sets used in the subject graphs are related;
If each of the subgraphs has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph;
Rejoin the subgraph,
The similarity determination method according to claim 1, wherein the link generated in the subgraph is added as it is, and the degree of matching between the theme graphs is calculated by returning to the theme graph before division.

When calculating the degree of coincidence between the subject graphs in the similarity determination means ,
Dividing each of the subject graphs into subgraphs based on how strongly the word sets used in the subject graphs are related;
If each of the subgraphs has no link between any nodes between the subgraphs, generate a link with a small weight in the subgraph,
The similarity determination method according to claim 1, wherein the degree of coincidence is calculated for each of the subgraphs.

When calculating the degree of coincidence between the subject graphs in the similarity determination means ,
Dividing each of the subject graphs into subgraphs based on how strongly the word sets used in the subject graphs are related;
If each of the subgraphs has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph;
Calculate the degree of match for each subgraph,
The similarity determination method according to claim 1, wherein a degree of coincidence between the subject graphs is calculated by calculating a sum of degrees of coincidence calculated for each of the subgraphs.

A document search device for searching for a document based on a search request from a user,
Search interface means for analyzing a search request from the user and extracting a search key;
Search key subject graph creating means for generating a subject graph of a search key using the importance of words of the search key and the relevance between words ;
Word information management means for acquiring a set of document IDs of documents in which the specified word appears;
When the document ID is specified, a document corresponding to the document ID is obtained from the search target document storage unit, and a subject graph of the document is created using the importance of words in the document and the relevance between words. Means for creating a subject graph for a search target document;
A similarity determination unit that inputs a subject graph of the search key and a subject graph of the document, and determines how similar they are;
The search interface means, said search key subject graph generator, the word information management means, characterized by having a a search control means for controlling the target document subject graph generator, and the similarity determination unit Document search device.

The search key subject graph creating means,
Word extraction means for extracting a word used in the document element from a word element or a Boolean operator combination of the word string or a sentence or a document or a document element constituted by a document set;
Importance assigning means for assigning importance to each of the extracted words;
Relevance assigning means for assigning a relevancy between any two extracted words,
The search subject document subject graph creating means,
Subject expression means for creating a subject graph in which the importance of the word is the weight of the node and the relevance between the words is the weight of the link,
The similarity determination means,
8. The document search apparatus according to claim 7 , further comprising means for obtaining a degree of similarity between the document elements based on a degree of matching between the subject graph of the search key and the subject graph of the document.

The similarity determination means,
The greater the number of similar nodes (nodes containing the same word) in the subject graph of the search key and the subject graph of the document , the greater the degree of matching between the graphs,
If a node in one graph is heavily weighted, the greater the weight of a similar node in the other graph, the greater the degree of matching between both graphs,
The greater the number of similar links (links containing the same word in the nodes at both ends of the link) in both graphs, the greater the degree of matching between both graphs,
If a link in one graph is heavily weighted, the greater the weight of a similar link in the other graph, the greater the degree of matching between the two graphs. 8. The document search apparatus according to claim 7 , further comprising first calculating means for calculating a degree of matching between the subject graph of the search key and the subject graph of the document.

The similarity determination means,
Each subject graph of the the subject chart document of the search key, on the basis of what has each other associated with the intensity of how word set is used in the graph is divided into subgraphs,
If each of the subgraphs has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph;
Rejoin the subgraph,
8. A method according to claim 7 , further comprising the step of: adding the link generated to the partial graph as it is, returning the graph before the division, and calculating the degree of matching between the subject graph of the search key and the subject graph of the document. Document search device as described.

The similarity determination means,
Each subject graph of the the subject chart document of the search key, on the basis of what has each other associated with the intensity of how word set is used in the graph is divided into subgraphs,
If each of the subgraphs has no link between any nodes between the subgraphs, generate a link with a small weight in the subgraph,
8. The document search apparatus according to claim 7 , further comprising third calculating means for calculating the degree of matching for each of said subgraphs.

The similarity determination means,
Each subject graph of the the subject chart document of the search key, on the basis of what has each other associated with the intensity of how word set is used in the graph is divided into subgraphs,
If each of the subgraphs has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph;
Calculate the degree of match for each of the subgraphs,
8. A method according to claim 7 , further comprising: fourth calculating means for calculating a degree of matching between the subject graph of the retrieval key and the subject graph of the document by calculating a sum of the degrees of matching calculated for each of the subgraphs. Document retrieval device.

Subject graph creating means for acquiring a document corresponding to a document ID from a document storage means in which the document is stored, and creating a subject graph using the importance of words of the document and the relevance between words ;
When the subject graphs of two documents are input, a graph similarity determination unit that calculates a degree of coincidence between the two graphs;
Classifying means for classifying the documents based on a matrix representing the similarity between the documents;
A document classification device, comprising: classification control means for controlling the entire classification work.

The subject graph creating means,
To measure the degree of agreement between the subject graphs ,
Word extraction means for extracting a word used in the document element from a word element or a Boolean operator combination of the word string or a sentence or a document or a document element constituted by a document set;
Importance assigning means for assigning importance to each of the extracted words;
A relevance assigning means for assigning a relevance between any two extracted words;
Subject expression means for generating the theme graph as the importance of the word as the weight of the node, and the relevance between the words as the weight of the link,
The graph similarity determination means,
15. The document classification device according to claim 14 , further comprising means for calculating a similarity between the document elements based on a degree of matching between the subject graphs .

The graph similarity determination means,
The greater the number of similar nodes (nodes containing the same word) in both graphs, the greater the degree of matching between the graphs,
If a node in one graph is heavily weighted, the greater the weight of a similar node in the other graph, the greater the degree of matching between both graphs,
The greater the number of similar links (links containing the same word in the nodes at both ends of the link) in both graphs, the greater the degree of matching between both graphs,
If a link in one graph is heavily weighted, the greater the weight of a similar link in the other graph, the greater the degree of matching between the two graphs. 14. The document classification apparatus according to claim 13 , further comprising a first calculation unit that calculates a degree of coincidence between the subject graphs .

The graph similarity determination means,
Dividing each of the subject graphs into subgraphs based on how strongly the word sets used in the subject graphs are related;
If each of the subgraphs has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph;
Rejoin the subgraph,
14. The document classification device according to claim 13 , further comprising a second calculation unit that adds the link generated to the subgraph as it is, returns the graph before the division, and calculates the degree of coincidence between the subject graphs .

The graph similarity determination means,
Dividing each of the subject graphs into subgraphs based on how strongly the word sets used in the subject graphs are related;
If each of the subgraphs has no link between any nodes between the subgraphs, generate a link with a small weight in the subgraph,
14. The document classification device according to claim 13 , further comprising third calculating means for calculating a degree of coincidence for each of said subgraphs.

The graph classification degree determination means,
Dividing each of the subject graphs into subgraphs based on how strongly the word sets used in the subject graphs are related;
If each of the subgraphs has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph;
Calculate the degree of match for each of the subgraphs,
14. The document classification device according to claim 13 , further comprising: fourth calculating means for calculating a degree of coincidence between the subject graphs by calculating a sum of degrees of coincidence calculated for each of the subgraphs.

A storage medium storing a document search program for searching for a document based on a search request from a user,
On the computer,
A search interface step of analyzing a search request from the user and extracting a search key;
A search key theme graph creating step of generating a theme graph of the search key using the importance of the search key and the degree of association between words ;
A word information management step of acquiring a set of document IDs of documents in which the specified word appears;
When the document ID is specified, a document corresponding to the document ID is obtained from the search target document storage unit in which the search target document is stored, and the importance of words in the document and the degree of association between words are used. a search target document subject graph generating step of generating the subject chart document Te,
A similarity determination step of inputting a subject graph of the search key and a subject graph of the document of the search target document and determining how similar they are ;
A storage medium storing a document search program, wherein the storage medium stores a program for executing the program .

The search key subject graph creating step includes:
A word string or a Boolean operator combination of the word strings, or a sentence or a document or a document element formed of a document set, a word extraction step of extracting a word used in the document element;
An importance assigning step of assigning an importance to each of the extracted words;
And a relevancy assigning step of assigning a relevance between any two extracted words .
The search subject document graph creation step includes:
Executing a subject expression step of generating a subject graph in which the importance of the word is set as the weight of the node and the relevance between the words is set as the weight of the link;
The similarity determination step includes:
20. A storage medium storing the document search program according to claim 19 , further comprising: executing a step of calculating a similarity between the document elements based on a degree of matching between the subject graph of the search key and the subject graph of the document.

The similarity determination step includes:
The greater the number of similar nodes (nodes containing the same word) in both the subject graph of the search key and the subject graph of the document , the greater the degree of matching between the graphs,
If a node in one graph is heavily weighted, the greater the weight of a similar node in the other graph, the greater the degree of matching between both graphs,
The greater the number of similar links (links containing the same word in the nodes at both ends of the link) in both graphs, the greater the degree of matching between both graphs,
If a link in one graph is heavily weighted, the greater the weight of a similar link in the other graph, the greater the degree of matching between the two graphs. 20. A storage medium storing a document search program according to claim 19, wherein a first calculation step of calculating a degree of coincidence between the subject graph of the search key and the subject graph of the document is performed .

The similarity determination step includes:
Each graph of the subject chart of the document the subject graph of the search key, based on whether the each other associated with the intensity of how word set is used in the graph is divided into subgraphs,
If each of the subgraphs has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph;
Calculate the degree of match for each subgraph,
Said portion by calculating the sum of the degree of match calculated for each graph, the search second claim 19 to perform the calculation step of calculating a degree of match between the subject graph subject chart and the document key A storage medium storing the described document search program.

The similarity determination step includes:
Each graph of the subject chart of the document the subject graph of the search key, based on whether the each other associated with the intensity of how word set is used in the graph is divided into subgraphs,
If each of the subgraphs has no link between any nodes between the subgraphs, a link having a small weight is generated for the sublink;
20. A storage medium storing a document search program according to claim 19 , wherein a third calculation step of calculating a degree of coincidence for each of the subgraphs of the subject graph of the search key and the subject graph of the document is executed .

The similarity determination step includes:
Each graph of the subject chart of the document the subject graph of the search key, on the basis of the on whether the each other associated with the intensity of how word set is used in the graph is divided into subgraphs ,
If each of the subgraphs has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph;
Calculate the degree of match for each subgraph,
By calculating the sum of the degree of the calculated matching for each of the subgraphs, claim 19 of executing the fourth calculation step of calculating a degree of match between the subject chart of the document the subject graph of the search key A storage medium storing the described document search program.

A storage medium storing a document content Ruipu program for classifying documents,
On the computer,
A subject graph creating step of acquiring a document corresponding to the document ID from the document storage unit in which the document is stored, and creating a subject graph using the importance of words of the document and the degree of relevance between words ;
When the subject graphs of the two documents are input, a graph classification determining step of determining the degree of coincidence between them,
A storage medium storing a document classification program, which stores a program for executing a classification step of classifying the documents based on a matrix representing the similarity between the documents.

The subject graph creation step includes:
To measure the degree of agreement between the subject graphs ,
A word string or a Boolean operator combination of the word strings, or a sentence or a document or a document element formed of a document set, a word extraction step of extracting a word used in the document element;
An importance assigning step of assigning an importance to each of the extracted words;
A relevance assigning step of assigning a relevancy between any two extracted words;
A theme expression step of generating the theme graph with the importance of the word as the weight of the node and the relevance between the words as the weight of the link;
26. A storage medium storing the document classification program according to claim 25 , wherein a document element similarity determination step of determining a similarity between the document elements based on a degree of matching between the subject graphs is performed .

The graph similarity determination step includes:
The greater the number of similar nodes (nodes containing the same word) in both subject graphs, the greater the degree of matching between the graphs,
If a node in one graph is heavily weighted, the greater the weight of a similar node in the other graph, the greater the degree of matching between both graphs,
The greater the number of similar links (links containing the same word in the nodes at both ends of the link) in both graphs, the greater the degree of matching between both graphs,
If a link in one graph is heavily weighted, the greater the weight of a similar link in the other graph, the greater the degree of matching between the two graphs. 26. The storage medium storing the document classification program according to claim 25 , wherein a first step of calculating a degree of coincidence between the subject graphs is performed so as to perform the first step .

The graph similarity determination step includes:
Each of the subject graph based on whether the each other associated with the intensity of how word set is used in the graph is divided into subgraphs,
If each of the subgraphs has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph;
Rejoin the subgraph,
26. The document classification program according to claim 25, wherein the generated link is added to the subgraph as it is, and a second calculation step of calculating the degree of coincidence between the subject graphs by returning to the graph before division is executed . Storage medium.

The graph similarity determination step includes:
Dividing each of the subject graphs into subgraphs based on how strongly the set of words used in the subject graphs are related;
If each of the subgraphs has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph;
26. A storage medium storing the document classification program according to claim 25, wherein a third calculation step of calculating a degree of coincidence is performed for each subgraph.

The graph similarity determination step includes:
Dividing each of the subject graphs into subgraphs based on how strongly the set of words used in the subject graphs are related;
If each of the subgraphs has no link between any nodes in the subgraph, generate a link with a small weight in the subgraph;
Calculate the degree of match for each subgraph,
26. A storage storing the document classification program according to claim 25 , wherein a fourth calculation step of calculating a degree of coincidence between the subject graphs is performed by calculating a sum of degrees of coincidence calculated for each of the subgraphs. Medium.