JP4769151B2

JP4769151B2 - Document set analysis apparatus, document set analysis method, program implementing the method, and recording medium storing the program

Info

Publication number: JP4769151B2
Application number: JP2006237663A
Authority: JP
Inventors: 浩之戸田; 考藤村; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2006-09-01
Filing date: 2006-09-01
Publication date: 2011-09-07
Anticipated expiration: 2026-09-01
Also published as: JP2008059442A

Description

本発明は、データマイニング技術に関するものである。 The present invention relates to data mining technology.

現在、Ｗｅｂページやブログ記事等、ニュース記事等のテキストを含む文書集合（文書データ集合とも言う）に対して検索やデータマイニングを行う技術が広く知られている。 Currently, techniques for performing search and data mining on a document set (also referred to as a document data set) including text such as a news article such as a web page or a blog article are widely known.

その技術において、ユーザが大量の文書を取り扱う場合に、ユーザが「文書集合中に存在する主要な話題が知りたい」や「文書集合中の特定の話題に関連する情報群にアクセスしたい」という文書に関連した情報を取得する要求を持つことが多い。 In this technology, when a user handles a large amount of documents, a document that the user wants to know the main topics existing in the document collection or wants to access a group of information related to a specific topic in the document collection. Often has a request to get information related to.

これらの要求を実現する実現方法としては、次のようなものが知られている。 The following methods are known as implementation methods for realizing these requirements.

一つは、クラスタリングアルゴリズムを利用する方法（例えば、非特許文献１参照）である。この方法では、それぞれの文書を単語ベクトルで表現し、ベクトル間の類似度（コサイン類似度等）を利用して、類似したベクトルを統合することによって、類似した話題（あるいは話題語）に関する文書をクラスタとして特定する。そして、この個々のクラスタを特定の話題に関連する情報の集合と見做す事によって、上述の要求を実現するものである。 One is a method using a clustering algorithm (see, for example, Non-Patent Document 1). In this method, documents related to similar topics (or topic words) are expressed by expressing each document as a word vector and using similarities between vectors (such as cosine similarity) to integrate similar vectors. Identify as a cluster. The above-described request is realized by regarding each individual cluster as a set of information related to a specific topic.

他の実現方法としては、話題語抽出を利用する方法（例えば、特許文献１参照）が知られている。この方法は、文書集合の中から、キーワードの出現頻度や出現分布に基づいて、文書集合中の特定の話題に関連するキーワードを抽出する技術を利用する。その技術で抽出した特定のキーワードを含む文書を特定の話題に関連する文書の集合と見なすことによって、上述の要求を実現するものである。 As another realization method, a method using topic word extraction (for example, see Patent Document 1) is known. This method uses a technique for extracting a keyword related to a specific topic in the document set from the document set based on the appearance frequency and distribution of the keywords. The above request is realized by regarding a document including a specific keyword extracted by the technique as a set of documents related to a specific topic.

なお、関連技術として、文書をノードと見做して、各ノード（文書）の中心性を算出方法（例えば、ＰａｇｅＲａｎｋ（例えば、非特許文献２参照））が知られている。文書の集合を特定するために、ｗｅｂ上に存在する検索エンジン（例えば、非特許文献３参照）も広く知られている。文書を単語ベクトルに表す技術（例えば、非特許文献４参照）も広く知られている。
特開２００５−２０８８３８号公報（段落［００６６］〜［０１４４］等）。Ｄ．Ｃｕｔｔｉｎｇ，Ｄ．Ｋａｒｇｅｒ，Ｊ．Ｐｅｄｅｒｓｅｎ，ａｎｄＪ．Ｔｕｋｒｙ，“Ｓｃａｔｔｅｒ／Ｇａｔｈｅｒ：ａｃｌｕｓｔｅｒ−ｂａｓｅｄａｐｐｒｏａｃｈｔｏｂｒｏｗｓｉｎｇｌａｒｇｅｄｏｃｕｍｅｎｔｃｏｌｌｅｃｔｉｏｎｓ”，Ｐｒｏｃ．ｏｆＳＩＧＩＲ１９９２，ＡＣＭ，Ｊｕｎｅ１９９２，ｐｐ３１８−３２９．Ｓ．Ｂｒｉｎ，ａｎｄＬ．Ｐａｇｅ，“Ｔｈｅａｎａｔｏｍｙｏｆａｌａｒｇｅ−ｓｃａｌｅｈｙｐｅｒｔｅｘｕｔｕａｌＷｅｂＳｅａｒｃｈＥｎｇｉｎｅ”，Ｐｒｏｃ．ｏｆＷＷＷ７，Ｅｌｓｅｖｉｅｒｓｃｉｅｎｃｅ，Ａｐｒｉｌ１９９８，ｐｐ１０７−１１７．エヌ・ティ・ティレゾナント株式会社、”ポータルサイトｇｏｏ”、［ｏｎｌｉｎｅ］、平成１８年、エヌ・ティ・ティレゾナント株式会社、［平成１８年７月３１日検索］、インターネット＜ＵＲＬ：ｈｔｔｐ：／／ｗｗｗ．ｇｏｏ．ｎｅ．ｊｐ／＞北研二、津田和彦、獅子掘正幹、「情報検索アルゴリズム」、共立出版、２００２年１月。 As a related technique, a method of calculating the centrality of each node (document) by regarding the document as a node (for example, PageRank (for example, see Non-Patent Document 2)) is known. In order to specify a set of documents, a search engine (for example, see Non-Patent Document 3) existing on the web is also widely known. A technique for expressing a document as a word vector (for example, see Non-Patent Document 4) is also widely known.
JP 2005-208838 A (paragraphs [0066] to [0144]). D. Cutting, D.C. Karger, J .; Pedersen, and J.M. Tukry, “Scatter / Gather: a cluster-based approach to browsing large document collections”, Proc. of SIGIR 1992, ACM, June 1992, pp 318-329. S. Brin, and L.L. Page, “The anatomy of a large-scale hypertextual Web Search Engine”, Proc. of WWW7, Elsevier science, April 1998, pp 107-117. NTT Resonant Co., Ltd., “Portal Site goo”, [online], 2006, NTT Resonant Co., Ltd., [searched July 31, 2006], Internet <URL: http: // / Www. goo. ne. jp /> Kita Kenji, Tsuda Kazuhiko, Isogo Miki, “Information Retrieval Algorithm”, Kyoritsu Shuppan, January 2002.

上述の文書に関連した情報を取得する要求を実現する技術には、以下のような問題点が知られている。 The following problems are known in the technology for realizing a request for acquiring information related to the above-described document.

上述のクラスタリングアルゴリズムを用いる手法では、全ての文書が何れかのクラスタに属することを前提としている。一方、現実のデータでは、他の文書と関係ない、いわゆる「その他」に属する文書が存在するため、必ずしも適切なクラスタリングが行えず、結果的に上記の要求に対して取得した情報にも、多くのノイズを含む等の問題を有する。 In the method using the clustering algorithm described above, it is assumed that all documents belong to any cluster. On the other hand, in real data, there are documents that belong to the so-called “others” that are not related to other documents. Therefore, appropriate clustering cannot always be performed, and as a result, much information is acquired for the above request. There are problems such as including noise.

上述の話題語抽出を用いた手法では、文書を一つの集合にするときのルールが、キーワードを含むか否かと言う非常に単純なものであるため、必ずしも有益な文書の集合が作られているとは限らない。例えば、この手法で出力される一つのキーワードが複数の話題に関連していたり、逆に一つの話題が複数のキーワードに関連することもしばしばある。 In the method using topic word extraction described above, the rules for making documents into one set are very simple whether or not they contain keywords, so a set of useful documents is always created. Not necessarily. For example, one keyword output by this method is often related to a plurality of topics, and conversely, one topic is often related to a plurality of keywords.

本発明は、前記課題に基づいてなされたものであって、強い繋がりが存在する文書集合のみを特定し、その文書集合をクラスタに分類し、そのクラスタ中の各文書の役割を明確に分析する文書集合分析装置，文書集合分析方法，その方法を実装したプログラム及びそのプログラムを格納した記録媒体を提供することにある。 The present invention has been made on the basis of the above-described problems, and identifies only a document set having a strong connection, classifies the document set into a cluster, and clearly analyzes the role of each document in the cluster. An object of the present invention is to provide a document set analysis apparatus, a document set analysis method, a program in which the method is implemented, and a recording medium storing the program.

前記課題の解決を図るために、請求項１記載の発明は、文書データ管理手段によって管理される文書集合内の文書間の関連性に基づいて、文書の役割を特定する文書集合分析装置であって、入力手段から入力された文書集合特定条件に基づいて前記文書集合を特定する文書集合特定手段と、前記特定された文書集合に含まれる各文書間で話題語に関する類似度を評価する類似性評価手段と、前記類似性評価手段によって評価された類似度に基づいて、文書間の関係性を抽出する関係抽出手段と、前記関係抽出手段によって抽出された文書間の関係性に基づき、文書と該文書以外の文書間の関連性の高さを示す指標として、該文書の中心性を算出する中心性判定手段と、前記関係抽出手段によって得られた文書間の関連性と、各文書の中心性に基づいて、文書間の関連性を二次元座標で表現し、その二次元座標に対する三次元目の座標で中心性を表現し、前記文書集合を三次元のグラフ構造で表現するグラフ構造構築手段と、前記得られたグラフ構造において前記文書集合のうち中心性のスコアが極大となっている文書ノードを頂点ノードとして抽出する頂点ノード抽出手段と、前記抽出された頂点ノードと、当該頂点ノードと連結されていると共に中心性のスコアが当該頂点ノードの中心性のスコアよりも低く且つ極小でない一つ以上の文書ノードとで構成される文書ノード群を山状ノード群と特定する山状ノード群特定手段と、前記抽出された頂点ノードに対して当該ノードは前記文書集合の話題語の役割をなすものであることを示すラベルを付与すると共に、前記特定された山状ノード群に対して当該ノード群は前記話題語と関連性を有する文書集合の役割をなすものであることであることを示すラベルを付与するラベル付け手段と、前記話題語の役割をなすものとしてラベル付けされた頂点ノードを示す文書ノード、前記話題語と関連性を有する文書集合の役割をなすものとしてラベル付けされた山状ノード群を示す文書ノード群を可視化して出力する情報出力手段と、を備えることを特徴とする。 In order to solve the above-described problem, the invention described in claim 1 is a document set analysis apparatus that specifies the role of a document based on the relationship between documents in the document set managed by the document data management means. The document set specifying unit that specifies the document set based on the document set specifying condition input from the input unit, and the similarity that evaluates the similarity regarding the topic word between the documents included in the specified document set An evaluation unit; a relationship extraction unit that extracts a relationship between documents based on the degree of similarity evaluated by the similarity evaluation unit; and a document based on the relationship between documents extracted by the relationship extraction unit As an index indicating the degree of relevance between documents other than the document, a centrality determination unit that calculates the centrality of the document, a relevance between documents obtained by the relationship extraction unit, and a center of each document To sex Therefore, a graph structure construction means for expressing the relationship between documents by two-dimensional coordinates, expressing the centrality by the coordinates of the third dimension with respect to the two-dimensional coordinates, and expressing the document set by a three-dimensional graph structure; A vertex node extracting means for extracting, as a vertex node, a document node having a maximal centrality score in the document set in the obtained graph structure; and the extracted vertex node and the vertex node connected thereto And a node node group specifying a document node group composed of one or more document nodes whose centrality score is lower than the centrality score of the vertex node and is not minimal means, together with the nodes to impart a label that indicates that those forming the role of topical words of the document set to the extracted top node, the identified peaks And labeling means the node group to the node group that imparts a label indicating that it is intended to form a part of a set of documents having a relevance to the topic words, as forming a part of the topic word A document node indicating a labeled vertex node, an information output means for visualizing and outputting a document node group indicating a group of mountain nodes labeled as a document set having a relationship with the topic word ; It is characterized by providing.

請求項２記載の発明は、文書データ管理手段によって管理される文書集合内の文書間の関連性に基づいて、文書の役割を特定する、コンピュータが実行する文書集合分析方法であって、入力手段から入力された文書集合特定条件に基づいて前記文書集合を特定する文書集合特定ステップと、前記特定された文書集合に含まれる各文書間で話題語に関する類似度を評価する類似性評価ステップと、前記類似性評価手段によって評価された類似度に基づいて、文書間の関係性を抽出する関係抽出ステップと、前記関係抽出手段によって抽出された文書間の関係性に基づき、文書と該文書以外の文書間の関連性の高さを示す指標として、該文書の中心性を算出する中心性判定ステップと、前記関係抽出手段によって得られた文書間の関連性と、各文書の中心性に基づいて、文書間の関連性を二次元座標で表現し、その二次元座標に対する三次元目の座標で中心性を表現し、前記文書集合を三次元のグラフ構造で表現するグラフ構造構築ステップと、前記得られたグラフ構造において前記文書集合のうち中心性のスコアが極大となっている文書ノードを頂点ノードとして抽出する頂点ノード抽出ステップと、前記抽出された頂点ノードと、当該頂点ノードと連結されていると共に中心性のスコアが当該頂点ノードの中心性のスコアよりも低く且つ極小でない一つ以上の文書ノードとで構成される文書ノード群を山状ノード群と特定する山状ノード群特定ステップと、前記抽出された頂点ノードに対して当該ノードは前記文書集合の話題語の役割をなすものであることを示すラベルを付与すると共に、前記特定された山状ノード群に対して当該ノード群は前記話題語と関連性を有する文書集合の役割をなすものであることであることを示すラベルを付与するラベル付けステップと、前記話題語の役割をなすものとしてラベル付けされた頂点ノードを示す文書ノード、前記話題語と関連性を有する文書集合の役割をなすものとしてラベル付けされた山状ノード群を示す文書ノード群を可視化して出力する情報出力ステップと、を有する。 The invention described in claim 2 is a computer-executed document set analysis method for specifying a role of a document based on a relationship between documents in a document set managed by the document data management means, the input means A document set specifying step for specifying the document set based on a document set specifying condition input from the above, a similarity evaluation step for evaluating a similarity level related to a topic word between the documents included in the specified document set, A relationship extracting step for extracting a relationship between documents based on the similarity evaluated by the similarity evaluating unit; and a document and a document other than the document based on the relationship between documents extracted by the relationship extracting unit. as an indicator of relevance in height between the document and the center judgment step of calculating the center of the document, and the association between the document obtained by the relation extracting means, each sentence Based on the centrality of the document, the relationship between documents is expressed in two-dimensional coordinates, the centrality is expressed in the third dimension of the two-dimensional coordinates, and the document set is expressed in a three-dimensional graph structure. A structure construction step, a vertex node extraction step for extracting, as a vertex node, a document node having a maximum centrality score in the document set in the obtained graph structure, the extracted vertex node, A mountain that identifies a group of document nodes that is connected to a vertex node and has one or more document nodes that have a centrality score lower than the centrality score of the vertex node and that is not minimal. and Jo nodes specifying step, the node with respect to the extracted vertices node confers a label that indicates that those forming the role of topical words of the document set Both labeled step of applying a label which indicates that said corresponding node group versus specified mountain-like nodes is that those forming the role of a set of documents having a relevance to the topic words, the Visualizing a document node indicating a vertex node labeled as a topic word, and a document node indicating a mountain node labeled as a document set having a relationship with the topic word And an information output step for outputting.

請求項３記載の発明は、文書集合分析プログラムであって、請求項２に記載の文書集合分析方法を、コンピュータで実行可能なコンピュータプログラムとして記述したことを特徴とする。 The invention described in claim 3 is a document set analysis program, wherein the document set analysis method according to claim 2 is described as a computer program executable by a computer.

請求項４記載の発明は、記録媒体であって、請求項２に記載の文書集合分析方法を、コンピュータで実行可能なコンピュータプログラムとして記述し、そのコンピュータプログラムを記録したことを特徴とする。 The invention described in claim 4 is a recording medium, wherein the document set analysis method described in claim 2 is described as a computer program executable by a computer, and the computer program is recorded.

前記請求項１，２の発明によれば、中心性に基づく文書間の関連性を取得できる。 According to the first and second aspects of the present invention, the relationship between documents based on centrality can be acquired.

また、前記請求項１，２の発明によれば、文書間の関係に基づくグラフ構造を取得できる。 Further, according to claim 1, 2 of the invention, can obtain a graph structure based on the relationship between documents.

前記請求項３の発明によれば、請求項２に記載の文書集合分析方法をコンピュータプログラムとして記載できる。 According to the invention of claim 3 , the document set analysis method of claim 2 can be described as a computer program.

前記請求項４の発明によれば、請求項２に記載の文書集合分析方法を実装したコンピュータプログラムを記録媒体に記録できる。 According to the fourth aspect of the present invention, a computer program that implements the document set analysis method according to the second aspect can be recorded on a recording medium.

以上示したように請求項１，２の発明によれば、文書の関連性に基づいて、その文書の役割を明確に分析できる。 As described above, according to the inventions of claims 1 and 2 , the role of the document can be clearly analyzed based on the relevance of the document.

また、請求項１，２の発明によれば、強い繋がりを有する文書集合のみを特定できる。 Further, according to the first and second aspects of the invention, only a document set having a strong connection can be specified.

請求項３の発明によれば、請求項２に記載の文書集合分析方法を実装したコンピュータプログラムを提供できる。 According to the invention of claim 3 , it is possible to provide a computer program in which the document set analysis method according to claim 2 is implemented.

請求項４の発明によれば、請求項２に記載の文書集合分析方法を実装したコンピュータプログラムを記録した記録媒体を提供できる。 According to invention of Claim 4 , the recording medium which recorded the computer program which mounted the document set analysis method of Claim 2 can be provided.

これを以ってデータマイニング技術分野に貢献できる。 This can contribute to the data mining technology field.

以下、本発明の実施形態を図面等に基づいて詳細に説明する。本実施形態における文書集合分析装置は、検索したニュース記事の中に存在する話題（即ち、話題語）を特定し、その特定した話題に関連する文書をクラスタ化し、さらに、そのクラスタ中の文書に対してそれぞれの文書の位置付けを明らかにする文書分析を行う装置である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. The document set analysis apparatus according to the present embodiment identifies topics (that is, topic words) existing in the searched news articles, clusters the documents related to the identified topics, and further converts the documents in the clusters into It is a device that performs document analysis to clarify the position of each document.

本実施形態における文書集合分析装置の構成を図１に基づいて説明する。 The configuration of the document set analysis apparatus in this embodiment will be described with reference to FIG.

文書集合分析装置は、文書集合特定部１０，類似性評価部２０，関係抽出部３０，中心性判定部４０，情報分析部５０，情報出力部６０，文書データ管理手段（例えば、文書ＤＢ（Ｄａｔａｂａｓｅ））７０から構成される。さらに、情報分析部５０は、グラフ構造構築部５１，頂点ノード抽出部５２，山状ノード群特定部５３，ラベル付け部５４から構成される。 The document set analysis apparatus includes a document set specification unit 10, a similarity evaluation unit 20, a relationship extraction unit 30, a centrality determination unit 40, an information analysis unit 50, an information output unit 60, document data management means (for example, a document DB (Database). )) 70. Further, the information analysis unit 50 includes a graph structure construction unit 51, a vertex node extraction unit 52, a mountain node group identification unit 53, and a labeling unit 54.

文書集合特定部１０は、文書集合を特定する文書集合特定条件を含む指定や要求（例えば、ユーザからの指定や要求）、もしくは、あらかじめ決められた文書集合特定条件に基づいて文書データ管理手段７０にアクセスし、複数文書で構成される文書集合を特定する。なお、文書集合特定条件は、予め備えられた入力手段（例えば、キーボード装置）によって、入力されても良い。 The document set specifying unit 10 includes a document data management unit 70 based on a specification or request including a document set specifying condition for specifying a document set (for example, a specification or request from a user) or a predetermined document set specifying condition. To specify a document set composed of a plurality of documents. The document set specifying condition may be input by an input means (for example, a keyboard device) provided in advance.

類似性評価部２０は、話題（あるいは、話題語）に関して文書集合内の各文書間の類似度を評価する。例えば、文書間の類似度は、各文書を単語ベクトルで評価しコサイン類似度を利用する方法（例えば、非特許文献１参照）や、一方の文書に基づいて言語モデルを構築し、他方の文書がその言語モデルからどの程度の確率で生成されるかという言語モデルに基づく評価を行う方法が考えられる。 The similarity evaluation unit 20 evaluates the similarity between the documents in the document set with respect to the topic (or topic word). For example, the similarity between documents can be determined by evaluating each document with a word vector and using the cosine similarity (for example, see Non-Patent Document 1), or constructing a language model based on one document and the other document. It is conceivable to perform an evaluation based on the language model of how much probability is generated from the language model.

関係抽出部３０は、前記類似性評価部２０で評価した文書間の類似度に基づいて、文書間に関係があるか否かの関係性を特定する。例えば、文書間の関係を行列Ａと表現した場合、以下の式のように定義することが考えられる。 The relationship extraction unit 30 specifies a relationship as to whether or not there is a relationship between documents based on the similarity between documents evaluated by the similarity evaluation unit 20. For example, when the relationship between documents is expressed as a matrix A, it may be defined as the following equation.

ここで、ＴｏｐＳｉｍ_p（ｉ）は文書ｉとの類似度が高い文書ｐ件に含まれる文書の集合を示す。一般に全ての類似度を利用した場合には、低い類似度がノイズとなる傾向があるため、類似度が高い文書間に対してのみリンクを設定している。ｓｉｍ（ｉ，ｊ）は、文書をｌｏｇｔｆ−ｉｄｆ重み（例えば、非特許文献４）による単語ベクトルとして表現した場合の文書ｉと文書ｊのコサイン類似度を示している。なお、ｌｏｇｔｆ−ｉｄｆ重みは、個々の文書をベクトルで表現するときの個々の要素の重みである。 Here, TopSim _p (i) indicates a set of documents included in p documents having high similarity to the document i. In general, when all the similarities are used, since a low similarity tends to be noise, a link is set only between documents having a high similarity. Sim (i, j) indicates the cosine similarity between the document i and the document j when the document is expressed as a word vector with log tf-idf weight (for example, Non-Patent Document 4). The log tf-idf weight is a weight of each element when each document is expressed by a vector.

さらに、上記のように全ての類似度を利用した場合には、他のリンクと比較して明らかに重みが小さいリンクが存在している。そこで、アウトリンクのうちごく少ない確率でしか遷移しないリンクを除去する事が考えられる。この操作は以下の式で示される。 Furthermore, when all the similarities are used as described above, there is a link having a clearly smaller weight than other links. Therefore, it is conceivable to remove a link that transitions with a very low probability among outlinks. This operation is shown by the following equation.

ここで、ｌ_i,qはノードｉからのアウトリンクを遷移確率の降順に並べ、閾値ｑを越えるまで加算した遷移確率の合計値を示す。ＴｏｐＬｉｎｋ_q（ｉ）は、加算対象になったリンクのリンク先ノードの集合を示す。 Here, l _{i, q} represents the total value of the transition probabilities obtained by arranging the outlinks from the node i in descending order of the transition probabilities and adding until the threshold q is exceeded. TopLink _q (i) indicates a set of link destination nodes of links to be added.

中心性判定部４０は、関係抽出部３０で得られた文書間の関係を、文書をノードと見做して文書間の関係を重み（即ち、文書間の類似度）つきのエッジとするグラフ構造と見做し、各ノード（文書）の中心性を算出する。なお、前述の中心性は、単純なリンク本数を計算する方法やＰａｇｅＲａｎｋ（非特許文献２参照）等を利用することが考えられる。 The centrality determination unit 40 regards the relationship between documents obtained by the relationship extraction unit 30 as a graph structure in which the document is regarded as a node and the relationship between documents is an edge with a weight (ie, similarity between documents). And the centrality of each node (document) is calculated. For the above-described centrality, it is conceivable to use a simple method for calculating the number of links, PageRank (see Non-Patent Document 2), or the like.

グラフ構造構築部５１は、中心性判定部４０で得られる、各文書間の関係と、各文書の中心性のスコアに基づいて、文書間の関係を示すグラフ構造を二次元平面（例えば、ｘｙ平面）上に配置し、個々の文書の中心性のスコアを三次元目（例えば、ｚ軸）に割り当てた三次元のグラフ構造を構築する。図２は、この三次元のグラフ構造の概念図である。なお、この図２に関しては、後で説明する。 The graph structure construction unit 51 converts the graph structure indicating the relationship between documents based on the relationship between the documents and the centrality score of each document obtained by the centrality determination unit 40 into a two-dimensional plane (for example, xy). A three-dimensional graph structure is constructed in which the centrality score of each document is assigned to the third dimension (eg, z-axis). FIG. 2 is a conceptual diagram of this three-dimensional graph structure. Note that FIG. 2 will be described later.

頂点ノード抽出部５２は、グラフ構造構築部５１で構築されたグラフ構造から、ノードとノードを繋ぐ辺（即ち、エッジ）で連結している自身以外のノード（文書と一対一で対応）より高い中心性を持つノードを頂点として抽出する。 The vertex node extraction unit 52 is higher than the node other than itself (corresponding to the document in a one-to-one correspondence) connected from the graph structure constructed by the graph structure construction unit 51 by the edge (ie, edge) connecting the nodes. A node with centrality is extracted as a vertex.

山状ノード群特定部５３は、頂点ノード抽出部５２で抽出された頂点ノードから、中心性が低くなる方向にグラフ構造をたどり、ノードで構成される山を特定する。即ち、山状ノード群特定部５３までの処理によって、文書がクラスタ化されることになる。 The mountain-shaped node group identification unit 53 traces the graph structure in the direction of decreasing centrality from the vertex node extracted by the vertex node extraction unit 52, and identifies a mountain composed of nodes. That is, the documents are clustered by the processing up to the mountain node group specifying unit 53.

ラベル付け部５４は、頂点ノード抽出部５２で抽出した頂点ノード，山状ノード群特定部５３で特定した頂点を中心とする山状のノード群，それらの間の関係に対して、ラベル付けを行う。 The labeling unit 54 labels the vertex nodes extracted by the vertex node extraction unit 52, the mountain nodes centered on the peaks identified by the mountain node group identification unit 53, and the relationship between them. Do.

ここで、グラフ構造と中心性について説明する。中心性スコアの定義によれば、多くのエッジが存在するエリアのノードは高いスコアを有する。グラフ構造に基づき、ある人がノードを渡り歩くモデル（ユーザがグラフに沿ってノードを閲覧するモデル）を考えた場合、そのような中心性の高いエリアでは、エリア内での遷移が多く、ノード間の関連性も高い。つまり、そのエリアは同じ話題に関連するノードで構成される。したがって、図２のそれぞれの山は、それぞれ異なる話題に対応すると考えられる。 Here, the graph structure and centrality will be described. According to the definition of centrality score, nodes in areas where there are many edges have a high score. Based on the graph structure, when a model in which a person walks across nodes (a model in which a user browses nodes along a graph) is considered, in such an area with high centrality, there are many transitions within the area, and between nodes Is also highly relevant. That is, the area is composed of nodes related to the same topic. Therefore, each mountain in FIG. 2 is considered to correspond to a different topic.

また、図２中の山に含まれるノードの位置に応じて、文書にはそれぞれ特徴があると考えられる。以下では、それぞれのノードに該当する文書の特徴を説明する。さらに、それぞれの特徴を持つノード毎に、文書集合における役割の特定方法を説明する。 Further, it is considered that each document has a characteristic depending on the position of the node included in the mountain in FIG. In the following, the characteristics of a document corresponding to each node will be described. Further, a method for identifying a role in a document set will be described for each node having each characteristic.

図２における最初の段階のノードは、山の頂上にあるノード（例えば、符号ａ１やｂ１で示されるノードに該当）であり、１つの山には１つのノードが存在するのみである。これらのノードは、周囲のノードから最も高い状態遷移があるノードであり、周囲のノードと最もよく関係するノードであるため、話題を最もよく表現する文書であると言える。つまり、頂点ノードが示す文書は、そのエリアの話題を特定する。以後、このエリアの話題を特定する文書（ノード）のラベルをコア文書（または、コアノード）とする。 The node in the first stage in FIG. 2 is a node at the top of the mountain (for example, corresponding to a node indicated by reference signs a1 and b1), and only one node exists in one mountain. Since these nodes are nodes having the highest state transition from the surrounding nodes and are the nodes most closely related to the surrounding nodes, it can be said that these nodes are documents that best express the topic. That is, the document indicated by the vertex node specifies the topic of the area. Hereinafter, the label of the document (node) that identifies the topic in this area is set as the core document (or core node).

第２段階目のノードは、頂点と近接したノード（例えば、図２中の符号ａ２，ａ３，ａ４やｂ２，ｂ３で示されるノード）である。これらのノードはコアノードから直接もしくは間接的に双方向リンクのみをたどって到達できるノードである。双方向リンクは、相互にリンクが張られており、高い関連性を示す。これらのノードはコアノードとの間で多くの状態遷移があり、文書の内容もコア文書との高い関連性を有する。以後、このコア文書との高い関連性を有する文書（ノード）のラベルをサプリメンタル文書（または、サプリメンタルノード）とする。 Nodes in the second stage are nodes close to the vertex (for example, nodes indicated by symbols a2, a3, a4 and b2, b3 in FIG. 2). These nodes are nodes that can be reached from a core node directly or indirectly by following only a bidirectional link. Bidirectional links are linked to each other and show high relevance. These nodes have many state transitions with the core node, and the content of the document is also highly related to the core document. Hereinafter, a label of a document (node) having a high relationship with the core document is referred to as a supplemental document (or supplemental node).

第３段階目のノードは、例えば、図２中の符号ａ５，ａ６，ａ７，ｂ４で示されるノードのように、コアノードもしくはサプリメンタルノードにリンクしているノードである。外部のノードへの状態遷移や自己遷移と比べて、特定の話題のコアノードやサプリメンタルノードへの遷移確率が高いノードである。これらのノード必ずしも話題の中心ではないが話題に関連する情報を含んでおり、話題の周辺の情報等ノベルティの高い情報を含む事が多いノードである。以後、この話題の周辺の情報等ノベルティの高い情報を含む事が多い文書（ノード）のラベルをサブトピック文書（またはサブトピックノード）とする。 The node at the third stage is a node linked to a core node or a supplemental node, such as nodes indicated by reference symbols a5, a6, a7, and b4 in FIG. It is a node that has a higher probability of transition to a core node or supplemental node of a specific topic than state transition to an external node or self-transition. These nodes are not necessarily the center of the topic, but include information related to the topic, and are often nodes that contain highly novel information such as information around the topic. Hereinafter, a label of a document (node) that often includes highly novel information such as information around the topic is referred to as a subtopic document (or subtopic node).

最終段階目のノードは、どの話題のノードに対しても強い関連性がないノードである。例えば、図２中の符号ｃ１で示されるノードである。このノードは、他に似ているノードが少なく、自己遷移確率が高い。以後、この他に似ているノードが少なく、自己遷移確率が高い文書（ノード）のラベルをアウトライヤー文書（アウトライヤーノード）とする。このアウトライヤー文書の存在を許容することによって、その他文書が無理にいづれかのクラスタに属しノイズの原因となることを防ぐことになる。 The node at the final stage is a node that does not have a strong relationship with any topic node. For example, it is a node indicated by reference numeral c1 in FIG. This node has few similar nodes and high self-transition probability. Hereinafter, a label of a document (node) having few similar nodes and a high self-transition probability is referred to as an outlier document (outlier node). By allowing the outlier document to exist, it is possible to prevent other documents from forcibly belonging to any cluster and causing noise.

以上のような方法に基づいて、それぞれのノードに対し以下のようにラベル付けを行う。 Based on the above method, each node is labeled as follows.

まず、各ノードに対しては、各ノードがどのような話題に関連する文書なのか、その話題を表現する場合にどの程度の役割を持つ文書であるのか、という情報をラベル（即ち、コアノード）として付与する。 First, for each node, information indicating what topic each node is related to and the role of the document when expressing the topic is labeled (ie, core node). As given.

次に、山状のノード群に対しては、頂点ノードが表現する話題に関連する文書のクラスタとしてのラベル（即ち、サプリメンタルノード）を付与する。 Next, a label (that is, a supplemental node) as a cluster of documents related to the topic expressed by the vertex node is assigned to the mountain-shaped node group.

そして、山状ノード群の組合せについては、それらの連結状態から、二つの山が表現する話題の関連性の高さに付いてラベル付け（即ち、サブトピックノードまたはアウトライヤーノード）を行う。 Then, the combinations of mountain nodes are labeled (ie, subtopic nodes or outlier nodes) based on their connected state, with high relevance of topics expressed by the two mountains.

情報出力部６０は、情報分析部５０によって得られた、ノード間の関係，個々のノードの中心性及び文書集合中での役割を利用して、ユーザに対して文書集合の内容を表示（可視化）する。可視化は、例えば、ディスプレイ装置で行う。 The information output unit 60 displays (visualizes) the contents of the document set to the user by using the relationship between nodes, the centrality of each node, and the role in the document set obtained by the information analysis unit 50. ) Visualization is performed with a display device, for example.

三次元イメージを利用した可視化（例えば、３Ｄ（Ｄｉｍｅｎｓｉｏｎ）マップ）の一例として、図３が挙げられる。図３は、新聞記事に対して、「地震」という検索キーワード（即ち、話題語）を利用して得られた検索結果の文書集合を可視化したものである。図３中の符号ＣＮで示されるノードがコアノードである。 FIG. 3 is an example of visualization using a three-dimensional image (for example, a 3D (Dimension) map). FIG. 3 is a visualization of a document set of search results obtained using a search keyword (ie, topic word) “earthquake” for newspaper articles. A node indicated by a symbol CN in FIG. 3 is a core node.

この図３では、山状の部分が見られるが、個々の山が個々の話題を示しており、個々の山に属するノードに該当する文書がそれぞれの話題に関する文書群である。 In FIG. 3, mountain-shaped portions are seen, but each mountain indicates an individual topic, and documents corresponding to nodes belonging to each mountain are a document group related to each topic.

また、図４には図３の一部を拡大したイメージを示す。この拡大イメージ上では、二つの山がサプリメンタルノード（例えば、符号ＳＮで示されるノード）によって連結している事がわかる。これは、二つの山の関連性が高いことを示している。実際に、この二つの山は、日本で発生した地震について触れており、提案手法により、話題間の関連性を発見する事が可能であることも示している。 FIG. 4 shows an enlarged image of a part of FIG. On this enlarged image, it can be seen that the two peaks are connected by a supplemental node (for example, a node indicated by the symbol SN). This indicates that the two mountains are highly related. In fact, these two mountains touch on earthquakes that occurred in Japan, and show that it is possible to discover the relationship between topics using the proposed method.

例えば、図４中の一つの山状の構造は、一つの話題「阪神大震災」や「関東大震災」を表している。その山の中で中心的な部分は、実際に発生したイベントの主要な内容（例えば、地震が発生したこと）、それ以外では、主要な話題に付随する情報（例えば、地震後の火事、復興支援のニュースなど）に該当するノードである。 For example, one mountain-shaped structure in FIG. 4 represents one topic “the Great Hanshin Earthquake” or “the Great Kanto Earthquake”. The central part of the mountain is the main content of the event that actually occurred (for example, the occurrence of an earthquake), otherwise the information that accompanies the main topic (for example, post-earthquake fires, reconstruction) Node corresponding to support news).

文書データ管理手段７０は、ユーザが指定した検索キーワードや文書の最終更新日等の条件に応じて文書集合を特定できる検索機能を持った文書データ格納装置（例えば、ハードディスク装置やメモリを含む装置）である。この文書データ管理手段７０は、ｗｅｂ等からあらかじめ情報を収集してきて構築する事が考えられる。また、ｗｅｂ上に存在する検索エンジン（非特許文献３参照）をそのまま文書データ管理手段７０として利用することも考えられる。 The document data management means 70 is a document data storage device (for example, a device including a hard disk device or a memory) having a search function that can specify a document set in accordance with conditions such as a search keyword designated by a user and a document last update date. It is. The document data management means 70 may be constructed by collecting information in advance from a web or the like. It is also conceivable to use a search engine (see Non-Patent Document 3) existing on the web as the document data management means 70 as it is.

本実施形態における文書集合分析方法を図５に基づいて説明する。 A document set analysis method according to this embodiment will be described with reference to FIG.

まず、ユーザから指定、もしくは、予め決められた文書集合特定条件を入力手段から読み込む（Ｓ１０１）。なお、入力手段は、例えば、キーボード装置などが想定できる。 First, a document set specifying condition designated by the user or predetermined is read from the input means (S101). The input means can be assumed to be a keyboard device, for example.

次に、文書集合特定部１０が、前記文書集合特定条件に合致した文書の集合を特定する（Ｓ１０２）。 Next, the document set specifying unit 10 specifies a set of documents that meet the document set specifying condition (S102).

次に、類似性評価部２０が、文書集合特定部１０で特定した文書群中の各文書ペア間の類似度を算出する（Ｓ１０３）。 Next, the similarity evaluation unit 20 calculates the similarity between each document pair in the document group specified by the document set specifying unit 10 (S103).

次に、関係抽出部３０は、算出された類似度に基づいて関連性の強いペアを抽出し、重み付きで関連性を特定する（Ｓ１０４）。 Next, the relationship extraction unit 30 extracts a pair having strong relevance based on the calculated similarity, and specifies relevance with weight (S104).

次に、中心性判定部４０は、類似性評価部２０及び関係抽出部３０によって特定された情報に基づいて、指標（例えば、ＰａｇｅＲａｎｋなど）を作成し、各ノードの中心性を特定する（Ｓ１０５）。 Next, the centrality determination unit 40 creates an index (for example, PageRank) based on the information specified by the similarity evaluation unit 20 and the relationship extraction unit 30, and specifies the centrality of each node (S105). ).

次に、グラフ構造構築部５１は、類似性評価部２０，関係抽出部３０，中心性判定部４０から得られた情報に基づいて、三次元空間状にノード（文書と一対一対応）を配置したグラフ構造を構築する（Ｓ１０６）。 Next, the graph structure construction unit 51 arranges nodes (corresponding to documents one-to-one) in a three-dimensional space based on information obtained from the similarity evaluation unit 20, the relationship extraction unit 30, and the centrality determination unit 40. The constructed graph structure is constructed (S106).

次に、頂点ノード抽出部５２は、グラフ構造構築部５１で得られたグラフ構造に基づいて、頂点ノードを抽出する（Ｓ１０７）。 Next, the vertex node extraction unit 52 extracts a vertex node based on the graph structure obtained by the graph structure construction unit 51 (S107).

次に、山状ノード群特定部５３は、グラフ構造構築部５１で得られたグラフ構造と頂点ノード抽出部５２で得られた頂点から山状のノード群を抽出する（Ｓ１０８）。 Next, the mountain-shaped node group specifying unit 53 extracts a mountain-shaped node group from the graph structure obtained by the graph structure construction unit 51 and the vertex obtained by the vertex node extraction unit 52 (S108).

次に、ラベル付け部５４は、グラフ構造構築部５１，頂点ノード抽出部５２，山状ノード群特定部５３で得られた情報に基づいてノード，山状のノード群，ノード群の関係に対してラベル付けを行う（Ｓ１０９）。 Next, the labeling unit 54 uses the information obtained by the graph structure construction unit 51, the vertex node extraction unit 52, and the mountain node group identification unit 53 to determine the relationship between nodes, mountain nodes, and node groups. Then, labeling is performed (S109).

そして、ラベル付けされたノード，山状のノード群，ノード群の関係をリストや３Ｄマップとして可視化する（Ｓ１１０）。 Then, the node labeled, mountain-like nodes, to visualize the relationship between the nodes as a list or a 3D map (S110).

以上のように、本実施形態によれば、文書集合特定条件によって与えられる文書集合から、各文書間の類似度を特定し、その類似度に基づいて文書間に強い繋がりを有する部分を重み付きで特定し、この情報に基づいて文書間の繋がりをグラフ構造に見立てて各文書の中心性を算出し、以上で得た文書間の関係と各文書の中心性の値から文書群を三次元に配置されるグラフ構造と見立てて、その位置関係から各文書の位置付けを特定することによって、文書集合中に含まれる「主要な話題の特定」、「話題に関連する文書の特定」、「各話題に関連する文書のうち、各文書の役割」、「話題間の関係」等を取得できる。 As described above, according to the present embodiment, the degree of similarity between documents is specified from the document set given by the document set specifying condition, and a portion having strong connection between documents is weighted based on the degree of similarity. Based on this information, the centrality of each document is calculated based on the relationship between the documents in the form of a graph structure, and the document group is three-dimensionally calculated from the relationship between the documents obtained above and the centrality value of each document. By identifying the position of each document from its positional relationship, the “main topic” specification, “topic related document specification”, “ Among the documents related to the topic, the “role of each document”, “relation between topics”, and the like can be acquired.

さらに詳述すると、本実施形態では、ユーザからの要求に基づき特定した文書集合を元に、各文書間の相互の類似度を評価し、類似度に基づき文書間の関連性を特定する。この関連性に基づき、各文書の中心性を評価する。この文書間の関連性と、個々の文書の中心性をともに用いる事で、文書集合中の特定の話題の検出や、特定の話題に属する文書のクラスタ化、クラスタ内の各文書の位置付けの明確化を実現する。 More specifically, in this embodiment, the mutual similarity between documents is evaluated based on a document set specified based on a request from a user, and the relationship between documents is specified based on the similarity. Based on this relationship, the centrality of each document is evaluated. By using both the relationship between documents and the centrality of individual documents, it is possible to detect specific topics in a document set, cluster documents belonging to a specific topic, and clarify the positioning of each document in the cluster. Realize.

また、文書間の関連性と各文書の中心性の値に基づき、文書集合を三次元のグラフ構造と見做し、その中の頂点や、山状のノード群を特定することで、文書集合中の特定の話題の検出や、特定の話題に属する文書のクラスタ化、クラスタ内の各文書の位置付けの明確化を実現する。 Also, based on the relationship between documents and the centrality value of each document, the document set is regarded as a three-dimensional graph structure, and by specifying the vertex and mountain-shaped node group, the document set It is possible to detect a specific topic, cluster a document belonging to a specific topic, and clarify the position of each document in the cluster.

なお、本実施形態の文書集合分析装置における各手段の一部もしくは全部の機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータを用いて実行して本発明を実現することができること、本実施形態の文書集合分析方法における手順をコンピュータのプログラムで構成し、そのプログラムをコンピュータに実行させることができることは言うまでもなく、コンピュータでその機能を実現するためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えばＦＤ（Ｆｌｏｐｐｙ（登録商標）Ｄｉｓｋ）や、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｋ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、メモリカード、ＣＤ（ＣｏｍｐａｃｔＤｉｓｋ）、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）、リムーバブルディスクなどに記録して、保存したり、配布したりすることが可能である。また、上記のプログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。 Note that the present invention can be realized by configuring some or all of the functions of each means in the document set analysis apparatus of the present embodiment with a computer program and executing the program using the computer. It is needless to say that the procedure in the document set analysis method of the above can be configured by a computer program and the program can be executed by the computer. For example, FD (Floppy (registered trademark) Disk), MO (Magneto-Optical disk), ROM (Read Only Memory), memory card, CD (Compact Disk), DVD (Digital Versatile D) sk), and recorded in a removable disk, or stored, it is possible or distribute. It is also possible to provide the above program through a network such as the Internet or electronic mail.

さらに、上述の文書集合分析装置に関する方法を記述したコンピュータプログラムを、文書集合分析装置に関する方法に必要とされる入出力データを格納したメモリや外部記憶装置等にアクセスするように実装してもよい。 Further, a computer program describing a method related to the document set analysis apparatus described above may be implemented to access a memory or an external storage device that stores input / output data required for the method related to the document set analysis apparatus. .

以上、本発明の実施形態について説明したが、本発明は説明した実施形態に限定されるものでなく、各請求項に記載した範囲において各種の変形を行うことが可能である。 Although the embodiments of the present invention have been described above, the present invention is not limited to the described embodiments, and various modifications can be made within the scope described in each claim.

例えば、本実施形態における情報分析部は、グラフ構造構築部からラベル付け部のような手段で構成されているが、これらの手段に限らず、文書群と文書間の関連をグラフ構造と見做す他の処理手段も考えられる。より具体的には、ラベル付け部において、ラベル付けを更に細かく（例えば、５段階以上）して役割の特定を行っても良い。 For example, the information analysis unit in the present embodiment is configured by means such as a graph structure construction unit to a labeling unit. However, the information analysis unit is not limited to these units, and a relation between a document group and a document is regarded as a graph structure. Other processing means are also conceivable. More specifically, the labeling unit may specify the role by further finely labeling (for example, five or more levels).

本実施形態における文書集合分析装置の構成図。The block diagram of the document set analysis apparatus in this embodiment. 本実施形態における三次元構造概念図。The three-dimensional structure conceptual diagram in this embodiment. 本実施形態における可視化結果の一例を示す図。The figure which shows an example of the visualization result in this embodiment. 本実施形態における可視化結果の一例の拡大図。The enlarged view of an example of the visualization result in this embodiment. 本実施形態における文書集合分析方法を示すフローチャート。6 is a flowchart illustrating a document set analysis method according to the present embodiment.

Explanation of symbols

１０…文書集合特定部
２０…類似性評価部
３０…関係抽出部
４０…中心性判定部
５０…情報分析部
５１…グラフ構造構築部
５２…頂点ノード抽出部
５３…山状ノード群特定部
５４…ラベル付け部
６０…情報出力部
７０…文書データ管理手段
ａ１，ａ２，ａ３，ａ４，ａ５，ａ６，ａ７，ａ８，ｂ１，ｂ２，ｂ３，ｂ４，ｃ１…ノード
ＣＮ，ＣＮ１，ＣＮ２…コアノード
ＳＮ…サプリメンタルノード DESCRIPTION OF SYMBOLS 10 ... Document set specific | specification part 20 ... Similarity evaluation part 30 ... Relation extraction part 40 ... Centrality determination part 50 ... Information analysis part 51 ... Graph structure construction part 52 ... Vertex node extraction part 53 ... Mountain-shaped node group specification part 54 ... Labeling unit 60 ... Information output unit 70 ... Document data management means a1, a2, a3, a4, a5, a6, a7, a8, b1, b2, b3, b4, c1 ... node CN, CN1, CN2 ... core node SN ... Supplemental node

Claims

A document set analysis device that identifies a role of a document based on a relationship between documents in a document set managed by a document data management means,
Document set specifying means for specifying the document set based on the document set specifying condition input from the input means;
Similarity evaluation means for evaluating the degree of similarity related to topic words between the documents included in the specified document set;
A relationship extracting unit that extracts a relationship between documents based on the similarity evaluated by the similarity evaluating unit;
Based on the relationship between the documents extracted by the relationship extraction unit, the centrality determination unit that calculates the centrality of the document as an index indicating the high degree of relationship between the document and a document other than the document;
Based on the relation between documents obtained by the relation extracting means and the centrality of each document, the relation between documents is expressed by two-dimensional coordinates, and the centrality is expressed by the third-dimensional coordinates relative to the two-dimensional coordinates. And a graph structure construction means for representing the document set with a three-dimensional graph structure;
Vertex node extracting means for extracting, as a vertex node, a document node having a maximal centrality score in the document set in the obtained graph structure;
A document node group composed of the extracted vertex node and one or more document nodes connected to the vertex node and having a centrality score lower than the centrality score of the vertex node and not minimal and mountain-like nodes specifying means for specifying a mountain-shaped nodes a,
A label indicating that the node serves as a topic word of the document set is given to the extracted vertex node , and the node group is added to the identified mountain node group. Labeling means for assigning a label indicating that the document is a set of documents having relevance to a topic word ;
A document node indicating a vertex node labeled as the topic word, and a document node group indicating a mountain node group labeled as a document set having a relationship with the topic word Information output means for visualizing and outputting;
A document set analyzing apparatus comprising:

A computer-executed document set analysis method for identifying a role of a document based on a relationship between documents in a document set managed by a document data management means,
A document set specifying step for specifying the document set based on the document set specifying condition input from the input means;
A similarity evaluation step for evaluating a similarity level related to a topic word between the documents included in the specified document set;
A relationship extracting step of extracting a relationship between documents based on the similarity evaluated by the similarity evaluation means;
A centrality determination step of calculating the centrality of the document as an index indicating the high degree of relationship between the document and a document other than the document based on the relationship between the documents extracted by the relationship extraction unit;
Based on the relation between documents obtained by the relation extracting means and the centrality of each document, the relation between documents is expressed by two-dimensional coordinates, and the centrality is expressed by the third-dimensional coordinates relative to the two-dimensional coordinates. And a graph structure construction step for expressing the document set with a three-dimensional graph structure;
A vertex node extracting step of extracting, as a vertex node, a document node having a maximum centrality score in the document set in the obtained graph structure;
A document node group composed of the extracted vertex node and one or more document nodes connected to the vertex node and having a centrality score lower than the centrality score of the vertex node and not minimal and mountain-like nodes specifying step of specifying a mountain-shaped nodes a,
A label indicating that the node serves as a topic word of the document set is given to the extracted vertex node , and the node group is added to the identified mountain node group. A labeling step for providing a label indicating that the document is a set of documents having relevance to a topic word ;
A document node indicating a vertex node labeled as the topic word, and a document node group indicating a mountain node group labeled as a document set having a relationship with the topic word An information output step for visualizing and outputting;
A document set analysis method characterized by comprising:

A document set analysis program characterized in that the document set analysis method according to claim 2 is described as a computer program executable by a computer.

3. A recording medium, wherein the document set analysis method according to claim 2 is described as a computer program executable by a computer, and the computer program is recorded.