JP3675682B2

JP3675682B2 - Cluster analysis processing method, apparatus, and recording medium recording cluster analysis program

Info

Publication number: JP3675682B2
Application number: JP27047599A
Authority: JP
Inventors: 正之杉崎; 大二郎森; 雅且大久保; 一男田中
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 1999-09-24
Filing date: 1999-09-24
Publication date: 2005-07-27
Anticipated expiration: 2019-09-24
Also published as: JP2001092841A

Description

【０００１】
【発明の属する技術分野】
本発明はクラスター分析処理方法および装置に関する。
【０００２】
【従来の技術】
近年、インターネットなどのコンピュータネットワークの普及により、不特定多数を対象に電子化された大量の情報がやりとりされている。これらの大量の情報の中から重要な情報を獲得するために、様々な分析手法が存在する。中でも、類似した情報同士を結び付け２分木を生成するクラスター分析方法は、情報の分類手法として数多く利用されている。
【０００３】
クラスター分析とは「異質なものの混ざりあっている対象（それは個体＝ものの場合もあるし、変数の場合もある）を、それらの間に何らかの意味で定義された類似度(similarity)を手がかりにして似たものを集め、いくつかの均質なものの集落（クラスター）を分類する方法を総称したもの」である（“多変量統計解析法”田中、脇本、現代数学社、１９８３初版）。
【０００４】
簡単にアルゴリズムを説明すると、個々のデータの数をＮとして、
１．データ集合Ｄ＝｛１，．．．，Ｎ｝の各データ間の類似度を計算
２．データ集合Ｄから最も類似したデータの組（ｉ，ｊ）を探索
３．ｉ，ｊから新たなクラスターｋを生成し、データ集合Ｄに加える
４．データ集合Ｄからｉ，ｊを削除する
５．終了条件を満たすまで、上記１，２，３，４を繰り返す
となる。終了条件とは、例えば「クラスターの数がｍ個まで」や、上記アルゴリズムの２で「類似度の値によって類似していると判断されなくなった場合」などがある。図９が、クラスター分析処理のイメージ図である。
【０００５】
実際にクラスター分析を行う対象としては、文書の自動分類などがある。新聞記事などの大量の文書を集めておき、類似した内容の文書同士を収集させ分類する方法として利用される。この場合、各文書の特徴を、要素が文書内に存在する単語と１対１に対応したベクトル（特徴ベクトル）として表現する。各単語に対応する要素は、文書に対するその単語の重要度（重み）を実数値化して表現する。単語の重みの計算方法は、その単語の出現頻度や文書集合全体における分布の割合や文字数の長さなどを利用して決定する手法（“Automatic Text Processing ”Gerard Salton, ADDISONWESLEY pub. 1989）が一般的である。こうして得られたベクトルを用いて各文書間の類似度（距離）を定義する。例えばユークリッド空間におけるベクトル同士のなす角(cosθ) などを利用する。これにより、類似した文書同士を同じクラスターに分けていき、類似度から閾値を利用して類似していないと判断したときに、処理を止める。この結果、いくつかの文書のまとまりを得ることができる。
【０００６】
【発明が解決しようとする課題】
この分析手法は、処理対象が多くなると、次のように非常に計算回数が多くなるという欠点がある。
【０００７】
（１）各データ間の類似度の計算にＮ×Ｎ（Ｎはデータ数）回かかる。
【０００８】
（２）最も類似しているデータの組の探索がＮ×Ｎ回かかる。
【０００９】
（３）２つのクラスターをひとつににまとめる新たなクラスターを生成する場合、残りの個々のデータ（最初は、Ｎ−２個）との類似度の計算を行う。
【００１０】
データｉとデータｊに対し（ｉ，ｊ）の類似度と（ｊ，ｉ）の類似度は等しい（対称性が成り立つ）とすると、（１）や（２）の計算回数は１／２になるが、オーダとしてはＮの２乗に変化はない。
【００１１】
本発明の目的は、計算回数のオーダを最小でＮ（データの個数）にするラスター分析処理方法、装置、およびクラスター分析処理プログラムを記録した記録媒体を提供することにある。
【００１２】
【課題を解決するための手段】
（２）の計算において、処理の途中結果を記録しておき、次の類似しているデータの組の探索時に利用することで、処理回数のオーダを最小でＮ（データの個数）にする方法を提案する。
【００１３】
データ間の類似度の値を要素とするＮ×Ｎの行列を生成する。この行列をＡ＝｛Ａｉｊ｝（ｉ，ｊ＝１，．．．，Ｎ）とする。行列Ａから最大値をとるデータの組（ｉ_max ，ｊ_max ）を見つけることを考える。
【００１４】
あらかじめ（実数値、列ＩＤ）を組とした、Ｎ個の一時的な領域を用意する（これを次候補記憶テーブルＴ＝｛Ｔｉ：Ｔｉ＝（値、列ＩＤ），ｉ＝１，．．．．Ｎ｝とする）。各Ｔｉには、行列Ａの行ｉにおける最大値とその列ＩＤを入れる。式で表すと、

となる。次に、次候補記憶テーブルＴの要素Ｔｉ＝（値、列ＩＤ）の値が最大となるｉを見つける。こうして得られたｉとＴｉの列ＩＤが最大値をとるデータの組となる。
【００１５】
クラスター分析の場合、こうして得られたデータｉ_max とｊ_max から新しいクラスターｋを生成し、クラスターｋとそれ以外のデータ（個数Ｎ−２）との類似度を計算し直し、最大値を探索する。このとき次候補記憶テーブルＴを利用すると高速に探索できる。新たなクラスターｋ（行ＩＤはＮ＋１とする）を生成することにより行列Ａの中で変更されるのは行ｉ_max ，ｊ_max ，ｋと列ｉ_max ，ｊ_max ，ｋである。この変更に伴い次候補記憶テーブルＴの値の更新が確実に必要なのはｉ_max ，ｊ_max ，ｋであり、そのうち行内の最大値の探索が必要なのはｋのみとなる。また、次候補記憶テーブルＴの要素で、列ＩＤの値がｉ_max かｊ_max と同じであれば最大値の再計算をするが、それ以外は次候補記憶テーブルＴ内の要素の値がそのまま利用できる。すなわち、次候補記憶テーブルＴを用いることにより、行列Ａの最大値とそのデータの組の探索の計算回数のオーダーは最小時Ｎまで期待できる。すなわち、次候補記憶テーブルＴを探索するだけで最大値が見つかるとき、そのときの計算回数のオーダはＮになる。
【００１６】
【発明の実施の形態】
次に、本発明の実施の形態について図面を参照して説明する。
【００１７】
図１を参照すると、本発明の一実施形態のクラスター分析処理装置は、データ入力部１０１とデータ記憶部１０２とクラスター分析処理部１０３と表示部１０４で構成されている。
【００１８】
データ入力部１０１は処理を施したいデータを入力するためのものである。処理を施したいデータは、類似度（距離）が定義できるデータであればよく、〔従来の技術〕で説明した文書情報なども処理対象にできる。また、このとき各データ間の類似度を計算する関数も入力する。これは、装置側で扱える関数を提示し、装置利用者がその中から選択する方法もとれる。
【００１９】
データ記憶部１０２は、データ入力部１０１で入力された情報を記憶する。
【００２０】
クラスター分析処理部１０３は、データ入力部１０１で入力されたデータおよびデータ間の類似度計算用関数を用いてクラスター分析を行う。このとき、最も類似したデータの組の探索に、次候補記憶テーブルＴを用いて高速化を計っている。
【００２１】
表示部１０４は、クラスター分析の結果を表示する。
【００２２】
図２はクラスター分析処理部１０３の処理を示すフローチャートである。データ集合Ｄ（Ｎ個）の各データ間の類似度Ａ＝｛Ａｉｊ｝（ｉ，ｊ＝１，．．．，Ｎ）を計算する（ステップ２０１）。類似度の行列｛Ａｉｊ｝の各行ＩＤを探索し、類似度の最大値とそのときの列ＩＤを求め、次候補記憶テーブルＴ＝（Ｔ₁，．．．，Ｔ_N）：Ｔ_k（ｖａｌ，ｉｎｄｅｘ）に入れ、次候補記憶テーブルを探索し、類似度｛Ａｉｊ｝の最大値を探索する（ステップ２０２）。ｉ_max，ｊ_maxから新たなクラスターｋを生成し、データ集合Ｄに加える（ステップ２０４）。例えば集合Ｄ「りんご、きゅうり、なし」があったとすると、果物という観点から「りんご」と「なし」が近くて、これから新たなクラスター「果物」ができあがる。データ集合Ｄからｉ_max，ｊ_maxを削除する（ステップ２０５）。以上を、所定の終了条件が満たされるまで繰り返す（ステップ２０３）。
【００２３】
図３は図２中のステップ２０２の詳細を示すフローチャートである。なお、この処理に入る前に列ＩＤ番号Ｔｉ．ｉｎｄｅｘ＝−１に設定されている。まず類似度の行列Ａの行ＩＤ番号であるｉを初期化する（ステップ３０１）。行列Ａの最大値ｍａｘＶとその行ＩＤ番号ｍａｘＮｉ、列ＩＤ番号ｍａｘＮｉを初期化する（ステップ３０２）。行ＩＤ番号Ｔｉ．ｉｎｄｅｘの値を調べる（ステップ３０３，３０４），列ＩＤ番号Ｔｉ．ｉｎｄｅｘ＝−２であれば最大値を計算する必要がないのでステップ３０８に進み、行ＩＤ番号ｉを＋１する。列ＩＤ番号Ｔｉ．ｉｎｄｅｘが−１であれば、行ＩＤ番号がｉの類似度を求める必要があるので、類似度｛Ａｉｊ｝（ｌ＝１，．．．，Ｎ）の最大値を求め、その値をＴｉ．ｖａｌに保存し、そのときの列ＩＤ番号ｌをＴｉ．ｉｎｄｅｘに保存する（ステップ３０５）。ｍａｘＶとＴｉ．ｖａｌを比較し（ステップ３０６）、Ｔｉ．ｖａｌの方が大きければｍａｘＶ，ｍａｘＮｉ，ｍａｘＮｊにそれぞれＴｉ．ｖａｌ，ｉ，Ｔｉ．ｉｎｄｅｘを代入する（ステップ３０７）。行ＩＤ番号ｉを＋１する（ステップ３０８）。以上、ステップ３０３から３０８までの処理を行ＩＤ番号ｉがＮより大きくなるまで繰り返す（ステップ３０９）。
【００２４】
以上の処理により、次候補記憶テーブルＴ＝Ｔ_k（ｖａｌ，ｉｎｄｅｘ）、類似度の行列｛Ａｉｊ｝（ｉ，ｊ＝１〜Ｎ）内の最大値ｍａｘＶと、そのときの行ＩＤおよび列ＩＤ番号ｍａｘＮｉ，ｍａｘＮｊが求まる。
【００２５】
図４は図１中のステップ２０５の詳細を示すフローチャートである。
【００２６】
まず、ｉ行（ｉ＝１〜Ｎ）、ｍａｘＮｉ列の類似度ＡｉｍａｘＮｉ，ｉ行（ｉ＝１〜Ｎ）、ｍａｘＮｊ列の類似度ＡｉｍａｘＮｊ，ｍａｘＮｉ行、ｍａｘＮｊ列の類似度ＡｍａｘＮｉｍａｘＮｊをいずれも−１とする（ステップ４０１）。行ＩＤ番号ｉを初期化する（ステップ４０２）。行ＩＤ番号ｉがｍａｘＮｉまたはｍａｘＮｊに等しければ、行ＩＤ番号ｉの類似度の最大値は求めなくてもよいので、列ＩＤ番号Ｔｉ．ｉｎｄｅｘ＝−２とする（ステップ４０３〜４０５）。ｉがｍａｘＮｉ、ｍａｘＮｊでなく、Ｔｉ．ｉｎｄｅｘがｍａｘＮｉまたはｍａｘＮｊに等しい場合、Ｔｉ、ｉｎｄｅｘがｍａｘＮｉ、ｍａｘＮｊのデータが削除されるため、最大値を求める必要があるので、Ｔｉ．ｉｎｄｅｘ＝−１とする（ステップ４０６〜４０８）。行ＩＤ番号ｉを＋１する（ステップ４０９）。以上、ステップ４０３から４０９までの処理をｉがＮより大きくなるまで繰り返す（ステップ４１０）。
【００２７】
次に、類似度の行列｛Ａｉｊ｝に、Ａ（Ｎ＋１）ｊ，Ａｉ（Ｎ＋１）（ｉ，ｊ＝１〜Ｎ＋１）を追加し、Ｔ_(N+1) ．ｉｎｄｅｘ＝−１とする（ステップ４１１）。最後に、データ数Ｎを＋１インクリメントする（ステップ４１２）。
【００２８】
次に、本実施形態を具体例により説明する。
【００２９】
クラスター分析を行うデータを、例えば１９９９年６月のＡ社が発行している新聞記事一ケ月分とする。記事間の類似度は、〔従来の技術〕で説明した文書の特徴ベクトル間の成す角(cosθ) を用いる。
【００３０】
図５は、類似度の行列Ａから次候補記憶テーブルＴを用いて最大値を探索する例である。「文書ｉと文書ｊの類似度と文書ｊと文書ｉの類似度は同じ」であるから、実際の類似度の行列Ａの半分を利用する。行列Ａからの最大値の探索が初めてであれば、次候補記憶テーブルＴの各要素には値が入っていないため、全ての各行において最大の類似度とその列ＩＤを記録する。このようにして次候補記憶テーブルＴに入れられた値を利用して上から順に最大値を探索する。図５の例では、最大値を取る行ＩＤ番号と列ＩＤ番号の組が（５，２）でそのときの値が０．４６であった。それらから、新たなクラスターを生成し、古いクラスター情報を削除した例が図５である。
【００３１】
図６では、まず、類似度の行列Ａの２と５の行と列それぞれの情報を空にし、さらに新たなクラスター情報をＮ＋１として追加する。また、それに合わせて、次候補記憶テーブルＴの内容も変更する。Ｔ₂ とＴ₅ の情報は削除し、Ｔ_N+1 には行列Ａの行ＩＤ番号Ｎ＋１内の最大値とその列ＩＤ番号を記録する。それ以外に列ＩＤ番号の値が２あるいは５となる要素が存在すれば、その値も更新する。そのような要素がなければ、次の最大値の探索時の計算回数は［（次候補記憶テーブルＴの探索）＋（行Ｎ＋１の最大値探索）］となり、オーダーはＮとなる。
【００３２】
図７は、文書の自動分類書類における文書数に対する処理時間を従来方法と本発明の方法とで比較して示す図である。処理対象はアンケートの自由回答文で、文書間の類似度の計算には（従来の技術）で説明した特徴ベクトル間の成す角を利用した、従来方法では文書数に２乗で比例して計算時間が増えていくが、本方法ではほぼ線形に増えていくことが確認できる。
【００３３】
図８は本発明の他の実施形態のクラスター分析処理装置のブロック図である。
【００３４】
本実施形態のクラスター分析処理装置は入力装置５０１と記憶装置５０２、５０３と出力装置５０４と記録媒体５０５とデータ処理装置５０６で構成されている。入力装置５０１、記憶装置５０２、出力装置５０４は図１中のデータ入力部１０１、データ記憶部１０２、表示部１０４にそれぞれ対応する。記憶装置５０３はハードディスクである。記録媒体５０５は図２〜図４の処理からなるクラスター分析処理プログラムを記録したフロッピィ・ディスク、ＣＤ−ＲＯＭ、光磁気ディスクなどの記録媒体である。データ処理装置５０６は記録媒体５０５からクラスター分析処理プログラムを読み込んでこれを実行するＣＰＵである。
【００３５】
【発明の効果】
以上説明したように、本発明によれば、計算回数のオーダを最小でＮ（データの個数）にすることができる。
【図面の簡単な説明】
【図１】本発明の一実施形態のクラスター分析処理装置の構成図である。
【図２】クラスター分析処理部１０３の処理を示すフローチャートである。
【図３】図１中のステップ２０２の詳細を示すフローチャートである。
【図４】図１中のステップ２０５の詳細なフローチャートである。
【図５】類似度の行列Ａからの最大値の探索例を示す図である。
【図６】類似度の行列Ａからの最大値の探索例を示す図である。
【図７】文書の自動分類処理における文書数の処理時間の関係を従来方法と本発明で比較して示す図である。
【図８】本発明の他の実施形態のクラスター分析処理装置の構成図である。
【図９】クラスター分析のイメージ図である。
【符号の説明】
１０１データ入力部
１０２データ記憶部
１０３クラスター分析部
１０４表示部
２０１〜２０５、３０１〜３０９、４０１〜４１２ステップ
５０１入力装置
５０２、５０３記憶装置
５０４出力装置
５０５記録媒体
５０６データ処理装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a cluster analysis processing method and apparatus .
[0002]
[Prior art]
In recent years, with the spread of computer networks such as the Internet, a large amount of information digitized for an unspecified number has been exchanged. In order to acquire important information from such a large amount of information, various analysis methods exist. Among them, a cluster analysis method that generates a binary tree by connecting similar pieces of information is widely used as an information classification method.
[0003]
Cluster analysis is “a mixture of heterogeneous things (individuals may be things or variables), and the similarity defined in some sense between them is a clue. It is a collective term for collecting similar things and classifying several homogeneous clusters (clusters) ("Multivariate Statistical Analysis" Tanaka, Wakimoto, Hyundai Mathematics, 1983, first edition).
[0004]
To explain the algorithm briefly, let N be the number of individual data.
1. Data set D = {1,. . . , N} to calculate the similarity between each data. 2. Search for the most similar data set (i, j) from data set D 3. Create a new cluster k from i, j and add to data set D 4. Delete i, j from data set D Until the end condition is satisfied, the above 1, 2, 3, and 4 are repeated. The termination condition includes, for example, “up to m clusters” or “when it is no longer determined to be similar by the value of similarity” in 2 of the above algorithm. FIG. 9 is an image diagram of cluster analysis processing.
[0005]
The target of actual cluster analysis includes automatic document classification. It is used as a method for collecting a large number of documents such as newspaper articles and collecting and classifying similar documents. In this case, the feature of each document is expressed as a vector (feature vector) in which the element has a one-to-one correspondence with a word existing in the document. The element corresponding to each word expresses the importance (weight) of the word with respect to the document as a real value. The method of calculating the word weight is generally determined using the frequency of the word, the distribution ratio in the entire document set, the length of the number of characters, etc. (“Automatic Text Processing” Gerard Salton, ADDISONWESLEY pub. 1989). Is. The degree of similarity (distance) between the documents is defined using the vector thus obtained. For example, the angle (cosθ) between vectors in the Euclidean space is used. Thereby, similar documents are divided into the same cluster, and when it is determined that they are not similar using a threshold value from the similarity, the processing is stopped. As a result, a group of several documents can be obtained.
[0006]
[Problems to be solved by the invention]
This analysis method has a drawback that the number of calculations increases as follows when the number of processing objects increases.
[0007]
(1) It takes N × N (N is the number of data) times to calculate the similarity between each data.
[0008]
(2) It takes N × N searches for the most similar data set.
[0009]
(3) When generating a new cluster that combines two clusters into one, the similarity with the remaining individual data (initially N-2) is calculated.
[0010]
Assuming that the similarity of (i, j) and the similarity of (j, i) are the same for data i and data j (symmetry is established), the number of calculations of (1) and (2) is halved. However, there is no change in the square of N as an order.
[0011]
An object of the present invention is the order of the number of calculations raster analyzing processing method to N (number of data) with a minimum, devices, and to provide a recording medium recording a cluster analysis program.
[0012]
[Means for Solving the Problems]
In the calculation of (2), a method in which the order of the number of times of processing is set to N (the number of data) is minimized by recording the intermediate result of the processing and using it when searching for the next similar data set. Propose.
[0013]
An N × N matrix having elements of similarity values between data is generated. Let this matrix be A = {Aij} (i, j = 1,..., N). Consider finding a data set (i _max , j _max ) that takes the maximum value from matrix A.
[0014]
Prepare N temporary areas with a set of (real value, column ID) in advance (this is the next candidate storage table T = {Ti: Ti = (value, column ID), i = 1,. ... N}). In each Ti, the maximum value in the row i of the matrix A and its column ID are entered. Expressed as an expression:

It becomes. Next, i in which the value of element Ti = (value, column ID) in the next candidate storage table T is maximized is found. The i and Ti column IDs obtained in this way form a data set having the maximum value.
[0015]
In the case of cluster analysis, a new cluster k is generated from the data i _max and j _max thus obtained, the similarity between the cluster k and other data (number N-2) is recalculated, and the maximum value is searched. . At this time, when the next candidate storage table T is used, the search can be performed at high speed. It is the row i _max , j _max , k and the column i _max , j _max , k that are changed in the matrix A by creating a new cluster k (row ID is N + 1). With this change, it is i _max , j _max , k that surely needs to be updated in the value of the next candidate storage table T, and only k is required to search for the maximum value in the row. If the column ID value is the same as i _max or j _max in the elements of the next candidate storage table T, the maximum value is recalculated. Otherwise, the element values in the next candidate storage table T remain unchanged. Available. In other words, by using the next candidate storage table T, the order of the number of times of searching for the maximum value of the matrix A and the data set can be expected up to the minimum time N. That is, when the maximum value is found only by searching the next candidate storage table T, the order of the number of calculations at that time is N.
[0016]
DETAILED DESCRIPTION OF THE INVENTION
Next, embodiments of the present invention will be described with reference to the drawings.
[0017]
Referring to FIG. 1, the cluster analysis processing apparatus according to an embodiment of the present invention includes a data input unit 101, a data storage unit 102, a cluster analysis processing unit 103, and a display unit 104.
[0018]
The data input unit 101 is for inputting data to be processed. The data to be processed may be data that can define the similarity (distance), and the document information described in [Prior Art] can be processed. At this time, a function for calculating the similarity between the data is also input. This is a method in which a function that can be handled on the device side is presented, and the device user selects from among them.
[0019]
The data storage unit 102 stores information input by the data input unit 101.
[0020]
The cluster analysis processing unit 103 performs cluster analysis using the data input by the data input unit 101 and the similarity calculation function between the data. At this time, the next candidate storage table T is used to speed up the search for the most similar data set.
[0021]
The display unit 104 displays the result of cluster analysis.
[0022]
FIG. 2 is a flowchart showing the processing of the cluster analysis processing unit 103. The similarity A = {Aij} (i, j = 1,..., N) between each data of the data set D (N) is calculated (step 201). Each row ID of the similarity matrix {Aij} is searched, the maximum value of the similarity and the column ID at that time are obtained, and the next candidate storage table T = (T ₁ ,..., T _N ): T _k (val , Index), the next candidate storage table is searched, and the maximum value of the similarity {Aij} is searched (step 202). A new cluster k is generated from i _max and j _max and added to the data set D (step 204). For example, if there is a set D “apples, cucumbers, none”, “apples” and “none” are close to each other from the viewpoint of fruits, and a new cluster “fruits” will be created. Delete i _max and j _max from the data set D (step 205). The above is repeated until a predetermined end condition is satisfied (step 203).
[0023]
FIG. 3 is a flowchart showing details of step 202 in FIG. Note that the column ID number Ti. index = -1. First, i, which is the row ID number of the matrix A of similarity, is initialized (step 301). The maximum value maxV of the matrix A, its row ID number maxNi, and column ID number maxNi are initialized (step 302). Line ID number Ti. The index value is checked (steps 303 and 304), and the column ID number Ti. If index = −2, there is no need to calculate the maximum value, so the process proceeds to step 308 and the row ID number i is incremented by one. Column ID number Ti. If the index is −1, it is necessary to obtain the similarity of the row ID number i, so the maximum value of the similarity {Aij} (l = 1,. val, and the column ID number 1 at that time is Ti. Save to index (step 305). maxV and Ti. val (step 306) and Ti. If val is larger, max., maxNi, maxNj are Ti. val, i, Ti. index is substituted (step 307). The row ID number i is incremented by 1 (step 308). The processes from step 303 to step 308 are repeated until the row ID number i becomes larger than N (step 309).
[0024]
With the above processing, the next candidate storage table T = T _k (val, index), the maximum value maxV in the similarity matrix {Aij} (i, j = 1 to N), and the row ID and column ID at that time Numbers maxNi and maxNj are obtained.
[0025]
FIG. 4 is a flowchart showing details of step 205 in FIG.
[0026]
First, the similarity AimaxNi of the i row (i = 1 to N) and the maxNi column, the similarity AimaxNj of the i row (i = 1 to N), the maxNj column, the similarity AmaxNimaxNj of the maxNi row and the maxNj column are all −1. (Step 401). The row ID number i is initialized (step 402). If the row ID number i is equal to maxNi or maxNj, the maximum value of the similarity of the row ID number i may not be obtained. index = −2 (steps 403 to 405). i is not maxNi or maxNj but Ti. When index is equal to maxNi or maxNj, data of Ti and index are maxNi and maxNj is deleted, so the maximum value needs to be obtained. index = −1 (steps 406 to 408). The row ID number i is incremented by 1 (step 409). The processes from step 403 to 409 are repeated until i becomes larger than N (step 410).
[0027]
Next, A (N + 1) j and Ai (N + 1) (i, j = 1 to N + 1) are added to the similarity matrix {Aij}, and T _{(N + 1)} . index = −1 (step 411). Finally, the data number N is incremented by +1 (step 412).
[0028]
Next, the present embodiment will be described using a specific example.
[0029]
The data to be subjected to cluster analysis is, for example, one month of newspaper articles published by Company A in June 1999. The similarity between articles uses the angle (cosθ) formed between the feature vectors of the document described in [Prior Art].
[0030]
FIG. 5 shows an example in which the maximum value is searched from the similarity matrix A using the next candidate storage table T. Since the similarity between the document i and the document j is the same as the similarity between the document j and the document i, half of the actual similarity matrix A is used. If the search for the maximum value from the matrix A is the first time, since each element of the next candidate storage table T has no value, the maximum similarity and its column ID are recorded in all the rows. In this way, the maximum value is searched in order from the top using the values stored in the next candidate storage table T. In the example of FIG. 5, the combination of the row ID number and the column ID number taking the maximum value is (5, 2), and the value at that time is 0.46. FIG. 5 shows an example in which a new cluster is generated and old cluster information is deleted from them.
[0031]
In FIG. 6, first, information on the rows and

columns

2 and 5 of the similarity matrix A is emptied, and new cluster information is added as N + 1. In accordance with this, the contents of the next candidate storage table T are also changed. The information of T ₂ and T ₅ is deleted, and the maximum value in the row ID number N + 1 of the matrix A and its column ID number are recorded in T _{N + 1} . In addition, if there is an element having a column ID number value of 2 or 5, the value is also updated. If there is no such element, the number of calculations at the time of searching for the next maximum value is [(search of next candidate storage table T) + (maximum value search of row N + 1)], and the order is N.
[0032]
FIG. 7 is a diagram showing the processing time with respect to the number of documents in the automatic classification document of the document by comparing the conventional method and the method of the present invention. The processing target is a free answer sentence of the questionnaire. The similarity between the documents is calculated by using the angle between the feature vectors described in (Conventional technology). In the conventional method, the calculation is proportional to the square of the number of documents. Although time increases, it can be confirmed that this method increases almost linearly.
[0033]
FIG. 8 is a block diagram of a cluster analysis processing apparatus according to another embodiment of the present invention.
[0034]
The cluster analysis processing apparatus of this embodiment includes an input device 501,

storage devices

502 and 503, an output device 504, a recording medium 505, and a data processing device 506. The input device 501, the storage device 502, and the output device 504 correspond to the data input unit 101, the data storage unit 102, and the display unit 104 in FIG. The storage device 503 is a hard disk. A recording medium 505 is a recording medium such as a floppy disk, a CD-ROM, or a magneto-optical disk in which a cluster analysis processing program including the processes shown in FIGS. The data processing device 506 is a CPU that reads a cluster analysis processing program from the recording medium 505 and executes it.
[0035]
【The invention's effect】
As described above, according to the present invention, the order of the number of calculations can be reduced to N (the number of data).
[Brief description of the drawings]
FIG. 1 is a configuration diagram of a cluster analysis processing apparatus according to an embodiment of the present invention.
FIG. 2 is a flowchart showing processing of a cluster analysis processing unit 103;
FIG. 3 is a flowchart showing details of step 202 in FIG. 1;
4 is a detailed flowchart of step 205 in FIG. 1. FIG.
FIG. 5 is a diagram illustrating an example of searching for a maximum value from a similarity matrix A;
FIG. 6 is a diagram illustrating an example of searching for a maximum value from a similarity matrix A;
FIG. 7 is a diagram showing the relationship between the number of documents and the processing time in the automatic document classification processing in comparison with the conventional method and the present invention.
FIG. 8 is a configuration diagram of a cluster analysis processing apparatus according to another embodiment of the present invention.
FIG. 9 is an image diagram of cluster analysis.
[Explanation of symbols]
101 Data Input Unit 102 Data Storage Unit 103 Cluster Analysis Unit 104 Display Units 201-205, 301-309, 401-412 Step 501

Input Device

502, 503 Storage Device 504 Output Device 505 Recording Medium 506 Data Processing Device

Claims

A cluster analysis processing method performed by the cluster analysis processing apparatus,
A cluster analysis processing unit of the cluster analysis processing apparatus;
Searching for the maximum value of the similarity of each row from the matrix of similarity between each of the data, creating a next candidate storage table having the maximum value of the similarity and the number of the column taking the maximum value as elements, to search for the next candidate storage table, the maximum value in the matrix of the similarity, the process for obtaining the row ID, string ID that takes the maximum value,
A new cluster is generated and added to the similarity matrix, and all data having a row ID that takes the maximum value from the similarity matrix as a row ID, and a column ID value that takes the maximum value All the data with the row ID as the row ID, all the data with the row ID value that takes the maximum value as the column ID, and all the data with the column ID value that takes the maximum value as the column ID remove from having the next candidate from the storage table takes the maximum value row ID, the process of deleting the data to the row ID column values ID,
A cluster analysis processing method characterized by repeating these processes until a predetermined end condition is satisfied.

A recording medium that records a cluster analysis processing program that causes a computer to operate as a cluster analysis processing apparatus having a cluster analysis processing unit,
The cluster analysis processor
The maximum value of similarity of each row is searched from the matrix of similarity between each of the N pieces of data, and a next candidate storage table including the maximum value of similarity and the number of the column taking the maximum value as elements is created. And searching the next candidate storage table to obtain a maximum value in the similarity matrix, a row ID and a column ID taking the maximum value,
A new cluster is generated and added to the similarity matrix, and all data having a row ID that takes the maximum value from the similarity matrix as a row ID, and a column ID value that takes the maximum value All the data with the row ID as the row ID, all the data with the row ID value that takes the maximum value as the column ID, and all the data with the column ID value that takes the maximum value as the column ID And deleting the data having the row ID taking the maximum value and the value of the column ID as the row ID from the next candidate storage table
Is repeated until a predetermined end condition is satisfied
The recording medium which recorded the cluster analysis processing program for operating a computer as a cluster analysis processing apparatus characterized by the above-mentioned .

The cluster analysis processor
The maximum value of similarity of each row is searched from the matrix of similarity between each of the N pieces of data, and a next candidate storage table including the maximum value of similarity and the number of the column taking the maximum value as elements is created. And searching the next candidate storage table to obtain a maximum value in the similarity matrix, a row ID and a column ID taking the maximum value,
A new cluster is generated and added to the similarity matrix, and all data having a row ID that takes the maximum value from the similarity matrix as a row ID, and a column ID value that takes the maximum value All the data with the row ID as the row ID, all the data with the row ID value that takes the maximum value as the column ID, and all the data with the column ID value that takes the maximum value as the column ID And deleting the data having the row ID taking the maximum value and the value of the column ID as the row ID from the next candidate storage table
Is repeated until the process termination condition is satisfied
A cluster analysis processing apparatus characterized by that.