JP5255484B2

JP5255484B2 - Clustering distance learning device and program thereof, and clustering device

Info

Publication number: JP5255484B2
Application number: JP2009045015A
Authority: JP
Inventors: 健小早川
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2009-02-27
Filing date: 2009-02-27
Publication date: 2013-08-07
Anticipated expiration: 2029-02-27
Also published as: JP2010198518A

Description

本発明は、クラスタリングを行う際に類似度として用いるデータ間の距離のパラメータを学習するクラスタリング距離学習装置およびそのプログラム、ならびに、それを用いたクラスタリング装置に関する。 The present invention relates to a clustering distance learning device that learns a parameter of a distance between data used as a similarity when performing clustering, a program thereof, and a clustering device using the same.

従来、画像処理、言語処理等において、分類範疇が未知の複数のデータを、類似した同一のクラスタに分類するクラスタリングの手法が知られている（例えば、特許文献１参照）。このようなクラスタリングの手法においては、データが類似しているか否かの判定に、類似度としてデータ間の距離を用いている。この距離としては、ユークリッド距離やコサイン距離を用い、距離尺度を固定にするのが一般的である。 2. Description of the Related Art Conventionally, a clustering technique for classifying a plurality of pieces of data with unknown classification categories into similar identical clusters in image processing, language processing, and the like (see, for example, Patent Document 1). In such a clustering technique, the distance between data is used as the similarity in determining whether or not the data are similar. As this distance, Euclidean distance or cosine distance is generally used, and the distance scale is generally fixed.

また、従来のクラスタリングは、教師データを用いずに分類を行う教師なし手法が確立されていたが、近年、教師あり手法が用いられ、半教師ありクラスタリングとして研究がなされている（非特許文献１参照）。この半教師ありクラスタリングとは、クラスタリング対象のデータのいくつかに、予めクラスタに関する情報を与えることで、事前にクラスタリングに関する学習を行い、その学習結果に応じて他のクラスタリング対象のデータをクラスタリングするものである。この学習においては、よりよいクラスタリング性能を得るために、距離尺度を学習データに応じて設定することが望ましい。
そのため、従来、この学習において、距離としてマハラノビス距離を用い、距離尺度、すなわち、マハラノビス距離のパラメータを反復解法により推定する手法が提案されている（非特許文献２参照）。 In addition, in the conventional clustering, an unsupervised method for performing classification without using teacher data has been established, but recently, a supervised method is used, and research is being conducted as semi-supervised clustering (Non-patent Document 1). reference). This semi-supervised clustering is a method in which information related to the cluster is given to some of the data to be clustered in advance to perform learning related to clustering in advance, and other data to be clustered is clustered according to the learning result. It is. In this learning, in order to obtain better clustering performance, it is desirable to set the distance measure according to the learning data.
Therefore, conventionally, in this learning, a method has been proposed in which the Mahalanobis distance is used as a distance and a distance scale, that is, a parameter of the Mahalanobis distance is estimated by an iterative solution (see Non-Patent Document 2).

特許第２９６７３１２号公報Japanese Patent No. 2967312

S. Basu, M.Bilenko and R. J.Mooney:“A probabilistic framework for semi-supervised clustering”, Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 59-68(2004).S. Basu, M. Bilenko and R. J. Mooney: “A probabilistic framework for semi-supervised clustering”, Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 59-68 (2004). E. P.Xing, A. Y.Ng, M. I.Jordan and S. Russell:“Distance metric learning, with application to clustering with side-information”, Neural Information Processing Systems(NIPS)(2003).E. P. Xing, A. Y. Ng, M. I. Jordan and S. Russell: “Distance metric learning, with application to clustering with side-information”, Neural Information Processing Systems (NIPS) (2003).

従来のユークリッド距離やコサイン距離を用い、距離尺度を固定にしてクラスタリングを行う手法では、学習データに応じて調整するパラメータが存在しない。そのため、従来の手法では、クラスタリング対象のデータに応じた（例えば、データ分散の特性等に応じた）最適なクラスタリングを行うことができないという問題がある。
また、従来のマハラノビス距離を用い、学習データにより、予めマハラノビス距離のパラメータを推定する手法では、その推定を行う際に反復演算を行っている。そのため、従来の手法では、マハラノビス距離のパラメータ推定に時間がかかってしまうという問題がある。 In the conventional method of performing clustering with the distance scale fixed using the Euclidean distance or the cosine distance, there is no parameter to be adjusted according to the learning data. Therefore, the conventional method has a problem that optimal clustering according to data to be clustered (for example, according to data distribution characteristics) cannot be performed.
Moreover, in the method of estimating the parameter of Mahalanobis distance beforehand from learning data using the conventional Mahalanobis distance, iterative calculation is performed when performing the estimation. Therefore, the conventional method has a problem that it takes time to estimate the parameter of the Mahalanobis distance.

本発明は、以上のような問題点に鑑みてなされたものであり、半教師ありクラスタリングに用いるマハラノビス距離のパラメータを、学習データから反復演算を行わずに推定することが可能なクラスタリング距離学習装置およびそのプログラム、ならびに、クラスタリング装置を提供することを目的とする。 The present invention has been made in view of the above problems, and is a clustering distance learning device capable of estimating Mahalanobis distance parameters used for semi-supervised clustering from learning data without performing iterative operations. It is an object to provide a program and a clustering apparatus thereof.

本発明は、前記目的を達成するために創案されたものであり、まず、請求項１に記載のクラスタリング距離学習装置は、データの特徴量の類似度である距離により複数のデータをクラスタリングする際に用いるマハラノビス距離のパラメータを学習するクラスタリング距離学習装置であって、クラスタ中心記憶手段と、学習データ入力手段と、パラメータ演算手段と、を備える構成とした。 The present invention was devised to achieve the above object. First, the clustering distance learning device according to claim 1 performs clustering of a plurality of data based on a distance that is a similarity of data feature amounts. A clustering distance learning device that learns the Mahalanobis distance parameters used in the above configuration, and includes a cluster center storage unit, a learning data input unit, and a parameter calculation unit.

かかる構成において、クラスタリング距離学習装置は、クラスタ中心記憶手段に、クラスタごとに、データの特徴量空間におけるクラスタ中心の値を記憶する。このクラスタ中心は、データをクラスタリングする際のクラスタごとの特徴量空間における座標値である。また、特徴量空間とは、データの１つ以上の特徴量をベクトルとした座標空間である。ここで、データの特徴量とは、各データを分類するための指標となるものであって、例えば、言語処理の場合、音声データを分析することで抽出したケプストラム係数、文書データを形態素解析することで抽出した形態素の出現頻度等である。 In this configuration, the clustering distance learning device stores the cluster center value in the data feature amount space for each cluster in the cluster center storage unit. The cluster center is a coordinate value in the feature amount space for each cluster when data is clustered. The feature amount space is a coordinate space using one or more feature amounts of data as vectors. Here, the feature amount of data is an index for classifying each data. For example, in the case of language processing, cepstrum coefficients extracted by analyzing speech data and morphological analysis of document data are performed. The appearance frequency of the extracted morpheme.

そして、クラスタリング距離学習装置は、学習データ入力手段によって、どのクラスタに属するかが既知のデータである学習データを入力する。ここで学習データは、クラスタリング対象のデータと同種のデータであって、マハラノビス距離のパラメータを推定するために使用される。なお、学習データのクラスタを既知とするためには、学習データ入力手段が、クラスタを指定された後に、学習データをクラスタごとにまとめて入力することとしてもよいし、学習データそのものにクラスタの情報を付加して入力することとしてもよい。 Then, the clustering distance learning device inputs learning data that is known data as to which cluster belongs to the learning data input means. Here, the learning data is the same kind of data as the data to be clustered, and is used to estimate the Mahalanobis distance parameter. In order to make the learning data cluster known, the learning data input means may input the learning data collectively for each cluster after the cluster is designated. Alternatively, the learning data itself may include the cluster information. It is good also as inputting by adding.

そして、クラスタリング距離学習装置は、パラメータ演算手段によって、マハラノビス距離を計算するための行列と当該行列の転置行列との積の対角成分の和に予め定めた拘束条件を設け、学習データ入力手段で入力された学習データと、当該学習データのクラスタに対応するクラスタ中心記憶手段に記憶されているクラスタ中心とに基づいて、行列をマハラノビス距離のパラメータとして演算する。すなわち、クラスタリング距離学習装置は、パラメータ演算手段において、予め定めた拘束条件を設けることで、マハラノビス距離を計算するためのパラメータである行列を、反復演算することなく特定する。 Then, the clustering distance learning device provides a predetermined constraint condition on the sum of diagonal components of the product of the matrix for calculating the Mahalanobis distance and the transposed matrix of the matrix by the parameter calculation means, and the learning data input means A matrix is calculated as a Mahalanobis distance parameter based on the input learning data and the cluster center stored in the cluster center storage unit corresponding to the cluster of the learning data. That is, the clustering distance learning apparatus specifies a matrix that is a parameter for calculating the Mahalanobis distance without iterative calculation by providing a predetermined constraint condition in the parameter calculation means.

また、請求項２に記載のクラスタリング距離学習装置は、請求項１に記載のクラスタリング距離学習装置において、前記行列をＡとしたとき、当該行列Ａに対する前記拘束条件をｔｒ（Ａ^ＴＡ）＝１とし、パラメータ演算手段が、前記学習データをｘ、当該学習データを識別するｘの添字をｎ，ｎ′、前記クラスタ中心をμ、当該クラスタ中心のクラスタを識別するμの添字をｋ，ｋ′、添字ｎ，ｎ′の学習データｘが添字ｋ，ｋ′のクラスタに属する場合に値“１”、それ以外の場合に値“０”をとる属性値をｒ_ｎｋ，ｒ_ｎ′ｋ′で表したとき、前記行例Ａを、 Further, the clustering distance learning device according to claim 2 is the clustering distance learning device according to claim 1, wherein the constraint condition for the matrix A is tr (A ^T A) = 1, where A is the matrix. And the parameter calculation means sets the learning data to x, the subscript of x identifying the learning data to n, n ′, the cluster center to μ, and the subscript of μ to identify the cluster at the cluster center to k, k ′. , R _nk and r _{n′k ′} are attribute values having a value “1” when the learning data x of the subscripts n and n ′ belong to the cluster of the subscripts k and k ′, and a value “0” _otherwise . When the above example A is expressed,

により演算する構成とした。 It was set as the structure which calculates by.

かかる構成において、拘束条件を設けることで、マハラノビス距離のパラメータの推定を、学習データにおけるマハラノビス距離が最小となる拘束条件付最小化問題として解くことができる。そこで、クラスタリング距離学習装置は、パラメータ演算手段によって、この拘束条件付最小化問題の解である行例Ａを演算することで、マハラノビス距離のパラメータ（行列）を推定する。
このように、クラスタリング距離学習装置は、固有の式によりマハラノビス距離のパラメータ（行列）を推定するため、反復演算を行う必要がない。 In such a configuration, by providing the constraint condition, the estimation of the parameter of the Mahalanobis distance can be solved as a minimization problem with a constraint condition that minimizes the Mahalanobis distance in the learning data. Therefore, the clustering distance learning device estimates the parameter (matrix) of the Mahalanobis distance by calculating the example A which is a solution to the constraint condition minimization problem by the parameter calculation means.
In this way, the clustering distance learning device estimates the Mahalanobis distance parameter (matrix) using a unique expression, and thus does not need to perform iterative calculations.

さらに、請求項３に記載のクラスタリング距離学習プログラムは、データの特徴量の類似度である距離により複数のデータをクラスタリングする際に用いるマハラノビス距離のパラメータを学習するために、コンピュータを、学習データ入力手段、パラメータ演算手段、として機能させる構成とした。 Further, the clustering distance learning program according to claim 3 is configured to input a learning data into a computer in order to learn a Mahalanobis distance parameter used when clustering a plurality of data based on a distance that is a similarity of data feature amounts. And function as a parameter calculation means.

かかる構成において、クラスタリング距離学習プログラムは、学習データ入力手段によって、どのクラスタに属するかが既知のデータである学習データを入力する。そして、クラスタリング距離学習プログラムは、パラメータ演算手段によって、マハラノビス距離を計算するための行列をＡとしたとき、当該行列Ａに対する拘束条件をｔｒ（Ａ^ＴＡ）＝１とし、学習データをｘ、当該学習データを識別するｘの添字をｎ，ｎ′、クラスタ中心記憶手段に記憶されているデータの特徴量空間におけるクラスタごとのクラスタ中心をμ、当該クラスタ中心のクラスタを識別するμの添字をｋ，ｋ′、添字ｎ，ｎ′の学習データｘが添字ｋ，ｋ′のクラスタに属する場合に値“１”、それ以外の場合に値“０”をとる属性値をｒ_ｎｋ，ｒ_ｎ′ｋ′で表したとき、行例Ａを、 In such a configuration, the clustering distance learning program inputs learning data that is known data to which cluster it belongs by the learning data input means. Then, the clustering distance learning program sets the constraint condition for the matrix A to tr (A ^T A) = 1, the learning data x, and A, where A is a matrix for calculating the Mahalanobis distance by the parameter calculation means. The subscripts of x for identifying the learning data are n, n ', the cluster center for each cluster in the feature amount space of the data stored in the cluster center storage means is μ, and the subscript of μ for identifying the cluster at the cluster center is k. , k ', the subscript n, n' learning data x is the subscript k of, k 'value "1" if it belongs to a cluster of, and the other takes the value "0" when the attribute value _{_r} nk, _r _n' When represented by _{k ′} , example A is

により演算する。 Calculate by

また、請求項４に記載のクラスタリング装置は、データの特徴量の類似度にマハラノビス距離を用いて、複数のデータをクラスタリングするクラスタリング装置において、前記マハラノビス距離のパラメータを学習する距離学習手段を備え、前記距離学習手段が、クラスタ中心記憶手段と、学習データ入力手段と、パラメータ演算手段と、を備える構成とした。 Further, the clustering device according to claim 4 includes a distance learning unit that learns a parameter of the Mahalanobis distance in a clustering device that clusters a plurality of data using the Mahalanobis distance for the similarity of the feature amount of the data, The distance learning unit includes a cluster center storage unit, a learning data input unit, and a parameter calculation unit.

かかる構成において、クラスタリング装置は、クラスタ中心記憶手段に、クラスタごとに、データの特徴量空間におけるクラスタ中心の値を記憶する。また、クラスタリング装置は、学習データ入力手段によって、どのクラスタに属するかが既知のデータである学習データを入力する。そして、クラスタリング装置は、パラメータ演算手段によって、マハラノビス距離を計算するための行列と当該行列の転置行列との積の対角成分の和に予め定めた拘束条件を設け、学習データ入力手段で入力された学習データと、当該学習データのクラスタに対応するクラスタ中心記憶手段に記憶されているクラスタ中心とに基づいて、行列をマハラノビス距離のパラメータとして演算する。 In such a configuration, the clustering apparatus stores the cluster center value in the data feature space for each cluster in the cluster center storage means. Further, the clustering apparatus inputs learning data, which is known data to which cluster belongs, by the learning data input means. The clustering device provides a predetermined constraint condition for the sum of diagonal components of the product of the matrix for calculating the Mahalanobis distance and the transposed matrix of the matrix by the parameter calculation means, and is input by the learning data input means. The matrix is calculated as a Mahalanobis distance parameter based on the learned data and the cluster center stored in the cluster center storage unit corresponding to the cluster of the learned data.

さらに、請求項５に記載のクラスタリング装置は、請求項４に記載のクラスタリング装置において、前記行列をＡとしたとき、当該行列Ａに対する前記拘束条件をｔｒ（Ａ^ＴＡ）＝１とし、パラメータ演算手段が、前記学習データをｘ、当該学習データを識別するｘの添字をｎ，ｎ′、前記クラスタ中心をμ、当該クラスタ中心のクラスタを識別するμの添字をｋ，ｋ′、添字ｎ，ｎ′の学習データｘが添字ｋ，ｋ′のクラスタに属する場合に値“１”、それ以外の場合に値“０”をとる属性値をｒ_ｎｋ，ｒ_ｎ′ｋ′で表したとき、前記行例Ａを、 Furthermore, the clustering device according to claim 5 is the clustering device according to claim 4, wherein when the matrix is A, the constraint condition for the matrix A is tr (A ^T A) = 1, and parameter calculation is performed. The means is x for the learning data, n and n ′ for the x for identifying the learning data, μ for the cluster center, k and k ′ for the cluster for identifying the cluster at the cluster center, and n, When the learning data x of n ′ is represented by r _nk , r _{n′k ′} , the attribute value taking the value “1” when it belongs to the cluster of subscripts k and k ′, and the value “0” _otherwise . Example line A

かかる構成において、拘束条件を設けることで、マハラノビス距離のパラメータの推定を、学習データにおけるマハラノビス距離が最小となる拘束条件付最小化問題として解くことができる。そこで、クラスタリング装置は、パラメータ演算手段によって、この拘束条件付最小化問題の解である行例Ａを演算することで、マハラノビス距離のパラメータ（行列）を推定する。 In such a configuration, by providing the constraint condition, the estimation of the parameter of the Mahalanobis distance can be solved as a minimization problem with a constraint condition that minimizes the Mahalanobis distance in the learning data. Therefore, the clustering apparatus estimates the Mahalanobis distance parameter (matrix) by calculating the example A which is the solution of the constraint-conditional minimization problem by the parameter calculation means.

本発明は、以下に示す優れた効果を奏するものである。
請求項１〜３に記載の発明によれば、クラスタリングに用いるマハラノビス距離のパラメータ（行列）を、学習データに応じて調整することができる。これによって、クラスタリング装置において、クラスタリング対象のデータを精度よくクラスタリングすることが可能なパラメータを推定することができる。また、請求項１〜３に記載の発明によれば、マハラノビス距離のパラメータ（行列）の推定を、反復演算することなく求めることができる。これによって、ＣＰＵ負荷を軽減することができるとともに、パラメータの推定時間を短縮することができる。 The present invention has the following excellent effects.
According to the first to third aspects of the present invention, the Mahalanobis distance parameter (matrix) used for clustering can be adjusted according to the learning data. Thereby, in the clustering apparatus, it is possible to estimate a parameter capable of accurately clustering data to be clustered. According to the first to third aspects of the present invention, the estimation of the Mahalanobis distance parameter (matrix) can be obtained without iterative calculation. As a result, the CPU load can be reduced and the parameter estimation time can be shortened.

請求項４，５に記載の発明によれば、クラスタリングに用いるマハラノビス距離のパラメータ（行列）を、学習データに応じて調整することができ、クラスタリング対象のデータを精度よくクラスタリングすることができる。また、請求項４，５に記載の発明によれば、マハラノビス距離のパラメータ（行列）の推定を、反復演算することなく求めることができる。これによって、ＣＰＵ負荷を軽減することができるとともに、パラメータの推定時間を短縮することができる。 According to the fourth and fifth aspects of the present invention, the Mahalanobis distance parameter (matrix) used for clustering can be adjusted according to the learning data, and the clustering target data can be clustered with high accuracy. Further, according to the fourth and fifth aspects of the present invention, the estimation of the Mahalanobis distance parameter (matrix) can be obtained without iterative calculation. As a result, the CPU load can be reduced and the parameter estimation time can be shortened.

本発明の実施形態に係るクラスタリング装置の全体構成を示す概略ブロック図である。It is a schematic block diagram which shows the whole structure of the clustering apparatus which concerns on embodiment of this invention. 本発明の実施形態に係るクラスタリング装置の全体構成を示す詳細ブロック図である。It is a detailed block diagram which shows the whole structure of the clustering apparatus which concerns on embodiment of this invention. 本発明の実施形態に係るクラスタリング装置の距離学習手段における学習動作（パラメータ推定動作）を示すフローチャートである。It is a flowchart which shows the learning operation | movement (parameter estimation operation | movement) in the distance learning means of the clustering apparatus which concerns on embodiment of this invention. 本発明の実施形態に係るクラスタリング装置のクラスタリング実行手段におけるクラスタリング実行動作を示すフローチャートである。It is a flowchart which shows the clustering execution operation | movement in the clustering execution means of the clustering apparatus concerning embodiment of this invention.

以下、本発明の実施の形態について図面を参照して説明する。
［クラスタリング装置の概要］
最初に、図１を参照して、本発明の実施形態に係る距離学習手段（クラスタリング距離学習装置）を含んだクラスタリング装置の概要について説明する。
図１に示したクラスタリング装置１は、距離学習手段２によって、学習データから、データの特徴量の類似度である距離に用いるマハラノビス距離のパラメータ（対称正定値行列）を推定する。
なお、ベクトルａ，ｂ間のマハラノビス距離Ｍ（ａ，ｂ）は、行列（対称正定値行列）Ａを用いて、以下の（１）式により定義される距離である。また、Ｔは転置行列を示す。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[Overview of clustering equipment]
First, an overview of a clustering apparatus including distance learning means (clustering distance learning apparatus) according to an embodiment of the present invention will be described with reference to FIG.
The clustering apparatus 1 shown in FIG. 1 uses a distance learning unit 2 to estimate a Mahalanobis distance parameter (a symmetric positive definite matrix) to be used for the distance that is the similarity of the data feature amount from the learning data.
The Mahalanobis distance M (a, b) between the vectors a and b is a distance defined by the following equation (1) using a matrix (symmetric positive definite matrix) A. T represents a transposed matrix.

すなわち、マハラノビス距離は、行列Ａにより特定される距離である。そこで、クラスタリング装置１は、距離学習手段２によって、学習データからマハラノビス距離を算出するために用いる行列Ａをパラメータとして推定する。
このとき、距離学習手段２は、パラメータ（対称正定値行列）に対して予め拘束条件を設けることで、反復演算を行わずにパラメータを推定する。 That is, the Mahalanobis distance is a distance specified by the matrix A. Therefore, the clustering apparatus 1 uses the distance learning unit 2 to estimate the matrix A used for calculating the Mahalanobis distance from the learning data as a parameter.
At this time, the distance learning means 2 estimates the parameter without performing the iterative calculation by providing a constraint condition in advance for the parameter (symmetric positive definite matrix).

そして、クラスタリング装置１は、クラスタリング実行手段３によって、特徴量空間におけるクラスタ中心とデータとの距離に、距離学習手段２で推定したパラメータ（行列Ａ）を有するマハラノビス距離を用い、ｋ−平均（ｋ−ｍｅａｎｓ）法により、データをクラスタごとにクラスタリングする。 Then, the clustering device 1 uses the Mahalanobis distance having the parameter (matrix A) estimated by the distance learning unit 2 as the distance between the cluster center and the data in the feature amount space by the clustering execution unit 3, and k-average (k The data is clustered for each cluster by the -means) method.

このように、クラスタリング装置１は、予め与えられた学習データから、マハラノビス距離のパラメータ（対称正定値行列）を推定し、当該マハラノビス距離を用いてクラスタリングを行う半教師ありクラスタリング手法を実現するものである。
以下、本発明の実施形態に係るクラスタリング装置の構成および動作について順次説明する。 As described above, the clustering device 1 realizes a semi-supervised clustering method that estimates the Mahalanobis distance parameter (symmetric positive definite matrix) from the given learning data and performs clustering using the Mahalanobis distance. is there.
Hereinafter, the configuration and operation of the clustering apparatus according to the embodiment of the present invention will be sequentially described.

［クラスタリング装置の構成］
まず、図２を参照して、本発明の実施形態に係るクラスタリング装置の構成について説明する。このクラスタリング装置１は、データの特徴量の類似度にマハラノビス距離を用いて、複数のデータをクラスタリングするものである。ここでは、クラスタリング装置１は、距離学習手段２と、クラスタリング実行手段３と、を備えている。 [Configuration of clustering device]
First, the configuration of the clustering apparatus according to the embodiment of the present invention will be described with reference to FIG. The clustering device 1 is a device that clusters a plurality of data by using the Mahalanobis distance as the similarity of the feature amount of the data. Here, the clustering apparatus 1 includes distance learning means 2 and clustering execution means 3.

距離学習手段（クラスタリング距離学習装置）２は、後記するクラスタリング実行手段３において使用するマハラノビス距離のパラメータを学習（推定）するものである。ここでは、距離学習手段２は、記憶手段２１と、学習データ入力手段２２と、特徴量抽出手段２３と、パラメータ演算手段２４と、を備えている。 The distance learning means (clustering distance learning apparatus) 2 learns (estimates) the Mahalanobis distance parameter used in the clustering execution means 3 described later. Here, the distance learning unit 2 includes a storage unit 21, a learning data input unit 22, a feature amount extraction unit 23, and a parameter calculation unit 24.

記憶手段（クラスタ中心記憶手段）２１は、予め定めたデータの分類数（クラスタ数）ごとに、クラスタ中心の値を記憶するものであって、ハードディスク等の一般的な記憶装置である。
この記憶手段２１に記憶するクラスタ中心の値は、予め定めたクラスタの数（Ｋ個とする）だけ準備すれば、その値は特に限定するものではない。例えば、すべてのクラスタ中心μ_ｋ（ｋ＝１，２,…，Ｋ）を以下の（２）式に示すように、特徴量空間上の原点とする。 The storage means (cluster center storage means) 21 stores a cluster center value for each predetermined number of data classifications (number of clusters), and is a general storage device such as a hard disk.
The cluster center value stored in the storage means 21 is not particularly limited as long as a predetermined number (K) of clusters is prepared. For example, all cluster centers μ _k (k = 1, 2,..., K) are set as origins in the feature amount space as shown in the following equation (2).

なお、この記憶手段２１には、後記するパラメータ演算手段２４により演算されたマハラノビス距離の行例（パラメータ）Ａも記憶されるものとする。 The storage means 21 also stores a line example (parameter) A of Mahalanobis distance calculated by the parameter calculation means 24 described later.

学習データ入力手段２２は、外部からマハラノビス距離のパラメータを学習するための学習データを入力するものである。この学習データは、どのクラスタに属するデータであるかが既知のデータである。例えば、学習データ入力手段２２は、クラスタを指定された後に、学習データをクラスタごとにまとめて入力することとしてもよいし、学習データそのものにクラスタの情報を付加して入力することとしてもよい。 The learning data input means 22 inputs learning data for learning a parameter of Mahalanobis distance from the outside. This learning data is data in which the cluster belongs is known. For example, the learning data input unit 22 may input the learning data collectively for each cluster after the cluster is specified, or may input the learning data itself by adding the cluster information.

また、この学習データは、クラスタリング実行手段３においてクラスタリングするデータと同種のデータである。なお、この学習データは、画像データ、音声データ、文書データ等、その特徴量において分類可能なデータであれば、どのようなデータであっても構わない。この学習データ入力手段２２は、入力した学習データを特徴量抽出手段２３に出力する。 The learning data is the same type of data that is clustered by the clustering execution means 3. Note that the learning data may be any data as long as the data can be classified by the feature amount, such as image data, audio data, document data, and the like. The learning data input unit 22 outputs the input learning data to the feature amount extraction unit 23.

特徴量抽出手段２３は、学習データ入力手段２２で入力された学習データの特徴量を抽出するものである。なお、この特徴量抽出手段２３は、学習データから抽出する特徴量の種類を、予め学習データの種類に応じて定めておく。あるいは、学習データの種類が固定の場合は、予め定めた固有の特徴量を抽出するものであってもよい。
例えば、特徴量抽出手段２３は、学習データが、画像データである場合、色分布、動きベクトル等を特徴量として抽出することができる。また、学習データが、音声データの場合、ケプストラム係数等、文書データの場合、形態素の出現頻度等を特徴量とすることができる。 The feature amount extraction unit 23 extracts the feature amount of the learning data input by the learning data input unit 22. Note that the feature quantity extraction means 23 determines the type of feature quantity to be extracted from the learning data in advance according to the type of learning data. Alternatively, when the type of learning data is fixed, a predetermined unique feature amount may be extracted.
For example, when the learning data is image data, the feature amount extraction unit 23 can extract a color distribution, a motion vector, and the like as feature amounts. Further, when the learning data is speech data, a cepstrum coefficient or the like, and when the learning data is document data, the appearance frequency or the like of the morpheme can be used as the feature amount.

この特徴量抽出手段２３で抽出された１以上の特徴量は、特徴量空間上のベクトルで表される。すなわち、１つの学習データは、特徴量空間上の１点で表される。そして、特徴量抽出手段２３は、学習データごとに抽出した特徴量をパラメータ演算手段２４に出力する。なお、以降の説明においては、学習データごとに抽出した特徴量をベクトルとし、特徴量のベクトルで表される学習データをｘ_ｎ（ｎ＝１，２,…，Ｎ）と表記する。 The one or more feature amounts extracted by the feature amount extraction unit 23 are represented by vectors on the feature amount space. That is, one learning data is represented by one point on the feature amount space. Then, the feature quantity extraction unit 23 outputs the feature quantity extracted for each learning data to the parameter calculation unit 24. In the following description, the feature amount extracted for each learning data is referred to as a vector, and the learning data represented by the feature amount vector is expressed as x _n (n = 1, 2,..., N).

パラメータ演算手段２４は、学習データ入力手段２２で入力され、特徴量抽出手段２３で特徴量が抽出された学習データｘ_ｎ（ｎ＝１，２,…，Ｎ）と、当該学習データのクラスタに対応して記憶手段２１に記憶されているクラスタ中心μ_ｋ（ｋ＝１，２,…，Ｋ）とに基づいて、マハラノビス距離の行列をパラメータとして演算するものである。 The parameter calculation means 24 is input to the learning data x _n (n = 1, 2,..., N), which is input by the learning data input means 22 and extracted by the feature quantity extraction means 23, and the learning data cluster. Correspondingly, based on the cluster centers μ _k (k = 1, 2,..., K) stored in the storage means 21, a Mahalanobis distance matrix is calculated as a parameter.

このパラメータ演算手段２４は、演算により求めた行列を記憶手段２１に書き込み、記憶する。なお、この記憶手段２１に記憶された行列Ａは、クラスタリング実行手段３において、データをクラスタリングする際に、クラスタ中心μ_ｋからのマハラノビス距離を算出するために使用される。
具体的には、パラメータ演算手段２４は、以下の（３）式によりマハラノビス距離の行列Ａを演算する。 The parameter calculation means 24 writes and stores the matrix obtained by the calculation in the storage means 21. The matrix A stored in the storage means 21 is used to calculate the Mahalanobis distance from the cluster center μ _k when the clustering execution means 3 clusters data.
Specifically, the parameter calculation means 24 calculates the Mahalanobis distance matrix A by the following equation (3).

この（３）式では、学習データをベクトル表記であるｘ_ｎで表す。なお、添字ｎは、当該学習データを識別する情報であって、｛１，２，…，Ｎ｝とする。また、クラスタ中心をベクトル表記であるμ_ｋで表す。なお、添字ｋは、クラスタを識別する情報であって、｛１，２，…，Ｋ｝とする。さらに、ここでは、属性値をｒ_ｎｋで表す。この属性値は、添字ｎのデータが添字ｋのクラスタに属する場合に値“１”、それ以外の場合に値“０”をとる１−ｏｆ−ｋ表記法で表記することとする。なお、この（３）式において、添字ｎ′およびｋ′は、それぞれ添字ｎおよびｋと同じ意味を示し、二重の総和記号Σの変数を区別するために用いている。また、Ｔは転置行列を示し、ｔｒ（・）は括弧内の行列における対角成分の和（トレース）を表している。 In this equation (3), the learning data is represented by _xn which is a vector notation. The subscript n is information for identifying the learning data and is {1, 2,..., N}. The cluster center is represented by μ _k which is a vector notation. The subscript k is information for identifying a cluster and is {1, 2,..., K}. Further, here, the attribute value is represented by r _nk . This attribute value is expressed in a 1-of-k notation that takes the value “1” when the data of the subscript n belongs to the cluster of the subscript k, and the value “0” otherwise. In the equation (3), the subscripts n ′ and k ′ have the same meaning as the subscripts n and k, respectively, and are used to distinguish the variable of the double summation symbol Σ. T represents a transposed matrix, and tr (•) represents the sum (trace) of diagonal components in the matrix in parentheses.

このように、パラメータ演算手段２４は、学習データｘ_ｎと、クラスタ中心μ_ｋとを用いて、反復演算を行わずに演算を行い行列Ａを算出する。なお、この演算により、マハラノビス距離の行列を推定することができる理由については、後で説明を行うこととする。 As described above, the parameter calculation unit 24 calculates the matrix A by using the learning data _xn and the cluster center μ _k without performing iterative calculation. The reason why the matrix of the Mahalanobis distance can be estimated by this calculation will be described later.

クラスタリング実行手段（クラスタリング実行装置）３は、距離学習手段２で推定されたマハラノビス距離のパラメータ（行列）を用いて、クラスタリング対象のデータを、ｋ−平均（ｋ−ｍｅａｎｓ）法によりクラスタリングするものである。ここでは、クラスタリング実行手段３は、データ入力手段３１と、特徴量抽出手段３２と、クラスタ生成手段３３と、クラスタ中心更新手段３４と、クラスタ判定手段３５と、を備えている。 The clustering execution means (clustering execution apparatus) 3 uses the Mahalanobis distance parameter (matrix) estimated by the distance learning means 2 to cluster the data to be clustered by the k-means method. is there. Here, the clustering execution unit 3 includes a data input unit 31, a feature amount extraction unit 32, a cluster generation unit 33, a cluster center update unit 34, and a cluster determination unit 35.

データ入力手段３１は、外部からクラスタリング対象のデータを入力するものである。このデータは、距離学習手段２においてマハラノビス距離のパラメータを学習するために用いた学習データと同種のデータである。すなわち、このデータは、画像データ、音声データ、文書データ等、その特徴量において分類可能なデータであれば、どのようなデータであっても構わない。このデータ入力手段３１は、入力したデータを特徴量抽出手段３２に出力する。 The data input means 31 is for inputting data to be clustered from the outside. This data is the same type of learning data used for learning the Mahalanobis distance parameter in the distance learning means 2. In other words, this data may be any data as long as it can be classified by the feature amount, such as image data, audio data, document data, and the like. The data input unit 31 outputs the input data to the feature amount extraction unit 32.

特徴量抽出手段３２は、データ入力手段３１で入力されたデータの特徴量を抽出するものである。この特徴量抽出手段３２は、データから抽出する特徴量の種類を、予めデータの種類に応じて定めておく。あるいは、データの種類が固定の場合は、予め定めた固有の特徴量を抽出するものであってもよい。なお、特徴量抽出手段３２が抽出する特徴量は、距離学習手段２の特徴量抽出手段３２が抽出する特徴量と同種のものとする。 The feature quantity extraction unit 32 extracts the feature quantity of the data input by the data input unit 31. The feature quantity extraction means 32 determines the type of feature quantity to be extracted from the data in advance according to the type of data. Alternatively, when the data type is fixed, a predetermined characteristic amount may be extracted. The feature amount extracted by the feature amount extraction unit 32 is the same type as the feature amount extracted by the feature amount extraction unit 32 of the distance learning unit 2.

この特徴量抽出手段３２で抽出された１以上の特徴量は、特徴量空間上のベクトルで表される。すなわち、１つのデータは、特徴量空間上の１点で表される。そして、特徴量抽出手段３２は、データごとに抽出した特徴量をクラスタ生成手段３３に出力する。なお、以降の説明においては、データごとに抽出した特徴量をベクトルとし、特徴量のベクトルで表されるデータをｚ_ｍ（ｍ＝１，２,…，Ｍ）と表記する。 The one or more feature quantities extracted by the feature quantity extraction unit 32 are represented by vectors in the feature quantity space. That is, one piece of data is represented by one point on the feature amount space. Then, the feature quantity extraction unit 32 outputs the feature quantity extracted for each data to the cluster generation unit 33. In the following description, a feature amount extracted for each data is referred to as a vector, and data represented by the feature amount vector is expressed as z _m (m = 1, 2,..., M).

クラスタ生成手段３３は、データ入力手段３１で入力され、特徴量抽出手段３２で特徴量が抽出されたデータｚ_ｍ（ｍ＝１，２,…，Ｍ）について、距離学習手段２で推定されたマハラノビス距離のパラメータ（行列Ａ）を用いて、クラスタを生成するものである。 The cluster generation means 33 is estimated by the distance learning means 2 for the data z _m (m = 1, 2,..., M) input by the data input means 31 and extracted by the feature quantity extraction means 32. A cluster is generated using the Mahalanobis distance parameter (matrix A).

ここでは、クラスタ生成手段３３は、クラスタ中心μ_ｋとデータｚ_ｍとの距離を、距離学習手段２で推定された行列Ａを用いて前記（１）式により計算し、データｚ_ｍを、最も近いクラスタ中心μ_ｋに割り当てる（分類する）ことで、Ｋ個のクラスタを生成する。このクラスタ生成手段３３は、生成したＫ個のクラスタを、クラスタ中心μ_ｋとデータｚ_ｍとを対応付けて図示を省略したメモリに展開することとする。なお、クラスタ生成手段３３におけるクラスタ中心の初期値は、データｚ_ｍをクラスタ数Ｋで分割して割り当てたクラスタの中心とする。例えば、それぞれのクラスタに属するデータの特徴量空間上の平均とする。 Here, the cluster generation means 33 calculates the distance between the cluster center μ _k and the data z _m by the above equation (1) using the matrix A estimated by the distance learning means 2, and the data z _m By assigning (classifying) to the near cluster center μ _k , K clusters are generated. The cluster generation unit 33 expands the generated K clusters in a memory (not shown) in association with the cluster center μ _k and the data z _m . Note that the initial value of the cluster center in the cluster generation means 33 is the center of the cluster assigned by dividing the data z _m by the number K of clusters. For example, the average of the data belonging to each cluster on the feature amount space is used.

また、クラスタ生成手段３３は、後記するクラスタ判定手段３５において、クラスタ生成の終了が指示された場合に、メモリ（不図示）に展開されているクラスタをクラスタ結果として出力する。なお、クラスタ生成手段３３は、後記するクラスタ判定手段３５から、クラスタ生成の継続が指示された場合、後記するクラスタ中心更新手段３４において、更新されたクラスタ中心を用いてデータｚ_ｍのクラスタリングを行う。 Further, the cluster generation unit 33 outputs a cluster developed in a memory (not shown) as a cluster result when an end of cluster generation is instructed by the cluster determination unit 35 described later. The cluster generation unit 33 performs clustering of the data z _m using the updated cluster center in the cluster center update unit 34 described later when the cluster determination unit 35 described later instructs the continuation of cluster generation. .

クラスタ中心更新手段３４は、クラスタ生成手段３３で生成されたクラスタに基づいて、クラスタ中心を更新するものである。すなわち、クラスタ中心更新手段３４は、クラスタ生成手段３３で生成されたクラスタごとに、クラスタに属するデータｚ_ｍの特徴量空間における重心を求めることで、当該クラスタの新たなクラスタ中心とする。なお、クラスタ中心更新手段３４は、新たなクラスタ中心を、旧クラスタ中心と対応付けてメモリ（不図示）に書き込む。 The cluster center update unit 34 updates the cluster center based on the cluster generated by the cluster generation unit 33. That is, the cluster center updating unit 34 obtains the center of gravity of the data z _m belonging to the cluster in the feature amount space for each cluster generated by the cluster generating unit 33, thereby setting the new cluster center of the cluster. The cluster center update unit 34 writes the new cluster center in a memory (not shown) in association with the old cluster center.

クラスタ判定手段３５は、クラスタ中心更新手段３４で更新されたクラスタ中心と、旧クラスタ中心との距離（ユークリッド距離）に基づいて、クラスタリングの終了を判定するものである。このクラスタ判定手段３５は、例えば、更新されたクラスタ中心と、旧クラスタ中心との距離（更新量）が予め定めた距離（閾値）以下となった場合に、クラスタリングが終了したものと判定する。また、このクラスタ判定手段３５は、クラスタリングが終了したと判定した場合、クラスタ生成手段３３にクラスタ生成の終了を指示する。
なお、クラスタ判定手段３５は、クラスタ生成手段３３において、更新されたクラスタ中心でデータをクラスタリングし、割り当てるクラスタに変化がなくなった場合に、クラスタリングが終了したものと判定することとしてもよい。 The cluster determination unit 35 determines the end of clustering based on the distance (Euclidean distance) between the cluster center updated by the cluster center update unit 34 and the old cluster center. For example, when the distance (update amount) between the updated cluster center and the old cluster center is equal to or less than a predetermined distance (threshold value), the cluster determination unit 35 determines that the clustering has been completed. Further, when it is determined that the clustering has been completed, the cluster determination unit 35 instructs the cluster generation unit 33 to end the cluster generation.
Note that the cluster determination unit 35 may perform clustering of data at the updated cluster center in the cluster generation unit 33, and may determine that clustering has been completed when there is no change in the assigned cluster.

以上説明したように、クラスタリング装置１は、半教師ありクラスタリングとして、学習データに基づいて、マハラノビス距離のパラメータを推定し、そのパラメータを用いて、クラスタリング対象のデータをクラスタリングすることができる。このとき、クラスタリング装置１は、距離学習手段２において、マハラノビス距離のパラメータを、反復演算することなく推定することができ、学習データから高速にパラメータの推定を行うことができる。 As described above, the clustering device 1 can estimate the Mahalanobis distance parameter based on the learning data as semi-supervised clustering, and cluster the clustering target data using the parameter. At this time, the clustering apparatus 1 can estimate the Mahalanobis distance parameter without iterative calculation in the distance learning means 2, and can estimate the parameter at high speed from the learning data.

また、クラスタリング装置１は、図示を省略したＣＰＵやメモリを搭載した一般的なコンピュータで実現することができる。このとき、クラスタリング装置１は、コンピュータを、前記した各手段として機能させるクラスタリングプログラムにより動作させることができる。 Further, the clustering apparatus 1 can be realized by a general computer having a CPU and a memory (not shown). At this time, the clustering apparatus 1 can be operated by a clustering program that causes the computer to function as each of the means described above.

以上、本発明の実施形態について説明したが、本発明は、この実施形態に限定されるものではない。例えば、ここでは、クラスタリング装置１が、距離学習手段２と、クラスタリング実行手段３とでそれぞれ特徴量抽出手段２３，３２を備える構成としたが、１つの手段として構成し、入力データ（学習データ、クラスタ対象のデータ）に応じて、出力先（パラメータ演算手段２４またはクラスタ生成手段３３）を切り替える構成としてもよい。
また、この特徴量抽出手段２３，３２は、予め特徴量が抽出された学習データやクラスタ対象のデータが外部から入力される場合は、構成から省略してもよい。 Although the embodiment of the present invention has been described above, the present invention is not limited to this embodiment. For example, here, the clustering device 1 is configured to include the feature amount extraction units 23 and 32 in the distance learning unit 2 and the clustering execution unit 3, respectively. However, the clustering device 1 is configured as one unit, and input data (learning data, The output destination (parameter calculation means 24 or cluster generation means 33) may be switched according to the cluster target data.
Further, the feature amount extraction means 23 and 32 may be omitted from the configuration when learning data from which feature amounts have been extracted in advance or cluster target data is input from the outside.

また、ここでは、記憶手段２１に記憶するクラスタ中心μ_ｋの値を前記（２）式に示したように、特徴量空間上の原点としたが、以下の（４）式に示すように、学習データｘ_ｎ全体の特徴量空間上の平均（大域的平均）としてもよい。 Here, the value of the cluster center μ _k stored in the storage means 21 is the origin on the feature amount space as shown in the equation (2), but as shown in the following equation (4), It is good also as the average (global average) on the feature-value space of the learning data _xn whole.

あるいは、クラスタ中心μ_ｋの値を、以下の（５）式に示すように、クラスタｋごとの学習データｘ_ｎの特徴量空間上の平均（クラスタ平均）としてもよい。 Alternatively, the value of the cluster center μ _k may be an average (cluster average) in the feature amount space of the learning data x _n for each cluster k as shown in the following equation (5).

また、ここでは、クラスタリング装置１を、距離学習手段２とクラスタリング実行手段３とで構成したが、これらの各手段を、別個独立した装置（クラスタリング距離学習装置、クラスタリング実行装置）として構成してもよい。 Here, the clustering device 1 is composed of the distance learning means 2 and the clustering execution means 3, but each of these means may be configured as an independent device (clustering distance learning device, clustering execution device). Good.

この場合、クラスタリング距離学習装置（距離学習手段２）、クラスタリング実行装置（クラスタリング実行手段３）は、それぞれ図示を省略したＣＰＵやメモリを搭載した一般的なコンピュータで実現することができる。また、クラスタリング距離学習装置（距離学習手段２）は、コンピュータを、距離学習手段２で説明した各手段として機能させるクラスタリング距離学習プログラムにより動作させることができる。また、クラスタリング実行装置（クラスタリング実行手段３）は、コンピュータを、クラスタリング実行手段３で説明した各手段として機能させるクラスタリング実行プログラムにより動作させることができる。 In this case, the clustering distance learning device (distance learning means 2) and the clustering execution device (clustering execution means 3) can each be realized by a general computer equipped with a CPU and memory not shown. Further, the clustering distance learning device (distance learning means 2) can be operated by a clustering distance learning program that causes the computer to function as each means described in the distance learning means 2. Further, the clustering execution device (clustering execution means 3) can be operated by a clustering execution program that causes the computer to function as each means described in the clustering execution means 3.

［クラスタリング装置の動作］
次に、図３，４を参照して、本発明の実施形態に係るクラスタリング装置の動作について説明する。なお、ここでは、クラスタリング装置１の動作として、距離学習手段２において行われる学習動作（パラメータ推定動作）と、クラスタリング実行手段３において行われるクラスタリング実行動作とに分けて説明する。 [Operation of clustering device]
Next, with reference to FIGS. 3 and 4, the operation of the clustering apparatus according to the embodiment of the present invention will be described. Here, the operation of the clustering apparatus 1 will be described by dividing it into a learning operation (parameter estimation operation) performed in the distance learning means 2 and a clustering execution operation performed in the clustering execution means 3.

（学習動作）
最初に、図３を参照（構成については適宜図２参照）して、クラスタリング装置１（距離学習手段２）の学習動作（パラメータ推定動作）について説明する。なお、ここでは、記憶手段２１に、Ｋ個のクラスタ中心μ_ｋとして、予め特徴量空間上の原点が記憶されているものとする。 (Learning action)
First, a learning operation (parameter estimation operation) of the clustering apparatus 1 (distance learning means 2) will be described with reference to FIG. Here, it is assumed that the origin on the feature amount space is stored in advance in the storage unit 21 as K cluster centers μ _k .

まず、クラスタリング装置１は、距離学習手段２の学習データ入力手段２２によって、外部からマハラノビス距離のパラメータを学習するためのクラスタが既知の学習データを入力する（ステップＳ１０）。
そして、クラスタリング装置１は、距離学習手段２の特徴量抽出手段２３によって、ステップＳ１０で入力した学習データの特徴量を抽出する（ステップＳ１１）。 First, the clustering device 1 inputs learning data with a known cluster for learning the Mahalanobis distance parameter from the outside by the learning data input unit 22 of the distance learning unit 2 (step S10).
Then, the clustering apparatus 1 extracts the feature amount of the learning data input in step S10 by the feature amount extraction unit 23 of the distance learning unit 2 (step S11).

その後、クラスタリング装置１は、距離学習手段２のパラメータ演算手段２４によって、ステップＳ１１で特徴量が抽出された学習データｘ_ｎ（ｎ＝１，２,…，Ｎ）と、当該学習データのクラスタに対応して記憶手段２１に記憶されているクラスタ中心μ_ｋ（ｋ＝１，２,…，Ｋ）とに基づいて、前記（３）式により、マハラノビス距離の行列Ａを演算する（ステップＳ１２）。
そして、クラスタリング装置１は、距離学習手段２のパラメータ演算手段２４によって、ステップＳ１２で演算により求めたマハラノビス距離の行列Ａを、記憶手段２１に書き込み、記憶する（ステップＳ１３）。 Thereafter, the clustering device 1 applies the learning data x _n (n = 1, 2,..., N) from which the feature amount is extracted in step S11 by the parameter calculation unit 24 of the distance learning unit 2 and the cluster of the learning data. Correspondingly, based on the cluster centers μ _k (k = 1, 2,..., K) stored in the storage means 21, a matrix A of Mahalanobis distances is calculated by the equation (3) (step S12). .
Then, the clustering apparatus 1 writes and stores the Mahalanobis distance matrix A obtained by the calculation in step S12 in the storage unit 21 by the parameter calculation unit 24 of the distance learning unit 2 (step S13).

以上の動作によって、クラスタリング装置１は、マハラノビス距離を算出するために使用される行列Ａを、学習によって推定することができる。
なお、ここでは、記憶手段２１に、Ｋ個のクラスタ中心μ_ｋが予め記憶されているものとして説明したが、前記した（４）式や（５）式に示すように、学習データに応じてクラスタ中心の値を算出し記憶手段２１に記憶することとしてもよい。この場合、クラスタリング装置１は、ステップＳ１１以降に、距離学習手段２のパラメータ演算手段２４によって、ステップＳ１１で特徴量が抽出された学習データｘ_ｎに基づいて、前記した（４）式や（５）式に示す演算を行うことでクラスタ中心μ_ｋを算出し、記憶手段２１に書き込み、記憶する。 With the above operation, the clustering apparatus 1 can estimate the matrix A used for calculating the Mahalanobis distance by learning.
Here, the description has been made assuming that K cluster centers μ _k are stored in the storage means 21 in advance. However, as shown in the above equations (4) and (5), according to the learning data, The cluster center value may be calculated and stored in the storage unit 21. In this case, after step S11, the clustering apparatus 1 performs the above-described equations (4) and (5) based on the learning data _xn from which the feature amount is extracted in step S11 by the parameter calculation unit 24 of the distance learning unit 2. The cluster center μ _k is calculated by performing the calculation shown in FIG.

（クラスタリング実行動作）
次に、図４を参照（構成については適宜図２参照）して、クラスタリング装置１（クラスタリング実行手段３）のクラスタリング実行動作について説明する。
まず、クラスタリング装置１は、クラスタリング実行手段３のデータ入力手段３１によって、外部からクラスタリング対象のデータを入力する（ステップＳ２０）。
そして、クラスタリング装置１は、クラスタリング実行手段３の特徴量抽出手段３２によって、ステップＳ２０で入力したデータの特徴量を抽出する（ステップＳ２１）。 (Clustering execution operation)
Next, the clustering execution operation of the clustering apparatus 1 (clustering execution means 3) will be described with reference to FIG. 4 (refer to FIG. 2 as appropriate for the configuration).
First, the clustering device 1 inputs data to be clustered from the outside by the data input means 31 of the clustering execution means 3 (step S20).
Then, the clustering device 1 extracts the feature amount of the data input in step S20 by the feature amount extraction unit 32 of the clustering execution unit 3 (step S21).

その後、クラスタリング装置１は、クラスタリング実行手段３のクラスタ生成手段３３によって、ステップＳ２１で特徴量が抽出されたデータｚ_ｍ（ｍ＝１，２,…，Ｍ）をクラスタ数Ｋで分割し、それぞれのクラスタに属するデータの特徴量空間上の平均を求めることで、クラスタ中心の初期値を算出する（ステップＳ２２）。 Thereafter, the clustering device 1 divides the data z _m (m = 1, 2,..., M) from which the feature amount is extracted in step S21 by the cluster generation unit 33 of the clustering execution unit 3 by the number of clusters K, respectively. The initial value of the cluster center is calculated by obtaining the average of the data belonging to the cluster in the feature amount space (step S22).

そして、クラスタリング装置１は、クラスタリング実行手段３のクラスタ生成手段３３によって、クラスタ中心μ_ｋと、ステップＳ２１で特徴量が抽出されたデータｚ_ｍ（ｍ＝１，２,…，Ｍ）との距離を、距離学習手段２で推定され、記憶手段２１に記憶されている行列Ａを用いて前記（１）式により計算し、データｚ_ｍを、最も近いクラスタ中心μ_ｋに割り当てる（分類する）ことで、Ｋ個のクラスタを生成する（ステップＳ２３）。 The clustering apparatus 1 then determines the distance between the cluster center μ _k and the data z _m (m = 1, 2,..., M) from which the feature amount is extracted in step S21 by the cluster generation unit 33 of the clustering execution unit 3. Is calculated by the equation (1) using the matrix A estimated by the distance learning means 2 and stored in the storage means 21, and the data z _m is assigned (classified) to the nearest cluster center μ _k. Thus, K clusters are generated (step S23).

そして、クラスタリング装置１は、クラスタリング実行手段３のクラスタ中心更新手段３４によって、ステップＳ２３で生成されたクラスタごとに、クラスタに属するデータｚ_ｍの特徴量空間上の重心を求めることで、クラスタ中心を新たなクラスタ中心に更新する（ステップＳ２４）。
ここで、クラスタリング装置１は、クラスタリング実行手段３のクラスタ判定手段３５によって、ステップＳ２４で更新されたクラスタ中心と、旧クラスタ中心との距離差（更新量）が予め定めた距離（閾値）以下となったか否かを判定する（ステップＳ２５）。 Then, the clustering apparatus 1, the cluster centers updating means 34 clustering execution unit 3, for each cluster generated in step S23, by obtaining the centroid in the feature quantity space of the data z _m belonging to the cluster, the cluster center Update to a new cluster center (step S24).
Here, the clustering apparatus 1 determines that the distance difference (update amount) between the cluster center updated in step S24 and the old cluster center by the cluster determination unit 35 of the clustering execution unit 3 is equal to or less than a predetermined distance (threshold). It is determined whether or not (step S25).

そして、距離差（更新量）が閾値以下となった場合（ステップＳ２５でＹｅｓ）、クラスタリング装置１は、クラスタリング実行手段３のクラスタ生成手段３３によって、クラスタ結果を出力する（ステップＳ２６）。一方、距離差（更新量）が閾値より大きい場合（ステップＳ２５でＮｏ）、クラスタリング装置１は、ステップＳ２３に戻って、クラスタ生成手段３３によって、更新されたクラスタ中心によりクラスタを生成する動作を実行する。
以上の動作によって、クラスタリング装置１は、マハラノビス距離を用いて、ｋ−平均（ｋ−ｍｅａｎｓ）法により、データをクラスタごとにクラスタリングすることができる。 When the distance difference (update amount) is equal to or smaller than the threshold (Yes in Step S25), the clustering apparatus 1 outputs the cluster result by the cluster generation unit 33 of the clustering execution unit 3 (Step S26). On the other hand, when the distance difference (update amount) is larger than the threshold value (No in step S25), the clustering apparatus 1 returns to step S23 and performs an operation of generating a cluster with the updated cluster center by the cluster generation unit 33. To do.
With the above operation, the clustering apparatus 1 can cluster data for each cluster using the Mahalanobis distance by the k-means method.

なお、ここでは、更新されたクラスタ中心と、旧クラスタ中心との距離差（更新量）が予め定めた距離（閾値）以下となったか否かにより、ステップＳ２５においてクラスタ生成の終了判定を行ったが、この手法に限定されるものではない。例えば、ステップＳ２３において、更新されたクラスタ中心でクラスタを生成する際に、割り当てるクラスタに変化がなくなった場合に終了と判定することとしてもよい。 Here, the end of cluster generation is determined in step S25 depending on whether the distance difference (update amount) between the updated cluster center and the old cluster center is equal to or smaller than a predetermined distance (threshold). However, it is not limited to this method. For example, when a cluster is generated at the updated cluster center in step S23, it may be determined to end when there is no change in the allocated cluster.

［マハラノビス距離のパラメータ推定手法の説明］
次に、クラスタリング装置１が、距離学習手段２のパラメータ演算手段２４において、マハラノビス距離の算出に用いるパラメータ（行列）を、反復演算を行うことなく、前記（３）式の演算により求めることができる理由について説明する。 [Explanation of Mahalanobis distance parameter estimation method]
Next, the clustering device 1 can obtain the parameter (matrix) used for the calculation of the Mahalanobis distance by the calculation of the equation (3) without performing the iterative calculation in the parameter calculation unit 24 of the distance learning unit 2. The reason will be explained.

学習データｘ_ｎを最も精度よくクラスタリングするマハラノビス距離Ｍ（前記（１）式参照）の学習は、以下の（６）式に示す行列Ａとクラスタ中心μ_ｋの関数Ｊを最小にする問題と捉えることができる。 Learning the Mahalanobis distance M (see the above equation (1)) that clusters the learning data x _n with the highest accuracy is regarded as a problem of minimizing the function J of the matrix A and the cluster center μ _k shown in the following equation (6). be able to.

この関数Ｊは、クラスタｋのクラスタ中心μ_ｋと、当該クラスタｋに属する学習データｘ_ｎとのマハラノビス距離Ｍの和を、すべてのクラスタｋ（ｋ＝１，２,…，Ｋ）について総和する関数である。なお、ｒ_ｎｋは、添字ｎの学習データが添字ｋのクラスタに属する場合に値“１”、それ以外の場合に値“０”をとる。
ここで、関数Ｊ（Ａ，μ_ｋ）を、行列Ａで偏微分すると、以下の（７）式を得る。 This function J sums the sum of the Mahalanobis distance M between the cluster center μ _k of the cluster k and the learning data x _n belonging to the cluster k for all clusters k (k = 1, 2,..., K). It is a function. Note that r _nk takes a value “1” when the learning data of the subscript n belongs to the cluster of the subscript k, and takes a value “0” otherwise.
Here, when the function J (A, μ _k ) is partially differentiated by the matrix A, the following expression (7) is obtained.

この（７）式の右辺は、共分散行列である。つまり、その共分散行列の対角成分は、ｘ_ｎ＝μ_ｋでない限り正であり、前記（７）式の値は増大し続けるため、最小解が存在しないことになる。そこで、行列Ａの大きさを制限する拘束条件として、以下の（８）式を導入する。 The right side of the equation (7) is a covariance matrix. That is, the diagonal component of the covariance matrix is positive unless x _n = μ _k , and the value of equation (7) continues to increase, so there is no minimum solution. Therefore, the following equation (8) is introduced as a constraint condition for limiting the size of the matrix A.

なお、Ｔは転置行列を示し、ｔｒ（・）は括弧内の行列の対角成分の和（トレース）を表している。
この（８）式の拘束条件の下で、前記（６）式を最小にする行列Ａを決定する問題は、拘束条件付最小化問題であるため、ラグランジュの未定乗数αを導入して、以下の（９）式を最小化すればよいことになる。 T represents a transposed matrix, and tr (•) represents the sum (trace) of diagonal components of the matrix in parentheses.
Since the problem of determining the matrix A that minimizes the expression (6) under the constraint condition of the equation (8) is a constraint condition minimization problem, the Lagrange undetermined multiplier α is introduced and (9) Equation (9) should be minimized.

ここで、この（９）式を行列Ａで偏微分すると、以下の（１０）式を得る。 Here, when this equation (9) is partially differentiated by the matrix A, the following equation (10) is obtained.

この（１０）式が値“０”となる行列Ａは、以下の（１１）式で求められる。 The matrix A in which the expression (10) has a value “0” is obtained by the following expression (11).

ここで、前記（８）式の拘束条件に、この（１１）式で求められた行列Ａを代入すると、以下の（１２）式を得る。 Here, when the matrix A obtained by the equation (11) is substituted into the constraint condition of the equation (8), the following equation (12) is obtained.

この（１２）式を変形することで、以下の（１３）式を得る。 The following equation (13) is obtained by modifying the equation (12).

この（１３）式の結果を、前記（１１）式に代入することで、前記（３）式の行列Ａを得ることができる。
以上説明したように、行列Ａに拘束条件を設けることで、前記（３）式に示した行列Ａは、学習データｘ_ｎを最も精度よくクラスタリングするマハラノビス距離Ｍを算出するための行列となる。これによって、図２で説明した距離学習手段２のパラメータ演算手段２４は、行列Ａを反復演算することなく、前記（３）式の演算により、学習データを精度よくクラスタリングする行列を学習することができる。 By substituting the result of the equation (13) into the equation (11), the matrix A of the equation (3) can be obtained.
As described above, by providing a constraint condition on the matrix A, the matrix A shown in the equation (3) becomes a matrix for calculating the Mahalanobis distance M for clustering the learning data _xn with the highest accuracy. Accordingly, the parameter calculation unit 24 of the distance learning unit 2 described with reference to FIG. 2 can learn a matrix for accurately clustering the learning data by the calculation of the equation (3) without iterating the matrix A. it can.

なお、ここでは、拘束条件として、ｔｒ（Ａ^ＴＡ）の値を“１”としたが、この値は“１”である必要はなく、任意の定数であればよい。
すなわち、定数をｃとしたとき、拘束条件は、以下の（１４）式となる。 Here, as a constraint condition, the value of tr (A ^T A) is set to “1”, but this value does not have to be “1” and may be an arbitrary constant.
That is, when the constant is c, the constraint condition is the following equation (14).

この場合、前記（１２）式、（１３）式と同様の式の変形を行うことで、（１５）式に示す行列Ａを得ることができる。 In this case, the matrix A shown in the equation (15) can be obtained by modifying the equations similar to the equations (12) and (13).

よって、図２で説明した距離学習手段２のパラメータ演算手段２４は、前記（３）式の代わりに、この（１５）式を用いて、行列Ａを演算することとしてもよい。ただし、（１５）式において、１／ｃにより小数点演算が行われることから、ＣＰＵの負荷を軽減するため、定数ｃは“１”であることが望ましいといえる。 Therefore, the parameter calculation unit 24 of the distance learning unit 2 described in FIG. 2 may calculate the matrix A using the equation (15) instead of the equation (3). However, in equation (15), since the decimal point operation is performed by 1 / c, it can be said that the constant c is preferably “1” in order to reduce the load on the CPU.

１クラスタリング装置
２距離学習手段（クラスタリング距離学習装置）
２１記憶手段（クラスタ中心記憶手段）
２２学習データ入力手段
２３特徴量抽出手段
２４パラメータ演算手段
３クラスタリング実行手段（クラスタリング実行装置）
３１データ入力手段
３２特徴量抽出手段
３３クラスタ生成手段
３４クラスタ中心更新手段
３５クラスタ判定手段 DESCRIPTION OF SYMBOLS 1 Clustering apparatus 2 Distance learning means (clustering distance learning apparatus)
21 Storage means (cluster center storage means)
22 learning data input means 23 feature quantity extraction means 24 parameter calculation means 3 clustering execution means (clustering execution apparatus)
31 Data Input Unit 32 Feature Extraction Unit 33 Cluster Generation Unit 34 Cluster Center Update Unit 35 Cluster Determination Unit

Claims

A clustering distance learning device for learning a Mahalanobis distance parameter used when clustering a plurality of data according to a distance that is a similarity of a feature amount of data,
Cluster center storage means for storing a cluster center value in the feature amount space of the data for each cluster;
Learning data input means for inputting learning data which is known data to which cluster belongs,
A predetermined constraint condition is provided for the sum of diagonal components of the product of the matrix for calculating the Mahalanobis distance and the transposed matrix of the matrix, the learning data input by the learning data input means, and the learning data Parameter computing means for computing the matrix as the parameter based on the cluster center stored in the cluster center storage means corresponding to the cluster;
A clustering distance learning device comprising:

When the matrix is A, the constraint condition for the matrix A is tr (A ^T A) = 1,
The parameter calculation means is
The learning data is x, the subscript of x identifying the learning data is n, n ′, the cluster center is μ, the subscript of μ identifying the cluster at the cluster center is k, k ′, and the subscripts n, n ′. When the learning data x is represented by r _nk and r _{n′k ′ as} attribute values having a value “1” when the learning data x belongs to a cluster of subscripts k and k ′, and the value “0” _otherwise. The

The clustering distance learning device according to claim 1, wherein calculation is performed by:

In order to learn the Mahalanobis distance parameter used when clustering a plurality of data according to the distance that is the similarity of the feature amount of the data,
Learning data input means for inputting learning data which is known data to which cluster belongs,
When the matrix for calculating the Mahalanobis distance is A, the constraint condition for the matrix A is tr (A ^T A) = 1,
The learning data is x, the subscripts of x identifying the learning data are n, n ′, the cluster center for each cluster in the feature amount space of the data stored in the cluster center storage means is μ, and the cluster at the cluster center Is an attribute that takes the value “1” when the learning data x of the subscripts k and k ′, the subscripts n and n ′ belong to the cluster of the subscripts k and k ′, and the value “0” otherwise. When the value is represented by r _nk , r _{n′k ′} , the line example A is

Parameter calculation means for calculating by
Clustering distance learning program characterized by functioning as

In a clustering apparatus that clusters a plurality of data using the Mahalanobis distance for the similarity of the data feature amount,
A distance learning means for learning the Mahalanobis distance parameter;
The distance learning means includes
Cluster center storage means for storing a cluster center value in the feature amount space of the data for each cluster;
Learning data input means for inputting learning data which is known data to which cluster belongs,
A predetermined constraint condition is provided for the sum of diagonal components of the product of the matrix for calculating the Mahalanobis distance and the transposed matrix of the matrix, the learning data input by the learning data input means, and the learning data Parameter computing means for computing the matrix as the parameter based on the cluster center stored in the cluster center storage means corresponding to the cluster;
A clustering apparatus comprising:

The clustering apparatus according to claim 4, wherein the calculation is performed by: