JP5262582B2

JP5262582B2 - Surface defect distribution form analysis apparatus, method, and program

Info

Publication number: JP5262582B2
Application number: JP2008278634A
Authority: JP
Inventors: 淳治伊勢; 潔和嶋; 鉄生西山
Original assignee: Nippon Steel and Sumitomo Metal Corp
Current assignee: Nippon Steel Corp
Priority date: 2007-11-02
Filing date: 2008-10-29
Publication date: 2013-08-14
Anticipated expiration: 2028-10-29
Also published as: JP2009133843A

Description

本発明は、薄板の表面欠陥の分布形態解析装置、方法、プログラムに関し、例えば自動車や家電製品の外板等、良好な外観を要求される薄板製品に発生する疵や汚れ等の表面欠陥について、その発生個数や発生位置、更にはその分布形態から、表面欠陥の発生要因を分析するために用いて好適な技術に属する。 The present invention relates to a distribution analysis device, method, and program for surface defects of thin plates, for example, surface defects such as wrinkles and dirt that occur in thin plate products that require a good appearance, such as the outer plate of automobiles and home appliances, etc. It belongs to a suitable technique used for analyzing the generation factor of the surface defect from the number and the generation position of the generation and the distribution form.

薄板コイルのような鉄鋼製品に発生する疵や汚れ等の表面欠陥の発生要因を分析するために、例えば光学式の自動疵検査装置を用いて、プロセスラインを通過するコイルの表面欠陥を検出し、検出した表面欠陥に関する情報を保存することが行われている。 In order to analyze the cause of surface defects such as wrinkles and dirt generated in steel products such as thin coil, the surface defects of coils passing through the process line are detected using, for example, an optical automatic wrinkle inspection device. Storing information about detected surface defects.

特許文献１には、薄板コイルに発生し、自動疵検査装置等によって収集・蓄積された少なくとも発生位置を含む疵データから、疵の分布形態に関する情報を定量的な指標として自動的に抽出し、提示するようにした薄板の表面欠陥の分布形態解析手法について開示されている。 In Patent Document 1, information on the distribution form of wrinkles is automatically extracted as a quantitative index from wrinkle data including at least the generation position generated and accumulated by an automatic wrinkle inspection device or the like generated in a thin plate coil, A method for analyzing the distribution pattern of surface defects of a thin plate is disclosed.

特許文献１に開示された薄板の表面欠陥の分布形態解析手法では、自動疵検出装置で採取された疵データに基づいて、疵同士を自動的にグループ化する。その場合に、直線状の分布と楕円状の分布とが混在した状況を統一した指標で扱うことができれば有用であることから、直線と楕円とを近似的に表現し得る手法として、各疵グループの分布形態を二次元のガウス関数で表現するようにしている。 In the thin-film surface defect distribution form analysis method disclosed in Patent Document 1, wrinkles are automatically grouped based on wrinkle data collected by an automatic wrinkle detection device. In that case, it would be useful if the situation where a linear distribution and an elliptical distribution were mixed could be handled with a unified index. The distribution form is expressed by a two-dimensional Gaussian function.

非特許文献１に開示されたクラスタリング手法では、ノイズと考えられる部分を分離したクラスタを生成する方法である。 The clustering method disclosed in Non-Patent Document 1 is a method for generating a cluster in which a portion considered to be noise is separated.

特開２００５−２５７６６０号公報JP 2005-257660 A C.Bohm,C.Faloutsos,J.Pan,and C.Plant. Robust Information-theoretic Clustering. In KDD Conference,pages 65-75,2006.C. Bohm, C. Faloutsos, J. Pan, and C. Plant. Robust Information-theoretic Clustering. In KDD Conference, pages 65-75, 2006. D.Chakrabarti,S.Papafimitriou,D.S.Modha,and C.Faloutsos.Fully automatic cross-associations.In KDD Conference,pages 79-88,2004.D. Chakrabarti, S. Papafimitriou, D. S. Modha, and C. Faloutsos. Fully automatic cross-associations. In KDD Conference, pages 79-88, 2004. D.A.Huffman,"A Method for the Construction of Minimum-Redundancy Codes", Proceedings of the I.R.E., sept 1952, pp 1098-1102D.A.Huffman, "A Method for the Construction of Minimum-Redundancy Codes", Proceedings of the I.R.E., sept 1952, pp 1098-1102

ところで、薄板コイルにおいては密集する分布形態と散発的な分布形態を区別して扱う必要がある。疵同士のグループ化の際に、上述したような散発的な分布形態を別個に扱えないクラスタリングの手法をそのまま利用すると、発生要因分析で誤った解析結果となることになる。非特許文献１の手法を適用することで、薄板コイルにおいて区別して扱う必要がある密集する分布形態と散発的な分布形態を表わすことができるが、初期クラスタ数が少なければ適切なクラスタが生成されない、多ければ計算時間がクラスタ数の３乗のオーダーで増加する問題がある。 By the way, in a thin coil, it is necessary to distinguish between a dense distribution form and a sporadic distribution form. If the clustering method that cannot handle the sporadic distribution form as described above is used as it is when grouping moths together, an erroneous analysis result will be obtained in the cause analysis. By applying the method of Non-Patent Document 1, it is possible to represent a dense distribution form and a sporadic distribution form that need to be handled separately in a thin coil, but if the number of initial clusters is small, an appropriate cluster cannot be generated. If there are more, there is a problem that the calculation time increases in the order of the cube of the number of clusters.

本発明は、散発的な分布形態を区別し、自然な形で薄板の表面欠陥をグループ化する際に、初期クラスタ数が異なっていても適切なクラスタを生成することができるようにすることを目的とする。 The present invention distinguishes sporadic distribution forms, and when grouping the surface defects of a thin plate in a natural manner, it is possible to generate appropriate clusters even if the number of initial clusters is different. Objective.

本発明の表面欠陥の分布形態解析装置は、薄板に発生する表面欠陥の分布形態を解析する表面欠陥の分布形態解析装置において、少なくとも解析対象の薄板の表面欠陥の発生位置に関する座標データを入力する入力手段と、前記入力手段によって入力された集中的な分布と散発的な分布が混在した表面欠陥の座標データに基づいて、所定のクラスタリング手法を用いて、前記薄板の表面欠陥の分布形態を表わす初期クラスタを生成する初期クラスタ生成手段と、生成したクラスタの良さを評価する指標である圧縮コストを、クラスタの分布形態に応じて座標データの情報圧縮を行った際の情報量として算出する圧縮コスト算出手段と、前記初期クラスタ生成手段によって生成された初期クラスタを、前記圧縮コストに基づいて、集中的な分布と散発的な分布をなす座標データの組み合わせに分離したクラスタを生成する分離手段と、前記分離手段によって生成された集中的な分布をなす座標データを含むクラスタ同士を、前記圧縮コストを結合判断基準に用いて結合する結合手段とを備え、前記分離手段においてクラスタから分離された散発的な分布をなす座標データを、新規の又は別のクラスタの集中的な分布をなす座標データとして組み入れ、前記分離手段は、クラスタ内の座標データの調整平均によって求めた座標を中心とし、クラスタ内の座標データのばらつきを示す重み行列として、クラスタ内の全点の座標データの調整平均により生成される分散共分散行列の逆行列を用いて中心からの距離を定め、集中的な分布をなす座標データを距離が短い順に選択することを特徴とする。
本発明の表面欠陥の分布形態解析方法は、薄板に発生する表面欠陥の分布形態を解析する表面欠陥の分布形態解析方法において、少なくとも解析対象の薄板の表面欠陥の発生位置に関する座標データを入力する入力手順と、前記入力手順によって入力された集中的な分布と散発的な分布が混在した表面欠陥の座標データに基づいて、所定のクラスタリング手法を用いて、前記薄板の表面欠陥の分布形態を表わす初期クラスタを生成する初期クラスタ生成手順と、生成したクラスタの良さを評価する指標である圧縮コストを、クラスタの分布形態に応じて座標データの情報圧縮を行った際の情報量として算出する圧縮コスト算出手順と、前記初期クラスタ生成手順によって生成された初期クラスタを、前記圧縮コストに基づいて、集中的な分布と散発的な分布をなす座標データの組み合わせに分離したクラスタを生成する分離手順と、前記分離手順によって生成された集中的な分布をなす座標データを含むクラスタ同士を、前記圧縮コストを結合判断基準に用いて結合する結合手順とを有し、前記分離手順においてクラスタから分離された散発的な分布をなす座標データを、新規の又は別のクラスタの集中的な分布をなす座標データとして組み入れ、前記分離手順では、クラスタ内の座標データの調整平均によって求めた座標を中心とし、クラスタ内の座標データのばらつきを示す重み行列として、クラスタ内の全点の座標データの調整平均により生成される分散共分散行列の逆行列を用いて中心からの距離を定め、集中的な分布をなす座標データを距離が短い順に選択することを特徴とする。
本発明のプログラムは、薄板に発生する表面欠陥の分布形態を解析する処理をコンピュータに実行させるためのプログラムにおいて、少なくとも解析対象の薄板の表面欠陥の発生位置に関する座標データを入力する入力処理と、前記入力処理によって入力された集中的な分布と散発的な分布が混在した表面欠陥の座標データに基づいて、所定のクラスタリング手法を用いて、前記薄板の表面欠陥の分布形態を表わす初期クラスタを生成する初期クラスタ生成処理と、生成したクラスタの良さを評価する指標である圧縮コストを、クラスタの分布形態に応じて座標データの情報圧縮を行った際の情報量として算出する圧縮コスト算出処理と、前記初期クラスタ生成処理によって生成された初期クラスタを、前記圧縮コストに基づいて、集中的な分布と散発的な分布をなす座標データの組み合わせに分離したクラスタを生成する分離処理と、前記分離処理によって生成された集中的な分布をなす座標データを含むクラスタ同士を、前記圧縮コストを結合判断基準に用いて結合する結合処理とをコンピュータに実行させ、前記分離処理においてクラスタから分離された散発的な分布をなす座標データを、新規の又は別のクラスタの集中的な分布をなす座標データとして組み入れ、前記分離処理では、クラスタ内の座標データの調整平均によって求めた座標を中心とし、クラスタ内の座標データのばらつきを示す重み行列として、クラスタ内の全点の座標データの調整平均により生成される分散共分散行列の逆行列を用いて中心からの距離を定め、集中的な分布をなす座標データを距離が短い順に選択することを特徴とする。 The surface defect distribution form analysis apparatus of the present invention is a surface defect distribution form analysis apparatus that analyzes the distribution form of surface defects generated on a thin plate, and inputs at least coordinate data relating to the surface defect occurrence position of the thin plate to be analyzed. Based on the coordinate data of the surface defect in which the concentrated distribution and the sporadic distribution input by the input unit are mixed, a predetermined clustering method is used to represent the surface defect distribution form of the thin plate. An initial cluster generation means for generating an initial cluster, and a compression cost for calculating a compression cost, which is an index for evaluating the goodness of the generated cluster, as an information amount when the coordinate data is compressed according to the cluster distribution form A calculation unit and an initial cluster generated by the initial cluster generation unit are intensively distributed based on the compression cost. Separation means for generating clusters separated into a combination of coordinate data having a sporadic distribution, and clusters including coordinate data having a intensive distribution generated by the separation means, and using the compression cost as a joint criterion and means for combining with said incorporating coordinate data constituting the separated sporadic distribution from the cluster in the separating means, as the coordinate data forming the intensive distribution of new or another cluster, said separating means Is a variance-covariance matrix generated by adjusting the average of the coordinate data of all points in the cluster as a weighting matrix that shows the variation of the coordinate data in the cluster, centered on the coordinates obtained by the adjustment average of the coordinate data in the cluster characterized in that the defined distance from the center by using the inverse matrix, to select the coordinate data forming the intensive distribution distance in ascending order To.
The surface defect distribution pattern analysis method of the present invention is a surface defect distribution pattern analysis method for analyzing the distribution pattern of surface defects generated on a thin plate, and inputs at least coordinate data relating to the generation position of the surface defect of the thin plate to be analyzed. Based on the input procedure and the coordinate data of the surface defect in which the concentrated distribution and the sporadic distribution input by the input procedure are mixed, a predetermined clustering method is used to represent the surface defect distribution form of the thin plate. An initial cluster generation procedure for generating an initial cluster, and a compression cost for calculating the compression cost, which is an index for evaluating the goodness of the generated cluster, as the information amount when the coordinate data is compressed according to the cluster distribution form A calculation procedure and an initial cluster generated by the initial cluster generation procedure are intensively distributed based on the compression cost. A separation procedure for generating a cluster separated into a combination of coordinate data having a sporadic distribution, and clusters including coordinate data having a concentrated distribution generated by the separation procedure, and using the compression cost as a joint criterion and a coupling procedure for coupling with, incorporated coordinate data forming the sporadic distribution separated from the cluster in the separation procedure, the coordinate data forming the intensive distribution of new or another cluster, the separation In the procedure, the variance covariance generated by the adjustment average of the coordinate data of all the points in the cluster is used as a weighting matrix that shows the variation of the coordinate data in the cluster centered on the coordinates obtained by the adjustment average of the coordinate data in the cluster It determines the distance from the center by using the inverse matrix of the matrix, especially the selection of coordinate data constituting the intensive distribution short distance forward To.
The program of the present invention is a program for causing a computer to execute a process of analyzing a distribution form of surface defects generated on a thin plate, and an input process for inputting at least coordinate data relating to the generation position of the surface defect of the thin plate to be analyzed; Based on the coordinate data of the surface defect mixed with the concentrated distribution and the sporadic distribution input by the input process, the initial cluster representing the distribution form of the surface defect of the thin plate is generated by using a predetermined clustering method. An initial cluster generation process, and a compression cost calculation process for calculating a compression cost, which is an index for evaluating the goodness of the generated cluster, as an information amount when the information compression of the coordinate data is performed according to the distribution form of the cluster, Based on the compression cost, the initial cluster generated by the initial cluster generation process is concentrated. Separation processing for generating clusters separated into a combination of cloth and sporadic coordinate data, and clusters including coordinate data having a concentrated distribution generated by the separation processing are combined to determine the compression cost. Coordinate processing using the reference and combining processing is performed by the computer, and the coordinate data forming the sporadic distribution separated from the clusters in the separation processing is used as coordinate data forming a concentrated distribution of new or different clusters. In the incorporation and separation process, a coordinate matrix obtained by adjusting the average of the coordinate data in the cluster is used as a weight matrix indicating the variation of the coordinate data in the cluster, and the coordinate data of all the points in the cluster is generated. The distance from the center is determined using the inverse matrix of the variance-covariance matrix, and the coordinate data with a concentrated distribution has a short distance. And selects sequentially.

本発明によれば、分布の疎密に応じた自然な形で薄板の表面欠陥をグループ化することができる。また、初期クラスタ数が異なっていても適切なクラスタを生成することができる。これにより、薄板の表面欠陥の分布形態の解析精度を高めるとともに、例えば表面欠陥グループの重心位置、空間サイズ、表面欠陥個数密度といった表面欠陥の分布に係わる特徴量抽出を高速かつ大量に、そして正確に行うことが可能になる。 According to the present invention, the surface defects of the thin plates can be grouped in a natural form according to the density of the distribution. Even if the number of initial clusters is different, an appropriate cluster can be generated. As a result, the analysis accuracy of the distribution pattern of surface defects on the thin plate is enhanced, and feature extraction related to the distribution of surface defects such as the center of gravity position of the surface defect group, space size, and number density of surface defects can be extracted at high speed and in large quantities. It becomes possible to do.

以下、添付図面を参照して、本発明の好適な実施形態について説明する。
図１は、本実施の形態の薄板の表面欠陥の分布形態解析装置の構成の一例を示す図である。以降では、欠陥を疵と記す。１０１は疵データ入力部であり、鉄鋼製品である薄板コイルのプロセスラインや検査ラインに設置された自動疵検査装置で測定された少なくとも発生位置情報（座標データ）を含む疵データを、コンピュータネットワークやコンピュータ読み取り可能な記録媒体を介して本装置に入力するための入力手段として機能する。疵データには、薄板コイルの二次元平面内のある基準位置を原点とする疵の発生位置に関する座標データが含まれる。薄板コイルの圧延方向の座標をｘ、幅方向の座標をｙとし、当該薄板コイル内にＮ個の欠陥が発生した場合、座標データは、Ｎ行２列の行列で表現することができる。また、疵データには、各疵が薄板コイルの表面、及び裏面のいずれに発生しているかを識別するための情報や、薄板コイルの圧延方向の長さや幅等、疵が発生したコイルに関する寸法情報が含まれている。更に、疵の圧延方向の寸法値と幅方向の寸法値、疵の種類、及び疵の有害度等の情報が含まれる場合もある。 Preferred embodiments of the present invention will be described below with reference to the accompanying drawings.
FIG. 1 is a diagram showing an example of the configuration of a thin plate surface defect distribution pattern analyzer according to the present embodiment. Hereinafter, defects will be referred to as 疵. 101 is a saddle data input unit, and the saddle data including at least the generated position information (coordinate data) measured by the automatic saddle inspection apparatus installed in the process line and inspection line of the thin coil that is a steel product, It functions as an input means for inputting to the apparatus via a computer-readable recording medium. The wrinkle data includes coordinate data relating to a wrinkle generation position with a certain reference position in the two-dimensional plane of the thin coil as the origin. When the coordinate in the rolling direction of the thin coil is x and the coordinate in the width direction is y, and N defects are generated in the thin coil, the coordinate data can be expressed by a matrix of N rows and 2 columns. In addition, the wrinkle data includes information for identifying whether each wrinkle is generated on the front surface or the back surface of the thin coil, the length and width in the rolling direction of the thin coil, etc. Contains information. Furthermore, information such as the dimension value in the rolling direction and the dimension value in the width direction of the ridge, the type of ridge, and the degree of hazard of the ridge may be included.

１０２は疵データ蓄積部であり、疵データ入力部１０１によって入力された疵データを保存、蓄積する。そして、解析を行う際に、外部から入力される疵分布出力指示に応じて、コイル単位で疵データを取り出す機能を有する。また、薄板コイルの表面と裏面とで、疵データをそれぞれ別個に解析したい場合には、各疵の発生面情報（疵が表面及び裏面のいずれにあるかに関する情報）を用いて疵データを分離することによって、解析したい面の情報のみを取り出す。 102 is a bag data storage unit that stores and stores bag data input by the bag data input unit 101. And when analyzing, it has a function which takes out wrinkle data per coil according to the wrinkle distribution output instruction | indication input from the outside. In addition, if you want to analyze the wrinkle data separately for the front and back surfaces of the thin coil, separate the wrinkle data using the information on the generation surface of each wrinkle (information on whether the wrinkle is on the front or back surface) By doing so, only the information of the surface to be analyzed is extracted.

１０３は演算部であり、疵データ入力部１０１によって入力され、疵データ蓄積部１０２に蓄積された疵データに基づいてクラスタリングを行い、得られたクラスタから疵の分布形態を特徴量化する演算処理を実行する。演算部１０３は、初期クラスタ生成部１０３ａと、圧縮コスト算出部１０３ｂと、分離部１０３ｃと、結合部１０３ｄとを含み、これら各部１０３ａ〜１０３ｄによって疵同士をグループ化（クラスタリング）し、更にはそれら疵グループ（クラスタ）の重心位置、空間サイズ、個数密度といった疵の分布に係わる特徴量を求める特徴量算出手段としても機能する。なお、疵グループの重心位置、空間サイズ、個数密度の求め方については、例えば特許文献１に開示されているように、重心位置を座標データの平均により求め、空間サイズを分布するデータを覆う重心位置を中心とした二次元ガウス関数のパラメータである２つの標準偏差とし、個数密度をグループ内の疵個数を空間サイズを示すそれぞれの標準偏差で除算した値とするように、既存の技術を用いればよい。 103 is an arithmetic unit, which performs clustering based on the cocoon data input by the cocoon data input unit 101 and accumulated in the cocoon data storage unit 102, and performs arithmetic processing for converting the obtained clusters into feature quantities. Run. The calculation unit 103 includes an initial cluster generation unit 103a, a compression cost calculation unit 103b, a separation unit 103c, and a combination unit 103d. These units 103a to 103d group cocoons (clustering), and further It also functions as a feature amount calculation means for obtaining a feature amount related to the distribution of wrinkles such as the center of gravity position, space size, and number density of the wrinkle group (cluster). As to how to obtain the center of gravity position, space size, and number density of the cocoon group, for example, as disclosed in Patent Document 1, the center of gravity is obtained by averaging the coordinate data, and the center of gravity that covers the data that distributes the space size is covered. The existing technology is used so that two standard deviations, which are parameters of a two-dimensional Gaussian function centered on the position, are used, and the number density is the value obtained by dividing the number of cells in the group by the respective standard deviations indicating the space size. That's fine.

初期クラスタ生成部１０３ａは、疵データ入力部１０１によって入力され、疵データ蓄積部１０２に蓄積された疵データに基づいて、例えばＫ-ｍｅａｎｓ法といったクラスタリング手法を用いて、期待するクラスタ数を下限として、疵の分布形態を表わす初期クラスタを生成する初期クラスタ生成手段として機能する。Ｋ-ｍｅａｎｓ法の詳細な説明は省略するが、Ｋ-ｍｅａｎｓ法は、各クラスタの中心点を定め、クラスタ内の各疵データと中心点との距離を求め、その距離の２乗の総和を最小化するように疵データの各クラスタへの再割り当てと再割り当てによるクラスタの中心点の修正を繰り返し行う方法である。 The initial cluster generation unit 103a uses the clustering technique such as the K-means method based on the cocoon data input by the cocoon data input unit 101 and stored in the cocoon data storage unit 102, and sets the expected number of clusters as the lower limit. , Functions as an initial cluster generating means for generating an initial cluster representing the distribution form of ridges. Although the detailed description of the K-means method is omitted, the K-means method determines the center point of each cluster, obtains the distance between each saddle data in the cluster and the center point, and calculates the sum of the squares of the distances. This is a method of repeatedly reassigning the saddle data to each cluster and correcting the center point of the cluster by reassignment so as to minimize.

圧縮コスト算出部１０３ｂは、詳細は後述するが、生成したクラスタの良さを評価する指標である圧縮コストを、クラスタの分布形態に応じて座標データの情報圧縮を行った際の情報量として算出する圧縮コスト算出手段として機能する。 Although described in detail later, the compression cost calculation unit 103b calculates a compression cost, which is an index for evaluating the goodness of the generated cluster, as an information amount when the coordinate data is compressed according to the cluster distribution form. It functions as a compression cost calculation means.

分離部１０３ｃは、詳しくは後述するが、初期クラスタ生成部１０３ａによって生成された初期クラスタを、圧縮コストに基づいて、集中的な分布と散発的な分布をなす座標データの組みに分離したクラスタを生成する分離手段として機能する。 As will be described in detail later, the separation unit 103c separates clusters obtained by separating the initial cluster generated by the initial cluster generation unit 103a into a set of coordinate data having a concentrated distribution and a sporadic distribution based on the compression cost. It functions as a separating means for generating.

結合部１０３ｄは、詳細は後述するが、分離部１０３ｃによって生成された集中的な分布をなす座標データを含むクラスタ同士を、圧縮コストを結合判断基準に用いて結合する結合手段として機能する。 Although the details will be described later, the combining unit 103d functions as a combining unit that combines clusters including coordinate data having a concentrated distribution generated by the separating unit 103c using the compression cost as a combination determination criterion.

１０４は解析結果表示部であり、指定された薄板コイルに関する演算部１０３による演算結果に基づいて、クラスタリングされた疵の分布に係わる特徴量を表示する解析結果表示手段として機能する。 Reference numeral 104 denotes an analysis result display unit that functions as an analysis result display unit that displays feature quantities related to the clustered wrinkle distribution based on the calculation result of the calculation unit 103 regarding the specified thin plate coil.

ここで、本発明で導入するクラスタリング手法の概要について説明する。 Here, an overview of the clustering technique introduced in the present invention will be described.

あるデータ集合を考えた場合、図２（ａ）は、多くの人間の感覚として良いと思われるクラスタリングの例を示し、ガウス分布状の１つのクラスタ、直線分布状の１つのクラスタ及び散発的に発生している点（以降、外れ点と呼ぶ）が含まれている。それに対して、図２（ｂ）は、一般的なクラスタリング手法（例えばＫ-ｍｅａｎｓ法）によるクラスタリングの例を示し、散発的に発生している外れ点を含むガウス分布状の５つのクラスタが生成されている。 When a certain data set is considered, FIG. 2 (a) shows an example of clustering that seems to be good for many human senses. One cluster of Gaussian distribution, one cluster of linear distribution, and sporadically The point which has generate | occur | produced (henceforth an outlier point) is included. On the other hand, FIG. 2B shows an example of clustering by a general clustering method (for example, K-means method), and five clusters having Gaussian distributions including sporadic outliers are generated. Has been.

本発明で導入するクラスタリング手法は、教師なしクラスタ分割法であり、図２（ａ）に示すようなクラスタリングを実現すべく、圧縮コストなる新たな指標を導入する。圧縮コストは、１つの点が２つのクラスタのいずれに属するのが適当であるかを表わす指標として用いる。そして、例えばＫ-ｍｅａｎｓ法（クラスタ数ｋを指定）によるグループ化の後、圧縮コストに基づいて、外れ点分離及びクラスタ結合というアルゴリズムを実行する。 The clustering method introduced in the present invention is an unsupervised cluster division method, and a new index of compression cost is introduced in order to realize clustering as shown in FIG. The compression cost is used as an index indicating whether it is appropriate to belong to one of the two clusters. For example, after grouping by the K-means method (specifying the number of clusters k), algorithms such as outlier separation and cluster combination are executed based on the compression cost.

換言すれば、クラスタの良さを示す圧縮コストなる指標を使用して、
（Ｍ１）外れ点の存在下におけるクラスタ形状のロバストな評価
（Ｍ２）圧縮コストに基づく外れ点の識別、分離
（Ｍ３）圧縮コストに従った結合によるクラスタの構築
というアルゴリズムを実行する。 In other words, using an index of compression cost indicating the goodness of the cluster,
(M1) Robust evaluation of cluster shape in the presence of outliers (M2) Identification of outliers based on compression cost, separation (M3) Algorithm of cluster construction by combination according to compression cost is executed.

＜圧縮コスト＞
圧縮コストは、情報量圧縮方法の１つのハフマン符号化（非特許文献３を参照）を利用し、圧縮前の符号の出現確率を用いて圧縮した場合に平均符号長（単位はビット）が最小となる特長を応用したものである。 <Compression cost>
The compression cost is the minimum average code length (in bits) when compression is performed using the appearance probability of the code before compression using one Huffman coding (see Non-Patent Document 3) of the information amount compression method. This is an application of the features.

クラスタ内の発生位置の出現確率の代わりに予め用意した複数の確率分布（例えば、ガウス分布、ラプラス分布、一様分布等の任意の分布）を用いて符号化を行った場合、符号長が最小となった確率分布がクラスタの形状（特徴）を最も良く表わしている分布である
ことを示す。 When encoding is performed using a plurality of probability distributions prepared in advance (for example, arbitrary distributions such as Gaussian distribution, Laplace distribution, uniform distribution, etc.) instead of the occurrence probability of the occurrence position in the cluster, the code length is minimum. This probability distribution is the distribution that best represents the shape (feature) of the cluster.

更に、個々のクラスタの確率分布による圧縮コストに加え、クラスタの中心位置、確率分布の形態を示す情報、平均や分散等の確率分布を示すパラメータ、クラスタ内のデータ個数等といった複数のクラスタを識別するための情報も符号化及び情報圧縮を行い圧縮コストに加える。これにより過度に複数のクラスタに分離されることを防ぐことができる。 Furthermore, in addition to the compression cost due to the probability distribution of individual clusters, multiple clusters are identified such as the cluster center position, information indicating the form of the probability distribution, parameters indicating the probability distribution such as mean and variance, the number of data in the cluster, etc. The information for encoding is also encoded and compressed, and added to the compression cost. This can prevent excessive separation into a plurality of clusters.

本発明では、クラスタの形状を示す確率分布の利用によって情報量圧縮を行い、その際のクラスタ番号及びデータ点の座標のための符号長と符号化に必要なその他の情報量に相当する符号長を合わせて圧縮コストとして定義し、圧縮コストを最小化することで所望のクラスタを生成する手法を提示するものである。 In the present invention, the information amount is compressed by using a probability distribution indicating the shape of the cluster, and the code length corresponding to the cluster number and the coordinate of the data point at that time and the other information amount necessary for encoding are used. Are defined as compression costs, and a method for generating a desired cluster by minimizing the compression costs is presented.

符号化に必要なその他の情報量に相当する圧縮コストには、クラスタ数ｋを示す符号長、各クラスタの形状を示すパラメータ（例えばそれがガウス分布形状のクラスタである場合、平均及び分散）を示す符号長等が含まれる。詳細は以下で触れて行く。 The compression cost corresponding to the amount of other information necessary for encoding includes a code length indicating the number of clusters k, and a parameter indicating the shape of each cluster (for example, average and variance when the cluster is a Gaussian cluster). The code length shown is included. Details will be mentioned below.

ここで扱う座標データはすべて有限精度しか取り扱わないので、整数であると仮定して符号化を行う。すなわち、データ点はｄ次元のグリッド上に存在するものとする。グリッドの分解能は任意に選択できる。例えば、座標データの有効数字が小数点第１桁までであった場合は、グリッド定数で表わすグリットの分解能を、小数点第1桁の最小値である０．１とするとよい。 Since all the coordinate data handled here handle only finite precision, encoding is performed assuming that it is an integer. That is, the data point is assumed to exist on a d-dimensional grid. The resolution of the grid can be selected arbitrarily. For example, if the significant digits of the coordinate data are up to the first digit of the decimal point, the resolution of the grid represented by the grid constant may be set to 0.1 which is the minimum value of the first digit of the decimal point.

以下の本願の手法は、（ａ）整数の符号化手法、（ｂ）一旦クラスタに属するとされた点の符号化手法からなる。 The following method of the present application includes (a) an integer encoding method and (b) a point encoding method that once belongs to a cluster.

いま、図２に示したデータ集合を持っているとする。ここでは、ガウス分布及び一様分布（最小外接長方形を範囲とした一様分布）といった２つの分布が利用可能であるものとする。一旦、点をクラスタに割り当てれば、クラスタ中心からの偏差に対し、ハフマン符号化（非特許文献３を参照）を適用することにより、少ないデータ量で効率的に記録する
ことができる。 Assume that the user has the data set shown in FIG. Here, it is assumed that two distributions such as a Gaussian distribution and a uniform distribution (a uniform distribution with a minimum circumscribed rectangle as a range) can be used. Once a point is assigned to a cluster, it can be efficiently recorded with a small amount of data by applying Huffman coding (see Non-Patent Document 3) to the deviation from the cluster center.

（整数の符号化）
小さい整数にはより少ないビット数で、大きい整数にはより多いビット数で符号化を行う為に１進数表記にて整数の符号化を行う。１進数表記における基本の表記方法は、自然数の大きさに応じて零を並べるものである。本願では０と１を用いた１進数表記法であるイライアスコード或いはSelf-delimiting法と呼ばれる１進数符号化（非特許文献２を参照）を用いる。これらの符号化の方法では、正の整数ｉをＯ（ｌｏｇ(ｉ)）のビット数を使用して表わすために、整数値が一定量変化しても符号化後のビット数の変化は変化前の整数値が小さいほど大きく変化する。下記の表１に示すように符号化できる。１進数表記では、整数ｉの符号化した長さ（符号長）を示す部分と、符号化した値（符号値）を示す部分からなる。ここで注意すべきことは、符号値の最初のビットを常に「１」で示すことである。整数ｉの符号化後の長さを示す符号長は連続する零のビット数で示すために、符号値を示す部分はビットが最初に「１」となる部分から符号長の零と同じビット数の範囲である。正の整数ではない零を扱うためには、値に１を加えて符号化を行うことで扱える。また、負の数を扱うためには、値の正負を示すビットを追加することで拡張できる。 (Integer encoding)
In order to encode a smaller integer with a smaller number of bits and a larger integer with a larger number of bits, the integer is encoded in a decimal notation. The basic notation method in the decimal notation is that zeros are arranged according to the size of a natural number. In the present application, an erasure code that is a decimal notation using 0 and 1 or a unicode encoding called a self-delimiting method (see Non-Patent Document 2) is used. In these encoding methods, since the positive integer i is expressed using the number of bits of O (log (i)), even if the integer value changes by a certain amount, the change in the number of bits after encoding changes. The smaller the previous integer value, the greater the change. It can be encoded as shown in Table 1 below. In the decimal notation, it consists of a part indicating the encoded length (code length) of the integer i and a part indicating the encoded value (code value). It should be noted that the first bit of the code value is always indicated by “1”. Since the code length indicating the encoded length of the integer i is indicated by the number of consecutive zero bits, the part indicating the code value has the same number of bits as the code length zero from the part where the bit is first “1”. Range. To handle zero that is not a positive integer, it can be handled by adding 1 to the value and encoding. In order to handle negative numbers, it can be expanded by adding a bit indicating the positive or negative value.

（点の符号化）
各クラスタＣには、クラスタ無相関化の必要性の有無を示すフラグＲと、無相関化を行う行列Σ_*、クラスタのデータ分布の形状を示すフラグＴ（形状は任意の確率分布を予め用意する。例えば、ガウス分布、ラプラス分布、一様分布）とその中心座標及びデータ分布のパラメータが関係する。一旦、点ＰがクラスタＣに属すると決定すれば、点ＰがクラスタＣの分布に従うことを利用して、点の座標を符号化できる。点Ｐのi番目の座標Ｐｉの確率の値がｐである場合、座標の符号化によってＯ（ｌｏｇ(１／ｐ)）ビットを必要とする。 (Point coding)
For each cluster C, a flag R indicating the necessity of cluster decorrelation, a matrix Σ _* for performing decorrelation, and a flag T indicating the shape of cluster data distribution (arbitrary probability distribution is prepared in advance) For example, a Gaussian distribution, a Laplace distribution, and a uniform distribution) and their center coordinates and data distribution parameters are related. Once it is determined that the point P belongs to the cluster C, the coordinates of the point can be encoded using the fact that the point P follows the distribution of the cluster C. When the value of the probability of the i-th coordinate Pi of the point P is p, O (log (1 / p)) bits are required due to the encoding of the coordinates.

確率が一定値となる一様分布にて符号化した際のビット数を、最小外接長方形（第i次元目が下限ｌｂ_iと、上限ｕｂ_iで表わされる範囲）の範囲を定め、最小外接長方形の範囲に比例するＯ（ｌｏｇ(１／（ｕｂ_i−ｌｂ_i）)）のビット数で定める。 The number of bits at the time of encoding with a uniform distribution with a constant probability is defined as the range of the minimum circumscribed rectangle (the i-th dimension is represented by the lower limit lb _i and the upper limit ub _i ), and the minimum circumscribed rectangle It is determined by the number of bits of O (log (1 / (ub _i −lb _i ))) proportional to the range of.

点の符号化の目的は、点がクラスタ部分空間及びクラスタ特有の確率分布に従う場合に、圧縮コストが最小となるクラスタＣの点ｘ→（本明細書においてａ→との表記はａの上に→が付され、ベクトルであることを意味するものとする）の符号化手法を提案することである。後述するが、正しい選択により圧縮コストを最小化する確率密度関数を得ることになる。 The purpose of the point coding is that the point x → of the cluster C that minimizes the compression cost when the point follows the cluster subspace and the cluster-specific probability distribution (in this specification, the notation a → → is assumed to mean that it is a vector). As will be described later, a probability density function that minimizes the compression cost is obtained by a correct selection.

ここでは、クラスタの点の座標が主成分分析（ＰＣＡ）による座標変換によって無相関化され、各座標から対応するパラメータと共に確率分布が既に選択されているとする。図３の例については、横座標をラプラス分布とし、縦座標をガウス分布とする。両分布は、平均μ＝３．５及び標準偏差σ＝１を仮定している。座標値を符号化し必要なビット数を割り当てる必要がある。すなわち、高い確率（例えば３＜ｘ＜４）に相当する座標値には短いビット数（符号）を割り当て、低い確率（極端な例としてｙ＝１２）に相当する座標値には長いビット数（符号）を割り当てる。 Here, it is assumed that the coordinates of the points of the cluster are decorrelated by coordinate transformation by principal component analysis (PCA), and a probability distribution is already selected from each coordinate together with a corresponding parameter. In the example of FIG. 3, the abscissa is a Laplace distribution and the ordinate is a Gaussian distribution. Both distributions assume a mean μ = 3.5 and a standard deviation σ = 1. It is necessary to encode the coordinate values and assign the necessary number of bits. That is, a short bit number (sign) is assigned to a coordinate value corresponding to a high probability (eg, 3 <x <4), and a long bit number (y = 12 as an extreme example) is assigned to a coordinate value corresponding to a low probability (y = 12 as an extreme example). (Sign).

クラスタのデータ分布の形状を示すフラグＴで示される確率分布がデータの分布に一致していれば、ハフマン符号化によるビット数（符号長＝圧縮コスト）が最小化される。ハフマン符号化は、Ｐ(ｘ_i)が各座標値の確率密度関数の値である場合、個々の座標ｘ_iに長さｌｏｇ₂(１／Ｐ(ｘ_i))のビット数を割り当てる。 If the probability distribution indicated by the flag T indicating the shape of the data distribution of the cluster matches the data distribution, the number of bits (code length = compression cost) by Huffman coding is minimized. Huffman coding assigns a number of bits P when (x _i) is the value of the probability density function for each coordinate value, the length log ₂ to each of the coordinates _{x i (1 / P (x} i)).

（圧縮コストの算出方法）
ここで、クラスタＣの圧縮コストを以下の４つの定義に沿って定義する。
・定義１（点ｘ→の圧縮コストＥ）
ｘ→∈Ｒ^dがクラスタＣに属するとし、ｐｄｆ→(ｘ→)はクラスタＣの確率密度関数のｄ次元のベクトルである。各座標の確率密度関数ｐｄｆ_i(ｘ_i)は、対応するパラメータ（すなわち、ＰＤＦ＝｛ｐｄｆカ゛ウス(_ui,σ_i)，ｐｄｆ一様_(lbi,ubi)，ｐｄｆラフ゜ラス_(ai,bi)，・・・｝、平均ｕ_i、下限ｌｂ_i、上限ｕｂ_i、片側平均ａ_i∈Ｒ、標準偏差σ_i、√２の標準偏差ｂ_i∈Ｒ⁺）を有する１セットの予め定められた確率密度関数から選択される。γをグリッド定数（グリッドの分解能、すなわちグリッド間の距離）とし、点ｘ→の座標軸ｉでの圧縮コストＥ_iは下式（１）で表わされ、点ｘ→の圧縮コストは下式（２）で表わされる。 (Calculation method of compression cost)
Here, the compression cost of the cluster C is defined according to the following four definitions.
Definition 1 (compression cost E of point x →)
Suppose x → ∈R ^d belongs to cluster C, and pdf → (x →) is a d-dimensional vector of the probability density function of cluster C. The probability density function pdf _i (x _i ) of each coordinate is represented by a corresponding parameter (ie, PDF = {pdf _Gauss ( _ui, σ _i) , pdf uniform _{(lbi, ubi)} , pdf rough _{(ai, bi)} _,. ..}, A set of predetermined probability density with mean u _i , lower limit lb _i , upper limit ub _i , one-sided mean a _i ∈R, standard deviation σ _i , standard deviation b _i ∈R ^{+ of} √2) Selected from functions. γ is a grid constant (grid resolution, that is, a distance between grids), the compression cost E _i of the point x → on the coordinate axis _i is expressed by the following equation (1), and the compression cost of the point x → is expressed by the following equation ( 2).

図３のハッチングした例示位置において、ｘ座標（２と３の間）の確率は１９％であり、ハフマン符号化ではｌｏｇ₂（１／０．１９）＝２．３ビットが必要となる。ｙ座標はより低い確率（５％）にあり、より大きいビット数（４．３ビット）が必要となる。点ｘ→の圧縮コストは、ｘ座標とｙ座標のビット数の合計である６．６ビットに加えて、どのクラスタに所属するかをハフマン符号化されたクラスタ番号を表わすｌｏｇ₂(ｎ／｜Ｃ｜)ビットが加えられる。 In the hatched exemplary position of FIG. 3, the probability of the x coordinate (between 2 and 3) is 19%, and Huffman coding requires log ₂ (1 / 0.19) = 2.3 bits. The y coordinate is at a lower probability (5%) and requires a larger number of bits (4.3 bits). The compression cost of the point x → is log ₂ (n / |) indicating the cluster number to which the cluster belongs in addition to 6.6 bits which is the total number of bits of the x coordinate and the y coordinate. C |) bit is added.

（クラスタに適合する確率密度関数の決定）
次に、クラスタＣにどの確率密度関数を割り当てるかを示す。最終的な目的は圧縮コスト（ビット数）の最小化である。従って、各座標軸にクラスタの圧縮コストを最小化する確率密度関数ｐｄｆ（及び対応するパラメータ）を選択する必要がある。ガウス分布等の確率密度関数ｐｄｆについては、最適なパラメータがデータ集合の統計量（例えば平均、分散σ_i ²、上下限値）に相当することが知られている。 (Determining the probability density function that fits the cluster)
Next, which probability density function is assigned to cluster C is shown. The ultimate goal is to minimize the compression cost (number of bits). Therefore, it is necessary to select a probability density function pdf (and corresponding parameters) that minimizes the cluster compression cost for each coordinate axis. As for the probability density function pdf such as Gaussian distribution, it is known that the optimum parameters correspond to the statistics of the data set (for example, average, variance σ _i ² , upper and lower limit values).

したがって、座標軸ｉの確率密度関数ｐｄｆとしてガウス分布を選択する場合、確率密度関数のパラメータとして、点の座標軸ｉの平均ｕ_i及び分散σ_i ²を使用する。同様に、ラプラス分布については、平均ｕ_i及び分散２σ_i ²を適用する。一様分布については、座標値の範囲の上限ｕｂ_iと下限ｌｂ_iを適用する。予め用意したガウス分布、ラプラス分布、一様分布等、これらの確率密度関数ＰＤＦの中からクラスタの圧縮コストが最小となる確率密度関数を選択する。 Therefore, when a Gaussian distribution is selected as the probability density function pdf of the coordinate axis i, the mean u _i and the variance σ _i ² of the point coordinate axis i are used as parameters of the probability density function. Similarly, for the Laplace distribution, the mean u _i and the variance 2σ _i ² are applied. For the uniform distribution, the upper limit ub _i and the lower limit lb _i of the coordinate value range are applied. A probability density function that minimizes the cluster compression cost is selected from these probability density functions PDF such as a Gaussian distribution, a Laplace distribution, and a uniform distribution prepared in advance.

・定義２（クラスタＣのｐｄｆ→(ｘ→)）
Ｃを点ｘ→∈Ｃからなるクラスタとする。Stat＝（ｕ_i、σ_i、ｌｂ_i、ｕｂ_i・・・）を予め用意した確率密度関数ＰＤＦで必要とされる統計量とする。このとき、ｐｄｆ→は下式（３）に示すように、ｐｄｆ_i∈ＰＤＦから圧縮コストが最小となる確率密度関数が選択される。 Definition 2 (pdf of cluster C → (x →))
Let C be a cluster of points x → ∈C. Let Stat = (u _i , σ _i , lb _i , ub _i ...) Be a statistic required by a probability density function PDF prepared in advance. At this time, the probability density function that minimizes the compression cost is selected from pdf _i ∈ PDF, as shown in the following equation (3).

図３のｘ座標について、統計として平均（３．５）、分散（１．０）、上下限（１．４、６．２）を計算する。その後、ｘ座標の圧縮コストＥ_xは予め用意した確率密度関数に適切な統計量を適用したｐｄｆ一様_(1.4,6.2)、ｐｄｆカ゛ウス(_3.5,1.0)、ｐｄｆラフ゜ラス_(3.5,0.7)の中から最小の圧縮コストＥ_xとなる確率密度関数を選択する。ｙ座標についての圧縮コストＥ_yについても同様の手順で確率密度関数を選択する。なお、ここでは、ガウス分布、ラプラス分布、一様分布といった３つの分布を取り上げているが、他の確率密度関数も用いることができる。 For the x coordinate in FIG. 3, the average (3.5), variance (1.0), and upper and lower limits (1.4, 6.2) are calculated as statistics. Thereafter, compression costs E _x is previously prepared probability density function to the appropriate statistics the applied pdf uniform in the x-coordinate _(1.4,6.2), pdf mosquito Bu mouse _(3.5,1.0), in the pdf Rough ° lath _(3.5,0.7) selecting the minimum probability density function to be compressed cost E _x from. The probability density function is selected in the same procedure for the compression cost E _y for the y coordinate. Here, three distributions such as a Gaussian distribution, a Laplace distribution, and a uniform distribution are taken up, but other probability density functions can also be used.

（無相関化行列を用いた符号化）
クラスタが異なる座標軸で互いに相関のあるクラスタである場合、すなわち、クラスタの点のある座標値が少なくとも１つの他の座標値に依存する場合、無相関化行列により新たな座標軸に変換することで圧縮コストを縮小することができる場合がある。無相関化行列は、例えばクラスタの分散共分散行列Σの主成分分析（ＰＣＡ）で算出する主成分からなる行列やクラスタの分散共分散行列Σの固有行列Ｖとして計算することができる。外れ点（1つの座標軸に着目すれば外れ値）に影響されにくいロバストな方法での分散共分散行列の評価方法は後述する。 (Encoding using decorrelation matrix)
If the clusters are correlated with each other on different coordinate axes, that is, if a coordinate value of a point of the cluster depends on at least one other coordinate value, compression is performed by converting to a new coordinate axis using a decorrelation matrix In some cases, the cost can be reduced. The decorrelation matrix can be calculated as, for example, a matrix composed of principal components calculated by principal component analysis (PCA) of the cluster covariance matrix Σ or an eigen matrix V of the cluster covariance matrix Σ. A method for evaluating the variance-covariance matrix in a robust manner that is less susceptible to outliers (outliers when focusing on one coordinate axis) will be described later.

データを無相関化することは、クラスタの圧縮コストを縮小させるために必要である。大きい分散及び高い相関のある２つの座標を持つ代わりに、互いに相関のない２つの新しい座標を生成し、例えば一方の新しい座標が０に近い分散を持つ座標値に変換できれば、1つのグリッドの確率が大きくなるため、平均符号長がほぼ０ビットの圧縮コストとなることが期待される。そのため、圧縮コストの改善を考慮する場合、無相関化行列を使用する。 Decorrelating the data is necessary to reduce the compression cost of the cluster. Instead of having two coordinates with large variance and high correlation, if two new coordinates that are not correlated with each other are generated, for example, if one new coordinate can be converted to a coordinate value with variance close to 0, the probability of one grid Therefore, the average code length is expected to be a compression cost of almost 0 bits. Therefore, when considering improvement in compression cost, a decorrelation matrix is used.

・定義３（クラスタの無相関化）
無相関化行列は、Ｃを点ｘ（無相関化を行っていない元々の座標）から成るクラスタとし、クラスタＣの共分散行列Σとしたときに、クラスタＣの主成分分析（ＰＣＡ）で算出する主成分（Σ＝ＶΛＶ^T）からなる行列ＶΛ^-1/2とする。若しくは、クラスタＣの分散共分散行列Σの固有値と固有行列による対角化（Σ＝ＶΛＶ^T）で得られた行列を無相関化行列Ｖと定めるとする。Ｙを無相関化された点ｙの集合、すなわちｙ→∈Ｙ：ｙ→＝Ｖ^T・ｘ→として表わされるものとする。ｐｄｆ→(ｘ→)を元々の座標系の、ｐｄｆ→(ｙ→)を無相関化された集合Ｙを特徴付ける確率密度関数（確率分布）とする。最終的に用いられる無相関化行列は、圧縮コストを基準に定まり、クラスタＣの無相関化行列dec(C)は下式（４）で、それに対応する圧縮コストは下式（５）で表わされる。 Definition 3 (cluster decorrelation)
The decorrelation matrix is calculated by principal component analysis (PCA) of cluster C, where C is a cluster of points x (original coordinates that have not been decorrelated) and the covariance matrix Σ of cluster C. It is assumed that the matrix VΛ ^−1/2 is composed of principal components (Σ = VΛV ^T ). Alternatively, suppose that a matrix obtained by diagonalization (Σ = VΛV ^T ) with eigenvalues and eigenvalues of the variance-covariance matrix Σ of the cluster C is defined as a decorrelation matrix V. Let Y be represented as a set of uncorrelated points y, ie y → ∈Y: y → = V ^T · x →. Let pdf → (x →) be a probability density function (probability distribution) characterizing the uncorrelated set Y in the original coordinate system, and pdf → (y →). The decorrelation matrix finally used is determined based on the compression cost. The decorrelation matrix dec (C) of cluster C is expressed by the following equation (4), and the corresponding compression cost is expressed by the following equation (5). It is.

ｄ行×ｄ列の無相関化行列Ｖを符号化する際に、少数を符号長ｆビットで表わすとすると、各要素を合計ｄ²ｆビットで符号化する。そのため、無相関化による圧縮コストの減少分がこれを上回った場合に無相関化を行う。無相関化不要の場合は、無相関化行列の代わりに単位行列を用いて、無相関化の有無を表わすフラグに相当する１ビットだけを圧縮コストに加える。 When encoding a decorrelation matrix V of d rows × d columns, if a small number is expressed by a code length f bits, each element is encoded by a total of d ² f bits. Therefore, decorrelation is performed when the reduction in compression cost due to decorrelation exceeds this. When decorrelation is unnecessary, a unit matrix is used instead of the decorrelation matrix, and only one bit corresponding to a flag indicating the presence or absence of decorrelation is added to the compression cost.

・定義４（クラスタの圧縮コスト）
定義４は上述した事項を総括したものである。クラスタＣを符号化する際には無相関化行列ｄｅｃ(Ｃ)及び無相関化された各座標ｙ→を代表する確率密度関数ｐｄｆ→(ｙ→)を用いて行う。クラスタＣの圧縮コストは、下式（６）で表わす。 Definition 4 (cluster compression cost)
Definition 4 summarizes the above items. When the cluster C is encoded, it is performed using a decorrelation matrix dec (C) and a probability density function pdf → (y →) representing each decorrelated coordinate y →. The compression cost of cluster C is expressed by the following equation (6).

圧縮コスト算出部１０３ｂは、以上説明したように、疵データ入力部１０１によって入力され、疵データ蓄積部１０２に蓄積された疵データに基づいて、上式（６）等に示した圧縮コストを算出する。 As described above, the compression cost calculation unit 103b calculates the compression cost represented by the above equation (6) based on the bag data input by the bag data input unit 101 and stored in the bag data storage unit 102. To do.

＜外れ点の分離＞
本願では、従来手法で得られるクラスタを２つの部分集合からなるとして捉える。一つは集中的な分布の中核をなすを示すデータ（中核点）の集合、もう一方は散発的な分布をなすデータ（外れ点）の集合である。 <Separation of outliers>
In the present application, a cluster obtained by a conventional method is regarded as consisting of two subsets. One is a set of data (core points) indicating the core of a concentrated distribution, and the other is a set of data (outliers) having a sporadic distribution.

一般的なクラスタリング手法のＫ-ｍｅａｎｓ法やＫ-ｍｅｄｏｉｄｓ法では、中核点と外れ点とが混合したクラスタが生成されてしまい、特徴を捉え難いクラスタとなるおそれがある。 In a general clustering method such as the K-means method or the K-medoids method, a cluster in which core points and outliers are mixed may be generated, and the cluster may be difficult to capture features.

そこで、一旦生成したクラスタ（初期クラスタ）から外れ点を分離する。 Therefore, the outliers are separated from the once generated cluster (initial cluster).

最も確実な方法は、Ｎ点の座標データがあった場合、中核点と外れ点に分類する組み合わせは２^N通りある。それぞれの組み合わせに対して圧縮コストを算出し、圧縮コストが最小となる中核点と外れ点の組み合わせを採用することである。しかしながら、計算量は、座標データ数に対して指数関数的に増加する為に、ある程度の数を超えると現実的な時間で計算を終了することは困難になる。 The most reliable method is that when there are N points of coordinate data, there are 2 ^N combinations for classifying the core points and outliers. The compression cost is calculated for each combination, and the combination of the core point and the outlier point that minimizes the compression cost is adopted. However, since the amount of calculation increases exponentially with respect to the number of coordinate data, if it exceeds a certain number, it is difficult to finish the calculation in a realistic time.

以降では、ある程度の数を超えても現実的な時間で計算を終了するように、座標データに中核点に取り入れる順位付けを行い、中核点と外れ点の分離に要する計算量をＮに比例する方法を提供する。 Thereafter, the coordinate data is ranked so that the calculation is completed in a realistic time even if the number exceeds a certain number, and the calculation amount required to separate the core point from the outlier point is proportional to N. Provide a method.

順位付けの方法として中核点の中心からの距離を元に算出する方法である。以下では順位の代わりに距離で定めた境界によって中核点と外れ点を選択している。距離の定義は、中核点の外形を決めるために複数の定義を用いる。 As a ranking method, the calculation is based on the distance from the center of the core point. In the following, core points and outliers are selected by boundaries defined by distances instead of ranks. The definition of the distance uses a plurality of definitions to determine the outline of the core point.

クラスタ内の座標データの分布のばらつきを示す重み行列による距離によって中核点と外れ点の分離をする方法は、任意のクラスタリング手法（例えばＫ-ｍｅａｎｓ法）により算出したクラスタ群C＝｛Ｃ₁，・・・，Ｃ_k｝のクラスタＣ_iについて部分空間を定める正規直交行列Ｖ（「無相関化行列Ｖ」と称する）を算出する。無相関化行列Ｖを用いた重み行列を算出し、中核点と外れ点とを分離する境界を決定する距離を定めるものである。クラスタＣ_i中の中核点と外れ点の圧縮コストの合計が最小値となるクラスタの中心から等距離となる境界（楕円）を選択し、その結果、クラスタから外れ点を分離した新たなクラスタを生成する。このときのクラスタの中心は、座標データの平均又は、調整平均によって求めた座標を用いる。 The method of separating the core point and the outlier by the distance by the weight matrix indicating the distribution of the distribution of coordinate data in the cluster is a cluster group C = {C ₁ , calculated by an arbitrary clustering method (for example, K-means method). ..., and calculates the orthonormal matrix V (referred to as "decorrelation matrix V ') defining a subspace for a cluster C _i of C _k}. A weight matrix using the decorrelation matrix V is calculated, and a distance for determining a boundary separating the core point and the outlier point is determined. A boundary (ellipse) that is equidistant from the center of the cluster where the sum of the compression costs of the core points and outliers in the cluster C _i is the minimum value is selected, and as a result, a new cluster that separates outliers from the cluster is selected. Generate. At this time, the center of the cluster uses coordinates obtained by average of coordinate data or adjustment average.

以下、更に距離を定めるための重み行列について詳しく説明する。まず重み行列の評価方法の１つについて述べる。クラスタＣ_iの重み行列は、クラスタＣ_iに属する点が含まれる空間を定めるものとして分散共分散行列の逆行列（Σ^-1＝Ｖ∧^-1Ｖ^T）を用いる。逆
行列が直接算出できない場合は、主成分分析（ＰＣＡ）による主成分を用いる等して、これに変わる行列を算出する。また、別の方法としてクラスタＣ_iに属する点の分散共分散行列Σを固有値による対角行列を用いて表すと、固有ベクトルからなる正規直交行列Ｖが
得られる。この行列Ｖが上記で述べた無相関化行列である。無相関化行列Ｖは、クラスタＣ_iの点が含まれる空間（新たな座標軸）を定め、固有値は後で述べる等距離からなる境界の楕円（楕円球）の形状を決める。また、対角行列∧の全固有値は正である為、分散共分散行列Σ及びその逆行列Σ^-1は半正定行列である。すなわち、２点間を結ぶベクトルを
ｘとしたときにｘ^tΣ^-1ｘの様に、２次形式で表した際に必ず正となり、多次元空間にお
いても、あたかも２点間の距離ような値（擬距離）を定めることができる。例えば、クラスタの構造（共分散）を考慮して、２つの点であるｘ→及びｙ→間の距離を、下式（７）に示すようにマハラノビス距離として定義できる。 Hereinafter, the weight matrix for determining the distance will be described in detail. First, one of the weight matrix evaluation methods will be described. Weight matrix of the cluster C _i uses the inverse matrix of the variance-covariance matrix ^{^{^{(Σ -1 = V∧ -1 V T}}} ) as defining a space containing the points belonging to the cluster C _i. When the inverse matrix cannot be directly calculated, a matrix that changes to this is calculated by using a principal component by principal component analysis (PCA). Also, when expressed using the diagonal matrix cluster C _i covariance matrix of points belonging to Σ Alternatively by eigenvalues orthonormal matrix V is obtained consisting of eigenvectors. This matrix V is the decorrelation matrix described above. The decorrelation matrix V defines the space (new coordinate axis) in which the points of the cluster C _i are included, and the eigenvalue determines the shape of the boundary ellipse (elliptical sphere) composed of equal distances, which will be described later. Since all eigenvalues of the diagonal matrix ∧ are positive, the variance-covariance matrix Σ and its inverse matrix Σ ⁻¹ are semi-positive definite matrices. That is, when a vector connecting two points is x, it is always positive when expressed in a quadratic form like x ^t Σ ⁻¹ x, and even in a multidimensional space, it is as if the distance between two points A value (pseudorange) can be defined. For example, considering the cluster structure (covariance), the distance between the two points x → and y → can be defined as the Mahalanobis distance as shown in the following equation (7).

クラスタＣの中心をμ→として与えると、分散共分散行列Σは、下式（８）により行列Σ_Cを計算することで評価することができる。なお、｜Ｃ｜はクラスタＣのデータ点数である。（ｘ→−μ→）・（ｘ→−μ→）^Tはベクトルの外積であり、ｄ×ｄの行列となる。ｘ→（∈Ｃ）すべてについて計算し、クラスタＣのデータ点数｜Ｃ｜で平均化するので、Σ_Cは、行列のｉ行ｊ列の要素（Σ_C）_ijは第ｉの座標と第ｊの座標の共分散であって、（ｘ_i−ｕ_i）・（ｘ_j−ｕ_j）となる。 If the center of the cluster C is given as μ →, the variance-covariance matrix Σ can be evaluated by calculating the matrix Σ _C by the following equation (8). Note that | C | is the number of data points of cluster C. (X → −μ →) · (x → −μ →) ^T is an outer product of vectors, and is a matrix of d × d. Since all of x → (εC) are calculated and averaged with the number of data points | C | of the cluster C, Σ _C is an i-th row and j-th column element (Σ _C ) _ij is the i-th coordinate and j-th (X _i −u _i ) · (x _j −u _j ).

ここで、上記計算法で外れ点が含まれることにより、中核点と外れ点を希望通りに分離できない問題が生じる。希望通りの分割を行うためには、詳細は後述するが中核点の中心と分散共分散行列を的確に推定する必要がある。しかしながら、単純な平均で計算されるクラスタＣの中心は、外れ点の存在により中核点の中心からずれることがある。また、中心核点の中心と異なる中心に従えば分散共分散行列による無相関化行列Ｖによって生成される空間、すなわち中核点と外れ点の境界の候補となる「中核点の広がり」を評価する楕円の方向にも外れ点が影響することがある。外れ点の数が１つの場合でも、推定する中核点の中心や無相関化行列に大きく影響を与えることがある。図４には、従来の評価により中心が間違って推定され、データの共分散を表わす楕円が中核点とずれている様子を示す。 Here, since the outlier is included in the above calculation method, there arises a problem that the core point and the outlier can not be separated as desired. In order to perform the desired division, it is necessary to accurately estimate the center of the core point and the variance-covariance matrix, as will be described in detail later. However, the center of the cluster C calculated by a simple average may deviate from the center of the core point due to the presence of an outlier. If the center different from the center of the core point is followed, the space generated by the decorrelation matrix V based on the variance-covariance matrix, that is, the “spread of the core point” that is a candidate for the boundary between the core point and the outlier point is evaluated. Outliers can also affect the direction of the ellipse. Even when the number of outliers is one, the center of the core point to be estimated and the decorrelation matrix may be greatly affected. FIG. 4 shows a state in which the center is erroneously estimated by the conventional evaluation and the ellipse representing the covariance of the data is shifted from the core point.

（中核点のロバストな中心μ_R→の推定方法）
中心を定めるに際して、外れ点に影響されにくいという意味でのロバストさのある中心を、各座標のα調整平均を独立して決定する。ここでα調整平均は、対象データからデータ数の上位α／２％と下位α／２％に相当するデータを除いた部分を用いて平均を行う方法である。すなわち中央値はαを１００に近づけてゆき最後の１点だけが残る場合に相当する。αの値は、想定される初期クラスタに占める中核点の比率に設定するとよい。これにより、中央値によるデータの原点（μ_R→）が、全点の中心（μ→）に比べて、クラスタの中核点の中心の近傍となる。 (Method for estimating the robust center μ _R → of the core point)
When determining the center, an α adjustment average of each coordinate is independently determined for a robust center in the sense that it is not easily affected by an outlier. Here, the α-adjusted average is a method of performing averaging using a portion obtained by excluding data corresponding to upper α / 2% and lower α / 2% of the number of data from target data. That is, the median corresponds to the case where α is brought close to 100 and only the last one point remains. The value of α may be set to the ratio of the core points in the assumed initial cluster. As a result, the origin (μ _R →) of the data by the median is closer to the center of the central point of the cluster than the center (μ →) of all the points.

（中核点のロバストな重み行列（分散共分散行列）Σ_R推定方法）
同様に、ロバスト分散共分散行列（Σ_R）_ijは、クラスタの全点ｘ→についての（ｘ_i−μ_Ri）・（ｘ_j−μ_Rj）のα調整平均により生成されるようにしている。調整平均とは、平均を算出しようとする数値を大きさ順に並べ、先頭又は最後尾から一定割合αのデータを除き中央値付近のデータを用いて求める平均である。αの値は、初期クラスタに占める中核点の想定される割合の２乗に設定するとよい。行列Σ_Rには、算術平均による共分散行列と比較して、中核点の共分散が反映されることが期待できる。 (Robust weighting matrix core point (variance-covariance matrix) sigma _R estimation method)
Similarly, robust preparative variance-covariance matrix (Σ _R) _ij is as produced by α adjusted mean of all points in the cluster x → for the _{_{(x i -μ Ri) · (}} x j -μ Rj) Yes. The adjusted average is an average obtained by arranging numerical values to be averaged in order of magnitude and using data near the median value except for data of a fixed ratio α from the head or tail. The value of α may be set to the square of the assumed ratio of the core points in the initial cluster. It can be expected that the matrix Σ _R reflects the covariance of the core point compared to the covariance matrix by arithmetic mean.

算術平均による共分散行列Σ_Cは対角優勢（＝正定行列）である。すなわち、各対角成分Σ_i,iが他の要素Σ*，ｉの合計よりも大きい。また、対応する対角行列Λの全固有値は正となる。 The covariance matrix Σ _C by arithmetic mean is diagonally dominant (= positive definite matrix). That is, each diagonal component Σ _{i, i} is larger than the sum of the other elements Σ *, i. In addition, all eigenvalues of the corresponding diagonal matrix Λ are positive.

一方、ロバスト分散共分散行列Σ_Rが対角優勢行列ではない場合は、２次形式で表した値が負となる場合が発生し、距離を定義できなくなる。そのためにφ倍された単位行列φ・Ｉを付加することで、無相関化行列Ｖに影響を与えることなく計算を可能とする。φは、下式（９）に示すように与えられる。φの値は、縦列の合計と対角成分との極大差（１０％程度の値を付加してもよい）で表わされる。 On the other hand, if the robust covariance matrix sigma _R is not a diagonal dominant matrix, if a value expressed in a quadratic form becomes negative occurs, you can not define distance. Therefore, by adding a unit matrix φ · I multiplied by φ, calculation is possible without affecting the decorrelation matrix V. φ is given by the following equation (9). The value of φ is represented by the maximum difference between the sum of the columns and the diagonal component (a value of about 10% may be added).

行列φ・Ｉを付加することにより、固有値には影響を与えるが、固有ベクトルに影響を与えない。Σ＝ＶΛＶ^Tである場合、Σ＋φ・Ｉ＝ＶΛＶ^T＋φ・Ｉとなる。Ｖは正規直交であるので、φ・ＩもまたＶφＩＶ^Tと記述でき、また、分配則に従えば、Σ＋φ・Ｉ＝Ｖ（Λ＋φ・Ｉ）Ｖ^Tとなる、すなわち各固有値が増加しても、無相関化行列Ｖには影響を与えない。 By adding the matrix φ · I, the eigenvalue is affected, but the eigenvector is not affected. When Σ = VΛV ^T , Σ + φ · I = VΛV ^T + φ · I. Since V is orthonormal, φ · I can also be described as VφIV ^T, and according to the distribution rule, Σ + φ · I = V (Λ + φ · I) V ^T , that is, even if each eigenvalue increases. The decorrelation matrix V is not affected.

（重み行列に中核点の特徴を反映させる変換）
上記では、分散共分散行列を元に距離を定めて中核点を距離の短い順に選択する為、クラスタ（中核点）の外形となる境界が楕円（楕円球）となる。そのため、直線等、楕円で表現しにくい形状のクラスタ（中核点）を分離する際に、外れ点を一部分離しきれない場合がある。本願では、クラスタ（中核点）の圧縮コスト算出に用いる確率分布に応じた重み行列の変換方法を提供する。 (Conversion that reflects the features of the core points in the weight matrix)
In the above description, since the distance is determined based on the variance-covariance matrix and the core points are selected in order of increasing distance, the boundary that forms the outer shape of the cluster (core point) is an ellipse (elliptical sphere). For this reason, when separating clusters (core points) that are difficult to represent with ellipses such as straight lines, there are cases in which some of the outliers cannot be separated. The present application provides a weight matrix conversion method according to a probability distribution used for calculating a compression cost of a cluster (core point).

楕円状以外の形状をしたクラスタを抽出する為に、重み行列を修正することで抽出する方法を示す。重み行列の固有値は、クラスタの外形を示す楕円の長軸と短軸の長さに比例する値となる。そのため、固有値を変化させて長軸と短軸の長さの比を変化させることで、境界の形状を変化させることで、外れ点としたい点の中核点への、又は中核点としたい点の外れ点への混入を減らす。 A method of extracting a cluster having a shape other than an ellipse by modifying a weight matrix will be described. The eigenvalue of the weight matrix is a value proportional to the lengths of the major axis and the minor axis of the ellipse indicating the outer shape of the cluster. Therefore, by changing the eigenvalue and changing the ratio of the length of the major axis to the minor axis, by changing the shape of the boundary, to the core point of the point you want to be the outlier, or Reduce contamination at outliers.

重み行列を対角化によって分解し、クラスタの分布形態に応じた関数により変換した対角要素を用いて重み行列の再構成を行った行列を用いて距離を定め中核点の候補を選択する。具体的には、重み行列を固有行列と固有値を対角要素とする行列とに分解し、圧縮コスト算出時に用いるクラスタの分布形態に応じた関数により固有値を変換する。変換した値を新たな対角行列の要素として、重み行列を再構成し、距離算出に用いる。 The weight matrix is decomposed by diagonalization, and distances are determined using a matrix obtained by reconstructing the weight matrix using diagonal elements converted by a function corresponding to the distribution form of the cluster, and a candidate for the core point is selected. Specifically, the weight matrix is decomposed into an eigen matrix and a matrix having eigen values as diagonal elements, and the eigen values are converted by a function corresponding to the cluster distribution form used when calculating the compression cost. Using the converted values as elements of a new diagonal matrix, the weight matrix is reconstructed and used for distance calculation.

重み行列Σを対角化した場合、固有値を対角要素とする行列∧と、固有行列Ｖを用いて
下式（１０）と示す。
Σ＝Ｖ∧Ｖ^T・・・（１０） When the weighting matrix Σ is diagonalized, the following equation (10) is shown using a matrix とする having eigenvalues as diagonal elements and the eigenmatrix V:
Σ = V∧V ^T (10)

このとき、重み行列Σが対角優位でなければ固有値を対角要素とする行列∧にφ倍され
た単位行列φ・Ｉを付加する。値φは、この操作で各固有値間の比率が、それほど変化しない程度にする。 At this time, if the weight matrix Σ is not diagonally dominant, the unit matrix φ · I multiplied by φ is added to the matrix とする having eigenvalues as diagonal elements. The value φ is set such that the ratio between the eigenvalues does not change so much by this operation.

圧縮コスト算出時には、固有値に対応する固有ベクトル毎に確率分布を定めている。固有値の比率を変化させる際に、圧縮コスト算出時に用いる確率分布に応じて固有値を変化させる。このとき、中核点を選択する順序を決める距離を算出する場合には重み行列の逆行列を用いるため、重み行列の固有値が大きい程固有値の逆数が小さくなり、対応する方向への距離の変化が小さくなる。 When calculating the compression cost, a probability distribution is determined for each eigenvector corresponding to the eigenvalue. When changing the ratio of the eigenvalues, the eigenvalues are changed according to the probability distribution used when calculating the compression cost. At this time, since the inverse matrix of the weight matrix is used when calculating the distance for determining the order of selecting the core points, the larger the eigenvalue of the weight matrix, the smaller the reciprocal of the eigenvalue becomes, and the change of the distance in the corresponding direction changes. Get smaller.

比率を変化させる場合に正規分布を基準としたときに、一様分布であれば平均からの偏差が大きい場合でも確率が変わらないすなわち距離の差が無くしたいので、距離の算出に必要な分散を単調増加する関数（例えば１次関数で定数倍すること）で見かけの分散を大きくするか、さらには影響しないように距離の算出に用いないことで、一様分布の両端においても距離を短く見積もることができる。ラプラス分布であれば、平均から僅かに外れた場合に確率が大きく変化する、すなわち距離を大きく変えたいため、見かけの分散を単調減少する関数で変換する（例えば定数で徐算し小さくする）ことで、平均から僅かに外れた場合でも大きく距離を変化させることができる。ここで定数は１を超える正の実数であり、抽出したい中核点の形状に応じて設定する。場合により定数を確率分布毎に定めてもよい。 When changing the ratio based on the normal distribution, if the distribution is uniform, even if the deviation from the average is large, the probability does not change, i.e. we want to eliminate the difference in distance. By increasing the apparent variance with a monotonically increasing function (for example, multiplying by a constant with a linear function) or not using it to calculate the distance so as not to affect it, the distance is estimated short at both ends of the uniform distribution. be able to. If it is Laplace distribution, the probability will change greatly if it is slightly off the average, that is, change the apparent variance by a monotonically decreasing function (for example, by gradually subtracting with a constant). Thus, the distance can be changed greatly even when slightly deviating from the average. Here, the constant is a positive real number exceeding 1, and is set according to the shape of the core point to be extracted. In some cases, a constant may be determined for each probability distribution.

また、固有値を変換する別の方法として圧縮コスト算出時に用いるクラスタの分布形態に依らずに共通した関数を用意し適用する方法がある。圧縮コストが最小となる場合、おおよそ分散（又は固有値）の大小によって適用される確率分布が決まる傾向がある。その為、大きい固有値をさらに大きく、小さな固有値をさらに小さくなるように比率が変化する関数を用意すればよい。例えば冪関数ｆ（ｘ）＝ｘ^rを用いてよい。累乗の係数ｒは１より大きい正の実数に決める。 As another method for converting the eigenvalue, there is a method of preparing and applying a common function regardless of the cluster distribution form used when calculating the compression cost. When the compression cost is minimized, the probability distribution to be applied tends to be determined by the size of the variance (or eigenvalue). Therefore, a function whose ratio changes so that a large eigenvalue is further increased and a small eigenvalue is further decreased may be prepared. For example, the power function f (x) = x ^r may be used. The power coefficient r is determined to be a positive real number larger than 1.

以上によって、重み行列に中核点の特徴を反映させる変換を行うことができる。 As described above, it is possible to perform conversion that reflects the feature of the core point in the weight matrix.

図７に直線のクラスタ（４００点）とそれ以外の外れ値（１００点）が分布している３次元の座標データを２次元で示す。まず上記の方法で中核点と外れ点の分離を行い、圧縮コストを算出する際の中核点に適用すべき確率分布を決める。 FIG. 7 shows in two dimensions three-dimensional coordinate data in which straight clusters (400 points) and other outliers (100 points) are distributed. First, the core points and outliers are separated by the above method, and the probability distribution to be applied to the core points when calculating the compression cost is determined.

以上のロバストな評価手法により、図４に示すように、中心を正確に定め、分散共分散行列により表わされる楕円を中核点の分布とすることができる。図５に示すように、適正な無相関化行列Ｖは、クラスタの中核点の分散方向を示す固有ベクトルにより構成される。Ｖ^Tの乗算により座標の変換を行い座標間の相関を取り除くことができる。 With the above robust evaluation method, as shown in FIG. 4, the center can be accurately determined, and the ellipse represented by the variance-covariance matrix can be set as the distribution of the core points. As shown in FIG. 5, the proper decorrelation matrix V is composed of eigenvectors indicating the dispersion direction of the core points of the cluster. Coordinate conversion can be performed by multiplication of V ^T to remove the correlation between the coordinates.

（中核点と外れ点との分離）
次に、中核点と外れ点との分離について述べる。上述した各種分散共分散行列を重み行列として用いて距離を定める。次に、中心から定めた距離（境界）までを中核点、それ以外を外れ点として仮に定め、それぞれ境界を変化させた場合において圧縮コストを算出することを繰り返す。その中で、最も圧縮コストが最小となる境界を用いて中核点と外れ点に分離する。 (Separation of core points and outliers)
Next, separation of the core points and outliers will be described. The distance is determined using the above-described various covariance matrices as a weight matrix. Next, the distance from the center (boundary) to the predetermined distance (boundary) is temporarily determined as the core point, and the rest is deviated, and the compression cost is repeatedly calculated when the boundary is changed. Among them, the boundary with the lowest compression cost is used to separate the core point and the outlier point.

以下に詳細を述べる。まずは、境界を定める適正な無相関化行列を求める。すなわち、候補となる分散共分散行列から、圧縮コストが最適となる１つを選択する。候補となる分散共分散行列には、通常の分散共分散行列に加え、上述したロバストな手法による行列Σ_R、従来の手法による行列Σ_C、それぞれを確率分布に応じた変換を行った重み行列が含まれる。 Details are described below. First, an appropriate decorrelation matrix that determines the boundary is obtained. That is, from the candidate variance-covariance matrix, one with the optimum compression cost is selected. The candidate variance-covariance matrix includes the normal variance-covariance matrix, the matrix Σ _R by the robust method described above, the matrix Σ _C by the conventional method, and a weight matrix obtained by transforming each according to the probability distribution. Is included.

更に、ロバストに評価された中心μ_R→とロバストな分散共分散行列Σ_Rで定義されたマハラノビス距離を用いて、クラスタの中心に近い点を選択した上で、無相関化行列算出の元となる分散共分散行列として算出し、候補に加える。対象のクラスタから中心μ_R→に近い点よりクラスタ内のある割合（例えば５０％）に相当する数を選択し、選択した点を元に従来の共分散による候補である行列Σ_C,50とロバストなα調整平均による候補である行列Σ_R,50を算出する。更に無相関化を行わない場合の無相関化行列の候補として単位行列Ｉがある。これらの行列｛Σ_C、Σ_R、Σ_C,50、Σ_R,50、Ｉ｝の中で、最適な（最小の）圧縮コストを与える行列Σ*が選択される。図４の例では、従来の評価では圧縮コストの最小値が１６００であるのに対して、ロバストな評価では圧縮コストの最小値が１４８０となる。 Furthermore, using the Mahalanobis distance defined by the robust center μ _R → and the robust variance-covariance matrix Σ _R , a point close to the center of the cluster is selected, and the source of the decorrelation matrix calculation Is calculated as a covariance matrix and added to the candidates. A number corresponding to a certain ratio (for example, 50%) in the cluster is selected from points close to the center μ _R → from the target cluster, and the matrix Σ _{C, 50} that is a candidate by conventional covariance based on the selected point Compute a matrix Σ _{R, 50} that is a candidate with a robust α-adjusted average. Further, there is a unit matrix I as a candidate for a decorrelation matrix when decorrelation is not performed. Among these matrices {Σ _C , Σ _R , Σ _{C, 50} , Σ _{R, 50} , I}, the matrix Σ * that gives the optimal (minimum) compression cost is selected. In the example of FIG. 4, the minimum value of the compression cost is 1600 in the conventional evaluation, whereas the minimum value of the compression cost is 1480 in the robust evaluation.

続いて、クラスタ内の外れ点を見つける。ここまでに中核点の中心μ_R→を求めると共に、候補となる分散共分散行列Σ*（対応する無相関化行列Ｖ*）を選択している。最終的な目的は、クラスタを新たな２つの集合、すなわち中核点の集合（クラスタ）Ｃ中核点と外れ点の集合Ｃ外レ点に分けることである。分散共分散行列Σ*により定められるマハラノビス距離に基づいて両者の境界を定め、２つの集合に分割する。最初に全点を外れ点として（Ｃ外レ点＝Ｃ，Ｃ中核点＝｛｝）、マハラノビス距離に基づいて、中核点の中心μ_R→に近い距離にある点ｘ→を集合Ｃ外レ点から取り除き、集合Ｃ中核点に取り入れていくことを繰り返す。その際に、点ｘ→の移動前後における圧縮コストを算出する。 Next, find outliers in the cluster. So far, the center μ _R → of the core point has been obtained, and the candidate covariance matrix Σ * (corresponding decorrelation matrix V *) has been selected. The ultimate goal is to divide the cluster into two new sets: a set of core points (cluster) C core points and a set of outlier points C. Based on the Mahalanobis distance determined by the variance-covariance matrix Σ *, the boundary between the two is determined and divided into two sets. First, with all points as outliers (outer C point = C, C core point = {}), based on the Mahalanobis distance, a point x → at a distance close to the center μ _R → of the core point is removed from the set C outlier point. Repeatedly remove and incorporate into set C core. At that time, the compression cost before and after the movement of the point x → is calculated.

各繰り返しにおいて、集合Ｃ外レ点から集合Ｃ中核点に移動した点ｘ→は、選択された無相関化行列Ｖ*により定められる空間に移る。そして、集合の新たな分割（Ｃ中核点∪｛ｘ→｝，Ｃ外レ点−｛ｘ→｝）による圧縮コストは、最小の圧縮コストを導く確率分布を使用して決定する。外れ点に対する圧縮コストは一様分布を用いて符号化する。 In each iteration, the point x → that has moved from the point outside the set C to the core point of the set C moves to the space defined by the selected decorrelation matrix V *. Then, the compression cost due to the new division of the set (C core point ∪ {x →}, outer C point− {x →}) is determined using a probability distribution that leads to the minimum compression cost. The compression cost for outliers is encoded using a uniform distribution.

このような集合の選び方は、中核点と外れ点を分離する境界を無相関化行列によるマハラノビス距離が一定となる楕円として定め、その内外に分けることで２つの集合を定める。図４では、最小（１４８０）にて中核点が２４個、外れ点が６個となる集合からなる。 In selecting such a set, the boundary that separates the core point and the outlier point is defined as an ellipse having a constant Mahalanobis distance by a decorrelation matrix, and the two sets are determined by dividing the ellipse into the inside and outside. In FIG. 4, the minimum (1480) is a set of 24 core points and 6 outliers.

図４には、従来の無相関化行列Ｖ_c及びロバストな無相関化行列Ｖ_Rに対応する２つの圧縮コストの推移を示す。初期段階では、全点が外れ点とされ、いずれの圧縮コストも１８００程度となっている。集合Ｃ外レ点から集合Ｃ中核点に点が移動するに従って、圧縮コストは減少する。ロバストな無相関化行列Ｖ_Rに対応する圧縮コストでは、中核点が２４個となったときに、最小値である１４８０になる。その後、圧縮コストは再び１８００程度まで増加する。 FIG. 4 shows transitions of two compression costs corresponding to the conventional decorrelation matrix V _c and the robust decorrelation matrix V _R. In the initial stage, all points are set as outliers, and the compression cost is about 1800. As the point moves from the point outside the set C to the core point of the set C, the compression cost decreases. In the compression cost corresponding to the robust decorrelation matrix V _R , when the number of core points is 24, the minimum value is 1480. Thereafter, the compression cost again increases to about 1800.

分離部１０３ｃは、以上説明したように、圧縮コストに基づいて、初期クラスタ生成部１０３ａによって生成された初期クラスタを分離して、集中的な分布をなすデータ（中核点）を含む集合Ｃ中核点と、散発的な分布をなすデータ（外れ点）を含む集合Ｃ外レ点とを生成する。 As described above, the separation unit 103c separates the initial clusters generated by the initial cluster generation unit 103a based on the compression cost, and includes the set C core points including data (core points) having a concentrated distribution. And out-of-set C points including data (outliers) having a sporadic distribution.

＜クラスタ結合＞
クラスタの結合は、任意のクラスタリング・アルゴリズム（例えばＫ-ｍｅａｎｓ法）による分割位置及び分割数を変更することである。アルゴリズムに起因するクラスタの欠点により、本来は別のクラスタの一部である点を含んだクラスタが生成されることがある。そこで、従来のクラスタリング手法に起因する間違った分割をクラスタ結合により修正する手法について説明する。 <Cluster join>
The combination of clusters is to change the division position and the number of divisions by an arbitrary clustering algorithm (for example, K-means method). Due to the shortcomings of clusters resulting from the algorithm, clusters that contain points that are originally part of another cluster may be generated. Therefore, a method of correcting an incorrect division caused by the conventional clustering method by cluster connection will be described.

例えばＫ-ｍｅａｎｓ法では、本来は１つであるはずのクラスタを誤って複数に分離することがある。提案する方法のアルゴリズムでは、クラスタ結合を行うことにより誤った分離を正すものである。 For example, in the K-means method, a cluster that should originally be one may be erroneously separated into a plurality. The algorithm of the proposed method corrects erroneous separation by performing cluster connection.

ここでも、２つのクラスタを結合する際に圧縮コストを利用し、一対のクラスタを結合させることにより圧縮コストが減少するかどうかを判定する。すなわち、一対のクラスタ（Ｃ_i，Ｃ_j）について、下式（１１）に示すようにクラスタ結合による圧縮コストの低減量ｒｅｄｕｃｅｃｏｓｔ（Ｃ_i，Ｃ_j）を定義する。 Again, the compression cost is used when combining two clusters, and it is determined whether or not the compression cost is reduced by combining a pair of clusters. That is, for the pair of clusters (C _i , C _j ), a reduction cost reduction cost (C _i , C _j ) is defined as shown in the following equation (11).

クラスタ結合による圧縮コストの低減量ｒｅｄｕｃｅｃｏｓｔ（Ｃ_i，Ｃ_j）＞０である場合、一対のクラスタ（Ｃ_i，Ｃ_j）に結合の可能性があるものとする。 Reduction amount of compression cost due to cluster coupling When reducecost (C _i , C _j )> 0, it is assumed that there is a possibility of coupling to a pair of clusters (C _i , C _j ).

この手順は欲張り法に従い、圧縮コストの低減量がある限り繰り返し行われる。各繰り返しにおいて、ｒｅｄｕｃｅｃｏｓｔ（Ｃ_i，Ｃ_j）が最大となる２つのクラスタを結合し、圧縮コストが最小コストとなるクラスタリングを探索する。更には、局所的な最小により計算が行き詰るのを避けるために、ｒｅｄｕｃｅｃｏｓｔ（Ｃ_i，Ｃ_j）の減少がなくなっても計算をすぐには止めない。すなわち、ｒｅｄｕｃｅｃｏｓｔ（Ｃ_i，Ｃ_j）≦０となってもアルゴリズムを止めず、その代わりに、アルゴリズムを別のｔの繰り返しにおいて続け、ｒｅｄｕｃｅｃｏｓｔ（Ｃ_i，Ｃ_j）が負になって、つまりＣ_i及びＣ_jの結合により圧縮コストが増加する場合でも、最大のｒｅｄｕｃｅｃｏｓｔ（Ｃ_i，Ｃ_j）でクラスタ（Ｃ_i，Ｃ_j）の結合を継続する。 This procedure follows the greedy method and is repeated as long as there is a reduction in compression cost. In each iteration, two clusters having the largest reduce cost (C _i , C _j ) are combined to search for a cluster having the smallest compression cost. In addition, in order to avoid a deadlock due to a local minimum, the calculation is not stopped immediately even if there is no decrease in reducecost (C _i , C _j ). That is, even if reducecost (C _i , C _j ) ≦ 0, the algorithm is not stopped. Instead, the algorithm continues in another iteration of t, and reducecost (C _i , C _j ) becomes negative, that is, even if compression costs by the binding of C _i and C _j is increased, continues to bind the clusters (C _i, C _j) at the maximum _{_{reducecost (C i, C j)}} .

結合部１０３ｄは、以上説明したように、圧縮コストに基づいて、分離部１０３ｃによって生成された集中的な分布をなすデータを含むクラスタＣ中核点同士を結合する。 As described above, the combining unit 103d combines the core points of the cluster C including the data having a concentrated distribution generated by the separation unit 103c based on the compression cost.

また、初期クラスタ数が適切な初期クラスタ数に比べ多い場合において計算時間を減少させるための本願独自のアルゴリズムとして、上述した計算によるクラスタ結合の前段階で、結合するクラスタの候補をドロネー図に基づいて隣接するクラスタの組を決めることを行う。本発明で取り扱う薄板の表面欠陥は２次元上に分布しているので、ドロネー図を利用することが可能である。具体的には、クラスタの中心を母点として用いて、最初に母点を４点選び２つの三角形を作る、それぞれの三角形がドロネー三角分割であることを外接円に他の三角形の頂点が入っていないことで確認し、そうでなければ三角形の分割方法を変える。さらに１点加え、加えた１点を含む三角形に着目し三角形に分割しドロネー三角分割であること確認する。そうでなれば三角形の分割方法を変える。さらに逐次母点を追加し、追加すべき母点がなくなり、ドロネー三角分割となるまで繰り返すことでドロネー図を生成する。そのドロネー図に表われるドロネー辺によりつながる母点に相当するクラスタ同士を結合の候補とする。このようにドロネー図を利用して候補を決めることで、計算量を大幅に軽減させることができる。 In addition, as a unique algorithm for reducing the calculation time when the number of initial clusters is larger than the appropriate number of initial clusters, the candidate cluster to be combined is based on the Delaunay diagram at the previous stage of cluster combination by the above calculation. To determine a set of adjacent clusters. Since the surface defects of the thin plate handled in the present invention are distributed two-dimensionally, it is possible to use a Delaunay diagram. Specifically, using the center of the cluster as a generating point, first select 4 generating points and create two triangles. Each triangle is a Delaunay triangulation and the circumscribed circle contains the vertices of the other triangles. If not, change the triangle division method. Add one more point, pay attention to the triangle including the added point, and divide it into triangles to confirm that it is a Delaunay triangulation. If this is the case, change the method of dividing the triangle. Further, generating a Delaunay diagram is performed by adding successive generating points and repeating until there are no generating points to be added and a Delaunay triangulation is performed. Clusters corresponding to generating points connected by Delaunay edges shown in the Delaunay diagram are set as candidates for connection. Thus, by determining candidates using Delaunay diagrams, the amount of calculation can be greatly reduced.

また、初期クラスタ数が適切な初期クラスタ数に比べ少ない場合において本願発明の独自のアルゴリズムとして、外れ点の分離やクラスタ結合の段階でクラスタから分離した外れ点を別のクラスタに組み入れるようにする。そのために、外れ点として分離されたものを集め仮想的な新たなクラスタとして扱い、中核点に組み入れ可能な点があるかを、仮想的なクラスタと中核点からなるクラスタの圧縮コストの低減量ｒｅｄｕｃｅｃｏｓｔにて判断し、これが大きくなれば外れ点の一部を別クラスタの中核点として再度取り入れる。また、仮想的なクラスタから中核点からなるクラスタが生成できれば新たなクラスタとすることができる。 Further, when the number of initial clusters is smaller than the appropriate number of initial clusters, as a unique algorithm of the present invention, outliers separated from clusters at the stage of outlier separation or cluster combination are incorporated into another cluster. For this purpose, the separated points are collected and treated as a virtual new cluster, and whether there is a point that can be incorporated into the core point is reduced cost reduction amount of the cluster consisting of the virtual cluster and the core point reducecost If this becomes large, a part of the detachment point is taken in again as the core point of another cluster. If a cluster composed of core points can be generated from a virtual cluster, a new cluster can be obtained.

以上に述べたように、自動疵検査装置で採取された疵データの発生位置に関する座標データに基づいて、疵同士を自動的にグループ化することができ、例えば疵グループの重心位置、空間サイズ、疵個数密度といった疵の分布に係わる特徴量抽出を高速かつ大量に行うことが可能になる。この場合に、指標に基づいて最適となるように計算処理が行われるので、人間による分析処理に比べて、再現性の高い客観的かつ定量的な特徴量を抽出することが可能である。したがって、例えば疵の発生位置と操業条件の相関解析等、大量の疵の分布データを用いた解析を迅速に行うことができる。 As described above, based on the coordinate data relating to the generation position of the eyelid data collected by the automatic eyelid inspection apparatus, the eyelids can be automatically grouped, for example, the center of gravity position of the eyelid group, the space size, It is possible to extract features related to the distribution of wrinkles such as the number density of wrinkles at high speed and in large quantities. In this case, calculation processing is performed so as to be optimal based on the index, and therefore it is possible to extract an objective and quantitative feature amount with high reproducibility compared to human analysis processing. Therefore, for example, analysis using a large amount of soot distribution data, such as correlation analysis between soot generation positions and operating conditions, can be quickly performed.

しかも、上述したように初期クラスタから散発的な分布をなすデータ（外れ点）を分離して、集中的な分布をなすデータを含むクラスタを生成し、クラスタ同士を結合する手法により、外れ点を除いた自然な形で疵同士をグループ化することができる。これにより、疵の分布形態の解析精度を高めるとともに、疵の分布に係わる特徴量抽出を高速かつ大量に、そして正確に行うことが可能になる。 Moreover, as described above, by separating the sporadic data (outliers) from the initial cluster, generating clusters containing data with a concentrated distribution, and combining the clusters, It is possible to group cocoons together in a natural way. As a result, the analysis accuracy of the wrinkle distribution form can be improved, and the feature amount relating to the wrinkle distribution can be extracted at high speed, in large quantities, and accurately.

図６には薄板コイルに発生した疵の分布の一例を示す図であり、横軸が薄板コイルの長手方向位置、縦軸が幅方向位置を示す。図６（ａ）が特許文献１に開示された手法による結果であり、□を中心とする３つの楕円状の分布が抽出されている。それに対して、図６（ｂ）が本発明を適用した手法による結果であり、集中的な分布をなす中核点・と、散発的な分布をなす外れ点×とに分離され、直線状に発生する疵グループを的確に抽出することができている。これに基づいて、疵の発生の原因となる工程で対策を行い、大量の品質不良品を出すことを防ぐことができた。 FIG. 6 is a diagram showing an example of the distribution of wrinkles generated in the thin coil, where the horizontal axis indicates the longitudinal position of the thin coil and the vertical axis indicates the width direction position. FIG. 6A shows the result of the technique disclosed in Patent Document 1, and three elliptical distributions centered on □ are extracted. On the other hand, FIG. 6 (b) shows the result of the method according to the present invention, which is separated into a core point with a concentrated distribution and an outlier point with a sporadic distribution, and is generated in a straight line. The cocoon group to be able to be extracted accurately. Based on this, it was possible to prevent the production of a large number of defective products by taking measures in the process that causes the generation of defects.

（本発明の他の実施形態）
上述した実施形態の機能を実現するべく各種のデバイスを動作させるように、該各種デバイスと接続された装置あるいはシステム内のコンピュータに対し、前記実施形態の機能を実現するためのソフトウェアのプログラムコードを供給し、そのシステム或いは装置のコンピュータ（ＣＰＵ又はＭＰＵ）に格納されたプログラムに従って前記各種デバイスを動作させることによって実施したものも、本発明の範疇に含まれる。 (Other embodiments of the present invention)
In order to operate various devices to realize the functions of the above-described embodiments, program codes of software for realizing the functions of the above-described embodiments are provided to an apparatus or a computer in the system connected to the various devices. What is implemented by operating the various devices according to a program supplied and stored in a computer (CPU or MPU) of the system or apparatus is also included in the scope of the present invention.

また、この場合、上記ソフトウェアのプログラムコード自体が上述した実施形態の機能を実現することになり、そのプログラムコード自体、及びそのプログラムコードをコンピュータに供給するための手段、例えば、かかるプログラムコードを格納した記録媒体は本発明を構成する。かかるプログラムコードを記憶する記録媒体としては、例えばフレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、磁気テープ、不揮発性のメモリカード、ＲＯＭ等を用いることができる。 In this case, the program code of the software itself realizes the functions of the above-described embodiments, and the program code itself and means for supplying the program code to the computer, for example, the program code are stored. The recorded medium constitutes the present invention. As a recording medium for storing the program code, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.

また、コンピュータが供給されたプログラムコードを実行することにより、上述の実施形態の機能が実現されるだけでなく、そのプログラムコードがコンピュータにおいて稼働しているＯＳ（オペレーティングシステム）あるいは他のアプリケーションソフト等と共同して上述の実施形態の機能が実現される場合にもかかるプログラムコードは本発明の実施形態に含まれることは言うまでもない。 Further, by executing the program code supplied by the computer, not only the functions of the above-described embodiments are realized, but also the OS (operating system) or other application software in which the program code is running on the computer, etc. It goes without saying that the program code is also included in the embodiment of the present invention even when the functions of the above-described embodiment are realized in cooperation with the embodiment.

本技術は、主として、鉄鋼製品の薄板製品の製造における表面欠陥起因の不良品発生原因を解析することに利用される。 This technique is mainly used to analyze the cause of defective products due to surface defects in the manufacture of steel sheet products.

本実施の形態の薄板の表面欠陥の分布形態解析装置の構成の一例を示す図である。It is a figure which shows an example of a structure of the distribution form analyzer of the surface defect of the thin plate of this Embodiment. クラスタリングの例を示す図である。It is a figure which shows the example of clustering. 圧縮コストの例を説明するための図である。It is a figure for demonstrating the example of compression cost. 従来の手法とロバストな手法とを比較した結果を示す図であり、（ａ）がデータとその共分散を表わす楕円との関係を示す図、（ｂ）が中核点の数と圧縮コストとの関係を示す特性図である。It is a figure which shows the result which compared the conventional method and the robust method, (a) is a figure which shows the relationship between the ellipse showing data and its covariance, (b) is the number of core points, and compression cost It is a characteristic view which shows a relationship. 無相関化行列を説明するための図である。It is a figure for demonstrating a decorrelation matrix. 薄板コイルに発生した疵の分布の一例を示す図である。It is a figure which shows an example of distribution of the wrinkles which generate | occur | produced in the thin plate coil. 直線のクラスタ（４００点）とそれ以外の外れ値（１００点）が分布している３次元の座標データを表す図である。It is a figure showing the three-dimensional coordinate data in which the cluster (400 points) of a straight line and the other outlier (100 points) are distributed.

Explanation of symbols

１０１疵データ入力部
１０２疵データ蓄積部
１０３演算部
１０３ａ初期クラスタ生成部
１０３ｂ圧縮コスト算出部
１０３ｃ分離部
１０３ｄ結合部
１０４解析結果表示部 101 疵 data input unit 102 疵 data storage unit 103 calculation unit 103a initial cluster generation unit 103b compression cost calculation unit 103c separation unit 103d combination unit 104 analysis result display unit

Claims

In the surface defect distribution form analyzer that analyzes the distribution form of surface defects generated in the thin plate,
An input means for inputting at least coordinate data relating to the occurrence position of the surface defect of the thin plate to be analyzed;
Based on the coordinate data of the surface defect mixed with the concentrated distribution and the sporadic distribution input by the input means, an initial cluster representing the distribution form of the surface defect of the thin plate is generated using a predetermined clustering method. Initial cluster generation means to
A compression cost calculating means for calculating a compression cost, which is an index for evaluating the goodness of the generated cluster, as an information amount when information compression of coordinate data is performed according to the distribution form of the cluster;
Separating means for generating clusters obtained by separating the initial clusters generated by the initial cluster generating means into a combination of coordinate data having a concentrated distribution and a sporadic distribution based on the compression cost;
A combination unit that combines clusters including coordinate data having a concentrated distribution generated by the separation unit using the compression cost as a combination determination criterion;
Incorporating sporadic coordinate data separated from clusters in the separating means as coordinate data forming a new or intensive distribution of another cluster ,
The separation means is a variance generated by adjusting the average of the coordinate data of all points in the cluster as a weighting matrix centering on the coordinates obtained by the adjustment average of the coordinate data in the cluster and indicating the variation of the coordinate data in the cluster. A surface defect distribution pattern analysis apparatus , wherein a distance from a center is determined using an inverse matrix of a covariance matrix, and coordinate data having a concentrated distribution is selected in ascending order of distance .

Feature quantity calculation means for calculating at least one of the center of gravity position, the space size and the number density of the clusters generated by each means as a feature quantity;
The surface defect distribution form analysis apparatus according to claim 1, further comprising an analysis result display unit configured to display a calculation result obtained by the feature amount calculation unit.

The separation means, as a procedure for selecting coordinate data having a concentrated distribution, selects from a combination obtained by designating whether individual coordinate data belongs to a concentrated or sporadic distribution. Item 3. A surface defect distribution pattern analyzer according to Item 1 or 2.

The said separating means uses the inverse matrix of the matrix which is an adjustment average of the value which calculated | required the product of the deviation from an average using the variable whose matrix element has two variables to the said weight matrix . The surface defect distribution form analyzer according to any one of the preceding claims.

The separation unit decomposes the weight matrix into an eigen matrix and a matrix having eigen values as diagonal elements, and reconstructs the weight matrix by using values obtained by converting eigen values by a function according to a cluster distribution form used when calculating the compression cost. 5. The surface defect distribution form analysis apparatus according to claim 1 , wherein the matrix is used.

It said coupling means, the distribution form analyzer of surface defects according to any one of claims 1 to 5, characterized in that determined on the basis of the Delaunay diagram showing a candidate clusters of adjacencies of the cluster to which it binds.

In the surface defect distribution form analysis method for analyzing the distribution form of surface defects generated in a thin plate,
An input procedure for inputting at least coordinate data regarding the occurrence position of the surface defect of the thin plate to be analyzed;
Based on the coordinate data of the surface defect mixed with the intensive distribution and the sporadic distribution input by the input procedure, the initial cluster representing the distribution form of the surface defect of the thin plate is generated by using a predetermined clustering method. Initial cluster generation procedure to
A compression cost calculation procedure for calculating a compression cost, which is an index for evaluating the goodness of the generated cluster, as an information amount when information compression of coordinate data is performed according to the distribution form of the cluster;
A separation procedure for generating a cluster obtained by separating the initial cluster generated by the initial cluster generation procedure into a combination of coordinate data having a concentrated distribution and a sporadic distribution based on the compression cost;
A joining procedure for joining clusters including coordinate data having a concentrated distribution generated by the separating procedure using the compression cost as a joining criterion;
Incorporating sporadic coordinate data separated from the clusters in the separation procedure as coordinate data forming a new or intensive distribution of another cluster ,
In the separation procedure, the variance generated by the adjustment average of the coordinate data of all the points in the cluster as a weight matrix indicating the variation of the coordinate data in the cluster centered on the coordinates obtained by the adjustment average of the coordinate data in the cluster A surface defect distribution pattern analysis method, wherein a distance from a center is determined using an inverse matrix of a covariance matrix, and coordinate data having a concentrated distribution is selected in order of increasing distance .

In a program for causing a computer to execute a process of analyzing a distribution form of surface defects generated in a thin plate,
Input processing for inputting at least coordinate data relating to the occurrence position of surface defects of the thin plate to be analyzed;
Based on the coordinate data of the surface defect mixed with the concentrated distribution and the sporadic distribution input by the input process, the initial cluster representing the distribution form of the surface defect of the thin plate is generated by using a predetermined clustering method. Initial cluster generation processing to
A compression cost calculation process for calculating a compression cost, which is an index for evaluating the goodness of the generated cluster, as an information amount when the coordinate data is compressed according to the distribution form of the cluster;
A separation process for generating a cluster obtained by separating the initial cluster generated by the initial cluster generation process into a combination of coordinate data having a concentrated distribution and a sporadic distribution based on the compression cost;
Causing a computer to execute a joining process for joining clusters including coordinate data having a concentrated distribution generated by the separation process using the compression cost as a joining criterion;
Incorporating sporadic coordinate data separated from clusters in the separation process as coordinate data forming a new or another cluster intensive distribution ,
In the separation process, the variance generated by the adjustment average of the coordinate data of all the points in the cluster as a weight matrix indicating the variation of the coordinate data in the cluster centered on the coordinates obtained by the adjustment average of the coordinate data in the cluster A program characterized by determining a distance from the center using an inverse matrix of a covariance matrix and selecting coordinate data having a concentrated distribution in order of increasing distance .