JP6081146B2

JP6081146B2 - Data monitoring device

Info

Publication number: JP6081146B2
Application number: JP2012240674A
Authority: JP
Inventors: 珠規子夏目; 森　俊樹; 俊樹森
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2012-10-31
Filing date: 2012-10-31
Publication date: 2017-02-15
Anticipated expiration: 2032-10-31
Also published as: JP2014089671A

Description

本発明の実施形態は、多変量データを監視するデータ監視装置に関する。 Embodiments described herein relate generally to a data monitoring apparatus that monitors multivariate data.

大規模化、複雑化する製品開発プロジェクトにおいて、プロジェクトの状況を正確に把握し適切な管理を実施するために、例えば、工数、ソースコード規模、不具合件数など、さまざまな種類のデータを収集、活用することが重要となる。 Collect and utilize various types of data, such as man-hours, source code scale, and number of defects, in order to accurately grasp the project status and implement appropriate management in large and complex product development projects It is important to do.

近年、例えば、ソースコード静的解析ツール、不具合管理システムなど、さまざまな開発支援ツールや開発管理システムが普及し、プロジェクトに関連する大量の多変量データを逐次収集、蓄積することが可能になってきた。 In recent years, for example, various development support tools and development management systems such as source code static analysis tools and defect management systems have become widespread, and it has become possible to sequentially collect and accumulate large amounts of multivariate data related to projects. It was.

従来、状況把握のためのプロジェクトデータ監視手法として、管理図や累積グラフなどが利用されてきたが、元々、一変量または少数の変量の時系列変化を捉えるのに適した手法であり、多数の変量が相互に関連した多変量データの監視手法としては必ずしも適していない。 Conventionally, control charts and cumulative graphs have been used as project data monitoring techniques for grasping the situation, but they are originally suitable for capturing univariate or a small number of variable time series changes. It is not always suitable as a monitoring method for multivariate data in which variables are related to each other.

大量の多変量データが逐次更新されている状況において、それらの中から異常値や外れ値の候補を自動的に絞り込み、異常値や外れ値を示した要因等を簡単に特定できるようなプロジェクトデータ監視手法が望まれる。 Project data that automatically narrows down candidates for outliers and outliers from a large amount of multivariate data and can easily identify factors that indicate outliers and outliers A monitoring technique is desired.

特開２０１０−２１１３３８号公報JP 2010-21113B 特開２００４−３４９８５２号公報JP 2004-349852 A 特開２００８−１４６５９１号公報JP 2008-146591 A 特開２００８−１２９０１１号公報JP 2008-129011 A 特開２００１−７５６４２号公報JP 2001-75642 A

本発明が解決しようとする課題は、特異なサンプルの主要因を特定することのできるデータ監視装置を提供することである。 The problem to be solved by the present invention is to provide a data monitoring apparatus capable of specifying a main factor of a specific sample.

実施形態のデータ監視装置は、多変量データに対して主成分分析を実施し、データの集団から大きく外れた外れ値の候補を検出する外れ値候補検出部と、前記変量の組であるサンプル及び主成分分析で得られた主成分を、前記外れ値候補と対応させてランク付けした外れ値候補リストと、前記外れ値候補と対応する主成分に関して、その主成分得点の変量別の値と因子負荷量を表す変量影響度グラフとを生成して表示する結果表示部とを、備える。 Data monitoring apparatus embodiment, performed principal component analysis on multivariate data, and outlier candidate detection unit for detecting a candidate of large out-off value from a population of data samples and a set of variables Outlier value candidate list obtained by ranking the principal components obtained by principal component analysis in correspondence with the outlier candidates, and the principal values corresponding to the outlier candidates and the values and factors for each principal component score. A result display unit that generates and displays a variable influence graph representing the load amount.

本発明の一実施形態に係るデータ監視装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the data monitoring apparatus which concerns on one Embodiment of this invention. 外れ度を説明するための図である。It is a figure for demonstrating a deviation degree. 外れ度の計算の一例を示す図である。It is a figure which shows an example of calculation of a deviation degree. 外れ値候補リストの一例を示す図である。It is a figure which shows an example of an outlier candidate list. 変量影響度グラフの一例を示す図である。It is a figure which shows an example of a variable influence graph. 第１の実施形態に係るデータ監視装置におけるデータの監視処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the data monitoring process in the data monitoring apparatus which concerns on 1st Embodiment. サンプルの重みを説明する図である。It is a figure explaining the weight of a sample. 変量の重みを説明する図である。It is a figure explaining the weight of the variable. 第２の実施形態に係るデータ監視装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the data monitoring apparatus which concerns on 2nd Embodiment. 記憶部に記憶される重み付き多変量データの一覧例を示す図である。It is a figure which shows the example of a list of the weighted multivariate data memorize | stored in a memory | storage part. 第２の実施形態に係るデータ監視装置におけるデータの監視処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the monitoring process of the data in the data monitoring apparatus which concerns on 2nd Embodiment. 重み設定の一例を示す図である。It is a figure which shows an example of a weight setting. 重み設定後の外れ度の計算例を示す図である。It is a figure which shows the example of calculation of the deviance after weight setting. 重み変更の一例を示す図である。It is a figure which shows an example of a weight change. 重み変更後の外れ度の計算例を示す図である。It is a figure which shows the example of calculation of the deviation degree after a weight change.

以下、本発明の一実施の形態について、図面を参照して説明する。尚、各図において同一箇所については同一の符号を付すとともに、重複した説明は省略する。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. In the drawings, the same portions are denoted by the same reference numerals, and redundant description is omitted.

まず、本実施形態で用いる主要な用語について説明する。
「変量」とは、異なる値を取り得る項目のことである。
「サンプル」とは、変量の組（x11, x12, ・・・, x1m）をいう。
「多変量データ」とは、複数の変量からなるデータをいう。
「外れ値」とは、データの集団から大きく外れた値をいう。 First, main terms used in the present embodiment will be described.
“Variables” are items that can take different values.
“Sample” refers to a set of variables (x11, x12,..., X1m).
“Multivariate data” refers to data composed of a plurality of variables.
An “outlier” is a value that deviates significantly from the data group.

「主成分分析」とは、多変量データに対して、変量間の相関を考慮しながら座標変換し、情報量をなるべく減少させない形で単純化する方法である。主成分として、入力データ (x1, x2, ・・・, xm) から分散が最大になるように新しい合成変量 Z1 を定義する。

The “principal component analysis” is a method of simplifying the multivariate data in such a way that the coordinate transformation is performed in consideration of the correlation between the variables and the amount of information is not reduced as much as possible. As a principal component, a new composite variable Z1 is defined so that the variance is maximized from the input data (x1, x2, ..., xm).

合成変量は、主成分（principal component）と呼ばれ，そのうちで最も分散の大きい Z1 は第１主成分（ＰＣ１），次に分散の大きいZ2 は第２主成分（ＰＣ２），以下順に第ｍ主成分（ＰＣｍ）と呼ばれる。 A composite variable is called a principal component, of which Z1 having the largest variance is the first principal component (PC1), Z2 having the next largest variance is the second principal component (PC2), and the m-th principal in order. It is called component (PCm).

各主成分は互いに相関のないように互いに直交するように決定される。各合成変量間の相関がないことから，“個々の合成変量は独立に評価してよい”ということになる。 Each principal component is determined to be orthogonal to each other so as not to correlate with each other. Since there is no correlation between each synthetic variable, it means that “individual synthetic variables may be evaluated independently”.

「主成分得点」とは、元データを主成分ベクトルに射影した値をいう。
「主成分寄与率」とは、各変数が m 個の主成分でどれくらい説明されるかを表し、 0≦ 寄与率 ≦1 であり、(主成分寄与率) ＝ (主成分の分散値) / (各変量の分散の和)で表される。
「因子負荷量（主成分負荷量）」とは、主成分と元の変量との相関係数をいう。 The “principal component score” is a value obtained by projecting original data onto a principal component vector.
“Principal component contribution ratio” represents how much each variable is explained by m principal components, where 0 ≦ contribution ratio ≦ 1 and (principal component contribution ratio) = (dispersion value of principal component) / It is expressed as (sum of variance of each variable).
“Factor loading (principal component loading)” refers to the correlation coefficient between the principal component and the original variable.

（第１の実施形態）
本実施形態においては、正規化した各主成分得点の最大絶対値を“外れ度”として定義し、“外れ度”に基づいて、データの集団から大きく外れた“外れ値”の候補の検出を行うものである。 (First embodiment)
In this embodiment, the maximum absolute value of each normalized principal component score is defined as “outlier”, and based on “outlier”, detection of “outlier” candidates that are greatly out of the data group is detected. Is what you do.

図１は、本発明の第１の実施形態に係るデータ監視装置の概略構成を示すブロック図である。この装置は汎用のコンピュータ（例えばパーソナルコンピュータ（ＰＣ）等）と、同コンピュータ上で動作するソフトウェアとを用いて実現される。コンピュータとしては、ＣＡＤ（Computer Aided Design）やＣＡＥ（Computer Aided Engineering）に好適なエンジニアリングワークステーション（ＥＷＳ）等も含む。本実施形態はこのようなコンピュータに、多変量データの監視に係る一連の手続きを実行させるプログラムとして実施することもできる。 FIG. 1 is a block diagram showing a schematic configuration of a data monitoring apparatus according to the first embodiment of the present invention. This apparatus is realized using a general-purpose computer (for example, a personal computer (PC) or the like) and software operating on the computer. The computer includes an engineering workstation (EWS) suitable for CAD (Computer Aided Design) and CAE (Computer Aided Engineering). The present embodiment can also be implemented as a program that causes such a computer to execute a series of procedures relating to monitoring of multivariate data.

図１に示すように、第１の実施形態に係るデータ監視装置は、主として、入力部１１、外れ値候補検出部１２、結果表示部１３より構成されている。 As shown in FIG. 1, the data monitoring apparatus according to the first embodiment mainly includes an input unit 11, an outlier candidate detection unit 12, and a result display unit 13.

入力部１１は、多変量データを入力するものである。入力した多変量データは、外れ値候補検出部１２に送られる。 The input unit 11 inputs multivariate data. The input multivariate data is sent to the outlier candidate detection unit 12.

外れ値候補検出部１２は、入力部１１から受け取った多変量データに対して、主成分分析を実施して、外れ値の候補を検出するものである。外れ値の候補を検出するために、本実施形態では外れ度を計算する。外れ値の候補検出については、後述する。計算した外れ度は、結果表示部１３に送られる。 The outlier candidate detection unit 12 performs principal component analysis on the multivariate data received from the input unit 11 to detect outlier candidates. In order to detect outlier candidates, the present embodiment calculates outliers. The detection of outlier candidates will be described later. The calculated degree of detachment is sent to the result display unit 13.

結果表示部１３は、外れ値候補検出部１２から受け取った外れ度に基づいて、外れ値候補リスト及び外れ値候補毎の変量影響度グラフを生成して表示するものである。外れ値候補リスト及び変量影響度グラフについては、後述する。 The result display unit 13 generates and displays an outlier candidate list and a variable influence graph for each outlier candidate based on the outlier received from the outlier candidate detection unit 12. The outlier candidate list and the variable influence graph will be described later.

次に、以上のように構成されたデータ監視装置１００による多変量データの監視処理について説明する。 Next, multivariate data monitoring processing by the data monitoring apparatus 100 configured as described above will be described.

＜外れ値の候補検出＞
多変量データに対して、主成分分析を実施する。例えば、分散共分散行列を用いる手法が好適である。主成分分析により得られる主成分得点については、ある特性（主成分）に対して重心から大きく離れ、単純には比較できない多次元の変量データを比較しやすくするために正規化（特徴軸の正規化）する。例えば平均が０、分散が１となるよう線形変換する正規化により、主成分分析によって言わば単独に座標変換されたものが、分散が同じ長さになるように圧縮され、重心から大きく離れたデータを的確に把握しやすくなる。 <Detection of outlier candidates>
Perform principal component analysis on multivariate data. For example, a technique using a variance-covariance matrix is suitable. Principal component scores obtained by principal component analysis are normalized (feature axis normalization) to make it easier to compare multidimensional variable data that is far from the center of gravity for a certain characteristic (principal component) and cannot be simply compared. ). For example, data that is coordinate-transformed by principal component analysis by linear transformation so that the mean is 0 and the variance is 1, and is compressed so that the variance is the same length, and is far away from the center of gravity It becomes easy to grasp accurately.

各サンプル（変量の組）に対して、外れ度を計算する。本実施形態においては、正規化した各主成分得点の最大絶対値を、外れ度 Dj と定義する。外れ度 Djは、以下の式で表わされる。

For each sample (a set of variables), the outlier is calculated. In the present embodiment, the maximum absolute value of each normalized principal component score is defined as a deviation degree Dj. The deviation degree Dj is expressed by the following equation.

図２は、外れ度を説明するための図である。図２に示すように、外れ度は、主成分２に射影した値（主成分２の主成分得点）となる。 FIG. 2 is a diagram for explaining the degree of detachment. As shown in FIG. 2, the degree of deviation is a value projected onto the principal component 2 (the principal component score of the principal component 2).

図３は、外れ度の計算の一例を示す図である。図３に示すように、縦方向には各サンプルＡからＬを並べ、横方向には正規化した主成分得点Sn_1~ Sn_5を並べている。図３に示す例では、サンプルAで正規化された主成分得点の絶対値が一番大きいのは、第１主成分得点Sn_1の-1.53798である。そこで、絶対値である1.53798をサンプルAの外れ度とする。同様に、網掛けした主成分得点がそれぞれのサンプルの外れ度となる。図３の最右欄Ｄは、各サンプルの外れ度を表している。図３の最下欄は、累積寄与率を表している。累積寄与率は、各変量が m 個の主成分でどれくらい説明されるかを表す寄与率を累積したものである。図３に示す例では、第１主成分から第３主成分のPC1〜PC3で、データ全体の約９８％を説明可能ということができる。 FIG. 3 is a diagram illustrating an example of calculation of the degree of deviation. As shown in FIG. 3, samples A to L are arranged in the vertical direction, and normalized principal component scores Sn_1 to Sn_5 are arranged in the horizontal direction. In the example shown in FIG. 3, the first principal component score Sn_1 of −1.553798 has the largest absolute value of the principal component score normalized by the sample A. Therefore, an absolute value of 1.53798 is set as the degree of detachment of sample A. Similarly, the shaded principal component score is the degree of detachment of each sample. The rightmost column D in FIG. 3 represents the degree of detachment of each sample. The bottom column of FIG. 3 represents the cumulative contribution rate. Cumulative contribution is the cumulative contribution that represents how much each variable is explained by m principal components. In the example shown in FIG. 3, it can be said that about 98% of the entire data can be explained by the first to third principal components PC1 to PC3.

＜外れ値候補リスト＞
外れ値候補リストは、外れ度に基づき、外れ値候補と対応するサンプル、主成分をランク付けしたものである。図４は、外れ値候補リストの一例を示す図である。図４に示す例では、外れ度が２．０以上のものを外れ値候補とし、サンプルＢの第４主成分PC4が外れ度2.682162であり、外れ値候補として最上位のランクとなっている。 <Outside value candidate list>
The outlier candidate list is a ranking of the samples and principal components corresponding to the outlier candidates based on the degree of outliers. FIG. 4 is a diagram illustrating an example of an outlier candidate list. In the example shown in FIG. 4, those having an outlier degree of 2.0 or more are outlier candidates, and the fourth principal component PC4 of the sample B has an outlier degree of 2.682162, which is the highest rank as an outlier candidate.

＜変量影響度グラフ＞
変量影響度グラフは、外れ値候補と対応する主成分に関して、その主成分得点の変量別の値と因子負荷量を可視化するものである。変量影響度グラフは、外れ値の主要因となる変量の情報や変量間の相関の情報をもつ。図５は、変量影響度グラフの一例を示す図である。縦軸は因子負荷量を、横軸は各変量を表している。実線は各変量の因子負荷量を結んだものである。破線は変量別の値を結んだものである。図５に示す例では、サンプルＢは第４主成分PC4への寄与度が高く、特に変量Ｗの影響が大きいことがわかる。 <Variable influence graph>
The variable influence graph visualizes the value of each principal component score and the factor loading for the principal component corresponding to the outlier candidate. The variable influence graph has information on the variables that are the main factors of outliers and information on the correlation between the variables. FIG. 5 is a diagram illustrating an example of a variable influence graph. The vertical axis represents the factor loading, and the horizontal axis represents each variable. The solid line connects the factor loadings of each variable. The broken line connects the values for each variable. In the example shown in FIG. 5, it can be seen that the sample B has a high contribution to the fourth principal component PC4, and the influence of the variable W is particularly large.

図６は、第１の実施形態に係るデータ監視装置におけるデータの監視処理の流れを示すフローチャートである。 FIG. 6 is a flowchart illustrating a flow of data monitoring processing in the data monitoring apparatus according to the first embodiment.

まず、入力した多変量データに対して、例えば、分散共分散行列を用いて主成分分析を実施する（ステップＳ６１）。 First, principal component analysis is performed on the input multivariate data using, for example, a variance-covariance matrix (step S61).

次いで、主成分分析により得られた主成分得点の正規化を行う（ステップＳ６２）。 Next, normalization of the principal component score obtained by principal component analysis is performed (step S62).

次に、各サンプルについて、図３のようにして外れ度 Djを算出する（ステップＳ６３）。外れ度 Djの算出は、サンプル数分、実行する。 Next, the detachment degree Dj is calculated for each sample as shown in FIG. 3 (step S63). The deviation Dj is calculated for the number of samples.

次に、算出した外れ度に基づき、図４に示すような外れ値候補リスト及び図５に示すような変量影響度グラフをそれぞれ生成して表示する（ステップＳ６４）。 Next, an outlier candidate list as shown in FIG. 4 and a variable influence degree graph as shown in FIG. 5 are generated and displayed based on the calculated outliers (step S64).

第１の実施形態によれば、関連し合う変量が複数存在する多変量データにおいて、変量間の関係性を考慮しながら異常値や外れ値の特定ができる。また、特定された異常値や外れ値の主要因となる変量について、異常値や外れ値を示した原因分析や是正措置のための有効な情報が得られる。 According to the first embodiment, in multivariate data having a plurality of related variables, it is possible to specify an abnormal value or an outlier while considering the relationship between the variables. In addition, with regard to the variable that is the main factor of the specified abnormal value or outlier, effective information for cause analysis and corrective action showing the abnormal value or outlier can be obtained.

（第２の実施形態）
次に、第２の実施形態について説明する。第２の実施形態では、データ分析・監視において、注目する変量やサンプルについて、相対的な重要度を示す“重み”の設定及び“重み”の変更を可能とするものである。 (Second Embodiment)
Next, a second embodiment will be described. In the second embodiment, in data analysis / monitoring, it is possible to set “weight” indicating relative importance and change “weight” for a variable or sample of interest.

図７は、サンプルの重みを説明する図である。図７に示すように、サンプルの値が大きいほど、集団の中でのそのサンプルの影響が大きくなる。重みが 0 の場合は、そのサンプルを無視することを意味する。注目するサンプルに重み付けすることにより、全体の重心が移動し、注目するサンプルからの外れ値を検出できる。例えば、プロジェクトデータがサンプルであり、その重要度（ランク: A, B, C, …）が定義されている場合、重要度に応じて重みを設定することが好適である。 FIG. 7 is a diagram for explaining sample weights. As shown in FIG. 7, the larger the sample value, the greater the influence of that sample in the group. A weight of 0 means that the sample is ignored. By weighting the sample of interest, the entire center of gravity moves, and an outlier from the sample of interest can be detected. For example, when the project data is a sample and its importance (rank: A, B, C,...) Is defined, it is preferable to set a weight according to the importance.

図８は、変量の重みを説明する図である。図８に示すように、変量の値が大きいほど、その変量の違いによる影響が大きくなる。重みが 0 の場合は、その変量を無視することを意味する。注目する変量に重み付けすることにより、サンプルのばらつきを、その変量の方向に拡大または縮小し、その変量の影響を考慮した外れ値の検出が可能となる。例えば、重要な変量（ソースコード規模、不具合密度等）に注目したデータ監視を行う一方、不要な変量（他の変量で概ねカバーできるもの等）を無視するのが好適である。 FIG. 8 is a diagram for explaining variable weights. As shown in FIG. 8, the larger the value of the variable, the greater the influence of the difference in the variable. A weight of 0 means that the variable is ignored. By weighting the variable of interest, the variation of the sample can be expanded or reduced in the direction of the variable, and outliers can be detected in consideration of the influence of the variable. For example, it is preferable to perform data monitoring focusing on important variables (source code scale, defect density, etc.) while ignoring unnecessary variables (such as those that can be generally covered by other variables).

図９は、第２の実施形態に係るデータ監視装置の概略構成を示すブロック図である。図９に示すように、第２の実施形態に係るデータ監視装置２００では、第１の実施形態の構成に加えて、重み設定部１４、記憶部１５を有している。 FIG. 9 is a block diagram illustrating a schematic configuration of a data monitoring apparatus according to the second embodiment. As shown in FIG. 9, the data monitoring apparatus 200 according to the second embodiment includes a weight setting unit 14 and a storage unit 15 in addition to the configuration of the first embodiment.

重み設定部１４は、サンプルと変量に対して重みを設定するものである。第２の実施形態では、サンプルや変量の重みを調整することにより、データの補正を行う。重みの調整は、設定した重みに基づいて、多変量データを正規化し、以下の手法で行うことができる。 The weight setting unit 14 sets weights for samples and variables. In the second embodiment, data correction is performed by adjusting the weights of samples and variables. The adjustment of the weight can be performed by the following method by normalizing the multivariate data based on the set weight.

（1）重み 0 のサンプル、変量を除外する。
（2）変量 j に対して、重み付き平均 Ave_j、重み付き標準偏差 Dev_j を求める。
サンプル i の重みを Ws_i、変量 j に関するサンプル i の値を V_i_j とすると、
Ave_j = Σ(Ws_i * V_i_j) / Σ(Ws_i), Dev_j = √ Σ(Ws_i * (V_i_j−Ave_j)^2) / (Σ(Ws_i)-1)
（3）各変量について、平均 0、標準偏差 Wv_j （変量 j の重み）で正規化する。
V_i_j ← (V_i_j−Ave_j) / Dev_j * Wv_j
記憶部１５は、記憶装置であって、重み付きの多変量データを記憶するものである。例えば、ハードディスク装置や、半導体メモリであるフラッシュメモリを使用したソリッドステートドライブ（ＳＳＤ）装置が好適である。図１０は、記憶部１５に記憶される重み付き多変量データの一覧例を示す図である。図１０に示すように、サンプル数: n、変量数: m の多変量データを、n×m行列として、記憶部１５に格納する。初めに、サンプル及び変量に対し、重み＝ 1を初期値として設定する。 (1) Exclude samples with zero weight and variables.
(2) For the variable j, find the weighted average Ave_j and the weighted standard deviation Dev_j.
If the weight of sample i is Ws_i and the value of sample i for variable j is V_i_j,
Ave_j = Σ (Ws_i * V_i_j) / Σ (Ws_i), Dev_j = √ Σ (Ws_i * (V_i_j−Ave_j) ^ 2) / (Σ (Ws_i) -1)
(3) For each variable, normalize with mean 0 and standard deviation Wv_j (weight of variable j).
V_i_j ← (V_i_j−Ave_j) / Dev_j * Wv_j
The storage unit 15 is a storage device, and stores weighted multivariate data. For example, a hard disk device or a solid state drive (SSD) device using a flash memory which is a semiconductor memory is suitable. FIG. 10 is a diagram illustrating a list example of weighted multivariate data stored in the storage unit 15. As shown in FIG. 10, multivariate data with the number of samples: n and the number of variables: m is stored in the storage unit 15 as an n × m matrix. First, weight = 1 is set as an initial value for samples and variables.

図１１は、第２の実施形態に係るソースコード解析装置におけるソースコード処理の流れを示すフローチャートである。 FIG. 11 is a flowchart showing the flow of source code processing in the source code analyzing apparatus according to the second embodiment.

まず、重み付けをする（ステップＳ１１１）。図１２は、重み設定の一例を示す図である。図１２に示す例では、図３に示す外れ度の計算例において下位の主成分（PC4、PC5）で外れ度の大きかったサンプル Bとサンプル Hについて、重みをそれぞれ初期値の１から 0 に変更している。 First, weighting is performed (step S111). FIG. 12 is a diagram illustrating an example of weight setting. In the example shown in FIG. 12, the weights are changed from the initial values of 1 to 0 for sample B and sample H, which are large in the lower principal components (PC4, PC5) in the example of calculation of deviation shown in FIG. doing.

次いで、入力した多変量データに対して、例えば、分散共分散行列を用いて主成分分析を実施する（ステップＳ１１２）。 Next, principal component analysis is performed on the input multivariate data using, for example, a variance-covariance matrix (step S112).

次に、主成分分析により得られた主成分得点の正規化を行う（ステップＳ１１３）。 Next, normalization of the principal component score obtained by principal component analysis is performed (step S113).

次に、各サンプルについて、外れ度 Djを算出する（ステップＳ１１４）。外れ度 Djの算出は、サンプル数分、実行する。 Next, the detachment degree Dj is calculated for each sample (step S114). The deviation Dj is calculated for the number of samples.

次いで、算出した外れ度 Djに基づいて、各サンプルの外れ値候補のランク付けを行う（ステップＳ１１５）。図１３は、重み変更後の外れ度の計算例を示す図である。図１２で行ったサンプル Bとサンプル Hへの重み設定を反映して、サンプル Bとサンプル Hの全ての主成分得点は０となっている。重み変更後は、第１主成分から第３主成分（ＰＣ１〜ＰＣ３）の累積寄与率は上昇していることがわかる。さらに、第１主成分から第３主成分（ＰＣ１〜ＰＣ３）で外れ度の大きいサンプル C、E、K は、外れ値候補のランクも高いままでその傾向を維持していることがわかる。 Next, outlier candidates for each sample are ranked based on the calculated outlier Dj (step S115). FIG. 13 is a diagram illustrating a calculation example of the degree of detachment after the weight change. Reflecting the weight setting for sample B and sample H performed in FIG. 12, all principal component scores of sample B and sample H are zero. It can be seen that the cumulative contribution ratio of the first principal component to the third principal component (PC1 to PC3) increases after the weight change. Further, it can be seen that the samples C, E, and K having large outliers from the first main component to the third main component (PC1 to PC3) maintain the tendency with the ranks of outlier candidates remaining high.

図１４は、重み変更の一例を示す図である。図１４に示す例では、外れ値候補ランクが１位となったサンプルEと、外れ値候補ランクが２位となったサンプルCについて、重みを１から５に変更している。図１５は、重み変更後の外れ度の計算例を示す図である。図１５に示すように、外れ値の候補と対応する主成分は、サンプルの重み変更前とほぼ同じであり、構造は保存されていることがわかる。そして、重みを変更したサンプルC、サンプルEは外れ度が小さくなり、構造の中心になっている。外れ度が小さくなった結果、外れ値候補ランクも下がっている。これらの変化を反映して、サンプルFが外れ度が最大となり、外れ値候補ランクが１位となり、注目サンプル（C、E）からの外れ値となっている。 FIG. 14 is a diagram illustrating an example of the weight change. In the example illustrated in FIG. 14, the weight is changed from 1 to 5 for the sample E in which the outlier candidate rank is first and the sample C in which the outlier candidate rank is second. FIG. 15 is a diagram illustrating a calculation example of the degree of detachment after the weight change. As shown in FIG. 15, the principal components corresponding to the outlier candidates are almost the same as before the change of the sample weight, and the structure is saved. Sample C and sample E whose weights are changed have a small degree of detachment and are the center of the structure. As a result, the outlier candidate rank has also decreased. Reflecting these changes, sample F has the largest outlier, the outlier candidate rank is first, and is an outlier from the sample of interest (C, E).

次に、算出した外れ度に基づき、図４に示すような外れ値候補リスト及び図５に示すような変量影響度グラフをそれぞれ生成して表示する（ステップＳ１１６）。 Next, based on the calculated outliers, an outlier candidate list as shown in FIG. 4 and a variable influence graph as shown in FIG. 5 are generated and displayed (step S116).

第２の実施形態によれば、注目するサンプルや変量の重みを変化させることにより、データ分析・監視において、「このサンプルは明らかに異常値なので除外すべきである」、「この変量は重要なので重点的に監視する必要がある」等のユーザの経験や意思を反映させることができる。 According to the second embodiment, by changing the sample of interest and the weight of the variable, in data analysis / monitoring, “this sample is obviously an abnormal value and should be excluded”, “This variable is important. It is possible to reflect the experience and intention of the user such as “it is necessary to monitor with priority”.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the present invention have been described, these embodiments are presented by way of example and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１００，２００・・・データ監視装置
１１・・・入力部
１２・・・外れ値候補検出部
１３・・・結果表示部
１４・・・重み設定部
１５・・・記憶部 100, 200 ... data monitoring device 11 ... input unit 12 ... outlier candidate detection unit 13 ... result display unit 14 ... weight setting unit 15 ... storage unit

Claims

An outlier candidate detection unit that performs principal component analysis on multivariate data and detects outlier candidates that are significantly out of the data population;
Outlier candidate list in which the principal component obtained by the sample and principal component analysis that is a set of the variables and the principal component analysis is ranked in correspondence with the outlier candidate, and the principal component corresponding to the outlier candidate, the principal component A result display unit that generates and displays a variable impact graph indicating a factor load value and a value for each variable of the score,
A data monitoring device provided.

The outlier candidate detection unit calculates an outlier expressed by a maximum absolute value of each principal component score obtained by normalizing a principal component score obtained by principal component analysis in order to detect the outlier candidate. The data monitoring apparatus according to 1.

The data monitoring apparatus according to claim 1, further comprising an input unit that inputs the multivariate data, wherein the input multivariate data is sent to the outlier candidate detection unit.

The data monitoring apparatus according to claim 2 , wherein the deviation degree is calculated for each sample.

The outlier candidate, data monitoring apparatus according to claim 4, outlier candidate the sample or the main component of the out-degree less than a predetermined value.

The data monitoring apparatus according to claim 1, wherein the factor loading is a correlation coefficient between a principal component and an original variable.

further,
A weight setting unit for setting a weight for the sample and the variable;
A storage unit for storing the weighted multivariate data,
The data monitoring apparatus according to claim 1, further comprising: adjusting a weight of the sample or variable.

Samples with large deviations expressed by the maximum absolute value of each principal component score obtained by normalizing principal component scores obtained by principal component analysis by setting initial values for the samples and variables The data monitoring apparatus according to claim 7, wherein the weight is changed to a value smaller than an initial value.

The data monitoring apparatus according to claim 7 or 8, wherein an initial value is set for the sample and the variable, and the weight is changed to a value larger than the initial value for a sample having a higher outlier candidate rank.

The data according to any one of claims 1 to 9, wherein the outlier candidate list also reflects a cumulative contribution rate obtained by accumulating a contribution rate indicating how much each of the variables is explained by each principal component. Monitoring device.