JP7633563B2

JP7633563B2 - Feature extraction device, feature extraction method, and feature extraction program

Info

Publication number: JP7633563B2
Application number: JP2023539450A
Authority: JP
Inventors: 太三山本; 愛角田; 高明森谷; 学西尾; 優三好
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc USA
Current assignee: NTT Inc; NTT Inc USA
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2025-02-20
Anticipated expiration: 2041-08-04
Also published as: WO2023012933A1; JPWO2023012933A1

Description

本発明は、特徴抽出装置および特徴抽出方法ならびに特徴抽出プログラムに関する。 The present invention relates to a feature extraction device, a feature extraction method, and a feature extraction program.

時系列データを分析するデータ分析装置として、特許文献１に開示されたものが知られている。特許文献１には、分析対象となる各データの時間的な変化量を示す指標値を算出し、指標値に基づく順番で複数の時系列データをグラフ化したものを並べて表示することが記載されている。特許文献１では、例えば特定の時期に大きく変化しているデータに注目することができるので、データ分析を支援することが可能になる。 A data analysis device for analyzing time-series data is disclosed in Patent Document 1. Patent Document 1 describes a method for calculating an index value indicating the amount of change over time in each piece of data to be analyzed, and displaying graphs of multiple pieces of time-series data in an order based on the index value. Patent Document 1 makes it possible to focus on data that changes significantly in a specific period of time, for example, thereby supporting data analysis.

特許第６５９２４１１号公報Patent No. 6592411

しかし、上述した特許文献１では、時系列データを分析する分析方法が複数存在するときに、この時系列データを分析するために、複数の分析方法のうちどの分析方法が適しているかを判定することについて言及されていない。However, the above-mentioned Patent Document 1 does not mention determining which of multiple analysis methods is suitable for analyzing time series data when there are multiple analysis methods for analyzing the time series data.

このため、対象となる時系列データを分析する際に、適切な分析方法を選択することが難しいという問題があった。 This made it difficult to select an appropriate analysis method when analyzing the time series data in question.

本発明は、上記事情に鑑みてなされたものであり、その目的とするところは、時系列データを分析する分析方法の特徴を抽出することが可能な特徴抽出装置および特徴抽出方法ならびに特徴抽出プログラムを提供することにある。The present invention has been made in consideration of the above circumstances, and its purpose is to provide a feature extraction device, a feature extraction method, and a feature extraction program capable of extracting features of an analysis method for analyzing time series data.

本発明の一態様の特徴抽出装置は、２つの時系列データを組み合わせたデータ対を複数生成する組み合わせ部と、複数の分析方法を用いて、各データ対に含まれる２つの時系列データの類似度を分析する分析部と、前記分析部による分析結果に基づいて、各データ対の類似度の出現確率を分析方法ごとに算出する出現確率算出部と、各データ対について、分析方法ごとに算出される前記出現確率の乖離度を算出する乖離度算出部と、データ対に含まれる各時系列データ、及び前記乖離度を可視化してユーザに提示する可視化部と、前記ユーザによる、類似、非類似の判定入力を受け付ける入力部と、前記判定入力及び前記乖離度に基づいて、前記分析方法の特徴を抽出する特徴抽出部と、を備える。A feature extraction device according to one embodiment of the present invention includes a combination unit that generates a plurality of data pairs by combining two time series data; an analysis unit that uses a plurality of analysis methods to analyze the similarity between the two time series data included in each data pair; an occurrence probability calculation unit that calculates the occurrence probability of the similarity of each data pair for each analysis method based on the analysis results by the analysis unit; a deviation calculation unit that calculates the deviation of the occurrence probability calculated for each analysis method for each data pair; a visualization unit that visualizes each time series data included in the data pair and the deviation and presents it to a user; an input unit that accepts a judgment input of similarity or dissimilarity from the user; and a feature extraction unit that extracts features of the analysis method based on the judgment input and the deviation.

本発明の一態様の特徴抽出方法は、２つの時系列データを組み合わせたデータ対を複数生成するステップと、複数の分析方法を用いて、各データ対に含まれる２つの時系列データの類似度を分析するステップと、分析された類似度に基づいて、各データ対の類似度の出現確率を分析方法ごとに算出するステップと、各データ対について、分析方法ごとに算出される前記出現確率の乖離度を算出するステップと、データ対に含まれる各時系列データ、及び前記乖離度を可視化してユーザに提示するステップと、前記ユーザによる、類似、非類似の判定入力を受け付けるステップと、前記判定入力及び前記乖離度に基づいて、前記分析方法の特徴を抽出するステップと、を備える。A feature extraction method according to one embodiment of the present invention includes the steps of generating a plurality of data pairs by combining two time series data, analyzing the similarity of the two time series data included in each data pair using a plurality of analysis methods, calculating the occurrence probability of the similarity of each data pair for each analysis method based on the analyzed similarity, calculating the deviation of the occurrence probability calculated for each analysis method for each data pair, visualizing each time series data included in the data pair and the deviation and presenting it to a user, accepting a judgment input of similarity or dissimilarity from the user, and extracting features of the analysis method based on the judgment input and the deviation.

本発明の一態様は、上記特徴抽出装置としてコンピュータを機能させるための特徴抽出プログラムである。 One aspect of the present invention is a feature extraction program for causing a computer to function as the above-mentioned feature extraction device.

本発明によれば、時系列データを分析する分析方法の特徴を抽出することが可能になる。 According to the present invention, it becomes possible to extract characteristics of an analytical method for analyzing time series data.

図１は、第１実施形態に係る特徴抽出装置の構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of a feature extraction device according to the first embodiment. 図２は、時系列データ、及び２つの時系列データを組み合わせたデータ対の例を示す説明図である。FIG. 2 is an explanatory diagram showing an example of time series data and a data pair in which two pieces of time series data are combined. 図３は、第１～第４の分析方法を示す説明図である。FIG. 3 is an explanatory diagram showing the first to fourth analysis methods. 図４Ａは、複数のデータ対に対して、第１～第４の分析方法で算出した分析値を示す説明図である。FIG. 4A is an explanatory diagram showing analytical values calculated for a plurality of data pairs by the first to fourth analysis methods. 図４Ｂは、図４Ａに示した分析値の分布曲線を示す説明図であり、（ａ）は第１の分析方法による分布曲線、（ｂ）は第１の分析方法による分布曲線、（ｃ）は第３の分析方法による分布曲線、（ｄ）は第４の分析方法による分布曲線を示す。FIG. 4B is an explanatory diagram showing distribution curves of the analytical values shown in FIG. 4A, where (a) shows the distribution curve by the first analytical method, (b) shows the distribution curve by the first analytical method, (c) shows the distribution curve by the third analytical method, and (d) shows the distribution curve by the fourth analytical method. 図５は、図４Ｂに示した分布曲線ｓ１を正規化した正規化曲線を示す説明図である。FIG. 5 is an explanatory diagram showing a normalized curve obtained by normalizing the distribution curve s1 shown in FIG. 4B. 図６は、複数のデータ対、各データ対を４つの分析方法で分析した類似度の出現確率、及び出現確率の乖離度を示す図である。FIG. 6 is a diagram showing a plurality of data pairs, the occurrence probability of similarity when each data pair is analyzed using four analysis methods, and the deviation degree of the occurrence probability. 図７Ａは、データ対を構成する２つの時系列データ、及び４つの分析方法で算出した出現確率を示す図である。FIG. 7A is a diagram showing two time-series data constituting a data pair and occurrence probabilities calculated using four analysis methods. 図７Ｂは、データ対を構成する２つの時系列データ、及び４つの分析方法で算出した出現確率を示す図である。FIG. 7B is a diagram showing two time-series data constituting a data pair and occurrence probabilities calculated using the four analysis methods. 図８は、第１実施形態に係る特徴抽出装置の処理手順を示すフローチャートである。FIG. 8 is a flowchart showing a processing procedure of the feature extraction device according to the first embodiment. 図９は、記録部１８により取得される各時系列データの特徴パターンを示す説明図である。FIG. 9 is an explanatory diagram showing characteristic patterns of each piece of time series data acquired by the recording unit 18. As shown in FIG. 図１０は、第２実施形態に係る特徴抽出装置の構成を示すブロック図である。FIG. 10 is a block diagram showing the configuration of a feature extraction device according to the second embodiment. 図１１Ａは、複数のデータ対の分析結果、及び分析結果の正規化曲線を示す説明図である。FIG. 11A is an explanatory diagram showing the analysis results of multiple data pairs and normalized curves of the analysis results. 図１１Ｂは、データ対を構成する時系列データの項目と各データ対の出現確率の乖離度を示す説明図である。FIG. 11B is an explanatory diagram showing the items of the time-series data constituting the data pairs and the degree of deviation of the occurrence probability of each data pair. 図１２は、本実施形態のハードウェア構成を示すブロック図である。FIG. 12 is a block diagram showing the hardware configuration of this embodiment.

以下、本発明の実施形態について図面を参照して説明する。 Below, an embodiment of the present invention is described with reference to the drawings.

［第１実施形態］
図１は、第１実施形態に係る特徴抽出装置の構成を示すブロック図である。 [First embodiment]
FIG. 1 is a block diagram showing the configuration of a feature extraction device according to the first embodiment.

図１に示すように、第１実施形態に係る特徴抽出装置１は、データベース２（図では、「ＤＢ」と表記）に接続されている。特徴抽出装置１は、組み合わせ部１１と、データ分析部１２（分析部）と、出現確率算出部１３と、乖離度算出部１４と、可視化部１５と、入力部１６と、特徴抽出部１７と、記録部１８と、を備えている。As shown in FIG. 1, the feature extraction device 1 according to the first embodiment is connected to a database 2 (denoted as "DB" in the figure). The feature extraction device 1 includes a combination unit 11, a data analysis unit 12 (analysis unit), an occurrence probability calculation unit 13, a deviation calculation unit 14, a visualization unit 15, an input unit 16, a feature extraction unit 17, and a recording unit 18.

データベース２は、複数（ｍ個）の時系列データｑｉ（ｉ＝１～ｍ）を記憶する。時系列データｑｉは、例えば、総務省統計局が提供している消費者物価指数である。 Database 2 stores multiple (m) time series data qi (i = 1 to m). Time series data qi is, for example, the consumer price index provided by the Statistics Bureau of the Ministry of Internal Affairs and Communications.

組み合わせ部１１は、２つの時系列データを組み合わせたデータ対を生成する。具体的には、組み合わせ部１１は、データベース２に記憶されている各時系列データｑｉから、２つを選択して組み合わせたデータ対ａｊ（ｊ＝１～ｎ）を設定する。図２は、２つの時系列データを組み合わせてデータ対ａｊを設定する例を示す説明図である。図２に示すように、時系列データｑ１とｑ２を組み合わせてデータ対ａ１を生成する。時系列データｑ２とｑ３を組み合わせてデータ対ａ２を生成する。時系列データｑ１とｑ４を組み合わせてデータ対ａ３を生成する。時系列データｑ３とｑ４を組み合わせてデータ対ａ４を生成する。The combination unit 11 generates a data pair by combining two time series data. Specifically, the combination unit 11 selects two of the time series data qi stored in the database 2 and sets a combined data pair aj (j = 1 to n). Figure 2 is an explanatory diagram showing an example of setting a data pair aj by combining two time series data. As shown in Figure 2, the time series data q1 and q2 are combined to generate a data pair a1. The time series data q2 and q3 are combined to generate a data pair a2. The time series data q1 and q4 are combined to generate a data pair a3. The time series data q3 and q4 are combined to generate a data pair a4.

時系列データがｍ個の場合には、「ｍ*（ｍ－１）／２」個のデータ対が設定される。即ち、「ｎ＝ｍ*（ｍ－１）／２」である。例えば、時系列データが３８０個ある場合には、（３８０*３７９）／２＝７２０１０個のデータ対が設定される。 If there are m pieces of time series data, "m*(m-1)/2" data pairs are set. In other words, "n = m*(m-1)/2". For example, if there are 380 pieces of time series data, (380*379)/2 = 72,010 data pairs are set.

データ分析部１２は、複数の分析方法を用いて各データ対に含まれる２つの時系列データの類似度を分析する。具体的には、データ分析部１２は、組み合わせ部１１で設定されたデータ対ａｊを分析するための複数の分析方法の演算プログラムを備える。データ分析部１２は、第１の分析方法によりデータ対を分析する第１分析部２１と、第２の分析方法によりデータ対を分析する第２分析部２２と、第３の分析方法によりデータ対を分析する第３分析部２３と、第４の分析方法によりデータ対を分析する第４分析部２４を備えている。The data analysis unit 12 analyzes the similarity of the two time series data included in each data pair using a plurality of analysis methods. Specifically, the data analysis unit 12 includes a plurality of analysis method calculation programs for analyzing the data pair aj set by the combination unit 11. The data analysis unit 12 includes a first analysis unit 21 that analyzes the data pair using a first analysis method, a second analysis unit 22 that analyzes the data pair using a second analysis method, a third analysis unit 23 that analyzes the data pair using a third analysis method, and a fourth analysis unit 24 that analyzes the data pair using a fourth analysis method.

データ分析部１２は、第１～第４の分析方法により、データ対ａｊを構成する２つの時系列データの類似度を分析し、分析した結果を分析値として出力する。なお、本実施形態では、４つの分析方法を用いる例について示すが、４つ以外の分析方法を用いてもよい。以下、図３を参照して、第１分析方法～第４の分析方法の具体的な処理について説明する。The data analysis unit 12 analyzes the similarity between the two time series data constituting the data pair aj using the first to fourth analysis methods, and outputs the analysis result as an analysis value. Note that in this embodiment, an example using four analysis methods is shown, but analysis methods other than these four may also be used. Below, specific processing of the first to fourth analysis methods is described with reference to FIG. 3.

第１の分析方法は、図３（ａ）に示すように、２つの時系列データ間の絶対値の差分を積算する。具体的には、所定の時間間隔ごとに得られる一方の時系列データと他方の時系列データの差分の絶対値を算出し、一定の期間内において差分の絶対値を積算する。積算した数値を分析値として出力する。２つの時系列データの類似度が高いほど、分析値は小さい数値となる。 As shown in Figure 3(a), the first analysis method accumulates the absolute difference between two time series data. Specifically, the absolute value of the difference between one time series data and the other time series data obtained at a specified time interval is calculated, and the absolute value of the difference is accumulated within a certain period of time. The accumulated value is output as the analysis value. The higher the similarity between the two time series data, the smaller the analysis value will be.

第２の分析方法は、図３（ｂ）に示すように、それぞれの時系列データの時間経過に対する変化量を算出し、算出した変化量の差分を積算する。具体的には、所定の時間間隔ごとに得られる一方の時系列データと他方の時系列データの変化量の差分を算出し、一定の期間内において差分を積算する。例えば、一方の変化量が「＋１」、他方の変化量が「－１」である場合には、差分は「２」である。一方の変化量が「＋１」、他方の変化量も同様に「＋１」である場合には、差分は「０」である。一方の変化量が「－２」、他方の変化量が「＋１」である場合には、差分は「３」である。第２の分析方法は、これらの差分を積算し、積算した数値を分析値として出力する。２つの時系列データの類似度が高いほど、分析値は小さい数値となる。 As shown in FIG. 3(b), the second analysis method calculates the amount of change over time for each piece of time series data, and integrates the difference between the calculated amounts of change. Specifically, the difference between the amount of change between one piece of time series data and the other piece of time series data obtained at a predetermined time interval is calculated, and the difference is integrated over a certain period of time. For example, if the amount of change in one piece is "+1" and the amount of change in the other piece is "-1", the difference is "2". If the amount of change in one piece is "+1" and the amount of change in the other piece is also "+1", the difference is "0". If the amount of change in one piece is "-2" and the amount of change in the other piece is "+1", the difference is "3". The second analysis method integrates these differences, and outputs the integrated value as the analysis value. The higher the similarity between the two pieces of time series data, the smaller the analysis value.

第３の分析方法は、図３（ｃ）に示すように、それぞれの時系列データの時間経過に対する変化率を算出し、算出した変化率の差分を積算する。具体的には、所定の時間間隔ごとに得られる一方の時系列データと他方の時系列データの変化量の差分を算出し、一定の期間内において差分を積算する。例えば、一方の時系列データが「＋３％」、他方の時系列データが「－１％」である場合には、差分は「４」である。一方が「＋１％」、他方も同様に「＋１％」である場合には、差分は「０」である。第３の分析方法は、一定の期間内においてこれらの差分を積算し、積算した数値を分析値として出力する。２つの時系列データの類似度が高いほど、分析値は小さい数値となる。 As shown in FIG. 3(c), the third analysis method calculates the rate of change over time for each piece of time series data, and integrates the difference between the calculated rates of change. Specifically, the difference between the amount of change between one piece of time series data and the other piece of time series data obtained at a predetermined time interval is calculated, and the difference is integrated over a certain period of time. For example, if one piece of time series data is "+3%" and the other piece of time series data is "-1%, " the difference is "4". If one piece of time series data is "+1%" and the other piece of time series data is also "+1%, " the difference is "0". The third analysis method integrates these differences over a certain period of time, and outputs the integrated value as the analysis value. The higher the similarity between the two pieces of time series data, the smaller the analysis value will be.

第４の分析方法は、図３（ｄ）に示すように、それぞれの時系列データについて、所定の時間間隔ごとに平均値を算出する。更に、前述した第３の分析方法と同様に、一定の期間内において平均値の差分を積算し、積算した数値を分析値として出力する。２つの時系列データの類似度が高いほど、分析値は小さい数値となる。 As shown in Figure 3(d), the fourth analysis method calculates the average value for each time series data at a predetermined time interval. Furthermore, similar to the third analysis method described above, the difference in the average values is integrated over a certain period of time, and the integrated value is output as the analysis value. The higher the similarity between the two time series data, the smaller the analysis value will be.

第１分析部２１～第４分析部２４は、第１～第４の分析方法により算出した分析値に基づき、各データ対ａｊの分析値をプロットした分布曲線を作成する。以下、図４Ａ、図４Ｂを参照してデータ分析部１２の処理について詳細に説明する。The first analysis unit 21 to the fourth analysis unit 24 create a distribution curve by plotting the analysis values of each data pair aj based on the analysis values calculated by the first to fourth analysis methods. The processing of the data analysis unit 12 will be described in detail below with reference to Figures 4A and 4B.

図４Ａは、複数のデータ対に対して、第１～第４の分析方法で算出した分析値を示す説明図である。データ分析部１２は、図１に示すデータベース２に記憶されている各データ対ａｊ（ｊ＝１～ｎ）対して、第１～第４の分析方法により演算した分析値「ｂｋ-ｊ」を算出する。なお、「ｋ」は分析方法の番号を示し、「ｊ」はデータ対の番号を示す。即ち、「ｋ」は１～４の整数であり、「ｊ」は１～ｎの整数である。 Figure 4A is an explanatory diagram showing analytical values calculated for multiple data pairs using the first to fourth analysis methods. The data analysis unit 12 calculates analytical values "bk-j" calculated using the first to fourth analysis methods for each data pair aj (j = 1 to n) stored in the database 2 shown in Figure 1. Note that "k" indicates the number of the analysis method, and "j" indicates the number of the data pair. In other words, "k" is an integer from 1 to 4, and "j" is an integer from 1 to n.

例えば、２つの時系列データ「野菜・海藻」及び「すし（外食）」のデータ対ａ１に対して、第１の分析方法を用いて算出した分析値を分析値「ｂ１-1」とする。図４Ａでは、分析値「ｂ１-1」は「０．０５」とされている。データ対ａ１に対して、第２の分析方法を用いて算出した分析値を分析値「ｂ２-1」とする。図４Ａでは、分析値「ｂ２-1」は「０．２１」とされている。For example, for data pair a1 of two time series data "Vegetables/Seaweed" and "Sushi (eating out)", the analysis value calculated using the first analysis method is set to analysis value "b1-1". In FIG. 4A, the analysis value "b1-1" is set to "0.05". For data pair a1, the analysis value calculated using the second analysis method is set to analysis value "b2-1". In FIG. 4A, the analysis value "b2-1" is set to "0.21".

同様に、データ対ａ２に対して、第３の分析方法を用いて算出した分析値を分析値「ｂ３-2」とする。図４Ａでは、分析値「ｂ３-2」は「０．３３」とされている。データ対ａ３に対して、第４の分析方法を用いて算出した分析値を分析値「ｂ４-3」とする。図４Ａでは、分析値「ｂ４-3」は「０．６４」とされている。これらと同様に、各分析値「ｂｋ-ｊ」を算出する。 Similarly, the analytical value calculated for data pair a2 using the third analysis method is set to analysis value "b3-2". In FIG. 4A, the analytical value "b3-2" is set to "0.33". The analytical value calculated for data pair a3 using the fourth analysis method is set to analysis value "b4-3". In FIG. 4A, the analytical value "b4-3" is set to "0.64". Similarly, each analytical value "bk-j" is calculated.

データ分析部１２は、第１～第４の分析方法で算出したデータ対ａｊの分析値の分布曲線を生成する。具体的には、データ対ａｊ（ｊ＝１～ｎ）に対して、第１～第４の分析方法を用いて算出した分析値「ｂｋ-1～ｂｋ-n」（ｋ＝１～４）の分布曲線ｓ１～ｓ４を生成する。例えば、図４Ｂ（ａ）～（ｄ）に示すように、「ｂ１-ｊ」、「ｂ２-ｊ」、「ｂ３-ｊ」、「ｂ４-ｊ」（但し、ｊ＝１～ｎ）の分布曲線ｓ１～ｓ４を生成する。図４Ｂ（ａ）～（ｄ）は、第１～第４の分析方法による分析値をプロットしたグラフであり、横軸は分析値、縦軸は度数を示している。The data analysis unit 12 generates a distribution curve of the analysis values of the data pair aj calculated using the first to fourth analysis methods. Specifically, for the data pair aj (j = 1 to n), the data analysis unit 12 generates distribution curves s1 to s4 of the analysis values "bk-1 to bk-n" (k = 1 to 4) calculated using the first to fourth analysis methods. For example, as shown in Figures 4B (a) to (d), the data analysis unit 12 generates distribution curves s1 to s4 of "b1-j", "b2-j", "b3-j", and "b4-j" (where j = 1 to n). Figures 4B (a) to (d) are graphs plotting the analysis values obtained using the first to fourth analysis methods, with the horizontal axis representing the analysis value and the vertical axis representing the frequency.

図４Ｂ（ａ）では、第１の分析方法によりｎ個のデータ対ａ１～ａｎを分析した分析値ｂ１-ｊをプロットしており、各分析値に沿った曲線を分布曲線ｓ１としている。In Figure 4B (a), analytical values b1-j obtained by analyzing n data pairs a1 to an using the first analysis method are plotted, and the curve along each analytical value is the distribution curve s1.

図４Ｂ（ｂ）では、第２の分析方法によりｎ個のデータ対ａ１～ａｎを分析した分析値ｂ２-ｊをプロットしており、各分析値に沿った曲線を分布曲線ｓ２としている。In Figure 4B (b), the analysis values b2-j obtained by analyzing n data pairs a1 to an using the second analysis method are plotted, and the curve along each analysis value is the distribution curve s2.

図４Ｂ（ｃ）では、第３の分析方法によりｎ個のデータ対ａ１～ａｎを分析した分析値ｂ３-ｊをプロットしており、各分析値に沿った曲線を分布曲線ｓ３としている。In Figure 4B (c), the analytical values b3-j obtained by analyzing n data pairs a1 to an using the third analysis method are plotted, and the curve along each analytical value is the distribution curve s3.

図４Ｂ（ｄ）では、第４の分析方法によりｎ個のデータ対ａ１～ａｎを分析した分析値ｂ４-ｊをプロットしており、各分析値に沿った曲線を分布曲線ｓ４としている。In Figure 4B (d), the analysis values b4-j obtained by analyzing n data pairs a1 to an using the fourth analysis method are plotted, and the curve along each analysis value is the distribution curve s4.

図１に戻って、出現確率算出部１３は、複数の分析方法で作成された分布曲線ｓ１～ｓ４を正規化する。即ち、図４Ｂ（ａ）～（ｄ）に示した各分布曲線ｓ１～ｓ４は、それぞれを直接比較することはできない。従って、各分布曲線ｓ１～ｓ４を正規化する。例えば、分布曲線ｓ１を正規化することにより、図５に示す正規化曲線ｓ１１が得られる。即ち、出現確率算出部１３は、データ分析部１２の分析結果を正規化して出現確率を算出する。Returning to Figure 1, the occurrence probability calculation unit 13 normalizes the distribution curves s1 to s4 created by multiple analysis methods. That is, the distribution curves s1 to s4 shown in Figures 4B (a) to (d) cannot be directly compared with each other. Therefore, each distribution curve s1 to s4 is normalized. For example, by normalizing the distribution curve s1, the normalized curve s11 shown in Figure 5 is obtained. That is, the occurrence probability calculation unit 13 normalizes the analysis results of the data analysis unit 12 to calculate the occurrence probability.

出現確率算出部１３は、データ分析部１２による分析結果に基づいて、各データ対の類似度の出現確率を分析方法ごとに算出する。具体的には、出現確率算出部１３は、各データ対ａｊ（ｊ＝１～ｎ）に対して、第１～第４の分析方法による分析値（即ち、類似度）の出現確率を算出する。出現確率は、分析値の順位を「０～１」の範囲で示す指標である。出現確率が「０」に近いほど、２つの時系列データの類似度が高いことを示す。出現確率が「１」に近いほど２つの時系列データの類似度が低いことを示す。ｋ番目の分析方法によるデータ対ａｊの分析値「ｂｋ-ｊ」の出現確率を「ｐｋ-ｊ」で示す。第１の分析方法によるデータ対ａ１の分析値「ｂ１-1」の出現確率は、「ｐ１-1」である。例えば、第１の分析方法によるデータ対ａ１の分析値「ｂ１-1」が上位３０％に属している場合には、出現確率「ｐ１-1」は「０．３」である。The occurrence probability calculation unit 13 calculates the occurrence probability of the similarity of each data pair for each analysis method based on the analysis results by the data analysis unit 12. Specifically, the occurrence probability calculation unit 13 calculates the occurrence probability of the analysis value (i.e., similarity) by the first to fourth analysis methods for each data pair aj (j = 1 to n). The occurrence probability is an index that indicates the rank of the analysis value in the range of "0 to 1". The closer the occurrence probability is to "0", the higher the similarity of the two time series data. The closer the occurrence probability is to "1", the lower the similarity of the two time series data. The occurrence probability of the analysis value "bk-j" of the data pair aj by the kth analysis method is indicated by "pk-j". The occurrence probability of the analysis value "b1-1" of the data pair a1 by the first analysis method is "p1-1". For example, if the analysis value "b1-1" of the data pair a1 obtained by the first analysis method belongs to the top 30%, the occurrence probability "p1-1" is "0.3".

図５に示す正規化曲線ｓ１１において、データ対ａｊの分析値の出現確率が上位であるほど、データ対ａｊに含まれる２つの時系列データは類似度が高いことを示している。In the normalized curve s11 shown in Figure 5, the higher the occurrence probability of the analysis value of data pair aj, the higher the similarity between the two time series data included in data pair aj.

乖離度算出部１４は、各データ対ａｊについて、分析方法ごとに算出される出現確率の乖離度を算出する。具体的には、乖離度算出部１４は、対象となるデータ対ａｊに対して、４つの分析方法による、出現確率の乖離度を算出する。ｋ番目の分析方法によるデータ対ａｊの乖離度を「ｄｋ-ｊ」で示す。例えば、第１の分析方法によるデータ対ａ１の乖離度は「ｄ１-1」である。The deviation calculation unit 14 calculates the deviation of the occurrence probability calculated for each data pair aj for each analysis method. Specifically, the deviation calculation unit 14 calculates the deviation of the occurrence probability for the target data pair aj using four analysis methods. The deviation of the data pair aj using the kth analysis method is indicated as "dk-j". For example, the deviation of the data pair a1 using the first analysis method is "d1-1".

乖離度ｄｋ-ｊを、出現確率「ｐ１-ｊ」～「ｐ４-ｊ」を用いて、下記（１）～（４）式のように定義する。The deviation dk-j is defined as follows using the occurrence probabilities "p1-j" to "p4-j" as shown in equations (1) to (4).

ｄ１-ｊ＝｛（ｐ２-ｊ）＋（ｐ３-ｊ）＋（ｐ４-ｊ）｝／３－（ｐ１-ｊ）…（１）
ｄ２-ｊ＝｛（ｐ１-ｊ）＋（ｐ３-ｊ）＋（ｐ４-ｊ）｝／３－（ｐ２-ｊ）…（２）
ｄ３-ｊ＝｛（ｐ１-ｊ）＋（ｐ２-ｊ）＋（ｐ４-ｊ）｝／３－（ｐ３-ｊ）…（３）
ｄ４-ｊ＝｛（ｐ１-ｊ）＋（ｐ２-ｊ）＋（ｐ３-ｊ）｝／３－（ｐ４-ｊ）…（４）
上記（１）～（４）式から理解されるように、ｋ番目の分析方法による出現確率ｐｋ-ｊの乖離度とは、ｋ番目の分析方法による出現確率と、ｋ番目の分析方法以外の３つの分析方法による出現確率の平均との差分を示す数値である。従って、一の分析方法により算出した出現確率と、他の３つの分析方法により算出した出現確率の平均との差分が大きいほど、一の分析方法により算出した出現確率の乖離度は大きい数値となる。 d1-j={(p2-j)+(p3-j)+(p4-j)}/3-(p1-j)...(1)
d2-j={(p1-j)+(p3-j)+(p4-j)}/3-(p2-j)...(2)
d3-j={(p1-j)+(p2-j)+(p4-j)}/3-(p3-j)...(3)
d4-j={(p1-j)+(p2-j)+(p3-j)}/3-(p4-j)...(4)
As can be understood from the above formulas (1) to (4), the deviation of the occurrence probability pk-j by the kth analysis method is a numerical value indicating the difference between the occurrence probability by the kth analysis method and the average of the occurrence probabilities by the three analysis methods other than the kth analysis method. Therefore, the greater the difference between the occurrence probability calculated by one analysis method and the average of the occurrence probabilities calculated by the other three analysis methods, the greater the numerical value of the deviation of the occurrence probability calculated by the one analysis method.

乖離度算出部１４は、分析方法ごとに、上記（１）～（４）式で算出した乖離度ｄｋ-ｊの絶対値が大きい順にデータ対ａｊを並べ替える処理を実行する。図６は、第１の分析方法を用いて算出した出現確率ｐ１-ｊの乖離度ｄ１-ｊを、大きい順に並べ替えたデータを示す説明図である。例えば、「野菜・海藻」及び「すし（外食）」の時系列データのデータ対、「カップ麺」及び「生鮮食品」の時系列データのデータ対、・・の順にデータ対（品目１と品目２の組み合わせ）が並べ替えられている。The deviation calculation unit 14 executes a process of sorting the data pairs aj in descending order of the absolute value of the deviation dk-j calculated by the above formulas (1) to (4) for each analysis method. FIG. 6 is an explanatory diagram showing data in which the deviation d1-j of the occurrence probability p1-j calculated using the first analysis method is sorted in descending order. For example, the data pairs (combinations of item 1 and item 2) are sorted in the following order: a data pair of time series data for "vegetables/seaweed" and "sushi (eating out)", a data pair of time series data for "instant noodles" and "fresh foods", and so on.

可視化部１５は、データ対ａｊに含まれる各時系列データ、及び乖離度を可視化してユーザに提示する。具体的には、可視化部１５は、ディスプレイなどの表示部（図示省略）を有しており、乖離度算出部１４で算出された乖離度ｄｋ-ｊが一定値よりも大きいもの（例えば、０．６を超えるもの）のデータ対のグラフを、表示部に画面表示する。例えば、図７Ａ、図７Ｂに示すグラフ及び出現確率のデータを表示部に画面表示する。即ち、可視化部１５は、乖離度算出部１４で算出される乖離度が大きい所定数の分析結果のみを可視化する。The visualization unit 15 visualizes each time series data and deviation included in the data pair aj and presents it to the user. Specifically, the visualization unit 15 has a display unit (not shown) such as a display, and displays on the display unit a graph of the data pair whose deviation dk-j calculated by the deviation calculation unit 14 is greater than a certain value (e.g., greater than 0.6). For example, the graphs and occurrence probability data shown in Figures 7A and 7B are displayed on the display unit. That is, the visualization unit 15 visualizes only a predetermined number of analysis results whose deviations calculated by the deviation calculation unit 14 are large.

図７Ａ（ａ）は、時系列データｑ１１（例えば、家庭用耐久財）、及び時系列データｑ１２（例えば、家具・家事用品）のデータ対ａ１１のグラフを示し、図７Ａ（ｂ）は第１～第４の分析方法を用いて算出したデータ対ａ１１の出現確率を示している。図７Ｂ（ａ）は、時系列データｑ１３（例えば、野菜・海藻）、及び時系列データｑ１４（例えば、すし（外食））のデータ対ａ１２のグラフを示し、図７Ｂ（ｂ）は第１～第４の分析方法を用いて算出したデータ対ａ１２の出現確率を示している。 Figure 7A(a) shows a graph of data pair a11 of time series data q11 (e.g., household durable goods) and time series data q12 (e.g., furniture and household goods), and Figure 7A(b) shows the occurrence probability of data pair a11 calculated using the first to fourth analysis methods. Figure 7B(a) shows a graph of data pair a12 of time series data q13 (e.g., vegetables and seaweed) and time series data q14 (e.g., sushi (eating out)), and Figure 7B(b) shows the occurrence probability of data pair a12 calculated using the first to fourth analysis methods.

可視化部１５は、図７Ａ、図７Ｂに示すデータを表示部に表示する。ユーザは、表示部を見ることにより、表示された情報を認識することができる。The visualization unit 15 displays the data shown in Figures 7A and 7B on the display unit. The user can recognize the displayed information by looking at the display unit.

入力部１６は、ユーザによる類似または非類似の判定入力を受け付ける。具体的には、入力部１６は、キーボード等の操作機器を備えており、可視化部１５に表示されている情報に対する類似または非類似の判定入力を受け付ける。例えば、図７Ａに示したように時系列データｑ１１、ｑ１２は乖離しているので、ユーザにより非類似の判定結果が入力される。他方、図７Ｂに示したように時系列データｑ１３、ｑ１４は接近しているので、ユーザにより類似の判定結果が入力される。The input unit 16 accepts a judgment input of similarity or dissimilarity from the user. Specifically, the input unit 16 is equipped with an operating device such as a keyboard, and accepts an input of a judgment input of similarity or dissimilarity for the information displayed on the visualization unit 15. For example, as shown in FIG. 7A, the time series data q11 and q12 are separated, so the user inputs a judgment result of dissimilarity. On the other hand, as shown in FIG. 7B, the time series data q13 and q14 are close to each other, so the user inputs a judgment result of similarity.

特徴抽出部１７は、判定入力及び乖離度に基づいて分析方法の特徴を抽出する。具体的には、特徴抽出部１７は、入力部１６で入力された判定入力に基づいて、各分析方法の特徴を抽出する。例えば、図７Ａ（ａ）に示したデータ対ａ１１の時系列データｑ１１、ｑ１２は、グラフが乖離しており類似度が低い。従って、データ対ａ１１の出現確率は大きい数値になるはずである。図７Ａ（ｂ）に示すように、第３の分析方法により算出された出現確率は、小さい数値となっている。特徴抽出部１７は、時系列データｑ１１、ｑ１２の分析に対して、第３の分析方法は適さないという特徴を抽出する。即ち、特徴抽出部１７は、分析方法の特徴として、当該分析方法による分析に適さない時系列データを抽出する。The feature extraction unit 17 extracts the features of the analysis method based on the judgment input and the deviation. Specifically, the feature extraction unit 17 extracts the features of each analysis method based on the judgment input input by the input unit 16. For example, the time series data q11 and q12 of the data pair a11 shown in FIG. 7A(a) are graph-dissociated and have low similarity. Therefore, the occurrence probability of the data pair a11 should be a large value. As shown in FIG. 7A(b), the occurrence probability calculated by the third analysis method is a small value. The feature extraction unit 17 extracts the feature that the third analysis method is not suitable for the analysis of the time series data q11 and q12. That is, the feature extraction unit 17 extracts time series data that is not suitable for analysis by the analysis method as a feature of the analysis method.

また、図７Ｂ（ａ）に示したデータ対ａ１２の時系列データｑ１３、ｑ１４は、グラフが接近しており類似度が高い。従って、データ対ａ１２の出現確率は小さい数値になるはずである。図７Ｂ（ｂ）に示すように、第２、第３、第４の分析方法により算出された出現確率は、大きい数値となっている。特徴抽出部１７は、時系列データｑ１１、ｑ１２の分析に対して、第２、第３、第４の分析方法は適さないという特徴を抽出する。特徴抽出部１７は、記憶装置（図示省略）を備えており、抽出した特徴を記憶装置に記憶する。 In addition, the time series data q13 and q14 of the data pair a12 shown in FIG. 7B(a) are close to each other in the graph and have a high similarity. Therefore, the occurrence probability of the data pair a12 should be a small value. As shown in FIG. 7B(b), the occurrence probabilities calculated by the second, third, and fourth analysis methods are large values. The feature extraction unit 17 extracts the feature that the second, third, and fourth analysis methods are not suitable for analyzing the time series data q11 and q12. The feature extraction unit 17 is equipped with a storage device (not shown) and stores the extracted features in the storage device.

記録部１８は、時系列データの特性データを記録する。例えば、「野菜・海藻」については、季節の移り変わりにより影響されるという特性が予め認識されているので、この特性データを記録する。また、「自動車免許手数料」については、階段状に金額が変化するという特性が予め認識されているので、この特性データを記録する。また、上述した可視化部１５は、データ対を構成する各時系列データ及び乖離度に加えて、時系列データの特性を可視化してもよい。The recording unit 18 records characteristic data of the time series data. For example, it is recognized in advance that "vegetables/seaweed" has the characteristic of being affected by seasonal changes, and so this characteristic data is recorded. In addition, it is recognized in advance that "automobile license fees" have the characteristic of changing in a stepped manner, and so this characteristic data is recorded. Furthermore, the visualization unit 15 described above may visualize the characteristics of the time series data in addition to the time series data and deviation degree that make up the data pair.

次に、図８に示すフローチャートを参照して第１実施形態に係る特徴抽出装置１の動作について説明する。初めに、図８のステップＳ１１において、組み合わせ部１１は、データベース２に記憶されている複数の時系列データｑｉ（ｉ＝１～ｍ）を組み合わせることにより、データ対ａｊを生成する。時系列データｑｉがｍ個の場合には、「ｍ＊（ｍ－１）／２」個のデータ対が生成される。Next, the operation of the feature extraction device 1 according to the first embodiment will be described with reference to the flowchart shown in Fig. 8. First, in step S11 of Fig. 8, the combination unit 11 generates data pairs aj by combining multiple time series data qi (i = 1 to m) stored in the database 2. When the number of time series data qi is m, "m * (m - 1) / 2" data pairs are generated.

ステップＳ１２において、データ分析部１２は、複数の分析方法により各データ対ａｊを分析して分析値を算出する。具体的には、第１分析部２１は第１の分析方法を用いて各データ対ａｊの分析値を算出する。第２分析部２２は第２の分析方法を用いて各データ対ａｊの分析値を算出する。第３分析部２３は第３の分析方法を用いて各データ対ａｊの分析値を算出する。第４分析部２４は第４の分析方法を用いて各データ対ａｊの分析値を算出する。In step S12, the data analysis unit 12 analyzes each data pair aj using multiple analysis methods to calculate an analysis value. Specifically, the first analysis unit 21 calculates an analysis value for each data pair aj using the first analysis method. The second analysis unit 22 calculates an analysis value for each data pair aj using the second analysis method. The third analysis unit 23 calculates an analysis value for each data pair aj using the third analysis method. The fourth analysis unit 24 calculates an analysis value for each data pair aj using the fourth analysis method.

更に、データ分析部１２は、各分析方法で算出した分析値の分布曲線を生成する。具体的には、図４Ｂ（ａ）～（ｄ）に示したように、第１の分析方法で算出した分析値の分布曲線ｓ１、第２の分析方法で算出した分析値の分布曲線ｓ２、第３の分析方法で算出した分析値の分布曲線ｓ３、第４の分析方法で算出した分析値の分布曲線ｓ４を生成する。 Furthermore, the data analysis unit 12 generates a distribution curve of the analytical values calculated by each analysis method. Specifically, as shown in Figures 4B (a) to (d), a distribution curve s1 of the analytical values calculated by the first analysis method, a distribution curve s2 of the analytical values calculated by the second analysis method, a distribution curve s3 of the analytical values calculated by the third analysis method, and a distribution curve s4 of the analytical values calculated by the fourth analysis method are generated.

ステップＳ１３において、出現確率算出部１３は、各分布曲線ｓ１～ｓ４を正規化した正規化曲線を生成する。例えば、図５に示した正規化曲線ｓ１１を生成する。In step S13, the occurrence probability calculation unit 13 generates a normalized curve by normalizing each distribution curve s1 to s4. For example, the normalized curve s11 shown in FIG. 5 is generated.

ステップＳ１４において乖離度算出部１４は、正規化曲線ｓ１１に基づいて、各データ対ａｊの出現確率を算出する。例えば、図５に示すように、対象となるデータ対が全体の上位３０％に属している場合には、出現確率を「０．３」に設定する。また、上位７０％に属している場合には、出現確率を「０．７」に設定する。In step S14, the deviation calculation unit 14 calculates the occurrence probability of each data pair aj based on the normalization curve s11. For example, as shown in FIG. 5, if the target data pair belongs to the top 30% of the total, the occurrence probability is set to "0.3". If the target data pair belongs to the top 70%, the occurrence probability is set to "0.7".

ステップＳ１５において、乖離度算出部１４は、各出現確率の乖離度を算出する。具体的には、前述した（１）～（４）式により、各分析方法のデータ対ａｊの出現確率を算出する。更に、乖離度算出部１４は、乖離度が大きい順にデータ対ａｊを並べ替える処理を実行する。その結果、例えば図６に示したように、第１の分析方法を用いて算出した出現確率ｐ１-ｊの乖離度ｄ１-ｊを大きい順に並べたデータが得られる。In step S15, the deviation calculation unit 14 calculates the deviation of each occurrence probability. Specifically, the occurrence probability of data pair aj for each analysis method is calculated using the above-mentioned formulas (1) to (4). Furthermore, the deviation calculation unit 14 executes a process of sorting the data pair aj in descending order of deviation. As a result, data is obtained in which the deviations d1-j of the occurrence probability p1-j calculated using the first analysis method are sorted in descending order, as shown in FIG. 6, for example.

例えば、「野菜・海藻」と「すし（外食）」のデータ対においては、第１の分析方法を用いて算出した出現確率は、「０．０４７３」であり、第２～第４の分析方法を用いて算出した出現確率は、およそ「１．００００」である。このため、第１の分析方法で算出した出現確率は、他の３つの分析方法で算出した出現確率との差分が大きい数値となっており、乖離度が「０．９２６４２８」と高い数値になっている。For example, for the data pair "Vegetables/Seaweed" and "Sushi (eating out)", the occurrence probability calculated using the first analysis method is "0.0473", while the occurrence probability calculated using the second to fourth analysis methods is approximately "1.0000". For this reason, the occurrence probability calculated using the first analysis method has a large difference from the occurrence probabilities calculated using the other three analysis methods, with the deviation being a high value of "0.926428".

ステップＳ１６において、可視化部１５は、乖離度ｄ１-ｊが大きい（例えば、０．６以上のデータ対）と判定されたデータ対のグラフ、及び出現確率のデータを表示部（図示省略）に画面表示する。即ち、データ対のグラフ、及び出現確率のデータを可視化する。例えば、図７Ａ、図７Ｂに示す情報を画面表示する。In step S16, the visualization unit 15 displays on the display unit (not shown) a graph of the data pair determined to have a large deviation d1-j (e.g., data pair of 0.6 or more) and the occurrence probability data. That is, the graph of the data pair and the occurrence probability data are visualized. For example, the information shown in Figures 7A and 7B is displayed on the screen.

ユーザは、この画面を視認することにより、各分析方法による分析結果の正当性を判定する。例えば、図７Ａ（ａ）に示すグラフでは、２つの時系列データｑ１１、ｑ１２は類似していない。従って、出現確率は大きい数値（「１」に近い数値）になるものと推察される。図７Ａ（ｂ）に示すデータでは、第１、第２、第４の分析方法により算出した出現確率は「１」に近い数値を示しており、第３の分析方法により算出した出現確率は上記３つの分析方法から乖離した数値「０．１６」となっている。この場合には、第３の分析方法を採用した分析値は、不適切であり、第１、第２、第４の分析方法を採用した分析値は適切であると想定される。 By visually checking this screen, the user judges the validity of the analysis results by each analysis method. For example, in the graph shown in FIG. 7A(a), the two time series data q11 and q12 are not similar. Therefore, it is inferred that the occurrence probability will be a large value (a value close to "1"). In the data shown in FIG. 7A(b), the occurrence probability calculated by the first, second, and fourth analysis methods is close to "1", and the occurrence probability calculated by the third analysis method is a value of "0.16" that deviates from the above three analysis methods. In this case, it is assumed that the analysis value using the third analysis method is inappropriate, and the analysis value using the first, second, and fourth analysis methods is appropriate.

一方、図７Ｂ（ａ）に示すグラフでは、２つの時系列データｑ１３、ｑ１４は類似している。従って、出現確率は小さい数値（「０」に近い数値）になるものと推察される。図７Ｂ（ｂ）に示すデータでは、第２、第３、第４の分析方法により算出した出現確率は「１」に近い数値を示しており、第１の分析方法により算出した出現確率は上記３つの分析方法から乖離した数値「０．０５」となっている。この場合には、第２、第３、第４の分析方法を採用した分析値は、不適切であり、第１の分析方法を採用した分析値は適切であると想定される。On the other hand, in the graph shown in FIG. 7B(a), the two time series data q13 and q14 are similar. Therefore, it is inferred that the occurrence probability will be a small value (a value close to "0"). In the data shown in FIG. 7B(b), the occurrence probability calculated using the second, third, and fourth analysis methods is close to "1", while the occurrence probability calculated using the first analysis method is a value of "0.05", which deviates from the above three analysis methods. In this case, it is assumed that the analysis values obtained using the second, third, and fourth analysis methods are inappropriate, and the analysis value obtained using the first analysis method is appropriate.

更に、可視化部１５は、記録部１８に記録されている各時系列データの特性データを読み取り、表示部に表示する。例えば、分析対象となるデータ対に「野菜・海藻」の時系列データが含まれている場合には、「季節の移り変わりにより影響される」という特性データを表示部に表示する。また、分析対象となるデータ対に「自動車免許手数料」の時系列データが含まれている場合には、「階段状に金額が変化する」という特性データを表示部に表示する。ユーザは、この特性データを視認することにより、分析結果の判定の参考にすることができる。 Furthermore, the visualization unit 15 reads the characteristic data of each time series data recorded in the recording unit 18 and displays it on the display unit. For example, if the data pair to be analyzed contains time series data of "vegetables/seaweed", the characteristic data "affected by seasonal changes" is displayed on the display unit. Also, if the data pair to be analyzed contains time series data of "automobile license fees", the characteristic data "amount changes in a stepped manner" is displayed on the display unit. By visually checking this characteristic data, the user can use it as a reference for judging the analysis results.

ステップＳ１７において、入力部１６は、ユーザによる類似、非類似の判定入力を受け付ける。ユーザは、可視化された情報を参照して各分析方法による分析値が適切であるか否かの判定結果を入力する。例えば、前述した図７Ａに示した例では、第３の分析方法による分析値は不適切であり、第１、第２、第４の分析方法による分析値は適切である旨の判定結果を入力部１６にて入力する。前述した図７Ｂに示した例では、第２、第３、第４の分析方法による分析値は不適切であり、第１の分析方法による分析値は適切である旨の判定結果を入力部１６にて入力する。In step S17, the input unit 16 accepts the user's input of a judgment of similarity or dissimilarity. The user refers to the visualized information and inputs the judgment result of whether the analysis value obtained by each analysis method is appropriate or not. For example, in the example shown in FIG. 7A described above, the input unit 16 inputs the judgment result that the analysis value obtained by the third analysis method is inappropriate and that the analysis values obtained by the first, second, and fourth analysis methods are appropriate. In the example shown in FIG. 7B described above, the input unit 16 inputs the judgment result that the analysis values obtained by the second, third, and fourth analysis methods are inappropriate and that the analysis value obtained by the first analysis method is appropriate.

即ち、一の分析方法を用いて時系列データを分析して算出される出現確率と、他の分析方法を用いて時系列データを分析して算出される出現確率との間の乖離度が高いということは、この時系列データの分析に用いる分析方法として、一の分析方法または他の分析方法が不適切である可能性が高い。ユーザによる判定入力を取得することにより、各分析方法の特徴（例えば、時系列データａ１の分析には、第１の分析方法は適していないなど）を高精度に認識することが可能になる。In other words, if there is a high discrepancy between the occurrence probability calculated by analyzing time series data using one analysis method and the occurrence probability calculated by analyzing time series data using another analysis method, there is a high possibility that the one analysis method or the other analysis method is inappropriate as an analysis method to be used for analyzing the time series data. By acquiring a judgment input by the user, it becomes possible to recognize with high accuracy the characteristics of each analysis method (for example, the first analysis method is not suitable for analyzing time series data a1).

ステップＳ１８において、特徴抽出部１７は、入力部１６にて入力された判定入力に基づいて、適切、不適切の判定結果に応じたスコアを計算する。具体的には、適切であると判定した分析方法に対してスコアを「＋１」とし、適切でないと判定した分析方法に対してスコアを「－１」とする。図７Ａに示した例では、第３の分析方法のスコアを「－１」とし、第１、第２、第４の分析方法のスコアを「＋１」とする。図７Ｂに示した例では、第２、第３、第４の分析方法のスコアを「－１」とし、第１の分析方法のスコアを「＋１」とする。特徴抽出部１７は、第１～第４の分析方法ごとにスコアを積算する。なお、スコアの数値は「＋１」、「－１」に限定されるものではなく、「適切」、「不適切」の度合いに応じて「＋２」、「＋１」、「－１」、「－２」などの数値としてもよい。In step S18, the feature extraction unit 17 calculates a score according to the judgment result of appropriateness or inappropriateness based on the judgment input inputted by the input unit 16. Specifically, the score of the analysis method judged to be appropriate is set to "+1", and the score of the analysis method judged to be inappropriate is set to "-1". In the example shown in FIG. 7A, the score of the third analysis method is set to "-1", and the scores of the first, second and fourth analysis methods are set to "+1". In the example shown in FIG. 7B, the scores of the second, third and fourth analysis methods are set to "-1", and the score of the first analysis method is set to "+1". The feature extraction unit 17 accumulates the scores for each of the first to fourth analysis methods. Note that the numerical values of the scores are not limited to "+1" and "-1", and may be numerical values such as "+2", "+1", "-1", and "-2" according to the degree of "appropriateness" and "inappropriateness".

特徴抽出部１７は、上記したスコアの積算値に基づいて、各分析方法の特徴を抽出する。例えば、４つの分析方法のうち上述したスコアが最も高い分析方法が、対象となる時系列データの分析に適している、などの特徴を抽出する。特徴抽出部１７は、抽出した特徴を記憶装置（図示省略）に記録する。或いは、抽出した特徴に基づいて、既に記憶装置に記録されている特徴を修正する。The feature extraction unit 17 extracts features of each analysis method based on the accumulated score values described above. For example, it extracts a feature that the analysis method with the highest score among the four analysis methods is suitable for analyzing the target time series data. The feature extraction unit 17 records the extracted features in a storage device (not shown). Alternatively, it modifies features already recorded in the storage device based on the extracted features.

ステップＳ１９において、データ分析部１２は、第１～第４の分析方法に修正が必要であるか否かを判定する。例えば、図７Ａに示したように、データ対ａ１１の分析には第３の分析方法は適していないと判定されており、この場合に第３の分析方法に修正が必要であるか否かを判定する。修正が必要であると判定された場合には（Ｓ１９；ＹＥＳ）、ステップＳ２０に処理を進め、そうでなければ（Ｓ１９；ＮＯ）、本処理を終了する。In step S19, the data analysis unit 12 determines whether the first to fourth analysis methods require modification. For example, as shown in FIG. 7A, it is determined that the third analysis method is not suitable for analyzing the data pair a11, and in this case, it determines whether the third analysis method requires modification. If it is determined that modification is required (S19; YES), the process proceeds to step S20; otherwise (S19; NO), the process ends.

ステップＳ２０において、データ分析部１２は、対象となる分析方法を修正、或いは不適とする。その後、本処理を終了する。こうして、時系列データの類似度を分析する分析方法の特徴を抽出することができるのである。In step S20, the data analysis unit 12 modifies the target analysis method or determines it to be inappropriate. Then, the process ends. In this way, it is possible to extract the characteristics of the analysis method that analyzes the similarity of time series data.

このように、第１実施形態に係る特徴抽出装置１は、２つの時系列データを組み合わせたデータ対を複数生成する組み合わせ部１１と、複数の分析方法を用いて、各データ対に含まれる２つの時系列データの類似度を分析する分析部（データ分析部１２）と、分析部による分析結果に基づいて、各データ対の類似度の出現確率を分析方法ごとに算出する出現確率算出部１３と、各データ対について、分析方法ごとに算出される出現確率の乖離度を算出する乖離度算出部１４と、データ対に含まれる各時系列データ、及び乖離度を可視化してユーザに提示する可視化部１５と、ユーザによる、類似、非類似の判定入力を受け付ける入力部１６と、判定入力及び乖離度に基づいて、分析方法の特徴を抽出する特徴抽出部１７と、を有して構成されている。Thus, the feature extraction device 1 according to the first embodiment includes a combination unit 11 that generates multiple data pairs by combining two time series data, an analysis unit (data analysis unit 12) that analyzes the similarity between the two time series data included in each data pair using multiple analysis methods, an occurrence probability calculation unit 13 that calculates the occurrence probability of the similarity of each data pair for each analysis method based on the analysis results by the analysis unit, a deviation calculation unit 14 that calculates the deviation of the occurrence probability calculated for each analysis method for each data pair, a visualization unit 15 that visualizes each time series data included in the data pair and the deviation and presents it to the user, an input unit 16 that accepts a user's input of a judgment of similarity or dissimilarity, and a feature extraction unit 17 that extracts features of the analysis method based on the judgment input and the deviation.

上記のように構成された特徴抽出装置１では、時系列データを分析する分析方法がどのタイプの時系列データに適しているか、或いは適していないかを示す特徴を抽出することが可能となる。従って、データサイエンティストなどのユーザが、データ分析装置を用いて時系列データを分析する際に、ユーザがストックしている複数の分析方法から、適切な分析方法を選択できるように支援することが可能となる。 The feature extraction device 1 configured as described above can extract features that indicate which type of time series data an analysis method for analyzing time series data is suitable for, or is not suitable for. Therefore, when a user such as a data scientist analyzes time series data using a data analysis device, it is possible to support the user in selecting an appropriate analysis method from multiple analysis methods that the user has in stock.

また、可視化部１５は、乖離度算出部１４で算出される乖離度が大きい所定数の分析結果のみを可視化する。例えば、乖離度が０．６以上の分析結果のみを可視化する。このため、乖離度が小さい分析結果についての可視化を省略することができる。即ち、４つの分析方法の全ての乖離度が小さいということは、４つの分析方法による分析値がほぼ同一の数値になっているということであり、ユーザが介入する必要性は低いものと考えられる。乖離度が大きい所定数の分析結果のみを可視化の対象とすることにより、ユーザによる労力を低減することができる。 Furthermore, the visualization unit 15 visualizes only a predetermined number of analysis results with a large deviation calculated by the deviation calculation unit 14. For example, only analysis results with a deviation of 0.6 or more are visualized. This makes it possible to omit visualization of analysis results with small deviations. In other words, the fact that all four analysis methods have small deviations means that the analysis values obtained by the four analysis methods are almost identical, and it is considered that there is little need for user intervention. By visualizing only a predetermined number of analysis results with large deviations, the effort required by the user can be reduced.

また、予め認識されている各時系列データの特徴テータが記録部１８に記録されており、この特徴データを可視化部１５の表示部に表示することにより、ユーザは各分析方法の適正を判断するときの参考とすることができる。 In addition, characteristic data of each time series data that has been recognized in advance is recorded in the recording unit 18, and by displaying this characteristic data on the display unit of the visualization unit 15, the user can use it as a reference when judging the suitability of each analysis method.

即ち、図９に示すように、「野菜・海藻」の時系列データを含むデータ対ａ１について、記録部１８に「季節変動に影響される」という特徴データが記録されている。また、「自動車免許手数料」の時系列データを含むデータ対ａ１０について、記録部１８に「階段状に物価が変化する」という特徴データが記録されている。ユーザはデータ対ａ１、ａ１０の分析を行うときに、これらの特徴データを参照して各分析方法の特徴を判定することが可能となる。 That is, as shown in Figure 9, for data pair a1, which includes time series data for "vegetables and seaweed," feature data that "is influenced by seasonal fluctuations" is recorded in recording unit 18. Also, for data pair a10, which includes time series data for "automobile license fees," feature data that "prices change in a step-like manner" is recorded in recording unit 18. When analyzing data pairs a1 and a10, the user can refer to these feature data to determine the characteristics of each analysis method.

［第２実施形態］
次に、第２実施形態について説明する。図１０は、第２実施形態に係る特徴抽出装置１ａ、及びその周辺機器の構成を示すブロック図である。第２実施形態は、前述した第１実施形態と対比して、選択部１９が設けられている点で相違する。従って、選択部１９以外の構成要素については、同一符号を付して構成説明を省略する。 [Second embodiment]
Next, a second embodiment will be described. Fig. 10 is a block diagram showing the configuration of a feature extraction device 1a according to the second embodiment and its peripheral devices. The second embodiment differs from the first embodiment in that a selection unit 19 is provided. Therefore, the components other than the selection unit 19 are given the same reference numerals and the description of the configuration will be omitted.

選択部１９は、複数のデータ対のうち、一のデータ対の類似度の出現確率と、他のデータ対の類似度の出現確率が近似しており、一のデータ対に含まれる時系列データと、他のデータ対に含まれる時系列データが同一、または類似している場合に、他のデータ対を選択する。The selection unit 19 selects one data pair from among a plurality of data pairs when the occurrence probability of the similarity of the one data pair is close to the occurrence probability of the similarity of the other data pair, and the time series data included in the one data pair is identical or similar to the time series data included in the other data pair.

即ち、選択部１９は、組み合わせ部１１で生成されるデータ対のうち、時系列データが類似しているデータ対を選択する。可視化部１５は、選択部１９にて選択されたデータ対の出現確率を除外して可視化する。That is, the selection unit 19 selects data pairs whose time series data are similar from among the data pairs generated by the combination unit 11. The visualization unit 15 visualizes the occurrence probability of the data pairs selected by the selection unit 19 by excluding the occurrence probability.

図１１Ａは、第１の分析方法で複数のデータ対を分析して得られた分析結果の正規化分布曲線を示す図である。図１１Ｂは、データ対を構成する２つの時系列データ、及び乖離度ｄ１-ｊを示す図である。 Figure 11A shows a normalized distribution curve of the analysis results obtained by analyzing multiple data pairs using the first analysis method. Figure 11B shows two time series data constituting a data pair and the degree of discrepancy d1-j.

図１１Ｂに示されているデータ対ｘ１、ｘ２、ｘ３には、全て「大学授業料」の時系列データが含まれている。また、図１１Ａにおいて、データ対ｘ１、ｘ２、ｘ３がプロットされている位置は近似している。従って、これら３つのデータ対ｘ１、ｘ２、ｘ３のうちの２つのデータ対は冗長であり、不要であると考えられる。選択部１９は、データ対ｘ２、ｘ３を分析対象から除外する。 The data pairs x1, x2, and x3 shown in FIG. 11B all contain time series data for "university tuition fees." Furthermore, in FIG. 11A, the positions at which the data pairs x1, x2, and x3 are plotted are close to each other. Therefore, two of these three data pairs x1, x2, and x3 are considered to be redundant and unnecessary. The selection unit 19 excludes the data pairs x2 and x3 from the analysis target.

図１１Ｂに示されているデータ対ｘ４には「中華そば」の時系列データが含まれ、データ対ｘ５には「そば」の時系列データが含まれている。また、図１１Ａにおいて、データ対ｘ４、ｘ５プロットされている位置は近似している。従って、これら２つのデータ対ｘ４、ｘ５のうちの一方のデータ対は冗長であり、不要であると考えられる。選択部１９は、データ対ｘ５を分析対象から除外する。 Data pair x4 shown in Figure 11B contains time series data for "Chinese noodles," and data pair x5 contains time series data for "soba." Furthermore, in Figure 11A, the positions at which data pairs x4 and x5 are plotted are close to each other. Therefore, one of these two data pairs x4 and x5 is considered to be redundant and unnecessary. The selection unit 19 excludes data pair x5 from the analysis target.

このように、第２実施形態に係る特徴抽出装置１ａでは、複数のデータ対から、一のデータ対に対して類似する他のデータ対を除外してデータ分析を行うので、データ対の分析処理に要する負荷を軽減することができる。In this way, in the feature extraction device 1a of the second embodiment, data analysis is performed by excluding other data pairs that are similar to one data pair from multiple data pairs, thereby reducing the load required for the analysis processing of the data pairs.

即ち、選択部１９は、複数のデータ対のうち、一のデータ対の類似度の出現確率と、他のデータ対の類似度の出現確率が近似しており、且つ、一のデータ対を構成する時系列データと、他のデータ対を構成する時系列データが同一、または類似している場合に、他のデータ対を選択する。そして、可視化部１５は、選択部１９にて選択されたデータ対の出現確率を除外して表示部に表示する。このため、不要なデータの表示を回避することができ、演算負荷を軽減することができる。That is, the selection unit 19 selects one data pair from among multiple data pairs when the occurrence probability of the similarity of the one data pair is close to the occurrence probability of the other data pair, and the time series data constituting the one data pair is identical or similar to the time series data constituting the other data pair. The visualization unit 15 then displays the occurrence probability of the data pair selected by the selection unit 19 on the display unit, excluding the occurrence probability. This makes it possible to avoid displaying unnecessary data and reduce the computational load.

上記説明した本実施形態の特徴抽出装置１には、図１２に示すように例えば、ＣＰＵ（Central Processing Unit、プロセッサ）９０１と、メモリ９０２と、ストレージ９０３（HDD：HardDisk Drive、SSD：SolidState Drive）と、通信装置９０４と、入力装置９０５と、出力装置９０６とを備える汎用的なコンピュータシステムを用いることができる。メモリ９０２およびストレージ９０３は、記憶装置である。このコンピュータシステムにおいて、ＣＰＵ９０１がメモリ９０２上にロードされた所定のプログラムを実行することにより、特徴抽出装置１の各機能が実現される。 As shown in FIG. 12, the feature extraction device 1 of the present embodiment described above can be, for example, a general-purpose computer system including a CPU (Central Processing Unit, processor) 901, a memory 902, a storage 903 (HDD: Hard Disk Drive, SSD: Solid State Drive), a communication device 904, an input device 905, and an output device 906. The memory 902 and the storage 903 are storage devices. In this computer system, the CPU 901 executes a predetermined program loaded onto the memory 902, thereby realizing each function of the feature extraction device 1.

なお、特徴抽出装置１は、１つのコンピュータで実装されてもよく、あるいは複数のコンピュータで実装されても良い。また、特徴抽出装置１は、コンピュータに実装される仮想マシンであっても良い。The feature extraction device 1 may be implemented in one computer or in multiple computers. The feature extraction device 1 may also be a virtual machine implemented in a computer.

なお、特徴抽出装置１用のプログラムは、ＨＤＤ、ＳＳＤ、ＵＳＢ（Universal Serial Bus）メモリ、ＣＤ (Compact Disc)、ＤＶＤ (Digital Versatile Disc)などのコンピュータ読取り可能な記録媒体に記憶することも、ネットワークを介して配信することもできる。 The program for the feature extraction device 1 can be stored on a computer-readable recording medium such as a HDD, SSD, USB (Universal Serial Bus) memory, CD (Compact Disc), or DVD (Digital Versatile Disc), or can be distributed via a network.

なお、本発明は上記実施形態に限定されるものではなく、その要旨の範囲内で数々の変形が可能である。 The present invention is not limited to the above-described embodiments, and various modifications are possible within the scope of the invention.

１、１ａ特徴抽出装置
２データベース
１１組み合わせ部
１２データ分析部（分析部）
１３出現確率算出部
１４乖離度算出部
１５可視化部
１６入力部
１７特徴抽出部
１８記録部
１９選択部
２１第１分析部
２２第２分析部
２３第３分析部
２４第４分析部 1, 1a Feature extraction device 2 Database 11 Combination unit 12 Data analysis unit (analysis unit)
13 occurrence probability calculation unit 14 deviation calculation unit 15 visualization unit 16 input unit 17 feature extraction unit 18 recording unit 19 selection unit 21 first analysis unit 22 second analysis unit 23 third analysis unit 24 fourth analysis unit

Claims

A combination unit that generates a plurality of data pairs by combining two pieces of time series data;
an analysis unit that analyzes the similarity of two pieces of time series data included in each data pair by using a plurality of analysis methods;
an occurrence probability calculation unit that calculates the occurrence probability of the similarity of each data pair for each analysis method based on the analysis result by the analysis unit;
a deviation calculation unit that calculates a deviation of the occurrence probability calculated for each data pair for each analysis method;
a visualization unit that visualizes each of the time series data included in the data pair and the deviation degree and presents the visualized data to a user;
an input unit that accepts an input of a similarity/dissimilarity judgment by the user;
a feature extraction unit that extracts features of the analysis method based on the judgment input and the deviation degree;
A feature extraction device comprising:

The feature extraction device described in claim 1, wherein the occurrence probability calculation unit calculates the occurrence probability by normalizing the analysis result of the analysis unit.

The feature extraction device according to claim 1 , wherein the visualization unit visualizes only a predetermined number of analysis results having large degrees of deviation calculated by the deviation calculation unit.

a selection unit that selects one of the plurality of data pairs when an occurrence probability of a similarity between the one data pair and an occurrence probability of a similarity between the other data pair are similar and the time series data included in the one data pair and the time series data included in the other data pair are identical or similar to each other,
4. The feature extraction device according to claim 1, wherein the visualization unit performs visualization excluding the occurrence probabilities of other data pairs selected by the selection unit.

A recording unit that records characteristic data of the time series data,
The feature extraction device according to claim 1 , wherein the visualization unit visualizes characteristic data of the time series data in addition to each of the time series data included in the data pair and the degree of discrepancy.

A feature extraction device as described in any one of claims 1 to 5, wherein the feature extraction unit extracts time series data that is not suitable for analysis by the analysis method as a feature of the analysis method.

A feature extraction method executed by a feature extraction device including a combination unit, an analysis unit, an occurrence probability calculation unit, a deviation calculation unit, a visualization unit, an input unit, and a feature extraction unit, comprising:
The combining unit generates a plurality of data pairs by combining two pieces of time series data;
The analysis unit analyzes the similarity between two pieces of time series data included in each data pair by using a plurality of analysis methods;
the occurrence probability calculation unit calculating an occurrence probability of the similarity of each data pair for each analysis method based on the analyzed similarity;
a step of the deviation calculation unit calculating a deviation of the occurrence probability calculated for each data pair for each analysis method;
the visualization unit visualizing each time-series data included in the data pair and the deviation degree and presenting the visualized data to a user;
a step of receiving an input of a judgment of similarity or dissimilarity from the user by the input unit ;
the feature extraction unit extracting features of the analysis method based on the judgment input and the deviation degree;
The feature extraction method includes:

A feature extraction program that causes a computer to function as a feature extraction device according to any one of claims 1 to 6.