JP5652332B2

JP5652332B2 - Data structure comparison program and data structure comparison device

Info

Publication number: JP5652332B2
Application number: JP2011121565A
Authority: JP
Inventors: シャオジュン馬
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2011-05-31
Filing date: 2011-05-31
Publication date: 2015-01-14
Anticipated expiration: 2031-05-31
Also published as: JP2012248144A

Description

本発明は、データ構造比較プログラム及びデータ構造比較装置に関する。 The present invention relates to a data structure comparison program and a data structure comparison apparatus.

従来のデータ構造を比較する方法として、例えば、ツリー構造を有する複数のデータからパターン抽出技法を用いて頻出するツリー構造を見本のデータ構造として抽出し、抽出された見本のデータ構造の類似度を算出するものがある（例えば、非特許文献１、２参照）。 As a conventional method of comparing data structures, for example, a tree structure that frequently appears using a pattern extraction technique is extracted from a plurality of data having a tree structure as a sample data structure, and the similarity of the extracted sample data structures is calculated. Some are calculated (for example, see Non-Patent Documents 1 and 2).

非特許文献１には、根ノードを頂点として分岐する内部ノード及び最底辺の葉ノードから構成されるツリー構造間の比較をする際に、それぞれのツリー構造の葉ノードの並びを比較し、一方の葉ノードの並びから他方の葉ノードの並びへ変形するのに要するデータの挿入、削除、置換等の手順の最小回数（以下、「編集距離」という。）を算出する方法が開示されている。この編集距離を用いて、編集距離が予め定められた値より小さいツリー構造同士を類似性があるツリー構造とする。 Non-Patent Document 1 compares the arrangement of leaf nodes of each tree structure when comparing between tree structures composed of an internal node that branches from a root node as a vertex and a leaf node at the bottom. Disclosed is a method for calculating a minimum number of procedures (hereinafter referred to as “edit distance”) such as insertion, deletion, and replacement of data required for transforming from one leaf node sequence to another leaf node sequence. . Using this edit distance, tree structures having an edit distance smaller than a predetermined value are set as similar tree structures.

また、非特許文献２には、ツリー構造の葉ノードだけでなく、ツリー構造間の根ノード及び内部ノードを含めて一方のツリー構造から他方のツリー構造へ変形するのに要するデータの挿入、削除、置換等の手順の最小回数（以下、「ＴｒｅｅＥｄｉｔ距離」という。）を算出する方法が開示されている。このＴｒｅｅＥｄｉｔ距離の算出方法は、上記した非特許文献１の方法に比べて計算量が増加するものの、葉ノード以外も考慮するため、ツリー構造間の類似度としてより確かな値を算出する。 Non-Patent Document 2 includes not only the leaf nodes of a tree structure but also the insertion and deletion of data required for transformation from one tree structure to the other, including root nodes and internal nodes between tree structures. , A method of calculating the minimum number of procedures such as replacement (hereinafter referred to as “Tree Edit distance”) is disclosed. This tree edit distance calculation method increases the amount of calculation compared to the method of Non-Patent Document 1 described above, but calculates a more reliable value as the similarity between tree structures in order to consider other than leaf nodes.

Gusfield, Dan （1997). Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge, UK: Cambridge University Press. ISBN0-521-58519-8.Gusfield, Dan (1997). Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge, UK: Cambridge University Press. ISBN0-521-58519-8. Philip, Bille. A survey on tree edit distance and related problems. Journal Theoretical Computer Science, Volume 337 Issue 1-3, 9 June 2005. Elsevier Science Publishers Ltd. Essex, UKPhilip, Bille.A survey on tree edit distance and related problems.Journal Theoretical Computer Science, Volume 337 Issue 1-3, 9 June 2005.Elsevier Science Publishers Ltd. Essex, UK

本発明の目的は、本構成を採用しない場合と比べて計算量を減少するデータ構造比較プログラム及びデータ構造比較装置を提供することにある。 An object of the present invention is to provide a data structure comparison program and a data structure comparison apparatus that reduce the amount of calculation compared to a case where this configuration is not adopted.

本発明の一態様は、上記目的を達成するため、以下のデータ構造比較プログラム及びデータ構造比較装置を提供する。 In order to achieve the above object, one aspect of the present invention provides the following data structure comparison program and data structure comparison apparatus.

［１］コンピュータを、
対象と、対象に属するものとして関連付けられた第１の種類のデータと、前記第１の種類のデータに属するものとして関連付けられた第２の種類のデータとを構造の要素として含む複数の比較対象のデータから、予め定めた頻度で出現するデータ構造を見本のデータ構造として取得する取得手段と、
前記取得手段が取得した複数の前記見本のデータ構造の前記第２の種類のデータを比較する際に、前記見本のデータ構造間で前記第１の種類のデータが共通する第２の種類のデータ間の第１の類似度をそれぞれ算出する第１の類似度算出手段と、
前記第１の類似度算出手段が前記第２の種類のデータ毎に算出した前記第１の類似度に基づいて前記見本のデータ構造間の第２の類似度を算出する第２の類似度算出手段と、
前記第２の類似度が予め定められた値以上の類似性を示す前記見本のデータ構造を１つの分類として抽出する抽出手段として機能させるデータ構造比較プログラム。 [1]
A plurality of comparison objects including, as a structural element, an object, a first type of data associated as belonging to the object, and a second type of data associated as belonging to the first type of data Acquisition means for acquiring a data structure that appears at a predetermined frequency from the data as a sample data structure;
The second type of data in which the first type of data is common among the sample data structures when comparing the second type of data of the plurality of sample data structures acquired by the acquisition unit First similarity calculation means for calculating a first similarity between
Second similarity calculation for calculating a second similarity between the data structures of the samples based on the first similarity calculated by the first similarity calculation unit for each of the second type of data Means,
A data structure comparison program that functions as an extraction unit that extracts the data structure of the sample showing the similarity equal to or greater than a predetermined value as the second similarity.

［２］前記取得手段が前記見本のデータ構造を取得する際に、前記見本のデータ構造を取得した前記複数の比較対象のデータから時間を単位とする日時データを取得する日時データ取得手段と、
前記第１の類似度算出手段は、複数の見本のデータ構造の前記第２の種類のデータを比較する際に、前記見本のデータ構造間で前記第１の種類のデータが共通する第２の種類のデータに前記日時データ取得手段が取得した前記日時のデータを予め定めた方法で加えて、当該日時のデータが加えられた見本のデータ構造の間の第１の類似度を算出する前記［１］に記載のデータ構造比較プログラム。 [2] Date and time data acquisition means for acquiring date and time data in units of time from the plurality of comparison target data acquired from the sample data structure when the acquisition means acquires the sample data structure;
The first similarity calculating unit compares the second type of data of a plurality of sample data structures with the second type of data having the first type of data common between the sample data structures. The date and time data acquired by the date and time data acquisition means is added to the type of data by a predetermined method, and the first similarity between the sample data structures to which the date and time data is added is calculated. 1] The data structure comparison program according to [1].

［３］前記第１の類似度算出手段は、第２の種類のデータの内容に応じて重み付けをして類似度を算出する前記［１］又は［２］に記載のデータ構造比較プログラム。 [3] The data structure comparison program according to [1] or [2], wherein the first similarity calculation unit calculates the similarity by weighting according to the content of the second type of data.

［４］対象と、対象に属するものとして関連付けられた第１の種類のデータと、前記第１の種類のデータに属するものとして関連付けられた第２の種類のデータとを構造の要素として含む複数の比較対象のデータから、予め定めた頻度で出現するデータ構造を有する見本のデータ構造を取得する取得手段と、
前記取得手段が取得した複数の前記見本のデータ構造の前記第２の種類のデータを比較する際に、前記見本のデータ構造間で前記第１の種類のデータが共通する第２の種類のデータ間の第１の類似度をそれぞれ算出する第１の類似度算出手段と、
前記第１の類似度算出手段が前記第２の種類のデータ毎に算出した前記第１の類似度に基づいて前記見本のデータ構造間の第２の類似度を算出する第２の類似度算出手段と、
前記第２の類似度が予め定められた値以上の類似性を示す前記見本のデータ構造を１つの分類として抽出する抽出手段とを有するデータ構造比較装置。 [4] A plurality of objects including a target, a first type of data associated as belonging to the target, and a second type of data associated as belonging to the first type of data as structural elements. Acquisition means for acquiring a data structure of a sample having a data structure that appears at a predetermined frequency from the comparison target data;
The second type of data in which the first type of data is common among the sample data structures when comparing the second type of data of the plurality of sample data structures acquired by the acquisition unit First similarity calculation means for calculating a first similarity between
Second similarity calculation for calculating a second similarity between the data structures of the samples based on the first similarity calculated by the first similarity calculation unit for each of the second type of data Means,
A data structure comparison apparatus comprising: extraction means for extracting the data structure of the sample showing the similarity equal to or greater than a predetermined value as the second similarity.

請求項１又は４に係る発明によれば、本構成を採用しない場合と比べて計算量を減少することができる。 According to the invention which concerns on Claim 1 or 4, the amount of calculations can be reduced compared with the case where this structure is not employ | adopted.

請求項２に係る発明によれば、日時データを含めた第１の類似度を算出することができる。 According to the invention which concerns on Claim 2, the 1st similarity including date data can be calculated.

請求項３に係る発明によれば、第２の種類のデータの内容に応じて重み付けをすることができる。 According to the invention which concerns on Claim 3, weighting can be performed according to the content of 2nd type data.

図１は、本発明の実施の形態に係るデータ構造比較装置の構成の一例を示す図である。FIG. 1 is a diagram showing an example of the configuration of a data structure comparison apparatus according to an embodiment of the present invention. 図２は、ＤＰＣデータのパターンの構成の一例を示す概略図である。FIG. 2 is a schematic diagram illustrating an example of a configuration of a DPC data pattern. 図３は、計算用係数情報の構成の一例を示す概略図である。FIG. 3 is a schematic diagram illustrating an example of a configuration of calculation coefficient information. 図４は、重み付け係数情報の構成の一例を示す概略図である。FIG. 4 is a schematic diagram illustrating an example of the configuration of the weighting coefficient information. 図５（ａ）及び（ｂ）は、パターン取得手段１００のパターンの取得元となるＤＰＣデータのサポートの構成例を示す。FIGS. 5A and 5B show a configuration example of DPC data support that is a pattern acquisition source of the pattern acquisition unit 100. 図６（ａ）及び（ｂ）は、追加日時データが追加されたＤＰＣデータのパターンの構成の一例を示す概略図である。FIGS. 6A and 6B are schematic diagrams illustrating an example of a pattern configuration of DPC data to which additional date / time data is added. 図７は、第１の類似度の算出に用いる部分パターンを説明するための図である。FIG. 7 is a diagram for explaining a partial pattern used for calculating the first similarity. 図８は、第１の類似度の算出に用いるリストの生成を説明するための図である。FIG. 8 is a diagram for explaining generation of a list used for calculating the first similarity. 図９は、第１の類似度の算出方法を説明するための図である。FIG. 9 is a diagram for explaining a first similarity calculation method. 図１０は、第２の類似度の算出方法を説明するための図である。FIG. 10 is a diagram for explaining a second similarity calculation method. 図１１（ａ）〜（ｃ）は、データ構造抽出動作の一例を説明するための概略図である。11A to 11C are schematic diagrams for explaining an example of the data structure extraction operation.

（データ構造比較装置の構成）
図１は、本発明の実施の形態に係るデータ構造比較装置の構成の一例を示す図である。 (Configuration of data structure comparison device)
FIG. 1 is a diagram showing an example of the configuration of a data structure comparison apparatus according to an embodiment of the present invention.

データ構造比較装置１は、ＣＰＵ等から構成され各部を制御するとともに各種のプログラムを実行する制御部１０と、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）やフラッシュメモリ等の記憶媒体から構成され情報を記憶する記憶部１１とを備え、例えば、患者臨床情報及び診療行為の電子データの解析に用いられる。 The data structure comparison apparatus 1 includes a CPU and the like, and controls each unit and executes various programs, and a storage unit configured by a storage medium such as an HDD (Hard Disk Drive) or a flash memory and stores information. 11, for example, used for analyzing patient clinical information and electronic data of medical practice.

制御部１０は、後述するデータ構造比較プログラム１１０を実行することで、パターン取得手段１００、日時データ追加手段１０１、第１の類似度算出手段１０２、第２の類似度算出手段１０３及びデータ構造抽出手段１０４等として機能する。 The control unit 10 executes a data structure comparison program 110, which will be described later, to thereby obtain a pattern acquisition unit 100, a date / time data addition unit 101, a first similarity calculation unit 102, a second similarity calculation unit 103, and a data structure extraction. It functions as the means 104 or the like.

パターン取得手段１００は、解析対象である電子データとして記憶部１１から後述するＤＰＣデータ１１１に含まれる複数の比較対象のデータ（以下、「サポート」という。）から対象として患者名を根ノードとした見本のツリー構造を複数取得する。ここで、見本のツリー構造（以下、「パターン」という。）とは、パターン抽出技法を用いて、ツリー構造を有する複数のデータに予め定めた頻度で出現する共通のツリー構造として抽出されるもののことをいう。 The pattern acquisition unit 100 uses a patient name as a root node as a target from a plurality of comparison target data (hereinafter referred to as “support”) included in DPC data 111 described later from the storage unit 11 as electronic data to be analyzed. Get multiple sample tree structures. Here, the sample tree structure (hereinafter referred to as “pattern”) is extracted as a common tree structure that appears in a plurality of data having a tree structure at a predetermined frequency by using a pattern extraction technique. That means.

なお、パターン取得手段１００は、周知のパターン抽出技法を用いることで、複数のサポート間で後述するＥファイル及びＦファイルが共通したものをパターンとして取得する。周知のパターン抽出技法として、例えば、シーケンシャル・パターン・マイニングのＰｒｅｆｉｘＳｐａｎ、ＢＩＤＥ、ＣｌｏＳｐａｎ等又はサブツリーマイニング等を用いることができる。 Note that the pattern acquisition unit 100 acquires, as a pattern, a common E file and F file described later among a plurality of supports by using a well-known pattern extraction technique. As a well-known pattern extraction technique, for example, Prefix Span, BIDE, CloSpan or the like of sequential pattern mining or sub-tree mining can be used.

日時データ追加手段１０１は、パターン取得手段１００が取得したパターンの元となった複数のサポートから時間を単位とする日時データを抽出してパターンに追加する。なお、時間は、年月日、時間、分、秒等いずれを用いてもよい。 The date / time data adding unit 101 extracts date / time data in units of time from a plurality of supports that are the basis of the pattern acquired by the pattern acquiring unit 100 and adds them to the pattern. The time may be any of date, hour, minute, second, etc.

第１の類似度算出手段１０２は、比較対象となるパターン間において、葉ノードであるＦファイルのうち、後述するＥファイルが共通するＦファイルをリスト化して、パターン中の日時データをさらにリストに追加して、Ｅファイル毎に当該リストの類似度（以下、「第１の類似度」という。）として編集距離を算出する。 The first similarity calculation unit 102 lists F files that are common to E files, which will be described later, among the F files that are leaf nodes among the patterns to be compared, and further lists the date and time data in the patterns. In addition, the edit distance is calculated for each E file as the similarity of the list (hereinafter referred to as “first similarity”).

第２の類似度算出手段１０３は、第１の類似度算出手段１０２が算出したＥファイル毎の第１の類似度に基づいて比較対象のパターン間の類似度（以下、「第２の類似度」という。）として編集距離を算出する。 Based on the first similarity for each E file calculated by the first similarity calculation unit 102, the second similarity calculation unit 103 calculates a similarity between patterns to be compared (hereinafter, “second similarity”). The edit distance is calculated.

データ構造抽出手段１０４は、第２の類似度算出手段１０３が算出した第２の類似度が予め定められた値以下である複数のパターンの集合を１つの分類として抽出する。 The data structure extraction unit 104 extracts a set of a plurality of patterns whose second similarity calculated by the second similarity calculation unit 103 is equal to or less than a predetermined value as one classification.

記憶部１１は、制御部１０を上述した各手段として動作させるデータ構造比較プログラム１１０、ＤＰＣデータ１１１、計算用係数情報１１２及び重み付け係数情報１１３等を記憶する。 The storage unit 11 stores a data structure comparison program 110, DPC data 111, calculation coefficient information 112, weighting coefficient information 113, and the like that cause the control unit 10 to operate as the above-described units.

ＤＰＣデータ１１１は、分析可能な全国統一形式の患者臨床情報及び診療行為の電子データセットである。患者臨床情報は、例えば、患者基本情報、病名、術式、各種のスコア・ステージ分類等であり、診療行為情報は、診療行為、医薬品、医療材料、実施日、回数・数量、診療科、病棟、保険種別等である。 The DPC data 111 is an electronic data set of patient clinical information and medical practice in a nationally unified format that can be analyzed. Patient clinical information includes, for example, patient basic information, disease name, technique, various score / stage classifications, etc., and medical practice information includes medical practice, pharmaceuticals, medical materials, implementation date, frequency / quantity, clinical department, ward And insurance type.

また、ＤＰＣデータ１１１は、基本となるデータとして様式１、Ｅファイル及びＦファイルと呼ばれるデータを有する。様式１とは、患者の臨床情報、傷病名、術式、補助治療等である。Ｅファイルとは、実施日、回数、診療科、病棟、オーダ医師等の情報である。Ｆファイルとは、Ｅファイルの詳細な内容であり、例えば、行為、薬剤、材料、数量等の情報である。 The DPC data 111 includes data called format 1, E file, and F file as basic data. Form 1 includes patient clinical information, names of wounds, surgical procedures, adjuvant treatment, and the like. The E file is information such as the implementation date, number of times, clinical department, ward, order doctor, and the like. The F file is a detailed content of the E file, and is information such as an action, a medicine, a material, and a quantity.

本実施の形態では、患者を根ノードとし、その患者に属する日時データ及びＥファイルを内部ノード、Ｅファイルに属するＦファイルを葉ノードとして構成されるツリー構造をサポートとし、複数のサポートに予め定めた頻度以上で現れるサポートのデータ構造をパターンとして取得して、取得されたパターン間で類似度を算出し、算出された類似度に基づいて複数のパターンの集合を抽出する。 In the present embodiment, a tree structure including a patient as a root node, date / time data belonging to the patient and an E file as an internal node, and an F file belonging to the E file as a leaf node is supported. A support data structure that appears at a frequency higher than the specified frequency is acquired as a pattern, a similarity is calculated between the acquired patterns, and a set of a plurality of patterns is extracted based on the calculated similarity.

図２は、ＤＰＣデータのパターンの構成の一例を示す概略図である。 FIG. 2 is a schematic diagram illustrating an example of a configuration of a DPC data pattern.

ＤＰＣデータ１１１から取得されたパターン２ａ及びパターン２ｂは、患者に属する日時データ２２と、日時データ２２に属するＥファイル２１と、Ｅファイル２１に属するＦファイルとを有し、ツリー構造を構成する。 The patterns 2a and 2b acquired from the DPC data 111 include date / time data 22 belonging to a patient, an E file 21 belonging to the date / time data 22, and an F file belonging to the E file 21, and constitute a tree structure.

図３は、計算用係数情報の構成の一例を示す概略図である。 FIG. 3 is a schematic diagram illustrating an example of a configuration of calculation coefficient information.

計算用係数情報１１２は、Ｆファイル、期間、日単位を行列の項目とし、第１の類似度算出手段１０２が編集距離を算出する際にＦファイル又は日時データに乗じる係数を定義する。なお、Ｆファイルと期間、Ｆファイルと日単位はそれぞれ比較すべき対象ではないので係数を「∞」としている。 The calculation coefficient information 112 defines a coefficient to be multiplied by the F file or the date / time data when the first similarity calculation unit 102 calculates the edit distance, with the F file, the period, and the day unit as matrix items. Since the F file and the period, and the F file and the day unit are not objects to be compared, the coefficient is “∞”.

図４は、重み付け係数情報の構成の一例を示す概略図である。 FIG. 4 is a schematic diagram illustrating an example of the configuration of the weighting coefficient information.

重み付け係数情報１１３は、Ｆファイル内容欄と、重み付け係数欄とを有し、第１の類似度算出手段１０２が編集距離を算出する際にＦファイルの内容に応じてＦファイル又は日時データに乗じる係数を定義する。 The weighting coefficient information 113 includes an F file content field and a weighting coefficient field. When the first similarity calculation unit 102 calculates the edit distance, the weighting coefficient information 113 is multiplied by the F file or date / time data according to the content of the F file. Define the coefficients.

（データ構造比較装置の動作）
以下に、データ構造比較装置の動作例を各図を参照しつつ、（１）基本動作、（２）類似度算出動作、（３）データ構造抽出動作に分けて説明する。 (Operation of data structure comparison device)
Hereinafter, the operation example of the data structure comparison apparatus will be described by dividing into (1) basic operation, (2) similarity calculation operation, and (3) data structure extraction operation, with reference to each drawing.

（１）基本動作
まず、パターン取得手段１００は、記憶部１１のＤＰＣデータ１１１からデータ構造を抽出する対象となる複数のパターンを取得する。以下、説明を簡単にするため、図２に示す２つのパターン２ａ及び２ｂを取得した場合について説明する。 (1) Basic Operation First, the pattern acquisition unit 100 acquires a plurality of patterns to be extracted from the DPC data 111 of the storage unit 11. Hereinafter, in order to simplify the description, a case where the two patterns 2a and 2b shown in FIG. 2 are acquired will be described.

まず、パターンの取得方法を示すためにパターン２ｂを例にとって説明する。 First, the pattern 2b will be described as an example to show the pattern acquisition method.

図５（ａ）及び（ｂ）は、パターン取得手段１００のパターンの取得元となるＤＰＣデータ１１１のサポートの構成例を示す。 FIGS. 5A and 5B show a configuration example of support of the DPC data 111 that is a pattern acquisition source of the pattern acquisition unit 100.

パターン取得手段１００は、パターン抽出技法を用いて、図５（ａ）に示す複数のサポート２００ｂ_１、２００ｂ_２…からパターン２ｂを取得する。 The pattern acquisition unit 100 acquires a pattern 2b from a plurality of supports 200b ₁ , 200b ₂ ... Shown in FIG.

Ｆファイルの編集距離を利用した従来の技術では、Ｆファイル以外の情報を用いないため、日時データ２２がサポート２００ｂ_１、２００ｂ_２…それぞれにおいて異なっていても構わないため、図２に示すように、日時データ２２の内容が異なる箇所は「日単位」としてパターンＢを取得していた。 In the conventional technique using the edit distance of the F file, since information other than the F file is not used, the date and time data 22 may be different in each of the supports 200b ₁ , 200b ₂ ... As shown in FIG. The pattern B is acquired as “day unit” at a place where the contents of the date and time data 22 are different.

しかし、本実施の形態においては日時データ２２も考慮し、日時データ追加手段１０１は、パターン取得手段１００が取得したパターン２ｂの元となった複数のサポート２００ｂ_１、２００ｂ_２…（図５（ａ））から時間を単位とする日時データ２２を抽出して、異なる日時データを包括する内容の日時データとして、図５（ｂ）に示すように、追加日時データ２１０ｂ_１、２１０ｂ_２…を追加する。 However, in the present embodiment, the date / time data 22 is also taken into consideration, and the date / time data adding unit 101 includes a plurality of supports 200b ₁ , 200b ₂ ... (FIG. 5A )) To extract the date / time data 22 in units of time, and add the additional date / time data 210b ₁ , 210b ₂ ... As shown in FIG. .

図６（ａ）及び（ｂ）は、追加日時データが追加されたパターンの構成の一例を示す概略図である。 6A and 6B are schematic diagrams illustrating an example of a configuration of a pattern to which additional date / time data is added.

次に、日時データ追加手段１０１は、複数のサポート２００ｂ_１、２００ｂ_２…に追加された追加日時データ２１０ｂ_１、２１０ｂ_２…に基づき、図６（ａ）に示すように、パターン２ｂに追加日時データ２２ａを追加する。 Next, the date / time data adding unit 101 adds the added date / time to the pattern 2b as shown in FIG. 6A based on the added date / time data 210b ₁ , 210b ₂ ... Added to the plurality of supports 200b ₁ , 200b ₂ . Data 22a is added.

次に、日時データ追加手段１０１は、追加日時データ２２ｂの日時の範囲内で複数のサポート２００ｂ_１、２００ｂ_２…の日時データを検索し、図６（ａ）に示す日時データ２２ａを、図６（ｂ）に示すように、例えば、対象とする日時データを包括する最小の値である追加日時データ２２ｂで置き換える。このように、本実施の形態では、従来の類似度の算出において用いられなかった日時データを要素として加えて類似度を算出する。なお、追加日時データ２２ａは、最小の値に限らず、予め定めた単位のうち全ての日時データを包括する値に設定してもよい。 Next, the date / time data adding means 101 searches the date / time data of a plurality of supports 200b ₁ , 200b ₂ ... Within the date / time range of the added date / time data 22b, and the date / time data 22a shown in FIG. As shown in (b), for example, the target date / time data is replaced with the additional date / time data 22b which is the minimum value. As described above, in this embodiment, the similarity is calculated by adding date and time data that has not been used in the conventional calculation of similarity as an element. The additional date / time data 22a is not limited to a minimum value, and may be set to a value that includes all date / time data in a predetermined unit.

（２）類似度算出動作
図７は、第１の類似度の算出に用いる部分パターンを説明するための図である。 (2) Similarity Calculation Operation FIG. 7 is a diagram for explaining a partial pattern used for calculating the first similarity.

次に、図７に示すように、第１の類似度算出手段１０２は、パターン２ａ及び２ｂをＥファイル２１に予め定められたデータ区分に基づき部分パターン２ａ_１〜２ａ_４、２ｂ_１〜２ｂ_４に分ける。 Next, as shown in FIG. 7, the first similarity calculation means 102, the partial pattern _2a 1 _{to 2A region} 4 based on the predetermined data classified patterns 2a and 2b on the E file _21, 2b 1 ~2b ₄ Divide into

次に、第１の類似度算出手段１０２は、区分が「６０」で一致する部分パターン同士２ａ_１と２ｂ_１、区分が「５０」で一致する２ａ_２と２ｂ_２、区分が「５４」で一致する２ａ_３と２ｂ_３、区分が「７０」で一致する２ａ_４と２ｂ_４の間の第１の類似度を算出する。以下、算出方法を説明する。 Next, the first similarity calculation unit 102 calculates that the partial patterns 2a ₁ and 2b ₁ that coincide with the division “60”, 2a ₂ and 2b ₂ that coincide with the division “50”, and the division “54”. The first similarity between the matching 2a ₃ and 2b ₃ and the matching 2a ₄ and 2b ₄ with the section “70” is calculated. Hereinafter, the calculation method will be described.

図８は、第１の類似度の算出に用いるリストの生成を説明するための図である。 FIG. 8 is a diagram for explaining generation of a list used for calculating the first similarity.

第１の類似度算出手段１０２は、例えば、区分が「６０」で一致する部分パターン２ａ_１及び２ｂ_１からそれぞれ日時データ２２とＦファイル２０を文字列として含むリスト３ａ_１及び３ｂ_１を生成する。この際、第１の類似度算出手段１０２は、Ｆファイル２０が属する日時データ２２をそのＦファイル２０の前を並び順としてリストに追加する。なお、日時データ２２の追加する並び順はこれに限るものではなく、Ｆファイル２０の後に追加してもよい。 For example, the first similarity calculation unit 102 generates lists 3a ₁ and 3b ₁ including the date and time data 22 and the F file 20 as character strings from the partial patterns 2a ₁ and 2b ₁ that match in the category “60”, respectively. . At this time, the first similarity calculation unit 102 adds the date / time data 22 to which the F file 20 belongs to the list in the order of the front of the F file 20. The order in which the date / time data 22 is added is not limited to this, and may be added after the F file 20.

図９は、第１の類似度の算出方法を説明するための図である。 FIG. 9 is a diagram for explaining a first similarity calculation method.

次に、第１の類似度算出手段１０２は、リスト３ａ_１及び３ｂ_１の編集距離を算出する。具体的には、図９に示すように、リスト３ａ_１及び３ｂ_１に含まれる項目毎の類似度を数値化して行列とし、左上の要素から右下の要素へ最小値を取るように移動する。 Next, the first similarity calculation unit 102 calculates the edit distances of the lists 3a ₁ and 3b ₁ . Specifically, as shown in FIG. 9, the similarity for each item included in the lists 3a ₁ and 3b ₁ is digitized into a matrix, and moved so as to take the minimum value from the upper left element to the lower right element. .

ここで、行列中の同一の要素に対する右下への移動はリストに対して編集しない場合であり、異なる要素に対する下への移動はリストの要素を削除する場合、異なる要素に対する右への移動はリストの要素を挿入する場合にそれぞれ対応する。 Here, moving to the lower right for the same element in the matrix is not editing for the list, moving to the lower for different elements is to move to the right for different elements when deleting elements in the list. Corresponds to the case of inserting list elements.

上記の操作の結果、得られた最小値が編集距離となる。なお、図９に折れ線で示す例において、要素「翌日」と要素「２日後」を交換、要素「ＴＰ」を挿入、要素「ＧＰＴ」を削除、要素「ＡＬＰ」を挿入、要素「ＡＬｂ」を挿入する操作を行っている。 The minimum value obtained as a result of the above operation is the editing distance. In the example shown by the broken line in FIG. 9, the element “next day” and the element “two days later” are exchanged, the element “TP” is inserted, the element “GPT” is deleted, the element “ALP” is inserted, and the element “ALb” is inserted. You are performing an insert operation.

また、上記操作に対し、計算用係数情報１１２を係数として用いることで、日時データ同士の編集操作、Ｆファイル同士の編集操作以外を排除するとともに、Ｆファイルの交換に対する制限を大きくしている。なお、Ｆファイルの交換の係数を１としてもよい。 In addition, the calculation coefficient information 112 is used as a coefficient for the above operation, thereby excluding operations other than the editing operation between date data and the editing operation between F files, and the restriction on the exchange of F files is increased. The F file exchange coefficient may be 1.

また、上記操作に対し、重み付け係数情報１１３を用いても良い。つまり、要素「翌日」と要素「２日後」を交換する操作は、図４においてＦファイル内容欄が「一日差」に対応するため、０．７を係数として乗じる。以下、同様に、要素「ＴＰ」を挿入する操作に対し１を係数として乗じ、要素「ＧＰＴ」を削除する操作に対し１を係数として乗じ、要素「ＡＬＰ」を挿入する操作に対し０．５を係数として乗じ、要素「ＡＬｂ」を挿入する操作する操作に対し２を係数として乗じて編集距離を算出してもよい。 Moreover, you may use the weighting coefficient information 113 with respect to the said operation. That is, the operation of exchanging the element “next day” and the element “two days later” is multiplied by 0.7 as a coefficient because the F file content field corresponds to “one day difference” in FIG. Similarly, the operation for inserting the element “TP” is multiplied by 1 as a coefficient, the operation for deleting the element “GPT” is multiplied by 1 as a coefficient, and the operation for inserting the element “ALP” is 0.5. The edit distance may be calculated by multiplying the operation for inserting the element “ALb” by 2 as a coefficient.

上記の編集距離の算出をすべての部分パターン同士２ａ_１と２ｂ_１、２ａ_２と２ｂ_２、２ａ_３と２ｂ_３、２ａ_４と２ｂ_４について行いそれぞれパターン２ａと２ｂ間の第１の類似度を得る。 The above edit distance is calculated for all partial patterns 2a ₁ and 2b ₁ , 2a ₂ and 2b ₂ , 2a ₃ and 2b ₃ , 2a ₄ and 2b ₄ , and the first similarity between the patterns 2a and 2b is obtained. obtain.

図１０は、第２の類似度の算出方法を説明するための図である。 FIG. 10 is a diagram for explaining a second similarity calculation method.

パターン２ａ及び２ｂの対象とする全てのＦファイルを項目とし、Ｅファイル毎に分類した行列において、得られた各部分パターンの第１の類似度は、図１０に示すように、同行列中において対角上に配置される。 In the matrix in which all the F files that are the targets of the patterns 2a and 2b are items and classified for each E file, the first similarity of each partial pattern obtained is shown in FIG. Arranged diagonally.

次に、第２の類似度算出手段１０３は、図１０に示す行列において、第２の類似度を算出する。具体的には、図９において説明したのと同様に、左上から右下へ単純に第１の類似度を加算していけばよい。このように、Ｅファイルが異なるＦファイル間の編集距離、例えば、Ｅファイル２１「５０」と「５４」、「６０」と「７０」等の間の第１の類似度を算出しないため、すべてのＦファイル間の編集距離を算出していた従来の方法に比べて計算量が減少する。 Next, the second similarity calculation unit 103 calculates the second similarity in the matrix shown in FIG. Specifically, as described with reference to FIG. 9, the first similarity may be simply added from the upper left to the lower right. Thus, since the first similarity between the E files 21 “50” and “54”, “60” and “70”, etc. is not calculated, the editing distance between F files with different E files is not calculated. Compared to the conventional method that calculates the edit distance between F files, the amount of calculation is reduced.

（３）データ構造抽出動作
図１１（ａ）〜（ｃ）は、データ構造抽出動作の一例を説明するための概略図である。 (3) Data Structure Extraction Operation FIGS. 11A to 11C are schematic diagrams for explaining an example of the data structure extraction operation.

第２の類似度算出手段１０３によって各パターン間の第２の類似度が算出されると、図１１（ａ）に示すように、各パターン間の第２の類似度の関係を示す行列が生成される。 When the second similarity calculation unit 103 calculates the second similarity between the patterns, a matrix indicating the relationship of the second similarity between the patterns is generated as shown in FIG. Is done.

データ構造抽出手段１０４は、図１１（ａ）に示す行列において、第２の類似度の値が小さいものから順番に各パターンのデータ構造の類似性が高いと判断し、図１１（ｂ）に示すように、データ構造の類似性の高い複数のパターンを１つの分類として抽出する。 The data structure extraction unit 104 determines that the similarity of the data structure of each pattern is high in order from the smallest second similarity value in the matrix shown in FIG. As shown, a plurality of patterns having high data structure similarity are extracted as one classification.

ここで、分類の閾値を、例えば、第２の類似度の値が３未満とすると、図１１（ｂ）に示すように、パターン２ａ〜２ｈの分類と、パターン２ｉ及び２ｊの分類とに分断され、図１１（ｃ）に示すように、データ構造抽出手段１０４は、パターン２ａ〜２ｈの分類をクラスター１、パターン２ｉ及び２ｊの分類をクラスター２として抽出する。 Here, for example, if the second threshold value is less than 3, the classification threshold is divided into the patterns 2a to 2h and the patterns 2i and 2j as shown in FIG. Then, as shown in FIG. 11C, the data structure extraction unit 104 extracts the classification of the patterns 2a to 2h as the cluster 1, and the classification of the patterns 2i and 2j as the cluster 2.

［他の実施の形態］
なお、本発明は、上記実施の形態に限定されず、本発明の趣旨を逸脱しない範囲で種々な変形が可能である。 [Other embodiments]
The present invention is not limited to the above embodiment, and various modifications can be made without departing from the spirit of the present invention.

また、上記実施の形態で使用されるデータ構造比較プログラム１１０は、ＣＤ−ＲＯＭ等の記憶媒体から装置内の記憶部に読み込んでも良く、インターネット等のネットワークに接続されているサーバ装置等から装置内の記憶部にダウンロードしてもよい。また、上記実施の形態で使用されるＤＰＣパターン取得手段１００、日時データ追加手段１０１、第１の類似度算出手段１０２、第２の類似度算出手段１０３及びデータ構造抽出手段１０４の一部または全部をＡＳＩＣ等のハードウェアによって実現してもよい。 In addition, the data structure comparison program 110 used in the above embodiment may be read from a storage medium such as a CD-ROM into a storage unit in the apparatus, or from a server apparatus or the like connected to a network such as the Internet. You may download to the memory | storage part. In addition, a part or all of the DPC pattern acquisition unit 100, the date / time data addition unit 101, the first similarity calculation unit 102, the second similarity calculation unit 103, and the data structure extraction unit 104 used in the above embodiment. May be realized by hardware such as ASIC.

１データ構造比較装置
２ａ−２ｈパターン
２ａ_１−２ａ_４部分パターン
３ａ_１、３ｂ_１リスト
１０制御部
１１記憶部
２０Ｆファイル
２１Ｅファイル
２２日時データ
２２ａ、２２ｂ追加日時データ
１００パターン取得手段
１０１日時データ追加手段
１０２第１の類似度算出手段
１０３第２の類似度算出手段
１０４データ構造抽出手段
１１０データ構造比較プログラム
１１１ＤＰＣデータ
１１２計算用係数情報
１１３重み付け係数情報
２００ｂ_１、２００ｂ_２サポート
２１０ｂ_１、２１０ｂ_２追加日時データ

1 Data structure comparison device 2a-2h Pattern 2a ₁ -2a ₄ Partial pattern 3a ₁ , 3b ₁ List 10 Control unit 11 Storage unit 20 F file 21 E file 22 Date / time data 22a, 22b Additional date / time data 100 Pattern acquisition means 101 Date / time data Additional means 102 First similarity calculation means 103 Second similarity calculation means 104 Data structure extraction means 110 Data structure comparison program 111 DPC data 112 Calculation coefficient information 113 Weighting coefficient information 200b ₁ , 200b ₂ Support 210b ₁ , 210b ₂ Additional date and time data

Claims

Computer
A plurality of comparison objects including, as a structural element, an object, a first type of data associated as belonging to the object, and a second type of data associated as belonging to the first type of data Acquisition means for acquiring a data structure that appears at a predetermined frequency from the data as a sample data structure;
The second type of data in which the first type of data is common among the sample data structures when comparing the second type of data of the plurality of sample data structures acquired by the acquisition unit First similarity calculation means for calculating a first similarity between
Second similarity calculation for calculating a second similarity between the data structures of the samples based on the first similarity calculated by the first similarity calculation unit for each of the second type of data Means,
A data structure comparison program that functions as an extraction unit that extracts the data structure of the sample showing the similarity equal to or greater than a predetermined value as the second similarity.

Date and time data acquisition means for acquiring date and time data in units of time from the plurality of comparison target data acquired from the sample data structure when the acquisition means acquires the data structure of the sample;
The first similarity calculating unit compares the second type of data of a plurality of sample data structures with the second type of data having the first type of data common between the sample data structures. The date and time data acquired by the date and time data acquisition means is added to a type of data by a predetermined method, and a first similarity between sample data structures to which the date and time data is added is calculated. 1. The data structure comparison program according to 1.

The data structure comparison program according to claim 1, wherein the first similarity calculation unit calculates the similarity by weighting according to the content of the second type of data.

A plurality of comparison objects including, as a structural element, an object, a first type of data associated as belonging to the object, and a second type of data associated as belonging to the first type of data Acquisition means for acquiring a data structure of a sample having a data structure that appears at a predetermined frequency from the data of
The second type of data in which the first type of data is common among the sample data structures when comparing the second type of data of the plurality of sample data structures acquired by the acquisition unit First similarity calculation means for calculating a first similarity between
Second similarity calculation for calculating a second similarity between the data structures of the samples based on the first similarity calculated by the first similarity calculation unit for each of the second type of data Means,
A data structure comparison apparatus comprising: extraction means for extracting the data structure of the sample showing the similarity equal to or greater than a predetermined value as the second similarity.