JP6066089B2

JP6066089B2 - Data relationship determination system, data relationship determination method, and program

Info

Publication number: JP6066089B2
Application number: JP2013527982A
Authority: JP
Inventors: 由希子黒岩
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2011-08-08
Filing date: 2012-07-25
Publication date: 2017-01-25
Anticipated expiration: 2032-07-25
Also published as: WO2013021875A1; JPWO2013021875A1

Description

本発明は、情報処理におけるデータ間の関連性判定システムに関し、特にシステムやソフトウェア開発における仕様間の関連性判定に好適なシステムに関する。 The present invention relates to a system for determining relevance between data in information processing, and more particularly to a system suitable for determining relevance between specifications in system and software development.

データ間の関連性判定では、対比するデータをそれぞれ文字列に変換し、その文字列間の類似度を計算することで、類似度の大きいデータ間を関連性があると推定処理できる。例えば、非特許文献１に記載されているように、テキスト、画像、時系列データなどのデータに対して、データ間の類似度を計算し、データ間の類似度をデータ間の関連性として推定処理できる。
システムやソフトウェア開発の際にも、データ間の類似度を用いて関連性を推定する場合がある。例えば、非特許文献２では、データ間の類似度を用いて、業務フロー間の関連性を計算している。
しかし、これら技術では、データが異なる概念に属するが文字列が類似の場合に、関連性を高く計算する場合があった。
ここで、概念とは、個々のデータにのみ属する偶発的な性質でなく、複数のデータがもつ同一の本質的な特徴のことである。例えば、データがシステム開発における仕様（テキスト）の場合、概念は、システムのコンポーネントや業務分類などが相当する。
具体的には、類似度の算定処理のために対比する２つの仕様が「受注管理システムは、電話、ＦＡＸ（Ｆａｃｓｉｍｉｌｅ）、ＥＤＩ（ＥｌｅｃｔｒｏｎｉｃＤａｔａＩｎｔｅｒｃｈａｎｇｅ）に対応する。」と「発注管理システムは、電話、ＦＡＸ、ＥＤＩに対応する。」の場合、２つの仕様は、異なるコンポーネント（概念）である「受注管理システム」と「発注管理システム」に属している。このため、直接的には関連性がなく、これら２つの仕様の一方が、「電話には対応しない。」と変更されてもシステムやソフトウェア開発上で問題はない。
しかし、元の２つの仕様は、「受注」と「発注」を除き一致しているので、従来技術を用いると、元の２つの仕様は関連性が高いと計算されがちであった。そのため、一方が「電話には対応しない」と変更された場合に、データ間の関連性を判定処理するシステムによって、開発仕様で矛盾などの不整合が起きたと誤検出されてしまう。
このような問題に対処するために、自動的に概念とその重みを構成する方法は、例えば、非特許文献３に記載されている。この方法では、新聞記事などの文書を大量に集めて、文書の概念を多次元ベクトルにより構成する。多次元ベクトルを用いる理由は、一般的に概念は単純な階層構造とならずに重複部分が概念間にあるためである。
また、概念データベースを参照して概念を取得し、その概念を用いて複数のデータ間の類似度を算定する方法の一例が特許文献１に記載されている。特許文献１では、検索用に入力されたデータの概念を用いて、関連する類似したデータを抽出処理し、それらのデータを用いて情報検索を行なわれることが記載されている。In the relevance determination between data, each data to be compared is converted into a character string, and the similarity between the character strings is calculated, so that it is possible to estimate that there is a relevance between data having a high similarity. For example, as described in Non-Patent Document 1, the similarity between data is calculated for data such as text, images, and time-series data, and the similarity between the data is estimated as the relevance between the data. It can be processed.
In system and software development, there is a case where the relevance is estimated using the similarity between data. For example, in Non-Patent Document 2, the relevance between business flows is calculated using the similarity between data.
However, in these techniques, when the data belongs to different concepts but the character strings are similar, there is a case where the relevance is high.
Here, the concept is not an accidental property belonging only to individual data, but the same essential characteristic of a plurality of data. For example, when the data is a specification (text) in system development, the concept corresponds to a system component or business classification.
Specifically, the two specifications to be compared for the similarity calculation process are “the order management system corresponds to telephone, FAX (facsimile), EDI (electronic data interchange)” and “order management system In the case of "corresponding to telephone, FAX, EDI", the two specifications belong to "order management system" and "order management system" which are different components (concepts). For this reason, there is no direct relevance, and there is no problem in system and software development even if one of these two specifications is changed to “Not compatible with telephones”.
However, since the original two specifications are identical except for “order” and “ordering”, using the conventional technology, the original two specifications tend to be calculated as highly related. For this reason, when one of them is changed to “not compatible with telephones”, the system for determining the relationship between data erroneously detects an inconsistency such as a contradiction in the development specification.
In order to deal with such a problem, a method for automatically configuring a concept and its weight is described in Non-Patent Document 3, for example. In this method, a large number of documents such as newspaper articles are collected, and the concept of the document is configured by a multidimensional vector. The reason for using a multidimensional vector is that the concept generally does not have a simple hierarchical structure, and there is an overlapping portion between the concepts.
Patent Document 1 describes an example of a method for obtaining a concept by referring to a concept database and calculating a similarity between a plurality of data using the concept. Patent Document 1 describes that related similar data is extracted using the concept of data input for search, and information search is performed using these data.

特開２００６−１０６９７０号公報JP 2006-106970 A

ミング（Ｍｉｎｇ）著「ザシミラリティメトリクス（ＴｈｅＳｉｍｉｌａｒｉｔｙＭｅｔｒｉｃ）」２００４年発行の論文誌ＩＥＥＥトランザクションズオンインフォメーションセオリー（ＴｒａｎｓａｃｔｉｏｎｓｏｎＩｎｆｏｒｍａｔｉｏｎＴｈｅｏｒｙ）」５０巻１２号３２５０−３２６４頁Ming, “The Similarity Metrics,” published in 2004, IEEE Transactions on Information Theory, Vol. 50, No. 12, pages 3250-3264 キーオ（Ｋｅｏｇｈ）ら著「コンプレッション−ベースドデータマイニングオブシーケンシャルデータ（Ｃｏｍｐｒｅｓｓｉｏｎ−ｂａｓｅｄｄａｔａｍｉｎｉｎｇｏｆｓｅｑｕｅｎｔｉａｌｄａｔａ）」２００７年発行の論文誌データマイニングアンドノリッジディスカバリー（ＤａｔａＭｉｎｉｎｇａｎｄＫｎｏｗｌｅｄｇｅＤｉｓｃｏｖｅｒｙ）１４巻１号９９−１２９頁Keigo et al., “Compression-based data mining of sequential data”, published in 2007, Journal Data Mining and Knowledge Discovery Vol. 129 pages シュッチェ（Ｈ．Ｓｃｈｕｔｚｅ）著「ディメンジョンオブミーニング（ｄｉｍｅｎｓｉｏｎｓｏｆｍｅａｎｉｎｇ）」１９９２年発行のプロシーディングスオブスーパーコンピューティング（Ｐｒｏｃｅｅｄｉｎｇｓｏｆｓｕｐｅｒｃｏｍｐｕｔｉｎｇ）７８７−７９６頁Proceedings of supercomputing, pp. 787-796, published by H. Schutze, “Dimensions of Meaning”, 1992

しかし、例えば、システムやソフトウェア開発の仕様間の関連性においては、多くの文書を用いて分析することができない場合が多い。
なぜなら、多くプロジェクトでは、概念であるコンポーネントや業務フローなどとしてプロジェクト特有の用語を用いる。このため、プロジェクト内（場合によって類似プロジェクト、並行プロジェクトを含む）での文書だけが比較対象となる。このことによって、正確かつ多量の概念データベースの作成が困難である。
上記のように、システムやソフトウェア開発などにおいては、多くの文書を用いて分析できないため、前述したような既存の方法で自動的に概念を構成することは困難である。他方で、システム開発に用いられる文書などは、多くの場合、一般的な文書に比べて用語が限定的であり、また、複数の観点での階層構造化が図られている。
その一方、システム開発で用いられる文書などは、一般的な文書に比べて、２つの概念の要素に重複があったり概念の要素に不足があったりするような不完全な概念情報ならば、容易に構成できる。
本発明は、すべての情報が登録されていない不完全な概念情報に基づいてデータ間の関連性を正確に判定するデータ間の関連性判定システムを提供する。However, for example, in many cases, the relationship between the specifications of system and software development cannot be analyzed using many documents.
This is because many projects use terminology specific to a project as a conceptual component or business flow. For this reason, only documents within a project (including similar projects and parallel projects as the case may be) are to be compared. This makes it difficult to create an accurate and large amount of concept database.
As described above, in system and software development, etc., analysis cannot be performed using many documents, so it is difficult to automatically construct a concept using the existing method as described above. On the other hand, a document used for system development often has a limited terminology compared to a general document, and has a hierarchical structure from a plurality of viewpoints.
On the other hand, a document used in system development is easier if it is incomplete concept information in which there are two concept elements or a lack of concept elements compared to general documents. Can be configured.
The present invention provides a relevance determination system between data that accurately determines relevance between data based on incomplete concept information in which all information is not registered.

本発明に係るデータ間の関連性判定システムは、類似度計算の前処理として、生成した１ないし複数の文字列を要素とする概念に対して、複数の概念を要素とする概念集合に基づき、対比するデータが両方とも同一概念に属する文字列を含む場合と、前記対比するデータの一方が概念に属する文字列を含みかつ前記対比するデータのもう一方は概念に属する文字列を含まない場合と、前記対比するデータの両方が概念に属する文字列を含まない場合、の何れかであったならば、当該対比するデータを類似度計算の候補として選択する候補選択部と、前記候補選択部で選択された選択候補に対して類似度を計算処理する一方、前記候補選択部で選択されなかった候補に対して類似度を予め定めた小さい値に設定して、前記対比するデータの類似度を計算して出力する類似度計算部とを含むことを特徴とする。
本発明に係る情報処理装置によるデータ間の関連性判定方法は、生成した１ないし複数の文字列を要素とする概念に対して、複数の概念を要素とする概念集合に基づき、対比するデータが両方とも同一概念に属する文字列を含む場合と、前記対比するデータの一方が概念に属する文字列を含みかつ前記対比するデータのもう一方は概念に属する文字列を含まない場合と、前記対比するデータの両方が概念に属する文字列を含まない場合、の何れかであったならば、当該対比するデータを類似度計算の候補として選択する類似度計算の前処理に当たる候補選択ステップと、前記候補選択ステップで選択された選択候補に対して類似度を計算処理する一方、前記候補選択部で選択されなかった候補に対して類似度を予め定めた小さい値に設定して、前記対比するデータの類似度を演算処理して出力可能にする類似度計算ステップとを有することを特徴とする。
本発明に係るデータ間の関連性判定プログラムは、情報処理装置の制御部に、生成した１ないし複数の文字列を要素とする概念に対して、複数の概念を要素とする概念集合に基づき、対比するデータが両方とも同一概念に属する文字列を含む場合と、前記対比するデータの一方が概念に属する文字列を含みかつ前記対比するデータのもう一方は概念に属する文字列を含まない場合と、前記対比するデータの両方が概念に属する文字列を含まない場合、の何れかであったならば、当該対比するデータを類似度計算の候補として選択する類似度計算の前処理に当たる候補選択処理と、前記候補選択処理で選択された候補に対して類似度を計算処理する一方、前記候補選択部で選択されなかった候補に対して類似度を予め定めた小さい値に設定して、前記対比するデータの類似度を算定処理する類似度計算処理とを実行させることを特徴とする。
The relevance determination system between data according to the present invention is based on a concept set having a plurality of concepts as elements with respect to a concept having one or more generated character strings as elements as pre-processing of similarity calculation. If the case where the data to be compared contains the string belonging to the same concept both, other data which one has a character string containing Mikatsu the comparison belonging to the concept of the data to be the contrasts that do not contain the string belonging to the concept If both of the data to be compared do not include a character string belonging to the concept, a candidate selection unit that selects the data to be compared as a candidate for similarity calculation, and the candidate selection unit On the other hand, the similarity is calculated for the selection candidate selected in step 4, while the similarity is set to a predetermined small value for the candidate not selected by the candidate selection unit, and the similarity of the data to be compared Characterized in that the calculated and a similarity calculation unit to output.
In the method for determining the relationship between data by the information processing apparatus according to the present invention, the data to be compared is based on a concept set having a plurality of concepts as elements with respect to the generated concept having one or more character strings as elements. and If both contain the string belonging to the same concept, and when one of the data to be the contrast is free of strings belonging to the other concept of data strings containing Mikatsu the contrasting belonging to the concept, the comparison If both of the data to be included do not include a character string belonging to the concept, if it was either , a candidate selection step corresponding to pre-processing of similarity calculation that selects the data to be compared as a candidate for similarity calculation, and While calculating the similarity for the selection candidate selected in the candidate selection step, the similarity is set to a predetermined small value for the candidate not selected by the candidate selection unit. And having a similarity calculation step of enabling outputs the similarity of the data to be the comparison processing to.
The program for determining relevance between data according to the present invention is based on a concept set having a plurality of concepts as elements with respect to the concept having one or more character strings generated as elements in the control unit of the information processing device. If the case where the data to be compared contains the string belonging to the same concept both, other data which one has a character string containing Mikatsu the comparison belonging to the concept of the data to be the contrasts that do not contain the string belonging to the concept If both of the data to be compared do not include a character string belonging to the concept , the candidate selection corresponding to the pre-processing of the similarity calculation that selects the data to be compared as a candidate for similarity calculation Processing and calculating the similarity for the candidate selected in the candidate selection process, while setting the similarity to a predetermined small value for a candidate not selected by the candidate selection unit Characterized in that to execute the similarity calculating process of calculating processes the similarity of the data to be the comparison.

本発明によれば、すべての情報が登録されていない不完全な概念情報に基づいてデータ間の関連性を正確に判定するデータ間の関連性判定システムを提供できる。 ADVANTAGE OF THE INVENTION According to this invention, the relationship determination system between the data which determines the relationship between data correctly based on the incomplete concept information in which all the information is not registered can be provided.

図１は、データ間の関連性判定システムの第１の実施形態の構成例を示すブロック図である。
図２は、概念記憶部１００に記憶された概念情報の例を示す説明図である。
図３は、データ記憶部１０１に記憶された判定対象となるデータの例を示す説明図である。
図４は、候補記憶部１０２に記憶された候補の例を示す説明図である。
図５は、データ間の関連性判定システムの第１の実施形態の処理経過の例を示すフローチャートである。
図６は、データ間の関連性判定システムの第１の実施形態の概念集合ｉでデータｐとｑが類似度計算の候補かどうかを調べる候補選択部１０３の処理経過の例を示すフローチャートである。
図７は、データ間の関連性判定システムの第２の実施形態の構成例を示すブロック図である。
図８は、用語集記憶部２００に記憶された用語集の例を示す説明図である。
図９は、用語集記憶部２００に記憶された別の用語集の例を示す説明図である。
図１０は、データ間の関連性判定システムの第２の実施形態の概念構成部２０１の処理経過の例を示すフローチャートである。
図１１は、データ間の関連性判定システムの第３の実施形態の構成例を示すブロック図である。
図１２は、構造データ記憶部３００に記憶された構造データの例を示す説明図である。
図１３は、データ間の関連性判定システムの第３の実施形態の概念構成部３０１の処理経過の例を示すフローチャートである。
図１４は、概念構成部３０１で構成された概念の例を示す説明図である。
図１５は、データ間の関連性判定システムの第３の実施形態のデータ生成部３０２の処理経過の例を示すフローチャートである。
図１６は、データ生成部３０２によって生成されたデータの例を示す説明図である。
図１７は、本発明の具現化の一例を示す構成図である。
図１８は、本発明の別の具現化の一例を示す構成図である。FIG. 1 is a block diagram illustrating a configuration example of a first embodiment of a system for determining relevance between data.
FIG. 2 is an explanatory diagram illustrating an example of concept information stored in the concept storage unit 100.
FIG. 3 is an explanatory diagram illustrating an example of data to be determined stored in the data storage unit 101.
FIG. 4 is an explanatory diagram illustrating an example of candidates stored in the candidate storage unit 102.
FIG. 5 is a flowchart illustrating an example of processing progress of the first embodiment of the relationship determination system between data.
FIG. 6 is a flowchart showing an example of processing progress of the candidate selection unit 103 that checks whether data p and q are candidates for similarity calculation in the concept set i of the first embodiment of the relationship determination system between data. .
FIG. 7 is a block diagram illustrating a configuration example of the second embodiment of the relationship determination system between data.
FIG. 8 is an explanatory diagram illustrating an example of a glossary stored in the glossary storage unit 200.
FIG. 9 is an explanatory diagram illustrating another example of a glossary stored in the glossary storage unit 200.
FIG. 10 is a flowchart illustrating an example of a process progress of the conceptual configuration unit 201 of the second embodiment of the relationship determination system between data.
FIG. 11 is a block diagram illustrating a configuration example of the third embodiment of the relationship determination system between data.
FIG. 12 is an explanatory diagram illustrating an example of the structure data stored in the structure data storage unit 300.
FIG. 13 is a flowchart illustrating an example of processing progress of the conceptual configuration unit 301 of the third embodiment of the relationship determination system between data.
FIG. 14 is an explanatory diagram illustrating an example of a concept configured by the conceptual configuration unit 301.
FIG. 15 is a flowchart illustrating an example of processing progress of the data generation unit 302 of the third embodiment of the relationship determination system between data.
FIG. 16 is an explanatory diagram illustrating an example of data generated by the data generation unit 302.
FIG. 17 is a block diagram showing an example of realization of the present invention.
FIG. 18 is a block diagram showing an example of another embodiment of the present invention.

（第１の実施形態）
次に、本発明によるデータ間の関連性判定システムの第１の実施形態について図面を参照して詳細に説明する。第１の実施形態に係るデータ間の関連性判定システムは、予め記憶された概念とデータとに基づいて、データ間の関連性を示す類似度を計算する。なお、ここでは、概念の要素である文字列やデータの文字列として自然言語を用いている場合を例示して説明するが、使用する文字列は、画像や時系列データなどを示す何らかの文字列であっても構わない。
図１を参照すると、本実施形態に係るデータ間の関連性判定システムは、情報を記憶する記憶部１１と、プログラム制御により動作する計算部１２とから構成されている。
記憶部１１は、概念記憶部１００と、データ記憶部１０１と、候補記憶部１０２とを備える。
概念記憶部１００は、複数の概念に対し、概念を示す１つ以上の文字列を記憶する。概念は、システムのユーザがキーボードなどを介して入力することで指定してもよいし、システムにデフォルトとして記憶しておいてもよいし、それ以外の様態で指定してもよい。
図２は、概念記憶部１００に記憶された概念を示す語句の例を示す説明図である。図では、１行が１つの概念集合を示し、「，」ごとに１つの概念を示し、１つの概念内で、「／」で区切られた文字列が、同一概念に属する文字列のバリエーションを示す。
１つの概念集合に含まれる複数概念のうち、任意の２つの概念は、互いに素であること（すなわち、一方の概念の要素ともう一方の概念の要素が一致することがないこと）が望ましいが、互いに素であることに限定されない。また、その任意の２つの概念が異なる概念集合の場合は、互いに素でなくてよい。
例えば、図では、文字列「受注管理システム」と文字列「受注システム」は、同一概念の要素であることを示す。また、文字列「発注管理システム」と文字列「発注システム」は、「受注管理システム」と「受注システム」が示す概念とは別の概念の要素であることを示す。また、「電話」と「ＦＡＸ」とは異なる概念の要素であることを示す。注意が必要なことは、「受注管理システム」と「電話」とは同一概念に属することも異なる概念に属することとも示していないことである。
この複数の概念を要素とする概念集合を用いることによって、対比する２つのデータが類似度計算の候補として選択すべきか否かを候補選択部１０３が選択する。
データ記憶部１０１は、関連性を判定する複数のデータを記憶する。個々のデータは文字列で構成され、システムのユーザがキーボードなどを介して入力することで指定してもよいし、システムにデフォルトとして記憶しておいてもよいし、それ以外の様態で指定してもよい。図３は、データ記憶部１０１に記憶されたデータの例を示す説明図である。図で、１行目が各列の内容の説明を示しており、１列目はデータのＩＤ番号、２列目はデータの内容を示している。すなわち、図では、第一のデータは、「受注管理システムは、電話、ＦＡＸ、ＥＤＩに対応する」であり、第二のデータは、「発注管理システムは、電話、ＦＡＸ、ＥＤＩに対応する」であり、第三のデータは、「注文管理システムは、電話、ＦＡＸ、ＥＤＩに対応する」であり、第四のデータは、「受注システムは、電話、ＦＡＸ、ＥＤＩに対応する」であることを示している。なお、ここでは、個々のデータは日本語テキストだが、画像や時系列データなどを示す何らかの文字列でもよい。また、以降では、各データを、第一のデータ（ＩＤ＝１）をデータ１、第二のデータ（ＩＤ＝２）をデータ２のように呼ぶこととする。
候補記憶部１０２は、あるデータのＩＤ番号と、そのデータと類似度計算の対象とするデータのＩＤ番号を対応付けて候補として記憶する。図４は、候補記憶部１０２に記憶された候補の例を示す説明図である。ここで、データ１とデータ２との間の類似度は、データ２とデータ１との類似度と同じとみなして、あるデータに対しては、そのデータより番号の大きいデータのみを候補とするなら記述している。各行は、「，」で区切られた最初の番号に対し、以降の番号を類似度計算の候補とすることを示している。すなわち、図は、データ１は、データ３およびデータ４と類似度を計算するが、データ２とは類似度を計算しないことを示す。また、データ２はデータ３と類似度を計算するが、データ４とは類似度を計算しないことを示す。さらに、データ３は、データ４と類似度を計算しないことを示す。
計算部１２は、候補選択部１０３と、類似度計算部１０４とを備える。
候補選択部１０３は、概念記憶部１００に記憶された概念情報とデータ記憶部１０１に記憶された判定対象となるデータとに基づいて、類似度を計算する候補を選択し、選択された候補を候補記憶部１０２に記憶する。
類似度計算部１０４は、候補記憶部１０２に記憶された候補とデータ記憶部１０１に記憶された判定対象となるデータとに基づいて、データ間の類似度を計算する。なお、この際、類似度計算部１０４は、候補選択部１０３で選択されなかった候補（即ち候補記憶部１０２に記憶されていない要素）に対して類似度を予め定めた小さい値に設定する。
次に、図５のフローチャートを参照して、第１の実施形態の全体の動作について詳細に説明する。
まず、候補選択部１０３は、データの番号（図３で例示するＩＤ番号）を示す変数ｐの値を１に初期化する（ステップＡ１）。
次に、候補選択部１０３は、ｐとデータの総数を示すデータ数Ｍとを比較する（ステップＡ２）。ｐがＭ以下ならば、次のステップに移行する。ｐがＭより大ならば、類似度計算部１０４が行なうステップＡ１２に移行する。
次に、候補選択部１０３は、ｐと共に候補とするか判定するデータの番号を示す変数ｑの値をｐ＋１に初期化する（ステップＡ３）。
次に、候補選択部１０３は、ｑとデータ数Ｍとを比較する（ステップＡ４）。ｑがＭ以下ならば、次のステップに移行する。Ｍより大ならば、ステップＡ１１に移行する。
次に、候補選択部１０３は、概念集合を示す変数ｉの値を１に初期化する（ステップＡ５）。以降、ｉ番目の概念集合を概念集合ｉと呼ぶことにする。
次に、候補選択部１０３は、ｉと概念集合の総数を示す概念集合数Ｉとを比較する（ステップＡ６）。ｉがＩ以下ならば、次のステップに移行する。ｉがＩより大ならば、変数ｐと変数ｑによって特定されるデータｐとデータｑの両データがすべての概念集合でデータｐとデータｑが候補だと判定して、ステップＡ９に移行する。
次に、候補選択部１０３は、データｐとデータｑの両データが概念集合ｉを基準に用いた際に類似度を計算する候補になり得るかどうかを調べる処理を行う（ステップＡ７）。処理の詳細は後述する。候補でないならば、ステップＡ１０に移行する（ステップＡ７のＮｏ）。候補ならば、次のステップに移行する（ステップＡ７のＹｅｓ）。
次に、候補選択部１０３は、次の概念集合を基準とするために、ｉをインクリメントする（ステップＡ８）。そして、ステップＡ６に移行する。
ステップＡ６においてｉがＩより大ならば、候補選択部１０３は、ｐとｑとが類似度の計算候補であることを候補記憶部１０２に記憶する（ステップＡ９）。
次に、候補選択部１０３は、ｑをインクリメントする（ステップＡ１０）。そして、ステップＡ４に移行する。
ステップＡ４においてｑがデータ数Ｍより大ならば、候補選択部１０３は、ｐをインクリメントする（ステップＡ１１）。そして、ステップＡ２に移行する。
ステップＡ２においてｐがＭより大ならば、類似度計算部１０４は、候補記憶部１０２に記憶された候補間でデータ間の類似度を計算する（ステップＡ１２）。ここで、候補でないデータ間の類似度は０を設定する。そして、動作を終了する。データ間の類似度は、例えば、コルモゴロフ複雑性の概算を用いて計算することができる。計算した類似度は、すぐにディスプレイ装置や印刷装置などを介して出力してもよいし、記憶しておいてシステムのユーザからのリクエストなどに応じて出力してもよいし、それ以外の様態で出力してもよい。
次に、図６のフローチャートを参照して、第１の実施の形態の概念集合ｉでデータｐとｑが類似度計算の候補かどうかを調べる候補選択部１０３の動作について詳細に説明する。
まず、候補選択部１０３は、データｐとデータｑとが類似度計算の候補かどうかを示す候補フラグをｔｒｕｅに設定する（ステップＡ１３）。
次に、候補選択部１０３は、データｐが概念を示す文字列を１つ以上含んだかを示す変数ｎ１をｆａｌｓｅに設定し、データｑが概念を示す文字列を１つ以上含んだかを示す変数ｎ２をｆａｌｓｅに設定する（ステップＡ１４）。
次に、候補選択部１０３は、概念記憶部１００に記憶された概念集合ｉにおける概念の番号を示す変数ｊを１に初期化する（ステップＡ１５）。なお、以降では、ｊ番目の概念を概念ｊと呼ぶことにする。
次に、候補選択部１０３は、概念ｊの値と概念集合ｉに含まれる概念の総数を示す概念数Ｊとを比較する（ステップＡ１６）。ｊがＪ以下ならば、次のステップに移行する。ｊがＪより大ならば、ステップＡ２６に移行する。
次に、候補選択部１０３は、データｐが概念ｊを含むかを示すｍ１［ｊ］をｆａｌｓｅに設定し、データｑが概念ｊを含むかを示すｍ２［ｊ］をｆａｌｓｅ１に設定する（ステップＡ１７）。
次に、候補選択部１０３は、概念集合ｉの概念ｊを示す文字列の番号を示す変数ｋを１に初期化する（ステップＡ１８）。なお、以降では、ｋ番目の文字列を文字列ｋと呼ぶことにする。
次に、候補選択部１０３は、ｋと概念集合ｉの概念ｊに含まれる文字列の総数を示す文字列数Ｋとを比較する（ステップＡ１９）。ｋがＫ以下ならば、次のステップに移行する。ｋがＫより大ならば、ステップＡ２５に移行する。
次に、候補選択部１０３は、概念集合ｉの概念ｊの文字列ｋをデータｐが含むかを調べる（ステップＡ２０）。含むならば、次のステップに移行する。含まないならば、ステップＡ２２に移行する。
次に、候補選択部１０３は、ｎ１をｔｒｕｅに設定し、ｍ１［ｊ］をｔｒｕｅに設定する（ステップＡ２１）。例えば、概念記憶部１００に記録されている概念情報が図２であり、データ記憶部１０１に記録されている判定対象とするデータが図３で、ｐが１である場合を考える。この条件では、ｉ＝１、ｊ＝１、ｋ＝１では、概念集合１の概念１の文字列１である「受注管理システム」が、データ１「受注管理システムは、電話、ＦＡＸ、ＥＤＩに対応する」に文字列として含まれている。このため、本ステップに移行して、ｎ１はｔｒｕｅに、ｍ１［１］はｔｒｕｅに設定される。他方、ｉ＝１、ｊ＝１、ｋ＝２では、概念集合１の概念１の文字列２である「受注システム」を、データ１が含まないために、本ステップには移行してこない。同様に、ｉ＝１、ｊ＝２、ｋ＝１のときの「発注管理システム」と、ｉ＝１、ｊ＝２、ｋ＝２のときの「発注システム」との両方ともが、データ１に含まれていないため、本ステップには移行してこない。したがって、図２及び図３に例示した内容では、ｉ＝１に対して、ｎ１＝ｔｒｕｅ、ｍ１［１］＝ｔｒｕｅ、ｍ１［２］＝ｆａｌｓｅと設定される。
次に、候補選択部１０３は、概念集合ｉの概念ｊの文字列ｋをデータｑが含むかを調べる（ステップＡ２２）。含むならば、次のステップに移行する。含まないならば、ステップＡ２４に移行する。
次に、候補選択部１０３は、ｎ２をｔｒｕｅに設定し、ｍ２［ｊ］をｔｒｕｅに設定する（ステップＡ２３）。例えば、概念記憶部１００に記録されている概念情報が図２であり、データ記憶部１０１に記録されている判定対象とするデータが図３で、ｑが２である場合を考える。この条件では、ｉ＝１、ｊ＝１、ｋ＝１では、概念集合１の概念１の文字列１である「受注管理システム」が、データ２「発注管理システムは、電話、ＦＡＸ、ＥＤＩに対応する。」に文字列として含まれていない。このため、本ステップには移行してこない。同様にｉ＝１、ｊ＝１、ｋ＝２でも、「受注システム」がデータ２に含まれていないために、本ステップには移行してこない。他方、ｉ＝１、ｊ＝２、ｋ＝１では、「発注管理システム」がデータ２に含まれるため、本ステップに移行してきて、ｎ２＝ｔｒｕｅ、ｍ２［２］＝ｔｒｕｅに設定される。したがって、図２及び図３に例示した内容では、ｑ＝２、ｉ＝１に対して、ｎ２＝ｔｒｕｅ、ｍ２［１］＝ｆａｌｓｅ、ｍ２［２］＝ｔｒｕｅと設定される。また、ｑが３である場合、ｉ＝１では、いずれのｊ、ｋに対しても、対応する文字列をデータ３が含まないために、本ステップには移行してこない。したがって、ｎ２＝ｆａｌｓｅ、ｍ２［１］＝ｆａｌｓｅ、ｍ２［２］＝ｆａｌｓｅと設定される。また、ｑが４である場合、ｉ＝１、ｊ＝１、ｋ＝２の場合にのみ、本ステップに移行してくる。したがって、ｎ２＝ｔｒｕｅ、ｍ２［１］＝ｔｒｕｅ、ｍ２［２］＝ｆａｌｓｅと設定される。
次に、候補選択部１０３は、ｋをインクリメントする（ステップＡ２４）。そして、ステップＡ１９に移行する。
ステップＡ１９においてｋがＫより大になると、候補選択部１０３は、ｊをインクリメントする（ステップＡ２５）。そして、ステップＡ１６に移行する。
ステップＡ１６においてｊがＪより大になると、候補選択部１０３は、ｎ１とｎ２の両方がｔｒｕｅであるかを調べる（ステップＡ２６）。両方ｔｒｕｅならば、次のステップに移行する。片方あるいは両方がｆａｌｓｅならば、候補フラグがｔｒｕｅのまま、すなわち、データｐとデータｑは概念集合ｉにおいて候補であるとして、動作を終了する。例えば、概念記憶部１００に記録されている概念情報が図２であり、データ記憶部１０１に記録されている判定対象とするデータが図３で、ｐが１である場合、ｎ１＝ｔｒｕｅであり、ｑ＝２では、ｎ２＝ｔｒｕｅなので、次のステップに移行する。ｑ＝３では、ｎ２＝ｆａｌｓｅなので、データ１とデータ３とは候補であるとして、動作を終了する。ｑ＝３では、ｎ２＝ｔｒｕｅなので、次のステップに移行する。このように、データ１とデータ３とは、概念集合１において、概念に属する文字列を含むデータが１つであるため、候補であるとして、動作を終了する。
次に、候補選択部１０３は、候補フラグを仮にｆａｌｓｅに設定する（ステップＡ２７）。
次に、候補選択部１０３は、概念集合ｉの概念番号を示す変数ｊを１に初期化する（ステップＡ２８）。
次に、候補選択部１０３は、ｊと概念集合ｉの概念数Ｊとを比較する（ステップＡ２９）。ｊがＪ以下ならば、次のステップに移行する。ｊがＪより大ならば、候補フラグがｆａｌｓｅのまま、すなわち、データｐとデータｑは概念集合ｉで候補でないとして、動作を終了する。例えば、概念記憶部１００に記録されている概念情報が図２であり、データ記憶部１０１に記録されている判定対象とするデータが図３で、ｐが１である場合、ｑ＝２では、ｍ１［１］＝ｔｒｕｅでｍ２［１］＝ｆａｌｓｅであり、ｍ１［１］＝ｆａｌｓｅでｍ２［２］＝ｔｒｕｅであるため、次のステップで両方ｔｒｕｅとなることがなく、本ステップに移行し、データ１とデータ２とが概念集合１で候補でないとして、動作を終了する。このように、データ１とデータ２とは、同一概念の文字列を含まず、また、概念に属する文字列を含むデータが２つであるため、候補でないとして、動作を終了する。
次に、候補選択部１０３は、ｍ１［ｊ］とｍ２［ｊ］の両方がｔｒｕｅであるかを調べる（ステップＡ３０）。両方がｔｒｕｅならば、ステップＡ３２に移行する。片方でもｆａｌｓｅならば、次のステップに移行する。
次に、候補選択部１０３は、ｊをインクリメントする（ステップＡ３１）。そして、ステップＡ２９に移行する。
ステップＡ３０においてｍ１［ｊ］とｍ２［ｊ］が両方ともｔｒｕｅならば、候補選択部１０３は、候補フラグをｔｒｕｅに設定する（ステップＡ３２）。そして、候補フラグがｔｒｕｅのまま、すなわち、データｐとデータｑは概念集合ｉにおいて候補であるとして、動作を終了する。例えば、概念記憶部１００に記録されている概念情報が図２であり、データ記憶部１０１に記録されている判定対象とするデータが図３で、ｐが１である場合、ｑ＝４では、ｍ１［１］＝ｔｒｕｅかつｍ２［１］＝ｔｒｕｅであるので、本ステップに移行し、データ１とデータ４とが概念集合１において候補であるとして、動作を終了する。このように、データ１とデータ４とは、概念集合１において、同一概念の文字列を含むため、候補であるとして、動作を終了する。
上記のように本実施形態によれば、２つの概念の要素に重複があったり概念の要素に不足があったりするような不完全な概念集合に基づいても、データ間の関連性を正確に判定することができる。
（第２の実施形態）
次に、本発明によるデータ間の関連性判定システムの第２の実施形態について図面を参照して詳細に説明する。第２の実施形態に係るデータ間の関連性判定システムは、予め記憶された用語集から概念を構成し、構成した概念と比較するデータとに基づいて、データ間の関連性を示す類似度を計算する。なお、ここでは、概念を示す文字列やデータが自然言語の場合を例として説明する。
システムやソフトウェア開発では、曖昧さを排除するために、プロジェクト内で用いる用語を整理して用語集を作成することが多い。本実施形態では、そのように整理された用語集を用いて概念を構成してから、第１の実施形態と同様にデータ間の関連性を示す類似度を計算する。なお、第１の実施形態と同様の構成要素については、同一の符号を付し、詳細な説明を省略する。
図７を参照すると、本実施形態に係るデータ間の関連性判定システムは、情報を記憶する記憶部２１と、プログラム制御により動作する計算部２２とから構成されている。
記憶部２１は、用語集記憶部２００と、概念記憶部１００と、データ記憶部１０１と、候補記憶部１０２とを備える。
用語集記憶部２００は、システムやソフトウェア開発で用いる用語集を記憶する。用語集は、文字列である用語を集めたものであり、望ましくは、用語の関連語句とする文字列を含むものである。ここで、関連語句とは、同義語、類義語、関連語などである。図８は、用語集記憶部２００に記憶された用語集の例を示す説明図である。図では、１行目が各列の内容の説明を示しており、１列目は用語、２列目は用語の意味を示しており、３列目は用語の関連語を示している。例えば、図は、「受注」という用語の意味は「注文を受けること」であり、「受注」に関連して特別な受注の場合に用いる「特別受」という用語があることを示す。また、図は、「発注」という用語の意味は「注文を出すこと」であり、「発注」の代わりの特別な場合に用いる「特別発」という用語があることを示す。図９は、用語集記憶部２００に記憶された別の用語集の例を示す説明図である。図９では、１行目が各列の内容の説明を示しており、１列目がコンポーネント名、２列目がコンポーネントの略称を示している。例えば、図９では、「受注管理システム」と「発注管理システム」という２つのコンポーネントがあり、「受注管理システム」の略称が「受注システム」、「発注管理システム」の略称が「発注システム」であることを示している。
記憶部２１の他の構成要素である、概念記憶部１００、データ記憶部１０１、候補記憶部１０２は、実施形態１と同様である。
計算部２２は、概念構成部２０１と、候補選択部１０３と、類似度計算部１０４とを備える。
概念構成部２０１は、用語集記憶部２００に記憶された用語集に基づいて、概念を示す文字列を構成し、概念記憶部１００に記憶する。
候補選択部１０３および類似度計算部１０４は、実施形態１と同様である。
次に、図１０のフローチャートを参照して、第２の実施の形態の概念構成部２０１の動作について詳細に説明する。
まず、概念構成部２０１は、用語集から、指定された場所にある用語を抽出する（ステップＡ３３）。指定された場所は、システムのユーザがキーボードなどを介して入力することで指定してもよいし、システムにデフォルトとして記憶しておいてもよいし、それ以外の様態で指定してもよい。例えば、図８では、用語のある場所として、１行目を除く１列目を指定すればよい。また、図９でも、用語のある場所として、１行目を除く１列目を指定すればよい。
次に、概念構成部２０１は、用語集から指定された場所にある関連語句を抽出する（ステップＡ３４）。指定された場所は、システムのユーザがキーボードなどを介して入力することで指定してもよいし、システムにデフォルトとして記憶しておいてもよいし、それ以外の様態で指定してもよい。例えば、図８では、関連語句のある場所として、１行目を除く３列目を指定すればよい。また、図９では、関連語句のある場所として、１行目を除く２列目を指定すればよい。
次に、概念構成部２０１は、抽出した用語と関連語句をまとめて、概念を示す文字列を構成する（ステップＡ３５）。例えば、図２の形式に構成するには、抽出した用語と関連語句のそれぞれの文字列を「／」で区切って並べて概念を構成すればよい。
次に、概念構成部２０１は、個々の構成した概念を複数まとめて概念集合として構成する（ステップＡ３６）。例えば、図２の形式に構成するには、構成した概念を「，」で区切って並べて概念集合を構成すればよい。構成した概念集合は、概念記憶部１００に記憶する。そして、概念構成部２０１の動作を終了する。例えば、図９では、構成された概念集合は、図２の概念情報の１行目となる。
概念構成部２０１が概念集合を登録した後のデータ間の関連性を示す類似度を計算処理は、第１の実施形態と同様であるので説明を省略する。
上記のように本実施形態によれば、用語集から概念を自動で構成し、その概念を用いてデータ間の関連性を示す類似度を計算することができる。なお、ここでは、用語集から用語や関連語句を抽出する場合を例として説明したが、例えば、データ中に用語の説明がある場合には、データを用語集とみなして、用語を抽出してもよい。
（第３の実施形態）
次に、本発明によるデータ間の関連性判定システムの第３の実施形態について図面を参照して詳細に説明する。なお、第１および第２の実施形態と同様の構成要素については、同一の符号を付し、詳細な説明を省略する。
図１１を参照すると、本実施形態に係るデータ間の関連性判定システムは、情報を記憶する記憶部３１と、プログラム制御により動作する計算部３２とから構成されている。
記憶部３１は、構造データ記憶部３００と、概念記憶部１００と、データ記憶部１０１と、候補記憶部１０２とを備える。
構造データ記憶部３００は、階層的な構造をもち階層構造の項目名と内容とが与えられた構造データを記憶する。構造データはシステムのユーザがキーボードなどを介して入力することで指定してもよいし、システムにデフォルトとして記憶しておいてもよいし、それ以外の様態で指定してもよい。図１２は、構造データ記憶部３００に記憶された構造データの例を示す説明図である。図では、１行目が各列の内容を示しており、１列目が大分類での項目名、２列目が小分類での項目名、３列目が内容を示す。なお、章・節がある一般的な文書から、章・節情報を項目名として自動抽出して、大分類の項目名を章のタイトル、小分類の項目名を節のタイトルなどとして、構造データを作成してから構造データ記憶部３００に記憶してもよい。
記憶部３１の他の構成要素である、概念記憶部１００、データ記憶部１０１、候補記憶部１０２は、上記説明と同様である。
計算部３２は、概念構成部３０１と、データ生成部３０２と、候補選択部１０３と、類似度計算部１０４とを備える。
概念構成部３０１は、構造データ記憶部３００に記憶された構造データに基づいて、概念を構成し、概念記憶部１００に記憶する。
データ生成部３０２は、構造データ記憶部３００に記憶された構造データに基づいて、データを生成し、データ記憶部１０１に記憶する。
候補選択部１０３と類似度計算部１０４は、上記説明と同様である。
次に、図１３のフローチャートを参照して、第３の実施の形態の概念構成部３０１の動作について詳細に説明する。
まず、概念構成部３０１は、構造データ記憶部３００に記憶された構造データから項目名となる文字列を抽出する（ステップＡ３７）。例えば、図１２では、大分類の文字列「機能仕様」、「画面仕様」や小分類の文字列「受注管理システム」、「発注管理システム」、「設定画面」、「表示画面」を抽出する。
次に、概念構成部３０１は、抽出した項目名から概念を構成する（ステップＡ３８）。例えば、図１２では、大分類の文字列を用いて１つの概念集合、小分類の文字列を用いて別の概念集合を構成する。図１４は、概念構成部３０１で構成された概念の例を示す説明図である。
次に、概念構成部３０１は、構成した概念は、概念記憶部１００に記憶する（ステップＡ３９）。そして、概念構成部３０１の処理を終了する。
次に、図１５のフローチャートを参照して、第３の実施の形態のデータ生成部３０２の動作について詳細に説明する。
まず、データ生成部３０２は、構造データ記憶部３００に記憶された構造データから項目名となる文字列を抽出する（ステップＡ４０）。本ステップは、概念構成部３０１の項目名を抽出する動作と同様である。
次に、データ生成部３０２は、構造データ記憶部３００に記憶された構造データから内容を示す文字列を抽出する（ステップＡ４１）。
次に、データ生成部３０２は、項目名と、内容とを並べて、データを作成する（ステップＡ４２）。図１６は、データ生成部３０２によって生成されたデータの例を示す説明図である。図は、構造データ記憶部３００が図１２の場合にデータ生成部３０２が生成したデータの例である。ここでは、項目名と内容とを、間を「。」で区切って並べることで、データを生成している。
次に、データ生成部３０２は、生成したデータをデータ記憶部１０１に記憶する（ステップＡ４３）。
ここで、構造データ記憶部に記憶された構造データとデータ記憶部に記憶されたデータとは１対１に対応している。したがって、構造データを入力すると、概念構成部３０１が構成して概念記憶部１００に記憶された概念やデータ生成部３０２が生成してデータ記憶部１０１に記憶されたデータを用いて、第１および第２の実施形態と同様の処理で、構造データ間の関連性を判定することができる。
以上、本実施形態によれば、構造のあるデータの構造情報を用いて、概念を自動構成し、データ間の関連性を示す類似度を計算することができる。
なお、データ間の関連性判定システムの各部は、ハードウェアとソフトウェアの組み合わせを用いて実現すればよい。ハードウェアとソフトウェアとを組み合わせた形態では、ＲＡＭにデータ間の関連性判定プログラムが展開され、プログラムに基づいて制御部（ＣＰＵ）等のハードウェアを動作させることによって、各部を各種手段として実現する。データ間の関連性判定プログラムは、オペレーティングシステムや、他の一般的なソフトウェアなどに各処理を実行させて上記各部を構築するようにしてもよい。
また、このプログラムは、固定的に記憶媒体に記録されて頒布されても良い。当該記録媒体に記録されたプログラムは、有線、無線、又は記録媒体そのものを介して、メモリに読込まれ、制御部等を動作させる。なお、記録媒体を例示すれば、オプティカルディスクや磁気ディスク、半導体メモリ装置、ハードディスクなどが挙げられる。
また、データ間の関連性判定システムは、図１７や図１８に例示すように、コンピュータ単体として構築してもよいし、サーバ−クライアントシステムとして構築してもよい。
上記実施の形態を別の表現で説明すれば、データ間の関連性判定システムとして動作させる情報処理装置を、ＲＡＭに展開されたデータ間の関連性判定プログラムに基づき、候補選択部、類似度計算部として制御部を動作させることで実現することが可能である。また、加えて、概念構成部、データ生成部として制御部を動作させることで実現することが可能である。
以上説明したように、本発明に係るデータ間の関連性判定システムによれば、すべての情報が登録されていない不完全な概念情報に基づいてデータ間の関連性を正確に判定できる。
また、本発明の具体的な構成は前述の実施の形態に限られるものではなく、この発明の要旨を逸脱しない範囲の変更があってもこの発明に含まれる。また、複数の構成要素の適宜な組合せにより所望の効果を得られる。例えば、実施形態に示される全構成要素の幾つかの構成要素を統合したり削除してもよい。
また、上記の実施形態の一部または全部は、以下のようにも記載されうる。なお、以下の付記は本発明をなんら限定するものではない。
［付記１］
判定対象とする文字列から成る対比するデータの特徴が示された１つ以上の文字列を要素とした概念を複数の要素とする１ないし複数の概念集合に基づいて、対比するデータが、同一概念の文字列を含む場合か、あるいは概念に属する文字列を含むデータが一方である場合に、類似度計算の候補として選択する候補選択部と、
前記候補選択部で選択された候補に対して類似度を計算処理する一方、前記候補選択部で選択されなかった候補に対して類似度を予め定めた小さい値に設定して、前記対比するデータの類似度を出力する類似度計算部と
を含み成ることを特徴とするデータ間の関連性判定システム。
［付記２］
前記候補選択部は、前記概念集合として、２つの概念の要素の重複、あるいは概念の要素に不足による不完全な概念集合を用いて、類似度計算の候補を選択することを特徴とする上記付記記載のデータ間の関連性判定システム。
［付記３］
前記候補選択部は、複数の概念集合を有し、そのすべての概念集合において、対比する２つのデータが、同一概念の文字列を含むか場合か、あるいは概念に属する文字列を含むデータが一方である場合に、類似度計算の候補として選択することを特徴とする上記付記記載のデータ間の関連性判定システム。
［付記４］
前記類似度計算部は、対比するデータ間の類似度をコルモゴロフ複雑性の概算を用いて計算することを特徴とする上記付記記載のデータ間の関連性判定システム。
［付記５］
前記概念集合を構成する概念構成部を備えることを特徴とする上記付記記載のデータ間の関連性判定システム。
［付記６］
前記概念構成部は、
文字列である用語とその関連語句を記述した用語集に基づいて、用語とその関連語句とを要素として１つの概念をそれぞれ構成し、
構成した個々の概念を要素として１つの概念集合を構成する
ことを特徴とする上記付記記載のデータ間の関連性判定システム。
［付記７］
前記概念構成部は、
複数の項目名と内容とが与えられた構造データに基づいて、それぞれの項目名を要素としてそれぞれの概念を構成し、
構成した個々の概念を要素として概念集合を構成する
ことを特徴とする上記付記記載のデータ間の関連性判定システム。
［付記８］
複数の項目名と内容とが与えられた構造データに基づいて、項目名と内容とを連結した文字列をデータとして生成するデータ生成部を備えることを特徴とする上記付記記載のデータ間の関連性判定システム。
［付記９］
判定対象とする文字列から成る対比するデータの特徴が示された１つ以上の文字列を要素とした概念を複数の要素とする１ないし複数の概念集合に基づいて、対比するデータが、同一概念の文字列を含む場合か、あるいは概念に属する文字列を含むデータが一方である場合に、類似度計算の候補として選択する候補選択ステップと、
前記候補選択ステップで選択された候補に対して類似度を計算処理する一方、前記候補選択部で選択されなかった候補に対して類似度を予め定めた小さい値に設定して、前記対比するデータの類似度を出力可能にする類似度計算ステップとを有することを特徴とするデータ間の関連性判定方法。
［付記１０］
前記候補選択ステップは、前記概念集合として、２つの概念の要素の重複、あるいは概念の要素に不足による不完全な概念集合を用いて、類似度計算の候補を選択することを特徴とする上記付記記載のデータ間の関連性判定方法。
［付記１１］
前記候補選択ステップは、複数の概念集合を用い、そのすべての概念集合において、対比する２つのデータが、同一概念の文字列を含むか場合か、あるいは概念に属する文字列を含むデータが一方である場合に、類似度計算の候補として選択することを特徴とする上記付記記載のデータ間の関連性判定方法。
［付記１２］
前記類似度計算ステップは、対比するデータ間の類似度をコルモゴロフ複雑性の概算を用いて計算することを特徴とする上記付記記載のデータ間の関連性判定方法。
［付記１３］
前記概念集合を構成する概念構成ステップを有することを特徴とする上記付記記載のデータ間の関連性判定方法。
［付記１４］
前記概念構成ステップは、
文字列である用語とその関連語句を記述した用語集に基づいて、用語とその関連語句とを要素として１つの概念をそれぞれ構成し、
構成した個々の概念を要素として１つの概念集合を構成する
ことを特徴とする上記付記記載のデータ間の関連性判定方法。
［付記１５］
前記概念構成ステップは、
複数の項目名と内容とが与えられた構造データに基づいて、それぞれの項目名を要素としてそれぞれの概念を構成し、
構成した個々の概念を要素として概念集合を構成する
ことを特徴とする上記付記記載のデータ間の関連性判定方法。
［付記１６］
複数の項目名と内容とが与えられた構造データに基づいて、項目名と内容とを連結した文字列をデータとして生成するデータ生成ステップを有することを特徴とする上記付記記載のデータ間の関連性判定方法。
［付記１７］
情報処理装置の制御部に、
判定対象とする文字列から成る対比するデータの特徴が示された１つ以上の文字列を要素とした概念を複数の要素とする１ないし複数の概念集合に基づいて、対比するデータが、同一概念の文字列を含む場合か、あるいは概念に属する文字列を含むデータが一方である場合に、類似度計算の候補として選択する候補選択処理と、
前記候補選択処理で選択された候補に対して類似度を計算処理する一方、前記候補選択部で選択されなかった候補に対して類似度を予め定めた小さい値に設定して、前記対比するデータの類似度を算定する類似度計算処理と
を実行させることを特徴とするデータ間の関連性判定プログラム。
［付記１８］
前記候補選択処理では、前記概念集合として、２つの概念の要素の重複、あるいは概念の要素に不足による不完全な概念集合を用いて、類似度計算の候補を選択させることを特徴とする上記付記記載のデータ間の関連性判定プログラム。
［付記１９］
前記候補選択処理では、複数の概念集合を用いて、そのすべての概念集合において、対比する２つのデータが、同一概念の文字列を含むか場合か、あるいは概念に属する文字列を含むデータが一方である場合に、類似度計算の候補として選択させることを特徴とする上記付記記載のデータ間の関連性判定プログラム。
［付記２０］
前記類似度計算処理では、対比するデータ間の類似度をコルモゴロフ複雑性の概算を用いて計算させることを特徴とする上記付記記載のデータ間の関連性判定プログラム。
［付記２１］
前記概念集合を構成する概念構成処理を行わせることを特徴とする上記付記記載のデータ間の関連性判定プログラム。
［付記２２］
前記概念構成処理では、
文字列である用語とその関連語句を記述した用語集に基づいて、用語とその関連語句とを要素として１つの概念をそれぞれ構成し、
構成した個々の概念を要素として１つの概念集合を構成して行なわせる
ことを特徴とする上記付記記載のデータ間の関連性判定プログラム。
［付記２３］
前記概念構成処理では、
複数の項目名と内容とが与えられた構造データに基づいて、それぞれの項目名を要素としてそれぞれの概念を構成し、
構成した個々の概念を要素として概念集合を構成する処理を行わせる
ことを特徴とする上記付記記載のデータ間の関連性判定プログラム。
［付記２４］
複数の項目名と内容とが与えられた構造データに基づいて、項目名と内容とを連結した文字列をデータとして生成するデータ生成処理を行わせることを特徴とする上記付記記載のデータ間の関連性判定プログラム。
［付記２５］
上記付記記載のデータ間の関連性判定プログラムを記録した記録媒体。
本発明は、データ間の類似性を数値化して用いる多くのシステムに使用でき、例えば、仕様書の確認を行うシステムや手順書を確認するシステム、データベースから情報を引き出す際のキーワードを増やして用いるシステムなどに、不完全といえる程度の概念情報を設定することによってもその精度を高めることが可能である。
この出願は、２０１１年８月８日に出願された日本出願特願２０１１−１７２９２４号を基礎とする優先権を主張し、その開示の全てをここに取り込む。(First embodiment)
Next, a first embodiment of the data relationship determination system according to the present invention will be described in detail with reference to the drawings. The relationship determination system between data which concerns on 1st Embodiment calculates the similarity which shows the relationship between data based on the concept and data which were stored beforehand. Here, a case where a natural language is used as a character string or data character string that is a concept element will be described as an example. However, the character string to be used is any character string indicating an image, time-series data, or the like. It does not matter.
Referring to FIG. 1, the system for determining relevance between data according to the present embodiment includes a storage unit 11 that stores information and a calculation unit 12 that operates under program control.
The storage unit 11 includes a concept storage unit 100, a data storage unit 101, and a candidate storage unit 102.
The concept storage unit 100 stores one or more character strings indicating concepts for a plurality of concepts. The concept may be specified by the user of the system by inputting via a keyboard or the like, may be stored as a default in the system, or may be specified in other manners.
FIG. 2 is an explanatory diagram illustrating an example of a phrase indicating a concept stored in the concept storage unit 100. In the figure, one row represents one concept set, one concept for each “,”, and within one concept, character strings separated by “/” represent variations of character strings belonging to the same concept. Show.
It is desirable that any two concepts out of a plurality of concepts included in one concept set should be disjoint (that is, the elements of one concept and the elements of the other concept do not match). It is not limited to being disjoint. Moreover, when the arbitrary two concepts are different concept sets, they may not be disjoint.
For example, in the figure, the character string “order receiving management system” and the character string “order receiving system” are elements of the same concept. In addition, the character string “order management system” and the character string “order system” indicate elements of a concept different from the concepts indicated by the “order management system” and the “order system”. Further, “telephone” and “FAX” are elements of different concepts. It should be noted that “order management system” and “telephone” do not indicate that they belong to the same concept or different concepts.
By using the concept set having the plurality of concepts as elements, the candidate selection unit 103 selects whether or not the two pieces of data to be compared should be selected as candidates for similarity calculation.
The data storage unit 101 stores a plurality of data for determining relevance. Individual data is composed of character strings, which may be specified by the user of the system via a keyboard, stored as default in the system, or specified in other ways. May be. FIG. 3 is an explanatory diagram illustrating an example of data stored in the data storage unit 101. In the figure, the first row shows the contents of each column, the first column shows the data ID number, and the second column shows the data content. That is, in the drawing, the first data is “the order management system corresponds to telephone, FAX, EDI”, and the second data is “the order management system corresponds to telephone, FAX, EDI”. The third data is “the order management system supports telephone, FAX, EDI”, and the fourth data is “the order receiving system supports telephone, FAX, EDI”. Is shown. Here, the individual data is Japanese text, but it may be any character string indicating an image or time-series data. Further, hereinafter, each data is referred to as first data (ID = 1) as data 1 and second data (ID = 2) as data 2.
The candidate storage unit 102 stores an ID number of certain data and the ID number of the data to be subjected to similarity calculation as a candidate in association with each other. FIG. 4 is an explanatory diagram illustrating an example of candidates stored in the candidate storage unit 102. Here, the similarity between the data 1 and the data 2 is regarded as the same as the similarity between the data 2 and the data 1, and only a data having a larger number than that data is selected as a candidate for a certain data. If so. Each row indicates that the subsequent numbers are candidates for similarity calculation for the first number delimited by “,”. That is, the figure shows that data 1 calculates similarity with data 3 and data 4 but does not calculate similarity with data 2. Data 2 indicates that the similarity is calculated with data 3 but is not calculated with data 4. Furthermore, data 3 indicates that the similarity with data 4 is not calculated.
The calculation unit 12 includes a candidate selection unit 103 and a similarity calculation unit 104.
The candidate selection unit 103 selects a candidate for calculating the similarity based on the concept information stored in the concept storage unit 100 and the data to be determined stored in the data storage unit 101, and selects the selected candidate. It memorize | stores in the candidate memory | storage part 102. FIG.
The similarity calculation unit 104 calculates the similarity between the data based on the candidates stored in the candidate storage unit 102 and the data to be determined stored in the data storage unit 101. At this time, the similarity calculation unit 104 sets the similarity to a predetermined small value for candidates not selected by the candidate selection unit 103 (that is, elements not stored in the candidate storage unit 102).
Next, the overall operation of the first embodiment will be described in detail with reference to the flowchart of FIG.
First, the candidate selection unit 103 initializes the value of a variable p indicating a data number (ID number illustrated in FIG. 3) to 1 (step A1).
Next, the candidate selection unit 103 compares p with the number of data M indicating the total number of data (step A2). If p is M or less, the process proceeds to the next step. If p is greater than M, the process proceeds to step A12 performed by the similarity calculation unit 104.
Next, the candidate selection unit 103 initializes the value of the variable q indicating the number of data to be determined as a candidate together with p to p + 1 (step A3).
Next, the candidate selection unit 103 compares q with the number of data M (step A4). If q is less than or equal to M, the process proceeds to the next step. If it is larger than M, the process proceeds to step A11.
Next, the candidate selection unit 103 initializes the value of the variable i indicating the concept set to 1 (step A5). Hereinafter, the i-th concept set is referred to as concept set i.
Next, the candidate selection unit 103 compares i with the number of concept sets I indicating the total number of concept sets (step A6). If i is equal to or less than I, the process proceeds to the next step. If i is larger than I, it is determined that the data p and data q specified by the variable p and variable q are all concept sets and the data p and data q are candidates, and the process proceeds to step A9.
Next, the candidate selection unit 103 performs processing to check whether both data p and data q can be candidates for calculating similarity when the concept set i is used as a reference (step A7). Details of the processing will be described later. If it is not a candidate, the process proceeds to step A10 (No in step A7). If it is a candidate, the process proceeds to the next step (Yes in step A7).
Next, the candidate selection unit 103 increments i to use the next concept set as a reference (step A8). And it transfers to step A6.
If i is greater than I in step A6, the candidate selection unit 103 stores in the candidate storage unit 102 that p and q are similarity calculation candidates (step A9).
Next, the candidate selection part 103 increments q (step A10). Then, the process proceeds to step A4.
If q is larger than the number of data M in step A4, the candidate selection unit 103 increments p (step A11). Then, the process proceeds to step A2.
If p is larger than M in step A2, the similarity calculation unit 104 calculates the similarity between the data among the candidates stored in the candidate storage unit 102 (step A12). Here, the similarity between non-candidate data is set to zero. Then, the operation ends. Similarity between data can be calculated, for example, using an estimate of Kolmogorov complexity. The calculated similarity may be output immediately via a display device or a printing device, or may be stored and output in response to a request from a user of the system. May be output.
Next, the operation of the candidate selection unit 103 that checks whether the data p and q are candidates for similarity calculation in the concept set i of the first embodiment will be described in detail with reference to the flowchart of FIG.
First, the candidate selecting unit 103 sets a candidate flag indicating whether data p and data q are candidates for similarity calculation to true (step A13).
Next, the candidate selection unit 103 sets a variable n1 indicating whether the data p includes one or more character strings indicating a concept to false, and indicates whether the data q includes one or more character strings indicating a concept. n2 is set to false (step A14).
Next, the candidate selection unit 103 initializes a variable j indicating the concept number in the concept set i stored in the concept storage unit 100 to 1 (step A15). Hereinafter, the jth concept is referred to as concept j.
Next, the candidate selection unit 103 compares the value of the concept j with the concept number J indicating the total number of concepts included in the concept set i (step A16). If j is less than or equal to J, the process proceeds to the next step. If j is larger than J, the process proceeds to step A26.
Next, the candidate selection unit 103 sets m1 [j] indicating whether the data p includes the concept j to false, and sets m2 [j] indicating whether the data q includes the concept j to false1 (step 1). A17).
Next, the candidate selection unit 103 initializes a variable k indicating the number of the character string indicating the concept j of the concept set i to 1 (step A18). Hereinafter, the k-th character string is referred to as a character string k.
Next, the candidate selection unit 103 compares k with the number K of character strings indicating the total number of character strings included in the concept j of the concept set i (step A19). If k is K or less, the process proceeds to the next step. If k is larger than K, the process proceeds to step A25.
Next, the candidate selection unit 103 checks whether the data p includes the character string k of the concept j of the concept set i (step A20). If so, move on to the next step. If not included, the process proceeds to step A22.
Next, the candidate selection unit 103 sets n1 to true and sets m1 [j] to true (step A21). For example, let us consider a case where the conceptual information recorded in the conceptual storage unit 100 is FIG. 2, the data to be determined recorded in the data storage unit 101 is FIG. 3, and p is 1. Under this condition, when i = 1, j = 1, and k = 1, the “order receiving management system” that is the character string 1 of the concept 1 of the concept set 1 is changed to the data 1 “order receiving management system is telephone, FAX, EDI. "Corresponding" is included as a character string. Therefore, the process proceeds to this step, where n1 is set to true and m1 [1] is set to true. On the other hand, when i = 1, j = 1, and k = 2, the “order receiving system” that is the character string 2 of the concept 1 of the concept set 1 does not include the data 1, and therefore, the process does not proceed to this step. Similarly, both the “ordering management system” when i = 1, j = 2, and k = 1 and the “ordering system” when i = 1, j = 2, and k = 2 are both data 1 Because it is not included in this step, it does not move to this step. Therefore, in the contents illustrated in FIGS. 2 and 3, for i = 1, n1 = true, m1 [1] = true, and m1 [2] = false.
Next, the candidate selection unit 103 checks whether the data q includes the character string k of the concept j of the concept set i (step A22). If so, move on to the next step. If not included, the process proceeds to step A24.
Next, the candidate selection unit 103 sets n2 to true and sets m2 [j] to true (step A23). For example, consider the case where the conceptual information recorded in the concept storage unit 100 is FIG. 2, the data to be determined recorded in the data storage unit 101 is FIG. 3, and q is 2. Under this condition, when i = 1, j = 1, and k = 1, the “order management system” that is the character string 1 of the concept 1 of the concept set 1 is changed to the data 2 “order management system is telephone, FAX, EDI. Is not included as a character string. For this reason, it does not shift to this step. Similarly, even if i = 1, j = 1, and k = 2, the “order receiving system” is not included in the data 2, and thus the process does not proceed to this step. On the other hand, when i = 1, j = 2, and k = 1, the “ordering management system” is included in the data 2, and therefore, the process proceeds to this step and n2 = true and m2 [2] = true are set. Therefore, in the contents illustrated in FIGS. 2 and 3, for q = 2 and i = 1, n2 = true, m2 [1] = false, and m2 [2] = true are set. In addition, when q is 3, when i = 1, the data 3 does not include the corresponding character string for any j and k, so the process does not proceed to this step. Therefore, n2 = false, m2 [1] = false, and m2 [2] = false. When q is 4, the process proceeds to this step only when i = 1, j = 1, and k = 2. Therefore, n2 = true, m2 [1] = true, and m2 [2] = false.
Next, the candidate selection unit 103 increments k (step A24). Then, the process proceeds to step A19.
When k becomes larger than K in step A19, the candidate selection unit 103 increments j (step A25). Then, the process proceeds to step A16.
When j becomes larger than J in step A16, the candidate selection unit 103 checks whether both n1 and n2 are true (step A26). If both are true, the process proceeds to the next step. If one or both are false, the candidate flag remains true, ie, the data p and data q are candidates in the concept set i, and the operation ends. For example, when the concept information recorded in the concept storage unit 100 is FIG. 2 and the data to be determined recorded in the data storage unit 101 is FIG. 3 and p is 1, n1 = true. When q = 2, since n2 = true, the process proceeds to the next step. When q = 3, n2 = false, so that data 1 and data 3 are candidates and the operation is terminated. When q = 3, since n2 = true, the process proceeds to the next step. Thus, since data 1 and data 3 are one data including the character string belonging to the concept in the concept set 1, the operation is terminated as a candidate.
Next, the candidate selection unit 103 temporarily sets the candidate flag to false (step A27).
Next, the candidate selection unit 103 initializes a variable j indicating the concept number of the concept set i to 1 (step A28).
Next, the candidate selection unit 103 compares j with the concept number J of the concept set i (step A29). If j is less than or equal to J, the process proceeds to the next step. If j is greater than J, the operation is terminated assuming that the candidate flag remains false, that is, data p and data q are not candidates in concept set i. For example, when the concept information recorded in the concept storage unit 100 is FIG. 2 and the data to be determined recorded in the data storage unit 101 is FIG. 3 and p is 1, when q = 2, Since m2 [1] = false when m1 [1] = true and m2 [2] = true when m1 [1] = false, both are not true in the next step, and the process proceeds to this step. The operation is terminated assuming that data 1 and data 2 are not candidates in concept set 1. Thus, since data 1 and data 2 do not include character strings of the same concept and there are two data including character strings belonging to the concept, the operation is terminated as not being a candidate.
Next, the candidate selection unit 103 checks whether both m1 [j] and m2 [j] are true (step A30). If both are true, the process proceeds to step A32. If either one is false, the process proceeds to the next step.
Next, the candidate selection unit 103 increments j (step A31). Then, the process proceeds to step A29.
If both m1 [j] and m2 [j] are true in step A30, the candidate selection unit 103 sets the candidate flag to true (step A32). Then, the operation is finished assuming that the candidate flag remains true, that is, the data p and the data q are candidates in the concept set i. For example, when the concept information recorded in the concept storage unit 100 is FIG. 2 and the data to be determined recorded in the data storage unit 101 is FIG. 3 and p is 1, when q = 4, Since m1 [1] = true and m2 [1] = true, the process proceeds to this step, and the operation is terminated assuming that data 1 and data 4 are candidates in the concept set 1. As described above, the data 1 and the data 4 include character strings of the same concept in the concept set 1, so that the operation is terminated as being candidates.
As described above, according to the present embodiment, the relationship between data can be accurately determined even based on an incomplete concept set in which two concept elements overlap or there is a lack of concept elements. Can be determined.
(Second Embodiment)
Next, a second embodiment of the data relevance determination system according to the present invention will be described in detail with reference to the drawings. The relevance determination system between data according to the second embodiment constructs a concept from a glossary stored in advance, and based on the data to be compared with the configured concept, a similarity indicating the relevance between the data is obtained. calculate. Here, a case where a character string or data indicating a concept is a natural language will be described as an example.
In system and software development, in order to eliminate ambiguity, a glossary is often created by organizing terms used in a project. In the present embodiment, the concept is configured using the glossary arranged in this way, and then the similarity indicating the relationship between the data is calculated as in the first embodiment. In addition, about the component similar to 1st Embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.
Referring to FIG. 7, the system for determining relevance between data according to the present embodiment includes a storage unit 21 that stores information and a calculation unit 22 that operates under program control.
The storage unit 21 includes a glossary storage unit 200, a concept storage unit 100, a data storage unit 101, and a candidate storage unit 102.
The glossary storage unit 200 stores a glossary used in system and software development. The glossary is a collection of terms that are character strings, and preferably includes character strings that are related phrases of the terms. Here, the related phrases are synonyms, synonyms, related words, and the like. FIG. 8 is an explanatory diagram illustrating an example of a glossary stored in the glossary storage unit 200. In the figure, the first row shows the description of the contents of each column, the first column shows the term, the second column shows the meaning of the term, and the third column shows the related word of the term. For example, the figure shows that the meaning of the term “order received” is “receiving an order” and that there is a term “special order” used in the case of a special order related to “order received”. The figure also shows that the term “ordering” means “putting an order” and there is a term “special issue” used in a special case instead of “ordering”. FIG. 9 is an explanatory diagram illustrating another example of a glossary stored in the glossary storage unit 200. In FIG. 9, the first row shows the contents of each column, the first column shows the component name, and the second column shows the abbreviation of the component. For example, in FIG. 9, there are two components, “order management system” and “order management system”. The abbreviation of “order management system” is “order system”, and the abbreviation of “order management system” is “order system”. It shows that there is.
The other components of the storage unit 21, which are the concept storage unit 100, the data storage unit 101, and the candidate storage unit 102, are the same as those in the first embodiment.
The calculation unit 22 includes a conceptual configuration unit 201, a candidate selection unit 103, and a similarity calculation unit 104.
Based on the glossary stored in the glossary storage unit 200, the conceptual configuration unit 201 configures a character string indicating the concept and stores it in the concept storage unit 100.
The candidate selection unit 103 and the similarity calculation unit 104 are the same as those in the first embodiment.
Next, the operation of the conceptual configuration unit 201 according to the second embodiment will be described in detail with reference to the flowchart in FIG.
First, the conceptual configuration unit 201 extracts a term at a designated place from the glossary (step A33). The designated location may be designated by a user of the system through a keyboard or the like, may be stored in the system as a default, or may be designated in other manners. For example, in FIG. 8, what is necessary is just to designate the 1st column except the 1st line as a place with a term. Also in FIG. 9, the first column except the first row may be designated as the place where the term exists.
Next, the concept constructing unit 201 extracts a related phrase at a designated location from the glossary (step A34). The designated location may be designated by a user of the system through a keyboard or the like, may be stored in the system as a default, or may be designated in other manners. For example, in FIG. 8, the third column excluding the first row may be designated as the location where the related phrase is located. In FIG. 9, the second column excluding the first row may be designated as a place where there is a related phrase.
Next, the concept constructing unit 201 collects the extracted terms and related phrases, and constructs a character string indicating the concept (step A35). For example, in order to configure the format shown in FIG. 2, the concept may be configured by separating character strings of extracted terms and related terms by separating them with “/”.
Next, the concept configuration unit 201 configures a plurality of individual configured concepts as a concept set (step A36). For example, in order to configure the format shown in FIG. 2, the concept set may be configured by separating the configured concepts with “,”. The configured concept set is stored in the concept storage unit 100. Then, the operation of the conceptual configuration unit 201 ends. For example, in FIG. 9, the configured concept set is the first line of the concept information in FIG.
Since the calculation processing of the similarity indicating the relevance between the data after the concept constructing unit 201 registers the concept set is the same as in the first embodiment, the description thereof is omitted.
As described above, according to the present embodiment, it is possible to automatically construct a concept from a glossary and calculate the similarity indicating the relationship between data using the concept. In this example, the case where terms and related phrases are extracted from a glossary has been described as an example. However, for example, when there is an explanation of terms in the data, the data is regarded as a glossary and terms are extracted. Also good.
(Third embodiment)
Next, a third embodiment of the system for determining relevance between data according to the present invention will be described in detail with reference to the drawings. In addition, about the component similar to 1st and 2nd embodiment, the same code | symbol is attached | subjected and detailed description is abbreviate | omitted.
Referring to FIG. 11, the data relevance determination system according to the present embodiment includes a storage unit 31 that stores information and a calculation unit 32 that operates under program control.
The storage unit 31 includes a structural data storage unit 300, a concept storage unit 100, a data storage unit 101, and a candidate storage unit 102.
The structure data storage unit 300 stores structure data having a hierarchical structure and given hierarchical item names and contents. The structure data may be designated by the user of the system through a keyboard or the like, may be stored in the system as a default, or may be designated in other manners. FIG. 12 is an explanatory diagram illustrating an example of the structure data stored in the structure data storage unit 300. In the figure, the first row shows the contents of each column, the first column shows the item names in the large classification, the second column shows the item names in the small classification, and the third column shows the contents. It should be noted that chapter / section information is automatically extracted as item names from general documents with chapters / sections, and structure data with major category item names as chapter titles and minor category item names as section titles. May be stored in the structure data storage unit 300.
The other components of the storage unit 31, which are the concept storage unit 100, the data storage unit 101, and the candidate storage unit 102, are the same as described above.
The calculation unit 32 includes a conceptual configuration unit 301, a data generation unit 302, a candidate selection unit 103, and a similarity calculation unit 104.
The concept configuration unit 301 configures a concept based on the structure data stored in the structure data storage unit 300 and stores the concept in the concept storage unit 100.
The data generation unit 302 generates data based on the structure data stored in the structure data storage unit 300 and stores the data in the data storage unit 101.
The candidate selection unit 103 and the similarity calculation unit 104 are the same as described above.
Next, the operation of the conceptual configuration unit 301 according to the third embodiment will be described in detail with reference to the flowchart of FIG.
First, the conceptual configuration unit 301 extracts a character string serving as an item name from the structure data stored in the structure data storage unit 300 (step A37). For example, in FIG. 12, the character strings “functional specifications” and “screen specifications” of the large classification and the character strings “order management system”, “order management system”, “setting screen”, and “display screen” of the small classification are extracted. .
Next, the concept constructing unit 301 constructs a concept from the extracted item names (step A38). For example, in FIG. 12, one concept set is configured using a large classification character string, and another concept set is configured using a small classification character string. FIG. 14 is an explanatory diagram illustrating an example of a concept configured by the conceptual configuration unit 301.
Next, the concept configuration unit 301 stores the configured concept in the concept storage unit 100 (step A39). And the process of the conceptual structure part 301 is complete | finished.
Next, the operation of the data generation unit 302 of the third embodiment will be described in detail with reference to the flowchart of FIG.
First, the data generation unit 302 extracts a character string serving as an item name from the structure data stored in the structure data storage unit 300 (step A40). This step is the same as the operation of extracting the item name of the conceptual component 301.
Next, the data generation unit 302 extracts a character string indicating the content from the structure data stored in the structure data storage unit 300 (step A41).
Next, the data generation unit 302 creates data by arranging item names and contents (step A42). FIG. 16 is an explanatory diagram illustrating an example of data generated by the data generation unit 302. The figure shows an example of data generated by the data generation unit 302 when the structural data storage unit 300 is shown in FIG. Here, the data is generated by arranging the item name and contents by separating them with “.”.
Next, the data generation unit 302 stores the generated data in the data storage unit 101 (step A43).
Here, the structure data stored in the structure data storage unit and the data stored in the data storage unit have a one-to-one correspondence. Therefore, when the structural data is input, the first and second data are generated using the concept configured by the concept configuration unit 301 and stored in the concept storage unit 100 and the data generated by the data generation unit 302 and stored in the data storage unit 101. The relationship between the structural data can be determined by the same processing as that of the second embodiment.
As described above, according to the present embodiment, it is possible to automatically construct a concept using the structure information of structured data and calculate the similarity indicating the relationship between the data.
In addition, what is necessary is just to implement | achieve each part of the relationship determination system between data using the combination of hardware and software. In the form of a combination of hardware and software, a program for determining the relationship between data is expanded in the RAM, and each unit is realized as various means by operating hardware such as a control unit (CPU) based on the program. . The program for determining the relevance between data may be configured such that each unit is constructed by causing an operating system, other general software, or the like to execute each process.
Further, this program may be fixedly recorded on a storage medium and distributed. The program recorded on the recording medium is read into a memory via a wired, wireless, or recording medium itself, and operates a control unit or the like. Examples of the recording medium include an optical disk, a magnetic disk, a semiconductor memory device, and a hard disk.
Further, as shown in FIG. 17 and FIG. 18, the system for determining the relationship between data may be constructed as a single computer or a server-client system.
To explain the above embodiment in another expression, an information processing device that operates as a data relevance determination system is based on a data relevance determination program developed in a RAM, a candidate selection unit, a similarity calculation It can be realized by operating a control unit as a unit. In addition, it can be realized by operating a control unit as a conceptual configuration unit and a data generation unit.
As described above, according to the system for determining relevance between data according to the present invention, it is possible to accurately determine the relevance between data based on incomplete concept information in which all information is not registered.
In addition, the specific configuration of the present invention is not limited to the above-described embodiment, and changes within a range not departing from the gist of the present invention are included in the present invention. Further, a desired effect can be obtained by appropriately combining a plurality of components. For example, some components of all the components shown in the embodiment may be integrated or deleted.
In addition, a part or all of the above-described embodiments can be described as follows. Note that the following supplementary notes do not limit the present invention.
[Appendix 1]
The data to be compared is the same based on one or a plurality of concept sets having a plurality of elements each of which is a concept including one or more character strings in which characteristics of the data to be compared are indicated. A candidate selection unit for selecting as a candidate for similarity calculation when the character string of the concept is included or when there is one of the data including the character string belonging to the concept;
The similarity is calculated for the candidates selected by the candidate selection unit, while the similarity is set to a predetermined small value for the candidates not selected by the candidate selection unit, and the comparison data A similarity calculator that outputs the similarity of
A system for determining relevance between data, comprising:
[Appendix 2]
The above-mentioned supplementary note, wherein the candidate selection unit selects a candidate for similarity calculation using an incomplete concept set due to an overlap of two concept elements or a lack of concept elements as the concept set. A system for determining relevance between listed data.
[Appendix 3]
The candidate selection unit has a plurality of concept sets, and in all of the concept sets, the two data to be compared include a character string of the same concept, or data including a character string belonging to the concept In this case, the relevance determination system between data described in the above supplementary notes is selected as a candidate for similarity calculation.
[Appendix 4]
The similarity calculation unit according to the above supplementary note, wherein the similarity calculation unit calculates the similarity between the data to be compared using an approximation of Kolmogorov complexity.
[Appendix 5]
The system for determining relevance between data as described in the above supplementary note, comprising a conceptual component that constitutes the conceptual set.
[Appendix 6]
The conceptual component is
Based on a glossary describing terms that are strings and their related phrases, each concept is composed of terms and their related phrases as elements,
Construct one concept set with each individual concept as an element
A system for determining relevance between data described in the above supplementary notes.
[Appendix 7]
The conceptual component is
Based on the structure data given multiple item names and contents, each item name is used as an element to construct each concept,
Construct a concept set with each configured concept as an element
A system for determining relevance between data described in the above supplementary notes.
[Appendix 8]
A data generation unit that generates, as data, a character string in which item names and contents are concatenated based on structure data to which a plurality of item names and contents are given. Sex determination system.
[Appendix 9]
The data to be compared is the same based on one or a plurality of concept sets having a plurality of elements each of which is a concept including one or more character strings in which characteristics of the data to be compared are indicated. A candidate selection step of selecting as a candidate for similarity calculation when the data includes a character string of a concept or data including a character string belonging to a concept;
The similarity is calculated for the candidates selected in the candidate selection step, while the similarity is set to a predetermined small value for the candidates not selected by the candidate selection unit, and the comparison data A degree-of-similarity calculation step for enabling the output of the degree of similarity.
[Appendix 10]
The above-mentioned candidate selection step, wherein the candidate selection step selects a candidate for similarity calculation by using, as the concept set, an incomplete concept set due to an overlap of two concept elements or a lack of concept elements. A method for determining relevance between described data.
[Appendix 11]
The candidate selection step uses a plurality of concept sets, and in all the concept sets, if the two data to be compared contain character strings of the same concept, or data containing a character string belonging to the concept, In some cases, the relevance determination method between data described in the above supplementary notes is selected as a candidate for similarity calculation.
[Appendix 12]
In the similarity calculation step, the similarity between data to be compared is calculated using an estimate of Kolmogorov complexity.
[Appendix 13]
The method of determining relevance between the data described in the above supplementary notes, comprising a concept configuration step for configuring the concept set.
[Appendix 14]
The conceptual configuration step includes:
Based on a glossary describing terms that are strings and their related phrases, each concept is composed of terms and their related phrases as elements,
Construct one concept set with each individual concept as an element
A method for determining the relationship between data described in the above supplementary notes.
[Appendix 15]
The conceptual configuration step includes:
Based on the structure data given multiple item names and contents, each item name is used as an element to construct each concept,
Construct a concept set with each configured concept as an element
A method for determining the relationship between data described in the above supplementary notes.
[Appendix 16]
A data generation step of generating, as data, a character string in which item names and contents are concatenated based on structure data to which a plurality of item names and contents are given. Sex determination method.
[Appendix 17]
In the control unit of the information processing device,
The data to be compared is the same based on one or a plurality of concept sets having a plurality of elements each of which is a concept including one or more character strings in which characteristics of the data to be compared are indicated. A candidate selection process for selecting as a candidate for similarity calculation when a character string of a concept is included, or when data including a character string belonging to a concept is one side;
The similarity is calculated for the candidates selected in the candidate selection process, while the similarity is set to a predetermined small value for the candidates not selected by the candidate selection unit, and the comparison data Similarity calculation processing to calculate the similarity of
A program for determining relevance between data, characterized in that
[Appendix 18]
In the candidate selection process, as the concept set, a candidate for similarity calculation is selected by using an incomplete concept set due to an overlap of two concept elements or a lack of concept elements as the concept set. Relevance judgment program between listed data.
[Appendix 19]
In the candidate selection process, a plurality of concept sets are used, and in all of the concept sets, whether the two data to be compared include character strings of the same concept or data including a character string belonging to a concept In this case, the relevance determination program between data described in the above supplementary notes is selected as a candidate for similarity calculation.
[Appendix 20]
In the similarity calculation process, the degree of similarity between data to be compared is calculated using an estimate of Kolmogorov complexity.
[Appendix 21]
The program for determining the relevance between the data described in the above supplementary notes, characterized in that the concept composition processing that constitutes the concept set is performed.
[Appendix 22]
In the conceptual configuration process,
Based on a glossary describing terms that are strings and their related phrases, each concept is composed of terms and their related phrases as elements,
Construct a concept set with each individual concept as an element
A program for determining relevance between data described in the above supplementary notes.
[Appendix 23]
In the conceptual configuration process,
Based on the structure data given multiple item names and contents, each item name is used as an element to construct each concept,
Have the process of constructing a concept set with each constructed concept as an element
A program for determining relevance between data described in the above supplementary notes.
[Appendix 24]
Based on the structure data to which a plurality of item names and contents are given, a data generation process for generating a character string in which the item names and contents are concatenated as data is performed. Relevance judgment program.
[Appendix 25]
The recording medium which recorded the relationship determination program between the data of the said additional remarks description.
The present invention can be used in many systems in which the similarity between data is digitized, for example, a system for confirming specifications, a system for confirming procedure manuals, and an increased number of keywords for extracting information from a database. It is also possible to improve the accuracy by setting incomplete conceptual information in the system.
This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2011-172924 for which it applied on August 8, 2011, and takes in those the indications of all here.

１１、２１、３１記憶部
１２、２２、３２計算部
１００概念記憶部（概念記憶手段）
１０１データ記憶部（データ記憶手段）
１０２候補記憶部（候補記憶手段）
１０３候補選択部（候補選択手段）
１０４類似度計算部（類似度計算手段）
２００用語集記憶部（用語集記憶手段）
２０１概念構成部（概念構成手段）
３００構造データ記憶部（構造データ記憶手段）
３０１概念構成部（概念構成手段）
３０２データ生成部（データ生成手段）11, 21, 31 Storage unit 12, 22, 32 Calculation unit 100 Concept storage unit (concept storage means)
101 Data storage unit (data storage means)
102 Candidate storage unit (candidate storage means)
103 Candidate selection unit (candidate selection means)
104 similarity calculation unit (similarity calculation means)
200 Glossary storage unit (glossary storage means)
201 Conceptual configuration unit (conceptual configuration means)
300 Structure data storage unit (structure data storage means)
301 Concept component (concept component)
302 Data generation unit (data generation means)

Claims

As a pre-processing of similarity calculation, for a concept that has one or more generated character strings as elements , based on a concept set having a plurality of concepts as elements, both of the data to be compared belong to the same concept. If a, and when one of the data to be the contrast is free of strings other belonging to the concept of data strings to free Mikatsu the comparison belonging to the concept, the characters both of the data to be the comparison belong to the concept including If without the columns, if was either a candidate selection unit that selects the data to be the comparison as a candidate of the similarity calculation,
While calculating the similarity for the selection candidate selected by the candidate selection unit, the similarity is set to a predetermined small value for the candidate not selected by the candidate selection unit, and the comparison is performed. A system for determining relevance between data, comprising: a similarity calculation unit that calculates and outputs the similarity of data.

The candidate selection unit selects a candidate for similarity calculation using an incomplete concept set due to an overlap of two concept elements or a lack of concept elements as the concept set. Item 4. The system for determining the relationship between data according to Item 1.

The candidate selecting unit includes a plurality of concepts set, in all of its concepts set, two data to be compared are identical concept of string including cases or the, or data including the character string belonging to the concept is one If it is, it selects as a candidate of similarity calculation, The relationship determination system between the data of Claim 1 or Claim 2 characterized by the above-mentioned.

The relevance between data according to any one of claims 1 to 3, wherein the similarity calculation unit calculates the similarity between the data to be compared by using an estimate of Kolmogorov complexity. Judgment system.

The system according to claim 1, further comprising a concept configuration unit that configures the concept set.

The conceptual component is
Based on a glossary describing terms that are strings and their related phrases, each concept is composed of terms and their related phrases as elements,
6. The system for determining relevance between data according to claim 5, wherein one concept set is configured by using each configured concept as an element.

The conceptual component is
Based on the structure data given multiple item names and contents, each item name is used as an element to construct each concept,
7. The system for determining relevance between data according to claim 5 or 6, wherein a concept set is configured by using each configured concept as an element.

8. A data generation unit for generating, as data, a character string obtained by concatenating item names and contents based on structure data to which a plurality of item names and contents are given. The relationship determination system between the data of any one of Claims 1.

And if the relative concept and that generated 1 or more strings elements, based on the concept set to a plurality of concepts and elements, including a character string belonging to the same concept data to be compared are both data the comparison If one other data containing Mikatsu the contrasts strings belonging to the concept of the free and if they do not contain the string belonging to the concept, the character string belonging to both the concept of the data to be the contrast, any if was either a candidate selecting step corresponding to a previous treatment of the similarity calculation to select the data to the comparison as a candidate of the similarity calculation,
While calculating the similarity for the selection candidate selected in the candidate selection step, the similarity is set to a predetermined small value for the candidate not selected by the candidate selection unit, and the comparison is performed. A method for determining the relationship between data by an information processing apparatus, comprising: a similarity calculation step for calculating and outputting data similarity.

In the control unit of the information processing device,
And if the relative concept and that generated 1 or more strings elements, based on the concept set to a plurality of concepts and elements, including a character string belonging to the same concept data to be compared are both data the comparison If one other data containing Mikatsu the contrasts strings belonging to the concept of the free and if they do not contain the string belonging to the concept, the character string belonging to both the concept of the data to be the contrast, any if it was either the candidate selection process corresponding to a previous treatment of the similarity calculation to select the data to the comparison as a candidate of the similarity calculation,
The similarity is calculated for the candidates selected in the candidate selection process, while the similarity is set to a predetermined small value for the candidates not selected by the candidate selection unit, and the comparison data A program for determining relevance between data, characterized in that a similarity calculation process for calculating a similarity is executed.