JP7675431B2

JP7675431B2 - Database generation device, database generation method, and database generation program

Info

Publication number: JP7675431B2
Application number: JP2021139603A
Authority: JP
Inventors: 知大山形; 永和富野; 隆史河合; 芳啓伴地
Original assignee: 株式会社Find
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2025-05-13
Anticipated expiration: 2041-08-30
Also published as: JP2024009227A; JP2023033737A

Description

本発明は、データベース生成装置及びデータベース生成方法並びにデータベース生成用プログラムの技術分野に属する。より詳細には、複数の異なるデータベースを統合して統合データベースを生成するデータベース生成装置及びデータベース生成方法、並びに当該データベース生成装置用のプログラムの技術分野に属する。 The present invention belongs to the technical fields of a database generation device, a database generation method, and a program for generating a database. More specifically, the present invention belongs to the technical fields of a database generation device and a database generation method that integrate multiple different databases to generate an integrated database, and a program for the database generation device.

一般に、様々な会社は、それぞれの顧客や一般の消費者に関する情報をデータとしてそれぞれに含む顧客データベースや消費者データベースを一又は複数管理している。これらの顧客データベースや消費者データベースは、それらの目的等に合わせて、それらに蓄積されているサンプルの数やデータベースとしての項目（指標）も多岐に渡っている。 Generally, various companies manage one or more customer databases and consumer databases, each of which contains information about their customers and general consumers. These customer and consumer databases vary widely in the number of samples stored and the items (indicators) used as databases, depending on their purposes, etc.

また、自社で管理している顧客データベースや消費者データベースに対して、その属性や項目が異なる外部のデータベースを統合し、よりサンプル数等が多い顧客データベースを新たに生成することが必要となる場合がある。このようなデータベースの統合に関する従来技術を開示した先行技術文献としては、例えば下記特許文献１が挙げられる。 In addition, there are cases where a company needs to integrate an external database with different attributes or items into its in-house managed customer database or consumer database to generate a new customer database with a larger number of samples, etc. An example of a prior art document disclosing conventional technology related to such database integration is the following Patent Document 1.

この特許文献１に開示されている従来技術では、「より高速に結合処理を実行することができるデータベース統合装置などを提供する」ことを課題として、「データベース統合装置の受付部はクライアントから複数の結合対象のデータを結合する要求を受け付け、データベース統合装置の決定部は、当該要求により指定された結合対象のデータをそれぞれ格納したデータベースを備えるデータベースシステムの組み合わせのそれぞれが、組み合わせ相手のデータベースシステムから結合対象のデータを読み込んで結合処理を行うことができるか否か、及び組み合わせ相手のデータベースシステムに結合対象のデータを読み込ませることができるか否かを表す結合可否情報に基づいて、結合処理を実行するデータベースシステムを決定し、データベース統合装置の生成部は結合処理を実行させる実行計画を生成し、データベース統合装置の実行部は実行計画に基づいて上記要求をデータベースシステムに送信する」構成とされている。 The prior art disclosed in Patent Document 1 aims to "provide a database integration device or the like that can execute join processing at higher speeds," and is configured as follows: "The reception unit of the database integration device receives a request from a client to merge multiple pieces of data to be joined, and the determination unit of the database integration device determines the database system that will execute the join processing based on join feasibility information that indicates whether each combination of database systems including databases that respectively store the join target data specified by the request can read the join target data from the combined database system and execute the join processing, and whether the combined database system can read the join target data, the generation unit of the database integration device generates an execution plan for executing the join processing, and the execution unit of the database integration device transmits the request to the database system based on the execution plan."

特許第６１８１２５０号公報Patent No. 6181250

しかしながら一般に、複数の顧客データベースや消費者データベースを統合する際に、一の顧客データベースや一の消費者データベースにそのデータが含まれている顧客ＩＤや消費者ＩＤと他の顧客データベースや他の消費者データベースにそのデータが含まれている顧客ＩＤや消費者ＩＤとが一致しない場合がある。このような顧客データベースや消費者データベースを統合しようとする場合、従来では、複数の顧客データベースと消費者データベースに共通の顧客ＩＤや消費者ＩＤについてのデータを統合するしか方法がなかった。このため、当該統合の結果として得られた統合データベース（当該共通の顧客ＩＤ又は消費者ＩＤについてのデータのみを含む統合データベース）では、そのサンプル数も少なく、データベースとしての項目や指標も限定的なものになってしまい、結果として、統合データベースとしての用に供し得ないものしか生成されないという問題点があった。この問題点は、多くの顧客データベース・消費者データベースを統合しようとすればするほど各顧客データベース・消費者データベースに共通の顧客や消費者が少なくなり、統合データベースとして役に立たないものとなってしまうという問題点に繋がる。 However, in general, when integrating multiple customer databases or consumer databases, there are cases where the customer ID or consumer ID whose data is included in one customer database or one consumer database does not match the customer ID or consumer ID whose data is included in another customer database or another consumer database. When integrating such customer databases or consumer databases, the only way to do so in the past was to integrate data on customer IDs or consumer IDs common to multiple customer databases and consumer databases. For this reason, the integrated database obtained as a result of the integration (an integrated database containing only data on the common customer ID or consumer ID) has a small number of samples and the items and indicators as a database are limited, resulting in a problem that only databases that cannot be used as an integrated database are generated. This problem leads to the problem that the more customer databases and consumer databases are integrated, the fewer customers and consumers are common to each customer database and consumer database, making the integrated database useless.

そこで本発明は、上記の各問題点に鑑みて為されたもので、その課題の一例は、複数のデータベースを統合する場合においても、サンプル数が多く且つデータベースとしての項目（指標）が多岐に渡る統合データベースを自動的に生成することが可能なデータベース生成装置及びデータベース生成方法、並びに当該データベース生成装置用のプログラムを提供することにある。 The present invention has been made in consideration of the above problems, and one example of the objective of the present invention is to provide a database generation device and a database generation method, as well as a program for the database generation device, that can automatically generate an integrated database with a large number of samples and a wide range of database items (indicators), even when integrating multiple databases.

上記の課題を解決するために、請求項１に記載の発明は、商品の購入に関する被統合データベースに対する統合用データベースを用いた統合及び拡張により得られた統合データベースであり且つ商品の購入に関する統合データベースに対して、データベースとしてのサンプル数又は項目の少なくともいずれか一方が当該統合データベースと異なる他のデータベースを更に統合するデータベース生成装置であって、統合することで元のデータベースの精度からの精度の向上が期待される場合があるデータベースであり、種々の項目又は指標を含み且つ所定数のサンプルを含む汎用の接続用データベースを前記統合データベースに統合して第２統合データベースを生成する統合手段と、前記生成された第２統合データベースの正解率に基づいた精度である第２精度が前記統合データベースの正解率に基づいた精度である第１精度以上であるとき、前記統合データベースとの統合に実際に用いられる有効項目のデータを前記第２統合データベースから抽出して抽出第２統合データベースを生成する抽出手段と、前記生成された抽出第２統合データベースの正解率に基づいた精度である第３精度が前記第２精度以上であるとき、前記抽出第２統合データベースにおけるサンプル数を増やして前記統合データベースのサンプル数と整合させるようにデータを生成するデータ生成手段と、前記生成されたデータを含む前記抽出第２統合データベースのデータを、現実の市場の統計情報を含む市場統計データベースのデータに近似させ、サンプル数増大抽出第２統合データベースを生成する生成手段と、を備える。
上記の課題を解決するために、請求項７に記載の発明は、商品の購入に関する被統合データベースに対する統合用データベースを用いた統合及び拡張により得られた統合データベースであり且つ商品の購入に関する統合データベースに対して、データベースとしてのサンプル数又は項目の少なくともいずれか一方が当該統合データベースと異なる他のデータベースを更に統合するデータベース生成装置であり、統合手段と、抽出手段と、データ生成手段と、生成手段と、を備えるデータベース生成装置において実行されるデータベース生成方法であって、統合することで元のデータベースの精度からの精度の向上が期待される場合があるデータベースであり、種々の項目又は指標を含み且つ所定数のサンプルを含む汎用の接続用データベースを、前記統合手段により前記統合データベースに統合して第２統合データベースを生成する統合工程と、前記生成された第２統合データベースの正解率に基づいた精度である第２精度が前記統合データベースの正解率に基づいた精度である第１精度以上であるとき、前記抽出手段により、前記統合データベースとの統合に実際に用いられる有効項目のデータを前記第２統合データベースから抽出して抽出第２統合データベースを生成する抽出工程と、前記生成された抽出第２統合データベースの正解率に基づいた精度である第３精度が前記第２精度以上であるとき、前記データ生成手段により、前記抽出第２統合データベースにおけるサンプル数を増やして前記統合データベースのサンプル数と整合させるようにデータを生成するデータ生成工程と、前記生成手段により、前記生成されたデータを含む前記抽出第２統合データベースのデータを、現実の市場の統計情報を含む市場統計データベースのデータに近似させ、サンプル数増大抽出第２統合データベースを生成する生成工程と、を含む。
上記の課題を解決するために、請求項８に記載の発明は、商品の購入に関する被統合データベースに対する統合用データベースを用いた統合及び拡張により得られた統合データベースであり且つ商品の購入に関する統合データベースに対して、データベースとしてのサンプル数又は項目の少なくともいずれか一方が当該統合データベースと異なる他のデータベースを更に統合するデータベース生成装置に含まれるコンピュータを、統合することで元のデータベースの精度からの精度の向上が期待される場合があるデータベースであり、種々の項目又は指標を含み且つ所定数のサンプルを含む汎用の接続用データベースを前記統合データベースに統合して第２統合データベースを生成する統合手段、前記生成された第２統合データベースの正解率に基づいた精度である第２精度が前記統合データベースの正解率に基づいた精度である第１精度以上であるとき、前記統合データベースとの統合に実際に用いられる有効項目のデータを前記第２統合データベースから抽出して抽出第２統合データベースを生成する抽出手段、前記生成された抽出第２統合データベースの正解率に基づいた精度である第３精度が前記第２精度以上であるとき、前記抽出第２統合データベースにおけるサンプル数を増やして前記統合データベースのサンプル数と整合させるようにデータを生成するデータ生成手段、及び、前記生成されたデータを含む前記抽出第２統合データベースのデータを、現実の市場の統計情報を含む市場統計データベースのデータに近似させ、サンプル数増大抽出第２統合データベースを生成する生成手段、として機能させる。 In order to solve the above problem, the invention described in claim 1 is a database generation device that further integrates another database, which is an integrated database obtained by integrating and expanding an integrated database related to product purchases using an integrating database, into the integrated database related to product purchases, and which is a database in which at least one of the number of samples or items as a database is different from that of the integrated database, and which is a database in which accuracy can be expected to be improved from that of the original database by integrating, an integration means for integrating a general-purpose connection database including various items or indicators and a predetermined number of samples into the integrated database to generate a second integrated database, and an accuracy based on the accuracy of the generated second integrated database. The system comprises an extraction means for extracting data of valid items actually used in integration with the integrated database from the second integrated database to generate an extracted second integrated database when the second accuracy is equal to or higher than a first accuracy, which is an accuracy based on a rate of accuracy of the integrated database; a data generation means for generating data so as to increase the number of samples in the extracted second integrated database to match the number of samples in the integrated database when a third accuracy , which is an accuracy based on a rate of accuracy of the generated extracted second integrated database, is equal to or higher than the second accuracy; and a generation means for approximating data of the extracted second integrated database including the generated data to data of a market statistics database including statistical information of real market, to generate an extracted second integrated database with an increased number of samples.
In order to solve the above-mentioned problems, the invention described in claim 7 is a database generating device that further integrates another database, which is an integrated database obtained by integrating and expanding an integrated database related to product purchases using an integrating database, and which differs from the integrated database related to product purchases in at least one of the number of samples or items as a database, into the integrated database related to product purchases, and the database generating method is executed in the database generating device having an integrating means, an extracting means, a data generating means, and a generating means, and includes an integrating step of integrating a general-purpose connection database, which is a database whose accuracy may be expected to be improved from that of the original database by integrating it, into the integrated database by the integrating means to generate a second integrated database, and an extracting step of extracting the second integrated database by extracting the second integrated database from the general-purpose connection database, which is a database whose accuracy may be improved from that of the original database by integrating it, and which includes a general-purpose connection database which includes various items or indicators and a predetermined number of samples, the data generating means for generating data so as to increase the number of samples in the extracted second integrated database to match the number of samples in the integrated database when a second accuracy, which is an accuracy based on a rate of accuracy of the integrated database, is equal to or higher than a first accuracy, which is an accuracy based on a rate of accuracy of the integrated database; the data generating means for generating data so as to increase the number of samples in the extracted second integrated database to match the number of samples in the integrated database when a third accuracy, which is an accuracy based on a rate of accuracy of the generated extracted second integrated database, is equal to or higher than the second accuracy; and the data generating means for approximating data of the extracted second integrated database including the generated data to data of a market statistics database including statistical information of a real market, to generate an extracted second integrated database with an increased number of samples.
In order to solve the above problem, the invention described in claim 8 provides a database generating device that further integrates another database, which is an integrated database obtained by integrating and expanding an integrated database related to product purchases using an integrating database, into the integrated database related to product purchases, and which is different from the integrated database in at least one of the number of samples or items as a database, and which is a database where improvement in accuracy from the accuracy of the original database may be expected by integrating, and which integrates a general-purpose connection database including various items or indicators and a predetermined number of samples into the integrated database to generate a second integrated database, and a processing unit that generates a second integrated database based on the accuracy of the generated second integrated database, the processing unit including ... When a second accuracy is equal to or higher than a first accuracy, which is an accuracy based on the accuracy rate of the integrated database, the device functions as an extraction means for extracting data of valid items actually used in integration with the integrated database from the second integrated database to generate an extracted second integrated database; when a third accuracy, which is an accuracy based on the accuracy rate of the generated extracted second integrated database, is equal to or higher than the second accuracy, the device functions as a data generation means for generating data to increase the number of samples in the extracted second integrated database to match the number of samples in the integrated database; and a generation means for approximating data of the extracted second integrated database including the generated data to data of a market statistics database including statistical information of real market, to generate an extracted second integrated database with an increased number of samples.

請求項１、請求項７又は請求項８のいずれか一項に記載の発明によれば、第２統合データベースの第２精度が統合データベースの第１精度以上であるとき抽出第２統合データベースを生成し、その抽出第２統合データベースの第３精度が第２精度以上であるとき、抽出第２統合データベースにおけるサンプル数を増やして統合データベースのサンプル数と整合させた後に市場統計データベースのデータに近似させてサンプル数増大抽出第２統合データベースを生成する。よって、統合データベースに対応したサンプル数及び項目を有し且つ現実市場にも対応した統合データベースをサンプル数増大抽出第２統合データベースとして自動的に生成することができる。 According to the invention of any one of claims 1, 7 and 8 , when the second precision of the second integrated database is equal to or greater than the first precision of the integrated database, an extracted second integrated database is generated, and when the third precision of the extracted second integrated database is equal to or greater than the second precision, the number of samples in the extracted second integrated database is increased to match the number of samples in the integrated database, and then the number of samples is approximated to the data in the market statistics database to generate an extracted second integrated database with an increased number of samples. Thus, an integrated database having the number of samples and items corresponding to the integrated database and also corresponding to the real market can be automatically generated as the extracted second integrated database with an increased number of samples.

上記の課題を解決するために、請求項２に記載の発明は、請求項１に記載のデータベース生成装置において、前記第２精度が前記第１精度未満であるとき、又は前記第３精度が前記第２精度未満であるとき、前記有効項目のデータを前記接続用データベースから抽出して前記統合に供させる第２抽出手段を更に備える。 In order to solve the above problem, the invention described in claim 2 provides a database generation device described in claim 1 , further comprising a second extraction means for extracting data of the valid items from the connection database and providing the data for the integration when the second precision is less than the first precision or when the third precision is less than the second precision.

請求項２に記載の発明によれば、請求項１に記載の発明の作用に加えて、第２精度が第１精度未満であるとき、又は第３精度が第２精度未満であるとき、統合データベースとの統合に実際に用いられる有効項目のデータを接続用データベースから抽出して当該統合に供させるので、より高精度のサンプル数増大抽出第２統合データベースを自動的に生成することができる。 According to the invention described in claim 2 , in addition to the effect of the invention described in claim 1 , when the second precision is less than the first precision, or when the third precision is less than the second precision, data of valid items actually used in the integration with the integrated database is extracted from the connection database and used for the integration, so that a more accurate second integrated database with an increased number of samples can be automatically generated.

上記の課題を解決するために、請求項３に記載の発明は、請求項１又は請求項２に記載のデータベース生成装置において、前記生成されたサンプル数増大抽出第２統合データベースの正解率に基づいた精度である第４精度が前記第３精度未満であるとき、前記データ生成手段は、前記抽出第２統合データベースにおけるサンプル数を増やすための前記データを再生成するように構成される。 In order to solve the above problem, the invention described in claim 3 is a database generation device described in claim 1 or claim 2 , wherein when a fourth accuracy, which is an accuracy based on the accuracy rate of the generated sample number-increasing extracted second integrated database, is less than the third accuracy, the data generation means is configured to regenerate the data to increase the number of samples in the extracted second integrated database.

請求項３に記載の発明によれば、請求項１又は請求項２に記載の発明の作用に加えて、サンプル数増大抽出第２統合データベースの第４精度が第３精度未満であるとき、抽出第２統合データベースにおけるサンプル数を増やすためのデータが再生成されるので、更に高精度のサンプル数増大抽出第２統合データベースを自動的に生成することができる。 According to the invention described in claim 3 , in addition to the effects of the invention described in claim 1 or claim 2 , when the fourth precision of the extracted second integrated database with an increased number of samples is less than the third precision, data for increasing the number of samples in the extracted second integrated database is regenerated, so that an extracted second integrated database with an increased number of samples with even higher precision can be automatically generated.

上記の課題を解決するために、請求項４に記載の発明は、請求項１から請求項３のいずれか一項に記載のデータベース生成装置において、前記有効項目のデータの抽出は、主成分分析法、変数重要度法又はＳＨＡＰ（SHapley Additive exPlanations）ライブラリを用いた方法の少なくともいずれか一つを用いて実行されるように構成される。 In order to solve the above problem, the invention described in claim 4 is a database generation device described in any one of claims 1 to 3 , wherein the extraction of data of the effective items is performed using at least one of a principal component analysis method, a variable importance method, or a method using a SHAP (SHapley Additive exPlanations) library.

請求項４に記載の発明によれば、請求項１から請求項３のいずれか一項に記載の発明の作用に加えて、主成分分析法、変数重要度法又はＳＨＡＰライブラリを用いた方法の少なくともいずれか一つを用いて有効項目のデータが抽出されるので、より実用性の高い有効項目のデータを抽出することができる。 According to the invention described in claim 4 , in addition to the action of the invention described in any one of claims 1 to 3 , data on effective items is extracted using at least one of the principal component analysis method, the variable importance method, or a method using the SHAP library, so that data on effective items with higher practicality can be extracted.

上記の課題を解決するために、請求項５に記載の発明は、請求項１から請求項３のいずれか一項に記載のデータベース生成装置において、前記生成されたサンプル数増大抽出第２統合データベースに含まれるデータと前記接続用データベースに含まれるデータとの合致度を評価する合致度評価手段と、前記評価された合致度を示す合致度情報を報知する報知手段と、を更に備える。 In order to solve the above problem, the invention described in claim 5 , in the database generation device described in any one of claims 1 to 3 , further comprises a match evaluation means for evaluating the match between the data contained in the generated sample number increased extraction second integrated database and the data contained in the connection database, and a notification means for notifying the match information indicating the evaluated match.

請求項５に記載の発明によれば、請求項１から請求項３のいずれか一項に記載の発明の作用に加えて、生成されたサンプル数増大抽出第２統合データベースに含まれるデータと接続用データベースに含まれるデータとの合致度を評価し、その評価された合致度を示す合致度情報を報知するので、最終的に生成されたサンプル数増大抽出第２統合データベースの、元の接続用データベースに対する合致度を容易に認識することができる。 According to the invention described in claim 5 , in addition to the effect of the invention described in any one of claims 1 to 3 , the degree of match between the data contained in the generated sample number increased extraction second integrated database and the data contained in the connection database is evaluated, and matching degree information indicating the evaluated degree of match is notified, so that the degree of match between the finally generated sample number increased extraction second integrated database and the original connection database can be easily recognized.

上記の課題を解決するために、請求項６に記載の発明は、請求項５に記載のデータベース生成装置において、前記合致度評価手段は、平均／分散法、ヒストグラム法、集計データの統計的分布活用法、Ｓ（Signal）／Ｎ（Noise）法又はCronbachのα係数を用いた評価方法の少なくともいずれか一つを用いて前記合致度の評価を行うように構成される。 In order to solve the above problem, the invention described in claim 6 provides a database generation device described in claim 5 , wherein the match evaluation means is configured to evaluate the match using at least one of the mean/variance method, the histogram method, the statistical distribution utilization method of aggregated data, the S (Signal)/N (Noise) method, or an evaluation method using Cronbach's alpha coefficient.

請求項６に記載の発明によれば、請求項５に記載の発明の作用に加えて、平均／分散法、ヒストグラム法、集計データの統計的分布活用法、Ｓ／Ｎ法又はCronbachのα係数を用いた評価方法の少なくともいずれか一つを用いて合致度の評価を行うので、より正確に当該合致度を認識することができる。 According to the invention described in claim 6 , in addition to the function of the invention described in claim 5 , the degree of match is evaluated using at least one of the mean/variance method, the histogram method, the statistical distribution utilization method of aggregated data, the S/N method, or an evaluation method using Cronbach's alpha coefficient, so that the degree of match can be recognized more accurately.

以上説明したように、本発明によれば、第２統合データベースの第２精度が統合データベースの第１精度以上であるとき抽出第２統合データベースを生成し、その抽出第２統合データベースの第３精度が第２精度以上であるとき、抽出第２統合データベースにおけるサンプル数を増やして統合データベースのサンプル数と整合させた後に市場統計データベースのデータに近似させてサンプル数増大抽出第２統合データベースを生成する。 As described above, according to the present invention, when the second accuracy of the second integrated database is equal to or greater than the first accuracy of the integrated database, an extracted second integrated database is generated, and when the third accuracy of the extracted second integrated database is equal to or greater than the second accuracy, the number of samples in the extracted second integrated database is increased to match the number of samples in the integrated database, and then the number of samples is approximated to the data in the market statistics database to generate an extracted second integrated database with an increased number of samples .

従って、統合データベースに対応したサンプル数及び項目を有し且つ現実市場にも対応した統合データベースをサンプル数増大抽出第２統合データベースとして自動的に生成することができる。 Therefore, an integrated database having the number of samples and items corresponding to the integrated database and also corresponding to the real market can be automatically generated as a second integrated database with an increased number of samples .

第１実施形態のデータベース生成装置の概要構成を示すブロック図である。1 is a block diagram showing a schematic configuration of a database generating device according to a first embodiment; 第１実施形態のデータベース生成装置を構成する抽出部の概要構成を示すブロック図である。2 is a block diagram showing a schematic configuration of an extraction unit constituting the database generating device of the first embodiment; FIG. 第１実施形態のデータベース生成処理を示すフローチャートである。5 is a flowchart showing a database generation process according to the first embodiment. 第１実施形態のデータベース生成処理を実行する前のデータベースの内容を例示する図である。FIG. 4 is a diagram illustrating an example of the contents of a database before a database generation process according to the first embodiment is executed. 第１実施形態のデータベース生成処理を実行した後のデータベースの内容を例示する図である。FIG. 4 is a diagram illustrating an example of the contents of a database after a database generation process according to the first embodiment is executed. 第２実施形態のデータベース生成処理を示すフローチャートである。13 is a flowchart showing a database generation process according to the second embodiment.

次に、本発明を実施するための形態について、図面に基づいて説明する。なお、以下に説明する各実施形態は、複数の異なるデータベースのデータを統合して新たな統合データベースを生成するデータベース生成装置に対して本発明を適用した場合の実施の形態である。 Next, the embodiments for implementing the present invention will be described with reference to the drawings. Note that each embodiment described below is an embodiment in which the present invention is applied to a database generation device that integrates data from multiple different databases to generate a new integrated database.

（Ｉ）第１実施形態
初めに、本発明の第１実施形態について、図１乃至図５を用いて説明する。なお、図１は第１実施形態のデータベース生成装置の概要構成を示すブロック図であり、図２は当該データベース生成装置を構成する抽出部の概要構成を示すブロック図であり、図３は第１実施形態のデータベース生成処理を示すフローチャートである。また、図４は当該データベース生成処理を実行する前のデータベースの内容を例示する図であり、図５は当該データベース生成処理を実行した後のデータベースの内容を例示する図である。なお図１及び図３においては、「データベース」を適宜「ＤＢ」と表している。 (I) First embodiment
First, a first embodiment of the present invention will be described with reference to Figures 1 to 5. Figure 1 is a block diagram showing the general configuration of a database generating device of the first embodiment, Figure 2 is a block diagram showing the general configuration of an extraction unit constituting the database generating device, and Figure 3 is a flowchart showing a database generating process of the first embodiment. Figure 4 is a diagram showing an example of the contents of the database before the database generating process is executed, and Figure 5 is a diagram showing an example of the contents of the database after the database generating process is executed. In Figures 1 and 3, "database" is appropriately represented as "DB."

図１に示すように、第１実施形態のデータベース生成装置Ｓは、具体的には例えばパーソナルコンピュータ等により実現されるものであり、ＣＰＵ等からなる処理部１と、ＨＤＤ（Hard Disk Drive）又はＳＳＤ（Solid State Drive）等からなる記録部２と、キーボード及びマウス等からなる操作部３と、液晶ディスプレイ等からなるディスプレイ４と、により構成されている。 As shown in FIG. 1, the database generating device S of the first embodiment is specifically realized by, for example, a personal computer, and is composed of a processing unit 1 consisting of a CPU or the like, a recording unit 2 consisting of a HDD (Hard Disk Drive) or SSD (Solid State Drive) or the like, an operation unit 3 consisting of a keyboard, a mouse, or the like, and a display 4 consisting of a liquid crystal display or the like.

また処理部１は、評価部１０と、抽出部１１と、生成部１２と、統合部１３と、により構成されている。更に抽出部１１は、図２に示すように、主成分分析抽出部１１０と、変数重要度抽出部１１１と、ＳＨＡＰ抽出部１１２と、により構成されている。 The processing unit 1 is composed of an evaluation unit 10, an extraction unit 11, a generation unit 12, and an integration unit 13. As shown in FIG. 2, the extraction unit 11 is composed of a principal component analysis extraction unit 110, a variable importance extraction unit 111, and a SHAP extraction unit 112.

このとき、評価部１０、抽出部１１、生成部１２及び統合部１３は、処理部１を構成するＣＰＵ等を含むハードウェアロジック回路により実現されてもよいし、後述する第１実施形態のデータベース生成処理に相当するプログラムを上記ＣＰＵ等が読み込んで実行することにより、ソフトウェア的に実現されてもよい。また、主成分分析抽出部１１０、変数重要度抽出部１１１及びＳＨＡＰ抽出部１１２も同様に、抽出部１１を構成するＣＰＵ等を含むハードウェアロジック回路により実現されてもよいし、上記データベース生成処理に相当するプログラムを上記ＣＰＵ等が読み込んで実行することにより、ソフトウェア的に実現されてもよい。なお上記の各プログラムは、記録部２に予め記録されているものを上記ＣＰＵ等が読み込んでもよいし、図示しない外部のサーバ装置に記録されている当該プログラムをインターネット等のネットワークを介して上記ＣＰＵ等が取得して用いるように構成してもよい。 At this time, the evaluation unit 10, the extraction unit 11, the generation unit 12, and the integration unit 13 may be realized by a hardware logic circuit including a CPU constituting the processing unit 1, or may be realized in software by the CPU reading and executing a program corresponding to the database generation process of the first embodiment described later. Similarly, the principal component analysis extraction unit 110, the variable importance extraction unit 111, and the SHAP extraction unit 112 may be realized by a hardware logic circuit including a CPU constituting the extraction unit 11, or may be realized in software by the CPU reading and executing a program corresponding to the database generation process. Note that each of the above programs may be pre-recorded in the recording unit 2 and read by the CPU, or may be configured so that the CPU obtains and uses the program recorded in an external server device (not shown) via a network such as the Internet.

このとき、評価部１０が本発明の「第１評価手段」の一例、「第２評価手段」の一例、「第３評価手段」の一例、「第４評価手段」の一例、「第５評価手段」の一例、「第６評価手段」の一例、「第７評価手段」の一例及び「第８評価手段」の一例にそれぞれ相当し、抽出部１１が本発明の「抽出手段」の一例及び「第２抽出手段」の一例にそれぞれ相当する。また、生成部１２が本発明の「生成手段」の一例、「データ生成手段」の一例及び「第２生成手段」の一例にそれぞれ相当し、統合部１３が本発明の「統合手段」の一例及び「第２統合手段」の一例にそれぞれ相当する。更に、処理部１が本発明の「合致度評価手段」の一例に相当し、ディスプレイ４が本発明の「報知手段」の一例に相当する。 At this time, the evaluation unit 10 corresponds to an example of the "first evaluation means" of the present invention, an example of the "second evaluation means", an example of the "third evaluation means", an example of the "fourth evaluation means", an example of the "fifth evaluation means", an example of the "sixth evaluation means", an example of the "seventh evaluation means" and an example of the "eighth evaluation means", and the extraction unit 11 corresponds to an example of the "extraction means" and an example of the "second extraction means" of the present invention, respectively. Furthermore, the generation unit 12 corresponds to an example of the "generation means" of the present invention, an example of the "data generation means" and an example of the "second generation means", respectively, and the integration unit 13 corresponds to an example of the "integration means" and an example of the "second integration means" of the present invention, respectively. Furthermore, the processing unit 1 corresponds to an example of the "matching degree evaluation means" of the present invention, and the display 4 corresponds to an example of the "notification means" of the present invention.

以上の構成において、データベース生成装置Ｓは、図１に示す本体データベース１００のデータとドナーデータベース１０１のデータとを統合して統合データベース１０２を生成するデータベース生成装置である。このとき、統合される本体データベース１００及びドナーデータベース１０１それぞれのデータは、記録部２に予め記録されているものであってもよいし、第１実施形態のデータベース生成処理が実行される度に図示しない外部のサーバ装置等からインターネット等のネットワークを介して取得されるものであってもよい。 In the above configuration, the database generating device S is a database generating device that generates an integrated database 102 by integrating data of the main database 100 and data of the donor database 101 shown in FIG. 1. At this time, the data of the main database 100 and the donor database 101 to be integrated may be pre-recorded in the recording unit 2, or may be obtained via a network such as the Internet from an external server device (not shown) each time the database generating process of the first embodiment is executed.

ここで、第１実施形態の本体データベース１００は、例えば、その商品が属する商品ブランドに関する販売業務や開発業務等で企業が日常的に使用している顧客のデータベースや一般の消費者のデータベースであり、その企業や担当部署に属するデータベースである。このような本体データベース１００は、基本的にサンプル数が多く（例えば数万サンプル以上）、且つその商品ブランドに関連する項目（指標）を多く含むデータベースであり、その商品の現実の顧客に関するデータも含まれている。これに対し、本体データベース１００としては、その商品ブランドに直接的には関連しない項目（指標）については、そのデータ（サンプル数）は、多くは含まれていない。 Here, the main database 100 in the first embodiment is, for example, a database of customers or general consumers that is used daily by a company in sales and development work related to the product brand to which the product belongs, and is a database that belongs to the company or the relevant department. Such a main database 100 is a database that basically has a large number of samples (for example, more than tens of thousands of samples) and contains many items (indicators) related to the product brand, and also contains data on actual customers of the product. In contrast, the main database 100 does not contain much data (number of samples) for items (indicators) that are not directly related to the product brand.

上記のような本体データベース１００に対し、第１実施形態のドナーデータベース１０１は、上記企業には属さない、例えば外部の調査会社や自社の上記担当部署以外の部署等が作成したデータベースである。このようなドナーデータベース１０１は、本体データベース１００のような特定の商品又は商品ブランドに関する項目（指標）は少ないし、またサンプル数もそれほど多くはない場合が多い（例えば０乃至１，０００サンプル程度）。しかしながらドナーデータベース１０１は、上記商品ブランドに直接的には関連しない項目（指標）、例えば、購買者一般（上記商品以外の商品の購買者を含めた購買者一般）についてのライフスタイルに関する項目（指標）や、一般的な価値観に関する項目（指標）を多く含むデータベースである。 In contrast to the main database 100 described above, the donor database 101 of the first embodiment is a database that does not belong to the company, for example, created by an external research company or a department other than the above-mentioned responsible department within the company. Such a donor database 101 has few items (indicators) related to specific products or product brands like the main database 100, and often does not have a large number of samples (for example, about 0 to 1,000 samples). However, the donor database 101 is a database that contains many items (indicators) that are not directly related to the product brand, such as items (indicators) related to the lifestyle of general purchasers (general purchasers including purchasers of products other than the above-mentioned products) and items (indicators) related to general values.

そして、データベース生成装置Ｓでは、上記のような属性を有する本体データベース１００のデータに対して上記ドナーデータベース１０１のデータを統合し、項目（指標）を多岐に渡らせることで、上記企業に対して有効となる統合データベース１０２を生成する。 Then, the database generation device S integrates the data of the main database 100 having the attributes described above with the data of the donor database 101, and diversifies the items (indicators) to generate an integrated database 102 that is useful for the company.

より具体的に、先ずデータベース生成装置Ｓの記録部２は、第１実施形態のデータベース生成処理において生成される、後述する厳選ドナーデータベース１０３及びサンプル生成厳選ドナーデータベース１０４それぞれのデータを一時的に記録すると共に、当該データベース生成処理に必要なその他のデータを記録し、必要に応じて処理部１に出力する。 More specifically, the recording unit 2 of the database generation device S first temporarily records the data of the carefully selected donor database 103 and the sample generation carefully selected donor database 104 (described later) generated in the database generation process of the first embodiment, and also records other data necessary for the database generation process, and outputs it to the processing unit 1 as necessary.

一方、処理部１の評価部１０は、上記本体データベース１００等の各データベースの精度を、その正解率の観点から、例えばいわゆる混合行列（Confusion Matrix）を用いた従来の交差検証法（Cross Validation Method）を用いた評価方法により評価する。ここで、当該正解率について、例えば購入予測商品のデータが一のサンプルとしてそのデータベースに蓄積されている購買者が、その購入予測商品を実際に購入した場合、そのサンプルを含むそのデータベースとしては、正解率が向上することになる。 Meanwhile, the evaluation unit 10 of the processing unit 1 evaluates the accuracy of each database such as the main database 100 from the viewpoint of its accuracy rate, for example, by an evaluation method using a conventional cross validation method using a so-called confusion matrix. Here, regarding the accuracy rate, for example, if a purchaser whose data on a predicted purchase item is stored in the database as a sample actually purchases the predicted purchase item, the accuracy rate of the database including the sample will improve.

次に、抽出部１１は、ドナーデータベース１０１の項目（指標）の中から、統合データベース１０２の生成に当たって有効となる有効指標を抽出する。 Next, the extraction unit 11 extracts effective indicators from the items (indicators) of the donor database 101 that will be effective in generating the integrated database 102.

ここで、第１実施形態の抽出部１１における上記有効指標の抽出方法について、特に図２を用いて説明する。 Here, the method for extracting the above-mentioned effective indicators in the extraction unit 11 of the first embodiment will be explained, particularly with reference to FIG. 2.

当該抽出部１１による有効指標の抽出は、図２に示す主成分分析抽出部１１０、変数重要度抽出部１１１又はＳＨＡＰ抽出部１１２の少なくともいずれか一つにより行われる。このとき主成分分析抽出部１１０は、従来と同様の主成分分析法により有効指標を抽出する。より具体的に主成分分析抽出部１１０は、累積寄与率が予め変更可能に設定された累積寄与率閾値（例えば７０％）以上となる主成分の項目（指標）であって、且つ主成分負荷量の絶対値が予め変更可能に設定された主成分負荷量閾値（例えば０．０１）以上の項目（指標）を有効指標として抽出する。 The extraction of effective indicators by the extraction unit 11 is performed by at least one of the principal component analysis extraction unit 110, the variable importance extraction unit 111, and the SHAP extraction unit 112 shown in FIG. 2. At this time, the principal component analysis extraction unit 110 extracts effective indicators by a principal component analysis method similar to that used in the past. More specifically, the principal component analysis extraction unit 110 extracts as effective indicators items (indicators) of principal components whose cumulative contribution rate is equal to or greater than a cumulative contribution rate threshold value (e.g., 70%) that is previously set in a changeable manner, and whose absolute value of the principal component loading amount is equal to or greater than a principal component loading amount threshold value (e.g., 0.01) that is previously set in a changeable manner.

一方変数重要度抽出部１１１は、従来と同様の変数重要度法により有効指標を抽出する。より具体的に変数重要度抽出部１１１は、変数重要度が予め変更可能に設定された変数重要度閾値（例えば０．００２）以上となる項目（指標）を有効指標として抽出する。またＳＨＡＰ抽出部１１２は、従来と同様のＳＨＡＰ法により有効指標を抽出する。より具体的にＳＨＡＰ抽出部１１２は、目的変数に対して予め設定されたＳＨＡＰ閾値（例えば上位２０位）に入る項目（指標）を有効指標として抽出する。このときＳＨＡＰ抽出部１１２は、目的変数となる項目（例えば商品ブランド等）が複数存在する場合は、それらを和統合（ＯＲ統合）により有効指標に追加する。なお、主成分分析抽出部１１０、変数重要度抽出部１１１又はＳＨＡＰ抽出部１１２のいずれかの抽出結果を抽出部１１の抽出結果として用いるかについては、例えば、本体データベース１００の属性や生成すべき統合データベース１０２の属性等に応じて予め設定されているのが好適である。 On the other hand, the variable importance extraction unit 111 extracts effective indices using the same variable importance method as in the past. More specifically, the variable importance extraction unit 111 extracts items (indices) whose variable importance is equal to or greater than a variable importance threshold (e.g., 0.002) that is previously set so as to be changeable, as effective indices. The SHAP extraction unit 112 also extracts effective indices using the same SHAP method as in the past. More specifically, the SHAP extraction unit 112 extracts items (indices) that fall within a SHAP threshold (e.g., the top 20) previously set for the objective variable, as effective indices. At this time, if there are multiple items (e.g., product brands, etc.) that are objective variables, the SHAP extraction unit 112 adds them to the effective indices by sum integration (OR integration). In addition, whether the extraction result of the principal component analysis extraction unit 110, the variable importance extraction unit 111, or the SHAP extraction unit 112 is used as the extraction result of the extraction unit 11 is preferably set in advance according to, for example, the attributes of the main database 100 and the attributes of the integrated database 102 to be generated.

そして、主成分分析抽出部１１０、変数重要度抽出部１１１又はＳＨＡＰ抽出部１１２の少なくともいずれか一つから出力された有効指標は、和統合（ＯＲ統合）により、抽出部１１による抽出結果として出力される。そして、当該抽出結果としての有効指標のデータは、上記厳選ドナーデータベース１０３として記録部２に一時的に記録される。 Then, the effective indexes output from at least one of the principal component analysis extraction unit 110, the variable importance extraction unit 111, or the SHAP extraction unit 112 are output as the extraction result by the extraction unit 11 through sum integration (OR integration). Then, the data of the effective indexes as the extraction result is temporarily recorded in the recording unit 2 as the carefully selected donor database 103.

次に、図１に戻って、処理部１の生成部１２は、厳選ドナーデータベース１０３のサンプル数を本体データベース１００のサンプル数に整合させる（例えば、厳選ドナーデータベース１０３のサンプル数と本体データベース１００のサンプル数とを同数とする）べく、本発明の発明者らにより特許出願中（特願２０２０－０８５５４６号）の技術の他、従来の例えばウエイトバック法やＧＡＮ（（Generative Adversarial Networks（敵対的生成ネットワーク）技術等のＡＩ技術を用いたサンプルの新規生成方法を用いて、厳選ドナーデータベース１０３としてのデータ（サンプル）を新たに生成し、これを厳選ドナーデータベース１０３に追加してサンプル生成厳選ドナーデータベース１０４を生成し、記録部２に一時的に記録する。 Returning to FIG. 1, the generation unit 12 of the processing unit 1 generates new data (samples) for the carefully selected donor database 103 using a new sample generation method using AI technology such as the weight-back method or GAN (generative adversarial networks) technology, in addition to the technology for which a patent application is pending (Patent Application No. 2020-085546) by the inventors of the present invention, in order to match the number of samples in the carefully selected donor database 103 with the number of samples in the main database 100 (for example, to make the number of samples in the carefully selected donor database 103 the same as the number of samples in the main database 100), and adds this to the carefully selected donor database 103 to generate the sample generation carefully selected donor database 104, which is temporarily recorded in the recording unit 2.

これらにより、統合部１３は、上記記録されているサンプル生成厳選ドナーデータベース１０４のデータと元の本体データベース１００のデータを従来と同様の方法で統合し、第１実施形態の統合データベース１０２を生成する。このような統合データベース１０２においては、ドナーデータベース１０１の特徴点（長所）が本体データベース１００に適用されることで、本体データベース１００としての短所が補われることとなる。この結果、本体データベース１００が属する上記企業の企業活動等にとって極めて有効な統合データベース（すなわち、サンプル数が多く且つデータベースとしての項目（指標）が多岐に渡る統合データベース）１０２が自動的に得られることになる。 As a result, the integration unit 13 integrates the data of the recorded sample generation carefully selected donor database 104 and the data of the original main database 100 in the same manner as in the past, generating the integrated database 102 of the first embodiment. In such an integrated database 102, the characteristics (advantages) of the donor database 101 are applied to the main database 100, thereby compensating for the shortcomings of the main database 100. As a result, an integrated database 102 that is extremely useful for the business activities of the company to which the main database 100 belongs (i.e., an integrated database with a large number of samples and a wide range of database items (indicators)) is automatically obtained.

なお、上述してきた各機能を実行するに当たって必要な操作は操作部３において実行され、当該操作に対応する操作信号が処理部１に出力される。これにより処理部１は、当該操作信号に基づき、上述してきた一連の機能を実行する。また、当該機能の実行に当たって必要な情報は、例えばディスプレイ４に表示され、データベース生成装置Ｓの操作者等に提示される。 The operations required to execute each of the functions described above are executed by the operation unit 3, and an operation signal corresponding to the operation is output to the processing unit 1. The processing unit 1 then executes the series of functions described above based on the operation signal. Information required to execute the function is displayed, for example, on the display 4 and presented to the operator of the database generation device S.

次に、第１実施形態のデータベース生成装置Ｓにおいて実行されるデータベース生成処理について、具体的に図２乃至図５を用いて説明する。 Next, the database generation process executed by the database generation device S of the first embodiment will be specifically described with reference to Figures 2 to 5.

上述した機能を有するデータベース生成装置Ｓにより実行される第１実施形態のデータベース生成処理は、例えばデータベース生成装置Ｓの図示しない電源スイッチがオンとされたタイミングから開始される。 The database generation process of the first embodiment, which is executed by the database generation device S having the above-mentioned functions, starts, for example, when a power switch (not shown) of the database generation device S is turned on.

当該データベース生成処理が開始されると、先ず、本体データベース１００のデータ及びドナーデータベース１０１のデータがそれぞれデータベース生成処理Ｓにおいて取得される。次に、処理部１の評価部１０は、取得した本体データベース１００のデータに基づき、上述した評価方法により本体データベース１００の精度を評価し、その評価結果を「評価Ａ」として記録部２に一時的に記録する（ステップＳ１）。 When the database generation process is started, first, data from the main database 100 and data from the donor database 101 are acquired in the database generation process S. Next, the evaluation unit 10 of the processing unit 1 evaluates the accuracy of the main database 100 using the evaluation method described above based on the acquired data from the main database 100, and temporarily records the evaluation result in the recording unit 2 as "Evaluation A" (step S1).

次に評価部１０は、上記ステップＳ１と並行して、取得したドナーデータベース１０１のデータに基づき、上述した評価方法によりドナーデータベース１０１の精度を評価し、その評価結果を「評価Ｂ」として記録部２に一時的に記録する（ステップＳ２）。次に、処理部１の抽出部１１は、上述した抽出方法により、ドナーデータベース１０１のデータから有効指標に相当するデータを抽出し、その抽出したデータを用いて上記厳選ドナーデータベース１０３を生成して記録部２に一時的に記録する（ステップＳ３）。その後評価部１０は、生成された厳選ドナーデータベース１０３のデータに基づき、上述した評価方法により厳選ドナーデータベース１０３の精度を評価し、その評価結果を「評価Ｃ」として記録部２に一時的に記録する（ステップＳ４）。 Next, in parallel with step S1, the evaluation unit 10 evaluates the accuracy of the donor database 101 using the above-mentioned evaluation method based on the acquired data of the donor database 101, and temporarily records the evaluation result in the recording unit 2 as "evaluation B" (step S2). Next, the extraction unit 11 of the processing unit 1 extracts data corresponding to the effective index from the data of the donor database 101 using the above-mentioned extraction method, generates the above-mentioned carefully selected donor database 103 using the extracted data, and temporarily records it in the recording unit 2 (step S3). After that, the evaluation unit 10 evaluates the accuracy of the carefully selected donor database 103 using the above-mentioned evaluation method based on the generated data of the carefully selected donor database 103, and temporarily records the evaluation result in the recording unit 2 as "evaluation C" (step S4).

次に処理部１は、記録部２に記録されている上記評価Ｃが上記評価Ｂ以上であるか否かを判定する（ステップＳ５）。ステップＳ５の判定において、評価Ｃが評価Ｂ未満である場合（ステップＳ５：ＮＯ）、ステップＳ３における有効指標の抽出が不十分であったとして再度ステップＳ３に戻り、上記抽出部１１は有効指標の再抽出を行う。一方、ステップＳ５の判定において、評価Ｃが評価Ｂ以上である場合（ステップＳ５：ＹＥＳ）、次に処理部１の生成部１２は、上述した生成方法を用いて厳選ドナーデータベース１０３についてのサンプル生成（データ生成）を行い、サンプル生成厳選ドナーデータベース１０４を生成して記録部２に一時的に記録する（ステップＳ６）。 Next, the processing unit 1 judges whether the evaluation C recorded in the recording unit 2 is equal to or greater than the evaluation B (step S5). If the judgment in step S5 is that the evaluation C is less than the evaluation B (step S5: NO), the extraction of the effective index in step S3 is deemed insufficient, and the process returns to step S3 again, and the extraction unit 11 re-extracts the effective index. On the other hand, if the judgment in step S5 is that the evaluation C is equal to or greater than the evaluation B (step S5: YES), the generation unit 12 of the processing unit 1 then performs sample generation (data generation) for the carefully selected donor database 103 using the above-mentioned generation method, generates the sample generated carefully selected donor database 104, and temporarily records it in the recording unit 2 (step S6).

そして、処理部１の統合部１３は、記録されているサンプル生成厳選ドナーデータベース１０４のデータと元の本体データベース１００のデータとを従来と同様の方法で統合し、統合データベース１０２を生成して記録部２に一時的に記録する（ステップＳ７）。このとき、統合データベース１０２は、図示しない外部のサーバ装置等に蓄積されてもよい。次に評価部１０は、記録されている統合データベース１０２のデータに基づき、上述した評価方法により統合データベース１０２の精度を評価し、その評価結果を「評価Ｄ」として記録部２に一時的に記録する（ステップＳ８）。 Then, the integration unit 13 of the processing unit 1 integrates the recorded data of the sample generation carefully selected donor database 104 and the data of the original main body database 100 in a conventional manner to generate an integrated database 102, which is temporarily recorded in the recording unit 2 (step S7). At this time, the integrated database 102 may be stored in an external server device (not shown). Next, the evaluation unit 10 evaluates the accuracy of the integrated database 102 using the evaluation method described above based on the recorded data of the integrated database 102, and temporarily records the evaluation result as "Evaluation D" in the recording unit 2 (step S8).

次に処理部１は、記録部２に記録されている上記評価Ｄが上記評価Ａ（上記ステップＳ１参照）以上であるか否かを判定する（ステップＳ９）。ステップＳ９の判定において、評価Ｄが評価Ａ未満である場合（ステップＳ９：ＮＯ）、現在の統合データベース１０２の生成過程に含まれていた上記ステップＳ３における有効指標の抽出が不十分であったとして、再度ステップＳ３に戻り、抽出部１１は有効指標の更なる抽出を行う。一方、ステップＳ９の判定において、評価Ｄが評価Ａ以上である場合（ステップＳ９：ＹＥＳ）、次に処理部１は、その時点での統合データベース１０２のデータとドナーデータベース１０１のデータとの合致度を評価する（ステップＳ１０）。 Next, the processing unit 1 judges whether the evaluation D recorded in the recording unit 2 is equal to or greater than the evaluation A (see step S1 above) (step S9). If the judgment in step S9 is that the evaluation D is less than evaluation A (step S9: NO), the extraction of the effective indicators in step S3 included in the generation process of the current integrated database 102 was insufficient, and the process returns to step S3 again, where the extraction unit 11 further extracts effective indicators. On the other hand, if the judgment in step S9 is that the evaluation D is equal to or greater than evaluation A (step S9: YES), the processing unit 1 then evaluates the degree of match between the data in the integrated database 102 and the data in the donor database 101 at that time (step S10).

ここで、ステップＳ１０として行われる合致度の評価は、企業独自の本体データベース１００に対して、一般化されたドナーデータベース１０１を統合した結果としての統合データベース１０２のデータが、ドナーデータベース１０１のデータにどの程度一致しているか、つまり、より汎用性の高いデータベースとなっているか、を評価するものである。このステップＳ１０における評価方法として具体的には、従来と同様の、例えば、平均／分散法、ヒストグラム法、集計データの統計的分布活用法、Ｓ／Ｎ法又はCronbachのα係数を用いた評価方法の少なくともいずれか一つが用いられる。このとき、例えば平均／分散法を用いて合致度を判定した場合は、平均値と分散範囲が一致するほど、合致度としては高くなることになる。そして、当該合致の評価結果は、例えばディスプレイ４を用いて表示（出力）されるか、又は、本体データベース１００が属する企業の担当者等に対して、記録部２に記録されている統合データベース１０２のデータと共に提供される（ステップＳ１１）。なお、ステップＳ１０における合致度の評価結果に基づき、例えば合致度を上げて統合データベース１０２の属性をドナーデータベース１０１の属性により近付けて汎用性を高めたい場合には、例えば、ドナーデータベース１０１のデータとの関係における真偽判定の基準をより厳格にしてデータベース生成処理を行うのが好適である。これに対し、上記合致度よりも評価部１０における評価値としての精度をより高めたい場合は、例えば、各データベースにおける目的変数判定の基準をより厳格にするのが好適である。 Here, the evaluation of the degree of match performed in step S10 is to evaluate the degree to which the data of the integrated database 102, which is the result of integrating the generalized donor database 101 with the company's own main database 100, matches the data of the donor database 101, that is, whether the database is more versatile. Specifically, the evaluation method in step S10 is the same as in the past, for example, at least one of the following evaluation methods is used: the mean/variance method, the histogram method, the statistical distribution utilization method of aggregated data, the S/N method, or the Cronbach's alpha coefficient. In this case, for example, when the degree of match is determined using the mean/variance method, the degree of match is higher as the average value and the variance range match. The evaluation result of the match is then displayed (output) using, for example, the display 4, or is provided to the person in charge of the company to which the main database 100 belongs, together with the data of the integrated database 102 recorded in the recording unit 2 (step S11). If it is desired to increase versatility by, for example, increasing the degree of matching based on the evaluation result of the degree of matching in step S10 to bring the attributes of the integrated database 102 closer to the attributes of the donor database 101, it is preferable to perform the database generation process with stricter standards for determining whether the data is true or false in relation to the data in the donor database 101. On the other hand, if it is desired to increase the accuracy of the evaluation value in the evaluation unit 10 rather than the degree of matching, it is preferable to, for example, make the standards for determining the objective variables in each database stricter.

その後処理部１は、例えば操作部３による終了操作等により第１実施形態のデータベース生成処理を終了するか否かを判定する（ステップＳ１２）。ステップＳ１２の判定において、当該データベース生成処理を終了する場合（ステップＳ１２：ＹＥＳ）、処理部１は、そのまま当該データベース生成処理を終了する。一方、ステップＳ１２の判定において、例えば他の本体データベース１００又は他のドナーデータベース１０１を対象として当該データベース生成処理を継続する場合（ステップＳ１２：ＮＯ）、処理部１は、上記ステップＳ１及びステップＳ２に戻り、上記他の本体データベース１００又は上記他のドナーデータベース１０１を対象として上述してきた処理を継続する。 Then, the processing unit 1 judges whether or not to end the database generation process of the first embodiment, for example, by an end operation by the operation unit 3 (step S12). If the judgment in step S12 is that the database generation process is to be ended (step S12: YES), the processing unit 1 ends the database generation process as is. On the other hand, if the judgment in step S12 is that the database generation process is to be continued, for example, for another main database 100 or another donor database 101 (step S12: NO), the processing unit 1 returns to the above steps S1 and S2 and continues the above-described processing for the other main database 100 or the other donor database 101.

次に、第１実施形態のデータベース生成処理が実行された結果としての本体データベース１００と統合データベース１０２の比較について、具体的に図４及び図５を用いて説明する。なお、図４及び図５は、一の企業だけでなく複数の企業の本体データベース１００について、第１実施形態のデータベース生成処理を実行した結果を纏めて示すものである。 Next, a comparison between the main database 100 and the integrated database 102 as a result of executing the database generation process of the first embodiment will be specifically described with reference to Figures 4 and 5. Note that Figures 4 and 5 show the results of executing the database generation process of the first embodiment for the main databases 100 of not only one company but multiple companies.

先ず図４に例示するように、ある企業Ａ社に属する本体データベース１００では、各顧客等をしめすＩＤに関連付けて、その属性やＡ社が実施したキャンペーンへの参加の有無のデータ等が記録（蓄積）されているとする。この場合、特にキャンペーンへの参加の有無は、Ａ社独自のデータではあるが、顧客の一般的な移動履歴等のデータは含まれていない（図４ハッチング部参照）。 First, as shown in the example of Figure 4, the main database 100 belonging to a certain company A records (accumulates) data such as attributes and whether or not the customer participated in a campaign implemented by company A in association with an ID indicating each customer. In this case, the participation in a campaign is data unique to company A, but data such as the customer's general movement history is not included (see the hatched area in Figure 4).

一方、上述してきた第１実施形態のデータベース生成処理では、このようなＡ社の本体データベース１００に対して、第１実施形態のドナーデータベース１０１が適用される。このときのドナーデータベース１０１としては、上記移動履歴やサービス利用履歴等の一般的なライフスタイル又は価値観を示すデータがサンプルとして含まれているものが用いられる。そして、このようなドナーデータベース１０１を用いた第１実施形態のデータベース生成処理が本体データベース１００に対して実行されると、その結果として得られる統合データベース１０２は、図５に例示するように、Ａ社の企業活動には関連性が低いとしてデータ（サンプル）が得られていなかった上記移動履歴等のデータがサンプルとして含まれ得ることになる。この結果、Ａ社の企業活動等にとって極めて有効な統合データベース１０２が自動的に得られたことになる。 On the other hand, in the database generation process of the first embodiment described above, the donor database 101 of the first embodiment is applied to the main database 100 of Company A. The donor database 101 used in this case contains data indicating general lifestyles or values, such as the movement history and service usage history, as samples. When the database generation process of the first embodiment using such a donor database 101 is executed on the main database 100, the resulting integrated database 102 may contain, as exemplified in FIG. 5, samples of data such as the movement history, for which data (samples) were not obtained because they were considered to be of low relevance to the business activities of Company A. As a result, an integrated database 102 that is extremely useful for the business activities of Company A is automatically obtained.

以上説明したように、第１実施形態のデータベース生成装置Ｓによるデータベース生成処理によれば、ドナーデータベース１０１の精度を評価Ｂとし、厳選ドナーデータベース１０３の精度を評価Ｃとし、評価Ｃ≧評価Ｂであるとき、サンプル生成厳選ドナーデータベース１０４を生成し、本体データベース１００に統合して統合データベース１０２を生成する（図３ステップＳ１乃至ステップＳ７参照）。よって、ドナーデータベース１０１の精度及び厳選ドナーデータベース１０３の精度の評価結果に基づいて生成したサンプル生成厳選ドナーデータベース１０４を本体データベース１００に統合して統合データベース１０２を生成するので、サンプル数が多く且つデータベースとしての項目（指標）が多岐に渡る統合データベース１０２を自動的に生成することができる。 As described above, according to the database generation process by the database generation device S of the first embodiment, the accuracy of the donor database 101 is rated B, the accuracy of the carefully selected donor database 103 is rated C, and when rating C≧rating B, the sample generation carefully selected donor database 104 is generated and integrated with the main database 100 to generate the integrated database 102 (see steps S1 to S7 in FIG. 3). Therefore, the sample generation carefully selected donor database 104 generated based on the evaluation results of the accuracy of the donor database 101 and the accuracy of the carefully selected donor database 103 is integrated with the main database 100 to generate the integrated database 102, so that the integrated database 102 with a large number of samples and a wide range of items (indicators) as a database can be automatically generated.

なお、本発明の発明者等によるシミュレーションによれば、万単位の数のサンプルを含み且つ商品ブランドに関連する本体データベース１００（評価Ａとしての正解率が５０％未満）に対して、千単位の数のサンプルを含み、変数の数が本体データベース１００より多く且つ一般価値観に関するドナーデータベース１０１（評価Ｂとしての正解率が８０％後半の値以上）を第１実施形態のデータベース生成処理を用いて統合して得られた統合データベース１０２（本体データベースのサンプル数と同数のサンプルを含み、変数の数が本体データベース１００の変数の数とドナーデータベース１０１の変数の数を合計した数となる）の評価Ｄとしての正解率は、元の本体データベース１００の正解率より高く、ドナーデータベース１０１の正解率に迫る正解率であることが確認できている。これらにより、第１実施形態のデータベース生成処理によれば、サンプル数が多く且つデータベースとしての項目（指標）が多岐に渡るだけでなく、元の本体データベース１００に対して精度（正解率）が飛躍的に向上した統合データベース１０２を自動的に生成することが可能となることが判る。 According to a simulation by the inventors of the present invention, the integrated database 102 (containing the same number of samples as the samples in the main database, and the number of variables is the sum of the number of variables in the main database 100 and the number of variables in the donor database 101) obtained by integrating the main database 100 (with a correct answer rate for evaluation B of 80% or more) related to general values and containing thousands of samples, with a main database 100 related to product brands (with a correct answer rate for evaluation A of less than 50%) using the database generation process of the first embodiment, has a higher correct answer rate for evaluation D than the original main database 100 and is close to the correct answer rate of the donor database 101. From these, it can be seen that the database generation process of the first embodiment makes it possible to automatically generate an integrated database 102 that not only contains a large number of samples and a wide variety of items (indicators) as a database, but also has a dramatically improved accuracy (correct answer rate) compared to the original main database 100.

また、評価Ｃ＜評価Ｂであるとき、有効指標のデータの再抽出及び厳選ドナーデータベース１０３の再生成を行い、再生成された厳選ドナーデータベース１０３の精度を再評価するので、より高精度の統合データベース１０２を自動的に生成することができる。 In addition, when evaluation C is smaller than evaluation B, the data of the validity indicators is re-extracted and the carefully selected donor database 103 is re-generated, and the accuracy of the re-generated carefully selected donor database 103 is re-evaluated, so that a more accurate integrated database 102 can be automatically generated.

更に、生成された統合データベース１０２の精度（評価Ｄ）が元の本体データベース１００の精度（評価Ａ）未満であるとき、有効指標のデータの再抽出及び厳選ドナーデータベース１０３の再生成を行い、再生成された厳選ドナーデータベース１０３の精度を再評価するので（図３ステップＳ８、ステップＳ９参照）、更に高精度の統合データベース１０２を自動的に生成することができる。 Furthermore, when the accuracy (rating D) of the generated integrated database 102 is less than the accuracy (rating A) of the original main database 100, the data of the valid indicators is re-extracted and the carefully selected donor database 103 is re-generated, and the accuracy of the re-generated carefully selected donor database 103 is re-evaluated (see steps S8 and S9 in Figure 3), so that an even more accurate integrated database 102 can be automatically generated.

また、上記評価Ｄが上記評価Ａより高い場合に、第１実施形態のデータベース生成処理を終了して統合データベース１０２の内容を確定するので（図３ステップＳ９：ＹＥＳ参照）、本体データベース１００よりもより高精度の統合データベース１０２を自動的に生成することができる。 In addition, if the evaluation D is higher than the evaluation A, the database generation process of the first embodiment is terminated and the contents of the integrated database 102 are confirmed (see step S9 in FIG. 3: YES), so that an integrated database 102 with higher accuracy than the main database 100 can be automatically generated.

更に、評価部１０による各評価が、混合行列を用いた交差検証法を用いてそれぞれ行われるので、各データベースの精度の評価を正確に行うことができる。 Furthermore, each evaluation by the evaluation unit 10 is performed using a cross-validation method that uses a confusion matrix, so that the accuracy of each database can be accurately evaluated.

更にまた、抽出部１１による有効指標の抽出が、主成分分析法、変数重要度法又はＳＨＡＰライブラリを用いた方法の少なくともいずれか一つを用いて行われるので、より実用性の高い有効指標のデータを抽出することができる。 Furthermore, since the extraction unit 11 extracts effective indicators using at least one of the principal component analysis method, the variable importance method, or a method using the SHAP library, it is possible to extract data on effective indicators that are more practical.

また、生成された統合データベース１０２に含まれるデータとドナーデータベース１０１に含まれるデータとの合致度を評価し、その評価された合致度を出力するので（図３ステップＳ１０及びステップＳ１１参照）、最終的に生成された統合データベース１０２の、元のドナーデータベース１０１に対する合致度を容易に認識することができる。 In addition, the degree of match between the data contained in the generated integrated database 102 and the data contained in the donor database 101 is evaluated and the evaluated degree of match is output (see steps S10 and S11 in Figure 3), so that the degree of match between the finally generated integrated database 102 and the original donor database 101 can be easily recognized.

更に、ステップＳ１０における合致度の評価が、平均／分散法、ヒストグラム法、集計データの統計的分布活用法、Ｓ／Ｎ法又はCronbachのα係数を用いた評価方法の少なくともいずれか一つを用いて行われるので、より正確に当該合致度を認識することができる。
（II）第２実施形態 Furthermore, since the evaluation of the degree of match in step S10 is performed using at least one of the mean/variance method, the histogram method, the statistical distribution method of aggregated data, the S/N method, or an evaluation method using Cronbach's alpha coefficient, the degree of match can be recognized more accurately.
(II) Second embodiment

次に、本発明の他の実施形態である第２実施形態について、図６を用いて説明する。なお、図６は第２実施形態のデータベース生成処理を示すフローチャートである。 Next, a second embodiment, which is another embodiment of the present invention, will be described with reference to FIG. 6. FIG. 6 is a flowchart showing the database generation process of the second embodiment.

上述した第１実施形態のデータベース生成処理では、本体データベース１００とドナーデータベース１０１とを統合し、統合データベース１０２を生成した。これに対し、以下に説明する第２実施形態のデータベース生成処理では、上記生成された統合データベース１０２を第１実施形態のデータベース生成処理と同様の方法にて更に拡充し、様々な市場（インターネット上の、いわゆる仮想市場を含む）に適用可能なデータベースを生成する。 In the database generation process of the first embodiment described above, the main database 100 and the donor database 101 are integrated to generate the integrated database 102. In contrast, in the database generation process of the second embodiment described below, the integrated database 102 generated above is further expanded in a manner similar to the database generation process of the first embodiment, to generate a database that can be applied to various markets (including so-called virtual markets on the Internet).

なお、第２実施形態のデータベース生成処理のハードウェア的な構成は、基本的には第１実施形態のデータベース生成装置Ｓのハードウェア的な構成と同一であるので、以下の説明では、当該データベース生成装置Ｓと同様の部材については同一の部材番号を付して細部の説明は省略する。また、第２実施形態のデータベース生成処理のうち、上述した第１実施形態のデータベース生成処理と同一の処理については、同一のステップ番号を付して細部の説明は両略する。 The hardware configuration of the database generation process of the second embodiment is basically the same as the hardware configuration of the database generation device S of the first embodiment, so in the following explanation, the same components as those of the database generation device S are given the same component numbers and detailed explanations are omitted. Furthermore, among the database generation process of the second embodiment, the same processes as those of the database generation process of the first embodiment described above are given the same step numbers and detailed explanations are omitted.

図６に示すように、第２実施形態のデータベース生成装置において実行される第２実施形態のデータベース生成処理は、第１実施形態のデータベース生成処理と同様に、例えば第２実施形態のデータベース生成装置の電源スイッチがオンとされたタイミングから開始される。 As shown in FIG. 6, the database generation process of the second embodiment executed by the database generation device of the second embodiment is started, for example, when the power switch of the database generation device of the second embodiment is turned on, similar to the database generation process of the first embodiment.

当該データベース生成処理が開始されると、先ず、第１実施形態のデータベース生成処理により生成された統合データベース１０２のデータが取得される。このとき、第２実施形態のデータベース生成処理に供される統合データベース１０２は、第１実施形態のデータベース生成処理により一の本体データベース１００と一のドナーデータベース１０１とを統合したものであってもよいし、第１実施形態のデータベース生成処理を連続して複数回繰り返すことにより、一又は複数の本体データベース１００と、一又は複数のドナーデータベース１０１とを統合して生成された統合データベースであってもよい。 When the database generation process is started, first, data of the integrated database 102 generated by the database generation process of the first embodiment is obtained. At this time, the integrated database 102 provided to the database generation process of the second embodiment may be one in which one main database 100 and one donor database 101 are integrated by the database generation process of the first embodiment, or it may be an integrated database generated by integrating one or more main databases 100 and one or more donor databases 101 by repeating the database generation process of the first embodiment multiple times in succession.

次に、第２実施形態の処理部１の評価部１０は、取得した統合データベース１０２のデータに基づき、第１実施形態のデータベース生成処理と同様の評価方法により統合データベース１０２の精度を評価し、その評価結果を「評価ａ」として第２実施形態の記録部２に一時的に記録する（ステップＳ２０）。 Next, the evaluation unit 10 of the processing unit 1 of the second embodiment evaluates the accuracy of the integrated database 102 based on the acquired data of the integrated database 102 using an evaluation method similar to that of the database generation process of the first embodiment, and temporarily records the evaluation result as "evaluation a" in the recording unit 2 of the second embodiment (step S20).

次に第２実施形態の処理部１の統合部１３は、統合データベース１０２のデータと第２実施形態の接続用データベース１２４のデータを従来と同様の方法で統合し、高精度統合データベース１２０を生成して記録部２に一時的に記録する（ステップＳ２１）。 Next, the integration unit 13 of the processing unit 1 of the second embodiment integrates the data of the integrated database 102 and the data of the connection database 124 of the second embodiment in a manner similar to that of the conventional method, generates a high-precision integrated database 120, and temporarily records it in the recording unit 2 (step S21).

ここで、上記接続用データベース１２４とは、二つのデータベースを接続して統合するためのいわば「糊代」として機能する場合や、その接続データベースを用いて統合することで元のデータベースの精度からの精度の向上が期待される場合があるデータベースであり、種々の項目（指標）を含み且つ所定数のサンプルを含む、汎用性の高いデータベースである。 The connection database 124 is a highly versatile database that may function as a sort of "glue" for connecting and integrating two databases, or that may be expected to improve the accuracy of the original database by integrating using the connection database, and that includes various items (indicators) and a predetermined number of samples.

次に評価部１０は、生成されて記録されている高精度統合データベース１２０のデータに基づき、上述した評価方法により高精度統合データベース１２０の精度を評価し、その評価結果を「評価ｂ」として記録部２に一時的に記録する（ステップＳ２２）。 Next, the evaluation unit 10 evaluates the accuracy of the high-precision integrated database 120 using the evaluation method described above based on the generated and recorded data of the high-precision integrated database 120, and temporarily records the evaluation result in the recording unit 2 as "evaluation b" (step S22).

次に処理部１は、記録部２に記録されている上記評価ｃが上記評価ｂ以上であるか否かを判定する（ステップＳ２３）。ステップＳ２３の判定において、評価ｃが評価ｂ未満である場合（ステップＳ２３：ＮＯ）、接続用データベース１２４の精度を向上させるべく、処理部１の抽出部１１は、上述した抽出方法により、その時点での接続用データベース１２４のデータから有効指標に相当するデータを抽出し、その抽出されたデータを用いて新たな（項目（指標）が厳選された）接続用データベース１２４を生成して記録部２に一時的に記録する（ステップＳ２７）。この新たな接続用データベース１２４は、その後の上記ステップＳ２１の処理に供される。 Next, the processing unit 1 judges whether the evaluation c recorded in the recording unit 2 is equal to or greater than the evaluation b (step S23). If the judgment in step S23 is that the evaluation c is less than the evaluation b (step S23: NO), in order to improve the accuracy of the connection database 124, the extraction unit 11 of the processing unit 1 extracts data corresponding to effective indicators from the data in the connection database 124 at that time using the extraction method described above, and generates a new connection database 124 (with carefully selected items (indicators)) using the extracted data and temporarily records it in the recording unit 2 (step S27). This new connection database 124 is then used for the processing in the above step S21.

一方、ステップＳ２３の判定において、評価ｃが評価ｂ以上である場合（ステップＳ２３：ＹＥＳ）、次に抽出部１１は、高精度統合データベース１２０の精度を向上させるべく、上述した抽出方法により、記録されている高精度統合データベース１２０のデータから有効指標に相当するデータを抽出し、その抽出したデータを用いて厳選高精度統合データベース１２１を生成して記録部２に一時的に記録する（ステップＳ２４）。ここで、当該厳選高精度統合データベース１２１の生成（ステップＳ２４）には、統合データベース１０２と属性又は特性が類似している所定の仮想市場に対応する項目（指標）及びそれに対応したモデルの生成が含まれている。その後評価部１０は、生成された厳選高精度統合データベース１２１のデータに基づき、上述した評価方法により厳選高精度統合データベース１２１の精度を評価し、その評価結果を「評価ｃ」として記録部２に一時的に記録する（ステップＳ２５）。 On the other hand, if the judgment in step S23 indicates that the evaluation c is equal to or higher than the evaluation b (step S23: YES), then the extraction unit 11 extracts data corresponding to the effective indicators from the recorded data of the high-precision integrated database 120 by the above-mentioned extraction method in order to improve the accuracy of the high-precision integrated database 120, generates the carefully selected high-precision integrated database 121 using the extracted data, and temporarily records it in the recording unit 2 (step S24). Here, the generation of the carefully selected high-precision integrated database 121 (step S24) includes the generation of items (indicators) corresponding to a predetermined virtual market whose attributes or characteristics are similar to those of the integrated database 102, and the generation of a model corresponding to the items (indicators). After that, the evaluation unit 10 evaluates the accuracy of the carefully selected high-precision integrated database 121 by the above-mentioned evaluation method based on the data of the generated carefully selected high-precision integrated database 121, and temporarily records the evaluation result as "evaluation c" in the recording unit 2 (step S25).

次に処理部１は、記録部２に記録されている上記評価ｃが上記評価ｂ（上記ステップＳ２２参照）以上であるか否かを判定する（ステップＳ２６）。ステップＳ２６の判定において、評価ｃが評価ｂ未満である場合（ステップＳ２６：ＮＯ）、上記ステップＳ２７の接続用データベース１２４における有効指標の抽出が不十分であったとして、再度ステップＳ２７に戻り、抽出部１１は有効指標の更なる抽出を行い、その後の上記ステップＳ２１に供させる。一方、ステップＳ２６の判定において、評価ｃが評価ｂ以上である場合（ステップＳ２６：ＹＥＳ）、次に処理部１の生成部１２は、上述した生成方法を用いて厳選高精度統合データベース１２１についてのサンプル生成（データ生成）を行う（ステップＳ２８）。 Next, the processing unit 1 judges whether the evaluation c recorded in the recording unit 2 is equal to or greater than the evaluation b (see step S22 above) (step S26). If the judgment in step S26 finds that the evaluation c is less than the evaluation b (step S26: NO), the extraction of the effective indicators in the connection database 124 in step S27 above is insufficient, and the process returns to step S27 again, where the extraction unit 11 further extracts effective indicators and then provides them to step S21. On the other hand, if the judgment in step S26 finds that the evaluation c is equal to or greater than the evaluation b (step S26: YES), the generation unit 12 of the processing unit 1 then generates samples (data) for the carefully selected high-precision integrated database 121 using the generation method described above (step S28).

次に処理部１は、現実（仮想でない）の市場における統計情報を含む市場統計データベースであって、例えば統合データベース１０２と属性又は特性が類似している所定の市場統計データベース１２２を用いて、上記サンプル生成後の厳選高精度統合データベース１２１のデータを当該市場統計データベース１２２のデータに近似させ（ステップＳ２９）、近似させたデータを用いてサンプル生成厳選高精度統合データベース１２３を生成して記録部２に一時的に記録する（ステップＳ３０）。 Then, the processing unit 1 uses a predetermined market statistics database 122, which is a market statistics database containing statistical information in a real (non-virtual) market and has attributes or characteristics similar to those of the integrated database 102, to approximate the data of the carefully selected, high-precision integrated database 121 after the sample generation to the data of the market statistics database 122 (step S29), and generates a sample generation carefully selected, high-precision integrated database 123 using the approximated data and temporarily records it in the recording unit 2 (step S30).

その後評価部１０は、生成されたサンプル生成厳選高精度統合データベース１２３のデータに基づき、上述した評価方法によりサンプル生成厳選高精度統合データベース１２３の精度を評価し、その評価結果を「評価ｄ」として記録部２に一時的に記録する（ステップＳ３１）。 Then, the evaluation unit 10 evaluates the accuracy of the sample generation, carefully selected, high-precision integrated database 123 using the evaluation method described above based on the data of the generated sample generation, carefully selected, high-precision integrated database 123, and temporarily records the evaluation result in the recording unit 2 as "evaluation d" (step S31).

次に処理部１は、記録部２に記録されている上記評価ｄが上記評価ｃ（上記ステップＳ２５参照）以上であるか否かを判定する（ステップＳ３２）。ステップＳ３２の判定において、評価ｄが評価ｃ未満である場合（ステップＳ３２：ＮＯ）、上記ステップＳ２８乃至上記ステップＳ３０におけるサンプル生成及び市場統計データベース１２２のデータへの近似等の処理における精度が不十分であったとして、処理部１は、再度ステップＳ２８に戻ってそれ以降の処理を繰り返す。 Next, the processing unit 1 judges whether the evaluation d recorded in the recording unit 2 is equal to or greater than the evaluation c (see step S25 above) (step S32). If the judgment in step S32 is that the evaluation d is less than the evaluation c (step S32: NO), the processing unit 1 returns to step S28 again and repeats the subsequent steps, assuming that the accuracy of the processes such as sample generation and approximation to the data in the market statistics database 122 in steps S28 to S30 was insufficient.

一方、ステップＳ３２の判定において、評価ｄが評価ｃ以上である場合（ステップＳ３２：ＹＥＳ）、次に処理部１は、その時点でのサンプル生成厳選高精度統合データベース１２３のデータと接続用データベース１２４のデータとの合致度の評価及びその出力を、第１実施形態のデータベース生成処理におけるステップＳ１０及びステップＳ１１と同様の方法により行う。 On the other hand, if it is determined in step S32 that evaluation d is equal to or greater than evaluation c (step S32: YES), then the processing unit 1 evaluates the degree of match between the data in the sample generation carefully selected high-precision integrated database 123 at that time and the data in the connection database 124 and outputs the evaluation result in the same manner as in steps S10 and S11 in the database generation process of the first embodiment.

その後処理部１は、例えば操作部３による終了操作等により第２実施形態のデータベース生成処理を終了するか否かを判定する（ステップＳ３３）。ステップＳ３３の判定において、当該データベース生成処理を終了する場合（ステップＳ３３：ＹＥＳ）、処理部１は、そのまま当該データベース生成処理を終了する。一方、ステップＳ３３の判定において、例えば他の統合データベース１０２を対象として当該データベース生成処理を継続する場合（ステップＳ３３：ＮＯ）、処理部１は、上記ステップＳ２０に戻り、上記他の統合データベース１０２を対象として上述してきた処理を継続する。 Then, the processing unit 1 judges whether or not to end the database generation process of the second embodiment, for example, by an end operation by the operation unit 3 (step S33). If the judgment in step S33 is that the database generation process is to be ended (step S33: YES), the processing unit 1 ends the database generation process as is. On the other hand, if the judgment in step S33 is that the database generation process is to be continued, for example, with another integrated database 102 as the target (step S33: NO), the processing unit 1 returns to the above step S20 and continues the above-described process with the other integrated database 102 as the target.

以上説明した第２実施形態のデータベース生成処理によっても、第１実施形態のデータベース生成処理と同様の効果を得ることができる。 The database generation process of the second embodiment described above can achieve the same effect as the database generation process of the first embodiment.

すなわち、統合データベース１０２の精度を評価ａとし、高精度統合データベース１２０の精度を評価ｂとし、評価ｂ≧評価ａであるとき厳選高精度統合データベース１２１を生成し、その厳選高精度統合データベース１２１の精度を評価ｃと、評価ｃ≧評価ｂであるとき、市場統計データベース１２２のデータに近似させてサンプル生成厳選高精度統合データベース１２３を生成する（図６ステップＳ２０乃至ステップＳ３０参照）。よって、統合データベース１０２に対応したサンプル数及び項目を有し且つ現実市場にも対応したサンプル生成厳選高精度統合データベース１２３を自動的に生成することができる。 That is, the accuracy of the integrated database 102 is rated a, the accuracy of the high-precision integrated database 120 is rated b, and when rated b ≧ rated a, a carefully selected high-precision integrated database 121 is generated, and when rated c ≧ rated b, the accuracy of the carefully selected high-precision integrated database 121 is rated c, and when rated c ≧ rated b, a sample generated carefully selected high-precision integrated database 123 is generated by approximating it to the data of the market statistics database 122 (see steps S20 to S30 in FIG. 6). Thus, a sample generated carefully selected high-precision integrated database 123 that has the number of samples and items corresponding to the integrated database 102 and also corresponds to the real market can be automatically generated.

また、評価ｂ＜評価ａであるとき（図６ステップＳ２３：ＮＯ参照）、又は評価ｃ＜評価ｂであるとき（図６ステップＳ２６：ＮＯ参照）、有効項目のデータを接続用データベース１２４から抽出して統合データベース１０２との統合に供させるので（図６ステップＳ２７参照）、より高精度のサンプル生成厳選高精度統合データベース１２３を自動的に生成することができる。 In addition, when evaluation b<evaluation a (see step S23 in FIG. 6: NO) or when evaluation c<evaluation b (see step S26 in FIG. 6: NO), data on valid items is extracted from the connection database 124 and is made available for integration with the integrated database 102 (see step S27 in FIG. 6), so that a more highly accurate sample generation carefully selected high-precision integrated database 123 can be automatically generated.

更に、サンプル生成厳選高精度統合データベース１２３の精度の評価ｄが評価ｃ未満であるとき（図３ステップＳ３２：ＮＯ参照）、ステップＳ２８としてのサンプル生成（データ生成）が再度実行されるので、更に高精度のサンプル生成厳選高精度統合データベース１２３を自動的に生成することができる。 Furthermore, when the accuracy evaluation d of the sample generation carefully selected high-precision integrated database 123 is less than the evaluation c (see step S32 in FIG. 3: NO), sample generation (data generation) is executed again as step S28, so that a sample generation carefully selected high-precision integrated database 123 with even higher accuracy can be automatically generated.

更にまた、生成されたサンプル生成厳選高精度統合データベース１２３に含まれるデータと接続用データベース１２４に含まれるデータとの合致度を評価し（図６ステップＳ１０参照）、その評価された合致度を示す合致度情報を報知する（図６ステップＳ１１参照）ので、最終的に生成されたサンプル生成厳選高精度統合データベース１２３の、元の接続用データベース１２４に対する合致度を容易に認識することができる。 Furthermore, the degree of match between the data contained in the generated sample generation carefully selected high-precision integrated database 123 and the data contained in the connection database 124 is evaluated (see step S10 in FIG. 6), and matching information indicating the evaluated degree of match is notified (see step S11 in FIG. 6), so that the degree of match between the finally generated sample generation carefully selected high-precision integrated database 123 and the original connection database 124 can be easily recognized.

以上それぞれ説明したように、本発明はデータベースの統合の分野に利用することが可能であり、特にサンプル数及び／又は項目（指標）数が異なるデータベース同士の統合の分野に適用すれば特に顕著な効果が得られる。 As explained above, the present invention can be used in the field of database integration, and particularly when applied to the field of integrating databases with different numbers of samples and/or items (indicators), it can produce particularly remarkable effects.

１処理部
２記録部
３操作部
４ディスプレイ
１０評価部
１１抽出部
１１０主成分分析抽出部
１１１変数重要度抽出部
１１２ＳＨＡＰ抽出部
１２生成部
１３統合部
１００本体データベース
１０１ドナーデータベース
１０２統合データベース
１０３厳選ドナーデータベース
１０４サンプル生成厳選ドナーデータベース
１２０高精度統合データベース
１２１厳選高精度統合データベース
１２２市場統計データベース
１２３サンプル生成厳選高精度統合データベース
１２４接続用データベース
Ｓデータベース生成装置 REFERENCE SIGNS LIST 1 Processing unit 2 Recording unit 3 Operation unit 4 Display 10 Evaluation unit 11 Extraction unit 110 Principal component analysis extraction unit 111 Variable importance extraction unit 112 SHAP extraction unit 12 Generation unit 13 Integration unit 100 Main database 101 Donor database 102 Integrated database 103 Carefully selected donor database 104 Sample generation carefully selected donor database 120 High-precision integrated database 121 Carefully selected high-precision integrated database 122 Market statistics database 123 Sample generation carefully selected high-precision integrated database 124 Connection database S Database generation device

Claims

A database generating device which further integrates another database, which is an integrated database obtained by integrating and expanding an integrated database related to product purchases using an integrating database, and which is different from the integrated database related to product purchases in at least one of the number of samples or items as a database , comprising:
an integration means for integrating a general-purpose connection database, which is a database that may be expected to improve accuracy from the accuracy of the original database by integration, and which includes various items or indicators and a predetermined number of samples, into the integrated database to generate a second integrated database;
an extraction means for extracting data of effective items actually used for integration with the integrated database from the second integrated database to generate an extracted second integrated database when a second accuracy based on the accuracy of the generated second integrated database is equal to or greater than a first accuracy based on the accuracy of the integrated database;
a data generating means for generating data so as to increase the number of samples in the extracted second integrated database to match the number of samples in the integrated database when a third accuracy, which is an accuracy based on a rate of correct answers of the generated extracted second integrated database, is equal to or greater than the second accuracy;
a generating means for approximating data of the extracted second integrated database including the generated data to data of a market statistics database including statistical information of a real market, thereby generating an extracted second integrated database with an increased number of samples;
A database generating device comprising:

2. The database generating device according to claim 1,
A database generation device further comprising a second extraction means for extracting data of the valid items from the connection database and providing it for the integration when the second accuracy is less than the first accuracy or when the third accuracy is less than the second accuracy .

3. The database generating device according to claim 1,
A database generation device characterized in that when a fourth accuracy, which is an accuracy based on the accuracy rate of the generated sample number-increasing extracted second integrated database, is less than the third accuracy, the data generation means regenerates the data to increase the number of samples in the extracted second integrated database .

4. The database generating device according to claim 1 ,
A database generating device , characterized in that the extraction of data on the effective items is performed using at least one of a principal component analysis method, a variable importance method, and a method using a SHAP (SHapley Additive exPlanations) library .

4. The database generating device according to claim 1 ,
a matching degree evaluation means for evaluating a matching degree between the data included in the generated sample number-increased extraction second integrated database and the data included in the connection database;
a notification means for notifying a match degree information indicating the evaluated match degree;
The database generating device further comprises:

6. The database generating device according to claim 5 ,
The database generation device is characterized in that the matching evaluation means evaluates the matching degree using at least one of the mean/variance method, the histogram method, the statistical distribution utilization method of aggregated data, the S (Signal)/N (Noise) method, or an evaluation method using Cronbach's alpha coefficient .

A database generating device further integrates another database, which is an integrated database obtained by integrating and expanding an integrated database related to product purchases using an integrating database, and which is different from the integrated database related to product purchases in at least one of the number of samples or items as a database, and which is executed in the database generating device comprising an integrating means, an extracting means, a data generating means, and a generating means,
an integration step of integrating a general-purpose connection database, which is a database that may be expected to improve accuracy from the accuracy of the original database by integration, and which includes various items or indicators and a predetermined number of samples, into the integrated database by the integration means to generate a second integrated database;
an extraction step of extracting, by the extraction means, data of effective items actually used for integration with the integrated database from the second integrated database to generate an extracted second integrated database when a second accuracy based on the accuracy of the generated second integrated database is equal to or greater than a first accuracy based on the accuracy of the integrated database;
a data generating step of generating data by the data generating means so as to increase the number of samples in the extracted second integrated database to match the number of samples in the integrated database when a third accuracy, which is an accuracy based on a correct answer rate of the generated extracted second integrated database, is equal to or greater than the second accuracy;
a generating step of approximating data of the extracted second integrated database including the generated data by the generating means to data of a market statistics database including statistical information of a real market, thereby generating an extracted second integrated database with an increased sample number;
A database generating method comprising :

A computer included in a database generating device which further integrates another database, which is an integrated database obtained by integrating and expanding an integrated database related to product purchases using an integrating database, and which is different from the integrated database related to product purchases in at least one of the number of samples or items as a database,
an integration means for integrating a general-purpose connection database including various items or indicators and a predetermined number of samples into the integrated database, the general-purpose connection database being a database that may be expected to improve accuracy from the accuracy of the original database through integration, to generate a second integrated database;
an extraction means for extracting data of effective items actually used for integration with the integrated database from the second integrated database to generate an extracted second integrated database when a second accuracy, which is an accuracy based on a correct answer rate of the generated second integrated database, is equal to or higher than a first accuracy, which is an accuracy based on a correct answer rate of the integrated database;
a data generating means for generating data so as to increase the number of samples in the extracted second integrated database to match the number of samples in the integrated database when a third accuracy, which is an accuracy based on a correct answer rate of the generated extracted second integrated database, is equal to or greater than the second accuracy; and
a generating means for approximating data of the extracted second integrated database including the generated data to data of a market statistics database including statistical information of a real market, thereby generating an extracted second integrated database with an increased number of samples;
A database generating program that functions as a database generating program .