JP7633958B2

JP7633958B2 - Information processing device, information processing method, and information processing program

Info

Publication number: JP7633958B2
Application number: JP2022028573A
Authority: JP
Inventors: 瞬希中川; 渉竹内
Original assignee: Hitachi High Tech Corp
Current assignee: Hitachi High Tech Corp
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2025-02-20
Anticipated expiration: 2042-02-25
Also published as: JP2023124669A

Description

本発明は、情報処理装置、情報処理方法及び情報処理プログラムに関する。 The present invention relates to an information processing device, an information processing method, and an information processing program.

近時、様々な分野においてコンピュータが学習用データを使用して数理モデルを機械学習することが一般的になっている。機械学習において、数理モデルの係数（パラメータ）は最適化される。ユーザがある説明変数を機械学習済の数理モデルに入力すると、その数理モデルは、ユーザが期待する目的変数を出力する。ここで“ユーザが期待する目的変数”とは、“ユーザが経験的に正しいと評価できる”という意味である。 Recently, in various fields, it has become common for computers to use learning data to train mathematical models through machine learning. In machine learning, the coefficients (parameters) of the mathematical model are optimized. When a user inputs explanatory variables into a machine-learned mathematical model, the mathematical model outputs the objective variable expected by the user. Here, "the objective variable expected by the user" means "one that the user can empirically evaluate as correct."

人為的、社会的、時期的又は環境的な制約に起因して、母集団から得られた標本としての学習用データに偏り（バイアス）がある場合、数理モデルは、たとえ技術的に正しく機械学習が行われていても、ユーザが期待する目的変数を出力しなくなってしまう。そこで、機械学習済の数理モデルの係数を修正する必要が生じる。特許文献１の情報処理装置は、機械学習済の数理モデルの説明変数に対して乗算される係数の値を修正することにより、ユーザにとって違和感の少ない目的変数を数理モデルに出力させる旨を記載している。 If there is bias in the learning data as a sample obtained from a population due to human, social, temporal, or environmental constraints, the mathematical model will not output the objective variable expected by the user, even if the machine learning is performed technically correctly. This creates a need to modify the coefficients of the machine-learned mathematical model. The information processing device in Patent Document 1 describes how the value of the coefficient multiplied by the explanatory variable of the machine-learned mathematical model is modified to cause the mathematical model to output an objective variable that is less strange to the user.

特開２０２０－９５０２号公報JP 2020-9502 A

しかしながら、機械学習済みの数理モデルの係数の値を修正するには、ユーザは、多くの係数のうちから修正すべき係数を選択し、さらにその係数の値を決定するという試行錯誤が必要になる。係数を事後的に修正することに代えて、学習用データ自身を母集団に対して偏りのないものに入れ替えて再度機械学習を行うことも一法である。しかしながら、フォーマット（変数）が同じ新たな学習用データをその母集団から収集することもまた、多大な手間と時間を要する。そして、新たに収集した学習用データにも他種の偏りがあるかも知れない。
そこで、本発明は、母集団から収集した偏りのある学習用データを簡便に修正し、修正後の学習用データを使用して、精度の高い数理モデルを機械学習することを目的とする。 However, to modify the coefficient values of a mathematical model that has already been machine-learned, the user must select the coefficients to be modified from among many coefficients, and then determine the values of the coefficients through trial and error. Instead of modifying the coefficients after the fact, it is also possible to replace the training data itself with unbiased data from the population and perform machine learning again. However, collecting new training data with the same format (variables) from the population also requires a lot of time and effort. Furthermore, the newly collected training data may also have other types of bias.
Therefore, an object of the present invention is to easily correct biased learning data collected from a population, and to use the corrected learning data to machine-learn a highly accurate mathematical model.

本発明の情報処理装置は、統計情報と同じ変数を有する同時分布情報を学習用データから作成する同時分布作成部と、前記作成した同時分布情報の標本が母集団としての前記統計情報に比して偏っている程度を、複数の前記変数の組み合わせごとに重みとして算出する重み算出部と、前記算出した重みに基づいて、前記学習用データを修正し、前記修正した学習用データを使用して予測モデルを学習する予測モデル作成部と、前記予測モデルを使用して予測用データに対する予測結果を出力する出力処理部と、を備え、前記統計情報は、前記学習用データの標本が収集された施設を含む母集団から取得されたものであり、前記予測モデル作成部は、前記重みに基づいて、前記学習用データの標本をコピーすることによって前記学習用データの標本数を増やし、前記予測モデル作成部は、修正する前の前記学習用データを使用して第１の弱学習器を学習するとともに、修正した後の前記学習用データを使用して第２の弱学習器を学習し、前記第１の弱学習器及び前記第２の弱学習器を有するアンサンブルモデルを作成すること、を特徴とする。
その他の手段については、発明を実施するための形態のなかで説明する。 The information processing device of the present invention includes a simultaneous distribution creation unit that creates simultaneous distribution information having the same variables as statistical information from training data; a weight calculation unit that calculates a degree to which a sample of the created simultaneous distribution information is biased compared to the statistical information as a population as a weight for each combination of a plurality of the variables; a prediction model creation unit that corrects the training data based on the calculated weight and learns a prediction model using the corrected training data; and an output processing unit that outputs a prediction result for the prediction data using the prediction model, wherein the statistical information is obtained from a population including a facility where a sample of the training data was collected, the prediction model creation unit increases the number of samples of the training data by copying the samples of the training data based on the weight, and the prediction model creation unit learns a first weak learner using the training data before being corrected and learns a second weak learner using the corrected training data, thereby creating an ensemble model having the first weak learner and the second weak learner .
Other means will be described in the description of the embodiment of the invention.

本発明によれば、母集団から収集した偏りのある学習用データを簡便に修正し、修正後の学習用データを使用して、精度の高い数理モデルを機械学習することができる。 According to the present invention, biased learning data collected from a population can be easily corrected, and the corrected learning data can be used to machine-learn a highly accurate mathematical model.

情報処理装置の構成等を説明する図である。FIG. 2 is a diagram illustrating a configuration of an information processing device. 学習用データの一例である。1 is an example of learning data. 統計情報の一例である。1 is an example of statistical information. 同時分布情報の一例である。13 is an example of joint distribution information. 重み情報の一例である。11 is an example of weight information. 同時分布作成処理手順のフローチャートである。13 is a flowchart of a joint distribution creation process. 重み算出処理手順のフローチャートである。13 is a flowchart of a weight calculation process; 予測モデル作成処理手順のフローチャートである。13 is a flowchart of a prediction model creation process. 出力処理手順のフローチャートである。13 is a flowchart of an output process procedure. 表示画面の一例である。1 is an example of a display screen.

以降、本発明を実施するための形態（“本実施形態”という）を、図等を参照しながら詳細に説明する。本実施形態は、単一の医療施設から収集された学習用データ、及び、国、自治体等が公表している統計情報を使用してその学習用データを修正する例である。しかしながら、本発明は、医療以外の分野についても一般的に適用可能である。 Hereinafter, a form for implementing the present invention (referred to as the "present embodiment") will be described in detail with reference to the drawings, etc. This embodiment is an example in which learning data collected from a single medical facility and the learning data are corrected using statistical information published by the national government, local governments, etc. However, the present invention is also generally applicable to fields other than medicine.

（母集団及び標本）
本実施形態における母集団は、国、自治体等に居住する国民、地域住民等である。
本実施形態における標本（サンプル）は、国、自治体等に属する特定の医療施設における被験者（被検者）である。標本は、母集団の特性を正しく反映しているとは限らない。それぞれの医療施設は、例えば高齢者が多い、男性が少ない等の特性を不可避的に有する場合がある。このような特性に起因して、母集団の特性と標本の特性との間に乖離が生じる。この乖離は、”偏り”とも呼ばれる。 (Population and Sample)
The population in this embodiment includes citizens, local residents, etc. residing in a country, a local government, etc.
The specimen (sample) in this embodiment is a subject (subject) at a specific medical facility belonging to a country, a local government, or the like. The specimen does not necessarily accurately reflect the characteristics of the population. Each medical facility may inevitably have characteristics such as a large number of elderly people and a small number of men. Due to such characteristics, a deviation occurs between the characteristics of the population and the characteristics of the sample. This deviation is also called "bias."

（情報処理装置の構成等）
図１は、情報処理装置１の構成等を説明する図である。情報処理装置１は、一般的なコンピュータであり、中央制御装置１１、マウス、キーボード等の入力装置１２、ディスプレイ等の出力装置１３、主記憶装置１４、補助記憶装置１５及び通信装置１６を備える。これらは、バスで相互に接続されている。補助記憶装置１５は、同時分布情報３１、重み情報３２及び予測モデル３３を格納している。予測モデル３３は、第１の弱学習器３４及第２の弱学習器３５を含む。 (Configuration of information processing device, etc.)
1 is a diagram for explaining the configuration of an information processing device 1. The information processing device 1 is a general computer, and includes a central control device 11, an input device 12 such as a mouse or a keyboard, an output device 13 such as a display, a main memory device 14, an auxiliary memory device 15, and a communication device 16. These are connected to each other via a bus. The auxiliary memory device 15 stores joint distribution information 31, weight information 32, and a prediction model 33. The prediction model 33 includes a first weak learner 34 and a second weak learner 35.

一般に知られているように、“弱学習器”とは、単独の状態における予測精度が比較的低い数理モデルである。そして、複数の弱学習器から構成され、個々の弱学習器の出力を多数決した結果を出力する数理モデルは、アンサンブルモデルと呼ばれる。アンサンブルモデルの予測精度は、個々の弱学習器に比して高くなる。本実施形態の学習モデル３３は、このようなアンサンブルモデルである。 As is generally known, a "weak learner" is a mathematical model that has relatively low predictive accuracy in a single state. A mathematical model that is composed of multiple weak learners and outputs the result of majority voting of the outputs of the individual weak learners is called an ensemble model. The predictive accuracy of an ensemble model is higher than that of individual weak learners. The learning model 33 of this embodiment is such an ensemble model.

主記憶装置１４における同時分布作成部２１、重み算出部２２、予測モデル作成部２３及び出力処理部２２は、プログラムである。中央制御装置１１は、これらのプログラムを補助記憶装置１５から主記憶装置１４に読み出すことによって、それぞれのプログラムの機能（詳細後記）を実現する。 The joint distribution creation unit 21, weight calculation unit 22, prediction model creation unit 23, and output processing unit 22 in the main memory device 14 are programs. The central control unit 11 reads these programs from the auxiliary memory device 15 to the main memory device 14, thereby realizing the functions of each program (described in detail below).

情報処理装置１は、有線又は無線のネットワーク３を介して、データベース２に接続されている。データベース２は、学習用データ４１、予測用データ４２及び統計情報４３（詳細後記）を格納する。データベース２は、例えば、統計情報４３を格納する国、自治体等のデータベース、学習用データ４１を格納する特定の医療施設のデータベース、及び、予測用データ４３（詳細後記）を格納する他の医療施設のデータベース等に分かれていてもよい。 The information processing device 1 is connected to a database 2 via a wired or wireless network 3. The database 2 stores learning data 41, prediction data 42, and statistical information 43 (described in detail below). The database 2 may be divided into, for example, a database of a country, local government, etc. that stores statistical information 43, a database of a specific medical facility that stores learning data 41, and a database of other medical facilities that stores prediction data 43 (described in detail below).

（学習用データ）
図２は、学習用データ４１の一例である。学習用データ４１は、特定の単一の医療施設から収集される。学習用データ４１においては、対象者ＩＤ欄１０１に記憶された対象者ＩＤに関連付けて、イベント発生欄１０２にはイベント発生フラグが、年齢欄１０３には年齢が、性別欄１０４には性別が記憶されている。 (Learning data)
2 is an example of the learning data 41. The learning data 41 is collected from a specific single medical facility. In the learning data 41, an event occurrence flag is stored in an event occurrence column 102, an age is stored in an age column 103, and a gender is stored in a gender column 104 in association with a subject ID stored in a subject ID column 101.

対象者ＩＤ欄１０１の対象者ＩＤは、対象者を一意に特定する識別子である。対象者とは、単一の医療施設による特定の検査の被験者である。
イベント発生欄１０２のイベント発生フラグは、“０”又は“１”のいずれかである。“１”は、検査によって、対象者が特定の疾病に罹患していることが判明したことを示す。“０”は、検査によって、対象者が特定の疾病に罹患していないことが判明したことを示す。つまり、“イベント”は疾病に罹患することを示す。
年齢欄１０３の年齢は、対象者の年齢である。
性別欄１０４の性別は、対象者の性別である。 The subject ID in the subject ID column 101 is an identifier that uniquely identifies a subject. A subject is a subject of a particular examination at a single medical facility.
The event occurrence flag in the event occurrence column 102 is either "0" or "1". "1" indicates that the test has revealed that the subject is suffering from a particular disease. "0" indicates that the test has revealed that the subject is not suffering from a particular disease. In other words, an "event" indicates that a disease occurs.
The age in the age column 103 is the age of the subject.
The gender in the gender column 104 indicates the gender of the subject.

（統計情報）
図３は、統計情報４３の一例である。統計情報４３は、国、自治体等が作成する行政資料（図示せず）から作成される。本実施形態が想定する行政資料は、国民又は地域住民の属性（イベント発生、年齢、性別、職業、住所、予防接種を受けた回数、…等の変数）を相互に関連付けた公開資料である。学習用データ４１の標本が“一部”であるのに対し、行政資料の母集団は、その一部を含み、その一部よりはるかに広い“全体”である。そして、行政資料における属性は、多くの場合、学習用データ４１における“イベント発生”、“年齢”及び“性別”以外の属性（変数）を含んでいる。 (Statistical Information)
FIG. 3 is an example of the statistical information 43. The statistical information 43 is created from administrative documents (not shown) created by the country, local government, etc. The administrative documents assumed in this embodiment are public documents that correlate attributes of citizens or local residents (variables such as event occurrence, age, sex, occupation, address, number of vaccinations, etc.). While the sample of the learning data 41 is a "part," the population of the administrative documents is a "whole" that includes that part and is much broader than that part. In addition, the attributes in the administrative documents often include attributes (variables) other than the "event occurrence,""age," and "sex" in the learning data 41.

統計情報４３においては、イベント発生欄１１１に記憶されたイベント発生フラグに関連付けて、年齢欄１１２には年齢が、性別欄１１３には性別が、統計情報の例数欄１１４には統計情報の例数が記憶されている。 In the statistical information 43, age is stored in the age column 112, gender is stored in the gender column 113, and the number of statistical information examples is stored in the number of statistical information examples column 114, in association with the event occurrence flag stored in the event occurrence column 111.

イベント発生欄１１１のイベント発生フラグは、図２のイベント発生フラグと同じである。
年齢欄１１２の年齢は、行政資料が取り扱う国民等の年齢である。
性別欄１１３の性別は、行政資料が取り扱う国民等の性別である。
統計情報の例数欄１１４の統計情報の例数は、そのイベント発生、年齢及び性別の組合せに該当する国民等（母集団）の人数である。イベント発生、年齢及び性別の組合せは、学習用データ４１（図２）の欄１０２～欄１０４に一致する。 The event occurrence flags in the event occurrence column 111 are the same as the event occurrence flags in FIG.
The age in the age column 112 is the age of the citizen or other person covered by the administrative materials.
The gender in the gender column 113 is the gender of the citizen or other person covered by the administrative document.
The number of statistical information examples in the statistical information example number column 114 is the number of citizens or the like (population) corresponding to the combination of the event occurrence, age, and gender. The combination of the event occurrence, age, and gender matches columns 102 to 104 of the learning data 41 (FIG. 2).

以上より明らかなように、統計情報４３は、もとの行政資料の属性から学習用データに現れる属性以外のものを捨象したうえで集計したものである。 As is clear from the above, statistical information 43 is compiled after ignoring all attributes from the original administrative data other than those that appear in the learning data.

（同時分布情報）
図４は、同時分布情報３１の一例である。同時分布情報３１においては、イベント発生欄１２１に記憶されたイベント発生フラグに関連付けて、年齢欄１２２には年齢が、性別欄１２３には性別が、学習用データの例数欄１２４には学習用データの例数が記憶されている。 (Simultaneous distribution information)
4 is an example of the simultaneous distribution information 31. In the simultaneous distribution information 31, ages are stored in an age column 122, genders are stored in a gender column 123, and the number of examples of learning data is stored in a number-of-examples-of-learning-data column 124 in association with the event occurrence flags stored in an event occurrence column 121.

イベント発生欄１２１のイベント発生フラグは、図２のイベント発生フラグと同じである。
年齢欄１２２の年齢は、図２の年齢と同じである。
性別欄１２３の性別は、図２の性別と同じである。
学習用データの例数欄１２４の学習用データの例数は、そのイベント発生、年齢及び性別の組合せに該当する、学習用データ４１（図２）の対象者（標本）の人数である。 The event occurrence flags in the event occurrence column 121 are the same as the event occurrence flags in FIG.
The ages in the age column 122 are the same as those in FIG.
The gender in the gender column 123 is the same as that in FIG.
The number of training data examples in the training data example number column 124 is the number of subjects (samples) in the training data 41 (FIG. 2) that correspond to the combination of event occurrence, age, and sex.

同時分布情報３１の属性（同時に起きるイベント発生、年齢及び性別の組合せ）は、図３の統計情報４３の属性と完全に一致している。同時分布情報３１の各属性の値（“０”又は“１”の２値、年齢の範囲、“男”又は“女”の２値）もまた、統計情報４３の各属性の値と完全に一致している。 The attributes of the simultaneous distribution information 31 (combinations of simultaneous event occurrences, age, and gender) are completely consistent with the attributes of the statistical information 43 in FIG. 3. The values of each attribute of the simultaneous distribution information 31 (binary values of "0" or "1", age range, binary values of "male" or "female") also are completely consistent with the values of each attribute of the statistical information 43.

統計情報４３（図３）の統計情報の例数、及び、同時分布情報３１（図４）の学習用データの例数を比較すると、例えば以下のことがわかる。これらは、特定の医療施設から収集した学習用データ４１（標本）が有する国民等（母集団）に対する偏りである。 Comparing the number of examples in the statistical information 43 (Figure 3) and the number of examples in the training data in the joint distribution information 31 (Figure 4), for example, reveals the following. These are biases in the training data 41 (sample) collected from a specific medical facility with respect to the general public (population).

・学習用データ４１においては、男性の比率が高い。
・学習用データ４１においては、男女ともに、イベント発生フラグが“１”である比率が高い。
・学習用データ４１においては、女性のイベント発生フラグが“１”である比率が、男性のその比率よりも有意に大きい。母集団には、このような傾向は認められない。
・学習用データ４１においては、女性若年層のイベント発生フラグが“１”である比率が高い。
・学習用データ４１においては、男性高齢層のイベント発生フラグが“１”である比率が高い。 In the learning data 41, the ratio of males is high.
In the learning data 41, the proportion of event occurrence flags set to "1" is high for both males and females.
In the learning data 41, the proportion of females whose event occurrence flags are "1" is significantly greater than the proportion of males whose event occurrence flags are "1." No such tendency is observed in the population.
In the learning data 41, the proportion of event occurrence flags of "1" for young females is high.
In the learning data 41, the proportion of elderly males whose event occurrence flags are "1" is high.

（重み情報）
図５は、重み情報３２の一例である。重み情報３２においては、イベント発生欄１３１に記憶されたイベント発生フラグに関連付けて、年齢欄１３２には年齢が、性別欄１３３には性別が、重み欄１３４には重みが記憶されている。 (Weight information)
5 is an example of the weight information 32. In the weight information 32, ages are stored in an age column 132, genders are stored in a gender column 133, and weights are stored in a weight column 134 in association with the event occurrence flags stored in the event occurrence column 131.

イベント発生欄１２１のイベント発生フラグは、図２のイベント発生フラグと同じである。
年齢欄１３２の年齢は、図２の年齢と同じである。
性別欄１３３の性別は、図２の性別と同じである。
重み欄１３４の重みは、統計情報の例数（図３の欄１１４）を学習用データの例数（図４の欄１２４）で除算した結果である。換言すれば、重みは、同時分布情報の標本が母集団としての前記統計情報に比して偏っている程度である。 The event occurrence flags in the event occurrence column 121 are the same as the event occurrence flags in FIG.
The ages in the age column 132 are the same as those in FIG.
The gender in the gender column 133 is the same as that in FIG.
The weights in the weight column 134 are the result of dividing the number of examples in the statistical information (column 114 in FIG. 3) by the number of examples in the training data (column 124 in FIG. 4). In other words, the weights are the degree to which the samples of the joint distribution information are biased compared to the statistical information as a population.

（重みの意味）
図５の各レコード（行）の重みを上下方向に見たとき、各レコードの重みのすべてが同じ値であれば、学習用データ４１は、統計情報４３に対して偏りがなく、特定の医療施設は、国民等の母集団を理想的に代表しているといえる。しかしながら多くの場合、これらの値は、ばらつきを有する。このばらつきは、その学習用データ４１が有する偏りの特徴を示している。例えば、レコード１３５の重みは“５.０”であり、他のレコードに比して有意に大きい。レコード６の重みは“１.０”であり、他のレコードに比して有意に小さい。これらは以下のことを示している。 (Meaning of weight)
When the weights of each record (row) in FIG. 5 are viewed vertically, if all the weights of each record are the same value, the learning data 41 is not biased relative to the statistical information 43, and it can be said that a particular medical facility is an ideal representative of the population, such as the general public. However, in many cases, these values have variation. This variation indicates the bias characteristics of the learning data 41. For example, the weight of record 135 is "5.0", which is significantly larger than the other records. The weight of record 6 is "1.0", which is significantly smaller than the other records. These indicate the following:

・学習用データ４１が収集された特定の医療施設においては、年齢が“２０～２９”であり、かつ、性別が“女”である標本のうち、イベント発生フラグが“０”であるものが過小であり、“１”であるものが過大である。
・したがって、この医療施設から収集した学習用データ４１を使用して数理モデルを機械学習すると、機械学習済の数理モデルは、“年齢が２０～２９である女性は、特定の疾病に罹患しやすい”という誤った結論を出力してしまう。
・このような誤りを防ぐには、学習用データ４１のレコードのうち、レコード１３５の属性を有するものをコピーして標本数を増やす一方、レコード１３６の属性を有するものはコピーせずそのままとし、内容の異なる新たな学習用データを作成する。
・さらに、新たな学習用データを使用して数理モデルを再度機械学習する。 In a particular medical facility where the learning data 41 was collected, among the samples with an age of “20-29” and a gender of “female”, those with an event occurrence flag of “0” are under-represented and those with an event occurrence flag of “1” are over-represented.
Therefore, when a mathematical model is machine-learned using the learning data 41 collected from this medical facility, the machine-learned mathematical model outputs the erroneous conclusion that "women aged 20 to 29 are prone to a certain disease."
To prevent such an error, records in the training data 41 that have the attributes of record 135 are copied to increase the number of samples, while records that have the attributes of record 136 are left as they are without being copied, thereby creating new training data with different contents.
-Furthermore, the mathematical model is re-trained using new learning data.

以降で、本実施形態の処理手順を説明する。処理手順は、４つ存在し、それらは、同時分布作成処理手順、重み算出処理手順、予測モデル作成処理手順及び出力処理手順であり、この順に実行される。 The processing procedures of this embodiment are explained below. There are four processing procedures: a joint distribution creation processing procedure, a weight calculation processing procedure, a prediction model creation processing procedure, and an output processing procedure, which are executed in this order.

（同時分布作成処理手順）
図６は、同時分布作成処理手順のフローチャートである。
ステップＳ２０１において、情報処理装置１の同時分布作成部２１は、学習用データ４１及び統計情報４３を取得する。具体的には、同時分布作成部２１は、データベース２から、学習用データ４１、及び、前記した“行政資料”の状態にある統計情報４３を取得する。ここで取得される統計資料４３は、図３のように整った型式を有していない。 (Joint distribution creation process)
FIG. 6 is a flowchart of the joint distribution creation process.
In step S201, the simultaneous distribution creation unit 21 of the information processing device 1 acquires the learning data 41 and the statistical information 43. Specifically, the simultaneous distribution creation unit 21 acquires the learning data 41 and the statistical information 43 in the above-mentioned "administrative data" state from the database 2. The statistical data 43 acquired here does not have a regular format as shown in FIG.

ステップＳ２０２において、同時分布作成部２１は、変数を指定する。具体的には、第１に、同時分布作成部２１は、学習用データ４１の属性（図２の例では、イベント発生、年齢及び性別）を“変数”として認識する。
第２に、同時分布作成部２１は、各変数における層を認識する。例えば、統計情報４３（行政資料）の年齢が、“０～９”、“１０～１９”、“２０～２９”、・・・のように層分けされているのを認識する。
第３に、同時分布作成部２１は、ステップＳ２０２の“第１”において認識された変数のみを属性とし、ステップＳ２０２の“第２”において認識された各変数の層ごとのレコードを有する、図３の型式の統計情報４３を作成する。このとき、同時分布作成部２１は、行政資料を参照し、イベント発生、性別及び年齢層ごとに、統計情報の例数（図３の欄１１４）を算出する。 In step S202, the simultaneous distribution creating unit 21 specifies variables. Specifically, first, the simultaneous distribution creating unit 21 recognizes attributes of the learning data 41 (in the example of FIG. 2, event occurrence, age, and gender) as "variables."
Second, the joint distribution creation unit 21 recognizes strata for each variable. For example, it recognizes that ages in the statistical information 43 (administrative data) are stratified into "0-9", "10-19", "20-29", and so on.
Third, the simultaneous distribution creation unit 21 creates statistical information 43 of the type shown in Fig. 3, which has only the variables recognized in "first" of step S202 as attributes and has a record for each stratum of each variable recognized in "second" of step S202. At this time, the simultaneous distribution creation unit 21 refers to administrative documents and calculates the number of examples of statistical information (column 114 in Fig. 3) for each event occurrence, gender, and age group.

ステップＳ２０３において、同時分布作成部２１は、同時分布情報３１を作成する。具体的には、第１に、同時分布作成部２１は、ステップＳ２０２の“第３”において作成した統計情報４３（図３）のコピーを作成し、“統計情報の例数”（図３の欄１１４）を“学習用データの例数”に書き換えたうえで、各レコードの統計情報の例数の値を削除する（空白に戻す）。
第２に、同時分布作成部２１は、学習用データ４１を参照し、学習用データの例数（図４の欄１２４）を算出することによって、図４の同時分布情報３１を完成させる。ここで完成した同時分布情報３１は、統計情報４３（図３）と同じ変数を有している。その後、同時分布作成処理手順を終了する。 In step S203, the simultaneous distribution creating unit 21 creates simultaneous distribution information 31. Specifically, first, the simultaneous distribution creating unit 21 creates a copy of the statistical information 43 ( FIG. 3 ) created in “third” of step S202, rewrites the “number of examples in the statistical information” (column 114 in FIG. 3 ) to the “number of examples in the learning data”, and then deletes the value of the number of examples in the statistical information for each record (returning it to blank).
Secondly, the simultaneous distribution creation unit 21 refers to the training data 41 and calculates the number of examples of the training data (column 124 in FIG. 4 ), thereby completing the simultaneous distribution information 31 in FIG. 4 . The simultaneous distribution information 31 completed at this stage has the same variables as the statistical information 43 ( FIG. 3 ). Thereafter, the simultaneous distribution creation process procedure ends.

（重み算出処理手順）
図７は、重み算出処理手順のフローチャートである。
ステップＳ２１１において、情報処理装置１の重み算出部２２は、統計情報４３及び同時分布情報３１を取得する。具体的には、重み算出部２２は、ステップＳ２０２の“第３”で作成した統計情報４３、及び、ステップＳ２０３の“第２”において完成した同時分布情報３１を取得する。 (Weight calculation process)
FIG. 7 is a flowchart of the weight calculation process.
In step S211, the weight calculation unit 22 of the information processing device 1 acquires the statistical information 43 and the simultaneous distribution information 31. Specifically, the weight calculation unit 22 acquires the statistical information 43 created in the “third” of step S202, and the simultaneous distribution information 31 completed in the “second” of step S203.

ステップＳ２１２において、重み算出部２２は、重み情報３２を作成する。具体的には、第１に、重み算出部２２は、ステップＳ２１１において取得した統計情報４３（図３）のコピーを作成し、“統計情報の例数”（図３の欄１１４）を“重み”に書き換えたうえで、各レコードの統計情報の例数の値を削除する（空白に戻す）。
第２に、重み算出部２２は、ステップＳ２１１において取得した統計情報４３の統計情報の例数を、ステップＳ２１１において取得した同時分布情報３１の学習用データの例数で除算し、その結果を“重み”として記憶する。重み算出部２２は、イベント発生、年齢層及び性別の組合せごとに重みを算出することによって、図５の重み情報３２を完成させる。その後、重み算出処理手順を終了する。 In step S212, the weight calculation unit 22 creates weight information 32. Specifically, first, the weight calculation unit 22 creates a copy of the statistical information 43 ( FIG. 3 ) acquired in step S211, rewrites the “number of examples in statistical information” (column 114 in FIG. 3 ) to “weight”, and then deletes the value of the number of examples in the statistical information for each record (returning it to blank).
Secondly, the weight calculation unit 22 divides the number of examples of the statistical information in the statistical information 43 acquired in step S211 by the number of examples of the learning data in the joint distribution information 31 acquired in step S211, and stores the result as a "weight". The weight calculation unit 22 completes the weight information 32 in Fig. 5 by calculating a weight for each combination of the event occurrence, age group, and gender. Then, the weight calculation process procedure ends.

（予測モデル作成処理手順）
図８は、予測モデル処理手順のフローチャートである。
ステップＳ２２１において、情報処理装置１の予測モデル作成部２３は、学習用データ４１及び重み情報３２を取得する。具体的には、予測モデル作成部２３は、データベース２から学習用データ４１を取得するとともに、ステップＳ２１２の“第２に”において完成した重み情報３２を取得する。ここで取得される重み情報３１は、学習用データ４１の偏りを表現している。 (Prediction model creation process)
FIG. 8 is a flowchart of a prediction model processing procedure.
In step S221, the prediction model creation unit 23 of the information processing device 1 acquires the learning data 41 and the weight information 32. Specifically, the prediction model creation unit 23 acquires the learning data 41 from the database 2, and also acquires the weight information 32 completed in “secondly” of step S212. The weight information 31 acquired here represents the bias of the learning data 41.

前記したように例えば、図５のレコード１３５の重みは、“５.０”である。このことは、統計情報４３における“イベント発生”が“０”、であり“年齢”が“２０～２９”であり、かつ、“性別”が“女”である母集団の数が、学習用データ４１におけるそのような標本の数の５倍存在することを示す。すなわち、学習用データ４１が収集された医療施設においては、国等に比して、そのような標本の数が“５分の１”しかなく、換言すれば、そのような標本が“５倍”の希少性を有することを意味する。なお、“５倍”という数値に絶対的な意味があるわけではない。他のレコードにおける重みに比して“５倍”は相対的に突出しており、そのことが偏りを示している。 As mentioned above, for example, the weight of record 135 in FIG. 5 is "5.0". This indicates that the number of populations in which "event occurrence" is "0", "age" is "20-29", and "gender" is "female" in statistical information 43 is five times the number of such samples in training data 41. In other words, in the medical facility where training data 41 was collected, the number of such samples is only "one-fifth" compared to the country, etc., which means that such samples are "five times" rarer. Note that the number "five times" does not have an absolute meaning. "Five times" is relatively prominent compared to the weights in other records, which indicates a bias.

ステップＳ２２２において、予測モデル作成部２３は、第１の弱学習器３４及び第２の弱学習器３５を作成する。具体的には、第１に、予測モデル作成部２３は、以下のように第１の弱学習器３４を学習する。 In step S222, the prediction model creation unit 23 creates a first weak learner 34 and a second weak learner 35. Specifically, first, the prediction model creation unit 23 learns the first weak learner 34 as follows.

・予測モデル作成部２３は、学習用データ４１をそのまま使用して数理モデルを機械学習し、最適化された数理モデルを第１の弱学習器３４とする。
・ここでの数理モデルは、年齢及び性別を説明変数とし、イベント発生を目的変数とする数理モデルであり、各説明変数に乗算される係数を有するものであってもよい。この場合、各係数が最適化される。
・さらに、ここでの数理モデルは、入力層、複数の中間層及び出力層を有するニューラルネットワークであってもよい。この場合、あるノードの情報を次の層のどのノードにどれだけ伝搬するかを決める重みベクトル（図５の欄１３４の重みとは別の概念）が最適化される。 The prediction model creation unit 23 uses the learning data 41 as is to machine-learn a mathematical model, and sets the optimized mathematical model as the first weak learner 34.
The mathematical model here is a mathematical model in which age and sex are explanatory variables and the occurrence of an event is a response variable, and may have coefficients by which each explanatory variable is multiplied. In this case, each coefficient is optimized.
Furthermore, the mathematical model here may be a neural network having an input layer, multiple intermediate layers, and an output layer. In this case, a weight vector (a different concept from the weights in column 134 in FIG. 5) that determines how much information from a certain node is propagated to which node in the next layer is optimized.

第２に、予測モデル作成部２３は、以下のように第２の弱学習器３５を学習する。
・予測モデル作成部２３は、図５の重みに従って学習用データ４１のレコード（標本）をコピー（複写）する。例えば、予測モデル作成部２３は、図５のレコード１３５に対応して、学習用データ４１のレコードのうち、年齢が“２０～２９”であり、かつ、性別が“女”であるものを５倍に増やす。
・予測モデル作成部２３は、同様にして、図５の他の全てのレコードについて、そのレコードに対応する学習用データ４１のレコード（標本）を、重みが示す倍数に増やす。
・予測モデル作成部２３は、このようにしてレコードの数が修正された後の学習用データ４１を使用して前記の数理モデルを機械学習し、最適化された数理モデルを第２の弱学習器３５とする。
第１の弱学習器３４に比して、第２の弱学習器３５は、予測精度が高い。 Second, the prediction model creation unit 23 trains the second weak learner 35 as follows.
The prediction model creation unit 23 copies (duplicates) the records (samples) of the training data 41 according to the weights in Fig. 5. For example, the prediction model creation unit 23 increases the number of records in the training data 41 in which the age is "20 to 29" and the gender is "female" by five times, corresponding to record 135 in Fig. 5.
Similarly, for all other records in FIG. 5, the prediction model creation unit 23 increases the records (samples) of the learning data 41 corresponding to those records by a factor indicated by the weight.
- The prediction model creation unit 23 uses the learning data 41 after the number of records has been corrected in this manner to machine-learn the mathematical model, and sets the optimized mathematical model as the second weak learner 35.
Compared to the first weak learner 34, the second weak learner 35 has higher prediction accuracy.

修正後の学習用データ４１のレコード数が極端に増加すると、それを使用して数理モデルを機械学習する処理に時間を要する。そこで例えば図５のレコード１３５の重みが“５０.０”であり、レコード１３６の重みが“２０.０”であったとする。この場合、予測モデル作成部２３は、レコード１３５に対応する学習用データ４１のレコードを５０倍にコピーし、レコード１３６に対応する学習用データ４１のレコードを２０倍にコピーするには及ばない。予測モデル作成部２３は、例えば、レコード１３５に対応する学習用データ４１のレコードを５倍にコピーし、レコード１３６に対応する学習用データ４１のレコードを２倍にコピーすればよい（制限的コピー）。 If the number of records in the training data 41 after correction increases dramatically, it takes time to use it to machine-train a mathematical model. For example, assume that the weight of record 135 in FIG. 5 is "50.0" and the weight of record 136 is "20.0". In this case, the prediction model creation unit 23 does not need to copy the record of the training data 41 corresponding to record 135 50 times and copy the record of the training data 41 corresponding to record 136 20 times. For example, the prediction model creation unit 23 may copy the record of the training data 41 corresponding to record 135 five times and copy the record of the training data 41 corresponding to record 136 twice (limited copy).

ステップＳ２２３において、予測モデル作成部２３は、アンサンブルモデル３３を作成する。具体的には、予測モデル作成部２３は、第１の弱学習器３４及び第２の弱学習器３５を有するアンサンブルモデル３３を作成する。前記したように、アンサンブルモデルとは、それを構成する弱学習器の出力結果を“多数決的”に採用するモデルである。その後、予測モデル作成処理手順を終了する。 In step S223, the prediction model creation unit 23 creates an ensemble model 33. Specifically, the prediction model creation unit 23 creates an ensemble model 33 having a first weak learner 34 and a second weak learner 35. As described above, an ensemble model is a model that adopts the output results of the weak learners that compose it by "majority vote." Thereafter, the prediction model creation process procedure ends.

（出力処理手順）
図９は、出力処理手順のフローチャートである。
ステップＳ２３１において、情報処理装置１の出力処理部２４は、予測用データ４２を取得する。具体的には、出力処理部２４は、データベース２から予測用データ４２を取得する。予測用データ４２は、学習用データ４１（図２）と同じ構成を有する。しかしながら、予測用データ４２のイベント発生欄は、空白である。つまり、学習用データ４１が、イベント発生が既知である“教師付き学習データ”（過去の標本）であるのに対し、予測用データ４２は、イベント発生が未知である現在の標本である。多くの場合、予測用データ４２は、特定の医療施設以外の他の医療施設から収集される。 (Output Processing Procedure)
FIG. 9 is a flowchart of the output process procedure.
In step S231, the output processing unit 24 of the information processing device 1 acquires the prediction data 42. Specifically, the output processing unit 24 acquires the prediction data 42 from the database 2. The prediction data 42 has the same configuration as the learning data 41 (FIG. 2). However, the event occurrence column of the prediction data 42 is blank. In other words, the learning data 41 is "supervised learning data" (past samples) in which the event occurrence is known, whereas the prediction data 42 is a current sample in which the event occurrence is unknown. In many cases, the prediction data 42 is collected from medical facilities other than a specific medical facility.

ステップＳ２３２において、出力処理部２４は、イベント発生の予測結果を表示する。具体的には、第１に、出力処理部２４は、ユーザが指定する年齢及び性別の組合せを予測用データとして予測モデル３３に入力し、予測モデル３３の出力としてのイベント発生を取得する。ここで出力処理部２４は、ユーザが指定する年齢及び性別の組合せを第１の弱学習器３４のみに入力してもよいし、第２の弱学習器３５のみに入力してもよい。
第２に、出力処理部２４は、出力装置１３に表示画面５１（詳細後記）を表示する。その後、出力処理手順を終了する。 In step S232, the output processing unit 24 displays the prediction result of the event occurrence. Specifically, first, the output processing unit 24 inputs the combination of age and gender specified by the user as prediction data to the prediction model 33, and obtains the event occurrence as the output of the prediction model 33. Here, the output processing unit 24 may input the combination of age and gender specified by the user only to the first weak learner 34, or may input it only to the second weak learner 35.
Secondly, the output processing unit 24 displays a display screen 51 (described in detail later) on the output device 13. Thereafter, the output processing procedure ends.

（表示画面）
図１０は、表示画面５１の一例である。いま、ユーザは、以下を希望している。
・過去において、特定の疾病に罹患しなかった男性の人数を、年齢層ごとに視認したい。
・その人数を、特定の医療施設と、国民等の母集団とで比較したい。 (Display screen)
10 is an example of the display screen 51. Now, the user wishes to do the following.
- I want to see the number of men who have not suffered from a specific disease in the past, by age group.
-I would like to compare that number between a specific medical facility and the general population, such as the general public.

そこで、ユーザは、グラフ表示欄５２のうち表示情報欄５３の横軸欄５８に“年齢”を入力し、凡例欄５９に“イベント発生：０性別：男”を入力し、グラフ表示実行ボタン５５を押下する。すると出力処理部２４は、年齢層を横軸とし、標本数を縦軸とするグラフ５４上に、折れ線５６及び折れ線５７を比較可能に表示する。折れ線５６は、統計情報の例数を示し、折れ線５７は、学習用データの例数を示す。図１０では、単純化のために縦軸の目盛りを１つに統一しているが、出力処理部２４は、折れ線５７用の目盛り（桁が小さい）とは別に、折れ線５６用の目盛り（桁が大きい）を表示してもよい。 The user then enters "age" in the horizontal axis column 58 of the display information column 53 in the graph display column 52, enters "event occurrence: 0, gender: male" in the legend column 59, and presses the graph display execution button 55. The output processing unit 24 then displays lines 56 and 57 for comparison on graph 54, which has age groups on the horizontal axis and the number of samples on the vertical axis. Line 56 indicates the number of examples of statistical information, and line 57 indicates the number of examples of learning data. In FIG. 10, the scale of the vertical axis is unified to one for simplicity, but the output processing unit 24 may display a scale (larger digits) for line 56 in addition to a scale (smaller digits) for line 57.

さらにユーザは、以下を希望している。
・他の医療施設から収集した現在の予測用データのうち、年齢層が“３０～３９”である女性についてのイベント発生を予測したい。
・アンサンブルモデル３３を使用してイベント発生を予測したい。 In addition, users want:
- We want to predict the occurrence of events for women in the age group "30 to 39" from the current prediction data collected from other medical facilities.
- I want to predict the occurrence of an event using the ensemble model 33.

そこで、ユーザは、予測用データとして、分析欄６１のうち予測用データ欄６２の年齢欄６３に“３０～３９”を入力し、性別欄６４に“女”を入力し、予測モデル欄６５に“アンサンブル”を入力し、分析実行ボタン６６を押下する。すると、出力処理部２４は、イベント発生の予測結果欄６７に予測結果を表示し、予測結果の説明欄６８に予測結果の説明を表示する。 The user then enters "30-39" as prediction data in the age column 63 of the prediction data column 62 in the analysis column 61, enters "female" in the gender column 64, enters "ensemble" in the prediction model column 65, and presses the analysis execution button 66. The output processing unit 24 then displays the prediction result in the event occurrence prediction result column 67, and displays an explanation of the prediction result in the prediction result explanation column 68.

これらの表示例から、出力処理部２４は、以下の処理を行うことがわかる。
・出力処理部２４は、第１の弱学習器３４及び第２の弱学習器３５を有するアンサンブルモデル３３に対して、年齢“３０～３９”及び性別“女”を入力した。
・第１の弱学習器３４は、イベント発生“１”を出力した。
・第２の弱学習器３５も、イベント発生“１”を出力した。
・アンサンブルモデル３３は、これらの出力を多数決し、イベント発生“１”を出力した。イベント発生の予測結果欄６７の“Ｃ”は、多数決後の出力“１”に対応している。 From these display examples, it can be seen that the output processing unit 24 performs the following processes.
The output processing unit 24 inputs the age "30 to 39" and the gender "female" to the ensemble model 33 having the first weak learner 34 and the second weak learner 35.
The first weak learner 34 outputs an event occurrence value of "1".
The second weak learner 35 also outputs an event occurrence "1".
The ensemble model 33 took a majority vote of these outputs and output an event occurrence of "1.""C" in the event occurrence prediction result column 67 corresponds to the output "1" after the majority vote.

ちなみに、第１の弱学習器３４及び第２の弱学習器３５の両者がイベント発生“０”を出力した場合、アンサンブルモデル３３は、多数決の結果、イベント発生“０”を出力する。すると出力処理部２４は、イベント発生の予測結果欄６７に“Ａ”を表示する。第１の弱学習器３４及び第２の弱学習器３５のどちらか一方がイベント発生“０”を出力し、他方がイベント発生“１”を出力した場合、アンサンブルモデル３３は、多数決の結果、“同数競合”を出力する。すると出力処理部２４は、イベント発生の予測結果欄６７に“Ｂ”を表示する。 Incidentally, if both the first weak learner 34 and the second weak learner 35 output an event occurrence of "0", the ensemble model 33 outputs an event occurrence of "0" as a result of majority vote. The output processing unit 24 then displays "A" in the event occurrence prediction result column 67. If either the first weak learner 34 or the second weak learner 35 outputs an event occurrence of "0" and the other outputs an event occurrence of "1", the ensemble model 33 outputs "tie competition" as a result of majority vote. The output processing unit 24 then displays "B" in the event occurrence prediction result column 67.

図１０の例では、ユーザは、アンサンブルモデルを使用することを希望した。しかしながら、ユーザは、第１の弱学習器のみを使用することを希望できるし、第２の弱学習器のみを使用することも希望できる。結局、これらの希望に応じて、出力表示部２４は、ユーザの選択に基づいて、第１の弱学習器のみを使用して、第２の弱学習器のみを使用して、又は、前記アンサンブルモデルを使用して前記予測結果を出力することができる。 In the example of FIG. 10, the user desired to use the ensemble model. However, the user may wish to use only the first weak learner, or may wish to use only the second weak learner. Ultimately, depending on these desires, the output display unit 24 can output the prediction result using only the first weak learner, using only the second weak learner, or using the ensemble model, based on the user's selection.

（重みを使用して学習用データを修正する意義）
始めから統計情報４３（図３）を学習用データとして数理モデルを機械学習することも当然に可能である。しかしながら、特定の単一の医療施設から収集された学習用データに偏りがあることは、事後的にわかる場合が多い。また、特定の単一の医療施設から収集された学習用データを使用して機械学習した数理モデルが既に存在しており、専ら当該医療施設から収集された標本に対する予測のために、当該機械学習済の数理モデルを引き続き使用したい場合もある。さらに、情報量が膨大である統計情報４３を使用して機械学習を行う処理よりは、前記したように、統計情報４３を単純にソートし層ごとの例数を除算して重みを算出し、学習用データを制限的にコピーする処理の方が単純である。 (The significance of using weights to modify training data)
It is of course possible to machine-train a mathematical model using the statistical information 43 (FIG. 3) as training data from the beginning. However, it is often the case that the training data collected from a specific single medical facility is biased after the fact. In addition, there are cases where a mathematical model machine-trained using training data collected from a specific single medical facility already exists, and the user wishes to continue using the machine-trained mathematical model solely for prediction of samples collected from the medical facility. Furthermore, as described above, the process of simply sorting the statistical information 43, dividing the number of examples for each layer, calculating weights, and copying the training data in a limited manner is simpler than the process of performing machine learning using the statistical information 43, which has a huge amount of information.

（本実施形態の効果）
本実施形態の情報処理装置の効果は以下の通りである。
（１）情報処理装置は、学習用データの母集団に対する偏りに基づき、学習用データを修正することができる。
（２）情報処理装置は、医療施設における標本の偏りを修正することができる。
（３）情報処理装置は、コピーという簡便な方法で学習用データを修正することができる。
（４）情報処理装置は、予測精度が高いアンサンブルモデルを使用することができる。
（５）情報処理装置は、ユーザがアンサンブルモデルの使用を選択することを可能にする。
（６）情報処理装置は、学習用データの母集団に対する偏りを比較可能に表示することができる。 (Effects of this embodiment)
The information processing device of this embodiment has the following advantages.
(1) The information processing device can correct the learning data based on the bias of the learning data relative to the population.
(2) The information processing device can correct bias in samples at medical facilities.
(3) The information processing device can modify the learning data by a simple method of copying.
(4) The information processing device can use an ensemble model with high predictive accuracy.
(5) The information processor allows a user to select the use of an ensemble model.
(6) The information processing device can display bias in the learning data relative to a population in a comparative manner.

なお、本発明は前記した実施例に限定されるものではなく、様々な変形例が含まれる。例えば、前記した実施例は、本発明を分かり易く説明するために詳細に説明したものであり、必ずしも説明したすべての構成を備えるものに限定されるものではない。また、ある実施例の構成の一部を他の実施例の構成に置き換えることが可能であり、また、ある実施例の構成に他の実施例の構成を加えることも可能である。また、各実施例の構成の一部について、他の構成の追加・削除・置換をすることが可能である。 The present invention is not limited to the above-described embodiments, but includes various modified examples. For example, the above-described embodiments have been described in detail to make the present invention easier to understand, and are not necessarily limited to those having all of the configurations described. It is also possible to replace part of the configuration of one embodiment with the configuration of another embodiment, and it is also possible to add the configuration of another embodiment to the configuration of one embodiment. It is also possible to add, delete, or replace part of the configuration of each embodiment with other configurations.

また、前記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウエアで実現してもよい。また、前記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウエアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリや、ハードディスク、ＳＳＤ（Solid State Drive）等の記録装置、又は、ＩＣカード、ＳＤカード、ＤＶＤ等の記録媒体に置くことができる。
また、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしもすべての制御線や情報線を示しているとは限らない。実際には殆どすべての構成が相互に接続されていると考えてもよい。 In addition, the above-mentioned configurations, functions, processing units, processing means, etc. may be realized in part or in whole by hardware, for example, by designing them as integrated circuits. In addition, the above-mentioned configurations, functions, etc. may be realized in software by a processor interpreting and executing a program that realizes each function. Information such as a program, table, file, etc. that realizes each function can be stored in a memory, a recording device such as a hard disk or SSD (Solid State Drive), or a recording medium such as an IC card, SD card, or DVD.
In addition, the control lines and information lines shown are those that are considered necessary for the explanation, and not all control lines and information lines in the product are necessarily shown. In reality, it can be considered that almost all components are connected to each other.

１情報処理装置
２データベース
３ネットワーク
１１中央制御装置
１２入力装置
１３出力装置
１４主記憶装置
１５補助記憶装置
１６通信装置
２１同時分布作成部
２２重み算出部
２３予測モデル作成部
２４出力処理部
３１同時分布情報
３２重み情報
３３予測モデル（アンサンブルモデル）
３４第１の弱学習器
３５第２の弱学習器
４１学習用データ
４２予測用データ
４３統計情報
５１表示画面 REFERENCE SIGNS LIST 1 Information processing device 2 Database 3 Network 11 Central control device 12 Input device 13 Output device 14 Main memory device 15 Auxiliary memory device 16 Communication device 21 Joint distribution creation unit 22 Weight calculation unit 23 Prediction model creation unit 24 Output processing unit 31 Joint distribution information 32 Weight information 33 Prediction model (ensemble model)
34 First weak learner 35 Second weak learner 41 Learning data 42 Prediction data 43 Statistical information 51 Display screen

Claims

a simultaneous distribution creating unit that creates simultaneous distribution information having the same variables as the statistical information from the learning data;
a weight calculation unit that calculates a degree of bias of the created sample of the joint distribution information compared to the statistical information as a population as a weight for each combination of the multiple variables;
modifying the learning data based on the calculated weights;
a prediction model creation unit that learns a prediction model using the corrected learning data;
an output processing unit that outputs a prediction result for prediction data using the prediction model;
Equipped with
The statistical information is
The learning data sample is obtained from a population including a facility where the learning data sample was collected,
The prediction model creation unit,
increasing the number of samples of the training data by copying samples of the training data based on the weights;
The prediction model creation unit,
training a first weak learner using the training data before the correction;
training a second weak learner using the modified training data;
creating an ensemble model comprising the first weak learner and the second weak learner;
An information processing device comprising:

The output processing unit includes:
outputting the prediction result using only the first weak learner, using only the second weak learner, or using the ensemble model based on a user selection;
2. The information processing apparatus according to claim 1 ,

The output processing unit includes:
Illustrating the number of samples of the statistical information and the number of samples of the learning data in a comparative manner;
3. The information processing apparatus according to claim 2 ,

The simultaneous distribution creation unit of the information processing device
Creating joint distribution information having the same variables as the statistical information from the training data;
The weight calculation unit of the information processing device
Calculating a degree of bias of the sample of the created joint distribution information compared to the statistical information as a population as a weight for each combination of the multiple variables;
The prediction model creation unit of the information processing device,
modifying the learning data based on the calculated weights;
Training a predictive model using the corrected training data;
The output processing unit of the information processing device
Outputting a prediction result for prediction data using the prediction model;
The statistical information is
The learning data sample is obtained from a population including a facility where the learning data sample was collected,
The prediction model creation unit,
increasing the number of samples of the training data by copying samples of the training data based on the weights;
The prediction model creation unit,
training a first weak learner using the training data before the correction;
training a second weak learner using the modified training data;
creating an ensemble model comprising the first weak learner and the second weak learner;
An information processing method for an information processing device comprising the steps of:

Computer,
a simultaneous distribution creating unit that creates simultaneous distribution information having the same variables as the statistical information from the learning data;
a weight calculation unit that calculates a degree of bias of the created sample of the joint distribution information compared to the statistical information as a population as a weight for each combination of the multiple variables;
modifying the learning data based on the calculated weights;
a prediction model creation unit that learns a prediction model using the corrected learning data;
an output processing unit that outputs a prediction result for prediction data using the prediction model;
An information processing program for causing the device to function as described above,
The statistical information is
The learning data sample is obtained from a population including a facility where the learning data sample was collected,
The prediction model creation unit:
executing a process of increasing the number of samples of the learning data by copying samples of the learning data based on the weights;
The prediction model creation unit:
training a first weak learner using the training data before the correction;
training a second weak learner using the modified training data;
executing a process of creating an ensemble model having the first weak learner and the second weak learner;
An information processing program characterized by: