JP7645461B2

JP7645461B2 - Model generation device, model generation method and program

Info

Publication number: JP7645461B2
Application number: JP2021091906A
Authority: JP
Inventors: 雄一郎定永; 伸夫原
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2025-03-14
Anticipated expiration: 2041-05-31
Also published as: JP2022184197A

Description

本開示は、データ間の関係を示すモデルを生成するモデル生成装置などに関する。 The present disclosure relates to a model generation device that generates a model that shows relationships between data.

データを推定するためのモデルを生成するモデル生成装置が提案されている。例えば、モデル生成装置は、複数の変数のそれぞれのデータを含むデータセットから、目的変数と説明変数とを選択し、それらの変数の間の相関係数、またはそれらの変数を用いた回帰モデルを導出することによって、モデルを生成する。データセットは、例えば製造に関する複数の製造データを含む。目的変数は、例えば、製造される製品の品質特性をデータとして示し、説明変数は、製造プロセスに用いられるパラメータのデータを示す。したがって、生成されるモデルを用いれば、製造プロセスから製品の品質特性を推定することができる。 A model generation device has been proposed that generates a model for estimating data. For example, the model generation device generates a model by selecting a response variable and an explanatory variable from a data set including data for each of a plurality of variables, and deriving a correlation coefficient between those variables or a regression model using those variables. The data set includes, for example, a plurality of manufacturing data related to manufacturing. The response variable indicates, for example, the quality characteristics of a manufactured product as data, and the explanatory variables indicate data on parameters used in the manufacturing process. Therefore, by using the generated model, it is possible to estimate the quality characteristics of a product from the manufacturing process.

また、局所品質モデルを作成する関連解析装置が提案されている（例えば、特許文献１参照）。この関連解析装置は、局所品質モデルを上述のモデルとして作成するためモデル生成装置と言える。また、この関連解析装置は、説明変数に相当する操業因子の空間を複数の局所領域に分割し、その各局所領域に対して局所品質モデルを作成する。 Also, a related analysis device that creates a local quality model has been proposed (see, for example, Patent Document 1). This related analysis device can be said to be a model generation device because it creates a local quality model as the above-mentioned model. In addition, this related analysis device divides the space of operational factors corresponding to explanatory variables into multiple local regions, and creates a local quality model for each of the local regions.

特許第４６５３５４７号公報Patent No. 4653547

しかしながら、上記特許文献１の関連解析装置であるモデル生成装置では、モデルの精度向上を図ることが難しいという課題がある。 However, the model generation device, which is the related analysis device in Patent Document 1, has the problem that it is difficult to improve the accuracy of the model.

そこで、本開示は、モデルの精度向上を容易に図ることができるモデル生成装置などを提供する。 Therefore, the present disclosure provides a model generation device that can easily improve the accuracy of a model.

本開示の一態様に係るモデル生成装置は、１または複数の目的変数と１または複数の説明変数との関係を示すモデルを生成するモデル生成装置であって、３以上の変数を含むデータセットを受信する受信手段と、前記データセットから、１以上の目的変数と、１以上の説明変数とを特定する第１変数特定手段と、前記データセットに含まれる前記３以上の変数のうちの、特定された前記目的変数および前記説明変数以外の変数である１以上の層別変数候補のそれぞれについて、当該層別変数候補を用いることによって前記モデルの確からしさが増す度合いである改善度を特定する改善度特定手段と、前記１以上の層別変数候補から、前記１以上の層別変数候補のそれぞれの前記改善度に基づいて、１または複数の層別変数候補を層別変数として特定する第２変数特定手段と、前記層別変数と前記目的変数との関係の傾向に基づいて、データセットを複数の層に分類する層別手段と、前記複数の層毎に、前記モデルを生成する生成手段と、を備える。 A model generation device according to one aspect of the present disclosure is a model generation device that generates a model showing a relationship between one or more objective variables and one or more explanatory variables, and includes a receiving means for receiving a dataset including three or more variables, a first variable identification means for identifying one or more objective variables and one or more explanatory variables from the dataset, an improvement identification means for identifying an improvement level, which is the degree to which the certainty of the model is increased by using one or more stratification variable candidates that are variables other than the identified objective variables and the explanatory variables among the three or more variables included in the dataset, a second variable identification means for identifying one or more stratification variable candidates as stratification variables from the one or more stratification variable candidates based on the improvement level of each of the one or more stratification variable candidates, a stratification means for classifying the dataset into multiple strata based on the tendency of the relationship between the stratification variables and the objective variables, and a generation means for generating the model for each of the multiple strata.

なお、これらの包括的または具体的な態様は、システム、方法、集積回路、コンピュータプログラムまたはコンピュータ読み取り可能なＣＤ－ＲＯＭなどの記録媒体で実現されてもよく、システム、方法、集積回路、コンピュータプログラムおよび記録媒体の任意な組み合わせで実現されてもよい。また、記録媒体は、非一時的な記録媒体であってもよい。 These comprehensive or specific aspects may be realized by a system, a method, an integrated circuit, a computer program, or a recording medium such as a computer-readable CD-ROM, or may be realized by any combination of a system, a method, an integrated circuit, a computer program, and a recording medium. The recording medium may also be a non-transitory recording medium.

本開示のモデル生成装置は、モデルの精度向上を容易に図ることができる。 The model generation device disclosed herein can easily improve the accuracy of the model.

本開示の一態様における更なる利点および効果は、明細書および図面から明らかにされる。かかる利点および／または効果は、いくつかの実施の形態並びに明細書および図面に記載された特徴によってそれぞれ提供されるが、１つまたはそれ以上の同一の特徴を得るために必ずしも全てが提供される必要はない。 Further advantages and benefits of certain aspects of the present disclosure will become apparent from the specification and drawings. Such advantages and/or benefits may be provided by some of the embodiments and features described in the specification and drawings, respectively, but not necessarily all of them need be provided to obtain one or more identical features.

図１は、実施の形態におけるモデル生成システムの一例を示す図である。FIG. 1 is a diagram illustrating an example of a model generation system according to an embodiment. 図２は、実施の形態におけるモデル生成装置の構成を示す図である。FIG. 2 is a diagram showing a configuration of a model generating device according to an embodiment. 図３Ａは、実施の形態におけるデータセットの一例を示す図である。FIG. 3A is a diagram illustrating an example of a data set according to the embodiment. 図３Ｂは、図３Ａのデータセットの先頭の行と２番目の行とを示す図である。FIG. 3B shows the first and second rows of the data set of FIG. 3A. 図３Ｃは、実施の形態におけるデータセットから選択される目的変数および説明変数を示す図である。FIG. 3C is a diagram illustrating a response variable and an explanatory variable selected from a dataset in the embodiment. 図４は、実施の形態におけるデータセットの他の例を示す図である。FIG. 4 is a diagram showing another example of a data set according to the embodiment. 図５は、実施の形態におけるデータセットの変数名などを簡略化して示す図である。FIG. 5 is a diagram showing a simplified view of variable names of a data set according to the embodiment. 図６は、実施の形態におけるモデル生成装置の機能構成を示すブロック図である。FIG. 6 is a block diagram showing a functional configuration of the model generating device according to the embodiment. 図７は、実施の形態における層別データセットの一例を示す図である。FIG. 7 is a diagram illustrating an example of a stratified data set according to the embodiment. 図８は、実施の形態における層別データセットのそれぞれについて、その層別データセットに含まれる各レコードによって示される座標点の分布を示す図である。FIG. 8 is a diagram showing the distribution of coordinate points indicated by each record included in each stratified data set in the embodiment. 図９は、実施の形態におけるモデル生成装置の全体的な処理動作の一例を示すフローチャートである。FIG. 9 is a flowchart showing an example of the overall processing operation of the model generating device according to the embodiment. 図１０は、図９のステップＳ７における質的変数の候補抽出処理の具体的な一例を示すフローチャートである。FIG. 10 is a flow chart showing a specific example of the process of extracting candidates for qualitative variables in step S7 of FIG. 図１１は、図９のステップＳ８における量的変数の候補抽出処理の具体的な一例を示すフローチャートである。FIG. 11 is a flow chart showing a specific example of the process of extracting candidates for quantitative variables in step S8 of FIG. 図１２は、図９のステップＳ９における改善度算出処理の具体的な一例を示すフローチャートである。FIG. 12 is a flowchart showing a specific example of the improvement degree calculation process in step S9 of FIG.

（本開示の基礎となった知見）
本発明者は、「背景技術」の欄において記載した特許文献１のモデル生成装置に関し、以下の問題が生じることを見い出した。 (Findings that formed the basis of this disclosure)
The present inventors have found that the model generating device of Patent Document 1 described in the "Background Art" section has the following problems.

上記特許文献１では、説明変数に相当する操業因子の空間を複数の局所領域に分割し、その各局所領域に対して局所品質モデルを構築する。したがって、データセットから複数のモデルが生成される。しかし、それらのモデルの構築には、データセットに含まれる目的変数と説明変数のみが用いられ、目的変数および説明変数以外の変数が用いられていない。具体的には、データセットに含まれる説明変数が示すデータの分布のみに基づいてデータセットが複数の局所領域に分割され、それらの局所領域に対してモデルが生成される。つまり、上記特許文献１では、説明変数および目的変数以外の変数が、説明変数と目的変数との間の相関関係に与える影響が不明なため、その説明変数および目的変数以外の変数は、モデルの構築には用いられていない。したがって、その説明変数および目的変数以外の変数のデータが、説明変数と目的変数との間の相関関係に影響を与えるような場合には、高い精度のモデルを生成することが難しい。 In the above Patent Document 1, the space of the operation factors corresponding to the explanatory variables is divided into multiple local regions, and a local quality model is constructed for each local region. Therefore, multiple models are generated from a dataset. However, only the objective variables and explanatory variables included in the dataset are used to construct these models, and variables other than the objective variables and explanatory variables are not used. Specifically, the dataset is divided into multiple local regions based only on the distribution of data indicated by the explanatory variables included in the dataset, and models are generated for these local regions. In other words, in the above Patent Document 1, since the influence of the explanatory variables and variables other than the objective variables on the correlation between the explanatory variables and the objective variables is unknown, the explanatory variables and variables other than the objective variables are not used to construct the model. Therefore, when data of the explanatory variables and variables other than the objective variables affect the correlation between the explanatory variables and the objective variables, it is difficult to generate a model with high accuracy.

そこで、本開示の一態様に係るモデル生成装置は、１または複数の目的変数と１または複数の説明変数との関係を示すモデルを生成するモデル生成装置であって、３以上の変数を含むデータセットを受信する受信手段と、前記データセットから、１以上の目的変数と、１以上の説明変数とを特定する第１変数特定手段と、前記データセットに含まれる前記３以上の変数のうちの、特定された前記目的変数および前記説明変数以外の変数である１以上の層別変数候補のそれぞれについて、当該層別変数候補を用いることによって前記モデルの確からしさが増す度合いである改善度を特定する改善度特定手段と、前記１以上の層別変数候補から、前記１以上の層別変数候補のそれぞれの前記改善度に基づいて、１または複数の層別変数候補を層別変数として特定する第２変数特定手段と、前記層別変数と前記目的変数との関係の傾向に基づいて、データセットを複数の層に分類する層別手段と、前記複数の層毎に、前記モデルを生成する生成手段と、を備える。例えば、前記層別手段は、前記層別変数ごとに、当該層別変数のデータの同一性または類似性に基づいて、当該層別変数のデータを複数のグループに分類し、複数のグループの組み合わせ毎に、前記データセットを分類してもよい。 Therefore, a model generation device according to one embodiment of the present disclosure is a model generation device that generates a model showing a relationship between one or more objective variables and one or more explanatory variables, and includes a receiving means for receiving a dataset including three or more variables, a first variable identification means for identifying one or more objective variables and one or more explanatory variables from the dataset, an improvement identification means for identifying, for each of one or more stratification variable candidates that are variables other than the identified objective variable and the explanatory variables among the three or more variables included in the dataset, an improvement degree that is the degree to which the certainty of the model is increased by using the stratification variable candidate, a second variable identification means for identifying one or more stratification variable candidates as stratification variables from the one or more stratification variable candidates based on the improvement degree of each of the one or more stratification variable candidates, a stratification means for classifying a dataset into a plurality of strata based on a tendency of the relationship between the stratification variables and the objective variable, and a generation means for generating the model for each of the plurality of strata. For example, the stratification means may classify the data of the stratification variables into a plurality of groups based on the identity or similarity of the data of the stratification variables, and classify the data set for each combination of the plurality of groups.

これにより、目的変数および説明変数以外の変数である層別変数に応じた層別分類が行われ、複数の層のそれぞれに対してモデルが生成される。つまり、その層別変数の各データが、それらのデータ間の共通性または類似性に応じて、複数のグループに分類される。そして、データセットは、それらのグループに対応する層に層別分類される。この層別分類によって、各層には、同一のグループに属する層別変数のデータをそれぞれ有する１つ以上のレコードが含まれる。なお、グループは、共通性または類似性を有するデータの集合であって、このグループには、共通のデータの集合であるカテゴリと、類似する数値データの集合であるクラスタとがある。また、層別変数は、モデルに含まれる変数として採用されていないが、そのモデルの生成には用いられる非活用変数である。このように、本開示の一態様に係るモデル生成装置では、データセットに含まれる変数のうち、説明変数および目的変数以外の変数である非活用変数によって、データセットに対する層別分類が行われるため、その非活用変数が説明変数と目的変数との間の相関関係に影響を与えるような場合であっても、その非活用変数に基づいた高い精度のモデルを生成することができる。つまり、モデルの精度向上を容易に図ることができる。 As a result, stratification is performed according to the stratification variables, which are variables other than the objective variable and the explanatory variable, and a model is generated for each of the multiple strata. That is, each data of the stratification variables is classified into multiple groups according to the commonality or similarity between the data. Then, the dataset is stratified into strata corresponding to those groups. By this stratification, each strata includes one or more records each having data of the stratification variables belonging to the same group. Note that a group is a collection of data having commonality or similarity, and this group includes a category, which is a collection of common data, and a cluster, which is a collection of similar numerical data. In addition, the stratification variables are non-utilized variables that are not adopted as variables included in the model, but are used in generating the model. In this way, in the model generating device according to one aspect of the present disclosure, stratification is performed on the dataset by non-utilized variables, which are variables other than the explanatory variable and the objective variable, among the variables included in the dataset, so that even if the non-utilized variables affect the correlation between the explanatory variable and the objective variable, a highly accurate model based on the non-utilized variables can be generated. In other words, the accuracy of the model can be easily improved.

例えば、それぞれ非活用変数である第１層別変数および第２層別変数が特定される。そして、第１層別変数の２つ以上のデータが例えば第１グループおよび第２グループに分類され、第２層別変数の２つ以上のデータが例えば第３グループおよび第４グループに分類される。なお、これらのグループに含まれる全てのデータは、共通性または高い類似性を有する。この場合、複数の層として、例えば第１層、第２層、第３層および第４層が決定される。第１層は、第１層別変数の第１グループと、第２層別変数の第３グループとの組み合わせに対応する。第２層は、第１層別変数の第１グループと、第２層別変数の第４グループとの組み合わせに対応する。第３層は、第１層別変数の第２グループと、第２層別変数の第３グループとの組み合わせに対応する。第４層は、第１層別変数の第２グループと、第２層別変数の第４グループとの組み合わせに対応する。このように、Ｎ個の層別変数のそれぞれのグループの組み合わせに応じて複数の層が決定される。したがって、層別分類では、第１層別変数の第１グループに属するデータと、第２層別変数の第３グループに属するデータとを含むレコードは、第１層に分類される。第１層別変数の第１グループに属するデータと、第２層別変数の第４グループに属するデータとを含むレコードは、第２層に分類される。第１層別変数の第２グループに属するデータと、第２層別変数の第３グループに属するデータとを含むレコードは、第３層に分類される。第１層別変数の第２グループに属するデータと、第２層別変数の第４グループに属するデータとを含むレコードは、第４層に分類される。 For example, a first stratification variable and a second stratification variable, each of which is an unexploited variable, are identified. Then, two or more data of the first stratification variable are classified, for example, into a first group and a second group, and two or more data of the second stratification variable are classified, for example, into a third group and a fourth group. All data included in these groups have commonality or high similarity. In this case, for example, a first layer, a second layer, a third layer, and a fourth layer are determined as the multiple layers. The first layer corresponds to a combination of the first group of the first stratification variable and the third group of the second stratification variable. The second layer corresponds to a combination of the first group of the first stratification variable and the fourth group of the second stratification variable. The third layer corresponds to a combination of the second group of the first stratification variable and the third group of the second stratification variable. The fourth layer corresponds to a combination of the second group of the first stratification variable and the fourth group of the second stratification variable. In this way, multiple layers are determined according to the combination of the groups of the N stratification variables. Thus, in stratified classification, a record containing data belonging to the first group of the first stratification variable and data belonging to the third group of the second stratification variable is classified into the first stratum. A record containing data belonging to the first group of the first stratification variable and data belonging to the fourth group of the second stratification variable is classified into the second stratum. A record containing data belonging to the second group of the first stratification variable and data belonging to the third group of the second stratification variable is classified into the third stratum. A record containing data belonging to the second group of the first stratification variable and data belonging to the fourth group of the second stratification variable is classified into the fourth stratum.

このように、層別変数が２つ以上であっても、データセットに対して最適な層別分類を行うことができ、複数の層のそれぞれに対して、それらの層別変数、すなわちＮ個の非活用変数のそれぞれのデータに応じた高い精度のモデルを生成することができる。 In this way, even if there are two or more stratification variables, optimal stratification classification can be performed for the dataset, and a highly accurate model can be generated for each of the multiple strata according to the data for those stratification variables, i.e., the N non-utilized variables.

また、本開示の一態様に係るモデル生成装置では、１または複数の層別変数は、１以上の層別変数候補から、それらの層別変数候補の改善度に基づいて特定される。例えば、大きい改善度を有する層別変数候補が層別変数として特定される。したがって、大きい改善度の層別変数を用いた層別分類が層別手段によって行われるため、より高い精度のモデルを生成することができる。 In addition, in a model generation device according to one aspect of the present disclosure, one or more stratification variables are identified from one or more stratification variable candidates based on the degree of improvement of the stratification variable candidates. For example, a stratification variable candidate having a large degree of improvement is identified as a stratification variable. Therefore, since stratification classification using a stratification variable having a large degree of improvement is performed by the stratification means, a model with higher accuracy can be generated.

また、前記データセットは、文字を含むデータを示す質的変数と、数字からなるデータを示す量的変数とを含んでもよい。 The data set may also include qualitative variables that represent data that includes characters and quantitative variables that represent data that includes numbers.

これにより、質的変数および量的変数のうちの一方だけでなく両方を含む複数の層別変数を特定することができ、特定される層別変数の変数型の自由度を高めることができる。 This allows the identification of multiple stratification variables that include not just one of qualitative and quantitative variables, but both, and increases the degree of freedom in the variable types of the identified stratification variables.

また、前記データセットに含まれる前記３以上の変数のうち、特定された前記目的変数および前記説明変数以外の変数が前記質的変数である場合には、前記質的変数の前記目的変数に対する影響度に基づいて、前記データセットに含まれる前記３以上の変数から前記質的変数を前記層別変数候補として抽出する質的候補抽出手段をさらに備えてもよい。 In addition, when a variable other than the identified objective variable and explanatory variable among the three or more variables included in the data set is a qualitative variable, the data set may further include a qualitative candidate extraction means for extracting the qualitative variable as the stratification variable candidate from the three or more variables included in the data set based on the degree of influence of the qualitative variable on the objective variable.

これにより、データセットに含まれる全ての非活用変数のうちの全ての質的変数のそれぞれを層別変数候補として扱うことなく、例えば、目的変数に対する影響度が大きい質的変数のみを層別変数候補として扱うことができる。その結果、全ての非活用変数のうちの全ての質的変数のそれぞれの改善度を特定することなく、一部の質的変数、すなわち影響度が大きい質的変数のみに対して改善度を特定することができる。つまり、改善度の特定対象とされる質的変数の数を減らすことができる。さらに、影響度が大きい質的変数は、大きい改善度が見込まれる質的変数であるため、改善度の特定の処理負担を効果的に減らすことができる。 This makes it possible to treat, for example, only the qualitative variables that have a large influence on the objective variable as candidate stratification variables, without treating each and every qualitative variable among all the non-utilized variables included in the dataset as a candidate stratification variable. As a result, it is possible to identify the degree of improvement for only some of the qualitative variables, i.e., the qualitative variables that have a large influence, without identifying the degree of improvement for each of all the qualitative variables among all the non-utilized variables. In other words, it is possible to reduce the number of qualitative variables that are the targets for identifying the degree of improvement. Furthermore, because qualitative variables with a large influence are qualitative variables that are expected to have a large degree of improvement, the processing burden for identifying the degree of improvement can be effectively reduced.

また、前記データセットに含まれる前記３以上の変数のうち、特定された前記目的変数および前記説明変数以外の変数が前記量的変数である場合には、前記量的変数の機械学習によるクラスタリングによって得られるクラスタの状態に基づいて、前記データセットに含まれる前記３以上の変数から前記量的変数を前記層別変数候補として抽出する量的候補抽出手段をさらに備えてもよい。 In addition, when a variable other than the identified objective variable and explanatory variable among the three or more variables included in the dataset is the quantitative variable, the system may further include a quantitative candidate extraction means for extracting the quantitative variable as the stratification variable candidate from the three or more variables included in the dataset based on the state of the cluster obtained by the machine learning clustering of the quantitative variables.

これにより、データセットに含まれる全ての非活用変数のうちの全ての量的変数のそれぞれを層別変数候補として扱うことなく、例えば、信頼性の高い多くのクラスタを有する量的変数のみを層別変数候補として扱うことができる。その結果、全ての非活用変数のうちの全ての量的変数のそれぞれの改善度を特定することなく、一部の量的変数、すなわち信頼性の高い多くのクラスタを有する量的変数のみに対して改善度を特定することができる。つまり、改善度の特定対象とされる量的変数の数を減らすことができる。さらに、信頼性の高い多くのクラスタを有する量的変数は、大きい改善度が見込まれる量的変数であるため、改善度の特定の処理負担を効果的に減らすことができる。 This makes it possible to treat, for example, only quantitative variables with many highly reliable clusters as stratification variable candidates, without treating each and every quantitative variable among all non-utilized variables included in the dataset as a stratification variable candidate. As a result, it is possible to identify the degree of improvement for only some quantitative variables, i.e., quantitative variables with many highly reliable clusters, without identifying the degree of improvement for each of all non-utilized variables. In other words, the number of quantitative variables targeted for identifying the degree of improvement can be reduced. Furthermore, since quantitative variables with many highly reliable clusters are quantitative variables for which a large degree of improvement is expected, the processing burden for identifying the degree of improvement can be effectively reduced.

また、前記質的候補抽出手段は、ランダムフォレストまたは勾配ブースティング決定木を用いて、前記質的変数の前記影響度を算出してもよい。 The qualitative candidate extraction means may also calculate the degree of influence of the qualitative variables using a random forest or a gradient boosting decision tree.

これにより、例えばランダムフォレストのジニ係数に応じた値を影響度として算出することによって、質的変数の適切な影響度を算出することができる。その結果、大きい影響度を有する質的変数を、大きい改善度が見込まれる層別変数候補として適切に抽出することができる。 This makes it possible to calculate the appropriate influence of a qualitative variable, for example by calculating a value corresponding to the Gini coefficient of a random forest as the influence. As a result, qualitative variables with a large influence can be appropriately extracted as candidates for stratification variables that are expected to produce a large degree of improvement.

また、前記量的候補抽出手段は、混合ガウスモデルまたはｋ－ｍｅａｎｓ法を用いて、前記データセットに含まれる前記量的変数の２つ以上のデータに対するクラスタリングを行ってもよい。 The quantitative candidate extraction means may also perform clustering on two or more pieces of data for the quantitative variables contained in the dataset using a Gaussian mixture model or a k-means method.

これにより、１つ以上の量的変数のそれぞれについて、クラスタの状態を適切に得ることができる。その結果、信頼性の高い多くのクラスタを有する量的変数を、大きい改善度が見込まれる層別変数候補として適切に抽出することができる。 This allows the cluster state to be obtained appropriately for each of one or more quantitative variables. As a result, quantitative variables with many highly reliable clusters can be appropriately extracted as candidates for stratification variables that are expected to produce a large degree of improvement.

また、前記改善度特定手段は、前記層別変数候補を用いて分類された前記データセットから得られるモデルの確からしさを示す指標と、分類されていない前記データセットから得られるモデルの確からしさを示す指標との差分を算出することによって、前記改善度を特定してもよい。 The improvement level determination means may determine the improvement level by calculating the difference between an index indicating the likelihood of a model obtained from the data set classified using the stratification variable candidates and an index indicating the likelihood of a model obtained from the unclassified data set.

これにより、層別変数候補の改善度を適切に特定することができる。 This allows the degree of improvement of candidate stratification variables to be appropriately identified.

また、前記生成手段は、さらに、生成された複数の前記モデルのそれぞれについて、当該モデルの確からしさを示す指数を算出し、前記モデル生成装置は、さらに、複数の前記モデルのそれぞれに対して算出された前記指数を出力する結果出力手段を備えてもよい。 The generating means may further calculate an index indicating the likelihood of each of the multiple models generated, and the model generating device may further include a result output means for outputting the calculated index for each of the multiple models.

これにより、例えば、複数のモデルのそれぞれの自由度調整済み決定係数が、そのモデルの確からしさを示す指数として算出されて出力される。したがって、ユーザは、生成されたモデルを使用するか否かを、その指数にしたがって容易に判断することができる。 As a result, for example, the coefficient of determination adjusted for the degrees of freedom for each of multiple models is calculated and output as an index indicating the reliability of the model. Therefore, the user can easily decide whether or not to use the generated model based on the index.

また、前記生成手段は、２つ以上の前記説明変数のそれぞれのデータと前記目的変数のデータとの関係を示す前記モデルを、重回帰式として生成してもよい。 The generating means may generate the model showing the relationship between the data of each of the two or more explanatory variables and the data of the objective variable as a multiple regression equation.

これにより、説明変数の数に関わらず適切なモデルを生成することができる。 This allows us to generate an appropriate model regardless of the number of explanatory variables.

以下、実施の形態について、図面を参照しながら具体的に説明する。 The following describes the embodiment in detail with reference to the drawings.

なお、以下で説明する実施の形態は、いずれも包括的または具体的な例を示すものである。以下の実施の形態で示される数値、形状、材料、構成要素、構成要素の配置位置及び接続形態、ステップ、ステップの順序などは、一例であり、本開示を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、最上位概念を示す独立請求項に記載されていない構成要素については、任意の構成要素として説明される。 The embodiments described below are all comprehensive or specific examples. The numerical values, shapes, materials, components, component placement and connection forms, steps, and order of steps shown in the following embodiments are merely examples and are not intended to limit the present disclosure. Furthermore, among the components in the following embodiments, components that are not described in an independent claim that indicates a superordinate concept are described as optional components.

また、各図は、模式図であり、必ずしも厳密に図示されたものではない。また、各図において、同じ構成部材については同じ符号を付している。 The figures are schematic diagrams and are not necessarily precise illustrations. In each figure, the same components are given the same reference numerals.

（実施の形態１）
［ハードウェア構成］
図１は、本実施の形態におけるモデル生成システムの一例を示す図である。 (Embodiment 1)
[Hardware configuration]
FIG. 1 is a diagram showing an example of a model generation system according to the present embodiment.

本実施の形態におけるモデル生成システム１は、モデル生成装置１００と、製造管理装置５００とを含む。 The model generation system 1 in this embodiment includes a model generation device 100 and a manufacturing management device 500.

製造管理装置５００は、例えば製造工場に設置され、製品を製造する製造システムを管理する装置である。この製造管理装置５００は、その製造システムで得られるデータセットＤｓを、例えばインターネットなどのネットワークを介してモデル生成装置１００に送信する。なお、データセットＤｓの詳細については、図３Ａ～図５を用いて後述する。 The manufacturing management device 500 is installed, for example, in a manufacturing factory, and is a device that manages a manufacturing system that manufactures products. This manufacturing management device 500 transmits a data set Ds obtained by the manufacturing system to the model generating device 100 via a network such as the Internet. Details of the data set Ds will be described later using Figures 3A to 5.

モデル生成装置１００は、パーソナルコンピュータなどから構成され、上述の製造管理装置５００からデータセットＤｓを受信する。そして、本実施の形態におけるモデル生成装置１００は、そのデータセットＤｓに基づいて、説明変数のデータと目的変数のデータとの関係を示す複数のモデルを生成する。 The model generating device 100 is configured with a personal computer or the like, and receives the data set Ds from the above-mentioned manufacturing management device 500. Then, the model generating device 100 in this embodiment generates multiple models showing the relationship between the data of the explanatory variables and the data of the objective variables based on the data set Ds.

図２は、本実施の形態におけるモデル生成装置１００の構成を示す図である。 Figure 2 shows the configuration of the model generation device 100 in this embodiment.

モデル生成装置１００は、入力部１０１、演算回路１０２、メモリ１０３、出力部１０４、記憶部１０５、データベース１０６、および通信部１０７を備える。 The model generating device 100 includes an input unit 101, an arithmetic circuit 102, a memory 103, an output unit 104, a storage unit 105, a database 106, and a communication unit 107.

通信部１０７は、モデル生成装置１００の外部にある機器と通信する。その通信は、無線通信であっても、有線通信であってもよい。無線通信の方式は、Ｗｉ－Ｆｉ（登録商標）、Ｂｌｕｅｔｏｏｔｈ（登録商標）、またはＺｉｇＢｅｅ（登録商標）であってもよく、その他の方式であってもよい。例えば、通信部１０７は、製造管理装置５００と通信し、その製造管理装置５００からデータセットＤｓを受信する。 The communication unit 107 communicates with devices external to the model generating device 100. The communication may be wireless or wired. The wireless communication method may be Wi-Fi (registered trademark), Bluetooth (registered trademark), or ZigBee (registered trademark), or may be another method. For example, the communication unit 107 communicates with the manufacturing management device 500 and receives a data set Ds from the manufacturing management device 500.

入力部１０１は、ユーザによる入力操作を受け付けるＨＭＩ（Human Machine Interface）としての機能を有し、例えばキーボード、マウス、タッチセンサ、タッチパッドなどを備える。 The input unit 101 functions as an HMI (Human Machine Interface) that accepts input operations by a user, and includes, for example, a keyboard, a mouse, a touch sensor, a touch pad, etc.

出力部１０４は、画像または文字などを表示するディスプレイを有し、そのディスプレイは、例えば液晶ディスプレイ、プラズマディスプレイ、有機ＥＬ（Electro-Luminescence）ディスプレイなどである。なお、出力部１０４は、画像または文字などを印刷するプリンタを有していてもよく、演算回路１０２から出力されるデータをファイル形式で記憶部１０５に格納する機能を有していてもよい。 The output unit 104 has a display for displaying images or characters, and the display is, for example, a liquid crystal display, a plasma display, or an organic EL (Electro-Luminescence) display. The output unit 104 may have a printer for printing images or characters, and may have a function for storing data output from the arithmetic circuit 102 in a file format in the memory unit 105.

記憶部１０５は、演算回路１０２への各命令が記述されたプログラム（すなわちコンピュータプログラム）１０５ａを格納している。また、記憶部１０５には、その演算回路１０２の処理によって一時的に生成される各テンポラリーデータ１０５ｂが格納されてもよい。なお、このような記憶部１０５は、不揮発性の記録媒体であって、例えば、ハードディスクなどの磁気記憶装置、光ディスク、半導体メモリなどである。なお、プログラム１０５ａは、例えば、リムーバブルメディアまたはネットワークを介して、モデル生成装置１００に提供され、記憶部１０５に格納される。リムーバブルメディアは、例えばＣＤ－ＲＯＭ（Compact Disc Read Only Memory）、フラッシュメモリなどである。このため、通信部１０７は、リムーバブルメディアのプログラム１０５ａを読み込むインターフェースを備えていてもよい。 The storage unit 105 stores a program (i.e., a computer program) 105a in which each command to the arithmetic circuit 102 is written. The storage unit 105 may also store each temporary data 105b temporarily generated by the processing of the arithmetic circuit 102. The storage unit 105 is a non-volatile recording medium, for example, a magnetic storage device such as a hard disk, an optical disk, or a semiconductor memory. The program 105a is provided to the model generating device 100 via, for example, a removable medium or a network, and is stored in the storage unit 105. The removable medium is, for example, a CD-ROM (Compact Disc Read Only Memory), a flash memory, or the like. For this reason, the communication unit 107 may be provided with an interface for reading the program 105a from the removable medium.

メモリ１０３には、演算回路１０２によって読み出されて展開されたプログラム１０５ａが一時的に保存される。このようなメモリ１０３は、例えば揮発性のＲＡＭ（Random Access Memory）である。 The memory 103 temporarily stores the program 105a that is read and expanded by the arithmetic circuit 102. Such memory 103 is, for example, a volatile RAM (Random Access Memory).

演算回路１０２は、メモリ１０３に展開されたプログラム１０５ａを実行する回路であって、例えば、ＣＰＵ（Central Processing Unit）、ＧＰＵ（Graphics Processing Unit）などである。演算回路１０２は、プログラム１０５ａを実行するときには、記憶部１０５に格納されている各テンポラリーデータ１０５ｂを用いてもよい。 The arithmetic circuit 102 is a circuit that executes the program 105a expanded in the memory 103, and is, for example, a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit). When executing the program 105a, the arithmetic circuit 102 may use each temporary data 105b stored in the storage unit 105.

データベース１０６は、記憶部１０５と同様に、不揮発性の記録媒体であって、例えば、ハードディスクなどの磁気記憶装置、光ディスク、半導体メモリなどである。例えば、演算回路１０２は、製造管理装置５００からネットワークおよび通信部１０７を介してデータセットＤｓを取得して、そのデータセットＤｓをデータベース１０６に格納する。 Like the storage unit 105, the database 106 is a non-volatile recording medium, for example, a magnetic storage device such as a hard disk, an optical disk, or a semiconductor memory. For example, the arithmetic circuit 102 acquires a data set Ds from the manufacturing management device 500 via the network and the communication unit 107, and stores the data set Ds in the database 106.

なお、本実施の形態では、記憶部１０５とデータベース１０６とは互に異なる記録媒体であるが、記憶部１０５およびデータベース１０６は、それらを含む１つの記録媒体として構成されていてもよい。 In this embodiment, the storage unit 105 and the database 106 are different recording media, but the storage unit 105 and the database 106 may be configured as a single recording medium that includes them.

［データセット］
図３Ａは、本実施の形態におけるデータセットＤｓの一例を示す図である。また、図３Ｂは、そのデータセットＤｓの先頭の行と２番目の行とを示す図である。 [Dataset]
Fig. 3A is a diagram showing an example of a data set Ds in this embodiment, and Fig. 3B is a diagram showing the first row and the second row of the data set Ds.

データセットＤｓは、製造管理装置５００から送信される生のデータセットであって、例えば、上述の製造システムにおける製造プロセス、および、その製造プロセスによって製造された製品の品質を示す、複数の製造データからなる構造化されたデータセットである。このようなデータセットＤｓは、図３Ａに示すように、複数の変数のそれぞれの変数名と、それらの変数のデータとを示す。なお、データは、文字および数字のうちの少なくとも一方を示すものであれば、どのようなものであってもよい。データセットＤｓの先頭の行には、複数の変数のそれぞれの変数名が配置され、データセットＤｓの２行目以降の各行には、複数の変数のそれぞれのデータが配置されている。このような２行目以降の各行は、複数の変数のそれぞれのデータを含むレコードとして扱われる。また、データセットＤｓの左端の列は、紐付け情報列であって、それらのレコードを識別するための識別情報であるＩＤが示されている。ＩＤは、レコードに含まれる各変数のデータを紐付けている。 The data set Ds is a raw data set transmitted from the manufacturing management device 500, and is a structured data set consisting of multiple manufacturing data indicating, for example, the manufacturing process in the above-mentioned manufacturing system and the quality of the product manufactured by the manufacturing process. As shown in FIG. 3A, such a data set Ds indicates the variable names of multiple variables and the data of those variables. Note that the data may be any type as long as it indicates at least one of letters and numbers. The first row of the data set Ds contains the variable names of the multiple variables, and the second and subsequent rows of the data set Ds contain the data of the multiple variables. Each of these rows from the second row onwards is treated as a record containing the data of the multiple variables. The leftmost column of the data set Ds is a linking information column, and indicates an ID, which is identification information for identifying those records. The ID links the data of each variable contained in the record.

具体的には、図３Ｂに示すように、データセットＤｓの先頭の行には、それぞれの変数名である、「電圧」、「速度」、「抵抗値」、「作業者」、「設備号機」、「材料配合」、「材料温度差」、「補助電圧」、および「治具温度」が配置されている。そして、２行目のレコードには、それらの変数名によって識別される変数のデータｄ１～ｄ９が含まれている。データｄ１は、変数名「電圧」によって識別される変数のデータであって、例えば「５．４８８１３５」である。データｄ２は、変数名「速度」によって識別される変数のデータであって、例えば「７．１５１８９４」である。データｄ３は、変数名「抵抗値」によって識別される変数のデータであって、例えば「４４．６９８３１」である。データｄ４は、変数名「作業者」によって識別される変数のデータであって、例えば「スズキ」である。データｄ５は、変数名「設備号機」によって識別される変数のデータであって、例えば「Ｃ号機」である。データｄ６は、変数名「材料配合」によって識別される変数のデータであって、例えば「０」である。データｄ７は、変数名「材料温度差」によって識別される変数のデータであって、例えば「８．８１５６７３」である。データｄ８は、変数名「補助電圧」によって識別される変数のデータであって、例えば「３」である。データｄ９は、変数名「治具温度」によって識別される変数のデータであって、例えば「９．２９８４８１」である。これらの各変数のデータｄ１～ｄ９を含むレコードは、ＩＤ「ＩＤ２００９０１」によって識別される。つまり、ＩＤ「ＩＤ２００９０１」は、そのＩＤによって識別されるレコードに含まれるデータｄ１～ｄ９を紐付けている。 Specifically, as shown in FIG. 3B, the first row of the data set Ds contains the variable names "Voltage," "Speed," "Resistance," "Worker," "Facility No.," "Material Mix," "Material Temperature Difference," "Auxiliary Voltage," and "Jig Temperature." The second row of records contains data d1 to d9 of variables identified by those variable names. Data d1 is data of a variable identified by the variable name "Voltage," for example, "5.488135." Data d2 is data of a variable identified by the variable name "Speed," for example, "7.151894." Data d3 is data of a variable identified by the variable name "Resistance," for example, "44.69831." Data d4 is data of a variable identified by the variable name "Worker," for example, "Suzuki." Data d5 is data of a variable identified by the variable name "Facility No.," for example, "Unit C." Data d6 is data for a variable identified by the variable name "material composition", for example "0". Data d7 is data for a variable identified by the variable name "material temperature difference", for example "8.815673". Data d8 is data for a variable identified by the variable name "auxiliary voltage", for example "3". Data d9 is data for a variable identified by the variable name "jig temperature", for example "9.298481". The record containing data d1 to d9 for each of these variables is identified by the ID "ID200901". In other words, the ID "ID200901" links the data d1 to d9 contained in the record identified by that ID.

データセットＤｓは、図３Ａに示すように、このようなレコードを複数含む。例えば、データセットＤｓは、上述のＩＤ「ＩＤ２００９０１」によって識別されるレコードと、ＩＤ「ＩＤ２００９０２」によって識別されるレコードと、ＩＤ「ＩＤ２００９０３」によって識別されるレコードとを含む。このように、本実施の形態におけるデータセットＤｓは、複数の変数のそれぞれのデータを有するレコードを２つ以上含む。 As shown in FIG. 3A, the dataset Ds includes a plurality of such records. For example, the dataset Ds includes a record identified by the above-mentioned ID "ID200901", a record identified by the ID "ID200902", and a record identified by the ID "ID200903". In this way, the dataset Ds in this embodiment includes two or more records having data for each of a plurality of variables.

また、図３Ａに示すように、データセットＤｓの左から２番目の列は、変数名「電圧」によって識別される変数のレコードごとのデータを示す。例えば、それらのデータは、「５．４８８１３５」、「６．０２７６３４」、および「４．２３６５４８」などである。同様に、データセットＤｓの左から３番目の列は、変数名「速度」によって識別される変数のレコードごとのデータを示す。データセットＤｓの左から４番目の列は、変数名「抵抗値」によって識別される変数のレコードごとのデータを示す。データセットＤｓの左から５番目の列は、変数名「作業者」によって識別される変数のレコードごとのデータを示す。データセットＤｓの左から６番目の列は、変数名「設備号機」によって識別される変数のレコードごとのデータを示す。データセットＤｓの左から７番目の列は、変数名「材料配合」によって識別される変数のレコードごとのデータを示す。データセットＤｓの左から８番目の列は、変数名「材料温度差」によって識別される変数のレコードごとのデータを示す。データセットＤｓの左から９番目の列は、変数名「補助電圧」によって識別される変数のレコードごとのデータを示す。データセットＤｓの左から１０番目の列は、変数名「治具温度」によって識別される変数のレコードごとのデータを示す。 Also, as shown in FIG. 3A, the second column from the left of the data set Ds shows data for each record of the variable identified by the variable name "Voltage". For example, the data are "5.488135", "6.027634", and "4.236548". Similarly, the third column from the left of the data set Ds shows data for each record of the variable identified by the variable name "Speed". The fourth column from the left of the data set Ds shows data for each record of the variable identified by the variable name "Resistance". The fifth column from the left of the data set Ds shows data for each record of the variable identified by the variable name "Worker". The sixth column from the left of the data set Ds shows data for each record of the variable identified by the variable name "Equipment Number". The seventh column from the left of the data set Ds shows data for each record of the variable identified by the variable name "Material Mixture". The eighth column from the left of the data set Ds shows data for each record of the variable identified by the variable name "Material Temperature Difference". The ninth column from the left of the data set Ds shows data for each record of the variable identified by the variable name "auxiliary voltage". The tenth column from the left of the data set Ds shows data for each record of the variable identified by the variable name "jig temperature".

ここで、変数の型には、量的変数と質的変数とがある。量的変数の各データは、数字のみで表されるデータであって、質的変数の各データは、文字を含んで表されるデータである。図３Ａおよび図３Ｂの例では、変数名「作業者」および変数名「設備号機」のそれぞれによって識別される変数が、質的変数である。例えば、図３Ｂに示すように、変数名「作業者」によって識別される変数のデータｄ４は「スズキ」であって、文字を含んでいる。したがって、変数名「作業者」によって識別される変数は、質的変数である。また、図３Ａおよび図３Ｂの例では、変数名「材料配合」、変数名「材料温度差」、変数名「補助電圧」、および変数名「治具温度」のそれぞれによって識別される変数が、量的変数である。例えば、図３Ｂに示すように、変数名「材料配合」によって識別される変数のデータｄ６は「０」であって、数字のみで表されている。したがって、変数名「材料配合」によって識別される変数は、量的変数である。 Here, there are quantitative and qualitative variables as types of variables. Each data of a quantitative variable is data expressed only by numbers, and each data of a qualitative variable is data expressed including characters. In the example of FIG. 3A and FIG. 3B, the variables identified by the variable names "worker" and "equipment number" are qualitative variables. For example, as shown in FIG. 3B, data d4 of the variable identified by the variable name "worker" is "Suzuki", which includes characters. Therefore, the variable identified by the variable name "worker" is a qualitative variable. Also, in the example of FIG. 3A and FIG. 3B, the variables identified by the variable names "material composition", "material temperature difference", "auxiliary voltage", and "jig temperature" are quantitative variables. For example, as shown in FIG. 3B, data d6 of the variable identified by the variable name "material composition" is "0", which is expressed only by numbers. Therefore, the variable identified by the variable name "material composition" is a quantitative variable.

なお、変数は、図３Ａおよび図３Ｂに示す例に限定されるものではなく、どのような変数であってもよい。変数は、例えば、人に関わる変数、材料に関わる変数、設備に係る変数などである。人に関わる変数は、「作業者」または「作業班」などの変数であってもよい。材料に関わる変数は、「源泉材料Ｌｏｔ」または「途中工程材料Ｌｏｔ」などの変数であってもよい。設備に係る変数は、「生産設備種類、世代」、「生産設備号機」、「設備内レーン別、スピンドル別」、「金型」、「治具」、「金型温度」、「乾燥温度」、「設備メンテナンス前後」などの変数であってもよい。また、変数は、「雰囲気温度」、「雰囲気湿度」、「生産時期、時間」などの変数であってもよい。また、変数は、製品の「品種、品番」または「製品サイズ」などの変数であってもよい。 The variables are not limited to the examples shown in FIG. 3A and FIG. 3B, and may be any variables. The variables may be, for example, variables related to people, variables related to materials, variables related to equipment, etc. The variables related to people may be variables such as "worker" or "work crew". The variables related to materials may be variables such as "source material lot" or "intermediate process material lot". The variables related to equipment may be variables such as "production equipment type, generation", "production equipment number", "by lane in equipment, by spindle", "mold", "jig", "mold temperature", "drying temperature", "before and after equipment maintenance", etc. The variables may also be variables such as "ambient temperature", "ambient humidity", "production period, time", etc. The variables may also be variables such as "production type, part number" or "product size" of the product.

図３Ｃは、データセットＤｓから選択される目的変数および説明変数を示す図である。 Figure 3C shows the objective variables and explanatory variables selected from the dataset Ds.

データセットＤｓに示される各変数は、ユーザによる入力操作に応じて、活用変数と非活用変数に分類され、各活用変数は、説明変数と目的変数に分類される。活用変数は、モデルに採用される変数であって、非活用変数は、モデルに採用されない変数である。なお、非活用変数は、従来、データセットに含まれる目的変数以外の全ての変数を説明変数として採用せずに、目的変数に対する寄与度が大きな変数のみを説明変数として採用するために、モデルには採用されなかった変数である。さらに、活用変数には、説明変数と目的変数とがある。図３Ｃに示す例では、ユーザは、変数名「抵抗値」の変数を目的変数として選択し、変数名「電圧」の変数と、変数名「速度」の変数とをそれぞれ説明変数として選択する。これにより、変数名「作業者」、変数名「設備号機」、変数名「材料配合」、変数名「材料温度差」、変数名「補助電圧」、および変数名「治具温度」のそれぞれの変数が、非活用変数として決定される。また、これらの非活用変数には、上述の質的変数および量的変数が含まれる。したがって、本実施の形態では、データセットＤｓに示されるＭ個の非活用変数は、それぞれ文字を含むデータを示す１つ以上の質的変数と、それぞれ数字からなるデータを示す１つ以上の量的変数とを含む。なお、そのＭ個は、データセットＤｓに含まれる非活用変数の個数であって、上述の例では６個である。 Each variable shown in the data set Ds is classified into an utilized variable and a non-utilized variable according to the input operation by the user, and each utilized variable is classified into an explanatory variable and a target variable. The utilized variable is a variable adopted in the model, and the non-utilized variable is a variable not adopted in the model. In addition, the non-utilized variable is a variable that was not adopted in the model because, conventionally, all variables other than the target variable included in the data set are not adopted as explanatory variables, and only variables that contribute greatly to the target variable are adopted as explanatory variables. Furthermore, the utilized variables include explanatory variables and target variables. In the example shown in FIG. 3C, the user selects the variable with the variable name "resistance value" as the target variable, and selects the variable with the variable name "voltage" and the variable with the variable name "speed" as the explanatory variables. As a result, the variables with the variable names "operator", "equipment number", "material composition", "material temperature difference", "auxiliary voltage", and "jig temperature" are determined as non-utilized variables. These non-utilized variables include the above-mentioned qualitative and quantitative variables. Therefore, in this embodiment, the M non-utilized variables shown in the data set Ds include one or more qualitative variables each indicating data containing characters, and one or more quantitative variables each indicating data consisting of numbers. Note that M is the number of non-utilized variables included in the data set Ds, which is six in the above example.

図４は、本実施の形態におけるデータセットＤｓの他の例を示す図である。 Figure 4 shows another example of the data set Ds in this embodiment.

演算回路１０２は、データセットＤｓに含まれる質的変数をダミー変数に置き換える。つまり、演算回路１０２は、ＯｎｅＨｏｔＥｎｃｏｄｉｎｇを行うことによって、図３Ａ～図３Ｃに示す変数名「作業者」の質的変数のデータを、変数名「作業者スズキ」、変数名「作業者サトウ」、および変数名「作業者タカハシ」の３つの変数のデータからなるフラグ列に変換する。例えば、図３Ａに示す変数名「作業者」の変数のデータ「スズキ」は、変数名「作業者スズキ」の変数のデータ「１」と、変数名「作業者サトウ」の変数のデータ「０」と、変数名「作業者タカハシ」の変数のデータ「０」とからなるフラグ列に変換される。また、図３Ａに示す変数名「作業者」の変数のデータ「サトウ」は、変数名「作業者スズキ」の変数のデータ「０」と、変数名「作業者サトウ」の変数のデータ「１」と、変数名「作業者タカハシ」の変数のデータ「０」とからなるフラグ列に変換される。同様に、演算回路１０２は、図３Ａ～図３Ｃに示す変数名「設備号機」の質的変数のデータを、変数名「設備号機Ｃ」、変数名「設備号機Ｄ」、および変数名「設備号機Ｅ」の３つの変数のデータからなるフラグ列に変換する。例えば、図３Ａに示す変数名「設備号機」の変数のデータ「Ｃ号機」は、変数名「設備号機Ｃ」の変数のデータ「１」と、変数名「設備号機Ｄ」の変数のデータ「０」と、変数名「設備号機Ｅ」の変数のデータ「０」とからなるフラグ列に変換される。また、図３Ａに示す変数名「設備号機」の変数のデータ「Ｄ号機」は、変数名「設備号機Ｃ」の変数のデータ「０」と、変数名「設備号機Ｄ」の変数のデータ「１」と、変数名「設備号機Ｅ」の変数のデータ「０」とからなるフラグ列に変換される。演算回路１０２は、後述のランダムフォレストなどの機械学習において質的変数を扱う場合には、その質的変数をダミー変数に置き換える。 The arithmetic circuit 102 replaces the qualitative variables included in the data set Ds with dummy variables. In other words, the arithmetic circuit 102 performs One Hot Encoding to convert the data of the qualitative variable with the variable name "Worker" shown in Figures 3A to 3C into a flag string consisting of data of three variables, namely, the variable names "Worker Suzuki", "Worker Sato", and "Worker Takahashi". For example, the data "Suzuki" of the variable name "Worker" shown in Figure 3A is converted into a flag string consisting of the data "1" of the variable name "Worker Suzuki", the data "0" of the variable name "Worker Sato", and the data "0" of the variable name "Worker Takahashi". Also, the data of the variable "Sato" with the variable name "Worker" shown in FIG. 3A is converted into a flag string consisting of the data of the variable "Worker Suzuki" with the variable name "Worker Sato", the data of the variable "Worker Takahashi", and the data of the variable "Worker Takahashi". Similarly, the arithmetic circuit 102 converts the data of the qualitative variable with the variable name "Facility No. " shown in FIG. 3A to FIG. 3C into a flag string consisting of the data of three variables with the variable names "Facility No. C", "Facility No. D", and "Facility No. E". For example, the data of the variable "Facility No. C" with the variable name "Facility No. " shown in FIG. 3A is converted into a flag string consisting of the data of the variable name "Facility No. C", the data of the variable name "Facility No. D", and the data of the variable name "Facility No. E". Furthermore, the data of the variable "Facility Unit D" shown in FIG. 3A is converted into a flag string consisting of the data of the variable "Facility Unit C" (0), the data of the variable "Facility Unit D" (1), and the data of the variable "Facility Unit E" (0). When handling qualitative variables in machine learning such as random forest described below, the arithmetic circuit 102 replaces the qualitative variables with dummy variables.

図５は、本実施の形態におけるデータセットＤｓの変数名などを簡略化して示す図である。 Figure 5 shows a simplified view of the variable names of the data set Ds in this embodiment.

以下、説明を分かり易くするため、図３Ａ～図３Ｃに示すデータセットＤｓの変数名「電圧」、「速度」、「抵抗値」、「作業者」、「設備号機」、「材料配合」、「材料温度差」、「補助電圧」および「治具温度」を、図５に示すように、変数名「Ｘ０」、「Ｘ１」、「Ｙ」、「Ｚ０」「Ｚ１」、「Ｄ１」、「Ｄ２」、「Ｄ３」および「Ｄ４」に置き換える。また、変数名「作業者」によって識別される変数のデータ「スズキ」および「サトウ」を、「Ａ」および「Ｂ」に置き換え、変数名「設備号機」によって識別される変数のデータ「Ｃ号機」および「Ｄ号機」を、「Ｃ」および「Ｄ」に置き換える。 For ease of understanding, the variable names "Voltage", "Speed", "Resistance", "Worker", "Facility No.", "Material Mixture", "Material Temperature Difference", "Auxiliary Voltage", and "Jig Temperature" of the data set Ds shown in Figures 3A to 3C are replaced with the variable names "X0", "X1", "Y", "Z0", "Z1", "D1", "D2", "D3", and "D4" as shown in Figure 5. In addition, the variable data "Suzuki" and "Sato" identified by the variable name "Worker" are replaced with "A" and "B", and the variable data "Unit C" and "Unit D" identified by the variable name "Facility No." are replaced with "C" and "D".

なお、以下、各変数を識別する必要がある場合には、変数Ｘ１のように、変数の後に変数名を付けることによって、その変数を識別する。また、図５に示す例では、変数Ｘ０、変数Ｘ１、および変数Ｙはそれぞれ、量的変数である。また、変数Ｚ０および変数Ｚ１はそれぞれ、質的変数であり、変数Ｄ１、変数Ｄ２、変数Ｄ３、および変数Ｄ４はそれぞれ、量的変数である。 In the following, when it is necessary to identify each variable, the variable is identified by adding the variable name after the variable, such as variable X1. In the example shown in FIG. 5, variables X0, X1, and Y are quantitative variables. Variables Z0 and Z1 are qualitative variables, and variables D1, D2, D3, and D4 are quantitative variables.

［機能構成］
図６は、演算回路１０２の機能構成を示すブロック図である。 [Functional configuration]
FIG. 6 is a block diagram showing the functional configuration of the arithmetic circuit 102. As shown in FIG.

演算回路１０２は、プログラム１０５ａを実行することによって、モデルを生成するための複数の機能を実現する。具体的には、演算回路１０２は、受信部（受信手段）１３０、第１変数特定部（第１変数特定手段）１２１、層別条件設定部１２２、非活用変数抽出部１２３、変数型判定部１２４、候補抽出部１２５、改善度特定部（改善度特定手段）１４０、第２変数特定部（第２変数特定手段）１２６、層別部（層別手段）１２７、生成部（生成手段）１２８、および結果出力部（結果出力手段）１２９を備える。また、候補抽出部１２５は、質的候補抽出部（質的候補抽出手段）１２５ａおよび量的候補抽出部（量的候補抽出手段）１２５ｂを備える。これらの構成要素は、演算回路１０２によるプログラム１０５ａの実行によって実現される。 The arithmetic circuit 102 executes the program 105a to realize a plurality of functions for generating a model. Specifically, the arithmetic circuit 102 includes a receiving unit (receiving means) 130, a first variable identification unit (first variable identification means) 121, a stratification condition setting unit 122, a non-utilized variable extraction unit 123, a variable type determination unit 124, a candidate extraction unit 125, an improvement degree identification unit (improvement degree identification means) 140, a second variable identification unit (second variable identification means) 126, a stratification unit (stratification means) 127, a generation unit (generation means) 128, and a result output unit (result output means) 129. In addition, the candidate extraction unit 125 includes a qualitative candidate extraction unit (qualitative candidate extraction means) 125a and a quantitative candidate extraction unit (quantitative candidate extraction means) 125b. These components are realized by the execution of the program 105a by the arithmetic circuit 102.

受信部１３０は、３以上の変数を含むデータセットＤｓを受信する。例えば、受信部１３０は、データベース１０６からデータセットＤｓを読み出すことによって、そのデータセットＤｓを取得する。そして、ユーザは、入力部１０１に対して入力操作を行うことによって、図５に示すデータセットＤｓの複数の変数の中から説明変数と目的変数を選択する。第１変数特定部１２１は、入力部１０１によって受け付けられたユーザのその入力操作に応じて、図５に示すデータセットＤｓの複数の変数の中から、例えば変数Ｘ０および変数Ｘ１をそれぞれ説明変数として特定する。さらに、第１変数特定部１２１は、その複数の変数の中から、例えば変数Ｙを目的変数として特定する。これにより、２つの変数が説明変数として設定され、１つの変数が目的変数として設定される。 The receiving unit 130 receives a data set Ds including three or more variables. For example, the receiving unit 130 acquires the data set Ds by reading the data set Ds from the database 106. Then, the user performs an input operation on the input unit 101 to select an explanatory variable and a target variable from among the multiple variables of the data set Ds shown in FIG. 5. The first variable identification unit 121 identifies, for example, the variable X0 and the variable X1 as explanatory variables from among the multiple variables of the data set Ds shown in FIG. 5 in response to the user's input operation accepted by the input unit 101. Furthermore, the first variable identification unit 121 identifies, for example, the variable Y as a target variable from among the multiple variables. As a result, two variables are set as explanatory variables and one variable is set as a target variable.

このように、本実施の形態における第１変数特定部１２１は、データセットＤｓから、１以上の目的変数と、１以上の説明変数とを特定する。なお、本実施の形態では、２つの説明変数が設定され、１つの目的変数が設定されるが、その説明変数および目的変数のそれぞれの数は、これらの例に限らず、任意の数であってもよい。例えば、第１変数特定部１２１は、データセットＤｓの複数の変数のうちの１つの変数を説明変数に設定してもよく、３つ以上の変数のそれぞれを説明変数に設定してもよい。また、第１変数特定部１２１は、データセットＤｓの複数の変数のうちの２つ以上の変数のそれぞれを目的変数として特定してもよい。 In this manner, the first variable identification unit 121 in this embodiment identifies one or more objective variables and one or more explanatory variables from the dataset Ds. Note that, in this embodiment, two explanatory variables and one objective variable are set, but the respective numbers of the explanatory variables and objective variables are not limited to these examples and may be any number. For example, the first variable identification unit 121 may set one variable of the multiple variables of the dataset Ds as the explanatory variable, or may set each of three or more variables as the explanatory variable. In addition, the first variable identification unit 121 may identify each of two or more variables of the multiple variables of the dataset Ds as the objective variable.

また、第１変数特定部１２１は、説明変数として設定された変数が変数Ｘ０および変数Ｘ１であり、目的変数として設定された変数が変数Ｙであることを示す第１設定情報を、メモリ１０３または記憶部１０５に格納する。第１設定情報が記憶部１０５に格納される場合には、その第１設定情報は、テンポラリーデータ１０５ｂとして格納されてもよい。また、第１変数特定部１２１は、その第１設定情報を非活用変数抽出部１２３に出力してもよい。 The first variable identification unit 121 also stores in the memory 103 or the storage unit 105 first setting information indicating that the variables set as explanatory variables are the variable X0 and the variable X1, and that the variable set as the objective variable is the variable Y. When the first setting information is stored in the storage unit 105, the first setting information may be stored as temporary data 105b. The first variable identification unit 121 may also output the first setting information to the non-utilized variable extraction unit 123.

層別条件設定部１２２は、入力部１０１によって受け付けられたユーザの入力操作に応じて、データセットＤｓの層別に用いられる非活用変数の総数をＮ個（Ｎは１以上の整数）に設定する。本実施の形態では、層別条件設定部１２２は、ユーザの入力操作に応じて総数を設定するが、その入力操作を必要とせず、予め定められているＮ個を固定値として設定してもよい。なお、Ｎ個は、層別に用いられる非活用変数の総数であるが、本実施の形態では、その総数Ｎが１以上である例を挙げて説明する。また、本実施の形態では、具体的な例として、総数ＮはＮ＝２に設定される。なお、その層別に用いられる非活用変数は、以下、層別変数とも呼ばれる。つまり、本実施の形態における層別条件設定部１２２は、ユーザによる入力操作に応じて、層別変数の総数を設定する。なお、データセットＤｓに含まれる層別変数の各データは、上記層別のためには、それらのデータ間の共通性または類似性に応じて１つのグループだけではなく、複数のグループに分類される必要がある。 The stratification condition setting unit 122 sets the total number of non-utilized variables used for stratification of the data set Ds to N (N is an integer of 1 or more) in response to the user's input operation received by the input unit 101. In this embodiment, the stratification condition setting unit 122 sets the total number in response to the user's input operation, but the input operation is not required and a predetermined N may be set as a fixed value. Note that N is the total number of non-utilized variables used for stratification, but in this embodiment, an example in which the total number N is 1 or more is given. Also, in this embodiment, as a specific example, the total number N is set to N=2. Note that the non-utilized variables used for stratification are also referred to as stratification variables hereinafter. That is, the stratification condition setting unit 122 in this embodiment sets the total number of stratification variables in response to the user's input operation. Note that, for the above-mentioned stratification, each data of the stratification variables included in the data set Ds needs to be classified into not only one group but multiple groups according to the commonality or similarity between the data.

層別条件設定部１２２は、層別変数の総数であるＮ個を示す第２設定情報を、メモリ１０３または記憶部１０５に格納する。第２設定情報が記憶部１０５に格納される場合には、その第２設定情報は、テンポラリーデータ１０５ｂとして格納されてもよい。また、層別条件設定部１２２は、第２設定情報を第２変数特定部１２６に出力してもよい。 The stratification condition setting unit 122 stores second setting information indicating N, which is the total number of stratification variables, in the memory 103 or the storage unit 105. When the second setting information is stored in the storage unit 105, the second setting information may be stored as temporary data 105b. In addition, the stratification condition setting unit 122 may output the second setting information to the second variable identification unit 126.

非活用変数抽出部１２３は、第１変数特定部１２１によって読み出された図５に示すデータセットＤｓの複数の変数の中から、Ｍ個の非活用変数を抽出する。具体的には、非活用変数抽出部１２３は、第１変数特定部１２１、メモリ１０３または記憶部１０５から第１設定情報を取得する。そして、非活用変数抽出部１２３は、その複数の変数から、第１設定情報によって示される説明変数および目的変数以外の全ての変数を非活用変数として抽出する。例えば、非活用変数抽出部１２３は、データセットＤｓの複数の変数の中から、変数Ｚ０、変数Ｚ１、変数Ｄ１、変数Ｄ２、変数Ｄ３、および変数Ｄ４をそれぞれ非活用変数として抽出する。その結果、本実施の形態では、データセットＤｓによって示される複数の変数から、Ｍ個の非活用変数が抽出される。そして、非活用変数抽出部１２３は、抽出されたＭ個の非活用変数を示す抽出情報を、メモリ１０３または記憶部１０５に格納する。抽出情報が記憶部１０５に格納される場合には、その抽出情報は、テンポラリーデータ１０５ｂとして格納されてもよい。また、非活用変数抽出部１２３は、抽出情報を変数型判定部１２４に出力してもよい。 The non-utilized variable extraction unit 123 extracts M non-utilized variables from the multiple variables of the data set Ds shown in FIG. 5 read by the first variable identification unit 121. Specifically, the non-utilized variable extraction unit 123 acquires the first setting information from the first variable identification unit 121, the memory 103, or the storage unit 105. Then, the non-utilized variable extraction unit 123 extracts all variables other than the explanatory variables and the objective variable indicated by the first setting information as non-utilized variables from the multiple variables. For example, the non-utilized variable extraction unit 123 extracts the variables Z0, Z1, D1, D2, D3, and D4 as non-utilized variables from the multiple variables of the data set Ds. As a result, in this embodiment, M non-utilized variables are extracted from the multiple variables indicated by the data set Ds. Then, the non-utilized variable extraction unit 123 stores extraction information indicating the extracted M non-utilized variables in the memory 103 or the storage unit 105. When the extracted information is stored in the storage unit 105, the extracted information may be stored as temporary data 105b. In addition, the non-utilized variable extraction unit 123 may output the extracted information to the variable type determination unit 124.

変数型判定部１２４は、非活用変数抽出部１２３、メモリ１０３または記憶部１０５から抽出情報を取得し、その抽出情報によって示されるＭ個の非活用変数のそれぞれの変数型を順に判定する。変数型には、上述の質的変数の型と、量的変数の型とがある。つまり、変数型判定部１２４は、非活用変数のデータに基づいて、その非活用変数が質的変数であるか、量的変数であるかを判定する。具体的には、変数型判定部１２４は、非活用変数のデータに文字が含まれていれば、その非活用変数が質的変数であると判定する。一方、変数型判定部１２４は、非活用変数のデータに文字が含まれず数字のみが含まれていれば、その非活用変数が量的変数であると判定する。これにより、Ｍ個の非活用変数のそれぞれが、質的変数と量的変数とに分類される。例えば、本実施の形態では、変数型判定部１２４は、非活用変数Ｚ０および非活用変数Ｚ１のそれぞれが質的変数であると判定し、非活用変数Ｄ１、非活用変数Ｄ２、非活用変数Ｄ３、および非活用変数Ｄ４のそれぞれが量的変数であると判定する。そして、変数型判定部１２４は、Ｍ個の非活用変数のそれぞれについて、その非活用変数の変数型を示す変数型情報をメモリ１０３または記憶部１０５に格納する。変数型情報が記憶部１０５に格納される場合には、その変数型情報は、テンポラリーデータ１０５ｂとして格納されてもよい。また、変数型判定部１２４は、変数型情報を候補抽出部１２５に出力してもよい。 The variable type determination unit 124 acquires the extraction information from the non-utilized variable extraction unit 123, the memory 103, or the storage unit 105, and sequentially determines the variable type of each of the M non-utilized variables indicated by the extraction information. There are two types of variables: the qualitative variable type and the quantitative variable type. That is, the variable type determination unit 124 determines whether the non-utilized variable is a qualitative variable or a quantitative variable based on the data of the non-utilized variable. Specifically, if the data of the non-utilized variable contains characters, the variable type determination unit 124 determines that the non-utilized variable is a qualitative variable. On the other hand, if the data of the non-utilized variable does not contain characters but contains only numbers, the variable type determination unit 124 determines that the non-utilized variable is a quantitative variable. As a result, each of the M non-utilized variables is classified into a qualitative variable and a quantitative variable. For example, in this embodiment, the variable type determination unit 124 determines that the non-utilized variables Z0 and Z1 are each qualitative variables, and that the non-utilized variables D1, D2, D3, and D4 are each quantitative variables. Then, the variable type determination unit 124 stores variable type information indicating the variable type of each of the M non-utilized variables in the memory 103 or the storage unit 105. When the variable type information is stored in the storage unit 105, the variable type information may be stored as temporary data 105b. The variable type determination unit 124 may also output the variable type information to the candidate extraction unit 125.

候補抽出部１２５は、データセットＤｓに含まれる３以上の変数から、特定された目的変数および説明変数以外の変数を層別変数候補として抽出する。言い換えれば、候補抽出部１２５は、Ｍ個の非活用変数の中から層別変数候補を抽出する。層別変数候補は、データセットＤｓの層別分類に用いられる層別変数の候補である。 The candidate extraction unit 125 extracts variables other than the identified objective variable and explanatory variables as stratification variable candidates from the three or more variables included in the dataset Ds. In other words, the candidate extraction unit 125 extracts stratification variable candidates from the M non-utilized variables. The stratification variable candidates are candidates for stratification variables used for stratification classification of the dataset Ds.

具体的には、候補抽出部１２５は、上述のように、質的候補抽出部１２５ａと、量的候補抽出部１２５ｂとを備えている。質的候補抽出部１２５ａは、変数型判定部１２４、メモリ１０３または記憶部１０５から変数型情報を取得し、その変数型情報に示されている非活用変数の変数型を特定する。そして、質的候補抽出部１２５ａは、その非活用変数の変数型が質的変数であれば、その質的変数の影響度を算出し、その影響度に基づいて質的変数を層別変数候補として採用するか否かを判断する。なお、影響度は、質的変数である非活用変数のデータが目的変数のデータに与える影響の大きさを示す数値であって、その影響が大きいほど、大きい値を示す。質的候補抽出部１２５ａは、採用すると判断された質的変数である非活用変数を層別変数候補として抽出する。このような質的候補抽出部１２５ａによる処理は、図９のステップＳ７における質的変数の候補抽出処理によって行われる。その詳細については、図１０を用いて後述する。 Specifically, the candidate extraction unit 125 includes a qualitative candidate extraction unit 125a and a quantitative candidate extraction unit 125b, as described above. The qualitative candidate extraction unit 125a acquires variable type information from the variable type determination unit 124, the memory 103, or the storage unit 105, and identifies the variable type of the non-utilized variable indicated in the variable type information. If the variable type of the non-utilized variable is a qualitative variable, the qualitative candidate extraction unit 125a calculates the influence of the qualitative variable and determines whether or not to adopt the qualitative variable as a stratification variable candidate based on the influence. The influence is a numerical value indicating the magnitude of the influence that the data of the non-utilized variable, which is a qualitative variable, has on the data of the objective variable, and the greater the influence, the greater the value indicated. The qualitative candidate extraction unit 125a extracts the non-utilized variable, which is a qualitative variable determined to be adopted, as a stratification variable candidate. Such processing by the qualitative candidate extraction unit 125a is performed by the candidate extraction processing of the qualitative variable in step S7 of FIG. 9. The details will be described later with reference to FIG. 10.

一方、量的候補抽出部１２５ｂは、変数型判定部１２４、メモリ１０３または記憶部１０５から変数型情報を取得し、その変数型情報に示されている非活用変数の変数型を特定する。そして、量的候補抽出部１２５ｂは、その非活用変数の変数型が量的変数であれば、その量的変数のクラスタリングの状態を特定し、そのクラスタリングの状態に基づいて非活用変数を層別変数候補として採用するか否かを判断する。クラスタリングの詳細については、後述する。量的候補抽出部１２５ｂは、採用すると判断された量的変数である非活用変数を層別変数候補として抽出する。このような量的候補抽出部１２５ｂによる処理は、図９のステップＳ８における量的変数の候補抽出処理によって行われる。その詳細については、図１１を用いて後述する。 On the other hand, the quantitative candidate extraction unit 125b acquires variable type information from the variable type determination unit 124, the memory 103, or the storage unit 105, and identifies the variable type of the non-utilized variable indicated in the variable type information. If the variable type of the non-utilized variable is a quantitative variable, the quantitative candidate extraction unit 125b identifies the clustering state of the quantitative variable, and determines whether or not to adopt the non-utilized variable as a stratification variable candidate based on the clustering state. Details of clustering will be described later. The quantitative candidate extraction unit 125b extracts the non-utilized variable, which is the quantitative variable determined to be adopted, as a stratification variable candidate. Such processing by the quantitative candidate extraction unit 125b is performed by the quantitative variable candidate extraction process in step S8 of FIG. 9. Details will be described later with reference to FIG. 11.

次に、改善度特定部１４０は、データセットＤｓに含まれる３以上の変数のうちの、特定された目的変数および説明変数以外の変数である１以上の層別変数候補のそれぞれについて、その層別変数候補を用いることによってモデルの確からしさが増す度合いである改善度を特定する。このような改善度特定部１４０による処理は、図９のステップＳ９における改善度算出処理によって行われる。その詳細については、図１２を用いて後述する。 Next, the improvement degree identification unit 140 identifies an improvement degree, which is the degree to which the reliability of the model is increased by using one or more stratification variable candidates, which are variables other than the identified objective variable and explanatory variables, among the three or more variables included in the dataset Ds. Such processing by the improvement degree identification unit 140 is performed by the improvement degree calculation processing in step S9 of FIG. 9. The details will be described later with reference to FIG. 12.

そして、改善度特定部１４０は、１以上の層別変数候補のそれぞれについて、その層別変数候補の改善度を示す改善度情報を、メモリ１０３または記憶部１０５に格納する。改善度情報が記憶部１０５に格納される場合には、その改善度情報は、テンポラリーデータ１０５ｂとして格納されてもよい。また、改善度特定部１４０は、改善度情報を第２変数特定部１２６に出力してもよい。 Then, the improvement degree identification unit 140 stores improvement degree information indicating the degree of improvement of each of the one or more stratification variable candidates in the memory 103 or the storage unit 105. When the improvement degree information is stored in the storage unit 105, the improvement degree information may be stored as temporary data 105b. In addition, the improvement degree identification unit 140 may output the improvement degree information to the second variable identification unit 126.

第２変数特定部１２６は、候補抽出部１２５、メモリ１０３または記憶部１０５から、１以上の層別変数候補のそれぞれの改善度情報を取得する。さらに、第２変数特定部１２６は、層別条件設定部１２２、メモリ１０３または記憶部１０５から第２設定情報を取得する。そして、第２変数特定部１２６は、それらの改善度情報および第２設定情報を用いて、１以上の層別変数候補の中から、Ｎ個の層別変数候補をそれぞれ層別変数として特定する。層別変数は、データセットＤｓのレコードの層別に用いられる変数である。 The second variable identification unit 126 acquires improvement degree information for each of the one or more stratification variable candidates from the candidate extraction unit 125, the memory 103, or the storage unit 105. Furthermore, the second variable identification unit 126 acquires second setting information from the stratification condition setting unit 122, the memory 103, or the storage unit 105. Then, the second variable identification unit 126 uses the improvement degree information and the second setting information to identify N stratification variable candidates as stratification variables from among the one or more stratification variable candidates. The stratification variables are variables used to stratify the records of the dataset Ds.

このように、本実施の形態における第２変数特定部１２６は、１以上の層別変数候補から、その１以上の層別変数候補のそれぞれの改善度に基づいて、Ｎ個の層別変数候補を層別変数として特定する。つまり、第２変数特定部１２６は、層別条件設定部１２２によって設定された総数であるＮ個だけ層別変数を特定する。このとき、第２変数特定部１２６は、改善度に基づいて、層別変数を特定する。 In this manner, the second variable identification unit 126 in this embodiment identifies N stratification variable candidates as stratification variables from one or more stratification variable candidates based on the degree of improvement of each of the one or more stratification variable candidates. In other words, the second variable identification unit 126 identifies N stratification variables, which is the total number set by the stratification condition setting unit 122. At this time, the second variable identification unit 126 identifies the stratification variables based on the degree of improvement.

そして、第２変数特定部１２６は、その特定されたＮ個の層別変数を示す層別変数情報を、メモリ１０３または記憶部１０５に格納する。層別変数情報が記憶部１０５に格納される場合には、その層別変数情報は、テンポラリーデータ１０５ｂとして格納されてもよい。また、第２変数特定部１２６は、層別変数情報を層別部１２７に出力してもよい。 Then, the second variable identification unit 126 stores stratification variable information indicating the identified N stratification variables in the memory 103 or the storage unit 105. When the stratification variable information is stored in the storage unit 105, the stratification variable information may be stored as temporary data 105b. In addition, the second variable identification unit 126 may output the stratification variable information to the stratification unit 127.

層別部１２７は、データセットＤｓに含まれる層別変数の２つ以上のデータ間の共通性または類似性に基づいて層別分類を行う。この層別分類では、層別部１２７は、データセットＤｓに含まれる２つ以上のレコードを複数の層に分類することによって、複数の層のそれぞれに１つ以上のレコードを含める処理である層別分類を実行する。つまり、層別部１２７は、層別変数と目的変数との関係の傾向に基づいて、データセットＤｓを複数の層に分類する。具体的には、層別部１２７は、層別変数ごとに、その層別変数のデータの同一性または類似性に基づいて、その層別変数のデータを複数のグループに分類し、複数のグループの組み合わせ毎に、データセットＤｓを分類する。ここで、その層別変数は、上述の層別変数情報によって示されている。したがって、層別部１２７は、第２変数特定部１２６、メモリ１０３または記憶部１０５から、層別変数情報を取得する。そして、層別部１２７は、その層別変数情報に基づいて、データセットＤｓに対する層別分類を行う。 The stratification unit 127 performs stratification based on the commonality or similarity between two or more data of the stratification variables included in the data set Ds. In this stratification, the stratification unit 127 performs stratification, which is a process of including one or more records in each of the multiple layers by classifying two or more records included in the data set Ds into multiple layers. That is, the stratification unit 127 classifies the data set Ds into multiple layers based on the tendency of the relationship between the stratification variables and the objective variable. Specifically, the stratification unit 127 classifies the data of the stratification variables into multiple groups for each stratification variable based on the identity or similarity of the data of the stratification variable, and classifies the data set Ds for each combination of the multiple groups. Here, the stratification variables are indicated by the stratification variable information described above. Therefore, the stratification unit 127 acquires the stratification variable information from the second variable identification unit 126, the memory 103, or the storage unit 105. Then, the stratification unit 127 performs stratification for the data set Ds based on the stratification variable information.

具体的には、層別部１２７は、層別変数情報によって示されるＮ個の層別変数のそれぞれについて、その層別変数の２つ以上のデータ間の共通性または類似性に基づいて、データセットＤｓに含まれるその層別変数の２つ以上のデータを複数のグループに分類する。そして、層別部１２７は、そのＮ個の層別変数のそれぞれのグループの組み合わせに応じて複数の層を決定し、データセットＤｓに含まれる２つ以上のレコードを、決定された複数の層に分類する。これによって、複数の層のそれぞれに、１つ以上のレコードからなる層別データセットが生成される。複数の層のそれぞれの層別データセットは、データセットＤｓから分類された１つ以上のレコードを含む。その１つ以上のレコードのそれぞれは、Ｎ個の層別変数のそれぞれの同一グループに属するデータを含む。さらに、その１つ以上のレコードのそれぞれは、目的変数および説明変数のそれぞれのデータを含む。この層別データセットの詳細については、図７を用いて後述する。 Specifically, for each of the N stratification variables indicated by the stratification variable information, the stratification unit 127 classifies two or more data of the stratification variables included in the dataset Ds into a plurality of groups based on the commonality or similarity between the two or more data of the stratification variables. Then, the stratification unit 127 determines a plurality of strata according to the combination of the groups of the N stratification variables, and classifies two or more records included in the dataset Ds into the determined plurality of strata. As a result, a stratification dataset consisting of one or more records is generated for each of the plurality of strata. The stratification dataset for each of the plurality of strata includes one or more records classified from the dataset Ds. Each of the one or more records includes data belonging to the same group for each of the N stratification variables. Furthermore, each of the one or more records includes data for each of the objective variables and explanatory variables. Details of this stratification dataset will be described later with reference to FIG. 7.

なお、本実施の形態では、層別変数が質的変数である場合には、同一のデータがグループ化され、層別変数が量的変数である場合は、同一または類似のデータがグループ化される。また、本実施の形態における層別部１２７は、層別変数ごとに、その層別変数のデータの同一性または類似性に基づいて、その層別変数のデータを複数のグループに分類し、複数のグループの組み合わせ毎に、データセットＤｓを分類する。ここで、層別変数が量的変数である場合において、その量的変数のデータが類似しているとは、量的変数の目的変数に対する影響の傾向が類似していることを意味する。したがって、層別部１２７は、層別変数と目的変数との関係の傾向に基づいて、データセットＤｓを複数の層に分類していると言える。 In this embodiment, when the stratification variables are qualitative variables, identical data are grouped, and when the stratification variables are quantitative variables, identical or similar data are grouped. Furthermore, the stratification unit 127 in this embodiment classifies the data of each stratification variable into a plurality of groups based on the identity or similarity of the data of the stratification variable, and classifies the data set Ds for each combination of the plurality of groups. Here, when the stratification variables are quantitative variables, the data of the quantitative variables being similar means that the tendency of the influence of the quantitative variables on the objective variable is similar. Therefore, it can be said that the stratification unit 127 classifies the data set Ds into a plurality of strata based on the tendency of the relationship between the stratification variables and the objective variable.

生成部１２８は、複数の層毎に、１または複数の目的変数と１または複数の説明変数との関係を示すモデルを生成する。つまり、生成部１２８は、複数の層のそれぞれについて、その層に含まれる１つ以上のレコード、すなわち層別データセットを用いて、説明変数のデータと目的変数のデータとの関係を示すモデルを生成する。ここで、上述の例では、変数Ｘ０および変数Ｘ１がそれぞれ説明変数であるが、説明変数は１つでもよく、２つ以上であってもよい。したがって、この場合には、生成部１２８は、２つ以上の説明変数のそれぞれのデータと目的変数のデータとの関係を示すモデルを、重回帰式として生成する。例えば、生成部１２８は、説明変数Ｘ０および説明変数Ｘ１と目的変数Ｙとに対する重回帰分析を行うことによって、説明変数Ｘ０および説明変数Ｘ１のそれぞれのデータと目的変数Ｙのデータとの関係を示すモデルを生成する。 The generation unit 128 generates a model showing the relationship between one or more objective variables and one or more explanatory variables for each of the multiple layers. That is, the generation unit 128 generates a model showing the relationship between the explanatory variable data and the objective variable data for each of the multiple layers, using one or more records included in that layer, i.e., a stratified data set. Here, in the above example, the variable X0 and the variable X1 are each an explanatory variable, but the number of explanatory variables may be one or more. Therefore, in this case, the generation unit 128 generates a model showing the relationship between the data of each of the two or more explanatory variables and the objective variable data as a multiple regression equation. For example, the generation unit 128 generates a model showing the relationship between the data of each of the explanatory variables X0 and the explanatory variable X1 and the objective variable Y by performing a multiple regression analysis on the explanatory variables X0 and the explanatory variables X1 and the objective variable Y.

結果出力部１２９は、生成された複数のモデルを出力する。つまり、結果出力部１２９は、生成部１２８によって層ごとに生成されたモデルを、その生成部１２８から取得して出力部１０４に出力する。 The result output unit 129 outputs the multiple models that have been generated. In other words, the result output unit 129 obtains the models generated for each layer by the generation unit 128 from the generation unit 128 and outputs them to the output unit 104.

［層別データセット］
図７は、本実施の形態における層別データセットの一例を示す図である。 [Stratified Dataset]
FIG. 7 is a diagram showing an example of a stratified data set in this embodiment.

例えば、第２変数特定部１２６は、それぞれ質的変数である非活用変数Ｚ０および非活用変数Ｚ１を層別変数として特定する。図５に示すデータセットＤｓの各レコードに含まれる層別変数Ｚ０のデータは、「Ａ」または「Ｂ」を示す。また、そのデータセットＤｓの各レコードに含まれる層別変数Ｚ１のデータは、「Ｃ」または「Ｄ」を示す。そこで、層別部１２７は、図７の（ａ）に示すように、層別変数Ｚ０のデータ「Ａ」と、層別変数Ｚ１のデータ「Ｃ」とを含む各レコードを、第１層に分類する。これにより、層別データセットＤｓ１が生成される。層別データセットＤｓ１は、ＩＤ「ＩＤ２００９０１」によって識別されるレコードと、ＩＤ「ＩＤ２００９０２」によって識別されるレコードと、ＩＤ「ＩＤ２００９０３」によって識別されるレコードとからなる。 For example, the second variable identification unit 126 identifies the non-utilized variable Z0 and the non-utilized variable Z1, which are qualitative variables, as stratification variables. The data of the stratification variable Z0 included in each record of the data set Ds shown in FIG. 5 indicates "A" or "B". The data of the stratification variable Z1 included in each record of the data set Ds indicates "C" or "D". Therefore, as shown in FIG. 7(a), the stratification unit 127 classifies each record including the data "A" of the stratification variable Z0 and the data "C" of the stratification variable Z1 into the first layer. This generates the stratification data set Ds1. The stratification data set Ds1 consists of a record identified by the ID "ID200901", a record identified by the ID "ID200902", and a record identified by the ID "ID200903".

同様に、層別部１２７は、図７の（ｂ）に示すように、層別変数Ｚ０のデータ「Ｂ」と、層別変数Ｚ１のデータ「Ｃ」とを含む各レコードを、第２層に分類する。これにより、層別データセットＤｓ２が生成される。層別データセットＤｓ２は、ＩＤ「ＩＤ２００９０４」によって識別されるレコードと、ＩＤ「ＩＤ２００９０５」によって識別されるレコードと、ＩＤ「ＩＤ２００９０６」によって識別されるレコードとからなる。 Similarly, as shown in FIG. 7B, the stratification unit 127 classifies each record that includes data "B" of the stratification variable Z0 and data "C" of the stratification variable Z1 into the second tier. This generates a stratification dataset Ds2. The stratification dataset Ds2 consists of a record identified by the ID "ID200904", a record identified by the ID "ID200905", and a record identified by the ID "ID200906".

同様に、層別部１２７は、図７の（ｃ）に示すように、層別変数Ｚ０のデータ「Ａ」と、層別変数Ｚ１のデータ「Ｄ」とを含む各レコードを、第３層に分類する。これにより、層別データセットＤｓ３が生成される。層別データセットＤｓ３は、ＩＤ「ＩＤ２００９０７」によって識別されるレコードと、ＩＤ「ＩＤ２００９０８」によって識別されるレコードと、ＩＤ「ＩＤ２００９０９」によって識別されるレコードとからなる。 Similarly, as shown in FIG. 7C, the stratification unit 127 classifies each record that includes data "A" of the stratification variable Z0 and data "D" of the stratification variable Z1 into the third tier. This generates a stratification dataset Ds3. The stratification dataset Ds3 consists of a record identified by the ID "ID200907", a record identified by the ID "ID200908", and a record identified by the ID "ID200909".

同様に、層別部１２７は、図７の（ｄ）に示すように、層別変数Ｚ０のデータ「Ｂ」と、層別変数Ｚ１のデータ「Ｄ」とを含む各レコードを、第４層に分類する。これにより、層別データセットＤｓ４が生成される。層別データセットＤｓ４は、ＩＤ「ＩＤ２００９１０」によって識別されるレコードと、ＩＤ「ＩＤ２００９１１」によって識別されるレコードと、ＩＤ「ＩＤ２００９１２」によって識別されるレコードとからなる。 Similarly, as shown in (d) of FIG. 7, the stratification unit 127 classifies each record that includes data "B" of the stratification variable Z0 and data "D" of the stratification variable Z1 into the fourth tier. This generates a stratification dataset Ds4. The stratification dataset Ds4 consists of a record identified by the ID "ID200910", a record identified by the ID "ID200911", and a record identified by the ID "ID200912".

言い換えれば、層別変数Ｚ０の２つ以上のデータがグループ「Ａ」およびグループ「Ｂ」に分類され、層別変数Ｚ１の２つ以上のデータがグループ「Ｃ」およびグループ「Ｄ」に分類される。第１層は、層別変数Ｚ０のグループ「Ａ」と、層別変数Ｚ１のグループ「Ｃ」との組み合わせに対応する。第２層は、層別変数Ｚ０のグループ「Ｂ」と、層別変数Ｚ１のグループ「Ｃ」との組み合わせに対応する。第３層は、層別変数Ｚ０のグループ「Ａ」と、層別変数Ｚ１のグループ「Ｄ」との組み合わせに対応する。第４層は、層別変数Ｚ０のグループ「Ｂ」と、層別変数Ｚ１のグループ「Ｄ」との組み合わせに対応する。このように、層別変数Ｚ０および層別変数Ｚ１のそれぞれのグループの組み合わせに応じて複数の層が決定される。したがって、層別分類では、層別変数Ｚ０のグループ「Ａ」に属するデータと、層別変数Ｚ１のグループ「Ｃ」に属するデータとを含むレコードは、第１層に分類される。層別変数Ｚ０のグループ「Ｂ」に属するデータと、層別変数Ｚ１のグループ「Ｃ」に属するデータとを含むレコードは、第２層に分類される。層別変数Ｚ０のグループ「Ａ」に属するデータと、層別変数Ｚ１のグループ「Ｄ」に属するデータとを含むレコードは、第３層に分類される。層別変数Ｚ０のグループ「Ｂ」に属するデータと、層別変数Ｚ１のグループ「Ｄ」に属するデータとを含むレコードは、第４層に分類される。 In other words, two or more data of the stratification variable Z0 are classified into groups "A" and "B", and two or more data of the stratification variable Z1 are classified into groups "C" and "D". The first layer corresponds to the combination of group "A" of the stratification variable Z0 and group "C" of the stratification variable Z1. The second layer corresponds to the combination of group "B" of the stratification variable Z0 and group "C" of the stratification variable Z1. The third layer corresponds to the combination of group "A" of the stratification variable Z0 and group "D" of the stratification variable Z1. The fourth layer corresponds to the combination of group "B" of the stratification variable Z0 and group "D" of the stratification variable Z1. In this way, multiple layers are determined according to the combination of the groups of the stratification variables Z0 and Z1. Therefore, in the stratification classification, a record including data belonging to group "A" of the stratification variable Z0 and data belonging to group "C" of the stratification variable Z1 is classified into the first layer. Records that contain data belonging to group "B" of stratification variable Z0 and data belonging to group "C" of stratification variable Z1 are classified into the second tier. Records that contain data belonging to group "A" of stratification variable Z0 and data belonging to group "D" of stratification variable Z1 are classified into the third tier. Records that contain data belonging to group "B" of stratification variable Z0 and data belonging to group "D" of stratification variable Z1 are classified into the fourth tier.

なお、各層に分類されるレコードには、層別変数以外の他の非活用変数のデータが含まれていてもよく、図７に示す例のように、活用変数および層別変数のそれぞれのデータのみが含まれていてもよい。 Note that records classified into each stratum may contain data on non-utilized variables other than the stratification variables, or may contain only data on the utilization variables and the stratification variables, as in the example shown in Figure 7.

図８は、層別データセットＤｓ１～Ｄｓ４のそれぞれについて、その層別データセットに含まれる各レコードによって示される座標点の分布を示す図である。 Figure 8 shows the distribution of coordinate points indicated by each record contained in each stratified dataset Ds1 to Ds4.

層別データセットＤｓ１～Ｄｓ４のそれぞれは、複数のレコードを含む。そして、その複数のレコードのそれぞれは、説明変数Ｘ０のデータと、説明変数Ｘ１のデータと、目的変数Ｙのデータとを含み、座標点（Ｘ０，Ｘ１，Ｙ）として示される。つまり、レコードは、説明変数Ｘ０、説明変数Ｘ１および目的変数Ｙからなる三次元座標系における座標点として示される。 Each of the stratified data sets Ds1 to Ds4 includes multiple records. Each of the multiple records includes data for explanatory variable X0, explanatory variable X1, and objective variable Y, and is shown as a coordinate point (X0, X1, Y). In other words, a record is shown as a coordinate point in a three-dimensional coordinate system consisting of explanatory variable X0, explanatory variable X1, and objective variable Y.

データセットＤｓに含まれる全てのレコードの座標点からは、それらのレコード間の相関性を見出すことが難しい。しかし、図８に示すように、層別データセットＤｓ１～Ｄｓ４のそれぞれでは、その層別データセットに含まれる複数のレコードの座標点は、互に相関性を有する。したがって、層別データセットＤｓ１～Ｄｓ４のそれぞれでは、その層別データセットに含まれる全てのレコードの座標点から、それらのレコード間の相関性を見出すことができる。 It is difficult to find correlations between records from the coordinate points of all the records contained in dataset Ds. However, as shown in FIG. 8, in each of stratified datasets Ds1 to Ds4, the coordinate points of multiple records contained in that stratified dataset are mutually correlated. Therefore, in each of stratified datasets Ds1 to Ds4, it is possible to find correlations between those records from the coordinate points of all the records contained in that stratified dataset.

生成部１２８は、これらの層別データセットＤｓ１～Ｄ４のそれぞれで、その層別データセットに含まれる１つ以上のレコードを用いて、説明変数Ｘ０および説明変数Ｘ１のそれぞれのデータと目的変数Ｙのデータとの関係を示すモデルを生成する。 For each of these stratified data sets Ds1 to D4, the generation unit 128 uses one or more records contained in the stratified data set to generate a model showing the relationship between the data for each of the explanatory variables X0 and X1 and the data for the objective variable Y.

このように、本実施の形態では、層別変数が非活用変数であって、その非活用変数に応じた層別分類が行われ、複数の層のそれぞれに対してモデルが生成される。したがって、説明変数以外の変数である非活用変数によって、データセットＤｓに対する層別分類を最適に行うことができる。その結果、非活用変数に応じて説明変数と目的変数との間の相関関係が変化するような場合であっても、その非活用変数のグループに応じた高い精度のモデルを生成することができる。つまり、モデルの精度向上を容易に図ることができる。また、本実施の形態では、層別変数が２つ以上であっても、データセットＤｓに対して最適な層別分類を行うことができ、複数の層のそれぞれに対して、それらの層別変数、すなわちＮ個の非活用変数のそれぞれのデータに応じた高い精度のモデルを生成することができる。 In this way, in this embodiment, the stratification variables are non-utilized variables, and stratification is performed according to the non-utilized variables, and a model is generated for each of the multiple layers. Therefore, the non-utilized variables, which are variables other than the explanatory variables, can be used to optimally perform stratification for the data set Ds. As a result, even if the correlation between the explanatory variables and the objective variable changes according to the non-utilized variables, a highly accurate model can be generated according to the group of the non-utilized variables. In other words, the accuracy of the model can be easily improved. Furthermore, in this embodiment, even if there are two or more stratification variables, optimal stratification can be performed for the data set Ds, and a highly accurate model can be generated for each of the multiple layers according to the data of the stratification variables, i.e., the N non-utilized variables.

［処理動作］
図９は、本実施の形態におけるモデル生成装置１００の全体的な処理動作の一例を示すフローチャートである。 [Processing Operation]
FIG. 9 is a flowchart showing an example of the overall processing operation of the model generating device 100 in this embodiment.

まず、モデル生成装置１００の受信部１３０は、データ受信処理を行う（ステップＳ１）。このデータ受信処理では、第１変数特定部１２１は、データベース１０６からデータセットＤｓを読み出すことによって、そのデータセットＤｓを受信する。そして、第１変数特定部１２１は、そのデータセットＤｓによって示される複数の変数から、説明変数および目的変数を特定する（ステップＳ２）。これにより、説明変数および目的変数が設定される。例えば、上述のように、変数Ｘ０および変数Ｘ１がそれぞれ説明変数に設定され、変数Ｙが目的変数に設定される。 First, the receiving unit 130 of the model generating device 100 performs a data receiving process (step S1). In this data receiving process, the first variable identifying unit 121 receives the data set Ds by reading the data set Ds from the database 106. Then, the first variable identifying unit 121 identifies an explanatory variable and a target variable from the multiple variables indicated by the data set Ds (step S2). In this way, the explanatory variable and the target variable are set. For example, as described above, the variables X0 and X1 are set as explanatory variables, and the variable Y is set as the target variable.

次に、層別条件設定部１２２は、ユーザの入力操作に応じて、層別変数の総数Ｎを設定する（ステップＳ３）。例えば、総数ＮはＮ＝２に設定される。なお、この総数Ｎは、予め定められた固定値であってもよい。 Next, the stratification condition setting unit 122 sets the total number N of stratification variables in response to the user's input operation (step S3). For example, the total number N is set to N = 2. Note that this total number N may be a predetermined fixed value.

次に、非活用変数抽出部１２３は、データセットＤｓの複数の変数から、説明変数および目的変数以外の変数を、非活用変数として抽出する（ステップＳ４）。 Next, the non-utilized variable extraction unit 123 extracts variables other than the explanatory variables and the objective variable as non-utilized variables from the multiple variables in the dataset Ds (step S4).

その後、モデル生成装置１００は、ステップＳ５～Ｓ８を含む第１ループ処理を、ステップＳ４で抽出された全ての非活用変数のそれぞれに対して順に実行する。すなわち、データセットＤｓに示されるＭ個の非活用変数のそれぞれに対して第１ループ処理が順に実行される。 Then, the model generating device 100 executes the first loop process including steps S5 to S8 for each of all the non-utilized variables extracted in step S4 in turn. That is, the first loop process is executed for each of the M non-utilized variables shown in the data set Ds in turn.

具体的には、まず、変数型判定部１２４は、処理対象の非活用変数の変数型を判定する（ステップＳ５）。そして、変数型判定部１２４は、その変数型が質的変数の型であるか否かを判定する（ステップＳ６）。つまり、変数型判定部１２４は、処理対象の非活用変数が質的変数であるか否かを判定する。そして、その処理対象の非活用変数が質的変数であると変数型判定部１２４によって判定されると（ステップＳ６のＹｅｓ）、質的候補抽出部１２５ａは、質的変数の候補抽出処理を実行する（ステップＳ７）。つまり、質的候補抽出部１２５ａは、その質的変数である非活用変数の目的変数に対する影響度を算出し、その影響度に基づいてその質的変数を層別変数候補として採用するか否かを判断する。そして、質的候補抽出部１２５ａは、採用すると判断された質的変数を層別変数候補として抽出する。一方、その処理対象の非活用変数が質的変数ではないと変数型判定部１２４によって判定されると（ステップＳ６のＮｏ）、量的候補抽出部１２５ｂは、その処理対象の非活用変数である量的変数の候補抽出処理を実行する（ステップＳ８）。つまり、量的候補抽出部１２５ｂは、データセットＤｓに含まれるその量的変数の２つ以上のデータに対してクラスタリングを行い、そのクラスタリングの状態に基づいて、その量的変数を層別変数候補として採用するか否かを判断する。そして、量的候補抽出部１２５ｂは、採用すると判断された量的変数を層別変数候補として抽出する。 Specifically, first, the variable type determination unit 124 determines the variable type of the non-utilized variable to be processed (step S5). Then, the variable type determination unit 124 determines whether the variable type is a qualitative variable type (step S6). That is, the variable type determination unit 124 determines whether the non-utilized variable to be processed is a qualitative variable. Then, when the variable type determination unit 124 determines that the non-utilized variable to be processed is a qualitative variable (Yes in step S6), the qualitative candidate extraction unit 125a executes a candidate extraction process for the qualitative variable (step S7). That is, the qualitative candidate extraction unit 125a calculates the influence of the non-utilized variable, which is a qualitative variable, on the objective variable, and determines whether or not to adopt the qualitative variable as a stratification variable candidate based on the influence. Then, the qualitative candidate extraction unit 125a extracts the qualitative variable determined to be adopted as a stratification variable candidate. On the other hand, if the variable type determination unit 124 determines that the non-utilized variable to be processed is not a qualitative variable (No in step S6), the quantitative candidate extraction unit 125b executes candidate extraction processing for the quantitative variable that is the non-utilized variable to be processed (step S8). That is, the quantitative candidate extraction unit 125b performs clustering on two or more pieces of data for the quantitative variable included in the data set Ds, and determines whether or not to adopt the quantitative variable as a stratification variable candidate based on the state of the clustering. Then, the quantitative candidate extraction unit 125b extracts the quantitative variable that is determined to be adopted as a stratification variable candidate.

このようなステップＳ５～Ｓ８を含む第１ループ処理が、全ての非活用変数のそれぞれに対して順に実行されることによって、その全ての非活用変数から１以上の層別変数候補が抽出される。 The first loop process including steps S5 to S8 is executed for each of the non-utilized variables in turn, and one or more stratification variable candidates are extracted from all of the non-utilized variables.

次に、改善度特定部１４０は、１以上の層別変数候補のそれぞれの改善度を算出する改善度算出処理を行う（ステップＳ９）。 Next, the improvement degree identification unit 140 performs an improvement degree calculation process to calculate the improvement degree of each of one or more stratification variable candidates (step S9).

そして、第２変数特定部１２６は、全ての非活用変数から抽出された１以上の層別変数候補を、それらの層別変数候補の改善度順にソートする（ステップＳ１０）。具体的には、第２変数特定部１２６は、改善度が大きいほどその層別変数候補が前に配置されるように、それらの層別変数候補を並べ替える。 Then, the second variable identification unit 126 sorts the one or more stratification variable candidates extracted from all the non-utilized variables in order of the degree of improvement of the stratification variable candidates (step S10). Specifically, the second variable identification unit 126 rearranges the stratification variable candidates such that the greater the degree of improvement, the earlier the stratification variable candidate is placed.

次に、第２変数特定部１２６は、ソートされた１以上の層別変数候補のうち、改善度が上位の層別変数候補を、ステップＳ３で設定された総数Ｎだけ、層別変数として特定する（ステップＳ１１）。つまり、第２変数特定部１２６は、大きい改善度から順にＮ個の層別変数候補をそれぞれ層別変数として特定する。これにより、合計Ｎ個の層別変数が特定される。上述の例では、Ｎ＝２であって、質的変数Ｚ０および質的変数Ｚ１がそれぞれ層別変数として特定される。 Next, the second variable identification unit 126 identifies the stratification variable candidates with the highest degree of improvement from among the sorted one or more stratification variable candidates as stratification variables, up to the total number N set in step S3 (step S11). That is, the second variable identification unit 126 identifies the N stratification variable candidates as stratification variables in descending order of degree of improvement. In this way, a total of N stratification variables are identified. In the above example, N=2, and the qualitative variable Z0 and the qualitative variable Z1 are each identified as a stratification variable.

次に、層別部１２７は、その特定されたＮ個の層別変数を用いてデータセットＤｓに対する層別分類を行うことによって、複数の層別データセットを生成する。例えば、図７に示すように、層別データセットＤｓ１～Ｄｓ４が生成される。そして、生成部１２８は、層別データセットごとに、説明変数および目的変数に対する重回帰分析を行うことによって重回帰式を算出する（ステップＳ１２）。これにより、層別データセットごとに、重回帰式からなるモデルが生成される。 Next, the stratification unit 127 generates a plurality of stratified datasets by performing stratification classification on the dataset Ds using the identified N stratification variables. For example, as shown in FIG. 7, stratification datasets Ds1 to Ds4 are generated. Then, the generation unit 128 calculates a multiple regression equation by performing a multiple regression analysis on the explanatory variables and the objective variable for each stratification dataset (step S12). As a result, a model consisting of a multiple regression equation is generated for each stratification dataset.

生成部１２８は、さらに、複数の層別データセットのそれぞれで算出された重回帰式に対して、説明変数の自由度調整済み決定係数を算出する（ステップＳ１３）。 The generation unit 128 further calculates the coefficient of determination adjusted for the degrees of freedom of the explanatory variables for the multiple regression equation calculated for each of the multiple stratified data sets (step S13).

結果出力部１２９は、ステップＳ１２で算出された各重回帰式と、ステップＳ１３で算出された各決定係数とを出力部１０４に出力する。これにより、出力部１０４は、各重回帰式と各決定係数とをディスプレイに表示したり、紙に印刷したり、それらを示すファイルを記憶部１０５に格納する（ステップＳ１４）。 The result output unit 129 outputs each multiple regression equation calculated in step S12 and each coefficient of determination calculated in step S13 to the output unit 104. As a result, the output unit 104 displays each multiple regression equation and each coefficient of determination on a display, prints them on paper, or stores a file showing them in the memory unit 105 (step S14).

図１０は、図９のステップＳ７における質的変数の候補抽出処理の具体的な一例を示すフローチャートである。なお、この候補抽出処理で扱われる処理対象の非活用変数は、質的変数である。 Figure 10 is a flowchart showing a specific example of the process of extracting candidates for qualitative variables in step S7 of Figure 9. Note that the non-utilized variables to be processed in this candidate extraction process are qualitative variables.

質的候補抽出部１２５ａは、処理対象の非活用変数のカテゴリ数が第１閾値以下であるか否かを判定する（ステップＳ７１）。そのカテゴリ数の第１閾値は、例えば２０である。カテゴリ数は、データセットＤｓに含まれる、その処理対象の非活用変数によって示される複数の同一データからなるグループ数である。例えば、図５に示すデータセットＤｓにおいて、質的変数である非活用変数Ｚ０によって示される複数のデータには、「Ａ」を示すデータと、「Ｂ」を示すデータとが含まれている。したがって、その非活用変数Ｚ０のカテゴリ数は２である。同様に、図５に示すデータセットＤｓにおいて、質的変数である非活用変数Ｚ１によって示される複数のデータには、「Ｃ」を示すデータと、「Ｄ」を示すデータとが含まれている。したがって、その非活用変数Ｚ１のカテゴリ数は２である。 The qualitative candidate extraction unit 125a determines whether the number of categories of the non-utilized variable to be processed is equal to or less than a first threshold (step S71). The first threshold for the number of categories is, for example, 20. The number of categories is the number of groups consisting of multiple identical data represented by the non-utilized variable to be processed, which are included in the data set Ds. For example, in the data set Ds shown in FIG. 5, the multiple data represented by the non-utilized variable Z0, which is a qualitative variable, includes data representing "A" and data representing "B". Therefore, the number of categories of the non-utilized variable Z0 is 2. Similarly, in the data set Ds shown in FIG. 5, the multiple data represented by the non-utilized variable Z1, which is a qualitative variable, includes data representing "C" and data representing "D". Therefore, the number of categories of the non-utilized variable Z1 is 2.

次に、質的候補抽出部１２５ａは、処理対象の非活用変数のカテゴリ数が第１閾値以下ではないと判定すると（ステップＳ７１のＮｏ）、その非活用変数を影響度の算出対象から除外する（ステップＳ７２）。例えば、カテゴリ数が比較的多い非活用変数を層別変数候補として抽出し、さらに、その層別変数候補を層別変数として用いれば、多くの層別データセットが生成される。その結果、多くのモデルが生成されることによって、各モデルの精度の向上と、それらのモデルの使い易さの向上とを、期待することが難しいと想定される。したがって、ステップＳ７２では、そのようなカテゴリ数が多い非活用変数を影響度の算出対象から除外することによって、その非活用変数が層別変数候補として抽出されることを抑制し、その非活用変数が層別変数に用いられることを抑制することができる。 Next, when the qualitative candidate extraction unit 125a determines that the number of categories of the non-utilized variable to be processed is not equal to or less than the first threshold (No in step S71), the qualitative candidate extraction unit 125a excludes the non-utilized variable from the calculation of the influence (step S72). For example, if a non-utilized variable with a relatively large number of categories is extracted as a stratification variable candidate, and the stratification variable candidate is further used as a stratification variable, many stratification data sets are generated. As a result, it is assumed that it is difficult to expect improvement in the accuracy of each model and improvement in the ease of use of the models due to the generation of many models. Therefore, in step S72, by excluding such a non-utilized variable with a large number of categories from the calculation of the influence, it is possible to suppress the extraction of the non-utilized variable as a stratification variable candidate and to suppress the use of the non-utilized variable as a stratification variable.

一方、質的候補抽出部１２５ａは、処理対象の非活用変数のカテゴリ数が第１閾値以下であると判定すると（ステップＳ７１のＹｅｓ）、その処理対象の非活用変数の影響度を教師あり機械学習によって算出する（ステップＳ７３）。その教師あり機械学習は、例えばランダムフォレストを用いた学習である。ランダムフォレストは、複数の決定木を用いる手法である。例えば、質的候補抽出部１２５ａは、データセットＤｓに含まれる目的変数の各データと、データセットＤｓに含まれる処理対象の非活用変数の各データとを、それぞれ教師データとして用いたランダムフォレストの機械学習を実行する。このランダムフォレストは、例えば目的変数のデータから処理対象の非活用変数のデータを推定するための学習モデルである。より具体的には、処理対象の非活用変数は、非活用変数Ｚ０である。この場合、質的候補抽出部１２５ａは、目的変数のデータをランダムフォレストに入力することによって、その目的変数のデータに対応する非活用変数Ｚ０のデータがそのランダムフォレストから出力されるように、機械学習を実行する。このときランダムフォレストから出力される非活用変数Ｚ０のデータは、「Ａ」または「Ｂ」である。 On the other hand, when the qualitative candidate extraction unit 125a determines that the number of categories of the non-utilized variable to be processed is equal to or less than the first threshold (Yes in step S71), it calculates the influence of the non-utilized variable to be processed by supervised machine learning (step S73). The supervised machine learning is, for example, learning using a random forest. A random forest is a method using multiple decision trees. For example, the qualitative candidate extraction unit 125a executes machine learning of a random forest using each data of the objective variable included in the data set Ds and each data of the non-utilized variable to be processed included in the data set Ds as teacher data. This random forest is, for example, a learning model for estimating the data of the non-utilized variable to be processed from the data of the objective variable. More specifically, the non-utilized variable to be processed is the non-utilized variable Z0. In this case, the qualitative candidate extraction unit 125a executes machine learning so that the data of the objective variable is input to the random forest, and the data of the non-utilized variable Z0 corresponding to the data of the objective variable is output from the random forest. At this time, the data of the non-utilized variable Z0 output from the random forest is "A" or "B".

質的候補抽出部１２５ａは、ランダムフォレストに含まれる複数の決定木の不純度を表す指標であるジニ係数Ｇに基づいて、その処理対象の非活用変数の影響度を算出する。ジニ係数Ｇは、決定木のノードごとに、式（１）で定義される。 The qualitative candidate extraction unit 125a calculates the influence of the non-utilized variables to be processed based on the Gini coefficient G, which is an index representing the impurity of multiple decision trees included in the random forest. The Gini coefficient G is defined for each node of the decision tree by formula (1).

ここで、式（１）において、Ｃはカテゴリ数である。また、Ｐｉは、カテゴリｉに属するデータ数を、全データ数で割ったものである。つまり、Ｐｉは、そのジニ係数Ｇに対応するノードにおいて分類されたカテゴリｉのデータの数を、そのノードにおいて分類されたデータの総数で除算することによって得られる商である。例えば、「Ａ」を示す２つのデータと、「Ｂ」を示す１つのデータとがそのノードにおいて分類された場合、Ｇ＝１－（２／３）^２－（１／３）^２である。 Here, in formula (1), C is the number of categories. Furthermore, Pi is the number of data belonging to category i divided by the total number of data. In other words, Pi is the quotient obtained by dividing the number of data of category i classified in the node corresponding to the Gini coefficient G by the total number of data classified in the node. For example, when two data indicating "A" and one data indicating "B" are classified in the node, G=1-(2/3) ² -(1/3) ² .

質的候補抽出部１２５ａは、決定木におけるジニ係数ができるだけ小さくなるように学習を行う。そして、質的候補抽出部１２５ａは、ランダムフォレストに用いられた複数の決定木の全てのジニ係数の平均値が小さいほど大きい値を示す影響度を算出する。例えば、質的候補抽出部１２５ａは、その平均値の逆数を影響度として算出する。 The qualitative candidate extraction unit 125a performs learning so that the Gini coefficient in the decision tree is as small as possible. Then, the qualitative candidate extraction unit 125a calculates the influence degree, which indicates a larger value as the average value of all the Gini coefficients of the multiple decision trees used in the random forest is smaller. For example, the qualitative candidate extraction unit 125a calculates the inverse of the average value as the influence degree.

次に、質的候補抽出部１２５ａは、例えばランダムフォレストを用いた学習で算出された影響度が第２閾値以上であるか否かを判定する（ステップＳ７４）。ここで、質的候補抽出部１２５ａは、処理対象の非活用変数の影響度が第２閾値以上ではないと判定すると（ステップＳ７４のＮｏ）、その非活用変数を層別変数候補から除外する（ステップＳ７５）。つまり、その非活用変数は、層別変数候補として採用されない。 Next, the qualitative candidate extraction unit 125a determines whether the degree of influence calculated by learning using, for example, a random forest is equal to or greater than a second threshold (step S74). If the qualitative candidate extraction unit 125a determines that the degree of influence of the non-utilized variable being processed is not equal to or greater than the second threshold (No in step S74), it excludes the non-utilized variable from the stratification variable candidates (step S75). In other words, the non-utilized variable is not adopted as a stratification variable candidate.

一方、質的候補抽出部１２５ａは、処理対象の非活用変数の影響度が第２閾値以上であると判定すると（ステップＳ７４のＹｅｓ）、その非活用変数を層別変数候補として採用する（ステップＳ７６）。つまり、質的候補抽出部１２５ａは、処理対象の非活用変数を層別変数候補として採用するか否かを判断し、採用すると判断された非活用変数を層別変数候補として抽出する。 On the other hand, if the qualitative candidate extraction unit 125a determines that the influence of the non-utilized variable being processed is equal to or greater than the second threshold (Yes in step S74), it adopts the non-utilized variable as a stratification variable candidate (step S76). In other words, the qualitative candidate extraction unit 125a determines whether or not to adopt the non-utilized variable being processed as a stratification variable candidate, and extracts the non-utilized variable that is determined to be adopted as a stratification variable candidate.

このように、本実施の形態における質的候補抽出部１２５ａは、処理対象の非活用変数が質的変数の場合、ランダムフォレストを用いてその質的変数の影響度を算出し、その影響度に基づいてその質的変数を層別変数候補として抽出するか否かを判断する。 In this way, in the present embodiment, when the non-utilized variable to be processed is a qualitative variable, the qualitative candidate extraction unit 125a calculates the influence of the qualitative variable using a random forest, and determines whether or not to extract the qualitative variable as a stratification variable candidate based on the influence.

これにより、データセットＤｓに含まれる全ての非活用変数のうちの全ての質的変数のそれぞれを層別変数候補として扱うことなく、例えば、目的変数に対する影響度が大きい質的変数のみを層別変数候補として扱うことができる。その結果、全ての非活用変数のうちの全ての質的変数のそれぞれの改善度を特定することなく、一部の質的変数、すなわち影響度が大きい質的変数のみに対して改善度を特定することができる。つまり、改善度の特定対象とされる質的変数の数を減らすことができる。さらに、影響度が大きい質的変数は、大きい改善度が見込まれる質的変数であるため、改善度の特定の処理負担を効果的に減らすことができる。 This makes it possible to treat, for example, only the qualitative variables that have a large influence on the objective variable as stratification variable candidates, without treating each and every qualitative variable among all the non-utilized variables included in the dataset Ds as a stratification variable candidate. As a result, it is possible to identify the degree of improvement for only some of the qualitative variables, i.e., the qualitative variables that have a large influence, without identifying the degree of improvement for each of all the qualitative variables among all the non-utilized variables. In other words, it is possible to reduce the number of qualitative variables that are the targets for identifying the degree of improvement. Furthermore, because qualitative variables with a large influence are qualitative variables that are expected to have a large degree of improvement, the processing burden for identifying the degree of improvement can be effectively reduced.

図１１は、図９のステップＳ８における量的変数の候補抽出処理の具体的な一例を示すフローチャートである。なお、この候補抽出処理で扱われる処理対象の非活用変数は、量的変数である。 Figure 11 is a flowchart showing a specific example of the process of extracting candidates for quantitative variables in step S8 of Figure 9. Note that the non-utilized variables to be processed in this candidate extraction process are quantitative variables.

量的候補抽出部１２５ｂは、データセットＤｓに含まれる処理対象の非活用変数の各データに対するクラスタリングを、教師なし機械学習によって行う（ステップＳ８１）。その教師なし機械学習は、例えば混合ガウスモデル（ＧＭＭ：Gaussian Mixture Model）である。 The quantitative candidate extraction unit 125b performs clustering on each data of the non-utilized variables to be processed that are included in the data set Ds by unsupervised machine learning (step S81). The unsupervised machine learning is, for example, a Gaussian Mixture Model (GMM).

混合ガウスモデルは、ある確率分布が与えられたとき、その確率分布を複数のガウス関数（すなわち正規分布）の線形結合で近似する手法である。線形結合では、複数のガウス関数のそれぞれは、重みπｋを用いて結合される。重みπｋは、ｋ番目のガウス関数の重みであって、混合係数とも呼ばれる。（ａ，ｂ）の２次元で考えた場合、ｋ番目のガウス関数は、ａの平均値μａ＿ｋと、ｂの平均値μｂ＿ｋと、ａの分散Σａ＿ｋと、ｂの分散Σｂ＿ｋと、ａとｂの共分散Σａｂ＿ｋとを有する。各正規分布の大きさは、簡易的にΣｂ＿ｋ＋Σａ＿ｋで扱うことができる。なお、本実施の形態では、（ａ，ｂ）は、（目的変数，量的変数である非活用変数）である。 The Gaussian mixture model is a method for approximating a given probability distribution with a linear combination of multiple Gaussian functions (i.e., normal distributions). In linear combination, multiple Gaussian functions are combined using a weight πk. The weight πk is the weight of the kth Gaussian function and is also called a mixing coefficient. When considered in two dimensions of (a, b), the kth Gaussian function has the mean value μa_k of a, the mean value μb_k of b, the variance Σa_k of a, the variance Σb_k of b, and the covariance Σab_k of a and b. The magnitude of each normal distribution can be simply handled as Σb_k + Σa_k. In this embodiment, (a, b) are (objective variables, non-utilized variables that are quantitative variables).

量的候補抽出部１２５ｂは、混合ガウスモデルでのハイパーパラメータであるクラスタ数を変更しながらその混合ガウスモデルを解析する。そして、量的候補抽出部１２５ｂは、例えば赤池情報量基準（ＡＩＣ：Akaike’s Information Criterion）またはベイズ情報量基準（ＢＩＣ：Bayesian Information Criterion）が最小となるクラスタ数を採用する。これにより、そのクラスタ数だけクラスタが生成される。なお、クラスタ数は、１つ以上である。また、クラスタは、上述のカテゴリまたはグループに相当する。 The quantitative candidate extraction unit 125b analyzes the Gaussian mixture model while changing the number of clusters, which is a hyperparameter in the Gaussian mixture model. The quantitative candidate extraction unit 125b then adopts the number of clusters that minimizes, for example, the Akaike's Information Criterion (AIC) or the Bayesian Information Criterion (BIC). This generates as many clusters as the number of clusters. The number of clusters is one or more. A cluster corresponds to the above-mentioned category or group.

その後、量的候補抽出部１２５ｂは、ステップＳ８２～Ｓ８４を含む第２ループ処理を、ステップＳ８１で生成された全てのクラスタのそれぞれに対して順に実行する。 Then, the quantitative candidate extraction unit 125b executes a second loop process including steps S82 to S84 for each of all the clusters generated in step S81 in order.

具体的には、まず、量的候補抽出部１２５ｂは、処理対象のクラスタ内のデータ数が第３閾値以上であるか否かを判定する（ステップＳ８２）。ここで、量的候補抽出部１２５ｂは、データ数が第３閾値以上であると判定すると（ステップＳ８２のＹｅｓ）、そのクラスタを高信頼クラスタとして採用する（ステップＳ８３）。一方、量的候補抽出部１２５ｂは、データ数が第３閾値未満であると判定すると（ステップＳ８２のＮｏ）、そのクラスタを低信頼クラスタに分類する（ステップＳ８４）。これにより、層別変数候補が、信頼性の低いクラスタを用いて抽出されることを抑制することができる。 Specifically, first, the quantitative candidate extraction unit 125b determines whether the number of data in the cluster to be processed is equal to or greater than a third threshold (step S82). Here, if the quantitative candidate extraction unit 125b determines that the number of data is equal to or greater than the third threshold (Yes in step S82), it adopts the cluster as a high-reliability cluster (step S83). On the other hand, if the quantitative candidate extraction unit 125b determines that the number of data is less than the third threshold (No in step S82), it classifies the cluster as a low-reliability cluster (step S84). This makes it possible to prevent stratification variable candidates from being extracted using low-reliability clusters.

このようなステップＳ８２～Ｓ８４を含む第２ループ処理が、ステップ８１で生成された全てのクラスタのそれぞれに対して順に実行される。これにより、量的変数の候補抽出処理における第一段階の処理として、その全てのクラスタから信頼性の低いクラスタを除外する処理が行われる。 The second loop process including steps S82 to S84 is executed in order for each of the clusters generated in step 81. As a result, the first stage of the process of extracting candidates for quantitative variables is to remove clusters with low reliability from all of the clusters.

そして、量的候補抽出部１２５ｂは、量的変数の候補抽出処理における第二段階の処理として、ステップＳ８５～Ｓ８７の処理を行う。つまり、量的候補抽出部１２５ｂは、ステップＳ８３で採用された高信頼クラスタが２つ以上あるか否かを判定する（ステップＳ８５）。ここで、量的候補抽出部１２５ｂは、その高信頼クラスタが２つ以上あると判定すると（ステップＳ８５のＹｅｓ）、それらの高信頼クラスタに対応する処理対象の非活用変数を層別変数候補として採用する（ステップＳ８７）。一方、量的候補抽出部１２５ｂは、高信頼クラスタが２つ以上ないと判定すると（ステップＳ８５のＮｏ）、その非活用変数を層別変数候補から除外する（ステップＳ８６）。 Then, the quantitative candidate extraction unit 125b performs steps S85 to S87 as the second stage of processing in the candidate extraction process for quantitative variables. That is, the quantitative candidate extraction unit 125b determines whether there are two or more high-reliability clusters adopted in step S83 (step S85). Here, if the quantitative candidate extraction unit 125b determines that there are two or more high-reliability clusters (Yes in step S85), it adopts the non-utilized variables to be processed that correspond to those high-reliability clusters as stratification variable candidates (step S87). On the other hand, if the quantitative candidate extraction unit 125b determines that there are not two or more high-reliability clusters (No in step S85), it excludes the non-utilized variables from the stratification variable candidates (step S86).

このように、本実施の形態における量的候補抽出部１２５ｂは、処理対象の非活用変数が量的変数の場合、その量的変数の機械学習によるクラスタリングによってクラスタの状態を特定し、そのクラスタの状態に基づいてその量的変数を層別変数候補として抽出するか否かを判断する。 In this way, in the present embodiment, when the non-utilized variable to be processed is a quantitative variable, the quantitative candidate extraction unit 125b identifies the cluster state by clustering the quantitative variable through machine learning, and determines whether or not to extract the quantitative variable as a stratification variable candidate based on the cluster state.

これにより、データセットＤｓに含まれる全ての非活用変数のうちの全ての量的変数のそれぞれを層別変数候補として扱うことなく、例えば、信頼性の高い多くのクラスタを有する量的変数のみを層別変数候補として扱うことができる。その結果、改善度特定部１４０が、全ての非活用変数のうちの全ての量的変数のそれぞれの改善度を特定することなく、一部の量的変数、すなわち信頼性の高い多くのクラスタを有する量的変数のみに対して改善度を特定することができる。つまり、改善度の特定対象とされる量的変数の数を減らすことができる。さらに、信頼性の高い多くのクラスタを有する量的変数は、大きい改善度が見込まれる量的変数であるため、改善度の特定の処理負担を効果的に減らすことができる。 This makes it possible to treat, for example, only quantitative variables with many highly reliable clusters as stratification variable candidates, without treating each and every quantitative variable among all non-utilized variables included in the data set Ds as a stratification variable candidate. As a result, the improvement degree identification unit 140 can identify the improvement degree for only some quantitative variables, i.e., quantitative variables with many highly reliable clusters, without identifying the improvement degree for each of all non-utilized variables. In other words, the number of quantitative variables to be identified for improvement degree can be reduced. Furthermore, since quantitative variables with many highly reliable clusters are quantitative variables for which a large improvement degree is expected, the processing burden for identifying the improvement degree can be effectively reduced.

図１２は、図９のステップＳ９における改善度算出処理の具体的な一例を示すフローチャートである。なお、この改善度算出処理で扱われる層別変数候補は、質的変数であっても、量的変数であってもよい。 Figure 12 is a flowchart showing a specific example of the improvement calculation process in step S9 of Figure 9. Note that the stratification variable candidates used in this improvement calculation process may be qualitative variables or quantitative variables.

改善度特定部１４０は、層別分類が行われていないデータセットＤｓにおける目的変数と説明変数との関係を示す重回帰式を算出し、その重回帰式に対する自由度調整済み決定係数を第１決定係数として算出する（ステップＳ９１）。層別分類が行われていないデータセットＤｓは、例えば図５に示すデータセットＤｓである。 The improvement degree identification unit 140 calculates a multiple regression equation that indicates the relationship between the objective variable and the explanatory variables in a data set Ds that has not been subjected to stratification, and calculates the coefficient of determination adjusted for the degrees of freedom for the multiple regression equation as the first coefficient of determination (step S91). The data set Ds that has not been subjected to stratification is, for example, the data set Ds shown in FIG. 5.

その後、改善度特定部１４０は、ステップＳ９２～Ｓ９６を含む第３ループ処理を、ステップＳ７およびステップＳ８で抽出された全ての層別変数候補のそれぞれに対して順に実行する。この第３ループ処理では、改善度特定部１４０は、候補抽出部１２５で抽出された１以上の層別変数候補のそれぞれに対して順に自由度調整済み決定係数を第２決定係数として算出する。 Then, the improvement degree identification unit 140 executes a third loop process including steps S92 to S96 for each of all stratification variable candidates extracted in steps S7 and S8 in order. In this third loop process, the improvement degree identification unit 140 calculates the degree of freedom-adjusted coefficient of determination as the second coefficient of determination for each of the one or more stratification variable candidates extracted by the candidate extraction unit 125 in order.

具体的には、まず、改善度特定部１４０は、抽出された層別変数候補を用いてデータセットＤｓを層別分類することによって、複数の層別データセットを生成する（ステップＳ９２）。この層別分類は、層別部１２７による層別分類と同様であるが、図７の例のように、複数の非活用変数を用いることなく、１つの非活用変数である層別変数候補を用いて行われる。なお、複数の層別データセットのそれぞれは、図７の例のように、データセットＤｓに含まれる１つ以上のレコードを含む。その１つ以上のレコードのそれぞれは、層別変数候補の同一のデータまたは類似しているデータを含む。つまり、その１つ以上のレコードのそれぞれは、層別変数候補の同一のグループ（すなわちカテゴリまたはクラスタ）に属するデータを含む。さらに、その１つ以上のレコードのそれぞれは、目的変数および説明変数のそれぞれのデータを含む。 Specifically, first, the improvement degree identification unit 140 generates a plurality of stratified datasets by stratifying the dataset Ds using the extracted stratification variable candidates (step S92). This stratification is similar to the stratification by the stratification unit 127, but is performed using a stratification variable candidate that is a single non-utilized variable, without using a plurality of non-utilized variables, as in the example of FIG. 7. Each of the plurality of stratified datasets includes one or more records included in the dataset Ds, as in the example of FIG. 7. Each of the one or more records includes the same data or similar data of the stratification variable candidate. In other words, each of the one or more records includes data that belongs to the same group (i.e., category or cluster) of the stratification variable candidate. Furthermore, each of the one or more records includes data of each of the objective variable and the explanatory variable.

次に、改善度特定部１４０は、その生成された複数の層別データセットのそれぞれについて、その層別データセットにおける目的変数と説明変数との関係を示す重回帰式を算出する（ステップＳ９３）。そして、改善度特定部１４０は、その算出された各重回帰式に対する自由度調整済み決定係数を第２決定係数として算出し（ステップＳ９４）、それらの第２決定係数の代表値を代表決定係数として決定する（ステップＳ９５）。例えば、改善度特定部１４０は、ステップＳ９３で算出された重回帰式ａ、ｂ、ｃ・・・に対して第２決定係数２ａ、２ｂ、２ｃ、・・・を算出し、それらの第２決定係数２ａ、２ｂ、２ｃ、・・・の代表値を代表決定係数として決定する。具体的には、改善度特定部１４０は、第２決定係数２ａ、２ｂ、２ｃ、・・・の平均値または最大値を代表決定係数に決定する。 Next, the improvement degree identification unit 140 calculates a multiple regression equation indicating the relationship between the objective variable and the explanatory variable in the stratified data set for each of the generated stratified data sets (step S93). Then, the improvement degree identification unit 140 calculates the degree of freedom-adjusted coefficient of determination for each of the calculated multiple regression equations as the second coefficient of determination (step S94), and determines the representative value of the second coefficient of determination as the representative coefficient of determination (step S95). For example, the improvement degree identification unit 140 calculates second coefficients of determination 2a, 2b, 2c, ... for the multiple regression equations a, b, c ... calculated in step S93, and determines the representative value of the second coefficients of determination 2a, 2b, 2c, ... as the representative coefficient of determination. Specifically, the improvement degree identification unit 140 determines the average value or maximum value of the second coefficients of determination 2a, 2b, 2c, ... as the representative coefficient of determination.

そして、改善度特定部１４０は、ステップＳ９５で決定された代表決定係数と、ステップＳ９１で算出された第１決定係数との差分を、ステップＳ９２で用いられた層別変数候補の改善度として算出する（ステップＳ９６）。つまり、改善度特定部１４０は、「代表決定係数－第１決定係数＝改善度」によって、層別変数候補の改善度を算出する。言い換えれば、代表決定係数と第１決定係数との差分が、層別変数候補の改善度として定義される。 Then, the improvement degree identifying unit 140 calculates the difference between the representative coefficient of determination determined in step S95 and the first coefficient of determination calculated in step S91 as the degree of improvement of the stratification variable candidate used in step S92 (step S96). That is, the improvement degree identifying unit 140 calculates the degree of improvement of the stratification variable candidate by "representative coefficient of determination - first coefficient of determination = improvement degree". In other words, the difference between the representative coefficient of determination and the first coefficient of determination is defined as the degree of improvement of the stratification variable candidate.

このように、本実施の形態における改善度特定部１４０は、層別変数候補を用いて分類されたデータセットＤｓから得られる重回帰式の確からしさを示す第１決定係数と、分類されていないデータセットＤｓから得られる重回帰式の確からしさを示す代表決定係数との差分を算出することによって、改善度を特定する。なお、その重回帰式は、目的変数と説明変数との関係を示すモデルである。また、第１決定係数、第２決定係数、および代表決定係数のそれぞれは、自由度調整済み決定係数であって、モデルの確からしさを示す指標である。 In this manner, the improvement degree identification unit 140 in this embodiment identifies the improvement degree by calculating the difference between the first coefficient of determination, which indicates the likelihood of the multiple regression equation obtained from the data set Ds classified using the stratification variable candidates, and the representative coefficient of determination, which indicates the likelihood of the multiple regression equation obtained from the unclassified data set Ds. Note that the multiple regression equation is a model that indicates the relationship between the objective variable and the explanatory variables. Also, each of the first coefficient of determination, the second coefficient of determination, and the representative coefficient of determination is a coefficient of determination adjusted for the degrees of freedom, and is an index that indicates the likelihood of the model.

これにより、１以上の層別変数候補のそれぞれについて、層別分類が行われない場合に得られる自由度決定係数からの改善度を適切に特定することができる。その結果、層別部１２７は、その改善度を用いて１以上の層別変数候補から改善度が大きい層別変数候補を層別変数として用いることによって、データセットＤｓに対する最適な層別分類を行うことができる。その結果、層別部１２７による層別分類によって得られる複数の層別データセットのそれぞれに対して、より精度の高いモデルを生成することができる。 This makes it possible to appropriately identify the degree of improvement for each of one or more stratification variable candidates from the coefficient of determination of degrees of freedom obtained when stratification classification is not performed. As a result, the stratification unit 127 can perform optimal stratification classification for the data set Ds by using the degree of improvement and using a stratification variable candidate with a large degree of improvement from one or more stratification variable candidates as a stratification variable. As a result, it is possible to generate a more accurate model for each of the multiple stratification data sets obtained by the stratification classification by the stratification unit 127.

［モデルの例］
以上のように、本実施の形態では、データセットＤｓに対して層別分類が行われる。例えば、非活用変数Ｚ０および非活用変数Ｚ１がそれぞれ層別変数として特定された場合には、図７に示すように、４つの層別データセットＤｓ１～Ｄｓ４が生成される。そして、４つの層別データセットＤｓ１～Ｄｓ４のそれぞれからモデルが生成される。これにより、モデルの精度向上を図ることができる。 [Model example]
As described above, in this embodiment, stratified classification is performed on the data set Ds. For example, when the non-utilized variable Z0 and the non-utilized variable Z1 are each identified as a stratification variable, four stratified data sets Ds1 to Ds4 are generated as shown in FIG. 7. Then, a model is generated from each of the four stratified data sets Ds1 to Ds4. This makes it possible to improve the accuracy of the model.

具体的には、層別分類が行われない場合、データセットＤｓから生成されるモデルは、以下の式（２）のように示される。なお、式（２）では、ｘ_０およびｘ_１が、上述の説明変数Ｘ０および説明変数Ｘ１にそれぞれ相当し、ｆが上述の目的変数Ｙに相当する。 Specifically, when stratified classification is not performed, the model generated from the data set Ds is expressed as shown in the following formula (2). In formula (2), _x0 and _x1 correspond to the explanatory variables X0 and X1, respectively, and f corresponds to the objective variable Y.

一方、本実施の形態では、上述のように層別分類が行われるため、以下の式（３）～式（６）に示される４つのモデルがそれぞれ重回帰式として生成される。なお、式（３）～式（６）では、ｘ_０およびｘ_１が、上述の説明変数Ｘ０および説明変数Ｘ１にそれぞれ相当し、ｆ_００、ｆ_０１、ｆ_１０、およびｆ_１１のそれぞれが上述の目的変数Ｙに相当する。具体的には、式（３）は、図７の（ｄ）および図８に示す層別データセットＤｓ４から生成されたモデルであって、その層別データセットＤｓ４は、層別変数Ｚ０のデータ「Ｂ」と、層別変数Ｚ１のデータ「Ｄ」とを含む各レコードを含む。式（３）のｆ_００は、この層別データセットＤｓ４の目的変数Ｙに相当する。式（４）は、図７の（ｃ）および図８に示す層別データセットＤｓ３から生成されたモデルであって、その層別データセットＤｓ３は、層別変数Ｚ０のデータ「Ａ」と、層別変数Ｚ１のデータ「Ｄ」とを含む各レコードを含む。式（４）のｆ_０１は、この層別データセットＤｓ３の目的変数Ｙに相当する。式（５）は、図７の（ｂ）および図８に示す層別データセットＤｓ２から生成されたモデルであって、その層別データセットＤｓ２は、層別変数Ｚ０のデータ「Ｂ」と、層別変数Ｚ１のデータ「Ｃ」とを含む各レコードを含む。式（５）のｆ_１０は、この層別データセットＤｓ２の目的変数Ｙに相当する。式（６）は、図７の（ａ）および図８に示す層別データセットＤｓ１から生成されたモデルであって、その層別データセットＤｓ１は、層別変数Ｚ０のデータ「Ａ」と、層別変数Ｚ１のデータ「Ｃ」とを含む各レコードを含む。式（６）のｆ_１１は、この層別データセットＤｓ１の目的変数Ｙに相当する。 On the other hand, in this embodiment, since stratified classification is performed as described above, four models shown in the following formulas (3) to (6) are generated as multiple regression equations. In formulas (3) to (6), x ₀ and x ₁ correspond to the explanatory variables X0 and X1, respectively, and f ₀₀ , f ₀₁ , f ₁₀ , and f ₁₁ correspond to the objective variable Y. Specifically, formula (3) is a model generated from the stratified data set Ds4 shown in FIG. 7(d) and FIG. 8, and the stratified data set Ds4 includes records including the data "B" of the stratified variable Z0 and the data "D" of the stratified variable Z1. f ₀₀ in formula (3) corresponds to the objective variable Y of this stratified data set Ds4. Equation (4) is a model generated from the stratified data set Ds3 shown in FIG. 7(c) and FIG. 8, and the stratified data set Ds3 includes each record including the data "A" of the stratified variable Z0 and the data "D" of the stratified variable Z1. f ₀₁ in Equation (4) corresponds to the objective variable Y of the stratified data set Ds3. Equation (5) is a model generated from the stratified data set Ds2 shown in FIG. 7(b) and FIG. 8, and the stratified data set Ds2 includes each record including the data "B" of the stratified variable Z0 and the data "C" of the stratified variable Z1. f ₁₀ in Equation (5) corresponds to the objective variable Y of the stratified data set Ds2. Equation (6) is a model generated from the stratified data set Ds1 shown in FIG. 7(a) and FIG. 8, and the stratified data set Ds1 includes each record including the data "A" of the stratified variable Z0 and the data "C" of the stratified variable Z1. f ₁₁ in equation (6) corresponds to the objective variable Y of this stratified data set Ds1.

なお、本実施の形態では、２つの層別変数Ｚ０および層別変数Ｚ１が特定され、層別変数Ｚ０の各データが２つのグループに分類され、層別変数Ｚ１の各データが２つのグループに分類される。したがって、グループの組み合わせ数が４であって、４つのモデルが生成される。ここで、３つの層別変数が特定され、それらの層別変数の各データが２つのグループに分類される場合には、グループの組み合わせ数は８であって、８つのモデルが生成される。また、２つの層別変数が特定され、それらの層別変数の各データが３つのグループに分類される場合には、グループの組み合わせ数は９であって、９つのモデルが生成される。 In this embodiment, two stratification variables Z0 and Z1 are specified, and each data of the stratification variable Z0 is classified into two groups, and each data of the stratification variable Z1 is classified into two groups. Therefore, the number of group combinations is four, and four models are generated. Here, if three stratification variables are specified and each data of these stratification variables is classified into two groups, the number of group combinations is eight, and eight models are generated. Also, if two stratification variables are specified and each data of these stratification variables is classified into three groups, the number of group combinations is nine, and nine models are generated.

ここで、Ｒ^２＊は自由度調整済み決定係数である。この自由度調整済み決定係数は、モデルの確からしさを示す指数である。本実施の形態では、上述の式（２）～式（６）に示すとおり、自由度調整済み決定係数を、０．２７３から、０．５０３～０．９６９の範囲まで増加させることができ、モデルの精度向上を図ることができる。そして、このような各モデルと各自由度調整済み決定係数とが、結果出力部１２９によって出力される。 Here, R2 ^* is the coefficient of determination after the degrees of freedom adjustment. This coefficient of determination after the degrees of freedom adjustment is an index indicating the reliability of the model. In this embodiment, as shown in the above formulas (2) to (6), the coefficient of determination after the degrees of freedom adjustment can be increased from 0.273 to a range of 0.503 to 0.969, thereby improving the accuracy of the model. Then, each model and each coefficient of determination after the degrees of freedom adjustment are output by the result output unit 129.

このように、本実施の形態における生成部１２８は、生成された複数のモデルのそれぞれについて、そのモデルの確からしさを示す指数を算出する。そして、結果出力部１２９は、複数のモデルのそれぞれに対して算出されたその指数を出力する。したがって、ユーザは、生成されたモデルを使用するか否かを、その指数にしたがって容易に判断することができる。 In this manner, the generation unit 128 in this embodiment calculates an index indicating the likelihood of each of the multiple models generated. Then, the result output unit 129 outputs the calculated index for each of the multiple models. Therefore, the user can easily determine whether or not to use the generated model based on the index.

［効果など］
以上のように、本実施の形態では、目的変数および説明変数以外の変数である層別変数に応じた層別分類が行われ、複数の層のそれぞれに対してモデルが生成される。また、層別変数は、モデルに含まれる変数として採用されていないが、そのモデルの生成には用いられる非活用変数である。したがって、その非活用変数によって、データセットＤｓに対する層別分類を最適に行うことができる。その結果、非活用変数に応じて説明変数と目的変数との間の相関関係が変化するような場合であっても、その非活用変数のデータに応じた高い精度のモデルを生成することができる。つまり、モデルの精度向上を容易に図ることができる。 [Effects, etc.]
As described above, in this embodiment, stratification is performed according to the stratification variables, which are variables other than the objective variable and the explanatory variables, and a model is generated for each of the multiple strata. In addition, the stratification variables are non-utilized variables that are not adopted as variables included in the model but are used to generate the model. Therefore, the non-utilized variables can optimally perform stratification for the data set Ds. As a result, even if the correlation between the explanatory variable and the objective variable changes depending on the non-utilized variables, a highly accurate model can be generated according to the data of the non-utilized variables. In other words, the accuracy of the model can be easily improved.

また、本実施の形態では、データセットＤｓのＭ個の非活用変数の中から、有効な変数が層別変数として自動的に特定される。したがって、例えば工場の有識者などのユーザが活用変数（すなわち目的変数および説明変数）を選択した意図を活かすことができ、ユーザの理解し易いモデルの生成と、そのモデルの精度向上とを両立することができる。 In addition, in this embodiment, from among the M non-utilized variables in the data set Ds, effective variables are automatically identified as stratification variables. Therefore, it is possible to utilize the intention of a user, such as a factory expert, when selecting the utilized variables (i.e., the objective variable and explanatory variables), and it is possible to generate a model that is easy for the user to understand and to improve the accuracy of the model.

また、本実施の形態では、改善度特定部１４０が、データセットＤｓに含まれる３以上の変数のうちの、特定された目的変数および説明変数以外の変数である１以上の層別変数候補のそれぞれについて、当該層別変数候補を用いることによってモデルの確からしさが増す度合いである改善度を特定する。さらに、第２変数特定部１２６が、１以上の層別変数候補から、それらの層別変数候補の改善度に基づいてＮ個の層別変数候補のそれぞれを層別変数として特定する。そして、層別部１２７が、そのＮ個の層別変数のそれぞれについて、その層別変数の２つ以上のデータ間の共通性または類似性に基づいて、データセットＤｓに含まれるその層別変数の２つ以上のデータを複数のグループに分類する。さらに、層別部１２７が、Ｎ個の層別変数のそれぞれのグループの組み合わせに応じて複数の層を決定し、データセットＤｓに含まれる２つ以上のレコードを、決定された複数の層に分類する。グループは、層別変数が質的変数の場合には、上述のカテゴリに相当し、層別変数が量的変数の場合には、上述のクラスタに相当する。 In addition, in this embodiment, the improvement degree specification unit 140 specifies an improvement degree, which is the degree to which the reliability of the model increases by using one or more stratification variable candidates, which are variables other than the specified objective variable and explanatory variables, among the three or more variables included in the data set Ds. Furthermore, the second variable specification unit 126 specifies each of N stratification variable candidates as a stratification variable from the one or more stratification variable candidates based on the improvement degree of the stratification variable candidates. Then, for each of the N stratification variables, the stratification unit 127 classifies two or more data of the stratification variable included in the data set Ds into multiple groups based on the commonality or similarity between two or more data of the stratification variable. Furthermore, the stratification unit 127 determines multiple strata according to the combination of the groups of the N stratification variables, and classifies two or more records included in the data set Ds into the determined multiple strata. When the stratification variables are qualitative variables, the groups correspond to the above-mentioned categories, and when the stratification variables are quantitative variables, they correspond to the above-mentioned clusters.

これにより、層別変数が２つ以上であっても、データセットＤｓに対して最適な層別分類を行うことができ、複数の層のそれぞれに対して、それらの層別変数、すなわちＮ個の非活用変数のそれぞれのデータに応じた高い精度のモデルを生成することができる。また、例えば、大きい改善度を有する層別変数候補が層別変数として特定され、その大きい改善度の層別変数を用いた層別分類が層別部１２７によって行われるため、より高い精度のモデルを生成することができる。 As a result, even if there are two or more stratification variables, optimal stratification classification can be performed for the data set Ds, and a highly accurate model can be generated for each of the multiple strata according to the data of the stratification variables, i.e., the N non-utilized variables. In addition, for example, a stratification variable candidate with a large degree of improvement is identified as a stratification variable, and stratification classification using the stratification variable with the large degree of improvement is performed by the stratification unit 127, so that a model with higher accuracy can be generated.

また、本実施の形態では、質的変数の層別変数候補に対しても、量的変数の層別変数候補に対しても、同一の算出手法によって改善度が算出される。したがって、質的変数の層別変数候補および量的変数の層別変数候補の何れにも、その算出される改善度を、モデルの確からしさが増す共通の度合いとして用いることがきる。つまり、改善度は、変数型に依存することのない共通の指標とも言える。したがって、質的変数の層別変数候補および量的変数の層別変数候補を分け隔てすることなく平等に、それらの層別変数候補から、大きい改善度を有する層別変数候補を層別変数として特定することができる。 In addition, in this embodiment, the degree of improvement is calculated for both the stratification variable candidates of qualitative variables and the stratification variable candidates of quantitative variables by the same calculation method. Therefore, the calculated degree of improvement can be used as a common degree of increase in the reliability of the model for both the stratification variable candidates of qualitative variables and the stratification variable candidates of quantitative variables. In other words, the degree of improvement can be said to be a common index that does not depend on the variable type. Therefore, it is possible to identify the stratification variable candidates with the greatest degree of improvement as the stratification variable from among the stratification variable candidates of qualitative variables and the stratification variable candidates of quantitative variables equally without making any distinction between them.

また、本実施の形態では、層別条件設定部１２２が、ユーザによる入力操作に応じて、層別変数の総数を設定し、第２変数特定部１２６が、その設定された総数であるＮ個だけ層別変数を特定する。 In addition, in this embodiment, the stratification condition setting unit 122 sets the total number of stratification variables in response to an input operation by the user, and the second variable identification unit 126 identifies only N stratification variables, which is the set total number.

これにより、特定される層別変数の総数を、ユーザの意図どおりに任意に設定することができ、生成されるモデルの数または精度を調整することができる。 This allows the total number of stratification variables to be identified to be set as desired by the user, allowing the number or accuracy of models to be generated to be adjusted.

また、本実施の形態では、Ｍ個の非活用変数は、それぞれ文字を含むデータを示す１つ以上の質的変数と、それぞれ数字からなるデータを示す１つ以上の量的変数とを含む。 In addition, in this embodiment, the M non-utilized variables include one or more qualitative variables each representing data that includes characters, and one or more quantitative variables each representing data that consists of numbers.

これにより、質的変数および量的変数のうちの一方だけでなく両方を含むＮ個の層別変数を特定することができ、特定される層別変数の変数型の自由度を高めることができる。 This allows us to identify N stratification variables that include not just one of qualitative variables and quantitative variables, but both, and increases the degree of freedom in the variable types of the identified stratification variables.

（変形例など）
以上、本開示の一態様に係るモデル生成装置について、上記実施の形態に基づいて説明したが、本開示は、その実施の形態に限定されるものではない。本開示の趣旨を逸脱しない限り、当業者が思いつく各種変形を上記実施の形態に施したものも本開示に含まれてもよい。 (Variations, etc.)
Although the model generating device according to one aspect of the present disclosure has been described based on the above embodiment, the present disclosure is not limited to the embodiment. As long as it does not deviate from the spirit of the present disclosure, various modifications conceivable by a person skilled in the art to the above embodiment may also be included in the present disclosure.

例えば、本実施の形態では、質的変数の候補抽出処理に、教師あり機械学習の一例としてランダムフォレストが用いられているが、その教師あり機械学習はランダムフォレストに限定されるものではなく、他の教師あり機械学習が用いられてもよい。例えば、ランダムフォレストの代わりに、勾配ブースティング決定木（ＧＢＤＴ：Gradient Boosting Decision Tree）が用いられてもよい。この勾配ブースティング決定木が用いられる場合には、誤差または損失係数が小さくなるように機械学習が行われる。そして、質的候補抽出部１２５ａは、その誤差または損失係数が小さいほど大きい値を示す影響度を算出する。また、ランダムフォレストと勾配ブースティング決定木とを組み合わせてもよい。例えば、質的変数である第１非活用変数の影響度を、ランダムフォレストを用いて算出し、質的変数である第２非活用変数の影響度を、勾配ブースティング決定木を用いて算出してもよい。このとき、互に異なる２つの機械学習によって算出される影響度を正規化することによって、それらの影響度を比較可能にしてもよい。 For example, in the present embodiment, a random forest is used as an example of supervised machine learning in the process of extracting candidates for qualitative variables, but the supervised machine learning is not limited to a random forest, and other supervised machine learning may be used. For example, a gradient boosting decision tree (GBDT) may be used instead of a random forest. When the gradient boosting decision tree is used, machine learning is performed so that the error or loss coefficient is small. Then, the qualitative candidate extraction unit 125a calculates an influence degree that indicates a larger value as the error or loss coefficient is smaller. In addition, a random forest and a gradient boosting decision tree may be combined. For example, the influence degree of the first non-utilized variable, which is a qualitative variable, may be calculated using a random forest, and the influence degree of the second non-utilized variable, which is a qualitative variable, may be calculated using a gradient boosting decision tree. At this time, the influence degrees calculated by the two different machine learning methods may be normalized to make those influence degrees comparable.

また、本実施の形態では、量的変数の候補抽出処理に、教師なし機械学習の一例として混合ガウスモデルが用いられているが、その教師なし機械学習は混合ガウスモデルに限定されるものではなく、他の教師なし機械学習が用いられてもよい。例えば、混合ガウスモデルの代わりに、ｋ－ｍｅａｎｓ法が用いられてもよい。この場合には、処理対象の非活用変数によって示される各データは、ｋ－ｍｅａｎｓ法によってクラスタリングされる。また、混合ガウスモデルとｋ－ｍｅａｎｓ法とを組み合わせてもよい。例えば、量的変数である第１非活用変数のクラスタリングを、混合ガウスモデルを用いて行い、量的変数である第２非活用変数のクラスタリングを、ｋ－ｍｅａｎｓ法を用いて行ってもよい。 In addition, in this embodiment, a Gaussian mixture model is used as an example of unsupervised machine learning in the process of extracting candidates for quantitative variables, but the unsupervised machine learning is not limited to the Gaussian mixture model, and other unsupervised machine learning may be used. For example, a k-means method may be used instead of a Gaussian mixture model. In this case, each data represented by the non-utilized variable to be processed is clustered by the k-means method. The Gaussian mixture model and the k-means method may be combined. For example, clustering of the first non-utilized variable, which is a quantitative variable, may be performed using a Gaussian mixture model, and clustering of the second non-utilized variable, which is a quantitative variable, may be performed using the k-means method.

また、混合ガウスモデルでは、１つ１つのデータは、各グループに属する確率を有し、複数のグループのうち最も確率の高いグループに属する。混合ガウスモデルの代わりに後述のｋ－ｍｅａｎｓ法が用いられる場合には、１つ１つのデータは、各グループの重心との間に距離を有し、複数のグループのうち最も距離が近いグループに属する。したがって、具体的に、量的変数のデータが類似しているとは、そのデータに対応する確率が一定値以上であること、又は、グループの重心から、そのデータに対応する位置までの距離が一定値以下であることである。 In addition, in a Gaussian mixture model, each piece of data has a probability of belonging to each group, and of the multiple groups, the data belongs to the group with the highest probability. When the k-means method described below is used instead of a Gaussian mixture model, each piece of data has a distance from the center of gravity of each group, and belongs to the group with the closest distance among the multiple groups. Therefore, specifically, data of a quantitative variable is said to be similar when the probability corresponding to that data is equal to or greater than a certain value, or when the distance from the center of gravity of the group to the position corresponding to that data is equal to or less than a certain value.

また、本実施の形態では、重回帰式をモデルとして生成したが、単回帰式をモデルとして生成してもよく、回帰式以外のモデルを生成してもよい。例えば、ニューラルネットワークをモデルとして生成してもよい。 In addition, in this embodiment, a multiple regression equation is generated as the model, but a simple regression equation may be generated as the model, or a model other than a regression equation may be generated. For example, a neural network may be generated as the model.

また、本実施の形態におけるデータセットＤｓは、製造関連の変数およびその変数のデータを示すが、その製造関連に限定されることなく、製造関連とは異なる他の分野の変数およびその変数のデータを示していてもよい。 In addition, the data set Ds in this embodiment shows manufacturing-related variables and data on those variables, but is not limited to manufacturing-related variables and may show variables and data on those variables in fields other than manufacturing-related fields.

また、本実施の形態におけるデータセットＤｓに含まれるデータは、操業データと品質データとに分別されていてもよい。例えば、操業データは、製造プロセスに関するデータであって、図５に示す、変数Ｘ０、変数Ｘ１、変数Ｚ０、変数Ｚ１、変数Ｄ１、変数Ｄ２、変数Ｄ３、および変数Ｄ４のそれぞれのデータであってもよい。また、例えば、品質データは、製品の品質に関するデータであって、図５に示す変数Ｙであってもよい。 The data included in the data set Ds in this embodiment may be separated into operation data and quality data. For example, the operation data is data related to the manufacturing process, and may be data for each of the variables X0, X1, Z0, Z1, D1, D2, D3, and D4 shown in FIG. 5. For example, the quality data is data related to the quality of the product, and may be variable Y shown in FIG. 5.

また、本実施の形態では、層別変数の総数Ｎが設定されるが、例えば、質的変数の総数と、量的変数の総数とをそれぞれ個別に設定してもよい。この場合、第２変数特定部１２６は、質的変数である１以上の層別変数候補から、その設定された質的変数の総数だけ、改善度の大きい順に層別変数候補を層別変数として特定する。さらに、第２変数特定部１２６は、量的変数である１以上の層別変数候補から、その設定された量的変数の総数だけ、改善度の大きい順に層別変数候補を層別変数として特定する。 In addition, in this embodiment, the total number N of stratification variables is set, but for example, the total number of qualitative variables and the total number of quantitative variables may be set separately. In this case, the second variable identification unit 126 identifies stratification variable candidates as stratification variables from among one or more stratification variable candidates that are qualitative variables, in descending order of degree of improvement for the total number of the set qualitative variables. Furthermore, the second variable identification unit 126 identifies stratification variable candidates as stratification variables from among one or more stratification variable candidates that are quantitative variables, in descending order of degree of improvement for the total number of the set quantitative variables.

また、本実施の形態では、Ｍ個の非活用変数に含まれる１以上の質的変数から、それらの影響度に基づいて層別変数候補が抽出され、そのＭ個の非活用変数に含まれる１以上の量的変数から、それらのクラスタの状態に基づいて層別変数候補が抽出される。しかし、影響度およびクラスタの状態に関わらず、Ｍ個の非活用変数に含まれる全ての質的変数のそれぞれが層別変数候補として抽出されてもよく、Ｍ個の非活用変数に含まれる全ての量的変数のそれぞれが層別変数候補として抽出されてもよい。この場合には、Ｍ個の非活用変数のそれぞれが層別変数候補として抽出され、そのＭ個の非活用変数のそれぞれに対して改善度が算出される。 In addition, in this embodiment, stratification variable candidates are extracted from one or more qualitative variables included in the M non-utilized variables based on their influence, and stratification variable candidates are extracted from one or more quantitative variables included in the M non-utilized variables based on their cluster states. However, regardless of the influence and cluster state, all of the qualitative variables included in the M non-utilized variables may be extracted as stratification variable candidates, and all of the quantitative variables included in the M non-utilized variables may be extracted as stratification variable candidates. In this case, each of the M non-utilized variables is extracted as a stratification variable candidate, and an improvement degree is calculated for each of the M non-utilized variables.

また、本実施の形態では、データセットＤｓは、２つの変数型のそれぞれに属する変数のデータを含んでいるが、その変数型の数は２つに限らず、１つだけであってもよく、３つ以上であってもよい。 In addition, in this embodiment, the data set Ds includes data on variables belonging to each of two variable types, but the number of variable types is not limited to two, and may be only one, or may be three or more.

また、本実施の形態では、データセットＤｓは、製造管理装置５００からネットワークを介して送信されてデータベース１０６に格納されるが、他の装置または記録媒体からデータベース１０６に出力されて格納されてもよい。また、データセットＤｓは、ネットワークを介さずにデータベース１０６に格納されてもよい。 In addition, in this embodiment, the data set Ds is transmitted from the manufacturing management device 500 via a network and stored in the database 106, but it may also be output to the database 106 from another device or recording medium and stored therein. The data set Ds may also be stored in the database 106 without going through a network.

なお、以下のような場合も本開示に含まれる。 The following cases are also included in this disclosure:

（１）上記の少なくとも１つの装置は、具体的には、マイクロプロセッサ、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、ハードディスクユニット、ディスプレイユニット、キーボード、マウスなどから構成されるコンピュータシステムである。そのＲＡＭまたはハードディスクユニットには、コンピュータプログラムが記憶されている。マイクロプロセッサが、コンピュータプログラムにしたがって動作することにより、上記の少なくとも１つの装置は、その機能を達成する。ここでコンピュータプログラムは、所定の機能を達成するために、コンピュータに対する指令を示す命令コードが複数個組み合わされて構成されたものである。 (1) The at least one device is specifically a computer system consisting of a microprocessor, a ROM (Read Only Memory), a RAM (Random Access Memory), a hard disk unit, a display unit, a keyboard, a mouse, etc. A computer program is stored in the RAM or hard disk unit. The at least one device achieves its functions by the microprocessor operating in accordance with the computer program. Here, a computer program is composed of a combination of multiple instruction codes that indicate commands for a computer to achieve a specified function.

（２）上記の少なくとも１つの装置を構成する構成要素の一部または全部は、１個のシステムＬＳＩ（Large Scale Integration：大規模集積回路）から構成されているとしてもよい。システムＬＳＩは、複数の構成部を１個のチップ上に集積して製造された超多機能ＬＳＩであり、具体的には、マイクロプロセッサ、ＲＯＭ、ＲＡＭなどを含んで構成されるコンピュータシステムである。前記ＲＡＭには、コンピュータプログラムが記憶されている。マイクロプロセッサが、コンピュータプログラムにしたがって動作することにより、システムＬＳＩは、その機能を達成する。 (2) Some or all of the components constituting at least one of the above devices may be composed of a single system LSI (Large Scale Integration). A system LSI is an ultra-multifunctional LSI manufactured by integrating multiple components on a single chip, and specifically, is a computer system composed of a microprocessor, ROM, RAM, etc. A computer program is stored in the RAM. The system LSI achieves its functions when the microprocessor operates in accordance with the computer program.

（３）上記の少なくとも１つの装置を構成する構成要素の一部または全部は、その装置に脱着可能なＩＣカードまたは単体のモジュールから構成されているとしてもよい。ＩＣカードまたはモジュールは、マイクロプロセッサ、ＲＯＭ、ＲＡＭなどから構成されるコンピュータシステムである。ＩＣカードまたはモジュールは、上記の超多機能ＬＳＩを含むとしてもよい。マイクロプロセッサが、コンピュータプログラムにしたがって動作することにより、ＩＣカードまたはモジュールは、その機能を達成する。このＩＣカードまたはこのモジュールは、耐タンパ性を有するとしてもよい。 (3) Some or all of the components constituting at least one of the above devices may be composed of an IC card or a standalone module that is detachable from the device. The IC card or module is a computer system composed of a microprocessor, ROM, RAM, etc. The IC card or module may include the above-mentioned ultra-multifunction LSI. The IC card or module achieves its functions when the microprocessor operates according to a computer program. This IC card or module may be tamper-resistant.

（４）本開示は、上記に示す方法であるとしてもよい。また、これらの方法をコンピュータにより実現するコンピュータプログラムであるとしてもよいし、コンピュータプログラムからなるデジタル信号であるとしてもよい。 (4) The present disclosure may be the methods described above. It may also be a computer program that realizes these methods by a computer, or a digital signal that is a computer program.

また、本開示は、コンピュータプログラムまたはデジタル信号をコンピュータ読み取り可能な記録媒体、例えば、フレキシブルディスク、ハードディスク、ＣＤ（Compact Disc）－ＲＯＭ、ＤＶＤ、ＤＶＤ－ＲＯＭ、ＤＶＤ－ＲＡＭ、ＢＤ（Blu-ray（登録商標） Disc）、半導体メモリなどに記録したものとしてもよい。また、これらの記録媒体に記録されているデジタル信号であるとしてもよい。 The present disclosure may also be a computer program or a digital signal recorded on a computer-readable recording medium, such as a flexible disk, a hard disk, a CD (Compact Disc)-ROM, a DVD, a DVD-ROM, a DVD-RAM, a BD (Blu-ray (registered trademark) Disc), a semiconductor memory, or the like. It may also be a digital signal recorded on such a recording medium.

また、本開示は、コンピュータプログラムまたはデジタル信号を、電気通信回線、無線または有線通信回線、インターネットを代表とするネットワーク、データ放送等を経由して伝送するものとしてもよい。 The present disclosure may also involve the transmission of computer programs or digital signals via telecommunications lines, wireless or wired communication lines, networks such as the Internet, data broadcasting, etc.

また、プログラムまたはデジタル信号を記録媒体に記録して移送することにより、またはプログラムまたはデジタル信号をネットワーク等を経由して移送することにより、独立した他のコンピュータシステムにより実施するとしてもよい。 The program or digital signal may also be implemented by another independent computer system by recording it on a recording medium and transferring it, or by transferring the program or digital signal via a network, etc.

本開示は、モデルの精度向上を容易に図ることができるという効果を奏し、例えば、製造プロセスに用いられる変数のデータから、その製造プロセスで製造される製品の品質を推定するモデルを生成する装置またはシステムに適用することができる。 The present disclosure has the effect of easily improving the accuracy of a model, and can be applied, for example, to an apparatus or system that generates a model that estimates the quality of a product manufactured in a manufacturing process from data on variables used in the manufacturing process.

１モデル生成システム
１００モデル生成装置
１０１入力部
１０２演算回路
１０３メモリ
１０４出力部
１０５記憶部
１０５ａプログラム
１０５ｂテンポラリーデータ
１０６データベース
１２１第１変数特定部
１２２層別条件設定部
１２３非活用変数抽出部
１２４変数型判定部
１２５候補抽出部
１２５ａ質的候補抽出部
１２５ｂ量的候補抽出部
１２６第２変数特定部
１２７層別部
１２８生成部
１２９結果出力部
１３０受信部
１４０改善度特定部
５００製造管理装置
Ｄｓデータセット
Ｄｓ１～Ｄｓ４層別データセット 1 Model generation system 100 Model generation device 101 Input unit 102 Arithmetic circuit 103 Memory 104 Output unit 105 Storage unit 105a Program 105b Temporary data 106 Database 121 First variable identification unit 122 Stratification condition setting unit 123 Non-utilized variable extraction unit 124 Variable type determination unit 125 Candidate extraction unit 125a Qualitative candidate extraction unit 125b Quantitative candidate extraction unit 126 Second variable identification unit 127 Stratification unit 128 Generation unit 129 Result output unit 130 Reception unit 140 Improvement degree identification unit 500 Manufacturing management device Ds Data set Ds1 to Ds4 Stratified data set

Claims

A model generation device that generates a model indicating a relationship between one or more objective variables and one or more explanatory variables,
receiving means for receiving a data set including three or more variables;
A first variable identification means for identifying one or more objective variables and one or more explanatory variables from the data set;
an improvement degree specifying means for specifying an improvement degree, which is a degree of increase in the certainty of the model by using one or more stratification variable candidates, which are variables other than the identified objective variable and the identified explanatory variables, among the three or more variables included in the data set;
a second variable specifying means for specifying one or more stratification variable candidates as stratification variables from the one or more stratification variable candidates based on the degree of improvement of each of the one or more stratification variable candidates;
A stratification means for classifying a data set into a plurality of strata based on a tendency of a relationship between the stratification variable and the objective variable;
A generation means for generating the model for each of the plurality of layers;
A model generating device comprising:

The stratification means classifies the data of the stratification variables into a plurality of groups based on identity or similarity of the data of the stratification variables, and classifies the data set for each combination of the plurality of groups.
The model generating device according to claim 1 .

The data set includes qualitative variables representing data including characters and quantitative variables representing data consisting of numbers.
3. The model generating device according to claim 1 or 2.

and a qualitative candidate extraction means for extracting, when a variable other than the identified objective variable and the identified explanatory variable among the three or more variables included in the data set is a qualitative variable, the qualitative variable from the three or more variables included in the data set as the stratification variable candidate based on an influence degree of the qualitative variable on the objective variable.
4. The model generating device of claim 3.

The method further comprises a quantitative candidate extraction means for extracting the quantitative variables as stratification variable candidates from the three or more variables included in the data set based on a state of a cluster obtained by clustering the quantitative variables by machine learning when the variables other than the identified objective variable and explanatory variable are the quantitative variables among the three or more variables included in the data set.
5. The model generating device of claim 4.

The qualitative candidate extraction means
Calculating the influence of the qualitative variables using a random forest or a gradient boosting decision tree.
5. The model generating device of claim 4.

The quantitative candidate extraction means
performing clustering on two or more pieces of data of the quantitative variables included in the data set using a Gaussian mixture model or a k-means method;
6. The model generating device according to claim 5.

The improvement degree specifying means includes:
determining the degree of improvement by calculating a difference between an index indicating the likelihood of a model obtained from the data set classified using the stratification variable candidates and an index indicating the likelihood of a model obtained from the data set that is not classified;
A model generating device according to any one of claims 1 to 7.

The generating means further comprises:
Calculating an index indicating the likelihood of each of the generated models;
The model generating device further comprises:
A result output means for outputting the index calculated for each of the plurality of models is provided.
A model generating device according to any one of claims 1 to 8.

The generating means includes:
generating the model indicating a relationship between data of each of the two or more explanatory variables and data of the objective variable as a multiple regression equation;
A model generating device according to any one of claims 1 to 9.

A model generation method in which a computer generates a model showing a relationship between one or more objective variables and one or more explanatory variables, comprising the steps of:
receiving a dataset containing three or more variables;
Identifying one or more objective variables and one or more explanatory variables from the dataset;
Identifying an improvement degree, which is a degree to which the certainty of the model is increased by using one or more stratification variable candidates, which are variables other than the identified objective variable and the identified explanatory variables, among the three or more variables included in the data set;
identifying one or more stratification variable candidates as stratification variables from the one or more stratification variable candidates based on the degree of improvement of each of the one or more stratification variable candidates;
Classifying the data set into a plurality of strata based on a trend of a relationship between the stratification variables and the objective variable;
generating the model for each of the plurality of layers;
Model generation method.

A program for causing a computer to generate a model showing a relationship between one or more objective variables and one or more explanatory variables,
receiving a dataset containing three or more variables;
Identifying one or more objective variables and one or more explanatory variables from the dataset;
Identifying an improvement degree, which is a degree to which the certainty of the model is increased by using one or more stratification variable candidates, which are variables other than the identified objective variable and the identified explanatory variables, among the three or more variables included in the data set;
identifying one or more stratification variable candidates as stratification variables from the one or more stratification variable candidates based on the degree of improvement of each of the one or more stratification variable candidates;
Classifying the data set into a plurality of strata based on a trend of a relationship between the stratification variables and the objective variable;
generating the model for each of the plurality of layers;
A program that causes a computer to do something.