JP7849027B2

JP7849027B2 - Method and program for determining the reliability of an estimation model, and measurement system.

Info

Publication number: JP7849027B2
Application number: JP2022180849A
Authority: JP
Inventors: 瑞樹蔦
Original assignee: National Agriculture and Food Research Organization
Current assignee: National Agriculture and Food Research Organization
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2026-04-21
Anticipated expiration: 2042-11-11
Also published as: JP2024070391A

Description

本発明は、説明変数に従って目的変数を推定する推定モデルの信頼性の判定方法及びプログラム、並びに計測システムに関する。 This invention relates to a method and program for determining the reliability of an estimation model that estimates a target variable according to explanatory variables, as well as a measurement system.

近年研究開発が急速に進んでいる多変量解析、機械学習やディープラーニングでは、説明変数から目的変数を推定する推定モデルを生成することが行われる。推定モデルは、与えられた既存のデータ群（トレーニングデータ）に基づいて生成され、新規のデータ群に対しても利用され得る。 In recent years, research and development in multivariate analysis, machine learning, and deep learning has been rapidly advancing. These methods involve generating estimation models that predict the target variable from explanatory variables. These estimation models are generated based on existing data sets (training data) and can also be used for new data sets.

多変量解析、機械学習等において重要なのは、新規のデータ群に対しても広く適用できる信頼性の高い推定モデルを構築することである。推定モデルが新規のデータ群に適用し得る程度は「頑健性」とも称される。適用可能な新規のデータ群の範囲が広いほど、その推定モデルは頑健性が高いということができる。 In multivariate analysis, machine learning, and other related fields, it is crucial to construct reliable estimation models that can be broadly applied to new datasets. The degree to which an estimation model can be applied to new datasets is also referred to as its "robustness." The wider the range of new datasets to which it can be applied, the more robust the estimation model is considered to be.

信頼性の高い推定モデルを提供するためのアルゴリズムが多数提案されている。また、推定モデルの信頼性を判定する方法も、多数提案されている。推定モデルの信頼性の判定方法には、一例として、cross-validation、test set validationなどがある。 Numerous algorithms have been proposed to provide highly reliable estimation models. Furthermore, many methods have been proposed for determining the reliability of estimation models. Examples of methods for determining the reliability of estimation models include cross-validation and test set validation.

推定モデルの信頼性の判定においては、既存のデータを推定モデル構築用データ（トレーニングデータ）と信頼性のテスト用（テストデータ）とに分割し、前者を用いて構築した推定モデルを後者に適用し、目的変数の実測値と推定値の誤差が測定される。その誤差に従って推定モデルの信頼性が判定される。誤差が小さいほどその推定モデルの信頼性が高いと判定され、新規のデータ群への適用可能性が高くなる。また、推定モデル構築に用いるトレーニングデータの選択とこれらの判定方法を組み合わせ、推定モデルの信頼性を高める手法も考案されている。 In determining the reliability of an estimation model, existing data is split into training data (for model construction) and test data (for reliability testing). The estimation model constructed using the former is then applied to the latter, and the error between the observed and estimated values of the target variable is measured. The reliability of the estimation model is determined according to this error. A smaller error indicates a higher reliability of the estimation model, increasing its applicability to new data sets. Furthermore, methods have been devised to enhance the reliability of estimation models by combining the selection of training data used for model construction with these reliability determination methods.

しかし、既存データ群に基づいて生成された推定モデルが、新規のデータ群への適用可能であるか否かを判定することが容易ではないという問題がある。すなわち、生成された推定モデルの新規のデータ群への適用可能性の検証や、最適な変数の選択は、通常、cross-validation、test set validationなど、既存データを使ったシミュレーションに基づいて行われるが、生成された推定モデルが新規のデータ群に適用可能であるか否かは、実際に当該新規のデータ群を推定モデルに適用し、得られた目的変数の実測値と推定値の誤差を検証しなければ分からない。 However, there is a problem in that it is not easy to determine whether an estimation model generated based on existing data sets is applicable to a new data set. That is, while the applicability of the generated estimation model to a new data set and the selection of optimal variables are usually performed based on simulations using existing data, such as cross-validation and test set validation, whether the generated estimation model is applicable to a new data set can only be determined by actually applying the new data set to the estimation model and verifying the error between the observed and estimated values of the target variable.

特開2017-51162号公報Japanese Patent Publication No. 2017-51162

Campomanes et al. (2014) Origin of the spectral shifts among the early intermediates of the rhodopsin photocycle. Journal of the American chemical society. 136:3842-3851.Campomanes et al. (2014) Origin of the spectral shifts among the early intermediates of the rhodopsin photocycle. Journal of the American chemical society. 136:3842-3851. Trivittayasil et al. （2018）Classification of 1-methylcyclopropene treated apples by fluorescence fingerprint using partial least squares discriminant analysis with stepwise selectivity ratio variable selection method. Chemometrics and Intelligent Laboratory Systems. 175:30-36.Trivittayasil et al. (2018) Classification of 1-methylcyclopropene treated apples by fluorescence fingerprint using partial least squares discriminant analysis with stepwise selectivity ratio variable selection method. Chemometrics and Intelligent Laboratory Systems. 175:30-36. 栗原ら、（2021）LiNGAMを用いた大量変数の因果探索処理に向けた計算カーネルの高速化の検討、研究報告ハイパフォーマンスコンピューティング、26:1-8.Kurihara et al. (2021) Examination of acceleration of computational kernels for causal search processing of large numbers of variables using LiNGAM, Research Report High-Performance Computing, 26:1-8. 大山ら、（2022）大阪府の特定健康診査データの因果探索、情報処理、63（2）Oyama et al. (2022) Causal exploration of specific health checkup data in Osaka Prefecture, Information Processing, 63(2) Esbensen and Geladi (2010) Principles of Proper Validation: use and abuse of re-sampling for validation. Chemometrics and Intelligent Laboratory Systems. 168-187.Esbensen and Geladi (2010) Principles of Proper Validation: use and abuse of re-sampling for validation. Chemometrics and Intelligent Laboratory Systems. 168-187. Andersen and Bro (2010) Variable selection in regression - a tutorial. Chemometrics and Intelligent Laboratory Systems. 728-737Andersen and Bro (2010) Variable selection in regression - a tutorial. Chemometrics and Intelligent Laboratory Systems. 728-737 A. Hyvarinen and S. M. Smith. Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models. J. of Machine Learning Research 14:111-152, 2013.A. Hyvarinen and S. M. Smith. Pairwise Likelihood Ratios for Estimation of Non-Gaussian Structural Equation Models. J. of Machine Learning Research 14:111-152, 2013.

本発明は、新規のデータ群を推定モデルに入力することなく推定モデルの信頼性を判定することを可能とし、推定モデルの信頼性を高めることが可能な推定モデルの信頼性を判定する方法及びプログラム、並びに計測システムを提供するものである。 This invention provides a method, program, and measurement system for determining the reliability of an estimation model, enabling the determination of the model's reliability without inputting new data sets, thereby improving the model's reliability.

上記の課題を解決するため、本発明に係る推定モデルの信頼性の判定方法は、説明変数から目的変数を推定する推定モデルの信頼性の判定方法において、前記説明変数と前記目的変数に対して統計的因果推論を適用して前記説明変数と前記目的変数との間の因果の向きを推定するステップと、前記目的変数が前記説明変数の直接かつ唯一の原因となっているか否かを判定するステップと、前記判定の結果に従い、前記推定モデルが前記推定モデルの構築に使用されたデータ以外の新規のデータにも適用可能であるか否かを判定するステップとを備えたことを特徴とする。 To solve the above problems, the method for determining the reliability of an estimation model according to the present invention is a method for determining the reliability of an estimation model that estimates a dependent variable from explanatory variables, and is characterized by comprising the steps of: applying statistical causal inference to the explanatory variables and the dependent variable to estimate the direction of causality between the explanatory variables and the dependent variable; determining whether the dependent variable is the direct and sole cause of the explanatory variables; and determining, according to the result of the determination, whether the estimation model is applicable to new data other than the data used to construct the estimation model.

また、本発明に係る推定モデルの信頼性の判定のためのコンピュータプログラムは、コンピュータに、上記方法を実行させることが可能に構成され得る。また、本発明に係る計測システムは、測定対象を計測して、説明変数としての計測データを取得する計測装置と、前記説明変数を推定モデルに適用して目的変数を出力し、前記測定対象の各種特性を演算するコンピュータとを備えた計測システムであって、コンピュータに上記コンピュータプログラムを備え、コンピュータプログラムが上記方法を実行するように構成され得る。 Furthermore, the computer program for determining the reliability of the estimation model according to the present invention may be configured to allow a computer to execute the above method. Also, the measurement system according to the present invention is a measurement system comprising a measurement device that measures a target object and acquires measurement data as explanatory variables, and a computer that applies the explanatory variables to an estimation model to output a target variable and calculates various characteristics of the target object, wherein the computer may be equipped with the above-mentioned computer program and configured to execute the above method.

本発明に係る判定方法における因果の向きを推定するステップにおいて、目的変数が前記説明変数の原因となっている可能性が低いと判断される場合、推定モデルからその説明変数を削除するステップを更に実行することができる。このとき、前記説明変数を削除するステップは、目的変数が前記説明変数の原因となっている可能性が低いとの判断に係る説明変数を一括で削除することができる。または、目的変数を削除するステップは、目的変数が前記説明変数の原因となっている可能性が低いとの判断に係る前記説明変数を逐次削除しつつ、その削減後の前記推定モデルの推定誤差を演算し、推定誤差が所定の条件を満たした場合に前記削除のステップを停止することができる。 In the step of estimating the direction of causality in the determination method according to the present invention, if it is determined that the dependent variable is unlikely to be the cause of the explanatory variable, a further step of deleting that explanatory variable from the estimation model can be performed. In this case, the step of deleting the explanatory variable can delete all explanatory variables related to the determination that the dependent variable is unlikely to be the cause of the explanatory variable at once. Alternatively, the step of deleting the dependent variable can delete the explanatory variables related to the determination that the dependent variable is unlikely to be the cause of the explanatory variable sequentially, calculate the estimation error of the estimation model after the reduction, and stop the deletion step if the estimation error satisfies predetermined conditions.

また、この判定方法において、その複数個の前記説明変数を適宜選択した上で合成して複数の合成変数を生成するステップと、複数の合成変数の中から一の合成変数を選択して前記説明変数とするステップとを備えることができる。複数の合成変数を生成するステップは、主成分分析、因子分析、t-distributed stochastic neighbor embedding (ｔ－ＳＮＥ)、クラスター分析、non-negative matrix factorization (ＮＭＦ)、multivariate curve resolution (ＭＣＲ)、parallel factor analysis (ＰＡＲＡＦＡＣ)、partial least squares 回帰分析、又はアンサンブル学習を含み得る。また、前記因果の向きを推定するステップは、ＬｉＮＧＡＭ（Linear Non-Gaussian Acyclic Model）を用いることができる。 Furthermore, this determination method may include the steps of appropriately selecting and combining multiple explanatory variables to generate multiple composite variables, and selecting one composite variable from the multiple composite variables to use as the explanatory variable. The step of generating multiple composite variables may include principal component analysis, factor analysis, t-distributed stochastic neighbor embedding (t-SNE), cluster analysis, non-negative matrix factorization (NMF), multivariate curve resolution (MCR), parallel factor analysis (PARAFAC), partial least squares regression analysis, or ensemble learning. Additionally, the step of estimating the direction of causality can utilize LiNGAM (Linear Non-Gaussian Acyclic Model).

本発明によれば、新規のデータ群への適用可能性を高め、推定モデルの信頼性を高めることが可能な推定モデルの信頼性を判定する方法を提供することができる。 According to the present invention, it is possible to provide a method for determining the reliability of an estimation model that can enhance its applicability to new data sets and improve the reliability of the estimation model.

第１の実施の形態に係る計測システム１の構成の一例を説明する概略図である。This is a schematic diagram illustrating an example of the configuration of the measurement system 1 according to the first embodiment. 図１に示すコンピュータ１０の構成の一例を説明するブロック図である。Figure 1 is a block diagram illustrating an example of the configuration of the computer 10 shown in Figure 1. 第１の実施の形態の計測システム１において、推定モデルの信頼性の判定を行う手順を説明するフローチャートである。This is a flowchart illustrating the procedure for determining the reliability of the estimation model in the measurement system 1 of the first embodiment. 図３のフローチャートのステップＳ１６の判定の一例を説明するグラフである。This graph illustrates an example of the determination in step S16 of the flowchart in Figure 3. 直接の因果関係が認められた目的変数ｙ１と、直接の因果関係が認められない目的変数ｙ２に関し、実測値（measured）と推定モデルによる推定値(estimated)の関係を示すグラフである。This graph shows the relationship between measured values and estimated values from the estimation model for dependent variable y1, which has a direct causal relationship, and dependent variable y2, which does not have a direct causal relationship. 図３のフローチャートのステップＳ１６の判定の別の例を説明するグラフである。This graph illustrates another example of the decision in step S16 of the flowchart in Figure 3. 直接の因果関係が認められた目的変数ｙ１と、直接の因果関係が認められない目的変数ｙ３に関し、実測値（measured）と推定モデルによる推定値(estimated)の関係を示すグラフである。This graph shows the relationship between measured values and estimated values from the estimation model for the dependent variable y1, which has a direct causal relationship, and dependent variable y3, which does not have a direct causal relationship. 第２の実施の形態のコンピュータ１０の構成を説明する構成図である。This is a diagram illustrating the configuration of the computer 10 in the second embodiment. 第２の実施の形態の動作を説明するフローチャートである。This is a flowchart illustrating the operation of the second embodiment. 直接の因果関係が認められない目的変数ｙ３に関し、説明変数の削除の前後での実測値（measured）と推定モデルによる推定値(estimated)の関係を示すグラフである。This graph shows the relationship between the measured value and the estimated value from the estimation model for the dependent variable y3, for which no direct causal relationship has been found, before and after the removal of the explanatory variable. 第３の実施の形態のコンピュータ１０の構成を説明する構成図である。This is a configuration diagram illustrating the configuration of the computer 10 in the third embodiment. 第３の実施の形態の動作を説明するフローチャートである。This is a flowchart illustrating the operation of the third embodiment. 実施例に係る測定結果の一例を示すグラフである。This graph shows an example of the measurement results related to the example. 実施例に係る測定結果の一例を示すグラフである。This graph shows an example of the measurement results related to the example. 説明変数、目的変数、未観測変数の間の因果関係について説明する概略図である。This is a schematic diagram illustrating the causal relationships between explanatory variables, dependent variables, and unobserved variables. Pairwise LiNGAMアルゴリズム（非特許文献７）により説明変数の合成変数と目的変数の間の尤度を計算した結果としてのヒートマップの一例である。This is an example of a heatmap resulting from calculating the likelihood between the composite variable of the explanatory variables and the dependent variable using the Pairwise LiNGAM algorithm (Non-Patent Document 7). 目的変数と合成変数との間の因果の向きの判断の手法を説明する概略図である。This is a schematic diagram illustrating a method for determining the direction of causality between the dependent variable and the composite variable. 目的変数と合成変数との間の因果の向きの判断の手法を説明する概略図である。This is a schematic diagram illustrating a method for determining the direction of causality between the dependent variable and the composite variable.

以下、添付図面を参照して本実施形態について説明する。添付図面では、機能的に同じ要素は同じ番号で表示される場合もある。なお、添付図面は本開示の原理に則った実施形態と実装例を示しているが、これらは本開示の理解のためのものであり、決して本開示を限定的に解釈するために用いられるものではない。本明細書の記述は典型的な例示に過ぎず、本開示の特許請求の範囲又は適用例を如何なる意味においても限定するものではない。 This embodiment will be described below with reference to the attached drawings. In the attached drawings, functionally identical elements may be indicated by the same number. While the attached drawings show embodiments and implementation examples in accordance with the principles of this disclosure, they are for the purpose of understanding this disclosure and are not intended to be used to restrictively interpret this disclosure. The descriptions in this specification are merely typical examples and do not limit the claims or applications of this disclosure in any way.

本実施形態では、当業者が本開示を実施するのに十分詳細にその説明がなされているが、他の実装・形態も可能で、本開示の技術的思想の範囲と精神を逸脱することなく構成・構造の変更や多様な要素の置き換えが可能であることを理解する必要がある。従って、以降の記述をこれに限定して解釈してはならない。 While this embodiment is described in sufficient detail for those skilled in the art to implement the disclosure, other implementations and forms are possible, and it is important to understand that the configuration and structure can be modified and various elements replaced without departing from the scope and spirit of the technical idea of this disclosure. Therefore, the following description should not be interpreted as limiting the scope to this embodiment.

［第１の実施の形態］
図１の概略図を参照して、第１の実施の形態に係る計測システム１の構成の一例を説明する。この計測システム１は、例えば果物などの農産物の各種特性（例えば糖度、水分量、色度、彩度など）を計測する計測システムであり、コンピュータ１０、分光計測装置２０を備えると共に、ネットワークＮＷを介してサーバ４０及びデータベース５０と接続されている。 [First Embodiment]
Referring to the schematic diagram in Figure 1, an example of the configuration of the measurement system 1 according to the first embodiment will be described. This measurement system 1 is a measurement system that measures various characteristics of agricultural products such as fruits (e.g., sugar content, moisture content, color, saturation, etc.), and is equipped with a computer 10 and a spectroscopic measuring device 20, and is connected to a server 40 and a database 50 via a network NW.

農産物の各種特性の計測のために、コンピュータ１０は、多変量解析、機械学習又はディープラーニングに基づき推定モデルを生成及び利用する。なお、以下の説明では、分光計測装置２０で得られた分光データを入力として推定モデルを生成する構成を説明するが、これは一例であって、分光計測装置２０に代えて、又はこれに加えて、各種計測装置、カメラ、センサ等のデータをコンピュータに入力し、各種特性の計測に用いてもよい。 To measure various characteristics of agricultural products, the computer 10 generates and utilizes estimation models based on multivariate analysis, machine learning, or deep learning. The following description explains a configuration in which estimation models are generated using spectral data obtained from the spectroscopic measurement device 20 as input. However, this is merely an example; instead of, or in addition to, the spectroscopic measurement device 20, data from various measuring devices, cameras, sensors, etc., may be input to the computer and used for measuring various characteristics.

コンピュータ１０は、分光計測装置２０から、例えば農産物を計測して得られた入力データ（説明変数）としての分光データを供給され、この分光データを推定モデルに適用し、農産物の各種特性、例えば特性ｙ１、ｙ２、ｙ３…（目的変数）を出力するよう構成されている。特性ｙ１、ｙ２、ｙ３は、例えば農産物（例えば果物）の糖度、水分量、色度、彩度等である。 Computer 10 is configured to receive spectral data as input data (explanatory variables) from the spectroscopic measuring device 20, for example, by measuring agricultural products. It applies this spectral data to an estimation model and outputs various characteristics of the agricultural product, such as characteristics y1, y2, y3, etc. (dependent variables). Characteristics y1, y2, and y3 are, for example, the sugar content, moisture content, color, and saturation of the agricultural product (e.g., fruit).

推定モデルの生成において、コンピュータ１０は、分光計測装置２０で測定して得られたトレーニングデータとしての分光データ（ｘｔ１、ｘｔ２、...）を供給されて推定モデルを生成する。推定モデルの生成後は、新規のデータ群（推定モデルの生成には使用されなかったデータ群）としての分光データ（ｘ１、ｘ２、…）が分光計測装置２０から得られた場合、その推定モデルに当該分光データが入力され、各種特性ｙ１、ｙ２、ｙ３・・・が出力される。 In generating the estimation model, the computer 10 is supplied with spectral data (xt1, xt2, ...) as training data obtained from the spectroscopic measurement device 20, and generates the estimation model. After the estimation model is generated, if new spectral data (x1, x2, ...) is obtained from the spectroscopic measurement device 20 as a new data set (data set not used to generate the estimation model), this spectral data is input into the estimation model, and various characteristics y1, y2, y3, ... are output.

コンピュータ１０は、統計的因果推論の手法を用いて、生成された推定モデルの信頼性を判定する。得られた信頼性の程度に従い、推定モデルの生成には用いられなかった新規のデータ群ｘ１、ｘ２、…への当該推定モデルの適用可能性が判断される。更に、信頼性の評価の結果に従い、推定モデルは適宜更新され、これにより新規データ群への推定モデルの適用可能性が高められ得る。 Computer 10 uses statistical causal inference techniques to determine the reliability of the generated estimation model. Based on the obtained reliability level, the applicability of the estimation model to new data sets x1, x2, ... that were not used in the model's generation is determined. Furthermore, the estimation model is updated as appropriate based on the reliability evaluation results, thereby potentially improving its applicability to the new data sets.

コンピュータ１０は、ネットワークＮＷを介してサーバ４０及びデータベース５０と接続され、推定モデルの生成・更新時において、各種データをサーバ４０及びデータベース５０からネットワークＮＷを介して供給される。また、コンピュータ１０は、生成した推定モデル、その生成に使用したトレーニングデータに関するデータをネットワークＮＷを介してサーバ４０に送信する。また、コンピュータ１０は、推定モデルの信頼性の判定において、その判定のための各種データをサーバ４０及びデータベース５０から供給される。更に、コンピュータ１０は、推定モデルの信頼性の判定結果に関するデータをネットワークＮＷを介してサーバ４０に供給することができる。 Computer 10 is connected to Server 40 and Database 50 via a network (NW). During the generation and updating of the estimation model, various data is supplied from Server 40 and Database 50 via the network (NW). Computer 10 also transmits data related to the generated estimation model and the training data used for its generation to Server 40 via the network (NW). Furthermore, during the determination of the reliability of the estimation model, Computer 10 receives various data for this determination from Server 40 and Database 50. In addition, Computer 10 can supply data related to the reliability determination results of the estimation model to Server 40 via the network (NW).

図２のブロック図を参照して、コンピュータ１０の構成の一例を説明する。図２に示すように、コンピュータ１０は、一例として、ＣＰＵ（Central Processing Unit）１０１、ＲＯＭ１０２、ＲＡＭ１０３、フラッシュメモリ１０４、入出力制御部１０５、通信制御部１０６を備えて構成される。 Referring to the block diagram in Figure 2, an example of the configuration of the computer 10 will be described. As shown in Figure 2, the computer 10 is configured, for example, with a CPU (Central Processing Unit) 101, ROM 102, RAM 103, flash memory 104, input/output control unit 105, and communication control unit 106.

ＣＰＵ１０１は、推定モデルの生成・更新のための演算、推定モデルを用いた出力データの演算を含む各種演算を担当する中央制御装置である。ＣＰＵ１０１に加え、画像処理及び画像認識を実行するための制御装置としてＧＰＵ（Graphics Processing Unit）が併設されてもよい。 The CPU 101 is a central control unit responsible for various calculations, including calculations for generating and updating estimation models, and calculations for output data using the estimation models. In addition to the CPU 101, a GPU (Graphics Processing Unit) may be included as a control unit for performing image processing and image recognition.

ＲＯＭ１０２は、各種演算のためのプログラム、及びこれらのプログラムを実行するのに必要な各種データを格納する記憶装置である。ＲＡＭ１０３は、同プログラムの演算結果等を一時的に記憶する記憶装置である。フラッシュメモリ１０４は、ＲＯＭ１０２から読み出されたプログラムを格納すると共に、サーバ４０から提供される各種更新プログラムを格納するための記憶媒体である。 ROM 102 is a storage device that stores programs for various calculations and various data necessary for executing these programs. RAM 103 is a storage device that temporarily stores the calculation results of the programs. Flash memory 104 is a storage medium for storing programs read from ROM 102 and for storing various update programs provided by server 40.

入出力制御部１０５は、サーバ４０等からのデータ入力、及びサーバ４０等へのデータ出力を制御する制御部である。通信制御部１０６は、サーバ４０との間でやり取りされるデータの送受信を制御する。 The input/output control unit 105 controls data input from the server 40 and data output to the server 40. The communication control unit 106 controls the transmission and reception of data exchanged between the server 40 and the system.

コンピュータ１０は、ＲＯＭ１０２等に格納されているコンピュータプログラムにより、推定モデル生成／更新部１１２、合成変数生成部１１３、因果推論部１１４、及び新規データ群適用可能性判定部１１５を内部に仮想的に実現する。このコンピュータプログラムにより、コンピュータ内で推定モデルを生成／更新すると共に、その信頼性を判定するための方法が実行される。 Computer 10 virtually implements an estimation model generation/update unit 112, a composite variable generation unit 113, a causal inference unit 114, and a new data set applicability determination unit 115 internally using a computer program stored in ROM 102 or the like. This computer program generates/updates the estimation model within the computer and executes a method for determining its reliability.

推定モデル生成／更新部１１２は、トレーニングデータに従い、推定モデルを生成すると共に、因果推論部１１４での演算の結果に従い、推定モデルを更新する。推定モデルの生成・更新は、一例として、線形回帰分析を利用して実行することができる。合成変数生成部１１３は、入力データとしてのトレーニングデータの説明変数ｘｔ１、ｘｔ２、…を合成して合成変数Ｓｙ１、Ｓｙ２…を生成する機能を有する。後述するように、説明変数ｘｔ１、ｘｔ２…が複数個存在する場合、合成変数生成部１１３は、その複数個の説明変数を適宜選択した上で合成して複数の合成変数を生成する。 The estimation model generation/update unit 112 generates an estimation model according to the training data and updates the estimation model according to the results of calculations performed by the causal inference unit 114. For example, the generation and updating of the estimation model can be performed using linear regression analysis. The composite variable generation unit 113 has the function of generating composite variables Sy1, Sy2, etc. by combining the explanatory variables xt1, xt2, ... from the training data used as input data. As described later, if there are multiple explanatory variables xt1, xt2, ..., the composite variable generation unit 113 appropriately selects and combines these multiple explanatory variables to generate multiple composite variables.

因果推論部１１４は、説明変数（合成変数も含む）と目的変数ｙ１、ｙ２、ｙ３、…に対して統計的因果推論を適用して説明変数と目的変数ｙ１、ｙ２、ｙ３、…との間の因果の向き（どちらが原因となって、どちらが変化するのか）を推定する。なお、説明変数が複数個存在する場合には、その複数個の説明変数を合成して合成変数が合成変数生成部１１３で生成され、その合成変数が説明変数として選択され、因果推論部１１４における統計的因果推論に用いられ得る。 The causal inference unit 114 applies statistical causal inference to the explanatory variables (including composite variables) and the dependent variables y1, y2, y3, ... to estimate the direction of causality between the explanatory variables and the dependent variables y1, y2, y3, ... (which variable causes which change). If there are multiple explanatory variables, these multiple explanatory variables are combined to generate a composite variable in the composite variable generation unit 113. This composite variable is then selected as an explanatory variable and can be used for statistical causal inference in the causal inference unit 114.

因果推論部１１４は、推定モデルの演算に用いられた目的変数ｙ１、ｙ２、ｙ３、…、及び説明変数に従って統計的因果推論の演算を実行し、目的変数ｙ１、ｙ２、ｙ３、…が説明変数の直接かつ唯一の原因になっているか否かを推定する。 The causal inference unit 114 performs statistical causal inference calculations according to the dependent variables y1, y2, y3, ... and explanatory variables used in the estimation model, and estimates whether the dependent variables y1, y2, y3, ... are the direct and sole causes of the explanatory variables.

説明変数が受動的なセンサで取得された場合、説明変数ｘ１、ｘ２、ｘ３、…は目的変数やその他未観測の変数（ｚ１、ｚ２、ｚ３、…）の原因になり得ない。また、個々の説明変数が互いの原因になることもない。したがって、説明変数、目的変数、未観測変数の間の因果関係は、図１５に示す８つのパターンのうちどれかに限定される。 When explanatory variables are acquired by passive sensors, the explanatory variables x1, x2, x3, ... cannot be the cause of the dependent variable or other unobserved variables (z1, z2, z3, ...). Furthermore, individual explanatory variables cannot cause each other. Therefore, the causal relationships between explanatory variables, dependent variables, and unobserved variables are limited to one of the eight patterns shown in Figure 15.

ここで、「直接かつ唯一の原因になっている」とは、ある目的変数が、他の変数を介さずに説明変数を変化させる関係であり、且つ、説明変数を変化させる変数が他に存在しないことを意味する。図１５では、パターン１～３がこれに該当する。ある推定モデルにおいて、目的変数ｙ１、ｙ２、ｙ３、…が説明変数の直接かつ唯一の原因になっていると判断される場合、その推定モデルは、目的変数ｙ１、ｙ２、ｙ３、…と説明変数の関係が変わらない限りにおいて、トレーニングデータとは別の新規のデータ群にも適用可能である可能性が高いと判断することができる。 Here, "direct and sole cause" means that a dependent variable changes an independent variable without the intervention of other variables, and that no other variables change the independent variable. In Figure 15, patterns 1-3 fall into this category. If, in a given estimation model, the dependent variables y1, y2, y3, ... are determined to be the direct and sole cause of the independent variables, then that estimation model is likely to be applicable to new data sets separate from the training data, as long as the relationship between the dependent variables y1, y2, y3, ... and the independent variables remains unchanged.

一方、ある推定モデルにおいて、目的変数ｙ１、ｙ２、ｙ３、…が説明変数の直接かつ唯一の原因にはなっていないか、又は他に目的変数または説明変数の原因となる未観測変数が存在すると判断される場合、その推定モデルは、トレーニングデータとは別の新規のデータ群に適用可能である可能性は低いと判断することができる。 On the other hand, if, in a particular estimation model, the dependent variables y1, y2, y3, ... are not the direct and sole cause of the explanatory variables, or if it is determined that there are other unobserved variables that cause either the dependent or explanatory variables, then it can be concluded that the estimation model is unlikely to be applicable to a new set of data separate from the training data.

更に、ある推定モデルにおいて、説明変数が他の原因になっていると判断される場合には、その推定モデルは、トレーニングデータとは別の新規のデータ群に適用可能である可能性は低いと判断することができる。このような場合には、推定モデルは現実の状況を正確に反映していると認めることができないので、新規のデータ群に当該推定モデルを適用したとしても、現実の数値を反映した目的変数が得られないと判断され得る。 Furthermore, if an estimation model determines that the explanatory variables are due to other causes, it can be concluded that the estimation model is unlikely to be applicable to a new data set separate from the training data. In such cases, the estimation model cannot be considered to accurately reflect the real-world situation, and therefore, even if the estimation model is applied to the new data set, it can be concluded that the dependent variable will not reflect the actual numerical values.

因果推論部１１４での演算・推定の結果は、推定モデル生成／更新部１１２に提供され、生成済の推定モデルの更新のために用いられる。なお、因果推論部１１４は、公知のＬｉＮＧＡＭ（Linear Non-Gaussian Acyclic Model）等を基礎として構築することができる。 The results of the calculations and estimations performed by the causal inference unit 114 are provided to the estimation model generation/update unit 112 and used to update the generated estimation model. The causal inference unit 114 can be constructed based on known models such as LiNGAM (Linear Non-Gaussian Acyclic Model).

新規データ群適用可能性判定部１１５は、因果推論部１１４での演算の結果に従い、トレーニングデータを用いて生成された推定モデルが、当該トレーニングデータとは異なる新規のデータ群にも適用可能であるか否かを判定する。適用可能と判定された新規データ群は、推定モデルに説明変数ｘ１、ｘ２…として入力され、目的変数ｙ１、ｙ２、ｙ３、…が演算・出力され、計測システム１の計測値として出力される。 The new data set applicability determination unit 115 determines, based on the results of the calculations performed by the causal inference unit 114, whether the estimation model generated using the training data is applicable to a new data set different from the training data. If the new data set is determined to be applicable, it is input into the estimation model as explanatory variables x1, x2, ..., the target variables y1, y2, y3, ... are calculated and output, and these are then output as measured values by the measurement system 1.

次に、図３のフローチャートを参照して、第１の実施の形態の計測システム１において、推定モデルの信頼性の判定を行う手順を説明する。推定モデルがトレーニングデータにｘｔ１、ｘｔ２、…に基づいて生成されると、その推定モデルの信頼性の判定が実行される。まず、ステップＳ１１では、トレーニングデータとしての説明変数ｘｔ１、ｘｔ２、…が複数個存在するか否かが判定される（ステップＳ１１）。 Next, referring to the flowchart in Figure 3, the procedure for determining the reliability of the estimation model in the measurement system 1 of the first embodiment will be explained. Once the estimation model is generated based on the training data xt1, xt2, ..., the reliability of the estimation model is determined. First, in step S11, it is determined whether or not there are multiple explanatory variables xt1, xt2, ... as training data (step S11).

複数個の説明変数が存在しない場合（１つの説明変数のみが存在する場合）には、ステップＳ１５において、その説明変数と目的変数ｙ１、ｙ２、ｙ３…に対して統計的因果推論が因果推論部１１４において適用され、説明変数と目的変数ｙ１、ｙ２…の因果の向きの推定が行われる（ステップＳ１５）。 If there are no multiple explanatory variables (i.e., only one explanatory variable exists), in step S15, statistical causal inference is applied to that explanatory variable and the dependent variables y1, y2, y3… in the causal inference unit 114, and the direction of causality between the explanatory variable and the dependent variables y1, y2… is estimated (step S15).

一方、説明変数ｘｔ１、ｘｔ２…が複数個存在する場合、その複数個の説明変数を適宜組み合わせた合成変数Ｓｙ１、Ｓｙ２…が複数個生成される（ステップＳ１２）。合成関数の生成は、一例として主成分分析、因子分析、t-distributed stochastic neighbor embedding (ｔ－ＳＮＥ)、クラスター分析、non-negative matrix factorization (ＮＭＦ)、multivariate curve resolution (ＭＣＲ)、parallel factor analysis (ＰＡＲＡＦＡＣ)、partial least squares 回帰分析、アンサンブル学習等を用いて行うことができる。 On the other hand, if there are multiple explanatory variables xt1, xt2, etc., multiple composite variables Sy1, Sy2, etc. are generated by appropriately combining these multiple explanatory variables (step S12). The generation of composite functions can be performed using methods such as principal component analysis, factor analysis, t-distributed stochastic neighbor embedding (t-SNE), cluster analysis, non-negative matrix factorization (NMF), multivariate curve resolution (MCR), parallel factor analysis (PARAFAC), partial least squares regression analysis, and ensemble learning.

そして、その複数個の合成変数Ｓｙ１、Ｓｙ２…の中から、目的変数ｙ１、ｙ２の推定に最も寄与し得る合成変数Ｓｙｘが選択される（ステップＳ１３）。合成変数の選択は、一例としてＰＬＳ回帰モデルを使用して実行することができる。そして、選択された合成変数Ｓｙｘが説明変数とされ、その説明変数と目的変数ｙ１、ｙ２に対して統計的因果推論を適用して因果の向きを推定する（ステップＳ１４）。 Then, from among these multiple composite variables Sy1, Sy2, etc., the composite variable Syx that can best contribute to the estimation of the dependent variables y1 and y2 is selected (Step S13). The selection of composite variables can be performed using a PLS regression model as an example. The selected composite variable Syx is then used as an explanatory variable, and statistical causal inference is applied to this explanatory variable and the dependent variables y1 and y2 to estimate the direction of causality (Step S14).

ステップＳ１４、又はＳ１５における因果推論の結果、目的変数ｙ１、ｙ２、…が説明変数の直接かつ唯一の原因となっていると判断された場合（ステップＳ１６のＹ）、対象の推定モデルは、新規のデータ群に対しても適用することができる可能性が高いと判断される（ステップＳ１７）。一方、目的変数ｙ１、ｙ２…が説明変数の直接かつ唯一の原因になっていないと判断された場合（ステップＳ１６のＮ）、対象の推定モデルは、新規のデータ群に対して適用することができる可能性が低いと判断される（ステップＳ１８）。 If, as a result of causal inference in step S14 or S15, it is determined that the dependent variables y1, y2, ... are the direct and sole causes of the explanatory variables (Y in step S16), then the estimation model is considered highly likely to be applicable to the new data set (step S17). On the other hand, if it is determined that the dependent variables y1, y2, ... are not the direct and sole causes of the explanatory variables (N in step S16), then the estimation model is considered less likely to be applicable to the new data set (step S18).

ステップＳ１６の判定の一例を図４を参照して説明する。例えば、図４（ａ）に示すように、目的変数ｙ１が、ある説明変数（例：７００ｎｍ付近の吸光度）の直接かつ唯一の原因となっているデータを生成し、さらにpartial least square（PLS）回帰分析を実施して説明変数の合成変数を生成した場合、Pairwise LiNGAMアルゴリズム（非特許文献７）により説明変数の合成変数と目的変数の間の尤度を計算すると、例えば図１６に示すようなヒートマップが得られる。ヒートマップ上のi行j列の値が正の場合は因果の向きはiからj、負の場合はjからiと判定される。したがって、図１６の場合は因果の向きは目的変数ｙ１から合成変数１に向かっており、ｙ１が合成変数１の直接かつ唯一の原因となっていると判断でき、目的変数ｙ１に関する推定モデルは一定の信頼性を有しているものと判断することができる。一方、図４（ｂ）に示すように、未観測変数ｚが、ある説明変数（例：５２０ｎｍ付近の吸光度）の直接の原因になっており、目的変数ｙ２は未観測変数ｚが原因となって変化しているものの、説明変数の原因となっていない場合には、Pairwise LiNGAMにより図１７に示す尤度のヒートマップが得られ、目的変数ｙ２は合成変数に対しては直接かつ唯一の原因にはなっていないと判定される。このため、目的変数ｙ２に関しては、推定モデルの信頼性が低いと判断することができる。 An example of the determination in step S16 will be explained with reference to Figure 4. For example, as shown in Figure 4(a), if data is generated in which the dependent variable y1 is the direct and sole cause of a certain explanatory variable (e.g., absorbance around 700 nm), and further partial least square (PLS) regression analysis is performed to generate a composite variable of the explanatory variables, then calculating the likelihood between the composite variable of the explanatory variables and the dependent variable using the Pairwise LiNGAM algorithm (Non-Patent Literature 7) will yield a heatmap like the one shown in Figure 16. If the value in row i and column j on the heatmap is positive, the direction of causality is determined to be from i to j, and if it is negative, it is determined to be from j to i. Therefore, in the case of Figure 16, the direction of causality is from the dependent variable y1 to the composite variable 1, and it can be determined that y1 is the direct and sole cause of the composite variable 1, and the estimation model for the dependent variable y1 can be determined to have a certain level of reliability. On the other hand, as shown in Figure 4(b), if the unobserved variable z is the direct cause of a certain explanatory variable (e.g., absorbance around 520 nm), and the dependent variable y2 changes due to the unobserved variable z, but not as the cause of the explanatory variable, then Pairwise LiNGAM yields the likelihood heatmap shown in Figure 17, and it is determined that the dependent variable y2 is not the direct and sole cause of the composite variable. Therefore, the reliability of the estimation model can be judged as low with respect to the dependent variable y2.

図５は、直接の因果関係が認められた目的変数ｙ１と、直接の因果関係が認められない目的変数ｙ２に関し、実測値（measured）と推定モデルによる推定値(estimated)の関係を示すグラフである。直接の因果関係が認められた目的変数ｙ１については、実測値と推定値との間に一定の関係が認められ、新規のデータ群を推定モデルに入力して得られる目的変数ｙ１の二乗平均平方根誤差ＲＭＳＥｎｅｗは、トレーニングデータを入力して得られる目的変数ｙ１の二乗平均平方根誤差ＲＭＳＥｔｒａｉｎと比べ有意な差を有していない。 Figure 5 is a graph showing the relationship between measured values and estimated values from the estimation model for the dependent variable y1, for which a direct causal relationship was found, and for the dependent variable y2, for which a direct causal relationship was not found. For the dependent variable y1, for which a direct causal relationship was found, a certain relationship was observed between the measured and estimated values. The root mean square error (RMSEnew) of the dependent variable y1 obtained by inputting a new data set into the estimation model did not show a significant difference compared to the root mean square error (RMSEtrain) of the dependent variable y1 obtained by inputting the training data.

一方、直接の因果関係が認められず、目的変数が前記説明変数の原因となっている可能性が低いと認められる目的変数ｙ２については、新規のデータ群を推定モデルに入力して得られる目的変数ｙ２の二乗平均平方根誤差ＲＭＳＥｎｅｗは、トレーニングデータを入力して得られる目的変数ｙ２の二乗平均平方根誤差ＲＭＳＥｔｒａｉｎと比べて非常に大きい。これは、目的変数ｙ２についての推定モデルは、新規のデータ群に適用できないことを意味する。 On the other hand, for the dependent variable y2, where no direct causal relationship is found and it is considered unlikely that the dependent variable is the cause of the explanatory variable, the root mean square error (RMSEnew) of the dependent variable y2 obtained by inputting a new data set into the estimation model is significantly larger than the root mean square error (RMSEtrain) of the dependent variable y2 obtained by inputting the training data. This means that the estimation model for the dependent variable y2 cannot be applied to the new data set.

ステップＳ１６の判定の別の例を図６を参照して説明する。例えば、図６（ａ）に示すように、目的変数ｙ１が、ある説明変数（図６（ａ）では、７００ｎｍ付近の吸光度）の直接かつ唯一の原因となっていると判断される場合であり、図４（ａ）と同様である。 Another example of the determination in step S16 will be explained with reference to Figure 6. For example, as shown in Figure 6(a), this is the case when it is determined that the dependent variable y1 is the direct and sole cause of a certain explanatory variable (in Figure 6(a), the absorbance around 700 nm), which is similar to Figure 4(a).

一方、図６（ｂ）では、未観測変数ｚが、ある説明変数（図４（ｂ）では、５２０ｎｍ付近の吸光度）の原因になっている一方、目的変数ｙ３は未観測変数ｚが原因となって変化し、更に目的変数ｙ３が原因となってある説明変数が変化する関係であると判断される。このような場合にはPairwise LiNGAMにより図１８に示す尤度のヒートマップが得られ、目的変数ｙ３は説明変数に対して直接かつ唯一の原因にはなっていないと判定されるので、目的変数ｙ３についての推定モデルもまた信頼性が低く、新規のデータ群に対して適用可能である可能性が低いと判断される。このように、本システム１によれば、目的変数ｙ１、ｙ２、ｙ３…と説明変数との間の直接の因果関係の推論の結果に従い、推定モデルの信頼性を判定することができる。 On the other hand, in Figure 6(b), it is determined that the unobserved variable z is the cause of a certain explanatory variable (absorbance around 520 nm in Figure 4(b)), while the dependent variable y3 changes due to the unobserved variable z, and furthermore, the dependent variable y3 causes a change in a certain explanatory variable. In such cases, the Pairwise LiNGAM obtains the likelihood heatmap shown in Figure 18, and it is determined that the dependent variable y3 is not the direct and sole cause of the explanatory variable. Therefore, the estimation model for the dependent variable y3 is also considered unreliable and unlikely to be applicable to new data sets. Thus, according to this system 1, the reliability of the estimation model can be determined based on the inference of the direct causal relationship between the dependent variables y1, y2, y3… and the explanatory variables.

図７は、直接の因果関係が認められた目的変数ｙ１と、直接の因果関係が認められない目的変数ｙ３に関し、実測値（measured）と推定モデルによる推定値(estimated)の関係を示すグラフである。直接の因果関係が認められた目的変数ｙ１については、実測値と推定値との間に一定の関係が認められ、新規のデータ群を推定モデルに入力して得られる目的変数ｙ１の二乗平均平方根誤差ＲＭＳＥｎｅｗは、トレーニングデータを入力して得られる目的変数ｙ１の二乗平均平方根誤差ＲＭＳＥｔｒａｉｎと比べ有意な差を有していない。 Figure 7 is a graph showing the relationship between measured values and estimated values from the estimation model for the dependent variable y1, for which a direct causal relationship was found, and for the dependent variable y3, for which no direct causal relationship was found. For the dependent variable y1, for which a direct causal relationship was found, a certain relationship was observed between the measured and estimated values. The root mean square error (RMSEnew) of the dependent variable y1 obtained by inputting a new data set into the estimation model did not show a significant difference compared to the root mean square error (RMSEtrain) of the dependent variable y1 obtained by inputting the training data.

一方、直接の因果関係が認められず、説明変数が目的変数の原因になっていると認められる目的変数ｙ３については、新規のデータ群を推定モデルに入力して得られる目的変数ｙ３の二乗平均平方根誤差ＲＭＳＥｎｅｗは、トレーニングデータを入力して得られる目的変数ｙ３の二乗平均平方根誤差ＲＭＳＥｔｒａｉｎと比べて非常に大きい。これは、目的変数ｙ３についての推定モデルは、新規のデータ群に適用できないことを意味する。 On the other hand, for the dependent variable y3, where no direct causal relationship is found and the explanatory variable is considered to be the cause of the dependent variable, the root mean square error (RMSEnew) of the dependent variable y3 obtained by inputting a new data set into the estimation model is significantly larger than the root mean square error (RMSEtrain) of the dependent variable y3 obtained by inputting the training data. This means that the estimation model for the dependent variable y3 cannot be applied to the new data set.

以上説明したように、第１の実施の形態に係る計測システム１、及び推定モデルの信頼性の判定方法、及びプログラムによれば、トレーニングデータを用いて生成された推定モデルの信頼性（トレーニングデータとは異なる新規のデータ群にも推定モデルを適用可能である可能性）を、統計的因果推論に基づく因果の向きに従って判断するため、新規データ群を推定モデルに入力して検証することなく判定することが可能になる。 As explained above, according to the measurement system 1, the method for determining the reliability of the estimation model, and the program of the first embodiment, the reliability of the estimation model generated using training data (the possibility that the estimation model can be applied to new data sets different from the training data) is determined according to the direction of causality based on statistical causal inference. Therefore, it becomes possible to determine the reliability without inputting the new data set into the estimation model for verification.

［第２の実施の形態］
次に、第２の実施の形態の計測システム１を、図８～図９を参照して説明する。この第２の実施の形態の計測システム１の全体構成は第１の実施の形態と同一でよいので、重複する説明は省略する。図８は、第２の実施の形態のコンピュータ１０の構成を説明する構成図である。また、図９は第２の実施の形態の動作を説明するフローチャートである。第２の実施の形態のコンピュータ１０は、第１の実施の形態の構成に加え、削除対象とされた説明変数を一括で削除処理する変数一括削除処理部１１６を備えている。その他の構成は、第１の実施の形態（図２）と同様であるので、重複する説明は省略する。 [Second Embodiment]
Next, the measurement system 1 of the second embodiment will be described with reference to Figures 8 and 9. The overall configuration of the measurement system 1 of this second embodiment may be the same as that of the first embodiment, so redundant explanations will be omitted. Figure 8 is a configuration diagram illustrating the configuration of the computer 10 of the second embodiment. Figure 9 is a flowchart illustrating the operation of the second embodiment. In addition to the configuration of the first embodiment, the computer 10 of the second embodiment is equipped with a variable batch deletion processing unit 116 that batch deletes explanatory variables that are to be deleted. The other configurations are the same as those of the first embodiment (Figure 2), so redundant explanations will be omitted.

変数一括削除処理部１１６は、因果推論部１１４での推論の結果に従い、所定の説明変数を推定モデルから一括で削除する機能を有する。すなわち、図９に示すように、個々の説明変数と及び複数の目的変数ｙ１、ｙ２、ｙ３…とに対し逐次統計的因果推論を実行して因果の向きを推定し（ステップＳ２１）、信頼性に関するデータを得た後、目的変数が原因となっていると判断された説明変数は維持する一方で、目的変数が原因となっていないと判断された説明変数を推定モデルから削除する（ステップＳ２２）。削除後に残った説明変数が、推定モデル構築用の最終的な説明変数として採用される（ステップＳ２３）。 The variable batch deletion processing unit 116 has the function of batch deleting predetermined explanatory variables from the estimation model according to the results of the inference performed by the causal inference unit 114. That is, as shown in Figure 9, sequential statistical causal inference is performed on each explanatory variable and multiple dependent variables y1, y2, y3… to estimate the direction of causality (step S21). After obtaining reliability data, explanatory variables determined to be caused by the dependent variable are retained, while explanatory variables determined not to be caused by the dependent variable are deleted from the estimation model (step S22). The explanatory variables remaining after deletion are adopted as the final explanatory variables for constructing the estimation model (step S23).

図１０（ａ）、（ｂ）は、直接の因果関係が認められない目的変数ｙ３に関し、説明変数の削除の前後での実測値（measured）と推定モデルによる推定値(estimated)の関係を示すグラフである。 Figures 10(a) and (b) are graphs showing the relationship between the measured value and the estimated value from the estimation model for the dependent variable y3, for which no direct causal relationship was found, before and after the removal of the explanatory variable.

図１０（ａ）に示すように、説明変数の削除前においては、目的変数ｙ３については、新規のデータ群を推定モデルに入力して得られる目的変数ｙ３の二乗平均平方根誤差ＲＭＳＥｎｅｗは、トレーニングデータを入力して得られる目的変数ｙ３の二乗平均平方根誤差ＲＭＳＥｔｒａｉｎと比べて非常に大きい。これは、説明関数の削除前における目的変数ｙ３についての推定モデルは、新規のデータ群に適用できないことを意味する。 As shown in Figure 10(a), before removing the explanatory variables, the root mean square error (RMSEnew) of the dependent variable y3 obtained by inputting the new data set into the estimation model is significantly larger than the root mean square error (RMSEtrain) obtained by inputting the training data. This means that the estimation model for the dependent variable y3 before removing the explanatory functions cannot be applied to the new data set.

一方、図１０（ｂ）に示すように、説明変数の削除後においては、目的変数ｙ３についても、新規のデータ群を推定モデルに入力して得られる目的変数ｙ３の二乗平均平方根誤差ＲＭＳＥｎｅｗは、トレーニングデータを入力して得られる目的変数ｙ３の二乗平均平方根誤差ＲＭＳＥｔｒａｉｎと比べて有意な差は認められない。これは、目的変数が原因となっていない説明変数の削除により、推定モデルの信頼性が向上し、削除前は適用可能である可能性は低いと判断された新規のデータ群についても、推定モデルの適用可能性が高まったことを意味する。 On the other hand, as shown in Figure 10(b), after removing the explanatory variables, the root mean square error (RMSEnew) of the dependent variable y3 obtained by inputting the new data set into the estimation model does not show a significant difference compared to the root mean square error (RMSEtrain) of the dependent variable y3 obtained by inputting the training data. This means that the reliability of the estimation model improved by removing explanatory variables that were not caused by the dependent variable, and the applicability of the estimation model increased even for new data sets that were judged to have a low probability of being applicable before removal.

以上説明したように、この第２の実施の形態によれば、第１の実施の形態と同一の効果を得ることができ、更に、因果の向きの情報に従い説明変数が削除されることにより、推定モデルの信頼性を向上させることができる。 As explained above, this second embodiment achieves the same effects as the first embodiment, and furthermore, the reliability of the estimation model can be improved by deleting explanatory variables according to the information on the direction of causality.

［第３の実施の形態］
次に、第３の実施の形態の計測システム１を、図１１～図１２を参照して説明する。この第３の実施の形態の計測システム１の全体構成は第１の実施の形態と同一でよいので、重複する説明は省略する。図１１は、第３の実施の形態のコンピュータ１０の構成を説明する構成図である。また、図１２は第３の実施の形態の動作を説明するフローチャートである。第３の実施の形態のコンピュータ１０は、第１の実施の形態の構成に加え、削除対象とされた説明変数を逐次削除処理する変数逐次削除処理部１１７を備えている。その他の構成は、第１の実施の形態（図２）と同様であるので、重複する説明は省略する。 [Third Embodiment]
Next, the measurement system 1 of the third embodiment will be described with reference to Figures 11 and 12. The overall configuration of the measurement system 1 of this third embodiment may be the same as that of the first embodiment, so redundant explanations will be omitted. Figure 11 is a configuration diagram illustrating the configuration of the computer 10 of the third embodiment. Figure 12 is a flowchart illustrating the operation of the third embodiment. In addition to the configuration of the first embodiment, the computer 10 of the third embodiment is equipped with a variable sequential deletion processing unit 117 that sequentially deletes explanatory variables that are to be deleted. The other configurations are the same as those of the first embodiment (Figure 2), so redundant explanations will be omitted.

変数逐次削除処理部１１７は、因果推論部１１４での推論の結果に従い、所定の説明変数を推定モデルから逐次削除する機能を有する。すなわち、図１２に示すように、変数逐次削除処理部１１７は、最初に、最終的に削減する説明変数の個数（ｋ）を決定し（ステップＳ３１）、その後、前述の実施の形態と同様に、トレーニングデータに従って推定モデルを生成し、そのクロスバリデーション誤差と現時点で残存している説明変数を記録する（ステップＳ３２）。 The variable sequential deletion processing unit 117 has the function of sequentially deleting predetermined explanatory variables from the estimated model according to the inference results of the causal inference unit 114. That is, as shown in Figure 12, the variable sequential deletion processing unit 117 first determines the final number (k) of explanatory variables to be reduced (step S31), and then, similar to the embodiment described above, generates an estimated model according to the training data and records its cross-validation error and the explanatory variables remaining at that time (step S32).

続いて、前述の実施の形態と同様に、個々の説明変数と目的変数に対して統計的因果推論を適用して、説明変数と目的変数との間の因果の向きを推定し（ステップＳ３３）、目的変数が原因となっている確率が最も低い説明変数を推定モデルから削除する（ステップＳ３４）。このステップＳ３２～Ｓ３４を、削除した説明変数の数がｋ個になるまで繰り返す（ステップＳ３５）。削減された説明変数の数が、目標のｋ個に達したら、クロスバリデーション誤差が最小になった時点で残っていた説明変数を、推定モデルの構築用の最終的な説明変数として採用する。なお、削除した説明変数の数がｋ個に達したか否かは、終了条件の一例であり、他の条件を採用することが可能であることは言うまでもない。 Next, similar to the previously described embodiment, statistical causal inference is applied to each explanatory variable and the dependent variable to estimate the direction of causality between the explanatory and dependent variables (step S33), and the explanatory variable with the lowest probability of being caused by the dependent variable is removed from the estimation model (step S34). Steps S32 to S34 are repeated until the number of removed explanatory variables reaches k (step S35). Once the number of removed explanatory variables reaches the target of k, the explanatory variables remaining at the point where the cross-validation error was minimized are adopted as the final explanatory variables for constructing the estimation model. It should be noted that whether or not the number of removed explanatory variables reaches k is just one example of a termination condition, and other conditions can be adopted.

以上説明したように、この第３の実施の形態によれば、第１の実施の形態と同一の効果を得ることができ、更に、因果の向きの情報に従い説明変数が逐次削除されることにより、推定モデルの信頼性を向上させることができる。 As explained above, this third embodiment achieves the same effects as the first embodiment, and furthermore, the reliability of the estimation model can be improved by sequentially deleting explanatory variables according to the information on the direction of causality.

（実施例１）
図１３に、図１の計測システム１において錠剤を測定対象として、その近赤外スペクトルデータを取得し、錠剤の薬効成分含量を推定モデルを適用して測定した場合の測定結果の一例を示す。図１３（ａ）は、第３の実施の形態のように、説明変数を逐次削除する手法を採用した場合における、薬効成分含量の実測値と、推定モデルによる推定値との分布を示している。一方、図１３（ｂ）は、従来技術に従った実測値と推定値の分布を示している。本実施の形態によれば、新規のデータ群においても、実測値と推定値のバラつきが小さく、推定モデルの信頼性が向上していることが分かる。 (Example 1)
Figure 13 shows an example of measurement results when a tablet is used as the measurement target in the measurement system 1 of Figure 1, its near-infrared spectral data is acquired, and the active ingredient content of the tablet is measured by applying an estimation model. Figure 13(a) shows the distribution of the measured value of the active ingredient content and the estimated value by the estimation model when a method of sequentially deleting explanatory variables is adopted, as in the third embodiment. On the other hand, Figure 13(b) shows the distribution of measured and estimated values according to the conventional technology. According to this embodiment, even with a new data set, the variability between the measured and estimated values is small, and it can be seen that the reliability of the estimation model has improved.

（実施例２）
図１４に、ガスチャンバにおけるガスセンサの計測データに本実施の形態を適用した場合の測定結果の一例を示す。図１４（ａ）は、従来技術に従った実測値と推定値の分布を示している。一方、図１４（ｂ）は、第３の実施の形態のように、説明変数を逐次削除する手法を採用した場合における、ガス中のエチレン濃度の実測値と、推定モデルによる推定値との分布を示している。本実施の形態によれば、新規のデータ群においても、実測値と推定値のバラつきが小さく、推定モデルの信頼性が向上していることが分かる。 (Example 2)
Figure 14 shows an example of measurement results when this embodiment is applied to measurement data from a gas sensor in a gas chamber. Figure 14(a) shows the distribution of measured and estimated values according to the conventional technology. On the other hand, Figure 14(b) shows the distribution of measured ethylene concentration in the gas and estimated values from the estimation model when a method of sequentially deleting explanatory variables is adopted, as in the third embodiment. According to this embodiment, even with a new data set, the variability between measured and estimated values is small, and the reliability of the estimation model is improved.

なお、本発明は、上記の実施例に限定されるものではなく、様々な変形が可能である。例えば、上記の実施例は、本発明を分かりやすく説明するために詳細に説明したものであり、本発明は、必ずしも説明した全ての構成を備える態様に限定されるものではない。また、ある実施例の構成の一部を他の実施例の構成に置き換えることが可能である。また、ある実施例の構成に他の実施例の構成を加えることも可能である。また、各実施例の構成の一部について、削除したり、他の構成を追加・置換したりすることが可能である。 Furthermore, the present invention is not limited to the embodiments described above, and various modifications are possible. For example, the embodiments described above are detailed explanations provided to clearly illustrate the present invention, and the present invention is not necessarily limited to embodiments comprising all the described configurations. Also, it is possible to replace parts of the configuration of one embodiment with those of another embodiment. It is also possible to add configurations from other embodiments to the configuration of one embodiment. Furthermore, it is possible to delete parts of the configuration of each embodiment, or to add or replace other configurations.

１０…コンピュータ、２０…分光計測装置、ＮＷ…ネットワーク、４０…サーバ、５０…データベース、１０１…ＣＰＵ、１０２…ＲＯＭ、１０３…ＲＡＭ、１０４…フラッシュメモリ、１０５…入出力制御部、１０６…通信制御部、１１２…推定モデル生成／更新部、１１３…合成変数生成部、１１４…因果推論部、１１５…新規データ群適用可能性判定部、１１６…変数一括削除処理部、１１７…変数逐次削除処理部。 10…Computer, 20…Spectroscopic Measurement Device, NW…Network, 40…Server, 50…Database, 101…CPU, 102…ROM, 103…RAM, 104…Flash Memory, 105…Input/Output Control Unit, 106…Communication Control Unit, 112…Estimation Model Generation/Update Unit, 113…Composite Variable Generation Unit, 114…Causal Inference Unit, 115…New Data Set Applicability Determination Unit, 116…Batch Variable Deletion Processing Unit, 117…Sequential Variable Deletion Processing Unit.

Claims

In a determination method for determining the reliability of an estimation model that estimates a dependent variable from explanatory variables using a computer , the computer performs the following:
The steps include: applying statistical causal inference to the explanatory variable and the dependent variable to estimate the direction of causality between the explanatory variable and the dependent variable;
The steps include determining whether the dependent variable is the direct and sole cause of the independent variable,
The step of determining, in accordance with the result of the above determination, whether or not the estimation model is applicable to new data other than the data used to construct the estimation model ,
In the step of determining whether or not there is a direct and sole cause, the determination of whether or not there is a direct and sole cause is made by using an algorithm that calculates the likelihood of the causal relationship between the dependent variable and the independent variable.
A method for determining the reliability of an estimation model, characterized by the features described above.

A method for determining the reliability of an estimation model according to claim 1, further comprising the step of removing an explanatory variable from the estimation model if, in the step of estimating the direction of causality, it is determined that the dependent variable is unlikely to be the cause of the explanatory variable.

The method for determining the reliability of an estimation model according to claim 2, wherein the step of deleting the explanatory variables involves deleting all explanatory variables that are deemed unlikely to be the cause of the dependent variable.

The method for determining the reliability of an estimation model according to claim 2, wherein the step of deleting the explanatory variables involves successively deleting explanatory variables that are deemed unlikely to be the cause of the dependent variable, and stopping the deletion step when a predetermined condition is met.

The steps include: appropriately selecting and combining multiple explanatory variables to generate multiple composite variables;
A method for determining the reliability of an estimation model according to claim 1, comprising the step of selecting one composite variable from among the plurality of composite variables and using it as the explanatory variable.

The method for determining the reliability of an estimation model according to claim 5, wherein the step of generating the plurality of composite variables includes principal component analysis, factor analysis, t-distributed stochastic neighbor embedding (t-SNE), cluster analysis, non-negative matrix factorization (NMF), multivariate curve resolution (MCR), parallel factor analysis (PARAFAC), partial least squares regression analysis, or ensemble learning.

The step of estimating the direction of causality is to use LiNGAM (Linear Non-Gaussian Acyclic Model) to determine the reliability of the estimation model according to claim 1.

In a computer program for determining the reliability of an estimation model that estimates the dependent variable from the explanatory variables,
The steps include: applying statistical causal inference to the explanatory variable and the dependent variable to estimate the direction of causality between the explanatory variable and the dependent variable;
The steps include determining whether the dependent variable is the direct and sole cause of the independent variable,
The system is configured to cause the computer to perform the following steps : determining whether the estimation model is applicable to new data other than the data used to construct the estimation model, in accordance with the result of the above determination;
In the step of determining whether or not there is a direct and sole cause, the determination of whether or not there is a direct and sole cause is made by using an algorithm that calculates the likelihood of the causal relationship between the dependent variable and the independent variable.
A computer program for determining the reliability of an estimation model.

A measuring device that measures the object to be measured and acquires measurement data as explanatory variables,
A measurement system comprising a computer that applies the explanatory variables to an estimation model to output a target variable and calculates various characteristics of the object to be measured,
The computer includes a computer program for determining the reliability of the estimation model.
The aforementioned computer program,
The steps include: applying statistical causal inference to the explanatory variable and the dependent variable to estimate the direction of causality between the explanatory variable and the dependent variable;
The steps include determining whether the dependent variable is the direct and sole cause of the independent variable,
The system is configured to cause the computer to perform the following steps : determining whether the estimation model is applicable to new data other than the data used to construct the estimation model, in accordance with the result of the above determination;
In the step of determining whether or not there is a direct and sole cause, the determination of whether or not there is a direct and sole cause is made by using an algorithm that calculates the likelihood of the causal relationship between the dependent variable and the independent variable.
Measurement system.