JP5852902B2

JP5852902B2 - Gene interaction analysis system, method and program thereof

Info

Publication number: JP5852902B2
Application number: JP2012040646A
Authority: JP
Inventors: 優夫植木; 田宮　元; 元田宮
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 2012-02-27
Filing date: 2012-02-27
Publication date: 2016-02-03
Anticipated expiration: 2032-02-27
Also published as: JP2013175135A

Description

本発明は、遺伝子間の相互作用がもたらす影響を解析する遺伝子間相互作用解析システム、その方法及びプログラムに関する。 The present invention relates to a gene interaction analysis system, a method and a program for analyzing an effect caused by an interaction between genes.

様々な生物種において、ゲノム上の遺伝子が個体の生物学的な特徴を示す表現型の発現に関与していることは、従来から良く知られていることである。
また、複数の遺伝子を考慮すると、それぞれの個々の遺伝子としては表現型に影響を与えることがないが、複数の遺伝子の組合せにより表現型に影響を与える場合がある。このとき、個々の遺伝子の効果を探索し、この探索結果を基に複数の遺伝子の組合せによる相互効果を探索することは困難である。 In various species, it is well known that genes on the genome are involved in the expression of phenotypes that show the biological characteristics of individuals.
In addition, when a plurality of genes are considered, each individual gene does not affect the phenotype, but the combination of the plurality of genes may affect the phenotype. At this time, it is difficult to search for the effect of each gene and search for the mutual effect by the combination of a plurality of genes based on the search result.

また、ヒトのゲノムの変異を網羅的に探索する技術が実用化され、ヒトの表現型、特に疾患の発症や薬の副作用の個人差などを同定するため、ゲノムにおける遺伝子のＳＮＰ（ＳｉｎｇｌｅＮｕｃｌｅｏｔｉｄｅＰｏｌｙｍｏｒｐｈｉｓｍ；一塩基多型または一塩基変異多型）を元に、ゲノムワイドに遺伝子変異を調べることが可能である（例えば、特許文献１参照）。
特に疾患の発症（表現型の発現の一例）と遺伝子におけるＳＮＰとの関係を解明することは、発症の予防を可能とし、ヒトの健康を維持するために重要である。 In addition, a technology for exhaustively searching for mutations in the human genome has been put into practical use, and SNP (Single Nucleotide Polymorphism) of genes in the genome is used to identify human phenotypes, particularly the onset of diseases and individual differences in drug side effects. Based on single nucleotide polymorphisms or single nucleotide polymorphisms), it is possible to examine gene mutations genome-wide (see, for example, Patent Document 1).
In particular, elucidating the relationship between the onset of a disease (an example of phenotypic expression) and a SNP in a gene is important for enabling the prevention of the onset and maintaining human health.

上述したように、複数の遺伝子が複雑なネットワークを形成させたり、あるいは疾患を発生・進行させるなどの影響を、ヒトの個体として有している遺伝子型（ジェノタイプ）における表現型として発現（質的形質または量的形質を発現）させる。
したがって、ゲノムワイドにＳＮＰ間における相互作用（エピスタシス）が形質の発現にもたらす影響を解明することは、疾患の発病を予め予防し、発病した後も有用な遺伝子治療を行うことを可能とする医学的な重要性を有している。 As described above, the expression (quality) of phenotypes in the genotype (genotype) possessed as a human individual has the effect that multiple genes form a complex network or develop or progress disease. Expression or quantitative traits).
Therefore, elucidating the influence of interaction (epistasis) between SNPs on the expression of traits on a genome-wide basis can prevent the onset of diseases in advance and enable useful gene therapy even after the onset Is important.

特表２００５−５２７９０４号公報JP 2005-527904 A

上述したゲノムワイドに形質の発現に影響を与えるＳＮＰ及びＳＮＰの組合せを探索する場合、上述した異なるＳＮＰ間における相互作用において、複数の異なる遺伝子座におけるＳＮＰの組合せを考慮した解析となり、膨大なデータ解析となる。
例えば、２個のＳＮＰの組合せの場合、９０万種類のＳＮＰから２個の異なるＳＮＰを取る組合せは、約５０００億通り存在する。このため、従来の数理モデルを使用した場合、効果を確認するために用いる検体数を１０００人程度と考えると、数理モデル自体の限界あるいは数理モデルをコンピュータで扱う際の計算複雑度における限界を超えてしまう。 When searching for a combination of SNPs and SNPs that affect the expression of traits in the genome-wide manner described above, in the interaction between the different SNPs described above, the analysis takes into account the combinations of SNPs at a plurality of different loci, and a huge amount of data It becomes analysis.
For example, in the case of combinations of two SNPs, there are about 500 billion combinations that take two different SNPs from 900,000 types of SNPs. For this reason, when the conventional mathematical model is used, if the number of specimens used for confirming the effect is considered to be about 1000, the limit of the mathematical model itself or the computational complexity when the mathematical model is handled by a computer is exceeded. End up.

このため、従来のゲノムワイドに形質の発現に影響を与えるＳＮＰの組合せを探索する手法は、ゲノムワイドな遺伝子のうち一部の遺伝子領域を選択するフィルタリングを行い、選択された遺伝子領域に対して解析処理のための数理モデルに当てはめることで、複数の遺伝子座を考慮した解析が行われている。
しかしながら、この遺伝子のフィルタリングを用いると、フィルタにかかって選択されなかった領域の遺伝子が、表現型の発現に主たる影響を有するＳＮＰを有する遺伝子である場合が考えられる。
このため、上述したように、実際に疾病の発症に影響を及ぼす遺伝子が、運悪くフィルタにかかって探索の対象から除去された場合、疾病の発症に関与するＳＮＰの組合せを同定することができなくなる。 For this reason, the conventional method of searching for a combination of SNPs that affect the expression of a trait in a genome-wide manner performs filtering for selecting a part of the gene region of the genome-wide gene, and selects the selected gene region. By applying it to a mathematical model for analysis processing, analysis considering a plurality of loci is performed.
However, when this gene filtering is used, there may be a case where a gene in a region not selected by filtering is a gene having a SNP having a main influence on expression of a phenotype.
For this reason, as described above, when a gene that actually affects the onset of a disease is unfortunately filtered and removed from the search target, a combination of SNPs involved in the onset of the disease can be identified. Disappear.

本発明は、このような事情に鑑みてなされたもので、遺伝子のフィルタリングによる選択を行うことなく、ゲノムワイドな遺伝子において、形質の発現に影響を与えるＳＮＰの組合せを容易に探索し、形質の発現の解析を行うことが可能な遺伝子間相互作用解析システム、その方法及びプログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and easily searches for combinations of SNPs that affect the expression of traits in genome-wide genes without performing selection by gene filtering. It is an object of the present invention to provide a gene interaction analysis system capable of analyzing expression, a method and a program thereof.

この発明は上述した課題を解決するためになされたもので、本発明の遺伝子間相互作用解析システムは、ゲノムワイドなＳＮＰ（一塩基多型）のジェノタイプデータから、形質発現に影響を与える遺伝子間相互作用を有するＳＮＰの組合せでありＳＮＰペアを網羅的に同定し、当該ＳＮＰペアから形質の発現を解析する回帰モデルを生成する遺伝子間相互作用解析システムであって、ｎ個の検体から検出されたｍ個の種類のＳＮＰから、順次異なる２種類のＳＮＰからなるＳＮＰペアを選択し、当該ＳＮＰペアにおけるＳＮＰの各々のジェノタイプ（優勢型及び劣勢型の組合せ）を２つのカテゴリからなるダミーコードのテーブルに分類し、当該ダミーコード毎にｎ個の前記検体の各々を分類し、２つのカテゴリとそれぞれのカテゴリに属する前記検体とを示すテーブルを作成するテーブル作成部と、前記テーブルにおけるダミーコードをダミー変数とした単回帰モデルにより、各カテゴリに分類された前記検体を用いた単回帰分析を行い、このダミー変数と前記形質の発現との関連性の強さを示す評価値を、前記ダミー変数毎に算出する評価値算出部と、前記関連性の強さを示す評価値が高い順番に、前記ダミー変数を配列させてランキングを生成するランキング作成部と、予め設定された順番までの前記ダミー変数を抽出し、この抽出したダミー変数を前記回帰モデルのダミー変数とするダミー変数作成部と、前記抽出したダミー変数に乗ずる回帰係数を罰則付き最尤法により算出する変数選択部とを備えることを特徴とする。 The present invention has been made to solve the above-described problems, and the gene interaction analysis system of the present invention is a gene that affects phenotypic expression from genotype data of genome-wide SNPs (single nucleotide polymorphisms). A gene interaction analysis system that comprehensively identifies SNP pairs and generates a regression model for analyzing the expression of traits from the SNP pairs, which is detected from n specimens. From the m types of SNPs selected, SNP pairs consisting of two different types of SNPs are selected sequentially, and each genotype (combination of dominant type and inferior type) of the SNPs in the SNP pair is a dummy consisting of two categories. Classify into a table of codes, classify each of the n samples for each dummy code, and belong to two categories and each category A table creation unit that creates a table indicating the specimen, and a single regression model using the specimen classified into each category by a single regression model using a dummy code in the table as a dummy variable. An evaluation value calculation unit that calculates an evaluation value indicating the strength of association with the expression of the trait for each dummy variable, and the dummy variables are arranged in descending order of the evaluation value indicating the strength of the association A ranking creating unit for generating a ranking, extracting the dummy variables up to a preset order, a dummy variable creating unit using the extracted dummy variables as dummy variables of the regression model, and the extracted dummy variables And a variable selection unit that calculates a regression coefficient to be multiplied by a penalized maximum likelihood method.

本発明の遺伝子間相互作用解析システムは、前記テーブル作成部が、コーディング法としてＣＤＣ（ｃｅｌｌ−ｗｉｓｅｄｕｍｍｙｃｏｄｉｎｇ）を用いて、１種類のＳＮＰにおけるジェノタイプを２つのカテゴリに分類したダミーコードのＣＤＣテーブルとし、また２つの異なる種類のＳＮＰにおけるジェノタイプを、１つのジェノタイプの組合せとそれ以外のジェノタイプとの２つのカテゴリに分類したダミーコードのＣＤＣ相互作用テーブルとするＣＤＣテーブル作成部と、コーディング法としてＡＤＣ（ａｄａｐｔｉｖｅｄｕｍｍｙｃｏｄｉｎｇ）を用いて、２つの異なる種類のＳＮＰにおけるジェノタイプの複数の組合せを、２つのカテゴリに分類したダミーコードのＡＤＣ相互作用テーブルとするＡＤＣテーブル作成部とから構成されていることを特徴とする。 In the gene interaction analysis system according to the present invention, the table creation unit uses a CDC (cell-wise dummy coding) as a coding method, and a CDC of a dummy code in which genotypes in one type of SNP are classified into two categories. A CDC table creation unit configured as a table and a dummy code CDC interaction table in which genotypes in two different types of SNPs are classified into two categories of a combination of one genotype and another genotype; An ADC table creation unit which uses ADC (adaptive dummy coding) as a coding method and uses a plurality of combinations of genotypes in two different types of SNPs as ADC interaction tables of dummy codes classified into two categories; Characterized in that it is al configured.

本発明の遺伝子間相互作用解析システムは、前記回帰モデルがロジスティック回帰モデルの場合、前記ＣＤＣテーブル、前記ＣＤＣ相互作用テーブル及び前記ＡＤＣ相互作用テーブルの各々が、２つのカテゴリと、当該カテゴリ毎における質的形質の２つのクラスに属する前記検体の数とからなる２×２のテーブルであり、前記評価値算出部が、前記ＣＤＣテーブル及び前記ＣＤＣ相互作用テーブルの各々のダミーコードの前記単回帰分析における前記評価値として尤度を算出するＣＤＣ尤度算出部と、前記ＣＤＣテーブル及び前記ＣＤＣ相互作用テーブルの各々のダミーコードの前記単回帰分析における前記評価値としてｐ値を算出するＣＤＣ＿ｐ値算出部と、前記ＡＤＣ相互作用テーブルの各々のダミーコードの前記単回帰分析における前記評価値としてｐ値を算出するＡＤＣ＿ｐ値算出部とであることを特徴とする。 In the gene interaction analysis system of the present invention, when the regression model is a logistic regression model, each of the CDC table, the CDC interaction table, and the ADC interaction table includes two categories and the quality of each category. In the single regression analysis of the dummy code of each of the CDC table and the CDC interaction table. A CDC likelihood calculating unit that calculates likelihood as the evaluation value; a CDC_p value calculating unit that calculates p value as the evaluation value in the single regression analysis of each dummy code of the CDC table and the CDC interaction table; , In the single regression analysis of each dummy code of the ADC interaction table Wherein there between ADC_p value calculation unit for calculating a p value as serial evaluation value.

本発明の遺伝子間相互作用解析システムは、前記ランキング作成部が前記ＣＤＣテーブル及び前記ＣＤＣ相互作用テーブルの各々のダミーコードの尤度を前記関連性の強い順番に配列させて前記ランキングを生成し、また、前記ＣＤＣテーブル及び前記ＣＤＣ相互作用テーブルの各々のダミーコードのｐ値を前記関連性の強い順番に配列させて前記ランキングを生成し、また、前記ＡＤＣ相互作用テーブルの各々のダミーコードのｐ値を前記関連性の強い順番に配列させて前記ランキングを生成することを特徴とする。 In the gene interaction analysis system of the present invention, the ranking creation unit generates the ranking by arranging the likelihood of each of the dummy codes in the CDC table and the CDC interaction table in the order of strong relevance, Further, the ranking is generated by arranging the p values of the dummy codes of the CDC table and the CDC interaction table in the order of the relevance, and the p of each dummy code of the ADC interaction table is generated. The ranking is generated by arranging values in the order of strong relevance.

本発明の遺伝子間相互作用解析システムは、前記回帰モデルが線形回帰モデルの場合、前記ＣＤＣ相互作用テーブル及び前記ＡＤＣ相互作用テーブルが２つのカテゴリと当該カテゴリの各々に含まれる前記検体の量的形質の加算値となるテーブルであり、前記評価値算出部が、前記ＣＤＣ相互作用テーブルの各々のダミーコードの前記単回帰分析における前記評価値としてｔ値を算出するＣＤＣ＿ｔ値算出部と、前記ＡＤＣ相互作用テーブルの各々のダミーコードの前記単回帰分析における前記評価値としてｔ値を算出するＡＤＣ＿ｔ値算出部とであることを特徴とする。 In the gene interaction analysis system of the present invention, when the regression model is a linear regression model, the CDC interaction table and the ADC interaction table include two categories and the quantitative traits of the specimen included in each of the categories. A CDC_t value calculation unit that calculates a t value as the evaluation value in the single regression analysis of each dummy code of the CDC interaction table, and the ADC mutual value An ADC_t value calculation unit that calculates a t value as the evaluation value in the single regression analysis of each dummy code in the action table.

本発明の遺伝子間相互作用解析システムは、前記ランキング作成部が前記ＣＤＣ相互作用テーブルの各々のダミーコードのｔ値を前記関連性の強い順番に配列させて前記ランキングを生成し、また前記ＡＤＣ相互作用テーブルの各々のダミーコードのｔ値を前記関連性の強い順番に配列させて前記ランキングを生成することを特徴とする。 In the gene interaction analysis system according to the present invention, the ranking creation unit generates the ranking by arranging t values of the dummy codes of the CDC interaction table in the order of strong association, The ranking is generated by arranging t values of the dummy codes of the action table in the order of the relevance.

本発明の遺伝子間相互作用解析方法は、ゲノムワイドなＳＮＰ（一塩基多型）のジェノタイプデータから、形質発現に影響を与える遺伝子間相互作用を有するＳＮＰの組合せでありＳＮＰペアを網羅的に同定し、当該ＳＮＰペアから形質の発現を解析する回帰モデルを生成する遺伝子間相互作用解析方法であって、テーブル作成部が、ｎ個の検体から検出されたｍ個の種類のＳＮＰから、順次異なる２種類のＳＮＰからなるＳＮＰペアを選択し、当該ＳＮＰペアにおけるＳＮＰの各々のジェノタイプ（優勢型及び劣勢型の組合せ）を２つのカテゴリからなるダミーコードのテーブルに分類し、当該ダミーコード毎にｎ個の前記検体の各々を分類し、２つのカテゴリとそれぞれのカテゴリに属する前記検体とを示すテーブルを作成するテーブル作成過程と、評価値算出部が、前記テーブルにおけるダミーコードをダミー変数とした単回帰モデルにより、各カテゴリに分類された前記検体を用いた単回帰分析を行い、このダミー変数と前記形質の発現との関連性の強さを示す評価値を、前記ダミー変数毎に算出する評価値算出過程と、ランキング作成部が、前記関連性の強さを示す評価値が高い順番に、前記ダミー変数を配列させてランキングを生成するランキング作成過程と、ダミー変数作成部が、予め設定された順番までの前記ダミー変数を抽出し、この抽出したダミー変数を前記回帰モデルのダミー変数とするダミー変数作成過程と、変数選択部が、前記抽出したダミー変数に乗ずる回帰係数を罰則付き最尤法により算出する変数選択過程とを含むことを特徴とする。 The gene interaction analysis method of the present invention is a combination of SNPs having gene-gene interactions that affect phenotypic expression from genome-wide SNP (single nucleotide polymorphism) genotype data. A gene interaction analysis method for generating a regression model for identifying and analyzing the expression of a trait from the SNP pair, wherein the table creation unit sequentially selects m types of SNPs detected from n samples. A SNP pair consisting of two different types of SNPs is selected, and each genotype (combination of dominant type and inferior type) of the SNPs in the SNP pair is classified into a table of dummy codes consisting of two categories. A table for classifying each of the n samples into two categories and creating a table indicating two categories and the samples belonging to each category And the evaluation value calculation unit performs a single regression analysis using the specimen classified into each category by a single regression model using the dummy code in the table as a dummy variable, and the expression of the dummy variable and the character An evaluation value calculation process for calculating an evaluation value indicating the strength of relevance for each dummy variable, and a ranking creating unit, in order from the highest evaluation value indicating the strength of the relevance, A ranking creation process in which ranking is generated by arranging, and a dummy variable creation unit extracts the dummy variables up to a preset order, and uses the extracted dummy variables as dummy variables of the regression model And a variable selection step of calculating a regression coefficient to be multiplied by the extracted dummy variable by a penalized maximum likelihood method.

本発明のプログラムは、ゲノムワイドなＳＮＰ（一塩基多型）のジェノタイプデータから、形質発現に影響を与える遺伝子間相互作用を有するＳＮＰの組合せでありＳＮＰペアを網羅的に同定し、当該ＳＮＰペアから形質の発現を解析する回帰モデルを生成する遺伝子間相互作用解析のプログラムであって、コンピュータが、ｎ個の検体から検出されたｍ個の種類のＳＮＰから、順次異なる２種類のＳＮＰからなるＳＮＰペアを選択し、当該ＳＮＰペアにおけるＳＮＰの各々のジェノタイプ（優勢型及び劣勢型の組合せ）を２つのカテゴリからなるダミーコードのテーブルに分類し、当該ダミーコード毎にｎ個の前記検体の各々を分類し、２つのカテゴリとそれぞれのカテゴリに属する前記検体とを示すテーブルを作成するテーブル作成手段、前記テーブルにおけるダミーコードをダミー変数とした単回帰モデルにより、各カテゴリに分類された前記検体を用いた単回帰分析を行い、このダミー変数と前記形質の発現との関連性の強さを示す評価値を、前記ダミー変数毎に算出する評価値算出手段、前記関連性の強さを示す評価値が高い順番に、前記ダミー変数を配列させてランキングを生成するランキング作成手段、予め設定された順番までの前記ダミー変数を抽出し、この抽出したダミー変数を前記回帰モデルのダミー変数とするダミー変数作成手段、前記抽出したダミー変数に乗ずる回帰係数を罰則付き最尤法により算出する変数選択手段として機能させるためのプログラムである。 The program of the present invention comprehensively identifies SNP pairs which are SNP combinations having gene-gene interactions that affect phenotypic expression from genotype data of genome-wide SNPs (single nucleotide polymorphisms). A gene interaction analysis program for generating a regression model for analyzing the expression of a trait from a pair, in which a computer starts from m types of SNPs detected from n specimens, and sequentially differs from two types of SNPs. SNP pairs are selected, and each genotype (combination of dominant type and inferior type) of the SNPs in the SNP pair is classified into a table of dummy codes consisting of two categories, and n specimens for each dummy code Creating means for classifying each of the two and creating a table showing two categories and the specimen belonging to each category The single regression model with the dummy code in the table as a dummy variable is used to perform a single regression analysis using the specimen classified into each category, and an evaluation showing the strength of the association between the dummy variable and the expression of the trait Evaluation value calculating means for calculating values for each dummy variable, ranking creating means for generating ranking by arranging the dummy variables in descending order of evaluation values indicating the strength of the relevance, preset order As dummy variable creation means for extracting the dummy variables up to and using the extracted dummy variable as a dummy variable of the regression model, a variable selection means for calculating a regression coefficient to be multiplied by the extracted dummy variable by a maximum likelihood method with penalties It is a program to make it function.

この発明によれば、従来の統計分析手法によるゲノムワイドな遺伝子解析においては、遺伝子全体における所定の範囲ではなく、ジェノタイプデータを２つのカテゴリに分類し、ジェノタイプ間の相互作用の影響の強さを単回帰分析することを可能とし、各テーブルに対応するダミーデータの中から、単回帰分析により強い影響を有するダミーコードを選択することで、ヒトの遺伝子全体におけるＳＮＰのジェノタイプの相互作用による質的形質あるいは量的形質への影響を同定することができ、回帰分析に用いる説明変数として、遺伝子間の相互作用に対応するダミーコードを用い、高い精度の回帰分析を行うことができる。 According to the present invention, in genome-wide gene analysis using a conventional statistical analysis method, genotype data is classified into two categories rather than a predetermined range in the entire gene, and the influence of the interaction between genotypes is strengthened. SNP genotype interaction in the entire human gene by selecting a dummy code that has a strong influence on the single regression analysis from the dummy data corresponding to each table. The influence on the qualitative trait or quantitative trait by can be identified, and a highly accurate regression analysis can be performed using a dummy code corresponding to the interaction between genes as an explanatory variable used for the regression analysis.

この発明の一実施形態による遺伝子間相互作用解析システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the gene interaction analysis system by one Embodiment of this invention. ＳＮＰデータベース１１に記憶されているＳＮＰ単位で、ＳＮＰのジェノタイプデータ毎に検体の表現型の検体数を集計したＳＮＰ基礎テーブルの一例を示している。An example of the SNP basic table in which the number of specimens of specimen phenotypes for each SNP genotype data is tabulated in SNP units stored in the SNP database 11 is shown. 全てのＳＮＰにおいて、２つの異なる種類のＳＮＰのジェノタイプデータ毎における２次相互作用に対応した検体の質的形質のクラスの分類を示すＳＮＰ相互作用基礎テーブルの一例を示している。An example of the SNP interaction basic table showing the classification of the qualitative trait class of the specimen corresponding to the secondary interaction for each genotype data of two different types of SNPs in all SNPs is shown. ＳＮＰ基礎テーブルにおいて、３つのジェノタイプデータのいずれか２つを組合せと、他の１つのジェノタイプデータとのカテゴリ毎に、カテゴリ内における検体を質的形質のクラスの分類を示すＣＤＣテーブルの一例を示す図である。An example of a CDC table showing a classification of a class of qualitative traits in a category for each category of a combination of any two of the three genotype data and the other one of the genotype data in the SNP basic table FIG. ＳＮＰ基礎テーブルにおいて、３つのジェノタイプデータのいずれか２つを組合せと、他の１つのジェノタイプデータとのカテゴリ毎に、カテゴリ内における検体を質的形質のクラスの分類を示すＣＤＣテーブルの一例を示す図である。An example of a CDC table showing a classification of a class of qualitative traits in a category for each category of a combination of any two of the three genotype data and the other one of the genotype data in the SNP basic table FIG. 図３に示すＳＮＰ相互作用テーブルにおける２つの異なる種類のＳＮＰのジェノタイプデータの組合せを２つのカテゴリに分割したＣＤＣ相互作用テーブルの一例を示す図である。It is a figure which shows an example of the CDC interaction table which divided | segmented the combination of the genotype data of two different types of SNP in the SNP interaction table shown in FIG. 3 into two categories. 図２に示すＳＮＰ基礎テーブルにおける２つの異なる種類のＳＮＰのジェノタイプデータの組合せを２つのカテゴリに分割したＡＤＣ相互作用テーブルの一例を示す図である。It is a figure which shows an example of the ADC interaction table which divided | segmented the combination of the genotype data of two different types of SNP in the SNP basic table shown in FIG. 2 into two categories. ＣＤＣテーブル毎に対して単回帰分析を行い、ＣＤＣテーブルが示す目的変数を用いた際の尤度またはｐ値の算出を説明する概念図である。It is a conceptual diagram explaining the calculation of likelihood or p value when performing a single regression analysis for each CDC table and using the objective variable indicated by the CDC table. ＳＮＰ基礎テーブルにおいて、３つのジェノタイプデータのいずれか２つを組合せと、他の１つのジェノタイプデータとのカテゴリ毎に、カテゴリ内における検体を質的形質のクラスの分類を示すＣＤＣテーブルの一例を示す図である。An example of a CDC table showing a classification of a class of qualitative traits in a category for each category of a combination of any two of the three genotype data and the other one of the genotype data in the SNP basic table FIG. 遺伝子間相互作用解析システムを用いた回帰モデルにおけるダミー変数の選択と回帰係数の算出の動作例を示すフローチャートである。It is a flowchart which shows the operation example of selection of a dummy variable in the regression model using a gene interaction analysis system, and calculation of a regression coefficient. この発明の第２の実施形態による遺伝子間相互作用解析システムの構成例を示す概略ブロック図である。It is a schematic block diagram which shows the structural example of the gene interaction analysis system by 2nd Embodiment of this invention. 第２の実施形態におけるＳＮＰ相互作用基礎テーブルの構成例を示すものである。The structural example of the SNP interaction basic | foundation table in 2nd Embodiment is shown.

本発明は、ゲノムワイドな遺伝子における遺伝子間相互作用がもたらす形質の発現への影響、例えば遺伝子間相互作用が疾患の発現に対して与える影響を、遺伝子全体におけるゲノムワイドで網羅的に解析するためのアルゴリズムに対し、ＳＩＳ（ｓｕｒｅｉｎｄｅｐｅｎｄｅｎｃｅｓｃｒｅｅｎｉｎｇ）法を適用している。このＳＩＳ法は、一般化線形モデルでの理論的な証明がなされているため、ケースコントロール研究に最適なアルゴリズムとして知られている。
本発明は、このＳＩＳ法を用い、ＳＮＰ（一塩基多型）のジェノタイプ間、あるいは２つの異なる種類のＳＮＰのジェノタイプ間におけるダミー変数のコーディング（ダミーコードの作成）を行い、かつこれらから得られるダミー変数のランキング、このランキングからの予め設定された順番までのダミー変数の選択、および選択したダミー変数に乗ずる回帰係数の算出を組み合わせたアルゴリズムとなっている。そして、本発明は、このアルゴリズムにより、少標本かつ高次元変数のデータから、ゲノムワイドなＳＮＰからの形質発現の回帰モデルを作成する手法に関する。 The present invention is intended to comprehensively analyze genome-wide effects on the expression of traits caused by gene-gene interactions in genome-wide genes, for example, the effects of gene-gene interactions on disease expression. The SIS (Sure Independence Screening) method is applied to the above algorithm. The SIS method is known as an optimal algorithm for case-control research because it has been theoretically proved by a generalized linear model.
The present invention uses this SIS method to code dummy variables between SNP (single nucleotide polymorphism) genotypes or between two different types of SNP genotypes (create a dummy code), and This algorithm combines ranking of dummy variables to be obtained, selection of dummy variables from the ranking up to a preset order, and calculation of a regression coefficient to be multiplied by the selected dummy variable. The present invention also relates to a technique for creating a regression model of phenotypic expression from genome-wide SNPs from data of small samples and high-dimensional variables using this algorithm.

＜第１の実施形態＞
本実施形態においては、大規模データである多数の検体から抽出した遺伝子座の異なる種類のＳＮＰのジェノタイプデータから、数理モデルであるロジスティック回帰モデルで使用するＳＮＰのジェノタイプデータを抽出し、この抽出したＳＮＰのジェノタイプデータを用いて回帰分析を行う。
すなわち、以下の（１）式に示すｍ種類のＳＮＰのジェノタイプデータにおけるｍ次元線形関数において、説明変数ｘ_ｊ（１≦ｊ≦ｍ）のなかから実際に使用する説明変数を選択する。 <First Embodiment>
In this embodiment, the SNP genotype data used in the logistic regression model, which is a mathematical model, is extracted from the SNP genotype data of different types of loci extracted from a large number of specimens which are large-scale data. Regression analysis is performed using the extracted SNP genotype data.
That is, in the m-dimensional linear function in the genotype data of m types of SNPs shown in the following equation (1), the explanatory variable that is actually used is selected from the explanatory variables x _j (1 ≦ j ≦ m).

そして、選択されたｑ（≪ｍ）種類の説明変数ｘ_ｉ（１≦ｉ≦ｑ）を用い、以下の（２）式のｑ次元線形関数による回帰分析を行う。 Then, using the selected q (<< m) types of explanatory variables x _i (1 ≦ i ≦ q), regression analysis is performed using the q-dimensional linear function of the following equation (2).

上述した多数のＳＮＰのジェノタイプデータから、回帰分析に用いるＳＮＰのジェノタイプデータを抽出する過程において、検体から抽出されたＳＮＰの種類毎のジェノタイプデータを２つのカテゴリに分割したダミーコードとする。そして、このダミーコードにより２×２の評価用のテーブルを生成し、このテーブル毎に単回帰モデルを作成し、この単回帰モデル毎に単回帰分析を用いて尤度あるいはｐ値などの評価値を求め、この評価値をランキングし、予め設定したランキング以内（例えば２５６番目まで）の評価値を有するダミーコードをダミー変数（説明変数）として選択する。そして、この選択した説明変数に対して乗ずる回帰係数β_ｊのうち０となるものを抽出する。そして、０とならない（非ゼロ）の回帰係数に対応する説明変数ｘ_ｉを抽出し、ＳＮＰのジェノタイプデータから質的形質を回帰分析するためのロジスティック回帰モデルを生成する。
以下、上述した処理を行う構成について、詳細に説明する。 In the process of extracting the SNP genotype data used for regression analysis from the many SNP genotype data described above, the dummy code obtained by dividing the genotype data for each type of SNP extracted from the specimen into two categories is used. . Then, a 2 × 2 evaluation table is generated using this dummy code, a single regression model is created for each table, and an evaluation value such as likelihood or p-value is used for each single regression model using single regression analysis. Are ranked, and a dummy code having an evaluation value within a preset ranking (for example, up to 256th) is selected as a dummy variable (explanatory variable). Then, the regression coefficient β _j that is multiplied by this selected explanatory variable is extracted as 0. Then, 0 and does not extract the explanatory variable x _i which corresponds to the regression coefficient (non-zero), to produce a logistic regression model for a regression analysis of qualitative traits from genotype data of SNP.
Hereinafter, the configuration for performing the above-described processing will be described in detail.

次に、図面を参照して、本発明の第１の実施形態について説明する。図１は、この発明の第１の実施形態によるゲノムワイドな遺伝子に対応した遺伝子間相互作用解析システムの構成例を示す概略ブロック図である。
図１において、本実施形態の遺伝子間相互作用解析システムは、基礎テーブル作成部１、ＣＤＣ（ｃｅｌｌ−ｗｉｓｅｄｕｍｍｙｃｏｄｉｎｇ）テーブル作成部２、ＡＤＣ（ａｄａｐｔｉｖｅｄｕｍｍｙｃｏｄｉｎｇ）テーブル作成部３、ＣＤＣ尤度算出部４、ＣＤＣ＿ｐ値算出部５、ＡＤＣ＿ｐ値算出部６、ランキング作成部７、ダミー変数作成部８、変数選択部９、回帰分析処理部１０、ＳＮＰデータベース１１、ＣＤＣ尤度記憶部１２、ＣＤＣ＿ｐ値記憶部１３、ＡＤＣ＿ｐ値記憶部１４、ランキング記憶部１５を備えている。 Next, a first embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a schematic block diagram showing a configuration example of a gene interaction analysis system corresponding to genome-wide genes according to the first embodiment of the present invention.
In FIG. 1, the gene interaction analysis system of this embodiment includes a basic table creation unit 1, a CDC (cell-wise dummy coding) table creation unit 2, an ADC (adaptive dummy coding) table creation unit 3, and a CDC likelihood calculation. Unit 4, CDC_p value calculation unit 5, ADC_p value calculation unit 6, ranking creation unit 7, dummy variable creation unit 8, variable selection unit 9, regression analysis processing unit 10, SNP database 11, CDC likelihood storage unit 12, CDC_p value A storage unit 13, an ADC_p value storage unit 14, and a ranking storage unit 15 are provided.

ＳＮＰデータベース１１には、ｎ個の検体から抽出されたｍ個（例えば、５０万箇所以上を越えるゲノムワイドな遺伝子内の遺伝子座におけるＳＮＰ）のＳＮＰのジェノタイプデータが、ｎ個各々に付与された検体識別情報に対応して記憶されている。また、ＳＮＰデータベース１１において、検体各々には、質的形質として、所定の病気である否か、または所定の薬に対して副作用があるか否か、あるいは肥満であるか否かの２値の表現型としての属性データが付与されている。このＳＮＰデータベース１１には、ＳＮＰのジェノタイプデータの抽出過程において、他の検体でＳＮＰが検出された遺伝子座のデータの欠損の無いＳＮＰのジェノタイプデータが得られた検体のみが記憶されている。したがって、本実施形態におけるＳＮＰのジェノタイプデータは、一般的なインピュテーション（Ｉｍｐｕｔａｔｉｏｎ）が必要な検体のデータが除去されている。 In the SNP database 11, m (for example, SNPs at gene loci in genome-wide genes exceeding 500,000 locations) extracted from n specimens are assigned to each of n nenotype data. Stored in correspondence with the specimen identification information. Further, in the SNP database 11, each specimen has a binary qualitative trait, whether it is a predetermined disease, whether there is a side effect on a predetermined drug, or whether it is obese. Attribute data as an expression type is given. This SNP database 11 stores only specimens for which SNP genotype data is obtained in the process of extracting SNP genotype data, with no loss of data on the locus at which the SNP was detected in other specimens. . Therefore, the SNP genotype data in the present embodiment has data of samples that require general imputation removed.

次に、図２は、ＳＮＰデータベース１１に記憶されているＳＮＰ単位で、ＳＮＰのジェノタイプデータ毎に検体の表現型の検体数を集計したＳＮＰ基礎テーブルの一例を示している。ＳＮＰデータベース１１には、このＳＮＰ基礎テーブルが異なる遺伝子座に存在するＳＮＰ単位に作成されるため、１つの質的形質に対してｍ個が記憶されることになる。ここで、ＳＮＰデータベース１１には、ＳＮＰ基礎テーブルが、このＳＮＰ基礎テーブルを識別するＳＮＰ基礎テーブル識別情報に対応して、書き込まれて記憶されている。この図２において、ＳＮＰのジェノタイプデータとして、ＳＮＰ＝０、ＳＮＰ＝１、ＳＮＰ＝２の３つの属性、例えばそれぞれがジェノタイプＡＡ、ＡＢ、ＢＢに対応する。ここで、例えばＡが優勢型であり、Ｂが劣勢型であり、ジェノタイプはこれらの組合せとなる。この図２におけるＳＮＰ基礎テーブルは、ジェノタイプ毎の各検体を、この検体を採取した人間が表現型としてある病気に罹患（＝１）しているか、あるいは罹患していないか（非罹患＝０）のいずれかの２値データのクラスに分類されている。 Next, FIG. 2 shows an example of an SNP basic table in which the number of specimens of the specimen phenotype is tabulated for each SNP genotype data in SNP units stored in the SNP database 11. In the SNP database 11, since this SNP basic table is created in SNP units existing at different loci, m pieces are stored for one qualitative trait. Here, in the SNP database 11, the SNP basic table is written and stored in correspondence with the SNP basic table identification information for identifying the SNP basic table. In FIG. 2, three attributes of SNP = 0, SNP = 1, and SNP = 2, for example, respectively correspond to genotypes AA, AB, and BB as SNP genotype data. Here, for example, A is the dominant type, B is the inferior type, and the genotype is a combination thereof. The SNP basic table in FIG. 2 shows whether each specimen for each genotype is affected (= 1) or not affected (non-affected = 0) by the person who collected the specimen as a phenotype. ) Is classified into one of the binary data classes.

そして、ジェノタイプＳＮＰ＝０の場合、非罹患の検体がｎ００個あり、罹患の検体がｎ１０個ある。また、ジェノタイプＳＮＰ＝１の場合、非罹患の検体がｎ０１個あり、罹患の検体がｎ１１個ある。ジェノタイプＳＮＰ＝２の場合、非罹患の検体がｎ０２個あり、罹患の検体がｎ１２個ある。ここで、本実施形態において、「ｎｘｘ」における「ｘｘ」は添え字である。
ここで、ｎ００＋ｎ１０＋ｎ０１＋ｎ１１＋ｎ０２＋ｎ１２＝ｎ（総検体数）である。
すなわち、ＳＮＰ基礎テーブルは、ｎ個の検体に対して、一つの種類のＳＮＰに対して、質的形質の２クラスと、ジェノタイプデータをカテゴリデータとした属性３種類とによる２×３の分割表である。 When genotype SNP = 0, there are n00 unaffected specimens and n10 affected specimens. When genotype SNP = 1, there are n01 unaffected specimens and n11 affected specimens. When genotype SNP = 2, there are n02 non-affected specimens and n12 affected specimens. Here, in the present embodiment, “xx” in “nxx” is a subscript.
Here, n00 + n10 + n01 + n11 + n02 + n12 = n (total number of samples).
In other words, the SNP basic table is divided into 2 × 3 for n specimens by two classes of qualitative traits and three types of attributes using genotype data as category data for one type of SNP. It is a table.

図１に戻り、基礎テーブル作成部１は、ＳＮＰデータベース１１から順次ＳＮＰ単位で各検体のＳＮＰのジェノタイプデータを読み出し、各検体を対応するジェノタイプデータに分類する。また、基礎テーブル作成部１は、各ジェノタイプデータの検体を質的形質の２クラスに分類し、分類した結果を図２に示すＳＮＰ基礎テーブルとする。基礎テーブル作成部１は、上述したように、全ての種類（ｍ個）のＳＮＰに対して、図２のＳＮＰ基礎テーブルを生成する。 Returning to FIG. 1, the basic table creation unit 1 sequentially reads out the SNP genotype data of each sample from the SNP database 11 in units of SNPs, and classifies each sample into corresponding genotype data. Further, the basic table creation unit 1 classifies each genotype data specimen into two classes of qualitative traits, and sets the classified result as the SNP basic table shown in FIG. As described above, the basic table creation unit 1 generates the SNP basic table of FIG. 2 for all types (m) of SNPs.

次に、図３は、全てのＳＮＰの種類において、２つの異なる種類のＳＮＰのジェノタイプデータ毎における２次相互作用に対応した検体の質的形質のクラスの分類を示すＳＮＰ相互作用基礎テーブルの一例を示している。この図３において、ＳＮＰの種類としてはＳＮＰ１及びＳＮＰ２を用い、ＳＮＰ１のジェノタイプデータをＳＮＰ１＝０、ＳＮＰ１＝１及びＳＮＰ１＝２とし、ＳＮＰ２のジェノタイプデータをＳＮＰ２＝０、ＳＮＰ２＝１及びＳＮＰ２＝２とする。すなわち、ＳＮＰ相互作用基礎テーブルは、ＳＮＰ１のジェノタイプ（０、１、２）データとＳＮＰ２のジェノタイプ（０、１、２）との組合せ毎に、検体を罹患（＝１）及び非罹患（＝０）のいずれか対応するクラスに分類している。ここで、ジェノタイプ（０，１，２）は、ジェノタイプ（ＡＡ，ＡＢ，ＢＢ）の優勢型と劣勢型の組合せを示している Next, FIG. 3 shows the SNP interaction basic table showing the classification of the class of qualitative traits of the specimen corresponding to the secondary interaction for each genotype data of two different types of SNPs in all types of SNPs. An example is shown. In FIG. 3, SNP1 and SNP2 are used as SNP types, SNP1 genotype data is SNP1 = 0, SNP1 = 1 and SNP1 = 2, SNP2 genotype data is SNP2 = 0, SNP2 = 1 and SNP2 = 2. That is, the SNP interaction basic table indicates that the specimen is affected (= 1) and not affected (= 1) for each combination of the SNP1 genotype (0, 1, 2) data and the SNP2 genotype (0, 1, 2). = 0). Here, genotype (0, 1, 2) indicates a combination of dominant type and inferior type of genotype (AA, AB, BB).

例えば、図３において、ジェノタイプデータＳＮＰ１＝０及びＳＮＰ２＝０の組合せに対し、非罹患のクラスの検体数がｎ０００個であり、罹患のクラスの検体数がｎ１００個である。また、ジェノタイプデータＳＮＰ１＝２及びＳＮＰ２＝１の組合せに対し、非罹患のクラスの検体数がｎ０１２個であり、罹患のクラスの検体数がｎ１１２個である。
ここで、ＳＮＰ基礎テーブルの場合と同様に、ｎ０００＋ｎ００１＋ｎ００２＋ｎ０１０＋ｎ０１１＋ｎ０１２＋ｎ０２０＋ｎ０２１＋ｎ０２２＋ｎ１００＋ｎ１０１＋ｎ１０２＋ｎ１１０＋ｎ１１１＋ｎ１１２＋ｎ１２０＋ｎ１２１＋ｎ１２２＝ｎ（総検体数）である。ここで、本実施形態において、「ｎｙｙｙ」における「ｙｙｙ」は添え字である。 For example, in FIG. 3, for the combination of genotype data SNP1 = 0 and SNP2 = 0, the number of non-affected classes is n000 and the number of affected classes is n100. Further, for the combination of genotype data SNP1 = 2 and SNP2 = 1, the number of specimens in the non-affected class is n012, and the number of specimens in the affected class is n112.
Here, as in the SNP basic table, n000 + n001 + n002 + n010 + n011 + n012 + n020 + n021 + n022 + n100 + n101 + n102 + n110 + n111 + n112 + n120 + n121 + n122 = n (total number of samples). Here, in the present embodiment, “yyy” in “nyyy” is a subscript.

図１に戻り、基礎テーブル作成部１は、ＳＮＰデータベース１１に記憶されているｍ個のＳＮＰから２つの異なる種類のＳＮＰの組合せを順次選択する。
そして、基礎テーブル作成部１は、選択されたＳＮＰの組（以下、ＳＮＰペア）において、ＳＮＰペアにおけるジェノタイプデータの組合わせ毎（例えば、ジェノタイプデータＳＮＰ１＝０及びＳＮＰ２＝０の組合せ）に検体を分類する。また、基礎テーブル作成部１は、各ＳＮＰペアにおけるジェノタイプデータの組合わせの検体を質的形質の２クラスに分類し、分類した結果を図３に示す構成のＳＮＰ相互作用基礎テーブルとする。基礎テーブル作成部１は、上述したように、全ての種類（ｍ個）のＳＮＰから、異なる種類の２つのＳＮＰに対して、図３のＳＮＰ相互作用基礎テーブルを生成する。
したがって、基礎テーブル作成部１は、ｍ個のＳＮＰの２つの異なる種類のＳＮＰの組合せが｛ｍ（ｍ−１）／２｝個あるため、ＳＮＰ相互作用基礎テーブルを｛ｍ（ｍ−１）／２｝個生成する。 Returning to FIG. 1, the basic table creation unit 1 sequentially selects combinations of two different types of SNPs from m SNPs stored in the SNP database 11.
Then, the basic table creation unit 1 performs, for each combination of genotype data in the SNP pair (for example, a combination of genotype data SNP1 = 0 and SNP2 = 0) in the selected SNP pair (hereinafter, SNP pair). Classify the specimen. Further, the basic table creation unit 1 classifies the samples of the combination of genotype data in each SNP pair into two classes of qualitative traits, and sets the classified result as the SNP interaction basic table having the configuration shown in FIG. As described above, the basic table creation unit 1 generates the SNP interaction basic table of FIG. 3 for two different types of SNPs from all types (m) of SNPs.
Therefore, since there are {m (m−1) / 2} combinations of two different types of SNPs of the m SNPs, the basic table creation unit 1 defines the SNP interaction basic table as {m (m−1). / 2} are generated.

次に、図４及び図５は、ＳＮＰ基礎テーブルにおいて、３つのジェノタイプデータ（ジェノタイプ毎のデータ）のいずれか２つを組合せと、他の１つのジェノタイプデータとのカテゴリ毎に、カテゴリ内における検体を質的形質のクラスの分類を示すＣＤＣテーブルの一例を示す図である。この図４及び図５に示すＣＤＣテーブルは、２つのカテゴリと、カテゴリにおける検体数を２つのクラスに分類した２×２の分割表の構成をしており、ＳＮＰデータベース１１に記憶されている。 Next, FIG. 4 and FIG. 5 show a category for each category of a combination of any two of the three genotype data (data for each genotype) and another one of the genotype data in the SNP basic table. It is a figure which shows an example of the CDC table which shows the classification | category of the class of qualitative traits for the specimen in the inside. The CDC table shown in FIGS. 4 and 5 has a 2 × 2 contingency table in which two categories and the number of specimens in the categories are classified into two classes, and is stored in the SNP database 11.

図４は、ＳＮＰの３種類の属性のジェノタイプデータであるＳＮＰ＝０、ＳＮＰ＝１及びＳＮＰ＝２において、ジェノタイプデータＳＮＰ＝０を単独のカテゴリＣ１とし、ジェノタイプデータＳＮＰ＝１及びＳＮＰ＝２の組合せをカテゴリＣ２としているテーブルの一例である。この図４の場合、単独のカテゴリＣ１であるジェノタイプデータＳＮＰ＝０に着目するＣＤＣテーブルとなっている。この図４において、ジェノタイプデータＳＮＰ＝０のカテゴリにおける非罹患及び罹患のクラスの検体数は図２のＳＮＰ基礎テーブルと同様である。一方、ジェノタイプデータＳＮＰ＝１及びＳＮＰ２の組合せのカテゴリにおける非罹患及び罹患のクラスの検体数は、図２のＳＮＰ基礎テーブルにおけるジェノタイプデータＳＮＰ＝１とジェノタイプデータＳＮＰ２の各々のカテゴリにおける非罹患及び罹患のクラスの検体数を加算した結果となっている。 FIG. 4 shows that in SNP = 0, SNP = 1, and SNP = 2, which are genotype data of three types of attributes of SNP, genotype data SNP = 0 is a single category C1, and genotype data SNP = 1 and SNP. = 2 is an example of a table in which a combination of 2 is a category C2. In the case of FIG. 4, the CDC table focuses on genotype data SNP = 0, which is a single category C1. In FIG. 4, the number of specimens of the non-affected and affected classes in the category of genotype data SNP = 0 is the same as the SNP basic table of FIG. On the other hand, the number of specimens of the non-affected and affected classes in the category of the combination of genotype data SNP = 1 and SNP2 is the non-affected in each category of genotype data SNP = 1 and genotype data SNP2 in the SNP basic table of FIG. This is the result of adding the number of affected and affected samples.

また、図５は、ＳＮＰの３種類の属性のジェノタイプデータであるＳＮＰ＝０、ＳＮＰ＝１及びＳＮＰ＝２において、ジェノタイプデータＳＮＰ＝１を単独のカテゴリＣ１とし、ジェノタイプデータＳＮＰ＝０及びＳＮＰ＝２の組合せをカテゴリＣ２としているテーブルの一例である。この図５の場合、単独のカテゴリＣ１であるジェノタイプデータＳＮＰ＝１に着目するＣＤＣテーブルとなっている。この図５において、ジェノタイプデータＳＮＰ＝１のカテゴリにおける非罹患及び罹患のクラスの検体数は図２のＳＮＰ基礎テーブルと同様である。一方、ジェノタイプデータＳＮＰ＝０及びＳＮＰ２の組合せのカテゴリにおける非罹患及び罹患のクラスの検体数は、図２のＳＮＰ基礎テーブルにおけるジェノタイプデータＳＮＰ＝０とジェノタイプデータＳＮＰ２の各々のカテゴリにおける非罹患及び罹患のクラスの検体数を加算した結果となっている。 Further, FIG. 5 shows that in SNP = 0, SNP = 1, and SNP = 2, which are genotype data of three types of attributes of SNP, genotype data SNP = 1 is a single category C1, and genotype data SNP = 0. And a combination of SNP = 2 is an example of a table having category C2. In the case of FIG. 5, the CDC table focuses on genotype data SNP = 1, which is a single category C1. In FIG. 5, the number of specimens of the non-affected and affected classes in the category of genotype data SNP = 1 is the same as the SNP basic table of FIG. On the other hand, the number of specimens of the non-affected and affected classes in the category of the combination of genotype data SNP = 0 and SNP2 is the non-affected in each category of genotype data SNP = 0 and genotype data SNP2 in the SNP basic table of FIG. This is the result of adding the number of affected and affected samples.

図１に戻り、ＣＤＣテーブル作成部２は、ＳＮＰデータベース１１から、ＳＮＰ基礎テーブル識別情報により、図２に示すＳＮＰ基礎テーブルをＳＮＰ単位に読み出す。そして、ＣＤＣテーブル作成部２は、読み出した各ＳＮＰ基礎テーブルにおいて、３種類の３つのカテゴリであるジェノタイプデータを、２種類及び１種類に再分類し、図４あるいは図５に示す２つのカテゴリからなる３（＝３（３−１）／２）つのＣＤＣテーブルを生成する。ここで、１つのＳＮＰにおいて生成される３つのＣＤＣテーブルにおけるジェノタイプデータの組合せは、図４及び図５に示す以外に、ジェノタイプデータＳＮＰ＝２を単独のカテゴリＣ１とし、ジェノタイプＳＮＰ＝０及びＳＮＰ＝１の組合せのカテゴリＣ２である。 Returning to FIG. 1, the CDC table creation unit 2 reads the SNP basic table shown in FIG. 2 from the SNP database 11 in units of SNPs based on the SNP basic table identification information. Then, the CDC table creation unit 2 reclassifies the three types of three categories of genotype data into two types and one type in each read SNP basic table, and the two categories shown in FIG. 4 or FIG. 3 (= 3 (3-1) / 2) CDC tables are generated. Here, the combination of genotype data in the three CDC tables generated in one SNP includes genotype data SNP = 2 as a single category C1 and genotype SNP = 0, as shown in FIG. 4 and FIG. And SNP = 1 combination category C2.

ここで、ＣＤＣテーブル作成部２は、２種類のジェノタイプデータの組合せのカテゴリにおける非罹患及び罹患の各々の検体数を、図２のＳＮＰ基礎テーブルにおける組み合わされたジェノタイプデータの各々における非罹患の検体数を加算し、組合せのカテゴリにおける非罹患の検体数とし、同様に、組み合わされたジェノタイプデータの各々における罹患の検体数を加算し、組合せのカテゴリにおける罹患の検体数とする。また、ＣＤＣテーブル作成部２は、作成したＣＤＣテーブルに対してＣＤＣテーブル識別情報を付与し、このＣＤＣテーブル識別情報と、このＣＤＣテーブル識別情報の示すＣＤＣテーブルとを対応付けて、ＳＮＰデータベース１１へ書き込んで記憶させる。 Here, the CDC table creation unit 2 calculates the number of non-affected and affected specimens in the category of the combination of two types of genotype data as the non-affected in each of the combined genotype data in the SNP basic table of FIG. The number of samples is added to obtain the number of unaffected samples in the combination category, and similarly, the number of affected samples in each of the combined genotype data is added to obtain the number of affected samples in the combination category. Further, the CDC table creation unit 2 assigns CDC table identification information to the created CDC table, associates this CDC table identification information with the CDC table indicated by this CDC table identification information, and sends it to the SNP database 11. Write and store.

次に、図６は、図３に示すＳＮＰ相互作用基礎テーブルにおける２つの異なる種類のＳＮＰのジェノタイプデータの組合せを２つのカテゴリに分割したＣＤＣ相互作用テーブルの一例を示す図である。この図６に示すＣＤＣ相互作用テーブルは、ＣＤＣテーブルと同様に、２つのカテゴリと、カテゴリにおける検体数を２つのクラスに分類した２×２の構成をしており、ＳＮＰデータベース１１に記憶されている。図３に示すＳＮＰ相互作用基礎テーブルは、２つの異なる種類のＳＮＰの各々における３つのジェノタイプデータの組合せであるため、図３から解るようにこの組合せは９個ある。 Next, FIG. 6 is a diagram illustrating an example of a CDC interaction table in which combinations of genotype data of two different types of SNPs in the SNP interaction basic table shown in FIG. 3 are divided into two categories. Similar to the CDC table, the CDC interaction table shown in FIG. 6 has a 2 × 2 configuration in which the number of specimens in the category is classified into two classes, and is stored in the SNP database 11. Yes. Since the SNP interaction basic table shown in FIG. 3 is a combination of three genotype data in each of two different types of SNPs, there are nine such combinations as can be seen from FIG.

この９個の組合せからＣＤＣテーブルの場合と同様に、注目する一つの組合せを単独のカテゴリＣ１とし、残りの他の組合せのグループをカテゴリＣ２とする。この図６の場合、単独の組合せをＳＮＰ＝１及びＳＮＰ２＝１とし、図３におけるこの組合せ以外の条件を他の組合せのグループとしている。
したがって、組合せが９個あるため、単特となる組合せも９個あり、結果的に図３のＳＮＰ相互作用基礎テーブルは、９個のＣＤＣ相互作用テーブルに分割される。 As in the case of the CDC table from these nine combinations, one notable combination is set as a single category C1, and the remaining group of other combinations is set as a category C2. In the case of FIG. 6, the single combination is SNP = 1 and SNP2 = 1, and the conditions other than this combination in FIG. 3 are groups of other combinations.
Therefore, since there are nine combinations, there are nine unique combinations. As a result, the SNP interaction basic table of FIG. 3 is divided into nine CDC interaction tables.

図１に戻り、ＣＤＣテーブル作成部２は、ＳＮＰデータベース１１から、図３に示すＳＮＰ相互作用基礎テーブルを、２つの種類の異なるＳＮＰの組合せ単位に読み出す。そして、ＣＤＣテーブル作成部２は、読み出した各ＳＮＰ相互作用基礎テーブルにおいて、９種類のジェノタイプデータの組合せを、１種類の組合せと、残りの他の８種類の組合せとの２つのカテゴリに再分類し、図６に示す２つのカテゴリからなる８つのＣＤＣ相互作用テーブルを生成する。 Returning to FIG. 1, the CDC table creation unit 2 reads the SNP interaction basic table shown in FIG. 3 from the SNP database 11 in combination units of two different types of SNPs. Then, the CDC table creation unit 2 reconfigures the nine types of genotype data in each of the read SNP interaction basic tables into two categories: one type of combination and the remaining eight types of combinations. Classify and generate eight CDC interaction tables consisting of the two categories shown in FIG.

ここで、ＣＤＣテーブル作成部２は、８種類のジェノタイプデータの組合せのカテゴリにおける非罹患及び罹患の各々の検体数を、図３のＳＮＰ相互作用テーブルにおけるグループ化された８種類の組合せの各々における非罹患を加算し、組合せのカテゴリにおける非罹患の検体数とし、同様に、グループ化された８種類の組合せの各々における罹患を加算し、８種類の組合せのグループのカテゴリにおける罹患の検体数とする。また、ＣＤＣテーブル作成部２は、作成したＣＤＣ相互作用テーブルに対してＣＤＣ相互作用テーブル識別情報を付与し、このＣＤＣ相互作用テーブル識別情報と、このＣＤＣ相互作用テーブル識別情報の示すＣＤＣ相互作用テーブルとを対応付けて、ＳＮＰデータベース１１へ書き込んで記憶させる。 Here, the CDC table creation unit 2 calculates the number of non-affected and affected specimens in the category of combinations of eight types of genotype data, and sets each of the eight types of combinations grouped in the SNP interaction table of FIG. Is added to the number of unaffected specimens in the combination category, and similarly, the incidence in each of the eight grouped combinations is added to determine the number of affected specimens in the eight combination group categories. And Further, the CDC table creation unit 2 assigns CDC interaction table identification information to the created CDC interaction table, and this CDC interaction table identification information and the CDC interaction table indicated by this CDC interaction table identification information Are stored in the SNP database 11 in association with each other.

次に、図７は、図２に示すＳＮＰ基礎テーブルにおける２つの異なる種類のＳＮＰのジェノタイプデータの組合せを２つのカテゴリに分割したＡＤＣ相互作用テーブルの一例を示す図である。この図７に示すＡＤＣ相互作用テーブルは、ＣＤＣ相互作用テーブルと同様に、２つのカテゴリと、カテゴリにおける検体数を２つのクラスに分類した２×２の構成をしており、ＳＮＰデータベース１１に記憶されている。図７に示すＡＤＣ相互作用テーブルは、２つの異なる種類のＳＮＰの各々における３つのジェノタイプデータの複数の組合せをダミーコードとしている。 Next, FIG. 7 is a diagram illustrating an example of an ADC interaction table in which a combination of genotype data of two different types of SNPs in the SNP basic table illustrated in FIG. 2 is divided into two categories. Similar to the CDC interaction table, the ADC interaction table shown in FIG. 7 has a 2 × 2 configuration in which the number of specimens in the category is classified into two classes, and is stored in the SNP database 11. Has been. The ADC interaction table shown in FIG. 7 uses a plurality of combinations of three genotype data in each of two different types of SNPs as dummy codes.

例えば、図７における２つのカテゴリ（Ｃ１、Ｃ２）に分割するジェノタイプデータの組合せにおいては、カテゴリＣ１がＳＮＰ１＝０、ＳＮＰ＝１、及びＳＮＰ＝２の組合せと、ＳＮＰ２＝０及びＳＮＰ２＝１の組合せと、ＳＮＰ２＝１及びＳＮＰ２＝２の組合せと、ＳＮＰ２＝０及びＳＮＰ２＝２の組合せとのグループからなるグループであり、一方、カテゴリＣ２がＳＮＰ基礎テーブルにおけるすでに説明したカテゴリＣ１以外のジェノタイプデータのグループである。すなわち、カテゴリＣ１のいずれか１つのジェノタイプとカテゴリＣ２のいずれか１つのジェノタイプとの組合せが一方のグループとされ、それ以外の８つの組合せが他方のグループとされ、それぞれのクラスの検体数が加算される。 For example, in the combination of genotype data divided into two categories (C1, C2) in FIG. 7, the category C1 is a combination of SNP1 = 0, SNP = 1, and SNP = 2, and SNP2 = 0 and SNP2 = 1. , A combination of SNP2 = 1 and SNP2 = 2, and a combination of SNP2 = 0 and SNP2 = 2, while category C2 is a geno other than category C1 already described in the SNP basic table. A group of type data. That is, the combination of any one of the genotypes of category C1 and any one of the genotypes of category C2 is set as one group, and the other eight combinations are set as the other group, and the number of samples of each class Is added.

図１に戻り、ＡＤＣテーブル作成部３は、ＳＮＰデータベース１１に記憶されているＳＮＰ基礎テーブルから、ＳＮＰペアとして異なる種類のＳＮＰの組合せを選択し、このＳＮＰペアにおける全てのジェノタイプデータを２つのカテゴリとして分類した全ての組合せを、ＳＮＰペア毎に生成する。そして、ＡＤＣテーブル作成部３は、カテゴリＣ１とする２つの異なる種類のＳＮＰにおける複数のジェノタイプデータの各々の非罹患及び罹患の検体数を読み出す。そして、ＡＤＣテーブル作成部３は、このカテゴリＣ１として読み出したジェノタイプデータの組合せにおける各々のジェノタイプデータにおける非罹患及び罹患のクラスの検体数をそれぞれクラス毎に加算する。 Returning to FIG. 1, the ADC table creation unit 3 selects different types of SNP combinations as SNP pairs from the SNP basic table stored in the SNP database 11, and sets all the genotype data in this SNP pair as two types. All combinations classified as categories are generated for each SNP pair. Then, the ADC table creation unit 3 reads out the number of unaffected and diseased samples of each of a plurality of genotype data in two different types of SNPs of category C1. Then, the ADC table creation unit 3 adds the number of specimens of the unaffected and affected classes in each genotype data in the combination of genotype data read out as the category C1 for each class.

また、ＡＤＣテーブル作成部３は、ＳＮＰデータベース１１に記憶されているＳＮＰ基礎テーブルから、カテゴリＣ１に含まれている以外のカテゴリＣ２のジェノタイプデータの非罹患及び罹患の各々のクラスの検体数を読み出す。そして、ＡＤＣテーブル作成部３は、このカテゴリＣ２として読み出したジェノタイプデータの非罹患及び罹患のクラスの検体数をそれぞれクラス毎に加算し、上述したＡＤＣ相互作用テーブルを生成する。したがって、ＳＮＰペア毎に、２つのＳＮＰの各々のジェノタイプデータを２つのカテゴリに分類する全ての組合せに対応したＡＤＣ相互作用テーブルが生成される。
また、ＡＤＣテーブル作成部３は、作成したＣＤＣ相互作用テーブルに対してＣＤＣ相互作用テーブル識別情報を付与し、このＡＤＣ相互作用テーブル識別情報と、このＡＤＣ相互作用テーブル識別情報の示すＡＤＣ相互作用テーブルとを対応付けて、ＳＮＰデータベース１１へ書き込んで記憶させる。 Further, the ADC table creation unit 3 calculates the number of specimens of each class of non-affected and affected in the genotype data of the category C2 other than those included in the category C1 from the SNP basic table stored in the SNP database 11. read out. Then, the ADC table creation unit 3 adds the number of specimens of the non-affected and affected classes of the genotype data read out as the category C2 for each class, and generates the above-described ADC interaction table. Therefore, for each SNP pair, an ADC interaction table corresponding to all combinations classifying the genotype data of each of the two SNPs into two categories is generated.
Further, the ADC table creation unit 3 assigns CDC interaction table identification information to the created CDC interaction table, and the ADC interaction table identification information and the ADC interaction table indicated by the ADC interaction table identification information. Are stored in the SNP database 11 in association with each other.

ＣＤＣ尤度算出部４は、ＳＮＰデータベース１１から順次、ＣＤＣテーブル識別情報の示すＣＤＣテーブルを読み出し、ＣＤＣテーブルのカテゴリの組合せを説明変数とした単回帰分析を行い、その尤度を算出する。すなわち、図４または図５に示すＣＤＣテーブルにおける２つのクラス及び２つのカテゴリを抽出して、図８に示す２×２の分割表とする。この図８は、ＣＤＣテーブル毎に対して単回帰分析を行い、ＣＤＣテーブルが示す説明変数を用いた際の尤度またはｐ値の算出を説明する概念図である。
すなわち、ＣＤＣ尤度算出部４は、ＣＤＣテーブル識別情報により、ＳＮＰデータベース１１から順次ＣＤＣテーブルを読み出し、読み出したＣＤＣテーブルにおける各カテゴリＣ１及びＣ２各々における非罹患のクラスと罹患のクラスの検体数から、以下に示す（３）式を用いてロジスティック回帰モデルの最大対数尤度（以下、尤度とする）を算出する。 The CDC likelihood calculating unit 4 sequentially reads out the CDC table indicated by the CDC table identification information from the SNP database 11, performs a single regression analysis using combinations of categories in the CDC table as explanatory variables, and calculates the likelihood. That is, two classes and two categories in the CDC table shown in FIG. 4 or FIG. 5 are extracted to form a 2 × 2 contingency table shown in FIG. FIG. 8 is a conceptual diagram illustrating the calculation of likelihood or p-value when performing a single regression analysis for each CDC table and using the explanatory variables indicated by the CDC table.
That is, the CDC likelihood calculation unit 4 sequentially reads out the CDC table from the SNP database 11 based on the CDC table identification information, and determines from the number of samples of the unaffected class and the affected class in each of the categories C1 and C2 in the read CDC table. The maximum log likelihood (hereinafter referred to as “likelihood”) of the logistic regression model is calculated using the following equation (3).

この（３）式において、図８に示すように、カテゴリＣ１における非罹患のクラスに属する検体数がａ個であり、罹患のクラスに属する検体数がｃ個である。また、カテゴリＣ２における非罹患のクラスに属する検体数がｂ個であり、罹患のクラスに属する検体数がｄ個である。
また、ＣＤＣ尤度算出部４は、ＣＤＣテーブル識別情報の各々に対応させて、ＣＤＣテーブル識別情報の示すＣＤＣテーブルの尤度をＣＤＣ尤度記憶部１２に対して書き込んで記憶させる。
ここで、単一のＳＮＰのジェノタイプデータにおけるＣＤＣテーブルの数が３ｍ個あるため尤度は、３ｍ個が算出される。 In this equation (3), as shown in FIG. 8, the number of specimens belonging to the non-affected class in category C1 is a, and the number of specimens belonging to the affected class is c. In addition, the number of samples belonging to the non-affected class in category C2 is b, and the number of samples belonging to the affected class is d.
Further, the CDC likelihood calculating unit 4 writes and stores the likelihood of the CDC table indicated by the CDC table identification information in the CDC likelihood storage unit 12 in association with each of the CDC table identification information.
Here, since there are 3m CDC tables in the genotype data of a single SNP, 3m likelihood is calculated.

同様に、ＣＤＣ尤度算出部４は、ＳＮＰデータベース１１から順次、ＣＤＣ相互作用テーブル識別情報の示すＣＤＣ相互作用テーブルを読み出し、読み出したＣＤＣ相互作用テーブルにおける各カテゴリＣ１及びＣ２各々における非罹患のクラスと罹患のクラスの検体数から、上記（３）式を用いてロジスティック回帰モデルの尤度を算出する。
そして、ＣＤＣ尤度算出部４は、各ＣＤＣ相互作用テーブルを示すＣＤＣ相互作用テーブル識別情報に対応させて、ＣＤＣ相互作用テーブル識別情報の示すＣＤＣ相互作用テーブルの尤度とを対応させてＣＤＣ尤度記憶部１２に対して書き込んで記憶させる。
ここで、異なる２つのＳＮＰの組合せとしてはｍ（ｍ−１）／２個あり、これらが９個に分類されているため、ＣＤＣ相互作用テーブルの数が９ｍ（ｍ−１）／２個あるため尤度は、９ｍ（ｍ−１）／２個が算出される。 Similarly, the CDC likelihood calculating unit 4 sequentially reads the CDC interaction table indicated by the CDC interaction table identification information from the SNP database 11, and the unaffected class in each of the categories C1 and C2 in the read CDC interaction table. And the likelihood of the logistic regression model is calculated from the number of specimens in the affected class using the above equation (3).
Then, the CDC likelihood calculating unit 4 associates the CDC likelihood table with the likelihood of the CDC interaction table indicated by the CDC interaction table identification information in association with the CDC interaction table identification information indicating each CDC interaction table. It is written and stored in the degree storage unit 12.
Here, there are m (m-1) / 2 combinations of two different SNPs, and these are classified into nine, so the number of CDC interaction tables is 9m (m-1) / 2. Therefore, the likelihood is calculated as 9m (m-1) / 2.

ＣＤＣ＿ｐ値算出部５は、ＳＮＰデータベース１１から順次、ＣＤＣテーブル識別情報の示すＣＤＣテーブルを読み出し、ＣＤＣテーブルのカテゴリの組合せをダミー変数、すなわち説明変数とした単回帰分析を行い、そのｐ値を算出する。すなわち、ＣＤＣ＿ｐ値算出部５は、ＣＤＣテーブルにおける各カテゴリＣ１及びＣ２各々における非罹患のクラスと罹患のクラスの検体数から、以下に示す（４）式を用いてＣＤＣテーブルに対応するｐ値（以下、ＣＤＣ＿ｐ値とする）を算出する。 The CDC_p value calculation unit 5 sequentially reads the CDC table indicated by the CDC table identification information from the SNP database 11, performs a single regression analysis using the combination of the categories of the CDC table as a dummy variable, that is, an explanatory variable, and calculates the p value. To do. That is, the CDC_p value calculation unit 5 calculates the p value (corresponding to the CDC table from the number of specimens of the non-affected class and the affected class in each of the categories C1 and C2 in the CDC table using the following equation (4) Hereinafter, it is referred to as CDC_p value).

この（４）式において、図８に示すように、カテゴリＣ１における非罹患のクラスに属する検体数がａ個であり、罹患のクラスに属する検体数がｃ個である。また、カテゴリＣ２における非罹患のクラスに属する検体数がｂ個であり、罹患のクラスに属する検体数がｄ個である。
また、ＣＤＣ＿ｐ値算出部５は、ＣＤＣテーブルを示すＣＤＣテーブル識別情報の各々に対応させ、ＣＤＣテーブル識別情報の示すＣＤＣテーブルのｐ値を、ＣＤＣ＿ｐ値記憶部１３に対して書き込んで記憶させる。
ここで、単一のＳＮＰのジェノタイプデータにおけるＣＤＣテーブルの数が３ｍ個あるためｐ値は、３ｍ個が算出される。 In this equation (4), as shown in FIG. 8, the number of specimens belonging to the non-affected class in category C1 is a, and the number of specimens belonging to the affected class is c. In addition, the number of samples belonging to the non-affected class in category C2 is b, and the number of samples belonging to the affected class is d.
Further, the CDC_p value calculation unit 5 writes the p value of the CDC table indicated by the CDC table identification information in the CDC_p value storage unit 13 so as to correspond to each of the CDC table identification information indicating the CDC table.
Here, since there are 3m CDC tables in the genotype data of a single SNP, 3m is calculated as the p value.

同様に、ＣＤＣ＿ｐ値算出部５は、ＳＮＰデータベース１１から順次、ＣＤＣ相互作用テーブル識別情報の示すＣＤＣ相互作用テーブルを読み出し、読み出したＣＤＣ相互作用テーブルにおける各カテゴリＣ１及びＣ２各々における非罹患のクラスと罹患のクラスの検体数から、上記（４）式を用いてロジスティック回帰モデルのｐ値を算出する。
そして、ＣＤＣ＿ｐ値算出部５は、ＣＤＣ相互作用テーブルを示すＣＤＣ相互作用テーブル識別情報の各々に対応させ、ＣＤＣ相互作用テーブル識別情報の示すＣＤＣ相互作用テーブルのｐ値をＣＤＣ＿ｐ値記憶部１３に対して書き込んで記憶させる。
ここで、異なる２つのＳＮＰの組合せとしてはｍ（ｍ−１）／２個あり、これらが９個に分類されているため、ＣＤＣ相互作用テーブルの数が９ｍ（ｍ−１）／２個あるためｐ値は、９ｍ（ｍ−１）／２個が算出される。 Similarly, the CDC_p value calculation unit 5 sequentially reads out the CDC interaction table indicated by the CDC interaction table identification information from the SNP database 11, and sets the unaffected class in each of the categories C1 and C2 in the read CDC interaction table. The p-value of the logistic regression model is calculated from the number of samples in the affected class using the above equation (4).
Then, the CDC_p value calculation unit 5 associates each of the CDC interaction table identification information indicating the CDC interaction table with the CDC interaction table identification information and indicates the p value of the CDC interaction table indicated by the CDC interaction table identification information to the CDC_p value storage unit 13. To write and memorize.
Here, there are m (m-1) / 2 combinations of two different SNPs, and these are classified into nine, so the number of CDC interaction tables is 9m (m-1) / 2. Therefore, the p value is calculated as 9m (m-1) / 2.

ＡＤＣ＿ｐ値算出部６は、ＣＤＣ＿ｐ値算出部５と同様に、図７に示すＡＤＣ相互作用テーブルにおける２つのクラス及び２つのカテゴリを図８に示す２×２分割表とし、上記（４）式を用いて、各ＡＤＣ相互作用テーブルのｐ値を算出する。
すなわち、ＡＤＣ＿ｐ値算出部６は、同一の２つの異なる種類におけるＡＤＣ相互作用テーブルをＳＮＰデータベース１１から、ＡＤＣ相互作用テーブル識別情報により順次読み出し、読み出したＡＤＣ相互作用テーブル各々における各カテゴリＣ１及びＣ２各々における非罹患のクラスと罹患のクラスの検体数から、以下の（５）式を用いてＢＡ（ｂａｌａｎｃｅｄａｃｃｕｒａｃｙ）を算出する。 Similarly to the CDC_p value calculation unit 5, the ADC_p value calculation unit 6 converts the two classes and the two categories in the ADC interaction table shown in FIG. 7 into a 2 × 2 table shown in FIG. To calculate the p-value for each ADC interaction table.
That is, the ADC_p value calculation unit 6 sequentially reads out the ADC interaction tables of the same two different types from the SNP database 11 according to the ADC interaction table identification information, and each category C1 and C2 in each of the read ADC interaction tables. BA (balanced accuracy) is calculated from the number of specimens of the non-affected class and the affected class using the following equation (5).

また、ＡＤＣ＿ｐ値算出部６は、算出したＢＡが最大値を有する異なる２つの種類のＳＮＰのジェノタイプデータの組合せに対応するＡＤＣ相互作用テーブルを選択する。例えば、ＡＤＣ＿ｐ値算出部６は、ＳＮＰ１とＳＮＰ２との各ジェノタイプデータの組合せ全てのＡＤＣ相互作用テーブル各々について、上記（５）式によりＢＡを算出し、最大のＢＡを有するジェノタイプデータの組合せに対応するＡＤＣ相互作用テーブルを選択する。
ここで、ＡＤＣ＿ｐ値算出部６は、上述した２個の異なるＳＮＰの組において、これらの２個のＳＮＰ各々のジェノタイプデータの組合せのなかからＢＡの最大値を有する組合せの選択を、ｍ個の種類のＳＮＡにおける２つの組の全てを順次、ＳＮＰデータベース１１から読み出して行う。
また、ＡＤＣ＿ｐ値算出部６は、２個の異なるＳＮＰの組毎に１つのＡＤＣ相互作用テーブルを選択するため、ＳＮＰデータベース１１に記憶されている全てのＡＤＣ相互作用テーブルから、ｍ（ｍ−１）／２個のＡＤＣ相互作用テーブルが選択される。 Further, the ADC_p value calculation unit 6 selects an ADC interaction table corresponding to a combination of genotype data of two different types of SNPs in which the calculated BA has a maximum value. For example, the ADC_p value calculation unit 6 calculates BA for each of the ADC interaction tables of all combinations of genotype data of SNP1 and SNP2 by the above equation (5), and the combination of genotype data having the maximum BA The ADC interaction table corresponding to is selected.
Here, the ADC_p value calculation unit 6 selects m combinations having the maximum BA value from among the combinations of the genotype data of each of these two SNPs in the above-described two different SNP sets. All of the two sets in the type of SNA are sequentially read out from the SNP database 11.
Further, since the ADC_p value calculation unit 6 selects one ADC interaction table for each set of two different SNPs, m (m−1) is obtained from all the ADC interaction tables stored in the SNP database 11. ) / 2 ADC interaction tables are selected.

そして、ＡＤＣ＿ｐ値算出部６は、選択されたＡＤＣ相互作用テーブルにおける各カテゴリＣ１及びＣ２各々における非罹患のクラスと罹患のクラスの検体数から、上記（４）式を用いてロジスティック回帰モデルの傾き（回帰係数）に関するｐ値を算出する。
また、ＡＤＣ＿ｐ値算出部６は、ＡＤＣ相互作用テーブルを示すＡＤＣ相互作用テーブル識別情報の各々に対応させ、ＡＤＣ相互作用テーブル識別情報の示すＡＤＣ相互作用テーブルのｐ値を対応させてＡＤＣ＿ｐ値記憶部１４に対して書き込んで記憶させる。
ここで、複数のＳＮＰにおけるジェノタイプデータの組合せがｍ（ｍ−１）／２個あるため、すなわちＡＤＣ相互作用テーブルの数がｍ（ｍ−１）／２個あるため、ｐ値はｍ（ｍ−１）／２個が算出される。 Then, the ADC_p value calculation unit 6 calculates the slope of the logistic regression model using the above equation (4) from the number of samples of the unaffected class and the affected class in each of the categories C1 and C2 in the selected ADC interaction table. The p-value for (regression coefficient) is calculated.
The ADC_p value calculation unit 6 is associated with each of the ADC interaction table identification information indicating the ADC interaction table, and the ADC_p value storage unit is associated with the p value of the ADC interaction table indicated by the ADC interaction table identification information. 14 is written and stored.
Here, since there are m (m−1) / 2 combinations of genotype data in a plurality of SNPs, that is, the number of ADC interaction tables is m (m−1) / 2, the p value is m ( m-1) / 2 are calculated.

上述したように、ＣＤＣテーブル及びＣＤＣ相互作用テーブルの各々から得る尤度とｐ値とは、それぞれ｛３ｍ＋９ｍ（ｍ−１）／２｝個となる。
また、ｍ（ｍ−１）／２個のＡＤＣ相互作用テーブルから得られるｐ値は、上述したよようにｍ（ｍ−１）／２個となる。
以降に述べるランキングにおいて、いずれを選択するかはユーザの設定により行われる。すなわち、説明変数をＣＤＣ及びＡＤＣのいずれのダミーコードを用いるか、またダミーコードとしてＣＤＣを用いる場合、ランキングを行う際の評価値を尤度及びｐ値のいずれを用いるかの設定を、ユーザは自身の評価したい対象によって任意に変更する。 As described above, the likelihood and the p value obtained from each of the CDC table and the CDC interaction table are {3m + 9m (m−1) / 2}.
Further, the p value obtained from m (m−1) / 2 ADC interaction tables is m (m−1) / 2 as described above.
In the ranking described below, which one to select is determined by user settings. That is, whether the dummy variable of CDC or ADC is used as the explanatory variable, and if CDC is used as the dummy code, the user sets the evaluation value for ranking when using the likelihood or the p-value. Change arbitrarily according to the subject you want to evaluate.

ランキング作成部７は、ユーザが入力手段（キーボードなど）から入力するランキングの設定条件が、ＣＤＣを用いて評価値を尤度である場合、ＣＤＣテーブル識別情報及びＣＤＣ相互作用テーブル識別情報に対応して記憶されている尤度を、ＣＤＣ尤度記憶部１２から全て読み出し、尤度が大きい順にソートを行いランキングを求める処理を行う。ここで、尤度が大きいほど、説明変数としてもあてはまりが良くなる確率が高いため、ランキング作成部７は、読み出した尤度を大きい順番に１番から｛３ｍ＋９ｍ（ｍ−１）／２｝番目までのランキングを行う。 The ranking creation unit 7 corresponds to the CDC table identification information and the CDC interaction table identification information when the ranking setting condition that the user inputs from the input means (keyboard or the like) is the likelihood of the evaluation value using CDC. All the stored likelihoods are read out from the CDC likelihood storage unit 12 and sorted in descending order of likelihood to obtain a ranking. Here, the higher the likelihood, the higher the probability that it will be applied as an explanatory variable. Therefore, the ranking creating unit 7 sets the read likelihoods in order from the first to the {3m + 9m (m−1) / 2} th. Ranking up to.

ここで、ランキング作成部７は、求めた１番から｛３ｍ＋９ｍ（ｍ−１）／２｝番目までのＣＤＣテーブル及びＣＤＣ相互作用テーブルから求めた尤度、すなわちＣＤＣ尤度のランキングを、ＣＤＣ尤度のランキングに対応するＣＤＣテーブル識別情報及びＣＤＣ相互作用テーブル識別情報の各々を、数値の大きい順番にランキング記憶部１５に書き込んで記憶させる。 Here, the ranking creating unit 7 calculates the likelihood obtained from the obtained first to {3m + 9m (m−1) / 2} -th CDC tables and the CDC interaction table, that is, the CDC likelihood ranking, from the CDC likelihood. Each of the CDC table identification information and the CDC interaction table identification information corresponding to the degree ranking is written and stored in the ranking storage unit 15 in descending order of numerical values.

また、ランキング作成部７は、ユーザが入力手段から入力するランキングの設定条件が、ＣＤＣを用いて評価値をｐ値である場合、ＣＤＣテーブル識別情報及びＣＤＣ相互作用テーブル識別情報に対応して記憶されているｐ値を、ＣＤＣ＿ｐ値記憶部１３から全て読み出し、ｐ値が小さい順にソートを行いランキングを求める処理を行う。ここで、ｐ値が小さいほど、形質と関連がないという帰無仮説を仮定したとき、観測値が発生する確率が小さいこと、すなわち関連性がないという仮説が不自然であるため、ランキング作成部７は、読み出したｐ値を小さい順番に１番から｛３ｍ＋９ｍ（ｍ−１）／２｝番目までのランキングを行う。 Further, when the ranking setting condition that the user inputs from the input means is a p-value using CDC, the ranking creation unit 7 stores the evaluation value corresponding to the CDC table identification information and the CDC interaction table identification information. All the p values that have been set are read from the CDC_p value storage unit 13 and sorted in ascending order of the p values to obtain a ranking. Here, when the null hypothesis that the smaller the p-value is, the less the relationship with the character is assumed, the probability that the observed value is generated is small, that is, the hypothesis that there is no relationship is unnatural. 7 ranks the read p-value from the first to the {3m + 9m (m−1) / 2} -th in ascending order.

ここで、ランキング作成部７は、求めた１番から｛３ｍ＋９ｍ（ｍ−１）／２｝番目までのＣＤＣテーブル及びＣＤＣ相互作用テーブルから求めたｐ値、すなわちＣＤＣｐ値（ＣＤＣ＿ｐ値）のランキングを、ＣＤＣｐ値のランキングに対応するＣＤＣテーブル識別情報及びＣＤＣ相互作用テーブル識別情報の各々を、上位から順番にランキング記憶部１５に書き込んで記憶させる。 Here, the ranking creating unit 7 calculates the ranking of the p value obtained from the obtained first to {3m + 9m (m−1) / 2} -th CDC table and the CDC interaction table, that is, the CDCp value (CDC_p value). Each of the CDC table identification information and the CDC interaction table identification information corresponding to the ranking of the CDCp value is written and stored in the ranking storage unit 15 in order from the top.

また、ランキング作成部７は、ユーザが入力手段から入力するランキングの設定条件が、ＡＤＣを用いる場合、ＡＤＣ相互作用テーブル識別情報に対応して記憶されているｐ値を、ＡＤＣ＿ｐ値記憶部１４から全て読み出し、ｐ値が小さい順にソートを行いランキングを求める処理を行う。ここで、ｐ値が小さいほど、ＣＤＣｐ値の算出と同様に、形質と関連がないという帰無仮説を仮定したとき、観測値が発生する確率が小さいこと、すなわち関連性がないという仮説が不自然であるため、ランキング作成部７は、読み出したｐ値を小さい順番に１番から｛３ｍ＋９ｍ（ｍ−１）／２｝番目までのランキングを行う。 In addition, when the ranking setting condition that the user inputs from the input unit uses ADC, the ranking creation unit 7 uses the ADC_p value storage unit 14 to store the p value stored corresponding to the ADC interaction table identification information. Read all, sort in ascending order of p-value, and perform a process of obtaining the ranking. Here, the smaller the p-value, the smaller the probability that an observed value will occur, that is, the hypothesis that there is no relevance, is assumed, assuming the null hypothesis that there is no association with a trait, as in the calculation of the CDCp value. Since it is natural, the ranking creating unit 7 ranks the read p values from the first to the {3m + 9m (m−1) / 2} -th in ascending order.

ここで、ランキング作成部７は、求めた１番から｛３ｍ＋９ｍ（ｍ−１）／２｝番目までのＡＤＣ相互作用テーブルから求めたｐ値、すなわちＡＤＣｐ値（ＡＤＣ＿ｐ値）のランキングを、ＡＤＣｐ値のランキングに対応するＡＤＣ相互作用テーブル識別情報の各々を、上位から順番にランキング記憶部１５に書き込んで記憶させる。 Here, the ranking creation unit 7 calculates the ADCp value from the obtained first to {3m + 9m (m−1) / 2} -th ADC interaction tables, that is, the ranking of ADCp values (ADC_p values). Each of the ADC interaction table identification information corresponding to the ranking is written and stored in the ranking storage unit 15 in order from the top.

ダミー変数作成部８は、ランキング作成部７が作成したランキングから、ユーザが入力手段から入力するランキングの設定条件に対応するランキングを選択し、このランキングから上位ｒ個、例えば２５６個を選択し、選択されたテーブルに対応するダミーコードを説明変数とし、以下の（６）式に示す多重ロジスティック回帰モデルを構成する。本実施形態において、上位ｒ個を選択することは、ヒトの遺伝子全体におけるＳＮＰのジェノタイプの相互作用による質的形質（第２の実施形態においては量的形質）への影響の強さを同定することである（すなわち、回帰モデルに使用するダミー変数としてのＳＮＰにおけるジェノタイプの組合せの同定を行う）。
例えば、ユーザがランキングの設定条件をＡＤＣを用いるとした場合、ダミー変数作成部８は、ランキング記憶部１５からＡＤＣ相互作用テーブル識別情報を検索し、このＡＤＣ相互作用テーブル識別情報に対応して記憶されているＡＤＣｐ値のランキングを読み出す。 The dummy variable creation unit 8 selects the ranking corresponding to the ranking setting condition that the user inputs from the input means from the ranking created by the ranking creation unit 7, selects the top r, for example, 256 from this ranking, The dummy code corresponding to the selected table is used as an explanatory variable, and a multiple logistic regression model shown in the following equation (6) is configured. In this embodiment, selecting the top r identifies the strength of the influence on the qualitative trait (quantitative trait in the second embodiment) due to the SNP genotype interaction in the entire human gene. (Ie, identifying genotype combinations in the SNP as dummy variables used in the regression model).
For example, when the user uses ADC as a ranking setting condition, the dummy variable creation unit 8 searches the ranking storage unit 15 for the ADC interaction table identification information, and stores the ADC interaction table identification information in correspondence with the ADC interaction table identification information. The ranking of the ADCp value being read is read out.

そして、ダミー変数作成部８は、読み出したＡＤＣｐ値のランキングから、上位ｒ個のＡＤＣ相互作用テーブル識別情報を選択し、このＡＤＣ相互作用テーブル識別情報の示すＡＤＣ相互作用テーブルのカテゴリとなるダミーコードを説明変数とする。
ここで、設定される数値ｒは、計算限界のために一例として２５６としているが、以下の（７）式により求まる数とすることが望ましい。 Then, the dummy variable creation unit 8 selects the top r ADC interaction table identification information from the ranking of the read ADCp values, and the dummy code that becomes the category of the ADC interaction table indicated by the ADC interaction table identification information Is an explanatory variable.
Here, the numerical value r to be set is 256 as an example because of the calculation limit, but it is desirable that the numerical value r be determined by the following equation (7).

例えば、ダミー変数作成部８が選択した２５６個のダミー変数に、図９に示すＣＤＣテーブルのＳＮＰのジェノタイプデータの組合せが入っていた場合、合計１５人の被験者の内、非罹患者が６人であり、罹患者が９人である。この図９は、ＳＮＰ基礎テーブルにおいて、３つのジェノタイプデータのいずれか２つを組合せと、他の１つのジェノタイプデータとのカテゴリ毎に、カテゴリ内における検体を質的形質のクラスの分類を示すＣＤＣテーブルの一例を示す図である。 For example, if 256 dummy variables selected by the dummy variable creation unit 8 include a combination of SNP genotype data in the CDC table shown in FIG. 9, 6 unaffected individuals out of a total of 15 subjects. There are 9 people affected. This FIG. 9 shows the classification of the qualitative trait class of the specimen in the category for each category of the combination of any two of the three genotype data and the other one of the genotype data in the SNP basic table. It is a figure which shows an example of the CDC table shown.

この図８のＣＤＣテーブルテーブルの場合、非罹患者が６人であり、罹患者が９人であるため、応答変数ベクトルをｙ＝（０，０，０，０，０，０，１，１，１，１，１，１，１，１，１）とし、説明変数ベクトルは、ＣＤＣテーブル及びＣＤＣ相互作用テーブルや、ＡＤＣ相互作用テーブルなとのダミーコードを説明変数（すなわち、ダミー変数）とし、２×２の分割表である各テーブルの列に応じて０あるいは１を与えることでダミー変数が定まる。一例として、ダミー変数のベクトルはｘ＝（０，１，０，０，０，０，１，１，０，１，０，０，１，０，１）などとなる。 In the case of the CDC table of FIG. 8, since there are 6 non-affected persons and 9 affected persons, the response variable vector is y = (0,0,0,0,0,0,1,1). , 1, 1, 1, 1, 1, 1, 1), and the explanatory variable vector uses a dummy code such as a CDC table, a CDC interaction table, or an ADC interaction table as an explanatory variable (that is, a dummy variable). A dummy variable is determined by giving 0 or 1 according to the column of each table which is a 2 × 2 contingency table. As an example, the vector of the dummy variable is x = (0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1).

図１に戻り、変数選択部９は、ダミー変数作成部８が選択したダミー変数を、後述するロジスティック回帰モデルにおける説明変数とする。
すなわち、変数選択部９は、ロジスティック回帰モデルにおける説明変数ベクトルを、選択した２５６個の２×２分割表におけるダミー変数に対して適用し、２５６個のダミー変数からなる２５６次元の説明変数の行列である説明変数行列を求める。そして、変数選択部９は、この説明変数行列を一般的なロジスティック回帰式に代入して、（６）式に示すロジスティック回帰モデルの式を生成する。 Returning to FIG. 1, the variable selection unit 9 sets the dummy variable selected by the dummy variable creation unit 8 as an explanatory variable in a logistic regression model described later.
That is, the variable selection unit 9 applies the explanatory variable vector in the logistic regression model to the selected dummy variables in the 256 2 × 2 contingency tables, and is a matrix of 256-dimensional explanatory variables made up of 256 dummy variables. An explanatory variable matrix is obtained. Then, the variable selection unit 9 substitutes this explanatory variable matrix into a general logistic regression equation to generate a logistic regression model equation represented by equation (6).

次に、変数選択部９は、例えば、回帰係数の選択において、ＳＥＥ（ＳｍｏｏｔｈｔｈｒｅｓｈｏｌｄＥｓｔｉｍａｔｉｎｇＥｑｕａｔｉｏｎ）の手法に基づくアルゴリズムにより、２５６個のダミー変数のうちいずれが、回帰分析に対して目的変数を求めるため、実際に説明変数としての効果を有するかの判定処理を行う。
ここで、変数選択部９は、罰則付き回帰法（罰則付き最尤法）であるＳＥＥの処理において、説明変数ベクトルＸとして、上記ロジスティック回帰モデルを縦に２５６個並べたｎ（総検体数）行×２５６列のデザイン行列を生成する。
また、変数選択部９は、（６）式における＾β_ｊ ^（０）をｊ番目のダミー変数に乗ずる回帰係数の初期値とし、チューニングパラメータλ（平滑化パラメータであり推定曲線の局所変動の程度を制御することにより、以下に示す（８）式で示す集合Ａ（λ）を生成する。 Next, the variable selection unit 9 obtains an objective variable for the regression analysis, for example, by using an algorithm based on the SEE (Smooth threshold Estimating Equation) method in selecting a regression coefficient. Therefore, a determination process is performed as to whether or not there is an effect as an explanatory variable.
Here, the variable selection unit 9 performs n (total number of samples) in which 256 logistic regression models are arranged vertically as the explanatory variable vector X in the processing of SEE which is a penalized regression method (penalized maximum likelihood method). A design matrix of rows × 256 columns is generated.
In addition, the variable selection unit 9 sets ^ β _j ⁽⁰⁾ in the equation (6) as an initial value of a regression coefficient by which the j-th dummy variable is multiplied, and a tuning parameter λ (a smoothing parameter that is a degree of local variation of the estimated curve) Is controlled to generate a set A (λ) represented by the following equation (8).

変数選択部９は、以下に示す（９）式において、ｊ∈Ａ（λ）を満たす回帰係数β_ｊについて、（９）式が最大となる（最大化される）回帰係数β_ｊの値を求める。また、変数選択部９は、Ａ（λ）に属さない回帰係数β_ｊを、スパース解として０とする（罰則付き回帰法）。 The variable selection unit 9 uses the value of the regression coefficient β _j that maximizes (maximizes) the expression (9) for the regression coefficient β _j satisfying j∈A (λ) in the following expression (9). Ask. The variable selection unit 9 sets the regression coefficient β _j not belonging to A (λ) to 0 as a sparse solution (regression method with penalties).

上記（９）式におけるｌ（β）は、下記の（１０）式に示すロジスティック回帰モデルの対数尤度関数である。 In the above equation (9), l (β) is a log likelihood function of the logistic regression model shown in the following equation (10).

ここで、変数選択部９は、（８）式において用いた初期推定値＾β_ｊ ^（０）を求める際に用いるチューニングパラメータとして、Ｌ２罰則付きロジスティック回帰モデルにおけるチューニングパラメータλ_２を用いている。ここで、変数選択部９は、初期推定値＾β_ｊ ^（０）を求める際、上記Ｌ２罰則付きロジスティック回帰モデルを用いて、（９）式におけるδ_ｊ（λ）を０として、（９）式を最大化させる初期推定値＾β_ｊ ^（０）を抽出する。また、初期推定値＾β_ｊ ^（０）の二乗和の正規化項を加えたＬ２罰則付きロジスティック回帰モデルを用いるとともに、（９）式におけるチューニングパラメータλ_２として、最大化問題を解くニュートン−ラプソン（Ｎｅｗｔｏｎ−Ｒａｐｈｓｏｎ）法の更新ステップにおいて逆行列が特異とならない最小の値を用いている。そのため、本実施形態において、チューニングパラメータλ_２を１／ｎ（総検体数）とする。 Here, the variable selection unit 9 uses the tuning parameter λ ₂ in the L2 penalized logistic regression model as a tuning parameter used when obtaining the initial estimated value ^ β _j ⁽⁰⁾ used in the equation (8). Here, when the variable selection unit 9 obtains the initial estimated value ^ β _j ⁽⁰⁾ , δ _j (λ) in the equation (9) is set to 0 using the L2 penalty logistic regression model (9) Extract the initial estimate ^ β _j ⁽⁰⁾ that maximizes the equation. Newton-Raphson solves the maximization problem as a tuning parameter λ _{2 in} equation (9) using a logistic regression model with an L2 penalty to which a normalization term of the sum of squares of the initial estimate ^ β _j ⁽⁰⁾ is added In the update step of the (Newton-Raphson) method, the minimum value at which the inverse matrix is not singular is used. Therefore, in this embodiment, the tuning parameter λ _{2 is set} to 1 / n (total number of samples).

また、変数選択部９は、チューニングパラメータλを、上述した（９）式が最大化されたβ_ｊを＾β_ｊとして、以下の（１１）式に代入し、ＥＢＩＣ（ＥｘｔｅｎｄｅｄＢａｙｅｓＩｎｆｏｒｍａｔｉｏｎＣｒｉｔｅｒｉｏｎ）を算出する。 In addition, the variable selection unit 9 substitutes the tuning parameter λ into the following equation (11) by setting β _j where the above-described equation (9) is maximized as ^ β _j , and sets EBIC (Extended Bayes Information Criterion) to calculate.

ここで、変数選択部９は、上記（１１）式、すなわちＥＢＩＣを最小とするチューニングパラメータλを求め、このチューニングパラメータλにて、ＳＥＥの（９）式を最大化するβ_ｊを求める。また、（１１）式において、γは区間［０，１］で定義される定数であり、数値ｐはＣＤＣである場合に｛９ｍ（ｍ−１）／２＋３ｍ｝を用い、ＡＤＣの場合に｛ｍ（ｍ−１）／２｝を用いる。
同様に、チューニングパラメータλの選択の基準として、以下の（１２）式に示すＡＩＣと、（１３）式に示すＢＩＣとの各々を用いても良い。以下の式においても、＾βとｐの数値に関しては、（１１）式と同様である。 Here, the variable selection unit 9 obtains the above equation (11), that is, the tuning parameter λ that minimizes the EBIC, and obtains β _j that maximizes the SEE equation (9) using the tuning parameter λ. In Equation (11), γ is a constant defined in the interval [0, 1], and the numerical value p uses {9m (m−1) / 2 + 3m} when it is CDC, and {ADC m (m-1) / 2} is used.
Similarly, as a reference for selecting the tuning parameter λ, an AIC represented by the following expression (12) and a BIC represented by the expression (13) may be used. Also in the following formula | equation, regarding the numerical value of (beta) and p, it is the same as that of (11) Formula.

変数選択部９は、上述した処理により、ダミー変数作成部８から供給されるランキングにおける上位ｒ個のダミー変数ｘの回帰係数β_ｊを求め、求めた回帰係数β_ｊを回帰分析処理部１０に対して出力する。 The variable selection unit 9 obtains the regression coefficient β _j of the top r dummy variables x in the ranking supplied from the dummy variable creation unit 8 by the above-described processing, and sends the obtained regression coefficient β _j to the regression analysis processing unit 10. Output.

回帰分析処理部１０は、変数選択部９から供給されるｒ個のダミー変数の各々に対応する回帰係数β_ｊにおいて、０である回帰係数β_ｊを抽出する。
また、回帰分析処理部１０は、ｒ個のダミー変数からなる（６）式において、０となった回帰係数βｊに対応するダミー変数を除去し、回帰変数βｊが０でないダミー変数から構成されるロジスティック回帰モデルを生成する。
これにより、回帰分析処理部１０は、上述のように求められたロジスティック回帰モデルを用い、新たな検体のＳＮＰのジェノタイプから生成したダミー変数のデータを代入し、この新たな検体の患者の罹患している確率のロジットを求める。ここで、ロジットの数値が正であれば罹患の確率は５０％以上であり、数値が高くなる程に確率が高くなる。一方、ロジットの数値が負であれば罹患の確率は５０％以上であり、絶対値の数値が高くなる程に確率が低くなる。 Regression analysis processing unit 10, the regression coefficient beta _j corresponding to each of r pieces of dummy variables supplied from the variable-selecting part 9, and extracts the regression coefficient beta _j is 0.
Further, the regression analysis processing unit 10 is configured by removing a dummy variable corresponding to the regression coefficient βj that is 0 in the equation (6) including r dummy variables and including a dummy variable in which the regression variable βj is not 0. Generate a logistic regression model.
Thereby, the regression analysis processing unit 10 substitutes the data of the dummy variable generated from the SNP genotype of the new specimen using the logistic regression model obtained as described above, and the disease of the patient of the new specimen is affected. Find the logit of the probability of being. Here, if the logit value is positive, the probability of morbidity is 50% or more, and the higher the value, the higher the probability. On the other hand, if the value of logit is negative, the probability of morbidity is 50% or more, and the probability decreases as the absolute value increases.

また、上述したダミー変数のコーディング法として、ＣＤＣあるいはＡＤＣを用いる場合と、遺伝学分野で従来知られているコーディング法とには、メリットとデメリットが存在する。
従来のコーディング法として、メリットはＳＮＰにおいてアリル数に応じて０、１、２とリスクが増大するモデルが適切である場合に、統計量の従う分布を精度よく近似できるため、多重検定手法を用いた第一種の過誤を制御できる。
しかしながら、デメリットとしては、現実としてジェノタイプとして劣性、優性、加法性など相互作用には様々なパターンが混在しているため、単一のコーディング法をもって分析を行うことは有意な相互作用を見過ごす一因となる。このデメリットが本実施形態の課題となっている。 Further, there are merits and demerits in the case of using CDC or ADC as the above-described dummy variable coding method and the coding method conventionally known in the field of genetics.
As a conventional coding method, the merit is that when a model with an increased risk of 0, 1, 2 depending on the number of alleles in SNP is appropriate, the distribution according to the statistics can be accurately approximated. Can control the first type of error.
However, as a demerit, since various patterns such as inferiority, dominance, and additivity are mixed as genotypes in reality, it is not easy to overlook significant interactions by analyzing with a single coding method. It becomes a cause. This demerit is a problem of this embodiment.

したがって、本実施形態においては、ユーザが解析しようとする対象により、ＣＤＣあるいはＡＤＣのいずれかを選択するか、あるいは双方を行い、より正確な結果が得られる回帰モデルを選択するかで使い分ける必要がある。
ＣＤＣのメリットとしては、分割表の各々のセルを中心にした２×２分割表を切り出すことで、小数例観察セルの影響を排除し、ジェノタイプ間あるいは複数のジェノタイプ相互作用のパターンに対応できる。
一方、ＣＤＣのデメリットとしては、一つの相互作用を複数の２×２分割表という独立した因子として扱うという点において相互作用の評価を冗長に行う可能性がある。 Therefore, in the present embodiment, it is necessary to select either the CDC or the ADC according to the object to be analyzed by the user, or both to select a regression model that can obtain a more accurate result. is there.
As a merit of CDC, by cutting out 2 × 2 contingency table centered on each cell of contingency table, the influence of the observation cell of decimal example is eliminated, and it corresponds to the pattern of genotypes or multiple genotype interaction it can.
On the other hand, as a disadvantage of CDC, there is a possibility that the interaction is evaluated redundantly in that one interaction is treated as an independent factor of a plurality of 2 × 2 contingency tables.

また、ＡＤＣのメリットは、ＣＤＣと同様の原理により、小数例観察セルの影響を排除している。すなわち、異なる種類のＳＮＰ間の相互作用の各セル（ジェノタイプの分類するカテゴリとしてのグループ）を、再グルーピングして作られる２×２分割表の組を全探索することにより、相互作用の見過ごしの可能性を大幅に減少させている。
一方、ＡＤＣのデメリットとしては、最終的に得られた最適な２×２分割表において、再グルーピングに伴う分散の増大が顕著な場合には、そこから計算されるオッズ比などに信頼性がおけない。 Moreover, the merit of ADC eliminates the influence of the decimal example observation cell by the same principle as CDC. That is, by over-searching a set of 2 × 2 contingency tables created by regrouping each cell (group as a category classified by genotype) of the interaction between different types of SNPs, the interaction is overlooked. The possibility of is greatly reduced.
On the other hand, as a disadvantage of the ADC, if the increase in variance due to regrouping is significant in the optimal 2 × 2 contingency table finally obtained, the odds ratio calculated from it can be reliable. Absent.

次に、ランキング作成部７が行うランキングにおいて、ＣＤＣを用いて生成したダミー変数を尤度を用いてランキング、ＣＤＣを用いて生成したダミー変数をｐ値を用いてランキング、ＡＤＣを用いて生成したダミー変数をｐを用いてランキングのそれぞれのメリットデメリットを以下に示す。
（ａ）ＣＤＣと尤度との組合せ
メリットは、ダミー変数の選択の枠組みとして解釈され、以降のダミー変数の選択パートにつながる一貫した自然な手法と言える。オリジナルのＳＩＳにおいて用いられた統計量である。
デメリットは、２×２分割表オッズ比のｐ値に比べ、疫学研究における経験実績が少ないことが挙げられる。また、一つのＳＮＰのジェノタイプの相互作用を複数の２×２分割表という因子に分解し、それらを独立したものとして扱う、という点において相互作用の評価を冗長に行う可能性がある。 Next, in the ranking performed by the ranking generation unit 7, dummy variables generated using CDC are ranked using likelihood, dummy variables generated using CDC are generated using p-value, and are generated using ADC. The merit and demerit of each ranking using the dummy variable p is shown below.
(A) Combination of CDC and Likelihood Merits are interpreted as a framework for selecting dummy variables, and can be said to be a consistent and natural method that leads to subsequent dummy variable selection parts. Statistics used in the original SIS.
Disadvantages include less experience in epidemiological studies than the p-value of the 2 × 2 contingency table odds ratio. In addition, there is a possibility that the interaction is evaluated redundantly in that the genotype interaction of one SNP is decomposed into a plurality of factors called 2 × 2 contingency tables and these are treated as independent factors.

（ｂ）ＣＤＣとｐ値との組合せ
メリットは、２×２分割表のオッズ比のｐ値が、疫学研究において広く用いられ、良い性質が多く知られているために無難な統計量と言える。
デメリットは、一つのＳＮＰのジェノタイプの相互作用を複数の２×２分割表という因子に分解し、それらを独立したものとして扱う、という点において相互作用の評価を冗長に行う可能性がある。
（ｃ）ＡＤＣとｐ値との組合せ
メリットは、ＳＮＰの相互作用の各セルを適応的に再グルーピングし、相互作用を代表するひとつの統計量を構成することで、処理におけるＣＤＣに存在する冗長性を取り除くことができる。
デメリットは、ＳＮＰの相互作用の各セルの再グルーピングによる分散の増大を考慮していないため、見せかけのランキング結果を招く可能性がある。 (B) Combination of CDC and p-value The merit is a safe statistic because the p-value of the odds ratio of the 2 × 2 contingency table is widely used in epidemiological studies and many good properties are known.
The disadvantage is that there is a possibility that the interaction is evaluated redundantly in that the genotype interaction of one SNP is decomposed into a plurality of factors called 2 × 2 contingency tables and treated as independent factors.
(C) Combination of ADC and p-value The advantage is that each cell of SNP interaction is adaptively regrouped, and a single statistic representative of the interaction is constructed, so that redundancy existing in the CDC in processing Sex can be removed.
Disadvantages do not take into account the increase in dispersion due to regrouping of each cell of SNP interaction, which can lead to a fake ranking result.

変数選択部９が行う回帰係数の選択において、ＳＣＡＤ（ｓｍｏｏｔｈｌｙｃｌｉｐｐｅｄａｂｓｏｌｕｔｅｄｅｖｉａｔｉｏｎ）と、本実施形態で用いているＳＥＥとには以下に示すメリットとデメリットがある。
ＳＣＡＤのメリットは、回帰係数の推定量の初期値を必要とされないことが挙げられる。
一方、ＳＣＡＤのデメリットは、回帰係数を求める際、求解のための高精度かつ簡便なアルゴリズムが存在していないことが挙げられる。
ＳＥＥは、上述したＳＣＡＤのデメリットを解決するために本実施形態において用いており、回帰係数の初期値推定量が必要であるが、適切な求解のための精度の良い簡便なアルゴリズムが存在する。 In the selection of regression coefficients performed by the variable selection unit 9, SCAD (smoothly clipped absolute device) and SEE used in the present embodiment have the following advantages and disadvantages.
The advantage of SCAD is that the initial value of the estimated amount of the regression coefficient is not required.
On the other hand, the disadvantage of SCAD is that there is no highly accurate and simple algorithm for finding a regression coefficient.
SEE is used in the present embodiment in order to solve the above-mentioned disadvantages of SCAD, and an initial value estimation amount of a regression coefficient is necessary. However, there is a simple algorithm with high accuracy for appropriate solution.

次に、変数選択部９が行う回帰係数の選択において、ＳＥＥの回帰係数の設定に対しチューニングパラメータをＡＩＣで求める手法、ＳＥＥの回帰係数の設定に対しチューニングパラメータをＢＩＣ求める手法、ＳＥＥの回帰係数の設定に対しチューニングパラメータをＥＢＩＣで求める手法のそれぞれのメリットデメリットを以下に示す。
（ａ）ＡＩＣで求める手法
変数選択の一致性という理論的に好ましい性質が保証されないことが知られているため、変数選択を目的とする場合においては、ＡＩＣの使用は推奨されない。
（ｂ）ＢＩＣで求める手法
メリットは、変数の次元が比較的小さい場合に、変数選択の一致性が保証されることが知られている。また、ＥＢＩＣのようなチューニングパラメータを含まない。さらに、使用に関しては多くの経験的な実績がある。
デメリットとしては、変数の次元が大きい場合に変数選択の一致性が破綻する場合があることが挙げられる。 Next, in the selection of the regression coefficient performed by the variable selection unit 9, a method for obtaining the tuning parameter by AIC for setting the regression coefficient for SEE, a method for obtaining the tuning parameter by BIC for setting the regression coefficient for SEE, and the regression coefficient for SEE The merits and demerits of each of the methods for obtaining the tuning parameters by EBIC with respect to the setting are shown below.
(A) Method obtained by AIC Since it is known that the theoretically preferable property of consistency of variable selection is not guaranteed, use of AIC is not recommended for the purpose of variable selection.
(B) Technique obtained by BIC It is known that the merit of variable selection is guaranteed when the dimension of the variable is relatively small. Also, tuning parameters such as EBIC are not included. In addition, there are many empirical achievements regarding use.
The disadvantage is that the consistency of variable selection may be broken when the dimension of the variable is large.

（ｃ）ＥＢＩＣで求める手法
メリットは、変数の次元が大きい場合にも変数選択の一致性が成立することが挙げられる。
デメリットは、偽陽性及び偽陰性の制御をチューニングパラメータによって行う必要があるが、それを適切に調整する一般的な信頼できる手法が存在しないことが挙げられる。
上述の変数選択の一致性とは、真の回帰係数が非０のものと０のもので分割される場合に、この分割を正しく判別できる、すなわち、より正確にはサンプル数が大きくなるにつれ、正しく判別する確率が１に収束することを意味している。 (C) Method obtained by EBIC A merit is that consistency of variable selection is established even when the dimension of the variable is large.
The disadvantage is that false positives and false negatives need to be controlled by tuning parameters, but there is no general reliable method for appropriately adjusting them.
The coincidence of the variable selection described above means that when the true regression coefficient is divided into non-zero and zero, this division can be correctly determined, that is, as the number of samples increases more accurately, This means that the probability of correct discrimination converges to 1.

上述したように本実施形態においては、ＳＮＰ毎にジェノタイプを２つのカテゴリに分類し、また２つの異なるＳＮＰの組合せ毎に、２つのジェノタイプのカテゴリに分類し、それぞれをダミー変数としている。そして、このダミー変数の中から罹患しているとする判定に対する寄与の度合いが高いダミー変数を選択し、このダミー変数各々の回帰係数を求めている。このため、異なるＳＮＰ間の罹患に影響する相互作用を解析する際、回帰モデルの生成に対して、ダミー変数及びこのダミー変数に対する回帰係数を決定するために膨大なデータが必要であったのに対し、本実施形態においては、ダミー変数を寄与度に応じて選択して用いるため、従来に比較して回帰モデルを生成するための検体数を低減させることができる。 As described above, in the present embodiment, genotypes are classified into two categories for each SNP, and classified into two genotype categories for each combination of two different SNPs, and each is used as a dummy variable. A dummy variable having a high degree of contribution to the determination that the patient is affected is selected from the dummy variables, and a regression coefficient of each dummy variable is obtained. For this reason, when analyzing the interaction affecting the morbidity between different SNPs, a huge amount of data was required to determine the dummy variable and the regression coefficient for this dummy variable for the generation of the regression model. On the other hand, in this embodiment, since the dummy variable is selected and used according to the contribution, the number of samples for generating the regression model can be reduced as compared with the conventional case.

次に図１０を用いて、第１の実施形態における遺伝子間相互作用解析システムを用いた回帰モデルにおけるダミー変数の選択と回帰係数の算出の処理の流れを説明する。図１０は、遺伝子間相互作用解析システムを用いた回帰モデルにおけるダミー変数の選択と回帰係数の算出の動作例を示すフローチャートである。
ステップＳ１：
基礎テーブル作成部１は、ＳＮＰデータベース１１から、検体のＳＮＰ毎のジェノタイプデータを読み出し、ＳＮＰ毎のＳＮＰ基礎テーブルを生成する。
そして、基礎テーブル作成部１は、作成したＳＮＰ基礎テーブル毎に、このＳＮＰ基礎テーブルを識別するＳＮＰ基礎テーブル識別情報を付し、ＳＮＰ基礎テーブルに対応させて、このＳＮＰ基礎テーブルとともにＳＮＰ基礎テーブル識別情報を対応させ、ＳＮＰデータベース１１に書き込んで記憶させる。 Next, the flow of processing of dummy variable selection and regression coefficient calculation in the regression model using the gene interaction analysis system in the first embodiment will be described with reference to FIG. FIG. 10 is a flowchart showing an operation example of selecting a dummy variable and calculating a regression coefficient in a regression model using the gene interaction analysis system.
Step S1:
The basic table creation unit 1 reads genotype data for each SNP of the specimen from the SNP database 11 and generates a SNP basic table for each SNP.
And the basic table preparation part 1 attaches | subjects the SNP basic table identification information which identifies this SNP basic table to every created SNP basic table, and makes it correspond to a SNP basic table, and SNP basic table identification with this SNP basic table Information is made to correspond and written into the SNP database 11 for storage.

ステップＳ２：
基礎テーブル作成部１は、ＳＮＰデータベース１１から、検体のＳＮＰ毎のジェノタイプデータを読み出し、２つの異なるＳＮＰ間におけるそれぞれのジェノタイプの相互作用のダミー変数を決定するためのＳＮＰ相互作用基礎テーブルを生成する。
そして、基礎テーブル作成部１は、作成したＳＮＰ相互作用基礎テーブル毎に、このＳＮＰ相互作用基礎テーブルを識別するＳＮＰ相互作用基礎テーブル識別情報を付し、ＳＮＰ相互作用基礎テーブルに対応させて、このＳＮＰ相互作用基礎テーブルとともにＳＮＰ相互作用基礎テーブル識別情報を対応させ、ＳＮＰデータベース１１に書き込んで記憶させる。 Step S2:
The basic table creation unit 1 reads out genotype data for each SNP of the specimen from the SNP database 11, and determines an SNP interaction basic table for determining dummy variables for each genotype interaction between two different SNPs. Generate.
And the basic table preparation part 1 attaches | subjects the SNP interaction basic table identification information which identifies this SNP interaction basic table to every created SNP interaction basic table, and makes this SNP interaction basic table correspond, and this The SNP interaction basic table identification information is associated with the SNP interaction basic table, and is written and stored in the SNP database 11.

ステップＳ３：
遺伝子間相互作用解析システムににおける図示しない制御部は、遺伝子間相互作用解析システムに対し、外部装置からダミー変数のコーディング方法としてＣＤＣあるいはＡＤＣを選択する信号が供給されると、ＣＤＣの場合に処理をステップＳ４へ進め、一方、ＡＤＣの場合に処理をステップＳ１０へ進める。 Step S3:
A control unit (not shown) in the gene interaction analysis system performs processing in the case of CDC when a signal for selecting CDC or ADC as a dummy variable coding method is supplied from an external device to the gene interaction analysis system. In the case of ADC, the process proceeds to step S10.

ステップＳ４：
ＣＤＣテーブル作成部２は、ＳＮＰデータベース１１から、ＳＮＰ基礎テーブル識別情報が付されたＳＮＰ基礎テーブルを、ＳＮＰ毎に順次読み出し、ジェノタイプデータのグルーピングを行い、２×２の分割表であるＣＤＣテーブルを生成する。
そして、ＣＤＣテーブル作成部２は、作成したＣＤＣテーブル毎に、このＣＤＣテーブルを識別するＣＤＣテーブル識別情報を付し、ＣＤＣテーブルに対応させて、このＣＤＣテーブルとともにＣＤＣテーブル識別情報を対応させ、ＳＮＰデータベース１１に書き込んで記憶させる。 Step S4:
The CDC table creation unit 2 sequentially reads out the SNP basic table to which the SNP basic table identification information is attached from the SNP database 11 for each SNP, performs genotype data grouping, and a CDC table that is a 2 × 2 partitioned table. Is generated.
Then, the CDC table creation unit 2 attaches CDC table identification information for identifying the CDC table to each created CDC table, associates the CDC table with the CDC table, associates the CDC table identification information with the CDC table, Write to the database 11 and store it.

ステップＳ５：
ＣＤＣテーブル作成部２は、ＳＮＰデータベース１１から、ＳＮＰ相互作用基礎テーブル識別情報が付されたＳＮＰ相互作用基礎テーブルを、２つの異なるＳＮＰの組毎に順次読み出し、９種類のジェノタイプデータの組合せを２つのカテゴリに分割し、２×２の分割表であるＣＤＣ相互作用テーブルを生成する。
そして、ＣＤＣテーブル作成部２は、作成したＣＤＣテーブル毎に、このＣＤＣテーブルを識別するＣＤＣテーブル識別情報を付し、ＣＤＣテーブルに対応させて、このＣＤＣテーブルとともにＣＤＣテーブル識別情報を対応させ、ＳＮＰデータベース１１に書き込んで記憶させる。 Step S5:
The CDC table creation unit 2 sequentially reads out the SNP interaction basic table to which the SNP interaction basic table identification information is attached from the SNP database 11 for each of two different sets of SNPs, and determines combinations of nine types of genotype data. Dividing into two categories, a CDC interaction table which is a 2 × 2 contingency table is generated.
Then, the CDC table creation unit 2 attaches CDC table identification information for identifying the CDC table to each created CDC table, associates the CDC table with the CDC table, associates the CDC table identification information with the CDC table, Write to the database 11 and store it.

ステップＳ６：
遺伝子間相互作用解析システムにおける図示しない制御部は、遺伝子間相互作用解析システムに対し、外部装置からダミー変数をランキングする評価値として尤度あるいはｐ値（オッズ比のｐ値）を選択する信号が供給されると、尤度の場合に処理をステップＳ７へ進め、一方、ｐ値の場合に処理をステップＳ８へ進める。 Step S6:
A control unit (not shown) in the gene interaction analysis system receives a signal for selecting likelihood or p value (p value of odds ratio) as an evaluation value for ranking dummy variables from an external device to the gene interaction analysis system. If supplied, the process proceeds to step S7 if likelihood, while the process proceeds to step S8 if p-value.

ステップＳ７：
ＣＤＣ尤度算出部４は、ＣＤＣテーブル識別情報が付されたＣＤＣテーブルを順次読み出し、（３）式によりＳＮＰと罹患の関連性の強さを示す尤度を算出する。
そして、ＣＤＣ尤度算出部４は、この求めた尤度とＣＤＣテーブル識別情報とを対応付けて、ＣＤＣ尤度記憶部１２に対して書き込んで記憶させる。 Step S7:
The CDC likelihood calculating unit 4 sequentially reads out the CDC table to which the CDC table identification information is attached, and calculates the likelihood indicating the strength of the relationship between the SNP and the disease by the equation (3).
Then, the CDC likelihood calculating unit 4 associates the obtained likelihood with the CDC table identification information, and writes and stores it in the CDC likelihood storage unit 12.

ステップＳ８：
ＣＤＣ＿ｐ値算出部５は、ＣＤＣテーブル識別情報が付されたＣＤＣテーブルを順次読み出し、（４）式によりＳＮＰと罹患の関連性の強さを示すｐ値を算出する。
そして、ＣＤＣ＿ｐ値算出部５は、この求めたｐ値とＣＤＣテーブル識別情報とを対応付けて、ＣＤＣ＿ｐ値記憶部１３に対して書き込んで記憶させる。 Step S8:
The CDC_p value calculation unit 5 sequentially reads the CDC table to which the CDC table identification information is attached, and calculates the p value indicating the strength of the relationship between the SNP and the disease by the equation (4).
Then, the CDC_p value calculation unit 5 associates the obtained p value with the CDC table identification information, writes it in the CDC_p value storage unit 13, and stores it.

ステップＳ９：
ＡＤＣテーブル作成部３は、ＳＮＰデータベース１１から、ＳＮＰ基礎テーブル識別情報が付されたＳＮＰ基礎テーブルを、２つの異なるＳＮＰの組毎に順次読み出し、２つの異なるＳＮＰの組のジェノタイプデータの全ての組合せに対応する数のグルーピングを行い、このグルーピングそれぞれの２×２の分割表であるＡＤＣ相互作用テーブルを生成する。
そして、ＣＤＣテーブル作成部２は、作成したＡＤＣ相互作用テーブル毎に、このＡＤＣ相互作用テーブルを識別するＡＤＣ相互作用テーブル識別情報を付し、ＡＤＣ相互作用テーブルに対応させて、このＡＤＣ相互作用テーブルとともにＡＤＣ相互作用テーブル識別情報を対応させ、ＳＮＰデータベース１１に書き込んで記憶させる。 Step S9:
The ADC table creation unit 3 sequentially reads out the SNP basic table to which the SNP basic table identification information is attached from the SNP database 11 for every two different sets of SNPs, and all of the genotype data of the two different sets of SNPs. The number of groupings corresponding to the combination is performed, and an ADC interaction table that is a 2 × 2 contingency table for each grouping is generated.
Then, the CDC table creation unit 2 attaches ADC interaction table identification information for identifying the ADC interaction table to each created ADC interaction table, and associates the ADC interaction table with the ADC interaction table. At the same time, the ADC interaction table identification information is made to correspond and written into the SNP database 11 and stored.

ステップＳ１０：
ＡＤＣ＿ｐ値算出部６は、ＡＤＣ相互作用テーブル識別情報が付されたＡＤＣ相互作用テーブルを順次読み出し、（４）式によりＳＮＰと罹患の関連性の強さを示すｐ値を算出する。
そして、ＡＤＣ＿ｐ値算出部６は、この求めたｐ値とＡＤＣ相互作用テーブル識別情報とを対応付けて、ＡＤＣ＿ｐ値記憶部１４に対して書き込んで記憶させる。 Step S10:
The ADC_p value calculation unit 6 sequentially reads out the ADC interaction table to which the ADC interaction table identification information is attached, and calculates a p value indicating the strength of the relationship between the SNP and the disease according to the equation (4).
Then, the ADC_p value calculation unit 6 associates the obtained p value with the ADC interaction table identification information, and writes and stores them in the ADC_p value storage unit 14.

ステップＳ１１：
ランキング作成部７は、ユーザの設定したダミー変数を設定する際のコーディング方法（ＣＤＣあるいはＡＤＣ）と、コーディング方法がＣＤＣの場合にランキングを行うための評価値（尤度あるいはｐ値）とに対応し、ランキング処理を行う。
すなわち、ランキング作成部７は、コーディング方法がＣＤＣであり、かつ評価値が尤度である場合、ＣＤＣ尤度記憶部１２からＣＤＣテーブル及びＣＤＣ相互作用テーブル毎の尤度を順次読み出し、尤度の大きい順番に配列させるランキング処理を行う。
そして、ランキング作成部７は、ランキングの配列の順番に、ＣＤＣテーブル識別情報、及びＣＤＣ相互作用テーブル識別情報を配列させて、ランキング記憶部１５に書き込んで記憶させる。 Step S11:
The ranking creation unit 7 corresponds to a coding method (CDC or ADC) when setting a dummy variable set by the user and an evaluation value (likelihood or p value) for ranking when the coding method is CDC. Then, ranking processing is performed.
That is, when the coding method is CDC and the evaluation value is likelihood, the ranking creating unit 7 sequentially reads out the likelihood for each CDC table and CDC interaction table from the CDC likelihood storage unit 12, and the likelihood Perform ranking processing to arrange in descending order.
Then, the ranking creating unit 7 arranges the CDC table identification information and the CDC interaction table identification information in the order of the ranking arrangement, and writes and stores them in the ranking storage unit 15.

一方、ランキング作成部７は、コーディング方法がＣＤＣであり、かつ評価値がｐ値である場合、ＣＤＣ＿ｐ値記憶部１３からＣＤＣテーブル及びＣＤＣ相互作用テーブル毎のｐ値を順次読み出し、ｐ値の小さい順番に配列させるランキング処理を行う。
そして、ランキング作成部７は、尤度の場合と同様に、ランキングの配列の順番に、ＣＤＣテーブル識別情報及びＣＤＣ相互作用テーブル識別情報を配列させて、ランキング記憶部１５に書き込んで記憶させる。
また、ランキング作成部７は、コーディング方法がＡＤＣである場合、ＡＤＣ＿ｐ値記憶部１４からＡＤＣ相互作用テーブル毎のｐ値を順次読み出し、ｐ値を小さい順番に配列させるランキング処理を行う。
そして、ランキング作成部７は、ＣＤＣの場合と同様に、ランキングの配列の順番に、ＡＤＣ相互作用テーブル識別情報を配列させて、ランキング記憶部１５に書き込んで記憶させる。 On the other hand, when the coding method is CDC and the evaluation value is a p value, the ranking creating unit 7 sequentially reads the p value for each CDC table and each CDC interaction table from the CDC_p value storage unit 13, and the p value is small. Perform ranking processing to arrange in order.
Then, the ranking creation unit 7 arranges the CDC table identification information and the CDC interaction table identification information in the order of the ranking arrangement, and writes and stores them in the ranking storage unit 15 as in the case of likelihood.
Further, when the coding method is ADC, the ranking creating unit 7 sequentially reads p values for each ADC interaction table from the ADC_p value storage unit 14 and performs ranking processing for arranging the p values in ascending order.
Then, as in the case of CDC, the ranking creating unit 7 arranges the ADC interaction table identification information in the order of ranking arrangement, and writes and stores them in the ranking storage unit 15.

ステップＳ１２：
ダミー変数作成部８は、ユーザの設定したダミー変数を設定する際のコーディング方法がＣＤＣ場合とＡＤＣの場合に対応し、ダミー変数の選択の処理を行う。
すなわち、ダミー変数作成部８は、コーディング方法がＣＤＣである場合、ランキング記憶部１５に記憶されているＣＤＣテーブル識別情報及びＣＤＣ相互作用テーブル識別情報の配列において、配列の順番にｒ個を順次読み出し、読み出したｒ個のＣＤＣテーブルのジェノタイプデータの分類されたカテゴリをダミー変数として設定する。
同様に、ダミー変数作成部８は、コーディング方法がＡＤＣである場合、ランキング記憶部１５に記憶されているＡＤＣ相互作用テーブル識別情報の配列において、配列の順番にｒ個を順次読み出し、読み出したｒ個のＡＤＣ相互作用テーブルのジェノタイプデータの分類されたカテゴリをダミー変数として設定する。 Step S12:
The dummy variable creating unit 8 performs dummy variable selection processing corresponding to the case where the coding method for setting the dummy variable set by the user is CDC or ADC.
That is, when the coding method is CDC, the dummy variable creation unit 8 sequentially reads r in the order of arrangement in the CDC table identification information and CDC interaction table identification information stored in the ranking storage unit 15. Then, the classified categories of the read r CDC table genotype data are set as dummy variables.
Similarly, when the coding method is ADC, the dummy variable creation unit 8 sequentially reads r pieces in the order of arrangement in the arrangement of the ADC interaction table identification information stored in the ranking storage unit 15 and reads the r The classified categories of the genotype data of the individual ADC interaction tables are set as dummy variables.

ステップＳ１３：
回帰分析処理部１０は、ダミー変数作成部８の設定したダミー変数を説明変数ｘ_ｊとする回帰モデルとして（６）式を生成する。
そして、変数選択部９は、回帰分析処理部１０の生成した（６）式における説明変数ｘ_ｊの各々の回帰係数β_ｊの算出を行う。このとき、回帰係数βを算出する際のチューニングパラメータλを設定するための評価値として、ＡＩＣ、ＢＩＣ及びＥＢＩＣのいずれを使用するかは、ユーザが遺伝子間相互作用解析システムに対して設定する。 Step S13:
Regression analysis processing unit 10 generates the dummy the set dummy variables of variable creation unit 8 as regression model as explanatory variables x _j (6) formula.
Then, the variable selection unit 9 calculates the regression coefficient β _j of each explanatory variable x _{j in} the expression (6) generated by the regression analysis processing unit 10. At this time, the user sets which of AIC, BIC, and EBIC to use as an evaluation value for setting the tuning parameter λ when calculating the regression coefficient β to the gene interaction analysis system.

ステップＳ１４：
回帰分析処理部１０は、（６）式の説明変数ｘ_ｊの各々に対し、変数選択部９の算出した回帰係数β_ｊを代入し、対応する目的変数を求める回帰モデルを生成する。 Step S14:
The regression analysis processing unit 10 substitutes the regression coefficient β _j calculated by the variable selection unit 9 for each of the explanatory variables x _{j in} the equation (6), and generates a regression model for obtaining the corresponding objective variable.

＜第２の実施形態＞
上述した第１の実施形態はロジスティック回帰法に対する処理であった。本第２の実施形態は、線形回帰モデルを作成する際のダミー変数選択及び回帰係数の決定の処理を対象としたものである。
次に、図面を参照して、本発明の実施の形態について説明する。図１１は、この発明の第２の実施形態による遺伝子間相互作用解析システムの構成例を示す概略ブロック図である。本実施形態においては、遺伝子間相互作用解析システムは、基礎テーブル作成部１、ＣＤＣテーブル作成部２、ＡＤＣテーブル作成部３、ＣＤＣ＿ｔ値算出部１６、ＡＤＣ＿ｔ値算出部１７、ランキング作成部７、ダミー変数作成部８、変数選択部９、回帰分析処理部１０、ＳＮＰデータベース１１、ＣＤＣ＿ｔ値記憶部１８、ＡＤＣ＿ｔ値記憶部１９、ランキング記憶部１５を備えている。 <Second Embodiment>
The first embodiment described above is a process for the logistic regression method. The second embodiment is directed to dummy variable selection and regression coefficient determination processing when creating a linear regression model.
Next, embodiments of the present invention will be described with reference to the drawings. FIG. 11 is a schematic block diagram showing a configuration example of a gene interaction analysis system according to the second embodiment of the present invention. In this embodiment, the gene interaction analysis system includes a basic table creation unit 1, a CDC table creation unit 2, an ADC table creation unit 3, a CDC_t value calculation unit 16, an ADC_t value calculation unit 17, a ranking creation unit 7, a dummy A variable creation unit 8, a variable selection unit 9, a regression analysis processing unit 10, an SNP database 11, a CDC_t value storage unit 18, an ADC_t value storage unit 19, and a ranking storage unit 15 are provided.

以下、第２の実施形態が第１の実施形態と異なる構成及び動作の説明を行う。まず、第１の実施形態と異なる点は、ＣＤＣのコーディング法において、単一のＳＮＰのジェノタイプの組合せがないことである。第２の実施形態においては、第１の実施形態におけるＣＤＣ相互作用テーブルとＡＤＣ相互作用テーブルとが生成される。
また、ＳＮＰデータベース１１には、検体毎に対応して、量的形質のデータ（例えば、体重、血糖値など）とＳＮＰ各々のジェノタイプデータとが対応して記憶されている。 Hereinafter, the configuration and operation of the second embodiment different from those of the first embodiment will be described. First, the difference from the first embodiment is that there is no combination of single SNP genotypes in the CDC coding method. In the second embodiment, the CDC interaction table and the ADC interaction table in the first embodiment are generated.
The SNP database 11 stores quantitative trait data (for example, body weight, blood glucose level, etc.) and genotype data of each SNP corresponding to each specimen.

図１２は、第２の実施形態におけるＳＮＰ相互作用基礎テーブルの構成例を示すものである。図１２においては、全てのＳＮＰにおいて、２つの異なる種類のＳＮＰのジェノタイプデータ毎における２次相互作用に対応した検体の質的形質のクラスの分類を示すＳＮＰ相互作用基礎テーブルの一例を示している。この図１２において、ＳＮＰの種類としてはＳＮＰ１及びＳＮＰ２を用い、ＳＮＰ１のジェノタイプデータをＳＮＰ１＝０、ＳＮＰ２及びＳＮＰ＝２とし、ＳＮＰ２のジェノタイプデータをＳＮＰ２＝０、ＳＮＰ２＝１及びＳＮＰ２＝２とする。すなわち、ＳＮＰ相互作用基礎テーブルは、ＳＮＰ１のジェノタイプデータとＳＮＰジェノタイプデータとの組合せ毎に、検体の量的形質の積算値が示されている。 FIG. 12 shows a configuration example of the SNP interaction basic table in the second embodiment. FIG. 12 shows an example of a SNP interaction basic table showing classification of qualitative trait classes of specimens corresponding to secondary interactions in genotype data of two different types of SNPs in all SNPs. Yes. In FIG. 12, SNP1 and SNP2 are used as SNP types, SNP1 genotype data is SNP1 = 0, SNP2 and SNP = 2, and SNP2 genotype data is SNP2 = 0, SNP2 = 1 and SNP2 = 2. And That is, the SNP interaction basic table indicates the integrated value of the quantitative trait of the specimen for each combination of the SNP1 genotype data and the SNP genotype data.

例えば、図１２において、ジェノタイプデータＳＮＰ１＝０及びＳＮＰ２＝０の組合せに対し、検体数がｎ０１個あり、このｎ０１個の検体における量的形質の数値の合計値がｓ０１である。また、ジェノタイプデータＳＮＰ１＝２及びＳＮＰ２＝１の組合せに対し、検体数がｎ１１個であり、このｎ１１個の検体における量的形質の数値の合計値がｓ１１である。
ここで、ｎ００＋ｎ０１＋ｎ０２＋ｎ１０＋ｎ１１＋ｎ１２＋ｎ２０＋ｎ２１＋ｎ２２＝ｎ（総検体数）である。 For example, in FIG. 12, the number of specimens is n01 for the combination of genotype data SNP1 = 0 and SNP2 = 0, and the total value of the numerical values of quantitative traits in these n01 specimens is s01. Further, for the combination of genotype data SNP1 = 2 and SNP2 = 1, the number of specimens is n11, and the total value of the numerical values of quantitative traits in these n11 specimens is s11.
Here, n00 + n01 + n02 + n10 + n11 + n12 + n20 + n21 + n22 = n (total number of samples).

図１に戻り、基礎テーブル作成部１は、ＳＮＰデータベース１１に記憶されているｍ個のＳＮＰから２つの異なる種類のＳＮＰの組合せを順次選択する。
そして、基礎テーブル作成部１は、選択されたＳＮＰの組（以下、ＳＮＰペア）において、ＳＮＰペアにおけるジェノタイプデータの組合わせ毎（例えば、ジェノタイプデータＳＮＰ１＝０及びＳＮＰ２＝０の組合せ）に検体を分類する。また、基礎テーブル作成部１は、各ＳＮＰペアにおけるジェノタイプデータの組合わせの検体を質的形質の２クラスに分類し、分類した結果を図１２に示すＳＮＰ相互作用基礎テーブルとする。基礎テーブル作成部１は、上述したように、全ての種類（ｍ個）のＳＮＰから、異なる種類の２つのＳＮＰに対して、図１２のＳＮＰ相互作用基礎テーブルを生成する。 Returning to FIG. 1, the basic table creation unit 1 sequentially selects combinations of two different types of SNPs from m SNPs stored in the SNP database 11.
Then, the basic table creation unit 1 performs, for each combination of genotype data in the SNP pair (for example, a combination of genotype data SNP1 = 0 and SNP2 = 0) in the selected SNP pair (hereinafter, SNP pair). Classify the specimen. Further, the basic table creation unit 1 classifies the samples of the combination of genotype data in each SNP pair into two classes of qualitative traits, and uses the classified result as the SNP interaction basic table shown in FIG. As described above, the basic table creation unit 1 generates the SNP interaction basic table of FIG. 12 for two different types of SNPs from all types (m) of SNPs.

図１１に戻り、ＣＤＣテーブル作成部２は、ＳＮＰデータベース１１から、図１２に示すＳＮＰ相互作用基礎テーブルを順次読み出す。そして、ＣＤＣテーブル作成部２は、読み出した各ＳＮＰ相互作用基礎テーブルにおいて、９種類のジェノタイプデータの組合せを、１種類の組合せと、残りの他の８種類の組合せとの２つのカテゴリに再分類し、２つのカテゴリからなる８つのＣＤＣ相互作用テーブル（第１の実施形態におけるＣＤＣ相互作用テーブルと同様のカテゴリ分類）を生成する。ここで、ＣＤＣテーブル作成部２は、８種類のジェノタイプデータの組合せにグループ化された組合せの各々における量的形質の数値の合計値を加算し、組合せのカテゴリにおける量的形質の総合計値とする。また、ＣＤＣテーブル作成部２は、作成したＣＤＣ相互作用テーブルに対してＣＤＣ相互作用テーブル識別情報を付与し、このＣＤＣ相互作用テーブル識別情報と、このＣＤＣ相互作用テーブル識別情報の示すＣＤＣ相互作用テーブルとを対応付けて、ＳＮＰデータベース１１へ書き込んで記憶させる。 Returning to FIG. 11, the CDC table creation unit 2 sequentially reads the SNP interaction basic table shown in FIG. 12 from the SNP database 11. Then, the CDC table creation unit 2 reconfigures the nine types of genotype data in each of the read SNP interaction basic tables into two categories: one type of combination and the remaining eight types of combinations. Classification is performed, and eight CDC interaction tables (category classification similar to the CDC interaction table in the first embodiment) including two categories are generated. Here, the CDC table creation unit 2 adds the total value of the numerical values of the quantitative traits in each of the grouped combinations to the combination of the eight types of genotype data, and the total value of the quantitative traits in the category of the combination And Further, the CDC table creation unit 2 assigns CDC interaction table identification information to the created CDC interaction table, and this CDC interaction table identification information and the CDC interaction table indicated by this CDC interaction table identification information Are stored in the SNP database 11 in association with each other.

ＡＤＣテーブル作成部３は、ＳＮＰデータベース１１に記憶されているｍ個のＳＮＰから２つの異なる種類のＳＮＰの組合せをＳＮＰペアとして順次選択する。
そして、ＡＤＣテーブル作成部３は、ＳＮＰペアにおいて、カテゴリＣ１とする２つの種類の異なる複数のＳＮＰのジェノタイプデータの各々の検体の量的形質の値を合計する。
また、ＡＤＣテーブル作成部３は、ＳＮＰデータベース１１に記憶されているＳＮＰ基礎テーブルから、上記カテゴリＣ１に含まれている以外のカテゴリＣ２のジェノタイプデータの各々の検体の量的形質の値を合計する。
次に、ＡＤＣテーブル作成部３は、カテゴリＣ１及びこのカテゴリＣ１の量的形質の合計値と、カテゴリＣ２及びこのカテゴリＣ１の量的形質の合計値とからなるＡＤＣ相互作用テーブル（第１の実施形態におけるＡＤＣ相互作用テーブルと同様のカテゴリ分類）を生成する。
そして、ＡＤＣテーブル作成部３は、生成したＡＤＣ相互作用テーブルに対してＡＤＣ相互作用テーブル識別情報を付与し、このＡＤＣ相互作用テーブル識別情報と、このＡＤＣ相互作用テーブル識別情報の示すＡＤＣ相互作用テーブルとを対応付けて、ＳＮＰデータベース１１へ書き込んで記憶させる。 The ADC table creation unit 3 sequentially selects combinations of two different types of SNPs from the m SNPs stored in the SNP database 11 as SNP pairs.
Then, the ADC table creation unit 3 sums the values of the quantitative traits of each specimen of the genotype data of two different types of SNPs of category C1 in the SNP pair.
Further, the ADC table creation unit 3 sums the values of the quantitative traits of each specimen of the genotype data of the category C2 other than those included in the category C1 from the SNP basic table stored in the SNP database 11. To do.
Next, the ADC table creation unit 3 generates an ADC interaction table (first implementation) consisting of the total value of the quantitative traits of the category C1 and this category C1, and the total value of the quantitative traits of the category C2 and this category C1. Category classification similar to the ADC interaction table in the form) is generated.
Then, the ADC table creation unit 3 assigns ADC interaction table identification information to the generated ADC interaction table, and the ADC interaction table identification information and the ADC interaction table indicated by the ADC interaction table identification information. Are stored in the SNP database 11 in association with each other.

ＣＤＣ＿ｔ値算出部１６は、ＳＮＰデータベース１１から順次、ＣＤＣ相互作用テーブル識別情報の示すＣＤＣ相互作用テーブルを読み出し、ＣＤＣ相互作用テーブルのカテゴリの組合せを説明変数とした単回帰分析を行い、そのｔ値を算出する。すなわち、ＣＤＣ＿ｔ値算出部１６は、ＣＤＣ相互作用テーブルにおける各カテゴリＣ１及びＣ２各々における量的形質の合計値から、以下に示す（１４）式のｔ値の算出方法を用いてＣＤＣ相互作用テーブルに対応するｔ値（以下、ＣＤＣｔ値とする）を算出する。 The CDC_t value calculation unit 16 sequentially reads the CDC interaction table indicated by the CDC interaction table identification information from the SNP database 11, performs a single regression analysis using the combination of the categories of the CDC interaction table as an explanatory variable, and the t value Is calculated. That is, the CDC_t value calculation unit 16 creates a CDC interaction table from the total value of quantitative traits in each of the categories C1 and C2 in the CDC interaction table using the t value calculation method of the following equation (14). A corresponding t value (hereinafter referred to as a CDCt value) is calculated.

量的形質の回帰分析の場合、線形の回帰分析モデルを用いるが、最適な２個のカテゴリの分類を検出することは、すでに説明したロジスティック回帰モデルと同様である。この２個のカテゴリの分類において、２個のカテゴリの各々における量的形質の合計値の平均値の差、すなわち二群の平均値の差の検定で用いられる統計料としてのｔ値を用いている。上記（１４）式には、カテゴリＣ１の各々における量的形質の合計値の平均値ｙ１と、カテゴリＣ２における量的形質の合計値の平均値ｙ２と、カテゴリＣ１に含まれる検体数ｎ１と、カテゴリＣ２に含まれる検体数ｎ２とが用いられている。 In the case of quantitative trait regression analysis, a linear regression analysis model is used, but the optimal classification of two categories is detected in the same manner as the logistic regression model already described. In the classification of these two categories, the difference between the average values of the quantitative traits in each of the two categories, that is, the t value as a statistical fee used in the test of the difference between the average values of the two groups is used. Yes. In the above equation (14), the average value y1 of the total value of quantitative traits in each of the categories C1, the average value y2 of the total value of quantitative traits in the category C2, the number of samples n1 included in the category C1, The number of samples n2 included in the category C2 is used.

すなわち、ＣＤＣ＿ｔ値算出部１６は、以下に示す（１４）式に対し、ＣＤＣ相互作用テーブル毎に、カテゴリＣ１及びＣ２の各々における検体数及び量的形質の合計値とを代入して、ｔ値を算出する。これにより、（１４）式は以下の（１５）式となり、この（１５）式においては、ＳＮＰ１＝０及びＳＮＰ２＝１のジェノタイプデータの組合せのカテゴリＣ１と、それ以外のジェノタイプデータの８個の組合せからなるカテゴリＣ２とのｔ値を算出する。 That is, the CDC_t value calculation unit 16 substitutes the number of specimens and the total value of the quantitative traits in each of the categories C1 and C2 for each CDC interaction table in the following equation (14) to obtain the t value Is calculated. As a result, the equation (14) becomes the following equation (15). In the equation (15), the category C1 of the combination of genotype data of SNP1 = 0 and SNP2 = 1 and the other genotype data 8 The t value with the category C2 composed of the combinations is calculated.

また、ＣＤＣ＿ｔ値算出部１６は、ＣＤＣ相互作用テーブルを示すＣＤＣ相互作用テーブル識別情報の各々に対応させ、ＣＤＣ相互作用テーブル識別情報の示すＣＤＣ相互作用テーブルのｔ値を、ＣＤＣ＿ｔ値記憶部１８に対して書き込んで記憶させる。 Further, the CDC_t value calculation unit 16 associates each of the CDC interaction table identification information indicating the CDC interaction table with the CDC interaction table identification information, and stores the t value of the CDC interaction table indicated by the CDC interaction table identification information in the CDC_t value storage unit 18. Write and memorize it.

ＡＤＣ＿ｔ値算出部１７は、ＳＮＰデータベース１１から順次、ＡＤＣ相互作用テーブル識別情報の示すＡＤＣ相互作用テーブルを読み出し、ＡＤＣ相互作用テーブルのカテゴリの組合せを説明変数とした単回帰分析を行い、そのｔ値を算出する。すなわち、ＡＤＣ＿ｔ値算出部１７は、ＡＤＣ相互作用テーブルにおける各カテゴリＣ１及びＣ２各々における量的形質の合計値から、上記（１４）式のｔ値の算出方法を用いてＡＤＣ相互作用テーブルに対応するｔ値（以下、ＡＤＣｔ値とする）を算出する。 The ADC_t value calculation unit 17 sequentially reads out the ADC interaction table indicated by the ADC interaction table identification information from the SNP database 11, performs a single regression analysis using the combination of the categories of the ADC interaction table as an explanatory variable, and outputs the t value. Is calculated. That is, the ADC_t value calculation unit 17 corresponds to the ADC interaction table using the t value calculation method of the above formula (14) from the total value of the quantitative traits in each of the categories C1 and C2 in the ADC interaction table. The t value (hereinafter referred to as ADCt value) is calculated.

すなわち、ＡＤＣ＿ｔ値算出部１７は、上記（１４）式に対し、ＣＤＣ相互作用テーブル毎に、カテゴリＣ１及びＣ２の各々における検体数及び量的形質の合計値とを代入して、ｔ値を算出する。これにより（１４）式は以下の（１６）式となり、この（１６）式においては、ＳＮＰ１＝０及びＳＮＰ２＝１かつＳＮＰ１＝２及びＳＮＰ２＝１のジェノタイプデータの組合せのカテゴリＣ１と、それ以外のジェノタイプデータの組合せからなるカテゴリＣ２とのｔ値を算出する。 That is, the ADC_t value calculation unit 17 calculates the t value by substituting the number of specimens and the total value of the quantitative traits in each of the categories C1 and C2 for each CDC interaction table with respect to the above equation (14). To do. As a result, the equation (14) becomes the following equation (16). In the equation (16), the category C1 of the combination of genotype data of SNP1 = 0 and SNP2 = 1 and SNP1 = 2 and SNP2 = 1 and T value with category C2 which consists of a combination of genotype data other than is calculated.

ここで、ＡＤＣ＿ｔ値算出部１７は、（１６）式で算出したＳＮＰペア毎に、最大値を有するＡＤＣ相互作用テーブルを選択し、この選択したＡＤＣ相互作用テーブルをＳＮＰペアのダミー変数とするカテゴリの分類とする。
また、ＡＤＣ＿ｔ値算出部１７は、各ＳＮＰペアにおいて選択されたＡＤＣ相互作用テーブルを示すＡＤＣ相互作用テーブル識別情報の各々に対応させ、ＡＤＣ相互作用テーブル識別情報の示すＡＤＣ相互作用テーブルのｔ値を、ＡＤＣ＿ｔ値記憶部１９に対して書き込んで記憶させる。 Here, the ADC_t value calculation unit 17 selects an ADC interaction table having the maximum value for each SNP pair calculated by the equation (16), and sets the selected ADC interaction table as a dummy variable of the SNP pair. Classification.
Further, the ADC_t value calculating unit 17 associates each of the ADC interaction table identification information indicating the ADC interaction table selected in each SNP pair with the t value of the ADC interaction table indicated by the ADC interaction table identification information. , Write to the ADC_t value storage unit 19 for storage.

ランキング作成部７は、ＣＤＣ＿ｔ値記憶部１８から、ＣＤＣ相互作用テーブル識別情報により、順次ＣＤＣ相互作用テーブルのｔ値を読み出す。
そして、ランキング作成部７は、読み出したｔ値を大きい方から順番にソートし、この順番にＣＤＣ相互作用テーブル識別番号を配列させ、ランキング記憶部１５に対して書き込んで記憶させる。 The ranking creation unit 7 sequentially reads t values of the CDC interaction table from the CDC_t value storage unit 18 based on the CDC interaction table identification information.
Then, the ranking creating unit 7 sorts the read t values in descending order, arranges the CDC interaction table identification numbers in this order, and writes and stores them in the ranking storage unit 15.

また、ランキング作成部７は、ＡＤＣ＿ｔ値記憶部１９から、ＡＤＣ相互作用テーブル識別情報により、順次ＡＤＣ相互作用テーブルのｔ値を読み出す。
そして、ランキング作成部７は、読み出したｔ値を大きい方から順番にソートし、この順番にＡＤＣ相互作用テーブル識別番号を配列させ、ランキング記憶部１５に対して書き込んで記憶させる。 In addition, the ranking creating unit 7 sequentially reads t values of the ADC interaction table from the ADC_t value storage unit 19 based on the ADC interaction table identification information.
Then, the ranking creating unit 7 sorts the read t values in descending order, arranges the ADC interaction table identification numbers in this order, and writes and stores them in the ranking storage unit 15.

ダミー変数作成部８は、ランキング作成部７が作成したランキングから、ユーザが入力手段から入力するランキングの設定条件に対応するランキングを選択し、このランキングから上位ｒ個、例えば２５６個を選択し、選択されたテーブルに対応するダミーコードを説明変数とし、線形回帰モデルを構成する。
例えば、ユーザがランキングの設定条件をＣＤＣを用いるとした場合、ダミー変数作成部８は、ランキング記憶部１５からＣＤＣ相互作用テーブル識別情報を検索し、このＣＤＣ相互作用テーブル識別情報に対応して記憶されているＣＤＣｐ値のランキングを読み出す。そして、ダミー変数作成部８は、読み出したＣＤＣｐ値のランキングから、上位ｒ個のＣＤＣ相互作用テーブル識別情報を選択し、このＣＤＣ相互作用テーブル識別情報の示すＣＤＣ相互作用テーブルのカテゴリとなるダミー変数を説明変数とする。 The dummy variable creation unit 8 selects the ranking corresponding to the ranking setting condition that the user inputs from the input means from the ranking created by the ranking creation unit 7, selects the top r, for example, 256 from this ranking, A dummy regression code corresponding to the selected table is used as an explanatory variable to construct a linear regression model.
For example, when the user uses CDC as the ranking setting condition, the dummy variable creation unit 8 searches the ranking storage unit 15 for the CDC interaction table identification information, and stores it in correspondence with the CDC interaction table identification information. The ranking of the CDCp value being read is read out. Then, the dummy variable creation unit 8 selects the top r CDC interaction table identification information from the read ranking of the CDCp values, and the dummy variable that becomes the category of the CDC interaction table indicated by the CDC interaction table identification information Is an explanatory variable.

一方、ユーザがランキングの設定条件をＡＤＣを用いるとした場合、ダミー変数作成部８は、ランキング記憶部１５からＡＤＣ相互作用テーブル識別情報を検索し、このＡＤＣ相互作用テーブル識別情報に対応して記憶されているＡＤＣｐ値のランキングを読み出す。そして、ダミー変数作成部８は、読み出したＡＤＣｐ値のランキングから、上位ｒ個のＡＤＣ相互作用テーブル識別情報を選択し、このＡＤＣ相互作用テーブル識別情報の示すＡＤＣ相互作用テーブルのカテゴリとなるダミー変数を説明変数とする。
また、本実施形態において、変数選択部９及び回帰分析処理部１０の処理については、第１の実施形態と同様のため説明を省略する。 On the other hand, if the user uses ADC as the ranking setting condition, the dummy variable creation unit 8 searches the ranking storage unit 15 for the ADC interaction table identification information, and stores the ADC interaction table identification information in correspondence with the ADC interaction table identification information. The ranking of the ADCp value being read is read out. Then, the dummy variable creation unit 8 selects the top r ADC interaction table identification information from the ranking of the read ADCp values, and the dummy variable that becomes the category of the ADC interaction table indicated by the ADC interaction table identification information Is an explanatory variable.
In the present embodiment, the processes of the variable selection unit 9 and the regression analysis processing unit 10 are the same as those in the first embodiment, and thus the description thereof is omitted.

また、図１における遺伝子間相互作用解析システムの機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより遺伝子間相互作用の解析処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。 Further, a program for realizing the function of the gene interaction analysis system in FIG. 1 is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into the computer system and executed. An analysis process of gene interaction may be performed. Here, the “computer system” includes an OS and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

以上、この発明の実施形態を図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes design and the like within a scope not departing from the gist of the present invention.

１…基礎テーブル作成部
２…ＣＤＣテーブル作成部
３…ＡＤＣテーブル作成部
４…ＣＤＣ尤度算出部
５…ＣＤＣ＿ｐ値算出部
６…ＡＤＣ＿ｐ値算出部
７…ランキング作成部
８…ダミー変数作成部
９…変数選択部
１０…回帰分析処理部
１１…ＳＮＰデータベース
１２…ＣＤＣ尤度記憶部
１３…ＣＤＣ＿ｐ値記憶部
１４…ＡＤＣ＿ｐ値記憶部
１５…ランキング記憶部
１６…ＣＤＣ＿ｔ値算出部
１７…ＡＤＣ＿ｔ値算出部
１８…ＣＤＣ＿ｔ値記憶部
１９…ＡＤＣ＿ｔ値記憶部 DESCRIPTION OF SYMBOLS 1 ... Basic table creation part 2 ... CDC table creation part 3 ... ADC table creation part 4 ... CDC likelihood calculation part 5 ... CDC_p value calculation part 6 ... ADC_p value calculation part 7 ... Ranking creation part 8 ... Dummy variable creation part 9 ... Variable selection unit 10 ... regression analysis processing unit 11 ... SNP database 12 ... CDC likelihood storage unit 13 ... CDC_p value storage unit 14 ... ADC_p value storage unit 15 ... ranking storage unit 16 ... CDC_t value calculation unit 17 ... ADC_t value calculation unit 18 ... CDC_t value storage unit 19 ... ADC_t value storage unit

Claims

Genome-wide SNP (single nucleotide polymorphism) genotype data is a combination of SNPs that have gene-gene interactions that affect phenotypic expression, comprehensively identifying SNP pairs, and character expression from the SNP pair A gene interaction analysis system for generating a regression model to analyze,
From m types of SNPs detected from n specimens, SNP pairs consisting of two different types of SNPs are selected in sequence, and each genotype (combination of dominant type and inferior type) of each SNP in the SNP pair is selected. A table creation unit that classifies a dummy code table including two categories, classifies each of the n samples for each dummy code, and creates a table indicating the two categories and the samples belonging to each category. When,
The single regression model with the dummy code in the table as a dummy variable is used to perform a single regression analysis using the specimen classified into each category, and an evaluation showing the strength of the association between the dummy variable and the expression of the trait An evaluation value calculation unit for calculating a value for each dummy variable;
A ranking creating unit that generates a ranking by arranging the dummy variables in order of the evaluation value indicating the strength of the relevance;
Extracting the dummy variables up to a preset order, and making the extracted dummy variables dummy variables of the regression model,
A variable selection unit that calculates a regression coefficient to be multiplied by the extracted dummy variable by a maximum likelihood method with penalties.

The table creation unit
Using CDC (cell-wise dummy coding) as a coding method, a CDC table of dummy codes obtained by classifying genotypes in one type of SNP into two categories, and genotypes in two different types of SNPs A CDC table creation unit as a CDC interaction table of dummy codes classified into two categories of combinations of genotypes and other genotypes;
Using ADC (adaptive dummy coding) as a coding method, an ADC table creation unit that makes a plurality of combinations of genotypes in two different types of SNPs into an ADC interaction table of dummy codes classified into two categories, and The gene interaction analysis system according to claim 1, wherein

If the regression model is a logistic regression model,
Each of the CDC table, the CDC interaction table, and the ADC interaction table is a 2 × 2 table including two categories and the number of the specimens belonging to two classes of qualitative traits in each category. Yes,
The evaluation value calculation unit
A CDC likelihood calculating unit that calculates likelihood as the evaluation value in the single regression analysis of each dummy code of the CDC table and the CDC interaction table;
A CDC_p value calculation unit that calculates a p value as the evaluation value in the single regression analysis of each dummy code of the CDC table and the CDC interaction table;
The gene interaction analysis system according to claim 2, wherein the system is an ADC_p value calculation unit that calculates a p value as the evaluation value in the single regression analysis of each dummy code of the ADC interaction table.

The ranking generation unit generates the ranking by arranging the likelihoods of the dummy codes of the CDC table and the CDC interaction table in the order of strong association, and the CDC table and the CDC interaction table. The ranking is generated by arranging the p values of each dummy code in the order of strong association, and the p values of the dummy codes of the ADC interaction table are arranged in the order of strong association. The gene interaction analysis system according to claim 3, wherein the ranking is generated.

When the regression model is a linear regression model,
The CDC interaction table and the ADC interaction table are two categories and a table that is an addition value of the quantitative traits of the specimen included in each of the categories;
The evaluation value calculation unit
A CDC_t value calculation unit that calculates a t value as the evaluation value in the single regression analysis of each dummy code of the CDC interaction table;
The gene interaction analysis system according to claim 2, wherein the system is an ADC_t value calculation unit that calculates a t value as the evaluation value in the single regression analysis of each dummy code of the ADC interaction table.

The ranking generation unit generates the ranking by arranging the t values of the dummy codes of the CDC interaction table in the order of the relevance, and the t values of the dummy codes of the ADC interaction table. The gene interaction analysis system according to claim 5, wherein the ranking is generated by arranging the sequences in the order of strong association.

Genome-wide SNP (single nucleotide polymorphism) genotype data is a combination of SNPs that have gene-gene interactions that affect phenotypic expression, comprehensively identifying SNP pairs, and character expression from the SNP pair A gene interaction analysis method for generating a regression model to be analyzed,
The table creation unit selects SNP pairs consisting of two different types of SNPs sequentially from m types of SNPs detected from n samples, and each genotype (dominant type and inferiority) of the SNPs in the SNP pair. A table indicating the two categories and the samples belonging to the respective categories by classifying each of the n samples for each dummy code. The table creation process to create,
The evaluation value calculation unit performs a single regression analysis using the specimen classified into each category using a single regression model with the dummy code in the table as a dummy variable, and the relationship between the dummy variable and the expression of the trait An evaluation value calculation process for calculating an evaluation value indicating the strength of each dummy variable;
A ranking creation process in which the ranking creation unit generates a ranking by arranging the dummy variables in descending order of evaluation values indicating the strength of the relevance;
A dummy variable creating unit extracts the dummy variables up to a preset order and uses the extracted dummy variables as dummy variables of the regression model; and
A variable selection section, wherein the variable selection section includes a variable selection step of calculating a regression coefficient to be multiplied by the extracted dummy variable by a penalized maximum likelihood method.

Genome-wide SNP (single nucleotide polymorphism) genotype data is a combination of SNPs that have gene-gene interactions that affect phenotypic expression, comprehensively identifying SNP pairs, and character expression from the SNP pair A gene interaction analysis program for generating a regression model to be analyzed,
Computer
From m types of SNPs detected from n specimens, SNP pairs consisting of two different types of SNPs are selected in sequence, and each genotype (combination of dominant and inferior types) of each SNP in the SNP pair is selected. Table creation means for classifying into a table of dummy codes composed of two categories, classifying each of the n samples for each dummy code, and creating a table indicating the two categories and the samples belonging to each category ,
The single regression model with the dummy code in the table as a dummy variable is used to perform a single regression analysis using the specimen classified into each category, and an evaluation showing the strength of the association between the dummy variable and the expression of the trait Evaluation value calculating means for calculating a value for each dummy variable;
Ranking generating means for generating a ranking by arranging the dummy variables in descending order of evaluation values indicating the strength of the relevance;
Dummy variable creating means for extracting the dummy variables up to a preset order and using the extracted dummy variables as dummy variables of the regression model;
A program for functioning as variable selection means for calculating a regression coefficient to be multiplied by the extracted dummy variable by a maximum likelihood method with penalties.