JP6208622B2

JP6208622B2 - Analysis device, database creation method, and system

Info

Publication number: JP6208622B2
Application number: JP2014104958A
Authority: JP
Inventors: 安田　知弘; 知弘安田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2014-05-21
Filing date: 2014-05-21
Publication date: 2017-10-04
Anticipated expiration: 2034-05-21
Also published as: JP2015219856A

Description

本発明は、解析装置、データベース作成方法、およびシステムに関するものである。 The present invention relates to an analysis apparatus, a database creation method, and a system.

次世代シーケンサによるＤＮＡ配列決定(Next Generation Sequencing, NGS)は，従来のサンガー法に比べて，劇的に低いコストでゲノム配列を決定することを可能とした。2001年に約１億米ドルであったゲノム配列決定のコストは，次世代シーケンサにより2013年には5,826米ドルにまで下がっている(DNA Sequencing Costs, http://www.genome.gov/sequencingcosts/)。コストが安いだけでなく，短期間で膨大な量の配列データを得ることが可能であり，一度装置を稼働させるだけで１兆塩基を超える膨大な配列データを生成することが可能な次世代シーケンサも存在する。こうした技術により，多数の被験者のゲノム配列を決定する大規模プロジェクトが実行可能となり、国内外で大規模コホート(被験者の集団)のゲノム配列を決定するプロジェクトがいくつも実行に移されている。 Next generation sequencing (NGS) using next-generation sequencers has made it possible to determine genome sequences at a dramatically lower cost than conventional Sanger methods. Genomic sequencing costs, which were about US $ 100 million in 2001, have been reduced to US $ 5,826 in 2013 by next-generation sequencing (DNA Sequencing Costs, http://www.genome.gov/sequencingcosts/) . Next-generation sequencer that is not only low in cost, but can obtain a huge amount of sequence data in a short period of time, and can generate a huge amount of sequence data exceeding 1 trillion bases by operating the device once. Is also present. With this technology, a large-scale project that determines the genome sequence of a large number of subjects can be executed, and a number of projects that determine the genome sequence of a large-scale cohort (a group of subjects) have been put into practice both at home and abroad.

一方，次世代シーケンサにより多数の人のゲノム配列が決定されるにつれ，ゲノムの個人差として従来主要な解析対象であった一塩基多型(Single Nucleotide Polymorphism, SNP)だけでなく，数百塩基以上の比較的長い配列が一度に変化する現象が頻繁に見られることがわかってきた。こうした変異はゲノムの構造多型(Structural Variation, SV)と呼ばれ，主要なものが、標準的なゲノム配列にない配列が見られる挿入や、逆にあるべき配列が見られない欠失である（挿入や欠失の例を図５に示す）。構造多型では，一度に変化する塩基の量が多いため，個人の形質表現型の違いやゲノムが関連する疾患に重要な役割を果たすと考えられている。構造多型の発生機構や構造多型が生体内の機能に与える効果を解析するためには，構造多型を網羅的に解析する必要がある。 On the other hand, as the genome sequence of many people is determined by the next-generation sequencer, not only single nucleotide polymorphism (Single Nucleotide Polymorphism, SNP), which has been the main target of analysis, but also several hundred bases or more. It has been found that the phenomenon of relatively long sequence changes frequently is observed. These mutations are called genomic structural variations (SVs), and the main ones are insertions where sequences that are not found in the standard genome sequence are found, or deletions where the sequence that should be is not found (Examples of insertion and deletion are shown in FIG. 5). Structural polymorphisms are thought to play an important role in differences in individual phenotypes and in diseases related to the genome because of the large amount of bases that change at one time. In order to analyze the generation mechanism of structural polymorphisms and the effects of structural polymorphisms on in vivo functions, it is necessary to comprehensively analyze structural polymorphisms.

このような解析に関して、例えば特許文献１にはＤＮＡ配列をゲノムにマッピングし、マッピングされた配列同士を比較する技術が開示されている。また、非特許文献１には多数の被験者の配列データを同時解析することにより、被験者あたりの配列データ量が少ない場合でも、複数の被験者に共通に出現する変異を検出することにより、構造多型解析を可能とする技術が開示されている。 Regarding such analysis, for example, Patent Document 1 discloses a technique of mapping a DNA sequence to a genome and comparing the mapped sequences. In addition, Non-Patent Document 1 analyzes the sequence data of a large number of subjects, thereby detecting a mutation that appears commonly in a plurality of subjects even when the amount of sequence data per subject is small. A technique enabling analysis is disclosed.

特開２００５−１３５０５３号公報JP 2005-135053 A

Handsaker et al. 2011, Discovery and genotyping of genome structural polymorphism by sequencing on a population scale, Nature Genetics 2011 Mar;43(3):269-76.Handsaker et al. 2011, Discovery and genotyping of genome structural polymorphism by sequencing on a population scale, Nature Genetics 2011 Mar; 43 (3): 269-76.

次世代シーケンサが一度に配列決定できる長さは１００〜４００塩基程度に限られるため，構造多型のように大きなゲノムの変化を正確に捉えることは難しい。その一方で、ゲノム配列決定のコストが下がったとはいえ、大規模コホートのゲノム配列決定を行なう場合、ゲノム全体の配列決定はコストがかかるため、タンパク質を生成する遺伝子の領域等の限られた範囲だけの配列決定を行なうか、ゲノム配列全体の配列決定を行なう際に被験者当たりのデータ量を削減することが広く行なわれている。次世代シーケンサによりゲノム配列を決定する場合、解析対象の配列の３０倍以上のデータ量を得ることが望ましいが、１０倍あるいはそれ以下のデータ量しか得られない場合があり、効率よくデータ量の不足を補う技術が必要とされる。 Since the length that the next-generation sequencer can sequence at a time is limited to about 100 to 400 bases, it is difficult to accurately capture large genomic changes such as structural polymorphisms. On the other hand, although genome sequencing costs have been reduced, genome sequencing in large cohorts is costly, so limited sequencing of gene regions that produce proteins, etc. It is widely practiced to reduce the amount of data per subject when performing sequencing alone or when sequencing the entire genomic sequence. When determining a genome sequence using a next-generation sequencer, it is desirable to obtain a data amount that is 30 times or more that of the sequence to be analyzed. However, there may be a case where only 10 times or less of the data amount can be obtained. Technology to make up for the shortage is needed.

これに対して、特許文献１に開示された技術は、公開データベース（ＤＢ）の配列に対する処理であって、特定の被験者集団の配列を扱うものではない。そして、効率よくデータ量の不足を補う技術ではない。また、非特許文献１に開示された技術は、多数の被験者の配列データを同時解析することにより、被験者あたりの配列データ量が少ない場合でも、複数の被験者に共通に出現する変異を検出して構造多型解析を可能とする。しかし、多数の被験者の解析が必要であるため、同じ変異を共有しない被験者のデータを多数扱うことになり、効率よくデータ量の不足を補うことができず、構造多型の検出時に偽陽性が発生しやすくなる。 On the other hand, the technique disclosed in Patent Document 1 is a process for an array in a public database (DB), and does not handle an array of a specific subject group. And it is not a technology that efficiently compensates for the shortage of data. In addition, the technique disclosed in Non-Patent Document 1 detects a mutation that appears commonly in a plurality of subjects even when the amount of sequence data per subject is small by simultaneously analyzing the sequence data of many subjects. Enables structural polymorphism analysis. However, since it is necessary to analyze a large number of subjects, we handle a lot of data from subjects who do not share the same mutation, so we cannot efficiently compensate for the lack of data volume, and false positives are detected when structural polymorphisms are detected. It tends to occur.

そこで、本発明の主たる目的は、被験者あたりのデータ量が少ない場合であっても、効率よくデータ量の不足を補い、検出感度を高めることにある。 Therefore, a main object of the present invention is to efficiently compensate for a shortage of data amount and increase detection sensitivity even when the amount of data per subject is small.

本発明に係る代表的な解析装置は、複数の被験者の中から第一被験者を選択し、前記第一被験者の変異を出力するＤＮＡ配列の解析装置であって、複数の被験者のゲノムから得られる配列データを公共データベースの参照ゲノム配列にマッピングして得られるマッピングデータ、および前記複数の被験者の血縁関係を表す家系情報を記憶する記憶手段と、前記家系情報に基づき、所定の条件の第二被験者を順次選択する家系情報分析手段と、前記第一被験者と前記第二被験者のマッピングデータを同時解析して変異を検出し、該検出された変異を前記第一被験者と前記第二被験者に共通の変異として記録するペア解析を実行するペア解析手段と、前記検出された変異から重複を除去し、前記第一被験者の変異として出力する変異決定手段と、を備えたことを特徴とする。 A representative analysis apparatus according to the present invention is a DNA sequence analysis apparatus that selects a first subject from a plurality of subjects and outputs a mutation of the first subject, and is obtained from genomes of the plurality of subjects. Storage means for storing mapping data obtained by mapping sequence data to a reference genome sequence in a public database, and pedigree information representing the blood relationship of the plurality of subjects, and a second subject under a predetermined condition based on the pedigree information Family information analysis means for sequentially selecting, and detecting the mutation by simultaneously analyzing the mapping data of the first subject and the second subject, the detected mutation is common to the first subject and the second subject Pair analysis means for performing pair analysis to record as a mutation, mutation determination means for removing duplication from the detected mutation, and outputting as a mutation of the first subject, Characterized by comprising.

そして、例えば前記所定の条件は前記第一被験者の両親であり、前記家系情報分析手段は前記第一被験者の両親を前記第二被験者として選択することを特徴とする。 For example, the predetermined condition is the parents of the first subject, and the family information analyzing means selects the parents of the first subject as the second subject.

さらに、本発明はデータベース作成方法およびシステムとしても把握される。 Further, the present invention is grasped as a database creation method and system.

本発明によれば、被験者あたりのデータ量が少ない場合であっても、効率よくデータ量の不足を補うことができ、検出感度を高めることができる。 According to the present invention, even when the amount of data per subject is small, the shortage of the amount of data can be efficiently compensated, and the detection sensitivity can be increased.

ＤＮＡ配列解析装置の構成例を示す図である。It is a figure which shows the structural example of a DNA sequence analyzer. ＤＮＡ配列解析の処理の概要を示す図である。It is a figure which shows the outline | summary of the process of DNA sequence analysis. 処理シーケンスの例を示す図である。It is a figure which shows the example of a processing sequence. 家系情報の例を示す図である。It is a figure which shows the example of family tree information. 挿入および欠失の例を示す図である。It is a figure which shows the example of insertion and deletion. 挿入および欠失をsplit read方式で検出する例を示す図である。It is a figure which shows the example which detects insertion and deletion by a split read system. 複数の被験者によるデータ量増幅の例を示す図である。It is a figure which shows the example of the data amount amplification by a some test subject. 被験者が親子と共有する変異の包含関係の例を示す図である。It is a figure which shows the example of the inclusion relationship of the variation | mutation which a test subject shares with a parent and child. ペア解析手段の処理フローチャートの例を示す図である。It is a figure which shows the example of the process flowchart of a pair analysis means. 家系情報分析手段の処理フローチャートの例を示す図である。It is a figure which shows the example of the process flowchart of a family tree information analysis means. 集合から第二被験者を決定する処理フローチャートの例を示す図である。It is a figure which shows the example of the process flowchart which determines a 2nd test subject from a set. 変異決定手段の処理フローチャートの例を示す図である。It is a figure which shows the example of the process flowchart of a variation | mutation determination means. ペアエンドの例を示す図である。It is a figure which shows the example of a pair end. 変異解析木の例を示す図である。It is a figure which shows the example of a mutation analysis tree. 変異解析木を更新して未発見の変異が発見される確率を計算する処理フローチャートの例を示す図である。It is a figure which shows the example of the process flowchart which updates the variation | mutation analysis tree and calculates the probability that an undiscovered variation | mutation will be discovered. 創始者同士のペア解析の例を示す図である。It is a figure which shows the example of the pair analysis of founders. 変異解析木を利用した家系情報分析手段の処理フローチャートの例を示す図である。It is a figure which shows the example of the processing flowchart of the family tree information analysis means using a variation | mutation analysis tree.

以下では、次世代シーケンサ配列データおよび家系情報に基づき、各被験者がもつゲノムの変異を検出する感度を向上させるのに好ましい実施の形態を説明する。実施の形態を説明に当たり、先ずは説明に用いる用語について説明する。 Below, based on next-generation sequencer sequence data and pedigree information, a preferred embodiment for improving the sensitivity of detecting the genomic variation of each subject will be described. In describing the embodiment, first, terms used in the description will be described.

ヒトのゲノムは国際共同プロジェクトによって配列決定が行われ、得られた配列は標準的なヒトのゲノム配列として公開されている。他のモデル生物でもゲノム配列決定が実施されたものが多数存在する。このような標準的なゲノム配列を以下では参照ゲノム配列と呼ぶ。なお、この参照ゲノム配列と比べた時、被験者のゲノムには、特定の座標の１塩基が変化する一塩基多型(Single Nucleotide Polymorphism)、主に２〜４塩基の短い配列の繰返し回数が変化するマイクロサテライト、長い領域が同時に変化する構造多型といった様々な変異が存在し得る。以下では構造多型の検出を中心に説明するが、他の種類の変異を検出してもよい。 The human genome has been sequenced by an international collaborative project and the resulting sequence is published as a standard human genome sequence. Many other model organisms have been subjected to genome sequencing. Such a standard genomic sequence is hereinafter referred to as a reference genomic sequence. When compared to this reference genome sequence, the subject's genome has a single nucleotide polymorphism that changes one base at a specific coordinate (Single Nucleotide Polymorphism). There may be a variety of mutations such as microsatellite, structural polymorphism in which long region changes simultaneously In the following, the description will focus on the detection of structural polymorphism, but other types of mutations may be detected.

次世代シーケンサを用いることで、配列決定対象となるＤＮＡの互いに数百塩基離れた位置に存在する２か所の配列を多数得ることができる。これらの配列をペアエンドという。図１３はペアエンドの例を示す図である。互いに向かい合う２つの矢印１３０１、１３０２はペアエンドの２か所の配列を表す。２つの矢印１３０１、１３０２を結ぶ曲線１３０３はどの矢印とどの矢印とがペアエンドであるかを表す。次世代シーケンサにより得られた各配列はゲノム配列のどこの位置に由来するかを計算される。この処理はマッピングと呼ばれる当該技術分野で普通に用いられる技術である。各被験者について、与えられたすべての次世代シーケンサ配列をマッピングした結果を記録したデータを以下ではマッピングデータと呼ぶ。 By using the next-generation sequencer, it is possible to obtain a large number of sequences at two locations existing at positions separated by several hundred bases from the DNA to be sequenced. These sequences are called paired ends. FIG. 13 is a diagram illustrating an example of a pair end. Two arrows 1301 and 1302 facing each other represent an array at two positions in a paired end. A curve 1303 connecting two arrows 1301 and 1302 indicates which arrow and which arrow are paired. It is calculated where each sequence obtained by the next generation sequencer originates from the genome sequence. This process is a technique commonly used in the art called mapping. For each subject, data in which the results of mapping all given next-generation sequencer sequences are recorded is hereinafter referred to as mapping data.

家系情報は、被験者間の親子関係の集合であり、図４の例に示すように複数の家系図の集合により表現できる。複数となるのは被験者全員が血縁関係者とは限らないためである。家系情報において両親２人のうち少なくとも１人がデータ中に存在しない被験者は創始者と呼ばれる。なお、ここでの親子関係は生物学的な血縁関係における親子である。 Family information is a set of parent-child relationships between subjects, and can be expressed by a set of a plurality of family trees as shown in the example of FIG. The reason for the multiple is that not all subjects are related to blood. A subject whose family information does not have at least one of the two parents in the data is called the founder. The parent-child relationship here is a parent-child relationship in a biological relationship.

ヒトの細胞には、両親から受け継いだ２セットの染色体が含まれており、性別により異なる性染色体を除き、同一の遺伝子が同一順序で配置されている染色体が２本ずつある。このような２本の染色体を相同染色体という。相同染色体上のそれぞれの遺伝子をアリルという。アリルの一方だけに変異がある場合、その変異はヘテロであるという。また、両方のアリルに同じ変異がある場合、その変異はホモであるという。また、変異が両親のいずれにもない突然変異で発生したものである場合、その変異はde novoであるといわれるが、de novoの変異が発生する確率は低いため、ここでは対象としない。 Human cells contain two sets of chromosomes inherited from their parents, and there are two chromosomes in which the same genes are arranged in the same order except for sex chromosomes that differ according to gender. Such two chromosomes are called homologous chromosomes. Each gene on the homologous chromosome is called an allele. If there is a mutation in only one of the alleles, the mutation is said to be heterogeneous. If both alleles have the same mutation, the mutation is said to be homozygous. In addition, if the mutation is caused by a mutation that is not found in either of the parents, the mutation is said to be de novo, but it is not included here because the probability of the de novo mutation occurring is low.

以下、代表的な実施の形態を実施例１〜３として図面を参照しつつ説明する。 Hereinafter, typical embodiments will be described as Examples 1 to 3 with reference to the drawings.

以下、実施例１を説明する。図１はＤＮＡ配列解析装置１００の構成例を示す図である。ＤＮＡ配列解析装置１００は、ＣＰＵ（Central Processing Unit）１０１、主記憶装置（メモリ）１０２、補助記憶装置１０３、ユーザインタフェース部１０６を備える。このＤＮＡ配列解析装置１００は、ＬＡＮ（Local Area Network）等のネットワーク１０５を介して外部のネットワークに接続されている。主記憶装置１０２にはＣＰＵ１０１によって実行される各種のプログラムおよびこれらのプログラムをＣＰＵ１０１で実行するのに必要な各種のデータが保持されている。主記憶装置１０２は少なくとも、マッピングデータ（１）１０７−１、家系情報（１）１０８−１、およびＣＰＵ１０１で実行させることにより家系情報分析手段１０９、ペア解析手段１１０、変異決定手段１１１として機能させるプログラムが格納されたＲＡＭ（Random Access Memory）等のメモリである。これらのプログラムはまとめてＤＮＡ配列解析プログラムとなる。 Example 1 will be described below. FIG. 1 is a diagram illustrating a configuration example of a DNA sequence analysis apparatus 100. The DNA sequence analysis apparatus 100 includes a CPU (Central Processing Unit) 101, a main storage device (memory) 102, an auxiliary storage device 103, and a user interface unit 106. The DNA sequence analyzer 100 is connected to an external network via a network 105 such as a LAN (Local Area Network). The main storage device 102 holds various programs executed by the CPU 101 and various data necessary for the CPU 101 to execute these programs. The main storage device 102 is caused to function as at least the mapping data (1) 107-1, the family information (1) 108-1, and the CPU 101 so as to function as the family information analysis means 109, the pair analysis means 110, and the mutation determination means 111. A memory such as a RAM (Random Access Memory) in which a program is stored. These programs collectively become a DNA sequence analysis program.

補助記憶装置１０３はマッピングデータ（２）１０７−２と家系情報（２）１０８−２等を記録可能なＨＤＤ（Hard Disk Drive）等の記憶装置である。さらに、補助記憶装置１０３には検出された変異を記録する変異情報ＤＢ（１）１１２−１が格納される。リムーバブルメディア１０４は、ＤＮＡ配列解析装置１００に着脱可能な記憶装置であって、マッピングデータ（３）１０７−３、家系情報（３）１０８−３、変異情報ＤＢ（２）１１２−２等を記録可能なＣＤ、ＤＶＤ等の記録媒体である。ネットワーク１０５を介してマッピングデータ（４）１０７−４、家系情報（４）１０８−４、変異情報ＤＢ（３）１１２−３を格納したネットワーク接続のストレージ装置へアクセス可能としてもよい。 The auxiliary storage device 103 is a storage device such as an HDD (Hard Disk Drive) capable of recording mapping data (2) 107-2, family information (2) 108-2, and the like. Further, the auxiliary storage device 103 stores a mutation information DB (1) 112-1 for recording the detected mutation. The removable medium 104 is a storage device that can be attached to and detached from the DNA sequence analyzer 100 and records mapping data (3) 107-3, family information (3) 108-3, mutation information DB (2) 112-2, and the like. Possible recording media such as CD and DVD. The network-connected storage device that stores the mapping data (4) 107-4, the family information (4) 108-4, and the mutation information DB (3) 112-3 may be accessible via the network 105.

主記憶装置１０２、補助記憶装置１０３、リムーバブルメディア１０４、ネットワーク接続のストレージ装置のそれぞれが記憶するデータ、例えばマッピングデータ（１）１０７−１とマッピングデータ（２）１０７−２とマッピングデータ（３）１０７−３とマッピングデータ（４）１０７−４とは同じ内容のデータであってもよい。これらのデータは例えばＣＰＵ１０１で読み書きする場合に必要に応じて主記憶装置１０２へマッピングデータ（１）１０７−１として格納してもよいし、ＤＮＡ配列解析記憶装置１００の電源を切る場合あるいは主記憶装置１０２の空き容量が無くなった場合に主記憶装置１０２からマッピングデータ（１）１０７−１を他へコピーしてもよい。また、処理開始前に図示を省略した他の装置からリムーバブルメディア１０４のマッピングデータ（３）１０７−３あるいは外部のマッピングデータ（４）１０７−４へデータを格納し、ＤＮＡ配列解析装置１００の起動時や処理開始時にＤＮＡ配列解析装置１００内の主記憶装置１０２あるいは補助記憶装置１０３へコピーしてもよい。なお、これらのデータがどこに格納されているかが関係ない説明では代表してマッピングデータ１０７、家系情報１０８、変異情報ＤＢ１１２とする。 Data stored in the main storage device 102, auxiliary storage device 103, removable media 104, and network-connected storage device, for example, mapping data (1) 107-1, mapping data (2) 107-2, and mapping data (3) 107-3 and mapping data (4) 107-4 may be the same data. These data may be stored as mapping data (1) 107-1 in the main storage device 102 as necessary when reading / writing with the CPU 101, for example, or when the DNA sequence analysis storage device 100 is turned off or in the main memory. When there is no more free space in the device 102, the mapping data (1) 107-1 may be copied from the main storage device 102 to another. Further, before starting the processing, the data is stored in the mapping data (3) 107-3 of the removable medium 104 or the external mapping data (4) 107-4 from another device (not shown), and the DNA sequence analyzer 100 is activated. It may be copied to the main storage device 102 or the auxiliary storage device 103 in the DNA sequence analyzer 100 at the time or at the start of processing. In the description that does not relate to where these data are stored, the mapping data 107, the family information 108, and the mutation information DB 112 are representative.

マッピングデータ１０７の内容は図１３を用いて既に説明したとおりであり、家系情報１０８の内容は図４を用いて既に説明したとおりである。変異情報ＤＢ１１２の内容は例えば図１に示すように染色体番号、変異の開始位置、変異の終了位置、変異の種類等の情報を含む内容であってもよいし、The 1000 Genomes Project （http://www.1000genomes.org)のＶＣＦ（Variant Call Format）の一部あるいはすべての情報を含む内容であってもよいし、解析における統計情報として例えば検出数等を含む内容であってもよい。ここで、図１３、図４、図１等では説明のための表現を用いたが、このような表現に限定されることなく、マッピングデータ１０７、家系情報１０８、変異情報ＤＢ１１２はＣＰＵ１０１により処理可能なデータフォーマットであればどのようなフォーマットであってもよい。ユーザインタフェース部１０６はユーザインタフェースを提供する入出力装置であり、例えばキーボード、マウス、ディスプレイ等である。 The contents of the mapping data 107 are as already described with reference to FIG. 13, and the contents of the family information 108 are as already described with reference to FIG. The contents of the mutation information DB 112 may be contents including information such as chromosome number, mutation start position, mutation end position, mutation type, etc., as shown in FIG. 1, or The 1000 Genomes Project (http: / /www.1000genomes.org) VCF (Variant Call Format) content may be included, or may include content including, for example, the number of detections as statistical information in the analysis. Here, expressions for explanation are used in FIGS. 13, 4, 1, etc., but the mapping data 107, family information 108, and mutation information DB 112 can be processed by the CPU 101 without being limited to such expressions. Any data format may be used. The user interface unit 106 is an input / output device that provides a user interface, such as a keyboard, a mouse, and a display.

図２はＤＮＡ配列解析による変異解析の処理の概要を示す図であり、図３は各処理手段が協調動作する処理シーケンスの例を示す図である。図２と図３を参照しつつ、個々の処理を説明する。 FIG. 2 is a diagram showing an outline of mutation analysis processing by DNA sequence analysis, and FIG. 3 is a diagram showing an example of a processing sequence in which each processing means cooperates. Each process will be described with reference to FIGS. 2 and 3.

（１）被験者を順次処理するためのループ
ＤＮＡ配列解析装置１００の処理は各被験者に適用され、それらの被験者（以下、第一被験者）がもつゲノム配列の変異を検出する。そのため、図２に示すように処理全体が被験者単位のループとなっており、ループの１回の処理では未処理の第一被験者を選択して以下に説明する処理を実行し、処理結果を出力する。 (1) Loop for sequentially processing subjects The processing of the DNA sequence analyzer 100 is applied to each subject, and detects a mutation in the genome sequence of those subjects (hereinafter referred to as the first subject). Therefore, as shown in FIG. 2, the entire process is a loop for each subject, and in one process of the loop, the unprocessed first subject is selected, the process described below is executed, and the process result is output. To do.

（２）家系情報分析手段１０９
図３に示すように家系情報１０８は予め主記憶装置１０２や補助記憶装置１０３等に保存されている。図３の例ではユーザインタフェース部１０６から入力してＣＰＵ１０１が主記憶装置１０２や補助記憶装置１０３へ保存しているが、リムーバブルメディア１０４やネットワーク１０５を介して入力してもよい。この家系情報１０８に基づき、処理中の第一被験者とともにペア解析手段１１０においてペア解析の対象となる他の一人の被験者（以下、第二被験者）を選択する。ここで、主記憶装置１０２に家系情報（１）１０２−１が削除等により無い場合は補助記憶装置１０３から家系情報（２）１０８−２を主記憶装置１０２へ読み込んでもよい。 (2) Family information analysis means 109
As shown in FIG. 3, the family information 108 is stored in advance in the main storage device 102, the auxiliary storage device 103, or the like. In the example of FIG. 3, the CPU 101 is input from the user interface unit 106 and stored in the main storage device 102 or the auxiliary storage device 103, but may be input via the removable medium 104 or the network 105. Based on this pedigree information 108, the pair analysis means 110 selects another one subject (hereinafter referred to as a second subject) as a pair analysis target together with the first subject being processed. Here, if there is no family information (1) 102-1 in the main storage device 102 due to deletion or the like, the family information (2) 108-2 may be read from the auxiliary storage device 103 into the main storage device 102.

（３）ペア解析手段１１０
選択した第一被験者と第二被験者を対象としてペア解析手段１１０を実行し、後述するペア解析を実行する。すなわち、マッピングデータ１０７を読み込み、変異検出し、選択した第一被験者と第二被験者とに共通の変異を出力する。ここで、主記憶装置１０２にマッピングデータ（１）１０７−１が削除等により無い場合は補助記憶装置１０３からマッピングデータ（２）１０７−２を主記憶装置１０２へ読み込んでもよい。 (3) Pair analysis means 110
The pair analysis unit 110 is executed for the selected first and second subjects, and a pair analysis described later is executed. That is, the mapping data 107 is read, a mutation is detected, and a mutation common to the selected first and second subjects is output. Here, when there is no mapping data (1) 107-1 in the main storage device 102 due to deletion or the like, the mapping data (2) 107-2 may be read from the auxiliary storage device 103 into the main storage device 102.

（４）変異決定手段１１１
ペア解析手段１１０の結果に基づき、第一被験者のゲノム配列の変異を決定する。すなわち、補助記憶装置１０３からペア解析結果を読み込み、ペア解析結果の重複を除去する。 (4) Mutation determination means 111
Based on the result of the pair analysis means 110, the variation | mutation of the genome sequence of a 1st test subject is determined. That is, the pair analysis result is read from the auxiliary storage device 103, and duplication of the pair analysis result is removed.

（５）変異情報ＤＢへの書き出し
以上の処理で得られたゲノム配列の変異は、変異情報ＤＢ１１２に格納される。変異情報ＤＢ１１２の内容は図１を用いて既に説明したとおりであり、変異情報ＤＢ１１２のデータベースフォーマットとして関係データベースやＸＭＬデータベース等を使用してもよい。 (5) Writing to mutation information DB The variation | mutation of the genome sequence obtained by the above process is stored in mutation information DB112. The contents of the mutation information DB 112 are as already described with reference to FIG. 1, and a relational database, an XML database, or the like may be used as the database format of the mutation information DB 112.

図５はゲノム配列の構造変異のうちの欠失と挿入の例を示す図である。まず欠失について説明する。被験者のゲノム配列５０１ａ、５０２ａがあり、参照ゲノムのゲノム配列５０１ｂ、５０３、５０２ｂがある。ここで、配列５０１ａと配列５０１ｂとは同一内容の配列であり、配列５０２ａと配列５０２ｂとは同一内容の配列である。そして、黒い長方形で描かれている参照ゲノムの配列５０３が被験者のゲノム配列にない。このような配列５０３は欠失と呼ばれる。一方、被疑者のゲノム配列５０４ａ、５０６、５０５ａがあり、参照ゲノムのゲノム配列５０４ｂ、５０５ｂがあり、被験者のゲノム配列には参照ゲノム配列に存在しない配列５０６が存在する。このような配列５０６は挿入と呼ばれる。 FIG. 5 is a diagram showing an example of deletion and insertion in the structural variation of the genome sequence. First, the deletion will be described. There are genomic sequences 501a and 502a of the subject, and genomic sequences 501b, 503, and 502b of the reference genome. Here, the array 501a and the array 501b have the same contents, and the array 502a and the array 502b have the same contents. The reference genome sequence 503 drawn in a black rectangle is not in the subject genome sequence. Such a sequence 503 is called a deletion. On the other hand, there are the genome sequences 504a, 506, and 505a of the suspect, the genome sequences 504b and 505b of the reference genome, and the subject genome sequence includes a sequence 506 that does not exist in the reference genome sequence. Such an array 506 is called an insertion.

図６は欠失および挿入をsplit readと呼ばれる方式（K. Ye, M. H. Schulz, Q. Long, R. Apweiler, and Z. Ning, Pindel: A pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads, Bioinformatics, 25(21): 2865-2871, 2009.）により、次世代シーケンサ配列に基づいて検出する例を示す図である。 Figure 6 shows a method called split read for deletion and insertion (K. Ye, MH Schulz, Q. Long, R. Apweiler, and Z. Ning, Pindel: A pattern growth approach to detect break points of large deletions and medium sized. insertions from paired-end short reads, Bioinformatics, 25 (21): 2865-2871, 2009.) is a diagram showing an example of detection based on the next-generation sequencer sequence.

まず、欠失の検出について説明する。欠失が存在する場合、被験者のゲノムにおいて欠失の境界の位置に由来する次世代シーケンサ配列６０２は、対応する配列が参照ゲノムでは２か所にわかれて６０３と６０４とになっている。これらは、次世代シーケンサ配列６０２のペアエンドのもう一方の配列である次世代シーケンサ配列６０１ａに対応する６０１ｂの付近にマッピングされる。このような配列が、同一の位置で複数見つかれば、その位置に欠失があると判断される。 First, detection of deletion will be described. When there is a deletion, the next-generation sequencer sequence 602 derived from the position of the boundary of the deletion in the subject's genome is 603 and 604 with the corresponding sequence divided into two locations in the reference genome. These are mapped in the vicinity of 601b corresponding to the next-generation sequencer sequence 601a which is the other paired end of the next-generation sequencer sequence 602. If a plurality of such sequences are found at the same position, it is determined that there is a deletion at that position.

次に、挿入の検出について説明する。挿入が存在する場合、被験者のゲノムにおいて挿入の境界の位置に由来する次世代シーケンサ配列６０６、６０７は、挿入された配列に相当する部分が参照ゲノムでは一部が失われる。そして、次世代シーケンサ配列６０５ａ、６０８ａに配列６０５ｂ、６０８ｂが対応し、残った部分６０９、６１０がそれぞれのペアエンドのもう一方の配列６０５ｂと配列６０８ｂの付近にマッピングされる。このような配列が、ゲノム上の特定の位置の両側で複数見つかれば、その位置に挿入があると判断される。 Next, detection of insertion will be described. When there is an insertion, the next-generation sequencer sequences 606 and 607 derived from the position of the insertion boundary in the subject's genome partially lose the portion corresponding to the inserted sequence in the reference genome. Then, the next-generation sequencer arrays 605a and 608a correspond to the arrays 605b and 608b, and the remaining portions 609 and 610 are mapped in the vicinity of the other arrays 605b and 608b of the respective pair ends. If a plurality of such sequences are found on both sides of a specific position on the genome, it is determined that there is an insertion at that position.

split read方式はPindel（http://gmt.genome.wustl.edu/pindel/current）と呼ばれる広く知られたソフトウェアに実装されている。他の構造多型検出方式に比べて出力される構造多型に誤りが比較的少なく、構造多型の位置も正確に計算できるという特徴があるが、その一方で、構造多型の境界位置に現れる配列が複数必要であることから、データ量が不足する場合は検出感度が不十分となってしまう。実際にシミュレーションデータを用いた計算機実験を行ったところ、相同染色体がいずれも保持しているホモの構造多型の場合、配列量がゲノム配列の５倍では感度が不十分であったが、１０倍以上では十分な検出感度が得られたとの報告がある（Yasuda T, Suzuki S, Nagasaki M, Miyano S.: ChopSticks: High-resolution analysis of homozygous deletions by exploiting concordant read pairs, BMC Bioinformatics 2012 Oct 30;13:279）。 The split read method is implemented in a well-known software called Pindel (http://gmt.genome.wustl.edu/pindel/current). Compared to other structural polymorphism detection methods, the structural polymorphism that is output has relatively few errors, and the structure polymorphism position can be calculated accurately. Since a plurality of appearing arrays are necessary, the detection sensitivity is insufficient when the data amount is insufficient. When a computer experiment using simulation data was actually performed, in the case of a homomorphic polymorphism held by all homologous chromosomes, the sensitivity was insufficient when the sequence amount was 5 times the genome sequence. There has been a report that sufficient detection sensitivity was obtained at times above (Yasuda T, Suzuki S, Nagasaki M, Miyano S .: ChopSticks: High-resolution analysis of homozygous deletions by exploiting concordant read pairs, BMC Bioinformatics 2012 Oct 30; 13: 279).

被験者当たりの配列量が１０倍あっても、構造多型がヘテロ（２本の相同染色体のうち、片方だけがその構造多型をもつ）の場合、例えば図７に示すように構造多型を反映した配列データ量は５倍相当に留まってしまう。大規模な被験者集団のゲノム配列決定では、コストを削減しつつ被験者数を確保するために、被験者一人あたりの配列量が１０倍に達しない場合があり、そのような場合に被験者を個別に処理していてはsplit read方式で十分な感度を得ることができない。 Even if the amount of sequence per subject is 10 times, if the structural polymorphism is heterogeneous (only one of the two homologous chromosomes has the structural polymorphism), for example, as shown in FIG. The reflected array data amount remains equivalent to five times. In genome sequencing of a large group of subjects, the amount of sequence per subject may not reach 10 times in order to secure the number of subjects while reducing costs. In such cases, subjects are processed individually. However, sufficient sensitivity cannot be obtained with the split read method.

split read方式の長所である出力される構造多型に誤りが少なく位置も正確である長所を生かしつつ、データ量が不足する場合の検出感度を補うため、血縁者の配列データを利用し、同一の変異に由来する配列を増幅する。親子であれば、ゲノム配列を半分共有するため、同一の変異を有する可能性が高い。図７は、同一の変異を有する被験者を同時解析して、データ量を増幅する例を示す図である。この図の例では、被験者１と被験者２が同一の挿入をいずれもヘテロで保有している。被験者当たり１０倍の配列量がある場合、この挿入に由来する配列量は５倍相当になってしまうため、split read方式では十分な検出感度が得られない。 Utilizing the advantage of split read method that the structural polymorphism to be output has few errors and the position is accurate, but to compensate for the detection sensitivity when the amount of data is insufficient, the relative sequence data is used and the same Amplify the sequence derived from the mutation. Parents and children share half of the genome sequence and are likely to have the same mutation. FIG. 7 is a diagram illustrating an example in which subjects having the same mutation are simultaneously analyzed to amplify the data amount. In the example of this figure, both the subject 1 and the subject 2 have the same insertion hetero. When there is a 10-fold sequence amount per subject, the sequence amount derived from this insertion is equivalent to 5 times, so that sufficient detection sensitivity cannot be obtained with the split read method.

しかし、この２人の被験者のデータを混合して同時解析すれば、この挿入に由来する配列量は１０倍相当になるため、split read方式で十分な検出感度を得ることができる。このように、２人のマッピングデータを組み合わせて解析する処理をここではペア解析と呼ぶ。また、ペア解析の感度は非常に高く、一方で誤りは少ないことを仮定する。以下では、解析対象の被験者を第一被験者、第一被験者の変異を検出するために、第一被験者とペア解析を行なう被験者を第二被験者と呼ぶ。なお、ペアではなく３人、４人といったより多くの人数で同時解析を行なうと、エラーを含む配列が増えて偽陽性の変異が多くなってしまうことから、ここではペア解析に限定する。 However, if the data of these two subjects are mixed and analyzed simultaneously, the sequence amount derived from this insertion will be equivalent to 10 times, so that sufficient detection sensitivity can be obtained by the split read method. In this way, the process of analyzing the mapping data of the two persons is referred to as pair analysis here. It is also assumed that the sensitivity of pair analysis is very high, while there are few errors. Hereinafter, the subject to be analyzed is referred to as the first subject, and the subject who performs pair analysis with the first subject in order to detect the mutation of the first subject is referred to as the second subject. Note that if simultaneous analysis is performed with a larger number of people such as three or four instead of a pair, the number of false positive mutations increases due to an increase in sequences containing errors.

図８は第一被験者が親および子と共有する変異の包含関係の例を模式的に表した図である。第一被験者の変異は、de novoと呼ばれその被験者に突然変異で出現する稀な変異を除けば、父親または母親から遺伝したものである。したがって、父親および母親とそれぞれペア解析を行うことにより、第一被験者の変異をすべて網羅できる。両親のマッピングデータが揃っていない場合でも、子供とペア解析を行なうことにより検出できる可能性がある。複数の子供がいれば、それらの子供のそれぞれとペア解析を行なうことにより、第一被験者の大半の変異を網羅できる。 FIG. 8 is a diagram schematically showing an example of the inclusion relationship of mutations shared by the first subject with the parent and the child. The variation in the first subject is inherited from the father or mother except for a rare variation called de novo that appears in the subject as a mutation. Therefore, all the mutations of the first subject can be covered by performing pair analysis with the father and the mother, respectively. Even if the mapping data of the parents is not available, there is a possibility that it can be detected by performing pair analysis with the child. If there are multiple children, most of the mutations in the first subject can be covered by pair analysis with each of those children.

しかし、子供と共有していない変異８０１が常に一定の割合で存在する。そのような変異でも血縁関係がない被験者と共有する場合がある。図８において、父母の変異に重複があるのはこれが理由である。大規模な被験者集団のゲノム配列決定プロジェクトにおいては、血縁関係がない被験者が多数いるため、血縁関係がない被験者同士のペアの数は膨大である。そのため、変異の検出に必要かつ十分なペアのみを選んでペア解析する必要がある。このため、図１６に例を示すように、血縁関係がない被験者のペアは、創始者同士のペア解析１６０１に限定する。第一被験者が創始者以外であれば、両親とすべての変異を共有するため、血縁関係がない第二被験者とのペア解析が不要である。また、第一被験者が創始者の場合、創始者でない第二被験者の変異は、その第二被験者の家系の創始者のいずれかから遺伝した変異である。したがって、創始者でない第二被験者とのペア解析１６０２は、不要である。 However, there are always mutations 801 that are not shared with children. Such mutations may be shared with unrelated subjects. In FIG. 8, this is the reason why there is an overlap in parental mutation. In a genome sequencing project for a large group of subjects, there are a large number of subjects who are not related, so the number of pairs of subjects who are not related is enormous. Therefore, it is necessary to select only pairs that are necessary and sufficient for mutation detection and perform pair analysis. For this reason, as shown in an example in FIG. 16, a pair of subjects having no blood relationship is limited to a pair analysis 1601 between founders. If the first subject is other than the founder, all mutations are shared with the parents, so pair analysis with a second subject who is not related is unnecessary. When the first subject is the founder, the mutation of the second subject who is not the founder is a mutation inherited from any of the founders of the family of the second subject. Therefore, the pair analysis 1602 with the second subject who is not the founder is unnecessary.

図１６の例において、被験者のすべてのペアは ₁₃C₂=78 であるが、親子および創始者のペアに限定すると、₇C₂+2×6=33 ペアについてペア解析を行なえば済む。一般に、創始者の数が f とし、両親が揃っている被験者の数を m とすれば、ペア解析を実行すべきペアの数は _fC₂+2m となる。 In the example of FIG. 16, all the pairs of subjects are ₁₃ C ₂ = 78, but if limited to the parent-child and founder pairs, pair analysis may be performed for ₇ C ₂ + 2 × 6 = 33 pairs. In general, if the number of founders is f and the number of subjects with parents is m, the number of pairs to be subjected to pair analysis is _f C ₂ + 2m.

図９はペア解析手段１１０の処理フローチャートの例を示す図である。図９を参照しつつ、ペア解析処理の詳細を説明する。
Ｓ９０１：入力として与えられた第一被験者および第二被験者のペア（以下、入力ペアとする）について、すでにペア解析を実行済みか判定する。
Ｓ９０２：Ｓ９０１の判定において、すでに入力ペアのペア解析を実行済みの場合、以前の解析結果をコピーして終了する。
Ｓ９０３：入力ペア２人のマッピングデータ１０７を取得する。
Ｓ９０４：Ｓ９０３で取得したマッピングデータ１０７を、すべて連結することによりマージする。
Ｓ９０５：マージされたマッピング結果を、１人データであるかのようにPindelの入力として使用し、変異を検出する。
Ｓ９０６：得られた変異には、第一被験者および第二被験者に共通の変異が入っていることが期待されるが、第二被験者のみが保有するにも関わらず、配列量がゲノム上の特定領域で多く得られた場合や、第二被験者の構造変異がホモである場合にも、検出される場合がある。このような第一被験者にない構造変異を除去するため、得られた構造変異を図６のように裏付けるＤＮＡ配列を取得する。それらの配列のうち、第一被験者に由来する配列数が予め設定された閾値ｋを超えるときのみ、その変異を第一被験者の変異として出力する。閾値を越えない場合は、見つかった変異を破棄する。
Ｓ９０７：Ｓ９０６で出力した変異を、変異決定手段１１１で利用できるようにし、次回以降ペア解析を要求されたときに、Ｓ９０２で以前の解析結果としてコピーできるようにするため保存する。 FIG. 9 is a diagram illustrating an example of a processing flowchart of the pair analysis unit 110. Details of the pair analysis processing will be described with reference to FIG.
S901: It is determined whether a pair analysis has already been performed for a pair of a first subject and a second subject given as input (hereinafter referred to as an input pair).
S902: If the pair analysis of the input pair has already been executed in the determination of S901, the previous analysis result is copied and the process ends.
S903: The mapping data 107 of two input pairs is acquired.
S904: Merging is performed by concatenating all the mapping data 107 acquired in S903.
S905: The merged mapping result is used as an input of Pindel as if it were one person data, and a mutation is detected.
S906: The obtained mutation is expected to contain a mutation common to the first subject and the second subject, but the sequence amount is specified on the genome even though only the second subject possesses it. It may also be detected when many are obtained in the region or when the structural variation of the second subject is homozygous. In order to remove such a structural variation that does not exist in the first subject, a DNA sequence that supports the obtained structural variation as shown in FIG. 6 is obtained. Of these sequences, the mutation is output as the mutation of the first subject only when the number of sequences derived from the first subject exceeds a preset threshold value k. If the threshold is not exceeded, the found mutation is discarded.
S907: The mutation output in S906 is made available to the mutation determination unit 111, and when pair analysis is requested from the next time, it is saved so that it can be copied as the previous analysis result in S902.

図１０は家系情報分析手段１０９の処理フローチャートの例を示す図である。この処理により、第一被験者とともにペア解析されるべき第二被験者が決定される。
Ｓ１００１：第一被験者の親が、マッピングデータ１０７および家系情報１０８に含まれているかを判定する。含まれている場合はＳ１００２へ遷移し、そうでない場合はＳ１００５へ遷移する。
Ｓ１００２：第一被験者と、Ｓ１００１で含まれていると判定された第一被験者の一人目の親とで、ペア解析を行うと決定する。すなわち、この一人目の親を第二被験者と決定する。
Ｓ１００３：第一被験者の親の別の一人が、マッピングデータ１０７および家系情報１０８に含まれているかを判定する。含まれている場合はＳ１００４へ遷移し、そうでない場合はＳ１００５へ遷移する。
Ｓ１００４：第一被験者と、Ｓ１００３で含まれていると判定された第一被験者の二人目の親とで、ペア解析を行うと決定する。すなわち、この二人目の親を第二被験者と決定する。
Ｓ１００５：第一被験者をｘ、第一被験者の子の集合をＳとし、図１１に示す処理を行なって、子全員とのペア解析を行うと決定する。すなわち、子全員を第二被験者と決定する。図１１の処理については後述する。
Ｓ１００６：第一被験者をｘ、全創始者の集合をＳとし、図１１に示す処理を行なって、創始者全員とのペア解析を行うと決定する。すなわち、創始者全員を第二被験者と決定する。 FIG. 10 is a diagram illustrating an example of a processing flowchart of the family information analyzing unit 109. By this process, the second subject to be paired with the first subject is determined.
S1001: It is determined whether the parent of the first subject is included in the mapping data 107 and the family information 108. If it is included, the process proceeds to S1002, and if not, the process proceeds to S1005.
S1002: It is determined that pair analysis is performed between the first subject and the first parent of the first subject determined to be included in S1001. That is, the first parent is determined as the second subject.
S1003: It is determined whether another one of the parents of the first subject is included in the mapping data 107 and the family information 108. If it is included, the process proceeds to S1004. Otherwise, the process proceeds to S1005.
S1004: It is determined that pair analysis is performed between the first subject and the second parent of the first subject determined to be included in S1003. That is, this second parent is determined as the second subject.
S1005: It is determined that the first subject is x, the set of children of the first subject is S, and the processing shown in FIG. 11 is performed to perform pair analysis with all the children. That is, all children are determined as second subjects. The process of FIG. 11 will be described later.
S1006: It is determined that the first subject is x, the set of all the founders is S, the processing shown in FIG. 11 is performed, and the pair analysis with all the founders is performed. That is, all the founders are determined as second subjects.

図１１は集合Ｓから第二被験者を決定する処理フローチャートの例を示す図である。集合Ｓには事前に複数の被験者の集合がセットされる。また、変数ｘは事前に第一被験者がセットされる。この処理は、Ｓ１００５とＳ１００６で使用されるため、子全員と創始者全員を第二被験者と決定する処理となり、家系情報分析手段１０９の処理の一部となる。
Ｓ１１０１：集合Ｓに第一被験者ｘが含まれている場合には、第一被験者ｘを集合Ｓから除外する。
Ｓ１１０２：集合Ｓが空集合であれば、処理を終了する。
Ｓ１１０３：集合Ｓから新たな第二被験者を一人だけ選択して変数ｙにセットし、集合Ｓから除外する。
S1104: 変数ｘ、ｙにセットされている第一被験者および第二被験者のペア解析を行うと決定する。 FIG. 11 is a diagram illustrating an example of a processing flowchart for determining a second subject from the set S. A set of a plurality of subjects is set in the set S in advance. The variable x is preset with the first subject. Since this process is used in S1005 and S1006, all the children and all the founders are determined as second subjects, and is a part of the process of the family information analyzing unit 109.
S1101: When the first subject x is included in the set S, the first subject x is excluded from the set S.
S1102: If the set S is an empty set, the process ends.
S1103: Only one new second subject is selected from set S, set to variable y, and excluded from set S.
S1104: It is determined that the pair analysis of the first subject and the second subject set in the variables x and y is performed.

図１０、１１を用いて説明した処理により決定された第二被験者が複数の場合、その複数の第二被験者を第二被験者の集合として家系情報分析手段１０９からペア解析手段１１０へ出力し、ペア解析手段１１０では第二被験者の集合から一人ずつ抽出して第一被験者とペアを構成してもよい。また、図２、３を用いた説明とは異なるが、図１０のＳ１００２とＳ１００４および図１１のＳ１１０４のそれぞれにおいて、図９を用いて説明したペア解析を直ちに行ってもよい。 When there are a plurality of second subjects determined by the processing described with reference to FIGS. 10 and 11, the plurality of second subjects are output as a set of second subjects from the family information analyzing unit 109 to the pair analyzing unit 110. The analysis unit 110 may extract a pair from the second subject group to form a pair with the first subject. Although different from the description using FIGS. 2 and 3, the pair analysis described using FIG. 9 may be immediately performed in each of S1002 and S1004 of FIG. 10 and S1104 of FIG. 11.

図１２は、第一被験者と、家系情報分析手段１０９により決定された各第二被験者とのペア解析により得られた変異のデータに基づき、第一被験者が保有するゲノムの変異を決定する変異決定手段１１１の処理フローチャートの例を示す図である。
Ｓ１２０１：ペア解析で得られた構造変異のデータをすべて連結する。この段階では、異なる第二被験者とのペア解析の結果に重複して含まれる同一の構造変異が多数存在する。
Ｓ１２０２：連結された構造変異のデータを、ゲノム上の位置に基づきソートする。
Ｓ１２０３：ソートされた構造変異のデータを走査し、ゲノム上での位置の差が、事前に与えられたパラメータである整数 w を下回り（w 塩基以内）、かつ、挿入や欠失等の変異の種類が一致しているものを同一の変異とみなして統合（マージ）する。
Ｓ１２０４：以上の処理により得られた、変異の重複が無い結果を変異情報ＤＢ１１２へ出力する。 FIG. 12 shows the mutation determination for determining the mutation of the genome possessed by the first subject based on the mutation data obtained by the pair analysis between the first subject and each of the second subjects determined by the family information analysis means 109. It is a figure which shows the example of the process flowchart of the means.
S1201: All structural mutation data obtained by pair analysis are linked. At this stage, there are many identical structural mutations that are included in duplicate results of pair analysis with different second subjects.
S1202: Sort the linked structural variation data based on the position on the genome.
S1203: The sorted structural variation data is scanned, the difference in position on the genome is less than the integer w which is a parameter given in advance (within w bases), and the mutation such as insertion or deletion Merge (merge) by considering the same type as the same mutation.
S1204: The result without the duplication of mutation obtained by the above processing is output to the mutation information DB 112.

以上で説明したように、被験者あたりのデータ量が少ない場合であっても、親や子という同一の変異を共有する可能性が高い被験者と解析することにより、データ量の不足を補い、検出精度を高めることができる。 As explained above, even if the amount of data per subject is small, by analyzing with subjects who are likely to share the same mutations as parents and children, the lack of data amount is compensated for, and the detection accuracy Can be increased.

次に、ペア解析の検出感度が低いあるいは一定の確率で誤りが発生する場合にも対応可能とする実施例２について図面を参照しつつ説明する。実施例１の説明では、まず親とペア解析を行い、必要に応じて子とペア解析を行なってさらに創始者間のペアでペア解析を行なった。これは、ペア解析の感度(以下、r とする)が十分に高いことを前提とすれば、これらの組み合わせにより、データ中で変異を共有する被験者が存在する変異については検出可能と想定できるためである。しかし、検出感度 r が低い場合には、実施例１のペア解析を実施するだけでは見落としてしまう変異が存在する場合がある。そのような変異を検出するためには、親子だけでなく、祖父母や孫や兄弟姉妹といった血縁者にペア解析の対象を広げ、それぞれの変異がペア解析される回数を増やし、検出される確率を上げる必要がある。ここで、血縁者を選択する際に、未発見の変異を最も多く共有する被験者を第二被験者として選択することが重要となる。 Next, a description will be given of a second embodiment that can cope with a case where detection sensitivity of pair analysis is low or an error occurs with a certain probability, with reference to the drawings. In the description of Example 1, first, pair analysis was performed with a parent, pair analysis was performed with a child as necessary, and pair analysis was further performed with a pair between founders. This is because, assuming that the sensitivity of pair analysis (hereinafter referred to as r) is sufficiently high, it can be assumed that mutations that include subjects who share mutations in the data can be detected using these combinations. It is. However, when the detection sensitivity r is low, there may be a mutation that may be overlooked if only the pair analysis of Example 1 is performed. In order to detect such mutations, not only the parents and children but also the relatives such as grandparents, grandchildren, and siblings can expand the pair analysis target, increase the number of times each mutation is paired, and increase the probability of detection. It is necessary to raise. Here, when selecting a relative, it is important to select a subject who shares the most undiscovered mutation as the second subject.

一方、検出された変異が実際に被験者のゲノムに存在する確率を p とする。実施例１では親等を第二被験者とすることにより p が十分高く１に近いものとしたが、p が低い場合には、１度検出されただけの変異は誤りの可能性が高いことから、複数回検出されるまで、ペア解析を繰り返すことが望ましい。具体的には、以下の処理で、Ｎをパラメータとして与えられる整数とし、Ｎ回検出されたときに初めて第一被験者の変異として処理することが好ましい。 On the other hand, let p be the probability that the detected mutation actually exists in the subject's genome. In Example 1, p was sufficiently high and close to 1 by setting the second degree as a second subject. However, when p is low, a mutation detected only once has a high possibility of error. It is desirable to repeat the pair analysis until it is detected multiple times. Specifically, in the following processing, it is preferable that N is an integer given as a parameter, and is processed as a mutation of the first subject only when N is detected.

未発見の変異を最も多く共有する被験者を選択するために、各被験者について未発見の変異を保有する確率を計算する。この確率には、当該被験者と第一被験者との血縁的な近さ、すでに解析済みの変異との重複度、ペア解析が実行された回数が関連する。具体的には、ある被験者 y を第二被験者として用いる場合、第一被験者を x とすると、第一被験者と第二被験者が共通の先祖から受け継いだ変異が新たに発見される確率P(y)は、次の式で表される。
P(y) = Σ_v∈V 2F(x, y)s(v)r(1-r)^n(v)
ここで、F(x, y)は第一被験者 x と第二被験者 y の親縁係数(x と y のアリルを選んだとき、それらが同じ祖先から遺伝により受け継がれた確率。アリルは２本の相同染色体のそれぞれにある同じ遺伝子の配列)である。なお、親縁係数は家系情報１０８に含めてもよい。V は x が持つ変異の部分集合の集合、s(v) は x の変異全体に占める v の変異の割合、n(v) は v の変異に対してこれまでに実行されたペア解析の回数である。V に属する任意の部分集合 v は、x と y のペア解析を行なう以前に、他の第二被験者とのペア解析が全く同じように適用された変異の集合とする。P(y) を計算するためには、V を求め、さらに V に属する任意の v に対してペア解析が何回適用されたかを把握する必要がある。ペア解析が何度も実行されると V は複雑になり、適切に V を計算することが難しくなる。そこで、V を計算するための処理について図面を参照しつつ説明する。 To select subjects who share the most undiscovered mutations, the probability of having an undiscovered mutation is calculated for each subject. This probability relates to the closeness of the subject and the first subject, the degree of overlap with the already analyzed mutation, and the number of times the pair analysis has been performed. Specifically, when a subject y is used as a second subject, and the first subject is x, the probability P (y) that a mutation inherited from a common ancestor by the first subject and the second subject is newly found. Is expressed by the following equation.
P (y) = Σ _v∈V 2F (x, y) s (v) r (1-r) ^{n (v)}
Where F (x, y) is the relative coefficient of the first subject x and the second subject y (when alleles of x and y are selected, the probability that they were inherited from the same ancestor. Sequence of the same gene in each of the homologous chromosomes). The closeness coefficient may be included in the family information 108. V is the set of subsets of mutations in x, s (v) is the percentage of mutations in v in the total mutations in x, and n (v) is the number of pair analyzes performed so far on the mutations in v It is. An arbitrary subset v belonging to V is a set of mutations to which pair analysis with other second subjects is applied in the same way before the pair analysis of x and y is performed. In order to calculate P (y), it is necessary to find V and to know how many times the pair analysis has been applied to any v belonging to V. If pair analysis is performed many times, V becomes complicated and it becomes difficult to calculate V appropriately. Therefore, a process for calculating V will be described with reference to the drawings.

図１４は V を計算するために用いる変異解析木の例を示す図である。変異解析木は第一被験者の変異全体を表すノード１４０１を根とした木構造である。第二被験者とのペア解析を行なう毎に、各ノードは当該第二被験者に含まれるノード１４０２と、含まれないノード１４０３に分割される。子ノードが無いノードはリーフと呼ばれる。各ノード L には、ペア解析を実施した第二被験者が保有する変異であるか否かを表すラベル h（否定記号「¬」が無ければ保有する、あれば保有しない）、当該ノードの変異に対し実行されたペア解析の回数 n、第一被験者の変異全体に占める当該ノードの変異が占める割合 s からなる三つ組が記録される。L が三つ組(h, n, s)を持つとき、以下では L＝(h, n, s)と表記する。変異解析木において各リーフに対応する第一被験者の変異の集合が V の各要素となる。 FIG. 14 is a diagram showing an example of a mutation analysis tree used for calculating V. The mutation analysis tree is a tree structure rooted at a node 1401 representing the entire mutation of the first subject. Each time the pair analysis with the second subject is performed, each node is divided into a node 1402 included in the second subject and a node 1403 not included. A node with no child nodes is called a leaf. Each node L has a label h (whether or not possessed if there is no negative sign “¬”) that indicates whether or not the mutation is possessed by the second subject who has performed the pair analysis. A triplet is recorded, consisting of the number n of pair analyzes performed for the number, and the ratio s of the variation of the node in the total variation of the first subject. When L has a triplet (h, n, s), it is expressed as L = (h, n, s) below. The set of mutations of the first subject corresponding to each leaf in the mutation analysis tree is each element of V.

図１５はペア解析を行なう毎に実行される変異解析木を更新する処理フローチャートの例を示す図である。
Ｓ１５０１：変数 X を０に初期化する。この変数 X は P(x)を計算する途中経過を格納するために用いられる。
Ｓ１５０２：変数 F に第一被験者と第二被験者の親縁係数をセットする。
Ｓ１５０３：更新前の変異解析木に存在したすべてのリーフについて、Ｓ１５０４〜Ｓ１５０７の処理を繰り返す。この繰り返しは、第一被験者の変異のすべての部分集合について、処理を行うための繰り返しである。
Ｓ１５０４：Lを未処理のリーフの１つとする。このリーフに記録されている三つ組を(h, n, s)とする。
Ｓ１５０５：L の子として２つの新たなリーフ L1 と L2 を追加する。それらに記録される三つ組は図１５に示す通りとする。
Ｓ１５０７：X に L に相当する変異が検出される確率を加える。
Ｓ１５０８：X の値を P(y)として出力する。 FIG. 15 is a diagram showing an example of a processing flowchart for updating the mutation analysis tree executed every time pair analysis is performed.
S1501: Variable X is initialized to zero. This variable X is used to store the progress of calculating P (x).
S1502: The closeness coefficient of the first subject and the second subject is set in the variable F.
S1503: The processing of S1504 to S1507 is repeated for all the leaves existing in the mutation analysis tree before update. This iteration is an iteration for performing processing on all subsets of mutations in the first subject.
S1504: Let L be one of the unprocessed leaves. Let the triple recorded in this leaf be (h, n, s).
S1505: Two new leaves L1 and L2 are added as children of L. The triples recorded in them are as shown in FIG.
S1507: The probability that a mutation corresponding to L is detected is added to X.
S1508: The value of X is output as P (y).

家系情報分析手段１０９として、実施例１で説明した図１０に示した処理の例の代わりに、図１５に示した処理の例により計算した P(y) を用いて第二被験者の選択を行ない、選択された第二被験者に対してペア解析を実施する。図１７にこの処理のフローチャートの例を示す。図面を参照しつつ、処理の内容を説明する。ここでは入力として、第一被験者、この第一被験者とペア解析を行なった過去の第二被験者からなる集合 S 、および血縁者でない被験者同士で同一の変異をもつ確率 B が与えられる。
Ｓ１７０１：変数 P_maxを０に初期化する。
Ｓ１７０２：集合 S の被験者の親子で、集合 S に属さない被験者の集合を S’にセットする。
Ｓ１７０３：集合 S’が空集合か判定し、空集合であればＳ１７０８に遷移する。空集合でなければ、集合 S’の各要素について、Ｓ１７０４〜Ｓ１７０７の処理を繰り返す。
Ｓ１７０４：集合 S’から１つ要素 y を取り出し、集合 S’から要素 y を消去する。
Ｓ１７０５：図１５を用いて説明した処理により P(y) を計算するが、Ｓ１５０５の変異解析木の変更は実行しない。
Ｓ１７０６：P_max が P(y) 未満でなければ、Ｓ１７０３へ戻り次の被験者を処理する。
Ｓ１７０７：P_max が P(y) 未満の場合は、P(y) を新たな P_max の値としてセットし、変数 y_max に y をセットする。 As the family information analysis means 109, the second subject is selected using P (y) calculated by the processing example shown in FIG. 15 instead of the processing example shown in FIG. 10 described in the first embodiment. The pair analysis is performed on the selected second subject. FIG. 17 shows an example of a flowchart of this process. The contents of the processing will be described with reference to the drawings. Here, as input, a first subject, a set S of past second subjects that have been paired with the first subject, and a probability B having the same mutation between subjects who are not related to each other are given.
S1701: The variable P _max is initialized to zero.
S1702: A set of subjects who are parents and children of subjects in set S and do not belong to set S is set in S ′.
S1703: It is determined whether the set S ′ is an empty set. If it is an empty set, the process proceeds to S1708. If it is not an empty set, the processing of S1704 to S1707 is repeated for each element of the set S ′.
S1704: One element y is extracted from the set S ′, and the element y is deleted from the set S ′.
S1705: P (y) is calculated by the process described using FIG. 15, but the mutation analysis tree in S1505 is not changed.
S1706: If P _max is not less than P (y), the process returns to S1703 to process the next subject.
S1707: P _max If there is less than P (y), set P a (y) as a new value of P _max, is set to y variable y _max.

以上のＳ１７０３へ戻り、Ｓ１７０４〜Ｓ１７０７を繰り返す。
Ｓ１７０８：P_max が確率 B すなわち血縁関係のない被験者が共通の変異をもつ確率より小さければ、血縁者よりも血縁者以外とペア解析を行なった方が、新しい変異の見つかる可能性が高くなる。このため、確率 B より小さい場合はＳ１７０９へ遷移し、確率 Bより小さくない場合はＳ１７１１へ遷移する。
Ｓ１７０９：未処理の創始者がいるか否かを判定して、未処理の創始者がいる場合はＳ１７１０へ遷移し、いない場合はＳ１７１１へ遷移する。
Ｓ１７１０：未処理の創始者を一人任意に選び、第二被験者とする。
Ｓ１７１１：y_maxを第二被験者とする。
Ｓ１７１２：第一被験者と第二被験者でペア解析を実施する。次の第二被験者を選ぶ準備として図１５に示した処理により変異解析木を更新する。 Returning to S1703, S1704 to S1707 are repeated.
S1708: If P _max is smaller than the probability B, that is, the probability that subjects who are not related have a common mutation, it is more likely that a new mutation will be found by performing pair analysis with a non-related person than by a related person. For this reason, when it is smaller than the probability B, the process proceeds to S1709, and when it is not smaller than the probability B, the process proceeds to S1711.
S1709: It is determined whether or not there is an unprocessed founder. If there is an unprocessed founder, the process proceeds to S1710, and if not, the process proceeds to S1711.
S1710: One unprocessed founder is arbitrarily selected as a second subject.
S1711: Let y _max be the second subject.
S1712: Pair analysis is performed on the first subject and the second subject. In preparation for selecting the next second subject, the mutation analysis tree is updated by the process shown in FIG.

以上で説明したように、被験者あたりのデータ量が少なく、親や子という被験者を解析しても検出感度が十分に得られない場合であっても、最少人数によりデータ量の不足を補える確率を高め、検出精度を高めることができる。 As explained above, even if the amount of data per subject is small and the detection sensitivity is not sufficiently obtained even when analyzing subjects such as parents and children, the probability that the shortage of data amount can be compensated by the minimum number of people. And detection accuracy can be increased.

実施例１、２で説明した処理では、ペア解析における次世代シーケンサ配列数の閾値 k、変異決定手段１１１で使用する同一とみなす変異の位置の閾値 w、血縁関係のない被験者同士で変異を共有する確率 B といったパラメータを使用する。これらのパラメータの値が予め得られない場合の値の調整および推定の処理の例について説明する。 In the processing described in Examples 1 and 2, the threshold k for the number of next-generation sequencers in pair analysis, the threshold w for the position of the mutation considered to be the same used by the mutation determining unit 111, and sharing the mutation among unrelated subjects Use a parameter such as probability B An example of value adjustment and estimation processing when the values of these parameters cannot be obtained in advance will be described.

（１）ペア解析における次世代シーケンサ配列数の閾値 k
次世代シーケンサ配列および変異の位置情報が公開されているデータ、すなわち検出結果の判明しているデータを用いて、最も検出精度が高くなる k の値を選択する。公開情報の代わりにサンガー法等の次世代シーケンサ以外の処理で決定された変異情報と一致するように k の値を調整してもよい。 (1) Threshold number of next-generation sequencer sequences in pair analysis k
Using the data for which the next-generation sequencer sequence and the position information of the mutation are publicly disclosed, that is, the data whose detection result is known, the value of k that provides the highest detection accuracy is selected. Instead of public information, the value of k may be adjusted to match the mutation information determined by processing other than the next-generation sequencer such as the Sanger method.

（２）同一とみなす変異の位置の閾値 w
上記（１）と同じく公開されているデータを解析した結果に基づき、検出される変異の位置の変動幅を求め、公開されているデータに対して正しく検出された変異の位置を含むように w の値を調整する。 (2) Threshold value w of the mutation position considered to be the same
Based on the result of analyzing the published data as in (1) above, the fluctuation range of the position of the detected mutation is obtained, and the correctly detected mutation position is included in the published data. Adjust the value of.

（３）血縁関係のない被験者同士で変異を共有する確率 B
少数の創始者間でペア解析を行なう。このペア解析で検出された変異について、その変異を支持する次世代シーケンサ配列がそれぞれの被験者のマッピングデータで閾値 k を越える第１の回数を計算する。一方、やはり少数の親子の組をペア解析し、このペア解析により得られる変異の第２の数を計算する。１組の親子のペア解析では、各被験者の約半数の変異が検出される。したがって、第１の回数を第２の数の２倍の値で除算することにより、B の値を推定できる。 (3) Probability of sharing mutations among unrelated subjects B
Pair analysis is performed between a few founders. For the mutation detected by the pair analysis, the first number of times that the next-generation sequencer sequence supporting the mutation exceeds the threshold value k in the mapping data of each subject is calculated. On the other hand, a pair analysis of a small number of parents and children is also performed, and a second number of mutations obtained by this pair analysis is calculated. In the pair analysis of one set of parent and child, about half of the mutations in each subject are detected. Therefore, the value of B can be estimated by dividing the first number by the value twice the second number.

以上で説明したようにパラメータの各値が予め得られない場合であっても、各値を調整および推定できる。そして、被験者あたりのデータ量が少ない場合であっても、データ量の不足を補い、検出精度を高めることができる。 As described above, each value can be adjusted and estimated even if each value of the parameter cannot be obtained in advance. And even if the amount of data per subject is small, the shortage of the amount of data can be compensated and the detection accuracy can be increased.

以上、各実施例について説明したが、以上の実施例は好ましい例を示したものであり、上記各実施例の具体的構成に限定する趣旨ではない。以上の説明の要旨を逸脱しない範囲において種々変更可能である。 Each example has been described above, but the above example shows a preferable example, and is not intended to limit the specific configuration of each example. Various modifications can be made without departing from the scope of the above description.

１００ＤＮＡ配列解析装置
１０１ＣＰＵ
１０２主記憶装置
１０３補助記憶装置
１０４リムーバブルメディア
１０５ネットワーク
１０６インタフェース部
１０７マッピングデータ
１０８家系情報
１０９家系情報分析手段
１１０ペア解析手段
１１１変異決定手段
１１２変異情報ＤＢ 100 DNA sequence analyzer 101 CPU
102 Main storage device 103 Auxiliary storage device 104 Removable media 105 Network 106 Interface unit 107 Mapping data 108 Family information 109 Family information analysis means 110 Pair analysis means 111 Mutation determination means 112 Mutation information DB

Claims

A DNA sequence analyzer for selecting a first subject from a plurality of subjects and outputting a mutation of the first subject,
Storage means for storing mapping data obtained by mapping sequence data obtained from a plurality of subject genomes to a reference genome sequence in a public database, and pedigree information representing the relative relationships of the plurality of subjects;
Based on the family information, family information analysis means for sequentially selecting a second subject of a predetermined condition,
A pair for performing pair analysis for simultaneously detecting mapping data of the first subject and the second subject to detect a mutation, and recording the detected mutation as a common mutation in the first subject and the second subject Analysis means;
Mutation determination means for removing duplication from the detected mutation and outputting as the mutation of the first subject,
An analysis device characterized by comprising:

The predetermined condition is the parents of the first subject,
The analysis apparatus according to claim 1, wherein the family information analyzing unit selects the parents of the first subject as the second subject.

The predetermined condition is the founder who is the child of the first subject and any subject who does not have at least one parent in the family information,
The analysis apparatus according to claim 1, wherein the family information analyzing unit selects the child of the first subject and the founder as the second subject.

The predetermined condition is the founder who is one parent of the first subject, the child of the first subject, and any subject who does not have at least one parent in the family information,
The pedigree information analyzing means analyzing apparatus according to claim 1, characterized in that selecting the initiator and one of the parents and children of the first test subjects as the second subject.

The predetermined condition is a subject having the largest probability P (y),
The pedigree information analysis means for the first subject x and each second subject candidate y,
F (x, y) is an affinity coefficient between the first subject x and the second subject candidate y,
V is a set of subsets obtained by dividing the set of mutations of the first subject, and each subset consists only of mutations inherited by the same subject.
Let n (v) be the number of times that the subset of mutations of the first subject with v∈V has been subject to pair analysis for the first subject,
Let s (v) be the proportion of the mutation of v in the total mutation of the first subject x,
When r is the mutation detection probability when mutation analysis is performed using mapping data of two subjects,
_{Evaluate the} probability that a new mutation will be detected by P (y) = Σ _v∈V 2F (x, y) s (v) r (1-r) ^{n (v)}
2. The analysis apparatus according to claim 1, wherein a subject having the largest probability P (y) is selected as a second subject.

Storing mapping data obtained by mapping sequence data obtained from a plurality of subject genomes to a reference genome sequence in a public database;
Storing pedigree information representing the blood relationship of the plurality of subjects;
And selecting whether we first subject among the plurality of subjects,
A step of sequentially selecting a second subject under a predetermined condition based on the family tree information;
A step of performing pair analysis for simultaneously detecting mapping data of the first subject and the second subject to detect a mutation, and recording the detected mutation as a mutation common to the first subject and the second subject. When,
Removing duplicates from the detected mutation and creating a database as the mutation of the first subject;
A database creation method characterized by comprising:

The predetermined condition is the parents of the first subject,
The database creation method according to claim 6, wherein in the step of sequentially selecting the second subject, the parents of the first subject are selected as the second subject.

The predetermined condition is the founder who is the child of the first subject and any subject who does not have at least one parent in the family information,
The database creation method according to claim 6, wherein in the step of sequentially selecting the second subject, the child of the first subject and the founder are selected as the second subject.

The predetermined condition is the founder who is one parent of the first subject, the child of the first subject, and any subject who does not have at least one parent in the family information,
Wherein the step of sequentially selecting the second subject, a database creation method according to claim 6, characterized in that selecting the initiator and one of the parents and children of the first test subjects as the second subject.

The predetermined condition is a subject having the largest probability P (y),
The step of sequentially selecting the second subject includes
For the first subject x and the candidate y for each second subject,
F (x, y) is an affinity coefficient between the first subject x and the second subject candidate y,
V is a set of subsets obtained by dividing the set of mutations of the first subject, and each subset consists only of mutations inherited by the same subject.
Let n (v) be the number of times that the subset of mutations of the first subject with v∈V has been subject to pair analysis for the first subject,
Let s (v) be the proportion of the mutation of v in the total mutation of the first subject x,
When r is the mutation detection probability when mutation analysis is performed using mapping data of two subjects,
_{Evaluate the} probability that a new mutation will be detected by P (y) = Σ _v∈V 2F (x, y) s (v) r (1-r) ^{n (v)}
The database creation method according to claim 6, wherein a subject having the largest probability P (y) is selected as a second subject.

Determining a threshold k to be a common mutation based on the position and DNA sequence of known mutations published;
7. The database creation method according to claim 6, wherein the step of executing the pair analysis records a common mutation when the number of the detected mutation sequences exceeds the threshold value k.

Determining a threshold k to be a common mutation based on the Sanger process;
7. The database creation method according to claim 6, wherein the step of performing pair analysis records a common mutation when the number of detected mutation sequences exceeds the threshold value k.

A system including a computer that selects a first subject from a plurality of subjects and outputs a variation of the first subject,
Storage device for storing mapping data obtained by mapping sequence data obtained from a plurality of subject genomes to a reference genome sequence in a public database, pedigree information representing the relative relationships of the plurality of subjects, and a database output by the computer Further including
The calculator is
Storage means for inputting and storing the mapping data and the family information from the storage device;
Based on the family information, family information analysis means for sequentially selecting a second subject of a predetermined condition,
A pair for performing pair analysis for simultaneously detecting mapping data of the first subject and the second subject to detect a mutation, and recording the detected mutation as a common mutation in the first subject and the second subject Analysis means;
Mutation determination means for removing duplication from the detected mutation and outputting the mutation of the first subject as the database;
A system characterized by comprising:

The system according to claim 13, wherein the storage device and the computer are connected via a network.

The system according to claim 13, wherein the storage device is a removable medium, and the storage device is attached to the computer.