JP3437521B2

JP3437521B2 - Apparatus and method for correcting frame shift error of cDNA sequence, and recording medium recording program for executing the method

Info

Publication number: JP3437521B2
Application number: JP2000058402A
Authority: JP
Inventors: 崎良英林
Original assignee: RIKEN
Current assignee: RIKEN
Priority date: 2000-03-03
Filing date: 2000-03-03
Publication date: 2003-08-18
Anticipated expiration: 2020-03-03
Also published as: JP2001249110A; WO2001065249A1

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の背景】発明の分野本発明は、ｃＤＮＡ配列のフレームシフトエラーを修正
する装置およびその方法、並びにそれを実行するための
プログラムを記録した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus and method for correcting a frame shift error in a cDNA sequence, and a recording medium recording a program for executing the same.

【０００２】関連技術多くの仮定に基づくタンパク質は、ラフドラフトゲノム
配列から推定され、完全なＣＤＳ情報を含むマウス全長
ｃＤＮＡについて大規模に配列決定する試みが始まっ
た。ＣＤＳにおける塩基の呼び出しエラー、特にＣＤＳ
におけるフレームシフトエラーは、ラフドラフトｃＤＮ
Ａまたはゲノム配列に基づくアミノ酸配列を推定するの
に重大な問題である。これらの配列を最終的に決定する
のは時間のかかる作業である。従って、推定されるフレ
ームシフトエラーの位置に関する情報は最終作業におけ
るプライマーの設計に非常に重要である。さらにアミノ
酸配列のラフドラフトは、遺伝子の分類およびデータ調
査に有用である。シーケンシングエラーによってフレー
ムシフトエラーを調べ、それを防ぐためには、いくつか
の方法が開発されている。[0002] Based on the related art many assumptions protein is estimated from the rough draft genome sequence, attempts to sequence a large scale began for mice full-length cDNA containing the complete CDS information. Calling error of base in CDS, especially CDS
Frame shift error in rough draft cDN
It is a significant problem in deducing amino acid sequences based on A or genomic sequences. The final determination of these sequences is a time consuming task. Therefore, information about the position of the putative frameshift error is very important for primer design in the final work. In addition, rough drafts of amino acid sequences are useful for gene classification and data exploration. Several methods have been developed to examine and prevent frame shift errors due to sequencing errors.

【０００３】しかしながら、ＤＮＡ配列のフレームシフ
トを補正するコンピューターソフトウエアのほとんどは
ゲノムＤＮＡ解析ための開発されたものであって、全長
ｃＤＮＡ配列のためのものではない。特に、理研プロジ
ェクトなどの全長ｃＤＮＡの試みは、ＤＮＡ配列からの
アミノ酸の配列情報を最小限にするために、ｃＤＮＡ配
列のラフドラフトの質の高い補正を必要とする。従来の
ソフトウエアは、ＤＮＡ配列のテキスト塩基解析の情報
に基づいたフレームシフトを補正するために設計されて
いる。しかしながら、テキスト塩基情報だけでなく、配
列決定実験から配列の質に関する情報もまた一義的な補
正に極めて重要である。However, most computer software that corrects frame shifts in DNA sequences was developed for genomic DNA analysis and not for full-length cDNA sequences. In particular, attempts at full-length cDNAs such as the RIKEN project require high quality correction of rough drafts of cDNA sequences in order to minimize sequence information of amino acids from DNA sequences. Conventional software is designed to correct frameshifts based on the information of textual base analysis of DNA sequences. However, not only textual base information, but also information on the quality of the sequence from sequencing experiments is very important for unambiguous correction.

【０００４】最近では、配列の質は数値で表したスコ
ア、すなわちフレッド（ｐｈｒｅｄ）によって示すこと
ができるとされている。ｃＤＮＡに対するこのような方
法はゲノム解析について開発されたものである（White,
O. et al.,(1993)Nuc.Acids Res.21,3829-3838; Richte
rich,P.(1998)Genome Res.8,251-259; Medigue,C. et a
l.,(1999)Genome Res.9,1116-1127）。これらの方法に
より得られた情報はフレームシフトエラーの位置を示唆
するのに用いることはできるが（Ewing,B. et al.,(199
8)Genome Res.8,175-185; Ewing,B., and Green,P.(199
8)Genome Res.8,186-194）、正確性の点で依然として不
十分であった。このためフレームシフトエラーの修正は
人間の肉眼と経験に頼らざるを得ず、正確なアミノ酸配
列の予測・検証は多大な労力を要していた。It has recently been stated that the quality of a sequence can be indicated by a numerical score, or fred. Such a method for cDNA was developed for genomic analysis (White,
O. et al., (1993) Nuc. Acids Res. 21, 3829-3838; Richte
rich, P. (1998) Genome Res.8,251-259; Medigue, C. et a
L., (1999) Genome Res. 9, 1116-1127). The information obtained by these methods can be used to suggest the location of frame shift errors (Ewing, B. et al., (199
8) Genome Res. 8,175-185; Ewing, B., And Green, P. (199
8) Genome Res.8, 186-194), which was still insufficient in terms of accuracy. For this reason, correction of frame shift errors must rely on the naked eye and experience of the human eye, and accurate prediction and verification of amino acid sequences required a great deal of labor.

【０００５】[0005]

【発明の概要】本発明はかかる問題を解決すべくなされ
たものである。SUMMARY OF THE INVENTION The present invention has been made to solve such problems.

【０００６】すなわち、本発明はｃＤＮＡ配列中のフレ
ームシフトの位置を特定し、正確なアミノ酸配列を予測
し、検証する手段を提供することをその目的とする。That is, an object of the present invention is to provide a means for identifying the position of a frame shift in a cDNA sequence, predicting an accurate amino acid sequence, and verifying the amino acid sequence.

【０００７】本発明の第一の態様は、ｃＤＮＡ配列のフ
レームシフトエラーを修正する装置であって、ｃＤＮＡ
配列と、このｃＤＮＡ配列の配列決定に使用された電気
泳動の画像データとを入力する入力部と、入力したｃＤ
ＮＡ配列に、挿入および欠失からなる群から選択される
１〜４個の人工的変異を生じさせることにより修正され
たｃＤＮＡ配列を作成する修正配列作成部と、入力され
たｃＤＮＡ配列のＶａスコアと修正されたｃＤＮＡ配列
のＶａスコアを式（Ｉ）に基づいて計算する演算部と、Ｖａ＝ｌｏｇＰ１＋ｌｏｇＰ２（Ｉ）（上記式中、であり、Ｐcodon（ｉ）はコドンの位置（ｉ）における
より好ましいコドンの選択が生じる確率を表し、ｉはコ
ドンの位置番号を表し、Ｗ１は開始コドンの位置を表
し、Ｗ２は停止コドンの位置を表し、Ｐerr（ｊ）は人
工的変異の位置ｊにおけるベースコールエラーの確率を
表し、Ｄは０〜１２０の比例定数を表す）最小のＶａス
コアを有する配列を選択して出力する出力部とを含んで
なる装置である。The first aspect of the present invention is an apparatus for correcting a frame shift error in a cDNA sequence, which comprises:
An input unit for inputting the sequence and the image data of the electrophoresis used for the sequencing of this cDNA sequence, and the input cD
A modified sequence creating unit that creates a modified cDNA sequence by generating 1 to 4 artificial mutations selected from the group consisting of insertion and deletion in the NA sequence, and a Va score of the input cDNA sequence. And a calculation unit for calculating the Va score of the corrected cDNA sequence based on the formula (I), Va = logP1 + logP2 (I) (wherein, , Pcodon (i) represents the probability that a more preferred codon selection will occur at codon position (i), i represents the codon position number, W1 represents the start codon position, and W2 represents the stop codon position. , Perr (j) represents the probability of a base call error at the position j of the artificial mutation, and D represents a proportional constant of 0 to 120.) An output unit that selects and outputs the sequence having the smallest Va score. It is a device including.

【０００８】本発明の第二の態様は、ｃＤＮＡ配列のフ
レームシフトエラーを修正するプログラムを記録したコ
ンピュータ読みとり可能な記録媒体であって、ｃＤＮＡ
配列と、このｃＤＮＡ配列の配列決定に使用された電気
泳動の画像データとを入力する手順と、入力したｃＤＮ
Ａ配列に、挿入および欠失からなる群から選択される１
〜４個の人工的変異を生じさせることにより修正された
ｃＤＮＡ配列を作成する手順と、入力されたｃＤＮＡ配
列のＶａスコアと修正されたｃＤＮＡ配列のＶａスコア
を式（Ｉ）に基づいて計算する手順と、Ｖａ＝ｌｏｇＰ１＋ｌｏｇＰ２（Ｉ）（上記式中、であり、Ｐcodon（ｉ）はコドンの位置（ｉ）における
より好ましいコドンの選択が生じる確率を表し、ｉはコ
ドンの位置番号を表し、Ｗ１は開始コドンの位置を表
し、Ｗ２は停止コドンの位置を表し、Ｐerr（ｊ）は人
工的変異の位置ｊにおけるベースコールエラーの確率を
表し、Ｄは０〜１２０の比例定数を表す）最小のＶａス
コアを有する配列を選択して出力する手順とを実行させ
るプログラムを記録したコンピュータ読みとり可能な記
録媒体である。A second aspect of the present invention is a computer-readable recording medium in which a program for correcting a frame shift error in a cDNA sequence is recorded.
Procedure for inputting the sequence and the image data of the electrophoresis used for the sequencing of this cDNA sequence, and the input cDNA
1 selected from the group consisting of insertions and deletions in the A sequence
Procedure for creating a modified cDNA sequence by making ~ 4 artificial mutations, and calculating Va score of input cDNA sequence and Va score of modified cDNA sequence based on formula (I) Procedure and Va = logP1 + logP2 (I) (wherein , Pcodon (i) represents the probability that a more preferred codon selection will occur at codon position (i), i represents the codon position number, W1 represents the start codon position, and W2 represents the stop codon position. , Perr (j) represents the probability of a base call error at the position j of the artificial mutation, and D represents a proportional constant of 0 to 120.) and a procedure of selecting and outputting the sequence having the smallest Va score. It is a computer-readable recording medium in which a program to be executed is recorded.

【０００９】本発明の第三の態様は、ｃＤＮＡ配列のフ
レームシフトエラーを修正する方法であって、ｃＤＮＡ
配列と、このｃＤＮＡ配列の配列決定に使用された電気
泳動の画像データとを準備し、準備したｃＤＮＡ配列
に、挿入および欠失からなる群から選択される１〜４個
の人工的変異を生じさせることにより修正されたｃＤＮ
Ａ配列を作成し、準備したｃＤＮＡ配列のＶａスコアと
修正されたｃＤＮＡ配列のＶａスコアを式（Ｉ）に基づ
いて計算し、Ｖａ＝ｌｏｇＰ１＋ｌｏｇＰ２（Ｉ）（上記式中、であり、Ｐcodon（ｉ）はコドンの位置（ｉ）における
より好ましいコドンの選択が生じる確率を表し、ｉはコ
ドンの位置番号を表し、Ｗ１は開始コドンの位置を表
し、Ｗ２は停止コドンの位置を表し、Ｐerr（ｊ）は人
工的変異の位置ｊにおけるベースコールエラーの確率を
表し、Ｄは０〜１２０の比例定数を表す）最小のＶａス
コアを有する配列を選択する工程を含んでなる方法であ
る。A third aspect of the invention is a method of correcting frameshift errors in a cDNA sequence, the method comprising:
The sequence and the electrophoretic image data used for the sequencing of this cDNA sequence are prepared, and 1 to 4 artificial mutations selected from the group consisting of insertion and deletion are generated in the prepared cDNA sequence. Modified by
A sequence was prepared, and the Va score of the prepared cDNA sequence and the Va score of the corrected cDNA sequence were calculated based on the formula (I), and Va = logP1 + logP2 (I) (wherein, , Pcodon (i) represents the probability that a more preferred codon selection will occur at codon position (i), i represents the codon position number, W1 represents the start codon position, and W2 represents the stop codon position. , Perr (j) represents the probability of a base call error at position j of the artificial mutation, and D represents the proportional constant between 0 and 120). Selecting the sequence with the smallest Va score. Is.

【００１０】本発明の第一の態様、第二の態様、および
第三の態様によれば、ｃＤＮＡ配列のフレームシフトの
検索作業が自動化され、検索結果の正確性は従来法との
比較において非常に優れたものであった。従って、従来
手作業で行っていたフレームシフトの検索作業の労力が
著しく軽減される。According to the first aspect, the second aspect, and the third aspect of the present invention, the search work of the frame shift of the cDNA sequence is automated, and the accuracy of the search result is extremely high in comparison with the conventional method. Was excellent. Therefore, the labor of the frame shift search work, which is conventionally performed manually, is significantly reduced.

【００１１】本発明の第一の態様、第二の態様、および
第三の態様によればまた、フレームシフトの検索結果が
数量化される。従って、正確なアミノ酸配列が予測され
るだけではなく、その予測結果の信頼性も知ることがで
きる。このため、配列の正確性のための塩基配列の読み
とり再実験を行う場合、どの部分を重点的に読み直すべ
きかが示され、実験の手間が簡略化される点で有利であ
る。According to the first aspect, the second aspect, and the third aspect of the present invention, the search result of the frame shift is also quantified. Therefore, not only the accurate amino acid sequence can be predicted, but also the reliability of the prediction result can be known. Therefore, when rereading a nucleotide sequence for the purpose of sequence accuracy and conducting a re-experiment, which portion should be re-read intensively is indicated, which is advantageous in that the labor of the experiment is simplified.

【００１２】[0012]

【発明の具体的説明】第一の態様入力部において入力されるｃＤＮＡ配列の由来は特に限
定されず、原核細胞由来のものも真核細胞由来のものも
入力されるｃＤＮＡ配列に包含される。入力されるｃＤ
ＮＡ配列は全長ｃＤＮＡ配列であることが好ましい。ｃ
ＤＮＡ配列が真核細胞由来の場合、５’末端非コード領
域から３’末端非コード領域までの領域を含んでいるこ
とが好ましい。DETAILED DESCRIPTION OF THE INVENTION First Embodiment The origin of the cDNA sequence input in the input section is not particularly limited, and both cDNA derived from prokaryotic cells and those derived from eukaryotic cells are included in the input cDNA sequences. Cd input
The NA sequence is preferably a full length cDNA sequence. c
When the DNA sequence is derived from a eukaryotic cell, it preferably contains a region from the 5'terminal noncoding region to the 3'terminal noncoding region.

【００１３】入力部においては、入力されるｃＤＮＡ配
列の配列決定に使用された電気泳動の画像データが入力
される。このデータはＳＣＦファイルやトレースファイ
ルのような電気泳動結果の画像ファイルであることがで
き、後述するように人工変異の位置ｊにおけるベースコ
ールエラーの確率の計算に使用される。本発明による装
置の修正配列作成部においては、入力されたｃＤＮＡ配
列に挿入および欠失からなる群から選択される１〜４個
（好ましくは、１〜２個）の人工的変異を生じさせる。In the input section, the image data of electrophoresis used for determining the sequence of the input cDNA sequence is input. This data can be an image file of the electrophoresis result such as an SCF file or a trace file, and is used for calculating the probability of base call error at the position j of the artificial mutation, as described later. In the modified sequence generator of the device according to the present invention, 1 to 4 (preferably 1 to 2) artificial mutations selected from the group consisting of insertion and deletion are generated in the input cDNA sequence.

【００１４】より好ましくは、１つの挿入を含む修正ｃ
ＤＮＡ配列と、１つの欠失を含む修正ｃＤＮＡ配列と、
１つの挿入と１つの欠失とを含む修正ｃＤＮＡ配列とが
作成される。この場合、１つの人工的挿入または欠失を
含む配列が２×Ｎ個作成され、人工的挿入と人工的欠失
との組み合わせを含む配列が２×Ｎ×（Ｎ−１）個作成
される（但し、入力されたｃＤＮＡ配列の塩基対の数を
Ｎとする。）。従って、作成される配列数は、２Ｎ＋２
Ｎ（Ｎ−１）個である。More preferably, a modification c containing one insertion
A DNA sequence and a modified cDNA sequence containing one deletion,
A modified cDNA sequence containing one insertion and one deletion is created. In this case, 2 × N sequences containing one artificial insertion or deletion are created, and 2 × N × (N-1) sequences containing a combination of artificial insertion and artificial deletion are created. (However, the number of base pairs of the input cDNA sequence is N). Therefore, the number of sequences created is 2N + 2
It is N (N-1).

【００１５】修正配列作成部においては、人工的変異が
生じたｃＤＮＡ配列を、ｃＤＮＡ候補群として一時的に
保存する記憶部を備えていてもよい。修正配列作成部に
おいてすべての候補配列を作成しかつ作成された配列候
補を記憶部に保存し、次いで候補それぞれのＶａを演算
部において計算してもよい。The modified sequence preparation unit may be provided with a storage unit for temporarily storing the cDNA sequence in which the artificial mutation has occurred as a cDNA candidate group. It is also possible that all the candidate sequences are created in the modified sequence creation unit, the created sequence candidates are stored in the storage unit, and then Va of each candidate is calculated in the calculation unit.

【００１６】演算部において、入力されたｃＤＮＡ配列
のＶａスコアと修正されたｃＤＮＡ配列のＶａスコアが
計算される。Ｖａは前記式（Ｉ）に従って計算できる。
Ｖａスコアは好ましくは式（II）に基づいて計算するこ
とができる。Ｖａ＝ｌｏｇＰ１＋ｌｏｇＰ２＋ｌｏｇＰatg （II）（上記式中、Ｐ１およびＰ２は前記において定義された
内容と同義であり、Ｐatgは、より好ましい開始コドン
の位置が生じる確率を表す。）Ｖａスコアは好ましくは
式（III）に基づいて計算することができる。Ｖａ＝ｌｏｇＰ１＋ｌｏｇＰ２＋ｌｏｇＰconsensus （III）（上記式中、Ｐ１およびＰ２は前記において定義された
内容と同義であり、Ｐconsensusは、より好ましいコン
センサス配列が生じる確率を表す。）Ｖａスコアは、更に好ましくは、式（IV）に基づいて計
算することができる。Ｖａ＝ｌｏｇＰ１＋ｌｏｇＰ２＋ｌｏｇＰatg＋ｌｏｇＰconsensus （IV）（上記式中、Ｐ１およびＰ２は前記において定義された
内容と同義であり、Ｐatgは、より好ましい開始コドン
の位置が生じる確率を表し、Ｐkozakは、より好ましい
コンセンサス配列が生じる確率を表す。）式（Ｉ）、（II）、（III）、および（IV）において、
Ｐcodon（ｉ）はGribskov,M. et al.,(1984)Nucl.Acids
Res.12,539-549により、またトランスターム（Ｔｒａ
ｎｓＴｅｒｍ）のウェブサイト(http://biochem.otago.
ac.nz:800/Transterm/homepage.html)（Brown,C.M. et
al.,(1993)Nuc.Acids Res.21,3119-3123;Dalphin,M.E.
et al.,(1998)Nuc.Acids Res.26,335-337）から入手で
きるコドン選択（codon preference）の確率に従って決
定でき、好ましくは、入力するｃＤＮＡ配列の由来に従
ってコドン選択の確率データを選択できる。In the calculation unit, the Va score of the input cDNA sequence and the Va score of the corrected cDNA sequence are calculated. Va can be calculated according to the above formula (I).
The Va score can preferably be calculated based on formula (II). Va = logP1 + logP2 + logPatg (II) (In the above formula, P1 and P2 have the same meanings as defined above, and Patg represents the probability of occurrence of a more preferable start codon position.) The Va score is preferably the formula (III). ) Can be calculated based on. Va = logP1 + logP2 + logPconsensus (III) (In the above formula, P1 and P2 have the same meanings as defined above, and Pconsensus represents the probability of a more preferred consensus sequence.) The Va score is more preferably the formula ( IV) can be calculated based on. Va = logP1 + logP2 + logPatg + logPconsensus (IV) (In the above formula, P1 and P2 have the same meanings as defined above, Patg represents the probability of occurrence of a more preferred start codon position, and Pkozak yields a more preferred consensus sequence. Represents the probability.) In formulas (I), (II), (III), and (IV),
Pcodon (i) is from Gribskov, M. et al., (1984) Nucl. Acids.
Res.12,539-549, and also Transterm (Tra
nsTerm) website (http: //biochem.otago.
ac.nz:800/Transterm/homepage.html) (Brown, CM et
al., (1993) Nuc. Acids Res. 21, 3119-3123; Dalphin, ME
et al., (1998) Nuc. Acids Res. 26, 335-337), and the probability data of codon preference can be selected according to the probability of codon preference, and preferably the probability data of codon selection can be selected according to the origin of the input cDNA sequence.

【００１７】入力するｃＤＮＡ配列がほ乳類（例えば、
ヒト、マウス、ラット、ウシ、ブタ）由来である場合、
Ｐcodonは好ましくは下記表に記載された出現確率に従
って決定できる。The cDNA sequence to be input is mammalian (for example,
Human, mouse, rat, cow, pig)
Pcodon can preferably be determined according to the occurrence probabilities listed in the table below.

【００１８】[0018]

【表７】コドンが不確定塩基（「Ｎ」）を含む場合、Ｐcodonは
１／２に設定できる。[Table 7] If the codon contains an indeterminate base ("N"), Pcodon can be set to 1/2.

【００１９】ＰkozakはＰcodonと独立でないので、式
（Ｉ）、（II）、（III）、および（IV）の確率は二重
に計算される。コザックコンセンサスは次のコドンの１
番目の塩基、すなわち次のコドンに属するＡｎｎＡＴＧ
Ｇの最後のＧを含むからである。本発明者は３０を越え
るアミノ酸のタンパク質に対して本発明による方法を適
用しているが、この二重の計算はＶａに対してそれほど
影響を及ぼさず二重計算の結果は無視できる。Since Pkozak is not independent of Pcodon, the probabilities of equations (I), (II), (III), and (IV) are double calculated. Kozak consensus is one of the following codons
The second base, ie, the AnnATG belonging to the next codon
This is because the last G of G is included. Although the present inventor has applied the method according to the present invention to proteins of more than 30 amino acids, this double calculation has no significant effect on Va and the result of the double calculation is negligible.

【００２０】式（Ｉ）、（II）、（III）、および（I
V）において、Ｐerr（ｊ）はｊ番目のベースコールの誤
りの確率を表し、ｃＤＮＡ配列の配列決定に使用された
電気泳動結果の画像データに基づいて計算できる。Formulas (I), (II), (III), and (I
In V), Perr (j) represents the error probability of the jth base call and can be calculated based on the image data of the electrophoresis result used for the sequencing of the cDNA sequence.

【００２１】具体的には、式（Ｉ）、（II）、（II
I）、および（IV）において、Ｐerr（ｊ）は下記式に従
って計算できる。Ｐerr（ｊ）＝１０^{−０．１Ｑ（ｊ）} 式中、Ｑ（ｊ）は人工的変異の位置ｊにおけるベースコ
ールの信頼性を表すフレッド（phred）スコアであり、
ｃＤＮＡ配列の配列決定に使用された電気泳動の画像デ
ータに基づいて、各塩基のシグナルピークの距離と高
さ、および最も近くの非割付塩基（Ｎ）からの距離の分
散（dispersion）から算出できる（B.Ewing, et al., G
enome Research 8:175-185(1998); B.Ewing and P.Gree
n Genome Research 8:186-194(1998)）。フレッドスコ
アの計算用ソフトウエアはhttp://bozeman.genome.wash
ington.edu/（米国ワシントン大学）から入手できる。Specifically, the formulas (I), (II) and (II
In I) and (IV), Perr (j) can be calculated according to the following formula. Per (j) = 10 ^−0.1 Q (j), Q (j) is a Fred score representing the reliability of the base call at the position j of the artificial mutation,
It can be calculated from the distance and height of the signal peak of each base and the distance from the nearest unassigned base (N), based on the electrophoresis image data used for sequencing the cDNA sequence. (B. Ewing, et al., G
enome Research 8: 175-185 (1998); B. Ewing and P. Gree
n Genome Research 8: 186-194 (1998)). Fred score calculation software is http: //bozeman.genome.wash
Available at ington.edu/ (University of Washington, USA).

【００２２】Ｑ（ｊ）は、また、フレッドに相当する市
販のソフトウエア（例えば、BaseImagIR（Licor社、ww
w.licor.com）、Trace Tuner（Parcel社、www.paracel.
com））を用いて算出することもできる。Q (j) is also commercially available software corresponding to Fred (eg, BaseImagIR (Licor, ww
w.licor.com), Trace Tuner (Parcel, www.paracel.
com)) can also be used for the calculation.

【００２３】本発明による装置の演算部は、上記のフレ
ッドスコア計算用ソフトウエアを備えていてもよい。The calculation unit of the device according to the present invention may include the above-mentioned software for calculating the Fred score.

【００２４】式（Ｉ）、（II）、（III）、および（I
V）において、Ｄは０〜１２０、好ましくは、６０〜１
００の比例定数を表す。式（Ｉ）、（II）、（III）、
および（IV）においてＰatgは、開始コドンの出現確率
に従って決定できる。Formulas (I), (II), (III), and (I
In V), D is 0 to 120, preferably 60 to 1
00 represents a proportional constant. Formulas (I), (II), (III),
In and (IV), Patg can be determined according to the occurrence probability of the start codon.

【００２５】入力するｃＤＮＡ配列がほ乳類由来（例え
ば、ヒト、マウス、ラット、ウシ、ブタ）由来である場
合には、好ましくはＰatgは表２に記載された出現確率
に従って決定できる。When the input cDNA sequence is derived from a mammal (eg, human, mouse, rat, cow, pig), Patg can be determined according to the probability of occurrence described in Table 2.

【００２６】入力するｃＤＮＡ配列がほ乳類以外の種の
由来である場合には、入力するｃＤＮＡ配列の由来に従
って出現確率データを構築し、使用できる。もちろん入
力するｃＤＮＡ配列が哺乳類以外の由来であっても表２
の出現確率データを使用できる。When the input cDNA sequence is derived from a species other than mammal, the occurrence probability data can be constructed and used according to the origin of the input cDNA sequence. Of course, even if the input cDNA sequence is derived from a source other than mammalian, Table 2
Appearance probability data of can be used.

【００２７】[0027]

【表８】表２は、どの位置のＡＴＧが開始コドンとして選択され
るかということと、より好ましいＡＴＧの位置を示すラ
ンダム配列が生じる確率（Ｐ_ａｔｇ）を示している。
「ｎ」は、開始コドンが総てのフレームの５末端からｎ
個目であるということを意味する。開始コドンの６０．
３％が５’末端から１番目のＡＴＧであり、開始コドン
の１８．１％が５’末端から２番目のＡＴＧである。
５’上流のＡＴＧは、下流にあるものより好ましい。他
方、後記例で提供される試験データセットの５’ＵＴＲ
の平均長は１５１．６ｂｐである。便宜的にＡＴＧの数
を当てはめ、Ｐ_ａｔｇの分母を任意の数字、ここでは７
に設定した。[Table 8] Table 2 shows which position of ATG is selected as the start codon, and the probability (P _atg ) of a random sequence indicating a more preferable position of ATG.
"N" is the start codon n from the 5th end of all frames
It means that it is the first item. 60. of the start codon
3% is the first ATG from the 5'end, and 18.1% of the start codon is the second ATG from the 5'end.
The 5'upstream ATG is preferred over the downstream. On the other hand, the 5'UTR of the test data set provided in the example below
Has an average length of 151.6 bp. For convenience, the number of ATG is applied, and the denominator of P _atg is an arbitrary number, here 7
Set to.

【００２８】式（Ｉ）、（II）、（III）、および（I
V）において、Ｐconsensusはより好ましいコンセンサス
配列の出現確率に従って決定でき、好ましくは、入力す
るｃＤＮＡ配列の由来に従ってコンセンサス配列の出現
確率データを選択できる。Formulas (I), (II), (III), and (I
In V), Pconsensus can be determined according to the appearance probability of the more preferred consensus sequence, and preferably the appearance probability data of the consensus sequence can be selected according to the origin of the input cDNA sequence.

【００２９】コンセンサス配列としては、コザック配列
やシャイン・ダルガーノ配列が挙げられる。Examples of consensus sequences include Kozak sequences and Shine-Dalgarno sequences.

【００３０】入力されるｃＤＮＡ配列がほ乳類由来（例
えば、ヒト、マウス、ラット、ウシ、ブタ）である場
合、Ｐconsensusはより好ましいコザックコンセンサス
配列が生じる確率：Ｐkozakであることができる。Ｐkoz
akはKozak, M.,Nucleic AcidsResearch, 15:8125-8148
(1987)に開示された好ましいコザックコンセンサス配列
の出現確率に従って決定できる。If the input cDNA sequence is of mammalian origin (eg human, mouse, rat, bovine, porcine), Pconsensus can be the probability of a more preferred Kozak consensus sequence: Pkozak. Pkoz
ak is Kozak, M., Nucleic Acids Research, 15: 8125-8148
It can be determined according to the probability of occurrence of the preferred Kozak consensus sequence disclosed in (1987).

【００３１】Ｐkozakは、好ましくは、下記表に従って
決定できる。Pkozak can preferably be determined according to the table below.

【００３２】[0032]

【表９】表３は、開始コドンの周囲のコンセンサスの割合と、よ
り好ましいコンセンサス（Ｐkozak）を示すランダムな
配列が生じる確率を示している。コンセンサスはＸｎｎ
ａｔｇＹ（Ｘ、Ｙ＝Ａ、Ｇ、Ｃ、Ｔ）（ここで、Ｘおよ
びＹは問題の塩基である）で表される。ＡｎｎａｔｇＧ
は、ランダムなコンセンサスを与える、１６の選択肢の
うち最も好ましいコンセンサスであり、ＡｎｎａｔｇＧ
より好ましいコンセンサスは１／１６の確率で生じる。[Table 9] Table 3 shows the percentage of consensus around the start codon and the probability that a random sequence with a more favorable consensus (Pkozak) will occur. Consensus is Xnn
It is represented by atgY (X, Y = A, G, C, T), where X and Y are the bases in question. AnnatgG
Is the most preferred consensus of 16 alternatives that gives a random consensus.
A more favorable consensus occurs with a probability of 1/16.

【００３３】表３においてＣＤＳ領域におけるＡ、Ｇ、
ＣおよびＴの割合が総て互いに等しいと仮定した。ＣＤ
Ｓ（code sequence）における４つの塩基の割合は、
Ａ：Ｇ：Ｃ：Ｔ＝２５．９％：２６．４％：２５．９
％：２１．８％である。５’ＵＴＲのおける割合は、
Ａ：Ｇ：Ｃ：Ｔ＝２０．８％：２９．２％：２９．８
％：２０．３％である。５’非翻訳領域（ＵＴＲ）は実
際にはＧＣが豊富であるが、Ａ、Ｇ、ＣおよびＴの割合
はＣＤＳにおいてはほとんど同等である。In Table 3, A, G in the CDS region,
It was assumed that the proportions of C and T were all equal to each other. CD
The ratio of 4 bases in S (code sequence) is
A: G: C: T = 25.9%: 26.4%: 25.9
%: 21.8%. The ratio of 5'UTR is
A: G: C: T = 20.8%: 29.2%: 29.8
%: 20.3%. The 5'untranslated region (UTR) is actually rich in GC, but the proportions of A, G, C and T are almost equivalent in CDS.

【００３４】入力されるｃＤＮＡ配列が原核細胞由来
（例えば、大腸菌、耐熱菌）である場合、Ｐconsensus
はより好ましいシャイン・ダルガーノコンセンサス配列
が生じる確率：Ｐs-dであることができる。When the input cDNA sequence is derived from a prokaryotic cell (eg Escherichia coli, thermostable bacterium), Pconsensus
Can be the probability that a more preferred Shine-Dalgarno consensus sequence occurs: Ps-d.

【００３５】出力部において、Ｖａスコアに基づいてこ
れらの配列の中からｃＤＮＡ配列として最も可能性の高
い配列が選択される。具体的には最小のＶａスコアを示
すｃＤＮＡ配列候補が「よりｃＤＮＡ配列らしい配列」
として選択される。At the output section, the sequence most likely as a cDNA sequence is selected from these sequences based on the Va score. Specifically, the cDNA sequence candidate showing the smallest Va score is "a sequence more like a cDNA sequence".
Is selected as.

【００３６】第二の態様および第三の態様上述した第一の態様における入力部、修正配列作成部、
演算部、出力部、および場合によっては記憶部は、いず
れも図１に示すようなコンピュータシステム２０上で稼
働するプログラムモジュールとして実現することができ
る。このようなプログラムモジュールを含むｃＤＮＡ配
列のフレームシフトエラー修正プログラムは、記録媒体
であるフロッピー（登録商標）ディスクまたはＣＤ−Ｒ
ＯＭ（Compact Disk-Read Only Memory）等に記録さ
れ、コンピュータシステム２０により読み出されて上述
したようなフレームシフトエラーの修正が行われる。 Second and Third Aspects The input section, the correction sequence creation section in the above-mentioned first aspect,
The arithmetic unit, the output unit, and, in some cases, the storage unit can be realized as a program module that operates on the computer system 20 as shown in FIG. A frame shift error correction program for a cDNA sequence including such a program module is a recording medium such as a floppy (registered trademark) disk or a CD-R.
It is recorded in an OM (Compact Disk-Read Only Memory) or the like and read by the computer system 20 to correct the frame shift error as described above.

【００３７】コンピュータシステム２０は、図１に示す
ようにミニタワー等の筐体に収納されたコンピュータ本
体２１と、ＣＲＴ（Cathode Ray Tube−陰極線管−）等
の表示装置２２と、記録出力装置としてのプリンタ２３
と、入力装置としてのキーボード２４ａおよびマウス２
４ｂと、記録媒体としてのフロッピーディスク３１内の
情報を読み出すためのフロッピーディスクドライブ装置
２６と、記録媒体としてのＣＤ−ＲＯＭ３２内の情報を
読み出すためのＣＤ−ＲＯＭドライブ装置２７とを備え
ている。As shown in FIG. 1, the computer system 20 includes a computer main body 21 housed in a casing such as a mini tower, a display device 22 such as a CRT (Cathode Ray Tube), and a recording output device. Printer 23
And a keyboard 24a and a mouse 2 as input devices
4b, a floppy disk drive device 26 for reading information from a floppy disk 31 as a recording medium, and a CD-ROM drive device 27 for reading information from a CD-ROM 32 as a recording medium.

【００３８】これらの構成をブロック図として示すと、
図２に示すように、コンピュータ本体２１が収納された
筐体内には、ＲＡＭ（Random Access Memory）等からな
る内部メモリ２５と、ハードディスクユニット２８等の
外部記憶装置がさらに設けられている。なお、ｃＤＮＡ
配列のフレームシフトエラー修正プログラムを記録した
フロッピーディスク（記録媒体）３１は、図１に示すよ
うに、フロッピーディスクドライブ装置２６のスロット
に挿入されて所定の手順によりコンピュータ本体２１に
インストールされる。本発明によるプログラムを記録す
る記録媒体は、フロッピーディスク３１に限られず、Ｃ
Ｄ−ＲＯＭ３２や内部メモリ２５、ハードディスクユニ
ット２８等の他、図示されていないＭＯ（Magnet Optic
al）ディスクや光ディスク、ＤＶＤ（Digital Versatil
e Disk）等であってもよい。When these configurations are shown as a block diagram,
As shown in FIG. 2, an internal memory 25 such as a RAM (Random Access Memory) and an external storage device such as a hard disk unit 28 are further provided in the housing in which the computer main body 21 is housed. In addition, cDNA
A floppy disk (recording medium) 31 in which the array frame shift error correction program is recorded is inserted into a slot of the floppy disk drive device 26 and installed in the computer main body 21 by a predetermined procedure, as shown in FIG. The recording medium for recording the program according to the present invention is not limited to the floppy disk 31 and may be C
In addition to the D-ROM 32, the internal memory 25, the hard disk unit 28, etc., a MO (Magnet Optic
al) Disc, optical disc, DVD (Digital Versatil)
e Disk) or the like.

【００３９】例１ここで、仮想ｃＤＮＡ配列のＶａスコアの計算例を示
す。以下の２つのオープンリーディングフレームｃＤＮ
Ａ配列５’−ＴｎｎＡＴＧＣＴＡＴＧＡｎｎＧｎｎＡＴ
ＧＧＡＧＴＧＡ−３’（配列１および配列２を含む）を
仮定する。配列１＝ＴｎｎＡＴＧＣＴＡ配列２＝ＧｎｎＡＴＧＧＡＧ開始コドンの周囲には１６の選択肢があり、ＸｎｎＡＴ
ＧＹとなる（ここで、ＸおよびＹは４つの塩基対（Ａ、
Ｇ、ＣおよｈびＴ）の１つである）。ＴｎｎＡＴＧＣは
１６配列のうち開始コドンの周囲で最もまれなコンセン
サスであり、他配列はいずれもＴｎｎＡＴＧＣよりも、
開始コドンとしてあり得る。よって、Ｐkozakは１に等
しい。他方、ＧｎｎＡＴＧＧは１６配列のうち、開始コ
ドン周囲で２番目に確率の高いコンセンサスである。Ｐ
kozakは２／１６に等しい。配列１のＣＴＡは、発生頻
度が全６４コドンのうち５７番目である、まれなコドン
である。ＧＡＧは６４コドンのうち２番目に確率の高い
コドンである。ＣＴＡのＰcodonは５７／６４であり、
ＧＡＧでは２／６４である。配列１は配列２の下流に見
られ、配列１のＰatgは１／７であり、配列２では２／
７である。各フレームのＰkozak、Ｐatg、およびＰcodo
nは配列１では、Ｐkozak＝１６／１６、Ｐatg＝１／７、Ｐ
codon＝５７／６４配列２では、Ｐkozak＝２／１６、Ｐatg＝２／７、Ｐco
don＝２／６４である。 Example 1 Here, an example of calculating the Va score of a virtual cDNA sequence will be shown. The following two open reading frames cDN
A sequence 5'-TnnATGCTATGAnnGnnAT
Assume GGAGTGA-3 ', which includes Sequence 1 and Sequence 2. Sequence 1 = TnnATGCTA Sequence 2 = GnnATGGAG There are 16 alternatives around the start codon, XnnAT
GY (where X and Y are four base pairs (A,
G, C and h and T))). TnnATGC is the rarest consensus around the start codon of the 16 sequences, and all other sequences are more than TnnATGC.
It can be the start codon. Therefore, Pkozak is equal to 1. On the other hand, GnnATGG is the second most probable consensus around the start codon among the 16 sequences. P
kozak is equal to 2/16. Sequence 1 CTA is a rare codon with an occurrence frequency of the 57th out of all 64 codons. GAG has the second highest probability of 64 codons. CTA's Pcodon is 57/64,
It is 2/64 in GAG. Sequence 1 is found downstream of Sequence 2, Patg of Sequence 1 is 1/7 and Sequence 2 is 2 /
7 Pkozak, Patg, and Pcodo for each frame
In the array 1, n is Pkozak = 16/16, Patg = 1/7, P
codon = 57/64 In array 2, Pkozak = 2/16, Patg = 2/7, Pco
don = 2/64.

【００４０】配列１および配列２のＶａ値はＶａ（配列１）＝ｌｏｇ（１６／１６）＋ｌｏｇ（１／
７）＋ｌｏｇ（５７／６４）＝−０．８９Ｖａ（配列２）＝ｌｏｇ（２／１６）＋ｌｏｇ（２／
７）＋ｌｏｇ（２／６４）＝−２．９５である。よって、配列２は配列１よりもＣＤＳに近似し
ているといえる。The Va value of array 1 and array 2 is Va (array 1) = log (16/16) + log (1 /
7) + log (57/64) = − 0.89 Va (sequence 2) = log (2/16) + log (2 /
7) + log (2/64) = − 2.95. Therefore, it can be said that the array 2 is closer to the CDS than the array 1.

【００４１】例２以下、本発明によりフレームシフトを有するｃＤＮＡ配
列を修復した例を示す。完全ｃＤＮＡ配列中にあらかじ
め挿入／欠失を生じさせることにより、フレームシフト
が生じたｃＤＮＡ配列（試験データセット）を準備し
た。 Example 2 The following is an example of repairing a cDNA sequence having a frame shift according to the present invention. A frameshifted cDNA sequence (test data set) was prepared by previously inserting / deleting in the complete cDNA sequence.

【００４２】試験データセットは、ＮＣＢＩｎｔデー
タベースからダウンロードされた６１２個の既知のマウ
ス完全ＣＤＳ配列からなり、人工的挿入／欠失がベース
コールエラーとして各配列に作り出された。無作為な数
のジェネレーターが各塩基のフレッドスコアを作出し、
挿入および欠失の確率はそのフレッドスコアに従う。１
／６のベースコールエラーが欠失であり、１／６が挿入
である。試験データのベースコール精度はおよそ９９％
に設定する。試験データセットのｃＤＮＡの平均長は２
３７０ｂｐである。表４はＣＤＳにおける挿入および欠
失の数の分布（フレームシフトエラーの割合）を示して
いる。１４．４％の配列がＣＤＳにフレームシフトエラ
ーを含まず、８５．６％の配列がＣＤＳに１以上のフレ
ームシフトエラーを含んでいる。この統計値はＵＴＲに
おける挿入／欠失を含まない。The test data set consisted of 612 known mouse complete CDS sequences downloaded from the NCBI nt database, and artificial insertions / deletions were created for each sequence as base call errors. A random number of generators creates a Fred score for each base,
The probabilities of insertions and deletions follow its Fred score. 1
A base call error of / 6 is a deletion and 1/6 is an insertion. Base call accuracy of test data is about 99%
Set to. Average length of cDNA in test dataset is 2
It is 370 bp. Table 4 shows the distribution of the number of insertions and deletions in CDS (rate of frameshift error). 14.4% of the sequences contain no frameshift error in the CDS and 85.6% of the sequences contain one or more frameshift errors in the CDS. This statistic does not include insertions / deletions in the UTR.

【００４３】[0043]

【表１０】Ｖａの計算は前記式（IV）に従って行った。Ｐcodon
（ｉ）は表１に従って決定した。[Table 10] The calculation of Va was performed according to the above formula (IV). Pcodon
(I) was determined according to Table 1.

【００４４】Ｑ（ｊ）は、phred version 0.961028.m
（http://bozeman.genome.washington.edu/から入手）
を用いて算出した。Q (j) is a phred version 0.961028.m
(Obtained from http://bozeman.genome.washington.edu/)
Was calculated using.

【００４５】Ｐatgは表２に従って決定した。Ｐconsens
usはＰkozakとして計算した。Ｐkozakは表３に従って決
定した。Patg was determined according to Table 2. Pconsens
us was calculated as Pkozak. Pkozak was determined according to Table 3.

【００４６】準備したｃＤＮＡ配列それぞれについて、
１つの人工的挿入を含む修正ｃＤＮＡ配列と、１つの人
工的欠失を含む修正ｃＤＮＡ配列と、１つの人工的挿入
と１つの人工的欠失とを含む修正ｃＤＮＡ配列とを作成
し、入力したｃＤＮＡ配列のＶａスコアとｃＤＮＡ候補
配列のＶａスコアを計算した。For each of the prepared cDNA sequences,
A modified cDNA sequence containing one artificial insertion, a modified cDNA sequence containing one artificial deletion, and a modified cDNA sequence containing one artificial insertion and one artificial deletion were created and input. The Va score of the cDNA sequence and the Va score of the cDNA candidate sequence were calculated.

【００４７】推定精度は、その正しいアミノ酸配列に対
して閾値よりも高い相同性を示す、推定アミノ酸配列の
割合により評価された。相同性の閾値は１００％、９８
％、９５％、９０％および８５％に設定された。Ｄは、
種々のＤを試みた後、最適値として８０に設定された。The estimation accuracy was evaluated by the ratio of the deduced amino acid sequence showing homology higher than the threshold value to the correct amino acid sequence. Homology threshold is 100%, 98
%, 95%, 90% and 85%. D is
After trying various D's, it was set to 80 as the optimum value.

【００４８】結果は表５の通りであった。The results are shown in Table 5.

【００４９】[0049]

【表１１】表５は、いくつかの方法の推定精度を示している。[Table 11] Table 5 shows the estimation accuracy of several methods.

【００５０】推定精度の解析にはＧｅｎｓｃａｎｖｅ
ｒ１．０（Burge,C. and Karlin,S.(1997)J.Mol.Biol.2
68,78-94; Burge,C.and Karlin,S.(1998)Curr.Opin.Str
uct.Biol.8,346-354）が使用された。ＤＥＣＯＲＤＥＲ
^ａでは人工的フレームシフトの数は２つまでに限定され
た。ＤＥＣＯＲＤＥＲ^ｂでは人工的フレームシフトの数
は４つまでに限定された。For the analysis of the estimation accuracy, Genscan ve
r1.0 (Burge, C. and Karlin, S. (1997) J. Mol. Biol. 2
68, 78-94; Burge, C. and Karlin, S. (1998) Curr. Opin. Str
uct.Biol.8,346-354) was used. DECORDER
^{In a} , the number of artificial frame shifts was limited to two. In DECORDER ^b the number of artificial frameshifts was limited to 4.

【００５１】表５中、「最長フレーム」は、最長フレー
ムがいずれのデータコレクションも用いずに各試験配列
のＣＤＳとして選択されることを意味する。Ｇｅｎｓｃ
ａｎはｃＤＮＡ解析のために開発されたものではなく、
ゲノム解析のために開発されたものである。このプログ
ラムがｃＤＮＡに適用されるとしても、このプログラム
は、最長フレームの単純な選択というよりむしろＣＤＳ
部分の推定に対していっそう機能する。本発明者は、フ
レームシフトエラーを含むｃＤＮＡのタンパク質を推定
するための他の適切なプログラムを見出すことはできな
かった。Ｇｅｎｓｃａｎはプロモーターの推定を伴えば
十分機能するが、ｃＤＮＡにはプロモーターは存在しな
い。このような応用においていては、Ｇｅｎｓｃａｎの
性能は非常に制限される。In Table 5, "longest frame" means that the longest frame is selected as the CDS for each test sequence without any data collection. Gensc
an was not developed for cDNA analysis,
It was developed for genome analysis. Even if this program is applied to cDNA, it will be a CDS rather than a simple selection of the longest frame.
Works even better for partial estimation. The inventor could not find any other suitable program for estimating the protein of a cDNA containing a frameshift error. Genscan works well with a putative promoter, but there is no promoter in the cDNA. In such applications, Genscan performance is very limited.

【００５２】本発明による方法を実行するためのプログ
ラムはＤＥＣＯＲＤＥＲと呼ばれる。ＤＥＣＯＲＤＥＲ
による推定タンパク質の６８〜７０％（６１２の試験サ
ンプルのうち４１７のタンパク質）は、正しいタンパク
質と８５％を越える同一性を示す。必要とされる推定精
度が８５％より大きいとき、ＤＥＣＯＲＤＥＲは３つの
方法のうち最良の推定度を示す。Ｇｅｎｓｃａｎおよび
ＤＥＣＯＲＤＥＲは８５％の推定精度ではほとんど同等
の結果を示すが、これらの推定結果は十分推定されるタ
ンパク質の異なる割合を示す。ＤＥＣＯＲＥＤＥＲによ
る推定の場合には、３５〜３６％の推定タンパク質が９
８％を越える同一性を示し、３２〜３５％（＝６８％〜
３６％または７０％〜３５％）の推定タンパク質が８５
〜９８％の同一性を示す。他方、Ｇｅｎｓｃａｎの場合
は、１５％の推定タンパク質が９８％を越える同一性を
示し、５４％（＝６９％〜１５％）の推定タンパク質が
８５〜９８％の同一性を示す。ＧｅｎｓｃａｎとＤＥＣ
ＯＲＤＥＲとを組み合わせることができる。Ｇｅｎｓｃ
ａｎがＤＥＣＯＲＤＥＲによって修正されたｃＤＮＡに
基づくタンパク質を推定する場合、７３％の推定タンパ
ク質が８５％を越える同一性を示し、これは８５％の推
定レベルでは最高の値である（データ省略）。The program for carrying out the method according to the invention is called DECORDER. DECORDER
68-70% of the putative proteins according to (417 of 612 test samples) show greater than 85% identity with the correct protein. When the required estimation accuracy is greater than 85%, DECORDER shows the best estimation of the three methods. Although Genscan and DECORDER show almost equivalent results with an estimation accuracy of 85%, these estimation results show different percentages of well-estimated proteins. In the case of the estimation by DECOREDER, 35 to 36% of the estimated proteins are 9
Shows identity over 8%, 32 ~ 35% (= 68% ~
36% or 70% to 35%) of the estimated protein is 85
Shows ~ 98% identity. On the other hand, in the case of Genscan, 15% of the putative proteins show greater than 98% identity and 54% (= 69% -15%) of the putative proteins show 85-98% identity. Genscan and DEC
It can be combined with ORDER. Gensc
When an predicted protein based on a cDNA modified by DECORDER, 73% of the predicted proteins showed greater than 85% identity, the highest value at the 85% predicted level (data not shown).

【図面の簡単な説明】[Brief description of drawings]

【図１】ｃＤＮＡ配列のフレームシフトエラー修正プロ
グラムを記録したコンピュータ読み取り可能な記録媒体
が用いられるコンピュータシステムを示す斜視図であ
る。FIG. 1 is a perspective view showing a computer system using a computer-readable recording medium in which a frame shift error correction program for a cDNA sequence is recorded.

【図２】図１のコンピュータシステムのハードウェア構
成を示すブロック図である。FIG. 2 is a block diagram showing a hardware configuration of the computer system of FIG.

[Explanation of symbols]

２０コンピュータシステム２１コンピュータ本体２２表示装置２３プリンタ２４ａ入力装置２４ｂマウス２５記録媒体（内部メモリ）２６フロッピーディスクドライブ装置２７ＣＤ−ＲＯＭドライブ装置２８記録媒体（ハードディスクユニット）３１記録媒体（フロッピーディスク）３２記録媒体（ＣＤ−ＲＯＭ） 20 computer system 21 Computer body 22 Display 23 Printer 24a input device 24b mouse 25 Recording medium (internal memory) 26 Floppy disk drive 27 CD-ROM drive device 28 Recording medium (hard disk unit) 31 recording medium (floppy disk) 32 recording medium (CD-ROM)

Claims

(57) [Claims]

1. An apparatus for correcting a frame shift error in a cDNA sequence, comprising: an input unit for inputting the cDNA sequence and the image data of electrophoresis used for the sequencing of the cDNA sequence; and the input cDNA sequence. A modified sequence generator for generating a modified cDNA sequence by generating 1 to 4 artificial mutations selected from the group consisting of insertion and deletion, and Va score and modification of the input cDNA sequence. CD
An operation unit for calculating the Va score of the NA sequence based on the formula (I), and Va = logP1 + logP2 (I) (wherein: Where Pcodon (i) represents the probability that a more preferred codon choice will occur at codon position (i), i represents the codon position number, W1 represents the start codon position, and W2 represents the stop codon position. Perr (j) represents the probability of the base call error at the position j of the artificial mutation, and D represents the proportional constant of 0 to 120.) An output unit for selecting and outputting the sequence having the smallest Va score. A device comprising.

2. The apparatus according to claim 1, wherein the calculation unit calculates the Va score of the input cDNA sequence and the Va score of the corrected cDNA sequence based on formula (II). Va = logP1 + logP2 + logPatg (II) (In the above formula, P1 and P2 have the same meanings as defined in claim 1, and Patg represents the probability of occurrence of a more preferable start codon position.)

3. The apparatus according to claim 1, wherein the arithmetic unit calculates the Va score of the input cDNA sequence and the Va score of the corrected cDNA sequence based on the formula (III). Va = logP1 + logP2 + logPconsensus (III) (In the above formula, P1 and P2 have the same meanings as defined in claim 1, and Pconsensus represents the probability that a more preferable consensus sequence will occur.)

4. The apparatus according to claim 1, wherein the arithmetic unit calculates the Va score of the input cDNA sequence and the Va score of the corrected cDNA sequence based on formula (IV). Va = logP1 + logP2 + logPatg + logPconsensus (IV) (In the above formula, P1 and P2 are synonymous with the content defined in claim 1, Patg represents the probability of occurrence of a more preferable start codon position, and Pconsensus is a more preferable consensus sequence. Represents the probability that occurs.)

5. The apparatus according to claim 1, wherein Pcodon (i) is determined in the arithmetic unit according to the following table. [Table 1]

6. In the arithmetic unit, Perr (j) is Perr.
(J) = 10 ^{−0.1Q (j)} (wherein Q (j) represents the Fred score at the position j of the artificial mutation), and is calculated according to any one of claims 1 to 5. The device according to claim 1.

7. The apparatus according to claim 2, wherein Patg is determined according to the following table in the arithmetic unit. [Table 2]

8. A probability that Pconsensus produces a more preferable Kozak consensus sequence in the arithmetic unit: Pkozak
The device according to any one of claims 3 to 7, characterized in that

9. Device according to claim 8, characterized in that Pkozak is determined according to the following table. [Table 3]

10. The modified array creation unit inputs the cD.
10. The device according to any one of claims 1 to 9, characterized in that it produces 1-2 artificial mutations in the NA sequence.

11. A modified cDNA sequence containing one insertion and a modified cDNA containing one deletion in the modified sequence generator.
Modified cDNA containing A sequence and one insertion and one deletion
Device according to claim 10, characterized in that it creates an A array.

12. The apparatus according to claim 1, wherein the cDNA sequence is derived from mammals.

13. A computer-readable recording medium in which a program for correcting a frame shift error of a cDNA sequence is recorded, wherein the cDNA sequence and the electrophoretic image data used for the sequencing of the cDNA sequence are input. And a procedure for preparing a modified cDNA sequence by introducing 1 to 4 artificial mutations selected from the group consisting of insertion and deletion into the input cDNA sequence, and the input cDNA sequence Va score and modified cd
A procedure for calculating the Va score of the NA sequence based on the formula (I), and Va = logP1 + logP2 (I) (wherein, Where Pcodon (i) represents the probability that a more preferred codon choice will occur at codon position (i), i represents the codon position number, W1 represents the start codon position, and W2 represents the stop codon position. , Perr (j) represents the probability of base call error at the position j of the artificial mutation, and D represents the proportional constant of 0 to 120.) A procedure of selecting and outputting the sequence having the smallest Va score. A computer-readable recording medium in which a program to be executed is recorded.

14. The recording medium according to claim 13, wherein the arithmetic unit calculates the Va score of the input cDNA sequence and the Va score of the corrected cDNA sequence based on formula (II). . Va = logP1 + logP2 + logPatg (II) (In the above formula, P1 and P2 have the same meanings as defined in claim 13, and Patg represents the probability that a more preferable start codon position will occur.)

15. The recording medium according to claim 13, wherein the calculation unit calculates the Va score of the input cDNA sequence and the Va score of the corrected cDNA sequence based on formula (III). . Va = logP1 + logP2 + logPconsensus (III) (In the above formula, P1 and P2 have the same meanings as defined in claim 13, and Pconsensus represents the probability that a more preferred consensus sequence will occur.)

16. The recording medium according to claim 13, wherein the calculation unit calculates the Va score of the input cDNA sequence and the Va score of the corrected cDNA sequence based on formula (IV). . Va = logP1 + logP2 + logPatg + logPconsensus (IV) (In the above formula, P1 and P2 are synonymous with the content defined in claim 13, Patg represents the probability of occurrence of a more preferable start codon position, and Pconsensus represents a more preferable consensus sequence. Represents the probability that occurs.)

17. The calculation unit determines Pcodon (i) according to the following table.
16. The recording medium according to any one of 16. [Table 4]

18. In the arithmetic unit, Perr (j) is Perr.
(J) = 10 ^{−0.1Q (j)} (wherein, Q (j) represents a Fred score at the position j of the artificial mutation). The recording medium according to 1 above.

19. The calculating unit determines Patg according to the following table.
The recording medium according to any one of to 18. [Table 5]

20. Probability that Pconsensus produces a more preferable Kozak consensus sequence in the arithmetic unit: Pkoza
The recording medium according to any one of claims 15 to 19, wherein k is k.

21. The recording medium according to claim 20, wherein Pkozak is determined according to the following table. [Table 6]

22. The modified array creation unit inputs the cD
The recording medium according to any one of claims 13 to 21, characterized in that one to two artificial mutations are generated in the NA sequence.

23. A modified cDNA sequence containing one insertion and a modified cDNA containing one deletion in the modified sequence generator.
Modified cDNA containing A sequence and one insertion and one deletion
The recording medium according to claim 22, wherein an A array is created.

24. The recording medium according to claim 13, wherein the cDNA sequence is derived from a mammal.

25. A method for correcting a frame shift error in a cDNA sequence, which comprises preparing a cDNA sequence and image data of electrophoresis used for determining the sequence of the cDNA sequence, and inserting the prepared cDNA sequence into the inserted cDNA sequence. A cDNA sequence modified by generating 1 to 4 artificial mutations selected from the group consisting of:
The Va score of the A sequence is calculated based on the formula (I), and Va = logP1 + logP2 (I) (wherein, Where Pcodon (i) represents the probability that a more preferred codon choice will occur at codon position (i), i represents the codon position number, W1 represents the start codon position, and W2 represents the stop codon position. Where Perr (j) represents the probability of base call error at position j of the artificial mutation and D represents the proportionality constant of 0 to 120) A method comprising the step of selecting the sequence with the smallest Va score. .

26. The method according to claim 25, characterized in that the Va score of the prepared cDNA sequence and the Va score of the modified cDNA sequence are calculated based on formula (II). Va = logP1 + logP2 + logPatg (II) (In the above formula, P1 and P2 have the same meanings as defined in claim 25, and Patg represents the probability of occurrence of a more preferable start codon position.)

27. The method according to claim 25, wherein the Va score of the prepared cDNA sequence and the Va score of the modified cDNA sequence are calculated based on formula (III). Va = logP1 + logP2 + logPconsensus (III) (In the above formula, P1 and P2 have the same meanings as defined in claim 25, and Pconsensus represents the probability that a more preferable consensus sequence will occur.)

28. The method according to claim 25, wherein the calculation unit calculates the Va score of the input cDNA sequence and the Va score of the corrected cDNA sequence based on formula (IV). Va = logP1 + logP2 + logPatg + logPconsensus (IV) (In the above formula, P1 and P2 are synonymous with the content defined in claim 25, Patg represents the probability of occurrence of a more preferable start codon position, and Pconsensus represents a more preferable consensus sequence. Represents the probability that occurs.)