JP3353263B2

JP3353263B2 - Gene motif extraction processing apparatus and processing method

Info

Publication number: JP3353263B2
Application number: JP27533694A
Authority: JP
Inventors: 孝五條堀; 義男舘野; 一穂池尾; 祐一川西; 正人河合
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1993-11-12
Filing date: 1994-11-09
Publication date: 2002-12-03
Anticipated expiration: 2017-12-03
Also published as: JPH07274965A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は遺伝子のモチーフ抽出処
理装置及び処理方法に係り、特に与えられた複数の遺伝
子配列情報の比較からそれらの配列間の保存部位である
モチーフを抽出する遺伝子のモチーフ抽出処理装置及び
処理方法に関する。近年の遺伝子工学の進歩に伴い、Ｄ
ＮＡ配列やアミノ酸配列で表現される遺伝子配列情報デ
ータベースが急増している。また、ヒトゲノム計画など
のように、特定の生物の遺伝子配列を全て解明しようと
いう試みが世界的規模で行われており、遺伝子配列情報
は今後も急激に増加することが予想される。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a gene motif extraction processing apparatus and method, and more particularly to a gene motif for extracting a motif that is a conserved site between sequences by comparing a plurality of pieces of given gene sequence information. The present invention relates to an extraction processing device and a processing method. With the progress of genetic engineering in recent years, D
Gene sequence information databases represented by NA sequences and amino acid sequences are rapidly increasing. Attempts to elucidate all the gene sequences of specific organisms, such as the Human Genome Project, have been made worldwide, and it is expected that the gene sequence information will increase rapidly in the future.

【０００２】これらの遺伝子配列の中には、配列情報は
明らかになっているが、その機能や構造に関しては未知
であるものも多い。このような遺伝子の機能や構造を、
その配列情報から予測するために有効な方法として、配
列に特徴的な規則性であるモチーフの検索がある。その
ために、配列が既知のものから多くのモチーフを抽出す
る技術が必要とされる。[0002] Among these gene sequences, sequence information has been clarified, but many of their functions and structures are unknown. The functions and structures of such genes are
As an effective method for predicting from the sequence information, there is a search for a motif having a regularity characteristic of the sequence. For that purpose, a technique for extracting many motifs from known sequences is required.

【０００３】[0003]

【従来の技術】従来、遺伝子配列において遺伝子機能を
特定する、配列に特徴的な規則性を示すモチーフは、実
験や文献での報告に基づいて決定されてきた。このよう
なモチーフを登録したデータベースとして、ＰＲＯＳＩ
ＴＥが知られている。ところで、一般に、遺伝子配列の
中で、機能的に重要な部位（サイト）は変わりにくいこ
とが知られている。このことを利用すれば、複数の遺伝
子配列の比較から、保存領域としてモチーフを抽出する
ことができる。しかしながら、従来、遺伝子配列の比較
からモチーフを抽出する手法は確立されていない。2. Description of the Related Art Hitherto, a motif that specifies a gene function in a gene sequence and has a characteristic regularity in the sequence has been determined based on experiments and reports in the literature. PROSII is a database that registers such motifs.
TE is known. Incidentally, it is generally known that functionally important sites (sites) in a gene sequence are hard to change. By utilizing this fact, a motif can be extracted as a conserved region by comparing a plurality of gene sequences. However, a technique for extracting a motif from a comparison of gene sequences has not been established.

【０００４】[0004]

【発明が解決しようとする課題】実験等により人為的に
モチーフを決定するのは、大変な作業である。そこで、
遺伝子配列の比較からモチーフを機械的に抽出すること
ができれば、遺伝子機能の解明等に有効な多くの情報を
得ることができると考えられる。しかし、単に複数の遺
伝子配列の各部位を比較し、各部位の類似性を調べてい
く手法を採った場合、次のような問題がある。Determining a motif artificially by experiment or the like is a difficult task. Therefore,
If a motif can be mechanically extracted from the comparison of gene sequences, it is considered that a lot of information effective for elucidation of gene function and the like can be obtained. However, when a method of simply comparing each site of a plurality of gene sequences and examining the similarity of each site is adopted, there are the following problems.

【０００５】つまり、抽出対象とする複数の遺伝子配列
情報が特定の種類の生物に偏った場合、抽出しようとす
る規則性に偏りが生じる。例えば、人間の遺伝子配列情
報、猿の遺伝子配列情報、馬の遺伝子配列情報、・・・
等の高等生物の遺伝子配列情報が多数あり、それより下
等な生物の遺伝子配列情報が少ない配列情報群につい
て、各部位の類似性からモチーフを抽出しようとした場
合、類似性の高い部分が進化においてあまり変化してい
ない保存領域であるとは必ずしも認定することはでき
ず、モチーフとして抽出する保存領域の認定が誤りが生
じる可能性がある。この逆の場合も同様である。[0005] In other words, when a plurality of gene sequence information to be extracted is biased toward a specific kind of organism, the regularity to be extracted is biased. For example, human gene sequence information, monkey gene sequence information, horse gene sequence information, ...
If you try to extract a motif from the similarity of each site in a sequence information group that has a lot of gene sequence information of higher organisms such as However, it is not always possible to identify a storage area that has not changed much in, and there is a possibility that the identification of a storage area to be extracted as a motif may be erroneous. The same applies to the reverse case.

【０００６】本発明は上記問題点の解決を図り、複数の
遺伝子配列情報をもとに、機械的に（自動的に）モチー
フを抽出することを目的とする。An object of the present invention is to solve the above-mentioned problems and to mechanically (automatically) extract a motif based on a plurality of pieces of gene sequence information.

【０００７】[0007]

【課題を解決するための手段】図１は本発明の構成例を
示す図である。図１において、１０はＣＰＵ及びメモリ
等からなる処理装置である。配列情報入力手段１１は、
モチーフの抽出対象となる複数の遺伝子配列のアライメ
ントデータを入力する手段である。系統樹作成手段１２
は、配列情報入力手段１１によって入力した複数の遺伝
子配列のアライメントデータをもとに遺伝子配列間の相
違度に基づく進化系統樹１３を作成する手段である。な
お、系統樹１３は、例えば古生物学的な情報等を用いて
予め作成しておくようにしてもよい。FIG. 1 is a diagram showing a configuration example of the present invention. In FIG. 1, reference numeral 10 denotes a processing device including a CPU, a memory, and the like. The sequence information input means 11
This is a means for inputting alignment data of a plurality of gene sequences from which a motif is to be extracted. Phylogenetic tree creation means 12
Is a means for creating an evolutionary phylogenetic tree 13 based on the degree of difference between gene sequences based on alignment data of a plurality of gene sequences input by the sequence information input means 11. The phylogenetic tree 13 may be created in advance using, for example, paleontological information.

【０００８】配列の重み計算手段１４は、系統樹１３の
枝の長さから各配列の重みを計算する手段である。スコ
ア計算手段１５は、配列の各部位毎に、その部位におい
て出現する配列要素の類似性の度合を示すスコアを、配
列の重みと、予め配列要素の種類に応じて求められてい
る要素の類似性に基づくスコア表１６とに基づいて計算
する手段である。The array weight calculator 14 calculates the weight of each array from the length of the branch of the phylogenetic tree 13. The score calculating means 15 calculates, for each site in the array, a score indicating the degree of similarity of the array element appearing in that site, based on the weight of the array and the similarity of the element determined in advance according to the type of the array element. This is means for calculating based on the gender-based score table 16.

【０００９】特徴情報抽出手段１７は、計算されたスコ
アに基づいて遺伝子配列における特徴的な規則性を有す
る部分をモチーフとして抽出し、ディスプレイやプリン
タ等の出力装置１８に出力させる手段である。特に、特
徴情報抽出手段１７は、スコア計算手段１５によって計
算されたスコアの値が所定の閾値または設定された閾値
を超えた場合に、その部位をモチーフ部位として抽出す
る。又、特徴情報抽出手段１７は、所定の連続領域幅又
は設定した連続領域幅でモチーフ部位の出現率を計算
し、その値が所定のランダムレベル又は設定したランダ
ムレベルを超えた場合には、その連続領域をモチーフ領
域とし、隣合うモチーフ領域を１つのモチーフ領域とす
る。これらの計算結果等が出力装置１８へ出力される。
出力装置１８は、モチーフの出現率やランダムレベルの
値等をプロットしてグラフ表示を出力する。The feature information extracting means 17 is a means for extracting a portion having a characteristic regularity in the gene sequence as a motif based on the calculated score, and outputting the motif to an output device 18 such as a display or a printer. In particular, when the value of the score calculated by the score calculating means 15 exceeds a predetermined threshold value or a set threshold value, the characteristic information extracting means 17 extracts that part as a motif part. Further, the feature information extraction means 17 calculates the appearance rate of the motif region with a predetermined continuous region width or a set continuous region width, and when the value exceeds a predetermined random level or a set random level, A continuous region is defined as a motif region, and an adjacent motif region is defined as one motif region. The calculation results and the like are output to the output device 18.
The output device 18 plots a motif appearance rate, a random level value, and the like, and outputs a graph display.

【００１０】[0010]

【作用】本発明の遺伝子のモチーフ抽出処理装置は、マ
ルティプルアライメントデータを入力データとし、各部
位においてアライメントデータを構成する配列中で高度
に保存されているアミノ酸をモチーフとして出力する。
ただし、進化的に近縁な配列が存在することによる、ア
ミノ酸の出現頻度の偏りを補正するために、アライメン
トデータに基づき系統樹１３を作成し、系統樹１３の枝
長や形から、各遺伝子配列に対する重み付けを行う。更
に、性質の似たアミノ酸の出現を許容するために、アミ
ノ酸の類似性に基づいて計算されたスコア表１６を用い
て、各部位でのスコアを計算する。ここで求められたス
コアが高いほど、その部位では、アミノ酸が高度に保存
されていることを示す。The gene motif extraction processing apparatus of the present invention uses multiple alignment data as input data, and outputs amino acids highly conserved in the sequence constituting the alignment data at each site as motifs.
However, a phylogenetic tree 13 is created based on the alignment data in order to correct the bias in the frequency of appearance of amino acids due to the presence of sequences closely related to the evolution, and each gene sequence is determined based on the branch length and shape of the phylogenetic tree 13. Is weighted. Further, in order to allow the appearance of amino acids having similar properties, a score at each site is calculated using the score table 16 calculated based on the similarity of amino acids. A higher score determined here indicates that the amino acid is highly conserved at that site.

【００１１】更に、モチーフ部位を抽出するために、ス
コアの閾値を設定する操作を行う。又、ここで設定した
閾値を超えるスコアを示した部位をモチーフ部位として
抽出する。そして、モチーフ領域を限定するために、領
域幅とランダムレベルとを設定する操作を行う。ここで
設定した領域幅内でのモチーフ部位の出現率がランダム
レベルを超える値を示した場合、その領域をモチーフ領
域とみなす。又、隣合うモチーフ領域は、１つのモチー
フ領域とみなす。Further, in order to extract a motif part, an operation of setting a score threshold is performed. In addition, a part showing a score exceeding the threshold set here is extracted as a motif part. Then, in order to limit the motif region, an operation of setting a region width and a random level is performed. If the appearance rate of the motif region within the region width set here exceeds a random level, the region is regarded as a motif region. Adjacent motif regions are regarded as one motif region.

【００１２】[0012]

【実施例】以下、図面を参照しつつ、本発明の実施例を
アミノ酸配列で表される遺伝子配列情報を例にして説明
する。図２は本発明の実施例の処理フローチャートであ
る。以下の説明における処理（ａ）〜（ｋ）は、図２に
示す処理（ａ）〜（ｋ）に対応する。DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described with reference to the drawings, using gene sequence information represented by an amino acid sequence as an example. FIG. 2 is a processing flowchart of the embodiment of the present invention. The processes (a) to (k) in the following description correspond to the processes (a) to (k) shown in FIG.

【００１３】図３に示す５本の遺伝子配列Ａ〜Ｅからな
るアライメントデータを考える。アルファベット一文字
がひとつのアミノ酸に対応し、配列長の＊はギャップを
表す。（ａ）アライメントデータ入力配列情報入力手段１１は、図３に示す配列Ａ〜Ｅのアラ
イメントデータを入力する。配列情報がよく似た配列が
多数存在する場合、各部位を代表するアミノ酸をその出
現頻度から求めると偏りが生ずる。そこで、以下の処理
では、入力したアライメントデータから系統樹１３を作
成し、系統樹１３の枝長や形をもとに、各遺伝子配列に
対する重み付けの計算を行う。その計算結果を用い、各
配列に対して重み付けを行うことで、偏りを補正する。Consider alignment data consisting of five gene sequences A to E shown in FIG. One letter of the alphabet corresponds to one amino acid, and * in the sequence length indicates a gap. (A) Alignment data input The sequence information input means 11 inputs the alignment data of the sequences A to E shown in FIG. When there are a large number of sequences having very similar sequence information, a bias occurs when amino acids representing each site are determined from their appearance frequencies. Therefore, in the following processing, a phylogenetic tree 13 is created from the input alignment data, and weighting for each gene sequence is calculated based on the branch length and shape of the phylogenetic tree 13. The bias is corrected by weighting each array using the calculation result.

【００１４】（ｂ）系統樹作成系統樹作成手段１２による系統樹１３の作成には、例え
ばＵＰＧ（UnweightedPair-Group Clustering）法を用
いる。他の作成方法を用いてもよい。本実施例では、具
体的には系統樹の作成を以下のように行う。先ず、アラ
イメントデータをもとに、遺伝子配列間の相違度を求め
る。相違度は２本ずつの配列を組にして、それら配列間
のアミノ酸の置換数として計算される。計算式はアミノ
酸置換数を求める時に一般的に使われる次の式（１）を
用いる。(B) Creation of a phylogenetic tree The creation of a phylogenetic tree 13 by the phylogenetic tree creating means 12 uses, for example, the UPG (Unweighted Pair-Group Clustering) method. Other creation methods may be used. In the present embodiment, specifically, a phylogenetic tree is created as follows. First, the degree of difference between gene sequences is determined based on the alignment data. The degree of difference is calculated as the number of amino acid substitutions between pairs of two sequences. The following formula (1), which is generally used when calculating the number of amino acid substitutions, is used for the calculation.

【００１５】Ｋ＝−ｌｏｇ（１−ｐ）・・・式（１）ここで、Ｋはアミノ酸数置換数、ｐは２本の配列間で異
なるアミノ酸を持つ部位の割合である。また、ギャップ
を含む部位については計算から除外する。式（１）によ
り、全ての２本の配列の組、即ち（Ａ，Ｂ），（Ａ，
Ｃ），・・・（Ａ，Ｅ），（Ｂ，Ｃ），・・・，（Ｃ，
Ｄ），（Ｃ，Ｅ），（Ｄ，Ｅ）の組について相違度を計
算する。また、この相違度をＶ_AB，Ｖ_AC，・・・，Ｖ_DE
と表すと、相違度Ｖ_AB，Ｖ_AC，・・・，Ｖ_DEの中で最小
のものを選び、その組を結び付ける。この例では、配列
Ｄと配列Ｅが結び付けられる。この相違度を枝の長さと
する。K = −log (1-p) (1) where K is the number of amino acid substitutions, and p is the ratio of sites having different amino acids between the two sequences. Also, a portion including a gap is excluded from the calculation. According to equation (1), a set of all two arrays, ie, (A, B), (A,
C), ... (A, E), (B, C), ..., (C,
D), (C, E) and (D, E) are calculated for the difference. Also, the degree of difference is calculated as V _AB , V _AC _,.
, The smallest one of the degrees of difference V _AB , V _AC ,..., V _DE is selected, and the set is connected. In this example, the arrays D and E are linked. This difference is defined as the length of the branch.

【００１６】次に配列Ｄと配列Ｅを一つのグループと
し、これらと他の各配列との相違度を同様に式（１）に
より計算する。例えば、配列Ｄ，Ｅと配列Ａとの相違度
Ｖ_(DE) _Aは、Ｖ_(DE)A＝（Ｖ_AD＋Ｖ_AE）／２で求められ
る。同様に、Ｖ_(DE)B，Ｖ_(DE) _Cについても計算し、こ
れらと、前に求めたＶ_AB，Ｖ_AC，・・・からＶ_DEを除い
たものの中から、最小の値を持つものを選ぶ。この例で
は、配列Ａと配列Ｂの相違度Ｖ_ABが最小であり、これら
が２番目にグループ化される。以下、同様にグループ化
と相違度の計算を行い、その結果をもとに系統樹１３を
作成する。Next, array D and array E are grouped into one group, and the degree of difference between these arrays and each of the other arrays is similarly calculated by equation (1). For example, the degree of difference V _(DE) _A between the arrays D and E and the array A is obtained by V _{(DE) A} = ( _VAD + _VAE ) / 2. Similarly, V _{(DE) B} and V _(DE) _C are calculated, and have the smallest value from those obtained by removing V _DE from V _AB , V _AC _,. Choose one. In this example, the dissimilarity V _AB between array A and array B is the smallest, and these are grouped second. Hereinafter, grouping and calculation of the degree of difference are similarly performed, and a phylogenetic tree 13 is created based on the results.

【００１７】図４（Ａ）は、図３に示すアライメントデ
ータをもとに構築された系統樹１３の例であり、図中の
括弧内の数字は系統樹１３中の各枝の長さを表してい
る。（ｃ）各遺伝子配列に対する重み付けの計算次に、配列の重み計算手段１４は、作成された系統樹１
３の各枝の長さをもとに、各枝に重みを付与する。各枝
に与える重みは、その枝から分岐した配列の本数で枝の
長さを割ることにより求める。FIG. 4A shows an example of a phylogenetic tree 13 constructed based on the alignment data shown in FIG. 3, and the numbers in parentheses in the figure indicate the length of each branch in the phylogenetic tree 13. Represents. (C) Calculation of Weight for Each Gene Sequence Next, the sequence weight calculation means 14 calculates the phylogenetic tree 1
Weight is given to each branch based on the length of each branch of No. 3. The weight given to each branch is obtained by dividing the length of the branch by the number of arrays branched from the branch.

【００１８】図４（Ａ）の系統樹１３を例に説明する。
枝１は長さ０．１５８であり、枝１からは配列Ａ、配列
Ｂ、配列Ｃの３本の配列が分岐している。従って、枝１
の重みは０．１５８／３で０．０５３と求められる。同
様にして、全ての枝の重みを求めると、次のようにな
る。枝１の重み＝０．１５８／３＝０．０５３枝２の重み＝０．９０３／２＝０．４５２枝３の重み＝０．３６７／２＝０．１８４枝４の重み＝０．７４５枝５の重み＝枝６の重み＝０．３７８／２＝０．１８９枝７の重み＝枝８の重み＝０．０００こうして求めた各枝の重みをもとに各配列の重みを計算
する。A description will be given by taking the phylogenetic tree 13 of FIG. 4A as an example.
Branch 1 has a length of 0.158, and three sequences of sequence A, sequence B, and sequence C branch from branch 1. Therefore, branch 1
Is 0.158 / 3 and is calculated as 0.053. Similarly, when the weights of all the branches are obtained, the result is as follows. Weight of branch 1 = 0.158 / 3 = 0.053 Weight of branch 2 = 0.903 / 2 = 0.452 Weight of branch 3 = 0.367 / 2 = 0.184 Weight of branch 4 = 0.745 Weight of branch 5 = weight of branch 6 = 0.378 / 2 = 0.189 Weight of branch 7 = weight of branch 8 = 0.000 The weight of each array is calculated based on the weight of each branch thus obtained. .

【００１９】各配列に付与する重みは、系統樹１３の根
（ｒｏｏｔ）から遡った時に通る枝の重み合計として求
める。図４（Ａ）の系統樹１３の例では、次のようにな
る。配列Ａは、系統樹１３上で枝１、枝３、枝５を通
る。各枝に与えられた重みは、それぞれ０．０５３，
０．１８４，０．３７８である。従って、配列Ａの重み
は、これらの合計で０．６１５と求められる。同様にし
て、全ての配列に対する重みを計算する。更に、全ての
配列の重みの合計を求め、その合計で各配列の重みを割
り、重みの合計が１になるように標準化する。図４
（Ｂ）は、図４（Ａ）の系統樹１３から求めた各配列Ａ
〜Ｅの重みを示す。The weight given to each array is obtained as the total weight of the branches that pass when going back from the root of the phylogenetic tree 13. In the example of the phylogenetic tree 13 in FIG. Sequence A passes through branch 1, branch 3, and branch 5 on the phylogenetic tree 13. The weight given to each branch is 0.053,
0.184 and 0.378. Therefore, the weight of array A is determined to be 0.615 in total. Similarly, the weights for all the arrays are calculated. Further, the sum of the weights of all the arrays is obtained, the weight of each array is divided by the sum, and the sum is standardized so that the sum of the weights becomes 1. FIG.
(B) shows each sequence A determined from the phylogenetic tree 13 in FIG. 4 (A).
ＥE indicate weights.

【００２０】（ｄ）各アミノ酸ごとの重み計算次に、配列の重み計算手段１４は、各配列Ａ〜Ｅの標準
化された重みをもとに、各部位におけるそれぞれのアミ
ノ酸の重みを求める。そして、部位毎に出現するアミノ
酸の重みを、そのアミノ酸が現れるすべての配列の重み
の合計として求める。(D) Weight calculation for each amino acid Next, the sequence weight calculation means 14 calculates the weight of each amino acid at each site based on the standardized weight of each of the sequences A to E. Then, the weight of the amino acid appearing for each site is determined as the sum of the weights of all the sequences in which the amino acid appears.

【００２１】図３に示すアライメントデータと図４
（Ｂ）の配列の重みをもとに説明する。第１番目の部位
では、配列ＡはＱ（グルタミン）、配列ＢはＬ（ロイシ
ン）、配列ＣはＥ（グルタミン酸）、配列Ｄと配列Ｅは
Ｓ（セリン）のアミノ酸がそれぞれ出現している。従っ
て、第１番目の部位では、アミノ酸Ｑの重みには配列Ａ
の重み０．２１０が与えられ、同様にアミノ酸Ｌには
０．２１０、アミノ酸Ｅには０．２７２、アミノ酸Ｓに
は配列Ｄと配列Ｅの重みの和で０．３０８の重みがそれ
ぞれ与えられる。その他のアミノ酸は、第１番目の部位
では重み０となる。The alignment data shown in FIG. 3 and FIG.
A description will be given based on the weight of the array in (B). At the first position, amino acids of sequence A (Q (glutamine), sequence B (L (leucine)), sequence C (E (glutamic acid)), and sequences D and E (S (serine)) appear. Therefore, at the first position, the weight of amino acid Q
Similarly, the amino acid L is given a weight of 0.210, the amino acid E is given a weight of 0.272, the amino acid S is given a weight of 0.308 by the sum of the weights of the sequence D and the sequence E, respectively. . Other amino acids have a weight of 0 at the first position.

【００２２】同様に、第２番目の部位では、配列Ａ，
Ｂ，Ｃにアミノ酸Ｖ（バリン）が出現し、配列Ｄ，Ｅに
アミノ酸Ａ（アラニン）が出現している。この部位にお
けるアミノ酸Ｖの重みは、配列Ａ，Ｂ，Ｃの重みの和
で、０．２１０＋０．２１０＋０．２７２＝０．６９２と
なる。第２番目の部位におけるアミノ酸Ａの重みは、０．１５４＋０．１５４＝０．３０８となる。Similarly, at the second site, sequence A,
Amino acid V (valine) appears in B and C, and amino acid A (alanine) appears in sequences D and E. The weight of amino acid V at this site is the sum of the weights of sequences A, B, and C, and is 0.210 + 0.210 + 0.272 = 0.592. The weight of amino acid A at the second site is 0.154 + 0.154 = 0.308.

【００２３】同様にして、全ての部位において、各アミ
ノ酸の重みを計算する。図５は、以上のようにして計算
した第１番目の部位から第１０番目の部位までのアミノ
酸の重みを示す。なお、図５では、小数点以下第４桁ま
での計算結果を示している。図６は、配列の重み計算手
段１４が行う、図２に示す処理（ｄ）の一実施例を示す
フローチャートである。Similarly, the weight of each amino acid is calculated at all the sites. FIG. 5 shows the amino acid weights from the first site to the tenth site calculated as described above. FIG. 5 shows the calculation result up to the fourth digit after the decimal point. FIG. 6 is a flowchart showing one embodiment of the processing (d) shown in FIG. 2 performed by the array weight calculator 14.

【００２４】図６中、ステップ３１は、配列の重み計算
手段１４の、即ち、処理装置１０のＣＰＵ内の、部位位
置カウンタを初期化する。ステップ３２は、配列の重み
計算手段１４の、即ち、処理装置１０のＣＰＵ内の、遺
伝子配列番号カウンタを初期化する。ステップ３３は、
現部位位置での現遺伝子配列のアミノ酸を、配列の重み
計算手段１４の、即ち、処理装置１０のメモリ内の、現
アミノ酸種格納領域に格納する。以下の説明で、現部位
位置、現遺伝子配列、現アミノ酸種等は、夫々現在注目
している部位の位置、現在注目している遺伝子配列、現
在注目しているアミノ酸種等を指す。ステップ３４は、
現遺伝子配列の重みを、配列の重み計算手段１４の、即
ち、処理装置１０のメモリ内の、現部位位置の現アミノ
酸種のスコア格納領域に格納されているスコアに加算し
て格納する。ステップ３５は、遺伝子配列番号カウンタ
に１を加える。In FIG. 6, a step 31 initializes a part position counter of the array weight calculating means 14, ie, in the CPU of the processing device 10. A step 32 initializes a gene sequence number counter of the sequence weight calculation means 14, that is, in the CPU of the processing device 10. Step 33 is
The amino acid of the current gene sequence at the current site position is stored in the current amino acid type storage area of the sequence weight calculation means 14, that is, in the memory of the processing device 10. In the following description, the current site position, the current gene sequence, the current amino acid type, and the like respectively indicate the position of the current site of interest, the current gene sequence of interest, the current amino acid type, and the like. Step 34
The weight of the current gene sequence is added to the score stored in the score storage area of the current amino acid type at the current site position in the sequence weight calculation means 14, ie, in the memory of the processing device 10, and stored. A step 35 adds 1 to the gene sequence number counter.

【００２５】ステップ３６は、遺伝子配列番号カウンタ
の値が遺伝子数より大きいか否かを判定し、判定結果が
ＮＯであれば、処理はステップ３４へ戻る。他方、ステ
ップ３６の判定結果がＹＥＳであれば、ステップ３７が
現部位位置の各アミノ酸のスコアの合計が１になるよう
に標準化処理を行う。次に、ステップ３８は、各部位で
のスコアを計算し、ステップ３９は、部位位置に１を加
える。ステップ４０は、部位位置カウンタの値がアライ
メントデータ長より大きいか否かを判定し、判定結果が
ＹＥＳであれば処理が終了する。他方、ステップ４０の
判定結果がＮＯであれば、処理はステップ３２へ戻る。In step 36, it is determined whether or not the value of the gene sequence number counter is larger than the number of genes. If the determination result is NO, the process returns to step 34. On the other hand, if the decision result in the step 36 is YES, a step 37 performs a standardization process so that the sum of the scores of each amino acid at the current site position becomes 1. Next, a step 38 calculates a score for each part, and a step 39 adds 1 to the part position. A step 40 decides whether or not the value of the part position counter is larger than the alignment data length. If the decision result in the step 40 is YES, the process ends. On the other hand, if the decision result in the step 40 is NO, the process returns to the step 32.

【００２６】（ｅ）各部位でのスコア計算配列によっては、性質の類似したアミノ酸への置換が起
こっている場合があるが、このような場合でも、機能的
に保存されていることが多い。そこで、スコア計算手段
１５は、このような部位をモチーフとして抽出するため
に、アミノ酸間の物理化学的類似性に基づくスコア表を
もとに各部位のスコアを計算する。(E) Calculation of score at each site Depending on the sequence, substitution to an amino acid having similar properties may occur, but even in such a case, it is often functionally conserved. Therefore, in order to extract such a site as a motif, the score calculation unit 15 calculates a score of each site based on a score table based on physicochemical similarity between amino acids.

【００２７】アミノ酸の類似性に基づくスコア表１６ａ
は、予め各アミノ酸の物理・化学的性質をもとに求めら
れているものであって、各アミノ酸の組の置換頻度や性
質の違いの程度を示す距離に基づいて、各アミノ酸の組
に対して付与された値を持つテーブルである。例えば、
グリシン（Ｇ）と他のアミノ酸との組のスコアは、次の
ような値が付与されている。ただし、この場合には便宜
上各スコアが１００倍されている。Score table 16a based on amino acid similarity
Is determined in advance based on the physical and chemical properties of each amino acid.Based on the substitution frequency of each amino acid set and the distance indicating the degree of difference in properties, each amino acid set is It is a table with the values given by For example,
The score of the set of glycine (G) and other amino acids is given the following values. However, in this case, each score is multiplied by 100 for convenience.

【００２８】このようなスコア表については、種々のものが知られて
いるので、ここでの説明はこの程度にとどめる。[0028] Since various types of such score tables are known, the description here will be limited to this level.

【００２９】スコア表１６ａの値を加味して計算した各
部位のスコアは、その値が大きいほど、その部位ではア
ミノ酸が「保存的」であることを示している。例えば、
図５の第１番目の部位のスコアは、次の式（２）で求め
られる。Ｓ₁＝Ｄ（Ｓ，Ｓ）×Ｓ（Ｓ）×Ｓ（Ｓ）＋Ｄ（Ｓ，Ｌ）×Ｓ（Ｓ）×Ｓ（Ｌ）＋Ｄ（Ｓ，Ｅ）×Ｓ（Ｓ）×Ｓ（Ｅ）＋Ｄ（Ｓ，Ｑ）×Ｓ（Ｓ）×Ｓ（Ｑ）＋Ｄ（Ｌ，Ｓ）×Ｓ（Ｌ）×Ｓ（Ｓ）＋Ｄ（Ｌ，Ｌ）×Ｓ（Ｌ）×Ｓ（Ｌ）＋Ｄ（Ｌ，Ｅ）×Ｓ（Ｌ）×Ｓ（Ｅ）＋Ｄ（Ｌ，Ｑ）×Ｓ（Ｌ）×Ｓ（Ｑ）＋Ｄ（Ｅ，Ｓ）×Ｓ（Ｅ）×Ｓ（Ｓ）＋Ｄ（Ｅ，Ｌ）×Ｓ（Ｅ）×Ｓ（Ｌ）＋Ｄ（Ｅ，Ｅ）×Ｓ（Ｅ）×Ｓ（Ｅ）＋Ｄ（Ｅ，Ｑ）×Ｓ（Ｅ）×Ｓ（Ｑ）＋Ｄ（Ｑ，Ｓ）×Ｓ（Ｑ）×Ｓ（Ｓ）＋Ｄ（Ｑ，Ｌ）×Ｓ（Ｑ）×Ｓ（Ｌ）＋Ｄ（Ｑ，Ｅ）×Ｓ（Ｑ）×Ｓ（Ｅ）＋Ｄ（Ｑ，Ｑ）×Ｓ（Ｓ）×Ｓ（Ｑ）・・・式（２）ここで、Ｓ₁は第一番目の部位のスコア、Ｄ（アミノ酸
１，アミノ酸２）はアミノ酸１とアミノ酸２のスコア表
１６ａから得た類似度、Ｓ（アミノ酸）はその部位にお
けるアミノ酸の重み（図５）である。The score of each site calculated in consideration of the value in the score table 16a indicates that the larger the value, the more conservative the amino acid is at that site. For example,
The score of the first part in FIG. 5 is obtained by the following equation (2). S ₁ = D (S, S) × S (S) × S (S) + D (S, L) × S (S) × S (L) + D (S, E) × S (S) × S (E ) + D (S, Q) × S (S) × S (Q) + D (L, S) × S (L) × S (S) + D (L, L) × S (L) × S (L) + D (L, E) × S (L) × S (E) + D (L, Q) × S (L) × S (Q) + D (E, S) × S (E) × S (S) + D (E , L) × S (E) × S (L) + D (E, E) × S (E) × S (E) + D (E, Q) × S (E) × S (Q) + D (Q, S ) × S (Q) × S (S) + D (Q, L) × S (Q) × S (L) + D (Q, E) × S (Q) × S (E) + D (Q, Q) × S (S) × S (Q) Formula (2) Here, S ₁ is obtained from the score of the first site, and D (amino acid 1, amino acid 2) is obtained from the score table 16a for amino acid 1 and amino acid 2. Was Acetonide, is S (amino acid) is the weight of the amino acid at that site (Figure 5).

【００３０】図７は、スコア計算手段１５が行う、図２
に示す処理（ｅ）の一実施例を示すフローチャートであ
る。図７中、ステップ５１は、スコア計算手段１５の、
即ち、処理装置１０のメモリ内の、現部位位置のスコア
格納領域を初期化する。ステップ５２は、スコア計算手
段１５の、即ち、処理装置１０のメモリ内の、アミノ酸
種カウンタを初期化する。ステップ５３は、スコア計算
手段１５の、即ち、処理装置１０のメモリ内の、比較ア
ミノ酸種カウンタを初期化する。ステップ５４は、現ア
ミノ酸種と現比較アミノ酸種間の類似度を、アミノ酸類
似度スコア表１６ａを参照して得る。ステップ５５は、Ｓ_i＝Ｓ_i＋Ｄ（Ａ₁，Ａ₂）×Ｓ（Ａ₁）×Ｓ
（Ａ₂）なる計算を行う。ここで、Ｓ_iは現部位位置（ｉ番目）
のスコア、Ａ₁は現アミノ酸種、Ａ₂は現比較アミノ酸
種、Ｄ（Ａ₁，Ａ₂）は現アミノ酸種と現比較アミノ酸
種の類似度のスコア、Ｓ（Ａ₁）は現部位位置の現アミ
ノ酸種のスコア、Ｓ（Ａ₂）は現部位位置の現比較アミ
ノ酸種のスコアを夫々示す。FIG. 7 is a diagram showing the operation of the score calculating means 15 shown in FIG.
6 is a flowchart illustrating an example of a process (e) illustrated in FIG. In FIG. 7, step 51 is performed by the score calculation unit 15.
That is, the score storage area of the current site position in the memory of the processing device 10 is initialized. A step 52 initializes an amino acid type counter of the score calculation means 15, that is, in the memory of the processing device 10. A step 53 initializes the comparison amino acid type counter of the score calculation means 15, that is, in the memory of the processing device 10. In step 54, the similarity between the current amino acid species and the current comparative amino acid species is obtained with reference to the amino acid similarity score table 16a. Step 55 is as follows: S _i = S _i + D (A ₁ , A ₂ ) × S (A ₁ ) × S
(A ₂ ) Here, _Si is the current position (i-th position)
, A ₁ is the current amino acid type, A ₂ is the current comparative amino acid type, D (A ₁ , A ₂ ) is the score of the similarity between the current amino acid type and the current comparative amino acid type, and S (A ₁ ) is the current site position. , S (A ₂ ) indicates the score of the current comparative amino acid species at the current site position.

【００３１】ステップ５６は、比較アミノ酸種を次の比
較アミノ酸種に変更し、ステップ５７は、全てのアミノ
酸種との比較が行われたか否かを判定する。ステップ５
７の判定結果がＮＯであれば、処理はステップ５３へ戻
る。他方、ステップ５７の判定結果がＹＥＳであると、
ステップ５８でアミノ酸種を次のアミノ酸種に変更す
る。ステップ５９は、全てのアミノ酸種についてスコア
の計算を行ったか否かを判定し、判定結果がＹＥＳであ
れば処理が終了する。他方、ステップ５９の判定結果が
ＮＯであれば、処理はステップ５２へ戻る。Step 56 changes the comparative amino acid species to the next comparative amino acid species, and step 57 determines whether or not all amino acid species have been compared. Step 5
If the decision result in the step 7 is NO, the process returns to the step 53. On the other hand, if the decision result in the step 57 is YES,
In step 58, the amino acid type is changed to the next amino acid type. A step 59 decides whether or not scores have been calculated for all amino acid species, and the process ends if the decision result in the step 59 is YES. On the other hand, if the decision result in the step 59 is NO, the process returns to the step 52.

【００３２】（ｆ）計算結果出力図３に示すアライメントデータについて、スコア計算手
段１５でスコアを計算した結果は以下のとおりであっ
た。（部位０１〜０５） 0.5183 0.7744 0.5677 0.8198 0.4881 （部位０６〜１０） 0.9328 0.4940 0.8683 0.3165 0.3580 （部位１１〜１５） 0.9311 0.3834 0.4072 0.3611 0.6114 （部位１６〜２０） 0.6937 0.5976 0.5699 0.5574 0.5010 （部位２１〜２５） 0.3880 0.6168 0.5530 0.5739 0.6296 （部位２６〜３０） 0.7718 0.3473 0.3772 0.6956 1.0000 （部位３１〜３５） 0.9841 0.9646 1.0000 0.9149 0.8891 （部位３６〜４０） 1.0000 0.6916 0.7864 0.7804 0.7903 （部位４１〜４５） 0.5830 0.6021 0.7753 0.5654 0.6976 （部位４６〜５０） 0.9037 0.6428 0.8303 0.9542 0.7105 （ｇ）閾値設定特徴情報抽出手段１７は、スコアに閾値を決定し、その
閾値を超えるスコアの与えられた部位をモチーフとして
抽出する。そのため、閾値を、ユーザの指定またはディ
フォルト値として事前に定められている値により設定す
る。(F) Output of Calculation Result The score calculation means 15 calculated the score for the alignment data shown in FIG. (Site 01-05) 0.5183 0.7744 0.5677 0.8198 0.4881 (site 06-10) 0.9328 0.4940 0.8683 0.3165 0.3580 (site 11-15) 0.9311 0.3834 0.4072 0.3611 0.6114 (site 16-20) 0.6937 0.5976 0.5699 0.5574 0.5010 (site 21-25) 0.3880 0.6168 0.5530 0.5739 0.6296 (parts 26 to 30) 0.7718 0.3473 0.3772 0.6956 1.0000 (parts 31 to 35) 0.9841 0.9646 1.0000 0.9149 0.8891 (parts 36 to 40) 1.0000 0.6916 0.7864 0.7804 0.7903 (parts 41 to 45) 0.5830 0.6021 0.7753 0.5654 0.6976 ( (Parts 46 to 50) 0.9037 0.6428 0.8303 0.9542 0.7105 (g) Threshold value setting The characteristic information extracting means 17 determines a threshold value for a score, and extracts a part having a score exceeding the threshold value as a motif. Therefore, the threshold value is set by a value specified in advance by the user or as a default value.

【００３３】（ｈ）閾値を超える部位をモチーフとして
出力スコアの閾値がｔｈの場合、特徴情報抽出手段１７は、
次式の条件を満たす部位をモチーフの候補として抽出す
る。Ｓ＞ｔｈ図３に示すアライメントデータについて、スコアの閾値
ｔｈを０．９０として抽出したモチーフは、以下のとお
りであった。(H) Using a portion exceeding the threshold as a motif When the threshold of the output score is th, the feature information extracting means 17
A part satisfying the following condition is extracted as a motif candidate. S> th The motifs extracted for the alignment data shown in FIG. 3 with the score threshold th set to 0.90 were as follows.

【００３４】『３０Ｄ［ＬＩ］［ＩＭ］Ｌ［ＬＩＦ］
［ＫＲＨ］Ｌ』ここで、「３０」は図３のアライメントデータ中におけ
るモチーフの先頭アミノ酸の位置が３０であることを意
味する。また、［］は、その部位では［］内の複数のア
ミノ酸が出現していることを示す。即ち、抽出されたモ
チーフ部位は、アライメントデータ中の６番目の（Ｆま
たはＹ）、１１番目（ＦまたはＹ）、３０番目（Ｄ）、
３１番目（ＬまたはＩ）、３２番目（ＩまたはＭ）、３
３番目（Ｌ）、３４番目（ＬまたはＩまたはＦ）、３６
番目（Ｌ）、４６番目（ＩまたはＶ）及び４９番目（Ｌ
またはＩまたはＭ）の部位である。"30 D [LI] [IM] L [LIF]
[KRH] L ”Here,“ 30 ”means that the position of the first amino acid of the motif in the alignment data in FIG. 3 is 30. [] Indicates that a plurality of amino acids in [] appear at that site. That is, the extracted motif sites are the sixth (F or Y), the eleventh (F or Y), the thirty (D),
31st (L or I), 32nd (I or M), 3
3rd (L), 34th (L or I or F), 36
(L), 46th (I or V) and 49th (L
Or I or M).

【００３５】図８は、本発明の実施例によって抽出され
たモチーフ部位を示す図である。ところで、従来は、遺
伝子配列に特徴的な配列パターンであるモチーフを部位
としてマニュアル操作である程度までは抽出することが
できたが、モチーフを領域として同定することは困難で
あった。しかし、機能領域の同定や、祖先遺伝子の推定
を行う場合、モチーフを領域として同定することは非常
に重要である。そこで、本発明において、部位として抽
出されたモチーフ配列を領域として同定する方法につい
て、より詳細に説明する。FIG. 8 is a diagram showing motif sites extracted according to the embodiment of the present invention. By the way, conventionally, a motif which is a sequence pattern characteristic of a gene sequence could be extracted to some extent by manual operation using a motif as a site, but it was difficult to identify the motif as a region. However, when a functional region is identified or an ancestral gene is estimated, it is very important to identify a motif as a region. Therefore, in the present invention, a method for identifying a motif sequence extracted as a site as a region will be described in more detail.

【００３６】（ｉ），（ｊ），（ｋ）領域幅及びランダ
ムレベルの設定、モチーフ部位の出現率の計算、モチー
フ領域の出力図９は、特徴情報抽出手段１７が行う、図２に示す処理
（ｉ），（ｊ），（ｋ）の一実施例を示すフローチャー
トである。本実施例は、大略３つの処理からなる。第１
の処理では、任意の領域幅を設定し、その領域幅内のモ
チーフ部位の出現率を求める。第２の処理では、設定し
た領域幅内でのモチーフ部位の出現率が充分に高いか否
かを判断するためのランダムレベルを求め、ランダムレ
ベルを越える出現率でモチーフ部位が存在する場合はそ
の領域幅内のモチーフ部位を１つのモチーフ領域として
同定する。第３の処理では、同定されたモチーフ領域が
連続する場合にはそれらをまとめて１つのモチーフ領域
とする。(I), (j), (k) Setting of area width and random level, calculation of appearance rate of motif part, output of motif area FIG. 9 is shown in FIG. It is a flowchart which shows one Example of a process (i), (j), (k). The present embodiment generally includes three processes. First
In the process (1), an arbitrary region width is set, and the appearance rate of the motif portion within the region width is obtained. In the second process, a random level for determining whether or not the appearance rate of the motif portion within the set region width is sufficiently high is determined. A motif site within the region width is identified as one motif region. In the third process, when the identified motif regions are continuous, they are combined into one motif region.

【００３７】つまり、より具体的には以下の処理Ｓ１〜
Ｓ６が繰り返される。Ｓ１：モチーフ部位の抽出を行う。Ｓ２：初期領域幅、拡張幅、最大拡張幅を設定する。ま
た、モチーフ部位の出現率のランダムレベルを求めるた
めの領域幅を最大拡張幅に設定する。ただし、最大拡張
幅がアライメントデータ長の半分を越える場合には、ラ
ンダムレベルの領域幅をアライメントデータ長の半分の
長さを越えない値に設定する。Ｓ３：初期領域幅及びランダムレベルの領域幅の夫々で
のモチーフ部位の出現率を計算してプロットする。Ｓ４：初期領域幅でのモチーフ部位の出現率がランダム
レベルの領域幅でのモチーフ部位の出現率を越えている
場合には、初期領域幅を「モチーフ領域」とみなす。Ｓ５：隣合う初期領域幅のモチーフ部位の出現率がとも
に「モチーフ領域」である場合には、これらを結合して
１つの「モチーフ領域」とみなす。Ｓ６：処理Ｓ４及びＳ５をアライメントデータの全長に
渡って繰り返す。That is, more specifically, the following processes S1 to S1
S6 is repeated. S1: Extract a motif site. S2: Set the initial area width, expansion width, and maximum expansion width. Further, the region width for obtaining the random level of the appearance rate of the motif part is set to the maximum expansion width. However, if the maximum extension width exceeds half the alignment data length, the area width of the random level is set to a value not exceeding half the alignment data length. S3: Calculate and plot the appearance rate of the motif site in each of the initial region width and the random level region width. S4: If the appearance rate of the motif part in the initial area width exceeds the appearance rate of the motif part in the random level area width, the initial area width is regarded as the “motif area”. S5: When both the appearance rates of the adjacent motif regions having the initial region width are “motif regions”, these are combined and regarded as one “motif region”. S6: Steps S4 and S5 are repeated over the entire length of the alignment data.

【００３８】図９に基づいてモチーフ領域の同定処理を
説明するに、ステップ６１はモチーフ部位の出現率を求
める領域幅を設定する。ステップ６２は、特徴情報抽出
手段１７の、即ち、処理装置１０のＣＰＵ内の、部位位
置カウンタを初期化する。ステップ６３は、現部位位置
を中心として、設定した領域幅内でのモチーフ部位の出
現率を、次の式から求める。To explain the motif region identification processing with reference to FIG. 9, step 61 sets a region width for obtaining the motif site appearance rate. A step 62 initializes a part position counter of the characteristic information extracting means 17, that is, in the CPU of the processing device 10. In step 63, the appearance rate of the motif part within the set area width is calculated from the following equation with the current part position as the center.

【００３９】（モチーフ部位の出現率）＝（領域幅内モ
チーフ部位数）／（領域幅）ステップ６４は、モチーフ部位の出現率をグラフにプロ
ットし、ステップ６５は、現部位位置でのランダムレベ
ルを計算し、グラフにプロットする。ステップ６６は、
部位位置に１を加え、ステップ６７は、部位位置カウン
タの値がアライメントデータ長より大きいか否かを判定
する。ステップ６７の判定結果がＮＯであれば、処理は
ステップ６３へ戻る。(Occurrence rate of motif part) = (Number of motif parts in area width) / (area width) Step 64 plots the appearance rate of the motif part on a graph. Is calculated and plotted on a graph. Step 66 is
One is added to the part position, and a step 67 determines whether or not the value of the part position counter is larger than the alignment data length. If the decision result in the step 67 is NO, the process returns to the step 63.

【００４０】他方、ステップ６７の判定結果がＹＥＳで
あると、ステップ６８で特徴情報抽出手段１７の、即
ち、処理装置１０のＣＰＵ内の、モチーフ領域フラグを
初期化する。ステップ６９は、部位位置カウンタを初期
化する。ステップ７０は、部位位置のモチーフ出現率が
ランダムレベルより高いか否かを判定する。ステップ７
０の判定結果がＹＥＳであれば処理はステップ７１へ進
み、ＮＯであれば処理はステップ７５へ進む。On the other hand, if the decision result in the step 67 is YES, a motif area flag of the feature information extracting means 17, that is, in the CPU of the processing device 10, is initialized in a step 68. A step 69 initializes a part position counter. Step 70 determines whether or not the motif appearance rate at the site position is higher than the random level. Step 7
If the determination result of 0 is YES, the process proceeds to step 71; otherwise, the process proceeds to step 75.

【００４１】ステップ７１は、モチーフ領域フラグが立
っているか（セットされているか）否かを判定し、判定
結果がＹＥＳであると、ステップ７２が現部位位置を中
心とした領域を現モチーフ領域に加えて伸長する。他
方、ステップＳ７１の判定結果がＮＯであると、ステッ
プ７３はモチーフ領域フラグを立て、ステップ７４は、
現部位位置を中心とした領域幅の中で最初にモチーフ部
位の出現する部位位置を現モチーフ領域の開始部位とす
る。ステップ７２又は７４を行った後は、処理がステッ
プ７８へ進む。A step 71 decides whether or not the motif area flag is set (set). If the decision result in the step 71 is YES, a step 72 sets the area centered on the current position as the current motif area. In addition, it extends. On the other hand, if the decision result in the step S71 is NO, a step 73 sets a motif area flag, and the step 74 sets a motif area flag.
The part position where the motif part first appears in the area width centered on the current part position is defined as the start part of the current motif area. After performing Step 72 or 74, the process proceeds to Step 78.

【００４２】ステップ７５は、モチーフ領域フラグが立
っているか否かを判定し、判定結果がＮＯであれば、処
理はステップ７８へ進む。他方、ステップ７５の判定結
果がＹＥＳであれば、ステップ７６がモチーフ領域フラ
グを初期化し、スッテプ７７が現モチーフ領域を出力す
る。このステップ７６を行った後は、処理がステップ７
８へ進む。A step 75 decides whether or not the motif area flag is set. If the decision result in the step 75 is NO, the process proceeds to a step 78. On the other hand, if the decision result in the step 75 is YES, a step 76 initializes the motif area flag, and the step 77 outputs the current motif area. After performing step 76, the process proceeds to step 7
Proceed to 8.

【００４３】ステップ７８は、部位位置に１を加える。
又、ステップ７９は、部位位置カウンタの値がアライメ
ントデータ長より大きいか否かを判定し、判定結果がＹ
ＥＳであれば、処理は終了する。他方、ステップ７９の
判定結果がＮＯであれば、処理はステップ７０へ戻る。
上記の如きモチーフ領域の同定処理を行った場合の実験
結果を以下に説明する。In step 78, 1 is added to the position of the part.
A step 79 decides whether or not the value of the part position counter is larger than the alignment data length.
If it is ES, the process ends. On the other hand, if the decision result in the step 79 is NO, the process returns to the step 70.
An experimental result when the above-described motif region identification processing is performed will be described below.

【００４４】実験では、ＦＬＡＡ７Ａ−１をプローブと
して、アライメントデータを対象としてモチーフ領域の
同定を行った。図１０〜図１２は、プローブ名がＦＬＡ
Ａ７Ａ−１、ｈｏｍｏｌｏｇｕｅ本数が５３、初期領域
幅が２１、最大拡張幅が１０１、アライメントデータ長
が９７、ランダムレベルを求めるための領域幅が４１、
モチーフ部位抽出時の設定値が０．９０の場合の実験結
果を示す。図１０は、設定した領域幅におけるモチーフ
部位の占める割合を示しており、「ｏ」はモチーフ領域
幅の初期値でのプロットを示し、「．．．」はランダム
レベルのプロットを示す。尚、プロットが重なった場合
は、割合の高い方を優先してプロットしてある。図１１
は、抽出されたモチーフ部位を示す。更に、図１２は、
モチーフ領域の同定処理により得られたモチーフ領域を
示している。図１２中、「：」はモチーフの開始位置と
終了位置とを示し、「［］」はそのモチーフ部位に複
数のアミノ酸が出現することを示し、「−」はその部位
に任意のアミノ酸又はギャップ（即ち、モチーフ部位で
はない部位）が出現することを示す。In the experiment, a motif region was identified using FLAA7A-1 as a probe and alignment data as a target. 10 to 12 show that the probe name is FLA
A7A-1, the number of homologues is 53, the initial region width is 21, the maximum extension width is 101, the alignment data length is 97, the region width for obtaining the random level is 41,
The experiment result when the setting value at the time of motif part extraction is 0.90 is shown. FIG. 10 shows the proportion of the motif region in the set region width, where “o” indicates a plot with the initial value of the motif region width, and “...” Indicates a random level plot. If the plots overlap, the plot with the higher ratio is preferentially plotted. FIG.
Indicates an extracted motif site. Further, FIG.
The motif region obtained by the motif region identification processing is shown. In FIG. 12, “:” indicates the start position and end position of the motif, “[]” indicates that a plurality of amino acids appear at the motif site, and “−” indicates an arbitrary amino acid or gap at the site. (That is, a site that is not a motif site).

【００４５】図１３〜図１５は、同様にしてＥＣＯＤＨ
ＦＯＬＧ−１をプローブとして用いて得られた他の実験
結果を示す。図１３〜図１５は、プローブ名がＥＣＯＤ
ＨＦＯＬＧ−１、ｈｏｍｏｌｏｇｕｅ、本数が１１、初
期領域幅が１１、アライメントデータ長が１７９、ラン
ダムレベルを求めるための領域幅が８１、モチーフ部位
抽出時の設定値が０．９０の場合の実験結果を示す。図
１３及び図１４は、同じアライメントデータに対するプ
ロットを分割して示しており、図１５は、モチーフ領域
の同定処理により得られたモチーフ領域を示す。図１
３及び図１４は、設定した領域幅におけるモチーフ部位
の占める割合をアライメントデータと対応させて示して
おり、「ｏ」はモチーフ領域幅の初期値でのプロット、
即ち、設定した領域幅でのモチーフ部位の出現率を示
し、「．．．」はランダムレベルのプロットを示す。つ
まり、モチーフ部位の出現率がこのランダムレベルより
低い場合は、このモチーフ部位をモチーフ領域とはみな
さない。尚、プロットが重なった場合は、割合の高い方
を優先してプロットしてある。更に、図１３及び図１４
中、アライメントデータの左側に示されている名前は、
遺伝子配列データベースＤＤＢＪに登録されている遺伝
子配列のエントリー名を示す。又、Ｄｉｈｙｄｒｏｆｏ
ｌａｔｅｒｅｄｕｃｔａｓｅｓｉｇｎａｔｕｒｅ
［ＬＩＦ］−Ｇ−Ｘ（４）−［ＬＩＶＭＦ］−Ｐ−Ｗ
は、モチーフデータベースＰＲＯＳＩＴＥに登録されて
いるデータである。FIGS. 13 to 15 show ECODH in the same manner.
Fig. 4 shows other experimental results obtained using FOLG-1 as a probe. 13 to 15 show that the probe name is ECOD.
The experimental results when HFOLG-1, homologue, the number is 11, the initial region width is 11, the alignment data length is 179, the region width for obtaining the random level is 81, and the setting value when extracting the motif site are 0.90 are shown. Show. 13 and 14 show plots of the same alignment data divided, and FIG. 15 shows a motif region obtained by the motif region identification process. FIG.
3 and FIG. 14 show the proportion of the motif site in the set region width in association with the alignment data, where “o” is a plot of the initial value of the motif region width,
That is, the appearance rate of the motif portion in the set region width is shown, and "..." indicates a plot of the random level. That is, when the appearance rate of the motif site is lower than the random level, the motif site is not regarded as a motif region. If the plots overlap, the plot with the higher ratio is preferentially plotted. 13 and FIG.
The name shown on the left of the alignment data is
The entry name of the gene sequence registered in the gene sequence database DDBJ is shown. Also, Dihydrofo
late reductase signature
[LIF] -GX (4)-[LIVMF] -PW
Is data registered in the motif database PROSITE.

【００４６】図１５中、左側に示されている「１２２」
等の数字は、各モチーフ領域のアライメントデータ上で
の開始位置を示す。又、右側に示されている「１３７」
等の数字は、各モチーフ領域のアライメントデータ上で
の終了位置を示す。図１６〜図１８は、同様にしてＨＵ
ＭＴＲＸ１−１をプローブとして用いて得られた他の実
験結果を示す。図１６〜図１８は、プローブ名がＨＵＭ
ＴＲＸ１−１、ｈｏｍｏｌｏｇｕｅ、本数が１５、初期
領域幅が１１、アライメントデータ長が１１０、ランダ
ムレベルを求めるための領域幅が５１、モチーフ部位抽
出時の設定値が０．９０の場合の実験結果を示す。図１
６及び図１７は、同じアライメントデータに対するプロ
ットを分割して示しており、図１８は、モチーフ領域の
同定処理により得られたモチーフ領域を示す。"122" shown on the left side in FIG.
And the like indicate the starting position of each motif region on the alignment data. "137" shown on the right side
And the like indicate the end position of each motif region on the alignment data. 16 to 18 show HUs in the same manner.
The other experimental result obtained using MTRX1-1 as a probe is shown. 16 to 18 show that the probe name is HUM.
The experimental results when TRX1-1, homologue, the number is 15, the initial area width is 11, the alignment data length is 110, the area width for obtaining the random level is 51, and the setting value at the time of extracting the motif part is 0.90 are shown. Show. FIG.
6 and FIG. 17 show plots for the same alignment data divided, and FIG. 18 shows a motif region obtained by the motif region identification process.

【００４７】図１６及び図１７は、設定した領域幅にお
けるモチーフ部位の占める割合をアライメントデータと
対応させて示しており、「ｏ」はモチーフ領域幅の初期
値でのプロット、即ち、設定した領域幅でのモチーフ部
位の出現率を示し、「．．．」はランダムレベルのプロ
ットを示す。つまり、モチーフ部位の出現率がこのラン
ダムレベルより低い場合は、このモチーフ部位をモチー
フ領域とはみなさない。尚、プロットが重なった場合
は、割合の高い方を優先してプロットしてある。更に、
図１６及び図１７中、アライメントデータの左側に示さ
れている名前は、遺伝子配列データベースＤＤＢＪに登
録されている遺伝子配列のエントリー名を示す。又、Ｔ
ｈｉｏｒｅｄｏｘｉｎｆａｍｉｌｙａｃｔｉｖｅ
ｓｉｔｅ［ＳＴＡ］−Ｘ−［ＷＧ］−Ｃ−［ＡＧＶ］−
［ＰＨ］−Ｃは、モチーフデータベースＰＲＯＳＩＴＥ
に登録されているデータである。FIGS. 16 and 17 show the proportion of the motif site in the set region width in association with the alignment data. “O” is a plot of the initial value of the motif region width, that is, the set region width. The motif site appearance rate by width is shown, and "..." indicates a random level plot. That is, when the appearance rate of the motif site is lower than the random level, the motif site is not regarded as a motif region. If the plots overlap, the plot with the higher ratio is preferentially plotted. Furthermore,
In FIGS. 16 and 17, the name shown on the left side of the alignment data indicates the entry name of the gene sequence registered in the gene sequence database DDBJ. Also, T
hioredoxin family active
site [STA] -X- [WG] -C- [AGV]-
[PH] -C is a motif database PROSITE
This is the data registered in.

【００４８】図１８中、左側に示されている「６９」等
の数字は、各モチーフ領域のアライメントデータ上での
開始位置を示す。又、右側に示されている「１０５」等
の数字は、各モチーフ領域のアライメントデータ上での
終了位置を示す。これにより、本発明によれば、遺伝子
配列情報から、機械的に（自動的に）モチーフ領域を抽
出・同定することができるので、高速にモチーフ領域の
抽出・同定が可能となる。従って、大量の遺伝子配列デ
ータから新規なモチーフを発見したり、モチーフデータ
ベースを作成することが、容易にできる。この様にして
得られたモチーフ情報をもとに、未知機能の遺伝子配列
の機能及び構造の予測を効率良く行うことができ、本発
明を遺伝子機能の発見や機能領域の同定に利用すると非
常に便利である。In FIG. 18, numerals such as "69" shown on the left side indicate the starting positions of the respective motif regions on the alignment data. The number such as “105” shown on the right side indicates the end position of each motif region on the alignment data. Thus, according to the present invention, a motif region can be mechanically (automatically) extracted and identified from gene sequence information, so that motif region extraction and identification can be performed at high speed. Therefore, it is easy to discover a new motif from a large amount of gene sequence data and create a motif database. Based on the motif information obtained in this way, it is possible to efficiently predict the function and structure of a gene sequence having an unknown function, and it is extremely useful to use the present invention to discover gene functions and identify functional regions. It is convenient.

【００４９】以上、本発明を実施例により説明したが、
本発明は上記実施例に限定されるものではなく、本発明
の範囲内で種々の変形及び改良が可能であることは、言
うまでもない。The present invention has been described with reference to the embodiments.
The present invention is not limited to the above embodiments, and it goes without saying that various modifications and improvements can be made within the scope of the present invention.

【００５０】[0050]

【発明の効果】以上説明したように、本発明によれば、
遺伝子配列情報から、機械的に（自動的に）モチーフ領
域を抽出・同定することができるので、高速にモチーフ
領域の抽出・同定が可能となるので、大量の遺伝子配列
データから新規なモチーフを発見したり、モチーフデー
タベースを作成することが、容易にでき、この様にして
得られたモチーフ情報をもとに、未知機能の遺伝子配列
の機能及び構造の予測を効率良く行うこともできるの
で、本発明を遺伝子機能の発見や機能領域の同定に利用
すると非常に便利であり、遺伝子工学の発展に寄与する
ところが大きい。As described above, according to the present invention,
Since motif regions can be extracted and identified automatically (automatically) from gene sequence information, motif regions can be extracted and identified at high speed. New motifs can be discovered from a large amount of gene sequence data. Motif database can be easily created, and the function and structure of an unknown function gene sequence can be predicted efficiently based on the motif information thus obtained. It is very convenient to use the invention to discover gene functions and identify functional regions, and greatly contribute to the development of genetic engineering.

[Brief description of the drawings]

【図１】本発明の構成例を示す図である。FIG. 1 is a diagram showing a configuration example of the present invention.

【図２】本発明の実施例の処理フローチャートである。FIG. 2 is a processing flowchart of an embodiment of the present invention.

【図３】入力したアライメントデータの例を示す図であ
る。FIG. 3 is a diagram showing an example of input alignment data.

【図４】図３に示すアライメントデータから求めた系統
樹と配列の重みの結果を示す図である。FIG. 4 is a diagram showing a phylogenetic tree and sequence weight results obtained from the alignment data shown in FIG. 3;

【図５】各部位におけるアミノ酸の重みの計算結果を示
す図である。FIG. 5 is a diagram showing calculation results of amino acid weights at each site.

【図６】配列の重み計算手段が行う処理の一実施例を示
すフローチャートである。FIG. 6 is a flowchart illustrating an example of a process performed by an array weight calculator.

【図７】スコア計算手段が行う処理の一実施例を示すフ
ローチャートである。FIG. 7 is a flowchart illustrating an example of a process performed by a score calculation unit.

【図８】本発明の実施例によって抽出されたモチーフ部
位を示す図である。FIG. 8 is a diagram showing motif sites extracted according to an example of the present invention.

【図９】特徴情報抽出手段が行う処理の一実施例を示す
フローチャートである。FIG. 9 is a flowchart illustrating an example of a process performed by a feature information extracting unit.

【図１０】モチーフ領域の同定処理を行った場合の実験
結果を説明する図である（その１）。FIG. 10 is a view for explaining an experimental result when a motif region identification process is performed (part 1).

【図１１】モチーフ領域の同定処理を行った場合の実験
結果を説明する図である（その２）。FIG. 11 is a diagram for explaining an experimental result when a motif region identification process is performed (part 2).

【図１２】モチーフ領域の同定処理を行った場合の実験
結果を説明する図である（その３）。FIG. 12 is a diagram illustrating an experimental result when a motif region identification process is performed (part 3).

【図１３】モチーフ領域の同定処理を行った場合の実験
結果を説明する図である（その１）。FIG. 13 is a view for explaining an experimental result when a motif region identification process is performed (part 1).

【図１４】モチーフ領域の同定処理を行った場合の実験
結果を説明する図である（その２）。FIG. 14 is a diagram illustrating an experimental result when a motif region identification process is performed (part 2).

【図１５】モチーフ領域の同定処理を行った場合の実験
結果を説明する図である（その３）。FIG. 15 is a diagram illustrating an experimental result when a motif region identification process is performed (part 3).

【図１６】モチーフ領域の同定処理を行った場合の実験
結果を説明する図である（その１）。FIG. 16 is a diagram illustrating an experimental result when a motif region identification process is performed (part 1).

【図１７】モチーフ領域の同定処理を行った場合の実験
結果を説明する図である（その２）。FIG. 17 is a diagram for explaining an experimental result when a motif region identification process is performed (part 2).

【図１８】モチーフ領域の同定処理を行った場合の実験
結果を説明する図である（その３）。FIG. 18 is a diagram illustrating an experimental result when a motif region identification process is performed (part 3).

[Explanation of symbols]

１０処理装置１１配列情報入力手段１２系統樹作成手段１３系統樹１４配列の重み計算手段１５スコア計算手段１６要素の類似性に基づくスコア表１７特徴情報抽出手段１８出力装置 Reference Signs List 10 processing device 11 sequence information input means 12 phylogenetic tree creating means 13 phylogenetic tree 14 array weight calculating means 15 score calculating means 16 score table based on similarity of elements 17 feature information extracting means 18 output device

フロントページの続き (72)発明者池尾一穂静岡県三島市谷田1111番地国立遺伝学研究所内 (72)発明者川西祐一神奈川県川崎市中原区上小田中1015番地富士通株式会社内 (72)発明者河合正人神奈川県川崎市中原区上小田中1015番地富士通株式会社内 (56)参考文献Ｊ．Ｍｏｌ．Ｅｖｏｌ．，1994年２月，38，ｐ．188−203 Ｊ．Ｍｏｌ．Ｂｉｏｌ，1994年６月24 日，239，ｐ．698−712 ｂｉｔ，1993年９月，25［９］，ｐ. 87−94 (58)調査した分野(Int.Cl.⁷，ＤＢ名) C12N 15/00 - 15/90 G06F 17/30 ＰｕｂＭｅｄＪＩＣＳＴファイル（ＪＯＩＳ) ＩＮＳＰＥＣ（ＤＩＡＬＯＧ)Continued on the front page (72) Inventor Kazuho Ikeo 1111 Yata, Mishima City, Shizuoka Prefecture Inside the National Institute of Genetics (72) Inventor Yuichi Kawanishi 1015 Uedanaka, Nakahara-ku, Kawasaki City, Kanagawa Prefecture Inside Fujitsu Limited (72) Inventor Kawai Masato 1015 Kamikodanaka, Nakahara-ku, Kawasaki City, Kanagawa Prefecture Inside Fujitsu Limited (56) References Mol. Evol. , February 1994, 38, p. 188-203 Mol. Biol, June 24, 1994, 239, p. 698-712 bits, September 1993, 25 [9], pp. 87-94 (58) Fields investigated (Int. Cl. ⁷ , DB name) C12N 15/00-15/90 G06F 17/30 PubMed JICST File (JOIS) INSPEC (DIALOG)

Claims

(57) [Claims]

1. A processing device for extracting a characteristic regularity specifying a gene function from gene sequence information, wherein the sequence calculates a weight of each sequence from the length of a branch of an evolutionary phylogenetic tree regarding a plurality of gene sequences. a weight calculation hand stage, for each site of the array, and score calculation means to calculate a score indicating the degree of similarity of array elements appearing at that site using the weight, based on the calculated score and a feature information extraction means to extract a portion having a characteristic regularities in gene sequence motif, gene motif extraction apparatus.

2. A motif extracting apparatus of a gene of claim 1, further phylogenetic tree creation means to create a phylogenetic tree based on the degree of difference between the gene sequences on the basis of the alignment data of a plurality of gene sequences Equipped with a gene motif extraction processing device.

3. A motif extracting apparatus of a gene according to claim 1 or 2, wherein said scoring hands stage, the weight of each sequence are calculated Te <br/> by the weight calculation hand stage of the sequence, and means for calculating a score based on a similarity Seijo paper between array elements are determined according to the type of pre-array elements, gene motif extraction apparatus.

4. A motif extracting apparatus of a gene of any one of claims 1 to 3, wherein the feature information extraction hand stage, the value of the score calculation hand stages thus calculated score is given A gene motif extraction processing device, comprising: means for extracting and outputting a portion as a motif portion when a threshold value or a set threshold value is exceeded.

5. A motif extracting apparatus of a gene according to claim 4, wherein the feature information extraction hand stage, the appearance rate of the motif sites were calculated by a predetermined continuous area width or the set continuous region width, that value When a predetermined random level or a set random level is exceeded, a continuous area is extracted as a motif area, and when both adjacent areas are motif areas, means for outputting those areas as one motif area is provided. Further, a gene motif extraction processing device.

6. The motif extracting apparatus of genes according to claim 5, the characteristic information of the feature information extraction hand stage or found by displaying the graph by plotting the values of the incidence and random level of at least the motif part A gene motif extraction processing device further comprising an output device for outputting.

7. A processing method for extracting a characteristic regularity specifying a gene function from gene sequence information by a computer, wherein the sequence information input means inputs alignment data of a plurality of gene sequences to be extracted. A processing step , wherein the phylogenetic tree creating means creates an evolutionary phylogenetic tree based on the alignment data, and an array weight calculating means calculates the weight of each array from the length of the branch in the phylogenetic tree A process in which a score calculating unit calculates a score of each part based on the calculated weight of each array and similarity information between array elements which is previously determined according to the type of the array element. and process, the feature information extraction unit, when the value of the calculated score exceeds a predetermined threshold or set threshold, and a process of extracting the site motif sites heritage Motif extraction processing method of the child.

8. The method for extracting a motif of a gene according to claim 7, wherein in the extracting step, the feature information extracting means determines the appearance rate of the motif portion at a predetermined continuous region width or a set continuous region width. When the calculated value exceeds a predetermined random level or a set random level, the continuous region is extracted as a motif region, and when both adjacent regions are motif regions, those regions are regarded as one motif region. A gene motif extraction processing method, further comprising a processing step of outputting as a region.

9. The motif extraction method of gene according to claim 8, process the output device, as the motif the area for displaying the graph by plotting the values of the incidence and random level of at least the motif part, said extraction A method for extracting a motif of a gene, further comprising the step of outputting feature information from the motif.