JP2907169B2

JP2907169B2 - Stochastic motif discovery method for protein and gene sequences

Info

Publication number: JP2907169B2
Application number: JP8349487A
Authority: JP
Inventors: 由希子黒岩
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1996-12-27
Filing date: 1996-12-27
Publication date: 1999-06-21
Anticipated expiration: 2016-12-27
Also published as: JPH10187666A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、タンパク質及び遺
伝子の配列におけるモチーフを探索する方法に関し、特
に、配列データから確率的モチーフを発見する方法に関
する。The present invention relates to a method for searching for a motif in a protein or gene sequence, and more particularly to a method for finding a stochastic motif from sequence data.

【０００２】[0002]

【従来の技術】モチーフとは、タンパク質のアミノ酸配
列や遺伝子の塩基配列において、構造上・機能上の重要
な部位であり、生物の進化の過程において保存されてい
る部分配列のことである。モチーフはコンセンサス配列
とも呼ばれている。2. Description of the Related Art A motif is a structurally and functionally important site in the amino acid sequence of a protein or the base sequence of a gene, and is a partial sequence that is conserved during the evolution of an organism. Motifs are also called consensus sequences.

【０００３】従来、タンパク質のアミノ酸配列のモチー
フ及び遺伝子の塩基配列のモチーフは、パターンと呼ば
れる表現で表わされていた。例えば、タンパク質モチー
フのデータベースであるＰＲＯＳＩＴＥでは、“Ｎ−
{Ｐ}−[ＳＴ]−Ｐ”や“[ＳＴ]−ｘ−[ＳＴ]”などのよ
うに、２０種のアミノ酸のＯＲ（論理和）の並びとして
モチーフを表現していた。ここで、Ｎ、Ｐ、Ｓ、Ｔは、
アミノ酸の種類を表わす略号であって、それぞれ、アミ
ノ酸の一種であるアスパラギン酸、プロリン、セリン、
トレオニンを示している。また、ｘは任意のアミノ酸を
示し、{Ｐ}は、Ｐ（プロリン）以外のアミノ酸全てを示
し、[ＳＴ]は、Ｓ（セリン）またはＴ（トレオニン）を
示す。Heretofore, the motif of the amino acid sequence of a protein and the motif of the base sequence of a gene have been represented by an expression called a pattern. For example, in the protein motif database PROSITE, "N-
The motif was expressed as an OR (logical sum) of 20 amino acids, such as {P}-[ST] -P "or" [ST] -x- [ST] ". N, P, S, T are
Abbreviations representing the types of amino acids, each of which is a kind of amino acid, aspartic acid, proline, serine,
Shows threonine. Further, x represents an arbitrary amino acid, {P} represents all amino acids except P (proline), and [ST] represents S (serine) or T (threonine).

【０００４】また、一部では、プロファイルと呼ばれる
表現方法も用いられている。これは、モチーフの各位置
でそれぞれのアミノ酸または塩基が来た時のスコアを表
現したものである。さらに一方では、タンパク質や遣伝
子の配列を隠れマルコフモデル（ＨＭＭ：Hidden Marko
v Model）のような確率的モデルによって表現する手法
が使用されつつある。[0004] In some cases, an expression method called a profile is also used. This expresses the score when each amino acid or base comes at each position of the motif. On the other hand, Hidden Marko models (HMMs) hide protein and gene sequences.
v Model) is being used.

【０００５】このような表現を用いて、生化学的に関係
のある異なる配列から自動的に共通部位を発見できれ
ば、この共通部位は生化学的な機能や立体構造のために
重要な部分配列であって進化的に保存されてきたもので
あると考えることができる。このため、タンパク質や遺
伝子の機能や構造の解析のために、共通部位であるモチ
ーフの発見は重要な問題である。モチーフを発見する手
法としては、これまでにも、ボトムアップによる手法、
トップダウンによる手法など、さまざまな手法が研究さ
れている。[0005] If a common site can be automatically found from different biochemically related sequences using such an expression, this common site is a partial sequence important for biochemical functions and three-dimensional structure. It can be thought of as something that has been evolutionarily preserved. Therefore, the discovery of a motif, which is a common site, is an important issue for the analysis of functions and structures of proteins and genes. As a method of discovering motifs, a bottom-up method,
Various methods are being studied, including top-down methods.

【０００６】ボトムアップによる手法は、パターンの空
間を考えて、それぞれのパターンに対して配列での出現
頻度または情報量（Information Content）を計算し、
出現頻度または情報量の多いパターンをモチーフとする
手法である。例えば、１９９０年発行の米国の雑誌“Pr
oc. Natl. Acad. Sci. USA”の第８７号８２６−８３０
頁記載のＳｍｉｔｈらの論文“Finding sequence motif
s in groups of functionally related proteins”や、
１９８９年発行の米国の同雑誌の第８６号１１８３−１
１８７頁記載のＳｔｏｒｍｏらの論文“Identifying pr
otein-bindingsites from unaligned DNA fragments”
などがある。In the bottom-up method, the frequency of appearance or the amount of information (Information Content) in an array is calculated for each pattern in consideration of the pattern space.
This method uses a pattern with a high frequency of appearance or a large amount of information as a motif. For example, the 1990 magazine "Pr
oc. Natl. Acad. Sci. USA "No. 87-826-830.
Smith et al., "Finding sequence motif"
s in groups of functionally related proteins ”
No. 86, 1183-1 of the US magazine published in 1989
Stormo et al., “Identifying pr.
otein-bindingsites from unaligned DNA fragments ”
and so on.

【０００７】トップダウンによる手法は、配列を比較し
てローカルな類似度を見つける手法であり、配列のアラ
イメントを行なう。アライメントとは、与えられた複数
の配列に、ギャップなどを入れて、同じあるいは性質の
似たアミノ酸または塩基が縦のカラム位置に並ぶように
する手法である。トップダウンによる手法には、配列を
２本ずつペアワイズにアライメントを行ない、その結果
を組み合わせる手法や、グループ間でアライメントする
手法がある。このようなアライメントを用いたモチーフ
抽出では、配列をアライメントしてから同じアミノ酸ま
たは塩基が揃っている部位を探すか、類似の局所的な部
位をアライメントする。このようなアライメントとして
は、例えば、Ｂｌａｓｔ，ＦａｓｔＡ，ＣｌｕｓｔＶ
などがある。The top-down method is a method of comparing sequences to find local similarities, and performs sequence alignment. Alignment is a technique in which gaps and the like are inserted in a given plurality of sequences so that amino acids or bases having the same or similar properties are arranged in vertical column positions. The top-down method includes a method of performing pairwise alignment of two sequences and combining the results, and a method of performing alignment between groups. In the motif extraction using such an alignment, after a sequence is aligned, a site having the same amino acid or base is searched for, or a similar local site is aligned. Such alignment includes, for example, Blast, FastA, Clust V
and so on.

【０００８】[0008]

【発明が解決しようとする課題】しかしながら、従来の
ボトムアップによる手法では、計算時間がかかることが
問題点であった。例えば、文字数｜Σ｜での長さｌのパ
ターンでは、探索空間のサイズは、Ο（｜Σ｜^l）とな
り、パターンの長さｌで指数関数的に大きくなる。さら
に、実際にタンパク質モチーフのデータベースであるＰ
ＲＯＳＩＴＥを見てみると、進化的に保存されている部
位であるとは言え、配列におけるギャップやミスマッチ
は少なくない。ボトムアップによる手法においてギャッ
プやミスマッチを考慮しようとすると、ギャップやミス
マッチを許すような近隣語も考えなければならなくな
り、探索空問が大きくなり過ぎて、計算時間の観点から
実現が難しい。しかも、得られるパターンの表現能力
は、隠れマルコフモデルなどの確率的モデルよりも低
い。However, the conventional bottom-up method has a problem that it takes a long time to calculate. For example, for a pattern of length l with the number of characters | Σ |, the size of the search space is Ο (| Σ | ^l ), which increases exponentially with the length l of the pattern. Furthermore, P which is actually a database of protein motifs
Looking at ROSITE, although it is an evolutionarily conserved site, there are many gaps and mismatches in the sequence. When trying to consider gaps and mismatches in the bottom-up method, it is necessary to consider neighboring words that allow gaps and mismatches, and the search query becomes too large, which is difficult to achieve in terms of computation time. Moreover, the obtained pattern has lower expression ability than a stochastic model such as a hidden Markov model.

【０００９】また、従来のアライメントによる手法で
は、初期の段階の誤りが波及して精度があまり良くなら
なかったり、配列の重み付けを考えなければならなかっ
たり、位置に依存したスコアが考えづらかったりという
問題点があった。Further, in the conventional alignment method, errors in the initial stage are spread, so that the accuracy is not so good, the weight of the array must be considered, and the score depending on the position is difficult to consider. There was a problem.

【００１０】本発明は、上記問題点の解決を図り、共通
な部位だけに注目してギャップやミスマッチを確率的モ
デルを用いて表現することで、配列を詳細に表現できる
確率的モチーフの発見手法を提供することを目的とす
る。さらに本発明は、モチーフを隠れマルコフモデルで
表現することで、反復するモチーフを陽に扱うことので
きる確率的モチーフの発見手法を提供することを目的と
する。[0010] The present invention solves the above-mentioned problems, and focuses on only common sites to express gaps and mismatches using a stochastic model, thereby finding a stochastic motif capable of expressing sequences in detail. The purpose is to provide. It is another object of the present invention to provide a stochastic motif discovery method that can explicitly handle a repeated motif by expressing the motif with a hidden Markov model.

【００１１】[0011]

【課題を解決するための手段】本発明の確率的モチーフ
発見方法は、請求項１に記載のように、タンパク質及び
遺伝子のいずれか一方の配列データから確率的モチーフ
を発見する方法において、コンピュータを用い、与えら
れた配列データから記号処理によって抽出したモチーフ
を初期のモチーフとして、配列データの中から前記初期
のモチーフに最も適合する部分配列を第１の部分配列と
して抽出する第１のステップと、第１の部分配列の前後
を伸長して新たに第２の部分配列を生成し、第２の部分
配列を確率的に表現したものを新たにモチーフとする第
２のステップと、配列データの中から第２のステップで
得たモチーフに最も適合する部分配列を抽出して改めて
第１の部分配列とする第３のステップとを有し、第１の
ステップの実行後、第２のステップ及び第３のステップ
を繰り返し実行することにより確率的モチーフを発見す
る。Probabilistic motif discovery method of the present invention solve problems which means for a], as described in claim 1, a method of finding the probabilistic motif from either array data of proteins and gene, the computer Using a motif extracted by symbol processing from given sequence data as an initial motif, and extracting a subsequence that best matches the initial motif from the sequence data as a first subsequence; A second step in which a second subsequence is newly generated by extending before and after the first subsequence and a stochastic representation of the second subsequence is newly set as a motif; And extracting a sub-sequence that best fits the motif obtained in the second step from the first step to make the first sub-sequence again, and after the execution of the first step Discovering probabilistic motif By repeating the second step and the third step.

【００１２】請求項２に記載の発明は、請求項１に記載
の確率的モチーフ発見方法において、第１のステップに
おいて、アミノ酸及び塩基のいずれか一方の並びである
パターンに関し配列データでの出現数を計算し、その出
現数が多い方から選択したパターンを初期のモチーフと
する。また、請求項３に記載の発明は、請求項１に記載
の確率的モチーフ発見方法おいて、第２のステップにお
いてモチーフを隠れマルコフモデルを用いて表現し、第
３のステップにおい配列データからモチーフに最も適合
する部分配列を取り出す際には、配列長で尤度を正規化
し、正規化後の尤度が高い部分配列を取り出す。According to a second aspect of the present invention, in the method for discovering a stochastic motif according to the first aspect, in the first step, the number of occurrences in sequence data of a pattern which is a sequence of any one of an amino acid and a base is included. Is calculated, and the pattern selected from the one with the largest number of appearances is set as the initial motif. According to a third aspect of the present invention, in the stochastic motif finding method according to the first aspect, in the second step, the motif is expressed using a hidden Markov model, and the motif is extracted from the sequence data in the third step. When extracting a subsequence that best matches the likelihood, the likelihood is normalized by the array length, and a subsequence having a high likelihood after normalization is extracted.

【００１３】さらに請求項３に記載の確率的モチーフ発
見方法では、第２のステップにおいて確率的にモチーフ
を生成する際に、所与の状態数からなる隠れマルコフ
モデルを初期モデルとし、遷移確率が所定値以下の遷移
を取り除き新たな状態を結合して新たな隠れマルコフモ
デルのネットワーク構造を生成し、その後、生成したネ
ットワーク構造の隠れマルコフモデルの初期パラメタを
ランダムに複数の組設定することと、設定に対してパラ
メタ算出を行うことと、パラメタ算出で最良の結果を与
える隠れマルコフモデルを新たな隠れマルコフモデルと
することとを繰り返したり、隠れマルコフモデルの状
態数及びネットワーク構造を最適化するために、配列デ
ータを学習用データと評価用データとに分け、配列デー
タのうち学習用データのみを隠れマルコフモデルの学習
の際に使用し、学習後の隠れマルコフモデルの中から評
価用データの尤度が最も高い状態数及びネットワーク構
造を探索し、探索された状態数及びネットワーク構造に
基づいてパラメタを再推定したり、尤度の低い部分配
列を予め定めた割合で除去してからモチーフを生成した
りすることができる。さらに請求項３に記載の確率的モ
チーフ発見方法では、隠れマルコフモデルの状態数及び
ネットワーク構造を情報量基準で最適化することができ
る。Further, in the stochastic motif finding method according to the third aspect, when the motif is stochastically generated in the second step, a hidden Markov model having a given number of states is used as an initial model, and the transition probability is determined. Generating a network structure of a new hidden Markov model by removing the transitions below a predetermined value and combining the new states, and then randomly setting a plurality of sets of initial parameters of the hidden Markov model of the generated network structure; To repeat the parameter calculation for the setting and to use the hidden Markov model that gives the best result in the parameter calculation as a new hidden Markov model, or to optimize the number of states and network structure of the hidden Markov model Then, the array data is divided into learning data and evaluation data. Is used in the learning of the hidden Markov model, and the number of states and the network structure with the highest likelihood of the evaluation data are searched from the hidden Markov model after the learning, and based on the searched number of states and the network structure. Parameters can be re-estimated, or a motif can be generated after removing a partial sequence with low likelihood at a predetermined ratio. Further, in the stochastic motif discovery method according to the third aspect, the number of states and the network structure of the hidden Markov model can be optimized on the basis of the information amount.

【００１４】（作用）本発明では、パターンよりも表現
能力の高い確率的モデルを用いることによって、モチー
フ内のギャップやミスマッチの扱いを容易にし、モチー
フを精密に表現することができる。特に、隠れマルコフ
モデルを使って表現することによって、反復するモチー
フも陽に扱えるようにした。確率モデルとして隠れマル
コフモデルを用いてモチーフを生成する際には、例え
ば、特開平７−３６９２０号公報に記載のタンパク質配
列データに対する隠れマルコフモデル作成方法を用い
て、隠れマルコフモデルのトポロジー及びパラメタを学
習する。(Action) In the present invention, by using a probabilistic model having a higher expression ability than a pattern, it is possible to easily handle gaps and mismatches in the motif and express the motif precisely. In particular, by using a hidden Markov model, it was possible to explicitly handle repetitive motifs. When generating a motif using a hidden Markov model as a probabilistic model, for example, using the hidden Markov model creation method for protein sequence data described in JP-A-7-36920, the topology and parameters of the hidden Markov model are used. learn.

【００１５】モチーフ発見の最初から確率的モデルを適
用するのでは、対象とするタンパク質やアミノ酸の配列
データのデータ量が一般に大きいため、計算時間がかか
り過ぎる。アミノ酸配列の配列長、すなわち対象とする
アミノ酸の残基数は、通常、３００〜４００程度だが、
最大で５０００を越える配列も存在する。また、遺伝子
の配列は、使用される文字数は少ないものの、一般には
アミノ酸配列よりさらに長い。そこで、本発明では、計
算時間を短縮するため、最初のステップでは、記号処理
によって、例えばパターンで大まかにモチーフの部分に
絞り、その後、それをもとに確率的モデルによる計算を
実行する。そして、確率モデルによるモチーフ生成を何
度か繰り返すことで、次第に配列の伸長を行ないなが
ら、モチーフを詳細に表現する。If a probabilistic model is applied from the beginning of motif discovery, the amount of sequence data of the target protein or amino acid is generally large, so that it takes too much calculation time. The sequence length of the amino acid sequence, that is, the number of residues of the target amino acid is usually about 300 to 400,
There are also sequences up to 5000. In addition, although the sequence of a gene uses a small number of characters, it is generally longer than the amino acid sequence. Therefore, in the present invention, in order to reduce the calculation time, in the first step, the pattern is roughly narrowed down to a motif part by, for example, a pattern by symbol processing, and thereafter the calculation is performed based on the probabilistic model. By repeating the generation of the motif by the probability model several times, the motif is expressed in detail while the sequence is gradually extended.

【００１６】[0016]

【発明の実施の形態】次に、本発明の実施の形態につい
て、図面を参照して説明する。図１は、本発明によるタ
ンパク質及び遺伝子の配列の確率的モチーフ発見方法の
実施の一形態を説明するための図である。なお、各図に
おいて、通常の矩形（角が直角となっている矩形）はデ
ータを表わし、角が丸くなっている矩形は処理を表わ
し、菱形は処理における判断と分岐を表わしている。Next, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a diagram for explaining an embodiment of the method for discovering stochastic motifs of protein and gene sequences according to the present invention. In each of the drawings, a normal rectangle (a rectangle having a right-angled corner) represents data, a rectangle having a rounded corner represents a process, and a diamond represents a judgment and branch in the process.

【００１７】本実施の形態では、モチーフ発見の対象と
なる配列データとして、タンパク質におけるアミノ酸配
列あるいは遺伝子における塩基配列の配列データ１０が
与えられる。配列データ１０には、通常、複数個の配列
に関するデータが含まれる。まず、この配列データ１０
に対して、記号処理の１種であるパターン処理によるモ
チーフ生成を実行し、モチーフ１１を抽出する（ステッ
プ１０１）。次に、合致部位抽出処理を実行して（ステ
ップ１０２）、配列データ１０中でのモチーフ１１に合
致する部位を抽出し、抽出された部位を部分配列データ
１２とする。伸長処理を実行し（ステップ１０３）、こ
の部分配列データ１２の前後を伸長して部分配列データ
１３とする。そして、伸長後の部分配列データ１３に対
し、確率モデルによるモチーフ生成処理を実行し（ステ
ップ１０４）、確率モデルで表現されたモチーフ１１を
生成してステップ１０２に戻る。本実施の形態では、ス
テップ１０２〜１０４の過程を繰り返すことによって、
確率的モチーフを発見する。In the present embodiment, sequence data 10 of an amino acid sequence in a protein or a base sequence in a gene is given as sequence data for which a motif is to be discovered. The sequence data 10 usually includes data on a plurality of sequences. First, the array data 10
Then, motif generation is performed by pattern processing, which is one type of symbol processing, to extract the motif 11 (step 101). Next, a matching part extraction process is executed (step 102), a part matching the motif 11 in the sequence data 10 is extracted, and the extracted part is set as the partial sequence data 12. A decompression process is executed (step 103), and the portion before and after the partial sequence data 12 is decompressed to obtain partial sequence data 13. Then, a motif generation process using a probability model is performed on the expanded partial sequence data 13 (step 104), a motif 11 represented by the probability model is generated, and the process returns to step 102. In the present embodiment, by repeating the steps 102 to 104,
Discover stochastic motifs.

【００１８】以下、各処理について、図面を参照しなが
らさらに詳しく説明する。Hereinafter, each processing will be described in more detail with reference to the drawings.

【００１９】図２は、ステップ１０１におけるパターン
によるモチーフ生成処理に使用する表の一例を示してい
る。この表５０には、パターンと、そのパターンが出現
した配列数（出現数）を書き込む。同一の配列に同一の
パターンが２度以上出現した場合にも、出現数は１と数
えることとする。これは、一部の配列に特異的に出現す
る共通部位を除くためである。したがって、表中の出現
数の値は、常に、与えられた配列データ１０に含まれる
配列数以下となる。ここに例示した表は、ロイシンジッ
パーモチーフを含む１５９個の配列からモチーフ発見を
行なった時に得られたパターンとその出現数の一部を表
わしている。FIG. 2 shows an example of a table used in the motif generation process based on the pattern in step 101. In this table 50, the pattern and the number of arrays in which the pattern appears (the number of appearances) are written. Even when the same pattern appears twice or more in the same sequence, the number of occurrences is counted as one. This is to remove common sites that specifically appear in some sequences. Therefore, the value of the number of appearances in the table is always equal to or less than the number of arrays included in the given array data 10. The table exemplified here shows a pattern obtained when motif discovery was performed from 159 sequences including the leucine zipper motif and a part of the number of appearances thereof.

【００２０】本実施の形態において、パターンは、定ま
ったアミノ酸もしくは塩基と、任意のアミノ酸もしくは
塩基との並びの形式とする。すなわち、［ＳＴ］や
｛Ｐ｝などのアミノ酸もしくは塩基の論理和（ＯＲ）は
認めないものとする。これは、計算時間の都合上、論理
和や情報量は考慮せず、最もシンプルなパターンだけを
考えるということである。このようにシンプルなパター
ンを用いた抽出でも、本実施の形態では後で確率的モデ
ルで詳細に表現するので、構わない。In the present embodiment, the pattern is in the form of a sequence of fixed amino acids or bases and arbitrary amino acids or bases. That is, a logical OR (OR) of amino acids or bases such as [ST] and {P} is not allowed. This means that only the simplest pattern is considered without considering the logical sum or the amount of information for the sake of calculation time. Even in the case of extraction using such a simple pattern, the present embodiment does not care since it is expressed in detail later by a probabilistic model.

【００２１】パターンにおいて定まっているアミノ酸も
しくは塩基の数と、任意のアミノ酸もしくは塩基ｘの長
さの最大は、予め適当に決めておく。例えば、アミノ酸
もしくは塩基の数を３、ｘの長さの最大を３に決めた場
合のパターンの例は、“ＬｘｘＥｘｘｘＬ”や“Ｔｘｘ
ｘＴＡ”などである。ここで、Ｌ,Ｅは、それぞれ、ア
ミノ酸の一種であるロイシン、グルタミン酸を示し、
Ｔ,Ａは、それぞれ、塩基の一種であるチミン、アデニ
ンを示す。The number of amino acids or bases determined in the pattern and the maximum length of any amino acid or base x are appropriately determined in advance. For example, when the number of amino acids or bases is determined to be 3 and the maximum length of x is determined to be 3, examples of patterns are “LxxExxxxL” and “Txx”.
xTA "and the like. Here, L and E represent leucine and glutamic acid, respectively, which are a kind of amino acids,
T and A represent thymine and adenine, which are a kind of base, respectively.

【００２２】図３は、ステップ１０１でのパターンによ
るモチーフ生成処理において、図２に示すような表を用
いなるべく多くの配列に含まれるパターンを選ぶための
処理を示すフローチャートである。配列データ２０（＝
配列データ１０）を上から順に見て、パターン生成処理
（ステップ１１１）によりパターンを一つずつ生成し、
そのパターンが同一配列中で新規かどうかを判断する
（ステップ１１２）。ステップ１１２において、生成さ
れたパターンが新規でない場合には、ステップ１１６に
移行し、生成されたパターンが新規の場合には、そのパ
ターンが既に表（図２参照）に記載されたパターンであ
るかどうか、すなわち、そのパターンが表内で新規かど
うかを判断する（ステップ１１３）。表内で新規でない
場合には、表でのそのパターンの出現数に１を加算（イ
ンクリメント）し（ステップ１１４）、ステップ１１６
に移行する。表内で新規の場合には、その表にそのパタ
ーンを書き込むとともにその出現数を１として（ステッ
プ１１５）、ステップ１１６に移行する。アミノ酸配列
や塩基配列では、配列中に全てのパターンが現れるわけ
ではないので、本実施の形態では、新規のパターンが現
れる度に、そのパターンを表に書き込み、表を拡張する
こととし、これによって、表を保持するためのメモリを
節約する。FIG. 3 is a flowchart showing a process for selecting patterns included in as many arrays as possible using the table shown in FIG. 2 in the motif generation process based on patterns in step 101. Sequence data 20 (=
The array data 10) is viewed from the top, and patterns are generated one by one by a pattern generation process (step 111).
It is determined whether the pattern is new in the same sequence (step 112). In step 112, if the generated pattern is not new, the process proceeds to step 116, and if the generated pattern is new, whether the pattern is a pattern already described in the table (see FIG. 2). It is determined whether the pattern is new in the table (step 113). If the pattern is not new in the table, 1 is added (incremented) to the number of occurrences of the pattern in the table (step 114), and step 116
Move to If the pattern is new in the table, the pattern is written in the table, the number of occurrences is set to 1 (step 115), and the process proceeds to step 116. In an amino acid sequence or a base sequence, not all patterns appear in the sequence.In this embodiment, each time a new pattern appears, the pattern is written in a table, and the table is extended. Saves memory for holding tables.

【００２３】ステップ１１６では、配列データ２０中の
配列を全部取り出したか、すなわちデータ終了かどうか
を判断し、データ終了でなければ上述の処理を繰り返す
ためにステップ１１１に戻って次のパターン生成に移
り、データの終了となったら、表での出現数が上位のパ
ターン２１を選出する（ステップ１１７）。このパター
ン２１は、モチーフ１１として、合致部位の抽出など次
の処理で使用される。In step 116, it is determined whether all the arrays in the array data 20 have been extracted, that is, whether or not the data has been completed. If not, the process returns to step 111 to repeat the above-described processing, and proceeds to the next pattern generation. When the data ends, a pattern 21 having a higher number of appearances in the table is selected (step 117). This pattern 21 is used as a motif 11 in the next processing such as extraction of a matching part.

【００２４】図４は、モチーフ１１がパターンで表現さ
れていた場合の合致部位探索のための処理（ステップ１
０２）を示すフローチャートである。ここでは、モチー
フ１１がパターン３１として与えられるものとする。な
お、モチーフ１１が確率モデルで表わされている場合の
合致部位探索処理については、図７及び図８を用いて後
述する。パターンが含まれない配列からは部分配列を抽
出しないままとし、多くのパターンが含まれている配列
からは多くの部分配列を抽出することとする。また、同
一部位が同時に複数のパターンに合致している場合に
は、その部位を複数回抽出するのではなく、一つの部分
配列として抽出することとする。FIG. 4 shows a process for searching for a matching part when the motif 11 is represented by a pattern (step 1).
02) is a flowchart showing the process of FIG. Here, it is assumed that the motif 11 is given as the pattern 31. Note that the matching part search processing when the motif 11 is represented by a probability model will be described later with reference to FIGS. It is assumed that a partial sequence is not extracted from an array that does not include a pattern, and a large number of partial sequences are extracted from an array that includes many patterns. In addition, when the same part matches a plurality of patterns at the same time, the part is not extracted a plurality of times, but is extracted as one partial sequence.

【００２５】与えられた配列データ３０（＝配列データ
１０）に対して、部分配列生成処理を実行して、順次、
部分配列３２を生成する（ステップ１２１）。そして、
この部分配列３２が、与えられたパターン３１のいずれ
かに合致するかを比較する（ステップ１２２）。いずれ
かに合致する場合には、その部分配列３２をファイルに
書き込んで（ステップ１２３）、ステップ１２４に移行
し、合致しない場合にはそのままステップ１２４に移行
する。ステップ１２４では、データ終了かどうかを判断
し、データ終了でなければ、ここで述べた過程を繰り返
すためにステップ１２１に戻る。ステップ１２４でデー
タ終了であれば、合致部位探索の処理を終了するが、こ
の時点で、パターン３１に合致する部分配列３３がファ
イルに得られている。For the given array data 30 (= array data 10), a partial array generation process is executed, and
The partial array 32 is generated (step 121). And
It is compared whether the partial array 32 matches any of the given patterns 31 (step 122). If any of them matches, the partial array 32 is written to the file (step 123), and the process proceeds to step 124. If not, the process directly proceeds to step 124. In step 124, it is determined whether or not the data is completed. If the data is not completed, the process returns to step 121 to repeat the process described here. If the data is completed in step 124, the process of searching for a matching part is ended. At this point, the partial array 33 matching the pattern 31 has been obtained in the file.

【００２６】次に、伸長処理（ステップ１０３）につい
て説明する。伸長処理は、配列データ１０から探索・抽
出された部分配列データ１２に対し、その部分配列の前
後にある長さで配列要素（塩基やアミノ酸）を付加し
て、部分配列を伸長する処理である。実際には、合致部
位抽出の際に、併せて伸長処理を行なえばよい。この伸
長処理は、現在のモチーフの付近にある進化的な保存部
位を探すためだけでなく、パターンの開始位置・終了位
置を揃えるために行なっている。例として、前述したロ
イシンジッパーモチーフを発見する場合を考える。この
モチーフは、周期７でＬ（ロイシン）が現われる“(Ｌ
ｘｘｘｘｘｘ)ⁿＬ”という形式である。パターンによる
モチーフ生成で、アミノ酸の数を３、ｘの長さの最大を
３に決めた場合、前述した通り図２の表に示す結果が得
られていた。ここで伸長処理を行なわないものとする
と、“ＫｘｘＥｘＬ”や“ＬｘＫｘｘＥ”で抽出した部
分配列は、“ＬｘｘｘｘｘｘＬ”のパターンで抽出した
部分配列と異なるデータとして学習されるが、伸長処理
を行うことによって、それらのパターンで抽出した部分
配列を揃えて学習することができる。Next, the decompression process (step 103) will be described. The extension process is a process of adding a sequence element (base or amino acid) with a length before and after the partial sequence to the partial sequence data 12 searched and extracted from the sequence data 10 to extend the partial sequence. . Actually, it is only necessary to perform the extension processing at the time of extracting the matched part. This extension is performed not only to search for an evolutionary conserved site near the current motif, but also to align the start position and end position of the pattern. As an example, consider the case of finding the leucine zipper motif described above. In this motif, L (leucine) appears in cycle 7 "(L
xxxxxx) ⁿ L ". When the number of amino acids was determined to be 3 and the maximum length of x was determined to be 3 in the motif generation by the pattern, the results shown in the table of FIG. 2 were obtained as described above. If the decompression process is not performed, the partial sequence extracted by “KxxExL” or “LxKxxE” is learned as data different from the partial sequence extracted by the pattern of “LxxxxL”. Thus, learning can be performed by aligning the partial sequences extracted with those patterns.

【００２７】次に、確率モデルによるモチーフ生成処理
（ステップ１０４）について説明する。ここでは、確率
モデルとして隠れマルコフモデル（ＨＭＭ：Hidden Mar
kovModel）を使用する場合を説明する。以下、隠れマル
コフモデルのことを、ＨＭＭと略記する。Next, the motif generation processing using the probability model (step 104) will be described. Here, a hidden Markov model (HMM: Hidden Mar
kovModel) will be described. Hereinafter, the Hidden Markov Model is abbreviated as HMM.

【００２８】ところで、特開平７−３６９２０号公報に
は、タンパク質配列データに対するＨＭＭ作成方法が開
示されている。同公報での実施例に示されているよう
に、タンパク質配列データに対するＨＭＭ作成方法は、
一般の信号にも応用が可能である。そこで本実施の形態
では、特開平７−３６９２０号公報に記載の方法をタン
パク質及び遺伝子の配列に適用できるように改良して、
モチーフを表現したＨＭＭを部分配列から作成するよう
にした。図５は、本実施の形態でのＨＭＭの作成過程を
表わすフローチャートである。なお、作成過程の細部に
ついては、特開平７−３６９２０号公報を参照のこと。Incidentally, Japanese Patent Application Laid-Open No. 7-36920 discloses a method of creating an HMM for protein sequence data. As shown in the examples in the publication, an HMM creation method for protein sequence data is as follows:
It can be applied to general signals. Therefore, in the present embodiment, the method described in JP-A-7-36920 is improved so that it can be applied to protein and gene sequences.
An HMM expressing the motif was created from the partial sequence. FIG. 5 is a flowchart showing a process of creating an HMM in the present embodiment. For details of the preparation process, see Japanese Patent Application Laid-Open No. 7-36920.

【００２９】まず、ＨＭＭの状態数の最大Ｎｍａｘを予
め与え、与えられた部分配列のデータ（伸長処理後の部
分配列データ１３）を学習データと評価データとに分け
る（ステップ１３１）。また、初期ＨＭＭのネットワー
ク構造を、少ない状態数の全結合プラス終了状態のＨＭ
Ｍとする（ステップ１３２）。次に、予め試行回数ＣＮ
ＴＭＡＸが指定されているとして、初期パラメタを乱数
とし、学習データに対してＢａｕｍ−Ｗｅｌｃｈアルゴ
リズムによるパラメタ推定を試行回数ＣＮＴＭＡＸだけ
繰り返し、ｉに現在のＨＭＭの状態数、候補Ｍに尤度が
最大のＨＭＭを代入する（ステップ１３３）。ここで、
学習データＹ₁,Ｙ₂,…,Ｙ_mに対する尤度とは、Ｐｒｏｂ
(Ｙ_i)をＨＭＭがＹ_iを出力する確率としたとき、First, the maximum Nmax of the number of states of the HMM is given in advance, and the data of the given partial sequence (the partial sequence data 13 after decompression processing) is divided into learning data and evaluation data (step 131). In addition, the network structure of the initial HMM is defined as the total number of states with a small number of states plus the HM of the end state.
M (step 132). Next, the number of trials CN
Assuming that TMAX is specified, the initial parameter is set to a random number, the parameter estimation by the Baum-Welch algorithm is repeated for the training data by the number of trials CNTMAX, i represents the current number of HMM states, and candidate M has the maximum likelihood. The HMM is substituted (step 133). here,
Training data Y _1, Y _2, ..., and the likelihood for the Y _m, Prob
When (Y _i ) is the probability that the HMM outputs Y _i ,

【００３０】[0030]

【数１】Ｐｒｏｂ(Ｙ₁) × Ｐｒｏｂ(Ｙ₂) × … × Ｐ
ｒｏｂ(Ｙ_m) のことである。そして、現在のＨＭＭの状態数ｉが、予
め指定した最大状態数Ｎｍａｘに達したかどうかを判断
する（ステップ１３４）。ｉ≧Ｎｍａｘの場合は、ステ
ップ１３７に移行して後処理を行う。一方、ｉがＮｍａ
ｘになっていない場合には、現在の状態数ｉでのＨＭＭ
であるｂｓｔＭ_iに上述の候補Ｍを保存し（ステップ１
３５）、候補Ｍから確率がε未満の遷移を除き状態を付
加したＨＭＭを初期ＨＭＭのネットワーク構造とし（ス
テップ１３６）、ステップ１３３に戻り、上記の過程を
繰り返す。ここでεは、予め定めたしきい値である。## EQU1 ## Prob (Y ₁ ) × Prob (Y ₂ ) ×... × P
lob (Y _m ). Then, it is determined whether or not the current state number i of the HMM has reached the maximum state number Nmax specified in advance (step 134). If i ≧ Nmax, the flow shifts to step 137 to perform post-processing. On the other hand, if i is Nma
If it is not x, the HMM at the current state number i
The above-mentioned candidate M is stored in bstMi _i (step 1).
35), the HMM to which a state is added from the candidate M except for transitions with a probability less than ε is set as the network structure of the initial HMM (step 136), and the process returns to step 133 to repeat the above process. Here, ε is a predetermined threshold value.

【００３１】ステップ１３７の後処理では、それぞれの
状態数ｉのＨＭＭであるｂｓｔＭ_iに対して評価データ
での尤度を計算し、評価データでの尤度が最大となる状
態数のＨＭＭを選び、計算に使われない状態を消去する
（ステップ１３８）。すなわち、初期確率がゼロより大
の状態からいくつかの遷移を通っても到達できない状
態、及び、いくつかの遷移を通っても最終状態に到達で
きない状態をそれぞれ除く。こうして得られたＨＭＭの
状態数とネットワーク構造のデータ３５により、学習デ
ータと評価データの両方を使ってパラメタを再推定し
（ステップ１３９）、確率モデルによるモチーフを生成
する。In the post-processing of step 137, the likelihood in the evaluation data is calculated for bstM _i , which is the HMM of each state number i, and the HMM having the maximum likelihood in the evaluation data is selected. , Erase the state not used for calculation (step 138). That is, a state in which the initial probability cannot be reached through some transitions from a state in which the initial probability is greater than zero and a state in which the final state cannot be reached through some transitions are excluded. The parameters are reestimated using both the learning data and the evaluation data based on the number of states of the HMM thus obtained and the data 35 of the network structure (step 139), and a motif based on a probability model is generated.

【００３２】以上、確率モデルによるモチーフ生成処理
の一例を説明したが、確率モデルによるモチーフ生成処
理は上述した処理に限定されるわけではない。図６は、
確率モデルによるモチーフ生成の別の例を示すフローチ
ャートである。以下、図６に示すモチーフ生成処理につ
いて説明する。Although an example of the motif generation processing using the probability model has been described above, the motif generation processing using the probability model is not limited to the above-described processing. FIG.
9 is a flowchart illustrating another example of motif generation using a probability model. Hereinafter, the motif generation processing shown in FIG. 6 will be described.

【００３３】まず、ＨＭＭの状態数の最大Ｎｍａｘを予
め与え（ステップ１４１）、初期ＨＭＭのネットワーク
構造を、少ない状態数の全結合プラス終了状態のＨＭＭ
とする（ステップ１４２）。次に、予め試行回数ＣＮＴ
ＭＡＸが指定されているとして、初期パラメタを乱数と
し、部分配列データに対してＢａｕｍ−Ｗｅｌｃｈアル
ゴリズムによるパラメタ推定を試行回数ＣＮＴＭＡＸだ
け繰り返し、ｉに現在のＨＭＭの状態数、候補Ｍに尤度
が最大のＨＭＭを代入する（ステップ１４３）。そし
て、現在のＨＭＭの状態数ｉが、予め指定した最大状態
数Ｎｍａｘに達したかどうかを判断する（ステップ１４
４）。ｉ≧Ｎｍａｘの場合は、ステップ１４７に移行し
て後処理を行う。一方、ｉがＮｍａｘになっていない場
合には、現在の状態数ｉでのＨＭＭであるｂｓｔＭ_iに
上述の候補Ｍを保存し、ＭＤＬ_iにＭの記述長を保存し
（ステップ１４５）、候補Ｍから確率がε未満の遷移を
除き状態を付加したＨＭＭを初期ＨＭＭのネットワーク
構造とし（ステップ１４６）、ステップ１４３に戻り、
上記の過程を繰り返す。First, the maximum Nmax of the number of states of the HMM is given in advance (step 141).
(Step 142). Next, the number of trials CNT
Assuming that MAX is specified, the initial parameter is set to a random number, parameter estimation by the Baum-Welch algorithm is repeated for the partial sequence data by the number of trials CNTMAX, i represents the current number of HMM states, and candidate M has the maximum likelihood. (Step 143). Then, it is determined whether or not the current state number i of the HMM has reached the maximum state number Nmax specified in advance (step 14).
4). If i ≧ Nmax, the flow shifts to step 147 to perform post-processing. On the other hand, if i is not Nmax, the above-mentioned candidate M is stored in bstM _i , which is the HMM with the current number of states i, and the description length of M is stored in MDL _i (step 145). An HMM to which a state is added from M except transitions with a probability less than ε is set as the network structure of the initial HMM (step 146), and the process returns to step 143.
Repeat the above process.

【００３４】ステップ１４７の後処理では、ＭＤＬ_iが
最大となるＭ_iを選び（ステップ１４８）、上述した例
と同様の計算によって、Ｍ_iから使われない状態を消去
し、使われない状態が消去されたＭ_iを確率モデルによ
るモチーフとする。ＭＤＬ_iは候補Ｍ_iの記述長を表わし
ているから、結局、情報量基準によって、ＨＭＭの状態
数及びネットワーク構造が最適化されたことになる。In the post-processing of step 147, M _i with the maximum MDL _i is selected (step 148), and the unused state is deleted from M _i by the same calculation as in the above example, and the unused state becomes the erased M _i to the motif by the probability model. Since MDL _i represents the description length of a candidate M _i, eventually, by the information criterion, the number of states and the network structure of the HMM will have been optimized.

【００３５】以上、ＨＭＭによる確率モデルによりモチ
ーフを作成する処理を説明した。ＨＭＭによれば、ギャ
ップやミスマッチを容易に表現することができる。ギャ
ップはＨＭＭの自己ループなどで表現でき、ミスマッチ
は各状態の出力確率で表現できる。さらに、本実施の形
態あるいは特開平７−３６９２０号公報に記載のＨＭＭ
作成方法では、ｌｅｆｔ−ｔｏ−ｒｉｇｈｔ型に限らず
一般のネットワーク構造を学習できるため、反復するモ
チーフなども陽に扱うことができる。The processing for creating a motif using a probability model based on the HMM has been described above. According to the HMM, gaps and mismatches can be easily expressed. The gap can be represented by a self-loop of the HMM, and the mismatch can be represented by the output probability of each state. Further, the HMM described in the present embodiment or JP-A-7-36920 is disclosed.
In the creation method, not only the left-to-right type but also a general network structure can be learned, so that a repeated motif or the like can be explicitly handled.

【００３６】ところで、実際にモチーフ発見を行なう場
合には、初めに与えられる配列データに、モチーフを全
く含まない配列が含まれている場合もある。これには、
誤ってノイズとしてモチーフが与えられる場合も考えら
れるし、与えられたグループのサブグループのモチーフ
を抽出しようとしている場合も考えられる。これらの場
合に対応するために、尤度Ｐｒｏｂ(Ｙ_i)が低くなる部
分配列Ｙ_iを予め定めておいた割合で除くとよい。例え
ば、図５で示した例の場合、ステップ１３７で評価デー
タの尤度を計算する際に、尤度の低い部分配列を予め定
めておいた割合で除いて尤度を計算する。また、このス
テップ１３７において、選んだ状態のネットワーク構造
のＨＭＭのパラメタを学習データと評価データの両方を
用いて再推定する際にも、尤度の低い部分配列は除いて
パラメタ推定する。図６で示した例の場合も同様に、後
処理（ステップ１４７）において、尤度の低い部分配列
は除いてから再びパラメタ推定をするようにしてもよ
い。When actually finding a motif, there is a case in which the sequence data initially given contains a sequence containing no motif at all. This includes
A motif may be erroneously given as noise, or a motif of a subgroup of a given group may be extracted. In order to cope with these cases, the partial array Y _i having the low likelihood Prob (Y _i ) may be removed at a predetermined ratio. For example, in the case of the example shown in FIG. 5, when calculating the likelihood of the evaluation data in step 137, the likelihood is calculated by removing a partial array having a low likelihood at a predetermined ratio. Also, in this step 137, when re-estimating the parameters of the HMM of the network structure in the selected state using both the learning data and the evaluation data, the parameter estimation is performed excluding the partial sequence having a low likelihood. Similarly, in the case of the example shown in FIG. 6, in the post-processing (step 147), the parameter estimation may be performed again after removing the partial array with low likelihood.

【００３７】次に、合致部位探索処理（ステップ１０
２）のうち、モチーフ１１が確率モデルで表わされてい
る場合の処理を説明する。図７は、モチーフが確率モデ
ルで表現されていた場合に合致部位抽出のために使う表
５１の一例を示している。Next, a matching part search process (step 10)
The processing when the motif 11 is represented by the probability model in 2) will be described. FIG. 7 shows an example of a table 51 used for extracting a matching part when the motif is represented by a probability model.

【００３８】表５１では、それぞれの配列に対して、そ
の配列に含まれる部分配列のなかで配列長ごとに尤度が
最大となる配列が、その最大尤度とともに書き込まれ
る。実際には、この合致部位探索処理に引き続く伸長処
理のために、部分配列を伸長し伸長後の部分配列を表５
１に書き込むようにしてもよい。配列長の最大と最小は
予め与えておく。例えば、タンパク質のモチーフの場
合、配列長の最小は３、最大は５０程度で十分である。
図示した例では、最大尤度だけを書き込んでいるが、配
列にモチーフが複数出現する状態に対応するため、配列
長ごとに、尤度が上位である部分配列とその尤度を書き
込む表を用いてもよい。In Table 51, for each array, the array having the maximum likelihood for each array length among the partial arrays included in the array is written together with the maximum likelihood. Actually, for the extension process following the matching site search process, the partial sequence is extended, and the extended
1 may be written. The maximum and minimum of the array length are given in advance. For example, in the case of a protein motif, a minimum sequence length of 3 and a maximum sequence length of 50 are sufficient.
In the illustrated example, only the maximum likelihood is written, but in order to correspond to a state in which a plurality of motifs appear in the sequence, a table in which, for each sequence length, the partial sequence having the highest likelihood and the likelihood are used. You may.

【００３９】図８は、モチーフが確率モデルで表現され
ていた場合における、合致部位抽出のための処理を説明
するフローチャートである。ここでは、処理の前半で候
補を絞ることと正規化のための前処理を行ない、後半で
尤度を正規化して候補から部分配列を選んでいる。FIG. 8 is a flowchart for explaining a process for extracting a matching part when a motif is represented by a probability model. Here, in the first half of the processing, candidates are narrowed down and preprocessing for normalization is performed, and in the second half, the likelihood is normalized and a partial sequence is selected from the candidates.

【００４０】配列長（モチーフ長）の最大と最小を示す
データ４１が予め与えられているとして、まず、配列デ
ータ４０（＝配列データ１０）を順に見て、配列長の最
大と最小（データ４１）にしたがって、順次、部分配列
４３を生成する（ステップ１５１）。次に、この部分配
列４３に対する、与えられた確率モデル４２の尤度（ま
たは確率）４４を計算する（ステップ１５２）。例え
ば、確率モデル４２としてＨＭＭを用いている場合に
は、フォワード−バックワード（Forward-Backward）ア
ルゴリズムにより、部分配列の尤度を計算すればよい。Assuming that data 41 indicating the maximum and the minimum of the sequence length (motif length) are given in advance, first, the sequence data 40 (= sequence data 10) is sequentially examined, and the maximum and the minimum of the sequence length (data 41) are determined. ), The partial arrays 43 are sequentially generated (step 151). Next, the likelihood (or probability) 44 of the given probability model 42 with respect to the partial array 43 is calculated (step 152). For example, when an HMM is used as the probability model 42, the likelihood of the partial sequence may be calculated by a forward-backward algorithm.

【００４１】次に、配列長で尤度４４を正規化するため
の前処理を実行する（ステップ１５３）。この前処理と
平行して、尤度４４と、図７に示した表において対応す
る配列番号、配列長での尤度とを比較し（ステップ１５
４）、得られた尤度４４の方が表に記載された最大尤度
よりも大きくない場合にはステップ１５６に移行し、尤
度４４が最大尤度より大きい場合には、表において対応
する部分配列と尤度とを書き換えてから（ステップ１５
５）、ステップ１５６に移行する。ステップ１５６で
は、配列データ４０中の配列を全部取り出したか、すな
わちデータ終了かどうかを判断し、データ終了でなけれ
ば上述の処理を繰り返すためにステップ１５１に戻って
次の部分配列生成に移る。一方、ステップ１５６でデー
タの終了となったら、その時点で、各配列と各配列長に
対して最大尤度になる部分配列とその尤度とが表に得ら
れているから、ステップ１５３での前処理の結果を用
い、正規化のための計算を行う（ステップ１５７）。正
規化処理では、前処理によって得られた定数を使用して
表の尤度を正規化する。そして、各配列に対して正規化
した尤度が最大となる部分配列を取り出す（ステップ１
５８）。このようにして得られた部分配列データ４５
が、合致部位探索の結果として得られた部分配列データ
である。Next, pre-processing for normalizing the likelihood 44 with the array length is executed (step 153). In parallel with this preprocessing, the likelihood 44 is compared with the likelihood at the corresponding sequence number and sequence length in the table shown in FIG. 7 (step 15).
4) If the obtained likelihood 44 is not larger than the maximum likelihood described in the table, the process proceeds to step 156. If the likelihood 44 is larger than the maximum likelihood, the corresponding likelihood is determined in the table. After rewriting the partial array and likelihood (step 15
5) The process proceeds to step 156. In step 156, it is determined whether all the arrays in the array data 40 have been extracted, that is, whether or not the data has ended. If the data has not ended, the process returns to step 151 to repeat the above-described processing, and proceeds to the generation of the next partial array. On the other hand, when the data ends in step 156, at that time, a partial array having the maximum likelihood for each array and each array length and its likelihood are obtained in a table. A calculation for normalization is performed using the result of the preprocessing (step 157). In the normalization processing, the likelihood of the table is normalized using the constant obtained by the preprocessing. Then, a partial array having the maximum likelihood normalized for each array is extracted (step 1).
58). The partial sequence data 45 thus obtained
Are partial sequence data obtained as a result of the matching site search.

【００４２】さて、正規化のための定数Ａ,Ｂは、尤度
の全データからも求めることができるが、データ量が膨
大なために、計算時間や計算のための記憶領域を過度に
必要とする。そこで、ステップ１５３での前処理では、
得られた尤度のデータ４４をまとめておく。例えば、ラ
ンダムに選択されたデータのみを保存して、ステップ１
５７ではこの保存されたデータから定数Ａ,Ｂが算出さ
れるようにする。あるいは、各部分配列の尤度を配列長
で分けて、さらに適当な数ごとにグループにして、それ
らのグループの尤度の平均、尤度の標準偏差を保存す
る。これらのまとめたデータを用いて、ステップ１５７
では、配列長により尤度を正規化する。The constants A and B for normalization can be obtained from all likelihood data. However, since the amount of data is enormous, the calculation time and the storage area for the calculation are excessively required. And Therefore, in the preprocessing in step 153,
The obtained likelihood data 44 is put together. For example, only the data selected at random is stored, and step 1
At 57, constants A and B are calculated from the stored data. Alternatively, the likelihood of each subsequence is divided by the sequence length, further grouped by an appropriate number, and the average of the likelihoods of those groups and the standard deviation of the likelihood are stored. Using these compiled data, step 157
Then, the likelihood is normalized by the array length.

【００４３】配列長により尤度を正規化した値は、Ｚ
Ｓｃｏｒｅと呼ばれる。このＺＳｃｏｒｅは、以下の
ようにして求める。今、部分配列Ｙの尤度Ｌを正規化す
ることを考える。この部分配列Ｙの対数尤度ｌｏｇＬ
と配列長ＳｅｑＬｅｎは、定数Ａ,Ｂを用いて、The value obtained by normalizing the likelihood by the array length is Z
It is called Score. This Z Score is obtained as follows. Now, consider normalizing the likelihood L of the partial array Y. The log likelihood log L of this subsequence Y
And the sequence length SeqLen, using constants A and B,

【００４４】[0044]

【数２】ＬｏｇＬ＝Ａ × ＳｅｑＬｅｎ＋Ｂと線形近似することができる。ステップ１５３（前処
理）で処理しながら集めた尤度のデータから、ステップ
１５７での正規化のための定数Ａ,Ｂを求め、次式によ
り尤度を正規化した値ｌｏｇＬ(Ｎ)を求める。## EQU2 ## It can be linearly approximated as Log L = A × SeqLen + B. From the likelihood data collected while processing in step 153 (preprocessing), constants A and B for normalization in step 157 are obtained, and a value log L (N) obtained by normalizing the likelihood by the following equation is obtained. Ask.

【００４５】[0045]

【数３】このｌｏｇＬ(Ｎ)の平均ＡＶ(Ｎ)と標準偏差ＳＤ(Ｎ)
を求め、部分配列ＹのＺＳｃｏｒｅを(Equation 3) The average AV (N) and standard deviation SD (N) of this log L (N)
And calculate the ZScore of the subarray Y

【００４６】[0046]

【数４】とする。ＺＳｃｏｒｅの平均は０.０、標準偏差は１.
０である。(Equation 4) And The average of Z Score is 0.0 and the standard deviation is 1.
0.

【００４７】本実施の形態では、図１においてステップ
１０２〜１０４で表わした過程を繰り返すことによっ
て、確率的モチーフを発見する。合致部位抽出（ステッ
プ１０２）において前の繰り返しの時とほとんど同じ部
分配列を抽出した場合、抽出した部分配列の長さがあら
かじめ指定した配列長を越えた場合、あるいは、あらか
じめ指定した回数以上の繰り返しがあった場合に、この
繰り返しを終了する。In the present embodiment, a stochastic motif is found by repeating the process represented by steps 102 to 104 in FIG. In the matching part extraction (step 102), when the same partial sequence as that of the previous repetition is extracted, when the length of the extracted partial sequence exceeds the predetermined sequence length, or when the predetermined number of repetitions is performed If there is, this repetition is terminated.

【００４８】[0048]

【発明の効果】以上説明したように本発明は、確率的モ
チーフを発見するため、パターンによる表現よりも精密
にモチーフを表現することができる。また、最初に、パ
ターンなどの記号処理によって配列データを大まかにモ
チーフの部分に絞り、その後、隠れマルコフモデルなど
による学習を実行するので、確率的モチーフの発見方法
でありながら、モチーフ発見に時間がかかりすぎるのを
防ぐことができる。As described above, according to the present invention, since a stochastic motif is discovered, a motif can be expressed more precisely than a pattern. Also, first, the array data is roughly narrowed down to the motif part by symbol processing such as patterns, and then learning using a hidden Markov model etc. is performed. It can be prevented from being applied too much.

[Brief description of the drawings]

【図１】本発明によるタンパク質及び遺伝子の配列の確
率的モチーフ発見方法の実施の一形態を説明するフロー
チャートである。FIG. 1 is a flowchart illustrating an embodiment of a method for discovering stochastic motifs of protein and gene sequences according to the present invention.

【図２】パターンによるモチーフ生成処理に使用する表
の一例を示す図である。FIG. 2 is a diagram illustrating an example of a table used for a pattern-based motif generation process.

【図３】パターンによるモチーフ生成処理を説明するフ
ローチャートである。FIG. 3 is a flowchart illustrating a motif generation process using a pattern.

【図４】モチーフがパターンで表現されていた場合の合
致部位探索の処理を説明するフローチャートである、FIG. 4 is a flowchart illustrating a process of searching for a matching part when a motif is represented by a pattern.

【図５】確率モデルによるモチーフ生成の過程の一例を
示すフローチャートである。FIG. 5 is a flowchart illustrating an example of a motif generation process using a probability model.

【図６】確率モデルによるモチーフ生成の過程の別の例
を示すフローチャートである。FIG. 6 is a flowchart illustrating another example of a process of generating a motif using a probability model.

【図７】モチーフが確率モデルで表現されていた場合に
合致部位抽出のために使用する表の一例を示す図であ
る。FIG. 7 is a diagram showing an example of a table used for extracting a matching part when a motif is represented by a probability model.

【図８】モチーフが確率モデルで表現されていた場合の
合致部位抽出の処理を説明するフローチャートである。FIG. 8 is a flowchart illustrating a process of extracting a matching part when a motif is represented by a probability model.

[Explanation of symbols]

１０,２０,３０,４０配列データ１１モチーフ１２,１３,４５部分配列データ２１,３１パターン３２,３３,４３部分配列３５,４１データ４２確率モデル４４尤度５０,５１表１０１〜１０４,１１１〜１１７,１２１〜１２４ス
テップ１３１〜１３９,１４１〜１４８,１５１〜１５８ス
テップ10, 20, 30, 40 Sequence data 11 Motif 12, 13, 45 Partial sequence data 21, 31 Pattern 32, 33, 43 Partial sequence 35, 41 Data 42 Probability model 44 Likelihood 50, 51 Tables 101 to 104, 111 to 117, 121-124 steps 131-139, 141-148, 151-158 steps

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平７−36920（ＪＰ，Ａ) 矢田，石川，田中，浅井「隠れマルコフモデルと遺伝的アルゴリズムによるＤＮＡ配列のシグナルパターン抽出」情報処理学会論文誌，Ｖｏｌ．37，Ｎｏ. ６，1966，ｐ．1117−1129 矢田，飯沼，内野，市吉「ゲノム情報解析」三菱総合研究所所報，Ｎｏ．29, 1996，ｐ．158−179 小長谷，新田，山西「遺伝子情報知識ベースシステムの構想について」ＩＣＯＴ研究速報，ＴＭ−1033，1991 (58)調査した分野(Int.Cl.⁶，ＤＢ名) G06F 17/30 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-7-36920 (JP, A) Yada, Ishikawa, Tanaka, Asai "Hidden Markov Model and Signal Pattern Extraction of DNA Sequences by Genetic Algorithm" Information Processing Society of Japan Transactions, Vol. 37, No. 6, 1966, p. 1117-1129 Yada, Iinuma, Uchino, Ichiyoshi "Genome Information Analysis", Mitsubishi Research Institute, No. 29, 1996, p. 158-179 Konagaya, Nitta, Yamanishi "On the Concept of Genetic Information Base System" ICOT Research Bulletin, TM-1033, 1991 (58) Fields investigated (Int. Cl. ⁶ , DB name) G06F 17/30 JICST File (JOIS)

Claims

(57) [Claims]

1. A method for finding a stochastic motif from sequence data of any one of a protein and a gene, wherein a motif extracted by symbol processing from given sequence data is used as an initial motif from the sequence data. A first step of extracting a subsequence that best fits the initial motif as a first subsequence; and elongating before and after the first subsequence to newly generate a second subsequence, A second step in which a probabilistic representation of the partial sequence is newly used as a motif; and a partial sequence that best matches the motif obtained in the second step is extracted from the sequence data and the second step is performed again. 1
And a third step in which the stochastic motif is found by repeatedly executing the second step and the third step after the execution of the first step. Compu
Probabilistic motif discovery method using a chromatography data.

2. In the first step, the number of appearances in the sequence data is calculated for a pattern that is a sequence of any one of amino acids and bases, and the pattern selected from the one with the largest number of occurrences is the initial pattern. The probabilistic motif finding method according to claim 1, wherein the method is a motif.

3. In the second step, the motif is expressed by using a hidden Markov model, and in the third step, a partial sequence that best fits the motif is extracted from the sequence data. The probabilistic motif finding method according to claim 1, wherein the likelihood is normalized, and a partial sequence having a high likelihood after the normalization is extracted.

4. In the second step, when the motif is stochastically generated using the sequence data, a hidden Markov model including a given number of states is used as an initial model, and a transition probability is equal to or less than a predetermined value. And combining the new states to generate a new hidden Markov model network structure, and then randomly setting a plurality of sets of initial parameters of the generated hidden Markov model of the network structure; and 4. The stochastic method according to claim 3, wherein the parameter calculation is repeatedly performed, and the hidden Markov model that gives the best result in the parameter calculation is set as a new hidden Markov model. How to discover motifs.

5. In the second step, in order to optimize the number of states and the network structure of the Hidden Markov Model, the array data is divided into learning data and evaluation data. Is used for learning the hidden Markov model, and the number of states and the network structure with the highest likelihood of the evaluation data are searched from the hidden Markov model after the learning. The method according to claim 4, wherein the parameter is re-estimated based on a structure, and the motif is generated stochastically.

6. The probabilistic motif finding method according to claim 3, wherein, in the second step, the motif is generated after removing a partial sequence having a low likelihood at a predetermined ratio.

7. The probabilistic motif discovery method according to claim 3, wherein the number of states and the network structure of the Hidden Markov Model are optimized based on the information amount.