JP5007803B2

JP5007803B2 - Gene clustering apparatus, gene clustering method and program

Info

Publication number: JP5007803B2
Application number: JP2007060745A
Authority: JP
Inventors: 毅井澤; 仁藤宮
Original assignee: National Institute of Agrobiological Sciences
Current assignee: National Institute of Agrobiological Sciences
Priority date: 2007-03-09
Filing date: 2007-03-09
Publication date: 2012-08-22
Anticipated expiration: 2027-03-09
Also published as: JP2008225689A

Description

本発明は、複数の遺伝子を配列の類似性に基づいてクラスタリングする遺伝子クラスタリング装置、遺伝子クラスタリング方法およびプログラムに関する。 The present invention relates to a gene clustering apparatus, a gene clustering method, and a program for clustering a plurality of genes based on sequence similarity.

機能の分からない遺伝子の働きを推定するには、すでに知られている遺伝子に対する類似性を評価し、配列の類似性に基づいてクラスタリングを行う手法が有効であることが知られている。
従来、遺伝子のクラスタリングには最大節約法、最尤法、近隣結合法などが用いられている。これらの方法は、クラスタリング対象となる遺伝子の配列を直接比較しながら、系統樹を作成する点が共通である。このようなクラスタリングを利用した例として、非特許文献１に開示されたクラスタリングとアラインメントのためのプログラムなどがあげられる。 In order to estimate the function of a gene whose function is unknown, it is known that a technique of evaluating similarity to a known gene and performing clustering based on sequence similarity is effective.
Conventionally, a maximum saving method, a maximum likelihood method, a neighborhood connection method, and the like are used for gene clustering. These methods are common in that a phylogenetic tree is created while directly comparing the sequences of genes to be clustered. As an example using such clustering, a clustering and alignment program disclosed in Non-Patent Document 1 can be cited.

従来の遺伝子クラスタリング方法では、一つひとつの遺伝子の塩基配列に着目し、個々の塩基配列の変異の時期や前後関係を推定することで系統樹を作成している。しかしながら、これらの方法では、遺伝的にかなり離れてしまっているものや、分化したあとに新たに獲得された機能など、大幅に全体の配列が異なるようなもの同士は比較できないという問題があった。従来のクラスタリングは、進化的な過程で発生する程度の配列変化、すなわち比較的変化の少ない遺伝子同士を比較するのには適している。 In the conventional gene clustering method, a phylogenetic tree is created by paying attention to the base sequence of each gene, and estimating the time and context of mutation of each base sequence. However, with these methods, there is a problem that it is not possible to compare things that are significantly different from each other, such as those that are genetically separated or functions that are newly acquired after differentiation. . Conventional clustering is suitable for comparing gene changes that occur in an evolutionary process, that is, genes with relatively little change.

CLUSTAL W:improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice; J.D.Thompson et.al.; Nucleic acids Research, 1994, Vol. 22, No.22 4673-4680.CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice; J.D.Thompson et.al .; Nucleic acids Research, 1994, Vol. 22, No.22 4673-4680.

上述のように、従来のクラスタリング方法のように全ての遺伝子配列をそのまま用いてクラスタリングする方法では、進化的に離れた遺伝子のクラスタリングを行うことは難しかった。 As described above, it is difficult to cluster genes that are evolutionarily separated by the method of clustering using all gene sequences as they are, as in the conventional clustering method.

本発明は、進化的に離れた生物の遺伝子でも、類似した機能を持つ遺伝子を発見できるような遺伝子クラスタリング装置、遺伝子クラスタリング方法およびプログラムを提供することを目的とする。 An object of the present invention is to provide a gene clustering apparatus, a gene clustering method, and a program capable of discovering genes having similar functions even in genes of evolutionarily distant organisms.

本発明に係る遺伝子クラスタリング装置は、複数の遺伝子を配列の類似性に基づいてクラスタリングする遺伝子クラスタリング装置であって、遺伝子配列に含まれるモチーフ配列を検索するモチーフ検索部と、各々の遺伝子配列に含まれるモチーフ配列を比較することにより、任意の２つの遺伝子の類似度スコアを計算するモチーフスコア計算部と、前記類似度スコアを用いて、任意の２つの遺伝子の遺伝子間距離を計算する遺伝子間距離計算部と、前記遺伝子間距離に基づいて、前記複数の遺伝子のクラスタリングを行うクラスタリング処理部とを備える。
本発明では、遺伝子配列に含まれるモチーフを指標として遺伝子の類似度を解析するようにした。進化的には離れていても類似した機能を持つ遺伝子は同様のモチーフを持っていることが多いため、本発明は、広い生物種間での機能類似遺伝子の発見や、未知の遺伝子の機能推定等に大変有効である。 The gene clustering apparatus according to the present invention is a gene clustering apparatus that clusters a plurality of genes based on sequence similarity, and includes a motif search unit that searches for a motif sequence included in a gene sequence, and each gene sequence A motif score calculation unit for calculating a similarity score between two arbitrary genes by comparing motif sequences to be calculated, and an intergenic distance for calculating an intergenic distance between any two genes using the similarity score A calculation unit and a clustering processing unit that clusters the plurality of genes based on the inter-gene distance.
In the present invention, gene similarity is analyzed using a motif included in the gene sequence as an index. Since genes that have similar functions even though they are evolutionarily separated often have similar motifs, the present invention can be used to discover functionally similar genes among a wide range of species and to estimate the functions of unknown genes. It is very effective.

前記モチーフスコア計算部は、第１の遺伝子の配列に含まれるすべてのモチーフ配列と、第２の遺伝子の配列に含まれるすべてのモチーフ配列について総当りで類似度を求め、得られたモチーフ同士の類似度の総和を第１の遺伝子と第２の遺伝子の類似度スコアとすると好ましい。 The motif score calculation unit obtains a brute force similarity for all motif sequences included in the sequence of the first gene and all motif sequences included in the sequence of the second gene. The sum of the similarities is preferably a similarity score between the first gene and the second gene.

前記遺伝子間距離計算部は、第１の遺伝子と、第２〜第Ｎの遺伝子の類似度スコアを要素とする第１のベクトルと、第２の遺伝子と、第１、第３〜第Ｎの遺伝子の類似度スコアを要素とする第２のベクトルの要素の相関を求めることにより、前記第１の遺伝子と前記第２の遺伝子の遺伝子間距離を算出すると好ましい。 The intergene distance calculation unit includes a first vector, a first vector having elements of similarity scores of the second to Nth genes, a second gene, the first, third to Nth It is preferable to calculate the intergenic distance between the first gene and the second gene by calculating the correlation between the elements of the second vector having the gene similarity score as an element.

本発明に係る遺伝子クラスタリング方法は、複数の遺伝子を配列の類似性に基づいてクラスタリングする遺伝子クラスタリング方法であって、遺伝子配列に含まれるモチーフ配列を検索するモチーフ検索工程と、各々の遺伝子配列に含まれるモチーフ配列を比較することにより、任意の２つの遺伝子の類似度スコアを計算するモチーフスコア計算工程と、前記類似度スコアを用いて、任意の２つの遺伝子の遺伝子間距離を計算する遺伝子間距離計算工程と、前記遺伝子間距離に基づいて、前記複数の遺伝子のクラスタリングを行うクラスタリング処理工程とを備える。
本発明では、遺伝子配列に含まれるモチーフを指標として遺伝子の類似度を解析するようにした。進化的には離れていても類似した機能を持つ遺伝子は同様のモチーフを持っていることが多いため、本発明は、広い生物種間での機能類似遺伝子の発見や、未知の遺伝子の機能推定等に大変有効である。 The gene clustering method according to the present invention is a gene clustering method for clustering a plurality of genes based on sequence similarity, and includes a motif search step for searching a motif sequence included in a gene sequence, and included in each gene sequence A motif score calculating step for calculating a similarity score between two arbitrary genes by comparing the motif sequences to be calculated, and an intergenic distance for calculating an intergenic distance between any two genes using the similarity score A calculation step and a clustering step for clustering the plurality of genes based on the inter-gene distance.
In the present invention, gene similarity is analyzed using a motif included in the gene sequence as an index. Since genes that have similar functions even though they are evolutionarily separated often have similar motifs, the present invention can be used to discover functionally similar genes among a wide range of species and to estimate the functions of unknown genes. It is very effective.

本発明に係るコンピュータプログラムは、コンピュータを、複数の遺伝子を配列の類似性に基づいてクラスタリングする遺伝子クラスタリング装置として機能させるプログラムであって、遺伝子配列に含まれるモチーフ配列を検索するモチーフ検索部と、各々の遺伝子配列に含まれるモチーフ配列を比較することにより、任意の２つの遺伝子の類似度スコアを計算するモチーフスコア計算部と、前記類似度スコアを用いて、任意の２つの遺伝子の遺伝子間距離を計算する遺伝子間距離計算部と、前記遺伝子間距離に基づいて、前記複数の遺伝子のクラスタリングを行うクラスタリング処理部として機能させる。
本発明では、遺伝子配列に含まれるモチーフを指標として遺伝子の類似度を解析するようにした。進化的には離れていても類似した機能を持つ遺伝子は同様のモチーフを持っていることが多いため、本発明は、広い生物種間での機能類似遺伝子の発見や、未知の遺伝子の機能推定等に大変有効である。 A computer program according to the present invention is a program that causes a computer to function as a gene clustering apparatus that clusters a plurality of genes based on sequence similarity, and a motif search unit that searches for a motif sequence included in a gene sequence; A motif score calculator that calculates the similarity score of any two genes by comparing the motif sequences included in each gene sequence, and the intergenic distance between any two genes using the similarity score And a clustering processing unit that performs clustering of the plurality of genes based on the intergenic distance.
In the present invention, gene similarity is analyzed using a motif included in the gene sequence as an index. Since genes that have similar functions even though they are evolutionarily separated often have similar motifs, the present invention can be used to discover functionally similar genes among a wide range of species and to estimate the functions of unknown genes. It is very effective.

以下、本発明の実施の形態について図面を参照して説明する。
実施の形態１．
図１は、本発明の実施の形態１による、遺伝子クラスタリング装置１０の機能構成を示すブロック図である。図に示すように、遺伝子クラスタリング装置１０は、入力装置１１、ユーザインターフェイス部１２、データアクセス部１３、遺伝子配列記憶部１４、スコア記憶部１５、モチーフ記憶部１６、モチーフ検索部１７、モチーフスコア計算部１８、遺伝子間距離計算部１９、クラスタリング処理部２０、出力装置２１を備えている。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a functional configuration of a gene clustering apparatus 10 according to Embodiment 1 of the present invention. As shown in the figure, the gene clustering device 10 includes an input device 11, a user interface unit 12, a data access unit 13, a gene sequence storage unit 14, a score storage unit 15, a motif storage unit 16, a motif search unit 17, and a motif score calculation. A unit 18, an intergene distance calculation unit 19, a clustering processing unit 20, and an output device 21 are provided.

遺伝子クラスタリング装置１０は、例えば汎用的なパーソナルコンピュータに所定のプログラムを実行させたものである。ユーザインターフェイス部１２、データアクセス部１３、モチーフ検索部１７、モチーフスコア計算部１８、遺伝子間距離計算部１９、およびクラスタリング処理部２０は、プログラムに従ってコンピュータのプロセッサが行う動作のモジュールを表しており、これらは実際には一体として遺伝子クラスタリング装置１０のプロセッサを構成する。 The gene clustering apparatus 10 is obtained by causing a general-purpose personal computer to execute a predetermined program, for example. The user interface unit 12, the data access unit 13, the motif search unit 17, the motif score calculation unit 18, the intergene distance calculation unit 19, and the clustering processing unit 20 represent modules of operations performed by a computer processor according to a program, These actually constitute a processor of the gene clustering apparatus 10 as a whole.

遺伝子配列記憶部１４、スコア記憶部１５、およびモチーフ記憶部１６は、遺伝子クラスタリング装置１０のハードディスク等の記憶装置である。
入力装置１１は、例えばキーボード、マウス、タッチパネル等の入力手段であり、ユーザが遺伝子クラスタリング装置１０に処理の指示を与えたり、データやパラメータを入力するために用いられる。また、USB(Universal Serial Bus)インターフェイスを介して、メモリ媒体などからデータを読み込むことも可能である。ユーザによる入力装置１１を介した操作はユーザインターフェイス部１２によって制御される。
出力装置２１は、表示装置やプリンタ等である。 The gene sequence storage unit 14, the score storage unit 15, and the motif storage unit 16 are storage devices such as a hard disk of the gene clustering device 10.
The input device 11 is input means such as a keyboard, a mouse, and a touch panel, for example, and is used by the user to give processing instructions to the gene clustering device 10 and to input data and parameters. It is also possible to read data from a memory medium or the like via a USB (Universal Serial Bus) interface. The operation through the input device 11 by the user is controlled by the user interface unit 12.
The output device 21 is a display device, a printer, or the like.

次に本実施形態による遺伝子クラスタリング処理ついて説明する。
まず、クラスタリングの対象となる遺伝子群の配列情報が遺伝子配列記憶部１４からデータアクセス部１３を介してモチーフ検索部１７に供給される。遺伝子配列記憶部１４には、入力装置１１を介して入力された遺伝子の配列情報が記憶されている。 Next, the gene clustering process according to the present embodiment will be described.
First, sequence information of a gene group to be clustered is supplied from the gene sequence storage unit 14 to the motif search unit 17 via the data access unit 13. The gene sequence storage unit 14 stores gene sequence information input via the input device 11.

図２は、クラスタリングの対象となる遺伝子群の例を示す図である。ここでは、対象となる遺伝子の遺伝子番号とその生物種を示している。図２に示す例は、トウモロコシ（Zea mays）のID１（indeterminate１）遺伝子をqueryとして、イネ（Oryza Sativa）、シロイヌナズナ（arabidopsis thaliana）、および紅藻のアミノ酸配列に対してblastサーチ（閾値1e-30）を行い、ヒットした遺伝子を示している。 FIG. 2 is a diagram showing an example of gene groups to be clustered. Here, the gene number of the target gene and its species are shown. The example shown in FIG. 2 is a blast search (threshold 1e-30) for amino acid sequences of rice (Oryza Sativa), Arabidopsis thaliana, and red algae using the corn (Zea mays) ID1 (indeterminate1) gene as a query. ) And shows the hit genes.

なお、それぞれの遺伝子配列は、例えば以下のサイトで参照することができる。
イネ： http://rapdb.lab.nig.ac.jp/（RAP１）
シロイヌナズナ： http://mips.gsf.de/proj/thal/db/（MIPS）
紅藻：http://merolae.biol.s.u-tokyo.ac.jp/ Each gene sequence can be referred to, for example, at the following site.
Rice: http://rapdb.lab.nig.ac.jp/ (RAP1)
Arabidopsis: http://mips.gsf.de/proj/thal/db/ (MIPS)
Red algae: http://merolae.biol.su-tokyo.ac.jp/

ID1遺伝子はトウモロコシにおいて花成を制御している遺伝子として単離されたものであり、ジンクフィンガーをもつ転写因子をコードしている。
なお、遺伝子群の選び方は上記の方法に限られず、他の配列解析手法を用いてもよい。 The ID1 gene has been isolated as a gene that controls flowering in maize, and encodes a transcription factor having a zinc finger.
The method for selecting a gene group is not limited to the above method, and other sequence analysis methods may be used.

次に、供給された遺伝子群を対象にモチーフ検索部１７においてモチーフ検索を実行する。モチーフは、タンパク質構造中の活性部位や機能領域に対応した配列パターンである。モチーフ検索は、例えばMEME(Bailey and Elkan, 1994)などの手法を用いて行うことができる。図３は、図２にその一部を示した遺伝子群に対してモチーフ検索を行った結果得られるモチーフデータの例を示す図である。図中、番号を付された四角で表されたものが個々のモチーフに対応する。例えば、ID1遺伝子は、５番、２番、３番、１番、７番、６番、１８番で表されるモチーフ配列を有していることが分かる。一般に、遺伝的にかなり離れている場合でも、機能的に類似した遺伝子同士は同じモチーフを持っていることが多い。 Next, a motif search is executed in the motif search unit 17 for the supplied gene group. A motif is a sequence pattern corresponding to an active site or a functional region in a protein structure. The motif search can be performed using a technique such as MEME (Bailey and Elkan, 1994). FIG. 3 is a diagram illustrating an example of motif data obtained as a result of a motif search performed on the gene group partially shown in FIG. In the figure, the numbered squares correspond to individual motifs. For example, it is understood that the ID1 gene has a motif sequence represented by No. 5, No. 2, No. 3, No. 1, No. 7, No. 6, No. 18. In general, functionally similar genes often have the same motif even if they are genetically separated.

モチーフ検索を行うことにより、各々の遺伝子の配列の中から、主要な構造・機能を決めるために寄与していると考えられる大小さまざまな部分配列の情報を得ることができる。得られたモチーフデータはモチーフ記憶部１６に保存される。 By performing a motif search, it is possible to obtain information on partial sequences of various sizes, which are considered to contribute to determining the main structure / function, from the sequence of each gene. The obtained motif data is stored in the motif storage unit 16.

次に、モチーフスコア計算部１８において、クラスタリング対象となる全ての遺伝子同士を比較して、含まれるモチーフ配列でみた類似度を表すスコアを算出する。類似度スコア算出には、アミノ酸相互の置換確率に基づくPAM(Point-Accepted Mutation、In Margaret O. Dayhoff, editor, Atlas of Protein Sequence and Structure, volume 5, pages 345-352. National Biochemical Research Foundation, Washington DC, 1978)やBLOSUM(Blocs Substitution Matrix、Henikoff and Henikoff (1992; PNAS 89:10915-10919))などを用いることができる。スコア記憶部１５には、これらの手法で用いられるスコアデータが保存されている。
なお、本実施形態では、モチーフ以外の領域についてはスコア算出を行っていない。これはモチーフ以外の部分をスコア０とみなしていることを意味する。モチーフという配列が保存された部分に絞り、スコアを算出することで高速にクラスタリングを実施している。もし、さらに必要があれば、単に保存された配列モチーフだけでなく、二次構造予測などの機能を加え、αヘリックスやβシートなどを決めている構造部分を抽出し、それらをモチーフとしてスコアを与えることで、機能だけでなく構造類似のクラスタリングを行わせることも可能である。 Next, the motif score calculation unit 18 compares all the genes to be clustered, and calculates a score representing the degree of similarity in terms of the included motif sequences. For similarity score calculation, PAM (Point-Accepted Mutation, In Margaret O. Dayhoff, editor, Atlas of Protein Sequence and Structure, volume 5, pages 345-352. National Biochemical Research Foundation, Washington DC, 1978) and BLOSUM (Blocs Substitution Matrix, Henikoff and Henikoff (1992; PNAS 89: 10915-10919)) and the like can be used. The score storage unit 15 stores score data used in these methods.
In the present embodiment, score calculation is not performed for regions other than the motif. This means that the part other than the motif is regarded as score 0. Clustering is performed at high speed by narrowing down to the part where the sequence called motif is stored and calculating the score. If there is a further need, add not only the conserved sequence motif but also a function such as secondary structure prediction to extract the structural part that determines α-helix and β-sheet, and score them as motifs By giving, it is possible to perform not only functions but also structure-like clustering.

類似度スコア算出方法について説明する。
例えば、遺伝子１に含まれるモチーフ１と、遺伝子２に含まれるモチーフ２の配列が下記のとおりとする。
モチーフ１：WKCEKCAK
モチーフ２：WKCDKCN A similarity score calculation method will be described.
For example, the sequences of motif 1 included in gene 1 and motif 2 included in gene 2 are as follows.
Motif 1: WKCEKCAK
Motif 2: WKCDKCN

モチーフ１とモチーフ２の最初のアミノ酸残基はWなので、図４に示すPAM40のマトリクスのWの行のWの列を参照すると、スコアは１３であることが分かる。２番目のアミノ酸残基は両配列ともKであり、スコアは６であることが分かる。このように順にスコアを求めてそれらを加算すると、モチーフ１とモチーフ２のスコアは以下のようになる。
スコア＝１３＋６＋９＋３＋６＋９＋（−３）＝４３
このようにして、遺伝子１および遺伝子２に含まれているすべてのモチーフ同士について総当りでスコアを求める。さらに、すべてのモチーフ同士のスコアの和を求め、遺伝子１と遺伝子２の類似度スコアとする。ここで、モチーフ相互に比較するに当たって、アミノ酸残基の欠失や挿入を考慮して最適なスコアを算出する場合は、部分最適並置を求める動的計画法を用いたアルゴリズムSmith-Waterman法（Smith TF, Waterman MS (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195-197.）を利用している。 Since the first amino acid residue of motif 1 and motif 2 is W, referring to the column of W in the W row of the PAM40 matrix shown in FIG. It can be seen that the second amino acid residue is K for both sequences and the score is 6. When scores are sequentially obtained in this way and added, the scores of motif 1 and motif 2 are as follows.
Score = 13 + 6 + 9 + 3 + 6 + 9 + (− 3) = 43
In this way, a score is obtained for all the motifs included in gene 1 and gene 2 in a round-robin manner. Furthermore, the sum of the scores of all motifs is obtained and used as the similarity score between gene 1 and gene 2. Here, when calculating the optimal score in consideration of deletion and insertion of amino acid residues when comparing motifs, the algorithm Smith-Waterman method (Smith TF, Waterman MS (1981). “Identification of Common Molecular Subsequences”. Journal of Molecular Biology 147: 195-197.).

図５に、上記のようにして求められた遺伝子同士のスコアマトリクスの一部を示す。図５は、４つの遺伝子についての遺伝子相互の類似度スコアを示している。 FIG. 5 shows a part of the score matrix between genes determined as described above. FIG. 5 shows gene similarity scores for the four genes.

次に、遺伝子間距離計算部１９において、各遺伝子間の距離を算出する。遺伝子同士の距離はさまざまな定義が可能であるが、本発明では、ピアソンの相関係数を用いる。これは、図５に示すマトリクスの任意の２行のデータを取り出し、相互の要素の相関を求める方法である。相関係数を用いることで、相対的なモチーフ類似度を持つものに対しては相関が高くなり、絶対値の偏りによって離れてしまうことがない。共通モチーフの数が多いものと少ないものなどの差が多少あっても、共通の尺度で補正しながら距離を求めることが可能となる。 Next, the intergene distance calculation unit 19 calculates the distance between each gene. The distance between genes can be defined in various ways. In the present invention, the Pearson correlation coefficient is used. This is a method of obtaining data of two arbitrary rows of the matrix shown in FIG. 5 and obtaining a correlation between elements. By using the correlation coefficient, the correlation becomes high for those having a relative motif similarity, and the correlation coefficient does not leave due to the bias of the absolute value. Even if there is a slight difference between a large number of common motifs and a small number of common motifs, the distance can be obtained while correcting with a common scale.

次に、クラスタリング処理部２０において、遺伝子間距離計算部１９で算出された距離の値を用いてWard法や群平均法などの方法を用いてクラスタリングを実施する。図６にクラスタリング結果のデンドログラム図示す。図６から、トウモロコシのID１遺伝子は、Os10g0419200遺伝子と似た機能を持っていることが示唆される。Os10g0419200遺伝子は、zinc finger proteinをコードしており、Os10g0419200が持つ機能はZinc finger, C2H2 type family proteinと付与されており、実際にＩＤ１と類似の機能を持つということが類推できる。 Next, the clustering processing unit 20 performs clustering using a method such as the Ward method or the group average method using the distance value calculated by the intergene distance calculation unit 19. FIG. 6 shows a dendrogram of the clustering result. FIG. 6 suggests that the maize ID1 gene has a function similar to that of the Os10g0419200 gene. The Os10g0419200 gene encodes zinc finger protein, and the function of Os10g0419200 is given as Zinc finger and C2H2 type family protein, and it can be analogized that it actually has a similar function to ID1.

このように、本発明によれば、モチーフの抽出、モチーフの有無と類似度を指標としたクラスタリングという一連の解析が可能となる。モチーフとは、機能ドメインに特徴的な保存配列パターンなどを含み、モチーフを指標として解析することで、遺伝的には離れていても機能的に似た遺伝子を比較解析することができる。アミノ酸配列の置換率を利用した解析はこれまでにも存在しているが、モチーフの有無・類似度を指標とした比較解析の手法は確立されておらず、今後、生物間で保存された機能遺伝子の解析、機能未知遺伝子の機能推定等で利用することが出来る。DNAシーケンシング技術の進歩により、非常に多くの生物種のゲノムの読取が進んできており、必ずしも遺伝的に同祖でない場合でも、機能的に類似なものがクラスタリングにより見出せれば、未知の遺伝子配列の機能を解析するのに非常に有用である。 Thus, according to the present invention, it is possible to perform a series of analyzes of extraction of motifs and clustering using the presence / absence and similarity of motifs as indices. A motif includes a conserved sequence pattern that is characteristic of a functional domain. By analyzing the motif as an index, genes that are functionally similar can be comparatively analyzed even if they are genetically separated. Analyzes using amino acid sequence substitution rates have existed so far, but methods for comparative analysis using the presence or similarity of motifs as indices have not been established, and functions that have been preserved between organisms in the future It can be used for gene analysis, function estimation of unknown function genes, and the like. Advances in DNA sequencing technology have led to the reading of genomes of a large number of species, and even if they are not necessarily genetically homologous, if functionally similar ones can be found by clustering, unknown genes It is very useful for analyzing the function of a sequence.

なお、本発明によるクラスタリング方法は、単に遺伝子のモチーフ情報に限らず、構造的な特徴、つまりαヘリックス、βシート、疎水性、親水性の強いエリアなど種々の指標値に置き換えた数値列パターンを対象に利用することも可能である。また、本発明で説明している遺伝子配列は文字列そのものである。したがって、遺伝子配列はそのまま文字配列のクラスタリングに置き換えることが可能である。あらゆる文字情報あるいは数値情報列に適用可能であることはいうまでもない。文字列ではその一致した文字数をスコアにすることや、単に辞書に存在する単語ごとに一定のスコアを与えるという方法でも問題ない。数字列の場合は、その数値そのものの差やその２乗値を距離として広く適用が可能なことはいうまでもない。 It should be noted that the clustering method according to the present invention is not limited to gene motif information, but a numerical sequence pattern in which structural features are replaced with various index values such as α helix, β sheet, hydrophobic and hydrophilic areas. It can also be used as a target. Further, the gene sequence described in the present invention is a character string itself. Therefore, it is possible to replace the gene sequence as it is with clustering of character sequences. Needless to say, the present invention can be applied to any character information or numerical information sequence. For character strings, there is no problem even if the number of matched characters is used as a score, or a method of simply giving a constant score for each word existing in the dictionary. In the case of a numeric string, it is needless to say that the difference between the numerical values themselves or the square value thereof can be widely applied as a distance.

図１は、本発明の実施の形態１による、遺伝子クラスタリング装置の機能構成を示すブロック図である。FIG. 1 is a block diagram showing a functional configuration of a gene clustering apparatus according to Embodiment 1 of the present invention. 図２は、クラスタリングの対象となる遺伝子群の例を示す図である。FIG. 2 is a diagram showing an example of gene groups to be clustered. 図３は、検索により得られるモチーフの例を示す図である。FIG. 3 is a diagram illustrating an example of a motif obtained by a search. 図４は、PAM40のマトリクス表である。FIG. 4 is a matrix table of PAM40. 図５は、遺伝子同士の類似度スコアの例を示す図である。FIG. 5 is a diagram illustrating an example of a similarity score between genes. 図６は、遺伝子のクラスタリング結果のデンドログラム図である。FIG. 6 is a dendrogram of gene clustering results.

Explanation of symbols

１０遺伝子クラスタリング装置、１１入力装置、１２ユーザインターフェイス部、１３データアクセス部、１４遺伝子配列記憶部、１５スコア記憶部、１６モチーフ記憶部、１７モチーフ検索部、１８モチーフスコア計算部、１９遺伝子間距離計算部、２０クラスタリング処理部、２１出力装置 10 gene clustering devices, 11 input devices, 12 user interface units, 13 data access units, 14 gene sequence storage units, 15 score storage units, 16 motif storage units, 17 motif search units, 18 motif score calculation units, 19 intergenic distances Calculation unit, 20 Clustering processing unit, 21 Output device

Claims

A gene clustering apparatus for clustering a plurality of genes based on sequence similarity,
A motif search unit for searching a motif sequence included in a gene sequence;
A motif score calculation unit for calculating a similarity score between any two genes by comparing the motif sequences included in each gene sequence;
An intergenic distance calculation unit for calculating an intergenic distance between any two genes using the similarity score;
Based on the genetic distance between Bei example a clustering processing unit which performs clustering of the plurality of genes,
The motif score calculation unit
Similarity is obtained for all motif sequences included in the sequence of the first gene and all motif sequences included in the sequence of the second gene, and the sum of the similarities between the obtained motifs is calculated as the first. A gene clustering apparatus characterized in that a similarity score between the gene of and the second gene is used.

The inter-gene distance calculation unit is:
A first vector, a first vector whose elements are similarity scores of the second to Nth genes, a second gene, and a similarity score of the first, third to Nth genes as elements The gene clustering apparatus according to claim 1, wherein an inter-gene distance between the first gene and the second gene is calculated by obtaining a correlation between elements of the second vector to be calculated.

A gene clustering method for clustering a plurality of genes based on sequence similarity,
A motif search step for searching for a motif sequence contained in a gene sequence;
A motif score calculation step of calculating a similarity score between any two genes by comparing the motif sequences included in each gene sequence;
An intergenic distance calculating step of calculating an intergenic distance between any two genes using the similarity score;
Based on the genetic distance between Bei example a clustering processing step of performing clustering of the plurality of genes,
In the motif score calculation step,
Similarity is obtained for all motif sequences included in the sequence of the first gene and all motif sequences included in the sequence of the second gene, and the sum of the similarities between the obtained motifs is calculated as the first. A gene clustering method characterized in that a similarity score between the gene of and a second gene is used.

Computer
A program that functions as a gene clustering device that clusters a plurality of genes based on sequence similarity,
A motif search unit for searching a motif sequence included in a gene sequence;
A motif score calculation unit for calculating a similarity score between any two genes by comparing the motif sequences included in each gene sequence;
An intergenic distance calculation unit for calculating an intergenic distance between any two genes using the similarity score;
Based on the distance between the genes, function as a clustering processing unit for clustering the plurality of genes ,
The motif score calculation unit
Similarity is obtained for all motif sequences included in the sequence of the first gene and all motif sequences included in the sequence of the second gene, and the sum of the similarities between the obtained motifs is calculated as the first. A program characterized by having a similarity score between the gene of and the second gene .