JP3928050B2

JP3928050B2 - Base sequence classification system and oligonucleotide frequency analysis system

Info

Publication number: JP3928050B2
Application number: JP2003328845A
Authority: JP
Inventors: 淑道池村; 貴志阿部; 智中川; 登喜男上月; 重彦金谷; 誠木ノ内
Original assignee: Inter University Research Institute Corp Research Organization of Information and Systems
Current assignee: Inter University Research Institute Corp Research Organization of Information and Systems
Priority date: 2003-09-19
Filing date: 2003-09-19
Publication date: 2007-06-13
Anticipated expiration: 2023-09-19
Also published as: JP2005092786A; WO2005029386A1

Description

本発明は、塩基配列中において複数種類のオリゴヌクレオチドがそれぞれ出現する出現頻度に基づいて、塩基配列を生物学的分類に分類するための自己組織化マップを作成する塩基配列の分類システム、および、上記自己組織化マップを用いてオリゴヌクレオチドの出現頻度の偏り（種のような生物学的分類による偏りや、ＤＮＡ配列の位置による偏り）を解析するためのオリゴヌクレオチド出現頻度の解析システムに関するものである。 The present invention provides a base sequence classification system for creating a self-organizing map for classifying a base sequence into a biological classification based on the frequency of occurrence of each of a plurality of types of oligonucleotides in the base sequence, and This is an oligonucleotide frequency analysis system for analyzing the occurrence frequency bias of oligonucleotides (bias due to biological classification such as species and bias due to DNA sequence position) using the above self-organizing map. is there.

因子対応分析や主成分分析（ＰＣＡ）のような多変量分析が、遺伝子配列の差異を調査するのに用いられ、成功を収めている。しかしながら、従来の多変量分析のクラスタリング能力は、多種多様なゲノムから得られた大量の配列データを集合的に分析する場合には、不十分である。 Multivariate analysis, such as factor correspondence analysis and principal component analysis (PCA), has been used successfully to investigate genetic sequence differences. However, the clustering ability of the conventional multivariate analysis is insufficient when collectively analyzing a large amount of sequence data obtained from various genomes.

コホネンが開発した、競合ニューラルネットワークを利用した自己組織化マップ（Self Organizing Map；以下、「ＳＯＭ」と略記する）は、画像、音声や指紋等の認識や工業製品の生産プロセスの制御に利用されてきた（非特許文献１、非特許文献２）。ＳＯＭは、多次元データを結合重みベクトルの２次元配列上に非線形写像したものであり、高次元データ空間のトポロジーを効果的に保存する。ＳＯＭは、高次元の複雑なデータを二次元平面上にクラスタリングおよび視覚化するための強力なツールである。 Self-organizing map (SOM) developed by Kohonen using competitive neural networks is used for recognition of images, sounds, fingerprints, etc., and control of production processes for industrial products. (Non-Patent Document 1, Non-Patent Document 2). SOM is a non-linear mapping of multidimensional data onto a two-dimensional array of coupling weight vectors, and effectively preserves the topology of a high-dimensional data space. SOM is a powerful tool for clustering and visualizing high-dimensional complex data on a two-dimensional plane.

近年、様々の生物のゲノム情報の解明に伴い、膨大な量の生命情報が蓄積しつつあり、コンピュータを用いてこれら生命情報から生命の謎を解くことも医薬開発等の面から重要になり、ＳＯＭの応用が盛んになっている。本願発明者等は、ゲノム情報科学のために従来のＳＯＭ作成法を改良した改良型のＳＯＭ作成法を提案した（特許文献１、非特許文献３・４参照）。この改良は、学習プロセスおよび作成されるマップ（ＳＯＭ）がデータ入力の順序に依存しないよう、データ入力および学習を一括処理する一括学習ＳＯＭ作成法に基づいている。また、改良型ＳＯＭ作成法では、主成分分析（ＰＣＡ）を使用して初期結合重みベクトルを定義している。したがって、改良型ＳＯＭは、データ入力の順序だけでなく初期条件にも依存しない。 In recent years, with the elucidation of genome information of various organisms, a vast amount of life information is being accumulated, and it is important from the aspect of drug development to solve the mystery of life from such life information using a computer, Applications of SOM are becoming popular. The inventors of the present application have proposed an improved SOM creation method improved from the conventional SOM creation method for genome information science (see Patent Document 1, Non-Patent Documents 3 and 4). This improvement is based on a batch learning SOM creation method that batch processes data input and learning so that the learning process and the map to be created (SOM) do not depend on the order of data input. Further, in the improved SOM creation method, the initial connection weight vector is defined using principal component analysis (PCA). Thus, the improved SOM does not depend on the initial conditions as well as the order of data entry.

例えば、特許文献１の実施例１では、高次元の入力データとしての１６種類の微生物のコドン（トリヌクレオチド）使用頻度に基づいて、改良型のＳＯＭを用いて微生物の遺伝子を分類したＳＯＭを作成する方法が開示されている。
国際公開第ＷＯ０２／５０７６７Ａ１号（２００２年６月２７日公開）自己組織化マップの応用−多次元情報の２次元可視化」（徳高平蔵、岸田悟、藤村喜久郎著、海文堂出版株式会杜、１９９９年７月２０日初版発行、ＩＳＢＮ４−３０３−７３２３０−３）「自己組織化マップ（Self Organizing-Map）」（Ｔ，コホネン著、徳高平蔵、岸田悟、藤村喜久郎訳、シュブリンガー・フェアラーク東京株式会社、１９９６年６月１５日発行、ＩＳＢＮ４−４３１−７０７００−ＸＣ３０５５） Kanaya, S., Kinouchi, M., Abe, T., Kudo, Y, Yamada, V., Nishi, T., Marl, H. and Ikemura, T. (2001) Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli 0157 genome. Gene 276, 89-99. Abe, T., Kanaya, S., Kinouchi, M., Ichiba, V., Kozuki, T. and Ikemura, T. (2003) Informatics for unveiling hidden genome signatures. Genome Res. 13, 693-702. For example, in Example 1 of Patent Document 1, based on the codon (trinucleotide) usage frequency of 16 types of microorganisms as high-dimensional input data, an SOM in which microbial genes are classified using an improved SOM is created. A method is disclosed.
International Publication No. WO 02/50767 A1 (released on June 27, 2002) Application of Self-Organizing Maps-Two-dimensional Visualization of Multidimensional Information ”(by Tokutaka Heizo, Satoru Kishida, Kikuro Fujimura, Kaibundo Publishing Co., Ltd., July 20, 1999, first edition, ISBN 4-303-73230-3) “Self Organizing Map” (T, Kohonen, Tokurataka Heizo, Kishida Satoru, Fujimura Kikuro, Shubringer Fairlark Tokyo, June 15, 1996, ISBN 4-431-70700) -XC3055) Kanaya, S., Kinouchi, M., Abe, T., Kudo, Y, Yamada, V., Nishi, T., Marl, H. and Ikemura, T. (2001) Analysis of codon usage diversity of bacterial genes with a self-organizing map (SOM): characterization of horizontally transferred genes with emphasis on the E. coli 0157 genome. Gene 276, 89-99. Abe, T., Kanaya, S., Kinouchi, M., Ichiba, V., Kozuki, T. and Ikemura, T. (2003) Informatics for unveiling hidden genome signatures.Genome Res. 13, 693-702.

公開ＤＮＡデータベース内には、相補的なＤＮＡ配列の対のうちの一方だけの配列データが登録されている。特許文献１や非特許文献３・４ではこの配列データにおけるオリゴヌクレオチドの出現頻度を用いてＳＯＭを作成している。 In the public DNA database, sequence data of only one of a pair of complementary DNA sequences is registered. In Patent Document 1 and Non-Patent Documents 3 and 4, an SOM is created using the appearance frequency of oligonucleotides in this sequence data.

しかしながら、ゲノム内におけるオリゴヌクレオチド出現頻度の全体的な特徴を考慮すれば、相補的なＤＮＡ配列の対のうちの一方の配列における、相補的なオリゴヌクレオチド間（例えばＡＡＡＣ対ＧＴＴＴ間）の出現頻度の違いは、分類には重要ではない。むしろ、二本鎖ＤＮＡ全体においては、相補的なオリゴヌクレオチドの出現頻度は同一になるはずであるので、上記の出現頻度の違いは、作成されたＳＯＭが二本鎖ＤＮＡ全体における特徴を正確に反映しない結果となる可能性もある。また、ＳＯＭの作成は、長い演算時間を必要とするので、少しでも演算時間を短縮することが望まれる。 However, considering the overall characteristics of the oligonucleotide appearance frequency in the genome, the frequency of occurrence between complementary oligonucleotides (for example, between AAAC and GTTT) in one of the pair of complementary DNA sequences. The difference is not important for classification. Rather, since the appearance frequency of complementary oligonucleotides should be the same in the entire double-stranded DNA, the difference in the appearance frequency described above is that the created SOM accurately characterizes the entire double-stranded DNA. The result may not be reflected. In addition, since the creation of SOM requires a long calculation time, it is desired to reduce the calculation time as much as possible.

本発明の第１の目的は、分類能力の目立った減少なしに短時間で自己組織化マップを作成可能な塩基配列の分類システムを提供することにある。 A first object of the present invention is to provide a base sequence classification system capable of creating a self-organizing map in a short time without a significant decrease in classification ability.

また、特許文献１や非特許文献３・４においてオリゴヌクレオチドの出現頻度を用いて作成されたＳＯＭでは、塩基配列の由来する微生物を複数の生物学的分類に分類することができる。しかしながら、このＳＯＭでは、各生物学的分類に属する生物由来の塩基配列中における個々のオリゴヌクレオチドの出現頻度を把握することができない。各生物学的分類に属する生物由来の塩基配列中における個々のオリゴヌクレオチドの出現頻度を把握することができれば、例えば生物学的分類を分ける重要な鍵となるオリゴヌクレオチドを見つけ出すことができ、有用である。 Moreover, in SOM created using the appearance frequency of the oligonucleotide in Patent Document 1 and Non-Patent Documents 3 and 4, microorganisms from which the base sequence is derived can be classified into a plurality of biological classifications. However, with this SOM, it is impossible to grasp the frequency of occurrence of individual oligonucleotides in a base sequence derived from an organism belonging to each biological classification. If the frequency of occurrence of individual oligonucleotides in a base sequence derived from an organism belonging to each biological class can be ascertained, for example, it is possible to find an important key oligonucleotide that separates biological classes, which is useful. is there.

本発明の第２の目的は、各生物学的分類に属する生物由来の塩基配列中における個々のオリゴヌクレオチドの出現頻度を把握することを可能とするオリゴヌクレオチド出現頻度の解析システムを提供することにある。 A second object of the present invention is to provide an oligonucleotide appearance frequency analysis system that makes it possible to grasp the appearance frequency of individual oligonucleotides in a base sequence derived from an organism belonging to each biological classification. is there.

また、ヒトのゲノムは、全長のドラフト配列が決定されているが、その中で機能の分かっていない領域がまだまだ大量に残されている。ヒトのＤＮＡのような非常に長いＤＮＡ塩基配列の中から、シグナル配列（転写因子を認識する等の機能を有する塩基配列）を見つけ出し、その機能を解析するのは至難の業である。そのため、ＤＮＡ塩基配列の中からシグナル配列が多く存在する位置を予測できれば、機能の解析に有用である。 In the human genome, a full-length draft sequence has been determined, but there are still a large number of regions whose functions are unknown. It is extremely difficult to find a signal sequence (base sequence having a function such as recognizing a transcription factor) from an extremely long DNA base sequence such as human DNA and analyze the function. Therefore, if a position where a large number of signal sequences are present can be predicted from the DNA base sequence, it is useful for analyzing the function.

本発明の第３の目的は、ＤＮＡ塩基配列の中からシグナル配列が多く存在する位置を予測することを可能とするオリゴヌクレオチド出現頻度の解析システムを提供することにある。 A third object of the present invention is to provide an oligonucleotide frequency analysis system that makes it possible to predict a position where a large number of signal sequences are present from a DNA base sequence.

本発明の塩基配列の分類システムは、上記の課題を解決するために、塩基配列中において複数種類のオリゴヌクレオチドがそれぞれ出現する出現頻度を入力ベクトル群として多次元空間上に配置し、これら入力ベクトル群を複数の格子点が配置されたマップ上へ非線形に写像して上記塩基配列を各格子点に分類する自己組織化により、自己組織化マップを作成する塩基配列の分類システムであって、相補的な対をなすオリゴヌクレオチドの出現頻度を加算することにより、各対ごとのオリゴヌクレオチドの出現頻度を算出する加算部と、各対ごとのオリゴヌクレオチドの出現頻度に基づいて上記自己組織化マップを作成する自己組織化マップ作成部とを備えることを特徴としている。 In order to solve the above-described problem, the base sequence classification system of the present invention arranges the appearance frequency of each of a plurality of types of oligonucleotides in a base sequence as an input vector group on a multidimensional space, and these input vectors. A base sequence classification system that creates a self-organizing map by nonlinearly mapping a group onto a map in which a plurality of grid points are arranged and classifying the base sequence into each grid point. By adding the appearance frequencies of oligonucleotides forming a pair, the addition unit for calculating the appearance frequency of oligonucleotides for each pair, and the self-organizing map based on the appearance frequencies of oligonucleotides for each pair A self-organizing map creation unit is provided.

本発明のオリゴヌクレオチド出現頻度の解析システムは、上記の課題を解決するために、塩基配列中において複数種類のオリゴヌクレオチドがそれぞれ出現する出現頻度を入力ベクトル群として多次元空間上に配置し、これら入力ベクトル群を複数の格子点が配置されたマップ上へ非線形に写像して上記塩基配列を各格子点に分類する自己組織化により、自己組織化マップを作成する自己組織化マップ作成部と、オリゴヌクレオチドの出現頻度に関する情報を各格子点ごとに表した出現頻度マップを個々のオリゴヌクレオチドについて作成する出現頻度マップ作成部とを備えることを特徴としている。 In order to solve the above-described problem, the oligonucleotide appearance frequency analysis system of the present invention arranges the appearance frequencies at which multiple types of oligonucleotides each appear in a base sequence as an input vector group on a multidimensional space, and A self-organizing map creating unit that creates a self-organizing map by nonlinearly mapping an input vector group onto a map in which a plurality of lattice points are arranged and classifying the base sequence into each lattice point; And an appearance frequency map creating unit that creates an appearance frequency map representing information on the appearance frequency of the oligonucleotide for each lattice point for each oligonucleotide.

上記解析システムは、各格子点に分類された塩基配列中におけるモノヌクレオチド組成に基づいて、各格子点に分類された塩基配列中におけるオリゴヌクレオチドの出現頻度の期待値を演算する期待値演算部と、各格子点に分類された塩基配列中におけるオリゴヌクレオチドの出現頻度を、上記期待値で除算することにより正規化する正規化部とをさらに備え、上記出現頻度マップ作成部が、正規化されたオリゴヌクレオチドの出現頻度に基づいて出現頻度マップを作成するようになっていることが好ましい。 The analysis system includes an expected value calculation unit that calculates an expected value of the appearance frequency of the oligonucleotide in the base sequence classified into each lattice point based on the mononucleotide composition in the base sequence classified into each lattice point. A normalization unit that normalizes the appearance frequency of the oligonucleotide in the base sequence classified into each lattice point by dividing by the expected value, and the appearance frequency map creation unit is normalized It is preferable that the appearance frequency map is created based on the appearance frequency of the oligonucleotide.

上記解析システムは、相補的な対をなすオリゴヌクレオチドの出現頻度を加算することにより、各対ごとのオリゴヌクレオチドの出現頻度を算出する加算部をさらに備え、上記自己組織化マップ作成部および出現頻度マップ作成部が、各対ごとのオリゴヌクレオチドの出現頻度に基づいて自己組織化マップおよび出現頻度マップを作成するようになっていることが好ましい。 The analysis system further includes an adding unit that calculates the appearance frequency of the oligonucleotide for each pair by adding the appearance frequencies of the complementary pair of oligonucleotides, the self-organizing map creating unit and the appearance frequency It is preferable that the map creation unit creates a self-organizing map and an appearance frequency map based on the appearance frequency of oligonucleotides for each pair.

本発明のオリゴヌクレオチド出現頻度の解析システムは、上記の課題を解決するために、同一のＤＮＡ配列から取り出した複数の断片塩基配列中において複数種類のオリゴヌクレオチドがそれぞれ出現する出現頻度を入力ベクトル群として多次元空間上に配置し、これら入力ベクトル群を多次元空間から複数の格子点が配置されたマップ上へ自己組織化によって非線形に写像することにより、上記断片塩基配列が各格子点に分類された自己組織化マップを作成する自己組織化マップ作成部と、各格子点に分類された断片塩基配列における個々のオリゴヌクレオチドの出現頻度に基づいて、ＤＮＡ配列上における個々のオリゴヌクレオチドの出現頻度の分布を示す出現頻度分布図を作成する出現頻度分布図作成部とを備えることを特徴としている。 In order to solve the above-described problem, the oligonucleotide frequency analysis system of the present invention is configured to determine the frequency of occurrence of each of a plurality of types of oligonucleotides in a plurality of fragment base sequences extracted from the same DNA sequence. Are placed on a multi-dimensional space, and these input vector groups are non-linearly mapped by self-organization from a multi-dimensional space onto a map on which a plurality of lattice points are placed. Frequency of occurrence of individual oligonucleotides on the DNA sequence based on the frequency of occurrence of individual oligonucleotides in the fragment base sequence classified into each lattice point And an appearance frequency distribution diagram creation unit for creating an appearance frequency distribution diagram showing the distribution of .

上記解析システムは、各断片塩基配列中におけるモノヌクレオチド組成に基づいて、各断片塩基配列中におけるオリゴヌクレオチドの出現頻度の期待値を演算する期待値演算部と、各断片塩基配列中におけるオリゴヌクレオチドの出現頻度を、上記期待値で除算することにより正規化する正規化部とをさらに備え、上記出現頻度分布図作成部が、正規化されたオリゴヌクレオチドの出現頻度に基づいて出現頻度分布図を作成するようになっていることが好ましい。 Based on the mononucleotide composition in each fragment base sequence, the above analysis system includes an expected value calculation unit that calculates an expected value of the appearance frequency of the oligonucleotide in each fragment base sequence, and an oligonucleotide value in each fragment base sequence. A normalization unit that normalizes the appearance frequency by dividing by the expected value, and the appearance frequency distribution diagram creation unit creates an appearance frequency distribution diagram based on the appearance frequency of the normalized oligonucleotide It is preferable to do so.

上記解析システムは、各相補的な対をなすオリゴヌクレオチドの出現頻度を加算することにより、各対ごとのオリゴヌクレオチドの出現頻度を算出する加算部をさらに備え、上記出現頻度分布図作成部が、各対ごとのオリゴヌクレオチドの出現頻度に基づいて出現頻度分布図を作成するようになっていることが好ましい。 The analysis system further includes an adding unit that calculates the appearance frequency of the oligonucleotide for each pair by adding the appearance frequencies of the oligonucleotides that make up each complementary pair, and the appearance frequency distribution map creation unit includes: It is preferable to create an appearance frequency distribution map based on the appearance frequency of oligonucleotides for each pair.

本発明の分類システムによれば、各対ごとのオリゴヌクレオチドの出現頻度に基づいて自己組織化マップを作成することができる。 According to the classification system of the present invention, a self-organizing map can be created based on the appearance frequency of oligonucleotides for each pair.

自己組織化マップは、多種類の塩基配列を生物学的分類に分類するための生物情報科学的ツールとして有用である。 Self-organizing maps are useful as bioinformatics tools for classifying many types of base sequences into biological classifications.

また、自己組織化マップを利用すれば、成分が不明な細菌ＤＮＡの混合サンプル中にどのような種類の微生物由来のＤＮＡがどれだけの数存在するかを効率的に予測することが可能になる。したがって、自己組織化マップは、培養が困難な自然環境上の微生物の混合物等のような複数種の微生物を含む混合サンプルの成分分析に特に有用である。 In addition, if a self-organizing map is used, it is possible to efficiently predict how many types of microorganism-derived DNA exist in a mixed sample of bacterial DNA whose components are unknown. . Therefore, the self-organizing map is particularly useful for component analysis of a mixed sample containing a plurality of types of microorganisms such as a mixture of microorganisms in a natural environment that is difficult to culture.

また、自己組織化マップを利用すれば、生物学的分類に関するいかなる情報もない塩基配列の分類が可能になる。したがって、ＳＯＭは、そのゲノムの塩基配列の一部のみが分かっており、種が全く未知の生物（細菌等）が、どの系統群に属するかを特定するのに有用である。それゆえ、自己組織化マップは、新規で産業上有用な細菌等を探索するのに有用である。 In addition, by using a self-organizing map, it is possible to classify base sequences without any information on biological classification. Therefore, SOM is useful for specifying to which strain group an organism (such as bacteria) whose species is completely unknown, only a part of the base sequence of its genome is known. Therefore, the self-organizing map is useful for searching for new and industrially useful bacteria.

また、自己組織化マップを利用すれば、ゲノム中から水平伝達を通じて他の種から導入されたと考えられるセグメントを見つけることも可能となる。
自己組織化マップによれば、詳しくは後段で述べるが、
本発明の分類システムによれば、加算部で、相補的な対をなすオリゴヌクレオチドの出現頻度を加算することにより各対ごとのオリゴヌクレオチドの出現頻度を算出し、この各対ごとのオリゴヌクレオチドの出現頻度に基づいて自己組織化マップ作成部で自己組織化マップを作成するので、自己組織化マップ作成部における演算量を半減することができる。それゆえ、一般に長い演算時間を必要とする自己組織化マップ作成部における演算時間を約半分に短縮することができる。また、相補的な対をなす２つのオリゴヌクレオチドの間での出現頻度の違いは塩基配列の分類には重要ではないので、相補的な対をなすオリゴヌクレオチドを同一とみなして処理を行うことで、分類能力の目立った減少なしに自己組織化マップを作成することができる。 In addition, if a self-organizing map is used, it is possible to find segments that are considered to have been introduced from other species through horizontal transmission from the genome.
According to the self-organizing map, details will be described later,
According to the classification system of the present invention, the adder calculates the appearance frequency of the oligonucleotide for each pair by adding the appearance frequencies of the complementary pairs of oligonucleotides. Since the self-organizing map creating unit creates the self-organizing map based on the appearance frequency, the amount of calculation in the self-organizing map creating unit can be halved. Therefore, the computation time in the self-organizing map creating unit that generally requires a long computation time can be reduced to about half. In addition, the difference in frequency of appearance between two complementary pairs of oligonucleotides is not important for the classification of the base sequence, so it can be treated by treating the complementary pair of oligonucleotides as identical. Self-organizing maps can be created without noticeable reduction in classification ability.

したがって、本発明によれば、分類能力の減少なしに短時間で自己組織化マップを作成可能な塩基配列の分類システムを提供できる。 Therefore, according to the present invention, it is possible to provide a base sequence classification system capable of creating a self-organizing map in a short time without a decrease in classification ability.

自己組織化マップ作成部と出現頻度マップ作成部とを備える本発明のオリゴヌクレオチド出現頻度の解析システムでは、自己組織化マップを作成すると共に、自己組織化マップ上の格子点に対応してオリゴヌクレオチドの出現頻度に関する情報を表す出現頻度マップを作成することができる。自己組織化マップは、生物学的分類ごとの領域に分離されるので、自己組織化マップを参照しながら出現頻度マップを見れば、各生物学的分類に属する生物由来の塩基配列中における個々のオリゴヌクレオチドの出現頻度の特徴抽出ができる。例えば特定の生物学的分類に属する生物由来の塩基配列中において過剰に出現するオリゴヌクレオチドの種類や、特定の生物学的分類に属する生物由来の塩基配列中において過少に出現するオリゴヌクレオチドの種類等を把握することができる。それゆえ、例えば、生物学的分類を分ける鍵となる重要なオリゴヌクレオチドを見つけることができる。 In the oligonucleotide appearance frequency analysis system of the present invention comprising a self-organizing map creating unit and an appearance frequency map creating unit, a self-organizing map is created and oligonucleotides corresponding to lattice points on the self-organizing map are created. It is possible to create an appearance frequency map that represents information related to the appearance frequency of. Since the self-organization map is separated into regions for each biological classification, if an appearance frequency map is viewed with reference to the self-organization map, individual base sequences derived from organisms belonging to each biological classification are included. Feature extraction of oligonucleotide appearance frequency can be performed. For example, the types of oligonucleotides that appear excessively in a base sequence derived from an organism belonging to a specific biological classification, the types of oligonucleotides that appear excessively in a base sequence derived from an organism belonging to a specific biological classification, etc. Can be grasped. Thus, for example, key oligonucleotides that are key to separating biological classifications can be found.

上記解析システムは、期待値演算部と正規化部とをさらに備え、出現頻度マップ作成部が、正規化されたオリゴヌクレオチドの出現頻度に基づいて出現頻度マップを作成するようになっている構成であれば、異なる格子点に分類された塩基配列間でのオリゴヌクレオチドの出現頻度の差を、各塩基配列におけるモノヌクレオチド組成の偏りから切り離して、正確に検出することができる。したがって、異なる生物学的分類に属する生物間でのオリゴヌクレオチドの出現頻度の違いをより正確に反映した出現頻度マップを作成できる。 The analysis system further includes an expected value calculation unit and a normalization unit, and the appearance frequency map creation unit creates an appearance frequency map based on the normalized appearance frequency of oligonucleotides. If so, the difference in the appearance frequency of the oligonucleotide between the base sequences classified into different lattice points can be accurately detected by separating from the bias of the mononucleotide composition in each base sequence. Therefore, it is possible to create an appearance frequency map that more accurately reflects the difference in the appearance frequency of oligonucleotides between organisms belonging to different biological classifications.

上記解析システムは、相補的な対をなすオリゴヌクレオチドの出現頻度を加算することにより、各対ごとのオリゴヌクレオチドの出現頻度を算出する加算部をさらに備え、上記自己組織化マップ作成部および出現頻度マップ作成部が、各対ごとのオリゴヌクレオチドの出現頻度に基づいて自己組織化マップおよび出現頻度マップを作成するようになっている構成であれば、さらに次の効果が得られる。すなわち、加算部で、相補的な対をなすオリゴヌクレオチドの出現頻度を加算することにより各対ごとのオリゴヌクレオチドの出現頻度を算出し、この各対ごとのオリゴヌクレオチドの出現頻度に基づいて自己組織化マップ作成部で自己組織化マップを作成するので、自己組織化マップ作成部における演算量を半減することができる。それゆえ、一般に長い演算時間を必要とする自己組織化マップ作成部における演算時間を約半分に短縮することができる。また、相補的な対をなす２つのオリゴヌクレオチドの間での出現頻度の違いは塩基配列の分類には重要ではないので、相補的な対をなすオリゴヌクレオチドを同一とみなして処理を行うことで、分類能力の目立った減少なしに自己組織化マップを作成することができる。さらに、上記構成によれば、加算部で、相補的な対をなすオリゴヌクレオチドの出現頻度を加算することにより各対ごとのオリゴヌクレオチドの出現頻度を算出し、この各対ごとのオリゴヌクレオチドの出現頻度に基づいて出現頻度マップ作成部で出現頻度マップを作成するので、出現頻度マップ作成部における演算量をも半減することができる。それゆえ、一般に長い演算時間を必要とする出現頻度マップ作成部における演算時間を約半分に短縮することができる。また、相補的な対をなす２つのオリゴヌクレオチドの間での出現頻度の違いは出現頻度マップには重要ではないので、相補的な対をなすオリゴヌクレオチドを同一とみなして処理を行うことで、正確性を低下させることなく出現頻度マップを作成することができる。 The analysis system further includes an adding unit that calculates the appearance frequency of the oligonucleotide for each pair by adding the appearance frequencies of the complementary pair of oligonucleotides, the self-organizing map creating unit and the appearance frequency If the map creation unit is configured to create a self-organizing map and an appearance frequency map based on the appearance frequency of oligonucleotides for each pair, the following effects can be further obtained. That is, the addition unit calculates the appearance frequency of the oligonucleotide for each pair by adding the appearance frequency of the complementary pair of oligonucleotides, and the self-organization based on the appearance frequency of the oligonucleotide for each pair. Since the self-organizing map is created by the integrated map creating unit, the amount of calculation in the self-organizing map creating unit can be halved. Therefore, the computation time in the self-organizing map creating unit that generally requires a long computation time can be reduced to about half. In addition, the difference in frequency of appearance between two complementary pairs of oligonucleotides is not important for the classification of the base sequence, so it can be treated by treating the complementary pair of oligonucleotides as identical. Self-organizing maps can be created without noticeable reduction in classification ability. Furthermore, according to the above configuration, the addition unit calculates the appearance frequency of the oligonucleotide for each pair by adding the appearance frequency of the complementary pair of oligonucleotides, and the appearance of the oligonucleotide for each pair. Since the appearance frequency map creation unit creates the appearance frequency map based on the frequency, the amount of calculation in the appearance frequency map creation unit can be halved. Therefore, it is possible to reduce the computation time in the appearance frequency map creating unit that generally requires a long computation time to about half. In addition, since the difference in appearance frequency between two complementary pairs of oligonucleotides is not important in the appearance frequency map, by treating the complementary pair of oligonucleotides as the same, An appearance frequency map can be created without degrading accuracy.

自己組織化マップ作成部と出現頻度分布図作成部とを備える本発明のオリゴヌクレオチド出現頻度の解析システムでは、ＤＮＡ配列上における個々のオリゴヌクレオチドの出現頻度の分布を示す出現頻度分布図を作成することができる。この出現頻度分布図により、ＤＮＡ配列上において、特定のオリゴヌクレオチドが過剰に出現する領域や、特定のオリゴヌクレオチドが過少に出現する領域を知ることができる。これらの領域の一部は、シグナル配列を多く含む領域や遺伝子リッチな領域に対応すると考えられる。それゆえ、上記出現頻度分布図により、シグナル配列を多く含む領域や遺伝子リッチな領域の位置を予測することが可能となる。 In the oligonucleotide appearance frequency analysis system of the present invention comprising a self-organizing map creation unit and an appearance frequency distribution diagram creation unit, an appearance frequency distribution diagram showing the distribution of the appearance frequency of individual oligonucleotides on a DNA sequence is created. be able to. From this appearance frequency distribution diagram, it is possible to know a region where a specific oligonucleotide appears excessively or a region where a specific oligonucleotide appears too small on the DNA sequence. Some of these regions are considered to correspond to regions rich in signal sequences and gene-rich regions. Therefore, it is possible to predict the position of a region containing a lot of signal sequences or a gene-rich region based on the appearance frequency distribution diagram.

上記解析システムは、期待値演算部と正規化部とをさらに備え、出現頻度分布図作成部が、モノヌクレオチド出現頻度で正規化されたオリゴヌクレオチドの出現頻度に基づいて出現頻度分布図を作成するようになっている構成であれば、異なる位置の断片塩基配列間でのオリゴヌクレオチドの出現頻度の差を、各塩基配列におけるモノヌクレオチド組成の偏りから切り離して、正確に検出することができる。したがって、異なる位置の断片塩基配列間でのオリゴヌクレオチドの出現頻度の違いをより正確に反映した出現頻度分布図を作成できる。 The analysis system further includes an expected value calculation unit and a normalization unit, and the appearance frequency distribution diagram creation unit creates an appearance frequency distribution diagram based on the appearance frequency of the oligonucleotide normalized by the mononucleotide appearance frequency. With this configuration, the difference in the appearance frequency of oligonucleotides between fragment base sequences at different positions can be accurately detected by separating them from the bias of the mononucleotide composition in each base sequence. Therefore, it is possible to create an appearance frequency distribution map that more accurately reflects the difference in the appearance frequency of oligonucleotides between fragment base sequences at different positions.

上記解析システムは、各相補的な対をなすオリゴヌクレオチドの出現頻度を加算することにより、各対ごとのオリゴヌクレオチドの出現頻度を算出する加算部をさらに備え、上記出現頻度分布図作成部が、各対ごとのオリゴヌクレオチドの出現頻度に基づいて出現頻度分布図を作成するようになっている構成であれば、さらに次の効果が得られる。すなわち、加算部で、相補的な対をなすオリゴヌクレオチドの出現頻度を加算することにより各対ごとのオリゴヌクレオチドの出現頻度を算出し、この各対ごとのオリゴヌクレオチドの出現頻度に基づいて出現頻度分布図作成部で出現頻度分布図を作成するので、出現頻度分布図作成部における演算量を半減することができる。それゆえ、一般に長い演算時間を必要とする出現頻度分布図作成部における演算時間を約半分に短縮することができる。また、相補的な対をなす２つのオリゴヌクレオチドの間での出現頻度の違いは出現頻度分布図には重要ではないので、相補的な対をなすオリゴヌクレオチドを同一とみなして処理を行うことで、正確性を低下させることなく出現頻度分布図を作成することができる。 The analysis system further includes an adding unit that calculates the appearance frequency of the oligonucleotide for each pair by adding the appearance frequencies of the oligonucleotides that make up each complementary pair, and the appearance frequency distribution map creation unit includes: If the configuration is such that an appearance frequency distribution map is created based on the appearance frequency of each pair of oligonucleotides, the following effects can be further obtained. That is, the addition unit calculates the appearance frequency of the oligonucleotide for each pair by adding the appearance frequency of the complementary pair of oligonucleotides, and the appearance frequency based on the appearance frequency of the oligonucleotide for each pair. Since the appearance frequency distribution diagram is created by the distribution diagram creation unit, the amount of calculation in the appearance frequency distribution diagram creation unit can be halved. Therefore, it is possible to reduce the calculation time in the appearance frequency distribution diagram creating unit that generally requires a long calculation time by about half. In addition, since the difference in appearance frequency between two complementary pairs of oligonucleotides is not important in the appearance frequency distribution diagram, it is possible to treat the complementary pair of oligonucleotides as the same. It is possible to create an appearance frequency distribution map without reducing accuracy.

本実施形態のオリゴヌクレオチド出現頻度解析システム（塩基配列の分類システム）は、図１に示すように、相補データ加算部（加算部）１と、オリゴヌクレオチド出現頻度データ格納部２と、期待値演算部３と、モノヌクレオチド組成データ格納部４と、正規化部５と、出現頻度マップ作成部６と、出現頻度分布図作成部７と、各種演算データを記憶するための記憶部８と、アノテーションデータ作成部９と、ＳＯＭ作成部１０と、データの入力や演算の実行指示のためのキーボードやマウスなどの入力デバイス１１と、表示や印刷等の出力を行うためのディスプレイやプリンタなどの出力デバイス１２とを備えている。 As shown in FIG. 1, the oligonucleotide appearance frequency analysis system (base sequence classification system) of the present embodiment includes a complementary data adding unit (adding unit) 1, an oligonucleotide appearance frequency data storage unit 2, and an expected value calculation. Unit 3, mononucleotide composition data storage unit 4, normalization unit 5, appearance frequency map creation unit 6, appearance frequency distribution diagram creation unit 7, storage unit 8 for storing various calculation data, annotation Data creation unit 9, SOM creation unit 10, input device 11 such as a keyboard and mouse for inputting data and instructing execution of calculation, and output device such as a display and a printer for performing display and printing 12.

これらのうち、相補データ加算部１、期待値演算部３、正規化部５、出現頻度マップ作成部６、出現頻度分布図作成部７、アノテーションデータ作成部９、およびＳＯＭ作成部１０は、コンピュータと、相補データ加算部１、期待値演算部３、正規化部５、出現頻度マップ作成部６、出現頻度分布図作成部７、アノテーションデータ作成部９、およびＳＯＭ作成部１０としてコンピュータを機能させるためのコンピュータ・プログラムとによって実現される機能ブロックである。例えば、ＳＯＭ作成部１０は、コンピュータと、ＳＯＭ作成部１０としてコンピュータを機能させるための一括学習ＳＯＭプログラム、例えば株式会社ザナジェン製の“ＸａｎａＭｉｎｅ”とによって実現することができる。また、記憶部８はＲＡＭ等の書き換え可能なメモリによって実現でき、オリゴヌクレオチド出現頻度データ格納部２およびモノヌクレオチド組成データ格納部４は、ハードディスク等の大容量記憶装置によって実現できる。 Among these, the complementary data addition unit 1, the expected value calculation unit 3, the normalization unit 5, the appearance frequency map creation unit 6, the appearance frequency distribution diagram creation unit 7, the annotation data creation unit 9, and the SOM creation unit 10 Then, the computer is caused to function as the complementary data adding unit 1, the expected value calculating unit 3, the normalizing unit 5, the appearance frequency map creating unit 6, the appearance frequency distribution diagram creating unit 7, the annotation data creating unit 9, and the SOM creating unit 10. Is a functional block realized by a computer program. For example, the SOM creation unit 10 can be realized by a computer and a collective learning SOM program for causing the computer to function as the SOM creation unit 10, for example, “XanaMine” manufactured by Xanagen Corporation. The storage unit 8 can be realized by a rewritable memory such as a RAM, and the oligonucleotide appearance frequency data storage unit 2 and the mononucleotide composition data storage unit 4 can be realized by a mass storage device such as a hard disk.

オリゴヌクレオチド出現頻度データ格納部２は、公開されている塩基配列データベース等から供給された、複数の塩基配列中における１つずつのオリゴヌクレオチドの出現頻度のデータを格納している。上記複数の塩基配列は、異なる複数の生物種由来の塩基配列であってもよく、同一のＤＮＡ配列から取り出した複数の断片塩基配列であってもよい。断片塩基配列のデータを用いる場合、断片塩基配列の長さを一定長（例えば１０ｋｂや１００ｋｂ）に揃えることが好ましい。同一のＤＮＡ配列から取り出した複数の断片塩基配列を用いる場合、各断片塩基配列（領域）同士は、一部が重複していてもよく、全く重複していなくともよい。上記各断片塩基配列の長さは、１０ｋｂ以上が好ましく、１００ｋｂ以上がより好ましい。 The oligonucleotide appearance frequency data storage unit 2 stores appearance frequency data for each oligonucleotide in a plurality of base sequences supplied from a publicly available base sequence database or the like. The plurality of base sequences may be base sequences derived from different biological species, or may be a plurality of fragment base sequences extracted from the same DNA sequence. When using fragment base sequence data, it is preferable to align the length of the fragment base sequence to a certain length (for example, 10 kb or 100 kb). In the case of using a plurality of fragment base sequences extracted from the same DNA sequence, the fragment base sequences (regions) may partially overlap or not overlap at all. The length of each fragment base sequence is preferably 10 kb or more, and more preferably 100 kb or more.

オリゴヌクレオチド出現頻度データ格納部２に格納されたオリゴヌクレオチド出現頻度のデータは、例えば、次のような方法で用意すればよい。予め、既存のゲノム情報データベースや、各種の配列決定手法により新たに決定された遺伝子の全長配列あるいは断片塩基配列などから、複数のＤＮＡ全長塩基配列あるいは同一のＤＮＡ由来の複数の断片塩基配列のデータを入手する。同一のＤＮＡ由来の複数の断片塩基配列のデータを用いる場合、全長塩基配列から特定長の領域を複数切り出すことで得られた複数の断片塩基配列のデータを用いるとよい。次いで、このＤＮＡ塩基配列のデータに基づいて、ＤＮＡの全長塩基配列あるいは断片塩基配列におけるオリゴヌクレオチドの出現回数をカウントすることにより、オリゴヌクレオチドの出現頻度を求め、オリゴヌクレオチド出現頻度データ格納部２に記憶させる。分析対象の配列が既存のオリゴヌクレオチド出現頻度のデータベースに登録されている場合には、入力デバイス１１からの指示によって、そのデータをインターネット等を介して既存のデータベースからオリゴヌクレオチド出現頻度データ格納部２に取り込むだけでよい。また、オリゴヌクレオチド出現頻度データ格納部２へのオリゴヌクレオチド出現頻度のデータの格納は、入力デバイス１１を用いてオリゴヌクレオチド出現頻度の数値を直接的に入力する方法でも行うことができる。入力デバイス１１を用いて数値を入力する方法としては、入力デバイス１１としてのキーボード等を用いた手入力、音声入力、紙入力等を用いることができる。 The oligonucleotide appearance frequency data stored in the oligonucleotide appearance frequency data storage unit 2 may be prepared by the following method, for example. Data of multiple DNA full-length base sequences or multiple fragment base sequences derived from the same DNA from existing genome information databases or full-length sequences or fragment base sequences of genes newly determined by various sequencing methods Get When using data of a plurality of fragment base sequences derived from the same DNA, data of a plurality of fragment base sequences obtained by cutting out a plurality of specific length regions from the full length base sequence may be used. Next, the appearance frequency of the oligonucleotide is obtained by counting the appearance frequency of the oligonucleotide in the full-length base sequence or fragment base sequence of the DNA based on the DNA base sequence data, and the oligonucleotide appearance frequency data storage unit 2 Remember. When the sequence to be analyzed is registered in an existing oligonucleotide appearance frequency database, the data is stored in the oligonucleotide appearance frequency data storage unit 2 from the existing database via the Internet or the like according to an instruction from the input device 11. You just need to capture it. In addition, storage of oligonucleotide appearance frequency data in the oligonucleotide appearance frequency data storage unit 2 can also be performed by a method of directly inputting a numerical value of the oligonucleotide appearance frequency using the input device 11. As a method of inputting a numerical value using the input device 11, manual input using a keyboard or the like as the input device 11, voice input, paper input, or the like can be used.

上記オリゴヌクレオチド出現頻度は、分類能を向上させるために、３塩基以上のオリゴヌクレオチドの出現頻度であることが好ましい。一方、あまり塩基数の多いオリゴヌクレオチド出現頻度を用いると、演算されるデータ量が膨大となる。そのため、オリゴヌクレオチド出現頻度は、３〜５塩基のオリゴヌクレオチド（トリヌクレオチド、テトラヌクレオチド、またはペンタヌクレオチド）の出現頻度であることが好ましい。また、オリゴヌクレオチド出現頻度のデータは、複数のオリゴヌクレオチドに関するオリゴヌクレオチド出現頻度の数値を含むものであればよいが、特定の塩基数のオリゴヌクレオチドの全種類（例えば６４種類のトリヌクレオチド）に関するオリゴヌクレオチド出現頻度の数値を含むものであることが好ましい。 The oligonucleotide appearance frequency is preferably the appearance frequency of oligonucleotides having 3 or more bases in order to improve the classification ability. On the other hand, if the appearance frequency of oligonucleotides with too many bases is used, the amount of data to be calculated becomes enormous. Therefore, the appearance frequency of the oligonucleotide is preferably the appearance frequency of 3 to 5 base oligonucleotide (trinucleotide, tetranucleotide, or pentanucleotide). In addition, the oligonucleotide appearance frequency data may be any data that includes a numerical value of the oligonucleotide appearance frequency for a plurality of oligonucleotides, but the oligonucleotides for all types of oligonucleotides having a specific number of bases (for example, 64 types of trinucleotides). It is preferable to include a numerical value of the nucleotide appearance frequency.

また、上記オリゴヌクレオチド出現頻度には、パリンドロームに関係しない内部のｎ個の塩基Ｎ（ｎは１以上ｎ_max以下の整数；ｎ_maxは自然数、例えば３）を持つパリンドロームのオリゴヌクレオチドの出現頻度を含めてもよい。すなわち、例えばオリゴヌクレオチド出現頻度としてのヘキサヌクレオチド出現頻度を分析する場合、パリンドロームに関係しない内部のｎ個の塩基Ｎ（ｎは１〜３の整数）を持つパリンドロームのオリゴヌクレオチドの出現頻度、例えばＧＧＧＮＣＣＣ、ＧＧＧＮＮＣＣＣ、ＧＧＧＮＮＮＣＣＣの出現頻度を分析対象に含めてもよい。この場合、演算量は増えるが、近い種が異なる領域に分離されたＳＯＭを作成することが可能となる。 In addition, the appearance frequency of the above-mentioned oligonucleotides is the appearance of palindromic oligonucleotides having n bases N (n is an integer from 1 to n _max ; n _max is a natural number, for example, 3) not related to the palindrome. The frequency may be included. That is, for example, when analyzing the hexanucleotide appearance frequency as the oligonucleotide appearance frequency, the appearance frequency of a palindromic oligonucleotide having n bases N (n is an integer of 1 to 3) not related to the palindrome, For example, the appearance frequency of GGGNCCCC, GGGNNCCC, and GGGNNNNCCC may be included in the analysis target. In this case, although the amount of calculation increases, it becomes possible to create SOMs in which close species are separated into different regions.

相補データ加算部１は、オリゴヌクレオチド出現頻度データ格納部２に格納された１つずつのオリゴヌクレオチドの出現頻度のデータに基づき、相補的な対をなすオリゴヌクレオチドの出現頻度を加算することにより各対ごとのオリゴヌクレオチドの出現頻度を算出し、算出結果のデータを出力する。 The complementary data adding unit 1 adds the appearance frequencies of the oligonucleotides forming a complementary pair, based on the appearance frequency data of each oligonucleotide stored in the oligonucleotide appearance frequency data storage unit 2. The frequency of occurrence of oligonucleotide for each pair is calculated, and data of the calculation result is output.

アノテーションデータ作成部９は、塩基配列データベース等を参照して得られた断片配列の生物学的意味のアノテーションを行う。ＳＯＭ作成部１０は、塩基配列中において複数種類のオリゴヌクレオチドがそれぞれ出現する出現頻度を入力ベクトル群として多次元空間上に配置し、これら入力ベクトル群を複数の格子点が配置されたマップ上へ非線形に写像して上記塩基配列を各格子点に分類する自己組織化により、自己組織化マップを作成する。ＳＯＭ作成部１０は、好ましくは、相補データ加算部１から供給される相補的な各対ごとのオリゴヌクレオチドの出現頻度のデータに基づき、このデータを入力ベクトルのデータとして自己組織化を行うことにより、自己組織化マップを作成する。相補的な各対ごとのオリゴヌクレオチドの出現頻度のデータを用いることにより、演算時間を短縮できると共に相補性の影響を除去できる。ＳＯＭ作成部１０は、オリゴヌクレオチド出現頻度データ格納部２に格納された１つずつのオリゴヌクレオチドの出現頻度のデータに基づき、このデータを入力ベクトルのデータとして自己組織化を行うことにより、自己組織化マップを作成するものであってもよい。ＳＯＭ作成部１０は、作成したＳＯＭのデータを出力デバイス１２を介して出力することにより、ＳＯＭを２次元や３次元の画像として出力（表示や印刷等）することができるようになっている。画像の出力形態としては、例えば、２次元画像上において、各格子点に分類された塩基配列が属する生物学的分類を色で表現し、各格子点に分類された塩基配列の数を濃度で表現する形態；３次元画像上において、各格子点に分類された塩基配列の数を棒の高さで表現する形態等が挙げられる。ＳＯＭ作成部１０は、作成したＳＯＭのデータに加えて、ＳＯＭ上の各格子点に分類された塩基配列の識別情報（ＩＤ）、各格子点に分類された塩基配列中におけるオリゴヌクレオチドの出現頻度のデータ、アノテーション情報を出力しうるように構成されている。相補データ加算部１から供給される相補的な各対ごとのオリゴヌクレオチドの出現頻度のデータに基づいてＳＯＭが作成された場合には、ＳＯＭ作成部１０から出力される各格子点に分類された塩基配列中におけるオリゴヌクレオチドの出現頻度のデータは、相補的な各対ごとのオリゴヌクレオチドの出現頻度のデータ（相補データ加算部１から供給される相補的な各対ごとのオリゴヌクレオチドの出現頻度のデータに対応）である。 The annotation data creation unit 9 performs annotation of the biological meaning of the fragment sequence obtained by referring to the base sequence database or the like. The SOM creation unit 10 arranges the appearance frequency of each of a plurality of types of oligonucleotides in the base sequence as an input vector group on a multidimensional space, and places these input vector groups on a map on which a plurality of grid points are arranged. A self-organizing map is created by non-linear mapping and self-organizing to classify the base sequence into each lattice point. The SOM creation unit 10 preferably performs self-organization by using this data as input vector data based on the data of the frequency of appearance of each complementary oligonucleotide supplied from the complementary data addition unit 1. Create a self-organizing map. By using the data on the frequency of appearance of each complementary oligonucleotide, the calculation time can be shortened and the influence of complementarity can be removed. Based on the appearance frequency data of each oligonucleotide stored in the oligonucleotide appearance frequency data storage unit 2, the SOM creation unit 10 performs self-organization using the data as input vector data. It is also possible to create a conversion map. The SOM creation unit 10 can output (display, print, etc.) the SOM as a two-dimensional or three-dimensional image by outputting the created SOM data via the output device 12. As an output form of the image, for example, on a two-dimensional image, the biological classification to which the base sequence classified into each grid point belongs is represented by color, and the number of base sequences classified into each grid point is expressed by concentration. Forms to be represented: forms in which the number of base sequences classified into each grid point is represented by the height of a bar on a three-dimensional image. In addition to the created SOM data, the SOM creation unit 10 includes identification information (ID) of the base sequence classified into each lattice point on the SOM, and the appearance frequency of the oligonucleotide in the base sequence classified into each lattice point Data and annotation information can be output. When the SOM was created based on the data of the appearance frequency of each pair of complementary oligonucleotides supplied from the complementary data adding unit 1, it was classified into each grid point output from the SOM creating unit 10. The data on the frequency of occurrence of oligonucleotides in the base sequence is the data on the frequency of appearance of oligonucleotides for each pair of complementary (the frequency of appearance of oligonucleotides for each pair of complementary supplied from the complementary data adding unit 1). Data).

作成されるＳＯＭは、比較ゲノム解析のための、新規で、強力で、かつ高感度のツールである。特に細菌ゲノムに関するＳＯＭは、有用である。すなわち、細菌ゲノムに関するＳＯＭは、培養が困難な自然環境上の微生物の混合物から得られたＤＮＡ配列を生物学的分類に分類するための生物情報科学的ツールとして特に有用である。ＳＯＭは、分類能が非常に高いので、莫大な量の細菌配列からの多種多様なゲノム情報を抽出するための効率的で強力なツールである。 The created SOM is a new, powerful and sensitive tool for comparative genomic analysis. SOM, particularly for bacterial genomes, is useful. That is, SOM for bacterial genomes is particularly useful as a bioinformatics tool to classify DNA sequences obtained from a mixture of microorganisms in the natural environment that are difficult to culture into biological classifications. SOM is an efficient and powerful tool for extracting a wide variety of genomic information from vast amounts of bacterial sequences because of its very high classification ability.

環境中の微生物の大部分は、依然として培養不可能である。したがって、環境中の支配的な微生物の個体群についての我々の理解は制限されている。環境中での微生物の多様性を研究し、完全に新規で産業上有用な遺伝子を探索するために、極限環境のような環境中の微生物の混合物に由来したＤＮＡ断片を配列を決定することが多くのグループによって行われている。そのような研究では、ＤＮＡは、これら微生物の混合物から、培養や種のクローニングを行うことなく抽出される。また、ＤＮＡサンプルは、配列決定ベクターなどのようなベクターに断片化およびクローニングされ、その後にベクターの配列が決定される。したがって、ＤＮＡ混合物中にどのようなタイプのゲノムＤＮＡがどれだけの数存在し、それらの配列がどれくらい新規であるかを知ることは重要である。本発明のシステムで教師なしニューラル・ネットワーク・アルゴリズムを用いて作成されるＳＯＭは、これらのＤＮＡ混合物中のゲノムＤＮＡに関する情報を得るための強力な生物情報科学的ツールである。すなわち、予め既知の細菌ゲノムを用いてＳＯＭを作成し、ＳＯＭ上の領域を生物学的分類ごとに分離しておけば、成分が不明な細菌ＤＮＡの混合サンプルから得た断片の配列を用い、その断片の配列がＳＯＭ上におけるどの位置に存在するかに基づいて、上記混合サンプル中にどのような種類の微生物がどれだけの数存在するかを効率的に予測することが可能になる。したがって、ＳＯＭは、培養が困難な自然環境上の微生物の混合物等のような複数種の微生物を含む混合サンプルの成分分析に特に有用である。それゆえ、ＳＯＭは、新規で産業上有用な細菌等を探索するのに有用である。 Most of the microorganisms in the environment are still unculturable. Thus, our understanding of the dominant microbial population in the environment is limited. In order to study the diversity of microorganisms in the environment and to search for completely new and industrially useful genes, sequencing DNA fragments derived from a mixture of microorganisms in the environment such as the extreme environment It is done by many groups. In such studies, DNA is extracted from a mixture of these microorganisms without culturing or cloning the species. Alternatively, the DNA sample is fragmented and cloned into a vector such as a sequencing vector, after which the vector sequence is determined. Therefore, it is important to know how many types of genomic DNA are present in the DNA mixture and how novel their sequences are. SOMs created using unsupervised neural network algorithms in the system of the present invention are powerful bioinformatics tools for obtaining information about genomic DNA in these DNA mixtures. That is, if an SOM is created using a known bacterial genome in advance and the region on the SOM is separated for each biological classification, the sequence of fragments obtained from a mixed sample of bacterial DNA whose components are unknown, Based on where the sequence of the fragment is located on the SOM, it is possible to efficiently predict what kind of microorganisms and how many microorganisms are present in the mixed sample. Therefore, SOM is particularly useful for component analysis of mixed samples containing a plurality of types of microorganisms, such as a mixture of microorganisms in the natural environment that are difficult to culture. Therefore, SOM is useful for searching for new and industrially useful bacteria.

また、ＳＯＭは、各ゲノムの署名的特徴である種特異的特徴（オリゴヌクレオチド出現頻度の鍵となる組み合わせ）を認識することができるので、予め既知のＤＮＡ配列を用いてＳＯＭを作成し、ＳＯＭ上の領域を種ごとに分離しておけば、ＳＯＭを用いて種に関するいかなる情報もないゲノム配列の種特異的分類が可能になる。したがって、ＳＯＭは、そのゲノムの塩基配列の一部のみが分かっており、種が全く未知の生物（細菌等）が、どの種に属するかを特定するのに有用である。それゆえ、ＳＯＭは、新規で産業上有用な細菌等を探索するのに有用である。 In addition, since the SOM can recognize the species-specific features (the key combination of oligonucleotide appearance frequency) that are signature features of each genome, an SOM is created using a known DNA sequence in advance. Separating the upper region by species allows species-specific classification of genomic sequences without any information about the species using SOM. Therefore, SOM is useful for identifying to which species a living organism (such as a bacterium) whose species is completely unknown belongs only to a part of the base sequence of its genome. Therefore, SOM is useful for searching for new and industrially useful bacteria.

以上のように、多種多様な分類グループのゲノム配列が蓄積されれば、ＳＯＭは、細菌の配列の分類のための広範囲に適用可能で、かつ強力で、培養不可能な微生物の混合ＤＮＡサンプル（例えば海底堆積物のサンプル）から得られた塩基配列の分類に対して極めて有用なツールになると考えられる。 As described above, if genome sequences of a wide variety of taxonomic groups are accumulated, SOM can be applied to a broad range of bacterial sequence classifications, and is a powerful, non-culturable microbial mixed DNA sample ( For example, it is considered to be an extremely useful tool for classification of base sequences obtained from samples of seabed sediments.

また、予め既知のＤＮＡ配列を用いてＳＯＭを作成してＳＯＭ上におけるある種の塩基配列がほぼ含まれる専有領域を特定しておけば、ある１つの種のＤＮＡ配列から切り出した断片塩基配列ごとのオリゴヌクレオチド出現頻度をＳＯＭ上にマップし、特定された専有領域から外れた位置に対応する断片塩基配列を探せば、水平伝達を通じて他の種から導入されたと考えられるセグメントを見つけることができる。 In addition, if an SOM is created using a known DNA sequence in advance and an exclusive region almost including a certain base sequence on the SOM is specified, each fragment base sequence cut out from a certain type of DNA sequence If the oligonucleotide appearance frequency is mapped on the SOM and the fragment base sequence corresponding to the position deviated from the specified exclusive region is searched, a segment considered to be introduced from another species can be found through horizontal transmission.

モノヌクレオチド組成データ格納部４は、分析対象の各塩基配列中におけるモノヌクレオチド組成のデータを格納している。期待値演算部３は、モノヌクレオチド組成データ格納部４に格納された分析対象の各塩基配列中におけるモノヌクレオチド組成のデータを参照してＳＯＭ上の各格子点に分類された塩基配列中におけるモノヌクレオチド組成の値を取得し、取得した各格子点のモノヌクレオチド組成（各格子点に分類された塩基配列中における各モノヌクレオチドの含有率）の値に基づいて、各格子点に分類された塩基配列中におけるオリゴヌクレオチドの出現頻度の期待値を演算する。正規化部５は、ＳＯＭ上の各格子点に分類された塩基配列中におけるオリゴヌクレオチドの出現頻度を、期待値演算部３で演算された期待値で除算することにより正規化する。 The mononucleotide composition data storage unit 4 stores mononucleotide composition data in each base sequence to be analyzed. The expected value calculation unit 3 refers to the data of the mononucleotide composition in each base sequence to be analyzed stored in the mononucleotide composition data storage unit 4 and the mono value in the base sequence classified into each lattice point on the SOM. Bases classified into each lattice point based on the value of the mononucleotide composition (content of each mononucleotide in the base sequence classified into each lattice point) obtained for each lattice point The expected value of the appearance frequency of the oligonucleotide in the sequence is calculated. The normalization unit 5 normalizes the appearance frequency of the oligonucleotide in the base sequence classified into each lattice point on the SOM by dividing it by the expected value calculated by the expected value calculation unit 3.

出現頻度マップ作成部６は、オリゴヌクレオチドの出現頻度に関する情報を各格子点ごとに表した出現頻度マップを個々のオリゴヌクレオチドについて作成する。出現頻度マップ作成部６は、作成した出現頻度マップのデータを出力デバイス１２を介して出力することにより、２次元や３次元の画像として出力（表示や印刷等）することができるようになっている。画像の出力形態としては、例えば、２次元画像上において、各格子点に分類された塩基配列のオリゴヌクレオチド出現頻度を色および濃度で表現する形態；３次元画像上において、各格子点に分類された塩基配列の数を棒の高さで表現する形態等が挙げられる。出現頻度マップ作成部６は、正規化部５で正規化されたオリゴヌクレオチドの出現頻度に基づいて出現頻度マップを作成するようになっている。出現頻度マップ作成部６は、オリゴヌクレオチド出現頻度データ格納部２に格納された１つずつのオリゴヌクレオチドの出現頻度のデータに基づいて出現頻度マップを作成するようになっていてもよいが、ＳＯＭ作成部１０から供給される相補的な各対ごとのオリゴヌクレオチドの出現頻度のデータに基づいて出現頻度マップを作成するようになっていることがより好ましい。これにより、演算時間を短縮できると共に相補性の影響を除去できる。 The appearance frequency map creating unit 6 creates an appearance frequency map representing information on the appearance frequency of the oligonucleotide for each lattice point for each oligonucleotide. The appearance frequency map creation unit 6 can output (display, print, etc.) as a two-dimensional or three-dimensional image by outputting the created appearance frequency map data via the output device 12. Yes. As an output form of the image, for example, a form in which the oligonucleotide appearance frequency of the base sequence classified into each grid point is represented by color and density on a two-dimensional image; In other words, the number of base sequences is expressed by the height of the bar. The appearance frequency map creation unit 6 creates an appearance frequency map based on the appearance frequencies of the oligonucleotides normalized by the normalization unit 5. The appearance frequency map creation unit 6 may create an appearance frequency map based on the appearance frequency data of each oligonucleotide stored in the oligonucleotide appearance frequency data storage unit 2. More preferably, the appearance frequency map is created based on the appearance frequency data for each pair of complementary oligonucleotides supplied from the creation unit 10. Thereby, the calculation time can be shortened and the influence of complementarity can be removed.

出現頻度分布図作成部７は、分析対象の塩基配列群が同一のゲノム配列から取り出した複数の断片塩基配列である場合に、各格子点に分類された断片塩基配列における個々のオリゴヌクレオチドの出現頻度に基づいて、ゲノム配列上における個々のオリゴヌクレオチドの出現頻度の分布を示す出現頻度分布図を作成する。出現頻度分布図作成部７は、作成した出現頻度分布図のデータを出力デバイス１２を介して出力することにより、出現頻度分布図を１次元や２次元の画像として出力（表示や印刷等）することができるようになっている。画像の出力形態としては、例えば、ゲノム配列に対応する棒グラフ上において、各格子点に分類された断片塩基配列が属する生物学的分類を色および濃度で表現する形態；２次元画像上において、ゲノム配列上での位置をｘ軸座標、その位置に対応する断片塩基配列のオリゴヌクレオチド出現頻度をｙ軸座標で表現する形態等が挙げられる。出現頻度分布図作成部７は、正規化部５で正規化されたオリゴヌクレオチドの出現頻度に基づいて出現頻度分布図を作成するようになっている。出現頻度分布図作成部７は、ＳＯＭ作成部１０から供給される相補的な各対ごとのオリゴヌクレオチドの出現頻度のデータに基づいて出現頻度分布図を作成するようになっていることが好ましい。これにより、演算時間を短縮できると共に相補性の影響を除去できる。 When the base sequence group to be analyzed is a plurality of fragment base sequences extracted from the same genome sequence, the appearance frequency distribution map creation unit 7 generates individual oligonucleotides in the fragment base sequences classified into the lattice points. Based on the frequency, an appearance frequency distribution diagram showing the distribution of the appearance frequency of individual oligonucleotides on the genome sequence is created. The appearance frequency distribution chart creation unit 7 outputs the appearance frequency distribution chart as a one-dimensional or two-dimensional image by outputting data of the created appearance frequency distribution chart via the output device 12 (display, printing, etc.). Be able to. As an output form of an image, for example, on a bar graph corresponding to the genome sequence, a form in which the biological classification to which the fragment base sequence classified into each lattice point belongs is represented by color and concentration; on the two-dimensional image, the genome Examples include a form in which the position on the sequence is represented by x-axis coordinates, and the oligonucleotide appearance frequency of the fragment base sequence corresponding to the position is represented by y-axis coordinates. The appearance frequency distribution diagram creation unit 7 creates an appearance frequency distribution diagram based on the appearance frequencies of the oligonucleotides normalized by the normalization unit 5. It is preferable that the appearance frequency distribution diagram creating unit 7 creates an appearance frequency distribution diagram based on the data on the appearance frequency of each pair of complementary oligonucleotides supplied from the SOM creating unit 10. Thereby, the calculation time can be shortened and the influence of complementarity can be removed.

次に、ＳＯＭ作成部１０およびそれによって実行されるマップ作成ステップについて詳細に説明する。 Next, the SOM creation unit 10 and the map creation step executed thereby will be described in detail.

マップ作成ステップにおけるＳＯＭ作成法としては、コホネンの自己組織化法によるＳＯＭ作成法（以下、「コホネン法」と呼ぶ）、特許文献１に記載の自己組織化法による改良型ＳＯＭ作成法（以下、単に「改良型ＳＯＭ作成法」と呼ぶ）等を用いることができるが、改良型ＳＯＭ作成法を用いることが好ましい。 As the SOM creation method in the map creation step, the SOM creation method by the self-organization method of Kohonen (hereinafter referred to as “Kohonen method”), the improved SOM creation method by the self-organization method described in Patent Document 1 (hereinafter, It is preferable to use an improved SOM creation method, although it may be simply referred to as “improved SOM creation method”.

ここで、ＳＯＭを作成する自己組織化法について、基本原理を説明する。この自己組織化法は、ニューラルネットワークを用いて多次元の入力データを高次元空間から低次元空間へ非線形に写像（マッピング)することで、高次元空間内での入力データ同士の類似関係（入力データの特徴）を保ったまま低次元空間へ写像を行うことができるものである。この自己組織化法には、多次元の入力データがプロットされる高次元空間の入力層と、低次元空間に格子状に配置された複数の出力ニューロン（格子点）で構成された出力層との２層からなるニューラルネットワークを用いる。そして、入力値に対応する入力ベクトル、および、入力値に対応する点と出力ニューロンとの結合の重みを表す結合重みベクトル（ニューロンベクトル）とを用い、結合重みベクトルを初期値（初期結合重みベクトル）に設定した後、結合重みベクトルを修正することで学習を行う。 Here, the basic principle of the self-organization method for creating the SOM will be described. This self-organization method uses a neural network to non-linearly map (map) multi-dimensional input data from a high-dimensional space to a low-dimensional space. Data can be mapped to a low-dimensional space while maintaining the characteristics of the data. In this self-organization method, an input layer in a high-dimensional space where multi-dimensional input data is plotted, an output layer composed of a plurality of output neurons (lattice points) arranged in a grid in a low-dimensional space, A two-layer neural network is used. Then, using the input vector corresponding to the input value and the connection weight vector (neuron vector) representing the connection weight between the point corresponding to the input value and the output neuron, the connection weight vector is set to the initial value (initial connection weight vector). ), Learning is performed by correcting the coupling weight vector.

自己組織化法は、出力ニューロンの位置関係を考慮し学習を行うものである。ニューラルネットワークの出力層では、出力ニューロン間に相対的な位置関係 (距離関係)が存在する。そして、入力データベクトルと最も距離が近い結合重みベクトルに対応する出力ニューロンおよびその近傍の出力ニューロンの結合重みベクトルに対して、結合重みベクトルの修正を行う（入力データベクトル近傍以外の結合重みベクトルに対しては修正を行わない）。これによって、ニューラルネットワークの学習が行われる。入力ベクトルと結合重みベクトルの距離の計算には、ユークリッド距離が使われる。結合重みベクトルの修正は、結合重みベクトルが入力ベクトルに近付くように行われる。例えば、結合重みベクトルを入力ベクトルに近付けるために、入力ベクトルと勝者ニューロンの結合重みベクトルの差を学習係数（学習係数は０〜１）倍してから元の結合重みベクトルに加える。 In the self-organization method, learning is performed in consideration of the positional relationship between output neurons. In the output layer of a neural network, there is a relative positional relationship (distance relationship) between output neurons. Then, the connection weight vector is corrected for the connection weight vector of the output neuron corresponding to the connection weight vector closest to the input data vector and the output neuron in the vicinity thereof (to the connection weight vector other than the vicinity of the input data vector). But no corrections are made). Thereby, learning of the neural network is performed. The Euclidean distance is used to calculate the distance between the input vector and the coupling weight vector. The coupling weight vector is corrected so that the coupling weight vector approaches the input vector. For example, in order to bring the connection weight vector closer to the input vector, the difference between the connection weight vector of the input vector and the winner neuron is multiplied by a learning coefficient (the learning coefficient is 0 to 1) and then added to the original connection weight vector.

このようにして、自己組織化法では、出力ニューロンの位置関係を考慮することによって、入力データ空間（入力層）における入力データ間の距離関係を保ったまま、入力データを高次元空間（出力層）にマッピングすることができる。 In this way, the self-organization method considers the positional relationship of the output neurons, and maintains the distance relationship between the input data in the input data space (input layer) while maintaining the input data in the high-dimensional space (output layer). ) Can be mapped.

コホネン法は、次の３工程よりなる。工程１：各ニューロン上のベクトル（結合重みベクトル）を、乱数値を用いて初期化する。工程２：入力ベクトルに対して最も近い結合重みベクトルを持つニューロンを選択する。工程３：選択されたニューロン及びその近傍の結合重みベクトルを更新する。工程２と工程３は入力ベクトルの数だけ繰り返される。これを１回の学習として、決められた回数の学習を行う。学習後には、入力ベクトルは最も近い結合重みベクトルを持つニューロンに分類されることになる。コホネンのＳＯＭでは，高次元空間上の入力ベクトル群から低次元のマップ上に配置されたニューロン群に、特徴を保ちつつ非線形な写像を行える。 The Kohonen method consists of the following three steps. Step 1: A vector (connection weight vector) on each neuron is initialized using a random value. Step 2: Select a neuron having a connection weight vector closest to the input vector. Step 3: Update the connection weight vector of the selected neuron and its neighborhood. Steps 2 and 3 are repeated for the number of input vectors. The learning is performed a predetermined number of times as one learning. After learning, the input vector is classified into the neuron having the closest connection weight vector. Kohonen's SOM can perform non-linear mapping while maintaining features from a group of input vectors in a high-dimensional space to a group of neurons arranged on a low-dimensional map.

このコホネン法では、工程２および工程３で一つの入力に対する結合重みベクトルヘの分類をもとに結合重みベクトルヘの更新を行うため、後で入力されるベクトルほど精細に分離され、入力ベクトルの学習順により異なるＳＯＭが作成される。そのため、再現性のあるＳＯＭを得ることができない恐れがある。また、工程１の初期結合重みベクトル設定では乱数値をとっているために、乱数値の構造が学習後に得られる自己組織化マップに影響を及ぼすことにより、入力ベクトル以外の因子が自己組織化マップに反映される。そのため、入力ベクトルの構造が正確にＳＯＭに反映できない恐れがある。さらに、工程１で乱数値をとっているために、初期値が入力ベクトルの構造と大きく異なるときには、非常に長い学習時間を要する。また、工程２および３で一つの入力に対する結合重みベクトルヘの分類をもとに結合重みベクトルヘの更新を行うため、入力ベクトルの数に比例して学習時間が長くなる。 In this Kohonen method, the update to the connection weight vector is performed based on the classification of the connection weight vector for one input in Step 2 and Step 3, so that the later input vectors are separated more finely and the learning order of the input vectors is reduced. Different SOMs are created. Therefore, there is a possibility that a reproducible SOM cannot be obtained. In addition, since the initial connection weight vector setting in step 1 takes a random value, the structure of the random value affects the self-organizing map obtained after learning. It is reflected in. Therefore, there is a possibility that the structure of the input vector cannot be accurately reflected in the SOM. Furthermore, since the random value is taken in step 1, when the initial value is greatly different from the structure of the input vector, a very long learning time is required. In addition, since the connection weight vector is updated based on the classification of the connection weight vector for one input in Steps 2 and 3, the learning time becomes longer in proportion to the number of input vectors.

一方、改良型ＳＯＭ作成法は、非線形写像法によりコンピュータを用いて入力ベクトルデータを結合重みベクトルに分類する方法であって、（ａ）オリゴヌクレオチド出現頻度のデータを多次元の入力ベクトルのデータとして入力するステップと、（ｂ）初期結合重みベクトルを設定するステップと、（ｃ）入力ベクトルを各結合重みベクトルヘ分類するステップと、（ｄ）各結合重みベクトルに分類された入力ベクトルおよび該結合重みベクトルの近傍に分類された入力ベクトルと類似の構造となるように結合重みベクトルを更新するステップと、（ｅ）学習回数（繰り返し回数）が設定学習回数に達するまでステップ（ｃ）および（ｄ）を繰り返すステップと、（ｅ）入力ベクトルを結合重みベクトルヘ分類し、結果をＳＯＭとして出力するステップとを含んでいる。 On the other hand, the improved SOM creation method is a method of classifying input vector data into coupling weight vectors using a computer by a non-linear mapping method, and (a) using oligonucleotide appearance frequency data as multidimensional input vector data An input step; (b) a step of setting an initial coupling weight vector; (c) a step of classifying the input vector into each coupling weight vector; and (d) an input vector classified into each coupling weight vector and the coupling weight. A step of updating the connection weight vector so as to have a structure similar to the input vector classified in the vicinity of the vector, and (e) steps (c) and (d) until the learning number (repetition number) reaches the set learning number. And (e) classifying the input vector into a combination weight vector and outputting the result as an SOM That and a step.

上記改良型ＳＯＭ作成法では、コホネン法における「一つの入力ベクトルを（初期）結合重みベクトルヘ分類する」という逐次処理アルゴリズムを、「すべての入力ベクトルを結合重みベクトルに分類した後、個々の結合重みベクトルを更新する」という一括処理学習アルゴリズムに変更したことで、再現性のあるＳＯＭを得ることができると共に、入力ベクトルの数が多くなっても演算時間を短く抑えることができる。 In the improved SOM creation method described above, the sequential processing algorithm “classify one input vector into (initial) coupling weight vectors” in the Kohonen method is used, and after classifying all input vectors into coupling weight vectors, By changing to the batch processing learning algorithm of “updating vectors”, a reproducible SOM can be obtained, and even if the number of input vectors increases, the calculation time can be reduced.

上記ステップ（ａ）において、入力ベクトルデータが、Ｍ次元（Ｍは正の整数）からなるＫ個の入力ベクトルのデータ（Ｋは３以上の正の整数）であってもよい。 In the step (a), the input vector data may be data of K input vectors (K is a positive integer of 3 or more) having M dimensions (M is a positive integer).

また上記ステップ（ｂ）においては、教師なし多変量解析法により得られる多次元から成る入力ベクトルの多次元空間の分布の特徴を、初期結合重みベクトルの配置および要素に反映させることにより、初期結合重みベクトルを設定することが好ましい。これにより、入力ベクトルの構造を正確にＳＯＭに反映できると共に、学習時間を短縮することができる。教師なし多変量解析法としては、主成分分析または多次元尺度構成法等を用いることができる。 In the step (b), the initial coupling is performed by reflecting the characteristics of the multidimensional space distribution of the multidimensional input vector obtained by the unsupervised multivariate analysis method in the arrangement and elements of the initial coupling weight vector. It is preferable to set a weight vector. As a result, the structure of the input vector can be accurately reflected in the SOM, and the learning time can be shortened. As the unsupervised multivariate analysis method, principal component analysis, multidimensional scale construction method, or the like can be used.

上記ステップ（ｃ）において、入力ベクトルを各結合重みベクトルに分類する方法としては、距離、内積および方向余弦からなる尺度より選ぱれる類似性の尺度に基づいた分類方法等を用いることができる。上述の距離としては、ユークリッド距離等を挙げることができる。 In the step (c), as a method for classifying the input vector into each coupling weight vector, a classification method based on a similarity measure selected from a measure including a distance, an inner product, and a direction cosine can be used. Examples of the distance include Euclidean distance.

上記ステップ（ｄ）において、各結合重みベクトルに分類された入力ベクトルおよび該結合重みベクトルの近傍に分類された入力ベクトルと類似の構造となるように結合重みベクトルを更新する処理にも、一括処理学習アルゴリズムを用いることができる。 In the step (d), the batch processing is also used for updating the connection weight vector so that the input vector classified into each connection weight vector and the input vector classified in the vicinity of the connection weight vector have a similar structure. A learning algorithm can be used.

上記各ステップの処理、特に一括処理学習アルゴリズムを用いた処理は、並列コンピュータを用いて演算処理することが好ましい。これにより、演算時間を短縮することができる。 It is preferable that the processes of the above steps, particularly the processes using the collective processing learning algorithm, are arithmetically processed using a parallel computer. Thereby, calculation time can be shortened.

以下、上記ＳＯＭ作成部１０を用いた改良型ＳＯＭ作成法の各ステップについて詳述する。 Hereinafter, each step of the improved SOM creation method using the SOM creation unit 10 will be described in detail.

〔ステップ（ａ）〕
まず、オリゴヌクレオチド出現頻度データ格納部２に格納された複数の塩基配列のオリゴヌクレオチド出現頻度のデータを多次元の入力ベクトルデータとしてＳＯＭ作成部１０ヘ入力する。 [Step (a)]
First, oligonucleotide appearance frequency data of a plurality of base sequences stored in the oligonucleotide appearance frequency data storage unit 2 is input to the SOM creation unit 10 as multidimensional input vector data.

入力ベクトルデータは、通常、Ｋ個（Ｋは、塩基配列の数）の入力ベクトル｛ｘ_１，ｘ₂，...，ｘ_ｋ，...，ｘ_Ｋ｝（ｋ＝１，２，．．．，Ｋ）から構成されている。各入力ベクトルｘ_ｋは、Ｍ次元（Ｍは、オリゴヌクレオチドの区分の数）のベクトルであり、下式（１）で表すことができる。 The input vector data is usually K (K is the number of base sequences) input vectors {x ₁ , x ₂ ,..., X _k , ..., x _K } (k = 1, 2,. ..., K). Each input vector x _k is an M-dimensional vector (M is the number of oligonucleotide segments), and can be expressed by the following equation (1).

ｘｋ＝｛ｘ_ｋ１，ｘ_ｋ２，...，ｘ_ｋＭ｝（１）
（ここで、ｘ_ｋ１，ｘ_ｋ２，...，ｘ_ｋＭは、オリゴヌクレオチド出現頻度）
例えば、オリゴヌクレオチド出現頻度を６４種のトリヌクレオチドの出現頻度とし、Ｋ種類のＤＮＡの全長塩基配列のトリヌクレオチド使用頻度に基づいて複数の微生物を分類する場合には、これら微生物由来のＫ種類のＤＮＡの全長塩基配列のトリヌクレオチド使用頻度を６４次元に数値化し、数値化された６４次元のデータを入力ベクトルとして設定する。 xk = { _xk1 , _xk2 ,..., _xkM } (1)
(Where x _k1 , x _k2 ,..., X _kM are oligonucleotide appearance frequencies)
For example, if the oligonucleotide appearance frequency is 64 types of trinucleotides and a plurality of microorganisms are classified based on the trinucleotide usage frequency of the full-length base sequences of K types of DNA, The trinucleotide usage frequency of the full length base sequence of DNA is digitized into 64 dimensions, and the digitized 64 dimensions data is set as an input vector.

塩基配列を複数の生物学的分類へ分類する場合、各生物学的分類の特徴をより正確に分析するために、各生物学的分類（例えば種）ごとに十分な数のＤＮＡ塩基配列に関するオリゴヌクレオチド出現頻度のデータを入力ベクトルのデータとして入力することが好ましい。解析対象とするＤＮＡ塩基配列（全長配列あるいは断片塩基配列）の数は、通常、各生物学的分類ごとに数百個〜数万個程度用意すればよい。なお、入力ベクトルは、通常、非特許文献１、非特許文献２等に記載の常法に準じて設定できる。 When classifying a base sequence into multiple biological classes, oligos related to a sufficient number of DNA base sequences for each biological class (eg, species) to analyze the characteristics of each biological class more accurately It is preferable to input nucleotide appearance frequency data as input vector data. About hundreds to tens of thousands of DNA base sequences (full-length sequences or fragment base sequences) to be analyzed are usually prepared for each biological classification. It should be noted that the input vector can usually be set according to the ordinary method described in Non-Patent Document 1, Non-Patent Document 2, and the like.

〔ステップ（ｂ）〕
次に、コンピュータを用いて、ニューラルネットワークを構築し、初期結合重みベクトルを設定する。ニューラルネットワークにおける出力層上には、作成しようとするＳＯＭの次元Ｄ（Ｄは正の整数；Ｄ＜Ｍ）に応じて、出力ニューロン（格子点）をＤ次元の格子状に配置する。各出力ニューロン（格子点）の初期結合重みベクトルも、Ｄ次元の格子状に配置する。 [Step (b)]
Next, a neural network is constructed using a computer, and an initial connection weight vector is set. On the output layer in the neural network, output neurons (lattice points) are arranged in a D-dimensional lattice according to the dimension D of the SOM to be created (D is a positive integer; D <M). The initial connection weight vector of each output neuron (grid point) is also arranged in a D-dimensional grid.

初期結合重みベクトルは、非特許文献１、非特許文献２等に記載されたＳＯＭの作成法と同じく乱数値に基づいて設定することができる。入力ベクトルの構造を正確にＳＯＭに反映させたい、あるいは学習時間を短縮させたい場合には、乱数値に基づいて初期結合重みベクトルを設定するよりも、主成分分析や多次元尺度構成法等多変量解析法を用いて、上記工程（ａ）で設定したＭ次元からなるＫ個の入力ベクトル｛ｘ_１，ｘ_２，...，ｘ_Ｋ｝データに基づいて、初期結合重みベクトルを設定することが好ましい。このようにして設定された初期結合重みベクトルが、Ｄ次元の格子状に配置されているＰ個の結合重みベクトル｛Ｗ^０ _１，Ｗ^０ _２，...，Ｗ^０ _Ｐ｝の集合よりなる場合には、各結合重みベクトルは、下式（２）で表すことができる。 The initial connection weight vector can be set based on a random value as in the SOM creation method described in Non-Patent Document 1, Non-Patent Document 2, and the like. If you want to accurately reflect the structure of the input vector in the SOM or reduce the learning time, rather than setting the initial coupling weight vector based on the random number value, you can use many methods such as principal component analysis and multidimensional scaling. An initial coupling weight vector is set based on _K input vectors {x ₁ , x ₂ ,..., X _K } data having M dimensions set in the step (a) by using a variable analysis method. It is preferable. The initial connection weight vector set in this way is a set of P connection weight vectors {W ⁰ ₁ , W ⁰ ₂ ,..., W ⁰ _P } arranged in a D-dimensional lattice pattern. In this case, each connection weight vector can be expressed by the following equation (2).

Ｗ^０ _ｉ＝Ｆ｛ｘ_１，ｘ_２，...，ｘ_Ｋ｝（２）
式（２）において、ｉはｉ＝１，２，...，Ｐである。また、式（２）中のＦ｛ｘ_１，ｘ_２，...，ｘ_Ｋ｝は入力ベクトル｛ｘ_１，ｘ_２，...，ｘ_Ｋ｝から初期結合重みベクトルヘの変換関数を表わす。具体例として、２次元（Ｄ＝２）および３次元（Ｄ＝３）の格子状に初期結合重みベクトルを設定する方法について説明する。該方法に準じて、Ｄ次元の格子状への初期結合重みベクトルの設定を行うことができる。
（１）２次元（Ｄ＝２）の格子状に初期結合重みベクトルを設定する方法
（２次元のＳＯＭを作成する場合）
Ｍ次元からなるＫ個の入力ベクトル｛ｘ_１，ｘ_２，...，ｘ_Ｋ｝に対して主成分分析を行い、第１主成分ベクトルおよび第２主成分ベクトルを求め、得られたこれら主成分ベクトルをｂ_１およびｂ_２とする。これら２つの主成分ベクトルをもとに、Ｋ個の入力ベクトルに対する主成分Ｚ_１ｋ＝ｂ_１ｘ_ｋおよびＺ_２ｋ＝ｂ_２ｘ_ｋを求める（ｋ＝１，２，...，Ｋ）。｛Ｚ_１１，Ｚ_１２，...，Ｚ_１ｋ，...，Ｚ_１Ｋ｝および｛Ｚ_２１，Ｚ_２２，...，Ｚ_２ｋ，...，Ｚ_２Ｋ｝の標準偏差をそれぞれσ_１およびσ_２とする。 W ⁰ _i = F {x ₁ , x ₂ ,..., X _K } (2)
In the expression (2), i is i = 1, 2,. Further, _F in the formula _{_{(2) {x 1, x}} 2, ..., x K} is the input vector _{_{{x 1, x 2, ...}} , x K} representing the transfer function of the initial combining weights Bekutoruhe from . As a specific example, a method of setting initial coupling weight vectors in a two-dimensional (D = 2) and three-dimensional (D = 3) lattice will be described. In accordance with this method, the initial coupling weight vector can be set in a D-dimensional lattice pattern.
(1) Method of setting initial coupling weight vectors in a two-dimensional (D = 2) lattice (when creating a two-dimensional SOM)
These are obtained by performing principal component analysis on _K input vectors {x ₁ , x ₂ ,..., X _K } having M dimensions to obtain a first principal component vector and a second principal component vector. the principal component vectors and _{b 1} and _{b 2.} Based on these two principal component vectors, principal components Z _1k = b ₁ x _k and Z _2k = b ₂ x _k for K input vectors are obtained (k = 1, 2,..., K). _{_{{Z 11, Z 12, ...}} , Z 1k, ..., Z 1K} and _{_{{Z 21, Z 22, ...}} , Z 2k, ..., Z 2K} respectively sigma ₁ standard deviation And σ ₂ .

入力ベクトルの平均値を求め、得られた該平均値をｘ_ａｖｅとする。 An average value of the input vectors is obtained, and the obtained average value is defined as x _ave .

出力層上の２次元の格子点を、２次元平面（出力層）上の座標で、すなわちｉｊ（ｉ＝１，２，...Ｉ；ｊ＝１，２，...，Ｊ）で表現し、２次元の格子点（ｉｊ）上に結合重みベクトルＷ^０ _ｉｊを置く。ＩとＪの値は、３以上の整数であれば良い。Ｊは、Ｉ×σ_２／σ_１よりも小さい整数のなかで最大のものが好ましい。Ｉの値は入力ベクトルのデータ数に応じて適宜設定すればよい。Ｉの値は、通常５０〜１０００であり、例えば１００である。 A two-dimensional lattice point on the output layer is represented by coordinates on a two-dimensional plane (output layer), that is, ij (i = 1, 2,... I; j = 1, 2,..., J). In this case, a coupling weight vector W ⁰ _ij is placed on a two-dimensional lattice point (ij). The values of I and J may be integers of 3 or more. J is preferably the largest among integers smaller than I × σ ₂ / σ ₁ . The value of I may be appropriately set according to the number of input vector data. The value of I is usually 50 to 1000, for example 100.

Ｗ^０ _ｉｊは、式（３）により定義することができる。 W ⁰ _ij can be defined by equation (3).

（２）３次元（Ｄ＝３）の格子状に初期結合重みベクトルを設定する方法
（３次元のＳＯＭを作成する場合）
上述（１）の主成分分析において、第１主成分ベクトルと第２主成分ベクトルに加えて、第３主成分ベクトルを求め、得られた第１主成分ベクトル、第２主成分ベクトル、および第３主成分ベクトルをそれぞれ、ｂ_１、ｂ_２、およびｂ_３とする。これら３つの主成分ベクトルをもとに、主成分Ｚ_１ｋ＝ｂ_１ｘ_ｋ、Ｚ_２ｋ＝ｂ_２ｘ_ｋ、およびＺ_３ｋ＝ｂ_３ｘ_ｋを求める。｛Ｚ_１１，Ｚ_１２，...，Ｚ_１ｋ，...，Ｚ_１Ｋ｝、｛Ｚ_２１，Ｚ_２２，...，Ｚ_２ｋ，...，Ｚ_２Ｋ｝、および｛Ｚ_３１，Ｚ_３２，...，Ｚ_３ｋ，...，Ｚ_３Ｋ｝の標準偏差をそれぞれσ_１、σ_２、およびσ_３とする。３次元の格子点をｉｊｌ（ｉ＝１，２，...，Ｉ；ｊ＝１，２，...，Ｊ；ｌ＝１，２，...，Ｌ）で表現し、３次元の格子点（ｉｊｌ）上に結合重みベクトルＷ^０ _ｉｊｌを置く。Ｉ、Ｊ、Ｌの値は、３以上の整数であれば良い。Ｊは、Ｉ×σ_２／σ_１よりも小さい整数のなかで最大のもの、Ｌは、Ｉ×σ_３／σ_１よりも小さい整数のなかで最大のものが好ましい。Ｉの値は、入力ベクトルのデータ数に応じて適宜設定すればよい。Ｉの値は、通常５０〜１０００であり、例えば１００である。Ｗ^０ _ｉｊｌは、下式（４）により定義することができる

(2) Method of setting initial coupling weight vectors in a three-dimensional (D = 3) grid (when creating a three-dimensional SOM)
In the above principal component analysis (1), in addition to the first principal component vector and the second principal component vector, a third principal component vector is obtained, and the obtained first principal component vector, second principal component vector, Let the three principal component vectors be b ₁ , b ₂ , and b ₃ , respectively. Based on these three principal component vectors, principal components Z _1k = b ₁ x _k , Z _2k = b ₂ x _k , and Z _3k = b ₃ x _k are obtained. {Z ₁₁ , Z ₁₂ , ..., Z _1k , ..., Z _1K }, {Z ₂₁ , Z ₂₂ , ..., Z _2k , ..., Z _2K }, and {Z ₃₁ , Z ₃₂ ,..., Z _3k ,..., Z _3K } are standard deviations σ ₁ , σ ₂ , and σ ₃ , respectively. A three-dimensional lattice point is expressed by ijl (i = 1, 2,..., I; j = 1, 2,..., J; l = 1, 2,..., L). The connection weight vector W ⁰ _ijl is placed on the grid point (ijl) of The values of I, J, and L may be integers of 3 or more. J is preferably the largest integer smaller than I × σ ₂ / σ ₁ , and L is preferably the largest integer smaller than I × σ ₃ / σ ₁ . The value of I may be appropriately set according to the number of input vector data. The value of I is usually 50 to 1000, for example 100. W ⁰ _ijl can be defined by the following equation (4)

〔ステップ（ｃ）〕
すべての入力ベクトル｛ｘ_１，ｘ_２，...，ｘ_Ｋ｝を、各結合重みベクトルヘ分類する。

[Step (c)]
All input vectors {x ₁ , x ₂ ,..., X _K } are classified into the respective connection weight vectors.

具体的には、すべての入力ベクトル｛ｘ_１，ｘ_２，...，ｘ_Ｋ｝を、類似性の尺度（距離、内積、方向余弦等）を利用してｔ回の学習（修正）を行った後の、Ｄ次元の格子状に配置されたＰ個の結合重みベクトルＷ^ｔ _１，Ｗ^ｔ _２，...，Ｗ^ｔ _Ｐのいずれかに、コンピュータを用いて分類する。ここで、ｔは、学習の回数（エポック）、すなわちステップ（ｃ）の前に何回ステップ（ｄ）が実行されているかを表わす。合計Ｔ回学習を行う場合（Ｔは設定学習回数を表す）、ｔ＝０，１，２，．．．Ｔである。第ｔエポック（第ｔ回目の学習時、すなわち第ｔ回目のステップ（ｄ）実行後）におけるｉ番目の結合重みベクトルは、Ｗｔｉで表すことができる。ここで、ｉ＝１，２，．．．，Ｐである。ｔ＝０の時（ステップ（ｃ）を初めて実行する時）には、結合重みベクトルＷ^ｔ _１，Ｗ^ｔ _２，...，Ｗ^ｔ _Ｐは、ステップ（ｂ）で設定した初期結合重みベクトルに相当する。各入力ベクトルの分類は、各結合重みベクトルＷ^ｔ _ｉとのユークリッド距離を計算し、該入力ベクトルを最小のユークリッド距離を有する結合重みベクトルに割り当てることにより行うことができる。なお、２次元の格子点（ｉｊ）上に配置された結合重みベクトルの場合には、Ｗ^ｔ _ｉはＷ^ｔ _ｉｊと表すことができる。 Specifically, all input vectors {x ₁ , x ₂ ,..., X _K } are learned (corrected) t times using a similarity measure (distance, inner product, direction cosine, etc.). binding weight vector, of P arranged in D-dimensional lattice shape after ^{_{^{_{W t 1, W t 2,}}}} ..., to one of W ^t _P, it is classified using the computer. Here, t represents the number of times of learning (epoch), that is, how many times step (d) is executed before step (c). When learning is performed a total of T times (T represents the set learning number), t = 0, 1, 2,. . . T. The i-th coupling weight vector at the t-th epoch (during the t-th learning, that is, after execution of the t-th step (d)) can be represented by Wti. Here, i = 1, 2,. . . , P. When t = 0 (when step (c) is executed for the first time), the connection weight vectors W ^t ₁ , W ^t ₂ ,..., W ^t _P are the initial connection weight vectors set in step (b). It corresponds to. The classification of each input vector can be performed by calculating the Euclidean distance with each coupling weight vector W ^t _i and assigning the input vector to the coupling weight vector having the minimum Euclidean distance. In the case of a coupling weight vector arranged on a two-dimensional lattice point (ij), W ^t _i can be expressed as W ^t _ij .

入力ベクトル｛ｘ_１，ｘ_２，...，ｘ_Ｋ｝は、各入力ベクトルｘ_ｋ毎に並列処理してＷ^ｔ _ｉに分類することが可能である。 Input vector _{_{{x 1, x 2, ...}} , x K} may be classified into ^W _{t i} by parallel processing for each input vector _{x k.}

〔ステップ（ｄ）〕
各ニューロンペクトルＷ^ｔ _ｉについて、該結合重みベクトルに分類された入力ベクトル（ｘ_ｋ）および該結合重みベクトルの近傍に分類された入力ベクトルと類似の構造となるように、結合重みベクトルＷ^ｔ _ｉを更新する。 [Step (d)]
For each neuron spectrum W ^t _i , the connection weight vector W ^t _i has a similar structure to the input vector (x _k ) classified into the connection weight vector and the input vector classified in the vicinity of the connection weight vector. Update.

即ち、ある特定の結合重みベクトルＷ^ｔ _ｉ’が位置づけられている格子点に帰属する入力ベクトルの集合をＳ_ｉ’とする。Ｓ_ｉ’に属するＮ個のベクトルｘ^ｔ _１（Ｓ_ｉ’），ｘ^ｔ _２（Ｓ_ｉ’），...，ｘ^ｔ _Ｎ（Ｓ_ｉ’）と、Ｗ^ｔ _ｉ’とから、Ｓ_ｉ’に属する入力ベクトルの構造を反映させた新たな結合重みベクトル（Ｗ^ｔ＋１ _ｉ’）を、次式（５）の関数Ｇにより求めることにより、ニューロンベクトルＷ^ｔ _ｉ’（ｉ’＝１，２，...，Ｐ）を更新する。 That is, let S _{i ′} be a set of input vectors belonging to a lattice point where a specific coupling weight vector W ^t _{i ′} is positioned. _'N number of vector ^x _t 1 belonging to the _{_{^{(S i' S i),}}} x t 2 (S i ') ^{_{and, ..., x t N (S}} i'), from the ^{W t} _{_i ',} _S _i A new connection weight vector (W ^{t + 1} _{i ′} ) reflecting the structure of the input vector belonging to _“′ is obtained by the function G of the following equation (5), so that the neuron vector W ^t _{i ′} (i ′ = 1, 2) , ..., P) is updated.

Ｗ^t+1 _i’＝Ｇ（Ｗ^t _i’，ｘ^t ₁(Ｓ_i’)，ｘ^t ₂(Ｓ_i’)，...，ｘ^t _N(Ｓ_i’)）（５）
具体例として、２次元の格子状に設定された結合重みベクトルＷｔｉｊの更新について説明する。他のＤ次元の格子状に設定された結合重みベクトルについても同様に行うことができる。 W ^{t + 1} _{i ′} = G (W ^t _{i ′} , x ^t ₁ (S _{i ′} ), x ^t ₂ (S _{i ′} ),..., X ^t _N (S _{i ′} )) (5)
As a specific example, the update of the connection weight vector Wtij set in a two-dimensional lattice shape will be described. The same can be done for the coupling weight vectors set in other D-dimensional grids.

入力ベクトルｘ_ｋが２次元の格子状に配置された結合重みベクトルのＷ^ｔ _ｉｊに帰属し、Ｗ^ｔ _ｉｊが位置づけられている格子点の近傍の格子点に帰属する入力ベクトルの集合をＳ_ｉｊとしたとき、Ｓ_ｉｊに属するＮ_ｉｊ個の入力ベクトルｘ^ｔ _１（Ｓ_ｉｊ），ｘ^ｔ _２（Ｓ_ｉｊ），...，ｘ^ｔ _Ｎｉｊ（Ｓ_ｉｊ）およびＷ^ｔ _ｉｊから、Ｓ_ｉｊに属する入力ベクトル構造を反映させる新たな結合重みベクトルｘ^ｔ＋１ _ｉｊを下式（６）により求めることにより、結合重みベクトルＷ^ｔ _ｉｊを更新することができる。 Belong to W ^t _ij coupling weight vectors input vector x _k are arranged in a two-dimensional grid-shaped, W ^t _ij S the set of input vectors belonging to the lattice points in the vicinity of the grid points are positioned are _ij when a _was, _{N ij} inputs the vector ^x _t 1 belonging to _{_{^{_{S ij (S ij), x}}}} t 2 (S ij), ..., from _x t Nij _{(S ij)} and ^W _{t _ij,} the _{S ij} By obtaining a new joint weight vector x ^{t + 1} _ij that reflects the input vector structure to which it belongs, the joint weight vector W ^t _ij can be updated.

ここで、Ｎ_ｉｊは、Ｓ_ｉｊに分類された入力ベクトルの総数である。

Here, N _ij is the total number of input vectors classified as S _ij .

α（ｔ）は、設定学習回数をＴと設定したときの第ｔエポックに対する学習係数（０＜α（ｔ）＜１）であり、単調減少関数を用いる。より好ましくは、下式（７）により求めることができる。 α (t) is a learning coefficient (0 <α (t) <1) for the t-th epoch when the set number of learning times is set to T, and uses a monotonically decreasing function. More preferably, it can obtain | require by the following Formula (7).

設定学習回数Ｔは、入力ベクトルのデータ数に応じて適宜設定すればよい。設定学習回数Ｔは、通常１０〜１０００であり、例えば１００である。

The set learning frequency T may be set as appropriate according to the number of input vector data. The set learning frequency T is usually 10 to 1000, for example 100.

近傍集合Ｓ_ｉｊは、より好ましくは、ｉ−β（ｔ）≦ｉ’≦ｉ＋β（ｔ）かつｊ−β（ｔ）≦ｊ’≦ｊ＋β（ｔ）の条件を満たす格子点ｉ’ｊ’に分類された入力ベクトルｘ_ｉｊの集合である。β（ｔ）は、近傍を決定する数であり、例えば式（８）により求める。 More preferably, the neighborhood set S _ij is a lattice point i′j ′ that satisfies the conditions of i−β (t) ≦ i ′ ≦ i + β (t) and j−β (t) ≦ j ′ ≦ j + β (t). A set of classified input vectors _xij . β (t) is a number that determines the neighborhood, and is obtained by, for example, Expression (8).

β（ｔ）＝ｍａｘ｛０，２５−ｔ｝（８）
結合重みベクトル｛Ｗ^ｔ _１，Ｗ^ｔ _２，...，Ｗ^ｔ _Ｐ｝は、各結合重みベクトルＷ^ｔ _ｉ毎に並列処理して更新することが可能である。 β (t) = max {0,25−t} (8)
The connection weight vectors {W ^t ₁ , W ^t ₂ ,..., W ^t _P } can be updated by parallel processing for each connection weight vector W ^t _i .

〔ステップ（ｅ）〕
学習回数（ステップ（ｄ）を繰り返した回数）ｔが設定学習回数Ｔに達したか否かを判定し、学習回数ｔが設定学習回数Ｔに達していなければ、ステップ（ｃ）に戻り、ステップ（ｃ）およびステップ（ｄ）を再度行う。すなわち、学習回数ｔが設定学習回数Ｔに達するまで、ステップ（ｃ）およびステップ（ｄ）を繰り返し、学習を行う。そして、学習回数ｔが設定学習回数Ｔに達すると、次のステップ（ｆ）に移る。 [Step (e)]
It is determined whether or not the number of learnings (the number of repetitions of step (d)) t has reached the set learning number T. If the learning number t has not reached the set learning number T, the process returns to step (c), (C) and step (d) are performed again. That is, the learning is performed by repeating Step (c) and Step (d) until the learning number t reaches the set learning number T. When the learning number t reaches the set learning number T, the process proceeds to the next step (f).

〔ステップ（ｆ）〕
学習終了後、ステップ（ｃ）の方法に準じて、入力ベクトルｘ_ｋを結合重みベクトルＷ^ｔ _ｉへ、コンピュータにより分類し、結果を出力する。入力ベクトルの構造を反映した、ザで表される分類の基準に基づいて、入力ベクトルｘ_ｋは分類される。即ち、複数の入力ベクトルが同一の結合重みベクトルに分類された場合には、これら入力ベクトルのベクトル構造は非常に類似していることがわかる。入力ベクトル｛ｘ_１，ｘ_２，...，ｘ_Ｋ｝は、各入力ベクトルｘ_ｋ毎に並列処理して分類することが可能である。 [Step (f)]
After the learning is completed, the input vector x _k is classified into the connection weight vector W ^t _i by the computer according to the method of step (c), and the result is output. The input vector x _k is classified on the basis of the classification criterion represented by “Z” that reflects the structure of the input vector. That is, when a plurality of input vectors are classified into the same coupling weight vector, it can be seen that the vector structures of these input vectors are very similar. The input vectors {x ₁ , x ₂ ,..., X _K } can be classified by parallel processing for each input vector x _k .

上記ステップで出力された分類結果にしたがってＳＯＭを作成する。作成したＳＯＭは、出力デバイス１２から出力（表示、印刷等）することにより可視化可能となる。ＳＯＭの作成および表示等は、非特許文献１、非特許文献２等に記載の方法に準じて行うことができる。例えば、２次元の格子点に結合重みベクトルを設定して得られた入力ベクトルの分類結果は、２次元のＳＯＭとして表示することができる。具体的には、２次元の格子点を有する結合重みベクトルの各格子点に帰属された入力ベクトルの属性に基づいて、各格子点に適当なラベルを付与した後、このラベルを２次元の格子に画面表示または印刷等により、ＳＯＭとして表示することができる。各格子点に帰属された入力ベクトルの総数の値を２次元の格子に画面表示または印刷等により、ＳＯＭとして表示することも可能である。 An SOM is created according to the classification result output in the above step. The created SOM can be visualized by outputting (displaying, printing, etc.) from the output device 12. Creation and display of the SOM can be performed according to the methods described in Non-Patent Document 1, Non-Patent Document 2, and the like. For example, an input vector classification result obtained by setting a coupling weight vector to a two-dimensional lattice point can be displayed as a two-dimensional SOM. Specifically, an appropriate label is assigned to each grid point based on the attribute of the input vector attributed to each grid point of the coupling weight vector having a two-dimensional grid point, and then this label is assigned to the two-dimensional grid. It can be displayed as SOM by screen display or printing. It is also possible to display the value of the total number of input vectors assigned to each grid point as a SOM by screen display or printing on a two-dimensional grid.

上記各ステップで使用するコンピュータとしては、計算速度の速いものが好ましい。上記ステップ（ａ）〜（ｆ）は、同一のコンピュータを用いて行う必要はない。即ち、上記のあるステップで得られた結果を別のコンピュータに出力し、該コンピュータで次ステップの処理を行ってもよい。また、並列処理可能なステップ（ステップ（ｃ）〜（ｆ））の演算処理は、マルチＣＰＵを有するコンピュータあるいは、複数台のコンピュータを用いて並列処理することも可能である。従来型のＳＯＭ作成法では、逐次処理学習アルゴリズムを採用しているために、並列処理することができないが、改良型のＳＯＭ作成法では、一括処理学習アルゴリズムを採用したことにより並列処理が可能である。並列処理が可能となることにより、入力ベクトルを分類するための演算時間を大幅に短縮することが可能となる。即ち、上記６ステップを一つのプロセッサで処理する時間をそれぞれ、T1,T2,T3,T4,T5およびT6とし、Ｃ個のプロセッサで並列処理すると、理想的には、それぞれのステップで要する時間は、Tl,T2,T3/C,T4/C,T5/C,T6となり、全体では、
T1+T2+T3+T4+T5+T6-{T1+T2+(T3+T4+T5)/C+T6}=(1-1/C)(T3+T4+T5)
時間だけ、演算時間を短縮できる。 As a computer used in each of the above steps, a computer having a high calculation speed is preferable. The steps (a) to (f) need not be performed using the same computer. That is, the result obtained in a certain step may be output to another computer, and the next step may be performed by the computer. In addition, the arithmetic processing of steps (steps (c) to (f)) that can be performed in parallel can be performed in parallel using a computer having a multi-CPU or a plurality of computers. The conventional SOM creation method employs a sequential processing learning algorithm and cannot perform parallel processing. However, the improved SOM creation method employs a batch processing learning algorithm and can perform parallel processing. is there. By enabling parallel processing, it is possible to greatly reduce the calculation time for classifying input vectors. That is, the time required to process the above 6 steps by one processor is T1, T2, T3, T4, T5, and T6. If C processors are processed in parallel, ideally, the time required for each step is , Tl, T2, T3 / C, T4 / C, T5 / C, T6.
T1 + T2 + T3 + T4 + T5 + T6- {T1 + T2 + (T3 + T4 + T5) / C + T6} = (1-1 / C) (T3 + T4 + T5)
Calculation time can be shortened by time.

以下に、本発明の実施例を示す。 Examples of the present invention are shown below.

〔実施例１〕
（ＳＯＭ作成方法）
実施例１〜３では、前記実施形態のシステムを用い、特許文献１に記載の改良型ＳＯＭ作成法にしたがって以下の方法で２次元および３次元のＳＯＭを作成した。初期結合重みベクトルは、乱数値の代わりに主成分分析（ＰＣＡ）によって定義した。これは、主成分分析が、比較的少量の配列を分析する場合には、遺伝子配列を既知の生物学的分類に分類できることに基づいている。結合重みベクトル（Ｗ_ｉｊ）は、ｉ（＝０，１，．．，Ｉ−１）およびｊ（＝０，１，．．，Ｊ−１）で表される２次元格子内に配列した。Ｉは、２５０に設定した。Ｊは、(σ_２／σ_１）×２５０より大きく、かつ最も近い整数として定義した（ここで、σ_１およびσ_２はそれぞれ第１および第２の主成分の標準偏差である）。結合重みベクトルは、非特許文献４に記載の方法で設定および更新した。この実施例で使用した一括学習ＳＯＭプログラム“ＸａｎａＭｉｎｅ”は、株式会社ザナジェンから入手した。 [Example 1]
(SOM creation method)
In Examples 1 to 3, two-dimensional and three-dimensional SOMs were created by the following method according to the improved SOM creation method described in Patent Document 1 using the system of the above embodiment. The initial binding weight vector was defined by principal component analysis (PCA) instead of random values. This is based on the fact that principal component analysis can classify gene sequences into known biological classifications when analyzing relatively small amounts of sequences. The connection weight vectors (W _ij ) were arranged in a two-dimensional lattice represented by i (= 0, 1,..., I−1) and j (= 0, 1,. I was set to 250. J was defined as the closest integer greater than (σ ₂ / σ ₁ ) × 250 (where σ ₁ and σ ₂ are the standard deviations of the first and second principal components, respectively). The connection weight vector was set and updated by the method described in Non-Patent Document 4. The collective learning SOM program “XanaMine” used in this example was obtained from Xanagen Corporation.

分析対象の配列のデータは、“ＧｅｎＢａｎｋ”(http://www.ncbi.nlm.nih.gov/Genbank/)から入手した。全長配列から切り出した断片塩基配列中における未決定ヌクレオチド（Ｎ）の数が、断片塩基配列の全長（窓サイズ；１０ｋｂまたは１００ｋｂ）の１０％を超えている場合には、当該断片塩基配列を分析対象から除外した。断片塩基配列中における未決定ヌクレオチド（Ｎ）の数が、窓サイズの１０％以下である場合には、未決定ヌクレオチド（Ｎ）を除く長さに対してオリゴヌクレオチドの出現頻度を正規化し、分析対象に含めた。 The sequence data to be analyzed was obtained from “GenBank” (http://www.ncbi.nlm.nih.gov/Genbank/). When the number of undetermined nucleotides (N) in the fragment base sequence cut out from the full length sequence exceeds 10% of the total length (window size; 10 kb or 100 kb) of the fragment base sequence, the fragment base sequence is analyzed. Excluded from the subject. When the number of undetermined nucleotides (N) in the fragment base sequence is 10% or less of the window size, the frequency of occurrence of oligonucleotides is normalized to the length excluding undetermined nucleotides (N) and analyzed. Included in subject.

（１３種の真核生物のゲノムに対するＳＯＭ）
真核生物の配列に対するＳＯＭのクラスタリング能力を調査するために、本願発明者等は、まず初めに、１３種の真核生物のゲノム配列（合計３Ｇｂ）から切り出した、互いに重複していない３００，０００個の１０ｋｂの断片塩基配列と、１０ｋｂずつずれた約３００，０００個の１００ｋｂの断片塩基配列とにおけるトリヌクレオチド、テトラヌクレオチド、およびペンタヌクレオチドの出現頻度を分析した。これらのゲノム配列は、ヒト(Homo sapiens)、フグ(Fugu rubripes)、ゼブラフィッシュ(Danio rerio)、コメ(Oryza sativa)、シロイヌナズナ(Arabidopsis thaliana)、タルウマゴヤシ(Medicago truncatula)、キイロショウジョウバエ(Drosophila melanogaster)、線虫(Caenorhabditis elegans)、キイロタマホコリカビ(Dictyostelium discoideum)、熱帯熱マラリア原虫(Plasmodium falciparum)、赤痢アメーバ(Entamoeba histolytica)、分裂酵母(Schizosaccharomyces pombe)、およびパン酵母(Saccharomyces cerevisiae)のゲノム配列を含む。ヒトについては、ほぼ完全な配列データが入手できる染色体２，６，７，１３，１４，２０，２１，２２，Ｘ，およびＹ由来の配列を分析した。 (SOM for 13 eukaryotic genomes)
In order to investigate the clustering ability of SOM for eukaryotic sequences, the inventors first started by extracting from the 13 eukaryotic genome sequences (total 3 Gb) 300 non-overlapping 300, The frequency of occurrence of trinucleotide, tetranucleotide, and pentanucleotide in 000 10 kb fragment base sequences and about 300,000 100 kb fragment base sequences shifted by 10 kb was analyzed. These genomic sequences include human (Homo sapiens), puffer (Fugu rubripes), zebrafish (Danio rerio), rice (Oryza sativa), Arabidopsis thaliana, Medicago truncatula, Drosophila melanogaster (asterisk) , Caenorhabditis elegans, Dictyostelium discoideum, Plasmodium falciparum, Entamoeba histolytica, Schizosaccharomyces pombe, and cesacsiae bacillus cerevisiae including. For humans, sequences from chromosomes 2, 6, 7, 13, 14, 20, 21, 22, X, and Y, for which nearly complete sequence data are available, were analyzed.

そして、これら断片塩基配列について、ゲノム情報科学に適合させた改良型ＳＯＭを、非特許文献４に記載の方法で作成した。最初に、３００，０００個の１０ｋｂ断片塩基配列におけるオリゴヌクレオチド出現頻度（トリヌクレオチド、テトラヌクレオチド、およびペンタヌクレオチドの出現頻度）を主成分分析により分析し、第１および第２の主成分を用いて２次元格子として配列された初期結合重みベクトルを設定した。設定学習回数は８０に設定した。８０回の学習サイクルの後、１０ｋｂ断片塩基配列のオリゴヌクレオチド出現頻度を２次元の格子状に配置された最終結合重みベクトルで表すことができ、その結果としてＳＯＭが作成された。得られたＳＯＭは、明らかな種特異分離を示した。上記配列は、主として種特異領域にクラスタリングされた。 Then, an improved SOM adapted to genomic information science was created for these fragment base sequences by the method described in Non-Patent Document 4. First, the frequency of occurrence of oligonucleotides (frequency of occurrence of trinucleotides, tetranucleotides, and pentanucleotides) in 300,000 10-kb fragment base sequences was analyzed by principal component analysis, and the first and second principal components were used. An initial coupling weight vector arranged as a two-dimensional grid was set. The set learning count was set to 80. After 80 learning cycles, the oligonucleotide appearance frequency of the 10 kb fragment base sequence can be represented by a final bond weight vector arranged in a two-dimensional grid, and as a result, an SOM was created. The resulting SOM showed a clear species-specific separation. The sequence was clustered primarily in species-specific regions.

作成されたＳＯＭをカラー出力（カラー印刷やカラー表示等）により、単一の種由来の配列を含む格子点を有彩色で示し、複数の種由来の配列を含む格子点を黒色で示した（図示しない）。第１および第２の主成分によって設定された初期ベクトルによる分類（主成分分析）を、１０ｋｂの断片塩基配列のトリヌクレオチド出現頻度に関するＳＯＭ（以下、「１０ｋｂＴｒｉ−ＳＯＭ」と略記する）内で達成された分類と比較することにより、１０ｋｂＴｒｉ−ＳＯＭ内では単一の種由来の配列が遥かに密にクラスタリングされていることが明確に分かった。種クラスタリングは、１０ｋｂの断片塩基配列のテトラヌクレオチド出現頻度に関するＳＯＭ（以下、「１０ｋｂＴｅｔｒａ−ＳＯＭ」と略記する）およびペンタヌクレオチドＳＯＭ（以下、「１０ｋｂＰｅｎｔａ−ＳＯＭ」と略記する）内ではさらに強まった。例えば、１０ｋｂＴｒｉ−ＳＯＭ内、１０ｋｂＴｅｔｒａ−ＳＯＭ内、および１０ｋｂＰｅｎｔａ−ＳＯＭ内ではそれぞれ、ヒト配列の９４％、９７％、および９８％がヒト領域に分類された。 The generated SOM is shown in color by chromatic color, and the grid points including arrays derived from a plurality of species are shown in black by color output (color printing, color display, etc.) ( Not shown). Classification by the initial vector set by the first and second principal components (principal component analysis) is achieved within the SOM (hereinafter abbreviated as “10 kb Tri-SOM”) regarding the trinucleotide appearance frequency of the 10 kb fragment base sequence. By comparison with the classified classification, it was clearly found that sequences from a single species were clustered much more closely within the 10 kb Tri-SOM. Species clustering was further strengthened within SOM (hereinafter abbreviated as “10 kb Tetra-SOM”) and pentanucleotide SOM (hereinafter abbreviated as “10 kb Penta-SOM”) regarding the frequency of appearance of tetranucleotides in the 10 kb fragment base sequence. For example, within 10 kb Tri-SOM, within 10 kb Tetra-SOM, and within 10 kb Penta-SOM, 94%, 97%, and 98% of the human sequence were classified as human regions, respectively.

ＤＮＡデータベース内では、相補的な配列の対のうちの一方だけが登録されている。ゲノム内におけるオリゴヌクレオチド出現頻度の全体的な特徴を考慮すれば、相補的なオリゴヌクレオチド間（例えばＡＡＡＣ対ＧＴＴＴ間）における出現頻度の違いは、重要ではない。テトラヌクレオチド出現頻度やペンタヌクレオチド出現頻度に関するＳＯＭの作成が長い演算時間を必要とすることも特筆すべきことである。 In the DNA database, only one of a pair of complementary sequences is registered. Considering the overall characteristics of oligonucleotide appearance frequency within the genome, the difference in appearance frequency between complementary oligonucleotides (eg between AAAC vs. GTTT) is not significant. It should also be noted that the creation of SOMs related to tetranucleotide appearance frequency and pentanucleotide appearance frequency requires a long calculation time.

そこで、上記演算時間を削減する試みとして、相補的なオリゴヌクレオチドの対の出現頻度を加算した縮退セットの出現頻度を用いて、１０ｋｂの断片塩基配列のテトラヌクレオチド出現頻度に関するＳＯＭ、および１００ｋｂの断片塩基配列のペンタヌクレオチド出現頻度に関するＳＯＭ（以下、「１００ｋｂＤｅｇｅＰｅｎｔａ−ＳＯＭ」と略記する）を作成した。これにより、クラスタリング能力の目立った減少なしに演算時間を約半分にすることができた。 Therefore, as an attempt to reduce the calculation time, using the appearance frequency of the degenerate set obtained by adding the appearance frequencies of the complementary oligonucleotide pairs, the SOM regarding the tetranucleotide appearance frequency of the 10 kb fragment base sequence, and the 100 kb fragment An SOM (hereinafter abbreviated as “100 kb DegePenta-SOM”) relating to the appearance frequency of pentanucleotides in the base sequence was prepared. This allowed the computation time to be halved without a noticeable decrease in clustering ability.

１０ｋｂＴｒｉ−ＳＯＭ内の各格子点に対する結合重みベクトルから得られたＧＣ含量（Ｇ＋Ｃ％）は、１０ｋｂＴｒｉ−ＳＯＭの横軸に反映され、１０ｋｂＴｒｉ−ＳＯＭの左から右へ増加する。高ＧＣ含量の配列は、１０ｋｂＴｒｉ−ＳＯＭの右側に位置する。１０ｋｂＴｅｔｒａ−ＳＯＭおよび１０ｋｂＰｅｎｔａ−ＳＯＭについても、類似の結果が得られた。同一のＧＣ含量を持つ配列が、オリゴヌクレオチド出現頻度の複合的な組み合わせによって分離され、結果として種特異分離が起こった。１０ｋｂＳＯＭ内では、種内分離が明らかである。例えば、ヒトは、１０ｋｂＴｒｉ−ＳＯＭ内および１０ｋｂＴｅｔｒａ−ＳＯＭ内において、２つの主要な領域に分離された。しかしながら、１０ｋｂＰｅｎｔａ−ＳＯＭ内においては、ヒト配列が単一の連続した領域に分類された。このことは、ヒト１０ｋｂ配列間における幅広い変化にもかかわらず、ＳＯＭがヒト配列内におけるペンタヌクレオチド出現頻度の共通した特徴を認識していることを示している。 The GC content (G + C%) obtained from the coupling weight vector for each lattice point in the 10 kb Tri-SOM is reflected on the horizontal axis of the 10 kb Tri-SOM and increases from the left to the right of the 10 kb Tri-SOM. The high GC content sequence is located to the right of the 10 kb Tri-SOM. Similar results were obtained for 10 kb Tetra-SOM and 10 kb Penta-SOM. Sequences with the same GC content were separated by a complex combination of oligonucleotide frequencies, resulting in species-specific separation. Within the 10 kb SOM, intraspecific separation is evident. For example, humans were segregated into two major regions within 10 kb Tri-SOM and 10 kb Tetra-SOM. However, within the 10 kb Penta-SOM, human sequences were classified into a single contiguous region. This indicates that despite the wide variation between human 10 kb sequences, SOM recognizes a common feature of pentanucleotide frequency in human sequences.

次に、約３００，０００個の１００ｋｂの断片塩基配列におけるトリヌクレオチド、テトラヌクレオチド、およびペンタヌクレオチドの出現頻度に関するＳＯＭ（それぞれ「１００ｋｂＴｒi−ＳＯＭ」、「１００ｋｂＴｅｔｒａ−ＳＯＭ」、「１００ｋｂＰｅｎｔａ−ＳＯＭ」と略記し、これら３つを「１００ｋｂＳＯＭ」と総称する）を作成した。 Next, SOMs relating to the appearance frequency of trinucleotide, tetranucleotide, and pentanucleotide in about 300,000 100 kb fragment base sequences (abbreviated as “100 kb Tri-SOM”, “100 kb Tetra-SOM”, and “100 kb Penta-SOM”, respectively). These three were collectively referred to as “100 kb SOM”).

そして、作成されたＳＯＭをカラー出力（カラー印刷やカラー表示等）により、単一の種由来の配列を含む格子点を種ごとに異なる有彩色で示し、複数の種由来の配列を含む格子点を黒色で示した。このカラー出力結果を白黒画像に変換したものを図２に示す。図２（ａ）、図２（ｂ）、図２（ｃ）、および図２（ｄ）はそれぞれ、１００ｋｂＴｒi−ＳＯＭ、１００ｋｂＴｅｔｒａ−ＳＯＭ、１００ｋｂＰｅｎｔａ−ＳＯＭ、および１００ｋｂＤｅｇｅＰｅｎｔａ−ＳＯＭを示す。また、図２において、Ｃは線虫の領域、Ａはシロイヌナズナの領域、Ｒはコメの領域、Ｄはキイロショウジョウバエの領域、Ｆはフグの領域、Ｚはゼブラフィッシュの領域、Ｈはヒトの領域を示す。 The generated SOM is displayed in color by color output (color printing, color display, etc.), and lattice points including arrays derived from a single species are displayed in different chromatic colors for each species, and lattice points including arrays derived from a plurality of species. Is shown in black. FIG. 2 shows the result of converting this color output result to a black and white image. 2 (a), 2 (b), 2 (c), and 2 (d) show 100 kb Tri-SOM, 100 kb Tetra-SOM, 100 kb Penta-SOM, and 100 kb DegePenta-SOM, respectively. In FIG. 2, C is a nematode region, A is an Arabidopsis region, R is a rice region, D is a Drosophila region, F is a puffer region, Z is a zebrafish region, and H is a human region. Indicates.

１００ｋｂＳＯＭ内においては、１０ｋｂの断片塩基配列に関するＳＯＭ内よりも（種内分離でなく）種間分離が顕著であった。１００ｋｂＴｅｔｒａ−ＳＯＭ内および１００ｋｂＰｅｎｔａ−ＳＯＭ内においては、全ての種が１つの主要な領域を有していた（図２（ｂ）および図２（ｃ））。さらに、上記種領域は、ゲノム配列を含まない白い連続した格子で囲まれていた。種特異格子のベクトルは、たとえ領域の境界に近くとも領域間で異なり、白い連続した格子に基づき種境界を主として自動的に描くことができる。１００ｋｂＳＯＭを詳細に調べると、特定の特徴を持つ少数の配列からなる小さな領域がいくつか存在した。例えば、コメ領域（図２のＲで示す領域）とフグ領域（図２のＦで示す領域）との間に位置するシロイヌナズナの小さな領域（図２のＡで示す領域）は、主として、動原体性領域および亜動原体性領域由来の配列からなっている。種内分離の分析は、個々のゲノムの詳細な構造に関する深い情報を与えることができる。 Within 100 kb SOM, inter-species separation was more prominent (not intra-species separation) than within SOM for 10 kb fragment base sequences. Within the 100 kb Tetra-SOM and 100 kb Penta-SOM, all species had one major region (FIG. 2 (b) and FIG. 2 (c)). In addition, the seed region was surrounded by a white continuous grid that did not contain genomic sequences. The species-specific lattice vectors differ between regions even near the region boundaries, and the seed boundaries can be mainly automatically drawn based on the white continuous lattice. Examining the 100 kb SOM in detail, there were several small regions consisting of a small number of sequences with specific features. For example, a small area of Arabidopsis (area indicated by A in FIG. 2) located between the rice area (area indicated by R in FIG. 2) and the puffer area (area indicated by F in FIG. 2) is mainly composed of kinematics. It consists of sequences derived from the somatic region and the subcentromeric region. Intraspecific segregation analysis can provide in-depth information on the detailed structure of individual genomes.

〔実施例２〕
ＳＯＭは、各生物種のゲノムの代表的な特徴であるオリゴヌクレオチド出現頻度の種特異的な組み合わせを認識し、特徴的な出現頻度パターンを特定することができた。１００ｋｂＳＯＭ内の各格子点における各オリゴヌクレオチドの出現頻度（観測値）を計算し、各格子点におけるモノヌクレオチド組成から期待される各オリゴヌクレオチドの出現頻度の期待値で正規化した。そうして正規化した各格子点のオリゴヌクレオチドの出現頻度（観測値／期待値の比）を、ＳＯＭと同様の２次元の格子状のマップに表したもの（出現頻度マップ）を各オリゴヌクレオチドごとに作成した。２次元マップ上の各格子点における正規化したオリゴヌクレオチドの出現頻度（観測値／期待値の比Ｒ）の情報は、例えばカラー出力する場合、２次元マップ上の格子点の色で表現できる。 [Example 2]
SOM was able to recognize species-specific combinations of oligonucleotide appearance frequencies, which are typical characteristics of the genome of each species, and identify characteristic appearance frequency patterns. The appearance frequency (observed value) of each oligonucleotide at each lattice point in 100 kb SOM was calculated, and normalized with the expected value of the appearance frequency of each oligonucleotide expected from the mononucleotide composition at each lattice point. The oligonucleotide appearance frequency (ratio of observed value / expected value) normalized at each lattice point is expressed in a two-dimensional lattice-like map similar to SOM (appearance frequency map) for each oligonucleotide. Created for each. Information on the appearance frequency (observed value / expected value ratio R) of the normalized oligonucleotide at each lattice point on the two-dimensional map can be expressed by the color of the lattice point on the two-dimensional map, for example, when outputting in color.

一例として、オリゴヌクレオチドの出現頻度が期待値に対して過剰である（オリゴヌクレオチドが過剰に出現する）格子点、すなわち観測値／期待値の比が１より十分に大きい格子点を赤で示す。また、オリゴヌクレオチドの出現頻度が期待値に対して過少である（オリゴヌクレオチドが過少に出現する）格子点、すなわち観測値／期待値の比が１より十分に小さい格子点を青で示す。また、オリゴヌクレオチドの出現頻度が期待値と同程度である（オリゴヌクレオチドが期待値レベルで出現する）格子点、すなわち観測値／期待値の比が１付近である格子点を白で示す。そして、格子点の色の濃度は、観測値／期待値の比が１から離れるほど濃くなるようにする。 As an example, lattice points where the frequency of occurrence of oligonucleotides is excessive with respect to the expected value (oligonucleotides appear excessively), that is, lattice points with a ratio of observed value / expected value sufficiently larger than 1, are shown in red. In addition, a lattice point where the appearance frequency of the oligonucleotide is less than the expected value (an oligonucleotide appears too small), that is, a lattice point whose observation value / expected value ratio is sufficiently smaller than 1, is shown in blue. In addition, a lattice point where the appearance frequency of the oligonucleotide is about the same as the expected value (the oligonucleotide appears at the expected value level), that is, a lattice point where the ratio of the observed value / expected value is near 1, is shown in white. Then, the color density of the grid point is set to increase as the ratio of the observed value / expected value goes away from 1.

このようにしてカラー出力した出現頻度マップを白黒画像に変換したものの代表例を図３および図４に示す。図３（ｂ）はＣＡＧＴの出現頻度マップ、図３（ｃ）はＡＡＴＴの出現頻度マップである。また、対照として、図３（ａ）に１００ｋｂＴｅｔｒａ−ＳＯＭを示す。また、図３には、観測値／期待値の比Ｒの値と黒濃度との関係を示すスケールを併せて示している。図４（ｂ）はＡＣＡＧＧとＣＣＴＧＴの合計の出現頻度を示す出現頻度マップ、図４（ｃ）はＣＧＡＣＧとＣＧＴＣＧの合計の出現頻度を示す出現頻度マップ、図４（ｄ）はＣＧＡＡＡとＴＴＴＣＧの合計の出現頻度を示す出現頻度マップである。また、対照として、図４（ａ）に１００ｋｂＤｅｇｅＰｅｎｔａ−ＳＯＭを示す。また、図４には、観測値／期待値の比Ｒの値と黒濃度との関係を示すスケールを併せて示している。 3 and 4 show typical examples of the appearance frequency map output in color as described above and converted into a black and white image. 3B is a CAGT appearance frequency map, and FIG. 3C is an AATT appearance frequency map. As a control, 100 kb Tetra-SOM is shown in FIG. FIG. 3 also shows a scale indicating the relationship between the observed value / expected value ratio R and the black density. 4 (b) is an appearance frequency map showing the total appearance frequency of ACAGG and CCTGT, FIG. 4 (c) is an appearance frequency map showing the total appearance frequency of CGACG and CGTCG, and FIG. 4 (d) is a diagram of CGAAA and TTTCG. It is an appearance frequency map which shows the total appearance frequency. As a control, 100 kb DegePenta-SOM is shown in FIG. FIG. 4 also shows a scale indicating the relationship between the observed value / expected value ratio R and the black density.

なお、出現頻度マップにおいて、各格子点における正規化したオリゴヌクレオチドの出現頻度（観測値／期待値の比）の情報は、他の様式、例えば２次元格子状マップを３次元化した３次元の棒グラフにおける高さ等で表現してもよい。 In addition, in the appearance frequency map, information on the appearance frequency (observed value / expected value ratio) of the normalized oligonucleotide at each lattice point is obtained in another manner, for example, a three-dimensional three-dimensional representation of a two-dimensional lattice map. You may express by the height etc. in a bar graph.

上記のオリゴヌクレオチド出現頻度の正規化は、各格子点におけるオリゴヌクレオチド出現頻度をモノヌクレオチド組成の差から切り離して調べることを可能にした。例えば、塩基配列間での、ＣＧおよびＧＣを含むオリゴヌクレオチドの出現頻度の差を、塩基配列間でのＧＣ含量の差から切り離して鋭敏に検出することができる。種々のテトラヌクレオチドおよびペンタヌクレオチドに関する出現頻度マップにおいて、過剰出現領域と過少出現領域との境界は、ほとんど種の境界と正確に一致した。図３に示したものは、種分離に関する特徴的な例である。ＡＡＴＴは、コメ、キイロショウジョウバエ、および線虫では過剰に出現し、フグおよびゼブラフィッシュでは過少に出現し、ヒトおよびシロイヌナズナでは適度に出現した。ＣＡＧＴは、３種の脊椎動物（ヒト、フグ、およびゼブラフィッシュ）全てにおいて過剰に出現したが、２種の植物（コメおよびシロイヌナズナ）の両方において過少に出現した。 The normalization of the oligonucleotide appearance frequency described above makes it possible to examine the oligonucleotide appearance frequency at each lattice point separately from the difference in mononucleotide composition. For example, the difference in the appearance frequency of oligonucleotides containing CG and GC between the base sequences can be detected sensitively by separating from the difference in the GC content between the base sequences. In the appearance frequency map for various tetranucleotides and pentanucleotides, the boundary between the overappearing region and the underappearing region almost exactly coincided with the species boundary. FIG. 3 shows a characteristic example regarding species separation. AATT appeared in excess in rice, Drosophila melanogaster, and nematodes, underexpressed in puffer fish and zebrafish, and moderately appeared in humans and Arabidopsis. CAGT appeared in excess in all three vertebrates (human, puffer, and zebrafish), but under-represented in both two plants (rice and Arabidopsis).

また、相補的なテトラヌクレオチドの対の出現頻度を加算した値に基づいて出現頻度マップを作成した。作成された出現頻度マップは、相補的な対の一方のテトラヌクレオチドの出現頻度に基づく出現頻度マップとほとんど同一であった。それゆえ、相補的な対の一方のテトラヌクレオチドに基づく出現頻度マップのみを図示している。 In addition, an appearance frequency map was created based on a value obtained by adding the appearance frequencies of complementary tetranucleotide pairs. The created appearance frequency map was almost identical to the appearance frequency map based on the appearance frequency of one of the complementary pair of tetranucleotides. Therefore, only the appearance frequency map based on one tetranucleotide of the complementary pair is shown.

ペンタヌクレオチドの出現頻度マップについては、ＤｅｇｅＰｅｎｔａ−ＳＯＭに関する出現頻度マップの例を図示している（図４）。３種の脊椎動物（ヒト、フグ、およびゼブラフィッシュ）全てにおいて、（ＡＣＡＧＧ＋ＣＣＴＧＴ）および（ＣＧＡＣＧ＋ＣＧＴＣＧ）はそれぞれ過剰出現および過少出現した。（ＣＧＡＡＡ＋ＴＴＴＣＧ）は、キイロショウジョウバエおよび線虫では過剰に出現し、分裂酵母では適度に出現した。ＳＯＭは、配列分離に関する多くのオリゴヌクレオチドの複合的な組み合わせを利用することで、種による分類を実現できる。 As for the appearance frequency map of pentanucleotide, an example of the appearance frequency map related to DegePenta-SOM is illustrated (FIG. 4). In all three vertebrates (human, puffer, and zebrafish), (ACAGG + CCTGT) and (CGACG + CGGTCG) were over- and under-represented, respectively. (CGAAA + TTTCG) appeared in excess in Drosophila melanogaster and nematodes, and appeared moderately in fission yeast. SOM can achieve species classification by utilizing a complex combination of many oligonucleotides for sequence separation.

〔実施例３〕
（ヒトゲノム配列における種内の差）
細菌ゲノムに関するＴｅｔｒａ−ＳＯＭ内の特徴的なテトラヌクレオチドの生物学的な意味を明らかにするために、本願発明者等は、４塩基制限酵素を産生する細菌内における制限酵素系を用いてパリンドローム（回文構造）のテトラヌクレオチドの出現頻度の相関を調べた。制限酵素認識部位（切断部位）のテトラヌクレオチドは、４塩基制限酵素を産生する細菌のゲノムにおいては特徴的に過少に出現する。ＳＯＭは、この細菌ゲノムの生物学的特性を正しく認識した。ＳＯＭは、オリゴヌクレオチド出現頻度以外のいかなる情報を用いることなくゲノム配列を既知の生物学的分類に（図２の場合には種に）分類することができた。ＳＯＭは、分類能力が非常に高いので、多種多様なゲノム情報を抽出する強力な情報科学ツールになるはずである。１つのゲノム内における種内の差に関するＳＯＭの分類能力を調べるために、本願発明者等は、ヒトゲノム由来の２．８Ｇｂ高品質のドラフト配列を分析した。本願発明者等は、２．８Ｇｂヒト配列から得た、互いに重複していない１０ｋｂの配列と、１０ｋｂずつずれた１００ｋｂの配列に関するＴｅｔｒａ−ＳＯＭおよびＰｅｎｔａ−ＳＯＭを、各格子点に分類された配列の数を棒の高さで表した３次元画像として出力した。 Example 3
(Intraspecies differences in human genome sequences)
To elucidate the biological meaning of the characteristic tetranucleotide in Tetra-SOM with respect to the bacterial genome, the inventors have used a palindrome using a restriction enzyme system in bacteria producing 4-base restriction enzymes. The correlation of the appearance frequency of tetranucleotides in the palindrome was examined. The tetranucleotide of the restriction enzyme recognition site (cleavage site) is characteristically under-represented in the genome of bacteria producing 4-base restriction enzymes. SOM correctly recognized the biological properties of this bacterial genome. SOM was able to classify genomic sequences into known biological classifications (in the case of FIG. 2 as species) without using any information other than the frequency of oligonucleotide appearance. Since SOM has a very high classification ability, it should be a powerful information science tool for extracting a wide variety of genome information. To examine SOM's ability to classify within-genome differences within one genome, the inventors analyzed 2.8 Gb high quality draft sequences from the human genome. The inventors of the present application have classified 10-kb non-overlapping sequences obtained from a 2.8 Gb human sequence and Tetra-SOM and Penta-SOM relating to a 100-kb sequence shifted by 10 kb into each lattice point. Was output as a three-dimensional image representing the height of the bar.

本願発明者等は、１０ｋｂＳＯＭの各格子点における個々のオリゴヌクレオチド（テトラヌクレオチド等）の出現頻度に対し、各格子点のモノヌクレオチド組成による正規化を行い、正規化されたオリゴヌクレオチドの出現頻度を計算した。正規化されたオリゴヌクレオチドの出現頻度は、モノヌクレオチド組成から期待されるオリゴヌクレオチドの出現頻度の期待値に対する、ＳＯＭ上におけるオリゴヌクレオチドの出現頻度の値（観測値）の比である。正規化されたオリゴヌクレオチドの出現頻度に基づいて、前述したのと同様にして、出現頻度マップを作成した。各テトラヌクレオチドの出現頻度マップは、全域にわたって過少に出現する出現頻度マップ（ＡＡＣＧの出現頻度マップ）、および全域にわたって過剰に出現する出現頻度マップ（ＴＴＣＣの出現頻度マップ）を含んでいた。次に、本願発明者等は、１０ｋｂＳＯＭ内の制限された部位（出現頻度マップにおける広域の過少出現領域に囲まれた小さな過剰出現領域）で顕著に出現するが、１００ｋｂＳＯＭ内の全域にわたって過少に出現するテトラヌクレオチドに着眼した。これら例の１タイプは、１つのＣＧと２つのＣまたはＧとを含む複数のテトラヌクレオチドに対応していた。このタイプについては、類似した局所的な過剰出現パターンが観測された（タイプＡ）。これらのテトラヌクレオチドは、よく特徴づけられた転写シグナルであるＧＣボックスの構成成分に対応していた。ＴＡＴＡボックスの構成成分であるＴＴＡＡ、ＡＴＡＡ、およびＡＴＴＡは、タイプＡのテトラヌクレオチドのパターンに類似したパターンを持っていたが、その配列は全く異なっていた（タイプＢ）。他の特徴的なパターンも観測され、そのいくつかは類似していた。 The inventors of the present application normalize the frequency of appearance of individual oligonucleotides (tetranucleotides, etc.) at each lattice point of 10 kb SOM by the mononucleotide composition of each lattice point, and determine the appearance frequency of the normalized oligonucleotide. Calculated. The appearance frequency of the normalized oligonucleotide is a ratio of the value (observation value) of the appearance frequency of the oligonucleotide on the SOM to the expected value of the appearance frequency of the oligonucleotide expected from the mononucleotide composition. Based on the appearance frequency of the normalized oligonucleotide, an appearance frequency map was created in the same manner as described above. The appearance frequency map of each tetranucleotide included an appearance frequency map that appeared in an insufficient amount over the entire area (AACG appearance frequency map), and an appearance frequency map that appeared in an excessive amount over the entire area (TTCC appearance frequency map). Next, the inventors of the present application appear prominently in a limited part within 10 kb SOM (small over-appearance area surrounded by a large under-appearance area in the appearance frequency map), but under-appear over the entire area within 100 kb SOM. We focused on tetranucleotides. One type of these examples corresponded to multiple tetranucleotides containing one CG and two C or G. For this type, a similar local over-occurrence pattern was observed (type A). These tetranucleotides corresponded to components of the GC box, a well-characterized transcription signal. The components of the TATA box, TTAA, ATAA, and ATTA, had a pattern similar to that of type A tetranucleotides, but their sequences were completely different (type B). Other characteristic patterns were also observed, some of which were similar.

テトラヌクレオチドの局所的な過剰出現パターンによる生物学的な意味を調べるために、上記格子点における正規化した各テトラヌクレオチドの出現頻度（この格子点に分類された１０ｋｂ配列における、正規化した各テトラヌクレオチドの出現頻度を表す）を染色体２１ｑ配列に沿ってプロットした。結果を図５に示す。なお、図５において、各テトラヌクレオチドの出現頻度は、図３（ｂ）（ｃ）等と同様の表示色で表示した。また、図５には、遺伝子の存在位置も併せて示した。 In order to examine the biological meaning of the local over-occurrence pattern of tetranucleotides, the frequency of occurrence of each normalized tetranucleotide at the lattice point (each normalized tetra in the 10 kb sequence classified at this lattice point). (Which represents the frequency of occurrence of nucleotides) was plotted along the chromosome 21q sequence. The results are shown in FIG. In FIG. 5, the appearance frequency of each tetranucleotide is displayed in the same display color as in FIGS. 3 (b) and 3 (c). FIG. 5 also shows the location of the gene.

タイプＡおよびタイプＢのテトラヌクレオチドおよび他のいくつかのテトラヌクレオチドの分布パターンを染色体２１ｑ配列に沿ってプロットした。遺伝子リッチな領域において、タイプＡおよびタイプＢが類似のパターンを持ち、顕著な出現（赤および白）が観測されたという観測結果は、これらヌクレオチドが、転写調節シグナルの典型例であるＴＡＴＡボックスおよびＧＣボックスの核となる配列であるという見解と整合している。上記染色体のさまざまな部分においては、ＧＡＴＣおよびＡＧＴＡの分布パターンも、タイプＡおよびタイプＢの分布パターンと類似していた。これらテトラヌクレオチドの全てが、染色体２０および２２においても、遺伝子リッチ領域において高いレベルで出現した（図示しない）。この結果は、これらテトラヌクレオチドが、遺伝子の発現調節または機能に関連するシグナル配列またはその構成成分である可能性が高いことを示唆している。１つのゲノム内におけるジヌクレオチド、トリヌクレオチド、およびテトラヌクレオチドの出現頻度が、高い相関関係を持つことが見出されている。この基本的なゲノムの特徴のために、本願発明者等は、特徴的なテトラヌクレオチドを容易に特定することができた。そのテトラヌクレオチドは、Ｔｅｔｒａ−ＳＯＭにおいては多くの領域で過少に出現したが、Ｔｒi−ＳＯＭにおいてはその構成トリヌクレオチドはむしろ過剰に出現した。上記テトラヌクレオチドの過少出現は、その構成トリヌクレオチドの過少出現に起因するものではなかったので、上記過少出現は、細菌のゲノム内の制限酵素認識部位の配列について観測されたように、テトラヌクレオチドの生物学的な意味を反映していると考えられる。ＡＴＴＧは、このタイプに属しており、分布は遺伝子プアな領域内に集中していた（図５）。 The distribution pattern of type A and type B tetranucleotides and several other tetranucleotides were plotted along the chromosome 21q sequence. In the gene-rich region, the observation that type A and type B have a similar pattern and a pronounced appearance (red and white) was observed, indicating that these nucleotides are TATA boxes that are typical examples of transcriptional regulatory signals and It is consistent with the view that it is the core sequence of the GC box. In various parts of the chromosome, the distribution pattern of GATC and AGTA was similar to that of type A and type B. All of these tetranucleotides also appeared at high levels in gene-rich regions on chromosomes 20 and 22 (not shown). This result suggests that these tetranucleotides are likely to be signal sequences or components thereof related to gene expression regulation or function. It has been found that the frequency of occurrence of dinucleotides, trinucleotides, and tetranucleotides within a genome is highly correlated. Due to this basic genomic feature, the inventors of the present application could easily identify a characteristic tetranucleotide. The tetranucleotide appeared under-represented in many regions in Tetra-SOM, whereas its constituent trinucleotide appeared rather in excess in Tri-SOM. The underappearance of the tetranucleotide was not due to the underappearance of its constituent trinucleotides, so the underappearance was observed for the sequence of tetranucleotides in the bacterial genome as observed for the restriction enzyme recognition site sequences. It is thought to reflect the biological meaning. ATTG belongs to this type, and the distribution was concentrated in a gene-poor region (FIG. 5).

ゲノム中の特徴的なオリゴヌクレオチド、過少出現するものだけでなく過剰出現するものを考慮すると、ＤＮＡの立体構造、コンテキスト依存の突然変異、およびＤＮＡの修飾を含むさまざまな因子が、原因になっていると考えられる。過剰出現する配列に関し、多量に存在するＤＮＡ結合蛋白質によって認識される配列の優先傾向を考慮しなければならない。ＳＯＭを用いた種間分離および種内分離は非常に明瞭であるので、ＳＯＭは、進化の過程で個体のゲノムの配列の特徴を決定してきた詳細な分子機構を理解するための基礎的なガイドラインを提供するはずである。 Considering the characteristic oligonucleotides in the genome, those that occur in excess as well as those that are under-represented, a variety of factors, including DNA conformation, context-dependent mutations, and DNA modifications, are responsible It is thought that there is. With respect to sequences that appear in excess, the preference for sequences recognized by abundant DNA binding proteins must be considered. Because interspecific and intraspecific separation using SOM is very clear, SOM is a basic guideline for understanding the detailed molecular mechanisms that have been used to characterize individual genome sequences during evolution. Should provide.

（遺伝子のシグナル配列の特徴決定）
多種多様なオリゴヌクレオチド配列が、遺伝子のシグナル配列（例えば遺伝子発現の調節シグナル配列）として機能する。種々のテトラヌクレオチド（例えばタイプＡおよびタイプＢ）が転写シグナル配列に適合した特徴を持つという図２等における知見は、ＳＯＭが遺伝子のシグナル配列の特徴決定およびコンピュータ予測（in-silico予測）を行うための新規なツールとなる可能性を示している。転写シグナル配列などのような遺伝子のシグナル配列は、典型的にはテトラヌクレオチドより長い。それゆえ、この可能性を試すにはより長いオリゴヌクレオチドの分析が必要になる。そこで、テトラヌクレオチドの場合と同様にしてヒトゲノム中において過少出現するペンタヌクレオチドを調べた。その結果、遺伝子リッチな領域と遺伝子プアな領域との間で分布パターンが異なる例が存在した。上記の局所的で特異的な出現パターンは、多くの場合、テトラヌクレオチドのパターンよりも明瞭であった。このことは、上記の多くのテトラヌクレオチドが、より長い配列長を持つシグナル配列（例えばＧＣボックス）の構成成分であることを示唆している。 (Determination of gene signal sequence)
A wide variety of oligonucleotide sequences function as gene signal sequences (eg, regulatory signal sequences for gene expression). The finding in FIG. 2 etc. that various tetranucleotides (eg type A and type B) have characteristics adapted to the transcription signal sequence, SOM performs characterization of the signal sequence of the gene and computer prediction (in-silico prediction) It shows the possibility of becoming a new tool for. The signal sequence of a gene, such as a transcription signal sequence, is typically longer than a tetranucleotide. Therefore, longer oligonucleotide analysis is required to test this possibility. Therefore, in the same manner as in the case of tetranucleotides, pentanucleotides that appear in a small amount in the human genome were examined. As a result, there was an example in which the distribution pattern was different between the gene rich region and the gene poor region. The above-mentioned local and specific appearance pattern was often clearer than the tetranucleotide pattern. This suggests that many of the above tetranucleotides are constituents of signal sequences (eg, GC boxes) with longer sequence lengths.

シグナル配列認識機構と、ゲノム全域でのそれぞれのオリゴヌクレオチド配列の出現レベルとは、互いに関連するものと考えられる。特定の標的蛋白質に結合する高い親和力などのような顕著な活性をオリゴヌクレオチド配列が持っている場合には、そのオリゴヌクレオチド配列の出現は、ランダムな出現から偏り、ゲノム全域で顕著に変化するであろう。例えば、転写因子に対して強い結合活性を持つオリゴヌクレオチド配列は、ゲノムの多くの領域にわたるランダムな出現と比較して過少に出現するが、遺伝子調節領域内ではより高頻度であろう。このようなシグナル配列は、広い窓を持つＳＯＭ（例えば１００ｋｂＳＯＭ）の全域にわたって過少に発現するが、より狭い窓を持つＳＯＭ（例えば１０ｋｂＳＯＭ）の制限された部位ではより高い頻度で出現する。既知の転写因子に対して結合活性を持つオリゴヌクレオチド配列がゲノム全域にわたってランダムな出現頻度で、あるいはそれより高い出現頻度で出現するような逆のケースでは、隣接する他のシグナル成分との組み合わせが、上記配列が調節シグナル配列として機能するための絶対的必須条件となるはずである。ゲノム全域にわたる因子結合活性を持つオリゴヌクレオチド配列の出現頻度は、転写調節のための組み合わせユニット内における各オリゴヌクレオチドの相互の役割を理解するための基礎的な情報（特異性決定における差異識別への貢献）を与えるだろう。因子結合活性を持つオリゴヌクレオチドのレベルに関するＳＯＭデータは、異なる分類のシグナル配列の異なる振る舞い（異なる窓のサイズを持つＳＯＭ上で可視化することができる）を分類することを可能にするであろう。よく研究された生物から収集された、分類されたシグナル配列の振る舞いを参照すれば、配列は決定されているが僅かな追加の実験データしかないゲノム内のシグナル配列の予測に有用なコンピュータによる方法（in-silico方法）を開発することができる。 The signal sequence recognition mechanism and the appearance level of each oligonucleotide sequence throughout the genome are considered to be related to each other. If an oligonucleotide sequence has significant activity, such as high affinity for binding to a specific target protein, the appearance of that oligonucleotide sequence is biased from random appearance and may vary significantly throughout the genome. I will. For example, oligonucleotide sequences with strong binding activity for transcription factors appear under-compared compared to random appearance across many regions of the genome, but will be more frequent within gene regulatory regions. Such signal sequences are under-expressed across the entire SOM with a wide window (eg, 100 kb SOM), but appear more frequently at restricted sites in a SOM with a narrower window (eg, 10 kb SOM). In the opposite case where an oligonucleotide sequence that has binding activity for a known transcription factor appears at random or higher frequency throughout the genome, the combination with other adjacent signal components The sequence should be an absolute prerequisite for functioning as a regulatory signal sequence. The frequency of appearance of oligonucleotide sequences with factor binding activity across the entire genome is the basic information for understanding the mutual role of each oligonucleotide within the combination unit for transcriptional regulation (to identify differences in specificity determination). Will contribute). SOM data on the level of oligonucleotides with factor binding activity will allow to classify different behaviors of different classes of signal sequences (which can be visualized on SOMs with different window sizes). With reference to the behavior of categorized signal sequences collected from well-studied organisms, a computational method useful for predicting signal sequences in the genome that have been sequenced but with little additional experimental data (In-silico method) can be developed.

このアプローチの準備として、転写因子結合の可能性を持つことが知られているペンタヌクレオチドの特徴決定を次のようにして行った。まず、ＴＲＡＮＳＦＡＣデータベース(http://transfac.gbf.de/TRANSFAC)中でヒト転写因子に対する結合配列して知られているペンタヌクレオチドを検索した。上記データベース中では、因子結合配列として文献で報告されているペンタヌクレオチドは合計２２個あった。しかしながら、上記データベース中のＭＡＴＲＩＸテーブルを参照してこれらの配列を詳細にチェックしたところ、これらの多くはより長いシグナル配列の構成部分であり、転写因子結合のための主決定配列として選択できたものは、４つのペンタヌクレオチド、ＮＦ−Ｙ結合部位ＣＣＡＡＴ、ＧＡＴＡ−１因子結合部位ＧＡＴＡＡ、ＫＬＦ結合部位ＣＡＣＣＣ、およびＮＦ−１結合部位ＴＧＧＣＡであった。 In preparation for this approach, pentanucleotides known to have the potential to bind transcription factors were characterized as follows. First, pentanucleotides known as binding sequences for human transcription factors were searched in the TRANSFAC database (http://transfac.gbf.de/TRANSFAC). In the database, there were a total of 22 pentanucleotides reported in the literature as factor binding sequences. However, when these sequences were checked in detail with reference to the MATRIX table in the above database, many of these were components of longer signal sequences that could be selected as the main determining sequence for transcription factor binding. Were four pentanucleotides, NF-Y binding site CCAAT, GATA-1 factor binding site GATAA, KLF binding site CACCC, and NF-1 binding site TGGCA.

そして、ヒト２．８Ｇｂ配列の１０ｋｂＰｅｎｔａ−ＳＯＭおよび１００ｋｂＰｅｎｔａ−ＳＯＭのそれぞれにおける、これら４つのペンタヌクレオチドの出現頻度分布パターン（出現頻度マップ）を作成し、ＧＣボックスの核となる配列の分布パターンと比較した。ＧＡＴＡＡは、１０ｋｂＳＯＭおよび１００ｋｂＳＯＭの双方の多くの領域で過少に出現した。ＣＣＡＡＴは、１００ｋｂＳＯＭの多くの領域で過少に出現したが、１０ｋｂＳＯＭの制限された部位ではかなり出現した。ＣＣＡＡＴの分布の、ＧＣボックスの核となる成分の分布との詳細な比較により、１０ｋｂＳＯＭにおけるＣＣＡＡＴの多い領域がＧＣボックス配列の多い領域と明確に区別されることが示された。これは、（ＧＣボックス配列ではなく）ＣＣＡＡＴが、染色体２１ｑの遺伝子リッチな領域におけるよりも遺伝子プアな領域においてより優勢であったという知見（図５）と整合している。ＣＡＣＣＣおよびＴＧＧＣＡは、１０ｋｂＳＯＭおよび１００ｋｂＳＯＭの多くの領域にまたがって過剰に出現した。そのような頻繁に出現する配列は、標的遺伝子の正常な調節のために他の追加の特定の配列を必要とする配列、または多量に存在するＤＮＡ結合因子に対する結合配列に対応しうる配列である。ゲノム全域にわたっての出現レベルに関する既知のシグナル配列の系統的な分類は、ゲノム配列決定以来可能になっており、また、正確なシグナル配列認識の基礎をなす分子機構を解析するための新規な方法を提供する。 Then, an appearance frequency distribution pattern (appearance frequency map) of these four pentanucleotides in each of 10 kb Penta-SOM and 100 kb Penta-SOM of the human 2.8 Gb sequence is created, and compared with the distribution pattern of the sequence serving as the core of the GC box did. GATAA was under-represented in many areas, both 10 kb SOM and 100 kb SOM. CCAAT appeared under-represented in many regions of 100 kb SOM, but appeared significantly in the restricted region of 10 kb SOM. A detailed comparison of the CCAAT distribution with the distribution of the core component of the GC box showed that the CCAAT-rich region in 10 kb SOM is clearly distinguished from the region with many GC box sequences. This is consistent with the finding that CCAAT (rather than the GC box sequence) was more prevalent in the gene poor region than in the gene rich region of chromosome 21q (FIG. 5). CACCC and TGGCA appeared in excess across many regions of 10 kb SOM and 100 kb SOM. Such frequently occurring sequences are sequences that require other additional specific sequences for normal regulation of the target gene or can correspond to binding sequences for abundant DNA binding factors. . Systematic classification of known signal sequences with respect to their level of occurrence across the genome has been possible since genome sequencing, and a novel method for analyzing the molecular mechanisms underlying accurate signal sequence recognition. provide.

ＳＯＭは、１つのマップ上の多種多様なゲノムにおいてオリゴヌクレオチド出現頻度を可視化できる。１３種の真核生物に対する、上述した４種のペンタヌクレオチドの各々の１０ｋｂＰｅｎｔａ−ＳＯＭおよび１００ｋｂＰｅｎｔａ−ＳＯＭ（図２（ｃ））における出現頻度パターン（出現頻度マップ）を作成した。ＧＡＴＡＡは、熱帯熱マラリア原虫およびキイロタマホコリカビを除いて上記真核生物の全てにおいて過少出現した。ＣＣＡＡＴは、３種の脊椎動物（フグ、ゼブラフィッシュ、ヒト）全ての多くの領域において過少に出現したが、２種の植物（コメおよびシロイヌナズナ）および２種の無脊椎動物（線虫、キイロショウジョウバエ）においては過剰に出現した。これらの結果は、個々の種におけるシグナル配列認識の機構と、シグナル配列認識系を確立する進化の過程とを理解するための基礎的な情報を提供しうる。ＳＯＭは、２次元マップ上におけるランダムな分布から特徴的に偏ったオリゴヌクレオチドを明示する。さらに、特異的な特徴を持ったゲノム配列がマップ上で自己組織化されたので、そのような配列全てのゲノム位置を染色体に沿ってプロットすることができた（図５）。十分な実験データを持つ様々な種の既知のシグナル配列をＳＯＭを用いて特徴決定し、系統的に分類すれば、配列は決定されているがそれ以外には僅かな実験データしかないゲノムに対して最も有用なシグナル配列のコンピュータによる予測方法を開発することが可能になるであろう。そのようなゲノムの数は急速に増大しているので、そのようなコンピュータによる予測方法の開発の重要性も増大している。 SOM can visualize the frequency of oligonucleotide appearance in a wide variety of genomes on one map. Appearance frequency patterns (appearance frequency maps) in 10 kb Penta-SOM and 100 kb Penta-SOM (FIG. 2C) of each of the four pentanucleotides described above for 13 eukaryotes were prepared. GATAA was under-represented in all of the above eukaryotes except for Plasmodium falciparum and D. gondii. CCAAT appeared underrepresented in many areas of all three vertebrates (Fugu, zebrafish, human), but two plants (rice and Arabidopsis) and two invertebrates (Nematoda, Drosophila) ) Appeared in excess. These results may provide basic information to understand the mechanism of signal sequence recognition in individual species and the evolutionary process that establishes the signal sequence recognition system. SOM manifests oligonucleotides that are characteristically biased from a random distribution on a two-dimensional map. In addition, since genomic sequences with specific features were self-organized on the map, the genomic positions of all such sequences could be plotted along the chromosome (FIG. 5). Characterizing various species of known signal sequences with sufficient experimental data using SOM and systematically classifying them against a genome that has been determined but otherwise has little experimental data It will be possible to develop a computerized method for predicting the most useful signal sequences. Since the number of such genomes is growing rapidly, the importance of developing such computerized prediction methods is also increasing.

〔実施例４〕
以下の実施例では、全長配列が知られている８１種の細菌ゲノムから得た１７０００個強の互いに重複していない１ｋｂ断片塩基配列（セグメント）および５ｋｂ断片塩基配列について、前記実施形態のシステムを用い、トリヌクレオチド出現頻度およびテトラヌクレオチド出現頻度を用いて特許文献１に記載の改良型ＳＯＭ作成法にしたがって２次元および３次元のＳＯＭを構築した。初期結合重みベクトルを得るための第１のステップとして、非特許文献４に記載されているように、１７０００個強の互いに重複していないセグメントの出現頻度を主成分分析（ＰＣＡ）によって分析した。設定学習回数は１００に設定した。１００回の学習サイクルの後、断片塩基配列のオリゴヌクレオチド出現頻度は、ＳＯＭ中の結合重みベクトルに実質的に投影された。作成したＳＯＭは、単一種由来の断片塩基配列を含む格子点を有彩色で示し、複数の種の断片塩基配列を含む格子点を黒で示すカラー画像として出力した（図示しない）。５ｋｂ断片塩基配列におけるテトラヌクレオチド出現頻度を用いて作成したＳＯＭでは、ほとんどの種の断片塩基配列は、種特異的な重複しない複数の領域に分離された。 Example 4
In the following examples, the system of the above embodiment is used for 17,000 fragment base sequences (segments) and 5 kb fragment base sequences which are obtained from 81 bacterial genomes with known full-length sequences and do not overlap each other. Using the trinucleotide appearance frequency and the tetranucleotide appearance frequency, two-dimensional and three-dimensional SOMs were constructed according to the improved SOM preparation method described in Patent Document 1. As a first step for obtaining an initial coupling weight vector, as described in Non-Patent Document 4, the appearance frequency of over 17,000 non-overlapping segments was analyzed by principal component analysis (PCA). The set learning count was set to 100. After 100 learning cycles, the oligonucleotide appearance frequency of the fragment base sequence was substantially projected onto the bond weight vector in the SOM. The created SOM output a lattice image including fragment base sequences derived from a single species in a chromatic color, and outputs a lattice image including a plurality of species fragment base sequences in black (not shown). In the SOM prepared using the frequency of appearance of tetranucleotide in the 5 kb fragment base sequence, the fragment base sequences of most species were separated into a plurality of species-specific non-overlapping regions.

個々の格子点の結合重みベクトルの分析は、ジヌクレオチド出現頻度を用いて作成したＳＯＭ、トリヌクレオチド出現頻度を用いて作成したＳＯＭ、およびテトラヌクレオチド出現頻度を用いて作成したＳＯＭ（それぞれ「Ｄｉ−ＳＯＭ」、「Ｔｒi−ＳＯＭ」、および「Ｔｅｔｒａ−ＳＯＭ」と略記する）中における各結合重みベクトルに関するＧＣ含量（Ｇ＋Ｃ％）が主としてＳＯＭの横軸に投影され、ＳＯＭの左から右に行くにしたがって増加した。ＳＯＭにおいて、ＡＴリッチな細菌の配列は左側に、ＧＣリッチな細菌の配列は右側にそれぞれ分布した。重要なことには、同じＧＣ含量（Ｇ＋Ｃ％）を持つ配列は、オリゴヌクレオチド出現頻度の複合的な組み合わせによって分離され、その結果、種特異的な分離がなされた。言いかえれば、各ゲノム中の１０ｋｂ断片塩基配列の多くは、署名のようにそれぞれのゲノムを反映したオリゴヌクレオチドの組み合わせを持っている。ＳＯＭは、代表的な結合重みベクトルとして署名を明示することができる。 The analysis of the bond weight vector of each lattice point is performed using SOM created using the dinucleotide appearance frequency, SOM created using the trinucleotide appearance frequency, and SOM created using the tetranucleotide appearance frequency (each “Di−”). The GC content (G + C%) for each joint weight vector in “SOM”, “Tri-SOM”, and “Tetra-SOM”) is projected mainly on the horizontal axis of the SOM and goes from left to right of the SOM. Therefore increased. In SOM, AT-rich bacterial sequences were distributed on the left and GC-rich bacterial sequences on the right. Importantly, sequences with the same GC content (G + C%) were separated by a complex combination of oligonucleotide frequencies, resulting in species-specific separation. In other words, many 10 kb fragment base sequences in each genome have a combination of oligonucleotides reflecting each genome like a signature. The SOM can specify a signature as a representative binding weight vector.

１７０，０００個の互いに重複していない１ｋｂ断片塩基配列も同様にして分析し、ＳＯＭを作成した。この場合、分離の度合いは多少小さくなったが、種特異的分離が再び観察された。これは、種特異的特徴（署名）が、１ｋｂ断片塩基配列の主な構成グループ中でさえ検知できることを示している。 170,000 non-overlapping 1 kb fragment base sequences were similarly analyzed to prepare SOM. In this case, the degree of separation was somewhat smaller, but species-specific separation was again observed. This shows that species-specific features (signatures) can be detected even in the main group of 1 kb fragment sequences.

個々の種への分類は、培養不可能な多様な微生物の分類の最初には重要でなく、系統学的な分類が重要になる。そこで、２次元および３次元のＳＯＭによる１２の主要な系統群への分類をテストしたところ、実際に、よい分類が検知できた。具体的には、オリゴヌクレオチド出現頻度に関するＳＯＭを約１００種の細菌の分類に適用したところ、５ｋｂの細菌配列の約９０％を、１２の系統群、すなわち、αプロテオバクテリア(Alphaproteobacteria)、βプロテオバクテリア(Betaproteobacteria)、γプロテオバクテリア(Gammaproteobacteria)、δプロテオバクテリア(Deltaproteobacteria)、アーケア（古細菌；Archaea)、クラミジア属(Chlamydia)、ファーミキューテス(Firmicutes)、アクチノバクテリア(Actinobacteria)、フゾバクテリウム属(Fusobacteria)、超好熱性グラム陰性嫌気性桿菌群(Thermotogae)等に粗分類することができた。上記で作成したＳＯＭを用いて、環境から得られた難培養性微生物の系統推定を行った。ＧｅｎＢａｎｋには、生物種が特定されていない環境微生物の塩基配列が登録されている。生物種が特定されていない塩基配列のうち、塩基長が１ｋｂ以上のリボソームＲＮＡ遺伝子（ｒＤＮＡ）配列以外の６６０件の非ｒＤＮＡ配列を用いて、系統群への分類を行った。ｒＤＮＡ配列と比べ、非ｒＤＮＡ配列では、従来の相同性解析による系統群への分類は難しい。非ｒＤＮＡ配列６６０件のうち３４３件は、反芻胃から採取された微生物由来の配列であった。これら３４３配列のＳＯＭ上での分類において、メタン生成菌と硫黄分解菌等の嫌気性微生物類の領域に大部分が分類されており、反芻胃という環境に生存する微生物として整合性の高い生物種であった。よって、環境中に複雑に混合している微生物群の系統推定ならびに多様性の推測が可能なことが示された。 Classification into individual species is not important at the beginning of the classification of diverse microorganisms that cannot be cultivated, but phylogenetic classification is important. Therefore, when the classification into 12 main phylogenetic groups by 2D and 3D SOM was tested, a good classification was actually detected. Specifically, when SOM regarding the frequency of oligonucleotide appearance was applied to the classification of about 100 bacteria, about 90% of the 5 kb bacterial sequence was converted into 12 phylogenetic groups, namely, alpha proteobacteria, beta proteo Bacteria (Betaproteobacteria), γproteobacteria, δproteobacteria, Archaea (Archaea), Chlamydia, Firmicutes, Actinobacteria, Fusoacteria ), Super thermophilic gram-negative anaerobic bacilli (Thermotogae), etc. Using the SOM created above, systematic estimation of difficult-to-culture microorganisms obtained from the environment was performed. In GenBank, the base sequences of environmental microorganisms whose biological species are not specified are registered. Classification was made into phylogenetic groups using 660 non-rDNA sequences other than the ribosomal RNA gene (rDNA) sequence having a base length of 1 kb or more among the base sequences for which no biological species were specified. Compared to rDNA sequences, non-rDNA sequences are difficult to classify into phylogenetic groups by conventional homology analysis. Of the 660 non-rDNA sequences, 343 were derived from microorganisms collected from the rumen. In the classification of these 343 sequences on the SOM, most of them are classified into the area of anaerobic microorganisms such as methanogens and sulfur-degrading bacteria. Met. Therefore, it was shown that it is possible to estimate the phylogeny and diversity of microbial groups mixed in the environment.

ＳＯＭは、多くの場合において、各々の種がほぼ等しい数の格子点からなる２つの主要な領域へ分離される結果となった。ＳＯＭによる同一種内での領域分離は、転写方向に関連していた。同一種内の分離は、ＳＯＭ中で使用される配列の長さに依存した。かなり長い範囲の近隣遺伝子が同一の転写方向を持っているゲノムについては、同一種内の分離が、１０ｋｂ配列内の出現頻度を分析したＳＯＭにおいてさえ顕著であった。しかしながら、転写方向が数ｋｂのような短い範囲のゲノムでさえ頻繁に変化するゲノムでは、同一種内の分離は１０ｋｂのＳＯＭにおいてはそれほど顕著ではなくなる。 SOM often resulted in the separation of each species into two major regions of approximately equal numbers of grid points. Region separation within the same species by SOM was related to transcription direction. Separation within the same species was dependent on the length of the sequence used in the SOM. For genomes where a fairly long range of neighboring genes have the same transcriptional direction, segregation within the same species was prominent even in SOMs analyzed for frequency of occurrence within the 10 kb sequence. However, in genomes where the transcription direction changes frequently even in a short range of genomes, such as a few kb, separation within the same species is less pronounced in a 10 kb SOM.

〔実施例５〕
１ｋｂのような短い断片のゲノム配列の場合には、転写方向は多くの場合知られておらず、転写方向の区別は、重要ではなく、生物種間の特定のための複雑さを引き起こすかもしれない。これを考慮に入れて、１対の相補的なテトラヌクレオチドの出現頻度を加算した。 Example 5
In the case of genomic sequences of short fragments such as 1 kb, the direction of transcription is often unknown, and the distinction of transcription direction is not important and may cause complexity for identification between species. Absent. Taking this into account, the frequency of appearance of a pair of complementary tetranucleotides was added.

細菌８１種の１ｋｂ配列について、ジヌクレオチド（２連続塩基）、トリヌクレオチド（３連続塩基）、およびテトラヌクレオチド（４連続塩基）の各々の相補的な対をなすオリゴヌクレオチドの出現頻度を加算した出現頻度（縮退出現頻度）を用いて、実施例４と同様のＳＯＭ作成処理（頻度解析）を行った。すなわち、ゲノム配列の相補性の影響を除去するために、相補的な対をなす２つのオリゴヌクレオチド（例えば、ＡＡとＴＴ）を同一のものとみなし、これらの出現頻度を加算した出現頻度を用いてＳＯＭ作成処理を行った。例えば、６４個のトリヌクレオチドの出現頻度データの代わりに３２対の相補的なトリヌクレオチドの出現頻度データを使用する以外は、実施例４と同様にしてＳＯＭを作成した。その結果、実施例４と同様の結果が得られた。同一生物種内の分離も減少した。 Appearance of 1 kb sequences of 81 species of bacteria, with the frequency of occurrence of oligonucleotides forming complementary pairs of dinucleotides (2 consecutive bases), trinucleotides (3 consecutive bases), and tetranucleotides (4 consecutive bases) added. Using the frequency (degenerate appearance frequency), the same SOM creation processing (frequency analysis) as in Example 4 was performed. That is, in order to eliminate the influence of complementarity of genome sequences, two complementary pairs of oligonucleotides (for example, AA and TT) are regarded as the same, and the appearance frequency obtained by adding these appearance frequencies is used. The SOM creation process was performed. For example, an SOM was prepared in the same manner as in Example 4 except that 32 pairs of complementary trinucleotide appearance frequency data was used instead of 64 trinucleotide appearance frequency data. As a result, the same result as in Example 4 was obtained. Separation within the same species was also reduced.

〔実施例６〕
遠縁にあたる生物からの水平伝達を通じて導入されたゲノム・セグメントは、ドナーゲノムの配列の特徴を保持することが知られており、受容したゲノムのものと識別することができる。本願発明者等は、ＳＯＭが水平に伝達された遺伝子を識別するのに有用であり、重要なことには、伝達された遺伝子のドナーゲノムを予測するのに有用であることを以前に示した（非特許文献３）。ＳＯＭにおいて、個々の種の主要な領域から遠い所に位置する特有の格子点が存在する場合があった。主要な領域とは明白に異なるオリゴヌクレオチド出現頻度を持つ配列は、少なくとも一部は、他のゲノムから水平に伝達されたゲノム部分に対応するはずである。 Example 6
Genomic segments introduced through horizontal transmission from distantly related organisms are known to retain the sequence characteristics of the donor genome and can be distinguished from those of the received genome. The inventors have previously shown that SOM is useful for identifying horizontally transmitted genes and, importantly, for predicting the donor genome of the transmitted genes. (Non-Patent Document 3). In SOM, there may be a unique grid point located far from the main area of each species. Sequences with oligonucleotide frequencies that are distinctly different from the main region should correspond at least in part to portions of the genome that are transmitted horizontally from other genomes.

それらの種自体の領域とは異なるオリゴヌクレオチド組成を持つこれらの配列を視覚化するために、本願発明者等は、E. Coli（大腸菌）の領域およびそれと密接に関連する細菌S. typhimuriumの領域の両方の外に位置するE. Coli由来の１０ｋｂの配列を調べた。S. typhimurium領域の配列を除外すると、E. Coli配列の中で２番目に高い数の点がY pestis領域で見つかった。その後、ジヌクレオチド、トリヌクレオチド、およびテトラヌクレオチドで共通して見つかったY pestis領域内に存在する５つの配列に注目した。これらの配列内には、３７個の既知の遺伝子（それらのうちの２３個は、Y pestis遺伝子に対して顕著な相同性を持っていた）があった。例えば、アミノ酸レベルでは、これら遺伝子にコードされた２３個の蛋白質のうちの６個の蛋白質が６０％を越える同一性レベルを持っており、最も高い同一性レベルを持つものは８０％の同一性レベルを持っていた。この同一性は、垂直対であるE. Coli蛋白質およびY pestis蛋白質に関して計算された平均同一性の値４０％と比較して顕著に高かった。さらに、３つの遺伝子がファージにコードされた遺伝子と相同的であり、１つの遺伝子はトランスポゾン遺伝子と相同的であった。これらの発見は、これらの遺伝子が他の生物からE. Coliゲノムへ水平に伝達されたかもしれないという予測を支援する。 In order to visualize these sequences with different oligonucleotide compositions from the regions of their species themselves, the inventors have identified the region of E. Coli and the region of the closely related bacterium S. typhimurium. A 10 kb sequence from E. Coli located outside of both was examined. Excluding the sequence of the S. typhimurium region, the second highest number of spots in the E. Coli sequence were found in the Y pestis region. Subsequently, attention was focused on five sequences present within the Y pestis region commonly found in dinucleotides, trinucleotides, and tetranucleotides. Within these sequences were 37 known genes (23 of which had significant homology to the Y pestis gene). For example, at the amino acid level, 6 out of 23 proteins encoded by these genes have an identity level exceeding 60%, and the one with the highest identity level has an identity of 80%. Had a level. This identity was significantly higher compared to the average identity value of 40% calculated for the vertical pair E. Coli protein and Y pestis protein. In addition, three genes were homologous to the phage encoded gene and one gene was homologous to the transposon gene. These findings support the prediction that these genes may have been transmitted horizontally from other organisms to the E. Coli genome.

〔実施例７〕
パリンドロームのテトラヌクレオチドの出現頻度の種特異的特徴が観察されることが見出された。例えば、それ自身のゲノムによってコードされた制限酵素の標的テトラヌクレオチドは、特に過少に出現した。さらに、パリンドロームのオリゴヌクレオチドは、転写調節蛋白質のような様々な蛋白質の標的部位であることが知られている。パリンドロームのオリゴヌクレオチドの出現頻度および分布は、これらのオリゴヌクレオチドがランダムに出現するものとして予測した結果と明白に異なる可能性がある。したがって、パリンドロームのオリゴヌクレオチドに注目することは興味深い。パリンドロームのヘキサヌクレオチドの場合にはオリゴヌクレオチドの種類が６４であり、パリンドロームのオクタヌクレオチドの場合には、オリゴヌクレオチドの種類が２５６であることに注意すべきである。言いかえれば、重要な生物学上の意味を持つ長い配列にさえ注目することができた。 Example 7
It was found that species-specific features of the frequency of occurrence of palindromic tetranucleotides were observed. For example, restriction enzyme target tetranucleotides encoded by their own genomes were particularly under-represented. Furthermore, palindromic oligonucleotides are known to be target sites for various proteins such as transcriptional regulatory proteins. The frequency and distribution of palindromic oligonucleotides may clearly differ from the results predicted for these oligonucleotides to appear randomly. It is therefore interesting to focus on palindromic oligonucleotides. Note that in the case of palindromic hexanucleotides the oligonucleotide type is 64, and in the case of palindromic octanucleotides the oligonucleotide type is 256. In other words, even long sequences with important biological meaning could be noted.

分析の結果、明らかに、種分離が明白になり、パリンドロームのオリゴヌクレオチドの出現頻度が、種特異的な特徴をより明白に表わした。制限酵素の認識配列の場合、およびさらにＤＮＡ結合蛋白質を備えた認識配列の場合、内部の塩基が認識に関係しない例がある。例えば、Ｔｔｈ１１１１の認識配列はＧＡＣＮ３ＧＴＣ（ここで「Ｎ」の位置ではどんな塩基も選ぶことができる）である。これを考慮に入れて、本願発明者等は、次のタイプの部分的にパリンドロームのオリゴヌクレオチドを分析対象に含めた。すなわち、パリンドロームに関係しない内部のｎ個の塩基（ｎは１〜３の整数）を持つパリンドロームのオリゴヌクレオチド、例えば認識に使用されるパリンドロームのオリゴヌクレオチドとしてＧＧＧＮＣＣＣ、ＧＧＧＮＮＣＣＣ、ＧＧＧＮＮＮＣＣＣを分析対象に含めた。このパリンドロームのオリゴヌクレオチドの認識部位において、真のヘキサヌクレオチド（ｎ＝０）を分析対象に加えた場合、２５６（６４×４）個の変数を分析できた。 The results of the analysis clearly revealed species separation and the frequency of palindromic oligonucleotides more clearly represented species-specific characteristics. In the case of a restriction enzyme recognition sequence, and in the case of a recognition sequence further comprising a DNA binding protein, there are examples in which the internal base is not involved in the recognition. For example, the recognition sequence of Tth1111 is GACN3GTC (wherein any base can be selected at the “N” position). Taking this into account, the inventors have included the following types of partially palindromic oligonucleotides in the analysis. That is, a palindromic oligonucleotide having n bases not related to the palindrome (n is an integer of 1 to 3), for example, GGGNNCCC, GGGNNCCC, and GGGNNCNCC are analyzed as palindromic oligonucleotides used for recognition Included. When true hexanucleotide (n = 0) was added to the analysis target at the recognition site of the oligonucleotide of this palindrome, 256 (64 × 4) variables could be analyzed.

この分析により、高い生物学的特異性を持っている可能性のあるオリゴヌクレオチドに注目できる。したがって、ＳＯＭは、同様の認識配列を持つ同じタイプの制限酵素あるいはＤＮＡ結合蛋白質を持つ種を、効率的に特定することが可能である。したがって、ＳＯＭは、系統学的な分類だけではなく近縁の種（γプロテオバクテリアに属する異なる種の細菌）を強力に分離できると考えられる。最初に通常のオリゴヌクレオチドＳＯＭを用いて塩基配列を系統学的なグループに分類し、それらをパリンドロームのオリゴヌクレオチドを使用して細分類すれば、多種多様な塩基配列を細分類することが可能になる。 This analysis allows attention to oligonucleotides that may have a high biological specificity. Thus, SOM can efficiently identify species with the same type of restriction enzyme or DNA binding protein with similar recognition sequences. Therefore, it is considered that SOM can strongly separate related species (bacteria of different species belonging to γ-proteobacteria) as well as phylogenetic classification. It is possible to subdivide a wide variety of base sequences by first classifying the base sequences into phylogenetic groups using ordinary oligonucleotide SOM and subdividing them using palindromic oligonucleotides. become.

本発明の分類システムは、以上のように、塩基配列の生物学的分類への分類、複数種の生物を含む混合サンプルの成分分析、新規で産業上有用な細菌等の探索、ゲノム配列中における水平伝達を通じて他の種から導入されたセグメントの予測等に利用できる。 As described above, the classification system of the present invention classifies the base sequence into the biological classification, analyzes the components of the mixed sample containing a plurality of kinds of organisms, searches for new and industrially useful bacteria, and the like in the genome sequence. It can be used to predict segments introduced from other species through horizontal transmission.

また、本発明の解析システムは、以上のように、生物学的分類を分ける鍵となる重要なオリゴヌクレオチドの探索や、ゲノム配列中におけるシグナル配列を多く含む領域の探索等に利用できる。 In addition, as described above, the analysis system of the present invention can be used for searching for important oligonucleotides that are the key to separating biological classifications, searching for regions containing many signal sequences in genomic sequences, and the like.

本発明の実施の一形態に係るオリゴヌクレオチド出現頻度解析システムの構成を示すブロック図である。It is a block diagram which shows the structure of the oligonucleotide appearance frequency analysis system which concerns on one Embodiment of this invention. 本発明の実施の一例において作成されたＳＯＭの例を示す図であり、（ａ）はトリヌクレオチド出現頻度を用いて作成されたＳＯＭ、（ｂ）はテトラヌクレオチド出現頻度を用いて作成されたＳＯＭ、（ｃ）はペンタヌクレオチド出現頻度を用いて作成されたＳＯＭ、（ｄ）相補的な対をなす２つのペンタヌクレオチドの出現頻度を加算したペンタヌクレオチド出現頻度を用いて作成されたＳＯＭを示す。It is a figure which shows the example of the SOM produced in the example of implementation of this invention, (a) is the SOM created using the trinucleotide appearance frequency, (b) is the SOM created using the tetranucleotide appearance frequency (C) shows the SOM created using the pentanucleotide appearance frequency, and (d) shows the SOM created using the pentanucleotide appearance frequency obtained by adding the appearance frequencies of two pentanucleotides forming a complementary pair. 本発明の実施の一例において作成されたＳＯＭおよび出現頻度マップの例を示す図であり、（ａ）はテトラヌクレオチド出現頻度を用いて作成されたＳＯＭ、（ｂ）はＣＡＧＴの出現頻度マップ、（ｃ）はＡＡＴＴの出現頻度マップを示す。It is a figure which shows the example of the SOM created in the example of implementation of this invention, and the appearance frequency map, (a) is SOM created using the tetranucleotide appearance frequency, (b) is the appearance frequency map of CAGT, ( c) shows an AATT appearance frequency map. 本発明の実施の一例において作成されたＳＯＭおよび出現頻度マップの例を示す図であり、（ａ）は相補的な対をなす２つのペンタヌクレオチドの出現頻度を加算したペンタヌクレオチド出現頻度を用いて作成されたＳＯＭ、（ｂ）はＡＣＡＧＧおよびＣＣＴＧＴの合計の出現頻度マップ、（ｃ）はＣＧＡＣＧおよびＣＧＴＣＧの合計の出現頻度マップ、（ｄ）はＣＧＡＡＡおよびＴＴＴＣＧの合計の出現頻度マップを示す。It is a figure which shows the example of the SOM created in the example of implementation of this invention, and an appearance frequency map, (a) is using the pentanucleotide appearance frequency which added the appearance frequency of two pentanucleotides which make a complementary pair. The created SOM, (b) shows the total frequency map of ACAGG and CCTGT, (c) shows the total frequency map of CGACG and CGTCG, and (d) shows the total frequency map of CGAAA and TTTCG. 染色体２１ｑ上におけるいくつかのテトラヌクレオチドの正規化された出現頻度の分布を示すテトラヌクレオチド出現頻度分布図である。It is a tetranucleotide appearance frequency distribution map which shows distribution of the normalized appearance frequency of some tetranucleotide on the chromosome 21q.

Explanation of symbols

１相補データ加算部（加算部）
２オリゴヌクレオチド出現頻度データ格納部
３期待値演算部
４モノヌクレオチド組成データ格納部
５正規化部
６出現頻度マップ作成部
７出現頻度分布図作成部
１０ＳＯＭ作成部 1 Complementary data adder (adder)
2 Oligonucleotide appearance frequency data storage unit 3 Expected value calculation unit 4 Mononucleotide composition data storage unit 5 Normalization unit 6 Appearance frequency map creation unit 7 Appearance frequency distribution diagram creation unit 10 SOM creation unit

Claims

The frequency of appearance of each of multiple types of oligonucleotides in a base sequence is arranged as an input vector group on a multidimensional space, and these input vector groups are non-linearly mapped onto a map on which a plurality of grid points are arranged. A base sequence classification system that creates a self-organizing map by self-organizing a base sequence into each lattice point,
An oligonucleotide frequency data storage unit that stores data on the frequency of appearance of each of a plurality of types of oligonucleotides in a plurality of base sequences;
By extracting the data on the appearance frequency of each oligonucleotide stored in the oligonucleotide appearance frequency data storage unit and adding the appearance frequencies of the complementary oligonucleotides, An adder that calculates the frequency of appearance and outputs data of the frequency of occurrence of the calculated oligonucleotide for each pair ;
-Out based on the data of the frequency of occurrence of oligonucleotides for each pair of output from the adding section, the self-organizing by performing the self-organizing the frequency of occurrence of oligonucleotides for each pair as the input vector group A base sequence classification system comprising a self-organizing map creating unit that creates a map and outputs data of the created self-organizing map.

An oligonucleotide frequency data storage unit that stores data on the frequency of appearance of each of a plurality of types of oligonucleotides in a plurality of base sequences;
By extracting the data on the appearance frequency of each oligonucleotide stored in the oligonucleotide appearance frequency data storage unit and adding the appearance frequencies of the complementary oligonucleotides, An adder that calculates the frequency of appearance and outputs data of the frequency of occurrence of the calculated oligonucleotide for each pair;
Based on the appearance frequency data of each pair of oligonucleotides output from the adder, the appearance frequency of each pair of oligonucleotides is arranged on a multidimensional space as an input vector group, and these input vector groups are A self-organizing map that non-linearly maps onto the map where the grid points are placed and classifies the above base sequence into each grid point , creates a self-organizing map, and outputs the data of the created self-organizing map An organizational map creation department;
Based on the appearance frequency data of each pair of oligonucleotides output from the above adder, an appearance frequency map is created for each oligonucleotide that shows information about the appearance frequency of each pair of oligonucleotides for each lattice point. And an appearance frequency map creating unit that outputs data of the created appearance frequency map, and an oligonucleotide appearance frequency analysis system.

A mononucleotide composition data storage section storing mononucleotide composition data in each base sequence to be analyzed;
Retrieve data mononucleotide composition in each base sequence of the analyte that is stored in the mononucleotide composition data storage unit, based on the mononucleotide composition in nucleotide sequences classified into each grid point extracted, each grid point An expected value calculation unit that calculates the expected value of the appearance frequency of the oligonucleotide in the base sequence classified into, and outputs the calculated expected value;
Receiving the frequency data of each pair of oligonucleotides output from the adder and the expected value calculated by the expected value calculator, the frequency of occurrence of oligonucleotides in the base sequence classified into each lattice point Is further divided by the expected value, and a normalization unit that outputs normalized appearance frequency data ,
3. The oligonucleotide according to claim 2, wherein the appearance frequency map creation unit creates an appearance frequency map based on the appearance frequency data of the oligonucleotide normalized by the normalization unit. Appearance frequency analysis system.

An oligonucleotide appearance frequency data storage unit storing data of appearance frequencies at which multiple types of oligonucleotides each appear in a plurality of fragment base sequences extracted from the same DNA sequence;
Take out the data of the appearance frequency of the oligonucleotide stored in the above-mentioned oligonucleotide appearance frequency data storage unit, and calculate the appearance frequency of the oligonucleotide for each pair by adding the appearance frequency of the complementary pair of oligonucleotides And an adding unit that outputs data of the appearance frequency of the oligonucleotide for each pair calculated,
Based on the data on the frequency of appearance of each pair of oligonucleotides output from the adder, the frequency of appearance of each pair of oligonucleotides is placed in a multidimensional space as an input vector group, and these input vector groups are multidimensional. By creating a self-organizing map in which the above-mentioned fragment base sequence is classified into each lattice point by creating a non-linear mapping by self-organization onto a map in which a plurality of lattice points are arranged from space, the self-organization created A self-organizing map creation unit that outputs map data;
Based on the appearance frequency of each oligonucleotide in the fragment base sequence classified into each lattice point, an appearance frequency distribution diagram showing the distribution of the appearance frequency of each oligonucleotide on the DNA sequence is created, and the created appearance frequency An appearance frequency distribution map creation unit for outputting distribution map data,
The appearance frequency distribution map creation unit creates an appearance frequency distribution map based on the appearance frequency data of each pair of oligonucleotides output from the addition unit. Frequency analysis system.