（マイクロアレイ）データ解析Tips

Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
（マイクロアレイ）データ解析Tips

（マイクロアレイ）データ解析Tips

cosinor法のかけ方 Gene Expression Omnibus (GEO)から得られる*.soft.txtファイル整形のためのPerlプログラム GEOから得られる「GPL*.txt」というアノテーションファイルから、特定の遺伝子IDに相当する行のみ抽出するPerlプログラム MAD（Median Absolute Deviation）の求め方 CygwinでRを使いたい場合（2006/2/28追加） Links（データベース） Links（解析ツールなど） Links（CygwinやPerl, gnuplot）
Gene Expression Omnibus (GEO)から得られる*.soft.txtファイル整形のためのPerlプログラム GEOから得られる「GPL*.txt」というアノテーションファイルから、特定の遺伝子IDに相当する行のみ抽出するPerlプログラム MAD（Median Absolute Deviation）の求め方 CygwinでRを使いたい場合（2006/2/28追加） Links（データベース） Links（解析ツールなど） Links（CygwinやPerl, gnuplot）
GEOから得られる「GPL*.txt」というアノテーションファイルから、特定の遺伝子IDに相当する行のみ抽出するPerlプログラム MAD（Median Absolute Deviation）の求め方 CygwinでRを使いたい場合（2006/2/28追加） Links（データベース） Links（解析ツールなど） Links（CygwinやPerl, gnuplot）
MAD（Median Absolute Deviation）の求め方 CygwinでRを使いたい場合（2006/2/28追加） Links（データベース） Links（解析ツールなど） Links（CygwinやPerl, gnuplot）
CygwinでRを使いたい場合（2006/2/28追加） Links（データベース） Links（解析ツールなど） Links（CygwinやPerl, gnuplot）
Links（データベース） Links（解析ツールなど） Links（CygwinやPerl, gnuplot）
Links（解析ツールなど） Links（CygwinやPerl, gnuplot）
Links（CygwinやPerl, gnuplot）

cosinor法のかけ方

例(data.txt):
--- ここから ---
#Time data
0 84
4 102
8 91
12 73
16 61
20 72
--- ここまで ---

gnuplot
f(x)=M+A*cos(2*pi/P*(x-T))
M=80 #水準MESOR (max+min)/2
A=20 #振幅Amplitude (max-min)/2
T=4 #頂点位相Acrophase
P=24 #周期Period (P)
fit f(x) "data.txt" via M,A,T,P
q

GEOから得られる*.soft.txtファイル整形のためのPerlプログラム

Gene Expression Omnibus (GEO)からマイクロアレイデータセット(GDS***.soft.txt)をダウンロードして解析するとき、「GSM***」と「サンプル名」を対応づけて最終的には「サンプル名」で解析したい。このような目的のために使う。
目的：GDS1096.soft.txtのファイルをGDS1096.txtのようにしたい。

実行例：perl geo.pl GDS1096.soft.txt > GDS1096.txt

解説： このPerlプログラムは、講義科目「バイオインフォマティクスリテラシーII」で習ったハッシュを使っています。大まかには、「GSM**」と「サンプル名」の対応を保持しておき、240行の「GSM**」一つ一つについてサンプル名に置換し、その行以降を出力するPerlプログラムです。具体的には以下のとおりです：

GDS1096.soft.txtファイルを一行ずつ読み込んで、「#GSM」で始まる行に対して、"#"を削除し、": "でsplit
1. $_[0]をさらに" "でsplitし、@tmpに格納。$tmp[0]には「GSM44702」などのサンプルIDのみの情報が収められており、それを$accessionに代入
2. $_[2]には「Normal Liver , Commercial mRNA for normal human tissue」などの情報が収められているので、その中に" , "がある場合には" , "でsplit、", "がある場合には", "でsplitし、@tmp2に格納。それ以外の場合には$_[2]を$tmp2[0]に代入
3. $tmp2[0]には「Normal Liver」などの情報が収められている。$_tmp2[0]中の" "を"_"に置換し、「Normal_Liver」のようにする
4. （「GSM44702」などのサンプルID情報が収められている）$accession --> （「Normal_Liver」などの情報が収められている）$tmp2[0]への対応づけを行い、$tissue_name{$accession}で指定した$accessionに対応するデータの参照を行う
「ID_REF」で始まる行があれば、その行に対して3カラム目以降についてその要素に対応するデータの参照を行い、出力
(「ID_REF」の行番号+1)行目以降を出力。この際、"null"という文字が要素中にあればそれを0で置換する

GEOから得られる「GPL*.txt」というアノテーションファイルから、特定の遺伝子IDに相当する行のみ抽出するPerlプログラム

目的：ある手段によって得られた遺伝子IDのみからなるリストファイル（例：GDS1096_best10_heart_ID.txt）をもとに、アノテーション情報を付加した別のファイルを作りたい。

前提1：公共遺伝子発現データベース（GEO）からダウンロードして得られたアノテーションファイルを使う。
前提2：アノテーション情報を得たい遺伝子（クローン）IDリストファイルが手元にある（例：「解析 --- 似た発現パターンを持つ遺伝子の同定」で得られたGDS1096_best10_heart_ID.txt）。
前提3：解析したマイクロアレイデータのPlatform（例えば、「Affymetrix GeneChip Human Genome U133 Array Set HG-U133A」チップというPlatformを使ったなら、そのPlatform IDはGPL96）に対応する最新のアノテーションファイルが手元にある。（例：ここから得たGPL96.annot.txt）

実行例：perl geo_gpl_annot.pl GPL96.annot.txt GDS1096_best10_heart_ID.txt > GPL96_best_10_heart_annot.txt

解説： このPerlプログラムも、講義科目「バイオインフォマティクスリテラシーII」で習ったハッシュを使っています。ここでは、GPL96.annot.txtのアノテーションファイルをタブで区切った「最初の要素（遺伝子IDが含まれる列）」を添字として「その行まるごと」に対応させ、次にGDS1096_best10_heart_ID.txt中の遺伝子IDを一行一行読み込みながら遺伝子IDに対応するアノテーションファイル中の行をまるごと出力させています。 10個程度であればエクセルでやればいいと思いますが、100個や1000個などになったときにありがたみが分かることと思います。

MAD（Median Absolute Deviation）の求め方

1, 3, 7, 9, 12, 30のMADを求めたい：
MADとは、the median of the absolute deviations from the medianのことです。
--- ここから ---
x <- c(1, 3, 7, 9, 12, 30)
median(abs(x - median(x))) #4.5になります。
--- ここまで ---
標準正規分布のMADを推定したい：
--- ここから ---
x <- rnorm(5000000, mean=0, sd=1)
median(abs(x - median(x))) #1/1.4826（0.6745）付近の値になります。
--- ここまで ---
パッケージstats中のmad関数を使う：
--- ここから ---
mad(rnorm(5000000, mean=0, sd=1))/1.4826
--- ここまで ---
または、
--- ここから ---
mad(rnorm(5000000, mean=0, sd=1), constant=1)
--- ここまで ---

つまり、(1 standard deviation) = 1.4826 * (1 MAD)です。

CygwinでRを使いたい場合

前提：Cygwin（C:\cygwin）とR（C:\Program Files\R\R-2.2.0）が既にインストールされているとします。

「C:\Program Files\R」上で、「R-2.2.0」のディレクトリを「C:\cygwin」上に移動（またはコピー）
Cygwinを起動（Cygwinのプロンプトは$）
環境設定ファイル（.bashrc）にRのパスを指定します。もともと「PATH=/usr/local/bin:/usr/X11R6/bin:/usr/bin:/bin:$PATH」
だったのを
「PATH=/usr/local/bin:/usr/X11R6/bin:/usr/bin:/bin:$PATH:/R-2.2.0/bin」
などとしてバイナリの場所を指定します。
環境設定ファイルを読み込みます
$ source .bashrc

ここまでが最初の一回でやるべきことで、二回目以降は以下からはじめます
$ R
Rが起動します。
>
Rのプロンプト（>）が出るので、あとはRのコマンドを打ち込みます。例えば、
> getwd() #カレントディレクトリを表示
尚、終了させたいときにはRのコマンドライン上で終了させるのと同じく、
> q()
とやります。

Links（データベース）

統合版

ヒト

マウス

Links（解析ツールなど）

Argus
Comander et al., Genome Res., 2001
A New Database System for Web-Based Analysis of Multiple Microarray Data Sets
GoMiner
Zeeberg et al., Genome Biol., 2003
a resource for biological interpretation of genomic and proteomic data
High-Throughput GoMiner
Zeeberg et al., BMC Bioinformatics, 2005
an 'industrial-strength' integrative Gene Ontology tool for interpretation of multiple-microarray experiments, with application to studies of Common Variable Immune Deficiency (CVID)
KnowledgeEditor
Toyoda and Konagaya, Bioinformatics, 2003
a graphical workbench for biological experts to model biomolecular network graphs
MatchMiner
Bussey et al., Genome Biol., 2003
a tool for batch navigation among gene and gene product identifiers
MedMiner
Tanabe et al., Biotechniques, 1999
an Internet Text-Mining Tool for Biomedical Information, with Application to Gene Expression Profiling
RARGE
Sakurai et al., NAR, 2005
The RIKEN Arabidopsis Genome Encyclopedia
SNOMAD(Standardization and NOrmalization of MicroArray Data)
Colantuoni et al., Bioinformatics, 2002
a collection of algorithms for the normalization and standardization of gene expression datasets
MatArray(a Matlab toolbox for microarray data)
Venet D., Bioinformatics, 2003
normalization, hierarchical clustering, K-means clustering, etc.
Gene Set Enrichment Analysis (GSEA)
Mootha et al., Nat Genet., 2003
a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states.
a step-down quadratic regression method
Liu et al., BMC Bioinformatics, 2005
時系列データ解析 -- for non-cyclic genes

Links（CygwinやPerl, gnuplot）

Links（Perl）

Last updated on 2008.7.28