Chau-Wen Tseng's Bioinformatic Research

Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Chau-Wen Tseng's Bioinformatic Research

High-Performance Computing for Bioinformatic Applications

Focus of Research

Recent advances in molecular biology techniques such as automated DNA sequencing, DNA microarrays, and mass spectrometers allow scientists to quickly gather huge amounts of gene sequence, gene expression, and protein sequence data. Popular algorithms for analyzing bioinformatic data may be oversimplified due to concerns about limited processing power. My research investigates methods to exploit the rapidly increasing power of high performance computing architectures to improve the speed and quality of bioinformatic algorithms.

Research Projects

Protein Identification via Tandem Mass Spectrometry

Mass spectrometry (MS) is a technique used by scientists to identify molecules based on their mass/charge ratios. Biologists use mass spectrometers as a high-throughput method of analyzing protein fragments (peptides). Many techniques and software tools exist for identifying the correct peptide associated with each mass spectra. Unfortunately current tools are computationally expensive and have high error rates.

We are investigating different approaches to improving the accuracy of protein identification algorithms, ranging from constructing more comprehensive peptide databases to modifying and/or combining peptide matching functions from multiple MS search engines.

Some Related Papers

N. Edwards, X. Wu and C.-W. Tseng, Novel Peptide Identification Using ESTs and Genomic Sequence, Poster at 2nd Annual Conference of the US Human Proteome Organisation (US HUPO 2006), Boston, MA, March 2006.
S. Tanner et al., InsPecT: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra, Analytical Chemistry, 2005, 77:4626-4639.
R. Craig and R. Beavis, A Method for Reducing the Time Required to Match Protein Sequences with Tandem Mass Spectra, Rapid Commun. Mass Spectrom., 2003, 17:2310-2316.

Genome Mapping & EST Clustering

High-throughput methods can quickly collect and sequence large quantities of short strands of mRNA/cDNA known as Expressed Sequence Tags (ESTs), to the extent that EST sequences make up over 60% of the sequences in the NCBI GenBank sequence database. Since ESTs are short single-read sequences with high error rates, they usually need to be collected into clusters (e.g., NCBI UniGene) to provide useful biological information. A number of EST clustering tools (NCBI megaBlast, TIGR TGICL, CAP, d2_cluster, GeneNest, PaCE, etc.) exist, but have varying degrees of precision, scalability, and efficiency. Metrics for qualitative and quantative comparisons between EST clustering algorithms are still under development.

Our goal is to compare existing EST clustering algorithms and discover enhancements to precision and scalability where possible. We also plan to incorporate algorithms for processing EST clusters to discover alternative splicing models and polymorphisms as a more sophisticated metric for comparing the precision and utility of EST clustering algorithms.

One of our first results produced ESTmapper, a new tool for clustering EST sequences based on efficiently mapping ESTs to the genome. Our mapping algorithm is based on first building a suffix tree for the genome, then searching for long common substrings between each EST and the genome. We use these common substrings to build "matching regions", gapped local alignments between the EST and genome to account for sequencing errors and splicing. ESTs are then mapped to the genome by their longest matching regions and placed into clusters based on their locations in the genome. Preliminary experiments show that ESTmapper is precise and very efficient for large numbers of EST sequences, though more work is needed to enable ESTmapper to effectively handle large genomes.

Some Related Papers

X. Wu, W.-J. Lee, and C.-W. Tseng, ESTmapper: Efficiently Aligning DNA Sequences to Genomes, 4th IEEE International Workshop on High Performance Computational Biology (HiCOMB'05), Denver, CO, April 2005.
G. Pertea et al., TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets, Bioinformatics 19(5):651-652, 2003.
A. Kalyanaraman, S. Aluru, S. Kthari, and V. Brendel, Efficient clustering of large EST data sets on parallel computers, Nucleic Acids Research, 31(11):2963-2974, 2003.

Sequence Alignment

Pairwise comparison of DNA and protein sequences is one of the fundamental tools in bioinformatic research. Many sequence comparison algorithms have been developed, but most researchers use some form of BLAST. We plan on exploring modifications to NCBI BLAST software, so alternative sequence comparison algorithms may be used with the same BLAST interface. Algorithms choices will be made based on desired precision and efficiency.

Previously we worked on UM-BLAST, a wrapper for high-performance BLASTs. Our experiments show that no single version of BLAST is able to achieve the best performance given variations in sequence database size, query batch size, and query sequence length. We find mpiBLAST is best at exploiting database partitioning over multiple nodes to keep large databases in memory. BLAST++ is best at amortizing search costs for multiple batched queries. Threaded BLAST is best at reducing search costs for very long single queries. Based on our evaluation, we design UM-BLAST, a wrapper capable of selecting the proper combination of threaded BLAST, BLAST++, and mpiBLAST to achieve good performance over a range of search parameters.

Some Related Papers

X. Wu and C.-W. Tseng, Searching Sequence Databases Using High Performance BLASTs Parallel Computing for Bioinformatics and Computational Biology (A. Zomaya, ed.), John Wiley & Sons, 2005.
Darling, Carey, and Feng, The Design, Implementation, and Evaluation of mpiBLAST, ClusterWorld 2003 conference
Wang, Ong, Ooi, and Tan, BLAST++ : A Tool for BLASTing Queries in Batches, APBC2003: 71-79
Altschul, Gish, Miller, Myers, and Lipman, A basic local alignment search tool, Journal of Molecular Biology (1990) 215:403-410

Faculty

Chau-Wen Tseng

Students

Xue Wu

Affiliated Research Groups

Center for Bioinformatics and Computational Biology

Some Bioinformatics Researchers at University of Maryland