Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Chau-Wen Tseng's Bioinformatic Research
[go: Go Back, main page]

High-Performance Computing for Bioinformatic Applications

Focus of Research

Recent advances in molecular biology techniques such as automated DNA sequencing, DNA microarrays, and mass spectrometers allow scientists to quickly gather huge amounts of gene sequence, gene expression, and protein sequence data. Popular algorithms for analyzing bioinformatic data may be oversimplified due to concerns about limited processing power. My research investigates methods to exploit the rapidly increasing power of high performance computing architectures to improve the speed and quality of bioinformatic algorithms.

Research Projects

  • Protein Identification via Tandem Mass Spectrometry

    Mass spectrometry (MS) is a technique used by scientists to identify molecules based on their mass/charge ratios. Biologists use mass spectrometers as a high-throughput method of analyzing protein fragments (peptides). Many techniques and software tools exist for identifying the correct peptide associated with each mass spectra. Unfortunately current tools are computationally expensive and have high error rates.

    We are investigating different approaches to improving the accuracy of protein identification algorithms, ranging from constructing more comprehensive peptide databases to modifying and/or combining peptide matching functions from multiple MS search engines.

    Some Related Papers

  • Genome Mapping & EST Clustering

    High-throughput methods can quickly collect and sequence large quantities of short strands of mRNA/cDNA known as Expressed Sequence Tags (ESTs), to the extent that EST sequences make up over 60% of the sequences in the NCBI GenBank sequence database. Since ESTs are short single-read sequences with high error rates, they usually need to be collected into clusters (e.g., NCBI UniGene) to provide useful biological information. A number of EST clustering tools (NCBI megaBlast, TIGR TGICL, CAP, d2_cluster, GeneNest, PaCE, etc.) exist, but have varying degrees of precision, scalability, and efficiency. Metrics for qualitative and quantative comparisons between EST clustering algorithms are still under development.

    Our goal is to compare existing EST clustering algorithms and discover enhancements to precision and scalability where possible. We also plan to incorporate algorithms for processing EST clusters to discover alternative splicing models and polymorphisms as a more sophisticated metric for comparing the precision and utility of EST clustering algorithms.

    One of our first results produced ESTmapper, a new tool for clustering EST sequences based on efficiently mapping ESTs to the genome. Our mapping algorithm is based on first building a suffix tree for the genome, then searching for long common substrings between each EST and the genome. We use these common substrings to build "matching regions", gapped local alignments between the EST and genome to account for sequencing errors and splicing. ESTs are then mapped to the genome by their longest matching regions and placed into clusters based on their locations in the genome. Preliminary experiments show that ESTmapper is precise and very efficient for large numbers of EST sequences, though more work is needed to enable ESTmapper to effectively handle large genomes.

    Some Related Papers

  • Sequence Alignment

    Pairwise comparison of DNA and protein sequences is one of the fundamental tools in bioinformatic research. Many sequence comparison algorithms have been developed, but most researchers use some form of BLAST. We plan on exploring modifications to NCBI BLAST software, so alternative sequence comparison algorithms may be used with the same BLAST interface. Algorithms choices will be made based on desired precision and efficiency.

    Previously we worked on UM-BLAST, a wrapper for high-performance BLASTs. Our experiments show that no single version of BLAST is able to achieve the best performance given variations in sequence database size, query batch size, and query sequence length. We find mpiBLAST is best at exploiting database partitioning over multiple nodes to keep large databases in memory. BLAST++ is best at amortizing search costs for multiple batched queries. Threaded BLAST is best at reducing search costs for very long single queries. Based on our evaluation, we design UM-BLAST, a wrapper capable of selecting the proper combination of threaded BLAST, BLAST++, and mpiBLAST to achieve good performance over a range of search parameters.

    Some Related Papers

  • Faculty
    Students
    Affiliated Research Groups
  • Center for Bioinformatics and Computational Biology
  • Some Bioinformatics Researchers at University of Maryland

  • Mike Cummings
  • Nathan Edwards
  • Chuck Delwiche
  • Steven Mount
  • Steven Salzberg