Home > contrail-bio

contrail-bio

Contrail-bio is a project mainly written in JAVA and PERL, it's free.

Mirror of: A Hadoop based genome assembler for assembling large genomes in the clouds

Contrail http://contrail-bio.sf.net

The first step towards analyzing a previously unsequenced organism is to assemble the reads by merging similar reads into progressively longer sequences. New assemblers such as Velvet and Euler attempt to solve the assembly problem by constructing, simplifying, and traversing the de Bruijn graph of the read sequences. Nodes in the graph represent substrings of the reads, and directed edges connect consecutive substrings. Genome assembly is then modeled as finding an Eulerian tour through the graph, although repeats may lead to multiple possible tours. As such, assemblers primarily focus on correcting errors, reconstructing unambiguous regions, and resolving short repeats. These assemblers have successfully assembled small genomes from short reads, but have had limited success scaling to larger mammalian-sized genomes, in part, because they require constructing and manipulating graphs far larger than can fit into memory.

Addressing this limitation, we have developed a new assembly program Contrail, that uses Hadoop for de novo assembly of large genomes from short sequencing reads. Similar to other leading short read assembler, Contrail relies on the graph-theoretic framework of de Bruijn graphs. However, unlike these programs, which require large RAM resources, Contrail relies on Hadoop to iteratively transform an on-disk representation of the assembly graph, allowing an in depth analysis even for large genomes. Preliminary results show Contrail’s contigs are of similar size and quality to those generated by Velvet when applied to small (bacterial) genomes, but provides vastly superior scaling capabilities when applied to large genomes. We are also developing extensions to Contrail to efficiently compute a traditional overlap-graph based assembly of large genomes within Hadoop, strategy that will be especially valuable as read lengths increase beyond 100bp.

Contrail enables de novo assembly of large genomes from short reads by bridging research in computation biology with research in high performance computation. This combination is essential in light of the large data sets involved, and has the potential to unlock discoveries of critical magnitude. Whereas the published analysis of the African and Asian human individuals used read mapping to discover conserved regions and regions with small polymorphisms, de novo assembly has the unique potential to also discover large scale polymorphisms between these individuals and the reference human genome. Mapping the large-scale differences is an important step towards better understanding of our own biology, and may reveal previously unknown characteristics of the human genome related to health or disease. Furthermore, a short read assembler for large genomes is also essential for sequencing the vast numbers of complex organisms that have never been sequenced before, and will directly contribute to new biological knowledge.

Release History

Version 0.8.2 Oct 13, 2010

Initial public release