De Novo Genome Assembly with PacBio Reads

CEES Extra seminar by Jason Miller from J. Craig Venter Institute

Abstract

Second-generation DNA sequencing technology has enabled rapid reconstruction of genomes large and small. Compared to first-generation Sanger technology, machines such as the Illumina HiSeq 2000 generate higher coverage at lower cost. Regrettably, their shorter read length reduces the quality and extent of reconstruction of repetitive genomes. Third-generation platforms offer read lengths that surpass Sanger but fall short of second-generation platforms in throughput and quality. Third generation reads have been used to improve and validate existing assemblies. They have been used at very high coverage to generate high quality assemblies of moderate size genomes using pre-assembly correction methods. Working with the Celera Assembler, we are pushing Sanger-era assembly software to exploit intermediate levels of third-generation sequence data with and without a correction step. We applied the third-generation PacBio RS II sequencing technology to challenging genomes of an algae (Pelagomonas), a plant (Medicago truncatula), and a fish (Salmo salar). We compared state-of-the-art Illumina assemblies to assemblies that incorporated 20x PacBio data with and without correction. We demonstrate contig size gains and suggest combinations of other data types that enlarge scaffolds. These assembly methods should enable more large-genome projects to exploit third generation sequencing.

Jason Miller
J. Craig Venter Institute, Rockville MD USA

Published June 13, 2014 1:06 PM - Last modified June 20, 2014 10:39 AM