Compressing short reads from high-throughput DNA sequencing
The goal of this task is to explore whether known properties of short read data sets can be used to develop tailored algorithms that allows compression beyond what is achieved by general data compression methods.
Newer technology in DNA sequencing parallelize the sequencing process,
generating enormous amounts of data, up to several terabytes per sequencing
run. These short reads are then run through alignment software, which
maps the reads to a reference genome. It is often interesting to save this data at several levels - as full raw data, as unaligned reads, and as reads aligned against reference genome.
The task is to explore different approaches tailored to compressing
these kinds of DNA data sets. By the nature of how the short reads are
sequenced they contain huge amounts of overlapping data. The first
aspect of the task would be to use this overlap and similarity in the
reads to preprocess them for compression.
A second aspect would be to compress already aligned reads, using the
reference genome that it has been mapped against to compress it.
A third aspect is to look at the quality score part of short reads,
and explore possible measures to trim the data.
Students should be skilled in programming. An interest for mathematics may be an advantage. No prior knowledge of biology is needed.