DNA sequencing technology is developing very rapidly. Next Generation Sequencing (NGS), also known as High-Throughput Sequencing (HTS) or Deep sequencing, has revolutionized the speed and cost of DNA sequencing. With the latest machines, one can determine the sequence of nucleotides, the building blocks of DNA, with an extreme speed relative to what could be done only 6 years ago. In the course of two weeks one machine can sequence up to 600 billion base pairs, divided into 6 billion short sequences of 100 base pairs each. For comparison, the entire human genome consists of about 3 billion base pairs. The cost of sequencing has also been reduced dramatically. However, in order to be more widely used clinically, even higher cost-effective solutions are required.
Sequencing may be performed on a DNA sample from a human individual in order to identify the variants present in the individual’s genetic profile as compared to a human reference sequence. When such a sample has been sequenced, all the short sequences have to be mapped back to the correct location on the reference genome. Due to sequencing errors and variation in the genome between human individuals, the short sequences may not match perfectly, making the task of finding the correct location difficult. A large number of programs to map such short sequences against a reference genome have been developed. Examples of such applications are Bowtie, BWA, Novoalign, and SOAP. The quality of the results and speed of such programs varies, and they also have many parameters that can be adjusted.
In order to gain higher performance on current hardware architectures, some form of parallelism must be utilized. Graphical Processing Units (GPUs) have emerged as the most cost effective solution, with regards to both performance/cost and performance/watt, compared to other parallelization strategies. Furthermore, programming models such as OpenCL, CUDA and C++ AMP, make it possible to harness the power of GPUs outside of computer graphics. Yet, to fully utilize the hardware algorithms must take into account communication and synchronization policies, and the size and speed of the memory and cache system.
Recently a few programs for performing short read mapping that utilize GPUs using different strategies have been published (e.g. SARUMAN, GPU-RMAP, CUSHAW, BarraCUDA, SOAP3, MAROSE). The performance of these tools vary, both in terms of accuracy of mapping and in the amount of time and memory required, but some of them show promising results at least under certain circumstances.
In this project, the task will be to study the existing methods and tools, and then implement some form of short read mapping method using a GPU. The performance relative to existing methods will be evaluated.
The project is suitable for anyone interested in bioinformatics and has programming experience, preferably with parallelization in one form or another.
Supervisors: Torbjørn Rognes (BMI/IFI), Leonardo Lamorgese and Johan Simon Seland (SINTEF Applied Mathematics)