Benchmarking system for genome assembly and/or variant calling software
The goal of this project is to develop a system that could be used for the assessment of how well various tools for genome assembly, short sequence mapping or variant calling perform.
Sequencing reads from next generation sequencing data mapped to a reference genome. In this case, the reads come in pairs, so-called paired end sequencing.
DNA sequencing technology is developing very rapidly. Next Generation Sequencing (NGS), also known as High-Throughput Sequencing (HTS) or Deep sequencing, has revolutionized the speed and cost of DNA sequencing. With the latest machines, one can determine the sequence of nucleotides, the building blocks of DNA, with an extreme speed relative to what could be done only a few years ago. Typically, in the course of two weeks one machine can sequence up to 600 billion base pairs, divided into 6 billion short sequences of 100 base pairs each. For comparison, the entire human genome consists of about 3 billion base pairs.
When sequencing an entirely new genome, e.g. of a bacteria or a fish, it is essential to puzzle together these small pieces of sequences into as large parts as possible. This is the task performed by a genome assembler, like the CELERA assembler, Velvet, Newbler, and numerous other software tools. Because of errors in the data and problematic areas of the genome, it is not an easy task. The quality of the results and speed of these programs varies, and they also have many parameters that can be adjusted.
In other cases the data come from a relatively well-known genome, e.g. the human genome. Sequencing is then usually performed in order to identify the variants present in one individual’s genetic profile, or the variants present in particular cells (e.g. in a cancer cell vs a normal cell), as compared to a reference sequence. In these cases, it is not necessary to assemble the genome. Instead, the goal is to map all the short sequences back to the reference genome, if possible, and to determine any variants from the reference. This is called mapping and variant calling. Because of sequencing errors and variations in the genome between individuals, large or small deviations are to be expected. A number of programs to map such short sequences against a reference genome have been developed. Examples of such applications are Bowtie, BWA, Novoalign, and SOAP. There is also specialized software to discover variants.
All of these tasks are challenging. Numerous methods have been developed, and new approaches are constantly being tried out. However, the assessment of how well such algorithms perform lags behind the development of methods in itself. This means that although a plethora of methods are available, it is not easy to know which method will perform best under certain conditions.
The goal of this project is to develop a system that could be used for the assessment of how well various tools for genome assembly, short sequence mapping or variant calling perform. Data sets and evaluation procedures should be established that can be automatically applied to different methods.
The task is suitable for anyone with an interest in bioinformatics, who has some programming experience, and who possess basic knowledge in statistics.
The project will be carried out in collaboration with the Centre for Ecological and Evolutionary Synthesis (CEES) at the Department of Biology, UiO.
Forskningsgruppe: BMI (biomedical informatics)