Benchmarking of variant calling and mutation detection tools
DNA sequencing technology is developing very rapidly. Next Generation Sequencing (NGS), also known as High-Throughput Sequencing (HTS) or Deep sequencing, has revolutionized the speed and cost of DNA sequencing. With the latest machines, one can determine the sequence of nucleotides, the building blocks of DNA, with an extreme speed relative to what could be done only 6 years ago. In the course of two weeks one machine can sequence up to 600 billion base pairs, divided into 6 billion short sequences of 100 base pairs each. For comparison, the entire human genome consists of about 3 billion base pairs. The cost of sequencing has also been reduced dramatically. However, in order to be more widely used clinically, even higher cost-effective solutions are required.
Sequencing may be performed on a DNA sample from a human individual in order to identify the variants present in one individual’s genetic profile, or the mutations that have occurred in particular cells (e.g. in a cancer cell vs a normal cell), as compared to a reference sequence. When such a sample has been sequenced, all the short sequences first have to be mapped back to the correct location on the reference genome. Due to sequencing errors and variation in the genome between human individuals, the short sequences may not match perfectly, making the task of finding the correct location difficult. A large number of programs to map such short sequences against a reference genome have been developed. The quality of the results and speed of such programs varies, and they also have many parameters that can be adjusted. Subsequently any variants from the reference need to be determined. This is called variant calling. The variants may be relatively simple, like single nucleotide variants (SNVs) or polymorphism (SNPs) and small deletions or insertions, or they may be complex (large insertions and deletions, inversions, repetitions, translocations etc). Several tools for discovering variants and detecting mutations are available.
Both variant calling and mutation detection is challenging. Numerous methods have been developed, and new approaches are constantly being tried out. However, the assessment of how well such algorithms perform is not trivial. This means that although a plethora of methods are available, it is not easy to know which method will perform best under certain conditions.
The goal of this project is to develop a system that could be used for the assessment of how well various tools for variant calling and/or mutation detection perform. Data sets and evaluation procedures should be established that could be automatically applied to different methods.
The project is suitable for anyone with an interest in bioinformatics, who has some programming experience, and who possess basic knowledge in statistics.