Computational detection of chimeric sequences
In metagenomics, the genomic DNA sequences from a mixture of different species are sequenced. By analysing the resulting sequences one can identify the organisms present and the amount of biological diversity in the sample. There are certain genes that are present in all living organism, and these are often used as a “barcode” to identify the different species found in the sample. Samples may be extracted from as diverse sites as the human skin or gut, or from deep oceans, lakes, soil or other interesting places. Recently, such studies of the human gut microbiome has provided novel insights into how important the bacteria in our gut are for our metabolism and for our immune system.
In the processing of the DNA before sequencing, a method known as PCR (polymerase chain reaction) is used to amplify the amount of DNA. It works as a kind of copying machine, ideally duplicating the amount of DNA in each cycle. During this process, various errors may occur. One problem is the generation of chimeric sequences, chimeras. Chimeras are sequences that consist of a concatenation of two or more unrelated and different sequences, instead of a single sequence. Undetected chimeras may confound metagenomic studies by giving the impression that the biological diversity is larger than it really is.
There exists a few computational tools for detecting chimeric sequences. They work by analysing each sequence in order to find if it is likely that it has emerged from the combination of two or more other sequences in the sample (de novo chimera detection) or in a reference database (reference based chimera detection). The most popular tool is UCHIME (Edgar et al. 2011) in the USEARCH package (Edgar 2010). We have recently developed a tool called VSEARCH (Rognes et al. 2015) that implements a similar chimera detection algorithm, and the code is freely available. During our work with this tool we have seen that it is possible to improve the accuracy and detection of chimeras.
The aim of this project is to try to improve the detection of chimeric sequences, both by increased accuracy and speed. The method should be implemented in a tool and compared to existing tools.
- Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26 (19): 2460-2461. doi:10.1093/bioinformatics/btq461
- Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R (2011) UCHIME improves sensitivity and speed of chimera detection. Bioinformatics, 27 (16): 2194-2200. doi:10.1093/bioinformatics/btr381
- Rognes T, Mahé F, Fluoris T et al (2015) VSEARCH - Versatile open-source tool for metagenomics. GitHub https://github.com/torognes/vsearch