Taxonomic classification of microbes based on DNA barcodes using machine learning
Recently the importance of the microbes inside and surrounding our bodies have been brought to increased attention. Numerous projects have sought to identify the vast range of bacteria, fungi, smaller eukaryotes and other organisms present in our gut, on our skin and at other sites of our bodies. Their diversity and the presence or absence of certain species have been shown to be important for our health and for avoiding diseases. It is important to be able to identify microbes. This task is called taxonomical classification, and is often done by analyzing DNA barcodes in various ways.
DNA barcodes are certain substrings in the DNA of an organism that can be used to identify the species of the organism. The substring, or marker, is present in a wide range of organisms, but with some sequence variance, enabling different organisms to be separated. There are several marker sequences in use, and these are available in marker specific databases. The marker genes usually contain both fixed and variable regions. Unfortunately a single marker may not be able to separate all species because the sequence may be identical even for different species.
Taxonomic classification based on DNA markers can be considered a supervised learning problem. Different approaches can be used for this, including Nearest-Neighbour, Hidden Markov Models, Naïve Bayes, and some of theseare described by Vinje et al. (2015). They are often based on a profile of k-mers present in the sequences, but a direct alignment using BLAST may also be used. The naïve Bayes classifier from the RDP project is perhaps the most commonly used, but some alternatives like the SINTAX classifier described in a preprint by Edgar (2016) have recently been proposed. A common problem is overclassification.
The aim of this project is to explore whether there are other machine learning and classification methods that can be successfully used and to compare them with existing methods.
- Vinje H, Liland KH, Almøy T, Snipen L (2015) Comparing K-mer based methods for improved classification of 16S sequences. BMC Bioinformatics, 16, 205. doi: 10.1186/s12859-015-0647-4
- Edgar R (2016) SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS sequences. bioRxiv, 074161. doi: 10.1101/074161