Developing a tool to remove selected organisms from DNA sequencing data
Recently, the importance of the microbes inside and surrounding our bodies have been brought to increased attention. Numerous projects have sought to identify the vast range of bacteria, viruses, fungi, smaller eukaryotes and other microorganisms present in our gut, on our skin and at other sites of our bodies. Their diversity and the presence or absence of certain species have been shown to be important for our health and diseases. The microorganisms in the soil and oceans of the earth as well as selected other environments have also been studied in great detail.
To identify the microorganisms present (the microbiota), samples containing a mixture of the organisms present are taken and their genomic DNA (the microbiome) is extracted. Then either all of the DNA present (metagenomics) or selected DNA marker genes (barcodes) are sequenced, resulting in a huge amount of DNA sequencing reads. This data is then analysed and compared to sequences in reference databases to identify the species present.
When you are sequencing the microorganisms in a sample from the human gut or skin, it is inevitable that the samples will contain some human cells as well. The same is true when sequencing gut samples from other animals, e.g. from a fish, or other host organisms. When sequencing gut samples, the samples may also contain DNA from nutrients (e.g. plants and animals). There might also be various other contaminations. Furthermore, DNA from the phi X 174 bacteriophage (a virus) is often added as a positive control before sequencing for quality control. All of these sequences are unwanted and should be removed before further analysis proceeds. Removing the data at an early stage of analysis will save storage space and reduce the computational resources and time used for analysis. In the case of human samples it is also desirable to remove human sequences initially because human genetic data need to be protected due to its sensitive nature; the remaining data can then be handled more easily.
To filter the dataset and remove selected organisms, we can use a combination of various tools to search and align the sequences against reference databases containing the sequences to be removed. These are tools like Bowtie, BWA, BLAST, BBduk and others which are different in terms of their accuracy and speed. The sequences may not match perfectly to the reference, perhaps due to sequencing errors, but should perhaps be removed anyway. It is often important not to eliminate sequences that should be retained, but at the same time we would like to remove as much as possible of the unwanted sequences. Due to the large amounts of data both from the sequencing reads and in the reference database, this can be very time consuming, especially if high accuracy is required. There is a trade-off between accuracy and speed.
At the moment, it seems like these unwanted sequences are removed more or less in an ad hoc way by researchers, and there does not seem to exist a simple tool that can efficiently and accurately remove such sequences. The aim of this project is to make such a tool and test it to demonstrate how well it works on synthetic and real datasets. It could be designed by combining existing tools in the right manner.
The project requires programming experience. Some biological insight would be an advantage, but is not required.