Optimal analysis of microbiome sequencing data in clinical medicine
Recently the importance of the microbes inside and surrounding our bodies has been increasingly recognized. Numerous projects have sought to identify the vast range of bacteria, fungi, smaller eukaryotes and other organisms present in our gut (Kummen et al. 2017), on our skin and at other sites of our bodies. Their diversity and the presence or absence of certain species have been shown to be important for our health and diseases. To identify the microorganisms present, samples containing a mixture of the organisms present are taken and their DNA is extracted. Then either all of the DNA present (metagenomics) or a selected DNA barcode is sequenced, resulting in a large amount of DNA sequencing reads.
DNA barcodes are certain genetic markers in the DNA of an organism that can be used to identify the species of the organism. The marker is present in a wide range of organisms, but with some variance, enabling different organisms to be separated. A typical marke is the 16S ribosomal RNA (rRNA) gene. There exists large databases of the marker gene sequences with information about which species they belong to, like the Greengenes, RDP and SILVA databases. The marker genes usually contain both fixed and variable regions. Unfortunately a single marker may not be able to separate all species because the sequence may be identical even for different species.
Researchers at the Norwegian PSC Research Center and Research Institute of Internal Medicine at Oslo university hospital have collected gut microbiomes from a large number of patients. From this material, both a region of the 16S rRNA gene has been sequenced as well as the whole genomes with metagenomic sequencing. Pipelines for analysis of the datasets have been established. They involve various quality checks, trimming, dereplication, clustering or assembly of the sequences. Identification of the microbial species present and their abundance is then performed and the diversity of the species is often estimated. However, there are many different steps and methods involved and many choices of tools. There are also several parameters that can be tuned. For instance, for selection of representative sequences for DNA barcode-based analyses, either clustering methods like UCLUST or Swarm could be used, or the more recent Deblur (Amir et al 2017), DADA or UNOISE methods.
The aim of this project is to compare the performance of the different approaches and ultimately to improve on the established pipelines. In particular the performance of the 16S rRNA amplicon-based methods should be compared to the full metagenome sequencing approaches.
Kummen M, Holm K, Anmarkrud JA, Nygård S, Vesterhus M, Høivik ML, Trøseid M, Marschall HU, Schrumpf E, Moum B, Røsjø H, Aukrust P, Karlsen TH, Hov JR (2017) The gut microbial profile in patients with primary sclerosing cholangitis is distinct from patients with ulcerative colitis without biliary disease and healthy controls. Gut, 66, 611-619. doi: 10.1136/gutjnl-2015-310500
Amir A, McDonald D, Navas-Molina JA, Kopylova E, Morton JT, Xu ZZ, Kightley EP, Thompson LR, Hyde ER, Gonzalez A, Knight R (2017) Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns. mSystems, 2, e00191-16. doi: 10.1128/mSystems.00191-16