Developing a pipeline for the analysis of genetic variants of gut microbes
Modern DNA sequencing technologies generate large amounts of data and employ advanced computational techniques for biological interpretation. One application of modern sequencing is the mapping of bacterial species in the human gut, which can involve the assembly, or building, of bacterial genomes from a mixture of short DNA sequences.
Presence of certain gut microbes and certain gut microbial genes are strongly associated with CRC. Bacteria such as Enterococcus faecalis may produce superoxide, potentially damaging DNA in proliferating cells, leading to chromosomal instability and promoting the development of cancer. While mutations in human cells are a known cause of cancer, whether traces of genotoxic stress (i.e., mutations) can be detected in commensal bacteria is currently unknown.
In an ongoing study conducted at the Cancer Registry of Norway (https://www.kreftregisteret.no/CRC-biome), we produce a large amount of sequencing data (metagenomes) from serially collected human gut samples from about 1500 study participants. The samples represent healthy individuals (controls), pre-cancers and cancers (cases), each with on average three gigabases of sequence data. In total, the size of the dataset to be analysed is over thirteen terabases. Analyses will be performed using UiO’s services for sensitive data (TSD) and the Colossus high performance computing facility.
The specific aims of the master project is the establishment of bacterial genomes, identifying genetic variation and development of an algorithm for quantifying and classifying genetic variants (i.e., describing mutational processes) in the bacteria inhabiting the human gut. Establishment of bacterial genomes requires efficient implementation of bioinformatics tools with robust parameter optimization. Analysis of genetic variants requires adaption and application of existing classification strategies commonly used to analyse mutational processes in cancer, including evaluation of classification performance.
While the project has a large potential for biological interpretation, the actual analyses are not dependent on prior biological knowledge, but will be based on algorithms using graph theory and machine learning.
Prior knowledge of scripting in Python and R is required. No prior knowledge of microbiology or cancer is needed.
Supervisors: Einar Birkeland, Trine B Rounge, Torbjørn Rognes