Developing a pipeline for deep analyses of cancer-causing bacterial genes from metagenome data
Colorectal cancer (CRC) is the third most common cancer in men and the second in women world-wide. Screening may detect cancer at an early stage, and removal of precursor lesions reduces mortality. Available screening tools are cumbersome, often invasive and has a limited accuracy.
Presence of certain gut microbes and certain gut microbial genes are strongly associated with CRC. Bacteria such as Enterococcus faecalis may produce superoxide, which may damage DNA in proliferating cells, leading to chromosomal instability. The bile-tolerant microbes Bilophila, Desulfovibrio, proinflammatory bacteria in the genus Mogibacterium, and multiple Bacteroidetes species are more abundant in pre-cancer than in healthy controls. Such microbes and genes therefore represent potential biomarkers for CRC, and stool sampling followed by potential detection of cancer-causing bacteria or genes may be an important, non-invasive tool in CRC screening programs.
In an ongoing study (CRC-biome), we produce sequencing data from the organisms of 2000 gut samples – i.e. metagenomes. The samples represent healthy individuals (controls), pre-cancers and cancers (cases), each with on average three gigabases of sequence data. In total, the size of the dataset to be analysed is six terabases.
The challenge is to identify potential cancer-causing genes in large metagenomic datasets and determine their variations, features and the bacteria of origin. Highly efficient algorithms are required to analyse these types of data. The final challenge is to visualize the findings to a multi-disciplinary audience.
The purpose of this master’s project is to identify the optimal computational approach for this challenge and combine tools in a robust pipeline that reproducibly analyses large metagenomic data sets. There is a need for dynamic reference databases with known and newly discovered genes, optimal clustering, alignment and assembly algorithms that can handle large datasets. Evolutionary analyses may be incorporated. Tools such as snakemake may ensure reproducible analyses and results can be visulized in R using the ggplot2 and/or Shiny packages.
Prior knowledge of scripting in Python or R is required. No prior knowledge of microbiology or cancer is needed.
Supervisors: Trine B Rounge, Torbjørn Rognes and Even Sannes Riiser