Integrating different types of sequencing data with applications in cancer medicine

Cancer is a disease caused by changes in our genetic code, our DNA which is composed of chained sequences of four basic "letters" or base pairs. High-throughput sequencing is a technology that reads the sequences of base pairs from millions of DNA or RNA chains in parallel from biological samples and has revolutionized our understanding of our genetic code - the human genome and also the mechanisms of cancer. Currently, most of the world's sequencing data is produced by the sequencing instruments from the company Illumina. Their technology uses a chemical approach that generates fairly short sequences (reads) in the order of 50-500 base pairs pr. read. Although the newest instruments from Illumina generates sequencing information from billions of individual base pairs at a high quality in a single run (and can easily sequence all 3 billion base pairs in the human genome), the nature of the length of these reads leads to limitations for certain applications. Specifically, applications that aim to understand the connections of the long chains of DNA or RNA are challenging with short reads. There are however other sequencing technologies that can read longer sequences developed by Pacific Biosciences (PacBio) and Oxford Nanopore. These technologies can be more suited to tackle some applications that need information on longer chains of sequences.

Our group has a focus on genome biology with a specific focus on prostate cancer. Prostate cancer is often characterized with complex rearrangements of chains of DNA. Prostate cancer is also the solid cancer type with the highest frequency of a specific type of change in our DNA leading to so-called fusion genes. A fusion gene can have new functions in the cells and be cancer promoting. In addition, fusion genes are often highly cancer specific in that the sequences do not exist in normal cells and therefore ideal targets for the development of biological markers that can be of use in improving the clinical management of these patients. These fusion genes encoded at the DNA level are expressed as fusion gene transcript sequences at the RNA level that can be thousands of base pairs long. It is challenging to fully characterize these chains of RNA with short read sequences and technologies offering longer reads may be better suited for this type of application.

We have currently produced data from different types of sequencing protocols and platforms from four prostate cancer patients, with an aim to provide a more detailed analysis of fusion gene transcripts in prostate cancer. Specifically, we have generated Illumina whole-transcriptome sequencing data (RNA-sequencing), Illumina whole-exome sequencing data (DNA-sequencing), Illumina targeted transcript sequencing (a protocol we have developed called RACE-sequencing) and corresponding PacBio "long-read" targeted transcript sequencing from these patients. The aim of this study is to explore existing computational methods, and potentially improve, or develop new methods for integrating multi-level sequencing data, both short reads and long reads. Further an aim is to evaluate the utility value of this type of data for detailed characterization of full-length fusion gene transcripts in prostate cancer.

The candidate is expected to spend time at the Oslo University Hospital-Radiumhospitalet during the master project. The candidate is expected to take a special curriculum introductory course on cancer biology early in the project (10 study points). This to be able to better view the bioinformatic analyses in a biological context. Main MSc-supervisor will be Andreas Hoff, with co-supervision from Rolf Skotheim and Bjarne Johannessen. This master project is offered as a long master project (60 study points).

Publisert 9. nov. 2018 11:27 - Sist endret 9. nov. 2018 11:51

Omfang (studiepoeng)