Detecting transcription regulatory networks in cancer
The discovery of cancer subtypes is key to adequately stratify patients for optimal diagnostic and treatment. The identification of these cancer subtypes has recurrently been derived from the clusterization of genes based on their expression data. As genes lying within the same cluster are likely to reflect the regulatory action of specific transcription factors (TFs), it provides a valuable mean to reveal the molecular basis of the disease and to reveal new biomarkers or therapeutic targets. In this project, we plan to use an unsupervised machine learning approach to automatically predict cancer subtypes from gene modules to infer their regulating TFs.
As cancer is a disease of gene expression deregulation, we aim at better understanding how genes are dysregulated in cancer at the transcriptional level, where the regulatory proteins known as TFs control the expression of a group of genes. The underlying TF-target gene associations are known as transcriptional regulatory networks (TRN). This networks will be derived from clusters of co-expressed (and likely co-regulated) genes in cancer patients. Finding these groups expressed genes is challenging given the large number of genes and patients. Such data can be are represented as a matrix, where the columns correspond to the patients (varying from tens to thousands), the rows to the genes (~18,000), and the values to the expression value in a given patient. Dimensionality reduction techniques may be helpful to address this problem by finding groups of related genes or groups of patients sharing particular features. Non-negative matrix factorization (NMF) represents a machine learning approach for dimensionality reduction and unsupervised clustering. Specifically, the algorithm factorizes a matrix M into two matrices containing non-negative elements. Through this process, the columns of M are intrinsically clustered, in an unsupervised manner.
The goal of this Master thesis is to detect TRNs automatically from cancer data in a unsupervised way. The NMF approach will be applied to find groups of co-expressed genes and the resulting groups will be refined using bioinformatic methods to highlight the underlying regulatory TFs. As the most significant TRNs can be used as features to stratify cancer samples on large public cancer datasets, we will compare if the predicted TRN correlate with cancer molecular subtypes.
The selected candidate will develop the computational tool dedicated to this task and will apply it to publicly available data from The Cancer Genome Atlas, TCGA, and in house data at the Oslo University Hospital. A specific focus will be given to breast cancer with in house data available.
Advantageously, we have expertise in the in depth analyses of high-throughput ‘omics data from cancer patients, transcriptional regulation, as well as on the use of the NMF approach. It will provide an optimal learning environment to the selected student. During the course of the project, the student will be exposed to machine learning (NMF) and computational approaches for the management, analyses, and interpretation of large-scale, high-throughput sequencing data. We seek a highly motivated individual preferably with programming skills and knowledge of computational tools development. Knowledge in statistical methods and/or a biological background is a plus. We are looking for applicants excited about combining life sciences and computation. The candidate will be co-supervised by Dr. Anthony Mathelier, Dr. Jaime Castro-Mondragon and Dr. Ole-Christian Lingjærde. The supervisors have strong expertise in computer science and biology and are affiliated to the Centre for Molecular Medicine Norway, the Oslo University Hospital and the University of Oslo. The student will be collaborating with researchers at the Oslo University Hospital.