Develop improved segmentation algorithms
The human genome consists of two copies of each chromosome. In a tumor cell, the genome typically has more than two copies of some genomic regions and less than two copies of other regions. Thus, if we plot the number of DNA copies as a function of the genomic position for a tumor cell, we obtain a non-flat curve. Determining the break points in such a curve (i.e. the genomic positions where the copy number jumps from one value to another) is called copy number segmentation. It may be posed as a statistical estimation problem, since the actual data obtained from molecular analyses of tumor cells are noisy (i.e. the observed copy number values at two genomic loci will not be identical even if the true copy numbers are identical). Depending on the technology used to obtain the measurements, the observation noise can vary substantially. In addition, most copy number analyses are based on pools of many cancer cells, and the individual cells may be genetically different, resulting in copy number jumps of varying amplitudes reflecting the proportion of cells exhibiting a particular change.
Our group has previously developed efficient algorithms for copy number segmentation, in collaboration with researchers in Cambridge, UK, and at Oslo University Hospital. These algorithms, which are implemented in R and available as the Bioconductor package copynumber, are in common use.
The purpose of this master thesis project will be to (1) improve the segmentation algorithms to handle multiple tracks of copy number data at the same time, i.e. to develop a "multivariate version" of the segmentation algorithms; (2) develop better graphics tools to integrate the output from the segmentation algorithms with other available information (e.g. from public databases).
This project requires no prior knowledge of biology, but some knowledge (or interest) in statistics or mathematics is beneficial.